Pass pandas sub dataframe to master dataframe - pandas

I have a dataframe which I am doing some work on
d={'x':[2,8,4,-5,4,5,-3,5],'y':[-.12,.35,.3,.15,.4,-.5,.6,.57]}
df=pd.DataFrame(d)
df['x_even']=df['x']%2==0
subdf, get all rows where x is negative and then square x and then multiple 100 to y
subdf=df[df.x<0]
subdf['x']=subdf.x**2
subdf['y']=subdf.y*100
subdf's work is completed. I am not sure how I can incorporate these changes to the master dataframe (df).

It looks like your current code should give you a SettingWithCopyWarning warning.
To avoid this you could do the following:
df.loc[df.x<0, 'y'] = df.loc[df.x<0, 'y']*100
df.loc[df.x<0, 'x'] = df.loc[df.x<0, 'x']**2
Which will change your df, without raising a warning and there is no need to merge anything back.

pd.merge(subdf,df,how='outer')
This does what I was asking for. Thanks for the tip Primer

Related

Pandas MultiIndex manipulation

I'm not very adept at Python, but I have a "bandaid" solution to a problem and trying to find out if there is a better way to do things. I have a dataframe of stocks I download from pandas_datareader. This gives me a MultiIndex df, and I'm trying to exact just the attributes that I want.
The initial df from pandas_datareader results in the following structure:
I'm interested in getting just the "High" and "Closing" prices in this structure. To achieve this, I have done the following:
df.loc[:, ['High', 'Close']]
Which gives me:
This is close to what I want, but not grouped by the stock, rather the attribute. To group the attribute by stock, I tried swapping the levels, and then specifying the columns I want:
newdf = df.swaplevel(axis='columns')
newdf.loc[:, [('BHP.AX','High'),('BHP.AX','Close'),('S32.AX','Close'),('S32.AX','High')]]
This gives me the desired result, but seems a very "hardcoded" and inefficient way of doing this:
Is there a more generalized way I could go about doing this? I want to be able to just specify the attributes (e.g. Close, High etc) and the result to be for all stocks in there (grouped by stock rather than the attribute). This Multiindex is not making it easy for me so any help you can offer is appreciated.
You can use the index slice function to get it easily. Please correct the 'ACN' and 'IT' as I tested it on different stocks.
References.MultiIndex / advanced indexing
idx = pd.IndexSlice
data = data.loc[:,idx[:,('High','Low','ACN','IT')]] # edit your symbol
data = data.swaplevel(axis='columns')
data.sort_index(level=0, axis=1, inplace=True)
data.head()
ACN IT
Close High Close High
Date
2020-03-31 163.259995 169.880005 99.570000 109.160004
2020-04-01 154.679993 160.820007 93.290001 96.209999
2020-04-02 156.270004 160.500000 94.099998 94.919998
2020-04-03 152.149994 158.720001 91.820000 94.290001
2020-04-06 166.050003 166.750000 99.860001 100.940002
Found a rather simple solution.
newdf = rawout.loc[:,['Close','High', 'Open']].swaplevel(axis='columns')
Using this there is no need to specify all the stocks. I swap the levels in the code above but this may not be required by someone else.

Pandas dataframe being treated as a series object after using groupby

I am conducting an analysis of a dataset. To find my results, I use this line of code:
new_df = df_ncis.groupby(['state', 'year'])['totals'].mean()
The object returned by this statement is a Series, when it should be a dataframe. I don't understand why this happened, or how to solve this issue. Also, one of the columns of the new object is missing its name. Here is the github link for the project: https://github.com/louishrm/gundataUS.
Any help would be great.
You are filtering the result by ['totals'] which is a series.
try this instead
new_df = df_ncis[['state', 'year', 'totals']].groupby(['state', 'year']).mean()
which will give you a dataframe with your 3 columns.
or if you want it as a dataframe of one column (Note the double brackets)
new_df = df_ncis.groupby(['state', 'year'])[['totals']].mean()

Pandas str slice in combination with Pandas str index

I have a Dataframe containing a single column with a list of file names. I want to find all rows in the Dataframe that their value has a prefix from a set of know prefixes.
I know I can run a simple for loop, but I want to run in a Dataframe to check speeds and run benchmarks - it's also a nice exercise.
What I had in mind is combining str.slice with str.index but I can't get it to work. This is what I have in mind:
import pandas as pd
file_prefixes = {...}
file_df = pd.Dataframe(list_of_file_names)
file_df.loc[file_df.file.str.slice(start=0, stop=upload_df.file.str.index('/')-1).isin(file_prefixes), :] # this doesn't work as the index returns a dataframe
My hope is that said code will return all rows that the value there starts with a file prefix from the list above.
In summary, I would like help with 2 things:
Combining slice and index
Thoughts about better ways to achieve this
Thanks
I will use startswith
file_df.loc[file_df.file.str.startswith(tuple(file_prefixes)), :]

Remove NaN values from pandas dataframes inside a list

I have a number of dataframes inside a list. I am trying to remove NaN values. I tried to do it in a for loop:
for i in list_of_dataframes:
i.dropna()
it didn't work but python didnt return an error neither. If I apply the code
list_of_dataframes[0] = list_of_dataframes[0].dropna()
to a each dataframe individually it works, but i have too many of them. There must be a way which I just can't figure out. What are the possible solutions?
Thanks a lot
You didn't assign the new DataFrames with the dropped values to anything, so there was no effect.
Try this:
for i in range(len(list_of_dataframes)):
list_of_dataframes[i] = list_of_dataframes[i].dropna()
Or, more conveniently:
for df in list_of_dataframes:
df.dropna(inplace=True)

why pandas df.drop() doesn't drop all indexes unless inplace used

I started writing this question in other form, but found a solution in the meantime. I have a dataframe shaped like 55k x 4. Since couple of hours now I couldn't understand why I can't drop rows I need to drop. I had something like that:
print(df.shape)
indexes_to_drop = list()
for row in df.itertuples(index=True):
if some_complex_function(row[1]):
indexes_to_drop.append(row[0])
print(len(indexes_to_drop))
df = df.drop(index=indexes_to_drop)
print(df.shape)
My output was like:
55000 x 4
2500
52500 x 4
However, once I displayed some rows from my df I was still able to find rows I thought were deleted. Ofc 1st thought was to check some_complex_function. But I logged everything it did, and it was just fine.
So, I tried couple other ways of deleting rows using index, for example:
df = df.drop(df.index[ignore_indexes])
Still, shape is ok, but not the rows.
Then I tried with iterrows() instead of itertuples. Same thing.
I thought maybe there is something wrong with indexing. You know: index number vs index label. I tested my code on small dataframe and everything worked like a charm.
Then I realized, I do some stuff with my df before I run above, code. So I reset the index before like that:
df.reset_index(inplace=True, drop=True) Indexes changed, started counting from 0, but my results were still wrong.
Then I tried this: df.drop(index=indexes_to_drop, inplace=True)
And BOOM it worked.
Right now Im not looking for the solution, as I apparently found one. I'd like to know WHY dropping rows not "inplace" didn't work. I don't get that.
Cheers!