I have pandas dataframe df. I would like to select columns which have standard deviation grater than 1. Here is what I tried
df2 = df[df.std() >1]
df2 = df.loc[df.std() >1]
Both generated error. What am I doing wrong?
Use df.loc[:, df.std() > 1] and it will fix it.
The first part which is [: refers to the rows and the second part df.std() > 1 refers to the columns
Related
I am getting the daily counts of rows from a dataframe using
df = df.groupby(by=df['startDate'].dt.date).count()
How can I modify this so I can also group by another column 'unitName'?
Thank you
Use list with GroupBy.size:
df = df.groupby([df['startDate'].dt.date, 'unitName']).size()
If need count non missing values, e.g. column col use DataFrameGroupBy.count:
df = df.groupby([df['startDate'].dt.date, 'unitName'])['col'].count()
I'm getting the following warning while executing this line
new_df = df1[df2['pin'].isin(df1['vpin'])]
UserWarning: Boolean Series key will be reindexed to match DataFrame index.
The df1 and df2 has only one similar column and they do not have same number of rows.
I want to filter df1 based on the column in df2. If df2.pin is in df1.vpin I want those rows.
There are multiple rows in df1 for same df2.pin and I want to retrieve them all.
pin
count
1
10
2
20
vpin
Column B
1
Cell 2
1
Cell 4
The command is working. I'm trying to overcome the warning.
It doesn't really make sense to use df2['pin'].isin(df1['vpin']) as a boolean mask to index df1 as this mean will have the indices of df2, thus the reindexing performed by pandas.
Use instead:
new_df = df1[df1['vpin'].isin(df2['pin'])]
I have two dataframes.
df1 has an index list made of strings like (row1,row2,..,rown) and a column list made of strings like (col1,col2,..,colm) while df2 has k rows and 3 columns (char_1,char_2,value). char_1 contains strings like df1 indexes while char_2 contains strings like df1 columns. I only want to assign the df2 value to df1 in the right position. For example if the first row of df2 reads ['row3','col1','value2'] I want to assign value2 to df1 in the position ([2,0]) (third row and first column).
I tried to use two functions to slide rows and columns of df1:
def func1(val):
# first I convert the series to dataframe
val=val.to_frame()
val=val.reset_index()
val=val.set_index('index') # I set the index so that it's the right column
def func2(val2):
try: # maybe the combination doesn't exist
idx1=list(cou.index[df2[char_2]==(val2.name)]) #val2.name reads col name of df1
idx2=list(cou.index[df2[char_1]==val2.index.values[0]]) #val2.index.values[0] reads index name of df1
idx= list(reduce(set.intersection, map(set, [idx1,idx2])))
idx=int(idx[0]) # final index of df2 where I need to take value to assign to df1
check=1
except:
check=0
if check==1: # if index exists
val2[0]=df2['value'][idx] # assign value to df1
return val2
val=val.apply(func2,axis=1) #apply the function for columns
val=val.squeeze() #convert again to series
return val
df1=df1.apply(func1,axis=1) #apply the function for rows
I made the conversion inside func1 because without this step I wasn't able to work with series keeping index and column names so I wasn't able to find the index idx in func2.
Well the problem is that it takes forever. df1 size is (3'600 X 20'000) and df2 is ( 500 X 3 ) so it's not too much. I really don't understand the problem.. I run the code for the first row and column to check the result and it's fine and it takes 1 second, but now for the entire process I've been waiting for hours and it's still not finished.
Is there a way to optimize it? As I wrote in the title I only need to run a function that keeps column and index names and works sliding the entire dataframe. Thanks in advance!
Say I have n dataframes, df1, df2...dfn.
Finding rows that contain "bad" values in a row in a given dataframe is done by e.g.,
index1 = df1[df1.isin([np.nan, np.inf, -np.inf])]
index2 = df2[df2.isin([np.nan, np.inf, -np.inf])]
Now, droping these bad rows in the bad dataframe is done with:
df1 = df1.replace([np.inf, -np.inf], np.nan).dropna()
df2 = df2.replace([np.inf, -np.inf], np.nan).dropna()
The problem is that any function that expects the two (n) dataframes columns to be of the same length may give an error if there is bad data in one df but not the other.
How do I drop not just the bad row from the offending dataframe, but the same row from a list of dataframes?
So in the two dataframe case, if in df1 date index 2009-10-09 contains a "bad" value, that same row in df2 will be dropped.
[Possible "ugly"? solution?]
I suspect that one way to do it is to merge the two (n) dataframes on date, then apply the cleanup function to drop "bad" values are automatic since the entire row gets dropped? But what happens if a date is missing from one dataframe and not the other? [and they still happen to be the same length?]
Doing your replace
df1 = df1.replace([np.inf, -np.inf], np.nan)
df2 = df2.replace([np.inf, -np.inf], np.nan)
Then, Here we using inner .
newdf=pd.concat([df1,df2],axis=1,keys=[1,2], join='inner').dropna()
And split it back to two dfs , here we using combine_first with dropna of original df
df1,df2=[s[1].loc[:,s[0]].combine_first(x.dropna()) for x,s in zip([df1,df2],newdf.groupby(level=0,axis=1))]
I can't get what is possibly wrong in the way I use df.corr() function.
For a DF with 2 columns it returns only 1*1 resulting DF.
In:
merged_df[['Citable per Capita','Citations']].corr()
Out:
one by one resulting DF
What can be the problem here? I expected to see as many rows and columns as many columns were there in the original DF
I found the problem - it was the wrong dtype of the first column values.
To change type of all the columns, use:
df=df.apply(lambda x: pd.to_numeric(x, errors='ignore'))
Note that apply creates a copy of df. That is why reassignment is necessary here