I am performing a groupby operation on a DataFrame. On each of the group I have to rename two columns and drop one, so that each group will have the following form:
index(timestamp) | column-x | column-y
... | .... | .....
The index is a timestamp and it will be common to each group. 'column-x' and 'column-y' instead will be different to each group. My goal is then to join all groups on the index so that I have a unique DataFrame such as:
index(timestamp) | column-x1 | column-y1 | column-x2 | column-y2 | ...
... | ..... | ...... | ....... | ....... | ...
The function I apply to each group is (can I make inplace edit to the group while iterating?):
def process_ssp(df_ssp):
sensor_name = df_ssp.iloc[0]['subsystem-sensor-parameter'] # to be used as column name
df_ssp.rename(columns = {
'value_raw': '%s_raw' % sensor_name,
'value_hrf': '%s_hrf' % sensor_name,
}, inplace = True)
df_ssp.drop('subsystem-sensor-parameter', axis='columns', inplace=True) # since this is the column I am grouping on I guess this isn't the right thing to do?
return df_ssp
Then I call:
res = df_node.groupby('subsystem-sensor-parameter', as_index=False).apply(process_ssp)
Which produces the error:
ValueError: cannot reindex from a duplicate axis
EDIT:
Dataset sample https://drive.google.com/file/d/1RvPE1t3BmjeaqCNkVqGwmokCFQQp77n8/view?usp=sharing
You can first add column subsystem-sensor-parameter for MultiIndex, reshape by unstack, sorting MultiIndex in columns by second level and chane their positons. Last convert MultiIndex by flattening with map and join:
res = (df_node.set_index('subsystem-sensor-parameter', append=True)
.unstack()
.sort_index(axis=1, level=1)
.swaplevel(0,1, axis=1))
res.columns = res.columns.map('_'.join)
I'm able to successfully apply your code and produce the output you want by iterating over the groups rather than using apply:
import pandas as pd
df = pd.read_csv('/Users/jeffmayse/Downloads/sample.csv')
df.set_index('timestamp', inplace=True)
def process_ssp(df_ssp):
sensor_name = df_ssp.iloc[0]['subsystem-sensor-parameter'] # to be used as column name
df_ssp.rename(columns = {
'value_raw': '%s_raw' % sensor_name,
'value_hrf': '%s_hrf' % sensor_name,
}, inplace = True)
df_ssp.drop('subsystem-sensor-parameter', axis='columns', inplace=True) # since this is the column I am grouping on I guess this isn't the right thing to do?
return df_ssp
groups = df.groupby('subsystem-sensor-parameter')
out = []
for name, group in groups:
try:
out.append(process_ssp(group))
except:
print(name)
pd.concat(out).shape
Out[7]: (16131, 114)
And in fact, the issue is in the apply method, as your function is not needed to produce the error:
df.groupby('subsystem-sensor-parameter', as_index=False).apply(lambda x: x)
evaluates to ValueError: cannot reindex from a duplicate axis as well.
However, this statement evaluates as we'd expect:
df.reset_index(inplace=True)
df.groupby('subsystem-sensor-parameter', as_index=False).apply(process_ssp)
Out[22]:
nc-devices-alphasense_hrf ... wagman-uptime-uptime_raw
0 0 ... NaN
1 NaN ... NaN
2 NaN ... NaN
3 NaN ... NaN
...
The issue is that you have a DatetimeIndex with duplicate values. .apply is attempting to combine the result sets back together, but is not sure how to combine an index with duplicate values. At least, I believe that's it. Reset your index and try again.
Edit: to expand, you see this error commonly when trying to reindex a DatetimeIndex i.e., you have an hourly index and want to convert it to a second resolution index, or usually fill in missing hours. you use reindex, but it will fail if your index has duplicate values. I'd guess that is what is happening here: the dataframes produced by the function being applied have duplicate index values and the error comes from trying to produce the output via calling reindex on a DatetimeIndex with duplicates. Resetting the index works because your index is now all unique, and the timestamp column is not important to this operation.
Related
I am trying to import a .csv of EMG data as a Dataframe and filter each column of data using a list comprehension. Below is a dummy dataframe.
from scipy.signal import butter, filtfilt
test_array = pd.DataFrame(np.random.normal(0,2,size=(1000,6)),columns = ['time','RF','VM','TA','GM','BF'])
b,a = butter(4,[0.05,0.9],'bandpass',analog=False)
columns = ['RF','VM','TA','GM','BF']
filtered_df = pd.DataFrame([filtfilt(b,a,test_array.loc[:,i] for i in test_array[columns]])
The code above gives a version of the expected output, but instead of returning filtered_df as a (1000,5) dataframe, it is returning a (5,1000) dataframe.
I've tried using df.transpose() on the back end to fix the orientation, but it seems like there should be a more straightforward solution to preventing the transposing in the first place. Is there a way to get the desired output?
This issue is related to how you building the new dataframe. Just passing in a list from:
[filtfilt(b,a,test_array.loc[:,i]) for i in test_array[columns]]
pandas will read that in as a dataframe with four rows and column names representing the indices of the numpy array. If you build your dataframe using a dictionary mapped to each column name like:
results = [filtfilt(b,a,test_array.loc[:,i]) for i in test_array[columns]]
filtered_df = pd.DataFrame(data = dict(zip(columns, results)))
you get your desired result
RF VM TA GM BF
0 -0.072520 0.025846 0.111571 0.043277 0.024290
1 -2.674829 3.139997 0.285869 -0.162487 3.759851
2 -0.521439 3.481993 0.427854 -1.411966 5.422871
3 -2.719175 5.162347 2.195120 -0.535819 -1.721818
4 0.451544 1.730292 0.930652 -2.017700 -0.926594
.. ... ... ... ... ...
995 -5.240183 -0.625118 2.176452 2.065998 1.561615
996 -3.084039 -0.017626 -0.377022 -1.996366 2.041706
997 -5.122489 1.476979 -3.219335 1.609466 -3.707151
998 -2.072177 -0.870773 0.546386 0.031297 0.247766
999 0.141538 -0.048204 -0.601213 0.499631 0.246530
[1000 rows x 5 columns]
I'm getting the following warning while executing this line
new_df = df1[df2['pin'].isin(df1['vpin'])]
UserWarning: Boolean Series key will be reindexed to match DataFrame index.
The df1 and df2 has only one similar column and they do not have same number of rows.
I want to filter df1 based on the column in df2. If df2.pin is in df1.vpin I want those rows.
There are multiple rows in df1 for same df2.pin and I want to retrieve them all.
pin
count
1
10
2
20
vpin
Column B
1
Cell 2
1
Cell 4
The command is working. I'm trying to overcome the warning.
It doesn't really make sense to use df2['pin'].isin(df1['vpin']) as a boolean mask to index df1 as this mean will have the indices of df2, thus the reindexing performed by pandas.
Use instead:
new_df = df1[df1['vpin'].isin(df2['pin'])]
Can't sort the pivot table based on the columns passed in index attribute in ascending order.
when the df is printed 'Deepthy' comes first for column Name, I need 'aarathy' to come first
pls check this image while printing
df = pd.DataFrame({'Name': ['aarathy', 'Deepthy','aarathy','aarathy'],'Ship': ['everest', 'Oasis of the Seas','everest','everest'], 'Tracking': ['TESTTRACK003', 'TESTTRACK008', 'TESTTRACK009','TESTTRACK005'],'Bag': ['123', '127', '129','121'],})
df=pd.pivot_table(df,index=["Name","Ship","Tracking","Bag"]).sort_index(axis=1,ascending=True)
I tried it by passing sort_values and sort_index(axis=1,ascending=True) but id doesn't works
You naeed convert values to lowercase and for first level of sorting use key parameter:
#helper column for run your code
df['new'] = 1
df=(pd.pivot_table(df,index=["Name","Ship","Tracking","Bag"])
.sort_index(level=0,ascending=True, key=lambda x: x.str.lower()))
print (df)
new
Name Ship Tracking Bag
aarathy everest TESTTRACK003 123 1
TESTTRACK005 121 1
TESTTRACK009 129 1
Deepthy Oasis of the Seas TESTTRACK008 127 1
I have two data frames which look like df1 and df2 below and I want to create df3 as shown.
I could do this using a left join to have all the rows in one dataframe and then did a numpy.where to see if they are matching or not.
I could get what I want but I feel there should be an elegant way of doing this which will eliminate renaming columns, reshuffling columns in dataframe and then using np.where.
Is there a better way to do this?
code to reproduce dataframes:
import pandas as pd
df1=pd.DataFrame({'product':['apples','bananas','oranges','pineapples'],'price':[1,2,3,7],'quantity':[5,7,11,4]})
df2=pd.DataFrame({'product':['apples','bananas','oranges'],'price':[2,2,4],'quantity':[5,7,13]})
df3=pd.DataFrame({'product':['apples','bananas','oranges'],'price_df1':[1,2,3],'price_df2':[2,2,4],'price_match':['No','Yes','No'],'quantity':[5,7,11],'quantity_df2':[5,7,13],'quantity_match':['Yes','Yes','No']})
An elegant way to do your task is to:
generate "partial" DataFrames from each source column,
and then concatenate them.
The first step is to define a function to join 2 source columns and append "match" column:
def myJoin(s1, s2):
rv = s1.to_frame().join(s2.to_frame(), how='inner',
lsuffix='_df1', rsuffix='_df2')
rv[s1.name + '_match'] = np.where(rv.iloc[:,0] == rv.iloc[:,1], 'Yes', 'No')
return rv
Then, from df1 and df2, generate 2 auxiliary DataFrames setting product as the index:
wrk1 = df1.set_index('product')
wrk2 = df2.set_index('product')
And the final step is:
result = pd.concat([ myJoin(wrk1[col], wrk2[col]) for col in wrk1.columns ], axis=1)\
.reset_index()
Details:
for col in wrk1.columns - generates names of columns to join.
myJoin(wrk1[col], wrk2[col]) - generates the partial result for this column from
both source DataFrames.
[…] - a list comprehension, collecting the above partial results in a list.
pd.concat(…) - concatenates these partial results into the final result.
reset_index() - converts the index (product names) into a regular column.
For your source data, the result is:
product price_df1 price_df2 price_match quantity_df1 quantity_df2 quantity_match
0 apples 1 2 No 5 5 Yes
1 bananas 2 2 Yes 7 7 Yes
2 oranges 3 4 No 11 13 No
I want to run frequency table on each of my variable in my df.
def frequency_table(x):
return pd.crosstab(index=x, columns="count")
for column in df:
return frequency_table(column)
I got an error of 'ValueError: If using all scalar values, you must pass an index'
How can i fix this?
Thank you!
You aren't passing any data. You are just passing a column name.
for column in df:
print(column) # will print column names as strings
try
ctabs = {}
for column in df:
ctabs[column]=frequency_table(df[column])
then you can look at each crosstab by using the column name as keys in the ctabs dictionary
for column in df:
print(data[column].value_counts())
For example:
import pandas as pd
my_series = pd.DataFrame(pd.Series([1,2,2,3,3,3, "fred", 1.8, 1.8]))
my_series[0].value_counts()
will generate output like in below:
3 3
1.8 2
2 2
fred 1
1 1
Name: 0, dtype: int64