pandas: sort grouped dataframe by frequency of group members - pandas

I am interested in sorting a grouped dataframe by the number of entries for each group. As far as I can see, I can either sort by the group labels or not at all. Say I have 10 entries that belong to three groups. Group A has 6 members, group B has three members, and group C has 1 member. Now when I e.g. do a grouped.describe(), I would like the output to be ordered so that the group with the most entries is shown first.

I would unstack the statistics from describe(), then you can simply use sort(), so:
incsv = StringIO("""Group,Value
B,1
B,2
B,3
C,8
A,5
A,10
A,15
A,25
A,35
A,40""")
df = pd.read_csv(incsv)
groups = df.groupby('Group').describe().unstack()
Value
count mean std min 25% 50% 75% max
Group
A 6 21.666667 14.023789 5 11.25 20 32.5 40
B 3 2.000000 1.000000 1 1.50 2 2.5 3
C 1 8.000000 NaN 8 8.00 8 8.0 8
dfstats.xs('Value', axis=1).sort('count', ascending=True)
count mean std min 25% 50% 75% max
Group
C 1 8.000000 NaN 8 8.00 8 8.0 8
B 3 2.000000 1.000000 1 1.50 2 2.5 3
A 6 21.666667 14.023789 5 11.25 20 32.5 40
I reversed the sort just for illustration because it was already sorted by default, but you can sort anyway you want of course.
Bonus for anyone who can sort by count without dropping or stacking the 'Value' level. :)

Related

How to add Multilevel Columns and create new column?

I am trying to create a "total" column in my dataframe
idx = pd.MultiIndex.from_product([['Room 1','Room 2', 'Room 3'],['on','off']])
df = pd.DataFrame([[1,4,3,6,5,15], [3,2,1,5,1,7]], columns=idx)
My dataframe
Room 1 Room 2 Room 3
on off on off on off
0 1 4 3 6 5 15
1 3 2 1 5 1 7
For each room, I want to create a total column and then a on% column.
I have tried the following, however, it does not work.
df.loc[:, slice(None), "total" ] = df.xs('on', axis=1,level=1) + df.xs('off', axis=1,level=1)
Let us try something fancy ~
df.stack(0).eval('total=on + off \n on_pct=on / total').stack().unstack([1, 2])
Room 1 Room 2 Room 3
off on total on_pct off on total on_pct off on total on_pct
0 4.0 1.0 5.0 0.2 6.0 3.0 9.0 0.333333 15.0 5.0 20.0 0.250
1 2.0 3.0 5.0 0.6 5.0 1.0 6.0 0.166667 7.0 1.0 8.0 0.125
Oof this was a roughie, but you can do it like this if you want to avoid loops. Worth noting it redefines your df twice because i need the total columns. Sorry about that, but is the best i could do. Also if you have any questions just comment.
df = pd.concat([y.assign(**{'Total {0}'.format(x+1): y.iloc[:,0] + y.iloc[:,1]})for x , y in df.groupby(np.arange(df.shape[1])//2,axis=1)],axis=1)
df = pd.concat([y.assign(**{'Percentage_Total{0}'.format(x+1): (y.iloc[:,0] / y.iloc[:,2])*100})for x , y in df.groupby(np.arange(df.shape[1])//3,axis=1)],axis=1)
print(df)
This groups by the column's first index (rooms) and then loops through each group to add the total and percent on. The final step is to reindex using the unique rooms:
import pandas as pd
idx = pd.MultiIndex.from_product([['Room 1','Room 2', 'Room 3'],['on','off']])
df = pd.DataFrame([[1,4,3,6,5,15], [3,2,1,5,1,7]], columns=idx)
for room, group in df.groupby(level=0, axis=1):
df[(room, 'total')] = group.sum(axis=1)
df[(room, 'pct_on')] = group[(room, 'on')] / df[(room, 'total')]
result = df.reindex(columns=df.columns.get_level_values(0).unique(), level=0)
Output:
Room 1 Room 2 Room 3
on off total pct_on on off total pct_on on off total pct_on
0 1 4 5 0.2 3 6 9 0.333333 5 15 20 0.250
1 3 2 5 0.6 1 5 6 0.166667 1 7 8 0.125

pandas - how to vectorized group by calculations instead of iteration

Here is a code sniplet to simulate the problem i am facing. i am using iteration on large datasets
df = pd.DataFrame({'grp':np.random.choice([1,2,3,4,5],500),'col1':np.arange(0,500),'col2':np.random.randint(0,10,500),'col3':np.nan})
for index, row in df.iterrows():
#based on group label, get last 3 values to calculate mean
d=df.iloc[0:index].groupby('grp')
try:
dgrp_sum=d.get_group(row.grp).col2.tail(3).mean()
except:
dgrp_sum=999
#after getting last 3 values of group with reference to current row reference, multiply by other rows
df.at[index,'col3']=dgrp_sum*row.col1*row.col2
if i want to speed it up with vectors, how do i convert this code?
You basically calculate moving average over every group.
Which means you can group dataframe by "grp" and calculate rolling mean.
At the end you multiply columns in each row because it is not dependent on group.
df["col3"] = df.groupby("grp").col2.rolling(3, min_periods=1).mean().reset_index(0,drop=True)
df["col3"] = df[["col1", "col2", "col3"]].product(axis=1)
Note: In your code, each calculated mean is placed in the next row, thats why you probably have this try block.
# Skipping last product gives only mean
# np.random.seed(1234)
# print(df[df["grp"] == 2])
grp col1 col2 iter mask
4 2 4 6 999.000000 6.000000
5 2 5 0 6.000000 3.000000
6 2 6 9 3.000000 5.000000
17 2 17 1 5.000000 3.333333
27 2 27 9 3.333333 6.333333

How to count the distance in cells (e.g. in indices) between two repeating values in one column in Pandas dataframe?

I have the following dataset. It lists the words that were presented to a participant in the psycholinguistic experiment (I set the order of the presentation of each word as an index):
data = {'Stimulus': ['sword','apple','tap','stick', 'elephant', 'boots', 'berry', 'apple', 'pear', 'apple', 'stick'],'Order': [1,2,3,4,5,6,7,8,9,10,11]}
df = pd.DataFrame(data, columns = ['Stimulus', 'Order'])
df.set_index('Order', inplace=True)
Stimulus
Order
1 sword
2 apple
3 tap
4 stick
5 elephant
6 boots
7 berry
8 apple
9 pear
10 apple
11 stick
Some values in this dataset are repeated (e.g. apple), some are not. The problem is that I need to calculate the distance in cells based on the order column between each occurrence of repeated values and store it in a separate column, like this:
Stimulus Distance
Order
1 sword NA
2 apple NA
3 tap NA
4 stick NA
5 elephant NA
6 boots NA
7 berry NA
8 apple 6
9 pear NA
10 apple 2
11 stick 7
It shouldn't be hard to implement, but I've got stuck.. Initially, I made a dictionary of duplicates where I store items as keys and their indices as values:
{'apple': [2,8,10],'stick': [4, 11]}
And then I failed to find a solution to put those values into a column. If there is a simplier way to do it in a loop without using dictionaries, please let me know. I will appreciate any advice.
Use, df.groupby on Stimulus then transform the Order column using pd.Series.diff:
df = df.reset_index()
df['Distance'] = df.groupby('Stimulus').transform(pd.Series.diff)
df = df.set_index('Order')
# print(df)
Stimulus Distance
Order
1 sword NaN
2 apple NaN
3 tap NaN
4 stick NaN
5 elephant NaN
6 boots NaN
7 berry NaN
8 apple 6.0
9 pear NaN
10 apple 2.0
11 stick 7.0

plot dataframe column on one axis and other columns as separate lines on the same plot (in different color)

I have following dataframe.
precision recall F1 cutoff
cutoff
0 0.690148 1.000000 0.814610 0
1 0.727498 1.000000 0.839943 1
2 0.769298 0.916667 0.834051 2
3 0.813232 0.916667 0.859741 3
4 0.838062 0.833333 0.833659 4
5 0.881454 0.833333 0.854946 5
6 0.925455 0.750000 0.827202 6
7 0.961111 0.666667 0.786459 7
8 0.971786 0.500000 0.659684 8
9 0.970000 0.166667 0.284000 9
10 0.955000 0.083333 0.152857 10
I want to plot cutoff column on x-axis and precision,recall and F1 values as separate lines on the same plot (in different color). How can I do it?
When I am trying to plot the dataframe, it is taking the cutoff column also for plotting.
Thanks
Remove column before ploting:
df.drop('cutoff', axis=1).plot()
But maybe problem is how is created index, maybe help change:
df = df.set_index(df['cutoff'])
df.drop('cutoff', axis=1).plot()
to:
df = df.set_index('cutoff')
df.plot()

How to unite several results of a dataframe columns describe() into one dataframe?

I am applying describe() to several columns of my dataframe, for example:
raw_data.groupby("user_id").size().describe()
raw_data.groupby("business_id").size().describe()
And several more, because I want to find out how many data points are there per user on average/median/etc..
My question is, each of those calls returns something that seems to be an unstructured output. Is there an easy way to combine them all to a single new dataframe which columns will be: [count,mean,std,min,25%,50%,75%,max] and the index will be the various columns described?
Thanks!
I might simply build a new DataFrame manually. If you have
>>> raw_data
user_id business_id data
0 10 1 5
1 20 10 6
2 20 100 7
3 30 100 8
Then the results of groupby(smth).size().describe() are just another Series:
>>> raw_data.groupby("user_id").size().describe()
count 3.000000
mean 1.333333
std 0.577350
min 1.000000
25% 1.000000
50% 1.000000
75% 1.500000
max 2.000000
dtype: float64
>>> type(_)
<class 'pandas.core.series.Series'>
and so:
>>> descrs = ((col, raw_data.groupby(col).size().describe()) for col in raw_data)
>>> pd.DataFrame.from_items(descrs).T
count mean std min 25% 50% 75% max
user_id 3 1.333333 0.57735 1 1 1 1.5 2
business_id 3 1.333333 0.57735 1 1 1 1.5 2
data 4 1.000000 0.00000 1 1 1 1.0 1
Instead of from_items I could have passed a dictionary, e.g.
pd.DataFrame({col: raw_data.groupby(col).size().describe() for col in raw_data}).T, but this way the column order is preserved without having to think about it.
If you don't want all the columns, instead of for col in raw_data, you could define columns_to_describe = ["user_id", "business_id"] etc and use for col in columns_to_describe, or use for col in raw_data if col.endswith("_id"), or whatever you like.