pandas groupby by the dictionary - pandas

I encounter a problem:
import pandas
df=pandas.DataFrame({"code":['a','a','b','c','d'],
'data':[3,4,3,6,7],})
mat={'group1':['a','b'],'group2':['a','c'],'group3':{'a','b','c','d'}}
the df like this
code data
0 a 3
1 a 4
2 b 3
3 c 6
4 d 7
I wanted the mean of the group1,group2,group3. In this example the key:group1 match the value:a,b , So I find the code equal a or b in df. the mean of group1 is (3+4+3)/3
group2 -> 'a','c' -> (3+4+6)/3
group3 -> 'a','b','c','d' ->(3+4+3+6+7)/5
I try to use groupby. It's doesn't works.
thx!

IIUC you can do something like as follows:
In [133]: rules = {
...: 'grp1': ['a','b'],
...: 'grp2': ['a','c'],
...: 'grp3': list('abcd')
...: }
...:
...: r = pd.DataFrame(
...: [{r:df.loc[df.code.isin(rules[r]), 'data'].mean()}
...: for r in rules
...: ]
...: ).stack()
...:
In [134]: r
Out[134]:
0 grp1 3.333333
1 grp2 4.333333
2 grp3 4.600000
dtype: float64

Related

Calculate mean of 3rd quintile for each groupby

I have a df and I want to calculate mean of the 3rd quintile for each group. The way do is to write a self defined function and to apply for each group; but there are some issues. The code:
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': pd.Series(np.array(range(20))), 'B': ['a','a','a','a','a','a','a','a','a','a','b','b','b','b','b','b','b','b','b','b']})
def func_mean_quintile(df):
# Make sure data is in DataFrame
df = pd.DataFrame(df)
df['pct'] = pd.to_numeric(pd.cut(df.iloc[:,0], 5, labels=np.r_[1:6]))
avg = df[df['pct'] == 3].iloc[:,0].mean()
return np.full((len(df)), avg)
df['C'] = df.groupby('B')['A'].apply(func_mean_quintile)
The result is NaN for all column C
I don't know where is it wrong?
Plus if you know how to make self defined function perform better, please help
Thank you
Proposed solution without function
You do not need a function; this should do the calc:
q_lo = 0.4 # start of 3d quintile
q_hi = 0.6 # end of 3d quintile
(df.groupby('B')
.apply(lambda g:g.assign(C = g.loc[(g['A'] >= g['A'].quantile(q_lo)) & (g['A'] < g['A'].quantile(q_hi)), 'A' ].mean()))
.reset_index(drop = True)
)
output:
A B C
0 0 a 4.5
1 1 a 4.5
2 2 a 4.5
3 3 a 4.5
4 4 a 4.5
5 5 a 4.5
6 6 a 4.5
7 7 a 4.5
8 8 a 4.5
9 9 a 4.5
10 10 b 14.5
11 11 b 14.5
12 12 b 14.5
13 13 b 14.5
14 14 b 14.5
15 15 b 14.5
16 16 b 14.5
17 17 b 14.5
18 18 b 14.5
19 19 b 14.5
Your original solution
Also works if you replace the line df['C'] = ... with
df['C'] = df.groupby('B')['A'].transform(func_mean_quintile)
Do it like this:
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': pd.Series(np.array(range(20))), 'B':['a','a','a','a','a','a','a','a','a','a','b','b','b','b','b','b','b','b','b' ,'b']})
def func_mean_quintile(df):
# Make sure data is in DataFrame
df = pd.DataFrame(df)
df['pct'] = pd.to_numeric(pd.cut(df.iloc[:,0], 5, labels=np.r_[1:6]))
avg = df[df['pct'] == 3].iloc[:,0].mean()
return np.full((len(df)), avg)
means = df.groupby('B').apply(func_mean_quintile)
df['C'][df["B"]=='a'] = means["a"]
df['C'][df["B"]=='b'] = means["b"]
This will give you the required output.
Think its easier if you split it in two different steps. First label each datapoint with which quantile it is in. Secondly just an aggregation per quantile.
import pandas as pd
import numpy as np
df = pd.DataFrame(
{
"a": pd.Series(np.array(range(20))),
"b": ["a", "a", "a", "a", "a", "a", "a", "a", "a", "a", "b", "b", "b", "b", "b", "b", "b", "b", "b", "b"],
}
)
df["a_quantile"] = pd.cut(df.a, bins=4, labels=["q1", "q2", "q3", "q4"])
df_agg = df.groupby("a_quantile").agg({"a": ["mean"]})
df_agg.head()
With the aggregation results shown below:
Out[9]:
a
mean
a_quantile
q1 2
q2 7
q3 12
q4 17

How to replace pd.NamedAgg to a code compliant with pandas 0.24.2?

Hello I am obliged to downgrade Pandas versioon to '0.24.2'
As a result, the function pd.NamedAgg is not recognizable anymore.
import pandas as pd
import numpy as np
agg_cols = ['A', 'B', 'C']
agg_df = df.groupby(agg_cols).agg(
max_foo=pd.NamedAgg(column='Foo', aggfunc=np.max),
min_foo=pd.NamedAgg(column='Foo', aggfunc=np.min)
).reset_index()
Can you help me please change my code to make it compliant with the version 0.24.2??
Thank you a lot.
Sample:
agg_df = df.groupby(agg_cols)['Foo'].agg(
[('max_foo', np.max),('min_foo', np.min)]
).reset_index()
df = pd.DataFrame({
'A':list('a')*6,
'B':[4,5,4,5,5,4],
'C':[7]*6,
'Foo':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')
})
agg_cols = ['A', 'B', 'C']
agg_df = df.groupby(agg_cols).agg(
max_foo=pd.NamedAgg(column='Foo', aggfunc=np.max),
min_foo=pd.NamedAgg(column='Foo', aggfunc=np.min)
).reset_index()
print (agg_df)
A B C max_foo min_foo
0 a 4 7 5 0
1 a 5 7 7 1
Because there is only one column Foo for processing add column Foo after groupby and pass tuples with new columns names with aggregate functions:
agg_df = df.groupby(agg_cols)['Foo'].agg(
[('max_foo', np.max),('min_foo', np.min)]
).reset_index()
print (agg_df)
A B C max_foo min_foo
0 a 4 7 5 0
1 a 5 7 7 1
Another idea is pass dictionary of lists of aggregate functions:
agg_df = df.groupby(agg_cols).agg({'Foo':['max', 'min']})
agg_df.columns = [f'{b}_{a}' for a, b in agg_df.columns]
agg_df = agg_df.reset_index()
print (agg_df)
A B C max_foo min_foo
0 a 4 7 5 0
1 a 5 7 7 1

Dataframe Index value

I have a dataframe
df = pd.DataFrame({'Event1':['Music', 'Poetry', 'Theatre', 'Comedy'],
'Event2':[Poetry, Music, Dance, Theater]})
I need to create a new column called 'Val' that has the index of the element from Event 2 as it occurs in Event1. For example Val would be
'Val':[1,0,NaN,2].
Here's a way you can do:
Solution 1
import numpy as np
df['val'] = df['Event2'].apply(lambda x: np.where(x == df['Event1'])[0][0])
print(df)
Event1 Event2 val
0 Music Poetry 1
1 Poetry Music 0
2 Theatre Comedy 3
3 Comedy Theatre 2
Solution 2
df = pd.DataFrame({'Event1':['Music', 'Poetry', 'Theater', 'Comedy'], 'Event2':['Poetry', 'Music', 'Dance', 'Theater']})
df['val'] = (df['Event2']
.apply(lambda x: np.argwhere(x == df['Event1']))
.apply(lambda x: x[0][0] if len(x)>0 else x)
)
df['val'] = pd.to_numeric(df['val'], errors='coerce')
print(df)
Event1 Event2 val
0 Music Poetry 1.0
1 Poetry Music 0.0
2 Theater Dance NaN
3 Comedy Theater 2.0

Lookup into dataframe from another with iloc

Have a dataframe with location and column for lookup as follows:
import pandas as pd
import numpy as np
i = ['dog', 'cat', 'bird', 'donkey'] * 100000
df1 = pd.DataFrame(np.random.randint(1, high=380, size=len(i)),
['cat', 'bird', 'donkey', 'dog'] * 100000).reset_index()
df1.columns = ['animal', 'locn']
df1.head()
The dataframe to be looked at is as follows:
df = pd.DataFrame(np.random.randn(len(i), 2), index=i,
columns=list('AB')).rename_axis('animal').sort_index(0).reset_index()
df
Looking for a faster way to assign a column with value of B, for every record in df1.
df1.assign(val=[df[df.animal == a].iloc[b].B for a, b in zip(df1.animal, df1['locn'])])
...is pretty slow.
Use GroupBy.cumcount for counter of animal column for positions, so possible use merge with left join:
df['locn'] = df.groupby('animal').cumcount()
df1['new'] = df1.merge(df.reset_index(), on=['animal','locn'], how='left')['B']
Verfify in smaller dataframe:
np.random.seed(2019)
i = ['dog', 'cat', 'bird', 'donkey'] * 100
df1 = pd.DataFrame(np.random.randint(1, high=10, size=len(i)),
['cat', 'bird', 'donkey', 'dog'] * 100).reset_index()
df1.columns = ['animal', 'locn']
print (df1)
df = pd.DataFrame(np.random.randn(len(i), 2), index=i,
columns=list('AB')).rename_axis('animal').sort_index(0).reset_index()
df1 = df1.assign(val=[df[df.animal == a].iloc[b].B for a, b in zip(df1.animal, df1['locn'])])
df['locn'] = df.groupby('animal').cumcount()
df1['new'] = df1.merge(df.reset_index(), on=['animal','locn'], how='left')['B']
locn = df.groupby('animal').cumcount()
df1 = df1.assign(new1 = df1.merge(df.reset_index().assign(locn = locn),
on=['animal','locn'], how='left')['B'])
print (df1.head(10))
animal locn val new new1
0 cat 9 -0.535465 -0.535465 -0.535465
1 bird 3 0.296240 0.296240 0.296240
2 donkey 6 0.222638 0.222638 0.222638
3 dog 9 1.115175 1.115175 1.115175
4 cat 7 0.608889 0.608889 0.608889
5 bird 9 -0.025648 -0.025648 -0.025648
6 donkey 1 0.324736 0.324736 0.324736
7 dog 1 0.533579 0.533579 0.533579
8 cat 8 -1.818238 -1.818238 -1.818238
9 bird 9 -0.025648 -0.025648 -0.025648

Multi-column calculation in pandas

I've got this long algebra formula that I need to apply to a dataframe:
def experience_mod(A, B, C, D, T, W):
E = (T-A)
F = (C-D)
xmod = (A + B + (E*W) + ((1-W)*F))/(D + B + (F*W) + ((1-W)*F))
return xmod
A = loss['actual_primary_losses']
B = loss['ballast']
C = loss['ExpectedLosses']
D = loss['ExpectedPrimaryLosses']
T = loss['ActualIncurred']
W = loss['weight']
How would I write this to calculate the experience_mod() for every row?
something like this?
loss['ExperienceRating'] = loss.apply(experience_mod(A,B,C,D,T,W) axis = 0)
Pandas and the underlying library, numpy, it's using, support vectorized operations, so given two dataframes A and B, operations like A + B, A - B etc are valid.
Your code works fine, you need to apply the function to the columns directly and assign the results back to the new column ExperienceRating,
Here's a working example:
In [1]: import pandas as pd
In [2]: import numpy as np
In [3]: df = pd.DataFrame(np.random.randn(6,6), columns=list('ABCDTW'))
In [4]: df
Out[4]:
A B C D T W
0 0.049617 0.082861 2.289549 -0.783082 -0.691990 -0.071152
1 0.722605 0.209683 -0.347372 0.254951 0.468615 -0.132794
2 -0.301469 -1.849026 -0.334381 -0.365116 -0.238384 -1.999025
3 -0.554925 -0.859044 -0.637079 -1.040336 0.627027 -0.955889
4 -2.024621 -0.539384 0.006734 0.117628 -0.215070 -0.661466
5 1.942926 -0.433067 -1.034814 -0.292179 0.744039 0.233953
In [5]: def experience_mod(A, B, C, D, T, W):
...: E = (T-A)
...: F = (C-D)
...:
...: xmod = (A + B + (E*W) + ((1-W)*F))/(D + B + (F*W) + ((1-W)*F))
...:
...: return xmod
...:
In [6]: experience_mod(df["A"], df["B"], df["C"], df["D"], df["T"], df["W"])
Out[6]:
0 1.465387
1 -2.060483
2 1.000469
3 1.173070
4 7.406756
5 -0.449957
dtype: float64
In [7]: df['ExperienceRating'] = experience_mod(df["A"], df["B"], df["C"], df["D"], df["T"], df["W"])
In [8]: df
Out[8]:
A B C D T W ExperienceRating
0 0.049617 0.082861 2.289549 -0.783082 -0.691990 -0.071152 1.465387
1 0.722605 0.209683 -0.347372 0.254951 0.468615 -0.132794 -2.060483
2 -0.301469 -1.849026 -0.334381 -0.365116 -0.238384 -1.999025 1.000469
3 -0.554925 -0.859044 -0.637079 -1.040336 0.627027 -0.955889 1.173070
4 -2.024621 -0.539384 0.006734 0.117628 -0.215070 -0.661466 7.406756
5 1.942926 -0.433067 -1.034814 -0.292179 0.744039 0.233953 -0.449957