Why assign method in pandas method chaining behave differently if it is applied in chain after group by? - pandas

I am trying to chain some methods in pandas but seems like order of methods is restrictive in Pandas.Let me explain this with mpg data.
In two of the below options, I have changed the order of the assign method. In option 1, it is before group by and it works as expected. While in option 2, it is after group by and it produces garbage output. In R/tidyverse I could simply do ungroup() and use mutate() either before or after group by and it would still produce the same output.
import pandas as pd
import seaborn as sns
df = sns.load_dataset("mpg")
Option 1
(
df
.assign(origin=df.origin.map({'europe':'Europe'}).fillna(df.origin))
.query(("origin=='Europe' & model_year==80"))
.groupby(['origin','cylinders'],dropna=False)
.mpg
.sum()
.reset_index()
)
Option 2
(
df
.query(("origin=='europe' & model_year==80"))
.groupby(['origin','cylinders'],dropna=False)
.mpg
.sum()
.reset_index()
.assign(origin=df.origin.map({'europe':'Europe'}).fillna(df.origin))
)
The whole thing can also be done quite neatly without method chaining in Pandas but I am trying to see if I can make method chaining work for myself.
How can I ensure assign method in above two options produce same output regardless of where it is in the chain of methods?

The key thing here is actually the .reset_index(). In the original data, the first two rows have "usa" as their origin, so those get applied to the transformed data.
To illustrate, we can join (on the index):
tra = (
df
.query("origin=='europe' & model_year==80")
.groupby(['origin', 'cylinders'], dropna=False)
['mpg'].sum()
.reset_index()
)
tra.join(df['origin'], rsuffix='_2')
origin cylinders mpg origin_2
0 europe 4 299.2 usa
1 europe 5 36.4 usa
To fix it, you could use a lambda to make use of the transformed data (as sammywemmy wrote in a comment):
tra.assign(origin=lambda df_:
df_['origin'].map({'europe':'Europe'}).fillna(df_['origin'])
)
origin cylinders mpg
0 Europe 4 299.2
1 Europe 5 36.4

Related

How to calculate pearsonr (and correlation significance) with pandas groupby?

I would like to do a groupby correlation using pandas and pearsonr.
Currently I have:
df = pd.DataFrame(np.random.randint(0,10,size=(1000, 4)), columns=list('ABCD'))
df.groupby(['A','B'])[['C','D']].corr().unstack().iloc[:,1]
However I would like to calculate the correlation significance using pearsonr (scipy package) like this:
from scipy.stats import pearsonr
corr,pval= pearsonr(df['C'],df['D'])
How do I combine the groupby with the pearsonr, something like this:
corr,val=df.groupby(['A','B']).agg(pearsonr(['C','D']))
If I understand, you need to perform the Pearson's test between C and D for any combination of A and B.
To carry out this task you need to groupby(['A','B']) as you already done. Now your grouped dataframe is a "set" of dataframes (one dataframe for each A,B combination), so you can apply the stats.pearsonr to any of these dataframes through the apply method. In order to have two distinct columns for the test-statistic (r, correlation index) and for the p-value, you can also include the output from pearsonr in a pd.Series.
from scipy import stats
df.groupby(['A','B']).apply(lambda d:pd.Series(stats.pearsonr(d.C, d.D), index=["corr", "pval"]))
The output is:
corr pval
A B
0 0 -0.318048 0.404239
1 0.750380 0.007804
2 -0.536679 0.109723
3 -0.160420 0.567917
4 -0.479591 0.229140
.. ... ...
9 5 0.218743 0.602752
6 -0.114155 0.662654
7 0.053370 0.883586
8 -0.436360 0.091069
9 -0.047767 0.882804
[100 rows x 2 columns]
In jupyter:
Another advice I can give you is to adjust the p-values to avoid false-positives, since you are replicating the experiment several times:
corr_df["qval"] = p_adjust_bh(corr_df.pval)
I used the p_adjust_bh function from here (answer from #Eric Talevich)

When using 'df.groupby(column).apply()' get the groupby column within the 'apply' context?

I want to get the groupby column i.e. column that is supplied to df.groupby as a by argument (i.e. df.groupby(by=column)), within the apply context that comes after groupby (i.e. df.groupby(by=column).apply(Here)).
For example,
df = pd.DataFrame({'Animal': ['Falcon', 'Falcon',
'Parrot', 'Parrot'],
'Max Speed': [380., 370., 24., 26.]})
df.groupby(['Animal']).apply(Here I want to know that groupby column is 'Animal')
df
Animal Max Speed
0 Falcon 380.0
1 Falcon 370.0
2 Parrot 24.0
3 Parrot 26.0
Of course, I can have one more line of code or simply by supplying the groupby column to the apply context separately (e.g. .apply(lambda df_: some_function(df_,s='Animal')) ), but I am curious to see if this can be done in a single line e.g. possibly using pandas function built for doing this.
I just figured out a one-liner solution:
df = pd.DataFrame({'Animal': ['Falcon', 'Falcon',
'Parrot', 'Parrot'],
'Max Speed': [380., 370., 24., 26.]})
df.groupby(['Animal']).apply(lambda df_: df_.apply(lambda x: all(x==df_.name)).loc[lambda x: x].index.tolist())
returns groupby column within each groupby.apply context.
Animal
Falcon [Animal]
Parrot [Animal]
Since it is quite a long one-liner (uses 3 lambdas!), it is better to wrap it in a separate function, as shown below:
def get_groupby_column(df_): return df_.apply(lambda x: all(x==df_.name)).loc[lambda x: x].index.tolist()
df.groupby(['Animal']).apply(get_groupby_column)
Note of caution: this solution won't apply if other columns of the dataframe also contain the items from the groupby column e.g. if Max Speed column contained any of the items from the groupby column (i.e. Animal) there will be inaccurate results.
You could use grouper.names:
>>> df.groupby('Animal').grouper.names
['Animal']
>>>
With apply:
grouped = df.groupby('Animal')
grouped.apply(lambda x: grouped.grouper.names)

pandas groupby returns multiindex with two more aggregates

When grouping by a single column, and using as_index=False, the behavior is expected in pandas. However, when I use .agg, as_index no longer appears to behave as expected. In short, it doesn't appear to matter.
# imports
import pandas as pd
import numpy as np
# set the seed
np.random.seed(834)
df = pd.DataFrame(np.random.rand(10, 1), columns=['a'])
df['letter'] = np.random.choice(['a','b'], size=10)
summary = df.groupby('letter', as_index=False).agg([np.count_nonzero, np.mean])
summary
returns:
a
count_nonzero mean
letter
a 6.0 0.539313
b 4.0 0.456702
When I would have expected the axis to be 0 1 with letter as a column in the dataframe.
In summary, I want to be able to group by one or more columns, summarize a single column with multiple aggregates, and return a dataframe that does not have the group by columns as the index, nor a Multi Index in the column.
The comment from #Trenton did the trick.
summary = df.groupby('letter')['a'].agg([np.count_nonzero, np.mean]).reset_index()

pandas groupby keeping other columns

This question is similar to this one, but in my case I need to apply a function that returns a Series rather than a single value for each group — that question is about aggregating with sum, but I need to use rank (so the difference is like that between agg and transform).
I have data on firms over time. This generates some dummy data that looks like my use case:
import numpy as np
import pandas as pd
dates = pd.date_range('1926', '2020', freq='M')
ndates = len(dates)
nfirms = 5000
cols = list('ABCDE')
df = pd.DataFrame(np.random.randn(nfirms*ndates,len(cols)),
index=np.tile(dates,nfirms),
columns=cols)
df.insert(0, 'id', np.repeat(np.arange(nfirms), ndates))
I need to calculate ranks of column E within each date (the index), but keeping column id.
If I just use groupby and .rank I get this:
df.groupby(level=0)['E'].rank()
1926-01-31 3226.0
1926-02-28 1042.0
1926-03-31 1611.0
1926-04-30 2591.0
1926-05-31 30.0
...
2019-08-31 1973.0
2019-09-30 227.0
2019-10-31 4381.0
2019-11-30 1654.0
2019-12-31 1572.0
Name: E, Length: 5640000, dtype: float64
This has the same dimension as df but I'm not sure it's safe to merge on the index — I really need to join on the id column also. Can I assume that the order remains the same?
If the order in the output is the same as in the output, I think I can do this:
df['ranks'] = df.groupby(level=0)['E'].rank()
But something about this seems strange, and I assume there is a way to include additional columns in the groupby output.
(I'm also not clear if calling .rank() is equivalent to .transform('rank').)

How to avoid temporary variables when creating new column via groupby.apply

I would like to create a new column newcol in a dataframe df as the result of
df.groupby('keycol').apply(somefunc)
The obvious:
df['newcol'] = df.groupby('keycol').apply(somefunc)
does not work: either df['newcol'] ends up containing all nan's (which is certainly not what the RHS evaluates to), OR some exception is raised (the details of the exception vary wildly depending on what somefunc returns).
I have tried many variations of the above, including stuff like
import pandas as pd
df['newcol'] = pd.Series(df.groupby('keycol').apply(somefunc), index=df.index)
They all fail.
The only thing that has worked requires defining an intermediate variable:
import pandas as pd
tmp = df.groupby('keycol').apply(lambda x: pd.Series(somefunc(x)))
tmp.index = df.index
df['rank'] = tmp
Is there a way to achieve this without having to create an intermediate variable?
(The documentation for GroupBy.apply is almost content-free.)
Let's build up an example and I think I can illustrate why your first attempts are failing:
Example data:
n = 25
df = pd.DataFrame({'expenditure' : np.random.choice(['foo','bar'], n),
'groupid' : np.random.choice(['one','two'], n),
'coef' : randn(n)})
print df.head(10)
results in:
coef expenditure groupid
0 0.874076 bar one
1 -0.972586 foo two
2 -0.003457 bar one
3 -0.893106 bar one
4 -0.387922 bar two
5 -0.109405 bar two
6 1.275657 foo two
7 -0.318801 foo two
8 -1.134889 bar two
9 1.812964 foo two
So if apply a simple function, mean, to the grouped data we get the following:
df2= df.groupby('groupid').apply(mean)
print df2
Which is:
coef
groupid
one -0.215539
two 0.149459
So the dataframe above is indexed by groupid and has one column, coef.
What you tried to do first was, effectively, the following:
df['newcol'] = df2
That gives all NaNs for newcol. Honestly I have no idea why that doesn't throw an error. I'm not sure why it would produce anything at all. I think what you really want to do is merge df2 back into df
To merge df and df2 we need to remove the index from df2, rename the new column, then merge:
df2= df.groupby('groupid').apply(mean)
df2.reset_index(inplace=True)
df2.columns = ['groupid','newcol']
df.merge(df2)
which I think is what you were after.
This is such a common idiom that Pandas includes the transform method which wraps all this up into a much simpler syntax:
df['newcol'] = df.groupby('groupid').transform(mean)
print df.head()
results:
coef expenditure groupid newcol
0 1.705825 foo one -0.025112
1 -0.608750 bar one -0.025112
2 -1.215015 bar one -0.025112
3 -0.831478 foo two -0.073560
4 2.174040 bar one -0.025112
Better documentation is here.