pandas groupby keeping other columns - pandas

This question is similar to this one, but in my case I need to apply a function that returns a Series rather than a single value for each group — that question is about aggregating with sum, but I need to use rank (so the difference is like that between agg and transform).
I have data on firms over time. This generates some dummy data that looks like my use case:
import numpy as np
import pandas as pd
dates = pd.date_range('1926', '2020', freq='M')
ndates = len(dates)
nfirms = 5000
cols = list('ABCDE')
df = pd.DataFrame(np.random.randn(nfirms*ndates,len(cols)),
index=np.tile(dates,nfirms),
columns=cols)
df.insert(0, 'id', np.repeat(np.arange(nfirms), ndates))
I need to calculate ranks of column E within each date (the index), but keeping column id.
If I just use groupby and .rank I get this:
df.groupby(level=0)['E'].rank()
1926-01-31 3226.0
1926-02-28 1042.0
1926-03-31 1611.0
1926-04-30 2591.0
1926-05-31 30.0
...
2019-08-31 1973.0
2019-09-30 227.0
2019-10-31 4381.0
2019-11-30 1654.0
2019-12-31 1572.0
Name: E, Length: 5640000, dtype: float64
This has the same dimension as df but I'm not sure it's safe to merge on the index — I really need to join on the id column also. Can I assume that the order remains the same?
If the order in the output is the same as in the output, I think I can do this:
df['ranks'] = df.groupby(level=0)['E'].rank()
But something about this seems strange, and I assume there is a way to include additional columns in the groupby output.
(I'm also not clear if calling .rank() is equivalent to .transform('rank').)

Related

Why assign method in pandas method chaining behave differently if it is applied in chain after group by?

I am trying to chain some methods in pandas but seems like order of methods is restrictive in Pandas.Let me explain this with mpg data.
In two of the below options, I have changed the order of the assign method. In option 1, it is before group by and it works as expected. While in option 2, it is after group by and it produces garbage output. In R/tidyverse I could simply do ungroup() and use mutate() either before or after group by and it would still produce the same output.
import pandas as pd
import seaborn as sns
df = sns.load_dataset("mpg")
Option 1
(
df
.assign(origin=df.origin.map({'europe':'Europe'}).fillna(df.origin))
.query(("origin=='Europe' & model_year==80"))
.groupby(['origin','cylinders'],dropna=False)
.mpg
.sum()
.reset_index()
)
Option 2
(
df
.query(("origin=='europe' & model_year==80"))
.groupby(['origin','cylinders'],dropna=False)
.mpg
.sum()
.reset_index()
.assign(origin=df.origin.map({'europe':'Europe'}).fillna(df.origin))
)
The whole thing can also be done quite neatly without method chaining in Pandas but I am trying to see if I can make method chaining work for myself.
How can I ensure assign method in above two options produce same output regardless of where it is in the chain of methods?
The key thing here is actually the .reset_index(). In the original data, the first two rows have "usa" as their origin, so those get applied to the transformed data.
To illustrate, we can join (on the index):
tra = (
df
.query("origin=='europe' & model_year==80")
.groupby(['origin', 'cylinders'], dropna=False)
['mpg'].sum()
.reset_index()
)
tra.join(df['origin'], rsuffix='_2')
origin cylinders mpg origin_2
0 europe 4 299.2 usa
1 europe 5 36.4 usa
To fix it, you could use a lambda to make use of the transformed data (as sammywemmy wrote in a comment):
tra.assign(origin=lambda df_:
df_['origin'].map({'europe':'Europe'}).fillna(df_['origin'])
)
origin cylinders mpg
0 Europe 4 299.2
1 Europe 5 36.4

How to calculate pearsonr (and correlation significance) with pandas groupby?

I would like to do a groupby correlation using pandas and pearsonr.
Currently I have:
df = pd.DataFrame(np.random.randint(0,10,size=(1000, 4)), columns=list('ABCD'))
df.groupby(['A','B'])[['C','D']].corr().unstack().iloc[:,1]
However I would like to calculate the correlation significance using pearsonr (scipy package) like this:
from scipy.stats import pearsonr
corr,pval= pearsonr(df['C'],df['D'])
How do I combine the groupby with the pearsonr, something like this:
corr,val=df.groupby(['A','B']).agg(pearsonr(['C','D']))
If I understand, you need to perform the Pearson's test between C and D for any combination of A and B.
To carry out this task you need to groupby(['A','B']) as you already done. Now your grouped dataframe is a "set" of dataframes (one dataframe for each A,B combination), so you can apply the stats.pearsonr to any of these dataframes through the apply method. In order to have two distinct columns for the test-statistic (r, correlation index) and for the p-value, you can also include the output from pearsonr in a pd.Series.
from scipy import stats
df.groupby(['A','B']).apply(lambda d:pd.Series(stats.pearsonr(d.C, d.D), index=["corr", "pval"]))
The output is:
corr pval
A B
0 0 -0.318048 0.404239
1 0.750380 0.007804
2 -0.536679 0.109723
3 -0.160420 0.567917
4 -0.479591 0.229140
.. ... ...
9 5 0.218743 0.602752
6 -0.114155 0.662654
7 0.053370 0.883586
8 -0.436360 0.091069
9 -0.047767 0.882804
[100 rows x 2 columns]
In jupyter:
Another advice I can give you is to adjust the p-values to avoid false-positives, since you are replicating the experiment several times:
corr_df["qval"] = p_adjust_bh(corr_df.pval)
I used the p_adjust_bh function from here (answer from #Eric Talevich)

pandas groupby returns multiindex with two more aggregates

When grouping by a single column, and using as_index=False, the behavior is expected in pandas. However, when I use .agg, as_index no longer appears to behave as expected. In short, it doesn't appear to matter.
# imports
import pandas as pd
import numpy as np
# set the seed
np.random.seed(834)
df = pd.DataFrame(np.random.rand(10, 1), columns=['a'])
df['letter'] = np.random.choice(['a','b'], size=10)
summary = df.groupby('letter', as_index=False).agg([np.count_nonzero, np.mean])
summary
returns:
a
count_nonzero mean
letter
a 6.0 0.539313
b 4.0 0.456702
When I would have expected the axis to be 0 1 with letter as a column in the dataframe.
In summary, I want to be able to group by one or more columns, summarize a single column with multiple aggregates, and return a dataframe that does not have the group by columns as the index, nor a Multi Index in the column.
The comment from #Trenton did the trick.
summary = df.groupby('letter')['a'].agg([np.count_nonzero, np.mean]).reset_index()

Manipulating duplicate rows across a subset of columns in dataframe pandas

Suppose I have a dataframe as follows:
df = pd.DataFrame({"user":[11,11,11,21,21,21,21,21,32,32],
"event":[0,0,1,0,0,1,1,1,0,0],
"datetime":['05:29:54','05:32:04','05:32:08',
'15:35:26','15:36:07','15:36:16','15:36:50','15:36:54',
'09:29:12', '09:29:25'] })
I would like to handle the repetitive lines across the first column (user) to reach the following.
In this case, we replace the 'event' column with the maximum value related in the 'user' column (for example for user=11, the maximum value for event is 1). And the third column is replaced by the average of the datetime.
P.S. It has been already discussed about dropping the repetitive rows here, however, I do not want to drop rows blindly. Especially when I am dealing with a dataframe with a lot of attributes.
You want to groupby and aggregate
df.groupby('user').agg({'event': 'max',
'datetime': lambda s: pd.to_timedelta(s).mean()})
If you want, you can also just change your datetime column first to timedelta using pd.to_timedelta and just take the mean in the agg
You can use str to represent the way you intend
df.groupby('user').agg({'event': 'max',
'datetime': lambda s: str(pd.to_timedelta(s).mean().to_pytimedelta())})
You can convert datetimes to native integers and aggregate mean, last convert back and for HH:MM:SS strings use strftime:
df['datetime'] = pd.to_datetime(df['datetime']).astype(np.int64)
df1 = df.groupby('user', as_index=False).agg({'event':'max', 'datetime':'mean'})
df1['datetime'] = pd.to_datetime(df1['datetime']).dt.strftime('%H:%M:%S')
print (df1)
user event datetime
0 11 1 05:31:22
1 21 1 15:36:18
2 32 0 09:29:18

pandas resample when cumulative function returns data frame

I would like to use resampling function from pandas but applying my own custom function. The problem I'm facing is that the custom function returns a pandas Data Frame instead of a single array.
The following example illustrate my problem:
>>> import pandas as pd
>>> import numpy as np
>>> def f(data):
... return ((1+data).cumprod(axis=0)-1)
...
>>> data = np.random.randn(1000,3)
>>> index = pd.date_range("20170101", periods = 1000, freq="B")
>>> df = pd.DataFrame(data= data, index =index)
Now suppose I want to resample the business days to business end month frequency:
>>> resampler = df.resample("BM")
If I apply now the my function f I don't get the desired result. I would like to get the last row of my output from f.
>>> resampler.apply(f)
this is becaumes the cumprod in my function f returns a pandas data frame. I could write my f such that it returns just the last row. However, I would like to use this function in other places as well to return the whole Data Frame. This could be solved via introducing a flag like "last_row" in the function f which steers to return the complete or just the last row. But this solutions seem rather nasty.
Just define your function f with a last_row parameter. You can default it to False so that it returns the entire dataframe. When True it returns the last row
def f(data, last_row=False):
df = ((1+data).cumprod(axis=0)-1)
if last_row:
return df.iloc[-1]
return df
Get the last row
df.resample('BM').apply(f, last_row=True)
0 1 2
2017-01-31 0.185662 -0.580058 -1.004879
2017-02-28 -1.004035 -0.999878 17.059846
2017-03-31 -0.995280 -1.000001 -1.000507
2017-04-28 -1.000656 -240.369487 -1.002645
2017-05-31 47.646827 -72.042190 -1.000016
....
Return all the rows as you already did.
df.resample('BM').apply(f)
I think you could refactor in the following way, which will be much faster for larger dataframes:
(1+df).resample('BM').prod() - 1
0 1 2
2017-01-31 -0.999436 -1.259078 -1.000215
2017-02-28 -1.221404 0.342863 9.841939
2017-03-31 -0.820196 -1.002598 -0.450662
2017-04-28 -1.000299 2.739184 -1.035557
2017-05-31 -0.999986 -0.920445 -2.103289
That gives the same answer as #TedPetrou although you can't tell because we used different random seeds, but you can easily test this yourself. Though actually, I'm still sorting out why this gives the same answer via prod() rather than cumprod(). Anyway, as you can see this is a mix of intuition and reverse engineering I'm using here and will update as I double check things...
For this relatively small dataframe with 1,000 rows, this way is only around twice as fast, but if you increase the rows you'll find this way scales much better (about 250x faster at 10,000 rows).
Alternative approaches: These give different answers from the above (and from each other) but I wonder if they might be closer to what you are looking for?
(1+df).resample('BM').mean().expanding().apply( lambda x: x.prod() - 1)
(1+df).expanding().apply( lambda x: x.prod() - 1).resample('BM').mean()