How to calculate pearsonr (and correlation significance) with pandas groupby? - pandas

I would like to do a groupby correlation using pandas and pearsonr.
Currently I have:
df = pd.DataFrame(np.random.randint(0,10,size=(1000, 4)), columns=list('ABCD'))
df.groupby(['A','B'])[['C','D']].corr().unstack().iloc[:,1]
However I would like to calculate the correlation significance using pearsonr (scipy package) like this:
from scipy.stats import pearsonr
corr,pval= pearsonr(df['C'],df['D'])
How do I combine the groupby with the pearsonr, something like this:
corr,val=df.groupby(['A','B']).agg(pearsonr(['C','D']))

If I understand, you need to perform the Pearson's test between C and D for any combination of A and B.
To carry out this task you need to groupby(['A','B']) as you already done. Now your grouped dataframe is a "set" of dataframes (one dataframe for each A,B combination), so you can apply the stats.pearsonr to any of these dataframes through the apply method. In order to have two distinct columns for the test-statistic (r, correlation index) and for the p-value, you can also include the output from pearsonr in a pd.Series.
from scipy import stats
df.groupby(['A','B']).apply(lambda d:pd.Series(stats.pearsonr(d.C, d.D), index=["corr", "pval"]))
The output is:
corr pval
A B
0 0 -0.318048 0.404239
1 0.750380 0.007804
2 -0.536679 0.109723
3 -0.160420 0.567917
4 -0.479591 0.229140
.. ... ...
9 5 0.218743 0.602752
6 -0.114155 0.662654
7 0.053370 0.883586
8 -0.436360 0.091069
9 -0.047767 0.882804
[100 rows x 2 columns]
In jupyter:
Another advice I can give you is to adjust the p-values to avoid false-positives, since you are replicating the experiment several times:
corr_df["qval"] = p_adjust_bh(corr_df.pval)
I used the p_adjust_bh function from here (answer from #Eric Talevich)

Related

is There any methods to merge multiple dataframes of different templates

There are a total of 4 dataframes (df1 / df2 / df3 / df4),
Each dataframe has a different template, but they all have the same columns.
I want to merges the row of each dataframe based on the same column, but what function should I use? A 'merge' or 'join' function doesn't seem to work, and deleting the rest of the columns after grouping them into a list seems to be too messy.
I want to make attached image
This is an option, you can merge the dataframes and then drop the useless columns from the total dataframe.
df_total = pd.concat([df1, df2, df3, df4], axis=0)
df_total.drop(['Value2', 'Value3'], axis=1)
You can use reduce to get it done too.
from functools import reduce
reduce(lambda left,right: pd.merge(left, right, on=['ID','value1'], how='outer'), [df1,df2,df3,df4])[['ID','value1']]
ID value1
0 a 1
1 b 4
2 c 5
3 f 1
4 g 5
5 h 6
6 i 1

pandas groupby returns multiindex with two more aggregates

When grouping by a single column, and using as_index=False, the behavior is expected in pandas. However, when I use .agg, as_index no longer appears to behave as expected. In short, it doesn't appear to matter.
# imports
import pandas as pd
import numpy as np
# set the seed
np.random.seed(834)
df = pd.DataFrame(np.random.rand(10, 1), columns=['a'])
df['letter'] = np.random.choice(['a','b'], size=10)
summary = df.groupby('letter', as_index=False).agg([np.count_nonzero, np.mean])
summary
returns:
a
count_nonzero mean
letter
a 6.0 0.539313
b 4.0 0.456702
When I would have expected the axis to be 0 1 with letter as a column in the dataframe.
In summary, I want to be able to group by one or more columns, summarize a single column with multiple aggregates, and return a dataframe that does not have the group by columns as the index, nor a Multi Index in the column.
The comment from #Trenton did the trick.
summary = df.groupby('letter')['a'].agg([np.count_nonzero, np.mean]).reset_index()

pandas operations over multiple axis

How can I do operations over multiple columns in one go in pandas?
For example, I would like to calculate the df[['a',b']].mean(level=0) or df[['a',b']].kurtosis(level=0) (I need the level=0 as it's a multi indexed dataframe).
But I would like to have one single number and do the calculation over multiple axis in one go. A and B would be merged into one single column (or series).
In numpy this is possible I believe with axis=(0,1), but I'm unsure how this can be achieved in pandas.
Speed is very important, so apply or iterating is not a solution.
The expected result would be as follows:
np.random.seed([3, 1415])
df = pd.DataFrame(
np.random.rand(10, 2),
pd.MultiIndex.from_product([list('ab'), range(5)]),
list('AB')
)
df
Out[76]:
A B
a 0 0.444939 0.407554
1 0.460148 0.465239
2 0.462691 0.016545
3 0.850445 0.817744
4 0.777962 0.757983
b 0 0.934829 0.831104
1 0.879891 0.926879
2 0.721535 0.117642
3 0.145906 0.199844
4 0.437564 0.100702
expected result:
df.groupby(level=0).agg(['mean']).mean(axis=1)
Out[78]:
a 0.546125
b 0.529589
dtype: float64
But it needs to be achieved in one single calculation, not in mean of mean, as this will maybe work for the mean, but for other calculations it may not produce the same result as if it was done in one go (for example I'm not sure if the kurtosis of the kurtosis is equal to the kurtosis in one go.)
Consider the sample dataframe df
np.random.seed([3, 1415])
df = pd.DataFrame(
np.random.rand(10, 2),
pd.MultiIndex.from_product([list('ab'), range(5)]),
list('AB')
)
df
A B
a 0 0.444939 0.407554
1 0.460148 0.465239
2 0.462691 0.016545
3 0.850445 0.817744
4 0.777962 0.757983
b 0 0.934829 0.831104
1 0.879891 0.926879
2 0.721535 0.117642
3 0.145906 0.199844
4 0.437564 0.100702
Typical Solution
Use groupby and agg
df.groupby(level=0).agg(['mean', pd.Series.kurt])
A B
mean kurt mean kurt
a 0.599237 -2.885262 0.493013 0.018225
b 0.623945 -0.900488 0.435234 -3.105328
Solve Different
pd.concat([
df.mean(level=0),
df.kurt(level=0)
], axis=1, keys=['Mean', 'Kurt']).swaplevel(1, 0, 1).sort_index(1)
A B
Kurt Mean Kurt Mean
a -2.885262 0.599237 0.018225 0.493013
b -0.900488 0.623945 -3.105328 0.435234
This seems to work:
df.stack().mean(level=0)
Out[146]:
a 0.546125
b 0.529589
dtype: float64

Aggregate over an index in pandas?

How can I aggregate (sum) over an index which I intend to map to new values? Basically I have a groupby result by two variables where I want to groupby one variable into larger classes. The following code does this operation on s by mapping the first by-variable but seems too complicating:
import pandas as pd
mapping={1:1, 2:1, 3:3}
s=pd.Series([1]*6, index=pd.MultiIndex.from_arrays([[1,1,2,2,3,3],[1,2,1,2,1,2]]))
x=s.reset_index()
x["level_0"]=x.level_0.map(mapping)
result=x.groupby(["level_0", "level_1"])[0].sum()
Is there a way to write this more concisely?
There is a level= option for Series.sum(), I guess you can use that and it will be a quite concise way to do it.
In [69]:
s.index = pd.MultiIndex.from_tuples(map(lambda x: (mapping.get(x[0]), x[1]), s.index.values))
s.sum(level=(0,1))
Out[69]:
1 1 2
2 2
3 1 1
2 1
dtype: int64

Calculate Arbitrary Percentile on Pandas GroupBy

Currently there is a median method on the Pandas's GroupBy objects.
Is there is a way to calculate an arbitrary percentile (see: http://docs.scipy.org/doc/numpy-dev/reference/generated/numpy.percentile.html) on the groupings?
Median would be the calcuation of percentile with q=50.
You want the quantile method:
In [47]: df
Out[47]:
A B C
0 0.719391 0.091693 one
1 0.951499 0.837160 one
2 0.975212 0.224855 one
3 0.807620 0.031284 one
4 0.633190 0.342889 one
5 0.075102 0.899291 one
6 0.502843 0.773424 one
7 0.032285 0.242476 one
8 0.794938 0.607745 one
9 0.620387 0.574222 one
10 0.446639 0.549749 two
11 0.664324 0.134041 two
12 0.622217 0.505057 two
13 0.670338 0.990870 two
14 0.281431 0.016245 two
15 0.675756 0.185967 two
16 0.145147 0.045686 two
17 0.404413 0.191482 two
18 0.949130 0.943509 two
19 0.164642 0.157013 two
In [48]: df.groupby('C').quantile(.95)
Out[48]:
A B
C
one 0.964541 0.871332
two 0.826112 0.969558
I found another useful solution here
If I have to use groupby another approach can be:
def percentile(n):
def percentile_(x):
return np.percentile(x, n)
percentile_.__name__ = 'percentile_%s' % n
return percentile_
Using the below call, I am able to achieve the same result as the solution given by #TomAugspurger
df.groupby('C').agg([percentile(50), percentile(95)])
With pandas >= 0.25.0 you can also use Named aggregation
An example would be
import numpy
import pandas as pd
df = pd.DataFrame({'A': numpy.random.randint(1,3,size=100),'C': numpy.random.randn(100)})
df.groupby('A').agg(min_val = ('C','min'), percentile_80 = ('C',lambda x: x.quantile(0.8)))