Aggregate over an index in pandas? - pandas

How can I aggregate (sum) over an index which I intend to map to new values? Basically I have a groupby result by two variables where I want to groupby one variable into larger classes. The following code does this operation on s by mapping the first by-variable but seems too complicating:
import pandas as pd
mapping={1:1, 2:1, 3:3}
s=pd.Series([1]*6, index=pd.MultiIndex.from_arrays([[1,1,2,2,3,3],[1,2,1,2,1,2]]))
x=s.reset_index()
x["level_0"]=x.level_0.map(mapping)
result=x.groupby(["level_0", "level_1"])[0].sum()
Is there a way to write this more concisely?

There is a level= option for Series.sum(), I guess you can use that and it will be a quite concise way to do it.
In [69]:
s.index = pd.MultiIndex.from_tuples(map(lambda x: (mapping.get(x[0]), x[1]), s.index.values))
s.sum(level=(0,1))
Out[69]:
1 1 2
2 2
3 1 1
2 1
dtype: int64

Related

is There any methods to merge multiple dataframes of different templates

There are a total of 4 dataframes (df1 / df2 / df3 / df4),
Each dataframe has a different template, but they all have the same columns.
I want to merges the row of each dataframe based on the same column, but what function should I use? A 'merge' or 'join' function doesn't seem to work, and deleting the rest of the columns after grouping them into a list seems to be too messy.
I want to make attached image
This is an option, you can merge the dataframes and then drop the useless columns from the total dataframe.
df_total = pd.concat([df1, df2, df3, df4], axis=0)
df_total.drop(['Value2', 'Value3'], axis=1)
You can use reduce to get it done too.
from functools import reduce
reduce(lambda left,right: pd.merge(left, right, on=['ID','value1'], how='outer'), [df1,df2,df3,df4])[['ID','value1']]
ID value1
0 a 1
1 b 4
2 c 5
3 f 1
4 g 5
5 h 6
6 i 1

How to calculate pearsonr (and correlation significance) with pandas groupby?

I would like to do a groupby correlation using pandas and pearsonr.
Currently I have:
df = pd.DataFrame(np.random.randint(0,10,size=(1000, 4)), columns=list('ABCD'))
df.groupby(['A','B'])[['C','D']].corr().unstack().iloc[:,1]
However I would like to calculate the correlation significance using pearsonr (scipy package) like this:
from scipy.stats import pearsonr
corr,pval= pearsonr(df['C'],df['D'])
How do I combine the groupby with the pearsonr, something like this:
corr,val=df.groupby(['A','B']).agg(pearsonr(['C','D']))
If I understand, you need to perform the Pearson's test between C and D for any combination of A and B.
To carry out this task you need to groupby(['A','B']) as you already done. Now your grouped dataframe is a "set" of dataframes (one dataframe for each A,B combination), so you can apply the stats.pearsonr to any of these dataframes through the apply method. In order to have two distinct columns for the test-statistic (r, correlation index) and for the p-value, you can also include the output from pearsonr in a pd.Series.
from scipy import stats
df.groupby(['A','B']).apply(lambda d:pd.Series(stats.pearsonr(d.C, d.D), index=["corr", "pval"]))
The output is:
corr pval
A B
0 0 -0.318048 0.404239
1 0.750380 0.007804
2 -0.536679 0.109723
3 -0.160420 0.567917
4 -0.479591 0.229140
.. ... ...
9 5 0.218743 0.602752
6 -0.114155 0.662654
7 0.053370 0.883586
8 -0.436360 0.091069
9 -0.047767 0.882804
[100 rows x 2 columns]
In jupyter:
Another advice I can give you is to adjust the p-values to avoid false-positives, since you are replicating the experiment several times:
corr_df["qval"] = p_adjust_bh(corr_df.pval)
I used the p_adjust_bh function from here (answer from #Eric Talevich)

pandas operations over multiple axis

How can I do operations over multiple columns in one go in pandas?
For example, I would like to calculate the df[['a',b']].mean(level=0) or df[['a',b']].kurtosis(level=0) (I need the level=0 as it's a multi indexed dataframe).
But I would like to have one single number and do the calculation over multiple axis in one go. A and B would be merged into one single column (or series).
In numpy this is possible I believe with axis=(0,1), but I'm unsure how this can be achieved in pandas.
Speed is very important, so apply or iterating is not a solution.
The expected result would be as follows:
np.random.seed([3, 1415])
df = pd.DataFrame(
np.random.rand(10, 2),
pd.MultiIndex.from_product([list('ab'), range(5)]),
list('AB')
)
df
Out[76]:
A B
a 0 0.444939 0.407554
1 0.460148 0.465239
2 0.462691 0.016545
3 0.850445 0.817744
4 0.777962 0.757983
b 0 0.934829 0.831104
1 0.879891 0.926879
2 0.721535 0.117642
3 0.145906 0.199844
4 0.437564 0.100702
expected result:
df.groupby(level=0).agg(['mean']).mean(axis=1)
Out[78]:
a 0.546125
b 0.529589
dtype: float64
But it needs to be achieved in one single calculation, not in mean of mean, as this will maybe work for the mean, but for other calculations it may not produce the same result as if it was done in one go (for example I'm not sure if the kurtosis of the kurtosis is equal to the kurtosis in one go.)
Consider the sample dataframe df
np.random.seed([3, 1415])
df = pd.DataFrame(
np.random.rand(10, 2),
pd.MultiIndex.from_product([list('ab'), range(5)]),
list('AB')
)
df
A B
a 0 0.444939 0.407554
1 0.460148 0.465239
2 0.462691 0.016545
3 0.850445 0.817744
4 0.777962 0.757983
b 0 0.934829 0.831104
1 0.879891 0.926879
2 0.721535 0.117642
3 0.145906 0.199844
4 0.437564 0.100702
Typical Solution
Use groupby and agg
df.groupby(level=0).agg(['mean', pd.Series.kurt])
A B
mean kurt mean kurt
a 0.599237 -2.885262 0.493013 0.018225
b 0.623945 -0.900488 0.435234 -3.105328
Solve Different
pd.concat([
df.mean(level=0),
df.kurt(level=0)
], axis=1, keys=['Mean', 'Kurt']).swaplevel(1, 0, 1).sort_index(1)
A B
Kurt Mean Kurt Mean
a -2.885262 0.599237 0.018225 0.493013
b -0.900488 0.623945 -3.105328 0.435234
This seems to work:
df.stack().mean(level=0)
Out[146]:
a 0.546125
b 0.529589
dtype: float64

Equivalent of Rs which in pandas

How do I get the column of the min in the example below, not the actual number?
In R I would do:
which(min(abs(_quantiles - mean(_quantiles))))
In pandas I tried (did not work):
_quantiles.which(min(abs(_quantiles - mean(_quantiles))))
You could do it this way, call np.min on the df as a np array, use this to create a boolean mask and drop the columns that don't have at least a single non NaN value:
In [2]:
df = pd.DataFrame({'a':np.random.randn(5), 'b':np.random.randn(5)})
df
Out[2]:
a b
0 -0.860548 -2.427571
1 0.136942 1.020901
2 -1.262078 -1.122940
3 -1.290127 -1.031050
4 1.227465 1.027870
In [15]:
df[df==np.min(df.values)].dropna(axis=1, thresh=1).columns
Out[15]:
Index(['b'], dtype='object')
idxmin and idxmax exist, but no general which as far as I can see.
_quantiles.idxmin(abs(_quantiles - mean(_quantiles)))

selecting data from pandas panel with MultiIndex

I have a DataFrame with MultiIndex, for example:
In [1]: arrays = [['one','one','one','two','two','two'],[1,2,3,1,2,3]]
In [2]: df = DataFrame(randn(6,2),index=MultiIndex.from_tuples(zip(*arrays)),columns=['A','B'])
In [3]: df
Out [3]:
A B
one 1 -2.028736 -0.466668
2 -1.877478 0.179211
3 0.886038 0.679528
two 1 1.101735 0.169177
2 0.756676 -1.043739
3 1.189944 1.342415
Now I want to compute the means of elements 2 and 3 (index level 1) for each row (index level 0) and each column. So I need a DataFrame which would look like
A B
one 1 mean(df['A'].ix['one'][1:3]) mean(df['B'].ix['one'][1:3])
two 1 mean(df['A'].ix['two'][1:3]) mean(df['B'].ix['two'][1:3])
How do I do that without using loops over rows (index level 0) of the original data frame? What if I want to do the same for a Panel? There must be a simple solution with groupby, but I'm still learning it and can't think of an answer.
You can use the xs function to select on levels.
Starting with:
A B
one 1 -2.712137 -0.131805
2 -0.390227 -1.333230
3 0.047128 0.438284
two 1 0.055254 -1.434262
2 2.392265 -1.474072
3 -1.058256 -0.572943
You can then create a new dataframe using:
DataFrame({'one':df.xs('one',level=0)[1:3].apply(np.mean), 'two':df.xs('two',level=0)[1:3].apply(np.mean)}).transpose()
which gives the result:
A B
one -0.171549 -0.447473
two 0.667005 -1.023508
To do the same without specifying the items in the level, you can use groupby:
grouped = df.groupby(level=0)
d = {}
for g in grouped:
d[g[0]] = g[1][1:3].apply(np.mean)
DataFrame(d).transpose()
I'm not sure about panels - it's not as well documented, but something similar should be possible
I know this is an old question, but for reference who searches and finds this page, the easier solution I think is the level keyword in mean:
In [4]: arrays = [['one','one','one','two','two','two'],[1,2,3,1,2,3]]
In [5]: df = pd.DataFrame(np.random.randn(6,2),index=pd.MultiIndex.from_tuples(z
ip(*arrays)),columns=['A','B'])
In [6]: df
Out[6]:
A B
one 1 -0.472890 2.297778
2 -2.002773 -0.114489
3 -1.337794 -1.464213
two 1 1.964838 -0.623666
2 0.838388 0.229361
3 1.735198 0.170260
In [7]: df.mean(level=0)
Out[7]:
A B
one -1.271152 0.239692
two 1.512808 -0.074682
In this case it means that level 0 is kept over axis 0 (the rows, default value for mean)
Do the following:
# Specify the indices you want to work with.
idxs = [("one", elem) for elem in [2,3]] + [("two", elem) for elem in [2,3]]
# Compute grouped mean over only those indices.
df.ix[idxs].mean(level=0)