Groupby with conditions - pandas

df = pd.DataFrame({'Category': ['A','B','B','B','C','C'],
'Subcategory': ['X','X','Y','Y','Z','Z'],
'Values': [1,2,3,4,5,6]})
which I use groupby to summarize -
`df.groupby('Category')['Values'].agg({np.size, np.mean, np.median})`
size mean median
Category
A 1 1.0 1.0
B 3 3.0 3.0
C 2 5.5 5.5
Objective: In addition to the above, show additional groupby by subcategory 'X' to create below output:
ALL Subcategory Only Subcategory 'X'
size mean median size mean median
Category
A 1 1.0 1.0 1 1 1
B 3 3.0 3.0 1 2 2
C 2 5.5 5.5 0 0 0
My solution currently is to create two groupby, to_frame() then pd.merge them. Is there a better way? Thanks!
df.groupby('Category')['Values'].agg({np.size, np.mean, np.median})
df[df['Subcategory']=='X'].groupby('Category')['Values'].agg({np.size, np.mean, np.median})

Related

How to add Multilevel Columns and create new column?

I am trying to create a "total" column in my dataframe
idx = pd.MultiIndex.from_product([['Room 1','Room 2', 'Room 3'],['on','off']])
df = pd.DataFrame([[1,4,3,6,5,15], [3,2,1,5,1,7]], columns=idx)
My dataframe
Room 1 Room 2 Room 3
on off on off on off
0 1 4 3 6 5 15
1 3 2 1 5 1 7
For each room, I want to create a total column and then a on% column.
I have tried the following, however, it does not work.
df.loc[:, slice(None), "total" ] = df.xs('on', axis=1,level=1) + df.xs('off', axis=1,level=1)
Let us try something fancy ~
df.stack(0).eval('total=on + off \n on_pct=on / total').stack().unstack([1, 2])
Room 1 Room 2 Room 3
off on total on_pct off on total on_pct off on total on_pct
0 4.0 1.0 5.0 0.2 6.0 3.0 9.0 0.333333 15.0 5.0 20.0 0.250
1 2.0 3.0 5.0 0.6 5.0 1.0 6.0 0.166667 7.0 1.0 8.0 0.125
Oof this was a roughie, but you can do it like this if you want to avoid loops. Worth noting it redefines your df twice because i need the total columns. Sorry about that, but is the best i could do. Also if you have any questions just comment.
df = pd.concat([y.assign(**{'Total {0}'.format(x+1): y.iloc[:,0] + y.iloc[:,1]})for x , y in df.groupby(np.arange(df.shape[1])//2,axis=1)],axis=1)
df = pd.concat([y.assign(**{'Percentage_Total{0}'.format(x+1): (y.iloc[:,0] / y.iloc[:,2])*100})for x , y in df.groupby(np.arange(df.shape[1])//3,axis=1)],axis=1)
print(df)
This groups by the column's first index (rooms) and then loops through each group to add the total and percent on. The final step is to reindex using the unique rooms:
import pandas as pd
idx = pd.MultiIndex.from_product([['Room 1','Room 2', 'Room 3'],['on','off']])
df = pd.DataFrame([[1,4,3,6,5,15], [3,2,1,5,1,7]], columns=idx)
for room, group in df.groupby(level=0, axis=1):
df[(room, 'total')] = group.sum(axis=1)
df[(room, 'pct_on')] = group[(room, 'on')] / df[(room, 'total')]
result = df.reindex(columns=df.columns.get_level_values(0).unique(), level=0)
Output:
Room 1 Room 2 Room 3
on off total pct_on on off total pct_on on off total pct_on
0 1 4 5 0.2 3 6 9 0.333333 5 15 20 0.250
1 3 2 5 0.6 1 5 6 0.166667 1 7 8 0.125

pandas dataframe how to replace extreme outliers for all columns

I have a pandas dataframe with some very extreme value - more than 5 std.
I want to replace, per column, each value that is more than 5 std with the max other value.
For example,
df = A B
1 2
1 6
2 8
1 115
191 1
Will become:
df = A B
1 2
1 6
2 8
1 8
2 1
What is the best way to do it without a for loop over the columns?
s=df.mask((df-df.apply(lambda x: x.std() )).gt(5))#mask where condition applies
s=s.assign(A=s.A.fillna(s.A.max()),B=s.B.fillna(s.B.max())).sort_index(axis = 0)#fill with max per column and resort frame
A B
0 1.0 2.0
1 1.0 6.0
2 2.0 8.0
3 1.0 8.0
4 2.0 1.0
Per the discussion in the comments you need to decide what your threshold is. say it is q=100, then you can do
q = 100
df.loc[df['A'] > q,'A'] = max(df.loc[df['A'] < q,'A'] )
df
this fixes column A:
A B
0 1 2
1 1 6
2 2 8
3 1 115
4 2 1
do the same for B
Calculate a column-wise z-score (if you deem something an outlier if it lies outside a given number of standard deviations of the column) and then calculate a boolean mask of values outside your desired range
def calc_zscore(col):
return (col - col.mean()) / col.std()
zscores = df.apply(calc_zscore, axis=0)
outlier_mask = zscores > 5
After that it's up to you to fill the values marked with the boolean mask.
df[outlier_mask] = something

aggregate dataframe horizontally

I have the following data:
inputdata = [[1,'long',30.2,'Win'],[1,'long',-12.4,'Loss'],
[2,'short',-12.3,'Loss'],[1,'long',3.2,'Win'],
[3,'short',0.0,'B/E'],[3,'short',23.2,'Win'],
[3,'long',3.2,'Win'],[4,'short',-4.2,'Loss']]
datadf = DataFrame(columns=['AssetId','Direction','PnL','W_L'], data = inputdata)
datadf
AssetId Direction PnL W_L
0 1 long 30.2 Win
1 1 long -12.4 Loss
2 2 short -12.3 Loss
3 1 long 3.2 Win
4 3 short 0.0 B/E
5 3 short 23.2 Win
6 3 long 3.2 Win
7 4 short -4.2 Loss
Now I want to aggregate this further into a new dataframe that looks like this mock up (a few sample rows added, more stats to be added:
Stat Long Short Total
0 Trades 4 4 8
1 Won 3 1 4
2 Lost 1 2 3
(...)
I tried this:
datadf.groupby(['Direction'])['PnL'].count()
Direction
long 4
short 4
Name: PnL, dtype: int64
This produces the necessary data, but I would have to fill my aggregation data frame field by field, which seems cumbersome and I am not even sure how to get the exact value into each row/column. Based on this example, is there a better way to achieve this goal?
You can do crosstab:
pd.crosstab(df['W_L'], df['Direction'],margins=True, margins_name='Total')
Output:
Direction long short Total
W_L
B/E 0 1 1
Loss 1 2 3
Win 3 1 4
Total 4 4 8
Use pivot_table:
res = pd.pivot_table(df.iloc[:,1:], index=["W_L"], columns=["Direction"], aggfunc="count").droplevel(0, 1)
res["total"] = res.sum(1)
print (res.append(res.sum().rename(index="Trades")))
Direction long short total
W_L
B/E NaN 1.0 1.0
Loss 1.0 2.0 3.0
Win 3.0 1.0 4.0
Trades 4.0 4.0 8.0

Pandas - calculate rolling average of group excluding current row

For an example:
data = {'Platoon': ['A','A','A','A','A','A','B','B','B','B','B','C','C','C','C','C'],
'Date' : [1,2,3,4,5,6,1,2,3,4,5,1,2,3,4,5],
'Casualties': [1,4,5,7,5,5,6,1,4,5,6,7,4,6,4,6]}
df = pd.DataFrame(data)
This works to calculate the rolling average, inclusive of the current row:
df['avg'] = df.groupby(['Platoon'])['Casualties'].transform(lambda x: x.rolling(2, 1).mean())
Which gives:
Platoon Date Casualties Avg
A 1 1 1.0
A 2 4 2.5
A 3 5 4.5
A 4 7 6.0
......
What I want to get is:
Platoon Date Casualties Avg
A 1 1 1.0
A 2 4 1.0
A 3 5 2.5
A 4 7 4.5
......
I suspect I can use shift here but I can't figure it out!
You need shift with bfill
df.groupby(['Platoon'])['Casualties'].apply(lambda x: x.rolling(2, 1).mean().shift().bfill())

Binary operation broadcasting across multiindex

can anyone explain why broadcasting across a multiindexed series doesn't work? Might it be a bug in pandas (0.12.0)?
x = pd.DataFrame({'year':[1,1,1,1,2,2,2,2],
'country':['A','A','B','B','A','A','B','B'],
'prod':[1,2,1,2,1,2,1,2],
'val':[10,20,15,25,20,30,25,35]})
x = x.set_index(['year','country','prod']).squeeze()
y = pd.DataFrame({'year':[1,1,2,2],'prod':[1,2,1,2],
'mul':[10,0.1,20,0.2]})
y = y.set_index(['year','prod']).squeeze()
From the description of matching/broadcasting behavior from the pandas docs I would expect to be able to multiply x and y and have the values of y broadcast across each country, giving:
>>> x.mul(y, level=['year','prod'])
year country prod
1 A 1 100.0
2 2.0
B 1 150.0
2 2.5
2 A 1 400.0
2 6.0
B 1 500.0
2 7.0
But instead, I get:
Exception: Join on level between two MultiIndex objects is ambiguous
(Note that this is a variation on the theme of this question.)
As discussed by me and #jreback in the issue opened to deal with this, a nice workaround to the problem involves doing the following:
Move the non-matching index level(s) to columns using unstack
Perform the multiplication/division
Put the non-matching index level(s) back using stack
Make sure the index levels are in the same order as they were before.
Here's how it works:
In [112]: x.unstack('country').mul(y, axis=0).stack('country').reorder_levels(x.index.names)
Out[112]:
year country prod
1 A 1 100.0
B 1 150.0
A 2 2.0
B 2 2.5
2 A 1 400.0
B 1 500.0
A 2 6.0
B 2 7.0
dtype: float64
I think that's rather good, and should be pretty efficient.