aggregate dataframe horizontally - pandas

I have the following data:
inputdata = [[1,'long',30.2,'Win'],[1,'long',-12.4,'Loss'],
[2,'short',-12.3,'Loss'],[1,'long',3.2,'Win'],
[3,'short',0.0,'B/E'],[3,'short',23.2,'Win'],
[3,'long',3.2,'Win'],[4,'short',-4.2,'Loss']]
datadf = DataFrame(columns=['AssetId','Direction','PnL','W_L'], data = inputdata)
datadf
AssetId Direction PnL W_L
0 1 long 30.2 Win
1 1 long -12.4 Loss
2 2 short -12.3 Loss
3 1 long 3.2 Win
4 3 short 0.0 B/E
5 3 short 23.2 Win
6 3 long 3.2 Win
7 4 short -4.2 Loss
Now I want to aggregate this further into a new dataframe that looks like this mock up (a few sample rows added, more stats to be added:
Stat Long Short Total
0 Trades 4 4 8
1 Won 3 1 4
2 Lost 1 2 3
(...)
I tried this:
datadf.groupby(['Direction'])['PnL'].count()
Direction
long 4
short 4
Name: PnL, dtype: int64
This produces the necessary data, but I would have to fill my aggregation data frame field by field, which seems cumbersome and I am not even sure how to get the exact value into each row/column. Based on this example, is there a better way to achieve this goal?

You can do crosstab:
pd.crosstab(df['W_L'], df['Direction'],margins=True, margins_name='Total')
Output:
Direction long short Total
W_L
B/E 0 1 1
Loss 1 2 3
Win 3 1 4
Total 4 4 8

Use pivot_table:
res = pd.pivot_table(df.iloc[:,1:], index=["W_L"], columns=["Direction"], aggfunc="count").droplevel(0, 1)
res["total"] = res.sum(1)
print (res.append(res.sum().rename(index="Trades")))
Direction long short total
W_L
B/E NaN 1.0 1.0
Loss 1.0 2.0 3.0
Win 3.0 1.0 4.0
Trades 4.0 4.0 8.0

Related

How to add Multilevel Columns and create new column?

I am trying to create a "total" column in my dataframe
idx = pd.MultiIndex.from_product([['Room 1','Room 2', 'Room 3'],['on','off']])
df = pd.DataFrame([[1,4,3,6,5,15], [3,2,1,5,1,7]], columns=idx)
My dataframe
Room 1 Room 2 Room 3
on off on off on off
0 1 4 3 6 5 15
1 3 2 1 5 1 7
For each room, I want to create a total column and then a on% column.
I have tried the following, however, it does not work.
df.loc[:, slice(None), "total" ] = df.xs('on', axis=1,level=1) + df.xs('off', axis=1,level=1)
Let us try something fancy ~
df.stack(0).eval('total=on + off \n on_pct=on / total').stack().unstack([1, 2])
Room 1 Room 2 Room 3
off on total on_pct off on total on_pct off on total on_pct
0 4.0 1.0 5.0 0.2 6.0 3.0 9.0 0.333333 15.0 5.0 20.0 0.250
1 2.0 3.0 5.0 0.6 5.0 1.0 6.0 0.166667 7.0 1.0 8.0 0.125
Oof this was a roughie, but you can do it like this if you want to avoid loops. Worth noting it redefines your df twice because i need the total columns. Sorry about that, but is the best i could do. Also if you have any questions just comment.
df = pd.concat([y.assign(**{'Total {0}'.format(x+1): y.iloc[:,0] + y.iloc[:,1]})for x , y in df.groupby(np.arange(df.shape[1])//2,axis=1)],axis=1)
df = pd.concat([y.assign(**{'Percentage_Total{0}'.format(x+1): (y.iloc[:,0] / y.iloc[:,2])*100})for x , y in df.groupby(np.arange(df.shape[1])//3,axis=1)],axis=1)
print(df)
This groups by the column's first index (rooms) and then loops through each group to add the total and percent on. The final step is to reindex using the unique rooms:
import pandas as pd
idx = pd.MultiIndex.from_product([['Room 1','Room 2', 'Room 3'],['on','off']])
df = pd.DataFrame([[1,4,3,6,5,15], [3,2,1,5,1,7]], columns=idx)
for room, group in df.groupby(level=0, axis=1):
df[(room, 'total')] = group.sum(axis=1)
df[(room, 'pct_on')] = group[(room, 'on')] / df[(room, 'total')]
result = df.reindex(columns=df.columns.get_level_values(0).unique(), level=0)
Output:
Room 1 Room 2 Room 3
on off total pct_on on off total pct_on on off total pct_on
0 1 4 5 0.2 3 6 9 0.333333 5 15 20 0.250
1 3 2 5 0.6 1 5 6 0.166667 1 7 8 0.125

pandas dataframe how to replace extreme outliers for all columns

I have a pandas dataframe with some very extreme value - more than 5 std.
I want to replace, per column, each value that is more than 5 std with the max other value.
For example,
df = A B
1 2
1 6
2 8
1 115
191 1
Will become:
df = A B
1 2
1 6
2 8
1 8
2 1
What is the best way to do it without a for loop over the columns?
s=df.mask((df-df.apply(lambda x: x.std() )).gt(5))#mask where condition applies
s=s.assign(A=s.A.fillna(s.A.max()),B=s.B.fillna(s.B.max())).sort_index(axis = 0)#fill with max per column and resort frame
A B
0 1.0 2.0
1 1.0 6.0
2 2.0 8.0
3 1.0 8.0
4 2.0 1.0
Per the discussion in the comments you need to decide what your threshold is. say it is q=100, then you can do
q = 100
df.loc[df['A'] > q,'A'] = max(df.loc[df['A'] < q,'A'] )
df
this fixes column A:
A B
0 1 2
1 1 6
2 2 8
3 1 115
4 2 1
do the same for B
Calculate a column-wise z-score (if you deem something an outlier if it lies outside a given number of standard deviations of the column) and then calculate a boolean mask of values outside your desired range
def calc_zscore(col):
return (col - col.mean()) / col.std()
zscores = df.apply(calc_zscore, axis=0)
outlier_mask = zscores > 5
After that it's up to you to fill the values marked with the boolean mask.
df[outlier_mask] = something

Groupby with conditions

df = pd.DataFrame({'Category': ['A','B','B','B','C','C'],
'Subcategory': ['X','X','Y','Y','Z','Z'],
'Values': [1,2,3,4,5,6]})
which I use groupby to summarize -
`df.groupby('Category')['Values'].agg({np.size, np.mean, np.median})`
size mean median
Category
A 1 1.0 1.0
B 3 3.0 3.0
C 2 5.5 5.5
Objective: In addition to the above, show additional groupby by subcategory 'X' to create below output:
ALL Subcategory Only Subcategory 'X'
size mean median size mean median
Category
A 1 1.0 1.0 1 1 1
B 3 3.0 3.0 1 2 2
C 2 5.5 5.5 0 0 0
My solution currently is to create two groupby, to_frame() then pd.merge them. Is there a better way? Thanks!
df.groupby('Category')['Values'].agg({np.size, np.mean, np.median})
df[df['Subcategory']=='X'].groupby('Category')['Values'].agg({np.size, np.mean, np.median})

Resampling a dataframe based on depth column

I have two dataframe which the key is depth. One has > 2k values the other only 100, but the min and the max depth are the same. I would like to upsample the small dataframe (which has only one column) at the same size of the bigger one and repeat the same value of a column between two depths.
I've tried using concatenate and resampling but I'm stuck when I want to find the same depth since the two dataframes depths do not have exactly the same values
I have this:
df_small:
depth Litholog
0 38.076 2.0
1 39.546 2.0
2 41.034 4.0
3 55.133 3.0
4 69.928 2.0
and this:
df_big:
depth
0 21.3360
1 35.2044
2 37.6428
3 41.7576
4 41.9100
5 48.7680
6 53.1876
7 56.0832
8 58.3692
9 62.1792
I would like this:
df_result:
depth Litholog
0 21.3360 2
1 35.2044 2
2 37.6428 2
3 41.7576 4
4 41.9100 4
5 48.7680 4
6 53.1876 4
7 56.0832 3
8 58.3692 3
9 62.1792 2
I tried several approach but without success. Many thanks to all
If change sample data for same max and min value in both is possible use merge_asof:
#change sample data for same min,max by df_big
print (df_small)
depth Litholog
0 21.3360 2.0
1 39.5460 2.0
2 41.0340 4.0
3 55.1330 3.0
4 62.1792 2.0
df = pd.merge_asof(df_big, df_small, on='depth')
print (df)
depth Litholog
0 21.3360 2.0
1 35.2044 2.0
2 37.6428 2.0
3 41.7576 4.0
4 41.9100 4.0
5 48.7680 4.0
6 53.1876 4.0
7 56.0832 3.0
8 58.3692 3.0
9 62.1792 2.0

Pandas - calculate rolling average of group excluding current row

For an example:
data = {'Platoon': ['A','A','A','A','A','A','B','B','B','B','B','C','C','C','C','C'],
'Date' : [1,2,3,4,5,6,1,2,3,4,5,1,2,3,4,5],
'Casualties': [1,4,5,7,5,5,6,1,4,5,6,7,4,6,4,6]}
df = pd.DataFrame(data)
This works to calculate the rolling average, inclusive of the current row:
df['avg'] = df.groupby(['Platoon'])['Casualties'].transform(lambda x: x.rolling(2, 1).mean())
Which gives:
Platoon Date Casualties Avg
A 1 1 1.0
A 2 4 2.5
A 3 5 4.5
A 4 7 6.0
......
What I want to get is:
Platoon Date Casualties Avg
A 1 1 1.0
A 2 4 1.0
A 3 5 2.5
A 4 7 4.5
......
I suspect I can use shift here but I can't figure it out!
You need shift with bfill
df.groupby(['Platoon'])['Casualties'].apply(lambda x: x.rolling(2, 1).mean().shift().bfill())