Pandas - calculate rolling average of group excluding current row - pandas

For an example:
data = {'Platoon': ['A','A','A','A','A','A','B','B','B','B','B','C','C','C','C','C'],
'Date' : [1,2,3,4,5,6,1,2,3,4,5,1,2,3,4,5],
'Casualties': [1,4,5,7,5,5,6,1,4,5,6,7,4,6,4,6]}
df = pd.DataFrame(data)
This works to calculate the rolling average, inclusive of the current row:
df['avg'] = df.groupby(['Platoon'])['Casualties'].transform(lambda x: x.rolling(2, 1).mean())
Which gives:
Platoon Date Casualties Avg
A 1 1 1.0
A 2 4 2.5
A 3 5 4.5
A 4 7 6.0
......
What I want to get is:
Platoon Date Casualties Avg
A 1 1 1.0
A 2 4 1.0
A 3 5 2.5
A 4 7 4.5
......
I suspect I can use shift here but I can't figure it out!

You need shift with bfill
df.groupby(['Platoon'])['Casualties'].apply(lambda x: x.rolling(2, 1).mean().shift().bfill())

Related

How to clean a dataframe column with hours and minutes

I have a column of hours and minutes and I would like all values in the column to be in hours. So how do I divide only the columns values in minutes by 60 to get hours? I tried splitting the column by space to separate numbers and strings but I got stuck how to achieve the desire outcome.
Using a lambda with split.
df["content_duration"] = df["content_duration"].apply(
lambda x: round(int(x.split(" ")[0]) / 60, 2) if x.split(" ")[1] == "mins" else x.split(" ")[0]
)
print(df)
content_duration
0 1.5
1 1
2 1.5
3 1
4 0.62
5 0.73
Use the replace() function to replace the units with their respective conversions. Then apply the pandas eval function to each value to do the necessary conversions. Then round to the desired number of decimal places.
# Create the dataframe
df = pd.DataFrame({"content_duration": ['1.5 hours','1 hour','1.5 hours','1 hour', '37 mins','44 mins']})
# Convert the units to numeric datatype
df['content_duration'] = (df['content_duration'].replace({' mins?':'/60',' hours?':'*1'}, regex=True))\
.apply(pd.eval)\
.round(1)
# Print the dataframe
print(df)
OUTPUT:
content_duration
0 1.5
1 1.0
2 1.5
3 1.0
4 0.6
5 0.7
Pandas's to_timedelta is very good at converting this, you just need to remove the s from hours/mins:
df['hours'] = (pd.to_timedelta(df['content_duration']
.str.replace(r's\b', '', regex=True))
.dt.total_seconds().div(3600)
.round(2) # optional
)
Output:
content_duration hours
0 1.5 hours 1.50
1 1 hour 1.00
2 1.5 hours 1.50
3 1 hour 1.00
4 37 mins 0.62
5 44 mins 0.73
To have strings:
df['hours'] = (pd.to_timedelta(df['content_duration'].str.replace(r's\b', '', regex=True))
.dt.total_seconds().div(3600).round(2)
.astype(str).add(' hours')
)
output:
content_duration hours
0 1.5 hours 1.5 hours
1 1 hour 1.0 hours
2 1.5 hours 1.5 hours
3 1 hour 1.0 hours
4 37 mins 0.62 hours
5 44 mins 0.73 hours

How to add Multilevel Columns and create new column?

I am trying to create a "total" column in my dataframe
idx = pd.MultiIndex.from_product([['Room 1','Room 2', 'Room 3'],['on','off']])
df = pd.DataFrame([[1,4,3,6,5,15], [3,2,1,5,1,7]], columns=idx)
My dataframe
Room 1 Room 2 Room 3
on off on off on off
0 1 4 3 6 5 15
1 3 2 1 5 1 7
For each room, I want to create a total column and then a on% column.
I have tried the following, however, it does not work.
df.loc[:, slice(None), "total" ] = df.xs('on', axis=1,level=1) + df.xs('off', axis=1,level=1)
Let us try something fancy ~
df.stack(0).eval('total=on + off \n on_pct=on / total').stack().unstack([1, 2])
Room 1 Room 2 Room 3
off on total on_pct off on total on_pct off on total on_pct
0 4.0 1.0 5.0 0.2 6.0 3.0 9.0 0.333333 15.0 5.0 20.0 0.250
1 2.0 3.0 5.0 0.6 5.0 1.0 6.0 0.166667 7.0 1.0 8.0 0.125
Oof this was a roughie, but you can do it like this if you want to avoid loops. Worth noting it redefines your df twice because i need the total columns. Sorry about that, but is the best i could do. Also if you have any questions just comment.
df = pd.concat([y.assign(**{'Total {0}'.format(x+1): y.iloc[:,0] + y.iloc[:,1]})for x , y in df.groupby(np.arange(df.shape[1])//2,axis=1)],axis=1)
df = pd.concat([y.assign(**{'Percentage_Total{0}'.format(x+1): (y.iloc[:,0] / y.iloc[:,2])*100})for x , y in df.groupby(np.arange(df.shape[1])//3,axis=1)],axis=1)
print(df)
This groups by the column's first index (rooms) and then loops through each group to add the total and percent on. The final step is to reindex using the unique rooms:
import pandas as pd
idx = pd.MultiIndex.from_product([['Room 1','Room 2', 'Room 3'],['on','off']])
df = pd.DataFrame([[1,4,3,6,5,15], [3,2,1,5,1,7]], columns=idx)
for room, group in df.groupby(level=0, axis=1):
df[(room, 'total')] = group.sum(axis=1)
df[(room, 'pct_on')] = group[(room, 'on')] / df[(room, 'total')]
result = df.reindex(columns=df.columns.get_level_values(0).unique(), level=0)
Output:
Room 1 Room 2 Room 3
on off total pct_on on off total pct_on on off total pct_on
0 1 4 5 0.2 3 6 9 0.333333 5 15 20 0.250
1 3 2 5 0.6 1 5 6 0.166667 1 7 8 0.125

pandas dataframe how to replace extreme outliers for all columns

I have a pandas dataframe with some very extreme value - more than 5 std.
I want to replace, per column, each value that is more than 5 std with the max other value.
For example,
df = A B
1 2
1 6
2 8
1 115
191 1
Will become:
df = A B
1 2
1 6
2 8
1 8
2 1
What is the best way to do it without a for loop over the columns?
s=df.mask((df-df.apply(lambda x: x.std() )).gt(5))#mask where condition applies
s=s.assign(A=s.A.fillna(s.A.max()),B=s.B.fillna(s.B.max())).sort_index(axis = 0)#fill with max per column and resort frame
A B
0 1.0 2.0
1 1.0 6.0
2 2.0 8.0
3 1.0 8.0
4 2.0 1.0
Per the discussion in the comments you need to decide what your threshold is. say it is q=100, then you can do
q = 100
df.loc[df['A'] > q,'A'] = max(df.loc[df['A'] < q,'A'] )
df
this fixes column A:
A B
0 1 2
1 1 6
2 2 8
3 1 115
4 2 1
do the same for B
Calculate a column-wise z-score (if you deem something an outlier if it lies outside a given number of standard deviations of the column) and then calculate a boolean mask of values outside your desired range
def calc_zscore(col):
return (col - col.mean()) / col.std()
zscores = df.apply(calc_zscore, axis=0)
outlier_mask = zscores > 5
After that it's up to you to fill the values marked with the boolean mask.
df[outlier_mask] = something

Groupby with conditions

df = pd.DataFrame({'Category': ['A','B','B','B','C','C'],
'Subcategory': ['X','X','Y','Y','Z','Z'],
'Values': [1,2,3,4,5,6]})
which I use groupby to summarize -
`df.groupby('Category')['Values'].agg({np.size, np.mean, np.median})`
size mean median
Category
A 1 1.0 1.0
B 3 3.0 3.0
C 2 5.5 5.5
Objective: In addition to the above, show additional groupby by subcategory 'X' to create below output:
ALL Subcategory Only Subcategory 'X'
size mean median size mean median
Category
A 1 1.0 1.0 1 1 1
B 3 3.0 3.0 1 2 2
C 2 5.5 5.5 0 0 0
My solution currently is to create two groupby, to_frame() then pd.merge them. Is there a better way? Thanks!
df.groupby('Category')['Values'].agg({np.size, np.mean, np.median})
df[df['Subcategory']=='X'].groupby('Category')['Values'].agg({np.size, np.mean, np.median})

aggregate dataframe horizontally

I have the following data:
inputdata = [[1,'long',30.2,'Win'],[1,'long',-12.4,'Loss'],
[2,'short',-12.3,'Loss'],[1,'long',3.2,'Win'],
[3,'short',0.0,'B/E'],[3,'short',23.2,'Win'],
[3,'long',3.2,'Win'],[4,'short',-4.2,'Loss']]
datadf = DataFrame(columns=['AssetId','Direction','PnL','W_L'], data = inputdata)
datadf
AssetId Direction PnL W_L
0 1 long 30.2 Win
1 1 long -12.4 Loss
2 2 short -12.3 Loss
3 1 long 3.2 Win
4 3 short 0.0 B/E
5 3 short 23.2 Win
6 3 long 3.2 Win
7 4 short -4.2 Loss
Now I want to aggregate this further into a new dataframe that looks like this mock up (a few sample rows added, more stats to be added:
Stat Long Short Total
0 Trades 4 4 8
1 Won 3 1 4
2 Lost 1 2 3
(...)
I tried this:
datadf.groupby(['Direction'])['PnL'].count()
Direction
long 4
short 4
Name: PnL, dtype: int64
This produces the necessary data, but I would have to fill my aggregation data frame field by field, which seems cumbersome and I am not even sure how to get the exact value into each row/column. Based on this example, is there a better way to achieve this goal?
You can do crosstab:
pd.crosstab(df['W_L'], df['Direction'],margins=True, margins_name='Total')
Output:
Direction long short Total
W_L
B/E 0 1 1
Loss 1 2 3
Win 3 1 4
Total 4 4 8
Use pivot_table:
res = pd.pivot_table(df.iloc[:,1:], index=["W_L"], columns=["Direction"], aggfunc="count").droplevel(0, 1)
res["total"] = res.sum(1)
print (res.append(res.sum().rename(index="Trades")))
Direction long short total
W_L
B/E NaN 1.0 1.0
Loss 1.0 2.0 3.0
Win 3.0 1.0 4.0
Trades 4.0 4.0 8.0