Pandas pivot table / groupby to calculate weighted average - pandas

I'm using pandas version 0.25.0 to calculate weighted averages of priced contracts.
Data:
{'Counterparty': {0: 'A',
1: 'B',
2: 'B',
3: 'A',
4: 'A',
5: 'C',
6: 'D',
7: 'E',
8: 'E',
9: 'C',
10: 'F',
11: 'C',
12: 'C',
13: 'G'},
'Contract': {0: 'A1',
1: 'B1',
2: 'B2',
3: 'A2',
4: 'A3',
5: 'C1',
6: 'D1',
7: 'E1',
8: 'E2',
9: 'C2',
10: 'F1',
11: 'C3',
12: 'C4',
13: 'G'},
'Delivery': {0: '1/8/2019',
1: '1/8/2019',
2: '1/8/2019',
3: '1/8/2019',
4: '1/8/2019',
5: '1/8/2019',
6: '1/8/2019',
7: '1/8/2019',
8: '1/8/2019',
9: '1/8/2019',
10: '1/8/2019',
11: '1/8/2019',
12: '1/8/2019',
13: '1/8/2019'},
'Price': {0: 134.0,
1: 151.0,
2: 149.0,
3: 134.0,
4: 132.14700000000002,
5: 150.0,
6: 134.566,
7: 153.0,
8: 151.0,
9: 135.0,
10: 149.0,
11: 135.0,
12: 147.0,
13: 151.0},
'Balance': {0: 200.0,
1: 54.87,
2: 200.0,
3: 133.44,
4: 500.0,
5: 500.0,
6: 1324.05,
7: 279.87,
8: 200.0,
9: 20.66,
10: 110.15,
11: 100.0,
12: 100.0,
13: 35.04}}
Method 1:
df.pivot_table(
index=['Counterparty', 'Contract'],
columns='Delivery',
values=['Balance', 'Price'],
aggfunc={
'Balance': sum,
'Price': np.mean
},
margins=True
).fillna('').swaplevel(0,1,axis=1).sort_index(axis=1).round(3)
Result 1:
Is there any way in which I can use np.average in pandas pivot table?
Thinking along the lines of
aggfunc = {
'Balance': sum,
'Price': lambda x: np.average(x, weights='Balance')
}
Current result: 143.265, which is computed by np.mean.
Desired result: 140.424, which is the weighted average of Price by Balance.
Method 2:
df_grouped = df.groupby(['Counterparty', 'Contract', 'Delivery']).apply(lambda x: pd.Series(
{
'Balance': x['Balance'].sum(),
'Price': np.average(x['Price'], weights=x['Balance']),
}
)).round(3).unstack().swaplevel(0,1, axis=1).sort_index(axis=1)
Result 2:
Using groupby, I would need to pd.concat and append sum by level to get grand totals with aggfunc = {Balance: sum, Price: np.average}.
The expected result is:
Balance: 3758.08 (using sum)
Price: 140.424 (using np.average)
Which is displayed in a Grand Total row beneath all the rows of data.

Just define a custom function to calculate weighted average, and use it with aggfunc instead of np.mean in your code as follows:
wa_func =lambda x: np.average(x, weights=df.loc[x.index, 'Balance'])
df1 = df.pivot_table(
index=['Counterparty', 'Contract'],
columns='Delivery',
values=['Balance', 'Price'],
aggfunc={
'Balance': sum,
'Price': wa_func
},
margins=True
).fillna('').swaplevel(0,1,axis=1).sort_index(axis=1).round(3)
Out[35]:
Delivery 1/8/2019 All
Balance Price Balance Price
Counterparty Contract
A A1 200.00 134.000 200.00 134.000
A2 133.44 134.000 133.44 134.000
A3 500.00 132.147 500.00 132.147
B B1 54.87 151.000 54.87 151.000
B2 200.00 149.000 200.00 149.000
C C1 500.00 150.000 500.00 150.000
C2 20.66 135.000 20.66 135.000
C3 100.00 135.000 100.00 135.000
C4 100.00 147.000 100.00 147.000
D D1 1324.05 134.566 1324.05 134.566
E E1 279.87 153.000 279.87 153.000
E2 200.00 151.000 200.00 151.000
F F1 110.15 149.000 110.15 149.000
G G 35.04 151.000 35.04 151.000
All 3758.08 140.424 3758.08 140.424

Related

multiple nested groupby in pandas

Here is my pandas dataframe:
df = pd.DataFrame({'Date': {0: '2016-10-11', 1: '2016-10-11', 2: '2016-10-11', 3: '2016-10-11', 4: '2016-10-11',5: '2016-10-12',6: '2016-10-12',7: '2016-10-12',8: '2016-10-12',9: '2016-10-12'}, 'Stock': {0: 'A', 1: 'B', 2: 'C', 3: 'D', 4: 'E', 5: 'F', 6: 'G', 7: 'H',8: 'I', 9:'J'}, 'Sector': {0: 0,1: 0, 2: 1, 3: 1, 4: 1, 5: 0, 6:0, 7:0, 8:1, 9:1}, 'Segment': {0: 0, 1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6:2,7:2,8:3,9:3}, 'Range': {0: 5, 1: 0, 2: 1, 3: 0, 4: 2, 5: 6, 6:0, 7:23, 8:5, 9:5}})
Here is how it looks:
I want to add the following columns:
'Date_Range_Avg': average of 'Range' grouped by Date
'Date_Sector_Range_Avg': average of 'Range' grouped by Date and Sector
'Date_Segment_Range_Avg': average of 'Range' grouped by Date and Segment
This would be the output:
res = pd.DataFrame({'Date': {0: '2016-10-11', 1: '2016-10-11', 2: '2016-10-11', 3: '2016-10-11', 4: '2016-10-11',5: '2016-10-12',6: '2016-10-12',7: '2016-10-12',8: '2016-10-12',9: '2016-10-12'}, 'Stock': {0: 'A', 1: 'B', 2: 'C', 3: 'D', 4: 'E', 5: 'F', 6: 'G', 7: 'H',8: 'I', 9:'J'}, 'Sector': {0: 0,1: 0, 2: 1, 3: 1, 4: 1, 5: 0, 6:0, 7:0, 8:1, 9:1}, 'Segment': {0: 0, 1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6:2,7:2,8:3,9:3}, 'Range': {0: 5, 1: 0, 2: 1, 3: 0, 4: 2, 5: 6, 6:0, 7:23, 8:5, 9:5}, 'Date_Range_Avg':{0: 1.6, 1: 1.6, 2: 1.6, 3: 1.6, 4: 1.6, 5: 7.8, 6: 7.8, 7: 7.8, 8:7.8, 9: 7.8}, 'Date_Sector_Range_Avg':{0: 2.5, 1: 2.5, 2: 1, 3: 1, 4: 1, 5: 9.67, 6: 9.67, 7: 9.67, 8: 9.67, 9: 9.67}, 'Date_Segment_Range_Avg':{0: 5, 1: 0.75, 2: 0.75, 3: 0.75, 4: 0.75, 5: 6, 6: 11.5, 7: 11.5, 8: 5, 9: 5}})
This is how it looks:
Note I have rounded some of the values - but this rounding is not essential for the question I have (please feel free to not round)
I'm aware that I can do each of these groupings separately but it strikes me as inefficient (my dataset contains millions of rows)
Essentially, I would like to first do a grouping by Date and then re-use it to do the two more fine-grained groupings by Date and Segment and by Date and Sector.
How to do this?
My initial hunch is to go like this:
day_groups = df.groupby("Date")
df['Date_Range_Avg'] = day_groups['Range'].transform('mean')
and then to re-use day_groups to do the 2 more fine-grained groupbys like this:
df['Date_Sector_Range_Avg'] = day_groups.groupby('Segment')[Range].transform('mean')
Which doesn't work as you get:
'AttributeError: 'DataFrameGroupBy' object has no attribute 'groupby''
groupby runs really fast when the aggregate function is vectorized. If you are worried about performance, try it out first to see if it's the real bottleneck in your program.
You can create temporary data frames holding the result of each groupby, then successively merge them with df:
group_bys = {
"Date_Range_Avg": ["Date"],
"Date_Sector_Range_Avg": ["Date", "Sector"],
"Date_Segment_Range_Avg": ["Date", "Segment"]
}
tmp = [
df.groupby(columns)["Range"].mean().to_frame(key)
for key, columns in group_bys.items()
]
result = df
for t in tmp:
result = result.merge(t, left_on=t.index.names, right_index=True)
Result:
Date Stock Sector Segment Range Date_Range_Avg Date_Sector_Range_Avg Date_Segment_Range_Avg
0 2016-10-11 A 0 0 5 1.6 2.500000 5.00
1 2016-10-11 B 0 1 0 1.6 2.500000 0.75
2 2016-10-11 C 1 1 1 1.6 1.000000 0.75
3 2016-10-11 D 1 1 0 1.6 1.000000 0.75
4 2016-10-11 E 1 1 2 1.6 1.000000 0.75
5 2016-10-12 F 0 1 6 7.8 9.666667 6.00
6 2016-10-12 G 0 2 0 7.8 9.666667 11.50
7 2016-10-12 H 0 2 23 7.8 9.666667 11.50
8 2016-10-12 I 1 3 5 7.8 5.000000 5.00
9 2016-10-12 J 1 3 5 7.8 5.000000 5.00
Another option is to use transform, and avoid the multiple merges:
# reusing your code
group_bys = {
"Date_Range_Avg": ["Date"],
"Date_Sector_Range_Avg": ["Date", "Sector"],
"Date_Segment_Range_Avg": ["Date", "Segment"]
}
tmp = {key : df.groupby(columns)["Range"].transform('mean')
for key, columns in group_bys.items()
}
df.assign(**tmp)

Pandas Dataframe: change columns, index and plot

Hi, I generated the table above using Counter from collections for counting the combinations of 3 variables from a dataframe: Jessica, Mike, and Dog. I got the combination and their counts.
Any help to make that table a bit more prettier? I would like to rename the index as grp1, grp2, etc and the column as well with something else than 0.
Also, what would be the best plot to use for plotting the different groups?
Thanks for your help!!
I used this command to produce the table here:
df= np.random.choice(["Mike", "Jessica", "Dog"], size=(20, 3))
Z= pd.DataFrame(df,columns=['a', 'b', 'c'])
import collections
from collections import Counter
LL= Z.apply (Counter, axis= "columns").value_counts()
H= pd.DataFrame(LL)
print(H)
quite an unusual technique....
you can change the dict index to a multi-index
then plot() as barh and labels make sense
df= np.random.choice(["Mike", "Jessica", "Dog"], size=(20, 3))
Z= pd.DataFrame(df,columns=['a', 'b', 'c'])
import collections
from collections import Counter
LL= Z.apply (Counter, axis= "columns").value_counts()
H= pd.DataFrame(LL)
I = pd.Series(H.index).apply(pd.Series)
H = H.set_index(pd.MultiIndex.from_arrays(I.T.values, names=I.columns))
H.plot(kind="barh")
H after setting as multi-index
0
Mike Dog Jessica
2.0 1.0 NaN 5
NaN 1.0 4
NaN 1.0 2.0 3
1.0 NaN 2.0 3
1.0 1.0 2
NaN NaN 3.0 1
2.0 1.0 1
3.0 NaN NaN 1
Instead of using counter, you can apply value_counts directly to each row:
import pandas as pd
from matplotlib import pyplot as plt
# Hard Coded For Reproducibility
df = pd.DataFrame({'a': {0: 'Dog', 1: 'Jessica', 2: 'Mike',
3: 'Dog', 4: 'Dog', 5: 'Dog',
6: 'Jessica', 7: 'Jessica',
8: 'Dog', 9: 'Dog', 10: 'Jessica',
11: 'Mike', 12: 'Dog',
13: 'Jessica', 14: 'Mike',
15: 'Mike',
16: 'Mike', 17: 'Dog',
18: 'Jessica', 19: 'Mike'},
'b': {0: 'Mike', 1: 'Mike', 2: 'Jessica',
3: 'Jessica', 4: 'Dog', 5: 'Jessica',
6: 'Mike', 7: 'Dog', 8: 'Mike',
9: 'Dog', 10: 'Dog', 11: 'Dog',
12: 'Dog', 13: 'Jessica',
14: 'Jessica', 15: 'Dog',
16: 'Dog', 17: 'Dog', 18: 'Jessica', 19: 'Jessica'},
'c': {0: 'Mike', 1: 'Dog', 2: 'Jessica',
3: 'Dog', 4: 'Dog', 5: 'Dog', 6: 'Dog',
7: 'Jessica', 8: 'Mike', 9: 'Dog',
10: 'Dog', 11: 'Mike', 12: 'Jessica',
13: 'Jessica', 14: 'Jessica',
15: 'Jessica', 16: 'Jessica',
17: 'Dog', 18: 'Mike', 19: 'Dog'}})
# Apply value_counts across each row
df = df.apply(pd.value_counts, axis=1) \
.fillna(0)
# Group By All Columns and
# Get Duplicate Count From Group Size
df = pd.DataFrame(df
.groupby(df.columns.values.tolist())
.size()
.sort_values())
# Plot
plt.figure()
df.plot(kind="barh")
plt.tight_layout()
plt.show()
df after groupby, size, and sort:
0
Dog Jessica Mike
0.0 3.0 0.0 1
1.0 2.0 0.0 1
0.0 2.0 1.0 3
1.0 0.0 2.0 3
3.0 0.0 0.0 3
2.0 1.0 0.0 4
1.0 1.0 1.0 5
Plt:

Apply np.average in pandas pivot aggfunc

I am trying to calculate weighted average prices using pandas pivot table.
I have tried passing in a dictionary using aggfunc.
This does not work when passed into aggfunc, although it should calculate the correct weighted average.
'Price': lambda x: np.average(x, weights=df['Balance'])
I have also tried using a manual groupby:
df.groupby('Product').agg({
'Balance': sum,
'Price': lambda x : np.average(x, weights='Balance'),
'Value': sum
})
This also yields the error:
TypeError: Axis must be specified when shapes of a and weights differ.
Here is sample data
import pandas as pd
import numpy as np
price_dict = {'Product': {0: 'A',
1: 'A',
2: 'A',
3: 'A',
4: 'A',
5: 'B',
6: 'B',
7: 'B',
8: 'B',
9: 'B',
10: 'C',
11: 'C',
12: 'C',
13: 'C',
14: 'C'},
'Balance': {0: 10,
1: 20,
2: 30,
3: 40,
4: 50,
5: 60,
6: 70,
7: 80,
8: 90,
9: 100,
10: 110,
11: 120,
12: 130,
13: 140,
14: 150},
'Price': {0: 1,
1: 2,
2: 3,
3: 4,
4: 5,
5: 6,
6: 7,
7: 8,
8: 9,
9: 10,
10: 11,
11: 12,
12: 13,
13: 14,
14: 15},
'Value': {0: 10,
1: 40,
2: 90,
3: 160,
4: 250,
5: 360,
6: 490,
7: 640,
8: 810,
9: 1000,
10: 1210,
11: 1440,
12: 1690,
13: 1960,
14: 2250}}
Try to calculate weighted average by passing dict into aggfunc:
df = pd.DataFrame(price_dict)
df.pivot_table(
index='Product',
aggfunc = {
'Balance': sum,
'Price': np.mean,
'Value': sum
}
)
Output:
Balance Price Value
Product
A 150 3 550
B 400 8 3300
C 650 13 8550
The expected outcome should be :
Balance Price Value
Product
A 150 3.66 550
B 400 8.25 3300
C 650 13.15 8550
Here is one way using apply
df.groupby('Product').apply(lambda x : pd.Series(
{'Balance': x['Balance'].sum(),
'Price': np.average(x['Price'], weights=x['Balance']),
'Value': x['Value'].sum()}))
Out[57]:
Balance Price Value
Product
A 150.0 3.666667 550.0
B 400.0 8.250000 3300.0
C 650.0 13.153846 8550.0

pivoting pandas df - turn column values into column names

I have a df:
pd.DataFrame({'time_period': {0: pd.Timestamp('2017-04-01 00:00:00'),
1: pd.Timestamp('2017-04-01 00:00:00'),
2: pd.Timestamp('2017-03-01 00:00:00'),
3: pd.Timestamp('2017-03-01 00:00:00')},
'cost1': {0: 142.62999999999994,
1: 131.97000000000003,
2: 142.62999999999994,
3: 131.97000000000003},
'revenue1': {0: 56,
1: 113.14999999999998,
2: 177,
3: 99},
'cost2': {0: 309.85000000000002,
1: 258.25,
2: 309.85000000000002,
3: 258.25},
'revenue2': {0: 4.5,
1: 299.63,2: 309.85,
3: 258.25},
'City': {0: 'Boston',
1: 'New York',2: 'Boston',
3: 'New York'}})
I want to re-structure this df such that for revenue and cost separately:
pd.DataFrame({'City': {0: 'Boston', 1: 'New York'},
'Apr-17 revenue1': {0: 56.0, 1: 113.15000000000001},
'Apr-17 revenue2': {0: 4.5, 1: 299.63},
'Mar-17 revenue1': {0: 177, 1: 99},
'Mar-17 revenue2': {0: 309.85000000000002, 1: 258.25}})
And a similar df for costs.
Basically, turn the time_period column values into column names like Apr-17, Mar-17 with revenue/cost string as appropriate and values of revenue1/revenue2 and cost1/cost2 respectively.
I've been playing around with pd.pivot_table with some success but I can't get exactly what I want.
Use set_index and unstack
import datetime as dt
df['time_period'] = df['time_period'].apply(lambda x: dt.datetime.strftime(x,'%b-%Y'))
df = df.set_index(['A', 'B', 'time_period'])[['revenue1', 'revenue2']].unstack().reset_index()
df.columns = df.columns.map(' '.join)
A B revenue1 Apr-2017 revenue1 Mar-2017 revenue2 Apr-2017 revenue2 Mar-2017
0 Boston Orlando 56.00 177.0 4.50 309.85
1 New York Dallas 113.15 99.0 299.63 258.25

Pandas Calculating minimum timedelta by group

My input data frame as below:
Input Dataframe:
Input1 = pd.DataFrame({'LOT': {0: 'A1', 1: 'A2', 2: 'A3', 3: 'A4', 4: 'A5'},
'OPERATION': {0: 100.0, 1: 100.0, 2: 100.0, 3: 100.0, 4: 100.0},
'TXN_DATE': {0: '12/6/2016',
1: '12/5/2016',
2: '11/30/2016',
3: '11/27/2016',
4: '11/22/2016'}})
Input2 = pd.DataFrame({'LOT': {0: 'B1', 1: 'B2', 2: 'B3', 3: 'B4', 4: 'B5', 5: 'B6'},
'OPERATION': {0: 500, 1: 500, 2: 500, 3: 500, 4: 500, 5: 500},
'TXN_DATE': {0: '12/7/2016',
1: '12/3/2016',
2: '11/17/2016',
3: '11/22/2016',
4: '12/4/2016',
5: '12/3/2016'}})
I am interesting to calculate companion lot from Input2 to lot in Input1 table based on minimum TXN_DATES delta between them (time delta suppose to be minimal):
Final DataFrame:
Expected_out = pd.DataFrame({'COMPANION_LOT': {0: 'B5', 1: 'B5', 2: 'B4', 3: 'B4', 4: 'B4'},
'COMPANION_LOT TXN_DATE': {0: '12/4/2016',
1: '12/4/2016',
2: '11/22/2016',
3: '11/22/2016',
4: '11/22/2016'},
'LOT': {0: 'A1', 1: 'A2', 2: 'A3', 3: 'A4', 4: 'A5'},
'OPERATION': {0: 100, 1: 100, 2: 100, 3: 100, 4: 100},
'TXN_DATE': {0: '12/6/2016',
1: '12/5/2016',
2: '11/30/2016',
3: '11/27/2016',
4: '11/22/2016'}})`
Thank you
You can use mainly pandas.merge_asof and then add new column by map:
Input1.TXN_DATE = pd.to_datetime(Input1.TXN_DATE)
Input2.TXN_DATE = pd.to_datetime(Input2.TXN_DATE)
Input1 = Input1.sort_values('TXN_DATE')
Input2 = Input2.sort_values('TXN_DATE')
df = pd.merge_asof(Input1, Input2, on='TXN_DATE', suffixes=('','_COMPANION')) \
.sort_values('LOT') \
.drop('OPERATION_COMPANION', axis=1)
df['LOT_TXN_DATE'] = df.LOT_COMPANION.map(Input2.set_index('LOT')['TXN_DATE'])
print (df)
LOT OPERATION TXN_DATE LOT_COMPANION LOT_TXN_DATE
4 A1 100.0 2016-12-06 B5 2016-12-04
3 A2 100.0 2016-12-05 B5 2016-12-04
2 A3 100.0 2016-11-30 B4 2016-11-22
1 A4 100.0 2016-11-27 B4 2016-11-22
0 A5 100.0 2016-11-22 B4 2016-11-22