Calculate ratio using groupby - pandas

IM new using python
I created this dataframe:
d2= {'id': ['x2', 'x2', 'x2', 'x2', 'x3', 'x3', 'x3'], 'cod': [101001, 101001, 101001, 101001, 101002, 101002, 101002],
'flag': ['IN', 'IN', 'IN','CMP', 'IN', 'OUT', 'CMP'], 'col': [100, 100, 100, 300, 100, 300, 100]
}
df2 = pd.DataFrame(data=d2)
I want to calculate a ratio : (sum(IN)/sum(all) groupby id*cod.
The expected output should be
d2= {'id': ['x2', 'x2', 'x2', 'x2', 'x3', 'x3', 'x3'], 'cod': [101001, 101001, 101001, 101001, 101002, 101002, 101002],
'flag': ['IN', 'IN', 'IN','CMP', 'IN', 'OUT', 'CMP'], 'col': [0.5, 0.5, 0.5, 0.5, 0.2, 0.2, 0.2]
}
df2 = pd.DataFrame(data=d2)
Please tell me if im not clear. Thank you

First replace non matched values to 0 in DataFrame.where, aggregate sum and ast divide columns:
df3 = (df2.assign(new = df2['col'].where(df2['flag'].eq('IN'), 0))
.groupby(['id','cod'])
.transform('sum'))
df2['rat'] = df3['new'].div(df3['col'])
print (df2)
id cod flag col rat
0 x2 101001 IN 100 0.5
1 x2 101001 IN 100 0.5
2 x2 101001 IN 100 0.5
3 x2 101001 CMP 300 0.5
4 x3 101002 IN 100 0.2
5 x3 101002 OUT 300 0.2
6 x3 101002 CMP 100 0.2

You could create a temporary column (new), and use the temporary column, combined with groupby and transform, to get the ratio for each row::
(df2
.assign(
new = np.where(df2.flag == "IN", df2.col, 0),
ratio = lambda df : df.groupby(['id', 'cod'])
.pipe(lambda df: df['new']
.transform('sum')
.div(df['col'].transform('sum'))
)
)
)
id cod flag col new ratio
0 x2 101001 IN 100 100 0.5
1 x2 101001 IN 100 100 0.5
2 x2 101001 IN 100 100 0.5
3 x2 101001 CMP 300 0 0.5
4 x3 101002 IN 100 100 0.2
5 x3 101002 OUT 300 0 0.2
6 x3 101002 CMP 100 0 0.2

df2["col"] = df2.groupby(["id", "cod"], as_index=False)["col"].transform(
lambda x: x[df2.iloc[x.index]["flag"] == "IN"].sum() / x.sum(),
)
print(df2)
Prints:
id cod flag col
0 x2 101001 IN 0.5
1 x2 101001 IN 0.5
2 x2 101001 IN 0.5
3 x2 101001 CMP 0.5
4 x3 101002 IN 0.2
5 x3 101002 OUT 0.2
6 x3 101002 CMP 0.2

Related

Product demand down calculation in pandas df without loop

I'm having trouble with shift and diff and I feel it is simple?
Assume I have customers with different product demands, and they get handled with priority top down. I'd like to have it efficient without looping....
df_situation = pd.DataFrame(
{
"cust": [1, 2, 3, 3,4],
"prod": [1, 1, 1, 2,2],
"available": [1000, np.nan, np.nan, 2000, np.nan],
"needed": [200, 300, 1000, 1000,1000],
}
)
My objective is to get some additional columns like this, but it looks like difference calculations and shift operation are in a "chicken and egg problem situation".
Thanks in advance for any hint
leftover_prod is the available ffill - the cumulative demand groupby cumsum:
a = df_situation['available'].ffill()
df_situation['leftover_prod'] = (
a - df_situation.groupby('prod')['demand'].cumsum()
)
0 800.0
1 500.0
2 -500.0
3 1000.0
4 0.0
Name: leftover_prod, dtype: float64
fulfilled_cust is either the demand if there is enough leftover_prod or the leftover_prod groupby shift + np.where:
s = (df_situation.groupby('prod')['leftover_prod']
.shift()
.fillna(df_situation['available']))
df_situation['fulfilled_cust'] = np.where(
s.ge(df_situation['demand']), df_situation['demand'], s
)
0 200.0
1 300.0
2 500.0
3 1000.0
4 1000.0
Name: fulfilled_cust, dtype: float64
missing_cust is the demand - the fulfilled_cust:
df_situation['missing_cust'] = (
df_situation['demand'] - df_situation['fulfilled_cust']
)
0 0.0
1 0.0
2 500.0
3 0.0
4 0.0
Name: missing_cust, dtype: float64
Together:
a = df_situation['available'].ffill()
df_situation['leftover_prod'] = (
a - df_situation.groupby('prod')['demand'].cumsum()
)
s = (df_situation.groupby('prod')['leftover_prod']
.shift()
.fillna(df_situation['available']))
df_situation['fulfilled_cust'] = np.where(
s.ge(df_situation['demand']), df_situation['demand'], s
)
df_situation['missing_cust'] = (
df_situation['demand'] - df_situation['fulfilled_cust']
)
cust prod available demand leftover_prod fulfilled_cust missing_cust
0 1 1 1000.0 200 800.0 200.0 0.0
1 2 1 NaN 300 500.0 300.0 0.0
2 3 1 NaN 1000 -500.0 500.0 500.0
3 3 2 2000.0 1000 1000.0 1000.0 0.0
4 4 2 NaN 1000 0.0 1000.0 0.0
imports and DataFrame used:
import numpy as np
import pandas as pd
df_situation = pd.DataFrame({
"cust": [1, 2, 3, 3, 4],
"prod": [1, 1, 1, 2, 2],
"available": [1000, np.nan, np.nan, 2000, np.nan],
"demand": [200, 300, 1000, 1000, 1000],
})
(changed "needed" to "demand" as it appears in image.)

pd.df find rows pairwise using groupby and change bogus values

My pd.DataFrame looks like this example but has about 10mio rows, hence I am looking for an efficient solution.
import pandas as pd
df = pd.DataFrame({'timestamp':['2004-09-06', '2004-09-06', '2004-09-06', '2004-09-06', '2004-09-07', '2004-09-07'],
'opt_expiry': ['2005-12-16', '2005-12-16', '2005-12-16', '2005-12-16', '2005-06-17', '2005-06-17'],
'strike': [2, 2, 2.5, 2.5, 1.5, 1.5],
'type': ['c', 'p', 'c', 'p', 'c', 'p'],
'sigma': [0.25, 0.25, 0.001, 0.17, 0.195, 0.19],
'delta': [0.7, -0.3, 1, -0.25, 0.6, -0.4]}).set_index('timestamp', drop=True)
df.index = pd.to_datetime(df.index)
df.opt_expiry = pd.to_datetime(df.opt_expiry)
Out[2]:
opt_expiry strike type sigma delta
timestamp
2004-09-06 2005-12-16 2.0 c 0.250 0.70
2004-09-06 2005-12-16 2.0 p 0.250 -0.30
2004-09-06 2005-12-16 2.5 c 0.001 1.00
2004-09-06 2005-12-16 2.5 p 0.170 -0.25
2004-09-07 2005-06-17 1.5 c 0.195 0.60
2004-09-07 2005-06-17 1.5 p 0.190 -0.40
here is what I am looking to achieve:
1) find the pairs with identical timestamp, opt_expiry and strike:
groups = df.groupby(['timestamp','opt_expiry','strike'])
2) for each group check if the sum of the absolute delta equals 1. If true find the maximum of the two sigma values and assign that to both rows as the new, correct sigma. pseudo code:
for group in groups:
# if sum of absolute deltas != 1
if (abs(group.delta[0]) + abs(group.delta[1])) != 1:
correct_sigma = group.sigma.max()
group.sigma = correct_sigma
Expected output:
opt_expiry strike type sigma delta
timestamp
2004-09-06 2005-12-16 2.0 c 0.250 0.70
2004-09-06 2005-12-16 2.0 p 0.250 -0.30
2004-09-06 2005-12-16 2.5 c 0.170 1.00
2004-09-06 2005-12-16 2.5 p 0.170 -0.25
2004-09-07 2005-06-17 1.5 c 0.195 0.60
2004-09-07 2005-06-17 1.5 p 0.190 -0.40
Revised answer. I believe there could be a shorter answer out there. Maybe put it up as bounty
Data
df = pd.DataFrame({'timestamp':['2004-09-06', '2004-09-06', '2004-09-06', '2004-09-06', '2004-09-07', '2004-09-07'],
'opt_expiry': ['2005-12-16', '2005-12-16', '2005-12-16', '2005-12-16', '2005-06-17', '2005-06-17'],
'strike': [2, 2, 2.5, 2.5, 1.5, 1.5],
'type': ['c', 'p', 'c', 'p', 'c', 'p'],
'sigma': [0.25, 0.25, 0.001, 0.17, 0.195, 0.19],
'delta': [0.7, -0.3, 1, -0.25, 0.6, -0.4]}).set_index('timestamp', drop=True)
df
Working
Absolute delta sum for each groupfor each row
df['absdelta']=df['delta'].abs()
Absolute delta sum and maximum sigma for each group in a new dataframe df2
df2=df.groupby(['timestamp','opt_expiry','strike']).agg({'absdelta':'sum','sigma':'max'})#.reset_index().drop(columns=['timestamp','opt_expiry'])
df2
Merge df2 with df
df3=df.merge(df2, left_on='strike', right_on='strike',
suffixes=('', '_right'))
df3
mask groups with sum absolute delta not equal to 1
m=df3['absdelta_right']!=1
m
Using mask, apply maximum sigma to entities in groups masked above
df3.loc[m,'sigma']=df3.loc[m,'sigma_right']
Slice to return to original dataframe
df3.iloc[:,:-4]
Output

Pandas - Unexpected results when indexing a DataFrame containing missing entries

First, we create a large dataset with MultiIndex whose first record contains missing values np.NaN
In [200]: data = []
...: val = 0
...: for ind_1 in range(3000):
...: if ind_1 == 0:
...: data.append({'ind_1': 0, 'ind_2': np.NaN, 'val': np.NaN})
...: else:
...: for ind_2 in range(3000):
...: data.append({'ind_1': ind_1, 'ind_2': ind_2, 'val': val})
...: val += 1
...: df = pd.DataFrame(data).set_index(['ind_1', 'ind_2'])
In [201]: df
Out[201]:
val
ind_1 ind_2
0 NaN NaN
1 0.0 0.0
1.0 1.0
2.0 2.0
3.0 3.0
... ...
2999 2995.0 8996995.0
2996.0 8996996.0
2997.0 8996997.0
2998.0 8996998.0
2999.0 8996999.0
[8997001 rows x 1 columns]
I want to select all rows where ind_1 < 3 and ind_2 < 3
First I create an MultiIndex i1 where ind_1 < 3
In [202]: i1 = df.loc[df.index.get_level_values('ind_1') < 3].index
In [203]: i1
Out[203]:
MultiIndex([(0, nan),
(1, 0.0),
(1, 1.0),
(1, 2.0),
(1, 3.0),
(1, 4.0),
(1, 5.0),
(1, 6.0),
(1, 7.0),
(1, 8.0),
...
(2, 2990.0),
(2, 2991.0),
(2, 2992.0),
(2, 2993.0),
(2, 2994.0),
(2, 2995.0),
(2, 2996.0),
(2, 2997.0),
(2, 2998.0),
(2, 2999.0)],
names=['ind_1', 'ind_2'], length=6001)
Then I create an MultiIndex i2 where ind_2 < 3
In [204]: i2 = df.loc[~(df.index.get_level_values('ind_2') > 2)].index
In [205]: i2
Out[205]:
MultiIndex([( 0, nan),
( 1, 0.0),
( 1, 1.0),
( 1, 2.0),
( 2, 0.0),
( 2, 1.0),
( 2, 2.0),
( 3, 0.0),
( 3, 1.0),
( 3, 2.0),
...
(2996, 2.0),
(2997, 0.0),
(2997, 1.0),
(2997, 2.0),
(2998, 0.0),
(2998, 1.0),
(2998, 2.0),
(2999, 0.0),
(2999, 1.0),
(2999, 2.0)],
names=['ind_1', 'ind_2'], length=8998)
Logically, the solution should be the intersection of these two sets
In [206]: df.loc[i1 & i2]
Out[206]:
val
ind_1 ind_2
1 0.0 0.0
1.0 1.0
2.0 2.0
2 0.0 3000.0
1.0 3001.0
2.0 3002.0
Why is the first record (0, nan) filtered out?
Use boolean arrays i1, i2 instead of indexes
In [27]: i1 = df.index.get_level_values('ind_1') < 3
In [28]: i2 = ~(df.index.get_level_values('ind_2') > 2)
In [29]: i1
Out[29]: array([ True, True, True, ..., False, False, False])
In [30]: i2
Out[30]: array([ True, True, True, ..., False, False, False])
In [31]: df.loc[i1 & i2]
Out[31]:
val
ind_1 ind_2
0 NaN NaN
1 0.0 0.0
1.0 1.0
2.0 2.0
2 0.0 3000.0
1.0 3001.0
2.0 3002.0

How to change the orientation of a list in pandas

I have three lists as follows.
mylist = [["sensor9", [[0.5, 0.3, 0.8, 0.9, 0.8], [0.5, 0.6, 0.8, 0.9, 0.9]]],
["sensor12", [[10.6, 0.5, 0.9, 1.0, 0.9], [10.6, 0.9, 0.8, 0.8, 0.8]]]]
columns = ['score_1', 'score_2']
years = [2001, 2002, 2003, 2004, 2005]
I want to change the orientation of mylist as follows using columns as the headings and years for each element in mylist. More specifically, my final output should look as follows.
id, sensor, time, score_1, score_2
0, sensor9, 2001, 0.5, 0.5
0, sensor9, 2002, 0.3, 0.6
0, sensor9, 2003, 0.8, 0.8
0, sensor9, 2004, 0.9, 0.9
0, sensor9, 2005, 0.8, 0.9
1, sensor12, 2001, 0.6, 0.6
1, sensor12, 2002, 0.5, 0.9
1, sensor12, 2003, 0.9, 0.8
1, sensor12, 2004, 1.0, 0.8
1, sensor12, 2005, 0.9, 0.8
Dataframe that describes the id of the above dataframe
id, sensor
0, sensor9
1, sensor12
I was trying to do this with DataFrame.from_dict in pandas. However, I am not sure how to change the orientation of the mylist and align it with the years in pandas. Is it possible to do this?
I am happy to provide more details if needed.
Use list comprehension for generate DataFrame for second values of lists (nested lists) with transpose by DataFrame.T, then concat together and last create new column id by Series.map and DataFrame.insert for first position:
df1 = pd.DataFrame({'id':[0,1],
'sensor':['sensor9','sensor12']})
mylist = [["sensor9", [[0.5, 0.3, 0.8, 0.9, 0.8], [0.5, 0.6, 0.8, 0.9, 0.9]]],
["sensor12", [[10.6, 0.5, 0.9, 1.0, 0.9], [10.6, 0.9, 0.8, 0.8, 0.8]]]]
columns = ['score_1', 'score_2']
years = [2001, 2002, 2003, 2004, 2005]
L = [pd.DataFrame(x[1], index=columns, columns=years).T for x in mylist]
df = pd.concat(L, keys=[x[0] for x in mylist]).rename_axis(('sensor','time')).reset_index()
df.insert(0, 'id', df['sensor'].map(df1.set_index('sensor')['id']))
print (df)
id sensor time score_1 score_2
0 0 sensor9 2001 0.5 0.5
1 0 sensor9 2002 0.3 0.6
2 0 sensor9 2003 0.8 0.8
3 0 sensor9 2004 0.9 0.9
4 0 sensor9 2005 0.8 0.9
5 1 sensor12 2001 10.6 10.6
6 1 sensor12 2002 0.5 0.9
7 1 sensor12 2003 0.9 0.8
8 1 sensor12 2004 1.0 0.8
9 1 sensor12 2005 0.9 0.8
EDIT:
mylist = [["sensor9", [[0.5, 0.3, 0.8, 0.9, 0.8], [0.5, 0.6, 0.8, 0.9, 0.9]]],
["sensor12", [[10.6, 0.5, 0.9, 1.0, 0.9], [10.6, 0.9, 0.8, 0.8, 0.8]]]]
columns = ['score_1', 'score_2']
years = [2001, 2002, 2003, 2004, 2005]
L = [pd.DataFrame(x[1], index=columns, columns=years).T for x in mylist]
df = pd.concat(L, keys=[x[0] for x in mylist]).rename_axis(('sensor','time')).reset_index()
df.insert(0, 'id', pd.factorize(df['sensor'])[0])
print (df)
id sensor time score_1 score_2
0 0 sensor9 2001 0.5 0.5
1 0 sensor9 2002 0.3 0.6
2 0 sensor9 2003 0.8 0.8
3 0 sensor9 2004 0.9 0.9
4 0 sensor9 2005 0.8 0.9
5 1 sensor12 2001 10.6 10.6
6 1 sensor12 2002 0.5 0.9
7 1 sensor12 2003 0.9 0.8
8 1 sensor12 2004 1.0 0.8
9 1 sensor12 2005 0.9 0.8

How to use if conditions in Pandas?

I am working on pandas and I have four column
Name Sensex_index Start_Date End_Date
AAA 0.5 20/08/2016 25/09/2016
AAA 0.8 26/08/2016 29/08/2016
AAA 0.4 30/08/2016 31/08/2016
AAA 0.9 01/09/2016 05/09/2016
AAA 0.5 12/09/2016 22/09/2016
AAA 0.3 24/09/2016 29/09/2016
ABC 0.9 01/01/2017 15/01/2017
ABC 0.5 23/01/2017 30/01/2017
ABC 0.7 02/02/2017 15/03/2017
If the sensex index of same name increases from lower index and moves to higher index, then the Termination date is the previous value, for example, I am looking for the following output,
Name Sensex_index Actual_Start Termination_Date
AAA 0.5 20/08/2016 31/08/2016
AAA 0.8 20/08/2016 31/08/2016
AAA 0.4 20/08/2016 31/08/2016 [high to low; low to high,terminate]
AAA 0.9 01/09/2016 29/09/2016
AAA 0.5 01/09/2016 29/09/2016
AAA 0.3 01/09/2016 29/09/2016 [end of AAA]
ABC 0.9 01/01/2017 30/01/2017
ABC 0.5 01/01/2017 30/01/2017 [high to low; low to high,terminate]
ABC 0.7 02/02/2017 15/03/2017 [end of ABC]
#Setup
df = pd.DataFrame(data = [['AAA', 0.5, '20/08/2016', '25/09/2016'],
['AAA', 0.8, '26/08/2016', '29/08/2016'],
['AAA', 0.4, '30/08/2016', '31/08/2016'],
['AAA', 0.9, '01/09/2016', '05/09/2016'],
['AAA', 0.5, '12/09/2016', '22/09/2016'],
['AAA', 0.3, '24/09/2016', '29/09/2016'],
['ABC', 0.9, '01/01/2017', '15/01/2017'],
['ABC', 0.5, '23/01/2017', '30/01/2017'],
['ABC', 0.7, '02/02/2017', '15/03/2017']], columns = ['Name', 'Sensex_index', 'Start_Date', 'End_Date'])
#Find the rows where price change from high to low and then to high
df['change'] = df.groupby('Name')['Sensex_index'].apply(lambda x: x.rolling(3,center=True).apply(lambda y: True if (y[1]<y[0] and y[1]<y[2]) else False))
#Find the last row for each name
df.iloc[df.groupby('Name')['change'].tail(1).index, -1] = 1.0
#Set End_Date as Termination_Date for those changing points
df['Termination_Date'] = df.apply(lambda x: x.End_Date if x.change>0 else np.nan, axis=1)
#Set Actual_Start
df['Actual_Start'] = df.apply(lambda x: x.Start_Date if (x.name==0
or x.Name!= df.iloc[x.name-1]['Name']
or df.iloc[x.name-1]['change']>0)
else np.nan, axis=1)
#back fill the Termination_Date for other rows.
df.Termination_Date.fillna(method='bfill', inplace=True)
#forward fill the Actual_Start for other rows.
df.Actual_Start.fillna(method='ffill', inplace=True)
print(df)