Divide Dataframe Multiindex level 1 by the sum of level 0 - pandas

I have created a DataFrame like this:
df = pd.DataFrame(
{
'env': ['us', 'us', 'us', 'eu'],
'name': ['first', 'first', 'first', 'second'],
'default_version': ['2.0.1','2.0.1','2.0.1', '2.1.1'],
'version': ['2.2.1', '2.2.2.4', '2.3', '2.2.24'],
'count_events': [1, 8, 102, 244],
'unique_users': [1, 3, 72, 111]
}
)
df = df.pivot_table(index=['env', 'name', 'default_version'], \
columns='version', values=['count_events', 'unique_users'], aggfunc=np.sum)
Next I'm looking for is to find sum of all count_events at level=1 and sum of all unique_users at level=1, so I can find percentage of count_events and unique_users in each version.
I have generated the sum with the following code, but I don't know how to generate the %.
sums = df.sum(level=0, axis=1)
sums.columns = pd.MultiIndex.from_product([sums.columns, ['SUM']])
final_result = pd.concat([df, sums], axis=1)
It would not be a problem to change the sum code if necessary.

You can reindex your sums to match the shape of the original data using a combination of reindex and set_axis:
In [14]: fraction = (
...: df / (
...: sums
...: .reindex(df.columns.get_level_values(0), axis=1)
...: .set_axis(df.columns, axis=1)
...: )
...: ).fillna(0)
In [15]: fraction
Out[15]:
count_events unique_users
version 2.2.1 2.2.2.4 2.2.24 2.3 2.2.1 2.2.2.4 2.2.24 2.3
env name default_version
eu second 2.1.1 0.000000 0.000000 1.0 0.000000 0.000000 0.000000 1.0 0.000000
us first 2.0.1 0.009009 0.072072 0.0 0.918919 0.013158 0.039474 0.0 0.947368

Related

Groupby and Divide One Group of Rows by Another Group

I have a dataframe:
df = pd.DataFrame({
'Metric': ['Total Assets', 'Total Promo', 'Total Assets', 'Total Promo'],
'Product': ['AA', 'AA', 'BB', 'BB'],
'Risk': ['High', 'High','Low', 'Low'],
'202101': [ 200, 100, 400, 100],
'202102': [ 200, 100, 400, 100],
'202103': [ 200, 100, 400, 100]})
I wish to groupby Product and Risk and divide rows with Total Assets with Total Promo. I would the output to be like this:
df = pd.DataFrame({
'Product': ['AA', 'BB'],
'Risk': ['High', 'Low',],
'202101': [ 2, 4],
'202102': [ 2, 4],
'202103': [ 2, 4]})
So far my approach has been to try and first melt into long form. But I can't seem to get Total Assets and Total Promo to columns to be able to divide columns
df = pd.melt(df, id_vars=['Metric', 'Product', 'Risk'],
value_vars = ["202101", "202102", "202103"],
var_name='Months', value_name='Balance')
Here's one way:
df1 = df.set_index(['Metric', 'Product', 'Risk']).stack().unstack(0)
df = (df1['Total Assets'] / df1['Total Promo']).unstack(-1).reset_index()
OUTPUT:
Product Risk 202101 202102 202103
0 AA High 2.0 2.0 2.0
1 BB Low 4.0 4.0 4.0
Since there are only two rows per grouping and they are ordered, a groupby with the relevant columns, combined with pipe should suffice:
(df.iloc[:, 1:]
.groupby(['Product', 'Risk'])
.pipe(lambda df: df.first()/df.last())
)
202101 202102 202103
Product Risk
AA High 2.0 2.0 2.0
BB Low 4.0 4.0 4.0

Product demand down calculation in pandas df without loop

I'm having trouble with shift and diff and I feel it is simple?
Assume I have customers with different product demands, and they get handled with priority top down. I'd like to have it efficient without looping....
df_situation = pd.DataFrame(
{
"cust": [1, 2, 3, 3,4],
"prod": [1, 1, 1, 2,2],
"available": [1000, np.nan, np.nan, 2000, np.nan],
"needed": [200, 300, 1000, 1000,1000],
}
)
My objective is to get some additional columns like this, but it looks like difference calculations and shift operation are in a "chicken and egg problem situation".
Thanks in advance for any hint
leftover_prod is the available ffill - the cumulative demand groupby cumsum:
a = df_situation['available'].ffill()
df_situation['leftover_prod'] = (
a - df_situation.groupby('prod')['demand'].cumsum()
)
0 800.0
1 500.0
2 -500.0
3 1000.0
4 0.0
Name: leftover_prod, dtype: float64
fulfilled_cust is either the demand if there is enough leftover_prod or the leftover_prod groupby shift + np.where:
s = (df_situation.groupby('prod')['leftover_prod']
.shift()
.fillna(df_situation['available']))
df_situation['fulfilled_cust'] = np.where(
s.ge(df_situation['demand']), df_situation['demand'], s
)
0 200.0
1 300.0
2 500.0
3 1000.0
4 1000.0
Name: fulfilled_cust, dtype: float64
missing_cust is the demand - the fulfilled_cust:
df_situation['missing_cust'] = (
df_situation['demand'] - df_situation['fulfilled_cust']
)
0 0.0
1 0.0
2 500.0
3 0.0
4 0.0
Name: missing_cust, dtype: float64
Together:
a = df_situation['available'].ffill()
df_situation['leftover_prod'] = (
a - df_situation.groupby('prod')['demand'].cumsum()
)
s = (df_situation.groupby('prod')['leftover_prod']
.shift()
.fillna(df_situation['available']))
df_situation['fulfilled_cust'] = np.where(
s.ge(df_situation['demand']), df_situation['demand'], s
)
df_situation['missing_cust'] = (
df_situation['demand'] - df_situation['fulfilled_cust']
)
cust prod available demand leftover_prod fulfilled_cust missing_cust
0 1 1 1000.0 200 800.0 200.0 0.0
1 2 1 NaN 300 500.0 300.0 0.0
2 3 1 NaN 1000 -500.0 500.0 500.0
3 3 2 2000.0 1000 1000.0 1000.0 0.0
4 4 2 NaN 1000 0.0 1000.0 0.0
imports and DataFrame used:
import numpy as np
import pandas as pd
df_situation = pd.DataFrame({
"cust": [1, 2, 3, 3, 4],
"prod": [1, 1, 1, 2, 2],
"available": [1000, np.nan, np.nan, 2000, np.nan],
"demand": [200, 300, 1000, 1000, 1000],
})
(changed "needed" to "demand" as it appears in image.)

How to use apply for multiple Pandas dataset columns?

I am hardly trying to fill some columns with NaN values, selected from a previous list. The code is going to the else path and never makes the correct modifications...
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
'B': [0.0, np.nan, np.nan, 100],
'C': [20, 0.0002, 10000, np.nan],
'D': ['D0', 'D1', 'D2', 'D3']},
index=[0, 1, 2, 3])
num_cols = ['B', 'C']
fill_mean = lambda col: col.fillna(col.mean()) if col.name in num_cols else col
df2.apply(fill_mean, axis=1)
You can do this much simpler using
df1.fillna(df1.mean())
This will fill the numeric columns' nas by the column mean:
A B C D
0 A0 0.0 20.000000 D0
1 A1 50.0 0.000200 D1
2 A2 50.0 10000.000000 D2
3 A3 100.0 3340.000067 D3
I am not sure if your desired output it just the mean on all columns (single row). If that is the case, may be the below solution could help.
df = df1.select_dtypes(include='float').mean().to_frame().T
df = pd.concat([df, df.reindex(columns = df1.select_dtypes(exclude='float').columns)], axis=1, sort=False)
print(df)
B C A D
0 50.0 3340.000067 NaN NaN

pd.df find rows pairwise using groupby and change bogus values

My pd.DataFrame looks like this example but has about 10mio rows, hence I am looking for an efficient solution.
import pandas as pd
df = pd.DataFrame({'timestamp':['2004-09-06', '2004-09-06', '2004-09-06', '2004-09-06', '2004-09-07', '2004-09-07'],
'opt_expiry': ['2005-12-16', '2005-12-16', '2005-12-16', '2005-12-16', '2005-06-17', '2005-06-17'],
'strike': [2, 2, 2.5, 2.5, 1.5, 1.5],
'type': ['c', 'p', 'c', 'p', 'c', 'p'],
'sigma': [0.25, 0.25, 0.001, 0.17, 0.195, 0.19],
'delta': [0.7, -0.3, 1, -0.25, 0.6, -0.4]}).set_index('timestamp', drop=True)
df.index = pd.to_datetime(df.index)
df.opt_expiry = pd.to_datetime(df.opt_expiry)
Out[2]:
opt_expiry strike type sigma delta
timestamp
2004-09-06 2005-12-16 2.0 c 0.250 0.70
2004-09-06 2005-12-16 2.0 p 0.250 -0.30
2004-09-06 2005-12-16 2.5 c 0.001 1.00
2004-09-06 2005-12-16 2.5 p 0.170 -0.25
2004-09-07 2005-06-17 1.5 c 0.195 0.60
2004-09-07 2005-06-17 1.5 p 0.190 -0.40
here is what I am looking to achieve:
1) find the pairs with identical timestamp, opt_expiry and strike:
groups = df.groupby(['timestamp','opt_expiry','strike'])
2) for each group check if the sum of the absolute delta equals 1. If true find the maximum of the two sigma values and assign that to both rows as the new, correct sigma. pseudo code:
for group in groups:
# if sum of absolute deltas != 1
if (abs(group.delta[0]) + abs(group.delta[1])) != 1:
correct_sigma = group.sigma.max()
group.sigma = correct_sigma
Expected output:
opt_expiry strike type sigma delta
timestamp
2004-09-06 2005-12-16 2.0 c 0.250 0.70
2004-09-06 2005-12-16 2.0 p 0.250 -0.30
2004-09-06 2005-12-16 2.5 c 0.170 1.00
2004-09-06 2005-12-16 2.5 p 0.170 -0.25
2004-09-07 2005-06-17 1.5 c 0.195 0.60
2004-09-07 2005-06-17 1.5 p 0.190 -0.40
Revised answer. I believe there could be a shorter answer out there. Maybe put it up as bounty
Data
df = pd.DataFrame({'timestamp':['2004-09-06', '2004-09-06', '2004-09-06', '2004-09-06', '2004-09-07', '2004-09-07'],
'opt_expiry': ['2005-12-16', '2005-12-16', '2005-12-16', '2005-12-16', '2005-06-17', '2005-06-17'],
'strike': [2, 2, 2.5, 2.5, 1.5, 1.5],
'type': ['c', 'p', 'c', 'p', 'c', 'p'],
'sigma': [0.25, 0.25, 0.001, 0.17, 0.195, 0.19],
'delta': [0.7, -0.3, 1, -0.25, 0.6, -0.4]}).set_index('timestamp', drop=True)
df
Working
Absolute delta sum for each groupfor each row
df['absdelta']=df['delta'].abs()
Absolute delta sum and maximum sigma for each group in a new dataframe df2
df2=df.groupby(['timestamp','opt_expiry','strike']).agg({'absdelta':'sum','sigma':'max'})#.reset_index().drop(columns=['timestamp','opt_expiry'])
df2
Merge df2 with df
df3=df.merge(df2, left_on='strike', right_on='strike',
suffixes=('', '_right'))
df3
mask groups with sum absolute delta not equal to 1
m=df3['absdelta_right']!=1
m
Using mask, apply maximum sigma to entities in groups masked above
df3.loc[m,'sigma']=df3.loc[m,'sigma_right']
Slice to return to original dataframe
df3.iloc[:,:-4]
Output

Pandas time re-sampling categorical data from a column with calculations from another numerical column

I have a data-frame with a categorical column and a numerical , the index set to time data
df = pd.DataFrame({
'date': [
'2013-03-01 ', '2013-03-02 ',
'2013-03-01 ', '2013-03-02',
'2013-03-01 ', '2013-03-02 '
],
'Kind': [
'A', 'B', 'A', 'B', 'B', 'B'
],
'Values': [1, 1.5, 2, 3, 5, 3]
})
df['date'] = pd.to_datetime(df['date'])
df = df.set_index('date')
the above code gives:
Kind Values
date
2013-03-01 A 1.0
2013-03-02 B 1.5
2013-03-01 A 2.0
2013-03-02 B 3.0
2013-03-01 B 5.0
2013-03-02 A 3.0
My aim is to achieve the below data-frame:
A_count B_count A_Val max B_Val max
date
2013-03-01 2 1 2 5
2013-03-02 0 3 0 3
Which also has the time as index . Here, I note that If we use
data = pd.DataFrame(data.resample('D')['Pack'].value_counts())
we get :
Kind
date Kind
2013-03-01 A 2
B 1
2013-03-02 B 3
Use DataFrame.pivot_table with flattening MultiIndex in columns in list comprehension:
df = pd.DataFrame({
'date': [
'2013-03-01 ', '2013-03-02 ',
'2013-03-01 ', '2013-03-02',
'2013-03-01 ', '2013-03-02 '
],
'Kind': [
'A', 'B', 'A', 'B', 'B', 'B'
],
'Values': [1, 1.5, 2, 3, 5, 3]
})
df['date'] = pd.to_datetime(df['date'])
#is possible omit
#df = df.set_index('date')
df = df.pivot_table(index='date', columns='Kind', values='Values', aggfunc=['count','max'])
df.columns = [f'{b}_{a}' for a, b in df.columns]
print (df)
A_count B_count A_max B_max
date
2013-03-01 2.0 1.0 2.0 5.0
2013-03-02 NaN 3.0 NaN 3.0
Another solution with Grouper for resample by days:
df = df.set_index('date')
df = df.groupby([pd.Grouper(freq='d'), 'Kind'])['Values'].agg(['count','max']).unstack()
df.columns = [f'{b}_{a}' for a, b in df.columns]