Groupby and Divide One Group of Rows by Another Group - pandas

I have a dataframe:
df = pd.DataFrame({
'Metric': ['Total Assets', 'Total Promo', 'Total Assets', 'Total Promo'],
'Product': ['AA', 'AA', 'BB', 'BB'],
'Risk': ['High', 'High','Low', 'Low'],
'202101': [ 200, 100, 400, 100],
'202102': [ 200, 100, 400, 100],
'202103': [ 200, 100, 400, 100]})
I wish to groupby Product and Risk and divide rows with Total Assets with Total Promo. I would the output to be like this:
df = pd.DataFrame({
'Product': ['AA', 'BB'],
'Risk': ['High', 'Low',],
'202101': [ 2, 4],
'202102': [ 2, 4],
'202103': [ 2, 4]})
So far my approach has been to try and first melt into long form. But I can't seem to get Total Assets and Total Promo to columns to be able to divide columns
df = pd.melt(df, id_vars=['Metric', 'Product', 'Risk'],
value_vars = ["202101", "202102", "202103"],
var_name='Months', value_name='Balance')

Here's one way:
df1 = df.set_index(['Metric', 'Product', 'Risk']).stack().unstack(0)
df = (df1['Total Assets'] / df1['Total Promo']).unstack(-1).reset_index()
OUTPUT:
Product Risk 202101 202102 202103
0 AA High 2.0 2.0 2.0
1 BB Low 4.0 4.0 4.0

Since there are only two rows per grouping and they are ordered, a groupby with the relevant columns, combined with pipe should suffice:
(df.iloc[:, 1:]
.groupby(['Product', 'Risk'])
.pipe(lambda df: df.first()/df.last())
)
202101 202102 202103
Product Risk
AA High 2.0 2.0 2.0
BB Low 4.0 4.0 4.0

Related

Pandas : How to Apply a Condition on Every Values of a Dataframe, Based on a Second Symmetrical Dataframe

I have a dictionary with 2 DF : "quantity variation in %" and "prices". They are both symmetrical DF.
Let's say I want to set the price = 0 if the quantity variation in percentage is greater than 100 %
import numpy as np; import pandas as pd
d = {'qty_pct': pd.DataFrame({ '2020': [200, 0.5, 0.4],
'2021': [0.9, 0.5, 500],
'2022': [0.9, 300, 0.4]}),
'price': pd.DataFrame({ '2020': [-6, -2, -9],
'2021': [ 2, 3, 4],
'2022': [ 4, 6, 8]})}
# I had something like that in mind ...
df = d['price'].applymap(lambda x: 0 if x[d['qty_pct']] >=1 else x)
P.S. If by any chance there is a way to do this on asymmetrical DF, I would be curious to see how it's done.
Thanks,
I want to obtain this DF :
price = pd.DataFrame({'2020': [ 0, -2, -9],
'2021': [ 2, 3, 0],
'2022': [ 4, 0, 8]})
Assume price and qty_pct always have the same dimension, then you can just do:
d['price'][d['qty_pct'] >= 1] = 0
d['price']
2020 2021 2022
0 0 2 4
1 -2 3 0
2 -9 0 8

Divide Dataframe Multiindex level 1 by the sum of level 0

I have created a DataFrame like this:
df = pd.DataFrame(
{
'env': ['us', 'us', 'us', 'eu'],
'name': ['first', 'first', 'first', 'second'],
'default_version': ['2.0.1','2.0.1','2.0.1', '2.1.1'],
'version': ['2.2.1', '2.2.2.4', '2.3', '2.2.24'],
'count_events': [1, 8, 102, 244],
'unique_users': [1, 3, 72, 111]
}
)
df = df.pivot_table(index=['env', 'name', 'default_version'], \
columns='version', values=['count_events', 'unique_users'], aggfunc=np.sum)
Next I'm looking for is to find sum of all count_events at level=1 and sum of all unique_users at level=1, so I can find percentage of count_events and unique_users in each version.
I have generated the sum with the following code, but I don't know how to generate the %.
sums = df.sum(level=0, axis=1)
sums.columns = pd.MultiIndex.from_product([sums.columns, ['SUM']])
final_result = pd.concat([df, sums], axis=1)
It would not be a problem to change the sum code if necessary.
You can reindex your sums to match the shape of the original data using a combination of reindex and set_axis:
In [14]: fraction = (
...: df / (
...: sums
...: .reindex(df.columns.get_level_values(0), axis=1)
...: .set_axis(df.columns, axis=1)
...: )
...: ).fillna(0)
In [15]: fraction
Out[15]:
count_events unique_users
version 2.2.1 2.2.2.4 2.2.24 2.3 2.2.1 2.2.2.4 2.2.24 2.3
env name default_version
eu second 2.1.1 0.000000 0.000000 1.0 0.000000 0.000000 0.000000 1.0 0.000000
us first 2.0.1 0.009009 0.072072 0.0 0.918919 0.013158 0.039474 0.0 0.947368

Groupby sum and difference of rows in a pandas dataframe

I have a dataframe:
df = pd.DataFrame({
'Metric': ['Total Assets', 'Total Promo', 'Total Assets', 'Total Int'],
'Product': ['AA', 'AA', 'BB', 'AA'],
'Risk': ['High', 'High','Low', 'High'],
'202101': [ 130, 200, 190, 210],
'202102': [ 130, 200, 190, 210],
'202103': [ 130, 200, 190, 210],})
I would like to groupby Product and Risk and sum the entries in Total Assets and Total Promo and subtract the result from the entries in Total Int. I could multiply all rows with Total Int by -1 and sum the result. But I wanted to know if there was a direct way to do so.
df.groupby(['Product', 'Risk']).sum()
The actual dataset is large and it would introduce complexity to multiply certain rows by -1
The output would look like:
df = pd.DataFrame({
'Product': ['AA', 'BB'],
'Risk': ['High', 'Low'],
'202101': [ 120, 190],
'202102': [ 120, 190],
'202103': [ 120, 190],})
You can multiply by -1 your Total Int rows:
df.loc[df['Metric'] == 'Total Int', df.select_dtypes('number').columns] *= -1
# OR
df.loc[df['Metric'] == 'Total Int', df.filter(regex=r'\d{6}').columns] *= -1
>>> df.groupby(['Product', 'Risk']).sum()
202101 202102 202103
Product Risk
AA High 120 120 120
BB Low 190 190 190
In your actual dataset, do you have any groups that only have one row? The following solution will work if all groups have greater than one row, so that diff(), doesn't return nan. This is thy the second row of output is not in there, but I imagine your groups have more than one row in your large dataset.
IIUC, create a series s that differentiates the two groups and take the diff after a groupby of the sum:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'Metric': ['Total Assets', 'Total Promo', 'Total Assets', 'Total Int'],
'Product': ['AA', 'AA', 'BB', 'AA'],
'Risk': ['High', 'High','Low', 'High'],
'Col1': [ 130, 200, 190, 210],
'Col2': [ 130, 200, 190, 210],
'Col3': [ 130, 200, 190, 210],})
s = np.where(df['Metric'].isin(['Total Assets', 'Total Promo']), 'B', 'A')
cols = ['Product', 'Risk']
(df.groupby(cols + [s]).sum()
.groupby(cols).diff()
.dropna().reset_index().drop('level_2', axis=1))
Out[1]:
Product Risk Col1 Col2 Col3
0 AA High 120.0 120.0 120.0
How about this as a solution?
(df.
melt(['Metric', 'Product', 'Risk']).
pivot(index=['Product', 'Risk', 'variable'], columns= 'Metric', values = 'value').
assign(Total = lambda df: df['Total Assets'].fillna(0)+df['Total Promo'].fillna(0) - df['Total Int'].fillna(0)).
drop(columns = ['Total Assets', 'Total Promo', 'Total Int']).
reset_index().
pivot(index=['Product', 'Risk'], columns= 'variable', values = 'Total')
)

Transform and reshape a Data Frame from wide to long with additional column

I have a data frame that I want to transform from wide into a long format. But I do not want to use all columns.
In detail, I want to melt the following data frame
import pandas as pd
data = {'year': [2014, 2018,2020,2017],
'model':[12, 14,21,8],
'amount': [100, 120,80,210],
'quality': ["low", "high","medium","high"]
}
# pass column names in the columns parameter
df = pd.DataFrame.from_dict(data)
print(df)
into this data frame:
data2 = {'year': [2014, 2014, 2018, 2018, 2020, 2020, 2017, 2017],
'variable': ["model", "amount", "model", "amount", "model", "amount", "model", "amount"],
'value':[12, 100, 14, 120, 21, 80, 8, 210],
'quality': ["low", "low", "high", "high", "medium", "medium", "high", "high"]
}
# pass column names in the columns parameter
df2 = pd.DataFrame.from_dict(data2)
print(df2)
I tried pd.melt() with different combinations of the input parameters, and it works somehow if I do not take the quality colum into consideration. But according to the result, I can not skip the quality column. Furthermore, I tried df.pivot(), df.pivot_table(), and pd.wide_to_long(). All in several combinations. But somehow, I do not get the desired result. Maybe pushing the columns year and quality into the data frame index would help, before performing any pd.melt() operations?
Thank you very much for your help in advance!
import pandas as pd
data = {'year': [2014, 2018,2020,2017],
'model':[12, 14,21,8],
'amount': [100, 120,80,210],
'quality': ["low", "high","medium","high"]
}
# pass column names in the columns parameter
df = pd.DataFrame.from_dict(data)
print(df)
data2 = {'year': [2014, 2014, 2018, 2018, 2020, 2020, 2017, 2017],
'variable': ["model", "amount", "model", "amount", "model", "amount", "model", "amount"],
'value':[12, 100, 14, 120, 21, 80, 8, 210],
'quality': ["low", "low", "high", "high", "medium", "medium", "high", "high"]
}
# pass column names in the columns parameter
df2 = pd.DataFrame.from_dict(data2)
print(df2)
df3 = pd.melt(df, id_vars=['year', 'quality'], var_name='variable', value_name='value')
df3 = df3[['year', 'variable', 'value', 'quality']]
df3.sort_values('year', inplace=True)
print(df3)
Output (for df3):
year variable value quality
0 2014 model 12 low
4 2014 amount 100 low
3 2017 model 8 high
7 2017 amount 210 high
1 2018 model 14 high
5 2018 amount 120 high
2 2020 model 21 medium
6 2020 amount 80 medium

Pandas time re-sampling categorical data from a column with calculations from another numerical column

I have a data-frame with a categorical column and a numerical , the index set to time data
df = pd.DataFrame({
'date': [
'2013-03-01 ', '2013-03-02 ',
'2013-03-01 ', '2013-03-02',
'2013-03-01 ', '2013-03-02 '
],
'Kind': [
'A', 'B', 'A', 'B', 'B', 'B'
],
'Values': [1, 1.5, 2, 3, 5, 3]
})
df['date'] = pd.to_datetime(df['date'])
df = df.set_index('date')
the above code gives:
Kind Values
date
2013-03-01 A 1.0
2013-03-02 B 1.5
2013-03-01 A 2.0
2013-03-02 B 3.0
2013-03-01 B 5.0
2013-03-02 A 3.0
My aim is to achieve the below data-frame:
A_count B_count A_Val max B_Val max
date
2013-03-01 2 1 2 5
2013-03-02 0 3 0 3
Which also has the time as index . Here, I note that If we use
data = pd.DataFrame(data.resample('D')['Pack'].value_counts())
we get :
Kind
date Kind
2013-03-01 A 2
B 1
2013-03-02 B 3
Use DataFrame.pivot_table with flattening MultiIndex in columns in list comprehension:
df = pd.DataFrame({
'date': [
'2013-03-01 ', '2013-03-02 ',
'2013-03-01 ', '2013-03-02',
'2013-03-01 ', '2013-03-02 '
],
'Kind': [
'A', 'B', 'A', 'B', 'B', 'B'
],
'Values': [1, 1.5, 2, 3, 5, 3]
})
df['date'] = pd.to_datetime(df['date'])
#is possible omit
#df = df.set_index('date')
df = df.pivot_table(index='date', columns='Kind', values='Values', aggfunc=['count','max'])
df.columns = [f'{b}_{a}' for a, b in df.columns]
print (df)
A_count B_count A_max B_max
date
2013-03-01 2.0 1.0 2.0 5.0
2013-03-02 NaN 3.0 NaN 3.0
Another solution with Grouper for resample by days:
df = df.set_index('date')
df = df.groupby([pd.Grouper(freq='d'), 'Kind'])['Values'].agg(['count','max']).unstack()
df.columns = [f'{b}_{a}' for a, b in df.columns]