Groupby sum and difference of rows in a pandas dataframe - pandas

I have a dataframe:
df = pd.DataFrame({
'Metric': ['Total Assets', 'Total Promo', 'Total Assets', 'Total Int'],
'Product': ['AA', 'AA', 'BB', 'AA'],
'Risk': ['High', 'High','Low', 'High'],
'202101': [ 130, 200, 190, 210],
'202102': [ 130, 200, 190, 210],
'202103': [ 130, 200, 190, 210],})
I would like to groupby Product and Risk and sum the entries in Total Assets and Total Promo and subtract the result from the entries in Total Int. I could multiply all rows with Total Int by -1 and sum the result. But I wanted to know if there was a direct way to do so.
df.groupby(['Product', 'Risk']).sum()
The actual dataset is large and it would introduce complexity to multiply certain rows by -1
The output would look like:
df = pd.DataFrame({
'Product': ['AA', 'BB'],
'Risk': ['High', 'Low'],
'202101': [ 120, 190],
'202102': [ 120, 190],
'202103': [ 120, 190],})

You can multiply by -1 your Total Int rows:
df.loc[df['Metric'] == 'Total Int', df.select_dtypes('number').columns] *= -1
# OR
df.loc[df['Metric'] == 'Total Int', df.filter(regex=r'\d{6}').columns] *= -1
>>> df.groupby(['Product', 'Risk']).sum()
202101 202102 202103
Product Risk
AA High 120 120 120
BB Low 190 190 190

In your actual dataset, do you have any groups that only have one row? The following solution will work if all groups have greater than one row, so that diff(), doesn't return nan. This is thy the second row of output is not in there, but I imagine your groups have more than one row in your large dataset.
IIUC, create a series s that differentiates the two groups and take the diff after a groupby of the sum:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'Metric': ['Total Assets', 'Total Promo', 'Total Assets', 'Total Int'],
'Product': ['AA', 'AA', 'BB', 'AA'],
'Risk': ['High', 'High','Low', 'High'],
'Col1': [ 130, 200, 190, 210],
'Col2': [ 130, 200, 190, 210],
'Col3': [ 130, 200, 190, 210],})
s = np.where(df['Metric'].isin(['Total Assets', 'Total Promo']), 'B', 'A')
cols = ['Product', 'Risk']
(df.groupby(cols + [s]).sum()
.groupby(cols).diff()
.dropna().reset_index().drop('level_2', axis=1))
Out[1]:
Product Risk Col1 Col2 Col3
0 AA High 120.0 120.0 120.0

How about this as a solution?
(df.
melt(['Metric', 'Product', 'Risk']).
pivot(index=['Product', 'Risk', 'variable'], columns= 'Metric', values = 'value').
assign(Total = lambda df: df['Total Assets'].fillna(0)+df['Total Promo'].fillna(0) - df['Total Int'].fillna(0)).
drop(columns = ['Total Assets', 'Total Promo', 'Total Int']).
reset_index().
pivot(index=['Product', 'Risk'], columns= 'variable', values = 'Total')
)

Related

how to groupby but counsecutively only

i have a dataframe like this:
data = {'costs': [150, 400, 300, 500, 350], 'month':[1, 2, 2, 1, 1]}
df = pd.DataFrame(data)
i want to use groupby(['month']).sum() but first row not to be
cobmined with fourth and fifth rows so the result of costs would be
like this
list(df['costs'])= [150, 700, 850]
Try:
x = (
df.groupby((df.month != df.month.shift(1)).cumsum())
.agg({"costs": "sum", "month": "first"})
.reset_index(drop=True)
)
print(x)
Prints:
costs month
0 150 1
1 700 2
2 850 1

Assign Random Number between two value conditionally

I have a dataframe:
df = pd.DataFrame({
'Prod': ['abc', 'qrt', 'xyz', 'xam', 'asc', 'yat'],
'Line': ['Revenues','EBT', 'Expenses', 'Revenues', 'EBT', 'Expenses'],
'AGE': ['35-44', '20-34', '35-44', '35-44', '45-70', '35-44']})
I wish to replace the values in the Age column by integers between two values. So, for example, I wish to replace each value with age range '35-44' by a random integer between 35-44.
I tried:
df.loc[df["AGE"]== '35-44', 'AGE'] = random.randint(35, 44)
But it picks the same value for each row. I would like it to randomly pick a different value for each row.
I get:
df = pd.DataFrame({
'Prod': ['abc', 'qrt', 'xyz', 'xam', 'asc', 'yat'],
'Line': ['Revenues','EBT', 'Expenses', 'Revenues', 'EBT', 'Expenses'],
'AGE': ['38', '20-34', '38', '38', '45-70', '38']})
But I would like to get something like the following. I don't much care about how the values are distributed as long as they are in the range that I assign
df = pd.DataFrame({
'Prod': ['abc', 'qrt', 'xyz', 'xam', 'asc', 'yat'],
'Line': ['Revenues','EBT', 'Expenses', 'Revenues', 'EBT', 'Expenses'],
'AGE': ['36', '20-34', '39', '38', '45-70', '45']})
The code
random.randint(35, 44)
Produces a single random value making the statement analogous to:
df.loc[df["AGE"]== '35-44', 'AGE'] = 38 # some constant
We need a collection of values that is the same length as the values to fill. We can use np.random.randint instead:
import numpy as np
m = df["AGE"] == '35-44'
df.loc[m, 'AGE'] = np.random.randint(35, 44, m.sum())
(Series.sum is used to "count" the number of True values in the Series since True is 1 and False is 0)
df:
Prod Line AGE
0 abc Revenues 40
1 qrt EBT 20-34
2 xyz Expenses 41
3 xam Revenues 35
4 asc EBT 45-70
5 yat Expenses 36
*Reproducible with np.random.seed(26)
Naturally, using the filter on both sides of the expression with apply would also work:
import random
m = df["AGE"] == '35-44'
df.loc[m, 'AGE'] = df.loc[m, 'AGE'].apply(lambda _: random.randint(35, 44))
df:
Prod Line AGE
0 abc Revenues 36
1 qrt EBT 20-34
2 xyz Expenses 37
3 xam Revenues 43
4 asc EBT 45-70
5 yat Expenses 44
*Reproducible with random.seed(28)

Groupby and Divide One Group of Rows by Another Group

I have a dataframe:
df = pd.DataFrame({
'Metric': ['Total Assets', 'Total Promo', 'Total Assets', 'Total Promo'],
'Product': ['AA', 'AA', 'BB', 'BB'],
'Risk': ['High', 'High','Low', 'Low'],
'202101': [ 200, 100, 400, 100],
'202102': [ 200, 100, 400, 100],
'202103': [ 200, 100, 400, 100]})
I wish to groupby Product and Risk and divide rows with Total Assets with Total Promo. I would the output to be like this:
df = pd.DataFrame({
'Product': ['AA', 'BB'],
'Risk': ['High', 'Low',],
'202101': [ 2, 4],
'202102': [ 2, 4],
'202103': [ 2, 4]})
So far my approach has been to try and first melt into long form. But I can't seem to get Total Assets and Total Promo to columns to be able to divide columns
df = pd.melt(df, id_vars=['Metric', 'Product', 'Risk'],
value_vars = ["202101", "202102", "202103"],
var_name='Months', value_name='Balance')
Here's one way:
df1 = df.set_index(['Metric', 'Product', 'Risk']).stack().unstack(0)
df = (df1['Total Assets'] / df1['Total Promo']).unstack(-1).reset_index()
OUTPUT:
Product Risk 202101 202102 202103
0 AA High 2.0 2.0 2.0
1 BB Low 4.0 4.0 4.0
Since there are only two rows per grouping and they are ordered, a groupby with the relevant columns, combined with pipe should suffice:
(df.iloc[:, 1:]
.groupby(['Product', 'Risk'])
.pipe(lambda df: df.first()/df.last())
)
202101 202102 202103
Product Risk
AA High 2.0 2.0 2.0
BB Low 4.0 4.0 4.0

Transform and reshape a Data Frame from wide to long with additional column

I have a data frame that I want to transform from wide into a long format. But I do not want to use all columns.
In detail, I want to melt the following data frame
import pandas as pd
data = {'year': [2014, 2018,2020,2017],
'model':[12, 14,21,8],
'amount': [100, 120,80,210],
'quality': ["low", "high","medium","high"]
}
# pass column names in the columns parameter
df = pd.DataFrame.from_dict(data)
print(df)
into this data frame:
data2 = {'year': [2014, 2014, 2018, 2018, 2020, 2020, 2017, 2017],
'variable': ["model", "amount", "model", "amount", "model", "amount", "model", "amount"],
'value':[12, 100, 14, 120, 21, 80, 8, 210],
'quality': ["low", "low", "high", "high", "medium", "medium", "high", "high"]
}
# pass column names in the columns parameter
df2 = pd.DataFrame.from_dict(data2)
print(df2)
I tried pd.melt() with different combinations of the input parameters, and it works somehow if I do not take the quality colum into consideration. But according to the result, I can not skip the quality column. Furthermore, I tried df.pivot(), df.pivot_table(), and pd.wide_to_long(). All in several combinations. But somehow, I do not get the desired result. Maybe pushing the columns year and quality into the data frame index would help, before performing any pd.melt() operations?
Thank you very much for your help in advance!
import pandas as pd
data = {'year': [2014, 2018,2020,2017],
'model':[12, 14,21,8],
'amount': [100, 120,80,210],
'quality': ["low", "high","medium","high"]
}
# pass column names in the columns parameter
df = pd.DataFrame.from_dict(data)
print(df)
data2 = {'year': [2014, 2014, 2018, 2018, 2020, 2020, 2017, 2017],
'variable': ["model", "amount", "model", "amount", "model", "amount", "model", "amount"],
'value':[12, 100, 14, 120, 21, 80, 8, 210],
'quality': ["low", "low", "high", "high", "medium", "medium", "high", "high"]
}
# pass column names in the columns parameter
df2 = pd.DataFrame.from_dict(data2)
print(df2)
df3 = pd.melt(df, id_vars=['year', 'quality'], var_name='variable', value_name='value')
df3 = df3[['year', 'variable', 'value', 'quality']]
df3.sort_values('year', inplace=True)
print(df3)
Output (for df3):
year variable value quality
0 2014 model 12 low
4 2014 amount 100 low
3 2017 model 8 high
7 2017 amount 210 high
1 2018 model 14 high
5 2018 amount 120 high
2 2020 model 21 medium
6 2020 amount 80 medium

How can rows of a pandas DataFrame all be plotted together as lines?

Let's say we have the following DataFrame:
import pandas as pd
df = pd.DataFrame(
[
['Norway' , 'beta', 30.0 , 31.0, 32.0, 32.4, 32.5, 32.1],
['Denmark' , 'beta', 75.7 , 49.1, 51.0, 52.3, 50.0, 47.9],
['Switzerland', 'beta', 46.9 , 44.0, 43.5, 42.3, 41.8, 43.4],
['Finland' , 'beta', 29.00, 29.8, 27.0, 26.0, 25.3, 24.8],
['Netherlands', 'beta', 30.2 , 30.1, 28.5, 28.2, 28.0, 28.0],
],
columns = [
'country',
'run_type',
'score A',
'score B',
'score C',
'score D',
'score E',
'score F'
]
)
df
How could the score values be plotted as lines, where each line corresponds to a country?
Since you tagged matplotlib, here is a solution using plt.plot(). The idea is to plot the lines row wise using iloc
import matplotlib.pyplot as plt
# define DataFrame here
df1 = df.filter(like='score')
for i in range(len(df1)):
plt.plot(df1.iloc[i], label=df['country'][i])
plt.legend()
plt.show()
Try to plot the transpose of the dataframe:
# the score columns, modify if needed
score_cols = df.columns[df.columns.str.contains('score')]
df.set_index('country')[score_cols].T.plot()
Output: