Grouping multiple columns to find max value from other column and representing them columnwise - pandas

Existing dataframe :
id sub-id amount
01 A 100
01 A 50
01 B 10
01 B 5
01 B 50
01 C 10
02 A 25
02 A 10
Expected Dataframe :
id sub-id A_max B_max C_max
01 A 100 50 10
02 A 25 0 0
Please note their can total 5 sub-id which may or may not be present for every id
with this code df.loc[df.groupby(['id','sub-id'])['amount'].idxmax()].reset_index(drop=True) i can get the answer but i need it in the above expected format
please guide
#jezrael
adding up further if we have below table :
sub-id A_max B_max C_max
A 100 50 10
A 25 0 0
B 250 10 50
B 100 50 70
and expected output we need is :
sub-id A_max B_max C_max
A 100 50 10
B 250 50 70
how to get the same..?

Use DataFrame.pivot_table with aggregate max and then add to sub-id values - if need first value per id use:
df2 = df.pivot_table(index='id',
columns='sub-id',
values='amount',
aggfunc='max', fill_value=0).add_suffix('_max')
df = df.groupby('id', as_index=False)['sub-id'].first().join(df2, on='id')
print (df)
id sub-id A_max B_max C_max
0 1 A 100 50 10
1 2 A 25 0 0
Or if need id with maximal amount use DataFrameGroupBy.idxmax:
df2 = df.pivot_table(index='id',
columns='sub-id',
values='amount',
aggfunc='max', fill_value=0).add_suffix('_max')
df = df.loc[df.groupby('id')['amount'].idxmax(), ['id','sub-id']].join(df2, on='id')
print (df)
id sub-id A_max B_max C_max
0 1 A 100 50 10
6 2 A 25 0 0
EDIT: If need maximal values per sub-id use:
df1 = df.groupby('sub-id', as_index=False).max()

Related

Pandas DataFrame subtract values

Im new to python
I have a data frame (df) which has the following structure:
ID
rate
Sequential number
a
150
1
a
150
1
a
50
2
b
250
1
c
25
1
d
25
1
d
40
2
d
25
3
The ID are customers, the value are monthly rates and Sequential number is a number that always increases by 1, if the customer changes the monthly rate
I want to do the following:
for every ID find the maximum value in the column Sequential number, take the associated value in the column rate, find the minimum value in the column Sequential number and take associated value in the column rate and subtracting the rates.
At the end I want to have a additional column to my data frame with the difference of the rates. Maybe the loop could do the following:
for id in df()
find max() in column Sequential number and get value in rates -
min () in column Sequential number and get value in rates
return difference
The new df_new should be this
ID
rate
Sequential number
rate_diff
a
150
1
0
a
150
1
0
a
50
2
-100
b
250
1
0
c
25
1
0
d
25
1
0
d
40
2
0
d
30
3
5
If an ID has only one entry, the rate_diff should be 0
I tried already the lambda Function:
df['diff_rate'] = df.groupby('ID')['rate'].transform(lambda x : x-x.min())
but this returns
ID
rate
Sequential number
rate_diff
a
150
1
100
a
150
1
100
a
50
2
0
b
250
1
0
c
25
1
0
d
25
1
0
d
40
2
15
d
30
3
10
Maybe someone of you have a small workaround for this! :-)
One approach with indexing:
g = df.groupby('ID')['Sequential number']
IMAX = g.idxmax()
IMIN = g.idxmin()
df['rate_diff'] = 0
df.loc[IMAX, 'rate_diff'] = (df.loc[IMAX, 'rate'].to_numpy()
-df.loc[IMIN, 'rate'].to_numpy()
)
Another with groupby.transform+where:
g = df.sort_values(by=['ID', 'Sequential number']).groupby('ID')
m = g['Sequential number'].idxmax()
df['rate_diff'] = (g['rate'].transform(lambda x: x.iloc[-1]-x.iloc[0])
.where(df.index.isin(m), 0)
)
output:
ID rate Sequential number rate_diff
0 a 150 1 0
1 a 150 1 0
2 a 50 2 -100
3 b 250 1 0
4 c 25 1 0
5 d 25 1 0
6 d 40 2 0
7 d 30 3 5

Adding extra n rows at the end of a dataframe of a certain value

I have a dataframe with currently 22 rows
index value
0 23
1 22
2 19
...
21 20
to this dataframe, i want to add 72 rows to make the dataframe exactly 100 rows. So i need to fill loc[22:99] but with a certain value, let's say 100.
I tried something like this
uncon_dstn_2021['balance'].loc[22:99] = 100
but did not work. Any idea?
You can do reindex
out = df.reindex(df.index.tolist() + list(range(22, 99+1)), fill_value = 100)
You can also use pd.concat:
df1 = pd.concat([df, pd.DataFrame({'balance': [100]*(100-len(df))})], ignore_index=True)
print(df1)
# Output
balance
0 1
1 14
2 11
3 11
4 10
.. ...
96 100
97 100
98 100
99 100
[100 rows x 1 columns]

Pandas cumsum only if positive else zero

I am making a table, where i want to show that if there's no income, no expense can happen
it's a cumulative sum table
This is what I've
Incoming
Outgoing
Total
0
150
-150
10
20
-160
100
30
-90
50
70
-110
Required output
Incoming
Outgoing
Total
0
150
0
10
20
0
100
30
70
50
70
50
I've tried
df.clip(lower=0)
and
df['new_column'].apply(lambda x : df['outgoing']-df['incoming'] if df['incoming']>df['outgoing'])
That doesn't work as well
is there any other way?
Update:
A more straightforward approach inspired by your code using clip and without numpy:
diff = df['Incoming'].sub(df['Outgoing'])
df['Total'] = diff.mul(diff.ge(0).cumsum().clip(0, 1)).cumsum()
print(df)
# Output:
Incoming Outgoing Total
0 0 150 0
1 10 20 0
2 100 30 70
3 50 70 50
Old answer:
Find the row where the balance is positive for the first time then compute the cumulative sum from this point:
start = np.where(df['Incoming'] - df['Outgoing'] >= 0)[0][0]
df['Total'] = df.iloc[start:]['Incoming'].sub(df.iloc[start:]['Outgoing']) \
.cumsum().reindex(df.index, fill_value=0)
Output:
>>> df
Incoming Outgoing Total
0 0 150 0
1 10 20 0
2 100 30 70
3 50 70 50
IIUC, you can check when Incoming is greater than Outgoing using np.where and assign a helper column. Then you can check when this new column is not null, using notnull(), calculate the difference, and use cumsum() on the result:
df['t'] = np.where(df['Incoming'].ge(df['Outgoing']),0,np.nan)
df['t'].ffill(axis=0,inplace=True)
df['Total'] = np.where(df['t'].notnull(),(df['Incoming'].sub(df['Outgoing'])),df['t'])
df['Total'] = df['Total'].cumsum()
df.drop('t',axis=1,inplace=True)
This will give back:
Incoming Outgoing Total
0 0 150 NaN
1 10 20 NaN
2 100 30 70.0
3 50 70 50.0

Pandas column merging on condition

This is my pandas df:
Id Protein A_Egg B_Meat C_Milk Category
A 10 10 20 0 egg
B 20 10 0 10 milk
C 20 10 10 10 meat
D 25 20 10 0 egg
I wish to merge protein column with other column based on "Category"
My output is
Id Protein_final
A 20
B 30
C 30
D 45
Ideally, I would like to show how I am approaching but, I am frankly clueless!!
EDIT: Also, How to handle is the category is blank or does meet one of the column (in that can final should be same as initial value in protein column)
Use DataFrame.lookup with some preprocessing with remove values in columns names before _ and lowercase, last add to column:
arr = df.rename(columns=lambda x: x.split('_')[-1].lower()).lookup(df.index, df['Category'])
df['Protein'] += arr
print (df)
Id Protein A_Egg B_Meat C_Milk Category
0 A 20 10 20 0 egg
1 B 30 10 0 10 milk
2 C 30 10 10 10 meat
3 D 45 20 10 0 egg
If need only 2 columns finally:
df = df[['Id','Protein']]
You can melt the dataframe, and filter for rows where category equals the variable column, and sum the final columns :
(
df
.melt(["Id", "Protein", "Category"])
.assign(variable=lambda x: x.variable.str[2:].str.lower(),
Protein_final=lambda x: x.Protein + x.value)
.query("Category == variable")
.filter(["Id", "Protein_final"])
)
Id Protein_final
0 A 20
3 D 45
6 C 30
9 B 30

Apply function with arguments across Multiindex levels

I would like to apply a custom function to each level within a multiindex.
For example, I have the dataframe
df = pd.DataFrame(np.arange(16).reshape((4,4)),
columns=pd.MultiIndex.from_product([['OP','PK'],['PRICE','QTY']]))
of which I want to add a column for each level 0 column, called "Value" which is the result of the following function;
def my_func(df, scale):
return df['QTY']*df['PRICE']*scale
where the user supplies the "scale" value.
Even in setting up this example, I am not sure how to show the result I want. But I know I want the final dataframe's multiindex column to be
pd.DataFrame(columns=pd.MultiIndex.from_product([['OP','PK'],['PRICE','QTY','Value']]))
Even if that wasn't had enough, I want to apply one "scale" value for the "OP" level 0 column and a different "scale" value to the "PK" column.
Use:
def my_func(df, scale):
#select second level of columns
df1 = df.xs('QTY', axis=1, level=1).values *df.xs('PRICE', axis=1, level=1) * scale
#create MultiIndex in columns
df1.columns = pd.MultiIndex.from_product([df1.columns, ['val']])
#join to original
return pd.concat([df, df1], axis=1).sort_index(axis=1)
print (my_func(df, 10))
OP PK
PRICE QTY val PRICE QTY val
0 0 1 0 2 3 60
1 4 5 200 6 7 420
2 8 9 720 10 11 1100
3 12 13 1560 14 15 2100
EDIT:
For multiple by scaled values different for each level is possible use list of values:
print (my_func(df, [10, 20]))
OP PK
PRICE QTY val PRICE QTY val
0 0 1 0 2 3 120
1 4 5 200 6 7 840
2 8 9 720 10 11 2200
3 12 13 1560 14 15 4200
Use groupby + agg, and then concatenate the pieces together with pd.concat.
scale = 10
v = df.groupby(level=0, axis=1).agg(lambda x: x.values.prod(1) * scale)
v.columns = pd.MultiIndex.from_product([v.columns, ['value']])
pd.concat([df, v], axis=1).sort_index(axis=1, level=0)
OP PK
PRICE QTY value PRICE QTY value
0 0 1 0 2 3 60
1 4 5 200 6 7 420
2 8 9 720 10 11 1100
3 12 13 1560 14 15 2100