How to apply different aggregate functions to different columns in pandas? - pandas

I have the dataframe with many columns in it , some of it contains price and rest contains volume as below:
year_month 0_fx_price_gy 0_fx_volume_gy 1_fx_price_yuy 1_fx_volume_yuy
1990-01 2 10 3 30
1990-01 2 20 2 40
1990-02 2 30 3 50
I need to do group by year_month and do mean on price columns and sum on volume columns.
is there any quick way to do this in one statement like do average if column name contains price and sum if it contains volume?
df.groupby('year_month').?
Note: this is just sample data with less columns but format is similar
output
year_month 0_fx_price_gy 0_fx_volume_gy 1_fx_price_yuy 1_fx_volume_yuy
1990-01 2 30 2.5 70
1990-02 2 30 3 50

Create dictionary by matched values and pass to DataFrameGroupBy.agg, last add reindex if order of output columns is changed:
d1 = dict.fromkeys(df.columns[df.columns.str.contains('price')], 'mean')
d2 = dict.fromkeys(df.columns[df.columns.str.contains('volume')], 'sum')
#merge dicts together
d = {**d1, **d2}
print (d)
{'0_fx_price_gy': 'mean', '1_fx_price_yuy': 'mean',
'0_fx_volume_gy': 'sum', '1_fx_volume_yuy': 'sum'}
Another solution for dictionary:
d = {}
for c in df.columns:
if 'price' in c:
d[c] = 'mean'
if 'volume' in c:
d[c] = 'sum'
And solution should be simplify if only price and volume columns without first column filtered out by df.columns[1:]:
d = {x:'mean' if 'price' in x else 'sum' for x in df.columns[1:]}
df1 = df.groupby('year_month', as_index=False).agg(d).reindex(columns=df.columns)
print (df1)
year_month 0_fx_price_gy 0_fx_volume_gy 1_fx_price_yuy 1_fx_volume_yuy
0 1990-01 2 40 3 60
1 1990-02 2 20 3 30

Related

How to get the pivot table return the value together with the corresponding column

I have the following data frame :
Loc1
Loc2
Month
Trips
a
b
1
200
a
b
4
500
a
b
7
600
c
d
6
400
c
d
4
300
I need to find out for every route (Loc1 to Loc2) which month has the most trips and with the corresponding trips number .
I run some code but the output I get is as follows. How do I get the Trips column appear together.
Loc1
Loc2
Month
a
b
7
c
d
6
The code I used as below :
df = pd.read_csv('data.csv')
df = df[['Loc1','Loc2','Month','Trips']]
df = df.pivot_table(index = ['Loc1', 'Loc2'],
columns = 'Month',
values = 'Trips',)
df = df.idxmax(axis = 1)
df = df.reset_index()
print(f"Each route's busiest month : \n {df.to_string()}")
Try to sort by Trips in descending order and get the first row per group
df.sort_values(by='Trips', ascending=False).groupby(['Loc1', 'Loc2'], as_index=False).first()
Or:
df.sort_values(by='Trips').groupby(['Loc1', 'Loc2'], as_index=False).last()
NB. I couldn't run the code to test, but you get the general idea.

Sorting df by column name of type timestamp

I have a dataframe df which consists of columns of countries and rows of dates. The index is of type "DateTime."
I would like to sort the df by the values of each country by the last element in the series (eg, the latest date) and the graph the "top N" countries by this latest value.
I thought if I sorted the transpose of the df and then slice it, I would have what I need. Hence, if N = 10, then I would select df[0:9].
However,when I attempt to select the last column, I get a 'keyerror' message, referencing the selected column:
KeyError: '2021-03-28 00:00:00'.
I'm stumped....
df_T = df.transpose()
column_name = str(df_T.columns[-1])
df_T.sort_values(by = column_name, axis = 'columns', inplace = True)
#select the top 10 countries by latest value, eg
# plot df_T[0:9]
What I'm trying to do, example df:
A B C .... X Y Z
2021-03-29 10 20 5 .... 50 100 7
2021-03-28 9 19 4 .... 45 90 6
2021-03-27 8 15 2 .... 40 80 4
...
2021-01-03 0 0 0 .... 0 0 0
I want to select series representing by the greatest N values as of the latest index value (eg, latest date).

Pandas column merging on condition

This is my pandas df:
Id Protein A_Egg B_Meat C_Milk Category
A 10 10 20 0 egg
B 20 10 0 10 milk
C 20 10 10 10 meat
D 25 20 10 0 egg
I wish to merge protein column with other column based on "Category"
My output is
Id Protein_final
A 20
B 30
C 30
D 45
Ideally, I would like to show how I am approaching but, I am frankly clueless!!
EDIT: Also, How to handle is the category is blank or does meet one of the column (in that can final should be same as initial value in protein column)
Use DataFrame.lookup with some preprocessing with remove values in columns names before _ and lowercase, last add to column:
arr = df.rename(columns=lambda x: x.split('_')[-1].lower()).lookup(df.index, df['Category'])
df['Protein'] += arr
print (df)
Id Protein A_Egg B_Meat C_Milk Category
0 A 20 10 20 0 egg
1 B 30 10 0 10 milk
2 C 30 10 10 10 meat
3 D 45 20 10 0 egg
If need only 2 columns finally:
df = df[['Id','Protein']]
You can melt the dataframe, and filter for rows where category equals the variable column, and sum the final columns :
(
df
.melt(["Id", "Protein", "Category"])
.assign(variable=lambda x: x.variable.str[2:].str.lower(),
Protein_final=lambda x: x.Protein + x.value)
.query("Category == variable")
.filter(["Id", "Protein_final"])
)
Id Protein_final
0 A 20
3 D 45
6 C 30
9 B 30

Sequence of numbers per category given first entry (Python, Pandas)

Suppose I have 5 categories {A, B, C, D, E} and Several date entries of PURCHASES with distinct dates (for instance, A may range from 01/01/1900 to 31/01/1901 and B from 02/02/1930 to 03/03/1933.
I want to create a new column 'day of occurrence' where I have sequence of number 1...N from the point I find the first date in which number of purchases >= 5.
I want this to compare how categories are similar from the day they've achieved 5 purchases (dates are irrelevant here, but product lifetime is)
Thanks!
Here is how you can label rows from 1 to N depending on column value.
import pandas as pd
df = pd.DataFrame(data=[3, 6, 9, 3, 6], columns=['data'])
df['day of occurrence'] = 0
values_count = df.loc[df['data'] > 5].shape[0]
df.loc[df['data'] > 5, 'day of occurrence'] = range(1, values_count + 1)
The initial DataFrame:
data
0 3
1 6
2 9
3 3
4 6
Output DataFrame:
data day of occurrence
0 3 0
1 6 1
2 9 2
3 3 0
4 6 3
Your data should be sorted by date, for example, df = df.sort_values(by='your-datetime-column')

Pandas sort grouby groups by arbitrary condition on its contents

Ok, this is getting ridiculous ... I've spent way too much time on something that should be trivial.
I want to group a data frame by a column, then sort the groups (not within the group) by some condition (in my case maximum over some column B in the group).
I expected something along these lines:
df.groupby('A').sort_index(lambda group_content: group_content.B.max())
I also tried:
groups = df.groupby('A')
maxx = gg['B'].max()
groups.sort_index(...)
But, of course, no sort_index on a group by object ..
EDIT:
I ended up using (almost) the solution suggested by #jezrael
df['max'] = df.groupby('A')['B'].transform('max')
df = df.sort_values(['max', 'B'], ascending=True).drop('max', axis=1)
groups = df.groupby('A', sort=False)
I had to add ascending=True to sort_values, but more importantly sort=False to groupby, otherwise I would get the groups sort lex (A contains strings).
I think you need if possible same max for some groups use GroupBy.transform with max for new column and then sort by DataFrame.sort_values:
df = pd.DataFrame({
'A':list('aaabcc'),
'B':[7,8,9,100,20,30]
})
df['max'] = df.groupby('A')['B'].transform('max')
df = df.sort_values(['max','A'])
print (df)
A B max
0 a 7 9
1 a 8 9
2 a 9 9
4 c 20 30
5 c 30 30
3 b 100 100
If always max values are unique use Series.argsort:
s = df.groupby('A')['B'].transform('max')
df = df.iloc[s.argsort()]
print (df)
A B
0 a 7
1 a 8
2 a 9
4 c 20
5 c 30
3 b 100