Calulate the groupby (several columns) average in pandas [duplicate] - pandas

This question already has answers here:
How to move pandas data from index to column after multiple groupby
(4 answers)
Closed 3 years ago.
I have a dataframe as shown below.
Unit_ID Price Sector Contract_Date Rooms
1 20 SE1 16-10-2015 2
9 40 SE1 20-10-2015 2
2 40 SE1 16-10-2016 3
2 30 SE1 16-10-2015 3
3 20 SE1 16-10-2015 3
3 10 SE1 16-10-2016 3
4 60 SE1 16-10-2016 2
5 40 SE2 16-10-2015 2
8 80 SE1 20-10-2015 2
6 80 SE2 16-10-2016 3
6 60 SE2 16-10-2015 3
7 40 SE2 16-10-2015 3
7 20 SE2 16-10-2015 3
8 120 SE2 16-10-2016 2
From the above I would like to prepare a dataframe as shown below in pandas.
Expected Output:
Sector Rooms Year Average_Price
SE1 2 2015 30
SE1 2 2016 60
SE1 3 2015 25
SE1 3 2016 25
SE2 2 2015 60
SE2 2 2016 120
SE2 3 2015 50
SE2 3 2016 50
I think I should use pandas groupby
I tried following code
df['Year'] = df.Contract_Date.dt.year
df1 = df.groupby(['Sector', 'Year', 'Rooms']).Price.mean()

Use:
( df.groupby(['Sector','Rooms',df['Contract_Date'].dt.year.rename('Year')])
.Price
.mean()
.rename('Average_Price')
.reset_index() )
Sector Rooms Year Average_Price
0 SE1 2 2015 46.666667
1 SE1 2 2016 60.000000
2 SE1 3 2015 25.000000
3 SE1 3 2016 25.000000
4 SE2 2 2015 40.000000
5 SE2 2 2016 120.000000
6 SE2 3 2015 40.000000
7 SE2 3 2016 80.000000
or using groupby.agg:
( df.groupby(['Sector','Rooms',df['Contract_Date'].dt.year.rename('Year')])
.Price
.agg(Average_Price = 'mean')
.reset_index() )

Related

Filter rows of a table based on a condition that implies: 1) value of a field within a range 2) id of the business and 3) date?

I want to filter a TableA, taking into account only those rows whose "TotalInvoice" field is within the minimum and maximum values expressed in a ViewB, based on month and year values and RepairShopId (the sample data only has one RepairShopId, but all the data has multiple IDs).
In the view I have minimum and maximum values for each business and each month and year.
TableA
RepairOrderDataId
RepairShopId
LastUpdated
TotalInvoice
1
10
2017-06-01 07:00:00.000
765
1
10
2017-06-05 12:15:00.000
765
2
10
2017-02-25 13:00:00.000
400
3
10
2017-10-19 12:15:00.000
295679
4
10
2016-11-29 11:00:00.000
133409.41
5
10
2016-10-28 12:30:00.000
127769
6
10
2016-11-25 16:15:00.000
122400
7
10
2016-10-18 11:15:00.000
1950
8
10
2016-11-07 16:45:00.000
79342.7
9
10
2016-11-25 19:15:00.000
1950
10
10
2016-12-09 14:00:00.000
111559
11
10
2016-11-28 10:30:00.000
106333
12
10
2016-12-13 18:00:00.000
23847.4
13
10
2016-11-01 17:00:00.000
22782.9
14
10
2016-10-07 15:30:00.000
NULL
15
10
2017-01-06 15:30:00.000
138958
16
10
2017-01-31 13:00:00.000
244484
17
10
2016-12-05 09:30:00.000
180236
18
10
2017-02-14 18:30:00.000
92752.6
19
10
2016-10-05 08:30:00.000
161952
20
10
2016-10-05 08:30:00.000
8713.08
ViewB
RepairShopId
Orders
Average
MinimumValue
MaximumValue
year
month
yearMonth
10
1
370343
370343
370343
2015
7
2015-7
10
1
109645
109645
109645
2015
10
2015-10
10
1
148487
148487
148487
2015
12
2015-12
10
1
133409.41
133409.41
133409.41
2016
3
2016-3
10
1
19261
19261
19261
2016
8
2016-8
10
4
10477.3575
2656.65644879821
18298.0585512018
2016
9
2016-9
10
69
15047.709565
10
90942.6052417394
2016
10
2016-10
10
98
22312.077244
10
147265.581935242
2016
11
2016-11
10
96
20068.147395
10
99974.1750708773
2016
12
2016-12
10
86
25334.053372
10
184186.985160105
2017
1
2017-1
10
69
21410.63855
10
153417.00126689
2017
2
2017-2
10
100
13009.797
10
59002.3589332934
2017
3
2017-3
10
101
11746.191287
10
71405.3391452842
2017
4
2017-4
10
123
11143.49756
10
55306.8202091131
2017
5
2017-5
10
197
15980.55406
10
204538.144334771
2017
6
2017-6
10
99
10852.496969
10
63283.9899761938
2017
7
2017-7
10
131
52601.981526
10
1314998.61355187
2017
8
2017-8
10
124
10983.221854
10
59444.0535811233
2017
9
2017-9
10
115
12467.148434
10
72996.6054527277
2017
10
2017-10
10
123
14843.379593
10
129673.931373139
2017
11
2017-11
10
111
8535.455945
10
50328.1495501884
2017
12
2017-12
I've tried:
SELECT *
FROM TableA
INNER JOIN ViewB ON TableA.RepairShopId = ViewB.RepairShopId
WHERE TotalInvoice > MinimumValue AND TotalInvoice < MaximumValue
AND TableA.RepairShopId = ViewB.RepairShopId
But I'm not sure how to compare it the yearMonth field with the datetime field "LastUpdated".
Any help is very appreciated!
here is how you can do it:
I assumed LastUpdated column is the column from tableA which indicate date of
SELECT *
FROM TableA A
INNER JOIN ViewB B
ON A.RepairShopId = B.RepairShopId
AND A.TotalInvoice > B.MinimumValue
AND A.TotalInvoice < B.MaximumValue
AND YEAR(LastUpdated) = B.year
AND MONTH(LastUpdated) = B.month

Groupby sum in years in pandas

I have a data frame as shown below. which is a sales data of two health care product starting from December 2016 to November 2018.
product profit sale_date discount
A 50 2016-12-01 5
A 50 2017-01-03 4
B 200 2016-12-24 10
A 50 2017-01-18 3
B 200 2017-01-28 15
A 50 2017-01-18 6
B 200 2017-01-28 20
A 50 2017-04-18 6
B 200 2017-12-08 25
A 50 2017-11-18 6
B 200 2017-08-21 20
B 200 2017-12-28 30
A 50 2018-03-18 10
B 300 2018-06-08 45
B 300 2018-09-20 50
A 50 2018-11-18 8
B 300 2018-11-28 35
From the above I would like to prepare below dataframe and plot that into line plot.
Expected Output
bought_year total_profit
2016 250
2017 1250
2018 1000
X axis = bought_year
Y axis = profit
use groupby with dt.year and .agg to name your column.
df1 = df.groupby(df['sale_date'].dt.year).agg(total_profit=('profit','sum'))\
.reset_index().rename(columns={'sale_date': 'bought_year'})
print(df1)
bought_year total_profit
0 2016 250
1 2017 1250
2 2018 1000
df1.set_index('bought_year').plot(kind='bar')

GroupBy aggregation based on condition and year wise sum using pandas

I have a data frame as shown below
ID Sector Plot Tenancy_Start_Date Rental
1 SE1 A 2018-08-14 100
1 SE1 A 2019-08-18 200
2 SE1 B 2017-08-12 150
3 SE1 A 2020-02-12 300
5 SE2 A 2017-08-13 400
5 SE2 A 2019-08-12 300
6 SE2 B 2019-08-11 150
5 SE2 A 2020-01-10 300
7 SE2 B 2019-08-11 500
From the above I would like to prepare below data frame as Sector and Plot aggregation level.
Expected Output:
Sector Plot Total_Rental Rental_2017 Rental_2018 Rental_2019 Rental_2020
SE1 A 600 0 100 200 300
SE1 B 150 150 0 0 0
SE2 A 1000 400 0 300 300
SE2 B 650 0 0 650 0
I'd create a year column:
df['Year'] = df['Tenancy_State_date'].dt.year
then do your groupby
df['Rent_by_cats'] = df.groupby(['Sector', 'Year', 'Plot'])['Rental'].transform(sum)
then lastly move it into separate columns
yrs = df['Year'].unique().tolist()
for y in yrs:
df['Rental_' + str(y)] = 0
df.loc[df['Year'] == y, 'Rental_' + str(y)] = df['Rent_by_cats']
Output:
ID Sector Plot Tenancy_Start_Date Rental Year Rent_by_cats Rental_2018 Rental_2019 Rental_2017 Rental_2020
0 1 SE1 A 2018-08-14 100 2018 100 100 0 0 0
1 1 SE1 A 2019-08-18 200 2019 200 0 200 0 0
2 2 SE1 B 2017-08-12 150 2017 150 0 0 150 0
3 3 SE1 A 2020-02-12 300 2020 300 0 0 0 300
4 5 SE2 A 2017-08-13 400 2017 400 0 0 400 0
5 5 SE2 A 2019-08-12 300 2019 300 0 300 0 0
6 6 SE2 B 2019-08-11 150 2019 650 0 650 0 0
7 5 SE2 A 2020-01-10 300 2020 300 0 0 0 300
8 7 SE2 B 2019-08-11 500 2019 650 0 650 0 0
You can do (df being your input dataframe):
#in case if it's not already a datetime:
df["Tenancy_Start_Date"]=pd.to_datetime(df["Tenancy_Start_Date"])
df2=df.pivot_table(index=["Sector", "Plot"], columns=df["Tenancy_Start_Date"].dt.year, values="Rental", aggfunc=sum).fillna(0)
df2.columns=[f"Rental_{col}" for col in df2.columns]
df2["Total_Rental"]=df2.sum(axis=1)
df2=df2.reset_index(drop=False)
Outputs:
Sector Plot ... Rental_2020 Total_Rental
0 SE1 A ... 300.0 600.0
1 SE1 B ... 0.0 150.0
2 SE2 A ... 300.0 1000.0
3 SE2 B ... 0.0 650.0

GroupBy unique aggregation and with specific condition in pandas

I have a dataframe as shown below
UnitID Sector Start_Date Status
1 SE1 2018-02-26 Closed
1 SE1 2019-03-27 Active
2 SE1 2017-02-26 Closed
2 SE1 2018-02-26 Closed
2 SE1 2019-02-26 Active
3 SE1 NaT Not_in_contract
4 SE1 NaT Not_in_contract
5 SE2 2017-02-26 Closed
5 SE2 2018-02-26 Closed
5 SE2 2019-02-26 Active
6 SE2 2018-02-26 Closed
6 SE2 2019-02-26 Active
7 SE2 2018-02-26 Closed
7 SE2 2018-07-15 Closed
8 SE2 NaT Not_in_contract
9 SE2 NaT Not_in_contract
10 SE2 2019-05-22 Active
11 SE2 2019-06-24 Active
From the above I would like to prepare below data frame
Sector Number_of_unique_units Number_of_Active_units
SE1 4 2
SE2 7 4
Use GroupBy.agg with DataFrameGroupBy.nunique and custom lambda function with count number of Active by sum of boolean mask:
df1=(df.groupby('Sector').agg(Number_of_unique_units=('UnitID','nunique'),
Number_of_Active_units=('Status',lambda x:x.eq('Active').sum()))
.reset_index())
print (df1)
Sector Number_of_unique_units Number_of_Active_units
0 SE1 4 2
1 SE2 7 4

Groupby aggregate based on multiple condition pandas

I have a data frame as shown below
Sector Plot Year Amount Month
SE1 1 2017 10 Sep
SE1 1 2018 10 Oct
SE1 1 2019 10 Jun
SE1 1 2020 90 Feb
SE1 2 2018 50 Jan
SE1 2 2017 100 May
SE1 2 2018 30 Oct
SE2 2 2018 50 Mar
SE2 2 2019 100 Jan
From the above I would like to prepare below data frame
Sector Plot Number_of_Times Mean_Amount Recent_Amount Recent_year Recent_Month
SE1 1 4 30 50 2020 Feb
SE1 2 3 60 30 2018 Oct
SE2 2 2 75 100 2019 Jan
So if all rows are sorted in input data use GroupBy.agg with named aggregations:
df1 = (df.groupby(['Sector','Plot']).agg(Number_of_Times=('Year','size'),
Mean_Amount=('Amount','mean'),
Recent_Amount=('Amount','last'),
Recent_year=('Year','last'),
Recent_Month=('Month','last')).reset_index())
print (df1)
Sector Plot Number_of_Times Mean_Amount Recent_Amount Recent_year \
0 SE1 1 4 30 90 2020
1 SE1 2 3 60 30 2018
2 SE2 2 2 75 100 2019
Recent_Month
0 Feb
1 Oct
2 Jan
If necessary sorting convert Month to datetimes, add DataFrame.sort_values, apply solution and last convert months back to strings:
df['Month'] = pd.to_datetime(df['Month'], format='%b')
df1 = (df.sort_values(['Sector','Plot','Year','Month'])
.groupby(['Sector','Plot']).agg(Number_of_Times=('Year','size'),
Mean_Amount=('Amount','mean'),
Recent_Amount=('Amount','last'),
Recent_year=('Year','last'),
Recent_Month=('Month','last')).reset_index())
df1['Recent_Month'] = df1['Recent_Month'].dt.strftime('%b')
print (df1)
Sector Plot Number_of_Times Mean_Amount Recent_Amount Recent_year \
0 SE1 1 4 30 90 2020
1 SE1 2 3 60 30 2018
2 SE2 2 2 75 100 2019
Recent_Month
0 Feb
1 Oct
2 Jan
Another idea, buggy in pandas 0.25.1:
months = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
df['Month'] = pd.Categorical(df['Month'] , ordered=True, categories=months)
df1 = (df.sort_values(['Sector','Plot','Year','Month'])
.groupby(['Sector','Plot']).agg(Number_of_Times=('Year','size'),
Mean_Amount=('Amount','mean'),
Recent_Amount=('Amount','last'),
Recent_year=('Year','last'),
Recent_Month=('Month','last')).reset_index())
print (df1)
ValueError: Buffer dtype mismatch, expected 'Python object' but got 'long long'