GroupBy unique aggregation and with specific condition in pandas - pandas

I have a dataframe as shown below
UnitID Sector Start_Date Status
1 SE1 2018-02-26 Closed
1 SE1 2019-03-27 Active
2 SE1 2017-02-26 Closed
2 SE1 2018-02-26 Closed
2 SE1 2019-02-26 Active
3 SE1 NaT Not_in_contract
4 SE1 NaT Not_in_contract
5 SE2 2017-02-26 Closed
5 SE2 2018-02-26 Closed
5 SE2 2019-02-26 Active
6 SE2 2018-02-26 Closed
6 SE2 2019-02-26 Active
7 SE2 2018-02-26 Closed
7 SE2 2018-07-15 Closed
8 SE2 NaT Not_in_contract
9 SE2 NaT Not_in_contract
10 SE2 2019-05-22 Active
11 SE2 2019-06-24 Active
From the above I would like to prepare below data frame
Sector Number_of_unique_units Number_of_Active_units
SE1 4 2
SE2 7 4

Use GroupBy.agg with DataFrameGroupBy.nunique and custom lambda function with count number of Active by sum of boolean mask:
df1=(df.groupby('Sector').agg(Number_of_unique_units=('UnitID','nunique'),
Number_of_Active_units=('Status',lambda x:x.eq('Active').sum()))
.reset_index())
print (df1)
Sector Number_of_unique_units Number_of_Active_units
0 SE1 4 2
1 SE2 7 4

Related

Filter rows of a table based on a condition that implies: 1) value of a field within a range 2) id of the business and 3) date?

I want to filter a TableA, taking into account only those rows whose "TotalInvoice" field is within the minimum and maximum values expressed in a ViewB, based on month and year values and RepairShopId (the sample data only has one RepairShopId, but all the data has multiple IDs).
In the view I have minimum and maximum values for each business and each month and year.
TableA
RepairOrderDataId
RepairShopId
LastUpdated
TotalInvoice
1
10
2017-06-01 07:00:00.000
765
1
10
2017-06-05 12:15:00.000
765
2
10
2017-02-25 13:00:00.000
400
3
10
2017-10-19 12:15:00.000
295679
4
10
2016-11-29 11:00:00.000
133409.41
5
10
2016-10-28 12:30:00.000
127769
6
10
2016-11-25 16:15:00.000
122400
7
10
2016-10-18 11:15:00.000
1950
8
10
2016-11-07 16:45:00.000
79342.7
9
10
2016-11-25 19:15:00.000
1950
10
10
2016-12-09 14:00:00.000
111559
11
10
2016-11-28 10:30:00.000
106333
12
10
2016-12-13 18:00:00.000
23847.4
13
10
2016-11-01 17:00:00.000
22782.9
14
10
2016-10-07 15:30:00.000
NULL
15
10
2017-01-06 15:30:00.000
138958
16
10
2017-01-31 13:00:00.000
244484
17
10
2016-12-05 09:30:00.000
180236
18
10
2017-02-14 18:30:00.000
92752.6
19
10
2016-10-05 08:30:00.000
161952
20
10
2016-10-05 08:30:00.000
8713.08
ViewB
RepairShopId
Orders
Average
MinimumValue
MaximumValue
year
month
yearMonth
10
1
370343
370343
370343
2015
7
2015-7
10
1
109645
109645
109645
2015
10
2015-10
10
1
148487
148487
148487
2015
12
2015-12
10
1
133409.41
133409.41
133409.41
2016
3
2016-3
10
1
19261
19261
19261
2016
8
2016-8
10
4
10477.3575
2656.65644879821
18298.0585512018
2016
9
2016-9
10
69
15047.709565
10
90942.6052417394
2016
10
2016-10
10
98
22312.077244
10
147265.581935242
2016
11
2016-11
10
96
20068.147395
10
99974.1750708773
2016
12
2016-12
10
86
25334.053372
10
184186.985160105
2017
1
2017-1
10
69
21410.63855
10
153417.00126689
2017
2
2017-2
10
100
13009.797
10
59002.3589332934
2017
3
2017-3
10
101
11746.191287
10
71405.3391452842
2017
4
2017-4
10
123
11143.49756
10
55306.8202091131
2017
5
2017-5
10
197
15980.55406
10
204538.144334771
2017
6
2017-6
10
99
10852.496969
10
63283.9899761938
2017
7
2017-7
10
131
52601.981526
10
1314998.61355187
2017
8
2017-8
10
124
10983.221854
10
59444.0535811233
2017
9
2017-9
10
115
12467.148434
10
72996.6054527277
2017
10
2017-10
10
123
14843.379593
10
129673.931373139
2017
11
2017-11
10
111
8535.455945
10
50328.1495501884
2017
12
2017-12
I've tried:
SELECT *
FROM TableA
INNER JOIN ViewB ON TableA.RepairShopId = ViewB.RepairShopId
WHERE TotalInvoice > MinimumValue AND TotalInvoice < MaximumValue
AND TableA.RepairShopId = ViewB.RepairShopId
But I'm not sure how to compare it the yearMonth field with the datetime field "LastUpdated".
Any help is very appreciated!
here is how you can do it:
I assumed LastUpdated column is the column from tableA which indicate date of
SELECT *
FROM TableA A
INNER JOIN ViewB B
ON A.RepairShopId = B.RepairShopId
AND A.TotalInvoice > B.MinimumValue
AND A.TotalInvoice < B.MaximumValue
AND YEAR(LastUpdated) = B.year
AND MONTH(LastUpdated) = B.month

How to calculate monthly normals?

I have this df:
CODE TMAX TMIN PP
DATE
1991-01-01 000130 32.6 23.4 0.0
1991-01-02 000130 31.2 22.4 0.0
1991-01-03 000130 32.0 NaN 0.0
1991-01-04 000130 32.2 23.0 0.0
1991-01-05 000130 30.5 22.0 0.0
... ... ... ...
2020-12-27 158328 NaN NaN NaN
2020-12-28 158328 NaN NaN NaN
2020-12-29 158328 NaN NaN NaN
2020-12-30 158328 NaN NaN NaN
2020-12-31 158328 NaN NaN NaN
I have data of 30 years (1991-2020) for each CODE, and i want to calculate monthly normals of TMAX, TMIN and PP. So for TMAX and TMIN i should calculate the average for every month, so if January have 31 days i should get the mean of those 31 values and get a value for January 1991, January 1992, etc. So i will have 30 Januarys (January 1991, January 1992, ... ,January 2020), 30 Februarys, etc. After this i should calculate the average of every group of months (Januarys with Januarys, Februarys with Februarys, etc). So i will have 12 values (one value for every month). Example:
(January1991 + January1992 + ..... + January 2020) /30
(February1991 + February1992 + ..... + February 2020) /30
.... same for every group of months.
So i'm using this code but i don't know if it's ok.
from datetime import date
normalstemp=df[['CODE','TMAX','TMIN']].groupby([df.CODE, df.index.month]).mean().round(1)
For PP (precipitation) i should sum the values of every PP value of the month, so if January have 31 days i should sum all of their values and get a value for January 1991, January 1992, etc. So i will have 30 Januarys (January 1991, January 1992, ... ,January 2020) , 30 Februarys (February 1991, February 1992, ... ,February 2020), etc. After this i should calculate the average of every group of months (Januarys with Januarys, Februarys with Februarys, etc). So i will have 12 values (one value for every month, the same as TMAX and TMIN).
Example:
(January1991 + January1992 + ..... + January 2020) /30
(February1991 + February1992 + ..... + February 2020) /30
.... same for every group of months.
So im using this code but i know this code isn't correct because i'm not getting the mean of the januarys, februarys, etc.
normalspp=df[['CODE','PP']].groupby([df.CODE, df.index.month]).sum().round(1)
I only have basic knowledge of python so i will appreciate if you can help me.
Thanks in advance.
Ver 2: Average by Year-Month and by Month
import pandas as pd
import numpy as np
x = pd.date_range(start='1/1/1991', end='12/31/2020',freq='D')
df = pd.DataFrame({'Date':x.tolist()*2,
'Code':['000130']*10958 + ['158328']*10958,
'TMAX': np.random.randint(6,10, size=21916),
'TMIN': np.random.randint(1,5, size=21916)
})
# Create a Month column to get Average by Month for all years
df['Month'] = df.Date.dt.month
# Create a Year-Month column to get Average of each Month within the Year
df['Year_Mon'] = df.Date.dt.strftime('%Y-%m')
# Print the Average of each Month within each Year for each code
print (df.groupby(['Code','Year_Mon'])['TMAX'].mean())
print (df.groupby(['Code','Year_Mon'])['TMIN'].mean())
# Print the Average of each Month irrespective of the year (for each code)
print (df.groupby(['Code','Month'])['TMAX'].mean())
print (df.groupby(['Code','Month'])['TMAX'].mean())
If you want to give a name for the TMAX Average value, you can add the reset_index and rename column. Here's code to do that.
print (df.groupby(['Code','Year_Mon'])['TMAX'].mean().reset_index().rename(columns={'TMAX':'TMAX_Avg'}))
The output of this will be:
Average of TMAX for each Year-Month for each Code
Code Year_Mon
000130 1991-01 7.225806
1991-02 7.678571
1991-03 7.354839
1991-04 7.500000
1991-05 7.516129
...
158328 2020-08 7.387097
2020-09 7.300000
2020-10 7.516129
2020-11 7.500000
2020-12 7.451613
Name: TMAX, Length: 720, dtype: float64
Average of TMIN for each Year-Month for each Code
Code Year_Mon
000130 1991-01 2.419355
1991-02 2.571429
1991-03 2.193548
1991-04 2.366667
1991-05 2.451613
...
158328 2020-08 2.451613
2020-09 2.566667
2020-10 2.612903
2020-11 2.666667
2020-12 2.580645
Name: TMIN, Length: 720, dtype: float64
Average of TMAX for each Month for each Code (all years combined)
Code Month
000130 1 7.540860
2 7.536557
3 7.482796
4 7.486667
5 7.444086
6 7.570000
7 7.507527
8 7.529032
9 7.501111
10 7.401075
11 7.482222
12 7.517204
158328 1 7.532258
2 7.563679
3 7.490323
4 7.555556
5 7.500000
6 7.497778
7 7.545161
8 7.483871
9 7.526667
10 7.529032
11 7.547778
12 7.524731
Name: TMAX, dtype: float64
Average of TMIN for each Month for each Code (all years combined)
Code Month
000130 1 7.540860
2 7.536557
3 7.482796
4 7.486667
5 7.444086
6 7.570000
7 7.507527
8 7.529032
9 7.501111
10 7.401075
11 7.482222
12 7.517204
158328 1 7.532258
2 7.563679
3 7.490323
4 7.555556
5 7.500000
6 7.497778
7 7.545161
8 7.483871
9 7.526667
10 7.529032
11 7.547778
12 7.524731
Name: TMAX, dtype: float64
Ver 1: Average by Year and Month for each Code
Here is one way to do this.
You can create two columns - Year and Month. Then get the average of TMAX, TMIN, and PP for each month within the year by doing a groupby ('Code','Year_Mon')
See code for more details.
import pandas as pd
import numpy as np
# create a range of dates from 1/1/2018 thru 12/31/2020 for each day
x = pd.date_range(start='1/1/2018', end='12/31/2020',freq='D')
# create a dataframe with the date ranges x 2 for two codes
# TMIN is a random value from 1 thru 5 - you can put your actual data here
# TMAX is a random value from 6 thru 10 - you can put your actual data here
df = pd.DataFrame({'Date':x.tolist()*2,
'Code':['000130']*1096 + ['158328']*1096,
'TMAX': np.random.randint(6,10, size=2192),
'TMIN': np.random.randint(1,5, size=2192)
})
# Create a Year-Month column using df.Date.dt.strftime
df['Year_Mon'] = df.Date.dt.strftime('%Y-%m')
# Calculate the Average of TMAX and TMIN using groupby Code and Year_Mon
df['TMAX_Avg'] = df.groupby(['Code','Year_Mon'])['TMAX'].transform('mean')
df['TMIN_Avg'] = df.groupby(['Code','Year_Mon'])['TMIN'].transform('mean')
The output of this will be:
Date Code TMAX TMIN Year_Mon TMAX_Avg TMIN_Avg
0 2018-01-01 000130 8 2 2018-01 7.451613 2.129032
1 2018-01-02 000130 7 4 2018-01 7.451613 2.129032
2 2018-01-03 000130 9 2 2018-01 7.451613 2.129032
3 2018-01-04 000130 6 1 2018-01 7.451613 2.129032
4 2018-01-05 000130 9 4 2018-01 7.451613 2.129032
5 2018-01-06 000130 6 1 2018-01 7.451613 2.129032
6 2018-01-07 000130 9 2 2018-01 7.451613 2.129032
7 2018-01-08 000130 9 2 2018-01 7.451613 2.129032
8 2018-01-09 000130 7 2 2018-01 7.451613 2.129032
9 2018-01-10 000130 8 2 2018-01 7.451613 2.129032
10 2018-01-11 000130 8 3 2018-01 7.451613 2.129032
11 2018-01-12 000130 7 2 2018-01 7.451613 2.129032
12 2018-01-13 000130 7 1 2018-01 7.451613 2.129032
13 2018-01-14 000130 8 1 2018-01 7.451613 2.129032
14 2018-01-15 000130 7 3 2018-01 7.451613 2.129032
15 2018-01-16 000130 6 1 2018-01 7.451613 2.129032
16 2018-01-17 000130 6 3 2018-01 7.451613 2.129032
17 2018-01-18 000130 9 3 2018-01 7.451613 2.129032
18 2018-01-19 000130 7 2 2018-01 7.451613 2.129032
19 2018-01-20 000130 8 1 2018-01 7.451613 2.129032
20 2018-01-21 000130 9 4 2018-01 7.451613 2.129032
21 2018-01-22 000130 6 2 2018-01 7.451613 2.129032
22 2018-01-23 000130 9 4 2018-01 7.451613 2.129032
23 2018-01-24 000130 6 2 2018-01 7.451613 2.129032
24 2018-01-25 000130 8 3 2018-01 7.451613 2.129032
25 2018-01-26 000130 6 2 2018-01 7.451613 2.129032
26 2018-01-27 000130 8 1 2018-01 7.451613 2.129032
27 2018-01-28 000130 8 3 2018-01 7.451613 2.129032
28 2018-01-29 000130 6 1 2018-01 7.451613 2.129032
29 2018-01-30 000130 6 1 2018-01 7.451613 2.129032
30 2018-01-31 000130 8 1 2018-01 7.451613 2.129032
31 2018-02-01 000130 7 1 2018-02 7.250000 2.428571
32 2018-02-02 000130 6 2 2018-02 7.250000 2.428571
33 2018-02-03 000130 6 4 2018-02 7.250000 2.428571
34 2018-02-04 000130 8 3 2018-02 7.250000 2.428571
35 2018-02-05 000130 8 2 2018-02 7.250000 2.428571
36 2018-02-06 000130 6 3 2018-02 7.250000 2.428571
37 2018-02-07 000130 6 3 2018-02 7.250000 2.428571
38 2018-02-08 000130 7 1 2018-02 7.250000 2.428571
39 2018-02-09 000130 9 4 2018-02 7.250000 2.428571
40 2018-02-10 000130 8 2 2018-02 7.250000 2.428571
41 2018-02-11 000130 7 4 2018-02 7.250000 2.428571
42 2018-02-12 000130 8 1 2018-02 7.250000 2.428571
43 2018-02-13 000130 6 4 2018-02 7.250000 2.428571
44 2018-02-14 000130 6 1 2018-02 7.250000 2.428571
45 2018-02-15 000130 6 4 2018-02 7.250000 2.428571
46 2018-02-16 000130 8 2 2018-02 7.250000 2.428571
47 2018-02-17 000130 7 3 2018-02 7.250000 2.428571
48 2018-02-18 000130 9 3 2018-02 7.250000 2.428571
49 2018-02-19 000130 8 2 2018-02 7.250000 2.428571
If you want only the Code, Year-Month, and TMIN and TMAX values, you can do:
TMAX average for each month within the year:
print (df.groupby(['Code','Year_Mon'])['TMAX'].mean())
Output will be:
Code Year_Mon
000130 2018-01 7.451613
2018-02 7.250000
2018-03 7.774194
2018-04 7.366667
2018-05 7.451613
...
158328 2020-08 7.935484
2020-09 7.666667
2020-10 7.548387
2020-11 7.333333
2020-12 7.580645
TMIN average for each month within the year:
print (df.groupby(['Code','Year_Mon'])['TMIN'].mean())
Output will be:
Code Year_Mon
000130 2018-01 2.129032
2018-02 2.428571
2018-03 2.451613
2018-04 2.500000
2018-05 2.677419
...
158328 2020-08 2.709677
2020-09 2.166667
2020-10 2.161290
2020-11 2.366667
2020-12 2.548387

GroupBy aggregation based on condition and year wise sum using pandas

I have a data frame as shown below
ID Sector Plot Tenancy_Start_Date Rental
1 SE1 A 2018-08-14 100
1 SE1 A 2019-08-18 200
2 SE1 B 2017-08-12 150
3 SE1 A 2020-02-12 300
5 SE2 A 2017-08-13 400
5 SE2 A 2019-08-12 300
6 SE2 B 2019-08-11 150
5 SE2 A 2020-01-10 300
7 SE2 B 2019-08-11 500
From the above I would like to prepare below data frame as Sector and Plot aggregation level.
Expected Output:
Sector Plot Total_Rental Rental_2017 Rental_2018 Rental_2019 Rental_2020
SE1 A 600 0 100 200 300
SE1 B 150 150 0 0 0
SE2 A 1000 400 0 300 300
SE2 B 650 0 0 650 0
I'd create a year column:
df['Year'] = df['Tenancy_State_date'].dt.year
then do your groupby
df['Rent_by_cats'] = df.groupby(['Sector', 'Year', 'Plot'])['Rental'].transform(sum)
then lastly move it into separate columns
yrs = df['Year'].unique().tolist()
for y in yrs:
df['Rental_' + str(y)] = 0
df.loc[df['Year'] == y, 'Rental_' + str(y)] = df['Rent_by_cats']
Output:
ID Sector Plot Tenancy_Start_Date Rental Year Rent_by_cats Rental_2018 Rental_2019 Rental_2017 Rental_2020
0 1 SE1 A 2018-08-14 100 2018 100 100 0 0 0
1 1 SE1 A 2019-08-18 200 2019 200 0 200 0 0
2 2 SE1 B 2017-08-12 150 2017 150 0 0 150 0
3 3 SE1 A 2020-02-12 300 2020 300 0 0 0 300
4 5 SE2 A 2017-08-13 400 2017 400 0 0 400 0
5 5 SE2 A 2019-08-12 300 2019 300 0 300 0 0
6 6 SE2 B 2019-08-11 150 2019 650 0 650 0 0
7 5 SE2 A 2020-01-10 300 2020 300 0 0 0 300
8 7 SE2 B 2019-08-11 500 2019 650 0 650 0 0
You can do (df being your input dataframe):
#in case if it's not already a datetime:
df["Tenancy_Start_Date"]=pd.to_datetime(df["Tenancy_Start_Date"])
df2=df.pivot_table(index=["Sector", "Plot"], columns=df["Tenancy_Start_Date"].dt.year, values="Rental", aggfunc=sum).fillna(0)
df2.columns=[f"Rental_{col}" for col in df2.columns]
df2["Total_Rental"]=df2.sum(axis=1)
df2=df2.reset_index(drop=False)
Outputs:
Sector Plot ... Rental_2020 Total_Rental
0 SE1 A ... 300.0 600.0
1 SE1 B ... 0.0 150.0
2 SE2 A ... 300.0 1000.0
3 SE2 B ... 0.0 650.0

Groupby aggregate based on multiple condition pandas

I have a data frame as shown below
Sector Plot Year Amount Month
SE1 1 2017 10 Sep
SE1 1 2018 10 Oct
SE1 1 2019 10 Jun
SE1 1 2020 90 Feb
SE1 2 2018 50 Jan
SE1 2 2017 100 May
SE1 2 2018 30 Oct
SE2 2 2018 50 Mar
SE2 2 2019 100 Jan
From the above I would like to prepare below data frame
Sector Plot Number_of_Times Mean_Amount Recent_Amount Recent_year Recent_Month
SE1 1 4 30 50 2020 Feb
SE1 2 3 60 30 2018 Oct
SE2 2 2 75 100 2019 Jan
So if all rows are sorted in input data use GroupBy.agg with named aggregations:
df1 = (df.groupby(['Sector','Plot']).agg(Number_of_Times=('Year','size'),
Mean_Amount=('Amount','mean'),
Recent_Amount=('Amount','last'),
Recent_year=('Year','last'),
Recent_Month=('Month','last')).reset_index())
print (df1)
Sector Plot Number_of_Times Mean_Amount Recent_Amount Recent_year \
0 SE1 1 4 30 90 2020
1 SE1 2 3 60 30 2018
2 SE2 2 2 75 100 2019
Recent_Month
0 Feb
1 Oct
2 Jan
If necessary sorting convert Month to datetimes, add DataFrame.sort_values, apply solution and last convert months back to strings:
df['Month'] = pd.to_datetime(df['Month'], format='%b')
df1 = (df.sort_values(['Sector','Plot','Year','Month'])
.groupby(['Sector','Plot']).agg(Number_of_Times=('Year','size'),
Mean_Amount=('Amount','mean'),
Recent_Amount=('Amount','last'),
Recent_year=('Year','last'),
Recent_Month=('Month','last')).reset_index())
df1['Recent_Month'] = df1['Recent_Month'].dt.strftime('%b')
print (df1)
Sector Plot Number_of_Times Mean_Amount Recent_Amount Recent_year \
0 SE1 1 4 30 90 2020
1 SE1 2 3 60 30 2018
2 SE2 2 2 75 100 2019
Recent_Month
0 Feb
1 Oct
2 Jan
Another idea, buggy in pandas 0.25.1:
months = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
df['Month'] = pd.Categorical(df['Month'] , ordered=True, categories=months)
df1 = (df.sort_values(['Sector','Plot','Year','Month'])
.groupby(['Sector','Plot']).agg(Number_of_Times=('Year','size'),
Mean_Amount=('Amount','mean'),
Recent_Amount=('Amount','last'),
Recent_year=('Year','last'),
Recent_Month=('Month','last')).reset_index())
print (df1)
ValueError: Buffer dtype mismatch, expected 'Python object' but got 'long long'

Calulate the groupby (several columns) average in pandas [duplicate]

This question already has answers here:
How to move pandas data from index to column after multiple groupby
(4 answers)
Closed 3 years ago.
I have a dataframe as shown below.
Unit_ID Price Sector Contract_Date Rooms
1 20 SE1 16-10-2015 2
9 40 SE1 20-10-2015 2
2 40 SE1 16-10-2016 3
2 30 SE1 16-10-2015 3
3 20 SE1 16-10-2015 3
3 10 SE1 16-10-2016 3
4 60 SE1 16-10-2016 2
5 40 SE2 16-10-2015 2
8 80 SE1 20-10-2015 2
6 80 SE2 16-10-2016 3
6 60 SE2 16-10-2015 3
7 40 SE2 16-10-2015 3
7 20 SE2 16-10-2015 3
8 120 SE2 16-10-2016 2
From the above I would like to prepare a dataframe as shown below in pandas.
Expected Output:
Sector Rooms Year Average_Price
SE1 2 2015 30
SE1 2 2016 60
SE1 3 2015 25
SE1 3 2016 25
SE2 2 2015 60
SE2 2 2016 120
SE2 3 2015 50
SE2 3 2016 50
I think I should use pandas groupby
I tried following code
df['Year'] = df.Contract_Date.dt.year
df1 = df.groupby(['Sector', 'Year', 'Rooms']).Price.mean()
Use:
( df.groupby(['Sector','Rooms',df['Contract_Date'].dt.year.rename('Year')])
.Price
.mean()
.rename('Average_Price')
.reset_index() )
Sector Rooms Year Average_Price
0 SE1 2 2015 46.666667
1 SE1 2 2016 60.000000
2 SE1 3 2015 25.000000
3 SE1 3 2016 25.000000
4 SE2 2 2015 40.000000
5 SE2 2 2016 120.000000
6 SE2 3 2015 40.000000
7 SE2 3 2016 80.000000
or using groupby.agg:
( df.groupby(['Sector','Rooms',df['Contract_Date'].dt.year.rename('Year')])
.Price
.agg(Average_Price = 'mean')
.reset_index() )