Groupby sum in years in pandas - pandas

I have a data frame as shown below. which is a sales data of two health care product starting from December 2016 to November 2018.
product profit sale_date discount
A 50 2016-12-01 5
A 50 2017-01-03 4
B 200 2016-12-24 10
A 50 2017-01-18 3
B 200 2017-01-28 15
A 50 2017-01-18 6
B 200 2017-01-28 20
A 50 2017-04-18 6
B 200 2017-12-08 25
A 50 2017-11-18 6
B 200 2017-08-21 20
B 200 2017-12-28 30
A 50 2018-03-18 10
B 300 2018-06-08 45
B 300 2018-09-20 50
A 50 2018-11-18 8
B 300 2018-11-28 35
From the above I would like to prepare below dataframe and plot that into line plot.
Expected Output
bought_year total_profit
2016 250
2017 1250
2018 1000
X axis = bought_year
Y axis = profit

use groupby with dt.year and .agg to name your column.
df1 = df.groupby(df['sale_date'].dt.year).agg(total_profit=('profit','sum'))\
.reset_index().rename(columns={'sale_date': 'bought_year'})
print(df1)
bought_year total_profit
0 2016 250
1 2017 1250
2 2018 1000
df1.set_index('bought_year').plot(kind='bar')

Related

Presto Window Function - (sum over partition by)

I have the following Presto 2 tables, one storing budget information by client & day, and the other one storing spend information by client & day
select day, client_id, budget_id, budget_period, budget_amount
from budget_table
day
client_id
budget_id
budget_period
budget_amount
2021-02-27
1
1-1
daily
10
2021-02-28
1
1-1
daily
10
2021-03-01
1
1-1
daily
10
2021-03-02
1
1-1
daily
10
2021-03-03
1
1-1
daily
10
2021-03-04
1
1-2
monthly
500
2021-03-05
1
1-2
monthly
500
2021-03-06
1
1-2
monthly
500
2021-02-27
2
2-1
monthly
400
2021-02-28
2
2-1
monthly
400
2021-03-01
2
2-1
monthly
400
2021-03-02
2
2-1
monthly
400
2021-03-03
2
2-2
one_time
1000
2021-03-04
2
2-2
one_time
1000
2021-03-05
2
2-2
one_time
1000
2021-03-06
2
2-2
one_time
1000
select day, client_id, spend
from spend_table
day
client_id
spend
2021-02-27
1
8
2021-02-28
1
9
2021-03-01
1
10
2021-03-02
1
7
2021-03-03
1
6
2021-03-04
1
16
2021-03-05
1
19
2021-03-06
1
18
2021-02-27
2
13
2021-02-28
2
15
2021-03-01
2
14
2021-03-02
2
15
2021-03-03
2
20
2021-03-04
2
25
2021-03-05
2
18
2021-03-06
2
27
Below is desired output:
day
client_id
budget_id
budget_period
budget_amount
spend
spend_over_period
2021-02-27
1
1-1
daily
10
8
8
2021-02-28
1
1-1
daily
10
9
9
2021-03-01
1
1-1
daily
10
10
10
2021-03-02
1
1-1
daily
10
7
7
2021-03-03
1
1-1
daily
10
6
6
2021-03-04
1
1-2
monthly
500
16
16
2021-03-05
1
1-2
monthly
500
19
35
2021-03-06
1
1-2
monthly
500
18
53
2021-02-27
2
2-1
monthly
400
13
13
2021-02-28
2
2-1
monthly
400
15
28
2021-03-01
2
2-1
monthly
400
14
14
2021-03-02
2
2-1
monthly
400
15
29
2021-03-03
2
2-2
one_time
1000
20
20
2021-03-04
2
2-2
one_time
1000
25
45
2021-03-05
2
2-2
one_time
1000
18
63
2021-03-06
2
2-2
one_time
1000
27
90
I have tried
select s.day,
s.client_id,
b.budget_id,
b.budget_period,
b.budget_amount,
s.spend,
case when b.budget_period = 'daily' then s.spend
when b.budget_period = 'monthly' then sum(s.spend) over (partition by b.budget_id, month(date(s.day)))
when as spend_over_period = 'one_time' then sum(s.spend) over (partition by b.budget_id)
end as budget_over_period
from spend_table as s
join budget_table as b
on s.day = b.day
and s.client_id = b.client_id
group by 1,2,3,4,5,6
But, I get u'EXPRESSION_NOT_AGGREGATE' error.
Does anybody know how to query to get the desired output in Presto?
You can remove group by clause completely and use ordering and frame for your window functions with order by date(s.day) and range between unbounded preceding and current row:
select s.day,
s.client_id,
b.budget_id,
b.budget_period,
b.budget_amount,
s.spend,
case
when b.budget_period = 'daily' then s.spend
when b.budget_period = 'monthly' then sum(s.spend) over (partition by b.budget_id, month(date(s.day)) order by date(s.day) range between unbounded preceding and current row)
when b.budget_period = 'one_time' then sum(s.spend) over (partition by b.budget_id order by date(s.day) range between unbounded preceding and current row)
end as spend_over_period
from spend_table as s
join budget_table as b
on s.day = b.day
and s.client_id = b.client_id
order by 2,3,1
Output:
day
client_id
budget_id
budget_period
budget_amount
spend
spend_over_period
2021-02-27
1
1-1
daily
10
8
8
2021-02-28
1
1-1
daily
10
9
9
2021-03-01
1
1-1
daily
10
10
10
2021-03-02
1
1-1
daily
10
7
7
2021-03-03
1
1-1
daily
10
6
6
2021-03-04
1
1-2
monthly
500
16
16
2021-03-05
1
1-2
monthly
500
19
35
2021-03-06
1
1-2
monthly
500
18
53
2021-02-27
2
2-1
monthly
400
13
13
2021-02-28
2
2-1
monthly
400
15
28
2021-03-01
2
2-1
monthly
400
14
14
2021-03-02
2
2-1
monthly
400
15
29
2021-03-03
2
2-2
one_time
1000
20
20
2021-03-04
2
2-2
one_time
1000
25
45
2021-03-05
2
2-2
one_time
1000
18
63
2021-03-06
2
2-2
one_time
1000
27
90

Filter rows of a table based on a condition that implies: 1) value of a field within a range 2) id of the business and 3) date?

I want to filter a TableA, taking into account only those rows whose "TotalInvoice" field is within the minimum and maximum values expressed in a ViewB, based on month and year values and RepairShopId (the sample data only has one RepairShopId, but all the data has multiple IDs).
In the view I have minimum and maximum values for each business and each month and year.
TableA
RepairOrderDataId
RepairShopId
LastUpdated
TotalInvoice
1
10
2017-06-01 07:00:00.000
765
1
10
2017-06-05 12:15:00.000
765
2
10
2017-02-25 13:00:00.000
400
3
10
2017-10-19 12:15:00.000
295679
4
10
2016-11-29 11:00:00.000
133409.41
5
10
2016-10-28 12:30:00.000
127769
6
10
2016-11-25 16:15:00.000
122400
7
10
2016-10-18 11:15:00.000
1950
8
10
2016-11-07 16:45:00.000
79342.7
9
10
2016-11-25 19:15:00.000
1950
10
10
2016-12-09 14:00:00.000
111559
11
10
2016-11-28 10:30:00.000
106333
12
10
2016-12-13 18:00:00.000
23847.4
13
10
2016-11-01 17:00:00.000
22782.9
14
10
2016-10-07 15:30:00.000
NULL
15
10
2017-01-06 15:30:00.000
138958
16
10
2017-01-31 13:00:00.000
244484
17
10
2016-12-05 09:30:00.000
180236
18
10
2017-02-14 18:30:00.000
92752.6
19
10
2016-10-05 08:30:00.000
161952
20
10
2016-10-05 08:30:00.000
8713.08
ViewB
RepairShopId
Orders
Average
MinimumValue
MaximumValue
year
month
yearMonth
10
1
370343
370343
370343
2015
7
2015-7
10
1
109645
109645
109645
2015
10
2015-10
10
1
148487
148487
148487
2015
12
2015-12
10
1
133409.41
133409.41
133409.41
2016
3
2016-3
10
1
19261
19261
19261
2016
8
2016-8
10
4
10477.3575
2656.65644879821
18298.0585512018
2016
9
2016-9
10
69
15047.709565
10
90942.6052417394
2016
10
2016-10
10
98
22312.077244
10
147265.581935242
2016
11
2016-11
10
96
20068.147395
10
99974.1750708773
2016
12
2016-12
10
86
25334.053372
10
184186.985160105
2017
1
2017-1
10
69
21410.63855
10
153417.00126689
2017
2
2017-2
10
100
13009.797
10
59002.3589332934
2017
3
2017-3
10
101
11746.191287
10
71405.3391452842
2017
4
2017-4
10
123
11143.49756
10
55306.8202091131
2017
5
2017-5
10
197
15980.55406
10
204538.144334771
2017
6
2017-6
10
99
10852.496969
10
63283.9899761938
2017
7
2017-7
10
131
52601.981526
10
1314998.61355187
2017
8
2017-8
10
124
10983.221854
10
59444.0535811233
2017
9
2017-9
10
115
12467.148434
10
72996.6054527277
2017
10
2017-10
10
123
14843.379593
10
129673.931373139
2017
11
2017-11
10
111
8535.455945
10
50328.1495501884
2017
12
2017-12
I've tried:
SELECT *
FROM TableA
INNER JOIN ViewB ON TableA.RepairShopId = ViewB.RepairShopId
WHERE TotalInvoice > MinimumValue AND TotalInvoice < MaximumValue
AND TableA.RepairShopId = ViewB.RepairShopId
But I'm not sure how to compare it the yearMonth field with the datetime field "LastUpdated".
Any help is very appreciated!
here is how you can do it:
I assumed LastUpdated column is the column from tableA which indicate date of
SELECT *
FROM TableA A
INNER JOIN ViewB B
ON A.RepairShopId = B.RepairShopId
AND A.TotalInvoice > B.MinimumValue
AND A.TotalInvoice < B.MaximumValue
AND YEAR(LastUpdated) = B.year
AND MONTH(LastUpdated) = B.month

How to select data for especific time intervals after using Pandas’ resample function?

I used Pandas’ resample function for calculating the sales of a list of proucts every 6 months.
I used the resample function for ‘6M’ and using apply({“column-name”:”sum”}).
Now I’d like to create a table with the sum of the sales for the first six months.
How can I extract the sum of the first 6 months, given that all products have records for more than 3 years, and none of them have the same start date?
Thanks in advance for any suggestions.
Here is an example of the data:
Product Date sales
Product 1 6/30/2017 20
12/31/2017 60
6/30/2018 50
12/31/2018 100
Product 2 1/31/2017 30
7/31/2017 150
1/31/2018 200
7/31/2018 300
1/31/2019 100
While waiting for your data, I worked on this. See if this is something that will be helpful for you.
import pandas as pd
df = pd.DataFrame({'Date':['2018-01-10','2018-02-15','2018-03-18',
'2018-07-10','2018-09-12','2018-10-14',
'2018-11-16','2018-12-20','2019-01-10',
'2019-04-15','2019-06-12','2019-10-18',
'2019-12-02','2020-01-05','2020-02-25',
'2020-03-15','2020-04-11','2020-07-22'],
'Sales':[200,300,100,250,150,350,150,200,250,
200,300,100,250,150,350,150,200,250]})
#first breakdown the data by Yearly Quarters
df['YQtr'] = pd.PeriodIndex(pd.to_datetime(df.Date), freq='Q')
#next create a column to identify Half Yearly - H1 for Jan-Jun & H2 for Jul-Dec
df.loc[df['YQtr'].astype(str).str[-2:].isin(['Q1','Q2']),'HYear'] = df['YQtr'].astype(str).str[:-2]+'H1'
df.loc[df['YQtr'].astype(str).str[-2:].isin(['Q3','Q4']),'HYear'] = df['YQtr'].astype(str).str[:-2]+'H2'
#Do a cummulative sum on Half Year to get sales by H1 & H2 for each year
df['HYear_cumsum'] = df.groupby('HYear')['Sales'].cumsum()
#Now filter out only the rows with the max value. That's the H1 & H2 sales figure
df1 = df[df.groupby('HYear')['HYear_cumsum'].transform('max')== df['HYear_cumsum']]
print (df)
print (df1)
The output of this will be:
Source Data + Half Year cumulative sum:
Date Sales YQtr HYear HYear_cumsum
0 2018-01-10 200 2018Q1 2018H1 200
1 2018-02-15 300 2018Q1 2018H1 500
2 2018-03-18 100 2018Q1 2018H1 600
3 2018-07-10 250 2018Q3 2018H2 250
4 2018-09-12 150 2018Q3 2018H2 400
5 2018-10-14 350 2018Q4 2018H2 750
6 2018-11-16 150 2018Q4 2018H2 900
7 2018-12-20 200 2018Q4 2018H2 1100
8 2019-01-10 250 2019Q1 2019H1 250
9 2019-04-15 200 2019Q2 2019H1 450
10 2019-06-12 300 2019Q2 2019H1 750
11 2019-10-18 100 2019Q4 2019H2 100
12 2019-12-02 250 2019Q4 2019H2 350
13 2020-01-05 150 2020Q1 2020H1 150
14 2020-02-25 350 2020Q1 2020H1 500
15 2020-03-15 150 2020Q1 2020H1 650
16 2020-04-11 200 2020Q2 2020H1 850
17 2020-07-22 250 2020Q3 2020H2 250
The half year cumulative sum for each half year.
Date Sales YQtr HYear HYear_cumsum
2 2018-03-18 100 2018Q1 2018H1 600
7 2018-12-20 200 2018Q4 2018H2 1100
10 2019-06-12 300 2019Q2 2019H1 750
12 2019-12-02 250 2019Q4 2019H2 350
16 2020-04-11 200 2020Q2 2020H1 850
17 2020-07-22 250 2020Q3 2020H2 250
I will look at your sample data and work on it later tonight.

How to merge two dataframe base on dates which the datediff is one day?

Input
df1
id A
2020-01-01 10
2020-02-07 20
2020-04-09 30
df2
id B
2019-12-31 50
2020-02-06 20
2020-02-07 70
2020-04-08 34
2020-04-09 44
Goal
df
id A B
2020-01-01 10 50
2020-02-07 20 20
2020-04-09 30 34
The detail as follows:
df1 merges df2 base on id, which add columns from df2.
the type of id is datetime.
merge rules: df1 based on yesterday
Could you simply add 1 day to df2's ID column before merging?
df1.merge(df2.assign(id=df2['id'] + pd.Timedelta(days=1)), on='id')
id A B
0 2020-01-01 10 50
1 2020-02-07 20 20
2 2020-04-09 30 34
Try pd.merge_asof
df = pd.merge_asof(df1,df2,on='id',tolerance=pd.Timedelta('1 day'),allow_exact_matches=False)
id A B
0 2020-01-01 10 50
1 2020-02-07 20 20
2 2020-04-09 30 34

GroupBy aggregation based on condition and year wise sum using pandas

I have a data frame as shown below
ID Sector Plot Tenancy_Start_Date Rental
1 SE1 A 2018-08-14 100
1 SE1 A 2019-08-18 200
2 SE1 B 2017-08-12 150
3 SE1 A 2020-02-12 300
5 SE2 A 2017-08-13 400
5 SE2 A 2019-08-12 300
6 SE2 B 2019-08-11 150
5 SE2 A 2020-01-10 300
7 SE2 B 2019-08-11 500
From the above I would like to prepare below data frame as Sector and Plot aggregation level.
Expected Output:
Sector Plot Total_Rental Rental_2017 Rental_2018 Rental_2019 Rental_2020
SE1 A 600 0 100 200 300
SE1 B 150 150 0 0 0
SE2 A 1000 400 0 300 300
SE2 B 650 0 0 650 0
I'd create a year column:
df['Year'] = df['Tenancy_State_date'].dt.year
then do your groupby
df['Rent_by_cats'] = df.groupby(['Sector', 'Year', 'Plot'])['Rental'].transform(sum)
then lastly move it into separate columns
yrs = df['Year'].unique().tolist()
for y in yrs:
df['Rental_' + str(y)] = 0
df.loc[df['Year'] == y, 'Rental_' + str(y)] = df['Rent_by_cats']
Output:
ID Sector Plot Tenancy_Start_Date Rental Year Rent_by_cats Rental_2018 Rental_2019 Rental_2017 Rental_2020
0 1 SE1 A 2018-08-14 100 2018 100 100 0 0 0
1 1 SE1 A 2019-08-18 200 2019 200 0 200 0 0
2 2 SE1 B 2017-08-12 150 2017 150 0 0 150 0
3 3 SE1 A 2020-02-12 300 2020 300 0 0 0 300
4 5 SE2 A 2017-08-13 400 2017 400 0 0 400 0
5 5 SE2 A 2019-08-12 300 2019 300 0 300 0 0
6 6 SE2 B 2019-08-11 150 2019 650 0 650 0 0
7 5 SE2 A 2020-01-10 300 2020 300 0 0 0 300
8 7 SE2 B 2019-08-11 500 2019 650 0 650 0 0
You can do (df being your input dataframe):
#in case if it's not already a datetime:
df["Tenancy_Start_Date"]=pd.to_datetime(df["Tenancy_Start_Date"])
df2=df.pivot_table(index=["Sector", "Plot"], columns=df["Tenancy_Start_Date"].dt.year, values="Rental", aggfunc=sum).fillna(0)
df2.columns=[f"Rental_{col}" for col in df2.columns]
df2["Total_Rental"]=df2.sum(axis=1)
df2=df2.reset_index(drop=False)
Outputs:
Sector Plot ... Rental_2020 Total_Rental
0 SE1 A ... 300.0 600.0
1 SE1 B ... 0.0 150.0
2 SE2 A ... 300.0 1000.0
3 SE2 B ... 0.0 650.0