How to select data for especific time intervals after using Pandas’ resample function? - dataframe

I used Pandas’ resample function for calculating the sales of a list of proucts every 6 months.
I used the resample function for ‘6M’ and using apply({“column-name”:”sum”}).
Now I’d like to create a table with the sum of the sales for the first six months.
How can I extract the sum of the first 6 months, given that all products have records for more than 3 years, and none of them have the same start date?
Thanks in advance for any suggestions.
Here is an example of the data:
Product Date sales
Product 1 6/30/2017 20
12/31/2017 60
6/30/2018 50
12/31/2018 100
Product 2 1/31/2017 30
7/31/2017 150
1/31/2018 200
7/31/2018 300
1/31/2019 100

While waiting for your data, I worked on this. See if this is something that will be helpful for you.
import pandas as pd
df = pd.DataFrame({'Date':['2018-01-10','2018-02-15','2018-03-18',
'2018-07-10','2018-09-12','2018-10-14',
'2018-11-16','2018-12-20','2019-01-10',
'2019-04-15','2019-06-12','2019-10-18',
'2019-12-02','2020-01-05','2020-02-25',
'2020-03-15','2020-04-11','2020-07-22'],
'Sales':[200,300,100,250,150,350,150,200,250,
200,300,100,250,150,350,150,200,250]})
#first breakdown the data by Yearly Quarters
df['YQtr'] = pd.PeriodIndex(pd.to_datetime(df.Date), freq='Q')
#next create a column to identify Half Yearly - H1 for Jan-Jun & H2 for Jul-Dec
df.loc[df['YQtr'].astype(str).str[-2:].isin(['Q1','Q2']),'HYear'] = df['YQtr'].astype(str).str[:-2]+'H1'
df.loc[df['YQtr'].astype(str).str[-2:].isin(['Q3','Q4']),'HYear'] = df['YQtr'].astype(str).str[:-2]+'H2'
#Do a cummulative sum on Half Year to get sales by H1 & H2 for each year
df['HYear_cumsum'] = df.groupby('HYear')['Sales'].cumsum()
#Now filter out only the rows with the max value. That's the H1 & H2 sales figure
df1 = df[df.groupby('HYear')['HYear_cumsum'].transform('max')== df['HYear_cumsum']]
print (df)
print (df1)
The output of this will be:
Source Data + Half Year cumulative sum:
Date Sales YQtr HYear HYear_cumsum
0 2018-01-10 200 2018Q1 2018H1 200
1 2018-02-15 300 2018Q1 2018H1 500
2 2018-03-18 100 2018Q1 2018H1 600
3 2018-07-10 250 2018Q3 2018H2 250
4 2018-09-12 150 2018Q3 2018H2 400
5 2018-10-14 350 2018Q4 2018H2 750
6 2018-11-16 150 2018Q4 2018H2 900
7 2018-12-20 200 2018Q4 2018H2 1100
8 2019-01-10 250 2019Q1 2019H1 250
9 2019-04-15 200 2019Q2 2019H1 450
10 2019-06-12 300 2019Q2 2019H1 750
11 2019-10-18 100 2019Q4 2019H2 100
12 2019-12-02 250 2019Q4 2019H2 350
13 2020-01-05 150 2020Q1 2020H1 150
14 2020-02-25 350 2020Q1 2020H1 500
15 2020-03-15 150 2020Q1 2020H1 650
16 2020-04-11 200 2020Q2 2020H1 850
17 2020-07-22 250 2020Q3 2020H2 250
The half year cumulative sum for each half year.
Date Sales YQtr HYear HYear_cumsum
2 2018-03-18 100 2018Q1 2018H1 600
7 2018-12-20 200 2018Q4 2018H2 1100
10 2019-06-12 300 2019Q2 2019H1 750
12 2019-12-02 250 2019Q4 2019H2 350
16 2020-04-11 200 2020Q2 2020H1 850
17 2020-07-22 250 2020Q3 2020H2 250
I will look at your sample data and work on it later tonight.

Related

Presto Window Function - (sum over partition by)

I have the following Presto 2 tables, one storing budget information by client & day, and the other one storing spend information by client & day
select day, client_id, budget_id, budget_period, budget_amount
from budget_table
day
client_id
budget_id
budget_period
budget_amount
2021-02-27
1
1-1
daily
10
2021-02-28
1
1-1
daily
10
2021-03-01
1
1-1
daily
10
2021-03-02
1
1-1
daily
10
2021-03-03
1
1-1
daily
10
2021-03-04
1
1-2
monthly
500
2021-03-05
1
1-2
monthly
500
2021-03-06
1
1-2
monthly
500
2021-02-27
2
2-1
monthly
400
2021-02-28
2
2-1
monthly
400
2021-03-01
2
2-1
monthly
400
2021-03-02
2
2-1
monthly
400
2021-03-03
2
2-2
one_time
1000
2021-03-04
2
2-2
one_time
1000
2021-03-05
2
2-2
one_time
1000
2021-03-06
2
2-2
one_time
1000
select day, client_id, spend
from spend_table
day
client_id
spend
2021-02-27
1
8
2021-02-28
1
9
2021-03-01
1
10
2021-03-02
1
7
2021-03-03
1
6
2021-03-04
1
16
2021-03-05
1
19
2021-03-06
1
18
2021-02-27
2
13
2021-02-28
2
15
2021-03-01
2
14
2021-03-02
2
15
2021-03-03
2
20
2021-03-04
2
25
2021-03-05
2
18
2021-03-06
2
27
Below is desired output:
day
client_id
budget_id
budget_period
budget_amount
spend
spend_over_period
2021-02-27
1
1-1
daily
10
8
8
2021-02-28
1
1-1
daily
10
9
9
2021-03-01
1
1-1
daily
10
10
10
2021-03-02
1
1-1
daily
10
7
7
2021-03-03
1
1-1
daily
10
6
6
2021-03-04
1
1-2
monthly
500
16
16
2021-03-05
1
1-2
monthly
500
19
35
2021-03-06
1
1-2
monthly
500
18
53
2021-02-27
2
2-1
monthly
400
13
13
2021-02-28
2
2-1
monthly
400
15
28
2021-03-01
2
2-1
monthly
400
14
14
2021-03-02
2
2-1
monthly
400
15
29
2021-03-03
2
2-2
one_time
1000
20
20
2021-03-04
2
2-2
one_time
1000
25
45
2021-03-05
2
2-2
one_time
1000
18
63
2021-03-06
2
2-2
one_time
1000
27
90
I have tried
select s.day,
s.client_id,
b.budget_id,
b.budget_period,
b.budget_amount,
s.spend,
case when b.budget_period = 'daily' then s.spend
when b.budget_period = 'monthly' then sum(s.spend) over (partition by b.budget_id, month(date(s.day)))
when as spend_over_period = 'one_time' then sum(s.spend) over (partition by b.budget_id)
end as budget_over_period
from spend_table as s
join budget_table as b
on s.day = b.day
and s.client_id = b.client_id
group by 1,2,3,4,5,6
But, I get u'EXPRESSION_NOT_AGGREGATE' error.
Does anybody know how to query to get the desired output in Presto?
You can remove group by clause completely and use ordering and frame for your window functions with order by date(s.day) and range between unbounded preceding and current row:
select s.day,
s.client_id,
b.budget_id,
b.budget_period,
b.budget_amount,
s.spend,
case
when b.budget_period = 'daily' then s.spend
when b.budget_period = 'monthly' then sum(s.spend) over (partition by b.budget_id, month(date(s.day)) order by date(s.day) range between unbounded preceding and current row)
when b.budget_period = 'one_time' then sum(s.spend) over (partition by b.budget_id order by date(s.day) range between unbounded preceding and current row)
end as spend_over_period
from spend_table as s
join budget_table as b
on s.day = b.day
and s.client_id = b.client_id
order by 2,3,1
Output:
day
client_id
budget_id
budget_period
budget_amount
spend
spend_over_period
2021-02-27
1
1-1
daily
10
8
8
2021-02-28
1
1-1
daily
10
9
9
2021-03-01
1
1-1
daily
10
10
10
2021-03-02
1
1-1
daily
10
7
7
2021-03-03
1
1-1
daily
10
6
6
2021-03-04
1
1-2
monthly
500
16
16
2021-03-05
1
1-2
monthly
500
19
35
2021-03-06
1
1-2
monthly
500
18
53
2021-02-27
2
2-1
monthly
400
13
13
2021-02-28
2
2-1
monthly
400
15
28
2021-03-01
2
2-1
monthly
400
14
14
2021-03-02
2
2-1
monthly
400
15
29
2021-03-03
2
2-2
one_time
1000
20
20
2021-03-04
2
2-2
one_time
1000
25
45
2021-03-05
2
2-2
one_time
1000
18
63
2021-03-06
2
2-2
one_time
1000
27
90

Calculate number of days from date time column to a specific date - pandas

I have a df as shown below.
df:
ID open_date limit
1 2020-06-03 100
1 2020-06-23 500
1 2019-06-29 300
1 2018-06-29 400
From the above I would like to calculate a column named age_in_days.
age_in_days is the number of days from open_date to 2020-06-30.
Expected output
ID open_date limit age_in_days
1 2020-06-03 100 27
1 2020-06-23 500 7
1 2019-06-29 300 367
1 2018-06-29 400 732
Make sure open_date in datetime dtype and subtract it from 2020-06-30
df['open_date'] = pd.to_datetime(df.open_date)
df['age_in_days'] = (pd.Timestamp('2020-06-30') - df.open_date).dt.days
Out[209]:
ID open_date limit age_in_days
0 1 2020-06-03 100 27
1 1 2020-06-23 500 7
2 1 2019-06-29 300 367
3 1 2018-06-29 400 732

Groupby sum in years in pandas

I have a data frame as shown below. which is a sales data of two health care product starting from December 2016 to November 2018.
product profit sale_date discount
A 50 2016-12-01 5
A 50 2017-01-03 4
B 200 2016-12-24 10
A 50 2017-01-18 3
B 200 2017-01-28 15
A 50 2017-01-18 6
B 200 2017-01-28 20
A 50 2017-04-18 6
B 200 2017-12-08 25
A 50 2017-11-18 6
B 200 2017-08-21 20
B 200 2017-12-28 30
A 50 2018-03-18 10
B 300 2018-06-08 45
B 300 2018-09-20 50
A 50 2018-11-18 8
B 300 2018-11-28 35
From the above I would like to prepare below dataframe and plot that into line plot.
Expected Output
bought_year total_profit
2016 250
2017 1250
2018 1000
X axis = bought_year
Y axis = profit
use groupby with dt.year and .agg to name your column.
df1 = df.groupby(df['sale_date'].dt.year).agg(total_profit=('profit','sum'))\
.reset_index().rename(columns={'sale_date': 'bought_year'})
print(df1)
bought_year total_profit
0 2016 250
1 2017 1250
2 2018 1000
df1.set_index('bought_year').plot(kind='bar')

Pandas Shift Date Time Columns Back One Hour

I have data in a DF (df1) that starts and ends like this below and I'm trying to shift the "0" and "1" columns below so that the date and time is moved back one hour so that the date and time start at hour == 0 not hour == 1.
data starts (df1) -
0 1 2 3 4 5 6 7
0 20160101 100 7.977169 109404.0 20160101 100 4.028678 814.0
1 20160101 200 8.420204 128546.0 20160101 200 4.673662 2152.0
2 20160101 300 9.515370 165931.0 20160101 300 8.019863 8100.0
data ends (df1) -
0 1 2 3 4 5 6 7
8780 20161231 2100 4.198906 11371.0 20161231 2100 0.995571 131.0
8781 20161231 2200 4.787433 19083.0 20161231 2200 1.029809 NaN
8782 20161231 2300 3.987506 9354.0 20161231 2300 0.900942 NaN
8783 20170101 0 3.284947 1815.0 20170101 0 0.899262 NaN
I need the date and time to start shifted back one hour so start time is hour begin not hour end -
0 1 2 3 4 5 6 7
0 20160101 000 7.977169 109404.0 20160101 100 4.028678 814.0
1 20160101 100 8.420204 128546.0 20160101 200 4.673662 2152.0
2 20160101 200 9.515370 165931.0 20160101 300 8.019863 8100.0
and ends like this with the date and time below -
0 1 2 3 4 5 6 7
8780 20161231 2000 4.198906 11371.0 20161231 2100 0.995571 131.0
8781 20161231 2100 4.787433 19083.0 20161231 2200 1.029809 NaN
8782 20161231 2200 3.987506 9354.0 20161231 2300 0.900942 NaN
8783 20161231 2300 3.284947 1815.0 20170101 0 0.899262 NaN
And, i have no real idea of how to accomplish this or how to research it. Thank you,
It would be better to create a proper datetime object then simply remove the hours as a sum which will handle any redaction in days. We can then use dt.strftime to re-create your object (string) columns.
s = pd.to_datetime(
df[0].astype(str) + df[1].astype(str).str.zfill(4), format="%Y%m%d%H%M"
)
0 2016-01-01 01:00:00
1 2016-01-01 02:00:00
2 2016-01-01 03:00:00
8780 2016-12-31 21:00:00
8781 2016-12-31 22:00:00
8782 2016-12-31 23:00:00
8783 2017-01-01 00:00:00
dtype: datetime64[ns]
df[1] = (s - pd.DateOffset(hours=1)).dt.strftime("%H%M").str.lstrip("0").str.zfill(3)
df[0] = (s - pd.DateOffset(hours=1)).dt.strftime("%Y%d%m")
print(df)
0 1 2 3 4 5 6 7
0 20160101 000 7.977169 109404.0 20160101 100 4.028678 814.0
1 20160101 100 8.420204 128546.0 20160101 200 4.673662 2152.0
2 20160101 200 9.515370 165931.0 20160101 300 8.019863 8100.0
8780 20163112 2000 4.198906 11371.0 20161231 2100 0.995571 131.0
8781 20163112 2100 4.787433 19083.0 20161231 2200 1.029809 NaN
8782 20163112 2200 3.987506 9354.0 20161231 2300 0.900942 NaN
8783 20163112 2300 3.284947 1815.0 20170101 0 0.899262 NaN
Use, DataFrame.shift to shift the columns 0, 1, then use Series.bfill on column 0 of df2 to fill the missing values, then use .fillna on column 1 of df2 to fill the NaN values, finally use Dataframe.join to join the dataframe df2 with the dataframe df1:
df2 = df1[['0', '1']].shift()
df2['0'] = df2['0'].bfill()
df2['1'] = df2['1'].fillna('000')
df2 = df2.join(df1.loc[:, '2':])
# print(df2)
0 1 2 3 4 5 6 7
0 20160101 000 7.977169 109404.0 20160101 100 4.028678 814.0
1 20160101 100 8.420204 128546.0 20160101 200 4.673662 2152.0
2 20160101 200 9.515370 165931.0 20160101 300 8.019863 8100.0
...
8780 20160101 300 4.198906 11371.0 20161231 2100 0.995571 131.0
8781 20161231 2100 4.787433 19083.0 20161231 2200 1.029809 NaN
8782 20161231 2200 3.987506 9354.0 20161231 2300 0.900942 NaN
8783 20161231 2300 3.284947 1815.0 20170101 0 0.899262 NaN
You can do subtraction in pandas (considering that the data in your dataframe are not string type)
I will show you an example on how it can be done
import pandas as pd
df = pd.DataFrame()
df['time'] = [0,100,500,2100,2300,0] #creating dataframe
df['time'] = df['time']-100 #This is what you want to do
Now your data will be subtracted by 100.
There is a case when subtracting 0 you will get -100 as time. For that you can do this:
for i in range(len(df['time'])):
if df['time'].iloc[i]== -100:
df['time'].iloc[i]=2300

GroupBy aggregation based on condition and year wise sum using pandas

I have a data frame as shown below
ID Sector Plot Tenancy_Start_Date Rental
1 SE1 A 2018-08-14 100
1 SE1 A 2019-08-18 200
2 SE1 B 2017-08-12 150
3 SE1 A 2020-02-12 300
5 SE2 A 2017-08-13 400
5 SE2 A 2019-08-12 300
6 SE2 B 2019-08-11 150
5 SE2 A 2020-01-10 300
7 SE2 B 2019-08-11 500
From the above I would like to prepare below data frame as Sector and Plot aggregation level.
Expected Output:
Sector Plot Total_Rental Rental_2017 Rental_2018 Rental_2019 Rental_2020
SE1 A 600 0 100 200 300
SE1 B 150 150 0 0 0
SE2 A 1000 400 0 300 300
SE2 B 650 0 0 650 0
I'd create a year column:
df['Year'] = df['Tenancy_State_date'].dt.year
then do your groupby
df['Rent_by_cats'] = df.groupby(['Sector', 'Year', 'Plot'])['Rental'].transform(sum)
then lastly move it into separate columns
yrs = df['Year'].unique().tolist()
for y in yrs:
df['Rental_' + str(y)] = 0
df.loc[df['Year'] == y, 'Rental_' + str(y)] = df['Rent_by_cats']
Output:
ID Sector Plot Tenancy_Start_Date Rental Year Rent_by_cats Rental_2018 Rental_2019 Rental_2017 Rental_2020
0 1 SE1 A 2018-08-14 100 2018 100 100 0 0 0
1 1 SE1 A 2019-08-18 200 2019 200 0 200 0 0
2 2 SE1 B 2017-08-12 150 2017 150 0 0 150 0
3 3 SE1 A 2020-02-12 300 2020 300 0 0 0 300
4 5 SE2 A 2017-08-13 400 2017 400 0 0 400 0
5 5 SE2 A 2019-08-12 300 2019 300 0 300 0 0
6 6 SE2 B 2019-08-11 150 2019 650 0 650 0 0
7 5 SE2 A 2020-01-10 300 2020 300 0 0 0 300
8 7 SE2 B 2019-08-11 500 2019 650 0 650 0 0
You can do (df being your input dataframe):
#in case if it's not already a datetime:
df["Tenancy_Start_Date"]=pd.to_datetime(df["Tenancy_Start_Date"])
df2=df.pivot_table(index=["Sector", "Plot"], columns=df["Tenancy_Start_Date"].dt.year, values="Rental", aggfunc=sum).fillna(0)
df2.columns=[f"Rental_{col}" for col in df2.columns]
df2["Total_Rental"]=df2.sum(axis=1)
df2=df2.reset_index(drop=False)
Outputs:
Sector Plot ... Rental_2020 Total_Rental
0 SE1 A ... 300.0 600.0
1 SE1 B ... 0.0 150.0
2 SE2 A ... 300.0 1000.0
3 SE2 B ... 0.0 650.0