Pandas - Filtering alternate Monday - pandas

I have a Dataframe that has sales data by day. I would like to be able to filter out sales data of every alternate Monday. For example, if I select June 27 the next date I would like to filter would be July 11 and the next date would be July 25 and so on.
I have my Dataframe as below
sale_date, count
2022-06-27, 100
2022-07-01, 150
2022-07-07, 100
2022-07-11, 150
2022-06-20, 100
2022-07-25, 150
I would expect the output to be
sale_date, count
2022-06-27, 100
2022-07-11, 150
2022-07-25, 150

You can use:
# convert to datetime
date = pd.to_datetime(df['sale_date'])
# is the day a Monday (0 = Monday)?
m1 = date.dt.weekday.eq(0)
# is the week an "even" week?
m2 = date.dt.isocalendar().week.mod(2).eq(0)
# if both conditions are True, keep the row
out = df[m1&m2]
output:
sale_date count
0 2022-06-27 100
3 2022-07-11 150
5 2022-07-25 150
intermediates:
sale_date count weekday weekday.eq(0) week week.mod(2) week.mod(2).eq(0)
0 2022-06-27 100 0 True 26 0 True
1 2022-07-01 150 4 False 26 0 True
2 2022-07-07 100 3 False 27 1 False
3 2022-07-11 150 0 True 28 0 True
4 2022-06-20 100 0 True 25 1 False
5 2022-07-25 150 0 True 30 0 True

df11=df1.resample("2w-mon",closed="left",on="sale_date")["count"].first().reset_index()
df11.assign(sale_date=df11.sale_date-pd.Timedelta(days=7))
out:
sale_date count
0 2022-06-27 100
1 2022-07-11 100
2 2022-07-25 150

Related

Pandas: to get mean for each data category daily [duplicate]

I am a somewhat beginner programmer and learning python (+pandas) and hope I can explain this well enough. I have a large time series pd dataframe of over 3 million rows and initially 12 columns spanning a number of years. This covers people taking a ticket from different locations denoted by Id numbers(350 of them). Each row is one instance (one ticket taken).
I have searched many questions like counting records per hour per day and getting average per hour over several years. However, I run into the trouble of including the 'Id' variable.
I'm looking to get the mean value of people taking a ticket for each hour, for each day of the week (mon-fri) and per station.
I have the following, setting datetime to index:
Id Start_date Count Day_name_no
149 2011-12-31 21:30:00 1 5
150 2011-12-31 20:51:00 1 0
259 2011-12-31 20:48:00 1 1
3015 2011-12-31 19:38:00 1 4
28 2011-12-31 19:37:00 1 4
Using groupby and Start_date.index.hour, I cant seem to include the 'Id'.
My alternative approach is to split the hour out of the date and have the following:
Id Count Day_name_no Trip_hour
149 1 2 5
150 1 4 10
153 1 2 15
1867 1 4 11
2387 1 2 7
I then get the count first with:
Count_Item = TestFreq.groupby([TestFreq['Id'], TestFreq['Day_name_no'], TestFreq['Hour']]).count().reset_index()
Id Day_name_no Trip_hour Count
1 0 7 24
1 0 8 48
1 0 9 31
1 0 10 28
1 0 11 26
1 0 12 25
Then use groupby and mean:
Mean_Count = Count_Item.groupby(Count_Item['Id'], Count_Item['Day_name_no'], Count_Item['Hour']).mean().reset_index()
However, this does not give the desired result as the mean values are incorrect.
I hope I have explained this issue in a clear way. I looking for the mean per hour per day per Id as I plan to do clustering to separate my dataset into groups before applying a predictive model on these groups.
Any help would be grateful and if possible an explanation of what I am doing wrong either code wise or my approach.
Thanks in advance.
I have edited this to try make it a little clearer. Writing a question with a lack of sleep is probably not advisable.
A toy dataset that i start with:
Date Id Dow Hour Count
12/12/2014 1234 0 9 1
12/12/2014 1234 0 9 1
12/12/2014 1234 0 9 1
12/12/2014 1234 0 9 1
12/12/2014 1234 0 9 1
19/12/2014 1234 0 9 1
19/12/2014 1234 0 9 1
19/12/2014 1234 0 9 1
26/12/2014 1234 0 10 1
27/12/2014 1234 1 11 1
27/12/2014 1234 1 11 1
27/12/2014 1234 1 11 1
27/12/2014 1234 1 11 1
04/01/2015 1234 1 11 1
I now realise I would have to use the date first and get something like:
Date Id Dow Hour Count
12/12/2014 1234 0 9 5
19/12/2014 1234 0 9 3
26/12/2014 1234 0 10 1
27/12/2014 1234 1 11 4
04/01/2015 1234 1 11 1
And then calculate the mean per Id, per Dow, per hour. And want to get this:
Id Dow Hour Mean
1234 0 9 4
1234 0 10 1
1234 1 11 2.5
I hope this makes it a bit clearer. My real dataset spans 3 years with 3 million rows, contains 350 Id numbers.
Your question is not very clear, but I hope this helps:
df.reset_index(inplace=True)
# helper columns with date, hour and dow
df['date'] = df['Start_date'].dt.date
df['hour'] = df['Start_date'].dt.hour
df['dow'] = df['Start_date'].dt.dayofweek
# sum of counts for all combinations
df = df.groupby(['Id', 'date', 'dow', 'hour']).sum()
# take the mean over all dates
df = df.reset_index().groupby(['Id', 'dow', 'hour']).mean()
You can use the groupby function using the 'Id' column and then use the resample function with how='sum'.

sql split yearly record into 12 monthly records

I am trying to use common table expression to split an yearly record into 12 monthly records. I have to do it for next 20 years records . That means 20 rows into 600 rows (20*12=600 records).
What is the best way to do it. Can anyone help with an efficient way to do it.
Using a single table as shown below. Year 0 means current year so it should split into remaining months and year=1 means next year onward it should split into 12 (months) records
id year value
1 0 3155174.87
1 1 30423037.3
1 2 35339631.25
expected result should look like this:
Id Year Month Value Calender year
1 0 5 150 2022
1 0 6 150 2022
1 0 7 150 2022
1 0 8 150 2022
1 0 9 150 2022
1 0 10 150 2022
1 0 11 150 2022
1 0 12 150 2022
1 0 1 150 2023
1 0 2 150 2023
1 0 3 150 2023
1 0 4 150 2023
1 1 5 100 2023
1 1 6 100 2023
1 1 7 100 2023
1 1 8 100 2023
1 1 9 100 2023
1 1 10 100 2023
1 1 11 100 2023
1 1 12 100 2023
1 1 1 100 2024
1 1 2 100 2024
1 1 3 100 2024
1 1 4 100 2024
You can simply join onto a list of months, and then use a bit of arithmetic to split the Value
SELECT
t.Id,
t.Year,
v.Month,
Value = t.Value / CASE WHEN t.Year = 0 THEN 13 - MONTH(GETDATE()) ELSE 12 END
FROM YourTable t
JOIN (VALUES
(1),(2),(3),(4),(5),(6),(7),(8),(9),(10),(11),(12)
) v(Month) ON t.year > 0 OR v.Month >= MONTH(GETDATE());
db<>fiddle

running total starting from a date column

I'm trying to get a running total as of a date. This is the data I have
Date
transaction Amount
End of Week Balance
jan 1
5
100
jan 2
3
100
jan 3
4
100
jan 4
3
100
jan 5
1
100
jan 6
3
100
I would like to find out what the daily end balance is. My thought is to get a running total from each day to the end of the week and subtract it from the end of week balance, like below
Date
transaction Amount
Running total
End of Week Balance
Balance - Running total
jan 1
5
19
100
86
jan 2
3
14
100
89
jan 3
4
11
100
93
jan 4
3
7
100
96
jan 5
1
4
100
97
jan 6
3
3
100
100
I can use
SUM(transactionAmount) OVER (Order by Date)
to get a running total, is there a way to specify that I only want the total of transactions that have taken place after the date?
You can use sum() as a window function, but accumulate in reverse:
select t.*,
(end_of_week_balance -
sum(transactionAmount) over (order by date desc)
)
from t;
If you have this example:
1> select i, sum(i) over (order by i) S from integers where i<10;
2> go
i S
----------- -----------
1 1
2 3
3 6
4 10
5 15
6 21
7 28
8 36
9 45
you can also do:
1> select i, sum(case when i>3 then i else 0 end) over (order by i) S from integers where i<10;
2> go
i S
----------- -----------
1 0
2 0
3 0
4 4
5 9
6 15
7 22
8 30
9 39

Year wise aggregation on the given condition in pandas

I have a data frame as shown below. which is a sales data of two health care product starting from December 2016 to November 2018.
product price sale_date discount
A 50 2016-12-01 5
A 50 2017-01-03 4
B 200 2016-12-24 10
A 50 2017-01-18 3
B 200 2017-01-28 15
A 50 2017-01-18 6
B 200 2017-01-28 20
A 50 2017-04-18 6
B 200 2017-12-08 25
A 50 2017-11-18 6
B 200 2017-08-21 20
B 200 2017-12-28 30
A 50 2018-03-18 10
B 300 2018-06-08 45
B 300 2018-09-20 50
A 50 2018-11-18 8
B 300 2018-11-28 35
From the above I would like to prepare below data frame
Expected Output:
product year number_of_months total_price total_discount number_of_sales
A 2016 1 50 5 1
B 2016 1 200 10 1
A 2017 12 250 25 5
B 2017 12 1000 110 5
A 2018 11 100 18 2
B 2018 11 900 130 3
Note: Please note that the data starts from Dec 2016 to Nov 2018.
So number of months in 2016 is 1, in 2017 we have full data so 12 months and 2018 we have 11 months.
First aggregate sum by years and product and then create new column for counts by months by DataFrame.insert and Series.map:
df1 =(df.groupby(['product',df['sale_date'].dt.year], sort=False).sum().add_prefix('total_')
.reset_index())
df1.insert(2,'number_of_months', df1['sale_date'].map({2016:1, 2017:12, 2018:11}))
print (df1)
product sale_date number_of_months total_price total_discount
0 A 2016 1 50 5
1 A 2017 12 250 25
2 B 2016 1 200 10
3 B 2017 12 1000 110
4 A 2018 11 100 18
5 B 2018 11 900 130
If want dynamic dictionary by minumal and maximal datetimes use:
s = pd.date_range(df['sale_date'].min(), df['sale_date'].max(), freq='MS')
d = s.year.value_counts().to_dict()
print (d)
{2017: 12, 2018: 11, 2016: 1}
df1 = (df.groupby(['product',df['sale_date'].dt.year], sort=False).sum().add_prefix('total_')
.reset_index())
df1.insert(2,'number_of_months', df1['sale_date'].map(d))
print (df1)
product sale_date number_of_months total_price total_discount
0 A 2016 1 50 5
1 A 2017 12 250 25
2 B 2016 1 200 10
3 B 2017 12 1000 110
4 A 2018 11 100 18
5 B 2018 11 900 130
For ploting is used DataFrame.set_index with DataFrame.unstack:
df2 = (df1.set_index(['sale_date','product'])[['total_price','total_discount']]
.unstack(fill_value=0))
df2.columns = df2.columns.map('_'.join)
print (df2)
total_price_A total_price_B total_discount_A total_discount_B
sale_date
2016 50 200 5 10
2017 250 1000 25 110
2018 100 900 18 130
df2.plot()
EDIT:
df1 = (df.groupby(['product',df['sale_date'].dt.year], sort=False)
.agg( total_price=('price','sum'),
total_discount=('discount','sum'),
number_of_sales=('discount','size'))
.reset_index())
df1.insert(2,'number_of_months', df1['sale_date'].map({2016:1, 2017:12, 2018:11}))
print (df1)
product sale_date number_of_months total_price total_discount \
0 A 2016 NaN 50 5
1 A 2017 NaN 250 25
2 B 2016 NaN 200 10
3 B 2017 NaN 1000 110
4 A 2018 NaN 100 18
5 B 2018 NaN 900 130
number_of_sales
0 1
1 5
2 1
3 5
4 2
5 3

Difference between first row and current row, by group

I have a data set like this:
state,date,events_per_day
AM,2020-03-01,100
AM,2020-03-02,120
AM,2020-03-15,200
BA,2020-03-16,80
BA,2020-03-20,100
BA,2020-03-29,150
RS,2020-04-01,80
RS,2020-04-05,100
RS,2020-04-11,160
Now I need to compute the difference between the date in the first row of each group and the date in the current row.
i.e. the first row of each group:
for group "AM" the first date is 2020-03-01;
for group "BA" the first date is 2020-03-16;
for group "RS" it is 2020-04-01.
In the end, the result I want is:
state,date,events_per_day,days_after_first_event
AM,2020-03-01,100,0
AM,2020-03-02,120,1 <--- 2020-03-02 - 2020-03-01
AM,2020-03-15,200,14 <--- 2020-03-14 - 2020-03-01
BA,2020-03-16,80,0
BA,2020-03-20,100,4 <--- 2020-03-20 - 2020-03-16
BA,2020-03-29,150,13 <--- 2020-03-29 - 2020-03-16
RS,2020-04-01,80,0
RS,2020-04-05,100,4 <--- 2020-04-05 - 2020-04-01
RS,2020-04-11,160,10 <--- 2020-04-11 - 2020-04-01
I found How to calculate time difference by group using pandas? and it is almost to what I want. However, diff() returns the difference between consecutive lines, and I need the difference between the current line and the first line.
How can I do this?
Option 3: groupby.transform
df['days_since_first'] = df['date'] - df.groupby('state')['date'].transform('first')
output
state date events_per_day days_since_first
0 AM 2020-03-01 100 0 days
1 AM 2020-03-02 120 1 days
2 AM 2020-03-15 200 14 days
3 BA 2020-03-16 80 0 days
4 BA 2020-03-20 100 4 days
5 BA 2020-03-29 150 13 days
6 RS 2020-04-01 80 0 days
7 RS 2020-04-05 100 4 days
8 RS 2020-04-11 160 10 days
Prepossessing:
# convert to datetime
df['date'] = pd.to_datetime(df['date'])
# extract the first dates by states:
first_dates = df.groupby('state')['date'].first() #.min() works as well
Option 1: Index alignment
# set_index before substraction allows index alignment
df['days_since_first'] = (df.set_index('state')['date'] - first_dates).values
Option 2: map:
df['days_since_first'] = df['date'] - df['state'].map(first_dates)
Output:
state date events_per_day days_since_first
0 AM 2020-03-01 100 0 days
1 AM 2020-03-02 120 1 days
2 AM 2020-03-15 200 14 days
3 BA 2020-03-16 80 0 days
4 BA 2020-03-20 100 4 days
5 BA 2020-03-29 150 13 days
6 RS 2020-04-01 80 0 days
7 RS 2020-04-05 100 4 days
8 RS 2020-04-11 160 10 days