How to sum of certain values using pandas datetime operations - pandas

Headline is not clear. Let me explain.
I have a dataframe like this:
Order Quantity Date Accepted Date Delivered
20 01-05-2010 01-02-2011
10 01-11-2010 01-03-2011
300 01-12-2010 01-04-2011
5 01-03-2011 01-03-2012
20 01-04-2012 01-11-2013
10 01-07-2013 01-12-2014
I want to basically create another column that contains the total undelivered items for each row.
Expected output:
Order Quantity Date Accepted Date Delivered Pending Order
20 01-05-2010 01-02-2011 20
10 01-11-2010 01-03-2011 30
300 01-12-2010 01-04-2011 330
5 01-03-2011 01-03-2012 305
20 01-04-2012 01-11-2013 20
10 01-07-2013 01-12-2014 30

Here, I have taken a part of your dataframe and try to get the result.
df = pd.DataFrame({'order': [20, 10, 300, 200],
'Date_aceepted': ['01-05-2010', '01-11-2010', '01-12-2010', '01-12-2010'],
'Date_delever': ['01-02-2011', '01-03-2011', '01-04-2011', '01-12-2010']})
order Date_aceepted Date_delever
0 20 01-05-2010 01-02-2011
1 10 01-11-2010 01-03-2011
2 300 01-12-2010 01-04-2011
3 200 01-12-2010 01-12-2010
Then I will change the Date_accepted and Date_deliver to date time by using pandas data time module
df['date1'] = pd.to_datetime(df['Date_aceepted'])
df['date2'] = pd.to_datetime(df['Date_delever'])
Then I will make a new data frame in which the Date_accepted and Date_delever are not the same. I assume you just need that in your final result.
dff = df[df['date1'] != df['date2']]
You can see the last row in which both accepted and delever are same is now removed in dff.
order Date_aceepted Date_delever date1 date2
0 20 01-05-2010 01-02-2011 2010-01-05 2011-01-02
1 10 01-11-2010 01-03-2011 2010-01-11 2011-01-03
2 300 01-12-2010 01-04-2011 2010-01-12 2011-01-04
Then I did use pandas cumsum of pending order
dff['pending'] = dff['order'].cumsum()
and it gives
order Date_aceepted Date_delever date1 date2 pending
0 20 01-05-2010 01-02-2011 2010-01-05 2011-01-02 20
1 10 01-11-2010 01-03-2011 2010-01-11 2011-01-03 30
2 300 01-12-2010 01-04-2011 2010-01-12 2011-01-04 330
The final data frame has two extra columns that can be dropped if you don't want in your result.

Related

count number of records by month over the last five years where record date > select month

I need to show the number of valid inspectors we have by month over the last five years. Inspectors are considered valid when the expiration date on their certification has not yet passed, recorded as the month end date. The below SQL code is text of the query to count valid inspectors for January 2017:
SELECT Count(*) AS RecordCount
FROM dbo_Insp_Type
WHERE (dbo_Insp_Type.CERT_EXP_DTE)>=#2/1/2017#);
Rather than designing 60 queries, one for each month, and compiling the results in a final table (or, err, query) are there other methods I can use that call for less manual input?
From this sample:
Id
CERT_EXP_DTE
1
2022-01-15
2
2022-01-23
3
2022-02-01
4
2022-02-03
5
2022-05-01
6
2022-06-06
7
2022-06-07
8
2022-07-21
9
2022-02-20
10
2021-11-05
11
2021-12-01
12
2021-12-24
this single query:
SELECT
Format([CERT_EXP_DTE],"yyyy/mm") AS YearMonth,
Count(*) AS AllInspectors,
Sum(Abs([CERT_EXP_DTE] >= DateSerial(Year([CERT_EXP_DTE]), Month([CERT_EXP_DTE]), 2))) AS ValidInspectors
FROM
dbo_Insp_Type
GROUP BY
Format([CERT_EXP_DTE],"yyyy/mm");
will return:
YearMonth
AllInspectors
ValidInspectors
2021-11
1
1
2021-12
2
1
2022-01
2
2
2022-02
3
2
2022-05
1
0
2022-06
2
2
2022-07
1
1
ID
Cert_Iss_Dte
Cert_Exp_Dte
1
1/15/2020
1/15/2022
2
1/23/2020
1/23/2022
3
2/1/2020
2/1/2022
4
2/3/2020
2/3/2022
5
5/1/2020
5/1/2022
6
6/6/2020
6/6/2022
7
6/7/2020
6/7/2022
8
7/21/2020
7/21/2022
9
2/20/2020
2/20/2022
10
11/5/2021
11/5/2023
11
12/1/2021
12/1/2023
12
12/24/2021
12/24/2023
A UNION query could calculate a record for each of 50 months but since you want 60, UNION is out.
Or a query with 60 calculated fields using IIf() and Count() referencing a textbox on form for start date:
SELECT Count(IIf(CERT_EXP_DTE>=Forms!formname!tbxDate,1,Null)) AS Dt1,
Count(IIf(CERT_EXP_DTE>=DateAdd("m",1,Forms!formname!tbxDate),1,Null) AS Dt2,
...
FROM dbo_Insp_Type
Using the above data, following is output for Feb and Mar 2022. I did a test with Cert_Iss_Dte included in criteria and it did not make a difference for this sample data.
Dt1
Dt2
10
8
Or a report with 60 textboxes and each calls a DCount() expression with criteria same as used in query.
Or a VBA procedure that writes data to a 'temp' table.

Pandas - Filtering alternate Monday

I have a Dataframe that has sales data by day. I would like to be able to filter out sales data of every alternate Monday. For example, if I select June 27 the next date I would like to filter would be July 11 and the next date would be July 25 and so on.
I have my Dataframe as below
sale_date, count
2022-06-27, 100
2022-07-01, 150
2022-07-07, 100
2022-07-11, 150
2022-06-20, 100
2022-07-25, 150
I would expect the output to be
sale_date, count
2022-06-27, 100
2022-07-11, 150
2022-07-25, 150
You can use:
# convert to datetime
date = pd.to_datetime(df['sale_date'])
# is the day a Monday (0 = Monday)?
m1 = date.dt.weekday.eq(0)
# is the week an "even" week?
m2 = date.dt.isocalendar().week.mod(2).eq(0)
# if both conditions are True, keep the row
out = df[m1&m2]
output:
sale_date count
0 2022-06-27 100
3 2022-07-11 150
5 2022-07-25 150
intermediates:
sale_date count weekday weekday.eq(0) week week.mod(2) week.mod(2).eq(0)
0 2022-06-27 100 0 True 26 0 True
1 2022-07-01 150 4 False 26 0 True
2 2022-07-07 100 3 False 27 1 False
3 2022-07-11 150 0 True 28 0 True
4 2022-06-20 100 0 True 25 1 False
5 2022-07-25 150 0 True 30 0 True
df11=df1.resample("2w-mon",closed="left",on="sale_date")["count"].first().reset_index()
df11.assign(sale_date=df11.sale_date-pd.Timedelta(days=7))
out:
sale_date count
0 2022-06-27 100
1 2022-07-11 100
2 2022-07-25 150

Merge when date is between two dates Pandas

I'm looking for way in which I can merge a table on multiple conditions, one of which is when a date is between two dates in the other table
Below is the two data sets
DATA SET 1
Code 1
Code 2
Date
Number
001
192
02.02.22
10
002
192
05.03.22
12
002
192
09.05.22
8
003
193
14.06.22
14
003
193
16.08.22
18
DATA SET 2
Code 1
Code 2
Date Start
Date End
005
192
15.01.22
5.02.22
002
192
01.05.22
01.06.22
003
193
10.08.22
10.09.22
003
192
01.03.22
15.03.22
007
192
10.06.22
18.06.22
I basically need to end up with Data Set 2 but with the Number column attached - merged on Code 1, Code 2, and when the date in DS1 is between the two dates in DS 2.
In this example above, the outcome would look like this:
Code 1
Code 2
Date Start
Date End
Number
002
192
01.05.22
01.06.22
8
003
193
10.08.22
10.09.22
18
Thanks
Try:
# Convert to datetime
df1['Date'] = pd.to_datetime(df1['Date'], dayfirst=True)
df2['Date Start'] = pd.to_datetime(df2['Date Start'], dayfirst=True)
df2['Date End'] = pd.to_datetime(df2['Date End'], dayfirst=True)
# Merge on Code 1 and Code 2 then keep only rows where Start Date <= Date <= End Date
out = df2.merge(df1, how='left', on=['Code 1', 'Code 2']) \
.query('Date.between(`Date Start`, `Date End`)')
Output:
Code 1
Code 2
Date Start
Date End
Date
Number
2
192
2022-05-01 00:00:00
2022-06-01 00:00:00
2022-05-09 00:00:00
8
3
193
2022-08-10 00:00:00
2022-09-10 00:00:00
2022-08-16 00:00:00
18

Difference between first row and current row, by group

I have a data set like this:
state,date,events_per_day
AM,2020-03-01,100
AM,2020-03-02,120
AM,2020-03-15,200
BA,2020-03-16,80
BA,2020-03-20,100
BA,2020-03-29,150
RS,2020-04-01,80
RS,2020-04-05,100
RS,2020-04-11,160
Now I need to compute the difference between the date in the first row of each group and the date in the current row.
i.e. the first row of each group:
for group "AM" the first date is 2020-03-01;
for group "BA" the first date is 2020-03-16;
for group "RS" it is 2020-04-01.
In the end, the result I want is:
state,date,events_per_day,days_after_first_event
AM,2020-03-01,100,0
AM,2020-03-02,120,1 <--- 2020-03-02 - 2020-03-01
AM,2020-03-15,200,14 <--- 2020-03-14 - 2020-03-01
BA,2020-03-16,80,0
BA,2020-03-20,100,4 <--- 2020-03-20 - 2020-03-16
BA,2020-03-29,150,13 <--- 2020-03-29 - 2020-03-16
RS,2020-04-01,80,0
RS,2020-04-05,100,4 <--- 2020-04-05 - 2020-04-01
RS,2020-04-11,160,10 <--- 2020-04-11 - 2020-04-01
I found How to calculate time difference by group using pandas? and it is almost to what I want. However, diff() returns the difference between consecutive lines, and I need the difference between the current line and the first line.
How can I do this?
Option 3: groupby.transform
df['days_since_first'] = df['date'] - df.groupby('state')['date'].transform('first')
output
state date events_per_day days_since_first
0 AM 2020-03-01 100 0 days
1 AM 2020-03-02 120 1 days
2 AM 2020-03-15 200 14 days
3 BA 2020-03-16 80 0 days
4 BA 2020-03-20 100 4 days
5 BA 2020-03-29 150 13 days
6 RS 2020-04-01 80 0 days
7 RS 2020-04-05 100 4 days
8 RS 2020-04-11 160 10 days
Prepossessing:
# convert to datetime
df['date'] = pd.to_datetime(df['date'])
# extract the first dates by states:
first_dates = df.groupby('state')['date'].first() #.min() works as well
Option 1: Index alignment
# set_index before substraction allows index alignment
df['days_since_first'] = (df.set_index('state')['date'] - first_dates).values
Option 2: map:
df['days_since_first'] = df['date'] - df['state'].map(first_dates)
Output:
state date events_per_day days_since_first
0 AM 2020-03-01 100 0 days
1 AM 2020-03-02 120 1 days
2 AM 2020-03-15 200 14 days
3 BA 2020-03-16 80 0 days
4 BA 2020-03-20 100 4 days
5 BA 2020-03-29 150 13 days
6 RS 2020-04-01 80 0 days
7 RS 2020-04-05 100 4 days
8 RS 2020-04-11 160 10 days

Pandas adding row to categorical index

I have a scenario where I would like to group my datasets by personally defined week indexes that are then averaged and aggregate the averages into a "Total" row. I am able to achieve the first half of my scenario, but when I try to append/insert a new "Total" row that sums these rows I am receiving error messages.
I attempted to create this row via two different methods:
Method 1:
week_index_avg_unit.loc['Total'] = week_index_avg_unit.sum()
TypeError: cannot append a non-category item to a CategoricalIndex
Method 2:
week_index_avg_unit.index.insert(['Total'], week_index_avg_unit.sum())
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
I have used the first approach in this scenario multiple times, but this is the first time where I'm cutting the data into multiple categories and clearly see where the CategoricalIndex type is the problem.
Here is the format of my data:
date organic ppc oa other content_partnership total \
0 2018-01-01 379 251 197 51 0 878
1 2018-01-02 880 527 405 217 0 2029
2 2018-01-03 859 589 403 323 0 2174
3 2018-01-04 835 533 409 335 0 2112
4 2018-01-05 760 449 355 272 0 1836
year_month day weekday weekday_name week_index
0 2018-01 1 0 Monday Week 1
1 2018-01 2 1 Tuesday Week 1
2 2018-01 3 2 Wednesday Week 1
3 2018-01 4 3 Thursday Week 1
4 2018-01 5 4 Friday Week 1
Here is the code:
import pandas as pd
import numpy as np
from datetime import datetime
import matplotlib.pyplot as plt
historicals = pd.read_csv("2018-2019_plants.csv")
# Capture dates for additional date columns
date_col = pd.to_datetime(historicals['date'])
historicals['year_month'] = date_col.dt.strftime("%Y-%m")
historicals['day'] = date_col.dt.day
historicals['weekday'] = date_col.dt.dayofweek
historicals['weekday_name'] = date_col.dt.day_name()
# create week ranges segment (7 day range)
historicals['week_index'] = pd.cut(historicals['day'],[0,7,14,21,28,32], labels=['Week 1','Week 2','Week 3','Week 4','Week 5'])
# Week Index Average (Units)
week_index_avg_unit = historicals[df_monthly_average].groupby(['week_index']).mean().astype(int)
type(week_index_avg_unit.index)
pandas.core.indexes.category.CategoricalIndex
Here is the week_index_avg_unit table:
organic ppc oa other content_partnership total day weekday
week_index
Week 1 755 361 505 405 22 2027 4 3
Week 2 787 360 473 337 19 1959 11 3
Week 3 781 382 490 352 18 2006 18 3
...
pd.CategoricalIndex is a special animal. It is immutable, so to do the trick you may need to use something like pd.CategoricalIndex.set_categories to add a new category.
See pandas docs: https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.CategoricalIndex.html