Merge when date is between two dates Pandas - pandas

I'm looking for way in which I can merge a table on multiple conditions, one of which is when a date is between two dates in the other table
Below is the two data sets
DATA SET 1
Code 1
Code 2
Date
Number
001
192
02.02.22
10
002
192
05.03.22
12
002
192
09.05.22
8
003
193
14.06.22
14
003
193
16.08.22
18
DATA SET 2
Code 1
Code 2
Date Start
Date End
005
192
15.01.22
5.02.22
002
192
01.05.22
01.06.22
003
193
10.08.22
10.09.22
003
192
01.03.22
15.03.22
007
192
10.06.22
18.06.22
I basically need to end up with Data Set 2 but with the Number column attached - merged on Code 1, Code 2, and when the date in DS1 is between the two dates in DS 2.
In this example above, the outcome would look like this:
Code 1
Code 2
Date Start
Date End
Number
002
192
01.05.22
01.06.22
8
003
193
10.08.22
10.09.22
18
Thanks

Try:
# Convert to datetime
df1['Date'] = pd.to_datetime(df1['Date'], dayfirst=True)
df2['Date Start'] = pd.to_datetime(df2['Date Start'], dayfirst=True)
df2['Date End'] = pd.to_datetime(df2['Date End'], dayfirst=True)
# Merge on Code 1 and Code 2 then keep only rows where Start Date <= Date <= End Date
out = df2.merge(df1, how='left', on=['Code 1', 'Code 2']) \
.query('Date.between(`Date Start`, `Date End`)')
Output:
Code 1
Code 2
Date Start
Date End
Date
Number
2
192
2022-05-01 00:00:00
2022-06-01 00:00:00
2022-05-09 00:00:00
8
3
193
2022-08-10 00:00:00
2022-09-10 00:00:00
2022-08-16 00:00:00
18

Related

Pandas: to get mean for each data category daily [duplicate]

I am a somewhat beginner programmer and learning python (+pandas) and hope I can explain this well enough. I have a large time series pd dataframe of over 3 million rows and initially 12 columns spanning a number of years. This covers people taking a ticket from different locations denoted by Id numbers(350 of them). Each row is one instance (one ticket taken).
I have searched many questions like counting records per hour per day and getting average per hour over several years. However, I run into the trouble of including the 'Id' variable.
I'm looking to get the mean value of people taking a ticket for each hour, for each day of the week (mon-fri) and per station.
I have the following, setting datetime to index:
Id Start_date Count Day_name_no
149 2011-12-31 21:30:00 1 5
150 2011-12-31 20:51:00 1 0
259 2011-12-31 20:48:00 1 1
3015 2011-12-31 19:38:00 1 4
28 2011-12-31 19:37:00 1 4
Using groupby and Start_date.index.hour, I cant seem to include the 'Id'.
My alternative approach is to split the hour out of the date and have the following:
Id Count Day_name_no Trip_hour
149 1 2 5
150 1 4 10
153 1 2 15
1867 1 4 11
2387 1 2 7
I then get the count first with:
Count_Item = TestFreq.groupby([TestFreq['Id'], TestFreq['Day_name_no'], TestFreq['Hour']]).count().reset_index()
Id Day_name_no Trip_hour Count
1 0 7 24
1 0 8 48
1 0 9 31
1 0 10 28
1 0 11 26
1 0 12 25
Then use groupby and mean:
Mean_Count = Count_Item.groupby(Count_Item['Id'], Count_Item['Day_name_no'], Count_Item['Hour']).mean().reset_index()
However, this does not give the desired result as the mean values are incorrect.
I hope I have explained this issue in a clear way. I looking for the mean per hour per day per Id as I plan to do clustering to separate my dataset into groups before applying a predictive model on these groups.
Any help would be grateful and if possible an explanation of what I am doing wrong either code wise or my approach.
Thanks in advance.
I have edited this to try make it a little clearer. Writing a question with a lack of sleep is probably not advisable.
A toy dataset that i start with:
Date Id Dow Hour Count
12/12/2014 1234 0 9 1
12/12/2014 1234 0 9 1
12/12/2014 1234 0 9 1
12/12/2014 1234 0 9 1
12/12/2014 1234 0 9 1
19/12/2014 1234 0 9 1
19/12/2014 1234 0 9 1
19/12/2014 1234 0 9 1
26/12/2014 1234 0 10 1
27/12/2014 1234 1 11 1
27/12/2014 1234 1 11 1
27/12/2014 1234 1 11 1
27/12/2014 1234 1 11 1
04/01/2015 1234 1 11 1
I now realise I would have to use the date first and get something like:
Date Id Dow Hour Count
12/12/2014 1234 0 9 5
19/12/2014 1234 0 9 3
26/12/2014 1234 0 10 1
27/12/2014 1234 1 11 4
04/01/2015 1234 1 11 1
And then calculate the mean per Id, per Dow, per hour. And want to get this:
Id Dow Hour Mean
1234 0 9 4
1234 0 10 1
1234 1 11 2.5
I hope this makes it a bit clearer. My real dataset spans 3 years with 3 million rows, contains 350 Id numbers.
Your question is not very clear, but I hope this helps:
df.reset_index(inplace=True)
# helper columns with date, hour and dow
df['date'] = df['Start_date'].dt.date
df['hour'] = df['Start_date'].dt.hour
df['dow'] = df['Start_date'].dt.dayofweek
# sum of counts for all combinations
df = df.groupby(['Id', 'date', 'dow', 'hour']).sum()
# take the mean over all dates
df = df.reset_index().groupby(['Id', 'dow', 'hour']).mean()
You can use the groupby function using the 'Id' column and then use the resample function with how='sum'.

Match group of variables and values with the nearest datetime

I have a transaction table that looks like that:
transaction_start store_no item_no amount post_voided
2021-03-01 10:00:00 001 101 45 N
2021-03-01 10:00:00 001 105 25 N
2021-03-01 10:00:00 001 109 40 N
2021-03-01 10:05:00 002 103 35 N
2021-03-01 10:05:00 002 135 20 N
2021-03-01 10:08:00 001 140 2 N
2021-03-01 10:11:00 001 101 -45 Y
2021-03-01 10:11:00 001 105 -25 Y
2021-03-01 10:11:00 001 109 -40 Y
The table does not have an id column; the transaction_start for a given store_no will never be the same.
Whenever a transaction is post voided, the transaction is then repeated with the same store_no, item_no but with a negative/minus amount and an equal or higher transaction_start. Also, the column post_voided is then equal to 'Y'.
In the example above, the rows 1-3 have the same transaction_start and store_no, thus belonging to the same receipt, containing three different items (101, 105, 109). The same logic is applied to the other rows: rows 4-5 belong to a same receipt, and so on. In the example, 4 different receipts can be seen. The last receipt, given by the last three rows, is a post voided of the first receipt (rows 1-3).
What I want to do is to change the transaction_start for the post_voided = 'Y' transactions (in my example, only one receipt - represented by the last three rows - has it) to the next/closest datetime of a similar receipt that has the variables store_no, item_no and (negative) amount (but post_voided = 'N') (in my example, the similar ticket is given by the first three rows - store_no, all item_no and (positive) amount match). The transaction_start for the post voided receipt is always equal or higher than the "original" receipt.
Desired output:
transaction_start store_no item_no amount post_voided
2021-03-01 10:00:00 001 101 45 N
2021-03-01 10:00:00 001 105 25 N
2021-03-01 10:00:00 001 109 40 N
2021-03-01 10:05:00 002 103 35 N
2021-03-01 10:05:00 002 135 20 N
2021-03-01 10:08:00 001 140 2 N
2021-03-01 10:00:00 001 101 -45 Y
2021-03-01 10:00:00 001 105 -25 Y
2021-03-01 10:00:00 001 109 -40 Y
Here a link of the table: https://dbfiddle.uk/?rdbms=sqlserver_2019&fiddle=26142fa24e46acb4213b96c86f4eb94b
Thanks in advance!
Consider below
select a.* replace(ifnull(b.transaction_start, a.transaction_start) as transaction_start)
from `project.dataset.table` a
left join (
select * replace(-amount as amount)
from `project.dataset.table`
where post_voided = 'N'
) b
using (store_no, item_no)
if applied to sample data in your question - output is
Consider below for new / extended example (https://dbfiddle.uk/?rdbms=sqlserver_2019&fiddle=91f9f180fd672e7c357aa48d18ced5fd)
select x.* replace(ifnull(y.original_transaction_start, x.transaction_start) as transaction_start)
from `project.dataset.table` x
left join (
select b.transaction_start, b.store_no, b.item_no, b.amount amount,
max(a.transaction_start) original_transaction_start
from `project.dataset.table` a
join `project.dataset.table` b
on a.store_no = b.store_no
and a.item_no = b.item_no
and a.amount = -b.amount
and a.post_voided = 'N'
and b.post_voided = 'Y'
and a.transaction_start < b.transaction_start
group by b.transaction_start, b.store_no, b.item_no, b.amount
) y
using (store_no, item_no, amount, transaction_start)
with output

Difference between first row and current row, by group

I have a data set like this:
state,date,events_per_day
AM,2020-03-01,100
AM,2020-03-02,120
AM,2020-03-15,200
BA,2020-03-16,80
BA,2020-03-20,100
BA,2020-03-29,150
RS,2020-04-01,80
RS,2020-04-05,100
RS,2020-04-11,160
Now I need to compute the difference between the date in the first row of each group and the date in the current row.
i.e. the first row of each group:
for group "AM" the first date is 2020-03-01;
for group "BA" the first date is 2020-03-16;
for group "RS" it is 2020-04-01.
In the end, the result I want is:
state,date,events_per_day,days_after_first_event
AM,2020-03-01,100,0
AM,2020-03-02,120,1 <--- 2020-03-02 - 2020-03-01
AM,2020-03-15,200,14 <--- 2020-03-14 - 2020-03-01
BA,2020-03-16,80,0
BA,2020-03-20,100,4 <--- 2020-03-20 - 2020-03-16
BA,2020-03-29,150,13 <--- 2020-03-29 - 2020-03-16
RS,2020-04-01,80,0
RS,2020-04-05,100,4 <--- 2020-04-05 - 2020-04-01
RS,2020-04-11,160,10 <--- 2020-04-11 - 2020-04-01
I found How to calculate time difference by group using pandas? and it is almost to what I want. However, diff() returns the difference between consecutive lines, and I need the difference between the current line and the first line.
How can I do this?
Option 3: groupby.transform
df['days_since_first'] = df['date'] - df.groupby('state')['date'].transform('first')
output
state date events_per_day days_since_first
0 AM 2020-03-01 100 0 days
1 AM 2020-03-02 120 1 days
2 AM 2020-03-15 200 14 days
3 BA 2020-03-16 80 0 days
4 BA 2020-03-20 100 4 days
5 BA 2020-03-29 150 13 days
6 RS 2020-04-01 80 0 days
7 RS 2020-04-05 100 4 days
8 RS 2020-04-11 160 10 days
Prepossessing:
# convert to datetime
df['date'] = pd.to_datetime(df['date'])
# extract the first dates by states:
first_dates = df.groupby('state')['date'].first() #.min() works as well
Option 1: Index alignment
# set_index before substraction allows index alignment
df['days_since_first'] = (df.set_index('state')['date'] - first_dates).values
Option 2: map:
df['days_since_first'] = df['date'] - df['state'].map(first_dates)
Output:
state date events_per_day days_since_first
0 AM 2020-03-01 100 0 days
1 AM 2020-03-02 120 1 days
2 AM 2020-03-15 200 14 days
3 BA 2020-03-16 80 0 days
4 BA 2020-03-20 100 4 days
5 BA 2020-03-29 150 13 days
6 RS 2020-04-01 80 0 days
7 RS 2020-04-05 100 4 days
8 RS 2020-04-11 160 10 days

How to sum of certain values using pandas datetime operations

Headline is not clear. Let me explain.
I have a dataframe like this:
Order Quantity Date Accepted Date Delivered
20 01-05-2010 01-02-2011
10 01-11-2010 01-03-2011
300 01-12-2010 01-04-2011
5 01-03-2011 01-03-2012
20 01-04-2012 01-11-2013
10 01-07-2013 01-12-2014
I want to basically create another column that contains the total undelivered items for each row.
Expected output:
Order Quantity Date Accepted Date Delivered Pending Order
20 01-05-2010 01-02-2011 20
10 01-11-2010 01-03-2011 30
300 01-12-2010 01-04-2011 330
5 01-03-2011 01-03-2012 305
20 01-04-2012 01-11-2013 20
10 01-07-2013 01-12-2014 30
Here, I have taken a part of your dataframe and try to get the result.
df = pd.DataFrame({'order': [20, 10, 300, 200],
'Date_aceepted': ['01-05-2010', '01-11-2010', '01-12-2010', '01-12-2010'],
'Date_delever': ['01-02-2011', '01-03-2011', '01-04-2011', '01-12-2010']})
order Date_aceepted Date_delever
0 20 01-05-2010 01-02-2011
1 10 01-11-2010 01-03-2011
2 300 01-12-2010 01-04-2011
3 200 01-12-2010 01-12-2010
Then I will change the Date_accepted and Date_deliver to date time by using pandas data time module
df['date1'] = pd.to_datetime(df['Date_aceepted'])
df['date2'] = pd.to_datetime(df['Date_delever'])
Then I will make a new data frame in which the Date_accepted and Date_delever are not the same. I assume you just need that in your final result.
dff = df[df['date1'] != df['date2']]
You can see the last row in which both accepted and delever are same is now removed in dff.
order Date_aceepted Date_delever date1 date2
0 20 01-05-2010 01-02-2011 2010-01-05 2011-01-02
1 10 01-11-2010 01-03-2011 2010-01-11 2011-01-03
2 300 01-12-2010 01-04-2011 2010-01-12 2011-01-04
Then I did use pandas cumsum of pending order
dff['pending'] = dff['order'].cumsum()
and it gives
order Date_aceepted Date_delever date1 date2 pending
0 20 01-05-2010 01-02-2011 2010-01-05 2011-01-02 20
1 10 01-11-2010 01-03-2011 2010-01-11 2011-01-03 30
2 300 01-12-2010 01-04-2011 2010-01-12 2011-01-04 330
The final data frame has two extra columns that can be dropped if you don't want in your result.

Pandas - Find difference based on two subsequent rows of Dataframe

I have a Dataframe that captures date when ticket was raised by a customer that is captured in column labelled date. If the ref_column for the current cell is same as the following cell then I need to find difference of aging based on date column current cell and the following cell for the same cust_id. if the ref_column is to the same then I need to find difference of date and ref_date of the same row.
Given below is how my data is:
cust_id,date,ref_column,ref_date
101,15/01/19,abc,31/01/19
101,17/01/19,abc,31/01/19
101,19/01/19,xyz,31/01/19
102,15/01/19,abc,31/01/19
102,21/01/19,klm,31/01/19
102,25/01/19,xyz,31/01/19
103,15/01/19,xyz,31/01/19
Expected output:
cust_id,date,ref_column,ref_date,aging(in days)
101,15/01/19,abc,31/01/19,2
101,17/01/19,abc,31/01/19,14
101,19/01/19,xyz,31/01/19,0
102,15/01/19,abc,31/01/19,16
102,21/01/19,klm,31/01/19,10
102,25/01/19,xyz,31/01/19,0
103,15/01/19,xyz,31/01/19,0
Aging(in days) is 0 for the last entry for a given cust_id
Here's my approach:
# convert dates to datetime type
# ignore if already are
df['date'] = pd.to_datetime(df['date'])
df['ref_date'] = pd.to_datetime(df['ref_date'])
# customer group
groups = df.groupby('cust_id')
# where ref_column is the same with the next:
same_ = df['ref_column'].eq(groups['ref_column'].shift(-1))
# update these ones
df['aging'] = np.where(same_,
-groups['date'].diff(-1).dt.days, # same ref as next row
df['ref_date'].sub(df['date']).dt.days) # diff ref than next row
# update last elements in groups:
last_idx = groups['date'].idxmax()
df.loc[last_idx, 'aging'] = 0
Output:
cust_id date ref_column ref_date aging
0 101 2019-01-15 abc 2019-01-31 2.0
1 101 2019-01-17 abc 2019-01-31 14.0
2 101 2019-01-19 xyz 2019-01-31 0.0
3 102 2019-01-15 abc 2019-01-31 16.0
4 102 2019-01-21 klm 2019-01-31 10.0
5 102 2019-01-25 xyz 2019-01-31 0.0
6 103 2019-01-15 xyz 2019-01-31 0.0