Merge tables in R and update rows where dates overlap - sql

I hope this makes sense - it's my first post here so I'm sorry if the question is badly formed.
I have tables OldData and NewData:
OldData
ID DateFrom DateTo Priority
1 2018-11-01 2018-12-01* 5
1 2018-12-01 2019-02-01 5
2 2017-06-01 2018-03-01 5
2 2018-03-01 2018-04-05* 5
NewData
ID DateFrom DateTo Priority
1 2018-11-13 2018-12-01 6
2 2018-03-21 2018-05-01 6
I need merge these tables as below. Where IDs match, dates overlap, and Priority is higher in NewData, I need to update the dates in OldData to reflect NewData.
ID DateFrom DateTo Priority
1 2018-11-01 2018-11-13 5
1 2018-11-13 2018-12-01 6
1 2018-12-01 2019-02-01 5
2 2017-06-01 2018-03-01 5
2 2018-03-01 2018-03-21 5
2 2018-03-21 2018-05-01 6
I first tried to run nested for loops through each table, matching criteria and making changes one at a time, but I'm sure there is a much better way. e.g. possibly using sql in r?

In general, I interpret this to be an rbind operation with some cleanup: per-ID, if there is any overlap in the date ranges, then the lower-priority date range is truncated to match. Though not shown in the data, if you have situations where two higher-priority rows may completely negate a middle row, then you might need to add to the logic (it might then turn into an iterative process).
tidyverse
library(dplyr)
out_tidyverse <- bind_rows(OldData, NewData) %>%
arrange(ID, DateFrom) %>%
group_by(ID) %>%
mutate(
DateTo = if_else(row_number() < n() &
DateTo > lead(DateFrom) & Priority < lead(Priority),
lead(DateFrom), DateTo),
DateFrom = if_else(row_number() > 1 &
DateFrom < lag(DateTo) & Priority < lag(Priority),
lag(DateTo), DateFrom)
) %>%
ungroup()
out_tidyverse
# # A tibble: 6 x 4
# ID DateFrom DateTo Priority
# <int> <chr> <chr> <int>
# 1 1 2018-11-01 2018-11-13 5
# 2 1 2018-11-13 2018-12-01 6
# 3 1 2018-12-01 2019-02-01 5
# 4 2 2017-06-01 2018-03-01 5
# 5 2 2018-03-01 2018-03-21 5
# 6 2 2018-03-21 2018-05-01 6
### confirm it is the same as your expected output
all(mapply(`==`, FinData, out_tidyverse))
# [1] TRUE
data.table
I am using magrittr here in order to break out the flow in a readable fashion, but it is not required. If you're comfortable with data.table by itself, then translating from the magrittr::%>% to a native data.table piping should be straight-forward.
Also, I am using as.data.table instead of the often-preferred side-effect setDT, primarily so that you don't use it on your production frame and not realize that many data.frame operations in R (on those two frames) now behave somewhat differently. If you're up for using data.table, then feel free to step around this precaution.
library(data.table)
library(magrittr)
OldDT <- as.data.table(OldData)
NewDT <- as.data.table(NewData)
out_DT <- rbind(OldDT, NewDT) %>%
.[ order(ID, DateFrom), ] %>%
.[, .i := seq_len(.N), by = .(ID) ] %>%
.[, DateTo := fifelse(.i < .N &
DateTo > shift(DateFrom, type = "lead") &
Priority < shift(Priority, type = "lead"),
shift(DateFrom, type = "lead"), DateTo),
by = .(ID) ] %>%
.[, DateFrom := fifelse(.i > 1 &
DateFrom < shift(DateTo) &
Priority < shift(Priority),
shift(DateTo), DateFrom),
by = .(ID) ] %>%
.[, .i := NULL ]
out_DT[]
# ID DateFrom DateTo Priority
# 1: 1 2018-11-01 2018-11-13 5
# 2: 1 2018-11-13 2018-12-01 6
# 3: 1 2018-12-01 2019-02-01 5
# 4: 2 2017-06-01 2018-03-01 5
# 5: 2 2018-03-01 2018-03-21 5
# 6: 2 2018-03-21 2018-05-01 6
all(mapply(`==`, FinData, out_DT))
# [1] TRUE
Data:
OldData <- read.table(header = TRUE, text="
ID DateFrom DateTo Priority
1 2018-11-01 2018-12-01 5
1 2018-12-01 2019-02-01 5
2 2017-06-01 2018-03-01 5
2 2018-03-01 2018-04-05 5")
NewData <- read.table(header = TRUE, text="
ID DateFrom DateTo Priority
1 2018-11-13 2018-12-01 6
2 2018-03-21 2018-05-01 6")
FinData <- read.table(header = TRUE, text="
ID DateFrom DateTo Priority
1 2018-11-01 2018-11-13 5
1 2018-11-13 2018-12-01 6
1 2018-12-01 2019-02-01 5
2 2017-06-01 2018-03-01 5
2 2018-03-01 2018-03-21 5
2 2018-03-21 2018-05-01 6")

To my understanding this should be helpful:
library(datatable)
df_new <- setDT(df_new)
df_old <- setDT(df_old)
df_all <- rbind(df_old, df_new)
df_all[, .SD[.N], by = .(ID, DateFrom, DateTo)]
You simply rbind both dataframes, then group the resulting df by ID, DateFrom and DateTo. Within each group you extract the last row (i.e. the latest). This would result in a dataframe that is basically equal to df_old, except that in some cases the values will be 'updated' with the values from df_new. Should df_new have new groups (i.e. combinations of ID, DateFrom and DateTo) then those rows are also in it.
Edit: (after your comment)
df_all[, .SD[.N], by = .(ID, DateFrom)]

Related

Get a random sample from dataframe with grouped columns?

I have a dataframe of time series data, called dates_c that looks like this:
DATE_T Da HN NAR TJH
0 2014-01-01 00:11:25 2014-01-01 3520 11931 769.198
1 2014-01-01 00:11:25 2014-01-01 3560 11942 338.143
2 2014-01-01 00:11:25 2014-01-01 3542 11937 665.481
3 2014-01-01 00:11:25 2014-01-01 3563 11944 529.058
4 2014-01-01 00:11:25 2014-01-01 3535 11936 2883.945
I want to get 60 random rows per Da + NAR. This is what I did:
np.random.seed(987)
columns = ['DATE_T', 'HN', 'TJH']
new = dates_c.groupby(['Da', 'NAR'])[columns].apply(pd.Series.sample, n=60, replace=False).reset_index()
I keep getting this error:
ValueError: Key 2014-01-01 00:00:00 not in level Index([2014-01-01, 2014-01-02, 2014-01-03, 2014-01-04, 2014-01-05, 2014-01-06,
2014-01-07, 2014-01-08, 2014-01-09, 2014-01-10,
...
2014-12-22, 2014-12-23, 2014-12-24, 2014-12-25, 2014-12-26, 2014-12-27,
2014-12-28, 2014-12-29, 2014-12-30, 2014-12-31],
dtype='object', name='Date', length=320)
Here you need replace = True since some group may do not have enough data point for n=60
out = df.groupby(['Date', 'NOAA_AR']).apply(lambda x : x.sample(n=60,replace=True))
Try:
dates_count.groupby(['Date', 'NOAA_AR'])[COLS].sample(60).reset_index()

Difference between first row and current row, by group

I have a data set like this:
state,date,events_per_day
AM,2020-03-01,100
AM,2020-03-02,120
AM,2020-03-15,200
BA,2020-03-16,80
BA,2020-03-20,100
BA,2020-03-29,150
RS,2020-04-01,80
RS,2020-04-05,100
RS,2020-04-11,160
Now I need to compute the difference between the date in the first row of each group and the date in the current row.
i.e. the first row of each group:
for group "AM" the first date is 2020-03-01;
for group "BA" the first date is 2020-03-16;
for group "RS" it is 2020-04-01.
In the end, the result I want is:
state,date,events_per_day,days_after_first_event
AM,2020-03-01,100,0
AM,2020-03-02,120,1 <--- 2020-03-02 - 2020-03-01
AM,2020-03-15,200,14 <--- 2020-03-14 - 2020-03-01
BA,2020-03-16,80,0
BA,2020-03-20,100,4 <--- 2020-03-20 - 2020-03-16
BA,2020-03-29,150,13 <--- 2020-03-29 - 2020-03-16
RS,2020-04-01,80,0
RS,2020-04-05,100,4 <--- 2020-04-05 - 2020-04-01
RS,2020-04-11,160,10 <--- 2020-04-11 - 2020-04-01
I found How to calculate time difference by group using pandas? and it is almost to what I want. However, diff() returns the difference between consecutive lines, and I need the difference between the current line and the first line.
How can I do this?
Option 3: groupby.transform
df['days_since_first'] = df['date'] - df.groupby('state')['date'].transform('first')
output
state date events_per_day days_since_first
0 AM 2020-03-01 100 0 days
1 AM 2020-03-02 120 1 days
2 AM 2020-03-15 200 14 days
3 BA 2020-03-16 80 0 days
4 BA 2020-03-20 100 4 days
5 BA 2020-03-29 150 13 days
6 RS 2020-04-01 80 0 days
7 RS 2020-04-05 100 4 days
8 RS 2020-04-11 160 10 days
Prepossessing:
# convert to datetime
df['date'] = pd.to_datetime(df['date'])
# extract the first dates by states:
first_dates = df.groupby('state')['date'].first() #.min() works as well
Option 1: Index alignment
# set_index before substraction allows index alignment
df['days_since_first'] = (df.set_index('state')['date'] - first_dates).values
Option 2: map:
df['days_since_first'] = df['date'] - df['state'].map(first_dates)
Output:
state date events_per_day days_since_first
0 AM 2020-03-01 100 0 days
1 AM 2020-03-02 120 1 days
2 AM 2020-03-15 200 14 days
3 BA 2020-03-16 80 0 days
4 BA 2020-03-20 100 4 days
5 BA 2020-03-29 150 13 days
6 RS 2020-04-01 80 0 days
7 RS 2020-04-05 100 4 days
8 RS 2020-04-11 160 10 days

Pandas - Mapping two Dataframe based on date ranges

I am trying to categorise users based on their lifecycle. Given below Pandas dataframe shows the number of times a customer raised a ticket depending on how long they have used the product.
master dataframe
cust_id,start_date,end_date
101,02/01/2019,12/01/2019
101,14/02/2019,24/04/2019
101,27/04/2019,02/05/2019
102,25/01/2019,02/02/2019
103,02/01/2019,22/01/2019
Master lookup table
start_date,end_date,project_name
01/01/2019,13/01/2019,project_a
14/01/2019,13/02/2019,project_b
15/02/2019,13/03/2019,project_c
14/03/2019,13/06/2019,project_d
I am trying to map the above two data frames such that I am able to add project_name to the master dataframe
Expected output:
cust_id,start_date,end_date,project_name
101,02/01/2019,12/01/2019,project_a
101,14/02/2019,24/04/2019,project_c
101,14/02/2019,24/04/2019,project_d
101,27/04/2019,02/05/2019,project_d
102,25/01/2019,02/02/2019,project_b
103,02/01/2019,22/01/2019,project_a
103,02/01/2019,22/01/2019,project_b
I do expect duplicate rows in the final output as a single row in the master dataframe would fall under multiple rows of master lookup table
I think you need:
df = df1.assign(a=1).merge(df2.assign(a=1), on='a')
m1 = df['start_date_y'].between(df['start_date_x'], df['end_date_x'])
m2 = df['end_date_y'].between(df['start_date_x'], df['end_date_x'])
df = df[m1 | m2]
print (df)
cust_id start_date_x end_date_x a start_date_y end_date_y project_name
1 101 2019-02-01 2019-12-01 1 2019-01-14 2019-02-13 project_b
2 101 2019-02-01 2019-12-01 1 2019-02-15 2019-03-13 project_c
3 101 2019-02-01 2019-12-01 1 2019-03-14 2019-06-13 project_d
6 101 2019-02-14 2019-04-24 1 2019-02-15 2019-03-13 project_c
7 101 2019-02-14 2019-04-24 1 2019-03-14 2019-06-13 project_d

Pandas - Find difference based on two subsequent rows of Dataframe

I have a Dataframe that captures date when ticket was raised by a customer that is captured in column labelled date. If the ref_column for the current cell is same as the following cell then I need to find difference of aging based on date column current cell and the following cell for the same cust_id. if the ref_column is to the same then I need to find difference of date and ref_date of the same row.
Given below is how my data is:
cust_id,date,ref_column,ref_date
101,15/01/19,abc,31/01/19
101,17/01/19,abc,31/01/19
101,19/01/19,xyz,31/01/19
102,15/01/19,abc,31/01/19
102,21/01/19,klm,31/01/19
102,25/01/19,xyz,31/01/19
103,15/01/19,xyz,31/01/19
Expected output:
cust_id,date,ref_column,ref_date,aging(in days)
101,15/01/19,abc,31/01/19,2
101,17/01/19,abc,31/01/19,14
101,19/01/19,xyz,31/01/19,0
102,15/01/19,abc,31/01/19,16
102,21/01/19,klm,31/01/19,10
102,25/01/19,xyz,31/01/19,0
103,15/01/19,xyz,31/01/19,0
Aging(in days) is 0 for the last entry for a given cust_id
Here's my approach:
# convert dates to datetime type
# ignore if already are
df['date'] = pd.to_datetime(df['date'])
df['ref_date'] = pd.to_datetime(df['ref_date'])
# customer group
groups = df.groupby('cust_id')
# where ref_column is the same with the next:
same_ = df['ref_column'].eq(groups['ref_column'].shift(-1))
# update these ones
df['aging'] = np.where(same_,
-groups['date'].diff(-1).dt.days, # same ref as next row
df['ref_date'].sub(df['date']).dt.days) # diff ref than next row
# update last elements in groups:
last_idx = groups['date'].idxmax()
df.loc[last_idx, 'aging'] = 0
Output:
cust_id date ref_column ref_date aging
0 101 2019-01-15 abc 2019-01-31 2.0
1 101 2019-01-17 abc 2019-01-31 14.0
2 101 2019-01-19 xyz 2019-01-31 0.0
3 102 2019-01-15 abc 2019-01-31 16.0
4 102 2019-01-21 klm 2019-01-31 10.0
5 102 2019-01-25 xyz 2019-01-31 0.0
6 103 2019-01-15 xyz 2019-01-31 0.0

Pandas - group by to return first occurence and there on every third occurence of a value

I am trying to filter records from a Dataframe based on their occurence. I am trying to filter out the first occurence and then on every third occurence based on emp_id. Given below is how my Dataframe is.
emp_id,date,value
101,2018-12-01,10001
101,2018-12-03,10002
101,2018-12-05,10003
101,2018-12-13,10004
In the above sample, expected output is :
emp_id,date,value
101,2018-12-01,10001
101,2018-12-13,10004
Given below is the code I have built this far:
df['emp_id'] = df.groupby('emp_id').cumcount()+1
df['emp_id'] = np.where((df['emp_id']%3)==0,1,0)
This however returns back 2nd occurence and every third occurrence after that. How could I modify such that it returns back the first occurence and then on every third occurence based on emp_id
I think you need boolean indexing with check 0 or 1, assign to column is not necessary, is possible create helper Series s:
print (df)
emp_id date value
0 101 2018-12-01 10001
1 101 2018-12-03 10002
2 101 2018-12-05 10003
3 101 2018-12-13 10004
4 101 2018-12-01 10005
5 101 2018-12-03 10006
6 101 2018-12-05 10007
7 101 2018-12-13 10008
s = df.groupby('emp_id').cumcount()
df['check'] = (s % 3) == 0
Alternative:
s = df.groupby('emp_id').cumcount() + 1
df['check'] = (s % 3) == 1
print (df)
emp_id date value check
0 101 2018-12-01 10001 True
1 101 2018-12-03 10002 False
2 101 2018-12-05 10003 False
3 101 2018-12-13 10004 True
4 101 2018-12-01 10005 False
5 101 2018-12-03 10006 False
6 101 2018-12-05 10007 True
7 101 2018-12-13 10008 False