Join overlapping date ranges - sql

I need to join table A and table B to create table C.
Table A and Table B store status flags for the IDs. The status flags (A_Flag and B_Flag) can change from time to time, so one ID can contain multiple rows, which represents the history of the ID's statuses. The flags for a particular ID can change independently of each other, which can result in one row in Table A belonging to multiple rows in Table B, and vice versa.
The resulting table (Table C) needs to be a list of unique date ranges covering every date within the IDs life (01/01/2008-18/08/2008), and A_Flag and B_Flag values for each date range.
The actual tables contain hundreds of IDs with each ID having a varying numbers of rows per table.
I have access to SQL and SAS tools to achieve the end result.
Source - Table A
ID Start End A_Flag
1 01/01/2008 23/03/2008 1
1 23/03/2008 15/06/2008 0
1 15/06/2008 18/08/2008 1
Source - Table B
ID Start End B_Flag
1 19/01/2008 17/02/2008 1
1 17/02/2008 15/06/2008 0
1 15/06/2008 18/08/2008 1
Result - Table C
ID Start End A_Flag B_Flag
1 01/01/2008 19/01/2008 1 0
1 19/01/2008 17/02/2008 1 1
1 17/02/2008 23/03/2008 1 0
1 23/03/2008 15/06/2008 0 0
1 15/06/2008 18/08/2008 1 1

I'm going to solve this in SQL, assuming that you have a function called lag (SQL Server 2012, Oracle, Postgres, DB2). You can get the same effect with a correlated subquery.
The idea is to get all the different time periods. Then join back to the original tables to get the flags.
I am having trouble uploading the code, but can get most of it. However, it starts with start ends, which you create by doing a union (not union all) of the four dates in one column: select a.start as thedate. This is then union'ed with a.end, b.start, and b.end.
with driver as (
select thedate as start, lag(thedate) over (order by thedate) as end
from startends
) 
select startdate, enddate, a.flag, b.flag
from driver left outer join
a
on a.start >= driver.start and a.end <= driver.end left outer join
b
on b.start >= driver.start and b.end <= driver.end

The problem you posed can be solved in one SQL statement without nonstandard extensions.
The most important thing to recognize is that the dates in the begin-end pairs each represent a potential starting or ending point of a time span during which the flag pair will be true. It actually doesn't matter that one date is a "begin" and another and "end"; any date is a time delimiter that does both: it ends a prior period and begins another. Construct a set of minimal time intervals, and join them to the tables to find the flags that obtained during each interval.
I added your example (and a solution) to my Canonical SQL page. See there for a detailed discussion. In fairness to SO, here's the query itself
with D (ID, bound) as (
select ID
, case T when 's' then StartDate else EndDate end as bound
from (
select ID, StartDate, EndDate from so.A
UNION
select ID, StartDate, EndDate from so.B
) as U
cross join (select 's' as T union select 'e') as T
)
select P.*, a.Flag as A_Flag, b.Flag as B_Flag
from (
select s.ID, s.bound as StartDate, min(e.bound) as EndDate
from D as s join D as e
on s.ID = e.ID
and s.bound < e.bound
group by s.ID, s.bound
) as P
left join so.A as a
on P.ID = a.ID
and a.StartDate <= P.StartDate and P.EndDate <= a.EndDate
left join so.B as b
on P.ID = b.ID
and b.StartDate <= P.StartDate and P.EndDate <= b.EndDate
order by P.ID, P.StartDate, P.EndDate

One possible SAS solution to this is to perform a partial join, and then create the necessary additional rows in the data step. This should work assuming tableA has all possible records; if that's not the case (if tableB can start before tableA), some additional logic may be needed to consider that possibility (if first.id and start gt b_start). There may also be additional logic needed for issues not present in the example data - I don't have a lot of time this morning and didn't debug this for anything beyond the example data cases, but the concept should be evident.
data tableA;
informat start end DDMMYY10.;
format start end DATE9.;
input ID Start End A_Flag;
datalines;
1 01/01/2008 23/03/2008 1
1 23/03/2008 15/06/2008 0
1 15/06/2008 18/08/2008 1
;;;;
run;
data tableB;
informat start end DDMMYY10.;
format start end DATE9.;
input ID Start End B_Flag;
datalines;
1 19/01/2008 17/02/2008 1
1 17/02/2008 15/06/2008 0
1 15/06/2008 18/08/2008 1
;;;;
run;
proc sql;
create table c_temp as
select * from tableA A
left join (select id, start as b_start, end as b_end, b_flag from tableB) B
on A.Id = B.id
where (A.start le B.b_start and A.end gt B.b_start) or (A.start lt B.b_end and A.end ge B.b_end)
order by A.ID, A.start, B.b_start;
quit;
data tableC;
set c_temp;
by id start;
retain b_flag_ret;
format start_fin end_fin DATE9.;
if first.id then b_flag_ret=0;
do until (start=end);
if (start lt b_start) and first.start then do;
start_fin=start;
end_fin=b_start;
a_flag_fin=a_flag;
b_flag_fin=b_flag_ret;
output;
start=b_start;
end;
else do; *start=b_start;
start_fin=ifn(start ge b_start, start, b_start);
end_fin = ifn(b_end le end, b_end, end);
a_flag_fin=a_flag;
b_flag_fin=b_flag;
output;
start=end; *leave the loop as there will be a later row that matches;
end;
end;
run;

This type of sequential processing with shifts and offsets is one of the situations where the SAS DATA step shines. Not that this answer is simple, but it is simpler than using SQL, which can be done, but isn't designed with this sequential processing in mind.
Furthermore, solutions based on DATA step tend to be very efficient. This one runs in time O(n log n) in theory, but closer to O(n) in practice, and in constant space.
The first two DATA steps are just loading data, slightly modified from Joe's answer, to have multiple IDs (otherwise the syntax is MUCH easier) and to add some corner cases, i.e., an ID for which it is impossible to determine initial state.
data tableA;
informat start end DDMMYY10.;
format start end DATE9.;
input ID Start End A_Flag;
datalines;
1 01/01/2008 23/03/2008 1
2 23/03/2008 15/06/2008 0
2 15/06/2008 18/08/2008 1
;;;;
run;
data tableB;
informat start end DDMMYY10.;
format start end DATE9.;
input ID Start End B_Flag;
datalines;
1 19/01/2008 17/02/2008 1
2 17/02/2008 15/06/2008 0
4 15/06/2008 18/08/2008 1
;;;;
run;
The next data step finds the first modification for each id and flag and sets the initial value to the opposite of what it found.
/* Get initial state by inverting first change */
data firstA;
set tableA;
by id;
if first.id;
A_Flag = ~A_Flag;
run;
data firstB;
set tableB;
by id;
if first.id;
B_Flag = ~B_Flag;
run;
data first;
merge firstA firstB;
by id;
run;
The next data step merges the artificial "first" table with the other two, retaining the last state known and discarding the artificial initial row.
data tableAB (drop=lastA lastB);
set first tableA tableB;
by id start;
retain lastA lastB lastStart;
if A_flag = . and ~first.id then A_flag = lastA;
else lastA = A_flag;
if B_flag = . and ~first.id then B_flag = lastB;
else lastB = B_flag;
if ~first.id; /* drop artificial first row per id */
run;
The steps above do almost everything.
The only bug is that the end dates will be wrong, because they are copied from the original row.
To fix that, copy the next start to each row's end, unless it is a final row.
The easiest way is to sort each id by reverse start, look back one record, then sort ascending again at the end.
/* sort descending to ... */
proc sort data=tableAB;
by id descending start;
run;
/* ... copy next start to this row's "end" field if not final */
data tableAB(drop=nextStart);
set tableAB;
by id descending start;
nextStart=lag(start);
if ~first.id then end=nextStart;
run;
proc sort data=tableAB;
by id start;
run;

Related

How can I replace the LAST() function in MS Access with proper ordering on a rather large table?

I have an MS Access database with the two tables, Asset and Transaction. The schema looks like this:
Table ASSET
Key Date1 AType FieldB FieldC ...
A 2023.01.01 T1
B 2022.01.01 T1
C 2023.01.01 T2
.
.
TABLE TRANSACTION
Date2 Key TType1 TType2 TType3 FieldOfInterest ...
2022.05.31 A 1 1 1 10
2022.08.31 A 1 1 1 40
2022.08.31 A 1 2 1 41
2022.09.31 A 1 1 1 30
2022.07.31 A 1 1 1 30
2022.06.31 A 1 1 1 20
2022.10.31 A 1 1 1 45
2022.12.31 A 2 1 1 50
2022.11.31 A 1 2 1 47
2022.05.23 B 2 1 1 30
2022.05.01 B 1 1 1 10
2022.05.12 B 1 2 1 20
.
.
.
The ASSET table has a PK (Key).
The TRANSACTION table has a composite key that is (Key, Date2, Type1, Type2, Type3).
Given the above tables let's see an example:
Input1 = 2022.04.01
Input2 = 2022.08.31
Desired result:
Key FieldOfInterest
A 41
because if the Transactions in scope was to be ordered by Date2, TType1, TType2, TType3 all ascending then the record having FieldOfInterest = 41 would be the last one.
Note that Asset B is not in scope due to Asset.Date1 < Input1, neither is Asset C because AType != T1. Ultimately I am curious about the SUM(FieldOfInterest) of all the last transactions belonging to an Asset that is in scope determined by the input variables.
The following query has so far provided the right results but after upgrading to a newer MS Access version, the LAST() operation is no longer reliably returning the row which is the latest addition to the Transaction table.
I have several input values but the most important ones are two dates, lets call them InputDate1 and
InputDate2.
This is how it worked so far:
SELECT Asset.AType, Last(FieldOfInterest) AS CurrentValue ,Asset.Key
FROM Transaction
INNER JOIN Asset ON Transaction.Key = Asset.Key
WHERE Transaction.Date2 <= InputDate2 And Asset.Date1 >= InputDate1
GROUP BY Asset.Key, Asset.AType
HAVING Asset.AType='T1'
It is known that the grouped records are not guaranteed to be in any order. Obviously it is a mistake to rely on the order of the records of the group by operation will always keep the original table order but lets just ignore this for now.
I have been struggling to come up with the right way to do the following:
join the Asset and Transaction tables on Asset.Key = Transaction.Key
filter by Asset.Date1 >= InputDate1 AND Transaction.Date2 <= InputDate2
then I need to select one record for all Transaction.Key where Date2 and TType1 and TType2 and TType3 has the highest value. (this represents the actual last record for given Key)
As far as I know there is no way to order records within a group by clause which is unfortunate.
I have tried Ranking, but the Transactions table is large (800k rows) and the performance was very slow, I need something faster than this. The following are an example of three saved queries that I wrote and chained together but the performance is very disappointing probably due to the ranking step.
-- Saved query step_1
SELECT Asset.*, Transaction.*
FROM Transaction
INNER JOIN Asset ON Transaction.Key = Asset.Key
WHERE Transaction.Date2 <= 44926
AND Asset.Date1 >= 44562
AND Asset.aType = 'T1'
-- Saved query step_2
SELECT tr.FieldOfInterest, (SELECT Count(*) FROM
(SELECT tr2.Transaction.Key, tr2.Date2, tr2.Transaction.tType1, tr2.tType2, tr2.tType3 FROM step_1 AS tr2) AS tr1
WHERE (tr1.Date2 > tr.Date2 OR
(tr1.Date2 = tr.Date2 AND tr1.tType1 > tr.Transaction.tType1) OR
(tr1.Date2 = tr.Date2 AND tr1.tType1 = tr.Transaction.tType1 AND tr1.tType2 > tr.tType2) OR
(tr1.Date2 = tr.Date2 AND tr1.tType1 = tr.Transaction.tType1 AND tr1.tType2 = tr.tType2 AND tr1.tType3 > tr.tType3))
AND tr1.Key = tr.Transaction.Key)+1 AS Rank
FROM step_1 AS tr
-- Saved query step_3
SELECT SUM(FieldOfInterest) FROM step_2
WHERE Rank = 1
I hope I am being clear enough so that I can get some useful recommendations. I've been stuck with this for weeks now and really don't know what to do about it. I am open for any suggestions.
Reading the following specification
then I need to select one record for all Transaction.Key where Date2 and TType1 and TType2 and TType3 has the highest value. (this represents the actual last record for given Key)
Consider a simple aggregation for step 2 to retrieve the max values then in step 3 join all fields to first query.
Step 1 (rewritten to avoid name collision and too many columns)
SELECT a.[Key] AS Asset_Key, a.Date1, a.AType,
t.[Key] AS Transaction_Key, t.Date2,
t.TType1, t.TType2, t.TType3, t.FieldOfInterest
FROM Transaction t
INNER JOIN Asset a ON a.[Key] = a.[Key]
WHERE t.Date2 <= 44926
AND a.Date1 >= 44562
AND a.AType = 'T1'
Step 2
SELECT Transaction_Key,
MAX(Date2) AS Max_Date2,
MAX(TType1) AS TType1,
MAX(TType2) AS TType2,
MAX(TType3) AS TType3
FROM step_1
GROUP Transaction_Key
Step 3
SELECT s1.*
FROM step_1 s1
INNER JOIN step_2 s2
ON s1.Transaction_Key = s2.Transaction_Key
AND s1.Date2 = s2.Max_Date2
AND s1.TType1 = s2.Max_TType1
AND s1.TType2 = s2.Max_TType2
AND s1.TType3 = s2.Max_TType3

how to add sequence number to a group of specific consecutive records?

I do have a tricky one:
Here is my example data. It is sorted by a date variable (which is not included here). I want to calculate a a new variable, called seq_no which creates a grouping sequence number for each consecutive flagged records. The seq_no variable should look like in the example and I want to calculate it with SAS or SQL.
ID flag seq_no
1 Y 1
1
1 Y 2
1 Y 2
2
2 Y 1
2 Y 1
2
2 Y 2
3 Y 1
3 Y 1
3
Thanks a lot in advance!
Stephan
Unlike SAS datasets, SQL tables represent unordered sets. The following assumes that you have a column that specifies the ordering.
You can count the number of empty records before each "Y" and use that to assign a unique value:
proc sql;
select t.*,
(select count(*)
from t t2
where t2.id = t.id and t2.flag is null and t2.ordcol <= t.ordcol
) grp_id
from t;
A "real" database would have more substantial functionality -- in particular window functions -- that would facilitate this effort.
As you said It is sorted by a date variable (which is not included here), this is an approach using Windowed Aggregates, which most DBMSes support:
with cte as
( select id, flag, datecol,
-- assign a new value whenever there's a NULL flag
sum(case when flag is null then 1 else 0 end)
over (partition by id
order by datecol
rows unbounded preceding) as grp
from tab
)
select id, flag, datecol, grp,
case when flag is not null
then dense_rank() -- assign a sequence to each group of 'Y'
over (partition by id
order by grp)
end
from cte
If you don't care if the sequence starts at 0 or 1 you can even simplify it to
select id, flag, datecol,
case when flag is not null
then sum(case when flag is null then 1 else 0 end)
over (partition by id
order by datecol
rows unbounded preceding)
end as grp2
from tab
SQL is not a good tool for this since it is designed for set operations, not sequential processing. (There are workarounds and advanced functions in other SQL implementations that can help, see the other answers.)
But in a data step it is simple with a retained variable.
data want;
set have;
by id ;
if first.id then seq_no=0;
seq_no+(flag='Y');
run;
One wrinkle in your request is that you don't want the count on the records where the flag is not Y. That could be easily done by using a separate variable to retain the count than you use as your "seq_no" variable.
data want;
set have;
by id ;
if first.id then cumm_seq_no=0;
cumm_seq_no+(flag='Y');
if flag='Y' then seq_no=cumm_seq_no;
run;
Here is a base SAS method.
proc sort data=have;
by id date;
data want (drop=prev_seq);
set have;
by id;
retain prev_seq;
if first.id then prev_seq = .;
if flag = 'Y' then do;
prev_seq + 1;
seq_no = prev_seq;
end;
run;
You want seq_no incremented at the start of a contiguous Y block of flag values. Use the NOTSORTED option of the BY statement to process contiguous blocks, when blocks are disordered or gapped.
Example:
data want;
set have;
by id flag NOTSORTED;
if first.id then seq_num = 0;
if first.flag and flag = 'Y' then seq_num+1;
if flag='Y'
then seq_no = seq_num;
else call missing(seq_no);
drop seq_num;
run;

sas/sql logic needed

I have a data with SSN and Open date and have to calculate if a customer has opened 2 or more accounts within 120 days based on the open_date field. I know to use INTCK/INTNX functions but it requires 2 date fields, not sure how to apply the same logic on a single field for same customer.Please suggest.
SSN account Open_date
xyz 000123 12/01/2015
xyz 112344 11/22/2015
xyz 893944 04/05/2016
abc 992343 01/10/2016
abc 999999 03/05/2016
123 111123 07/16/2015
123 445324 10/12/2015
You can use exists or join:
proc sql;
select distinct SSN
from t
where exists (select 1
from t t2
where t2.SSN = t.SSN and
t2.open_date between t.open_date and t.open_date + 120
);
I'd do it using JOIN :
proc sql;
create table want as
select *
from have
where SSN in
(select a.SSN
from have a
inner join have b
on a.SSN=b.SSN
where intck('day', a.Open_date, b.Open_Date)+1 < 120)
;
quit;
Just a slightly different solution here - use the dif function which calculates the number of days between accounts being open.
proc sort data=have;
by ssn open_date;
run;
data want;
set have;
by ssn;
days_between_open = dif(open_date);
if first.ssn then days_between_open = .;
*if 0 < days_between_open < 120 then output;
run;
Then you can filter the table above as required. I've left it commented out at this point because you haven't specified how you want your output table.

Select only Contiguous Records in DB2 SQL

So i have a table of readings (heavily simplified version below) - sometimes there is a break in the reading history (see the record i have flagged as N) - The 'From Read' should always match a previous 'To Read' or the 'To Read' should always match a later 'From Read' BUT I want to only select records as far back as the first 'break' in the reads.
How would i write a query in DB2 SQL to only return the rows flagged with a 'Y'?
EDIT: The contiguous flag is something i have added manually to represent the records i would like to select, it does not exist on the table.
ID From To Contiguous
ABC 01/01/2014 30/06/2014 Y
ABC 01/06/2013 01/01/2014 Y
ABC 01/05/2013 01/06/2013 Y
ABC 01/01/2013 01/02/2013 N
ABC 01/10/2012 01/01/2013 N
Thanks in advance!
J
you will need a recursive select
something like that:
WITH RECURSIVE
contiguous_intervals(start, end) AS (
select start, end
from intervals
where end = (select max(end) from intervals)
UNION ALL
select i.start, i.end
from contiguous_intervals m, intervals i
where i.end = m.start
)
select * from contiguous_intervals;
You can do this with lead(), lag(). I'm not sure what the exact logic is for your case, but I think it is something like:
select r.*,
(case when (prev_to = from or prev_to is null) and
(next_from = to or next_from is null)
then 'Y'
else 'N'
end) as Contiguous
from (select r.*, lead(from) over (partition by id order by from) as next_from,
lag(to) over (partition by id order by to) as prev_to
from readings r
) r;

Check previous records to update current record in Oracle

I'm having difficulty figuring out how to check Previous records in order to see if the current record should be updated.
Don't want to use the lag function because I will not have the information on how many records to go back.
I have a table that contains Employee Raise information. I want to Put a X in the IND field if there has been a previous Merit increase PCT greater than the current Merit increase within the last 6 months. The current Record is the 2012/05 record.
Emp Action Date Code proj PCT Ind
====================================================
123 raise 2012/01 COL acct 2
123 raise 2012/01 Merit soft 7
123 raise 2012/02 Merit Acct 4
123 Raise 2012/05 merit soft 6 ?
It's not particularly efficient but you can use a brute force approach
UPDATE <<table_name>> a
SET ind = 'X'
WHERE <<date column>> = (SELECT MAX(<<date column>>
FROM <<table name>> b
WHERE a.emp = b.emp)
AND EXISTS( SELECT 1
FROM <<table name>> c
WHERE a.emp = c.emp
AND c.code = 'Merit'
AND c.action = 'raise'
AND c.pct > a.pct
AND c.<<date column>> > sysdate - interval '6' month
AND c.rowid != a.rowid);
If you are just looking for a query rather than an update, this might work
Select
emp,
action,
date,
code,
project,
pct,
case when max(case when code='Merit' and action='raise' then pct end)
 over (partition by emp order by date
range between interval '6' month preceding and current row
) > pct then 'X' end as ind
From the_table
I don't have a database to test this against right now so I'm not entirely sure this will work but think it should. Edit: worked out how to get SQL Fiddle going on my iPad. This seems to work, however it will put an X in all the rows that meet the condition and not just the most recent.