Based on feedback, I am restructuring my question.
I am working with SQL on a Presto database.
My objective is to report on employees that take consecutive days of PTO or Sick Time since the beginning of 2018. My desired output would have the individual islands of time taken by employee with the start and end dates, along the lines of:
The main table I am using is d_employee_time_off
There are only two time_off_type_name: PTO and Sick Leave.
The ds is a datestamp and I use the latest ds (usually the current date)
I have access to a date table named d_date
I can join the tables on d_employee_time_off.time_off_date = d_date.full_date
I hope that I have structured this question in a fashion that is understandable.
I believe the need here is to join the day off material to a calendar table.
In the example solution below I am generating this "on the fly" but I think you do have your own solution for this. Also in my example I have used the string 'Monday' and moved backward from that (or, you could use 'Friday' and move forward). I'm, not keen on language dependent solutions but as I'm not a Presto user wasn't able to test anything on Presto. So the example below uses some of your own logic, but using SQL Server syntax which I trust you can translate to Presto:
Query:
;WITH
Digits AS (
SELECT 0 AS digit UNION ALL
SELECT 1 UNION ALL SELECT 2 UNION ALL SELECT 3 UNION ALL SELECT 4 UNION ALL
SELECT 5 UNION ALL SELECT 6 UNION ALL SELECT 7 UNION ALL SELECT 8 UNION ALL
SELECT 9
)
, cal AS (
SELECT
ca.number
, dateadd(day,ca.number,'20180101') as cal_date
, datename(weekday,dateadd(day,ca.number,'20180101')) weekday
FROM Digits [1s]
CROSS JOIN Digits [10s]
CROSS JOIN Digits [100s] /* add more like this as needed */
cross apply (
SELECT
[1s].digit
+ [10s].digit * 10
+ [100s].digit * 100 /* add more like this as needed */
AS number
) ca
)
, time_off AS (
select
*
from cal
inner join mytable t on (cal.cal_date = t.time_off_date and cal.weekday <> 'Monday')
or (cal.cal_date between dateadd(day,-2,t.time_off_date)
and t.time_off_date and datename(weekday,t.time_off_date) = 'Monday')
)
, starting_points AS (
SELECT
employee_id,
cal_date,
dense_rank() OVER(partition by employee_id
ORDER BY
time_off_date
) AS rownum
FROM
time_off A
WHERE
NOT EXISTS (
SELECT
*
FROM
time_off B
WHERE
B.employee_id = A.employee_id
AND B.cal_date = DATEADD(day, -1, A.cal_date)
)
)
, ending_points AS (
SELECT
employee_id,
cal_date,
dense_rank() OVER(partition by employee_id
ORDER BY
time_off_date
) AS rownum
FROM
time_off A
WHERE
NOT EXISTS (
SELECT
*
FROM
time_off B
WHERE
B.employee_id = A.employee_id
AND B.cal_date = DATEADD(day, 1, A.cal_date)
)
)
SELECT
S.employee_id,
S.cal_date AS start_range,
E.cal_date AS end_range
FROM
starting_points S
JOIN
ending_points E
ON E.employee_id = S.employee_id
AND E.rownum = S.rownum
order by employee_id
, start_range
Result:
employee_id start_range end_range
1 200035 02.01.2018 02.01.2018
2 200035 20.04.2018 27.04.2018
3 200037 27.01.2018 29.01.2018
4 200037 31.03.2018 02.04.2018
see: http://rextester.com/MISZ50793
CREATE TABLE mytable(
ID INT NOT NULL
,employee_id INTEGER NOT NULL
,type VARCHAR(3) NOT NULL
,time_off_date DATE NOT NULL
,time_off_in_days INT NOT NULL
);
INSERT INTO mytable(id,employee_id,type,time_off_date,time_off_in_days) VALUES (1,200035,'PTO','2018-01-02',1);
INSERT INTO mytable(id,employee_id,type,time_off_date,time_off_in_days) VALUES (2,200035,'PTO','2018-04-20',1);
INSERT INTO mytable(id,employee_id,type,time_off_date,time_off_in_days) VALUES (3,200035,'PTO','2018-04-23',1);
INSERT INTO mytable(id,employee_id,type,time_off_date,time_off_in_days) VALUES (4,200035,'PTO','2018-04-24',1);
INSERT INTO mytable(id,employee_id,type,time_off_date,time_off_in_days) VALUES (5,200035,'PTO','2018-04-25',1);
INSERT INTO mytable(id,employee_id,type,time_off_date,time_off_in_days) VALUES (6,200035,'PTO','2018-04-26',1);
INSERT INTO mytable(id,employee_id,type,time_off_date,time_off_in_days) VALUES (7,200035,'PTO','2018-04-27',1);
INSERT INTO mytable(id,employee_id,type,time_off_date,time_off_in_days) VALUES (8,200037,'PTO','2018-01-29',1);
INSERT INTO mytable(id,employee_id,type,time_off_date,time_off_in_days) VALUES (9,200037,'PTO','2018-04-02',1);
Related
I have two different tables, common column is truck_id.
I need to subtract two tables from each other to find the net amount.
The result I want:
truck_id
difference
35kd85
1500
35hh52
900
SELECT
(SELECT SUM(last_revenue) FROM (
SELECT DISTINCT last_revenue FROM
Expedition WHERE YEAR(departure_date) > 2020 AND truck_id = '31adc444'
UNION ALL SELECT last_revenue FROM
ChainingExpedition WHERE YEAR(departure_date) > 2020 AND truck_id = '31adc444'
)x
)-(SELECT SUM(price_dollar) FROM (
SELECT DISTINCT price_dollar FROM TruckMaintenanceExpense
WHERE YEAR(payment_date) > 2020 AND expense_type = 'çeker'
AND truck_id ='31adc444'
)x
) AS difference
SQL subtraction on two different tables
When I type truck_id in my query, I get the right result, but my goal is to draw as a list.
Typically, I would create a list of all trucks that meet the criteria you are looking for. Then, I'll get a list of all trucks with revenue and a separate list of trucks with expense. Then you join those 3 tables together and do the math. You query was very hard to follow without good indenting and structure. Next time you ask a question, be sure to include sample data. You should be writing all the CREATE TABLE and INSERT INTO statements that I include in the EXAMPLE DATA section in the fiddle below.
--*****EXAMPLE DATA*****
CREATE TABLE Expedition (
truck_id nvarchar(50)
, last_revenue decimal(19,2)
, departure_date datetime
);
CREATE TABLE TruckMaintenanceExpense (
truck_id nvarchar(50)
, price_dollar decimal(19,2)
, payment_date datetime
);
INSERT INTO Expedition (truck_id, last_revenue, departure_date)
VALUES ('35kd85', 2000.00, '2020-1-1')
, ('35kd85', 3500.00, '2020-1-1')
, ('35hh52', 300.00, '2020-1-1')
, ('35hh52', 258.98, '2020-1-1')
;
INSERT INTO TruckMaintenanceExpense (truck_id, price_dollar, payment_date)
VALUES ('35kd85', 9865.23, '2020-2-1')
, ('35kd85', 321.54, '2020-2-1')
, ('35hh52', 159.78, '2020-2-1')
, ('35hh52', 598.77, '2020-2-1')
;
--*****END EXAMPLE DATA*****
--Create a list of all truck_ids. It would be more helpful
--if you had a table that defined all the trucks (i.e. dbo.trucks).
WITH AllTrucks as (
SELECT truck_id
FROM TruckMaintenanceExpense
UNION ALL
SELECT truck_id
FROM Expedition
)
SELECT DISTINCT
a.truck_id
--Use ISNULL to make sure we have 0.00 if the truck is missing
--either revenue or expense.
, ISNULL(tRev.total_revenue,0) - ISNULL(tExp.total_expense,0) as difference
--Get the list of truck_ids. Use SELECT DISTINCT to elimiate duplicates.
FROM AllTrucks as a
--Get a list of all trucks with Expedition reveue.
LEFT OUTER JOIN (
SELECT
e.truck_id
, YEAR(e.departure_date) as [year]
, SUM(e.last_revenue) as total_revenue
FROM Expedition as e
WHERE e.departure_date >= '2020-1-1 00:00:00'
AND e.departure_date < '2021-1-1 00:00:00'
GROUP BY e.truck_id, YEAR(departure_date)
) as tRev
ON tRev.truck_id = a.truck_id
--Get a list of all trucks with maintenance expenses.
LEFT OUTER JOIN (
SELECT
tme.truck_id
, YEAR(tme.payment_date) as [year]
, SUM(tme.price_dollar) as total_expense
FROM TruckMaintenanceExpense as tme
WHERE tme.payment_date >= '2020-1-1 00:00:00'
AND tme.payment_date < '2021-1-1 00:00:00'
GROUP BY tme.truck_id, YEAR(payment_date)
) as tExp
ON tExp.truck_id = a.truck_id
;
truck_id
difference
35hh52
-199.57
35kd85
-4686.77
fiddle
I have the following table (Oracle database):
ID
valid_from
valid_to
1
01.01.22
28.02.22
1
01.03.22
30.06.22
1
01.07.22
31.12.22
1
01.01.23
null
2
01.01.22
31.03.22
2
01.04.22
null
How do I best extract now all date ranges without overlaps over both IDs? The final result set should look like:
valid_from
valid_to
01.01.22
28.02.22
01.03.22
31.03.22
01.04.22
30.06.22
01.07.23
31.12.22
01.01.23
null
Null stands for max_date (PL / SQL Oracle Max Date).
Moreover, I should only select the values valid for the current year (let's assume we are already in 2022).
Thanks for your help in advance!
You can apply next select statement:
with
-- main table
t1 AS (SELECT w, q1, q2, to_date(q1,'dd.mm.yy') q1d, to_date(q2,'dd.mm.yy') q2d FROM www)
-- custom year in YYYY format
, t0 AS (SELECT '2022' y FROM dual)
-- join and order dates FROM - TO
, t2 AS (SELECT t1.q1, t1.q1d, s2.q2, s2.q2d
FROM t1
LEFT JOIN t1 s2 on t1.q1d <= s2.q2d
ORDER BY t1.q1d, s2.q2d)
-- mark the first each new row-pair by row_number()
, t3 AS (SELECT t2.*,
row_number() OVER (PARTITION BY t2.q1d ORDER BY t2.q1d ) r
FROM t2 )
-- join custom year value and select desired rows based on that value
SELECT q1, q2 FROM t3
JOIN t0 on 1=1
WHERE r = 1
-- for the custom year
AND t0.y <= to_char(q1d, 'yyyy')
ORDER BY q1d;
Demo
In my table-example dates are presented in varchar2 datatype and in dd.mm.yy date format. In case if your table fields have datatype date, then you don't need to implement function to_date() for those 2 fields.
Used table sample:
create table www (w integer, q1 varchar2(30), q2 varchar2(30));
insert into www values (1, '01.01.22', '28.02.22');
insert into www values (1, '01.03.22', '30.06.22');
insert into www values (1, '01.07.22', '31.12.22');
insert into www values (1, '01.01.23', '');
insert into www values (2, '01.01.22', '31.03.22');
insert into www values (2, '01.04.22', '');
If your table sample has more rows which are have null value in the field valid_to and the dates in valid_from are not in any range, let's say:
insert into www values (1, '01.01.24', '');
then previous solution will produce more rows in the end with null value.
In this case you can use that more complex solution:
...
-- join custom year value and select desired rows based on that value
, t4 as (SELECT q1, q2, q1d FROM t3
JOIN t0 on 1=1
WHERE r = 1 AND
-- for the custom year
t0.y <= to_char(q1d, 'yyyy')
ORDER BY q1d)
-- filter non-nullable rows
, t5 as ( SELECT q1, q2 FROM t4 WHERE Q2 IS NOT NULL )
-- max date from rows where Q2 field has null value
, t6 as ( SELECT to_char(MAX(Q1D),'dd.mm.yy') q1, q2
FROM t4
WHERE Q2 IS NULL
GROUP BY q2)
-- append rows with max date
SELECT * FROM t5
UNION ALL
SELECT * FROM t6;
Demo
I have one problem identifying and fixing some records having overlapping time intervals, for one scd type 2 dimension.
What I have is:
Bkey Uid startDate endDate
'John' 1 1990-01-01 (some time stamp) 2017-01-10 (some time stamp)
'John' 2 2016-11=03 (some time stamp) 2016-11-14 (some time stamp)
'John' 3 2016-11-14 (some time stamp) 2016-12-29 (some time stamp)
'John' 4 2016-12-29 (some time stamp) 2017-01-10 (some time stamp)
'John' 5 2017-01-10 (some time stamp) 2017-04-22 (some time stamp)
......
I want to find (first) which are all the Johns having overlapping time periods, for a table having lots and lots of Johns and then to figure out a way to correct those overlapping time periods. For the latest I know there are some function LAGG, LEAD, which can handle that, but it eludes me how to find those over lappings.
Any hints?
Regards,
[ 1 ] Following query will return overlapping time ranges:
SELECT *,
(
SELECT *
FROM #Dimension1 y
WHERE x.Bkey = y.Bkey
AND x.Uid <> y.Uid
AND NOT(x.startDate > y.endDate OR x.endDate < y.startDate)
FOR XML RAW, ROOT, TYPE
) OverlappingTimeRanges
FROM #Dimension1 x
Full script:
DECLARE #Dimension1 TABLE (
Bkey VARCHAR(50) NOT NULL,
Uid INT NOT NULL,
startDate DATE NOT NULL,
endDate DATE NOT NULL,
CHECK(startDate < endDate)
);
INSERT #Dimension1
SELECT 'John', 1, '1990-01-01', '2017-01-10' UNION ALL
SELECT 'John', 2, '2016-11-03', '2016-11-14' UNION ALL
SELECT 'John', 3, '2016-11-14', '2016-12-29' UNION ALL
SELECT 'John', 4, '2016-12-29', '2017-01-10' UNION ALL
SELECT 'John', 5, '2017-01-11', '2017-04-22';
SELECT *,
(
SELECT *
FROM #Dimension1 y
WHERE x.Bkey = y.Bkey
AND x.Uid <> y.Uid
AND NOT(x.startDate > y.endDate OR x.endDate < y.startDate)
FOR XML RAW, ROOT, TYPE
) OverlappingTimeRanges
FROM #Dimension1 x
Demo here
[ 2 ] In order to find distinct groups of time ranges with overlapping original rows I would use following approach:
-- Edit 1
DECLARE #Groups TABLE (
Bkey VARCHAR(50) NOT NULL,
Uid INT NOT NULL,
startDateNew DATE NOT NULL,
endDateNew DATE NOT NULL,
CHECK(startDateNew < endDateNew)
);
INSERT #Groups
SELECT x.Bkey, x.Uid, z.startDateNew, z.endDateNew
FROM #Dimension1 x
OUTER APPLY (
SELECT MIN(y.startDate) AS startDateNew, MAX(y.endDate) AS endDateNew
FROM #Dimension1 y
WHERE x.Bkey = y.Bkey
AND NOT(x.startDate > y.endDate OR x.endDate < y.startDate)
) z
-- End of Edit 1
-- This returns distinct groups identified by DistinctGroupId together with all overlapping Uid(s) from current group
SELECT *
FROM (
SELECT ROW_NUMBER() OVER(ORDER BY b.Bkey, b.startDateNew, b.endDateNew) AS DistinctGroupId, b.*
FROM (
SELECT DISTINCT a.Bkey, a.startDateNew, a.endDateNew
FROM #Groups a
) b
) c
OUTER APPLY (
SELECT d.Uid AS Overlapping_Uid
FROM #Groups d
WHERE c.Bkey = d.Bkey
AND c.startDateNew = d.startDateNew
AND c.endDateNew = d.endDateNew
) e
-- This returns distinct groups identified by DistinctGroupId together with an XML (XmlCol) which includes overlapping Uid(s)
SELECT *
FROM (
SELECT ROW_NUMBER() OVER(ORDER BY b.Bkey, b.startDateNew, b.endDateNew) AS DistinctGroupId, b.*
FROM (
SELECT DISTINCT a.Bkey, a.startDateNew, a.endDateNew
FROM #Groups a
) b
) c
OUTER APPLY (
SELECT (
SELECT d.Uid AS Overlapping_Uid
FROM #Groups d
WHERE c.Bkey = d.Bkey
AND c.startDateNew = d.startDateNew
AND c.endDateNew = d.endDateNew
FOR XML RAW, TYPE
) AS XmlCol
) e
Note: Last range used in my example is 'John', 5, '2017-01-11', '2017-04-22'; and not 'John', 5, '2017-01-10', '2017-04-22';. Also, data type used is DATE and not DATETIME[2][OFFSET].
I think the tricky part of your query is being able to articulate the logic for overlapping ranges. We can self join on the condition that a row on the left overlaps with any row on the right. All matching rows are those which overlap.
We can think of four possible overlap scenarios:
|---------| |---------| no overlap
|---------|
|---------| 1st end and 2nd start overlap
|---------|
|---------| 1st start and 2nd end overlap
|---------|
|---| 2nd completely contained inside 1st
(could be 1st inside 2nd also)
SELECT DISTINCT
t.Uid
FROM yourTable t1
INNER JOIN yourTable t2
ON t1.startDate <= t2.endDate AND
t2.startDate <= t1.endDate
WHERE
t1.Bkey = 'John' AND t2.Bkey = 'John'
This will at least let you identify overlapping records. Updating and separating them in a meaningful way will probably end up being an ugly gaps and islands problem, perhaps meriting another question.
we can acheive this by doing a self join of emp table.
a.emp_id != b.emp_id ensures same row is not joined with itself.
remaining comparison clause checks if any row's start date or end date falls in other row's date range.
create table emp(name varchar(20), emp_id numeric(10), start_date date, end_date date);
insert into emp values('John', 1, '1990-01-01', '2017-01-10');
insert into emp values( 'John', 2, '2016-11-03', '2016-11-14');
insert into emp values( 'John', 3, '2016-11-14', '2016-12-29');
insert into emp values( 'John', 4, '2016-12-29', '2017-01-10');
insert into emp values( 'John', 5, '2017-01-11', '2017-04-22');
commit;
with A as (select * from EMP),
B as (select * from EMP)
select A.* from A,B where A.EMP_ID != B.EMP_ID
and A.START_DATE < B.END_DATE and B.START_DATE < A.END_DATE
and (A.START_DATE between B.START_DATE and B.END_DATE
or A.END_DATE between B.START_DATE and B.END_DATE);
I have a dataset of hospitalisations ('spells') - 1 row per spell. I want to drop any spells recorded within a week after another (there could be multiple) - the rationale being is that they're likely symptomatic of the same underlying cause. Here is some play data:
create table hif_user.rzb_recurse_src (
patid integer not null,
eventdate integer not null,
type smallint not null
);
insert into hif_user.rzb_recurse_src values (1,1,1);
insert into hif_user.rzb_recurse_src values (1,3,2);
insert into hif_user.rzb_recurse_src values (1,5,2);
insert into hif_user.rzb_recurse_src values (1,9,2);
insert into hif_user.rzb_recurse_src values (1,14,2);
insert into hif_user.rzb_recurse_src values (2,1,1);
insert into hif_user.rzb_recurse_src values (2,5,1);
insert into hif_user.rzb_recurse_src values (2,19,2);
Only spells of type 2 - within a week after any other - are to be dropped. Type 1 spells are to remain.
For patient 1, dates 1 & 9 should be kept. For patient 2, all rows should remain.
The issue is with patient 1. Spell date 9 is identified for dropping as it is close to spell date 5; however, as spell date 5 is close to spell date 1 is should be dropped therefore allowing spell date 9 to live...
So, it seems a recursive problem. However, I've not used recursive programming in SQL before and I'm struggling to really picture how to do it. Can anyone help? I should add that I'm using Teradata which has more restrictions than most with recursive SQL (only UNION ALL sets allowed I believe).
It's a cursor logic, check one row after the other if it fits your rules, so recursion is the easiest (maybe the only) way to solve your problem.
To get a decent performance you need a Volatile Table to facilitate this row-by-row processing:
CREATE VOLATILE TABLE vt (patid, eventdate, exac_type, rn, startdate) AS
(
SELECT r.*
,ROW_NUMBER() -- needed to facilitate the join
OVER (PARTITION BY patid ORDER BY eventdate) AS rn
FROM hif_user.rzb_recurse_src AS r
) WITH DATA ON COMMIT PRESERVE ROWS;
WITH RECURSIVE cte (patid, eventdate, exac_type, rn, startdate) AS
(
SELECT vt.*
,eventdate AS startdate
FROM vt
WHERE rn = 1 -- start with the first row
UNION ALL
SELECT vt.*
-- check if type = 1 or more than 7 days from the last eventdate
,CASE WHEN vt.eventdate > cte.startdate + 7
OR vt.exac_type = 1
THEN vt.eventdate -- new start date
ELSE cte.startdate -- keep old date
END
FROM vt JOIN cte
ON vt.patid = cte.patid
AND vt.rn = cte.rn + 1 -- proceed to next row
)
SELECT *
FROM cte
WHERE eventdate - startdate = 0 -- only new start days
order by patid, eventdate
I think the key to solving this is getting the first date more than 7 days from the current date and then doing a recursive subquery:
with rrs as (
select rrs.*,
(select min(rrs2.eventdate)
from hif_user.rzb_recurse_src rrs2
where rrs2.patid = rrs.patid and
rrs2.eventdate > rrs.eventdate + 7
) as eventdate7
from hif_user.rzb_recurse_src rrs
),
recursive cte as (
select patid, min(eventdate) as eventdate, min(eventdate7) as eventdate7
from hif_user.rzb_recurse_src rrs
group by patid
union all
select cte.patid, cte.eventdate7, rrs.eventdate7
from cte join
hif_user.rzb_recurse_src rrs
on rrs.patid = cte.patid and
rrs.eventdate = cte.eventdate7
)
select cte.patid, cte.eventdate
from cte;
If you want additional columns, then join in the original table at the last step.
I have an issue in Teradata where I am trying to build a historical contract table that lists a system, it's corresponding contracts and the start and end dates of each contract. This table would then be queried for reporting as a point in time table. Here is some code to better explain.
CREATE TABLE TMP_WORK_DB.SOLD_SYSTEMS
(
SYSTEM_ID varchar(5),
CONTRACT_TYPE varchar(10),
CONTRACT_RANK int,
CONTRACT_STRT_DT date,
CONTRACT_END_DT date
);
INSERT INTO TMP_WORK_DB.SOLD_SYSTEMS VALUES ('AAA', 'BEST', 10, '2012-01-01', '2012-06-30');
INSERT INTO TMP_WORK_DB.SOLD_SYSTEMS VALUES ('AAA', 'BEST', 9, '2012-01-01', '2012-06-30');
INSERT INTO TMP_WORK_DB.SOLD_SYSTEMS VALUES ('AAA', 'OK', 1, '2012-08-01', '2012-12-30');
INSERT INTO TMP_WORK_DB.SOLD_SYSTEMS VALUES ('BBB', 'BEST', 10, '2013-12-01', '2014-03-02');
INSERT INTO TMP_WORK_DB.SOLD_SYSTEMS VALUES ('BBB', 'BETTER', 7, '2013-12-01', '2017-03-02');
INSERT INTO TMP_WORK_DB.SOLD_SYSTEMS VALUES ('BBB', 'GOOD', 4, '2016-12-02', '2017-12-02');
INSERT INTO TMP_WORK_DB.SOLD_SYSTEMS VALUES ('CCC', 'BEST', 10, '2009-10-13', '2014-10-14');
INSERT INTO TMP_WORK_DB.SOLD_SYSTEMS VALUES ('CCC', 'BETTER', 7, '2009-10-13', '2016-10-14');
INSERT INTO TMP_WORK_DB.SOLD_SYSTEMS VALUES ('CCC', 'OK', 2, '2008-10-13', '2017-10-14');
The required output would be:
SYSTEM_ID CONTRACT_TYPE CONTRACT_STRT_DT CONTARCT_END_DT CONTRACT_RANK
AAA BEST 01/01/2012 06/30/2012 10
AAA OK 08/01/2012 12/30/2012 1
BBB BEST 12/01/2013 03/02/2014 10
BBB BETTER 03/03/2014 03/02/2017 7
BBB GOOD 03/03/2017 12/02/2017 4
CCC OK 10/13/2008 10/12/2009 2
CCC BEST 10/13/2009 10/14/2014 10
CCC BETTER 10/15/2014 10/14/2016 7
CCC OK 10/15/2016 10/14/2017 2
I'm not necessarily looking to reduce rows but am looking to get the correct state of the system_id at any given point in time. Note that when a higher ranked contract ends and a lower ranked contract is still active the lower ranked picks up where the higher one left off.
We are using TD 14 and I have been able to get the easy records where the dates flow sequentially and are of higher rank but am having trouble with the overlaps where two different ranked contracts cover multiple date spans.
I found this blog post (Sharpening Stones) and got it working for the most part but am still having trouble setting the new start dates for the overlapping contracts.
Any help would be appreciated. Thanks.
*UPDATE 04/04/2014 *
I came up with the following code which gives me exactly what I want but I'm not sure of the performance. It works on smaller data sets of a few hundred rows but I havent tested it on several million:
*UPDATE 04/07/2014 *
Updated the date subquery due to spool issues. This query explodes all days where the contract is possibly active and then uses the ROW_NUMBER function to get the highest ranked CONTRACT_TYPE per day. The MIN/MAX functions are then partitioned over the system and contract type to pick up when the highest ranked contract type changes.
*UPDATE - 2 - 04/07/2014 *
I cleaned up the query and it seems to be perform a little better.
SELECT
SYSTEM_ID
, CONTRACT_TYPE
, MIN(CALENDAR_DATE) NEW_START_DATE
, MAX(CALENDAR_DATE) NEW_END_DATE
, CONTRACT_RANK
FROM (
SELECT
CALENDAR_DATE
, SYSTEM_ID
, CONTRACT_TYPE
, CONTRACT_RANK
, ROW_NUMBER() OVER (PARTITION BY SYSTEM_ID, CALENDAR_DATE ORDER BY CONTRACT_RANK DESC, CONTRACT_STRT_DT DESC, CONTRACT_END_DT DESC) AS RNK
FROM SOLD_SYSTEMS t1
JOIN (
SELECT CALENDAR_DATE
FROM FULL_CALENDAR_TABLE ia
WHERE CALENDAR_DATE > DATE'2013-01-01'
)dt
ON CALENDAR_DATE BETWEEN CONTRACT_STRT_DT AND CONTRACT_END_DT
QUALIFY RNK = 1
)z1
GROUP BY 1,2,5
Following approach uses the new PERIOD functions in TD13.10.
-- 1. TD_SEQUENCED_COUNT can't be used in joins, so create a Volatile Table
-- 2. TD_SEQUENCED_COUNT can't use additional columns (e.g. CONTRACT_RANK),
-- so simply create a new row whenever a period starts or ends without
-- considering CONTRACT_RANK
CREATE VOLATILE TABLE vt AS
(
WITH cte
(
SYSTEM_ID
,pd
)
AS
(
SELECT
SYSTEM_ID
-- PERIODs can easily be constructed on-the-fly, but the end date is not inclusive,
-- so I had to adjust to your implementation, CONTRACT_END_DT +/- 1:
,PERIOD(CONTRACT_STRT_DT, CONTRACT_END_DT + 1) AS pd
FROM SOLD_SYSTEMS
)
SELECT
SYSTEM_ID
,BEGIN(pd) AS CONTRACT_STRT_DT
,END(pd) - 1 AS CONTRACT_END_DT
FROM
TABLE (TD_SEQUENCED_COUNT
(NEW VARIANT_TYPE(cte.SYSTEM_ID)
,cte.pd)
RETURNS (SYSTEM_ID VARCHAR(5)
,Policy_Count INTEGER
,pd PERIOD(DATE))
HASH BY SYSTEM_ID
LOCAL ORDER BY SYSTEM_ID ,pd) AS dt
)
WITH DATA
PRIMARY INDEX (SYSTEM_ID)
ON COMMIT PRESERVE ROWS
;
-- Find the matching CONTRACT_RANK
SELECT
vt.SYSTEM_ID
,t.CONTRACT_TYPE
,vt.CONTRACT_STRT_DT
,vt.CONTRACT_END_DT
,t.CONTRACT_RANK
FROM vt
-- If both vt and SOLD_SYSTEMS have a NUPI on SYSTEM_ID this join should be
-- quite efficient
JOIN SOLD_SYSTEMS AS t
ON vt.SYSTEM_ID = t.SYSTEM_ID
AND ( t.CONTRACT_STRT_DT, t.CONTRACT_END_DT)
OVERLAPS (vt.CONTRACT_STRT_DT, vt.CONTRACT_END_DT)
QUALIFY
-- As multiple contracts for the same period are possible:
-- find the row with the highest rank
ROW_NUMBER()
OVER (PARTITION BY vt.SYSTEM_ID,vt.CONTRACT_STRT_DT
ORDER BY t.CONTRACT_RANK DESC, vt.CONTRACT_END_DT DESC) = 1
ORDER BY 1,3
;
-- Previous query might return consecutive rows with the same CONTRACT_RANK, e.g.
-- BBB BETTER 2014-03-03 2016-12-01 7
-- BBB BETTER 2016-12-02 2017-03-02 7
-- If you don't want that you have to normalize the data:
WITH cte
(
SYSTEM_ID
,CONTRACT_STRT_DT
,CONTRACT_END_DT
,CONTRACT_RANK
,CONTRACT_TYPE
,pd
)
AS
(
SELECT
vt.SYSTEM_ID
,vt.CONTRACT_STRT_DT
,vt.CONTRACT_END_DT
,t.CONTRACT_RANK
,t.CONTRACT_TYPE
,PERIOD(vt.CONTRACT_STRT_DT, vt.CONTRACT_END_DT + 1) AS pd
FROM vt
JOIN SOLD_SYSTEMS AS t
ON vt.SYSTEM_ID = t.SYSTEM_ID
AND ( t.CONTRACT_STRT_DT, t.CONTRACT_END_DT)
OVERLAPS (vt.CONTRACT_STRT_DT, vt.CONTRACT_END_DT)
QUALIFY
ROW_NUMBER()
OVER (PARTITION BY vt.SYSTEM_ID,vt.CONTRACT_STRT_DT
ORDER BY t.CONTRACT_RANK DESC, vt.CONTRACT_END_DT DESC) = 1
)
SELECT
SYSTEM_ID
,CONTRACT_TYPE
,BEGIN(pd) AS CONTRACT_STRT_DT
,END(pd) - 1 AS CONTRACT_END_DT
,CONTRACT_RANK
FROM
TABLE (TD_NORMALIZE_MEET
(NEW VARIANT_TYPE(cte.SYSTEM_ID
,cte.CONTRACT_RANK
,cte.CONTRACT_TYPE)
,cte.pd)
RETURNS (SYSTEM_ID VARCHAR(5)
,CONTRACT_RANK INT
,CONTRACT_TYPE VARCHAR(10)
,pd PERIOD(DATE))
HASH BY SYSTEM_ID
LOCAL ORDER BY SYSTEM_ID, CONTRACT_RANK, CONTRACT_TYPE, pd ) A
ORDER BY 1, 3;
Edit: This is another way to get the result of the 2nd query without Volatile Table and TD_SEQUENCED_COUNT:
SELECT
t.SYSTEM_ID
,t.CONTRACT_TYPE
,BEGIN(CONTRACT_PERIOD) AS CONTRACT_STRT_DT
,END(CONTRACT_PERIOD)- 1 AS CONTRACT_END_DT
,t.CONTRACT_RANK
,dt.p P_INTERSECT PERIOD(t.CONTRACT_STRT_DT,t.CONTRACT_END_DT + 1) AS CONTRACT_PERIOD
FROM
(
SELECT
dt.SYSTEM_ID
,PERIOD(d, MIN(d)
OVER (PARTITION BY dt.SYSTEM_ID
ORDER BY d
ROWS BETWEEN 1 FOLLOWING AND UNBOUNDED FOLLOWING)) AS p
FROM
(
SELECT
SYSTEM_ID
,CONTRACT_STRT_DT AS d
FROM SOLD_SYSTEMS
UNION
SELECT
SYSTEM_ID
,CONTRACT_END_DT + 1 AS d
FROM SOLD_SYSTEMS
) AS dt
QUALIFY p IS NOT NULL
) AS dt
JOIN SOLD_SYSTEMS AS t
ON dt.SYSTEM_ID = t.SYSTEM_ID
WHERE CONTRACT_PERIOD IS NOT NULL
QUALIFY
ROW_NUMBER()
OVER (PARTITION BY dt.SYSTEM_ID,p
ORDER BY t.CONTRACT_RANK DESC, t.CONTRACT_END_DT DESC) = 1
ORDER BY 1,3
And based on that you can also include the normalization in a single query:
WITH cte
(
SYSTEM_ID
,CONTRACT_TYPE
,CONTRACT_STRT_DT
,CONTRACT_END_DT
,CONTRACT_RANK
,pd
)
AS
(
SELECT
t.SYSTEM_ID
,t.CONTRACT_TYPE
,BEGIN(CONTRACT_PERIOD) AS CONTRACT_STRT_DT
,END(CONTRACT_PERIOD)- 1 AS CONTRACT_END_DT
,t.CONTRACT_RANK
,dt.p P_INTERSECT PERIOD(t.CONTRACT_STRT_DT,t.CONTRACT_END_DT + 1) AS CONTRACT_PERIOD
FROM
(
SELECT
dt.SYSTEM_ID
,PERIOD(d, MIN(d)
OVER (PARTITION BY dt.SYSTEM_ID
ORDER BY d
ROWS BETWEEN 1 FOLLOWING AND UNBOUNDED FOLLOWING)) AS p
FROM
(
SELECT
SYSTEM_ID
,CONTRACT_STRT_DT AS d
FROM SOLD_SYSTEMS
UNION
SELECT
SYSTEM_ID
,CONTRACT_END_DT + 1 AS d
FROM SOLD_SYSTEMS
) AS dt
QUALIFY p IS NOT NULL
) AS dt
JOIN SOLD_SYSTEMS AS t
ON dt.SYSTEM_ID = t.SYSTEM_ID
WHERE CONTRACT_PERIOD IS NOT NULL
QUALIFY
ROW_NUMBER()
OVER (PARTITION BY dt.SYSTEM_ID,p
ORDER BY t.CONTRACT_RANK DESC, t.CONTRACT_END_DT DESC) = 1
)
SELECT
SYSTEM_ID
,CONTRACT_TYPE
,BEGIN(pd) AS CONTRACT_STRT_DT
,END(pd) - 1 AS CONTRACT_END_DT
,CONTRACT_RANK
FROM
TABLE (TD_NORMALIZE_MEET
(NEW VARIANT_TYPE(cte.SYSTEM_ID
,cte.CONTRACT_RANK
,cte.CONTRACT_TYPE)
,cte.pd)
RETURNS (SYSTEM_ID VARCHAR(5)
,CONTRACT_RANK INT
,CONTRACT_TYPE VARCHAR(10)
,pd PERIOD(DATE))
HASH BY SYSTEM_ID
LOCAL ORDER BY SYSTEM_ID, CONTRACT_RANK, CONTRACT_TYPE, pd ) A
ORDER BY 1, 3;
SEL system_id,contract_type,MAX(contract_rank),
CASE WHEN contract_strt_dt<prev_end_dt THEN prev_end_dt+1
ELSE contract_strt_dt
END AS new_start ,contract_strt_dt,contract_end_dt,
MIN(contract_end_dt) OVER (PARTITION BY system_id
ORDER BY contract_strt_dt,contract_end_dt ROWS BETWEEN 1 PRECEDING
AND 1 PRECEDING) prev_end_dt
FROM sold_systems
GROUP BY system_id,contract_type,contract_strt_dt,contract_end_dt
ORDER BY contract_strt_dt,contract_end_dt,prev_end_dt
I think I'v got it....
try this
select SYSTEM_ID, CONTRACT_TYPE,CONTRACT_RANK,
case
when CONTRACT_STRT_DT<NEW_START_DATE then NEW_START_DATE /*if new_star_date overlap startdate then get new_Start_date */
else CONTRACT_STRT_DT
end as new_contract_str_dt,
CONTRACT_END_DT
from
(select t1.SYSTEM_ID,t1.CONTRACT_TYPE,t1.CONTRACT_RANK,t1.CONTRACT_STRT_DT,t1.CONTRACT_END_DT,
coalesce(max(t1.CONTRACT_END_DT) over (partition by t1.SYSTEM_ID order by t1.CONTRACT_RANK desc rows between UNBOUNDED PRECEDING and 1 preceding ), t1.CONTRACT_STRT_DT) NEW_START_DATE
from SOLD_SYSTEMS t1
) as temp1
/*you may remove fully overlapped contracts*/
where NEW_START_DATE<=CONTRACT_END_DT
It's simpler and have a good execution plan... You can work with large tables (don't forget to collect statistics )