Merge the records for overlapping dates - sql

I have data as below and want to merge the records for overlapping dates. MIN and MAX of start and end dates for overlapping records should be the Start and end date of merged record.
Before merge:
Item Code Start_date End_date
============== =========== ===========
111 15-May-2004 20-Jun-2004
111 22-May-2004 07-Jun-2004
111 20-Jun-2004 13-Aug-2004
111 27-May-2004 30-Aug-2004
111 02-Sep-2004 23-Dec-2004
222 21-May-2004 19-Aug-2004
Required output:
Item Code Start_date End_date
============== =========== ===========
111 15-May-2004 30-Aug-2004
111 02-Sep-2004 23-Dec-2004
222 21-May-2004 19-Aug-2004
you can create sample data using
create table item(item_code number, start_date date, end_date date);
insert into item values (111,to_date('15-May-2004','DD-Mon-YYYY'),to_date('20-Jun-2004','DD-Mon-YYYY'));
insert into item values (111,to_date('22-May-2004','DD-Mon-YYYY'),to_date('07-Jun-2004','DD-Mon-YYYY'));
insert into item values (111,to_date('20-Jun-2004','DD-Mon-YYYY'),to_date('13-Aug-2004','DD-Mon-YYYY'));
insert into item values (111,to_date('27-May-2004','DD-Mon-YYYY'),to_date('30-Aug-2004','DD-Mon-YYYY'));
insert into item values (111,to_date('02-Sep-2004','DD-Mon-YYYY'),to_date('23-Dec-2004','DD-Mon-YYYY'));
insert into item values (222,to_date('21-May-2004','DD-Mon-YYYY'),to_date('19-Aug-2004','DD-Mon-YYYY'));
commit;

The code for this type of problem is rather tricky. Here is one approach that works pretty well:
with item (item_code, start_date, end_date) as (
select 111,to_date('15-05-2004','DD-MM-YYYY'),to_date('20-06-2004','DD-MM-YYYY') from dual union all
select 111,to_date('22-05-2004','DD-MM-YYYY'),to_date('07-06-2004','DD-MM-YYYY') from dual union all
select 111,to_date('20-06-2004','DD-MM-YYYY'),to_date('13-08-2004','DD-MM-YYYY') from dual union all
select 111,to_date('27-05-2004','DD-MM-YYYY'),to_date('30-08-2004','DD-MM-YYYY') from dual union all
select 111,to_date('02-09-2004','DD-MM-YYYY'),to_date('23-12-2004','DD-MM-YYYY') from dual union all
select 222,to_date('21-05-2004','DD-MM-YYYY'),to_date('19-08-2004','DD-MM-YYYY') from dual
),
id as (
select item_code, start_date as dte, count(*) as inc
from item
group by item_code, start_date
union all
select item_code, end_date, - count(*) as inc
from item
group by item_code, end_date
),
id2 as (
select id.*, sum(inc) over (partition by item_code order by dte) as running_inc
from id
),
id3 as (
select id2.*, sum(case when running_inc = 0 then 1 else 0 end) over (partition by item_code order by dte desc) as grp
from id2
)
select item_code, min(dte) as start_date, max(dte) as end_date
from id3
group by item_code, grp;
And a rextester to validate it.
What is this doing? Good question. The idea in these problems is to define the adjacent groups. This method does so by counting the number of "starts" and "ends" up to a given date. When the value is 0, a group ends.
The specific steps are as follows:
(1) Break out all the dates onto separate rows along with an indicator of whether the date is a start date or end date. This indicator is key to defining the ranges -- +1 to "enter" and "-1" to exit.
(2) Calculate the running total of the indicators. The 0s in this total are the ends of overlapping ranges.
(3) Do a reverse cumulative sum of the 0s to identify the groups.
(4) Aggregate to get the final results.
You can look at each of the CTEs to see what is happening in the data.

It's a variation of a gaps&islands problem. First calculate the maximum previous end date for each row. Then filter the rows where the current row's start date is greater than that max date, this is the start of a new group and the group's end date is found in the next row.
WITH max_dates AS
(
SELECT
item_code
,start_date
,Max(end_date) -- get the maximum prevous end_date
Over (PARTITION BY item_code
ORDER BY start_date
ROWS BETWEEN Unbounded Preceding AND 1 Preceding) AS max_prev_date
,Max(end_date) -- get the maximum overall date (only needed for the last group)
Over (PARTITION BY item_code) AS max_date
FROM item
)
SELECT
item_code
,start_date
,Coalesce(Lead(max_prev_date) -- next row got the end date for the current row
Over (PARTITION BY item_code
ORDER BY start_date)
,max_date ) AS end_date -- no next row for the last row --> overall maximum end_date
FROM max_dates
WHERE max_prev_date < start_date -- maximum previous end date is less than current start date --> start of a new group
OR max_prev_date IS NULL -- first row

In SQL Server you can try this. It will give your desired output but as performance point of view the query might slow down, When there is a large number of data to be checked.
DECLARE #item Table(item_code int, start_date date, end_date date);
insert into #item values (111,'15-May-2004','20-Jun-2004');
insert into #item values (111,'22-May-2004','07-Jun-2004');
insert into #item values (111,'20-Jun-2004','13-Aug-2004');
insert into #item values (111,'27-May-2004','30-Aug-2004');
insert into #item values (111,'02-Sep-2004','23-Dec-2004');
insert into #item values (222,'21-May-2004','19-Aug-2004');
SELECT * FROM #item WHERE item_code IN (SELECT item_code FROM #item GROUP BY item_code) AND
(start_date IN (SELECT max(start_date) FROM #item GROUP BY item_code) or start_date In (SELECT min(start_date) FROM #item GROUP BY item_code))

with help of above answers i am able to simplify this as below
WITH max_dates AS
(
SELECT
item_code
,start_date
,end_date
,Max(end_date)
Over (PARTITION BY item_code
ORDER BY start_date
) AS max_date
FROM item
) ,
max_dates1 as
(
select max_dates.* , lag(max_date) over(partition by item_code order by 1) as MPD from max_dates
)
select ITEM_CODE,start_date,end_date from max_dates1
WHERE MPD < start_date
OR MPD IS NULL

Related

Generate Rows Between Two Dates, Copying Down The Values in the Remaining Columns

I'm trying to write a script that will look at the issue date and termination date for each policy in a table. I want to be able to take those two dates, create a row for each year in between those two dates, and then fill in the values in the remaining columns.
I've been working with a recursive CTE approach in Redshift and I've got to the point where I can create the annual records. The part I'm stuck on is how to include the other columns in the table and fill each of the created rows with the same information as the row above.
For example, if I start with a record that looks something like
policy_number
issue_date
termination_date
issue_state
product
plan_code
001
1985-05-26
2005-03-02
CT
ROP
123456
I want to build a table that would look like this
policy_number
issue_date
termination_date
issue_state
product
plan_code
start_date
001
1985-05-26
2005-03-02
CT
ROP
123456
1985-05-26
001
1985-05-26
2005-03-02
CT
ROP
123456
1986-05-26
001
1985-05-26
2005-03-02
CT
ROP
123456
1987-05-26
...
...
...
...
...
...
...
001
1985-05-26
2005-03-02
CT
ROP
123456
2004-05-26
001
1985-05-26
2005-03-02
CT
ROP
123456
2005-03-02
Here's the code I've got so far:
WITH RECURSIVE start_dt AS
(
SELECT MIN(issue_date) AS s_dt -- step 1: grab start date
FROM myTable
WHERE policy_number = '001'
GROUP BY policy_number
),
end_dt AS
(
SELECT MAX(effective_date) AS e_dt -- step 2: grab the termination date
FROM myTable
WHERE policy_number = '001'
GROUP BY policy_number
),
dates (dt) AS
(
-- start at the start date
SELECT s_dt dt -- selectin start date from step 1
FROM start_dt
UNION ALL
-- recursive lines
SELECT dateadd(YEAR,1,dt)::DATE dt -- converted to date to avoid type mismatch -- adding annual records until the termination date
FROM dates
WHERE dt <= (SELECT e_dt FROM end_dt)
-- stop at the end date
)
SELECT *
FROM dates
which yields
dt
1985-05-26
1986-05-26
1987-05-26
...
How can I include the rest the columns in my table? I'm also open to using a cross join if that would be a better approach. I'm expecting this to generate around 10,000,000 rows, so any optimization would be much appreciated.
If I understand correctly you have a table with begin/end dates and you have a process for generating all the needed dates to span the min / max of these. You want to apply this list of dates to the starting table to get all rows replicated between begin and end.
You have a good start - the list of dates. The usual process is to join the dates with the table using inequality conditions. (ON dt >= begin and dt <= end)
You will need to deal with some edge condition around the unique dates for each input row. If you need to maintain these unique dates you will need to fudge the join condition. All doable.
==============================================================
Back from biz trip and can give more concrete guidance.
There's 2 ways to do this. The first is the CTE approach you are driving down but this will pass all the data through each loop of the CTE. This could be slow. This would look like (including data setup):
create table mytable (
policy_number varchar(8),
issue_date timestamp,
termination_date timestamp,
issue_state varchar(4),
product varchar(16),
plan_code int);
insert into mytable values
('001', '1985-05-26', '2005-03-02', 'CT', 'ROP', 123456),
('002', '1988-07-25', '2005-08-07', 'CT', 'ROP', 654321)
;
with recursive pdata(policy_number, issue_date, termination_date,
issue_state, product, plan_code, start_date,
yr) as (
select policy_number, issue_date, termination_date, issue_state,
product, plan_code, issue_date as start_date, 0 as yr
from mytable
union all
select policy_number, issue_date, termination_date, issue_state,
product, plan_code,
issue_date + yr * (interval '1 years') as start_date,
yr + 1 as yr
from pdata
where start_date < termination_date
)
select policy_number, issue_date, termination_date,
issue_state, product, plan_code,
case when start_date > termination_date
then termination_date
else start_date
end as start_date
from pdata
order by start_date, policy_number;
The other way to do this is to generate the length of years in the recursive CTE but apply the data expansion in a loop join. This has the benefit of not carrying all the data through the recursive calls but has the expense of the loop join. It should be faster with large amounts of data but you can decide which is right for you.
Since each input row has its own date I left things in year intervals as this is cleaner. This looks like:
create table mytable (
policy_number varchar(8),
issue_date timestamp,
termination_date timestamp,
issue_state varchar(4),
product varchar(16),
plan_code int);
insert into mytable values
('001', '1985-05-26', '2005-03-02', 'CT', 'ROP', 123456),
('002', '1988-07-25', '2005-08-07', 'CT', 'ROP', 654321)
;
with recursive nums(yr, maxnum) as (
select 0::int as yr,
date_part('year', max(termination_date)) -
date_part('year', min(issue_date)) as maxnum
from mytable
union all
select yr + 1 as yr, maxnum
from nums
where yr <= maxnum
)
select policy_number, issue_date, termination_date,
issue_state, product, plan_code,
case when issue_date + yr * interval '1 year' > termination_date
then termination_date
else issue_date + yr * interval '1 year'
end as start_date
from mytable p
left join nums n
on termination_date + interval '1 year'
> issue_date + yr * interval '1 year'
order by start_date, policy_number;

How to calculate total balance in sql using sum?

updated question --
I have a table that contains the following columns:
DROP TABLE TABLE_1;
CREATE TABLE TABLE_1(
TRANSACTION_ID number, USER_KEY number,AMOUNT number,CREATED_DATE DATE, UPDATE_DATE DATE
);
insert into TABLE_1
values ('001','1001',75,'2022-12-02','2022-12-03'),
('001','1001',-74.98,'2022-12-02','2022-12-03'),
('001','1001',74.98,'2022-12-03','2022-12-04'),
('001','1001',-75,'2022-12-03','2022-12-04')
I need to calculate the balance based on the update date. In some cases there can be the same update_date for two different records. When I have this, I want to grab the lower value of the balance.
This is the query I have so far:
select * from (
select TRANSACTION_ID,USER_KEY,AMOUNT,CREATED_DATE,UPDATE_DATE,
sum(AMOUNT) over(partition by USER_KEY order by UPDATE_DATE rows BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as TOTAL_BALANCE_AMOUNT
from TABLE_1
) qualify row_number() over (partition by USER_KEY order by UPDATE_DATE DESC, UPDATE_DATE DESC) = 1
In the query above, it's is grabbing the 75, rather than the 0 after I try to only grab the LAST balance.
Is there a way to include in the qualify query to grab the last balance but if the dates are the same, to grab the lowest balance?
why is the second query, showing 4 different record balances?
That is the point of "running total". If the goal is to have a single value per entire window then order by should be skipped:
select USER_KEY,
sum(AMOUNT) over(partition by USER_KEY) as TOTAL_BALANCE_AMOUNT
from TABLE1;
The partition by clause could be futher expanded with date to produce output per user_key/date:
select USER_KEY,
sum(AMOUNT) over(partition by USER_KEY,date) as TOTAL_BALANCE_AMOUNT
from TABLE1;
I think you're looking for something like this, aggregate by USER_ID, DATE, and then calculate a running sum. If this is not what you're looking for nor is Lukasz Szozda's answer, please edit the question to show the intended output.
create or replace table T1(USER_KEY int, AMOUNT number(38,2), "DATE" date);
insert into T1(USER_KEY, AMOUNT, "DATE") values
(1001, 75, '2022-12-02'),
(1001, -75, '2022-12-02'),
(1001, 75, '2022-12-03'),
(1001, -75, '2022-12-03');
-- Option 1, aggregate after window
select USER_KEY, "DATE", min(TOTAL_BALANCE_AMOUNT) as MINIMUM_BALANCE from
(
select USER_KEY, "DATE", sum(AMOUNT)
over(partition by USER_KEY order by DATE, AMOUNT desc rows BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as TOTAL_BALANCE_AMOUNT from
T1
)
group by USER_KEY, "DATE"
;
--Option 2, qualify by partitioning by user and day, reversing the order of transactions
select USER_KEY, "DATE", sum(AMOUNT)
over(partition by USER_KEY order by DATE, AMOUNT desc rows BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as TOTAL_BALANCE_AMOUNT
from
T1
qualify row_number() over (partition by USER_KEY, DATE order by DATE, AMOUNT asc) = 1
;
USER_KEY
DATE
TOTAL_BALANCE_AMOUNT
1001
2022-12-02 00:00:00
0
1001
2022-12-03 00:00:00
0

How can I print the second row value in the first row column?

I have fetching data as follows:
start_date end_date amount
12/10/2020 - 1800000
12/18/2020 - 1200000
01/18/2021 - 1000000
I would like to print the start date of the second row to the end date of the first row for X rows I am fetching so that the table becomes as follows:
start_date end_date amount
12/10/2020 12/18/2020 1800000
12/18/2020 01/18/2021 1200000
01/18/2021 - 1000000
Thanks everyone!
Use LEAD as follows:
select start_Date,
lead(start_date) over (order by start_Date) as end_Date
amount
from your_Table t
Use lead, if you have no column for partition by, use only order by in the over()
select start_date,
lead(start_date) over (partition by ... order by start_date) as end_date,
amount
...
Used the sample table and selected the data. Please change the SQL query as per your need.
DECLARE #Range TABLE (
start_date DATETIME
,end_date DATETIME
,amount INT
)
INSERT #Range
SELECT '12/10/2020'
,NULL
,1800000
UNION ALL
SELECT '12/18/2020'
,NULL
,1200000
UNION ALL
SELECT '01/18/2021'
,NULL
,1000000
SELECT start_date
,LEAD(start_date, 1) OVER (
ORDER BY start_date
) AS end_date
,amount
FROM #Range;
Another option for LEAD by using Recursive CTE
DECLARE #T TABLE(start_date DATE, end_date DATE, amount INT)
Insert into #T VALUES
('12/10/2020',NULL,1800000),
('12/18/2020',NULL,1200000),
('01/18/2021',NULL,1000000)
;WITH CTE AS(
SELECT rownum = ROW_NUMBER() OVER(ORDER BY End_Date),amount,Start_Date,End_Date
FROM #T
)
SELECT
CTE.start_date,CTE.amount,Nex.start_date AS[End Date]
FROM CTE
LEFT JOIN CTE prev ON prev.rownum = CTE.rownum - 1
LEFT JOIN CTE nex ON nex.rownum = CTE.rownum + 1

Finding lowest two minimum values and finding difference between the two in SQL Server?

I have a transaction table where I have to find the first and second date of transaction of every customer. Finding first date is very simple where I can use MIN() func to find the first date but the second and in particular finding the difference between the two is getting very challenging and somehow I am not able to find out any feasible way:
select a.customer_id, a.transaction_date, a.Row_Count2
from ( select
transaction_date as transaction_date,
reference_no as customer_id,
row_number() over (partition by reference_no
ORDER BY reference_no, transaction_date) AS Row_Count2
from transaction_detail
) a
where a.Row_Count2 < 3
ORDER BY a.customer_id, a.transaction_date, a.Row_Count2
Gives me this :
What I want is , following columns:
||CustomerID|| ||FirstDateofPurchase|| ||SecondDateofPuchase|| ||Diff. b/w Second & First Date ||
You can use window functions LEAD/LAG to return results you are looking for
First try to find all the leading dates by reference number using LEAD, generate row number for each row using your original logic. You can then do difference on dates for row number value 1 row from the result set.
Ex (I'm not excluding same day transactions and treating them as separate and generating row number based on result set from your query above, you can easily change the sql below to consider these as one and remove them so that you get next date as second date):
declare #tbl table(reference_no int, transaction_date datetime)
insert into #tbl
select 1000, '2018-07-11'
UNION ALL
select 1001, '2018-07-12'
UNION ALL
select 1001, '2018-07-12'
UNIOn ALL
select 1001, '2018-07-13'
UNIOn ALL
select 1002, '2018-07-11'
UNIOn ALL
select 1002, '2018-07-15'
select customer_id, transaction_date as firstdate,
transaction_date_next seconddate,
datediff(day, transaction_date, transaction_date_next) diff_in_days
from
(
select reference_no as customer_id, transaction_date,
lead(transaction_date) over (partition by reference_no
order by transaction_date) transaction_date_next,
row_number() over (partition by reference_no ORDER BY transaction_date) AS Row_Count
from #tbl
) src
where Row_Count = 1
You can do this with CROSS APPLY.
SELECT td.customer_id, MIN(ca.transaction_date), MAX(ca.transaction_date),
DATEDIFF(day, MIN(ca.transaction_date), MAX(ca.transaction_date))
FROM transaction_detail td
CROSS APPLY (SELECT TOP 2 *
FROM transaction_detail
WHERE customer_id = td.customer_id
ORDER BY transaction_date) ca
GROUP BY td.customer_id

Compare multiple date ranges

I am using iReport 3.0.0, PostgreSQL 9.1. For a report I need to compare date ranges from invoices with date ranges in filters and print for every invoice code if a filter range is covered, partially covered, etc. To complicate things, there can be multiple date ranges per invoice code.
Table Invoices
ID Code StartDate EndDate
1 111 1.5.2012 31.5.2012
2 111 1.7.2012 20.7.2012
3 111 25.7.2012 31.7.2012
4 222 1.4.2012 15.4.2012
5 222 18.4.2012 30.4.2012
Examples
Filter: 1.5.2012. - 5.6.2012.
Result that I need to get is:
code 111 - partialy covered
code 222 - invoice missing
Filter: 1.5.2012. - 31.5.2012.
code 111 - fully covered
code 222 - invoice missing
Filter: 1.6.2012. - 30.6.2012.
code 111 - invoice missing
code 222 - invoice missing
After clarification in comment.
Your task as I understand it:
Check for all supplied individual date ranges (filter) whether they are are covered by the combined date ranges of sets of codes in your table (invoice).
It can be done with plain SQL, but it is not a trivial task. The steps could be:
Supply date ranges as filters.
Combine date ranges in invoice table per code.
Can result in one or more ranges per code.
Look for overlaps between filters and combined invoices
Classify: fully covered / partially covered.
Can result in one full coverage, one or two partial coverages or no coverage.
Reduce to maximum level of coverage.
Display one row for every combination of (filter, code) with the resulting coverage, in a sensible sort order
Ad hoc filter ranges
WITH filter(filter_id, startdate, enddate) AS (
VALUES
(1, '2012-05-01'::date, '2012-06-05'::date) -- list filters here.
,(2, '2012-05-01', '2012-05-31')
,(3, '2012-06-01', '2012-06-30')
)
SELECT * FROM filter;
Or put them in a (temporary) table and use the table instead.
Combine overlapping / adjacent date ranges per code
WITH a AS (
SELECT code, startdate, enddate
,max(enddate) OVER (PARTITION BY code ORDER BY startdate) AS max_end
-- Calculate the cumulative maximum end of the ranges sorted by start
FROM invoice
), b AS (
SELECT *
,CASE WHEN lag(max_end) OVER (PARTITION BY code
ORDER BY startdate) + 2 > startdate
-- Compare to the cumulative maximum end of the last row.
-- Only if there is a gap, start a new group. Therefore the + 2.
THEN 0 ELSE 1 END AS step
FROM a
), c AS (
SELECT code, startdate, enddate, max_end
,sum(step) OVER (PARTITION BY code ORDER BY startdate) AS grp
-- Members of the same date range end up in the same grp
-- If there is a gap, the grp number is incremented one step
FROM b
)
SELECT code, grp
,min(startdate) AS startdate
,max(enddate) AS enddate
FROM c
GROUP BY 1, 2
ORDER BY 1, 2
Alternative final SELECT (may be faster or not, you'll have to test):
SELECT DISTINCT code, grp
,first_value(startdate) OVER w AS startdate
,last_value(enddate) OVER w AS enddate
FROM c
WINDOW W AS (PARTITION BY code, grp ORDER BY startdate
RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
ORDER BY 1, 2;
Combine to one query
WITH
-- supply one or more filter values
filter(filter_id, startdate, enddate) AS (
VALUES
(1, '2012-05-01'::date, '2012-06-05'::date) -- cast values in first row
,(2, '2012-05-01', '2012-05-31')
,(3, '2012-06-01', '2012-06-30')
)
-- combine date ranges per code
,a AS (
SELECT code, startdate, enddate
,max(enddate) OVER (PARTITION BY code ORDER BY startdate) AS max_end
FROM invoice
), b AS (
SELECT *
,CASE WHEN (lag(max_end) OVER (PARTITION BY code ORDER BY startdate)
+ 2) > startdate THEN 0 ELSE 1 END AS step
FROM a
), c AS (
SELECT code, startdate, enddate, max_end
,sum(step) OVER (PARTITION BY code ORDER BY startdate) AS grp
FROM b
), i AS ( -- substitutes original invoice table
SELECT code, grp
,min(startdate) AS startdate
,max(enddate) AS enddate
FROM c
GROUP BY 1, 2
)
-- match filters
, x AS (
SELECT f.filter_id, i.code
,bool_or(f.startdate >= i.startdate
AND f.enddate <= i.enddate) AS full_cover
FROM filter f
JOIN i ON i.enddate >= f.startdate
AND i.startdate <= f.enddate -- only overlapping
GROUP BY 1,2
)
SELECT f.*, i.code
,CASE x.full_cover
WHEN TRUE THEN 'fully covered'
WHEN FALSE THEN 'partially covered'
ELSE 'invoice missing'
END AS covered
FROM (SELECT DISTINCT code FROM i) i
CROSS JOIN filter f -- all combinations of filter and code
LEFT JOIN x USING (filter_id, code) -- join in overlapping
ORDER BY filter_id, code;
Tested and works for me on PostgreSQL 9.1.