start date end date combine rows - sql

In Redshift, through SQL script want to consolidate monthly records as long as gap between the end date of first and the start date of the next record is 32 days or less (<=32) into single record with minimum startdate of continuous month as output startdate and maximum of end date of continuous month as output enddate.
The below input data refers to the table's data and also listed the expected output. The input data is listed ORDER BY ID,STARTDT,ENDDT in ASC.
For example, in below table, consider ID 100, the gab between the end of the first record and start of the next record <=32, however gap between the second record end date and third records start date falls more than 32 days, hence the first two records to be consolidate into one record i.e. (ID),MIN(STARTSDT),MAX(ENDDT) which corresponds to first record in the expected output. Similarly gab between 3 and 4 record in the input data falls within the 32 days and thus these 2 records to be consolidated into single records which corresponds to the second record in the expected output.
INPUT DATA:
ID STARTDT ENDDT
100 2000-01-01 2000-01-31
100 2000-02-01 2000-02-29
100 2000-05-01 2000-05-31
100 2000-06-01 2000-06-30
100 2000-09-01 2000-09-30
100 2000-10-01 2000-10-31
101 2012-06-01 2012-06-30
101 2012-07-01 2012-07-31
102 2000-01-01 2000-01-31
103 2013-03-01 2013-03-31
103 2013-05-01 2013-05-31
EXPECTED OUTPUT:
ID MIN_STARTDT MAX_END_DT
100 2000-01-01 2000-02-29
100 2000-05-01 2000-06-30
100 2000-09-01 2000-10-31
101 2012-06-01 2012-07-31
102 2000-01-01 2000-01-31
103 2013-03-01 2013-03-31
103 2013-05-01 2013-05-31

You can do this in steps:
Use a join to identify where two adjacent records should be combined.
Then do a cumulative sum to assign all such adjacent records a grouping identifier.
Aggregate.
It looks like:
select id, min(startdt), max(enddte)
from (select t.*,
count(case when tprev.id is null then 1 else 0 end) over
(partition by t.idid
order by t.startdt
rows between unbounded preceding and current row
) as grp
from t left join
t tprev
on t.id = tprev.id and
t.startdt = tprev.enddt + interval '1 day'
) t
group by id, grp;

The question is very similar to this one and my answer is also similar: Fetch rows based on condition
The gist of the idea is to use Window Functions to identify transitions between period (events which are less than 33 days apart), and then do some filtering to remove the rows within the period, and then Window Functions again.
Complete solution:
SELECT
id,
startdt AS period_start,
period_end
FROM (
SELECT
id,
startdt,
enddt,
lead(enddt, 1)
OVER (PARTITION BY id
ORDER BY enddt) AS period_end,
period_boundary
FROM (
SELECT
id,
startdt,
enddt,
CASE WHEN period_switch = 0 AND reverse_period_switch = 1
THEN 'start'
ELSE 'end' END AS period_boundary
FROM (
SELECT
id,
startdt,
enddt,
CASE WHEN datediff(days, enddt, lead(startdt, 1)
OVER (PARTITION BY id
ORDER BY enddt ASC)) > 32
THEN 1
ELSE 0 END AS period_switch,
CASE WHEN datediff(days, lead(enddt, 1)
OVER (PARTITION BY id
ORDER BY enddt DESC), startdt) > 32
THEN 1
ELSE 0 END AS reverse_period_switch
FROM date_test
)
AS sessioned
WHERE period_switch != 0 OR reverse_period_switch != 0
UNION
SELECT -- adding start rows without transition
id,
startdt,
enddt,
'start'
FROM (
SELECT
id,
startdt,
enddt,
row_number()
OVER (PARTITION BY id
ORDER BY enddt ASC) AS row_num
FROM date_test
) AS with_row_number
WHERE row_num = 1
UNION
SELECT -- adding end rows without transition
id,
startdt,
enddt,
'end'
FROM (
SELECT
id,
startdt,
enddt,
row_number()
OVER (PARTITION BY id
ORDER BY enddt desc) AS row_num
FROM date_test
) AS with_row_number
WHERE row_num = 1
) AS with_boundary -- data set containing start/end boundaries
) AS with_end -- data set where end date is propagated into the start row of the period
WHERE period_boundary = 'start'
ORDER BY id, startdt ASC;
Note that in your expected output, you had a row for 103 2013-05-01 2013-05-31, however its start date is 31 days apart from end date of the previous row, so this row should instead be merged with the previous row for id 103 according to your requirements.
So the output that I get looks like this:
id start end
100 2000-01-01 2000-02-29
100 2000-05-01 2000-06-30
100 2000-09-01 2000-10-31
101 2012-06-01 2012-07-31
102 2000-01-01 2000-01-31
103 2013-03-01 2013-05-31

Related

How to merge rows startdate enddate based on column values using Lag Lead or window functions?

I have a table with 4 columns: ID, STARTDATE, ENDDATE and BADGE. I want to merge rows based on ID and BADGE values but make sure that only consecutive rows will get merged.
For example, If input is:
Output will be:
I have tried lag lead, unbounded, bounded precedings but unable to achieve the output:
SELECT ID,
STARTDATE,
MAX(ENDDATE),
NAME
FROM (SELECT USERID,
IFF(LAG(NAME) over(Partition by USERID Order by STARTDATE) = NAME,
LAG(STARTDATE) over(Partition by USERID Order by STARTDATE),
STARTDATE) AS STARTDATE,
ENDDATE,
NAME
from myTable )
GROUP BY USERID,
STARTDATE,
NAME
We have to make sure that we merge only consective rows having same ID and Badge.
Help will be appreciated, Thanks.
You can split the problem into two steps:
creating the right partitions
aggregating on the partitions with direct aggregation functions (MIN and MAX)
You can approach the first step using a boolean field that is 1 when there's no consecutive date match (row1.ENDDATE = row2.STARTDATE + 1 day). This value will indicate when a new partition should be created. Hence if you compute a running sum, you should have your correctly numbered partitions.
WITH cte AS (
SELECT *,
IFF(LAG(ENDDATE) OVER(PARTITION BY ID, Badge ORDER BY STARTDATE) + INTERVAL 1 DAY = STARTDATE , 0, 1) AS boolval
FROM tab
)
SELECT *
SUM(COALESCE(boolval, 0)) OVER(ORDER BY ID DESC, STARTDATE) AS rn
FROM cte
Then the second step can be summarized in the direct aggregation of "STARTDATE" and "ENDDATE" using the MIN and MAX function respectively, grouping on your ranking value. For syntax correctness, you need to add "ID" and "Badge" too in the GROUP BY clause, even though their range of action is already captured by the computed ranking value.
WITH cte AS (
SELECT *,
IFF(LAG(ENDDATE) OVER(PARTITION BY ID, Badge ORDER BY STARTDATE) + INTERVAL 1 DAY = STARTDATE , 0, 1) AS boolval
FROM tab
), cte2 AS (
SELECT *,
SUM(COALESCE(boolval, 0)) OVER(ORDER BY ID DESC, STARTDATE) AS rn
FROM cte
)
SELECT ID,
MIN(STARTDATE) AS STARTDATE,
MAX(ENDDATE) AS ENDDATE,
Badge
FROM cte2
GROUP BY ID,
Badge,
rn
In Snowflake, such gaps and island problem can be solved using
function conditional_true_event
As below query -
First CTE, creates a column to indicate a change event (true or false) when a value changes for column badge.
Next CTE (cte_1) using this change event column with function conditional_true_event produces another column (increment if change is TRUE) to be used as grouping, in the final main query.
And, final query is just min, max group by.
with cte as (
select
m.*,
case when badge <> lag(badge) over (partition by id order by null)
then true
else false end flag
from merge_tab m
), cte_1 as (
select c.*,
conditional_true_event(flag) over (partition by id order by null) cn
from cte c
)
select id,min(startdate) ms, max(enddate) me, badge
from cte_1
group by id,badge,cn
order by id desc, ms asc, me asc, badge asc;
Final output -
ID
MS
ME
BADGE
51
1985-02-01
2019-04-28
1
51
2019-04-29
2020-08-16
2
51
2020-08-17
2021-04-03
3
51
2021-04-04
2021-04-05
1
51
2021-04-06
2022-08-20
2
51
2022-08-21
9999-12-31
3
10
2020-02-06
9999-12-31
3
With data -
select * from merge_tab;
ID
STARTDATE
ENDDATE
BADGE
51
1985-02-01
2019-04-28
1
51
2019-04-29
2019-04-28
2
51
2019-09-16
2019-11-16
2
51
2019-11-17
2020-08-16
2
51
2020-08-17
2021-04-03
3
51
2021-04-04
2021-04-05
1
51
2021-04-06
2022-05-05
2
51
2022-05-06
2022-08-20
2
51
2022-08-21
9999-12-31
3
10
2020-02-06
2019-04-28
3
10
2021-03-21
9999-12-31
3

How can I select records from the last value accumulated

I have the next data: TABLE_A
RegisteredDate
Quantity
2022-03-01 13:00
100
2022-03-01 13:10
20
2022-03-01 13:20
-80
2022-03-01 13:30
-40
2022-03-02 09:00
10
2022-03-02 22:00
-5
2022-03-03 02:00
-5
2022-03-03 03:00
25
2022-03-03 03:20
-10
If I add cumulative column
select RegisteredDate, Quantity
, sum(Quantity) over ( order by RegisteredDate ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as Summary
from TABLE_A
RegisteredDate
Quantity
Summary
2022-03-01 13:00
100
100
2022-03-01 13:10
20
120
2022-03-01 13:20
-80
40
2022-03-01 13:30
-40
0
2022-03-02 09:00
10
10
2022-03-02 22:00
-5
5
2022-03-03 02:00
-5
0
2022-03-03 03:00
25
25
2022-03-03 03:20
-10
15
Is there a way to get the following result with a query?
RegisteredDate
Quantity
Summary
2022-03-03 03:00
25
25
2022-03-03 03:20
-10
15
This result is the last records after the last zero.
EDIT:
Really for the solution to this problem I need the: 2022-03-03 03:00 is the first date of the last records after the last zero.
You can try to use SUM aggregate window function to calculation grp column which part represent to last value accumulated.
Query 1:
WITH cte AS
(
SELECT RegisteredDate,
Quantity,
sum(Quantity) over (order by RegisteredDate ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as Summary
FROM TABLE_A
), cte2 AS (
SELECT *,
SUM(CASE WHEN Summary = 0 THEN 1 ELSE 0 END) OVER(order by RegisteredDate desc) grp
FROM cte
)
SELECT RegisteredDate,
Quantity
FROM cte2
WHERE grp = 0
ORDER BY RegisteredDate
Results:
| RegisteredDate | Quantity |
|----------------------|----------|
| 2022-03-03T03:00:00Z | 25 |
| 2022-03-03T03:20:00Z | -10 |
Use a CTE that returns the summary column and NOT EXISTS to filter out the rows that you don't need:
WITH cte AS (SELECT *, SUM(Quantity) OVER (ORDER BY RegisteredDate) Summary FROM TABLE_A)
SELECT c1.*
FROM cte c1
WHERE NOT EXISTS (
SELECT 1
FROM cte c2 WHERE c2.RegisteredDate >= c1.RegisteredDate AND c2.Summary = 0
)
ORDER BY c1.RegisteredDate;
There is no need for ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW in the OVER clause of the window function, because this is the default behavior.
See the demo.
Try this:
with u as
(select RegisteredDate,
Quantity,
sum(Quantity) over (order by RegisteredDate ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as Summary
from TABLE_A)
select * from u
where RegisteredDate >= all(select RegisteredDate from u where Summary = 0)
and Summary <> 0;
Fiddle
Basically what you want is for RegisteredDate to be >= all RegisteredDatess where Summary = 0, and you want Summary <> 0.
When using window functions, it is necessary to take into account that RegisteredDate column is not unique in TABLE_A, so ordering only by RegisteredDate column is not enough to get a stable result on the same dataset.
With A As (
Select ROW_NUMBER() Over (Order by RegisteredDate, Quantity) As ID, RegisteredDate, Quantity
From TABLE_A),
B As (
Select A.*, SUM(Quantity) Over (Order by ID) As Summary
From A)
Select Top 1 *
From B
Where ID > (Select MAX(ID) From B Where Summary=0)
ID
RegisteredDate
Quantity
Summary
8
2022-03-03 03:00
25
25

How to create a start and end date with no gaps from one date column and to sum a value within the dates

I am new SQL coding using in SQL developer.
I have a table that has 4 columns: Patient ID (ptid), service date (dt), insurance payment amount (insr_amt), out of pocket payment amount (op_amt). (see table 1 below)
What I would like to do is (1) create two columns "start_dt" and "end_dt" using the "dt" column where if there are no gaps in the date by the patient ID then populate the start and end date with the first and last date by patient ID, however if there is a gap in service date within the patient ID then to create the separate start and end date rows per patient ID, along with (2) summing the two payment amounts by patient ID with in the one set of start and end date visits (see table 2 below).
What would be the way to run this using SQL code in SQL developer?
Thank you!
Table 1:
Ptid
dt
insr_amt
op_amt
A
1/1/2021
30
20
A
1/2/2021
30
10
A
1/3/2021
30
10
A
1/4/2021
30
30
B
1/6/2021
10
10
B
1/7/2021
20
10
C
2/1/2021
15
30
C
2/2/2021
15
30
C
2/6/2021
60
30
Table 2:
Ptid
start_dt
end_dt
total_insr_amt
total_op_amt
A
1/1/2021
1/4/2021
120
70
B
1/6/2021
1/7/2021
30
20
C
2/1/2021
2/2/2021
30
60
C
2/6/2021
2/6/2021
60
30
You didn't mention the specific database so this solution works in PostgreSQL. You can do:
select
ptid,
min(dt) as start_dt,
max(dt) as end_dt,
sum(insr_amt) as total_insr_amt,
sum(op_amt) as total_op_amt
from (
select *,
sum(inc) over(partition by ptid order by dt) as grp
from (
select *,
case when dt - interval '1 day' = lag(dt) over(partition by ptid order by dt)
then 0 else 1 end as inc
from t
) x
) y
group by ptid, grp
order by ptid, grp
Result:
ptid start_dt end_dt total_insr_amt total_op_amt
----- ---------- ---------- -------------- -----------
A 2021-01-01 2021-01-04 120 70
B 2021-01-06 2021-01-07 30 20
C 2021-02-01 2021-02-02 30 60
C 2021-02-06 2021-02-06 60 30
See running example at DB Fiddle 1.
EDIT for Oracle
As requested, the modified query that works in Oracle is:
select
ptid,
min(dt) as start_dt,
max(dt) as end_dt,
sum(insr_amt) as total_insr_amt,
sum(op_amt) as total_op_amt
from (
select x.*,
sum(inc) over(partition by ptid order by dt) as grp
from (
select t.*,
case when dt - 1 = lag(dt) over(partition by ptid order by dt)
then 0 else 1 end as inc
from t
) x
) y
group by ptid, grp
order by ptid, grp
See running example at db<>fiddle 2.

SQL Dates Selection

I Have a OPL_Dates Table with Start Date and End Dates as Below:
dbo.OPL_Dates
ID Start_date End_date
--------------------------------------
12345 1975-01-01 2001-12-31
12345 1989-01-01 2004-12-31
12345 2005-01-01 NULL
12345 2007-01-01 NULL
12377 2009-06-01 2009-12-31
12377 2013-02-07 NULL
12377 2010-01-01 2012-01-01
12489 2011-12-31 NULL
12489 2012-03-01 2012-04-01
The Output I am looking for is:
ID Start_date End_date
-------------------------------------
12345 1975-01-01 2004-12-31
12345 2005-01-01 NULL
12377 2009-06-01 2009-12-31
12377 2010-01-01 2012-01-01
12377 2013-02-07 NULL
12489 2011-12-31 NULL
Basically, I want to show the gap between the OPL periods(IF Any) else I need min of Start Date and Max of End Dates, for a particular ID.NULL means Open-Ended Date which can be converted to "9999-12-31".
The following pretty much does what you want:
with p as (
select v.*, sum(inc) over (partition by v.id order by v.dte) as running_inc
from t cross apply
(values (id, start_date, 1),
(id, coalesce(end_date, '2999-12-31'), -1)
) v(id, dte, inc)
)
select id, min(dte), max(dte)
from (select p.*, sum(case when running_inc = 0 then 1 else 0 end) over (partition by id order by dte desc) as grp
from p
) p
group by id, grp;
Note that it changes the "inifinite" end date from NULL to 2999-12-31. This is a convenience, because NULL orders first in SQL Server ascending sorts.
Here is a SQL Fiddle.
What is this doing? It is unpivoting the dates into a single column, with a 1/-1 flag (inc) indicating whether the record is a start or end. The running sum of this flag then indicates the groups that should be combined. When the running sum is 0, then a group has ended. To include the end date in the right group, a reverse running sum is needed -- but that's a detail.

Find From/To Dates across multiple rows - SQL Postgres

I want to be able to "book" within range of dates, but you can't book across gaps of days. So booking across multiple rates is fine as long as they are contiguous.
I am happy to change data structure/index, if there are better ways of storing start/end ranges.
So far I have a "rates" table which contains Start/End Periods of time with a daily rate.
e.g. Rates Table.
ID Price From To
1 75.00 2015-04-12 2016-04-15
2 100.00 2016-04-16 2016-04-17
3 50.00 2016-04-18 2016-04-30
For the above data I would want to return:
From To
2015-04-12 2016-4-30
For simplicity sake it is safe to assume that dates are safely consecutive. For contiguous dates To is always 1 day before from.
For the case there is only 1 row, I would want it to return the From/To of that single row.
Also to clarify if I had the following data:
ID Price From To
1 75.00 2015-04-12 2016-04-15
2 100.00 2016-04-17 2016-04-18
3 50.00 2016-04-19 2016-04-30
4 50.00 2016-05-01 2016-05-21
Meaning where there is a gap >= 1 day it would count as a separate range.
In which case I would expect the following:
From To
2015-04-12 2016-04-15
2015-04-17 2016-05-21
Edit 1
After playing around I have come up with the following SQL which seems to work. Although I'm not sure if there are better ways/issues with it?
WITH grouped_rates AS
(SELECT
from_date,
to_date,
SUM(grp_start) OVER (ORDER BY from_date, to_date) group
FROM (SELECT
gite_id,
from_date,
to_date,
CASE WHEN (from_date - INTERVAL '1 DAY') = lag(to_date)
OVER (ORDER BY from_date, to_date)
THEN 0
ELSE 1
END grp_start
FROM rates
GROUP BY from_date, to_date) AS start_groups)
SELECT
min(from_date) from_date,
max(to_date) to_date
FROM grouped_rates
GROUP BY grp;
This is identifying contiguous overlapping groups in the data. One approach is to find where each group begins and then do a cumulative sum. The following query adds a flag indicating if a row starts a group:
select r.*,
(case when not exists (select 1
from rates r2
where r2.from < r.from and r2.to >= r.to or
(r2.from = r.from and r2.id < r.id)
)
then 1 else 0 end) as StartFlag
from rate r;
The or in the correlation condition is to handle the situation where intervals that define a group overlap on the start date for the interval.
You can then do a cumulative sum on this flag and aggregate by that sum:
with r as (
select r.*,
(case when not exists (select 1
from rates r2
where (r2.from < r.from and r2.to >= r.to) or
(r2.from = r.from and r2.id < r.id)
)
then 1 else 0 end) as StartFlag
from rate r
)
select min(from), max(to)
from (select r.*,
sum(r.StartFlag) over (order by r.from) as grp
from r
) r
group by grp;
CREATE TABLE prices( id INTEGER NOT NULL PRIMARY KEY
, price MONEY
, date_from DATE NOT NULL
, date_upto DATE NOT NULL
);
-- some data (upper limit is EXCLUSIVE)
INSERT INTO prices(id, price, date_from, date_upto) VALUES
( 1, 75.00, '2015-04-12', '2016-04-16' )
,( 2, 100.00, '2016-04-17', '2016-04-19' )
,( 3, 50.00, '2016-04-19', '2016-05-01' )
,( 4, 50.00, '2016-05-01', '2016-05-22' )
;
-- SELECT * FROM prices;
-- Recursive query to "connect the dots"
WITH RECURSIVE rrr AS (
SELECT date_from, date_upto
, 1 AS nperiod
FROM prices p0
WHERE NOT EXISTS (SELECT * FROM prices nx WHERE nx.date_upto = p0.date_from) -- no preceding segment
UNION ALL
SELECT r.date_from, p1.date_upto
, 1+r.nperiod AS nperiod
FROM prices p1
JOIN rrr r ON p1.date_from = r.date_upto
)
SELECT * FROM rrr r
WHERE NOT EXISTS (SELECT * FROM prices nx WHERE nx.date_from = r.date_upto) -- no following segment
;
Result:
date_from | date_upto | nperiod
------------+------------+---------
2015-04-12 | 2016-04-16 | 1
2016-04-17 | 2016-05-22 | 3
(2 rows)