Group Records based on predefined date range in SQL (Oracle) - sql

Is it possible to group records based on a predefined date range differences (e.g. 30 days) based on the start_date of a row and the end_date of the previous row for non-consecutive dates? I want to take the min(start_date) and max(end_date) of each group. I tried the lead and lag function with partition by in Oracle but couldn't come up with a proper solution. A related but unanswered post related to my question can be found here.
E.g.
ROW_NUM PROJECT_ID START_DATE END_DATE
1 1 2016-01-14 2016-08-15
2 1 2016-08-16 2016-09-10 --- Date diff Row 1&2 = 1 Day
3 1 2016-11-15 2017-01-10 --- Date diff Row 2&3 = 66 Days
4 1 2016-01-17 2017-04-10 --- Date diff Row 3&4 = 7 Days
5 2 2018-04-28 2018-06-01 --- Other Project
6 2 2019-02-01 2019-04-05 --- Diff > 30 Days
7 2 2019-04-08 2019-07-28 --- Diff 3 Days
Expected Result:
ROW_NUM PROJECT_ID START_DATE END_DATE
1 1 2016-01-14 2016-09-10
3 1 2016-11-15 2017-04-10
5 2 2018-04-28 2018-06-01
6 2 2019-02-01 2019-07-28

Use lag() and a cumulative sum to define where the groups begin. Then aggregate:
select project_id, min(start_date), max(end_date)
from (select t.*,
sum(case when prev_end_date > start_date - interval '30' day then 0 else 1 end) over
(partition by project_id order by start_date) as grp
from (select t.*,
lag(end_date) over (partition by project_id order by start_date) as prev_end_date
from t
) t
) t
group by project_id, grp;

Related

merge consecutive dates ranges using SQL

i want to find the end date of all the consecutive date ranges. Some of the dates are not consecutive, in this case it will return the end of the single range.
Table Name: Sospensioni
ClientId. Status. StartDate EndDate
1 1 01/01/2022 02/01/2022
1 1 03/01/2022 04/01/2022
1 1 12/01/2022 15/01/2022
2 1 03/01/2022 03/01/2022
2 1 05/01/2022 06/01/2022
i want a sql statement to merge consecutive ranges for each client (example of result)
ClientId. Status. StartDate EndDate
1 1 01/01/2022 04/01/2022
1 1 12/01/2022 15/01/2022
2 1 03/01/2022 03/01/2022
2 1 05/01/2022 06/01/2022
I want to solve the problem using SQL only.
thanks
This is a Gaps & Islands problem. You can use the typical solution using LAG(). For example:
select
max(client_id) as client_id,
max(status) as status,
min(start_date) as start_date,
max(end_date) as end_date
from (
select *, sum(i) over(partition by client_id order by start_date) as g
from (
select *,
case when dateadd(day, -1, start_date) <>
lag(end_date) over(partition by client_id order by start_date)
then 1 else 0 end as i
from t
) x
) y
group by client_id, g
order by client_id, g
Result:
client_id status start_date end_date
---------- ------- ----------- ----------
1 1 2022-01-01 2022-01-04
1 1 2022-01-12 2022-01-15
2 1 2022-01-03 2022-01-03
2 1 2022-01-05 2022-01-06
See running example at db<>fiddle.

How to create a start and end date with no gaps from one date column and to sum a value within the dates

I am new SQL coding using in SQL developer.
I have a table that has 4 columns: Patient ID (ptid), service date (dt), insurance payment amount (insr_amt), out of pocket payment amount (op_amt). (see table 1 below)
What I would like to do is (1) create two columns "start_dt" and "end_dt" using the "dt" column where if there are no gaps in the date by the patient ID then populate the start and end date with the first and last date by patient ID, however if there is a gap in service date within the patient ID then to create the separate start and end date rows per patient ID, along with (2) summing the two payment amounts by patient ID with in the one set of start and end date visits (see table 2 below).
What would be the way to run this using SQL code in SQL developer?
Thank you!
Table 1:
Ptid
dt
insr_amt
op_amt
A
1/1/2021
30
20
A
1/2/2021
30
10
A
1/3/2021
30
10
A
1/4/2021
30
30
B
1/6/2021
10
10
B
1/7/2021
20
10
C
2/1/2021
15
30
C
2/2/2021
15
30
C
2/6/2021
60
30
Table 2:
Ptid
start_dt
end_dt
total_insr_amt
total_op_amt
A
1/1/2021
1/4/2021
120
70
B
1/6/2021
1/7/2021
30
20
C
2/1/2021
2/2/2021
30
60
C
2/6/2021
2/6/2021
60
30
You didn't mention the specific database so this solution works in PostgreSQL. You can do:
select
ptid,
min(dt) as start_dt,
max(dt) as end_dt,
sum(insr_amt) as total_insr_amt,
sum(op_amt) as total_op_amt
from (
select *,
sum(inc) over(partition by ptid order by dt) as grp
from (
select *,
case when dt - interval '1 day' = lag(dt) over(partition by ptid order by dt)
then 0 else 1 end as inc
from t
) x
) y
group by ptid, grp
order by ptid, grp
Result:
ptid start_dt end_dt total_insr_amt total_op_amt
----- ---------- ---------- -------------- -----------
A 2021-01-01 2021-01-04 120 70
B 2021-01-06 2021-01-07 30 20
C 2021-02-01 2021-02-02 30 60
C 2021-02-06 2021-02-06 60 30
See running example at DB Fiddle 1.
EDIT for Oracle
As requested, the modified query that works in Oracle is:
select
ptid,
min(dt) as start_dt,
max(dt) as end_dt,
sum(insr_amt) as total_insr_amt,
sum(op_amt) as total_op_amt
from (
select x.*,
sum(inc) over(partition by ptid order by dt) as grp
from (
select t.*,
case when dt - 1 = lag(dt) over(partition by ptid order by dt)
then 0 else 1 end as inc
from t
) x
) y
group by ptid, grp
order by ptid, grp
See running example at db<>fiddle 2.

Calculating moving average over irregular data

I am trying to calculate a moving average of several fields in a SQL Server database that involved irregularly-spaced values over time. I realized that for regularly-spaced data I can use an SELECT grp, AVG(count) FROM t ... OVER (PARTITION BY grp ... ROWS 7 PRECEDING) to create a moving average of the prior week's data. However, I have data organized as follows:
DATE GRP COUNT
2018-07-05 1 10
2018-07-08 1 4
2018-07-11 1 6
2018-07-12 1 6
2018-07-11 2 5
2018-07-15 2 10
2018-07-17 2 8
2018-07-20 2 10
...
Where for most groups there are no observations for some dates. The output I'm looking for is:
DATE GRP MOVING_AVG
2018-07-05 1 10
2018-07-08 1 7
2018-07-11 1 6.67
2018-07-13 1 5.33
2018-07-11 2 5
2018-07-15 2 7.5
2018-07-16 2 7.67
2018-07-20 2 9.33
Is there a way of specifying dates instead of rows in the PRECEDING clause, or do I have to create some sort of mask to average over?
EDITED FOR CLARIFICATION BASED ON COMMENTS
In SQL Server, I think this might be simpler achieved with a lateral join:
select
date,
grp,
(
select avg(count)
from mytable t1
where
t1.grp = t.grp
and t1.date >= dateadd(year, -1, t.date)
and t1.date <= t.date
) as cnt
from mytable
If i'm not misunderstanding. You want 7 or whatever days but rows before a date.
DATE GRP COUNT
2018-07-11 2 5
2018-07-15 2 10
2018-07-17 2 8
2018-07-20 2 10 <--- the AVG of this row must include 7 days before,so 2018-07-11 not include
In that case :
select
date,
grp,
(
select avg(count)
from t t1
where
t1.grp = t.grp
and DATEDIFF(day, t1.date, t.date) <= 7 /*7 or whatever day you want*/
and t1.date <= t.date
) as MOVING_AVG
from t

Using the earliest date of a partition to determine what other dates belong to that partition

Assume this is my table:
ID DATE
--------------
1 2018-11-12
2 2018-11-13
3 2018-11-14
4 2018-11-15
5 2018-11-16
6 2019-03-05
7 2019-05-07
8 2019-05-08
9 2019-05-08
I need to have partitions be determined by the first date in the partition. Where, any date that is within 2 days of the first date, belongs in the same partition.
The table would end up looking like this if each partition was ranked
PARTITION ID DATE
------------------------
1 1 2018-11-12
1 2 2018-11-13
1 3 2018-11-14
2 4 2018-11-15
2 5 2018-11-16
3 6 2019-03-05
4 7 2019-05-07
4 8 2019-05-08
4 9 2019-05-08
I've tried using datediff with lag to compare to the previous date but that would allow a partition to be inappropriately sized based on spacing, for example all of these dates would be included in the same partition:
ID DATE
--------------
1 2018-11-12
2 2018-11-14
3 2018-11-16
4 2018-11-18
3 2018-11-20
4 2018-11-22
Previous flawed attempt:
Mark when a date is more than 2 days past the previous date:
(case when datediff(day, lag(event_time, 1) over (partition by user_id, stage order by event_time), event_time) > 2 then 1 else 0 end)
You need to use a recursive CTE for this, so the operation is expensive.
with t as (
-- add an incrementing column with no gaps
select t.*, row_number() over (order by date) as seqnum
from t
),
cte as (
select id, date, date as mindate, seqnum
from t
where seqnum = 1
union all
select t.id, t.date,
(case when t.date <= dateadd(day, 2, cte.mindate)
then cte.mindate else t.date
end) as mindate,
t.seqnum
from cte join
t
on t.seqnum = cte.seqnum + 1
)
select cte.*, dense_rank() over (partition by mindate) as partition_num
from cte;

Oracle SQL - Select users between two date by month

I am learning SQL and I was wondering how to select active users by month, depending on their starting and ending date (both timestamp(6)). My table looks like this:
Cust_Num | Start_Date | End_Date
1 | 2018-01-01 | 2019-01-01
2 | 2018-01-01 | NULL
3 | 2019-01-01 | 2019-06-01
4 | 2017-01-01 | 2019-03-01
So, counting the active users by month, I should have an output like:
As of. | Count
2018-06-01 | 3
...
2019-02-01 | 3
2019-07-01 | 1
So far, I do a manual operation by entering each month:
Select
201906,
count(distinct a.cust_num)
From
active_users a
Where
to_date(‘20190630’,’yyyymmdd) between a.start_date and nvl (a.end_date, ‘31-dec-9999)
union all
Select
201905,
count(distinct a.cust_num)
From
active_users a
Where
to_date(‘20190531’,’yyyymmdd) between a.start_date and nvl (a.end_date, ‘31-dec-9999)
union all
...
Not very optimized and sustainable if I want to enter 10 years ao 120 months lol.
Any help is welcome. Thanks a lot!
This query shows the active-user-count effective as-of the end of the month.
How it works:
Convert each input row (with StartDate and EndDate value) into two rows that represent a point-in-time when the active-user-count incremented (on StartDate) and decremented (on EndDate). We need to convert NULL to a far-off date value because NULL values are sorted before instead of after non-NULL values:
This makes your data look like this:
OnThisDate Change
2018-01-01 1
2019-01-01 -1
2018-01-01 1
9999-12-31 -1
2019-01-01 1
2019-06-01 -1
2017-01-01 1
2019-03-01 -1
Then we simply SUM OVER the Change values (after sorting) to get the active-user-count as of that specific date:
So first, sort by OnThisDate:
OnThisDate Change
2017-01-01 1
2018-01-01 1
2018-01-01 1
2019-01-01 1
2019-01-01 -1
2019-03-01 -1
2019-06-01 -1
9999-12-31 -1
Then SUM OVER:
OnThisDate ActiveCount
2017-01-01 1
2018-01-01 2
2018-01-01 3
2019-01-01 4
2019-01-01 3
2019-03-01 2
2019-06-01 1
9999-12-31 0
Then we PARTITION (not group!) the rows by month and sort them by their date so we can identify the last ActiveCount row for that month (this actually happens in the WHERE of the outermost query, using ROW_NUMBER() and COUNT() for each month PARTITION):
OnThisDate ActiveCount IsLastInMonth
2017-01-01 1 1
2018-01-01 2 0
2018-01-01 3 1
2019-01-01 4 0
2019-01-01 3 1
2019-03-01 2 1
2019-06-01 1 1
9999-12-31 0 1
Then filter on that where IsLastInMonth = 1 (actually, where ROW_COUNT() = COUNT(*) inside each PARTITION) to give us the final output data:
At-end-of-month Active-count
2017-01 1
2018-01 3
2019-01 3
2019-03 2
2019-06 1
9999-12 0
This does result in "gaps" in the result-set because the At-end-of-month column only shows rows where the Active-count value actually changed rather than including all possible calendar months - but that's ideal (as far as I'm concerned) because it excludes redundant data. Filling in the gaps can be done inside your application code by simply repeating output rows for each additional month until it reaches the next At-end-of-month value.
Here's the query using T-SQL on SQL Server (I don't have access to Oracle right now). And here's the SQLFiddle I used to come to a solution: http://sqlfiddle.com/#!18/ad68b7/24
SELECT
OtdYear,
OtdMonth,
ActiveCount
FROM
(
-- This query adds columns to indicate which row is the last-row-in-month ( where RowInMonth == RowsInMonth )
SELECT
OnThisDate,
OtdYear,
OtdMonth,
ROW_NUMBER() OVER ( PARTITION BY OtdYear, OtdMonth ORDER BY OnThisDate ) AS RowInMonth,
COUNT(*) OVER ( PARTITION BY OtdYear, OtdMonth ) AS RowsInMonth,
ActiveCount
FROM
(
SELECT
OnThisDate,
YEAR( OnThisDate ) AS OtdYear,
MONTH( OnThisDate ) AS OtdMonth,
SUM( [Change] ) OVER ( ORDER BY OnThisDate ASC ) AS ActiveCount
FROM
(
SELECT
StartDate AS [OnThisDate],
1 AS [Change]
FROM
tbl
UNION ALL
SELECT
ISNULL( EndDate, DATEFROMPARTS( 9999, 12, 31 ) ) AS [OnThisDate],
-1 AS [Change]
FROM
tbl
) AS sq1
) AS sq2
) AS sq3
WHERE
RowInMonth = RowsInMonth
ORDER BY
OtdYear,
OtdMonth
This query can be flattened into fewer nested queries by using aggregate and window functions directly instead of using aliases (like OtdYear, ActiveCount, etc) but that would make the query much harder to understand.
I have created the query which will give the result of all the months starting from the minimum start date in the table till maximum end date.
You can change it using adding one condition in WHERE clause.
-- table creation
CREATE TABLE ACTIVE_USERS (CUST_NUM NUMBER, START_DATE DATE, END_DATE DATE)
-- data creation
INSERT INTO ACTIVE_USERS
SELECT * FROM
(
SELECT 1, DATE '2018-01-01', DATE '2019-01-01' FROM DUAL UNION ALL
SELECT 2, DATE '2018-01-01', NULL FROM DUAL UNION ALL
SELECT 3, DATE '2019-01-01', DATE '2019-06-01' FROM DUAL UNION ALL
SELECT 4, DATE '2017-01-01', DATE '2019-03-01' FROM DUAL
)
-- data in the actual table
SELECT * FROM ACTIVE_USERS ORDER BY CUST_NUM;
CUST_NUM START_DATE END_DATE
---------- ---------- ----------
1 2018-01-01 2019-01-01
2 2018-01-01
3 2019-01-01 2019-06-01
4 2017-01-01 2019-03-01
Query to fetch desired result
WITH CTE ( START_DATE, END_DATE ) AS
(
SELECT
ADD_MONTHS( START_DATE, LEVEL - 1 ),
ADD_MONTHS( START_DATE, LEVEL ) - 1
FROM
(
SELECT
MIN( START_DATE ) AS START_DATE,
MAX( END_DATE ) AS END_DATE
FROM
ACTIVE_USERS
)
CONNECT BY LEVEL <= CEIL( MONTHS_BETWEEN( END_DATE, START_DATE ) ) + 1
)
--
--
SELECT
C.START_DATE,
COUNT(1) AS CNT
FROM
CTE C
JOIN ACTIVE_USERS D ON
(
C.END_DATE BETWEEN
D.START_DATE
AND
CASE
WHEN D.END_DATE IS NOT NULL THEN D.END_DATE
ELSE C.END_DATE
END
)
GROUP BY
C.START_DATE
ORDER BY
C.START_DATE;
-- output --
START_DATE CNT
---------- ----------
2017-01-01 1
2017-02-01 1
2017-03-01 1
2017-04-01 1
2017-05-01 1
2017-06-01 1
2017-07-01 1
2017-08-01 1
2017-09-01 1
2017-10-01 1
2017-11-01 1
START_DATE CNT
---------- ----------
2017-12-01 1
2018-01-01 3
2018-02-01 3
2018-03-01 3
2018-04-01 3
2018-05-01 3
2018-06-01 3
2018-07-01 3
2018-08-01 3
2018-09-01 3
2018-10-01 3
START_DATE CNT
---------- ----------
2018-11-01 3
2018-12-01 3
2019-01-01 3
2019-02-01 3
2019-03-01 2
2019-04-01 2
2019-05-01 2
2019-06-01 1
30 rows selected.
Cheers!!