Flattening date intervals in SQL - sql

I have a database table where there are three columns that are essential to this question:
A group ID, that groups rows together
A start date
An end date
I want to make a view from this table so that overlapping date intervals, that have the same grouping ID, are flattened.
Date intervals that are not overlapping shall not be flattened.
Example:
Group ID Start End
1 2016-01-01 2017-12-31
1 2016-06-01 2020-01-01
1 2022-08-31 2030-12-31
2 2010-03-01 2017-01-01
2 2012-01-01 2013-12-31
3 2001-01-01 9999-13-31
...becomes...
Group ID Start End
1 2016-01-01 2020-01-01
1 2022-08-31 2030-12-31
2 2010-03-01 2017-01-01
3 2001-01-01 9999-12-31
Intervals that overlap may do so in any way, completely enclosed by other intervals, or they may be staggered, or they may even have the same start and/or end dates.
There are few similar ids. Commonly (> 95%) there is only one row with a particular group ID. There are about a thousand IDs that show up in two rows; a handful of IDs that exist in three rows; none that are in four rows or more.
But I need to be prepared that there may show up group IDs that exist in four or more rows.
How can I write an SQL statement that creates a view that shows the table flattened this way?
Do note that every row also has a unique ID. This does not need to be preserved in any way, but in case it helps when writing the SQL, I am letting you know.

First, find intervals that are not continuation of overlapping sequence:
select *
from dateclap d1
where not exists(
select *
from dateclap d2
where d2.group_id=d1.group_id and
d2.end_date >= d1.start_date and
(d2.start_date < d1.start_date or
(d1.start_date=d2.start_date and d2.r_id<d1.r_id)))
Last line distinguishes intervals starting at the same date/time, ordering them by unique record id (r_id).
Then for each such record we can get hierarchical selection of records with connect_by_root r_id distinguishing clamp groups. After that all we need is to get min/max for clamp group (connect_by_root r_id is id of parent record in group):
select group_id, min(start_date) as start_date, max(end_date) as end_date
from dateclap d1
start with not exists(
select *
from dateclap d2
where d2.group_id=d1.group_id and
d2.end_date >= d1.start_date and
(d2.start_date < d1.start_date or
(d1.start_date=d2.start_date and d2.r_id<d1.r_id)))
connect by nocycle
prior group_id=group_id and
start_date between prior start_date and prior end_date
group by group_id, connect_by_root r_id
Note nocycle here - it is a dirty trick to avoid exceptions because connection is weak and in fact tries to connect record to itself. You can refine condition after "connect by" similar to "exists" condition to avoid nocycle usage.
P.S. Table was created for tests like this:
CREATE TABLE "ANIKIN"."DATECLAP"
(
"R_ID" NUMBER,
"GROUP_ID" NUMBER,
"START_DATE" DATE,
"END_DATE" DATE
) PCTFREE 10 PCTUSED 40 INITRANS 1 MAXTRANS 255 NOCOMPRESS LOGGING
STORAGE(INITIAL 65536 NEXT 1048576 MINEXTENTS 1 MAXEXTENTS 2147483645
PCTINCREASE 0 FREELISTS 1 FREELIST GROUPS 1 BUFFER_POOL DEFAULT)
TABLESPACE "ANIKIN" ;
Unique key (or probably primary key) for r_id and corresponding seuqence/triggers are not something specific to tests, just populate r_id with unique values.

select t1.group_id, least(min(t1.start_date), min(t2.start_date)), greatest(max(t1.start_date), max(t2.end_date)) from test_interval t1, test_interval t2
where (t1.start_date, t1.end_date) overlaps (t2.start_date, t2.end_date)
and t1.rowid <> t2.rowid
and t1.group_id = t2.group_id group by t1.group_id;
Such query produces for me list of overlapping intervals. OVERLAPS is an undocumented operator. I only wonder if that won't return wrong result when we got two pair of intervals that are overlapping in pair but not each other.
Where I used rowid you can use your unique row identifier

Create 2 functions that return the flattened start- and end-date for a specific element:
CREATE OR REPLACE FUNCTION getMinStartDate
(
p_group_id IN NUMBER,
p_start IN DATE
)
RETURN DATE AS
v_result DATE;
BEGIN
SELECT MIN(start_date)
INTO v_result
FROM my_data
WHERE group_id = p_group_id
AND start_date <= p_start
AND end_date >= p_start;
RETURN v_result;
END getMinStartDate;
CREATE OR REPLACE FUNCTION getMaxEndDate
(
p_group_id IN NUMBER,
p_end IN DATE
)
RETURN DATE AS
v_result DATE;
BEGIN
SELECT MAX(end_date)
INTO v_result
FROM my_data
WHERE group_id = p_group_id
AND start_date <= p_end
AND end_date >= p_end;
RETURN v_result;
END getMaxEndDate;
Your view should then return, for each element, these flattened dates.
Of course, DISTINCT since various elements may result in the same dates:
SELECT DISTINCT
group_id,
getMinStartDate(group_id, start_date) AS start_date,
getMaxEndDate(group_id, end_date) AS end_date
FROM my_data;

The input data shows an end date of 9999-13-31 in the last row. That should be corrected.
With that said, it is best to choose a made-up end date that is not exactly 9999-12-31. In many problems one needs to add a day, or a couple of weeks, or whatever, to all the dates in a table; but if one tries to add to 9999-12-31, that will fail. I prefer 8999-12-31; one thousand years should be enough for most computations. {:-) In the test data I created for my query I used this convention. (The solution can be easily adapted for 9999-12-31 though.)
When working with datetime intervals, remember that a pure date means midnight at the beginning of a day. So the year 2016 has the "end date" 2017-01-01 (midnight at the beginning of the day) and the year 2017 has the "start date" 2017-01-01 also. So the table SHOULD have the same end-date and start-date for periods that immediately follow each other - and they should be fused together into a single interval. However, an interval ending on 2016-08-31 and one that begins on 2016-09-01 should NOT be fused together; they are separated by a full day (specifically the day of 2016-08-31 is NOT included in either interval).
The OP did not specify how the end-dates are meant to be interpreted here. I assume they are as described in the last paragraph; otherwise the solution can be easily adapted (but it will require adding 1 to end dates first, and then subtracting 1 at the end - this is exactly one of those cases when 9999-12-31 is not a good placeholder for "unknown".)
Solution:
with m as
(
select group_id, start_date,
max(end_date) over (partition by group_id order by start_date
rows between unbounded preceding and 1 preceding) as m_time
from inputs -- "inputs" is the name of the base table
union all
select group_id, NULL, max(end_date) from inputs group by group_id
),
n as
(
select group_id, start_date, m_time
from m
where start_date > m_time or start_date is null or m_time is null
),
f as
(
select group_id, start_date,
lead(m_time) over (partition by group_id order by start_date) as end_date
from n
)
select * from f where start_date is not null
;
Output (with the data provided):
GROUP_ID START_DATE END_DATE
---------- ---------- ----------
1 2016-01-01 2020-01-01
1 2022-08-31 2030-12-31
2 2010-03-01 2017-01-01
3 2001-01-01 8999-12-31

Related

Find Intersection Between Date Ranges In PostgreSQL

I have records with a two dates check_in and check_out, I want to know the ranges when more than one person was checked in at the same time.
So if I have the following checkin / checkouts:
Person A: 1PM - 6PM
Person B: 3PM - 10PM
Person C: 9PM - 11PM
I would want to get 3PM - 6PM (Overlap of person A and B) and 9PM - 10PM (overlap of person B and C).
I can write an algorithm to do this in linear time with code, is it possible to do this via a relational query in linear time with PostgreSQL as well?
It needs to have a minimal response, meaning no overlapping ranges. So if there were a result which gave the range 6PM - 9PM and 8PM - 10PM it would be incorrect. It should instead return 6PM - 10pm.
Assumptions
The solution heavily depends on the exact table definition including all constraints. For lack of information in the question I'll assume this table:
CREATE TABLE booking (
booking_id serial PRIMARY KEY
, check_in timestamptz NOT NULL
, check_out timestamptz NOT NULL
, CONSTRAINT valid_range CHECK (check_out > check_in)
);
So, no NULL values, only valid ranges with inclusive lower and exclusive upper bound, and we don't really care who checks in.
Also assuming a current version of Postgres, at least 9.2.
Query
One way to do it with only SQL using a UNION ALL and window functions:
SELECT ts AS check_id, next_ts As check_out
FROM (
SELECT *, lead(ts) OVER (ORDER BY ts) AS next_ts
FROM (
SELECT *, lag(people_ct, 1 , 0) OVER (ORDER BY ts) AS prev_ct
FROM (
SELECT ts, sum(sum(change)) OVER (ORDER BY ts)::int AS people_ct
FROM (
SELECT check_in AS ts, 1 AS change FROM booking
UNION ALL
SELECT check_out, -1 FROM booking
) sub1
GROUP BY 1
) sub2
) sub3
WHERE people_ct > 1 AND prev_ct < 2 OR -- start overlap
people_ct < 2 AND prev_ct > 1 -- end overlap
) sub4
WHERE people_ct > 1 AND prev_ct < 2;
SQL Fiddle.
Explanation
In subquery sub1 derive a table of check_in and check_out in one column. check_in adds one to the crowd, check_out subtracts one.
In sub2 sum all events for the same point in time and compute a running count with a window function: that's the window function sum() over an aggregate sum() - and cast to integer or we get numeric from this:
sum(sum(change)) OVER (ORDER BY ts)::int
In sub3 look at the count of the previous row
In sub4 only keep rows where overlapping time ranges start and end, and pull the end of the time range into the same row with lead().
Finally, only keep rows, where time ranges start.
To optimize performance I would walk through the table once in a plpgsql function like demonstrated in this related answer on dba.SE:
Calculate Difference in Overlapping Time in PostgreSQL / SSRS
Idea is to divide time in periods and save them as bit values with specified granularity.
0 - nobody is checked in one grain
1 - somebody is checked in one grain
Let's assume that granularity is 1 hour and period is 1 day.
000000000000000000000000 means nobody is checked in that day
000000000000000000000110 means somebody is checked between 21 and 23
000000000000011111000000 means somebody is checked between 13 and 18
000000000000000111111100 means somebody is checked between 15 and 22
After that we do binary OR on the each value in the range and we have our answer.
000000000000011111111110
It can be done in linear time. Here is an example from Oracle but it can be transformed to PostgreSQL easily.
with rec (checkin, checkout)
as ( select 13, 18 from dual
union all
select 15, 22 from dual
union all
select 21, 23 from dual )
,spanempty ( empt)
as ( select '000000000000000000000000' from dual) ,
spanfull( full)
as ( select '111111111111111111111111' from dual)
, bookingbin( binbook) as ( select substr(empt, 1, checkin) ||
substr(full, checkin, checkout-checkin) ||
substr(empt, checkout, 24-checkout)
from rec
cross join spanempty
cross join spanfull ),
bookingInt (rn, intbook) as
( select rownum, bin2dec(binbook) from bookingbin),
bitAndSum (bitAndSumm) as (
select sum(bitand(b1.intbook, b2.intbook)) from bookingInt b1
join bookingInt b2
on b1.rn = b2.rn -1 ) ,
SumAll (sumall) as (
select sum(bin2dec(binbook)) from bookingBin )
select lpad(dec2bin(sumall - bitAndSumm), 24, '0')
from SumAll, bitAndSum
Result:
000000000000011111111110

Aggregate continuous ranges of dates

Let's say you have the following PostgreSQL sparse table listing reservation dates:
CREATE TABLE reserved_dates (
reserved_days_id SERIAL NOT NULL,
reserved_date DATE NOT NULL
);
INSERT INTO reserved_dates (reserved_date) VALUES
('2014-10-11'),
('2014-10-12'),
('2014-10-13'),
-- gap
('2014-10-15'),
('2014-10-16'),
-- gap
('2014-10-18'),
-- gap
('2014-10-20'),
('2014-10-21');
How do you aggregate those dates into continuous date ranges (ranges without gaps)? Such as:
start_date | end_date
------------+------------
2014-10-11 | 2014-10-13
2014-10-15 | 2014-10-16
2014-10-18 | 2014-10-18
2014-10-20 | 2014-10-21
This is what I came up with so far, but I can only get start_date this way:
WITH reserved_date_ranges AS (
SELECT reserved_date,
reserved_date
- LAG(reserved_date) OVER (ORDER BY reserved_date) AS difference
FROM reserved_dates
)
SELECT *
FROM reserved_date_ranges
WHERE difference > 1 OR difference IS NULL;
SELECT min(reserved_date) AS start_date
, max(reserved_date) AS end_date
FROM (
SELECT reserved_date
, reserved_date - row_number() OVER (ORDER BY reserved_date)::int AS grp
FROM reserved_dates
) sub
GROUP BY grp
ORDER BY grp;
Compute gap-less serial numbers in chronological order with the window function row_number(). Duplicate dates are not allowed. (I added a UNIQUE constraint in the fiddle.)
If your reserved_days_id happens to be gap-less and in chronological order, you can use that directly instead. But that's typically not the case.
Subtract that from reserved_date in each row (after converting to integer). Consecutive days end up with the same date value grp - which has no other purpose or meaning than to form groups.
Aggregate in the outer query. Voilá.
db<>fiddle here
Old sqlfiddle
Similar cases:
Rank based on sequence of dates
Group by repeating attribute

PL/SQL Finding Difference Between Start and End Dates in Different Rows

I am trying to find the difference between start and end dates in different rows of a result set, using PL/SQL. Here is an example:
ID TERM START_DATE END_DATE
423 201420 26-AUG-13 13-DEC-13
423 201430 21-JAN-14 09-MAY-14
423 201440 16-JUN-14 07-AUG-14
For any specific ID, I need to get the difference between the end date in the first record and the start date of the second record. Similarly, I need to get the difference between the end date in the second record and the start date of the third record, and so forth.
Eventually I will need to perform the same operation on a variety of IDs. I am assuming I have to use a cursor and loop.
I would appreciate any help or suggestions on accomplishing this. Thanks in advance.
The "lead" analytic function in Oracle can grab a value from the succeeding row as a value in the current row.
Given a series of rows returned from a query and a position of the cursor, LEAD provides access to a row at a given physical offset beyond that position.
Here, this SQL grabs start_date from the next row and subtracts end_date from the current row.
select id, term, start_date, end_date,
lead(start_date) over (partition by id order by term) - end_date diff_in_days
from your_table;
Sample output:
ID TERM START_DATE END_DATE DIFF_IN_DAYS
---------- ---------- -------------------- -------------------- ------------
423 201420 26-AUG-2013 00:00:00 13-DEC-2013 00:00:00 39
423 201430 21-JAN-2014 00:00:00 09-MAY-2014 00:00:00 36
423 201440 14-JUN-2014 00:00:00 07-AUG-2014 00:00:00
I would suggest looking at using the LEAD and LAG analytic functions from Oracle. By the sounds of it they should suit your needs.
See the docs here: http://docs.oracle.com/cd/B19306_01/server.102/b14200/functions074.htm
Code:
SELECT [ID], [TERM], [START_DATE], [END_DATE],
CASE WHEN MIN([END_DATE]) OVER(PARTITION BY [ID] ORDER BY [TERM] ROWS BETWEEN 1 PRECEDING AND CURRENT ROW)=[END_DATE] THEN NULL ELSE
MIN([END_DATE]) OVER(PARTITION BY [ID] ORDER BY [TERM] ROWS BETWEEN 1 PRECEDING AND CURRENT ROW)-[START_DATE] END AS [DAYS_BETWEEN]
FROM [TABLE]
This seemed to work:
SELECT DISTINCT
ID,
TERM_CODE,
TERM_START_DATE,
TERM_END_DATE,
( ( LEAD ( TERM_START_DATE, 1 ) OVER ( PARTITION BY ID ORDER BY TERM_CODE ) ) -TERM_END_DATE AS DIFF DAYS
FROM TABLE

SQL -- computing end dates from a given start date with arbitrary breaks

I have a table of 'semesters' of variable lengths with variable breaks in between them with a constraint such that a 'start_date' is always greater than the previous 'end_date':
id start_date end_date
-----------------------------
1 2012-10-01 2012-12-20
2 2013-01-05 2013-03-28
3 2013-04-05 2013-06-29
4 2013-07-10 2013-09-20
And a table of students as follows, where a start date may occur at any time within a given semester:
id start_date n_weeks
-------------------------
1 2012-11-15 25
2 2013-02-12 8
3 2013-03-02 12
I am attempting to compute an 'end_date' by joining the 'students' on 'semesters' which takes into account the variable-length breaks in-between semesters.
I can draw in the previous semester's end date (ie from the previous row's end_date) and by subtraction find the number of days in-between semesters using the following:
SELECT start_date
, end_date
, lag(end_date) OVER () AS prev_end_date
, start_date - lag(end_date) OVER () AS days_break
FROM terms
ORDER BY start_date;
Clearly, if there were to be only two terms, it would simply be a matter of adding the 'break' in days (perhaps, cast to 'weeks') -- and thereby extend the 'end_date' by that same period of time.
But should 'n_weeks' for a given student span more than one term, how could such a query be structured ?
Been banging my head against a wall for the last couple of days and I'd be immensely grateful for any help anyone would be able to offer....
Many thanks.
Rather than just looking at the lengths of semesters or the gaps between them, you could generate a list of all the dates that are within a semester using generate_series(), like this:
SELECT
row_number() OVER () as day_number,
day
FROM
(
SELECT
generate_series(start_date, end_date, '1 day') as day
FROM
semesters
) as day_series
ORDER BY
day
(SQLFiddle demo)
This assigns each day that is during a semester an arbitrary but sequential "day number", skipping out all the gaps between semesters.
You can then use this as a sub-query/CTE JOINed to your table of students: first find the "day number" of their start date, then add 7 * n_weeks to find the "day number" of their end date, and finally join back to find the actual date for that "day number".
This assumes that there is no special handling needed for partial weeks - i.e. if n_weeks is 4, the student must be enrolled for 28 days which are within the duration of a semeseter. The approach could be adapted to measure weeks (pass 1 week as the last argument to generate_series()), with the additional step of finding which week the student's start_date falls into.
Here's a complete query (SQLFiddle demo here):
WITH semester_days AS
(
SELECT
semester_id,
row_number() OVER () as day_number,
day_date::date
FROM
(
SELECT
id as semester_id,
generate_series(start_date, end_date, '1 day') as day_date
FROM
semesters
) as day_series
ORDER BY
day_date
)
SELECT
S.id as student_id,
S.start_date,
SD_start.semester_id as start_semester_id,
S.n_weeks,
SD_end.day_date as end_date,
SD_end.semester_id as end_semester_id
FROM
students as S
JOIN
semester_days as SD_start
On SD_start.day_date = S.start_date
JOIN
semester_days as SD_end
On SD_end.day_number = SD_start.day_number + (7 * S.n_weeks)
ORDER BY
S.start_date

Oracle SQL - Putting together potentially contradictory or overlapping date ranges

I have a table like this:
Id Begin_Date End_date
1 01-JAN-12 05-JAN-12
1 01-FEB-12 01-MAR-12
1 15-FEB-12 05-MAR-12
For a given Id, it gives a set of date ranges. Let's say that if a date is between the begin and end date for that Id, then that Id is "on". Otherwise, "off"
The problem here is these last two rows -- the date ranges overlap and contradict each other. The second row claims that the 1 was "on" between 01-FEB-12 and 01-MAR-123, but the third row claims that 1 was off before before 14-FEB-12. Similarly, the second row claims that 1 was off on 02-MAR-12, but row 3 claims it was on.
The reconciliation logic I'd like to apply is that, in cases of contradictions, pick the earliest possible begin date and the earliest possible end date after it. The result would therefore be:
Id Begin_Date End_date
1 01-JAN-12 05-JAN-12
1 01-FEB-12 01-MAR-12
I was able to pull this off with the lag analytical function, but I ran into difficulty with other use cases. Take this input data set.
Id Begin_Date End_date
1 01-JAN-12 10-JAN-12
1 5-JAN-12 8-JAN-12
1 12-JAN-12 15-JAN-12
1 1-JAN-12 14-JAN-12
What I expect here as output is:
Id Begin_Date End_date
1 01-JAN-12 8-JAN-12
1 01-JAN-12 14-JAN-12
...because the first row is the earliest begin date, and its end date is the earliest end date after that. The next row is the earliest begin date after the previous end date, and the end date of that row is the earliest end date after that. There are no begin dates after 14-JAN-12, so I'm done.
I'm having very little luck solving this problem. One approach I tried was getting the rank partitioned by id and compare it to the max rank. I then used the lag function to compare to previous ranks. However, this strategy totally fails for use cases above.
Any suggestions?
Well, the critical requirement rests on this:
The reconciliation logic I'd like to apply is that, in cases of
contradictions, pick the earliest possible begin date and the earliest
possible end date after it.
sqlfiddle here
CREATE TABLE table1
(
id INT,
DateStart DATE,
DateEnd DATE
);
INSERT INTO table1
VALUES
(1, TO_DATE('20110101','YYYYMMdd'), TO_DATE('20110110','YYYYMMdd'));
INSERT INTO table1
VALUES
(2, TO_DATE('20110105','YYYYMMdd'), TO_DATE('20110108','YYYYMMdd'));
INSERT INTO table1
VALUES
(3, TO_DATE('20110112','YYYYMMdd'), TO_DATE('20110115','YYYYMMdd'));
INSERT INTO table1
VALUES
(4, TO_DATE('20110101','YYYYMMdd'), TO_DATE('20110114','YYYYMMdd'));
INSERT INTO table1
VALUES
(5, TO_DATE('20110206','YYYYMMdd'), TO_DATE('20110208','YYYYMMdd'));
INSERT INTO table1
VALUES
(6, TO_DATE('20110201','YYYYMMdd'), TO_DATE('20110207','YYYYMMdd'));
The select statement:
SELECT ID, DATESTART, DATEEND
FROM
(
SELECT ID, TYPE, DATES AS DATESTART,
LEAD(DATES) OVER (ORDER BY DATES) AS DATEEND
FROM
(
SELECT ID, TYPE,DATES,
LAG(ID) OVER (ORDER BY DATES) AS LASTID,
LAG(TYPE) OVER (ORDER BY DATES) AS LASTTYPE,
LAG(DATES) OVER (ORDER BY DATES) AS LASTDATES
FROM
(
SELECT ID,'START' AS TYPE,DATESTART AS DATES
FROM table1
UNION ALL
SELECT ID,'END',DATEEND
FROM table1
)
) H
WHERE TYPE != LASTTYPE OR LASTTYPE IS NULL
)
WHERE TYPE = 'START'
ORDER BY DATESTART
Here's a step by step for each subquery:
explode each row's date start and date end into one column
copy the last row using LAG and put it in current row
filter out the rows which is are in the middle (e.g. 1,2,3,4 remove 2,3)
get the end date in the next row because these are either first or last rows
extract only useful rows, those rows which has TYPE = START
For the second data set:
Id Begin_Date End_date
1 01-JAN-12 10-JAN-12
1 5-JAN-12 8-JAN-12
1 12-JAN-12 15-JAN-12
1 1-JAN-12 14-JAN-12
After your reconciliation logic, the result would be:
Id Begin_Date End_date
1 01-JAN-12 8-JAN-12 (includes the rows 1,2 and 4 -> minimum begin_date is 1-JAN, minimum end_date is 8-JAN)
1 12-JAN-12 15-JAN-12 (includes row 3)