Aggregate continuous ranges of dates - sql

Let's say you have the following PostgreSQL sparse table listing reservation dates:
CREATE TABLE reserved_dates (
reserved_days_id SERIAL NOT NULL,
reserved_date DATE NOT NULL
);
INSERT INTO reserved_dates (reserved_date) VALUES
('2014-10-11'),
('2014-10-12'),
('2014-10-13'),
-- gap
('2014-10-15'),
('2014-10-16'),
-- gap
('2014-10-18'),
-- gap
('2014-10-20'),
('2014-10-21');
How do you aggregate those dates into continuous date ranges (ranges without gaps)? Such as:
start_date | end_date
------------+------------
2014-10-11 | 2014-10-13
2014-10-15 | 2014-10-16
2014-10-18 | 2014-10-18
2014-10-20 | 2014-10-21
This is what I came up with so far, but I can only get start_date this way:
WITH reserved_date_ranges AS (
SELECT reserved_date,
reserved_date
- LAG(reserved_date) OVER (ORDER BY reserved_date) AS difference
FROM reserved_dates
)
SELECT *
FROM reserved_date_ranges
WHERE difference > 1 OR difference IS NULL;

SELECT min(reserved_date) AS start_date
, max(reserved_date) AS end_date
FROM (
SELECT reserved_date
, reserved_date - row_number() OVER (ORDER BY reserved_date)::int AS grp
FROM reserved_dates
) sub
GROUP BY grp
ORDER BY grp;
Compute gap-less serial numbers in chronological order with the window function row_number(). Duplicate dates are not allowed. (I added a UNIQUE constraint in the fiddle.)
If your reserved_days_id happens to be gap-less and in chronological order, you can use that directly instead. But that's typically not the case.
Subtract that from reserved_date in each row (after converting to integer). Consecutive days end up with the same date value grp - which has no other purpose or meaning than to form groups.
Aggregate in the outer query. Voilá.
db<>fiddle here
Old sqlfiddle
Similar cases:
Rank based on sequence of dates
Group by repeating attribute

Related

SQL to find sum of total days in a window for a series of changes

Following is the table:
start_date
recorded_date
id
2021-11-10
2021-11-01
1a
2021-11-08
2021-11-02
1a
2021-11-11
2021-11-03
1a
2021-11-10
2021-11-04
1a
2021-11-10
2021-11-05
1a
I need a query to find the total day changes in aggregate for a given id. In this case, it changed from 10th Nov to 8th Nov so 2 days, then again from 8th to 11th Nov so 3 days and again from 11th to 10th for a day, and finally from 10th to 10th, that is 0 days.
In total there is a change of 2+3+1+0 = 6 days for the id - '1a'.
Basically for each change there is a recorded_date, so we arrange that in ascending order and then calculate the aggregate change of days grouped by id. The final result should be like:
id
Agg_Change
1a
6
Is there a way to do this using SQL. I am using vertica database.
Thanks.
you can use window function lead to get the difference between rows and then group by id
select id, sum(daydiff) Agg_Change
from (
select id, abs(datediff(day, start_Date, lead(start_date,1,start_date) over (partition by id order by recorded_date))) as daydiff
from tablename
) t group by id
It's indeed the use of LAG() to get the previous date in an OLAP query, and an outer query getting the absolute date difference, and the sum of it, grouping by id:
WITH
-- your input - don't use in real query ...
indata(start_date,recorded_date,id) AS (
SELECT DATE '2021-11-10',DATE '2021-11-01','1a'
UNION ALL SELECT DATE '2021-11-08',DATE '2021-11-02','1a'
UNION ALL SELECT DATE '2021-11-11',DATE '2021-11-03','1a'
UNION ALL SELECT DATE '2021-11-10',DATE '2021-11-04','1a'
UNION ALL SELECT DATE '2021-11-10',DATE '2021-11-05','1a'
)
-- real query starts here, replace following comma with "WITH" ...
,
w_lag AS (
SELECT
id
, start_date
, LAG(start_date) OVER w AS prevdt
FROM indata
WINDOW w AS (PARTITION BY id ORDER BY recorded_date)
)
SELECT
id
, SUM(ABS(DATEDIFF(DAY,start_date,prevdt))) AS dtdiff
FROM w_lag
GROUP BY id
-- out id | dtdiff
-- out ----+--------
-- out 1a | 6
I was thinking lag function will provide me the answer, but it kept giving me wrong answer because I had the wrong logic in one place. I have the answer I need:
with cte as(
select id, start_date, recorded_date,
row_number() over(partition by id order by recorded_date asc) as idrank,
lag(start_date,1) over(partition by id order by recorded_date asc) as prev
from table_temp
)
select id, sum(abs(date(start_date) - date(prev))) as Agg_Change
from cte
group by 1
If someone has a better solution please let me know.

Flattening date intervals in SQL

I have a database table where there are three columns that are essential to this question:
A group ID, that groups rows together
A start date
An end date
I want to make a view from this table so that overlapping date intervals, that have the same grouping ID, are flattened.
Date intervals that are not overlapping shall not be flattened.
Example:
Group ID Start End
1 2016-01-01 2017-12-31
1 2016-06-01 2020-01-01
1 2022-08-31 2030-12-31
2 2010-03-01 2017-01-01
2 2012-01-01 2013-12-31
3 2001-01-01 9999-13-31
...becomes...
Group ID Start End
1 2016-01-01 2020-01-01
1 2022-08-31 2030-12-31
2 2010-03-01 2017-01-01
3 2001-01-01 9999-12-31
Intervals that overlap may do so in any way, completely enclosed by other intervals, or they may be staggered, or they may even have the same start and/or end dates.
There are few similar ids. Commonly (> 95%) there is only one row with a particular group ID. There are about a thousand IDs that show up in two rows; a handful of IDs that exist in three rows; none that are in four rows or more.
But I need to be prepared that there may show up group IDs that exist in four or more rows.
How can I write an SQL statement that creates a view that shows the table flattened this way?
Do note that every row also has a unique ID. This does not need to be preserved in any way, but in case it helps when writing the SQL, I am letting you know.
First, find intervals that are not continuation of overlapping sequence:
select *
from dateclap d1
where not exists(
select *
from dateclap d2
where d2.group_id=d1.group_id and
d2.end_date >= d1.start_date and
(d2.start_date < d1.start_date or
(d1.start_date=d2.start_date and d2.r_id<d1.r_id)))
Last line distinguishes intervals starting at the same date/time, ordering them by unique record id (r_id).
Then for each such record we can get hierarchical selection of records with connect_by_root r_id distinguishing clamp groups. After that all we need is to get min/max for clamp group (connect_by_root r_id is id of parent record in group):
select group_id, min(start_date) as start_date, max(end_date) as end_date
from dateclap d1
start with not exists(
select *
from dateclap d2
where d2.group_id=d1.group_id and
d2.end_date >= d1.start_date and
(d2.start_date < d1.start_date or
(d1.start_date=d2.start_date and d2.r_id<d1.r_id)))
connect by nocycle
prior group_id=group_id and
start_date between prior start_date and prior end_date
group by group_id, connect_by_root r_id
Note nocycle here - it is a dirty trick to avoid exceptions because connection is weak and in fact tries to connect record to itself. You can refine condition after "connect by" similar to "exists" condition to avoid nocycle usage.
P.S. Table was created for tests like this:
CREATE TABLE "ANIKIN"."DATECLAP"
(
"R_ID" NUMBER,
"GROUP_ID" NUMBER,
"START_DATE" DATE,
"END_DATE" DATE
) PCTFREE 10 PCTUSED 40 INITRANS 1 MAXTRANS 255 NOCOMPRESS LOGGING
STORAGE(INITIAL 65536 NEXT 1048576 MINEXTENTS 1 MAXEXTENTS 2147483645
PCTINCREASE 0 FREELISTS 1 FREELIST GROUPS 1 BUFFER_POOL DEFAULT)
TABLESPACE "ANIKIN" ;
Unique key (or probably primary key) for r_id and corresponding seuqence/triggers are not something specific to tests, just populate r_id with unique values.
select t1.group_id, least(min(t1.start_date), min(t2.start_date)), greatest(max(t1.start_date), max(t2.end_date)) from test_interval t1, test_interval t2
where (t1.start_date, t1.end_date) overlaps (t2.start_date, t2.end_date)
and t1.rowid <> t2.rowid
and t1.group_id = t2.group_id group by t1.group_id;
Such query produces for me list of overlapping intervals. OVERLAPS is an undocumented operator. I only wonder if that won't return wrong result when we got two pair of intervals that are overlapping in pair but not each other.
Where I used rowid you can use your unique row identifier
Create 2 functions that return the flattened start- and end-date for a specific element:
CREATE OR REPLACE FUNCTION getMinStartDate
(
p_group_id IN NUMBER,
p_start IN DATE
)
RETURN DATE AS
v_result DATE;
BEGIN
SELECT MIN(start_date)
INTO v_result
FROM my_data
WHERE group_id = p_group_id
AND start_date <= p_start
AND end_date >= p_start;
RETURN v_result;
END getMinStartDate;
CREATE OR REPLACE FUNCTION getMaxEndDate
(
p_group_id IN NUMBER,
p_end IN DATE
)
RETURN DATE AS
v_result DATE;
BEGIN
SELECT MAX(end_date)
INTO v_result
FROM my_data
WHERE group_id = p_group_id
AND start_date <= p_end
AND end_date >= p_end;
RETURN v_result;
END getMaxEndDate;
Your view should then return, for each element, these flattened dates.
Of course, DISTINCT since various elements may result in the same dates:
SELECT DISTINCT
group_id,
getMinStartDate(group_id, start_date) AS start_date,
getMaxEndDate(group_id, end_date) AS end_date
FROM my_data;
The input data shows an end date of 9999-13-31 in the last row. That should be corrected.
With that said, it is best to choose a made-up end date that is not exactly 9999-12-31. In many problems one needs to add a day, or a couple of weeks, or whatever, to all the dates in a table; but if one tries to add to 9999-12-31, that will fail. I prefer 8999-12-31; one thousand years should be enough for most computations. {:-) In the test data I created for my query I used this convention. (The solution can be easily adapted for 9999-12-31 though.)
When working with datetime intervals, remember that a pure date means midnight at the beginning of a day. So the year 2016 has the "end date" 2017-01-01 (midnight at the beginning of the day) and the year 2017 has the "start date" 2017-01-01 also. So the table SHOULD have the same end-date and start-date for periods that immediately follow each other - and they should be fused together into a single interval. However, an interval ending on 2016-08-31 and one that begins on 2016-09-01 should NOT be fused together; they are separated by a full day (specifically the day of 2016-08-31 is NOT included in either interval).
The OP did not specify how the end-dates are meant to be interpreted here. I assume they are as described in the last paragraph; otherwise the solution can be easily adapted (but it will require adding 1 to end dates first, and then subtracting 1 at the end - this is exactly one of those cases when 9999-12-31 is not a good placeholder for "unknown".)
Solution:
with m as
(
select group_id, start_date,
max(end_date) over (partition by group_id order by start_date
rows between unbounded preceding and 1 preceding) as m_time
from inputs -- "inputs" is the name of the base table
union all
select group_id, NULL, max(end_date) from inputs group by group_id
),
n as
(
select group_id, start_date, m_time
from m
where start_date > m_time or start_date is null or m_time is null
),
f as
(
select group_id, start_date,
lead(m_time) over (partition by group_id order by start_date) as end_date
from n
)
select * from f where start_date is not null
;
Output (with the data provided):
GROUP_ID START_DATE END_DATE
---------- ---------- ----------
1 2016-01-01 2020-01-01
1 2022-08-31 2030-12-31
2 2010-03-01 2017-01-01
3 2001-01-01 8999-12-31

Smoothing out a result set by date

Using SQL I need to return a smooth set of results (i.e. one per day) from a dataset that contains 0-N records per day.
The result per day should be the most recent previous value even if that is not from the same day. For example:
Starting data:
Date: Time: Value
19/3/2014 10:01 5
19/3/2014 11:08 3
19/3/2014 17:19 6
20/3/2014 09:11 4
22/3/2014 14:01 5
Required output:
Date: Value
19/3/2014 6
20/3/2014 4
21/3/2014 4
22/3/2014 5
First you need to complete the date range and fill in the missing dates (21/3/2014 in you example). This can be done by either joining a calendar table if you have one, or by using a recursive common table expression to generate the complete sequence on the fly.
When you have the complete sequence of dates finding the max value for the date, or from the latest previous non-null row becomes easy. In this query I use a correlated subquery to do it.
with cte as (
select min(date) date, max(date) max_date from your_table
union all
select dateadd(day, 1, date) date, max_date
from cte
where date < max_date
)
select
c.date,
(
select top 1 max(value) from your_table
where date <= c.date group by date order by date desc
) value
from cte c
order by c.date;
May be this works but try and let me know
select date, value from test where (time,date) in (select max(time),date from test group by date);

PL/SQL Finding Difference Between Start and End Dates in Different Rows

I am trying to find the difference between start and end dates in different rows of a result set, using PL/SQL. Here is an example:
ID TERM START_DATE END_DATE
423 201420 26-AUG-13 13-DEC-13
423 201430 21-JAN-14 09-MAY-14
423 201440 16-JUN-14 07-AUG-14
For any specific ID, I need to get the difference between the end date in the first record and the start date of the second record. Similarly, I need to get the difference between the end date in the second record and the start date of the third record, and so forth.
Eventually I will need to perform the same operation on a variety of IDs. I am assuming I have to use a cursor and loop.
I would appreciate any help or suggestions on accomplishing this. Thanks in advance.
The "lead" analytic function in Oracle can grab a value from the succeeding row as a value in the current row.
Given a series of rows returned from a query and a position of the cursor, LEAD provides access to a row at a given physical offset beyond that position.
Here, this SQL grabs start_date from the next row and subtracts end_date from the current row.
select id, term, start_date, end_date,
lead(start_date) over (partition by id order by term) - end_date diff_in_days
from your_table;
Sample output:
ID TERM START_DATE END_DATE DIFF_IN_DAYS
---------- ---------- -------------------- -------------------- ------------
423 201420 26-AUG-2013 00:00:00 13-DEC-2013 00:00:00 39
423 201430 21-JAN-2014 00:00:00 09-MAY-2014 00:00:00 36
423 201440 14-JUN-2014 00:00:00 07-AUG-2014 00:00:00
I would suggest looking at using the LEAD and LAG analytic functions from Oracle. By the sounds of it they should suit your needs.
See the docs here: http://docs.oracle.com/cd/B19306_01/server.102/b14200/functions074.htm
Code:
SELECT [ID], [TERM], [START_DATE], [END_DATE],
CASE WHEN MIN([END_DATE]) OVER(PARTITION BY [ID] ORDER BY [TERM] ROWS BETWEEN 1 PRECEDING AND CURRENT ROW)=[END_DATE] THEN NULL ELSE
MIN([END_DATE]) OVER(PARTITION BY [ID] ORDER BY [TERM] ROWS BETWEEN 1 PRECEDING AND CURRENT ROW)-[START_DATE] END AS [DAYS_BETWEEN]
FROM [TABLE]
This seemed to work:
SELECT DISTINCT
ID,
TERM_CODE,
TERM_START_DATE,
TERM_END_DATE,
( ( LEAD ( TERM_START_DATE, 1 ) OVER ( PARTITION BY ID ORDER BY TERM_CODE ) ) -TERM_END_DATE AS DIFF DAYS
FROM TABLE

Oracle SQL - Putting together potentially contradictory or overlapping date ranges

I have a table like this:
Id Begin_Date End_date
1 01-JAN-12 05-JAN-12
1 01-FEB-12 01-MAR-12
1 15-FEB-12 05-MAR-12
For a given Id, it gives a set of date ranges. Let's say that if a date is between the begin and end date for that Id, then that Id is "on". Otherwise, "off"
The problem here is these last two rows -- the date ranges overlap and contradict each other. The second row claims that the 1 was "on" between 01-FEB-12 and 01-MAR-123, but the third row claims that 1 was off before before 14-FEB-12. Similarly, the second row claims that 1 was off on 02-MAR-12, but row 3 claims it was on.
The reconciliation logic I'd like to apply is that, in cases of contradictions, pick the earliest possible begin date and the earliest possible end date after it. The result would therefore be:
Id Begin_Date End_date
1 01-JAN-12 05-JAN-12
1 01-FEB-12 01-MAR-12
I was able to pull this off with the lag analytical function, but I ran into difficulty with other use cases. Take this input data set.
Id Begin_Date End_date
1 01-JAN-12 10-JAN-12
1 5-JAN-12 8-JAN-12
1 12-JAN-12 15-JAN-12
1 1-JAN-12 14-JAN-12
What I expect here as output is:
Id Begin_Date End_date
1 01-JAN-12 8-JAN-12
1 01-JAN-12 14-JAN-12
...because the first row is the earliest begin date, and its end date is the earliest end date after that. The next row is the earliest begin date after the previous end date, and the end date of that row is the earliest end date after that. There are no begin dates after 14-JAN-12, so I'm done.
I'm having very little luck solving this problem. One approach I tried was getting the rank partitioned by id and compare it to the max rank. I then used the lag function to compare to previous ranks. However, this strategy totally fails for use cases above.
Any suggestions?
Well, the critical requirement rests on this:
The reconciliation logic I'd like to apply is that, in cases of
contradictions, pick the earliest possible begin date and the earliest
possible end date after it.
sqlfiddle here
CREATE TABLE table1
(
id INT,
DateStart DATE,
DateEnd DATE
);
INSERT INTO table1
VALUES
(1, TO_DATE('20110101','YYYYMMdd'), TO_DATE('20110110','YYYYMMdd'));
INSERT INTO table1
VALUES
(2, TO_DATE('20110105','YYYYMMdd'), TO_DATE('20110108','YYYYMMdd'));
INSERT INTO table1
VALUES
(3, TO_DATE('20110112','YYYYMMdd'), TO_DATE('20110115','YYYYMMdd'));
INSERT INTO table1
VALUES
(4, TO_DATE('20110101','YYYYMMdd'), TO_DATE('20110114','YYYYMMdd'));
INSERT INTO table1
VALUES
(5, TO_DATE('20110206','YYYYMMdd'), TO_DATE('20110208','YYYYMMdd'));
INSERT INTO table1
VALUES
(6, TO_DATE('20110201','YYYYMMdd'), TO_DATE('20110207','YYYYMMdd'));
The select statement:
SELECT ID, DATESTART, DATEEND
FROM
(
SELECT ID, TYPE, DATES AS DATESTART,
LEAD(DATES) OVER (ORDER BY DATES) AS DATEEND
FROM
(
SELECT ID, TYPE,DATES,
LAG(ID) OVER (ORDER BY DATES) AS LASTID,
LAG(TYPE) OVER (ORDER BY DATES) AS LASTTYPE,
LAG(DATES) OVER (ORDER BY DATES) AS LASTDATES
FROM
(
SELECT ID,'START' AS TYPE,DATESTART AS DATES
FROM table1
UNION ALL
SELECT ID,'END',DATEEND
FROM table1
)
) H
WHERE TYPE != LASTTYPE OR LASTTYPE IS NULL
)
WHERE TYPE = 'START'
ORDER BY DATESTART
Here's a step by step for each subquery:
explode each row's date start and date end into one column
copy the last row using LAG and put it in current row
filter out the rows which is are in the middle (e.g. 1,2,3,4 remove 2,3)
get the end date in the next row because these are either first or last rows
extract only useful rows, those rows which has TYPE = START
For the second data set:
Id Begin_Date End_date
1 01-JAN-12 10-JAN-12
1 5-JAN-12 8-JAN-12
1 12-JAN-12 15-JAN-12
1 1-JAN-12 14-JAN-12
After your reconciliation logic, the result would be:
Id Begin_Date End_date
1 01-JAN-12 8-JAN-12 (includes the rows 1,2 and 4 -> minimum begin_date is 1-JAN, minimum end_date is 8-JAN)
1 12-JAN-12 15-JAN-12 (includes row 3)