Merging date intervals in SQL Server - sql

I have the following data:
StartDate | EndDate
-------------------------
1982.03.02 | 1982.09.30
1982.10.01 | 1985.01.17
1985.06.26 | 1985.07.26
1985.07.30 | 1991.12.31
1992.01.01 | 1995.12.31
1996.01.01 | 2004.05.31
2004.06.05 | 2006.01.31
2006.02.01 | 2011.05.20
I need to merge any intervals that are adjacent (both start and the end date are included in the intervals, so an interval ending on 2003.05.06 is adjacent with an interval starting on 2003.05.07), so in this case, the resulting set should be:
StartDate | EndDate
-------------------------
1982.03.02 | 1985.01.17
1985.06.26 | 1985.07.26
1985.07.30 | 2004.05.31
2004.06.05 | 2011.05.20
For me, the obvious way to do this is to iterate the set with a cursor, and construct a result set row-by-row. However, this functionality will be within code that could potentially be called thousands of times in a day, on a server under heavy load, so I'd prefer not having any performance issues. Any data set is small (20 rows tops), and the data range is large, so any solution that generates all the dates in a range is unfeasible.
Is there a better way I'm not seeing?
Initialization code (from Damien's answer):
CREATE TABLE Periods (
StartDate datetime NOT NULL CONSTRAINT PK_Periods PRIMARY KEY CLUSTERED,
EndDate datetime NOT NULL
)
INSERT INTO Periods(StartDate,EndDate)
SELECT '19820302', '19820930'
UNION ALL SELECT '19821001', '19850117'
UNION ALL SELECT '19850626', '19850726'
UNION ALL SELECT '19850730', '19911231'
UNION ALL SELECT '19920101', '19951231'
UNION ALL SELECT '19960101', '20040531'
UNION ALL SELECT '20040605', '20060131'
UNION ALL SELECT '20060201', '20110520'

It takes longer for me to set up the sample data than to write the query - it would be better if you posted questions that include CREATE TABLE and INSERT/SELECT statements. I don't know what your table is called, I've called mine Periods:
create table Periods (
StartDate date not null,
EndDate date not null
)
go
insert into Periods(StartDate,EndDate)
select '19820302','19820930' union all
select '19821001','19850117' union all
select '19850626','19850726' union all
select '19850730','19911231' union all
select '19920101','19951231' union all
select '19960101','20040531' union all
select '20040605','20060131' union all
select '20060201','20110520'
go
; with MergedPeriods as (
Select p1.StartDate, p1.EndDate
from
Periods p1
left join
Periods p2
on
p1.StartDate = DATEADD(day,1,p2.EndDate)
where
p2.StartDate is null
union all
select p1.StartDate,p2.EndDate
from
MergedPeriods p1
inner join
Periods p2
on
p1.EndDate = DATEADD(day,-1,p2.StartDate)
)
select StartDate,MAX(EndDate) as EndDate
from MergedPeriods group by StartDate
Result:
StartDate EndDate
1982-03-02 1985-01-17
1985-06-26 1985-07-26
1985-07-30 2004-05-31
2004-06-05 2011-05-20

Here's a query that performs best of all submissions so far, with only two table accesses in the execution plan (instead of three or more). All queries are of course helped by indexes. Please note that the execution plan rates this query as more expensive, but the actual Reads & CPU are significantly better. Estimated costs in execution plans are not the same as actual performance.
WITH Grps AS (
SELECT
(Row_Number() OVER (ORDER BY P1.StartDate) - 1) / 2 Grp,
P1.StartDate,
P1.EndDate
FROM
Periods P1
CROSS JOIN (SELECT -1 UNION ALL SELECT 1) D (Dir)
LEFT JOIN Periods P2 ON
DateAdd(Day, D.Dir, P1.StartDate) = P2.EndDate
OR DateAdd(Day, D.Dir, P1.EndDate) = P2.StartDate
WHERE
(Dir = -1 AND P2.EndDate IS NULL)
OR (Dir = 1 AND P2.StartDate IS NULL)
)
SELECT
Min(StartDate) StartDate,
Max(EndDate) EndDate
FROM Grps
GROUP BY Grp;
One more thing I think worth mentioning is that querying your date period table would all around in most cases be simpler and better performing if you used exclusive end dates (aka "open" end dates) instead of closed ones:
StartDate | EndDate | EndDate
(Inclusive) | (Inclusive) | (Exclusive)
---------------------------------------
1982.03.02 | 1982.09.30 | 1982.10.01
1982.10.01 | 1985.01.17 | 1985.01.18
Using exclusive end dates is (in my opinion) best practice most of the time because it allows you to change the data type of the date column or to change the resolution of the date, without affecting any queries, code, or other logic. For example, if your dates needed to be to the nearest 12 hours instead of 24 hours, you'd have major work to get that accomplished, whereas if you used exclusive end dates not a single thing would have to change!
If you were using exclusive end dates, my query would look like this:
WITH Grps AS (
SELECT
(Row_Number() OVER (ORDER BY P1.StartDate) - 1) / 2 Grp,
P1.StartDate,
P1.EndDate
FROM
Periods P1
CROSS JOIN (SELECT 1 UNION ALL SELECT 2) X (Which)
LEFT JOIN Periods P2 ON
(X.Which = 1 AND P1.StartDate = P2.EndDate)
OR (X.Which = 2 AND P1.EndDate = P2.StartDate)
WHERE
P2.EndDate IS NULL
OR P2.StartDate IS NULL
)
SELECT
Min(StartDate) StartDate,
Max(EndDate) EndDate
FROM Grps
GROUP BY Grp;
Notice there's no DateAdd or DateDiff now, with hardcoded values of "1 Day" that would have to change if you for example switched to 12-hour periods.
Update
Here's an updated query that incorporates things I've learned in the last almost 5 years. This query now has no joins at all, and though it does have 3 sort operations in it which could be performance problems, I think this query will compete reasonably well, and in the absence of indexes will probably beat all others hands down.
WITH Groups AS (
SELECT Grp = Row_Number() OVER (ORDER BY StartDate) / 2, *
FROM
#Periods
(VALUES (0), (0)) X (Dup)
), Ranges AS (
SELECT StartDate = Max(StartDate), EndDate = Min(EndDate)
FROM Groups
GROUP BY Grp
HAVING Max(StartDate) <> DateAdd(day, 1, Min(EndDate))
), ReGroups AS (
SELECT
Grp = Row_Number() OVER (ORDER BY StartDate) / 2,
StartDate,
EndDate
FROM
Ranges
CROSS JOIN (VALUES (0), (0)) X (Dup)
)
SELECT
StartDate = Min(StartDate),
EndDate = Max(EndDate)
FROM ReGroups
GROUP BY Grp
HAVING Count(*) = 2
;
And here's yet another version using windowing functions (kind of what the previous query is simulating):
WITH LeadLag AS (
SELECT
PrevEndDate = Coalesce(Lag(EndDate) OVER (ORDER BY StartDate), '00010101'),
NextStartDate = Coalesce(Lead(StartDate) OVER (ORDER BY StartDate), '99991231'),
*
FROM #Periods
), Dates AS (
SELECT
X.*
FROM
LeadLag
CROSS APPLY (
SELECT
StartDate = CASE WHEN DateAdd(day, 1, PrevEndDate) <> StartDate THEN StartDate ELSE NULL END,
EndDate = CASE WHEN DateAdd(day, 1, EndDate) <> NextStartDate THEN EndDate ELSE NULL END
) X
WHERE
X.StartDate IS NOT NULL
OR X.EndDate IS NOT NULL
), Final AS (
SELECT
StartDate,
EndDate = Min(EndDate) OVER (ORDER BY EndDate ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING)
FROM Dates
)
SELECT *
FROM Final
WHERE StartDate IS NOT NULL
;

You could look up the heads: rows that start a period. Then search for the last end date before the next head in a subquery:
; with heads as
(
select StartDate
, EndDate
, row_number() over (order by StartDate) as rn
from #YourTable h
where not exists
(
select *
from #YourTable next
where next.EndDate = dateadd(day, -1, h.StartDate)
)
)
select heads.StartDate
, (
select top 1 EndDate
from #YourTable
where EndDate < COALESCE(
(
select StartDate
from heads h2
where heads.rn + 1 = h2.rn
), '9999-01-01')
order by
EndDate desc
) as EndDate
from heads
Example at ODATA.

Hmmm... I know you said
any solution that generates all the dates in a range is unfeasible.
But for some reason I just wanted to show how that would be done. I don't mean to waste your time.
First, create a numbers table if you don't already have one.
CREATE TABLE Numbers (
Num int NOT NULL CONSTRAINT PK_Numbers PRIMARY KEY CLUSTERED
)
INSERT Numbers VALUES (0)
WHILE ##RowCount < 65536
INSERT Numbers SELECT Num FROM Numbers + (SELECT Max(Num) FROM Numbers) + 1
Then group some islands!
WITH Dts AS (
SELECT
DateAdd(Day, Num, StartDate) Dt,
DateAdd(
Day,
-DENSE_RANK() OVER (ORDER BY StartDate, Num),
DateAdd(Day, Num, StartDate)
) Grp
FROM
Periods P
INNER JOIN Numbers N ON DateDiff(Day, P.StartDate, P.EndDate) >= N.Num
)
SELECT Min(Dt) StartDate, Max(Dt) EndDate
FROM Dts
GROUP BY Grp
ORDER BY StartDate
If you are using SQL 2000 this won't work, so please let me know and I'll come up with another solution for you.

Here's a very similar thread for PostgreSQL:
PostgreSQL matching interval between start and end time against timestamp
I'm only mildly familiar with T-SQL, so I'm not entirely sure the takeaway applies to you, but the general idea is to additionally store an indexable geometry type with a GIST (or R-tree) index, and to query against it. This will make the queries very fast.
(example segment code below is from peufeu's reply, and applies to date ranges too):
CREATE TABLE segments( start INTEGER NOT NULL, stop INTEGER NOT NULL, range_box BOX NOT NULL );
INSERT INTO segments SELECT n,n+1,BOX(POINT(n,-1),POINT(n+1,1)) FROM generate_series( 1, 1000000 ) n;
CREATE INDEX segments_box ON segments USING gist( range_box );
CREATE INDEX segments_start ON segments(start);
CREATE INDEX segments_stop ON segments(stop);
EXPLAIN ANALYZE SELECT * FROM segments WHERE 300000 BETWEEN start AND stop;
Index Scan using segments_start on segments (cost=0.00..12959.24 rows=209597 width=72) (actual time=91.990..91.990 rows=2 loops=1)
Index Cond: (300000 >= start)
Filter: (300000 <= stop)
Total runtime: 92.023 ms
EXPLAIN ANALYZE SELECT * FROM segments WHERE range_box && '(300000,0,300000,0)'::BOX;
Bitmap Heap Scan on segments (cost=283.49..9740.27 rows=5000 width=72) (actual time=0.036..0.037 rows=2 loops=1)
Recheck Cond: (range_box && '(300000,0),(300000,0)'::box)
-> Bitmap Index Scan on segments_box (cost=0.00..282.24 rows=5000 width=0) (actual time=0.032..0.032 rows=2 loops=1)
Index Cond: (range_box && '(300000,0),(300000,0)'::box)
Total runtime: 0.064 ms
Again, the above is PostgreSQL specific, but it might be worth looking if the needed types/operator/indexes in T-SQL exist as well.

Old thread, but if anyone is looking for an implementation of doing this in PostGIS, here is an example:
-- Create the data:
drop table if exists periods;
create temporary table periods as
select '19820302'::date as StartDate,'19820930'::date as EndDate union all
select '19821001'::date,'19850117'::date union all
select '19850626'::date,'19850726'::date union all
select '19850730'::date,'19911231'::date union all
select '19920101'::date,'19951231'::date union all
select '19960101'::date,'20040531'::date union all
select '20040605'::date,'20060131'::date union all
select '20060201'::date,'20110520'::date;
-- Run with PostGIS
-- Convert all intervals to lines, and then do point intersection.
select
'1970-01-01'::date+st_x(st_astext(st_pointn(line,1)))::int4 as start,
'1970-01-01'::date+st_x(st_astext(st_pointn(line,st_numpoints(line))))::int4-1 as end
from
(select (st_dump(st_linemerge(st_union(the_geom)))).geom as line from
(select st_makeline(st_makepoint(startdate-'1970-01-01',0),
st_makepoint(enddate-'1970-01-01'+1,0)) as the_geom from periods)t
)x;
-- Result
start | end
-------------------------
1982-03-02 | 1985-01-17
1985-06-26 | 1985-07-26
1985-07-30 | 2004-05-31
2004-06-05 | 2011-05-20

alter table MergedPeriods (
StartDate date not null,
EndDate date not null
)
go
insert into MergedPeriods(StartDate,EndDate)
select '20130210','20130215' union all
select '20130216','20130228' union all
select '20130302','20130312' union all
select '20130317','20130325' union all
select '20130326','20130405' union all
select '20130406','20130411' union all
select '20130502','20130610'
go
; with MergedPeriods as (
Select p1.StartDate, p1.EndDate
from
[test].[dbo].[Periods] p1
left join
[test].[dbo].[Periods] p2
on
p1.StartDate = DATEADD(day,1,p2.EndDate)
where
p2.StartDate is null
union all
select p1.StartDate,p2.EndDate
from
MergedPeriods p1
inner join
[test].[dbo].[Periods] p2
on
p1.EndDate = DATEADD(day,-1,p2.StartDate)
)
select MIN(StartDate),MAX(EndDate) as EndDate
from MergedPeriods group by StartDate

Related

SQL how to write a query that return missing date ranges?

I am trying to figure out how to write a query that looks at certain records and finds missing date ranges between today and 9999-12-31.
My data looks like below:
ID |start_dt |end_dt |prc_or_disc_1
10412 |2018-07-17 00:00:00.000 |2018-07-20 00:00:00.000 |1050.000000
10413 |2018-07-23 00:00:00.000 |2018-07-26 00:00:00.000 |1040.000000
So for this data I would want my query to return:
2018-07-10 | 2018-07-16
2018-07-21 | 2018-07-22
2018-07-27 | 9999-12-31
I'm not really sure where to start. Is this possible?
You can do that using the lag() function in MS SQL (but that is available starting with 2012?).
with myData as
(
select *,
lag(end_dt,1) over (order by start_dt) as lagEnd
from myTable),
myMax as
(
select Max(end_dt) as maxDate from myTable
)
select dateadd(d,1,lagEnd) as StartDate, dateadd(d, -1, start_dt) as EndDate
from myData
where lagEnd is not null and dateadd(d,1,lagEnd) < start_dt
union all
select dateAdd(d,1,maxDate) as StartDate, cast('99991231' as Datetime) as EndDate
from myMax
where maxDate < '99991231';
If lag() is not available in MS SQL 2008, then you can mimic it with row_number() and joining.
select
CASE WHEN DATEDIFF(day, end_dt, ISNULL(LEAD(start_dt) over (order by ID), '99991231')) > 1 then end_dt +1 END as F1,
CASE WHEN DATEDIFF(day, end_dt, ISNULL(LEAD(start_dt) over (order by ID), '99991231')) > 1 then ISNULL(LEAD(start_dt) over (order by ID) - 1, '99991231') END as F2
from t
Working SQLFiddle example is -> Here
FOR 2008 VERSION
SELECT
X.end_dt + 1 as F1,
ISNULL(Y.start_dt-1, '99991231') as F2
FROM t X
LEFT JOIN (
SELECT
*
, (SELECT MAX(ID) FROM t WHERE ID < A.ID) as ID2
FROM t A) Y ON X.ID = Y.ID2
WHERE DATEDIFF(day, X.end_dt, ISNULL(Y.start_dt, '99991231')) > 1
Working SQLFiddle example is -> Here
This should work in 2008, it assumes that ranges in your table do not overlap. It will also eliminate rows where the end_date of the current row is a day before the start date of the next row.
with dtRanges as (
select start_dt, end_dt, row_number() over (order by start_dt) as rownum
from table1
)
select t2.end_dt + 1, coalesce(start_dt_next -1,'99991231')
FROM
( select dr1.start_dt, dr1.end_dt,dr2.start_dt as start_dt_next
from dtRanges dr1
left join dtRanges dr2 on dr2.rownum = dr1.rownum + 1
) t2
where
t2.end_dt + 1 <> coalesce(start_dt_next,'99991231')
http://sqlfiddle.com/#!18/65238/1
SELECT
*
FROM
(
SELECT
end_dt+1 AS start_dt,
LEAD(start_dt-1, 1, '9999-12-31')
OVER (ORDER BY start_dt)
AS end_dt
FROM
yourTable
)
gaps
WHERE
gaps.end_dt >= gaps.start_dt
I would, however, strongly urge you to use end dates that are "exclusive". That is, the range is everything up to but excluding the end_dt.
That way, a range of one day becomes '2018-07-09', '2018-07-10'.
It's really clear that my range is one day long, if you subtract one from the other you get a day.
Also, if you ever change to needing hour granularity or minute granularity you don't need to change your data. It just works. Always. Reliably. Intuitively.
If you search the web you'll find plenty of documentation on why inclusive-start and exclusive-end is a very good idea from a software perspective. (Then, in the query above, you can remove the wonky +1 and -1.)
This solves your case, but provide some sample data if there will ever be overlaps, fringe cases, etc.
Take one day after your end date and 1 day before the next line's start date.
DECLARE # TABLE (ID int, start_dt DATETIME, end_dt DATETIME, prc VARCHAR(100))
INSERT INTO # (id, start_dt, end_dt, prc)
VALUES
(10410, '2018-07-09 00:00:00.00','2018-07-12 00:00:00.000','1025.000000'),
(10412, '2018-07-17 00:00:00.00','2018-07-20 00:00:00.000','1050.000000'),
(10413, '2018-07-23 00:00:00.00','2018-07-26 00:00:00.000','1040.000000')
SELECT DATEADD(DAY, 1, end_dt)
, DATEADD(DAY, -1, LEAD(start_dt, 1, '9999-12-31') OVER(ORDER BY id) )
FROM #
You may want to take a look at this:
http://sqlfiddle.com/#!18/3a224/1
You just have to edit the begin range to today and the end range to 9999-12-31.

How to add an extra temp date column using the select query in SQL Server

I am trying to add an extra date column in the select statement using a certain where condition.
Below is my current table:
table
I want to add an extra Date column which is add all date between >=Start and <= End-2.
output
Getting error with this query:
SELECT
*, temp_Date AS Date
FROM
Mytable
WHERE
Date >= Start AND Date <= End - 2
Thanks in advance.
Consider this statement as dummy data:
CREATE TABLE MyTable
(
id int not null,
startDate date not null,
endDate date not null,
val int not null
)
insert into MyTable
values
(10,'20171106','20171112',7),
(10,'20171106','20171112',6),
(10,'20171106','20171112',5),
(10,'20171106','20171112',0),
(10,'20171106','20171112',2)
Using recursive CTE you select each tuple as your starting date and increment that date until it reaches the enddate like this:
;WITH rc AS (
SELECT id, startDate, endDate, val
, startDate AS temp_date
FROM MyTable
UNION ALL
SELECT id, startDate, endDate, val
, DATEADD(DAY,1,temp_date)
FROM rc
WHERE DATEADD(DAY,1,temp_date) <= enddate
)
SELECT *
FROM rc
You should be aware the recursion in SQL-Server is expensive and slow on larger data. Also remember to hint the maximum recursive loop amounts as the default is 100. Example:
OPTION (MAXRECURSION 0)
The 0 would be unlimited recursions, with the risk of running infinitely.
As I read you are using a data warehouse and as such it should have a time or date dimension. In such case a simple join would do the work:
SELECT id, startDate, endDate, val
, date_sid AS temp_date
FROM MyTable AS m
INNER JOIN DimDate AS dd
ON dd.date_sid >= startDate
AND dd.date_sid <= endDate
Please consider not using reserved keys for column names (like start, end or value)
Try this:
Select M.*
, Temp_date = Datediff(day, M.start, dateadd(day,-2,M.end))
from MyTable M

Total Count of Active Employees by Date

I have in the past written queries that give me counts by date (hires, terminations, etc...) as follows:
SELECT per.date_start AS "Date",
COUNT(peo.EMPLOYEE_NUMBER) AS "Hires"
FROM hr.per_all_people_f peo,
hr.per_periods_of_service per
WHERE per.date_start BETWEEN peo.effective_start_date AND peo.EFFECTIVE_END_DATE
AND per.date_start BETWEEN :PerStart AND :PerEnd
AND per.person_id = peo.person_id
GROUP BY per.date_start
I was now looking to create a count of active employees by date, however I am not sure how I would date the query as I use a range to determine active as such:
SELECT COUNT(peo.EMPLOYEE_NUMBER) AS "CT"
FROM hr.per_all_people_f peo
WHERE peo.current_employee_flag = 'Y'
and TRUNC(sysdate) BETWEEN peo.effective_start_date AND peo.EFFECTIVE_END_DATE
Here is a simple way to get started. This works for all the effective and end dates in your data:
select thedate,
SUM(num) over (order by thedate) as numActives
from ((select effective_start_date as thedate, 1 as num from hr.per_periods_of_service) union all
(select effective_end_date as thedate, -1 as num from hr.per_periods_of_service)
) dates
It works by adding one person for each start and subtracting one for each end (via num) and doing a cumulative sum. This might have duplicates dates, so you might also do an aggregation to eliminate those duplicates:
select thedate, max(numActives)
from (select thedate,
SUM(num) over (order by thedate) as numActives
from ((select effective_start_date as thedate, 1 as num from hr.per_periods_of_service) union all
(select effective_end_date as thedate, -1 as num from hr.per_periods_of_service)
) dates
) t
group by thedate;
If you really want all dates, then it is best to start with a calendar table, and use a simple variation on your original query:
select c.thedate, count(*) as NumActives
from calendar c left outer join
hr.per_periods_of_service pos
on c.thedate between pos.effective_start_date and pos.effective_end_date
group by c.thedate;
If you want to count all employees who were active during the entire input date range
SELECT COUNT(peo.EMPLOYEE_NUMBER) AS "CT"
FROM hr.per_all_people_f peo
WHERE peo.[EFFECTIVE_START_DATE] <= :StartDate
AND (peo.[EFFECTIVE_END_DATE] IS NULL OR peo.[EFFECTIVE_END_DATE] >= :EndDate)
Here is my example based on Gordon Linoff answer
with a little modification, because in SUBSTRACT table all records were appeared with -1 in NUM, even if no date was in END DATE = NULL.
use AdventureWorksDW2012 --using in MS SSMS for choosing DATABASE to work with
-- and may be not work in other platforms
select
t.thedate
,max(t.numActives) AS "Total Active Employees"
from (
select
dates.thedate
,SUM(dates.num) over (order by dates.thedate) as numActives
from
(
(
select
StartDate as thedate
,1 as num
from DimEmployee
)
union all
(
select
EndDate as thedate
,-1 as num
from DimEmployee
where EndDate IS NOT NULL
)
) AS dates
) AS t
group by thedate
ORDER BY thedate
worked for me, hope it will help somebody
I was able to get the results I was looking for with the following:
--Active Team Members by Date
SELECT "a_date",
COUNT(peo.EMPLOYEE_NUMBER) AS "CT"
FROM hr.per_all_people_f peo,
(SELECT DATE '2012-04-01'-1 + LEVEL AS "a_date"
FROM dual
CONNECT BY LEVEL <= DATE '2012-04-30'+2 - DATE '2012-04-01'-1
)
WHERE peo.current_employee_flag = 'Y'
AND "a_date" BETWEEN peo.effective_start_date AND peo.EFFECTIVE_END_DATE
GROUP BY "a_date"
ORDER BY "a_date"

Multiple Joins on Temporary Table OR Subquery in SQL 2008

I have the following SQL which gets a season for each day in a range of dates, then groups each season by start and end date with number of nights. What it does is not important but my question is which is better, the way I've done it below or use the first select statement as a subquery each time #dateSeasons is used in the second query. Both ways seem to run the same but this way looks neater.
DECLARE #dateSeasons TABLE ([date] date, seasonID int)
INSERT INTO #dateSeasons
SELECT D.[date], S.ID
FROM #dates AS D
CROSS APPLY (
SELECT TOP 1 ID
FROM dbo.Seasons
WHERE bookingID = #bookingID
AND D.[date] BETWEEN startDate AND endDate
ORDER BY ID DESC
) AS S
SELECT MIN([date]), endDate, DATEDIFF(DAY, MIN([date]), DATEADD(DAY, 1, endDate)), seasonID
FROM (
SELECT S1.seasonID, S1.[date], (
SELECT MAX([date])
FROM #dateSeasons S2
WHERE S2.seasonID = S1.seasonID
AND NOT EXISTS (
SELECT NULL
FROM #dateSeasons S3
WHERE S3.[date] < S2.[date]
AND S3.[date] > S1.[date]
AND S3.seasonID <> S1.seasonID
)
) AS endDate
FROM #dateSeasons S1
) AS results
GROUP BY endDate, seasonID
ORDER BY MIN([date])
Looking neater is irrelevant in writing SQL Code. What looks elegant is often the worst possible way to solve the problem from a performance standpoint.
The only way to know for sure which is best is to first make sure both ways you are testing return the same results and then performance test them and check out the execution plans (or explain in mySQL). Techniques which make the query better are database specific as well. What works best to performance tune in SQL Server might be the worst possibility in Oracle.
Sometimes you can get better performance by using a common table expression (CTE):
WITH
dateSeasons ([date], [seasonID])
AS
(
SELECT D.[date], S.ID
FROM #dates AS D
CROSS APPLY (
SELECT TOP 1 ID
FROM dbo.Seasons
WHERE bookingID = #bookingID
AND D.[date] BETWEEN startDate AND endDate
ORDER BY ID DESC
) AS S
)
SELECT MIN([date]), endDate, DATEDIFF(DAY, MIN([date]), DATEADD(DAY, 1, endDate)), seasonID
FROM (
SELECT S1.seasonID, S1.[date], (
SELECT MAX([date])
FROM dateSeasons S2
WHERE S2.seasonID = S1.seasonID
AND NOT EXISTS (
SELECT NULL
FROM dateSeasons S3
WHERE S3.[date] < S2.[date]
AND S3.[date] > S1.[date]
AND S3.seasonID <> S1.seasonID
)
) AS endDate
FROM dateSeasons S1
) AS results
GROUP BY endDate, seasonID
ORDER BY MIN([date])

Merge overlapping date intervals

Is there a better way of merging overlapping date intervals?
The solution I came up with is so simple that now I wonder if someone else has a better idea of how this could be done.
/***** DATA EXAMPLE *****/
DECLARE #T TABLE (d1 DATETIME, d2 DATETIME)
INSERT INTO #T (d1, d2)
SELECT '2010-01-01','2010-03-31' UNION SELECT '2010-04-01','2010-05-31'
UNION SELECT '2010-06-15','2010-06-25' UNION SELECT '2010-06-26','2010-07-10'
UNION SELECT '2010-08-01','2010-08-05' UNION SELECT '2010-08-01','2010-08-09'
UNION SELECT '2010-08-02','2010-08-07' UNION SELECT '2010-08-08','2010-08-08'
UNION SELECT '2010-08-09','2010-08-12' UNION SELECT '2010-07-04','2010-08-16'
UNION SELECT '2010-11-01','2010-12-31' UNION SELECT '2010-03-01','2010-06-13'
/***** INTERVAL ANALYSIS *****/
WHILE (1=1) BEGIN
UPDATE t1 SET t1.d2 = t2.d2
FROM #T AS t1 INNER JOIN #T AS t2 ON
DATEADD(day, 1, t1.d2) BETWEEN t2.d1 AND t2.d2
IF ##ROWCOUNT = 0 BREAK
END
/***** RESULT *****/
SELECT StartDate = MIN(d1) , EndDate = d2
FROM #T
GROUP BY d2
ORDER BY StartDate, EndDate
/***** OUTPUT *****/
/*****
StartDate EndDate
2010-01-01 2010-06-13
2010-06-15 2010-08-16
2010-11-01 2010-12-31
*****/
I was looking for the same solution and came across this post on Combine overlapping datetime to return single overlapping range record.
There is another thread on Packing Date Intervals.
I tested this with various date ranges, including the ones listed here, and it works correctly every time.
SELECT
s1.StartDate,
--t1.EndDate
MIN(t1.EndDate) AS EndDate
FROM #T s1
INNER JOIN #T t1 ON s1.StartDate <= t1.EndDate
AND NOT EXISTS(SELECT * FROM #T t2
WHERE t1.EndDate >= t2.StartDate AND t1.EndDate < t2.EndDate)
WHERE NOT EXISTS(SELECT * FROM #T s2
WHERE s1.StartDate > s2.StartDate AND s1.StartDate <= s2.EndDate)
GROUP BY s1.StartDate
ORDER BY s1.StartDate
The result is:
StartDate | EndDate
2010-01-01 | 2010-06-13
2010-06-15 | 2010-06-25
2010-06-26 | 2010-08-16
2010-11-01 | 2010-12-31
You asked this back in 2010 but don't specify any particular version.
An answer for people on SQL Server 2012+
WITH T1
AS (SELECT *,
MAX(d2) OVER (ORDER BY d1) AS max_d2_so_far
FROM #T),
T2
AS (SELECT *,
CASE
WHEN d1 <= DATEADD(DAY, 1, LAG(max_d2_so_far) OVER (ORDER BY d1))
THEN 0
ELSE 1
END AS range_start
FROM T1),
T3
AS (SELECT *,
SUM(range_start) OVER (ORDER BY d1) AS range_group
FROM T2)
SELECT range_group,
MIN(d1) AS d1,
MAX(d2) AS d2
FROM T3
GROUP BY range_group
Which returns
+-------------+------------+------------+
| range_group | d1 | d2 |
+-------------+------------+------------+
| 1 | 2010-01-01 | 2010-06-13 |
| 2 | 2010-06-15 | 2010-08-16 |
| 3 | 2010-11-01 | 2010-12-31 |
+-------------+------------+------------+
DATEADD(DAY, 1 is used because your desired results show you want a period ending on 2010-06-25 to be collapsed into one starting 2010-06-26. For other use cases this may need adjusting.
Here is a solution with just three simple scans. No CTEs, no recursion, no joins, no table updates in a loop, no "group by" — as a result, this solution should scale the best (I think).
I think number of scans can be reduced to two, if min and max dates are known in advance;
the logic itself just needs two scans — find gaps, applied twice.
declare #datefrom datetime, #datethru datetime
DECLARE #T TABLE (d1 DATETIME, d2 DATETIME)
INSERT INTO #T (d1, d2)
SELECT '2010-01-01','2010-03-31'
UNION SELECT '2010-03-01','2010-06-13'
UNION SELECT '2010-04-01','2010-05-31'
UNION SELECT '2010-06-15','2010-06-25'
UNION SELECT '2010-06-26','2010-07-10'
UNION SELECT '2010-08-01','2010-08-05'
UNION SELECT '2010-08-01','2010-08-09'
UNION SELECT '2010-08-02','2010-08-07'
UNION SELECT '2010-08-08','2010-08-08'
UNION SELECT '2010-08-09','2010-08-12'
UNION SELECT '2010-07-04','2010-08-16'
UNION SELECT '2010-11-01','2010-12-31'
select #datefrom = min(d1) - 1, #datethru = max(d2) + 1 from #t
SELECT
StartDate, EndDate
FROM
(
SELECT
MAX(EndDate) OVER (ORDER BY StartDate) + 1 StartDate,
LEAD(StartDate ) OVER (ORDER BY StartDate) - 1 EndDate
FROM
(
SELECT
StartDate, EndDate
FROM
(
SELECT
MAX(EndDate) OVER (ORDER BY StartDate) + 1 StartDate,
LEAD(StartDate) OVER (ORDER BY StartDate) - 1 EndDate
FROM
(
SELECT d1 StartDate, d2 EndDate from #T
UNION ALL
SELECT #datefrom StartDate, #datefrom EndDate
UNION ALL
SELECT #datethru StartDate, #datethru EndDate
) T
) T
WHERE StartDate <= EndDate
UNION ALL
SELECT #datefrom StartDate, #datefrom EndDate
UNION ALL
SELECT #datethru StartDate, #datethru EndDate
) T
) T
WHERE StartDate <= EndDate
The result is:
StartDate EndDate
2010-01-01 2010-06-13
2010-06-15 2010-08-16
2010-11-01 2010-12-31
The idea is to simulate the scanning algorithm for merging intervals. My solution makes sure it works across a wide range of SQL implementations. I've tested it on MySQL, Postgres, SQL-Server 2017, SQLite and even Hive.
Assuming the table schema is the following.
CREATE TABLE t (
a DATETIME,
b DATETIME
);
We also assume the interval is half-open like [a,b).
When (a,i,j) is in the table, it shows that there are j intervals covering a, and there are i intervals covering the previous point.
CREATE VIEW r AS
SELECT a,
Sum(d) OVER (ORDER BY a ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING) AS i,
Sum(d) OVER (ORDER BY a ROWS UNBOUNDED PRECEDING) AS j
FROM (SELECT a, Sum(d) AS d
FROM (SELECT a, 1 AS d FROM t
UNION ALL
SELECT b, -1 AS d FROM t) e
GROUP BY a) f;
We produce all the endpoints in the union of the intervals and pair up adjacent ones. Finally, we produce the set of intervals by only picking the odd-numbered rows.
SELECT a, b
FROM (SELECT a,
Lead(a) OVER (ORDER BY a) AS b,
Row_number() OVER (ORDER BY a) AS n
FROM r
WHERE j=0 OR i=0 OR i is null) e
WHERE n%2 = 1;
I've created a sample DB-fiddle and SQL-fiddle. I also wrote a blog post on union intervals in SQL.
A Geometric Approach
Here and elsewhere I've noticed that date packing questions don't provide a geometric approach to this problem. After all, any range, date-ranges included, can be interpreted as a line. So why not convert them to a sql geometry type and utilize geometry::UnionAggregate to merge the ranges.
Why?
This has the advantage of handling all types of overlaps, including fully nested ranges. It also works like any other aggregate query, so it's a little more intuitive in that respect. You also get the bonus of a visual representation of your results if you care to use it. Finally, it is the approach I use for simultaneous range packing (you work with rectangles instead of lines in that case, and there are many more considerations). I just couldn't get the existing approaches to work in that scenario.
This has the disadvantage of requiring more recent versions of SQL Server. It also requires a numbers table and it's annoying to extract the individually produced lines from the aggregated shape. But hopefully in the future Microsoft adds a TVF that allows you to do this easily without a numbers table (or you can just build one yourself). Also, geometrical objects work with floats, so you have conversion annoyances and precision concerns to keep in mind.
Performance-wise I don't know how it compares, but I've done a few things (not shown here) to make it work for me even with large datasets.
Code Description
In 'numbers':
I build a table representing a sequence
Swap it out with your favorite way to make a numbers table.
For a union operation, you won't ever need more rows than in
your original table, so I just use it as the base to build it.
In 'mergeLines':
I convert the dates to floats and use those floats
to create geometrical points.
In this problem, we're working in
'integer space,' meaning there are no time considerations, and so
an begin date in one range that is one day apart from an end date
in another should be merged with that other. In order to make
that merge happen, we need to convert to 'real space.', so we
add 1 to the tail of all ranges (we undo this later).
I then connect these points via STUnion and STEnvelope.
Finally, I merge all these lines via UnionAggregate. The resulting
'lines' geometry object might contain multiple lines, but if they
overlap, they turn into one line.
In the outer query:
I use the numbers CTE to extract the individual lines inside 'lines'.
I envelope the lines which here ensures that the lines are stored
only as its two endpoints.
I read the endpoint x values and convert them back to their time
representations, ensuring to put them back into 'integer space'.
The Code
with
numbers as (
select row_number() over (order by (select null)) i
from #t
),
mergeLines as (
select lines = geometry::UnionAggregate(line)
from #t
cross apply (select line =
geometry::Point(convert(float, d1), 0, 0).STUnion(
geometry::Point(convert(float, d2) + 1, 0, 0)
).STEnvelope()
) l
)
select ap.StartDate,
ap.EndDate
from mergeLines ml
join numbers n on n.i between 1 and ml.lines.STNumGeometries()
cross apply (select line = ml.lines.STGeometryN(i).STEnvelope()) l
cross apply (select
StartDate = convert(datetime,l.line.STPointN(1).STX),
EndDate = convert(datetime,l.line.STPointN(3).STX) - 1
) ap
order by ap.StartDate;
In this solution, I created a temporary Calendar table which stores a value for every day across a range. This type of table can be made static. In addition, I'm only storing 400 some odd dates starting with 2009-12-31. Obviously, if your dates span a larger range, you would need more values.
In addition, this solution will only work with SQL Server 2005+ in that I'm using a CTE.
With Calendar As
(
Select DateAdd(d, ROW_NUMBER() OVER ( ORDER BY s1.object_id ), '1900-01-01') As [Date]
From sys.columns as s1
Cross Join sys.columns as s2
)
, StopDates As
(
Select C.[Date]
From Calendar As C
Left Join #T As T
On C.[Date] Between T.d1 And T.d2
Where C.[Date] >= ( Select Min(T2.d1) From #T As T2 )
And C.[Date] <= ( Select Max(T2.d2) From #T As T2 )
And T.d1 Is Null
)
, StopDatesInUse As
(
Select D1.[Date]
From StopDates As D1
Left Join StopDates As D2
On D1.[Date] = DateAdd(d,1,D2.Date)
Where D2.[Date] Is Null
)
, DataWithEariestStopDate As
(
Select *
, (Select Min(SD2.[Date])
From StopDatesInUse As SD2
Where T.d2 < SD2.[Date] ) As StopDate
From #T As T
)
Select Min(d1), Max(d2)
From DataWithEariestStopDate
Group By StopDate
Order By Min(d1)
EDIT The problem with using dates in 2009 has nothing to do with the final query. The problem is that the Calendar table is not big enough. I started the Calendar table at 2009-12-31. I have revised it start at 1900-01-01.
Try this
;WITH T1 AS
(
SELECT d1, d2, ROW_NUMBER() OVER(ORDER BY (SELECT 0)) AS R
FROM #T
), NUMS AS
(
SELECT ROW_NUMBER() OVER(ORDER BY (SELECT 0)) AS R
FROM T1 A
CROSS JOIN T1 B
CROSS JOIN T1 C
), ONERANGE AS
(
SELECT DISTINCT DATEADD(DAY, ROW_NUMBER() OVER(PARTITION BY T1.R ORDER BY (SELECT 0)) - 1, T1.D1) AS ELEMENT
FROM T1
CROSS JOIN NUMS
WHERE NUMS.R <= DATEDIFF(DAY, d1, d2) + 1
), SEQUENCE AS
(
SELECT ELEMENT, DATEDIFF(DAY, '19000101', ELEMENT) - ROW_NUMBER() OVER(ORDER BY ELEMENT) AS rownum
FROM ONERANGE
)
SELECT MIN(ELEMENT) AS StartDate, MAX(ELEMENT) as EndDate
FROM SEQUENCE
GROUP BY rownum
The basic idea is to first unroll the existing data, so you get a separate row for each day. This is done in ONERANGE
Then, identify the relationship between how dates increment and the way the row numbers do.
The difference remains constant within an existing range/island. As soon as you get to a new data island, the difference between them increases because the date increments by more than 1, while the row number increments by 1.
This Solution is similar to the 1st solution with additional Deletion Condition.
This will sort the data in the main table itself instead of using different table to store the result.
DROP TABLE IF EXISTS #SampleTable;
CREATE TABLE #SampleTable (StartTime DATETIME NULL, EndTime DATETIME NULL);
INSERT INTO #SampleTable(StartTime, EndTime)
VALUES
(N'2010-01-01T00:00:00', N'2010-03-31T00:00:00'),
(N'2010-03-01T00:00:00', N'2010-06-13T00:00:00'),
(N'2010-04-01T00:00:00', N'2010-05-31T00:00:00'),
(N'2010-06-15T00:00:00', N'2010-06-25T00:00:00'),
(N'2010-06-26T00:00:00', N'2010-07-10T00:00:00'),
(N'2010-07-04T00:00:00', N'2010-08-16T00:00:00'),
(N'2010-08-01T00:00:00', N'2010-08-05T00:00:00'),
(N'2010-08-01T00:00:00', N'2010-08-09T00:00:00'),
(N'2010-08-02T00:00:00', N'2010-08-07T00:00:00'),
(N'2010-08-08T00:00:00', N'2010-08-08T00:00:00'),
(N'2010-08-09T00:00:00', N'2010-08-12T00:00:00'),
(N'2010-11-01T00:00:00', N'2010-12-31T00:00:00');
--
DECLARE #RowCount INT=0;
WHILE(1=1) --
BEGIN
SET #RowCount=0;
--
UPDATE T1
SET T1.EndTime=T2.EndTime
FROM dbo.#SampleTable AS T1
INNER JOIN dbo.#SampleTable AS T2 ON DATEADD(DAY, 1, T1.EndTime) BETWEEN T2.StartTime AND T2.EndTime;
--
SET #RowCount=#RowCount+##ROWCOUNT;
--
DELETE T1
FROM dbo.#SampleTable AS T1
INNER JOIN dbo.#SampleTable AS T2 ON T1.EndTime=T2.EndTime AND T1.StartTime>T2.StartTime;
--
SET #RowCount=#RowCount+##ROWCOUNT;
--
IF #RowCount=0 --
BREAK;
END;
SELECT * FROM #SampleTable
I was inspired by the Geometric Approach given by pwilcox, but wanted to try a different approach. This is using Trino, but I hope the functions used can also be found in other versions of SQL.
WITH Geo AS (
SELECT
transform( -- 6) See Below~
ST_Geometries( -- 5) Extracts an array of individual lines from the union.
geometry_union( -- 4) Returns the union of aggregated lines, melding all lines together into a single geometric multi-line.
array_agg( -- 3) Aggregation function that joins all lines together.
ST_LineString( -- 2) Makes the pairs of geometric points into lines.
ARRAY[ST_Point(0, to_unixtime(d1)), ST_Point(0, to_unixtime(d2))] -- 1) Takes unix time start and end values and makes them into an array of geometric points.
)
)
)
)
, x -> (ST_YMin(x), ST_Length(x))) AS timestamp_duration -- 6) From the array of lines, The minimum value and length of each line is extracted.
FROM #T -- The miniumum value is a timestamp, length is duration.
WHERE d1 <> d2 -- I had errors any time this was the case.
)
-- 7) Finally, I unnest the produced array and convert the values back into timestamps.
SELECT from_unixtime(timestamp) AS StartDate
, from_unixtime(timestamp + duration) AS EndDate
FROM Geo
CROSS JOIN UNNEST(timestamp_duration) AS t(timestamp, duration)
For reference, this took my company cluster about 2 minutes to make 400k start/end timestamps into 700 distinct start/end timestamps.
It also runs in just 2 stages.