Group by contiguous dates and Count - sql

I have a table which contains information about reports being accessed along with the Date.I need to group reports being accessed according to a date range and count them.
I'm using T-SQL
Table
EventId ReportId Date
60 4 11/24/2015
59 11 11/23/2015
58 6 11/22/2015
57 11 11/22/2015
56 9 11/21/2015
55 3 11/20/2015
54 5 11/20/2015
53 6 11/19/2015
52 5 11/19/2015
51 4 11/18/2015
50 3 11/17/2015
49 9 11/16/2015
If days' difference is 3 then I need result in the format
StartDate EndDate ReportsAccessed
11/22/2015 11/24/2015 4
11/19/2015 11/21/2015 5
11/16/2015 11/18/2015 3
but the difference between days could change.

Assuming you have values for all the dates, then you can calculate the difference in days between each date and the maximum (or minimum) date. Then divide this by three and use that for aggregation:
select min(date), max(date), count(*) as ReportsAccessed
from (select t.*, max(date) over () as maxd
from table t
) t
group by (datediff(day, date, maxd) / 3)
order by min(date);
"3" is what I think you are referring to as the "difference in days".

Those 2 blocks are simply for added clarity on what parameters you'd have to change
DECLARE #t as TABLE(
id int identity(1,1),
reportId int,
dateAccess date)
DECLARE #NumberOfDays int=3;
And here comes the actual select
Select StartDate, EndDate, COUNT(reportId) from
(
select *,
DATEADD(day, DATEDIFF(DAY, dateAccess, maxdate.maxdate)%#NumberOfDays, dateAccess) as EndDate,
DATEADD(day, DATEDIFF(DAY, dateAccess, maxdate.maxdate)%#NumberOfDays-#NumberOfDays+1, dateAccess) as StartDate
from #t, (select MAX(dateAccess) maxdate from #t t2) maxdate
) results
GROUP BY StartDate, EndDate
ORDER BY StartDate desc
There are a few places I'm unsure if it's optimized or not, for instance cross joining with select max(date) instead of using a subquery, but that returns the exact result from your OP.
Basically, I simply split the entries into groups based on how far they are from the MAX(date), and then use a COUNT. On that note, it might be more useful to use COUNT(distinct ...) otherwise if someone looks at the document #9 3 times, it will tell you tha 3 documents were checked, but only 1 was truly looked at.
The upside with using MAX(date) over MIN(date) is that your first group will always have the maximal amount of days. This will prove very useful if you want to compare the last few periods to the average. The downside is that you don't have stable data. With every new entry (assuming it's a new day), your query will cycle itself to produce a new set of results. If you wanted to graph the data, you'd be better comparing to MIN(date) that way the first days won't change when you add a new one.
Depending on the usage, it could even be useful to extrapolate the number of accesses done in the last period (in that case MIN(date) is also preferable).
Here's an adaptation of Gordon's answer that's probably much more optimized (it's at the very least much more aesthetic) :
SELECT DateADD(day, -datediff(day, dateAccess, maxdate)/3*3, maxdate) as EndDate,
DateADD(day, (-datediff(day, dateAccess, maxdate)/3+1)*3, maxdate) as StartDate,
count(reportId)
from (select *, MAX(dateAccess) over() as maxdate from #t) t
GROUP BY datediff(day, dateAccess, maxdate)/3, maxdate

I will insist that most efficient way of doing this is to use tally table. That way you are getting sargable predicates with all benefits from indexes on date column:
declare #c int = 3
;with minmax as(select min(date) as mind, max(date) as maxd from t),
tally as(select #c * (-1 + row_number() over(order by(select null))) as rn
from master..spt_values),
intervals as(select dateadd(dd, rn, mind) as f, dateadd(dd, rn + #c - 1, mind) t
from tally t cross join minmax m where dateadd(dd, rn, mind) <= maxd)
select i.f as [from], i.t as [to], count(*) as reeports
from intervals i
join t on t.date >= i.f and t.date <= i.t
group by i.f, i.t
Explanation: minmax selects minimum date and maximum date from table.
tally generates numbers from 0 to N(depends on system, but enougth to calc intervals). intervals selects resulting intervals. The last part is simple join on intervals to calculate counts per interval.
Fiddle http://sqlfiddle.com/#!3/c61d1/5

Related

How do I get the month number with the maximum number of days from the date range?

I have a table with 10 million rows, where there are two columns that contain the start date and the end date of the range. For example, 2019-09-25 and 2019-10-20. I want to extract the month number with the maximum number of days, in this example it will be 10. In addition to dates that are separated by one month, there are also such examples: 2019-07-01 and 2019-07-29 (within one month), as well as 2019-07-01 and 2019-09-05 (more than one month). How can I implement this?
Seems like you could do something like this:
SELECT CASE WHEN DATEDIFF(DAY, DATEFROMPARTS(YEAR(EndDate),MONTH(EndDate),1),EndDate) >= DATEDIFF(DAY, StartDate, EOMONTH(StartDate)) THEN DATEPART(MONTH,EndDate)
ELSE DATEPART(MONTH,StartDate)
END
FROM (VALUES('20190925','20191020'))V(StartDate,EndDate);
Does the following fit your requirements?
You can build a table of days-in-month (this would be permanent ideally)
and then join to it using the month numbers of your min and max dates.
declare #start date='20190925', #end date='20191020';
--declare #start date='20190701', #end date='20190729';
--declare #start date='20190701', #end date='20190905';
with dim as (
select m,DAY(DATEADD(DD,-1,DATEADD(mm, DATEDIFF(mm, 0, DateFromParts(Year(GetDate()),m,1) )+1, 0)))d
from (values(1),(2),(3),(4),(5),(6),(7),(8),(9),(10),(11),(12))m(m)
)
select top(1) with ties m
from dim
where m between Month(#start) and Month(#end)
order by d desc
You don't state how you determin the most days where there are several months with the same number of months, so with ties includes all qualifying months.
Edit
So I don't know if there is a requirement to span years - the sample data suggests not - however with a permanent list of dates and corresponding days in month values (this is often part of a calendar table) a slight tweak will accomodate it.
with dim as (
select Year(#start)*100 + m m, Day(DATEADD(DD,-1,DATEADD(mm, DATEDIFF(mm, 0, DateFromParts(Year(#start),m,1) )+1, 0)))d
from (values(1),(2),(3),(4),(5),(6),(7),(8),(9),(10),(11),(12))m(m)
union all
select Year(#end)*100 + m m, Day(DATEADD(DD,-1,DATEADD(mm, DATEDIFF(mm, 0, DateFromParts(Year(#end),m,1) )+1, 0)))d
from (values(1),(2),(3),(4),(5),(6),(7),(8),(9),(10),(11),(12))m(m)
)
select top(1) with ties m
from dim
where m between Year(#start)*100 + Month(#start) and Year(#end)*100 + Month(#end)
order by d desc
You could try something like this
with
l0(n) as (
select 1 n
from (values (1),(1),(1),(1),(1),(1),(1),(1)) as v(n))
select top(1) with ties
vTable.*, calc.dt month_with_most_days
from (values ('20190925','20191020'),
('20190925','20191120')) vTable(startdate, enddate)
cross apply (values (datediff(month, vTable.startdate, vTable.enddate))) diff(mo_count)
cross apply (select top (diff.mo_count+1)
row_number() over (order by (select null)) n
from l0 l1, l0 l2, l0 l3, l0 l4) tally /* 8^4 months possible */
cross apply (values (cast(case when tally.n=1 then startdate
when tally.n=diff.mo_count+1 then enddate
else eomonth(dateadd(month, tally.n-1, startdate)) end as date))) calc(dt)
order by row_number() over (partition by startdate, enddate
order by day(calc.dt) desc);
startdate enddate month_with_most_days
20190925 20191020 2019-09-25
20190925 20191120 2019-10-31

SQL how to write a query that return missing date ranges?

I am trying to figure out how to write a query that looks at certain records and finds missing date ranges between today and 9999-12-31.
My data looks like below:
ID |start_dt |end_dt |prc_or_disc_1
10412 |2018-07-17 00:00:00.000 |2018-07-20 00:00:00.000 |1050.000000
10413 |2018-07-23 00:00:00.000 |2018-07-26 00:00:00.000 |1040.000000
So for this data I would want my query to return:
2018-07-10 | 2018-07-16
2018-07-21 | 2018-07-22
2018-07-27 | 9999-12-31
I'm not really sure where to start. Is this possible?
You can do that using the lag() function in MS SQL (but that is available starting with 2012?).
with myData as
(
select *,
lag(end_dt,1) over (order by start_dt) as lagEnd
from myTable),
myMax as
(
select Max(end_dt) as maxDate from myTable
)
select dateadd(d,1,lagEnd) as StartDate, dateadd(d, -1, start_dt) as EndDate
from myData
where lagEnd is not null and dateadd(d,1,lagEnd) < start_dt
union all
select dateAdd(d,1,maxDate) as StartDate, cast('99991231' as Datetime) as EndDate
from myMax
where maxDate < '99991231';
If lag() is not available in MS SQL 2008, then you can mimic it with row_number() and joining.
select
CASE WHEN DATEDIFF(day, end_dt, ISNULL(LEAD(start_dt) over (order by ID), '99991231')) > 1 then end_dt +1 END as F1,
CASE WHEN DATEDIFF(day, end_dt, ISNULL(LEAD(start_dt) over (order by ID), '99991231')) > 1 then ISNULL(LEAD(start_dt) over (order by ID) - 1, '99991231') END as F2
from t
Working SQLFiddle example is -> Here
FOR 2008 VERSION
SELECT
X.end_dt + 1 as F1,
ISNULL(Y.start_dt-1, '99991231') as F2
FROM t X
LEFT JOIN (
SELECT
*
, (SELECT MAX(ID) FROM t WHERE ID < A.ID) as ID2
FROM t A) Y ON X.ID = Y.ID2
WHERE DATEDIFF(day, X.end_dt, ISNULL(Y.start_dt, '99991231')) > 1
Working SQLFiddle example is -> Here
This should work in 2008, it assumes that ranges in your table do not overlap. It will also eliminate rows where the end_date of the current row is a day before the start date of the next row.
with dtRanges as (
select start_dt, end_dt, row_number() over (order by start_dt) as rownum
from table1
)
select t2.end_dt + 1, coalesce(start_dt_next -1,'99991231')
FROM
( select dr1.start_dt, dr1.end_dt,dr2.start_dt as start_dt_next
from dtRanges dr1
left join dtRanges dr2 on dr2.rownum = dr1.rownum + 1
) t2
where
t2.end_dt + 1 <> coalesce(start_dt_next,'99991231')
http://sqlfiddle.com/#!18/65238/1
SELECT
*
FROM
(
SELECT
end_dt+1 AS start_dt,
LEAD(start_dt-1, 1, '9999-12-31')
OVER (ORDER BY start_dt)
AS end_dt
FROM
yourTable
)
gaps
WHERE
gaps.end_dt >= gaps.start_dt
I would, however, strongly urge you to use end dates that are "exclusive". That is, the range is everything up to but excluding the end_dt.
That way, a range of one day becomes '2018-07-09', '2018-07-10'.
It's really clear that my range is one day long, if you subtract one from the other you get a day.
Also, if you ever change to needing hour granularity or minute granularity you don't need to change your data. It just works. Always. Reliably. Intuitively.
If you search the web you'll find plenty of documentation on why inclusive-start and exclusive-end is a very good idea from a software perspective. (Then, in the query above, you can remove the wonky +1 and -1.)
This solves your case, but provide some sample data if there will ever be overlaps, fringe cases, etc.
Take one day after your end date and 1 day before the next line's start date.
DECLARE # TABLE (ID int, start_dt DATETIME, end_dt DATETIME, prc VARCHAR(100))
INSERT INTO # (id, start_dt, end_dt, prc)
VALUES
(10410, '2018-07-09 00:00:00.00','2018-07-12 00:00:00.000','1025.000000'),
(10412, '2018-07-17 00:00:00.00','2018-07-20 00:00:00.000','1050.000000'),
(10413, '2018-07-23 00:00:00.00','2018-07-26 00:00:00.000','1040.000000')
SELECT DATEADD(DAY, 1, end_dt)
, DATEADD(DAY, -1, LEAD(start_dt, 1, '9999-12-31') OVER(ORDER BY id) )
FROM #
You may want to take a look at this:
http://sqlfiddle.com/#!18/3a224/1
You just have to edit the begin range to today and the end range to 9999-12-31.

SQL Server - Split year into 4 weekly periods

I would like to split up the year into 13 periods with 4 weeks in each
52 weeks a year / 4 = 13 even periods
I would like each period to start on a saturday and end on a friday.
It should look like the below image
Obviously I could do this manually, but the dates would change each year and I am looking for a way to automate this with SQL rather than manually do this for each upcoming year
Is there a way to produce this yearly split automatically?
In this previous answer I show an approach to create a numbers/date table. Such a table is very handsome in many places.
With this approach you might try something like this:
CREATE TABLE dbo.RunningNumbers(Number INT NOT NULL,CalendarDate DATE NOT NULL, CalendarYear INT NOT NULL,CalendarMonth INT NOT NULL,CalendarDay INT NOT NULL, CalendarWeek INT NOT NULL, CalendarYearDay INT NOT NULL, CalendarWeekDay INT NOT NULL);
DECLARE #CountEntries INT = 100000;
DECLARE #StartNumber INT = 0;
WITH E1(N) AS(SELECT 1 FROM(VALUES (1),(1),(1),(1),(1),(1),(1),(1),(1),(1))t(N)), --10 ^ 1
E2(N) AS(SELECT 1 FROM E1 a CROSS JOIN E1 b), -- 10 ^ 2 = 100 rows
E4(N) AS(SELECT 1 FROM E2 a CROSS JOIN E2 b), -- 10 ^ 4 = 10,000 rows
E8(N) AS(SELECT 1 FROM E4 a CROSS JOIN E4 b), -- 10 ^ 8 = 10,000,000 rows
CteTally AS
(
SELECT TOP(ISNULL(#CountEntries,1000000)) ROW_NUMBER() OVER(ORDER BY(SELECT NULL)) -1 + ISNULL(#StartNumber,0) As Nmbr
FROM E8
)
INSERT INTO dbo.RunningNumbers
SELECT CteTally.Nmbr,CalendarDate.d,CalendarExt.*
FROM CteTally
CROSS APPLY
(
SELECT DATEADD(DAY,CteTally.Nmbr,{ts'1900-01-01 00:00:00'})
) AS CalendarDate(d)
CROSS APPLY
(
SELECT YEAR(CalendarDate.d) AS CalendarYear
,MONTH(CalendarDate.d) AS CalendarMonth
,DAY(CalendarDate.d) AS CalendarDay
,DATEPART(WEEK,CalendarDate.d) AS CalendarWeek
,DATEPART(DAYOFYEAR,CalendarDate.d) AS CalendarYearDay
,DATEPART(WEEKDAY,CalendarDate.d) AS CalendarWeekDay
) AS CalendarExt;
GO
NTILE - SQL Server 2008+ will create (almost) even chunks.
This the actual query
SELECT *,NTILE(13) OVER(ORDER BY CalendarDate) AS Periode
FROM RunningNumbers
WHERE CalendarWeekDay=6
AND CalendarDate>={d'2017-01-01'} AND CalendarDate <= {d'2017-12-31'};
GO
--Carefull with existing data!
--DROP TABLE dbo.RunningNumbers;
Hint 1: Place indexes!
Hint 2: Read the link about NTILE, especially the Remark-section.
I think this will fit for this case. You might think about using Prdp's approach with ROW_NUMBER() in conncetion with INT division. But - big advantage! - NTILE would allow PARTITION BY CalendarYear.
Hint 3: You might add a column to the table
...where you set the period's number as a fix value. This will make future queries very easy and would allow manual correction on special cases (53rd week..)
Here is one way using Calendar table
DECLARE #start DATE = '2017-04-01',
#end_date DATE = '2017-12-31'
SET DATEFIRST 7;
WITH Calendar
AS (SELECT 1 AS id,
#start AS start_date,
Dateadd(dd, 6, #start) AS end_date
UNION ALL
SELECT id + 1,
Dateadd(week, 1, start_date),
Dateadd(week, 1, end_date)
FROM Calendar
WHERE end_date < #end_date)
SELECT id,
( Row_number()OVER(ORDER BY id) - 1 ) / 4 + 1 AS Period,
start_date,
end_date
FROM Calendar
OPTION (maxrecursion 0)
I have generated dates using Recursive CTE but it is better to create a physical calendar table use it in queries like this
Firstly, you will never get 52 even weeks in a year, there are overlap weeks in most calendar standards. You will occasionally get a week 53.
You can tell SQL to use Saturday as the first day of the week with datefirst, then running a datepart on today's date with getdate() will tell you the week of the year:
SET datefirst 6 -- 6 is Saturday
SELECT datepart(ww,getdate()) as currentWeek
You could then divide this by 4 with a CEILING command to get the 4-week split:
SET datefirst 6
SELECT DATEPART(ww,getdate()) as currentWeek,
CEILING(DATEPART(ww,getdate())/4) as four_week_split

trying to find the maximum number of occurrences over time T-SQL

I have data recording the StartDateTime and EndDateTime (both DATETIME2) of a process for all of the year 2013.
My task is to find the maximum amount of times the process was being ran at any specific time throughout the year.
I have wrote some code to check every minute/second how many processes were running at the specific time, but this takes a very long time and would be impossible to let it run for the whole year.
Here is the code (in this case check every minute for the date 25/10/2013)
CREATE TABLE dbo.#Hit
(
ID INT IDENTITY (1,1) PRIMARY KEY,
Moment DATETIME2,
COUNT INT
)
DECLARE #moment DATETIME2
SET #moment = '2013-10-24 00:00:00'
WHILE #moment < '2013-10-25'
BEGIN
INSERT INTO #Hit ( Moment, COUNT )
SELECT #moment, COUNT(*)
FROM dbo.tblProcessTimeLog
WHERE ProcessFK IN (25)
AND #moment BETWEEN StartDateTime AND EndDateTime
AND DelInd = 0
PRINT #moment
SET #moment = DATEADD(MINute,1,#moment)
END
SELECT * FROM #Hit
ORDER BY COUNT DESC
Can anyone think how i could get a similar result (I just need the maximum amount of processes being run at any given time), but for all year?
Thanks
DECLARE #d DATETIME = '20130101'; -- the first day of the year you care about
;WITH m(m) AS
( -- all the minutes in a day
SELECT TOP (1440) ROW_NUMBER() OVER (ORDER BY number) - 1
FROM master..spt_values
),
d(d) AS
( -- all the days in *that* year (accounts for leap years vs. hard-coding 365)
SELECT TOP (DATEDIFF(DAY, #d, DATEADD(YEAR, 1, #d))) DATEADD(DAY, number, #d)
FROM master..spt_values WHERE type = N'P' ORDER BY number
),
x AS
( -- all the minutes in *that* year
SELECT moment = DATEADD(MINUTE, m.m, d.d) FROM m CROSS JOIN d
)
SELECT TOP (1) WITH TIES -- in case more than one at the top
x.moment, [COUNT] = COUNT(l.ProcessFK)
FROM x
INNER JOIN dbo.tblProcessTimeLog AS l
ON x.moment >= l.StartDateTime
AND x.moment <= l.EndDateTime
WHERE l.ProcessFK = 25 AND l.DelInd = 0
GROUP BY x.moment
ORDER BY [COUNT] DESC;
See this post for why I don't think you should use BETWEEN for range queries, even in cases where it does semantically do what you want.
Create a table T whose rows represent some time segments.
This table could well be a temporary table (depending on your case).
Say:
row 1 - [from=00:00:00, to=00:00:01)
row 2 - [from=00:00:01, to=00:00:02)
row 3 - [from=00:00:02, to=00:00:03)
and so on.
Then just join from your main table
(tblProcessTimeLog, I think) to this table
based on the datetime values recorded in
tblProcessTimeLog.
A year has just about half million minutes
so it is not that many rows to store in T.
I recently pulled some code from SO trying to solve the 'island and gaps' problem, and the algorithm for that should help you solve your problem.
The idea is that you want to find the point in time that has the most started processes, much like figuring out the deepest nesting of parenthesis in an expression:
( ( ( ) ( ( ( (deepest here, 6)))))
This sql will produce this result for you (I included a temp table with sample data):
/*
CREATE TABLE #tblProcessTimeLog
(
StartDateTime DATETIME2,
EndDateTime DATETIME2
)
-- delete from #tblProcessTimeLog
INSERT INTO #tblProcessTimeLog (StartDateTime, EndDateTime)
Values ('1/1/2012', '1/6/2012'),
('1/2/2012', '1/6/2012'),
('1/3/2012', '1/6/2012'),
('1/4/2012', '1/6/2012'),
('1/5/2012', '1/7/2012'),
('1/6/2012', '1/8/2012'),
('1/6/2012', '1/10/2012'),
('1/6/2012', '1/11/2012'),
('1/10/2012', '1/12/2012'),
('1/15/2012', '1/16/2012')
;
*/
with cteProcessGroups (EventDate, GroupId) as
(
select EVENT_DATE, (E.START_ORDINAL - E.OVERALL_ORDINAL) GROUP_ID
FROM
(
select EVENT_DATE, EVENT_TYPE,
MAX(START_ORDINAL) OVER (ORDER BY EVENT_DATE, EVENT_TYPE ROWS UNBOUNDED PRECEDING) as START_ORDINAL,
ROW_NUMBER() OVER (ORDER BY EVENT_DATE, EVENT_TYPE) AS OVERALL_ORDINAL
from
(
Select StartDateTime AS EVENT_DATE, 1 as EVENT_TYPE, ROW_NUMBER() OVER (ORDER BY StartDateTime) as START_ORDINAL
from #tblProcessTimeLog
UNION ALL
select EndDateTime, 0 as EVENT_TYPE, NULL
FROM #tblProcessTimeLog
) RAWDATA
) E
)
select Max(EventDate) as EventDate, count(GroupId) as OpenProcesses
from cteProcessGroups
group by (GroupId)
order by COUNT(GroupId) desc
Results:
EventDate OpenProcesses
2012-01-05 00:00:00.0000000 5
2012-01-06 00:00:00.0000000 4
2012-01-15 00:00:00.0000000 2
2012-01-10 00:00:00.0000000 2
2012-01-08 00:00:00.0000000 1
2012-01-07 00:00:00.0000000 1
2012-01-11 00:00:00.0000000 1
2012-01-06 00:00:00.0000000 1
2012-01-06 00:00:00.0000000 1
2012-01-06 00:00:00.0000000 1
2012-01-16 00:00:00.0000000 1
Note that the 'in-between' rows don't give anything meaningful. Basically this output is only tuned to tell you when the most activity was. Looking at the other rows in the out put, there wasn't just 1 process running on 1/8 (there was actually 3). But the way this code works is that by grouping the processes that are concurrent together in a group, you can count the number of simultaneous processes. The date returned is when the max concurrent processes began. It doesn't tell you how long they were going on for, but you can solve that with an additional query. (once you know the date the most was ocurring, you can find out the specific process IDs by using a BETWEEN statement on the date.)
Hope this helps.

Merge overlapping date intervals

Is there a better way of merging overlapping date intervals?
The solution I came up with is so simple that now I wonder if someone else has a better idea of how this could be done.
/***** DATA EXAMPLE *****/
DECLARE #T TABLE (d1 DATETIME, d2 DATETIME)
INSERT INTO #T (d1, d2)
SELECT '2010-01-01','2010-03-31' UNION SELECT '2010-04-01','2010-05-31'
UNION SELECT '2010-06-15','2010-06-25' UNION SELECT '2010-06-26','2010-07-10'
UNION SELECT '2010-08-01','2010-08-05' UNION SELECT '2010-08-01','2010-08-09'
UNION SELECT '2010-08-02','2010-08-07' UNION SELECT '2010-08-08','2010-08-08'
UNION SELECT '2010-08-09','2010-08-12' UNION SELECT '2010-07-04','2010-08-16'
UNION SELECT '2010-11-01','2010-12-31' UNION SELECT '2010-03-01','2010-06-13'
/***** INTERVAL ANALYSIS *****/
WHILE (1=1) BEGIN
UPDATE t1 SET t1.d2 = t2.d2
FROM #T AS t1 INNER JOIN #T AS t2 ON
DATEADD(day, 1, t1.d2) BETWEEN t2.d1 AND t2.d2
IF ##ROWCOUNT = 0 BREAK
END
/***** RESULT *****/
SELECT StartDate = MIN(d1) , EndDate = d2
FROM #T
GROUP BY d2
ORDER BY StartDate, EndDate
/***** OUTPUT *****/
/*****
StartDate EndDate
2010-01-01 2010-06-13
2010-06-15 2010-08-16
2010-11-01 2010-12-31
*****/
I was looking for the same solution and came across this post on Combine overlapping datetime to return single overlapping range record.
There is another thread on Packing Date Intervals.
I tested this with various date ranges, including the ones listed here, and it works correctly every time.
SELECT
s1.StartDate,
--t1.EndDate
MIN(t1.EndDate) AS EndDate
FROM #T s1
INNER JOIN #T t1 ON s1.StartDate <= t1.EndDate
AND NOT EXISTS(SELECT * FROM #T t2
WHERE t1.EndDate >= t2.StartDate AND t1.EndDate < t2.EndDate)
WHERE NOT EXISTS(SELECT * FROM #T s2
WHERE s1.StartDate > s2.StartDate AND s1.StartDate <= s2.EndDate)
GROUP BY s1.StartDate
ORDER BY s1.StartDate
The result is:
StartDate | EndDate
2010-01-01 | 2010-06-13
2010-06-15 | 2010-06-25
2010-06-26 | 2010-08-16
2010-11-01 | 2010-12-31
You asked this back in 2010 but don't specify any particular version.
An answer for people on SQL Server 2012+
WITH T1
AS (SELECT *,
MAX(d2) OVER (ORDER BY d1) AS max_d2_so_far
FROM #T),
T2
AS (SELECT *,
CASE
WHEN d1 <= DATEADD(DAY, 1, LAG(max_d2_so_far) OVER (ORDER BY d1))
THEN 0
ELSE 1
END AS range_start
FROM T1),
T3
AS (SELECT *,
SUM(range_start) OVER (ORDER BY d1) AS range_group
FROM T2)
SELECT range_group,
MIN(d1) AS d1,
MAX(d2) AS d2
FROM T3
GROUP BY range_group
Which returns
+-------------+------------+------------+
| range_group | d1 | d2 |
+-------------+------------+------------+
| 1 | 2010-01-01 | 2010-06-13 |
| 2 | 2010-06-15 | 2010-08-16 |
| 3 | 2010-11-01 | 2010-12-31 |
+-------------+------------+------------+
DATEADD(DAY, 1 is used because your desired results show you want a period ending on 2010-06-25 to be collapsed into one starting 2010-06-26. For other use cases this may need adjusting.
Here is a solution with just three simple scans. No CTEs, no recursion, no joins, no table updates in a loop, no "group by" — as a result, this solution should scale the best (I think).
I think number of scans can be reduced to two, if min and max dates are known in advance;
the logic itself just needs two scans — find gaps, applied twice.
declare #datefrom datetime, #datethru datetime
DECLARE #T TABLE (d1 DATETIME, d2 DATETIME)
INSERT INTO #T (d1, d2)
SELECT '2010-01-01','2010-03-31'
UNION SELECT '2010-03-01','2010-06-13'
UNION SELECT '2010-04-01','2010-05-31'
UNION SELECT '2010-06-15','2010-06-25'
UNION SELECT '2010-06-26','2010-07-10'
UNION SELECT '2010-08-01','2010-08-05'
UNION SELECT '2010-08-01','2010-08-09'
UNION SELECT '2010-08-02','2010-08-07'
UNION SELECT '2010-08-08','2010-08-08'
UNION SELECT '2010-08-09','2010-08-12'
UNION SELECT '2010-07-04','2010-08-16'
UNION SELECT '2010-11-01','2010-12-31'
select #datefrom = min(d1) - 1, #datethru = max(d2) + 1 from #t
SELECT
StartDate, EndDate
FROM
(
SELECT
MAX(EndDate) OVER (ORDER BY StartDate) + 1 StartDate,
LEAD(StartDate ) OVER (ORDER BY StartDate) - 1 EndDate
FROM
(
SELECT
StartDate, EndDate
FROM
(
SELECT
MAX(EndDate) OVER (ORDER BY StartDate) + 1 StartDate,
LEAD(StartDate) OVER (ORDER BY StartDate) - 1 EndDate
FROM
(
SELECT d1 StartDate, d2 EndDate from #T
UNION ALL
SELECT #datefrom StartDate, #datefrom EndDate
UNION ALL
SELECT #datethru StartDate, #datethru EndDate
) T
) T
WHERE StartDate <= EndDate
UNION ALL
SELECT #datefrom StartDate, #datefrom EndDate
UNION ALL
SELECT #datethru StartDate, #datethru EndDate
) T
) T
WHERE StartDate <= EndDate
The result is:
StartDate EndDate
2010-01-01 2010-06-13
2010-06-15 2010-08-16
2010-11-01 2010-12-31
The idea is to simulate the scanning algorithm for merging intervals. My solution makes sure it works across a wide range of SQL implementations. I've tested it on MySQL, Postgres, SQL-Server 2017, SQLite and even Hive.
Assuming the table schema is the following.
CREATE TABLE t (
a DATETIME,
b DATETIME
);
We also assume the interval is half-open like [a,b).
When (a,i,j) is in the table, it shows that there are j intervals covering a, and there are i intervals covering the previous point.
CREATE VIEW r AS
SELECT a,
Sum(d) OVER (ORDER BY a ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING) AS i,
Sum(d) OVER (ORDER BY a ROWS UNBOUNDED PRECEDING) AS j
FROM (SELECT a, Sum(d) AS d
FROM (SELECT a, 1 AS d FROM t
UNION ALL
SELECT b, -1 AS d FROM t) e
GROUP BY a) f;
We produce all the endpoints in the union of the intervals and pair up adjacent ones. Finally, we produce the set of intervals by only picking the odd-numbered rows.
SELECT a, b
FROM (SELECT a,
Lead(a) OVER (ORDER BY a) AS b,
Row_number() OVER (ORDER BY a) AS n
FROM r
WHERE j=0 OR i=0 OR i is null) e
WHERE n%2 = 1;
I've created a sample DB-fiddle and SQL-fiddle. I also wrote a blog post on union intervals in SQL.
A Geometric Approach
Here and elsewhere I've noticed that date packing questions don't provide a geometric approach to this problem. After all, any range, date-ranges included, can be interpreted as a line. So why not convert them to a sql geometry type and utilize geometry::UnionAggregate to merge the ranges.
Why?
This has the advantage of handling all types of overlaps, including fully nested ranges. It also works like any other aggregate query, so it's a little more intuitive in that respect. You also get the bonus of a visual representation of your results if you care to use it. Finally, it is the approach I use for simultaneous range packing (you work with rectangles instead of lines in that case, and there are many more considerations). I just couldn't get the existing approaches to work in that scenario.
This has the disadvantage of requiring more recent versions of SQL Server. It also requires a numbers table and it's annoying to extract the individually produced lines from the aggregated shape. But hopefully in the future Microsoft adds a TVF that allows you to do this easily without a numbers table (or you can just build one yourself). Also, geometrical objects work with floats, so you have conversion annoyances and precision concerns to keep in mind.
Performance-wise I don't know how it compares, but I've done a few things (not shown here) to make it work for me even with large datasets.
Code Description
In 'numbers':
I build a table representing a sequence
Swap it out with your favorite way to make a numbers table.
For a union operation, you won't ever need more rows than in
your original table, so I just use it as the base to build it.
In 'mergeLines':
I convert the dates to floats and use those floats
to create geometrical points.
In this problem, we're working in
'integer space,' meaning there are no time considerations, and so
an begin date in one range that is one day apart from an end date
in another should be merged with that other. In order to make
that merge happen, we need to convert to 'real space.', so we
add 1 to the tail of all ranges (we undo this later).
I then connect these points via STUnion and STEnvelope.
Finally, I merge all these lines via UnionAggregate. The resulting
'lines' geometry object might contain multiple lines, but if they
overlap, they turn into one line.
In the outer query:
I use the numbers CTE to extract the individual lines inside 'lines'.
I envelope the lines which here ensures that the lines are stored
only as its two endpoints.
I read the endpoint x values and convert them back to their time
representations, ensuring to put them back into 'integer space'.
The Code
with
numbers as (
select row_number() over (order by (select null)) i
from #t
),
mergeLines as (
select lines = geometry::UnionAggregate(line)
from #t
cross apply (select line =
geometry::Point(convert(float, d1), 0, 0).STUnion(
geometry::Point(convert(float, d2) + 1, 0, 0)
).STEnvelope()
) l
)
select ap.StartDate,
ap.EndDate
from mergeLines ml
join numbers n on n.i between 1 and ml.lines.STNumGeometries()
cross apply (select line = ml.lines.STGeometryN(i).STEnvelope()) l
cross apply (select
StartDate = convert(datetime,l.line.STPointN(1).STX),
EndDate = convert(datetime,l.line.STPointN(3).STX) - 1
) ap
order by ap.StartDate;
In this solution, I created a temporary Calendar table which stores a value for every day across a range. This type of table can be made static. In addition, I'm only storing 400 some odd dates starting with 2009-12-31. Obviously, if your dates span a larger range, you would need more values.
In addition, this solution will only work with SQL Server 2005+ in that I'm using a CTE.
With Calendar As
(
Select DateAdd(d, ROW_NUMBER() OVER ( ORDER BY s1.object_id ), '1900-01-01') As [Date]
From sys.columns as s1
Cross Join sys.columns as s2
)
, StopDates As
(
Select C.[Date]
From Calendar As C
Left Join #T As T
On C.[Date] Between T.d1 And T.d2
Where C.[Date] >= ( Select Min(T2.d1) From #T As T2 )
And C.[Date] <= ( Select Max(T2.d2) From #T As T2 )
And T.d1 Is Null
)
, StopDatesInUse As
(
Select D1.[Date]
From StopDates As D1
Left Join StopDates As D2
On D1.[Date] = DateAdd(d,1,D2.Date)
Where D2.[Date] Is Null
)
, DataWithEariestStopDate As
(
Select *
, (Select Min(SD2.[Date])
From StopDatesInUse As SD2
Where T.d2 < SD2.[Date] ) As StopDate
From #T As T
)
Select Min(d1), Max(d2)
From DataWithEariestStopDate
Group By StopDate
Order By Min(d1)
EDIT The problem with using dates in 2009 has nothing to do with the final query. The problem is that the Calendar table is not big enough. I started the Calendar table at 2009-12-31. I have revised it start at 1900-01-01.
Try this
;WITH T1 AS
(
SELECT d1, d2, ROW_NUMBER() OVER(ORDER BY (SELECT 0)) AS R
FROM #T
), NUMS AS
(
SELECT ROW_NUMBER() OVER(ORDER BY (SELECT 0)) AS R
FROM T1 A
CROSS JOIN T1 B
CROSS JOIN T1 C
), ONERANGE AS
(
SELECT DISTINCT DATEADD(DAY, ROW_NUMBER() OVER(PARTITION BY T1.R ORDER BY (SELECT 0)) - 1, T1.D1) AS ELEMENT
FROM T1
CROSS JOIN NUMS
WHERE NUMS.R <= DATEDIFF(DAY, d1, d2) + 1
), SEQUENCE AS
(
SELECT ELEMENT, DATEDIFF(DAY, '19000101', ELEMENT) - ROW_NUMBER() OVER(ORDER BY ELEMENT) AS rownum
FROM ONERANGE
)
SELECT MIN(ELEMENT) AS StartDate, MAX(ELEMENT) as EndDate
FROM SEQUENCE
GROUP BY rownum
The basic idea is to first unroll the existing data, so you get a separate row for each day. This is done in ONERANGE
Then, identify the relationship between how dates increment and the way the row numbers do.
The difference remains constant within an existing range/island. As soon as you get to a new data island, the difference between them increases because the date increments by more than 1, while the row number increments by 1.
This Solution is similar to the 1st solution with additional Deletion Condition.
This will sort the data in the main table itself instead of using different table to store the result.
DROP TABLE IF EXISTS #SampleTable;
CREATE TABLE #SampleTable (StartTime DATETIME NULL, EndTime DATETIME NULL);
INSERT INTO #SampleTable(StartTime, EndTime)
VALUES
(N'2010-01-01T00:00:00', N'2010-03-31T00:00:00'),
(N'2010-03-01T00:00:00', N'2010-06-13T00:00:00'),
(N'2010-04-01T00:00:00', N'2010-05-31T00:00:00'),
(N'2010-06-15T00:00:00', N'2010-06-25T00:00:00'),
(N'2010-06-26T00:00:00', N'2010-07-10T00:00:00'),
(N'2010-07-04T00:00:00', N'2010-08-16T00:00:00'),
(N'2010-08-01T00:00:00', N'2010-08-05T00:00:00'),
(N'2010-08-01T00:00:00', N'2010-08-09T00:00:00'),
(N'2010-08-02T00:00:00', N'2010-08-07T00:00:00'),
(N'2010-08-08T00:00:00', N'2010-08-08T00:00:00'),
(N'2010-08-09T00:00:00', N'2010-08-12T00:00:00'),
(N'2010-11-01T00:00:00', N'2010-12-31T00:00:00');
--
DECLARE #RowCount INT=0;
WHILE(1=1) --
BEGIN
SET #RowCount=0;
--
UPDATE T1
SET T1.EndTime=T2.EndTime
FROM dbo.#SampleTable AS T1
INNER JOIN dbo.#SampleTable AS T2 ON DATEADD(DAY, 1, T1.EndTime) BETWEEN T2.StartTime AND T2.EndTime;
--
SET #RowCount=#RowCount+##ROWCOUNT;
--
DELETE T1
FROM dbo.#SampleTable AS T1
INNER JOIN dbo.#SampleTable AS T2 ON T1.EndTime=T2.EndTime AND T1.StartTime>T2.StartTime;
--
SET #RowCount=#RowCount+##ROWCOUNT;
--
IF #RowCount=0 --
BREAK;
END;
SELECT * FROM #SampleTable
I was inspired by the Geometric Approach given by pwilcox, but wanted to try a different approach. This is using Trino, but I hope the functions used can also be found in other versions of SQL.
WITH Geo AS (
SELECT
transform( -- 6) See Below~
ST_Geometries( -- 5) Extracts an array of individual lines from the union.
geometry_union( -- 4) Returns the union of aggregated lines, melding all lines together into a single geometric multi-line.
array_agg( -- 3) Aggregation function that joins all lines together.
ST_LineString( -- 2) Makes the pairs of geometric points into lines.
ARRAY[ST_Point(0, to_unixtime(d1)), ST_Point(0, to_unixtime(d2))] -- 1) Takes unix time start and end values and makes them into an array of geometric points.
)
)
)
)
, x -> (ST_YMin(x), ST_Length(x))) AS timestamp_duration -- 6) From the array of lines, The minimum value and length of each line is extracted.
FROM #T -- The miniumum value is a timestamp, length is duration.
WHERE d1 <> d2 -- I had errors any time this was the case.
)
-- 7) Finally, I unnest the produced array and convert the values back into timestamps.
SELECT from_unixtime(timestamp) AS StartDate
, from_unixtime(timestamp + duration) AS EndDate
FROM Geo
CROSS JOIN UNNEST(timestamp_duration) AS t(timestamp, duration)
For reference, this took my company cluster about 2 minutes to make 400k start/end timestamps into 700 distinct start/end timestamps.
It also runs in just 2 stages.