Gap and Island problem - query not working for all periods - sql

I have to create a query to find the gaps and islands between dates. This seems to be a standard gaps and island problem. To show my issue I will use sample of data. The queries are executed in Snowflake.
CREATE TABLE TEST (StartDate date, EndDate date);
INSERT INTO TEST
SELECT '8/20/2017', '8/21/2017' UNION ALL
SELECT '8/22/2017', '9/22/2017' UNION ALL
SELECT '8/23/2017', '9/23/2017' UNION ALL
SELECT '8/24/2017', '8/26/2017' UNION ALL
SELECT '8/28/2017', '9/19/2017' UNION ALL
SELECT '9/23/2017', '9/27/2017' UNION ALL
SELECT '9/25/2017', '10/10/2017' UNION ALL
SELECT '10/17/2017','10/18/2017' UNION ALL
SELECT '10/25/2017','11/3/2017' UNION ALL
SELECT '11/3/2017', '11/15/2017';
This code gives me a sample of table.
Then I have the code to find gaps and islands:
SELECT
MIN(StartDate) AS IslandStartDate,
MAX(EndDate) AS IslandEndDate
FROM
(
SELECT
*,
CASE WHEN PreviousEndDate >= StartDate THEN 0 ELSE 1 END AS IslandStartInd,
SUM(CASE WHEN PreviousEndDate >= StartDate THEN 0 ELSE 1 END) OVER (ORDER BY Groups.RN) AS IslandId
FROM
(
SELECT
ROW_NUMBER() OVER(ORDER BY StartDate,EndDate) AS RN,
StartDate,
EndDate,
LAG(EndDate,1) OVER (ORDER BY StartDate, EndDate) AS PreviousEndDate
FROM
TEST
) Groups
) Islands
GROUP BY
IslandId
ORDER BY
IslandStartDate
The results are:
As you see the problem is for period 8/28/2017 - 9/19/2017.
This period should not be a separate island, because it should be included in the period: 8/23/2017 - 9/23/2017.
Do you have any idea how I can modify my query to get the correct results (so instead 6 I should have 5 islands as 8/28/2017 - 9/19/2017 should not be island). This just example of data, so I am looking for unversal solution, but so far I have not figure out the correct approach.

You can express the gaps-and-islands logic like this:
select min(startdate), max(enddate)
from (select t.*,
sum(case when prev_enddate >= startdate then 0 else 1 end) over (order by startdate) as grp
from (select t.*,
max(enddate) over (order by startdate rows between unbounded preceding and 1 preceding) as prev_enddate
from test t
) t
) t
group by grp
order by min(startdate);
Here is a db<>fiddle.
The idea is to look for the maximum enddate on all the "earlier" rows. This value is used to check if there is an overlap.
So, the innermost subquery calculates the previous enddate. The middle subquery does a cumulative sum of the beginnings of groups to assign a group identifier.
The outer query just aggregates by the group identifier.

You could remove the overlapping records from the original set:
SELECT MinStart as StartDate, MaxEnd as EndDate
FROM Test data
CROSS APPLY (SELECT MIN(StartDate) MinStart, MAX(EndDate) MaxEnd FROM TEST lkp WHERE lkp.StartDate < data.EndDate AND lkp.EndDate > data.StartDate) bounds
GROUP BY MinStart, MaxEnd
StartDate
EndDate
2017-08-20
2017-08-21
2017-08-22
2017-09-23
2017-08-23
2017-10-10
2017-10-17
2017-10-18
2017-10-25
2017-11-03
2017-11-03
2017-11-15
In this current result set, no additional duplications have occurred, but in a larger recordset there would be more potential for a much larger range of contiguous records. Meaning you may need to recursively execute this lookup.
Putting that together you get:
SELECT
MIN(StartDate) AS IslandStartDate,
MAX(EndDate) AS IslandEndDate
FROM
(
SELECT
*,
CASE WHEN PreviousEndDate >= StartDate THEN 0 ELSE 1 END AS IslandStartInd,
SUM(CASE WHEN PreviousEndDate >= StartDate THEN 0 ELSE 1 END) OVER (ORDER BY Groups.RN) AS IslandId
FROM
(
SELECT
ROW_NUMBER() OVER(ORDER BY StartDate,EndDate) AS RN,
StartDate,
EndDate,
LAG(EndDate,1) OVER (ORDER BY StartDate, EndDate) AS PreviousEndDate
FROM
(
SELECT MinStart as StartDate, MaxEnd as EndDate
FROM Test data
CROSS APPLY (SELECT MIN(StartDate) MinStart, MAX(EndDate) MaxEnd FROM TEST lkp WHERE lkp.StartDate < data.EndDate AND lkp.EndDate > data.StartDate) bounds
GROUP BY MinStart, MaxEnd
) Normalized
) Groups
) Islands
GROUP BY
IslandId
ORDER BY
IslandStartDate
This results in 4 islands, not 5 as you were originally expecting, because of your 2nd and 3rd input lines AND the 6th and 7th lines, they create an Island that spans 8/22 - 10/10 !
SELECT '8/22/2017', '9/22/2017' UNION ALL
SELECT '8/23/2017', '9/23/2017' UNION ALL
...
SELECT '9/23/2017', '9/27/2017' UNION ALL
SELECT '9/25/2017', '10/10/2017' UNION ALL
IslandStartDate
IslandEndDate
2017-08-20
2017-08-21
2017-08-22
2017-10-10
2017-10-17
2017-10-18
2017-10-25
2017-11-15

Related

Total contracts by month

I am trying to find the total contracts by month. Data is stored in columns (Start Date) and (End Date) multiple lines of data for each month.
SELECT e.CustomerID, e.AgentID,
COUNT(*) engagementnumber
FROM Engagements e
GROUP BY EndDate
The first time I ran the code with
SELECT COUNT(*) engagementnumber,
FROM Engagements,
GROUP BY EndDate
I got a count but it wasn't grouped by month.
You can unpivot the dates and use aggregation:
select year(dte), month(dte),
sum(inc) as change_in_month,
sum(sum(inc)) over (order by min(dte) as active_in_month
from ((select startdate as dte, 1 as inc from Engagements) union all
(select enddate, -1 as int from Engagements)
) t
group by year(dte), month(dte)
order by year(dte), month(dte);
You can try something like
SELECT MONTH(enddate), COUNT(*) OVER (PARTITION BY MONTH(enddate)) engagementnumber FROM engagements

Split multi-month records into individual months

I have data in a table in this format - where date range is multi-month:
SourceSink Class ShadowPrice Round Period StartDate EndDate
AEC Peak 447.038 3 WIN2020 2020-12-01 2021-02-28
I want to create a view/ insert into a new table - the above record broken by month as shown below:
SourceSink Class ShadowPrice Round Period StartDate EndDate
AEC Peak 447.038 3 WIN2020 2020-12-01 2021-12-31
AEC Peak 447.038 3 WIN2020 2021-01-01 2021-01-31
AEC Peak 447.038 3 WIN2020 2021-02-01 2021-02-28
Please advise.
One option is a recursive query. Assuming that periods always start on the the first day of a month and end on the last day of a month, as shown in your sample data, that would be:
with cte as (
select t.*, startDate newStartDate, eomonth(startDate) newEndDate
from mytable t
union all
select
sourceSink,
class,
shadowPrice,
period,
startDate,
endDate,
dateadd(month, 1, newStartDate),
eomonth(dateadd(month, 1, newStartDate))
from cte
where newStartDate < endDate
)
select * from cte
If periods start and end on variying month days, then we need a little more logic:
with cte as (
select
t.*,
startDate newStartDate,
case when eomonth(startDate) <= endDate then eomonth(startDate) else endDate end newEndDate
from mytable t
union all
select
sourceSink,
class,
shadowPrice,
period,
startDate,
endDate,
dateadd(month, 1, datefromparts(year(newStartDate), month(newStartDate), 1)),
case when eomonth(dateadd(month, 1, datefromparts(year(newStartDate), month(newStartDate), 1))) <= endDate
then eomonth(dateadd(month, 1, datefromparts(year(newStartDate), month(newStartDate), 1)))
else endDate
end
from cte
where datefromparts(year(newStartDate), month(newStartDate), 1) < endDate
)
select * from cte
Just another option using a CROSS APPLY and an ad-hoc tally table
Example
Select A.[SourceSink]
,A.[Class]
,A.[ShadowPrice]
,A.[Round]
,A.[Period]
,B.[StartDate]
,B.[EndDate]
From YourTable A
Cross Apply (
Select StartDate=min(D)
,EndDate =max(D)
From (
Select Top (DateDiff(DAY,[StartDate],[EndDate])+1)
D=DateAdd(DAY,-1+Row_Number() Over (Order By (Select Null)),[StartDate])
From master..spt_values n1,master..spt_values n2
) B1
Group By Year(D),Month(D)
) B
Returns

SQL Query - Combine rows based on multiple columns

On the image above, I'd like to combine rows with the same value on consecutive days.
Combined rows will have the earliest date on From column and the latest date on To column.
Looking at the example, even if Rows 3 and 4 have the same value, they were not combined because of the date gap.
I've tried using LAG and LEAD functions but no luck.
You can try below way -
DEMO
with c as
(
select *, datediff(dd,todate,laedval) as leaddiff,
datediff(dd,todate,lagval) as lagdiff
from
(
select *,lead(todate) over(partition by value order by todate) laedval,
lag(todate) over(partition by value order by todate) lagval
from t1
)A
)
select * from
(
select value,min(todate) as fromdate,max(todate) as todate from c
where coalesce(leaddiff,0)+coalesce(lagdiff,0) in (1,-1)
group by value
union all
select value,fromdate,todate from c
where coalesce(leaddiff,0)+coalesce(lagdiff,0)>1 or coalesce(leaddiff,0)+coalesce(lagdiff,0)<-1
)A order by value
OUTPUT:
value fromdate todate
1 16/07/2019 00:00:00 17/07/2019 00:00:00
3 21/07/2019 00:00:00 26/07/2019 00:00:00
2 18/07/2019 00:00:00 18/07/2019 00:00:00
2 20/07/2019 00:00:00 20/07/2019 00:00:00
I am going to recommend the following approach:
Find where each new group begins. You can do this by comparing the previous maximum todate with the fromdate in this row.
Do a cumulative sum of the starts to define a group.
Aggregate the results.
This can be handled using window functions and aggregation:
select value, min(fromdate) as fromdate, max(todate) as todate
from (select t.*,
sum(case when prev_todate >= dateadd(day, -1, fromdate)
then 0 -- overlap, so this does not begin a new group
else 1 -- no overlap, so this does begin a new group
end) over
(partition by value order by fromdate) as grp
from (select t.*,
max(todate) over (partition by value
order by fromdate
rows between unbounded preceding and 1 preceding
) as prev_todate
from t
) t
) t
group by value, grp
order by value, min(fromdate);
Here is a db<>fiddle.

SQL - Group rows by contiguous date

I have a table:
Value Date
100 01/01/2000
110 01/05/2002
100 01/10/2003
100 01/12/2004
I want to group the data in this way
Value StartDate EndDate
100 01/01/2000 30/04/2002
110 01/05/2002 30/09/2003
100 01/10/2003 NULL --> or value like '01/01/2099'
How can I accomplish this?
Can a CTE be useful and how?
For RDBMS supported window functions (example on MS SQL database):
with Test(value, dt) as(
select 100, cast('2000-01-01' as date) union all
select 110, cast('2002-05-01' as date) union all
select 100, cast('2003-10-01' as date) union all
select 100, cast('2004-12-01' as date)
)
select max(value) value, min(dt) startDate, max(end_dt) endDate
from (
select a.*, sum(brk) over(order by dt) grp
from (
select t.*,
case when value!=lag(value) over(order by dt) then 1 else 0 end brk,
DATEADD(DAY,-1,lead(dt,1,cast('2099-01-02' as date)) over(order by dt)) end_dt
from Test t
) a
) b
group by grp
order by startDate
I think the difference of row numbers is simpler in this case:
select value, min(date) as endDate,
dateadd(day, -1, lead(min(date)) over (order by min(date))) as endDate
from (select t.*,
row_number() over (order by date) as seqnum,
row_number() over (partition by value order by date) as seqnum_v
from t
) t
group by value, (seqnum - seqnum_v);
The difference of the row numbers defines the groups you want. This is a bit hard to see at first . . . if you stare at the results of the subquery, you'll see how it works.

SQL calculate date segments within calendar year

What I need is to calculate the missing time periods within the calendar year given a table such as this in SQL:
DatesTable
|ID|DateStart |DateEnd |
1 NULL NULL
2 2015-1-1 2015-12-31
3 2015-3-1 2015-12-31
4 2015-1-1 2015-9-30
5 2015-1-1 2015-3-31
5 2015-6-1 2015-12-31
6 2015-3-1 2015-6-30
6 2015-7-1 2015-10-31
Expected return would be:
1 2015-1-1 2015-12-31
3 2015-1-1 2015-2-28
4 2015-10-1 2015-12-31
5 2015-4-1 2015-5-31
6 2015-1-1 2015-2-28
6 2015-11-1 2015-12-31
It's essentially work blocks. What I need to show is the part of the calendar year which was NOT worked. So for ID = 3, he worked from 3/1 through the rest of the year. But he did not work from 1/1 till 2/28. That's what I'm looking for.
You can do it using LEAD, LAG window functions available from SQL Server 2012+:
;WITH CTE AS (
SELECT ID,
LAG(DateEnd) OVER (PARTITION BY ID ORDER BY DateEnd) AS PrevEnd,
DateStart,
DateEnd,
LEAD(DateStart) OVER (PARTITION BY ID ORDER BY DateEnd) AS NextStart
FROM DatesTable
)
SELECT ID, DateStart, DateEnd
FROM (
-- Get interval right before current [DateStart, DateEnd] interval
SELECT ID,
CASE
WHEN DateStart IS NULL THEN '20150101'
WHEN DateStart > start THEN start
ELSE NULL
END AS DateStart,
CASE
WHEN DateStart IS NULL THEN '20151231'
WHEN DateStart > start THEN DATEADD(d, -1, DateStart)
ELSE NULL
END AS DateEnd
FROM CTE
CROSS APPLY (SELECT COALESCE(DATEADD(d, 1, PrevEnd), '20150101')) x(start)
-- If there is no next interval then get interval right after current
-- [DateStart, DateEnd] interval (up-to end of year)
UNION ALL
SELECT ID, DATEADD(d, 1, DateEnd) AS DateStart, '20151231' AS DateEnd
FROM CTE
WHERE DateStart IS NOT NULl -- Do not re-examine [Null, Null] interval
AND NextStart IS NULL -- There is no next [DateStart, DateEnd] interval
AND DateEnd < '20151231' -- Current [DateStart, DateEnd] interval
-- does not terminate on 31/12/2015
) AS t
WHERE t.DateStart IS NOT NULL
ORDER BY ID, DateStart
The idea behind the above query is simple: for every [DateStart, DateEnd] interval get 'not worked' interval right before it. If there is no interval following the current interval, then also get successive 'not worked' interval (if any).
Also note that I assume that if DateStart is NULL then DateStart is also NULL for the same ID.
Demo here
If your data is not too big, this approach will work. It expands all the days and ids and then re-groups them:
with d as (
select cast('2015-01-01' as date)
union all
select dateadd(day, 1, d)
from d
where d < cast('2015-12-31' as date)
),
td as (
select *
from d cross join
(select distinct id from t) t
where not exists (select 1
from t t2
where d.d between t2.startdate and t2.enddate
)
)
select id, min(d) as startdate, max(d) as enddate
from (select td.*,
dateadd(day, - row_number() over (partition by id order by d), d) as grp
from td
) td
group by id, grp
order by id, grp;
An alternative method relies on cumulative sums and similar functionality that is much easier to expression in SQL Server 2012+.
Somewhat simpler approach I think.
Basically create a list of dates for all work block ranges (A). Then create a list of dates for the whole year for each ID (B). Then remove the A from B. Compile the remaining list of dates into date ranges for each ID.
DECLARE #startdate DATETIME, #enddate DATETIME
SET #startdate = '2015-01-01'
SET #enddate = '2015-12-31'
--Build date ranges from remaining date list
;WITH dateRange(ID, dates, Grouping)
AS
(
SELECT dt1.id, dt1.Dates, dt1.Dates + row_number() over (order by dt1.id asc, dt1.Dates desc) AS Grouping
FROM
(
--Remove (A) from (B)
SELECT distinct dt.ID, tmp.Dates FROM DatesTable dt
CROSS APPLY
(
--GET (B) here
SELECT DATEADD(DAY, number, #startdate) [Dates]
FROM master..spt_values
WHERE type = 'P' AND DATEADD(DAY, number, #startdate) <= #enddate
) tmp
left join
(
--GET (A) here
SELECT DISTINCT T.Id,
D.Dates
FROM DatesTable AS T
INNER JOIN master..spt_values as N on N.number between 0 and datediff(day, T.DateStart, T.DateEnd)
CROSS APPLY (select dateadd(day, N.number, T.DateStart)) as D(Dates)
WHERE N.type ='P'
) dr
ON dr.Id = dt.Id and dr.Dates = tmp.Dates
WHERE dr.id is null
) dt1
)
SELECT ID, CAST(MIN(Dates) AS DATE) DateStart, CAST(MAX(Dates) AS DATE) DateEnd
FROM dateRange
GROUP BY ID, Grouping
ORDER BY ID
Heres the code:
http://sqlfiddle.com/#!3/f3615/1
I hope this helps!