Split date into multiple date ranges in Hive or SQL Server - sql

I would like to convert the below dates into different date ranges, here emp belongs to Chennai location from 2019-07-25 to 2099-02-14, but in between, emp worked from DEL between 2020-02-15 and 2020-02-23.
So I would like to convert above dates into below date ranges

I have tried below in SQL Server and it works:
Test Setup
CREATE TABLE Employee(EmployeeId INT, fromDate date, todate date, Placename VARCHAR(100), PlaceCode VARCHAR(100))
INSERT INTO Employee
VALUES(1111,'2019-07-25','2099-02-14','CHENNAI','MAA'),
(1111,'2020-02-15','2020-02-23','DELHI','DEL');
Query to Execute
;WITH CTE_Ranges AS
(
SELECT EmployeeId, fromdate, lag(fromdate,1) OVER(PARTITION BY EmployeeId ORDER BY fromdate) previousfromDate,todate
, lead(todate,1) OVER(PARTITION BY EmployeeId ORDER BY todate) nexttodate, placename, placecode from Employee
)
--Handle the Maximum and minimum dates
SELECT * FROM
(
SELECT EmployeeId, fromdate, DATEADD(day,-1,lead(fromdate) over(partition by EmployeeId ORDER BY fromDate)) as todate, PlaceName, PlaceCode
FROM CTE_Ranges
UNION ALL
SELECT EmployeeId, DATEADD(day,1,lag(todate) over(partition by EmployeeId ORDER BY todate)) as fromdate, todate,PlaceName, PlaceCode
FROM CTE_Ranges) AS t
WHERE fromdate is not null and todate is not null
UNION ALL
--Now handle normal date ranges
SELECT EmployeeId, fromDate,todate, placename, placecode
from cte_ranges
WHERE previousfromdate is not null and nexttodate is not null
order by fromdate
Resultset
+------------+------------+------------+-----------+-----------+
| EmployeeId | fromdate | todate | PlaceName | PlaceCode |
+------------+------------+------------+-----------+-----------+
| 1111 | 2019-07-25 | 2020-02-14 | CHENNAI | MAA |
| 1111 | 2020-02-15 | 2020-02-23 | DELHI | DEL |
| 1111 | 2020-02-24 | 2099-02-14 | CHENNAI | MAA |
+------------+------------+------------+-----------+-----------+

To handle multiple levels of nesting, you can unpivot the data and
with e as (
-- unpivot the dates
select employeeid, fromdate as dte,
placename, placecode
from t
union all
select employeeid, enddate, null, null
from t
),
e2 as (
-- impute the intermediate placenames
select e.*,
max(placename) over (partition by employeeid, grp) as imputed_placename
from (select e.*,
count(placename) over (partition by employeeid order by dte) as grp
from e
) e
)
select employeeid, fromdate,
dateadd(day, -1, lead(fromdate) over (partition by employeeid order by dte)) as enddate,
placename, placecode
from e1;

Related

Oracle Pivot Help based on Data

I am trying use a oracle pivot function to display the data in below format. I have tried to use examples I found stackoverflow, but I am unable to achieve what I am looking.
With t as
(
select 1335 as emp_id, 'ADD Insurance New' as suuid, sysdate- 10 as startdate, null as enddate from dual
union all
select 1335 as emp_id, 'HS' as suuid, sysdate- 30 as startdate, null as enddate from dual
union all
select 1335 as emp_id, 'ADD Ins' as suuid, sysdate- 30 as startdate, Sysdate - 10 as enddate from dual
)
select * from t
output:
+--------+-------------------+-------------------+---------+-------------------+
| EMP_ID | SUUID_1 | SUUID_1_STARTDATE | SUUID_2 | SUUID_2_STARTDATE |
+--------+-------------------+-------------------+---------+-------------------+
| 1335 | ADD Insurance New | 10/5/2020 15:52 | HS | 9/15/2020 15:52 |
+--------+-------------------+-------------------+---------+-------------------+
Can anyone suggest to how to use SQL Pivot to get this format?
You can use conditional aggregation. There is more than one way to understand your question, but one approach that would work for your sample data is:
select emp_id,
max(case when rn = 1 then suuid end) suuid_1,
max(case when rn = 1 then startdate end) suid_1_startdate,
max(case when rn = 2 then suuid end) suuid_2,
max(case when rn = 2 then startdate end) suid_2_startdate
from (
select t.*, row_number() over(partition by emp_id order by startdate desc) rn
from t
where enddate is null
) t
group by emp_id
Demo on DB Fiddle:
EMP_ID | SUUID_1 | SUID_1_STARTDATE | SUUID_2 | SUID_2_STARTDATE
-----: | :---------------- | :--------------- | :------ | :---------------
1335 | ADD Insurance New | 05-OCT-20 | HS | 15-SEP-20
You can do it with PIVOT:
With t ( emp_id, suuid, startdate, enddate ) as
(
select 1335, 'ADD Insurance New', sysdate- 10, null from dual union all
select 1335, 'HS', sysdate- 30, null from dual union all
select 1335, 'ADD Ins', sysdate- 30, Sysdate - 10 from dual
)
SELECT emp_id,
"1_SUUID" AS suuid1,
"1_STARTDATE" AS suuid_startdate1,
"2_SUUID" AS suuid2,
"2_STARTDATE" AS suuid_startdate2
FROM (
SELECT t.*,
ROW_NUMBER() OVER ( ORDER BY startdate DESC, enddate DESC NULLS FIRST )
AS rn
FROM t
)
PIVOT (
MAX( suuid ) AS suuid,
MAX( startdate ) AS startdate,
MAX( enddate ) AS enddate
FOR rn IN ( 1, 2 )
)
Outputs:
EMP_ID | SUUID1 | SUUID_STARTDATE1 | SUUID2 | SUUID_STARTDATE2
-----: | :---------------- | :--------------- | :----- | :---------------
1335 | ADD Insurance New | 05-OCT-20 | HS | 15-SEP-20
db<>fiddle here

Grouping SSRS report on multiple fields

I have a report we will call ReportOne, in this ReportOne I am querying the data for this report with a stored procedure. The stored procedure query returns two values which are 'TravelDate' and 'Status'.
My report has four fields, 'BeginDate', 'EndDate', 'Status', and 'Days'.
My issue is this, I need to group the report by both the 'Status' and consecutive days. Consecutive days coming from TravelDate.
'BeginDate' will be the first new date
'EndDate' will be the last consecutive date.
'Status' will be status.
'Days' will be the number of consecutive days.
Example,
TravelDate | Status
1/1/2001 | Leave
1/2/2001 | Leave
1/3/2001 | Leave
1/5/2001 | Leave
1/6/2001 | Travel
The report will then look as follows.
BeginDate | EndDate | Status | Days
1/1/2001 | 1/3/2001 | Leave | 3
1/5/2001 | 1/5/2001 | Leave | 1
1/6/2001 | 1/6/2001 | Travel | 1
Example
Declare #YourTable Table ([TravelDate] date,[Status] varchar(50))
Insert Into #YourTable Values
('1/1/2001','Leave')
,('1/2/2001','Leave')
,('1/3/2001','Leave')
,('1/5/2001','Leave')
,('1/6/2001','Travel')
Select BeginDate=min(TravelDate)
,EndDate =max(TravelDate)
,Status =max(Status)
,Days =datediff(DAY,min(TravelDate),max(TravelDate))+1
From (
Select *
,Grp = DateDiff(DAY,'1900-01-01',TravelDate) - row_number() over (partition by status order by TravelDate)
From #YourTable
) A
Group By Grp
Order By BeginDate
Returns
BeginDate EndDate Status Days
2001-01-01 2001-01-03 Leave 3
2001-01-05 2001-01-05 Leave 1
2001-01-06 2001-01-06 Travel 1
EDIT -- Capture from Stored Procedure -- #YourTable Structure must match the Structure of Stored Procedure
Declare #YourTable Table ([TravelDate] date,[Status] varchar(50))
Insert Into #YourTable
Exec youStoredProcedure
Select BeginDate=min(TravelDate)
,EndDate =max(TravelDate)
,Status =max(Status)
,Days =datediff(DAY,min(TravelDate),max(TravelDate))+1
From (
Select *
,Grp = DateDiff(DAY,'1900-01-01',TravelDate) - row_number() over (partition by status order by TravelDate)
From #YourTable
) A
Group By Grp
Order By BeginDate
EDIT - Nested Subquery
Select BeginDate=min(TravelDate)
,EndDate =max(TravelDate)
,Status =max(Status)
,Days =datediff(DAY,min(TravelDate),max(TravelDate))+1
From (
Select *
,Grp = DateDiff(DAY,'1900-01-01',TravelDate) - row_number() over (partition by status order by TravelDate)
From (
-- Your Query Here ---
) A
) A
Group By Grp
Order By BeginDate
EDIT - Consumed from a TVF
Select BeginDate=min(TravelDate)
,EndDate =max(TravelDate)
,Status =max(Status)
,Days =datediff(DAY,min(TravelDate),max(TravelDate))+1
From (
Select *
,Grp = DateDiff(DAY,'1900-01-01',TravelDate) - row_number() over (partition by status order by TravelDate)
From [dbo].[YourTableValedFunction](Param1,Param2) src
) A
Group By Grp
Order By BeginDate

SQL Server Select the most recent past date if no future date available

I have a table structure as below,
CREATE TABLE #CustOrder ( CustId INT, OrderDate DATE )
INSERT #CustOrder ( CustId, OrderDate )
VALUES ( 1, '2016-11-01' ),
( 1, '2019-09-01' ),
( 2, '2019-07-01' ),
( 2, '2019-11-01' ),
( 3, '2017-01-01' ),
( 4, '2016-12-01' ),
( 4, '2017-01-01' )
I want to list the customer with their future order dates, if they do not have a future order I want to list their last or most recent order. I have the following query.
; WITH LastOrder AS
(
SELECT
CO.CustId,
CO.OrderDate,
ROW_NUMBER() OVER(PARTITION BY CO.CustId ORDER BY ABS(DATEDIFF(DAY, CO.OrderDate, GETUTCDATE()))) AS RowNum
FROM #CustOrder AS CO
)
SELECT LO.CustId, LO.OrderDate
FROM LastOrder AS LO
WHERE LO.RowNum = 1
This query gives me the result as,
CustId | OrderDate
--------+-------------
1 | 2016-11-01
2 | 2019-07-01
3 | 2017-01-01
4 | 2017-01-01
However, I need the result as,
CustId | OrderDate
--------+-------------
1 | 2019-09-01
2 | 2019-07-01
3 | 2017-01-01
4 | 2017-01-01
As
Customer 1 has a future order on 2019-09-01
Customer 2 has two future order but the first one is on 2019-07-01
Customer 3 has no more than 1 order, it should just return 2017-01-01
Customer 4 has two past orders but the most recent is 2017-01-01
rextester: http://rextester.com/PBKNA95127
CREATE TABLE #CustOrder ( CustId INT, OrderDate DATE )
INSERT #CustOrder ( CustId, OrderDate )
VALUES ( 1, '2016-11-01' ),
( 1, '2019-09-01' ),
( 2, '2019-07-01' ),
( 2, '2019-11-01' ),
( 3, '2017-01-01' ),
( 4, '2016-12-01' ),
( 4, '2017-01-01' )
; WITH LastOrder AS
(
SELECT
CO.CustId,
CO.OrderDate,
ROW_NUMBER() OVER(PARTITION BY CO.CustId
ORDER BY case when co.OrderDate > getdate() then 0 else 1 end
, abs(DATEDIFF(DAY, getdate(),CO.OrderDate)) asc
) AS RowNum
FROM #CustOrder AS CO
)
SELECT LO.CustId, LO.OrderDate
FROM LastOrder AS LO
WHERE LO.RowNum = 1
results:
+--------+------------+
| CustId | OrderDate |
+--------+------------+
| 1 | 2019-09-01 |
| 2 | 2019-07-01 |
| 3 | 2017-01-01 |
| 4 | 2017-01-01 |
+--------+------------+
You can use the MAX function to check if the latest date is in the future. If so, get the MIN date after today using MIN. Else get the latest date.
SELECT CUSTID,OrderDate
FROM (SELECT CustId,
OrderDate,
CASE WHEN MAX(orderdate) OVER(PARTITION BY CustId) > GETUTCDATE()
THEN MIN(case when orderdate >getutcdate() then orderdate end) OVER(PARTITION BY CustId)
ELSE MAX(orderdate) OVER(PARTITION BY CustId) end as latest_date
FROM #CustOrder) T
WHERE latest_date=orderDate
Min, Max, UNION approach
select custID, MIN(OrderDate)
from #CustOrder
where OrderDate > '2017-02-17'
group by custID
union all
select co1.custID, max(co1.OrderDate)
from #CustOrder co1
where not exists ( select 1
from #CustOrder co2
where co2.CustId = co1.CustId
and co2.OrderDate > '2017-02-17'
)
group by co1.custID
Start your ORDER BY with a CASE expression that prefers future over past, and then use the ABS DATEDIFF (like you have now) as the second condition in the ORDER BY.
Maybe create another column and use the LAG() window function to grab the last date function and then put a conditional/case statement within the select portion? https://msdn.microsoft.com/en-us/library/hh231256.aspx

Select distinct users group by time range

I have a table with the following info
|date | user_id | week_beg | month_beg|
SQL to create table with test values:
CREATE TABLE uniques
(
date DATE,
user_id INT,
week_beg DATE,
month_beg DATE
)
INSERT INTO uniques VALUES ('2013-01-01', 1, '2012-12-30', '2013-01-01')
INSERT INTO uniques VALUES ('2013-01-03', 3, '2012-12-30', '2013-01-01')
INSERT INTO uniques VALUES ('2013-01-06', 4, '2013-01-06', '2013-01-01')
INSERT INTO uniques VALUES ('2013-01-07', 4, '2013-01-06', '2013-01-01')
INPUT TABLE:
| date | user_id | week_beg | month_beg |
| 2013-01-01 | 1 | 2012-12-30 | 2013-01-01 |
| 2013-01-03 | 3 | 2012-12-30 | 2013-01-01 |
| 2013-01-06 | 4 | 2013-01-06 | 2013-01-01 |
| 2013-01-07 | 4 | 2013-01-06 | 2013-01-01 |
OUTPUT TABLE:
| date | time_series | cnt |
| 2013-01-01 | D | 1 |
| 2013-01-01 | W | 1 |
| 2013-01-01 | M | 1 |
| 2013-01-03 | D | 1 |
| 2013-01-03 | W | 2 |
| 2013-01-03 | M | 2 |
| 2013-01-06 | D | 1 |
| 2013-01-06 | W | 1 |
| 2013-01-06 | M | 3 |
| 2013-01-07 | D | 1 |
| 2013-01-07 | W | 1 |
| 2013-01-07 | M | 3 |
I want to calculate the number of distinct user_id's for a date:
For that date
For that week up to that date (Week to date)
For the month up to that date (Month to date)
1 is easy to calculate.
For 2 and 3 I am trying to use such queries:
SELECT
date,
'W' AS "time_series",
(COUNT DISTINCT user_id) COUNT (user_id) OVER (PARTITION BY week_beg) AS "cnt"
FROM user_subtitles
SELECT
date,
'M' AS "time_series",
(COUNT DISTINCT user_id) COUNT (user_id) OVER (PARTITION BY month_beg) AS "cnt"
FROM user_subtitles
Postgres does not allow window functions for DISTINCT calculation, so this approach does not work.
I have also tried out a GROUP BY approach, but it does not work as it gives me numbers for whole week/months.
Whats the best way to approach this problem?
Count all rows
SELECT date, '1_D' AS time_series, count(DISTINCT user_id) AS cnt
FROM uniques
GROUP BY 1
UNION ALL
SELECT DISTINCT ON (1)
date, '2_W', count(*) OVER (PARTITION BY week_beg ORDER BY date)
FROM uniques
UNION ALL
SELECT DISTINCT ON (1)
date, '3_M', count(*) OVER (PARTITION BY month_beg ORDER BY date)
FROM uniques
ORDER BY 1, time_series
Your columns week_beg and month_beg are 100 % redundant and can easily be replaced by
date_trunc('week', date + 1) - 1 and date_trunc('month', date) respectively.
Your week seems to start on Sunday (off by one), therefore the + 1 .. - 1.
The default frame of a window function with ORDER BY in the OVER clause uses is RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW. That's exactly what you need.
Use UNION ALL, not UNION.
Your unfortunate choice for time_series (D, W, M) does not sort well, I renamed to make the final ORDER BY easier.
This query can deal with multiple rows per day. Counts include all peers for a day.
More about DISTINCT ON:
Select first row in each GROUP BY group?
DISTINCT users per day
To count every user only once per day, use a CTE with DISTINCT ON:
WITH x AS (SELECT DISTINCT ON (1,2) date, user_id FROM uniques)
SELECT date, '1_D' AS time_series, count(user_id) AS cnt
FROM x
GROUP BY 1
UNION ALL
SELECT DISTINCT ON (1)
date, '2_W'
,count(*) OVER (PARTITION BY (date_trunc('week', date + 1)::date - 1)
ORDER BY date)
FROM x
UNION ALL
SELECT DISTINCT ON (1)
date, '3_M'
,count(*) OVER (PARTITION BY date_trunc('month', date) ORDER BY date)
FROM x
ORDER BY 1, 2
DISTINCT users over dynamic period of time
You can always resort to correlated subqueries. Tend to be slow with big tables!
Building on the previous queries:
WITH du AS (SELECT date, user_id FROM uniques GROUP BY 1,2)
,d AS (
SELECT date
,(date_trunc('week', date + 1)::date - 1) AS week_beg
,date_trunc('month', date)::date AS month_beg
FROM uniques
GROUP BY 1
)
SELECT date, '1_D' AS time_series, count(user_id) AS cnt
FROM du
GROUP BY 1
UNION ALL
SELECT date, '2_W', (SELECT count(DISTINCT user_id) FROM du
WHERE du.date BETWEEN d.week_beg AND d.date )
FROM d
GROUP BY date, week_beg
UNION ALL
SELECT date, '3_M', (SELECT count(DISTINCT user_id) FROM du
WHERE du.date BETWEEN d.month_beg AND d.date)
FROM d
GROUP BY date, month_beg
ORDER BY 1,2;
SQL Fiddle for all three solutions.
Faster with dense_rank()
#Clodoaldo came up with a major improvement: use the window function dense_rank(). Here is another idea for an optimized version. It should be even faster to exclude daily duplicates right away. The performance gain grows with the number of rows per day.
Building on a simplified and sanitized data model
- without the redundant columns
- day as column name instead of date
date is a reserved word in standard SQL and a basic type name in PostgreSQL and shouldn't be used as identifier.
CREATE TABLE uniques(
day date -- instead of "date"
,user_id int
);
Improved query:
WITH du AS (
SELECT DISTINCT ON (1, 2)
day, user_id
,date_trunc('week', day + 1)::date - 1 AS week_beg
,date_trunc('month', day)::date AS month_beg
FROM uniques
)
SELECT day, count(user_id) AS d, max(w) AS w, max(m) AS m
FROM (
SELECT user_id, day
,dense_rank() OVER(PARTITION BY week_beg ORDER BY user_id) AS w
,dense_rank() OVER(PARTITION BY month_beg ORDER BY user_id) AS m
FROM du
) s
GROUP BY day
ORDER BY day;
SQL Fiddle demonstrating the performance of 4 faster variants. It depends on your data distribution which is fastest for you.
All of them are about 10x as fast as the correlated subqueries version (which isn't bad for correlated subqueries).
Without correlated subqueries. SQL Fiddle
with u as (
select
"date", user_id,
date_trunc('week', "date" + 1)::date - 1 week_beg,
date_trunc('month', "date")::date month_beg
from uniques
)
select
"date", count(distinct user_id) D,
max(week_dr) W, max(month_dr) M
from (
select
user_id, "date",
dense_rank() over(partition by week_beg order by user_id) week_dr,
dense_rank() over(partition by month_beg order by user_id) month_dr
from u
) s
group by "date"
order by "date"
Try
SELECT
*
FROM
(
SELECT dates, count(user_id), 'D' as timesereis FROM users_data GROUP BY dates
UNION
SELECT max(dates), count(user_id), 'W' FROM users_data GROUP BY date_part('year',dates)+date_part('week',dates)
UNION
SELECT max(dates), count(user_id), 'M' FROM users_data GROUP BY date_part('year',dates)+date_part('week',dates)
) tEMP order by dates, timesereis
SQLFIDDLE
Try queries like this
SELECT count(distinct user_id), date_format(date, '%Y-%m-%d') as date_period
FROM uniques
GROUP By date_period

Combine consecutive date ranges

Using SQL Server 2008 R2,
I'm trying to combine date ranges into the maximum date range given that one end date is next to the following start date.
The data is about different employments. Some employees may have ended their employment and have rejoined at a later time. Those should count as two different employments (example ID 5). Some people have different types of employment, running after each other (enddate and startdate neck-to-neck), in this case it should be considered as one employment in total (example ID 30).
An employment period that has not ended has an enddate that is null.
Some examples is probably enlightening:
declare #t as table (employmentid int, startdate datetime, enddate datetime)
insert into #t values
(5, '2007-12-03', '2011-08-26'),
(5, '2013-05-02', null),
(30, '2006-10-02', '2011-01-16'),
(30, '2011-01-17', '2012-08-12'),
(30, '2012-08-13', null),
(66, '2007-09-24', null)
-- expected outcome
EmploymentId StartDate EndDate
5 2007-12-03 2011-08-26
5 2013-05-02 NULL
30 2006-10-02 NULL
66 2007-09-24 NULL
I've been trying different "islands-and-gaps" techniques but haven't been able to crack this one.
The strange bit you see with my use of the date '31211231' is just a very large date to handle your "no-end-date" scenario. I have assumed you won't really have many date ranges per employee, so I've used a simple Recursive Common Table Expression to combine the ranges.
To make it run faster, the starting anchor query keeps only those dates that will not link up to a prior range (per employee). The rest is just tree-walking the date ranges and growing the range. The final GROUP BY keeps only the largest date range built up per starting ANCHOR (employmentid, startdate) combination.
SQL Fiddle
MS SQL Server 2008 Schema Setup:
create table Tbl (
employmentid int,
startdate datetime,
enddate datetime);
insert Tbl values
(5, '2007-12-03', '2011-08-26'),
(5, '2013-05-02', null),
(30, '2006-10-02', '2011-01-16'),
(30, '2011-01-17', '2012-08-12'),
(30, '2012-08-13', null),
(66, '2007-09-24', null);
/*
-- expected outcome
EmploymentId StartDate EndDate
5 2007-12-03 2011-08-26
5 2013-05-02 NULL
30 2006-10-02 NULL
66 2007-09-24 NULL
*/
Query 1:
;with cte as (
select a.employmentid, a.startdate, a.enddate
from Tbl a
left join Tbl b on a.employmentid=b.employmentid and a.startdate-1=b.enddate
where b.employmentid is null
union all
select a.employmentid, a.startdate, b.enddate
from cte a
join Tbl b on a.employmentid=b.employmentid and b.startdate-1=a.enddate
)
select employmentid,
startdate,
nullif(max(isnull(enddate,'32121231')),'32121231') enddate
from cte
group by employmentid, startdate
order by employmentid
Results:
| EMPLOYMENTID | STARTDATE | ENDDATE |
-----------------------------------------------------------------------------------
| 5 | December, 03 2007 00:00:00+0000 | August, 26 2011 00:00:00+0000 |
| 5 | May, 02 2013 00:00:00+0000 | (null) |
| 30 | October, 02 2006 00:00:00+0000 | (null) |
| 66 | September, 24 2007 00:00:00+0000 | (null) |
SET NOCOUNT ON
DECLARE #T TABLE(ID INT,FromDate DATETIME, ToDate DATETIME)
INSERT INTO #T(ID,FromDate,ToDate)
SELECT 1,'20090801','20090803' UNION ALL
SELECT 2,'20090802','20090809' UNION ALL
SELECT 3,'20090805','20090806' UNION ALL
SELECT 4,'20090812','20090813' UNION ALL
SELECT 5,'20090811','20090812' UNION ALL
SELECT 6,'20090802','20090802'
SELECT ROW_NUMBER() OVER(ORDER BY s1.FromDate) AS ID,
s1.FromDate,
MIN(t1.ToDate) AS ToDate
FROM #T s1
INNER JOIN #T t1 ON s1.FromDate <= t1.ToDate
AND NOT EXISTS(SELECT * FROM #T t2
WHERE t1.ToDate >= t2.FromDate
AND t1.ToDate < t2.ToDate)
WHERE NOT EXISTS(SELECT * FROM #T s2
WHERE s1.FromDate > s2.FromDate
AND s1.FromDate <= s2.ToDate)
GROUP BY s1.FromDate
ORDER BY s1.FromDate
An alternative solution that uses window functions rather than recursive CTEs
SELECT
employmentid,
MIN(startdate) as startdate,
NULLIF(MAX(COALESCE(enddate,'9999-01-01')), '9999-01-01') as enddate
FROM (
SELECT
employmentid,
startdate,
enddate,
DATEADD(
DAY,
-COALESCE(
SUM(DATEDIFF(DAY, startdate, enddate)+1) OVER (PARTITION BY employmentid ORDER BY startdate ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING),
0
),
startdate
) as grp
FROM #t
) withGroup
GROUP BY employmentid, grp
ORDER BY employmentid, startdate
This works by calculating a grp value that will be the same for all consecutive rows. This is achieved by:
Determine totals days the span occupies (+1 as the dates are inclusive)
SELECT *, DATEDIFF(DAY, startdate, enddate)+1 as daysSpanned FROM #t
Cumulative sum the days spanned for each employment, ordered by startdate. This gives us the total days spanned by all the previous employment spans
We coalesce with 0 to ensure we dont have NULLs in our cumulative sum of days spanned
We do not include current row in our cumulative sum, this is because we will use the value against startdate rather than enddate (we cant use it against enddate because of the NULLs)
SELECT *, COALESCE(
SUM(daysSpanned) OVER (
PARTITION BY employmentid
ORDER BY startdate
ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING
)
,0
) as cumulativeDaysSpanned
FROM (
SELECT *, DATEDIFF(DAY, startdate, enddate)+1 as daysSpanned FROM #t
) inner1
Subtract the cumulative days from the startdate to get our grp. This is the crux of the solution.
If the start date increases at the same rate as the days spanned then the days are consecutive, and subtracting the two will give us the same value.
If the startdate increases faster than the days spanned then there is a gap and we will get a new grp value greater than the previous one.
Although grp is a date, the date itself is meaningless we are using just as a grouping value
SELECT *, DATEADD(DAY, -cumulativeDaysSpanned, startdate) as grp
FROM (
SELECT *, COALESCE(
SUM(daysSpanned) OVER (
PARTITION BY employmentid
ORDER BY startdate
ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING
)
,0
) as cumulativeDaysSpanned
FROM (
SELECT *, DATEDIFF(DAY, startdate, enddate)+1 as daysSpanned FROM #t
) inner1
) inner2
With the results
+--------------+-------------------------+-------------------------+-------------+-----------------------+-------------------------+
| employmentid | startdate | enddate | daysSpanned | cumulativeDaysSpanned | grp |
+--------------+-------------------------+-------------------------+-------------+-----------------------+-------------------------+
| 5 | 2007-12-03 00:00:00.000 | 2011-08-26 00:00:00.000 | 1363 | 0 | 2007-12-03 00:00:00.000 |
+--------------+-------------------------+-------------------------+-------------+-----------------------+-------------------------+
| 5 | 2013-05-02 00:00:00.000 | NULL | NULL | 1363 | 2009-08-08 00:00:00.000 |
+--------------+-------------------------+-------------------------+-------------+-----------------------+-------------------------+
| 30 | 2006-10-02 00:00:00.000 | 2011-01-16 00:00:00.000 | 1568 | 0 | 2006-10-02 00:00:00.000 |
+--------------+-------------------------+-------------------------+-------------+-----------------------+-------------------------+
| 30 | 2011-01-17 00:00:00.000 | 2012-08-12 00:00:00.000 | 574 | 1568 | 2006-10-02 00:00:00.000 |
+--------------+-------------------------+-------------------------+-------------+-----------------------+-------------------------+
| 30 | 2012-08-13 00:00:00.000 | NULL | NULL | 2142 | 2006-10-02 00:00:00.000 |
+--------------+-------------------------+-------------------------+-------------+-----------------------+-------------------------+
| 66 | 2007-09-24 00:00:00.000 | NULL | NULL | 0 | 2007-09-24 00:00:00.000 |
+--------------+-------------------------+-------------------------+-------------+-----------------------+-------------------------+
Finally we can GROUP BY grp to get the get rid of the consecutive days.
Use MIN and MAX to get the new startdate and endate
To handle the NULL enddate we give them a large value to get picked up by MAX then convert them back to NULL again
SELECT
employmentid,
MIN(startdate) as startdate,
NULLIF(MAX(COALESCE(enddate,'9999-01-01')), '9999-01-01') as enddate
FROM (
SELECT *, DATEADD(DAY, -cumulativeDaysSpanned, startdate) as grp
FROM (
SELECT *, COALESCE(
SUM(daysSpanned) OVER (
PARTITION BY employmentid
ORDER BY startdate
ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING
)
,0
) as cumulativeDaysSpanned
FROM (
SELECT *, DATEDIFF(DAY, startdate, enddate)+1 as daysSpanned FROM #t
) inner1
) inner2
) inner3
GROUP BY employmentid, grp
ORDER BY employmentid, startdate
To get the desired result
+--------------+-------------------------+-------------------------+
| employmentid | startdate | enddate |
+--------------+-------------------------+-------------------------+
| 5 | 2007-12-03 00:00:00.000 | 2011-08-26 00:00:00.000 |
+--------------+-------------------------+-------------------------+
| 5 | 2013-05-02 00:00:00.000 | NULL |
+--------------+-------------------------+-------------------------+
| 30 | 2006-10-02 00:00:00.000 | NULL |
+--------------+-------------------------+-------------------------+
| 66 | 2007-09-24 00:00:00.000 | NULL |
+--------------+-------------------------+-------------------------+
We can combine the inner queries to get the query at the start of this answer. Which is shorter, but less explainable
Limitations of all this required that
there are no overlaps of startdate and enddate for an employment. This could produce collisions in our grp.
startdate is not NULL. However this could be overcome by replacing NULL start dates with small date values
Future developers can decipher the window black magic you performed
A modified script for combining all overlapping periods. For example
01.01.2001-01.01.2010
05.05.2005-05.05.2015
will give one period:
01.01.2001-05.05.2015
tbl.enddate must be completed
;WITH cte
AS(
SELECT
a.employmentid
,a.startdate
,a.enddate
from tbl a
left join tbl c on a.employmentid=c.employmentid
and a.startdate > c.startdate
and a.startdate <= dateadd(day, 1, c.enddate)
WHERE c.employmentid IS NULL
UNION all
SELECT
a.employmentid
,a.startdate
,a.enddate
from cte a
inner join tbl c on a.startdate=c.startdate
and (c.startdate = dateadd(day, 1, a.enddate) or (c.enddate > a.enddate and c.startdate <= a.enddate))
)
select distinct employmentid,
startdate,
nullif(max(enddate),'31.12.2099') enddate
from cte
group by employmentid, startdate