I've been working on an issue for a few days now, and I can't seem to find the right fix. Does anybody have an idea?
Case
We want to create a new a new sequence number whenever an employee has resigned for more than 1 day. We have the delta of the current employment record and the previous, so we can check the sequence. We want to calculate the min(Start) and max(End) of each employment record which isn't separated more than 1 day apart.
Data
Employee
Contract
Unit
Start
End
Delta
John Doe
1
Unit A
2014-01-01
2017-12-31
NULL
John Doe
2
Unit A
2018-02-01
2018-12-31
31
John Doe
3
Unit B
2019-01-01
2020-05-31
1
John Doe
4
Unit A
2020-06-01
NULL
1
With the query it should give back:
Employee
Contract
Unit
Start
End
Delta
Sequence
John Doe
1
Unit A
2014-01-01
2017-12-31
NULL
1
John Doe
2
Unit A
2018-02-01
2018-12-31
31
2
John Doe
3
Unit B
2019-01-01
2020-05-31
1
2
John Doe
4
Unit A
2020-06-01
NULL
1
2
That is because sequence 1 end at 31-12-2017, and a new one starts in February of 2018, so there has been more than 1 day of separation between the records. The following all have a sequence of 2 because it is continuing.
Query
I've tried a few things already with lag() and lead(), but I keep working myself into a corner with the data sample that I have. When I run it on the full set it won't work.
SELECT
Employee,
Start,
End,
DeltaPrevious,
Delta,
DeltaNext,
case
when DeltaPrevious IS NULL AND Delta = 1 then 1
when DeltaPrevious = 1 AND Delta > 1 then min(Contract) OVER (PARTITION BY Employee ORDER BY Contract ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING)
when DeltaPrevious > 1 AND Delta = 1 then min(Contract) OVER (PARTITION BY Employee ORDER BY Contract ROWS BETWEEN 1 PRECEDING AND CURRENT ROW)
end as Sequence
FROM
Contracts
ORDER BY
Employee, Start ASC
Hope that someone has a great idea.
Thanks,
Basically, you want to use lag() to get the previous date and then do a cumulative sum. This looks like:
select c.*,
sum(case when prev_end >= dateadd(day, -1, start) then 0 else 1
end) over (partition by employee order by start) as ranking
from (select c.*,
lag(end) over (partition by employee order by start) as prev_end
from contracts c
) c;
You mention that you might want to recalculate the new start and end. You would just use the above as a subquery/CTE and aggregate on employee and ranking.
If I understood correctly from the definition of Sequence in your second table, you are more interested in the DeltaNext than in the Delta(Previous). Here an attempt, including the code to create a sample input date with two more employees:
CREATE TABLE #input_table (Employee VARCHAR(255), [Contract] INT, Unit VARCHAR(6), [Start] DATE, [End] DATE)
INSERT INTO #input_table
VALUES
('John Doe', 1, 'Unit A', '2014-01-01', '2017-12-31'),
('John Doe', 2, 'Unit A', '2018-02-01', '2018-12-31'),
('John Doe', 3, 'Unit B', '2019-01-01', '2020-05-31'),
('John Doe', 4, 'Unit A', '2020-06-01', NULL),
('Alice', 1, 'Unit A', '2020-01-01', NULL),
('Bob', 1, 'Unit C', '2020-01-01', '2020-02-20')
First we create the deltas:
SELECT *
, DeltaPrev = DATEDIFF(DAY, LAG([End], 1, NULL) OVER(PARTITION BY Employee
ORDER BY [Start]), [Start]) -- Not relevant (?)
, DeltaNext = DATEDIFF(DAY, [End], LEAD([Start], 1, NULL) OVER(PARTITION BY Employee ORDER BY [Start]))
INTO #cte_delta -- I'll create a CTE at the end
FROM #input_table
Then we define Sequence:
SELECT *
, [Sequence] = CASE WHEN DeltaNext > 1 THEN 1 ELSE 2 END
INTO #cte_sequence
FROM #cte_delta
We then group the same Sequences by assigning a unique ROW_NUMBER for each employee with consecutive/ same Sequences:
SELECT *
, GRP = ROW_NUMBER() OVER(PARTITION BY Employee ORDER BY [Start]) - ROW_NUMBER() OVER(PARTITION BY Employee, [Sequence] ORDER BY [Start])
INTO #cte_grp
FROM #cte_sequence
Finally we calculate the min and max of the contract duration:
SELECT *
, MIN([Start]) OVER(PARTITION BY Employee, GRP) AS ContractStart
, CASE WHEN COUNT(*) OVER(PARTITION BY Employee, GRP) = COUNT([End])
OVER(PARTITION BY Employee, GRP) THEN MAX([End]) OVER(PARTITION BY Employee, GRP) ELSE NULL END AS ContractEnd
FROM cte_grp
The COUNT(*) and COUNT([End]) comparison is necessary or else the ContractEnd would be the max non-NULL value, i.e. 2018-02-01.
The whole code with CTEs here:
WITH cte_delta AS (
SELECT *
, DeltaPrev = DATEDIFF(DAY, LAG([End], 1, NULL) OVER(PARTITION BY Employee ORDER BY [Start]), [Start]) -- Not relevant (?)
, DeltaNext = DATEDIFF(DAY, [End], LEAD([Start], 1, NULL) OVER(PARTITION BY Employee ORDER BY [Start]))
FROM #input_table
)
, cte_sequence AS (
SELECT *
, [Sequence] = CASE WHEN DeltaNext > 1 THEN 1 ELSE 2 END
FROM cte_delta
)
, cte_grp AS (
SELECT *
, GRP = ROW_NUMBER() OVER(PARTITION BY Employee ORDER BY [Start]) - ROW_NUMBER() OVER(PARTITION BY Employee, [Sequence] ORDER BY [Start])
FROM cte_sequence
)
SELECT *
, MIN([Start]) OVER(PARTITION BY Employee, GRP) AS ContractStart
, CASE WHEN COUNT(*) OVER(PARTITION BY Employee, GRP) = COUNT([End]) OVER(PARTITION BY Employee, GRP) THEN MAX([End]) OVER(PARTITION BY Employee, GRP) ELSE NULL END AS ContractEnd
FROM cte_grp
Here the output:
Employee
Contract
Unit
Start
End
DeltaPrev
DeltaNext
Sequence
GRP
ContractStart
ContractEnd
Alice
1
Unit A
2020-01-01
NULL
NULL
NULL
2
0
2020-01-01
NULL
Bob
1
Unit C
2020-01-01
2020-02-20
NULL
NULL
2
0
2020-01-01
2020-02-20
John Doe
1
Unit A
2014-01-01
2017-12-31
NULL
32
1
0
2014-01-01
2017-12-31
John Doe
2
Unit A
2018-02-01
2018-12-31
32
1
2
1
2018-02-01
NULL
John Doe
3
Unit B
2019-01-01
2020-05-31
1
1
2
1
2018-02-01
NULL
John Doe
4
Unit A
2020-06-01
NULL
1
NULL
2
1
2018-02-01
NULL
Feel free to select DISTINCT records according to your needs.
Related
i have to group data on basis of amount column but if the amount repeat after some interval then it should be treated as new group.e.g
CREATE TABLE [dbo].[TEST](
[ID] [INT] NULL,
[DLRCODE] [VARCHAR](20) NULL,
[AMN] [DECIMAL](21, 5) NULL,
[RATE] [DECIMAL](7, 5) NULL,
[DTE] [DATETIME] NULL
) ON [NFS_DATA]
-----this should be first group
1 123 10.00000 5.00000 2019-11-01 00:00:00.000
2 123 10.00000 5.00000 2019-11-02 00:00:00.000
3 123 10.00000 5.00000 2019-11-03 00:00:00.000
-----this should be second group
4 123 15.00000 5.00000 2019-11-04 00:00:00.000
-----this should be third group
5 123 10.00000 5.00000 2019-11-05 00:00:00.000
6 123 10.00000 5.00000 2019-11-06 00:00:00.000
-----this should be fourth group
7 123 20.00000 5.00000 2019-11-07 15:02:07.537
as you can check from above code and data, result should be group, every time amount change new group will be created.
result will like this
1 30 --- group of first three records
2 15 --- group of fourth records
3 20 --- group of fifth and sixth records
4 20 --- group of seven record
You can do this by using a combination of LAG and conditional aggregation:
WITH CTE AS
(
SELECT Id
, DLRCode
, Amn
, Rate
, DTE
, ISNULL(LAG(Amn) OVER(ORDER BY DTE), Amn) As PreviousAmount
FROM dbo.Test
)
SELECT Id
, DLRCode
, Amn
, Rate
, DTE
, SUM(IIF(Amn = PreviousAmount, 0, 1)) OVER(ORDER BY DTE) As Grp
FROM CTE
To get your result set, you only need lag(), taking both the date and the amount into account:
select t.*
from (select t.*,
lag(amn) over (partition by dlrcode, rate order by dte) as prev_amn,
lag(dte) over (partition by dlrcode, rate order by dte) as prev_dte
from test t
) t
where prev_amn is null or
prev_amn <> amn or
prev_dte < dateadd(day, -1, dte);
If you want to incorporate this into a group id and then summarize the groups -- with information from multiple rows -- then we'll add a group id as the cumulative sum of the group changes and aggregate:
select dlrcode, rate, amn, min(dte), max(dte),
count(*)
from (select t.*,
sum(case when prev_amn = amn and prev_dte >= dateadd(day, -1, dte)
then 0 else 1
end) over (partition by dlrcode, rate) as grp
from (select t.*,
lag(amn) over (partition by dlrcode, rate order by dte) as prev_amn,
lag(dte) over (partition by dlrcode, rate order by dte) as prev_dte
from test t
) t
) t
group by dlrcode, rate, amn, grp;
For multiple rows with identical features, I hope two add few marks/new columns in the original table.
The original table is as below:
ID Start_date End_Date Amount
1 2005-01-01 2010-01-01 5
1 2000-07-01 2009-06-01 10
1 2017-08-01 2018-03-01 30
I wish to keep one record with the earliest start date, latest end date, added amount and an indicator to tell me to use this record. For the others, just use the indicator to tell me not to use.
The updated table should be as below:
ID Start_date End_Date Amount Amount_new Usable Start End
1 2005-01-01 2010-01-01 5 45 0 2000-07-01 2018-03-01
1 2000-07-01 2009-06-01 10 1
1 2017-08-01 2018-03-01 30 1
It does not matter which row to keep, as long as there is one row with Usable=0, and Amount_new, Start and End are updated.
If not considering the end date, I was thinking of grouping by ID and Start_date, then update the column Usable and Amount_new of the first row. However I still have the problem of how to select the first row from the group by group. Considering the End_Date makes my mind even more messy!
Could anyone help to shed some light upon this issue?
You seem to want something like this:
alter table original
add amount_new int,
add usable bit,
add new_start,
add new_end;
Then, you can update it using window functions:
with toupdate as (
select o.*,
sum(amount) over (partition by id) as x_amount,
(case when row_number() over (partition by id order by start_date) as x_usable,
min(start_date) as x_start_date,
max(end_date) as x_end_date
from original o
)
update toupdate
set new_amount = x_amount,
usable = x_usable,
new_start = x_start_date,
new_end = x_end_date;
The following query should do what you want:
CREATE TABLE #temp (ID INT, [Start_date] DATE, End_Date DATE, Amount NUMERIC(28,0), Amount_new NUMERIC(28,0), Usable BIT, Start [Date], [End] [Date])
INSERT INTO #temp (ID, [Start_date], End_Date, Amount) VALUES
(1,'2005-01-01','2010-01-01',5),
(1,'2000-07-01','2009-06-01',10),
(1,'2017-08-01','2018-03-01',30),
(2,'2001-07-01','2009-06-01',5),
(2,'2017-08-01','2019-03-01',35)
UPDATE t1
SET Amount_new = t2.[Amount_new],
Usable = 1,
Start = t2.[Start],
[End] = t2.[End]
FROM (SELECT *,ROW_NUMBER() OVER (PARTITION BY ID ORDER BY (SELECT 1)) AS RNO FROM #temp) t1
INNER JOIN
(
SELECT ID,[Start_date],[End_Date],[Amount]
,SUM(Amount) OVER(PARTITION BY ID) AS [Amount_new]
,MIN([Start_date]) OVER(PARTITION BY ID) AS [Start]
,MAX(End_Date) OVER(PARTITION BY ID) AS [End]
,ROW_NUMBER() OVER (PARTITION BY ID ORDER BY (SELECT 1)) AS RNO
FROM #temp ) t2 ON t1.id = t2.id AND t2.rno = t1.RNO AND t2.RNO = 1
SELECT * FROM #temp
The result is as below,
ID Start_date End_Date Amount Amount_new Usable Start End
1 2005-01-01 2010-01-01 5 45 1 2000-07-01 2018-03-01
1 2000-07-01 2009-06-01 10 NULL NULL NULL NULL
1 2017-08-01 2018-03-01 30 NULL NULL NULL NULL
2 2001-07-01 2009-06-01 5 40 1 2001-07-01 2019-03-01
2 2017-08-01 2019-03-01 35 NULL NULL NULL NULL
I have a table with the following schema:
CREATE TABLE Codes
(
diagnosis_code CHAR,
visit_date DATE,
visit_id INT,
patient_id int
);
I would like to output the patient_ids where the patient is readmitted (so a different visit_id) with the same diagnosis_code within a certain time (say 15 days). For example, if I have the following entries in the table:
diagnosis_code visit_date visit_id patient_id
-------------- ---------- ----------- -----------
A 2018-01-01 1 1
B 2018-01-01 1 1
A 2018-01-07 2 1
C 2018-01-01 3 2
D 2018-01-01 4 3
D 2018-01-20 5 3
E 2018-01-01 6 4
E 2018-01-01 6 4
A 2018-01-07 7 1
The query would return only patient_id = 1, and the rationales are as follows:
1, because between visit_id 1 and 2, this patient shared diagnosis code A.
Not 2 because this patient was only admitted once.
Not 3 because this patient, although readmitted for the same diagnosis, was not readmitted within 15 days of their initial visit.
Not 4 because this patient has a duplicated diagnosis code in the same visit.
Notice that patient_id = 1 is readmitted for the same diagnosis during visit_id = 7, but he was already counted once before.
You could try a simple join, adding the conditions you described:
select
distinct c.patient_id
from codes c
join codes d on d.patient_id = c.patient_id
and d.visit_id <> c.visit_id
and d.diagnosis_code = c.diagnosis_code
and d.visit_date between c.visit_date
and dateadd(day, 15, c.visit_date)
i used lag.
declare #Codes table
(
diagnosis_code CHAR,
visit_date DATE,
visit_id INT,
patient_id int
);
insert into #Codes
values
('A', '2018-01-01' ,1, 1)
,('B' , '2018-01-01', 1, 1)
,('A' , '2018-01-07', 2, 1)
,('C' ,'2018-01-01', 3, 2)
/*
D 2018-01-01 4 3
D 2018-01-15 5 3
E 2018-01-01 6 4
E 2018-01-01 6 4
A 2018-01-07 7 1
*/
select *
from (
select *
--,rn=row_number() over (partition by patient_ID,diagnosis_code order by visit_date)
,DaysSince = datediff(day,lag(visit_date,1) over (partition by patient_ID,diagnosis_code order by visit_date),visit_date)
from #Codes
) a
where a.DaysSince<=15
You can also use inbuilt FIRST_VALUE and DATEADD functions to achieve this:
SELECT
DISTINCT patient_id,diagnosis_code
FROM
(SELECT
FIRST_VALUE(visit_date) OVER (PARTITION BY patient_id,diagnosis_code ORDER BY visit_id ASC) AS Initial_Visit,
DATEADD(DAY,15,first_value(visit_date) OVER (PARTITION BY patient_id,diagnosis_code ORDER BY visit_id ASC)) Window
,* FROM Codes
)m
WHERE
Initial_Visit <> visit_date
AND visit_date <= Window
I need to create a report and I am struggling with the SQL script.
The table I want to query is a company_status_history table which has entries like the following (the ones that I can't figure out)
Table company_status_history
Columns:
| id | company_id | status_id | effective_date |
Data:
| 1 | 10 | 1 | 2016-12-30 00:00:00.000 |
| 2 | 10 | 5 | 2017-02-04 00:00:00.000 |
| 3 | 11 | 5 | 2017-06-05 00:00:00.000 |
| 4 | 11 | 1 | 2018-04-30 00:00:00.000 |
I want to answer to the question "Get all companies that have been at least for some point in status 1 inside the time period 01/01/2017 - 31/12/2017"
Above are the cases that I don't know how to handle since I need to add some logic of type :
"If this row is status 1 and it's date is before the date range check the next row if it has a date inside the date range."
"If this row is status 1 and it's date is after the date range check the row before if it has a date inside the date range."
I think this can be handled as a gaps and islands problem. Consider the following input data: (same as sample data of OP plus two additional rows)
id company_id status_id effective_date
-------------------------------------------
1 10 1 2016-12-15
2 10 1 2016-12-30
3 10 5 2017-02-04
4 10 4 2017-02-08
5 11 5 2017-06-05
6 11 1 2018-04-30
You can use the following query:
SELECT t.id, t.company_id, t.status_id, t.effective_date, x.cnt
FROM company_status_history AS t
OUTER APPLY
(
SELECT COUNT(*) AS cnt
FROM company_status_history AS c
WHERE c.status_id = 1
AND c.company_id = t.company_id
AND c.effective_date < t.effective_date
) AS x
ORDER BY company_id, effective_date
to get:
id company_id status_id effective_date grp
-----------------------------------------------
1 10 1 2016-12-15 0
2 10 1 2016-12-30 1
3 10 5 2017-02-04 2
4 10 4 2017-02-08 2
5 11 5 2017-06-05 0
6 11 1 2018-04-30 0
Now you can identify status = 1 islands using:
;WITH CTE AS
(
SELECT t.id, t.company_id, t.status_id, t.effective_date, x.cnt
FROM company_status_history AS t
OUTER APPLY
(
SELECT COUNT(*) AS cnt
FROM company_status_history AS c
WHERE c.status_id = 1
AND c.company_id = t.company_id
AND c.effective_date < t.effective_date
) AS x
)
SELECT id, company_id, status_id, effective_date,
ROW_NUMBER() OVER (PARTITION BY company_id ORDER BY effective_date) -
cnt AS grp
FROM CTE
Output:
id company_id status_id effective_date grp
-----------------------------------------------
1 10 1 2016-12-15 1
2 10 1 2016-12-30 1
3 10 5 2017-02-04 1
4 10 4 2017-02-08 2
5 11 5 2017-06-05 1
6 11 1 2018-04-30 2
Calculated field grp will help us identify those islands:
;WITH CTE AS
(
SELECT t.id, t.company_id, t.status_id, t.effective_date, x.cnt
FROM company_status_history AS t
OUTER APPLY
(
SELECT COUNT(*) AS cnt
FROM company_status_history AS c
WHERE c.status_id = 1
AND c.company_id = t.company_id
AND c.effective_date < t.effective_date
) AS x
), CTE2 AS
(
SELECT id, company_id, status_id, effective_date,
ROW_NUMBER() OVER (PARTITION BY company_id ORDER BY effective_date) -
cnt AS grp
FROM CTE
)
SELECT company_id,
MIN(effective_date) AS start_date,
CASE
WHEN COUNT(*) > 1 THEN DATEADD(DAY, -1, MAX(effective_date))
ELSE MIN(effective_date)
END AS end_date
FROM CTE2
GROUP BY company_id, grp
HAVING COUNT(CASE WHEN status_id = 1 THEN 1 END) > 0
Output:
company_id start_date end_date
-----------------------------------
10 2016-12-15 2017-02-03
11 2018-04-30 2018-04-30
All you want know is those records from above that overlap with the specified interval.
Demo here with somewhat more complicated use case.
Maybe this is what you are looking for? For these kind of questions, you need to join two instance of your table, in this case I am just joining with next record by Id, which probably is not totally correct. To do it better, you can create a new Id using a windowed function like row_number, ordering the table by your requirement criteria
If this row is status 1 and it's date is before the date range check
the next row if it has a date inside the date range
declare #range_st date = '2017-01-01'
declare #range_en date = '2017-12-31'
select
case
when csh1.status_id=1 and csh1.effective_date<#range_st
then
case
when csh2.effective_date between #range_st and #range_en then true
else false
end
else NULL
end
from company_status_history csh1
left join company_status_history csh2
on csh1.id=csh2.id+1
Implementing second criteria:
"If this row is status 1 and it's date is after the date range check
the row before if it has a date inside the date range."
declare #range_st date = '2017-01-01'
declare #range_en date = '2017-12-31'
select
case
when csh1.status_id=1 and csh1.effective_date<#range_st
then
case
when csh2.effective_date between #range_st and #range_en then true
else false
end
when csh1.status_id=1 and csh1.effective_date>#range_en
then
case
when csh3.effective_date between #range_st and #range_en then true
else false
end
else null -- ¿?
end
from company_status_history csh1
left join company_status_history csh2
on csh1.id=csh2.id+1
left join company_status_history csh3
on csh1.id=csh3.id-1
I would suggest the use of a cte and the window functions ROW_NUMBER. With this you can find the desired records. An example:
DECLARE #t TABLE(
id INT
,company_id INT
,status_id INT
,effective_date DATETIME
)
INSERT INTO #t VALUES
(1, 10, 1, '2016-12-30 00:00:00.000')
,(2, 10, 5, '2017-02-04 00:00:00.000')
,(3, 11, 5, '2017-06-05 00:00:00.000')
,(4, 11, 1, '2018-04-30 00:00:00.000')
DECLARE #StartDate DATETIME = '2017-01-01';
DECLARE #EndDate DATETIME = '2017-12-31';
WITH cte AS(
SELECT *
,ROW_NUMBER() OVER (PARTITION BY company_id ORDER BY effective_date) AS rn
FROM #t
),
cteLeadLag AS(
SELECT c.*, ISNULL(c2.effective_date, c.effective_date) LagEffective, ISNULL(c3.effective_date, c.effective_date)LeadEffective
FROM cte c
LEFT JOIN cte c2 ON c2.company_id = c.company_id AND c2.rn = c.rn-1
LEFT JOIN cte c3 ON c3.company_id = c.company_id AND c3.rn = c.rn+1
)
SELECT 'Included' AS RangeStatus, *
FROM cteLeadLag
WHERE status_id = 1
AND effective_date BETWEEN #StartDate AND #EndDate
UNION ALL
SELECT 'Following' AS RangeStatus, *
FROM cteLeadLag
WHERE status_id = 1
AND effective_date > #EndDate
AND LagEffective BETWEEN #StartDate AND #EndDate
UNION ALL
SELECT 'Trailing' AS RangeStatus, *
FROM cteLeadLag
WHERE status_id = 1
AND effective_date < #EndDate
AND LeadEffective BETWEEN #StartDate AND #EndDate
I first select all records with their leading and lagging Dates and then I perform your checks on the inclusion in the desired timespan.
Try with this, self-explanatory. Responds to this part of your question:
I want to answer to the question "Get all companies that have been at
least for some point in status 1 inside the time period 01/01/2017 -
31/12/2017"
Case that you want to find those id's that have been in any moment in status 1 and have records in the period requested:
SELECT *
FROM company_status_history
WHERE id IN
( SELECT Id
FROM company_status_history
WHERE status_id=1 )
AND effective_date BETWEEN '2017-01-01' AND '2017-12-31'
Case that you want to find id's in status 1 and inside the period:
SELECT *
FROM company_status_history
WHERE status_id=1
AND effective_date BETWEEN '2017-01-01' AND '2017-12-31'
I have a dataset with id ,Status and date range of employees.
The input dataset given below are the details of one employee.
The date ranges in the records are continuous(in exact order) such that startdate of second row will be the next date of enddate of first row.
If an employee takes leave continuously for different months, then the table is storing the info with date range as separated for different months.
For example: In the input set, the employee has taken Sick leave from '16-10-2016' to '31-12-2016' and joined back on '1-1-2017'.
So there are 3 records for this item but the dates are continuous.
In the output I need this as one record as shown in the expected output dataset.
INPUT
Id Status StartDate EndDate
1 Active 1-9-2007 15-10-2016
1 Sick 16-10-2016 31-10-2016
1 Sick 1-11-2016 30-11-2016
1 Sick 1-12-2016 31-12-2016
1 Active 1-1-2017 4-2-2017
1 Unpaid 5-2-2017 9-2-2017
1 Active 10-2-2017 11-2-2017
1 Unpaid 12-2-2017 28-2-2017
1 Unpaid 1-3-2017 31-3-2017
1 Unpaid 1-4-2017 30-4-2017
1 Active 1-5-2017 13-10-2017
1 Sick 14-10-2017 11-11-2017
1 Active 12-11-2017 NULL
EXPECTED OUTPUT
Id Status StartDate EndDate
1 Active 1-9-2007 15-10-2016
1 Sick 16-10-2016 31-12-2016
1 Active 1-1-2017 4-2-2017
1 Unpaid 5-2-2017 9-2-2017
1 Active 10-2-2017 11-2-2017
1 Unpaid 12-2-2017 30-4-2017
1 Active 1-5-2017 13-10-2017
1 Sick 14-10-2017 11-11-2017
1 Active 12-11-2017 NULL
I can't take min(startdate) and max(EndDate) group by id,status because if the same employee has taken another Sick leave then that end date ('11-11-2017' in the example) will come as the End date.
can anyone help me with the query in SQL server 2014?
It suddenly hit me that this is basically a gaps and islands problem - so I've completely changed my solution.
For this solution to work, the dates does not have to be consecutive.
First, create and populate sample table (Please save us this step in your future questions):
DECLARE #T AS TABLE
(
Id int,
Status varchar(10),
StartDate date,
EndDate date
);
SET DATEFORMAT DMY; -- This is needed because how you specified your dates.
INSERT INTO #T (Id, Status, StartDate, EndDate) VALUES
(1, 'Active', '1-9-2007', '15-10-2016'),
(1, 'Sick', '16-10-2016', '31-10-2016'),
(1, 'Sick', '1-11-2016', '30-11-2016'),
(1, 'Sick', '1-12-2016', '31-12-2016'),
(1, 'Active', '1-1-2017', '4-2-2017'),
(1, 'Unpaid', '5-2-2017', '9-2-2017'),
(1, 'Active', '10-2-2017', '11-2-2017'),
(1, 'Unpaid', '12-2-2017', '28-2-2017'),
(1, 'Unpaid', '1-3-2017', '31-3-2017'),
(1, 'Unpaid', '1-4-2017', '30-4-2017'),
(1, 'Active', '1-5-2017', '13-10-2017'),
(1, 'Sick', '14-10-2017', '11-11-2017'),
(1, 'Active', '12-11-2017', NULL);
The (new) common table expression:
;WITH CTE AS
(
SELECT Id,
Status,
StartDate,
EndDate,
ROW_NUMBER() OVER(PARTITION BY Id ORDER BY StartDate)
- ROW_NUMBER() OVER(PARTITION BY Id, Status ORDER BY StartDate) As IslandId,
ROW_NUMBER() OVER(PARTITION BY Id ORDER BY StartDate DESC)
- ROW_NUMBER() OVER(PARTITION BY Id, Status ORDER BY StartDate DESC) As ReverseIslandId
FROM #T
)
The (new) query:
SELECT DISTINCT Id,
Status,
MIN(StartDate) OVER(PARTITION BY IslandId, ReverseIslandId) As StartDate,
NULLIF(MAX(ISNULL(EndDate, '9999-12-31')) OVER(PARTITION BY IslandId, ReverseIslandId), '9999-12-31') As EndDate
FROM CTE
ORDER BY StartDate
(new) Results:
Id Status StartDate EndDate
1 Active 01.09.2007 15.10.2016
1 Sick 16.10.2016 31.12.2016
1 Active 01.01.2017 04.02.2017
1 Unpaid 05.02.2017 09.02.2017
1 Active 10.02.2017 11.02.2017
1 Unpaid 12.02.2017 30.04.2017
1 Active 01.05.2017 13.10.2017
1 Sick 14.10.2017 11.11.2017
1 Active 12.11.2017 NULL
You can see a live demo on rextester.
Please note that string representation of dates in SQL should be acccording to ISO 8601 - meaning either yyyy-MM-dd or yyyyMMdd as it's unambiguous and will always be interpreted correctly by SQL Server.
It's an example of GROUPING AND WINDOW.
First you set a reset point for each Status
Sum to set a group
Then get max/min dates of each group.
;with x as
(
select Id, Status, StartDate, EndDate,
iif (lag(Status) over (order by Id, StartDate) = Status, null, 1) rst
from emp
), y as
(
select Id, Status, StartDate, EndDate,
sum(rst) over (order by Id, StartDate) grp
from x
)
select Id,
MIN(Status) as Status,
MIN(StartDate) StartDate,
MAX(EndDate) EndDate
from y
group by Id, grp
order by Id, grp
GO
Id | Status | StartDate | EndDate
-: | :----- | :------------------ | :------------------
1 | Active | 01/09/2007 00:00:00 | 15/10/2016 00:00:00
1 | Sick | 16/10/2016 00:00:00 | 31/12/2016 00:00:00
1 | Active | 01/01/2017 00:00:00 | 04/02/2017 00:00:00
1 | Unpaid | 05/02/2017 00:00:00 | 09/02/2017 00:00:00
1 | Active | 10/02/2017 00:00:00 | 11/02/2017 00:00:00
1 | Unpaid | 12/02/2017 00:00:00 | 30/04/2017 00:00:00
1 | Active | 01/05/2017 00:00:00 | 13/10/2017 00:00:00
1 | Sick | 14/10/2017 00:00:00 | 11/11/2017 00:00:00
1 | Active | 12/11/2017 00:00:00 | null
dbfiddle here
Here's an alternative answer that doesn't use LAG.
First I need to take a copy of your test data:
DECLARE #table TABLE (Id INT, [Status] VARCHAR(50), StartDate DATE, EndDate DATE);
INSERT INTO #table SELECT 1, 'Active', '20070901', '20161015';
INSERT INTO #table SELECT 1, 'Sick', '20161016', '20161031';
INSERT INTO #table SELECT 1, 'Sick', '20161101', '20161130';
INSERT INTO #table SELECT 1, 'Sick', '20161201', '20161231';
INSERT INTO #table SELECT 1, 'Active', '20170101', '20170204';
INSERT INTO #table SELECT 1, 'Unpaid', '20170205', '20170209';
INSERT INTO #table SELECT 1, 'Active', '20170210', '20170211';
INSERT INTO #table SELECT 1, 'Unpaid', '20170212', '20170228';
INSERT INTO #table SELECT 1, 'Unpaid', '20170301', '20170331';
INSERT INTO #table SELECT 1, 'Unpaid', '20170401', '20170430';
INSERT INTO #table SELECT 1, 'Active', '20170501', '20171013';
INSERT INTO #table SELECT 1, 'Sick', '20171014', '20171111';
INSERT INTO #table SELECT 1, 'Active', '20171112', NULL;
Then the query is:
WITH add_order AS (
SELECT
*,
ROW_NUMBER() OVER (ORDER BY StartDate) AS order_id
FROM
#table),
links AS (
SELECT
a1.Id,
a1.[Status],
a1.order_id,
MIN(a1.order_id) AS start_order_id,
MAX(ISNULL(a2.order_id, a1.order_id)) AS end_order_id,
MIN(a1.StartDate) AS StartDate,
MAX(ISNULL(a2.EndDate, a1.EndDate)) AS EndDate
FROM
add_order a1
LEFT JOIN add_order a2 ON a2.Id = a1.Id AND a2.[Status] = a1.[Status] AND a2.order_id = a1.order_id + 1 AND a2.StartDate = DATEADD(DAY, 1, a1.EndDate)
GROUP BY
a1.Id,
a1.[Status],
a1.order_id),
merged AS (
SELECT
l1.Id,
l1.[Status],
l1.[StartDate],
ISNULL(l2.EndDate, l1.EndDate) AS EndDate,
ROW_NUMBER() OVER (PARTITION BY l1.Id, l1.[Status], ISNULL(l2.EndDate, l1.EndDate) ORDER BY l1.order_id) AS link_id
FROM
links l1
LEFT JOIN links l2 ON l2.order_id = l1.end_order_id)
SELECT
Id,
[Status],
StartDate,
EndDate
FROM
merged
WHERE
link_id = 1
ORDER BY
StartDate;
Results are:
Id Status StartDate EndDate
1 Active 2007-09-01 2016-10-15
1 Sick 2016-10-16 2016-12-31
1 Active 2017-01-01 2017-02-04
1 Unpaid 2017-02-05 2017-02-09
1 Active 2017-02-10 2017-02-11
1 Unpaid 2017-02-12 2017-04-30
1 Active 2017-05-01 2017-10-13
1 Sick 2017-10-14 2017-11-11
1 Active 2017-11-12 NULL
How does it work? First I add a sequence number, to assist with merging contiguous rows together. Then I determine the rows that can be merged together, add a number to identify the first row in each set that can be merged, and finally pick the first rows out of the final CTE. Note that I also have to handle rows that can't be merged, hence the LEFT JOINs and ISNULL statements.
Just for interest, this is what the output from the final CTE looks like, before I filter out all but the rows with a link_id of 1:
Id Status StartDate EndDate link_id
1 Active 2007-09-01 2016-10-15 1
1 Sick 2016-10-16 2016-12-31 1
1 Sick 2016-11-01 2016-12-31 2
1 Sick 2016-12-01 2016-12-31 3
1 Active 2017-01-01 2017-02-04 1
1 Unpaid 2017-02-05 2017-02-09 1
1 Active 2017-02-10 2017-02-11 1
1 Unpaid 2017-02-12 2017-04-30 1
1 Unpaid 2017-03-01 2017-04-30 2
1 Unpaid 2017-04-01 2017-04-30 3
1 Active 2017-05-01 2017-10-13 1
1 Sick 2017-10-14 2017-11-11 1
1 Active 2017-11-12 NULL 1
You could use lag() and lead() function together to check the previous and next status
WITH CTE AS
(
select *,
COALESCE(LEAD(status) OVER(ORDER BY (select 1)), '0') Nstatus,
COALESCE(LAG(status) OVER(ORDER BY (select 1)), '0') Pstatus
from table
)
SELECT * FROM CTE
WHERE (status <> Nstatus AND status <> Pstatus) OR
(status <> Pstatus)