Collapse consecutive similar records into a single record - sql

I have records of people from an old system that I'm trying to convert over to the new system. In the old system, a person might end up with several records for the same location. They could also go from location, to another, and then return to the previous location. Here's some example data:
PersonID | LocationID | StartDate | EndDate
1 | 1 | 1980-07-30 | 2007-07-16
1 | 1 | 2007-07-16 | 2008-01-30
1 | 2 | 2008-01-30 | 2009-03-02
1 | 2 | 2009-03-02 | 2009-11-06
1 | 3 | 2014-07-16 | 2015-01-16
1 | 1 | 2016-01-26 | 2999-12-31
I would like to collapse this data so that I get a date range for any consecutive LocationIDs. For the data above, this is what I would expect:
PersonID | LocationID | StartDate | EndDate
1 | 1 | 1980-07-30 | 2008-01-30
1 | 2 | 2008-01-30 | 2009-11-06
1 | 3 | 2014-07-16 | 2015-01-16
1 | 1 | 2016-01-26 | 2999-12-31
I'm unsure as to how to do this. I previously tried joining to the previous record, but that only works when there's two consecutive locations, not with 3 or more (there could be an undefined number of consecutive records).
select
a.PersonID,
a.LocationID,
a.StartDate,
a.EndDate,
case when a.LocationID = b.LocationID then a.PK_ID else b.PK_ID end as NewID
from employees a
left outer join employees b
on a.PersonID = b.PersonID
and a.PK_ID = b.PK_ID - 1
So, how can I write a query to get the results I need?
Note: we're treating '2999-12-31' are our 'NULL' date field

This is a classic Gaps-and-Islands (Edit- corrected for larger span 2999)
Select [PersonID]
,[LocationID]
,[StartDate] = min(D)
,[EndDate] = max(D)
From (
Select *
,Grp = Row_Number() over (Order By D) - Row_Number() over (Partition By [PersonID],[LocationID] Order By D)
from YourTable A
Cross Apply (
Select Top (DateDiff(DAY,A.[StartDate],A.[EndDate])+1) D=DateAdd(DAY,-1+Row_Number() Over (Order By (Select Null)),A.[StartDate])
From master..spt_values n1,master..spt_values n2
) B
) G
Group By [PersonID],[LocationID],Grp
Order By [PersonID],min(D)
Returns
PersonID LocationID StartDate EndDate
1 1 1980-07-30 2008-01-30
1 2 2008-01-30 2009-11-06
1 3 2014-07-16 2015-01-16
1 1 2016-01-26 2999-12-31
Using your original query
Select [PersonID]
,[LocationID]
,[StartDate] = min(D)
,[EndDate] = max(D)
From (
Select *
,Grp = Row_Number() over (Order By D) - Row_Number() over (Partition By [PersonID],[LocationID] Order By D)
From (
-- Your Original Query
select
a.PersonID,
a.LocationID,
a.StartDate,
a.EndDate,
case when a.LocationID = b.LocationID then a.PK_ID else b.PK_ID end as NewID
from employees a
left outer join employees b
on a.PersonID = b.PersonID
and a.PK_ID = b.PK_ID - 1
) A
Cross Apply (
Select Top (DateDiff(DAY,A.[StartDate],A.[EndDate])+1) D=DateAdd(DAY,-1+Row_Number() Over (Order By (Select Null)),A.[StartDate])
From master..spt_values n1,master..spt_values n2
) B
) G
Group By [PersonID],[LocationID],Grp
Order By [PersonID],min(D)
Requested Comments
Let's break it down to its components.
1) The CROSS APPLY Portion: This will expand a single record into N records. For example:
Declare #YourTable Table ([PersonID] int,[LocationID] int,[StartDate] date,[EndDate] date)
Insert Into #YourTable Values
(1,1,'1980-07-01','1980-07-03' )
,(1,1,'1980-07-02','1980-07-04' ) -- Notice the Overlap
,(1,2,'2008-01-30','2008-02-05')
Select *
from #YourTable A
Cross Apply (
Select Top (DateDiff(DAY,A.[StartDate],A.[EndDate])+1) D=DateAdd(DAY,-1+Row_Number() Over (Order By (Select Null)),A.[StartDate])
From master..spt_values n1,master..spt_values n2
) B
The above query will generate
2) The Grp Portion: Perhaps easier if I provide a simple example:
Declare #YourTable Table ([PersonID] int,[LocationID] int,[StartDate] date,[EndDate] date)
Insert Into #YourTable Values
(1,1,'1980-07-01','1980-07-03' )
,(1,1,'1980-07-02','1980-07-04' ) -- Notice the Overlap
,(1,2,'2008-01-30','2008-02-05')
Select *
,Grp = Row_Number() over (Order By D) - Row_Number() over (Partition By [PersonID],[LocationID] Order By D)
,RN1 = Row_Number() over (Order By D)
,RN2 = Row_Number() over (Partition By [PersonID],[LocationID] Order By D)
from #YourTable A
Cross Apply (
Select Top (DateDiff(DAY,A.[StartDate],A.[EndDate])+1) D=DateAdd(DAY,-1+Row_Number() Over (Order By (Select Null)),A.[StartDate])
From master..spt_values n1,master..spt_values n2
) B
The above query Generates:
RN1 and RN2 are breakouts of the GRP, just to illustrate the mechanic. Notice RN1 minus RN2 equals the GRP. Once we have the GRP, it becomes a simple matter of aggregation via a group by
3) Pulling it all Together:
Declare #YourTable Table ([PersonID] int,[LocationID] int,[StartDate] date,[EndDate] date)
Insert Into #YourTable Values
(1,1,'1980-07-01','1980-07-03' )
,(1,1,'1980-07-02','1980-07-04' ) -- Notice the Overlap
,(1,2,'2008-01-30','2008-02-05')
Select [PersonID]
,[LocationID]
,[StartDate] = min(D)
,[EndDate] = max(D)
From (
Select *
,Grp = Row_Number() over (Order By D) - Row_Number() over (Partition By [PersonID],[LocationID] Order By D)
from #YourTable A
Cross Apply (
Select Top (DateDiff(DAY,A.[StartDate],A.[EndDate])+1) D=DateAdd(DAY,-1+Row_Number() Over (Order By (Select Null)),A.[StartDate])
From master..spt_values n1,master..spt_values n2
) B
) G
Group By [PersonID],[LocationID],Grp
Order By [PersonID],min(D)
Returns

For your sample data, you can use the difference of row numbers approach:
select personid, locationid, min(startdate), max(enddate)
from (select e.*,
row_number() over (partition by personid order by startdate) as seqnum_p,
row_number() over (partition by personid, locationid order by startdate) as seqnum_pl
from employees e
) e
group by (seqnum_p - seqnum_pl), personid, locationid;
This assumes that the start and end dates are contiguous. That is, there is no gap for a given employee at the same location.

Related

Create episode for each value with new Begin and End Dates

This is in reference to below Question
Loop through each value to the seq num
But now Client want to see the data differently and started a new thread for this question.
below is the requirement.
This is the data .
ID seqNum DOS Service End Date
1 1 1/1/2017 1/15/2017
1 2 1/16/2017 1/16/2017
1 3 1/17/2017 1/21/2017
1 4 1/22/2017 2/13/2017
1 5 2/14/2017 3/21/2017
1 6 2/16/2017 3/21/2017
Expected outPut:
ID SeqNum DOSBeg DOSEnd
1 1 1/1/2017 1/30/2017
1 2 1/31/2017 3/1/2017
1 3 3/2/2017 3/31/2017
For each DOSBeg, add 29 and that is DOSEnd. then Add 1 to DOSEnd (1/31/2017) is new DOSBeg.
Now add 29 to (1/31/2017) and that is 3/1/2017 which is DOSEnd . Repeat this untill DOSend >=Max End Date i.e 3/21/2017.
Basically, we need episode of 29 days for each ID.
I tried with this code and it is giving me duplicates.
with cte as (
select ID, minDate as DOSBeg,dateadd(day,29,mindate) as DOSEnd
from #temp
union all
select ID,dateadd(day,1,DOSEnd) as DOSBeg,dateadd(day,29,dateadd(day,1,DOSEnd)) as DOSEnd
from cte
)
select ID,DOSBeg,DOSEnd
from cte
OPTION (MAXRECURSION 0)
Here mindate is Minimum DOS for this ID i.e. 1/1/2017
I came up with below logic and this is working fine for me. Is there any better way than this ?
declare #table table (id int, seqNum int identity(1,1), DOS date, ServiceEndDate date)
insert into #table
values
(1,'20170101','20170115'),
(1,'20170116','20170116'),
(1,'20170117','20170121'),
(1,'20170122','20170213'),
(1,'20170214','20170321'),
(1,'20170216','20170321'),
(2,'20170101','20170103'),
(2,'20170104','20170118')
select * into #temp from #table
--drop table #data
select distinct ID, cast(min(DOS) over (partition by ID) as date) as minDate
,row_Number() over (partition by ID order by ID, DOS) as SeqNum,
DOS,
max(ServiceEndDate) over (partition by ID)as maxDate
into #data
from #temp
--drop table #StartDateLogic
with cte as
(select ID,mindate as startdate,maxdate
from #data
union all
select ID,dateadd(day,30,startdate) as startdate,maxdate
from cte
where maxdate >= dateadd(day,30,startdate))
select distinct ID,startdate
into #StartDateLogic
from cte
OPTION (MAXRECURSION 0)
--final Result set
select ID
,ROW_NUMBER() over (Partition by ID order by ID,StartDate) as SeqNum
,StartDate
,dateadd(day,29,startdate) as EndDate
from #StartDateLogic
You were on the right track wit the recursive cte, but you forgot the anchor.
declare #table table (id int, seqNum int identity(1,1), DOS date, ServiceEndDate date)
insert into #table
values
(1,'20170101','20170115'),
(1,'20170116','20170116'),
(1,'20170117','20170121'),
(1,'20170122','20170213'),
(1,'20170214','20170321'),
(1,'20170216','20170321'),
(2,'20170101','20170103'),
(2,'20170104','20170118')
;with dates as(
select top 1 with ties id, seqnum, DOSBeg = DOS, DOSEnd = dateadd(day,29,DOS)
from #table
order by row_number() over (partition by id order by seqnum)
union all
select t.id, t.seqNum, DOSBeg = dateadd(day,1,d.DOSEnd), DOSEnd = dateadd(day,29,dateadd(day,1,d.DOSEnd))
from dates d
inner join #table t on
d.id = t.id and t.seqNum = d.seqNum + 1
)
select *
from dates d
where d.DOSEnd <= (select max(dateadd(month,1,ServiceEndDate)) from #table where id = d.id)
order by id, seqNum

First value in DATE minus 30 days SQL

I have bunch of data out of which I'm showing ID, max date and it's corresponding values (user id, type, ...). Then I need to take MAX date for each ID, substract 30 days and show first date and it's corresponding values within this date period.
Example:
ID Date Name
1 01.05.2018 AAA
1 21.04.2018 CCC
1 05.04.2018 BBB
1 28.03.2018 AAA
expected:
ID max_date max_name previous_date previous_name
1 01.05.2018 AAA 05.04.2018 BBB
I have working solution using subselects, but as I have quite huge WHERE part, refresh takes ages.
SUBSELECT looks like that:
(SELECT MIN(N.name)
FROM t1 N
WHERE N.ID = T.ID
AND (N.date < MAX(T.date) AND N.date >= (MAX(T.date)-30))
AND (...)) AS PreviousName
How'd you write the select?
I'm using TSQL
Thanks
I can do this with 2 CTEs to build up the dates and names.
SQL Fiddle
MS SQL Server 2017 Schema Setup:
CREATE TABLE t1 (ID int, theDate date, theName varchar(10)) ;
INSERT INTO t1 (ID, theDate, theName)
VALUES
( 1,'2018-05-01','AAA' )
, ( 1,'2018-04-21','CCC' )
, ( 1,'2018-04-05','BBB' )
, ( 1,'2018-03-27','AAA' )
, ( 2,'2018-05-02','AAA' )
, ( 2,'2018-05-21','CCC' )
, ( 2,'2018-03-03','BBB' )
, ( 2,'2018-01-20','AAA' )
;
Main Query:
;WITH cte1 AS (
SELECT t1.ID, t1.theDate, t1.theName
, DATEADD(day,-30,t1.theDate) AS dMinus30
, ROW_NUMBER() OVER (PARTITION BY t1.ID ORDER BY t1.theDate DESC) AS rn
FROM t1
)
, cte2 AS (
SELECT c2.ID, c2.theDate, c2.theName
, ROW_NUMBER() OVER (PARTITION BY c2.ID ORDER BY c2.theDate) AS rn
, COUNT(*) OVER (PARTITION BY c2.ID) AS theCount
FROM cte1
INNER JOIN cte1 c2 ON cte1.ID = c2.ID
AND c2.theDate >= cte1.dMinus30
WHERE cte1.rn = 1
GROUP BY c2.ID, c2.theDate, c2.theName
)
SELECT cte1.ID, cte1.theDate AS max_date, cte1.theName AS max_name
, cte2.theDate AS previous_date, cte2.theName AS previous_name
, cte2.theCount
FROM cte1
INNER JOIN cte2 ON cte1.ID = cte2.ID
AND cte2.rn=1
WHERE cte1.rn = 1
Results:
| ID | max_date | max_name | previous_date | previous_name |
|----|------------|----------|---------------|---------------|
| 1 | 2018-05-01 | AAA | 2018-04-05 | BBB |
| 2 | 2018-05-21 | CCC | 2018-05-02 | AAA |
cte1 builds the list of max_date and max_name grouped by the ID and then using a ROW_NUMBER() window function to sort the groups by the dates to get the most recent date. cte2 joins back to this list to get all dates within the last 30 days of cte1's max date. Then it does essentially the same thing to get the last date. Then the outer query joins those two results together to get the columns needed while only selecting the most and least recent rows from each respectively.
I'm not sure how well it will scale with your data, but using the CTEs should optimize pretty well.
EDIT: For the additional requirement, I just added in another COUNT() window function to cte2.
I would do:
select id,
max(case when seqnum = 1 then date end) as max_date,
max(case when seqnum = 1 then name end) as max_name,
max(case when seqnum = 2 then date end) as prev_date,
max(case when seqnum = 2 then name end) as prev_name,
from (select e.*, row_number() over (partition by id order by date desc) as seqnum
from example e
) e
group by id;

select top N records for each entity

I have a table like below -
ID | Reported Date | Device_ID
-------------------------------------------
1 | 2016-03-09 09:08:32.827 | 1
2 | 2016-03-08 09:08:32.827 | 1
3 | 2016-03-08 09:08:32.827 | 1
4 | 2016-03-10 09:08:32.827 | 2
5 | 2016-03-05 09:08:32.827 | 2
Now, i want a top 1 row based on date column for each device_ID
Expected Output
ID | Reported Date | Device_ID
-------------------------------------------
1 | 2016-03-09 09:08:32.827 | 1
4 | 2016-03-10 09:08:32.827 | 2
I am using SQL Server 2008 R2. i can go and write Stored Procedure to handle it but wanted do it with simple query.
****************EDIT**************************
Answer by 'Felix Pamittan' worked well but for 'N' just change it to
SELECT
Id, [Reported Date], Device_ID
FROM (
SELECT *,
Rn = ROW_NUMBER() OVER(PARTITION BY Device_ID ORDER BY [ReportedDate] DESC)
FROM tbl
)t
WHERE Rn >= N
He had mentioned this in comment thought to add it to questions so that no body miss it.
Use ROW_NUMBER:
SELECT
Id, [Reported Date], Device_ID
FROM (
SELECT *,
Rn = ROW_NUMBER() OVER(PARTITION BY Device_ID ORDER BY [ReportedDate] DESC)
FROM tbl
)t
WHERE Rn = 1
You can also try using CTE
With DeviceCTE AS
(SELECT *, ROW_NUMBER() OVER(PARTITION BY Device_ID ORDER BY [Reported Date] DESC) AS Num
FROM tblname)
SELECT Id, [Reported Date], Device_ID
From DeviceCTE
Where Num = 1
If you can't use an analytic function, e.g. because your application layer won't allow it, then you can try the following solution which uses a subquery to arrive at the answer:
SELECT t1.ID, t2.maxDate, t1.Device_ID
INNER JOIN
(
SELECT Device_ID, MAX([Reported Date]) AS maxDate
FROM yourTable
GROUP BY Device_ID
) t2
ON t1.Device_ID = t2.Device_ID AND t1.[Reported Date] = t2.maxDate
Select * from DEVICE_TABLE D
where [Reported Date] = (Select Max([Reported Date]) from DEVICE_TABLE where Device_ID = D.Device_ID)
should do the trick, assume that "top 1 row based on date column" means that you want to select the latest reported date of each Device_ID ?
As for your title, select top 5 rows of each Device_ID
Select * from DEVICE_TABLE D
where [Reported Date] in (Select top 5 [Reported Date] from DEVICE_TABLE D where Device_ID = D.Device_ID)
order by Device_ID, [Reported Date] desc
will give you the top 5 latest reports of each device id.
You may want to sort out the top 5 date if your data isn't in order...
Again with no analytic functions you can use CROSS APPLY :
DECLARE #tbl TABLE(Id INT,[Reported Date] DateTime , Device_ID INT)
INSERT INTO #tbl
VALUES
(1,'2016-03-09 09:08:32.827',1),
(2,'2016-03-08 09:08:32.827',1),
(3,'2016-03-08 09:08:32.827',1),
(4,'2016-03-10 09:08:32.827',2),
(5,'2016-03-05 09:08:32.827',2)
SELECT r.*
FROM ( SELECT DISTINCT Device_ID FROM #tbl ) d
CROSS APPLY ( SELECT TOP 1 *
FROM #tbl t
WHERE d.Device_ID = t.Device_ID ) r
Can be easily modified to support N records.
Credits go to wBob answering this question here

TSQL getting max and min date with a seperate but not unique record

example table:
test_date | test_result | unique_ID
12/25/15 | 100 | 50
12/01/15 | 150 | 75
10/01/15 | 135 | 75
09/22/14 | 99 | 50
04/10/13 | 125 | 50
I need to find the first and last test date as well as the test result to match said date by user. So, I can group by ID, but not test result.
SELECT MAX(test_date)[need matching test_result],
MIN(test_date) [need matching test_result],
unique_id
from [table]
group by unique_id
THANKS!
Create TABLE #t
(
test_date date ,
Test_results int,
Unique_id int
)
INSERT INTO #t
VALUES ( '12/25/15',100,50 ),
( '12/01/15',150,75 ),
( '10/01/15',135,75 ),
( '09/22/14',99,50 ),
( '04/10/13',125,50 )
select 'MinTestDate' as Type, a.test_date, a.Test_results, a.Unique_id
from #t a inner join (
select min(test_date) as test_datemin, max(test_date) as test_datemax, unique_id from #t
group by unique_ID) b
on a.test_date = b.test_datemin
union all
select 'MaxTestDate' as Type, a.test_date, a.Test_results, a.Unique_id from #t a
inner join (
select min(test_date) as test_datemin, max(test_date) as test_datemax, unique_id from #t
group by unique_ID) b
on a.test_date = b.test_datemax
I would recommend window functions. The following returns the information on 2 rows per id:
select t.*
from (select t.*,
row_number() over (partition by unique_id order by test_date) as seqnum_asc,
row_number() over (partition by unique_id order by test_date desc) as seqnum_desc
from table t
) t;
For one row, use conditional aggregation (or pivot if you prefer):
select unique_id,
min(test_date), max(case when seqnum_asc = 1 then test_result end),
max(test_date), max(case when seqnum_desc = 1 then test_result end)
from (select t.*,
row_number() over (partition by unique_id order by test_date) as seqnum_asc,
row_number() over (partition by unique_id order by test_date desc) as seqnum_desc
from table t
) t
group by unique_id;
Consider using a combination of self-joins and derived tables:
SELECT t1.unique_id, minTable.MinOftest_date, t1.test_result As Mintestdate_result,
maxTable.MaxOftest_date, t2.test_result As Maxtestdate_result
FROM TestTable AS t1
INNER JOIN
(
SELECT Min(TestTable.test_date) AS MinOftest_date,
TestTable.unique_ID
FROM TestTable
GROUP BY TestTable.unique_ID
) As minTable
ON (t1.test_date = minTable.MinOftest_date
AND t1.unique_id = minTable.unique_id)
INNER JOIN TestTable As t2
INNER JOIN
(
SELECT Max(TestTable.test_date) AS MaxOftest_date,
TestTable.unique_ID
FROM TestTable
GROUP BY TestTable.unique_ID
) AS maxTable
ON t2.test_date = maxTable.MaxOftest_date
AND t2.unique_ID = maxTable.unique_ID
ON minTable.unique_id = maxTable.unique_id;
OUTPUT
unique_id MinOftest_date Mintestdate_result MaxOftest_date Maxtestdate_result
50 4/10/2013 125 12/25/2015 100
75 10/1/2015 135 12/1/2015 150

TSQL - Get Data grouped by row number

I have a table with ID and Date field
ID |Date
1 |2013-5-22
1 |2013-5-23
1 |2013-5-25
1 |2013-5-26
2 |2013-5-26
2 |2013-5-27
1 |2013-5-27
1 |2013-5-28
With the Row_Number i can group all data by id and ghet the Min date and Max Date
;WITH q AS(
SELECT f.*,
grp = DATEDIFF(day, 0, f.Date) - ROW_NUMBER() OVER (PARTITION BY f.ID ORDER BY f.Date),
FROM myTable f
)
SELECT
MIN(q.ID) as ID,
MIN(q.Date) as StartDate,
MAX(q.Date) as EndDate
FROM q
GROUP BY q.grp, q.ID, Date
;
Result:
ID |StartDate |EndDate
1 |2013-5-22 |2013-5-23
2 |2013-5-26 |2013-5-27
1 |2013-5-25 |2013-5-28
Now i need to get the date step by <= 3
Example:
ID |StartDate |EndDate
1 |2013-5-22 |2013-5-23
2 |2013-5-26 |2013-5-27
1 |2013-5-25 |2013-5-27
1 |2013-5-28 |2013-5-28
Can someone, please, illuminate my way?
ty
EDIT
Sorry
;WITH q AS(
SELECT f.*,
grp = DATEDIFF(day, 0, f.Date) - ROW_NUMBER() OVER (PARTITION BY f.ID ORDER BY f.Date)
FROM MyTable f
)
SELECT
MIN(q.ID) as ID,
MIN(q.Date) as StartDate,
MAX(q.Date) as EndDate
FROM q
GROUP BY q.grp, q.ID
;
My first attempt had a bug, try this instead:
;WITH q AS(
SELECT ID, Date,
grp = DATEDIFF(day, 0, Date) - ROW_NUMBER() OVER (PARTITION BY ID ORDER BY Date)
FROM myTable
), r as
(
select id, date, grp,
(ROW_NUMBER() OVER (PARTITION BY grp ORDER BY Date)-1)/3 a from q
)
SELECT
MIN(ID) as ID,
MIN(Date) as StartDate,
MAX(Date) as EndDate
FROM r
GROUP BY grp, ID, a