Group Sequential SQL Records

Group Sequential SQL Records - sql

Looking for a way to group sequential timeclock records into a single row.
The source system has an identity column, employee id, date and in/out flag (1=in & 2=out). Note that the
ID EmployeeID DATE InOut
1019374 5890 2008-08-19 14:07:14 1
1019495 5890 2008-08-19 18:17:08 2
1019504 5890 2008-08-19 18:50:40 1
1019601 5890 2008-08-19 22:06:18 2
I am looking for sql that would give me the following result
EmployeeID ClockIn BreakStart BreakEnd ClockOut
5890 2008-08-19 14:07:14 2008-08-19 18:17:08 2008-08-19 18:50:40 2008-08-19 22:06:18
Note that the ID in the source system is not always sequential because of timeclock edits. Date should be chronological. If only two punches exist, I will need to have the clock in and clock out dates populated with no breaks (or something consistent that I can extract with a case statement). No breaks example below:
EmployeeID ClockIn BreakStart BreakEnd ClockOut
5890 2008-08-19 14:07:14 2008-08-19 22:06:18
Sql version is 2008 R2
Thanks in advance, I can't figure out how to get this to work consistently and your help is greatly appreciated.

You can do this with the ROW_NUMBER() function and a windowed COUNT() to handle the no break days:
;with cte as (SELECT *,ROW_NUMBER() OVER(PARTITION BY EmployeeID, CAST(dt AS DATE) ORDER BY dt) RN
,COUNT(*) OVER(PARTITION BY EmployeeID, CAST(dt AS DATE)) Dt_CT
FROM Table1)
SELECT EmployeeID
,Dt = CAST(dt AS DATE)
,ClockIn = MAX(CASE WHEN RN = 1 THEN DT END)
,BreakStart = MAX(CASE WHEN Dt_CT = 4 AND RN = 2 THEN DT END)
,BreakEnd = MAX(CASE WHEN Dt_CT = 4 AND RN = 3 THEN DT END)
,ClockOut = MAX(CASE WHEN (Dt_CT = 2 AND RN = 2) OR RN = 4 THEN DT END)
FROM cte
GROUP BY EmployeeID
,CAST(dt AS DATE)
Demo: SQL Fiddle
This is set by day, so a clockout after midnight wouldn't work, and odd number of punches would also be problematic, but for a simple world like your example this will do.

Related

Using RANK OVER PARTITION to Compare a Previous Row Result

I'm working with a dataset that contains (among other columns) a userID and startDate. The goal is to have a new column "isRehire" that compares their startDate with previous startDates.
If the difference between startDates is within 1 year, isRehire = Y.
The difficulty and my issue comes in when there are more than 2 startDates for a user. If the difference between the 3rd and 1st startDate is over a year, the 3rd startDate would be the new "base date" for being a rehire.
userID
startDate
isRehire
123
07/24/19
N
123
02/04/20
Y
123
08/25/20
N
123
12/20/20
Y
123
06/15/21
Y
123
08/20/21
Y
123
08/30/21
N
In the above example you can see the issue visualized. The first startDate 07/24/19, the user is not a Rehire. The second startDate 02/04/20, they are a Rehire. The 3rd startDate 08/25/20 the user is not a rehire because it has been over 1 year since their initial startDate. This is the new "anchor" date.
The next 3 instances are all Y as they are within 1 year of the new "anchor" date of 08/25/20. The final startDate of 08/30/21 is over a year past 08/25/20, indicating a "N" and the "cycle" resets again with 08/30/21 as the new "anchor" date.
My goal is to utilize RANK OVER PARTITION to be able to complete this, as from my testing I believe there must be a way to assign ranks to the dates which can then be wrapped in a select statement for a CASE expression to be written. Although it's completely possible I'm barking up the wrong tree entirely.
Below you can see some of the code I've attempted to use to complete this, although without much success so far.
select TestRank,
startDate,
userID,
CASE WHEN TestRank = TestRank THEN (TestRank - 1
) ELSE '' END AS TestRank2
from
(
select userID,
startDate
RANK() OVER (PARTITION BY userID
ORDER BY startDate desc)
as TestRank
from [MyTable] a
WHERE a.userID = [int]
) b

This is complicated logic, and window functions are not sufficient. To solve this, you need iteration -- or in SQL-speak, a recursive CTE:
with t as (
select t.*, row_number() over (partition by id order by startdate) as seqnum
from mytable t
),
cte as (
select t.id, t.startdate, t.seqnum, 'N' as isrehire, t.startdate as anchordate
from t
where seqnum = 1
union all
select t.id, t.startdate, t.seqnum,
(case when t.startdate > dateadd(year, 1, cte.anchordate) then 'N' else 'Y' end),
(case when t.startdate > dateadd(year, 1, cte.anchordate) then t.startdate else cte.anchordate end)
from cte join
t
on t.seqnum = cte.seqnum + 1
)
select *
from cte
order by id, startdate;
Here is a db<>fiddle.

SQL - Combine two rows if difference is below threshhold

I have a table like this in SQL Server:
id start_time end_time
1 10:00:00 10:34:00
2 10:38:00 10:52:00
3 10:53:00 11:23:00
4 11:24:00 11:56:00
5 14:20:00 14:40:00
6 14:41:00 14:59:00
7 15:30:00 15:40:00
What I would like to have is a query that outputs consolidated records based on the time difference between two consecutive records (end_time of row n and start_time row n+1) . All records where the time difference is less than 2 minutes should be combined into one time entry and the ID of the first record should be kept. This should also combine more than two records if multiple consecutive records have a time difference less than 2 minutes.
This would be the expected output:
id start_time end_time
1 10:00:00 10:34:00
2 10:38:00 11:56:00
5 14:20:00 14:59:00
7 15:30:00 15:40:00
Thanks in advance for any tips how to build the query.
Edit:
I started with following code to calculate the lead_time and the time difference but do not know how to group and consolidate.
WITH rows AS
(
SELECT *, ROW_NUMBER() OVER (ORDER BY Id) AS rn
FROM #temp
)
SELECT mc.id, mc.start_time, mc.end_time, mp.start_time lead_time, DATEDIFF(MINUTE, mc.[end_time], mp.[start_time]) as DiffToNewSession
FROM rows mc
LEFT JOIN rows mp
ON mc.rn = mp.rn - 1

The window function in t-sql can realize a lot of data statistics, such as
create table #temp(id int identity(1,1), start_time time, end_time time)
insert into #temp(start_time, end_time)
values ('10:00:00', '10:34:00')
, ('10:38:00', '10:52:00')
, ('10:53:00', '11:23:00')
, ('11:24:00', '11:56:00')
, ('14:20:00', '14:40:00')
, ('14:41:00', '14:59:00')
, ('15:30:00', '15:40:00')
;with c0 as(
select *, LAG(end_time,1,'00:00:00') over (order by id) as lag_time
from #temp
), c1 as(
select *, case when DATEDIFF(MI, lag_time, start_time) <= 2 then 1 else -0 end as gflag
from c0
), c2 as(
select *, SUM(case when gflag=0 then 1 else 0 end) over(order by id) as gid
from c1
)
select MIN(id) as id, MIN(start_time) as start_time, MAX(end_time) as end_time
from c2
group by gid
In order to better describe the process of data construction, I simply use c0, c1, c2... to represent levels, you can merge some levels and optimize.
If you can’t use id as a sorting condition, then you need to change the sorting part in the above statement.

You can use a recursive cte to get the result that you want. This method just simple compare current end_time with next start_time. If it is less than the 2 mintues threshold use the same start_time as grp_start. And the end, simple do a GROUP BY on the grp_start
with rcte as
(
-- anchor member
select *, grp_start = start_time
from tbl
where id = 1
union all
-- recursive member
select t.id, t.start_time, t.end_time,
grp_start = case when datediff(second, r.end_time, t.start_time) <= 120
then r.grp_start
else t.start_time
end
from tbl t
inner join rcte r on t.id = r.id + 1
)
select id = min(id), grp_start as start_time, max(end_time) as end_time
from rcte
group by grp_start
demo

I guess this should do the trick without recursion. Again I used several ctes in order to make the solution a bit easier to read. guess it can be reduced a little...
INSERT INTO T1 VALUES
(1,'10:00:00','10:34:00')
,(2,'10:38:00','10:52:00')
,(3,'10:53:00','11:23:00')
,(4,'11:24:00','11:56:00')
,(5,'14:20:00','14:40:00')
,(6,'14:41:00','14:59:00')
,(7,'15:30:00','15:40:00')
GO
WITH cte AS(
SELECT *
,ROW_NUMBER() OVER (ORDER BY id) AS rn
,DATEDIFF(MINUTE, ISNULL(LAG(endtime) OVER (ORDER BY id), starttime), starttime) AS diffMin
,COUNT(*) OVER (PARTITION BY (SELECT 1)) as maxRn
FROM T1
),
cteFirst AS(
SELECT *
FROM cte
WHERE rn = 1 OR diffMin > 2
),
cteGrp AS(
SELECT *
,ISNULL(LEAD(rn) OVER (ORDER BY id), maxRn+1) AS nextRn
FROM cteFirst
)
SELECT f.id, f.starttime, MAX(ISNULL(n.endtime, f.endtime)) AS endtime
FROM cteGrp f
LEFT JOIN cte n ON n.rn >= f.rn AND n.rn < f.nextRn
GROUP BY f.id, f.starttime

SQL Date intelligence: filtering data by seconds ran from last known valid result

Help! We're trying to create a new column (Is Valid?) to reproduce the logic below.
It is a binary result that:
it is 1 if it is the first known value of an ID
it is 1 if it is 3 seconds or later than the previous "1" of that ID
Note 1: this is not the difference in seconds from the previous record
It is 0 if it is less than 3 seconds than the previous "1" of that ID
Note 2: there are many IDs in the data set
Note 3: original dataset has ID and Date
Attached a PoC of the data and the expected result.

You would have to do this using a recursive CTE, which is quite expensive:
with tt as (
select t.*, row_number() over (partition by id order by time) as seqnum
from t
),
recursive cte as (
select t.*, time as grp_start
from tt
where seqnum = 1
union all
select tt.*,
(case when tt.time < cte.grp_start + interval '3 second'
then tt.time
else tt.grp_start
end)
from cte join
tt
on tt.seqnum = cte.seqnum + 1
)
select cte.*,
(case when grp_start = lag(grp_start) over (partition by id order by time)
then 0 else 1
end) as isValid
from cte;

Case statement based in max min dates

I have a columns as Memnumber, activity type, activity date, activity ID. One member can have activities after few days. I want to write a case statement that if the activity date is most initial then INITIAL and if activity is most recent then MR and if there is any activity in between these 2 dates then BETWEEN. They need to be grouped by Memnumber and treatment type.
I wrote query as :
--MR County Tree
SELECT T0.MEMBERNUMBER,
T0.ACTIVITYTYPE,
T1.MR_CY17,
T1.IN_CY17,
T0.ACTIVITY_DATE,
(T0.ACTIVITYID)
FROM DLA_EXTRACT_FINAL T0
INNER JOIN (
SELECT MEMBERNUMBER,
ACTIVITYTYPE,
MAX(ACTIVITY_DATE) MR_CY17,
MIN(ACTIVITY_DATE) IN_CY17
FROM DLA20_EXTRACT_FINAL
WHERE to_char(ACTIVITY_DATE, 'YYYYMMDD') >= 20170101
AND to_char(ACTIVITY_DATE, 'YYYYMMDD') <= 20171231
GROUP BY MEMBERNUMBER,
ACTIVITYTYPE
) T1 ON T0.MEMBERNUMBER = T1.MEMBERNUMBER
AND T0.ACTIVITYTYPE = T1.ACTIVITYTYPE
AND T0.ACTIVITY_DATE = T1.MR_CY17
--where T0.ACTIVITYTYPE='MT'
WHERE t0.MEMBERNUMBER = 'M500085268'
GROUP BY T0.MEMBERNUMBER,
T0.ACTIVITYTYPE,
T1.MR_CY17,
T1.IN_CY17,
T0.ACTIVITYID,
T0.ACTIVITY_DATE
ORDER BY T0.MEMBERNUMBER,
T0.ACTIVITYTYPE,
T1.MR_CY17,
T1.IN_CY17.
Looking for a solution.

You want to use window functions. Something like:
SELECT T0.MEMBERNUMBER,
T0.ACTIVITYTYPE,
T0.ACTIVITY_DATE,
T0.ACTIVITYID,
case when row_number() over (partition by T0.MEMBERNUMBER, T0.ACTIVITYTYPE
order by T0.ACTIVITY_DATE) = 1 then 1 else 0 end most_initial,
case when row_number() over (partition by T0.MEMBERNUMBER, T0.ACTIVITYTYPE
order by T0.ACTIVITY_DATE desc) = 1 then 1 else 0 end most_recent
FROM DLA_EXTRACT_FINAL T0
Then you can use case statements to label as INITIAL if most_intial = 1, MR if most_recent = 1, or BETWEEN if both are 0.

SQL Server - Conditionally Increment a Counter

What I'm looking to do is create grouped sequences for continuous date ranges. Take the following sample data:
Person|BeginDate |EndDate
A |1/1/2015 |1/31/2015
A |2/1/2015 |2/28/2015
A |4/1/2015 |4/30/2015
A |5/1/2015 |5/31/2015
B |1/1/2015 |1/30/2015
B |8/1/2015 |8/30/2015
B |9/1/2015 |9/30/2015
If BeginDate in the current row is >1 day from the EndDate in the previous row then increment the counter by 1, otherwise assign the counter's current value. The sequencing would look like :
Person|BeginDate |EndDate |Sequence
A |1/1/2015 |1/31/2015|1
A |2/1/2015 |2/28/2015|1
A |4/1/2015 |4/30/2015|2
A |5/1/2015 |5/31/2015|2
B |1/1/2015 |1/30/2015|1
B |8/1/2015 |8/30/2015|2
B |9/1/2015 |9/30/2015|2
Partitioned and reset for each person.
For your testing :
CREATE TABLE ##SequencingTest(
Person char(1)
,BeginDate date
,EndDate date)
INSERT INTO ##SequencingTest
VALUES
('A','1/1/2015','1/31/2015')
,('A','2/1/2015','2/28/2015')
,('A','4/1/2015','4/30/2015')
,('A','5/1/2015','5/31/2015')
,('B','1/1/2015','1/30/2015')
,('B','8/15/2015','8/31/2015')
,('B','9/1/2015','9/30/2015')

You can do this with lag() and then a cumulative sum:
select t.*,
sum(flag) over (partition by person order by begindate) as sequence
from (select t.*,
(case when datediff(day, lag(endDate) over (partition by person order by begindate), begindate) < 2
then 0
else 1
end) as flag
from t
) t;

If the continuous end dates are always 1 day before the next start date you could do something really primitive like this:
SELECT S1.Person, S1.BeginDate, S1.EndDate, SUM(S2.Cntr) AS Sequence
FROM Sequencing S1
INNER JOIN (SELECT Person, BeginDate,
CASE WHEN EXISTS (SELECT Person FROM Sequencing S2 WHERE S2.[EndDate] =
DATEADD(d, -1, S1.[BeginDate]) AND S2.Person = S1.Person) THEN 0 ELSE 1 END AS Cntr
FROM [Sequencing] S1
) S2
ON S1.Person = S2.Person
AND S1.BeginDate >= S2.BeginDate
GROUP BY S1.Person, S1.BeginDate, S1.EndDate
ORDER BY S1.Person, S1.BeginDate, S1.EndDate
Note I think you meant to say '1/31/2015' and '8/31/2015' as end dates to work with your example.
Also, #GordonLinoff's answer is probably better. I simply do not have the version of SQL Server to test it with.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas