SQL Server Query In and Out - sql

This is from DTR Device that i saved in Ms sql database
ID | Employee_ID | Date | InOutMode
-------+-------------+---------------------+-----------
70821 | 104 | 2019-10-11 19:00:00 | 0
70850 | 104 | 2019-10-12 07:01:00 | 1
if i'm going to separate the IN and OUT it suppose to be like this:
ID | Employee_ID | IN | OUT
-------+-------------+---------------------+-----------
70821 | 104 | 2019-10-11 19:00:00 | 2019-10-12 07:01:00
What happens is, i don't know if my queries were wrong. the TIME-OUT is not 2019-10-12 but 2019-10-11 same as the TIME-IN it looks like this:
ID | Employee_ID | IN | OUT
-------+-------------+---------------------+-----------
70821 | 104 | 2019-10-11 19:00:00 | 2019-10-11 07:01:00

Try this,
DECLARE #Temp_Table Table
(
Empoyee_id int,
[Date] datetime,
[InOutMode] bit
)
INSERT INTO #Temp_Table
(
Empoyee_id,[Date],[InOutMode]
)
SELECT 104,'20191011 09:30',1
UNION ALL
SELECT 104,'20191011 19:30',0
UNION ALL
SELECT 104,'20191012 09:30',1
UNION ALL
SELECT 104,'20191012 12:30',0
UNION ALL
SELECT 104,'20191012 19:00',0
UNION ALL
SELECT 104,'20191013 09:30',1
UNION ALL
SELECT 104,'20191013 07:30',0
UNION ALL
SELECT 104,'20191014 09:30',1
SELECT Empoyee_id,[Date],[In],IIF([In]>[Out],null,[Out]) as [Out]
FROM
(
SELECT Empoyee_id,CAST([Date] AS DATE) AS [Date],
MIN(IIF(InOutMode=1,[Date],NULL)) AS [In] ,
MAX(IIF(InOutMode=0,[Date],NULL)) AS [Out]
FROM #Temp_Table
GROUP BY Empoyee_id,CAST([Date] AS DATE)
)A

Try this:
;
WITH Ins as (
Select *
FROM HR_DTR_Device
WHERE InOutMode = 0
),
Outs as (
Select *
FROM HR_DTR_Device
WHERE InOutMode = 1
)
SELECT Ins.ID,
Ins.Employee_ID,
Ins.Date as [In],
(
SELECT Min(Outs.Date)
FROM Outs
WHERE Ins.Employee_ID = Outs.Employee_ID
AND Outs.Date > Ins.Date
) as [Out]
FROM Ins
WHERE Ins.Employee_ID = '104'
What this does:
Separates the Ins and the Outs, as if they were separate data sources. Using Common Table Expressions allows you, in effect, to pre-define subqueries and give them names.
For each record in the Ins, looks for the smallest date from the Outs that is still larger than the In date. (This assumes that your records are complete, and that you can't ever have two Ins in a row because someone forgot to clock out.)
Doesn't make any assumptions about when the Out date happens, just that it's later than the In date (by definition). That way, you don't have to worry about whether the employee left later the same day or early the next day (if you have employees working different shifts.)
Will also show any entries where the employee clocked in but has not yet clocked out.
I think your big error was here:
(SELECT MAX(Date) FROM HR_DTR_Device XX
WHERE InOutMode = 1
AND XX.Employee_ID = AA.Employee_ID
AND CAST(XX.Date AS DATE) = CAST(AA.Date AS DATE)) AS 'Out'
You are returning the largest date for that employee that is on the same calendar date (and is an Out). But, if the employee works until the next morning, the date will have changed!
You could fix this by changing your test to this:
CAST(DATEADD(d, -1, XX.Date) AS DATE) = CAST(AA.Date AS DATE)
... but then it will ONLY work for employees who worked overnight, whereas my solution simply finds the next time the employee clocked out after they clocked in, regardless of whether it's the same day, the next day, or the next week!
If you like this solution, please mark it as your accepted solution. Thank you.

Related

SQL procedure to show how many hours has worker worked

+-----------+-------------------------------+-------+
| Worker ID | Time(MM/DD/YYYY Hour:Min:Sec) | InOut |
+-----------+-------------------------------+-------+
| 1 | 12/04/2017 10:00:00 | In |
| 2 | 12/04/2017 10:00:00 | In |
| 2 | 12/04/2017 18:40:02 | Out |
| 3 | 12/04/2017 10:00:00 | In |
| 1 | 12/04/2017 12:01:00 | Out |
| 3 | 12/04/2017 19:40:05 | Out |
+-----------+-------------------------------+-------+
Hi! I have problem with my project and I thought some of you would help me. I have table like that. It is simple table that indicates worker getting in and out of company. I need to do procedure which would take ID and number of day as In parameters and it would show how many hours and minutes that worker has worked that day. Thanks for help.
Yeah, I had to do a number of queries like this at my old job. Here's the approach I used, and it worked out pretty well:
For each "Out" record, get the MAX(TIME) on "In" records with a time earlier than the OUT record
Does that make sense? You're basically joining the table against itself, looking for the record that represents the "clock in" time for any particular "clock out" time.
So here's the backbone:
select
*
, (
SELECT MAX(tim) from #tempTable subQ
where subQ.id = main.id
and subQ.tim <= main.tim
and subQ.InOut = 'In'
) as correspondingInTime
from #tempTable main
where InOut = 'Out'
... from here, you can get the data you need. Either by manipulating the query above, or using it as a subquery itself (which is my favored way of doing it) - something like:
select id as workerID, sum(DATEDIFF(s, correspondingInTime, tim)) as totalSecondsWorked
from
(
select
*
, (
SELECT MAX(tim) from #tempTable subQ
where subQ.id = main.id
and subQ.tim <= main.tim
and subQ.InOut = 'In'
) correspondingInTime
from #tempTable main
where InOut = 'Out'
) mainQuery
group by id
EDIT: Remove the 'as' before correspondingInTime, because oracle doesn't allow 'as' in table aliasing.
Maybe something similar to
select sum( time1 - prev_time1 ) from (
select InOut, time1,
prev(time1) over (partition by worker_id order by time1) prev_time1,
prev(InOut) over (partition by worker_id order by time1) prev_inOut
from MyTABLE
where TimeColumn between trunc(:date1) and trunc( :date1 + 1 )
and workerId = :workerId
) t1
where InOut = 'Out' and prev_InOut = 'In'
would go.
:workerId and :date1 are variables to constrain to one date and one worker as required.
I'm fairly certain Oracle allows you to use CROSS APPLY these days.
SELECT [Worker ID], yt.Time - ca.Time
FROM YourTable yt
CROSS APPLY (SELECT MAX(Time) AS Time
FROM YourTable
WHERE [Worker ID] = yt.[Worker ID] AND Time < yt.Time AND InOut = 'In') ca
WHERE yt.InOut = 'Out'

SQL grouping by datetime with a maximum difference of x minutes

I have a problem with grouping my dataset in MS SQL Server.
My table looks like
# | CustomerID | SalesDate | Turnover
---| ---------- | ------------------- | ---------
1 | 1 | 2016-08-09 12:15:00 | 22.50
2 | 1 | 2016-08-09 12:17:00 | 10.00
3 | 1 | 2016-08-09 12:58:00 | 12.00
4 | 1 | 2016-08-09 13:01:00 | 55.00
5 | 1 | 2016-08-09 23:59:00 | 10.00
6 | 1 | 2016-08-10 00:02:00 | 5.00
Now I want to group the rows where the SalesDate difference to the next row is of a maximum of 5 minutes.
So that row 1 & 2, 3 & 4 and 5 & 6 are each one group.
My approach was getting the minutes with the DATEPART() function and divide the result by 5:
(DATEPART(MINUTE, SalesDate) / 5)
For row 1 and 2 the result would be 3 and grouping here would work perfectly.
But for the other rows where there is a change in the hour or even in the day part of the SalesDate, the result cannot be used for grouping.
So this is where I'm stuck. I would really appreciate, if someone could point me in the right direction.
You want to group adjacent transactions based on the timing between them. The idea is to assign some sort of grouping identifier, and then use that for aggregation.
Here is an approach:
Identify group starts using lag() and date arithmetic.
Do a cumulative sum of the group starts to identify each group.
Aggregate
The query looks like this:
select customerid, min(salesdate), max(saledate), sum(turnover)
from (select t.*,
sum(case when salesdate > dateadd(minute, 5, prev_salesdate)
then 1 else 0
end) over (partition by customerid order by salesdate) as grp
from (select t.*,
lag(salesdate) over (partition by customerid order by salesdate) as prev_salesdate
from t
) t
) t
group by customerid, grp;
EDIT
Thanks to #JoeFarrell for pointing out I have answered the wrong question. The OP is looking for dynamic time differences between rows, but this approach creates fixed boundaries.
Original Answer
You could create a time table. This is a table that contains one record for each second of the day. Your table would have a second column that you can use to perform group bys on.
CREATE TABLE [Time]
(
TimeId TIME(0) PRIMARY KEY,
TimeGroup TIME
)
;
-- You could use a loop here instead.
INSERT INTO [Time]
(
TimeId,
TimeGroup
)
VALUES
('00:00:00', '00:00:00'), -- First group starts here.
('00:00:01', '00:00:00'),
('00:00:02', '00:00:00'),
('00:00:03', '00:00:00'),
...
('00:04:59', '00:00:00'),
('00:05:00', '00:05:00'), -- Second group starts here.
('00:05:01', '00:05:00')
;
The approach works best when:
You need to reuse your custom grouping in several different queries.
You have two or more custom groups you often use.
Once populated you can simply join to the table and output the desired result.
/* Using the time table.
*/
SELECT
t.TimeGroup,
SUM(Turnover) AS SumOfTurnover
FROM
Sales AS s
INNER JOIN [Time] AS t ON t.TimeId = CAST(s.SalesDate AS Time(0))
GROUP BY
t.TimeGroup
;

Generate more rows if there is a difference between two columns in SQL Server

I'm currently working on some reports from MS Project Server and found this oddity:
For some obscure reason, whenever you appoint to the same task with the same amount of time in consecutive days, instead of creating an entry for each appointment, the application updates the start date and the finish date fields on database, leaving only one entry for that task, but with a range between the dates.
If the amount of time appointed to the task in consecutive days are different, then there will be created one entry per appointment.
(Yes, I know, it's kind of confusing. I don't even know how to explain this better).
I want to know if it is somehow possible to generate more rows within SQL statement whenever there is a difference between the start and the finish date, one for each day in the range.
This is the query I have right now, I already can tell which rows have this date difference, but I don't know what I can do next.
select
r.WRES_ID, r.RES_NAME, PROJ_NAME, p.WPROJ_ID, TASK_NAME, WWORK_VALUE, WWORK_START, WWORK_FINISH,
datediff(d, WWORK_START, WWORK_FINISH) + 1 AS work_days
from MSP_WEB_RESOURCES r
join
MSP_WEB_ASSIGNMENTS a on a.WRES_ID = r.WRES_ID
join
MSP_WEB_PROJECTS p on p.WPROJ_ID = a.WPROJ_ID
join
MSP_WEB_WORK w on w.WASSN_ID = a.WASSN_ID
where RES_NAME = 'HenriqueBarcelos'
and WWORK_TYPE = 1
and WWORK_VALUE > 0
and WWORK_FINISH between '2014-01-27' and '2014-01-31'
order by WWORK_FINISH DESC
I know I could do this at the application level, but I was wondering if I could just do it within the database itself.
Thank's in advance.
Edit:
These are my current results:
WRES_ID | RES_NAME | TASK_NAME | WWORK_VALUE | WWORK_START | WWORK_FINISH | work_days
--------+------------------+-------------------------+---------------+---------------------+---------------------+----------
382 | HenriqueBarcelos | Outsourcing Initiatives | 60000.000000 | 2014-01-30 00:00:00 | 2014-01-30 00:00:00 | 1
382 | HenriqueBarcelos | Internal Training | 289800.000000 | 2014-01-29 00:00:00 | 2014-01-29 00:00:00 | 1
382 | HenriqueBarcelos | Outsourcing Initiatives | 120000.000000 | 2014-01-29 00:00:00 | 2014-01-29 00:00:00 | 1
382 | HenriqueBarcelos | Outsourcing Initiatives | 60000.000000 | 2014-01-27 00:00:00 | 2014-01-28 00:00:00 | 2
382 | HenriqueBarcelos | Infrastructure (TI) | 120000.000000 | 2014-01-27 00:00:00 | 2014-01-27 00:00:00 | 1
Notice that the second last register has a range of 2 days. In deed, there are 2 appointments, one on Jan 27th and other on 28th.
What I want to do is expand this and return one entry per day in this case.
It can be done, but it's not very elegant. First you need a function that will expand the date range into sequence of dates:
CREATE FUNCTION ufn_Expand(#start DATE, #end DATE)
RETURNS TABLE
AS
RETURN
WITH cte AS
(
SELECT #start AS dt
UNION ALL
SELECT DATEADD(dd, 1, dt) FROM cte WHERE dt < #end
)
SELECT dt FROM cte
Then use that in your query with CROSS APPLY:
SELECT /* your columns */, x.dt
FROM /* your joins */
CROSS APPLY ufn_Expand(WWORK_START, WWORK_FINISH) x
I'd use a numbers table (nice and set-based, yum!)
SELECT start_date
, end_date
, DateDiff(dd, start_date, end_date) + 1 As number_of_days --rows to display
FROM your_table
INNER
JOIN dbo.numbers
ON numbers.number BETWEEN 1 AND DateDiff(dd, start_date, end_date) + 1
Use your favourite search engine to find a numbers table script. Here's one I made earlier.
As an aside: if you remove the +1s you just modify the join to be between zero and the DateDiff() - I added the +1s as I thought it might be clearer!
You can see this from another perspective. You don't really want a row per each worked day. What you really need it's the number of worked days, multiplied by the reported worked time. Something like this:
(dbo.MSP_WEB_WORK.WWORK_VALUE / 60000) * (DATEDIFF(day, dbo.MSP_WEB_WORK.WWORK_START, dbo.MSP_WEB_WORK.WWORK_FINISH) + 1)
however, this creates an issue. Let's say you want a given period. If you use the WWORK_START and WWORK_FINISH dates for your report, you need to be careful to include all the work with only some days inside the period. Something like this will do it:
DECLARE #InitDate DATETIME;
DECLARE #EndDate DATETIME;
SET #InitDate = '2016/06/01';
SET #EndDate = '2016/07/01';
--Full list of tasks
SELECT dbo.MSP_WEB_RESOURCES.RES_NAME AS Name, dbo.MSP_WEB_PROJECTS.PROJ_NAME AS Project,
dbo.MSP_WEB_WORK.WWORK_VALUE / 60000 AS ReportedWork,
CASE
WHEN WWORK_START < #InitDate THEN DATEDIFF(day, #InitDate, dbo.MSP_WEB_WORK.WWORK_FINISH) + 1 --If the task started before the start of the period
WHEN WWORK_FINISH > DATEDIFF(day,-1,#EndDate) THEN DATEDIFF(day, WWORK_START, DATEDIFF(day,-1,#EndDate)) + 1 --if the task ended after the end of the period
ELSE DATEDIFF(day, dbo.MSP_WEB_WORK.WWORK_START, dbo.MSP_WEB_WORK.WWORK_FINISH) + 1 --All tasks with start and end date inside the period
END AS RepeatedDays,
CASE
WHEN WWORK_START < #InitDate THEN (dbo.MSP_WEB_WORK.WWORK_VALUE / 60000) * (DATEDIFF(day, #InitDate, dbo.MSP_WEB_WORK.WWORK_FINISH) + 1)
WHEN WWORK_FINISH > DATEDIFF(day,-1,#EndDate) THEN (dbo.MSP_WEB_WORK.WWORK_VALUE / 60000) * (DATEDIFF(day, WWORK_START, DATEDIFF(day,-1,#EndDate)) + 1)
ELSE (dbo.MSP_WEB_WORK.WWORK_VALUE / 60000) * (DATEDIFF(day, dbo.MSP_WEB_WORK.WWORK_START, dbo.MSP_WEB_WORK.WWORK_FINISH) + 1)
END AS ActualWork,
dbo.MSP_WEB_WORK.WWORK_START,
dbo.MSP_WEB_WORK.WWORK_FINISH
FROM dbo.MSP_WEB_RESOURCES INNER JOIN
dbo.MSP_WEB_ASSIGNMENTS INNER JOIN
dbo.MSP_WEB_PROJECTS ON dbo.MSP_WEB_ASSIGNMENTS.WPROJ_ID = dbo.MSP_WEB_PROJECTS.WPROJ_ID INNER JOIN
dbo.MSP_WEB_WORK ON dbo.MSP_WEB_ASSIGNMENTS.WASSN_ID = dbo.MSP_WEB_WORK.WASSN_ID ON
dbo.MSP_WEB_RESOURCES.WRES_ID = dbo.MSP_WEB_ASSIGNMENTS.WRES_ID
WHERE (dbo.MSP_WEB_WORK.WWORK_TYPE = 1) AND
(
#InitDate BETWEEN dbo.MSP_WEB_WORK.WWORK_START and dbo.MSP_WEB_WORK.WWORK_FINISH OR
DATEADD(day,-1,#EndDate) BETWEEN dbo.MSP_WEB_WORK.WWORK_START and dbo.MSP_WEB_WORK.WWORK_FINISH OR
(dbo.MSP_WEB_WORK.WWORK_START >= #InitDate) AND
(dbo.MSP_WEB_WORK.WWORK_FINISH < #EndDate)
)
ORDER BY dbo.MSP_WEB_WORK.WWORK_START;

SQL Group by specific time period

Hello, I need best solution to group data by specific time periods. I need to group month data from 07:00:00 till 18:59:59 and then from 19:00:00 till next days 06:59:59. In database there is a lot of data so speed-wise effective solution would be preferred.
Also would be great to insert Shift letter in query. There is 4 shifts: A,B,C,D and i have calendar table.
Table [Shiftcalendar]:
[ShiftDate] | [SHIFT] | [Nextshift]
2013-11-11 | N | A=B
2013-11-11 | D | C=A
2013-11-10 | N | D=C
.... | .... | ....
Column [Shift] represents day or night, column [Nextshift] represents shift and next shift. N means night time and is from 19:00:00 till next days 06:59:59, D means day and is from 07:00:00 till 18:59:59.
Table [wrkSpeedInfo]:
[wrkActionDate] | [wrkSpeed] | [wrkGlueValue] | [x1]
2013-11-11 07:00:35 | 200 | 300 | 20
2013-11-11 07:00:55 | 97 | 255 | 13
2013-11-11 07:01:23 | 127 | 124 | 15
.... | .... | .... | ....
I need to SUM [wrkSpeed], [wrkGlueValue] and [x1].
Someones help would be really appreciated :)
PS.: Don't mind my English writing skills, I am still on verge of improving it.
EDIT:
So long I doing lot of querys to take specific dates and shifts data, but would like to have all data in one query.
WHERE [wrkActionDate] BETWEEN '2013-10-03 07:00:00' AND '2013-10-03 18:59:59'
I can post full query but it takes lot of space and when would need to explain much more what i trying to do.
EDIT:
Ok someone said to post full query:
SELECT [wrkActionDate]
,[wrkCntrId]
,DATEDIFF(second, (SELECT TOP 1 t2.[wrkActionDate] FROM [DW].[dbo].[wrkSpeedInfo] as t2 WHERE [wrkCntrId] = 'S1' AND t2.[wrkActionDate] < t1.[wrkActionDate] ORDER BY t2.[wrkActionDate] DESC), [wrkActionDate])/60.0 AS MinPassed
,SUM([wrkSpeed])*DATEDIFF(second, (SELECT TOP 1 t2.[wrkActionDate] FROM [DW].[dbo].[wrkSpeedInfo] as t2 WHERE [wrkCntrId] = 'S1' AND t2.[wrkActionDate] < t1.[wrkActionDate] ORDER BY t2.[wrkActionDate] DESC), [wrkActionDate])/60.0 AS SumWrkSpeed
,SUM([wrkGlueValue])*DATEDIFF(second, (SELECT TOP 1 t2.[wrkActionDate] FROM [DW].[dbo].[wrkSpeedInfo] as t2 WHERE [wrkCntrId] = 'S1' AND t2.[wrkActionDate] < t1.[wrkActionDate] ORDER BY t2.[wrkActionDate] DESC), [wrkActionDate])/60.0 AS SumWrkGlueValue
,SUM([x1]) AS SumX1
FROM [DW].[dbo].[wrkSpeedInfo] as t1
WHERE [wrkActionDate] BETWEEN '2013-10-03 07:00:00' AND '2013-10-03 18:59:59' AND [wrkCntrId] = 'S1'
GROUP BY [wrkCntrId], [wrkActionDate]
So if I just could get all month data in one query that would be great, because now only getting data for one shift.
Would be great to get something like:
[ShiftDate] | [SHIFT] | [Nextshift] | SUM([wrkSpeed]) | SUM([wrkGlueValue]) | SUM([x1])
EDIT:
They using MS SQL 2012. Can't change structure or anything, only can select data from DB.
If you wanted to sum all the values just based on the day you would simply have to
GROUP BY CAST(wrkActionDate AS DATE)
But you don't want to group by the date precisely, you want to group based on your shift pattern. So to do that you can create a field that calculates which shift a particular time falls into, and then group based on that field.
SELECT [Shift]
,SUM(wrkSpeed) AS wrkSpeed
,SUM(wrkGlueValue) AS wrkGlueValue
,SUM(x1) AS x1
FROM(
SELECT w.*,
CASE WHEN (DATEPART(HOUR, wrkActionDate) >= 7 AND DATEPART(HOUR, wrkActionDate) < 19)
THEN LEFT(CAST(wrkActionDate AS DATE),10)+' D'
ELSE LEFT(CAST(DATEADD(DAY, -1, wrkActionDate) AS DATE),10)+' N'
END AS [Shift]
FROM [DW].[dbo].wrkSpeedInfo w
) w
GROUP BY [Shift]

Can I use a SQL Server CTE to merge intersecting dates?

I'm writing an app that handles scheduling time off for some of our employees. As part of this, I need to calculate how many minutes throughout the day that they have requested off.
In the first version of this tool, we disallowed overlapping time off requests, because we wanted to be able to just add up the total of StartTime minus EndTime for all requests. Preventing overlaps makes this calculation very fast.
This has become problematic, because Managers now want to schedule team meetings but are unable to do so when someone has already asked for the day off.
So, in the new version of the tool, we have a requirement to allow overlapping requests.
Here is an example set of data like what we have:
UserId | StartDate | EndDate
----------------------------
1 | 2:00 | 4:00
1 | 3:00 | 5:00
1 | 3:45 | 9:00
2 | 6:00 | 9:00
2 | 7:00 | 8:00
3 | 2:00 | 3:00
3 | 4:00 | 5:00
4 | 1:00 | 7:00
The result that I need to get, as efficiently as possible, is this:
UserId | StartDate | EndDate
----------------------------
1 | 2:00 | 9:00
2 | 6:00 | 9:00
3 | 2:00 | 3:00
3 | 4:00 | 5:00
4 | 1:00 | 7:00
We can easily detect overlaps with this query:
select
*
from
requests r1
cross join
requests r2
where
r1.RequestId < r2.RequestId
and
r1.StartTime < r2.EndTime
and
r2.StartTime < r1.EndTime
This is, in fact, how we were detecting and preventing the problems originally.
Now, we are trying to merge the overlapping items, but I'm reaching the limits of my SQL ninja skills.
It wouldn't be too hard to come up with a method using temp tables, but we want to avoid this if at all possible.
Is there a set-based way to merge overlapping rows?
Edit:
It would also be acceptable for the all of the rows to show up, as long as they were collapsed into just their time. For example if someone wants off from three to five, and from four to six, it would be acceptable for them to have two rows, one from three to five, and the next from five to six OR one from three to four, and the next from four to six.
Also, here is a little test bench:
DECLARE #requests TABLE
(
UserId int,
StartDate time,
EndDate time
)
INSERT INTO #requests (UserId, StartDate, EndDate) VALUES
(1, '2:00', '4:00'),
(1, '3:00', '5:00'),
(1, '3:45', '9:00'),
(2, '6:00', '9:00'),
(2, '7:00', '8:00'),
(3, '2:00', '3:00'),
(3, '4:00', '5:00'),
(4, '1:00', '7:00');
Complete Rewrite:
;WITH new_grp AS (
SELECT r1.UserId, r1.StartTime
FROM #requests r1
WHERE NOT EXISTS (
SELECT *
FROM #requests r2
WHERE r1.UserId = r2.UserId
AND r2.StartTime < r1.StartTime
AND r2.EndTime >= r1.StartTime)
GROUP BY r1.UserId, r1.StartTime -- there can be > 1
),r AS (
SELECT r.RequestId, r.UserId, r.StartTime, r.EndTime
,count(*) AS grp -- guaranteed to be 1+
FROM #requests r
JOIN new_grp n ON n.UserId = r.UserId AND n.StartTime <= r.StartTime
GROUP BY r.RequestId, r.UserId, r.StartTime, r.EndTime
)
SELECT min(RequestId) AS RequestId
,UserId
,min(StartTime) AS StartTime
,max(EndTime) AS EndTime
FROM r
GROUP BY UserId, grp
ORDER BY UserId, grp
Now produces the requested result and really covers all possible cases, including disjunct sub-groups and duplicates.
Have a look at the comments to the test data in the working demo at data.SE.
CTE 1
Find the (unique!) points in time where a new group of overlapping intervals starts.
CTE 2
Count the starts of new group up to (and including) every individual interval, thereby forming a unique group number per user.
Final SELECT
Merge the groups, take earlies start and latest end for groups.
I faced some difficulty, because T-SQL window functions max() or sum() do not accept an ORDER BY clause in a in a window. They can only compute one value per partition, which makes it impossible to compute a running sum / count per partition. Would work in PostgreSQL or Oracle (but not in MySQL, of course - it has neither window functions nor CTEs).
The final solution uses one extra CTE and should be just as fast.
Ok, it is possible to do with CTEs. I did not know how to use them at the beginning of the night, but here is the results of my research:
A recursive CTE has 2 parts, the "anchor" statement and the "recursive" statements.
The crucial part about the recursive statement is that when it is evaluated, only the rows that have not already been evaluated will show up in the recursion.
So, for example, if we wanted to use CTEs to get an all-inclusive list of times for these users, we could use something like this:
WITH
sorted_requests as (
SELECT
UserId, StartDate, EndDate,
ROW_NUMBER() OVER (PARTITION BY UserId ORDER BY StartDate, EndDate DESC) Instance
FROM #requests
),
no_overlap(UserId, StartDate, EndDate, Instance) as (
SELECT *
FROM sorted_requests
WHERE Instance = 1
UNION ALL
SELECT s.*
FROM sorted_requests s
INNER JOIN no_overlap n
ON s.UserId = n.UserId
AND s.Instance = n.Instance + 1
)
SELECT *
FROM no_overlap
Here, the "anchor" statement is just the first instance for every user, WHERE Instance = 1.
The "recursive" statement joins each row to the next row in the set, using the s.UserId = n.UserId AND s.Instance = n.Instance + 1
Now, we can use the property of the data, when sorted by start date, that any overlapping row will have a start date that is less than the previous row's end date. If we continually propagate the row number of the first intersecting row, every subsequent overlapping row will share that row number.
Using this query:
WITH
sorted_requests as (
SELECT
UserId, StartDate, EndDate,
ROW_NUMBER() OVER (PARTITION BY UserId ORDER BY StartDate, EndDate DESC) Instance
FROM
#requests
),
no_overlap(UserId, StartDate, EndDate, Instance, ConnectedGroup) as (
SELECT
UserId,
StartDate,
EndDate,
Instance,
Instance as ConnectedGroup
FROM sorted_requests
WHERE Instance = 1
UNION ALL
SELECT
s.UserId,
s.StartDate,
CASE WHEN n.EndDate >= s.EndDate
THEN n.EndDate
ELSE s.EndDate
END EndDate,
s.Instance,
CASE WHEN n.EndDate >= s.StartDate
THEN n.ConnectedGroup
ELSE s.Instance
END ConnectedGroup
FROM sorted_requests s
INNER JOIN no_overlap n
ON s.UserId = n.UserId AND s.Instance = n.Instance + 1
)
SELECT
UserId,
MIN(StartDate) StartDate,
MAX(EndDate) EndDate
FROM no_overlap
GROUP BY UserId, ConnectedGroup
ORDER BY UserId
We group by the aforementioned "first intersecting row" (called ConnectedGroup in this query) and find the minimum start time and maximum end time in that group.
The first intersecting row is propagated using this statement:
CASE WHEN n.EndDate >= s.StartDate
THEN n.ConnectedGroup
ELSE s.Instance
END ConnectedGroup
Which basically says, "if this row intersects with the previous row (based on us being sorted by start date), then consider this row to have the same 'row grouping' as the previous row. Otherwise, use this row's own row number as the 'row grouping' for itself."
This gives us exactly what we were looking for.
EDIT
When I had originally thought this up on my whiteboard, I knew that I would have to advance the EndDate of each row, to ensure that it would intersect with the next row, if any of the previous rows in the connected group would have intersected. I accidentally left that out. This has been corrected.
This works for postgres. Microsoft might need some modifications.
SET search_path='tmp';
DROP TABLE tmp.schedule CASCADE;
CREATE TABLE tmp.schedule
( person_id INTEGER NOT NULL
, dt_from timestamp with time zone
, dt_to timestamp with time zone
);
INSERT INTO schedule( person_id, dt_from, dt_to) VALUES
( 1, '2011-12-03 02:00:00' , '2011-12-03 04:00:00' )
, ( 1, '2011-12-03 03:00:00' , '2011-12-03 05:00:00' )
, ( 1, '2011-12-03 03:45:00' , '2011-12-03 09:00:00' )
, ( 2, '2011-12-03 06:00:00' , '2011-12-03 09:00:00' )
, ( 2, '2011-12-03 07:00:00' , '2011-12-03 08:00:00' )
, ( 3, '2011-12-03 02:00:00' , '2011-12-03 03:00:00' )
, ( 3, '2011-12-03 04:00:00' , '2011-12-03 05:00:00' )
, ( 4, '2011-12-03 01:00:00' , '2011-12-03 07:00:00' );
ALTER TABLE schedule ADD PRIMARY KEY (person_id,dt_from)
;
CREATE UNIQUE INDEX ON schedule (person_id,dt_to);
SELECT * FROM schedule ORDER BY person_id, dt_from;
WITH RECURSIVE ztree AS (
-- Terminal part
SELECT p1.person_id AS person_id
, p1.dt_from AS dt_from
, p1.dt_to AS dt_to
FROM schedule p1
UNION
-- Recursive part
SELECT p2.person_id AS person_id
, LEAST(p2.dt_from, zzt.dt_from) AS dt_from
, GREATEST(p2.dt_to, zzt.dt_to) AS dt_to
FROM ztree AS zzt
, schedule AS p2
WHERE 1=1
AND p2.person_id = zzt.person_id
AND (p2.dt_from < zzt.dt_from AND p2.dt_to >= zzt.dt_from)
)
SELECT *
FROM ztree zt
WHERE NOT EXISTS (
SELECT * FROM ztree nx
WHERE nx.person_id = zt.person_id
-- the recursive query returns *all possible combinations of
-- touching or overlapping intervals
-- we'll have to filter, keeping only the biggest ones
-- (the ones for which there is no bigger overlapping interval)
AND ( (nx.dt_from <= zt.dt_from AND nx.dt_to > zt.dt_to)
OR (nx.dt_from < zt.dt_from AND nx.dt_to >= zt.dt_to)
)
)
ORDER BY zt.person_id,zt.dt_from
;
Result:
DROP TABLE
CREATE TABLE
INSERT 0 8
NOTICE: ALTER TABLE / ADD PRIMARY KEY will create implicit index "schedule_pkey" for table "schedule"
ALTER TABLE
CREATE INDEX
person_id | dt_from | dt_to
-----------+------------------------+------------------------
1 | 2011-12-03 02:00:00+01 | 2011-12-03 04:00:00+01
1 | 2011-12-03 03:00:00+01 | 2011-12-03 05:00:00+01
1 | 2011-12-03 03:45:00+01 | 2011-12-03 09:00:00+01
2 | 2011-12-03 06:00:00+01 | 2011-12-03 09:00:00+01
2 | 2011-12-03 07:00:00+01 | 2011-12-03 08:00:00+01
3 | 2011-12-03 02:00:00+01 | 2011-12-03 03:00:00+01
3 | 2011-12-03 04:00:00+01 | 2011-12-03 05:00:00+01
4 | 2011-12-03 01:00:00+01 | 2011-12-03 07:00:00+01
(8 rows)
person_id | dt_from | dt_to
-----------+------------------------+------------------------
1 | 2011-12-03 02:00:00+01 | 2011-12-03 09:00:00+01
2 | 2011-12-03 06:00:00+01 | 2011-12-03 09:00:00+01
3 | 2011-12-03 02:00:00+01 | 2011-12-03 03:00:00+01
3 | 2011-12-03 04:00:00+01 | 2011-12-03 05:00:00+01
4 | 2011-12-03 01:00:00+01 | 2011-12-03 07:00:00+01
(5 rows)