Can I use a SQL Server CTE to merge intersecting dates? - sql

I'm writing an app that handles scheduling time off for some of our employees. As part of this, I need to calculate how many minutes throughout the day that they have requested off.
In the first version of this tool, we disallowed overlapping time off requests, because we wanted to be able to just add up the total of StartTime minus EndTime for all requests. Preventing overlaps makes this calculation very fast.
This has become problematic, because Managers now want to schedule team meetings but are unable to do so when someone has already asked for the day off.
So, in the new version of the tool, we have a requirement to allow overlapping requests.
Here is an example set of data like what we have:
UserId | StartDate | EndDate
----------------------------
1 | 2:00 | 4:00
1 | 3:00 | 5:00
1 | 3:45 | 9:00
2 | 6:00 | 9:00
2 | 7:00 | 8:00
3 | 2:00 | 3:00
3 | 4:00 | 5:00
4 | 1:00 | 7:00
The result that I need to get, as efficiently as possible, is this:
UserId | StartDate | EndDate
----------------------------
1 | 2:00 | 9:00
2 | 6:00 | 9:00
3 | 2:00 | 3:00
3 | 4:00 | 5:00
4 | 1:00 | 7:00
We can easily detect overlaps with this query:
select
*
from
requests r1
cross join
requests r2
where
r1.RequestId < r2.RequestId
and
r1.StartTime < r2.EndTime
and
r2.StartTime < r1.EndTime
This is, in fact, how we were detecting and preventing the problems originally.
Now, we are trying to merge the overlapping items, but I'm reaching the limits of my SQL ninja skills.
It wouldn't be too hard to come up with a method using temp tables, but we want to avoid this if at all possible.
Is there a set-based way to merge overlapping rows?
Edit:
It would also be acceptable for the all of the rows to show up, as long as they were collapsed into just their time. For example if someone wants off from three to five, and from four to six, it would be acceptable for them to have two rows, one from three to five, and the next from five to six OR one from three to four, and the next from four to six.
Also, here is a little test bench:
DECLARE #requests TABLE
(
UserId int,
StartDate time,
EndDate time
)
INSERT INTO #requests (UserId, StartDate, EndDate) VALUES
(1, '2:00', '4:00'),
(1, '3:00', '5:00'),
(1, '3:45', '9:00'),
(2, '6:00', '9:00'),
(2, '7:00', '8:00'),
(3, '2:00', '3:00'),
(3, '4:00', '5:00'),
(4, '1:00', '7:00');

Complete Rewrite:
;WITH new_grp AS (
SELECT r1.UserId, r1.StartTime
FROM #requests r1
WHERE NOT EXISTS (
SELECT *
FROM #requests r2
WHERE r1.UserId = r2.UserId
AND r2.StartTime < r1.StartTime
AND r2.EndTime >= r1.StartTime)
GROUP BY r1.UserId, r1.StartTime -- there can be > 1
),r AS (
SELECT r.RequestId, r.UserId, r.StartTime, r.EndTime
,count(*) AS grp -- guaranteed to be 1+
FROM #requests r
JOIN new_grp n ON n.UserId = r.UserId AND n.StartTime <= r.StartTime
GROUP BY r.RequestId, r.UserId, r.StartTime, r.EndTime
)
SELECT min(RequestId) AS RequestId
,UserId
,min(StartTime) AS StartTime
,max(EndTime) AS EndTime
FROM r
GROUP BY UserId, grp
ORDER BY UserId, grp
Now produces the requested result and really covers all possible cases, including disjunct sub-groups and duplicates.
Have a look at the comments to the test data in the working demo at data.SE.
CTE 1
Find the (unique!) points in time where a new group of overlapping intervals starts.
CTE 2
Count the starts of new group up to (and including) every individual interval, thereby forming a unique group number per user.
Final SELECT
Merge the groups, take earlies start and latest end for groups.
I faced some difficulty, because T-SQL window functions max() or sum() do not accept an ORDER BY clause in a in a window. They can only compute one value per partition, which makes it impossible to compute a running sum / count per partition. Would work in PostgreSQL or Oracle (but not in MySQL, of course - it has neither window functions nor CTEs).
The final solution uses one extra CTE and should be just as fast.

Ok, it is possible to do with CTEs. I did not know how to use them at the beginning of the night, but here is the results of my research:
A recursive CTE has 2 parts, the "anchor" statement and the "recursive" statements.
The crucial part about the recursive statement is that when it is evaluated, only the rows that have not already been evaluated will show up in the recursion.
So, for example, if we wanted to use CTEs to get an all-inclusive list of times for these users, we could use something like this:
WITH
sorted_requests as (
SELECT
UserId, StartDate, EndDate,
ROW_NUMBER() OVER (PARTITION BY UserId ORDER BY StartDate, EndDate DESC) Instance
FROM #requests
),
no_overlap(UserId, StartDate, EndDate, Instance) as (
SELECT *
FROM sorted_requests
WHERE Instance = 1
UNION ALL
SELECT s.*
FROM sorted_requests s
INNER JOIN no_overlap n
ON s.UserId = n.UserId
AND s.Instance = n.Instance + 1
)
SELECT *
FROM no_overlap
Here, the "anchor" statement is just the first instance for every user, WHERE Instance = 1.
The "recursive" statement joins each row to the next row in the set, using the s.UserId = n.UserId AND s.Instance = n.Instance + 1
Now, we can use the property of the data, when sorted by start date, that any overlapping row will have a start date that is less than the previous row's end date. If we continually propagate the row number of the first intersecting row, every subsequent overlapping row will share that row number.
Using this query:
WITH
sorted_requests as (
SELECT
UserId, StartDate, EndDate,
ROW_NUMBER() OVER (PARTITION BY UserId ORDER BY StartDate, EndDate DESC) Instance
FROM
#requests
),
no_overlap(UserId, StartDate, EndDate, Instance, ConnectedGroup) as (
SELECT
UserId,
StartDate,
EndDate,
Instance,
Instance as ConnectedGroup
FROM sorted_requests
WHERE Instance = 1
UNION ALL
SELECT
s.UserId,
s.StartDate,
CASE WHEN n.EndDate >= s.EndDate
THEN n.EndDate
ELSE s.EndDate
END EndDate,
s.Instance,
CASE WHEN n.EndDate >= s.StartDate
THEN n.ConnectedGroup
ELSE s.Instance
END ConnectedGroup
FROM sorted_requests s
INNER JOIN no_overlap n
ON s.UserId = n.UserId AND s.Instance = n.Instance + 1
)
SELECT
UserId,
MIN(StartDate) StartDate,
MAX(EndDate) EndDate
FROM no_overlap
GROUP BY UserId, ConnectedGroup
ORDER BY UserId
We group by the aforementioned "first intersecting row" (called ConnectedGroup in this query) and find the minimum start time and maximum end time in that group.
The first intersecting row is propagated using this statement:
CASE WHEN n.EndDate >= s.StartDate
THEN n.ConnectedGroup
ELSE s.Instance
END ConnectedGroup
Which basically says, "if this row intersects with the previous row (based on us being sorted by start date), then consider this row to have the same 'row grouping' as the previous row. Otherwise, use this row's own row number as the 'row grouping' for itself."
This gives us exactly what we were looking for.
EDIT
When I had originally thought this up on my whiteboard, I knew that I would have to advance the EndDate of each row, to ensure that it would intersect with the next row, if any of the previous rows in the connected group would have intersected. I accidentally left that out. This has been corrected.

This works for postgres. Microsoft might need some modifications.
SET search_path='tmp';
DROP TABLE tmp.schedule CASCADE;
CREATE TABLE tmp.schedule
( person_id INTEGER NOT NULL
, dt_from timestamp with time zone
, dt_to timestamp with time zone
);
INSERT INTO schedule( person_id, dt_from, dt_to) VALUES
( 1, '2011-12-03 02:00:00' , '2011-12-03 04:00:00' )
, ( 1, '2011-12-03 03:00:00' , '2011-12-03 05:00:00' )
, ( 1, '2011-12-03 03:45:00' , '2011-12-03 09:00:00' )
, ( 2, '2011-12-03 06:00:00' , '2011-12-03 09:00:00' )
, ( 2, '2011-12-03 07:00:00' , '2011-12-03 08:00:00' )
, ( 3, '2011-12-03 02:00:00' , '2011-12-03 03:00:00' )
, ( 3, '2011-12-03 04:00:00' , '2011-12-03 05:00:00' )
, ( 4, '2011-12-03 01:00:00' , '2011-12-03 07:00:00' );
ALTER TABLE schedule ADD PRIMARY KEY (person_id,dt_from)
;
CREATE UNIQUE INDEX ON schedule (person_id,dt_to);
SELECT * FROM schedule ORDER BY person_id, dt_from;
WITH RECURSIVE ztree AS (
-- Terminal part
SELECT p1.person_id AS person_id
, p1.dt_from AS dt_from
, p1.dt_to AS dt_to
FROM schedule p1
UNION
-- Recursive part
SELECT p2.person_id AS person_id
, LEAST(p2.dt_from, zzt.dt_from) AS dt_from
, GREATEST(p2.dt_to, zzt.dt_to) AS dt_to
FROM ztree AS zzt
, schedule AS p2
WHERE 1=1
AND p2.person_id = zzt.person_id
AND (p2.dt_from < zzt.dt_from AND p2.dt_to >= zzt.dt_from)
)
SELECT *
FROM ztree zt
WHERE NOT EXISTS (
SELECT * FROM ztree nx
WHERE nx.person_id = zt.person_id
-- the recursive query returns *all possible combinations of
-- touching or overlapping intervals
-- we'll have to filter, keeping only the biggest ones
-- (the ones for which there is no bigger overlapping interval)
AND ( (nx.dt_from <= zt.dt_from AND nx.dt_to > zt.dt_to)
OR (nx.dt_from < zt.dt_from AND nx.dt_to >= zt.dt_to)
)
)
ORDER BY zt.person_id,zt.dt_from
;
Result:
DROP TABLE
CREATE TABLE
INSERT 0 8
NOTICE: ALTER TABLE / ADD PRIMARY KEY will create implicit index "schedule_pkey" for table "schedule"
ALTER TABLE
CREATE INDEX
person_id | dt_from | dt_to
-----------+------------------------+------------------------
1 | 2011-12-03 02:00:00+01 | 2011-12-03 04:00:00+01
1 | 2011-12-03 03:00:00+01 | 2011-12-03 05:00:00+01
1 | 2011-12-03 03:45:00+01 | 2011-12-03 09:00:00+01
2 | 2011-12-03 06:00:00+01 | 2011-12-03 09:00:00+01
2 | 2011-12-03 07:00:00+01 | 2011-12-03 08:00:00+01
3 | 2011-12-03 02:00:00+01 | 2011-12-03 03:00:00+01
3 | 2011-12-03 04:00:00+01 | 2011-12-03 05:00:00+01
4 | 2011-12-03 01:00:00+01 | 2011-12-03 07:00:00+01
(8 rows)
person_id | dt_from | dt_to
-----------+------------------------+------------------------
1 | 2011-12-03 02:00:00+01 | 2011-12-03 09:00:00+01
2 | 2011-12-03 06:00:00+01 | 2011-12-03 09:00:00+01
3 | 2011-12-03 02:00:00+01 | 2011-12-03 03:00:00+01
3 | 2011-12-03 04:00:00+01 | 2011-12-03 05:00:00+01
4 | 2011-12-03 01:00:00+01 | 2011-12-03 07:00:00+01
(5 rows)

Related

Select rows that tightly bound input date range

I have a table where each row has a TIMESTAMP column. I have as input a start date time and an end date time. I need to select rows from this table that cover the input time range without selecting any "extra" rows.
The only thing I can think of is that I need to execute multiple queries to find the row where m_time < start date time ordered descending limit 1, and the row where m_time > end date time ordered ascending limit 1. Then query from the table for rows between those two m_time.
Is there a way to do it in one query?
Example data:
| time_data |
|---------------------------|
| rownum | m_time |
|--------|------------------|
| 1 | 2020-11-01T00:00 |
| 2 | 2020-11-01T01:00 |
| 3 | 2020-11-01T02:00 |
| 4 | 2020-11-01T03:00 |
| 5 | 2020-11-01T04:00 |
| 6 | 2020-11-01T05:00 |
m_time has a data type of TIMESTAMP
Given
start_date_time = 2020-11-01T01:58
end_date_time = 2020-11-01T03:02
Expected output would be rows 2-5
Oracle Version 19.8.0.0.0
You can use lag() and lead():
select t.*
from (
select t.*,
lag(m_time) over(order by m_time) lag_m_time,
lead(m_time) over(order by m_time) lead_m_time
from time_data
) t
where (m_time <= :start_time and lead_m_time > :start_time)
or (m_time >= :start_time and m_time <= :end_time)
or (m_time >= :end_time and lag_m_time < :end_time)
I think the where clause could be simplified:
where lead_m_time > :start_time and lag_m_time < :end_time
In Oracle 12.1 and higher, match_recognize does quick work of such row pattern matching problems.
Notes - I changed your rownum column name, since rownum is a reserved keyword in Oracle. I used string data type for your dates (since that's what you seem to have), but that is a very bad practice. Make sure you always use date data type for dates. (This issue is unrelated to your question, so I left it alone.)
I simulate your sample data in the WITH clause; that is not part of the solution, it's just there for a quick test.
with
time_data (rn, m_time) as (
select 1, '2020-11-01T00:00' from dual union all
select 2, '2020-11-01T01:00' from dual union all
select 3, '2020-11-01T02:00' from dual union all
select 4, '2020-11-01T03:00' from dual union all
select 5, '2020-11-01T04:00' from dual union all
select 6, '2020-11-01T05:00' from dual
)
select rn, m_time
from time_data
match_recognize(
order by m_time
all rows per match
omit empty matches
pattern (F{0,1} M* L{0,1})
define M as m_time between :start_time and :end_time,
F as next(m_time) >= :start_time,
L as prev(m_time) <= :end_time
);
RN M_TIME
-- ----------------
2 2020-11-01T01:00
3 2020-11-01T02:00
4 2020-11-01T03:00
5 2020-11-01T04:00

Optimize the query of weekday statistics between two dates

I have a table with two fields: start_date and end_date. Now I want to count the total number of work overtime. I have created a new calendar table to maintain the working day status of the date.
table: workdays
id status
2020-01-01 4
2020-01-02 1
2020-01-03 1
2020-01-04 2
4: holiday, 1: weekday, 2: weekend
I created a function to calculate the weekdays between two dates (excluding weekends, holidays).
create or replace function get_workday_count (start_date in date, end_date in date)
return number is
day_count int;
begin
select count(0) into day_count from WORKDAYS
where TRUNC(ID) >= TRUNC(start_date)
and TRUNC(ID) <= TRUNC(end_date)
and status in (1, 3, 5);
return day_count;
end;
When I execute the following query statement, it takes about 5 minutes to display the results, erp_sj table has about 200000 rows of data.
select count(0) from ERP_SJ GET_WORKDAY_COUNT(start_date, end_date) > 5;
The fields used in query statements are indexed.
How to optimize? Or is there a better solution?
First of all, optimizing your function:
1.adding pragma udf (for faster execution in sql
2. Adding deterministic clause(for caching)
3. Replacing count(0) to count(*) (to allow cbo optimize count)
4. Replacing return number to int
create or replace function get_workday_count (start_date in date, end_date in date)
return int deterministic is
pragma udf;
day_count int;
begin
select count(*) into day_count from WORKDAYS w
where w.ID >= TRUNC(start_date)
and w.ID <= TRUNC(end_date)
and status in (1, 3, 5);
return day_count;
end;
Then you don't need to call your function in case of (end_date - start_date) < required number of days. Moreover, ideally it would be to use scalar subquery instead of function:
select count(*)
from ERP_SJ
where
case
when trunc(end_date) - trunc(start_date) > 5
then GET_WORKDAY_COUNT(trunc(start_date) , trunc(end_date))
else 0
end > 5
Or using subquery:
select count(*)
from ERP_SJ e
where
case
when trunc(end_date) - trunc(start_date) > 5
then (select count(*) from WORKDAYS w
where w.ID >= TRUNC(e.start_date)
and w.ID <= TRUNC(e.end_date)
and w.status in (1, 3, 5))
else 0
end > 5
WORKDAY_STATUSES table (just for completeness, not used below):
create table workday_statuses
( status number(1) constraint workday_statuses_pk primary key
, status_name varchar2(10) not null constraint workday_status_name_uk unique );
insert all
into workday_statuses values (1, 'Weekday')
into workday_statuses values (2, 'Weekend')
into workday_statuses values (3, 'Unknown 1')
into workday_statuses values (4, 'Holiday')
into workday_statuses values (5, 'Unknown 2')
select * from dual;
WORKDAYS table: one row for each day in 2020:
create table workdays
( id date constraint workdays_pk primary key
, status references workday_statuses not null )
organization index;
insert into workdays (id, status)
select date '2019-12-31' + rownum
, case
when to_char(date '2019-12-31' + rownum, 'Dy', 'nls_language = English') like 'S%' then 2
when date '2019-12-31' + rownum in
( date '2020-01-01', date '2020-04-10', date '2020-04-13'
, date '2020-05-08', date '2020-05-25', date '2020-08-31'
, date '2020-12-25', date '2020-12-26', date '2020-12-28' ) then 4
else 1
end
from xmltable('1 to 366')
where date '2019-12-31' + rownum < date '2021-01-01';
ERP_SJ table containing 30K rows with random data:
create table erp_sj
( id integer generated always as identity
, start_date date not null
, end_date date not null
, filler varchar2(100) );
insert into erp_sj (start_date, end_date, filler)
select dt, dt + dbms_random.value(0,7), dbms_random.string('x',100)
from ( select date '2019-12-31' + dbms_random.value(1,366) as dt
from xmltable('1 to 30000') );
commit;
get_workday_count() function:
create or replace function get_workday_count
( start_date in date, end_date in date )
return integer
deterministic -- Cache some results
parallel_enable -- In case you want to use it in parallel queries
as
pragma udf; -- Tell compiler to optimise for SQL
day_count integer;
begin
select count(*) into day_count
from workdays w
where w.id between trunc(start_date) and end_date
and w.status in (1, 3, 5);
return day_count;
end;
Notice that you should not truncate w.id, because all values have the time as 00:00:00 already. (I'm assuming that if end_date falls somewhere in the middle of a day, you want to count that day, so I have not truncated the end_date parameter.)
Test:
select count(*) from erp_sj
where get_workday_count(start_date, end_date) > 5;
COUNT(*)
--------
1302
Results returned in around 1.4 seconds.
Execution plan for the query within the function:
select count(*)
from workdays w
where w.id between trunc(sysdate) and sysdate +10
and w.status in (1, 3, 5);
--------------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers |
--------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | 1 |00:00:00.01 | 1 |
| 1 | SORT AGGREGATE | | 1 | 1 | 1 |00:00:00.01 | 1 |
|* 2 | FILTER | | 1 | | 7 |00:00:00.01 | 1 |
|* 3 | INDEX RANGE SCAN| WORKDAYS_PK | 1 | 7 | 7 |00:00:00.01 | 1 |
--------------------------------------------------------------------------------------------
Now try adding the function as a virtual column and indexing it:
create index erp_sj_workday_count_ix on erp_sj(workday_count);
select count(*) from erp_sj
where workday_count > 5;
Same result in 0.035 seconds. Plan:
-------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers |
-------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | 1 |00:00:00.01 | 5 |
| 1 | SORT AGGREGATE | | 1 | 1 | 1 |00:00:00.01 | 5 |
|* 2 | INDEX RANGE SCAN| ERP_SJ_WORKDAY_COUNT_IX | 1 | 1302 | 1302 |00:00:00.01 | 5 |
-------------------------------------------------------------------------------------------------------
Tested in 19.0.0.
Edit: As Sayan pointed out, the index on the virtual column won't be automatically updated if there are any changes in WORKDAYS, so there is a risk of wrong results with this approach. However, if performance is critical you could work around it by rebuilding the index on ERP_SJ every time you updated WORKDAYS. Maybe you could do this in a statement-level trigger on WORKDAYS, or just through scheduled IT maintenance processes if updates are very infrequent and ERP_SJ isn't so big that an index rebuild is impractical. If the index is partitioned, rebuilding affected partitions could be an option.
Or, don't have an index and live with the 1.4 seconds query execution time.
I understand that the columns ID and status have indexes on them ( not functional index on TRUNC(ID) ). So use this query
SELECT count(0)
INTO day_count
FROM WORKDAYS
WHERE ID BETWEEN TRUNC(start_date) AND TRUNC(end_date)
AND status in (1, 3, 5);
in order to be able to exploit the index on date column ID also.
May be try Scalar Subquery Caching
(in case there are plenty erp_sj records with the same start_date and end_date)
select count(0) from ERP_SJ where
(select GET_WORKDAY_COUNT(start_date, end_date) from dual) > 5
You are dealing with a data warehouse query (not an OLTP query).
Some best practices says you should
get rid od functions - avoid contenxt switch (this could be somehow mitigated with the UDF pragma but why to use function if you don't need it?)
get rid of indexes - quick for few rows; slow for large number of records
get rid of caching - caching is basically a workaround for repeating same thing
So the data warehouse approach for the problem consists of two steps
Extend the Workday Table
The workday table can be with a little query extended with a new column MIN_END_DAY that defines for each (start) day the minimum threshold to reach the limit of 5 working days.
The query uses LEAD aggregate function to get the 4th leading working day (check the PARTITION BY clause that distincs between the working ays and other days.
For the non working days you simple takes the LAST_VALUE of the next working day.
Example
with wd as (
select ID, STATUS,
case when status in (1, 3, 5) then
lead(id,4) over (partition by case when status in (1, 3, 5) then 'workday' end order by id) /* 4 working days ahead */
end as min_stop_day
from workdays),
wd2 as (
select ID, STATUS,
last_value(MIN_STOP_DAY) ignore nulls over (order by id desc) MIN_END_DAY
from wd)
select ID, STATUS, MIN_END_DAY
from wd2
order by 1;
ID, STATUS, MIN_END_DAY
01.01.2020 00:00:00 4 08.01.2020 00:00:00
02.01.2020 00:00:00 1 08.01.2020 00:00:00
03.01.2020 00:00:00 1 09.01.2020 00:00:00
04.01.2020 00:00:00 2 10.01.2020 00:00:00
05.01.2020 00:00:00 2 10.01.2020 00:00:00
06.01.2020 00:00:00 1 10.01.2020 00:00:00
Join to the Base Table
Now you can simple join your base table with the the extended workday table on the start_day and filter rows by comparing the end_daywith the MIN_END_DAY
Query
with wd as (
select ID, STATUS,
case when status in (1, 3, 5) then
lead(id,4) over (partition by case when status in (1, 3, 5) then 'workday' end order by id)
end as min_stop_day
from workdays),
wd2 as (
select ID, STATUS,
last_value(MIN_STOP_DAY) ignore nulls over (order by id desc) MIN_END_DAY
from wd)
select count(*) from erp_sj
join wd2
on trunc(erp_sj.start_date) = wd2.ID
where trunc(end_day) >= min_end_day
This will lead for large tables to the expected HASH JOIN execution plan.
Note that I assume 1) the workday table is complete (otherwise you can't use inner join) and 2) contains enough future data (the last 5 rows are obviously not usable).

SQL Server Query In and Out

This is from DTR Device that i saved in Ms sql database
ID | Employee_ID | Date | InOutMode
-------+-------------+---------------------+-----------
70821 | 104 | 2019-10-11 19:00:00 | 0
70850 | 104 | 2019-10-12 07:01:00 | 1
if i'm going to separate the IN and OUT it suppose to be like this:
ID | Employee_ID | IN | OUT
-------+-------------+---------------------+-----------
70821 | 104 | 2019-10-11 19:00:00 | 2019-10-12 07:01:00
What happens is, i don't know if my queries were wrong. the TIME-OUT is not 2019-10-12 but 2019-10-11 same as the TIME-IN it looks like this:
ID | Employee_ID | IN | OUT
-------+-------------+---------------------+-----------
70821 | 104 | 2019-10-11 19:00:00 | 2019-10-11 07:01:00
Try this,
DECLARE #Temp_Table Table
(
Empoyee_id int,
[Date] datetime,
[InOutMode] bit
)
INSERT INTO #Temp_Table
(
Empoyee_id,[Date],[InOutMode]
)
SELECT 104,'20191011 09:30',1
UNION ALL
SELECT 104,'20191011 19:30',0
UNION ALL
SELECT 104,'20191012 09:30',1
UNION ALL
SELECT 104,'20191012 12:30',0
UNION ALL
SELECT 104,'20191012 19:00',0
UNION ALL
SELECT 104,'20191013 09:30',1
UNION ALL
SELECT 104,'20191013 07:30',0
UNION ALL
SELECT 104,'20191014 09:30',1
SELECT Empoyee_id,[Date],[In],IIF([In]>[Out],null,[Out]) as [Out]
FROM
(
SELECT Empoyee_id,CAST([Date] AS DATE) AS [Date],
MIN(IIF(InOutMode=1,[Date],NULL)) AS [In] ,
MAX(IIF(InOutMode=0,[Date],NULL)) AS [Out]
FROM #Temp_Table
GROUP BY Empoyee_id,CAST([Date] AS DATE)
)A
Try this:
;
WITH Ins as (
Select *
FROM HR_DTR_Device
WHERE InOutMode = 0
),
Outs as (
Select *
FROM HR_DTR_Device
WHERE InOutMode = 1
)
SELECT Ins.ID,
Ins.Employee_ID,
Ins.Date as [In],
(
SELECT Min(Outs.Date)
FROM Outs
WHERE Ins.Employee_ID = Outs.Employee_ID
AND Outs.Date > Ins.Date
) as [Out]
FROM Ins
WHERE Ins.Employee_ID = '104'
What this does:
Separates the Ins and the Outs, as if they were separate data sources. Using Common Table Expressions allows you, in effect, to pre-define subqueries and give them names.
For each record in the Ins, looks for the smallest date from the Outs that is still larger than the In date. (This assumes that your records are complete, and that you can't ever have two Ins in a row because someone forgot to clock out.)
Doesn't make any assumptions about when the Out date happens, just that it's later than the In date (by definition). That way, you don't have to worry about whether the employee left later the same day or early the next day (if you have employees working different shifts.)
Will also show any entries where the employee clocked in but has not yet clocked out.
I think your big error was here:
(SELECT MAX(Date) FROM HR_DTR_Device XX
WHERE InOutMode = 1
AND XX.Employee_ID = AA.Employee_ID
AND CAST(XX.Date AS DATE) = CAST(AA.Date AS DATE)) AS 'Out'
You are returning the largest date for that employee that is on the same calendar date (and is an Out). But, if the employee works until the next morning, the date will have changed!
You could fix this by changing your test to this:
CAST(DATEADD(d, -1, XX.Date) AS DATE) = CAST(AA.Date AS DATE)
... but then it will ONLY work for employees who worked overnight, whereas my solution simply finds the next time the employee clocked out after they clocked in, regardless of whether it's the same day, the next day, or the next week!
If you like this solution, please mark it as your accepted solution. Thank you.

SQL procedure to show how many hours has worker worked

+-----------+-------------------------------+-------+
| Worker ID | Time(MM/DD/YYYY Hour:Min:Sec) | InOut |
+-----------+-------------------------------+-------+
| 1 | 12/04/2017 10:00:00 | In |
| 2 | 12/04/2017 10:00:00 | In |
| 2 | 12/04/2017 18:40:02 | Out |
| 3 | 12/04/2017 10:00:00 | In |
| 1 | 12/04/2017 12:01:00 | Out |
| 3 | 12/04/2017 19:40:05 | Out |
+-----------+-------------------------------+-------+
Hi! I have problem with my project and I thought some of you would help me. I have table like that. It is simple table that indicates worker getting in and out of company. I need to do procedure which would take ID and number of day as In parameters and it would show how many hours and minutes that worker has worked that day. Thanks for help.
Yeah, I had to do a number of queries like this at my old job. Here's the approach I used, and it worked out pretty well:
For each "Out" record, get the MAX(TIME) on "In" records with a time earlier than the OUT record
Does that make sense? You're basically joining the table against itself, looking for the record that represents the "clock in" time for any particular "clock out" time.
So here's the backbone:
select
*
, (
SELECT MAX(tim) from #tempTable subQ
where subQ.id = main.id
and subQ.tim <= main.tim
and subQ.InOut = 'In'
) as correspondingInTime
from #tempTable main
where InOut = 'Out'
... from here, you can get the data you need. Either by manipulating the query above, or using it as a subquery itself (which is my favored way of doing it) - something like:
select id as workerID, sum(DATEDIFF(s, correspondingInTime, tim)) as totalSecondsWorked
from
(
select
*
, (
SELECT MAX(tim) from #tempTable subQ
where subQ.id = main.id
and subQ.tim <= main.tim
and subQ.InOut = 'In'
) correspondingInTime
from #tempTable main
where InOut = 'Out'
) mainQuery
group by id
EDIT: Remove the 'as' before correspondingInTime, because oracle doesn't allow 'as' in table aliasing.
Maybe something similar to
select sum( time1 - prev_time1 ) from (
select InOut, time1,
prev(time1) over (partition by worker_id order by time1) prev_time1,
prev(InOut) over (partition by worker_id order by time1) prev_inOut
from MyTABLE
where TimeColumn between trunc(:date1) and trunc( :date1 + 1 )
and workerId = :workerId
) t1
where InOut = 'Out' and prev_InOut = 'In'
would go.
:workerId and :date1 are variables to constrain to one date and one worker as required.
I'm fairly certain Oracle allows you to use CROSS APPLY these days.
SELECT [Worker ID], yt.Time - ca.Time
FROM YourTable yt
CROSS APPLY (SELECT MAX(Time) AS Time
FROM YourTable
WHERE [Worker ID] = yt.[Worker ID] AND Time < yt.Time AND InOut = 'In') ca
WHERE yt.InOut = 'Out'

SQL grouping by datetime with a maximum difference of x minutes

I have a problem with grouping my dataset in MS SQL Server.
My table looks like
# | CustomerID | SalesDate | Turnover
---| ---------- | ------------------- | ---------
1 | 1 | 2016-08-09 12:15:00 | 22.50
2 | 1 | 2016-08-09 12:17:00 | 10.00
3 | 1 | 2016-08-09 12:58:00 | 12.00
4 | 1 | 2016-08-09 13:01:00 | 55.00
5 | 1 | 2016-08-09 23:59:00 | 10.00
6 | 1 | 2016-08-10 00:02:00 | 5.00
Now I want to group the rows where the SalesDate difference to the next row is of a maximum of 5 minutes.
So that row 1 & 2, 3 & 4 and 5 & 6 are each one group.
My approach was getting the minutes with the DATEPART() function and divide the result by 5:
(DATEPART(MINUTE, SalesDate) / 5)
For row 1 and 2 the result would be 3 and grouping here would work perfectly.
But for the other rows where there is a change in the hour or even in the day part of the SalesDate, the result cannot be used for grouping.
So this is where I'm stuck. I would really appreciate, if someone could point me in the right direction.
You want to group adjacent transactions based on the timing between them. The idea is to assign some sort of grouping identifier, and then use that for aggregation.
Here is an approach:
Identify group starts using lag() and date arithmetic.
Do a cumulative sum of the group starts to identify each group.
Aggregate
The query looks like this:
select customerid, min(salesdate), max(saledate), sum(turnover)
from (select t.*,
sum(case when salesdate > dateadd(minute, 5, prev_salesdate)
then 1 else 0
end) over (partition by customerid order by salesdate) as grp
from (select t.*,
lag(salesdate) over (partition by customerid order by salesdate) as prev_salesdate
from t
) t
) t
group by customerid, grp;
EDIT
Thanks to #JoeFarrell for pointing out I have answered the wrong question. The OP is looking for dynamic time differences between rows, but this approach creates fixed boundaries.
Original Answer
You could create a time table. This is a table that contains one record for each second of the day. Your table would have a second column that you can use to perform group bys on.
CREATE TABLE [Time]
(
TimeId TIME(0) PRIMARY KEY,
TimeGroup TIME
)
;
-- You could use a loop here instead.
INSERT INTO [Time]
(
TimeId,
TimeGroup
)
VALUES
('00:00:00', '00:00:00'), -- First group starts here.
('00:00:01', '00:00:00'),
('00:00:02', '00:00:00'),
('00:00:03', '00:00:00'),
...
('00:04:59', '00:00:00'),
('00:05:00', '00:05:00'), -- Second group starts here.
('00:05:01', '00:05:00')
;
The approach works best when:
You need to reuse your custom grouping in several different queries.
You have two or more custom groups you often use.
Once populated you can simply join to the table and output the desired result.
/* Using the time table.
*/
SELECT
t.TimeGroup,
SUM(Turnover) AS SumOfTurnover
FROM
Sales AS s
INNER JOIN [Time] AS t ON t.TimeId = CAST(s.SalesDate AS Time(0))
GROUP BY
t.TimeGroup
;