Collapse multiple rows based on time values - sql

I'm trying to collapse rows with consecutive timeline within the same day into one row but having an issue because of gap in time. For example, my dataset looks like this.
Date StartTime EndTime ID
2017-12-1 09:00:00 11:00:00 12345
2017-12-1 11:00:00 13:00:00 12345
2018-09-08 09:00:00 10:00:00 78465
2018-09-08 10:00:00 12:00:00 78465
2018-09-08 15:00:00 16:00:00 78465
2018-09-08 16:00:00 18:00:00 78465
As up can see, the first two rows can just be combined together without any issue because there's no time gap within that day. However. for the entries on 2019-09-08, there is a gap between 12:00 and 15:00. And I'd like to merge these four records into two different rows like this:
Date StartTime EndTime ID
2017-12-1 09:00:00 13:00:00 12345
2018-09-08 09:00:00 12:00:00 78465
2018-09-08 15:00:00 18:00:00 78465
In other words, I only want to collapse the rows only when the time variables are consecutive within the same day for the same ID.
Could anyone please help me with this? I tried to generate unique group using LAG and LEAD functions but it didn't work.

You can use a recursive cte. Group it as same group if the EndTime is same as next StartTime. And then find the MIN() and MAX()
with cte as
(
select rn = row_number() over (partition by [ID], [Date] order by [StartTime]),
*
from tbl
),
rcte as
(
-- anchor member
select rn, [ID], [Date], [StartTime], [EndTime], grp = 1
from cte
where rn = 1
union all
-- recursive member
select c.rn, c.[ID], c.[Date], c.[StartTime], c.[EndTime],
grp = case when r.[EndTime] = c.[StartTime]
then r.grp
else r.grp + 1
end
from rcte r
inner join cte c on r.[ID] = c.[ID]
and r.[Date] = c.[Date]
and r.rn = c.rn - 1
)
select [ID], [Date],
min([StartTime]) as StartTime,
max([EndTime]) as EndTime
from rcte
group by [ID], [Date], grp
db<>fiddle demo

Unless you have a particular objection to collapsing non-consecutive rows, which are consecutive for that ID, you can just use GROUP BY:
SELECT
Date,
StartTime = MIN(StartTime),
EndTime = MAX(EndTime),
ID
FROM table
GROUP BY ID, Date
Otherwise you can use a solution based on ROW_NUMBER:
SELECT
Date,
StartTime,
EndTime,
ID
FROM (
SELECT *,
rn = ROW_NUMBER() OVER (PARTITION BY Date, ID ORDER BY StartTime)
FROM table
) t
WHERE rn = 1

This is an example of a gaps-and-islands problem -- actually a pretty simple example. The idea is to assign an "island" grouping to each row specifying that they should be combined because they overlap. Then aggregate.
How do you assign the island? In this case, look at the previous endtime and if it is different from the starttime, then the row starts a new island. Voila! A cumulative sum of the the start flag identifies each island.
As SQL:
select id, date, min(starttime), max(endtime)
from (select t.*,
sum(case when prev_endtime = starttime then 0 else 1 end) over (partition by id, date order by starttime) as grp
from (select t.*,
lag(endtime) over (partition by id, date order by starttime) as prev_endtime
from t
) t
) t
group by id, date, grp;
Here is a db<>fiddle.
Note: This assumes that the time periods never span multiple days. The code can be very easily modified to handle that . . . but with a caveat. The start and end times should be stored as datetime (or a related timestamp) rather than separating the date and times into different columns. Why? SQL Server doesn't support '24:00:00' as a valid time.

Related

T-SQL, list of DATETIME, create <from> - <to> from it

What would I need to do to achieve the following? Somehow I can't seem to find a good solution.
I have a few CTEs, and the last one is producing just a list of DATETIME values, with a row number column, and those are ordered by the DATETIME.
For example
rn datetime
---------------------------
1 2023-01-07 01:00:00.000
2 2023-01-08 05:30:00.000
3 2023-01-08 08:00:00.000
4 2023-01-09 21:30:00.000
How do I have to join this CTE with each other in order to get the following result:
from to
---------------------------------------------------
2023-01-07 01:00:00.000 2023-01-08 05:30:00.000
2023-01-08 08:00:00.000 2023-01-09 21:30:00.000
Doing a regular inner join (with t1.rn = t2.rn - 1) gives me one row too much (the one from 05:30 to 08:00). So basically each date can only be "used" once.
Hope that makes sense... thanks!
Inner joining the CTE with itself, which didn't return the wanted result.
You can pivot the outcome of your CTE and distribute rows using arithmetics : modulo 2 comes to mind.
Assuming that your CTE returns columns dt (a datetime field) and rn (an integer row number) :
select min(dt) dt_from, max(dt) dt_to
from cte
group by ( rn - 1 ) % 2
On T-SQL we could also leverage integer division to express the pair grouping:
group by ( rn - 1 ) / 2
You can avoid both the JOIN and the GROUP BY by using LAG to retrieve a previous column value in the result set. The server may be able to generate an execution plan that iterates over the data just once instead of joining or grouping :
with pairs as (
SELECT rn, lag(datetime) OVER(ORDER BY rn) as dt_from, datetime as dt_to
from another_cte
)
select dt_from,dt_to
from pairs
ORDER BY rn
where rn%2=0
The row number itself can be calculated from datetime :
with pairs as (
SELECT
ROW_NUMBER() OVER(ORDER BY datetime) AS rn,
lag(datetime) OVER(ORDER BY datetime) as dt_from,
datetime as dt_to
from another_cte
)
select dt_from,dt_to
from pairs
ORDER BY dt_to
where rn%2=0

How to fill in rows based on available data

Using Snowflake SQL.
So my table has 2 columns: hour and customerID. Every customer will have 2 rows, one corresponding to hour that he/she came into the store, and one corresponding to hour that he/she left the store. With this data, I want to create a table that has every hour that a customer has been in the store. For example, a customer X entered the store at 1PM and left at 5PM, so there would be 5 rows (1 for each hour) like the screenshot below.
Here's my attempt that's now:
select
hour
,first_value(customer_id) over (partition by customer_id order by hour rows between unbounded preceding and current row) as customer_id
FROM table
In Snowflake, you would typically use a table of numbers to solve this. You can use the table (generator ...) syntax to generate such derived table, and then join it with an aggregate query that computes the hour boundaries of each client with an inequality condition:
select t.customer_id, dateadd(hour, n.rn, t.min_hour) final_hour
from (
select t.customer_id, min(t.hour) min_hour, max(t.hour) max_hour
from mytable t
group by t.customer_id
) t
inner join (
select row_number() over(order by null) - 1 rn
from table (generator(rowcount => 24))
) n on dateadd(hour, n.rn, t.min_hour) <= t.max_hour
order by customer_id, final_hour
This would handle up to 24 hours of visit per customer. If you need more, then you can increase the parameter to the table generator.
so for the example case as shown in the test data, where there is only one days worth of data GMB's solution works fine.
once you get into many days (that can/cannot have overlapping store visits, lets just pretend you cannot overnight in the store)
which can be fixed via:
select t.hour::date, t.customer_id, min(t.hour) min_hour, max(t.hour) max_hour
from mytable t
group by 1,2
but multiple entries, ether requires tag data like:
with mytable as (
select * from values
('2019-04-01 09:00:00','x','in')
,('2019-04-01 15:00:00','x','out')
,('2019-04-02 12:00:00','x','in')
,('2019-04-02 14:00:00','x','out')
v(hour, customer_id, state)
)
or for it to be inferred:
with mytable as (
select * from values ('2019-04-01 09:00:00','x','in'),('2019-04-01 15:00:00','x','out')
,('2019-04-02 12:00:00','x','in'),('2019-04-02 14:00:00','x','out')
v(hour, customer_id, state)
)
select hour::date as day
,hour
,customer_id
,state
,BITAND(row_number() over(partition by day, customer_id order by hour), 1) = 1 AS in_dir
from mytable
order by 3,1,2;
giving:
DAY HOUR CUSTOMER_ID STATE IN_DIR
2019-04-01 2019-04-01 09:00:00 x in TRUE
2019-04-01 2019-04-01 15:00:00 x out FALSE
2019-04-02 2019-04-02 12:00:00 x in TRUE
2019-04-02 2019-04-02 14:00:00 x out FALSE
now this can be used with a LAG and QUALIFY to get true ranges the can handle multi-entries:
select customer_id
,day
,hour
,lead(hour) over (partition by customer_id, day order by hour) as exit_time
from infer_direction
qualify in_dir = true
which works by getting then next time for all rows of each day/customer, and after that (via the qualify) only keeping the rows 'in' rows.
then we can then join to the time of a day:
select dateadd('hour', row_number() over(order by null) - 1, '00:00:00'::time) as hour
from table (generator(rowcount => 24))
thus for it all woven together
with mytable as (
select hour::timestamp as hour, customer_id, state
from values
('2019-04-01 09:00:00','x','in')
,('2019-04-01 12:00:00','x','out')
,('2019-04-02 13:00:00','x','in')
,('2019-04-02 14:00:00','x','out')
,('2019-04-02 9:00:00','x','in')
,('2019-04-02 10:00:00','x','out')
v(hour, customer_id, state)
), infer_direction AS (
select hour::date as day
,hour::time as hour
,customer_id
,state
,BITAND(row_number() over(partition by day, customer_id order by hour), 1) = 1 AS in_dir
from mytable
), visit_ranges as (
select customer_id
,day
,hour
,lead(hour) over (partition by customer_id, day order by hour) as exit_time
from infer_direction
qualify in_dir = true
), time_of_day AS (
select dateadd('hour', row_number() over(order by null) - 1, '00:00:00'::time) as hour
from table (generator(rowcount => 24))
)
select t.customer_id
,t.day
,h.hour
from visit_ranges as t
join time_of_day h on h.hour between t.hour and t.exit_time
order by 1,2,3;
we get:
CUSTOMER_ID DAY HOUR
x 2019-04-01 09:00:00
x 2019-04-01 10:00:00
x 2019-04-01 11:00:00
x 2019-04-01 12:00:00
x 2019-04-02 09:00:00
x 2019-04-02 10:00:00
x 2019-04-02 13:00:00
x 2019-04-02 14:00:00

SQL Query - Combine rows based on multiple columns

On the image above, I'd like to combine rows with the same value on consecutive days.
Combined rows will have the earliest date on From column and the latest date on To column.
Looking at the example, even if Rows 3 and 4 have the same value, they were not combined because of the date gap.
I've tried using LAG and LEAD functions but no luck.
You can try below way -
DEMO
with c as
(
select *, datediff(dd,todate,laedval) as leaddiff,
datediff(dd,todate,lagval) as lagdiff
from
(
select *,lead(todate) over(partition by value order by todate) laedval,
lag(todate) over(partition by value order by todate) lagval
from t1
)A
)
select * from
(
select value,min(todate) as fromdate,max(todate) as todate from c
where coalesce(leaddiff,0)+coalesce(lagdiff,0) in (1,-1)
group by value
union all
select value,fromdate,todate from c
where coalesce(leaddiff,0)+coalesce(lagdiff,0)>1 or coalesce(leaddiff,0)+coalesce(lagdiff,0)<-1
)A order by value
OUTPUT:
value fromdate todate
1 16/07/2019 00:00:00 17/07/2019 00:00:00
3 21/07/2019 00:00:00 26/07/2019 00:00:00
2 18/07/2019 00:00:00 18/07/2019 00:00:00
2 20/07/2019 00:00:00 20/07/2019 00:00:00
I am going to recommend the following approach:
Find where each new group begins. You can do this by comparing the previous maximum todate with the fromdate in this row.
Do a cumulative sum of the starts to define a group.
Aggregate the results.
This can be handled using window functions and aggregation:
select value, min(fromdate) as fromdate, max(todate) as todate
from (select t.*,
sum(case when prev_todate >= dateadd(day, -1, fromdate)
then 0 -- overlap, so this does not begin a new group
else 1 -- no overlap, so this does begin a new group
end) over
(partition by value order by fromdate) as grp
from (select t.*,
max(todate) over (partition by value
order by fromdate
rows between unbounded preceding and 1 preceding
) as prev_todate
from t
) t
) t
group by value, grp
order by value, min(fromdate);
Here is a db<>fiddle.

How can I reference column values from previous rows in BigQuery SQL, in order to perform operations or calculations?

I have sorted my data by start time, and I want to create a new field that rolls up data that overlap start times from the previous rows start and end time.
More specifically, I want to write logic that, for a given record X, if the start time is somewhere between the start and end time of the previous row, I want to give record X the same value for the new field as that previous row. If the start time happens after the end time of the previous row, it would get a new value for the new field.
Is something like this possible in BigQuery SQL? Was thinking maybe lag or window function, but not quite sure. Below are examples of what the base table looks like and what I want for the final table.
Any insight appreciated!
Below is for BigQuery Standard SQL
#standardSQL
SELECT recordID, startTime, endTime,
COUNTIF(newRange) OVER(ORDER BY startTime) AS newRecordID
FROM (
SELECT *,
startTime >= MAX(endTime) OVER(ORDER BY startTime ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING) AS newRange
FROM `project.dataset.table`
)
You can test, play with above using sample data from your question as in example below
#standardSQL
WITH `project.dataset.table` AS (
SELECT 1 recordID, TIME '12:35:00' startTime, TIME '12:50:00' endTime UNION ALL
SELECT 2, '12:46:00', '12:59:00' UNION ALL
SELECT 3, '14:27:00', '16:05:00' UNION ALL
SELECT 4, '15:48:00', '16:35:00' UNION ALL
SELECT 5, '16:18:00', '17:04:00'
)
SELECT recordID, startTime, endTime,
COUNTIF(newRange) OVER(ORDER BY startTime) AS newRecordID
FROM (
SELECT *,
startTime >= MAX(endTime) OVER(ORDER BY startTime ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING) AS newRange
FROM `project.dataset.table`
)
-- ORDER BY startTime
with result
Row recordID startTime endTime newRecordID
1 1 12:35:00 12:50:00 0
2 2 12:46:00 12:59:00 0
3 3 14:27:00 16:05:00 1
4 4 15:48:00 16:35:00 1
5 5 16:18:00 17:04:00 1
This is a gaps and islands problem. What you want to do is assign a group id to non-intersecting groups. You can calculating the non-intersections using window functions.
A record starts a new group if the cumulative maximum value of the end time, ordered by start time and ending at the previous record, is less than the current end time. The rest is just a cumulative sum to assign a group id.
For your data:
select t.*,
sum(case when prev_endtime >= endtime then 0 else 1 end) over (order by starttime) as group_id
from (select t.*,
max(endtime) over (order by starttime rows between unbounded preceding and 1 preceding) as prev_endtime
from t
) t;
The only potential issue is if two records start at exactly the same time. If this can happen, the logic might need to be slightly more complex.

To club the rows for week days

I have data like below:
StartDate EndDate Duration
----------
41890 41892 3
41898 41900 3
41906 41907 2
41910 41910 1
StartDate and EndDate are respective ID values for any dates from calendar. I want to calculate the sum of duration for consecutive days. Here I want to include the days which are weekends. E.g. in the above data, let's say 41908 and 41909 are weekends, then my required result set should look like below.
I already have another proc that can return me the next working day, i.e. if I pass 41907 or 41908 or 41909 as DateID in that proc, it will return 41910 as the next working day. Basically I want to check if the DateID returned by my proc when I pass the above EndDateID is same as the next StartDateID from above data, then both the rows should be clubbed. Below is the data I want to get.
ID StartDate EndDate Duration
----------
278457 41890 41892 3
278457 41898 41900 3
278457 41906 41910 3
Please let me know in case the requirement is not clear, I can explain further.
My Date Table is like below:
DateId Date Day
----------
41906 09-04-2014 Thursday
41907 09-05-2014 Friday
41908 09-06-2014 Saturdat
41909 09-07-2014 Sunday
41910 09-08-2014 Monday
Here is the SQL Code for setup:
CREATE TABLE Table1
(
StartDate INT,
EndDate INT,
LeaveDuration INT
)
INSERT INTO Table1
VALUES(41890, 41892, 3),
(41898, 41900, 3),
(41906, 41907, 3),
(41910, 41910, 1)
CREATE TABLE DateTable
(
DateID INT,
Date DATETIME,
Day VARCHAR(20)
)
INSERT INTO DateTable
VALUES(41907, '09-05-2014', 'Friday'),
(41908, '09-06-2014', 'Saturday'),
(41909, '09-07-2014', 'Sunday'),
(41910, '09-08-2014', 'Monday'),
(41911, '09-09-2014', 'Tuesday')
This is rather complicated. Here is an approach using window functions.
First, use the date table to enumerate the dates without weekends (you can also take out holidays if you want). Then, expand the periods into one day per row, by using a non-equijoin.
You can then use a trick to identify sequential days. This trick is to generate a sequential number for each id and subtract it from the sequential number for the dates. This is a constant for sequential days. The final step is simply an aggregation.
The resulting query is something like this:
with d as (
select d.*, row_number() over (order by date) as seqnum
from dates d
where day not in ('Saturday', 'Sunday')
)
select t.id, min(t.date) as startdate, max(t.date) as enddate, sum(duration)
from (select t.*, ds.seqnum, ds.date,
(d.seqnum - row_number() over (partition by id order by ds.date) ) as grp
from table t join
d ds
on ds.date between t.startdate and t.enddate
) t
group by t.id, grp;
EDIT:
The following is the version on this SQL Fiddle:
with d as (
select d.*, row_number() over (order by date) as seqnum
from datetable d
where day not in ('Saturday', 'Sunday')
)
select t.id, min(t.date) as startdate, max(t.date) as enddate, sum(duration)
from (select t.*, ds.seqnum, ds.date,
(ds.seqnum - row_number() over (partition by id order by ds.date) ) as grp
from (select t.*, 'abc' as id from table1 t) t join
d ds
on ds.dateid between t.startdate and t.enddate
) t
group by grp;
I believe this is working, but the date table doesn't have all the dates in it.