Group rows into sequences using a sliding window on a DateTime column

Group rows into sequences using a sliding window on a DateTime column - sql

I have a table that stores timestamped events. I want to group the events into 'sequences' by using 5-min sliding window on the timestamp column, and write the 'sequence ID' (any ID that can distinguish sequences) and 'order in sequence' into another table.
Input - event table:
+----+-------+-----------+
| Id | Name | Timestamp |
+----+-------+-----------+
| 1 | test | 00:00:00 |
| 2 | test | 00:06:00 |
| 3 | test | 00:10:00 |
| 4 | test | 00:14:00 |
+----+-------+-----------+
Desired output - sequence table. Here SeqId is the ID of the starting event, but it doesn't have to be, just something to uniquely identify a sequence.
+---------+-------+----------+
| EventId | SeqId | SeqOrder |
+---------+-------+----------+
| 1 | 1 | 1 |
| 2 | 2 | 1 |
| 3 | 2 | 2 |
| 4 | 2 | 3 |
+---------+-------+----------+
What would be the best way to do it? This is MSSQL 2008, I can use SSAS and SSIS if they make things easier.

CREATE TABLE #Input (Id INT, Name VARCHAR(20), Time_stamp TIME)
INSERT INTO #Input
VALUES
( 1 ,'test','00:00:00' ),
( 2 ,'test','00:06:00' ),
( 3 ,'test','00:10:00' ),
( 4 ,'test','00:14:00' )
SELECT * FROM #Input;
WITH cte AS -- add a sequential number
(
SELECT *,
ROW_NUMBER() OVER(ORDER BY Id) AS sort
FROM #Input
), cte2 as -- find the Id's with a difference of more than 5min
(
SELECT cte.*,
CASE WHEN DATEDIFF(MI, cte_1.Time_stamp,cte.Time_stamp) < 5 THEN 0 ELSE 1 END as GrpType
FROM cte
LEFT OUTER JOIN
cte as cte_1 on cte.sort =cte_1.sort +1
), cte3 as -- assign a SeqId
(
SELECT GrpType, Time_Stamp,ROW_NUMBER() OVER(ORDER BY Time_stamp) SeqId
FROM cte2
WHERE GrpType = 1
), cte4 as -- find the Time_Stamp range per SeqId
(
SELECT cte3.*,cte_2.Time_stamp as TS_to
FROM cte3
LEFT OUTER JOIN
cte3 as cte_2 on cte3.SeqId =cte_2.SeqId -1
)
-- final query
SELECT
t.Id,
cte4.SeqId,
ROW_NUMBER() OVER(PARTITION BY cte4.SeqId ORDER BY t.Time_stamp) AS SeqOrder
FROM cte4 INNER JOIN #Input t ON t.Time_stamp>=cte4.Time_stamp AND (t.Time_stamp <cte4.TS_to OR cte4.TS_to IS NULL);
This code is slightly more complex but it returns the expected output (which Gordon Linoffs solution doesn't...) and it's even slightly faster.

You seem to want things grouped together when they are less than five minutes apart. You can assign the groups by getting the previous time stamp and marking the beginning of a group. You then need to do a cumulative sum to get the group id:
with e as (
select e.*,
(case when datediff(minute, prev_timestamp, timestamp) < 5 then 1 else 0 end) as flag
from (select e.*,
(select top 1 e2.timestamp
from events e2
where e2.timestamp < e.timestamp
order by e2.timestamp desc
) as prev_timestamp
from events e
) e
)
select e.eventId, e.seqId,
row_number() over (partition by seqId order b timestamp) as seqOrder
from (select e.*, (select sum(flag) from e e2 where e2.timestamp <= e.timestamp) as seqId
from e
) e;
By the way, this logic is easier to express in SQL Server 2012+ because the window functions are more powerful.

Related

How to remove duplicates while sorting by unique datetime

I am working with a fairly bad data source, the column that has the information I need is within a varchar(max) and is delimited. However, the data can be duplicated across multiple rows so I am trying to remove these duplicates.
This can be done by trimming the column I am interested in, as when I repeat occurs the "ID" gets re-appended to the end of the column. Then I am taking a distinct of that to, which I then concatenate the results; it isn't pretty.
Example data and the query I currently use SQL Fiddle
Data Table
| id | callID | callDateTime | history |
|----|--------|-----------------------------|-------------------------------------|
| 1 | 1 | 2021-01-01 10:00:00.0000000 | Amount: 10, Ref:123, ID:123 |
| 2 | 1 | 2021-01-01 10:01:00.0000000 | Amount: 10, Ref:123, ID:123, ID:123 |
| 3 | 2 | 2021-01-01 11:00:00.0000000 | Amount:12.44, Ref:SIS, ID:124 |
| 4 | 2 | 2021-01-01 11:02:00.0000000 | Amount:11.22, Ref:Dad, ID:124 |
| 5 | 2 | 2021-01-01 11:01:00.0000000 | Amount:11.22, Ref:Mum, ID:124 |
| 6 | 3 | 2021-01-01 12:00:00.0000000 | Amount:11, ID:125 |
Query
select CallID, Concat([1],',', [2],',',[3])
from
(
select CallID, historyEdit, ROW_NUMBER() over (partition by callID order by callID) as rowNum
from
(
select distinct callID,
substring(history, 0, charindex(', ID:',history)) historyEdit
from test
) a
)b
PIVOT(max(historyEdit) for rowNum IN ([1],[2],[3])) piv
Result
| CallID | |
|--------|-------------------------------------------------------------------|
| 1 | Amount: 10, Ref:123,, |
| 2 | Amount:11.22, Ref:Dad,Amount:11.22, Ref:Mum,Amount:12.44, Ref:SIS |
| 3 | Amount:11,, |
The issue is that I need to ensure the concatenate part is doing so in the order of when the events occurred. In the above you will see that CallID 2 is in the wrong order as Information 3 is coming before Information 2, I did try to sort the base table by callDateTime first and then run the query, however it does seem to yield somewhat random results. Sometimes it will be in the correct order, other times it won't be. I assume this is because I am not specifying any order by clause in the query.
Including the callDateTime in the results then causes the distinct not return the unqiue data rows as the callDateTime is still unique to that duplicated row of data
I am using SQL Server v12
Desired Result
| CallID | |
|--------|-------------------------------------------------------------------|
| 1 | Amount: 10, Ref:123,, |
| 2 | Amount:12.44, Ref:SiS,Amount:11.22, Ref:Mum,Amount:11.22, Ref:Dad |
| 3 | Amount:11,, |

If I understand correctly, you want to break apart the history and recombine -- without duplicates -- for each callid. If so, you can use string_split() and string_agg():
select callid, string_agg(value, ', ')
from (select distinct t.callid, s.value
from test t cross apply
(select trim(s.value) as value
from string_split(t.history, ',') s
) s
) st
group by callid;
Here is a db<>fiddle.

You could use TOP clause inside the select to order the records before pivoting the results if you are sure of the number of records like below:
select callID, historyEdit
from
(
select distinct top 100000 callID, callDateTime,
substring(history, 0, charindex(', ID:',history)) historyEdit
from test
order by callDateTime
)t
Please see the results here.

One way to calculate the strings could be to CROSS APPLY using an ordinal splitter to separate the 'history' column into components which can be enumerated. The result is very close to what was provided in the question. Maybe the provided expected results aren't accurately representative? Something like this
Ordinal splitter described here
CREATE FUNCTION [dbo].[DelimitedSplit8K_LEAD]
--===== Define I/O parameters
(#pString VARCHAR(8000), #pDelimiter CHAR(1))
RETURNS TABLE WITH SCHEMABINDING AS
RETURN
WITH E1(N) AS (
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1
), --10E+1 or 10 rows
E2(N) AS (SELECT 1 FROM E1 a, E1 b), --10E+2 or 100 rows
E4(N) AS (SELECT 1 FROM E2 a, E2 b), --10E+4 or 10,000 rows max
cteTally(N) AS (--==== This provides the "zero base" and limits the number of rows right up front
-- for both a performance gain and prevention of accidental "overruns"
SELECT 0 UNION ALL
SELECT TOP (DATALENGTH(ISNULL(#pString,1))) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4
),
cteStart(N1) AS (--==== This returns N+1 (starting position of each "element" just once for each delimiter)
SELECT t.N+1
FROM cteTally t
WHERE (SUBSTRING(#pString,t.N,1) = #pDelimiter OR t.N = 0)
)
--===== Do the actual split. The ISNULL/NULLIF combo handles the length for the final element when no delimiter is found.
SELECT ItemNumber = ROW_NUMBER() OVER(ORDER BY s.N1),
Item = SUBSTRING(#pString,s.N1,ISNULL(NULLIF((LEAD(s.N1,1,1) OVER (ORDER BY s.N1) - 1),0)-s.N1,8000))
FROM cteStart s
;
query
with
unq_cte as (
select distinct callID
from #test),
exp_cte as (
select callID, callDateTime , dl.*,
row_number() over (partition by callID, dl.Item order by callDateTime) as rn
from #test t
cross apply dbo.DelimitedSplit8K_LEAD(t.history, ',') dl)
select t.callID,
stuff((select ',' + case when rn>1 then '' else Item end
from exp_cte tt
where t.callID = tt.callID
and ltrim(rtrim(Item)) not like 'ID%'
order by tt.callDateTime, tt.ItemNumber for xml path('')), 1, 1, '') [value1]
from unq_cte t
group by t.callID;
callID value1
1 Amount: 10, Ref:123,,
2 Amount:12.44, Ref:SIS,Amount:11.22, Ref:Mum,, Ref:Dad
3 Amount:11

SQL select a row X times and insert into new

I am trying to migrate a bunch of data from an old database to a new one, the old one used to just have the number of alarms that occurred on a single row. The new database inserts a new record for each alarm that occurs. Here is a basic version of how it might look. I want to select each row from Table 1 and insert the number of alarm values as new rows into Table 2.
Table 1:
| Alarm ID | Alarm Value |
|--------------|----------------|
| 1 | 3 |
| 2 | 2 |
Should go into the alarm table as the below values.
Table 2:
| Alarm New ID | Value |
|--------------|----------|
| 1 | 1 |
| 2 | 1 |
| 3 | 1 |
| 4 | 2 |
| 5 | 2 |
I want to create a select insert script that will do this, so the select statement will bring back the number of rows that appear in the "Value" column.

A recursive CTE can be convenient for this:
with cte as (
select id, alarm, 1 as n
from t
union all
select id, alarm, n + 1
from cte
where n < alarm
)
select row_number() over (order by id) as alarm_id, id as value
from cte
order by 1
option (maxrecursion 0);
Note: If your values do not exceed 100, then you can remove OPTION (MAXRECURSION 0).

Replicate values out with a CTE.
DECLARE #T TABLE(AlarmID INT, Value INT)
INSERT #T VALUES
(1,3),
(2,2)
;WITH ReplicateAmount AS
(
SELECT AlarmID, Value FROM #T
UNION ALL
SELECT R.AlarmID, Value=(R.Value - 1)
FROM ReplicateAmount R
INNER JOIN #T T ON R.AlarmID = T.AlarmID
WHERE R.Value > 1
)
SELECT
AlarmID = ROW_NUMBER() OVER( ORDER BY AlarmID),
Value = AlarmID --??
FROM
ReplicateAmount
ORDER BY
AlarmID
This answers your question. I would think the query below would be more useful, however, you did not include usage context.
SELECT
AlarmID,
Value
FROM
ReplicateAmount
ORDER BY
AlarmID

Rather than using an rCTE, which is recursive (as the name suggests) and will fail at 100 rows, you can use a Tally table, which tend to be far faster as well:
WITH N AS(
SELECT N
FROM (VALUES(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL))N(N)),
Tally AS(
SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS I
FROM N N1, N N2, N N3)
SELECT ROW_NUMBER() OVER (ORDER BY V.AlarmID,T.I) AS AlarmNewID,
V.AlarmID
FROM (VALUES(1,3),(2,2))V(AlarmID,AlarmValue)
JOIN Tally T ON V.AlarmValue >= T.I;

Exclude first record associated with each parent record in Postgres

There are 2 tables, users and job_experiences.
I want to return a list of all job_experiences except the first associated with each user.
users
id
---
1
2
3
job_experiences
id | start_date | user_id
--------------------------
1 | 201001 | 1
2 | 201201 | 1
3 | 201506 | 1
4 | 200901 | 2
5 | 201005 | 2
Desired result
id | start_date | user_id
--------------------------
2 | 201201 | 1
3 | 201506 | 1
5 | 201005 | 2
Current query
select
*
from job_experiences
order by start_date asc
offset 1
But this doesn't work as it would need to apply the offset to each user individually.

You can do this with a lateral join:
select je.*
from users u cross join lateral
(select je.*
from job_experiences je
where u.id = je.user_id
order by id
offset 1 -- all except the first
) je;
For performance, an index on job_experiences(user_id, id) is recommended.

use row_number() window function
with cte as
(
select e.*,
row_number()over(partition by user_id order by start_date desc) rn,
count(*) over(partition by user_id) cnt
from users u join job_experiences e on u.id=e.user_id
)
, cte2 as
(
select * from cte
) select * from cte2 t1
where rn<=(select max(cnt)-1 from cte2 t2 where t1.user_id=t2.user_id)

You could use an intermediate CTE to get the first (MIN) jobs for each user, and then use that to determine which records to exclude:
WITH user_first_je("user_id", "job_id") AS
(
SELECT "user_id", MIN("id")
FROM job_experiences
GROUP BY "user_id"
)
SELECT job_experiences.*
FROM job_experiences
LEFT JOIN user_first_je ON
user_first_je.job_id = job_experiences.id
WHERE user_first_je.job_id IS NULL;

Count values checking if consecutive

This is my table:
Event Order Timestamp
delFailed 281475031393706 2018-07-24T15:48:08.000Z
reopen 281475031393706 2018-07-24T15:54:36.000Z
reopen 281475031393706 2018-07-24T15:54:51.000Z
I need to count the number of event 'delFailed' and 'reopen' to calculate #delFailed - #reopen.
The difficulty is that there cannot be two same consecutives events, so that in this case the result will be "0" not "-1".
This is what i have achieved so far (Which is wrong because it gives me -1 instead of 0 due to the fact there are two consecutive "reopen" events )
with
events as (
select
event as events,
orders,
"timestamp"
from main_source_execevent
where orders = '281475031393706'
and event in ('reopen', 'delFailed')
order by "timestamp"
),
count_events as (
select
count(events) as CEvents,
events,
orders
from events
group by orders, events
)
select (
(select cevents from count_events where events = 'delFailed') - (select cevents from count_events where events = 'reopen')
) as nAttempts,
orders
from count_events
group by orders
How can i count once if there are two same consecutive events?

It is a gaps-and-islands problem, you can use make to row number to check rows are two same consecutive events
Explain
one row number created by normal.
another row number created by Event column
SELECT *
FROM (
SELECT *
,ROW_NUMBER() OVER(ORDER BY Timestamp) grp
,ROW_NUMBER() OVER(PARTITION BY Event ORDER BY Timestamp) rn
FROM T
) t1
| event | Order | timestamp | grp | rn |
|-----------|-----------------|----------------------|-----|----|
| delFailed | 281475031393706 | 2018-07-24T15:48:08Z | 1 | 1 |
| reopen | 281475031393706 | 2018-07-24T15:54:36Z | 2 | 1 |
| reopen | 281475031393706 | 2018-07-24T15:54:51Z | 3 | 2 |
when you create those two row you can get an upper result, then use grp - rn to get calculation the row are or are not same consecutive.
SELECT *,grp-rn
FROM (
SELECT *
,ROW_NUMBER() OVER(ORDER BY Timestamp) grp
,ROW_NUMBER() OVER(PARTITION BY Event ORDER BY Timestamp) rn
FROM T
) t1
| event | Order | timestamp | grp | rn | grp-rn |
|-----------|-----------------|----------------------|-----|----|----------|
| delFailed | 281475031393706 | 2018-07-24T15:48:08Z | 1 | 1 | 0 |
| reopen | 281475031393706 | 2018-07-24T15:54:36Z | 2 | 1 | 1 |
| reopen | 281475031393706 | 2018-07-24T15:54:51Z | 3 | 2 | 1 |
you can see when if there are two same consecutive events grp-rn column will be the same, so we can group by by grp-rn column and get count
Final query.
CREATE TABLE T(
Event VARCHAR(50),
"Order" VARCHAR(50),
Timestamp Timestamp
);
INSERT INTO T VALUES ('delFailed',281475031393706,'2018-07-24T15:48:08.000Z');
INSERT INTO T VALUES ('reopen',281475031393706,'2018-07-24T15:54:36.000Z');
INSERT INTO T VALUES ('reopen',281475031393706,'2018-07-24T15:54:51.000Z');
Query 1:
SELECT
SUM(CASE WHEN event = 'delFailed' THEN 1 END) -
SUM(CASE WHEN event = 'reopen' THEN 1 END) result
FROM (
SELECT Event,COUNT(distinct Event)
FROM (
SELECT *
,ROW_NUMBER() OVER(ORDER BY Timestamp) grp
,ROW_NUMBER() OVER(PARTITION BY Event ORDER BY Timestamp) rn
FROM T
) t1
group by grp - rn,Event
)t1
Results:
| result |
|--------|
| 0 |

I would just use lag() to get the first event in any sequence of similar values. Then do the calculation:
select sum( (event = 'reopen')::int ) as num_reopens,
sum( (event = 'delFailed')::int ) as num_delFailed
from (select mse.*,
lag(event) over (partition by orders order by "timestamp") as prev_event
from main_source_execevent mse
where orders = '281475031393706' and
event in ('reopen', 'delFailed')
) e
where prev_event <> event or prev_event is null;

Grouping SQL Results based on order

I have table with data something like this:
ID | RowNumber | Data
------------------------------
1 | 1 | Data
2 | 2 | Data
3 | 3 | Data
4 | 1 | Data
5 | 2 | Data
6 | 1 | Data
7 | 2 | Data
8 | 3 | Data
9 | 4 | Data
I want to group each set of RowNumbers So that my result is something like this:
ID | RowNumber | Group | Data
--------------------------------------
1 | 1 | a | Data
2 | 2 | a | Data
3 | 3 | a | Data
4 | 1 | b | Data
5 | 2 | b | Data
6 | 1 | c | Data
7 | 2 | c | Data
8 | 3 | c | Data
9 | 4 | c | Data
The only way I know where each group starts and stops is when the RowNumber starts over. How can I accomplish this? It also needs to be fairly efficient since the table I need to do this on has 52 Million Rows.
Additional Info
ID is truly sequential, but RowNumber may not be. I think RowNumber will always begin with 1 but for example the RowNumbers for group1 could be "1,1,2,2,3,4" and for group2 they could be "1,2,4,6", etc.

For the clarified requirements in the comments
The rownumbers for group1 could be "1,1,2,2,3,4" and for group2 they
could be "1,2,4,6" ... a higher number followed by a lower would be a
new group.
A SQL Server 2012 solution could be as follows.
Use LAG to access the previous row and set a flag to 1 if that row is the start of a new group or 0 otherwise.
Calculate a running sum of these flags to use as the grouping value.
Code
WITH T1 AS
(
SELECT *,
LAG(RowNumber) OVER (ORDER BY ID) AS PrevRowNumber
FROM YourTable
), T2 AS
(
SELECT *,
IIF(PrevRowNumber IS NULL OR PrevRowNumber > RowNumber, 1, 0) AS NewGroup
FROM T1
)
SELECT ID,
RowNumber,
Data,
SUM(NewGroup) OVER (ORDER BY ID
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS Grp
FROM T2
SQL Fiddle
Assuming ID is the clustered index the plan for this has one scan against YourTable and avoids any sort operations.

If the ids are truly sequential, you can do:
select t.*,
(id - rowNumber) as grp
from t

Also you can use recursive CTE
;WITH cte AS
(
SELECT ID, RowNumber, Data, 1 AS [Group]
FROM dbo.test1
WHERE ID = 1
UNION ALL
SELECT t.ID, t.RowNumber, t.Data,
CASE WHEN t.RowNumber != 1 THEN c.[Group] ELSE c.[Group] + 1 END
FROM dbo.test1 t JOIN cte c ON t.ID = c.ID + 1
)
SELECT *
FROM cte
Demo on SQLFiddle

How about:
select ID, RowNumber, Data, dense_rank() over (order by grp) as Grp
from (
select *, (select min(ID) from [Your Table] where ID > t.ID and RowNumber = 1) as grp
from [Your Table] t
) t
order by ID
This should work on SQL 2005. You could also use rank() instead if you don't care about consecutive numbers.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Group rows into sequences using a sliding window on a DateTime column - sql

Related

How to remove duplicates while sorting by unique datetime

SQL select a row X times and insert into new

Exclude first record associated with each parent record in Postgres

Count values checking if consecutive

Grouping SQL Results based on order

Categories

Resources