How to bucket data based on timestamps within a certain period or previous record? - sql

I have some data that I'm trying to bucket. Let's say the data has an user and timestamp. I want to define a session as any rows that has a timestamp within 10 minutes of the previous timestamp by user.
How would I go about this in SQL?
Example
+------+---------------------+---------+
| user | timestamp | session |
+------+---------------------+---------+
| 1 | 2021-05-09 15:12:52 | 1 |
| 1 | 2021-05-09 15:18:52 | 1 | within 10 min of previous timestamp
| 1 | 2021-05-09 15:32:52 | 2 | over 10 min, new session
| 2 | 2021-05-09 16:00:00 | 1 | different user
| 1 | 2021-05-09 17:00:00 | 3 | new session
| 1 | 2021-05-09 17:02:00 | 3 |
+------+---------------------+---------+
This will give me records within 10 minutes but how would I bucket them like above?
with cte as (
select user,
timestamp,
lag(timestamp) over (partition by user order by timestamp) as last_timestamp
from table
)
select *
from cte
where datediff(mm, last_timestamp, timestamp) <= 10

Try this one. It's basically an edge problem.
Working test case for SQL Server
The SQL:
with cte as (
select user1
, timestamp1
, session1 AS session_expected
, lag(timestamp1) over (partition by user1 order by timestamp1) as last_timestamp
, CASE WHEN datediff(n, lag(timestamp1) over (partition by user1 order by timestamp1), timestamp1) <= 10 THEN 0 ELSE 1 END AS edge
from table1
)
select *, SUM(edge) OVER (PARTITION BY user1 ORDER BY timestamp1) AS session_actual
from cte
ORDER BY timestamp1
;
Additional suggestion, see ROWS UNBOUNDED PRECEDING (thanks to #Charlieface):
with cte as (
select user1
, timestamp1
, session1 AS session_expected
, lag(timestamp1) over (partition by user1 order by timestamp1) as last_timestamp
, CASE WHEN datediff(n, lag(timestamp1) over (partition by user1 order by timestamp1), timestamp1) <= 10 THEN 0 ELSE 1 END AS edge
from table1
)
select *
, SUM(edge) OVER (PARTITION BY user1 ORDER BY timestamp1 ROWS UNBOUNDED PRECEDING) AS session_actual
from cte
ORDER BY timestamp1
;
Result:
Setup:
CREATE TABLE table1 (user1 int, timestamp1 datetime, session1 int);
INSERT INTO table1 VALUES
( 1 , '2021-05-09 15:12:52' , 1 )
, ( 1 , '2021-05-09 15:18:52' , 1 ) -- within 10 min of previous timestamp
, ( 1 , '2021-05-09 15:32:52' , 2 ) -- over 10 min, new session
, ( 2 , '2021-05-09 16:00:00' , 1 ) -- different user
, ( 1 , '2021-05-09 17:00:00' , 3 ) -- new session
, ( 1 , '2021-05-09 17:02:00' , 3 )
;

Related

SQL- Return rows after nth occurrence of event per user

I'm using postgreSQL 8.0 and I have a table with user_id, timestamp, and event_id.
How can I return the rows (or row) after the 4th occurrence of event_id = someID per user?
|---------------------|--------------------|------------------|
| user_id | timestamp | event_id |
|---------------------|--------------------|------------------|
| 1 | 2020-04-02 12:00 | 11 |
|---------------------|--------------------|------------------|
| 2 | 2020-04-02 13:00 | 11 |
|---------------------|--------------------|------------------|
| 2 | 2020-04-02 14:00 | 99 |
|---------------------|--------------------|------------------|
| 2 | 2020-04-02 15:00 | 11 |
|---------------------|--------------------|------------------|
| 2 | 2020-04-02 16:00 | 11 |
|---------------------|--------------------|------------------|
| 2 | 2020-04-02 17:00 | 11 |
|---------------------|--------------------|------------------|
| 2 | 2020-04-02 17:00 | 11 |
|---------------------|--------------------|------------------|
Ie if event_id = 11, I would only want the last row in the table above.
You can use window functions:
select *
from (
select t.*, row_number() over(partition by user_id, event_id order by timestamp) rn
from mytable t
) t
where rn > 4
Here is a little trick that removes the row number from the result:
select (t).*
from (
select t, row_number() over(partition by user_id, event_id order by timestamp) rn
from mytable t
) x
where rn > 4
You can use a cumulative count. This version includes the 4th occurrence:
select t.*
from (select t.*,
count(*) filter (where event_id = 11) over (partition by user_id order by timestamp) as event_11_cnt
from t
) t
where event_11_cnt >= 4;
The filter has been valid Postgres syntax for a long time, but instead, you can use:
select t.*
from (select t.*,
sum( (event_id = 11)::int ) over (partition by user_id order by timestamp) as event_11_cnt
from t
) t
where event_11_cnt >= 4;
This version does not:
where event_11_cnt > 4 or (event_11_cnt = 4 and event_id <> 11)
An alternative method:
select t.*
from t
where t.timestamp > (select t2.timestamp
from t t2
where t2.user_id = t.user_id and
t2.event_id = 11
order by t2.timestamp
limit 1 offset 3
);
sorry to be asking about such an old version of Postgres, here is an answer that worked:
WITH EventOrdered AS(
SELECT
EventTypeId
, UserId
, Timestamp
, ROW_NUMBER() OVER (PARTITION BY EventTypeId, UserId ORDER BY Timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) ROW_NO
FROM Event),
FourthEvent AS (
SELECT DISTINCT
UserID
, FIRST_VALUE(TimeStamp) OVER (PARTITION BY UserId ORDER BY Timestamp) FirstFourthEventTimestamp
FROM EventOrdered
WHERE ROW_NO = 4)
SELECT e.*
FROM Event e
JOIN FourthEvent ffe
ON e.UserId = ffe.UserId
AND e.Timestamp > ffe.FirstFourthEventTimestamp
ORDER BY e.UserId, e.Timestamp

Query for negative account balance period in bigquery

I am playing around with bigquery and hit an interesting use case. I have a collection of customers and account balances. The account balances collection records any account balance change.
Customers:
+---------+--------+
| ID | Name |
+---------+--------+
| 1 | Alice |
| 2 | Bob |
+---------+--------+
Accounts balances:
+---------+---------------+---------+------------+
| ID | customer_id | value | timestamp |
+---------+---------------+---------+------------+
| 1 | 1 | -500 | 2019-02-12 |
| 2 | 1 | -200 | 2019-02-10 |
| 3 | 2 | 200 | 2019-02-10 |
| 4 | 1 | 0 | 2019-02-09 |
+---------+---------------+---------+------------+
The goal is to find out, for how long a customer has a negative account balance. The resulting collection would look like this:
+---------+--------+---------------------------------+
| ID | Name | Negative account balance since |
+---------+--------+---------------------------------+
| 1 | Alice | 2 days |
+---------+--------+---------------------------------+
Bob is not in the collection, because his last account record shows a positive value.
I think following steps are involved:
get last account balance per customer, see if it is negative
go through the account balance values until you hit a positive (or no more) value
compute datediff
Is something like this even possible in sql? Do you have any ideas on who to create such query? To get customers that currently have a negative account balance, I use this query:
SELECT customer_id FROM (
SELECT t.account_balance, ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY timestamp DESC) as seqnum FROM `account_balances` t
) t
WHERE seqnum = 1 AND account_balance<0
Below is for BigQuery Standard SQL
#standardSQL
SELECT customer_id, name,
SUM(IF(negative_positive < 0, days, 0)) negative_days,
SUM(IF(negative_positive = 0, days, 0)) zero_days,
SUM(IF(negative_positive > 0, days, 0)) positive_days
FROM (
SELECT customer_id, negative_positive, grp,
1 + DATE_DIFF(MAX(ts), MIN(ts), DAY) days
FROM (
SELECT customer_id, ts, SIGN(value) negative_positive,
COUNTIF(flag) OVER(PARTITION BY customer_id ORDER BY ts) grp
FROM (
SELECT *, SIGN(value) = IFNULL(LEAD(SIGN(value)) OVER(PARTITION BY customer_id ORDER BY ts), 0) flag
FROM `project.dataset.balances`
)
)
GROUP BY customer_id, negative_positive, grp
)
LEFT JOIN `project.dataset.customers`
ON id = customer_id
GROUP BY customer_id, name
You can test, play with above using sample data from your question as in below example
#standardSQL
WITH `project.dataset.balances` AS (
SELECT 1 customer_id, -500 value, DATE '2019-02-12' ts UNION ALL
SELECT 1, -200, '2019-02-10' UNION ALL
SELECT 2, 200, '2019-02-10' UNION ALL
SELECT 1, 0, '2019-02-09'
), `project.dataset.customers` AS (
SELECT 1 id, 'Alice' name UNION ALL
SELECT 2, 'Bob'
)
SELECT customer_id, name,
SUM(IF(negative_positive < 0, days, 0)) negative_days,
SUM(IF(negative_positive = 0, days, 0)) zero_days,
SUM(IF(negative_positive > 0, days, 0)) positive_days
FROM (
SELECT customer_id, negative_positive, grp,
1 + DATE_DIFF(MAX(ts), MIN(ts), DAY) days
FROM (
SELECT customer_id, ts, SIGN(value) negative_positive,
COUNTIF(flag) OVER(PARTITION BY customer_id ORDER BY ts) grp
FROM (
SELECT *, SIGN(value) = IFNULL(LEAD(SIGN(value)) OVER(PARTITION BY customer_id ORDER BY ts), 0) flag
FROM `project.dataset.balances`
)
)
GROUP BY customer_id, negative_positive, grp
)
LEFT JOIN `project.dataset.customers`
ON id = customer_id
GROUP BY customer_id, name
-- ORDER BY customer_id
with result
Row customer_id name negative_days zero_days positive_days
1 1 Alice 3 1 0
2 2 Bob 0 0 1

SQL to find timespan between rows based on ID

I have the following table in a SQL db (HeartbeatHistory)
Timestamp | Comment | Id
------------------------
The comment can contain OK or ERR
The Id is the Id of the thing that has that comment.
I want to be able to query the table and find the durations that any given id was in an Error state.
Timestamp | Comment | Id
------------------------
12:00:00 | OK | 1
11:59:00 | ERR | 2
11:58:00 | OK | 4
11:57:00 | OK | 3
11:45:00 | ERR | 4
11:20:00 | OK | 2
11:00:00 | ERR | 3
11:30:00 | OK | 5
11:20:00 | ERR | 1
11:10:00 | OK | 1
11:00:00 | ERR | 1
10:30:00 | ERR | 5
So in the above table If I queried for 11:00:00 to 13:00:00 I would want to see.
ErrorStart | ErrorEnd | Id
--------------------------
11:00:00 | 11:10:00 | 1
11:20:00 | 12:00:00 | 1
11:59:00 | 12:00:00 | 2
11:00:00 | 11:57:00 | 3
11:45:00 | 11:58:00 | 4
11:00:00 | 11:30:00 | 5
(notice 5 started error before query date!!)
Is this possible? Also an Id might change state multiple times during the queried period.
So far I have this, which works for a single Id, but I need to make it work for multiple Ids.
declare #startDate datetime = #from;
declare #endDate datetime = #to;
declare #kpiId = 1;
select Foo.RowCreatedTimestamp, Foo.Comment, Foo.NextTimeStamp, Foo.NextComment, Foo.HeartBeatId, Foo.NextHeartBeatId
from (
select RowCreatedTimestamp, Comment,
lag(RowCreatedTimestamp, 1, 0) over (order by RowCreatedTimestamp desc) as NextTimeStamp,
lag(Comment, 1, 0) over (order by RowCreatedTimestamp desc) as NextComment,
HeartBeatId
from dbo.tblHeartbeatHistory
where RowCreatedTimestamp >= #startDate and RowCreatedTimestamp <= #endDate
and HeartbeatId in
(
select HeartbeatId
from dbo.tblKpiHeartBeats
where KpiId = #kpiId
)
) as Foo
where Foo.Comment like '%set to ERR%'
order by Foo.RowCreatedTimestamp desc;
So if the select HeartbeatId from dbo.tblKpiHeartBeats returns a single Id, this works. As soon as their are multiple id's it does not :(
To avoid confusion:
The table with the Timestamp, Comment and Id is HeartbeatHistory.
The other table referenced in my SQL is dbo.tblKpiHeartBeats.
This table looks like:
Kpi | HeartbeatId
-----------------
1 | 1
1 | 2
1 | 3
1 | 4
1 | 5
So i want all the error intervals for Kpi = 1, it would return the error intervals for HeartbeatId 1,2,3,4 and 5.
Further note. The data may have multiple errors in a row before an OK comes in.
It may just be all ERR for the query period or all OK.
You can add second CTE Id you want full join ERR AND OK rows (Code below only for OK rows)
WIRH History AS (
SELECT
FROM HeartbeatHistory
WHERE Timestamp BETWEEN #DateStart AND #DateEnd
), Errors AS(
SELECT Id, MIN(Timestamp) AS ErrorStart
FROM History
WHERE Comment = 'ERR'
GROUP BY Id
)
SELECT
ErrorStart = E.ErrorStart ,
ErrorEnd = O.Timestamp,
Id = O.Id
FROM History O
LEFT JOIN Errors E ON E.Id = O.Id
WHERE O.Comment = 'OK'
Edit: You can add prevOK timespan (or PK) column to the table (probably computed persistent) - link to last good row. It will be used as Id of row in your report.
Try this index:
CREATE INDEX IDX_EXAMPLE ON HeartbeatHistory (Timestamp, Id, prevOK, Comment)
WIRH History AS (
SELECT
FROM HeartbeatHistory
WHERE Timestamp BETWEEN #DateStart AND #DateEnd
)
SELECT
ErrorStart = E.ErrorStart ,
ErrorEnd = O.Timestamp,
Id = O.Id
FROM History O
OUTER APPLY (
SELECT MIN(Timestamp) AS ErrorStart
FROM History E
WHERE E.Id = O.ID AND E.prevOK = O.prevOK
)
WHERE O.Comment = 'OK'
The simplest method is to use lead(). If I assume that ERR does not occur twice in a row (as in your sample data):
select (case when timestamp >= '11:00:00' then timestamp else '11:00:00' end) as errorStart,
(case when next_timestamp <= '13:00:00' then next_timestamp else '13:00:00') as errorEnd,
id
from (select t.*,
lead(timestamp) over (partition by id order by timestamp) as next_timestamp
from t
) t
where comment = 'ERR' and
(timestamp <= '13:00:00' and
(next_timestamp >= '11:00:00' or next_timestamp is null)
);
Try this:
DECLARE #table TABLE (Timestmp TIME(1), Comment NVARCHAR(5), Id INT) --your table
INSERT INTO #table VALUES
('12:00:00','OK ','1'),('11:59:00','ERR','2'),('11:58:00','OK ','4'),('11:57:00','OK ','3'),
('11:45:00','ERR','4'),('11:20:00','OK ','2'),('11:00:00','ERR','3'),('11:30:00','OK ','5'),
('11:20:00','ERR','1'),('11:10:00','OK ','1'),('11:00:00','ERR','1'),('10:30:00','ERR','5')
DECLARE #ROWER TABLE (id INT IDENTITY(1,1), Timestmp TIME(1))
INSERT INTO #ROWER SELECT Timestmp FROM #table WHERE Comment='OK' ORDER BY Timestmp
DECLARE #TIME TIME(1) = '11:00:00' --your condition
SELECT DISTINCT CASE WHEN A.Timestmp >=#TIME THEN A.Timestmp ELSE #TIME END ErrorStart,
CASE WHEN B.Timestmp > A.Timestmp THEN B.Timestmp ELSE '' END ErrorEnd,
A.Id FROM (
SELECT ROW_NUMBER() OVER (ORDER BY id,Timestmp) rowid,* FROM #table WHERE Comment = 'ERR'
) A LEFT JOIN (
SELECT ROW_NUMBER() OVER (ORDER BY id,Timestmp) rowid,* FROM #table WHERE Comment = 'OK'
) B ON A.rowid = B.rowid
LEFT JOIN ( SELECT A.id,A.Timestmp t1,B.Timestmp t2 FROM #ROWER A
LEFT JOIN (SELECT id-1 id, Timestmp FROM #ROWER) B ON A.id=B.id
) C ON A.Timestmp BETWEEN C.t1 AND C.t2 ORDER BY A.Id
Hope it helps. :)

Count and pivot a table by date

I would like to identify the returning customers from an Oracle(11g) table like this:
CustID | Date
-------|----------
XC321 | 2016-04-28
AV626 | 2016-05-18
DX970 | 2016-06-23
XC321 | 2016-05-28
XC321 | 2016-06-02
So I can see which customers returned within various windows, for example within 10, 20, 30, 40 or 50 days. For example:
CustID | 10_day | 20_day | 30_day | 40_day | 50_day
-------|--------|--------|--------|--------|--------
XC321 | | | 1 | |
XC321 | | | | 1 |
I would even accept a result like this:
CustID | Date | days_from_last_visit
-------|------------|---------------------
XC321 | 2016-05-28 | 30
XC321 | 2016-06-02 | 5
I guess it would use a partition by windowing clause with unbounded following and preceding clauses... but I cannot find any suitable examples.
Any ideas...?
Thanks
No need for window functions here, you can simply do it with conditional aggregation using CASE EXPRESSION :
SELECT t.custID,
COUNT(CASE WHEN (last_visit- t.date) <= 10 THEN 1 END) as 10_day,
COUNT(CASE WHEN (last_visit- t.date) between 11 and 20 THEN 1 END) as 20_day,
COUNT(CASE WHEN (last_visit- t.date) between 21 and 30 THEN 1 END) as 30_day,
.....
FROM (SELECT s.custID,
LEAD(s.date) OVER(PARTITION BY s.custID ORDER BY s.date DESC) as last_visit
FROM YourTable s) t
GROUP BY t.custID
Oracle Setup:
CREATE TABLE customers ( CustID, Activity_Date ) AS
SELECT 'XC321', DATE '2016-04-28' FROM DUAL UNION ALL
SELECT 'AV626', DATE '2016-05-18' FROM DUAL UNION ALL
SELECT 'DX970', DATE '2016-06-23' FROM DUAL UNION ALL
SELECT 'XC321', DATE '2016-05-28' FROM DUAL UNION ALL
SELECT 'XC321', DATE '2016-06-02' FROM DUAL;
Query:
SELECT *
FROM (
SELECT CustID,
Activity_Date AS First_Date,
COUNT(1) OVER ( PARTITION BY CustID
ORDER BY Activity_Date
RANGE BETWEEN CURRENT ROW AND INTERVAL '10' DAY FOLLOWING )
- 1 AS "10_Day",
COUNT(1) OVER ( PARTITION BY CustID
ORDER BY Activity_Date
RANGE BETWEEN CURRENT ROW AND INTERVAL '20' DAY FOLLOWING )
- 1 AS "20_Day",
COUNT(1) OVER ( PARTITION BY CustID
ORDER BY Activity_Date
RANGE BETWEEN CURRENT ROW AND INTERVAL '30' DAY FOLLOWING )
- 1 AS "30_Day",
COUNT(1) OVER ( PARTITION BY CustID
ORDER BY Activity_Date
RANGE BETWEEN CURRENT ROW AND INTERVAL '40' DAY FOLLOWING )
- 1 AS "40_Day",
COUNT(1) OVER ( PARTITION BY CustID
ORDER BY Activity_Date
RANGE BETWEEN CURRENT ROW AND INTERVAL '50' DAY FOLLOWING )
- 1 AS "50_Day",
ROW_NUMBER() OVER ( PARTITION BY CustID ORDER BY Activity_Date ) AS rn
FROM Customers
)
WHERE rn = 1;
Output
USTID FIRST_DATE 10_Day 20_Day 30_Day 40_Day 50_Day RN
------ ------------------- ---------- ---------- ---------- ---------- ---------- ----------
AV626 2016-05-18 00:00:00 0 0 0 0 0 1
DX970 2016-06-23 00:00:00 0 0 0 0 0 1
XC321 2016-04-28 00:00:00 0 0 1 2 2 1
Here is an answer that works for me, I have based it on your answers above, thanks for contributions from MT0 and Sagi:
SELECT CustID,
visit_date,
Prev_Visit ,
COUNT( CASE WHEN (Days_between_visits) <=10 THEN 1 END) AS "0-10_day" ,
COUNT( CASE WHEN (Days_between_visits) BETWEEN 11 AND 20 THEN 1 END) AS "11-20_day" ,
COUNT( CASE WHEN (Days_between_visits) BETWEEN 21 AND 30 THEN 1 END) AS "21-30_day" ,
COUNT( CASE WHEN (Days_between_visits) BETWEEN 31 AND 40 THEN 1 END) AS "31-40_day" ,
COUNT( CASE WHEN (Days_between_visits) BETWEEN 41 AND 50 THEN 1 END) AS "41-50_day" ,
COUNT( CASE WHEN (Days_between_visits) >50 THEN 1 END) AS "51+_day"
FROM
(SELECT CustID,
visit_date,
Lead(T1.visit_date) over (partition BY T1.CustID order by T1.visit_date DESC) AS Prev_visit,
visit_date - Lead(T1.visit_date) over (
partition BY T1.CustID order by T1.visit_date DESC) AS Days_between_visits
FROM T1
) T2
WHERE Days_between_visits >0
GROUP BY T2.CustID ,
T2.visit_date ,
T2.Prev_visit ,
T2.Days_between_visits;
This returns:
CUSTID | VISIT_DATE | PREV_VISIT | DAYS_BETWEEN_VISIT | 0-10_DAY | 11-20_DAY | 21-30_DAY | 31-40_DAY | 41-50_DAY | 51+DAY
XC321 | 2016-05-28 | 2016-04-28 | 30 | | | 1 | | |
XC321 | 2016-06-02 | 2016-05-28 | 5 | 1 | | | | |

group a set of records by date in teradata

Currently I have data in a table as shown below:
date id value
1-Jan-13 1 100
2-Jan-13 1 100
3-Jan-13 1 100
4-Jan-13 1 200
5-Jan-13 1 200
6-Jan-13 1 100
7-Jan-13 1 100
I am trying to group the records based on the id and val and version records with startdate and end date .
Desired output:
start date end date id value
1-Jan-13 3-Jan-13 1 100
4-Jan-13 5-Jan-13 1 200
6-Jan-13 7-Jan-13 1 100
I'm not an expert in Teradata but you most likely, since windowing functions are supported (specifically ROW_NUMBER), be able to do something like this
SELECT MIN(date) start_date, MAX(date) end_date, id, value
FROM
(
SELECT date, id, value,
ROW_NUMBER() OVER (PARTITION BY id ORDER BY date) -
ROW_NUMBER() OVER (PARTITION BY id, value ORDER BY date) island
FROM table1
) q
GROUP BY id, value, island
ORDER BY start_date, end_date
Sample output:
| START_DATE | END_DATE | ID | VALUE |
|------------|------------|----|-------|
| 2013-01-01 | 2013-01-03 | 1 | 100 |
| 2013-01-04 | 2013-01-05 | 1 | 200 |
| 2013-01-06 | 2013-01-07 | 1 | 100 |
Here is SQLFiddle demo (It's a SQL Server demo, but should work as expected in Teradata)
The ROW_NUMBER version can be further simplified: modified SQL Fiddle
For Teradata:
SELECT
id,val,MIN(dt),MAX(dt)
FROM
(
SELECT
id,val,dt,
dt - ROW_NUMBER() OVER (PARTITION BY id ORDER BY val, dt) AS dummy
FROM table1
) AS dt
GROUP BY 1,2,dummy
And there are some hardly known functions in TD13.10 for processing time series data:
WITH cte(id,val,pd) AS
(
SELECT id, val, PERIOD(dt, dt+1) AS pd
FROM table1
)
SELECT
id, val,
BEGIN(pd) AS start_dt,
LAST(pd) AS end_dt
FROM
TABLE (TD_NORMALIZE_MEET
(NEW VARIANT_TYPE(cte.id,cte.val)
,cte.pd)
RETURNS (id INTEGER
,val INTEGER
,pd PERIOD(DATE)
,Nrm_Count INTEGER)
HASH BY id
LOCAL ORDER BY id, val, pd
) A
ORDER BY start_dt, end_dt