SQL- Return rows after nth occurrence of event per user - sql

I'm using postgreSQL 8.0 and I have a table with user_id, timestamp, and event_id.
How can I return the rows (or row) after the 4th occurrence of event_id = someID per user?
|---------------------|--------------------|------------------|
| user_id | timestamp | event_id |
|---------------------|--------------------|------------------|
| 1 | 2020-04-02 12:00 | 11 |
|---------------------|--------------------|------------------|
| 2 | 2020-04-02 13:00 | 11 |
|---------------------|--------------------|------------------|
| 2 | 2020-04-02 14:00 | 99 |
|---------------------|--------------------|------------------|
| 2 | 2020-04-02 15:00 | 11 |
|---------------------|--------------------|------------------|
| 2 | 2020-04-02 16:00 | 11 |
|---------------------|--------------------|------------------|
| 2 | 2020-04-02 17:00 | 11 |
|---------------------|--------------------|------------------|
| 2 | 2020-04-02 17:00 | 11 |
|---------------------|--------------------|------------------|
Ie if event_id = 11, I would only want the last row in the table above.

You can use window functions:
select *
from (
select t.*, row_number() over(partition by user_id, event_id order by timestamp) rn
from mytable t
) t
where rn > 4
Here is a little trick that removes the row number from the result:
select (t).*
from (
select t, row_number() over(partition by user_id, event_id order by timestamp) rn
from mytable t
) x
where rn > 4

You can use a cumulative count. This version includes the 4th occurrence:
select t.*
from (select t.*,
count(*) filter (where event_id = 11) over (partition by user_id order by timestamp) as event_11_cnt
from t
) t
where event_11_cnt >= 4;
The filter has been valid Postgres syntax for a long time, but instead, you can use:
select t.*
from (select t.*,
sum( (event_id = 11)::int ) over (partition by user_id order by timestamp) as event_11_cnt
from t
) t
where event_11_cnt >= 4;
This version does not:
where event_11_cnt > 4 or (event_11_cnt = 4 and event_id <> 11)
An alternative method:
select t.*
from t
where t.timestamp > (select t2.timestamp
from t t2
where t2.user_id = t.user_id and
t2.event_id = 11
order by t2.timestamp
limit 1 offset 3
);

sorry to be asking about such an old version of Postgres, here is an answer that worked:
WITH EventOrdered AS(
SELECT
EventTypeId
, UserId
, Timestamp
, ROW_NUMBER() OVER (PARTITION BY EventTypeId, UserId ORDER BY Timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) ROW_NO
FROM Event),
FourthEvent AS (
SELECT DISTINCT
UserID
, FIRST_VALUE(TimeStamp) OVER (PARTITION BY UserId ORDER BY Timestamp) FirstFourthEventTimestamp
FROM EventOrdered
WHERE ROW_NO = 4)
SELECT e.*
FROM Event e
JOIN FourthEvent ffe
ON e.UserId = ffe.UserId
AND e.Timestamp > ffe.FirstFourthEventTimestamp
ORDER BY e.UserId, e.Timestamp

Related

How to bucket data based on timestamps within a certain period or previous record?

I have some data that I'm trying to bucket. Let's say the data has an user and timestamp. I want to define a session as any rows that has a timestamp within 10 minutes of the previous timestamp by user.
How would I go about this in SQL?
Example
+------+---------------------+---------+
| user | timestamp | session |
+------+---------------------+---------+
| 1 | 2021-05-09 15:12:52 | 1 |
| 1 | 2021-05-09 15:18:52 | 1 | within 10 min of previous timestamp
| 1 | 2021-05-09 15:32:52 | 2 | over 10 min, new session
| 2 | 2021-05-09 16:00:00 | 1 | different user
| 1 | 2021-05-09 17:00:00 | 3 | new session
| 1 | 2021-05-09 17:02:00 | 3 |
+------+---------------------+---------+
This will give me records within 10 minutes but how would I bucket them like above?
with cte as (
select user,
timestamp,
lag(timestamp) over (partition by user order by timestamp) as last_timestamp
from table
)
select *
from cte
where datediff(mm, last_timestamp, timestamp) <= 10
Try this one. It's basically an edge problem.
Working test case for SQL Server
The SQL:
with cte as (
select user1
, timestamp1
, session1 AS session_expected
, lag(timestamp1) over (partition by user1 order by timestamp1) as last_timestamp
, CASE WHEN datediff(n, lag(timestamp1) over (partition by user1 order by timestamp1), timestamp1) <= 10 THEN 0 ELSE 1 END AS edge
from table1
)
select *, SUM(edge) OVER (PARTITION BY user1 ORDER BY timestamp1) AS session_actual
from cte
ORDER BY timestamp1
;
Additional suggestion, see ROWS UNBOUNDED PRECEDING (thanks to #Charlieface):
with cte as (
select user1
, timestamp1
, session1 AS session_expected
, lag(timestamp1) over (partition by user1 order by timestamp1) as last_timestamp
, CASE WHEN datediff(n, lag(timestamp1) over (partition by user1 order by timestamp1), timestamp1) <= 10 THEN 0 ELSE 1 END AS edge
from table1
)
select *
, SUM(edge) OVER (PARTITION BY user1 ORDER BY timestamp1 ROWS UNBOUNDED PRECEDING) AS session_actual
from cte
ORDER BY timestamp1
;
Result:
Setup:
CREATE TABLE table1 (user1 int, timestamp1 datetime, session1 int);
INSERT INTO table1 VALUES
( 1 , '2021-05-09 15:12:52' , 1 )
, ( 1 , '2021-05-09 15:18:52' , 1 ) -- within 10 min of previous timestamp
, ( 1 , '2021-05-09 15:32:52' , 2 ) -- over 10 min, new session
, ( 2 , '2021-05-09 16:00:00' , 1 ) -- different user
, ( 1 , '2021-05-09 17:00:00' , 3 ) -- new session
, ( 1 , '2021-05-09 17:02:00' , 3 )
;

SQL: Get date difference between rows in the same column [duplicate]

This question already has an answer here:
SQL or LINQ: how do I select records where only one paramater changes?
(1 answer)
Closed 3 years ago.
I am trying to create a report and this is my input data.
Stage Name Date
1 x 12/05/2019 10:00:03
1 x 12/05/2019 10:05:01
1 y 12/06/2019 12:00:07
2 x 12/06/2019 13:12:03
2 x 12/06/2019 13:23:00
1 y 12/08/2019 16:00:07
2 x 12/09/2019 09:17:59
This is my desired output.
Stage Name DateFrom DateTo DateDiff
1 x 12/05/2019 10:00:03 12/06/2019 12:00:07 1
1 y 12/06/2019 12:00:07 12/06/2019 13:12:03 0
2 x 12/06/2019 13:12:03 12/08/2019 16:00:07 2
1 y 12/08/2019 16:00:07 12/09/2019 09:17:59 1
I cannot use group by clause over stage and name, since it will group the 3rd and 6th rows from my input. I tried joining the table to itself, but I am not getting the desired result. Is this even possible in SQL ? Any ideas would be helpful. I am using Microsoft SQL Server.
This is a variation of the gaps and island problem. You want to group together groups of adjacent rows (ie having the same stage and name); but you want to use the start date of the next group as ending date for the current group.
Here is one way to do it:
select
stage,
name,
min(date) date_from,
lead(min(date)) over(order by min(date)) date_to,
datediff(day, min(date), lead(min(date)) over(order by min(date))) date_diff
from (
select
t.*,
row_number() over(order by date) rn1,
row_number() over(partition by stage, name order by date) rn2
from mytable t
) t
group by stage, name, rn1 - rn2
order by date_from
Demo on DB Fiddle:
stage | name | date_from | date_to | datediff
----: | :--- | :------------------ | :------------------ | -------:
1 | x | 12/05/2019 10:00:03 | 12/06/2019 12:00:07 | 1
1 | y | 12/06/2019 12:00:07 | 12/06/2019 13:12:03 | 0
2 | x | 12/06/2019 13:12:03 | 12/08/2019 16:00:07 | 2
1 | y | 12/08/2019 16:00:07 | 12/09/2019 09:17:59 | 1
2 | x | 12/09/2019 09:17:59 | null | null
Note that this does not produce exactly the result that you showed: there is an additional, pending record at the end of the resultset, that represents the "on-going" series of records. If needed, you can filter it out by nesting the query:
select *
from (
select
stage,
name,
min(date) date_from,
lead(min(date)) over(order by min(date)) date_to,
datediff(day, min(date), lead(min(date)) over(order by min(date))) date_diff
from (
select
t.*,
row_number() over(order by date) rn1,
row_number() over(partition by stage, name order by date) rn2
from mytable t
) t
group by stage, name, rn1 - rn2
) t
where date_to is not null
order by date_from
This is a variation of the gaps-and-islands problem, but it has a pretty simple solution.
Just keep every row where the previous row has a different stage or name. Then use lead() to get the next date. Here is the basic idea:
select t.stage, t.name, t.date as datefrom
lead(t.date) over (order by t.date) as dateto,
datediff(day, t.date, lead(t.date) over (order by t.date)) as diff
from (select t.*,
lag(date) over (partition by stage, name order by date) as prev_sn_date,
lag(date) over (order by date) as prev_date
from t
) t
where prev_sn_date <> prev_date or prev_sn_date is null;
If you really want to filter out the last row, you need one more step; I'm not sure if that is desirable.

time difference between transaction of user

Table: txn
customer_id | time_stamp
-------------------------
1 | 00:01:03
1 | 00:02:04
2 | 00:03:05
2 | 00:04:06
Looking to query the time difference between each first transaction and next transaction of customer_id
Results:
Customer ID | Time Diff
1 | 61
select customer_ID, ...
from txn
You want lead() . . . but date/time functions are notoriously database-specific. In SQL Server:
select t.*,
datediff(second,
time_stamp,
lead(time_stamp) over (partition by customer_id order by time_stamp)
) as diff_seconds
from t;
In BigQuery:
select t.*,
timestamp_diff(time_stamp,
lead(time_stamp) over (partition by customer_id order by time_stamp),
second
) as diff_seconds
from t;

SQL: FIlter rows by direction

I have a table with 2 column date (timestamp), status (boolean).
I have a lot of value like:
| date | status |
|-------------------------- |-------- |
| 2018-11-05T19:04:21.125Z | true |
| 2018-11-05T19:04:22.125Z | true |
| 2018-11-05T19:04:23.125Z | true |
....
I need to get a result like this:
| date_from | date_to | status |
|-------------------------- |-------------------------- |-------- |
| 2018-11-05T19:04:21.125Z | 2018-11-05T19:04:27.125Z | true |
| 2018-11-05T19:04:27.125Z | 2018-11-05T19:04:47.125Z | false |
| 2018-11-05T19:04:47.125Z | 2018-11-05T19:04:57.125Z | true |
So, I need to filter all "same" value and get in return only period of status true/false.
I create query like this:
SELECT max("current_date"), current_status, previous_status
FROM (SELECT date as "current_date",
status as current_status,
(lag(status, 1) OVER (ORDER BY msgtime))::boolean AS previous_status
FROM "table" as table
) as raw_data
group by current_status, previous_status
but in response I get only no more than 4 value
This is a gaps-and-islands problem. A typical method uses the difference of row numbers:
select min(date), max(date), status
from (select t.*,
row_number() over (order by date) as seqnum,
row_number() over (partition by status order by date) as seqnum_s
from t
) t
group by status, (seqnum - seqnum_s);
Yes you could use LAG but then you also need a running counter that increments every time the status changes:
WITH cte1 AS (
SELECT date, status, CASE WHEN LAG(status) OVER (ORDER BY date) = status THEN 0 ELSE 1 END AS chg
FROM yourdata
), cte2 AS (
SELECT date, status, SUM(chg) OVER (ORDER BY date) AS grp
FROM cte1
)
SELECT MIN(date) AS date_from, MAX(date) AS date_to, status
FROM cte2
GROUP BY grp, status
ORDER BY date_from
DB Fiddle

Count and pivot a table by date

I would like to identify the returning customers from an Oracle(11g) table like this:
CustID | Date
-------|----------
XC321 | 2016-04-28
AV626 | 2016-05-18
DX970 | 2016-06-23
XC321 | 2016-05-28
XC321 | 2016-06-02
So I can see which customers returned within various windows, for example within 10, 20, 30, 40 or 50 days. For example:
CustID | 10_day | 20_day | 30_day | 40_day | 50_day
-------|--------|--------|--------|--------|--------
XC321 | | | 1 | |
XC321 | | | | 1 |
I would even accept a result like this:
CustID | Date | days_from_last_visit
-------|------------|---------------------
XC321 | 2016-05-28 | 30
XC321 | 2016-06-02 | 5
I guess it would use a partition by windowing clause with unbounded following and preceding clauses... but I cannot find any suitable examples.
Any ideas...?
Thanks
No need for window functions here, you can simply do it with conditional aggregation using CASE EXPRESSION :
SELECT t.custID,
COUNT(CASE WHEN (last_visit- t.date) <= 10 THEN 1 END) as 10_day,
COUNT(CASE WHEN (last_visit- t.date) between 11 and 20 THEN 1 END) as 20_day,
COUNT(CASE WHEN (last_visit- t.date) between 21 and 30 THEN 1 END) as 30_day,
.....
FROM (SELECT s.custID,
LEAD(s.date) OVER(PARTITION BY s.custID ORDER BY s.date DESC) as last_visit
FROM YourTable s) t
GROUP BY t.custID
Oracle Setup:
CREATE TABLE customers ( CustID, Activity_Date ) AS
SELECT 'XC321', DATE '2016-04-28' FROM DUAL UNION ALL
SELECT 'AV626', DATE '2016-05-18' FROM DUAL UNION ALL
SELECT 'DX970', DATE '2016-06-23' FROM DUAL UNION ALL
SELECT 'XC321', DATE '2016-05-28' FROM DUAL UNION ALL
SELECT 'XC321', DATE '2016-06-02' FROM DUAL;
Query:
SELECT *
FROM (
SELECT CustID,
Activity_Date AS First_Date,
COUNT(1) OVER ( PARTITION BY CustID
ORDER BY Activity_Date
RANGE BETWEEN CURRENT ROW AND INTERVAL '10' DAY FOLLOWING )
- 1 AS "10_Day",
COUNT(1) OVER ( PARTITION BY CustID
ORDER BY Activity_Date
RANGE BETWEEN CURRENT ROW AND INTERVAL '20' DAY FOLLOWING )
- 1 AS "20_Day",
COUNT(1) OVER ( PARTITION BY CustID
ORDER BY Activity_Date
RANGE BETWEEN CURRENT ROW AND INTERVAL '30' DAY FOLLOWING )
- 1 AS "30_Day",
COUNT(1) OVER ( PARTITION BY CustID
ORDER BY Activity_Date
RANGE BETWEEN CURRENT ROW AND INTERVAL '40' DAY FOLLOWING )
- 1 AS "40_Day",
COUNT(1) OVER ( PARTITION BY CustID
ORDER BY Activity_Date
RANGE BETWEEN CURRENT ROW AND INTERVAL '50' DAY FOLLOWING )
- 1 AS "50_Day",
ROW_NUMBER() OVER ( PARTITION BY CustID ORDER BY Activity_Date ) AS rn
FROM Customers
)
WHERE rn = 1;
Output
USTID FIRST_DATE 10_Day 20_Day 30_Day 40_Day 50_Day RN
------ ------------------- ---------- ---------- ---------- ---------- ---------- ----------
AV626 2016-05-18 00:00:00 0 0 0 0 0 1
DX970 2016-06-23 00:00:00 0 0 0 0 0 1
XC321 2016-04-28 00:00:00 0 0 1 2 2 1
Here is an answer that works for me, I have based it on your answers above, thanks for contributions from MT0 and Sagi:
SELECT CustID,
visit_date,
Prev_Visit ,
COUNT( CASE WHEN (Days_between_visits) <=10 THEN 1 END) AS "0-10_day" ,
COUNT( CASE WHEN (Days_between_visits) BETWEEN 11 AND 20 THEN 1 END) AS "11-20_day" ,
COUNT( CASE WHEN (Days_between_visits) BETWEEN 21 AND 30 THEN 1 END) AS "21-30_day" ,
COUNT( CASE WHEN (Days_between_visits) BETWEEN 31 AND 40 THEN 1 END) AS "31-40_day" ,
COUNT( CASE WHEN (Days_between_visits) BETWEEN 41 AND 50 THEN 1 END) AS "41-50_day" ,
COUNT( CASE WHEN (Days_between_visits) >50 THEN 1 END) AS "51+_day"
FROM
(SELECT CustID,
visit_date,
Lead(T1.visit_date) over (partition BY T1.CustID order by T1.visit_date DESC) AS Prev_visit,
visit_date - Lead(T1.visit_date) over (
partition BY T1.CustID order by T1.visit_date DESC) AS Days_between_visits
FROM T1
) T2
WHERE Days_between_visits >0
GROUP BY T2.CustID ,
T2.visit_date ,
T2.Prev_visit ,
T2.Days_between_visits;
This returns:
CUSTID | VISIT_DATE | PREV_VISIT | DAYS_BETWEEN_VISIT | 0-10_DAY | 11-20_DAY | 21-30_DAY | 31-40_DAY | 41-50_DAY | 51+DAY
XC321 | 2016-05-28 | 2016-04-28 | 30 | | | 1 | | |
XC321 | 2016-06-02 | 2016-05-28 | 5 | 1 | | | | |