Calculate time between one event and the next one in PostgreSQL - sql

I have a table
id
user_id
created_at
status
1
100
2022-12-13 00:12:12
IN_TRANSIT
2
104
2022-12-13 01:12:12
IN_TRANSIT
3
100
2022-12-13 02:12:12
DONE
4
100
2022-12-13 03:12:12
IN_TRANSIT
5
104
2022-12-13 04:12:12
DONE
6
100
2022-12-13 05:12:12
DONE
7
104
2022-12-13 06:12:12
IN_TRANSIT
7
104
2022-12-13 07:12:12
REJECTED
I am trying to calculate the sum for each user of the idle time, so the time between status DONE and next IN_TRANSIT for that user.
The result should be
user_id
idle_time
100
01:00:00
104
02:00:00

select user_id
,idle_time
from (
select user_id
,status
,created_at-lag(created_at) over(partition by user_id order by created_at) as idle_time
,lag(status) over(partition by user_id order by created_at) as pre_status
from t
) t
where status = 'IN_TRANSIT'
and pre_status = 'DONE'
user_id
idle_time
100
01:00:00
104
02:00:00
Fiddle

Try the following:
select user_id, min(case status when 'IN_TRANSIT' then created_at end) -
min(case status when 'DONE' then created_at end) idle_time
from
(
select user_id, created_at, status,
sum(case status when 'DONE' then 1 end) over (partition by user_id order by created_at) as grp
from table_name
) T
group by user_id, grp
having min(case status when 'IN_TRANSIT' then created_at end) -
min(case status when 'DONE' then created_at end) is not null
This will find the time difference between 'DONE' status and the first next 'IN_TRANSIT' status, if there is a multiple 'IN_TRANSIT' statuses after and you want to find the difference with the last one just change min(case status when 'IN_TRANSIT' then created_at end) to max.
Also, if there are multiple 'DONE', 'IN_TRANSIT' for a user_id, then they will show as a separate rows in the result, but you can use this query as a subquery to find the sum of all differences grouped by user_id.
See a demo.

you need to use lag function for finding the diff between first and second row
select user_id,idle_time from (select user_id,status,
created_at - lag(created_at) over (order by user_id, created_at)
as idle_time
from calctime)
as drt WHERE idle_time > interval '10 sec' and status='IN_TRANSIT'

Related

How to find current_streaks in BQ SQL

Looking for the best way to find all current streaks of today in BigQuery ( so essentially the answer must be row_number() based but otherwise any flavor SQL should do..).
created_at | user_id
-------------+---------
2022-02-10 | 1
2022-02-09 | 1
2022-02-08 | 1
2022-02-10 | 2
2022-01-20 | 3
Desired result only showing User_ID of the Streaker and their # of days Streaked
user_id | streak
----------+---------
1 | 3
2 | 1
UserID: 2 is ignored because it's streak did not make it to today
You can add a condition outside the streak-identification code, which validates the existence of current_date() in the streak set and only display the valid streaks (i.e. ones which connect to today's date):
select user_id, array_length(array_agg(distinct created_at)) as streak from (
select
user_id,
created_at,
date_sub(created_at, interval rnk day) as grp from (
select
user_id,
date(created_at) as created_at,
dense_rank() over (partition by user_id order by created_at) as rnk
from table
)
)
group by user_id, grp
having current_date() in unnest( array_agg(distinct created_at))

Efficient way to get the average of past x events within d days per each row in SQL (big data)

I want to find the best and most efficient way to calculate the average of a score from the past 2 events within 7 days, and I need it per each row.
I already have a query that works on 60M rows, but on 100% (~500M rows) of the data its collapses (maybe not efficient or maybe lack of resources).
can you help? If you think my solution is not the best way please explain.
Thank you
I have this table:
user_id event_id start end score
---------------------------------------------------
1 7 30/01/2021 30/01/2021 45
1 6 24/01/2021 29/01/2021 25
1 5 22/01/2021 23/01/2021 13
1 4 18/01/2021 21/01/2021 15
1 3 17/01/2021 17/01/2021 52
1 2 08/01/2021 10/01/2021 8
1 1 01/01/2021 02/01/2021 36
I want per line (user id+event id): to get the average score of the past 2 events in the last 7 days.
Example: for this row:
user_id event_id start end score
---------------------------------------------------
1 6 24/01/2021 29/01/2021 25
user_id event_id start end score past_7_days_from_start event_num
--------------------------------------------------------------------------------------
1 6 24/01/2021 29/01/2021 25 null null
1 5 22/01/2021 23/01/2021 13 yes 1
1 4 18/01/2021 21/01/2021 15 yes 2
1 3 17/01/2021 17/01/2021 52 yes 3
1 2 08/01/2021 10/01/2021 8 no 4
1 1 01/01/2021 02/01/2021 36 no 5
so I would select only this rows for the group by and then avg(score):
user_id event_id start end score past_7_days_from_start event_num
--------------------------------------------------------------------------------------
1 5 22/01/2021 23/01/2021 13 yes 1
1 4 18/01/2021 21/01/2021 15 yes 2
Result:
user_id event_id start end score avg_score_of_past_2_events_within_7_days
--------------------------------------------------------------------------------------
1 6 24/01/2021 29/01/2021 25 14
My query:
SELECT user_id, event_id, AVG(score) as avg_score_of_past_2_events_within_7_days
FROM (
SELECT
B.user_id, B.event_id, A.score,
ROW_NUMBER() OVER (PARTITION BY B.user_id, B.event_id ORDER BY A.end desc) AS event_num,
FROM
"df" A
INNER JOIN
(SELECT user_id, event_id, start FROM "df") B
ON B.user_id = FTP.user_id
AND (A.end BETWEEN DATE_SUB(B.start, INTERVAL 7 DAY) AND B.start))
WHERE event_num >= 2
GROUP BY user_id, event_id
Any suggestion for a better way?
I don't believe in your case, there is a more efficient query.
I can suggest you do the following:
Make sure your base table is partition by start and cluster by user_id
Split the query to 3 parts that creating partitioned and clustered tables:
first table: only the inner join O(n^2)
second table: add ROW_NUMBER O(n)
third table: group by
If it is still a problem I would suggest doing batch preprocessing and run the queries by dates.
I've tried to create a use case with using LEAD functions, but I am not able to test if works on that large dataset.
I create the two before rows as prev and ante using LEAD.
Then I have an IF for the 7 days window, and if that matches I create scorePP and scoreAA otherwise they are null.
with t as (
select 1 as user_id,7 as event_id,parse_date('%d/%m/%Y','30/01/2021') as start,parse_date('%d/%m/%Y','30/01/2021') as stop, 45 as score union all
select 1 as user_id,6 as event_id,parse_date('%d/%m/%Y','24/01/2021') as start,parse_date('%d/%m/%Y','29/01/2021') as stop, 25 as score union all
select 1 as user_id,5 as event_id,parse_date('%d/%m/%Y','22/01/2021') as start,parse_date('%d/%m/%Y','23/01/2021') as stop, 13 as score union all
select 1 as user_id,4 as event_id,parse_date('%d/%m/%Y','18/01/2021') as start,parse_date('%d/%m/%Y','21/01/2021') as stop, 15 as score union all
select 1 as user_id,3 as event_id,parse_date('%d/%m/%Y','17/01/2021') as start,parse_date('%d/%m/%Y','17/01/2021') as stop, 52 as score union all
select 1 as user_id,2 as event_id,parse_date('%d/%m/%Y','08/01/2021') as start,parse_date('%d/%m/%Y','10/01/2021') as stop, 8 as score union all
select 1 as user_id,1 as event_id,parse_date('%d/%m/%Y','01/01/2021') as start,parse_date('%d/%m/%Y','02/01/2021') as stop, 36 as score union all
select 2 as user_id,3 as event_id,parse_date('%d/%m/%Y','12/01/2021') as start,parse_date('%d/%m/%Y','17/01/2021') as stop, 52 as score union all
select 2 as user_id,2 as event_id,parse_date('%d/%m/%Y','08/01/2021') as start,parse_date('%d/%m/%Y','10/01/2021') as stop, 8 as score union all
select 2 as user_id,1 as event_id,parse_date('%d/%m/%Y','01/01/2021') as start,parse_date('%d/%m/%Y','02/01/2021') as stop, 36 as score
)
select *, (select avg(x) from unnest([scorePP,scoreAA]) as x) as avg_score_7_day from (
SELECT
t.*,
lead(start,1) over(partition by user_id order by event_id desc, t.stop desc) prev_start,
lead(stop,1) over(partition by user_id order by event_id desc, t.stop desc) prev_stop,
lead(score,1) over(partition by user_id order by event_id desc, t.stop desc) prev_score,
if(((lead(start,1) over(partition by user_id order by event_id desc, t.stop desc)) between date_sub(start, interval 7 day) and (lead(stop,1) over(partition by user_id order by event_id desc, t.stop desc))),lead(score,1) over(partition by user_id order by event_id desc, t.stop desc),null) as scorePP,
/**/
lead(start,2) over(partition by user_id order by event_id desc, t.stop desc) ante_start,
lead(stop,2) over(partition by user_id order by event_id desc, t.stop desc) ante_stop,
lead(score,2) over(partition by user_id order by event_id desc, t.stop desc) ante_score,
if(((lead(start,2) over(partition by user_id order by event_id desc, t.stop desc)) between date_sub(start, interval 7 day) and (lead(stop,2) over(partition by user_id order by event_id desc, t.stop desc))),lead(score,2) over(partition by user_id order by event_id desc, t.stop desc),null) as scoreAA,
from
t
)
where coalesce(scorePP,scoreAA) is not null
order by user_id,event_id desc
Consider below approach
select * except(candidates1, candidates2),
( select avg(score)
from (
select * from unnest(candidates1) union distinct
select * from unnest(candidates2)
order by event_id desc
limit 2
)
) as avg_score_of_past_2_events_within_7_days
from (
select *,
array_agg(struct(event_id, score)) over(order by unix_date(t.start) range between 7 preceding and 1 preceding) as candidates1,
array_agg(struct(event_id, score)) over(order by unix_date(t.end) range between 7 preceding and 1 preceding) as candidates2
from your_table t
)
if applied to sample data in your question - output is

SQL 30 day active user query

I have a table of users and how many events they fired on a given date:
DATE
USERID
EVENTS
2021-08-27
1
5
2021-07-25
1
7
2021-07-23
2
3
2021-07-20
3
9
2021-06-22
1
9
2021-05-05
1
4
2021-05-05
2
2
2021-05-05
3
6
2021-05-05
4
8
2021-05-05
5
1
I want to create a table showing number of active users for each date with active user being defined as someone who has fired an event on the given date or in any of the preceding 30 days.
DATE
ACTIVE_USERS
2021-08-27
1
2021-07-25
3
2021-07-23
2
2021-07-20
2
2021-06-22
1
2021-05-05
5
I tried the following query which returned only the users who were active on the specified date:
SELECT COUNT(DISTINCT USERID), DATE
FROM table
WHERE DATE >= (CURRENT_DATE() - interval '30 days')
GROUP BY 2 ORDER BY 2 DESC;
I also tried using a window function with rows between but seems to end up getting the same result:
SELECT
DATE,
SUM(ACTIVE_USERS) AS ACTIVE_USERS
FROM
(
SELECT
DATE,
CASE
WHEN SUM(EVENTS) OVER (PARTITION BY USERID ORDER BY DATE ROWS BETWEEN 30 PRECEDING AND CURRENT ROW) >= 1 THEN 1
ELSE 0
END AS ACTIVE_USERS
FROM table
)
GROUP BY 1
ORDER BY 1
I'm using SQL:ANSI on Snowflake. Any suggestions would be much appreciated.
This is tricky to do as window functions -- because count(distinct) is not permitted. You can use a self-join:
select t1.date, count(distinct t2.userid)
from table t join
table t2
on t2.date <= t.date and
t2.date > t.date - interval '30 day'
group by t1.date;
However, that can be expensive. One solution is to "unpivot" the data. That is, do an incremental count per user of going "in" and "out" of active states and then do a cumulative sum:
with d as ( -- calculate the dates with "ins" and "outs"
select user, date, +1 as inc
from table
union all
select user, date + interval '30 day', -1 as inc
from table
),
d2 as ( -- accumulate to get the net actives per day
select date, user, sum(inc) as change_on_day,
sum(sum(inc)) over (partition by user order by date) as running_inc
from d
group by date, user
),
d3 as ( -- summarize into active periods
select user, min(date) as start_date, max(date) as end_date
from (select d2.*,
sum(case when running_inc = 0 then 1 else 0 end) over (partition by user order by date) as active_period
from d2
) d2
where running_inc > 0
group by user
)
select d.date, count(d3.user)
from (select distinct date from table) d left join
d3
on d.date >= start_date and d.date < end_date
group by d.date;

Review sessions from timestamp column with start/stop event with duplicate start/stop records

I have following case change records:
id
case_id
state
time_created
1
100
REVIEW_NEEDED
2021-03-30 15:11:58.015907000
2
100
REVIEW_NEEDED
2021-04-01 13:08:17.945926000
3
100
REVIEW
2021-04-07 06:20:48.873865000
4
100
WAITING
2021-04-07 06:32:47.159664000
5
100
REVIEW_NEEDED
2021-04-09 06:32:51.132127000
6
100
REVIEW
2021-04-12 04:39:36.426467000
7
100
REVIEW
2021-04-12 04:40:36.000000000
8
100
CLOSED
2021-04-12 04:40:43.133736000
9
101
REVIEW_NEEDED
2021-03-30 20:37:58.015907000
10
101
REVIEW
2021-04-04 13:08:17.945926000
11
101
CLOSED
2021-04-06 06:20:48.873865000
12
101
CLOSED
2021-04-06 06:20:50.000000000
I'd like to report sessions out of these like following:
open_id
close_id
case_id
waiting_time_start
handling_time_start
handling_time_end
1
4
100
2021-03-30 15:11:58.015907000
2021-04-07 06:20:48.873865000
2021-04-07 06:32:47.159664000
5
8
100
2021-04-09 06:32:51.132127000
2021-04-12 04:39:36.426467000
2021-04-12 04:40:43.133736000
9
11
101
2021-03-30 20:37:58.015907000
2021-04-04 13:08:17.945926000
2021-04-06 06:20:48.873865000
Waiting_time_start: when state = REVIEW_NEEDED
Handling_time_start: when state = REVIEW
Handling_time_end: when state = WAITING or CLOSED
My current solution is to rank the Waiting_time_start, Handling_time_start and Handling_time_end for each case and then join these events on rank, but this is not perfect as there's duplicate records, so number of start/stop events can differ for a case.
Thanks a lot for any ideas!
This is rather complicated. Start by adding a grouping based on the count of "waiting" and "closed" -- but only when they change values:
select t.*,
sum(case when (state <> next_state or next_state is null) and
state in ('WAITING', 'CLOSED')
then 1 else 0
end) over (partition by caseid order by time_created desc) as grouping
from (select t.*,
lead(state) over (partition by caseid order by time_created) as next_state
from t
) t
Then, you can just aggregate:
with cte as (
select t.*,
sum(case when (state <> next_state or next_state is null) and
state in ('WAITING', 'CLOSED')
then 1 else 0
end) over (partition by caseid order by time_created desc) as grouping
from (select t.*,
lead(state) over (partition by caseid order by time_created) as next_state
from t
) t
)
select caseid, min(id), max(id),
min(case when status = 'REVIEW_NEEDED' then time_created end),
min(case when status = 'REVIEW' then time_created end),
max(time_created)
from cte
group by grouping, caseid;

Google Big Query - Calculating monthly totals by status based on multiple date conditionals

I have table with the following data:
customer_id subscription_id plan status trial_start trial_end activated_at cancelled_at
1 jg1 basic cancelled 2020-06-26 2020-07-14 2020-07-14 2020-09-25
2 ab1 basic cancelled 2020-08-10 2020-08-24 2020-08-24 2021-02-15
3 cf8 basic cancelled 2020-08-25 2020-09-04 2020-09-04 2020-10-24
4 bc2 basic active 2020-10-12 2020-10-26 2020-10-26
5 hg4 basic active 2021-01-09 2021-02-08 2021-02-08
6 cd5 basic in-trial 2021-02-26
As you notice from the table, status = in_trial when a subscription is in trial. When subscription converts from in_trial to active there is activated_at date. When an in_trial or active subscription is cancelled, status switches to cancelled and cancelled_at date is present. Status column always shows only most recent status of a subscription. For every change in status a new row does not appear for subscription. For every change in status, status is changed, and appropriate dates appear to reflect time when status was changed.
My goal is to calculate, month-over-month, how many subscriptions are in status = in_trial, how many are in status = active and how many are in status = cancelled. Because status column reflects the most recent status of subscription, a query has to be able to determine how many subscriptions were in status = in_trial, status = active, and status = active based on available dates column.
If a particular subscription had multiple statuses in a given month (for example, subscription_id = ab1 was in trial in Aug-2020 and also converted to active in Aug-2020), I want only the most recent status to be considered for that subscription. So, as example, for subscription_id = ab1 I want it to be counted as active subscription for the month of Aug-2020.
The output I am looking for is:
date in_trial active cancelled
2020-06-01 1 0 0
2020-07-01 0 1 0
2020-08-01 1 2 0
2020-09-01 0 2 1
2020-10-01 0 2 1
2020-11-01 0 2 0
2020-12-01 0 2 0
2021-01-01 1 2 0
2021-02-01 1 2 1
2021-03-01 1 2 0
Or, results can be displayed in a different format, as long as numbers are correct. Another example of output can be:
date status count
2020-06-01 in_trial 1
2020-06-01 active 0
2020-06-01 cancelled 0
2020-07-01 in_trial 0
2020-07-01 active 1
2020-07-01 cancelled 0
... ... ...
2021-03-01 in_trial 1
2021-03-01 active 2
2021-03-01 cancelled 0
Below is the query you can use to reproduce the example table provided in this question:
SELECT 1 AS customer_id, 'jg1' AS subscription_id, 'basic' AS plan, 'cancelled' AS status, '2020-06-26' AS trial_start, '2020-07-14' AS trial_end, '2020-07-14' AS activated_at, '2020-09-25' AS cancelled_at UNION ALL
SELECT 2 AS customer_id, 'ab1' AS subscription_id, 'basic' AS plan, 'cancelled' AS status, '2020-08-10' AS trial_start, '2020-08-24' AS trial_end, '2020-08-24' AS activated_at, '2021-02-15' AS cancelled_at UNION ALL
SELECT 3 AS customer_id, 'cf8' AS subscription_id, 'basic' AS plan, 'cancelled' AS status, '2020-08-25' AS trial_start, '2020-09-04' AS trial_end, '2020-09-04' AS activated_at, '2020-10-24' AS cancelled_at UNION ALL
SELECT 4 AS customer_id, 'bc2' AS subscription_id, 'basic' AS plan, 'active' AS status, '2020-10-12' AS trial_start, '2020-10-26' AS trial_end, '2020-10-26' AS activated_at, '' AS cancelled_at UNION ALL
SELECT 5 AS customer_id, 'hg4' AS subscription_id, 'basic' AS plan, 'active' AS status, '2021-01-09' AS trial_start, '2021-02-08' AS trial_end, '2021-02-08' AS activated_at, '' AS cancelled_at UNION ALL
SELECT 6 AS customer_id, 'cd5' AS subscription_id, 'basic' AS plan, 'in_trial' AS status, '2021-02-26' AS trial_start, '' AS trial_end, '' AS activated_at, '' AS cancelled_at
I have been working on this problem since yesterday morning and continuing to figure out a way to do this efficiently. Thank you in advance for helping me solve this problem.
Below should work for you
select month,
count(distinct if(status = 0, customer_id, null)) in_trial,
count(distinct if(status = 1, customer_id, null)) active,
count(distinct if(status = 2, customer_id, null)) canceled
from (
select month, customer_id,
array_agg(status order by status desc limit 1)[offset(0)] status
from (
select distinct customer_id, 0 status, date_trunc(date, month) month
from `project.dataset.table`,
unnest(generate_date_array(date(trial_start), ifnull(date(trial_end), current_date()))) date
union all
select distinct customer_id, 1 status, date_trunc(date, month) month
from `project.dataset.table`,
unnest(generate_date_array(date(activated_at), ifnull(date(cancelled_at), current_date()))) date
union all
select distinct customer_id, 2 status, date_trunc(date(cancelled_at), month) month
from `project.dataset.table`
)
where not month is null
group by month, customer_id
)
group by month
# order by month
If applied to sample data in your question - output is