How to skip gaps when using SQL GROUP BY to compute duration - sql

My table has ACTIONS with USER and date/time of each action. A group of Actions by the same user is a session. I want to measure the duration of each session. However if there is a gap larger than 5 minutes without any action, then I want to start a new session.
I am using this SQL
Select USER,
TRUNC((Max(ACTION_DATE) - Min(ACTION_DATE))*24*60,2) as Minutes
from USER_ACTION_DATA
Group by USER,TRUNC(ACTION_DATE,'J') ;
And it works pretty well if the user only has one session per day. But if the data is
USER ACTION_DATE
---------------------
john 2021-05-24 11:30:22
john 2021-05-24 11:32:12
john 2021-05-24 11:32:44
john 2021-05-24 11:36:08
john 2021-05-24 14:20:02
john 2021-05-24 14:23:52
it will show a single session with 173 minutes. But that is wrong. It should be two sessions with 6 and 3 minutes (because the gap between the 4th and 5th record is more than 5 minutes). Is this possible with SQL, or do I have to do the grouping with a real programming language?

If you're on a recent version of Oracle you can use match_recognize:
select user_name,
start_date,
end_date,
trunc((end_date - start_date) * 24 * 60, 2) as minutes
from user_action_data
match_recognize (
partition by user_name
order by action_date
measures
first(action_date) as start_date,
last(action_date) as end_date
pattern (A B*)
define B as action_date <= prev(action_date) + interval '5' minute
);
USER_NAME
START_DATE
END_DATE
MINUTES
john
2021-05-24 11:30:22
2021-05-24 11:36:08
5.76
john
2021-05-24 14:20:02
2021-05-24 14:23:52
3.83
db<>fiddle
USER is a reserved word (and a function) so I've changed the column name to USER_NAME to make it valid.
Because it's only looking at the interval it will allow a session to span midnight, rather than restricting to sessions within a day, as you are doing by truncating with J. I'm assuming that's a good thing, of course. If it isn't then you can change it to only look within the same day:
define B as trunc(action_date) = trunc(prev(action_date))
and action_date <= prev(action_date) + interval '5' minute
db<>fiddle with some additional sample data to go into the next day.

You can assign a session using window functions -- check when the previous action is and assigning a session start when needed:
select uad.*,
sum(case when prev_action_date > action_date - interval '5' minute then 0 else 1 end) over
(partition by user order by action_date) as session_id
from (select uad.*,
lag(action_date) over (partition by user order by action_date) as prev_action_date
from USER_ACTION_DATA uad
) uad;
You can then aggregate this if you like:
select user, min(action_date), max(action_date),
( max(action_date) - min(action_date) ) * 24*60*60
from (select uad.*,
sum(case when prev_action_date > action_date - interval '5' minute then 0 else 1 end) over
(partition by user order by action_date) as session_id
from (select uad.*,
lag(action_date) over (partition by user order by action_date) as prev_action_date
from USER_ACTION_DATA uad
) uad
) uad
group by user, session_id;

Related

SQL Server : count distinct every 30 minutes or more

We have an activity database that records user interaction to a website, storing a log that includes values such as Time1, session_id and customer_id e.g.
2022-05-12 08:00:00|11|1
2022-05-12 08:20:00|11|1
2022-05-12 08:30:01|11|1
2022-05-12 08:14:00|22|2
2022-05-12 08:18:00|22|2
2022-05-12 08:16:00|33|1
2022-05-12 08:50:00|33|1
I need to have two separate queries:
Query #1: I need to count sessions multiple times if they have a log of 30 minutes or more grouping them on sessions on daily basis.
For example: Initially count=0
For session_id = 11, it starts at 08:00 and the last time with the same session_id is 08:30 -- count=1
For session_id = 22 it starts at 08:14 and the last time with the same session is 08:14 -- still the count=1 since it was less than 30 min
I tried this query, but it didn't work
select
count(session_id)
from
table1
where
#datetime between Time1 and dateadd(minute, 30, Time1);
Expected result:
Query #2: it's an extension of the above query where I need the unique customers on daily basis whose sessions were 30 min or more.
For example: from the above table I will have two unique customers on May 8th
Expected result
For the Time1 column, the input is in timestamp format when I show it in output I will group it on a basis.
This is a two-level aggregation (GROUP BY) problem. You need to start with a subquery to get the first and last timestamp of each session.
SELECT MIN(Time1) start_time,
MAX(Time1) end_time,
session_id, customer_id
FROM table1
GROUP BY session_id, customer_id
Next you need to use the subquery like this:
SELECT COUNT(session_id),
COUNT(DISTINCT customer_id),
CAST(start_time AS DATE)
FROM (
SELECT MIN(Time1) start_time,
MAX(Time1) end_time,
session_id, customer_id
FROM table1
GROUP BY session_id, customer_id
) a
WHERE DATEDIFF(MINUTE, start_time, end_time) >= 30
GROUP BY CAST(start_time AS DATE);

BigQuery SQL to keep the difference between two date values as 7 days

I have 2 date parameters in my BigQuery, one is for the first time open date(first_open) and other is the date when values are fetched(date). I need to find the set of users(id) who have opened on a specific date and their values for ONLY the next 7 days not more.
Eg.
1June20(first_open) user should be only in the date till 7June20(date)
2June20(first_open) user should be only till 8June20(date)
7June20(first_open) user should be only till 13June20(date)
SELECT
event_name,
COUNT(DISTINCT id) uniques,
COUNT(id) as total
FROM
`x-12.analytics_7.xyz`
WHERE
(first_open between "2020-06-01" and "2020-06-07")
AND (date BETWEEN "20200601" AND "20200613")
AND event_names in ("app_open","first_open")
AND platform = "ANDROID"
GROUP BY
event_names
As you can see from the query I'm using I'm restricting the users to 7 days of open but I can't restrict their values to only 7 days.
From what you describe, you can use COUNTIF():
SELECT event_name, COUNT(DISTINCT id) uniques, COUNT(id) as total,
COUNTIF(date <= DATE_ADD(first_open, interval 7 day))
FROM `x-12.analytics_7.xyz`
WHERE first_open between '2020-06-01' and '2020-06-07' and
date BETWEEN '2020-06-01' AND '2020-06-13- and
event_names in ('app_open', 'first_open') and
platform = 'ANDROID'
GROUP BY event_names;
Or, you can put the logic in the WHERE clause:
WHERE first_open between '2020-06-01' and '2020-06-07' and
date >= first_open and
date < date_add(first_open, interval 7 day) and
event_names in ('app_open', 'first_open') and
platform = 'ANDROID'

Define user's sessions (sql)

I have an event table (user_id, timestamp). I need to write a query to define a user session (every user can have more than one session and every session can have >= 1 event). 30 minutes of inactivity for the user is a completed session.
The output table should have the following format: (user_id, start_session, end_sesson). I wrote part of query, but what to do next i have no idea.
select
t.user_id,
t.ts start_session,
t.next_ts
from ( select
user_id,
ts,
DATEDIFF(SECOND, lag(ts, 1) OVER(partition by user_id order by ts), ts) next_ts
from
events_tabl ) t
You want a cumulative sum to identify the sessions and then aggregation:
select user_id, session_id, min(ts), max(ts)
from (select e.*,
sum(case when prev_ts > dateadd(minute, -30, ts)
then 0 else 1
end) over (partition by user_id order by ts) as session_id
from (select e.*,
lag(ts) over (partition by user_id order by ts), ts) as prev_ts
from events_tabl e
) e
) e
group by user_id, session_id;
Note that I changed the date/time logic from using datediff() to a direct comparison of the times. datediff() counts the number of "boundaries" between two times. So, there is 1 hour between 12:59 a.m. and 1:01 a.m. -- but zero hours between 1:01 a.m. and 1:59 a.m.
Although handling the diffs at the second level produces similar results, you can run into occasions where you are working with seconds or milliseconds -- but the time spans are too long to fit into an integer. Overflow errors. It is just easier to work directly with the date/time values.

How to use BigQuery Analytic Functions to calculate time between timestamped rows?

I have a data set that represents analytics events like:
Row timestamp account_id type
1 2018-11-14 21:05:40 UTC abc start
2 2018-11-14 21:05:40 UTC xyz another_type
3 2018-11-26 22:01:19 UTC xyz start
4 2018-11-26 22:01:23 UTC abc start
5 2018-11-26 22:01:29 UTC xyz some_other_type
11 2018-11-26 22:13:58 UTC xyz start
...
With some number of account_ids. I need to find the average time between start records per account_id.
I'm trying to use analytic functions as described here. My end goal would be a table like:
Row account_id avg_time_between_events_mins
1 xyz 53
2 abc 47
3 pqr 65
...
my best attempt--based on this post--looks like this:
WITH
events AS (
SELECT
COUNTIF(type = 'start' AND account_id='abc') OVER (ORDER BY timestamp) as diff,
timestamp
FROM
`myproject.dataset.events`
WHERE
account_id='abc')
SELECT
min(timestamp) AS start_time,
max(timestamp) AS next_start_time,
ABS(timestamp_diff(min(timestamp), max(timestamp), MINUTE)) AS minutes_between
FROM
events
GROUP BY
diff
This calculates the time between each start event and the last non-start event prior to the next start event for a specific account_id.
I tried to use PARTITION and a WINDOW FRAME CLAUSE like this:
WITH
events AS (
SELECT
COUNT(*) OVER (PARTITION BY account_id ORDER BY timestamp ROWS BETWEEN CURRENT ROW AND 1 FOLLOWING) as diff,
timestamp
FROM
`myproject.dataset.events`
WHERE
type = 'start')
SELECT
min(timestamp) AS start_time,
max(timestamp) AS next_start_time,
ABS(timestamp_diff(min(timestamp), max(timestamp), MINUTE)) AS minutes_between
FROM
events
GROUP BY
diff
But I got a nonsense result table. Can anyone walk me through how I would write and reason about a query like this?
You don't really need analytic functions for this:
select timestamp_diff(min(timestamp), max(timestamp), MINUTE)) / nullif(count(*) - 1, 0)
from `myproject.dataset.events`
where type = 'start'
group by account_id;
This is the timestamp of the most recent minus the oldest, divided by one less than the number of starts. That is the average between the starts.

Google Big Query SQL - Get most recent unique value by date

#EDIT - Following the comments, I rephrase my question
I have a BigQuery table that i want to use to get some KPI of my application.
In this table, I save each create or update as a new line in order to keep a better history.
So I have several times the same data with a different state.
Example of the table :
uuid |status |date
––––––|–––––––––––|––––––––––
3 |'inactive' |2018-05-12
1 |'active' |2018-05-10
1 |'inactive' |2018-05-08
2 |'active' |2018-05-08
3 |'active' |2018-05-04
2 |'inactive' |2018-04-22
3 |'inactive' |2018-04-18
We can see that we have multiple value of each data.
What I would like to get:
I would like to have the number of current 'active' entry (So there must be no 'inactive' entry with the same uuid after). And to complicate everything, I need this total per day.
So for each day, the amount of 'active' entries, including those from previous days.
So with this example I should have this result :
date | actives
____________|_________
2018-05-02 | 0
2018-05-03 | 0
2018-05-04 | 1
2018-05-05 | 1
2018-05-06 | 1
2018-05-07 | 1
2018-05-08 | 2
2018-05-09 | 2
2018-05-10 | 3
2018-05-11 | 3
2018-05-12 | 2
Actually i've managed to get the good amount of actives for one day. But my problem is when i want the results for each days.
What I've tried:
I'm stuck with two solutions that each return a different error.
First solution :
WITH
dates AS(
SELECT GENERATE_DATE_ARRAY(
DATE_SUB(CURRENT_DATE(), INTERVAL 6 MONTH), CURRENT_DATE(), INTERVAL 1 DAY)
arr_dates )
SELECT
i_date date,
(
SELECT COUNT(uuid)
FROM (
SELECT
uuid, status, date,
RANK() OVER(PARTITION BY uuid ORDER BY date DESC) rank
FROM users
WHERE
PARSE_DATE("%Y-%m-%d", FORMAT_DATETIME("%Y-%m-%d",date)) <= i_date
)
WHERE
status = 'active'
and rank = 1
## rank is the condition which causes the error
) users
FROM
dates, UNNEST(arr_dates) i_date
ORDER BY i_date;
The SELECT with the RANK() OVER correctly returns the users with a rank column that allow me to know which entry is the last for each uuid.
But when I try this, I got a :
Correlated subqueries that reference other tables are not supported unless they can be de-correlated, such as by transforming them into an efficient JOIN. because of the rank = 1 condition.
Second solution :
WITH
dates AS(
SELECT GENERATE_DATE_ARRAY(
DATE_SUB(CURRENT_DATE(), INTERVAL 6 MONTH), CURRENT_DATE(), INTERVAL 1 DAY)
arr_dates )
SELECT
i_date date,
(
SELECT
COUNT(t1.uuid)
FROM
users t1
WHERE
t1.date = (
SELECT MAX(t2.date)
FROM users t2
WHERE
t2.uuid = t1.uuid
## Here that's the i_date condition which causes problem
AND PARSE_DATE("%Y-%m-%d", FORMAT_DATETIME("%Y-%m-%d", t2.date)) <= i_date
)
AND status='active' ) users
FROM
dates,
UNNEST(arr_dates) i_date
ORDER BY i_date;
Here, the second select is working too and correctly returning the number of active user for a current day.
But the problem is when i try to use i_date to retrieve datas among the multiple days.
And Here i got a LEFT OUTER JOIN cannot be used without a condition that is an equality of fields from both sides of the join. error...
Which solution is more able to succeed ? What should i change ?
And, if my way of storing the data isn't good, how should i proceed in order to keep a precise history ?
Below is for BigQuery Standard SQL
#standardSQL
SELECT date, COUNT(DISTINCT uuid) total_active
FROM `project.dataset.table`
WHERE status = 'active'
GROUP BY date
-- ORDER BY date
Update to address your "rephrased" question :o)
Below example is using dummy data from your question
#standardSQL
WITH `project.dataset.users` AS (
SELECT 3 uuid, 'inactive' status, DATE '2018-05-12' date UNION ALL
SELECT 1, 'active', '2018-05-10' UNION ALL
SELECT 1, 'inactive', '2018-05-08' UNION ALL
SELECT 2, 'active', '2018-05-08' UNION ALL
SELECT 3, 'active', '2018-05-04' UNION ALL
SELECT 2, 'inactive', '2018-04-22' UNION ALL
SELECT 3, 'inactive', '2018-04-18'
), dates AS (
SELECT day FROM UNNEST((
SELECT GENERATE_DATE_ARRAY(MIN(date), MAX(date))
FROM `project.dataset.users`
)) day
), active_users AS (
SELECT uuid, status, date first, DATE_SUB(next_status.date, INTERVAL 1 DAY) last FROM (
SELECT uuid, date, status, LEAD(STRUCT(status, date)) OVER(PARTITION BY uuid ORDER BY date ) next_status
FROM `project.dataset.users` u
)
WHERE status = 'active'
)
SELECT day, COUNT(DISTINCT uuid) actives
FROM dates d JOIN active_users u
ON day BETWEEN first AND IFNULL(last, day)
GROUP BY day
-- ORDER BY day
with result
Row day actives
1 2018-05-04 1
2 2018-05-05 1
3 2018-05-06 1
4 2018-05-07 1
5 2018-05-08 2
6 2018-05-09 2
7 2018-05-10 3
8 2018-05-11 3
9 2018-05-12 2
I think this -- or something similar -- will do what you want:
SELECT day,
coalesce(running_actives, 0) - coalesce(running_inactives, 0)
FROM UNNEST(GENERATE_DATE_ARRAY(DATE('2015-05-11'), DATE('2018-06-29'), INTERVAL 1 DAY)
) AS day left join
(select date, sum(countif(status = 'active')) over (order by date) as running_actives,
sum(countif(status = 'active')) over (order by date) as running_inactives
from t
group by date
) a
on a.date = day
order by day;
The exact solution depends on whether the "inactive" is inclusive of the day (as above) or takes effect the next day. Either is handled the same way, by using cumulative sums of actives and inactives and then taking the difference.
In order to get data on all days, this generates the days using arrays and unnest(). If you have data on all days, that step may be unnecessary