Derive session duration when only timestamp is available in SQL - sql

I want to calculate the session duration for the usage of an app. However, in the provided log, the only relevant information I can obtain is timestamp. Below is a simplified log for a single user.
record_num, user_id, record_ts
-----------------------------
1, uid_1, 12:01am
2, uid_1, 12:02am
3, uid_1, 12:03am
4, uid_1, 12:22am
5, uid_1, 12:22am
6, uid_1, 12:25am
Assuming a session is concluded after 15 minutes of inactivity, the above log would consist 2 sessions. And now I would like to calculate the average duration for the two sessions.
I can derive the number of sessions by first calculate the time differences between each record, and whenever a difference exceeds 15 minutes, a session is counted.
But to derive the duration as I would need to know the min(record_ts) and max(record_ts) for each session. However, without a session_id of some sort, I could not group the records into associated sessions.
Is there any SQL based approach where I can solve this?

Assuming you have the date too (without it would mean calculating whether the end time of the session began before the start time), something like this would work:
WITH CTE AS
(SELECT * FROM
(SELECT 1 record_num, "uid_1" user_id, TIMESTAMP('2018-10-01 12:01:00') record_ts)
UNION ALL
(SELECT 2 record_num, "uid_1" user_id, TIMESTAMP('2018-10-01 12:02:00') record_ts)
UNION ALL
(SELECT 3 record_num, "uid_1" user_id, TIMESTAMP('2018-10-01 12:03:00') record_ts)
UNION ALL
(SELECT 4 record_num, "uid_1" user_id, TIMESTAMP('2018-10-01 12:22:00') record_ts)
UNION ALL
(SELECT 5 record_num, "uid_1" user_id, TIMESTAMP('2018-10-01 12:22:00') record_ts)
UNION ALL
(SELECT 6 record_num, "uid_1" user_id, TIMESTAMP('2018-10-01 12:25:00') record_ts)
UNION ALL
(SELECT 7 record_num, "uid_1" user_id, TIMESTAMP('2018-10-01 12:59:00') record_ts)),
sessions as
(SELECT
if(timestamp_diff(record_ts,lag(record_ts,1) OVER (PARTITION BY user_id ORDER BY
record_ts, record_num),MINUTE) >= 15 OR
lag(record_ts,1) OVER (PARTITION BY user_id ORDER BY record_ts, record_num) IS NULL,1,0)
session, record_num, user_id, record_ts
FROM CTE)
SELECT sum(session) OVER (PARTITION BY user_id ORDER BY record_ts, record_num)
sessionNo, record_num, user_id, record_ts
FROM sessions
The key being the number of minutes you want between sessions. In the case above I've put it at 15 minutes (>= 15). Obviously it might be useful to concatenate the session number with the user_Id and a session start time to create a unique session identifer.

I would do this in the following steps:
Use lag() and some logic to determine when a session begins.
Use cumulative sum to assign sessions.
Then aggregation to get averages.
So, to get information on each session:
select user_id, session, min(record_ts), max(record_ts),
timestamp_diff(max(record_ts), min(record_ts), second) as dur_seconds
from (select l.*,
countif( record_ts > timestamp_add(prev_record_ts, interval 15 minute) ) as session
from (select l.*,
lag(record_ts, 1, record_ts) over (partition by user_id order by record_ts) as prev_record_ts
from log l
) l
group by record_num, user_id;
The average is one further step:
with s as (
select user_id, session, min(record_ts), max(record_ts),
timestamp_diff(max(record_ts), min(record_ts), second) as dur_seconds
from (select l.*,
countif( record_ts > timestamp_add(prev_record_ts, interval 15 minute) ) as session
from (select l.*,
lag(record_ts, 1, record_ts) over (partition by user_id order by record_ts) as prev_record_ts
from log l
) l
group by record_num, user_id
)
select user_id, avg(dur_seconds)
from s
group b user_id;

Related

Grouping rows by ID and timestamp into sessions using BigQuery

I'm have a dataset like the one below and I'm looking to add the last column to this data.
The logic behind a session, is that it groups all rows by user_id into one session if they are within 5 days of the first event in a session.
In the example below, the users first event is 2023-01-01 which kicks off the first session. That is, although there is less than 5 days between 2023-01-04 and 2023-01-06, this is a new session as the 5 day counter resets when it's reached.
user_id timestamp session
user_1 2023-01-01 session_1
user_1 2023-01-01 session_1
user_1 2023-01-04 session_1
user_1 2023-01-06 session_2
user_1 2023-01-16 session_3
user_1 2023-01-16 session_3
user_1 2023-01-17 session_3
My data contains several users. How do I efficently add this session column in BigQuery?
It seems to be kind of cumulative capped sum problem. If I understood your requirements correctly, you might consider below.
I've answered similar problem here with some explanation about below cumsumbin user defined function.
CREATE TEMP FUNCTION cumsumbin(a ARRAY<INT64>) RETURNS INT64
LANGUAGE js AS """
bin = 1;
a.reduce((c, v) => {
if (c + Number(v) > 4) { bin += 1; return 0; }
else return c += Number(v);
}, 0);
return bin;
""";
WITH sample_table AS (
SELECT 'user_1' user_id, DATE '2023-01-01' timestamp UNION ALL
SELECT 'user_1' user_id, '2023-01-01' timestamp UNION ALL
SELECT 'user_1' user_id, '2023-01-04' timestamp UNION ALL
SELECT 'user_1' user_id, '2023-01-06' timestamp UNION ALL
SELECT 'user_1' user_id, '2023-01-16' timestamp UNION ALL
SELECT 'user_1' user_id, '2023-01-16' timestamp UNION ALL
SELECT 'user_1' user_id, '2023-01-17' timestamp
)
SELECT * EXCEPT(diff), 'session_' || cumsumbin(ARRAY_AGG(diff) OVER w1) session FROM (
SELECT *,
DATE_DIFF(timestamp, LAG(timestamp) OVER w0, DAY) AS diff
FROM sample_table
WINDOW w0 AS (PARTITION BY user_id ORDER BY timestamp)
) WINDOW w1 AS (PARTITION BY user_id ORDER BY timestamp);
Query results
Try the following:
with mydata as
(
select 'user_1' as user_id ,cast('2023-01-01' as date) as timestamp_
union all
select 'user_1' ,cast('2023-01-01' as date)
union all
select 'user_1' ,cast('2023-01-04' as date)
union all
select 'user_1' ,cast('2023-01-06' as date)
union all
select 'user_1' ,cast('2023-01-16' as date)
union all
select 'user_1' ,cast('2023-01-16' as date)
union all
select 'user_1' ,cast('2023-01-17' as date)
)
select user_id, timestamp_,
'session_' || dense_rank() over (partition by user_id order by div(df, 5)) as session
from
(
select *,
date_diff(timestamp_, min(timestamp_) over (partition by user_id), day) df
from mydata
) T
order by user_id, timestamp_
Output according to your input data:
The logic here is to find the date difference between each date and the the minimum date for each user, then perform an integer division by 5 on that data diff to create groups for the dates.
The use of dense_rank is to remove gaps that may occur from the grouping, if it's not important to have sessions ordered with no gaps you could remove it and use div(df, 5) instead.

Active customers for each day who were active in last 30 days

I have a BQ table, user_events that looks like the following:
event_date | user_id | event_type
Data is for Millions of users, for different event dates.
I want to write a query that will give me a list of users for every day who were active in last 30 days.
This gives me total unique users on only that day; I can't get it to give me the last 30 for each date. Help is appreciated.
SELECT
user_id,
event_date
FROM
[TableA]
WHERE
1=1
AND user_id IS NOT NULL
AND event_date >= DATE_ADD(CURRENT_TIMESTAMP(), -30, 'DAY')
GROUP BY
1,
2
ORDER BY
2 DESC
Below is for BigQuery Standard SQL and has few assumption about your case:
there is only one row per date per user
user is considered active in last 30 days if user has at least 5 (sure can be any number - even just 1) entries/rows within those 30 days
If above make sense - see below
#standardSQL
SELECT
user_id, event_date
FROM (
SELECT
user_id, event_date,
(COUNT(1)
OVER(PARTITION BY user_id
ORDER BY UNIX_DATE(event_date)
RANGE BETWEEN 30 PRECEDING AND 1 PRECEDING)
) >= 5 AS activity
FROM `yourTable`
)
WHERE activity
GROUP BY user_id, event_date
-- ORDER BY event_date
If above assumption #1 is not correct - you can just simple add pre-grouping as a sub-select
#standardSQL
SELECT
user_id, event_date
FROM (
SELECT
user_id, event_date,
(COUNT(1)
OVER(PARTITION BY user_id
ORDER BY UNIX_DATE(event_date)
RANGE BETWEEN 30 PRECEDING AND 1 PRECEDING)
) >= 5 AS activity
FROM (
SELECT user_id, event_date
FROM `yourTable`
GROUP BY user_id, event_date
)
)
WHERE activity
GROUP BY user_id, event_date
-- ORDER BY event_date
UPDATE
From comments: If user have any of the event_type IN ('view', 'conversion', 'productDetail', 'search') , they will be considered active. That means any kind of event triggered within the app
So, you can go with below, I think
#standardSQL
SELECT
user_id, event_date
FROM (
SELECT
user_id, event_date,
(COUNT(1)
OVER(PARTITION BY user_id
ORDER BY UNIX_DATE(event_date)
RANGE BETWEEN 30 PRECEDING AND 1 PRECEDING)
) >= 5 AS activity
FROM (
SELECT user_id, event_date
FROM `yourTable`
WHERE event_type IN ('view', 'conversion', 'productDetail', 'search')
GROUP BY user_id, event_date
)
)
WHERE activity
GROUP BY user_id, event_date
-- ORDER BY event_date

How to calculate Session and Session duration in Firebase Analytics raw data?

How to calculate Session Duration in Firebase analytics raw data which is linked to BigQuery?
I have used the following blog to calculate the users by using the flatten command for the events which are nested within each record, but I would like to know how to proceed with in calculating the Session and Session duration by country and time.
(I have many apps configured, but if you could help me with the SQL query for calculating the session duration and session, It would be of immense help)
Google Blog on using Firebase and big query
First you need to define a session - in the following query I'm going to break a session whenever a user is inactive for more than 20 minutes.
Now, to find all sessions with SQL you can use a trick described at https://blog.modeanalytics.com/finding-user-sessions-sql/.
The following query finds all sessions and their lengths:
#standardSQL
SELECT app_instance_id, sess_id, MIN(min_time) sess_start, MAX(max_time) sess_end, COUNT(*) records, MAX(sess_id) OVER(PARTITION BY app_instance_id) total_sessions,
(ROUND((MAX(max_time)-MIN(min_time))/(1000*1000),1)) sess_length_seconds
FROM (
SELECT *, SUM(session_start) OVER(PARTITION BY app_instance_id ORDER BY min_time) sess_id
FROM (
SELECT *, IF(
previous IS null
OR (min_time-previous)>(20*60*1000*1000), # sessions broken by this inactivity
1, 0) session_start
#https://blog.modeanalytics.com/finding-user-sessions-sql/
FROM (
SELECT *, LAG(max_time, 1) OVER(PARTITION BY app_instance_id ORDER BY max_time) previous
FROM (
SELECT user_dim.app_info.app_instance_id
, (SELECT MIN(timestamp_micros) FROM UNNEST(event_dim)) min_time
, (SELECT MAX(timestamp_micros) FROM UNNEST(event_dim)) max_time
FROM `firebase-analytics-sample-data.ios_dataset.app_events_20160601`
)
)
)
)
GROUP BY 1, 2
ORDER BY 1, 2
With the new schema of Firebase in BigQuery, I found that the answer by #Maziar did not work for me, but I am not sure why.
Instead I have used the following to calculate it, where a session is defined as a user engaging with your app for a minimum of 10 seconds and where the session stops if a user doesn't engage with the app for 30 minutes.
It provides total number of sessions and the session length in minutes, and it is based on this query: https://modeanalytics.com/modeanalytics/reports/5e7d902f82de/queries/2cf4af47dba4
SELECT COUNT(*) AS sessions,
AVG(length) AS average_session_length
FROM (
SELECT global_session_id,
(MAX(event_timestamp) - MIN(event_timestamp))/(60 * 1000 * 1000) AS length
FROM (
SELECT user_pseudo_id,
event_timestamp,
SUM(is_new_session) OVER (ORDER BY user_pseudo_id, event_timestamp) AS global_session_id,
SUM(is_new_session) OVER (PARTITION BY user_pseudo_id ORDER BY event_timestamp) AS user_session_id
FROM (
SELECT *,
CASE WHEN event_timestamp - last_event >= (30*60*1000*1000)
OR last_event IS NULL
THEN 1 ELSE 0 END AS is_new_session
FROM (
SELECT user_pseudo_id,
event_timestamp,
LAG(event_timestamp,1) OVER (PARTITION BY user_pseudo_id ORDER BY event_timestamp) AS last_event
FROM `dataset.events_2019*`
) last
) final
) session
GROUP BY 1
) agg
WHERE length >= (10/60)
As you know, Google has changed the schema of BigQuery firebase databases:
https://support.google.com/analytics/answer/7029846
Thanks to #Felipe answer, the new format will be changed as follow:
SELECT SUM(total_sessions) AS Total_Sessions, AVG(sess_length_seconds) AS Average_Session_Duration
FROM (
SELECT user_pseudo_id, sess_id, MIN(min_time) sess_start, MAX(max_time) sess_end, COUNT(*) records,
MAX(sess_id) OVER(PARTITION BY user_pseudo_id) total_sessions,
(ROUND((MAX(max_time)-MIN(min_time))/(1000*1000),1)) sess_length_seconds
FROM (
SELECT *, SUM(session_start) OVER(PARTITION BY user_pseudo_id ORDER BY min_time) sess_id
FROM (
SELECT *, IF(previous IS null OR (min_time-previous) > (20*60*1000*1000), 1, 0) session_start
FROM (
SELECT *, LAG(max_time, 1) OVER(PARTITION BY user_pseudo_id ORDER BY max_time) previous
FROM (SELECT user_pseudo_id, MIN(event_timestamp) AS min_time, MAX(event_timestamp) AS max_time
FROM `dataset_name.table_name` GROUP BY user_pseudo_id)
)
)
)
GROUP BY 1, 2
ORDER BY 1, 2
)
Note: change dataset_name and table_name based on your project info
Sample result:
With the recent change in which we have ga_session_id with each event row in the BigQuery table you can calculate number of sessions and average session length much more easily.
The value of the ga_session_id would remain same for the whole session, so you don't need to define the session separately.
You take the Min and the Max value of the event_timestamp column by grouping the result by user_pseudo_id , ga_session_id and event_date so that you get the session duration of the particular session of any user on any given date.
WITH
UserSessions as (
SELECT
user_pseudo_id,
event_timestamp,
event_date,
(Select value.int_value from UNNEST(event_params) where key = "ga_session_id") as session_id,
event_name
FROM `projectname.dataset_name.events_*`
),
SessionDuration as (
SELECT
user_pseudo_id,
session_id,
COUNT(*) AS events,
TIMESTAMP_DIFF(MAX(TIMESTAMP_MICROS(event_timestamp)), MIN(TIMESTAMP_MICROS(event_timestamp)), SECOND) AS session_duration
,event_date
FROM
UserSessions
WHERE session_id is not null
GROUP BY
user_pseudo_id,
session_id
,event_date
)
Select count(session_id) as NumofSessions,avg(session_duration) as AverageSessionLength from SessionDuration
At last you just do the count of the session_id to get the total number of sessions and do the average of the session duration to get the value of the average session length.

Conditional incrementing in BigQuery

I have a data table like this:
user_id event_time
1 1456812346
1 1456812350
1 1456812446
1 1456812950
1 1456812960
Now, I am trying to define a 'session_id' for the user based on the event_time. If the events come after a lag of 180 seconds, the events are considered as from new session. So, I would like an output similar to:
user_id event_time session_id
1 1456812346 1
1 1456812350 1
1 1456812446 1
1 1456812950 2
1 1456812960 2
The session is incremented at 4th row as the time is 504 secs after 3rd row and thus more than the threshold of 180 secs.
In Mysql, I could just declare a variable and then increment it conditionally. As variable creation is not supported in BigQuery, is there an alternate way to achieve this?
SELECT
user_id, event_time, session_id
FROM (
SELECT
user_id, event_time, event_time - last_time > 180 AS new_session,
SUM(IFNULL(new_session, 1))
OVER(PARTITION BY user_id ORDER BY event_time) AS session_id
FROM (
SELECT user_id, event_time,
LAG(event_time) OVER(PARTITION BY user_id ORDER BY event_time) AS last_time
FROM YourTable
)
)
ORDER BY event_time

Google BigQuery: Rolling Count Distinct

I have a table with is simply a list of dates and user IDs (not aggregated).
We define a metric called active users for a given date by counting the distinct number of IDs that appear in the previous 45 days.
I am trying to run a query in BigQuery that, for each day, returns the day plus the number of active users for that day (count distinct user from 45 days ago until today).
I have experimented with window functions, but can't figure out how to define a range based on the date values in a column. Instead, I believe the following query would work in a database like MySQL, but does not in BigQuery.
SELECT
day,
(SELECT
COUNT(DISTINCT visid)
FROM daily_users
WHERE day BETWEEN DATE_ADD(t.day, -45, "DAY") AND t.day
) AS active_users
FROM daily_users AS t
GROUP BY 1
This doesn't work in BigQuery: "Subselect not allowed in SELECT clause."
How to do this in BigQuery?
BigQuery documentation claims that count(distinct) works as a window function. However, that doesn't help you, because you are not looking for a traditional window frame.
One method would adds a record for each date after a visit:
select theday, count(distinct visid)
from (select date_add(u.day, n.n, "day") as theday, u.visid
from daily_users u cross join
(select 1 as n union all select 2 union all . . .
select 45
) n
) u
group by theday;
Note: there may be simpler ways to generate a series of 45 integers in BigQuery.
Below should work with BigQuery
#legacySQL
SELECT day, active_users FROM (
SELECT
day,
COUNT(DISTINCT id)
OVER (ORDER BY ts RANGE BETWEEN 45*24*3600 PRECEDING AND CURRENT ROW) AS active_users
FROM (
SELECT day, id, TIMESTAMP_TO_SEC(TIMESTAMP(day)) AS ts
FROM daily_users
)
) GROUP BY 1, 2 ORDER BY 1
Above assumes that day field is represented as '2016-01-10' format.
If it is not a case , you should adjust TIMESTAMP_TO_SEC(TIMESTAMP(day)) in most inner select
Also please take a look at COUNT(DISTINC) specifics in BigQuery
Update for BigQuery Standard SQL
#standardSQL
SELECT
day,
(SELECT COUNT(DISTINCT id) FROM UNNEST(active_users) id) AS active_users
FROM (
SELECT
day,
ARRAY_AGG(id)
OVER (ORDER BY ts RANGE BETWEEN 3888000 PRECEDING AND CURRENT ROW) AS active_users
FROM (
SELECT day, id, UNIX_DATE(PARSE_DATE('%Y-%m-%d', day)) * 24 * 3600 AS ts
FROM daily_users
)
)
GROUP BY 1, 2
ORDER BY 1
You can test / play with it using below dummy sample
#standardSQL
WITH daily_users AS (
SELECT 1 AS id, '2016-01-10' AS day UNION ALL
SELECT 2 AS id, '2016-01-10' AS day UNION ALL
SELECT 1 AS id, '2016-01-11' AS day UNION ALL
SELECT 3 AS id, '2016-01-11' AS day UNION ALL
SELECT 1 AS id, '2016-01-12' AS day UNION ALL
SELECT 1 AS id, '2016-01-12' AS day UNION ALL
SELECT 1 AS id, '2016-01-12' AS day UNION ALL
SELECT 1 AS id, '2016-01-13' AS day
)
SELECT
day,
(SELECT COUNT(DISTINCT id) FROM UNNEST(active_users) id) AS active_users
FROM (
SELECT
day,
ARRAY_AGG(id)
OVER (ORDER BY ts RANGE BETWEEN 86400 PRECEDING AND CURRENT ROW) AS active_users
FROM (
SELECT day, id, UNIX_DATE(PARSE_DATE('%Y-%m-%d', day)) * 24 * 3600 AS ts
FROM daily_users
)
)
GROUP BY 1, 2
ORDER BY 1