How to calculate Session and Session duration in Firebase Analytics raw data? - sql

How to calculate Session Duration in Firebase analytics raw data which is linked to BigQuery?
I have used the following blog to calculate the users by using the flatten command for the events which are nested within each record, but I would like to know how to proceed with in calculating the Session and Session duration by country and time.
(I have many apps configured, but if you could help me with the SQL query for calculating the session duration and session, It would be of immense help)
Google Blog on using Firebase and big query

First you need to define a session - in the following query I'm going to break a session whenever a user is inactive for more than 20 minutes.
Now, to find all sessions with SQL you can use a trick described at https://blog.modeanalytics.com/finding-user-sessions-sql/.
The following query finds all sessions and their lengths:
#standardSQL
SELECT app_instance_id, sess_id, MIN(min_time) sess_start, MAX(max_time) sess_end, COUNT(*) records, MAX(sess_id) OVER(PARTITION BY app_instance_id) total_sessions,
(ROUND((MAX(max_time)-MIN(min_time))/(1000*1000),1)) sess_length_seconds
FROM (
SELECT *, SUM(session_start) OVER(PARTITION BY app_instance_id ORDER BY min_time) sess_id
FROM (
SELECT *, IF(
previous IS null
OR (min_time-previous)>(20*60*1000*1000), # sessions broken by this inactivity
1, 0) session_start
#https://blog.modeanalytics.com/finding-user-sessions-sql/
FROM (
SELECT *, LAG(max_time, 1) OVER(PARTITION BY app_instance_id ORDER BY max_time) previous
FROM (
SELECT user_dim.app_info.app_instance_id
, (SELECT MIN(timestamp_micros) FROM UNNEST(event_dim)) min_time
, (SELECT MAX(timestamp_micros) FROM UNNEST(event_dim)) max_time
FROM `firebase-analytics-sample-data.ios_dataset.app_events_20160601`
)
)
)
)
GROUP BY 1, 2
ORDER BY 1, 2

With the new schema of Firebase in BigQuery, I found that the answer by #Maziar did not work for me, but I am not sure why.
Instead I have used the following to calculate it, where a session is defined as a user engaging with your app for a minimum of 10 seconds and where the session stops if a user doesn't engage with the app for 30 minutes.
It provides total number of sessions and the session length in minutes, and it is based on this query: https://modeanalytics.com/modeanalytics/reports/5e7d902f82de/queries/2cf4af47dba4
SELECT COUNT(*) AS sessions,
AVG(length) AS average_session_length
FROM (
SELECT global_session_id,
(MAX(event_timestamp) - MIN(event_timestamp))/(60 * 1000 * 1000) AS length
FROM (
SELECT user_pseudo_id,
event_timestamp,
SUM(is_new_session) OVER (ORDER BY user_pseudo_id, event_timestamp) AS global_session_id,
SUM(is_new_session) OVER (PARTITION BY user_pseudo_id ORDER BY event_timestamp) AS user_session_id
FROM (
SELECT *,
CASE WHEN event_timestamp - last_event >= (30*60*1000*1000)
OR last_event IS NULL
THEN 1 ELSE 0 END AS is_new_session
FROM (
SELECT user_pseudo_id,
event_timestamp,
LAG(event_timestamp,1) OVER (PARTITION BY user_pseudo_id ORDER BY event_timestamp) AS last_event
FROM `dataset.events_2019*`
) last
) final
) session
GROUP BY 1
) agg
WHERE length >= (10/60)

As you know, Google has changed the schema of BigQuery firebase databases:
https://support.google.com/analytics/answer/7029846
Thanks to #Felipe answer, the new format will be changed as follow:
SELECT SUM(total_sessions) AS Total_Sessions, AVG(sess_length_seconds) AS Average_Session_Duration
FROM (
SELECT user_pseudo_id, sess_id, MIN(min_time) sess_start, MAX(max_time) sess_end, COUNT(*) records,
MAX(sess_id) OVER(PARTITION BY user_pseudo_id) total_sessions,
(ROUND((MAX(max_time)-MIN(min_time))/(1000*1000),1)) sess_length_seconds
FROM (
SELECT *, SUM(session_start) OVER(PARTITION BY user_pseudo_id ORDER BY min_time) sess_id
FROM (
SELECT *, IF(previous IS null OR (min_time-previous) > (20*60*1000*1000), 1, 0) session_start
FROM (
SELECT *, LAG(max_time, 1) OVER(PARTITION BY user_pseudo_id ORDER BY max_time) previous
FROM (SELECT user_pseudo_id, MIN(event_timestamp) AS min_time, MAX(event_timestamp) AS max_time
FROM `dataset_name.table_name` GROUP BY user_pseudo_id)
)
)
)
GROUP BY 1, 2
ORDER BY 1, 2
)
Note: change dataset_name and table_name based on your project info
Sample result:

With the recent change in which we have ga_session_id with each event row in the BigQuery table you can calculate number of sessions and average session length much more easily.
The value of the ga_session_id would remain same for the whole session, so you don't need to define the session separately.
You take the Min and the Max value of the event_timestamp column by grouping the result by user_pseudo_id , ga_session_id and event_date so that you get the session duration of the particular session of any user on any given date.
WITH
UserSessions as (
SELECT
user_pseudo_id,
event_timestamp,
event_date,
(Select value.int_value from UNNEST(event_params) where key = "ga_session_id") as session_id,
event_name
FROM `projectname.dataset_name.events_*`
),
SessionDuration as (
SELECT
user_pseudo_id,
session_id,
COUNT(*) AS events,
TIMESTAMP_DIFF(MAX(TIMESTAMP_MICROS(event_timestamp)), MIN(TIMESTAMP_MICROS(event_timestamp)), SECOND) AS session_duration
,event_date
FROM
UserSessions
WHERE session_id is not null
GROUP BY
user_pseudo_id,
session_id
,event_date
)
Select count(session_id) as NumofSessions,avg(session_duration) as AverageSessionLength from SessionDuration
At last you just do the count of the session_id to get the total number of sessions and do the average of the session duration to get the value of the average session length.

Related

Difference between last and second last event in a table of events

I have the following table
which created by
create table events (
event_type integer not null,
value integer not null,
time timestamp not null,
unique (event_type, time)
);
given the data in the pic, I want to write a query that for each event_type that has been
registered more than once returns the difference between the latest and
the second latest value.
Given the above data, the output should be like
event_type value
2 -5
3 4
I solved it using the following :
CREATE VIEW [max_date] AS
SELECT event_type, max(time) as time, value
FROM events
group by event_type
having count(event_type) >1
order by time desc;
select event_type, value
from
(
select event_type, value, max(time)
from(
Select E1.event_type, ([max_date].value - E1.value) as value, E1.time
From events E1, [max_date]
Where [max_date].event_type = E1.event_type
and [max_date].time > E1.time
)
group by event_type
)
but this seems like a very complicated query and I wonder if there is an easier way?
Use window functions:
select e.*,
(value - prev_value)
from (select e.*,
lag(value) over (partition by event_type order by time) as prev_value,
row_number() over (partition by event_type order by time desc) as seqnum
from events e
) e
where seqnum = 1 and prev_value is not null;
You could use lag() and row_number()
select event_type, val
from (
select
event_type,
value - lag(value) over(partition by event_type order by time desc) val,
row_number() over(partition by event_type order by time desc) rn
from events
) t
where rn = 1 and val is not null
The inner query ranks records having the same event_type by descending time, and computes the difference between each value and the previous one.
Then, the outer query just filters on the top record per group.
Here is a way to do this using a combination of analytic functions and aggregation. This approach is friendly in the event that your database does not support LEAD and LAG.
WITH cte AS (
SELECT *, ROW_NUMBER() OVER (PARTITION BY event_type ORDER BY time DESC)
FROM events
)
SELECT
event_type,
MAX(CASE WHEN rn = 1 THEN value END) - MAX(CASE WHEN rn = 2 THEN value END) AS value
FROM cte
GROUP BY
event_type
HAVING
COUNT(*) > 1;

Bigquery resources exceeded during query execution - optimization

I have got a problem with this query.
SELECT event_date, country, COUNT(*) AS sessions,
AVG(length) AS average_session_length
FROM (
SELECT country, event_date, global_session_id,
(MAX(event_timestamp) - MIN(event_timestamp))/(60 * 1000 * 1000) AS length
FROM (
SELECT user_pseudo_id,
event_timestamp,
country,
event_date,
SUM(is_new_session) OVER (ORDER BY user_pseudo_id, event_timestamp) AS global_session_id,
SUM(is_new_session) OVER (PARTITION BY user_pseudo_id ORDER BY event_timestamp) AS user_session_id
FROM (
SELECT *,
CASE WHEN event_timestamp - last_event >= (30*60*1000*1000)
OR last_event IS NULL
THEN 1 ELSE 0 END AS is_new_session
FROM (
SELECT user_pseudo_id,
event_timestamp,
geo.country,
event_date,
LAG(event_timestamp,1) OVER (PARTITION BY user_pseudo_id ORDER BY event_timestamp) AS last_event
FROM `xxx.events*`
) last
) final
) session
GROUP BY global_session_id, country, event_date
) agg
WHERE length >= (10/60)
group by country, event_date
Google Cloud Console gives that error
Resources exceeded during query execution: The query could not be executed in the allotted memory.
I know that it is probably a problem with OVER clauses, but I do not have idea how to edit query to get the same results.
I would be thankful for some help.
Thank you guys!
If I had to guess, it is this line:
SUM(is_new_session) OVER (ORDER BY user_pseudo_id, event_timestamp) AS global_session_id,
I would recommend changing the code so the "global" session id is really local to each user:
SUM(is_new_session) OVER (PARTITION BY user_pseudo_id ORDER BY event_timestamp) AS global_session_id,
If you adjust the query and this basically works, then the resource problem is fixed. The next step is to figure out how to get the global id that you want. The simplest solution is to use a local id for each user.

Where i can find Average Session Duration in Firebase Analytics. How to Extract this Metrics Through Bigquery

Where to find Avg. Session Duration Metrics in Firebase analytics?
How to extract Avg. Session Duration Metrics data from Bigquery?
Avg. Session Duration Metrics which was previous available in Firebase analytics dashboard. But now, it is not available in Firebase analytics dashboard. Now, we are only seeing "Engagement Per User". Is the Engagement Per User and Avg. Session Duration Both are same? How to extract Avg. Session Duration from Fiebase analytics? How to query in Bigquery to extract Avg. Session duration metrics from Firebase.
enter image description here
Engagement per User is not the same as Avg. Session Duration. Engagement per User is all the time a user spends in the app in a day, not in a session.
You can find Avg. Session Duration in Firebase Analytics under Latest Release.
Here is a query for calculating avg. session length in BigQuery:
with timeline as
(
select
user_pseudo_id
, event_timestamp
, lag(event_timestamp, 1) over (partition by user_pseudo_id order by event_timestamp) as prev_event_timestamp
from
`YYYYY.analytics_XXXXX.events_*`
where
-- at first - a sliding period - how many days in the past we are looking into:
_table_suffix
between format_date("%Y%m%d", date_sub(current_date, interval 10 day))
and format_date("%Y%m%d", date_sub(current_date, interval 1 day))
)
, session_timeline as
(
select
user_pseudo_id
, event_timestamp
, case
when
-- half a hour period - a threshold for a new 'session'
event_timestamp - prev_event_timestamp >= (30*60*1000*1000)
or
prev_event_timestamp is null
then 1
else 0
end as is_new_session_flag
from
timeline
)
, marked_sessions as
(
select
user_pseudo_id
, event_timestamp
, sum(is_new_session_flag) over (partition by user_pseudo_id order by event_timestamp) AS user_session_id
from session_timeline
)
, measured_sessions as
(
select
user_pseudo_id
, user_session_id
-- session duration in seconds with 2 digits after the point
, round((max(event_timestamp) - min(event_timestamp))/ (1000 * 1000), 2) as session_duration
from
marked_sessions
group by
user_pseudo_id
, user_session_id
having
-- let's count only sessions longer than 10 seconds
session_duration >= 10
)
select
count(1) as number_of_sessions
, round(avg(session_duration), 2) as average_session_duration_in_sec
from
measured_sessions
For your additional question on how to get event_date and app_info.id, see the following query:
with timeline as
(
select
event_date,app_info.id,user_pseudo_id
, event_timestamp
, lag(event_timestamp, 1) over (partition by user_pseudo_id order by event_timestamp) as prev_event_timestamp
from
`<table>_*`
where
-- at first - a sliding period - how many days in the past we are looking into:
_table_suffix
between format_date("%Y%m%d", date_sub(current_date, interval 10 day))
and format_date("%Y%m%d", date_sub(current_date, interval 1 day))
)
, session_timeline as
(
select
event_date,id,
user_pseudo_id
, event_timestamp
, case
when
-- half a hour period - a threshold for a new 'session'
event_timestamp - prev_event_timestamp >= (30*60*1000*1000)
or
prev_event_timestamp is null
then 1
else 0
end as is_new_session_flag
from
timeline
)
, marked_sessions as
(
select
event_date,id, user_pseudo_id
, event_timestamp
, sum(is_new_session_flag) over (partition by user_pseudo_id order by event_timestamp) AS user_session_id
from session_timeline
)
, measured_sessions as
(
select
event_date,id, user_pseudo_id
, user_session_id
-- session duration in seconds with 2 digits after the point
, round((max(event_timestamp) - min(event_timestamp))/ (1000 * 1000), 2) as session_duration
from
marked_sessions
group by
event_date, id, user_pseudo_id
, user_session_id
having
-- let's count only sessions longer than 10 seconds
session_duration >= 10
)
select
event_date, id, count(1) as number_of_sessions
, round(avg(session_duration), 2) as average_session_duration_in_sec
from
measured_sessions
group by event_date, id
Every session (as defined here since December 2019: https://firebase.googleblog.com/2018/12/new-changes-sessions-user-engagement.html) has a session_id (besides other parameters). I think the safest and most robust way to calculate average session duration is to extract the data to BigQuery and then to calculate the average difference between the first and last timestamp by session. You need to flatten the array of event_params for this. For example, this is how it would be done in AWS Athena:
WITH arrays_flattened AS
(SELECT params.key AS key,
params.value.int_value AS id,
event_timestamp,
event_date
FROM your_database
CROSS JOIN UNNEST(event_params) AS t(params)
WHERE params.key = 'ga_session_id'), duration AS
(SELECT MAX(event_timestamp)-MIN(event_timestamp) AS duration
FROM arrays_flattened
WHERE key = 'ga_session_id'
GROUP BY id)
SELECT AVG(duration)
FROM duration

How can I find continuously events groups using BigQuery?

I'm using Firebase Analytics with BigQuery. Assume I need to give a voucher to users who shares a service everyday in at least 7 continuously days. If someone share in 2 weeks continuously, those will get 2 vouchers and so on.
How can I find out the segments of continuously events logged in Firebase Analytics?
Here is the query that I can find out the individual days that users give a sharing. But I can't recognize the continuous segments.
SELECT event.user_id, event.event_date,
MAX((SELECT p.value FROM UNNEST(user_properties) p WHERE p.key='name').string_value) as name,
MAX((SELECT p.value FROM UNNEST(user_properties) p WHERE p.key='email').string_value ) as email,
SUM((SELECT event_params.value.int_value from event.event_params where event_params.key = 'share_session_length')) as total_share_session_length
FROM `myProject.analytics_183565123.*` as event
where event_name like 'share_end'
group by user_id,event_date
having total_share_session_length >= 1
order by user_id desc
And this is the output:
How can I find out the segments of continuously events logged
Below example for BigQuery Standard SQL - hope you can adopt approach to your specific use case
#standardSQL
SELECT id, ARRAY_AGG(STRUCT(first_day, days) ORDER BY grp) continuous_groups
FROM (
SELECT id, grp, MIN(day) first_day, MAX(day) last_day, COUNT(1) days
FROM (
SELECT id, day,
COUNTIF(gap != 1) OVER(PARTITION BY id ORDER BY day) grp
FROM (
SELECT id, day,
DATE_DIFF(day,LAG(day) OVER(PARTITION BY id ORDER BY day), DAY) gap
FROM (
SELECT DISTINCT fullVisitorId id, PARSE_DATE('%Y%m%d', t.date) day
FROM `bigquery-public-data.google_analytics_sample.ga_sessions_*` t
)
)
)
GROUP BY id, grp
HAVING days >= 7
)
GROUP BY id
ORDER BY ARRAY_LENGTH(continuous_groups) DESC
with result

Active customers for each day who were active in last 30 days

I have a BQ table, user_events that looks like the following:
event_date | user_id | event_type
Data is for Millions of users, for different event dates.
I want to write a query that will give me a list of users for every day who were active in last 30 days.
This gives me total unique users on only that day; I can't get it to give me the last 30 for each date. Help is appreciated.
SELECT
user_id,
event_date
FROM
[TableA]
WHERE
1=1
AND user_id IS NOT NULL
AND event_date >= DATE_ADD(CURRENT_TIMESTAMP(), -30, 'DAY')
GROUP BY
1,
2
ORDER BY
2 DESC
Below is for BigQuery Standard SQL and has few assumption about your case:
there is only one row per date per user
user is considered active in last 30 days if user has at least 5 (sure can be any number - even just 1) entries/rows within those 30 days
If above make sense - see below
#standardSQL
SELECT
user_id, event_date
FROM (
SELECT
user_id, event_date,
(COUNT(1)
OVER(PARTITION BY user_id
ORDER BY UNIX_DATE(event_date)
RANGE BETWEEN 30 PRECEDING AND 1 PRECEDING)
) >= 5 AS activity
FROM `yourTable`
)
WHERE activity
GROUP BY user_id, event_date
-- ORDER BY event_date
If above assumption #1 is not correct - you can just simple add pre-grouping as a sub-select
#standardSQL
SELECT
user_id, event_date
FROM (
SELECT
user_id, event_date,
(COUNT(1)
OVER(PARTITION BY user_id
ORDER BY UNIX_DATE(event_date)
RANGE BETWEEN 30 PRECEDING AND 1 PRECEDING)
) >= 5 AS activity
FROM (
SELECT user_id, event_date
FROM `yourTable`
GROUP BY user_id, event_date
)
)
WHERE activity
GROUP BY user_id, event_date
-- ORDER BY event_date
UPDATE
From comments: If user have any of the event_type IN ('view', 'conversion', 'productDetail', 'search') , they will be considered active. That means any kind of event triggered within the app
So, you can go with below, I think
#standardSQL
SELECT
user_id, event_date
FROM (
SELECT
user_id, event_date,
(COUNT(1)
OVER(PARTITION BY user_id
ORDER BY UNIX_DATE(event_date)
RANGE BETWEEN 30 PRECEDING AND 1 PRECEDING)
) >= 5 AS activity
FROM (
SELECT user_id, event_date
FROM `yourTable`
WHERE event_type IN ('view', 'conversion', 'productDetail', 'search')
GROUP BY user_id, event_date
)
)
WHERE activity
GROUP BY user_id, event_date
-- ORDER BY event_date