SQL Query get session duration by firebase events - sql

Im need to know the duration of the sessions one by one of my users, to do that i use bigquery, in the next query i try to get the time, but to get you in all the context:
the param ga_session_id propagate for all the event in a session, then I want to rest the timestamp of the session_start (the start of the session) and the last event with this ga_session_id, that for each ga_session_id.
WITH grps AS (
SELECT event_timestamp, event_name,
(SELECT value.int_value FROM UNNEST(event_params)
WHERE key = "ga_session_id") AS sessionid,
COUNTIF(event_name = 'session_start') OVER (ORDER BY event_timestamp) as grp
FROM `nodal-descent-XXXXX.analytics_XXXXXX.events_intraday_*`
)
SELECT min(event_timestamp), max(event_timestamp),
timestamp_diff(timestamp_micros(max(event_timestamp)),
timestamp_micros(min(event_timestamp)), second) as se
FROM grps
An example of the data i have:
Anyone can help me to complete the query and do that but by each ga_session_id?

If I understand your question, you are looking to add the session id to your current query. If so try the following:
select
ep.value.int_value as ga_session_id
, min(event_timestamp) min_ses
, max(event_timestamp) max_ses
, timestamp_diff(timestamp_micros(max(event_timestamp)), timestamp_micros(min(event_timestamp)), second) as se
from bigquery-public-data.ga4_obfuscated_sample_ecommerce.events_20210131,
UNNEST(event_params) ep
where ep.key='ga_session_id'
group by ga_session_id
order by ga_session_id

Related

Double counting problem in Rolling weekly / monthly active endpoints

Here is my current code to calculate DAE,WAE,MAE:
select event_timestamp as day, Section, users as DAE, SUM(users)
OVER (PARTITION BY Section
ORDER BY event_timestamp
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) as WAE,
SUM(users)
OVER (PARTITION BY Section
ORDER BY event_timestamp
ROWS BETWEEN 29 PRECEDING AND CURRENT ROW) as MAE
from (
select count(distinct user_pseudo_id) as users, Section, event_timestamp
from
(select distinct *
from
(
select *,
CASE
WHEN param_value = "Names" or param_value = "SingleName" THEN 'Names'
ELSE param_value
END AS Section
from(
select user_pseudo_id, DATE_TRUNC(EXTRACT(DATE from TIMESTAMP_MICROS(event_timestamp)), DAY) as event_timestamp, event_name, params.value.string_value as param_value
from `rayn-deen-app.analytics_317927526.events_*`, unnest(event_params) as params
where (event_name = 'screen_view' and params.key = 'firebase_screen' and (
# Promises
params.value.string_value = "Promises"
# Favourites
or params.value.string_value = "Favourites"
))
group by user_pseudo_id, event_timestamp, event_name, param_value
order by event_timestamp, user_pseudo_id) raw
) base
order by event_timestamp) as events_table
group by Section, event_timestamp
)
The problem is that for WAE,MAE there is repeat counts of the same users happening. So for example user A was a "daily active user" for 4 days that week. Then in the WAE count, it will consider that as 4 users instead of one. So there is a problem of repeat counts which I need to remove somehow.

How to generate session_id by sql?

My tracking system do not generate sessions IDS.
I have user_id & event_date_time.
I need a new session_id for each user's session that starts 30 minutes or more after last event_date_time of each user.
My final goal is to calculate median session time.
I tried to generate session_id=1 and session_id=2 once event_date_time-next_event_time>30 and guid=guid, but i'm stuck from here
select a.*,
case when (a.next_event_date-a.event_date)*24*60<30 and userID=next_userID
then 1
when (a.next_event_date-a.event_date)*24*60>=30 and userID=next_userID then
2
end session_id
from
(select f.userID,
lead(f.userID) over (partition by f.guid order by f.event_date)
next_guid,
f.event_date,
lead(f.event_date) over (partition by f.guid order by f.event_date)
next_event_date
from event_table f
)a
where next_event_date is not null
If I understood correctly you could generate ID's this way:
select id, guid, event_date,
sum(chg) over (partition by guid order by event_date) session_id
from (
select id, guid, event_date,
case when lag(guid) over (partition by guid order by event_date) = guid
and 24 * 60 * (event_date -lag(event_date)
over (partition by guid order by event_date) ) < 30
then 0 else 1
end chg
from event_table ) a
dbfiddle demo
Compare neighbouring rows, if there are different guids or time difference is greater than 30 minutes then assign 1. Then sum these values analytically.
I think you're on the right track using lead or lag. My recommendation would be to break this into steps and create a temp table to work against:
With the first query, assign every record its own unique ID, either a sequence number or GUID. You could also capture some of the lagged data in this step.
With a second query, find the overlaps (< 30 minutes) and make the overlapping records all the same -- either the same as the earliest or latest in that grouping, doesn't matter as long as it's consistent.
Something like this:
create table events_temp as (
select f.*,
row_number() over (partition by f.userID order by f.event_date) as user_row,
lag(f.userID) over (partition by f.userID order by f.event_date) as prev_userID,
lag(f.event_date) over (partition by f.userID order by f.event_date) as prev_event_date
from event_table f
order by f.userId, f.event_date
)
select a.*,
case when prev_userID = userID
and 24 * 60 * (event_date - prev_event_date) < 30
then lag(user_row) over (partition by userID order by user_row)
else user_row
end as session_id
from events_temp

Optimizing sum() over(order by...) clause throwing 'resources exceeded' error

I'm computing a sessions table from event data from out website in BigQuery. The events table has around 12 million events (pretty small). After I add in the logic to create sessions, I want to sum all sessions and assign a global_session_id. I'm doing that using a sum()over(order by...) clause which is throwing a resources exceeded error. I know that the order by clause is causing all the data to be processed on a single node and that is causing the compute resources to be exceeded, but I'm not sure what changes I can make to my code to achieve the same result. Any work arounds, advice, or explanations are greatly appreciated.
with sessions_1 as ( /* Tie a visitor's last event and last campaign to current event. */
select visitor_id as session_user_id,
sent_at,
context_campaign_name,
event,
id,
LAG(sent_at,1) OVER (PARTITION BY visitor_id ORDER BY sent_at) as last_event,
LAG(context_campaign_name,1) OVER (PARTITION BY visitor_id ORDER BY sent_at) as last_event_campaign_name
from tracks_2
),
sessions_2 as ( /* Flag events that begin a new session. */
select *,
case
when context_campaign_name != last_event_campaign_name
or context_campaign_name is null and last_event_campaign_name is not null
or context_campaign_name is not null and last_event_campaign_name is null
then 1
when unix_seconds(sent_at)
- unix_seconds(last_event) >= (60 * 30)
or last_event is null
then 1
else 0
end as is_new_session
from sessions_1
),
sessions_3 as ( /* Assign events sessions numbers for total sessions and total user sessions. */
select id as event_id,
sum(is_new_session) over (order by session_user_id, sent_at) as global_session_id
#sum(is_new_session) over (partition by session_user_id order by sent_at) as user_session_id
from materialized_result_of_sessions_2_query
)
select * from sessions_3
If might help if you defined a CTE with just the sessions, rather than at the event level. If this works:
select session_user_id, sent_at,
row_number() over (order by session_user_id, sent_at) as global_session_id
from materialized_result_of_sessions_2_query
where is_new_session
group by session_user_id, sent_at;
If that doesn't work, you can construct the global id:
You can join this back to the original event-level data and then use a max() window function to assign it to all events. Something like:
select e.*,
max(s.global_session_id) over (partition by e.session_user_id order by e.event_at) as global_session_id
from events e left join
(<above query>) s
on s.session_user_id = e.session_user_id and s.sent_at = e.event_at;
If not, you can do:
select us.*, us.user_session_id + s.offset as global_session_id
from (select session_user_id, sent_at,
row_number() over (partition by session_user_id order by sent_at) as user_session_id
from materialized_result_of_sessions_2_query
where is_new_session
) us join
(select session_user_id, count(*) as cnt,
sum(count(*)) over (order by session_user_id) - count(*) as offset
from materialized_result_of_sessions_2_query
where is_new_session
group by session_user_id
) s
on us.session_user_id = s.session_user_id;
This might still fail if almost all users are unique and the sessions are short.

Google big query - Firebase analytics - Closed funnel for screen views (parameters)

I would to like get a closed funnel for my X screen views, which are parameters of event screen_view
I have found this very good tutorial - https://medium.com/firebase-developers/how-do-i-create-a-closed-funnel-in-google-analytics-for-firebase-using-bigquery-6eb2645917e1 but it is only for a closed funnel with events.
I would like to get this:
event_name event_param count_users
screen_view screen_name_1 100
screen_view screen_name_2 50
screen_view screen_name_3 20
screen_view screen_name_4 5
What I have tried is to change the provided code in the tutorial to event params, but I got to the point where I have no idea what to do next.
SELECT *,
IF (value.string_value = "screen_name1", user_pseudo_id, NULL) as funnel_1,
IF (value.string_value = "screen_name1" AND next_event = "screen_name2", user_pseudo_id, NULL) AS funnel_2
FROM (
SELECT p.value.string_value, user_pseudo_id , event_timestamp,
LEAD(p.value.string_value, 1) OVER (PARTITION BY user_pseudo_id ORDER BY event_timestamp) AS next_event
FROM `ProjectName.analytics_XX.events_20190119` as t1, UNNEST(event_params) as p
WHERE (p.value.string_value = "screen_name1" OR p.value.string_value = "screen_name2")
ORDER BY 2,3
LIMIT 100
)
Thanks for any help!
I have found the solution:
SELECT COUNT(DISTINCT funnel_1) as f1_users, COUNT(DISTINCT funnel_2) as f2_users FROM (
SELECT *,
IF (param.value.string_value = "screen_name1", user_pseudo_id, NULL) AS funnel_1,
IF (param.value.string_value = "screen_name1" AND next_screen = "screen_name2", user_pseudo_id, NULL) AS funnel_2
FROM (
SELECT TIMESTAMP_MICROS(event_timestamp), param, user_pseudo_id,
LEAD(param.value.string_value, 1) OVER (PARTITION BY user_pseudo_id ORDER BY event_timestamp) as next_screen
FROM `ProjectName.analytics_XX.events_20190119`, unnest(event_params) as param
WHERE
event_name = "screen_view" and
param.value.string_value IN ("screen_name1", "screen_name2")
AND _TABLE_SUFFIX BETWEEN '20190205' AND '20190312'
)
)

SQL rollup on sessions

I have an impression event table that has a bunch of timestamps and marked start/end boundaries. I am trying to roll it up to have a metric that says "this session contains at least 1 impression with feature x". I'm not sure how exactly to do this. Any help would be appreciated. Thanks.
I want to roll this up into something that looks like:
account, session_start, session_end, interacted_with_feature
3004514, 2018-02-23 13:43:35.475, 2018-02-23 13:43:47.377, FALSE
where it is simple for me to say if this session had any interactions with the feature or not.
Perhaps aggregation does what you want:
select account, min(timestamp), max(timestamp), max(interacted_with_feature)
from t
group by account;
I was able to solve this with conditional cumulative sums to generate a session group ID for each row.
with cte as (
select *
, sum(case when session_boundary = 'start' then 1 else 0 end)
over (partition by account order by timestamp rows unbounded preceding)
as session_num
from raw_sessions
)
select account
, session_num
, min(timestamp) as session_start
, max(timestamp) as session_end
, bool_or(interacted_with_feature) as interacted_with_feature
from cte
group by account, session_num