Google big query - Firebase analytics - Closed funnel for screen views (parameters) - sql

I would to like get a closed funnel for my X screen views, which are parameters of event screen_view
I have found this very good tutorial - https://medium.com/firebase-developers/how-do-i-create-a-closed-funnel-in-google-analytics-for-firebase-using-bigquery-6eb2645917e1 but it is only for a closed funnel with events.
I would like to get this:
event_name event_param count_users
screen_view screen_name_1 100
screen_view screen_name_2 50
screen_view screen_name_3 20
screen_view screen_name_4 5
What I have tried is to change the provided code in the tutorial to event params, but I got to the point where I have no idea what to do next.
SELECT *,
IF (value.string_value = "screen_name1", user_pseudo_id, NULL) as funnel_1,
IF (value.string_value = "screen_name1" AND next_event = "screen_name2", user_pseudo_id, NULL) AS funnel_2
FROM (
SELECT p.value.string_value, user_pseudo_id , event_timestamp,
LEAD(p.value.string_value, 1) OVER (PARTITION BY user_pseudo_id ORDER BY event_timestamp) AS next_event
FROM `ProjectName.analytics_XX.events_20190119` as t1, UNNEST(event_params) as p
WHERE (p.value.string_value = "screen_name1" OR p.value.string_value = "screen_name2")
ORDER BY 2,3
LIMIT 100
)
Thanks for any help!

I have found the solution:
SELECT COUNT(DISTINCT funnel_1) as f1_users, COUNT(DISTINCT funnel_2) as f2_users FROM (
SELECT *,
IF (param.value.string_value = "screen_name1", user_pseudo_id, NULL) AS funnel_1,
IF (param.value.string_value = "screen_name1" AND next_screen = "screen_name2", user_pseudo_id, NULL) AS funnel_2
FROM (
SELECT TIMESTAMP_MICROS(event_timestamp), param, user_pseudo_id,
LEAD(param.value.string_value, 1) OVER (PARTITION BY user_pseudo_id ORDER BY event_timestamp) as next_screen
FROM `ProjectName.analytics_XX.events_20190119`, unnest(event_params) as param
WHERE
event_name = "screen_view" and
param.value.string_value IN ("screen_name1", "screen_name2")
AND _TABLE_SUFFIX BETWEEN '20190205' AND '20190312'
)
)

Related

Double counting problem in Rolling weekly / monthly active endpoints

Here is my current code to calculate DAE,WAE,MAE:
select event_timestamp as day, Section, users as DAE, SUM(users)
OVER (PARTITION BY Section
ORDER BY event_timestamp
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) as WAE,
SUM(users)
OVER (PARTITION BY Section
ORDER BY event_timestamp
ROWS BETWEEN 29 PRECEDING AND CURRENT ROW) as MAE
from (
select count(distinct user_pseudo_id) as users, Section, event_timestamp
from
(select distinct *
from
(
select *,
CASE
WHEN param_value = "Names" or param_value = "SingleName" THEN 'Names'
ELSE param_value
END AS Section
from(
select user_pseudo_id, DATE_TRUNC(EXTRACT(DATE from TIMESTAMP_MICROS(event_timestamp)), DAY) as event_timestamp, event_name, params.value.string_value as param_value
from `rayn-deen-app.analytics_317927526.events_*`, unnest(event_params) as params
where (event_name = 'screen_view' and params.key = 'firebase_screen' and (
# Promises
params.value.string_value = "Promises"
# Favourites
or params.value.string_value = "Favourites"
))
group by user_pseudo_id, event_timestamp, event_name, param_value
order by event_timestamp, user_pseudo_id) raw
) base
order by event_timestamp) as events_table
group by Section, event_timestamp
)
The problem is that for WAE,MAE there is repeat counts of the same users happening. So for example user A was a "daily active user" for 4 days that week. Then in the WAE count, it will consider that as 4 users instead of one. So there is a problem of repeat counts which I need to remove somehow.

SQL Query get session duration by firebase events

Im need to know the duration of the sessions one by one of my users, to do that i use bigquery, in the next query i try to get the time, but to get you in all the context:
the param ga_session_id propagate for all the event in a session, then I want to rest the timestamp of the session_start (the start of the session) and the last event with this ga_session_id, that for each ga_session_id.
WITH grps AS (
SELECT event_timestamp, event_name,
(SELECT value.int_value FROM UNNEST(event_params)
WHERE key = "ga_session_id") AS sessionid,
COUNTIF(event_name = 'session_start') OVER (ORDER BY event_timestamp) as grp
FROM `nodal-descent-XXXXX.analytics_XXXXXX.events_intraday_*`
)
SELECT min(event_timestamp), max(event_timestamp),
timestamp_diff(timestamp_micros(max(event_timestamp)),
timestamp_micros(min(event_timestamp)), second) as se
FROM grps
An example of the data i have:
Anyone can help me to complete the query and do that but by each ga_session_id?
If I understand your question, you are looking to add the session id to your current query. If so try the following:
select
ep.value.int_value as ga_session_id
, min(event_timestamp) min_ses
, max(event_timestamp) max_ses
, timestamp_diff(timestamp_micros(max(event_timestamp)), timestamp_micros(min(event_timestamp)), second) as se
from bigquery-public-data.ga4_obfuscated_sample_ecommerce.events_20210131,
UNNEST(event_params) ep
where ep.key='ga_session_id'
group by ga_session_id
order by ga_session_id

Why is my BigQuery retention not lining up with Firebase?

I found this article which it mentions how to get retention so it is easier to manipulate, but I can't even get it to line up with just basic retention. I have messed with the dates all I can, but I don't understand why I am still getting anywhere from 2-12% off from Firebase. If you have any thoughts I will take it!
Here is the function:
WITH analytics_data AS (
SELECT user_pseudo_id, event_timestamp, event_name,
app_info.version AS app_version, -- This is new!
UNIX_MICROS(TIMESTAMP("2018-08-01 00:00:00", "-7:00")) AS start_day,
3600*1000*1000*24*7 AS one_week_micros
FROM `firebase-public-project.analytics_153293282.events_*`
WHERE _table_suffix BETWEEN '20180731' AND '20180829'
)
SELECT week_0_cohort / week_0_cohort AS week_0_pct,
week_1_cohort / week_0_cohort AS week_1_pct,
week_2_cohort / week_0_cohort AS week_2_pct,
week_3_cohort / week_0_cohort AS week_3_pct
FROM (
WITH week_3_users AS (
SELECT DISTINCT user_pseudo_id
FROM analytics_data
WHERE event_timestamp BETWEEN start_day+(3*one_week_micros) AND start_day+(4*one_week_micros)
),
week_2_users AS (
SELECT DISTINCT user_pseudo_id
FROM analytics_data
WHERE event_timestamp BETWEEN start_day+(2*one_week_micros) AND start_day+(3*one_week_micros)
),
week_1_users AS (
SELECT DISTINCT user_pseudo_id
FROM analytics_data
WHERE event_timestamp BETWEEN start_day+(1*one_week_micros) AND start_day+(2*one_week_micros)
),
week_0_users AS (
SELECT DISTINCT user_pseudo_id
FROM analytics_data
WHERE event_name = 'first_open'
AND app_version = "2.62" -- This bit is new, too!
AND event_timestamp BETWEEN start_day AND start_day+(1*one_week_micros)
)
SELECT
(SELECT count(*)
FROM week_0_users) AS week_0_cohort,
(SELECT count(*)
FROM week_1_users
JOIN week_0_users USING (user_pseudo_id)) AS week_1_cohort,
(SELECT count(*)
FROM week_2_users
JOIN week_0_users USING (user_pseudo_id)) AS week_2_cohort,
(SELECT count(*)
FROM week_3_users
JOIN week_0_users USING (user_pseudo_id)) AS week_3_cohort
)

BigQuery SQL: filter on event sequence

I want to count, for each app_id, how many times the event_type: store_app_view was followed by the event_type: store_app_download for the same user ("followed" meaning the event_time_utc of store_app_view is older than event_time_utc of store_app_download).
Sample data:
WITH
`project.dataset.dummy_data_init` AS (SELECT event_id FROM UNNEST(GENERATE_ARRAY(1, 10000)) event_id),
`project.dataset.dummy_data_completed` AS (SELECT event_id,
user_id[OFFSET(CAST(20 * RAND() - 0.5 AS INT64))] user_id,
app_id[OFFSET(CAST(100 * RAND() - 0.5 AS INT64))] app_id,
event_type[OFFSET(CAST(6 * RAND() - 0.5 AS INT64))] event_type,
event_time_utc[OFFSET(CAST(26 * RAND() - 0.5 AS INT64))] event_time_utc
FROM `project.dataset.dummy_data_init`,
(SELECT GENERATE_ARRAY(1, 20) user_id),
(SELECT GENERATE_ARRAY(1, 100) app_id),
(SELECT ['store_app_view', 'store_app_view', 'store_app_download','store_app_install','store_app_update','store_fetch_manifest'] event_type),
(SELECT GENERATE_TIMESTAMP_ARRAY('2020-01-01 00:00:00', '2020-01-26 00:00:00',
INTERVAL 1 DAY) AS event_time_utc))
Select * FROM `project.dataset.dummy_data_completed`
Thanks!
I want to count, for each app_id, how many times the event_type: store_app_view was followed by the event_type: store_app_download.
Your provided query seems to have almost no connection to this question, so I'll ignore it.
For each user/app pair, you can get the rows that matching your conditions using GROUP BY:
select user_id, app_id
from t
group by user_id, app_id
having min(case when event_type = 'store_app_view' then event_time end) <
max(case when event_type = 'store_app_download' then event_time end);
To get the total for each app, use a subquery or CTE:
select app_id, count(*)
from (select user_id, app_id
from t
group by user_id, app_id
having min(case when event_type = 'store_app_view' then event_time end) <
max(case when event_type = 'store_app_download' then event_time end)
) ua
group by app_id;

Optimizing sum() over(order by...) clause throwing 'resources exceeded' error

I'm computing a sessions table from event data from out website in BigQuery. The events table has around 12 million events (pretty small). After I add in the logic to create sessions, I want to sum all sessions and assign a global_session_id. I'm doing that using a sum()over(order by...) clause which is throwing a resources exceeded error. I know that the order by clause is causing all the data to be processed on a single node and that is causing the compute resources to be exceeded, but I'm not sure what changes I can make to my code to achieve the same result. Any work arounds, advice, or explanations are greatly appreciated.
with sessions_1 as ( /* Tie a visitor's last event and last campaign to current event. */
select visitor_id as session_user_id,
sent_at,
context_campaign_name,
event,
id,
LAG(sent_at,1) OVER (PARTITION BY visitor_id ORDER BY sent_at) as last_event,
LAG(context_campaign_name,1) OVER (PARTITION BY visitor_id ORDER BY sent_at) as last_event_campaign_name
from tracks_2
),
sessions_2 as ( /* Flag events that begin a new session. */
select *,
case
when context_campaign_name != last_event_campaign_name
or context_campaign_name is null and last_event_campaign_name is not null
or context_campaign_name is not null and last_event_campaign_name is null
then 1
when unix_seconds(sent_at)
- unix_seconds(last_event) >= (60 * 30)
or last_event is null
then 1
else 0
end as is_new_session
from sessions_1
),
sessions_3 as ( /* Assign events sessions numbers for total sessions and total user sessions. */
select id as event_id,
sum(is_new_session) over (order by session_user_id, sent_at) as global_session_id
#sum(is_new_session) over (partition by session_user_id order by sent_at) as user_session_id
from materialized_result_of_sessions_2_query
)
select * from sessions_3
If might help if you defined a CTE with just the sessions, rather than at the event level. If this works:
select session_user_id, sent_at,
row_number() over (order by session_user_id, sent_at) as global_session_id
from materialized_result_of_sessions_2_query
where is_new_session
group by session_user_id, sent_at;
If that doesn't work, you can construct the global id:
You can join this back to the original event-level data and then use a max() window function to assign it to all events. Something like:
select e.*,
max(s.global_session_id) over (partition by e.session_user_id order by e.event_at) as global_session_id
from events e left join
(<above query>) s
on s.session_user_id = e.session_user_id and s.sent_at = e.event_at;
If not, you can do:
select us.*, us.user_session_id + s.offset as global_session_id
from (select session_user_id, sent_at,
row_number() over (partition by session_user_id order by sent_at) as user_session_id
from materialized_result_of_sessions_2_query
where is_new_session
) us join
(select session_user_id, count(*) as cnt,
sum(count(*)) over (order by session_user_id) - count(*) as offset
from materialized_result_of_sessions_2_query
where is_new_session
group by session_user_id
) s
on us.session_user_id = s.session_user_id;
This might still fail if almost all users are unique and the sessions are short.