Double counting problem in Rolling weekly / monthly active endpoints - sql

Here is my current code to calculate DAE,WAE,MAE:
select event_timestamp as day, Section, users as DAE, SUM(users)
OVER (PARTITION BY Section
ORDER BY event_timestamp
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) as WAE,
SUM(users)
OVER (PARTITION BY Section
ORDER BY event_timestamp
ROWS BETWEEN 29 PRECEDING AND CURRENT ROW) as MAE
from (
select count(distinct user_pseudo_id) as users, Section, event_timestamp
from
(select distinct *
from
(
select *,
CASE
WHEN param_value = "Names" or param_value = "SingleName" THEN 'Names'
ELSE param_value
END AS Section
from(
select user_pseudo_id, DATE_TRUNC(EXTRACT(DATE from TIMESTAMP_MICROS(event_timestamp)), DAY) as event_timestamp, event_name, params.value.string_value as param_value
from `rayn-deen-app.analytics_317927526.events_*`, unnest(event_params) as params
where (event_name = 'screen_view' and params.key = 'firebase_screen' and (
# Promises
params.value.string_value = "Promises"
# Favourites
or params.value.string_value = "Favourites"
))
group by user_pseudo_id, event_timestamp, event_name, param_value
order by event_timestamp, user_pseudo_id) raw
) base
order by event_timestamp) as events_table
group by Section, event_timestamp
)
The problem is that for WAE,MAE there is repeat counts of the same users happening. So for example user A was a "daily active user" for 4 days that week. Then in the WAE count, it will consider that as 4 users instead of one. So there is a problem of repeat counts which I need to remove somehow.

Related

clickhouse window function difficulties - hoew to work with date windows

I have web sessions with utm tags (different channels of traffic: cpc, smm, push). Some of them with tags but some sessions from organic without utm tags. I want to overwrite organic sessions to previous tags
Rules, which I want to use:
push channel remains only for the session in which it is registered
all other non-empty channels are forwarded to all empty sessions for the current and next day.
Channels are not overwritten - that is, if at first there was a cpc channel, and then on the same day there was an smm channel, then cpc sessions go first, and then smm for the current and next day.
clickhouse version 22.8.10.29
Main Idea use arrays with union all for push channel
select install_id, session_id, date_uz , started_at, utm_medium, utm_medium_final
from (
SELECT *, arrayFirst(x -> x!='', arrayReverse(utm_medium_array)) as utm_medium_new,
maxIf(date_uz, utm_medium_new = utm_medium) OVER (PARTITION BY install_id ORDER BY started_at ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as last_date,
if(date_uz - last_date < 2, utm_medium_new, '') utm_medium_final
--any(utm_medium_new) OVER (PARTITION BY install_id ORDER BY started_at ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING ) as h,
from (
select install_id, session_id, utm_medium, date_uz , started_at,
groupArray(utm_medium) OVER (PARTITION BY install_id ORDER BY started_at ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS utm_medium_array
from marketing.sessions_with_attribution swa
where date_uz >=today()-50
and utm_medium!='push'
and install_id in ('1cc69a1f-eb17-4be6-8bfc-a5dee2dd9c50','57927c21-e862-4729-b38e-f663aa9d227d')
)
union ALL
select install_id, session_id, utm_medium, date_uz , started_at,
[] utm_medium_array, utm_medium , null, utm_medium
from marketing.sessions_with_attribution swa
where date_uz >=today()-50
and utm_medium = 'push'
and install_id in ('1cc69a1f-eb17-4be6-8bfc-a5dee2dd9c50','57927c21-e862-4729-b38e-f663aa9d227d')
)
order by install_id, started_at

Why is my BigQuery retention not lining up with Firebase?

I found this article which it mentions how to get retention so it is easier to manipulate, but I can't even get it to line up with just basic retention. I have messed with the dates all I can, but I don't understand why I am still getting anywhere from 2-12% off from Firebase. If you have any thoughts I will take it!
Here is the function:
WITH analytics_data AS (
SELECT user_pseudo_id, event_timestamp, event_name,
app_info.version AS app_version, -- This is new!
UNIX_MICROS(TIMESTAMP("2018-08-01 00:00:00", "-7:00")) AS start_day,
3600*1000*1000*24*7 AS one_week_micros
FROM `firebase-public-project.analytics_153293282.events_*`
WHERE _table_suffix BETWEEN '20180731' AND '20180829'
)
SELECT week_0_cohort / week_0_cohort AS week_0_pct,
week_1_cohort / week_0_cohort AS week_1_pct,
week_2_cohort / week_0_cohort AS week_2_pct,
week_3_cohort / week_0_cohort AS week_3_pct
FROM (
WITH week_3_users AS (
SELECT DISTINCT user_pseudo_id
FROM analytics_data
WHERE event_timestamp BETWEEN start_day+(3*one_week_micros) AND start_day+(4*one_week_micros)
),
week_2_users AS (
SELECT DISTINCT user_pseudo_id
FROM analytics_data
WHERE event_timestamp BETWEEN start_day+(2*one_week_micros) AND start_day+(3*one_week_micros)
),
week_1_users AS (
SELECT DISTINCT user_pseudo_id
FROM analytics_data
WHERE event_timestamp BETWEEN start_day+(1*one_week_micros) AND start_day+(2*one_week_micros)
),
week_0_users AS (
SELECT DISTINCT user_pseudo_id
FROM analytics_data
WHERE event_name = 'first_open'
AND app_version = "2.62" -- This bit is new, too!
AND event_timestamp BETWEEN start_day AND start_day+(1*one_week_micros)
)
SELECT
(SELECT count(*)
FROM week_0_users) AS week_0_cohort,
(SELECT count(*)
FROM week_1_users
JOIN week_0_users USING (user_pseudo_id)) AS week_1_cohort,
(SELECT count(*)
FROM week_2_users
JOIN week_0_users USING (user_pseudo_id)) AS week_2_cohort,
(SELECT count(*)
FROM week_3_users
JOIN week_0_users USING (user_pseudo_id)) AS week_3_cohort
)

BigQuery SQL: filter on event sequence

I want to count, for each app_id, how many times the event_type: store_app_view was followed by the event_type: store_app_download for the same user ("followed" meaning the event_time_utc of store_app_view is older than event_time_utc of store_app_download).
Sample data:
WITH
`project.dataset.dummy_data_init` AS (SELECT event_id FROM UNNEST(GENERATE_ARRAY(1, 10000)) event_id),
`project.dataset.dummy_data_completed` AS (SELECT event_id,
user_id[OFFSET(CAST(20 * RAND() - 0.5 AS INT64))] user_id,
app_id[OFFSET(CAST(100 * RAND() - 0.5 AS INT64))] app_id,
event_type[OFFSET(CAST(6 * RAND() - 0.5 AS INT64))] event_type,
event_time_utc[OFFSET(CAST(26 * RAND() - 0.5 AS INT64))] event_time_utc
FROM `project.dataset.dummy_data_init`,
(SELECT GENERATE_ARRAY(1, 20) user_id),
(SELECT GENERATE_ARRAY(1, 100) app_id),
(SELECT ['store_app_view', 'store_app_view', 'store_app_download','store_app_install','store_app_update','store_fetch_manifest'] event_type),
(SELECT GENERATE_TIMESTAMP_ARRAY('2020-01-01 00:00:00', '2020-01-26 00:00:00',
INTERVAL 1 DAY) AS event_time_utc))
Select * FROM `project.dataset.dummy_data_completed`
Thanks!
I want to count, for each app_id, how many times the event_type: store_app_view was followed by the event_type: store_app_download.
Your provided query seems to have almost no connection to this question, so I'll ignore it.
For each user/app pair, you can get the rows that matching your conditions using GROUP BY:
select user_id, app_id
from t
group by user_id, app_id
having min(case when event_type = 'store_app_view' then event_time end) <
max(case when event_type = 'store_app_download' then event_time end);
To get the total for each app, use a subquery or CTE:
select app_id, count(*)
from (select user_id, app_id
from t
group by user_id, app_id
having min(case when event_type = 'store_app_view' then event_time end) <
max(case when event_type = 'store_app_download' then event_time end)
) ua
group by app_id;

How to return all the rows in the yellow census blocks?

Hey the schema is like this: for the whole dataset, we should order by machine_id first, then order by ss2k. after that, for each machine, we should find all the rows with at least consecutively 5 flag = 'census'. In this dataset, the result should be all the yellow rows..
I cannot return the last 4 rows of the yellow blocks by using this:
drop table if exists qz_panel_census_228_rank;
create table qz_panel_census_228_rank as
select t.*
from (select t.*,
count(*) filter (where flag = 'census') over (partition by machine_id, date order by ss2k rows between current row and 4 following) as census_cnt5,
count(*) filter (where flag = 'census') over (partition by machine_id, date) as count_census,
row_number() over (partition by machine_id, date order by ss2k) as seqnum,
count(*) over (partition by machine_id, date) as cnt
from qz_panel_census_228 t
) t
where census_cnt5 = 5
group by 1,2,3,4,5,6,7,8,9,10,11
DISTRIBUTED BY (machine_id);
You were close, but you need to search in both directions:
select t.*
from (select t.*,
case when count(*) filter (where flag = 'census')
over (partition by machine_id, date
order by ss2k
rows between 4 preceding and current row) = 5
or count(*) filter (where flag = 'census')
over (partition by machine_id, date
order by ss2k
rows between current row and 4 following) = 5
then 1
else 0
end as flag
from qz_panel_census_228 t
) t
where flag = 1
Edit:
This approach will not work unless you add an extra count for each possible 5 row window, e.g. 3 preceding and 1 following, 2 preceding and 2 following, etc. This results in ugly code and is not very flexible.
The common way to solve this gaps & islands problem is to assign consecutive rows to a common group first:
select *
from
(
select t2.*,
count(*) over (partition by machine_id, date, grp) as cnt
from
(
select t1.*
from (select t.*,
-- keep the same number for 'census' rows
sum(case when flag = 'census' then 0 else 1 end)
over (partition by machine_id, date
order by ss2k
rows unbounded preceding) as grp
from qz_panel_census_228 t
) t1
where flag = 'census' -- only census rows
) as t2
) t3
where cnt >= 5 -- only groups of at least 5 census rows
Wow, there has to be a better way of doing this, but the only way I could figure out was to create blocks of consecutive 'census' values. This looks awful but might be a catalyst to a better idea.
with q1 as (
select
machine_id, recorded, ss2k, flag, date,
case
when flag = 'census' and
lag (flag) over (order by machine_id, ss2k) != 'census'
then 1
else 0
end as block
from foo
),
q2 as (
select
machine_id, recorded, ss2k, flag, date,
sum (block) over (order by machine_id, ss2k) as group_id,
case when flag = 'census' then 1 else 0 end as census
from q1
),
q3 as (
select
machine_id, recorded, ss2k, flag, date, group_id,
sum (census) over (partition by group_id order by ss2k) as max_count
from q2
),
groups as (
select group_id
from q3
group by group_id
having max (max_count) >= 5
)
select
q2.machine_id, q2.recorded, q2.ss2k, q2.flag, q2.date
from
q2
join groups g on q2.group_id = g.group_id
where
q2.flag = 'census'
If you run each query within the with clauses in isolation, I think you will see how this evolves.

Google big query - Firebase analytics - Closed funnel for screen views (parameters)

I would to like get a closed funnel for my X screen views, which are parameters of event screen_view
I have found this very good tutorial - https://medium.com/firebase-developers/how-do-i-create-a-closed-funnel-in-google-analytics-for-firebase-using-bigquery-6eb2645917e1 but it is only for a closed funnel with events.
I would like to get this:
event_name event_param count_users
screen_view screen_name_1 100
screen_view screen_name_2 50
screen_view screen_name_3 20
screen_view screen_name_4 5
What I have tried is to change the provided code in the tutorial to event params, but I got to the point where I have no idea what to do next.
SELECT *,
IF (value.string_value = "screen_name1", user_pseudo_id, NULL) as funnel_1,
IF (value.string_value = "screen_name1" AND next_event = "screen_name2", user_pseudo_id, NULL) AS funnel_2
FROM (
SELECT p.value.string_value, user_pseudo_id , event_timestamp,
LEAD(p.value.string_value, 1) OVER (PARTITION BY user_pseudo_id ORDER BY event_timestamp) AS next_event
FROM `ProjectName.analytics_XX.events_20190119` as t1, UNNEST(event_params) as p
WHERE (p.value.string_value = "screen_name1" OR p.value.string_value = "screen_name2")
ORDER BY 2,3
LIMIT 100
)
Thanks for any help!
I have found the solution:
SELECT COUNT(DISTINCT funnel_1) as f1_users, COUNT(DISTINCT funnel_2) as f2_users FROM (
SELECT *,
IF (param.value.string_value = "screen_name1", user_pseudo_id, NULL) AS funnel_1,
IF (param.value.string_value = "screen_name1" AND next_screen = "screen_name2", user_pseudo_id, NULL) AS funnel_2
FROM (
SELECT TIMESTAMP_MICROS(event_timestamp), param, user_pseudo_id,
LEAD(param.value.string_value, 1) OVER (PARTITION BY user_pseudo_id ORDER BY event_timestamp) as next_screen
FROM `ProjectName.analytics_XX.events_20190119`, unnest(event_params) as param
WHERE
event_name = "screen_view" and
param.value.string_value IN ("screen_name1", "screen_name2")
AND _TABLE_SUFFIX BETWEEN '20190205' AND '20190312'
)
)