Mismatch of retention results from firebase and BigQuery - google-bigquery

I calculated the retention i BigQuery with code bellow. Code was taken from here. But this code is giving me different retention then the retention already calculated in firebase. Number of users calculated in BigQuery is always smaller.
What is the difference between this two approaches? Is there a way to get the same result in BigQuery as it is in Firebase?
#standardSQL
####################################################################
# PART 1: Cohort of New Users starting on SEPT 1
####################################################################
WITH
new_user_cohort AS (
SELECT DISTINCT user_pseudo_id as new_user_id
FROM
`projectId.analytics_YOUR_TABLE.events_*`
WHERE
event_name = 'first_open' AND
#geo.country = 'France' AND
FORMAT_TIMESTAMP("%Y%m%d", TIMESTAMP_TRUNC(TIMESTAMP_MICROS(event_timestamp), DAY, "Etc/GMT+8")) = '20180901' AND
_TABLE_SUFFIX BETWEEN '20180830' AND '20180902'),
num_new_users AS (
SELECT count(*) as num_users_in_cohort FROM new_user_cohort
),
####################################################################
# PART 2: Engaged users from Sept 1 cohort
####################################################################
engaged_user_by_day AS (
SELECT
FORMAT_TIMESTAMP('%Y%m%d', TIMESTAMP_TRUNC(TIMESTAMP_MICROS(event_timestamp), DAY, "Etc/GMT+8")) as event_day, COUNT (DISTINCT user_pseudo_id) as num_engaged_users
FROM
`projectId.analytics_YOUR_TABLE.events_*` INNER JOIN new_user_cohort on new_user_id = user_pseudo_id
WHERE
event_name = 'user_engagement' AND
_TABLE_SUFFIX BETWEEN '20180830' AND '20180907'
GROUP BY (event_day)
)
####################################################################
# PART 3: Daily Retention = [Engaged Users / Total Users]
####################################################################
SELECT event_day, num_engaged_users, num_users_in_cohort, ROUND((num_engaged_users / num_users_in_cohort), 3) as retention_rate
FROM engaged_user_by_day CROSS JOIN num_new_users
ORDER BY (event_day)

I found out that that analytics is using sampling and in my report inside analytics it only uses 0.2% of the data.
In firebase I noticed that they removed retention tab (at least in my case). But I think there sampling was also used.

Related

Google Analytics Object in Google BigQuery

I'm trying to extract data from Google Analytics, but due to incompatibilities between dimensions and metrics, it was decided to use Google Big Query instead, to obtain the data related to GA4.
I'm struggling to find some metrics/dimensions in Google BigQuery, even searching on the documentation: https://support.google.com/analytics/answer/3437719?hl=en
Google Analytics Dimensions/Metrics:
These are the dimensions and metrics that I've used from google analytics and the ones I can't find in Google Big Query are:
Users
Sessions (I used totals. visits, but I get only NULLs and 1's, while on GA it fills with more numbers)
TransactionsPerSession
CountryIsoCode (In GA it is only the country indicative, for instance, Spain --> ES, but in Big Query, it's the country's complete name. This can be solved, but would be good to have the country code directly from the source)
avgSessionDuration
A great place to get this information is https://www.ga4bigquery.com/
I have copied one of my reports that will provide you with points 1,2,3 & 5. I don't use country but it can be found in the link above
-- subquery to prepare the data
with prep_traffic as (
select
user_pseudo_id,
event_date as date,
count(distinct (ecommerce.transaction_id)) as Transactions,
(select value.int_value from unnest(event_params) where key = 'ga_session_id') as session_id,
max((select value.string_value from unnest(event_params) where key = 'session_engaged')) as session_engaged,
max((select value.int_value from unnest(event_params) where key = 'engagement_time_msec')) as engagement_time_msec,
-- change event_name to the event(s) you want to count
countif(event_name = 'page_view') as event_count,
-- change event_name to the conversion event(s) you want to count
countif(event_name = 'add_payment_info') as conversions,
sum(ecommerce.purchase_revenue) as total_revenue
from
-- change this to your google analytics 4 bigquery export location
`bigquery****.events_*`
where
-- change the date range by using static and/or dynamic dates
_table_suffix between '20230129' and format_date('%Y%m%d',date_sub(current_date(), interval 1 day))
group by
user_pseudo_id,
session_id,
event_date)
-- main query
select
count(distinct user_pseudo_id) as users,
count(distinct concat(user_pseudo_id,session_id)) as sessions,
count(distinct case when session_engaged = '1' then concat(user_pseudo_id,session_id) end) as engaged_sessions,
ROUND(safe_divide(count(distinct case when session_engaged = '1' then concat(user_pseudo_id,session_id) end),count(distinct user_pseudo_id)),2) as engaged_sessions_per_user,
ROUND(safe_divide(count(distinct case when session_engaged = '1' then concat(user_pseudo_id,session_id) end),count(distinct concat(user_pseudo_id,session_id))),2) as engagement_rate,
(sum(Transactions)) As transactions,
(sum(Transactions))/ count(distinct concat(user_pseudo_id,session_id)) as TransactionsPerSession,
safe_divide(sum(engagement_time_msec),count(distinct case when session_engaged = '1' then concat(user_pseudo_id,session_id) end)) /count(distinct case when session_engaged = '1' then concat(user_pseudo_id,session_id) end)as avgSessionDuration,
sum(event_count) as event_count,
sum(conversions) as conversions,
ifnull(sum(total_revenue),0) as total_revenue,
date
from
prep_traffic
group by
date
order by
date desc, users desc

How to calculate weekly retention in BigQuery using SQL

I have the following table with the week number and the retention rate.
|creation_week |num_engaged_users |num_users_in_cohort |retention_rate|
|:------------:|:-----------------:|:------------------:|:------------:|
|37| 373114 |4604 |67.637|
|38| 1860 |4604. |40.4|
|39| 1233 |4604 |26.781|
|40| 668 |4604 |14.509|
|41| 450 |4604 |9.774|
|42| 463| 4604|10.056|
What I need is to make it look something like this
|week |week0 |week1 |week2|week3|week4|week5|week6|
|:---:|:----:|:----:|:---:|:---:|:---:|:---:|:---:|
|week37|100|ret.rate|ret.rate|ret.rate|ret.rate|ret.rate|ret.rate|
|week38|100|ret.rate|ret.rate|ret.rate|ret.rate|ret.rate|
|week39|100|ret.rate|ret.rate|ret.rate|ret.rate|
|week40|100|ret.rate|ret.rate|ret.rate|
|week41|100|ret.rate|ret.rate|
|week42|100|ret.rate|
how can I do that using BigQuery SQL?
For some reason Stackoverflow doesn't allow to post this question unless all the tables are marked as code...
I will provide the SQL code I used in the first answer because it doesn't let me post it either
WITH
new_user_cohort AS (
WITH
#table with cookie and user_ids for the later matching
table_1 AS (
SELECT
DISTINCT props.value.string_value AS cookie_id,
user_id
FROM
`stockduel.analytics.events`,
UNNEST(event_properties) AS props
WHERE
props.key = 'cookie_id'
AND user_id>0),
#second table which gives acess to the sample with the users who performed the event
table_2 AS (
SELECT
DISTINCT props.value.string_value AS cookie_id,
EXTRACT(WEEK
FROM
creation_date) AS first_week
FROM
`stockduel.analytics.events`,
UNNEST(event_properties) AS props
WHERE
props.key = 'cookie_id'
AND event_type = 'launch_first_time'
#set the date from when starting cohort analysis
AND EXTRACT(WEEK
FROM
creation_date) = EXTRACT(WEEK
FROM
DATE '2021-09-15'))
#join user id with cookie_id and group the elements to remove the duplicates
SELECT
user_id,
first_week
FROM
table_2
JOIN
table_1
ON
table_1.cookie_id = table_2.cookie_id
#group the results to avoid duplicates
GROUP BY
user_id,
first_week ),
num_new_users AS (
SELECT
COUNT(*) AS num_users_in_cohort,
first_week
FROM
new_user_cohort
GROUP BY
first_week ),
engaged_users_by_day AS (
SELECT
COUNT(DISTINCT `stockduel.analytics.ws_raw_sessions_v2`.user_id) AS num_engaged_users,
EXTRACT(WEEK
FROM
started_at) AS creation_week,
FROM
`stockduel.analytics.ws_raw_sessions_v2`
JOIN
new_user_cohort
ON
new_user_cohort.user_id = `stockduel.analytics.ws_raw_sessions_v2`.user_id
WHERE
EXTRACT(WEEK
FROM
started_at) BETWEEN EXTRACT(WEEK
FROM
DATE '2021-09-15')
AND EXTRACT(WEEK
FROM
DATE '2021-10-22')
GROUP BY
creation_week )
SELECT
creation_week,
num_engaged_users,
num_users_in_cohort,
ROUND((100*(num_engaged_users / num_users_in_cohort)), 3) AS retention_rate
FROM
engaged_users_by_day
CROSS JOIN
num_new_users
ORDER BY
creation_week

google bigQuery realtime does not match ga report

Hello I would like to see real-time data status using google bigquery real time table.
However, simple query statements do not match GA reports. I created a query that shows the number of sessions per hour, but I had an error rate of 10 to 30%.
Is the accuracy of google bigquery realtime not so good? Or am I making a mistake?
WITH noDuplicateTable as (
SELECT
ARRAY_AGG (t ORDER BY exportTimeUsec DESC LIMIT 1) [OFFSET (0)]. *
FROM
`tablename_20 *` AS t
WHERE
_TABLE_SUFFIX = FORMAT_DATE ("% y% m% d", CURRENT_DATE ('Asia / Seoul'))
GROUP BY
T.VisitKey
),
session as (
SELECT
ROW_NUMBER () OVER () sessionRow,
FORMAT_TIMESTAMP ('% H', TIMESTAMP_SECONDS (time), 'Asia / Seoul') AS startTime,
sum (session) as session,
(sum (session) -sum (isvisit)) as uniqueSession,
(sum (isvisit) / sum (session) * 100) as bounce,
sum (totalPageView) as totalPageView
FROM (
SELECT
count (visitId) as session,
visitStartTime as time,
sum (Ifnull (totals.Bounces, 0)) as isVisit,
sum (totals.pageviews) as totalPageView
FROM
noDuplicateTable
GROUP BY
visitStartTime
)
GROUP BY startTime
)
select * from session

Counting transactions on 2 different levels

I am using Google Analytics data in BigQuery, my desired output is
USERID INTERACTIONS TRANSACTIONS SCORE CHANNEL
XXX 3 1 33.33 Paid
Below is my query so far - I am getting duplicate transactions and I can;t work out why, Unesting my hits led to a high count of interactions as every line was bring counted, so I added the AND hit.isentrance IS TRUE clause, which means I can't use COUNT( DISTINCT hit.transaction.transactionid) as the entry row will never contain an order ID - instead I have to use totals.transactions, which I where I think my issues could be coming from?
SELECT UserID, SUM(Campaign_Interactions) AS Interactions, SUM(Transactions) AS Transactions, ROUND(SUM(Transactions)/SUM(Campaign_Interactions), 2) AS Con_Score, MasterChannel FROM(
(SELECT customDimension.value AS UserID, visitid AS visitid1, trafficSource.campaign AS Campaign, COUNT(trafficSource.campaign) AS Campaign_Interactions, SUM (totals.transactions) AS Transactions, ROUND(MAX(totals.transactions)/COUNT(trafficSource.campaign), 2) AS Conversion_Score
FROM `xxx.ga_sessions_20*` AS m
CROSS JOIN UNNEST(m.customdimensions) AS customDimension
CROSS JOIN UNNEST (hits) AS hit
WHERE parse_date('%y%m%d', _table_suffix) between
DATE_sub(current_date(), interval 7 day) and
DATE_sub(current_date(), interval 1 day)
AND customDimension.index = 2
AND trafficSource.campaign IS NOT NULL
AND (customDimension.value NOT LIKE 'true' AND customDimension.value NOT LIKE 'undefined')
AND hit.isentrance IS TRUE
GROUP BY visitid1, Campaign, userID
ORDER BY Transactions DESC)
JOIN
(SELECT * FROM `xxx.7Days_VisitID_MasterChan`)
ON visitid1 = visitid)
GROUP BY UserID, MasterChannel
ORDER BY UserID
And a screenshot of results. Note that for the ID 00004180-16f5-46e4-9caa-c6b47e03d795 (near the bottom) there should be only 1 order, but we are seeing it on each row.
It's fine for the user to have interactions across multiple channels, this is expected. Multiple transactions across multiple channels is also fine, but I can see in our CRM that this UserID has made only one order in the last 7 days, so I'd only expect to see a single transaction against the ID here.

Cohort/ Retention query in BigQuery using Google Analytics exported data

I need help formulating a cohort/retention query
I am trying to build a query to look at visitors who performed ActionX on their first visit (in the time frame) and then how many days later they returned to perform Action X again
The output I (eventually) need looks like this...
The table I am dealing with is an export of Google Analytics to BigQuery
If anyone could help me with this or anyone who has written a query similar that I can manipulate?
Thanks
Just to give you simple idea / direction
Below is for BigQuery Standard SQL
#standardSQL
SELECT
Date_of_action_first_taken,
ROUND(100 * later_1_day / Visits) AS later_1_day,
ROUND(100 * later_2_days / Visits) AS later_2_days,
ROUND(100 * later_3_days / Visits) AS later_3_days
FROM `OutputFromQuery`
You can test it with below dummy data from your question
#standardSQL
WITH `OutputFromQuery` AS (
SELECT '01.07.17' AS Date_of_action_first_taken, 1000 AS Visits, 800 AS later_1_day, 400 AS later_2_days, 300 AS later_3_days UNION ALL
SELECT '02.07.17', 1000, 860, 780, 860 UNION ALL
SELECT '29.07.17', 1000, 780, 120, 0 UNION ALL
SELECT '30.07.17', 1000, 710, 0, 0
)
SELECT
Date_of_action_first_taken,
ROUND(100 * later_1_day / Visits) AS later_1_day,
ROUND(100 * later_2_days / Visits) AS later_2_days,
ROUND(100 * later_3_days / Visits) AS later_3_days
FROM `OutputFromQuery`
The OutputFromQuery data is as below:
Date_of_action_first_taken Visits later_1_day later_2_days later_3_days
01.07.17 1000 800 400 300
02.07.17 1000 860 780 860
29.07.17 1000 780 120 0
30.07.17 1000 710 0 0
and the final output is:
Date_of_action_first_taken later_1_day later_2_days later_3_days
01.07.17 80.0 40.0 30.0
02.07.17 90.0 78.0 86.0
29.07.17 80.0 12.0 0.0
30.07.17 70.0 0.0 0.0
I found this query on Turn Your App Data into Answers with Firebase and BigQuery (Google I/O'19)
It should work :)
#standardSQL
###################################################
# Part 1: Cohort of New Users Starting on DEC 24
###################################################
WITH
new_user_cohort AS (
SELECT DISTINCT
user_pseudo_id as new_user_id
FROM
`[your_project].[your_firebase_table].events_*`
WHERE
event_name = `[chosen_event] ` AND
#set the date from when starting cohort analysis
FORMAT_TIMESTAMP("%Y%m%d", TIMESTAMP_TRUNC(TIMESTAMP_MICROS(event_timestamp), DAY, "Etc/GMT+1")) = '20191224' AND
_TABLE_SUFFIX BETWEEN '20191224' AND '20191230'
),
num_new_users AS (
SELECT count(*) as num_users_in_cohort FROM new_user_cohort
),
#############################################
# Part 2: Engaged users from Dec 24 cohort
#############################################
engaged_users_by_day AS (
SELECT
FORMAT_TIMESTAMP("%Y%m%d", TIMESTAMP_TRUNC(TIMESTAMP_MICROS(event_timestamp), DAY, "Etc/GMT+1")) as event_day,
COUNT(DISTINCT user_pseudo_id) as num_engaged_users
FROM
`[your_project].[your_firebase_table].events_*`
INNER JOIN
new_user_cohort ON new_user_id = user_pseudo_id
WHERE
event_name = 'user_engagement' AND
_TABLE_SUFFIX BETWEEN '20191224' AND '20191230'
GROUP BY
event_day
)
####################################################################
# Part 3: Daily Retention = [Engaged Users / Total Users]
####################################################################
SELECT
event_day,
num_engaged_users,
num_users_in_cohort,
ROUND((num_engaged_users / num_users_in_cohort), 3) as retention_rate
FROM
engaged_users_by_day
CROSS JOIN
num_new_users
ORDER BY
event_day
So I think I may have cracked it... from this output I then would need to manipulate it (pivot table it) to make it look like the desired output.
Can anyone review this for me and let me know what you think?
`WITH
cohort_items AS (
SELECT
MIN( TIMESTAMP_TRUNC(TIMESTAMP_MICROS((visitStartTime*1000000 +
h.time*1000)), DAY) ) AS cohort_day, fullVisitorID
FROM
TABLE123 AS U,
UNNEST(hits) AS h
WHERE _TABLE_SUFFIX BETWEEN "20170701" AND "20170731"
AND 'ACTION TAKEN'
GROUP BY 2
),
user_activites AS (
SELECT
A.fullVisitorID,
DATE_DIFF(DATE(TIMESTAMP_TRUNC(TIMESTAMP_MICROS((visitStartTime*1000000 + h.time*1000)), DAY)), DATE(C.cohort_day), DAY) AS day_number
FROM `Table123` A
LEFT JOIN cohort_items C ON A.fullVisitorID = C.fullVisitorID,
UNNEST(hits) AS h
WHERE
A._TABLE_SUFFIX BETWEEN "20170701 AND "20170731"
AND 'ACTION TAKEN'
GROUP BY 1,2),
cohort_size AS (
SELECT
cohort_day,
count(1) as number_of_users
FROM
cohort_items
GROUP BY 1
ORDER BY 1
),
retention_table AS (
SELECT
C.cohort_day,
A.day_number,
COUNT(1) AS number_of_users
FROM
user_activites A
LEFT JOIN cohort_items C ON A.fullVisitorID = C.fullVisitorID
GROUP BY 1,2
)
SELECT
B.cohort_day,
S.number_of_users as total_users,
B.day_number,
B.number_of_users / S.number_of_users as percentage
FROM retention_table B
LEFT JOIN cohort_size S ON B.cohort_day = S.cohort_day
WHERE B.cohort_day IS NOT NULL
ORDER BY 1, 3
`
Thank you in advance!
If you use some techniques available in BigQuery, you can potentially solve this type of problem with very cost and performance effective solutions. As an example:
SELECT
init_date,
ARRAY((SELECT AS STRUCT days, freq, ROUND(freq * 100 / MAX(freq) OVER(), 2) FROM UNNEST(data) ORDER BY days)) data
FROM(
SELECT
init_date,
ARRAY_AGG(STRUCT(days, freq)) data
FROM(
SELECT
init_date,
data AS days,
COUNT(data) freq
FROM(
SELECT
init_date,
ARRAY(SELECT DATE_DIFF(PARSE_DATE("%Y%m%d", dts), PARSE_DATE("%Y%m%d", init_date), DAY) AS dt FROM UNNEST(dts) dts) data
FROM(
SELECT
MIN(date) init_date,
ARRAY_AGG(DISTINCT date) dts
FROM `Table123`
WHERE TRUE
AND EXISTS(SELECT 1 FROM UNNEST(hits) where eventinfo.eventCategory = 'recommendation') -- This is your 'ACTION TAKEN' filter
AND _TABLE_SUFFIX BETWEEN "20170724" AND "20170731"
GROUP BY fullvisitorid
)
),
UNNEST(data) data
GROUP BY init_date, days
)
GROUP BY init_date
)
I tested this query against our G.A data and selected customers who interacted with our recommendation system (as you can see in the filter selection WHERE EXISTS...). Example of result (omitted absolute values of freq for privacy reasons):
As you can see, at day 28th for instance, 8% of customers came back 1 day later and interacted with the system again.
I recommend you to play around with this query and see if it works well for you. It's simpler, cheaper, faster and hopefully easier to maintain.