How to calculate "Daily user engagement" on Big Query and get the same result that Firebase shows in its dashboard? - google-bigquery

I'm trying to calculate the amount of time the average user spent in my app on a particular day.
I just read this great post from Todd Kerpelman on Firebase Blog that explains how to do it. With a little modification on his query I reached the solution bellow:
SELECT AVG(user_daily_engagement_time/1000/60)
FROM(
SELECT user_pseudo_id, event_date, sum(engagement_time) AS user_daily_engagement_time
FROM (
SELECT user_pseudo_id, event_date,
(SELECT value.int_value FROM UNNEST(event_params) WHERE key =
"engagement_time_msec") AS engagement_time
FROM `mydataset.events_20201104`
)
WHERE engagement_time > 0
GROUP BY 1,2)
The problem is that when I compare it with Firebase analytics (for the same period) it gives me a difference higher than 2 minutes. In this case the BigQuery answer is >2minutes higher than the Firebase Dashboard.

Related

GA4 vs BigQuery - User Count don't match

I have extracted from Bigquery the active_users and totalusers on 31/12/2022, grouped by CampaignName and Country, using the following query:
select
count(distinct case when (select value.int_value from unnest(event_params) where key = 'engagement_time_msec') > 0 or (select value.string_value from unnest(event_params) where key = 'session_engaged') = '1' then user_pseudo_id else null end) AS active_users
,count(distinct user_pseudo_id) AS totalusers
,traffic_source.name AS CampaignName
,geo.country AS Country
FROM `independent-tea-354108.analytics_254831690.events_20221231`
GROUP BY
traffic_source.name
,geo.country
The result filtered by CampaignName='(organic)' was:
(https://i.stack.imgur.com/LMQAH.png)
But when I compare with the data from GA4, it doesn't match and the difference is huge (around 15000 more active_users in GA4 than in BigQuery). Please note that this is only for one day, if it was a month the difference would be even higher:
(https://i.stack.imgur.com/8arYs.png)
I've tried filtering by other CampaignNames and not a single value matches and the differences are always huge.
These are two common reasons for the GA4 to BigQuery difference, You have probably already looked at them already.
Check your source table for blank 'user_pseudo_id's if you have a consent mode on your website they may be counted in GA4 but not in bigquery and this can cause big differences.
Time zone is another are that can make a difference BigQuery is always in UTC time your GA4 may not be.
I hope these help

Calculate Session Length from Google Analytics data in BigQuery

How do you calculate session length for website event data that flows via Google Analytics to BigQuery ?
A similar question has been posted & answered here. However, the underlying data structure is very different to my case:
Our data structure is: project_id.dataset_id.events_* with a separate table for each day, instead of project_id.dataset_id.ga_sessions_*
The way I've tried to get the session length is with the user_engagement event and the engagement_time_msec field:
SELECT
(SELECT value.int_value FROM UNNEST(event_params) WHERE key = 'ga_session_id') AS session_id,
SUM((SELECT value.int_value FROM UNNEST(event_params) WHERE key = 'engagement_time_msec'))/60.0 as session_length_seconds
FROM `project_id.dataset_id.events_*`
WHERE event_name = 'user_engagement'
GROUP BY 1
But I'm getting NULL values for some sessions: BigQuery Output
I haven't found good documentation from Google on this, so any help or links would be greatly appreciated.
This article explains very well how to calculate session length:
Basically there are two ways:
Engaged Session Length (using maximum engagement_time_msec) which seems to indicate that engagement_time_msec is a cumulative metrics
Normal Session Length (using the difference of the maximum & minimum event_timestamp across all events)
Pasting the section of the article here:
Average Session Duration Again this has changed slightly to engaged session duration. This will be lower than your Universal
Analytics session duration as it only counts when the tab is in focus.
Below I show how to do both.
SELECT
sum(engagement_time_msec)/1000 #in milliseconds
/count(distinct concat(user_pseudo_id,ga_session_id)) as ga4_session_duration,
sum(end_time-start_time)/1000000 #timestamp in microseconds
/count(distinct concat(user_pseudo_id,ga_session_id)) as ua_session_duration
from(
SELECT
user_pseudo_id,
(select value.int_value from unnest(event_params) where key = 'ga_session_id') as ga_session_id,
max((select value.int_value from unnest(event_params) where key = 'engagement_time_msec')) as engagement_time_msec,
min(event_timestamp) as start_time,
max(event_timestamp) as end_time
FROM `bigquery-public-data.ga4_obfuscated_sample_ecommerce.events_*`
where _table_suffix BETWEEN "20210101" and "20210131"
group by 1,2)

Results within Bigquery do not remain the same as in GA4

I'm inside BigQuery performing the query below to see how many users I had from August 1st to August 14th, but the number is not matching what GA4 presents me.
with event AS (
SELECT
user_id,
event_name,
PARSE_DATE('%Y%m%d',
event_date) AS event_date,
TIMESTAMP_MICROS(event_timestamp) AS event_timestamp,
ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY TIMESTAMP_MICROS(event_timestamp) DESC) AS rn,
FROM
`events_*`
WHERE
event_name= 'push_received')
SELECT COUNT ( DISTINCT user_id)
FROM
event
WHERE
event_date >= '2022-08-01'
Resultado do GA4
Result BQ = 37024
There are quite a few reasons why your GA4 data in the web will not match when compared to the BigQuery export and the Data API.
In this case, I believe you are running into the Time Zone issue. event_date is the date that the event was logged in the registered timezone of your Property. However, event_timestamp is a time in UTC that the event was logged by the client.
To resolve this, simply update your query with:
EXTRACT(DATETIME FROM TIMESTAMP_MICROS(`event_timestamp`) at TIME ZONE 'TIMEZONE OF YOUR PROPERTY' )
Your data should then match the WebUI and the GA4 Data API. This post that I co-authored goes into more detail on this and other reasons why your data doesn't match: https://analyticscanvas.com/3-reasons-your-ga4-data-doesnt-match/
You cannot simply compare totals. Divide it into daily comparisons and look at details.

Stuck on what seems like a simple SQL dense_rank task

Been stuck on this issue and could really use a suggestion or help.
What I have in a table is basic user flow on a website. For every Session ID, there's a page visited from start (lands on homepage) to finish (purchase). This has been ordered by timestamp to get a count of pages visited during this process. This 'page count' has also been partitioned by Session ID to go back to 1 every time the ID changes.
What I need to do now is assign a step count (highlighted is what I'm trying to achieve). This should assign a similar count but doesn't continue counting at duplicate steps (ie, someone visited multiple product pages - it's multiple pages but still only one 'product view' step.
You'd think this would be done using a dense rank, partitioned by session id - but that's where I get stuck. You can't order on page count because that'll assign a unique number to each step count. You can't order by Step because that orders it alphabetically.
What could I do to achieve this?
Screenshot of desired outcome:
Many thanks!
Use lag to see if two values are the same then a cumulative sum:
select t.*,
sum(case when prev_cs = custom_step then 0 else 1 end) over (partition by session_id order by timestamp) as steps_count
from (select t.*,
lag(custom_step) over (partition by session_id order by timestamp) as prev_cs
from t
) t
Below is for BigQuery Standard SQL
#standardSQL
SELECT * EXCEPT(flag),
COUNTIF(IFNULL(flag, TRUE)) OVER(PARTITION BY session_id ORDER BY timestamp) AS steps_count
FROM (
SELECT *,
custom_step != LAG(custom_step) OVER(PARTITION BY session_id ORDER BY timestamp) AS flag
FROM `project.dataset.table`
)
-- ORDER BY timestamp

Query multiple params in multiple tables with TABLE_DATE_RANGE for Firebase Analytics

I intend to get from the events I have in the applications a stat for most played audios within an article. In the event I send articleId and the audioID that has been played.
I want to obtain as result rows like this ordered by number of ocurrences:
| ID of the article | ID of the audio | number of occurrences
Since firebase analytics exports to bigquery in a diary basis and I want those events per month I created a query that takes the values from multiple tables, and mixed it with the info I found in this thread.
The resulting query is:
SELECT
(SELECT params.value.int_value FROM x.params
WHERE params.key = 'Article_ID') AS Article_ID,
(SELECT params.value.int_value FROM x.params
WHERE params.key = 'Audio_ID') AS Audio_ID,
COUNT(event_dim.name) as Number_Of_Plays
FROM
TABLE_DATE_RANGE([project-id:my_app_id.app_events_], DATE_ADD(CURRENT_TIMESTAMP(), -30, 'DAY'), CURRENT_TIMESTAMP()), UNNEST(event_dim) AS x
WHERE event_dim.name = 'Audio_Play'
GROUP BY Audio_ID, Article_ID
ORDER BY Number_Of_Plays desc
Unfortunately this query is not being parsed correctly provided me an error:
Error: Table name cannot be resolved: dataset name is missing.
RUN QUERY
I am pretty sure the issue is related to querying multiple tables in a range, but not sure how to fix it. Thanks.
The other answer you reference, is using StandardSQL and you are trying to use TABLE_DATE_RANGE which is only available in LegacySQL.
This is the query in Standard SQL that allows you multiple tables
#standardSql
SELECT
(SELECT params.value.int_value FROM x.params
WHERE params.key = 'Article_ID') AS Article_ID,
(SELECT params.value.int_value FROM x.params
WHERE params.key = 'Audio_ID') AS Audio_ID,
COUNT(event_dim.name) as Number_Of_Plays
FROM
`project-id:my_app_id.app_events_*`, UNNEST(event_dim) AS x
WHERE _TABLE_SUFFIX BETWEEN cast(DATE_ADD(current_date(), INTERVAL -30 DAY) as string) AND cast(current_date() as string)
AND event_dim.name = 'Audio_Play'
GROUP BY Audio_ID, Article_ID
ORDER BY Number_Of_Plays desc
See this From clause: project-id:my_app_id.app_events_* and the WHERE _TABLE_SUFFIX BETWEEN syntax line.