GA4 traffic source data do not match with bigquery - google-bigquery

I have try to export traffic source data and event attribtion from bigquery and match with GA4 (session_source and session_medium)
I am extract the event params (source ad medium) from bigquery but have a big gap between two data source
Any solution to solve it?
I have try to use use below SQL
with prep as (
select
user_pseudo_id,
(select value.int_value from unnest(event_params) where key = 'ga_session_id') as session_id,
max((select value.string_value from unnest(event_params) where key = 'source')) as source,
max((select value.string_value from unnest(event_params) where key = 'medium')) as medium,
max((select value.string_value from unnest(event_params) where key = 'name')) as campaign,
max((select value.string_value from unnest(event_params) where key = 'term')) as term,
max((select value.string_value from unnest(event_params) where key = 'content')) as coXXntent,
platform,
FROM `XXX`
group by
user_pseudo_id,
session_id,
platform
)
select
-- session medium (dimension | the value of a medium associated with a session)
platform,
coalesce(source,'(none)') as source_session,
coalesce(medium,'(none)') as medium_session,
coalesce(campaign,'(none)') as campaign_session,
coalesce(content,'(none)') as content,
coalesce(term,'(none)') as term,
count(distinct concat(user_pseudo_id,session_id)) as sessions
from
prep
group by
platform,
source_session,
medium_session,
campaign_session,
content,
term
order by
sessions desc

I'm also trying to figure out why BigQuery can't correctly match the source and medium to the event. The issue I found is that it assigns the source/medium as google/organic even though there is a gclid parameter in the link. The second issue is the huge deficiencies in recognizing the source as direct - in such cases I do not have these parameters for events at all.
The values are valid, but only for the source and medium that acquired the user.
As I compare data in UA and GA4 session attribution is correct. So it looks like a problem when exporting to BigQuery. I reported this to support and am waiting for a response.

I have also noticed source/medium does not align between BigQuery and GA4 and like Justyna has commented a lot of my source/medium come through as google/organic even when they are not. I am hoping Justyna will post here when there is a solution.
Looking at your code I can see 2 other areas that would cause discrepancies
1)
count(distinct concat(user_pseudo_id,session_id)) as sessions
This will only capture events with a valid pseudo_id and session_id, this is the correct way to count, but in my data there tends to be a few events without the ids are null so your session count included them but GA4 does.so use your preferred method of counting nulls to work out if this is an issue for you.
2):
You are also doing an exact count which again is correct but GA4 does an approximant match see link below for details.
https://developers.google.com/analytics/blog/2022/hll#using_bigquery_hll_functions_with_google_analytics_event_data
Using the above two techniques I can get a lot closer to the GA4 number of session but they are still not attributed correctly

Related

Matching BigQuery data with Traffic Acquisition GA4 report

I'm new to BigQuery and I'm trying to replicate the Traffic Acquisition GA4 report, but not very successfully at the moment, as my results are not even remotely close to the GA4 view.
I understand that the source/medium/campaign fields are event-based and not session-based in GA4 / BQ. My question is, why not every event has a source/medium/campaign as an event_parameter_key? It seems logical for me to have these parameters for the 'session_start' event, but unfortunately, it's not the case
I tried the following options to replicate the Traffic Acquisition report:
2.1 To check the first medium for sessions:
with cte as ( select
PARSE_DATE("%Y%m%d", event_date) AS Date,
user_pseudo_id,
concat(user_pseudo_id,(select value.int_value from unnest(event_params) where key = 'ga_session_id')) as session_id,
FIRST_VALUE((select value.string_value from unnest(event_params) where key = 'medium')) OVER (PARTITION BY concat(user_pseudo_id,(select value.int_value from unnest(event_params) where key = 'ga_session_id')) ORDER BY event_timestamp) as first_medium
FROM `project`)
select Date, first_medium, count(distinct user_pseudo_id) as Users, count (distinct session_id) as Sessions
from cte
group by 1,2;
The query returns 44k users with 'null' medium and 1.8k organic users while there are 17k users with the 'none' medium and 8k organic users in GA4.
2.2 If I change the first medium to the last medium:
FIRST_VALUE((select value.string_value from unnest(event_params) where key = 'medium')) OVER (PARTITION BY concat(user_pseudo_id,(select value.int_value from unnest(event_params) where key = 'ga_session_id')) ORDER BY event_timestamp desc) as last_medium
Organic medium increases to 9k users, though the results are still not matching the GA4 data.
2.3 I've also tried this code - https://www.ga4bigquery.com/traffic-source-dimensions-metrics-ga4/ - source / medium (based on session), and still got completely different results compared to the GA4.
Any help would be much appreciated!
I have noticed the samething, looking deeper I pulled 1 days worth of data from big query into google sheets and examined it.
Unsurprisingly I could replicate the results from ga4bigquery codes you have mentioned above results but they did not align with GA4 and although close for high traffic pages could be wildly out for the lower ones.
I then did a count for 'email' in event parmas source & ea_tracking_id as well as traffic_source and found they are all lower than the GA4 analytics.
I went to my dev site where I know exactly how many sessions have a source of email GA4 analytics agreed but big query did not, Google seems to be allocating a some traffic to not set randomly.
I have concluded the problem is not in the SQL and not in the tagging but in the bigquery GA4 data source. I have logged a query with google and we will see what happens. Sorry its not a solution

Firebase Analytics vs. BigQuery - Average engagement time per screen

My engagement time shown in Firebase analytics, under Engagement -> Pages and Screens -> Page Title and Screen Name differs from that which is returned by the following BigQuery query.
SELECT
(SELECT value.string_value FROM UNNEST(event_params) WHERE key="firebase_screen") AS screen,
AVG((SELECT value.int_value FROM UNNEST(event_params) WHERE key="engagement_time_msec")) AS avg_engagement_time,
FROM
`ABC.events_20210923`
GROUP BY screen
ORDER BY avg_engagement_time DESC
However the numbers shown in Firebase Analytics are completely different from the numbers returned by the query. The descending order in which they are shown is about 65% right. Is this a completely different metric or is my query just wrong?

BigQuery Firebase Average Coins Per Level In The Game

I developed a words game (using firebase as my backend) with levels and coins.
Now, I'm facing some difficulties while trying to query my DB, so that it will output a table with all levels in the game and average user coins for each level. For example :
Level Avg User Coins
0 50
1 12
2 2
Attached is a picture of my events table:
So as you can see, there is an event of 'level_end', then we can see the 'user coins' and 'level_num'. What is the right way to do that?
This is what I managed to do so far, obviously the wrong way :
SELECT event_name,user_id
FROM `words-game-en.analytics_208527783.events_20191004`,
UNNEST(event_params) as event_param
WHERE event_name = "level_end"
AND event_param.key = "user_coins"
You seem to want something like this:
SELECT event_param.level_num, AVG(event_param.user_coins)
FROM `words-game-en.analytics_208527783.events_20191004` CROSS JOIN
UNNEST(event_params) as event_param
WHERE event_name = 'level_end' AND event_param.key = 'user_coins'
GROUP BY level_num
ORDER BY level_num;
I'm a little confused by what is in event_params and what is directly in events, so you might need to properly qualify the column references.

How to map each parameter in firebase analytics sql to a separate column?

We use firebase analytics and bigQuery to run sql queries on collected data. This is turning out to be complex as some fields like event_params are repeated records. I want to map each of these repeated fields to separate column.
I want to write queries in the above dataset like finding the difference between minIso and maxIso. How can I define a UDF or a view which can return me the table in the column schema?
I want to map each of these repeated fields to separate column.
Going direction of pivoting parameters into columns conceptually doable but (in my strong opinion) is a “dead end” in most practical cases
There are many posts here on SO showing how to pivot/transpose rows to columns and the patterns are 1) you just hardcode all possible keys in your query )and obviously no-one likes this) or 2) you create utility query that extracts all keys for you and contracts needed query for you which then you need to execute – so either you do it manually in two steps or you using client of your choice to script those to steps to run in automated way
As I mentioned – there are plenty example of such here on SO
I want to write queries in the above dataset like finding the difference between minIso and maxIso
If all you need is to do some math with few parameters in the record – see below example
Dummy Example: for each app_instance_idtween find diff between coins_awarded and xp_awarded
#standardSQL
SELECT user_dim.app_info.app_instance_id, ARRAY(
SELECT AS STRUCT name,
(SELECT value.int_value FROM UNNEST(dim.params) param WHERE key = 'coins_awarded') -
(SELECT value.int_value FROM UNNEST(dim.params) param WHERE key = 'xp_awarded') diff_awarded
FROM UNNEST(event_dim) dim
WHERE dim.name = 'round_completed'
) AS event_dim
FROM `firebase-analytics-sample-data.ios_dataset.app_events_20160607`
WHERE 'round_completed' IN (SELECT name FROM UNNEST(event_dim))
with result as
Row app_instance_id event_dim.name event_dim.diff_awarded
1 02B6879DF2639C9E2244AD0783924CFC round_completed 226
2 02B6879DF2639C9E2244AD0783924CFC round_completed 171
3 0DE9DCDF2C407377AE3E779FB05864E7 round_completed 25
...
Dummy Example: leave whole user_dim intact but replace event_dim with just calculated values
#standardSQL
SELECT * REPLACE(ARRAY(
SELECT AS STRUCT name,
(SELECT value.int_value FROM UNNEST(dim.params) param WHERE key = 'coins_awarded') -
(SELECT value.int_value FROM UNNEST(dim.params) param WHERE key = 'xp_awarded') diff_awarded
FROM UNNEST(event_dim) dim
WHERE dim.name = 'round_completed'
) AS event_dim)
FROM `firebase-analytics-sample-data.ios_dataset.app_events_20160607`
WHERE 'round_completed' IN (SELECT name FROM UNNEST(event_dim))
This is turning out to be complex as some fields like event_params are repeated records. I want to map each of these repeated fields to separate column.
Hope, from above examples, you can see how really simple it is to deal with repeated fields. I do really recommend you to learn / practice work with arrays to gain long term benefits rather than looking for what [wrongly] looks like shortcut

BigQuery Session & Hit level understanding

I want to ask about your knowledge regarding the concept of Events.
Hit level
Session Level
How in BigQuery (standard SQL) how i can map mind this logic, and also
Sessions
Events Per Session
Unique Events
Please can somebody guide me to understand these concepts?
totals.visitors is Session
sometime
visitId is taken as Session
to achieve that you need to grapple a little with a few different concepts. The first being "what is a session" in GA lingo. you can find that here. A session is a collection of hits. A hit is one of the following: pageview, event, social interaction or transaction.
Now to see how that is represented in the BQ schema, you can look here. visitId and visitorId will help you define a session (as opposed to a user).
Then you can count the number of totals.hits that are events of the type you want.
It could look something like:
select visitId,
sum(case when hits.type = "EVENT" then totals.hits else 0) from
dataset.table_* group by 1
That should work to get you an overview. If you need to slice and dice the event details (i.e. hits.eventInfo.*) then I suggest you make a query for all the visitId and one for all the relevant events and their respective visitId
I hope that works!
Cheers
You can think of these concepts like this:
every row is a session
technically every row with totals.visits=1 is a valid session
hits is an array containing structs which contain information for every hit
You can write subqueries on arrays - basically treat them as tables. I'd recommend to study Working with Arrays and apply/transfer every exercise directly to hits, if possible.
Example for subqueries on session level
SELECT
fullvisitorid,
visitStartTime,
(SELECT SUM(IF(type='EVENT',1,0)) FROM UNNEST(hits)) events,
(SELECT COUNT(DISTINCT CONCAT(eventInfo.eventCategory,eventInfo.eventAction,eventInfo.eventLabel) )
FROM UNNEST(hits) WHERE type='EVENT') uniqueEvents,
(SELECT SUM(IF(type='PAGE',1,0)) FROM UNNEST(hits)) pageviews
FROM
`bigquery-public-data.google_analytics_sample.ga_sessions_20170801`
WHERE
totals.visits=1
LIMIT
1000
Example for Flattening to hit level
There's also the possibility to use fields in arrays for grouping if you cross join arrays with their parent row
SELECT
h.type,
COUNT(1) hits
FROM
`bigquery-public-data.google_analytics_sample.ga_sessions_20170801` AS t CROSS JOIN t.hits AS h
WHERE
totals.visits=1
GROUP BY
1
Regarding the relation between visitId and Sessions you can read this answer.