Matching BigQuery data with Traffic Acquisition GA4 report - google-bigquery

I'm new to BigQuery and I'm trying to replicate the Traffic Acquisition GA4 report, but not very successfully at the moment, as my results are not even remotely close to the GA4 view.
I understand that the source/medium/campaign fields are event-based and not session-based in GA4 / BQ. My question is, why not every event has a source/medium/campaign as an event_parameter_key? It seems logical for me to have these parameters for the 'session_start' event, but unfortunately, it's not the case
I tried the following options to replicate the Traffic Acquisition report:
2.1 To check the first medium for sessions:
with cte as ( select
PARSE_DATE("%Y%m%d", event_date) AS Date,
user_pseudo_id,
concat(user_pseudo_id,(select value.int_value from unnest(event_params) where key = 'ga_session_id')) as session_id,
FIRST_VALUE((select value.string_value from unnest(event_params) where key = 'medium')) OVER (PARTITION BY concat(user_pseudo_id,(select value.int_value from unnest(event_params) where key = 'ga_session_id')) ORDER BY event_timestamp) as first_medium
FROM `project`)
select Date, first_medium, count(distinct user_pseudo_id) as Users, count (distinct session_id) as Sessions
from cte
group by 1,2;
The query returns 44k users with 'null' medium and 1.8k organic users while there are 17k users with the 'none' medium and 8k organic users in GA4.
2.2 If I change the first medium to the last medium:
FIRST_VALUE((select value.string_value from unnest(event_params) where key = 'medium')) OVER (PARTITION BY concat(user_pseudo_id,(select value.int_value from unnest(event_params) where key = 'ga_session_id')) ORDER BY event_timestamp desc) as last_medium
Organic medium increases to 9k users, though the results are still not matching the GA4 data.
2.3 I've also tried this code - https://www.ga4bigquery.com/traffic-source-dimensions-metrics-ga4/ - source / medium (based on session), and still got completely different results compared to the GA4.
Any help would be much appreciated!

I have noticed the samething, looking deeper I pulled 1 days worth of data from big query into google sheets and examined it.
Unsurprisingly I could replicate the results from ga4bigquery codes you have mentioned above results but they did not align with GA4 and although close for high traffic pages could be wildly out for the lower ones.
I then did a count for 'email' in event parmas source & ea_tracking_id as well as traffic_source and found they are all lower than the GA4 analytics.
I went to my dev site where I know exactly how many sessions have a source of email GA4 analytics agreed but big query did not, Google seems to be allocating a some traffic to not set randomly.
I have concluded the problem is not in the SQL and not in the tagging but in the bigquery GA4 data source. I have logged a query with google and we will see what happens. Sorry its not a solution

Related

GA4 traffic source data do not match with bigquery

I have try to export traffic source data and event attribtion from bigquery and match with GA4 (session_source and session_medium)
I am extract the event params (source ad medium) from bigquery but have a big gap between two data source
Any solution to solve it?
I have try to use use below SQL
with prep as (
select
user_pseudo_id,
(select value.int_value from unnest(event_params) where key = 'ga_session_id') as session_id,
max((select value.string_value from unnest(event_params) where key = 'source')) as source,
max((select value.string_value from unnest(event_params) where key = 'medium')) as medium,
max((select value.string_value from unnest(event_params) where key = 'name')) as campaign,
max((select value.string_value from unnest(event_params) where key = 'term')) as term,
max((select value.string_value from unnest(event_params) where key = 'content')) as coXXntent,
platform,
FROM `XXX`
group by
user_pseudo_id,
session_id,
platform
)
select
-- session medium (dimension | the value of a medium associated with a session)
platform,
coalesce(source,'(none)') as source_session,
coalesce(medium,'(none)') as medium_session,
coalesce(campaign,'(none)') as campaign_session,
coalesce(content,'(none)') as content,
coalesce(term,'(none)') as term,
count(distinct concat(user_pseudo_id,session_id)) as sessions
from
prep
group by
platform,
source_session,
medium_session,
campaign_session,
content,
term
order by
sessions desc
I'm also trying to figure out why BigQuery can't correctly match the source and medium to the event. The issue I found is that it assigns the source/medium as google/organic even though there is a gclid parameter in the link. The second issue is the huge deficiencies in recognizing the source as direct - in such cases I do not have these parameters for events at all.
The values are valid, but only for the source and medium that acquired the user.
As I compare data in UA and GA4 session attribution is correct. So it looks like a problem when exporting to BigQuery. I reported this to support and am waiting for a response.
I have also noticed source/medium does not align between BigQuery and GA4 and like Justyna has commented a lot of my source/medium come through as google/organic even when they are not. I am hoping Justyna will post here when there is a solution.
Looking at your code I can see 2 other areas that would cause discrepancies
1)
count(distinct concat(user_pseudo_id,session_id)) as sessions
This will only capture events with a valid pseudo_id and session_id, this is the correct way to count, but in my data there tends to be a few events without the ids are null so your session count included them but GA4 does.so use your preferred method of counting nulls to work out if this is an issue for you.
2):
You are also doing an exact count which again is correct but GA4 does an approximant match see link below for details.
https://developers.google.com/analytics/blog/2022/hll#using_bigquery_hll_functions_with_google_analytics_event_data
Using the above two techniques I can get a lot closer to the GA4 number of session but they are still not attributed correctly

Firebase Analytics vs. BigQuery - Average engagement time per screen

My engagement time shown in Firebase analytics, under Engagement -> Pages and Screens -> Page Title and Screen Name differs from that which is returned by the following BigQuery query.
SELECT
(SELECT value.string_value FROM UNNEST(event_params) WHERE key="firebase_screen") AS screen,
AVG((SELECT value.int_value FROM UNNEST(event_params) WHERE key="engagement_time_msec")) AS avg_engagement_time,
FROM
`ABC.events_20210923`
GROUP BY screen
ORDER BY avg_engagement_time DESC
However the numbers shown in Firebase Analytics are completely different from the numbers returned by the query. The descending order in which they are shown is about 65% right. Is this a completely different metric or is my query just wrong?

BigQuery Session & Hit level understanding

I want to ask about your knowledge regarding the concept of Events.
Hit level
Session Level
How in BigQuery (standard SQL) how i can map mind this logic, and also
Sessions
Events Per Session
Unique Events
Please can somebody guide me to understand these concepts?
totals.visitors is Session
sometime
visitId is taken as Session
to achieve that you need to grapple a little with a few different concepts. The first being "what is a session" in GA lingo. you can find that here. A session is a collection of hits. A hit is one of the following: pageview, event, social interaction or transaction.
Now to see how that is represented in the BQ schema, you can look here. visitId and visitorId will help you define a session (as opposed to a user).
Then you can count the number of totals.hits that are events of the type you want.
It could look something like:
select visitId,
sum(case when hits.type = "EVENT" then totals.hits else 0) from
dataset.table_* group by 1
That should work to get you an overview. If you need to slice and dice the event details (i.e. hits.eventInfo.*) then I suggest you make a query for all the visitId and one for all the relevant events and their respective visitId
I hope that works!
Cheers
You can think of these concepts like this:
every row is a session
technically every row with totals.visits=1 is a valid session
hits is an array containing structs which contain information for every hit
You can write subqueries on arrays - basically treat them as tables. I'd recommend to study Working with Arrays and apply/transfer every exercise directly to hits, if possible.
Example for subqueries on session level
SELECT
fullvisitorid,
visitStartTime,
(SELECT SUM(IF(type='EVENT',1,0)) FROM UNNEST(hits)) events,
(SELECT COUNT(DISTINCT CONCAT(eventInfo.eventCategory,eventInfo.eventAction,eventInfo.eventLabel) )
FROM UNNEST(hits) WHERE type='EVENT') uniqueEvents,
(SELECT SUM(IF(type='PAGE',1,0)) FROM UNNEST(hits)) pageviews
FROM
`bigquery-public-data.google_analytics_sample.ga_sessions_20170801`
WHERE
totals.visits=1
LIMIT
1000
Example for Flattening to hit level
There's also the possibility to use fields in arrays for grouping if you cross join arrays with their parent row
SELECT
h.type,
COUNT(1) hits
FROM
`bigquery-public-data.google_analytics_sample.ga_sessions_20170801` AS t CROSS JOIN t.hits AS h
WHERE
totals.visits=1
GROUP BY
1
Regarding the relation between visitId and Sessions you can read this answer.

Google Analytics session-scoped fields returning multiple values

I've discovered that there are certain GA "session" scoped fields in BigQuery that have multiple values for the same fullVisitorId and visitId fields. See the example below:
Grouping the fields doesn't help either. In GA, I've checked the number of users vs number of users split by different devices. The user count is different:
This explains what's going on, a user would be grouped under multiple devices. My conclusion is that at some point during the users session, their browser user-agent changes and in the subsequent hit, a new device type is set in GA.
I'd have hoped GA would use either the first or last value, to avoid this scenario, but I guess they don't. My question is, if I'm accepting this as a "flaw" in GA. I'd rather pick one value. What's the best way to select the last or first device value from the below query:
SELECT
fullVisitorId,
visitId,
device.deviceCategory
FROM (
SELECT
*
FROM
`project.dataset.ga_sessions_*` desktop
UNION ALL
SELECT
*
FROM
`project.dataset.ga_sessions_*` mobile ) table
I've tried doing a sub-select and using STRING_AGG(), attempting to order by hits.time and limiting to one value and that still creates another row.
I've tested and found that the below fields all have the same issue:
visitNumber
totals.hits
totals.pageviews
totals.timeOnSite
trafficSource.campaign
trafficSource.medium
trafficSource.source
device.deviceCategory
totals.sessionQualityDim
channelGrouping
device.mobileDeviceInfo
device.mobileDeviceMarketingName
device.mobileDeviceModel
device.mobileInputSelector
device.mobileDeviceBranding
UPDATE
See below queries around this particular fullVisitorId and visitId - UNION has been removed:
visitStartTime added:
visitStartTime and hits.time added:
Well, from the looks of things, I think you have 3 options:
1 - Group by fullVisitorId, visitId; and use Max or MIN deviceCategory. That should prevent a device switcher from being double-counted, It's kind of arbitrary but then so is the GA data.
2 - Option two is similar but, if the deviceCategory result can be anything (i.e. isn't constrained in the results to just the valid deviceCategory members), you can use a CASE to check MAX(deviceCategory) = MIN(deviceCategory) and if they are different, return 'Multiple Devices'
3 - You could go further, counting the number of different devices used, construct a concatenation that lists them in some way, etc.
I'm going to write up Number 2 for you. In your question, you have 2 different queries: one with [date] and one without - I'll provide both.
Without [date]:
SELECT
fullVisitorId,
visitId,
case when max(device.deviceCategory) = min(device.deviceCategory)
then max(device.deviceCategory)
else 'Multiple Devices'
end as deviceCategory,
{metric aggregations here}
FROM
(SELECT *
FROM `project.dataset.ga_sessions_*` desktop
UNION ALL
SELECT *
FROM `project.dataset.ga_sessions_*` mobile
) table
GROUP BY fullVisitorId, visitId
With [date]:
SELECT
[date],
fullVisitorId,
visitId,
case when max(device.deviceCategory) = min(device.deviceCategory)
then max(device.deviceCategory)
else 'Multiple Devices'
end as deviceCategory,
{metric aggregations here}
FROM
(SELECT *
FROM `project.dataset.ga_sessions_*` desktop
UNION ALL
SELECT *
FROM `project.dataset.ga_sessions_*` mobile
) table
GROUP BY [date], fullVisitorId, visitId
I'm assuming here that the Selects and Union that you gave are sound.
Also, I should point out that those {metric aggregations} should be something other than SUMs, otherwise you will still be double-counting.
I hope this helps.
It's simply not possible to have two values for one row in this field, because it can only contain one value.
There are 2 possibilities:
you're actually querying two separate datasets/ two different views - that's not clearly visible with the example code. Client id (=fullvisitorid) is only unique per Property (Tracking Id, the UA-xxxxx stuff). If you query two different views from different properties you have to expect to get same ids used twice.
Given they are coming from one property, these two rows could actually be one session on a midnight split, which means visitId stays the same, but visitStartTime changes. But that would also mean the decision algorithm for device type changed in the meantime ... that would be curious.
Try using visitStartTime and see what happens.
If you're using two different properties use a user id to combine or separate the sessions by adding a constant - you can't combine them.
SELECT 'property_A' AS constant FROM ...
hth

Query for selecting sequence of hits consumes large quantity of data

I'm trying to measure the conversion rate through alternative funnels on a website. My query has been designed to output a count of sessions that viewed the relevant start URL and a count of sessions that hit the confirmation page strictly in that order. It does this by comparing the times of the hits.
My query appears to return accurate figures, but in doing so selects a massive quantity of data, just under 23GB for what I've attempted to limit to one hour of one day. I don't seem to have written my query in a particularly efficient way and gather that I'll use up all of my company's data quota fairly quickly if I continue to use it.
Here's the offending query in full:
WITH
s1 AS (
SELECT
fullVisitorId,
visitId,
LOWER(h.page.pagePath),
device.deviceCategory AS platform,
MIN(h.time) AS s1_time
FROM
`project.dataset.ga_sessions_*`, UNNEST(hits) AS h
WHERE
_TABLE_SUFFIX BETWEEN '20170107' AND '20170107'
AND
LOWER(h.page.pagePath) LIKE '{funnel-start-url-1}%' OR LOWER(h.page.pagePath) LIKE '{funnel-start-url-2}%'
AND
totals.visits = 1
AND
h.hour < 21
AND
h.hour >= 20
AND
h.type = "PAGE"
GROUP BY
path,
platform,
fullVisitorId,
visitId
ORDER BY
fullVisitorId ASC, visitId ASC
),
confirmations AS (
SELECT
fullVisitorId,
visitId,
MIN(h.time) AS confirmation_time
FROM
`project.dataset.ga_sessions_*`, UNNEST(hits) AS h
WHERE
_TABLE_SUFFIX BETWEEN '20170107' AND '20170107'
AND
h.type = "PAGE"
AND
LOWER(h.page.pagePath) LIKE '{confirmation-url-1}%' OR LOWER(h.page.pagePath) LIKE '{confirmations-url-2}%'
AND
totals.visits = 1
AND
h.hour < 21
AND
h.hour >= 20
GROUP BY
fullVisitorId,
visitId
)
SELECT
platform,
path,
COUNT(path) AS Views,
SUM(
CASE
WHEN s1.s1_time < confirmations.confirmation_time
THEN 1
ELSE 0
END
) AS SubsequentPurchases
FROM
s1
LEFT JOIN
confirmations
ON
s1.fullVisitorId = confirmations.fullVisitorId
AND
s1.visitId = confirmations.visitId
GROUP BY
platform,
path
What is it about this query that means it has to process so much data? Is there a better way to get at these numbers. Ideally any method should be able to measure the multiple different routes, but I'd settle for sustainability at this point.
There are probably a few ways that you can optimize your query but it seems like it won't entirely solve your issue (as I'll further try to explain).
As for the query, this one does the same but avoids re-selecting data and the LEFT JOIN operation:
SELECT
path,
platform,
COUNT(path) views,
COUNT(CASE WHEN last_hn > first_hn THEN 1 END) SubsequentPurchases
from(
SELECT
fv,
v,
platform,
path,
first_hn,
MAX(last_hn) OVER(PARTITION BY fv, v) last_hn
from(
SELECT
fullvisitorid fv,
visitid v,
device.devicecategory platform,
LOWER(hits.page.pagepath) path,
MIN(CASE WHEN REGEXP_CONTAINS(hits.page.pagepath, r'/catalog/|product') THEN hits.hitnumber ELSE null END) first_hn,
MAX(CASE WHEN REGEXP_CONTAINS(hits.page.pagepath, r'success') then hits.hitnumber ELSE null END) last_hn
FROM `project_id.data_set.ga_sessions_20170112`,
UNNEST(hits) hits
WHERE
REGEXP_CONTAINS(hits.page.pagepath, r'/catalog/|product|success')
AND totals.visits = 1
AND hits.type = 'PAGE'
GROUP BY
fv, v, path, platform
)
)
GROUP BY
path, platform
HAVING NOT REGEXP_CONTAINS(path, r'success')
first_hn tracks the funnel-start-url (in which I used the terms "catalog" and "product") and the last_hn tracks the confirmation URLs (which I used the term "success" but could add more values in the regex selector). Also, by using MIN and MAX operations and the analytical functions you can have some optimizations in your query.
There are a few points though to make here:
If you insert WHERE hits.hithour = 20, BigQuery still has to scan the whole table to find what is 20 from what is not. That means that the 23Gbs you observed still accounts for the whole day.
For comparison, I tested your query against our ga_sessions and it took around 31 days to reach 23Gb of data. As you are not selecting that many fields, it shouldn't be that easy to reach this amount unless you have a considerable high traffic volume coming from your data source.
Given current pricing for BigQuery, 23Gbs would consume you roughly $0.11 to process, which is quite cost-efficient.
Another thing I could imagine is that you are running this query several times a day and have no cache or some proper architecture for these operations.
All this being said, you can optimize your query but I suspect it won't change that much in the end as it seems you have quite a high volume of data. Processing 23Gbs a few times shouldn't be a problem but if you are concerned that it will reach your quota then it seems like you are running several times a day this query.
This being the case, see if using either some cache flag or saving the results into another table and then querying it instead will help. Also, you could start saving daily tables with just the sessions you are interested in (having the URL patterns you are looking for) and then running your final query in these newly created tables, which would allow you to query over a bigger range of days spending much less for that.