Disagreement between BigQuery and Google Analytics 4 for Pageviews - why? - google-bigquery

I have a large table of Google Analytics 4 (GA4) events in Big Query for a bunch of websites I look after. The table has the following schema:
field name
type
event_date
date
event_timestamp
integer
event_name
string
event_key
string
event_string_value
string
event_int_value
integer
event_float_value
float
event_double_value
float
user_pseudo_id
string
user_first_touch_timestamp
integer
device_category
string
device_model_name
string
device_host_name
string
device_web_hostman
string
geo_country
string
geo_city
string
traffic_source_name
string
I query the table to get the total number for pageviews for a specific site using the following query:
with date_range as (
select
'20220601' as start_date,
'20220630' as end_date)
select
count(distinct case when event_name = 'page_view' then concat(user_pseudo_id, cast(event_timestamp as string)) end) as pageviews
from
`project_name.datset_name.table_name`,
date_range
WHERE
event_date BETWEEN PARSE_DATE('%Y%m%d',date_range.start_date) AND PARSE_DATE('%Y%m%d',date_range.end_date)
AND device_web_hostname in ("www.website_name.com")
What is a mystery to me is that when I do this for some sites, the figure for page_views is out by several hundred pageviews. The Big Query figure is higher. What is interesting is that:
If I try other events, such as sessions then there are no issues
As stated, it is only for some sites and not all
I know enought to know:
These numbers are never going to agree, but they shouldn't be out by several hundred either
GA4 has the unprocessed data, so the way I am querying the data is different to how it is being processed in the GA4 interface
I have tried:
Looking at the GA4 documentation to see how pageviews are used/processed; I can't see anything that enlightens me
Debugging each site to make sure tags are firing correctly; they are
I've hit a bit of a wall with this and I'd begrateful if anyone has any insight to point me in another possible direction. Thanks in advance!

The issue lies in this following part of the code:
select
count(distinct case when event_name = 'page_view' then concat(user_pseudo_id, cast(event_timestamp as string)) end) as pageviews
You are counting distinct for concat of user_pseudo_id and event_timestamp which is not unique. You need to also have session_id on top of that to get a unique hit.

Related

GA4 vs BigQuery - User Count don't match

I have extracted from Bigquery the active_users and totalusers on 31/12/2022, grouped by CampaignName and Country, using the following query:
select
count(distinct case when (select value.int_value from unnest(event_params) where key = 'engagement_time_msec') > 0 or (select value.string_value from unnest(event_params) where key = 'session_engaged') = '1' then user_pseudo_id else null end) AS active_users
,count(distinct user_pseudo_id) AS totalusers
,traffic_source.name AS CampaignName
,geo.country AS Country
FROM `independent-tea-354108.analytics_254831690.events_20221231`
GROUP BY
traffic_source.name
,geo.country
The result filtered by CampaignName='(organic)' was:
(https://i.stack.imgur.com/LMQAH.png)
But when I compare with the data from GA4, it doesn't match and the difference is huge (around 15000 more active_users in GA4 than in BigQuery). Please note that this is only for one day, if it was a month the difference would be even higher:
(https://i.stack.imgur.com/8arYs.png)
I've tried filtering by other CampaignNames and not a single value matches and the differences are always huge.
These are two common reasons for the GA4 to BigQuery difference, You have probably already looked at them already.
Check your source table for blank 'user_pseudo_id's if you have a consent mode on your website they may be counted in GA4 but not in bigquery and this can cause big differences.
Time zone is another are that can make a difference BigQuery is always in UTC time your GA4 may not be.
I hope these help

Results within Bigquery do not remain the same as in GA4

I'm inside BigQuery performing the query below to see how many users I had from August 1st to August 14th, but the number is not matching what GA4 presents me.
with event AS (
SELECT
user_id,
event_name,
PARSE_DATE('%Y%m%d',
event_date) AS event_date,
TIMESTAMP_MICROS(event_timestamp) AS event_timestamp,
ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY TIMESTAMP_MICROS(event_timestamp) DESC) AS rn,
FROM
`events_*`
WHERE
event_name= 'push_received')
SELECT COUNT ( DISTINCT user_id)
FROM
event
WHERE
event_date >= '2022-08-01'
Resultado do GA4
Result BQ = 37024
There are quite a few reasons why your GA4 data in the web will not match when compared to the BigQuery export and the Data API.
In this case, I believe you are running into the Time Zone issue. event_date is the date that the event was logged in the registered timezone of your Property. However, event_timestamp is a time in UTC that the event was logged by the client.
To resolve this, simply update your query with:
EXTRACT(DATETIME FROM TIMESTAMP_MICROS(`event_timestamp`) at TIME ZONE 'TIMEZONE OF YOUR PROPERTY' )
Your data should then match the WebUI and the GA4 Data API. This post that I co-authored goes into more detail on this and other reasons why your data doesn't match: https://analyticscanvas.com/3-reasons-your-ga4-data-doesnt-match/
You cannot simply compare totals. Divide it into daily comparisons and look at details.

Big Query / SQL finding "new" data in a date range

I have a pretty big event log with columns:
id, timestamp, text, user_id
The text field contains a variety of things, like:
Road: This is the road name
City: This is the city name
Type: This is a type
etc..
I would like to get the result to the following:
Given a start and end date, how many **new** users used a road (that haven't before) grouped by road.
I've got various parts of this working fine (like the total amount of users, the grouping by, date range and so on. The SQL for getting the new users is alluding me though, having tried solutions like SELECT AS STRUCT on sub queries amongst other things.
Ultimately, I'd love to see a result like:
road, total_users, new_users
Any help would be much appreciated.
If I understand correctly, you want something like this:
select road, counif(seqnum = 1) as new_users, count(distinct user_id) as num_users
from (select l.*,
row_number() over (partition by l.user_id, l.text order by l.timestamp) as seqnum
from log l
where l.type = 'Road'
) l
where timestamp >= #timestamp1 and timestamp < #timestamp2
group by road;
This assumes that you have a column that specifies the type (i.e. "road") and another column with the name of the road (i.e. "Champs-Elysees").

Query multiple params in multiple tables with TABLE_DATE_RANGE for Firebase Analytics

I intend to get from the events I have in the applications a stat for most played audios within an article. In the event I send articleId and the audioID that has been played.
I want to obtain as result rows like this ordered by number of ocurrences:
| ID of the article | ID of the audio | number of occurrences
Since firebase analytics exports to bigquery in a diary basis and I want those events per month I created a query that takes the values from multiple tables, and mixed it with the info I found in this thread.
The resulting query is:
SELECT
(SELECT params.value.int_value FROM x.params
WHERE params.key = 'Article_ID') AS Article_ID,
(SELECT params.value.int_value FROM x.params
WHERE params.key = 'Audio_ID') AS Audio_ID,
COUNT(event_dim.name) as Number_Of_Plays
FROM
TABLE_DATE_RANGE([project-id:my_app_id.app_events_], DATE_ADD(CURRENT_TIMESTAMP(), -30, 'DAY'), CURRENT_TIMESTAMP()), UNNEST(event_dim) AS x
WHERE event_dim.name = 'Audio_Play'
GROUP BY Audio_ID, Article_ID
ORDER BY Number_Of_Plays desc
Unfortunately this query is not being parsed correctly provided me an error:
Error: Table name cannot be resolved: dataset name is missing.
RUN QUERY
I am pretty sure the issue is related to querying multiple tables in a range, but not sure how to fix it. Thanks.
The other answer you reference, is using StandardSQL and you are trying to use TABLE_DATE_RANGE which is only available in LegacySQL.
This is the query in Standard SQL that allows you multiple tables
#standardSql
SELECT
(SELECT params.value.int_value FROM x.params
WHERE params.key = 'Article_ID') AS Article_ID,
(SELECT params.value.int_value FROM x.params
WHERE params.key = 'Audio_ID') AS Audio_ID,
COUNT(event_dim.name) as Number_Of_Plays
FROM
`project-id:my_app_id.app_events_*`, UNNEST(event_dim) AS x
WHERE _TABLE_SUFFIX BETWEEN cast(DATE_ADD(current_date(), INTERVAL -30 DAY) as string) AND cast(current_date() as string)
AND event_dim.name = 'Audio_Play'
GROUP BY Audio_ID, Article_ID
ORDER BY Number_Of_Plays desc
See this From clause: project-id:my_app_id.app_events_* and the WHERE _TABLE_SUFFIX BETWEEN syntax line.

SQL Web Traffic Query

I came upon this question a couple of days back and couldn't find an optimal solution. We have a following table structure for storing some basic Web Traffic Logs of users who visit a particular website.
Table name: [tblWebtraffic]
Columns: Id,IPAddress,PageName,Date
I want a single query(i.e single Select statement) to query out. The total visits and total unique visitors (based on IPAddress) and total unique Pages that have been visited over the last 60 days.
PS:This is my first question in this site so forgive me if there are some details missing in the question. :)
EDIT: I am using a SQL Server Database.
SELECT pageName, count(*) AS pageHits FROM tblWebtraffic GROUP BY pageName
will give you hits per page, a slight alteration will give unique page hits
SELECT pageName, count(DISTINCT ipAddress) AS uniquePageHits FROM tblWebtraffic GROUP BY pageName
of course, removing the group by on pagename will give you the entire site hits
SELECT count(*) AS siteHits FROM tblWebtraffic
SELECT count(DISTINCT ipAddress) uniqueSiteHits FROM tblWebtraffic
PS: my first attempt at answer, so if anything missing please let me know :)
edit: these will work on MS SQLServer Transact SQL .. MySQL I'm less familiar with but I just tried out SELECT count(Distinct fieldname) and it worked
edit2: thanks for the edits - the code formatting looks great
edit3: answering the question :)
SELECT count(*) AS siteHits, count(DISTINCT ipAddress) AS uniqueSiteHits, count(DISTINCT pageName) AS uniquePages FROM tblWebtraffic WHERE DATEDIFF(d, [Date], getDate()) <= 60