Aggregation of reference source of GA4 session and aggregation on exported BQ do not match - google-bigquery

I am in trouble because the numbers do not match when comparing the data in GA4 and exported BQ.
If you have any clues, I would appreciate it if you could let me know.
【premise】
Im running e-commerce site, and using GA4.
Also, I set the export from GA4 to BigQuery using GA4 export function.
(https://support.google.com/analytics/answer/9358801)
【overview】
Using GA4's search tool, we created a report that counts the number of sessions for each referrer.
Next, in BigQuery to which the GA4 data was exported, the number of sessions was aggregated for each session referrer.
Comparing the GA4 report with the BQ results, the number of sessions for a particular session referrer is significantly lower in the BQ results.
【detail】
When aggregating the number of sessions for each referrer of sessions in BigQuery, the key of event_params was grouped by the string_value of source.
[SQL on BQ]
select
source
,count(distinct session_id)
,sum(ss_session_start)
from(
select
CAST((SELECT value.int_value FROM UNNEST(event_params) WHERE key = 'source') as STRING) as source
,CONCAT(user_pseudo_id,"-", CAST((SELECT value.int_value FROM UNNEST(event_params) WHERE key = 'ga_session_id') as STRING)) as session_id
from `XXXXXXXX.events_*`ga
where 1=1
and _table_suffix = "YYYYMMDD"
)
group by source
order by source
[things I tried]
・Depending on the referrer of the session, the figures were almost the same.
For example, BQ app is very different, but insta is almost the same.
・The total number of sessions is almost the same, and there are a large number of null session referrers.
 → Therefore, in GA4, what is classified as app etc. is null on BQ? I'm guessing
・I tried to aggregate the page_location with the utm_source extracted by myself using REGEX, but the same tendency was observed.

Related

Sessions count in GA4 is quite bigger than events_intraday table in bigQuery

Issue
For example, GA4's sessions number of November is 559,555 take from (Report / Acquistion / Traffic Acquistion). But if I calculate session number from bigquery table, it is 468,991.
There are big different. I guess bigquery's number close to our actual traffic and google analytics 360 number.
Actually this is start from we implemented ecommerce event in our site. But we not sure this is related or not.
question
GA4's screen number and data in bigquery should be same (or close)??
How can we solve this issue? We would like to have close number.
FYI
We used this for calculating session number in bigquery.
SELECT
HLL_COUNT.EXTRACT(
HLL_COUNT.INIT(
CONCAT(
user_pseudo_id,
(SELECT `value` FROM UNNEST(event_params) WHERE key = 'ga_session_id' LIMIT 1).int_value),
12)) AS session_count,
FROM `bigquery-public-data.ga4_obfuscated_sample_ecommerce.events_*`
https://developers.google.com/analytics/blog/2022/hll
I'd really appreciated it if you guys give us an advice.

How to get count of active users grouped by version? (from Firebase using BigQuery)

Problem description
I'm trying to get the information of how many active users I have in my app separated by the 2 or 3 latest versions of the app.
I've read some documentations and other stack questions but none of them was solving my problem (and some others had outdated solutions).
Examples of solutions I tried:
https://support.google.com/firebase/answer/9037342?hl=en#zippy=%2Cin-this-article (N-day active users - This solution is probably the best, but even changing the dataset name correctly and removing the _TABLE_SUFFIX conditions it kept returning me a single column n_day_active_users_count = 0 )
https://gist.github.com/sbrissenden/cab9bd3a043f1879ded605cba5005457
(this is not returning any values for me, didn't understand why)
How can I get count of active Users from google analytics (this is not a good fit because the other part of my job is already done and generating charts on Data Studio, so using REST API would be harder to join my two solutions - one from BigQuery and other from REST API)
Discrepancies on "active users metric" between Firebase Analytics dashboard and BigQuery export (this one uses outdated variables)
So, I started to write the solution out of my head, and this is what I get so far:
SELECT
user_pseudo_id,
app_info.version,
ROUND(COUNT(DISTINCT user_pseudo_id) OVER (PARTITION BY app_info.version) / SUM(COUNT(DISTINCT user_pseudo_id)) OVER (), 3) AS adoption
FROM `projet-table.events_*`
WHERE platform = 'ANDROID'
GROUP BY app_info.version, user_pseudo_id
ORDER BY app_info.version
Conclusions
I'm not sure if my logic is correct, but I think I can use user_pseudo_id to calculate it, right? The general idea is: user_of_X_version/users_of_all_versions.
(And the results are kinda close to the ones showing at Google Analytics web platform - I believe the difference is due to the date that I turned on the BigQuery integration. But.... I'd like some confirmation on that: if my logic is correct).
The biggest problem in my code now is that I cannot write it without grouping by user_pseudo_id (Because when I don't BigQuery says: "SELECT list expression references column
user_pseudo_id which is neither grouped nor aggregated at [2:3]") and that's why I have duplicated rows in the query result
Also, about the first link of examples... Is there any possibility of a record with engagement_time_msec param with value < 0? If not, why is that condition in the where clause?

READ BQUERY STREAMING (real time) API

I have BigQuery data warehouse which gets its data from Google Analytics.
the data is streamd - real time.
now I want to get this data as it arrives (and not after) to the bigquery using its API.
I have seen the api which lets you query the data after it saved into the bigquery,
for example:
from google.cloud import bigquery
# Construct a BigQuery client object.
client = bigquery.Client()
query = """
SELECT name, SUM(number) as total_people
FROM `bigquery-public-data.usa_names.usa_1910_2013`
WHERE state = 'TX'
GROUP BY name, state
ORDER BY total_people DESC
LIMIT 20
"""
query_job = client.query(query) # Make an API request.
print("The query data:")
for row in query_job:
# Row values can be accessed by field name or index.
print("name={}, count={}".format(row[0], row["total_people"]))
Is there any way to "listen" to the data and store some of it on cloud?
rather than let it be saved and then query from the bigquery?
Thanks
There is not currently a streaming read mechanism for accessing managed data in BigQuery; existing mechanisms leverage some form of snapshot-like consistency at a given point in time (tabledata.list, storage API read, etc).
Given that your data is already automatically delivered into BigQuery, the next best thing is likely some kind of delta strategy where you read periodically with some kind of filter (recent data filtered by a timestamp, etc).

How to unnest Google Analytics custom dimension in Google Data Prep

Background story:
We use Google Analytics to track user behaviour on our website. The data is exported daily into Big Query. Our implementation is quite complex and we use a lot of custom dimensions.
Requirements:
1. The data needs to be imported into our internal databases to enable better and more strategic insights.
2. The process needs to run without requiring human interaction
The problem:
Google Analytics data needs to be in a flat format so that we can import it into our database.
Question: How can I unnest custom dimensions data using Google Data Prep?
What it looks like?
----------------
customDimensions
----------------
[{"index":10,"value":"56483799"},{"index":16,"value":"·|·"},{"index":17,"value":"N/A"}]
What I need it to look like?
----------------------------------------------------------
customDimension10 | customDimension16 | customDimension17
----------------------------------------------------------
56483799 | ·|· | N/A
I know how to achieve this using a standard SQL query in Big Query interface but I really want to have a Google Data Prep flow that does it automatically.
Define the flat format and create it in BigQuery first.
You could
create one big table and repeat several values using CROSS JOINs on all the arrays in the table
create multiple tables (per array) and use ids to connect them, e.g.
for session custom dimensions concatenate fullvisitorid / visitstarttime
for hits concatenate fullvisitorid / visitstarttime / hitnumber
for products concatenate fullvisitorid / visitstarttime / hitnumber / productSku
The second options is a bit more effort but you save storage because you're not repeating all the information for everything.

Trouble Looking For Events WITHIN a Session In BigQuery or WITHIN Multiple Sessions

I wanted to get a second pair of eyes & some help confirming the best way to look within a session at the hit level in BigQuery. I have read the BigQuery developer documentation thoroughly that provides insight on working WITHIN as session. My challenge is this. Let us assume I write the high level query to count the number of sessions that exist and group the sessions by the device.device category as below:
SELECT device.deviceCategory,
COUNT(DISTINCT CONCAT (fullVisitorId, STRING (visitId)), 10000000) AS SESSIONS
FROM (TABLE_DATE_RANGE([XXXXXX.ga_sessions_], TIMESTAMP('2015-01-01'), TIMESTAMP('2015-06-30')))
GROUP EACH BY device.deviceCategory
ORDER BY sessions DESC
I then run a follow up query like the following to find the number of distinct users (Client ID's):
SELECT device.deviceCategory,
COUNT(DISTINCT fullVisitorID) AS USERS
FROM (TABLE_DATE_RANGE([XXXXXX.ga_sessions_], TIMESTAMP('2015-01-01'), TIMESTAMP('2015-06-30')))
GROUP EACH BY device.deviceCategory
ORDER BY users DESC
(Note that I broke those up because of the sheer size of the data I am working with which produces runs greater than 5TB in some cases).
My challenge is the following. I feel like I have the wrong approach and have not had success with the WITHIN function. For every user ID (or full visitor ID), I want to look within all their various sessions to find out how many sessions from the many they had were desktop and how many were mobile. Basically, these are the cross device users. I want to collect a table with these users. I started here:
SELECT COUNT(DISTINCT CONCAT (fullVisitorId, STRING (visitId)), 10000000) AS SESSIONS
FROM (TABLE_DATE_RANGE([XXXXXX.ga_sessions_], TIMESTAMP('2015-01-01'), TIMESTAMP('2015-06-30')))
WHERE device.deviceCategory = 'desktop' AND device.deviceCategory = 'mobile'
This is not correct though. Moreover, any version I write of a within query is giving me non-sense results or results that have 0 as the number. Does anyone have any strategies or tips to recommend a way forward here? What is the best way to use the WITHIN function to look for sessions that may have multiple events happening WITHIN the session (with my goal being collecting the user ID's that meet certain requirements within a session or over various sessions). Two days ago I did this in a very manual way by manually working through the steps and saving intermediate data frames to generate counts. That said, I wanted to see if there was any guidance to quickly do this using a single query?
I'm not sure if this question is still open on your end, but I believe I see your problem, and it is not with the misuse of the WITHIN function. It is a data understanding problem.
When dealing with GA and cross-device identification, you cannot reliably use any combination of fullVisitorId and visitId to identify users, as these are derived from the cookie that GA places on the users browser. Thus, leveraging the fullVisitorId would identify a specific browser on a specific device more accurately that a specific user.
In order to truly track users across devices, you must be able to leverage the userId functionality follow this link. This requires you to have the user sign in in some way, thus giving them an identifier that you can use across all of their devices and tie their behavior together.
After you implement some type of user identification that you can control, rather than GA's cookie assignment, you can use that to look for details across sessions and within those individual sessions.
Hope that helps!