Show transactions from a user who saw X page/s in their session - sql

I am working with Google Analytics data in BigQuery.
I'd like to show a list of transaction ids from users who visited a particular page on a website in their session, I've unnested hits.page.pagepath in order to identify a particular page, but since I don't know which row the actual transaction ID will occur on I am having trouble returning meaningful results.
My code looks like this, but is returning 0 results, as all the transaction Ids are NULL values, since they do not happen on rows where the page path meets the AND hits.page.pagePath LIKE "%clear-out%" condition:
SELECT hits.transaction.transactionId AS orderid
FROM `xxx.xxx.ga_sessions_20*` AS t
CROSS JOIN UNNEST(hits) AS hits
WHERE parse_date('%y%m%d', _table_suffix) between
DATE_sub(current_date(), interval 1 day) and
DATE_sub(current_date(), interval 1 day)
AND totals.transactions > 0
AND hits.page.pagePath LIKE "%clear-out%"
AND hits.transaction.transactionId IS NOT NULL
How can I say, for example, return the transaction Ids for all sessions where the user viewed AND hits.page.pagePath LIKE "%clear-out%"?

When cross joining, you're repeating the whole session for each hit. Use this nested info per hit to look for your page - not the cross joined hits.
You're unfortunately giving both the same name. It's better to keep them seperate - here's what it could look like:
SELECT
h.transaction.transactionId AS orderId
--,ARRAY( (SELECT AS STRUCT hitnumber, page.pagePath, transaction.transactionId FROM t.hits ) ) AS hitInfos -- test: show all hits in this session
FROM
`google.com:analytics-bigquery.LondonCycleHelmet.ga_sessions_20130910` AS t
CROSS JOIN t.hits AS h
WHERE
totals.transactions > 0 AND h.transaction.transactionId IS NOT NULL
AND
-- use the repeated hits nest (not the cross joined 'h') to check all pagePaths in the session
(SELECT LOGICAL_OR(page.pagePath LIKE "/helmets/%") FROM t.hits )
LOGICAL_OR() is an aggregation function for OR - so if any hit matches the condition it returns TRUE
(This query uses the openly available GA data from Google. It's a bit old but good to play around with.)

Related

GA4 vs BigQuery - User Count don't match

I have extracted from Bigquery the active_users and totalusers on 31/12/2022, grouped by CampaignName and Country, using the following query:
select
count(distinct case when (select value.int_value from unnest(event_params) where key = 'engagement_time_msec') > 0 or (select value.string_value from unnest(event_params) where key = 'session_engaged') = '1' then user_pseudo_id else null end) AS active_users
,count(distinct user_pseudo_id) AS totalusers
,traffic_source.name AS CampaignName
,geo.country AS Country
FROM `independent-tea-354108.analytics_254831690.events_20221231`
GROUP BY
traffic_source.name
,geo.country
The result filtered by CampaignName='(organic)' was:
(https://i.stack.imgur.com/LMQAH.png)
But when I compare with the data from GA4, it doesn't match and the difference is huge (around 15000 more active_users in GA4 than in BigQuery). Please note that this is only for one day, if it was a month the difference would be even higher:
(https://i.stack.imgur.com/8arYs.png)
I've tried filtering by other CampaignNames and not a single value matches and the differences are always huge.
These are two common reasons for the GA4 to BigQuery difference, You have probably already looked at them already.
Check your source table for blank 'user_pseudo_id's if you have a consent mode on your website they may be counted in GA4 but not in bigquery and this can cause big differences.
Time zone is another are that can make a difference BigQuery is always in UTC time your GA4 may not be.
I hope these help

Selecting the first and last event per user, per day

I have a Google Analytics event which fires on my website when certain interactions are made, this may or may not fire for a user in a session, or can fire many times.
I'd like to return results showing the userID and the value of the first and last event label, per day. I have tried to do this with MAX(hits.eventInfo.eventLabel), but when I fact check my results this is not returning the last value for that user in the day as I was expecting.
SELECT Date,
customDimension.value AS UserID,
MAX(hits.eventInfo.eventLabel) AS last_value
FROM `project.dataset.ga_sessions_20*` AS t
CROSS JOIN UNNEST(hits) AS hits
CROSS JOIN UNNEST(t.customdimensions) AS customDimension
WHERE parse_date('%y%m%d', _table_suffix) between
DATE_sub(current_date(), interval 1 day) and
DATE_sub(current_date(), interval 1 day)
AND hits.eventInfo.eventAction = "Value"
AND customDimension.index = 2
GROUP BY Date, UserID
For example, the query above returns results where user X has the following MAX() value:
20180806 User_x 69.96
But when I look at the details of that users interactions on the day I see:
Based on this, I would expect to see 79.95 as my MAX() result as it has the highest hit number, instead I seem to have selected a value from somewhere in the middle of the session - how can I adjust my query to ensure I select the last event value?
When you are looking for maximum value of column colA while doing GROUP BY - obviously MAX(colA) will work
But when you are looking for value in column colA based on maximum value in column colB - you should use STRING_AGG(colA ORDER BY colB DESC LIMIT 1) or similar using ARRAY_AGG()
So, in you case, I think it will be something like below (you should tune it further)
STRING_AGG(eventInfo.eventLabel ORDER BY hiNumber DESC LIMIT 1) AS last_value
In your case one should work with subqueries on the hits array. This allows full control over what you want to have. I used the example ga data from Google, so labels are different. But I wrote it in a way you can easily modify to fit your needs:
SELECT
date,
fullvisitorid,
visitstarttime,
(SELECT value FROM t.customDimensions WHERE index=2) userId,
(SELECT
--STRUCT(hour, minute, hitNumber, eventinfo.eventlabel) -- for testing, comment out next line
eventInfo.eventLabel
FROM t.hits
WHERE type='EVENT' AND eventInfo.eventAction <> '' -- modify to fit your condition
ORDER BY hitNumber ASC LIMIT 1
) AS firstEventLabel,
(SELECT
--STRUCT(hour, minute, hitNumber, eventinfo.eventlabel) -- for testing, comment out next line
eventInfo.eventLabel
FROM t.hits
WHERE type='EVENT' AND eventInfo.eventAction <> '' -- modify to fit your condition
ORDER BY hitNumber DESC LIMIT 1
) AS lastEventLabel
FROM
`bigquery-public-data.google_analytics_sample.ga_sessions_20170801` t
LIMIT 1000 -- for testing
Basically, I'm querying events order them by hitNumber ascending or descending and limit to one to only have one result per row. The line with userId also shows how to properly get a custom dimension value.
If you are very new to this concept of working with arrays you can learn all about it here: https://cloud.google.com/bigquery/docs/reference/standard-sql/arrays
MAX() should work. The one time it would return an unexpected value is if it is operating on a string, not a number.
Does this fix the problem?
MAX(CAST(hits.eventInfo.eventLabel as float128)) AS last_value

BigQuery Troubleshooting - Query or Google Analytics?

I am trying to query my Google Analytics tables for the past 20 days with the experiment ID of
zCeqsUOZSL6ESM94wH8XfA
The current code here, currently returns no rows:
SELECT
e.experimentId,
e.experimentVariant,
i.index=1 AS borrower_id
FROM
`93868086.ga_sessions_*`,
UNNEST(hits) as hits,
UNNEST(hits.experiment) AS e,
UNNEST(hits.customDimensions) AS i
WHERE
_TABLE_SUFFIX BETWEEN FORMAT_DATE('%Y%m%d',DATE_SUB(CURRENT_DATE(), INTERVAL 20 DAY))
AND FORMAT_DATE('%Y%m%d',DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY))
AND hits.type = 'PAGE'
AND e.experimentId = 'zCeqsUOZSL6ESM94wH8XfA'
Is my current code to generic to return any values? I have attempted to simplify the query just to see if it'll return rows that have an experiment ID populated but to no avail. I am currently trying to troubleshoot if it's my query or if the backend is having an issue tracking our A/B testing data. Any critique of my code above would be greatly appreciated.
Your code would return no rows if any of the array are empty. I am not an advocate of unnesting along three independent dimensions. But, if you want to keep all rows, use left joins instead of ,:
FROM `93868086.ga_sessions_*` LEFT JOIN
UNNEST(hits) hit LEFT JOIN
UNNEST(hits.experiment) e LEFT JOIN
UNNEST(hits.customDimensions) i

Select the date of a UserIDs first/most recent purchase

I am working with Google Analytics data in BigQuery, looking to aggregate the date of last visit and first visit up to UserID level, however my code is currently returning the max visit date for that user, so long as they have purchased within the selected date range, because I am using MAX().
If I remove MAX() I have to GROUP by DATE, which I don't want as this then returns multiple rows per UserID.
Here is my code which returns a series of dates per user - last_visit_date is currently working, as it's the only date that can simply look at the last date of user activity. Any advice on how I can get last_ord_date to select the date on which the order actually occurred?
SELECT
customDimension.value AS UserID,
# Last order date
IF(COUNT(DISTINCT hits.transaction.transactionId) > 0,
(MAX(DATE)),
"unknown") AS last_ord_date,
# first visit date
IF(SUM(totals.newvisits) IS NOT NULL,
(MAX(DATE)),
"unknown") AS first_visit_date,
# last visit date
MAX(DATE) AS last_visit_date,
# first order date
IF(COUNT(DISTINCT hits.transaction.transactionId) > 0,
(MIN(DATE)),
"unknown") AS first_ord_date
FROM
`XXX.XXX.ga_sessions_20*` AS t
CROSS JOIN
UNNEST (hits) AS hits
CROSS JOIN
UNNEST(t.customdimensions) AS customDimension
CROSS JOIN
UNNEST(hits.product) AS hits_product
WHERE
parse_DATE('%y%m%d',
_table_suffix) BETWEEN DATE_SUB(CURRENT_DATE(), INTERVAL 30 day)
AND DATE_SUB(CURRENT_DATE(), INTERVAL 1 day)
AND customDimension.index = 2
AND customDimension.value NOT LIKE "true"
AND customDimension.value NOT LIKE "false"
AND customDimension.value NOT LIKE "undefined"
AND customDimension.value IS NOT NULL
GROUP BY
UserID
the most efficient and clear way to do this (and also most portable) is to have a simple table/view that has two columns: userid, last_purchase and another that has other two cols userid, first_visit.
then you inner join it with the original raw table on userid and hit timestamp to get, say, the session IDs you're interested in. 3 steps but simple, readable and easy to maintain
It's very easy to hit too much complexity for a query that relies on first or last purchase/action (just look at the unnest operations you have there) that is becomes unusable and you'll spend way too much time trying to figure out the meaning of the output.
Also keep in mind that using the wildcard in the query has a limit of 1000 tables, so your last and first visits are in a rolling window of 1000 days.

Count pageviews by page

I would like to count the number of pageviews by page in BigQuery, using Google Analytics data source tables. I only want to count pages that have the custom page content grouping of ProductList_UA or ProductDetails_UAand I want to trim all the parameters from the end of the page URL so that I return a more manageable list of pages.
So far, my query looks as below, but my count of pageviews, bounces and exits are far too high (about 8x) - where am I going wrong?
SELECT IFNULL(REGEXP_EXTRACT(hits.page.pagePath,r'^(.*?)\?'), hits.page.pagePath) AS Trimmed_Page, COUNT(hits.page.pagepath) AS Pageviews, SUM(totals.bounces) AS Bounces, SUM(IF(hits.isexit = TRUE, 1,0)) AS Exits, SUM(IF(hits.isentrance = TRUE, 1,0)) AS Entrances, MIN(hits.contentGroup.contentGroup3) AS Content_Group
FROM `xxx.ga_sessions_20*` AS m
CROSS JOIN UNNEST(m.customdimensions) AS customDimension
CROSS JOIN UNNEST(m.hits) AS hits
WHERE parse_date('%y%m%d', _table_suffix) between
DATE_sub(current_date(), interval 1 day) and
DATE_sub(current_date(), interval 1 day)
AND (hits.contentGroup.contentGroup3 = 'ProductList_UA' OR hits.contentGroup.contentGroup3 = 'ProductDetails_UA')
AND hits.type="PAGE"
AND hits.isInteraction = TRUE
GROUP BY Trimmed_Page
ORDER BY Pageviews DESC
LIMIT 1000
I suspect the cross join with customDimensions is the cause of you seeing more results than expected, as each row of hits will be multiplied by the number of customDimensions in that row. Experiment without that cross join to see if it solves the issue.