Finding the journey made by users in Google BigQuery - sql

I'm looking to find the journey made by users on a particular website. The schema of my dataset is the same as Google Merchandise Store, which can be found here: https://support.google.com/analytics/answer/3437719?hl=en
From the Google BigQuery cookbook, I've implemented and modified the SQL code provided to get the sequence of hits made by every customer.
SELECT
fullVisitorId AS id,
visitId AS visitid,
visitNumber AS visitnumber,
h.hitNumber AS hitNumber,
CASE
WHEN h.eventInfo.eventAction = "Lead" THEN "Lead"
WHEN h.eventInfo.eventAction = "Homepage" THEN "Homepage"
WHEN h.eventInfo.eventAction = "Search" THEN "Search"
WHEN h.eventInfo.eventAction = "High Intent Use" THEN "High Intent Use"
WHEN h.eventInfo.eventAction = "Listing Page" THEN "Listing Page"
END AS journey
FROM
`dataset`,
UNNEST(hits) AS h
WHERE
h.type="PAGE"
OR h.type="EVENT"
ORDER BY
fullVisitorId,
visitId,
visitNumber,
hitNumber
A snippet of the result I got is as follows:
fullVisitorId visitId visitNumber hitnumber journey
001 1001 1 1 Homepage
001 1001 1 2 Search
001 1001 1 3 null
001 1001 1 4 Search
001 1001 1 5 Listing Page
001 1001 1 6 Lead
001 1001 1 2 Search
001 1001 1 7 Lead
002 1002 1 1 Search
...
What I need is to get another column which shows the journey taken by each visitor before the first "Lead", while ignoring the duplicates (for eg if the visitor searches for 5 pages back-to-back, the journey should only show "Search" once)
ie. for visitor 001 on visit 1001, the column will show:
Homepage -> Search -> Listing Page -> Lead
I hope the question is clear. Appreciate any help given! :)

Below is for BigQuery Standard SQL and applies extra logic to your existing/current query
#standardSQL
SELECT
fullVisitorId, visitId,
STRING_AGG(journey, ' -> ' ORDER BY visitNumber, hitnumber) journey_path
FROM (
SELECT
fullVisitorId, visitId,
MIN(visitNumber) visitNumber, MIN(hitnumber) hitnumber, journey
FROM (
SELECT *, COUNTIF(journey = 'Lead') OVER(win) grp
FROM `your_current_query`
WINDOW win AS (
PARTITION BY fullVisitorId, visitId
ORDER BY visitNumber, hitnumber
ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING
)
)
WHERE grp = 0
GROUP BY fullVisitorId, visitId, journey
)
GROUP BY fullVisitorId, visitId
so you can just use your existing query as below
#standardSQL
WITH `your_current_query` AS (
SELECT
fullVisitorId AS id,
visitId AS visitid,
visitNumber AS visitnumber,
h.hitNumber AS hitNumber,
CASE
WHEN h.eventInfo.eventAction = "Lead" THEN "Lead"
WHEN h.eventInfo.eventAction = "Homepage" THEN "Homepage"
WHEN h.eventInfo.eventAction = "Search" THEN "Search"
WHEN h.eventInfo.eventAction = "High Intent Use" THEN "High Intent Use"
WHEN h.eventInfo.eventAction = "Listing Page" THEN "Listing Page"
END AS journey
FROM
`dataset`,
UNNEST(hits) AS h
WHERE
h.type="PAGE"
OR h.type="EVENT"
)
SELECT
fullVisitorId, visitId,
STRING_AGG(journey, ' -> ' ORDER BY visitNumber, hitnumber) journey_path
FROM (
SELECT
fullVisitorId, visitId,
MIN(visitNumber) visitNumber, MIN(hitnumber) hitnumber, journey
FROM (
SELECT *, COUNTIF(journey = 'Lead') OVER(win) grp
FROM `your_current_query`
WINDOW win AS (
PARTITION BY fullVisitorId, visitId
ORDER BY visitNumber, hitnumber
ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING
)
)
WHERE grp = 0
GROUP BY fullVisitorId, visitId, journey
)
GROUP BY fullVisitorId, visitId
--- ORDER BY fullVisitorId, visitId
and if to follow your result example - above should produce below result
Row fullVisitorId visitId journey_path
1 001 1001 Homepage -> Search -> Listing Page -> Lead
2 002 1002 Search

I'd suggest using STRING_AGG to make a string of the journey steps, adding DISTINCT into your selection will only show individual journey steps once per user.
Something like:
STRING_AGG(DISTINCT(journey), '->') as propensity_banding_subset
You could then use some regex to clip off after the first 'lead', unless somebody can suggest a better method to do this in the original string aggregation?

I took Mikhails great approach and brought it to a more scalable version for those who have really large amounts of data. The idea is the same, but applied to a subquery on the hits array.
SELECT
fullVisitorId AS id,
visitId AS visitid,
visitNumber AS visitnumber,
ARRAY(
(SELECT AS STRUCT *
FROM
(SELECT AS STRUCT
hitNumber,
page.pagePath, -- pagePath instead of CASE-WHEN with events
count(page.pagePath) over (win) elNumber
FROM t.hits
WHERE type IN ('PAGE', 'EVENT')
WINDOW win AS (
PARTITION BY page.pagePath
ORDER BY hitnumber
ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING
)
ORDER BY hitNumber)
WHERE elNumber=0
-- instead of 'Lead' I used '/signin.html'
AND hitNumber < (SELECT MIN(hitNumber) FROM t.hits WHERE page.pagePath='/signin.html')
)
) AS journey
FROM
`bigquery-public-data.google_analytics_sample.ga_sessions_20170801` t
limit 1000
I used the actual sample data and couldn't find the events from the example there so I simply used page paths. But it should be easily adoptable.
Also this one returns nested data, not a flat table, which again saves space when saving the result as a table and is faster when performing queries on it.
There is also no grouping involved - everything happens within the subquery only on the array which allows very fast processing due to parallization.

Related

clickhouse window function difficulties - hoew to work with date windows

I have web sessions with utm tags (different channels of traffic: cpc, smm, push). Some of them with tags but some sessions from organic without utm tags. I want to overwrite organic sessions to previous tags
Rules, which I want to use:
push channel remains only for the session in which it is registered
all other non-empty channels are forwarded to all empty sessions for the current and next day.
Channels are not overwritten - that is, if at first there was a cpc channel, and then on the same day there was an smm channel, then cpc sessions go first, and then smm for the current and next day.
clickhouse version 22.8.10.29
Main Idea use arrays with union all for push channel
select install_id, session_id, date_uz , started_at, utm_medium, utm_medium_final
from (
SELECT *, arrayFirst(x -> x!='', arrayReverse(utm_medium_array)) as utm_medium_new,
maxIf(date_uz, utm_medium_new = utm_medium) OVER (PARTITION BY install_id ORDER BY started_at ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as last_date,
if(date_uz - last_date < 2, utm_medium_new, '') utm_medium_final
--any(utm_medium_new) OVER (PARTITION BY install_id ORDER BY started_at ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING ) as h,
from (
select install_id, session_id, utm_medium, date_uz , started_at,
groupArray(utm_medium) OVER (PARTITION BY install_id ORDER BY started_at ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS utm_medium_array
from marketing.sessions_with_attribution swa
where date_uz >=today()-50
and utm_medium!='push'
and install_id in ('1cc69a1f-eb17-4be6-8bfc-a5dee2dd9c50','57927c21-e862-4729-b38e-f663aa9d227d')
)
union ALL
select install_id, session_id, utm_medium, date_uz , started_at,
[] utm_medium_array, utm_medium , null, utm_medium
from marketing.sessions_with_attribution swa
where date_uz >=today()-50
and utm_medium = 'push'
and install_id in ('1cc69a1f-eb17-4be6-8bfc-a5dee2dd9c50','57927c21-e862-4729-b38e-f663aa9d227d')
)
order by install_id, started_at

Using OFFSET instead of UNNEST for nested fields in Google Bigquery

A quick question to GBQ gurus.
Here are two queries that are identical in their purpose
first
SELECT
fullVisitorId AS userid,
CONCAT(fullVisitorId, visitStartTime) AS session,
visitStartTime + (hits[
OFFSET(0)].time / 1000) AS eventtime,
date,
trafficSource.campaign,
trafficSource.source,
trafficSource.medium,
trafficSource.adContent,
trafficSource.adwordsClickInfo.campaignId,
geoNetwork.region,
geoNetwork.city,
trafficSource.keyword,
totals.visits AS visits,
device.deviceCategory AS deviceType,
hits[OFFSET(0)].eventInfo.eventAction,
hits[OFFSET(0)].TRANSACTION.transactionId,
hits[OFFSET(0)].TRANSACTION.transactionRevenue,
SUBSTR(channelGrouping,0,3) AS newchannelGrouping
FROM
`some_site.ga_sessions_*`
WHERE
ARRAY_LENGTH(hits) > 0
AND _table_suffix BETWEEN '20200201'
AND '20200201'
AND fullVisitorId IN (
SELECT
DISTINCT(fullVisitorId)
FROM
`some_site.ga_sessions_*`,
UNNEST(hits) AS hits
WHERE
_table_suffix BETWEEN '20200201'
AND '20200201'
AND (hits.TRANSACTION.transactionId != 'None')
)
second
SELECT
fullVisitorId AS userid,
CONCAT(fullVisitorId, visitStartTime) AS session,
visitStartTime + (hits.time / 1000) AS eventtime,
date,
trafficSource.campaign,
trafficSource.source,
trafficSource.medium,
trafficSource.adContent,
trafficSource.adwordsClickInfo.campaignId,
geoNetwork.region,
geoNetwork.city,
trafficSource.keyword,
totals.visits AS visits,
device.deviceCategory AS deviceType,
hits.eventInfo.eventAction,
hits.TRANSACTION.transactionId,
hits.TRANSACTION.transactionRevenue,
SUBSTR(channelGrouping,0,3) AS newchannelGrouping
FROM
`some_site.ga_sessions_*`, UNNEST(hits) hits
WHERE
_table_suffix BETWEEN '20200201' AND '20200201'
AND fullVisitorId IN (
SELECT
DISTINCT(fullVisitorId)
FROM
`some_site.ga_sessions_*`,
UNNEST(hits) AS hits
WHERE
_table_suffix BETWEEN '20200201'
AND '20200201'
AND (hits.TRANSACTION.transactionId != 'None')
)
The 1st one uses OFFSET to extract data from nested fields. According to execution details report, the query requires about 1.5 MB of shuffling.
The 2nd query uses UNNEST to reach nested data. And the amount of shuffled bytes is around (!) 75 MB
The amount of processed data is the same in both cases.
Now, the question is:
Does that mean that according to this article which concerns optimizing communication between slots I should uses OFFSET instead of UNNEST to get the data stored in nested fields?
Thanks!
Let's consider following examples with using BigQuery public dataset.
UNNEST - returns 6 results:
WITH t AS (SELECT * FROM `bigquery-public-data.google_analytics_sample.ga_sessions_20170801` WHERE visitId = 1501571504 )
SELECT h FROM t, UNNEST(hits) h
OFFSET - returns 1 result:
WITH t AS (SELECT * FROM `bigquery-public-data.google_analytics_sample.ga_sessions_20170801` WHERE visitId = 1501571504 )
SELECT hits[OFFSET(0)] FROM t
Both queries are referencing to the same record inside a GA public table. They show that using a join with UNNEST will bring one row per element inside the array and using OFFSET(0) will bring only one row with the first element of the array.
The reason for difference in high data shuffling is because the UNNEST performs a JOIN operation, which requires the data to be organized in a specific way. The OFFSET approach takes only the first element of the array.

Converting Legacy SQL to Standard SQL - Enhannced Ecommerce

I am in no way a coder so I have tried but falling over on this.
I want to use this query from Googles Google Analytics Big Query Cookbook
Products purchased by customers who purchased product A (Enhanced Ecommerce)
I have pasted the code below
Into Standard SQL.
I have made a few attemps but am falling over and not
Thank you in advance
John
SELECT hits.product.productSKU AS other_purchased_products,
COUNT(hits.product.productSKU) AS quantity
FROM (
SELECT fullVisitorId, hits.product.productSKU, hits.eCommerceAction.action_type
FROM TABLE_DATE_RANGE([bigquery-public-data:google_analytics_sample.ga_sessions_],
TIMESTAMP('2017-04-01'), TIMESTAMP('2017-04-20'))
)
WHERE fullVisitorId IN (
SELECT fullVisitorId
FROM TABLE_DATE_RANGE([bigquery-public-data:google_analytics_sample.ga_sessions_],
TIMESTAMP('2017-04-01'), TIMESTAMP('2017-04-20'))
WHERE hits.product.productSKU CONTAINS 'GGOEYOCR077799'
AND hits.eCommerceAction.action_type = '6'
GROUP BY fullVisitorId
)
AND hits.product.productSKU IS NOT NULL
AND hits.product.productSKU !='GGOEYOCR077799'
AND hits.eCommerceAction.action_type = '6'
GROUP BY other_purchased_products
ORDER BY quantity DESC;
Below is pure equivalent in BigQuery Standard SQL (no any optimizations, improvements, etc. - just pure translation from legacy to standard)
SELECT productSKU AS other_purchased_products, COUNT(productSKU) AS quantity
FROM (
SELECT fullVisitorId, prod.productSKU, hit.eCommerceAction.action_type
FROM `bigquery-public-data.google_analytics_sample.ga_sessions_*`,
UNNEST(hits) hit, UNNEST(hit.product) prod
WHERE _TABLE_SUFFIX BETWEEN '20170401' AND '20170420'
)
WHERE fullVisitorId IN (
SELECT fullVisitorId
FROM `bigquery-public-data.google_analytics_sample.ga_sessions_*`,
UNNEST(hits) hit, UNNEST(hit.product) prod
WHERE _TABLE_SUFFIX BETWEEN '20170401' AND '20170420'
AND prod.productSKU LIKE '%GGOEYOCR077799%'
AND hit.eCommerceAction.action_type = '6'
GROUP BY fullVisitorId
)
AND productSKU IS NOT NULL
AND productSKU !='GGOEYOCR077799'
AND action_type = '6'
GROUP BY other_purchased_products
ORDER BY quantity DESC
obviously produces exactly same result as legacy version

How can I find the previous page with Bigquery

I want to find out the previous page where the current page is a product page.
For example I have this page 'https://www.emag.ro/telefon-mobil-apple-iphone-x-64gb-4g-space-grey-mqac2rm-a/pd/DN094NBBM'and my previous page is this page 'https://www.emag.ro/search/telefoane-mobile/IPHONE/c?ref=srcql'
How in terms of hitnumber I can return how many users had this behavior.
I tried with this 2 query and I want to do a JOIN but I don't know how is better.
Also, I tried with LAG function but I don't know for sure if I catch all the users.
Thank you in advance.
with
view_product as (
SELECT
ga.fullVisitorId AS GA_USER_ID,
date as date,
h.hitnumber as hitnumber,
CONCAT(ga.fullVisitorId, cast(ga.visitId AS string)) AS SessionID,
(SELECT VALUE FROM h.customDimensions WHERE INDEX = 10) AS PAGETYPE,
(SELECT VALUE FROM h.customDimensions WHERE index =8) as ref_parameter,
visitid as visitid,
h.page.pagePath as page_path
FROM
`emagbigquery.0` ga,
UNNEST(hits) AS h
WHERE h.type='PAGE'
AND _TABLE_SUFFIX = '20190115'
AND (SELECT VALUE FROM h.customDimensions WHERE INDEX = 10) = 'viewproduct'
)
,
SEARCH_page_WITH_REF_SRCQL as (
select
date as date,
ga.fullVisitorId AS GA_USER_ID,
h.hitnumber as hitnumber,
CONCAT(ga.fullVisitorId, cast(ga.visitId AS string)) AS SessionID,
(SELECT VALUE FROM h.customDimensions WHERE INDEX = 10) AS PAGETYPE,
(SELECT VALUE FROM h.customDimensions WHERE index =8) as ref_parameter,
visitid as visitid,
h.page.pagePath as page_path
FROM
`emagbigquery.0` ga,
UNNEST(hits) AS h
WHERE h.type='PAGE'
AND _TABLE_SUFFIX = '20190115'
AND (SELECT VALUE FROM h.customDimensions WHERE INDEX = 10) = 'search'
AND (SELECT VALUE FROM h.customDimensions WHERE index =8) LIKE 'srcql'
)
select
COUNT(DISTINCT GA_USER_ID) AS USERS,
COUNT(DISTINCT SessionID) AS SESSIONS,
previous_page_from_srcql
from (
select
t1.ga_user_id,
t1.sessionid,
t2.hitnumber > t1.hitnumber as previous_page_from_srcql
from SEARCH_page_WITH_REF_SRCQL as t1
inner join view_product as t2
on t1.ga_user_id = t2.ga_user_id
group by
previous_page_from_srcql
Try UNNEST WITH OFFSET. It can give you an easy way to later determine that one row came after the other:
WITH path_and_prev AS (
SELECT ARRAY(
SELECT AS STRUCT session.page.pagePath
, LAG(session.page.pagePath) OVER(ORDER BY i) prevPagePath
FROM UNNEST(hits) session WITH OFFSET i
) x
FROM `google.com:analytics-bigquery.LondonCycleHelmet.ga_sessions_20130910`
)
SELECT COUNT(*) c, pagePath, prevPagePath
FROM path_and_prev, UNNEST(x)
WHERE pagePath='/vests/yellow.html'
AND prevPagePath='/vests/'
GROUP BY 2,3

Bigquery unnest hits - duplicating values)

Im trying to create a master view of a group in properties that are been imported into big query but it seem by using the unnest(hits) the SQL is duplicating the data leading to inaccurate values for revenues etc...
I have try to look at understanding why the unnest has caused this but I can't figure it out.
SELECT Date, hostname, channelGrouping, sum(transactionRevenue) as Revenue, sum(Shipping) as Shipping, sum(visits) as Sessions, sum(bounces) as Bounces, sum(transactions) as Transactions
FROM
(SELECT Date, h.page.hostname as hostname, channelGrouping, totals.transactionRevenue, totals.visits, h.transaction.transactionShipping as shipping, totals.bounces, totals.transactions
FROM `xxxxxxxxx.ga_sessions_*`, UNNEST(hits) AS h
WHERE _TABLE_SUFFIX BETWEEN '20170401' AND '20170509'
UNION ALL
SELECT Date, h.page.hostname as hostname, channelGrouping, totals.transactionRevenue, totals.visits, h.transaction.transactionShipping as shipping, totals.bounces, totals.transactions
FROM `xxxxxxxxx.ga_sessions_*`, UNNEST(hits) AS h
WHERE _TABLE_SUFFIX BETWEEN '20170401' AND '20170509'
UNION ALL
SELECT Date, h.page.hostname as hostname, channelGrouping, totals.transactionRevenue, totals.visits, h.transaction.transactionShipping as shipping, totals.bounces, totals.transactions
FROM `102674002.ga_sessions_*`, UNNEST(hits) AS h
WHERE _TABLE_SUFFIX BETWEEN '20170401' AND '20170509'
UNION ALL
SELECT Date, h.page.hostname as hostname, channelGrouping, totals.transactionRevenue, totals.visits, h.transaction.transactionShipping as shipping, totals.bounces, totals.transactions
FROM `xxxxxxxxx.ga_sessions_*`, UNNEST(hits) AS h
WHERE _TABLE_SUFFIX BETWEEN '20170401' AND '20170509'
UNION ALL
SELECT Date, h.page.hostname as hostname, channelGrouping, totals.transactionRevenue, totals.visits, h.transaction.transactionShipping as shipping, totals.bounces, totals.transactions
FROM `xxxxxxxxx.ga_sessions_*`, UNNEST(hits) AS h
WHERE _TABLE_SUFFIX BETWEEN '20170401' AND '20170509')
Group By Date, hostname, channelGrouping
Order by Date
This might do the trick:
SELECT
date,
channelGrouping,
SUM(Revenue) Revenue,
SUM(Shipping) Shipping,
SUM(bounces) bounces,
SUM(transactions) transactions,
hostname,
COUNT(date) sessions
FROM(
SELECT
date,
channelGrouping,
totals.totaltransactionrevenue / 1e6 Revenue,
ARRAY((SELECT DISTINCT page.hostname FROM UNNEST(hits) hits WHERE page.hostname IS NOT NULL)) hostnames,
(SELECT SUM(hits.transaction.transactionshipping) / 1e6 FROM UNNEST(hits) hits) Shipping,
totals.bounces bounces,
totals.transactions transactions
FROM `project_id.dataset_id.ga_sessions_*`
WHERE 1 = 1
AND ARRAY_LENGTH(ARRAY((SELECT DISTINCT page.hostname FROM UNNEST(hits) hits WHERE page.hostname IS NOT NULL))) > 0
AND _TABLE_SUFFIX BETWEEN '20170601' AND '20170609'
UNION ALL
(...)
),
UNNEST(hostnames) hostname
GROUP BY
date, channelGrouping, hostname
Notice that in this query I avoided applying the UNNEST operation in the hits field and I do so only inside subselects.
In order to understand why this is the case you have to understand how ga data is aggregated into BigQuery. Notice that we basically have 2 types of data: the session level data and the hits level. Each client visiting your website ends up generating a row into BigQuery, like so:
{fullvisitorid: 1, visitid:1, date: '20170601', channelGrouping: "search", hits: [{hitNumber: 1, page: {hostname: "yourserverhostname"}}, {hitNumber: 2, page: {hostname: "yourserverhostname"}}, totals: {totalTransactionRevenue:0, bounces: 0}]
If the same customer comes back a day later it generates another row into BQ, something like:
{fullvisitorid: 1, visitid:2, date: '20170602', channelGrouping: "search", hits: [{hitNumber: 1, page: {hostname: "yourserverhostname"}}, {hitNumber: 2, page: {hostname: "yourserverhostname"}}, totals: {totalTransactionRevenue:50000000, bounces: 2}]
As you can see, fields outside the key hits are related to the session level (and therefore each hit, i.e, each interaction the customer has in your website, adds up another entry here). When you apply UNNEST, you basically, apply a cross-join with all values inside of the array to the outer fields.
And this is where duplication happens!
Given the past example, if we apply UNNEST to the hits field, you end up with something like:
fullvisitorid visitid totals.totalTransactionRevenue hits.hitNumber
1 1 0 1
1 1 0 2
1 2 50000000 1
1 2 50000000 2
Notice that for each hit inside the hits field causes the outer fields, such as totals.totalTransactionRevenue to be duplicated for each hitNumber that happened inside the hits ARRAY.
So, if later on, you apply some operation like SUM(totals.totalTransactionRevenue) you end up summing this field multiplied by each hit that the customer had in that visitid.
What I tend to do is to avoid the (costly depending on the data volume) UNNEST operation on the hits field and I do so only in subqueries (where the unnesting happens only at the row level which does not duplicate data).