Converting Legacy SQL to Standard SQL - Enhannced Ecommerce - google-bigquery

I am in no way a coder so I have tried but falling over on this.
I want to use this query from Googles Google Analytics Big Query Cookbook
Products purchased by customers who purchased product A (Enhanced Ecommerce)
I have pasted the code below
Into Standard SQL.
I have made a few attemps but am falling over and not
Thank you in advance
John
SELECT hits.product.productSKU AS other_purchased_products,
COUNT(hits.product.productSKU) AS quantity
FROM (
SELECT fullVisitorId, hits.product.productSKU, hits.eCommerceAction.action_type
FROM TABLE_DATE_RANGE([bigquery-public-data:google_analytics_sample.ga_sessions_],
TIMESTAMP('2017-04-01'), TIMESTAMP('2017-04-20'))
)
WHERE fullVisitorId IN (
SELECT fullVisitorId
FROM TABLE_DATE_RANGE([bigquery-public-data:google_analytics_sample.ga_sessions_],
TIMESTAMP('2017-04-01'), TIMESTAMP('2017-04-20'))
WHERE hits.product.productSKU CONTAINS 'GGOEYOCR077799'
AND hits.eCommerceAction.action_type = '6'
GROUP BY fullVisitorId
)
AND hits.product.productSKU IS NOT NULL
AND hits.product.productSKU !='GGOEYOCR077799'
AND hits.eCommerceAction.action_type = '6'
GROUP BY other_purchased_products
ORDER BY quantity DESC;

Below is pure equivalent in BigQuery Standard SQL (no any optimizations, improvements, etc. - just pure translation from legacy to standard)
SELECT productSKU AS other_purchased_products, COUNT(productSKU) AS quantity
FROM (
SELECT fullVisitorId, prod.productSKU, hit.eCommerceAction.action_type
FROM `bigquery-public-data.google_analytics_sample.ga_sessions_*`,
UNNEST(hits) hit, UNNEST(hit.product) prod
WHERE _TABLE_SUFFIX BETWEEN '20170401' AND '20170420'
)
WHERE fullVisitorId IN (
SELECT fullVisitorId
FROM `bigquery-public-data.google_analytics_sample.ga_sessions_*`,
UNNEST(hits) hit, UNNEST(hit.product) prod
WHERE _TABLE_SUFFIX BETWEEN '20170401' AND '20170420'
AND prod.productSKU LIKE '%GGOEYOCR077799%'
AND hit.eCommerceAction.action_type = '6'
GROUP BY fullVisitorId
)
AND productSKU IS NOT NULL
AND productSKU !='GGOEYOCR077799'
AND action_type = '6'
GROUP BY other_purchased_products
ORDER BY quantity DESC
obviously produces exactly same result as legacy version

Related

BigQuery GA data: Scalar subquery produced more than one element in SQL

I am trying to calculate the percentatge of sessions per month based on some categories the users have visited on the website on Google BigQuery.
The query is the following one. It seems it has no error apparently, but when I run it, it says: 'Scalar subquery produced more than one element'.
Do you know why this is happening? Thanks!
WITH cte AS (
SELECT
partition_date as session_date,
EXTRACT(MONTH FROM CAST(partition_date AS date)) AS month,
COUNT(CONCAT(CAST(fullVisitorId as string),
CAST(visitId as string)
)) AS sessions,
h.page.pagePath AS page_path,
CASE WHEN h.page.pagePath LIKE '%/account/profile/%' THEN 'My Profile'
WHEN h.page.pagePath LIKE '%/myaccount/orders%' THEN 'My Orders'
WHEN h.page.pagePath LIKE '%/myaccount/wishlist' THEN 'My Wishlist'
ELSE NULL END as categories,
(SELECT count(CONCAT(CAST(fullVisitorId as string),
CAST(visitId as string)
)) over (partition by EXTRACT(MONTH FROM CAST(partition_date AS date)))
FROM `*.BO_*.ga_sessions`,
UNNEST(hits) AS h
WHERE partition_date BETWEEN '2022-07-31' AND '2022-08-01') as total_sessions
FROM
`*.BO_*.ga_sessions`, UNNEST(hits) AS h
WHERE partition_date BETWEEN '2022-07-31' AND '2022-08-01'
GROUP BY 1, 5
)
SELECT
month,
categories,
total_sessions,
sessions/total_sessions,
FROM cte
WHERE categories IS NOT NULL
GROUP BY 1, 2, 3, 4

Which part of my query is wrong? UNNEST function

I couldn't figure out which part of my code is wrong.
I used a UNNEST function but the error msg is still
'Cannot access field productSKU on a value with type ARRAY>' in Google Bigquery.
My query is below:
SELECT
hits.product.productSKU AS product_SKU,
hits.product.v2ProductName AS Product_Name,
SUM(totals.transactionRevenue) AS Total_Revenue,
FROM
`bigquery-public-data.google_analytics_sample.ga_sessions_*`,
UNNEST(hits.product) AS hits
WHERE
_TABLE_SUFFIX BETWEEN '20170701' AND '20170731' AND totals.transactions >= 1
Group by
hits.product.productSKU
Order by
v2ProductName DESC
Assuming the overall logic of your query reflect what you want to achieve - below is correct version that fixes unnest'ing part as well as adds missing field in group by - hope you see what gets corrected
#standardSQL
SELECT
product.productSKU AS product_SKU,
product.v2ProductName AS Product_Name,
SUM(totals.transactionRevenue) AS Total_Revenue,
FROM `bigquery-public-data.google_analytics_sample.ga_sessions_*`,
UNNEST(hits) AS hit,
UNNEST(hit.product) AS product
WHERE _TABLE_SUFFIX BETWEEN '20170701' AND '20170731' AND totals.transactions >= 1
GROUP BY product_SKU, Product_Name
ORDER BY v2ProductName DESC

Google Big Query: Get New Visitor Count using Custom Dimension

select PARSE_DATE('%Y%m%d', t.date) as Date
,count(distinct(fullvisitorid)) as User
,SUM( totals.newVisits ) AS New_Visitors
,(if(customDimensions.index=1, customDimensions.value,null)) as Orig
FROM `table` as t
CROSS JOIN UNNEST(hits) AS hit
CROSS JOIN UNNEST(hit.customDimensions ) AS customDimensions
group by Date, orig
Is there a way to get new visitor count and use the customDimension at the same time? The sum(total.newVisits) doesn't work.
Thanks
Below is for BigQuery Standard SQL
SELECT DATE
,COUNT(DISTINCT(fullvisitorid)) AS User
,SUM( newVisits ) AS New_Visitors
,Orig
FROM (
SELECT PARSE_DATE('%Y%m%d', t.date) AS DATE
,fullvisitorid
,totals.newVisits AS newVisits
,(IF(customDimensions.index=1, customDimensions.value,NULL)) AS Orig
FROM `table` AS t
CROSS JOIN UNNEST(hits) AS hit
CROSS JOIN UNNEST(hit.customDimensions ) AS customDimensions
GROUP BY DATE, orig, fullvisitorid, newVisits
)
GROUP BY DATE, Orig
The best way in your case is to remove the cross-joins and use sub-selects instead:
SELECT
PARSE_DATE('%Y%m%d', t.date) AS Date
,(SELECT value FROM UNNEST(customDimensions) WHERE index=1) Orig
,COUNT(DISTINCT(fullvisitorid)) AS User
,SUM( totals.newVisits ) AS New_Visitors
FROM
`table` t
GROUP BY Orig, Date
In case you have a dimension on hit scope and really need to flatten the table, you need to build a session id you can count distinct. That is because you are repeating all session scoped fields on hit-scope by applying the cross-join:
SELECT
PARSE_DATE('%Y%m%d', t.date) AS Date
,(SELECT value FROM h.customDimensions WHERE index=2) justAHitCd
,h.page.pagePathLevel1
,COUNT(DISTINCT(fullvisitorid)) AS User
-- create session id and count distinct
,COUNT(DISTINCT CONCAT(fullvisitorid, CAST(visitstarttime AS STRING)) ) AS all_sessions
-- only count distinct session id of sessions where totals.newVisits = 1
,COUNT(DISTINCT
IF(totals.newVisits=1,
CONCAT(fullvisitorid, CAST(visitstarttime AS STRING)),
NULL )
) AS New_Visitors
FROM
-- flatten table to hit scope (comma means cross-join in stnd sql)
`table` t, t.hits h
GROUP BY 1,2,3
So in case for new visitors I only provide a session id if totals.newVisits=1 - else the if-statement provides NULL which is not countable.
If you have something similar on product-scope, you'd need to create an ID that takes into account session and hit.
E.g. counting pages for productSku:
SELECT
PARSE_DATE('%Y%m%d', t.date) AS Date
,(SELECT value FROM h.customDimensions WHERE index=2) justAHitCd
,p.productSku
,COUNT(DISTINCT fullvisitorid) AS users
,COUNT(DISTINCT CONCAT(fullvisitorid, CAST(visitstarttime AS STRING))) AS sessions
,COUNT(DISTINCT
IF(h.type='PAGE',
CONCAT(fullvisitorid, cast(visitstarttime AS STRING),CAST(hitNumber AS STRING)),
NULL)
) as pageviews
,COUNT(1) AS products
FROM
`table` t, t.hits h LEFT JOIN h.product p
GROUP BY 1,2,3
Note, that I'm left joining the product array. Since it sometimes is empty a cross-join would destroy all hits information: cross-join with empty table results in empty table.
Hope that helps!

Bigquery unnest hits - duplicating values)

Im trying to create a master view of a group in properties that are been imported into big query but it seem by using the unnest(hits) the SQL is duplicating the data leading to inaccurate values for revenues etc...
I have try to look at understanding why the unnest has caused this but I can't figure it out.
SELECT Date, hostname, channelGrouping, sum(transactionRevenue) as Revenue, sum(Shipping) as Shipping, sum(visits) as Sessions, sum(bounces) as Bounces, sum(transactions) as Transactions
FROM
(SELECT Date, h.page.hostname as hostname, channelGrouping, totals.transactionRevenue, totals.visits, h.transaction.transactionShipping as shipping, totals.bounces, totals.transactions
FROM `xxxxxxxxx.ga_sessions_*`, UNNEST(hits) AS h
WHERE _TABLE_SUFFIX BETWEEN '20170401' AND '20170509'
UNION ALL
SELECT Date, h.page.hostname as hostname, channelGrouping, totals.transactionRevenue, totals.visits, h.transaction.transactionShipping as shipping, totals.bounces, totals.transactions
FROM `xxxxxxxxx.ga_sessions_*`, UNNEST(hits) AS h
WHERE _TABLE_SUFFIX BETWEEN '20170401' AND '20170509'
UNION ALL
SELECT Date, h.page.hostname as hostname, channelGrouping, totals.transactionRevenue, totals.visits, h.transaction.transactionShipping as shipping, totals.bounces, totals.transactions
FROM `102674002.ga_sessions_*`, UNNEST(hits) AS h
WHERE _TABLE_SUFFIX BETWEEN '20170401' AND '20170509'
UNION ALL
SELECT Date, h.page.hostname as hostname, channelGrouping, totals.transactionRevenue, totals.visits, h.transaction.transactionShipping as shipping, totals.bounces, totals.transactions
FROM `xxxxxxxxx.ga_sessions_*`, UNNEST(hits) AS h
WHERE _TABLE_SUFFIX BETWEEN '20170401' AND '20170509'
UNION ALL
SELECT Date, h.page.hostname as hostname, channelGrouping, totals.transactionRevenue, totals.visits, h.transaction.transactionShipping as shipping, totals.bounces, totals.transactions
FROM `xxxxxxxxx.ga_sessions_*`, UNNEST(hits) AS h
WHERE _TABLE_SUFFIX BETWEEN '20170401' AND '20170509')
Group By Date, hostname, channelGrouping
Order by Date
This might do the trick:
SELECT
date,
channelGrouping,
SUM(Revenue) Revenue,
SUM(Shipping) Shipping,
SUM(bounces) bounces,
SUM(transactions) transactions,
hostname,
COUNT(date) sessions
FROM(
SELECT
date,
channelGrouping,
totals.totaltransactionrevenue / 1e6 Revenue,
ARRAY((SELECT DISTINCT page.hostname FROM UNNEST(hits) hits WHERE page.hostname IS NOT NULL)) hostnames,
(SELECT SUM(hits.transaction.transactionshipping) / 1e6 FROM UNNEST(hits) hits) Shipping,
totals.bounces bounces,
totals.transactions transactions
FROM `project_id.dataset_id.ga_sessions_*`
WHERE 1 = 1
AND ARRAY_LENGTH(ARRAY((SELECT DISTINCT page.hostname FROM UNNEST(hits) hits WHERE page.hostname IS NOT NULL))) > 0
AND _TABLE_SUFFIX BETWEEN '20170601' AND '20170609'
UNION ALL
(...)
),
UNNEST(hostnames) hostname
GROUP BY
date, channelGrouping, hostname
Notice that in this query I avoided applying the UNNEST operation in the hits field and I do so only inside subselects.
In order to understand why this is the case you have to understand how ga data is aggregated into BigQuery. Notice that we basically have 2 types of data: the session level data and the hits level. Each client visiting your website ends up generating a row into BigQuery, like so:
{fullvisitorid: 1, visitid:1, date: '20170601', channelGrouping: "search", hits: [{hitNumber: 1, page: {hostname: "yourserverhostname"}}, {hitNumber: 2, page: {hostname: "yourserverhostname"}}, totals: {totalTransactionRevenue:0, bounces: 0}]
If the same customer comes back a day later it generates another row into BQ, something like:
{fullvisitorid: 1, visitid:2, date: '20170602', channelGrouping: "search", hits: [{hitNumber: 1, page: {hostname: "yourserverhostname"}}, {hitNumber: 2, page: {hostname: "yourserverhostname"}}, totals: {totalTransactionRevenue:50000000, bounces: 2}]
As you can see, fields outside the key hits are related to the session level (and therefore each hit, i.e, each interaction the customer has in your website, adds up another entry here). When you apply UNNEST, you basically, apply a cross-join with all values inside of the array to the outer fields.
And this is where duplication happens!
Given the past example, if we apply UNNEST to the hits field, you end up with something like:
fullvisitorid visitid totals.totalTransactionRevenue hits.hitNumber
1 1 0 1
1 1 0 2
1 2 50000000 1
1 2 50000000 2
Notice that for each hit inside the hits field causes the outer fields, such as totals.totalTransactionRevenue to be duplicated for each hitNumber that happened inside the hits ARRAY.
So, if later on, you apply some operation like SUM(totals.totalTransactionRevenue) you end up summing this field multiplied by each hit that the customer had in that visitid.
What I tend to do is to avoid the (costly depending on the data volume) UNNEST operation on the hits field and I do so only in subqueries (where the unnesting happens only at the row level which does not duplicate data).

How to use multiple custom dimensions in Google Big Query

Is there a way to use multiple custom dimensions in GBQ without using the Max function? My problem of using Max function is that it only saves the max pax_num, but I would like to have the count of visitors for all of the combinations of ( Date,product.v2ProductCategory,eCommerceAction.action_type
,product.v2ProductName). Note the pax_num is number of pax on that ticket. I need every combination of the dest+pax_num, not the dest+max(pax_num)
SELECT
Date
,count(distinct( concat(FULLVISITORID,cast(visitID as string)))) as visitor
, product.v2ProductCategory as product_category
,max(if(customDimensions.index=2, customDimensions.value,null)) as dest
,max((if(customDimensions.index=21, customDimensions.value,null)) ) as pax_num
,eCommerceAction.action_type as Action_type
,product.v2ProductName as product_name
FROM `table` as t
CROSS JOIN UNNEST(hits) AS hit
CROSS JOIN UNNEST(hit.customDimensions) AS customDimensions
CROSS JOIN UNNEST(hit.product) AS product
GROUP BY
Date
,product.v2ProductCategory
,eCommerceAction.action_type
,product.v2ProductName
Not sure if this is what you are looking for, but if you include the field pax_num in the group by you might already find what you need, like so:
select
date,
count(distinct( concat(FULLVISITORID,cast(visitID as string)))) as sessions,
product.v2ProductCategory category,
max(if(customDimensions.index=2, customDimensions.value, null)) as dest,
if(customDimensions.index=21, customDimensions.value,null) as pax_num,
eCommerceAction.action_type as act_type,
product.v2ProductName as product_name
from `table` as t,
unnest(hits) as hit,
unnest(hit.customDimensions) customDimensions,
unnest(hit.product) as product
group by
date,
category,
act_type,
pax_num,
product_name
having pax_num is not null
You gave as an example the pax_num values "paxnum_5" and "paxnum_6". If you insert the value pax_num in the group by operation, the count aggregation should happen on the level of pax_num which would preserve the values (and not mix everything into the max value as before).
Also, notice that if you count the distinct combination of fullvisitorids and visitids you are actually computing the total amount of sessions and not visitors (their definition is not the same).
Add the fullvisitorID solve the problem
SELECT
Date
,concat(fullVisitorID,cast(visitID as string)) as visitorID
,count(distinct( concat(FULLVISITORID,cast(visitID as string)))) as visitor
, product.v2ProductCategory as product_category
,max(if(customDimensions.index=2, customDimensions.value,null)) as dest
,max((if(customDimensions.index=21, customDimensions.value,null)) ) as pax_num
,eCommerceAction.action_type as Action_type
,product.v2ProductName as product_name
FROM `table` as t
CROSS JOIN UNNEST(hits) AS hit
CROSS JOIN UNNEST(hit.customDimensions) AS customDimensions
CROSS JOIN UNNEST(hit.product) AS product
GROUP BY
Date
,product.v2ProductCategory
,eCommerceAction.action_type
,product.v2ProductName
,visitorID