Bigquery unnest hits - duplicating values) - google-bigquery

Im trying to create a master view of a group in properties that are been imported into big query but it seem by using the unnest(hits) the SQL is duplicating the data leading to inaccurate values for revenues etc...
I have try to look at understanding why the unnest has caused this but I can't figure it out.
SELECT Date, hostname, channelGrouping, sum(transactionRevenue) as Revenue, sum(Shipping) as Shipping, sum(visits) as Sessions, sum(bounces) as Bounces, sum(transactions) as Transactions
FROM
(SELECT Date, h.page.hostname as hostname, channelGrouping, totals.transactionRevenue, totals.visits, h.transaction.transactionShipping as shipping, totals.bounces, totals.transactions
FROM `xxxxxxxxx.ga_sessions_*`, UNNEST(hits) AS h
WHERE _TABLE_SUFFIX BETWEEN '20170401' AND '20170509'
UNION ALL
SELECT Date, h.page.hostname as hostname, channelGrouping, totals.transactionRevenue, totals.visits, h.transaction.transactionShipping as shipping, totals.bounces, totals.transactions
FROM `xxxxxxxxx.ga_sessions_*`, UNNEST(hits) AS h
WHERE _TABLE_SUFFIX BETWEEN '20170401' AND '20170509'
UNION ALL
SELECT Date, h.page.hostname as hostname, channelGrouping, totals.transactionRevenue, totals.visits, h.transaction.transactionShipping as shipping, totals.bounces, totals.transactions
FROM `102674002.ga_sessions_*`, UNNEST(hits) AS h
WHERE _TABLE_SUFFIX BETWEEN '20170401' AND '20170509'
UNION ALL
SELECT Date, h.page.hostname as hostname, channelGrouping, totals.transactionRevenue, totals.visits, h.transaction.transactionShipping as shipping, totals.bounces, totals.transactions
FROM `xxxxxxxxx.ga_sessions_*`, UNNEST(hits) AS h
WHERE _TABLE_SUFFIX BETWEEN '20170401' AND '20170509'
UNION ALL
SELECT Date, h.page.hostname as hostname, channelGrouping, totals.transactionRevenue, totals.visits, h.transaction.transactionShipping as shipping, totals.bounces, totals.transactions
FROM `xxxxxxxxx.ga_sessions_*`, UNNEST(hits) AS h
WHERE _TABLE_SUFFIX BETWEEN '20170401' AND '20170509')
Group By Date, hostname, channelGrouping
Order by Date

This might do the trick:
SELECT
date,
channelGrouping,
SUM(Revenue) Revenue,
SUM(Shipping) Shipping,
SUM(bounces) bounces,
SUM(transactions) transactions,
hostname,
COUNT(date) sessions
FROM(
SELECT
date,
channelGrouping,
totals.totaltransactionrevenue / 1e6 Revenue,
ARRAY((SELECT DISTINCT page.hostname FROM UNNEST(hits) hits WHERE page.hostname IS NOT NULL)) hostnames,
(SELECT SUM(hits.transaction.transactionshipping) / 1e6 FROM UNNEST(hits) hits) Shipping,
totals.bounces bounces,
totals.transactions transactions
FROM `project_id.dataset_id.ga_sessions_*`
WHERE 1 = 1
AND ARRAY_LENGTH(ARRAY((SELECT DISTINCT page.hostname FROM UNNEST(hits) hits WHERE page.hostname IS NOT NULL))) > 0
AND _TABLE_SUFFIX BETWEEN '20170601' AND '20170609'
UNION ALL
(...)
),
UNNEST(hostnames) hostname
GROUP BY
date, channelGrouping, hostname
Notice that in this query I avoided applying the UNNEST operation in the hits field and I do so only inside subselects.
In order to understand why this is the case you have to understand how ga data is aggregated into BigQuery. Notice that we basically have 2 types of data: the session level data and the hits level. Each client visiting your website ends up generating a row into BigQuery, like so:
{fullvisitorid: 1, visitid:1, date: '20170601', channelGrouping: "search", hits: [{hitNumber: 1, page: {hostname: "yourserverhostname"}}, {hitNumber: 2, page: {hostname: "yourserverhostname"}}, totals: {totalTransactionRevenue:0, bounces: 0}]
If the same customer comes back a day later it generates another row into BQ, something like:
{fullvisitorid: 1, visitid:2, date: '20170602', channelGrouping: "search", hits: [{hitNumber: 1, page: {hostname: "yourserverhostname"}}, {hitNumber: 2, page: {hostname: "yourserverhostname"}}, totals: {totalTransactionRevenue:50000000, bounces: 2}]
As you can see, fields outside the key hits are related to the session level (and therefore each hit, i.e, each interaction the customer has in your website, adds up another entry here). When you apply UNNEST, you basically, apply a cross-join with all values inside of the array to the outer fields.
And this is where duplication happens!
Given the past example, if we apply UNNEST to the hits field, you end up with something like:
fullvisitorid visitid totals.totalTransactionRevenue hits.hitNumber
1 1 0 1
1 1 0 2
1 2 50000000 1
1 2 50000000 2
Notice that for each hit inside the hits field causes the outer fields, such as totals.totalTransactionRevenue to be duplicated for each hitNumber that happened inside the hits ARRAY.
So, if later on, you apply some operation like SUM(totals.totalTransactionRevenue) you end up summing this field multiplied by each hit that the customer had in that visitid.
What I tend to do is to avoid the (costly depending on the data volume) UNNEST operation on the hits field and I do so only in subqueries (where the unnesting happens only at the row level which does not duplicate data).

Related

Using OFFSET instead of UNNEST for nested fields in Google Bigquery

A quick question to GBQ gurus.
Here are two queries that are identical in their purpose
first
SELECT
fullVisitorId AS userid,
CONCAT(fullVisitorId, visitStartTime) AS session,
visitStartTime + (hits[
OFFSET(0)].time / 1000) AS eventtime,
date,
trafficSource.campaign,
trafficSource.source,
trafficSource.medium,
trafficSource.adContent,
trafficSource.adwordsClickInfo.campaignId,
geoNetwork.region,
geoNetwork.city,
trafficSource.keyword,
totals.visits AS visits,
device.deviceCategory AS deviceType,
hits[OFFSET(0)].eventInfo.eventAction,
hits[OFFSET(0)].TRANSACTION.transactionId,
hits[OFFSET(0)].TRANSACTION.transactionRevenue,
SUBSTR(channelGrouping,0,3) AS newchannelGrouping
FROM
`some_site.ga_sessions_*`
WHERE
ARRAY_LENGTH(hits) > 0
AND _table_suffix BETWEEN '20200201'
AND '20200201'
AND fullVisitorId IN (
SELECT
DISTINCT(fullVisitorId)
FROM
`some_site.ga_sessions_*`,
UNNEST(hits) AS hits
WHERE
_table_suffix BETWEEN '20200201'
AND '20200201'
AND (hits.TRANSACTION.transactionId != 'None')
)
second
SELECT
fullVisitorId AS userid,
CONCAT(fullVisitorId, visitStartTime) AS session,
visitStartTime + (hits.time / 1000) AS eventtime,
date,
trafficSource.campaign,
trafficSource.source,
trafficSource.medium,
trafficSource.adContent,
trafficSource.adwordsClickInfo.campaignId,
geoNetwork.region,
geoNetwork.city,
trafficSource.keyword,
totals.visits AS visits,
device.deviceCategory AS deviceType,
hits.eventInfo.eventAction,
hits.TRANSACTION.transactionId,
hits.TRANSACTION.transactionRevenue,
SUBSTR(channelGrouping,0,3) AS newchannelGrouping
FROM
`some_site.ga_sessions_*`, UNNEST(hits) hits
WHERE
_table_suffix BETWEEN '20200201' AND '20200201'
AND fullVisitorId IN (
SELECT
DISTINCT(fullVisitorId)
FROM
`some_site.ga_sessions_*`,
UNNEST(hits) AS hits
WHERE
_table_suffix BETWEEN '20200201'
AND '20200201'
AND (hits.TRANSACTION.transactionId != 'None')
)
The 1st one uses OFFSET to extract data from nested fields. According to execution details report, the query requires about 1.5 MB of shuffling.
The 2nd query uses UNNEST to reach nested data. And the amount of shuffled bytes is around (!) 75 MB
The amount of processed data is the same in both cases.
Now, the question is:
Does that mean that according to this article which concerns optimizing communication between slots I should uses OFFSET instead of UNNEST to get the data stored in nested fields?
Thanks!
Let's consider following examples with using BigQuery public dataset.
UNNEST - returns 6 results:
WITH t AS (SELECT * FROM `bigquery-public-data.google_analytics_sample.ga_sessions_20170801` WHERE visitId = 1501571504 )
SELECT h FROM t, UNNEST(hits) h
OFFSET - returns 1 result:
WITH t AS (SELECT * FROM `bigquery-public-data.google_analytics_sample.ga_sessions_20170801` WHERE visitId = 1501571504 )
SELECT hits[OFFSET(0)] FROM t
Both queries are referencing to the same record inside a GA public table. They show that using a join with UNNEST will bring one row per element inside the array and using OFFSET(0) will bring only one row with the first element of the array.
The reason for difference in high data shuffling is because the UNNEST performs a JOIN operation, which requires the data to be organized in a specific way. The OFFSET approach takes only the first element of the array.

Converting Legacy SQL to Standard SQL - Enhannced Ecommerce

I am in no way a coder so I have tried but falling over on this.
I want to use this query from Googles Google Analytics Big Query Cookbook
Products purchased by customers who purchased product A (Enhanced Ecommerce)
I have pasted the code below
Into Standard SQL.
I have made a few attemps but am falling over and not
Thank you in advance
John
SELECT hits.product.productSKU AS other_purchased_products,
COUNT(hits.product.productSKU) AS quantity
FROM (
SELECT fullVisitorId, hits.product.productSKU, hits.eCommerceAction.action_type
FROM TABLE_DATE_RANGE([bigquery-public-data:google_analytics_sample.ga_sessions_],
TIMESTAMP('2017-04-01'), TIMESTAMP('2017-04-20'))
)
WHERE fullVisitorId IN (
SELECT fullVisitorId
FROM TABLE_DATE_RANGE([bigquery-public-data:google_analytics_sample.ga_sessions_],
TIMESTAMP('2017-04-01'), TIMESTAMP('2017-04-20'))
WHERE hits.product.productSKU CONTAINS 'GGOEYOCR077799'
AND hits.eCommerceAction.action_type = '6'
GROUP BY fullVisitorId
)
AND hits.product.productSKU IS NOT NULL
AND hits.product.productSKU !='GGOEYOCR077799'
AND hits.eCommerceAction.action_type = '6'
GROUP BY other_purchased_products
ORDER BY quantity DESC;
Below is pure equivalent in BigQuery Standard SQL (no any optimizations, improvements, etc. - just pure translation from legacy to standard)
SELECT productSKU AS other_purchased_products, COUNT(productSKU) AS quantity
FROM (
SELECT fullVisitorId, prod.productSKU, hit.eCommerceAction.action_type
FROM `bigquery-public-data.google_analytics_sample.ga_sessions_*`,
UNNEST(hits) hit, UNNEST(hit.product) prod
WHERE _TABLE_SUFFIX BETWEEN '20170401' AND '20170420'
)
WHERE fullVisitorId IN (
SELECT fullVisitorId
FROM `bigquery-public-data.google_analytics_sample.ga_sessions_*`,
UNNEST(hits) hit, UNNEST(hit.product) prod
WHERE _TABLE_SUFFIX BETWEEN '20170401' AND '20170420'
AND prod.productSKU LIKE '%GGOEYOCR077799%'
AND hit.eCommerceAction.action_type = '6'
GROUP BY fullVisitorId
)
AND productSKU IS NOT NULL
AND productSKU !='GGOEYOCR077799'
AND action_type = '6'
GROUP BY other_purchased_products
ORDER BY quantity DESC
obviously produces exactly same result as legacy version

BigQuery Error - Uncategorized name - When combining GA data

I want to combine dimensions date, country & source with sessions and unique events for event category "Downloads". Based on this data, I want to calculate the Download Conversionrate in DataStudio later on. To be honest I'm a noob in SQL. But I hope I'm thinking the right way at least.
Trying the query below I get the following error:
Unrecognized name: Downloads at [40:3]
WITH
ga_tables AS (
SELECT
date,
trafficSource.source AS Source,
geoNetwork.country AS Country,
COUNT ( trafficSource.source ) AS Sessions
FROM
`xxxxxx.ga_sessions_*`
WHERE
_TABLE_SUFFIX BETWEEN '20190301'
AND '20190301'
GROUP BY
date,
Source,
Country
UNION ALL
SELECT
date,
trafficSource.source AS Source,
geoNetwork.country AS Country,
COUNT(DISTINCT CONCAT(CAST(fullVisitorId AS string),'-',CAST(visitId AS string),'-',CAST(date AS string),'-',ifnull(hits.eventInfo.eventLabel,
'null'))) AS Downloads
FROM
`xxxxxx.ga_sessions_*`,
UNNEST(hits) AS hits
WHERE
_TABLE_SUFFIX BETWEEN '20190301'
AND '20190301'
AND hits.type = 'EVENT'
AND hits.eventInfo.eventCategory = 'Downloads'
GROUP BY
date,
Source,
Country )
SELECT
date,
Country,
Source,
Downloads,
Sessions
FROM
ga_tables
ORDER BY
Sessions ASC
In your with statement, the fourth column in the first select statement is named Sessions, while the fourth column in the statement it's unioned with is called Downloads. Due to the nature of UNION ALL, the final output column will be called Sessions, so it does not exist when you are querying it. If you want Sessions and Downloads to be separate columns, make the query look something like this:
WITH
ga_tables AS (
SELECT
date,
trafficSource.source AS Source,
geoNetwork.country AS Country,
COUNT ( trafficSource.source ) AS Sessions,
NULL AS Downloads
FROM
`xxxxxx.ga_sessions_*`
WHERE
_TABLE_SUFFIX BETWEEN '20190301'
AND '20190301'
GROUP BY
date,
Source,
Country
UNION ALL
SELECT
date,
trafficSource.source AS Source,
geoNetwork.country AS Country,
NULL AS Sessions,
COUNT(DISTINCT CONCAT(CAST(fullVisitorId AS string),'-',CAST(visitId AS string),'-',CAST(date AS string),'-',ifnull(hits.eventInfo.eventLabel,
'null'))) AS Downloads
FROM
`xxxxxx.ga_sessions_*`,
UNNEST(hits) AS hits
WHERE
_TABLE_SUFFIX BETWEEN '20190301'
AND '20190301'
AND hits.type = 'EVENT'
AND hits.eventInfo.eventCategory = 'Downloads'
GROUP BY
date,
Source,
Country )
Edit: Given what it looks like you want to do with the table though, you might want to rewrite ga_tables to be something like this instead:
WITH
ga_tables AS (SELECT
date,
trafficSource.source AS Source,
geoNetwork.country AS Country,
MAX(Sessions) AS Sessions,
COUNT(DISTINCT CONCAT(CAST(fullVisitorId AS string),'-',CAST(visitId AS string),'-',CAST(date AS string),'-',ifnull(hits.eventInfo.eventLabel,
'null'))) AS Downloads
FROM (
SELECT
*,
COUNT(trafficSource.source) OVER (PARTITION BY date, Source, Country) AS Sessions
FROM
`xxxxxx.ga_sessions_*`),
UNNEST(hits) AS hits
WHERE
_TABLE_SUFFIX BETWEEN '20190301'
AND '20190301'
AND hits.type = 'EVENT'
AND hits.eventInfo.eventCategory = 'Downloads'
GROUP BY
date,
Source,
Country)

Google Big Query: Get New Visitor Count using Custom Dimension

select PARSE_DATE('%Y%m%d', t.date) as Date
,count(distinct(fullvisitorid)) as User
,SUM( totals.newVisits ) AS New_Visitors
,(if(customDimensions.index=1, customDimensions.value,null)) as Orig
FROM `table` as t
CROSS JOIN UNNEST(hits) AS hit
CROSS JOIN UNNEST(hit.customDimensions ) AS customDimensions
group by Date, orig
Is there a way to get new visitor count and use the customDimension at the same time? The sum(total.newVisits) doesn't work.
Thanks
Below is for BigQuery Standard SQL
SELECT DATE
,COUNT(DISTINCT(fullvisitorid)) AS User
,SUM( newVisits ) AS New_Visitors
,Orig
FROM (
SELECT PARSE_DATE('%Y%m%d', t.date) AS DATE
,fullvisitorid
,totals.newVisits AS newVisits
,(IF(customDimensions.index=1, customDimensions.value,NULL)) AS Orig
FROM `table` AS t
CROSS JOIN UNNEST(hits) AS hit
CROSS JOIN UNNEST(hit.customDimensions ) AS customDimensions
GROUP BY DATE, orig, fullvisitorid, newVisits
)
GROUP BY DATE, Orig
The best way in your case is to remove the cross-joins and use sub-selects instead:
SELECT
PARSE_DATE('%Y%m%d', t.date) AS Date
,(SELECT value FROM UNNEST(customDimensions) WHERE index=1) Orig
,COUNT(DISTINCT(fullvisitorid)) AS User
,SUM( totals.newVisits ) AS New_Visitors
FROM
`table` t
GROUP BY Orig, Date
In case you have a dimension on hit scope and really need to flatten the table, you need to build a session id you can count distinct. That is because you are repeating all session scoped fields on hit-scope by applying the cross-join:
SELECT
PARSE_DATE('%Y%m%d', t.date) AS Date
,(SELECT value FROM h.customDimensions WHERE index=2) justAHitCd
,h.page.pagePathLevel1
,COUNT(DISTINCT(fullvisitorid)) AS User
-- create session id and count distinct
,COUNT(DISTINCT CONCAT(fullvisitorid, CAST(visitstarttime AS STRING)) ) AS all_sessions
-- only count distinct session id of sessions where totals.newVisits = 1
,COUNT(DISTINCT
IF(totals.newVisits=1,
CONCAT(fullvisitorid, CAST(visitstarttime AS STRING)),
NULL )
) AS New_Visitors
FROM
-- flatten table to hit scope (comma means cross-join in stnd sql)
`table` t, t.hits h
GROUP BY 1,2,3
So in case for new visitors I only provide a session id if totals.newVisits=1 - else the if-statement provides NULL which is not countable.
If you have something similar on product-scope, you'd need to create an ID that takes into account session and hit.
E.g. counting pages for productSku:
SELECT
PARSE_DATE('%Y%m%d', t.date) AS Date
,(SELECT value FROM h.customDimensions WHERE index=2) justAHitCd
,p.productSku
,COUNT(DISTINCT fullvisitorid) AS users
,COUNT(DISTINCT CONCAT(fullvisitorid, CAST(visitstarttime AS STRING))) AS sessions
,COUNT(DISTINCT
IF(h.type='PAGE',
CONCAT(fullvisitorid, cast(visitstarttime AS STRING),CAST(hitNumber AS STRING)),
NULL)
) as pageviews
,COUNT(1) AS products
FROM
`table` t, t.hits h LEFT JOIN h.product p
GROUP BY 1,2,3
Note, that I'm left joining the product array. Since it sometimes is empty a cross-join would destroy all hits information: cross-join with empty table results in empty table.
Hope that helps!

FULL OUTER JOIN throwing error "wrap select in parentheses"

I am having trouble with a part of my SQL call, I receive this error
Error: Syntax error: Each subquery argument for table-valued function calls must be enclosed in parentheses. To fix this, replace SELECT... with (SELECT...) at [32:5]
This is at the SELECT after the FULL OUTER JOIN EACH, I'd argue that I have done that, I do not know what is wrong here so any suggestions would be much appriciated.
I am trying to create a funnel that more acurately sorts previous customers from new. There are in total 3 levels in the funnel, for "simplicity" I'll only show two.
SELECT
COUNT(s0.firstHit) AS pageId1,
SUM(s0.exit) AS pageId2,
COUNT(s1.firstHit) AS pageId3,
SUM(s1.exit) AS pageId4
FROM(
SELECT
s0.fullVisitorId,
s0.visitId,
s0.firstHit,
s0.exit,
s1.firstHit,
s1.exit
FROM (
SELECT
fullvisitorid,
visitid,
(SELECT MIN(hitNumber) FROM UNNEST(hits) WHERE REGEXP_CONTAINS(page.pagePath, r'pageId')) AS firstHit,
(SELECT MAX(IF(isExit, 1, 0)) FROM UNNEST(hits) WHERE REGEXP_CONTAINS(page.pagePath, r'pageId')) AS exitFlag
FROM
`<ID>.ga_sessions_2017*`
WHERE
_TABLE_SUFFIX BETWEEN '0601' AND '0602'
AND 1 = 1
AND EXISTS(SELECT 1 FROM UNNEST(hits) WHERE REGEXP_CONTAINS(page.pagePath, r'pageId'))
AND EXISTS(SELECT 1 FROM UNNEST(hits) hits WHERE (SELECT COUNT(value) FROM UNNEST(hits.customDimensions) custd WHERE index=20) > 0)
AND totals.visits = 1
GROUP BY
fullVisitorId,
visitId) AS s0
FULL OUTER JOIN EACH(
SELECT
fullVisitorId,
visitId,
(SELECT MIN(hitNumber) FROM UNNEST(hits) WHERE REGEXP_CONTAINS(page.pagePath, 'pageId')) AS firstHit,
(SELECT MAX(IF(isExit, 1, 0)) FROM UNNEST(hits) WHERE REGEXP_CONTAINS(page.pagePath, 'pageId')) AS exitFlag
FROM
`<ID>.ga_sessions_2017*`
WHERE
_TABLE_SUFFIX BETWEEN '0601' AND '0602'
AND totals.visits = 1
GROUP BY
fullVisitorId,
visitId) AS s1
ON
s0.fullVisitorId = s1.fullVisitorId
AND s0.visitId = s1.visitId ) s01
You can find ways to write this query and not having the JOIN operations going on.
For instance:
SELECT
fullvisitorid,
visitid,
(SELECT MIN(hitNumber) FROM UNNEST(hits) WHERE REGEXP_CONTAINS(page.pagePath, r'pageId')) AS firstFunnelHit,
(SELECT MAX(IF(isExit, 1, 0)) FROM UNNEST(hits) WHERE REGEXP_CONTAINS(page.pagePath, r'pageId')) AS firstExitFunnelFlag,
(SELECT MIN(hitNumber) FROM UNNEST(hits) WHERE (REGEXP_CONTAINS(page.pagePath, r'pageId')) OR ((select count(1) from unnest(hits) h, unnest(h.customDimensions) custd where custd.index = 20) > 0)) secondFunnelHit,
(SELECT MAX(IF(isExit, 1, 0)) FROM UNNEST(hits) WHERE REGEXP_CONTAINS(page.pagePath, r'pageId') OR ((select count(1) from unnest(hits) h, unnest(h.customDimensions) custd where custd.index = 20) > 0)) AS secondFunnelExitFlag
FROM `dataset.ga_sessions_2017*`
WHERE 1 = 1
AND _TABLE_SUFFIX BETWEEN '0601' AND '0602'
AND totals.visits = 1
Notice that in just one SELECT you can bring information regarding all visitors who have been in page "pageId" and also visitors who have been in this page and fired the customDimension on index=20.
For each step in your funnel analyzes you can bring new columns as results, such as the firstFunnelHit and secondFunnelHit.
By avoiding expensive JOINs you can query up to teras of data and still have results in seconds.
In the last subquery:
SELECT
fullVisitorId,
visitId
(SELECT MIN(hitNumber) FROM UNNEST(hits) WHERE REGEXP_CONTAINS(page.pagePath, 'pageId')) AS firstHit,
(SELECT MAX(IF(isExit, 1, 0)) FROM UNNEST(hits) WHERE REGEXP_CONTAINS(page.pagePath, 'pageId')) AS exitFlag
FROM
`<ID>.ga_sessions_2017*`
WHERE
_TABLE_SUFFIX BETWEEN '0601' AND '0602'
AND totals.visits = 1
GROUP BY
fullVisitorId,
visitId
it looks to me like you need a comma after visitId in the third line.
Best of luck.
There is no EACH keyword when using standard SQL; this is specific to legacy SQL. Remove that word and your query will probably work.