FULL OUTER JOIN throwing error "wrap select in parentheses" - sql

I am having trouble with a part of my SQL call, I receive this error
Error: Syntax error: Each subquery argument for table-valued function calls must be enclosed in parentheses. To fix this, replace SELECT... with (SELECT...) at [32:5]
This is at the SELECT after the FULL OUTER JOIN EACH, I'd argue that I have done that, I do not know what is wrong here so any suggestions would be much appriciated.
I am trying to create a funnel that more acurately sorts previous customers from new. There are in total 3 levels in the funnel, for "simplicity" I'll only show two.
SELECT
COUNT(s0.firstHit) AS pageId1,
SUM(s0.exit) AS pageId2,
COUNT(s1.firstHit) AS pageId3,
SUM(s1.exit) AS pageId4
FROM(
SELECT
s0.fullVisitorId,
s0.visitId,
s0.firstHit,
s0.exit,
s1.firstHit,
s1.exit
FROM (
SELECT
fullvisitorid,
visitid,
(SELECT MIN(hitNumber) FROM UNNEST(hits) WHERE REGEXP_CONTAINS(page.pagePath, r'pageId')) AS firstHit,
(SELECT MAX(IF(isExit, 1, 0)) FROM UNNEST(hits) WHERE REGEXP_CONTAINS(page.pagePath, r'pageId')) AS exitFlag
FROM
`<ID>.ga_sessions_2017*`
WHERE
_TABLE_SUFFIX BETWEEN '0601' AND '0602'
AND 1 = 1
AND EXISTS(SELECT 1 FROM UNNEST(hits) WHERE REGEXP_CONTAINS(page.pagePath, r'pageId'))
AND EXISTS(SELECT 1 FROM UNNEST(hits) hits WHERE (SELECT COUNT(value) FROM UNNEST(hits.customDimensions) custd WHERE index=20) > 0)
AND totals.visits = 1
GROUP BY
fullVisitorId,
visitId) AS s0
FULL OUTER JOIN EACH(
SELECT
fullVisitorId,
visitId,
(SELECT MIN(hitNumber) FROM UNNEST(hits) WHERE REGEXP_CONTAINS(page.pagePath, 'pageId')) AS firstHit,
(SELECT MAX(IF(isExit, 1, 0)) FROM UNNEST(hits) WHERE REGEXP_CONTAINS(page.pagePath, 'pageId')) AS exitFlag
FROM
`<ID>.ga_sessions_2017*`
WHERE
_TABLE_SUFFIX BETWEEN '0601' AND '0602'
AND totals.visits = 1
GROUP BY
fullVisitorId,
visitId) AS s1
ON
s0.fullVisitorId = s1.fullVisitorId
AND s0.visitId = s1.visitId ) s01

You can find ways to write this query and not having the JOIN operations going on.
For instance:
SELECT
fullvisitorid,
visitid,
(SELECT MIN(hitNumber) FROM UNNEST(hits) WHERE REGEXP_CONTAINS(page.pagePath, r'pageId')) AS firstFunnelHit,
(SELECT MAX(IF(isExit, 1, 0)) FROM UNNEST(hits) WHERE REGEXP_CONTAINS(page.pagePath, r'pageId')) AS firstExitFunnelFlag,
(SELECT MIN(hitNumber) FROM UNNEST(hits) WHERE (REGEXP_CONTAINS(page.pagePath, r'pageId')) OR ((select count(1) from unnest(hits) h, unnest(h.customDimensions) custd where custd.index = 20) > 0)) secondFunnelHit,
(SELECT MAX(IF(isExit, 1, 0)) FROM UNNEST(hits) WHERE REGEXP_CONTAINS(page.pagePath, r'pageId') OR ((select count(1) from unnest(hits) h, unnest(h.customDimensions) custd where custd.index = 20) > 0)) AS secondFunnelExitFlag
FROM `dataset.ga_sessions_2017*`
WHERE 1 = 1
AND _TABLE_SUFFIX BETWEEN '0601' AND '0602'
AND totals.visits = 1
Notice that in just one SELECT you can bring information regarding all visitors who have been in page "pageId" and also visitors who have been in this page and fired the customDimension on index=20.
For each step in your funnel analyzes you can bring new columns as results, such as the firstFunnelHit and secondFunnelHit.
By avoiding expensive JOINs you can query up to teras of data and still have results in seconds.

In the last subquery:
SELECT
fullVisitorId,
visitId
(SELECT MIN(hitNumber) FROM UNNEST(hits) WHERE REGEXP_CONTAINS(page.pagePath, 'pageId')) AS firstHit,
(SELECT MAX(IF(isExit, 1, 0)) FROM UNNEST(hits) WHERE REGEXP_CONTAINS(page.pagePath, 'pageId')) AS exitFlag
FROM
`<ID>.ga_sessions_2017*`
WHERE
_TABLE_SUFFIX BETWEEN '0601' AND '0602'
AND totals.visits = 1
GROUP BY
fullVisitorId,
visitId
it looks to me like you need a comma after visitId in the third line.
Best of luck.

There is no EACH keyword when using standard SQL; this is specific to legacy SQL. Remove that word and your query will probably work.

Related

Use row_number() in BigQuery in CTE

I'm trying to include row numbers per orderId with this query but I get a error message saying "Unrecognized name: BQ" when selecting the columns in the subquery. I havn't used it too much so not sure where I'm doing it wrong. Can anyone see it?
WITH BQ AS(
SELECT
(SELECT customDimensions.value FROM UNNEST(t.customDimensions)
AS customDimensions WHERE customDimensions.index = 6) as
orderId_bq,
hits.eventinfo.eventaction as event_action,
hits.transaction.transactionId as trx_id,
hits.page.pagePath as page,
hitnumber AS hitnumber
FROM `xxxxx-xxxx.xxxxxx.ga_sessions_20210801` t,
UNNEST(HITS) as hits
WHERE (SELECT customDimensions.value FROM UNNEST(t.customDimensions) AS customDimensions WHERE customDimensions.index = 8) = 'se'
AND (SELECT customDimensions.value FROM UNNEST(t.customDimensions) AS customDimensions WHERE customDimensions.index = 4) = 'soffadirekt'
--AND hits.eventinfo.eventaction IN ('complete purchase')
--AND hits.transaction.transactionId IS NULL
--and hits.page.pagePath != '/backend-transaction'
--and hits.eventinfo.eventaction != 'backend transaction' )
)
SELECT
BQ.event_action,
BQ.trx_id,
BQ.page,
BQ.hitnumber,
FROM (SELECT Row_number()
OVER( PARTITION BY BQ.orderId_bq
ORDER BY BQ.hitnumber) as RN,
BQ.orderId_bq
from BQ
)
I did also try this but then it doesn't regognize 'flat.orderId' instead:
WITH BQ AS
(SELECT
(SELECT customDimensions.value FROM UNNEST(t.customDimensions) AS customDimensions WHERE customDimensions.index = 6) as orderId_bq,
hits.eventinfo.eventaction as event_action,
hits.transaction.transactionId as trx_id,
hits.page.pagePath as page,
hitnumber AS hitnumber
FROM `xxxx-xxxxxxxx.ga_sessions_20210801` t,
UNNEST(HITS) as hits
WHERE (SELECT customDimensions.value FROM UNNEST(t.customDimensions) AS customDimensions WHERE customDimensions.index = 8) = 'se'
AND (SELECT customDimensions.value FROM UNNEST(t.customDimensions) AS customDimensions WHERE customDimensions.index = 4) = 'soffadirekt'
),
flat AS (
SELECT
*
from bq
)
SELECT
flat.orderId_bq,
flat.event_action,
flat.trx_id,
flat.page,
flat.hitnumber,
FROM (SELECT Row_number()
OVER( PARTITION BY flat.orderId_bq
ORDER BY flat.hitnumber) as RN,
flat.orderId_bq,
flat.event_action,
flat.trx_id,
flat.page,
flat.hitnumber
FROM flat
)
Query that worked:
WITH raw AS
(SELECT
(SELECT customDimensions.value FROM UNNEST(t.customDimensions) AS customDimensions WHERE customDimensions.index = 6) as orderId_bq,
hits.eventinfo.eventaction as event_action,
hits.transaction.transactionId as trx_id,
hits.page.pagePath as page,
hitnumber AS hitnumber
FROM `xxx-xxx.xxxxxx.ga_sessions_*` t,
UNNEST(HITS) as hits
WHERE (SELECT customDimensions.value FROM UNNEST(t.customDimensions) AS customDimensions WHERE customDimensions.index = 8) = 'se'
AND (SELECT customDimensions.value FROM UNNEST(t.customDimensions) AS customDimensions WHERE customDimensions.index = 4) = 'soffadirekt'
AND _TABLE_SUFFIX BETWEEN '20211001' AND '20211002'
)
SELECT
raw.event_action,
raw.orderId_bq,
raw.trx_id,
raw.page,
raw.hitnumber,
FROM
(SELECT ROW_NUMBER()
OVER(ORDER BY raw.hitnumber DESC) as RN,
raw.event_action,
raw.orderId_bq,
raw.trx_id,
raw.page,
raw.hitnumber
FROM raw) AS raw
Error Unrecognized name: BQ happens because your are trying to query BQ.* that does not exist since you did not add the BQ alias on your subquery. Adding AS BQ should work. See query:
SELECT
BQ.event_action,
BQ.trx_id,
BQ.page,
BQ.hitnumber,
FROM (SELECT Row_number()
OVER( PARTITION BY BQ.orderId_bq
ORDER BY BQ.hitnumber) as RN,
BQ.orderId_bq
FROM BQ) AS BQ
Just a suggestion, it might be better to use a different alias so your query will be much readable.
I tested this using a table of mine where I add AS subq1 at my subquery. See a simple test:
WITH subq1 AS (SELECT amount_paid,customer from `my-project.test_dataset.myTable`)
SELECT
subq1.RN,
subq1.customer,
subq1.amount_paid
FROM
(SELECT ROW_NUMBER()
OVER(ORDER BY subq1.amount_paid) as RN,
subq1.customer,
subq1.amount_paid
FROM subq1) AS subq1
LIMIT 3
Results:

BigQuery (Google Analytics data):query two different 'hits.customDimensions.index' in the same 'hits.hitNumber'

my goal:
Count 1 for the session if the following two hits.customDimensions.index and associated hits.customDimensions.value appear in the same hits.hitNumber (every row is 1 session if main query is still nested):
['hits.customDimensions.index' = 43 with associated 'hits.customDimensions.value' IN ('login', 'payment', 'order', 'thankyou')] AND ['hits.customDimensions.index' = 10 with associated 'hits.customDimensions.value' = 'checkout' [in the same hits.hitNumber]
my problem:
I don't know how i can query two different hits.customDimensions.value in the same hits.hitNumber in one Subquery without different WITH-tables. If it's possible, which I'm sure, the query would be very easy and short. Since i don't know how to query this usecase in a subquery, I use an workaround which totals to 5 WITH-tables. I would appreciate an easy way to query this usecase
Explanation workaround query:
Table1: Queries all except the 'problem-metric'
Table2-3: Each table queries one hits.customDimensions.index with associated hits.customDimensions.value filtered for the correct value, sessionId and hitNumber
table4: left join table 2 with table 3 based on date, sessionID and hitNumber. Basically if hitNumber combined with sessionId from table2 and table3 match I count 1
table5: left join table1 with table4 to combine the data
#Table1 - complete data except session_atleast_loginCheckout
WITH
prepared_data AS (
SELECT
date,
SUM((SELECT 1 FROM UNNEST(hits) WHERE CAST(eCommerceAction.action_type AS INT64) BETWEEN 4 AND 6 LIMIT 1)) AS sessions_atleast_basket,
#insert in this row query for sessions_atleast_loginCheckout
SUM((SELECT 1 FROM UNNEST(hits) as h, UNNEST(h.customDimensions) as hcd WHERE index = 43 AND value IN ('payment', 'order', 'thankyou') LIMIT 1)) AS sessions_atleast_payment,
FROM
`big-query-221916.172008714.ga_sessions_*`
WHERE
_TABLE_SUFFIX BETWEEN FORMAT_DATE('%Y%m%d',DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY)) AND FORMAT_DATE('%Y%m%d',DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY)) AND totals.visits = 1
GROUP BY
date
#Table2 - data for hits.customDimensions.index = 10 AND associated hits.customDimensions.value = 'checkout' with hits.hitNumber and sessionId (join later based on hitNumber and sessionId)
loginCheckout_index10_pagetype_data AS (
SELECT
date AS date,
CONCAT(fullVisitorId, '/', CAST( visitStartTime AS STRING)) AS sessionId,
h.hitNumber AS hitNumber,
IF(hcd.value IS NOT NULL, 1, NULL) AS pagetype_checkout
FROM
`big-query-221916.172008714.ga_sessions_*` AS o, UNNEST(hits) as h, UNNEST(h.customDimensions) as hcd
WHERE
_TABLE_SUFFIX BETWEEN FORMAT_DATE('%Y%m%d',DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY)) AND FORMAT_DATE('%Y%m%d',DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY)) AND hcd.index = 10 AND VALUE = 'checkout' AND h.type = 'PAGE' AND totals.visits = 1),
#Table3 - data for hits.customDimensions.index = 43 AND associated hits.customDimensions.value IN ('login', 'register', 'payment', 'order','thankyou') with hits.hitNumber and sessionId (join later based on hitNumber and sessionId)
loginCheckout_index43_pagelevel1_data AS (
SELECT
date AS date,
CONCAT(fullVisitorId, '/', CAST( visitStartTime AS STRING)) AS sessionId,
h.hitNumber AS hitNumber,
IF(hcd.value IS NOT NULL, 1, NULL) AS pagelevel1_login_to_thankyou
FROM
`big-query-221916.172008714.ga_sessions_*` AS o, UNNEST(hits) as h, UNNEST(h.customDimensions) as hcd
WHERE
_TABLE_SUFFIX BETWEEN FORMAT_DATE('%Y%m%d',DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY)) AND FORMAT_DATE('%Y%m%d',DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY)) AND hcd.index = 43 AND VALUE IN ('login', 'register', 'payment', 'order', 'thankyou') AND h.type = 'PAGE'
),
#table4 - left join table2 and table 3 on sessionId and hitNumber to get sessions_atleast_loginCheckout
loginChackout_output_data AS(
SELECT
a.date AS date,
COUNT(DISTINCT a.sessionId) AS sessions_atleast_loginCheckout
FROM
loginCheckout_index10_pagetype_data AS a
LEFT JOIN
loginCheckout_index43_pagelevel1_data AS b
ON
a.date = b.date AND
a.sessionId = b.sessionId AND
a.hitNumber = b.hitNumber
WHERE
pagelevel1_login_to_thankyou IS NOT NULL
GROUP BY
date
#table5 - leftjoin table1 with table4 to get all data together
SELECT
prep.date,
prep.sessions_atleast_basket,
log.sessions_atleast_loginCheckout,
prep.sessions_atleast_payment
FROM
prepared_data AS prep
LEFT JOIN
loginChackout_output_data as log
ON
prep.date = log.date AND
It's a bit like Inception, but maybe it helps to keep in mind that the input of unnest() is an array and the output are table rows ...
SELECT
SUM(totals.visits) as sessions
FROM
`big-query-221916.172008714.ga_sessions_*`
WHERE
_TABLE_SUFFIX BETWEEN FORMAT_DATE('%Y%m%d',DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY)) AND FORMAT_DATE('%Y%m%d',DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY))
AND -- the following two hits.customDimensions.index and associated hits.customDimensions.value appear in the same hits.hitNumber
(SELECT COUNT(1)>0 as hitsCountMoreThanZero FROM UNNEST(hits) AS h
WHERE
-- index 43, value IN ('login', 'payment', 'order', 'thankyou')
(select count(1)>0 from unnest(h.customdimensions) where index=43 and value IN ('login', 'payment', 'order', 'thankyou'))
AND
-- index 10, value = 'checkout'
(select count(1)>0 from unnest(h.customdimensions) where index=10 and value='checkout')
)
GROUP BY
date

How can I find the previous page with Bigquery

I want to find out the previous page where the current page is a product page.
For example I have this page 'https://www.emag.ro/telefon-mobil-apple-iphone-x-64gb-4g-space-grey-mqac2rm-a/pd/DN094NBBM'and my previous page is this page 'https://www.emag.ro/search/telefoane-mobile/IPHONE/c?ref=srcql'
How in terms of hitnumber I can return how many users had this behavior.
I tried with this 2 query and I want to do a JOIN but I don't know how is better.
Also, I tried with LAG function but I don't know for sure if I catch all the users.
Thank you in advance.
with
view_product as (
SELECT
ga.fullVisitorId AS GA_USER_ID,
date as date,
h.hitnumber as hitnumber,
CONCAT(ga.fullVisitorId, cast(ga.visitId AS string)) AS SessionID,
(SELECT VALUE FROM h.customDimensions WHERE INDEX = 10) AS PAGETYPE,
(SELECT VALUE FROM h.customDimensions WHERE index =8) as ref_parameter,
visitid as visitid,
h.page.pagePath as page_path
FROM
`emagbigquery.0` ga,
UNNEST(hits) AS h
WHERE h.type='PAGE'
AND _TABLE_SUFFIX = '20190115'
AND (SELECT VALUE FROM h.customDimensions WHERE INDEX = 10) = 'viewproduct'
)
,
SEARCH_page_WITH_REF_SRCQL as (
select
date as date,
ga.fullVisitorId AS GA_USER_ID,
h.hitnumber as hitnumber,
CONCAT(ga.fullVisitorId, cast(ga.visitId AS string)) AS SessionID,
(SELECT VALUE FROM h.customDimensions WHERE INDEX = 10) AS PAGETYPE,
(SELECT VALUE FROM h.customDimensions WHERE index =8) as ref_parameter,
visitid as visitid,
h.page.pagePath as page_path
FROM
`emagbigquery.0` ga,
UNNEST(hits) AS h
WHERE h.type='PAGE'
AND _TABLE_SUFFIX = '20190115'
AND (SELECT VALUE FROM h.customDimensions WHERE INDEX = 10) = 'search'
AND (SELECT VALUE FROM h.customDimensions WHERE index =8) LIKE 'srcql'
)
select
COUNT(DISTINCT GA_USER_ID) AS USERS,
COUNT(DISTINCT SessionID) AS SESSIONS,
previous_page_from_srcql
from (
select
t1.ga_user_id,
t1.sessionid,
t2.hitnumber > t1.hitnumber as previous_page_from_srcql
from SEARCH_page_WITH_REF_SRCQL as t1
inner join view_product as t2
on t1.ga_user_id = t2.ga_user_id
group by
previous_page_from_srcql
Try UNNEST WITH OFFSET. It can give you an easy way to later determine that one row came after the other:
WITH path_and_prev AS (
SELECT ARRAY(
SELECT AS STRUCT session.page.pagePath
, LAG(session.page.pagePath) OVER(ORDER BY i) prevPagePath
FROM UNNEST(hits) session WITH OFFSET i
) x
FROM `google.com:analytics-bigquery.LondonCycleHelmet.ga_sessions_20130910`
)
SELECT COUNT(*) c, pagePath, prevPagePath
FROM path_and_prev, UNNEST(x)
WHERE pagePath='/vests/yellow.html'
AND prevPagePath='/vests/'
GROUP BY 2,3

Google Big Query: Get New Visitor Count using Custom Dimension

select PARSE_DATE('%Y%m%d', t.date) as Date
,count(distinct(fullvisitorid)) as User
,SUM( totals.newVisits ) AS New_Visitors
,(if(customDimensions.index=1, customDimensions.value,null)) as Orig
FROM `table` as t
CROSS JOIN UNNEST(hits) AS hit
CROSS JOIN UNNEST(hit.customDimensions ) AS customDimensions
group by Date, orig
Is there a way to get new visitor count and use the customDimension at the same time? The sum(total.newVisits) doesn't work.
Thanks
Below is for BigQuery Standard SQL
SELECT DATE
,COUNT(DISTINCT(fullvisitorid)) AS User
,SUM( newVisits ) AS New_Visitors
,Orig
FROM (
SELECT PARSE_DATE('%Y%m%d', t.date) AS DATE
,fullvisitorid
,totals.newVisits AS newVisits
,(IF(customDimensions.index=1, customDimensions.value,NULL)) AS Orig
FROM `table` AS t
CROSS JOIN UNNEST(hits) AS hit
CROSS JOIN UNNEST(hit.customDimensions ) AS customDimensions
GROUP BY DATE, orig, fullvisitorid, newVisits
)
GROUP BY DATE, Orig
The best way in your case is to remove the cross-joins and use sub-selects instead:
SELECT
PARSE_DATE('%Y%m%d', t.date) AS Date
,(SELECT value FROM UNNEST(customDimensions) WHERE index=1) Orig
,COUNT(DISTINCT(fullvisitorid)) AS User
,SUM( totals.newVisits ) AS New_Visitors
FROM
`table` t
GROUP BY Orig, Date
In case you have a dimension on hit scope and really need to flatten the table, you need to build a session id you can count distinct. That is because you are repeating all session scoped fields on hit-scope by applying the cross-join:
SELECT
PARSE_DATE('%Y%m%d', t.date) AS Date
,(SELECT value FROM h.customDimensions WHERE index=2) justAHitCd
,h.page.pagePathLevel1
,COUNT(DISTINCT(fullvisitorid)) AS User
-- create session id and count distinct
,COUNT(DISTINCT CONCAT(fullvisitorid, CAST(visitstarttime AS STRING)) ) AS all_sessions
-- only count distinct session id of sessions where totals.newVisits = 1
,COUNT(DISTINCT
IF(totals.newVisits=1,
CONCAT(fullvisitorid, CAST(visitstarttime AS STRING)),
NULL )
) AS New_Visitors
FROM
-- flatten table to hit scope (comma means cross-join in stnd sql)
`table` t, t.hits h
GROUP BY 1,2,3
So in case for new visitors I only provide a session id if totals.newVisits=1 - else the if-statement provides NULL which is not countable.
If you have something similar on product-scope, you'd need to create an ID that takes into account session and hit.
E.g. counting pages for productSku:
SELECT
PARSE_DATE('%Y%m%d', t.date) AS Date
,(SELECT value FROM h.customDimensions WHERE index=2) justAHitCd
,p.productSku
,COUNT(DISTINCT fullvisitorid) AS users
,COUNT(DISTINCT CONCAT(fullvisitorid, CAST(visitstarttime AS STRING))) AS sessions
,COUNT(DISTINCT
IF(h.type='PAGE',
CONCAT(fullvisitorid, cast(visitstarttime AS STRING),CAST(hitNumber AS STRING)),
NULL)
) as pageviews
,COUNT(1) AS products
FROM
`table` t, t.hits h LEFT JOIN h.product p
GROUP BY 1,2,3
Note, that I'm left joining the product array. Since it sometimes is empty a cross-join would destroy all hits information: cross-join with empty table results in empty table.
Hope that helps!

Bigquery unnest hits - duplicating values)

Im trying to create a master view of a group in properties that are been imported into big query but it seem by using the unnest(hits) the SQL is duplicating the data leading to inaccurate values for revenues etc...
I have try to look at understanding why the unnest has caused this but I can't figure it out.
SELECT Date, hostname, channelGrouping, sum(transactionRevenue) as Revenue, sum(Shipping) as Shipping, sum(visits) as Sessions, sum(bounces) as Bounces, sum(transactions) as Transactions
FROM
(SELECT Date, h.page.hostname as hostname, channelGrouping, totals.transactionRevenue, totals.visits, h.transaction.transactionShipping as shipping, totals.bounces, totals.transactions
FROM `xxxxxxxxx.ga_sessions_*`, UNNEST(hits) AS h
WHERE _TABLE_SUFFIX BETWEEN '20170401' AND '20170509'
UNION ALL
SELECT Date, h.page.hostname as hostname, channelGrouping, totals.transactionRevenue, totals.visits, h.transaction.transactionShipping as shipping, totals.bounces, totals.transactions
FROM `xxxxxxxxx.ga_sessions_*`, UNNEST(hits) AS h
WHERE _TABLE_SUFFIX BETWEEN '20170401' AND '20170509'
UNION ALL
SELECT Date, h.page.hostname as hostname, channelGrouping, totals.transactionRevenue, totals.visits, h.transaction.transactionShipping as shipping, totals.bounces, totals.transactions
FROM `102674002.ga_sessions_*`, UNNEST(hits) AS h
WHERE _TABLE_SUFFIX BETWEEN '20170401' AND '20170509'
UNION ALL
SELECT Date, h.page.hostname as hostname, channelGrouping, totals.transactionRevenue, totals.visits, h.transaction.transactionShipping as shipping, totals.bounces, totals.transactions
FROM `xxxxxxxxx.ga_sessions_*`, UNNEST(hits) AS h
WHERE _TABLE_SUFFIX BETWEEN '20170401' AND '20170509'
UNION ALL
SELECT Date, h.page.hostname as hostname, channelGrouping, totals.transactionRevenue, totals.visits, h.transaction.transactionShipping as shipping, totals.bounces, totals.transactions
FROM `xxxxxxxxx.ga_sessions_*`, UNNEST(hits) AS h
WHERE _TABLE_SUFFIX BETWEEN '20170401' AND '20170509')
Group By Date, hostname, channelGrouping
Order by Date
This might do the trick:
SELECT
date,
channelGrouping,
SUM(Revenue) Revenue,
SUM(Shipping) Shipping,
SUM(bounces) bounces,
SUM(transactions) transactions,
hostname,
COUNT(date) sessions
FROM(
SELECT
date,
channelGrouping,
totals.totaltransactionrevenue / 1e6 Revenue,
ARRAY((SELECT DISTINCT page.hostname FROM UNNEST(hits) hits WHERE page.hostname IS NOT NULL)) hostnames,
(SELECT SUM(hits.transaction.transactionshipping) / 1e6 FROM UNNEST(hits) hits) Shipping,
totals.bounces bounces,
totals.transactions transactions
FROM `project_id.dataset_id.ga_sessions_*`
WHERE 1 = 1
AND ARRAY_LENGTH(ARRAY((SELECT DISTINCT page.hostname FROM UNNEST(hits) hits WHERE page.hostname IS NOT NULL))) > 0
AND _TABLE_SUFFIX BETWEEN '20170601' AND '20170609'
UNION ALL
(...)
),
UNNEST(hostnames) hostname
GROUP BY
date, channelGrouping, hostname
Notice that in this query I avoided applying the UNNEST operation in the hits field and I do so only inside subselects.
In order to understand why this is the case you have to understand how ga data is aggregated into BigQuery. Notice that we basically have 2 types of data: the session level data and the hits level. Each client visiting your website ends up generating a row into BigQuery, like so:
{fullvisitorid: 1, visitid:1, date: '20170601', channelGrouping: "search", hits: [{hitNumber: 1, page: {hostname: "yourserverhostname"}}, {hitNumber: 2, page: {hostname: "yourserverhostname"}}, totals: {totalTransactionRevenue:0, bounces: 0}]
If the same customer comes back a day later it generates another row into BQ, something like:
{fullvisitorid: 1, visitid:2, date: '20170602', channelGrouping: "search", hits: [{hitNumber: 1, page: {hostname: "yourserverhostname"}}, {hitNumber: 2, page: {hostname: "yourserverhostname"}}, totals: {totalTransactionRevenue:50000000, bounces: 2}]
As you can see, fields outside the key hits are related to the session level (and therefore each hit, i.e, each interaction the customer has in your website, adds up another entry here). When you apply UNNEST, you basically, apply a cross-join with all values inside of the array to the outer fields.
And this is where duplication happens!
Given the past example, if we apply UNNEST to the hits field, you end up with something like:
fullvisitorid visitid totals.totalTransactionRevenue hits.hitNumber
1 1 0 1
1 1 0 2
1 2 50000000 1
1 2 50000000 2
Notice that for each hit inside the hits field causes the outer fields, such as totals.totalTransactionRevenue to be duplicated for each hitNumber that happened inside the hits ARRAY.
So, if later on, you apply some operation like SUM(totals.totalTransactionRevenue) you end up summing this field multiplied by each hit that the customer had in that visitid.
What I tend to do is to avoid the (costly depending on the data volume) UNNEST operation on the hits field and I do so only in subqueries (where the unnesting happens only at the row level which does not duplicate data).