How can I export Google Optimize Data in Google BigQuery? - sql

So I ran a Test and now want to export my Google Optimize data with BigQuery.
Unfortunately both variables "experimentId" and "experimentVariant" in the BigQuery Export are empty .. although there is Test Data in Google Analytics for this date range.
Is there a problem with the synthax?
StandardSQL:
SELECT
clientId,
visitId,
fullVisitorId,
exp.experimentId AS experimentId,
exp.experimentVariant AS experimentVariant,
trafficSource.source AS source,
trafficSource.medium AS medium,
hits.page.pagePath AS pagePath,
timestamp_seconds(visitStartTime+(CAST(ROUND(hits.time/1000) AS INT64)))
AS timestamp,
hits.eventInfo.eventCategory AS EventCategory,
hits.eventInfo.eventAction AS EventAction,
hits.eventInfo.eventLabel AS EventLabel
FROM `123456789.ga_sessions_*` LEFT JOIN
UNNEST(hits) hits LEFT JOIN
UNNEST(hits.experiment) exp
WHERE hits.page.pagePath LIKE '%page1/page2%' AND _TABLE_SUFFIX BETWEEN
FORMAT_DATE('%Y%m%d', date '2019-04-16') AND FORMAT_DATE('%Y%m%d', date
'2019-04-22')
GROUP BY
clientId,
visitId,
fullVisitorId,
experimentId,
experimentVariant,
source,
medium,
pagePath,
timestamp,
eventCategory,
EventAction,
EventLabel

Related

Use row_number() in BigQuery in CTE

I'm trying to include row numbers per orderId with this query but I get a error message saying "Unrecognized name: BQ" when selecting the columns in the subquery. I havn't used it too much so not sure where I'm doing it wrong. Can anyone see it?
WITH BQ AS(
SELECT
(SELECT customDimensions.value FROM UNNEST(t.customDimensions)
AS customDimensions WHERE customDimensions.index = 6) as
orderId_bq,
hits.eventinfo.eventaction as event_action,
hits.transaction.transactionId as trx_id,
hits.page.pagePath as page,
hitnumber AS hitnumber
FROM `xxxxx-xxxx.xxxxxx.ga_sessions_20210801` t,
UNNEST(HITS) as hits
WHERE (SELECT customDimensions.value FROM UNNEST(t.customDimensions) AS customDimensions WHERE customDimensions.index = 8) = 'se'
AND (SELECT customDimensions.value FROM UNNEST(t.customDimensions) AS customDimensions WHERE customDimensions.index = 4) = 'soffadirekt'
--AND hits.eventinfo.eventaction IN ('complete purchase')
--AND hits.transaction.transactionId IS NULL
--and hits.page.pagePath != '/backend-transaction'
--and hits.eventinfo.eventaction != 'backend transaction' )
)
SELECT
BQ.event_action,
BQ.trx_id,
BQ.page,
BQ.hitnumber,
FROM (SELECT Row_number()
OVER( PARTITION BY BQ.orderId_bq
ORDER BY BQ.hitnumber) as RN,
BQ.orderId_bq
from BQ
)
I did also try this but then it doesn't regognize 'flat.orderId' instead:
WITH BQ AS
(SELECT
(SELECT customDimensions.value FROM UNNEST(t.customDimensions) AS customDimensions WHERE customDimensions.index = 6) as orderId_bq,
hits.eventinfo.eventaction as event_action,
hits.transaction.transactionId as trx_id,
hits.page.pagePath as page,
hitnumber AS hitnumber
FROM `xxxx-xxxxxxxx.ga_sessions_20210801` t,
UNNEST(HITS) as hits
WHERE (SELECT customDimensions.value FROM UNNEST(t.customDimensions) AS customDimensions WHERE customDimensions.index = 8) = 'se'
AND (SELECT customDimensions.value FROM UNNEST(t.customDimensions) AS customDimensions WHERE customDimensions.index = 4) = 'soffadirekt'
),
flat AS (
SELECT
*
from bq
)
SELECT
flat.orderId_bq,
flat.event_action,
flat.trx_id,
flat.page,
flat.hitnumber,
FROM (SELECT Row_number()
OVER( PARTITION BY flat.orderId_bq
ORDER BY flat.hitnumber) as RN,
flat.orderId_bq,
flat.event_action,
flat.trx_id,
flat.page,
flat.hitnumber
FROM flat
)
Query that worked:
WITH raw AS
(SELECT
(SELECT customDimensions.value FROM UNNEST(t.customDimensions) AS customDimensions WHERE customDimensions.index = 6) as orderId_bq,
hits.eventinfo.eventaction as event_action,
hits.transaction.transactionId as trx_id,
hits.page.pagePath as page,
hitnumber AS hitnumber
FROM `xxx-xxx.xxxxxx.ga_sessions_*` t,
UNNEST(HITS) as hits
WHERE (SELECT customDimensions.value FROM UNNEST(t.customDimensions) AS customDimensions WHERE customDimensions.index = 8) = 'se'
AND (SELECT customDimensions.value FROM UNNEST(t.customDimensions) AS customDimensions WHERE customDimensions.index = 4) = 'soffadirekt'
AND _TABLE_SUFFIX BETWEEN '20211001' AND '20211002'
)
SELECT
raw.event_action,
raw.orderId_bq,
raw.trx_id,
raw.page,
raw.hitnumber,
FROM
(SELECT ROW_NUMBER()
OVER(ORDER BY raw.hitnumber DESC) as RN,
raw.event_action,
raw.orderId_bq,
raw.trx_id,
raw.page,
raw.hitnumber
FROM raw) AS raw
Error Unrecognized name: BQ happens because your are trying to query BQ.* that does not exist since you did not add the BQ alias on your subquery. Adding AS BQ should work. See query:
SELECT
BQ.event_action,
BQ.trx_id,
BQ.page,
BQ.hitnumber,
FROM (SELECT Row_number()
OVER( PARTITION BY BQ.orderId_bq
ORDER BY BQ.hitnumber) as RN,
BQ.orderId_bq
FROM BQ) AS BQ
Just a suggestion, it might be better to use a different alias so your query will be much readable.
I tested this using a table of mine where I add AS subq1 at my subquery. See a simple test:
WITH subq1 AS (SELECT amount_paid,customer from `my-project.test_dataset.myTable`)
SELECT
subq1.RN,
subq1.customer,
subq1.amount_paid
FROM
(SELECT ROW_NUMBER()
OVER(ORDER BY subq1.amount_paid) as RN,
subq1.customer,
subq1.amount_paid
FROM subq1) AS subq1
LIMIT 3
Results:

Using OFFSET instead of UNNEST for nested fields in Google Bigquery

A quick question to GBQ gurus.
Here are two queries that are identical in their purpose
first
SELECT
fullVisitorId AS userid,
CONCAT(fullVisitorId, visitStartTime) AS session,
visitStartTime + (hits[
OFFSET(0)].time / 1000) AS eventtime,
date,
trafficSource.campaign,
trafficSource.source,
trafficSource.medium,
trafficSource.adContent,
trafficSource.adwordsClickInfo.campaignId,
geoNetwork.region,
geoNetwork.city,
trafficSource.keyword,
totals.visits AS visits,
device.deviceCategory AS deviceType,
hits[OFFSET(0)].eventInfo.eventAction,
hits[OFFSET(0)].TRANSACTION.transactionId,
hits[OFFSET(0)].TRANSACTION.transactionRevenue,
SUBSTR(channelGrouping,0,3) AS newchannelGrouping
FROM
`some_site.ga_sessions_*`
WHERE
ARRAY_LENGTH(hits) > 0
AND _table_suffix BETWEEN '20200201'
AND '20200201'
AND fullVisitorId IN (
SELECT
DISTINCT(fullVisitorId)
FROM
`some_site.ga_sessions_*`,
UNNEST(hits) AS hits
WHERE
_table_suffix BETWEEN '20200201'
AND '20200201'
AND (hits.TRANSACTION.transactionId != 'None')
)
second
SELECT
fullVisitorId AS userid,
CONCAT(fullVisitorId, visitStartTime) AS session,
visitStartTime + (hits.time / 1000) AS eventtime,
date,
trafficSource.campaign,
trafficSource.source,
trafficSource.medium,
trafficSource.adContent,
trafficSource.adwordsClickInfo.campaignId,
geoNetwork.region,
geoNetwork.city,
trafficSource.keyword,
totals.visits AS visits,
device.deviceCategory AS deviceType,
hits.eventInfo.eventAction,
hits.TRANSACTION.transactionId,
hits.TRANSACTION.transactionRevenue,
SUBSTR(channelGrouping,0,3) AS newchannelGrouping
FROM
`some_site.ga_sessions_*`, UNNEST(hits) hits
WHERE
_table_suffix BETWEEN '20200201' AND '20200201'
AND fullVisitorId IN (
SELECT
DISTINCT(fullVisitorId)
FROM
`some_site.ga_sessions_*`,
UNNEST(hits) AS hits
WHERE
_table_suffix BETWEEN '20200201'
AND '20200201'
AND (hits.TRANSACTION.transactionId != 'None')
)
The 1st one uses OFFSET to extract data from nested fields. According to execution details report, the query requires about 1.5 MB of shuffling.
The 2nd query uses UNNEST to reach nested data. And the amount of shuffled bytes is around (!) 75 MB
The amount of processed data is the same in both cases.
Now, the question is:
Does that mean that according to this article which concerns optimizing communication between slots I should uses OFFSET instead of UNNEST to get the data stored in nested fields?
Thanks!
Let's consider following examples with using BigQuery public dataset.
UNNEST - returns 6 results:
WITH t AS (SELECT * FROM `bigquery-public-data.google_analytics_sample.ga_sessions_20170801` WHERE visitId = 1501571504 )
SELECT h FROM t, UNNEST(hits) h
OFFSET - returns 1 result:
WITH t AS (SELECT * FROM `bigquery-public-data.google_analytics_sample.ga_sessions_20170801` WHERE visitId = 1501571504 )
SELECT hits[OFFSET(0)] FROM t
Both queries are referencing to the same record inside a GA public table. They show that using a join with UNNEST will bring one row per element inside the array and using OFFSET(0) will bring only one row with the first element of the array.
The reason for difference in high data shuffling is because the UNNEST performs a JOIN operation, which requires the data to be organized in a specific way. The OFFSET approach takes only the first element of the array.

Converting Legacy SQL to Standard SQL - Enhannced Ecommerce

I am in no way a coder so I have tried but falling over on this.
I want to use this query from Googles Google Analytics Big Query Cookbook
Products purchased by customers who purchased product A (Enhanced Ecommerce)
I have pasted the code below
Into Standard SQL.
I have made a few attemps but am falling over and not
Thank you in advance
John
SELECT hits.product.productSKU AS other_purchased_products,
COUNT(hits.product.productSKU) AS quantity
FROM (
SELECT fullVisitorId, hits.product.productSKU, hits.eCommerceAction.action_type
FROM TABLE_DATE_RANGE([bigquery-public-data:google_analytics_sample.ga_sessions_],
TIMESTAMP('2017-04-01'), TIMESTAMP('2017-04-20'))
)
WHERE fullVisitorId IN (
SELECT fullVisitorId
FROM TABLE_DATE_RANGE([bigquery-public-data:google_analytics_sample.ga_sessions_],
TIMESTAMP('2017-04-01'), TIMESTAMP('2017-04-20'))
WHERE hits.product.productSKU CONTAINS 'GGOEYOCR077799'
AND hits.eCommerceAction.action_type = '6'
GROUP BY fullVisitorId
)
AND hits.product.productSKU IS NOT NULL
AND hits.product.productSKU !='GGOEYOCR077799'
AND hits.eCommerceAction.action_type = '6'
GROUP BY other_purchased_products
ORDER BY quantity DESC;
Below is pure equivalent in BigQuery Standard SQL (no any optimizations, improvements, etc. - just pure translation from legacy to standard)
SELECT productSKU AS other_purchased_products, COUNT(productSKU) AS quantity
FROM (
SELECT fullVisitorId, prod.productSKU, hit.eCommerceAction.action_type
FROM `bigquery-public-data.google_analytics_sample.ga_sessions_*`,
UNNEST(hits) hit, UNNEST(hit.product) prod
WHERE _TABLE_SUFFIX BETWEEN '20170401' AND '20170420'
)
WHERE fullVisitorId IN (
SELECT fullVisitorId
FROM `bigquery-public-data.google_analytics_sample.ga_sessions_*`,
UNNEST(hits) hit, UNNEST(hit.product) prod
WHERE _TABLE_SUFFIX BETWEEN '20170401' AND '20170420'
AND prod.productSKU LIKE '%GGOEYOCR077799%'
AND hit.eCommerceAction.action_type = '6'
GROUP BY fullVisitorId
)
AND productSKU IS NOT NULL
AND productSKU !='GGOEYOCR077799'
AND action_type = '6'
GROUP BY other_purchased_products
ORDER BY quantity DESC
obviously produces exactly same result as legacy version

BigQuery Error - Uncategorized name - When combining GA data

I want to combine dimensions date, country & source with sessions and unique events for event category "Downloads". Based on this data, I want to calculate the Download Conversionrate in DataStudio later on. To be honest I'm a noob in SQL. But I hope I'm thinking the right way at least.
Trying the query below I get the following error:
Unrecognized name: Downloads at [40:3]
WITH
ga_tables AS (
SELECT
date,
trafficSource.source AS Source,
geoNetwork.country AS Country,
COUNT ( trafficSource.source ) AS Sessions
FROM
`xxxxxx.ga_sessions_*`
WHERE
_TABLE_SUFFIX BETWEEN '20190301'
AND '20190301'
GROUP BY
date,
Source,
Country
UNION ALL
SELECT
date,
trafficSource.source AS Source,
geoNetwork.country AS Country,
COUNT(DISTINCT CONCAT(CAST(fullVisitorId AS string),'-',CAST(visitId AS string),'-',CAST(date AS string),'-',ifnull(hits.eventInfo.eventLabel,
'null'))) AS Downloads
FROM
`xxxxxx.ga_sessions_*`,
UNNEST(hits) AS hits
WHERE
_TABLE_SUFFIX BETWEEN '20190301'
AND '20190301'
AND hits.type = 'EVENT'
AND hits.eventInfo.eventCategory = 'Downloads'
GROUP BY
date,
Source,
Country )
SELECT
date,
Country,
Source,
Downloads,
Sessions
FROM
ga_tables
ORDER BY
Sessions ASC
In your with statement, the fourth column in the first select statement is named Sessions, while the fourth column in the statement it's unioned with is called Downloads. Due to the nature of UNION ALL, the final output column will be called Sessions, so it does not exist when you are querying it. If you want Sessions and Downloads to be separate columns, make the query look something like this:
WITH
ga_tables AS (
SELECT
date,
trafficSource.source AS Source,
geoNetwork.country AS Country,
COUNT ( trafficSource.source ) AS Sessions,
NULL AS Downloads
FROM
`xxxxxx.ga_sessions_*`
WHERE
_TABLE_SUFFIX BETWEEN '20190301'
AND '20190301'
GROUP BY
date,
Source,
Country
UNION ALL
SELECT
date,
trafficSource.source AS Source,
geoNetwork.country AS Country,
NULL AS Sessions,
COUNT(DISTINCT CONCAT(CAST(fullVisitorId AS string),'-',CAST(visitId AS string),'-',CAST(date AS string),'-',ifnull(hits.eventInfo.eventLabel,
'null'))) AS Downloads
FROM
`xxxxxx.ga_sessions_*`,
UNNEST(hits) AS hits
WHERE
_TABLE_SUFFIX BETWEEN '20190301'
AND '20190301'
AND hits.type = 'EVENT'
AND hits.eventInfo.eventCategory = 'Downloads'
GROUP BY
date,
Source,
Country )
Edit: Given what it looks like you want to do with the table though, you might want to rewrite ga_tables to be something like this instead:
WITH
ga_tables AS (SELECT
date,
trafficSource.source AS Source,
geoNetwork.country AS Country,
MAX(Sessions) AS Sessions,
COUNT(DISTINCT CONCAT(CAST(fullVisitorId AS string),'-',CAST(visitId AS string),'-',CAST(date AS string),'-',ifnull(hits.eventInfo.eventLabel,
'null'))) AS Downloads
FROM (
SELECT
*,
COUNT(trafficSource.source) OVER (PARTITION BY date, Source, Country) AS Sessions
FROM
`xxxxxx.ga_sessions_*`),
UNNEST(hits) AS hits
WHERE
_TABLE_SUFFIX BETWEEN '20190301'
AND '20190301'
AND hits.type = 'EVENT'
AND hits.eventInfo.eventCategory = 'Downloads'
GROUP BY
date,
Source,
Country)

Bigquery unnest hits - duplicating values)

Im trying to create a master view of a group in properties that are been imported into big query but it seem by using the unnest(hits) the SQL is duplicating the data leading to inaccurate values for revenues etc...
I have try to look at understanding why the unnest has caused this but I can't figure it out.
SELECT Date, hostname, channelGrouping, sum(transactionRevenue) as Revenue, sum(Shipping) as Shipping, sum(visits) as Sessions, sum(bounces) as Bounces, sum(transactions) as Transactions
FROM
(SELECT Date, h.page.hostname as hostname, channelGrouping, totals.transactionRevenue, totals.visits, h.transaction.transactionShipping as shipping, totals.bounces, totals.transactions
FROM `xxxxxxxxx.ga_sessions_*`, UNNEST(hits) AS h
WHERE _TABLE_SUFFIX BETWEEN '20170401' AND '20170509'
UNION ALL
SELECT Date, h.page.hostname as hostname, channelGrouping, totals.transactionRevenue, totals.visits, h.transaction.transactionShipping as shipping, totals.bounces, totals.transactions
FROM `xxxxxxxxx.ga_sessions_*`, UNNEST(hits) AS h
WHERE _TABLE_SUFFIX BETWEEN '20170401' AND '20170509'
UNION ALL
SELECT Date, h.page.hostname as hostname, channelGrouping, totals.transactionRevenue, totals.visits, h.transaction.transactionShipping as shipping, totals.bounces, totals.transactions
FROM `102674002.ga_sessions_*`, UNNEST(hits) AS h
WHERE _TABLE_SUFFIX BETWEEN '20170401' AND '20170509'
UNION ALL
SELECT Date, h.page.hostname as hostname, channelGrouping, totals.transactionRevenue, totals.visits, h.transaction.transactionShipping as shipping, totals.bounces, totals.transactions
FROM `xxxxxxxxx.ga_sessions_*`, UNNEST(hits) AS h
WHERE _TABLE_SUFFIX BETWEEN '20170401' AND '20170509'
UNION ALL
SELECT Date, h.page.hostname as hostname, channelGrouping, totals.transactionRevenue, totals.visits, h.transaction.transactionShipping as shipping, totals.bounces, totals.transactions
FROM `xxxxxxxxx.ga_sessions_*`, UNNEST(hits) AS h
WHERE _TABLE_SUFFIX BETWEEN '20170401' AND '20170509')
Group By Date, hostname, channelGrouping
Order by Date
This might do the trick:
SELECT
date,
channelGrouping,
SUM(Revenue) Revenue,
SUM(Shipping) Shipping,
SUM(bounces) bounces,
SUM(transactions) transactions,
hostname,
COUNT(date) sessions
FROM(
SELECT
date,
channelGrouping,
totals.totaltransactionrevenue / 1e6 Revenue,
ARRAY((SELECT DISTINCT page.hostname FROM UNNEST(hits) hits WHERE page.hostname IS NOT NULL)) hostnames,
(SELECT SUM(hits.transaction.transactionshipping) / 1e6 FROM UNNEST(hits) hits) Shipping,
totals.bounces bounces,
totals.transactions transactions
FROM `project_id.dataset_id.ga_sessions_*`
WHERE 1 = 1
AND ARRAY_LENGTH(ARRAY((SELECT DISTINCT page.hostname FROM UNNEST(hits) hits WHERE page.hostname IS NOT NULL))) > 0
AND _TABLE_SUFFIX BETWEEN '20170601' AND '20170609'
UNION ALL
(...)
),
UNNEST(hostnames) hostname
GROUP BY
date, channelGrouping, hostname
Notice that in this query I avoided applying the UNNEST operation in the hits field and I do so only inside subselects.
In order to understand why this is the case you have to understand how ga data is aggregated into BigQuery. Notice that we basically have 2 types of data: the session level data and the hits level. Each client visiting your website ends up generating a row into BigQuery, like so:
{fullvisitorid: 1, visitid:1, date: '20170601', channelGrouping: "search", hits: [{hitNumber: 1, page: {hostname: "yourserverhostname"}}, {hitNumber: 2, page: {hostname: "yourserverhostname"}}, totals: {totalTransactionRevenue:0, bounces: 0}]
If the same customer comes back a day later it generates another row into BQ, something like:
{fullvisitorid: 1, visitid:2, date: '20170602', channelGrouping: "search", hits: [{hitNumber: 1, page: {hostname: "yourserverhostname"}}, {hitNumber: 2, page: {hostname: "yourserverhostname"}}, totals: {totalTransactionRevenue:50000000, bounces: 2}]
As you can see, fields outside the key hits are related to the session level (and therefore each hit, i.e, each interaction the customer has in your website, adds up another entry here). When you apply UNNEST, you basically, apply a cross-join with all values inside of the array to the outer fields.
And this is where duplication happens!
Given the past example, if we apply UNNEST to the hits field, you end up with something like:
fullvisitorid visitid totals.totalTransactionRevenue hits.hitNumber
1 1 0 1
1 1 0 2
1 2 50000000 1
1 2 50000000 2
Notice that for each hit inside the hits field causes the outer fields, such as totals.totalTransactionRevenue to be duplicated for each hitNumber that happened inside the hits ARRAY.
So, if later on, you apply some operation like SUM(totals.totalTransactionRevenue) you end up summing this field multiplied by each hit that the customer had in that visitid.
What I tend to do is to avoid the (costly depending on the data volume) UNNEST operation on the hits field and I do so only in subqueries (where the unnesting happens only at the row level which does not duplicate data).