Using OFFSET instead of UNNEST for nested fields in Google Bigquery - google-bigquery

A quick question to GBQ gurus.
Here are two queries that are identical in their purpose
first
SELECT
fullVisitorId AS userid,
CONCAT(fullVisitorId, visitStartTime) AS session,
visitStartTime + (hits[
OFFSET(0)].time / 1000) AS eventtime,
date,
trafficSource.campaign,
trafficSource.source,
trafficSource.medium,
trafficSource.adContent,
trafficSource.adwordsClickInfo.campaignId,
geoNetwork.region,
geoNetwork.city,
trafficSource.keyword,
totals.visits AS visits,
device.deviceCategory AS deviceType,
hits[OFFSET(0)].eventInfo.eventAction,
hits[OFFSET(0)].TRANSACTION.transactionId,
hits[OFFSET(0)].TRANSACTION.transactionRevenue,
SUBSTR(channelGrouping,0,3) AS newchannelGrouping
FROM
`some_site.ga_sessions_*`
WHERE
ARRAY_LENGTH(hits) > 0
AND _table_suffix BETWEEN '20200201'
AND '20200201'
AND fullVisitorId IN (
SELECT
DISTINCT(fullVisitorId)
FROM
`some_site.ga_sessions_*`,
UNNEST(hits) AS hits
WHERE
_table_suffix BETWEEN '20200201'
AND '20200201'
AND (hits.TRANSACTION.transactionId != 'None')
)
second
SELECT
fullVisitorId AS userid,
CONCAT(fullVisitorId, visitStartTime) AS session,
visitStartTime + (hits.time / 1000) AS eventtime,
date,
trafficSource.campaign,
trafficSource.source,
trafficSource.medium,
trafficSource.adContent,
trafficSource.adwordsClickInfo.campaignId,
geoNetwork.region,
geoNetwork.city,
trafficSource.keyword,
totals.visits AS visits,
device.deviceCategory AS deviceType,
hits.eventInfo.eventAction,
hits.TRANSACTION.transactionId,
hits.TRANSACTION.transactionRevenue,
SUBSTR(channelGrouping,0,3) AS newchannelGrouping
FROM
`some_site.ga_sessions_*`, UNNEST(hits) hits
WHERE
_table_suffix BETWEEN '20200201' AND '20200201'
AND fullVisitorId IN (
SELECT
DISTINCT(fullVisitorId)
FROM
`some_site.ga_sessions_*`,
UNNEST(hits) AS hits
WHERE
_table_suffix BETWEEN '20200201'
AND '20200201'
AND (hits.TRANSACTION.transactionId != 'None')
)
The 1st one uses OFFSET to extract data from nested fields. According to execution details report, the query requires about 1.5 MB of shuffling.
The 2nd query uses UNNEST to reach nested data. And the amount of shuffled bytes is around (!) 75 MB
The amount of processed data is the same in both cases.
Now, the question is:
Does that mean that according to this article which concerns optimizing communication between slots I should uses OFFSET instead of UNNEST to get the data stored in nested fields?
Thanks!

Let's consider following examples with using BigQuery public dataset.
UNNEST - returns 6 results:
WITH t AS (SELECT * FROM `bigquery-public-data.google_analytics_sample.ga_sessions_20170801` WHERE visitId = 1501571504 )
SELECT h FROM t, UNNEST(hits) h
OFFSET - returns 1 result:
WITH t AS (SELECT * FROM `bigquery-public-data.google_analytics_sample.ga_sessions_20170801` WHERE visitId = 1501571504 )
SELECT hits[OFFSET(0)] FROM t
Both queries are referencing to the same record inside a GA public table. They show that using a join with UNNEST will bring one row per element inside the array and using OFFSET(0) will bring only one row with the first element of the array.
The reason for difference in high data shuffling is because the UNNEST performs a JOIN operation, which requires the data to be organized in a specific way. The OFFSET approach takes only the first element of the array.

Related

Google Analytics Object in Google BigQuery

I'm trying to extract data from Google Analytics, but due to incompatibilities between dimensions and metrics, it was decided to use Google Big Query instead, to obtain the data related to GA4.
I'm struggling to find some metrics/dimensions in Google BigQuery, even searching on the documentation: https://support.google.com/analytics/answer/3437719?hl=en
Google Analytics Dimensions/Metrics:
These are the dimensions and metrics that I've used from google analytics and the ones I can't find in Google Big Query are:
Users
Sessions (I used totals. visits, but I get only NULLs and 1's, while on GA it fills with more numbers)
TransactionsPerSession
CountryIsoCode (In GA it is only the country indicative, for instance, Spain --> ES, but in Big Query, it's the country's complete name. This can be solved, but would be good to have the country code directly from the source)
avgSessionDuration
A great place to get this information is https://www.ga4bigquery.com/
I have copied one of my reports that will provide you with points 1,2,3 & 5. I don't use country but it can be found in the link above
-- subquery to prepare the data
with prep_traffic as (
select
user_pseudo_id,
event_date as date,
count(distinct (ecommerce.transaction_id)) as Transactions,
(select value.int_value from unnest(event_params) where key = 'ga_session_id') as session_id,
max((select value.string_value from unnest(event_params) where key = 'session_engaged')) as session_engaged,
max((select value.int_value from unnest(event_params) where key = 'engagement_time_msec')) as engagement_time_msec,
-- change event_name to the event(s) you want to count
countif(event_name = 'page_view') as event_count,
-- change event_name to the conversion event(s) you want to count
countif(event_name = 'add_payment_info') as conversions,
sum(ecommerce.purchase_revenue) as total_revenue
from
-- change this to your google analytics 4 bigquery export location
`bigquery****.events_*`
where
-- change the date range by using static and/or dynamic dates
_table_suffix between '20230129' and format_date('%Y%m%d',date_sub(current_date(), interval 1 day))
group by
user_pseudo_id,
session_id,
event_date)
-- main query
select
count(distinct user_pseudo_id) as users,
count(distinct concat(user_pseudo_id,session_id)) as sessions,
count(distinct case when session_engaged = '1' then concat(user_pseudo_id,session_id) end) as engaged_sessions,
ROUND(safe_divide(count(distinct case when session_engaged = '1' then concat(user_pseudo_id,session_id) end),count(distinct user_pseudo_id)),2) as engaged_sessions_per_user,
ROUND(safe_divide(count(distinct case when session_engaged = '1' then concat(user_pseudo_id,session_id) end),count(distinct concat(user_pseudo_id,session_id))),2) as engagement_rate,
(sum(Transactions)) As transactions,
(sum(Transactions))/ count(distinct concat(user_pseudo_id,session_id)) as TransactionsPerSession,
safe_divide(sum(engagement_time_msec),count(distinct case when session_engaged = '1' then concat(user_pseudo_id,session_id) end)) /count(distinct case when session_engaged = '1' then concat(user_pseudo_id,session_id) end)as avgSessionDuration,
sum(event_count) as event_count,
sum(conversions) as conversions,
ifnull(sum(total_revenue),0) as total_revenue,
date
from
prep_traffic
group by
date
order by
date desc, users desc

Which part of my query is wrong? UNNEST function

I couldn't figure out which part of my code is wrong.
I used a UNNEST function but the error msg is still
'Cannot access field productSKU on a value with type ARRAY>' in Google Bigquery.
My query is below:
SELECT
hits.product.productSKU AS product_SKU,
hits.product.v2ProductName AS Product_Name,
SUM(totals.transactionRevenue) AS Total_Revenue,
FROM
`bigquery-public-data.google_analytics_sample.ga_sessions_*`,
UNNEST(hits.product) AS hits
WHERE
_TABLE_SUFFIX BETWEEN '20170701' AND '20170731' AND totals.transactions >= 1
Group by
hits.product.productSKU
Order by
v2ProductName DESC
Assuming the overall logic of your query reflect what you want to achieve - below is correct version that fixes unnest'ing part as well as adds missing field in group by - hope you see what gets corrected
#standardSQL
SELECT
product.productSKU AS product_SKU,
product.v2ProductName AS Product_Name,
SUM(totals.transactionRevenue) AS Total_Revenue,
FROM `bigquery-public-data.google_analytics_sample.ga_sessions_*`,
UNNEST(hits) AS hit,
UNNEST(hit.product) AS product
WHERE _TABLE_SUFFIX BETWEEN '20170701' AND '20170731' AND totals.transactions >= 1
GROUP BY product_SKU, Product_Name
ORDER BY v2ProductName DESC

Google Big Query: Get New Visitor Count using Custom Dimension

select PARSE_DATE('%Y%m%d', t.date) as Date
,count(distinct(fullvisitorid)) as User
,SUM( totals.newVisits ) AS New_Visitors
,(if(customDimensions.index=1, customDimensions.value,null)) as Orig
FROM `table` as t
CROSS JOIN UNNEST(hits) AS hit
CROSS JOIN UNNEST(hit.customDimensions ) AS customDimensions
group by Date, orig
Is there a way to get new visitor count and use the customDimension at the same time? The sum(total.newVisits) doesn't work.
Thanks
Below is for BigQuery Standard SQL
SELECT DATE
,COUNT(DISTINCT(fullvisitorid)) AS User
,SUM( newVisits ) AS New_Visitors
,Orig
FROM (
SELECT PARSE_DATE('%Y%m%d', t.date) AS DATE
,fullvisitorid
,totals.newVisits AS newVisits
,(IF(customDimensions.index=1, customDimensions.value,NULL)) AS Orig
FROM `table` AS t
CROSS JOIN UNNEST(hits) AS hit
CROSS JOIN UNNEST(hit.customDimensions ) AS customDimensions
GROUP BY DATE, orig, fullvisitorid, newVisits
)
GROUP BY DATE, Orig
The best way in your case is to remove the cross-joins and use sub-selects instead:
SELECT
PARSE_DATE('%Y%m%d', t.date) AS Date
,(SELECT value FROM UNNEST(customDimensions) WHERE index=1) Orig
,COUNT(DISTINCT(fullvisitorid)) AS User
,SUM( totals.newVisits ) AS New_Visitors
FROM
`table` t
GROUP BY Orig, Date
In case you have a dimension on hit scope and really need to flatten the table, you need to build a session id you can count distinct. That is because you are repeating all session scoped fields on hit-scope by applying the cross-join:
SELECT
PARSE_DATE('%Y%m%d', t.date) AS Date
,(SELECT value FROM h.customDimensions WHERE index=2) justAHitCd
,h.page.pagePathLevel1
,COUNT(DISTINCT(fullvisitorid)) AS User
-- create session id and count distinct
,COUNT(DISTINCT CONCAT(fullvisitorid, CAST(visitstarttime AS STRING)) ) AS all_sessions
-- only count distinct session id of sessions where totals.newVisits = 1
,COUNT(DISTINCT
IF(totals.newVisits=1,
CONCAT(fullvisitorid, CAST(visitstarttime AS STRING)),
NULL )
) AS New_Visitors
FROM
-- flatten table to hit scope (comma means cross-join in stnd sql)
`table` t, t.hits h
GROUP BY 1,2,3
So in case for new visitors I only provide a session id if totals.newVisits=1 - else the if-statement provides NULL which is not countable.
If you have something similar on product-scope, you'd need to create an ID that takes into account session and hit.
E.g. counting pages for productSku:
SELECT
PARSE_DATE('%Y%m%d', t.date) AS Date
,(SELECT value FROM h.customDimensions WHERE index=2) justAHitCd
,p.productSku
,COUNT(DISTINCT fullvisitorid) AS users
,COUNT(DISTINCT CONCAT(fullvisitorid, CAST(visitstarttime AS STRING))) AS sessions
,COUNT(DISTINCT
IF(h.type='PAGE',
CONCAT(fullvisitorid, cast(visitstarttime AS STRING),CAST(hitNumber AS STRING)),
NULL)
) as pageviews
,COUNT(1) AS products
FROM
`table` t, t.hits h LEFT JOIN h.product p
GROUP BY 1,2,3
Note, that I'm left joining the product array. Since it sometimes is empty a cross-join would destroy all hits information: cross-join with empty table results in empty table.
Hope that helps!

Counting transactions on 2 different levels

I am using Google Analytics data in BigQuery, my desired output is
USERID INTERACTIONS TRANSACTIONS SCORE CHANNEL
XXX 3 1 33.33 Paid
Below is my query so far - I am getting duplicate transactions and I can;t work out why, Unesting my hits led to a high count of interactions as every line was bring counted, so I added the AND hit.isentrance IS TRUE clause, which means I can't use COUNT( DISTINCT hit.transaction.transactionid) as the entry row will never contain an order ID - instead I have to use totals.transactions, which I where I think my issues could be coming from?
SELECT UserID, SUM(Campaign_Interactions) AS Interactions, SUM(Transactions) AS Transactions, ROUND(SUM(Transactions)/SUM(Campaign_Interactions), 2) AS Con_Score, MasterChannel FROM(
(SELECT customDimension.value AS UserID, visitid AS visitid1, trafficSource.campaign AS Campaign, COUNT(trafficSource.campaign) AS Campaign_Interactions, SUM (totals.transactions) AS Transactions, ROUND(MAX(totals.transactions)/COUNT(trafficSource.campaign), 2) AS Conversion_Score
FROM `xxx.ga_sessions_20*` AS m
CROSS JOIN UNNEST(m.customdimensions) AS customDimension
CROSS JOIN UNNEST (hits) AS hit
WHERE parse_date('%y%m%d', _table_suffix) between
DATE_sub(current_date(), interval 7 day) and
DATE_sub(current_date(), interval 1 day)
AND customDimension.index = 2
AND trafficSource.campaign IS NOT NULL
AND (customDimension.value NOT LIKE 'true' AND customDimension.value NOT LIKE 'undefined')
AND hit.isentrance IS TRUE
GROUP BY visitid1, Campaign, userID
ORDER BY Transactions DESC)
JOIN
(SELECT * FROM `xxx.7Days_VisitID_MasterChan`)
ON visitid1 = visitid)
GROUP BY UserID, MasterChannel
ORDER BY UserID
And a screenshot of results. Note that for the ID 00004180-16f5-46e4-9caa-c6b47e03d795 (near the bottom) there should be only 1 order, but we are seeing it on each row.
It's fine for the user to have interactions across multiple channels, this is expected. Multiple transactions across multiple channels is also fine, but I can see in our CRM that this UserID has made only one order in the last 7 days, so I'd only expect to see a single transaction against the ID here.

Bigquery unnest hits - duplicating values)

Im trying to create a master view of a group in properties that are been imported into big query but it seem by using the unnest(hits) the SQL is duplicating the data leading to inaccurate values for revenues etc...
I have try to look at understanding why the unnest has caused this but I can't figure it out.
SELECT Date, hostname, channelGrouping, sum(transactionRevenue) as Revenue, sum(Shipping) as Shipping, sum(visits) as Sessions, sum(bounces) as Bounces, sum(transactions) as Transactions
FROM
(SELECT Date, h.page.hostname as hostname, channelGrouping, totals.transactionRevenue, totals.visits, h.transaction.transactionShipping as shipping, totals.bounces, totals.transactions
FROM `xxxxxxxxx.ga_sessions_*`, UNNEST(hits) AS h
WHERE _TABLE_SUFFIX BETWEEN '20170401' AND '20170509'
UNION ALL
SELECT Date, h.page.hostname as hostname, channelGrouping, totals.transactionRevenue, totals.visits, h.transaction.transactionShipping as shipping, totals.bounces, totals.transactions
FROM `xxxxxxxxx.ga_sessions_*`, UNNEST(hits) AS h
WHERE _TABLE_SUFFIX BETWEEN '20170401' AND '20170509'
UNION ALL
SELECT Date, h.page.hostname as hostname, channelGrouping, totals.transactionRevenue, totals.visits, h.transaction.transactionShipping as shipping, totals.bounces, totals.transactions
FROM `102674002.ga_sessions_*`, UNNEST(hits) AS h
WHERE _TABLE_SUFFIX BETWEEN '20170401' AND '20170509'
UNION ALL
SELECT Date, h.page.hostname as hostname, channelGrouping, totals.transactionRevenue, totals.visits, h.transaction.transactionShipping as shipping, totals.bounces, totals.transactions
FROM `xxxxxxxxx.ga_sessions_*`, UNNEST(hits) AS h
WHERE _TABLE_SUFFIX BETWEEN '20170401' AND '20170509'
UNION ALL
SELECT Date, h.page.hostname as hostname, channelGrouping, totals.transactionRevenue, totals.visits, h.transaction.transactionShipping as shipping, totals.bounces, totals.transactions
FROM `xxxxxxxxx.ga_sessions_*`, UNNEST(hits) AS h
WHERE _TABLE_SUFFIX BETWEEN '20170401' AND '20170509')
Group By Date, hostname, channelGrouping
Order by Date
This might do the trick:
SELECT
date,
channelGrouping,
SUM(Revenue) Revenue,
SUM(Shipping) Shipping,
SUM(bounces) bounces,
SUM(transactions) transactions,
hostname,
COUNT(date) sessions
FROM(
SELECT
date,
channelGrouping,
totals.totaltransactionrevenue / 1e6 Revenue,
ARRAY((SELECT DISTINCT page.hostname FROM UNNEST(hits) hits WHERE page.hostname IS NOT NULL)) hostnames,
(SELECT SUM(hits.transaction.transactionshipping) / 1e6 FROM UNNEST(hits) hits) Shipping,
totals.bounces bounces,
totals.transactions transactions
FROM `project_id.dataset_id.ga_sessions_*`
WHERE 1 = 1
AND ARRAY_LENGTH(ARRAY((SELECT DISTINCT page.hostname FROM UNNEST(hits) hits WHERE page.hostname IS NOT NULL))) > 0
AND _TABLE_SUFFIX BETWEEN '20170601' AND '20170609'
UNION ALL
(...)
),
UNNEST(hostnames) hostname
GROUP BY
date, channelGrouping, hostname
Notice that in this query I avoided applying the UNNEST operation in the hits field and I do so only inside subselects.
In order to understand why this is the case you have to understand how ga data is aggregated into BigQuery. Notice that we basically have 2 types of data: the session level data and the hits level. Each client visiting your website ends up generating a row into BigQuery, like so:
{fullvisitorid: 1, visitid:1, date: '20170601', channelGrouping: "search", hits: [{hitNumber: 1, page: {hostname: "yourserverhostname"}}, {hitNumber: 2, page: {hostname: "yourserverhostname"}}, totals: {totalTransactionRevenue:0, bounces: 0}]
If the same customer comes back a day later it generates another row into BQ, something like:
{fullvisitorid: 1, visitid:2, date: '20170602', channelGrouping: "search", hits: [{hitNumber: 1, page: {hostname: "yourserverhostname"}}, {hitNumber: 2, page: {hostname: "yourserverhostname"}}, totals: {totalTransactionRevenue:50000000, bounces: 2}]
As you can see, fields outside the key hits are related to the session level (and therefore each hit, i.e, each interaction the customer has in your website, adds up another entry here). When you apply UNNEST, you basically, apply a cross-join with all values inside of the array to the outer fields.
And this is where duplication happens!
Given the past example, if we apply UNNEST to the hits field, you end up with something like:
fullvisitorid visitid totals.totalTransactionRevenue hits.hitNumber
1 1 0 1
1 1 0 2
1 2 50000000 1
1 2 50000000 2
Notice that for each hit inside the hits field causes the outer fields, such as totals.totalTransactionRevenue to be duplicated for each hitNumber that happened inside the hits ARRAY.
So, if later on, you apply some operation like SUM(totals.totalTransactionRevenue) you end up summing this field multiplied by each hit that the customer had in that visitid.
What I tend to do is to avoid the (costly depending on the data volume) UNNEST operation on the hits field and I do so only in subqueries (where the unnesting happens only at the row level which does not duplicate data).