Google Analytics Object in Google BigQuery - google-bigquery

I'm trying to extract data from Google Analytics, but due to incompatibilities between dimensions and metrics, it was decided to use Google Big Query instead, to obtain the data related to GA4.
I'm struggling to find some metrics/dimensions in Google BigQuery, even searching on the documentation: https://support.google.com/analytics/answer/3437719?hl=en
Google Analytics Dimensions/Metrics:
These are the dimensions and metrics that I've used from google analytics and the ones I can't find in Google Big Query are:
Users
Sessions (I used totals. visits, but I get only NULLs and 1's, while on GA it fills with more numbers)
TransactionsPerSession
CountryIsoCode (In GA it is only the country indicative, for instance, Spain --> ES, but in Big Query, it's the country's complete name. This can be solved, but would be good to have the country code directly from the source)
avgSessionDuration

A great place to get this information is https://www.ga4bigquery.com/
I have copied one of my reports that will provide you with points 1,2,3 & 5. I don't use country but it can be found in the link above
-- subquery to prepare the data
with prep_traffic as (
select
user_pseudo_id,
event_date as date,
count(distinct (ecommerce.transaction_id)) as Transactions,
(select value.int_value from unnest(event_params) where key = 'ga_session_id') as session_id,
max((select value.string_value from unnest(event_params) where key = 'session_engaged')) as session_engaged,
max((select value.int_value from unnest(event_params) where key = 'engagement_time_msec')) as engagement_time_msec,
-- change event_name to the event(s) you want to count
countif(event_name = 'page_view') as event_count,
-- change event_name to the conversion event(s) you want to count
countif(event_name = 'add_payment_info') as conversions,
sum(ecommerce.purchase_revenue) as total_revenue
from
-- change this to your google analytics 4 bigquery export location
`bigquery****.events_*`
where
-- change the date range by using static and/or dynamic dates
_table_suffix between '20230129' and format_date('%Y%m%d',date_sub(current_date(), interval 1 day))
group by
user_pseudo_id,
session_id,
event_date)
-- main query
select
count(distinct user_pseudo_id) as users,
count(distinct concat(user_pseudo_id,session_id)) as sessions,
count(distinct case when session_engaged = '1' then concat(user_pseudo_id,session_id) end) as engaged_sessions,
ROUND(safe_divide(count(distinct case when session_engaged = '1' then concat(user_pseudo_id,session_id) end),count(distinct user_pseudo_id)),2) as engaged_sessions_per_user,
ROUND(safe_divide(count(distinct case when session_engaged = '1' then concat(user_pseudo_id,session_id) end),count(distinct concat(user_pseudo_id,session_id))),2) as engagement_rate,
(sum(Transactions)) As transactions,
(sum(Transactions))/ count(distinct concat(user_pseudo_id,session_id)) as TransactionsPerSession,
safe_divide(sum(engagement_time_msec),count(distinct case when session_engaged = '1' then concat(user_pseudo_id,session_id) end)) /count(distinct case when session_engaged = '1' then concat(user_pseudo_id,session_id) end)as avgSessionDuration,
sum(event_count) as event_count,
sum(conversions) as conversions,
ifnull(sum(total_revenue),0) as total_revenue,
date
from
prep_traffic
group by
date
order by
date desc, users desc

Related

Mismatch of retention results from firebase and BigQuery

I calculated the retention i BigQuery with code bellow. Code was taken from here. But this code is giving me different retention then the retention already calculated in firebase. Number of users calculated in BigQuery is always smaller.
What is the difference between this two approaches? Is there a way to get the same result in BigQuery as it is in Firebase?
#standardSQL
####################################################################
# PART 1: Cohort of New Users starting on SEPT 1
####################################################################
WITH
new_user_cohort AS (
SELECT DISTINCT user_pseudo_id as new_user_id
FROM
`projectId.analytics_YOUR_TABLE.events_*`
WHERE
event_name = 'first_open' AND
#geo.country = 'France' AND
FORMAT_TIMESTAMP("%Y%m%d", TIMESTAMP_TRUNC(TIMESTAMP_MICROS(event_timestamp), DAY, "Etc/GMT+8")) = '20180901' AND
_TABLE_SUFFIX BETWEEN '20180830' AND '20180902'),
num_new_users AS (
SELECT count(*) as num_users_in_cohort FROM new_user_cohort
),
####################################################################
# PART 2: Engaged users from Sept 1 cohort
####################################################################
engaged_user_by_day AS (
SELECT
FORMAT_TIMESTAMP('%Y%m%d', TIMESTAMP_TRUNC(TIMESTAMP_MICROS(event_timestamp), DAY, "Etc/GMT+8")) as event_day, COUNT (DISTINCT user_pseudo_id) as num_engaged_users
FROM
`projectId.analytics_YOUR_TABLE.events_*` INNER JOIN new_user_cohort on new_user_id = user_pseudo_id
WHERE
event_name = 'user_engagement' AND
_TABLE_SUFFIX BETWEEN '20180830' AND '20180907'
GROUP BY (event_day)
)
####################################################################
# PART 3: Daily Retention = [Engaged Users / Total Users]
####################################################################
SELECT event_day, num_engaged_users, num_users_in_cohort, ROUND((num_engaged_users / num_users_in_cohort), 3) as retention_rate
FROM engaged_user_by_day CROSS JOIN num_new_users
ORDER BY (event_day)
I found out that that analytics is using sampling and in my report inside analytics it only uses 0.2% of the data.
In firebase I noticed that they removed retention tab (at least in my case). But I think there sampling was also used.

Using OFFSET instead of UNNEST for nested fields in Google Bigquery

A quick question to GBQ gurus.
Here are two queries that are identical in their purpose
first
SELECT
fullVisitorId AS userid,
CONCAT(fullVisitorId, visitStartTime) AS session,
visitStartTime + (hits[
OFFSET(0)].time / 1000) AS eventtime,
date,
trafficSource.campaign,
trafficSource.source,
trafficSource.medium,
trafficSource.adContent,
trafficSource.adwordsClickInfo.campaignId,
geoNetwork.region,
geoNetwork.city,
trafficSource.keyword,
totals.visits AS visits,
device.deviceCategory AS deviceType,
hits[OFFSET(0)].eventInfo.eventAction,
hits[OFFSET(0)].TRANSACTION.transactionId,
hits[OFFSET(0)].TRANSACTION.transactionRevenue,
SUBSTR(channelGrouping,0,3) AS newchannelGrouping
FROM
`some_site.ga_sessions_*`
WHERE
ARRAY_LENGTH(hits) > 0
AND _table_suffix BETWEEN '20200201'
AND '20200201'
AND fullVisitorId IN (
SELECT
DISTINCT(fullVisitorId)
FROM
`some_site.ga_sessions_*`,
UNNEST(hits) AS hits
WHERE
_table_suffix BETWEEN '20200201'
AND '20200201'
AND (hits.TRANSACTION.transactionId != 'None')
)
second
SELECT
fullVisitorId AS userid,
CONCAT(fullVisitorId, visitStartTime) AS session,
visitStartTime + (hits.time / 1000) AS eventtime,
date,
trafficSource.campaign,
trafficSource.source,
trafficSource.medium,
trafficSource.adContent,
trafficSource.adwordsClickInfo.campaignId,
geoNetwork.region,
geoNetwork.city,
trafficSource.keyword,
totals.visits AS visits,
device.deviceCategory AS deviceType,
hits.eventInfo.eventAction,
hits.TRANSACTION.transactionId,
hits.TRANSACTION.transactionRevenue,
SUBSTR(channelGrouping,0,3) AS newchannelGrouping
FROM
`some_site.ga_sessions_*`, UNNEST(hits) hits
WHERE
_table_suffix BETWEEN '20200201' AND '20200201'
AND fullVisitorId IN (
SELECT
DISTINCT(fullVisitorId)
FROM
`some_site.ga_sessions_*`,
UNNEST(hits) AS hits
WHERE
_table_suffix BETWEEN '20200201'
AND '20200201'
AND (hits.TRANSACTION.transactionId != 'None')
)
The 1st one uses OFFSET to extract data from nested fields. According to execution details report, the query requires about 1.5 MB of shuffling.
The 2nd query uses UNNEST to reach nested data. And the amount of shuffled bytes is around (!) 75 MB
The amount of processed data is the same in both cases.
Now, the question is:
Does that mean that according to this article which concerns optimizing communication between slots I should uses OFFSET instead of UNNEST to get the data stored in nested fields?
Thanks!
Let's consider following examples with using BigQuery public dataset.
UNNEST - returns 6 results:
WITH t AS (SELECT * FROM `bigquery-public-data.google_analytics_sample.ga_sessions_20170801` WHERE visitId = 1501571504 )
SELECT h FROM t, UNNEST(hits) h
OFFSET - returns 1 result:
WITH t AS (SELECT * FROM `bigquery-public-data.google_analytics_sample.ga_sessions_20170801` WHERE visitId = 1501571504 )
SELECT hits[OFFSET(0)] FROM t
Both queries are referencing to the same record inside a GA public table. They show that using a join with UNNEST will bring one row per element inside the array and using OFFSET(0) will bring only one row with the first element of the array.
The reason for difference in high data shuffling is because the UNNEST performs a JOIN operation, which requires the data to be organized in a specific way. The OFFSET approach takes only the first element of the array.

Converting Legacy SQL to Standard SQL - Enhannced Ecommerce

I am in no way a coder so I have tried but falling over on this.
I want to use this query from Googles Google Analytics Big Query Cookbook
Products purchased by customers who purchased product A (Enhanced Ecommerce)
I have pasted the code below
Into Standard SQL.
I have made a few attemps but am falling over and not
Thank you in advance
John
SELECT hits.product.productSKU AS other_purchased_products,
COUNT(hits.product.productSKU) AS quantity
FROM (
SELECT fullVisitorId, hits.product.productSKU, hits.eCommerceAction.action_type
FROM TABLE_DATE_RANGE([bigquery-public-data:google_analytics_sample.ga_sessions_],
TIMESTAMP('2017-04-01'), TIMESTAMP('2017-04-20'))
)
WHERE fullVisitorId IN (
SELECT fullVisitorId
FROM TABLE_DATE_RANGE([bigquery-public-data:google_analytics_sample.ga_sessions_],
TIMESTAMP('2017-04-01'), TIMESTAMP('2017-04-20'))
WHERE hits.product.productSKU CONTAINS 'GGOEYOCR077799'
AND hits.eCommerceAction.action_type = '6'
GROUP BY fullVisitorId
)
AND hits.product.productSKU IS NOT NULL
AND hits.product.productSKU !='GGOEYOCR077799'
AND hits.eCommerceAction.action_type = '6'
GROUP BY other_purchased_products
ORDER BY quantity DESC;
Below is pure equivalent in BigQuery Standard SQL (no any optimizations, improvements, etc. - just pure translation from legacy to standard)
SELECT productSKU AS other_purchased_products, COUNT(productSKU) AS quantity
FROM (
SELECT fullVisitorId, prod.productSKU, hit.eCommerceAction.action_type
FROM `bigquery-public-data.google_analytics_sample.ga_sessions_*`,
UNNEST(hits) hit, UNNEST(hit.product) prod
WHERE _TABLE_SUFFIX BETWEEN '20170401' AND '20170420'
)
WHERE fullVisitorId IN (
SELECT fullVisitorId
FROM `bigquery-public-data.google_analytics_sample.ga_sessions_*`,
UNNEST(hits) hit, UNNEST(hit.product) prod
WHERE _TABLE_SUFFIX BETWEEN '20170401' AND '20170420'
AND prod.productSKU LIKE '%GGOEYOCR077799%'
AND hit.eCommerceAction.action_type = '6'
GROUP BY fullVisitorId
)
AND productSKU IS NOT NULL
AND productSKU !='GGOEYOCR077799'
AND action_type = '6'
GROUP BY other_purchased_products
ORDER BY quantity DESC
obviously produces exactly same result as legacy version

BigQuery Error - Uncategorized name - When combining GA data

I want to combine dimensions date, country & source with sessions and unique events for event category "Downloads". Based on this data, I want to calculate the Download Conversionrate in DataStudio later on. To be honest I'm a noob in SQL. But I hope I'm thinking the right way at least.
Trying the query below I get the following error:
Unrecognized name: Downloads at [40:3]
WITH
ga_tables AS (
SELECT
date,
trafficSource.source AS Source,
geoNetwork.country AS Country,
COUNT ( trafficSource.source ) AS Sessions
FROM
`xxxxxx.ga_sessions_*`
WHERE
_TABLE_SUFFIX BETWEEN '20190301'
AND '20190301'
GROUP BY
date,
Source,
Country
UNION ALL
SELECT
date,
trafficSource.source AS Source,
geoNetwork.country AS Country,
COUNT(DISTINCT CONCAT(CAST(fullVisitorId AS string),'-',CAST(visitId AS string),'-',CAST(date AS string),'-',ifnull(hits.eventInfo.eventLabel,
'null'))) AS Downloads
FROM
`xxxxxx.ga_sessions_*`,
UNNEST(hits) AS hits
WHERE
_TABLE_SUFFIX BETWEEN '20190301'
AND '20190301'
AND hits.type = 'EVENT'
AND hits.eventInfo.eventCategory = 'Downloads'
GROUP BY
date,
Source,
Country )
SELECT
date,
Country,
Source,
Downloads,
Sessions
FROM
ga_tables
ORDER BY
Sessions ASC
In your with statement, the fourth column in the first select statement is named Sessions, while the fourth column in the statement it's unioned with is called Downloads. Due to the nature of UNION ALL, the final output column will be called Sessions, so it does not exist when you are querying it. If you want Sessions and Downloads to be separate columns, make the query look something like this:
WITH
ga_tables AS (
SELECT
date,
trafficSource.source AS Source,
geoNetwork.country AS Country,
COUNT ( trafficSource.source ) AS Sessions,
NULL AS Downloads
FROM
`xxxxxx.ga_sessions_*`
WHERE
_TABLE_SUFFIX BETWEEN '20190301'
AND '20190301'
GROUP BY
date,
Source,
Country
UNION ALL
SELECT
date,
trafficSource.source AS Source,
geoNetwork.country AS Country,
NULL AS Sessions,
COUNT(DISTINCT CONCAT(CAST(fullVisitorId AS string),'-',CAST(visitId AS string),'-',CAST(date AS string),'-',ifnull(hits.eventInfo.eventLabel,
'null'))) AS Downloads
FROM
`xxxxxx.ga_sessions_*`,
UNNEST(hits) AS hits
WHERE
_TABLE_SUFFIX BETWEEN '20190301'
AND '20190301'
AND hits.type = 'EVENT'
AND hits.eventInfo.eventCategory = 'Downloads'
GROUP BY
date,
Source,
Country )
Edit: Given what it looks like you want to do with the table though, you might want to rewrite ga_tables to be something like this instead:
WITH
ga_tables AS (SELECT
date,
trafficSource.source AS Source,
geoNetwork.country AS Country,
MAX(Sessions) AS Sessions,
COUNT(DISTINCT CONCAT(CAST(fullVisitorId AS string),'-',CAST(visitId AS string),'-',CAST(date AS string),'-',ifnull(hits.eventInfo.eventLabel,
'null'))) AS Downloads
FROM (
SELECT
*,
COUNT(trafficSource.source) OVER (PARTITION BY date, Source, Country) AS Sessions
FROM
`xxxxxx.ga_sessions_*`),
UNNEST(hits) AS hits
WHERE
_TABLE_SUFFIX BETWEEN '20190301'
AND '20190301'
AND hits.type = 'EVENT'
AND hits.eventInfo.eventCategory = 'Downloads'
GROUP BY
date,
Source,
Country)

Counting transactions on 2 different levels

I am using Google Analytics data in BigQuery, my desired output is
USERID INTERACTIONS TRANSACTIONS SCORE CHANNEL
XXX 3 1 33.33 Paid
Below is my query so far - I am getting duplicate transactions and I can;t work out why, Unesting my hits led to a high count of interactions as every line was bring counted, so I added the AND hit.isentrance IS TRUE clause, which means I can't use COUNT( DISTINCT hit.transaction.transactionid) as the entry row will never contain an order ID - instead I have to use totals.transactions, which I where I think my issues could be coming from?
SELECT UserID, SUM(Campaign_Interactions) AS Interactions, SUM(Transactions) AS Transactions, ROUND(SUM(Transactions)/SUM(Campaign_Interactions), 2) AS Con_Score, MasterChannel FROM(
(SELECT customDimension.value AS UserID, visitid AS visitid1, trafficSource.campaign AS Campaign, COUNT(trafficSource.campaign) AS Campaign_Interactions, SUM (totals.transactions) AS Transactions, ROUND(MAX(totals.transactions)/COUNT(trafficSource.campaign), 2) AS Conversion_Score
FROM `xxx.ga_sessions_20*` AS m
CROSS JOIN UNNEST(m.customdimensions) AS customDimension
CROSS JOIN UNNEST (hits) AS hit
WHERE parse_date('%y%m%d', _table_suffix) between
DATE_sub(current_date(), interval 7 day) and
DATE_sub(current_date(), interval 1 day)
AND customDimension.index = 2
AND trafficSource.campaign IS NOT NULL
AND (customDimension.value NOT LIKE 'true' AND customDimension.value NOT LIKE 'undefined')
AND hit.isentrance IS TRUE
GROUP BY visitid1, Campaign, userID
ORDER BY Transactions DESC)
JOIN
(SELECT * FROM `xxx.7Days_VisitID_MasterChan`)
ON visitid1 = visitid)
GROUP BY UserID, MasterChannel
ORDER BY UserID
And a screenshot of results. Note that for the ID 00004180-16f5-46e4-9caa-c6b47e03d795 (near the bottom) there should be only 1 order, but we are seeing it on each row.
It's fine for the user to have interactions across multiple channels, this is expected. Multiple transactions across multiple channels is also fine, but I can see in our CRM that this UserID has made only one order in the last 7 days, so I'd only expect to see a single transaction against the ID here.