I want to combine dimensions date, country & source with sessions and unique events for event category "Downloads". Based on this data, I want to calculate the Download Conversionrate in DataStudio later on. To be honest I'm a noob in SQL. But I hope I'm thinking the right way at least.
Trying the query below I get the following error:
Unrecognized name: Downloads at [40:3]
WITH
ga_tables AS (
SELECT
date,
trafficSource.source AS Source,
geoNetwork.country AS Country,
COUNT ( trafficSource.source ) AS Sessions
FROM
`xxxxxx.ga_sessions_*`
WHERE
_TABLE_SUFFIX BETWEEN '20190301'
AND '20190301'
GROUP BY
date,
Source,
Country
UNION ALL
SELECT
date,
trafficSource.source AS Source,
geoNetwork.country AS Country,
COUNT(DISTINCT CONCAT(CAST(fullVisitorId AS string),'-',CAST(visitId AS string),'-',CAST(date AS string),'-',ifnull(hits.eventInfo.eventLabel,
'null'))) AS Downloads
FROM
`xxxxxx.ga_sessions_*`,
UNNEST(hits) AS hits
WHERE
_TABLE_SUFFIX BETWEEN '20190301'
AND '20190301'
AND hits.type = 'EVENT'
AND hits.eventInfo.eventCategory = 'Downloads'
GROUP BY
date,
Source,
Country )
SELECT
date,
Country,
Source,
Downloads,
Sessions
FROM
ga_tables
ORDER BY
Sessions ASC
In your with statement, the fourth column in the first select statement is named Sessions, while the fourth column in the statement it's unioned with is called Downloads. Due to the nature of UNION ALL, the final output column will be called Sessions, so it does not exist when you are querying it. If you want Sessions and Downloads to be separate columns, make the query look something like this:
WITH
ga_tables AS (
SELECT
date,
trafficSource.source AS Source,
geoNetwork.country AS Country,
COUNT ( trafficSource.source ) AS Sessions,
NULL AS Downloads
FROM
`xxxxxx.ga_sessions_*`
WHERE
_TABLE_SUFFIX BETWEEN '20190301'
AND '20190301'
GROUP BY
date,
Source,
Country
UNION ALL
SELECT
date,
trafficSource.source AS Source,
geoNetwork.country AS Country,
NULL AS Sessions,
COUNT(DISTINCT CONCAT(CAST(fullVisitorId AS string),'-',CAST(visitId AS string),'-',CAST(date AS string),'-',ifnull(hits.eventInfo.eventLabel,
'null'))) AS Downloads
FROM
`xxxxxx.ga_sessions_*`,
UNNEST(hits) AS hits
WHERE
_TABLE_SUFFIX BETWEEN '20190301'
AND '20190301'
AND hits.type = 'EVENT'
AND hits.eventInfo.eventCategory = 'Downloads'
GROUP BY
date,
Source,
Country )
Edit: Given what it looks like you want to do with the table though, you might want to rewrite ga_tables to be something like this instead:
WITH
ga_tables AS (SELECT
date,
trafficSource.source AS Source,
geoNetwork.country AS Country,
MAX(Sessions) AS Sessions,
COUNT(DISTINCT CONCAT(CAST(fullVisitorId AS string),'-',CAST(visitId AS string),'-',CAST(date AS string),'-',ifnull(hits.eventInfo.eventLabel,
'null'))) AS Downloads
FROM (
SELECT
*,
COUNT(trafficSource.source) OVER (PARTITION BY date, Source, Country) AS Sessions
FROM
`xxxxxx.ga_sessions_*`),
UNNEST(hits) AS hits
WHERE
_TABLE_SUFFIX BETWEEN '20190301'
AND '20190301'
AND hits.type = 'EVENT'
AND hits.eventInfo.eventCategory = 'Downloads'
GROUP BY
date,
Source,
Country)
Related
I'm trying to extract data from Google Analytics, but due to incompatibilities between dimensions and metrics, it was decided to use Google Big Query instead, to obtain the data related to GA4.
I'm struggling to find some metrics/dimensions in Google BigQuery, even searching on the documentation: https://support.google.com/analytics/answer/3437719?hl=en
Google Analytics Dimensions/Metrics:
These are the dimensions and metrics that I've used from google analytics and the ones I can't find in Google Big Query are:
Users
Sessions (I used totals. visits, but I get only NULLs and 1's, while on GA it fills with more numbers)
TransactionsPerSession
CountryIsoCode (In GA it is only the country indicative, for instance, Spain --> ES, but in Big Query, it's the country's complete name. This can be solved, but would be good to have the country code directly from the source)
avgSessionDuration
A great place to get this information is https://www.ga4bigquery.com/
I have copied one of my reports that will provide you with points 1,2,3 & 5. I don't use country but it can be found in the link above
-- subquery to prepare the data
with prep_traffic as (
select
user_pseudo_id,
event_date as date,
count(distinct (ecommerce.transaction_id)) as Transactions,
(select value.int_value from unnest(event_params) where key = 'ga_session_id') as session_id,
max((select value.string_value from unnest(event_params) where key = 'session_engaged')) as session_engaged,
max((select value.int_value from unnest(event_params) where key = 'engagement_time_msec')) as engagement_time_msec,
-- change event_name to the event(s) you want to count
countif(event_name = 'page_view') as event_count,
-- change event_name to the conversion event(s) you want to count
countif(event_name = 'add_payment_info') as conversions,
sum(ecommerce.purchase_revenue) as total_revenue
from
-- change this to your google analytics 4 bigquery export location
`bigquery****.events_*`
where
-- change the date range by using static and/or dynamic dates
_table_suffix between '20230129' and format_date('%Y%m%d',date_sub(current_date(), interval 1 day))
group by
user_pseudo_id,
session_id,
event_date)
-- main query
select
count(distinct user_pseudo_id) as users,
count(distinct concat(user_pseudo_id,session_id)) as sessions,
count(distinct case when session_engaged = '1' then concat(user_pseudo_id,session_id) end) as engaged_sessions,
ROUND(safe_divide(count(distinct case when session_engaged = '1' then concat(user_pseudo_id,session_id) end),count(distinct user_pseudo_id)),2) as engaged_sessions_per_user,
ROUND(safe_divide(count(distinct case when session_engaged = '1' then concat(user_pseudo_id,session_id) end),count(distinct concat(user_pseudo_id,session_id))),2) as engagement_rate,
(sum(Transactions)) As transactions,
(sum(Transactions))/ count(distinct concat(user_pseudo_id,session_id)) as TransactionsPerSession,
safe_divide(sum(engagement_time_msec),count(distinct case when session_engaged = '1' then concat(user_pseudo_id,session_id) end)) /count(distinct case when session_engaged = '1' then concat(user_pseudo_id,session_id) end)as avgSessionDuration,
sum(event_count) as event_count,
sum(conversions) as conversions,
ifnull(sum(total_revenue),0) as total_revenue,
date
from
prep_traffic
group by
date
order by
date desc, users desc
I have this query that helps me to find separate key words within strings (very useful with utm_campaign and utm_content):
SELECT
utm_campaign,
splits[SAFE_OFFSET(0)] AS country,
splits[SAFE_OFFSET(1)] AS product,
splits[SAFE_OFFSET(2)] AS budget,
splits[SAFE_OFFSET(3)] AS source,
splits[SAFE_OFFSET(4)] AS campaign,
splits[SAFE_OFFSET(5)] AS audience
FROM (
SELECT
utm_campaign,
SPLIT(REGEXP_REPLACE(
utm_campaign,
r'([^_]+)_([^_]+)_([^_]+)_([^_]+)_([^_]+)_([^_]+)_(.+)',
r'\1|\2|\3|\4|\5|\6|\7'),
'|') AS splits
FROM funnel_campaign)
For example: if I have a umt_campaign like this:
us_latam_mkt_google_black-friday_audiencie-custom_NNN-NNN_nnn_trafic_responsiv
The query from above will help me to separate each word with a _ in between. So I'll have a result like this:
utm_campaign
country
product
budget
source
campaign
audience
us_latam_mkt_google_black-friday_audiencie-custom_NNN-NNN_nnn_trafic_responsiv
us
latam
mkt
google
black-friday
audience-custom
What I want from the query from above is to give me in this case the audience column. I tried to add the query from above as a sub-query on this query in REVENUE because in this table I don't have the audience column but I have the utm_campaign column. Inside the utm_campaign string, the sixth fragment is the audience (with this query I have the error "Scalar subquery produced more than one element"):
WITH COST AS (
SELECT
POS AS POS,
DATE AS DATE,
EXTRACT(WEEK FROM DATE) AS WEEK,
SOURCE AS SOURCE,
MEDIUM AS MEDIUM,
CAMPAIGN AS CAMPAIGN,
AD_CONTENT,
FORMAT AS FORMAT,
"" AS BU_OD,
SUM(CLICKS)/1000 AS CLICKS,
SUM(IMPRESSIONS)/1000 AS IMPRESSIONS,
SUM(COST)/1000 AS COST,
sum(0) as SESSIONS,
SUM(0) AS TRANSACTIONS,
SUM(0) AS search_flight_pv,
SUM(0) AS search_flight_upv,
SUM(0) AS PAX,
SUM(0) AS REVENUE,
FROM MSR_funnel_campaign_table
WHERE DATE >= DATE '2019-01-01'
AND MEDIUM NOT LIKE 'DISPLAY_CORP'
GROUP BY 1,2,3,4,5,6,7,8,9
),
REVENUE AS(
SELECT
POS AS POS,
date AS DATE,
EXTRACT(WEEK FROM DATE) AS WEEK,
SOURCE_CAT AS SOURCE,
medium_group_2 AS MEDIUM,
CAMPAIGN AS CAMPAIGN,
AD_CONTENT,
CASE
WHEN SOURCE_CAT = 'FACEBOOK' THEN
(
SELECT
splits[SAFE_OFFSET(5)] AS FORMAT,
FROM (
SELECT
ad_content,
SPLIT(REGEXP_REPLACE(
ad_content,
r'([^_]+)_([^_]+)_([^_]+)_([^_]+)_([^_]+)_([^_]+)_(.+)',
r'\1|\2|\3|\4|\5|\6|\7'),
'|') AS splits
FROM ga_digital_marketing)) END AS FORMAT,
BU_OD AS BU_OD,
SUM(0) AS CLICKS,
SUM(0) AS IMPRESSIONS,
SUM(0) AS COST,
sum(sessions)/1000 as SESSIONS,
SUM(TRANSACTIONS)/1000 AS TRANSACTIONS,
SUM(search_flight_pv)/1000 AS search_flight_pv,
SUM(search_flight_upv)/1000 AS search_flight_upv,
SUM(PAX)/1000 AS PAX,
SUM(REVENUE)/1000 AS REVENUE,
FROM ga_digital_marketing
WHERE PAX_TYPE = 'PAID'
AND DATE >= DATE '2019-01-01'
AND MEDIUM NOT LIKE 'DISPLAY_CORP'
GROUP BY 1,2,3,4,5,6,7,8,9
),
COST_REVENUE AS (
SELECT *
FROM COST
UNION ALL
SELECT *
FROM REVENUE
)
SELECT
DATE,
WEEK,
POS,
SOURCE,
MEDIUM,
CAMPAIGN,
AD_CONTENT,
FORMAT,
BU,
CLICKS,
IMPRESSIONS,
SESSIONS,
TRANSACTIONS,
search_flight_pv,
search_flight_upv,
COST,
PAX,
REVENUE,
FROM COST_REVENUE
WHERE
1=1
AND DATE >= '2019-01-01'
What am I doing wrong here?
What I would like too see is having a match between the format dimension from COST and the format dimension from REVENUE (which it doesn't exists, but it is within the campaign column).
You don't really need the interior select statements as your campaign data should be in the same row of the table.
Change this:
CASE
WHEN SOURCE_CAT = 'FACEBOOK' THEN
(
SELECT
splits[SAFE_OFFSET(5)] AS FORMAT,
FROM (
SELECT
ad_content,
SPLIT(REGEXP_REPLACE(
ad_content,
r'([^_]+)_([^_]+)_([^_]+)_([^_]+)_([^_]+)_([^_]+)_(.+)',
r'\1|\2|\3|\4|\5|\6|\7'),
'|') AS splits
FROM ga_digital_marketing)) END AS FORMAT,
to something like this:
-- also replacing case with if for only 1 case
IF(SOURCE_CAT = 'FACEBOOK',
SPLIT(REGEXP_REPLACE(
ad_content,
r'([^_]+)_([^_]+)_([^_]+)_([^_]+)_([^_]+)_([^_]+)_(.+)',
r'\1|\2|\3|\4|\5|\6|\7'),
'|')[SAFE_OFFSET(5)], NULL) AS FORMAT,
I am in no way a coder so I have tried but falling over on this.
I want to use this query from Googles Google Analytics Big Query Cookbook
Products purchased by customers who purchased product A (Enhanced Ecommerce)
I have pasted the code below
Into Standard SQL.
I have made a few attemps but am falling over and not
Thank you in advance
John
SELECT hits.product.productSKU AS other_purchased_products,
COUNT(hits.product.productSKU) AS quantity
FROM (
SELECT fullVisitorId, hits.product.productSKU, hits.eCommerceAction.action_type
FROM TABLE_DATE_RANGE([bigquery-public-data:google_analytics_sample.ga_sessions_],
TIMESTAMP('2017-04-01'), TIMESTAMP('2017-04-20'))
)
WHERE fullVisitorId IN (
SELECT fullVisitorId
FROM TABLE_DATE_RANGE([bigquery-public-data:google_analytics_sample.ga_sessions_],
TIMESTAMP('2017-04-01'), TIMESTAMP('2017-04-20'))
WHERE hits.product.productSKU CONTAINS 'GGOEYOCR077799'
AND hits.eCommerceAction.action_type = '6'
GROUP BY fullVisitorId
)
AND hits.product.productSKU IS NOT NULL
AND hits.product.productSKU !='GGOEYOCR077799'
AND hits.eCommerceAction.action_type = '6'
GROUP BY other_purchased_products
ORDER BY quantity DESC;
Below is pure equivalent in BigQuery Standard SQL (no any optimizations, improvements, etc. - just pure translation from legacy to standard)
SELECT productSKU AS other_purchased_products, COUNT(productSKU) AS quantity
FROM (
SELECT fullVisitorId, prod.productSKU, hit.eCommerceAction.action_type
FROM `bigquery-public-data.google_analytics_sample.ga_sessions_*`,
UNNEST(hits) hit, UNNEST(hit.product) prod
WHERE _TABLE_SUFFIX BETWEEN '20170401' AND '20170420'
)
WHERE fullVisitorId IN (
SELECT fullVisitorId
FROM `bigquery-public-data.google_analytics_sample.ga_sessions_*`,
UNNEST(hits) hit, UNNEST(hit.product) prod
WHERE _TABLE_SUFFIX BETWEEN '20170401' AND '20170420'
AND prod.productSKU LIKE '%GGOEYOCR077799%'
AND hit.eCommerceAction.action_type = '6'
GROUP BY fullVisitorId
)
AND productSKU IS NOT NULL
AND productSKU !='GGOEYOCR077799'
AND action_type = '6'
GROUP BY other_purchased_products
ORDER BY quantity DESC
obviously produces exactly same result as legacy version
So I ran a Test and now want to export my Google Optimize data with BigQuery.
Unfortunately both variables "experimentId" and "experimentVariant" in the BigQuery Export are empty .. although there is Test Data in Google Analytics for this date range.
Is there a problem with the synthax?
StandardSQL:
SELECT
clientId,
visitId,
fullVisitorId,
exp.experimentId AS experimentId,
exp.experimentVariant AS experimentVariant,
trafficSource.source AS source,
trafficSource.medium AS medium,
hits.page.pagePath AS pagePath,
timestamp_seconds(visitStartTime+(CAST(ROUND(hits.time/1000) AS INT64)))
AS timestamp,
hits.eventInfo.eventCategory AS EventCategory,
hits.eventInfo.eventAction AS EventAction,
hits.eventInfo.eventLabel AS EventLabel
FROM `123456789.ga_sessions_*` LEFT JOIN
UNNEST(hits) hits LEFT JOIN
UNNEST(hits.experiment) exp
WHERE hits.page.pagePath LIKE '%page1/page2%' AND _TABLE_SUFFIX BETWEEN
FORMAT_DATE('%Y%m%d', date '2019-04-16') AND FORMAT_DATE('%Y%m%d', date
'2019-04-22')
GROUP BY
clientId,
visitId,
fullVisitorId,
experimentId,
experimentVariant,
source,
medium,
pagePath,
timestamp,
eventCategory,
EventAction,
EventLabel
Im trying to create a master view of a group in properties that are been imported into big query but it seem by using the unnest(hits) the SQL is duplicating the data leading to inaccurate values for revenues etc...
I have try to look at understanding why the unnest has caused this but I can't figure it out.
SELECT Date, hostname, channelGrouping, sum(transactionRevenue) as Revenue, sum(Shipping) as Shipping, sum(visits) as Sessions, sum(bounces) as Bounces, sum(transactions) as Transactions
FROM
(SELECT Date, h.page.hostname as hostname, channelGrouping, totals.transactionRevenue, totals.visits, h.transaction.transactionShipping as shipping, totals.bounces, totals.transactions
FROM `xxxxxxxxx.ga_sessions_*`, UNNEST(hits) AS h
WHERE _TABLE_SUFFIX BETWEEN '20170401' AND '20170509'
UNION ALL
SELECT Date, h.page.hostname as hostname, channelGrouping, totals.transactionRevenue, totals.visits, h.transaction.transactionShipping as shipping, totals.bounces, totals.transactions
FROM `xxxxxxxxx.ga_sessions_*`, UNNEST(hits) AS h
WHERE _TABLE_SUFFIX BETWEEN '20170401' AND '20170509'
UNION ALL
SELECT Date, h.page.hostname as hostname, channelGrouping, totals.transactionRevenue, totals.visits, h.transaction.transactionShipping as shipping, totals.bounces, totals.transactions
FROM `102674002.ga_sessions_*`, UNNEST(hits) AS h
WHERE _TABLE_SUFFIX BETWEEN '20170401' AND '20170509'
UNION ALL
SELECT Date, h.page.hostname as hostname, channelGrouping, totals.transactionRevenue, totals.visits, h.transaction.transactionShipping as shipping, totals.bounces, totals.transactions
FROM `xxxxxxxxx.ga_sessions_*`, UNNEST(hits) AS h
WHERE _TABLE_SUFFIX BETWEEN '20170401' AND '20170509'
UNION ALL
SELECT Date, h.page.hostname as hostname, channelGrouping, totals.transactionRevenue, totals.visits, h.transaction.transactionShipping as shipping, totals.bounces, totals.transactions
FROM `xxxxxxxxx.ga_sessions_*`, UNNEST(hits) AS h
WHERE _TABLE_SUFFIX BETWEEN '20170401' AND '20170509')
Group By Date, hostname, channelGrouping
Order by Date
This might do the trick:
SELECT
date,
channelGrouping,
SUM(Revenue) Revenue,
SUM(Shipping) Shipping,
SUM(bounces) bounces,
SUM(transactions) transactions,
hostname,
COUNT(date) sessions
FROM(
SELECT
date,
channelGrouping,
totals.totaltransactionrevenue / 1e6 Revenue,
ARRAY((SELECT DISTINCT page.hostname FROM UNNEST(hits) hits WHERE page.hostname IS NOT NULL)) hostnames,
(SELECT SUM(hits.transaction.transactionshipping) / 1e6 FROM UNNEST(hits) hits) Shipping,
totals.bounces bounces,
totals.transactions transactions
FROM `project_id.dataset_id.ga_sessions_*`
WHERE 1 = 1
AND ARRAY_LENGTH(ARRAY((SELECT DISTINCT page.hostname FROM UNNEST(hits) hits WHERE page.hostname IS NOT NULL))) > 0
AND _TABLE_SUFFIX BETWEEN '20170601' AND '20170609'
UNION ALL
(...)
),
UNNEST(hostnames) hostname
GROUP BY
date, channelGrouping, hostname
Notice that in this query I avoided applying the UNNEST operation in the hits field and I do so only inside subselects.
In order to understand why this is the case you have to understand how ga data is aggregated into BigQuery. Notice that we basically have 2 types of data: the session level data and the hits level. Each client visiting your website ends up generating a row into BigQuery, like so:
{fullvisitorid: 1, visitid:1, date: '20170601', channelGrouping: "search", hits: [{hitNumber: 1, page: {hostname: "yourserverhostname"}}, {hitNumber: 2, page: {hostname: "yourserverhostname"}}, totals: {totalTransactionRevenue:0, bounces: 0}]
If the same customer comes back a day later it generates another row into BQ, something like:
{fullvisitorid: 1, visitid:2, date: '20170602', channelGrouping: "search", hits: [{hitNumber: 1, page: {hostname: "yourserverhostname"}}, {hitNumber: 2, page: {hostname: "yourserverhostname"}}, totals: {totalTransactionRevenue:50000000, bounces: 2}]
As you can see, fields outside the key hits are related to the session level (and therefore each hit, i.e, each interaction the customer has in your website, adds up another entry here). When you apply UNNEST, you basically, apply a cross-join with all values inside of the array to the outer fields.
And this is where duplication happens!
Given the past example, if we apply UNNEST to the hits field, you end up with something like:
fullvisitorid visitid totals.totalTransactionRevenue hits.hitNumber
1 1 0 1
1 1 0 2
1 2 50000000 1
1 2 50000000 2
Notice that for each hit inside the hits field causes the outer fields, such as totals.totalTransactionRevenue to be duplicated for each hitNumber that happened inside the hits ARRAY.
So, if later on, you apply some operation like SUM(totals.totalTransactionRevenue) you end up summing this field multiplied by each hit that the customer had in that visitid.
What I tend to do is to avoid the (costly depending on the data volume) UNNEST operation on the hits field and I do so only in subqueries (where the unnesting happens only at the row level which does not duplicate data).