Accessing Google Analytics Custom Dimensions with BigQuery - sql

i want to get the following schema out of my GA BigQuery data:
Hostname; customDimension2; customDimensions3; PageViews; ScreenViews; TotalEvents; Sessions
At first i just want to get the Hostname and cd2 my query look like the following:
SELECT hits.page.hostname, hits.customDimensions.value
FROM `dataset`, UNNEST(hits) as hits
WHERE hits.customDimensions.index = 2
LIMIT 1000
I get the following Error:
Cannot access field index on a value with type ARRAY<STRUCT<index INT64, value STRING>> at [1:162]
So how can i handle two different BigQuery Arrays?

Since you can have up to 200 fields in that array and you usually only want one of them it is better to not cross join with it but write a little subquery.
SELECT
page.hostname,
(SELECT value FROM UNNEST(h.customDimensions) WHERE index=2) AS cd2
FROM `dataset`,
UNNEST(hits) as h
LIMIT 1000
The more data you have the faster this query will perform in comparison to the cross join version. Subquery is always faster than cross join.

Below is for BigQuery Standard SQL
#standardSQL
SELECT hit.page.hostname, customDimension.value
FROM `dataset`, UNNEST(hits) AS hit, UNNEST(hit.customDimensions) AS customDimension
WHERE customDimension.index = 2
LIMIT 100

Related

AWS Timestream query to get average measure for the first month of samples

In AWS Timestream I am trying to get the average heart rate for the first month since we have received heart rate samples for a specific user and the average for the last week. I'm having trouble with the query to get the first month part. When I try to use MIN(time) in the where clause I get the error: WHERE clause cannot contain aggregations, window functions or grouping operations.
SELECT * FROM "DATABASE"."TABLE"
WHERE measure_name = 'heart_rate' AND time < min(time) + 30
If I add it as a column and try to query on the column, I get the error: Column 'first_sample_time' does not exist
SELECT MIN(time) AS first_sample_time FROM "DATABASE"."TABLE"
WHERE measure_name = 'heart_rate' AND time > first_sample_time
Also if I try to add to MIN(time) I get the error: line 1:18: '+' cannot be applied to timestamp, integer
SELECT MIN(time) + 30 AS first_sample_time FROM "DATABASE"."TABLE"
Here is what I finally came up with but I'm wondering if there is a better way to do it?
WITH first_month AS (
SELECT
Min(time) AS creation_date,
From_milliseconds(
To_milliseconds(
Min(time)
) + 2628000000
) AS end_of_first_month,
USER
FROM
"DATABASE"."TABLE"
WHERE
USER = 'xxx'
AND measure_name = 'heart_rate'
GROUP BY
USER
),
first_month_avg AS (
SELECT
Avg(hm.measure_value :: DOUBLE) AS first_month_average,
fm.USER
FROM
"DATABASE"."TABLE" hm
JOIN first_month fm ON hm.USER = fm.USER
WHERE
measure_name = 'heart_rate'
AND hm.time BETWEEN fm.creation_date
AND fm.end_of_first_month
GROUP BY
fm.USER
),
last_week_avg AS (
SELECT
Avg(measure_value :: DOUBLE) AS last_week_average,
USER
FROM
"DATABASE"."TABLE"
WHERE
measure_name = 'heart_rate'
AND time > ago(14d)
AND USER = 'xxx'
GROUP BY
USER
)
SELECT
lwa.last_week_average,
fma.first_month_average,
lwa.USER
FROM
first_month_avg fma
JOIN last_week_avg lwa ON fma.USER = lwa.USER
Is there a better or more efficient way to do this?
I can see you've run into a few challenges along the way to your solution, and hopefully I can clear these up for you and also propose a cleaner way of reaching your solution.
Filtering on aggregates
As you've experienced first hand, SQL doesn't allow aggregates in the where statement, and you also cannot filter on new columns you've created in the select statement, such as aggregates or case statements, as those columns/results are not present in the table you're querying.
Fortunately there are ways around this, such as:
Making your main query a subquery, and then filtering on the result of that query, like below
Select * from (select *,count(that_good_stuff) as total_good_stuff from tasty_table group by 1,2,3) where total_good_stuff > 69
This works because the aggregate column (count) is no longer an aggregate at the time it's called in the where statement, it's in the result of the subquery.
Having clause
If a subquery isn't your cup of tea, you can use the having clause straight after your group by statement, which acts like a where statement except exclusively for handling aggregates.
This is better than resorting to a subquery in most cases, as it's more readable and I believe more efficient.
select *,count(that_good_stuff) as total_good_stuff from tasty_table group by 1,2,3 having total_good_stuff > 69
Finally, window statements are fantastic...they've really helped condense many queries I've made in the past by removing the need for subqueries/ctes. If you could share some example raw data (remove any pii of course) I'd be happy to share an example for your use case.
Nevertheless, hope this helps!
Tom

BigQuery: filter out hits based on hit and product scope dimensions

In BigQuery, there is the Google Analytics based query as is stated below and this works correctly.
#standard sql
SELECT
Date,
SUM(totals.visits) AS Sessions,
SUM(totals.transactions) AS Transactions
FROM
`[projectID].[DatasetID].ga_sessions_*`
WHERE
_TABLE_SUFFIX BETWEEN '20181217'
AND '20181217'
AND totals.visits > 0
GROUP BY
Date
In this query, I need to exclude all hits where within a hit...
..GA custom dimension #23 (hit-scope) contains value 'editor'
OR
..GA custom dimension #6 (product-scope) matches regular expression value '^63.....$'
OR
..GA hits.page.pagePath matches regular expression value 'gebak|cake'
Note: it is not the intention to apply the 3 conditions as stated above on session-level (as in this screenshot) but on hit-level, since I'd like to reproduce numbers from another GA view than the view from which the data is loaded to BigQuery. In this other GA view the 3 conditions as are stated above are set as view filters.
The 'best' query thus far is the one below (based on the post of Martin Weitzmann below). However, the dataset is not filtered in this query (in other words, the conditions do not work).
SELECT Date,
-- hits,
SUM(totals.transactions),
SUM(totals.visits)
FROM (
(
SELECT date, totals,
-- create own hits array
ARRAY(
SELECT AS STRUCT
hitnumber,
page,
-- create own product array
ARRAY(
SELECT AS STRUCT productSku, productQuantity
FROM h.product AS p
WHERE (SELECT COUNT(1)=0 FROM p.customDimensions WHERE index=6 AND value like '63%')
) AS product
FROM t.hits as h
WHERE
NOT REGEXP_CONTAINS(page.pagePath,r'gebak|cake')
AND
(SELECT COUNT(1)=0 FROM h.customDimensions WHERE index=23 AND value like '%editor%')
) AS hits
FROM
`[projectID].[DatasetID].ga_sessions_*` t
WHERE
_TABLE_SUFFIX BETWEEN '20181217'
AND '20181217'
AND totals.visits > 0
))
GROUP BY Date
Does anyone know how to achieve the desired output?
Thanks a lot in advance!
Note: the projectID and datasetID have been masked in both queries because of privacy concerns.
Own arrays approach
You can create your own hits and product arrays by using sub-queries on the original and feeding their output back into array functions. In those subqueries you can filter out your hits and products:
#standardsql
SELECT
date,
hits
--SUM(totals.visits) AS Sessions,
--SUM(totals.transactions) AS Transactions
FROM
(
SELECT
date, totals,
-- create own hits array
ARRAY(
SELECT AS STRUCT
hitnumber,
page,
-- create own product array
ARRAY(
SELECT AS STRUCT productSku, productQuantity
FROM h.product AS p
WHERE (SELECT COUNT(1)=0 FROM p.customDimensions WHERE index=6 AND value like '63%')
) AS product
FROM t.hits as h
WHERE
NOT REGEXP_CONTAINS(page.pagePath,r'gebak|cake')
AND
(SELECT COUNT(1)=0 FROM h.customDimensions WHERE index=23 AND value like '%editor%')
) AS hits
FROM
`bigquery-public-data.google_analytics_sample.ga_sessions_20161104` t
)
--GROUP BY 1
LIMIT 100
I left this example in an ungrouped state, but you can easily adjust it by commenting out the hits and group accordingly ...
Segmentation approach
I think you just need the right sub-query in your WHERE statement:
#standardsql
SELECT
date,
SUM(totals.visits) AS Sessions,
SUM(totals.transactions) AS Transactions
FROM
`bigquery-public-data.google_analytics_sample.ga_sessions_*` t
WHERE
(SELECT COUNT(1)=0 FROM t.hits h
WHERE
(SELECT count(1)>0 FROM h.customDimensions WHERE index=23 AND value like '%editor%')
OR
(SELECT count(1)>0 from h.product p, p.customdimensions cd WHERE index=6 AND value like '63%')
OR
REGEXP_CONTAINS(page.pagePath,r'gebak|cake')
)
GROUP BY date
Since all your groups are on session level, you don't need any flattening (resp. cross joins with arrays) on the main table, which is costly.
In your outermost WHERE you enter the hits array with a subquery - it's like a for-each on rows. Here you can already count occasions of REGEXP_CONTAINS(page.pagePath,r'gebak|cake').
For the other cases, you write a subquery again to enter the respective array - in the first case, customDimensions within hits. This is like a nested for-each inside the other one (subquery in a subquery).
In the second case, I'm simply flattening - but within the subquery only: product with its customDimensions. So this is a one-time nested for-each as well because I was lazy and cross-joined. I could've written another Subquery instead of the cross-join, so basically a triple-nested for-each (subquery in a subquery in a subquery).
Since I'm counting cases that I want to exclude, my outer condition is COUNT(1)=0.
I could only test it with ga sample data .. so it's kind of untested. But I guess you get the idea.
Just a quick example/idea on how to use WITH and REGEXP_EXTRACT on a public set
WITH CD6 AS (
SELECT cd.value, SUM(totals.visits) AS Sessions6Sum
FROM
`bigquery-public-data.google_analytics_sample.ga_sessions_*`,
UNNEST(hits) AS hits,
UNNEST(hits.product) AS prod,
UNNEST(prod.customDimensions) AS cd
WHERE cd.index=6
AND NOT REGEXP_CONTAINS(cd.value,r'^63.....$')
GROUP BY cd.value
),
CD23 AS (
SELECT cd.value, SUM(totals.visits) AS Sessions23Sum
FROM
`bigquery-public-data.google_analytics_sample.ga_sessions_*`,
UNNEST(hits) AS hits,
UNNEST(hits.product) AS prod,
UNNEST(prod.customDimensions) AS cd
WHERE cd.index=23
AND NOT REGEXP_CONTAINS(cd.value,r'editor')
GROUP BY cd.value
)
select CD6.Sessions6Sum + CD23.Sessions23Sum from CD6, CD23
You can get more information on how to use REGEXP_EXTRACT in bigQuery official API page

Selecting the value of a hit level custom metric from google analytics export in big query

We have created a hit level custom metric in google analytics that I want to retrieve in BigQuery. When running the following query:
#StandardSQL
SELECT h.page.pagePath, SUM(h.customMetrics.value)
FROM `141884522.ga_sessions_20181024`, UNNEST(hits) as h
GROUP BY h.page.pagePath
I get this error:
Error: Cannot access field value on a value with type ARRAY<STRUCT<index
INT64, value INT64>> at [2:45]
I can select just h.customMetrics (without grouping) which returns h.customMetrics.value and h.customMetrics.index but I cannot select the value or index specifically.
Anyone knows how to do that?
#standardSQL
SELECT h.page.pagePath, SUM(metric.value)
FROM `141884522.ga_sessions_20181024`, UNNEST(hits) AS h, UNNEST(h.customMetrics) metric
GROUP BY h.page.pagePath
Btw, if you want to see all pagePath's even those with missing metrics (in cse if it is a case with your data) - i would recommend replacing CROSS JOIN with LEFT JOIN as in below example
#standardSQL
SELECT h.page.pagePath, SUM(metric.value)
FROM `141884522.ga_sessions_20181024`, UNNEST(hits) AS h
LEFT JOIN UNNEST(h.customMetrics) metric
GROUP BY h.page.pagePath

BigQuery accessing CustomDimensions with new SQL syntax

I am migrating to the new SQL syntax in BigQuery, since it seems more flexible. However I am a bit stuck when it comes to access the fields in the customDimensions. I am writing something quite simple like this:
SELECT
cd.customDimensions.index,
cd.customDimensions.value
FROM `xxxxx.ga_sessions_20170312`, unnest(hits) cd
limit 100
But I get the error
Error: Cannot access field index on a value with type ARRAY<STRUCT<index INT64, value STRING>>
However if I run something like this works perfectly fine:
SELECT
date,
SUM((SELECT SUM(latencyTracking.pageLoadTime) FROM UNNEST(hits))) pageLoadTime,
SUM((SELECT SUM(latencyTracking.serverResponseTime) FROM UNNEST(hits))) serverResponseTime
FROM `xxxxxx.ga_sessions_20170312`
group by 1
Is there some different logic when it comes to query the customDimensions?
If the intention is to retrieve all custom dimensions in a flattened form, then join with UNNEST(customDimensions) as well:
#standardSQL
SELECT
cd.index,
cd.value
FROM `xxxxx.ga_sessions_20170312`,
unnest(hits) hit,
unnest(hit.customDimensions) cd
limit 100;
SELECT
fullvisitorid,
( SELECT MAX(IF(index=1,value, NULL))FROM UNNEST(hits.customDimensions)) AS CustomDimension1,
( SELECT MAX(IF(index=2,value, NULL))FROM UNNEST(hits.customDimensions)) AS CustomDimension2
FROM
`XXXXXXX`, unnest(hits) as hits

Query for selecting sequence of hits consumes large quantity of data

I'm trying to measure the conversion rate through alternative funnels on a website. My query has been designed to output a count of sessions that viewed the relevant start URL and a count of sessions that hit the confirmation page strictly in that order. It does this by comparing the times of the hits.
My query appears to return accurate figures, but in doing so selects a massive quantity of data, just under 23GB for what I've attempted to limit to one hour of one day. I don't seem to have written my query in a particularly efficient way and gather that I'll use up all of my company's data quota fairly quickly if I continue to use it.
Here's the offending query in full:
WITH
s1 AS (
SELECT
fullVisitorId,
visitId,
LOWER(h.page.pagePath),
device.deviceCategory AS platform,
MIN(h.time) AS s1_time
FROM
`project.dataset.ga_sessions_*`, UNNEST(hits) AS h
WHERE
_TABLE_SUFFIX BETWEEN '20170107' AND '20170107'
AND
LOWER(h.page.pagePath) LIKE '{funnel-start-url-1}%' OR LOWER(h.page.pagePath) LIKE '{funnel-start-url-2}%'
AND
totals.visits = 1
AND
h.hour < 21
AND
h.hour >= 20
AND
h.type = "PAGE"
GROUP BY
path,
platform,
fullVisitorId,
visitId
ORDER BY
fullVisitorId ASC, visitId ASC
),
confirmations AS (
SELECT
fullVisitorId,
visitId,
MIN(h.time) AS confirmation_time
FROM
`project.dataset.ga_sessions_*`, UNNEST(hits) AS h
WHERE
_TABLE_SUFFIX BETWEEN '20170107' AND '20170107'
AND
h.type = "PAGE"
AND
LOWER(h.page.pagePath) LIKE '{confirmation-url-1}%' OR LOWER(h.page.pagePath) LIKE '{confirmations-url-2}%'
AND
totals.visits = 1
AND
h.hour < 21
AND
h.hour >= 20
GROUP BY
fullVisitorId,
visitId
)
SELECT
platform,
path,
COUNT(path) AS Views,
SUM(
CASE
WHEN s1.s1_time < confirmations.confirmation_time
THEN 1
ELSE 0
END
) AS SubsequentPurchases
FROM
s1
LEFT JOIN
confirmations
ON
s1.fullVisitorId = confirmations.fullVisitorId
AND
s1.visitId = confirmations.visitId
GROUP BY
platform,
path
What is it about this query that means it has to process so much data? Is there a better way to get at these numbers. Ideally any method should be able to measure the multiple different routes, but I'd settle for sustainability at this point.
There are probably a few ways that you can optimize your query but it seems like it won't entirely solve your issue (as I'll further try to explain).
As for the query, this one does the same but avoids re-selecting data and the LEFT JOIN operation:
SELECT
path,
platform,
COUNT(path) views,
COUNT(CASE WHEN last_hn > first_hn THEN 1 END) SubsequentPurchases
from(
SELECT
fv,
v,
platform,
path,
first_hn,
MAX(last_hn) OVER(PARTITION BY fv, v) last_hn
from(
SELECT
fullvisitorid fv,
visitid v,
device.devicecategory platform,
LOWER(hits.page.pagepath) path,
MIN(CASE WHEN REGEXP_CONTAINS(hits.page.pagepath, r'/catalog/|product') THEN hits.hitnumber ELSE null END) first_hn,
MAX(CASE WHEN REGEXP_CONTAINS(hits.page.pagepath, r'success') then hits.hitnumber ELSE null END) last_hn
FROM `project_id.data_set.ga_sessions_20170112`,
UNNEST(hits) hits
WHERE
REGEXP_CONTAINS(hits.page.pagepath, r'/catalog/|product|success')
AND totals.visits = 1
AND hits.type = 'PAGE'
GROUP BY
fv, v, path, platform
)
)
GROUP BY
path, platform
HAVING NOT REGEXP_CONTAINS(path, r'success')
first_hn tracks the funnel-start-url (in which I used the terms "catalog" and "product") and the last_hn tracks the confirmation URLs (which I used the term "success" but could add more values in the regex selector). Also, by using MIN and MAX operations and the analytical functions you can have some optimizations in your query.
There are a few points though to make here:
If you insert WHERE hits.hithour = 20, BigQuery still has to scan the whole table to find what is 20 from what is not. That means that the 23Gbs you observed still accounts for the whole day.
For comparison, I tested your query against our ga_sessions and it took around 31 days to reach 23Gb of data. As you are not selecting that many fields, it shouldn't be that easy to reach this amount unless you have a considerable high traffic volume coming from your data source.
Given current pricing for BigQuery, 23Gbs would consume you roughly $0.11 to process, which is quite cost-efficient.
Another thing I could imagine is that you are running this query several times a day and have no cache or some proper architecture for these operations.
All this being said, you can optimize your query but I suspect it won't change that much in the end as it seems you have quite a high volume of data. Processing 23Gbs a few times shouldn't be a problem but if you are concerned that it will reach your quota then it seems like you are running several times a day this query.
This being the case, see if using either some cache flag or saving the results into another table and then querying it instead will help. Also, you could start saving daily tables with just the sessions you are interested in (having the URL patterns you are looking for) and then running your final query in these newly created tables, which would allow you to query over a bigger range of days spending much less for that.