I am migrating to the new SQL syntax in BigQuery, since it seems more flexible. However I am a bit stuck when it comes to access the fields in the customDimensions. I am writing something quite simple like this:
SELECT
cd.customDimensions.index,
cd.customDimensions.value
FROM `xxxxx.ga_sessions_20170312`, unnest(hits) cd
limit 100
But I get the error
Error: Cannot access field index on a value with type ARRAY<STRUCT<index INT64, value STRING>>
However if I run something like this works perfectly fine:
SELECT
date,
SUM((SELECT SUM(latencyTracking.pageLoadTime) FROM UNNEST(hits))) pageLoadTime,
SUM((SELECT SUM(latencyTracking.serverResponseTime) FROM UNNEST(hits))) serverResponseTime
FROM `xxxxxx.ga_sessions_20170312`
group by 1
Is there some different logic when it comes to query the customDimensions?
If the intention is to retrieve all custom dimensions in a flattened form, then join with UNNEST(customDimensions) as well:
#standardSQL
SELECT
cd.index,
cd.value
FROM `xxxxx.ga_sessions_20170312`,
unnest(hits) hit,
unnest(hit.customDimensions) cd
limit 100;
SELECT
fullvisitorid,
( SELECT MAX(IF(index=1,value, NULL))FROM UNNEST(hits.customDimensions)) AS CustomDimension1,
( SELECT MAX(IF(index=2,value, NULL))FROM UNNEST(hits.customDimensions)) AS CustomDimension2
FROM
`XXXXXXX`, unnest(hits) as hits
Related
i want to get the following schema out of my GA BigQuery data:
Hostname; customDimension2; customDimensions3; PageViews; ScreenViews; TotalEvents; Sessions
At first i just want to get the Hostname and cd2 my query look like the following:
SELECT hits.page.hostname, hits.customDimensions.value
FROM `dataset`, UNNEST(hits) as hits
WHERE hits.customDimensions.index = 2
LIMIT 1000
I get the following Error:
Cannot access field index on a value with type ARRAY<STRUCT<index INT64, value STRING>> at [1:162]
So how can i handle two different BigQuery Arrays?
Since you can have up to 200 fields in that array and you usually only want one of them it is better to not cross join with it but write a little subquery.
SELECT
page.hostname,
(SELECT value FROM UNNEST(h.customDimensions) WHERE index=2) AS cd2
FROM `dataset`,
UNNEST(hits) as h
LIMIT 1000
The more data you have the faster this query will perform in comparison to the cross join version. Subquery is always faster than cross join.
Below is for BigQuery Standard SQL
#standardSQL
SELECT hit.page.hostname, customDimension.value
FROM `dataset`, UNNEST(hits) AS hit, UNNEST(hit.customDimensions) AS customDimension
WHERE customDimension.index = 2
LIMIT 100
When analyzing GA data in BigQuery, I found duplicate records with the same values for the following fields
fullVisitorId
visitStartTime
hits.hitNumber
I filtered down the results a bit by a specific fullVisitorId and visitStartTime
SELECT
fullVisitorId,
visitStartTime,
hits.hitNumber,
hits.time,
TIMESTAMP_SECONDS(CAST(visitStartTime + 0.001 * hits.time AS INT64)) AS hitsTimestamp
FROM
`testGAview.ga_sessions_20200113`,
UNNEST(hits) AS hits
WHERE
fullVisitorId = '324982394082304'
AND visitStartTime = 324234233
ORDER BY
fullVisitorId,
visitStartTime,
hitNumber
The above query returns 13 records that have duplicate fullVisitorId, visitStartTime, and hits.hitNumber. I'm not sure how this is possible because looking at the [schema][1], all of these fields being the same for a different row is unexpected. I should say that this is an extremely small % of records .002%, so I'm thinking it could be a processing issue on the GA end.
What I'd like to do now is unnest ALL of the fields to see the other values, alongside the fullVisitorId, visitStartTime, and hitNumber
SELECT
*
FROM
`testGAview.ga_sessions_20200113` UNNEST(hits) AS h,
WHERE
fullVisitorId = '324982394082304'
AND visitStartTime = 324234233
AND hits.hitNumber = 23
What I'm hoping the above returns is 2 rows that meet the above conditions, and also shows values for all the other fields, to see if they are the exact same.
Can anybody help with this? Thanks!
The real question seems to be "how to compare if two rows are identical". Am I right?
This would solve that problem:
CREATE TEMP FUNCTION first_two_identical(x ANY TYPE) AS (
TO_JSON_STRING(x[OFFSET(0)]) = TO_JSON_STRING(x[OFFSET(1)])
);
SELECT first_two_identical(ARRAY_AGG(a)) identical
FROM `bigquery-public-data.google_analytics_sample.ga_sessions_20170801` a
WHERE visitId IN (1501583974, 1501616585)
The SELECT chooses 2 rows (the full row), and packs them together in an array.
The array is sent to the function first_two_identical()
The function takes the first 2 elements of the arrray it receives.
To transform the full rows into comparable objects, we used TO_JSON_STRING().
That's it.
In BigQuery, there is the Google Analytics based query as is stated below and this works correctly.
#standard sql
SELECT
Date,
SUM(totals.visits) AS Sessions,
SUM(totals.transactions) AS Transactions
FROM
`[projectID].[DatasetID].ga_sessions_*`
WHERE
_TABLE_SUFFIX BETWEEN '20181217'
AND '20181217'
AND totals.visits > 0
GROUP BY
Date
In this query, I need to exclude all hits where within a hit...
..GA custom dimension #23 (hit-scope) contains value 'editor'
OR
..GA custom dimension #6 (product-scope) matches regular expression value '^63.....$'
OR
..GA hits.page.pagePath matches regular expression value 'gebak|cake'
Note: it is not the intention to apply the 3 conditions as stated above on session-level (as in this screenshot) but on hit-level, since I'd like to reproduce numbers from another GA view than the view from which the data is loaded to BigQuery. In this other GA view the 3 conditions as are stated above are set as view filters.
The 'best' query thus far is the one below (based on the post of Martin Weitzmann below). However, the dataset is not filtered in this query (in other words, the conditions do not work).
SELECT Date,
-- hits,
SUM(totals.transactions),
SUM(totals.visits)
FROM (
(
SELECT date, totals,
-- create own hits array
ARRAY(
SELECT AS STRUCT
hitnumber,
page,
-- create own product array
ARRAY(
SELECT AS STRUCT productSku, productQuantity
FROM h.product AS p
WHERE (SELECT COUNT(1)=0 FROM p.customDimensions WHERE index=6 AND value like '63%')
) AS product
FROM t.hits as h
WHERE
NOT REGEXP_CONTAINS(page.pagePath,r'gebak|cake')
AND
(SELECT COUNT(1)=0 FROM h.customDimensions WHERE index=23 AND value like '%editor%')
) AS hits
FROM
`[projectID].[DatasetID].ga_sessions_*` t
WHERE
_TABLE_SUFFIX BETWEEN '20181217'
AND '20181217'
AND totals.visits > 0
))
GROUP BY Date
Does anyone know how to achieve the desired output?
Thanks a lot in advance!
Note: the projectID and datasetID have been masked in both queries because of privacy concerns.
Own arrays approach
You can create your own hits and product arrays by using sub-queries on the original and feeding their output back into array functions. In those subqueries you can filter out your hits and products:
#standardsql
SELECT
date,
hits
--SUM(totals.visits) AS Sessions,
--SUM(totals.transactions) AS Transactions
FROM
(
SELECT
date, totals,
-- create own hits array
ARRAY(
SELECT AS STRUCT
hitnumber,
page,
-- create own product array
ARRAY(
SELECT AS STRUCT productSku, productQuantity
FROM h.product AS p
WHERE (SELECT COUNT(1)=0 FROM p.customDimensions WHERE index=6 AND value like '63%')
) AS product
FROM t.hits as h
WHERE
NOT REGEXP_CONTAINS(page.pagePath,r'gebak|cake')
AND
(SELECT COUNT(1)=0 FROM h.customDimensions WHERE index=23 AND value like '%editor%')
) AS hits
FROM
`bigquery-public-data.google_analytics_sample.ga_sessions_20161104` t
)
--GROUP BY 1
LIMIT 100
I left this example in an ungrouped state, but you can easily adjust it by commenting out the hits and group accordingly ...
Segmentation approach
I think you just need the right sub-query in your WHERE statement:
#standardsql
SELECT
date,
SUM(totals.visits) AS Sessions,
SUM(totals.transactions) AS Transactions
FROM
`bigquery-public-data.google_analytics_sample.ga_sessions_*` t
WHERE
(SELECT COUNT(1)=0 FROM t.hits h
WHERE
(SELECT count(1)>0 FROM h.customDimensions WHERE index=23 AND value like '%editor%')
OR
(SELECT count(1)>0 from h.product p, p.customdimensions cd WHERE index=6 AND value like '63%')
OR
REGEXP_CONTAINS(page.pagePath,r'gebak|cake')
)
GROUP BY date
Since all your groups are on session level, you don't need any flattening (resp. cross joins with arrays) on the main table, which is costly.
In your outermost WHERE you enter the hits array with a subquery - it's like a for-each on rows. Here you can already count occasions of REGEXP_CONTAINS(page.pagePath,r'gebak|cake').
For the other cases, you write a subquery again to enter the respective array - in the first case, customDimensions within hits. This is like a nested for-each inside the other one (subquery in a subquery).
In the second case, I'm simply flattening - but within the subquery only: product with its customDimensions. So this is a one-time nested for-each as well because I was lazy and cross-joined. I could've written another Subquery instead of the cross-join, so basically a triple-nested for-each (subquery in a subquery in a subquery).
Since I'm counting cases that I want to exclude, my outer condition is COUNT(1)=0.
I could only test it with ga sample data .. so it's kind of untested. But I guess you get the idea.
Just a quick example/idea on how to use WITH and REGEXP_EXTRACT on a public set
WITH CD6 AS (
SELECT cd.value, SUM(totals.visits) AS Sessions6Sum
FROM
`bigquery-public-data.google_analytics_sample.ga_sessions_*`,
UNNEST(hits) AS hits,
UNNEST(hits.product) AS prod,
UNNEST(prod.customDimensions) AS cd
WHERE cd.index=6
AND NOT REGEXP_CONTAINS(cd.value,r'^63.....$')
GROUP BY cd.value
),
CD23 AS (
SELECT cd.value, SUM(totals.visits) AS Sessions23Sum
FROM
`bigquery-public-data.google_analytics_sample.ga_sessions_*`,
UNNEST(hits) AS hits,
UNNEST(hits.product) AS prod,
UNNEST(prod.customDimensions) AS cd
WHERE cd.index=23
AND NOT REGEXP_CONTAINS(cd.value,r'editor')
GROUP BY cd.value
)
select CD6.Sessions6Sum + CD23.Sessions23Sum from CD6, CD23
You can get more information on how to use REGEXP_EXTRACT in bigQuery official API page
We have created a hit level custom metric in google analytics that I want to retrieve in BigQuery. When running the following query:
#StandardSQL
SELECT h.page.pagePath, SUM(h.customMetrics.value)
FROM `141884522.ga_sessions_20181024`, UNNEST(hits) as h
GROUP BY h.page.pagePath
I get this error:
Error: Cannot access field value on a value with type ARRAY<STRUCT<index
INT64, value INT64>> at [2:45]
I can select just h.customMetrics (without grouping) which returns h.customMetrics.value and h.customMetrics.index but I cannot select the value or index specifically.
Anyone knows how to do that?
#standardSQL
SELECT h.page.pagePath, SUM(metric.value)
FROM `141884522.ga_sessions_20181024`, UNNEST(hits) AS h, UNNEST(h.customMetrics) metric
GROUP BY h.page.pagePath
Btw, if you want to see all pagePath's even those with missing metrics (in cse if it is a case with your data) - i would recommend replacing CROSS JOIN with LEFT JOIN as in below example
#standardSQL
SELECT h.page.pagePath, SUM(metric.value)
FROM `141884522.ga_sessions_20181024`, UNNEST(hits) AS h
LEFT JOIN UNNEST(h.customMetrics) metric
GROUP BY h.page.pagePath
I want to get the total sessions but just because I am unnesting 'hit.product' and 'hits' at same time as shown in the below code, its giving me less session count than that I can see in GA. I am suspecting that it is filtering out the sessions that doesn't have any products.
There is also a way that i can handle that without using unnest by using Array as shown below
ARRAY(SELECT DISTINCT v2ProductCategory FROM UNNEST(hits.product)) AS v2ProductCategory
Is there any way if I can pull all the sessions and its product category, product name and hits info(hits.time, hits.page.pagepath) without using ARRAY which I would be using it in my further analysis?
select count(distinct session) from(SELECT
fullvisitorid,
CONCAT(CAST(fullVisitorId AS string),CAST(visitId AS string)) AS session,
hits.time,
hits.page.pagePath,
hits.eCommerceAction.action_type,
product.v2ProductCategory,
product.v2ProductName
FROM
`XXXXXXXXXXXXXXX`,
UNNEST(hits) AS hits,
UNNEST(hits.product) AS product
WHERE
_TABLE_SUFFIX BETWEEN "20170930"
AND "20170930")