Understanding what leads to "resources exceeded" error in GBQ? - sql

I'm working in BigQuery on Google Analytics data. At various points in developing the query I get the error: "Resources exceeded". I want to further my understanding of what's happening. I've successfully worked around the problem, but only via trial and error.
When I use the explain tool it seems to be the 'compute' part of any query or sub-query that looks to have exceeded resources.
Here's an example of a standard SQL query that succeeds/fails depending on whether certain parts are left in:
SELECT
fullVisitorId,
visitId,
h.type AS type,
h.hitNumber AS hitNumber,
h.eventInfo.eventAction AS action,
LOWER(h.eventInfo.eventCategory) AS category,
h.page.pagePath AS page,
h.page.pageTitle AS landingTitle,
h.page.searchKeyword AS searchTerm,
LEAD(h.page.pagePath) OVER (PARTITION BY fullVisitorId, visitId ORDER BY h.hitNumber ASC) AS landingPage,
SPLIT(h.eventInfo.eventLabel, ':')[OFFSET(0)] AS clickTitle,
CASE WHEN LEAD(h.page.pageTitle) OVER (PARTITION BY fullVisitorId, visitId ORDER BY h.hitNumber ASC) = SPLIT(h.eventInfo.eventLabel, ':')[OFFSET(0)] THEN true ELSE false END AS searchClick
FROM `project.dataset.ga_sessions_*` AS main, UNNEST(hits) AS h
WHERE _TABLE_SUFFIX BETWEEN '20170401' AND '20170430'
AND (
(
h.eventInfo.eventAction = 'click' AND LOWER(h.eventInfo.eventCategory) LIKE '/search%'
)
OR type = 'PAGE'
)
ORDER BY
fullVisitorId ASC, visitId ASC, h.hitNumber ASC
When removing any one of these sets of elements the query runs:
ORDER BY
fullVisitorId ASC, visitId ASC, h.hitNumber ASC
Or:
LEAD(h.page.pagePath) OVER (PARTITION BY fullVisitorId, visitId ORDER BY h.hitNumber ASC) AS landingPage,
SPLIT(h.eventInfo.eventLabel, ':')[OFFSET(0)] AS clickTitle,
CASE WHEN LEAD(h.page.pageTitle) OVER (PARTITION BY fullVisitorId, visitId ORDER BY h.hitNumber ASC) = SPLIT(h.eventInfo.eventLabel, ':')[OFFSET(0)] THEN true ELSE false END AS searchClick
Or:
When running on a single date partition the entire query runs.
I would describe my current level of understanding as superficial, I know little of the inner workings of GBQ and how it allocates/permits compute resources. I do know that it performs calculations on separate machines where possible. I've heard these described as shards before.
What do I need to know about GBQ compute resources in order to understand why the above will work/not work?
N.B: I only have Tier 1 access, but that doesn't mean I can't gain increased access if I can justify a need. Obviously I don't want to do this with current level of understanding.

I think the only thing that should be causing a problem in your query is the ORDER BY operation. As you can see in this answer from Jordan, this operation is not parallelizable. You can also check the docs for some ideas of what causes the Resources Exceeded error.
The rest of the query seems to be fine though. I tested your query against our data and it processed almost 300Gb in 20s:
If you still get the error then maybe you are querying quite a high amount of data. This being the case, you could try breaking the query into smaller dates range, querying for less columns, adding some WHERE conditions to filter out some rows, changing tier and so on.

Related

How to UNNEST and flatten all the records in BigQuery when looking at Google Analytics

When analyzing GA data in BigQuery, I found duplicate records with the same values for the following fields
fullVisitorId
visitStartTime
hits.hitNumber
I filtered down the results a bit by a specific fullVisitorId and visitStartTime
SELECT
fullVisitorId,
visitStartTime,
hits.hitNumber,
hits.time,
TIMESTAMP_SECONDS(CAST(visitStartTime + 0.001 * hits.time AS INT64)) AS hitsTimestamp
FROM
`testGAview.ga_sessions_20200113`,
UNNEST(hits) AS hits
WHERE
fullVisitorId = '324982394082304'
AND visitStartTime = 324234233
ORDER BY
fullVisitorId,
visitStartTime,
hitNumber
The above query returns 13 records that have duplicate fullVisitorId, visitStartTime, and hits.hitNumber. I'm not sure how this is possible because looking at the [schema][1], all of these fields being the same for a different row is unexpected. I should say that this is an extremely small % of records .002%, so I'm thinking it could be a processing issue on the GA end.
What I'd like to do now is unnest ALL of the fields to see the other values, alongside the fullVisitorId, visitStartTime, and hitNumber
SELECT
*
FROM
`testGAview.ga_sessions_20200113` UNNEST(hits) AS h,
WHERE
fullVisitorId = '324982394082304'
AND visitStartTime = 324234233
AND hits.hitNumber = 23
What I'm hoping the above returns is 2 rows that meet the above conditions, and also shows values for all the other fields, to see if they are the exact same.
Can anybody help with this? Thanks!
The real question seems to be "how to compare if two rows are identical". Am I right?
This would solve that problem:
CREATE TEMP FUNCTION first_two_identical(x ANY TYPE) AS (
TO_JSON_STRING(x[OFFSET(0)]) = TO_JSON_STRING(x[OFFSET(1)])
);
SELECT first_two_identical(ARRAY_AGG(a)) identical
FROM `bigquery-public-data.google_analytics_sample.ga_sessions_20170801` a
WHERE visitId IN (1501583974, 1501616585)
The SELECT chooses 2 rows (the full row), and packs them together in an array.
The array is sent to the function first_two_identical()
The function takes the first 2 elements of the arrray it receives.
To transform the full rows into comparable objects, we used TO_JSON_STRING().
That's it.

How to see whole session path in BigQuery?

I'm learning standard sql in BigQuery and I have a task, where I have to show, what users did after entering checkout - what specific urls they've visited. I figured out something like this, but it'll only show me one previous step and I have to see at least 5 of them. Is this possible? Thank you
SELECT ARRAY(
SELECT AS STRUCT hits.page.pagePath
, LAG(hits.page.pagePath) OVER(ORDER BY i) prevPagePath
FROM UNNEST(hits) hits WITH OFFSET i
) x
FROM `xxxx.ga_sessions_20160801`
)
SELECT COUNT(*) as cnt, pagePath, prevPagePath
FROM path_and_prev, UNNEST(x)
WHERE regexp_contains (pagePath,r'(checkout/cart)')
GROUP BY 2,3
ORDER BY
cnt desc
Here is official GA shema for BQ export :
https://support.google.com/analytics/answer/3437719?hl=en
(Just a tip, feel free to export it in a sheet (Excel or Google or whatever) and indent decently to ease understanding of nesting :) )
The only way to safely get session behaviour is to get hits.hitNumber. Since pagePath is under page, which is under hits, hitnumber will always be specified :)
Up to you to filter on filled pagePath only, but still displaying hitnumber value.
Tell me if the solution does match your issue, or correct me :)

What is faster LAG(column) or saving in a variable the last value of a cursor?

Basically what the question says. I have this sql
select
LAG(ID_ESTADO) over(order by ID_EXPEDIENTE,orden) ULTIMO_IDESTADO,
LAG(ID_EXPEDIENTE) over(order by ID_EXPEDIENTE,orden) ULTIMO_EXPEDIENTE,
LAG(TABLA) over(partition by ID_EXPEDIENTE order by orden) ULTIMA_TABLA,
LAG(TIPO_ESTADO) over(partition by ID_EXPEDIENTE order by orden) ULTIMO_ESTADO,
LAG(FECHA_ESTADO) over(partition by ID_EXPEDIENTE order by orden) ULTIMA_FECHA,
LAG(ID_EXPEDIENTE,2) over(order by ID_EXPEDIENTE,orden) INMANT_EXPEDIENTE,
LAG(TABLA,2) over(partition by ID_EXPEDIENTE order by orden) INMANT_TABLA,
LAG(TIPO_ESTADO,2) over(partition by ID_EXPEDIENTE order by orden) INMANT_ESTADO,
ID_ESTADO,
ID_EXPEDIENTE,
TABLA,
TIPO_ESTADO,
orden,
anio,
fecha_estado,
NUMERO_EXPEDIENTE,
ANIO_EXPEDIENTE,
NATURALEZA_EXPEDIENTE,
ID_OFICINA,
DESCRIPCION,
ID_TIPO_INSTANCIA,
ID_OBJETO_JUICIO,
DESC_OBJETO_JUICIO,
FECHA_INICIO,
ID_EXPEDIENTE_ORIGEN,
CARATULA_EXPEDIENTE
from EST_ESTADOS_CIVIL e where TABLA like 'R%' and ANIO between 2012 and
2016
I wanted to know what is faster if to compare values in a cursor like this with lets say TIPO_ESTADO = ULTIMO_ESTADO
or to use the query alone and get the LAG values from variables and comparing them.
Bear in mind that changing this changes my query from 280.000 or so rows to 3 million rows.
I can keep the 280.000 rows but i need at least to include one or two LAG columns
As with all questions about performance, you can test the code to see which is better.
As for my opinion, I would not even consider using a cursor for performance. Set-based operations in the data should be better.
Note: If you do use a cursor, you will need an order by, which will probably pretty much kill any expected performance gains that you might think that you would get.

Query for selecting sequence of hits consumes large quantity of data

I'm trying to measure the conversion rate through alternative funnels on a website. My query has been designed to output a count of sessions that viewed the relevant start URL and a count of sessions that hit the confirmation page strictly in that order. It does this by comparing the times of the hits.
My query appears to return accurate figures, but in doing so selects a massive quantity of data, just under 23GB for what I've attempted to limit to one hour of one day. I don't seem to have written my query in a particularly efficient way and gather that I'll use up all of my company's data quota fairly quickly if I continue to use it.
Here's the offending query in full:
WITH
s1 AS (
SELECT
fullVisitorId,
visitId,
LOWER(h.page.pagePath),
device.deviceCategory AS platform,
MIN(h.time) AS s1_time
FROM
`project.dataset.ga_sessions_*`, UNNEST(hits) AS h
WHERE
_TABLE_SUFFIX BETWEEN '20170107' AND '20170107'
AND
LOWER(h.page.pagePath) LIKE '{funnel-start-url-1}%' OR LOWER(h.page.pagePath) LIKE '{funnel-start-url-2}%'
AND
totals.visits = 1
AND
h.hour < 21
AND
h.hour >= 20
AND
h.type = "PAGE"
GROUP BY
path,
platform,
fullVisitorId,
visitId
ORDER BY
fullVisitorId ASC, visitId ASC
),
confirmations AS (
SELECT
fullVisitorId,
visitId,
MIN(h.time) AS confirmation_time
FROM
`project.dataset.ga_sessions_*`, UNNEST(hits) AS h
WHERE
_TABLE_SUFFIX BETWEEN '20170107' AND '20170107'
AND
h.type = "PAGE"
AND
LOWER(h.page.pagePath) LIKE '{confirmation-url-1}%' OR LOWER(h.page.pagePath) LIKE '{confirmations-url-2}%'
AND
totals.visits = 1
AND
h.hour < 21
AND
h.hour >= 20
GROUP BY
fullVisitorId,
visitId
)
SELECT
platform,
path,
COUNT(path) AS Views,
SUM(
CASE
WHEN s1.s1_time < confirmations.confirmation_time
THEN 1
ELSE 0
END
) AS SubsequentPurchases
FROM
s1
LEFT JOIN
confirmations
ON
s1.fullVisitorId = confirmations.fullVisitorId
AND
s1.visitId = confirmations.visitId
GROUP BY
platform,
path
What is it about this query that means it has to process so much data? Is there a better way to get at these numbers. Ideally any method should be able to measure the multiple different routes, but I'd settle for sustainability at this point.
There are probably a few ways that you can optimize your query but it seems like it won't entirely solve your issue (as I'll further try to explain).
As for the query, this one does the same but avoids re-selecting data and the LEFT JOIN operation:
SELECT
path,
platform,
COUNT(path) views,
COUNT(CASE WHEN last_hn > first_hn THEN 1 END) SubsequentPurchases
from(
SELECT
fv,
v,
platform,
path,
first_hn,
MAX(last_hn) OVER(PARTITION BY fv, v) last_hn
from(
SELECT
fullvisitorid fv,
visitid v,
device.devicecategory platform,
LOWER(hits.page.pagepath) path,
MIN(CASE WHEN REGEXP_CONTAINS(hits.page.pagepath, r'/catalog/|product') THEN hits.hitnumber ELSE null END) first_hn,
MAX(CASE WHEN REGEXP_CONTAINS(hits.page.pagepath, r'success') then hits.hitnumber ELSE null END) last_hn
FROM `project_id.data_set.ga_sessions_20170112`,
UNNEST(hits) hits
WHERE
REGEXP_CONTAINS(hits.page.pagepath, r'/catalog/|product|success')
AND totals.visits = 1
AND hits.type = 'PAGE'
GROUP BY
fv, v, path, platform
)
)
GROUP BY
path, platform
HAVING NOT REGEXP_CONTAINS(path, r'success')
first_hn tracks the funnel-start-url (in which I used the terms "catalog" and "product") and the last_hn tracks the confirmation URLs (which I used the term "success" but could add more values in the regex selector). Also, by using MIN and MAX operations and the analytical functions you can have some optimizations in your query.
There are a few points though to make here:
If you insert WHERE hits.hithour = 20, BigQuery still has to scan the whole table to find what is 20 from what is not. That means that the 23Gbs you observed still accounts for the whole day.
For comparison, I tested your query against our ga_sessions and it took around 31 days to reach 23Gb of data. As you are not selecting that many fields, it shouldn't be that easy to reach this amount unless you have a considerable high traffic volume coming from your data source.
Given current pricing for BigQuery, 23Gbs would consume you roughly $0.11 to process, which is quite cost-efficient.
Another thing I could imagine is that you are running this query several times a day and have no cache or some proper architecture for these operations.
All this being said, you can optimize your query but I suspect it won't change that much in the end as it seems you have quite a high volume of data. Processing 23Gbs a few times shouldn't be a problem but if you are concerned that it will reach your quota then it seems like you are running several times a day this query.
This being the case, see if using either some cache flag or saving the results into another table and then querying it instead will help. Also, you could start saving daily tables with just the sessions you are interested in (having the URL patterns you are looking for) and then running your final query in these newly created tables, which would allow you to query over a bigger range of days spending much less for that.

PERCENT_RANK() in BigQuery returns Resources exceeded

When I try to use PERCENT_RANK() over a large dataset, it gives me an error.
SELECT
a2_lngram,
a2_decade,
a2_totalfreq,
a2_totalbooks,
a2_freq, a2_bfreq,
a2_arf,
c_avgarf,
d_arf,
oi,
PERCENT_RANK() OVER (ORDER BY d_arf DESC) plarf
FROM [trigram.trigrams8]
With a destination table and AllowLargeResults returns:
"Resources exceeded during query execution."
When I limit the results to few hundreds it runs fine.
JobID: otichyproject1:job_PpTpmMXYETUMiM_2scGgc997JVg
The dataset is public.
This is expected: The input for an analytic/window function needs to fit in one node for it to run successfully.
PERCENT_RANK() OVER (ORDER BY d_arf DESC) plarf
will only run if all the rows fit in one node. If they don't you'll see the "Resources exceeded during query execution" error.
There's a way to scale up with analytic functions: Partition your data.
PERCENT_RANK() OVER (PARTITION BY country ORDER BY d_arf DESC) plarf
... then the function can be run over multiple nodes, as long as each 'country' rows fit in one VM.
Not your case though - the fix I would do here is calculate the total on a separate subquery, join, and divide.
In summary, analytic functions are cool, but they have scalability issues on the size of each partition - luckily there are other ways to get the same results.