Issues with Analytic function in BigQuery - google-bigquery

since June 2nd we are having issues with analytic functions. when the query (not the partitions) passes a certain size the query fails with the following error:
Resources exceeded during query execution: The query could not be
executed in the allotted memory. Peak usage: 125% of limit. Top memory
consumer(s): analytic OVER() clauses: 97% other/unattributed: 3% . at
[....]
has anyone encountered the same problem?

BigQuery chooses several parallel workers for OVER() clauses based on the size of the tables being run upon. We can see resources exceeded error, when too much data are being processed by workers that BigQuery assigned to your query.
I assume that this issue could come from the OVER() clause and the amount of data used. You'll need to try to tune a bit on your query script (specially on the OVER() clauses) as it is said in the error message.
To read more about the error, take a look to the official documentation.
That's what would help is Slots - unit of computational capacity required to execute SQL queries:
When you enroll in a flat-rate pricing plan, you purchase a dedicated
number of slots to use for query processing. You can specify the
allocation of slots by location for all projects attached to that
billing account. Contact your sales representative if you are
interested in flat-rate pricing.
I hope you find the above pieces of information useful.

We were able ot overcome this limitation by splitting the original data into several shards and applying the analytical function on each shard.
In essence (for 8 shards):
WITH
t AS (
SELECT
RAND () AS __aux,
*
FROM <original table>
)
SELECT
* EXCEPT (__aux),
F () OVER (...) AS ...
FROM t
MOD (CAST (__aux*POW(10,9) AS INT64), 8) = 0
UNION ALL
....
SELECT
* EXCEPT (__aux),
F () OVER (...) AS ...
FROM t
MOD (CAST (__aux*POW(10,9) AS INT64), 8) = 7

Related

int64 overflow in sampling n number of rows (not %)

The below script is to randomly sample an approximate number of rows (50k).
SELECT *
FROM table
qualify rand() <= 50000 / count(*) over()
This has worked a handful of times before, hence, I was shocked to find this error this morning:
int64 overflow: 8475548256593033885 + 6301395400903259047
I have read this post. But as I am not summing, I don't think it is applicable.
The table in question has 267,606,559 rows.
Looking forward to any ideas. Thank you.
I believe counting is actually a sum the way BQ (and other databases) compute counts. You can see this by viewing the Execution Details/Graph (in the BQ UI). This is true even on a simple select count(*) from table query.
For your problem, consider something simpler like:
select *, rand() as my_rand
from table
order by my_rand
limit 50000
Also, if you know the rough size of your data or don't need exactly 50K, consider using the tablesample method:
select * from table
tablesample system (10 percent)

SQL Server query plan sort operator cost spikes when column count increased

I use a ROW_NUMBER() function inside a CTE that causes a SORT operator in the query plan. This SORT operator has always been the most expensive element of the query, but has recently spiked in cost after I increased the number of columns read from the CTE/query.
What confuses me is the increase in cost is not proportional to the column count. I can increase the column count without much issue, normally. However, it seems my query has past some threshold and now costs so much it has doubled the query execution time from 1hour to 2hour+.
I can't figure out what has caused the spike in cost and it's having an impact on business. Any ideas or next steps for troubleshooting you can advise?
Here is the query (simplified):
WITH versioned_events AS (
SELECT [event].*
,CASE WHEN [event].[handle_space] IS NOT NULL THEN [inv].[involvement_id]
ELSE [event].[involvement_id]
END AS [derived_involvement_id]
,ROW_NUMBER() OVER (PARTITION BY [event_id], [event_version] ORDER BY [event_created_date] DESC, [timestamp] DESC ) AS [latest_version]
FROM [database].[schema].[event_table] [event]
LEFT JOIN [database].[schema].[involvement] as [inv]
ON [event].[service_delivery_id] = [inv].[service_delivery_id]
AND [inv].[role_type_code] = 't'
AND [inv].latest_involvement = 1
WHERE event.deletion_type IS NULL AND (event.handle_space IS NULL
OR (event.handle_space NOT LIKE 'x%'
AND event.handle_space NOT LIKE 'y%'))
)
INSERT INTO db.schema.table (
....
)
SELECT
....
FROM versioned_events AS [event]
INNER JOIN (
SELECT DISTINCT service_delivery_id, derived_involvement_id
FROM versioned_events
WHERE latest_version = 1
WHERE ([versioned_events].[timestamp] > '2022-02-07 14:18:09.777610 +00:00')
) AS [delta_events]
ON COALESCE([event].[service_delivery_id],'NULL') = COALESCE([delta_events].[service_delivery_id],'NULL')
AND COALESCE([event].[derived_involvement_id],'NULL') = COALESCE([delta_events].[derived_involvement_id],'NULL')
WHERE [event].[latest_version] = 1
Here is the query plan from the version with the most columns that experiences the cost spike (all others look the same except this operator takes much less time (40-50mins):
I did a comparison of three executions, each with different column counts in the INSERT INTO SELECT FROM clause. I can't share the spreadsheet, but I will try convey my findings so far. The following is true of the query with the most columns:
It takes more than twice as long to execute than the other two executions
It performs more logical & physical reads and scans
It has more CPU time
It reads the most from Tempdb
The increase in execution time is not proportional with the increase in reads or other mentioned metrics
It is true that there is a memory spill level 8 happening. I have tried updating statistics, but it didn't help and all the versions of the query suffer the same problem so like-for-like is still compared.
I know it can be hard to help with this kind of problem without being able to poke around but I would be grateful if anyone could point me in the direction for what to check / try next.
P.S. the table it reads from is a heap and the table it joins to is indexed. The heap table needs to be a heap otherwise inserts into it will take too long and the problem is kicked down the road.
Also, when I say added more columns, I mean in the SELECT FROM versioned_events statement. The columns are replaced with "...." in the above example.
UPDATE
Using a temp table halved the execution time when the column count is the high number that caused the issue but actually takes longer with a reduced column count. It goes back to the idea that a threshold is crossed when the column count is increased :(. In any event, we've used a temp table for now to see if it helps in production.

Get the first row of a nested field in BigQuery

I have been struggling with a question that seem simple, yet eludes me.
I am dealing with the public BigQuery table on bitcoin and I would like to extract the first transaction of each block that was mined. In other word, to replace a nested field by its first row, as it appears in the table preview. There is no field that can identify it, only the order in which it was stored in the table.
I ran the following query:
#StandardSQL
SELECT timestamp,
block_id,
FIRST_VALUE(transactions) OVER (ORDER BY (SELECT 1))
FROM `bigquery-public-data.bitcoin_blockchain.blocks`
But it process 492 GB when run and throws the following error:
Error: Resources exceeded during query execution: The query could not be executed in the allotted memory. Sort operator used for OVER(ORDER BY) used too much memory..
It seems so simple, I must be missing something. Do you have an idea about how to handle such task?
#standardSQL
SELECT * EXCEPT(transactions),
(SELECT transaction FROM UNNEST(transactions) transaction LIMIT 1) transaction
FROM `bigquery-public-data.bitcoin_blockchain.blocks`
Recommendation: while playing with large table like this one - I would recommend creating smaller version of it - so it incur less cost for your dev/test. Below can help with this - you can run it in BigQuery UI with destination table which you will then be using for your dev. Make sure you set Allow Large Results and unset Flatten Results so you preserve original schema
#legacySQL
SELECT *
FROM [bigquery-public-data:bitcoin_blockchain.blocks#1529518619028]
The value of 1529518619028 is taken from below query (at a time of running) - the reason I took four days ago is that I know number of rows in this table that time was just 912 vs current 528,858
#legacySQL
SELECT INTEGER(DATE_ADD(USEC_TO_TIMESTAMP(NOW()), -24*4, 'HOUR')/1000)
An alternative approach to Mikhail's: Just ask for the first row of an array with [OFFSET(0)]:
#StandardSQL
SELECT timestamp,
block_id,
transactions[OFFSET(0)] first_transaction
FROM `bigquery-public-data.bitcoin_blockchain.blocks`
LIMIT 10
That first row from the array still has some nested data, that you might want to flatten to only their first row too:
#standardSQL
SELECT timestamp
, block_id
, transactions[OFFSET(0)].transaction_id first_transaction_id
, transactions[OFFSET(0)].inputs[OFFSET(0)] first_transaction_first_input
, transactions[OFFSET(0)].outputs[OFFSET(0)] first_transaction_first_output
FROM `bigquery-public-data.bitcoin_blockchain.blocks`
LIMIT 1000

BigQuery join too slow for a table of small size

I have a table with the following details:
- Table Size 39.6 MB
- Number of Rows 691,562
- 2 columns : contact_guid STRING, program_completed STRING
- column 1 data type is like uuid . around 30 char length
- column 2 data type is string with around 50 char length
I am trying this query:
#standardSQL
SELECT
cp1.contact_guid AS p1,
cp2.contact_guid AS p2,
COUNT(*) AS cnt
FROM
`data.contact_pairs_program_together` cp1
JOIN
`data.contact_pairs_program_together` cp2
ON
cp1.program_completed=cp2.program_completed
WHERE
cp1.contact_guid < cp2.contact_guid
GROUP BY
cp1.contact_guid,
cp2.contact_guid having cnt >1 order by cnt desc
Time taken to execute: 1200 secs
I know I am doing a self join and it is mentioned in best practices to avoid self join.
My Questions:
I feel this table size in terms of mb is too small for BigQuery therefore why is it taking so much time? And what does small table mean for BigQuery in context of join in terms of number of rows and size in bytes?
Is the number of rows too large? 700k ^ 2 is 10^11 rows during join. What would be a realistic number of rows for joins?
I did check the documentation regarding joins, but did not find much regarding how big a table can be for joins and how much time can be expected for it to run. How do we estimate rough execution time?
Execution Details:
As shown on the screenshot you provided - you are dealing with an exploding join.
In this case step 3 takes 1.3 million rows, and manages to produce 459 million rows. Steps 04 to 0B deal with repartitioning and re-shuffling all that extra data - as the query didn't provision enough resources to deal with these number of rows: It scaled up from 1 parallel input to 10,000!
You have 2 choices here: Either avoid exploding joins, or assume that exploding joins will take a long time to run. But as explained in the question - you already knew that!
How about if you generate all the extra rows in one op (do the join, materialize) and then run another query to process the 459 million rows? The first query will be slow for the reasons explained, but the second one will run quickly as BigQuery will provision enough resource to deal with that amount of data.
Agree with below suggestions
see if you can rephrase your query using analytic functions (by Tim)
Using analytic functions would be a much better idea (by Elliott)
Below is how I would make it
#standardSQL
SELECT
p1, p2, COUNT(1) AS cnt
FROM (
SELECT
contact_guid AS p1,
ARRAY_AGG(contact_guid) OVER(my_win) guids
FROM `data.contact_pairs_program_together`
WINDOW my_win AS (
PARTITION BY program_completed
ORDER BY contact_guid DESC
RANGE BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING
)
), UNNEST(guids) p2
GROUP BY p1, p2
HAVING cnt > 1
ORDER BY cnt DESC
Please try and let us know if helped

BigQuery COUNT(DISTINCT value) vs COUNT(value)

I found a glitch/bug in bigquery.
We got a table based on Bank Statistic data under the
starschema.net:clouddb:bank.Banks_token
If i run the following query:
SELECT count(*) as totalrow,
count(DISTINCT BankId ) as bankidcnt
FROM bank.Banks_token;
And i get the following result:
Row totalrow bankidcnt
1 9513 9903
My problem is that if i have 9513row how could i get 9903row, which is 390row more than the rowcount in the table.
In BigQuery, COUNT DISTINCT is a statistical approximation for all results greater than 1000.
You can provide an optional second argument to give the threshold at which approximations are used. So if you use COUNT(DISTINCT BankId, 10000) in your example, you should see the exact result (since the actual amount of rows is less than 10000). Note, however, that using a larger threshold can be costly in terms of performance.
See the complete documentation here:
https://developers.google.com/bigquery/docs/query-reference#aggfunctions
UPDATE 2017:
With BigQuery #standardSQL COUNT(DISTINCT) is always exact. For approximate results use APPROX_COUNT_DISTINCT(). Why would anyone use approx results? See this article.
I've used EXACT_COUNT_DISTINCT() as a way to get the exact unique count. It's cleaner and more general than COUNT(DISTINCT value, n > numRows)
Found here: https://cloud.google.com/bigquery/query-reference#aggfunctions