Understanding "Resources exceeded during query execution" with GROUP EACH BY in BigQuery - google-bigquery

I'm writing a background job to automatically process A/B test data in BigQuery, and I'm finding that I'm hitting "Resources exceeded during query execution" when doing large GROUP EACH BY statements. I saw from Resources Exceeded during query execution that reducing the number of groups can make queries succeed, so I split up my data into smaller pieces, but I'm still hitting errors (although less frequently). It would be nice to get a better intuition about what actually causes this error. In particular:
Does "resources exceeded" always mean that a shard ran out of memory, or could it also mean that the task ran out of time?
What's the right way to approximate the memory usage and the total memory I have available? Am I correct in assuming each shard tracks about 1/n of the groups and keeps the group key and all aggregates for each group, or is there another way that I should be thinking about it?
How is the number of shards determined? In particular, do I get fewer shards/resources if I'm querying over a smaller dataset?
The problematic query looks like this (in practice, it's used as a subquery, and the outer query aggregates the results):
SELECT
alternative,
snapshot_time,
SUM(column_1),
...
SUM(column_139)
FROM
my_table
CROSS JOIN
[table containing 24 unix timestamps] timestamps
WHERE last_updated_time < timestamps.snapshot_time
GROUP EACH BY alternative, user_id, snapshot_time
(Here's an example failed job: 124072386181:job_XF6MksqoItHNX94Z6FaKpuktGh4 )
I realize this query may be asking for trouble, but in this case, the table is only 22MB and the query results in under a million groups and it's still failing with "resources exceeded". Reducing the number of timestamps to process at once fixes the error, but I'm worried that I'll eventually hit a data scale large enough that this approach as a whole will stop working.

As you've guessed, BigQuery chooses a number of parallel workers (shards) for GROUP EACH and JOIN EACH queries based on the size of the tables being operated upon. It is a rough heuristic, but in practice, it works pretty well.
What is interesting about your query is that the GROUP EACH is being done over a larger table than the original table because of the expansion in the CROSS JOIN. Because of this, we choose a number of shards that is too small for your query.
To answer your specific questions:
Resources exceeded almost always means that a worker ran out of memory. This could be a shard or a mixer, in Dremel terms (mixers are the nodes in the computation tree that aggregate results. GROUP EACH BY pushes aggregation down to the shards, which are the leaves of the computation tree).
There isn't a good way to approximate the amount of resources available. This changes over time, with the goal that more of your queries should just work.
The number of shards is determined by the total bytes processed in the query. As you've noticed, this heuristic doesn't work well with joins that expand the underlying data sets. That said, there is active work underway to be smarter about how we pick the number of shards. To give you an idea of scale, your query got scheduled on only 20 shards, which is a tiny fraction of what a larger table would get.
As a workaround, you could save the intermediate result of the CROSS JOIN as a table, and running the GROUP EACH BY over that temporary table. That should let BigQuery use the expanded size when picking the number of shards. (if that doesn't work, please let me know, it is possible that we need to tweak our assignment thresholds).

Related

In BigQuery SQL, how to track resource usage to avoid "Resources exceeded during query execution" error

In BigQuery, Resources exceeded during query execution: Not enough resources for query planning - too many subqueries or query is too complex. is an error we receive when our queries are too big or complex.
For queries that run successfully, how can we see how close we are to receiving this error? What is the criteria for resources here, and what are our limits? We have a few large queries that may be pushing close to this error, but we don't want to unexpectedly receive the error one day as our source table for this query continues to grow.
It is easy for us to find bytes queried, as the editor makes it quite clear. And after executing a query, we can see the following in Execution Details
Is one of either Slot time consumed or Bytes shuffled the relevant metric here, with their being some limit after which the Resources exceeded error is thrown? Or is there some other criteria that determines when this error is thrown?
For added context, none of our tables query a particularly large amount of data, our largest query is under 20GB processed, however our larger queries do have a lot of sub-queries with t1 as (), t2 as (), t3 as (), t4 as (), t5 as (), ..., where the subqueries also reference the previous sub-queries, and I believe this is what is leading to the resources exceeded errors.
Edit: Output from our largest query, the one that prompted this posting. I refactored 2 of the subqueries into their own CTEs, so that this big query run would be successful. This obviously still seems large...
We've encountered this error quite a bit. According to this article: https://medium.com/#jamiekt/what-to-do-about-bigquery-error-resources-exceeded-during-query-execution-e80734b8c9b6 BigQuery computes a complexity score and will output an error if it crosses a threshold.
At this time, there is no easy way to know the complexity score of a query and how close to the limit it is.
However, if a complex query is working, that same query should keep working even if the underlying data continuously increases in size. Or rather, it shouldn't fail with a "Query is too complex" error as the complexity score would remain unchanged, but could end up failing with other errors, such as "The query could not be executed in the allotted memory".
To help alleviate the "Query is too complex" it is recommended to use temp tables (https://cloud.google.com/bigquery/docs/writing-results), avoid UNIONs when possible, reducing the amount of CTEs and nested views.

Query exhausted resources at this scale factor

I was running SQL query on Amazon Athena. And I got the following error couple of times:
Query exhausted resources at this scale factor
This query ran against the "test1" database, unless qualified by the query. Please post the error message on our forum or contact customer support with Query Id: *************
Without seeing the query it's hard to say for sure what the problem is, but it's very likely that it is due to an internal issue in Athena that has to do with sorting of large intermediary result sets.
The version of Presto that Athena uses does not have support for sorting datasets that are too big to fit in memory. It used to be the same for aggregations too, but that has been fixed by the Athena team.
The issue most often occurs when you have very wide tables, i.e. many columns, or columns with a lot of data. Each individual row can represent a big chunk of memory, and if a node runs out of memory while trying to sort its chunk the query aborts with the "query exhausted resources at this scale factor" error.
If this matches your situation the only way around is unfortunately to limit the number of columns, or eliminate the sorting. Sometimes you can rearrange the query to do the sorting at a different stage to make the memory pressure on the sorting stage lower.
Review these tips and try to refine your query.
https://aws.amazon.com/blogs/big-data/top-10-performance-tuning-tips-for-amazon-athena/
This error means that aggregated results exceeded allocated resources. I believe it is the memory.

Bigquery Tier 1 exceeded for partitioned table but not for by day tables

We have two tables in bigquery: one is large (a couple billion rows) and on a 'table-per-day' basis, the other is date partitioned, has the exact same schema but a small subset of the rows (~100 million rows).
I wanted to run a (standard-sql) query with a subselect in form of a join (same when subselect is in the where clause) on the small partitioned dataset. I was not able to run it because tier 1 was exceeded.
If I run the same query on the big dataset (that contains the data I need and a lot of other data) it runs fine.
I do not understand the reason for this.
Is it because:
Partitioned tables need more resources to query
Bigquery has some internal rules that the ratio of data processed to resources needed must meet a certain threshold, i.e. I was not paying enough when I queried the small dataset given the amount of resources I needed.
If 1. is true, we could simply make the small dataset also be on a 'table-per-day' basis in order to solve the issue. But before we do that though we would like to know if it is really going to solve our problem.
Details on the queries:
Big datset
queries 11 GB, runs 50 secs, Job Id remilon-study:bquijob_2adced71_15bf9a59b0a
Small dataset
Job Id remilon-study:bquijob_5b09f646_15bf9acd941
I'm an engineer on BigQuery and I just took a look at your jobs but it looks like your second query has an additional filter with a nested clause that your first query does not. It is likely that that extra processing is making your query exceed your tier. I would recommend running the queries in the BigQuery UI and looking at the Explanation tab to see how the queries differ in the query plan.
If you try running the exact same query (modifying only the partition syntax) for both tables and still get the same error I would recommend filing a bug.

BigQuery execution time and scaling

I created a test dataset of roughly 450GB in BigQuery and I am getting execution speed of ~9 seconds to query the largest table (10bn rows) when running from WebUI. I just wanted to check if this is a 'normal' expected result and whether it would get worse with larger size (i.e. 100bn rows+) and if the queries become more complex. I am aware of table partitioning/etc. but I just want to get a sense of what is 'normal' expected speed without first getting into optimization, since the above seems like 'smallish' size for what BQ is meant for.
The above result is achieved on a simple query like this:
select ColumnA from DataSet.Table order by ColumnB desc limit 100
So the result returned to the client is very small. ColumnA is structured as UUIDs represented in String format and ColumnB is integer.
It's almost impossible to say if this is "normal" or not. BigQuery is a multitenancy architecture/infrastructure. That means we all share the same resources (i.e. compute power) in the cluster when running queries. Therefore, query times are never deterministic in BigQuery i.e. they can vary depending on the number of concurrent queries executing from users at any given time. That said however, you can get reserved slots for a flat rate price. Although, you'd need to be spending quite a lot of money to justify that.
You can improve execution times by removing compute/shuffle/memory intensive steps like order by etc. Obviously, the complexity of the query will also have and impact on the query times.
On some of our projects we can smash through 3TB-5TB with a relatively complex query in about 15s-20s. Sometimes it quicker, sometimes is slower. We also run queries over much smaller datasets that can take the same amount of time. This is because what I wrote at the beginning - BigQuery query times are not deterministic.
Finally, BigQuery will cache results, so if you issue the same query multiple times over the same dataset it will be returned from the cache i.e. much quicker!

What causes "resources exceeded" in BigQuery?

My query failed with the error "resources exceeded". What causes this error, and how can I fix it?
Update (2016-03-16): For most queries, EACH is no longer required, and may actually increase the likelihood of seeing this error. If you omit the EACH keyword from every JOIN and GROUP BY in your query, the query engine will now dynamically optimize your query to eliminate this error.
There are still corner cases where specifying the EACH keyword can make a query run (or run faster), but generally speaking the BigQuery team recommends that you try your query without EACH first. Pretty soon, the EACH keyword will become a complete no-op.
Original answer: When you use the EACH keyword in JOIN EACH or GROUP EACH BY, or when you use a PARTITION BY clause, BigQuery partitions ("shuffles") your data on the fly according to the join keys or group keys, which allows each worker task to perform its portion of the join or aggregation locally.
The resources exceeded error occurs when one such worker gets too much data, and run over its limit. Generally speaking, the reasons for this error fall into two categories:
Skew: The data is heavily skewed toward one key value (say, a "guest" user ID or a null key), which means that one worker gets all the records for that key and gets overloaded.
Mismatch in data size and worker count: You have too much data for the number of workers that BigQuery assigned your query.
We are working on a number of improvements to help us cope with both scenarios so that you don't need to worry about these issues. For now, though, you can work around the problem with one of the following approaches:
Filter out skewed keys. If your data is skewed because half of your join key values are actually null, you could filter those out by adding WHERE key IS NOT NULL prior to the join.
Reduce the amount of data processed. Filter each side of the join with WHERE ABS(HASH(key)) % 5 == 0 to apply the join to only 1/5 of the data (or whatever fraction you want), and then do the same for == 1, == 2, == 3, == 4 in separate queries. You're manually sharding the data in smaller chunks to make the query go through--but note that you pay 5x as much because you queried the same data 5 times.
Revisit your query. Maybe you can build your query in a completely different way, or compute some intermediate results, to get the answer you want.
Also faced the error
Error: Resources exceeded during query execution
due to using an ORDER BY. More information about that is given by Pentium10
Using order by on big data databases is not an ordinary operation and
at some point it exceeds the attributes of big data resources. You
should consider sharding your query or run the order by in your
exported data.
As I explained to you today in your other question, adding
allowLargeResults will allow you to return large response, but you
can't specify a top-level ORDER BY, TOP or LIMIT clause. Doing so
negates the benefit of using allowLargeResults, because the query
output can no longer be computed in parallel.
To solve it I've gone through 9 steps