I'm setting up a crude data warehouse for my company and I've successfully pulled contact, company, deal and association data from our CRM into bigquery, but when I join these together into a master table for analysis via our BI platform, I continually get the error:
Query exceeded resource limits. This query used 22602 CPU seconds but would charge only 40M Analysis bytes. This exceeds the ratio supported by the on-demand pricing model. Please consider moving this workload to the flat-rate reservation pricing model, which does not have this limit. 22602 CPU seconds were used, and this query must use less than 10200 CPU seconds.
As such, I'm looking to optimise my query. I've already removed all GROUP BY and ORDER BY commands, and have tried using WHERE commands to do additional filtering but this seems illogical to me as it would add processing demands.
My current query is:
SELECT
coy.company_id,
cont.contact_id,
deals.deal_id,
{another 52 fields}
FROM `{contacts}` AS cont
LEFT JOIN `{assoc-contact}` AS ac
ON cont.contact_id = ac.to_id
LEFT JOIN `{companies}` AS coy
ON CAST(ac.from_id AS int64) = coy.company_id
LEFT JOIN `{assoc-deal}` AS ad
ON coy.company_id = CAST(ad.from_id AS int64)
LEFT JOIN `{deals}` AS deals
ON ad.to_id = deals.deal_id;
FYI {assoc-contact} and {assoc-deal} are both separate views I created from the associations table for easier associations of those tables to the companies table.
It should also be noted that this query has occasionally run successfully, so I know it does work, it just fails about 90% of the time due to the query being so big.
TLDR;
Check your join keys. 99% of the time the cause of the problem is a combinatoric explosion.
I can't know for sure since I don't have access to the data of the underlying table, but I will give a general resolution method which in my experience worked every time to find the root cause.
Long Answer
Investigation method
Say you are joining two tables
SELECT
cols
FROM L
JOIN R ON L.c1 = R.c1 AND L.c2 = R.c2
and you run into this error. The first thing you should do is check for duplicates in both tables.
SELECT
c1, c2, COUNT(1) as nb
FROM L
GROUP BY c1, c2
ORDER by nb DESC
And the same thing for each table involved in a join.
I bet that you will find that your join keys is duplicated. BigQuery is very scalable, so in my experience this error happens when you have a join key that repeats more than 100 000 times on both tables. It means that after your join, you will have 100000^2 = 10 billion rows !!!
Why BigQuery gives this error
In my experience, this error message means that your query does too many computation compared to the size of your inputs.
No wonder you're getting this if you end up with 10 billion rows after joining tables with a few million rows each.
BigQuery's on-demand pricing model is based on the amount of data read in your tables. This means that people could try to abuse this by, say running CPU-intensive computations while reading small datasets. To give an extreme example, imagine someone makes a Javascript UDF to mine bitcoin and runs it on BigQuery
SELECT MINE_BITCOIN_UDF()
The query will be billed $0 because it doesn't read anything, but will consume hours of Google's CPU. Of course they had to do something about this.
So this ratio exists to make sure that users don't do anything sketchy by using hours of CPUs while processing a few Mb of inputs.
Other MPP platforms with a different pricing model (e.g. Azure Synapse who charges based on the amount of bytes processed, not read like BQ) would perhaps have run without complaining, and then billed you 10Tb for reading that 40Mb table.
P.S.: Sorry for the late and long answer, it's probably too late for the person who asked, but hopefully it will help whoever runs into that error.
Related
Need help in optimizing the below query. Please suggest here
Db : Redshift
Sort Key:
order Table : install_ts
order_item: Install_ts
Suborder. : install_ts
suborder_item: install_ts
Dist Key: Not Added
query: Extracting selected columns from below table (not all)
select *, rank() OVER (PARTITION BY oo.id,
oi.id
ORDER BY i.suborder_id DESC) AS suborder_rank,
FROM order_item oi
LEFT JOIN order oo ON oo.id=oi.order_id
LEFT JOIN sub_order_item i ON i.order_item_id = oi.id
LEFT JOIN sub_order s ON i.suborder_id = s.id
WHERE
(
(oo.update_ts between '2021-04-13 18:29:59' and '2021-04-14 18:29:59' AND oo.create_ts>=current_date-500)
OR
oo.create_ts between '2021-04-13 18:29:59' and '2021-04-14 18:29:59'
OR
oi.create_ts between '2021-04-13 18:29:59' and '2021-04-14 18:29:59'
OR
(oi.update_ts between '2021-04-13 18:29:59' and '2021-04-14 18:29:59' AND oi.create_ts>=current_date-500)
)
Without knowing your data in more detail and knowing the full query load you want to optimize it will difficult to get things exactly correct. However I see a number of things in this that are POTENTIAL areas for significant performance issues. I do this work for many clients and while the optimization methods in Redshift differ from other databases there are a number of things you can do.
First off you need to remember that Redshift is a networked cluster of computers acting as a coherent database. This allows it to scale to very large database sizes and achieve horizontally scalable compute performance. It also means that there are network cables in play in the middle of your query.
This leads to the issue that I see most often at issue - massive amounts of data being redistributed across these network cable during the query. In your case you have not assigned a distribution key which will lead to random distribution of your data. But when you perform those joins based on "id" all the rows with the same id need to be moved to one slide (compute element) of the Redshift cluster. Basically all your data is being transferred around your cluster and this takes time. Assigning a common distribution key that is also a common join on (or group by) column, such as id, will greatly reduce this network traffic.
The next most common issue is scanning (or writing) too much data from (or to) disk. The disks on Redshift are fast but the data is often massive and the data is still moving through cables (faster than network but still takes time). Redshift helps this by having metadata stored with your data on disk and when it can it will not read unneeded data based on your Where clauses. In your case you have where clauses that are reducing the needed rows (I hope) which is good. But you are using a column that is not the sort key. If these columns correlates well with each other this may not be a problem but if they do not you could be reading more data than you need. You will want to look at the metadata for these table so see how well your sort keys are working for you. Also, Redshift only reads the columns you reference in your query so if you don't ask for a columns its data is not read from disk. Having "*" in your select is reading all the columns - if you don't need all the columns think about listing only the needed columns.
If your query is working on data that doesn't fit in memory during execution the extra data will need to spill to disk. This creates writes to disk of the swapped data and then reads of it back in. This can happen multiple times during the query. I don't know if this is happening in your case but you will want to check if you are spilling during your query.
The thing that impacts everything else is issues in the query and these take a number of forms. If your joins are not properly qualified then the amount of data can explode making the two issues above much worse. Window functions can be expensive especially when a partition different than the distribution key is used but also when the amount of data being operated on is larger than is needed. There are many other but these are two possible areas that look like they may be in your query. I just cannot tell by the info provided.
I'd start by looking at the explain plan and the actual execution information from system tables. Big numbers for rows or cost or time will lead you to the likely areas that need optimization. Like I said I do this work for customer all the time and can help here as time allows but much more information will be needed to understand what is really impacting your query.
We have two tables in bigquery: one is large (a couple billion rows) and on a 'table-per-day' basis, the other is date partitioned, has the exact same schema but a small subset of the rows (~100 million rows).
I wanted to run a (standard-sql) query with a subselect in form of a join (same when subselect is in the where clause) on the small partitioned dataset. I was not able to run it because tier 1 was exceeded.
If I run the same query on the big dataset (that contains the data I need and a lot of other data) it runs fine.
I do not understand the reason for this.
Is it because:
Partitioned tables need more resources to query
Bigquery has some internal rules that the ratio of data processed to resources needed must meet a certain threshold, i.e. I was not paying enough when I queried the small dataset given the amount of resources I needed.
If 1. is true, we could simply make the small dataset also be on a 'table-per-day' basis in order to solve the issue. But before we do that though we would like to know if it is really going to solve our problem.
Details on the queries:
Big datset
queries 11 GB, runs 50 secs, Job Id remilon-study:bquijob_2adced71_15bf9a59b0a
Small dataset
Job Id remilon-study:bquijob_5b09f646_15bf9acd941
I'm an engineer on BigQuery and I just took a look at your jobs but it looks like your second query has an additional filter with a nested clause that your first query does not. It is likely that that extra processing is making your query exceed your tier. I would recommend running the queries in the BigQuery UI and looking at the Explanation tab to see how the queries differ in the query plan.
If you try running the exact same query (modifying only the partition syntax) for both tables and still get the same error I would recommend filing a bug.
My query failed with the error "resources exceeded". What causes this error, and how can I fix it?
Update (2016-03-16): For most queries, EACH is no longer required, and may actually increase the likelihood of seeing this error. If you omit the EACH keyword from every JOIN and GROUP BY in your query, the query engine will now dynamically optimize your query to eliminate this error.
There are still corner cases where specifying the EACH keyword can make a query run (or run faster), but generally speaking the BigQuery team recommends that you try your query without EACH first. Pretty soon, the EACH keyword will become a complete no-op.
Original answer: When you use the EACH keyword in JOIN EACH or GROUP EACH BY, or when you use a PARTITION BY clause, BigQuery partitions ("shuffles") your data on the fly according to the join keys or group keys, which allows each worker task to perform its portion of the join or aggregation locally.
The resources exceeded error occurs when one such worker gets too much data, and run over its limit. Generally speaking, the reasons for this error fall into two categories:
Skew: The data is heavily skewed toward one key value (say, a "guest" user ID or a null key), which means that one worker gets all the records for that key and gets overloaded.
Mismatch in data size and worker count: You have too much data for the number of workers that BigQuery assigned your query.
We are working on a number of improvements to help us cope with both scenarios so that you don't need to worry about these issues. For now, though, you can work around the problem with one of the following approaches:
Filter out skewed keys. If your data is skewed because half of your join key values are actually null, you could filter those out by adding WHERE key IS NOT NULL prior to the join.
Reduce the amount of data processed. Filter each side of the join with WHERE ABS(HASH(key)) % 5 == 0 to apply the join to only 1/5 of the data (or whatever fraction you want), and then do the same for == 1, == 2, == 3, == 4 in separate queries. You're manually sharding the data in smaller chunks to make the query go through--but note that you pay 5x as much because you queried the same data 5 times.
Revisit your query. Maybe you can build your query in a completely different way, or compute some intermediate results, to get the answer you want.
Also faced the error
Error: Resources exceeded during query execution
due to using an ORDER BY. More information about that is given by Pentium10
Using order by on big data databases is not an ordinary operation and
at some point it exceeds the attributes of big data resources. You
should consider sharding your query or run the order by in your
exported data.
As I explained to you today in your other question, adding
allowLargeResults will allow you to return large response, but you
can't specify a top-level ORDER BY, TOP or LIMIT clause. Doing so
negates the benefit of using allowLargeResults, because the query
output can no longer be computed in parallel.
To solve it I've gone through 9 steps
select DATE(request_time) from logs.nobids_05 limit 1
gave me "3.48 GB processed" which a bit much considering that request_time is a field that appears in each row.
There are many other cases where just touching column automatically adds its total size to the cost. For example,
select * from logs.nobids_05 limit 1
gives me "This query will process 274 GB when run".
I am sure bigquery does not need to read 274GB for outputting 1 row of data.
2019 update: IF you cluster your tables, the cost of a SELECT * LIMIT 1 will be minimal.
https://medium.com/google-cloud/bigquery-optimized-cluster-your-tables-65e2f684594b
Running a "SELECT * FROM big_table LIMIT 1" with BigQuery would be the equivalent of doing this: https://www.youtube.com/watch?v=KZ-slvv_ZT4.
BigQuery is an analytical database. It's architecture and pricing are optimized for analysis at scale, not for single row handling.
Every operation in BigQuery involves a full table scan, but only of the columns mentioned in the query. The goal is to have predictable costs: Before running the query you are able to know how much data will be involved, therefore its cost. It might seem a big price to query just one row, but the good news is the cost remains constant, even when the queries get way more complex and CPU intensive.
Once in a while you might need to run a single row query, and the costs might seem excessive, but the assumption here is that you are using this tool to analyze data at scale, and the overall costs of having data stored in it should be more than competitive with other tools available. Since you've been working with other tools, I'd love to see a total cost comparison of analytical sessions within real case scenarios.
By the way, BigQuery has a better way for doing the equivalent of "SELECT * LIMIT x". It's free, and it relies on the REST API instead of querying:
https://developers.google.com/bigquery/docs/reference/v2/tabledata/list
This being said, thanks for the feedback, as there is a balancing job between making pricing more complex and the tool better suited for other jobs - and this balance is built on the feedback we get.
I don't think this is a bug. "When you run a query, you're charged according to the total data processed in the columns you select, even if you set an explicit LIMIT on the results." (https://developers.google.com/bigquery/pricing#samplecosts)
I'm writing a background job to automatically process A/B test data in BigQuery, and I'm finding that I'm hitting "Resources exceeded during query execution" when doing large GROUP EACH BY statements. I saw from Resources Exceeded during query execution that reducing the number of groups can make queries succeed, so I split up my data into smaller pieces, but I'm still hitting errors (although less frequently). It would be nice to get a better intuition about what actually causes this error. In particular:
Does "resources exceeded" always mean that a shard ran out of memory, or could it also mean that the task ran out of time?
What's the right way to approximate the memory usage and the total memory I have available? Am I correct in assuming each shard tracks about 1/n of the groups and keeps the group key and all aggregates for each group, or is there another way that I should be thinking about it?
How is the number of shards determined? In particular, do I get fewer shards/resources if I'm querying over a smaller dataset?
The problematic query looks like this (in practice, it's used as a subquery, and the outer query aggregates the results):
SELECT
alternative,
snapshot_time,
SUM(column_1),
...
SUM(column_139)
FROM
my_table
CROSS JOIN
[table containing 24 unix timestamps] timestamps
WHERE last_updated_time < timestamps.snapshot_time
GROUP EACH BY alternative, user_id, snapshot_time
(Here's an example failed job: 124072386181:job_XF6MksqoItHNX94Z6FaKpuktGh4 )
I realize this query may be asking for trouble, but in this case, the table is only 22MB and the query results in under a million groups and it's still failing with "resources exceeded". Reducing the number of timestamps to process at once fixes the error, but I'm worried that I'll eventually hit a data scale large enough that this approach as a whole will stop working.
As you've guessed, BigQuery chooses a number of parallel workers (shards) for GROUP EACH and JOIN EACH queries based on the size of the tables being operated upon. It is a rough heuristic, but in practice, it works pretty well.
What is interesting about your query is that the GROUP EACH is being done over a larger table than the original table because of the expansion in the CROSS JOIN. Because of this, we choose a number of shards that is too small for your query.
To answer your specific questions:
Resources exceeded almost always means that a worker ran out of memory. This could be a shard or a mixer, in Dremel terms (mixers are the nodes in the computation tree that aggregate results. GROUP EACH BY pushes aggregation down to the shards, which are the leaves of the computation tree).
There isn't a good way to approximate the amount of resources available. This changes over time, with the goal that more of your queries should just work.
The number of shards is determined by the total bytes processed in the query. As you've noticed, this heuristic doesn't work well with joins that expand the underlying data sets. That said, there is active work underway to be smarter about how we pick the number of shards. To give you an idea of scale, your query got scheduled on only 20 shards, which is a tiny fraction of what a larger table would get.
As a workaround, you could save the intermediate result of the CROSS JOIN as a table, and running the GROUP EACH BY over that temporary table. That should let BigQuery use the expanded size when picking the number of shards. (if that doesn't work, please let me know, it is possible that we need to tweak our assignment thresholds).