I'm learning BigQuery with the new Github dataset and my queries to the commits dataset keep failing due to resources exceeded. I trimmed down the SQL to this code and it still fails:
SELECT
commit,
FIRST(repo_name) AS repo_name,
FIRST(author.email) AS author_email,
FIRST(author.time_sec) AS time,
SUM(LENGTH(message)) AS len_commit_msg,
COUNT(difference.new_path) AS num_files
FROM
[bigquery-public-data:github_repos.commits]
GROUP BY
commit
ORDER BY
repo_name,
time
The dataset in question is large (150m rows) and what I want is just a list of commits with basic information about them (length of commit message and number of changed files).
Is there something particularly wrong in this example? I've tried changing the SUM(LENGTH(message)) part and the COUNT() to no avail. Or is the sort part a no-no for big query?
I also checked the previous "resources exceeded" questions and the answers relate to problems with PARTITION, JOIN, or GROUP EACH BY, which I have avoided.
ORDER BY is expensive - try without it
Related
I've been trying to run this query and I keep getting the resources exceeded error despite setting Allow Large results to true and setting the destination table. I tried adding a limit as well. Is there any way I can optimize the query to avoid this error?
SELECT
repo_name,
commit,
parent,
subject,
message,
difference.*
FROM
FLATTEN(FLATTEN([bigquery-public-data:github_repos.commits], repo_name), parent)
WHERE
(REGEXP_MATCH(message,r'(?i:null pointer dereference)'))
LIMIT
5000000
Thank you very much!
It worked for me (and took about a minute and a half):
#standardSQL
SELECT
repo_name,
commit,
parent,
subject,
message,
difference
FROM
`bigquery-public-data.github_repos.commits`
CROSS JOIN UNNEST(repo_name) AS repo_name
CROSS JOIN UNNEST(parent) AS parent
WHERE
REGEXP_CONTAINS(message, r'(?i:null pointer dereference)');
I suspect that legacy SQL in BigQuery is worse at handling this kind of thing. As an aside, though: is there a reason that you want to flatten the repo_name and parent arrays? Without flattening, the query would produce 37,073 rows, whereas with flattening, it produces 11,419,166. Depending on the kind of analysis that you want to do, you may not need to flatten the data at all.
Edit: since it sounds like flattening was only necessary to work around legacy SQL's limitations related to independently repeating fields, you can remove the CROSS JOINs and avoid flattening:
#standardSQL
SELECT
repo_name,
commit,
parent,
subject,
message,
difference
FROM
`bigquery-public-data.github_repos.commits`
WHERE
REGEXP_CONTAINS(message, r'(?i:null pointer dereference)');
This is faster, too--it takes about 15 seconds instead of 90.
I'm trying to run a pretty simple query but it's failing with an Resources exceeded error.
I read in another post that the heuristic used to allocate the number of mixers could fail from time to time.
SELECT
response.auctionId,
response.scenarioId,
ARRAY_AGG(response) AS responses
FROM
rtb_response_logs.2016080515
GROUP BY
response.auctionId,
response.scenarioId
Is there a way to fix my query knowing that:
a response is composed of 38 fields (most of them being short strings)
the max(count()) of a single response is kind of low (165)
Query Failed
Error: Resources exceeded during query execution.
Job ID: teads-1307:bquijob_257ce97b_1566a6a3f27
It's a current limitation that arrays (produced by ARRAY_AGG or other means) must fit in the memory of a single machine. We've made a couple of recent improvements that should help to reduce the resources required for queries such as this, however. To confirm whether this is the issue, you could try a query such as:
SELECT
SUM(LENGTH(FORMAT("%t", response))) AS total_response_size
FROM
rtb_response_logs.2016080515
GROUP BY
response.auctionId,
response.scenarioId
ORDER BY total_response_size DESC LIMIT 1;
This formats the structs as strings as a rough heuristic of how much memory they would take to represent. If the result is very large, then perhaps we can restructure the query to use less memory. If the result is not very large, then some other issue is at play, and we'll look into getting it fixed :) Thanks!
This query runs fails with resources exceeded:
SELECT
*,
DAY(event_timestamp) as whywontitwork,
FROM
looker_scratch.LR_78W8A60O4MQ20L2U6OA5B_events_sql_doctor_activity
But this one works fine:
SELECT
*
FROM
looker_scratch.LR_78W8A60O4MQ20L2U6OA5B_events_sql_doctor_activity
The source table is 14m rows but I've run similar queries on much larger datasets before. We have large results enabled and have tried both flattened results and not (though there are no nested fields anyway). The error also occurs if you use the DATE() function instead of DAY(), or a REGEXP_EXTRACT() function
The job id is realself-main:bquijob_69e3a888_152f1fdc205.
You've hit an internal error in BigQuery. We tweaked our query engine's configuration at around 3pm (US Pacific Time) in an effort to prevent the error.
Update: After observing the error rate, it looks like this change has fixed the problem. If you see any other issues, please let us know. Note that StackOverflow is best for usage questions, but if you suspect a bug, you can file an issue at our public issue tracker.
Hi,there.
Recently,I want to run a query in bigquery web UI by using "group by" over some tables(tables' name suits xxx_mst_yyyymmdd).The rows will be over 10 million. Unhappily,the query failed with this error:
Query Failed
Error: Resources exceeded during query execution.
I did some improvements with my query language,the error may not happen for this time.But with the increasement of my data, the Error will also appear in the future.So I checked the latest release of Bigquery,maybe there two ways to solve this:
1.After 2016/01/01,Bigquery will change the Query pricing tiers to satisfy the "High Compute Tiers" so that the "resourcesExceeded error" will not happen again.
2.BigQuery Slots.
I checked some documents in Google and didn't find a way on how to use BigQuery Slots.Is there any sample or usecase of BigQuery Slots?Or I have to contact with BigQuery Team to open the function?
Hope someone can help me to answer this question,thanks very much!
A couple of points:
I'm surprised that a GROUP BY with a cardinality of 10M failed with resources exceeded. Can you provide a job id of the failed query so we can investigate? You mention that you're concerned about hitting these errors more often as your data size increases; you should likely be able to increase your data size by a few more orders of magnitude without seeing this; likely you've encountered either a bug or something was strange with either your query or your data.
"High Compute Tiers" won't necessarily get rid of resourcesExceeded. For the most part, resourcesExceeded means that BigQuery ran into memory limitations; high compute tiers only address CPU usage. (and note, they haven't been enabled yet).
BigQuery slots enable you to process data faster and with more reliable performance. For the most part, they also wouldn't help prevent resourcesExceeded errors.
There is currently (as of Nov 5) a bug where you may need to provide an EACH keyword with a GROUP BY. Recent changes should enable BigQuery to automatically select the execution strategy, so EACH shouldn't be needed, but there are a couple of cases where it doesn't pick the right one. When in doubt, add an EACH to your JOIN and GROUP BY operations.
To get your project eligible for using slots you need to contact support.
I am currently running the following query in BigQuery:
SELECT a, FIRST(grouped_value) concat_value
FROM (SELECT a, group_concat(subreddit) over
(partition by a order by order_field asc
rows between unbounded preceding and unbounded following)
grouped_value
from
[long_list_of_tables] )
GROUP EACH BY a
Unfortunately, I end up with the following error:
Query Failed
Error: Resources exceeded during query execution.
Job ID: trusty-spanner-100412:job_cKtzW1aYFUSuRjixSiShghOAe-s
My limit is not reached as I can run other queries fine.
The query here is from the answer to "GROUP_CONCAT with ORDER BY"
I checked your query, and the results of GROUP_CONCAT are growing too big. So I think maybe there is another approach to use. BigQuery will soon announce general availability of Javascript UDFs, and then sorting and removing duplicates inside a string becomes simple Javascript code. Once the feature becomes public, I will make sure to publish example which does sorting and removing duplicates using Javascript.
This error message has nothing to do with quota, but with how the system shards your queries to try and run it.
Do you really need the EACH in your group by?
Normally you put each only if you know a will have a lot of distinct values, at the cost of more processing.
Also, when you run this query, do you "allow large results?" If not, then it's possible that you're running into that error