I am currently running the following query in BigQuery:
SELECT a, FIRST(grouped_value) concat_value
FROM (SELECT a, group_concat(subreddit) over
(partition by a order by order_field asc
rows between unbounded preceding and unbounded following)
grouped_value
from
[long_list_of_tables] )
GROUP EACH BY a
Unfortunately, I end up with the following error:
Query Failed
Error: Resources exceeded during query execution.
Job ID: trusty-spanner-100412:job_cKtzW1aYFUSuRjixSiShghOAe-s
My limit is not reached as I can run other queries fine.
The query here is from the answer to "GROUP_CONCAT with ORDER BY"
I checked your query, and the results of GROUP_CONCAT are growing too big. So I think maybe there is another approach to use. BigQuery will soon announce general availability of Javascript UDFs, and then sorting and removing duplicates inside a string becomes simple Javascript code. Once the feature becomes public, I will make sure to publish example which does sorting and removing duplicates using Javascript.
This error message has nothing to do with quota, but with how the system shards your queries to try and run it.
Do you really need the EACH in your group by?
Normally you put each only if you know a will have a lot of distinct values, at the cost of more processing.
Also, when you run this query, do you "allow large results?" If not, then it's possible that you're running into that error
Related
I'm using the following code to query a dataset based on a polygon:
SELECT *
FROM `waze-public-dataset.partner_name.view_jams_clustered`
WHERE ST_INTERSECTS(geo, ST_GEOGFROMTEXT("POLYGON((-99.54913355822276 27.60526592074579,-99.52673174853038 27.60526592074579,-99.52673174853038 27.590813604291416,-99.54913355822276 27.590813604291416,-99.54913355822276 27.60526592074579))")) IS TRUE
The validation message says that "This query will process 1 TB when run".
It seems like there's no problem. However, when I remove the "WHERE INTERSECTS" function, the validation message says exactly the same thing: "This query will process 1 TB when run", the same 1 TB, so I'm guessing that the ST_INTERSECTS function is not working.
When you actually run this query, the amount charged should be usually much less, as expected for spatially clustered table. I've run select count(*) ... query with one partner dataset, and while editor UI announced 9TB before running query, the query reported around 150MB processed after running.
The savings come from clustered table - but the specific clusters that intersect the polygon used in filter depend on actual data in the table and how clusters were created. The clusters and thus the cost of the query can only be determined when the query runs. Editor UI in this case shows maximum possible cost of the query.
I've been trying to run this query and I keep getting the resources exceeded error despite setting Allow Large results to true and setting the destination table. I tried adding a limit as well. Is there any way I can optimize the query to avoid this error?
SELECT
repo_name,
commit,
parent,
subject,
message,
difference.*
FROM
FLATTEN(FLATTEN([bigquery-public-data:github_repos.commits], repo_name), parent)
WHERE
(REGEXP_MATCH(message,r'(?i:null pointer dereference)'))
LIMIT
5000000
Thank you very much!
It worked for me (and took about a minute and a half):
#standardSQL
SELECT
repo_name,
commit,
parent,
subject,
message,
difference
FROM
`bigquery-public-data.github_repos.commits`
CROSS JOIN UNNEST(repo_name) AS repo_name
CROSS JOIN UNNEST(parent) AS parent
WHERE
REGEXP_CONTAINS(message, r'(?i:null pointer dereference)');
I suspect that legacy SQL in BigQuery is worse at handling this kind of thing. As an aside, though: is there a reason that you want to flatten the repo_name and parent arrays? Without flattening, the query would produce 37,073 rows, whereas with flattening, it produces 11,419,166. Depending on the kind of analysis that you want to do, you may not need to flatten the data at all.
Edit: since it sounds like flattening was only necessary to work around legacy SQL's limitations related to independently repeating fields, you can remove the CROSS JOINs and avoid flattening:
#standardSQL
SELECT
repo_name,
commit,
parent,
subject,
message,
difference
FROM
`bigquery-public-data.github_repos.commits`
WHERE
REGEXP_CONTAINS(message, r'(?i:null pointer dereference)');
This is faster, too--it takes about 15 seconds instead of 90.
I'm learning BigQuery with the new Github dataset and my queries to the commits dataset keep failing due to resources exceeded. I trimmed down the SQL to this code and it still fails:
SELECT
commit,
FIRST(repo_name) AS repo_name,
FIRST(author.email) AS author_email,
FIRST(author.time_sec) AS time,
SUM(LENGTH(message)) AS len_commit_msg,
COUNT(difference.new_path) AS num_files
FROM
[bigquery-public-data:github_repos.commits]
GROUP BY
commit
ORDER BY
repo_name,
time
The dataset in question is large (150m rows) and what I want is just a list of commits with basic information about them (length of commit message and number of changed files).
Is there something particularly wrong in this example? I've tried changing the SUM(LENGTH(message)) part and the COUNT() to no avail. Or is the sort part a no-no for big query?
I also checked the previous "resources exceeded" questions and the answers relate to problems with PARTITION, JOIN, or GROUP EACH BY, which I have avoided.
ORDER BY is expensive - try without it
We have a 1.01TB table with known duplicates we are trying to de-duplicate using GROUP EACH BY
There is an error message we'd like some help deciphering
Query Failed
Error:
Shuffle failed with error: Cannot shuffle more than 3.00T in a single shuffle. One of the shuffle partitions in this query exceeded 3.84G. Strategies for working around this error are available on the go/dremelfaq.
Job ID: job_MG3RVUCKSDCEGRSCSGA3Z3FASWTSHQ7I
The query as you'd imagine does quite a bit and looks a little something like this
SELECT Twenty, Different, Columns, For, Deduping, ...
including_some, INTEGER(conversions),
plus_also, DAYOFWEEK(SEC_TO_TIMESTAMP(INTEGER(timestamp_as_string)), conversions,
and_also, HOUROFDAY(SEC_TO_TIMESTAMP(INTEGER(timestamp_as_string)), conversions,
and_a, IF(REGEXP_MATCH(long_string_field,r'ab=(\d+)'),TRUE, NULL) as flag_for_counting,
with_some, joined, reference, columns,
COUNT(*) as duplicate_count
FROM [MainDataset.ReallyBigTable] as raw
LEFT OUTER JOIN [RefDataSet.ReferenceTable] as ref
ON ref.id = raw.refid
GROUP EACH BY ... all columns in the select bar the count...
Question
What does this error mean? Is it trying to do this kind of shuffling? ;-)
And finally, is the dremelfaq referenced in the error message available outside of Google and would it help understand whats going on?
Side Note
For completeness we tried a more modest GROUP EACH
SELECT our, twenty, nine, string, column, table,
count(*) as dupe_count
FROM [MainDataSet.ReallyBigTable]
GROUP EACH BY all, those, twenty, nine, string, columns
And we receive a more subtle
Error: Resources exceeded during query execution.
Job ID: job_D6VZEHB4BWZXNMXMMPWUCVJ7CKLKZNK4
Should Bigquery be able to perform these kind of de-duplication queries? How should we best approach this problem?
Actually, the shuffling involved is closer to this: http://www.youtube.com/watch?v=KQ6zr6kCPj8.
When you use the 'EACH' keyword, you're instructing the query engine to shuffle your data... you can think of it as a giant sort operation.
This is likely pushing close to the cluster limits that we've set in BigQuery. I'll talk to some of the other folks on the BigQuery team to see if there is a way we can figure out how to make your query work.
In the mean time, one option would be to partition your data into smaller tables and do the deduping on those smaller tables, then use table copy/append operations to create your final output table. To partition your data, you can do something like:
(SELECT * from [your_big_table] WHERE ABS(HASH(column1) % 10) == 1)
Unfortunately, this is going to be expensive, since it will require running the query over your 1 TB table 10 times.
As part of a data analysis project, I will be issuing some long running queries on a mysql database. My future course of action is contingent on the results I obtain along the way. It would be useful for me to be able to view partial results generated by a SELECT statement that is still running.
Is there a way to do this? Or am I stuck with waiting until the query completes to view results which were generated in the very first seconds it ran?
Thank you for any help : )
In general case the partial result cannot be produced. For example, if you have an aggregate function with GROUP BY clause, then all data should be analysed, before the 1st row is returned. LIMIT clause will not help you, because it is applied after the output is computed. Maybe you can give a concrete data and SQL query?
One thing you may consider is sampling your tables down. This is good practice in data analysis in general to get your iteration speed up when you're writing code.
For example, if you have table create privelages and you have some mega-huge table X with key unique_id and some data data_value
If unique_id is numeric, in nearly any database
create table sample_table as
select unique_id, data_value
from X
where mod(unique_id, <some_large_prime_number_like_1013>) = 1
will give you a random sample of data to work your queries out, and you can inner join your sample_table against the other tables to improve speed of testing / query results. Thanks to the sampling your query results should be roughly representative of what you will get. Note, the number you're modding with has to be prime otherwise it won't give a correct sample. The example above will shrink your table down to about 0.1% of the original size (.0987% to be exact).
Most databases also have better sampling and random number methods than just using mod. Check the documentaion to see what's available for your version.
Hope that helps,
McPeterson
It depends on what your query is doing. If it needs to have the whole result set before producing output - such as might happen for queries with group by or order by or having clauses, then there is nothing to be done.
If, however, the reason for the delay is client-side buffering (which is the default mode), then that can be adjusted using "mysql-use-result" as an attribute of the database handler rather than the default "mysql-store-result". This is true for the Perl and Java interfaces: I think in the C interface, you have to use an unbuffered version of the function that executes the query.