BigQuery max query length characters work around - google-bigquery

First let me explain the problem.
I have 500 unique users. The data from each of these users is split into smaller gzip files(lets say on an average 25 files per user). We have loaded each split gzip file as a separate table in BiqQuery. Therefore, our dataset has 13000 something tables in it.
Now, We have to run time range queries to retrieve some data from each user. We have around 500-1000 different time ranges from a single user. We would like to combine all these time ranges into a single query with logical OR and AND
WHERE (timestamp >2 and timestamp <3) OR (timestamp >4 and timestamp <5) OR .............. and so on 1000 times
and run them on 13000 tables
Our own tests show that Bigquery has query length limit of 10000 characters?
If we split the conditions into multiple queries we exceed 20,000 daily quota limit.
IS there any work around this, so that we could run these queries without hitting the daily quota limit?
Thanks
JR

I faced a similar issue of Big Query SQL Query length limit of 1024K characters when I am passing a big list of the array in WHERE condition.
To resolve it I used a parameterized query. https://cloud.google.com/bigquery/docs/parameterized-queries

I can think of two things:
Try reducing the number of tables in the dataset. If they share the same schema, would they be able to combined (denormalised) into a table, or at least less number of tables ?
I have loaded 500000+ JSON gzip files into one table, and querying is much easier.
With timestamp, you can try to use a shorter common denominator.
for example instead of
WHERE (timestamp > "2014-06-25:00:00:00" AND timestamp < "2014-06-26:00:00:00")
You could express
WHERE LEFT(timestamp,10) = "2014-06-25"
Hopefully this can reduce your character length limit as well.

The big query payload limit, when you use parameterized queries is increased to 10MB instead of 1MB. That helped me.
This is the error message that I got when I tried out to find the limit for payload size of parameterized queries:
{
"code" : 400,
"errors" : [ {
"domain" : "global",
"message" : "Request payload size exceeds the limit: 10485760 bytes.",
"reason" : "badRequest"
} ],
"message" : "Request payload size exceeds the limit: 10485760 bytes.",
"status" : "INVALID_ARGUMENT"
}

Related

Google Bigquery results

I am getting a part of result from the Bigquery API.
Earlier, I solved the issue of 1,00,000 records per result using iterators.
However, now I'm stuck at some other obstacle.
If I take more than 6-7 columns in a result, I do not get the complete set of result.
However, if I take a single column, I get the complete result.
Can there be a size limit as well for results in Bigquery API ?
There are some limits for Query Job
In particular
Maximum response size — 128 MB compressed
Of course, it is unlimited when writing large query results to a destination table (and then reading from there)

Is there any size limit for Spark-Dataframe to process/hold columns at a time?

I would like to know, whether Spark-Dataframe has a limitation on Column Size ? Like max columns that can be processed/hold by a Dataframe at a time is less than 500. I am asking because, while parsing a xml with less than 500 tags, I can process and generate a corresponding parquet file successfully, but if it is more than 500, then the parquet getting generated is empty. Any idea on this ?

BigQuery API limit exceeded error

We've got error during tabledata.list with message:
API limit exceeded: Unable to return a row that exceeds the API limits. To retrieve the row, export the table.
It is not listed at https://cloud.google.com/bigquery/troubleshooting-errors#errortable .
This error occurs every time.
We can export this table into GCS normally. Result looks normal (there are no extremely large rows).
We manage to retrieve several result pages before the actual error occurs.
com.google.api.client.googleapis.json.GoogleJsonResponseException: 403 Forbidden
{
"code" : 403,
"errors" : [ {
"domain" : "global",
"message" : "API limit exceeded: Unable to return a row that exceeds the API limits. To retrieve the row, export the table.",
"reason" : "apiLimitExceeded"
} ],
"message" : "API limit exceeded: Unable to return a row that exceeds the API limits. To retrieve the row, export the table."
}
at com.google.api.client.googleapis.json.GoogleJsonResponseException.from(GoogleJsonResponseException.java:145) ~[com.google.api-client.google-api-client-1.21.0.jar:1.21.0]
at com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:113) ~[com.google.api-client.google-api-client-1.21.0.jar:1.21.0]
at com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:40) ~[com.google.api-client.google-api-client-1.21.0.jar:1.21.0]
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest$1.interceptResponse(AbstractGoogleClientRequest.java:321) ~[com.google.api-client.google-api-client-1.21.0.jar:1.21.0]
What does it mean? How can we resolve this error?
Sorry about the inconvenience.
This is a known issue of tabledata.list method.
The problem is that we have some infrastructure limitations that it is currently not possible to return very large row from tabledata.list.
large is a relative word. Unfortunately some row has small size when represented in json, but can consume lots of memory when represented in our internal format.
The current workaround is as mentioned in the error message: to export the table.
For the long-term, we are actively working on improving our system to overcome this limitation. Stay tuned :)
There are two row related limits for tabledata.list:
The byte size of the proto message has to be less than 10M. If one row is larger than this size, we cannot retrieve it.
The max field value in the row has to be less than 350,000, that is the number of leaf fields of a row.
If you hit that problem, it usually only means the first row in your request is too large to return, if you skip that row, the following row retrieval might work. You may try look at the particular row closer to see why.
In the future, it is likely the field value limit can be removed, but we will still have the 10M size limit due to API server limitations.

Why are my BigQuery streaming inserts being rate limited?

I'm getting 403 rateLimitExceeded errors while doing streaming inserts into BigQuery. I'm doing many streaming inserts in parallel, so while I understand that this might be cause for some rate limiting, I'm not sure what rate limit in particular is the issue.
Here's what I get:
{
"code" : 403,
"errors" : [ {
"domain" : "global",
"message" : "Exceeded rate limits: Your table exceeded quota for rows. For more information, see https://cloud.google.com/bigquery/troubleshooting-errors",
"reason" : "rateLimitExceeded"
} ],
"message" : "Exceeded rate limits: Your table exceeded quota for rows. For more information, see https://cloud.google.com/bigquery/troubleshooting-errors"
}
Based on BigQuery's troubleshooting docs, 403 rateLimitExceeded is caused by either concurrent rate limiting or API request limits, but the docs make it sound like neither of those apply to streaming operations.
However, the message in the error mentions table exceeded quota for rows, which sounds more like the 403 quotaExceeded error. The streaming quotas are:
Maximum row size: 1 MB - I'm under this - my average row size is in the KB and I specifically limit sizes to ensure they don't hit 1MB
HTTP request size limit: 10 MB - I'm under this - my average batch size is < 400KB and max is < 1MB
Maximum rows per second: 100,000 rows per second, per table. Exceeding this amount will cause quota_exceeded errors. - can't imagine I'd be over this - each batch is about 500 rows, and each batch takes about 500 milliseconds. I'm running in parallel but inserting across about 2,000 tables, so while it's possible (though unlikely) that I'm doing 100k rows/second, there's no way that's per table (more like 1,000 rows/sec per table max)
Maximum rows per request: 500 - I'm right at 500
Maximum bytes per second: 100 MB per second, per table. Exceeding this amount will cause quota_exceeded errors. - Again, my insert rates are not anywhere near this volume by table.
Any thoughts/suggestions as to what this rate limiting is would be appreciated!
I suspect you are occasionally submitting more than 100,000 rows per second to a single table. Might your parallel insert processes occasionally all line up on the same table?
The reason this is reported as a rate limit error is to give a push-back signal to slow down: to handle sporadic spikes of operations on a single table, you can back off and try again to spread the load out.
This is different from a quota failure which implies that retrying will still fail until the quota epoch rolls over (for ex, daily quota limits).

Google BigQuery unable to process larger result set getting "Response too large to return" or "Resources exceeded during query execution"

I am currently working with large table (~105M Records) in C# application.
When query the table with 'Order by' or 'Order Each by' clause, then i am getting "Resources exceeded during query execution" error.
If i remove 'Order by' or 'Order Each by' clause, then i am getting Response too large to return error.
Here is the sample query for two scenarios (I am using Wikipedia public table)
SELECT Id,Title,Count(*) FROM [publicdata:samples.wikipedia] Group EACH by Id, title Order by Id, Title Desc
SELECT Id,Title,Count(*) FROM [publicdata:samples.wikipedia] Group EACH by Id, title
Here are the questions i have
What is the maximum size of Big Query Response?
How do we select all the records in Query Request not in 'Export Method'?
1. What is the maximum size of Big Query Response?
As it's mentioned on Quota-policy queries maximum response size: 10 GB compressed (unlimited when returning large query results)
2. How do we select all the records in Query Request not in 'Export Method'?
If you plan to run a query that might return larger results, you can set allowLargeResults to true in your job configuration.
Queries that return large results will take longer to execute, even if the result set is small, and are subject to additional limitations:
You must specify a destination table.
You can't specify a top-level ORDER BY, TOP or LIMIT clause. Doing so negates the benefit of using allowLargeResults, because the query output can no longer be computed in parallel.
Window functions can return large query results only if used in conjunction with a PARTITION BY clause.
Read more about how to paginate to get the results here and also read from the BigQuery Analytics book, the pages that start with page 200, where it is explained how Jobs::getQueryResults is working together with the maxResults parameter and int's blocking mode.
Update:
Query Result Size Limitations - Sometimes, it is hard to know what 10 GB of compressed
data means.
When you run a normal query in BigQuery, the response size is limited to 10 GB
of compressed data. Sometimes, it is hard to know what 10 GB of compressed
data means. Does it get compressed 2x? 10x? The results are compressed within
their respective columns, which means the compression ratio tends to be very
good. For example, if you have one column that is the name of a country, there
will likely be only a few different values. When you have only a few distinct
values, this means that there isn’t a lot of unique information, and the column
will generally compress well. If you return encrypted blobs of data, they will
likely not compress well because they will be mostly random. (This is explained on the book linked above on page 220)