BigQuery External Data Source Query Quotas - google-bigquery

I have a BigQuery table set up with a Cloud BigTable external data source. This works fine, and I'm able to run queries joining my BigTable data to some of my other BigQuery data. However, when I run too many queries against this table simultaneously, I get the following error:
Error encountered during job execution:
Exceeded rate limits: too many concurrent queries that read Cloud Bigtable data sources for this project. For more information, see https://cloud.google.com/bigquery/troubleshooting-errors
I can't find any documentation about the limits on concurrent queries on the linked page or on the BigQuery Quotas and Limits page. I'm not running that many queries here - max 10 at a time. Has anyone run into this before who knows what the actual concurrent query limit is?
edit:
So people don't have to dig through the attached Google ticket, the correct answer (as of April 2018) is 4 concurrent queries.

You should rather look for Quotas & Limits for Query Jobs
The following limits apply to query jobs created automatically by running interactive queries and to jobs submitted programmatically using jobs.query and query-type jobs.insert method calls.
Concurrent rate limit for on-demand, interactive queries — 50 concurrent queries
Queries with results that are returned from the query cache, and dry run queries do not count against this limit. You can specify a dry run query using the --dry_run flag or by setting the dryRun property in a query job.
Concurrent rate limit for queries that contain user-defined functions (UDFs) — 6 concurrent queries
The concurrent rate limit for queries that contain UDFs includes both interactive and batch queries. Interactive queries that contain UDFs also count toward the concurrent rate limit for interactive queries.
you cn find more in provided link

Related

How to get query time and space information from BigQuery API

I'm going to build a web app and use BigQuery as a part of backend database, and I want to show the query cost information (ex. 1.8 sec elapsed, 264.9 MB processed) in the app.
I know we can check the BigQuery's query information inside GCP, but how do we I get that information from BigQuery API?
The information you are interested in is present in the job statistics.
See jobs.get for more details: https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs/get
The dry-run sample may be of interest as well, though you can get the stats from a real invocation as well (dry run is for estimating costs without executing the query):
https://cloud.google.com/bigquery/docs/samples/bigquery-query-dry-run

Allowing many users to view stale BigQuery data query results concurrently

If I have a BigQuery dataset with data that I would like to make available to 1000 people (where each of these people would only be allowed to view their subset of the data, and is OK to view a 24hr stale version of their data), how can I do this without exceeding the 50 concurrent queries limit?
In the BigQuery documentation there's mention of 50 concurrent queries being permitted which give on-the-spot accurate data, which I would surpass if I needed them to all be able to view on-the-spot accurate data - which I don't.
In the documentation there is mention of Batch jobs being permitted and saving of results into destination tables which I'm hoping would somehow allow a reliable solution for my scenario, but am having difficulty finding information on how reliably or frequently those batch jobs can be expected to run, and whether or not someone querying results that exist in those destination tables is in itself counting towards the 50 concurrent users limit.
Any advice appreciated.
Without knowing the specifics of your situation and depending on how much data is in the output, I would suggest putting your own cache in front of BigQuery.
This sounds kind of like a dashboading/reporting solution, so I assume there is a large amount of data going in and a relatively small amount coming out (per-user).
Run one query per day with a batch script to generate your output (grouped by user) and then export it to GCS. You can then break it up into multiple flat files (or just read it into memory on your frontend). Each user hits your frontend, you determine which part of the output to serve up to them and respond.
This should be relatively cheap if you can work off the cached data and it is small enough that handling the BigQuery output isn't too much additional processing.
Google Cloud Functions might be an easy way to handle this, if you don't want the extra work of setting up a new VM to host your frontend.

Query Terminating in Redshift

We are migrating our database from SQL Server 2012 to Amazon Redshift.
The front end of our application is developed in MicroStrategy (MSTR) which fires the queries on Redshift.
Although the application is working fine in Production (on SQL Server 2012), we have run into a strange issue in our PoC Environment on Redshift.
When we kicked off a dashboard in MSTR, the query from the dashboard hits Redshift and it completes successfully without any issues.
But when we stress test the application by running all the dashboards simultaneously, then that particular dashboard's query terminates in Redshift. The database does not throw any error message which is why we cannot troubleshoot why the query is terminating.
Can anyone please suggest how we should go about solving this problem.
Thank you
The problem might be that you have some timeout on the queue that you are sending the query using WLM configuration.
Redshift is designed differently from other DB, to be optimized for Analytical queries. For that reason it doesn't cache queries results, as you would do with OLTP DB. The other difference is that you have a predefined concurrently level (also part of WLM - http://docs.aws.amazon.com/redshift/latest/mgmt/workload-mgmt-config.html). Each concurrency slot will have its allocated resources to complete big queries quickly, but it is limiting the number of concurrent queries that can run. The default configuration is 5, and you can increase it up to 50. The recommendation is to have it increased to not more than 15-20, as with 50, it means that each query is getting only 2% of the cluster resource instead of 20% (with 5) or 5% (with 20).
The combination of these two differences is: if you are connecting many dashboards, each one sends its queries to Redshift, competes over the resources (without caching each query will run again and again), and might timeout or just be too slow for an interactive dashboard.
Please make sure that you are using the Redshift optimized drivers for MicroStrategy, which are sending queries to Redshift under the above assumptions.
You can also consider putting some RDS between your dashboards and Redshift, with the aggregation data that you need for your dashboards, and that can use in-memory caching and higher concurrency on that summary data. You can see an interesting pattern that you can implement with pg-bouncer see here, that can help you send some queries (the analytical ones) to Redshift, and some (the aggregated dashboard ones) to a PostgreSQL one.

BigQuery - unable to submit queries via batch API

Our application batches queries and submits via BigQuery's batch API. We've submitted several batches of queries whose jobs have been stuck in a "running" state for over an hour now. All systems are green according to status.cloud.google.com but that does not appear to be the case for us.
Anyone else experiencing similar behavior? FWIW - query submission via the BQ web UI is no longer working for us either due to exceeding concurrent rate limits (from aforementioned stuck jobs), so something is woefully wrong...
You are submitting your queries via the batch API just fine. It looks like you are doing this very quickly and with computationally expensive queries, so they all compete with each other and slow down.
It looks like you submitted about 200 jobs at approximately the same time on the 18th (a few times), and about 25k jobs on the 17th.
These were all submitted at interactive query priority, and almost all of them immediately failed with rate limit exceeded errors, leaving the max concurrent quota limit of about 50 queries running from each set of queries you submitted.
Spot checking a few of these queries: these are computationally expensive queries. Take a look at your query's Billing Tier (https://cloud.google.com/bigquery/pricing#high-compute), which can be found in the jobs.get output here: https://cloud.google.com/bigquery/docs/reference/v2/jobs#statistics.query.billingTier. These queries also appear to be re-computing the same (or at least very similar) intermediate join results.
When you run 50 large queries simultaneously, they will compete with each other for resources and slow down.
There are several issues you may want to look in to:
You are submitting lots of queries at interactive query priority, which has a pretty strict concurrent rate limit. If you want to run many queries simultaneously, try using batch query priority. https://cloud.google.com/bigquery/docs/reference/v2/jobs#configuration.query.priority
Your query mix looks like it can be optimized. Can you materialize some intermediate results that are common across all your queries with one join operation, and then run lots of smaller queries against those results?
If you need to run many computationally expensive queries quickly:
You may want to purchase additional slots to increase your query throughput. See https://cloud.google.com/bigquery/pricing#slots.
You may want to rate limit yourself on the client side to prevent your computationally expensive queries from competing with each other. Consider running only a few queries at a time. You're overall throughput will likely be faster.
You are using the batch insert API. This makes it very efficient to insert multiple queries with one HTTP request. I find that the HTTP connection is rarely the cause of latency with large scale data analysis, so to keep client code simple I prefer to use the regular jobs.insert API and insert jobs one at a time. (This becomes more important when you want to deal with error cases, as doing that with batched inserts is difficult.)

Dataflow to BigQuery quota

I found a couple related questions, but no definitive answer from the Google team, for this particular question:
Is a Cloud DataFlow job, writing to BigQuery, limited to the BigQuery quota of 100K rows-per-second-per-table (i.e. BQ streaming limit)?
google dataflow write to bigquery table performance
Cloud DataFlow performance - are our times to be expected?
Edit:
The main motivation is to find a way to predict runtimes for various input sizes.
I've managed to run jobs which show > 180K rows/sec processed via the Dataflow monitoring UI. But I'm unsure if this is somehow throttled on the insert into the table, since the job runtime was slower by about 2x than a naive calculation (500mm rows / 180k rows/sec = 45 mins, which actually took almost 2 hrs)
From your message, it sounds like you are executing your pipeline in batch, not streaming, mode.
In Batch mode, jobs run on the Google Cloud Dataflow service do not use BigQuery's streaming writes. Instead, we write all the rows to be imported to files on GCS, and then invoke a BigQuery load" job. Note that this reduces your costs (load jobs are cheaper than streaming writes) and is more efficient overall (BigQuery can be faster doing a bulk load than doing per-row imports). The tradeoff is that no results are available in BigQuery until the entire job finishes successfully.
Load jobs are not limited by a certain number of rows/second, rather it is limited by the daily quotas.
In Streaming mode, Dataflow does indeed use BigQuery's streaming writes. In that case, the 100,000 rows per second limit does apply. If you exceed that limit, Dataflow will get a quota_exceeded error and will then retry the failing inserts. This behavior will help smooth out short-term spikes that temporarily exceed BigQuery's quota; if your pipeline exceeds quota for a long period of time, this fail-and-retry policy will eventually act as a form of backpressure that slows your pipeline down.
--
As for why your job took 2 hours instead of 45 minutes, your job will have multiple stages that proceed serially, and so using the throughput of the fastest stage is not an accurate way to estimate end-to-end runtime. For example, the BigQuery load job is not initiated until after Dataflow finishes writing all rows to GCS. Your rates seem reasonable, but please follow up if you suspect a performance degradation.