BigQuery - unable to submit queries via batch API

BigQuery - unable to submit queries via batch API - google-bigquery

Our application batches queries and submits via BigQuery's batch API. We've submitted several batches of queries whose jobs have been stuck in a "running" state for over an hour now. All systems are green according to status.cloud.google.com but that does not appear to be the case for us.
Anyone else experiencing similar behavior? FWIW - query submission via the BQ web UI is no longer working for us either due to exceeding concurrent rate limits (from aforementioned stuck jobs), so something is woefully wrong...

You are submitting your queries via the batch API just fine. It looks like you are doing this very quickly and with computationally expensive queries, so they all compete with each other and slow down.
It looks like you submitted about 200 jobs at approximately the same time on the 18th (a few times), and about 25k jobs on the 17th.
These were all submitted at interactive query priority, and almost all of them immediately failed with rate limit exceeded errors, leaving the max concurrent quota limit of about 50 queries running from each set of queries you submitted.
Spot checking a few of these queries: these are computationally expensive queries. Take a look at your query's Billing Tier (https://cloud.google.com/bigquery/pricing#high-compute), which can be found in the jobs.get output here: https://cloud.google.com/bigquery/docs/reference/v2/jobs#statistics.query.billingTier. These queries also appear to be re-computing the same (or at least very similar) intermediate join results.
When you run 50 large queries simultaneously, they will compete with each other for resources and slow down.
There are several issues you may want to look in to:
You are submitting lots of queries at interactive query priority, which has a pretty strict concurrent rate limit. If you want to run many queries simultaneously, try using batch query priority. https://cloud.google.com/bigquery/docs/reference/v2/jobs#configuration.query.priority
Your query mix looks like it can be optimized. Can you materialize some intermediate results that are common across all your queries with one join operation, and then run lots of smaller queries against those results?
If you need to run many computationally expensive queries quickly:
You may want to purchase additional slots to increase your query throughput. See https://cloud.google.com/bigquery/pricing#slots.
You may want to rate limit yourself on the client side to prevent your computationally expensive queries from competing with each other. Consider running only a few queries at a time. You're overall throughput will likely be faster.
You are using the batch insert API. This makes it very efficient to insert multiple queries with one HTTP request. I find that the HTTP connection is rarely the cause of latency with large scale data analysis, so to keep client code simple I prefer to use the regular jobs.insert API and insert jobs one at a time. (This becomes more important when you want to deal with error cases, as doing that with batched inserts is difficult.)

Related

Allowing many users to view stale BigQuery data query results concurrently

If I have a BigQuery dataset with data that I would like to make available to 1000 people (where each of these people would only be allowed to view their subset of the data, and is OK to view a 24hr stale version of their data), how can I do this without exceeding the 50 concurrent queries limit?
In the BigQuery documentation there's mention of 50 concurrent queries being permitted which give on-the-spot accurate data, which I would surpass if I needed them to all be able to view on-the-spot accurate data - which I don't.
In the documentation there is mention of Batch jobs being permitted and saving of results into destination tables which I'm hoping would somehow allow a reliable solution for my scenario, but am having difficulty finding information on how reliably or frequently those batch jobs can be expected to run, and whether or not someone querying results that exist in those destination tables is in itself counting towards the 50 concurrent users limit.
Any advice appreciated.

Without knowing the specifics of your situation and depending on how much data is in the output, I would suggest putting your own cache in front of BigQuery.
This sounds kind of like a dashboading/reporting solution, so I assume there is a large amount of data going in and a relatively small amount coming out (per-user).
Run one query per day with a batch script to generate your output (grouped by user) and then export it to GCS. You can then break it up into multiple flat files (or just read it into memory on your frontend). Each user hits your frontend, you determine which part of the output to serve up to them and respond.
This should be relatively cheap if you can work off the cached data and it is small enough that handling the BigQuery output isn't too much additional processing.
Google Cloud Functions might be an easy way to handle this, if you don't want the extra work of setting up a new VM to host your frontend.

BigQuery Retrieval Times Slow

BigQuery is fast at processing large sets of data, however retrieving large results from BigQuery is not fast at all.
For example, I ran a query that returned 211,136 rows over three HTTP requests, taking just over 12 seconds in total.
The query itself was returned from cache, so no time was spent executing the query. The host server is Amazon m4.xlarge running in US-East (Virginia).
In production I've seen this process take ~90seconds when returning ~1Mn rows. Obviously some of this could be down to network traffic... but it seems too slow for that to be the only cause (those 211,136 rows were only ~1.7MB).
Has anyone else encountered such slow speed when having results returned, and have found a resolution?
Update: Reran test on VM inside Google Cloud with very similar results. Ruling out network issues beteween Google and AWS.

Our SLO on this API is 32 seconds,and a call taking 12 seconds is normal. 90 seconds sounds too long, it must be hitting some of our system's tail latency.
I understand that it is embarrassingly slow. There are multiple reasons to it, and we are working on improving the latency of this API. By the end of Q1 next year, we should be able to roll out a change that would cut tabledata.list time in half (by upgrading the API frontend to our new One Platform technology). If we have more resource, we would also make jobs.getQueryResults faster.

Concurrent Requests using TableData.List
It's not great, but there is a resolution.
Make a query, and set the max rows to 1000. If there is no page token simply return the results.
If there is a page token then disregard the results*, and use the TableData.List API. However rather than simply sending one request at a time, send a request for every 10,000 records* in the result. To so this one can use the 'MaxResults' and 'StartIndex' fields. (Note even these smaller pages may be broken into multiple requests*, so paging logic is still needed).
This concurrency (and smaller pages) leads to significant reductions in retrieval times. Not as good as BigQ simply streaming all results, but enough to start realizing the gains from using BigQ.
Potential Pitfals: Keep an eye on the request count, as with larger result-sets there could be 100req/s throttling. It's also worth noting that there's no guarantee of ordering, so using StartIndex field as pseudo-paging may not always return correct results*.
* Anything with a single asterix is still an educated guess, but not confirmed as true/best practise.

Big query is to slow

I am just starting with biquery, my DB is small (10K of rows 1 table) and my queries are simple count and group by.
Its takes and average of 3-4 sec per request but sometimes its jumps to 10 and event 15sec
I am querying from amazon linux server in Irland using the BQ tool.
Is it possible to get results faster (under 1sec) so I will be able to present my webpages faster.

1) Big Query is a highly scalable database, before being a "super fast" database. It's designed to process HUGE amount of data distributing the processing among several different machines using a technique named Dremel. Because it's designed to use several machines and parallel processing, you should expect to have super-scalability with a good performance.
2) BigQuery is an asset when you want to analyze billions of rows.
For example: analyzing all the wikipedia revisions in 5-10 seconds isn't bad, is it? But even a much smaller table would take about the same time, even if has 10k rows.
3) Under this size, you'll be better off using more traditional data storage solutions such as Cloud SQL or the App Engine Datastore. If you want to keep SQL capability, Cloud SQL is the best guess.
Sybase IQ is often installed in a single database and it doesn't use Dremel. That said, it's going to be faster than Big Query in many scenarios...as designed.
4) Certainly the performance differ from a dedicated environment. You get your dedicated environment for 20K$ a month.

That's the expected behaviour. In BigQuery you are using a shared infrastructure, so depending on the use at the moment you will get better or worse response time. Actually batch queries (those not needing interactivity) are encouraged and rewarded by not adding up to your quota.
You typically don't use BigQuery as your main database to show data in your web application. Depending on what you want to do, BigQuery can be a Big Data storage and you should have another intermediate store where you could store computed results to display to your users. Or maybe in your use case you don't really need BigQuery and there is a better solution.
In any case, you are not going to be able to avoid a few seconds wait (even if you go Premium, you get more guarantees about the service, but in no case a service fast enough as to be your main backend for a webapp)

When should I prefer batch analysis over interactive analysis?

The incentive to use batch queries instead of interactive mode queries was pricing, but with newer price changes there is no cost difference anymore - so is there any other incentive (quota, performance, other...) to use batch queries?

With the price change, there are two primary reasons to use batch priority:
it lets you queue up your jobs.
it lets you run low priority queries in a way that doesn't impact high priority ones.
There are a number of rate limits that affect interactive (i.e. non-batch) queries -- you can have at most 20 running concurrently, there are concurrent byte limits and 'large query' limits. If those limits are hit, the query will fail immediately. This is because BigQuery assumes that an interactive query is something you need run immediately.
When you use batch, if you ever hit a rate limit, the query will be queued and retried later. There are still similar rate limits, but they operate separately from interactive rate limits, so your batch queries won't affect your interactive ones.
One example might be that you have periodic queries that you run daily or hourly to build dashboards. Maybe you have 100 queries that you want to run. If you try to run them all at once as interactive, some will fail because of concurrent rate limits. Additionally, you don't necessarily want these queries to interfere with other queries you are running manually from the BigQuery Web UI. So you can run the dashboard queries at batch priority and the other queries will run normally as interactive.
One other point to note is that the scheduling for Batch queries has changed so the average wait times should come down considerably. Instead of waiting a half hour or so, batch queries should start within a minute or two (subject to queueing, etc).

What data load on my DB should I expect if I get more users?

currently as a single user, it takes the 260ms for a certain query to run from start to finish.
what will happen if I have 1000 queries sent at the same time? should I expect the same query to take ~4 minutes? (260ms*1000)

It is not possible to make predictions without any knowledge of the situation. There will be a number of factors which affect this time:
Resources available to the server (if it is able to hold data in memory, things run quicker than if disk is being accessed)
What is involved in the query (e.g. a repeated query will usually execute quicker the second time around, assuming the underlying data has not changed)
What other bottlenecks are in the system (e.g. if the webserver and database server are on the same system, the two processes will be fighting for available resource under heavy load)
The only way to properly answer this question is to perform load testing on your application.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas