When should I prefer batch analysis over interactive analysis? - google-bigquery

The incentive to use batch queries instead of interactive mode queries was pricing, but with newer price changes there is no cost difference anymore - so is there any other incentive (quota, performance, other...) to use batch queries?

With the price change, there are two primary reasons to use batch priority:
it lets you queue up your jobs.
it lets you run low priority queries in a way that doesn't impact high priority ones.
There are a number of rate limits that affect interactive (i.e. non-batch) queries -- you can have at most 20 running concurrently, there are concurrent byte limits and 'large query' limits. If those limits are hit, the query will fail immediately. This is because BigQuery assumes that an interactive query is something you need run immediately.
When you use batch, if you ever hit a rate limit, the query will be queued and retried later. There are still similar rate limits, but they operate separately from interactive rate limits, so your batch queries won't affect your interactive ones.
One example might be that you have periodic queries that you run daily or hourly to build dashboards. Maybe you have 100 queries that you want to run. If you try to run them all at once as interactive, some will fail because of concurrent rate limits. Additionally, you don't necessarily want these queries to interfere with other queries you are running manually from the BigQuery Web UI. So you can run the dashboard queries at batch priority and the other queries will run normally as interactive.
One other point to note is that the scheduling for Batch queries has changed so the average wait times should come down considerably. Instead of waiting a half hour or so, batch queries should start within a minute or two (subject to queueing, etc).

Related

DynamoDB large transaction writes are timing out

I have a service that receives events that vary in size from ~5 - 10k items. We split these events up into chunks and these chunks need to be written in transactions because we do some post-processing that depends on a successful write of all the items in the chunk. Ordering of the events is important so we can't Dead Letter them to process at a later time. We're running into an issue where we receive very large (10k) events and they're clogging up the event processor causing a timeout (currently set to 15s). I'm trying to find a way to increase the processing speed of these large events to eliminate timeouts.
I'm open to ideas but curious if there are there any pitfalls of running transaction writes concurrently? E.g. splitting the event into chunks of 100 and having X threads run through them to write to dynamo concurrently.
There is no concern on multi-threading writes to DynamoDB so long as you have the capacity to handle the extra throughput.
I would also advise at trying smaller batches, as with 100 items in a batch, if one happens to fail for any reason then they all fail. Typically I suggest aiming for batch sizes of approx 10. But of course this depends on your use-case.
Also ensure that no threads are targeting the same item at the same time, as this would result in conflicting writes causing large amounts of failed batches.
In summary, batch small as possible, ensure your table has adequate capacity and ensure you don't hit the same items concurrently.

Why Big Query always has waiting time, is it possible to execute without waiting time?

Whenever I execute a query in Bigquery, I can see that in the Explanation tab the Waiting time is always average. Is it possible to execute without wait time or to reduce the wait time.
This image shows the query explanation (Bigquery wait time is average here)
Why Big Query always has waiting time?
You might have more work than can be immediately scheduled.
is it possible to execute without waiting time?
Purchase more BigQuery Slots.
Contact your sales representative for more information or support.
Otherwise, just wait. If your job isn’t time-critical, you can schedule it and wait for it to be executed as resources permit.
Bigquery has a complex scheduling system. Besides, usually there are couple stages for a simple query, and for each stage scheduler needs to find the best shard(node) which could execute the computation.
It takes some time for the scheduler to send the job to shards. But usually it should not be significantly long.

BigQuery - unable to submit queries via batch API

Our application batches queries and submits via BigQuery's batch API. We've submitted several batches of queries whose jobs have been stuck in a "running" state for over an hour now. All systems are green according to status.cloud.google.com but that does not appear to be the case for us.
Anyone else experiencing similar behavior? FWIW - query submission via the BQ web UI is no longer working for us either due to exceeding concurrent rate limits (from aforementioned stuck jobs), so something is woefully wrong...
You are submitting your queries via the batch API just fine. It looks like you are doing this very quickly and with computationally expensive queries, so they all compete with each other and slow down.
It looks like you submitted about 200 jobs at approximately the same time on the 18th (a few times), and about 25k jobs on the 17th.
These were all submitted at interactive query priority, and almost all of them immediately failed with rate limit exceeded errors, leaving the max concurrent quota limit of about 50 queries running from each set of queries you submitted.
Spot checking a few of these queries: these are computationally expensive queries. Take a look at your query's Billing Tier (https://cloud.google.com/bigquery/pricing#high-compute), which can be found in the jobs.get output here: https://cloud.google.com/bigquery/docs/reference/v2/jobs#statistics.query.billingTier. These queries also appear to be re-computing the same (or at least very similar) intermediate join results.
When you run 50 large queries simultaneously, they will compete with each other for resources and slow down.
There are several issues you may want to look in to:
You are submitting lots of queries at interactive query priority, which has a pretty strict concurrent rate limit. If you want to run many queries simultaneously, try using batch query priority. https://cloud.google.com/bigquery/docs/reference/v2/jobs#configuration.query.priority
Your query mix looks like it can be optimized. Can you materialize some intermediate results that are common across all your queries with one join operation, and then run lots of smaller queries against those results?
If you need to run many computationally expensive queries quickly:
You may want to purchase additional slots to increase your query throughput. See https://cloud.google.com/bigquery/pricing#slots.
You may want to rate limit yourself on the client side to prevent your computationally expensive queries from competing with each other. Consider running only a few queries at a time. You're overall throughput will likely be faster.
You are using the batch insert API. This makes it very efficient to insert multiple queries with one HTTP request. I find that the HTTP connection is rarely the cause of latency with large scale data analysis, so to keep client code simple I prefer to use the regular jobs.insert API and insert jobs one at a time. (This becomes more important when you want to deal with error cases, as doing that with batched inserts is difficult.)

Dataflow to BigQuery quota

I found a couple related questions, but no definitive answer from the Google team, for this particular question:
Is a Cloud DataFlow job, writing to BigQuery, limited to the BigQuery quota of 100K rows-per-second-per-table (i.e. BQ streaming limit)?
google dataflow write to bigquery table performance
Cloud DataFlow performance - are our times to be expected?
Edit:
The main motivation is to find a way to predict runtimes for various input sizes.
I've managed to run jobs which show > 180K rows/sec processed via the Dataflow monitoring UI. But I'm unsure if this is somehow throttled on the insert into the table, since the job runtime was slower by about 2x than a naive calculation (500mm rows / 180k rows/sec = 45 mins, which actually took almost 2 hrs)
From your message, it sounds like you are executing your pipeline in batch, not streaming, mode.
In Batch mode, jobs run on the Google Cloud Dataflow service do not use BigQuery's streaming writes. Instead, we write all the rows to be imported to files on GCS, and then invoke a BigQuery load" job. Note that this reduces your costs (load jobs are cheaper than streaming writes) and is more efficient overall (BigQuery can be faster doing a bulk load than doing per-row imports). The tradeoff is that no results are available in BigQuery until the entire job finishes successfully.
Load jobs are not limited by a certain number of rows/second, rather it is limited by the daily quotas.
In Streaming mode, Dataflow does indeed use BigQuery's streaming writes. In that case, the 100,000 rows per second limit does apply. If you exceed that limit, Dataflow will get a quota_exceeded error and will then retry the failing inserts. This behavior will help smooth out short-term spikes that temporarily exceed BigQuery's quota; if your pipeline exceeds quota for a long period of time, this fail-and-retry policy will eventually act as a form of backpressure that slows your pipeline down.
--
As for why your job took 2 hours instead of 45 minutes, your job will have multiple stages that proceed serially, and so using the throughput of the fastest stage is not an accurate way to estimate end-to-end runtime. For example, the BigQuery load job is not initiated until after Dataflow finishes writing all rows to GCS. Your rates seem reasonable, but please follow up if you suspect a performance degradation.

What data load on my DB should I expect if I get more users?

currently as a single user, it takes the 260ms for a certain query to run from start to finish.
what will happen if I have 1000 queries sent at the same time? should I expect the same query to take ~4 minutes? (260ms*1000)
It is not possible to make predictions without any knowledge of the situation. There will be a number of factors which affect this time:
Resources available to the server (if it is able to hold data in memory, things run quicker than if disk is being accessed)
What is involved in the query (e.g. a repeated query will usually execute quicker the second time around, assuming the underlying data has not changed)
What other bottlenecks are in the system (e.g. if the webserver and database server are on the same system, the two processes will be fighting for available resource under heavy load)
The only way to properly answer this question is to perform load testing on your application.