Does Dataflow jobs hit any Bigquery quotas and limits? - google-bigquery

I have around 1500 jobs to be implemented using Dataflow. Those jobs will be scheduled on daily basis. We may get to use huge number of DML statements using Bigquery Client library within our jobs. Listing down my concerns regarding Bigquery quotas and limits.
Reference: https://cloud.google.com/bigquery/quotas
Please confirm that do we need to take the daily usage limits of Bigquery into consideration in any of the below mentioned scenarios.
If we implement data inserts using BigqueryIO.write()
If we use DML statements (Update/Delete) using Bigquery Client Library within the Dataflow job
Please suggest.

You absolutely do need to take BigQuery quotas and limits into consideration - even when hooking into it from Dataflow.
Dataflow is just calling the BigQuery API on your behalf. Therefore, all quotas and limits still apply as if you were calling it directly yourself.

Related

PubSub topic with binary data to BigQuery

I'm expected to have thousands of sensors sending telemetry data at 10FPS with around 1KB of binary data per frame, using IOT Core, meaning I'll get it via PubSub. I'd like to get that data to BigQuery, and no processing is needed.
As Dataflow don't have a template capable of dealing with binary data, and working with it seems a bit cumbersome, I'd like to try to avoid it and go full serverless.
Question is, what's my best alternative?
I've thought about Cloud Run service running an express app to accept the data from PubSub, and using global variable to accumulate around 500 rows in ram, then dump it using BigQuery's insert() method (NodeJS client).
How reasonable is that? Will I gain something from accumulation, or should I just insert to bigquery every single incoming row?
Streaming Ingestion
If your requirement is to analyze high volumes of continuously arriving data with near-real-time dashboards and queries, streaming inserts would be a good choice. The quotas and limits for streaming inserts can be found here.
Since you are using the Node.js client library, use the BigQuery legacy streaming API's insert() method as you have already mentioned. The insert() method streams one row at a time irrespective of accumulation of rows.
For new projects, the BigQuery Storage Write API is recommended as it is cheaper and has an enriched feature set than the legacy API does. The BigQuery Storage Write API only supports Java, Python and Go(in preview) client libraries currently.
Batch Ingestion
If your requirement is to load large, bounded data sets that don’t have to be processed in real-time, prefer batch loading. BigQuery batch load jobs are free. You only pay for storing and querying the data but not for loading the data. Refer to quotas and limits for batch load jobs here. Some more key points on batch loading jobs have been quoted from this article.
Load performance is best effort
Since the compute used for loading data is made available from a shared pool at no cost to the user,
BigQuery does not make guarantees on performance and available
capacity of this shared pool. This is governed by the fair scheduler
allocating resources among load jobs that may be competing with loads
from other users or projects. Quotas for load jobs are in place to
minimize the impact.
Load jobs do not consume query capacity
Slots used for querying data are distinct from the slots used for ingestion. Hence, data
ingestion does not impact query performance.
ACID semantics
For data loaded through the bq load command, queries will either reflect the presence of all or none of the data .
Queries never scan partial data.

Can Bigquery scale to 0 workers?

I am using GCP BigQuery to run SQL queries.
As far as I know, the cost of BigQuery is charged by Slots (Workers), I'd like to know that, can Bigquery scale to 0 workers once queries finished? Is there anything I have to pay, like storage or compute engine instance?
Thank you.
NO.
BigQuery is a serverless managed database service.
It has no dependencies, it doesn't use Compute Engine Instances, or workers.
Slots concepts are for upfront payment and has nothing to do with Workers as you described.
In Bigquery you pay
your database storage
streaming insert
query costs
No.
BigQuery converts query statements into a graph of execution by dividing into multiple stages, which are composed of more granular sets of execution steps.
BigQuery model the units of work in a such a way that many potential workers may execute in parallel and each stage communicate with one another by using a distributed shuffle architecture. In BigQuery execution plan, workers are used to convey information specifically about parallelism

ETL on Google Cloud - (Dataflow vs. Spring Batch) -> BigQuery

I am considering BigQuery as my data warehouse requirement. Right now, I have my data in google cloud (cloud SQL and BigTable). I have exposed my REST APIs to retrieve data from both. Now, I would like to retrieve data from these APIs, do the ETL and load the data into BigQuery. I am evaluating 2 options of ETL (daily frequency of job for hourly data) right now:-
Use JAVA Spring Batch and create microservice and use Kubernetes as deployment environment. Will it scale?
Use Cloud DataFlow for ETL
Then use BigQuery batch insert API (for initial load) and streaming insert API (for incremental load when new data available in source) to load BigQuery denormalized schema.
Please let me know your opinions.
Without knowing your data volumes, specifically how much new or diff data you have per day and how you are doing paging with your REST APIs - here is my guidance...
If you go down the path of a using Spring Batch you are more than likely going to have to come up with your own sharding mechanism: how will you divide up REST calls to instantiate your Spring services? You will also be in the Kub management space and will have to handle retries with the streaming API to BQ.
If you go down the Dataflow route you will have to write a some transform code to call your REST API and peform the paging to populate your PCollection destined for BQ. With the recent addition of Dataflow templates you could: create a pipeline that is triggered every N hours and parameterize your REST call(s) to just pull data ?since=latestCall. From there you could execute BigQuery writes. I recommend doing this in batch mode as 1) it will scale better if you have millions of rows 2) be less cumbersome to manage (during non-active times).
Since Cloud Dataflow has built in re-try logic for BiqQuery and provides consistency across all input and output collections -- my vote is for Dataflow in this case.
How big are your REST call results in record count?

Dataflow to BigQuery quota

I found a couple related questions, but no definitive answer from the Google team, for this particular question:
Is a Cloud DataFlow job, writing to BigQuery, limited to the BigQuery quota of 100K rows-per-second-per-table (i.e. BQ streaming limit)?
google dataflow write to bigquery table performance
Cloud DataFlow performance - are our times to be expected?
Edit:
The main motivation is to find a way to predict runtimes for various input sizes.
I've managed to run jobs which show > 180K rows/sec processed via the Dataflow monitoring UI. But I'm unsure if this is somehow throttled on the insert into the table, since the job runtime was slower by about 2x than a naive calculation (500mm rows / 180k rows/sec = 45 mins, which actually took almost 2 hrs)
From your message, it sounds like you are executing your pipeline in batch, not streaming, mode.
In Batch mode, jobs run on the Google Cloud Dataflow service do not use BigQuery's streaming writes. Instead, we write all the rows to be imported to files on GCS, and then invoke a BigQuery load" job. Note that this reduces your costs (load jobs are cheaper than streaming writes) and is more efficient overall (BigQuery can be faster doing a bulk load than doing per-row imports). The tradeoff is that no results are available in BigQuery until the entire job finishes successfully.
Load jobs are not limited by a certain number of rows/second, rather it is limited by the daily quotas.
In Streaming mode, Dataflow does indeed use BigQuery's streaming writes. In that case, the 100,000 rows per second limit does apply. If you exceed that limit, Dataflow will get a quota_exceeded error and will then retry the failing inserts. This behavior will help smooth out short-term spikes that temporarily exceed BigQuery's quota; if your pipeline exceeds quota for a long period of time, this fail-and-retry policy will eventually act as a form of backpressure that slows your pipeline down.
--
As for why your job took 2 hours instead of 45 minutes, your job will have multiple stages that proceed serially, and so using the throughput of the fastest stage is not an accurate way to estimate end-to-end runtime. For example, the BigQuery load job is not initiated until after Dataflow finishes writing all rows to GCS. Your rates seem reasonable, but please follow up if you suspect a performance degradation.

Inserting into BigQuery via load jobs (not streaming)

I'm looking to use Dataflow to load data into BigQuery tables using BQ load jobs - not streaming (streaming would cost too much for our use case). I see that the Dataflow SDK has built in support for inserting data via BQ streaming, but I wasn't able to find anything in the Dataflow SDK that supports load jobs out of the box.
Some questions:
1) Does the Dataflow SDK have OOTB support for BigQuery load job inserts? If not, is it planned?
2) If I need to roll my own, what are some good approaches?
If I have to roll my own, performing a BQ load job using Google Cloud Storage is a multi step process - write the file to GCS, submit the load job via the BQ API, and (optionally) check the status until the job has completed (or failed). I'd hope I could use the existing TextIO.write() functionality to write to GCS, but I'm not sure how I'd compose that step with the subsequent call to the BQ API to submit the load job (and optionally the subsequent calls to check the status of the job until it's complete).
Also, I'd be using Dataflow in streaming mode, with windows of 60 seconds - so I'd want to do the load job every 60 seconds as well.
Suggestions?
I'm not sure which version of Apache Beam you are using, but now it's possible to use a micro-batching tactic using a Stream Pipeline. If you decide one way or another you can use something like this:
.apply("Saving in batches", BigQueryIO.writeTableRows()
.to(destinationTable(options))
.withMethod(Method.FILE_LOADS)
.withJsonSchema(myTableSchema)
.withCreateDisposition(CreateDisposition.CREATE_IF_NEEDED)
.withWriteDisposition(WriteDisposition.WRITE_APPEND)
.withExtendedErrorInfo()
.withTriggeringFrequency(Duration.standardMinutes(2))
.withNumFileShards(1);
.optimizedWrites());
Things to keep in mind
There are 2 different methods: FILE_LOADS and STREAMING_INSERT, if you use the first one you need to include the withTriggeringFrequency and withNumFileShards. For the first one, from my experience, is better to use minutes and the number will depend on the amount of throughput data. If you receive quite a lot try to keep it small, I have seen "stuck errors" when you increase it too much. The shards can affect mostly your GCS billing, if you add to much shards it will create more files per table per x amount of minutes.
If your input data size is not so big the streaming insert can work really well and the cost shouldn't be a big deal. In that scenario you can use STREAMING_INSERT method and remove the withTriggeringFrequency and withNumFileShards. Also, you can add withFailedInsertRetryPolicy like InsertRetryPolicy.retryTransientErrors() so no rows are being lost (keep in mind that idempotency is not guaranteed with STREAM_INSERTS, so duplication is possible)
You can check your Jobs in BigQuery and validate that everything is working! Keep in mind the policies for jobs with BigQuery (I think is 1000 jobs per table) when you are trying to define triggering frequency and shards.
Note: You can always read this article about efficient aggregation pipelines https://cloud.google.com/blog/products/data-analytics/how-to-efficiently-process-both-real-time-and-aggregate-data-with-dataflow
BigQueryIO.write() always uses BigQuery load jobs when the input PCollection is bounded. If you'd like it to also use them if it is unbounded, specify .withMethod(FILE_LOADS).withTriggeringFrequency(...).