Inserting into BigQuery via load jobs (not streaming) - google-bigquery

I'm looking to use Dataflow to load data into BigQuery tables using BQ load jobs - not streaming (streaming would cost too much for our use case). I see that the Dataflow SDK has built in support for inserting data via BQ streaming, but I wasn't able to find anything in the Dataflow SDK that supports load jobs out of the box.
Some questions:
1) Does the Dataflow SDK have OOTB support for BigQuery load job inserts? If not, is it planned?
2) If I need to roll my own, what are some good approaches?
If I have to roll my own, performing a BQ load job using Google Cloud Storage is a multi step process - write the file to GCS, submit the load job via the BQ API, and (optionally) check the status until the job has completed (or failed). I'd hope I could use the existing TextIO.write() functionality to write to GCS, but I'm not sure how I'd compose that step with the subsequent call to the BQ API to submit the load job (and optionally the subsequent calls to check the status of the job until it's complete).
Also, I'd be using Dataflow in streaming mode, with windows of 60 seconds - so I'd want to do the load job every 60 seconds as well.
Suggestions?

I'm not sure which version of Apache Beam you are using, but now it's possible to use a micro-batching tactic using a Stream Pipeline. If you decide one way or another you can use something like this:
.apply("Saving in batches", BigQueryIO.writeTableRows()
.to(destinationTable(options))
.withMethod(Method.FILE_LOADS)
.withJsonSchema(myTableSchema)
.withCreateDisposition(CreateDisposition.CREATE_IF_NEEDED)
.withWriteDisposition(WriteDisposition.WRITE_APPEND)
.withExtendedErrorInfo()
.withTriggeringFrequency(Duration.standardMinutes(2))
.withNumFileShards(1);
.optimizedWrites());
Things to keep in mind
There are 2 different methods: FILE_LOADS and STREAMING_INSERT, if you use the first one you need to include the withTriggeringFrequency and withNumFileShards. For the first one, from my experience, is better to use minutes and the number will depend on the amount of throughput data. If you receive quite a lot try to keep it small, I have seen "stuck errors" when you increase it too much. The shards can affect mostly your GCS billing, if you add to much shards it will create more files per table per x amount of minutes.
If your input data size is not so big the streaming insert can work really well and the cost shouldn't be a big deal. In that scenario you can use STREAMING_INSERT method and remove the withTriggeringFrequency and withNumFileShards. Also, you can add withFailedInsertRetryPolicy like InsertRetryPolicy.retryTransientErrors() so no rows are being lost (keep in mind that idempotency is not guaranteed with STREAM_INSERTS, so duplication is possible)
You can check your Jobs in BigQuery and validate that everything is working! Keep in mind the policies for jobs with BigQuery (I think is 1000 jobs per table) when you are trying to define triggering frequency and shards.
Note: You can always read this article about efficient aggregation pipelines https://cloud.google.com/blog/products/data-analytics/how-to-efficiently-process-both-real-time-and-aggregate-data-with-dataflow

BigQueryIO.write() always uses BigQuery load jobs when the input PCollection is bounded. If you'd like it to also use them if it is unbounded, specify .withMethod(FILE_LOADS).withTriggeringFrequency(...).

Related

Avoid session shutdown on BigQuery Storage API with Dataflow

I am implementing an ETL job that migrates a non partitioned BigQuery Table to a partitioned one.
To do so I use the Storage API from BigQuery. This creates a number of sessions to pull Data from.
In order to route the BigQuery writes to the right partition I use the File Loads methods.
Streaming inserts was not the option due to the limitation of 30 days.
Storage Write API seems to be limited identifying the partition.
By residing to the File Load Method the Data are being written to GCS.
The issue is that this takes too much time and there is the risk of the sessions to close.
Behind the scenes the File Load Method is a complex one with multiple steps. For example writings to GCS and combining the entries to a destination/partition joined file.
Based on the Dataflow processes it seems that nodes can execute workloads on different parts of the pipeline.
How can I avoid the risk of the session closing? Is there a way for my Dataflow nodes to focus only on the critical part which is write to GCS first and once this is done, then focus on all the other aspects?
You can do a Reshuffle right before applying the write to BigQuery. In Dataflow, that will create a checkpoint, and a new stage in the job. The write to BigQuery would start when all steps previous to the reshuffle have finished, and in case of errors and retries, the job would backtrack to that checkpoint.
Please note that doing a reshuffle implies doing a shuffling of data, so there will be a performance impact.

PubSub topic with binary data to BigQuery

I'm expected to have thousands of sensors sending telemetry data at 10FPS with around 1KB of binary data per frame, using IOT Core, meaning I'll get it via PubSub. I'd like to get that data to BigQuery, and no processing is needed.
As Dataflow don't have a template capable of dealing with binary data, and working with it seems a bit cumbersome, I'd like to try to avoid it and go full serverless.
Question is, what's my best alternative?
I've thought about Cloud Run service running an express app to accept the data from PubSub, and using global variable to accumulate around 500 rows in ram, then dump it using BigQuery's insert() method (NodeJS client).
How reasonable is that? Will I gain something from accumulation, or should I just insert to bigquery every single incoming row?
Streaming Ingestion
If your requirement is to analyze high volumes of continuously arriving data with near-real-time dashboards and queries, streaming inserts would be a good choice. The quotas and limits for streaming inserts can be found here.
Since you are using the Node.js client library, use the BigQuery legacy streaming API's insert() method as you have already mentioned. The insert() method streams one row at a time irrespective of accumulation of rows.
For new projects, the BigQuery Storage Write API is recommended as it is cheaper and has an enriched feature set than the legacy API does. The BigQuery Storage Write API only supports Java, Python and Go(in preview) client libraries currently.
Batch Ingestion
If your requirement is to load large, bounded data sets that don’t have to be processed in real-time, prefer batch loading. BigQuery batch load jobs are free. You only pay for storing and querying the data but not for loading the data. Refer to quotas and limits for batch load jobs here. Some more key points on batch loading jobs have been quoted from this article.
Load performance is best effort
Since the compute used for loading data is made available from a shared pool at no cost to the user,
BigQuery does not make guarantees on performance and available
capacity of this shared pool. This is governed by the fair scheduler
allocating resources among load jobs that may be competing with loads
from other users or projects. Quotas for load jobs are in place to
minimize the impact.
Load jobs do not consume query capacity
Slots used for querying data are distinct from the slots used for ingestion. Hence, data
ingestion does not impact query performance.
ACID semantics
For data loaded through the bq load command, queries will either reflect the presence of all or none of the data .
Queries never scan partial data.

How to batch streaming inserts to BigQuery from a Beam job

I'm writing to BigQuery in a beam job from an unbounded source. I'm using STREAMING INSERTS as the Method. I was looking at how to throttle the rows to BigQuery based on the recommendations in
https://cloud.google.com/bigquery/quotas#streaming_inserts
The BigQueryIO.Write API doesn't provide a way to set the micro batches.
I was looking at using Triggers but not sure if BigQuery groups everything in a pane into a request. I've setup the trigger as below
Window.<Long>into(new GlobalWindows())
.triggering(
Repeatedly.forever(
AfterFirst.of(
AfterPane.elementCountAtLeast(5),
AfterProcessingTime.pastFirstElementInPane().plusDelayOf(Duration.standardMinutes(2)))
))
.discardingFiredPanes());
Q1. Does Beam support micro batches or does it create one request for each element in the PCollection?
Q2. If the above trigger makes sense? Even If I set the window/trigger it could be sending one request for every element.
I don't know what you mean by micro-batch. The way I see it, BigQuery support loading data either as batches, either in streaming.
Basically, batch loads are subject to quotas and streaming loads are a bit more expensive.
Once you set the insertion method for your BigQueryIO the documentation states :
Note: If you use batch loads in a streaming pipeline, you must use withTriggeringFrequency to specify a triggering frequency.
Never tried it, but withTriggeringFrequency seems to be what you need here.

Different ways of updating bigquery table

In gcp, I need to update a bigquery table whenever a file (multiple formats such as json,xml) gets uploaded into a bucket. I have two options but not sure what are the pros/cons of each of them. Can someone suggest which is a better solution and why?
Approach 1 :
File uploaded to bucket --> Trigger Cloud Function (which updates the bigquery table) -->Bigquery
Approach 2:
File uploaded to bucket --> Trigger Cloud Function (which triggers a dataflow job) -->Dataflow-->Bigquery.
In production env, which approach is better suited and why? If there are alternative approaches,pls let me know.
This is quite a broad question, so I wouldn't be surprised if it gets voted to be closed. That said however, I'd always go #2 (GCS -> CF -> Dataflow -> BigQuery).
Remember, with Cloud Funtions there is a max execution time. If you kick off a load job from the Cloud Function, you'll need to bake logic into it to poll and check the status (load jobs in BigQuery are async). If it fails, you'll need to handle it. But, what if it's still running and you hit the max execution of your Cloud Function?
At least by using Dataflow, you don't have the problem of max execution times and you can simply rerun your pipeline if it fails for some transient reason e.g. network issues.

Get/Set BigQuery Job ID while doing BigQueryIO.write()

Is it possible to set BigQuery JobID or to get it while the batch pipeline is running.
I know it's possible using BigQuery API but is it possible if I'm using BigQueryIO from Apache Beam?
I need to send an acknowledgement after writing to BigQuery that the load is complete.
Currently this is not possible. It is complicated by the fact that a single BigQueryIO.write() may use many BigQuery jobs under the hood (i.e. BigQueryIO.write() is a general-purpose API for writing data to BigQuery, rather than an API for working with a single specific BigQuery load job), e.g.:
In case the amount of data to be loaded is larger than the BigQuery limits for a single load job, BigQueryIO.write() will shard it into multiple load jobs.
In case you are using one of the destination-dependent write methods (e.g. DynamicDestinations), and are loading into multiple tables at the same time, there'll be at least 1 load job per table.
In case you are writing an unbounded PCollection using the BATCH_LOADS method, it will periodically issue load jobs for newly arrived data, subject to the notes above.
In case you're using the STREAMING_INSERTS method (it is allowed to use it even if you're writing a bounded PCollection), there will be no load jobs at all.
You will need to use one of the typical workarounds for "doing something after something else is done", which is, e.g. wait until the entire pipeline is done using pipeline.run().waitUntilFinish() in your main program and then do your second action.