How to batch streaming inserts to BigQuery from a Beam job - google-bigquery

I'm writing to BigQuery in a beam job from an unbounded source. I'm using STREAMING INSERTS as the Method. I was looking at how to throttle the rows to BigQuery based on the recommendations in
https://cloud.google.com/bigquery/quotas#streaming_inserts
The BigQueryIO.Write API doesn't provide a way to set the micro batches.
I was looking at using Triggers but not sure if BigQuery groups everything in a pane into a request. I've setup the trigger as below
Window.<Long>into(new GlobalWindows())
.triggering(
Repeatedly.forever(
AfterFirst.of(
AfterPane.elementCountAtLeast(5),
AfterProcessingTime.pastFirstElementInPane().plusDelayOf(Duration.standardMinutes(2)))
))
.discardingFiredPanes());
Q1. Does Beam support micro batches or does it create one request for each element in the PCollection?
Q2. If the above trigger makes sense? Even If I set the window/trigger it could be sending one request for every element.

I don't know what you mean by micro-batch. The way I see it, BigQuery support loading data either as batches, either in streaming.
Basically, batch loads are subject to quotas and streaming loads are a bit more expensive.
Once you set the insertion method for your BigQueryIO the documentation states :
Note: If you use batch loads in a streaming pipeline, you must use withTriggeringFrequency to specify a triggering frequency.
Never tried it, but withTriggeringFrequency seems to be what you need here.

Related

What is the best way to add new data to BigQuery through BigQuery API?

I'm using Django as my backend framework to connect my web app with BigQuery. How I would do it is to use BigQuery API in views.py to fetch data from BQ. So far from my research, I found 2 ways I can add data to BQ from my Django:
Using the insert_rows_json() method where I would just need to have the data in a JSON format and it would append the data to the BQ.
Using the to_gbq() method where I would need the data to be in a Pandas DataFrame and I could include the parameter if_exists="replace" to update existing tables on the BQ.
Currently, for adding new data, I would use method 1 and for other operations such as updating and deleting, I would use method 2.
My question: Is it better if I use method 2 for all of my operations, or should I just stick to using method 1 for adding new data and method 2 for other operations?
OR PERHAPS is there any other way that is more efficient for the web app to run even faster?
Quoted from this doc:
For new projects, we recommend using the BigQuery Storage Write API instead of the tabledata.insertAll method. The Storage Write API has lower pricing and more robust features, including exactly-once delivery semantics. The tabledata.insertAll method is still fully supported.
You can try BigQuery Storage Write API instead of the legacy insert_rows_json() method for streaming data into BigQuery. It has lower pricing and more robust features, including exactly-once delivery semantics. If you still need to use the legacy streaming insert_rows_json() method, you can use it. It is still fully supported by Google Cloud.
Use the insert_rows_json() method for streaming data into BigQuery because that is a recommended method and maintained by Google Cloud.
You can also UPDATE and DELETE table data using DML queries via BigQuery client libraries. But, there are some limitations in BigQuery when doing UPDATE and DELETE queries immediately after streaming inserts.
Rows that were written to a table recently by using streaming (the tabledata.insertall method or the Storage Write API) cannot be modified with UPDATE, DELETE, or MERGE statements. The recent writes are those that occur within the last 30 minutes. All other rows in the table remain modifiable by using UPDATE, DELETE, or MERGE statements. The streamed data can take up to 90 minutes to become available for copy operations.
If you still want to use the to_gbq() method for updating and deleting the table, you can use it. Refer here you can find the difference between the pandas-gbq and google-cloud-bigquery libraries.

PubSub topic with binary data to BigQuery

I'm expected to have thousands of sensors sending telemetry data at 10FPS with around 1KB of binary data per frame, using IOT Core, meaning I'll get it via PubSub. I'd like to get that data to BigQuery, and no processing is needed.
As Dataflow don't have a template capable of dealing with binary data, and working with it seems a bit cumbersome, I'd like to try to avoid it and go full serverless.
Question is, what's my best alternative?
I've thought about Cloud Run service running an express app to accept the data from PubSub, and using global variable to accumulate around 500 rows in ram, then dump it using BigQuery's insert() method (NodeJS client).
How reasonable is that? Will I gain something from accumulation, or should I just insert to bigquery every single incoming row?
Streaming Ingestion
If your requirement is to analyze high volumes of continuously arriving data with near-real-time dashboards and queries, streaming inserts would be a good choice. The quotas and limits for streaming inserts can be found here.
Since you are using the Node.js client library, use the BigQuery legacy streaming API's insert() method as you have already mentioned. The insert() method streams one row at a time irrespective of accumulation of rows.
For new projects, the BigQuery Storage Write API is recommended as it is cheaper and has an enriched feature set than the legacy API does. The BigQuery Storage Write API only supports Java, Python and Go(in preview) client libraries currently.
Batch Ingestion
If your requirement is to load large, bounded data sets that don’t have to be processed in real-time, prefer batch loading. BigQuery batch load jobs are free. You only pay for storing and querying the data but not for loading the data. Refer to quotas and limits for batch load jobs here. Some more key points on batch loading jobs have been quoted from this article.
Load performance is best effort
Since the compute used for loading data is made available from a shared pool at no cost to the user,
BigQuery does not make guarantees on performance and available
capacity of this shared pool. This is governed by the fair scheduler
allocating resources among load jobs that may be competing with loads
from other users or projects. Quotas for load jobs are in place to
minimize the impact.
Load jobs do not consume query capacity
Slots used for querying data are distinct from the slots used for ingestion. Hence, data
ingestion does not impact query performance.
ACID semantics
For data loaded through the bq load command, queries will either reflect the presence of all or none of the data .
Queries never scan partial data.

Flink batching Sink

I'm trying to use flink in both a streaming and batch way, to add a lot of data into Accumulo (A few million a minute). I want to batch up records before sending them to Accumulo.
I ingest data either from a directory or via kafka, convert the data using a flatmap and then pass to a RichSinkFunction, which adds the data to a collection.
With the streaming data, batching seems ok, in that I can add the records to a collection of fixed size which get sent to accumulo once the batch threshold is reached. But for the batch data which is finite, I'm struggling to find a good approach to batching as it would require a flush time out in case there is no further data within a specified time.
There doesn't seem to be an Accumulo connector unlike for Elastic search or other alternative sinks.
I thought about using a Process Function with a trigger for batch size and time interval, but this requires a keyed window. I didn't want to go down the keyed route as data looks to be very skewed, in that some keys would have a tonne of records and some would have very few. If I don't use a windowed approach, then I understand that the operator won't be parallel. I was hoping to lazily batch, so each sink only cares about numbers or an interval of time.
Has anybody got any pointers on how best to address this?
You can access timers in a sink by implementing ProcessingTimeCallback. For an example, look at the BucketingSink -- its open and onProcessingTime methods should get you started.

Get/Set BigQuery Job ID while doing BigQueryIO.write()

Is it possible to set BigQuery JobID or to get it while the batch pipeline is running.
I know it's possible using BigQuery API but is it possible if I'm using BigQueryIO from Apache Beam?
I need to send an acknowledgement after writing to BigQuery that the load is complete.
Currently this is not possible. It is complicated by the fact that a single BigQueryIO.write() may use many BigQuery jobs under the hood (i.e. BigQueryIO.write() is a general-purpose API for writing data to BigQuery, rather than an API for working with a single specific BigQuery load job), e.g.:
In case the amount of data to be loaded is larger than the BigQuery limits for a single load job, BigQueryIO.write() will shard it into multiple load jobs.
In case you are using one of the destination-dependent write methods (e.g. DynamicDestinations), and are loading into multiple tables at the same time, there'll be at least 1 load job per table.
In case you are writing an unbounded PCollection using the BATCH_LOADS method, it will periodically issue load jobs for newly arrived data, subject to the notes above.
In case you're using the STREAMING_INSERTS method (it is allowed to use it even if you're writing a bounded PCollection), there will be no load jobs at all.
You will need to use one of the typical workarounds for "doing something after something else is done", which is, e.g. wait until the entire pipeline is done using pipeline.run().waitUntilFinish() in your main program and then do your second action.

Inserting into BigQuery via load jobs (not streaming)

I'm looking to use Dataflow to load data into BigQuery tables using BQ load jobs - not streaming (streaming would cost too much for our use case). I see that the Dataflow SDK has built in support for inserting data via BQ streaming, but I wasn't able to find anything in the Dataflow SDK that supports load jobs out of the box.
Some questions:
1) Does the Dataflow SDK have OOTB support for BigQuery load job inserts? If not, is it planned?
2) If I need to roll my own, what are some good approaches?
If I have to roll my own, performing a BQ load job using Google Cloud Storage is a multi step process - write the file to GCS, submit the load job via the BQ API, and (optionally) check the status until the job has completed (or failed). I'd hope I could use the existing TextIO.write() functionality to write to GCS, but I'm not sure how I'd compose that step with the subsequent call to the BQ API to submit the load job (and optionally the subsequent calls to check the status of the job until it's complete).
Also, I'd be using Dataflow in streaming mode, with windows of 60 seconds - so I'd want to do the load job every 60 seconds as well.
Suggestions?
I'm not sure which version of Apache Beam you are using, but now it's possible to use a micro-batching tactic using a Stream Pipeline. If you decide one way or another you can use something like this:
.apply("Saving in batches", BigQueryIO.writeTableRows()
.to(destinationTable(options))
.withMethod(Method.FILE_LOADS)
.withJsonSchema(myTableSchema)
.withCreateDisposition(CreateDisposition.CREATE_IF_NEEDED)
.withWriteDisposition(WriteDisposition.WRITE_APPEND)
.withExtendedErrorInfo()
.withTriggeringFrequency(Duration.standardMinutes(2))
.withNumFileShards(1);
.optimizedWrites());
Things to keep in mind
There are 2 different methods: FILE_LOADS and STREAMING_INSERT, if you use the first one you need to include the withTriggeringFrequency and withNumFileShards. For the first one, from my experience, is better to use minutes and the number will depend on the amount of throughput data. If you receive quite a lot try to keep it small, I have seen "stuck errors" when you increase it too much. The shards can affect mostly your GCS billing, if you add to much shards it will create more files per table per x amount of minutes.
If your input data size is not so big the streaming insert can work really well and the cost shouldn't be a big deal. In that scenario you can use STREAMING_INSERT method and remove the withTriggeringFrequency and withNumFileShards. Also, you can add withFailedInsertRetryPolicy like InsertRetryPolicy.retryTransientErrors() so no rows are being lost (keep in mind that idempotency is not guaranteed with STREAM_INSERTS, so duplication is possible)
You can check your Jobs in BigQuery and validate that everything is working! Keep in mind the policies for jobs with BigQuery (I think is 1000 jobs per table) when you are trying to define triggering frequency and shards.
Note: You can always read this article about efficient aggregation pipelines https://cloud.google.com/blog/products/data-analytics/how-to-efficiently-process-both-real-time-and-aggregate-data-with-dataflow
BigQueryIO.write() always uses BigQuery load jobs when the input PCollection is bounded. If you'd like it to also use them if it is unbounded, specify .withMethod(FILE_LOADS).withTriggeringFrequency(...).