Cloud Pub/Sub to BigQuery through Dataflow SQL - google-bigquery

I Would like to understand the working of Dataflow pipeline.
In my case, I have something published to cloud pub/sub periodically which Dataflow then writes to BigQuery. The volume of messages that come through are in the thousands so my publisher client has a batch setting for 1000 messages, 1 MB and 10 sec of latency.
The question is: When published in a batch as stated above, Does Dataflow SQL take in the messages in the batch and writes it to BigQuery all in one go? or, Does it writes one message at a time?
On the other hand, Is there any benefit of one over the other?
Please comment if any other details required. Thanks

Dataflow SQL is just a way to define, with SQL syntax, an Apache Beam pipeline, and to run it on Dataflow.
Because it's PubSub, it's a streaming pipeline that is created based on your SQL definition. When you run your SQL command, a Dataflow job starts and wait the messages from pubSub.
If you publish a bunch of messages, Dataflow is able to scale up to process them as soon as possible.
Keep in ming that Dataflow streaming never scale to 0 and therefore you will always pay for 1 or more VM to keep your pipeline up and running.

Related

Duplicates in Pub/Sub-Dataflow-BigQuery pipeline

Consider the following setup:
Pub/Sub
Dataflow: streaming job for validating events from Pub/Sub, unpacking and writing to BigQuery
BigQuery
We have counters on the valid events that pass through our Datafow Pipeline and observe the counters are higher than the amount of events that were available in Pub/Sub.
Note: It seems we also see duplicates in BigQuery, but we are still investigating this.
The following error can be observed in the Dataflow logs:
Pipeline stage consuming pubsub took 1h35m7.83313078s and default ack deadline is 5m.
Consider increasing ack deadline for subscription projects/<redacted>/subscriptions/<redacted >
Note that the Dataflow job is started when there are already millions of messages waiting in Pub/Sub.
Questions:
Can this cause duplicate events to be picked up by the pipeline?
Is there anything we can do to alleviate this issue?
My recommendation is to purge the PubSub subscription message queue by launching the Dataflow job in batch mode. Then run it in streaming mode for the usual operation. Like this, you will be able to start from a clean basis your streaming job and not have a long list of enqueued messages.
In addition, it's the power of Dataflow (and beam) to be able to run in streaming and in batch the same pipeline.

How to batch process data from Google Pub/Sub to Cloud Storage using Dataflow?

I'm building a Change Data Capture pipeline that reads data from a MYSQL database and creates a replica in BigQuery. I'll be pushing the changes in Pub/Sub and using Dataflow to transfer them to Google Cloud Storage. I have been able to figure out how to stream the changes, but I need to run batch processing for a few tables in my Database.
Can Dataflow be used to run a batch job while reading from an unbounded source like Pub/Sub? Can I run this batch job to transfer data from Pub/Sub to Cloud Storage and then load this data to BigQuery? I want a batch job because a stream job costs more.
Thank you for the precision.
First, when you use PubSub in Dataflow (Beam framework), it's only possible in streaming mode
Cloud Pub/Sub sources and sinks are currently supported only in streaming pipelines, during remote execution.
If your process don't need realtime, you can skip Dataflow and save money. You can use Cloud Functions or Cloud Run for the process that I propose you (App Engine also if you want, but not my first recommendation).
In both cases, create a process (Cloud Run or Cloud Function) that is triggered periodically (every week?) by Cloud Scheduler.
Solution 1
Connect your process to the pull subscription
Every time that you read a message (or a chunk of message, for example 1000), write stream into BigQuery. -> However, stream write is not free on big Query ($0.05 per Gb)
Loop until the queue is empty. Set the timeout to the max value(9 minutes with Cloud Function, 15 minutes to Cloud Run) to prevent any timeout issue
Solution 2
Connect your process to the pull subscription
Read a chunk of messages (for example 1000) and write them in memory (into an array).
Loop until the queue is empty. Set the timeout to the max value(9 minutes with Cloud Function, 15 minutes to Cloud Run) to prevent any timeout issue. Set also the memory to the max value (2Gb) for preventing out of memory crashes.
Create a load job into BigQuery from your in memory data array. -> Here the load job is free and you are limited to 1000 load jobs per day and per table.
However, this solution can fail if your app + the data size is larger than the ma memory value. An alternative, is to create a file into GCS every, for example, each 1 million of rows (depends the size and the memory footprint of each row). Name the file with a unique prefix, for example the date of the day (YYYYMMDD-tempFileXX), and increment the XX at each file creation. Then, create a load job, not from data in memory, but with data in GCS with a wild card in the file name (gs://myBucket/YYYYMMDD-tempFile*). Like this all the files which match with the prefix will be loaded.
Recommendation The PubSub messages are kept up to 7 days into a pubsub subscription. I recommend you to trigger your process at least every 3 days for having time to react and debug before message deletion into the subscription.
Personal experience The stream write into BigQuery is cheap for a low volume of data. For some cents, I recommend you to consider the 1st solution is you can pay for this. The management and the code are smaller/easier!

How can I write streaming Dataflow pipelines that support schema evolution?

I'm building some data streaming pipelines that read from Kafka and write to various sinks using Google Cloud Dataflow. The pipeline looks something like this (simplified).
// Example pipeline that writes to BigQuery.
Pipeline.create(options)
.apply(KafkaIO.read().withTopic(options.topic))
.apply(/* Convert to a Row type */)
.setRowSchema(schemaRegistry.lookup(options.topic))
.apply(
BigQueryIO.write<Row>()
.useBeamSchema()
.withCreateDisposition(CreateDispotion.CREATE_IF_NEEDED)
.withProject(options.outputProject)
.withDataset(options.outputDataset)
.withTable(options.outputTable)
)
I plan to run a pipeline for each of our Kafka topics, of which there are hundreds. The pipeline looks up the schema for the given topic during the planning stage. This allows BigQueryIO to create the necessary tables before starting the pipeline.
Question: How can I support evolving schemas in my Dataflow pipelines?
I've explored the option of updating an existing Dataflow job (using the --update flag). The thought is that I could automate the process of submitting an updated job whenever a schema changes. But updating a job seems to incur about 3 minutes of downtime. For some of the jobs, that much downtime won't work. I'm looking for other solutions that hopefully have no more than a few seconds of downtime.

Dataflow to BigQuery quota

I found a couple related questions, but no definitive answer from the Google team, for this particular question:
Is a Cloud DataFlow job, writing to BigQuery, limited to the BigQuery quota of 100K rows-per-second-per-table (i.e. BQ streaming limit)?
google dataflow write to bigquery table performance
Cloud DataFlow performance - are our times to be expected?
Edit:
The main motivation is to find a way to predict runtimes for various input sizes.
I've managed to run jobs which show > 180K rows/sec processed via the Dataflow monitoring UI. But I'm unsure if this is somehow throttled on the insert into the table, since the job runtime was slower by about 2x than a naive calculation (500mm rows / 180k rows/sec = 45 mins, which actually took almost 2 hrs)
From your message, it sounds like you are executing your pipeline in batch, not streaming, mode.
In Batch mode, jobs run on the Google Cloud Dataflow service do not use BigQuery's streaming writes. Instead, we write all the rows to be imported to files on GCS, and then invoke a BigQuery load" job. Note that this reduces your costs (load jobs are cheaper than streaming writes) and is more efficient overall (BigQuery can be faster doing a bulk load than doing per-row imports). The tradeoff is that no results are available in BigQuery until the entire job finishes successfully.
Load jobs are not limited by a certain number of rows/second, rather it is limited by the daily quotas.
In Streaming mode, Dataflow does indeed use BigQuery's streaming writes. In that case, the 100,000 rows per second limit does apply. If you exceed that limit, Dataflow will get a quota_exceeded error and will then retry the failing inserts. This behavior will help smooth out short-term spikes that temporarily exceed BigQuery's quota; if your pipeline exceeds quota for a long period of time, this fail-and-retry policy will eventually act as a form of backpressure that slows your pipeline down.
--
As for why your job took 2 hours instead of 45 minutes, your job will have multiple stages that proceed serially, and so using the throughput of the fastest stage is not an accurate way to estimate end-to-end runtime. For example, the BigQuery load job is not initiated until after Dataflow finishes writing all rows to GCS. Your rates seem reasonable, but please follow up if you suspect a performance degradation.

Inserting into BigQuery via load jobs (not streaming)

I'm looking to use Dataflow to load data into BigQuery tables using BQ load jobs - not streaming (streaming would cost too much for our use case). I see that the Dataflow SDK has built in support for inserting data via BQ streaming, but I wasn't able to find anything in the Dataflow SDK that supports load jobs out of the box.
Some questions:
1) Does the Dataflow SDK have OOTB support for BigQuery load job inserts? If not, is it planned?
2) If I need to roll my own, what are some good approaches?
If I have to roll my own, performing a BQ load job using Google Cloud Storage is a multi step process - write the file to GCS, submit the load job via the BQ API, and (optionally) check the status until the job has completed (or failed). I'd hope I could use the existing TextIO.write() functionality to write to GCS, but I'm not sure how I'd compose that step with the subsequent call to the BQ API to submit the load job (and optionally the subsequent calls to check the status of the job until it's complete).
Also, I'd be using Dataflow in streaming mode, with windows of 60 seconds - so I'd want to do the load job every 60 seconds as well.
Suggestions?
I'm not sure which version of Apache Beam you are using, but now it's possible to use a micro-batching tactic using a Stream Pipeline. If you decide one way or another you can use something like this:
.apply("Saving in batches", BigQueryIO.writeTableRows()
.to(destinationTable(options))
.withMethod(Method.FILE_LOADS)
.withJsonSchema(myTableSchema)
.withCreateDisposition(CreateDisposition.CREATE_IF_NEEDED)
.withWriteDisposition(WriteDisposition.WRITE_APPEND)
.withExtendedErrorInfo()
.withTriggeringFrequency(Duration.standardMinutes(2))
.withNumFileShards(1);
.optimizedWrites());
Things to keep in mind
There are 2 different methods: FILE_LOADS and STREAMING_INSERT, if you use the first one you need to include the withTriggeringFrequency and withNumFileShards. For the first one, from my experience, is better to use minutes and the number will depend on the amount of throughput data. If you receive quite a lot try to keep it small, I have seen "stuck errors" when you increase it too much. The shards can affect mostly your GCS billing, if you add to much shards it will create more files per table per x amount of minutes.
If your input data size is not so big the streaming insert can work really well and the cost shouldn't be a big deal. In that scenario you can use STREAMING_INSERT method and remove the withTriggeringFrequency and withNumFileShards. Also, you can add withFailedInsertRetryPolicy like InsertRetryPolicy.retryTransientErrors() so no rows are being lost (keep in mind that idempotency is not guaranteed with STREAM_INSERTS, so duplication is possible)
You can check your Jobs in BigQuery and validate that everything is working! Keep in mind the policies for jobs with BigQuery (I think is 1000 jobs per table) when you are trying to define triggering frequency and shards.
Note: You can always read this article about efficient aggregation pipelines https://cloud.google.com/blog/products/data-analytics/how-to-efficiently-process-both-real-time-and-aggregate-data-with-dataflow
BigQueryIO.write() always uses BigQuery load jobs when the input PCollection is bounded. If you'd like it to also use them if it is unbounded, specify .withMethod(FILE_LOADS).withTriggeringFrequency(...).