How to Crash/Stop DataFlow Pub/Sub Ingestion on BigQuery Insert Error - google-bigquery

I am searching for a way to make a Google DataFlow job stop ingesting from Pub/Sub when a (specific) exception happens.
The events from Pub/Sub are JSON read via PubsubIO.Read.Bound<TableRow> using TableRowJsonCoder and directly streamed to BigQuery with
BigQueryIO.Write.Bound.
(There is a ParDo inbetween that changes the contents of one field and some custom partitioning by day happening, but that should be irrelevant for this purpose.)
When there are fields in the events/rows ingested from PubSub that are not columns in the destination BigQuery table, the DataFlow job logs IOExceptions at run time claiming it could not insert the rows, but seems to acknowledge these messages and continues running.
What I want to do instead is to stop ingesting messages from Pub/Sub and/or make the Dataflow job crash, so that alerting could be based on the age of oldest unacknowledged message. At the very least I want to make sure that those Pub/Sub messages that failed to be inserted to BigQuery are not ack'ed so that I can fix the problem, restart the Dataflow job and consume those messages again.
I know that one suggested solution for handling faulty input is described here: https://cloud.google.com/blog/big-data/2016/01/handling-invalid-inputs-in-dataflow
I am also aware of this PR on Apache Beam that would allow inserting the rows without the offending fields:
https://github.com/apache/beam/pull/1778
However in my case I don't really want to guard from faulty input but rather from programmer errors, i.e. the fact that new fields were added to the JSON messages which are pushed to Pub/Sub, but the corresponding DataFlow job was not updated. So I don't really have faulty data, I rather simply want to crash when a programmer makes the mistake not to deploy a new Dataflow job before changing anything about the message format.
I assume it would be possible to (analogue to the blog post solution) create a custom ParDo that validates each row and throws an exception that isn't caught and leads to a crash.
But ideally, I would just like to have some configuration that does not handle the insert error and logs it but instead just crashes the job or at least stops ingestion.

You could have a ParDo with a DoFn which sits before the BQ write. The DoFn would be responsible to get the output table schema every X mins and would validate that each record that is to be written matches the expected output schema (and throw an exception if it doesn't).
Old Pipeline:
PubSub -> Some Transforms -> BQ Sink
New Pipeline:
PubSub -> Some Transforms -> ParDo(BQ Sink Validator) -> BQ Sink
This has the advantage that once someone fixes the output table schema, the pipeline will recover. You'll want to throw a good error messaging stating whats wrong with the incoming PubSub message.
Alternatively, you could have the BQ Sink Validator instead output messages to a PubSub DLQ (monitoring its size). Operationally you would have to update the table and then re-ingest the DLQ back in as an input. This has the advantage that only bad messages block pipeline execution.

Related

Cloud Pub/Sub to BigQuery through Dataflow SQL

I Would like to understand the working of Dataflow pipeline.
In my case, I have something published to cloud pub/sub periodically which Dataflow then writes to BigQuery. The volume of messages that come through are in the thousands so my publisher client has a batch setting for 1000 messages, 1 MB and 10 sec of latency.
The question is: When published in a batch as stated above, Does Dataflow SQL take in the messages in the batch and writes it to BigQuery all in one go? or, Does it writes one message at a time?
On the other hand, Is there any benefit of one over the other?
Please comment if any other details required. Thanks
Dataflow SQL is just a way to define, with SQL syntax, an Apache Beam pipeline, and to run it on Dataflow.
Because it's PubSub, it's a streaming pipeline that is created based on your SQL definition. When you run your SQL command, a Dataflow job starts and wait the messages from pubSub.
If you publish a bunch of messages, Dataflow is able to scale up to process them as soon as possible.
Keep in ming that Dataflow streaming never scale to 0 and therefore you will always pay for 1 or more VM to keep your pipeline up and running.

Duplicates in Pub/Sub-Dataflow-BigQuery pipeline

Consider the following setup:
Pub/Sub
Dataflow: streaming job for validating events from Pub/Sub, unpacking and writing to BigQuery
BigQuery
We have counters on the valid events that pass through our Datafow Pipeline and observe the counters are higher than the amount of events that were available in Pub/Sub.
Note: It seems we also see duplicates in BigQuery, but we are still investigating this.
The following error can be observed in the Dataflow logs:
Pipeline stage consuming pubsub took 1h35m7.83313078s and default ack deadline is 5m.
Consider increasing ack deadline for subscription projects/<redacted>/subscriptions/<redacted >
Note that the Dataflow job is started when there are already millions of messages waiting in Pub/Sub.
Questions:
Can this cause duplicate events to be picked up by the pipeline?
Is there anything we can do to alleviate this issue?
My recommendation is to purge the PubSub subscription message queue by launching the Dataflow job in batch mode. Then run it in streaming mode for the usual operation. Like this, you will be able to start from a clean basis your streaming job and not have a long list of enqueued messages.
In addition, it's the power of Dataflow (and beam) to be able to run in streaming and in batch the same pipeline.

Beam Java Dataflow, Bigquery streaming insert GroupByKey reducing elements

Fairly new to Dataflow/Apache Beam.
I have a pipeline that is reading messages from PubSub, these messages hold a file name for a file in GCS. These are read out of storage, table rows are created from the individual files then streamed into BigQuery.
The problem is, occured inside the streaming insert step, there is a GroupByKey step, this is taking in 20,819 elements, but is emitting considerably less elements at 11,207
And we observe 11,207 records inserted into BigQuery.
.apply(String.format(Constants.STREAM_DATA, tableTagName),
BigQueryIO.writeTableRows().to(table).withExtendedErrorInfo().withoutValidation)
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_NEVER)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND)
.withFailedInsertRetryPolicy(InsertRetryPolicy.retryTransientErrors()))
From reading, can't see how to "modify" step, the pipeline up to this point handles records "as expected". We read messages from PubSub withIdAttribute to see if that helps, but to no avail. We mostly observe this when the number of workers increase and the pipeline is under load, but haven't been able to bottom out what's causing it/ how to resolve?
Any help would be appreciated! Thanks

Flink batching Sink

I'm trying to use flink in both a streaming and batch way, to add a lot of data into Accumulo (A few million a minute). I want to batch up records before sending them to Accumulo.
I ingest data either from a directory or via kafka, convert the data using a flatmap and then pass to a RichSinkFunction, which adds the data to a collection.
With the streaming data, batching seems ok, in that I can add the records to a collection of fixed size which get sent to accumulo once the batch threshold is reached. But for the batch data which is finite, I'm struggling to find a good approach to batching as it would require a flush time out in case there is no further data within a specified time.
There doesn't seem to be an Accumulo connector unlike for Elastic search or other alternative sinks.
I thought about using a Process Function with a trigger for batch size and time interval, but this requires a keyed window. I didn't want to go down the keyed route as data looks to be very skewed, in that some keys would have a tonne of records and some would have very few. If I don't use a windowed approach, then I understand that the operator won't be parallel. I was hoping to lazily batch, so each sink only cares about numbers or an interval of time.
Has anybody got any pointers on how best to address this?
You can access timers in a sink by implementing ProcessingTimeCallback. For an example, look at the BucketingSink -- its open and onProcessingTime methods should get you started.

Service Broker Design

I’m looking to introduce SS Service Broker,
I have a remote orders database and a local processing database, all activity on the processing database has to happen in sequence, this seems a perfect job for Service Broker!
I’ve set up the infrastructure, I can send and receive messages and now I’m looking at the design of the processing. As I said all processes for one order need to be completed in sequence so I’ll put them in one conversation.
One of these processes is a request for external flat file data, we then wait (could be several days) and then import and process this file when it returns. How can I process half the tasks, then wait for the flat file to return before processing the other half.
I’ve had some ideas but I’m sure I’m missing a trick somewhere
1) Write all queue items to a status table and use status values – seems to remove some of the flexibility of SSSB and add another layer of tasks
2) Keep the transaction open until we get the data back – not ideal
3) Have the flat file import task continually polling for the file to appear – this seems inefficient
What is the most efficient way of managing this workflow?
thanks in advance
In my opinion it is like chain of responsibility. As far as i can understand we have the following workflow.
1.) Process for message.
2.) Wait for external file, now this can be a busy wait or if external data provides you a notification then we can actually do it in non-polling manner.
3.) Once data is received then process the data.
So my suggestion would be to use 3 different Queues one for each part, when one is done it will forward or put a new message in chained queue.
I am assuming, one order processing will not disrupt another order processing.
I am thinking MSMQ with Windows Sequential Work flow, might also be a candidate for this task.