Duplicates in Pub/Sub-Dataflow-BigQuery pipeline

Duplicates in Pub/Sub-Dataflow-BigQuery pipeline - google-bigquery

Consider the following setup:
Pub/Sub
Dataflow: streaming job for validating events from Pub/Sub, unpacking and writing to BigQuery
BigQuery
We have counters on the valid events that pass through our Datafow Pipeline and observe the counters are higher than the amount of events that were available in Pub/Sub.
Note: It seems we also see duplicates in BigQuery, but we are still investigating this.
The following error can be observed in the Dataflow logs:
Pipeline stage consuming pubsub took 1h35m7.83313078s and default ack deadline is 5m.
Consider increasing ack deadline for subscription projects/<redacted>/subscriptions/<redacted >
Note that the Dataflow job is started when there are already millions of messages waiting in Pub/Sub.
Questions:
Can this cause duplicate events to be picked up by the pipeline?
Is there anything we can do to alleviate this issue?

My recommendation is to purge the PubSub subscription message queue by launching the Dataflow job in batch mode. Then run it in streaming mode for the usual operation. Like this, you will be able to start from a clean basis your streaming job and not have a long list of enqueued messages.
In addition, it's the power of Dataflow (and beam) to be able to run in streaming and in batch the same pipeline.

Related

How to achieve dynamic fair processing between batch tasks?

Our use case is that our system supports scheduling multiple multi-channel send jobs at any time. Multi-channel send meaning send emails, send notifications, send sms etc.
How it currently works is we have a SQS queue per channel. Whenever a job is scheduled, it pushes all its send records to appropriate channel SQS. Any job scheduled later then pushes its own send records to appropriate channel SQS and so on. This leads to starvation of later scheduled jobs if the first scheduled job is high volume, as its records will be processed first from queue before reaching 2nd job records.
On consumer side, we have much lower processing rate than incoming as we can only do a fixed amount of sends per hour. So a high volume job could go on for a long time after being scheduled.
To solve the starvation problem, our first idea was to create 3 queues per channel, low, medium high volume and jobs would be submitted to queue as per their volume. Problem is if 2 or more same volume jobs come, then we still face this problem.
The only guaranteed way to ensure no starvation and fair processing seems like having a queue per job created dynamically. Consumers process from each queue at equal rate and processing bandwidth gets divided between jobs. High volume job might take long time to complete, but it wont choke processing for other jobs.
We could create the sqs queues dynamically for every job scheduled, but that will mean monitoring maybe 50+ queues at some point. Better choice seemed having a kinesis stream with multiple shards, where we would need to ensure every shard only contains single partition key that would identify a single job, I am not sure if that's possible though.
Are there any better ways to achieve this, so we can do fair processing and not starve any job?
If this is not the right community for such questions, please let me know.

Cloud Pub/Sub to BigQuery through Dataflow SQL

I Would like to understand the working of Dataflow pipeline.
In my case, I have something published to cloud pub/sub periodically which Dataflow then writes to BigQuery. The volume of messages that come through are in the thousands so my publisher client has a batch setting for 1000 messages, 1 MB and 10 sec of latency.
The question is: When published in a batch as stated above, Does Dataflow SQL take in the messages in the batch and writes it to BigQuery all in one go? or, Does it writes one message at a time?
On the other hand, Is there any benefit of one over the other?
Please comment if any other details required. Thanks

Dataflow SQL is just a way to define, with SQL syntax, an Apache Beam pipeline, and to run it on Dataflow.
Because it's PubSub, it's a streaming pipeline that is created based on your SQL definition. When you run your SQL command, a Dataflow job starts and wait the messages from pubSub.
If you publish a bunch of messages, Dataflow is able to scale up to process them as soon as possible.
Keep in ming that Dataflow streaming never scale to 0 and therefore you will always pay for 1 or more VM to keep your pipeline up and running.

Monitoring long lasting tasks in Airflow

I've seen people using Airflow to schedule hundreds of scraping jobs through Scrapyd daemons. However, one thing they miss in Airflow is monitoring long-lasting jobs like scraping: getting number of pages and items scraped so far, number of URL that failed so far or were retried without success.
What are my options to monitor current status of long lasting jobs? Is there something already available or I need to resort to external solutions like Prometheus, Grafana and instrument Scrapy spiders myself?

We've had better luck keeping our airflow jobs short and sweet.
With long-running tasks, you risk running into queue back-ups. And we've found the parallelism limits are not quite intuitive. Check out this post for a good breakdown.
In a case kind of like yours, when there's a lot of work to orchestrate and potentially retry, we reach for Kafka. The airflow dags pull messages off of a Kafka topic and then report success/failure via a Kafka wrapper.
We end up with several overlapping airflow tasks running in "micro-batches" reading a tune-able number of messages off Kafka, with the goal of keeping each airflow task under a certain run time.
By keeping the work small in each airflow task, we can easily scale the number of messages up or down to tune the overall task run time with the overall parallelism of the airflow workers.
Seems like you could explore something similar here?
Hope that helps!

Dataflow reading using PubSubIO is really slow

I'm having some trouble with a Dataflow pipeline that reads from PubSub and writes to BigQuery.
I had to drain it to perform some more complex updates. When I rerun the pipeline it started reading fom PubSub at a normal rate, but then after some minutes it stopped and now it is not reading messages from PubSub anymore! Data watermark is almost one week delayed and not progressing. There are more than 300k messages in the subscription to be read, according to Stackdriver.
It was running normally before the update, and now even if I downgrade my pipeline to the previous version (the one running before update), I still doesn't get it to work.
I tried several configurations:
1) We use Dataflow autoscaling, and I tried starting the pipeline with more powerful workers (n1-standard-64), and limiting it to ten workers, but it won't improve performance neither autoscale (it keeps only the initial worker).
2) I tried providing more disk through diskSizeGb (2048) and diskType (pd-ssd), but still no improvement.
3) Checked PubSub quotas and pull/push rates, but it's absolutely normal.
Pipeline shows no errors or warnings, and just won't progress.
I checked instances resources and CPU, RAM, disk read/write rates are all okay, compared to other pipelines. The only thing a little higher is network rates: about 400k bytes/sec (2000 packets/sec) outgoing and 300k bytes/sec incoming (1800 packets/sec).
What would you suggest I do?

The Dataflow SDK 2.x for Java and the Dataflow SDK for Python are based on Apache Beam. Make sure you are following the documentation as a reference when you update. Quotas can be an issue for slow running pipeline and lack of output but you mentioned those are fine.
It seems there is a need to look at the job. I recommend to open an issue on the PIT here and we’ll take a look. Make sure to provide your project id, job id and all the necessary details.

How to Crash/Stop DataFlow Pub/Sub Ingestion on BigQuery Insert Error

I am searching for a way to make a Google DataFlow job stop ingesting from Pub/Sub when a (specific) exception happens.
The events from Pub/Sub are JSON read via PubsubIO.Read.Bound<TableRow> using TableRowJsonCoder and directly streamed to BigQuery with
BigQueryIO.Write.Bound.
(There is a ParDo inbetween that changes the contents of one field and some custom partitioning by day happening, but that should be irrelevant for this purpose.)
When there are fields in the events/rows ingested from PubSub that are not columns in the destination BigQuery table, the DataFlow job logs IOExceptions at run time claiming it could not insert the rows, but seems to acknowledge these messages and continues running.
What I want to do instead is to stop ingesting messages from Pub/Sub and/or make the Dataflow job crash, so that alerting could be based on the age of oldest unacknowledged message. At the very least I want to make sure that those Pub/Sub messages that failed to be inserted to BigQuery are not ack'ed so that I can fix the problem, restart the Dataflow job and consume those messages again.
I know that one suggested solution for handling faulty input is described here: https://cloud.google.com/blog/big-data/2016/01/handling-invalid-inputs-in-dataflow
I am also aware of this PR on Apache Beam that would allow inserting the rows without the offending fields:
https://github.com/apache/beam/pull/1778
However in my case I don't really want to guard from faulty input but rather from programmer errors, i.e. the fact that new fields were added to the JSON messages which are pushed to Pub/Sub, but the corresponding DataFlow job was not updated. So I don't really have faulty data, I rather simply want to crash when a programmer makes the mistake not to deploy a new Dataflow job before changing anything about the message format.
I assume it would be possible to (analogue to the blog post solution) create a custom ParDo that validates each row and throws an exception that isn't caught and leads to a crash.
But ideally, I would just like to have some configuration that does not handle the insert error and logs it but instead just crashes the job or at least stops ingestion.

You could have a ParDo with a DoFn which sits before the BQ write. The DoFn would be responsible to get the output table schema every X mins and would validate that each record that is to be written matches the expected output schema (and throw an exception if it doesn't).
Old Pipeline:
PubSub -> Some Transforms -> BQ Sink
New Pipeline:
PubSub -> Some Transforms -> ParDo(BQ Sink Validator) -> BQ Sink
This has the advantage that once someone fixes the output table schema, the pipeline will recover. You'll want to throw a good error messaging stating whats wrong with the incoming PubSub message.
Alternatively, you could have the BQ Sink Validator instead output messages to a PubSub DLQ (monitoring its size). Operationally you would have to update the table and then re-ingest the DLQ back in as an input. This has the advantage that only bad messages block pipeline execution.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas