Which GCP Log Explorer query will show success message of data loading to BigQuery by Dataflow Job so that log sink to pub/sub can be created - google-bigquery

I am running a Dataflow streaming job which reads data from pub/sub topic and performs streaming insert into BigQuery.
Once data loaded, I want to capture the success status from Log explorer to send acknowledgement back to another pub/sub topic.
What Log Explorer query can serve the purpose ?
I tried to run query below, but did not help.
protoPayload.serviceData.jobCompletedEvent.job.jobConfiguration.load.destinationTable.datasetId="[REPLACE_WITH_YOUR_DATASET_ID]"
protoPayload.serviceData.jobCompletedEvent.job.jobConfiguration.load.destinationTable.projectId="REPLACE_WITH_YOUR_PROJECT_ID"
protoPayload.methodName="jobservice.jobcompleted"
protoPayload.state="DONE"
Please help.
Thanking you,
Santanu

Related

Cloud Pub/Sub to BigQuery through Dataflow SQL

I Would like to understand the working of Dataflow pipeline.
In my case, I have something published to cloud pub/sub periodically which Dataflow then writes to BigQuery. The volume of messages that come through are in the thousands so my publisher client has a batch setting for 1000 messages, 1 MB and 10 sec of latency.
The question is: When published in a batch as stated above, Does Dataflow SQL take in the messages in the batch and writes it to BigQuery all in one go? or, Does it writes one message at a time?
On the other hand, Is there any benefit of one over the other?
Please comment if any other details required. Thanks
Dataflow SQL is just a way to define, with SQL syntax, an Apache Beam pipeline, and to run it on Dataflow.
Because it's PubSub, it's a streaming pipeline that is created based on your SQL definition. When you run your SQL command, a Dataflow job starts and wait the messages from pubSub.
If you publish a bunch of messages, Dataflow is able to scale up to process them as soon as possible.
Keep in ming that Dataflow streaming never scale to 0 and therefore you will always pay for 1 or more VM to keep your pipeline up and running.

Duplicates in Pub/Sub-Dataflow-BigQuery pipeline

Consider the following setup:
Pub/Sub
Dataflow: streaming job for validating events from Pub/Sub, unpacking and writing to BigQuery
BigQuery
We have counters on the valid events that pass through our Datafow Pipeline and observe the counters are higher than the amount of events that were available in Pub/Sub.
Note: It seems we also see duplicates in BigQuery, but we are still investigating this.
The following error can be observed in the Dataflow logs:
Pipeline stage consuming pubsub took 1h35m7.83313078s and default ack deadline is 5m.
Consider increasing ack deadline for subscription projects/<redacted>/subscriptions/<redacted >
Note that the Dataflow job is started when there are already millions of messages waiting in Pub/Sub.
Questions:
Can this cause duplicate events to be picked up by the pipeline?
Is there anything we can do to alleviate this issue?
My recommendation is to purge the PubSub subscription message queue by launching the Dataflow job in batch mode. Then run it in streaming mode for the usual operation. Like this, you will be able to start from a clean basis your streaming job and not have a long list of enqueued messages.
In addition, it's the power of Dataflow (and beam) to be able to run in streaming and in batch the same pipeline.

Add alert when transfer in BigQuery results in "succeeded 0 jobs"

I have a schedule transfer running daily on BigQuery and mostly without any issues. The transfer reads a .csv file from an AWS S3 bucket and appends the information to a BigQuery table.
Recently there has been an issue where the transfer resulted in neither succeeded nor failed jobs.
transfer logs
The outcome was that no entries were imported but also no alert was triggered; I had to hear from the reports' users that something had gone wrong.
Question: is there a way to add an alert on BigQuery Transfers for when successful jobs = 0?
BigQuery does have monitoring, it has some known issues as well, this will help BigQuery Monitoring
Monitoring->Dashboard->Add Chart->Use Resource type Global and Metric type as "Uploaded rows".

Sending data from website to BigQuery using Pub/Sub and Cloud Functions

Here's what I'm trying to accomplish
A visitor lands on my website
Javascript collects some information and sends a hit
The hit is processed and inserted into BigQuery
And here's how I have planned to solve it
The hit is sent to Cloud Functions HTTP trigger (using Ajax)
Cloud Functions sends a message to Pub/Sub
Pub/Sub sends data to another Cloud Function using a Pub/Sub trigger
The second Cloud Function processes the hit into Biguery row and inserts it into BigQuery
Is there a simpler way to solve this?
Some other details to take into account
There are around 1 million hits a day
Don't want to use Cloud Dataflow because it inflates the costs
Can't (probably) skip Pub/Sub because some hits are sent when a person is leaving the site and the request might not have enough time to process everything.
You can perform a Big Query streaming, this one is less expensive and you avoid reach the Load Jobs quotas 1000 per table per day.
Another option is if you don't mind that the data spend a lot of time loading, you can store all the info in a Cloud Storage bucket and then load all the data with a transfer. You can program it in order that data be uploaded daily. This solution is focus in a batch environment in which you will store all the info in one side and then you transfer it to the final destination. If you only want to streaming the solution that you mentioned is ok.
It’s up to you to choose the option that better fits to your specific usage.

How to Crash/Stop DataFlow Pub/Sub Ingestion on BigQuery Insert Error

I am searching for a way to make a Google DataFlow job stop ingesting from Pub/Sub when a (specific) exception happens.
The events from Pub/Sub are JSON read via PubsubIO.Read.Bound<TableRow> using TableRowJsonCoder and directly streamed to BigQuery with
BigQueryIO.Write.Bound.
(There is a ParDo inbetween that changes the contents of one field and some custom partitioning by day happening, but that should be irrelevant for this purpose.)
When there are fields in the events/rows ingested from PubSub that are not columns in the destination BigQuery table, the DataFlow job logs IOExceptions at run time claiming it could not insert the rows, but seems to acknowledge these messages and continues running.
What I want to do instead is to stop ingesting messages from Pub/Sub and/or make the Dataflow job crash, so that alerting could be based on the age of oldest unacknowledged message. At the very least I want to make sure that those Pub/Sub messages that failed to be inserted to BigQuery are not ack'ed so that I can fix the problem, restart the Dataflow job and consume those messages again.
I know that one suggested solution for handling faulty input is described here: https://cloud.google.com/blog/big-data/2016/01/handling-invalid-inputs-in-dataflow
I am also aware of this PR on Apache Beam that would allow inserting the rows without the offending fields:
https://github.com/apache/beam/pull/1778
However in my case I don't really want to guard from faulty input but rather from programmer errors, i.e. the fact that new fields were added to the JSON messages which are pushed to Pub/Sub, but the corresponding DataFlow job was not updated. So I don't really have faulty data, I rather simply want to crash when a programmer makes the mistake not to deploy a new Dataflow job before changing anything about the message format.
I assume it would be possible to (analogue to the blog post solution) create a custom ParDo that validates each row and throws an exception that isn't caught and leads to a crash.
But ideally, I would just like to have some configuration that does not handle the insert error and logs it but instead just crashes the job or at least stops ingestion.
You could have a ParDo with a DoFn which sits before the BQ write. The DoFn would be responsible to get the output table schema every X mins and would validate that each record that is to be written matches the expected output schema (and throw an exception if it doesn't).
Old Pipeline:
PubSub -> Some Transforms -> BQ Sink
New Pipeline:
PubSub -> Some Transforms -> ParDo(BQ Sink Validator) -> BQ Sink
This has the advantage that once someone fixes the output table schema, the pipeline will recover. You'll want to throw a good error messaging stating whats wrong with the incoming PubSub message.
Alternatively, you could have the BQ Sink Validator instead output messages to a PubSub DLQ (monitoring its size). Operationally you would have to update the table and then re-ingest the DLQ back in as an input. This has the advantage that only bad messages block pipeline execution.