How to receive root cause for Pipeline Dataflow job failure - error-handling

I am running my pipeline in Dataflow. I want to collect all error messages from Dataflow job using its id. I am using Apache-beam 2.3.0 and Java 8.
DataflowPipelineJob dataflowPipelineJob = ((DataflowPipelineJob) entry.getValue());
String jobId = dataflowPipelineJob.getJobId();
DataflowClient client = DataflowClient.create(options);
Job job = client.getJob(jobId);
Is there any way to receive only error message from pipeline?

Programmatic support for reading Dataflow log messages is not very mature, but there are a couple options:
Since you already have the DataflowPipelineJob instance, you could use the waitUntilFinish() overload which accepts a JobMessagesHandler parameter to filter and capture error messages. You can see how DataflowPipelineJob uses this in its own waitUntilFinish() implementation.
Alternatively, you can query job logs using the Dataflow REST API: projects.jobs.messages/list. The API takes in a minimumImportance parameter which would allow you to query just for errors.
Note that in both cases, there may be error messages which are not fatal and don't directly cause job failure.

Related

Load from GCS to GBQ causes an internal BigQuery error

My application creates thousands of "load jobs" daily to load data from Google Cloud Storage URIs to BigQuery and only a few cases causing the error:
"Finished with errors. Detail: An internal error occurred and the request could not be completed. This is usually caused by a transient issue. Retrying the job with back-off as described in the BigQuery SLA should solve the problem: https://cloud.google.com/bigquery/sla. If the error continues to occur please contact support at https://cloud.google.com/support. Error: 7916072"
The application is written on Python and uses libraries:
google-cloud-storage==1.42.0
google-cloud-bigquery==2.24.1
google-api-python-client==2.37.0
Load job is done by calling
load_job = self._client.load_table_from_uri(
source_uris=source_uri,
destination=destination,
job_config=job_config,
)
this method has a default param:
retry: retries.Retry = DEFAULT_RETRY,
so the job should automatically retry on such errors.
Id of specific job that finished with error:
"load_job_id": "6005ab89-9edf-4767-aaf1-6383af5e04b6"
"load_job_location": "US"
after getting the error the application recreates the job, but it doesn't help.
Subsequent failed job ids:
5f43a466-14aa-48cc-a103-0cfb4e0188a2
43dc3943-4caa-4352-aa40-190a2f97d48d
43084fcd-9642-4516-8718-29b844e226b1
f25ba358-7b9d-455b-b5e5-9a498ab204f7
...
As mentioned in the error message, Wait according to the back-off requirements described in the BigQuery Service Level Agreement, then try the operation again.
If the error continues to occur, if you have a support plan please create a new GCP support case. Otherwise, you can open a new issue on the issue tracker describing your issue. You can also try to reduce the frequency of this error by using Reservations.
For more information about the error messages you can refer to this document.

Azure data factory How to catch any error on any activity and log it into database?

I am currently working on error handling in ADF.
I know how to get error from a particular activity by using # activity('').error.message and then pass it to database.
Do you guys know a generic way that able us to catch any error on any activity on a pipeline?
Best regards!
You can leverage the flow path dependency aspect within Azure data factory to manage logging of error based on single activity rather than duplicating same activities :
The below blog : https://datasharkx.wordpress.com/2021/08/19/error-logging-and-the-art-of-avoiding-redundant-activities-in-azure-data-factory/ explains all the same in details
Below are the basic principles that need to be followed :
Multiple dependencies with the same source are OR’ed together.
Multiple dependencies with different sources are AND’ed together.
Together this looks like (Act_3fails OR Act_3Skipped) AND (Act_2Completes OR Act_2skipped)
When you have certain number of activities in pipeline1, and you want to capture any error message when one of the activities fail, then using execute pipeline activity is the correct approach.
When you use execute pipeline activity to trigger pipeline1, when any activity inside pipeline1 fails, it throws an error message which includes the name of the activity which failed.
Look at the following demonstration. I have a pipeline named p1 which has 3 activities get metadata, for each and Script.
I am triggering p1 pipeline in p2 pipeline using execute pipeline activity and storing the error message in a variable using set variable activity. For demonstration, I made sure each activity throws an error each time (3 times).
When the get metadata activity fails, the respective error message captured in p2 pipeline will be as follows:
Operation on target Get Metadata1 failed: The required Blob is missing. ContainerName: data2, path: data2/files/.
When only for each activity in p1 pipeline fails, the respective error message will be captured in p2 pipeline.
Operation on target ForEach1 failed: Activity failed because an inner activity failed
When only script activity in p1 pipeline fails, the respective error message will be captured in p2 pipeline.
Operation on target Script1 failed: Invalid object name 'mydemotable'
So, using Execute pipeline activity (to execute pipeline1) would help you to capture the error message from any different activities that fail inside the required pipeline.

Change Google Cloud Dataflow BigQuery Priority

I have a Beam job running on Google Cloud DataFlow that reads data from BigQuery. When I run the job it takes minutes for the job to start reading data from the (tiny) table. It turns out the dataflow job sends of a BigQuery job which runs in BATCH mode and not in INTERACTIVE mode. How can I switch this to run immediately in Apache Beam? I couldn't find a method in the API to change the priority.
Maybe a Googler will correct me, but no, you cannot change this from BATCH to INTERACTIVE because it's not exposed by Beam's API.
From org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO.java (here):
private void executeQuery(
String executingProject,
String jobId,
TableReference destinationTable,
JobService jobService) throws IOException, InterruptedException {
JobReference jobRef = new JobReference()
.setProjectId(executingProject)
.setJobId(jobId);
JobConfigurationQuery queryConfig = createBasicQueryConfig()
.setAllowLargeResults(true)
.setCreateDisposition("CREATE_IF_NEEDED")
.setDestinationTable(destinationTable)
.setPriority("BATCH") <-- NOT EXPOSED
.setWriteDisposition("WRITE_EMPTY");
jobService.startQueryJob(jobRef, queryConfig);
Job job = jobService.pollJob(jobRef, JOB_POLL_MAX_RETRIES);
if (parseStatus(job) != Status.SUCCEEDED) {
throw new IOException(String.format(
"Query job %s failed, status: %s.", jobId, statusToPrettyString(job.getStatus())));
}
}
If it's really a problem for you that the query is running in BATCH mode, then one workaround could be:
Using the BigQuery API directly, roll your own initial request, and set the priority to INTERACTIVE.
Write the results of step 1 to a temp table
In your Beam pipeline, read the temp table using BigQueryIO.Read.from()
You can configure to run the queries with "Interactive" priority by passing a priority parameter. Check this Github example for details.
Please note that you might be reaching one of the BigQuery limits and quotas as when you use batch, if you ever hit a rate limit, the query will be queued and retried later. As opposed to the interactive ones, when if these limits are hit, the query will fail immediately. This is because BigQuery assumes that an interactive query is something you need run immediately.

Camel sql component with cron

I am trying to set a scheduler in order to set a cron expression.
<camel:endpoint id="sqlEndpoint" uri="sql:${sqlQuery}?scheduler=spring&scheduler.cron=0+6+8+*+*&dataSourceRef=veloxityDS&useIterator=false"/>
But when I run this as a consumer, this exception occured:
org.apache.camel.FailedToCreateConsumerException: Failed to create
Consumer for endpoint: Endpoint[sql://$select * from
dual?dataSourceRef=veloxityDS&scheduler=spring&scheduler.cron=0+6+8++&useIterator=false].
Reason: There are 1 scheduler parameters that couldn't be set on the
endpoint. Check the uri if the parameters are spelt correctly and that
they are properties of the endpoint. Unknown parameters=[{cron=0 6 8 *
*}]
Any ideas?
The endpoint you are trying to create is using parameters that don't exist. There is a full list of parameters at: http://camel.apache.org/sql-component.html
If you want your SQL procedure to run on a time interval you can either use a quartz endpoint, a polling consumer, or a route scheduler depending on your needs.
http://camel.apache.org/polling-consumer.html
http://camel.apache.org/quartz2.html
http://camel.apache.org/cronscheduledroutepolicy.html
Current parameter issues on your endpoint:
scheduler - not a supported parameter
scheduler.cron - not a supported parameter
dataSourceRef - deprecated.
Your scheduling alternatives leveraging only the sql endpoint are:
consumer.delay
consumer.initialDelay
consumer.useFixedDelay
maxMessagesPerPoll

Pentaho Data Integration: Error Handling

I'm building out an ETL process with Pentaho Data Integration (CE) and I'm trying to operationalize my Transformations and Jobs so that they'll be able to be monitored. Specifically, I want to be able to catch any errors and then send them to an error reporting service like Honeybadger or New Relic. I understand how to do row-level error reporting but I don't see a way to do job or transaction failure reporting.
Here is an example job.
The down path is where the transformation succeeds but has row errors. There we can just filter the results and log them.
The path to the right is the case where the transformation fails all-together (e.g. DB credentials are wrong). This is where I'm having trouble: I can't figure out how to get the error info to be sent.
How do I capture transformation failures to be logged?
You can not capture job-level errors details inside the job itself.
However there are other options for monitoring.
First option is using database logging for transformations or jobs (see the "Log" tab in the job/trans parameters dialog) - this way you always have up-to-date information about the execution status so you can, say, write a job that periodically scans the logging database and sends error reports wherever you need.
Meanwhile this option seems to be something pretty heavy-weight for development and support and not too flexible for further modifications. So in our company we ended up with monitoring on a job-execution level - i.e. when you run a job with kitchen.bat and it fails by any reason you get an "error" status of execution of the kitchen, so you can easily examine it and perform necessary actions with whenever tools you'd like - .bat commands, PowerShell or (in our case) Jenkins CI.
You could use the writeToLog("e", "Message") function in the Modified Java Script step.
Documentation:
// Writes a string to the defined Kettle Log.
//
// Usage:
// writeToLog(var);
// 1: String - The Message which should be written to
// the Kettle Debug Log
//
// writeToLog(var,var);
// 1: String - The Type of the Log
// d - Debug
// l - Detailed
// e - Error
// m - Minimal
// r - RowLevel
//
// 2: String - The Message which should be written to
// the Kettle Log