Pentaho Data Integration: Error Handling - pentaho

I'm building out an ETL process with Pentaho Data Integration (CE) and I'm trying to operationalize my Transformations and Jobs so that they'll be able to be monitored. Specifically, I want to be able to catch any errors and then send them to an error reporting service like Honeybadger or New Relic. I understand how to do row-level error reporting but I don't see a way to do job or transaction failure reporting.
Here is an example job.
The down path is where the transformation succeeds but has row errors. There we can just filter the results and log them.
The path to the right is the case where the transformation fails all-together (e.g. DB credentials are wrong). This is where I'm having trouble: I can't figure out how to get the error info to be sent.
How do I capture transformation failures to be logged?

You can not capture job-level errors details inside the job itself.
However there are other options for monitoring.
First option is using database logging for transformations or jobs (see the "Log" tab in the job/trans parameters dialog) - this way you always have up-to-date information about the execution status so you can, say, write a job that periodically scans the logging database and sends error reports wherever you need.
Meanwhile this option seems to be something pretty heavy-weight for development and support and not too flexible for further modifications. So in our company we ended up with monitoring on a job-execution level - i.e. when you run a job with kitchen.bat and it fails by any reason you get an "error" status of execution of the kitchen, so you can easily examine it and perform necessary actions with whenever tools you'd like - .bat commands, PowerShell or (in our case) Jenkins CI.

You could use the writeToLog("e", "Message") function in the Modified Java Script step.
Documentation:
// Writes a string to the defined Kettle Log.
//
// Usage:
// writeToLog(var);
// 1: String - The Message which should be written to
// the Kettle Debug Log
//
// writeToLog(var,var);
// 1: String - The Type of the Log
// d - Debug
// l - Detailed
// e - Error
// m - Minimal
// r - RowLevel
//
// 2: String - The Message which should be written to
// the Kettle Log

Related

How to make the SSIS package status to failure when propagate was set to false for a Sequence container

I have an SSIS package with for each loop > sequence container. The sequence container is trying to read file from For each loop and process its data. The requirement was to not fail the entire package when any exception happened in processing a file but to continue processing the next file until all the files were processed from the for each loop. For this, I have set the Propagate variable for the sequence container to False. I have also added email step on On Error event of Sequence container. The package is running as expected and able to process all files even when any exception happened with any file. But I would like the status of my SSIS package to be failed finally since one of the files got failed. How can I achieve that ?
Did you try this options?
(SSIS version in russian on the left side but it's sequence container)
View -> Properties window -> Then click on your sequence container and it will show you ther properties of sequence container.
If i were you first of all i would try property "FailPackageOnFailture" - it should cover your question if i get it right.
P.S. Also you can see the whole properties of your project when you click on a free place in your project
UPDATED (after comments and more clear understanding task):
The idea is - set this param Maximum ErrorCount for SQ as max as you want - in this case it wont stop the package because 1 of the files was failed in SQ and next file will process, but it should stop package after SQ will finish his work because you don't change MaximumErrorCount for package.
Important - a value of zero sets the error count threshold to infinity and package or task never get's Failure

Persist detailed information about failed Item processing

I´ve got a Job that runs a TaskletStep, then a chunk-based step and then another TaskletStep.
In each of these steps, errors (in the form of Exceptions) can occur.
The chunk-based step looks like this:
stepBuilderFactory
.get("step2")
.chunk<SomeItem, SomeItem>(1)
.reader(flatFileItemReader)
.processor(itemProcessor)
.writer {}
.faultTolerant()
.skipPolicy { _ , _ -> true } // skip all Exceptions and continue
.taskExecutor(taskExecutor)
.throttleLimit(taskExecutor.corePoolSize)
.build()
The whole job definition:
jobBuilderFactory.get("job1")
.validator(validator())
.preventRestart()
.start(taskletStep1)
.next(step2)
.next(taskletStep2)
.build()
I expected that Spring Batch somehow picks up the Exceptions that occur along the way, so I can then create a Report including them after the Job has finished processing. Looking at the different contexts, there´s also fields that should contain failureExceptions. However, it seems there´s no such information (especially for the chunked step).
What would be a good approach if I need information about:
what Exceptions did occur in which Job execution
which Item was the one that triggered it
The JobExecution provides a method to get all failure exceptions that happened during the job. You can use that in a JobExecutionListener#afterJob(JobExecution jobExecution) to generate your report.
In regards to which items caused the issue, this will depend on where the exception happens (during the read, process or write operation). For this requirement, you can use one of the ItemReadListener, ItemProcessListener or ItemWriteListener to keep record of the those items (For example, by adding them to the job execution context to be able to get access to them in the JobExecutionListener#afterJob method for your report).

How to receive root cause for Pipeline Dataflow job failure

I am running my pipeline in Dataflow. I want to collect all error messages from Dataflow job using its id. I am using Apache-beam 2.3.0 and Java 8.
DataflowPipelineJob dataflowPipelineJob = ((DataflowPipelineJob) entry.getValue());
String jobId = dataflowPipelineJob.getJobId();
DataflowClient client = DataflowClient.create(options);
Job job = client.getJob(jobId);
Is there any way to receive only error message from pipeline?
Programmatic support for reading Dataflow log messages is not very mature, but there are a couple options:
Since you already have the DataflowPipelineJob instance, you could use the waitUntilFinish() overload which accepts a JobMessagesHandler parameter to filter and capture error messages. You can see how DataflowPipelineJob uses this in its own waitUntilFinish() implementation.
Alternatively, you can query job logs using the Dataflow REST API: projects.jobs.messages/list. The API takes in a minimumImportance parameter which would allow you to query just for errors.
Note that in both cases, there may be error messages which are not fatal and don't directly cause job failure.

Change Google Cloud Dataflow BigQuery Priority

I have a Beam job running on Google Cloud DataFlow that reads data from BigQuery. When I run the job it takes minutes for the job to start reading data from the (tiny) table. It turns out the dataflow job sends of a BigQuery job which runs in BATCH mode and not in INTERACTIVE mode. How can I switch this to run immediately in Apache Beam? I couldn't find a method in the API to change the priority.
Maybe a Googler will correct me, but no, you cannot change this from BATCH to INTERACTIVE because it's not exposed by Beam's API.
From org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO.java (here):
private void executeQuery(
String executingProject,
String jobId,
TableReference destinationTable,
JobService jobService) throws IOException, InterruptedException {
JobReference jobRef = new JobReference()
.setProjectId(executingProject)
.setJobId(jobId);
JobConfigurationQuery queryConfig = createBasicQueryConfig()
.setAllowLargeResults(true)
.setCreateDisposition("CREATE_IF_NEEDED")
.setDestinationTable(destinationTable)
.setPriority("BATCH") <-- NOT EXPOSED
.setWriteDisposition("WRITE_EMPTY");
jobService.startQueryJob(jobRef, queryConfig);
Job job = jobService.pollJob(jobRef, JOB_POLL_MAX_RETRIES);
if (parseStatus(job) != Status.SUCCEEDED) {
throw new IOException(String.format(
"Query job %s failed, status: %s.", jobId, statusToPrettyString(job.getStatus())));
}
}
If it's really a problem for you that the query is running in BATCH mode, then one workaround could be:
Using the BigQuery API directly, roll your own initial request, and set the priority to INTERACTIVE.
Write the results of step 1 to a temp table
In your Beam pipeline, read the temp table using BigQueryIO.Read.from()
You can configure to run the queries with "Interactive" priority by passing a priority parameter. Check this Github example for details.
Please note that you might be reaching one of the BigQuery limits and quotas as when you use batch, if you ever hit a rate limit, the query will be queued and retried later. As opposed to the interactive ones, when if these limits are hit, the query will fail immediately. This is because BigQuery assumes that an interactive query is something you need run immediately.

Determine actual errors from a load job

Using the Java SDK I am creating a load job for just a single record with a fairly complicated schema. When monitoring the status of the load job, it takes a surprisingly long time (but perhaps this is due to working out the schema), but then says:
11:21:06.975 [main] INFO xxx.GoogleBigQuery - Job status (21694ms) create_scans_1384744805079_172221126: DONE
11:24:50.618 [main] ERROR xxx.GoogleBigQuery - Job create_scans_1384744805079_172221126 caused error (invalid) with message
Too many errors encountered. Limit is: 0.
11:24:50.810 [main] ERROR xxx.GoogleBigQuery - {
"message" : "Too many errors encountered. Limit is: 0.",
"reason" : "invalid"
?}
BTW - how do I tell the job that it can have more than zero errors using Java?
This load job does not appear in the list of recent jobs in the console, and as far as I can see, none of the Java objects contains any more details about the actual errors encountered. So how can I pro-grammatically find out what is going wrong? All I can find is:
if (err != null) {
log.error("Job {} caused error ({}) with message\n{}", jobID, err.getReason(), err.getMessage());
try {
log.error(err.toPrettyString());
}
...
In general I am having a difficult time finding good documentation for some of these things and am working it out by trial and error and short snippets of code found on here and older groups. If there is a better source of information than the getting started guides, then I would appreciate any pointers to that information. The Javadoc does not really help and I cannot find any complete examples of loading, querying, testing for errors, cataloging errors and so on.
This job is submitted via a NEWLINE_DELIMITIED_JSON record, supplied to the job via:
InputStream dummy = getClass().getResourceAsStream("/googlebigquery/xxx.record");
final InputStreamContent jsonIn = new InputStreamContent("application/octet-stream", dummy);
createTableJob = bigQuery.jobs().insert(projectId, loadJob, jsonIn).execute();
My authentication and so on seems to work correctly as separate Java code to list the projects, and the datasets in the project all works correctly. So I just need help in working what the actual error is - does it not like the schema (I have records nested within records for instance), or does it think that there is an error in the data I am submitting.
Thanks in advance for any help. The job number cited above is an actual failed load job if that helps any Google staffers who might read this.
It sounds like you have a couple of questions, so I'll try to address them all.
First, the way to get the status of the job that failed is to call jobs().get(jobId), which returns a job object that has an errorResult object that has the error that caused the job to fail (e.g. "too many errors"). The errorStream list is a lost of all of the errors on the job, which should tell you which lines hit errors.
Note if you have the job id, it may be easier to use bq to lookup the job -- you can run bq show <job_id> to get the job error information. If you add the --format=prettyjson it will print out all of the information in the job.
A hint you also might want to consider is to supply your own job id when you create the job -- then even if there is an error starting the job (i.e. the insert() call fails, perhaps due to a network error) you can look up the job to see what actually happened.
To tell BigQuery that some errors are allowed during import, you can use the maxBadResults setting in the load job. See https://developers.google.com/resources/api-libraries/documentation/bigquery/v2/java/latest/com/google/api/services/bigquery/model/JobConfigurationLoad.html#getMaxBadRecords().