Fail azure data factory pipeline if notebook execution skipped - azure-data-factory-2

I have build an ADF pipline and i am executing multiple databricks notebooks in the pipeline. When one of the notebook fails the remianing command in the notebook skipped but the pipeline did not fail. I want to make sure that the pipeline execution stops if the notebook fails and next command skipped
I have tried trigering an error so that the pipeline fails
NA
I want to make sure that the pipeline execution stops if the notebook fails and next command skipped

If you connect the Azure Databricks acitivities as follows:
Notebook2-Activity will only be executed if the first Notebook-Activity is successfull.
Here the first activity fails and the next is not executed (Monitoring view of Azure Data Factory):

Use assert; here a the Scala code snippet:
try {
... ... ...
}
catch {
case e: Exception => {
println("exception caught while ... ... : " + e)
assert(false)
}
}
This will fail your Notebook Activity in Azure Data Factory.

Related

Azure data factory How to catch any error on any activity and log it into database?

I am currently working on error handling in ADF.
I know how to get error from a particular activity by using # activity('').error.message and then pass it to database.
Do you guys know a generic way that able us to catch any error on any activity on a pipeline?
Best regards!
You can leverage the flow path dependency aspect within Azure data factory to manage logging of error based on single activity rather than duplicating same activities :
The below blog : https://datasharkx.wordpress.com/2021/08/19/error-logging-and-the-art-of-avoiding-redundant-activities-in-azure-data-factory/ explains all the same in details
Below are the basic principles that need to be followed :
Multiple dependencies with the same source are OR’ed together.
Multiple dependencies with different sources are AND’ed together.
Together this looks like (Act_3fails OR Act_3Skipped) AND (Act_2Completes OR Act_2skipped)
When you have certain number of activities in pipeline1, and you want to capture any error message when one of the activities fail, then using execute pipeline activity is the correct approach.
When you use execute pipeline activity to trigger pipeline1, when any activity inside pipeline1 fails, it throws an error message which includes the name of the activity which failed.
Look at the following demonstration. I have a pipeline named p1 which has 3 activities get metadata, for each and Script.
I am triggering p1 pipeline in p2 pipeline using execute pipeline activity and storing the error message in a variable using set variable activity. For demonstration, I made sure each activity throws an error each time (3 times).
When the get metadata activity fails, the respective error message captured in p2 pipeline will be as follows:
Operation on target Get Metadata1 failed: The required Blob is missing. ContainerName: data2, path: data2/files/.
When only for each activity in p1 pipeline fails, the respective error message will be captured in p2 pipeline.
Operation on target ForEach1 failed: Activity failed because an inner activity failed
When only script activity in p1 pipeline fails, the respective error message will be captured in p2 pipeline.
Operation on target Script1 failed: Invalid object name 'mydemotable'
So, using Execute pipeline activity (to execute pipeline1) would help you to capture the error message from any different activities that fail inside the required pipeline.

Azure ADF - how to proceed with pipeline activities only after new files arrival?

I have written generic datafiles arrival checking routine using databricks notebooks which accepts filenames and time which specifies acceptable freshness of files. many pipeline uses this notebook and passes filenames tuples and at end notebook returns True or False, to indicate if next workflow activity could start or not. so far so good.
now my question is how to use this in Azure ADF pipeline such that if it fails it should wait for 30 minutes or so and check again by running above notebook again?
this notebook shall run first so that if new files are already there then it should not wait
Since you are talking about the notebook activity , you can add a wait activity "on failue " and set the time for the wait . after wait add a executepipelien actvity . This execute pipeline should point to a pipeline with a execute pipeline ( again ) pointing to the main pipeline which has the notebook activity . Basically this is just a cycle , but will only execute when you have a failure .

LeaseAlreadyPresent Error in Azure Data Factory V2

I am getting the following error in a pipeline that has Copy activity with Rest API as source and Azure Data Lake Storage Gen 2 as Sink.
"message": "Failure happened on 'Sink' side. ErrorCode=AdlsGen2OperationFailed,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=ADLS Gen2 operation failed for: Operation returned an invalid status code 'Conflict'. Account: '{Storage Account Name}'. FileSystem: '{Container Name}'. Path: 'foodics_v2/Burgerizzr/transactional/_567a2g7a/2018-02-09/raw/inventory-transactions.json'. ErrorCode: 'LeaseAlreadyPresent'. Message: 'There is already a lease present.'. RequestId: 'd27f1a3d-d01f-0003-28fb-400303000000'..,Source=Microsoft.DataTransfer.ClientLibrary,''Type=Microsoft.Azure.Storage.Data.Models.ErrorSchemaException,Message=Operation returned an invalid status code 'Conflict',Source=Microsoft.DataTransfer.ClientLibrary,'",
The pipeline runs in a for loop with Batch size = 5. When I make it sequential, the error goes away, but I need to run it in parallel.
This is known issue with adf limitation variable thread parallel running.
You probably trying to rename filename using variable.
Your option is to run another child looping after each variable execution.
i.e. variable -> Execute Pipeline
enter image description here
or
remove those variable, hard coded those variable expression in azure activity.
enter image description here
Hope this helps

BigQuery loads manually but not through the Java SDK

I have a Dataflow pipeline, running locally. The objective is to read a JSON file using TEXTIO, make sessions and load it into BigQuery. Given the structure I have to create a temp directory in GCS and then load it into BigQuery using that. Previously I had a data schema error that prevented me to load the data, see here. That issue is resolved.
So now when I run the pipeline locally it ends with dumping a temporary JSON newline delimited file into GCS. The SDK then gives me the following:
Starting BigQuery load job beam_job_xxxx_00001-1: try 1/3
INFO [main] (BigQueryIO.java:2191) - BigQuery load job failed: beam_job_xxxx_00001-1
...
Exception in thread "main" com.google.cloud.dataflow.sdk.Pipeline$PipelineExecutionException: java.lang.RuntimeException: Failed to create the load job beam_job_xxxx_00001, reached max retries: 3
at com.google.cloud.dataflow.sdk.Pipeline.run(Pipeline.java:187)
at pedesys.Dataflow.main(Dataflow.java:148)
Caused by: java.lang.RuntimeException: Failed to create the load job beam_job_xxxx_00001, reached max retries: 3
at com.google.cloud.dataflow.sdk.io.BigQueryIO$Write$WriteTables.load(BigQueryIO.java:2198)
at com.google.cloud.dataflow.sdk.io.BigQueryIO$Write$WriteTables.processElement(BigQueryIO.java:2146)
The errors are not very descriptive and the data is still not loaded in BigQuery. What is puzzling is that if I go to the BigQuery UI and load the same temporary file from GCS that was dumped by the SDK's Dataflow pipeline manually, in the same table, it works beautifully.
The relevant code parts are as follows:
PipelineOptions options = PipelineOptionsFactory.create();
options.as(BigQueryOptions.class)
.setTempLocation("gs://test/temp");
Pipeline p = Pipeline.create(options)
...
...
session_windowed_items.apply(ParDo.of(new FormatAsTableRowFn()))
.apply(BigQueryIO.Write
.named("loadJob")
.to("myproject:db.table")
.withSchema(schema)
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND)
);
The SDK is swallowing the error/exception and not reporting it to users. It's most likely a schema problem. To get the actual error that is happening you need to fetch the job details by either:
CLI - bq show -j job beam_job_<xxxx>_00001-1
Browser/Web: use "try it" at the bottom of the page here.
#jkff has raised an issue here to improve the error reporting.

Pentaho Data Integration: Error Handling

I'm building out an ETL process with Pentaho Data Integration (CE) and I'm trying to operationalize my Transformations and Jobs so that they'll be able to be monitored. Specifically, I want to be able to catch any errors and then send them to an error reporting service like Honeybadger or New Relic. I understand how to do row-level error reporting but I don't see a way to do job or transaction failure reporting.
Here is an example job.
The down path is where the transformation succeeds but has row errors. There we can just filter the results and log them.
The path to the right is the case where the transformation fails all-together (e.g. DB credentials are wrong). This is where I'm having trouble: I can't figure out how to get the error info to be sent.
How do I capture transformation failures to be logged?
You can not capture job-level errors details inside the job itself.
However there are other options for monitoring.
First option is using database logging for transformations or jobs (see the "Log" tab in the job/trans parameters dialog) - this way you always have up-to-date information about the execution status so you can, say, write a job that periodically scans the logging database and sends error reports wherever you need.
Meanwhile this option seems to be something pretty heavy-weight for development and support and not too flexible for further modifications. So in our company we ended up with monitoring on a job-execution level - i.e. when you run a job with kitchen.bat and it fails by any reason you get an "error" status of execution of the kitchen, so you can easily examine it and perform necessary actions with whenever tools you'd like - .bat commands, PowerShell or (in our case) Jenkins CI.
You could use the writeToLog("e", "Message") function in the Modified Java Script step.
Documentation:
// Writes a string to the defined Kettle Log.
//
// Usage:
// writeToLog(var);
// 1: String - The Message which should be written to
// the Kettle Debug Log
//
// writeToLog(var,var);
// 1: String - The Type of the Log
// d - Debug
// l - Detailed
// e - Error
// m - Minimal
// r - RowLevel
//
// 2: String - The Message which should be written to
// the Kettle Log