LeaseAlreadyPresent Error in Azure Data Factory V2 - azure-data-lake

I am getting the following error in a pipeline that has Copy activity with Rest API as source and Azure Data Lake Storage Gen 2 as Sink.
"message": "Failure happened on 'Sink' side. ErrorCode=AdlsGen2OperationFailed,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=ADLS Gen2 operation failed for: Operation returned an invalid status code 'Conflict'. Account: '{Storage Account Name}'. FileSystem: '{Container Name}'. Path: 'foodics_v2/Burgerizzr/transactional/_567a2g7a/2018-02-09/raw/inventory-transactions.json'. ErrorCode: 'LeaseAlreadyPresent'. Message: 'There is already a lease present.'. RequestId: 'd27f1a3d-d01f-0003-28fb-400303000000'..,Source=Microsoft.DataTransfer.ClientLibrary,''Type=Microsoft.Azure.Storage.Data.Models.ErrorSchemaException,Message=Operation returned an invalid status code 'Conflict',Source=Microsoft.DataTransfer.ClientLibrary,'",
The pipeline runs in a for loop with Batch size = 5. When I make it sequential, the error goes away, but I need to run it in parallel.

This is known issue with adf limitation variable thread parallel running.
You probably trying to rename filename using variable.
Your option is to run another child looping after each variable execution.
i.e. variable -> Execute Pipeline
enter image description here
or
remove those variable, hard coded those variable expression in azure activity.
enter image description here
Hope this helps

Related

Load from GCS to GBQ causes an internal BigQuery error

My application creates thousands of "load jobs" daily to load data from Google Cloud Storage URIs to BigQuery and only a few cases causing the error:
"Finished with errors. Detail: An internal error occurred and the request could not be completed. This is usually caused by a transient issue. Retrying the job with back-off as described in the BigQuery SLA should solve the problem: https://cloud.google.com/bigquery/sla. If the error continues to occur please contact support at https://cloud.google.com/support. Error: 7916072"
The application is written on Python and uses libraries:
google-cloud-storage==1.42.0
google-cloud-bigquery==2.24.1
google-api-python-client==2.37.0
Load job is done by calling
load_job = self._client.load_table_from_uri(
source_uris=source_uri,
destination=destination,
job_config=job_config,
)
this method has a default param:
retry: retries.Retry = DEFAULT_RETRY,
so the job should automatically retry on such errors.
Id of specific job that finished with error:
"load_job_id": "6005ab89-9edf-4767-aaf1-6383af5e04b6"
"load_job_location": "US"
after getting the error the application recreates the job, but it doesn't help.
Subsequent failed job ids:
5f43a466-14aa-48cc-a103-0cfb4e0188a2
43dc3943-4caa-4352-aa40-190a2f97d48d
43084fcd-9642-4516-8718-29b844e226b1
f25ba358-7b9d-455b-b5e5-9a498ab204f7
...
As mentioned in the error message, Wait according to the back-off requirements described in the BigQuery Service Level Agreement, then try the operation again.
If the error continues to occur, if you have a support plan please create a new GCP support case. Otherwise, you can open a new issue on the issue tracker describing your issue. You can also try to reduce the frequency of this error by using Reservations.
For more information about the error messages you can refer to this document.

pubsub message ingestion into Bigquery using data fusion

I've built a simple realtime pipeline to receive messages and attributes from pubsub subscription and wrangle them to keep only a few fields and load it to a BigQuery table. when deployed and run, the pipeline log says Importing into table '<tablename>' from 0 paths; path[0] is '(empty)'; awaitCompletion: true
I'm unable to understand why 0 paths and why all the records are going to errors when an error collector was setup. Is there a way to debug the wrangler stage better?
sample wrangler directives as below:
keep message,attributes
set-charset :message 'utf-8'
set-type :attributes string
parse-as-json :attributes 1
parse-as-json :message 5
keep attributes_page_url,attributes_cart_remove,attributes_page_title,attributes_transaction_complete,message_event_id,message_data_dom_domain,message_data_dom_title,message_data_dom_pathname,message_data_udo_ut_visitor_id
columns-replace s/^attributes_//g
columns-replace s/^message_//g
Any help is appreciated.Thanks
The reason you see that there are 0 paths to load is that all records are causing errors during wrangling.
There are 2 ways in which you can capture these errors:
Configure the Wrangler stage to Fail Pipeline on error. This will show the exception/error in the logs.
Attach the Error output from the Wrangler stage to an Error Collector, and store the output in a File or GCS Sink. This allows you to capture the error message for each row. Configure the Error Collector as follows:
Error Message Column Name = errorMsg
Error Code Column Name = errorCode
Error Emitter Node Name = invalidRecord

Azure data factory How to catch any error on any activity and log it into database?

I am currently working on error handling in ADF.
I know how to get error from a particular activity by using # activity('').error.message and then pass it to database.
Do you guys know a generic way that able us to catch any error on any activity on a pipeline?
Best regards!
You can leverage the flow path dependency aspect within Azure data factory to manage logging of error based on single activity rather than duplicating same activities :
The below blog : https://datasharkx.wordpress.com/2021/08/19/error-logging-and-the-art-of-avoiding-redundant-activities-in-azure-data-factory/ explains all the same in details
Below are the basic principles that need to be followed :
Multiple dependencies with the same source are OR’ed together.
Multiple dependencies with different sources are AND’ed together.
Together this looks like (Act_3fails OR Act_3Skipped) AND (Act_2Completes OR Act_2skipped)
When you have certain number of activities in pipeline1, and you want to capture any error message when one of the activities fail, then using execute pipeline activity is the correct approach.
When you use execute pipeline activity to trigger pipeline1, when any activity inside pipeline1 fails, it throws an error message which includes the name of the activity which failed.
Look at the following demonstration. I have a pipeline named p1 which has 3 activities get metadata, for each and Script.
I am triggering p1 pipeline in p2 pipeline using execute pipeline activity and storing the error message in a variable using set variable activity. For demonstration, I made sure each activity throws an error each time (3 times).
When the get metadata activity fails, the respective error message captured in p2 pipeline will be as follows:
Operation on target Get Metadata1 failed: The required Blob is missing. ContainerName: data2, path: data2/files/.
When only for each activity in p1 pipeline fails, the respective error message will be captured in p2 pipeline.
Operation on target ForEach1 failed: Activity failed because an inner activity failed
When only script activity in p1 pipeline fails, the respective error message will be captured in p2 pipeline.
Operation on target Script1 failed: Invalid object name 'mydemotable'
So, using Execute pipeline activity (to execute pipeline1) would help you to capture the error message from any different activities that fail inside the required pipeline.

Copy data activity fails when sink is immutable

In my Azure data factory pipeline, I'm using a Copy data activity inside a ForEach activity to copy files from an input container to an archive container before processing the files in the input container. This normally works, but today I made the archive container immutable by adding a legal hold policy to it, and the next time the copy data activity ran, it failed with an error (see below). Is there any way around this, since you should be able to add new files to an immutable container?
Error code: 2200
Failure type: User configuration issue
Details:
Failure happened on 'Sink' side. ErrorCode=AdlsGen2OperationFailed,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=ADLS Gen2 operation failed for: Operation returned an invalid status code 'Conflict'. Account: 'mydatalake'. FileSystem: 'raw'. Path: 'Source/ABC/File_2021_03_24.csv'. ErrorCode: 'PathImmutableDueToLegalHold'. Message: 'This operation is not permitted as the path is immutable due to one or more legal holds.'. RequestId: '37f75e88-501a-0026-2fa1-20d52e000000'. TimeStamp: 'Wed, 24 Mar 2021 11:30:54 GMT'..,Source=Microsoft.DataTransfer.ClientLibrary,''Type=Microsoft.Azure.Storage.Data.Models.ErrorSchemaException,Message=Operation returned an invalid status code 'Conflict',Source=Microsoft.DataTransfer.ClientLibrary,'
Source: Pipeline LoadMyData

ADLA/U-SQL Error: Vertex user code error

I just have a simple U-SQL that extracts a csv using Extractors.Csv(encoding:Encoding.[Unicode]); and outputs into a lake store table. The file size is small around 600MB and is unicode type. The number of rows is 700K+
These are the columns:
UserId int,
Email string,
AltEmail string,
CreatedOn DateTime,
IsDeleted bool,
UserGuid Guid,
IFulfillmentContact bool,
IsBillingContact bool,
LastUpdateDate DateTime,
IsTermsOfUse string,
UserTypeId string
When I submit this job to my local, it works great without any issues. Once I submit it to ADLA, I get the following error:
Vertex failure triggered quick job abort. Vertex failed: SV1_Extract_Partition[0][0] with error: Vertex user code error.
Vertex failed with a fail-fast error
Vertex SV1_Extract_Partition[0][0].v1 {BA7B2378-597C-4679-AD69-07413A143E47} failed
Error:
Vertex user code error
exitcode=CsExitCode_StillActive Errorsnippet=An error occurred while processing adl://lakestore.azuredatalakestore.net/Data/User.csv
Any help is appreciated!
Since the file is larger than 250MB, you need to make sure that you upload it as a row-oriented file and not a binary file.
Also, please check the reply for the following question to see how you currently can find more details on the error: Debugging u-sql Jobs