ADLA/U-SQL Error: Vertex user code error - azure-data-lake

I just have a simple U-SQL that extracts a csv using Extractors.Csv(encoding:Encoding.[Unicode]); and outputs into a lake store table. The file size is small around 600MB and is unicode type. The number of rows is 700K+
These are the columns:
UserId int,
Email string,
AltEmail string,
CreatedOn DateTime,
IsDeleted bool,
UserGuid Guid,
IFulfillmentContact bool,
IsBillingContact bool,
LastUpdateDate DateTime,
IsTermsOfUse string,
UserTypeId string
When I submit this job to my local, it works great without any issues. Once I submit it to ADLA, I get the following error:
Vertex failure triggered quick job abort. Vertex failed: SV1_Extract_Partition[0][0] with error: Vertex user code error.
Vertex failed with a fail-fast error
Vertex SV1_Extract_Partition[0][0].v1 {BA7B2378-597C-4679-AD69-07413A143E47} failed
Error:
Vertex user code error
exitcode=CsExitCode_StillActive Errorsnippet=An error occurred while processing adl://lakestore.azuredatalakestore.net/Data/User.csv
Any help is appreciated!

Since the file is larger than 250MB, you need to make sure that you upload it as a row-oriented file and not a binary file.
Also, please check the reply for the following question to see how you currently can find more details on the error: Debugging u-sql Jobs

Related

LeaseAlreadyPresent Error in Azure Data Factory V2

I am getting the following error in a pipeline that has Copy activity with Rest API as source and Azure Data Lake Storage Gen 2 as Sink.
"message": "Failure happened on 'Sink' side. ErrorCode=AdlsGen2OperationFailed,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=ADLS Gen2 operation failed for: Operation returned an invalid status code 'Conflict'. Account: '{Storage Account Name}'. FileSystem: '{Container Name}'. Path: 'foodics_v2/Burgerizzr/transactional/_567a2g7a/2018-02-09/raw/inventory-transactions.json'. ErrorCode: 'LeaseAlreadyPresent'. Message: 'There is already a lease present.'. RequestId: 'd27f1a3d-d01f-0003-28fb-400303000000'..,Source=Microsoft.DataTransfer.ClientLibrary,''Type=Microsoft.Azure.Storage.Data.Models.ErrorSchemaException,Message=Operation returned an invalid status code 'Conflict',Source=Microsoft.DataTransfer.ClientLibrary,'",
The pipeline runs in a for loop with Batch size = 5. When I make it sequential, the error goes away, but I need to run it in parallel.
This is known issue with adf limitation variable thread parallel running.
You probably trying to rename filename using variable.
Your option is to run another child looping after each variable execution.
i.e. variable -> Execute Pipeline
enter image description here
or
remove those variable, hard coded those variable expression in azure activity.
enter image description here
Hope this helps

XGBoost Model in AWS-Sagemaker Fails with no error message

I'm trying to get a model using the XGBoost classifier in AWS-Sagemaker. I'm following the abalone example, but when I run it to build the training job it states InProgress 3 times and then just states Failed. Where do I go to find why it failed?
I've double checked the parameters and made sure the input and output files and directories in S3 were correct. I know there is permission to read and write because when setting up the data for train/validate/test I read and write to S3 with no problems.
print(status)
while status !='Completed' and status!='Failed':
time.sleep(60)
status = client.describe_training_job(TrainingJobName=job_name)['TrainingJobStatus']
print(status)
That is the code where the print statements come from. Is there something I can add to receive a better error message?
The problem occurred was that the file sent for predictions was csv but the XGBoost settings were set to receive libsvm.

SSIS CSV Import Error 0xC0202092 DTS_E_PRIMEOUTPUTFAILED

All of the sudden, a CSV file that is imported into a db/table every morning is failing every time within the last few weeks. I do not support this process directly, so I don't know much about SSIS, but would greatly appreciate some help as I need this working and whoever supports this process has no idea what the issue is. I'm not sure if that error regarding the row has anything to do with the data in the row because it looks fine to me. The CSV includes Active Directory information for every computer in AD and is exported from PowerShell to a server where the CSV is imported into a table via SSIS. The process is entirely automated and nothing has changed.
[Source - Clean_Gold CSV [1]] Error: The column delimiter for column "LastLogontimestamp" was not found.
[Source - Clean_Gold CSV [1]] Error: An error occurred while processing file "H:\Computers\clean_gold.csv" on data row 40377.
[SSIS.Pipeline] Error: SSIS Error Code DTS_E_PRIMEOUTPUTFAILED. The PrimeOutput method on component "Source - Clean_Gold CSV" (1) returned error code 0xC0202092. The component returned a failure code when the pipeline engine called PrimeOutput(). The meaning of the failure code is defined by the component, but the error is fatal and the pipeline stopped executing. There may be error messages posted before this with more information about the failure.
I looked at 40,378 and saw someone put a ", in the description of the computer object in Active Directory. That caused an issue with the delimiting.

Getting exception while reading data from blob in azure

While I am trying to read the list of blob data on azure, I am getting the following error:
Function evaluation disabled because a previous function evaluation timed out. You must continue execution to reenable function evaluation.
How to resolve this?
Please see the following link. Your code likely has a endless loop. https://msdn.microsoft.com/en-us/library/ms234762.aspx

File: 0: Unexpected from Google BigQuery load job

I've a compressed json file (900MB, newline delimited) and load into a new table via bq command and get the load failure:
e.g.
bq load --project_id=XXX --source_format=NEWLINE_DELIMITED_JSON --ignore_unknown_values mtdataset.mytable gs://xxx/data.gz schema.json
Waiting on bqjob_r3ec270ec14181ca7_000001461d860737_1 ... (1049s) Current status: DONE
BigQuery error in load operation: Error processing job 'XXX:bqjob_r3ec270ec14181ca7_000001461d860737_1': Too many errors encountered. Limit is: 0.
Failure details:
- File: 0: Unexpected. Please try again.
Why the error?
I tried again with the --max_bad_records, still not useful error message
bq load --project_id=XXX --source_format=NEWLINE_DELIMITED_JSON --ignore_unknown_values --max_bad_records 2 XXX.test23 gs://XXX/20140521/file1.gz schema.json
Waiting on bqjob_r518616022f1db99d_000001461f023f58_1 ... (319s) Current status: DONE
BigQuery error in load operation: Error processing job 'XXX:bqjob_r518616022f1db99d_000001461f023f58_1': Unexpected. Please try again.
And also cannot find any useful message in the console.
To BigQuery team, can you have a look using the job ID?
As far I know there are two error sections on a job. There is one error result, and that's what you see now. And there is a second, which should be a stream of errors. This second is important as you could have errors in it, but the actual job might succeed.
Also you can set the --max_bad_records=3 on the BQ tool. Check here for more params https://developers.google.com/bigquery/bq-command-line-tool
You probably have an error that is for each line, so you should try a sample set from this big file first.
Also there is an open feature request to improve the error message, you can star (vote) this ticket https://code.google.com/p/google-bigquery-tools/issues/detail?id=13
This answer will be picked up by the BQ team, so for them I am sharing that: We need an endpoint where we can query based on a jobid, the state, or the stream of errors. It would help a lot to get a full list of errors, it would help debugging the BQ jobs. This could be easy to implement.
I looked up this job in the BigQuery logs, and unfortunately, there isn't any more information than "failed to read" somewhere after about 930 MB have been read.
I've filed a bug that we're dropping important error information in one code path and submitted a fix. However, this fix won't be live until next week, and all that will do is give us more diagnostic information.
Since this is repeatable, it isn't likely a transient error reading from GCS. That means one of two problems: we have trouble decoding the .gz file, or there is something wrong with that particular GCS object.
For the first issue, you could try decompressing the file and re-uploading it as uncompressed. While it may sound like a pain to send gigabytes of data over the network, the good news is that the import will be faster since it can be done in parallel (we can't import a compressed file in parallel since it can only be read sequentially).
For the second issue (which is somewhat less likely) you could try downloading the file yourself to make sure you don't get errors, or try re-uploading the same file and seeing if that works.