XGBoost Model in AWS-Sagemaker Fails with no error message - xgboost

I'm trying to get a model using the XGBoost classifier in AWS-Sagemaker. I'm following the abalone example, but when I run it to build the training job it states InProgress 3 times and then just states Failed. Where do I go to find why it failed?
I've double checked the parameters and made sure the input and output files and directories in S3 were correct. I know there is permission to read and write because when setting up the data for train/validate/test I read and write to S3 with no problems.
print(status)
while status !='Completed' and status!='Failed':
time.sleep(60)
status = client.describe_training_job(TrainingJobName=job_name)['TrainingJobStatus']
print(status)
That is the code where the print statements come from. Is there something I can add to receive a better error message?

The problem occurred was that the file sent for predictions was csv but the XGBoost settings were set to receive libsvm.

Related

LeaseAlreadyPresent Error in Azure Data Factory V2

I am getting the following error in a pipeline that has Copy activity with Rest API as source and Azure Data Lake Storage Gen 2 as Sink.
"message": "Failure happened on 'Sink' side. ErrorCode=AdlsGen2OperationFailed,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=ADLS Gen2 operation failed for: Operation returned an invalid status code 'Conflict'. Account: '{Storage Account Name}'. FileSystem: '{Container Name}'. Path: 'foodics_v2/Burgerizzr/transactional/_567a2g7a/2018-02-09/raw/inventory-transactions.json'. ErrorCode: 'LeaseAlreadyPresent'. Message: 'There is already a lease present.'. RequestId: 'd27f1a3d-d01f-0003-28fb-400303000000'..,Source=Microsoft.DataTransfer.ClientLibrary,''Type=Microsoft.Azure.Storage.Data.Models.ErrorSchemaException,Message=Operation returned an invalid status code 'Conflict',Source=Microsoft.DataTransfer.ClientLibrary,'",
The pipeline runs in a for loop with Batch size = 5. When I make it sequential, the error goes away, but I need to run it in parallel.
This is known issue with adf limitation variable thread parallel running.
You probably trying to rename filename using variable.
Your option is to run another child looping after each variable execution.
i.e. variable -> Execute Pipeline
enter image description here
or
remove those variable, hard coded those variable expression in azure activity.
enter image description here
Hope this helps

ADLA/U-SQL Error: Vertex user code error

I just have a simple U-SQL that extracts a csv using Extractors.Csv(encoding:Encoding.[Unicode]); and outputs into a lake store table. The file size is small around 600MB and is unicode type. The number of rows is 700K+
These are the columns:
UserId int,
Email string,
AltEmail string,
CreatedOn DateTime,
IsDeleted bool,
UserGuid Guid,
IFulfillmentContact bool,
IsBillingContact bool,
LastUpdateDate DateTime,
IsTermsOfUse string,
UserTypeId string
When I submit this job to my local, it works great without any issues. Once I submit it to ADLA, I get the following error:
Vertex failure triggered quick job abort. Vertex failed: SV1_Extract_Partition[0][0] with error: Vertex user code error.
Vertex failed with a fail-fast error
Vertex SV1_Extract_Partition[0][0].v1 {BA7B2378-597C-4679-AD69-07413A143E47} failed
Error:
Vertex user code error
exitcode=CsExitCode_StillActive Errorsnippet=An error occurred while processing adl://lakestore.azuredatalakestore.net/Data/User.csv
Any help is appreciated!
Since the file is larger than 250MB, you need to make sure that you upload it as a row-oriented file and not a binary file.
Also, please check the reply for the following question to see how you currently can find more details on the error: Debugging u-sql Jobs

NuPIC OPF Runtime error getOutputData unknown output categoriesOut

I'm trying to run TemporalClassification model using OPF to recognize patterns from stream. I've adjusted model params so it has two Sensor inputs: ScalarEncoder and SDRCategoryEncoder. The latter marked as classifierOnly. And also it's set as predictedField in inferences.
When trying to feed model with input data I get
RuntimeError: getOutputData unknown output 'categoriesOut' on region Classifier.
NontemporalClassification (only inferenceType changed) model runs without such error.
I've found 6 occurances of categoriesOut in nupic code: https://github.com/numenta/nupic/search?utf8=%E2%9C%93&q=categoriesOut
And error arises in nupic/frameworks/opf/clamodel.py at line 558
classificationDist = classifier.getOutputData('categoriesOut')
Seems that ClassifierRegion in the network is not prepared properly to output data.
Can anyone explain why the classfier region doesn't have 'categoriesOut'? I guess there's misconfiguration in my model params, but there were no errors or warnings during initialization of model. Is there any mandatory parameters and assignments (except noticed in NUPIC documentation) necessary for TemporalClassification model to run?
There are several types of ClassifierRegions in NuPIC. You can find them in nupic/regions folder. I've checked sources and found that 'categoriesOut' is in the outputs dict of the KNNClassifierRegion
https://github.com/numenta/nupic/blob/469f6372082e95dd5d2a96181b745ba36d2e7a8a/nupic/regions/KNNClassifierRegion.py
outputs=dict(
categoriesOut=dict(
description='A vector representing, for each category '
'index, the likelihood that the input to the node belongs '
'to that category based on the number of neighbors of '
'that category that are among the nearest K.',
dataType='Real32',
count=0,
regionLevel=True,
isDefaultOutput=True),
Ensure you use KNNClassifierRegion when configuring your TemporalClassification model. Samples for NontemporalClassification use CLAClassifier, but CLAClassifierRegion has no categoriesOut in its outputs and error described in your question will arise if you keep
'regionName' : 'CLAClassifierRegion'
for TemporalClassification model.

File: 0: Unexpected from Google BigQuery load job

I've a compressed json file (900MB, newline delimited) and load into a new table via bq command and get the load failure:
e.g.
bq load --project_id=XXX --source_format=NEWLINE_DELIMITED_JSON --ignore_unknown_values mtdataset.mytable gs://xxx/data.gz schema.json
Waiting on bqjob_r3ec270ec14181ca7_000001461d860737_1 ... (1049s) Current status: DONE
BigQuery error in load operation: Error processing job 'XXX:bqjob_r3ec270ec14181ca7_000001461d860737_1': Too many errors encountered. Limit is: 0.
Failure details:
- File: 0: Unexpected. Please try again.
Why the error?
I tried again with the --max_bad_records, still not useful error message
bq load --project_id=XXX --source_format=NEWLINE_DELIMITED_JSON --ignore_unknown_values --max_bad_records 2 XXX.test23 gs://XXX/20140521/file1.gz schema.json
Waiting on bqjob_r518616022f1db99d_000001461f023f58_1 ... (319s) Current status: DONE
BigQuery error in load operation: Error processing job 'XXX:bqjob_r518616022f1db99d_000001461f023f58_1': Unexpected. Please try again.
And also cannot find any useful message in the console.
To BigQuery team, can you have a look using the job ID?
As far I know there are two error sections on a job. There is one error result, and that's what you see now. And there is a second, which should be a stream of errors. This second is important as you could have errors in it, but the actual job might succeed.
Also you can set the --max_bad_records=3 on the BQ tool. Check here for more params https://developers.google.com/bigquery/bq-command-line-tool
You probably have an error that is for each line, so you should try a sample set from this big file first.
Also there is an open feature request to improve the error message, you can star (vote) this ticket https://code.google.com/p/google-bigquery-tools/issues/detail?id=13
This answer will be picked up by the BQ team, so for them I am sharing that: We need an endpoint where we can query based on a jobid, the state, or the stream of errors. It would help a lot to get a full list of errors, it would help debugging the BQ jobs. This could be easy to implement.
I looked up this job in the BigQuery logs, and unfortunately, there isn't any more information than "failed to read" somewhere after about 930 MB have been read.
I've filed a bug that we're dropping important error information in one code path and submitted a fix. However, this fix won't be live until next week, and all that will do is give us more diagnostic information.
Since this is repeatable, it isn't likely a transient error reading from GCS. That means one of two problems: we have trouble decoding the .gz file, or there is something wrong with that particular GCS object.
For the first issue, you could try decompressing the file and re-uploading it as uncompressed. While it may sound like a pain to send gigabytes of data over the network, the good news is that the import will be faster since it can be done in parallel (we can't import a compressed file in parallel since it can only be read sequentially).
For the second issue (which is somewhat less likely) you could try downloading the file yourself to make sure you don't get errors, or try re-uploading the same file and seeing if that works.

Best way to detect a "data loss" publish action when calling SSDT's SQLPackage.exe

When calling SQLPackage.exe (syntax described here) with publish action /a:Publish, there are cases when data loss occurs and the execution will be halted; this is specified by setting the parameter /p:BlockOnDataLoss (default to be 'true').
I need to know whether my publish action has succeeded or has failed due to 'data loss'.
Currently when succeeded, the returned exit code will be 0. And when failed, we just have the returned exit code to be 1. We cannot say whether a fail was caused by the data loss or not. How can we identify this?
Somewhere in the console output, we see the line that contain "... is being dropped, data loss could occur." So I intend to scan for every output line is printed but I guess there should be some other better way to do this.
Hope to hear what you think.