Difference between Sagemaker notebook instance and Training job submission - numpy

I am getting an error with a Sagemaker training job with the error message "OverflowError: signed integer is greater than maximum". This is an image identification problem with code written in keras and tensorflow. The input for this is a large npy file stored in an s3 bucket.
The code works fine when run the Sagemaker notebook cells but errors out when submitted as a Training job using boto3 request.
I am using the same role in both places. what could be the cause for this error? I am using ml.g4dn.16xlarge instance in both cases

Couple of things I would check is
Framework Versions used in your notebook instance vs the training job
Instance Storage volume for the training job, since you are using G4dn it comes with attached SSD which ideally should be good enough.

This seems like a bug, Requests and urllib3 should only ask for a maximum number of bytes and it is capable of handling at once.

Related

Loading Larga data to amazon sagemaker notebook

I have 2 folder, on each folder I have 70 csv files each one with a size of 3mb to 5mb, so in general the data is like 20 millions rows with 5 columns each.
I used amazon wrangler s3.read_csv to load just one folder with all the 70 csv to a dataframe, not sure if this is a good approach due to the fact the data is really large.
I want to know how can I load the entire csv files from those 2 folders with aws wrangler s3.readcsv, or should I use pyspark?
Also another question is, is it possible to work locally using amazon sagemaker depenencies? I am not sure if using sagemaker notebook for the pipeline development might cost a lot for my client.
You can use PySpark to load data into your notebook as well, see this repo for instructions.
As for SageMaker, you can use the SageMaker Python SDK, or Boto3 to run jobs from your local machine. You can also create a notebook instance with a small instance size, experiment on a subset of your data, and trigger a Processing job to keep your notebook costs low. You only pay for the duration your processing job runs, and you can scale up for preparing the entire dataset.

aws sagemaker giving The model data archive is too large. Please reduce the size of the model data archive

I am using aws sagemaker to deploy a model whose generated artifacts are huge. The compressed size is about 80GB. Deploying on sage maker on a ml.m5.12xlarge instance is throwing this error while deploying to the endpoint
The model data archive is too large. Please reduce the size of the model data archive or move to an instance type with more memory.
I found that aws attaches EBS volume based on instance size(https://docs.aws.amazon.com/sagemaker/latest/dg/host-instance-storage.html) and i couldnot find anything more that 30Gb here. Should i go with a multi model endpoint here?
"d" instances have bigger (NVMe) volumes; you can try deploying to ml.m5d.* for example. But keep in mind that your download and instantiation time may exceed the service limits (between 15-20min in my experience) so if you can't have an endpoint up in that timeframe you may still encounter an error.

Sequential Scripts conditioned by S3 file existence

I have three python scripts. These are supposed be executed sequentially, but in different environments.
script1: Generate training and test dataset using an AWS EMR cluster and save it on S3.
script2: Train a Machine Learning model using the training data and save the trained model on S3. (Executed on an AWS GPU instance)
script3: Run evaluation based on the test data and trained model and save the result on S3. (Executed on an AWS GPU instance)
I would like to run all these scripts automatically, without executing them one by one. My questions are:
Are there good practices for handling S3 file existence conditions? (false tolerence etc)
How can I trigger launching GPU instances and EMR clusters?
Are there good ways or tools to handle this kind of process?
Take a look at https://docs.aws.amazon.com/AmazonS3/latest/dev/NotificationHowTo.html
You can configure your notification when an event occur on a bucket, for example when an object is created.
You can attach this notification directly to an AWS lambda function that, if will be set with the right role can create EMR cluster and all other resources accessible by AWS SDK.

AWS SageMaker: CapacityError: Unable to provision requested ML compute capacity.

We were running two TrainingJob instances of type (1) ml.p3.8xlarge and (2) ml.p3.2xlarge .
Each training job is running a custom algorithm with Tensorflow plus a Keras backend.
The instance (1) is running ok, while the instance (2) after a reported time of training of 1 hour, with any logging in CloudWatch (any text tow log), exits with this error:
Failure reason
CapacityError: Unable to provision requested ML compute capacity. Please retry using a different ML instance type.
I'm not sure what this message mean.
This message mean SageMaker tried to launch the instance but EC2 was not having enough capacity of this instance hence after waiting for some time(in this case 1 hour) SageMaker gave up and failed the training job.
For more information about capacity issue from ec2, please visit:
troubleshooting-launch-capacity
To solve this, you can either try running jobs with different instance type as suggested in failure reason or wait a few minutes and then submit your request again as suggested by EC2.

SageMaker: visualizing training statistics

If I send a TensorFlow training job to a SageMaker instance, what is the typical way to view training progress? Can I access TensorBoard for this launched EC2 instance? Is there some other alternative? What I'm looking for specifically are things like graphs of current training epoch and mAP.
you can now specify metrics(metricName, Regex) that you want to track by using AWS management console or Amazon SageMaker Python SDK APIs. After the model training starts, Amazon SageMaker will automatically monitor and stream the specified metrics in real time to the Amazon CloudWatch console for visualizing time-series curves.
Ref:
https://docs.aws.amazon.com/sagemaker/latest/dg/API_MetricDefinition.html