AWS SageMaker: CapacityError: Unable to provision requested ML compute capacity. - tensorflow

We were running two TrainingJob instances of type (1) ml.p3.8xlarge and (2) ml.p3.2xlarge .
Each training job is running a custom algorithm with Tensorflow plus a Keras backend.
The instance (1) is running ok, while the instance (2) after a reported time of training of 1 hour, with any logging in CloudWatch (any text tow log), exits with this error:
Failure reason
CapacityError: Unable to provision requested ML compute capacity. Please retry using a different ML instance type.
I'm not sure what this message mean.

This message mean SageMaker tried to launch the instance but EC2 was not having enough capacity of this instance hence after waiting for some time(in this case 1 hour) SageMaker gave up and failed the training job.
For more information about capacity issue from ec2, please visit:
troubleshooting-launch-capacity
To solve this, you can either try running jobs with different instance type as suggested in failure reason or wait a few minutes and then submit your request again as suggested by EC2.

Related

Difference between Sagemaker notebook instance and Training job submission

I am getting an error with a Sagemaker training job with the error message "OverflowError: signed integer is greater than maximum". This is an image identification problem with code written in keras and tensorflow. The input for this is a large npy file stored in an s3 bucket.
The code works fine when run the Sagemaker notebook cells but errors out when submitted as a Training job using boto3 request.
I am using the same role in both places. what could be the cause for this error? I am using ml.g4dn.16xlarge instance in both cases
Couple of things I would check is
Framework Versions used in your notebook instance vs the training job
Instance Storage volume for the training job, since you are using G4dn it comes with attached SSD which ideally should be good enough.
This seems like a bug, Requests and urllib3 should only ask for a maximum number of bytes and it is capable of handling at once.

GCP: IA ML serving with autoscaling to zero

I wanted to try the ML serving AI platform from GCP, but i want the node to scale only if there is a call to prediction.
I see in the documentation here:
If you select "Auto scaling", the optional Minimum number of nodes field displays. You can enter the minimum number of nodes to keep running at all times, when the service has scaled down. This field defaults to 0.
But when i try to create my model version, it shows an error telling me that this field should be > 1.
Here is what i tried:
Name: testv1
Pre-Built Container
Python 3.7
Framework Tensorflow
TF version 2.4.0
ML 2.4
Scaling auto-scaling
Min nodes nb 0
machine type n1-standard-4
GPU TESLA_K80 * 1
I tried to reproduce your case and found the same thing, I was not able to set the Minimum number of nodes to 0.
This seems to be an outdated documentation issue. There is an ongoing Feature Request that explains it was possible to set a minimum of 0 machines with a legacy machine type, and requests to make this option available for current types too.
On the other hand, I went ahead and opened a ticket to update the documentation.
As a workaround, you can deploy and your models right when you need them and then proceed to un-deploy them. Be mindful that undeployments may take up to 45 minutes, so it is advisable to wait 1 hour to re-deploy that model to avoid any issues.

Can you prevent Google AI platform from terminating an evaluator before it's complete?

I'm running a training job on the google AI platform, just training a simple tf.Estimator. Is there a way to prevent the whole job from completing if there's still an evaluation task running?
I remember someone using Kubeflow in GCP that needed to use the '--stream-logs' flag when submitting a AI Platform training job using the gcloud command (1). Otherwise, the job would get stopped before completion.
According to the documentation,
'with the --stream-logs flag, the job will continue to
run after this command exits and must be cancelled with gcloud
ai-platform jobs cancel JOB_ID)'
It is worth giving it a try and check if, in your case, this flag can also keep the job running instead of terminating it prematurely.
In the case that the issue kept happening when activating the flag, you might want to inspect the logs of the job to better understand the root cause of this behaviour.

Sometimes get the error "err == cudaSuccess || err == cudaErrorInvalidValue Unexpected CUDA error: out of memory"

I'm very lost on how to solve my particular problem, which is why I followed the getting help guideline in the Object Detection API and made a post here on Stack Overflow.
To start off, my goal was to run distributed training jobs on Azure. I've previously used gcloud ai-platform jobs submit training with great ease to run distributed jobs, but it's a bit difficult on Azure.
I built the tf1 docker image for the Object Detection API from the dockerfile here.
I had a cluster (Azure Kubernetes Service/AKS Cluster) with the following nodes:
4x Standard_DS2_V2 nodes
8x Standard_NC6 nodes
In Azure, NC6 nodes are GPU nodes backed by a single K80 GPU each, while DS2_V2 are typical CPU nodes.
I used TFJob to configure my job with the following replica settings:
Master (limit: 1 GPU) 1 replica
Worker (limit: 1 GPU) 7 replicas
Parameter Server (limit: 1 CPU) 3 replicas
Here's my conundrum: The job fails as one of the workers throw the following error:
tensorflow/stream_executor/cuda/cuda_driver.cc:175] Check failed: err == cudaSuccess || err == cudaErrorInvalidValue Unexpected CUDA error: out of memory
I randomly tried reducing the number of workers, and surprisingly, the job worked. It worked only if I had 3 or less Worker replicas. Although it took a lot of time (bit more than a day), the model could finish training successfully with 1 Master and 3 Workers.
This was a bit vexing as I could only use up to 4 GPUs even though the cluster had 8 GPUs allocated. I ran another test: When my cluster had 3 GPU nodes, I could only successfully run the job with 1 Master and 1 Worker! Seems like I can't fully utilize the GPUs for some reason.
Finally, I ran into another problem. The above runs were done with a very small amount of data (about 150 Mb) since they were tests. I ran a proper job later with a lot more data (about 12 GiB). Even though the cluster had 8 GPU nodes, it could only successfully do the job when there was 1 Master and 1 Worker.
Increasing the Worker replica count to more than 1 immediately caused the same cuda error as above.
I'm not sure if this is an Object Detection API based issue, or if it is caused by Kubeflow/TFJob or even if it's something Azure specific. I've opened a similar issue on the Kubeflow page, but I'm also now seeing if I can get some guide from the Object Detection API community. If you need any further details (like the tfjob yaml, or pipeline.config for the training) or have any questions, please let me know in the comments.
It might be related to the batch size used by the API.
Try to control the batch size, maybe as described in this answer:
https://stackoverflow.com/a/55529875/2109287
this is because of insufficient gpu memory:
try this below commands
hope it'll help
$ sudo fuser -v /dev/nvidia*
$ sudo kill -9 pid_no (Ex: 12345)
$ nvidia-smi --gpu-reset
:)

TensorFlow Distributed Estimator issue

I was successful in getting a simple cluster with one of each: chief, parameter server, worker, evaluator to work. I followed the directions on the tf.estimator.train_and_evaluate page and set a stopping condition by setting train_spec.max_steps to 20000. My problem is that this is sufficient to stop the chief and worker, but the parameter server and evaluator continue to wait in a loop. Since I'm running under a batch scheduler and asked for a certain time limit, I have to wait until then to get my output. Is there any way to signal the parameter server and evaluator to exit when all the training is done?