Google Vertex AI GPU only 50% utilized - tensorflow

I am running a custom training job using Google Vertex AI. I am using Nvidia Tesla V100 with 2 accelerators. I am training a ML model but my GPU utilization is only 50% during training.
I am using Nvidia Transfer Learning Toolkit to train an object detection model, and I specified GPUs=2 on the TLT commands.
Any ideas how I can get higher GPU utilization?

Related

Execution of Inference Workloads on Coral Dev Board in CPU, GPU and TPU simultaneously

I am currently working on executing inference workloads on Coral Dev Board with TensorFlow Lite. I am trying to run inference on CPU,GPU and TPU simultaneously to reduce inference latency.
Could you guys help me understand how I can execute inference on all the devices simultaneously? I could divide the layers of network for training phase in CPU and GPU but I am having trouble in assigning layers of the network to each devices for inference.The code is written in python language with keras API in Tensorflow.
Thanks.
As of now, if you compile your CPU TFLite model with the edgeTPU compiler (https://coral.ai/docs/edgetpu/compiler/) then the compiler tries to Map the operations on the TPU only (as long as the operations are supported by the TPU)
The Edge TPU compiler cannot partition the model more than once, and as soon as an unsupported operation occurs, that operation and everything after it executes on the CPU, even if supported operations occur later.
So partitioning a single TFLite model into CPU, GPU and TPU is not feasible as of now.

How much RAM do I need to train ssd_mobilenet_v2 model on GPU?

I want to train object detector using Tensorflow API's model SSD MobileNet v2 on a relatively big dataset (~3000 images for training and ~500 for testing). I've successfully managed all the necessary preprocessing steps, created train.record and test.record files and tried to run the training of the model with train.py, but the training process was killed by the kernel.
>INFO:tensorflow:Restoring parameters from /home/yurii/.../second_attempt/model.ckpt
>INFO:tensorflow:Restoring parameters from /home/yurii/.../second_attempt/model.ckpt
>INFO:tensorflow:Running local_init_op.
>INFO:tensorflow:Running local_init_op.
>INFO:tensorflow:Done running local_init_op.
>INFO:tensorflow:Done running local_init_op.
>INFO:tensorflow:Starting Session.
>INFO:tensorflow:Starting Session.
>INFO:tensorflow:Saving checkpoint to path /home/yurii/.../second_attempt/model.ckpt
>INFO:tensorflow:Saving checkpoint to path /home/yurii/.../second_attempt/model.ckpt
>INFO:tensorflow:Starting Queues.
>INFO:tensorflow:Starting Queues.
>Killed
I've found some info, stating that the issue could be because of the lack of RAM on my machine. Previously I trained the model on the smaller dataset (280 images for training and 40 for testing) and everything worked properly.
So,approximately, how much RAM do I need to train MobileNet on my dataset?
I am using Asus X555L with 4Gb RAM available, GPU is Nvidia GeForce 920M (2Gb, 3.5 compute capacity), CUDA version is 9.0.176, cudnn version is 7.5, tensorflow version is 1.7.0, Nvidia driver version is 384.130
Maybe you could reduce the batch size in config.py file.I am using a Hp laptop with 4GB Ram and Radeon Graphics Card.Currently my batch size is set to 4 for my custom object detection project using the same ssd_mobilenet_v2.

Why TensorFlow spent so many time on HtoD memcpy with Titan X?

I'm experiencing running AlexNet model from here in TensorFlow for evaluating time spent on GPU by library, with the following parameters and hardware:
1024 images on train dataset
10 epochs with mini-batch sizes of 128
using GPU GTX Titan X
I stated that the real execution time on GPU is just a fraction of the all execution time of training (the graph belows compares TensorFlow and AlexNet vs Caffe and its AlexNet implementation)
(information captured with nvidia-smi. 'Porcentagem' means percentage and 'Tempo (s)' means time (seconds))
The GPU utilization rate oscilates frenetically between 0 and 100% in training. Why that? Caffe doesn't oscilates to much beyond 40%
Also, Tensorflow spent many time doing memory copy from Host to Device, while Caffe doesn't. But why?
(tensorflow)
(caffe)

why is multi GPU tensorflow retraining not working

I have been training my tensorflow retraining algorithm using a single GTX Titan and it works just fine, but when I try to use multiple gpus in the flower of retraining example it does not work and seems to only utilize one GPU when I run it in Nvidia SMI.
Why is this happening as it does work with multiple gpus when retraining at Inception model from scratch but not during retraining?
TensorFlow's flower retraining example does not work with multiple GPUs at all, even if you set --num_gpus > 1. It should support a single GPU as you noted.
The model needs to be modified to utilize multiple GPUs in parallel. Unfortunately, a single TensorFlow operation like the flower retraining example can't automatically be split over multiple GPUs at this time.

Tensorflow 0.6 GPU Issue

I am using Nvidia Digits Box with GPU (Nvidia GeForce GTX Titan X) and Tensorflow 0.6 to train the Neural Network, and everything works. However, when I check the Volatile GPU Util using nvidia-smi -l 1, I notice that it's only 6%, and I think most of the computation is on CPU, since I notice that the process which runs Tensorflow has about 90% CPU usage. The result is the training process is very slow. I wonder if there are ways to make full usage of GPU instead of CPU to speed up the training process. Thanks!
I suspect you have a bottleneck somewhere (like in this github issue) -- you have some operation which doesn't have GPU implementation, so it's placed on CPU, and the GPU is idling because of data transfers. For instance, until recently reduce_mean was not implemented on GPU, and before that Rank was not implemented on GPU, and it was implicitly being used by many ops.
At one point, I saw a network from fully_connected_preloaded.py being slow because there was a Rank op that got placed on CPU, and hence triggering the transfer of entire dataset from GPU to CPU at each step.
To solve this I would first recommend upgrading to 0.8 since it had a few more ops implemented for GPU (reduce_prod for integer inputs, reduce_mean and others).
Then you can create your session with log_device_placement=True and see if there are any ops placed on CPU or GPU that would cause excessive transfers per step.
There are often ops in the input pipeline (such as parse_example) which don't have GPU implementations, I find it helpful sometimes to pin the whole input pipeline to CPU using with tf.device("/cpu:0"): block