GPU underutilized training a CNN

GPU underutilized training a CNN - tensorflow

What I'm trying to do is retrain VGG16 on recognizing new types of Image data using Keras with Tensorflow backend.
But the training process seems very slow to me, and after checking my GPU performance in the task manager it seems to me like my GPU is barely even being utilized.
This is my code: https://hastebin.com/pepozayutu.py
This is the output in my console: https://hastebin.com/uhonugenej.md
And this is what my task manager looks like during training: https://imgur.com/a/jRJ66
As you can see the GPU is barely doing anything, so why is my training so slow? It's agonizing to try different setups because each training takes 20-60 min depending on number of epochs.
I have installed Tensorflow-gpu 1.7.0, cuDNN 7.0.5, CUDA 9.0 and Keras 2.1.5. I'm running an NVIDIA GeForce 940MX
edit: I solved it! It appears that the problem was that my GPU was only being used for very short periods, and the reason for that was that the bottleneck was actually loading in the images. I stored my images as 3000x4000 pixel jpgs even though I scale them down to 150x150 or sometimes 64x64 for the CNN anyway. Reducing the size of my images on my disk seemed to get rid of the bottleneck

Increase your batch size to, say, 512. Batch size of 30 is too small.

The new TF.data.dataset API is great for building pieplines that feed data to your model asynchronously. So while your graph is being computed the pipeline will prefetch the data ready to go when the next graph execution cycle starts. With tf.estimators you can run a keras model with a TF.data.dataset pipeline as well. An example:
https://www.dlology.com/blog/an-easy-guide-to-build-new-tensorflow-datasets-and-estimator-with-keras-model/

I solved it! It appears that the problem was that my GPU was only being used for very short periods, and the reason for that was that the bottleneck was actually loading in the images. I stored my images as 3000x4000 pixel jpgs even though I scale them down to 150x150 or sometimes 64x64 for the CNN anyway. Reducing the size of my images on my disk seemed to get rid of the bottleneck

Related

How to estimate how much GPU memory required for deep learning?

We are trying to train our model for object recognition using tensorflow. Since there are too many images (100GB), I guess our current GPU server (1*2080Ti) could not work. We may need to purchase a more powerful one, but I do not sure how to estimate how much GPU memory we need. Is there some approach to estimate the requirements? thanks!

Your 2080Ti would do just fine for your task. The GPU memory for DL tasks are dependent on many factors such as number of trainable parameters in the network, size of the images you are feeding, batch size, floating point type (FP16 or FP32) and number of activations and etc. I think you get confused about loading all of the images to GPU memory at once. We do not do that, instead we use minibatches of different sizes to fit all of the images and params into memory. Throw any kind of network to your 2080Ti and adjust batch size then your training will run smoothly. You could go with your 2080Ti or can get another or two increase training speed. This blogpost provides beautiful insights about creating optimal DL environments.

What is the computational power required for training High resolution (4024 x 3036) images using VGG16-Net?

I am working on a classification of high-resolution images using VGG16-Net in keras.
But I am unable to use images of size beyond (600 x 600) resolution for training using Batch size 1 on nVIDIA GeForce GTX 1080 GPU,
I am facing Resource Exhaustion error OOM i.e unable to allocate tensor of the shape [18, 64, 600, 600].
Can anyone please suggest me any solution for this?
I want to use the large size images since I am labeling the images as Good and Bad based on the very small difference.
Thanks in advance!!

The whole network plus batch data need to be able to fit into VRAM. If you really do need to use high resolution images then you need to use a smaller network.
vgg-16 is old and inefficient anyway, not recommended for a new project. You could lookup things like mobilenetv2 or mnasnet but bare in mind that all of these commonly used models are generally optimized for around 600x600 or much smaller. Out of interest, I have tried training CNNs on very high resolution images, just to see what would happen and I found that of course, they train and run painfully slowly with much reduced accuracy - if all of the features in the images are very large with respect to the convolutional filters, then the filters won't be able to pick up on them.

Tensorflow fails to run on GPU from time to time

Solved this problem myself. It was because there were too much images in the celeba dataset and my dataloader was so inefficient. The dataloading took too much time and caused the low speed.
But still, this could not explain why the code was running on the cpu while the gpu memory was also taken up. After all I just transfer to pytorch.
My environment: windows10, cuda 9.0, cudnn 7.0.5, tensorflow-gpu 1.8.0.
I am working a cyclegan model. At first, it worked fine with my toy dataset, and could run on gpu without main problem(though the first 10 iterations took extremely long time, which means it might be running on cpu).
I later tried celeba dataset, only changed the folder name to load the data(I loaded data to the memory all at once, then use my own next_batch function and feed_dict to train the model). Then the problem arose: the GPU memory was still taken according to GPU-Z, but the GPU-load is low(less than 10%), and the training speed is very slow(took more than 10 times than normal), which means the code was running on CPU.
Would anyone please give me some advise? Any help is appreciated, thanks.

What is the batch size that you were trying? If it's too low (something like 2-8) for a small model, the memory consumed will not be much. It all depends on your batch size, the number of parameters in your model, etc. It also depends on the model architecture and how much of the model has components that can be run in parallel. Maybe try increasing your batch size and re-running it?

object detection Training becomes slower in time. Uses more CPU than GPU as the training progresses

System information
What is the top-level directory of the model you are using:research/object_detection
Have I written custom code (as opposed to using a stock example script provided in TensorFlow):yes (just VGG-16 implementation for Faster RCNN)
OS Platform and Distribution (e.g., Linux Ubuntu 16.04):Ubuntu 16.04
TensorFlow version (use command below):1.4.0
CUDA/cuDNN version:8 and 6
GPU model and memory: NVIDIA-1060 6GB
I am trying to Train a Faster-RCNN with VGG-16 as feature extractor(paper) on my custom dataset using the API.
Training params are same as described in the paper except for, am running for 15k steps only and resizing the images to 1200x1200 with a batch size = 1.
The Training Runs Fine but as the Time progresses The Training becomes slower. It is shifting between CPU and GPU.
The steps where the time around 1sec is running on GPU and the other high numbers like ~20secs is running in CPU I cross verified them using 'top' and 'nvidia-smi'. Why is it shifting between CPU and GPU in the middle? I can understand the shift when the model and logs are getting saved but this I don't understand why.
PS: I am running Only the Train script. Am not running the eval script
Update:
This becomes worse over time.
the secs/step is increasing thus affecting the rate at which the checkpoints and the logs getting stored
It should run less than 1sec/step because that was the speed when I started the training for the first 2k steps. And my dataset is pretty small (300 Images for training).

In my experience, it is possible that the size of your input image is quite too large. When you take a look at the tensorboard during the training session, you can find out that all the reshape calculation are running on GPU. So maybe you can write a python script to resize your input image without changing the aspect ratio, and you can at the same time set your batch size(maybe 4 or 8) a little bit higher. Then your can train your dataset faster and can also get a relative good result (mAP)

TensorFlow RNN training 100% CPU while only using 60% GPU

I'm working on code that trains a relatively large RNN (128 cell LSTM and some added layers). The main process is maxing out a core on the CPU, and I'm wondering if this is normal or whether I can optimize it. During the training loop (session.run calls) it's using about 60-70% GPU load while using 100% CPU load on one core. Note that data sampling work is already being done concurrently on other cores, so it's just the updating of the model parameters. Is this regular for such applications in TensorFlow or should the CPU load be much lower, while using the full capacity of the GPU?

We don't have full documentation on it yet, but you can take a look at the profiling information to see if it gives you more of an idea of where the time is going:
https://github.com/tensorflow/tensorflow/issues/1824#issuecomment-225754659

I think RNN cell have two input, it must wait for those two direction input when traning data, in other word, it optimize parallelism don't as easy as CNN. You can use a big batch size to improve the GPU utilization rate, but maybe cause other problem like that paper On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas