tensorflow training is slow, stops running on gpu? - tensorflow

I'm running the cifar_train program from tensorflow's tutorial. The tutorial says it reaches a peak performance of 86% after a few hours. I tried running this on my laptop, and after about 4 days it was still training. I'm wondering if I'm misreading the tutorial by thinking that it'll be finished after a few hours, or if that's the accuracy at that time. Also I'm looking at my gpu usage via afterburner and when I started it, it was around 30-40% usage, but after looking at it overnight the gpu usage is 0 and the cpu is at 100%. Is this normal?

Related

Tensorflow fails to run on GPU from time to time

Solved this problem myself. It was because there were too much images in the celeba dataset and my dataloader was so inefficient. The dataloading took too much time and caused the low speed.
But still, this could not explain why the code was running on the cpu while the gpu memory was also taken up. After all I just transfer to pytorch.
My environment: windows10, cuda 9.0, cudnn 7.0.5, tensorflow-gpu 1.8.0.
I am working a cyclegan model. At first, it worked fine with my toy dataset, and could run on gpu without main problem(though the first 10 iterations took extremely long time, which means it might be running on cpu).
I later tried celeba dataset, only changed the folder name to load the data(I loaded data to the memory all at once, then use my own next_batch function and feed_dict to train the model). Then the problem arose: the GPU memory was still taken according to GPU-Z, but the GPU-load is low(less than 10%), and the training speed is very slow(took more than 10 times than normal), which means the code was running on CPU.
Would anyone please give me some advise? Any help is appreciated, thanks.
What is the batch size that you were trying? If it's too low (something like 2-8) for a small model, the memory consumed will not be much. It all depends on your batch size, the number of parameters in your model, etc. It also depends on the model architecture and how much of the model has components that can be run in parallel. Maybe try increasing your batch size and re-running it?

How much performance increase can I expect from Tensorflow on GPU over CPU?

I have installed tensorflow-gpu on Linux Mint 18. My graphics card is a GT 740m. The deviceQuery and bandwidthTest for CUDA and the MNISTsample for cudnn scripts pass (referred here and here).
Tensorflow does use the GPU (e.g. following these instructions works, and memory and processing utilization of the GPU increases when running programes), but the performance is rather… mediocre.
For instance running the script shown on this site the GPU is only about twice as fast as the CPU. Certainly a nice improvement, but not "really, really fast", as is stated on the site. Another example: Using vgg16 with Keras to classify 100 images, each about 300x200 pixels takes around 30 seconds.
Is there anything I might do to increase the performance, or can I not expect anything better?
for search queries: slow,

Google cloud-ml stuck without logs after several iterations

I am training a TF ML job on cloud-ml, and it seems the job is stuck after a few iterations (900 iterations). Surprisingly, when I run the code locally it works fine, and also hyper tuning on GCP continues training but runs slower than my local laptop which has a 1060GTX GPU.
I am also using the runtime version 1.6.
I changed the scale-tier and it doesn't help. What can be the issue?

object detection Training becomes slower in time. Uses more CPU than GPU as the training progresses

System information
What is the top-level directory of the model you are using:research/object_detection
Have I written custom code (as opposed to using a stock example script provided in TensorFlow):yes (just VGG-16 implementation for Faster RCNN)
OS Platform and Distribution (e.g., Linux Ubuntu 16.04):Ubuntu 16.04
TensorFlow version (use command below):1.4.0
CUDA/cuDNN version:8 and 6
GPU model and memory: NVIDIA-1060 6GB
I am trying to Train a Faster-RCNN with VGG-16 as feature extractor(paper) on my custom dataset using the API.
Training params are same as described in the paper except for, am running for 15k steps only and resizing the images to 1200x1200 with a batch size = 1.
The Training Runs Fine but as the Time progresses The Training becomes slower. It is shifting between CPU and GPU.
The steps where the time around 1sec is running on GPU and the other high numbers like ~20secs is running in CPU I cross verified them using 'top' and 'nvidia-smi'. Why is it shifting between CPU and GPU in the middle? I can understand the shift when the model and logs are getting saved but this I don't understand why.
PS: I am running Only the Train script. Am not running the eval script
Update:
This becomes worse over time.
the secs/step is increasing thus affecting the rate at which the checkpoints and the logs getting stored
It should run less than 1sec/step because that was the speed when I started the training for the first 2k steps. And my dataset is pretty small (300 Images for training).
In my experience, it is possible that the size of your input image is quite too large. When you take a look at the tensorboard during the training session, you can find out that all the reshape calculation are running on GPU. So maybe you can write a python script to resize your input image without changing the aspect ratio, and you can at the same time set your batch size(maybe 4 or 8) a little bit higher. Then your can train your dataset faster and can also get a relative good result (mAP)

Resource Exhausted OOM while loading VGG16

I am apologizing in advance if this issue seems to basic, but I am new to Tensorflow and appreciate any help.
I find that I have to frequently keep rebooting my computer to be able to load models such as VGG16 from keras.applications. I have a fairly high-end machine with 4 GeForce GTX 1080 Ti GPUs and Intel® Core™ i7-6850K CPU # 3.60GHz × 12 for my CPU and use it only for Tensorflow (through Keras).
As soon as I reboot I will be able to successfully load models (such as VGG16) and train on large training datasets. But, if I let my computer sit idle for a while and rerun the same program, I will get a resource exhausted message (OOM) which can be fixed by rebooting my computer again. It is extremely frustrating to keep rebooting my computer every couple of hours. Does anyone know what's going on and how to solve this issue?
If you have batch size > 1, try to use lower batch size, which could lower the memory requirements gor GPU.
Also, if you end with working with the network, check the GPU memory by nvidia-smi, if it was released or not. If not, kill the process which loaded the network (usually some python interpreter).