Tensorflow GPU - Slow Training - Lot of time spent in QueueDequeueManyV2 - gpu

I am using Tensorflow training on GPU enabled device. Below is the image of the profiler output. Training is running very slow. Based on below image, majority of the time is spent in IO - QueueDequeueManyV2.
How can I speed up the training?

Related

Is it possible to do the whole training procedure in GPU with Tensorflow/Keras?

If the dataset is small enough to fit in the GPU memory, is it possible with Tensorflow to allocate it all initially and then do the training without having data transfers between CPU and GPU?
It seems to me that with tf.data this is not possible and the data transfer is not controlled by the programmer.
Analyzing the GPU workload during training, it reaches 75% with CIFAR10, but I would expect it to reach 100% being that the dataset fit in GPU memory.Also analyzing with Tensorboard I see that there are a lot of Send operations.
(I saw that there is a similar question quite old here, however at that time there was no tf.data yet)

object detection Training becomes slower in time. Uses more CPU than GPU as the training progresses

System information
What is the top-level directory of the model you are using:research/object_detection
Have I written custom code (as opposed to using a stock example script provided in TensorFlow):yes (just VGG-16 implementation for Faster RCNN)
OS Platform and Distribution (e.g., Linux Ubuntu 16.04):Ubuntu 16.04
TensorFlow version (use command below):1.4.0
CUDA/cuDNN version:8 and 6
GPU model and memory: NVIDIA-1060 6GB
I am trying to Train a Faster-RCNN with VGG-16 as feature extractor(paper) on my custom dataset using the API.
Training params are same as described in the paper except for, am running for 15k steps only and resizing the images to 1200x1200 with a batch size = 1.
The Training Runs Fine but as the Time progresses The Training becomes slower. It is shifting between CPU and GPU.
The steps where the time around 1sec is running on GPU and the other high numbers like ~20secs is running in CPU I cross verified them using 'top' and 'nvidia-smi'. Why is it shifting between CPU and GPU in the middle? I can understand the shift when the model and logs are getting saved but this I don't understand why.
PS: I am running Only the Train script. Am not running the eval script
Update:
This becomes worse over time.
the secs/step is increasing thus affecting the rate at which the checkpoints and the logs getting stored
It should run less than 1sec/step because that was the speed when I started the training for the first 2k steps. And my dataset is pretty small (300 Images for training).
In my experience, it is possible that the size of your input image is quite too large. When you take a look at the tensorboard during the training session, you can find out that all the reshape calculation are running on GPU. So maybe you can write a python script to resize your input image without changing the aspect ratio, and you can at the same time set your batch size(maybe 4 or 8) a little bit higher. Then your can train your dataset faster and can also get a relative good result (mAP)

Why TensorFlow spent so many time on HtoD memcpy with Titan X?

I'm experiencing running AlexNet model from here in TensorFlow for evaluating time spent on GPU by library, with the following parameters and hardware:
1024 images on train dataset
10 epochs with mini-batch sizes of 128
using GPU GTX Titan X
I stated that the real execution time on GPU is just a fraction of the all execution time of training (the graph belows compares TensorFlow and AlexNet vs Caffe and its AlexNet implementation)
(information captured with nvidia-smi. 'Porcentagem' means percentage and 'Tempo (s)' means time (seconds))
The GPU utilization rate oscilates frenetically between 0 and 100% in training. Why that? Caffe doesn't oscilates to much beyond 40%
Also, Tensorflow spent many time doing memory copy from Host to Device, while Caffe doesn't. But why?
(tensorflow)
(caffe)

TensorFlow RNN training 100% CPU while only using 60% GPU

I'm working on code that trains a relatively large RNN (128 cell LSTM and some added layers). The main process is maxing out a core on the CPU, and I'm wondering if this is normal or whether I can optimize it. During the training loop (session.run calls) it's using about 60-70% GPU load while using 100% CPU load on one core. Note that data sampling work is already being done concurrently on other cores, so it's just the updating of the model parameters. Is this regular for such applications in TensorFlow or should the CPU load be much lower, while using the full capacity of the GPU?
We don't have full documentation on it yet, but you can take a look at the profiling information to see if it gives you more of an idea of where the time is going:
https://github.com/tensorflow/tensorflow/issues/1824#issuecomment-225754659
I think RNN cell have two input, it must wait for those two direction input when traning data, in other word, it optimize parallelism don't as easy as CNN. You can use a big batch size to improve the GPU utilization rate, but maybe cause other problem like that paper On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima.

tensorflow training is slow, stops running on gpu?

I'm running the cifar_train program from tensorflow's tutorial. The tutorial says it reaches a peak performance of 86% after a few hours. I tried running this on my laptop, and after about 4 days it was still training. I'm wondering if I'm misreading the tutorial by thinking that it'll be finished after a few hours, or if that's the accuracy at that time. Also I'm looking at my gpu usage via afterburner and when I started it, it was around 30-40% usage, but after looking at it overnight the gpu usage is 0 and the cpu is at 100%. Is this normal?