Tensorflow 0.6 GPU Issue - gpu

I am using Nvidia Digits Box with GPU (Nvidia GeForce GTX Titan X) and Tensorflow 0.6 to train the Neural Network, and everything works. However, when I check the Volatile GPU Util using nvidia-smi -l 1, I notice that it's only 6%, and I think most of the computation is on CPU, since I notice that the process which runs Tensorflow has about 90% CPU usage. The result is the training process is very slow. I wonder if there are ways to make full usage of GPU instead of CPU to speed up the training process. Thanks!

I suspect you have a bottleneck somewhere (like in this github issue) -- you have some operation which doesn't have GPU implementation, so it's placed on CPU, and the GPU is idling because of data transfers. For instance, until recently reduce_mean was not implemented on GPU, and before that Rank was not implemented on GPU, and it was implicitly being used by many ops.
At one point, I saw a network from fully_connected_preloaded.py being slow because there was a Rank op that got placed on CPU, and hence triggering the transfer of entire dataset from GPU to CPU at each step.
To solve this I would first recommend upgrading to 0.8 since it had a few more ops implemented for GPU (reduce_prod for integer inputs, reduce_mean and others).
Then you can create your session with log_device_placement=True and see if there are any ops placed on CPU or GPU that would cause excessive transfers per step.
There are often ops in the input pipeline (such as parse_example) which don't have GPU implementations, I find it helpful sometimes to pin the whole input pipeline to CPU using with tf.device("/cpu:0"): block

Related

My tensorflow defaults to using my GPU instead of CPU, which is like 10 times slower. How do I fix this and make it use the CPU?

I have one of those gaming laptops with a discrete GPU and dedicated GPU (NVIDIA GeForce RTX 3070).
I was getting very slow speeds training neural networks on tensorflow. Many, many times slower than another laptop with vastly inferior specs in CPU and GPU.
I think the reason for this slowness is because tensorflow is probably running on the dedicate GPU because when I disable the dedicated GPU, the training time speeds up, like 10 times faster. These are huge differences, an order of magnitude.
I know the kernel is running on the dedicated GPU by default because when I disable the dedicated GPU in the middle of the session, the kernel dies.
Therefore, I think disabling the dedicated GPU has forced it to run on the CPU (AMD Ryzen 9 5900HX), which should be better.
I'm running this on Anaconda using Jupyter Notebook.
How do I force it to use by CPU instead of my GPU.
Edit: This seems to be a complicated issue. Some more information.
With dedicated GPU disabled, when training, according to the task manager the GPU usage is 0% (as expected) and the CPU usage is 40%.
But with dedicated GPU enabled, when training, GPU usage is about 10% and CPU usage is about 20%. This is 10 times slower than the above. Why is it using both, but less CPU?
With dedicated GPU enabled (i.e. the normal situation), according to the task manager, scikit-learn uses the CPU not the GPU. So this problem is specific to tensorflow.
Killing the dedicated GPU in the middle of the session crashes not only the kernel, but opening Jupyter Notebooks as well.
Forcing Anaconda and Jupyter Notebook to use the integrated GPU instead of the dedicated GPU in the Windows Setting doesn't fix the problem. It's still using the dedicated GPU.
Just tell tensorflow to do so:
with tf.device("/CPU:0"): #name might vary
model.fit(...)

TensorFlow GPU and CPU offloaded ops segregation

Assuming TensorFlow GPU library being used in computation, which operations are offloaded to GPU (and how often)? What is the performance impact of:
CPU Core count (because it is now not actively involved in computation)
RAM size.
GPU VRAM (What benefit of owning a higher memory GPU)
Say I'd like to decide upon particular(s) of these hardware choices. Can someone explain with an example, which aspect of a Machine Learning model will impact the particular hardware constraint?
(I need a little elaboration on what exact ops are offloaded to GPU and CPU, based on TensorFlow GPU lib for example.)
One way of using tensorflow to efficiently spread work between CPUs and GPUs is to use estimators.
For example :
model = tf.estimator.Estimator(model_fn=model_fn,
params=params,
model_dir="./models/model-v0-0")
model.train(lambda:input_fn(train_data_path), steps=1000)
In the function 'input_fn' the data batch loading and batch preparation will be offloaded to the CPU while the GPU is working on the model as declared in the function 'model_fn'.
If you are concerned about RAM constraints then you should look at using the tfrecord format as this avoids loading up the whole dataset in RAM
see tensorflow.org/tutorials/load_data/tf_records

Low GPU usage by Keras / Tensorflow?

I'm using keras with tensorflow backend on a computer with a nvidia Tesla K20c GPU. (CUDA 8)
I'm tranining a relatively simple Convolutional Neural Network, during training I run the terminal program nvidia-smi to check the GPU use. As you can see in the following output, the GPU utilization commonly shows around 7%-13%
My question is: during the CNN training shouldn't the GPU usage be higher? is this a sign of a bad GPU configuration or usage by keras/tensorflow?
nvidia-smi output
Could be due to several reasons but most likely you're having a bottleneck when reading the training data. As your GPU has processed a batch it requires more data. Depending on your implementation this can cause the GPU to wait for the CPU to load more data resulting in a lower GPU usage and also a longer training time.
Try loading all data into memory if it fits or use a QueueRunner which will make an input pipeline reading data in the background. This will reduce the time that your GPU is waiting for more data.
The Reading Data Guide on the TensorFlow website contains more information.
You should find the bottleneck:
On windows use Task-Manager> Performance to monitor how you are using your resources
On Linux use nmon, nvidia-smi, and htop to monitor your resources.
The most possible scenarios are:
If you have a huge dataset, take a look at the disk read/write rates; if you are accessing your hard-disk frequently, most probably you need to change they way you are dealing with the dataset to reduce number of disk access
Use the memory to pre-load everything as much as possible.
If you are using a restful API or any similar services, make sure that you do not wait much for receiving what you need. For restful services, the number of requests per second might be limited (check your network usage via nmon/Task manager)
Make sure you do not use swap space in any case!
Reduce the overhead of preprocessing by any means (e.g. using cache, faster libraries, etc.)
Play with the bach_size (however, it is said that higher values (>512) for batch size might have negative effects on accuracy)
The reason may be that your network is "relatively simple". I had a MNIST network with 60k training examples.
with 100 neurons in 1 hidden layer, CPU training was faster and GPU utilization on GPU training was around 10%
with 2 hidden layers, 2000 neurons each, GPU was significantly faster(24s vs 452s on CPU) and its utilization was around 39%
I have a pretty old PC (24GB DDR3-1333, i7 3770k) but a modern graphic card(RTX 2070 + SSDs if that matters) so there is a memory-GPU data transfer bottleneck.
I'm not yet sure how much room for improvement is here. I'd have to train a bigger network and compare it with better CPU/memory configuration + same GPU.
I guess that for smaller networks it doesn't matter that much anyway because they are relatively easy for the CPU.
Measuring GPU performance and utilization is not as straightforward as CPU or Memory. GPU is an extreme parallel processing unit and there are many factors. The GPU utilization number shown by nvidia-smi means what percentage of the time at least one gpu multiprocessing group was active. If this number is 0, it is a sign that none of your GPU is being utilized but if this number is 100 does not mean that the GPU is being used at its full potential.
These two articles have lots of interesting information on this topic:
https://www.imgtec.com/blog/a-quick-guide-to-writing-opencl-kernels-for-rogue/
https://www.imgtec.com/blog/measuring-gpu-compute-performance/
Low GPU utilization might be due to the small batch size. Keras has a habit of occupying the whole memory size whether, for example, you use batch size x or batch size 2x. Try using a bigger batch size if possible and see if it changes.

What's the impact of using a GPU in the performance of serving a TensorFlow model?

I trained a neural network using a GPU (1080 ti). The training speed on GPU is far better than using CPU.
Currently, I want to serve this model using TensorFlow Serving. I just interested to know if using GPU in the serving process has a same impact on performance?
Since the training apply on batches but inferencing (serving) uses asynchronous requests, do you suggest using GPU in serving a model using TensorFlow serving?
You still need to do a lot of tensor operations on the graph to predict something. So GPU still provides performance improvement for inference. Take a look at this nvidia paper, they have not tested their stuff on TF, but it is still relevant:
Our results show that GPUs provide state-of-the-art inference
performance and energy efficiency, making them the platform of choice
for anyone wanting to deploy a trained neural network in the field. In
particular, the Titan X delivers between 5.3 and 6.7 times higher
performance than the 16-core Xeon E5 CPU while achieving 3.6 to 4.4
times higher energy efficiency.
The short answer is yes, you'll get roughly the same speedup for running on the GPU after training. With a few minor qualifications.
You're running 2 passes over the data in training, which all happens on the GPU, during the feedforward inference you're doing less work, so there will be more time spent transferring data to the GPU memory relative to computations than in training. This is probably a minor difference though. And you can now asynchronously load the GPU if that's an issue (https://github.com/tensorflow/tensorflow/issues/7679).
Whether you'll actually need a GPU to do inference depends on your workload. If your workload isn't overly demanding you might get away with using the CPU anyway, after all, the computation workload is less than half, per sample, so consider the number of requests per second you'll need to serve and test out whether you overload your CPU to achieve that. If you do, time to get the GPU out!

Variables on CPU, training/gradients on GPU

On the CIFAR-10 tutorial, I noticed that the variables are placed in CPU memory, but it is stated in cifar10-train.py that it is trained with a single GPU.
I'm quite confused.. are the layer/activations stored in GPU? Or alternatively, are the gradients stored in the GPU? Otherwise, it would seem storing variables on CPU would not make use of the GPU at all - everything is stored in CPU memory, so only the CPU is used for forward/backward propagation.
If the GPU was used for f/b propagation, wouldn't that be a waste due to latency shuffling data CPU <-> GPU?
Indeed, in cifar10-train the activations and gradients are on GPU, only the parameters are on CPU. You are right that this is not optimal for single-GPU training due to the cost of copying parameters between CPU and GPU. I suspect the reason it is done this way is to have a single library for single-GPU and multi-GPU models, as in the multi-GPU case, it is probably faster to have parameters on CPU. You can test easily what speedup you can get by moving all variables to GPU, just remove the "with tf.device('/cpu:0')" in "_variable_on_cpu" in cifar10.py.