I was profiling the inference latency of a MobileNetV2 model (with a batch size of 20) on my GeForce GTX 1080 GPU.
The TensorFlow timeline shows as follows:
I notice that there is quite much empty space in the "stream: all Compute" line, which I think means my GPU was not always busy. What do you think could have been causing this idle time and are there any ways to improve it?
Related
I have one of those gaming laptops with a discrete GPU and dedicated GPU (NVIDIA GeForce RTX 3070).
I was getting very slow speeds training neural networks on tensorflow. Many, many times slower than another laptop with vastly inferior specs in CPU and GPU.
I think the reason for this slowness is because tensorflow is probably running on the dedicate GPU because when I disable the dedicated GPU, the training time speeds up, like 10 times faster. These are huge differences, an order of magnitude.
I know the kernel is running on the dedicated GPU by default because when I disable the dedicated GPU in the middle of the session, the kernel dies.
Therefore, I think disabling the dedicated GPU has forced it to run on the CPU (AMD Ryzen 9 5900HX), which should be better.
I'm running this on Anaconda using Jupyter Notebook.
How do I force it to use by CPU instead of my GPU.
Edit: This seems to be a complicated issue. Some more information.
With dedicated GPU disabled, when training, according to the task manager the GPU usage is 0% (as expected) and the CPU usage is 40%.
But with dedicated GPU enabled, when training, GPU usage is about 10% and CPU usage is about 20%. This is 10 times slower than the above. Why is it using both, but less CPU?
With dedicated GPU enabled (i.e. the normal situation), according to the task manager, scikit-learn uses the CPU not the GPU. So this problem is specific to tensorflow.
Killing the dedicated GPU in the middle of the session crashes not only the kernel, but opening Jupyter Notebooks as well.
Forcing Anaconda and Jupyter Notebook to use the integrated GPU instead of the dedicated GPU in the Windows Setting doesn't fix the problem. It's still using the dedicated GPU.
Just tell tensorflow to do so:
with tf.device("/CPU:0"): #name might vary
model.fit(...)
I have an GTX 1050 ti (4GB) and i5 CPU, 8GB memory.
I successfully installed tensorflow-gpu with cuda driver on win10 and the test shows that tensorflow is actually using the gpu (snapshot):
However, when carrying out the training with CNN, while the GPU memory is always 100%, the GPU load is qualsi 0 with some spikes # 30%~70%:
Is it normal ?
EDIT: While the GPU occupation is qualsi 0 with spikes, the CPU load is fixed at 100% during the training.
EDIT2: I did read somewhere that the CPU could be high while GPU be low if there are a lot of operations of data copy between CPU and GPU. But I am using the official tensorflow object detection api for the training so I am totally unaware of the possible place in code.
What you see is normal behavior in most cases.
TensorFlow books the entire GPU memory initially.
The load on the GPU is dependent upon the data it is getting for processing.
If the data loading operation is slow, then most of the time GPU is waiting for data to get copied from disk to the GPU, and during that time it is not performing any work. That is what you see in your screen.
I am running deep learning CNN (4-CNN layers and 3 FNN layers) model (written in Keras with tensorflow as backend) on two different machines.
I have 2 machines (A: with a GTX 960 graphics GPU with 2GB memory & clock speed: 1.17 GHz and B: with a Tesla K40 computation GPU with 12GB memory & clock speed: 745MHz)
But when I run the CNN model on A:
Epoch 1/35
50000/50000 [==============================] - 10s 198us/step - loss: 0.0851 - acc: 0.2323
on B:
Epoch 1/35
50000/50000 [==============================] - 43s 850us/step - loss: 0.0800 - acc: 0.3110
The numbers are not even comparable. I am quite new to deep learning and running code on GPUs. Could someone please help me explain why the numbers are so different?
Dataset: CIFAR-10 (32x32 RGB images)
Model batch size: 128
Model number of parameters: 1.2M
OS: Ubuntu 16.04
Nvidia driver version: 384.111
Cuda version: 7.5, V7.5.17
Please let me know if you need any more data.
Edit 1: (adding CPU info)
Machine A (GTX 960): 8 cores - Intel(R) Core(TM) i7-6700 CPU # 3.40GHz
Machine B (Tesla K40c):8 cores - Intel(R) Xeon(R) CPU E5-2637 v4 # 3.50GHz
TL;DR: Measure again with a larger batch size.
Those results do not surprise me much. It's a common mistake to think that an expensive Tesla card (or a GPU for that matter) will automatically do everything faster. You have to understand how GPUs work in order to harness their power.
If you compare the base clock speeds of your devices, you will find that your Xeon CPU has the fastest one:
Nvidia K40c: 745MHz
Nvidia GTX 960: 1127MHz
Intel i7: 3400MHz
Intel Xeon: 3500MHz
This gives you a hint of the speeds at which these devices operate and gives a very rough estimate of how fast they can crunch numbers if they would only do one thing at a time, that is, with no parallelization.
So as you see, GPUs are not fast at all (for some definition of fast), in fact they're quite slow. Also note how the K40c is in fact slower than the GTX 960.
However, the real power of a GPU comes from its ability to process a lot of data simultaneously! If you now check again at how much parallelization is possible on these devices, you will find that your K40c is not so bad after all:
Nvidia K40c: 2880 cuda cores
Nvidia GTX 960: 1024 cuda cores
Intel i7: 8 threads
Intel Xeon: 8 threads
Again, these numbers give you a very rough estimate of how many things these devices can do simultaneously.
Note: I am severely simplifying things: In absolutely no way is a CPU core comparable to a cuda core! They are very very different things. And in no way can base clock frequencies be compared like this! It's just to give an idea of what's happening.
So, your devices needs to be able to process a lot of data in parallel in order to maximize their throughput. Luckily tensorflow already does this for you: It will automatically parallelize all those heavy matrix multiplications for maximum throughput. However this is only going to be fast if the matrices have a certain size. Your batch size is set to 128 which means that almost all of these matrices will have the first dimension set to 128. I don't know the details of your model, but if the other dimensions are not large either, then I suspect that most of your K40c is staying idle during those matrix multiplications. Try to increase the batch size and measure again. You should find that larger batch sizes will make the K40c faster in comparison with the GTX 960. The same should be true for increasing the model's capacity: increase the number of units in the fully-connected layers and the number of filters in the convolutional layers. Adding more layers will probably not help here. The output of the nvidia-smi tool is also very useful to see how busy a GPU really is.
Note however that changing the model's hyper-parameter and/or the batch size will of course have a huge impact on how the model is able to train successfully and naturally you might also hit memory limitations.
Perhaps if increasing the batch size or changing the model is not an option, you could also try to train two models on the K40c at the same time to make use of the idle cores. However I have never tried this, so it might not work at all.
I'm experiencing running AlexNet model from here in TensorFlow for evaluating time spent on GPU by library, with the following parameters and hardware:
1024 images on train dataset
10 epochs with mini-batch sizes of 128
using GPU GTX Titan X
I stated that the real execution time on GPU is just a fraction of the all execution time of training (the graph belows compares TensorFlow and AlexNet vs Caffe and its AlexNet implementation)
(information captured with nvidia-smi. 'Porcentagem' means percentage and 'Tempo (s)' means time (seconds))
The GPU utilization rate oscilates frenetically between 0 and 100% in training. Why that? Caffe doesn't oscilates to much beyond 40%
Also, Tensorflow spent many time doing memory copy from Host to Device, while Caffe doesn't. But why?
(tensorflow)
(caffe)
I am using Nvidia Digits Box with GPU (Nvidia GeForce GTX Titan X) and Tensorflow 0.6 to train the Neural Network, and everything works. However, when I check the Volatile GPU Util using nvidia-smi -l 1, I notice that it's only 6%, and I think most of the computation is on CPU, since I notice that the process which runs Tensorflow has about 90% CPU usage. The result is the training process is very slow. I wonder if there are ways to make full usage of GPU instead of CPU to speed up the training process. Thanks!
I suspect you have a bottleneck somewhere (like in this github issue) -- you have some operation which doesn't have GPU implementation, so it's placed on CPU, and the GPU is idling because of data transfers. For instance, until recently reduce_mean was not implemented on GPU, and before that Rank was not implemented on GPU, and it was implicitly being used by many ops.
At one point, I saw a network from fully_connected_preloaded.py being slow because there was a Rank op that got placed on CPU, and hence triggering the transfer of entire dataset from GPU to CPU at each step.
To solve this I would first recommend upgrading to 0.8 since it had a few more ops implemented for GPU (reduce_prod for integer inputs, reduce_mean and others).
Then you can create your session with log_device_placement=True and see if there are any ops placed on CPU or GPU that would cause excessive transfers per step.
There are often ops in the input pipeline (such as parse_example) which don't have GPU implementations, I find it helpful sometimes to pin the whole input pipeline to CPU using with tf.device("/cpu:0"): block