I recently ran in to a bit of a glitch where my detection model running on two different GPUs (a Quadro RTX4000 and RTX A4000) on two different systems utilize the GPU differently.
The model uses only 0.2% of GPU on the Quadro system and uses anywhere from 50 to 70% on the A4000 machine. I am curious about why this is happening. The rest of the hardware on both the machines are the same.
Additional information: The model uses a 3D convolution and is built on tensorflow.
Looks like the Quadro RTX4000 does not use GPU.
The method tf.test.is_gpu_available() is deprecated and can still return True although the GPU is not used.
The correct way to verify the usage of the GPU availability + usage is to check the output of the snippet:
tf.config.list_physical_devices('GPU')
On the Quadro machine you should also run (in terminal):
watch -n 1 nvidia-smi
to see real-time the amount of GPU memory used.
Related
I have one of those gaming laptops with a discrete GPU and dedicated GPU (NVIDIA GeForce RTX 3070).
I was getting very slow speeds training neural networks on tensorflow. Many, many times slower than another laptop with vastly inferior specs in CPU and GPU.
I think the reason for this slowness is because tensorflow is probably running on the dedicate GPU because when I disable the dedicated GPU, the training time speeds up, like 10 times faster. These are huge differences, an order of magnitude.
I know the kernel is running on the dedicated GPU by default because when I disable the dedicated GPU in the middle of the session, the kernel dies.
Therefore, I think disabling the dedicated GPU has forced it to run on the CPU (AMD Ryzen 9 5900HX), which should be better.
I'm running this on Anaconda using Jupyter Notebook.
How do I force it to use by CPU instead of my GPU.
Edit: This seems to be a complicated issue. Some more information.
With dedicated GPU disabled, when training, according to the task manager the GPU usage is 0% (as expected) and the CPU usage is 40%.
But with dedicated GPU enabled, when training, GPU usage is about 10% and CPU usage is about 20%. This is 10 times slower than the above. Why is it using both, but less CPU?
With dedicated GPU enabled (i.e. the normal situation), according to the task manager, scikit-learn uses the CPU not the GPU. So this problem is specific to tensorflow.
Killing the dedicated GPU in the middle of the session crashes not only the kernel, but opening Jupyter Notebooks as well.
Forcing Anaconda and Jupyter Notebook to use the integrated GPU instead of the dedicated GPU in the Windows Setting doesn't fix the problem. It's still using the dedicated GPU.
Just tell tensorflow to do so:
with tf.device("/CPU:0"): #name might vary
model.fit(...)
I ran the following code on a Tesla K80, which as I understand consists of 2 GK210 graphics cards, each with 12GB of on chip ram, connected by something called a PLX switch. I am confused how at the pytorch level, the fact that there are two graphics cards is hidden from the user
import torch
torch.cuda.device_count() # 1
(my hunch is that tensorflow provides this same abstraction)
Follow up questions:
If I am training a model with pytorch, and I run nvidia-smi and see that the GPU is fully utilized, I would assume this means that both GK210's are at 100% utilization. How does pytorch distribute kernels across the two GK210's, and can I have faith that this is being done efficiently? (I.e. being done in a way that minimized data transfer between the two cards). Any resources that explain how this works would be much appreciated.
If I were writing a CUDA application, could I pin a CUDA stream to each card, and explicitly manage data transfers between the two cards?
Hello there,
I am trying to use DarkFlow, a Python implementation of YOLO (which uses Tensorflow as backend), on my Nvidia Jetson Nano to detect objects. I got all the setup and stuff, but it doesn't want to train. I set it to GPU mode and a line in the output says this:
Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 897MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X1, pci bus id: 0000:00:00.0, compute capability: 5.3)
This is the last line it outputs before the training gets "Killed" without any further messages. Because it's a heavy convolutional NN, I think the reason is RAM over-comsumption. Now I only can use this GPU in my Jetson Nano so, does anybody have a suggestion how to lower it or how to solve the problem otherwise?
Thanks for the answers in advance!
You may try to decrease batch_size to 1 and lower the width,height values but would not recommend a training session on jetson nano. Its limited capabilities(4 GB shared RAM) hinders the learning process. To counter the limitations you could try to follow this post or this one to increase swap_area which acts as RAM but still I would recommend using nano only for inference.
EDIT1: Also it is known that Tensorflow has a tendency to try to allocate all available RAM which makes the process killed by OS. To solve the issue you could use tf.GPUOptions to limit Tensorflow's RAM usage.
Example:
gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.4)
session = tf.Session(config=tf.ConfigProto(gpu_options=gpu_options))
We have chosen per_process_gpu_memory_fraction as 0.4 because it is best practice not to let Tensorflow allocate more RAM than half of the available resources.(Also because it is being shared)
Best of luck.
I am using Nvidia Digits Box with GPU (Nvidia GeForce GTX Titan X) and Tensorflow 0.6 to train the Neural Network, and everything works. However, when I check the Volatile GPU Util using nvidia-smi -l 1, I notice that it's only 6%, and I think most of the computation is on CPU, since I notice that the process which runs Tensorflow has about 90% CPU usage. The result is the training process is very slow. I wonder if there are ways to make full usage of GPU instead of CPU to speed up the training process. Thanks!
I suspect you have a bottleneck somewhere (like in this github issue) -- you have some operation which doesn't have GPU implementation, so it's placed on CPU, and the GPU is idling because of data transfers. For instance, until recently reduce_mean was not implemented on GPU, and before that Rank was not implemented on GPU, and it was implicitly being used by many ops.
At one point, I saw a network from fully_connected_preloaded.py being slow because there was a Rank op that got placed on CPU, and hence triggering the transfer of entire dataset from GPU to CPU at each step.
To solve this I would first recommend upgrading to 0.8 since it had a few more ops implemented for GPU (reduce_prod for integer inputs, reduce_mean and others).
Then you can create your session with log_device_placement=True and see if there are any ops placed on CPU or GPU that would cause excessive transfers per step.
There are often ops in the input pipeline (such as parse_example) which don't have GPU implementations, I find it helpful sometimes to pin the whole input pipeline to CPU using with tf.device("/cpu:0"): block
I understand that Tensorflow requires (for GPU computation) a GPU with Nvidia Compute Capability >= 3.0. There are many such GPUs to choose from. The gaming oriented GPUs, e.g. GeForce models, are much less expensive than the compute-oriented models, e.g. Tesla. My limited undertanding is that the compute-oriented models may lack video output (not needed for computation) and that the gaming models may be doing 32-bit math instead of 64. Assuming that Tensorflow uses (or prefers) 64-bit, does this mean that the gaming models will not work or will produce deficient results if used with Tensorflow? What attributes should one look for in choosing a GPU to use with Tensorflow?
The GPU-enabled version of TensorFlow has the following requirements:
64-bit Linux
Python 2.7
NVIDIA CUDA® 7.5 (CUDA 8.0 required for Pascal GPUs)
NVIDIA cuDNN v4.0 (minimum) or v5.1 (recommended)
TensorFlow GPU support requires having a GPU card with NVidia Compute Capability >= 3.0. Supported cards include but are not limited to:
NVidia Titan
NVidia Titan X
NVidia K20
NVidia K40
You can see their official docs Tensorflow GPU support
Gaming GPUs can work quite well. You want a very recent GPU with lots of memory and CUDA cores. Most people training neural nets these days on GPU use 32 bit floats.