We aim to reduce CPU usage during inference by moving to GPU. However, when we run on GPU, the CPU usage is still high. Using the timeline output from RunMetadata and RunOptions class, we get the following profiling:
CPU and GPU Activity during inference
The 6th row is the CPU usage, 8th is GPU. The block with op "Unknown" have the same name as the GPU block. It is not a back-and-forth copy between host and device problem, as we can see by the first 4 rows.
What we see is that CPU does something while the GPU work. I'd like to understand why this happens. Does anyone has a clue?
We are using Tensorflow-GPU 1.14 (C++ interface) on a Jetson TX2, jetpack 4.2.3, CUDA 10.2, capabilities 6.2.
Regards
Related
I have one of those gaming laptops with a discrete GPU and dedicated GPU (NVIDIA GeForce RTX 3070).
I was getting very slow speeds training neural networks on tensorflow. Many, many times slower than another laptop with vastly inferior specs in CPU and GPU.
I think the reason for this slowness is because tensorflow is probably running on the dedicate GPU because when I disable the dedicated GPU, the training time speeds up, like 10 times faster. These are huge differences, an order of magnitude.
I know the kernel is running on the dedicated GPU by default because when I disable the dedicated GPU in the middle of the session, the kernel dies.
Therefore, I think disabling the dedicated GPU has forced it to run on the CPU (AMD Ryzen 9 5900HX), which should be better.
I'm running this on Anaconda using Jupyter Notebook.
How do I force it to use by CPU instead of my GPU.
Edit: This seems to be a complicated issue. Some more information.
With dedicated GPU disabled, when training, according to the task manager the GPU usage is 0% (as expected) and the CPU usage is 40%.
But with dedicated GPU enabled, when training, GPU usage is about 10% and CPU usage is about 20%. This is 10 times slower than the above. Why is it using both, but less CPU?
With dedicated GPU enabled (i.e. the normal situation), according to the task manager, scikit-learn uses the CPU not the GPU. So this problem is specific to tensorflow.
Killing the dedicated GPU in the middle of the session crashes not only the kernel, but opening Jupyter Notebooks as well.
Forcing Anaconda and Jupyter Notebook to use the integrated GPU instead of the dedicated GPU in the Windows Setting doesn't fix the problem. It's still using the dedicated GPU.
Just tell tensorflow to do so:
with tf.device("/CPU:0"): #name might vary
model.fit(...)
I am trying to do a matrix decomposition (or tucker decomposition on a tensor) in Tensorflow with GPU. I have tensorflow-gpu, my NVidia GPU has 4GB RAM. My problem is that my input matrix is huge, millions of rows and millions of columns and the size of the matrix is more than 5GB in memory. So each time Tensorflow gives me an out of memory (OOM) error. (If I turn off GPU, the whole process can run successfully in CPU using system RAM. Of course, the speed is slow.)
I did some research on Tensorflow and on NVidia CUDA lib. CUDA seems has a "unified memory" mechanism so the system RAM and GPU RAM share one address book. Yet no further details found.
I wonder if Tensorflow supports some memory sharing mechanism such that I can generate input in system RAM? (Since I want to use GPU to accelerate the calculations) And GPU can do the calculation piece by piece.
I have an GTX 1050 ti (4GB) and i5 CPU, 8GB memory.
I successfully installed tensorflow-gpu with cuda driver on win10 and the test shows that tensorflow is actually using the gpu (snapshot):
However, when carrying out the training with CNN, while the GPU memory is always 100%, the GPU load is qualsi 0 with some spikes # 30%~70%:
Is it normal ?
EDIT: While the GPU occupation is qualsi 0 with spikes, the CPU load is fixed at 100% during the training.
EDIT2: I did read somewhere that the CPU could be high while GPU be low if there are a lot of operations of data copy between CPU and GPU. But I am using the official tensorflow object detection api for the training so I am totally unaware of the possible place in code.
What you see is normal behavior in most cases.
TensorFlow books the entire GPU memory initially.
The load on the GPU is dependent upon the data it is getting for processing.
If the data loading operation is slow, then most of the time GPU is waiting for data to get copied from disk to the GPU, and during that time it is not performing any work. That is what you see in your screen.
I have installed tensorflow-gpu on Linux Mint 18. My graphics card is a GT 740m. The deviceQuery and bandwidthTest for CUDA and the MNISTsample for cudnn scripts pass (referred here and here).
Tensorflow does use the GPU (e.g. following these instructions works, and memory and processing utilization of the GPU increases when running programes), but the performance is rather… mediocre.
For instance running the script shown on this site the GPU is only about twice as fast as the CPU. Certainly a nice improvement, but not "really, really fast", as is stated on the site. Another example: Using vgg16 with Keras to classify 100 images, each about 300x200 pixels takes around 30 seconds.
Is there anything I might do to increase the performance, or can I not expect anything better?
for search queries: slow,
I am using Nvidia Digits Box with GPU (Nvidia GeForce GTX Titan X) and Tensorflow 0.6 to train the Neural Network, and everything works. However, when I check the Volatile GPU Util using nvidia-smi -l 1, I notice that it's only 6%, and I think most of the computation is on CPU, since I notice that the process which runs Tensorflow has about 90% CPU usage. The result is the training process is very slow. I wonder if there are ways to make full usage of GPU instead of CPU to speed up the training process. Thanks!
I suspect you have a bottleneck somewhere (like in this github issue) -- you have some operation which doesn't have GPU implementation, so it's placed on CPU, and the GPU is idling because of data transfers. For instance, until recently reduce_mean was not implemented on GPU, and before that Rank was not implemented on GPU, and it was implicitly being used by many ops.
At one point, I saw a network from fully_connected_preloaded.py being slow because there was a Rank op that got placed on CPU, and hence triggering the transfer of entire dataset from GPU to CPU at each step.
To solve this I would first recommend upgrading to 0.8 since it had a few more ops implemented for GPU (reduce_prod for integer inputs, reduce_mean and others).
Then you can create your session with log_device_placement=True and see if there are any ops placed on CPU or GPU that would cause excessive transfers per step.
There are often ops in the input pipeline (such as parse_example) which don't have GPU implementations, I find it helpful sometimes to pin the whole input pipeline to CPU using with tf.device("/cpu:0"): block