How to lower RAM consumption in Tensorflow? - tensorflow

Hello there,
I am trying to use DarkFlow, a Python implementation of YOLO (which uses Tensorflow as backend), on my Nvidia Jetson Nano to detect objects. I got all the setup and stuff, but it doesn't want to train. I set it to GPU mode and a line in the output says this:
Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 897MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X1, pci bus id: 0000:00:00.0, compute capability: 5.3)
This is the last line it outputs before the training gets "Killed" without any further messages. Because it's a heavy convolutional NN, I think the reason is RAM over-comsumption. Now I only can use this GPU in my Jetson Nano so, does anybody have a suggestion how to lower it or how to solve the problem otherwise?
Thanks for the answers in advance!

You may try to decrease batch_size to 1 and lower the width,height values but would not recommend a training session on jetson nano. Its limited capabilities(4 GB shared RAM) hinders the learning process. To counter the limitations you could try to follow this post or this one to increase swap_area which acts as RAM but still I would recommend using nano only for inference.
EDIT1: Also it is known that Tensorflow has a tendency to try to allocate all available RAM which makes the process killed by OS. To solve the issue you could use tf.GPUOptions to limit Tensorflow's RAM usage.
Example:
gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.4)
session = tf.Session(config=tf.ConfigProto(gpu_options=gpu_options))
We have chosen per_process_gpu_memory_fraction as 0.4 because it is best practice not to let Tensorflow allocate more RAM than half of the available resources.(Also because it is being shared)
Best of luck.

Related

My tensorflow defaults to using my GPU instead of CPU, which is like 10 times slower. How do I fix this and make it use the CPU?

I have one of those gaming laptops with a discrete GPU and dedicated GPU (NVIDIA GeForce RTX 3070).
I was getting very slow speeds training neural networks on tensorflow. Many, many times slower than another laptop with vastly inferior specs in CPU and GPU.
I think the reason for this slowness is because tensorflow is probably running on the dedicate GPU because when I disable the dedicated GPU, the training time speeds up, like 10 times faster. These are huge differences, an order of magnitude.
I know the kernel is running on the dedicated GPU by default because when I disable the dedicated GPU in the middle of the session, the kernel dies.
Therefore, I think disabling the dedicated GPU has forced it to run on the CPU (AMD Ryzen 9 5900HX), which should be better.
I'm running this on Anaconda using Jupyter Notebook.
How do I force it to use by CPU instead of my GPU.
Edit: This seems to be a complicated issue. Some more information.
With dedicated GPU disabled, when training, according to the task manager the GPU usage is 0% (as expected) and the CPU usage is 40%.
But with dedicated GPU enabled, when training, GPU usage is about 10% and CPU usage is about 20%. This is 10 times slower than the above. Why is it using both, but less CPU?
With dedicated GPU enabled (i.e. the normal situation), according to the task manager, scikit-learn uses the CPU not the GPU. So this problem is specific to tensorflow.
Killing the dedicated GPU in the middle of the session crashes not only the kernel, but opening Jupyter Notebooks as well.
Forcing Anaconda and Jupyter Notebook to use the integrated GPU instead of the dedicated GPU in the Windows Setting doesn't fix the problem. It's still using the dedicated GPU.
Just tell tensorflow to do so:
with tf.device("/CPU:0"): #name might vary
model.fit(...)

Running the same detection model on different GPUs

I recently ran in to a bit of a glitch where my detection model running on two different GPUs (a Quadro RTX4000 and RTX A4000) on two different systems utilize the GPU differently.
The model uses only 0.2% of GPU on the Quadro system and uses anywhere from 50 to 70% on the A4000 machine. I am curious about why this is happening. The rest of the hardware on both the machines are the same.
Additional information: The model uses a 3D convolution and is built on tensorflow.
Looks like the Quadro RTX4000 does not use GPU.
The method tf.test.is_gpu_available() is deprecated and can still return True although the GPU is not used.
The correct way to verify the usage of the GPU availability + usage is to check the output of the snippet:
tf.config.list_physical_devices('GPU')
On the Quadro machine you should also run (in terminal):
watch -n 1 nvidia-smi
to see real-time the amount of GPU memory used.

How to make Tensorflow load GPU higher?

I have Tensorflow 1.4 GPU version installed. It successfully detects my GPU and uses it while trainig and evaluating. I have GeForce 1050Ti with 4Gb memory.
But I could not reach GPU load higher that 12-15% (more usual 5-6%). Meanwhile I get high CPU load and pretty slow training process.
I tested many different examples of differen NNs (RNN, LSTM, CNN, GAN etc) with plain Tensorflow and with Keras with TF as backend, but the result is the same.
I found that increasing a batch size helps to load GPU more, but batch size also affects training itself, so I can't increase it more than some possible limits.
So how to use GPU at maximum load and speed-up the NN training?
If you are using Keras in ubuntu, you can use multiprocessing and increase number of workers. If you use Batch Generator then you can increase limit on sequence depending upon the system RAM you have.
model.fit_generator(..., max_queue_size = 24, ..., workers = 2, use_multiprocessing = True, ...)

Resource Exhausted OOM while loading VGG16

I am apologizing in advance if this issue seems to basic, but I am new to Tensorflow and appreciate any help.
I find that I have to frequently keep rebooting my computer to be able to load models such as VGG16 from keras.applications. I have a fairly high-end machine with 4 GeForce GTX 1080 Ti GPUs and Intel® Core™ i7-6850K CPU # 3.60GHz × 12 for my CPU and use it only for Tensorflow (through Keras).
As soon as I reboot I will be able to successfully load models (such as VGG16) and train on large training datasets. But, if I let my computer sit idle for a while and rerun the same program, I will get a resource exhausted message (OOM) which can be fixed by rebooting my computer again. It is extremely frustrating to keep rebooting my computer every couple of hours. Does anyone know what's going on and how to solve this issue?
If you have batch size > 1, try to use lower batch size, which could lower the memory requirements gor GPU.
Also, if you end with working with the network, check the GPU memory by nvidia-smi, if it was released or not. If not, kill the process which loaded the network (usually some python interpreter).

Tensorflow 0.6 GPU Issue

I am using Nvidia Digits Box with GPU (Nvidia GeForce GTX Titan X) and Tensorflow 0.6 to train the Neural Network, and everything works. However, when I check the Volatile GPU Util using nvidia-smi -l 1, I notice that it's only 6%, and I think most of the computation is on CPU, since I notice that the process which runs Tensorflow has about 90% CPU usage. The result is the training process is very slow. I wonder if there are ways to make full usage of GPU instead of CPU to speed up the training process. Thanks!
I suspect you have a bottleneck somewhere (like in this github issue) -- you have some operation which doesn't have GPU implementation, so it's placed on CPU, and the GPU is idling because of data transfers. For instance, until recently reduce_mean was not implemented on GPU, and before that Rank was not implemented on GPU, and it was implicitly being used by many ops.
At one point, I saw a network from fully_connected_preloaded.py being slow because there was a Rank op that got placed on CPU, and hence triggering the transfer of entire dataset from GPU to CPU at each step.
To solve this I would first recommend upgrading to 0.8 since it had a few more ops implemented for GPU (reduce_prod for integer inputs, reduce_mean and others).
Then you can create your session with log_device_placement=True and see if there are any ops placed on CPU or GPU that would cause excessive transfers per step.
There are often ops in the input pipeline (such as parse_example) which don't have GPU implementations, I find it helpful sometimes to pin the whole input pipeline to CPU using with tf.device("/cpu:0"): block