I have Tensorflow 1.4 GPU version installed. It successfully detects my GPU and uses it while trainig and evaluating. I have GeForce 1050Ti with 4Gb memory.
But I could not reach GPU load higher that 12-15% (more usual 5-6%). Meanwhile I get high CPU load and pretty slow training process.
I tested many different examples of differen NNs (RNN, LSTM, CNN, GAN etc) with plain Tensorflow and with Keras with TF as backend, but the result is the same.
I found that increasing a batch size helps to load GPU more, but batch size also affects training itself, so I can't increase it more than some possible limits.
So how to use GPU at maximum load and speed-up the NN training?
If you are using Keras in ubuntu, you can use multiprocessing and increase number of workers. If you use Batch Generator then you can increase limit on sequence depending upon the system RAM you have.
model.fit_generator(..., max_queue_size = 24, ..., workers = 2, use_multiprocessing = True, ...)
Related
Hello there,
I am trying to use DarkFlow, a Python implementation of YOLO (which uses Tensorflow as backend), on my Nvidia Jetson Nano to detect objects. I got all the setup and stuff, but it doesn't want to train. I set it to GPU mode and a line in the output says this:
Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 897MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X1, pci bus id: 0000:00:00.0, compute capability: 5.3)
This is the last line it outputs before the training gets "Killed" without any further messages. Because it's a heavy convolutional NN, I think the reason is RAM over-comsumption. Now I only can use this GPU in my Jetson Nano so, does anybody have a suggestion how to lower it or how to solve the problem otherwise?
Thanks for the answers in advance!
You may try to decrease batch_size to 1 and lower the width,height values but would not recommend a training session on jetson nano. Its limited capabilities(4 GB shared RAM) hinders the learning process. To counter the limitations you could try to follow this post or this one to increase swap_area which acts as RAM but still I would recommend using nano only for inference.
EDIT1: Also it is known that Tensorflow has a tendency to try to allocate all available RAM which makes the process killed by OS. To solve the issue you could use tf.GPUOptions to limit Tensorflow's RAM usage.
Example:
gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.4)
session = tf.Session(config=tf.ConfigProto(gpu_options=gpu_options))
We have chosen per_process_gpu_memory_fraction as 0.4 because it is best practice not to let Tensorflow allocate more RAM than half of the available resources.(Also because it is being shared)
Best of luck.
README in the Google's BERT repo says, even a single sentence of length 512 can not sit in a 12 GB Titan X for the BERT-Large model.
But in the BERT paper, it says 64 TPU chips are used to train BERT-Large
with a maximum length 512 and batch size 256. How could they fit a >256x larger batch into only 171x more memory?
From another point of view, we can compare these two configurations in a memory-usage-per-sample basis:
TPU: Assume TPUv3 is used in pre-training, the total TPU memory is 32 GB/chip * 64 chips = 2048 GB. According to the paper, a batch size of 256 with maximum length 512 works well in this configuration, which means 8 GB memory is able to hold a single sample. Furthermore, memory usage per sample will reduce to only 4 GB if GPUv2 is used.
GPU: A 12 GB Titan X can not hold even a single sample of length 512.
Why is memory consumption on GPUs much larger? Does this mean memory consumption on TPUs is optimized way better than that on GPUs?
This is probably due to the advanced compiler that comes with TPU and optimized for tensorflow ops. As the readme - out-of-memory issues in BERT says,
The major use of GPU/TPU memory during DNN training is caching the intermediate activations in the forward pass that are necessary for efficient computation in the backward pass.
However, in the TPU compiling, a special XLA (domain-specific compiler for linear algebra that optimizes TensorFlow computations) instruction called fusion
can merge multiple instructions from different TensorFlow operations into a single computation. The TensorFlow operation corresponding to the root instruction in the fusion is used as the namespace of the fusion operation.
On the other side, running on the GPU with vanilla TF basically has no (or very limited) optimizations.
Recently I implemented a VGG-16 network using both Tensorflow and PyTorch, data set is CIFAR-10. Each picture is 32 * 32 RGB.
I use a 64 batch size in beginning, while I found PyTorch using much less GPU memory than tensorflow. Then I did some experiments and got a figure, which is posted below.
After some researching, I known the tensorflow using BFC algorithm to manage memory. So it's can explain why tensorflow's memory using decreasing or increasing by 2048, 1024, ... MB and sometimes the memory use not increasing when batch size is bigger.
But I am still confused, why the memory use is lower when batch size is 512 than batch size is 384, 448 etc. which has a smaller batch size. The same as when batch size is from 1024 to 1408, and batch size is 2048 to 2688.
Here is my source code:
PyTorch:https://github.com/liupeng3425/tesorflow-vgg/blob/master/vgg-16-pytorch.py
Tensorflow:https://github.com/liupeng3425/tesorflow-vgg/blob/master/vgg-16.py
edit:
I have two Titan XP on my computer, OS: Linux Mint 18.2 64-bit.
I determine GPU memory usage with command nvidia-smi.
My code runs on GPU1, which is defined in my code:
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = "1"
And I am sure there only one application using GPU1.
GPU memory usage can be determined by the application list below.
For example, like the posted screen shot below, process name is /usr/bin/python3 and its GPU memory usage is 1563 MiB.
As noted in the comments, by default TensorFlow always takes up all memory on a GPU. I assume you have disabled that function for this test, but it does show that the algorithms do not generally attempt to minimize the memory that is reserved, even if it's not all utilized in the calculations.
To find the optimal configuration for your device and code, TensorFlow often runs (parts of) the first calculation multiple times. I suspect that this included settings for pre-loading data onto the GPU. This would mean that the numbers you see happen to be the optimal values for your device and configuration.
Since TensorFlow doesn't mind using more memory, 'optimal' here is measured by speed, not memory usage.
I am using NVIDIA Tesla P40 to train a classification model. I used tensorflow's bidirectional_dynamic_rnn to build the bi-lstm network, and the training efficiency is so poor where only about 30% of computing resource are used, and the speed is even no faster than using the CPU with 45 logical cores. Could some help to give some advices fully using the GPU computing resource, or explain the reason?????
First hint: try to increase the batch_size. It will increase the amount of data to use in parallel, therefore decreasing the I/O time.
Note that, it will then require more GPU memory, so you have to tune it to avoid Out Of Memory errors.
I am using Nvidia Digits Box with GPU (Nvidia GeForce GTX Titan X) and Tensorflow 0.6 to train the Neural Network, and everything works. However, when I check the Volatile GPU Util using nvidia-smi -l 1, I notice that it's only 6%, and I think most of the computation is on CPU, since I notice that the process which runs Tensorflow has about 90% CPU usage. The result is the training process is very slow. I wonder if there are ways to make full usage of GPU instead of CPU to speed up the training process. Thanks!
I suspect you have a bottleneck somewhere (like in this github issue) -- you have some operation which doesn't have GPU implementation, so it's placed on CPU, and the GPU is idling because of data transfers. For instance, until recently reduce_mean was not implemented on GPU, and before that Rank was not implemented on GPU, and it was implicitly being used by many ops.
At one point, I saw a network from fully_connected_preloaded.py being slow because there was a Rank op that got placed on CPU, and hence triggering the transfer of entire dataset from GPU to CPU at each step.
To solve this I would first recommend upgrading to 0.8 since it had a few more ops implemented for GPU (reduce_prod for integer inputs, reduce_mean and others).
Then you can create your session with log_device_placement=True and see if there are any ops placed on CPU or GPU that would cause excessive transfers per step.
There are often ops in the input pipeline (such as parse_example) which don't have GPU implementations, I find it helpful sometimes to pin the whole input pipeline to CPU using with tf.device("/cpu:0"): block