Difference between GPU_0_bfc allocator and GPU_host_bfc allocoator in TensorFlow Timeline - tensorflow

When I try to profile the memory usage of model training in TensorFlow, I found there are two relevant information collected by TensorFlow Timeline tool, GPU_0_bfc and GPU_host_bfc (see the following figure), I am wondering which one can reflect the most accurate memory usage? or What is the difference between them? Thanks.
Sample TensorFlow Timeline Profiling Result

It seems that GPU_host_bfc is the usage of the allocated memory on the host (e.g. the RAM next to the CPU). This is usually less than the physically (and/or through swap) available memory.
Further reading here.

Related

System memory usage increase after using GPU?

When I use pytorch/tensorflow to build my neural network, I find that system memory usage increase when I use GPU.
The example code is pretty simple using pytorch, a small Conv1d neural network.
When I set use_gpu=False, the system memory usage is just 131M, . But when I set use_gpu=True, system memory usage increases to more than 2G.
As far as I know, when using gpu, many parameters will be pinned into GPU'S memory, that means the system memory usage will decrease.
Could anyone help to explain the situation ?

Can tensorflow consume GPU memory exactly equal to required

I use tensorflow c++ version to do CNN inference. I already set_allow_growth(true), but it still consume more GPU memory than exactly need .
set_per_process_gpu_memory_fraction can only set an upper bound of the GPU memory, but different CNN model have different upper bound. Is there a good way to solve the problem
Unfortunately, there's no such flag to use out-of-the-box, but this could be done (manually):
By default, TF allocates all the available GPU memory. Setting set_allow_growth to true, causing TF to allocate the needed memory in chunks instead of allocating all GPU memory at once. Every time TF will require more GPU memory than already allocated, it will allocate another chunk.
In addition, as you mentioned, TF supports set_per_process_gpu_memory_fraction which specifies the maximum GPU memory the process can require, in terms of percent of the total GPU memory. This results in out-of-memory (OOM) exceptions in case TF requires more GPU memory than allowed.
Unfortunately, I think the chunk size cannot be set by the user and is hard-coded in TF (for some reason I think the chunk size is 4GB but I'm not sure).
This results in being able to specify the maximum amount of GPU memory that you allow TF to use (in terms of percents). If you know how much GPU memory you have in total (can be retrieved by nvidia-smi, and you know how much memory you want to allow, you can calculate it in terms of percents and set it to TF.
If you run a small number of neural networks, you can find the required GPU memory for each of them by running it with different allowed GPU memory, like a binary search and see what's the minimum fraction that enables the NN to run. Then, setting the values you found as the values for set_per_process_gpu_memory_fraction for each NN will achieve what you wanted.

MXNet - How to Prevent Full Memory Allocation

Is there any way to prevent full GPU memory allocation for MXNet? So that it only allocates what it needs and not the whole GPU memory.
I want to use another model in Tensorflow/Keras on the same GPU alongside MXNet and it seems that the whole memory gets reserved by MXNet.
MXNet allocates memory as needed. Perhaps there is a memory leak in your program or Tensorflow is trying to pre-allocate the memory on the entire GPU which is the default behavior. That behavior is configurable with tf.GPUOptions. See the links on how to use those options.
Hope that helps,
Vishaal

Training on multi-GPUs with a small batch size

I am running TensorFlow on a machine which has two GPUs, each with 3 GB memory. My batch size is only 2GB, and so can fit on one GPU. Is there any point in training with both GPUs (using CUDA_VISIBLE_DEVICES)? If I did, how would TensorFlow distribute the training?
With regards to memory: I assume that you mean that one data batch is 2GB. However, Tensorflow also requires memory to store variables as well as hidden layer results etc. (to compute gradients). For this reason it also depends on your specific model whether or not the memory will be enough. Your best bet would be to just try with one GPU and see if the program crashes due to memory errors.
With regards to distribution: Tensorflow doesn't do this automatically at all. Each op is placed on some device. By default, if you have any number of GPUs available, all GPU-compatible ops will be placed on the first GPU and the rest on the CPU. This is despite Tensorflow reserving all memory on all GPUs by default.
You should have a look at the GPU guide on the Tensorflow website. The most important thing is that you can use the with tf.device context manager to place ops on other GPUs. Using this, the idea would be to split your batch into X chunks (X = number of GPUs) and define your model on each device, each time taking the respective chunk as input and making sure to reuse variables.
If you are using tf.Estimator, there is some information in this question. It is very easy to do distributed execution here using just two simple wrappers, but I personally haven't been able to use it successfully (pretty slow and crashes randomly with a segfault).

TensorFlow: How to measure how much GPU memory each tensor takes?

I'm currently implementing YOLO in TensorFlow and I'm a little surprised on how much memory that is taking. On my GPU I can train YOLO using their Darknet framework with batch size 64. On TensorFlow I can only do it with batch size 6, with 8 I already run out of memory. For the test phase I can run with batch size 64 without running out of memory.
I am wondering how I can calculate how much memory is being consumed by each tensor? Are all tensors by default saved in the GPU? Can I simply calculate the total memory consumption as the shape * 32 bits?
I noticed that since I'm using momentum, all my tensors also have a /Momentum tensor. Could that also be using a lot of memory?
I am augmenting my dataset with a method distorted_inputs, very similar to the one defined in the CIFAR-10 tutorial. Could it be that this part is occupying a huge chunk of memory? I believe Darknet does the modifications in the CPU.
Now that 1258 has been closed, you can enable memory logging in Python by setting an environment variable before importing TensorFlow:
import os
os.environ['TF_CPP_MIN_VLOG_LEVEL']='3'
import tensorflow as tf
There will be a lot of logging as a result of this. You'll want to grep the results to find the appropriate lines. For example:
grep MemoryLogTensorAllocation train.log
Sorry for the slow reply. Unfortunately right now the only way to set the log level is to edit tensorflow/core/platform/logging.h and recompile with e.g.
#define VLOG_IS_ON(lvl) ((lvl) <= 1)
There is a bug open 1258 to control logging more elegantly.
MemoryLogTensorOutput entries are logged at the end of each Op execution, and indicate the tensors that hold the outputs of the Op. It's useful to know these tensors since the memory is not released until the downstream Op consumes the tensors, which may be much later on in a large graph.
See the description in this (commit).
The memory allocation is raw info is there although it needs a script to collect the information in an easy to read form.