MXNet - How to Prevent Full Memory Allocation - mxnet

Is there any way to prevent full GPU memory allocation for MXNet? So that it only allocates what it needs and not the whole GPU memory.
I want to use another model in Tensorflow/Keras on the same GPU alongside MXNet and it seems that the whole memory gets reserved by MXNet.

MXNet allocates memory as needed. Perhaps there is a memory leak in your program or Tensorflow is trying to pre-allocate the memory on the entire GPU which is the default behavior. That behavior is configurable with tf.GPUOptions. See the links on how to use those options.
Hope that helps,
Vishaal

Related

Difference between GPU_0_bfc allocator and GPU_host_bfc allocoator in TensorFlow Timeline

When I try to profile the memory usage of model training in TensorFlow, I found there are two relevant information collected by TensorFlow Timeline tool, GPU_0_bfc and GPU_host_bfc (see the following figure), I am wondering which one can reflect the most accurate memory usage? or What is the difference between them? Thanks.
Sample TensorFlow Timeline Profiling Result
It seems that GPU_host_bfc is the usage of the allocated memory on the host (e.g. the RAM next to the CPU). This is usually less than the physically (and/or through swap) available memory.
Further reading here.

System memory usage increase after using GPU?

When I use pytorch/tensorflow to build my neural network, I find that system memory usage increase when I use GPU.
The example code is pretty simple using pytorch, a small Conv1d neural network.
When I set use_gpu=False, the system memory usage is just 131M, . But when I set use_gpu=True, system memory usage increases to more than 2G.
As far as I know, when using gpu, many parameters will be pinned into GPU'S memory, that means the system memory usage will decrease.
Could anyone help to explain the situation ?

Can tensorflow consume GPU memory exactly equal to required

I use tensorflow c++ version to do CNN inference. I already set_allow_growth(true), but it still consume more GPU memory than exactly need .
set_per_process_gpu_memory_fraction can only set an upper bound of the GPU memory, but different CNN model have different upper bound. Is there a good way to solve the problem
Unfortunately, there's no such flag to use out-of-the-box, but this could be done (manually):
By default, TF allocates all the available GPU memory. Setting set_allow_growth to true, causing TF to allocate the needed memory in chunks instead of allocating all GPU memory at once. Every time TF will require more GPU memory than already allocated, it will allocate another chunk.
In addition, as you mentioned, TF supports set_per_process_gpu_memory_fraction which specifies the maximum GPU memory the process can require, in terms of percent of the total GPU memory. This results in out-of-memory (OOM) exceptions in case TF requires more GPU memory than allowed.
Unfortunately, I think the chunk size cannot be set by the user and is hard-coded in TF (for some reason I think the chunk size is 4GB but I'm not sure).
This results in being able to specify the maximum amount of GPU memory that you allow TF to use (in terms of percents). If you know how much GPU memory you have in total (can be retrieved by nvidia-smi, and you know how much memory you want to allow, you can calculate it in terms of percents and set it to TF.
If you run a small number of neural networks, you can find the required GPU memory for each of them by running it with different allowed GPU memory, like a binary search and see what's the minimum fraction that enables the NN to run. Then, setting the values you found as the values for set_per_process_gpu_memory_fraction for each NN will achieve what you wanted.

How can I monitor GPU memory allocated by Tensorflow Serving

When I use nvidia-smi command I can see in the processes section how much memory is allocated by Tensorflow Serving. Is there a way to check how much is actually used by loaded models? How can I know if there is still some free memory to load the next model?

Training on multi-GPUs with a small batch size

I am running TensorFlow on a machine which has two GPUs, each with 3 GB memory. My batch size is only 2GB, and so can fit on one GPU. Is there any point in training with both GPUs (using CUDA_VISIBLE_DEVICES)? If I did, how would TensorFlow distribute the training?
With regards to memory: I assume that you mean that one data batch is 2GB. However, Tensorflow also requires memory to store variables as well as hidden layer results etc. (to compute gradients). For this reason it also depends on your specific model whether or not the memory will be enough. Your best bet would be to just try with one GPU and see if the program crashes due to memory errors.
With regards to distribution: Tensorflow doesn't do this automatically at all. Each op is placed on some device. By default, if you have any number of GPUs available, all GPU-compatible ops will be placed on the first GPU and the rest on the CPU. This is despite Tensorflow reserving all memory on all GPUs by default.
You should have a look at the GPU guide on the Tensorflow website. The most important thing is that you can use the with tf.device context manager to place ops on other GPUs. Using this, the idea would be to split your batch into X chunks (X = number of GPUs) and define your model on each device, each time taking the respective chunk as input and making sure to reuse variables.
If you are using tf.Estimator, there is some information in this question. It is very easy to do distributed execution here using just two simple wrappers, but I personally haven't been able to use it successfully (pretty slow and crashes randomly with a segfault).