Is there a way of determining how much GPU memory is in use by TensorFlow? - gpu

Tensorflow tends to preallocate the entire available memory on it's GPUs. For debugging, is there a way of telling how much of that memory is actually in use?

(1) There is some limited support with Timeline for logging memory allocations. Here is an example for its usage:
run_options = tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE)
run_metadata = tf.RunMetadata()
summary, _ = sess.run([merged, train_step],
feed_dict=feed_dict(True),
options=run_options,
run_metadata=run_metadata)
train_writer.add_run_metadata(run_metadata, 'step%03d' % i)
train_writer.add_summary(summary, i)
print('Adding run metadata for', i)
tl = timeline.Timeline(run_metadata.step_stats)
print(tl.generate_chrome_trace_format(show_memory=True))
trace_file = tf.gfile.Open(name='timeline', mode='w')
trace_file.write(tl.generate_chrome_trace_format(show_memory=True))
You can give this code a try with the MNIST example (mnist with summaries)
This will generate a tracing file named timeline, which you can open with chrome://tracing. Note that this only gives an approximated GPU memory usage statistics. It basically simulated a GPU execution, but doesn't have access to the full graph metadata. It also can't know how many variables have been assigned to the GPU.
(2) For a very coarse measure of GPU memory usage, nvidia-smi will show the total device memory usage at the time you run the command.
nvprof can show the on-chip shared memory usage and register usage at the CUDA kernel level, but doesn't show the global/device memory usage.
Here is an example command: nvprof --print-gpu-trace matrixMul
And more details here:
http://docs.nvidia.com/cuda/profiler-users-guide/#abstract

Here's a practical solution that worked well for me:
Disable GPU memory pre-allocation using TF session configuration:
config = tf.ConfigProto()
config.gpu_options.allow_growth=True
sess = tf.Session(config=config)
run nvidia-smi -l (or some other utility) to monitor GPU memory consumption.
Step through your code with the debugger until you see the unexpected GPU memory consumption.

There's some code in tensorflow.contrib.memory_stats that will help with this:
from tensorflow.contrib.memory_stats.python.ops.memory_stats_ops import BytesInUse
with tf.device('/device:GPU:0'): # Replace with device you are interested in
bytes_in_use = BytesInUse()
with tf.Session() as sess:
print(sess.run(bytes_in_use))

The TensorFlow profiler has improved memory timeline that is based on real gpu memory allocator information
https://github.com/tensorflow/tensorflow/tree/master/tensorflow/core/profiler#visualize-time-and-memory

tf.config.experimental.get_memory_info('GPU:0')
Currently returns the following keys:
'current': The current memory used by the device, in bytes.
'peak': The peak memory used by the device across the run of the program, in bytes.

as #V.M previously mentioned, a solution that works well is using: tf.config.experimental.get_memory_info('DEVICE_NAME')
This function returns a dictionary with two keys:
'current': The current memory used by the device, in bytes
'peak': The peak memory used by the device across the run of the program, in bytes.
The value of these keys is the ACTUAL memory used not the allocated one that is returned by nvidia-smi.
In reality, for GPUs, TensorFlow will allocate all the memory by default rendering using nvidia-smi to check for the used memory in your code useless. Even if, tf.config.experimental.set_memory_growth is set to true, Tensorflow will no more allocate the whole available memory but is going to remain in allocating more memory than the one is used and in a discrete manner, i.e. allocates 4589MiB then 8717MiB then 16943MiB then 30651 MiB, etc.
A small note concerning the get_memory_info() is that it doesn't return correct values if used in a tf.function() decorated function. Thus, the peak key shall be used after executing tf.function() decorated function to determine the peak memory used.
For older versions of Tensorflow, tf.config.experimental.get_memory_usage('DEVICE_NAME') was the only available function and only returned the used memory (no option for determining the peak memory).
Final note, you can also consider the Tensorflow Profiler available with Tensorboard as #Peter Mentioned.
Hope this helps :)

Related

Dask-Rapids data movment and out of memory issue

I am using dask (2021.3.0) and rapids(0.18) in my project. In this, I am performing preprocessing task on the CPU, and later the preprocessed data is transferred to GPU for K-means clustering. But in this process, I am getting the following problem:
1 of 1 worker jobs failed: std::bad_alloc: CUDA error: ~/envs/include/rmm/mr/device/cuda_memory_resource.hpp:69: cudaErrorMemoryAllocation out of memory
(before using GPU memory completely it gave the error i.e. it is not using GPU memory completely)
I have a single GPU of size 40 GB.
Ram size 512 GB.
I am using following snippet of code:
cluster=LocalCluster(n_workers=1, threads_per_worker=1)
cluster.scale(100)
##perform my preprocessing on data and get output on variable A
# convert A varible to cupy
x = A.map_blocks(cp.asarray)
km =KMeans(n_clusters=4)
predict=km.fit_predict(x).compute()
I am also looking for a solution so that the data larger than GPU memory can be preprocessed, and whenever there is a spill in GPU memory the spilled data is transferred into temp directory or CPU (as we do with dask where we define temp directory when there is a spill in RAM).
Any help will be appriciated.
There are several ways to run larger than GPU datasets.
Check out Nick Becker's blog, which has a few methods well documented
Check out BlazingSQL, which is built on top of RAPIDS and can perform out of core processings. You can try it at beta.blazingsql.com.

How to use CUDA pinned "zero-copy" memory for a memory mapped file?

Objective/Problem
In Python, I am looking for a fast way to read/write data from a memory mapped file to a GPU.
In a previous SO overflow post [ Cupy OutOfMemoryError when trying to cupy.load larger dimension .npy files in memory map mode, but np.load works fine ]
Where it is mentioned this is possible using CUDA pinned "zero-copy" memory. Furthermore, it seems that this method was developed by this person [
cuda - Zero-copy memory, memory-mapped file ] though that person was working in C++.
My previous attempts have been with Cupy, but I am open to any cuda methods.
What I have tried so far
I mentioned how I tried to use Cupy, which allows you to open numpy files in memmory mapped mode.
import os
import numpy as np
import cupy
#Create .npy files.
for i in range(4):
numpyMemmap = np.memmap( 'reg.memmap'+str(i), dtype='float32', mode='w+', shape=( 2200000 , 512))
np.save( 'reg.memmap'+str(i) , numpyMemmap )
del numpyMemmap
os.remove( 'reg.memmap'+str(i) )
# Check if they load correctly with np.load.
NPYmemmap = []
for i in range(4):
NPYmemmap.append( np.load( 'reg.memmap'+str(i)+'.npy' , mmap_mode = 'r+' ) )
del NPYmemmap
# Eventually results in memory error.
CPYmemmap = []
for i in range(4):
print(i)
CPYmemmap.append( cupy.load( 'reg.memmap'+str(i)+'.npy' , mmap_mode = 'r+' ) )
Result of what I have tried
My attempt resulting in OutOfMemoryError:
It was mentioned that
it appears that cupy.load will require that the entire file fit first in host memory, then in device memory.
And it was also mentioned that
CuPy can't handle mmap memory. So, CuPy uses GPU memory directly in default.
https://docs-cupy.chainer.org/en/stable/reference/generated/cupy.cuda.MemoryPool.html#cupy.cuda.MemoryPool.malloc
You can change default memory allocator if you want to use Unified Memory.
I tried using
cupy.cuda.set_allocator(cupy.cuda.MemoryPool(cupy.cuda.memory.malloc_managed).malloc)
But this didn't seem to make a difference. At the time of the error, my CPU Ram was at ~16 gigs, but my GPU ram was at 0.32 gigs. I am using Google colab where my CPU Ram is 25 gigs and GPU ram is 12 gigs. So it looks like that after the entire file was hosted in host memory, it checked that if it could fit in device memory, and when it saw that it only has 12 out of the required 16 gigs, it threw an error (my best guess).
So, now I am trying to figure out a way to use pinned 'zero-copy' memory to handle a memory mapped file which would feed data to the GPU.
If important, the type of data I am trying to transfer are floating point arrays. Normally, for read-only data, binary files are loaded into GPU memory, but I am working with data I am try to both read and write at every step.
It appears to me that currently, cupy doesn't offer a pinned allocator that can be used in place of the usual device memory allocator, i.e. could be used as the backing for cupy.ndarray. If this is important to you, you might consider filing a cupy issue.
However, it seems like it may be possible to create one. This should be considered experimental code. And there are some issues associated with its use.
The basic idea is that we will replace cupy's default device memory allocator with our own, using cupy.cuda.set_allocator as was already suggested to you. We will need to provide our own replacement for the BaseMemory class that is used as the repository for cupy.cuda.memory.MemoryPointer. The key difference here is that we will use a pinned memory allocator instead of a device allocator. This is the gist of the PMemory class below.
A few other things to be aware of:
after doing what you need with pinned memory (allocations) you should probably revert the cupy allocator to its default value. Unfortunately, unlike cupy.cuda.set_allocator, I did not find a corresponding cupy.cuda.get_allocator, which strikes me as a deficiency in cupy, something that also seems worthy of filing a cupy issue to me. However for this demonstration we will just revert to the None choice, which uses one of the default device memory allocators (not the pool allocator, however).
by providing this minimalistic pinned memory allocator, we are still suggesting to cupy that this is ordinary device memory. That means it's not directly accessible from the host code (it is, actually, but cupy doesn't know that). Therefore, various operations (such as cupy.load) will create unneeded host allocations, and unneeded copy operations. I think to address this would require much more than just this small change I am suggesting. But at least for your test case, this additional overhead may be manageable. It appears that you want to load data from disk once, and then leave it there. For that type of activity, this should be manageable, especially since you are breaking it up into chunks. As we will see, handling four 5GB chunks will be too much for 25GB of host memory. We will need host memory allocation for the four 5GB chunks (which are actually pinned) and we will also need additional space for one additional 5GB "overhead" buffer. So 25GB is not enough for that. But for demonstration purposes, if we reduce your buffer sizes to 4GB (5x4GB = 20GB) I think it may fit within your 25GB host RAM size.
Ordinary device memory associated with cupy's default device memory allocator, has an association with a particular device. pinned memory need not have such an association, however our trivial replacement of BaseMemory with a lookalike class means that we are suggesting to cupy that this "device" memory, like all other ordinary device memory, has a specific device association. In a single device setting such as yours, this distinction is meaningless. However, this isn't suitable for robust multi-device use of pinned memory. For that, again the suggestion would be a more robust change to cupy, perhaps by filing an issue.
Here's an example:
import os
import numpy as np
import cupy
class PMemory(cupy.cuda.memory.BaseMemory):
def __init__(self, size):
self.size = size
self.device_id = cupy.cuda.device.get_device_id()
self.ptr = 0
if size > 0:
self.ptr = cupy.cuda.runtime.hostAlloc(size, 0)
def __del__(self):
if self.ptr:
cupy.cuda.runtime.freeHost(self.ptr)
def my_pinned_allocator(bsize):
return cupy.cuda.memory.MemoryPointer(PMemory(bsize),0)
cupy.cuda.set_allocator(my_pinned_allocator)
#Create 4 .npy files, ~4GB each
for i in range(4):
print(i)
numpyMemmap = np.memmap( 'reg.memmap'+str(i), dtype='float32', mode='w+', shape=( 10000000 , 100))
np.save( 'reg.memmap'+str(i) , numpyMemmap )
del numpyMemmap
os.remove( 'reg.memmap'+str(i) )
# Check if they load correctly with np.load.
NPYmemmap = []
for i in range(4):
print(i)
NPYmemmap.append( np.load( 'reg.memmap'+str(i)+'.npy' , mmap_mode = 'r+' ) )
del NPYmemmap
# allocate pinned memory storage
CPYmemmap = []
for i in range(4):
print(i)
CPYmemmap.append( cupy.load( 'reg.memmap'+str(i)+'.npy' , mmap_mode = 'r+' ) )
cupy.cuda.set_allocator(None)
I haven't tested this in a setup with 25GB of host memory with these file sizes. But I have tested it with other file sizes that exceed the device memory of my GPU, and it seems to work.
Again, experimental code, not thoroughly tested, your mileage may vary, would be better to attain this functionality via filing of cupy github issues. And, as I've mentioned previously, this sort of "device memory" will be generally much slower to access from device code than ordinary cupy device memory.
Finally, this is not really a "memory mapped file" as all the file contents will be loaded into host memory, and furthermore, this methodology "uses up" host memory. If you have 20GB of files to access, you will need more than 20GB of host memory. As long as you have those files "loaded", 20GB of host memory will be in use.
UPDATE: cupy provides support for pinned allocators now, see here. This answer should only be used for historical reference.

Explanation of parallel arguments of tf.while_loop in TensorFlow

I want to implement an algorithm which allows a parallel implementation in TensorFlow. My question is what the arguments parallel_iterations, swap_memory and maximum_iterations actually do and which are their appropriate values according the situation. Specifically, in the documentation on TensorFlow's site https://www.tensorflow.org/api_docs/python/tf/while_loop says that parallel_iterations are the number of iterations allowed to run in parallel. Is this number the number of threads? When someone should allow CPU-GPU swap memory and for what reason? What are the advantages and disadvantages from this choice? What is the purpose of maximum_iterations? Can it be combined with parallel_iterations?
swap_memory is used when you want to have extra memory on the GPU device. Usually when you are training a model some activations are saved in the GPU mem. for later use. With swap_memory, you can store those activations in the CPU memory and use the GPU mem. to fit e.g. larger batch sizes. And this is an advantage. You would choose this if you need big batch_sizes or have long sequences and want to avoid OOM exceptions. Disadvantage is computation time since you need to transfer the data from CPU mem. to GPU mem.
The maximum iterations is smth. like this:
while num_iter < 100 and <some condition>:
do something
num_iter += 1
So it is useful when you check a condition, but also want to have an upper bound (one example is to check if your model converges. If it doesn't you still want to end after k iterations.)
As for parallel_iterations I am not sure, but it sounds like multiple threads, yes. You can try and see the effect in a sample script.

Tensorflow: Setting allow_growth to true does still allocate memory of all my GPUs

I have several GPUs but I only want to use one GPU for my training. I am using following options:
config = tf.ConfigProto(allow_soft_placement=True, log_device_placement=True)
config.gpu_options.allow_growth = True
with tf.Session(config=config) as sess:
Despite setting / using all these options, all of my GPUs allocate memory and
#processes = #GPUs
How can I prevent this from happening?
Note
I do not want use set the devices manually and I do not want to set CUDA_VISIBLE_DEVICES since I want tensorflow to automatically find the best (an idle) GPU available
When I try to start another run it uses the same GPU that is already used by another tensorflow process even though there are several other free GPUs (apart from the memory allocation on them)
I am running tensorflow in a docker container: tensorflow/tensorflow:latest-devel-gpu-py
I had this problem my self. Setting config.gpu_options.allow_growth = True
Did not do the trick, and all of the GPU memory was still consumed by Tensorflow.
The way around it is the undocumented environment variable TF_FORCE_GPU_ALLOW_GROWTH (I found it in
https://github.com/tensorflow/tensorflow/blob/3e21fe5faedab3a8258d344c8ad1cec2612a8aa8/tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc#L25)
Setting TF_FORCE_GPU_ALLOW_GROWTH=true works perfectly.
In the Python code, you can set
os.environ['TF_FORCE_GPU_ALLOW_GROWTH'] = 'true'
I can offer you a method mask_busy_gpus defined here: https://github.com/yselivonchyk/TensorFlow_DCIGN/blob/master/utils.py
Simplified version of the function:
import subprocess as sp
import os
def mask_unused_gpus(leave_unmasked=1):
ACCEPTABLE_AVAILABLE_MEMORY = 1024
COMMAND = "nvidia-smi --query-gpu=memory.free --format=csv"
try:
_output_to_list = lambda x: x.decode('ascii').split('\n')[:-1]
memory_free_info = _output_to_list(sp.check_output(COMMAND.split()))[1:]
memory_free_values = [int(x.split()[0]) for i, x in enumerate(memory_free_info)]
available_gpus = [i for i, x in enumerate(memory_free_values) if x > ACCEPTABLE_AVAILABLE_MEMORY]
if len(available_gpus) < leave_unmasked: ValueError('Found only %d usable GPUs in the system' % len(available_gpus))
os.environ["CUDA_VISIBLE_DEVICES"] = ','.join(map(str, available_gpus[:leave_unmasked]))
except Exception as e:
print('"nvidia-smi" is probably not installed. GPUs are not masked', e)
Usage:
mask_unused_gpus()
with tf.Session()...
Prerequesities: nvidia-smi
With this script I was solving next problem: on a multy-GPU cluster use only single (or arbitrary) number of GPUs allowing them to be automatically allocated.
Shortcoming of the script: if you are starting multiple scripts at once random assignment might cause same GPU assignment, because script depends on memory allocation and memory allocation takes some seconds to kick in.

Tensorflow: dynamically call GPUs with enough free memory

My desktop has two gpus which can run Tensorflow with specification /gpu:0 or /gpu:1. However, if I don't specify which gpu to run the code, Tensorflow will by default to call /gpu:0, as we all know.
Now I would like to setup the system such that it can assign gpu dynamically according to the free memory of each gpu. For example, if a script doesn't specify which gpu to run the code, the system first assigns /gpu:0 for it; then if another script runs now, it will check whether /gpu:0 has enough free memory. If yes, it will continue assign /gpu:0 to it, otherwise it will assign /gpu:1 to it. How can I achieve it?
Follow-ups:
I believe the question above may be related to the virtualization problem of GPU. That is to say, if I can virtualize multi-gpu in a desktop into one GPU, I can get what I want. So beside any setup methods for Tensorflow, any ideas about virtualization is also welcome.
TensorFlow generally assumes it's not sharing GPU with anyone, so I don't see a way of doing it from inside TensorFlow. However, you could do it from outside as follows -- shell script that calls nvidia-smi, parses out GPU k with more memory, then sets "CUDA_VISIBLE_DEVICES=k" and calls TensorFlow script
Inspired by:
How to set specific gpu in tensorflow?
def leave_gpu_with_most_free_ram():
try:
command = "nvidia-smi --query-gpu=memory.free --format=csv"
memory_free_info = _output_to_list(sp.check_output(command.split()))[1:]
memory_free_values = [int(x.split()[0]) for i, x in enumerate(memory_free_info)]
least_busy_idx = memory_free_values.index(max(memory_free_values))
# update CUDA variable
gpus =[least_busy_idx]
setting = ','.join(map(str, gpus))
os.environ["CUDA_VISIBLE_DEVICES"] = setting
print('Left next %d GPU(s) unmasked: [%s] (from %s available)'
% (leave_unmasked, setting, str(available_gpus)))
except FileNotFoundError as e:
print('"nvidia-smi" is probably not installed. GPUs are not masked')
print(e)
except sp.CalledProcessError as e:
print("Error on GPU masking:\n", e.output)
Add a call to this function before importing tensorflow