How to use CUDA pinned "zero-copy" memory for a memory mapped file?

How to use CUDA pinned "zero-copy" memory for a memory mapped file? - numpy

Objective/Problem
In Python, I am looking for a fast way to read/write data from a memory mapped file to a GPU.
In a previous SO overflow post [ Cupy OutOfMemoryError when trying to cupy.load larger dimension .npy files in memory map mode, but np.load works fine ]
Where it is mentioned this is possible using CUDA pinned "zero-copy" memory. Furthermore, it seems that this method was developed by this person [
cuda - Zero-copy memory, memory-mapped file ] though that person was working in C++.
My previous attempts have been with Cupy, but I am open to any cuda methods.
What I have tried so far
I mentioned how I tried to use Cupy, which allows you to open numpy files in memmory mapped mode.
import os
import numpy as np
import cupy
#Create .npy files.
for i in range(4):
numpyMemmap = np.memmap( 'reg.memmap'+str(i), dtype='float32', mode='w+', shape=( 2200000 , 512))
np.save( 'reg.memmap'+str(i) , numpyMemmap )
del numpyMemmap
os.remove( 'reg.memmap'+str(i) )
# Check if they load correctly with np.load.
NPYmemmap = []
for i in range(4):
NPYmemmap.append( np.load( 'reg.memmap'+str(i)+'.npy' , mmap_mode = 'r+' ) )
del NPYmemmap
# Eventually results in memory error.
CPYmemmap = []
for i in range(4):
print(i)
CPYmemmap.append( cupy.load( 'reg.memmap'+str(i)+'.npy' , mmap_mode = 'r+' ) )
Result of what I have tried
My attempt resulting in OutOfMemoryError:
It was mentioned that
it appears that cupy.load will require that the entire file fit first in host memory, then in device memory.
And it was also mentioned that
CuPy can't handle mmap memory. So, CuPy uses GPU memory directly in default.
https://docs-cupy.chainer.org/en/stable/reference/generated/cupy.cuda.MemoryPool.html#cupy.cuda.MemoryPool.malloc
You can change default memory allocator if you want to use Unified Memory.
I tried using
cupy.cuda.set_allocator(cupy.cuda.MemoryPool(cupy.cuda.memory.malloc_managed).malloc)
But this didn't seem to make a difference. At the time of the error, my CPU Ram was at ~16 gigs, but my GPU ram was at 0.32 gigs. I am using Google colab where my CPU Ram is 25 gigs and GPU ram is 12 gigs. So it looks like that after the entire file was hosted in host memory, it checked that if it could fit in device memory, and when it saw that it only has 12 out of the required 16 gigs, it threw an error (my best guess).
So, now I am trying to figure out a way to use pinned 'zero-copy' memory to handle a memory mapped file which would feed data to the GPU.
If important, the type of data I am trying to transfer are floating point arrays. Normally, for read-only data, binary files are loaded into GPU memory, but I am working with data I am try to both read and write at every step.

It appears to me that currently, cupy doesn't offer a pinned allocator that can be used in place of the usual device memory allocator, i.e. could be used as the backing for cupy.ndarray. If this is important to you, you might consider filing a cupy issue.
However, it seems like it may be possible to create one. This should be considered experimental code. And there are some issues associated with its use.
The basic idea is that we will replace cupy's default device memory allocator with our own, using cupy.cuda.set_allocator as was already suggested to you. We will need to provide our own replacement for the BaseMemory class that is used as the repository for cupy.cuda.memory.MemoryPointer. The key difference here is that we will use a pinned memory allocator instead of a device allocator. This is the gist of the PMemory class below.
A few other things to be aware of:
after doing what you need with pinned memory (allocations) you should probably revert the cupy allocator to its default value. Unfortunately, unlike cupy.cuda.set_allocator, I did not find a corresponding cupy.cuda.get_allocator, which strikes me as a deficiency in cupy, something that also seems worthy of filing a cupy issue to me. However for this demonstration we will just revert to the None choice, which uses one of the default device memory allocators (not the pool allocator, however).
by providing this minimalistic pinned memory allocator, we are still suggesting to cupy that this is ordinary device memory. That means it's not directly accessible from the host code (it is, actually, but cupy doesn't know that). Therefore, various operations (such as cupy.load) will create unneeded host allocations, and unneeded copy operations. I think to address this would require much more than just this small change I am suggesting. But at least for your test case, this additional overhead may be manageable. It appears that you want to load data from disk once, and then leave it there. For that type of activity, this should be manageable, especially since you are breaking it up into chunks. As we will see, handling four 5GB chunks will be too much for 25GB of host memory. We will need host memory allocation for the four 5GB chunks (which are actually pinned) and we will also need additional space for one additional 5GB "overhead" buffer. So 25GB is not enough for that. But for demonstration purposes, if we reduce your buffer sizes to 4GB (5x4GB = 20GB) I think it may fit within your 25GB host RAM size.
Ordinary device memory associated with cupy's default device memory allocator, has an association with a particular device. pinned memory need not have such an association, however our trivial replacement of BaseMemory with a lookalike class means that we are suggesting to cupy that this "device" memory, like all other ordinary device memory, has a specific device association. In a single device setting such as yours, this distinction is meaningless. However, this isn't suitable for robust multi-device use of pinned memory. For that, again the suggestion would be a more robust change to cupy, perhaps by filing an issue.
Here's an example:
import os
import numpy as np
import cupy
class PMemory(cupy.cuda.memory.BaseMemory):
def __init__(self, size):
self.size = size
self.device_id = cupy.cuda.device.get_device_id()
self.ptr = 0
if size > 0:
self.ptr = cupy.cuda.runtime.hostAlloc(size, 0)
def __del__(self):
if self.ptr:
cupy.cuda.runtime.freeHost(self.ptr)
def my_pinned_allocator(bsize):
return cupy.cuda.memory.MemoryPointer(PMemory(bsize),0)
cupy.cuda.set_allocator(my_pinned_allocator)
#Create 4 .npy files, ~4GB each
for i in range(4):
print(i)
numpyMemmap = np.memmap( 'reg.memmap'+str(i), dtype='float32', mode='w+', shape=( 10000000 , 100))
np.save( 'reg.memmap'+str(i) , numpyMemmap )
del numpyMemmap
os.remove( 'reg.memmap'+str(i) )
# Check if they load correctly with np.load.
NPYmemmap = []
for i in range(4):
print(i)
NPYmemmap.append( np.load( 'reg.memmap'+str(i)+'.npy' , mmap_mode = 'r+' ) )
del NPYmemmap
# allocate pinned memory storage
CPYmemmap = []
for i in range(4):
print(i)
CPYmemmap.append( cupy.load( 'reg.memmap'+str(i)+'.npy' , mmap_mode = 'r+' ) )
cupy.cuda.set_allocator(None)
I haven't tested this in a setup with 25GB of host memory with these file sizes. But I have tested it with other file sizes that exceed the device memory of my GPU, and it seems to work.
Again, experimental code, not thoroughly tested, your mileage may vary, would be better to attain this functionality via filing of cupy github issues. And, as I've mentioned previously, this sort of "device memory" will be generally much slower to access from device code than ordinary cupy device memory.
Finally, this is not really a "memory mapped file" as all the file contents will be loaded into host memory, and furthermore, this methodology "uses up" host memory. If you have 20GB of files to access, you will need more than 20GB of host memory. As long as you have those files "loaded", 20GB of host memory will be in use.
UPDATE: cupy provides support for pinned allocators now, see here. This answer should only be used for historical reference.

Related

Unaccountable Dask memory usage

I am digging into Dask and (mostly) feel comfortable with it. However I cannot understand what is going on in the following scenario. TBH, I'm sure a question like this has been asked in the past, but after searching for awhile I can't seem to find one that really hits the nail on the head. So here we are!
In the code below, you can see a simple python function with a Dask-delayed decorator on it. In my real use-case scenario this would be a "black box" type function within which I don't care what happens, so long as it stays with a 4 GB memory budget and ultimately returns a pandas dataframe. In this case I've specifically chosen the value N=1.5e8 since this results in a total memory footprint of nearly 2.2 GB (large, but still well within the budget). Finally, when executing this file as a script, I have a "data pipeline" which simply runs the black-box function for some number of ID's, and in the end builds up a result dataframe (which I could then do more stuff with)
The confusing bit comes in when this is executed. I can see that only two function calls are executed at once (which is what I would expect), but I receive the warning message distributed.worker - WARNING - Memory use is high but worker has no data to store to disk. Perhaps some other process is leaking memory? Process memory: 3.16 GiB -- Worker memory limit: 3.73 GiB, and shortly thereafter the script exits prematurely. Where is this memory usage coming from?? Note that if I increase memory_limit="8GB" (which is actually more than my computer has), then the script runs fine and my print statement informs me that the dataframe is indeed only utilizing 2.2 GB of memory
Please help me understand this behavior and, hopefully, implement a more memory-safe approach
Many thanks!
BTW:
In case it is helpful, I'm using python 3.8.8, dask 2021.4.0, and distributed 2021.4.0
I've also confirmed this behavior on a Linux (Ubuntu) machine, as well as a Mac M1. They both show the same behavior, although the Mac M1 fails for the same reason with far less memory usage (N=3e7, or roughly 500 MB)
import time
import pandas as pd
import numpy as np
from dask.distributed import LocalCluster, Client
import dask
#dask.delayed
def do_pandas_thing(id):
print(f"STARTING: {id}")
N = 1.5e8
df = pd.DataFrame({"a": np.arange(N), "b": np.arange(N)})
print(
f"df memory usage {df.memory_usage().sum()/(2**30):.3f} GB",
)
# Simulate a "long" computation
time.sleep(5)
return df.iloc[[-1]] # return the last row
if __name__ == "__main__":
cluster = LocalCluster(
n_workers=2,
memory_limit="4GB",
threads_per_worker=1,
processes=True,
)
client = Client(cluster)
# Evaluate "black box" functions with pandas inside
results = []
for i in range(10):
results.append(do_pandas_thing(i))
# compute
r = dask.compute(results)[0]
print(pd.concat(r, ignore_index=True))

I am unable to reproduce the warning/error with the following versions:
pandas=1.2.4
dask=2021.4.1
python=3.8.8
When the object size increases, the process does crash due to memory, but it's a good idea to have workloads that are a fraction of the available memory:
To put it simply, we weren't thinking about analyzing 100 GB or 1 TB datasets in 2011. Nowadays, my rule of thumb for pandas is that you should have 5 to 10 times as much RAM as the size of your dataset. So if you have a 10 GB dataset, you should really have about 64, preferably 128 GB of RAM if you want to avoid memory management problems. This comes as a shock to users who expect to be able to analyze datasets that are within a factor of 2 or 3 the size of their computer's RAM.
source

Dask-Rapids data movment and out of memory issue

I am using dask (2021.3.0) and rapids(0.18) in my project. In this, I am performing preprocessing task on the CPU, and later the preprocessed data is transferred to GPU for K-means clustering. But in this process, I am getting the following problem:
1 of 1 worker jobs failed: std::bad_alloc: CUDA error: ~/envs/include/rmm/mr/device/cuda_memory_resource.hpp:69: cudaErrorMemoryAllocation out of memory
(before using GPU memory completely it gave the error i.e. it is not using GPU memory completely)
I have a single GPU of size 40 GB.
Ram size 512 GB.
I am using following snippet of code:
cluster=LocalCluster(n_workers=1, threads_per_worker=1)
cluster.scale(100)
##perform my preprocessing on data and get output on variable A
# convert A varible to cupy
x = A.map_blocks(cp.asarray)
km =KMeans(n_clusters=4)
predict=km.fit_predict(x).compute()
I am also looking for a solution so that the data larger than GPU memory can be preprocessed, and whenever there is a spill in GPU memory the spilled data is transferred into temp directory or CPU (as we do with dask where we define temp directory when there is a spill in RAM).
Any help will be appriciated.

There are several ways to run larger than GPU datasets.
Check out Nick Becker's blog, which has a few methods well documented
Check out BlazingSQL, which is built on top of RAPIDS and can perform out of core processings. You can try it at beta.blazingsql.com.

Why are large numpy arrays 64-byte aligned but not smaller ones

The following code:
prev=[]
addresses=[]
for i in range(10000):
a = np.ones(x).astype(np.float32)
prev.append(a)
address = a.__array_interface__['data'][0]
assert(address % 64 == 0)
assert((address not in addresses))
addresses.append(address)
Will not raise an assertionError for values of x > 252 suggesting that arrays bigger than 253, (or bigger than 505 when using float16) are aligned differently to smaller arrays. What is the reason for this?
I am on a OSX (Intel(R) Core(TM) i7-6920HQ CPU # 2.90GHz) running numpy 1.12.1

Your test loop isn't accomplishing exactly what you expect. Since only one array exists in memory at a time, it's quite possible - indeed LIKELY - that new ones will be allocated at the same memory address as the one just freed. You'd have to do something like append the arrays to a list (thus making them all exist in memory simultaneously) to actually test 10000 distinct allocations.
However, I can easily believe that you're seeing a real effect, as it's perfectly reasonable for a memory allocator to use different strategies based on the size of the block being allocated. For example, at some point the allocator may stop trying to use memory it already has, and start requesting entire memory pages directly from the operating system. Once that threshold is reached, you'd find that everything is aligned on a much higher power-of-2 boundary than 64 - perhaps 4096. You seem to be hitting some intermediate threshold at 1024 bytes (including overhead), it might be interesting to test for 128/256/512/1024 byte alignment.

Here is my guess: Using aligned memory typically involves allocating a larger block, and then releasing the upfront bytes that are allocated before the alignment boundary.
This is insignificant for large arrays, but for small arrays the fragmentation and overhead introduced likely outweights the benefits.

Is there a way of determining how much GPU memory is in use by TensorFlow?

Tensorflow tends to preallocate the entire available memory on it's GPUs. For debugging, is there a way of telling how much of that memory is actually in use?

(1) There is some limited support with Timeline for logging memory allocations. Here is an example for its usage:
run_options = tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE)
run_metadata = tf.RunMetadata()
summary, _ = sess.run([merged, train_step],
feed_dict=feed_dict(True),
options=run_options,
run_metadata=run_metadata)
train_writer.add_run_metadata(run_metadata, 'step%03d' % i)
train_writer.add_summary(summary, i)
print('Adding run metadata for', i)
tl = timeline.Timeline(run_metadata.step_stats)
print(tl.generate_chrome_trace_format(show_memory=True))
trace_file = tf.gfile.Open(name='timeline', mode='w')
trace_file.write(tl.generate_chrome_trace_format(show_memory=True))
You can give this code a try with the MNIST example (mnist with summaries)
This will generate a tracing file named timeline, which you can open with chrome://tracing. Note that this only gives an approximated GPU memory usage statistics. It basically simulated a GPU execution, but doesn't have access to the full graph metadata. It also can't know how many variables have been assigned to the GPU.
(2) For a very coarse measure of GPU memory usage, nvidia-smi will show the total device memory usage at the time you run the command.
nvprof can show the on-chip shared memory usage and register usage at the CUDA kernel level, but doesn't show the global/device memory usage.
Here is an example command: nvprof --print-gpu-trace matrixMul
And more details here:
http://docs.nvidia.com/cuda/profiler-users-guide/#abstract

Here's a practical solution that worked well for me:
Disable GPU memory pre-allocation using TF session configuration:
config = tf.ConfigProto()
config.gpu_options.allow_growth=True
sess = tf.Session(config=config)
run nvidia-smi -l (or some other utility) to monitor GPU memory consumption.
Step through your code with the debugger until you see the unexpected GPU memory consumption.

There's some code in tensorflow.contrib.memory_stats that will help with this:
from tensorflow.contrib.memory_stats.python.ops.memory_stats_ops import BytesInUse
with tf.device('/device:GPU:0'): # Replace with device you are interested in
bytes_in_use = BytesInUse()
with tf.Session() as sess:
print(sess.run(bytes_in_use))

The TensorFlow profiler has improved memory timeline that is based on real gpu memory allocator information
https://github.com/tensorflow/tensorflow/tree/master/tensorflow/core/profiler#visualize-time-and-memory

tf.config.experimental.get_memory_info('GPU:0')
Currently returns the following keys:
'current': The current memory used by the device, in bytes.
'peak': The peak memory used by the device across the run of the program, in bytes.

as #V.M previously mentioned, a solution that works well is using: tf.config.experimental.get_memory_info('DEVICE_NAME')
This function returns a dictionary with two keys:
'current': The current memory used by the device, in bytes
'peak': The peak memory used by the device across the run of the program, in bytes.
The value of these keys is the ACTUAL memory used not the allocated one that is returned by nvidia-smi.
In reality, for GPUs, TensorFlow will allocate all the memory by default rendering using nvidia-smi to check for the used memory in your code useless. Even if, tf.config.experimental.set_memory_growth is set to true, Tensorflow will no more allocate the whole available memory but is going to remain in allocating more memory than the one is used and in a discrete manner, i.e. allocates 4589MiB then 8717MiB then 16943MiB then 30651 MiB, etc.
A small note concerning the get_memory_info() is that it doesn't return correct values if used in a tf.function() decorated function. Thus, the peak key shall be used after executing tf.function() decorated function to determine the peak memory used.
For older versions of Tensorflow, tf.config.experimental.get_memory_usage('DEVICE_NAME') was the only available function and only returned the used memory (no option for determining the peak memory).
Final note, you can also consider the Tensorflow Profiler available with Tensorboard as #Peter Mentioned.
Hope this helps :)

Memory mapped file for numpy arrays

I need to read in parts of a huge numpy array stored in a memory mapped file, process the data and repeat for another part of the array. The whole numpy array takes up around 50 GB and my machine has 8 GB of RAM.
I initially created the memory mapped file using numpy.memmap by reading in a lot of smaller files and processing their data and then writing the processed data to the memmap file. During the creation of the memmap file, I had no memory issues (I was using memmap.flush() periodically). Here's how I create the memory mapped file:
mmapData = np.memmap(mmapFile,mode='w+', shape=(large_no1,large_no2))
for i1 in np.arange(numFiles):
auxData = load_data_from(file[i1])
mmapData[i1,:] = auxData
mmapData.flush() % Do this every 10 iterations or so
However, when I try to access small portions (<10 MB) of the memmap file, it floods my whole ram when the memmap object is created. The machine slows down drastically and I can't do anything. Here's how I try to read in the data from the memory mapped file:
mmapData = np.memmap(mmapFile, mode='r',shape=(large_no1,large_no2))
aux1 = mmapData[5,1:1e7]
I thought using mmap or numpy.memmap should allow me to access parts of massive arrays without trying to load the whole thing to memory. What am I missing?
Am I using the wrong tool to access parts of a large numpy array (> 20 GB) stored in disk?

Could it be that you're looking at virtual, rather than physical memory consumption, and the slowdown is coming from something else?

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

How to use CUDA pinned "zero-copy" memory for a memory mapped file? - numpy

Related

Unaccountable Dask memory usage

Dask-Rapids data movment and out of memory issue

Why are large numpy arrays 64-byte aligned but not smaller ones

Is there a way of determining how much GPU memory is in use by TensorFlow?

Memory mapped file for numpy arrays

Categories

Resources