Memory mapped file for numpy arrays - numpy

I need to read in parts of a huge numpy array stored in a memory mapped file, process the data and repeat for another part of the array. The whole numpy array takes up around 50 GB and my machine has 8 GB of RAM.
I initially created the memory mapped file using numpy.memmap by reading in a lot of smaller files and processing their data and then writing the processed data to the memmap file. During the creation of the memmap file, I had no memory issues (I was using memmap.flush() periodically). Here's how I create the memory mapped file:
mmapData = np.memmap(mmapFile,mode='w+', shape=(large_no1,large_no2))
for i1 in np.arange(numFiles):
auxData = load_data_from(file[i1])
mmapData[i1,:] = auxData
mmapData.flush() % Do this every 10 iterations or so
However, when I try to access small portions (<10 MB) of the memmap file, it floods my whole ram when the memmap object is created. The machine slows down drastically and I can't do anything. Here's how I try to read in the data from the memory mapped file:
mmapData = np.memmap(mmapFile, mode='r',shape=(large_no1,large_no2))
aux1 = mmapData[5,1:1e7]
I thought using mmap or numpy.memmap should allow me to access parts of massive arrays without trying to load the whole thing to memory. What am I missing?
Am I using the wrong tool to access parts of a large numpy array (> 20 GB) stored in disk?

Could it be that you're looking at virtual, rather than physical memory consumption, and the slowdown is coming from something else?

Related

Dask-Rapids data movment and out of memory issue

I am using dask (2021.3.0) and rapids(0.18) in my project. In this, I am performing preprocessing task on the CPU, and later the preprocessed data is transferred to GPU for K-means clustering. But in this process, I am getting the following problem:
1 of 1 worker jobs failed: std::bad_alloc: CUDA error: ~/envs/include/rmm/mr/device/cuda_memory_resource.hpp:69: cudaErrorMemoryAllocation out of memory
(before using GPU memory completely it gave the error i.e. it is not using GPU memory completely)
I have a single GPU of size 40 GB.
Ram size 512 GB.
I am using following snippet of code:
cluster=LocalCluster(n_workers=1, threads_per_worker=1)
cluster.scale(100)
##perform my preprocessing on data and get output on variable A
# convert A varible to cupy
x = A.map_blocks(cp.asarray)
km =KMeans(n_clusters=4)
predict=km.fit_predict(x).compute()
I am also looking for a solution so that the data larger than GPU memory can be preprocessed, and whenever there is a spill in GPU memory the spilled data is transferred into temp directory or CPU (as we do with dask where we define temp directory when there is a spill in RAM).
Any help will be appriciated.
There are several ways to run larger than GPU datasets.
Check out Nick Becker's blog, which has a few methods well documented
Check out BlazingSQL, which is built on top of RAPIDS and can perform out of core processings. You can try it at beta.blazingsql.com.

How to use CUDA pinned "zero-copy" memory for a memory mapped file?

Objective/Problem
In Python, I am looking for a fast way to read/write data from a memory mapped file to a GPU.
In a previous SO overflow post [ Cupy OutOfMemoryError when trying to cupy.load larger dimension .npy files in memory map mode, but np.load works fine ]
Where it is mentioned this is possible using CUDA pinned "zero-copy" memory. Furthermore, it seems that this method was developed by this person [
cuda - Zero-copy memory, memory-mapped file ] though that person was working in C++.
My previous attempts have been with Cupy, but I am open to any cuda methods.
What I have tried so far
I mentioned how I tried to use Cupy, which allows you to open numpy files in memmory mapped mode.
import os
import numpy as np
import cupy
#Create .npy files.
for i in range(4):
numpyMemmap = np.memmap( 'reg.memmap'+str(i), dtype='float32', mode='w+', shape=( 2200000 , 512))
np.save( 'reg.memmap'+str(i) , numpyMemmap )
del numpyMemmap
os.remove( 'reg.memmap'+str(i) )
# Check if they load correctly with np.load.
NPYmemmap = []
for i in range(4):
NPYmemmap.append( np.load( 'reg.memmap'+str(i)+'.npy' , mmap_mode = 'r+' ) )
del NPYmemmap
# Eventually results in memory error.
CPYmemmap = []
for i in range(4):
print(i)
CPYmemmap.append( cupy.load( 'reg.memmap'+str(i)+'.npy' , mmap_mode = 'r+' ) )
Result of what I have tried
My attempt resulting in OutOfMemoryError:
It was mentioned that
it appears that cupy.load will require that the entire file fit first in host memory, then in device memory.
And it was also mentioned that
CuPy can't handle mmap memory. So, CuPy uses GPU memory directly in default.
https://docs-cupy.chainer.org/en/stable/reference/generated/cupy.cuda.MemoryPool.html#cupy.cuda.MemoryPool.malloc
You can change default memory allocator if you want to use Unified Memory.
I tried using
cupy.cuda.set_allocator(cupy.cuda.MemoryPool(cupy.cuda.memory.malloc_managed).malloc)
But this didn't seem to make a difference. At the time of the error, my CPU Ram was at ~16 gigs, but my GPU ram was at 0.32 gigs. I am using Google colab where my CPU Ram is 25 gigs and GPU ram is 12 gigs. So it looks like that after the entire file was hosted in host memory, it checked that if it could fit in device memory, and when it saw that it only has 12 out of the required 16 gigs, it threw an error (my best guess).
So, now I am trying to figure out a way to use pinned 'zero-copy' memory to handle a memory mapped file which would feed data to the GPU.
If important, the type of data I am trying to transfer are floating point arrays. Normally, for read-only data, binary files are loaded into GPU memory, but I am working with data I am try to both read and write at every step.
It appears to me that currently, cupy doesn't offer a pinned allocator that can be used in place of the usual device memory allocator, i.e. could be used as the backing for cupy.ndarray. If this is important to you, you might consider filing a cupy issue.
However, it seems like it may be possible to create one. This should be considered experimental code. And there are some issues associated with its use.
The basic idea is that we will replace cupy's default device memory allocator with our own, using cupy.cuda.set_allocator as was already suggested to you. We will need to provide our own replacement for the BaseMemory class that is used as the repository for cupy.cuda.memory.MemoryPointer. The key difference here is that we will use a pinned memory allocator instead of a device allocator. This is the gist of the PMemory class below.
A few other things to be aware of:
after doing what you need with pinned memory (allocations) you should probably revert the cupy allocator to its default value. Unfortunately, unlike cupy.cuda.set_allocator, I did not find a corresponding cupy.cuda.get_allocator, which strikes me as a deficiency in cupy, something that also seems worthy of filing a cupy issue to me. However for this demonstration we will just revert to the None choice, which uses one of the default device memory allocators (not the pool allocator, however).
by providing this minimalistic pinned memory allocator, we are still suggesting to cupy that this is ordinary device memory. That means it's not directly accessible from the host code (it is, actually, but cupy doesn't know that). Therefore, various operations (such as cupy.load) will create unneeded host allocations, and unneeded copy operations. I think to address this would require much more than just this small change I am suggesting. But at least for your test case, this additional overhead may be manageable. It appears that you want to load data from disk once, and then leave it there. For that type of activity, this should be manageable, especially since you are breaking it up into chunks. As we will see, handling four 5GB chunks will be too much for 25GB of host memory. We will need host memory allocation for the four 5GB chunks (which are actually pinned) and we will also need additional space for one additional 5GB "overhead" buffer. So 25GB is not enough for that. But for demonstration purposes, if we reduce your buffer sizes to 4GB (5x4GB = 20GB) I think it may fit within your 25GB host RAM size.
Ordinary device memory associated with cupy's default device memory allocator, has an association with a particular device. pinned memory need not have such an association, however our trivial replacement of BaseMemory with a lookalike class means that we are suggesting to cupy that this "device" memory, like all other ordinary device memory, has a specific device association. In a single device setting such as yours, this distinction is meaningless. However, this isn't suitable for robust multi-device use of pinned memory. For that, again the suggestion would be a more robust change to cupy, perhaps by filing an issue.
Here's an example:
import os
import numpy as np
import cupy
class PMemory(cupy.cuda.memory.BaseMemory):
def __init__(self, size):
self.size = size
self.device_id = cupy.cuda.device.get_device_id()
self.ptr = 0
if size > 0:
self.ptr = cupy.cuda.runtime.hostAlloc(size, 0)
def __del__(self):
if self.ptr:
cupy.cuda.runtime.freeHost(self.ptr)
def my_pinned_allocator(bsize):
return cupy.cuda.memory.MemoryPointer(PMemory(bsize),0)
cupy.cuda.set_allocator(my_pinned_allocator)
#Create 4 .npy files, ~4GB each
for i in range(4):
print(i)
numpyMemmap = np.memmap( 'reg.memmap'+str(i), dtype='float32', mode='w+', shape=( 10000000 , 100))
np.save( 'reg.memmap'+str(i) , numpyMemmap )
del numpyMemmap
os.remove( 'reg.memmap'+str(i) )
# Check if they load correctly with np.load.
NPYmemmap = []
for i in range(4):
print(i)
NPYmemmap.append( np.load( 'reg.memmap'+str(i)+'.npy' , mmap_mode = 'r+' ) )
del NPYmemmap
# allocate pinned memory storage
CPYmemmap = []
for i in range(4):
print(i)
CPYmemmap.append( cupy.load( 'reg.memmap'+str(i)+'.npy' , mmap_mode = 'r+' ) )
cupy.cuda.set_allocator(None)
I haven't tested this in a setup with 25GB of host memory with these file sizes. But I have tested it with other file sizes that exceed the device memory of my GPU, and it seems to work.
Again, experimental code, not thoroughly tested, your mileage may vary, would be better to attain this functionality via filing of cupy github issues. And, as I've mentioned previously, this sort of "device memory" will be generally much slower to access from device code than ordinary cupy device memory.
Finally, this is not really a "memory mapped file" as all the file contents will be loaded into host memory, and furthermore, this methodology "uses up" host memory. If you have 20GB of files to access, you will need more than 20GB of host memory. As long as you have those files "loaded", 20GB of host memory will be in use.
UPDATE: cupy provides support for pinned allocators now, see here. This answer should only be used for historical reference.

Extremely high memory usage with pyarrow reading gzipped parquet files

I have a (set of) gzipped parquet files with about 210 columns, of which I am loading about 100 columns into a pandas dataframe. It works fine and very fast when the file size is about 1 MB (with about 50 rows); the python3 process consumes < 500 MB of RAM. However when the file is > 1.5 MB (70+ rows) it starts consuming 9-10 GB of RAM without ever loading the dataframe. If I specify just 2-3 columns, it is able to load them from the "big" file (still consuming that kind of RAM), but anything beyond that seems impossible. All columns are text.
I am currently using pandas.read_parquet, but I have also tried pyarrow.read_table with same results.
Any ideas what could be going on? I just don't understand why loading that amount of data should blow up RAM like that and become unusable. My objective with this is to load the data in parquet to a database, so if there are better ways to do it that would be great to know as well.
The code is below; it's just a simple usage of pandas.read_parquet.
import pandas as pd
df = pd.read_parquet(bytesIO_from_file, columns=[...])
There was a memory usage issue in pyarrow 0.14 that has been resolved: https://issues.apache.org/jira/browse/ARROW-6060
The upcoming 0.15 release will have this fix, as well as a bunch of other optimizations in Parquet reading. If you're curious to try it now, see the docs for installing the development version.

Is there an optimal number of elements for a tfrecords file?

This is follow up to these SO questions
What is the need to do sharding of TFRecords files?
optimal size of a tfrecord file
and this passage from this tutorial
For this small dataset we will just create one TFRecords file for the
training-set and another for the test-set. But if your dataset is very
large then you can split it into several TFRecords files called
shards. This will also improve the random shuffling, because the
Dataset API only shuffles from a smaller buffer of e.g. 1024 elements
loaded into RAM. So if you have e.g. 100 TFRecords files, then the
randomization will be much better than for a single TFRecords file.
https://github.com/Hvass-Labs/TensorFlow-Tutorials/blob/master/18_TFRecords_Dataset_API.ipynb
So there is an optimal file size, but I am wondering, if there's an optimal number of elements? Since it's the elements itself that's being distributed to the GPUs cores?
Are you trying to optimize:
1 initial data randomization?
2 data randomization across training batches and/or epochs?
3 training/validation throughput (ie, gpu utilization)?
Initial data randomization should be handled when data are initially saved into sharded files. This can be challenging, assuming you can't read the data into memory. One approach is to read all the unique data ids into memory, shuffle those, do your train/validate/test split, and then write your actual data to file shards in that randomized order. Now your data are initially shuffled/split/sharded.
Initial data randomization will make it easier to maintain randomization during training. However, I'd still say it is 'best practice' to re-shuffle file names and re-shuffle a data memory buffer as part of the train/validate data streams. Typically, you'll set up an input stream using multiple threads/processes. The first step is to randomize the file input streams by re-shuffling the filenames. This can be done like:
train_files = tf.data.Dataset.list_files('{}/d*.tfr'.format(train_dir),
shuffle=True)
Now, if your initial data write was already randomized, you 'could' read the entire data from one file, before going to the next, but that would still impact re-randomization throughout the training process, so typically you interleave file reads, reading a certain number of records from each file. This also improves throughput, assuming you are using multiple file read processes (which you should do, to maximize gpu throughput).
blocksize = 1000 # samples read from one file before switching files
train_data = train_files.interleave(interleaveFiles,
block_length=blocksize,
num_parallel_calls=tf.data.experimental.AUTOTUNE)
Here, we're reading 1000 samples from each file, before going on to the next. Again, to re-shuffle the training data each epoch (which may or may not be critical), we re-shuffle the data in memory, setting a memory buffer based on what's available on the machine and how large our data items are (note - before formatting the data for gpu).
buffersize = 1000000 # samples read before shuffling in memory
train_data = train_data.shuffle(buffersize,
reshuffle_each_iteration=True)
train_data = train_data.repeat()
The repeat() call is just to allow the data set to 'wrap around' during training. This may or may not be important, depending on how you set up your training process.
To optimize throughput, you can do 2 things:
1 alter the order of operations in the data input stream. Typically, if you put your randomization operations early, they can operate on 'low weight' entities, like file names, rather than on tensors.
2 use pre-fetching to let your cpu processes stream data during gpu calculations
train_data = train_data.map(mapData,
num_parallel_calls=tf.data.experimental.AUTOTUNE)
train_data = train_data.padded_batch(batchsize)
train_data = train_data.prefetch(10)
So, mapping and batching happens last (this is usually preferred for maximizing gpu throughput, but it can depend on other factors, like data size (pre and post-tensorizing), and how computationally expensive your map function is).
Finally, you can tune the prefetch size to maximize gpu throughput, constrained by system memory and memory speed.
So, how does this all impact the 'optimal' number of data items in each sharded file?
Obviously, if your data/file size is > your blocksize, blocksize becomes irrelevant, and you might as well read each file completely. Typically, if you are going to use this paradigm, you wand blocksize << data/file. I use 10x; so if my blocksize is 1000, I have ~10,000 data items in the file. This may not be optimal, but so far I can maintain >90% gpu usage using this approach on my specific hardware. If you want to tune for your hardware, you could start somewhere at ~10x and adjust, based on whatever you are specifically trying to optimize.
If you have very large numbers of files, you may run into problems maintaining good file read streams, but on a modern system you should be able to get to 100,000 files or more and still be fine. Moving large numbers of files around can be difficult, but usually easier than having very small numbers of very big files, so there are some (broad) constraints on file sizes that can impact how many data items/file you end up with. Generally speaking, I'd say having on the order of 100s of files would be ideal for a large dataset. That way you can easily stream files across a network efficiently (again, that will depend on your network). If the data set is small, you'll have 10s to 50s of files, which is fine for streaming, depending on file size (I typically try to hit 100-300MB/file, which works well for moving things around a LAN or WAN).
So, I think file-size and number-of-files places much stronger constraints on your process than number of data items/file, so long as you have an appropriate number of data items/file, given your file read blocksize. Again, you could hyper-shard your files (1 data item/file?), and read entire files into memory, without using file blocking. That might work, and it would certainly be lightweight to shuffle file names, rather than data items. But you might also end up with millions of files!
To really optimize, you'll need to set up an end-to-end training system on a particular machine, and then tweak it to see what works best for your particular data, network, and hardware. So long as your data are effectively randomized and your data files are easy to store/use/share, you just want to optimize gpu throughput. I would be surprised if reordering the data input stream and pre-fetching doesn't get you there.

Why are large numpy arrays 64-byte aligned but not smaller ones

The following code:
prev=[]
addresses=[]
for i in range(10000):
a = np.ones(x).astype(np.float32)
prev.append(a)
address = a.__array_interface__['data'][0]
assert(address % 64 == 0)
assert((address not in addresses))
addresses.append(address)
Will not raise an assertionError for values of x > 252 suggesting that arrays bigger than 253, (or bigger than 505 when using float16) are aligned differently to smaller arrays. What is the reason for this?
I am on a OSX (Intel(R) Core(TM) i7-6920HQ CPU # 2.90GHz) running numpy 1.12.1
Your test loop isn't accomplishing exactly what you expect. Since only one array exists in memory at a time, it's quite possible - indeed LIKELY - that new ones will be allocated at the same memory address as the one just freed. You'd have to do something like append the arrays to a list (thus making them all exist in memory simultaneously) to actually test 10000 distinct allocations.
However, I can easily believe that you're seeing a real effect, as it's perfectly reasonable for a memory allocator to use different strategies based on the size of the block being allocated. For example, at some point the allocator may stop trying to use memory it already has, and start requesting entire memory pages directly from the operating system. Once that threshold is reached, you'd find that everything is aligned on a much higher power-of-2 boundary than 64 - perhaps 4096. You seem to be hitting some intermediate threshold at 1024 bytes (including overhead), it might be interesting to test for 128/256/512/1024 byte alignment.
Here is my guess: Using aligned memory typically involves allocating a larger block, and then releasing the upfront bytes that are allocated before the alignment boundary.
This is insignificant for large arrays, but for small arrays the fragmentation and overhead introduced likely outweights the benefits.