How to merge very large numpy arrays? - numpy

I will have many Numpy arrays stored in npz files, which are being saved using savez_compressed function.
I am splitting the information in many arrays because, if not, the functions I am using crash due to memory issues. The data is not sparse.
I will need to joint all that info in one unique array (to be able to process it with some routines), and store it into disk (to process it many times with diffente parameters).
Arrays won't fit into RAM+swap memory.
How to merge them into an unique array and save it to a disk?
I suspect that I should use mmap_mode, but I do not realize exactly how. Also, I imagine that can be some performance issues if I do not reserve contiguous disk space at first.
I have read this post but I still cannot realize how to do it.
EDIT
Clarification: I have made many functions to process similar data, some of them require an array as argument. In some cases I could pass them only part of this large array by using slicing. But it is still important to have all the info. in such an array.
This is because of the following: The arrays contain information (from physical simulations) time ordered. Among the argument of the functions, the user can set the initial and last time to process. Also, he/she can set the size of the processing chunk (which is important because this affect to the performance but allowed chunk size depend on the computational resources). Because of this, I cannot store the data as separated chunks.
The way in which this particular array (the one I am trying to create) is built is not important while it works.

You should be able to load chunk by chunk on a np.memap array:
import numpy as np
data_files = ['file1.npz', 'file2.npz2', ...]
# If you do not know the final size beforehand you need to
# go through the chunks once first to check their sizes
rows = 0
cols = None
dtype = None
for data_file in data_files:
with np.load(data_file) as data:
chunk = data['array']
rows += chunk.shape[0]
cols = chunk.shape[1]
dtype = chunk.dtype
# Once the size is know create memmap and write chunks
merged = np.memmap('merged.buffer', dtype=dtype, mode='w+', shape=(rows, cols))
idx = 0
for data_file in data_files:
with np.load(data_file) as data:
chunk = data['array']
merged[idx:idx + len(chunk)] = chunk
idx += len(chunk)
However, as pointed out in the comments working across a dimension which is not the fastest one will be very slow.

This would be an example how to write a 90GB of easily compressible data to disk. The most important points are mentioned here https://stackoverflow.com/a/48405220/4045774
The write/read speed should be in the range of (300 MB/s,500MB/s) on a nomal HDD.
Example
import numpy as np
import tables #register blosc
import h5py as h5
import h5py_cache as h5c
import time
def read_the_arrays():
#Easily compressable data
#A lot smaller than your actual array, I do not have that much RAM
return np.arange(10*int(15E3)).reshape(10,int(15E3))
def writing(hdf5_path):
# As we are writing whole chunks here this isn't realy needed,
# if you forget to set a large enough chunk-cache-size when not writing or reading
# whole chunks, the performance will be extremely bad. (chunks can only be read or written as a whole)
f = h5c.File(hdf5_path, 'w',chunk_cache_mem_size=1024**2*1000) #1000 MB cache size
dset = f.create_dataset("your_data", shape=(int(15E5),int(15E3)),dtype=np.float32,chunks=(10000,100),compression=32001,compression_opts=(0, 0, 0, 0, 9, 1, 1), shuffle=False)
#Lets write to the dataset
for i in range(0,int(15E5),10):
dset[i:i+10,:]=read_the_arrays()
f.close()
def reading(hdf5_path):
f = h5c.File(hdf5_path, 'r',chunk_cache_mem_size=1024**2*1000) #1000 MB cache size
dset = f["your_data"]
#Read chunks
for i in range(0,int(15E3),10):
data=np.copy(dset[:,i:i+10])
f.close()
hdf5_path='Test.h5'
t1=time.time()
writing(hdf5_path)
print(time.time()-t1)
t1=time.time()
reading(hdf5_path)
print(time.time()-t1)

Related

How to load a large h5 file in memory?

I have a large h5 file with 5-dimensional numpy array in HDFS. File size is ~130Gb. I am facing memory issues while loading the file with process gets killed with OOM Error even though machine has 256Gb RAM. How can I write the file in chunks and load back in chunks? I looked around and found that h5py provides method to chunk the dataset like so but how do I load back the data in chunks? Also will it work if the file resides in HDFS?
dset = f.create_dataset("Images2", (100,480,640), 'f', chunks=True)
Idea is to load the file in batches for less I/O time as well as memory issues. Any help would be much appreciated.
Two similar (but different) h5py I/O concepts are mentioned in the answer and comments above:
HDF5 Chunking is used to enable chunked I/O for improved performance. Chunking may not help if you get an OOM error when you try to read a large dataset with insufficient memory.
NumPy style Slicing is used to read a slice of the data from the drive to memory (or write a slice of data to the drive). Slicing is the key to avoid OOM errors when reading very large files.
Also, when creating very large datasets, you generally need to make
it resizeable. You can allocate an initial size, then use the ".resize()" method to increase the size on disk.
I wrote a simple example that shows how to use both slicing and chunking. It loads 100 images at a time into a resizeable dataset. It then closes the file and reopens (read-only) to read 100 images at a time into a NumPy array.
Effective chunking requires appropriate size/shape and is based on your array shape and I/O needs. I set the chunk size/shape in my example to match the size of 100 image array I was writing/reading.
This example should get you started. You will need to modify to use a 5-d array/dataset.
import numpy as np
import h5py
with h5py.File('SO_64645940.h5','w') as h5w:
img_ds = h5w.create_dataset('Images', shape=(100,480,640), dtype='f', maxshape=(None,480,640),chunks=(10,480,640))
next_img_row = 0
arr = np.random.random(100*480*640).reshape(100,480,640)
for cnt in range(1,10):
# print(cnt,img_ds.len(),next_img_row)
if img_ds.len() == next_img_row :
img_ds.resize(100*cnt,axis=0)
print('new ds size=',img_ds.len())
h5w['Images'][next_img_row:next_img_row+100] = arr
next_img_row += 100
with h5py.File('SO_64645940.h5','r') as h5r:
for cnt in range(10):
print('get slice#',str(cnt))
img_arr = h5r['Images'][cnt*100:(cnt+1)*100]
Chunking in HDF5 means that the data is not stored contigous, but in chunks.
See information here: https://docs.h5py.org/en/stable/high/dataset.html#chunked-storage
--> So this doesn't help you with your problem.
The solution might be that you build a function yourself to load the data chunkwise.
I made it for example this way for getting the data chunked:
def get_chunked(data, chunk_size=100):
for i in give_chunk(len(data), chunk_size):
chunked_array = data[i]
yield chunked_array
def give_chunk(length, chunk_size):
it = iter(range(length))
while True:
chunk = list(itertools.islice(it, chunk_size))
if not chunk:
break
yield chunk
For writing the data to HDF5 you can create the dataset first and then write the data chunk wise with slicing, see h5py documentation: https://docs.h5py.org/en/stable/high/dataset.html#reading-writing-data
I really can recommend this book for basic knowledge about HDF5: https://www.oreilly.com/library/view/python-and-hdf5/9781491944981/

Why is Pandas df.to_hdf("a_file", "a_key") output increases in size when executed multiple times

Pandas has a method .to_hdf() to save a dataframe as a HDF table.
However each time the command .to_hdf(path, key) is run, the size of the file increases.
import os
import string
import pandas as pd
import numpy as np
size = 10**4
df = pd.DataFrame({"C":np.random.randint(0,100,size),
"D": np.random.choice(list(string.ascii_lowercase), size = size)})
for iteration in range(4):
df.to_hdf("a_file.h5","key1")
print(os.path.getsize("a_file.h5"))
And the output clearly shows that the size of the file is increasing:
# 1240552
# 1262856
# 1285160
# 1307464
As a new df is saved each time, the hdf size should be constant.
As the increase seems quite modest for small df, with larger df it fastly leads to hdf files that are significantly bigger than the size of the file on the first save.
Sizes I get with a 10**7 long dataframe after 7 iterations:
29MB, 48MB, 67MB, 86MB, 105MB, 125MB, 144MB
Why is it so that the hdf file size is not constant and increase a each new to_hdf()?
This behavior is not really documented if you look in a fast manner at the documentation (which is 2973 pdf pages long). But can be found in #1643, and in the warning in IO Tools section/delete from a table section of the documentation:
If you do not specify anything, the default writing mode is 'a'which is the case of a simple df.to_hdf('a_path.h5','a_key') will nearly double the size of your hdf file each time you run your script.
Solution is to use the write mode: df.to_hdf('a_path.h5','a_key', mode = 'w')
However, this behavior will happen only with the fixed format (which is the default format) but not with the table format (except if append is set to True).

Feeding .npy (numpy files) into tensorflow data pipeline

Tensorflow seems to lack a reader for ".npy" files.
How can I read my data files into the new tensorflow.data.Dataset pipline?
My data doesn't fit in memory.
Each object is saved in a separate ".npy" file. each file contains 2 different ndarrays as features and a scalar as their label.
It is actually possible to read directly NPY files with TensorFlow instead of TFRecords. The key pieces are tf.data.FixedLengthRecordDataset and tf.io.decode_raw, along with a look at the documentation of the NPY format. For simplicity, let's suppose that a float32 NPY file containing an array with shape (N, K) is given, and you know the number of features K beforehand, as well as the fact that it is a float32 array. An NPY file is just a binary file with a small header and followed by the raw array data (object arrays are different, but we're considering numbers now). In short, you can find the size of this header with a function like this:
def npy_header_offset(npy_path):
with open(str(npy_path), 'rb') as f:
if f.read(6) != b'\x93NUMPY':
raise ValueError('Invalid NPY file.')
version_major, version_minor = f.read(2)
if version_major == 1:
header_len_size = 2
elif version_major == 2:
header_len_size = 4
else:
raise ValueError('Unknown NPY file version {}.{}.'.format(version_major, version_minor))
header_len = sum(b << (8 * i) for i, b in enumerate(f.read(header_len_size)))
header = f.read(header_len)
if not header.endswith(b'\n'):
raise ValueError('Invalid NPY file.')
return f.tell()
With this you can create a dataset like this:
import tensorflow as tf
npy_file = 'my_file.npy'
num_features = ...
dtype = tf.float32
header_offset = npy_header_offset(npy_file)
dataset = tf.data.FixedLengthRecordDataset([npy_file], num_features * dtype.size, header_bytes=header_offset)
Each element of this dataset contains a long string of bytes representing a single example. You can now decode it to obtain an actual array:
dataset = dataset.map(lambda s: tf.io.decode_raw(s, dtype))
The elements will have indeterminate shape, though, because TensorFlow does not keep track of the length of the strings. You can just enforce the shape since you know the number of features:
dataset = dataset.map(lambda s: tf.reshape(tf.io.decode_raw(s, dtype), (num_features,)))
Similarly, you can choose to perform this step after batching, or combine it in whatever way you feel like.
The limitation is that you had to know the number of features in advance. It is possible to extract it from the NumPy header, though, just a bit of a pain, and in any case very hardly from within TensorFlow, so the file names would need to be known in advance. Another limitation is that, as it is, the solution requires you to either use only one file per dataset or files that have the same header size, although if you know that all the arrays have the same size that should actually be the case.
Admittedly, if one considers this kind of approach it may just be better to have a pure binary file without headers, and either hard code the number of features or read them from a different source...
You can do it with tf.py_func, see the example here.
The parse function would simply decode the filename from bytes to string and call np.load.
Update: something like this:
def read_npy_file(item):
data = np.load(item.decode())
return data.astype(np.float32)
file_list = ['/foo/bar.npy', '/foo/baz.npy']
dataset = tf.data.Dataset.from_tensor_slices(file_list)
dataset = dataset.map(
lambda item: tuple(tf.py_func(read_npy_file, [item], [tf.float32,])))
Does your data fit into memory? If so, you can follow the instructions from the Consuming NumPy Arrays section of the docs:
Consuming NumPy arrays
If all of your input data fit in memory, the simplest way to create a Dataset from them is to convert them to tf.Tensor objects and use Dataset.from_tensor_slices().
# Load the training data into two NumPy arrays, for example using `np.load()`.
with np.load("/var/data/training_data.npy") as data:
features = data["features"]
labels = data["labels"]
# Assume that each row of `features` corresponds to the same row as `labels`.
assert features.shape[0] == labels.shape[0]
dataset = tf.data.Dataset.from_tensor_slices((features, labels))
In the case that the file doesn't fit into memory, it seems like the only recommended approach is to first convert the npy data into a TFRecord format, and then use the TFRecord data set format, which can be streamed without fully loading into memory.
Here is a post with some instructions.
FWIW, it seems crazy to me that TFRecord cannot be instantiated with a directory name or file name(s) of npy files directly, but it appears to be a limitation of plain Tensorflow.
If you can split the single large npy file into smaller files that each roughly represent one batch for training, then you could write a custom data generator in Keras that would yield only the data needed for the current batch.
In general, if your dataset cannot fit in memory, storing it as one single large npy file makes it very hard to work with, and preferably you should reformat the data first, either as TFRecord or as multiple npy files, and then use other methods.
Problem setup
I had a folder with images that were being fed into an InceptionV3 model for extraction of features. This seemed to be a huge bottleneck for the entire process. As a workaround, I extracted features from each image and then stored them on disk in a .npy format.
Now I had two folders, one for the images and one for the corresponding .npy files. There was an evident problem with the loading of .npy files in the tf.data.Dataset pipeline.
Workaround
I came across TensorFlow's official tutorial on show attend and tell which had a great workaround for the problem this thread (and I) were having.
Load numpy files
First off we need to create a mapping function that accepts the .npy file name and returns the numpy array.
# Load the numpy files
def map_func(feature_path):
feature = np.load(feature_path)
return feature
Use the tf.numpy_function
With the tf.numpy_function we can wrap any python function and use it as a TensorFlow op. The function must accept numpy object (which is exactly what we want).
We create a tf.data.Dataset with the list of all the .npy filenames.
dataset = tf.data.Dataset.from_tensor_slices(feature_paths)
We then use the map function of the tf.data.Dataset API to do the rest of our task.
# Use map to load the numpy files in parallel
dataset = dataset.map(lambda item: tf.numpy_function(
map_func, [item], tf.float16),
num_parallel_calls=tf.data.AUTOTUNE)

how to create hdf5 dataset with early allocate and no fill using h5py

I am trying to create a 78TB HDF5 dataset by filling it in a 2d block-partition manner. This is very slow when the block I'm writing spans rows that haven't ever been written to, because HDF5 is going in and allocating the diskspace and filling in the missing entries with zero.
Instead, I would like h5py to allocate the disk space for my dataset as soon as its created, and never fill it. This is possible with the C api according to Table 16 in the HDF5 Dataset documentation, but how can I do this with h5py, preferably with the high level interface?
I believe that you want to set the fill time to "never", with the H5Pset_fill_time() routine, but I don't know the h5py way to do that.
As Quincey suggested. You can use the low-level H5py API to create the dataset with the FILL_TIME_NEVER property then convert it back to a high-level Dataset object:
# create the rows dataset using the low-level api so I can force it to not do zero-filling, then convert to a high level object
spaceid = h5py.h5s.create_simple((numRows, numCols))
plist = h5py.h5p.create(h5py.h5p.DATASET_CREATE)
plist.set_fill_time(h5py.h5d.FILL_TIME_NEVER)
plist.set_chunk((rowchunk, colchunk))
datasetid = h5py.h5d.create(fout.id, "rows", h5py.h5t.NATIVE_DOUBLE, spaceid, plist)
rows = h5py.Dataset(datasetid)
Try specifying a chunk shape that matches your write pattern. For example if you are writing in blocks of 1024x1024, it would look like this:
import h5py
import numpy as np
f = h5py.File('mybigdset.h5', 'w')
dset = f.create_dataset('dset', (78*1024*1024, 1024*1024), dtype='f4', chunks=(1024,1024))
arr = np.random.rand(1024,1024)
dset[0:1024, 0:1024] = arr
f.close()
Thankfully, this didn't use 78TB of disk, the file size was just 4MB.

Reading/Writing a large symmetric matrix in Numpy

I have a large (bigger than half of my RAM) n-by-n symmetric matrix S. I want to write it to disk, using only ~n^2/2 space and be able to read it later. The writing part is:
S[np.tril_indices(n)].tofile(fid)
and the reading:
S = np.zeros((n,n))
S[np.tril_indices(n)]=np.fromfile(fid)
S = S + np.tril(S, -1).T
The problem is that all the temporary arrays i am creating do not fit in memory
In this case a plain Python loop seems to be the fastest and simplest solution. Also faster than the vectorized approach when that does fit into memory, at least in my testing on Numpy 1.8.2 and a not so fast hard drive as storage medium. So you could consider:
with open('test.dat', 'wb') as fid:
for i in range(n):
S[i,i:].tofile(fid)
And:
S = np.empty((n, n))
with open('test.dat', 'rb') as fid:
for i in range(n):
data = np.fromfile(fid, count=n-i)
S[i,i:] = data
S[i:,i] = data