Numpy memmap throttles with Pytorch Dataloader when available RAM less than file size - numpy

I'm working on a dataset that is too big to fit into RAM. The solution I'm trying currently is to use numpy memmap to load one sample/row at a time using Dataloader. The solution looks something like this:
class MMDataset(torch.utils.data.Dataset):
def __init__(self, path):
self.file_path = path
self.dataset_len = 44000000
self.bytes_per_value = 32/8
self.num_cols = 512
self.num_rows = 1
def __getitem__(self, index):
x = np.memmap(self.file_path, dtype='float32', mode='r', shape=(
self.num_rows, self.num_cols), offset=int(index*self.num_cols*self.bytes_per_value))
return np.array(x)
def __len__(self):
return self.dataset_len
dataset = MMDataset('./data/emb.memmap')
data_loader = DataLoader(
dataset,
batch_size=4096,
shuffle=True,
num_workers=20
)
When the amount of RAM available is greater than the size of the memmap file, the data loading is fast. I get around 60 batches/second. However, when the RAM available is less than the size of the memmap file, I get around 3 batches/second.
I discovered this when trying various sizes for the memmap file.
Why is this the case? If Dataloader + memmap is going to throttle when available RAM < memmap file size, this defeats the point of the solution.
I've observed that disk i/o is at 500MB/s read constantly when available RAM < memmap file size. This is much higher than the theoretical amount of reading required to load a batch of 4096 samples (closer to 8MB/s).

Related

How to load a large h5 file in memory?

I have a large h5 file with 5-dimensional numpy array in HDFS. File size is ~130Gb. I am facing memory issues while loading the file with process gets killed with OOM Error even though machine has 256Gb RAM. How can I write the file in chunks and load back in chunks? I looked around and found that h5py provides method to chunk the dataset like so but how do I load back the data in chunks? Also will it work if the file resides in HDFS?
dset = f.create_dataset("Images2", (100,480,640), 'f', chunks=True)
Idea is to load the file in batches for less I/O time as well as memory issues. Any help would be much appreciated.
Two similar (but different) h5py I/O concepts are mentioned in the answer and comments above:
HDF5 Chunking is used to enable chunked I/O for improved performance. Chunking may not help if you get an OOM error when you try to read a large dataset with insufficient memory.
NumPy style Slicing is used to read a slice of the data from the drive to memory (or write a slice of data to the drive). Slicing is the key to avoid OOM errors when reading very large files.
Also, when creating very large datasets, you generally need to make
it resizeable. You can allocate an initial size, then use the ".resize()" method to increase the size on disk.
I wrote a simple example that shows how to use both slicing and chunking. It loads 100 images at a time into a resizeable dataset. It then closes the file and reopens (read-only) to read 100 images at a time into a NumPy array.
Effective chunking requires appropriate size/shape and is based on your array shape and I/O needs. I set the chunk size/shape in my example to match the size of 100 image array I was writing/reading.
This example should get you started. You will need to modify to use a 5-d array/dataset.
import numpy as np
import h5py
with h5py.File('SO_64645940.h5','w') as h5w:
img_ds = h5w.create_dataset('Images', shape=(100,480,640), dtype='f', maxshape=(None,480,640),chunks=(10,480,640))
next_img_row = 0
arr = np.random.random(100*480*640).reshape(100,480,640)
for cnt in range(1,10):
# print(cnt,img_ds.len(),next_img_row)
if img_ds.len() == next_img_row :
img_ds.resize(100*cnt,axis=0)
print('new ds size=',img_ds.len())
h5w['Images'][next_img_row:next_img_row+100] = arr
next_img_row += 100
with h5py.File('SO_64645940.h5','r') as h5r:
for cnt in range(10):
print('get slice#',str(cnt))
img_arr = h5r['Images'][cnt*100:(cnt+1)*100]
Chunking in HDF5 means that the data is not stored contigous, but in chunks.
See information here: https://docs.h5py.org/en/stable/high/dataset.html#chunked-storage
--> So this doesn't help you with your problem.
The solution might be that you build a function yourself to load the data chunkwise.
I made it for example this way for getting the data chunked:
def get_chunked(data, chunk_size=100):
for i in give_chunk(len(data), chunk_size):
chunked_array = data[i]
yield chunked_array
def give_chunk(length, chunk_size):
it = iter(range(length))
while True:
chunk = list(itertools.islice(it, chunk_size))
if not chunk:
break
yield chunk
For writing the data to HDF5 you can create the dataset first and then write the data chunk wise with slicing, see h5py documentation: https://docs.h5py.org/en/stable/high/dataset.html#reading-writing-data
I really can recommend this book for basic knowledge about HDF5: https://www.oreilly.com/library/view/python-and-hdf5/9781491944981/

Speed up generation of USE(universal sentence encoder) embeddings

I am working on a semantic similarity problem using universal sentence encoder. The dataset contains abstracts of scholarly articles. The mean length is around 1500. There are ~300k records in data and it will take quite long to generate USE embedding for all of them. I am looking for ways to optimize this. Currently, generating embedding for 10k rows of data took ~15 mins.
from tqdm import tqdm
use_module_url = "https://tfhub.dev/google/universal-sentence-encoder/4"
model = hub.load(use_module_url)
print ("module %s loaded" % use_module_url)
def embed(input):
return model(input)
def get_features(texts):
if type(texts) is str:
texts = [texts]
return embed(texts)
def data_iterator(data):
chunk_list = []
for x in tqdm(range(0, len(data), 1000)):
if x+1000 > len(data):
chunk_list.append(data[x:len(data)])
else:
chunk_list.append(data[x:x+1000])
return chunk_list
data = df['text'][:10000].values
data_processed = list(map(process_text, data))
Here, I want to speed up the generation of USE embeddings for my data. I am experimenting in kaggle kernel and have turned on the GPU. The GPU utilization doesn`t go beyond 2-3% & CPU utilization was ~120%
%%time
BASE_VECTORS = []
chunk_list = data_iterator(data_processed)
for i in tqdm(chunk_list):
BASE_VECTORS_tmp = get_features(i)
BASE_VECTORS.extend(BASE_VECTORS_tmp)
BASE_VECTORS = np.asarray(BASE_VECTORS)
Time taken
CPU times: user 16min 48s, sys: 2min 59s, total: 19min 47s
Wall time: 15min 13s
Probably you had not installed GPU version of tensorflow, or had some CUDNN version mismatch. Normally USE uses GPU a lot.

How to save numpy array images and put them into a single folder?

I have a numpy array containing 5000 28 by 28 images (5000,28,28). I want to save all these images as jpg files and save them all in a single folder. What is the fastest and most efficient way to accomplish this?
I tried writing 50,000 images of 28x28 to disk as JPEG using:
sequential code (25seconds)
multi-threaded code (19 seconds)
multi-processing code (5 seconds)
on a 12-core MacBook Pro with SSDs. Times are given in parentheses after each item in list above.
#!/usr/bin/env python3
import numpy as np
import imageio
from multiprocessing import Pool
from multiprocessing.pool import ThreadPool
def WriteOne(i,data):
"""Write a single image to disk"""
#print(f"WriteOne called with i={i}")
imageio.imwrite(f"image-{i:04d}.jpg", data)
def WriteSequential(images):
"""Write all images sequentially to disk"""
print("Sequential")
for i, data in enumerate(images):
WriteOne(i, data)
return
def WriteIndices(p):
"""Write given index to disk as JPEG"""
#print(f"WriteIndices {p[0]}")
WriteOne(p[0], p[1])
return
def WriteMultiThread(images):
"""Write all images to disk with multi-threading"""
print("MultiThread")
nThreads = 8
with ThreadPool(nThreads) as pool:
pool.map(WriteIndices, list(enumerate(images)))
return
def WriteMultiProcess(images):
"""Write all images to disk with multi-processing"""
print("MultiProcess")
nProcesses = 8
with Pool(nProcesses) as pool:
pool.map(WriteIndices, list(enumerate(images)))
return
if __name__ == '__main__':
# Synthesize 50000 images, each 28x28
N = 50000
images = np.random.randint(0,256,(N,28,28), dtype=np.uint8)
WriteSequential(images)
WriteMultiThread(images)
WriteMultiProcess(images)

Multiprocessing with large HDF5 files

I have 5 HDF5 files that are 22 GB each. Each HDF5 file is a series of 4801 images that are 1920 by 1200 in size. I need to load the same frame number from each HDF5 file, get rid of some rogue pixels, average the stack of 5 images, and write a new HDF5 file with one processed image at each frame number. Because I can't load all 5 HDF5 files in at once without running out of RAM, I am only loading in chunks of images from each HDF5 file, putting 5 images for each frame number into a queue, processing the stack, and writing the resulting image to an HDF5 file. Right now I am using h5py to perform any reading/writing of HDF5 files.
I would like to know what the most computationally effective way is of working on chunked data? Right now, I am dedicating one processor to be the writer, then looping through some chunk size of data for which I create a number of consumers, put the data in a queue, wait for the consumers to be finished, then rinse and repeat until all of the images are processed. This means that every time the loop advances, it creates new consumer processes - I imagine there is some overhead in this. A sample of the code is below.
#!/usr/bin/env python
import time
import os
from multiprocessing import Process, Queue, JoinableQueue, cpu_count
import glob
import h5py
import numpy as np
'''Function definitions'''
# The consumer function takes data off of the Queue
def consumer(inqueue,output):
# Run indefinitely
while True:
# If the queue is empty, queue.get() will block until the queue has data
all_data = inqueue.get()
if all_data:
#n is the index corresponding to the projection location
n, image_data = all_data
#replace zingers with median and average stack
#Find the median for each pixel of the prefiltered image
med = np.median(image_data,axis=0)
#Loop through the image set
for j in range(image_data.shape[0]):
replicate = image_data[j,...]
mask = replicate - med > zinger_level
replicate[mask] = med[mask] # Substitute with median
image_data[j,...] = replicate # Put data back in place
out = np.mean(image_data,axis=0,dtype=np.float32).astype(np.uint16)
output.put((n,out))
else:
break
#function for writing out HDF5 file
def write_hdf(output,output_filename):
#create output HDF5 file
while True:
args = output.get()
if args:
i,data = args
with h5py.File(output_filename,'a') as fout:
fout['Prefiltered_images'][i,...] = data
else:
break
def fprocess_hdf_stack(hdf_filenames,output_filename):
file_list = []
for fname in hdf_filenames:
file_list.append(h5py.File(fname,'r'))
#process chunks of data so that we don't run out of memory
totsize = h5py.File(hdf_filenames[0],'r')['exchange']['data'].shape[0]
data_shape = h5py.File(hdf_filenames[0],'r')['exchange']['data'].shape
fout.create_dataset('Prefiltered_images',data_shape,dtype=np.uint16)
fout.close()
ints = range(totsize)
chunkSize= 100
#initialize how many consumers we would like working
num_consumers = cpu_count()*2
#Create the Queue objects
inqueue = JoinableQueue()
output = Queue()
#start process for writing HDF5 file
proc = Process(target=write_hdf, args=(output,output_filename))
proc.start()
print("Loading %i images into memory..."%chunkSize)
for i in range(0,totsize,chunkSize):
time0 = time.time()
chunk = ints[i:i+chunkSize]
data_list = []
#Make a list of the HDF5 datasets we are reading in
for files in file_list:
#shape is (angles, rows, columns)
data_list.append(files['exchange']['data'][chunk,...])
data_list = np.asarray(data_list)
print("Elapsed time to load images %i-%i is %0.2f minutes." %(chunk[0],chunk[-1],(time.time() - time0)/60))
consumers = []
#Create consumer processes
for i in range(num_consumers):
p = Process(target=consumer, args=(inqueue,output))
consumers.append(p)
p.start()
for n in range(data_list.shape[1]):
#Feed data into the queue
inqueue.put((chunk[n],data_list[:,n,...]))
#Kill all of the processes when everything is finished
for i in range(num_consumers):
inqueue.put(None)
for c in consumers:
c.join()
print("Elapsed time to process images %i-%i is %0.2f minutes." %(chunk[0],chunk[-1],(time.time() - time0)/60))
time.sleep(1)
output.put(None)
proc.join()
#Close the input HDF5 files.
for hdf_file in file_list:
hdf_file.close()
print("Input HDF5 files closed.")
return
if __name__ == '__main__':
start_time = time.time()
raw_images_filenames = glob.glob(raw_images_dir + raw_images_basename)
tempname = os.path.basename(raw_images_filenames[0]).split('.')[0]
tempname_split = tempname.split('_')[:-1]
output_filename = output_dir+'_'.join(tempname_split) + '_Prefiltered.hdf5'
fprocess_hdf_stack(raw_images_filenames,output_filename)
print("Elapsed time is %0.2f minutes" %((time.time() - start_time)/60))
I don't think my bottleneck is actually in the loading of the images. It is in initializing the consumers and carrying out the processing on the 5 images per each frame number. I've played around with taking the consumer function out of the for loop, but I don't know how to put a memory cap on this so that I don't run out of RAM. Thanks!

Reading/Writing a large symmetric matrix in Numpy

I have a large (bigger than half of my RAM) n-by-n symmetric matrix S. I want to write it to disk, using only ~n^2/2 space and be able to read it later. The writing part is:
S[np.tril_indices(n)].tofile(fid)
and the reading:
S = np.zeros((n,n))
S[np.tril_indices(n)]=np.fromfile(fid)
S = S + np.tril(S, -1).T
The problem is that all the temporary arrays i am creating do not fit in memory
In this case a plain Python loop seems to be the fastest and simplest solution. Also faster than the vectorized approach when that does fit into memory, at least in my testing on Numpy 1.8.2 and a not so fast hard drive as storage medium. So you could consider:
with open('test.dat', 'wb') as fid:
for i in range(n):
S[i,i:].tofile(fid)
And:
S = np.empty((n, n))
with open('test.dat', 'rb') as fid:
for i in range(n):
data = np.fromfile(fid, count=n-i)
S[i,i:] = data
S[i:,i] = data