Read randomly elements in .h5 file without loading whole matrix - tensorflow

I have a gigantic training data set that couldn't fit in RAM. I tried to load random batch of images in a stack without loading whole .h5. My approach was to create a list of indexes and shuffle them instead of shuffling the whole .h5 file.
Let's say:
a = np.arange(2000*2000*2000).reshape(2000, 2000, 2000)
idx = np.random.randint(2000, size = 800) #so that I only need to shuffle this idx at the end of epoch
# create this huge data 32GBs > my RAM
with h5py.File('./tmp.h5', 'w') as f:
tmp = f.create_dataset('a', (2000, 2000, 2000))
tmp[:] = a
# read it
with h5py.File('./tmp.h5', 'r') as f:
tensor = f['a'][:][idx] #if I don't do [:] there will be error if I do so it will load whole file which I don't want
Does somebody has a solution?

Thanks to #max9111, here's how I propose to solve it:
batch_size = 100
idx = np.arange(2000)
# shuffle
idx = np.random.shuffle(idx)
Due to the constraint of h5py:
Selection coordinates must be given in increasing order
One should sort before reading:
for step in range(epoch_len // batch_size):
try:
with h5py.File(path, 'r') as f:
return f['img'][np.sort(idx[step * batch_size])], f['label'][np.sort(idx[step * batch_size])]
except:
raise('epoch finished and drop the remainder')

Related

Incremental PCA on big dataset, with large component demand

I am trying to find the main 200 components of a datasets of 846 images (2048x2048x3 RGB) with sklearn.decomposition.IncrementalPCA.
Data are read by cv2 and reshaped into a 2d np array ([846,2048x2048x3] size, float16)
To ensure a smaller memory cost, I used partial_fit() and divide the original data into smaller chunks (batches) in both partial_fit() and transform() steps.
just like the way in this problem's solution:
Python PCA on Matrix too large to fit into memory
Now my code works well for relative smaller size computations, like computing 20 components for 200 images in the datasets. It outputs right outcomes.
However, the tasks demands me to compute 200 components, which leads to the limit that my batch's size should be larger or at least equal to 200. (according to sklearn's document and the information in the terminal when running the code)
https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.IncrementalPCA.html
With such big chunk size,I can finish the IPCA model set, but always face MemoryError when doing partial_fit()
What's more, another problem is:
I need to use inverse_transform later, I am not sure if I can use chunk-style compute in this step or not. (In the code below I did not use it.)
What can I do to avoid this MemoryError? Or should I replace IncrementalPCA with some other method instead ? (these alternatives should have some method like inverse_transform())
The all memory I can access to is 131661572 kB(~127GB)
My code:
from sklearn.decomposition import PCA, IncrementalPCA
import numpy as np
import cv2
import os
folder_path = "./output_img"
input=[]
for i in range(1, 847):
if i%10 == 0: print("loading",i,"th image")
# if i == 60: continue #special case, should be skipped
image_path = folder_path+f"/{i}neutral.jpg"
img = cv2.imread(image_path)
input.append(img.reshape(-1))
print("Loaded all",i,"images")
# change into numpy matrix
all_image = np.stack(input,axis=0)
# trans to 0-1 format float64
all_image = (all_image.astype(np.float16))
### shape: #_of_imag x image_pixel_num (50331648 for img_normals case)
# print(all_image)
# print(all_image.shape)
# PCA, keeps 200 features
COM_NUM=200
pca=IncrementalPCA(n_components = COM_NUM)
print("finished IPCA model set")
saving_path = "./principle847"
element_num = all_image.shape[0] # how many elements(rows) we have in the dataset
chunk_size = 220 # how many elements we feed to IPCA at a time
for i in range(0, element_num//chunk_size):
pca.partial_fit(all_image[i*chunk_size : (i+1)*chunk_size])
print("finished PCA fit:",i*chunk_size,"to",(i+1)*chunk_size)
pca.partial_fit(all_image[(i+1)*chunk_size : element_num]) #tail
print("finished PCA fit:",(i+1)*chunk_size,"to",element_num)
for i in range(0, element_num//chunk_size):
if i==0:
result = pca.transform(all_image[i*chunk_size : (i+1)*chunk_size])
else:
tmp = pca.transform(all_image[i*chunk_size : (i+1)*chunk_size])
result = np.concatenate((result, tmp), axis=0)
print("finished PCA transform:",i*chunk_size,"to",(i+1)*chunk_size)
tmp = pca.transform(all_image[(i+1)*chunk_size : element_num]) #tail
result = np.concatenate((result, tmp), axis=0)
print("finished PCA transform:",(i+1)*chunk_size,"to",element_num)
result = pca.inverse_transform(result)
print("PCA mean:",pca.mean_)
mean_img = pca.mean_
mean_img = mean_img.reshape(2048,2048,3)
mean_img = mean_img.astype(np.uint8)
cv2.imwrite(os.path.join(saving_path,("mean.png")),mean_img)
result=result.reshape(-1,2048,2048,3)
# result shape: #_of_componets * 2048 * 2048 * 3
dst = result
# dst=result/np.linalg.norm(result,axis=(3),keepdims=True)
for j in range(0,COM_NUM):
reconImage = (dst)[j]
# reconImage = reconImage.reshape(4096,4096,3)
reconImage = np.clip(reconImage,0,255)
reconImage = reconImage.astype(np.uint8)
cv2.imwrite(os.path.join(saving_path,("p"+str(j)+".png")),reconImage)
print("Saved",j+1,"principle imgs")
The error goes like:
File "model_generate.py", line 36, in <module>
pca.partial_fit(all_image[i*chunk_size : (i+1)*chunk_size])
File "/root/anaconda3/envs/PCA/lib/python3.8/site-packages/sklearn/decomposition/_incremental_pca.py", line 299, in partial_fit
U, V = svd_flip(U, V, u_based_decision=False)
File "/root/anaconda3/envs/PCA/lib/python3.8/site-packages/sklearn/utils/extmath.py", line 538, in svd_flip
max_abs_rows = np.argmax(np.abs(v), axis=1)
File "/root/anaconda3/envs/PCA/lib/python3.8/site-packages/numpy/core/fromnumeric.py", line 1103, in argmax
return _wrapfunc(a, 'argmax', axis=axis, out=out)
File "/root/anaconda3/envs/PCA/lib/python3.8/site-packages/numpy/core/fromnumeric.py", line 56, in _wrapfunc
return getattr(obj, method)(*args, **kwds)
MemoryError

Sampling for large class and augmentation for small classes in each batch

Let's say we have 2 classes one is small and the second is large.
I would like to use for data augmentation similar to ImageDataGenerator
for the small class, and sampling from each batch, in such a way, that, that each batch would be balanced. (Fro minor class- augmentation for major class- sampling).
Also, I would like to continue using image_dataset_from_directory (since the dataset doesn't fit into RAM).
What about
sample_from_datasets
function?
import tensorflow as tf
from tensorflow.python.data.experimental import sample_from_datasets
def augment(val):
# Example of augmentation function
return val - tf.random.uniform(shape=tf.shape(val), maxval=0.1)
big_dataset_size = 1000
small_dataset_size = 10
# Init some datasets
dataset_class_large_positive = tf.data.Dataset.from_tensor_slices(tf.range(100, 100 + big_dataset_size, dtype=tf.float32))
dataset_class_small_negative = tf.data.Dataset.from_tensor_slices(-tf.range(1, 1 + small_dataset_size, dtype=tf.float32))
# Upsample and augment small dataset
dataset_class_small_negative = dataset_class_small_negative \
.repeat(big_dataset_size // small_dataset_size) \
.map(augment)
dataset = sample_from_datasets(
datasets=[dataset_class_large_positive, dataset_class_small_negative],
weights=[0.5, 0.5]
)
dataset = dataset.shuffle(100)
dataset = dataset.batch(6)
iterator = dataset.as_numpy_iterator()
for i in range(5):
print(next(iterator))
# [109. -10.044552 136. 140. -1.0505208 -5.0829906]
# [122. 108. 141. -4.0211563 126. 116. ]
# [ -4.085523 111. -7.0003924 -7.027302 -8.0362625 -4.0226436]
# [ -9.039093 118. -1.0695585 110. 128. -5.0553837]
# [100. -2.004463 -9.032592 -8.041705 127. 149. ]
Set up the desired balance between the classes in the weights parameter of sample_from_datasets.
As it was noticed by
Yaoshiang,
the last batches are imbalanced and the datasets length are different. This can be avoided by
# Repeat infinitely both datasets and augment the small one
dataset_class_large_positive = dataset_class_large_positive.repeat()
dataset_class_small_negative = dataset_class_small_negative.repeat().map(augment)
instead of
# Upsample and augment small dataset
dataset_class_small_negative = dataset_class_small_negative \
.repeat(big_dataset_size // small_dataset_size) \
.map(augment)
This case, however, the dataset is infinite and the number of batches in epoch has to be further controlled.
You can use tf.data.Dataset.from_generator that allows more control on your data generation without loading all your data into RAM.
def generator():
i=0
while True :
if i%2 == 0:
elem = large_class_sample()
else :
elem =small_class_augmented()
yield elem
i=i+1
ds= tf.data.Dataset.from_generator(
generator,
output_signature=(
tf.TensorSpec(shape=yourElem_shape , dtype=yourElem_ype))
This generator will alterate samples between the two classes,and you can add more dataset operations(batch , shuffle..)
I didn't totally follow the problem. Would psuedo-code this work? Perhaps there are some operators on tf.data.Dataset that are sufficient to solve your problem.
ds = image_dataset_from_directory(...)
ds1=ds.filter(lambda image, label: label == MAJORITY)
ds2=ds.filter(lambda image, label: label != MAJORITY)
ds2 = ds2.map(lambda image, label: data_augment(image), label)
ds1.batch(int(10. / MAJORITY_RATIO))
ds2.batch(int(10. / MINORITY_RATIO))
ds3 = ds1.zip(ds2)
ds3 = ds3.map(lambda left, right: tf.concat(left, right, axis=0)
You can use the tf.data.Dataset.from_tensor_slices to load the images of two categories seperately and do data augmentation for the minority class. Now that you have two datasets combine them with tf.data.Dataset.sample_from_datasets.
# assume class1 is the minority class
files_class1 = glob('class1\\*.jpg')
files_class2 = glob('class2\\*.jpg')
def augment(filepath):
class_name = tf.strings.split(filepath, os.sep)[0]
image = tf.io.read_file(filepath)
image = tf.expand_dims(image, 0)
if tf.equal(class_name, 'class1'):
# do all the data augmentation
image_flip = tf.image.flip_left_right(image)
return [[image, class_name],[image_flip, class_name]]
# apply data augmentation for class1
train_class1 = tf.data.Dataset.from_tensor_slices(files_class1).\
map(augment,num_parallel_calls=tf.data.AUTOTUNE)
train_class2 = tf.data.Dataset.from_tensor_slices(files_class2)
dataset = tf.python.data.experimental.sample_from_datasets(
datasets=[train_class1,train_class2],
weights=[0.5, 0.5])
dataset = dataset.batch(BATCH_SIZE)

Capacity of queue in tf.data.Dataset

I have problem with Tensorflow's new input pipeline mechanism. When I create a data pipeline with tf.data.Dataset, which decodes jpeg images and then loads them into a queue, it tries to load as much image as it can into the queue. If throughput of loading images is greater than throughput of images processed by my model, then the memory usage increase unboundedly.
Below is the code snippet for building pipeline with tf.data.Dataset
def _imread(file_name, label):
_raw = tf.read_file(file_name)
_decoded = tf.image.decode_jpeg(_raw, channels=hps.im_ch)
_resized = tf.image.resize_images(_decoded, [hps.im_width, hps.im_height])
_scaled = (_resized / 127.5) - 1.0
return _scaled, label
n_samples = image_files.shape.as_list()[0]
dset = tf.data.Dataset.from_tensor_slices((image_files, labels))
dset = dset.shuffle(n_samples, None)
dset = dset.repeat(hps.n_epochs)
dset = dset.map(_imread, hps.batch_size * 32)
dset = dset.batch(hps.batch_size)
dset = dset.prefetch(hps.batch_size * 2)
Here image_files is a constant tensor and contains filenames of 30k images. Images are resized to 256x256x3 in _imread.
If a build a pipeline with the following snippet:
# refer to "https://www.tensorflow.org/programmers_guide/datasets"
def _imread(file_name, hps):
_raw = tf.read_file(file_name)
_decoded = tf.image.decode_jpeg(_raw, channels=hps.im_ch)
_resized = tf.image.resize_images(_decoded, [hps.im_width, hps.im_height])
_scaled = (_resized / 127.5) - 1.0
return _scaled
n_samples = image_files.shape.as_list()[0]
image_file, label = tf.train.slice_input_producer(
[image_files, labels],
num_epochs=hps.n_epochs,
shuffle=True,
seed=None,
capacity=n_samples,
)
# Decode image.
image = _imread(image_file,
images, labels = tf.train.shuffle_batch(
tensors=[image, label],
batch_size=hps.batch_size,
capacity=hps.batch_size * 64,
min_after_dequeue=hps.batch_size * 8,
num_threads=32,
seed=None,
enqueue_many=False,
allow_smaller_final_batch=True
)
Then memory usage is almost constant throughout training. How can I make tf.data.Dataset to load fixed amount of samples? Is the pipeline I create with tf.data.Dataset correct? I think that buffer_size argument in tf.data.Dataset.shuffle is for image_files and labels. So it shouldn't be a problem for storing 30k strings, right? Even if 30k images were to be loaded, it would require 30000*256*256*3*8/(1024*1024*1024)=43GB of memory. However it uses 59GB of 61GB system memory.
This will buffer n_samples, which looks to be your entire dataset. You might want to cut down on the buffering here.
dset = dset.shuffle(n_samples, None)
You might as well just repeat forever, repeat won't buffer (Does `tf.data.Dataset.repeat()` buffer the entire dataset in memory?)
dset = dset.repeat()
You are batching and then prefetching hps.batch_size # of batches. Ouch!
dset = dset.batch(hps.batch_size)
dset = dset.prefetch(hps.batch_size * 2)
Let's say hps.batch_size=1000 to make a concrete example. The first line above creates a batch of 1000 images. The 2nd line above creates 2000 batches of each 1000 images, buffering a grand total of 2,000,000 images. Oops!
You meant to do:
dset = dset.batch(hps.batch_size)
dset = dset.prefetch(2)

TensorFlow eval inbetween two queues

My goal is as follows:
1). Use tf.train.string_input_producer and tf.TextLineReader to read lines from files.
2). Convert the resulting tensors containing the files' lines into ordinary strings using eval to do preprocessing before batching (TensorFlow's limited string operations are insufficient for my purposes)
3). Convert these preprocessed strings back to tensors (presumably using tf.constant ?)
4). Use tf.train.batch on the resulting tensors.
The following code is a simplified version of what I'm working on.
The "After batch" print statement gets executed, the REPL hangs on the print statement with the final eval.
From what I've read, I have a feeling this is because
threads = tf.train.start_queue_runners(coord = coord, sess = sess)
needs to be run after calling tf.train.batch. But if I do this, then the REPL will of course hang on the first eval
evalue = value.eval(session = sess)
needed to do the preprocessing.
What is the best way to convert back and forth between tensors and their values inbetween queues? (I'm really hoping I can do this without preprocessing my data files beforehand.)
import tensorflow as tf
import os
def process(string):
return string.upper()
def main():
sess = tf.Session()
filenames = tf.constant(["test_data/" + f for f in os.listdir("./test_data")])
filename_queue = tf.train.string_input_producer(filenames)
file_reader = tf.TextLineReader()
key, value = file_reader.read(filename_queue)
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(coord = coord, sess = sess)
evalue = value.eval(session = sess)
proc_value = process(evalue)
tensor_value = tf.constant(proc_value)
batch = tf.train.batch([tensor_value], batch_size = 2, capacity = 2)
print "After batch."
print batch.eval(session = sess)
We discussed a slightly different approach, which I think achieves what you need here:
Converting TensorFlow tutorial to work with my own data
Not sure what file formats you are reading, but the above example reads CSVs row-by-row and packs them into randomized batches.
If you are reading from a CSV, then, in a nutshell, I think what you might want to do is instead of returning value from file_reader.read(filename_queue) immediately, you could try to do some pre-processing first, and return THAT instead, something like this:
rDefaults = [['a'] for row in range((ROW_LENGTH))]
_, value = reader.read(filename_queue)
whole_row = tf.decode_csv(value, record_defaults=rDefaults)
cell1 = tf.slice(whole_row, [0], [1]) # one specific cell that contains a string
cell2 = tf.slice(whole_row, [1], [2]) # another cell that contains a string
# do some processing on cell1 and cell2
return cell1, cell2

Too slow for converting to tf.constant when list contains 1000000 elems

What is the best way to do the training. That's too slow! And I don't know why it is that slow.
samples_all = tf.constant(samples_all)//more than 1000000 elems
labels_all = tf.constant(labels_all)
[sample, label] = tf.train.slice_input_producer([samples_all, labels_all])
The samples_all contains more than 1000000 string elems which represent the image paths. The first two lines in the code take too long to execute.
Below an example of loading 2M strings into string variable which takes less than 1 second on my MacBook. Variables are more efficient for large constants than tf.constant which are inlined in the graph. Note that there's also tf.string_input_producer which can handle large lists of strings without loading them all into TensorFlow memory.
n = 2000000
string_list = np.asarray(["string"]*n);
sess = tf.Session()
var_input = tf.placeholder(dtype=tf.string, shape=(n,))
var = tf.Variable(var_input)
start = time.time()
# var.initializer is equivalent to var.assign(var_input).op
sess.run(var.initializer, feed_dict={var_input: string_list})
elapsed = time.time()-start
rate = n*6./elapsed/10**6
print("%.2f MB/sec"%(rate,)) # => 15.13 MB/sec