I have problem with Tensorflow's new input pipeline mechanism. When I create a data pipeline with tf.data.Dataset, which decodes jpeg images and then loads them into a queue, it tries to load as much image as it can into the queue. If throughput of loading images is greater than throughput of images processed by my model, then the memory usage increase unboundedly.
Below is the code snippet for building pipeline with tf.data.Dataset
def _imread(file_name, label):
_raw = tf.read_file(file_name)
_decoded = tf.image.decode_jpeg(_raw, channels=hps.im_ch)
_resized = tf.image.resize_images(_decoded, [hps.im_width, hps.im_height])
_scaled = (_resized / 127.5) - 1.0
return _scaled, label
n_samples = image_files.shape.as_list()[0]
dset = tf.data.Dataset.from_tensor_slices((image_files, labels))
dset = dset.shuffle(n_samples, None)
dset = dset.repeat(hps.n_epochs)
dset = dset.map(_imread, hps.batch_size * 32)
dset = dset.batch(hps.batch_size)
dset = dset.prefetch(hps.batch_size * 2)
Here image_files is a constant tensor and contains filenames of 30k images. Images are resized to 256x256x3 in _imread.
If a build a pipeline with the following snippet:
# refer to "https://www.tensorflow.org/programmers_guide/datasets"
def _imread(file_name, hps):
_raw = tf.read_file(file_name)
_decoded = tf.image.decode_jpeg(_raw, channels=hps.im_ch)
_resized = tf.image.resize_images(_decoded, [hps.im_width, hps.im_height])
_scaled = (_resized / 127.5) - 1.0
return _scaled
n_samples = image_files.shape.as_list()[0]
image_file, label = tf.train.slice_input_producer(
[image_files, labels],
num_epochs=hps.n_epochs,
shuffle=True,
seed=None,
capacity=n_samples,
)
# Decode image.
image = _imread(image_file,
images, labels = tf.train.shuffle_batch(
tensors=[image, label],
batch_size=hps.batch_size,
capacity=hps.batch_size * 64,
min_after_dequeue=hps.batch_size * 8,
num_threads=32,
seed=None,
enqueue_many=False,
allow_smaller_final_batch=True
)
Then memory usage is almost constant throughout training. How can I make tf.data.Dataset to load fixed amount of samples? Is the pipeline I create with tf.data.Dataset correct? I think that buffer_size argument in tf.data.Dataset.shuffle is for image_files and labels. So it shouldn't be a problem for storing 30k strings, right? Even if 30k images were to be loaded, it would require 30000*256*256*3*8/(1024*1024*1024)=43GB of memory. However it uses 59GB of 61GB system memory.
This will buffer n_samples, which looks to be your entire dataset. You might want to cut down on the buffering here.
dset = dset.shuffle(n_samples, None)
You might as well just repeat forever, repeat won't buffer (Does `tf.data.Dataset.repeat()` buffer the entire dataset in memory?)
dset = dset.repeat()
You are batching and then prefetching hps.batch_size # of batches. Ouch!
dset = dset.batch(hps.batch_size)
dset = dset.prefetch(hps.batch_size * 2)
Let's say hps.batch_size=1000 to make a concrete example. The first line above creates a batch of 1000 images. The 2nd line above creates 2000 batches of each 1000 images, buffering a grand total of 2,000,000 images. Oops!
You meant to do:
dset = dset.batch(hps.batch_size)
dset = dset.prefetch(2)
Related
Let's say we have 2 classes one is small and the second is large.
I would like to use for data augmentation similar to ImageDataGenerator
for the small class, and sampling from each batch, in such a way, that, that each batch would be balanced. (Fro minor class- augmentation for major class- sampling).
Also, I would like to continue using image_dataset_from_directory (since the dataset doesn't fit into RAM).
What about
sample_from_datasets
function?
import tensorflow as tf
from tensorflow.python.data.experimental import sample_from_datasets
def augment(val):
# Example of augmentation function
return val - tf.random.uniform(shape=tf.shape(val), maxval=0.1)
big_dataset_size = 1000
small_dataset_size = 10
# Init some datasets
dataset_class_large_positive = tf.data.Dataset.from_tensor_slices(tf.range(100, 100 + big_dataset_size, dtype=tf.float32))
dataset_class_small_negative = tf.data.Dataset.from_tensor_slices(-tf.range(1, 1 + small_dataset_size, dtype=tf.float32))
# Upsample and augment small dataset
dataset_class_small_negative = dataset_class_small_negative \
.repeat(big_dataset_size // small_dataset_size) \
.map(augment)
dataset = sample_from_datasets(
datasets=[dataset_class_large_positive, dataset_class_small_negative],
weights=[0.5, 0.5]
)
dataset = dataset.shuffle(100)
dataset = dataset.batch(6)
iterator = dataset.as_numpy_iterator()
for i in range(5):
print(next(iterator))
# [109. -10.044552 136. 140. -1.0505208 -5.0829906]
# [122. 108. 141. -4.0211563 126. 116. ]
# [ -4.085523 111. -7.0003924 -7.027302 -8.0362625 -4.0226436]
# [ -9.039093 118. -1.0695585 110. 128. -5.0553837]
# [100. -2.004463 -9.032592 -8.041705 127. 149. ]
Set up the desired balance between the classes in the weights parameter of sample_from_datasets.
As it was noticed by
Yaoshiang,
the last batches are imbalanced and the datasets length are different. This can be avoided by
# Repeat infinitely both datasets and augment the small one
dataset_class_large_positive = dataset_class_large_positive.repeat()
dataset_class_small_negative = dataset_class_small_negative.repeat().map(augment)
instead of
# Upsample and augment small dataset
dataset_class_small_negative = dataset_class_small_negative \
.repeat(big_dataset_size // small_dataset_size) \
.map(augment)
This case, however, the dataset is infinite and the number of batches in epoch has to be further controlled.
You can use tf.data.Dataset.from_generator that allows more control on your data generation without loading all your data into RAM.
def generator():
i=0
while True :
if i%2 == 0:
elem = large_class_sample()
else :
elem =small_class_augmented()
yield elem
i=i+1
ds= tf.data.Dataset.from_generator(
generator,
output_signature=(
tf.TensorSpec(shape=yourElem_shape , dtype=yourElem_ype))
This generator will alterate samples between the two classes,and you can add more dataset operations(batch , shuffle..)
I didn't totally follow the problem. Would psuedo-code this work? Perhaps there are some operators on tf.data.Dataset that are sufficient to solve your problem.
ds = image_dataset_from_directory(...)
ds1=ds.filter(lambda image, label: label == MAJORITY)
ds2=ds.filter(lambda image, label: label != MAJORITY)
ds2 = ds2.map(lambda image, label: data_augment(image), label)
ds1.batch(int(10. / MAJORITY_RATIO))
ds2.batch(int(10. / MINORITY_RATIO))
ds3 = ds1.zip(ds2)
ds3 = ds3.map(lambda left, right: tf.concat(left, right, axis=0)
You can use the tf.data.Dataset.from_tensor_slices to load the images of two categories seperately and do data augmentation for the minority class. Now that you have two datasets combine them with tf.data.Dataset.sample_from_datasets.
# assume class1 is the minority class
files_class1 = glob('class1\\*.jpg')
files_class2 = glob('class2\\*.jpg')
def augment(filepath):
class_name = tf.strings.split(filepath, os.sep)[0]
image = tf.io.read_file(filepath)
image = tf.expand_dims(image, 0)
if tf.equal(class_name, 'class1'):
# do all the data augmentation
image_flip = tf.image.flip_left_right(image)
return [[image, class_name],[image_flip, class_name]]
# apply data augmentation for class1
train_class1 = tf.data.Dataset.from_tensor_slices(files_class1).\
map(augment,num_parallel_calls=tf.data.AUTOTUNE)
train_class2 = tf.data.Dataset.from_tensor_slices(files_class2)
dataset = tf.python.data.experimental.sample_from_datasets(
datasets=[train_class1,train_class2],
weights=[0.5, 0.5])
dataset = dataset.batch(BATCH_SIZE)
I created pipeline using tf.data API, for reading data set of images. I have a big dataset with high resolution. However, each time trying to reading all the dataset, the computer crash because the code using all the RAM. I tested the code with about 1280 images, it works without any error. But when I used all the datasets the model craches.
So, I am wondering if there is a way to make tf.data read a one or two batch in front not more than that.
This the code I am using to create the pipeline:
def decode_img(self, img):
img = tf.image.convert_image_dtype(img, tf.float32, saturate=False)
img = tf.image.resize(img, size=self.input_dim, antialias=False, name=None)
return img
def get_label(self, label):
y = np.zeros(self.n_class, dtype=np.float32)
y[label] = 1
return y
def process_path(self, file_path, label):
label = self.get_label(label)
img = Image.open(file_path)
width, height = img.size
# Setting the points for cropped image
new_hight = height // 2
new_width = width // 2
newsize = (new_width, new_hight)
img = img.resize(newsize)
if self.aug_img:
img = self.policy(img)
img = self.decode_img(np.array(img, dtype=np.float32))
return img, label
def create_pip_line(self):
def _fixup_shape(images, labels):
images.set_shape([None, None, 3])
labels.set_shape([7]) # I have 19 classes
return images, labels
tf_ds = tf.data.Dataset.from_tensor_slices((self.df["file_path"].values, self.df["class_num"].values))
tf_ds = tf_ds.map(lambda img, label: tf.numpy_function(self.process_path,
[img, label],
(tf.float32, tf.float32)),
num_parallel_calls=tf.data.experimental.AUTOTUNE)
tf_ds = tf_ds.map(_fixup_shape)
if not self.is_val:
tf_ds = tf_ds.shuffle(len(self.df), reshuffle_each_iteration=True)
tf_ds = tf_ds.batch(self.batch_size).repeat(self.epoch_num)
self.tf_ds = tf_ds.prefetch(tf.data.experimental.AUTOTUNE)
The main issue in my code was the Shuffle function. This function takes two parameters, the first one number of data to shuffle, the second one the repeat for each epoch.
However, I found the number of data that will be loaded to the memory depends on this function. Therefore, I reduced the number from all data to 100 and this makes the pipeline load 100 images and shuffles them then load another 100, and so on.
if not self.is_val:
tf_ds = tf_ds.shuffle(100, reshuffle_each_iteration=True)
I have a gigantic training data set that couldn't fit in RAM. I tried to load random batch of images in a stack without loading whole .h5. My approach was to create a list of indexes and shuffle them instead of shuffling the whole .h5 file.
Let's say:
a = np.arange(2000*2000*2000).reshape(2000, 2000, 2000)
idx = np.random.randint(2000, size = 800) #so that I only need to shuffle this idx at the end of epoch
# create this huge data 32GBs > my RAM
with h5py.File('./tmp.h5', 'w') as f:
tmp = f.create_dataset('a', (2000, 2000, 2000))
tmp[:] = a
# read it
with h5py.File('./tmp.h5', 'r') as f:
tensor = f['a'][:][idx] #if I don't do [:] there will be error if I do so it will load whole file which I don't want
Does somebody has a solution?
Thanks to #max9111, here's how I propose to solve it:
batch_size = 100
idx = np.arange(2000)
# shuffle
idx = np.random.shuffle(idx)
Due to the constraint of h5py:
Selection coordinates must be given in increasing order
One should sort before reading:
for step in range(epoch_len // batch_size):
try:
with h5py.File(path, 'r') as f:
return f['img'][np.sort(idx[step * batch_size])], f['label'][np.sort(idx[step * batch_size])]
except:
raise('epoch finished and drop the remainder')
Earlier I used Threads and Queues as my Data Pipeline and I got a really high Util on both GPUs (the data was created on the fly). I wanted to use the tf Dataset, but I struggle to reproduce the results.
I tried a lot of approches. Since I create data on the fly the from_generator() method seemed perfect. The code you see below is my last try I did. It seems that there is a bottleneck in creating the data although I am using the map() function for the processing of the generated images. What I tried in the code below I wanted to "multithread" the generators somehow, so there is more data coming in at the same time. But no better results so far.
def generator(n):
with tf.device('/cpu:0'):
while True:
...
yield image, label
def get_generator(n):
return partial(generator, n)
def dataset(n):
return tf.data.Dataset.from_generator(get_generator(n), output_types=(tf.float32, tf.float32), output_shapes=(tf.TensorShape([None,None,1]),tf.TensorShape([None,None,1])))
def input_fn():
# ds = tf.data.Dataset.from_generator(generator, output_types=(tf.float32, tf.float32), output_shapes=(tf.TensorShape([None,None,1]),tf.TensorShape([None,None,1])))
ds = tf.data.Dataset.range(BATCH_SIZE).apply(tf.data.experimental.parallel_interleave(dataset, cycle_length=BATCH_SIZE))
ds = ds.map(map_func=lambda img, lbl: processImage(img, lbl))
ds = ds.shuffle(SHUFFLE_SIZE)
ds = ds.batch(BATCH_SIZE)
ds = ds.prefetch(1)
return ds
The expected results would be a high GPU Util (>80%), but for now it is really low 10/20%.
You can use tf.data.Dataset.from_tensor_slices instead.
Just pass image/labels path. The function accepts filenames as argument.
def input_func():
dataset = tf.data.Dataset.from_tensor_slices(images_path, labels_path)
dataset = dataset.shuffle().repeat()
...
return dataset
TL;DR: how to ensure that data is loaded in multi threaded manner when using Dataset api in tensorflow 0.1.4?
Previously I did something like this with my images in disk:
filename_queue = tf.train.string_input_producer(filenames)
image_reader = tf.WholeFileReader()
_, image_file = image_reader.read(filename_queue)
imsize = 120
image = tf.image.decode_jpeg(image_file, channels=3)
image = tf.image.convert_image_dtype(image, dtype=tf.float32)
image_r = tf.image.resize_images(image, [imsize, imsize])
images = tf.train.shuffle_batch([image_r],
batch_size=20,
num_threads=30,
capacity=200,
min_after_dequeue=0)
This ensures that there will be 20 threads getting data ready for next learning iterations.
Now with the Dataset api I do something like:
dataset = tf.data.Dataset.from_tensor_slices((filenames, filenames_up, filenames_blacked))
dataset = dataset.map(parse_upscaler_corrector_batch)
After this I create an iterator:
sess = tf.Session();
iterator = dataset.make_initializable_iterator();
next_element = iterator.get_next();
sess.run(iterator.initializer);
value = sess.run(next_element)
Variable value will be passed for further processing.
So how do I ensure that data is being prepared in multui-threading manner here? Where could I read about Dataset api and multi threading data read?
So it appears that the way to achieve this is as follows:
dataset = dataset.map(parse_upscaler_corrector_batch, num_parallel_calls=12).prefetch(32).batch(self.ex_config.batch_size)
If one changes num_parallel_calls=12 one can see that both network/hdd load and cpu load either spike or decrease.