.prefetch() and .cache() not speeding up tf.data.Dataset pipeline - tensorflow2.0

I have a very big dataset of high resolution images so I am training over small chunks using keras.fit.
For loading chunks in memory I have a generator function which generates a tuple of tensors of variable size which I pass to tf.data.Dataset to create a data pipeline.
def extract_XY(list_idx):
read_images(list_idx)
append to list X
process, extract patch and convert to tensor X (?,100,100,3) #? means variable size mini-batch
Y = f(X)
return X,Y
for i in range(epochs):
for j in range(chunks):
x,y = extract_XY(list_idx) #list_idx changes in each loop
data = tf.data.Dataset.from_tensor_slices((X,Y)).batch(64).cache().prefetch(tf.data.experimental.AUTOTUNE)
model.fit(data,epochs=2,verbose=1)
My training with keras fit works but still slow I see no speed up using .cache() or .prefetch()
Can anyone help me understand if I am using them correctly in my case.
I can't use tf.data.Dataset.from_generator option as my generator doesn't yield a sequence.
Can I make my data-pipeline more efficient i.e., loading next chunk while model is training? any suggestions would be helpful?
Does using multiprocessing=True in Keras.fit() will help more? or tf.data.Dataset is already taking care of I/O bottleneck.
Thanks in advance!

Related

Tensorflow: How to prefetch data on the GPU from CPU tf.data.Dataset (from_generator)

I am struggling with the following. I am creating a tf.data.Dataset using the from_generator method. I perform these actions on CPU as I don't want to overload my GPU memory.
The dataset consists of tuples, which contain a tf.bool 1-D mask (tf.Tensor) with fixed length, and a tf.float 2-D matrix (tf.Tensor) with variable size. The loss function is decorated using the following decorator, so I would not assume the variable size is the problem.
#tf.function(experimental_relax_shapes=True)
Ideally, the dataset is kept on the CPU, but then prefetched onto the GPU.
def gen():
for i, j in zip(mask_list, wmat_list):
yield i, j
dataset = tf.data.Dataset.from_generator(gen, output_types=(tf.bool, tf.float32))
The main training loop currently relies on tf.identity to move the data to the gpu, which is inefficient. As shown on the screenshot from Tensorboard below. Roughly 70% of the time is spend loading the data and moving it to GPU.
for b, (mask, wmat) in enumerate(dataset):
with tf.GradientTape() as tape:
mask = tf.identity(mask)
wmat = tf.identity(wmat)
mean_error, loss = self.model.loss(mask, wmat)
epoch_loss += loss.numpy()
epoch_mean_error += mean_error.numpy()
I have tried the "prefetch_to_device" function. However, it did not move the data onto the GPU. As verified by printing e.g. mask.device in the training loop.
gpu_transform = tf.data.experimental.prefetch_to_device('/gpu')
dataset.apply(gpu_transform)
For me it resembles to this bug: https://github.com/tensorflow/tensorflow/issues/30929 . However, it is marked as solved and is over a year old.
Running TF 2.3 using the official Docker image.
I have found the solution to my own question.
The problem was that the tuples in the dataset did not contain tf.Tensors, but numpy arrays. Therefore, the pipeline was probably limited by the functionality of py_func().
The screenshot below show that the pipeline does not block on the CPU. However there is still a considerable MemCpy. The prefetch_to_device() still does not do anything. This is likely due to a known issue which should be fixed in TF2.4
https://github.com/tensorflow/tensorflow/issues/35563
The (unconfirmed) suggested workaround also did not work for me. (see edit)
with tf.device("/gpu:0"):
ds = ds.prefetch(1)
EDIT:
I have further investigated this issue and filed a bug report. It does now seem that the suggested workaround does something, but not sure if it completely prefetches in time.
https://github.com/tensorflow/tensorflow/issues/43905

No cache file written in TensorFlow dataset

I am trying to manage a large image dataset, that does not fit in the memory, while requiring some specific calculation. Currently, my code looks like this:
files = [str(f) for f in self.files]
labels = self.categories
batch_size= 32
dataset = tf.data.Dataset.from_generator(
lambda: zip(files, labels),
output_types=(tf.string, tf.uint8),
output_shapes=(tf.TensorShape([]), tf.TensorShape([]))
)
dataset = dataset.map(
lambda x, y: tf.py_function(_parser, [x, y, category_count], [tf.float32, tf.uint8]),
num_parallel_calls=tf.data.experimental.AUTOTUNE,
deterministic=False)
dataset.cache(filename='/tmp/dataset.tmp')
if mode == tf.estimator.ModeKeys.TRAIN:
dataset = dataset.shuffle(buffer_size=10*batch_size, reshuffle_each_iteration=True)
dataset = dataset.batch(batch_size=batch_size, drop_remainder=False)
if mode == tf.estimator.ModeKeys.TRAIN:
dataset.repeat(None)
else:
dataset.repeat(1)
dataset = dataset.prefetch(tf.data.experimental.AUTOTUNE)
The _parser() function opens a image file, does a bunch of transformations, and returns a tensor and a 1-hot encoded vector. The caching step does not seem to work properly, however:
There is not significant improvement of the computation time between the 1st epoch and the following ones
No cache file is created during the process, although the swap partition is almost full (~90%)
Does the cache() function creates a file only when both the memory and the swap partition is full? Furthermore, I expect to read only batch_size files at a time. However, it seems that all files are read at once during the mapping step. Should I consider using interleave() combined with from_generator() instead? Or maybe should I batched the files first, then map them?
Note that cache() should be used when the dataset is small. If the dataset is large(which is in your case) RAM will not be sufficient to cache its content so it does not fit into memory. Either you should increase the capacity of RAM or adapt some other method to speed up the training.
The other reason for the slowdown of training is the preprocessing stage when you use map() function.
map() method applies a transformation to each item unlike apply() method applies a transformation to the dataset as a whole.
You can use the interleave() and retain the same order of map() and then batch().
You are already using threading by making num_parallel_calls and setting it to tf.data.experimental.AUTOTUNE makes the best use of whatever is available.
You can also normalize your input data and then cache, if it does not fit into memory again then it's better not to cache on a large dataset.
You can follow these performance tips from TensorFlow.
If you have multiple workers/devices it will help you to speed up the training.
Below is the sample illustration showing prefetching with multithreaded loading and preprocessing.

TF Dataset API: Is the following sequence correct? map,cache,shuffle,batch,repeat,prefetch

I am using this sequence to read images files from disk and feed into a TF Keras model.
#Make dataset for training
dataset_train = tf.data.Dataset.from_tensor_slices((file_ids_training,file_names_training))
dataset_train = dataset_train.flat_map(lambda file_id,file_name: tf.data.Dataset.from_tensor_slices(
tuple (tf.py_func(_get_data_for_dataset, [file_id,file_name], [tf.float32,tf.float32]))))
dataset_train = dataset_train.cache()
dataset_train= dataset_train.shuffle(buffer_size=train_buffer_size)
dataset_train= dataset_train.batch(train_batch_size) #Make dataset, shuffle, and create batches
dataset_train= dataset_train.repeat()
dataset_train = dataset_train.prefetch(1)
dataset_train_iterator = dataset_train.make_one_shot_iterator()
get_train_batch = dataset_train_iterator.get_next()
I am having questions on whether this is the most optimal sequence.
For e.g. Should repeat come after shuffle() and before batch()?, Should cache() come after batch?
The answer here Output differences when changing order of batch(), shuffle() and repeat() suggests repeat or shuffle before batching. The order I often use is (1) shuffle, (2) repeat, (3) map, (4) batch but it can vary based on your preferences. I use shuffle before repeat to avoid blurring epoch boundaries. I use map before batch because my mapping function applies to a single example (not to a batch of examples) but you can certainly write a map function that is vectorized and expects to see a batch as input.
I'd suggest using the following order
dataset
.cache(filename='./data/cache/')
.shuffle(BUFFER_SIZE)
.repeat(Epoch)
.map(func, num_parallel_calls=tf.data.AUTOTUNE)
.filter(fltr)
.batch(BATCH_SIZE)
.prefetch(tf.data.AUTOTUNE)
in this way firstly to further speed up the training the processed data will be saved in binary format (done automatically by tf) by calling cache. The data will be saved in the cache file after, all the dataset is shuffled and repeated. After that just like #shivaraj said use map and filter function before batching the data. Lastly call the prefetch as said in tf documentation to prepare the data before hand while gpu is working on the previous batch.
Note:
Calling cache will take a lot of time on first call depending on the data size and memory available. But it speed up the training by at least 4 times, if you need to do multiple experiments while not making any change to dataset's inputs and outputs (labels).
Also changing the order of calling cache will also effect the time it takes to create the cache files. I found this order to be the fastest, in every term and also doesn't raises any warnings.
If you are reading images and preprocessing through a function, then use batch after map function.
If you use batch before map then then the function does not get filenames instead map function will get list of rank 1.
ValueError: Shape must be rank 0 but is rank 1 for '{{node ReadFile}} = ReadFile[](args_0)' with input shapes: [?].
Hence the sequence is
dataset = tf.data.Dataset.from_tensor_slices(file_paths)
dataset = dataset.shuffle(BUFFER_SIZE)
dataset = dataset.repeat() # can be after batch
dataset = dataset.map(parse_images)
dataset = dataset.batch(BATCH_SIZE)./repeat()/.prefetch(tf.data.AUTOTUNE)
Although you can choose to place repeat after batch also which doesn't affect your execution.
The buffer size in shuffle actually decides the magnitude of randomness you can introduce, bigger the buffer size better is randomness but you need to have better RAM size (usually > 8 Gigs).

How to efficiently shuffle a large tf.data.Dataset when using tf.estimator.train_and_evaluate?

The tf.estimator.train_and_evaluate documentation makes it clear that the input dataset must be properly shuffled for the training to see all examples:
Overfitting: In order to avoid overfitting, it is recommended to set up the training input_fn to shuffle the training data properly. It is also recommended to train the model a little longer, say multiple epochs, before performing evaluation, as the input pipeline starts from scratch for each training. It is particularly important for local training and evaluation.
In my application, I would like to uniformly sample examples from the full tf.data.Dataset with arbitrary evaluation frequency and shuffle()'s buffer size. Otherwise, the training can at most see the first:
(steps_per_second * eval_delay * batch_size) + buffer_size
elements, effectively discarding the rest. Is there an efficient way to work around that without loading the complete dataset in the system memory?
I considered sharding the dataset based on the buffer size, but if the evaluation does not occur frequently, it will iterate on the same shard multiple times (a repeat() closes the pipeline). Ideally, I would like to move to another shard after a complete iteration over the dataset, is that possible?
Thanks for any pointers!
A random sharding of the dataset can be implemented with this Dataset transformation:
def random_shard(shard_size, dataset_size):
num_shards = -(-dataset_size // shard_size) # Ceil division.
offsets = np.linspace(
0, dataset_size, num=num_shards, endpoint=False, dtype=np.int64)
def _random_shard(dataset):
sharded_dataset = tf.data.Dataset.from_tensor_slices(offsets)
sharded_dataset = sharded_dataset.shuffle(num_shards)
sharded_dataset = sharded_dataset.flat_map(
lambda offset: dataset.skip(offset).take(shard_size))
return sharded_dataset
return _random_shard
This requires to know the total dataset size in advance. However, if you implement a file-based sharding approach, you also iterate on the full dataset once so that is not a major issue.
Regarding efficiency, note that skip(offset) actually iterates on offset examples so a latency is to be expected if offset is large. Careful prefetching should help for this.

fit_generator in keras: where is the batch_size specified?

Hi I don't understand the keras fit_generator docs.
I hope my confusion is rational.
There is a batch_size and also the concept of training in in batches. Using model_fit(), I specify a batch_size of 128.
To me this means that my dataset will be fed in 128 samples at a time, thereby greatly alleviating memory. It should allow a 100 million sample dataset to be trained as long as I have got the time to wait. After all, keras is only "working with" 128 samples at a time. Right?
But I highly suspect that for specifying the batch_size alone doesn't do what I want whatsoever. Tons of memory is still being used. For my goals I need to train in batches of 128 examples each.
So I am guessing this is what fit_generator does. I really want to ask why doesn't batch_size actually work as it's name suggests?
More importantly, if fit_generator is needed, where do I specify the batch_size? The docs say to loop indefinitely.
A generator loops over every row once. How do I loop over 128 samples at a time and remember where I last stopped and recall it the next time that keras asks for the next batch's starting row number (would be row 129 after first batch is done).
You will need to handle the batch size somehow inside the generator. Here is an example to generate random batches:
import numpy as np
data = np.arange(100)
data_lab = data%2
wholeData = np.array([data, data_lab])
wholeData = wholeData.T
def data_generator(all_data, batch_size = 20):
while True:
idx = np.random.randint(len(all_data), size=batch_size)
# Assuming the last column contains labels
batch_x = all_data[idx, :-1]
batch_y = all_data[idx, -1]
# Return a tuple of (Xs,Ys) to feed the model
yield(batch_x, batch_y)
print([x for x in data_generator(wholeData)])
First, keras batch_size does work very well. If you are working on GPU, you should know that the model can be very heavy with keras, especially if you are using recurrent cells. If you are working on CPU, the whole program is loaded in memory, the batch size won't have much of an impact on the memory. If you are using fit(), the whole dataset is probably loaded in memory, keras produces batches at every step. It's very difficult to predict the amount of memory that will be used.
As for the fit_generator() method, you should build a python generator function (using yield instead of return), yielding one batch at every step. The yield should be in an infinite loop (we often use while true: ...).
Do you have some code to illustrate your problem?