How to efficiently shuffle a large tf.data.Dataset when using tf.estimator.train_and_evaluate? - tensorflow

The tf.estimator.train_and_evaluate documentation makes it clear that the input dataset must be properly shuffled for the training to see all examples:
Overfitting: In order to avoid overfitting, it is recommended to set up the training input_fn to shuffle the training data properly. It is also recommended to train the model a little longer, say multiple epochs, before performing evaluation, as the input pipeline starts from scratch for each training. It is particularly important for local training and evaluation.
In my application, I would like to uniformly sample examples from the full tf.data.Dataset with arbitrary evaluation frequency and shuffle()'s buffer size. Otherwise, the training can at most see the first:
(steps_per_second * eval_delay * batch_size) + buffer_size
elements, effectively discarding the rest. Is there an efficient way to work around that without loading the complete dataset in the system memory?
I considered sharding the dataset based on the buffer size, but if the evaluation does not occur frequently, it will iterate on the same shard multiple times (a repeat() closes the pipeline). Ideally, I would like to move to another shard after a complete iteration over the dataset, is that possible?
Thanks for any pointers!

A random sharding of the dataset can be implemented with this Dataset transformation:
def random_shard(shard_size, dataset_size):
num_shards = -(-dataset_size // shard_size) # Ceil division.
offsets = np.linspace(
0, dataset_size, num=num_shards, endpoint=False, dtype=np.int64)
def _random_shard(dataset):
sharded_dataset = tf.data.Dataset.from_tensor_slices(offsets)
sharded_dataset = sharded_dataset.shuffle(num_shards)
sharded_dataset = sharded_dataset.flat_map(
lambda offset: dataset.skip(offset).take(shard_size))
return sharded_dataset
return _random_shard
This requires to know the total dataset size in advance. However, if you implement a file-based sharding approach, you also iterate on the full dataset once so that is not a major issue.
Regarding efficiency, note that skip(offset) actually iterates on offset examples so a latency is to be expected if offset is large. Careful prefetching should help for this.

Related

Why tensorflow dataset neet to be batched before fit?

If we've made tensorflow dataset (for example from from_tensor_slices) we need to use .batch(...) method before we set this dataset as parameter of function fit(). Question is why method "fit" expect dataset to be batched ?
Datasets are sliced or batched for following reasons.
To avoid high memory usage if entire Dataset is used as one batch ( which might create Out of Memory problems)
Computation is faster when Training is done in Batches
Weight and Bias update is possible with respect to labels when training is done is batches.
Reference:
https://medium.com/#elimu.michael9/understanding-epochs-and-batches-23120a04b3cb

No cache file written in TensorFlow dataset

I am trying to manage a large image dataset, that does not fit in the memory, while requiring some specific calculation. Currently, my code looks like this:
files = [str(f) for f in self.files]
labels = self.categories
batch_size= 32
dataset = tf.data.Dataset.from_generator(
lambda: zip(files, labels),
output_types=(tf.string, tf.uint8),
output_shapes=(tf.TensorShape([]), tf.TensorShape([]))
)
dataset = dataset.map(
lambda x, y: tf.py_function(_parser, [x, y, category_count], [tf.float32, tf.uint8]),
num_parallel_calls=tf.data.experimental.AUTOTUNE,
deterministic=False)
dataset.cache(filename='/tmp/dataset.tmp')
if mode == tf.estimator.ModeKeys.TRAIN:
dataset = dataset.shuffle(buffer_size=10*batch_size, reshuffle_each_iteration=True)
dataset = dataset.batch(batch_size=batch_size, drop_remainder=False)
if mode == tf.estimator.ModeKeys.TRAIN:
dataset.repeat(None)
else:
dataset.repeat(1)
dataset = dataset.prefetch(tf.data.experimental.AUTOTUNE)
The _parser() function opens a image file, does a bunch of transformations, and returns a tensor and a 1-hot encoded vector. The caching step does not seem to work properly, however:
There is not significant improvement of the computation time between the 1st epoch and the following ones
No cache file is created during the process, although the swap partition is almost full (~90%)
Does the cache() function creates a file only when both the memory and the swap partition is full? Furthermore, I expect to read only batch_size files at a time. However, it seems that all files are read at once during the mapping step. Should I consider using interleave() combined with from_generator() instead? Or maybe should I batched the files first, then map them?
Note that cache() should be used when the dataset is small. If the dataset is large(which is in your case) RAM will not be sufficient to cache its content so it does not fit into memory. Either you should increase the capacity of RAM or adapt some other method to speed up the training.
The other reason for the slowdown of training is the preprocessing stage when you use map() function.
map() method applies a transformation to each item unlike apply() method applies a transformation to the dataset as a whole.
You can use the interleave() and retain the same order of map() and then batch().
You are already using threading by making num_parallel_calls and setting it to tf.data.experimental.AUTOTUNE makes the best use of whatever is available.
You can also normalize your input data and then cache, if it does not fit into memory again then it's better not to cache on a large dataset.
You can follow these performance tips from TensorFlow.
If you have multiple workers/devices it will help you to speed up the training.
Below is the sample illustration showing prefetching with multithreaded loading and preprocessing.

.prefetch() and .cache() not speeding up tf.data.Dataset pipeline

I have a very big dataset of high resolution images so I am training over small chunks using keras.fit.
For loading chunks in memory I have a generator function which generates a tuple of tensors of variable size which I pass to tf.data.Dataset to create a data pipeline.
def extract_XY(list_idx):
read_images(list_idx)
append to list X
process, extract patch and convert to tensor X (?,100,100,3) #? means variable size mini-batch
Y = f(X)
return X,Y
for i in range(epochs):
for j in range(chunks):
x,y = extract_XY(list_idx) #list_idx changes in each loop
data = tf.data.Dataset.from_tensor_slices((X,Y)).batch(64).cache().prefetch(tf.data.experimental.AUTOTUNE)
model.fit(data,epochs=2,verbose=1)
My training with keras fit works but still slow I see no speed up using .cache() or .prefetch()
Can anyone help me understand if I am using them correctly in my case.
I can't use tf.data.Dataset.from_generator option as my generator doesn't yield a sequence.
Can I make my data-pipeline more efficient i.e., loading next chunk while model is training? any suggestions would be helpful?
Does using multiprocessing=True in Keras.fit() will help more? or tf.data.Dataset is already taking care of I/O bottleneck.
Thanks in advance!

TF Dataset API: Is the following sequence correct? map,cache,shuffle,batch,repeat,prefetch

I am using this sequence to read images files from disk and feed into a TF Keras model.
#Make dataset for training
dataset_train = tf.data.Dataset.from_tensor_slices((file_ids_training,file_names_training))
dataset_train = dataset_train.flat_map(lambda file_id,file_name: tf.data.Dataset.from_tensor_slices(
tuple (tf.py_func(_get_data_for_dataset, [file_id,file_name], [tf.float32,tf.float32]))))
dataset_train = dataset_train.cache()
dataset_train= dataset_train.shuffle(buffer_size=train_buffer_size)
dataset_train= dataset_train.batch(train_batch_size) #Make dataset, shuffle, and create batches
dataset_train= dataset_train.repeat()
dataset_train = dataset_train.prefetch(1)
dataset_train_iterator = dataset_train.make_one_shot_iterator()
get_train_batch = dataset_train_iterator.get_next()
I am having questions on whether this is the most optimal sequence.
For e.g. Should repeat come after shuffle() and before batch()?, Should cache() come after batch?
The answer here Output differences when changing order of batch(), shuffle() and repeat() suggests repeat or shuffle before batching. The order I often use is (1) shuffle, (2) repeat, (3) map, (4) batch but it can vary based on your preferences. I use shuffle before repeat to avoid blurring epoch boundaries. I use map before batch because my mapping function applies to a single example (not to a batch of examples) but you can certainly write a map function that is vectorized and expects to see a batch as input.
I'd suggest using the following order
dataset
.cache(filename='./data/cache/')
.shuffle(BUFFER_SIZE)
.repeat(Epoch)
.map(func, num_parallel_calls=tf.data.AUTOTUNE)
.filter(fltr)
.batch(BATCH_SIZE)
.prefetch(tf.data.AUTOTUNE)
in this way firstly to further speed up the training the processed data will be saved in binary format (done automatically by tf) by calling cache. The data will be saved in the cache file after, all the dataset is shuffled and repeated. After that just like #shivaraj said use map and filter function before batching the data. Lastly call the prefetch as said in tf documentation to prepare the data before hand while gpu is working on the previous batch.
Note:
Calling cache will take a lot of time on first call depending on the data size and memory available. But it speed up the training by at least 4 times, if you need to do multiple experiments while not making any change to dataset's inputs and outputs (labels).
Also changing the order of calling cache will also effect the time it takes to create the cache files. I found this order to be the fastest, in every term and also doesn't raises any warnings.
If you are reading images and preprocessing through a function, then use batch after map function.
If you use batch before map then then the function does not get filenames instead map function will get list of rank 1.
ValueError: Shape must be rank 0 but is rank 1 for '{{node ReadFile}} = ReadFile[](args_0)' with input shapes: [?].
Hence the sequence is
dataset = tf.data.Dataset.from_tensor_slices(file_paths)
dataset = dataset.shuffle(BUFFER_SIZE)
dataset = dataset.repeat() # can be after batch
dataset = dataset.map(parse_images)
dataset = dataset.batch(BATCH_SIZE)./repeat()/.prefetch(tf.data.AUTOTUNE)
Although you can choose to place repeat after batch also which doesn't affect your execution.
The buffer size in shuffle actually decides the magnitude of randomness you can introduce, bigger the buffer size better is randomness but you need to have better RAM size (usually > 8 Gigs).

Using Tensorflow Datasets and Estimators with More Data than Ram

I've recently switched my modeling framework to use custom Tensorflow Estimators and Datasets, and am quite happy overall with this workflow.
However, I've just noticed an issue with how my dataset_input_fn loads data form tfrecords. My input function is modeled after the example in the Tensorflow documentation. The issue arises when I have more examples than I can fit into RAM. If I have 1e6 examples, and set my shuffle buffer_size to 1e5, a subset of 1e5 examples is selected once, shuffled, and then iterated on. Meaning my model is only trained on 10% of my overall dataset. My code that sets up this behavior is borrowed exactly from the Tensorflow documentation example code:
dataset = dataset.map(parser)
dataset = dataset.shuffle(buffer_size=10000)
dataset = dataset.batch(32)
dataset = dataset.repeat(num_epochs)
iterator = dataset.make_one_shot_iterator()
My question: is it possible to fill the shuffle buffer with new examples outside of the initial 1e5 as I train? Is this type of functionality supported with a one_shot_iterator? Do I need to use an initializable iterator?
Thanks!
I have found what appears to be a tenable workaround for now. Through some experimentation, I learned that when instantiating a TFRecordDataset,
filenames = ["file1.tfrecord", ..., "filen.tfrecord"]
dataset = tf.data.TFRecordDataset(filenames)
and setting up a shuffle buffer:
dataset = dataset.shuffle(buffer_size=10000)
the buffer is only populated with the first 10000 examples from however many tf records that requires. For example, in my case, I have ~300 tfrecord files containing 4096 examples each. On examination, my shuffle buffer appears to consists only of examples from the first 3 tf records in my filenames list. Since my filenames list is static, this means that my model is only trained of my first 3 tfrecords!
My fix for now is pretty simple. In my training loop I already alternate between Estimator.train and Estimator.evaluate, and I noticed that each time I call Estimator.train, the shuffle buffer is repopulated. My solution then is to shuffle my filenames each time my input_fn is called. This is not a very elegant solution, but does achieve the desired effect of allowing my to iterate across all tfrecords.
#My Crappy Fix: shuffle file names in input_fn
np.random.shuffle(filenames)
dataset = tf.data.TFRecordDataset(filenames)
What's annoying about this solution (aside from its kludginess) is that my minibatches are not "globally random". Rather, they are selected form a small subset of tf records, and only that subset is used for each training/evaluation cycle. One way to mitigate this is to increase my shuffle buffer size or decrease my tfrecord size, I'll probably do both of these. Finally, I think it's worth noting that if
shuffle_buffer_size < (tf_record_size + minibatch_size)
then, as far as I can tell, my TFRecordDataset will pull from a single tfrecord file!
Finally, I don't think the relevant tensorflow documentation conveys these complexities well. The documentation alludes to the ability to train on large datasets that don't fit into memory, but doesn't provide much detail. It seems unlikely that the tf authors had in mind my hacky strategy when writing this, so I remain curious to see if there's a better approach.