Tensorflow Data API - prefetch - tensorflow

I am trying to use new features of TF, namely Data API, and I am not sure how prefetch works. In the code below
def dataset_input_fn(...)
dataset = tf.data.TFRecordDataset(filenames, compression_type="ZLIB")
dataset = dataset.map(lambda x:parser(...))
dataset = dataset.map(lambda x,y: image_augmentation(...)
, num_parallel_calls=num_threads
)
dataset = dataset.shuffle(buffer_size)
dataset = dataset.batch(batch_size)
dataset = dataset.repeat(num_epochs)
iterator = dataset.make_one_shot_iterator()
does it matter between each lines above I put dataset=dataset.prefetch(batch_size)? Or maybe it should be after every operation that would be using output_buffer_size if the dataset was coming from tf.contrib.data?

In discussion on github I found a comment by mrry:
Note that in TF 1.4 there will be a Dataset.prefetch() method that
makes it easier to add prefetching at any point in the pipeline, not
just after a map(). (You can try it by downloading the current nightly
build.)
and
For example, Dataset.prefetch() will start a background thread to
populate a ordered buffer that acts like a tf.FIFOQueue, so that
downstream pipeline stages need not block. However, the prefetch()
implementation is much simpler, because it doesn't need to support as
many different concurrent operations as a tf.FIFOQueue.
so it means prefetch could be put by any command and it works on the previous command. So far I have noticed the biggest performance gains by putting it only at the very end.
There is one more discussion on Meaning of buffer_size in Dataset.map , Dataset.prefetch and Dataset.shuffle where mrry explains a bit more about the prefetch and buffer.
UPDATE 2018/10/01:
From version 1.7.0 Dataset API (in contrib) has an option to prefetch_to_device. Note that this transformation has to be the last in the pipeline and when TF 2.0 arrives contrib will be gone. To have prefetch work on multiple GPUs please use MultiDeviceIterator (example see #13610) multi_device_iterator_ops.py.
https://www.tensorflow.org/versions/master/api_docs/python/tf/contrib/data/prefetch_to_device

Related

How exactly does tf.data.Dataset.interleave() differ from map() and flat_map()?

My current understanding is:
Different map_func: Both interleave and flat_map expect "A function mapping a dataset element to a dataset". In contrast, map expects "A function mapping a dataset element to another dataset element".
Arguments: Both interleave and map offer the argument num_parallel_calls, whereas flat_map does not. Moreover, interleave offers these magical arguments block_length and cycle_length. For cycle_length=1, the documentation states that the outputs of interleave and flat_map are equal.
Last, I have seen data loading pipelines without interleave as well as ones with interleave. Any advice when to use interleave vs. map or flat_map would be greatly appreciated
//EDIT: I do see the value of interleave, if we start out with different datasets, such as in the code below
files = tf.data.Dataset.list_files("/path/to/dataset/train-*.tfrecord")
dataset = files.interleave(tf.data.TFRecordDataset)
However, is there any benefit of using interleave over map in a scenario such as the one below?
files = tf.data.Dataset.list_files("/path/to/dataset/train-*.png")
dataset = files.map(load_img, num_parallel_calls=tf.data.AUTOTUNE)
Edit:
Can map not also be used to parallelize I/O?
Indeed, you can read images and labels from a directory with map function. Assume this case:
list_ds = tf.data.Dataset.list_files(my_path)
def process_path(path):
### get label here etc. Images need to be decoded
return tf.io.read_file(path), label
new_ds = list_ds.map(process_path,num_parallel_calls=tf.data.experimental.AUTOTUNE)
Note that, now it is multi-threaded as num_parallel_calls has been set.
The advantage of interlave() function:
Suppose you have a dataset
With cycle_length you can out that many elements from the dataset, i.e 5, then 5 elements are out from the dataset and a map_func can be applied.
After, fetch dataset objects from newly generated objects, block_length pieces of data each time.
In other words, interleave() function can iterate through your dataset while applying a map_func(). Also, it can work with many datasets or data files at the same time. For example, from the docs:
dataset = dataset.interleave(lambda x:
tf.data.TextLineDataset(x).map(parse_fn, num_parallel_calls=1),
cycle_length=4, block_length=16)
However, is there any benefit of using interleave over map in a
scenario such as the one below?
Both interleave() and map() seems a bit similar but their use-case is not the same. If you want to read dataset while applying some mapping interleave() is your super-hero. Your images may need to be decoded while being read. Reading all first, and decoding may be inefficient when working with large datasets. In the code snippet you gave, AFAIK, the one with tf.data.TFRecordDataset should be faster.
TL;DR interleave() parallelizes the data loading step by interleaving the I/O operation to read the file.
map() will apply the data pre-processing to the contents of the datasets.
So you can do something like:
ds = train_file.interleave(lambda x: tf.data.Dataset.list_files(directory_here).map(func,
num_parallel_calls=tf.data.experimental.AUTOTUNE)
tf.data.experimental.AUTOTUNE will decide the level of parallelism for buffer size, CPU power, and also for I/O operations. In other words, AUTOTUNE will handle the level dynamically at runtime.
num_parallel_calls argument spawns multiple threads to utilize multiple cores for parallelizing the tasks. With this you can load multiple datasets in parallel, reducing the time waiting for the files to be opened; as interleave can also take an argument num_parallel_calls. Image is taken from docs.
In the image, there are 4 overlapping datasets, that is determined by the argument cycle_length, so in this case cycle_length = 4.
FLAT_MAP: Maps a function across the dataset and flattens the result. If you want to make sure order stays the same you can use this. And it does not take num_parallel_calls as an argument. Please refer docs for more.
MAP:
The map function will execute the selected function on every element of the Dataset separately. Obviously, data transformations on large datasets can be expensive as you apply more and more operations. The key point is, it can be more time consuming if CPU is not fully utilized. But we can use parallelism APIs:
num_of_cores = multiprocessing.cpu_count() # num of available cpu cores
mapped_data = data.map(function, num_parallel_calls = num_of_cores)
For cycle_length=1, the documentation states that the outputs of
interleave and flat_map are equal
cycle_length --> The number of input elements that will be processed concurrently. When set it to 1, it will be processed one-by-one.
INTERLEAVE: Transformation operations like map can be parallelized.
With parallelism of the map, at the top the CPU is trying to achieve parallelization in transformation, but the extraction of data from the disk can cause overhead.
Besides, once the raw bytes are read into memory, it may also be necessary to map a function to the data, which of course, requires additional computation. Like decrypting data etc. The impact of the various data extraction overheads needs to be parallelized in order to mitigate this with interleaving the contents of each dataset.
So while reading the datasets, you want to maximize:
Source of image: deeplearning.ai

Create round-robin sharding while generating sharded tfrecords

I am new to tensorflow and I am working on image segmentation problem in tensorflow 1.14. I have a huge dataset and generating tfrecords is very slow, when I try to generate one big tfrecord file. So, I would like to create 'n' shards of tfrecords. I could not find a way to do it online. Say I have 600 images and 600 masks. I want to generate 6 shards of tfrecords, with 100 images and 100 masks each in round robin fashion. A high level /pseudo-code of what I want is as follows -
sharded_tf_record_writer:
create n TFRecordWriter
----> for each_item in n TFRecordWriter
-----> write_example in round-robin fashion
I did search online and could not find relevant answer. I do not want to use apache beam for sharding. I appreciate any idea/help/guidance to achieve this.
I had asked the same question in one of the issues of tensorflow datasets and the user - Conchylicultor responded this -
Writing is done by _TFRecordWriter. Tfds will automatically compute the required number of shards and distribute examples across shards, However each shard is written sequentially.
You do not have control over the number of shards, it is also automatically computed.
However, the fact that examples are distributed between shards do not make the writing faster as examples are not pre-processed in parallel. If you want parallelism, then you'll have to use Apache Beam which allow to scale even to huge datasets
The link to the tensorflow/datasets issue is - https://github.com/tensorflow/datasets/issues/676
This might help.
Since you are working with object detection in tensorflow, there are some nice code in the official Tensorflow models repository that will do what you want. Note this code is for Tensorflow2 (not sure if it'll work in TF1)
See this example of writing sharded tfrecords from coco annotations. The idea is that you open up a list of TFRecordWriter in an exit stack (using contextlib2.ExitStack()), which will automatically close the TFRecords when each thread finishes writing to it.
The utility function open_sharded_output_tfrecords function creates this list of TFRecordWriter
import contextlib2
import tensorflow as tf
with contextlib2.ExitStack() as tf_record_close_stack, tf.gfile.GFile(
annotations_file, 'r'
) as fid:
output_tfrecords = tf_record_creation_util.open_sharded_output_tfrecords(
tf_record_close_stack, output_path, num_shards
)
Next you can use the ProcessPoolExecutor to write tfrecords into each shard in a round-robin fashion in parallel (4 workers in this example)
from concurrent.futures.process import ProcesPoolExecutor
with ProcessPoolExecutor(4) as executor:
for idx, image in enumerate(images):
futures = []
future = executor.submit(
_write_tf_record,
image,
idx,
num_shards,
output_tfrecords,
)
futures.append(future)
for future in futures:
future.result()
where _write_tf_record may look something like this:
def _write_tf_record(image, idx, num_shards, output_tfrecords)
tf_example = create_tf_example(image)
shard_idx = idx % num_shards
output_tfrecords[shard_idx].write(tf_example.SerializeToString())
Just make sure you have more shards than multiprocess workers, otherwise the same writer may be accessed by two different processes.

Dataset API, tf.contrib.data.rejection_resample resource exhausted (too many input files)

Context
I have switched to the dataset API (based on this) and this has resulted in a very significant performance boost compared to using queues.
I am using Tensorflow 1.6.
Problem
I have implemented resampling based on the very helpful explanation here.
The problem is that no matter where I place the resampling stage in the input pipeline, the program returns a ResourceExhaustedError. Changing the batch_size does not seem to fix this and this is only resolved when using a fraction of all input files.
My training files (.tfrecords) are ~200 GB in size and split over a few hundred shards, but the dataset API has handled them very well so far and it's only the resampling that is causing this problem.
Input pipeline example
batch_size = 20000
dataset = tf.data.Dataset.list_files(file_list)
dataset = dataset.apply(tf.contrib.data.parallel_interleave(
tf.data.TFRecordDataset, cycle_length=len(file_list), sloppy=True, block_length=10))
if resample:
dataset = dataset.apply(tf.contrib.data.rejection_resample(class_func=class_func, target_dist=target_dist, initial_dist= initial_dist,seed=5))
dataset = dataset.map(lambda _, data: (data))
dataset = dataset.shuffle(5*batch_size,seed=5)
dataset = dataset.apply(tf.contrib.data.map_and_batch(
map_func=_parse_function, batch_size=batch_size, num_parallel_batches=8))
dataset = dataset.prefetch(10)
return dataset
If anyone has an idea of how to work around this, it would be much appreciated!

feed data into a tf.contrib.data.Dataset like a queue

About the tf.contrib.data.Dataset (from TensorFlow 1.2, see here and here) usage:
The way how to get data doesn't really fit any way how I get the data usually. In my case, I have a thread and I receive data there and I don't know in advance when it will end but I see when it ends. Then I wait until I processed all the buffers and then I have finished one epoch. How can I get this logic with the Dataset?
Note that I prefer the Dataset interface over the QueueBase interface because it gives me the iterator interface which I can reinitialize and even reset to a different Dataset. This is more powerful compared to queues which cannot be reopened currently after they are closed (see here and here).
Maybe a similar question, or the same question: How can I wrap around a Dataset over a queue? I have some thread with reads some data from somewhere and which can feed it and queue it somehow. How do I get the data into the Dataset? I could repeat some dummy tensor infinite times and then use map to just return my queue.dequeue() but that really only gets me back to all the original problems with the queue, i.e. how to reopen the queue.
The new Dataset.from_generator() method allows you to define a Dataset that is fed by a Python generator. (To use this feature at present, you must download a nightly build of TensorFlow or build it yourself from source. It will be part of TensorFlow 1.4.)
The easiest way to implement your example would be to replace your receiving thread with a generator, with pseudocode as follows:
def receiver():
while True:
next_element = ... # Receive next element from external source.
# Note that this method may block.
end_of_epoch = ... # Decide whether or not to stop based on next_element.
if not end_of_epoch:
yield next_element # Note: you may need to convert this to an array.
else:
return # Returning will signal OutOfRangeError on downstream iterators.
dataset = tf.contrib.data.Dataset.from_generator(receiver, output_types=...)
# You can chain other `Dataset` methods after the generator. For example:
dataset = dataset.prefetch(...) # This will start a background thread
# to prefetch elements from `receiver()`.
dataset = dataset.repeat(...) # Note that each repetition will call
# `receiver()` again, and start from
# a fresh state.
dataset = dataset.batch(...)
More complicated topologies are possible. For example, you can use Dataset.interleave() to create many receivers in parallel.

In distributed tensorflow, how to write to summary from workers as well

I am using google cloud ml distributed sample for training a model on a cluster of computers. Input and output (ie rfrecords, checkpoints, tfevents) are all on gs:// (google storage)
Similarly to the distributed sample, I use an evaluation step that is called at the end, and the result is written as a summary, in order to use parameter hypertuning / either within Cloud ML, or using my own stack of tools.
But rather than performing a single evaluation on a large batch of data, I am running several evaluation steps, in order to retrieve statistics on the performance criteria, because I don't want to limited to a single value. I want to get information regarding the performance interval. In particular, the variance of performance is important to me. I'd rather select a model with lower average performance but with better worst cases.
I therefore run several evaluation steps. What I would like to do is to parallelize these evaluation steps because right now, only the master is evaluating. When using large clusters, it is a source of inefficiency, and task workers to evaluate as well.
Basically, the supervisor is created as :
self.sv = tf.train.Supervisor(
graph,
is_chief=self.is_master,
logdir=train_dir(self.args.output_path),
init_op=init_op,
saver=self.saver,
# Write summary_ops by hand.
summary_op=None,
global_step=self.tensors.global_step,
# No saving; we do it manually in order to easily evaluate immediately
# afterwards.
save_model_secs=0)
At the end of training I call the summary writer. :
# only on master, this is what I want to remove
if self.is_master and not self.should_stop:
# I want to have an idea of statistics of accuracy
# not just the mean, hence I run on 10 batches
for i in range(10):
self.global_step += 1
# I call an evaluator, and extract the accuracy
evaluation_values = self.evaluator.evaluate()
accuracy_value = self.model.accuracy_value(evaluation_values)
# now I dump the accuracy, ready to use within hptune
eval_summary = tf.Summary(value=[
tf.Summary.Value(
tag='training/hptuning/metric', simple_value=accuracy_value)
])
self.sv.summary_computed(session, eval_summary, self.global_step)
I tried to write summaries from workers as well , but I got an error : basically summary can be written from masters only. Is there any easy way to workaround ? The error is : "Writing a summary requires a summary writer."
My guess is you'd create a separate summary writer on each worker yourself, and write out summaries directly rather.
I suspect you wouldn't use a supervisor for the eval processing either. Just load a session on each worker for doing eval with the latest checkpoint, and writing out independent summaries.