How can I shuffle a whole dataset with TensorFlow? - tensorflow

Now I use following function for shuffling
from tensorflow.contrib import data
def input_pipeline(filenames, batch_size):
# Define a `tf.contrib.data.Dataset` for iterating over one epoch of the data.
dataset = data.TextLineDataset(filenames)
dataset = dataset.map(decode_func)
dataset = dataset.shuffle(buffer_size=10000) # Equivalent to min_after_dequeue=10000.
dataset = dataset.batch(batch_size)
# Return an *initializable* iterator over the dataset, which will allow us to
# re-initialize it at the beginning of each epoch.
return dataset.make_initializable_iterator()
But it will just shuffle data at the amount of buffer_size and it will fill buffer in an order.
My data is enormous which I can not set buffer_size too big. Is there any other solutions I can shuffle the whole datasets?

Currently there is no support in Dataset API for shuffling a whole Dataset (greater then 10k examples). According to this thread, the common approach is:
Randomly shuffle the entire data once using a
MapReduce/Spark/Beam/etc. job to create a set of roughly equal-sized
files ("shards").
In each epoch:
a. Randomly shuffle the list of shard filenames, using Dataset.list_files(...).shuffle(num_shards).
b. Use dataset.interleave(lambda filename: tf.data.TextLineDataset(filename), cycle_length=N) to mix together records from N different shards.
c. Use dataset.shuffle(B) to shuffle the resulting dataset. Setting B might require some experimentation, but you will probably want to set it to some value larger than the number of records in a single shard.

Related

TF Dataset API: Is the following sequence correct? map,cache,shuffle,batch,repeat,prefetch

I am using this sequence to read images files from disk and feed into a TF Keras model.
#Make dataset for training
dataset_train = tf.data.Dataset.from_tensor_slices((file_ids_training,file_names_training))
dataset_train = dataset_train.flat_map(lambda file_id,file_name: tf.data.Dataset.from_tensor_slices(
tuple (tf.py_func(_get_data_for_dataset, [file_id,file_name], [tf.float32,tf.float32]))))
dataset_train = dataset_train.cache()
dataset_train= dataset_train.shuffle(buffer_size=train_buffer_size)
dataset_train= dataset_train.batch(train_batch_size) #Make dataset, shuffle, and create batches
dataset_train= dataset_train.repeat()
dataset_train = dataset_train.prefetch(1)
dataset_train_iterator = dataset_train.make_one_shot_iterator()
get_train_batch = dataset_train_iterator.get_next()
I am having questions on whether this is the most optimal sequence.
For e.g. Should repeat come after shuffle() and before batch()?, Should cache() come after batch?
The answer here Output differences when changing order of batch(), shuffle() and repeat() suggests repeat or shuffle before batching. The order I often use is (1) shuffle, (2) repeat, (3) map, (4) batch but it can vary based on your preferences. I use shuffle before repeat to avoid blurring epoch boundaries. I use map before batch because my mapping function applies to a single example (not to a batch of examples) but you can certainly write a map function that is vectorized and expects to see a batch as input.
I'd suggest using the following order
dataset
.cache(filename='./data/cache/')
.shuffle(BUFFER_SIZE)
.repeat(Epoch)
.map(func, num_parallel_calls=tf.data.AUTOTUNE)
.filter(fltr)
.batch(BATCH_SIZE)
.prefetch(tf.data.AUTOTUNE)
in this way firstly to further speed up the training the processed data will be saved in binary format (done automatically by tf) by calling cache. The data will be saved in the cache file after, all the dataset is shuffled and repeated. After that just like #shivaraj said use map and filter function before batching the data. Lastly call the prefetch as said in tf documentation to prepare the data before hand while gpu is working on the previous batch.
Note:
Calling cache will take a lot of time on first call depending on the data size and memory available. But it speed up the training by at least 4 times, if you need to do multiple experiments while not making any change to dataset's inputs and outputs (labels).
Also changing the order of calling cache will also effect the time it takes to create the cache files. I found this order to be the fastest, in every term and also doesn't raises any warnings.
If you are reading images and preprocessing through a function, then use batch after map function.
If you use batch before map then then the function does not get filenames instead map function will get list of rank 1.
ValueError: Shape must be rank 0 but is rank 1 for '{{node ReadFile}} = ReadFile[](args_0)' with input shapes: [?].
Hence the sequence is
dataset = tf.data.Dataset.from_tensor_slices(file_paths)
dataset = dataset.shuffle(BUFFER_SIZE)
dataset = dataset.repeat() # can be after batch
dataset = dataset.map(parse_images)
dataset = dataset.batch(BATCH_SIZE)./repeat()/.prefetch(tf.data.AUTOTUNE)
Although you can choose to place repeat after batch also which doesn't affect your execution.
The buffer size in shuffle actually decides the magnitude of randomness you can introduce, bigger the buffer size better is randomness but you need to have better RAM size (usually > 8 Gigs).

Using feed_dict is more than 5x faster than using dataset API?

I created a dataset in TFRecord format for testing. Every entry contains 200 columns, named C1 - C199, each being a strings list, and a label column to denote the labels. The code to create the data can be found here: https://github.com/codescv/tf-dist/blob/8bb3c44f55939fc66b3727a730c57887113e899c/src/gen_data.py#L25
Then I used a linear model to train the data. The first approach looks like this:
dataset = tf.data.TFRecordDataset(data_file)
dataset = dataset.prefetch(buffer_size=batch_size*10)
dataset = dataset.map(parse_tfrecord, num_parallel_calls=5)
dataset = dataset.repeat(num_epochs)
dataset = dataset.batch(batch_size)
features, labels = dataset.make_one_shot_iterator().get_next()
logits = tf.feature_column.linear_model(features=features, feature_columns=columns, cols_to_vars=cols_to_vars)
train_op = ...
with tf.Session() as sess:
sess.run(train_op)
The full code can be found here: https://github.com/codescv/tf-dist/blob/master/src/lr_single.py
When I run the code above, I get 0.85 steps/sec (batch size being 1024).
In the second approach, I manually get batches from Dataset into python, then feed them to a placeholder, like this:
example = tf.placeholder(dtype=tf.string, shape=[None])
features = tf.parse_example(example, features=tf.feature_column.make_parse_example_spec(columns+[tf.feature_column.numeric_column('label', dtype=tf.float32, default_value=0)]))
labels = features.pop('label')
train_op = ...
dataset = tf.data.TFRecordDataset(data_file).repeat().batch(batch_size)
next_batch = dataset.make_one_shot_iterator().get_next()
with tf.Session() as sess:
data_batch = sess.run(next_batch)
sess.run(train_op, feed_dict={example: data_batch})
The full code can be found here: https://github.com/codescv/tf-dist/blob/master/src/lr_single_feed.py
When I run the code above, I get 5 steps/sec. That is 5x faster than the first approach. This is what I do not understand, because theoretically the second should be slower due to the extra serialization/deserialization of data batches.
Thanks!
There is currently (as of TensorFlow 1.9) a performance issue when using tf.data to map and batch tensors that have a large number of features with a small amount of data in each. The issue has two causes:
The dataset.map(parse_tfrecord, ...) transformation will execute O(batch_size * num_columns) small operations to create a batch. By contrast, feeding a tf.placeholder() to tf.parse_example() will execute O(1) operations to create the same batch.
Batching many tf.SparseTensor objects using dataset.batch() is much slower than directly creating the same tf.SparseTensor as the output of tf.parse_example().
Improvements to both these issues are underway, and should be available in a future version of TensorFlow. In the meantime, you can improve the performance of the tf.data-based pipeline by switching the order of the dataset.map() and dataset.batch() and rewriting the dataset.map() to work on a vector of strings, like the feeding based version:
dataset = tf.data.TFRecordDataset(data_file)
dataset = dataset.prefetch(buffer_size=batch_size*10)
dataset = dataset.repeat(num_epochs)
# Batch first to create a vector of strings as input to the map().
dataset = dataset.batch(batch_size)
def parse_tfrecord_batch(record_batch):
features = tf.parse_example(
record_batch,
features=tf.feature_column.make_parse_example_spec(
columns + [
tf.feature_column.numeric_column(
'label', dtype=tf.float32, default_value=0)]))
labels = features.pop('label')
return features, labels
# NOTE: Parallelism might not be as useful, because the individual map function now does
# more work per invocation, but you might want to experiment with this.
dataset = dataset.map(parse_tfrecord_batch)
# Add a prefetch at the end to pipeline execution.
dataset = dataset.prefetch(1)
features, labels = dataset.make_one_shot_iterator().get_next()
# ...
EDIT (2018/6/18): To answer your questions from the comments:
Why is dataset.map(parse_tfrecord, ...) O(batch_size * num_columns), not O(batch_size)? If parsing requires enumeration of the columns, why doesn't parse_example take O(num_columns)?
When you wrap TensorFlow code in a Dataset.map() (or other functional transformation) a constant number of extra operations per output are added to "return" values from the function and (in the case of tf.SparseTensor values) "convert" them to a standard format. When you directly pass the outputs of tf.parse_example() to the input of your model, these operations aren't added. While they are very small operations, executing so many of them can become a bottleneck. (Technically the parsing does take O(batch_size * num_columns) time, but the constants involved in parsing are much smaller than executing an operation.)
Why do you add a prefetch at the end of the pipeline?
When you're interested in performance, this is almost always the best thing to do, and it should improve the overall performance of your pipeline. For more information about best practices, see the performance guide for tf.data.

How to efficiently shuffle a large tf.data.Dataset when using tf.estimator.train_and_evaluate?

The tf.estimator.train_and_evaluate documentation makes it clear that the input dataset must be properly shuffled for the training to see all examples:
Overfitting: In order to avoid overfitting, it is recommended to set up the training input_fn to shuffle the training data properly. It is also recommended to train the model a little longer, say multiple epochs, before performing evaluation, as the input pipeline starts from scratch for each training. It is particularly important for local training and evaluation.
In my application, I would like to uniformly sample examples from the full tf.data.Dataset with arbitrary evaluation frequency and shuffle()'s buffer size. Otherwise, the training can at most see the first:
(steps_per_second * eval_delay * batch_size) + buffer_size
elements, effectively discarding the rest. Is there an efficient way to work around that without loading the complete dataset in the system memory?
I considered sharding the dataset based on the buffer size, but if the evaluation does not occur frequently, it will iterate on the same shard multiple times (a repeat() closes the pipeline). Ideally, I would like to move to another shard after a complete iteration over the dataset, is that possible?
Thanks for any pointers!
A random sharding of the dataset can be implemented with this Dataset transformation:
def random_shard(shard_size, dataset_size):
num_shards = -(-dataset_size // shard_size) # Ceil division.
offsets = np.linspace(
0, dataset_size, num=num_shards, endpoint=False, dtype=np.int64)
def _random_shard(dataset):
sharded_dataset = tf.data.Dataset.from_tensor_slices(offsets)
sharded_dataset = sharded_dataset.shuffle(num_shards)
sharded_dataset = sharded_dataset.flat_map(
lambda offset: dataset.skip(offset).take(shard_size))
return sharded_dataset
return _random_shard
This requires to know the total dataset size in advance. However, if you implement a file-based sharding approach, you also iterate on the full dataset once so that is not a major issue.
Regarding efficiency, note that skip(offset) actually iterates on offset examples so a latency is to be expected if offset is large. Careful prefetching should help for this.

How to speed up batch preparation when using Estimators API combined with tf.data.Dataset

I'd like to speed up my training routine that uses the Estimator API with input_fn wrote using tf.data.Dataset.
My implementation takes 2 second to prepare a batch of data and then runs training on GPU for 1 sec, and then start over preparing a batch. Which is really inefficient.
I'm looking for a way to prepare the batches asynchronously and upload them to GPU to speed up the training. Or alternatively for a way to cache datasets between invocations of input_fn (the dataset.cache() doesn't seems to be a good choice as the dataset has to be recreated on each input_fn invocation).
Here is a simplified version of my code:
def input_fn(filenames, labels, epochs):
dataset = tf.data.Dataset.from_tensor_slices((filenames, labels))
dataset = dataset.map(_read_wav, num_parallel_calls=num_map_threads)
if shuffle:
dataset = dataset.shuffle(buffer_size=len(labels))
dataset = dataset.map(_post_process, num_parallel_calls=num_map_threads)
dataset = dataset.map(lambda wav, label: ({'wav': wav}, label))
dataset = dataset.batch(128)
dataset = dataset.repeat(epochs) # to iterate over the training set forever
iterator = dataset.dataset.make_one_shot_iterator()
features, labels = iterator.get_next()
return features, labels
train_input_fn = lambda : input_fn(train_files, train_labels, None)
eval_input_fn = lambda : input_fn(eval_files, eval_labels, 1)
train_spec = tf.estimator.TrainSpec(input_fn=train_input_fn, max_steps=45000)
eval_spec = tf.estimator.EvalSpec(input_fn=eval_input_fn)
tf.estimator.train_and_evaluate(estimator, train_spec, eval_spec)
I've noticed that the Estimator API is under active development and in the master branch of tensorflow the input_fn can return datasets already, so maybe I'm asking too early and this feature isn't ready yet. But if so, please provide a ticket where this implementation can be tracked.
Using tf.data.Dataset.cache() is indeed not a good choice since it will cache the whole dataset into memory, which takes time and might overflow your memory.
The way to go is to use tf.data.Dataset.prefetch() at the end of your pipeline, which will always make sure that the data pipeline holds buffer_size elements. It is usually enough to have buffer_size = 1 at the end:
dataset = ...
dataset = dataset.batch(128)
dataset = dataset.prefetch(1) # prefetch one batch
As explained by #mrry in this answer, you can also try to increase the number of prefetched batches a bit.
Typically it is most useful to add a small prefetch buffer (with perhaps just a single element) at the very end of the pipeline, but more complex pipelines can benefit from additional prefetching, especially when the time to produce a single element can vary.
If you still have a slow input pipeline compared to your GPU computations, you need to increase the number of threads working in parallel using the num_parallel_calls argument of tf.data.Dataset.map().
A few points to add to Olivier's answer, mostly from this post:
repeat before shuffle is slightly faster, at the downside of blurred epoch boundaries. This may be significant in rare cases, but I doubt it.
shuffle before mapping - this reduces the memory foot print of your shuffle buffer size, since it only needs to buffer the filenames rather than the file contents.
it makes more sense to me to apply the third map transform to the output of get_next() rather than the dataset - not sure if that affects speed much. You could also consider putting both other map calls in the same one to reduce scheduling issues.
experiment with repeat before batching. Probably won't make a difference, but might be minor. If you repeat before shuffle as mentioned above you'll have to.
as mentioned by Olivier, use prefetch.
Code with modifications:
def input_fn(filenames, labels, epochs):
dataset = tf.data.Dataset.from_tensor_slices((filenames, labels))
dataset = dataset.repeat(epochs)
if shuffle:
dataset = dataset.shuffle(buffer_size=len(labels))
def combined_map_fn(*args):
return _post_process(_read_wav(*args))
dataset = dataset.map(combined_map_fn, num_parallel_calls=num_map_threads)
dataset = dataset.batch(128)
dataset = dataset.prefetch(1)
iterator = dataset.dataset.make_one_shot_iterator()
wavs, labels = iterator.get_next()
features = {'wav': wavs}
return features, labels

Memory management in Tensorflow's Dataset API

I have a training dataset that is too big to fit into memory, so my code reads only 1,000 records from disk at a time. Now I would like to use Tensorflow's new Dataset API. Does the Dataset API allow me to specify the number of records to keep in memory or does Tensorflow automatically manage memory so that I don't have to?
Yes. An example from official guide (Using the Dataset API for TensorFlow Input Pipelines, https://www.tensorflow.org/programmers_guide/datasets)
filenames = ["/var/data/file1.tfrecord", "/var/data/file2.tfrecord"]
dataset = tf.contrib.data.TFRecordDataset(filenames)
dataset = dataset.map(...) ## Parsing data with a user specified function
dataset = dataset.shuffle(buffer_size=10000) ## 10000: size of sample/record pool for random selection
dataset = dataset.repeat() ## None: keep repeating
dataset = dataset.batch(32) ## 32: number of samples/records per batch (to be read into memory)
If you will specify the number of records via batch_size. In this case TF will grab only batch_size elements from the file. You can also specify shuffle and this will guarantee that all the time in the memory will be at maximum buffer_size elements.
I verified it on my tfrecords files. I have 100 tfrecords files, each of them is ~10Gb (which is more than the memory on my laptop). And everything works fine.