I have a training dataset that is too big to fit into memory, so my code reads only 1,000 records from disk at a time. Now I would like to use Tensorflow's new Dataset API. Does the Dataset API allow me to specify the number of records to keep in memory or does Tensorflow automatically manage memory so that I don't have to?
Yes. An example from official guide (Using the Dataset API for TensorFlow Input Pipelines, https://www.tensorflow.org/programmers_guide/datasets)
filenames = ["/var/data/file1.tfrecord", "/var/data/file2.tfrecord"]
dataset = tf.contrib.data.TFRecordDataset(filenames)
dataset = dataset.map(...) ## Parsing data with a user specified function
dataset = dataset.shuffle(buffer_size=10000) ## 10000: size of sample/record pool for random selection
dataset = dataset.repeat() ## None: keep repeating
dataset = dataset.batch(32) ## 32: number of samples/records per batch (to be read into memory)
If you will specify the number of records via batch_size. In this case TF will grab only batch_size elements from the file. You can also specify shuffle and this will guarantee that all the time in the memory will be at maximum buffer_size elements.
I verified it on my tfrecords files. I have 100 tfrecords files, each of them is ~10Gb (which is more than the memory on my laptop). And everything works fine.
Related
I am trying to manage a large image dataset, that does not fit in the memory, while requiring some specific calculation. Currently, my code looks like this:
files = [str(f) for f in self.files]
labels = self.categories
batch_size= 32
dataset = tf.data.Dataset.from_generator(
lambda: zip(files, labels),
output_types=(tf.string, tf.uint8),
output_shapes=(tf.TensorShape([]), tf.TensorShape([]))
)
dataset = dataset.map(
lambda x, y: tf.py_function(_parser, [x, y, category_count], [tf.float32, tf.uint8]),
num_parallel_calls=tf.data.experimental.AUTOTUNE,
deterministic=False)
dataset.cache(filename='/tmp/dataset.tmp')
if mode == tf.estimator.ModeKeys.TRAIN:
dataset = dataset.shuffle(buffer_size=10*batch_size, reshuffle_each_iteration=True)
dataset = dataset.batch(batch_size=batch_size, drop_remainder=False)
if mode == tf.estimator.ModeKeys.TRAIN:
dataset.repeat(None)
else:
dataset.repeat(1)
dataset = dataset.prefetch(tf.data.experimental.AUTOTUNE)
The _parser() function opens a image file, does a bunch of transformations, and returns a tensor and a 1-hot encoded vector. The caching step does not seem to work properly, however:
There is not significant improvement of the computation time between the 1st epoch and the following ones
No cache file is created during the process, although the swap partition is almost full (~90%)
Does the cache() function creates a file only when both the memory and the swap partition is full? Furthermore, I expect to read only batch_size files at a time. However, it seems that all files are read at once during the mapping step. Should I consider using interleave() combined with from_generator() instead? Or maybe should I batched the files first, then map them?
Note that cache() should be used when the dataset is small. If the dataset is large(which is in your case) RAM will not be sufficient to cache its content so it does not fit into memory. Either you should increase the capacity of RAM or adapt some other method to speed up the training.
The other reason for the slowdown of training is the preprocessing stage when you use map() function.
map() method applies a transformation to each item unlike apply() method applies a transformation to the dataset as a whole.
You can use the interleave() and retain the same order of map() and then batch().
You are already using threading by making num_parallel_calls and setting it to tf.data.experimental.AUTOTUNE makes the best use of whatever is available.
You can also normalize your input data and then cache, if it does not fit into memory again then it's better not to cache on a large dataset.
You can follow these performance tips from TensorFlow.
If you have multiple workers/devices it will help you to speed up the training.
Below is the sample illustration showing prefetching with multithreaded loading and preprocessing.
I am using this sequence to read images files from disk and feed into a TF Keras model.
#Make dataset for training
dataset_train = tf.data.Dataset.from_tensor_slices((file_ids_training,file_names_training))
dataset_train = dataset_train.flat_map(lambda file_id,file_name: tf.data.Dataset.from_tensor_slices(
tuple (tf.py_func(_get_data_for_dataset, [file_id,file_name], [tf.float32,tf.float32]))))
dataset_train = dataset_train.cache()
dataset_train= dataset_train.shuffle(buffer_size=train_buffer_size)
dataset_train= dataset_train.batch(train_batch_size) #Make dataset, shuffle, and create batches
dataset_train= dataset_train.repeat()
dataset_train = dataset_train.prefetch(1)
dataset_train_iterator = dataset_train.make_one_shot_iterator()
get_train_batch = dataset_train_iterator.get_next()
I am having questions on whether this is the most optimal sequence.
For e.g. Should repeat come after shuffle() and before batch()?, Should cache() come after batch?
The answer here Output differences when changing order of batch(), shuffle() and repeat() suggests repeat or shuffle before batching. The order I often use is (1) shuffle, (2) repeat, (3) map, (4) batch but it can vary based on your preferences. I use shuffle before repeat to avoid blurring epoch boundaries. I use map before batch because my mapping function applies to a single example (not to a batch of examples) but you can certainly write a map function that is vectorized and expects to see a batch as input.
I'd suggest using the following order
dataset
.cache(filename='./data/cache/')
.shuffle(BUFFER_SIZE)
.repeat(Epoch)
.map(func, num_parallel_calls=tf.data.AUTOTUNE)
.filter(fltr)
.batch(BATCH_SIZE)
.prefetch(tf.data.AUTOTUNE)
in this way firstly to further speed up the training the processed data will be saved in binary format (done automatically by tf) by calling cache. The data will be saved in the cache file after, all the dataset is shuffled and repeated. After that just like #shivaraj said use map and filter function before batching the data. Lastly call the prefetch as said in tf documentation to prepare the data before hand while gpu is working on the previous batch.
Note:
Calling cache will take a lot of time on first call depending on the data size and memory available. But it speed up the training by at least 4 times, if you need to do multiple experiments while not making any change to dataset's inputs and outputs (labels).
Also changing the order of calling cache will also effect the time it takes to create the cache files. I found this order to be the fastest, in every term and also doesn't raises any warnings.
If you are reading images and preprocessing through a function, then use batch after map function.
If you use batch before map then then the function does not get filenames instead map function will get list of rank 1.
ValueError: Shape must be rank 0 but is rank 1 for '{{node ReadFile}} = ReadFile[](args_0)' with input shapes: [?].
Hence the sequence is
dataset = tf.data.Dataset.from_tensor_slices(file_paths)
dataset = dataset.shuffle(BUFFER_SIZE)
dataset = dataset.repeat() # can be after batch
dataset = dataset.map(parse_images)
dataset = dataset.batch(BATCH_SIZE)./repeat()/.prefetch(tf.data.AUTOTUNE)
Although you can choose to place repeat after batch also which doesn't affect your execution.
The buffer size in shuffle actually decides the magnitude of randomness you can introduce, bigger the buffer size better is randomness but you need to have better RAM size (usually > 8 Gigs).
I'd like to speed up my training routine that uses the Estimator API with input_fn wrote using tf.data.Dataset.
My implementation takes 2 second to prepare a batch of data and then runs training on GPU for 1 sec, and then start over preparing a batch. Which is really inefficient.
I'm looking for a way to prepare the batches asynchronously and upload them to GPU to speed up the training. Or alternatively for a way to cache datasets between invocations of input_fn (the dataset.cache() doesn't seems to be a good choice as the dataset has to be recreated on each input_fn invocation).
Here is a simplified version of my code:
def input_fn(filenames, labels, epochs):
dataset = tf.data.Dataset.from_tensor_slices((filenames, labels))
dataset = dataset.map(_read_wav, num_parallel_calls=num_map_threads)
if shuffle:
dataset = dataset.shuffle(buffer_size=len(labels))
dataset = dataset.map(_post_process, num_parallel_calls=num_map_threads)
dataset = dataset.map(lambda wav, label: ({'wav': wav}, label))
dataset = dataset.batch(128)
dataset = dataset.repeat(epochs) # to iterate over the training set forever
iterator = dataset.dataset.make_one_shot_iterator()
features, labels = iterator.get_next()
return features, labels
train_input_fn = lambda : input_fn(train_files, train_labels, None)
eval_input_fn = lambda : input_fn(eval_files, eval_labels, 1)
train_spec = tf.estimator.TrainSpec(input_fn=train_input_fn, max_steps=45000)
eval_spec = tf.estimator.EvalSpec(input_fn=eval_input_fn)
tf.estimator.train_and_evaluate(estimator, train_spec, eval_spec)
I've noticed that the Estimator API is under active development and in the master branch of tensorflow the input_fn can return datasets already, so maybe I'm asking too early and this feature isn't ready yet. But if so, please provide a ticket where this implementation can be tracked.
Using tf.data.Dataset.cache() is indeed not a good choice since it will cache the whole dataset into memory, which takes time and might overflow your memory.
The way to go is to use tf.data.Dataset.prefetch() at the end of your pipeline, which will always make sure that the data pipeline holds buffer_size elements. It is usually enough to have buffer_size = 1 at the end:
dataset = ...
dataset = dataset.batch(128)
dataset = dataset.prefetch(1) # prefetch one batch
As explained by #mrry in this answer, you can also try to increase the number of prefetched batches a bit.
Typically it is most useful to add a small prefetch buffer (with perhaps just a single element) at the very end of the pipeline, but more complex pipelines can benefit from additional prefetching, especially when the time to produce a single element can vary.
If you still have a slow input pipeline compared to your GPU computations, you need to increase the number of threads working in parallel using the num_parallel_calls argument of tf.data.Dataset.map().
A few points to add to Olivier's answer, mostly from this post:
repeat before shuffle is slightly faster, at the downside of blurred epoch boundaries. This may be significant in rare cases, but I doubt it.
shuffle before mapping - this reduces the memory foot print of your shuffle buffer size, since it only needs to buffer the filenames rather than the file contents.
it makes more sense to me to apply the third map transform to the output of get_next() rather than the dataset - not sure if that affects speed much. You could also consider putting both other map calls in the same one to reduce scheduling issues.
experiment with repeat before batching. Probably won't make a difference, but might be minor. If you repeat before shuffle as mentioned above you'll have to.
as mentioned by Olivier, use prefetch.
Code with modifications:
def input_fn(filenames, labels, epochs):
dataset = tf.data.Dataset.from_tensor_slices((filenames, labels))
dataset = dataset.repeat(epochs)
if shuffle:
dataset = dataset.shuffle(buffer_size=len(labels))
def combined_map_fn(*args):
return _post_process(_read_wav(*args))
dataset = dataset.map(combined_map_fn, num_parallel_calls=num_map_threads)
dataset = dataset.batch(128)
dataset = dataset.prefetch(1)
iterator = dataset.dataset.make_one_shot_iterator()
wavs, labels = iterator.get_next()
features = {'wav': wavs}
return features, labels
I have a data generator for training a CNN and works fine. Now I want to speed up training on 2 GPUs (on 1 pc) by following cifar10_multi_gpu_train.py. (https://www.tensorflow.org/programmers_guide/threading_and_queues)
Questions:
1) How to convert data generator to queue? Data item: (image file dir, output). Whole dataset: list of data items. Batch dataset: partial whole dataset. How to put it into a tensor like following:
batch_queue = tf.contrib.slim.prefetch_queue.prefetch_queue(
[images, labels], capacity=2 * FLAGS.num_gpus)
2) What is content of the queue?
1.a) The queue take whole dataset or single batch dataset?
1.b) It seems to me in cifar10 sample, the queue is 1 batch. But, how it cycle through all dataset?
1.c) If queue take whole dataset, what is the data in each thread of gpu? In such case, I am not sure I understand how concurrent GPU training is possible, as each dataset used to calculate loss and gradient depends on same model state. But, next loss+gradient of next dataset calucation is possible only after last one is done to modify the model weight.
One approach that might work is something like:
Build lists
image_list = [("file%d" % i) for i in range(100)]
label_list = read_label_list_from_disk(path)
convert image_list and label_list to tensors
Producer
image_filename_queue, label_queue =
tf.slice_input_producer([image_tensor, label_tensor], ..)
Reader
reader =tf.WholeFileReader()
key, value = reader.read(image_filename_queue)
images = tf.image.decode_png(value)`
Batching
image_batch, label_batch =
tf.train.batch([images,labels_queue],batch_size=batch_size)
Prefetching
batch_queue = tf.contrib.slim.prefetch_queue.prefetch_queue(
[image_batch, label_batch], capacity=2*gpus)
Hopefully, this explains the concept of the queue in Tensorflow as well.
Now I use following function for shuffling
from tensorflow.contrib import data
def input_pipeline(filenames, batch_size):
# Define a `tf.contrib.data.Dataset` for iterating over one epoch of the data.
dataset = data.TextLineDataset(filenames)
dataset = dataset.map(decode_func)
dataset = dataset.shuffle(buffer_size=10000) # Equivalent to min_after_dequeue=10000.
dataset = dataset.batch(batch_size)
# Return an *initializable* iterator over the dataset, which will allow us to
# re-initialize it at the beginning of each epoch.
return dataset.make_initializable_iterator()
But it will just shuffle data at the amount of buffer_size and it will fill buffer in an order.
My data is enormous which I can not set buffer_size too big. Is there any other solutions I can shuffle the whole datasets?
Currently there is no support in Dataset API for shuffling a whole Dataset (greater then 10k examples). According to this thread, the common approach is:
Randomly shuffle the entire data once using a
MapReduce/Spark/Beam/etc. job to create a set of roughly equal-sized
files ("shards").
In each epoch:
a. Randomly shuffle the list of shard filenames, using Dataset.list_files(...).shuffle(num_shards).
b. Use dataset.interleave(lambda filename: tf.data.TextLineDataset(filename), cycle_length=N) to mix together records from N different shards.
c. Use dataset.shuffle(B) to shuffle the resulting dataset. Setting B might require some experimentation, but you will probably want to set it to some value larger than the number of records in a single shard.