how can we get benefit from sharding the data to speed the training time? - tensorflow

My main issue is : I have 204 GB training tfrecords for 2 million images, and 28GB for validation tf.records files, of 302900 images. it takes 8 hour to train one epoch and this will take 33 day for training. I want to speed that by using multiple threads and shards but I am little bit confused about couple of things.
In tf.data.Dataset API there is shard function , So in the documentation they mentioned the following about shard function :
Creates a Dataset that includes only 1/num_shards of this dataset.
This dataset operator is very useful when running distributed training, as it allows each worker to read a unique subset.
When reading a single input file, you can skip elements as follows:
d = tf.data.TFRecordDataset(FLAGS.input_file)
d = d.shard(FLAGS.num_workers, FLAGS.worker_index)
d = d.repeat(FLAGS.num_epochs)
d = d.shuffle(FLAGS.shuffle_buffer_size)
d = d.map(parser_fn, num_parallel_calls=FLAGS.num_map_threads)
Important caveats:
Be sure to shard before you use any randomizing operator (such as shuffle).
Generally it is best if the shard operator is used early in the dataset pipeline. >For example, when reading from a set of TFRecord files, shard before converting >the dataset to input samples. This avoids reading every file on every worker. The >following is an example of an efficient sharding strategy within a complete >pipeline:
d = Dataset.list_files(FLAGS.pattern)
d = d.shard(FLAGS.num_workers, FLAGS.worker_index)
d = d.repeat(FLAGS.num_epochs)
d = d.shuffle(FLAGS.shuffle_buffer_size)
d = d.repeat()
d = d.interleave(tf.data.TFRecordDataset,
cycle_length=FLAGS.num_readers, block_length=1)
d = d.map(parser_fn, num_parallel_calls=FLAGS.num_map_threads)
So my question regarding the code above is when I try to makes d.shards of my data using shard function, if I set the number of shards (num_workers)to 10 , I will have 10 splits of my data , then should I set the num_reader in d.interleave function to 10 to guarantee that each reader take one split from the 10 split?
and how I can control which split the function interleave will take? because if I set the shard_index (worker_index) in shard function to 1 it will give me the first split. Can anyone give me an idea how can I perform this distributed training using the above functions?
then what about the num_parallel_call . should I set it to 10 as well?
knowing that I have single tf.records file for training and another one for validation , I don't split the tf.records files into multiple files.

First of all, how come dataset is 204GB for only 2million images? I think your image is way too large. Try to resize the image. After all, you would probably need to resize it to 224 x 224 in the end.
Second, try to reduce the size of your model. your model could be either too deep or not efficient enough.
Third, try to parallelize your input reading process. It could the bottleneck.

Related

How exactly does tf.data.Dataset.interleave() differ from map() and flat_map()?

My current understanding is:
Different map_func: Both interleave and flat_map expect "A function mapping a dataset element to a dataset". In contrast, map expects "A function mapping a dataset element to another dataset element".
Arguments: Both interleave and map offer the argument num_parallel_calls, whereas flat_map does not. Moreover, interleave offers these magical arguments block_length and cycle_length. For cycle_length=1, the documentation states that the outputs of interleave and flat_map are equal.
Last, I have seen data loading pipelines without interleave as well as ones with interleave. Any advice when to use interleave vs. map or flat_map would be greatly appreciated
//EDIT: I do see the value of interleave, if we start out with different datasets, such as in the code below
files = tf.data.Dataset.list_files("/path/to/dataset/train-*.tfrecord")
dataset = files.interleave(tf.data.TFRecordDataset)
However, is there any benefit of using interleave over map in a scenario such as the one below?
files = tf.data.Dataset.list_files("/path/to/dataset/train-*.png")
dataset = files.map(load_img, num_parallel_calls=tf.data.AUTOTUNE)
Edit:
Can map not also be used to parallelize I/O?
Indeed, you can read images and labels from a directory with map function. Assume this case:
list_ds = tf.data.Dataset.list_files(my_path)
def process_path(path):
### get label here etc. Images need to be decoded
return tf.io.read_file(path), label
new_ds = list_ds.map(process_path,num_parallel_calls=tf.data.experimental.AUTOTUNE)
Note that, now it is multi-threaded as num_parallel_calls has been set.
The advantage of interlave() function:
Suppose you have a dataset
With cycle_length you can out that many elements from the dataset, i.e 5, then 5 elements are out from the dataset and a map_func can be applied.
After, fetch dataset objects from newly generated objects, block_length pieces of data each time.
In other words, interleave() function can iterate through your dataset while applying a map_func(). Also, it can work with many datasets or data files at the same time. For example, from the docs:
dataset = dataset.interleave(lambda x:
tf.data.TextLineDataset(x).map(parse_fn, num_parallel_calls=1),
cycle_length=4, block_length=16)
However, is there any benefit of using interleave over map in a
scenario such as the one below?
Both interleave() and map() seems a bit similar but their use-case is not the same. If you want to read dataset while applying some mapping interleave() is your super-hero. Your images may need to be decoded while being read. Reading all first, and decoding may be inefficient when working with large datasets. In the code snippet you gave, AFAIK, the one with tf.data.TFRecordDataset should be faster.
TL;DR interleave() parallelizes the data loading step by interleaving the I/O operation to read the file.
map() will apply the data pre-processing to the contents of the datasets.
So you can do something like:
ds = train_file.interleave(lambda x: tf.data.Dataset.list_files(directory_here).map(func,
num_parallel_calls=tf.data.experimental.AUTOTUNE)
tf.data.experimental.AUTOTUNE will decide the level of parallelism for buffer size, CPU power, and also for I/O operations. In other words, AUTOTUNE will handle the level dynamically at runtime.
num_parallel_calls argument spawns multiple threads to utilize multiple cores for parallelizing the tasks. With this you can load multiple datasets in parallel, reducing the time waiting for the files to be opened; as interleave can also take an argument num_parallel_calls. Image is taken from docs.
In the image, there are 4 overlapping datasets, that is determined by the argument cycle_length, so in this case cycle_length = 4.
FLAT_MAP: Maps a function across the dataset and flattens the result. If you want to make sure order stays the same you can use this. And it does not take num_parallel_calls as an argument. Please refer docs for more.
MAP:
The map function will execute the selected function on every element of the Dataset separately. Obviously, data transformations on large datasets can be expensive as you apply more and more operations. The key point is, it can be more time consuming if CPU is not fully utilized. But we can use parallelism APIs:
num_of_cores = multiprocessing.cpu_count() # num of available cpu cores
mapped_data = data.map(function, num_parallel_calls = num_of_cores)
For cycle_length=1, the documentation states that the outputs of
interleave and flat_map are equal
cycle_length --> The number of input elements that will be processed concurrently. When set it to 1, it will be processed one-by-one.
INTERLEAVE: Transformation operations like map can be parallelized.
With parallelism of the map, at the top the CPU is trying to achieve parallelization in transformation, but the extraction of data from the disk can cause overhead.
Besides, once the raw bytes are read into memory, it may also be necessary to map a function to the data, which of course, requires additional computation. Like decrypting data etc. The impact of the various data extraction overheads needs to be parallelized in order to mitigate this with interleaving the contents of each dataset.
So while reading the datasets, you want to maximize:
Source of image: deeplearning.ai

Why is TensorFlow's tf.data.Dataset.shuffle so slow?

The shuffle step in the following code works very slow for a moderate buffer_size (say 1000):
filenames = tf.constant(filenames)
dataset = tf.data.Dataset.from_tensor_slices((filenames, labels))
dataset = dataset.map(_parse_function)
dataset = dataset.batch(batch_size)
dataset = dataset.shuffle(buffer_size)
If we use numpy to shuffle the data, the code looks as follows:
idx = np.arange(len(filenames))
np.random.shuffle(idx)
new_filenames = [filenames[i] for i in idx]
next_batch_filenames = new_filenames[:batch_size]
# get the corresponding files in batch
This is much faster. I wonder if TF does something beyond simply shuffles the data.
As Anton Codes wrote, your first snippet shuffles batches of whatever _parse_function parses from your files (probably feature data), while your second snippet only shuffles filenames.
If shuffling on file level is sufficient, you can actually achieve (roughly) the same performance via the tf.data.Dataset API:
dataset = tf.data.Dataset.from_tensor_slices((filenames, labels))
dataset = dataset.shuffle(len(filenames)) # shuffle file names
dataset = dataset.map(_parse_function)
dataset = dataset.batch(batch_size)
This practice of shuffling "pointers" to your training samples instead of the samples themselves can often improve performance.
NumPy might still be a little bit more efficient though, due to the overhead of shuffling inside the computational graph (which tf.data.Dataset.shuffle does, there is actually a C++ kernel specifically for this operation).
The advantage of the tf.data.Dataset approach is that it can automatically reshuffle the Dataset after each epoch.
The comparison is of two quite different operations.
Your dataset = tf.data.Dataset.from_tensor_slices((filenames, labels)) reads from disk. Like physical long term storage, possibly a magnetic spinning hard drive. This is slow. If you have the ability to store all of this in ram instead, or on an ultra fast raid style flash drive, then you'll address your largest bottle neck.
You also have a _parse_function that is fired off for each data point, every time there is a data read. The computation of that parse will take time and depending on what is in there it could be significant.
The comparison to numpy isn't really fair, in that your numpy example doesn't involve reading from disk or parsing data.
That should be the bulk of the difference. If you've addressed the above, the next place to look for more speedup is with these lines
3) dataset = dataset.map(_parse_function)
4) dataset = dataset.batch(batch_size)
5) dataset = dataset.shuffle(buffer_size)
These are your code lines. Line 4 makes batches of data, possibly 32 (batch_size for sure). Then line 5 kicks in and tries to shuffle your batches of 32 in a buffer of length 1000. That happens every time the training loop requests a new training batch. The shuffle step shuffles all those big batches, picks out the first one and adds a new one ... every ... single ... time.
We can reverse the order of batch and shuffle like so
3) dataset = dataset.map(_parse_function)
4) dataset = dataset.shuffle(buffer_size)
5) dataset = dataset.batch(batch_size)
This is better anyway, because before the contents of the batches were always the same but the order was mixed. This way the contents of the batches will be randomized also. Next, the shuffle has to only shuffle 1000 items, not 32x1000 items. Last, we can challenge if we really need a buffer size of 1000. Let's say our data set is 2000 items. A buffer size of 320 and a batch size of 32 will certainly randomize our data well, effectively giving any data in the buffer a 10% of going into the next batch and a 90% of being pushed back to mix with other data. That's pretty good. A buffer size of 64 and a batch size of 64 seems almost useless, other than the items are pulled out of the batch randomly one at a time, and so actually have a chance of not getting drawn and mixing with later data. Just not so much.

Reading sequential data from TFRecords files within the TensorFlow graph?

I'm working with video data, but I believe this question should apply to any sequential data. I want to pass my RNN 10 sequential examples (video frames) from a TFRecords file. When I first start reading the file, I need to grab 10 examples, and use this to create a sequence-example which is then pushed onto the queue for the RNN to take when it's ready. However, now that I have the 10 frames, next time I read from the TFRecords file, I only need to take 1 example and just shift the other 9 over. But when I hit the end of the first TFRecords file, I need to restart the process on the second TFRecords file. It's my understanding that the cond op will process the ops required under each condition even if that condition is not the one that is to be used. This would be a problem when using a condition to check whether to read 10 examples or only 1. Is there anyway to resolve this problem to still have the desired result outlined above?
You can use the recently added Dataset.window() transformation in TensorFlow 1.12 to do this:
filenames = tf.data.Dataset.list_files(...)
# Define a function that will be applied to each filename, and return the sequences in that
# file.
def get_examples_from_file(filename):
# Read and parse the examples from the file using the appropriate logic.
examples = tf.data.TFRecordDataset(filename).map(...)
# Selects a sliding window of 10 examples, shifting along 1 example at a time.
sequences = examples.window(size=10, shift=1, drop_remainder=True)
# Each element of `sequences` is a nested dataset containing 10 consecutive examples.
# Use `Dataset.batch()` and get the resulting tensor to convert it to a tensor value
# (or values, if there are multiple features in an example).
return sequences.map(
lambda d: tf.data.experimental.get_single_element(d.batch(10)))
# Alternatively, you can use `filenames.interleave()` to mix together sequences from
# different files.
sequences = filenames.flat_map(get_examples_from_file)

Are there any guidelines on sharding a data set?

Are there any guidelines on choosing the number of shard files for a data set, or the number of records in each shard?
In the examples of using tensorflow.contrib.slim,
there are roughly 1024 records in each shard of ImageNet data set.(tensorflow/models/inception)
there are roughly 600 records in each shard of flowers data set. (tensorflow/models/slim)
Do the number of shard files and the number of records in each shard has any impact on the training and the performance of the trained model?
To my knowledge, if we don't split the data set into multiple shards, it will be not quite random for shuffling data as the capacity of the RandomShuffleQueue may be less than the size of the data set.
Are there any other advantages of using multiple shards?
Update
The documentation says
If you have more reading threads than input files, to avoid the risk that you will have two threads reading the same example from the same file near each other.
Why can't we use 50 threads to read from 5 files?
The newer(2.5) version of Tensorflow has shard feature for dataset.
Find the below sample code from tensorflow documentation
A = tf.data.Dataset.range(10)
B = A.shard(num_shards=3, index=0)
list(B.as_numpy_iterator())
When reading a single input file, you can shard elements as follows
d = tf.data.TFRecordDataset(input_file)
d = d.shard(num_workers, worker_index)

Incorporating very large constants in Tensorflow

For example, the comments for the Tensorflow image captioning example model state:
NOTE: This script will consume around 100GB of disk space because each image
in the MSCOCO dataset is replicated ~5 times (once per caption) in the output.
This is done for two reasons:
1. In order to better shuffle the training data.
2. It makes it easier to perform asynchronous preprocessing of each image in
TensorFlow.
The primary goal of this question is to see if there is an alternative to this type of duplication. In my use case, storing the data in this way would require each image to be duplicated in the TFRecord files many more times, on the order of 20 - 50 times.
I should note first that I have already fed the images through VGGnet to extract 4096 dim features, and I have these stored as a mapping between filename and the vectors.
Before switching over to Tensorflow, I had been feeding batches containing filename strings and then looking up the corresponding vector on a per-batch basis. This allows me to store all of the image data in ~15GB without needing to duplicate the data on disk.
My first attempt to do this in in Tensorflow involved storing indices in the TFExample buffers and then doing a "preprocessing" step to slice into the corresponding matrix:
img_feat = pd.read_pickle("img_feats.pkl")
img_matrix = np.stack(img_feat)
preloaded_images = tf.Variable(img_matrix)
first_image = tf.slice(preloaded_images, [0,0], [1,4096])
However, in this case, Tensorflow disallows a variable larger than 2GB. So my next thought was to partition this across several variables:
img_tensors = []
for i in range(NUM_SPLITS):
with tf.Graph().as_default():
img_tensors.append(tf.Variable(img_matrices[i], name="preloaded_images_%i"%i))
first_image = tf.concat(1, [tf.slice(t, [0,0], [1,4096//NUM_SPLITS]) for t in img_tensors])
In this case, I'm forced to store each partition on a separate graph, because it seems any one graph cannot be this large either. However, now the concat fails because each tensor I am concatenating is on a separate graph.
Any advice on incorporating a large amount (~15GB) of preloaded into the Tensorflow graph.
Potentially related is this question; however in this case I'd like to override the decoding of the actual JPEG file with the preprocessed value in a tensor op.