Why is TensorFlow's tf.data.Dataset.shuffle so slow? - tensorflow

The shuffle step in the following code works very slow for a moderate buffer_size (say 1000):
filenames = tf.constant(filenames)
dataset = tf.data.Dataset.from_tensor_slices((filenames, labels))
dataset = dataset.map(_parse_function)
dataset = dataset.batch(batch_size)
dataset = dataset.shuffle(buffer_size)
If we use numpy to shuffle the data, the code looks as follows:
idx = np.arange(len(filenames))
np.random.shuffle(idx)
new_filenames = [filenames[i] for i in idx]
next_batch_filenames = new_filenames[:batch_size]
# get the corresponding files in batch
This is much faster. I wonder if TF does something beyond simply shuffles the data.

As Anton Codes wrote, your first snippet shuffles batches of whatever _parse_function parses from your files (probably feature data), while your second snippet only shuffles filenames.
If shuffling on file level is sufficient, you can actually achieve (roughly) the same performance via the tf.data.Dataset API:
dataset = tf.data.Dataset.from_tensor_slices((filenames, labels))
dataset = dataset.shuffle(len(filenames)) # shuffle file names
dataset = dataset.map(_parse_function)
dataset = dataset.batch(batch_size)
This practice of shuffling "pointers" to your training samples instead of the samples themselves can often improve performance.
NumPy might still be a little bit more efficient though, due to the overhead of shuffling inside the computational graph (which tf.data.Dataset.shuffle does, there is actually a C++ kernel specifically for this operation).
The advantage of the tf.data.Dataset approach is that it can automatically reshuffle the Dataset after each epoch.

The comparison is of two quite different operations.
Your dataset = tf.data.Dataset.from_tensor_slices((filenames, labels)) reads from disk. Like physical long term storage, possibly a magnetic spinning hard drive. This is slow. If you have the ability to store all of this in ram instead, or on an ultra fast raid style flash drive, then you'll address your largest bottle neck.
You also have a _parse_function that is fired off for each data point, every time there is a data read. The computation of that parse will take time and depending on what is in there it could be significant.
The comparison to numpy isn't really fair, in that your numpy example doesn't involve reading from disk or parsing data.
That should be the bulk of the difference. If you've addressed the above, the next place to look for more speedup is with these lines
3) dataset = dataset.map(_parse_function)
4) dataset = dataset.batch(batch_size)
5) dataset = dataset.shuffle(buffer_size)
These are your code lines. Line 4 makes batches of data, possibly 32 (batch_size for sure). Then line 5 kicks in and tries to shuffle your batches of 32 in a buffer of length 1000. That happens every time the training loop requests a new training batch. The shuffle step shuffles all those big batches, picks out the first one and adds a new one ... every ... single ... time.
We can reverse the order of batch and shuffle like so
3) dataset = dataset.map(_parse_function)
4) dataset = dataset.shuffle(buffer_size)
5) dataset = dataset.batch(batch_size)
This is better anyway, because before the contents of the batches were always the same but the order was mixed. This way the contents of the batches will be randomized also. Next, the shuffle has to only shuffle 1000 items, not 32x1000 items. Last, we can challenge if we really need a buffer size of 1000. Let's say our data set is 2000 items. A buffer size of 320 and a batch size of 32 will certainly randomize our data well, effectively giving any data in the buffer a 10% of going into the next batch and a 90% of being pushed back to mix with other data. That's pretty good. A buffer size of 64 and a batch size of 64 seems almost useless, other than the items are pulled out of the batch randomly one at a time, and so actually have a chance of not getting drawn and mixing with later data. Just not so much.

Related

Tensorflow: How to shuffle a dataset so that it doesn't reshuffle after splitting

I am so confused as to why it's been so hard for me to find the answer to this. I want to be able to shuffle a dataset one time. After shuffling, I then split the dataset into train/val/test splits. I can't find a way to do this without the train/val/test data being all reshuffled together anytime I iterate over the split datasets.
I guess because the train/val/test dataset are all pointing to locations in a dataset which is being shuffled each time.
Here's an example of my code that is trying to do this.
dataset = tf.data.Dataset.from_tensor_slices((x, y))
dataset = dataset.shuffle(buffer_size=len(x))
train, val, test = split_tf_dataset(dataset, len(x), test_pct=0.1, val_pct=0.1)
train, val, test = train.batch(batch_size=50, drop_remainder=True), val.batch(batch_size=50, drop_remainder=True), test.batch(batch_size=50, drop_remainder=True)
'split_tf_dataset' is just performing take and skip operations, no randomness added there.
My workaround so far has been to shuffle the data before I create the Dataset, but does Dataset have this functionality that I'm missing? The option 'reshuffle_each_iteration' doesn't seem to do anything in this case.
I would expect setting reshuffle_each_iteration to False to fix this problem, however it seems to have no effect. I've also tried calling Dataset.sample_from_datasets, however with one dataset it only
bounces your input back to you, doing nothing.
This is the numpy code that does what I'm expecting tensorflow should be able to do:
x = x[np.random.choice(np.arange(0, len(x)), size=len(x))]

Is it possible to shuffle the dataset using the index of its elements?

I am using tf.data.experimental.make_csv_dataset in tensorflow (TF1.14 and TF2.0) to read a csv file consisting 3 columns; index, column1 and column2. For me only column 1 and column2 are important. Each element in column1 is an array of shape (1,4) and column2 has (1,1). On this dataset, when I use tf.data.shuffle(buffer_size = some_number) for shuffling, it takes a lot of time to do this shuffling with a message Filling Up the shuffle buffer. My question is if there is a way to shuffle the dataset by using the indices of the column1/column2, because this might not take so much time for shuffling since it is only the indices.
My question is if there is a way to shuffle the dataset by using the indices of the column1/column2, because this might not take so much time for shuffling since it is only the indices
No, unfortunately not. Not in that way.
The reason is that a tf.data.Dataset object is inherently lazily loaded. It is deliberately so as it can represent arbitrarily large (even infinite) datasets so it wouldnt make sense to try to load it all into memory or do all the pre-processing up front.
This means that, whilst it would (of course) be feasible to read and shuffle the index we could not then access the nth element from the original dataset (at least not cheaply).
It's worth mentioning that the shuffle buffer only needs to be filled once so the delay will only happen at the start of training (and the start of each epoch if shuffling each epoch).
A sensible workaround that you may well have already considered is to load the dataset once with the shuffle then write it out somewhere (maybe a tfrecord format) with all the rows preshuffled.

Tensorflow TFrecord file of unequal size

I currently split my data into several TFrecord files and then read the data by shuffling & interleaving them. My code is below:
path_to_files = glob('train_*.tfrecord')
n_files = len(path_to_files)
tf_dataset = tf.data.Dataset.list_files(path).shuffle(n_files)
tf_dataset = tf_dataset.interleave(lambda filename: tf.data.TFRecordDataset(filename, num_parallel_reads=4).map(parseFunc), cycle_length=n_files)
tf_dataset = tf_dataset.shuffle(buffer_size=n_files*3)
tf_dataset = tf_dataset.batch(batchsize)
tf_dataset = tf_dataset.prefetch(buffer_size=batchsize)
I have 2 questions:
1) Is my cod eindeed doing what I intend it to do. Namely, does it randomly sample samples from each of the TFrecord files equally
2) What happens if the TFrecord files contain very different amount of samples (e.g. 1 will have 50 samples and another 500). Does this affect the randomness at all?
Thanks!
So I ran a simulation to test this as follows: I saved 3 files with:
file 1: ~1000 samples of the number 1
file 2: ~2000 samples of the number 2
file 3: ~3000 samples of the number 3
Then I loaded the iterator with the above code and sampled batches until the iterator ran out. Below are my results.
As can be seen in the figure, TF does NOT weigh the TFrecord files by their size when it randomly samples from them. Rather, it randomly samples from each of the unequal sized files with equal probability until one of them runs out of samples. Then it continues from each of the remaining files with equal probability.
Take home: to have truly random sampling, make sure that your TFrecord files are either equally sized or the labels are homogenously distributed between them
Q1: Not exactly. First of all, this line don't needs explicit shuffle, list_files already has this parameter. It's can be controlled by seed value.
tf_dataset = tf.data.Dataset.list_files(path, shuffle=True, seed=1).
Without repeat function, you'll get end of sequence error when iterator exhausts all your files. So it should be like this. With None passed as a value it will iterate indefinitely, or you can set exact number of epochs.
tf_dataset = tf.data.Dataset.list_files(path, shuffle=True, seed).repeat()
Q2: It's ok if the sizes of files is different. The only outcome is that contents of large file will have higher chance of being chosen by iterator. But it won't affect randomness. This line will do it's job, shuffling interleaved dataset. The only thing to remember, is that shuffle buffer controls amount of data, loaded into memory. It's generally recommended to set it to number of examples in dataset (number of all examples in all files), but in some cases it may become a substantial overhead and even cause OOM.
tf_dataset = tf_dataset.shuffle(buffer_size=n_files*3)

Caching a dataset with examples of varied length

My dataset is comprised of audio segments of between 5-180 seconds. The number of examples is small enough to allow caching it in memory, instead of reading from the disk over and over. Storing the data in a constant tensor / variable and using tf.train.slice_input_producer will allow me to cache the dataset in memory, but it requires storing all the data in one matrix. Since some examples are much longer than others, this matrix might be unnecessarily large and perhaps too large for the RAM.
I can simply have a list of numpy arrays for my data, and do the whole input reading-randomizing-preprocessing in a non-tensforflow way with a feed_dict, but I wonder if there is a way to do it without completely giving up on tensorflow for the input reading-randomizing-preprocessing part.
Thanks!
The more recent tf.data library provides a tf.data.Dataset.cache method to cache an entire dataset into memory or into a file.
For instance:
dataset = ...
dataset = dataset.map(preprocessing_fn) # apply preprocessing
dataset = dataset.cache() # cache entire dataset in memory after preprocessing
I've provided more details on how to use cache() in this answer.

Incorporating very large constants in Tensorflow

For example, the comments for the Tensorflow image captioning example model state:
NOTE: This script will consume around 100GB of disk space because each image
in the MSCOCO dataset is replicated ~5 times (once per caption) in the output.
This is done for two reasons:
1. In order to better shuffle the training data.
2. It makes it easier to perform asynchronous preprocessing of each image in
TensorFlow.
The primary goal of this question is to see if there is an alternative to this type of duplication. In my use case, storing the data in this way would require each image to be duplicated in the TFRecord files many more times, on the order of 20 - 50 times.
I should note first that I have already fed the images through VGGnet to extract 4096 dim features, and I have these stored as a mapping between filename and the vectors.
Before switching over to Tensorflow, I had been feeding batches containing filename strings and then looking up the corresponding vector on a per-batch basis. This allows me to store all of the image data in ~15GB without needing to duplicate the data on disk.
My first attempt to do this in in Tensorflow involved storing indices in the TFExample buffers and then doing a "preprocessing" step to slice into the corresponding matrix:
img_feat = pd.read_pickle("img_feats.pkl")
img_matrix = np.stack(img_feat)
preloaded_images = tf.Variable(img_matrix)
first_image = tf.slice(preloaded_images, [0,0], [1,4096])
However, in this case, Tensorflow disallows a variable larger than 2GB. So my next thought was to partition this across several variables:
img_tensors = []
for i in range(NUM_SPLITS):
with tf.Graph().as_default():
img_tensors.append(tf.Variable(img_matrices[i], name="preloaded_images_%i"%i))
first_image = tf.concat(1, [tf.slice(t, [0,0], [1,4096//NUM_SPLITS]) for t in img_tensors])
In this case, I'm forced to store each partition on a separate graph, because it seems any one graph cannot be this large either. However, now the concat fails because each tensor I am concatenating is on a separate graph.
Any advice on incorporating a large amount (~15GB) of preloaded into the Tensorflow graph.
Potentially related is this question; however in this case I'd like to override the decoding of the actual JPEG file with the preprocessed value in a tensor op.