Preprocessing Shuffling using multiple threads in TensorFlow - tensorflow

There is a python list of image file names.
It is important that each image file is read and then following steps be applied on it - Taking 5 random crops and their mirror reflections.
In order to maintain randomness in the order of images fed to the CNN it is also important that all the images from preprocessing of one image should not go into the CNN together.
My thoughts
Let multiple CPU threads preprocess the images and put them in a random shuffle queue.
Let the batchsize number of images be dequeued from the queue and used for CNN.
My questions
a) Is the above way the most optimal way of working it out ?
b) Can anyone provide a code example which can be taken as a reference to work it out ?

This is my implementation of what almost meets your requirements. The usage is easy in
dataset_train = Dataset("path/to/list.train.txt", subtract_mean=True, is_train=True, name='train')
dataset_val = Dataset("path/to/list.val.txt", subtract_mean=True, is_train=False, name='val')
for batch_x, batch_y in dataset_train.batches(batch_size):
# batch_x: (batch_size, H, W, 3), batch_y: (batch_size)
# for samling for validation
for val_step, (val_batch_x, val_batch_y) in \
enumerate(dataset_val.sample_batches(batch_size, 256)):
Use "threading" and "concurrent.futures" for background loading, and "Queue" for a fixed size prefetch (but this doesn't use multi-CPU, decribed below, without lost of speed).
Each image is random-cropped and flipped when loaded for training. (only center-cropped for testing and validation)
A batch of images are dequeued for feeding to the CNN.
Why not use multi-CPU
Due to GIL of CPython python threads with threading can't run simultaneously on multiple CPU cores. The alternative way to use multi-cores is multiprocessing (mp).
I used to implement the function with mp.Process and mp.Queue, but mp.Queue VERY SLOW for transferring large data like images between processes, because of the limitation of its implementation with pipe() (on Linux). The overhead is about 0.5 seconds for a batch of 100 256x256 images on a fast workstation where a batch of 64 images training of AlexNet only takes 0.6 sec.
I tried threading and Queue, and found that as the bottleneck is I/O but not CPU computation, and Queue.Queue transfer 100 images in no time, the loading becomes much faster even without multi-CPU.


Why do I have heavy DeserializeSparse phase after EagerKernelExecutes on the multiple GPU training?

I'm trying to train a small TF2.x model on 4 GPUs (AWS g4dn.12xlarge) that takes both dense and sparse tensors as its input. Once I tried without sparse features and just used dense features, my distributed training code worked well without any performance degradation. After including the sparse features, however, I found numerous unexpected chunks on the TensorBoard Profiler's trace_viewer.
Attached the profiler screenshot.
The main problem is that, although it seems all the GPUs computes their given batches well, there is a large timespan between a pair of computation blocks on the host side. There are 17x4 of EagerExecute:DeserializeSparse with the terminal ops of _Send input 0 from /job:localhost/replica:0/task:0/device:GPU:{gpu_number} to /job:localhost/replica:0/task:0/device:CPU:0. Here, 17 is the number of sparse features that the model receives, and 4 is the num of GPUs being utilized. Plus, tons of MemcpyD2H (small pink blocks at the screen shot) are occupying each GPU, not parallelized. That large period of time is about x6 of the actual forward pass.
Below is how the model treats sparse tensor inputs:
def call(self, inputs: tf.sparse.SparseTensor):
with tf.device("\cpu:0"):
x = self.hash_inputs_from_static_hash_table(inputs)
x = self.embedding_lookup_sparse(x)
return self.prediction_head(x)
The data can never be big (batch size = 128 per replica, sparse feature embedding dimension is <10), and I tried to move all sparse-related operations to CPU not to burden GPUs, but the problem persists just as the same as I didn't move those ops to CPU manually.
I want to know why those chunks appear after the GPU computations, and hopefully remove them to fully benefit from distributed training with multiple GPUs.
Seems like I'm still missing something that can be optimized and this situation might not that unique in distributed training, so asking for help for broader audience.

Why is batch size allocated in GPU?

Given a Keras model (on Colab) that has input shape (None,256,256,3) and batch_size is 16 then the memory allocated for that input shape is 16*256*256*3*datatype (datatype=2,4,8 depending on float16/32/64). This is how it works. My confusion is that for a given batch_size (=16) 1*256*256*3 could have been allocated and the 16 images could have been passed one by one and the final gradient could have been averaged.
1) So, is the allocation dependent on batch size so that 'batch_size' computations can be done in parallel and the configuration that I have mentioned above (1*256*256*3) would be serializing and hence defeating the purpose of GPU?
2) Would the same type of allocation happen on CPU for parallel computation (if the answer to 1) is yes)?
In general batch size is what you need to tune-up.
And as for your query batch size is data-dependent, and as you use batches, you are generally running a generator object, which loads data in batches, perform GD and then move on next.
It is preferred to use batch gradient decent as it converges faster than GD
Also as you increase batch size, so more no of training no of examples will be loaded, increasing memory allocation,
Yes you can use parallel computation for training large batches but overall you are doing same, as you are actually calculating whole batches each time which you are doing in genral batch computation
CPU should have cores, Then Yes, Else You Need GPU as Computing Requires A lOt of powers Because all you are doing under the hood is working with n dimensional matrices, calculating partial derivatives and then calculating square loss and further updating weights values

TPU terminology confusion

So I know how epochs, train steps, batch sizes and this kind of stuff are defined, but it is really hard to me to get my head wraped around the TPU terminology like train loops, iterations per loop and so on. I read this but Im still confused.
Also how can I benchmark the time for iterations per loop for example.
Any explanation would help me a lot there. Thanks!
As the other answers have described, iterations_per_loop is a tuning parameter that controls the amount of work done by the TPU before checking in with it again. A lower number lets you inspect results (and benchmark them) more often, and a higher number reduces the overhead due to synchronization.
This is no different from familiar network or file buffering techniques; changing its value affects performance, but not your final result. In contrast, ML hyperparameters like num_epochs, train_steps, or train_batch_size will change your result.
EDIT: Adding an illustration in pseudocode, below. Notionally, the training loop functions like this:
def process_on_TPU(examples, train_batch_size, iterations_per_loop):
# The TPU will run `iterations_per_loop` training iterations before returning to the host
for i in range(0, iterations_per_loop):
# on every iteration, the TPU will compute `train_batch_size` examples,
# calculating the gradient from every example in the given batch
compute(examples[i * train_batch_size : (i + 1) * train_batch_size])
# assume each entry in `example` is a single training example
for b in range(0, train_steps, train_batch_size * iterations_per_loop)
process_on_TPU(examples[b:b + train_batch_size * iterations_per_loop],
From this, it might appear that train_batch_size and iterations_per_loop are simply two different ways of accomplishing the same thing. However, this is not the case; train_batch_size affects the learning rate, since (at least in ResNet-50) the gradient is computed at each iteration from the average of the gradient of every example in the batch. Taking 50 steps per 50k examples will produce a different result from taking from 1k steps per 50k examples, since the latter case calculates the gradient much more often.
EDIT 2: Below is a way to visualize what's happening, with a racing metaphor. Think of the TPU as running a race that has a distance of train_steps examples, and its stride lets it cover a batch of examples per step. The race is on a track, which is shorter than the total race distance; the length of the lap is your total number of training examples, and every lap around the track is one epoch. You can think of iterations_per_loop as being the point where the TPU can stop at a "water station" of sorts where the training is temporarily paused for a variety of tasks (benchmarking, checkpointing, other housekeeping).
By "train loop", I'm assuming it's the same meaning as "training loop". The training loop is the one that iterates through each epoch in order to feed the model.
The iterations per loop is related to how Cloud TPU handles the training loop. In order to amortize the TPU launch cost, the model training step is wrapped in a tf.while_loop, such that one Session run actually runs many iterations for a single training loop.
Because of this, Cloud TPU runs a specified number of iterations of the training loop before returning to the host. Therefore, iterations_per_loop is how many iterations will run for one call.
TPU literally means "Tensor Processing Unit", it's a hardware device used for computation in exactly the same way a GPU is used. The TPUs are effectively Google proprietary GPUs. There are technical differences under the hood of a GPU vs a TPU, mostly regarding speed and power consumption, and some issues of floating point precision, but you don't need to care about the details.
iterations_per_loop appears to be an effort to improve efficiency by loading the TPU with multiple training batches. There are often hardware bandwidth limitations when transferring large amounts of data from main memory to a GPU/TPU.
It appears that the code you reference is passing iterations_per_loop number of training batches to the TPU, then running iterations_per_loop number of training steps before pausing to do another data transfer from main memory to TPU memory.
I'm rather surprised to see that though, I would expect that asynchronous background data transfers would be possible by now.
My only disclaimer is that, while I'm proficient with Tensorflow, and have watched TPU evolution in papers and articles, I'm not directly experienced with the Google API or running on TPUs, so I'm inferring from what I read in the documentation you linked to.

How to efficiently shuffle a large when using tf.estimator.train_and_evaluate?

The tf.estimator.train_and_evaluate documentation makes it clear that the input dataset must be properly shuffled for the training to see all examples:
Overfitting: In order to avoid overfitting, it is recommended to set up the training input_fn to shuffle the training data properly. It is also recommended to train the model a little longer, say multiple epochs, before performing evaluation, as the input pipeline starts from scratch for each training. It is particularly important for local training and evaluation.
In my application, I would like to uniformly sample examples from the full with arbitrary evaluation frequency and shuffle()'s buffer size. Otherwise, the training can at most see the first:
(steps_per_second * eval_delay * batch_size) + buffer_size
elements, effectively discarding the rest. Is there an efficient way to work around that without loading the complete dataset in the system memory?
I considered sharding the dataset based on the buffer size, but if the evaluation does not occur frequently, it will iterate on the same shard multiple times (a repeat() closes the pipeline). Ideally, I would like to move to another shard after a complete iteration over the dataset, is that possible?
Thanks for any pointers!
A random sharding of the dataset can be implemented with this Dataset transformation:
def random_shard(shard_size, dataset_size):
num_shards = -(-dataset_size // shard_size) # Ceil division.
offsets = np.linspace(
0, dataset_size, num=num_shards, endpoint=False, dtype=np.int64)
def _random_shard(dataset):
sharded_dataset =
sharded_dataset = sharded_dataset.shuffle(num_shards)
sharded_dataset = sharded_dataset.flat_map(
lambda offset: dataset.skip(offset).take(shard_size))
return sharded_dataset
return _random_shard
This requires to know the total dataset size in advance. However, if you implement a file-based sharding approach, you also iterate on the full dataset once so that is not a major issue.
Regarding efficiency, note that skip(offset) actually iterates on offset examples so a latency is to be expected if offset is large. Careful prefetching should help for this.

Regarding the use of tf.train.shuffle_batch() to create batches

In Tensorflow tutorial, it gives the following example regarding tf.train.shuffle_batch():
# Creates batches of 32 images and 32 labels.
image_batch, label_batch = tf.train.shuffle_batch(
[single_image, single_label],
I am not very clear about the meaning of capacity and min_after_dequeue. In this example, it is set as 50000 and 10000 respectively. What is the logic for this kind of setup, or what does that mean. If input has 200 images and 200 labels, what will happen?
The tf.train.shuffle_batch() function uses a tf.RandomShuffleQueue internally to accumulate batches of batch_size elements, which are sampled uniformly at random from the elements currently in the queue.
Many training algorithms, such as the stochastic gradient descent–based algorithms that TensorFlow uses to optimize neural networks, rely on sampling records uniformly at random from the entire training set. However, it is not always practical to load the entire training set in memory (in order to sample from it), so tf.train.shuffle_batch() offers a compromise: it fills an internal buffer with between min_after_dequeue and capacity elements, and samples uniformly at random from that buffer. For many training processes, this improves the accuracy of the model and provides adequate randomization.
The min_after_dequeue and capacity arguments have an indirect effect on training performance. Setting a large min_after_dequeue value will delay the start of training, because TensorFlow has to process at least that many elements before training can start. The capacity is an upper bound on the amount of memory that the input pipeline will consume: setting this too large may cause the training process to run out of memory (and possibly start swapping, which will impair the training throughput).
If the dataset has only 200 images, it would be easily possible to load the entire dataset in memory. tf.train.shuffle_batch() would be quite inefficient, because it enqueue each image and label multiple times in the tf.RandomShuffleQueue. In this case, you may find it more efficient to do the following instead, using tf.train.slice_input_producer() and tf.train.batch():
random_image, random_label = tf.train.slice_input_producer([all_images, all_labels],
image_batch, label_batch = tf.train.batch([random_image, random_label],