Why do I have heavy DeserializeSparse phase after EagerKernelExecutes on the multiple GPU training? - tensorflow

I'm trying to train a small TF2.x model on 4 GPUs (AWS g4dn.12xlarge) that takes both dense and sparse tensors as its input. Once I tried without sparse features and just used dense features, my distributed training code worked well without any performance degradation. After including the sparse features, however, I found numerous unexpected chunks on the TensorBoard Profiler's trace_viewer.
Attached the profiler screenshot.
The main problem is that, although it seems all the GPUs computes their given batches well, there is a large timespan between a pair of computation blocks on the host side. There are 17x4 of EagerExecute:DeserializeSparse with the terminal ops of _Send input 0 from /job:localhost/replica:0/task:0/device:GPU:{gpu_number} to /job:localhost/replica:0/task:0/device:CPU:0. Here, 17 is the number of sparse features that the model receives, and 4 is the num of GPUs being utilized. Plus, tons of MemcpyD2H (small pink blocks at the screen shot) are occupying each GPU, not parallelized. That large period of time is about x6 of the actual forward pass.
Below is how the model treats sparse tensor inputs:
def call(self, inputs: tf.sparse.SparseTensor):
with tf.device("\cpu:0"):
x = self.hash_inputs_from_static_hash_table(inputs)
x = self.embedding_lookup_sparse(x)
return self.prediction_head(x)
The data can never be big (batch size = 128 per replica, sparse feature embedding dimension is <10), and I tried to move all sparse-related operations to CPU not to burden GPUs, but the problem persists just as the same as I didn't move those ops to CPU manually.
I want to know why those chunks appear after the GPU computations, and hopefully remove them to fully benefit from distributed training with multiple GPUs.
Seems like I'm still missing something that can be optimized and this situation might not that unique in distributed training, so asking for help for broader audience.

Related

Why is batch size allocated in GPU?

Given a Keras model (on Colab) that has input shape (None,256,256,3) and batch_size is 16 then the memory allocated for that input shape is 16*256*256*3*datatype (datatype=2,4,8 depending on float16/32/64). This is how it works. My confusion is that for a given batch_size (=16) 1*256*256*3 could have been allocated and the 16 images could have been passed one by one and the final gradient could have been averaged.
1) So, is the allocation dependent on batch size so that 'batch_size' computations can be done in parallel and the configuration that I have mentioned above (1*256*256*3) would be serializing and hence defeating the purpose of GPU?
2) Would the same type of allocation happen on CPU for parallel computation (if the answer to 1) is yes)?
In general batch size is what you need to tune-up.
And as for your query batch size is data-dependent, and as you use batches, you are generally running a generator object, which loads data in batches, perform GD and then move on next.
It is preferred to use batch gradient decent as it converges faster than GD
Also as you increase batch size, so more no of training no of examples will be loaded, increasing memory allocation,
Yes you can use parallel computation for training large batches but overall you are doing same, as you are actually calculating whole batches each time which you are doing in genral batch computation
CPU should have cores, Then Yes, Else You Need GPU as Computing Requires A lOt of powers Because all you are doing under the hood is working with n dimensional matrices, calculating partial derivatives and then calculating square loss and further updating weights values

TPU terminology confusion

So I know how epochs, train steps, batch sizes and this kind of stuff are defined, but it is really hard to me to get my head wraped around the TPU terminology like train loops, iterations per loop and so on. I read this but Im still confused.
Also how can I benchmark the time for iterations per loop for example.
Any explanation would help me a lot there. Thanks!
As the other answers have described, iterations_per_loop is a tuning parameter that controls the amount of work done by the TPU before checking in with it again. A lower number lets you inspect results (and benchmark them) more often, and a higher number reduces the overhead due to synchronization.
This is no different from familiar network or file buffering techniques; changing its value affects performance, but not your final result. In contrast, ML hyperparameters like num_epochs, train_steps, or train_batch_size will change your result.
EDIT: Adding an illustration in pseudocode, below. Notionally, the training loop functions like this:
def process_on_TPU(examples, train_batch_size, iterations_per_loop):
# The TPU will run `iterations_per_loop` training iterations before returning to the host
for i in range(0, iterations_per_loop):
# on every iteration, the TPU will compute `train_batch_size` examples,
# calculating the gradient from every example in the given batch
compute(examples[i * train_batch_size : (i + 1) * train_batch_size])
# assume each entry in `example` is a single training example
for b in range(0, train_steps, train_batch_size * iterations_per_loop)
process_on_TPU(examples[b:b + train_batch_size * iterations_per_loop],
train_batch_size,
iterations_per_loop)
From this, it might appear that train_batch_size and iterations_per_loop are simply two different ways of accomplishing the same thing. However, this is not the case; train_batch_size affects the learning rate, since (at least in ResNet-50) the gradient is computed at each iteration from the average of the gradient of every example in the batch. Taking 50 steps per 50k examples will produce a different result from taking from 1k steps per 50k examples, since the latter case calculates the gradient much more often.
EDIT 2: Below is a way to visualize what's happening, with a racing metaphor. Think of the TPU as running a race that has a distance of train_steps examples, and its stride lets it cover a batch of examples per step. The race is on a track, which is shorter than the total race distance; the length of the lap is your total number of training examples, and every lap around the track is one epoch. You can think of iterations_per_loop as being the point where the TPU can stop at a "water station" of sorts where the training is temporarily paused for a variety of tasks (benchmarking, checkpointing, other housekeeping).
By "train loop", I'm assuming it's the same meaning as "training loop". The training loop is the one that iterates through each epoch in order to feed the model.
The iterations per loop is related to how Cloud TPU handles the training loop. In order to amortize the TPU launch cost, the model training step is wrapped in a tf.while_loop, such that one Session run actually runs many iterations for a single training loop.
Because of this, Cloud TPU runs a specified number of iterations of the training loop before returning to the host. Therefore, iterations_per_loop is how many iterations will run for one session.run call.
TPU literally means "Tensor Processing Unit", it's a hardware device used for computation in exactly the same way a GPU is used. The TPUs are effectively Google proprietary GPUs. There are technical differences under the hood of a GPU vs a TPU, mostly regarding speed and power consumption, and some issues of floating point precision, but you don't need to care about the details.
iterations_per_loop appears to be an effort to improve efficiency by loading the TPU with multiple training batches. There are often hardware bandwidth limitations when transferring large amounts of data from main memory to a GPU/TPU.
It appears that the code you reference is passing iterations_per_loop number of training batches to the TPU, then running iterations_per_loop number of training steps before pausing to do another data transfer from main memory to TPU memory.
I'm rather surprised to see that though, I would expect that asynchronous background data transfers would be possible by now.
My only disclaimer is that, while I'm proficient with Tensorflow, and have watched TPU evolution in papers and articles, I'm not directly experienced with the Google API or running on TPUs, so I'm inferring from what I read in the documentation you linked to.

How to fix low volatile GPU-Util with Tensorflow-GPU and Keras?

I have a 4 GPU machine on which I run Tensorflow (GPU) with Keras. Some of my classification problems take several hours to complete.
nvidia-smi returns Volatile GPU-Util which never exceeds 25% on any of my 4 GPUs.
How can I increase GPU Util% and speed up my training?
If your GPU util is below 80%, this is generally the sign of an input pipeline bottleneck. What this means is that the GPU sits idle much of the time, waiting for the CPU to prepare the data:
What you want is the CPU to keep preparing batches while the GPU is training to keep the GPU fed. This is called prefetching:
Great, but if the batch preparation is still way longer than the model training, the GPU will still remain idle, waiting for the CPU to finish the next batch. To make the batch preparation faster we can parallelize the different preprocessing operations:
We can go even further by parallelizing I/O:
Now to implement this in Keras, you need to use the Tensorflow Data API with Tensorflow version >= 1.9.0. Here is an example:
Let's assume, for the sake of this example that you have two numpy arrays x and y. You can use tf.data for any type of data but this is simpler to understand.
def preprocessing(x, y):
# Can only contain TF operations
...
return x, y
dataset = tf.data.Dataset.from_tensor_slices((x, y)) # Creates a dataset object
dataset = dataset.map(preprocessing, num_parallel_calls=64) # parallel preprocessing
dataset = dataset.batch(batch_size)
dataset = dataset.prefetch(None) # Will automatically prefetch batches
....
model = tf.keras.model(...)
model.fit(x=dataset) # Since tf 1.9.0 you can pass a dataset object
tf.data is very flexible, but as anything in Tensorflow (except eager), it uses a static graph. This can be a pain sometimes but the speed up is worth it.
To go further, you can have a look at the performance guide and the Tensorflow data guide.
I've got similar issue - the memory of all the GPUs were allocated by Keras, but Volatile was around 0% and training was taking almost the same amount of time as on CPU. I was using ImageDataGenerator, which turned out to be a bottleneck. When I increased the number of workers in fit_generator method from default value 1 to all available CPUs, then the training time dropped rapidly.
You can also load the data to the memory and then use flow method to prepare batches with augmented images.

Preprocessing Shuffling using multiple threads in TensorFlow

There is a python list of image file names.
It is important that each image file is read and then following steps be applied on it - Taking 5 random crops and their mirror reflections.
In order to maintain randomness in the order of images fed to the CNN it is also important that all the images from preprocessing of one image should not go into the CNN together.
My thoughts
Let multiple CPU threads preprocess the images and put them in a random shuffle queue.
Let the batchsize number of images be dequeued from the queue and used for CNN.
My questions
a) Is the above way the most optimal way of working it out ?
b) Can anyone provide a code example which can be taken as a reference to work it out ?
This is my implementation of what almost meets your requirements. The usage is easy in train.py.
dataset_train = Dataset("path/to/list.train.txt", subtract_mean=True, is_train=True, name='train')
dataset_val = Dataset("path/to/list.val.txt", subtract_mean=True, is_train=False, name='val')
dataset_train.shuffle()
for batch_x, batch_y in dataset_train.batches(batch_size):
# batch_x: (batch_size, H, W, 3), batch_y: (batch_size)
...
# for samling for validation
for val_step, (val_batch_x, val_batch_y) in \
enumerate(dataset_val.sample_batches(batch_size, 256)):
Use "threading" and "concurrent.futures" for background loading, and "Queue" for a fixed size prefetch (but this doesn't use multi-CPU, decribed below, without lost of speed).
Each image is random-cropped and flipped when loaded for training. (only center-cropped for testing and validation)
A batch of images are dequeued for feeding to the CNN.
Why not use multi-CPU
Due to GIL of CPython python threads with threading can't run simultaneously on multiple CPU cores. The alternative way to use multi-cores is multiprocessing (mp).
I used to implement the function with mp.Process and mp.Queue, but mp.Queue VERY SLOW for transferring large data like images between processes, because of the limitation of its implementation with pipe() (on Linux). The overhead is about 0.5 seconds for a batch of 100 256x256 images on a fast workstation where a batch of 64 images training of AlexNet only takes 0.6 sec.
I tried threading and Queue, and found that as the bottleneck is I/O but not CPU computation, and Queue.Queue transfer 100 images in no time, the loading becomes much faster even without multi-CPU.

GPU + CPU Tensorflow Training

Setup
I have a network, one whose parameter is a large-embedding matrix (3Million X 300 sized), say embed_mat.
During training, for each mini-batch, I only update a small subset of the vectors from embed_mat (max 15000 vectors) which are chosen using the embedding_lookup op. I am using the Adam optimizer to train my model.
As I cannot store this embed_mat in the GPU, due to its size, I define it under CPU (say /cpu:0) device, but the rest of the parameters of the model, the optimizer etc. are defined under a GPU (say, gpu:/0) device.
Questions
I see that my GPU usage is very minimal (200 MB), which suggests all my training is happening on the CPU. What I expected was that the result of the embedding_lookup is copied to the GPU and all my training happens there. Am I doing something wrong.
The training time is very largely affected by the size (num_vectors) of the embedding matrix which doesn't seem correct to me. In any mini-batch, I only update my network parameters and the vectors I looked up (~15000), so the training time should, if at all, grow sub-linearly with the size of the embedding matrix.
Is there a way to automatically and seamlessly split up my embed_mat to multiple GPUs for faster training?
I suspect the Adam Optimizer for this. Looks like because the embed_mat is on the CPU, all training is happening on the CPU. Is this correct?
Try visualizing on tensorboard where each of your ops is placed. In the "graph" tab you can color by "device". Ideally the embedding variable, the embedding lookup, and the embedding gradient update should be in the CPU, while most other things should be in the GPU.