Tensorflow - Inference time evaluation - tensorflow

I'm evaluating different image classification models using Tensorflow, and specifically inference time using different devices.
I was wondering if I have to use pretrained models or not.
I'm using a script generating 1000 random input images feeding them 1 by 1 to the network, and calculating mean inference time.
Thank you !

Let me start by a warning:
A proper benchmark of neural networks is done in a wrong way by most people. For GPUs there is disk I/O, memory bandwidth, PCI bandwidth, the GPU speed itself. Then there are implementation faults like using feed_dict in TensorFlow. This is also true for a efficient training of these models.
Let's start by a simple example considering a GPU
import tensorflow as tf
import numpy as np
data = np.arange(9 * 1).reshape(1, 9).astype(np.float32)
data = tf.constant(data, name='data')
activation = tf.layers.dense(data, 10, name='fc')
with tf.Session(config=tf.ConfigProto(log_device_placement=True)) as sess:
sess.run(tf.global_variables_initializer())
print sess.run(activation)
All it does is creating a const tensor and apply a fully connected layer.
All the operations are placed on the GPU:
fc/bias: (VariableV2): /job:localhost/replica:0/task:0/device:GPU:0
2018-01-25 09:55:01.587959: I tensorflow/core/common_runtime/placer.cc:874] fc/bias: (VariableV2)/job:localhost/replica:0/task:0/device:GPU:0
fc/bias/read: (Identity): /job:localhost/replica:0/task:0/device:GPU:0
2018-01-25 09:55:01.587970: I tensorflow/core/common_runtime/placer.cc:874] fc/bias/read: (Identity)/job:localhost/replica:0/task:0/device:GPU:0
fc/bias/Assign: (Assign): /job:localhost/replica:0/task:0/device:GPU:0
2018-01-25 09:55:01.587979: I tensorflow/core/common_runtime/placer.cc:874] fc/bias/Assign: (Assign)/job:localhost/replica:0/task:0/device:GPU:0
fc/kernel: (VariableV2): /job:localhost/replica:0/task:0/device:GPU:0
2018-01-25 09:55:01.587988: I tensorflow/core/common_runtime/placer.cc:874] fc/kernel: (VariableV2)/job:localhost/replica:0/task:0/device:GPU:0
fc/kernel/read: (Identity): /job:localhost/replica:0/task:0/device:GPU:0
...
Looks good, right?
Benchmarking this graph might give a rough estimate how fast the TensorFlow graph can be executed. Just replace tf.layers.dense by your network. If you accept the overhead of using pythons time package, you are done.
But this is, unfortunately, not the entire story.
There is copying the result back from the tensor-op 'fc/BiasAdd:0' accessing device memory (GPU) and copying to host memory (CPU, RAM).
Hence there is the PCI bandwidth limitation at some point. And there is a python interpreter somewhere sitting as well, taking CPU cycles.
Further, the operations are placed on the GPU, not necessary the values themselves. Not sure, which TF version you are using. But even a tf.const gave no guarantees in older version to be placed on the GPU. Which I only noticed when writing my own Ops. Btw: see my other answer on how TF decides where to place operations.
Now, the hard part: It depends on your graph. Having a tf.cond/tf.where sitting somewhere makes things harder to benchmark. Now, you need to go through all these struggles which you need to address when efficiently training a deep network. Meaning, a simple const cannot address all cases.
A solutions starts by putting/staging some values directly into GPU memory by running
stager = data_flow_ops.StagingArea([tf.float32])
enqeue_op = stager.put([dummy])
dequeue_op = tf.reduce_sum(stager.get())
for i in range(1000):
sess.run(enqeue_op)
beforehand. But again, the TF resource manager is deciding where it puts values (And there is no guarantee about the ordering or dropping/keeping values).
To sum it up: Benchmarking is a highly complex task as benchmarking CUDA code is complex. Now, you have CUDA and additionally python parts.
And it is a highly subjective task, depending on which parts you are interested in (just graph, including disk i/o, ...)
I usually run the graph with a tf.const input as in the example and use the profiler to see whats going on in the graph.
For some general ideas on how to improve runtime performance you might want to read the Tensorflow Performance Guide

So, to clarify, you are only interested in the runtime per inference step and not in the accuracy or any ML related performance metrics?
In this case it should not matter much if you initialize your model from a pretrained checkpoint or just from scratch via the given initializers (e.g. truncated_normal or constant) assigned to each variable in your graph.
The underlying mathematical operations will be the same, mainly matrix-multiply operations for whom it doesn't matter (much) which values the underlying add and multiply operations are performed on.
This could be a bit different, if your graph contains some more advanced control-flow structures like tf.while_loop that can influence the actual size of your graph depending on the values of certain Tensors.
Of course, the time it takes to initialize your graph at the very beginning of program execution will differ depending on if you initialize from scratch or from checkpoint.
Hope this helps.

Related

Why do I have heavy DeserializeSparse phase after EagerKernelExecutes on the multiple GPU training?

I'm trying to train a small TF2.x model on 4 GPUs (AWS g4dn.12xlarge) that takes both dense and sparse tensors as its input. Once I tried without sparse features and just used dense features, my distributed training code worked well without any performance degradation. After including the sparse features, however, I found numerous unexpected chunks on the TensorBoard Profiler's trace_viewer.
Attached the profiler screenshot.
The main problem is that, although it seems all the GPUs computes their given batches well, there is a large timespan between a pair of computation blocks on the host side. There are 17x4 of EagerExecute:DeserializeSparse with the terminal ops of _Send input 0 from /job:localhost/replica:0/task:0/device:GPU:{gpu_number} to /job:localhost/replica:0/task:0/device:CPU:0. Here, 17 is the number of sparse features that the model receives, and 4 is the num of GPUs being utilized. Plus, tons of MemcpyD2H (small pink blocks at the screen shot) are occupying each GPU, not parallelized. That large period of time is about x6 of the actual forward pass.
Below is how the model treats sparse tensor inputs:
def call(self, inputs: tf.sparse.SparseTensor):
with tf.device("\cpu:0"):
x = self.hash_inputs_from_static_hash_table(inputs)
x = self.embedding_lookup_sparse(x)
return self.prediction_head(x)
The data can never be big (batch size = 128 per replica, sparse feature embedding dimension is <10), and I tried to move all sparse-related operations to CPU not to burden GPUs, but the problem persists just as the same as I didn't move those ops to CPU manually.
I want to know why those chunks appear after the GPU computations, and hopefully remove them to fully benefit from distributed training with multiple GPUs.
Seems like I'm still missing something that can be optimized and this situation might not that unique in distributed training, so asking for help for broader audience.

DIfferent optimization with different TF versions

I'm trying to train a convolutional neural network with keras and Tensorflow version 2.6, also I did it with Tensorflow version 1.11. I think that I did the migration okey (two neural networks converged) but when I see the results they are very different, worst in TF2.6, I used an optimizer Adam for both cases with the same hyperparameters (learning_rate = 0.001) but the optimization in the loss function in TF1.11 is better than in TF2.6
I'm trying to find out where the differences could be. What things must be taken into account when we work with differents TF versions? Can have important numerical differences? I know that in TF1.x the default mode is graph and in TF2 the default is eager, I don't know if this could bring different behavior in the training.
It surprises me how much the loss function is reduced in the first epochs reaching a lower value at the end of the training.
you understand that is correct they are working in different working modes eager and graph but the loss Fn is defined by how much change of value to required optimized pointed calculated by your or configured method.
You cannot directly be compared one model training history to another directly, running it several time you experience TF 1 is faster and smaller in the number of losses in the loss Fn that is needed to review the changelog Changlog
Loss Fn are updated, the graph is the powerful technique we know but TF 2.x supports access of the value at its level, why you have easy delegated methods such as callback, dynamic FNs, and working update value runtime. ( Trends to understand and experiments for student or user compared by both versions on the same tasks )
Symetrics in methods not create different results.

Tensorflow distributed training high bandwidth on Parameter Server

We trained a DNN model in the distributed mode with the parameter server strategy. The model has an embedding layer and some code like
with tf.colocate_with(embeddings):
tf_output = tf.nn.embedding_lookup_sparse(embeddings, tf_ids, name=name + '_lookup')
This results in the GPU on workers is peak at ~10% (very low) due to network bandwidth saturated. If we comment out with colocate_with(embeddings). GPU on workers bump up to 30+%.
Curious what the possible reason for this. Since embeddings are allocated on PS, the difference seems where the embedding_lookup happens(at worker or PS). Here comes the mysterious part. colocate_with means both embedding variable and embedding lookup op on PS.

How to fix low volatile GPU-Util with Tensorflow-GPU and Keras?

I have a 4 GPU machine on which I run Tensorflow (GPU) with Keras. Some of my classification problems take several hours to complete.
nvidia-smi returns Volatile GPU-Util which never exceeds 25% on any of my 4 GPUs.
How can I increase GPU Util% and speed up my training?
If your GPU util is below 80%, this is generally the sign of an input pipeline bottleneck. What this means is that the GPU sits idle much of the time, waiting for the CPU to prepare the data:
What you want is the CPU to keep preparing batches while the GPU is training to keep the GPU fed. This is called prefetching:
Great, but if the batch preparation is still way longer than the model training, the GPU will still remain idle, waiting for the CPU to finish the next batch. To make the batch preparation faster we can parallelize the different preprocessing operations:
We can go even further by parallelizing I/O:
Now to implement this in Keras, you need to use the Tensorflow Data API with Tensorflow version >= 1.9.0. Here is an example:
Let's assume, for the sake of this example that you have two numpy arrays x and y. You can use tf.data for any type of data but this is simpler to understand.
def preprocessing(x, y):
# Can only contain TF operations
...
return x, y
dataset = tf.data.Dataset.from_tensor_slices((x, y)) # Creates a dataset object
dataset = dataset.map(preprocessing, num_parallel_calls=64) # parallel preprocessing
dataset = dataset.batch(batch_size)
dataset = dataset.prefetch(None) # Will automatically prefetch batches
....
model = tf.keras.model(...)
model.fit(x=dataset) # Since tf 1.9.0 you can pass a dataset object
tf.data is very flexible, but as anything in Tensorflow (except eager), it uses a static graph. This can be a pain sometimes but the speed up is worth it.
To go further, you can have a look at the performance guide and the Tensorflow data guide.
I've got similar issue - the memory of all the GPUs were allocated by Keras, but Volatile was around 0% and training was taking almost the same amount of time as on CPU. I was using ImageDataGenerator, which turned out to be a bottleneck. When I increased the number of workers in fit_generator method from default value 1 to all available CPUs, then the training time dropped rapidly.
You can also load the data to the memory and then use flow method to prepare batches with augmented images.

GPU + CPU Tensorflow Training

Setup
I have a network, one whose parameter is a large-embedding matrix (3Million X 300 sized), say embed_mat.
During training, for each mini-batch, I only update a small subset of the vectors from embed_mat (max 15000 vectors) which are chosen using the embedding_lookup op. I am using the Adam optimizer to train my model.
As I cannot store this embed_mat in the GPU, due to its size, I define it under CPU (say /cpu:0) device, but the rest of the parameters of the model, the optimizer etc. are defined under a GPU (say, gpu:/0) device.
Questions
I see that my GPU usage is very minimal (200 MB), which suggests all my training is happening on the CPU. What I expected was that the result of the embedding_lookup is copied to the GPU and all my training happens there. Am I doing something wrong.
The training time is very largely affected by the size (num_vectors) of the embedding matrix which doesn't seem correct to me. In any mini-batch, I only update my network parameters and the vectors I looked up (~15000), so the training time should, if at all, grow sub-linearly with the size of the embedding matrix.
Is there a way to automatically and seamlessly split up my embed_mat to multiple GPUs for faster training?
I suspect the Adam Optimizer for this. Looks like because the embed_mat is on the CPU, all training is happening on the CPU. Is this correct?
Try visualizing on tensorboard where each of your ops is placed. In the "graph" tab you can color by "device". Ideally the embedding variable, the embedding lookup, and the embedding gradient update should be in the CPU, while most other things should be in the GPU.