Selectively registering the backward pass of a set of ops on the GPU - tensorflow

I have a set of ops that are faster on CPUs than GPUs, both in terms of the forward and backward (gradient) computations. However they're only a small fraction of the whole model, most of which is better run on the GPU. Currently, if I just use with tf.device(...) when specifying the forward model, and I let TF decide where to place the optimizer (e.g. tf.train.AdamOptimizer op), then it puts all the backward pass computations on the GPU, which is suboptimal. Is there some way of specifying that an op and its gradients should be registered on the GPU?

Currently there's no good way to customize the device assignment for ops in the (automatically generated) gradient computation. However, one thing you can do is to register a "device function" using with tf.device():, (though the documentation for this function applies and is more comprehensive). A "device function" is a function that takes a newly-constructed tf.Operation and returns a device name, and TensorFlow assigns the operation to that device. This enables you to do the following:
# These are almost certainly faster on GPU, but are just shown as an example.
OPS_ON_CPU = set(["AvgPool", "AvgPoolGrad"])
def _device_function(op):
if op.type in OPS_ON_CPU:
return "/cpu:0"
else:
# Other ops will be placed on GPU if available, otherwise CPU.
return ""
with tf.device(_device_function):
# Build model in here.
# ...
loss = ...
train_op = tf.train.AdamOptimizer(0.01).minimize(loss)
...which will place all ops with type "AvgPool" or "AvgPoolGrad" on the CPU.

Related

Running Tensorflow model inference script on multiple GPU

I'm trying to run the model scoring (inference graph) from tensorflow objec detection API to run it on multiple GPU's, tried specifying the GPU number in the main, but it runs only on single GPU.placed GPU utilization snapshot here
Using tensorflow-gpu==1.13.1, can you kindly point me what I'm missing here.
for i in range(2):
with tf.device('/gpu:{}' . format(i)):
tf_init()
init = tf.global_variables_initializer
with detection_graph.as_default():
with tf.Session(config=tf.ConfigProto(log_device_placement=True)) as session:
call to #run_inference_multiple_images function
The responses to this question should give you a few options for fixing this.
Usually TensorFlow will occupy all visible GPUs unless told otherwise. So if you haven't already tried, you could just remove the with tf.device line (assuming you only have the two GPUs) and TensorFlow should use them both.
Otherwise, I think the easiest is setting the environment variables with os.environ["CUDA_VISIBLE_DEVICES"] = "0,1".

Tensorflow - Inference time evaluation

I'm evaluating different image classification models using Tensorflow, and specifically inference time using different devices.
I was wondering if I have to use pretrained models or not.
I'm using a script generating 1000 random input images feeding them 1 by 1 to the network, and calculating mean inference time.
Thank you !
Let me start by a warning:
A proper benchmark of neural networks is done in a wrong way by most people. For GPUs there is disk I/O, memory bandwidth, PCI bandwidth, the GPU speed itself. Then there are implementation faults like using feed_dict in TensorFlow. This is also true for a efficient training of these models.
Let's start by a simple example considering a GPU
import tensorflow as tf
import numpy as np
data = np.arange(9 * 1).reshape(1, 9).astype(np.float32)
data = tf.constant(data, name='data')
activation = tf.layers.dense(data, 10, name='fc')
with tf.Session(config=tf.ConfigProto(log_device_placement=True)) as sess:
sess.run(tf.global_variables_initializer())
print sess.run(activation)
All it does is creating a const tensor and apply a fully connected layer.
All the operations are placed on the GPU:
fc/bias: (VariableV2): /job:localhost/replica:0/task:0/device:GPU:0
2018-01-25 09:55:01.587959: I tensorflow/core/common_runtime/placer.cc:874] fc/bias: (VariableV2)/job:localhost/replica:0/task:0/device:GPU:0
fc/bias/read: (Identity): /job:localhost/replica:0/task:0/device:GPU:0
2018-01-25 09:55:01.587970: I tensorflow/core/common_runtime/placer.cc:874] fc/bias/read: (Identity)/job:localhost/replica:0/task:0/device:GPU:0
fc/bias/Assign: (Assign): /job:localhost/replica:0/task:0/device:GPU:0
2018-01-25 09:55:01.587979: I tensorflow/core/common_runtime/placer.cc:874] fc/bias/Assign: (Assign)/job:localhost/replica:0/task:0/device:GPU:0
fc/kernel: (VariableV2): /job:localhost/replica:0/task:0/device:GPU:0
2018-01-25 09:55:01.587988: I tensorflow/core/common_runtime/placer.cc:874] fc/kernel: (VariableV2)/job:localhost/replica:0/task:0/device:GPU:0
fc/kernel/read: (Identity): /job:localhost/replica:0/task:0/device:GPU:0
...
Looks good, right?
Benchmarking this graph might give a rough estimate how fast the TensorFlow graph can be executed. Just replace tf.layers.dense by your network. If you accept the overhead of using pythons time package, you are done.
But this is, unfortunately, not the entire story.
There is copying the result back from the tensor-op 'fc/BiasAdd:0' accessing device memory (GPU) and copying to host memory (CPU, RAM).
Hence there is the PCI bandwidth limitation at some point. And there is a python interpreter somewhere sitting as well, taking CPU cycles.
Further, the operations are placed on the GPU, not necessary the values themselves. Not sure, which TF version you are using. But even a tf.const gave no guarantees in older version to be placed on the GPU. Which I only noticed when writing my own Ops. Btw: see my other answer on how TF decides where to place operations.
Now, the hard part: It depends on your graph. Having a tf.cond/tf.where sitting somewhere makes things harder to benchmark. Now, you need to go through all these struggles which you need to address when efficiently training a deep network. Meaning, a simple const cannot address all cases.
A solutions starts by putting/staging some values directly into GPU memory by running
stager = data_flow_ops.StagingArea([tf.float32])
enqeue_op = stager.put([dummy])
dequeue_op = tf.reduce_sum(stager.get())
for i in range(1000):
sess.run(enqeue_op)
beforehand. But again, the TF resource manager is deciding where it puts values (And there is no guarantee about the ordering or dropping/keeping values).
To sum it up: Benchmarking is a highly complex task as benchmarking CUDA code is complex. Now, you have CUDA and additionally python parts.
And it is a highly subjective task, depending on which parts you are interested in (just graph, including disk i/o, ...)
I usually run the graph with a tf.const input as in the example and use the profiler to see whats going on in the graph.
For some general ideas on how to improve runtime performance you might want to read the Tensorflow Performance Guide
So, to clarify, you are only interested in the runtime per inference step and not in the accuracy or any ML related performance metrics?
In this case it should not matter much if you initialize your model from a pretrained checkpoint or just from scratch via the given initializers (e.g. truncated_normal or constant) assigned to each variable in your graph.
The underlying mathematical operations will be the same, mainly matrix-multiply operations for whom it doesn't matter (much) which values the underlying add and multiply operations are performed on.
This could be a bit different, if your graph contains some more advanced control-flow structures like tf.while_loop that can influence the actual size of your graph depending on the values of certain Tensors.
Of course, the time it takes to initialize your graph at the very beginning of program execution will differ depending on if you initialize from scratch or from checkpoint.
Hope this helps.

tensorflow summary ops can assign to gpu

Here is part of my code.
with tf.Graph().as_default(), tf.device('/cpu:0'):
global_step = tf.get_variable(
'global_step',
[],
initializer = tf.constant_initializer(0),
writer = tf.summary.FileWriter(logs_path,graph=tf.get_default_graph())
with tf.device('/gpu:0'):
tf.summary.scalar('learning_rate', INITIAL_LEARNING_RATE)
summary_op = tf.summary.merge_all()
when I run it. I will get following error:
InvalidArgumentError (see above for traceback): Cannot assign a device for operation 'learning_rate': Could not satisfy explicit device specification '/device:GPU:0' because no
supported kernel for GPU devices is available.
[[Node: learning_rate = ScalarSummary[T=DT_FLOAT, _device="/device:GPU:0"](learning_rate/tags, learning_rate/values)]]
if I move these 2 ops into tf.device("/cpu:0") device scope, It will work.
tf.summary.scalar('learning_rate', INITIAL_LEARNING_RATE)
summary_op = tf.summary.merge_all()
I google it. there are many suggestiones about using "allow_soft_placement=True". But I think this solution is basically change device scope automatically. So my question is:
why these 2 ops can not assign to gpu? Is there any documents I can look at to figure out what ops can or cannot assign to gpu?
any suggestion is welcome.
You can't assign a summary operation to a GPU because is meaningless.
In short, a GPU executes parallel operations. A summary is nothing but a file in which you append new lines every time you write on it. It's a sequential operation that has nothing in common with the operation that GPUs are capable to do.
Your error says it all:
Could not satisfy explicit device specification '/device:GPU:0' because no supported kernel for GPU devices is available.
That operation (in the tensorflow version you're using) has no GPU implementation and thus must be sent to a CPU device.

Tensorflow contrib.learn.Estimator multi-GPU

In order to use the contrib.learn.Estimator for multi-GPU training, I am attempting to specify GPU assignments in my model_fn.
In pseudo-code:
def model_fn(X, y):
with tf.device('/gpu:1'):
... various tensorflow ops for model ...
return predictions, loss, train_op
Everything works fine without the tf.device('/gpu:1') call, but with it I encounter the following error:
InvalidArgumentError (see above for traceback): Cannot assign a device to
node 'save/ShardedFilename_1': Could not satisfy explicit device
specification '/device:GPU:1' because no supported kernel
for GPU devices is available.
I do not believe that I am adding the offending op to the graph myself, but rather that it is injected through the Estimator's snapshot functionality.
I believe that the solution is to set allow_soft_placement=True so that non GPU functions will fall to CPU, but it's not obvious to me how that exposed when dealing with contrib.learn.Estimator.
I see that the option is usually set in ConfigProto & passed to the session, but I've been using the Estimator's functionality to manage the session for me. Should I be taking control of the session creation, or am I missing a parameter somewhere to accomplish this?
Many thanks in advance for any advice.
Along with Estimator leaving contrib in Tensorflow 1.0 this is fixed.

TensorFlow: slow performance when getting gradients at inputs

I'm building a simple multilayer perceptron with TensorFlow, and I also need to obtain the gradients (or error signal) of the loss at the neural network's inputs.
Here's my code, which works:
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(self.network, self.y))
optimizer = tf.train.AdagradOptimizer(learning_rate=nn_learning_rate).minimize(cost)
...
for i in range(epochs):
....
for batch in batches:
...
sess.run(optimizer, feed_dict=feed_dict)
grads_wrt_input = sess.run(tf.gradients(cost, self.x), feed_dict=feed_dict)[0]
(edited to include training loop)
Without the last line (grads_wrt_input...), this runs really fast on a CUDA machine. However, tf.gradients() reduces performance greatly by tenfold or more.
I recall that the error signals at the nodes are computed as intermediate values in the backpropagation algorithm, and I have successfully done this using the Java library DeepLearning4j. I was also under the impression that this would be a slight modification to the computation graph already built by optimizer.
How can this be made faster, or is there any other way to compute the gradients of the loss w.r.t. the inputs?
The tf.gradients() function builds a new backpropagation graph each time it is called, so the reason for the slowdown is that TensorFlow has to parse a new graph on each iteration of the loop. (This can be surprisingly expensive: the current version of TensorFlow is optimized for executing the same graph a large number of times.)
Fortunately the solution is easy: just compute the gradients once, outside the loop. You can restructure your code as follows:
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(self.network, self.y))
optimizer = tf.train.AdagradOptimizer(learning_rate=nn_learning_rate).minimize(cost)
grads_wrt_input_tensor = tf.gradients(cost, self.x)[0]
# ...
for i in range(epochs):
# ...
for batch in batches:
# ...
_, grads_wrt_input = sess.run([optimizer, grads_wrt_input_tensor],
feed_dict=feed_dict)
Note that, for performance, I also combined the two sess.run() calls. This ensures that the forward propagation, and much of the backpropagation, will be reused.
As an aside, one tip to find performance bugs like this is to call tf.get_default_graph().finalize() before starting your training loop. This will raise an exception if you inadvertantly add any nodes to the graph, which makes it easier to trace the cause of these bugs.