Tensorflow processing performance with multiple gpu - tensorflow

friends!
I have a question about processing with multiple gpu.
I'm using 4 gpus and tried simple A^n + B^n example in 3 way like below.
Single GPU
with tf.device('/gpu:0'):
....tf.matpow codes...
Multiple GPU
with tf.device('/gpu:0'):
....tf.matpow codes...
with tf.device('/gpu:1'):
....tf.matpow codes...
No specific gpu designated (I think maybe all of gpu used)
....just tf.matpow codes...
when tried this, the result was incomprehensible.
the result was
1. single gpu : 6.x seconds
2. multiple gpu(2 gpus) : 2.x seconds
3. no specific gpu designated(maybe 4 gpus) : 4.x seconds
I cannot understand why #2 is faster than #3.
Anyone can help me?
Thanks.

While the Tensorflow scheduler works well for single GPUs, it is not as good at optimizing the placement of computations on multiple GPUs yet. (Although it is being worked on presently.) Without further details, it's hard to know exactly what's going on. To get a get a better picture, you can log where the computations are actually being placed by the scheduler. You can do this by setting the log_device_placement flag on when creating the tf.Session:
# Creates a session with log_device_placement set to True.
sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))

In the third code sample (where no GPU was designated) Tensorflow didn't use all of your GPUs. By default if Tensorflow can find a GPU ("/gpu:0") to use it assigns as many calculations as possible to that GPU. You would need to tell it specifically that you want want it to use all 4 like you did in the second code sample.
From the Tensorflow documentation:
If you have more than one GPU in your system, the GPU with the lowest ID will be selected by default. If you would like to run on a different GPU, you will need to specify the preference explicitly:
with tf.device('/gpu:2'):
tf code here

Related

Running Tensorflow model inference script on multiple GPU

I'm trying to run the model scoring (inference graph) from tensorflow objec detection API to run it on multiple GPU's, tried specifying the GPU number in the main, but it runs only on single GPU.placed GPU utilization snapshot here
Using tensorflow-gpu==1.13.1, can you kindly point me what I'm missing here.
for i in range(2):
with tf.device('/gpu:{}' . format(i)):
tf_init()
init = tf.global_variables_initializer
with detection_graph.as_default():
with tf.Session(config=tf.ConfigProto(log_device_placement=True)) as session:
call to #run_inference_multiple_images function
The responses to this question should give you a few options for fixing this.
Usually TensorFlow will occupy all visible GPUs unless told otherwise. So if you haven't already tried, you could just remove the with tf.device line (assuming you only have the two GPUs) and TensorFlow should use them both.
Otherwise, I think the easiest is setting the environment variables with os.environ["CUDA_VISIBLE_DEVICES"] = "0,1".

How to prevent TPUEstimator from using GPU or TPU

I need to force TPUEstimator to use the CPU. I have a rented google machine and the GPU is already running training. Since the CPUs are idle, I want to start a second Tensorflow session for evaluation but I want to force the evaluation cycle to use CPUs only so that it does not steal GPU time.
I am assuming there is a flag in the run_config or similar for doing this but am struggling to find one in the TF documentation.
run_config = tf.contrib.tpu.RunConfig(
cluster=tpu_cluster_resolver,
master=FLAGS.master,
model_dir=FLAGS.output_dir,
save_checkpoints_steps=FLAGS.save_checkpoints_steps,
tpu_config=tf.contrib.tpu.TPUConfig(
iterations_per_loop=FLAGS.iterations_per_loop,
num_shards=FLAGS.num_tpu_cores,
per_host_input_for_training=is_per_host))
You can run a TPUEstimator locally by including two arguments: (1) use_tpu should be set to False, and (2) tf.contrib.tpu.RunConfig should be passed as the config argument.
my_tpu_estimator = tf.contrib.tpu.TPUEstimator(
model_fn=my_model_fn,
config=tf.contrib.tpu.RunConfig()
use_tpu=False)
The majority of example TPU models can be run in local mode by setting the command line flags:
$> python mnist_tpu.py --use_tpu=false --master=''
More documentation can be found here.

Tensorflow - Inference time evaluation

I'm evaluating different image classification models using Tensorflow, and specifically inference time using different devices.
I was wondering if I have to use pretrained models or not.
I'm using a script generating 1000 random input images feeding them 1 by 1 to the network, and calculating mean inference time.
Thank you !
Let me start by a warning:
A proper benchmark of neural networks is done in a wrong way by most people. For GPUs there is disk I/O, memory bandwidth, PCI bandwidth, the GPU speed itself. Then there are implementation faults like using feed_dict in TensorFlow. This is also true for a efficient training of these models.
Let's start by a simple example considering a GPU
import tensorflow as tf
import numpy as np
data = np.arange(9 * 1).reshape(1, 9).astype(np.float32)
data = tf.constant(data, name='data')
activation = tf.layers.dense(data, 10, name='fc')
with tf.Session(config=tf.ConfigProto(log_device_placement=True)) as sess:
sess.run(tf.global_variables_initializer())
print sess.run(activation)
All it does is creating a const tensor and apply a fully connected layer.
All the operations are placed on the GPU:
fc/bias: (VariableV2): /job:localhost/replica:0/task:0/device:GPU:0
2018-01-25 09:55:01.587959: I tensorflow/core/common_runtime/placer.cc:874] fc/bias: (VariableV2)/job:localhost/replica:0/task:0/device:GPU:0
fc/bias/read: (Identity): /job:localhost/replica:0/task:0/device:GPU:0
2018-01-25 09:55:01.587970: I tensorflow/core/common_runtime/placer.cc:874] fc/bias/read: (Identity)/job:localhost/replica:0/task:0/device:GPU:0
fc/bias/Assign: (Assign): /job:localhost/replica:0/task:0/device:GPU:0
2018-01-25 09:55:01.587979: I tensorflow/core/common_runtime/placer.cc:874] fc/bias/Assign: (Assign)/job:localhost/replica:0/task:0/device:GPU:0
fc/kernel: (VariableV2): /job:localhost/replica:0/task:0/device:GPU:0
2018-01-25 09:55:01.587988: I tensorflow/core/common_runtime/placer.cc:874] fc/kernel: (VariableV2)/job:localhost/replica:0/task:0/device:GPU:0
fc/kernel/read: (Identity): /job:localhost/replica:0/task:0/device:GPU:0
...
Looks good, right?
Benchmarking this graph might give a rough estimate how fast the TensorFlow graph can be executed. Just replace tf.layers.dense by your network. If you accept the overhead of using pythons time package, you are done.
But this is, unfortunately, not the entire story.
There is copying the result back from the tensor-op 'fc/BiasAdd:0' accessing device memory (GPU) and copying to host memory (CPU, RAM).
Hence there is the PCI bandwidth limitation at some point. And there is a python interpreter somewhere sitting as well, taking CPU cycles.
Further, the operations are placed on the GPU, not necessary the values themselves. Not sure, which TF version you are using. But even a tf.const gave no guarantees in older version to be placed on the GPU. Which I only noticed when writing my own Ops. Btw: see my other answer on how TF decides where to place operations.
Now, the hard part: It depends on your graph. Having a tf.cond/tf.where sitting somewhere makes things harder to benchmark. Now, you need to go through all these struggles which you need to address when efficiently training a deep network. Meaning, a simple const cannot address all cases.
A solutions starts by putting/staging some values directly into GPU memory by running
stager = data_flow_ops.StagingArea([tf.float32])
enqeue_op = stager.put([dummy])
dequeue_op = tf.reduce_sum(stager.get())
for i in range(1000):
sess.run(enqeue_op)
beforehand. But again, the TF resource manager is deciding where it puts values (And there is no guarantee about the ordering or dropping/keeping values).
To sum it up: Benchmarking is a highly complex task as benchmarking CUDA code is complex. Now, you have CUDA and additionally python parts.
And it is a highly subjective task, depending on which parts you are interested in (just graph, including disk i/o, ...)
I usually run the graph with a tf.const input as in the example and use the profiler to see whats going on in the graph.
For some general ideas on how to improve runtime performance you might want to read the Tensorflow Performance Guide
So, to clarify, you are only interested in the runtime per inference step and not in the accuracy or any ML related performance metrics?
In this case it should not matter much if you initialize your model from a pretrained checkpoint or just from scratch via the given initializers (e.g. truncated_normal or constant) assigned to each variable in your graph.
The underlying mathematical operations will be the same, mainly matrix-multiply operations for whom it doesn't matter (much) which values the underlying add and multiply operations are performed on.
This could be a bit different, if your graph contains some more advanced control-flow structures like tf.while_loop that can influence the actual size of your graph depending on the values of certain Tensors.
Of course, the time it takes to initialize your graph at the very beginning of program execution will differ depending on if you initialize from scratch or from checkpoint.
Hope this helps.

Tensorflow batching is very slow

I tried to setup a very simple Mnist example with an Estimator.
First I used the estimator's deprecated fit() parameters x, y and batch_size. This executed very fast and utilized about 100% of my GPU while not effecting the CPU much (about 10% utilization). So it worked as expected.
Because the x, y and batch_size parameters are deprecated, I wanted to use the input_fn parameter for the fit() function. To build the input_fn, I used a tf.slice_input_producer and batched it with tf.train.batch. This is my code https://gist.github.com/andreas-eberle/11f650fca0dce4c9d3d6c0955145e80d. You should be able to just run it with tensorflow 1.0.
My problem is that the training now runs very slow and only utilizes about 30 % of my GPU (shown in nvidia-smi).
I also tried to increase the queue capacity of the slice_input_producer and to increase the number of threads used for batching. However, this only helped to get to about 45% of GPU utilization and resulted in a 100 % GPU utilization.
What am I doing wrong? Is there a better way for feeding the inputs and batching them? I do not want to create the batches manually (creating subarrays of the numpy input array) because I want to use this example for a more complex input queue where I'll be reading and preprocessing the images in the graph.
I don't think my hardware should be the problem:
List item
Windows 10
NVidia GTX 960M
i7-6700HQ
32 GB RAM

how to implement iter_size like caffe in tensorflow

I am using tensorflow. My gpu memory is not enough, so I want to average the gradients of 4 iterations to update variable.
How to do this in tensorflow?
I met the same problem. I think this example might be useful for your problem. It compute N batches with N GPUs and then do the back propagation once. What you need to do is modify the Line 165-166. Run 'compute_gradients()' 'iter_size' times and run 'average_gradients()' once.