Model parallelism in TensorFlow multi-gpu training - tensorflow

I am training a model in several GPUs on a single machine using tensorflow. However, I find the speed is much slower than training on a single GPU. I am wondering if tensorflow executes sub-model in different GPUs in parallel or in a sequential order. For example:
x = 5
y = 2
with tf.device('/gpu:0'):
z1 = tf.multiply(x, y)
with tf.device('/gpu:1'):
z2 = tf.add(x, y)
Are the code inside /gpu:0 and /gpu:1 executes sequentially? If in sequential order, how can I make the two parts execute in parallel? Assume the two parts are not dependent on each other.

In TensorFlow only the second block (inside gpu:1) would execute since nothing depends on the first block.

Yes it is executing sequentially, by its nature the with block will wait till its computation is complete before moving to next code block.
You can implement queues and threading from tensorflow to leverage your additional compute.
Please refer thsi tutorial from tensorflow:
https://www.tensorflow.org/api_guides/python/threading_and_queues

Related

Defining assignment function as variable in tensroflow?

I am training a neural network by SGD (batch size = 1). The inputs are randomly generated, and the labels are calculated based on the input. AKA the data does not have to be realistic, but the relationships between inputs and labels are specific. I will train my NN only 1 epoch, but with many batches.
I have the following code:
training_input = tf.Variable(tf.zeros(...))
assign_training_input_with_random_values = training_input.assign(tf.random_normal(...))
//Create a session, initialize a bunch of variables, construct a neural network...
for batch in range(batch_number):
sess.run(assign_training_input_with_random_values)
//Train my neural network...
However I noticed that if I write the above code differently the speed goes down by a lot:
//Run the assignment operation directly without defining it as a variable
for batch in range(batch_size)
sess.run(training_input.assign(tf.random_normal(...)))
//Train my neural network...
The first snippet being significantly faster makes me worry that tensorflow is only randomizing when I define the assign_training_input_with_random_values variable, and the same training examples are fed to the NN over every batch afterwards. In this case, the NN will probably not generalize well. Meanwhile, the second snippet is slow because it is randomizing every batch. Is this actually the case or is there another reason for this?
First the explanation to your observations
Computational difference between 1st and 2nd solutions
It makes sense that your first solution is faster than the second. You define the assign operation once and then execute that for 100 epochs. However in the 2nd solution you create an op every epoch, growing the computational graph over time which causes your program to slow down.
Observation about the 1st solution
(After #Y.Z.'s finding) Apparently the first solution does evaluate to different random number arrays every time you run it. Therefore, the first solution is also valid.
Another way to implement this
The correct way to implement your solution would be to use a tf.placeholder to feed values in every epoch the following way.
import tensorflow as tf
import numpy as np
training_input = tf.Variable(tf.zeros(shape=[3, 2]))
tf_random = tf.placeholder(shape=[3, 2], dtype=tf.float32)
assign_training_input_with_random_values = training_input.assign(tf_random)
#Create a session, initialize a bunch of variables, construct a neural network...
epoch=0
with tf.Session() as sess:
while epoch < 10:
epoch+= 1
sess.run(assign_training_input_with_random_values, feed_dict={tf_random:np.random.normal(size=(3,2))})
Comparing Solution 1 vs My solution
So turns out, both your first solution and my solution will not grow the graph. If you run the line
print([n.name for n in tf.get_default_graph().as_graph_def().node])
for your first solution and my solution (Be careful to run tf.reset_default_graph() at the beginning) you'll see that the number of tensors remain constant regardless of the number of iterations. Appears that TensorFlow is smart enough to prune those old tf.random tensors no longer used.

Tensorflow slow inference speed in a loop

I am working on a reinforcement learning implementation using Tensorflow. After profiling on the training procedure, I found something really weird:
The following code is in a training loop:
state_batch, \
action_batch, \
reward_batch, \
next_state_batch, \
is_episode_finished_batch = self.data_manager.get_next_batch()
state_batch = np.divide(state_batch, 10.0)
next_state_batch = np.divide(next_state_batch, 10.0)
# Calculate y for the td_error of the critic
y_batch = []
next_action_batch = self.actor_network.target_evaluate(
next_state_batch, action_batch)
q_value_batch = self.critic_network.target_evaluate(
next_state_batch, next_action_batch)
for i in range(0, self.batch_size):
if is_episode_finished_batch[i]:
y_batch.append([reward_batch[i]])
else:
y_batch.append(reward_batch[i] + GAMMA * q_value_batch[i])
# Now that we have the y batch, train the critic
self.critic_network.train(y_batch, state_batch, action_batch)
# Then get the action gradient batch and adapt the gradient with the gradient inverting method
action_batch_for_gradients = self.actor_network.evaluate(
state_batch, action_batch)
q_gradient_batch = self.critic_network.get_action_gradient(
state_batch, action_batch_for_gradients)
q_gradient_batch = self.grad_inv.invert(
q_gradient_batch, action_batch_for_gradients)
# Now we can train the actor
self.actor_network.train(q_gradient_batch, state_batch, action_batch)
actor_network and critic_network are two classes that implement actor and critic in actor-critic algorithm. Each of them has their own network and operations, but all in the same graph and will run within the same session. Each of the member function (like evaluate, train...) contains a session.run and feed the data they need by passing parameter.
I observed that action_batch_for_gradients runs extremely slow, taking 0.x seconds to do one inference, even much slower than the self.critic_network.train. action_batch_for_gradients is simply an inference operation in actor network to get action. I then copy this line and duplicate it and found that only the first action_batch_for_gradients, right after self.critic_network.train is slow, but the second one is of the normal speed of a forward operation. I think it has something to do with switching within a graph, between training a network and forward in another network. But I can't tell how to avoid.
I found some discussions on stackoverflow about using same graph in the loop, instead of building new ones each time, to speed up using tensorflow. But I already build the graph beforehand and only run the different part of the graph in the training loop. So I don't know how i wrongly use tensorflow on this loop training. I am using Tensorflow 1.6.
I would appreciate your help!

How does one move data to multiple GPU towers using Tensorflow's Dataset API

We are running multi GPU jobs on Tensorflow and evaluating a migration from the queue based model (using the string_input_producer interface) to the new Tensorflow Dataset API. The latter appears to offer an easier way to switch between Train and Validation, concurrently.
A snippet of code below shows how we are doing this.
train_dataset, train_iterator = get_dataset(train_files, batch_size, epochs)
val_dataset, val_iterator = get_dataset(val_files, batch_size, epochs)
is_validating = tf.placeholder(dtype=bool, shape=())
next_batch = tf.cond(is_validating,
lambda: val_iterator.get_next(),
lambda: train_iterator.get_next())
validation_tower = self.num_gpus - 1
tower_grads = []
for i in range(self.num_gpus):
with tf.variable_scope(tf.get_variable_scope(),reuse=(i > 0)):
with tf.device('/gpu:%d' % i), tf.name_scope('%s_%d' % ('gpu_', i)) as scope:
if i == validation_tower:
images, labels = next_batch
# Loss funcs snipped out
else:
images, labels = next_batch
# Loss funcs snipped out
The get_dataset function builds a dataset, sets a map function and a batch size. It also builds an iterator, but doesn't initialize it. Initialization of the iterator occurs before the session starts.
The is_validating boolean is supplied while the session is running, and every few steps we pass is_validating as True via a feed_dict to use the validation dataset
The question I have is:
Lets say I have 8 gpus, so we run training on 7 GPUs. Does the Iterator advance from the same point for each of these 7 GPUs, hence supplying all 7 GPU's with the same data?
At present there are three main options, which have different usability and performance trade-offs:
In the Dataset.batch() transform, create a single large batch containing examples for all of your GPUs. Then use tf.split(..., self.num_gpus) on the output of Iterator.get_next() to create sub-batches for each GPU. This is probably the easiest approach, but it does place the splitting on the critical path.
In the Dataset.batch() transform, create a mini-batch that is sized for a single GPU. Then call Iterator.get_next() once per GPU to get multiple different batches. (By contrast, in your current code, the same value of next_batch is sent to each GPU, which is probably not what you wanted to happen.)
Create multiple iterators, one per GPU. Shard the data using Dataset.shard() early in the pipeline (e.g. on the list of files if your dataset is sharded). Note that this approach will consume more resources on the host, so you may need to dial down any buffer sizes and/or degrees of parallelism
Note that the current tf.data pipelines run on the CPU only, and an important aspect of an efficient pipeline is staging your training input to the GPU while the previous step is still running. See the TensorFlow CNN benchmarks for example code that shows how to stage data to GPUs efficiently. We are currently working on adding this support to the tf.data API directly.

Does Tensorflow simplify a computational graph?

I have a simple question and I was also searching already quiet a bit, but maybe I'm using the wrong keywords.
How does Tensorflow handle a given graph? If one has the simple graph:
x = tf.constant(1.0, name='input')
w = tf.constant0.8, name='weight')
b = tf.constant0.8, name='bias')
y_1 = tf.mul(w, x, name='output_1')
y_2 = tf.add(y_1, b, name='output_1')
The arithmetic statement is of course given by the computational graph, but is Tensorflow then kind of compiling and simplifying it in terms of saving time by not copying memories, etc.? So that it a 'condensed' version of the computational kernel is executed on the 'device' like CPU or GPU?
So that it reduces to something like that:
y_2 = tf.add(tf.mul(w, x), b, name='output_1')
Maybe somebody knows a good resource to learn more about how exactly Tensorflow runs under the hood without looking too deep into the source-code.
Thank you very much in advance!
TensorFlow includes various optimizations that can have the effect of simplifying a dataflow graph. In particular:
TensorFlow will apply common subexpression elimination to avoid performing redundant computation. In the case of your example, this will not have much effect, but TensorFlow will observe that w and b are the same constant, and replace them with a single value.
TensorFlow will apply constant propagation so that (computed) values that are the same in every execution of a subgraph will only be computed once. In your example, the entire expression is a constant, so TensorFlow will replace it with a single tf.constant() value corresponding to the result (1.6).
If you use the experimental XLA compiler, TensorFlow will make more aggressive simplifications, and may be able to replace a subgraph with a single TensorFlow kernel, containing just-in-time compiled code. If in your example x were a tf.placeholder(), the remainder of the computation could be compiled into a single kernel with one input and one output.

Tensorflow: Which graph statements are executed after the graph is built?

In Tensorflow, which statements within a graph definition block are executed only to build the graph vs. which are executed during training? For example:
with tf.Graph().as_default():
weightsLayer1 = tf.Variable(tf.truncated_normal([nInputUnits, nOutputUnits]))
weightsLayer1 = tf.div(weightsLayer1, tf.sqrt(tf.to_float(nInputUnits)))
biasesLayer1 = tf.Variable(tf.zeros([nUnitsHiddenLayer1]))
layer1output = tf.tanh(tf.matmul(images_placeholder, weightsLayer1) + biasesLayer1)
Intuitively, the lines defining weightsLayer1 and biasesLayer1 I assume are only executed once at startup, since they initialize weights and biases. However, the line computing layer1output I assume executes at every training step, since layer1output is used downstream to compute loss, which is minimized by the optimizer. So, how does Tensorflow know, during training, to only execute the last line and not the previous ones (which would re-initialize the weights and biases)?
You as the user are actually telling tensorflow which operations to run. During training, you typically tell tensorflow to execute operations that are provided by an optimizer. This looks something like this:
opt = tf.train.GradientDescentOptimizer(0.01)
train_step = opt.minimize(loss) #
for i in range(100):
sess.run(train_step, feed_dict=...)
Calling opt.minimize adds to the computation graphs the gradients w.r.t. the trainable variables as well as operations that update the variables using the gradients. train_step is in fact these update operations grouped using tf.group. If you (the user) run train_step, tensorflow figures out what parts of the computation graph it needs to run in order to execute these desired operations.
Likewise, if you do something like sess.run(fetches=loss, feed_dict=...), you are asking tensorflow to execute all operations in the graph that are necessary to compute loss.
Finally, initialization operations like the one in weightsLayer1 = tf.Variable(tf.truncated_normal([nInputUnits, nOutputUnits])) are usually run by sess.run(tf.initialize_all_variables()).
Edit: After re-reading your question, I want to be more clear about one aspect. No operations are actually executed by the graph definition code you provided. Tensorflow operations are executed if and only if you start a session and request the execution of parts of your graph. As stated above, that includes the initialization operations.