TensorFlow: slow performance when getting gradients at inputs - tensorflow

I'm building a simple multilayer perceptron with TensorFlow, and I also need to obtain the gradients (or error signal) of the loss at the neural network's inputs.
Here's my code, which works:
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(self.network, self.y))
optimizer = tf.train.AdagradOptimizer(learning_rate=nn_learning_rate).minimize(cost)
for i in range(epochs):
for batch in batches:
sess.run(optimizer, feed_dict=feed_dict)
grads_wrt_input = sess.run(tf.gradients(cost, self.x), feed_dict=feed_dict)[0]
(edited to include training loop)
Without the last line (grads_wrt_input...), this runs really fast on a CUDA machine. However, tf.gradients() reduces performance greatly by tenfold or more.
I recall that the error signals at the nodes are computed as intermediate values in the backpropagation algorithm, and I have successfully done this using the Java library DeepLearning4j. I was also under the impression that this would be a slight modification to the computation graph already built by optimizer.
How can this be made faster, or is there any other way to compute the gradients of the loss w.r.t. the inputs?

The tf.gradients() function builds a new backpropagation graph each time it is called, so the reason for the slowdown is that TensorFlow has to parse a new graph on each iteration of the loop. (This can be surprisingly expensive: the current version of TensorFlow is optimized for executing the same graph a large number of times.)
Fortunately the solution is easy: just compute the gradients once, outside the loop. You can restructure your code as follows:
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(self.network, self.y))
optimizer = tf.train.AdagradOptimizer(learning_rate=nn_learning_rate).minimize(cost)
grads_wrt_input_tensor = tf.gradients(cost, self.x)[0]
# ...
for i in range(epochs):
# ...
for batch in batches:
# ...
_, grads_wrt_input = sess.run([optimizer, grads_wrt_input_tensor],
Note that, for performance, I also combined the two sess.run() calls. This ensures that the forward propagation, and much of the backpropagation, will be reused.
As an aside, one tip to find performance bugs like this is to call tf.get_default_graph().finalize() before starting your training loop. This will raise an exception if you inadvertantly add any nodes to the graph, which makes it easier to trace the cause of these bugs.


Tensorflow batch normalization: difference momentum and renorm_momentum

I want to replicate a network build with the lasagne-library in tensor flow. I'm having some trouble with the batch normalization.
This is the lasagne documentation about the used batch normalization:
In tensorflow I found two functions to normalize:
The first one is simpler but does not let me choose the alpha parameter from lasagne (Coefficient for the exponential moving average of batch-wise means and standard deviations computed during training). I tried using the second function, which has a lot more options, but there are two things I do not understand about it:
I am not clear about the difference between momentum and renorm_momentum. If I have a alpha of 0.9 in the lasagne network, can I just set both tensorflow momentums to 0.9 and expect the same behaviour?
The tf documentation notes:
when training, the moving_mean and moving_variance need to be updated. By default the update ops are placed in tf.GraphKeys.UPDATE_OPS, so they need to be added as a dependency to the train_op. For example:
update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
with tf.control_dependencies(update_ops):
train_op = optimizer.minimize(loss)
I do not really understand what is happening here and where I need to put something similar in my code. Can I just put this somewhere before I run the session? What parts of this code piece should I not copy literally but change depending on my code?
There is a big difference between tf.nn.batch_normalization and tf.layers.batch_normalization. See my answer here. So you have made the right choice by using the layers version. Now, on your questions:
renorm_momentum only has an effect is you use batch renormalization by setting the renorm argument to True. You can ignore this if using default batch normalization.
Short answer: You can literally copy that code snippet. Put it exactly where you would normally call optimizer.minimize.
Long answer on 2.: Batch normalization has two "modes": Training and inference. During training, mean and variance of the current minibatch is used. During inference, this is not desirable (e.g. you might not even use batches as input, so there would be no minibatch statistics). For this reason, moving averages over minibatch means/variances are kept during training. These moving averages are then used for inference.
By default, Tensorflow only executes what it needs to. Those moving averages are not needed for training, so they normally would never be executed/updated. The tf.control_dependencies context manager forces Tensorflow to do the updates every time it computes whatever is in the code block (in this case the cost). Since the cost certainly needs to be computed exactly one per training step, this is a good way of making sure the moving averages are updated.
The code example seems a bit arcane, but in context it would really just be (as an example):
loss = ...
train_step = SomeOptimizer().minimize(loss)
with tf.Session() as sess:
loss = ...
with tf.control_dependencies(tf.get_collection(tf.GraphKeys.UPDATE_OPS)):
train_step = SomeOptimizer().minimize(loss)
with tf.Session() as sess:
Finally, keep in mind to use the correct training argument for batch normalization so that either minibatch statistics or moving averages are used as intended.

Why parallel_iterations in dynamic_rnn doesn't work?

I'm wondering how to use the dynamic_rnn function and make it parallel. I set gpu_options.allow_growth = True and use tf.nn.dynamic_rnn(rnn_cell, inputs=X, dtype=tf.float32, time_major=False, parallel_iterations=50) to do so. But both the GPU memory consumption and run time don't change when I changeing the value of parallel_iterations.
It is a very simple rnn, so I think there may not be data dependency.
basic_cell = BasicRNNCell(num_units=n_neurons)
outputs, states = tf.nn.dynamic_rnn(basic_cell, X, dtype=tf.float32, parallel_iterations=50)
logits = fully_connected(states, n_outputs, activation_fn=None)
cross_entropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=logits)
loss = tf.reduce_mean(cross_entropy)
optimizer = tf.train.AdamOptimizer(learning_rate)
train_op = optimizer.minimize(loss)
correct = tf.nn.in_top_k(logits, y, 1)
accuracy = tf.reduce_mean(tf.cast(correct, tf.float32))
Thanks in advance! I appreciate any suggestion.
Your observations don't mean that parallel_iterations don't work.
Whenever you have an RNN, you have a data dependency since the output of n'th step is fed into (n+1)'th step. In your example with BasicRNNCell, every computation is effectively dependent on previous computation. So, there are basically no opportunities to run multiple steps in parallel. With more complex cells you might have some computation in each step that is independent of previous steps (e.g. doing some attention over constant memory). In such cases there are opportunities for parallel execution of different steps.
Even if you model would allow for parallel execution, you might not be able to see it being reflected in memory usage. Memory usage depends on many factors including when TF returns memory to GPU; if you are computing gradients, you might need to keep most activations in memory whether you run iterations in parallel or not; the iterations that are run in parallel might not produce a lot of tensors; etc.
Similarly for CPU, if running stuff in parallel always helped performance, we would run a thousand threads in every process. parallel_iterations is simply a knob that is useful to have in some cases.

Is it possible to loop through all minibatches in a single tensorflow op using dataset/iterators?

I'm working with tf.data.dataset/iterator mechanism and trying to improve data loading performance. It occurred to me that offloading the entire minibatch loop from Python might help. My data is small enough that storing on CPU or GPU is no problem.
So, Is it possible to loop an optimizer node over a full minibatched epoch within a call to session.run?
The tensor returned by iterator.get_next() is only incremented once per session.run, which would seems to make it impossible to iterate through a dataset of minibatches... but if it could be done, my CPU would only have to touch the Python thread once per epoch.
UPDATE: #muskrat's suggestion to use tf.slice can be used for this purpose. See my subsequent non-answer with a schematic implementation of this using tf.while_loop. However, the question is whether this can be accomplished using dataset/iterators... and I'd still like to know.
From the description it seems that you already have the dataset preloaded as a constant on CPU/GPU, like at this example. That's certainly the first step.
Second, I suggest using tf.slice() to replicate the effect of the minibatch operation. In other words, just manually slice minibatches out of the preloaded constant (your dataset), and you should get the desired behavior. See for example the slice docs or this related post.
If that's not enough detail, please edit your question to include a code example (with mnist or something) and I can give more details.
This "answer" is an implementation of muskrat's tf.slice suggestion with the details of tf.while_loop worked out (with help from How to use tf.while_loop() in tensorflow and https://www.tensorflow.org/api_docs/python/tf/while_loop).
Unless your data and model are small enough that you're bottlenecked by Python I/O (like me!), this solution is probably academic.
Trains over minibatches without returning to the Python thread.
Uses only ops that have GPU implementations meaning that the entire graph can be placed in the GPU.
On my small dataset, which is presumably bottlenecked by Python I/O, this solution is twice the speed of my dataset/iteratior (which touches Python once per minibatch) and four times the speed of passing minibatches through feed_dict.
tf.while_loop is treacherous. It's challenging to understand when ops inside the loop's body are evaluated and when those they depend on are evaluated, particularly the (thin) official documentation and limited Stack Overflow coverage.
The missing documentation of tf.while_loop is that tensors outside the body of the loop are only evaluated once, even if inner ops depend on them. This means that optimization, model, and loss have to be defined in the loop. This limits flexibility if you'd like to e.g. be able to call validation loss ops between training epochs. Presumably this could be accomplished with tf.cond statements and the appropriate flags passed in via feed_dict. But not nearly as flexible or elegant as the dataset/iterator mechanism in tf.data.
Adding shuffling operations at each Epoch doesn't seem available on GPU.
Here's my schematic code (I've ommitted the variable and model definition for brevity):
def buildModel(info, training_data, training_targets):
graph = tf.Graph()
with graph.as_default():
# numBatches is passed in from Python once per Epoch.
batch_size = tf.placeholder(tf.float32, name = 'batch_size')
# Initializers for loop variables for tf.while_loop
batchCounter = tf.Variable(0, dtype=tf.float32, trainable=False)
lossList = tf.Variable(tf.zeros([0,1]), trainable=False)
# In a full example, I'd normalize my data here. And possibly shuffle
tf_training_data = tf.constant(training_data, dtype=tf.float32)
tf_training_targets = tf.constant(training_targets, dtype=tf.float32)
# For brevity, I'll spare the definitions of my variables. Because tf.Variables
# are essentially treated as globals in the model and are manipulated directly (like with tf.apply)
# they can reside outside runMinibatch, the body of tf.while_loop.
# weights_1 =
# biases_1 =
# etc.
def moreMinibatches(batchCount, lossList):
return (batchCount + 1) * batch_size <= len(training_data)
def runMinibatch(batchCount, lossList):
# These tensors and ops have to be defined inside runMinibatch, otherwise they're not updated as tf.wile_loop loops. This means
# slices, model definition, loss tensor, and training op.
dat_batch = tf.slice(tf_training_data, [tf.cast(batchCounter * batch_size, tf.int32) , 0], [tf.cast(batch_size, tf.int32), -1])
targ_batch = tf.slice(tf_training_targets, [tf.cast(batchCounter * batch_size, tf.int32) , 0], [tf.cast(batch_size, tf.int32), -1])
# Here's where you'd define the model as a function of weights and biases above and dat_batch
# model = <insert here>
loss = tf.reduce_mean(tf.squared_difference(model, targ_batch))
optimizer = tf.train.AdagradOptimizer() # for example
train_op = optimizer.minimize(while_loss, name='optimizer')
# control_dependences ensures that train_op is run before return
# even though the return values don't explicitly depend on it.
with tf.control_dependencies([train_op]):
return batchCount + 1, tf.concat([lossList, [[while_loss]]],0)
# So, the idea is that this trains a full epoch without returning to Python.
trainMinibatches = tf.while_loop(moreMinibatches, runMinibatch, [minibatchCounter, lossList]
shape_invariants=[batchCounter.get_shape(), tf.TensorShape(None)])
return (graph,
{'trainMinibatches' : trainAllMinibatches,
'minibatchCounter' : minibatchCounter,
'norm_loss' : norm_loss,
} )
numEpochs = 100 # e.g.
minibatchSize = 32 #
# training_dataset = <data here>
# training_targets = <targets here>
graph, ops = buildModel(info, training_dataset, training_targets,
with tf.Session(graph=graph, config=config) as session:
for i in range(numEpochs):
# This op will train on as all minibatches that fit in the full dataset. finalBatchCount with be the number of
# complete minibatches in the dataset. lossList is a list of each step's minibatches.
finalBatchCount, lossList = session.run(ops['trainAllMinibatches'],
print('minibatch losses at Epoch', i, ': ', lossList)
I implemented tf.slice() and tf.while_loop approach to vectorize mini-batch suggested above.
The performance was about 1.86 times faster in my case than the mini-batches using feed_dict, but I found there was a problem that the loss values of each epochs were not stabilized.
Then, I changed to tf.random_shuffle the inputs every epoch, the problem was much mitigated. (the performance gain was reduced to 1.68 times)

"rewind" tensorflow training step

I occasionally hit a problem with training in tensorflow and stochastic gradient descent where I load a mini-batch that wreaks havoc on my optimization op, pushing it to Nans. This, of course, throws an error in the training process and forces me to start over. Even if I wrap the optimization op in a try statement, by the time an exception is raised, the damage is done and I need to re-start.
Does anyone have a good way of, essentially, rewinding optimization back to a valid state when it hits an error? I would think you could use checkpoints for this, but the docs on saving/restoring are so spotty that i'm not sure...
As you suggest checkpoints are the way to do it. The key steps for your case are as follows:
First create a saver object after you've defined your graph:
saver = tf.train.Saver(max_to_keep=5, keep_checkpoint_every_n_hours=1)
Next, write out check points intermittently during training:
for step in range(max_steps):
... some training steps here
# Save the model every 100 iterations
if step % 100 == 0:
saver.save(sess, checkpoint_dir, global_step=step)
Finally, when you catch an error, reload the last good checkpoint:
# this next command restores the latest checkpoint or explicitly specify the filename if you want to use some other logic
restore_fn = tf.train.latest_checkpoint(FLAGS.restore_dir)
print('Restoring from %s' % restore_fn)
saver.restore(sess, restore_fn)
Answering a different question:
Which optimizer are you using?
Big jumps, like you can get with simple gradient descent, shouldn't be possible with gradient clipping or an optimizer with a limited step size (like Adam).

Tensorflow: Which graph statements are executed after the graph is built?

In Tensorflow, which statements within a graph definition block are executed only to build the graph vs. which are executed during training? For example:
with tf.Graph().as_default():
weightsLayer1 = tf.Variable(tf.truncated_normal([nInputUnits, nOutputUnits]))
weightsLayer1 = tf.div(weightsLayer1, tf.sqrt(tf.to_float(nInputUnits)))
biasesLayer1 = tf.Variable(tf.zeros([nUnitsHiddenLayer1]))
layer1output = tf.tanh(tf.matmul(images_placeholder, weightsLayer1) + biasesLayer1)
Intuitively, the lines defining weightsLayer1 and biasesLayer1 I assume are only executed once at startup, since they initialize weights and biases. However, the line computing layer1output I assume executes at every training step, since layer1output is used downstream to compute loss, which is minimized by the optimizer. So, how does Tensorflow know, during training, to only execute the last line and not the previous ones (which would re-initialize the weights and biases)?
You as the user are actually telling tensorflow which operations to run. During training, you typically tell tensorflow to execute operations that are provided by an optimizer. This looks something like this:
opt = tf.train.GradientDescentOptimizer(0.01)
train_step = opt.minimize(loss) #
for i in range(100):
sess.run(train_step, feed_dict=...)
Calling opt.minimize adds to the computation graphs the gradients w.r.t. the trainable variables as well as operations that update the variables using the gradients. train_step is in fact these update operations grouped using tf.group. If you (the user) run train_step, tensorflow figures out what parts of the computation graph it needs to run in order to execute these desired operations.
Likewise, if you do something like sess.run(fetches=loss, feed_dict=...), you are asking tensorflow to execute all operations in the graph that are necessary to compute loss.
Finally, initialization operations like the one in weightsLayer1 = tf.Variable(tf.truncated_normal([nInputUnits, nOutputUnits])) are usually run by sess.run(tf.initialize_all_variables()).
Edit: After re-reading your question, I want to be more clear about one aspect. No operations are actually executed by the graph definition code you provided. Tensorflow operations are executed if and only if you start a session and request the execution of parts of your graph. As stated above, that includes the initialization operations.