Tensorflow: Which graph statements are executed after the graph is built? - tensorflow

In Tensorflow, which statements within a graph definition block are executed only to build the graph vs. which are executed during training? For example:
with tf.Graph().as_default():
weightsLayer1 = tf.Variable(tf.truncated_normal([nInputUnits, nOutputUnits]))
weightsLayer1 = tf.div(weightsLayer1, tf.sqrt(tf.to_float(nInputUnits)))
biasesLayer1 = tf.Variable(tf.zeros([nUnitsHiddenLayer1]))
layer1output = tf.tanh(tf.matmul(images_placeholder, weightsLayer1) + biasesLayer1)
Intuitively, the lines defining weightsLayer1 and biasesLayer1 I assume are only executed once at startup, since they initialize weights and biases. However, the line computing layer1output I assume executes at every training step, since layer1output is used downstream to compute loss, which is minimized by the optimizer. So, how does Tensorflow know, during training, to only execute the last line and not the previous ones (which would re-initialize the weights and biases)?

You as the user are actually telling tensorflow which operations to run. During training, you typically tell tensorflow to execute operations that are provided by an optimizer. This looks something like this:
opt = tf.train.GradientDescentOptimizer(0.01)
train_step = opt.minimize(loss) #
for i in range(100):
sess.run(train_step, feed_dict=...)
Calling opt.minimize adds to the computation graphs the gradients w.r.t. the trainable variables as well as operations that update the variables using the gradients. train_step is in fact these update operations grouped using tf.group. If you (the user) run train_step, tensorflow figures out what parts of the computation graph it needs to run in order to execute these desired operations.
Likewise, if you do something like sess.run(fetches=loss, feed_dict=...), you are asking tensorflow to execute all operations in the graph that are necessary to compute loss.
Finally, initialization operations like the one in weightsLayer1 = tf.Variable(tf.truncated_normal([nInputUnits, nOutputUnits])) are usually run by sess.run(tf.initialize_all_variables()).
Edit: After re-reading your question, I want to be more clear about one aspect. No operations are actually executed by the graph definition code you provided. Tensorflow operations are executed if and only if you start a session and request the execution of parts of your graph. As stated above, that includes the initialization operations.


Tensorflow batch normalization: difference momentum and renorm_momentum

I want to replicate a network build with the lasagne-library in tensor flow. I'm having some trouble with the batch normalization.
This is the lasagne documentation about the used batch normalization:
In tensorflow I found two functions to normalize:
The first one is simpler but does not let me choose the alpha parameter from lasagne (Coefficient for the exponential moving average of batch-wise means and standard deviations computed during training). I tried using the second function, which has a lot more options, but there are two things I do not understand about it:
I am not clear about the difference between momentum and renorm_momentum. If I have a alpha of 0.9 in the lasagne network, can I just set both tensorflow momentums to 0.9 and expect the same behaviour?
The tf documentation notes:
when training, the moving_mean and moving_variance need to be updated. By default the update ops are placed in tf.GraphKeys.UPDATE_OPS, so they need to be added as a dependency to the train_op. For example:
update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
with tf.control_dependencies(update_ops):
train_op = optimizer.minimize(loss)
I do not really understand what is happening here and where I need to put something similar in my code. Can I just put this somewhere before I run the session? What parts of this code piece should I not copy literally but change depending on my code?
There is a big difference between tf.nn.batch_normalization and tf.layers.batch_normalization. See my answer here. So you have made the right choice by using the layers version. Now, on your questions:
renorm_momentum only has an effect is you use batch renormalization by setting the renorm argument to True. You can ignore this if using default batch normalization.
Short answer: You can literally copy that code snippet. Put it exactly where you would normally call optimizer.minimize.
Long answer on 2.: Batch normalization has two "modes": Training and inference. During training, mean and variance of the current minibatch is used. During inference, this is not desirable (e.g. you might not even use batches as input, so there would be no minibatch statistics). For this reason, moving averages over minibatch means/variances are kept during training. These moving averages are then used for inference.
By default, Tensorflow only executes what it needs to. Those moving averages are not needed for training, so they normally would never be executed/updated. The tf.control_dependencies context manager forces Tensorflow to do the updates every time it computes whatever is in the code block (in this case the cost). Since the cost certainly needs to be computed exactly one per training step, this is a good way of making sure the moving averages are updated.
The code example seems a bit arcane, but in context it would really just be (as an example):
loss = ...
train_step = SomeOptimizer().minimize(loss)
with tf.Session() as sess:
loss = ...
with tf.control_dependencies(tf.get_collection(tf.GraphKeys.UPDATE_OPS)):
train_step = SomeOptimizer().minimize(loss)
with tf.Session() as sess:
Finally, keep in mind to use the correct training argument for batch normalization so that either minibatch statistics or moving averages are used as intended.

Tensorflow slow inference speed in a loop

I am working on a reinforcement learning implementation using Tensorflow. After profiling on the training procedure, I found something really weird:
The following code is in a training loop:
state_batch, \
action_batch, \
reward_batch, \
next_state_batch, \
is_episode_finished_batch = self.data_manager.get_next_batch()
state_batch = np.divide(state_batch, 10.0)
next_state_batch = np.divide(next_state_batch, 10.0)
# Calculate y for the td_error of the critic
y_batch = []
next_action_batch = self.actor_network.target_evaluate(
next_state_batch, action_batch)
q_value_batch = self.critic_network.target_evaluate(
next_state_batch, next_action_batch)
for i in range(0, self.batch_size):
if is_episode_finished_batch[i]:
y_batch.append(reward_batch[i] + GAMMA * q_value_batch[i])
# Now that we have the y batch, train the critic
self.critic_network.train(y_batch, state_batch, action_batch)
# Then get the action gradient batch and adapt the gradient with the gradient inverting method
action_batch_for_gradients = self.actor_network.evaluate(
state_batch, action_batch)
q_gradient_batch = self.critic_network.get_action_gradient(
state_batch, action_batch_for_gradients)
q_gradient_batch = self.grad_inv.invert(
q_gradient_batch, action_batch_for_gradients)
# Now we can train the actor
self.actor_network.train(q_gradient_batch, state_batch, action_batch)
actor_network and critic_network are two classes that implement actor and critic in actor-critic algorithm. Each of them has their own network and operations, but all in the same graph and will run within the same session. Each of the member function (like evaluate, train...) contains a session.run and feed the data they need by passing parameter.
I observed that action_batch_for_gradients runs extremely slow, taking 0.x seconds to do one inference, even much slower than the self.critic_network.train. action_batch_for_gradients is simply an inference operation in actor network to get action. I then copy this line and duplicate it and found that only the first action_batch_for_gradients, right after self.critic_network.train is slow, but the second one is of the normal speed of a forward operation. I think it has something to do with switching within a graph, between training a network and forward in another network. But I can't tell how to avoid.
I found some discussions on stackoverflow about using same graph in the loop, instead of building new ones each time, to speed up using tensorflow. But I already build the graph beforehand and only run the different part of the graph in the training loop. So I don't know how i wrongly use tensorflow on this loop training. I am using Tensorflow 1.6.
I would appreciate your help!

How to smoothly produce Tensorflow auc summaries for training and test sets?

Tensorflow describes writing file summaries to visualize graph execution.
I envision three stages:
training the data (with optimization)
measuring accuracy on the training set (no optimization)
measuring accuracy on the test set (no optimization!)
I'd like all stages in the same script, as in the evaluate function of the wide_and_deep tutorial, but with the low-level API. I'd like three different graphs for stats like loss or AUC, one for each stage.
Suppose I use one session, and in each stage I define an AUC summary op:
# define auc
auc, auc_op = tf.metrics.auc(labels, predictions)
# summary scalar to track it
tf.summary.scalar("auc", auc_op, family=family_name)
# merge all summaries for evaluation and later writing
summary_op = tf.summary.merge_all()
summary_writer.add_summary(summary, step_num)
There are three graphs, but the first graph has all three runs on it, and the second graph has the last two runs (see below). What's worse, each stage starts from the previous state. This makes sense, because all the variables from the previous stages are still around.
I could use a different session for each stage, but that would throw away the model as well.
What is the smooth way to handle this?
I'd like to just clear some of the summary variables. I've tried re-initializing some variables, looked at related questions, read about name scope and variable scope and tried not to re-use variables for AUC, read about variables and sharing, looked into pruning nodes (though I don't understand it), etc. I have not made it work yet.
I am using the low-level API. I saw something like this in the high-level API in _eval_metric_ops, but I don't understand how they 'clear' the different stages. With name_scope?
Do I have to save and load the model into a new session just for this, or is there some clean way to graph each summary separately?
The metric ops will be local variables, so you could run tf.local_variables_initializer() in your Session, which will reset all of your metrics. You could also look through the local variables collection for those with "auc" in the name if you wanted to be a bit more discerning. The high-level way to do this would be to use an Estimator, which will manage metrics for you.

Tensorflow input pipeline

I have an input pipeline where samples are generated on fly. I use keras and custom ImageDataGenerator and corresponding Iterator to get samples in memory.
Under assumption that keras in my setup is using feed_dict (and that assumption is a question to me) I am thinking of speeding things up by switching to raw tensorflow + Dataset.from_generator().
Here I see that suggested solution for input pipelines that generate data on fly in the most recent Tensorflow is to use Dataset.from_generator().
Does keras with Tensorflow backend use feed_dict method?
If I switch to raw tensorflow + Dataset.from_generator(my_sample_generator) will that cut feed_dict memory copy overhead and buy me performance?
During predict (evaluation) phase apart from batch_x, batch_y I have also opaque index vector from my generator output. That vector corresponds to sample ids in the batch_x. Does that mean that I'm stuck with feed_dict approach for predict phase because I need that extra batch_z output from iterator?
The new tf.contrib.data.Dataset.from_generator() can potentially speed up your input pipeline by overlapping the data preparation with training. However, you will tend to get the best performance by switching over to TensorFlow ops in your input pipeline wherever possible.
To answer your specific questions:
The Keras TensorFlow backend uses tf.placeholder() to represent compiled function inputs, and feed_dict to pass arguments to a function.
With the recent optimizations to tf.py_func() and feed_dict copy overhead, I suspect the amount of time spent in memcpy() will be the same. However, you can more easily use Dataset.from_generator() with Dataset.prefetch() to overlap the training on one batch with preprocessing on the next batch.
It sounds like you can define a separate iterator for the prediction phase. The tf.estimator.Estimator class does something similar by instantiating different "input functions" with different signatures for training and evaluation, then building a separate graph for each role.
Alternatively, you could add a dummy output to your training iterator (for the batch_z values) and switch between training and evaluation iterators using a "feedable iterator".

TensorFlow: slow performance when getting gradients at inputs

I'm building a simple multilayer perceptron with TensorFlow, and I also need to obtain the gradients (or error signal) of the loss at the neural network's inputs.
Here's my code, which works:
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(self.network, self.y))
optimizer = tf.train.AdagradOptimizer(learning_rate=nn_learning_rate).minimize(cost)
for i in range(epochs):
for batch in batches:
sess.run(optimizer, feed_dict=feed_dict)
grads_wrt_input = sess.run(tf.gradients(cost, self.x), feed_dict=feed_dict)[0]
(edited to include training loop)
Without the last line (grads_wrt_input...), this runs really fast on a CUDA machine. However, tf.gradients() reduces performance greatly by tenfold or more.
I recall that the error signals at the nodes are computed as intermediate values in the backpropagation algorithm, and I have successfully done this using the Java library DeepLearning4j. I was also under the impression that this would be a slight modification to the computation graph already built by optimizer.
How can this be made faster, or is there any other way to compute the gradients of the loss w.r.t. the inputs?
The tf.gradients() function builds a new backpropagation graph each time it is called, so the reason for the slowdown is that TensorFlow has to parse a new graph on each iteration of the loop. (This can be surprisingly expensive: the current version of TensorFlow is optimized for executing the same graph a large number of times.)
Fortunately the solution is easy: just compute the gradients once, outside the loop. You can restructure your code as follows:
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(self.network, self.y))
optimizer = tf.train.AdagradOptimizer(learning_rate=nn_learning_rate).minimize(cost)
grads_wrt_input_tensor = tf.gradients(cost, self.x)[0]
# ...
for i in range(epochs):
# ...
for batch in batches:
# ...
_, grads_wrt_input = sess.run([optimizer, grads_wrt_input_tensor],
Note that, for performance, I also combined the two sess.run() calls. This ensures that the forward propagation, and much of the backpropagation, will be reused.
As an aside, one tip to find performance bugs like this is to call tf.get_default_graph().finalize() before starting your training loop. This will raise an exception if you inadvertantly add any nodes to the graph, which makes it easier to trace the cause of these bugs.