"rewind" tensorflow training step - tensorflow

I occasionally hit a problem with training in tensorflow and stochastic gradient descent where I load a mini-batch that wreaks havoc on my optimization op, pushing it to Nans. This, of course, throws an error in the training process and forces me to start over. Even if I wrap the optimization op in a try statement, by the time an exception is raised, the damage is done and I need to re-start.
Does anyone have a good way of, essentially, rewinding optimization back to a valid state when it hits an error? I would think you could use checkpoints for this, but the docs on saving/restoring are so spotty that i'm not sure...

As you suggest checkpoints are the way to do it. The key steps for your case are as follows:
First create a saver object after you've defined your graph:
saver = tf.train.Saver(max_to_keep=5, keep_checkpoint_every_n_hours=1)
Next, write out check points intermittently during training:
for step in range(max_steps):
... some training steps here
# Save the model every 100 iterations
if step % 100 == 0:
saver.save(sess, checkpoint_dir, global_step=step)
Finally, when you catch an error, reload the last good checkpoint:
# this next command restores the latest checkpoint or explicitly specify the filename if you want to use some other logic
restore_fn = tf.train.latest_checkpoint(FLAGS.restore_dir)
print('Restoring from %s' % restore_fn)
saver.restore(sess, restore_fn)

Answering a different question:
Which optimizer are you using?
Big jumps, like you can get with simple gradient descent, shouldn't be possible with gradient clipping or an optimizer with a limited step size (like Adam).

Related

How to proceed training after stopping it and changing some parameters?

I'm training my model via model.fit() in Keras. I stopped the training, by interrupting it, or even because it is done, and then changed the batch_size and decided to go with more training. Here is what's happening:
The loss when the training was stopped/finished = 26
The loss when the training proceeded = 46
Meanining that I lost all the progress I made and it is as if I'ms starting over.
It does procceed from where it left only if I don't change anything. But if I changed the batch size, it is as if the optimizer re-initializes my weights and throw out my progress. How can I get a handle on what the optimizer is doing without my consent ?
You most likely have some examples that give you large loss values. MSE makes this worse. When batch size is larger then you are probably getting a lot of these outliers in your batch. You can look at the top loss contributing examples.

Tensorflow batch normalization: difference momentum and renorm_momentum

I want to replicate a network build with the lasagne-library in tensor flow. I'm having some trouble with the batch normalization.
This is the lasagne documentation about the used batch normalization:
http://lasagne.readthedocs.io/en/latest/modules/layers/normalization.html?highlight=batchNorm
In tensorflow I found two functions to normalize:
https://www.tensorflow.org/api_docs/python/tf/nn/batch_normalization
https://www.tensorflow.org/api_docs/python/tf/layers/batch_normalization
The first one is simpler but does not let me choose the alpha parameter from lasagne (Coefficient for the exponential moving average of batch-wise means and standard deviations computed during training). I tried using the second function, which has a lot more options, but there are two things I do not understand about it:
I am not clear about the difference between momentum and renorm_momentum. If I have a alpha of 0.9 in the lasagne network, can I just set both tensorflow momentums to 0.9 and expect the same behaviour?
The tf documentation notes:
when training, the moving_mean and moving_variance need to be updated. By default the update ops are placed in tf.GraphKeys.UPDATE_OPS, so they need to be added as a dependency to the train_op. For example:
update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
with tf.control_dependencies(update_ops):
train_op = optimizer.minimize(loss)
I do not really understand what is happening here and where I need to put something similar in my code. Can I just put this somewhere before I run the session? What parts of this code piece should I not copy literally but change depending on my code?
There is a big difference between tf.nn.batch_normalization and tf.layers.batch_normalization. See my answer here. So you have made the right choice by using the layers version. Now, on your questions:
renorm_momentum only has an effect is you use batch renormalization by setting the renorm argument to True. You can ignore this if using default batch normalization.
Short answer: You can literally copy that code snippet. Put it exactly where you would normally call optimizer.minimize.
Long answer on 2.: Batch normalization has two "modes": Training and inference. During training, mean and variance of the current minibatch is used. During inference, this is not desirable (e.g. you might not even use batches as input, so there would be no minibatch statistics). For this reason, moving averages over minibatch means/variances are kept during training. These moving averages are then used for inference.
By default, Tensorflow only executes what it needs to. Those moving averages are not needed for training, so they normally would never be executed/updated. The tf.control_dependencies context manager forces Tensorflow to do the updates every time it computes whatever is in the code block (in this case the cost). Since the cost certainly needs to be computed exactly one per training step, this is a good way of making sure the moving averages are updated.
The code example seems a bit arcane, but in context it would really just be (as an example):
loss = ...
train_step = SomeOptimizer().minimize(loss)
with tf.Session() as sess:
....
becomes
loss = ...
with tf.control_dependencies(tf.get_collection(tf.GraphKeys.UPDATE_OPS)):
train_step = SomeOptimizer().minimize(loss)
with tf.Session() as sess:
....
Finally, keep in mind to use the correct training argument for batch normalization so that either minibatch statistics or moving averages are used as intended.

`estimator.train` with num_steps in Tensorflow

I have made a custom estimator in Tensorflow 1.4. In estimator.trainfunction, I see a steps parameter, which I am using as a way to stop the training and then evaluate on my validation dataset.
while True:
model.train(input_fn= lambda:train_input_fn(train_data), steps = FLAGS.num_steps)
model.evaluate(input_fn= lambda:train_input_fn(test_data))
After every num_steps, I run evaluate on validation dataset.
What I am observing is, after num_steps, once the evaluation is done, there is a jerk in the plot of AUC/Loss functions(in general all metric).
Plot attached :
I am unable to understand why it's happening.
Is it not the right way to evaluate metrics on validation dataset at regular intervals
Link to code
The issue
The issue comes from the fact that what you plot in TensorBoard is the accuracy or AUC computed since the beginning of estimator.train.
Here is what happens in details:
you create a summary based on the second output of tf.metrics.accuracy
accuracy = tf.metrics.accuracy(labels, predictions)
tf.summary.scalar('accuracy', accuracy[1])
when you call estimator.train(), a new Session is created and all the local variables are initialized again. This includes the local variables of accuracy (sum and count)
during this Session, the op tf.summary.merge_all() is called at regular intervals. What happens is that your summary is the accuracy of all the batches processed since you last called estimator.train(). Therefore, at the beginning of each training phase, the output is pretty noisy and it gets more stable once you progress.
Whenever you evaluate and call estimator.train() again, the local variables are initialized again and you go in a short "noisy" phase, which results in bumps on the training curve.
A solution
If you want a scalar summary that gives you the actual accuracy for each batch, it seems like you need to implement it without using tf.metrics. For instance, if you want the accuracy you will need to do:
accuracy = tf.reduce_mean(tf.cast(tf.equal(labels, predictions), tf.float32))
tf.summary.scalar('accuracy', accuracy)
It is easy to implement this for the accuracy, and I know it might be painful to do for AUC but I don't see a better solution for now.
Maybe having these bumps is not so bad. For instance if you train on one epoch, you will get the overall training accuracy on one epoch at the end.

Saving the state of the AdaGrad algorithm in Tensorflow

I am trying to train a word2vec model, and want to use the embeddings for another application. As there might be extra data later, and my computer is slow when training, I would like my script to stop and resume training later.
To do this, I created a saver:
saver = tf.train.Saver({"embeddings": embeddings,"embeddings_softmax_weights":softmax_weights,"embeddings_softmax_biases":softmax_biases})
I save the embeddings, and softmax weights and biases so I can resume training later. (I assume that this is the correct way, but please correct me if I'm wrong).
Unfortunately when resuming training with this script the average loss seems to go up again.
My idea is that this can be attributed to the AdaGradOptimizer I'm using. Initially the outer product matrix will probably be set to all zero's, where after my training it will be filled (leading to a lower learning rate).
Is there a way to save the optimizer state to resume learning later?
While TensorFlow seems to complain when you attempt to serialize an optimizer object directly (e.g. via tf.add_to_collection("optimizers", optimizer) and a subsequent call to tf.train.Saver().save()), you can save and restore the training update operation which is derived from the optimizer:
# init
if not load_model:
optimizer = tf.train.AdamOptimizer(1e-4)
train_step = optimizer.minimize(loss)
tf.add_to_collection("train_step", train_step)
else:
saver = tf.train.import_meta_graph(modelfile+ '.meta')
saver.restore(sess, tf.train.latest_checkpoint('./'))
train_step = tf.get_collection("train_step")[0]
# training loop
while training:
if iteration % save_interval == 0:
saver = tf.train.Saver()
save_path = saver.save(sess, filepath)
I do not know of a way to get or set the parameters specific to an existing optimizer, so I do not have a direct way of verifying that the optimizer's internal state was restored, but training resumes with loss and accuracy comparable to when the snapshot was created.
I would also recommend using the parameterless call to Saver() so that state variables not specifically mentioned will still be saved, although this might not be strictly necessary.
You may also wish to save the iteration or epoch number for later restoring, as detailed in this example:
http://www.seaandsailor.com/tensorflow-checkpointing.html

TensorFlow: slow performance when getting gradients at inputs

I'm building a simple multilayer perceptron with TensorFlow, and I also need to obtain the gradients (or error signal) of the loss at the neural network's inputs.
Here's my code, which works:
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(self.network, self.y))
optimizer = tf.train.AdagradOptimizer(learning_rate=nn_learning_rate).minimize(cost)
...
for i in range(epochs):
....
for batch in batches:
...
sess.run(optimizer, feed_dict=feed_dict)
grads_wrt_input = sess.run(tf.gradients(cost, self.x), feed_dict=feed_dict)[0]
(edited to include training loop)
Without the last line (grads_wrt_input...), this runs really fast on a CUDA machine. However, tf.gradients() reduces performance greatly by tenfold or more.
I recall that the error signals at the nodes are computed as intermediate values in the backpropagation algorithm, and I have successfully done this using the Java library DeepLearning4j. I was also under the impression that this would be a slight modification to the computation graph already built by optimizer.
How can this be made faster, or is there any other way to compute the gradients of the loss w.r.t. the inputs?
The tf.gradients() function builds a new backpropagation graph each time it is called, so the reason for the slowdown is that TensorFlow has to parse a new graph on each iteration of the loop. (This can be surprisingly expensive: the current version of TensorFlow is optimized for executing the same graph a large number of times.)
Fortunately the solution is easy: just compute the gradients once, outside the loop. You can restructure your code as follows:
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(self.network, self.y))
optimizer = tf.train.AdagradOptimizer(learning_rate=nn_learning_rate).minimize(cost)
grads_wrt_input_tensor = tf.gradients(cost, self.x)[0]
# ...
for i in range(epochs):
# ...
for batch in batches:
# ...
_, grads_wrt_input = sess.run([optimizer, grads_wrt_input_tensor],
feed_dict=feed_dict)
Note that, for performance, I also combined the two sess.run() calls. This ensures that the forward propagation, and much of the backpropagation, will be reused.
As an aside, one tip to find performance bugs like this is to call tf.get_default_graph().finalize() before starting your training loop. This will raise an exception if you inadvertantly add any nodes to the graph, which makes it easier to trace the cause of these bugs.