Difference between `apply_gradients` and `minimize` of optimizer in tensorflow - tensorflow

I am confused about the difference between apply_gradients and minimize of optimizer in tensorflow. For example,
optimizer = tf.train.AdamOptimizer(1e-3)
grads_and_vars = optimizer.compute_gradients(cnn.loss)
train_op = optimizer.apply_gradients(grads_and_vars, global_step=global_step)
and
optimizer = tf.train.AdamOptimizer(1e-3)
train_op = optimizer.minimize(cnn.loss, global_step=global_step)
Are they the same indeed?
If I want to decay the learning rate, can I use the following codes?
global_step = tf.Variable(0, name="global_step", trainable=False)
starter_learning_rate = 1e-3
learning_rate = tf.train.exponential_decay(starter_learning_rate, global_step,
100, FLAGS.decay_rate, staircase=True)
# Passing global_step to minimize() will increment it at each step.
learning_step = (
optimizer = tf.train.AdamOptimizer(learning_rate)
grads_and_vars = optimizer.compute_gradients(cnn.loss)
train_op = optimizer.apply_gradients(grads_and_vars, global_step=global_step)
)
Thanks for your help!

You can easily know from the link : https://www.tensorflow.org/get_started/get_started
(tf.train API part) that they actually do the same job.
The difference it that: if you use the separated functions( tf.gradients, tf.apply_gradients), you can apply other mechanism between them, such as gradient clipping.

here it says minimize uses tf.GradienTape and then apply_gradients:
Minimize loss by updating var_list.
This method simply computes gradient using tf.GradientTape and calls
apply_gradients(). If you want to process the gradient before applying
then call tf.GradientTape and apply_gradients() explicitly instead of
using this function.
So minimize actually uses apply_gradients just like:
def minimize(self, loss, var_list, grad_loss=None, name=None, tape=None):
grads_and_vars = self._compute_gradients(loss, var_list=var_list, grad_loss=grad_loss, tape=tape)
return self.apply_gradients(grads_and_vars, name=name)
In your example, you use compute_gradients and apply_gradients, this is indeed valid but nowadays, compute_gradients was made private and is therefore not good practice to use it. For this reason the function is not longer on the documentation.

Related

How to get train loss and evaluate loss every global step in Tensorflow Estimator?

I can get traing loss every global step. But I do want to add the evaluate loss in graph 'lossxx' in tensorboard. How to do that?
class MyHook(tf.train.SessionRunHook):
def after_run(self,run_context,run_value):
_session = run_context.session
_session.run(_session.graph.get_operation_by_name('acc_op'))
def my_model(features, labels, mode):
...
logits = tf.layers.dense(net, 3, activation=None)
predicted_classes = tf.argmax(logits, 1)
if mode == tf.estimator.ModeKeys.PREDICT:
predictions = {
'class': predicted_classes,
'prob': tf.nn.softmax(logits)
}
return tf.estimator.EstimatorSpec(mode, predictions=predictions)
# Compute loss.
loss = tf.losses.sparse_softmax_cross_entropy(labels=labels, logits=logits)
acc, acc_op = tf.metrics.accuracy(labels=labels, predictions=predicted_classes)
tf.identity(acc_op,'acc_op')
loss_sum = tf.summary.scalar('lossxx',loss)
accuracy_sum = tf.summary.scalar('accuracyxx',acc)
merg = tf.summary.merge_all()
# Create training op.
if mode == tf.estimator.ModeKeys.TRAIN:
optimizer = tf.train.AdagradOptimizer(learning_rate=0.1)
train_op = optimizer.minimize(loss, global_step=tf.train.get_global_step())
return tf.estimator.EstimatorSpec(mode, loss=loss, train_op=train_op,
training_chief_hooks=[
tf.train.SummarySaverHook(save_steps=10, output_dir='./model', summary_op=merg)])
return tf.estimator.EstimatorSpec(
mode, loss=loss, eval_metric_ops={'accuracy': (acc, acc_op)}
)
classifier.train(input_fn=train_input_fn, steps=1000,hooks=[ MyHook()])
You actually don't need to create a SummarySaverHook by yourself, as it is already included in the tf.estimator.Estimator. Just create all the summaries you want with tf.summary.xxx and they will all be evaluated every n steps. (See tf.estimator.RunConfig for this).
Also, you don't need to create a summary for your final loss loss. This will also be created for you automatically. If you do it like this, then the training and evaluation summaries will be shown in the same graph on TensorBoard. The estimator creates a sub-directory eval in your current model_dir to achieve this.
And a small hint: use the acc_op directly in summaries to update the metric and get the value of it. However, the tf.metrics functions are quite difficult to handle ;-)
You need to pass evaluation data to the model alongside with training data by using tf.estimator.train_and_evaluate

MobileNet is not usable when set is_training to false

The more accurate description for this issue is that MobileNet behaves bad when is_training is not set to true explicitly.
And I'm referring to the MobileNet that is provided by TensorFlow in their model repository https://github.com/tensorflow/models/blob/master/slim/nets/mobilenet_v1.py.
This is how I create the net (phase_train=True):
with slim.arg_scope(mobilenet_v1.mobilenet_v1_arg_scope(is_training=phase_train)):
features, endpoints = mobilenet_v1.mobilenet_v1(
inputs=images_placeholder, features_layer_size=features_layer_size, dropout_keep_prob=dropout_keep_prob,
is_training=phase_train)
I'm training a recognition network and while training I test on LFW. The results that I get during the training are improving over time and getting a good accuracy.
Before deployment I freeze the graph. if I freeze the graph with is_training=True the results that I get on LFW are the same as during training.
But if I set is_training=False I get results like the network haven't trained at all...
This behavior actually happens with other networks like Inception.
I tend to believe that I miss something very fundamental here and that this is not a bug in TensorFlow...
Any help would be appreciated.
Adding more code...
This is how I prepare for training:
images_placeholder = tf.placeholder(tf.float32, shape=(None, image_size, image_size, 1), name='input')
labels_placeholder = tf.placeholder(tf.int32, shape=(None))
dropout_placeholder = tf.placeholder_with_default(1.0, shape=(), name='dropout_keep_prob')
phase_train_placeholder = tf.Variable(True, name='phase_train')
global_step = tf.Variable(0, name='global_step', trainable=False)
# build graph
with slim.arg_scope(mobilenet_v1.mobilenet_v1_arg_scope(is_training=phase_train_placeholder)):
features, endpoints = mobilenet_v1.mobilenet_v1(
inputs=images_placeholder, features_layer_size=512, dropout_keep_prob=1.0,
is_training=phase_train_placeholder)
# loss
logits = slim.fully_connected(inputs=features, num_outputs=train_data.get_class_count(), activation_fn=None,
weights_initializer=tf.truncated_normal_initializer(stddev=0.1),
weights_regularizer=slim.l2_regularizer(scale=0.00005),
scope='Logits', reuse=False)
tf.losses.sparse_softmax_cross_entropy(labels=labels_placeholder, logits=logits,
reduction=tf.losses.Reduction.MEAN)
loss = tf.losses.get_total_loss()
# normalize output for inference
embeddings = tf.nn.l2_normalize(features, 1, 1e-10, name='embeddings')
# optimizer
optimizer = tf.train.AdamOptimizer()
train_op = optimizer.minimize(loss, global_step=global_step)
This is my train step:
batch_data, batch_labels = train_data.next_batch()
feed_dict = {
images_placeholder: batch_data,
labels_placeholder: batch_labels,
dropout_placeholder: dropout_keep_prob
}
_, loss_value = sess.run([train_op, loss], feed_dict=feed_dict)
I could add the code for how I freeze the graph but it's not really necessary. it's enough to build the graph with is_train=false, load latest checkpoint and run the evaluation on LWF to reproduce the problem.
Update...
I found that the problem is in the batch normalization layer. it's enough to set this layer to is_training=false to reproduce the problem.
references that I found after finding this:
http://ruishu.io/2016/12/27/batchnorm/
https://github.com/tensorflow/tensorflow/issues/10118
Batch Normalization - Tensorflow
Will update with a solution once I have a tested one.
So I found a solution.
Mainly using this reference: http://ruishu.io/2016/12/27/batchnorm/
From the link:
Note: When is_training is True the moving_mean and moving_variance need to be updated, by default the update_ops are placed in tf.GraphKeys.UPDATE_OPS so they need to be added as a dependency to the train_op, example:
update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS) if update_ops: updates = tf.group(*update_ops) total_loss = control_flow_ops.with_dependencies([updates], total_loss)
And to be straight to the point,
instead of creating the optimizer like so:
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)
train_op = optimizer.minimize(total_loss, global_step=global_step)
Do it like this:
update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
with tf.control_dependencies(update_ops):
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)
train_op = optimizer.minimize(total_loss, global_step=global_step)
That will solve the issue.
is_training should not have this effect. I need to see more of your code to understand what is happening, but odds are the variable names are not matching when you set is_training to false probably because of a variable scope reuse issue.

Batch Normalization in a Custom Estimator in Tensorflow

I'm referring to a Note at tf.layers.batch_normilization:
Note: when training, the moving_mean and moving_variance need to be updated. By default the update ops are placed in tf.GraphKeys.UPDATE_OPS, so they need to be added as a dependency to the train_op. For example:
update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
with tf.control_dependencies(update_ops):
train_op = optimizer.minimize(loss)
How would one implement this in a Custom Estimator? For example looking at this example on Tensorflow's website: The complete abalone model_fn
On the following issue, at the very bottom you have an example
https://github.com/tensorflow/tensorflow/issues/16455
if mode == tf.estimator.ModeKeys.TRAIN:
lr = 0.001
optimizer = tf.train.RMSPropOptimizer(learning_rate=lr, decay=0.9)
update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
with tf.control_dependencies(update_ops):
train_op = optimizer.minimize(loss, global_step=tf.train.get_global_step())
return tf.estimator.EstimatorSpec(mode=mode,
loss=loss,
train_op=train_op)
I guess you can pass the train_op you refer to the train_op parameter of the EstimatorSpec.

Compute gradient norm of each part of composite loss function

Assume I have the following loss function:
loss_a = tf.reduce_mean(my_loss_fn(model_output, targets))
loss_b = tf.reduce_mean(my_other_loss_fn(model_output, targets))
loss_final = loss_a + tf.multiply(alpha, loss_b)
To visualize the norm of the gradients w.r.t to loss_final one could do this:
optimizer = tf.train.AdamOptimizer(learning_rate=0.001)
grads_and_vars = optimizer.compute_gradients(loss_final)
grads, _ = list(zip(*grads_and_vars))
norms = tf.global_norm(grads)
gradnorm_s = tf.summary.scalar('gradient norm', norms)
train_op = optimizer.apply_gradients(grads_and_vars, name='train_op')
However, I would like to plot the norm of the gradients w.r.t to loss_a and loss_b separately. How can I do this in the most efficient way? Do I have to call compute_gradients(..) on both loss_a and loss_b separately and then add those two gradients together before passing them to optimizer.apply_gradients(..)? I know that this would mathematically be correct due to the summation rule, but it just seems a bit cumbersome and I also don't know how you would implement the summation of the gradients correctly. Also, loss_final is rather simple, because it's just a summation. What if loss_final was more complicated, e.g. a division?
I'm using Tensorflow 0.12.
You are right that combining gradients could get messy. Instead just compute the gradients of each of the losses as well as the final loss. Because tensorflow optimizes the directed acyclic graph (DAG) before compilation, this doesn't result in duplication of work.
For example:
import tensorflow as tf
with tf.name_scope('inputs'):
W = tf.Variable(dtype=tf.float32, initial_value=tf.random_normal((4, 1), dtype=tf.float32), name='W')
x = tf.random_uniform((6, 4), dtype=tf.float32, name='x')
with tf.name_scope('outputs'):
y = tf.matmul(x, W, name='y')
def my_loss_fn(output, targets, name):
return tf.reduce_mean(tf.abs(output - targets), name=name)
def my_other_loss_fn(output, targets, name):
return tf.sqrt(tf.reduce_mean((output - targets) ** 2), name=name)
def get_tensors(loss_fn):
loss = loss_fn(y, targets, 'loss')
grads = tf.gradients(loss, W, name='gradients')
norm = tf.norm(grads, name='norm')
return loss, grads, norm
targets = tf.random_uniform((6, 1))
with tf.name_scope('a'):
loss_a, grads_a, norm_a = get_tensors(my_loss_fn)
with tf.name_scope('b'):
loss_b, grads_b, norm_b = get_tensors(my_loss_fn)
with tf.name_scope('combined'):
loss = tf.add(loss_a, loss_b, name='loss')
grad = tf.gradients(loss, W, name='gradients')
with tf.Session() as sess:
tf.global_variables_initializer().run(session=sess)
writer = tf.summary.FileWriter('./tensorboard_results', sess.graph)
res = sess.run([norm_a, norm_b, grad])
print(*res, sep='\n')
Edit: In response to your comment... You can check the DAG of a tensorflow model using tensorboard. I've updated the code to store the graph.
Run tensorboard --logdir $PWD/tensorboard_results in a terminal and navigate to the url printed on the commandline (typically http://localhost:6006/). Then click on GRAPH tab to view the DAG. You can recursively expand the tensors, ops, namespaces to see subgraphs to see individual operations and their inputs.

tensorflow momentum part of variable not initialized?

tensorflow is telling me that the momentum part of a variable is uninitiated when I use the momentum optimizer. When I use the gradientDescent optimizer things work fine.
Here is a relevant part of the stack trace:
tensorflow.python.framework.errors.FailedPreconditionError: Attempting to use uninitialized value fc3/biases/Momentum
[[Node: Momentum/update_fc3/biases/ApplyMomentum = ApplyMomentum[T=DT_FLOAT, _class=["loc:#fc3/biases"], use_locking=false, _device="/job:localhost/replica:0/task:0/cpu:0"](fc3/biases, fc3/biases/Momentum, Momentum/learning_rate, gradients/fc3/logits_grad/tuple/control_dependency_1, Momentum/momentum)]]
Caused by op u'Momentum/update_fc3/biases/ApplyMomentum', defined at:
...
train_op = vgg.optimizer.minimize(vgg.loss, global_step=vgg.global_step)
I think the code is correct, it defines the ops for all layers before the initialize all variables op, etc. If not the GradientDescent optimizer wouldn't work right?
Following up on #etarion comment, here is a sketch of the code, starts with
def train(args):
datareader = # object to read data - no tensorflow code/import
with tf.Graph().as_default():
with_graph(datareader, args)
then with_graph does
def with_graph(datareader, args):
num_outputs = datareader.num_outputs()
img_orig = tf.placeholder(tf.float32, shape=datareader.features_placeholder_shape())
img_vgg16 = preprocess.imgbatch_2_vgg16(imgs=img_orig, channel_mean=8.46)
labels_placeholder = tf.placeholder(tf.float32, shape=(None, num_outputs))
vgg = vgg16(imgs=img_vgg16, weights=None, sess=None, trainable=args.trainable, stop_at_fc2=args.fc2)
add_loss(vgg, labels_placeholder, num_outputs, args)
add_optimizer(vgg, args)
sess = tf.Session(config=tf.ConfigProto(intra_op_parallelism_threads = 12))
init = tf.initialize_all_variables()
sess.run(init)
validation_imgs_orig, validation_labels = datareader.get_validation_set()
validation_imgs_vgg16 = sess.run(img_vgg16, {img_orig: validation_imgs_orig})
validation_feed_dict = {img_vgg16:validation_imgs_vgg16,
labels_placeholder:validation_labels}
train_op = vgg.optimizer.minimize(vgg.loss, global_step=vgg.global_step)
print("Starting training.")
sys.stdout.flush()
for step_number in range(3):
t0 = time.time()
train_imgs, train_labels = datareader.get_next_minibatch()
train_feed_dict = {img_orig: train_imgs,
labels_placeholder:train_labels}
sess.run(train_op, feed_dict=train_feed_dict)
print("step %3d took %.2f sec." % (step_number, time.time()-t0))
sys.stdout.flush()
The gradient descent optimizer does not have internal variables, the momentum one has. Somehow you don't initialize the state of the momentum (can't tell why exactly without the code). Ways to do that are initializing all variables right before you run the graph (after you added the optimizer to the graph), or, if you want to be explicit in what you initialize, use the get_slot_names()/get_slot() methods of the optimizer to get the Variables that make up the optimizer's internal state.
The problem is I was defining the training op from the optimizer minimize function after initializing all the variables - once I moved that in front of initialize all variables it worked.