MobileNet is not usable when set is_training to false - tensorflow

The more accurate description for this issue is that MobileNet behaves bad when is_training is not set to true explicitly.
And I'm referring to the MobileNet that is provided by TensorFlow in their model repository https://github.com/tensorflow/models/blob/master/slim/nets/mobilenet_v1.py.
This is how I create the net (phase_train=True):
with slim.arg_scope(mobilenet_v1.mobilenet_v1_arg_scope(is_training=phase_train)):
features, endpoints = mobilenet_v1.mobilenet_v1(
inputs=images_placeholder, features_layer_size=features_layer_size, dropout_keep_prob=dropout_keep_prob,
is_training=phase_train)
I'm training a recognition network and while training I test on LFW. The results that I get during the training are improving over time and getting a good accuracy.
Before deployment I freeze the graph. if I freeze the graph with is_training=True the results that I get on LFW are the same as during training.
But if I set is_training=False I get results like the network haven't trained at all...
This behavior actually happens with other networks like Inception.
I tend to believe that I miss something very fundamental here and that this is not a bug in TensorFlow...
Any help would be appreciated.
Adding more code...
This is how I prepare for training:
images_placeholder = tf.placeholder(tf.float32, shape=(None, image_size, image_size, 1), name='input')
labels_placeholder = tf.placeholder(tf.int32, shape=(None))
dropout_placeholder = tf.placeholder_with_default(1.0, shape=(), name='dropout_keep_prob')
phase_train_placeholder = tf.Variable(True, name='phase_train')
global_step = tf.Variable(0, name='global_step', trainable=False)
# build graph
with slim.arg_scope(mobilenet_v1.mobilenet_v1_arg_scope(is_training=phase_train_placeholder)):
features, endpoints = mobilenet_v1.mobilenet_v1(
inputs=images_placeholder, features_layer_size=512, dropout_keep_prob=1.0,
is_training=phase_train_placeholder)
# loss
logits = slim.fully_connected(inputs=features, num_outputs=train_data.get_class_count(), activation_fn=None,
weights_initializer=tf.truncated_normal_initializer(stddev=0.1),
weights_regularizer=slim.l2_regularizer(scale=0.00005),
scope='Logits', reuse=False)
tf.losses.sparse_softmax_cross_entropy(labels=labels_placeholder, logits=logits,
reduction=tf.losses.Reduction.MEAN)
loss = tf.losses.get_total_loss()
# normalize output for inference
embeddings = tf.nn.l2_normalize(features, 1, 1e-10, name='embeddings')
# optimizer
optimizer = tf.train.AdamOptimizer()
train_op = optimizer.minimize(loss, global_step=global_step)
This is my train step:
batch_data, batch_labels = train_data.next_batch()
feed_dict = {
images_placeholder: batch_data,
labels_placeholder: batch_labels,
dropout_placeholder: dropout_keep_prob
}
_, loss_value = sess.run([train_op, loss], feed_dict=feed_dict)
I could add the code for how I freeze the graph but it's not really necessary. it's enough to build the graph with is_train=false, load latest checkpoint and run the evaluation on LWF to reproduce the problem.
Update...
I found that the problem is in the batch normalization layer. it's enough to set this layer to is_training=false to reproduce the problem.
references that I found after finding this:
http://ruishu.io/2016/12/27/batchnorm/
https://github.com/tensorflow/tensorflow/issues/10118
Batch Normalization - Tensorflow
Will update with a solution once I have a tested one.

So I found a solution.
Mainly using this reference: http://ruishu.io/2016/12/27/batchnorm/
From the link:
Note: When is_training is True the moving_mean and moving_variance need to be updated, by default the update_ops are placed in tf.GraphKeys.UPDATE_OPS so they need to be added as a dependency to the train_op, example:
update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS) if update_ops: updates = tf.group(*update_ops) total_loss = control_flow_ops.with_dependencies([updates], total_loss)
And to be straight to the point,
instead of creating the optimizer like so:
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)
train_op = optimizer.minimize(total_loss, global_step=global_step)
Do it like this:
update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
with tf.control_dependencies(update_ops):
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)
train_op = optimizer.minimize(total_loss, global_step=global_step)
That will solve the issue.

is_training should not have this effect. I need to see more of your code to understand what is happening, but odds are the variable names are not matching when you set is_training to false probably because of a variable scope reuse issue.

Related

How to add more details on tensorboard using estimator API

I made my model following by https://www.tensorflow.org/tutorials/estimators/cnn.
I added SummarySaverHook to my model
summary_hook = tf.train.SummarySaverHook(
100,
output_dir='C:/Users/dir',
summary_op=tf.summary.merge_all())
# Configure the Training Op (for TRAIN mode)
if mode == tf.estimator.ModeKeys.TRAIN:
optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.01)
train_op = optimizer.minimize(
loss=loss,
global_step=tf.train.get_global_step())
return tf.estimator.EstimatorSpec(mode=mode, loss=loss, train_op=train_op, training_hooks=[summary_hook])
But when i run a get only enqueue_input chart(I don't known what is it) and model graph. I want get accuracy and loss charts.
So i want a couple of details in my tensorboard.
Loss and accuraty chars
It possible to get accuracy chart in time, because in estimator I only get accuracy after final step.
Can i get more details in tensorboard, like wrong predicted images? But without Session and Graph creation, only from estimator api?
First of all, you don't need to use summary_hook. You just need to specify desired metrics with tf.metrics right after you specify logits.
logits = tf.layers.dense(inputs=dropout, units=10)
predictions = {
"classes": tf.argmax(input=logits, axis=1),
"probabilities": tf.nn.softmax(logits, name="softmax_tensor")
}
accuracy = tf.metrics.accuracy(labels=labels, predictions=predictions['classes']
tf.summary.scalar('acc', accuracy[1])
And put this
tf.logging.set_verbosity(tf.logging.INFO)
right after your inputs, if you haven't done so.
You can plot evaluation metrics by inserting eval_metric_ops = {'accuracy': accuracy} dict to tf.estimator.EstimatorSpec
You can use tf.summary for visualizing images, weights and biases, etc.

Tensorflow: Using Batch Normalization gives poor (erratic) validation loss and accuracy

I am trying to use Batch Normalization using tf.layers.batch_normalization() and my code looks like this:
def create_conv_exp_model(fingerprint_input, model_settings, is_training):
# Dropout placeholder
if is_training:
dropout_prob = tf.placeholder(tf.float32, name='dropout_prob')
# Mode placeholder
mode_placeholder = tf.placeholder(tf.bool, name="mode_placeholder")
he_init = tf.contrib.layers.variance_scaling_initializer(mode="FAN_AVG")
# Input Layer
input_frequency_size = model_settings['bins']
input_time_size = model_settings['spectrogram_length']
net = tf.reshape(fingerprint_input,
[-1, input_time_size, input_frequency_size, 1],
name="reshape")
net = tf.layers.batch_normalization(net,
training=mode_placeholder,
name='bn_0')
for i in range(1, 6):
net = tf.layers.conv2d(inputs=net,
filters=8*(2**i),
kernel_size=[5, 5],
padding='same',
kernel_initializer=he_init,
name="conv_%d"%i)
net = tf.layers.batch_normalization(net,
training=mode_placeholder,
name='bn_%d'%i)
with tf.name_scope("relu_%d"%i):
net = tf.nn.relu(net)
net = tf.layers.max_pooling2d(net, [2, 2], [2, 2], 'SAME',
name="maxpool_%d"%i)
net_shape = net.get_shape().as_list()
net_height = net_shape[1]
net_width = net_shape[2]
net = tf.layers.conv2d( inputs=net,
filters=1024,
kernel_size=[net_height, net_width],
strides=(net_height, net_width),
padding='same',
kernel_initializer=he_init,
name="conv_f")
net = tf.layers.batch_normalization( net,
training=mode_placeholder,
name='bn_f')
with tf.name_scope("relu_f"):
net = tf.nn.relu(net)
net = tf.layers.conv2d( inputs=net,
filters=model_settings['label_count'],
kernel_size=[1, 1],
padding='same',
kernel_initializer=he_init,
name="conv_l")
### Squeeze
squeezed = tf.squeeze(net, axis=[1, 2], name="squeezed")
if is_training:
return squeezed, dropout_prob, mode_placeholder
else:
return squeezed, mode_placeholder
And my train step looks like this:
update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
with tf.control_dependencies(update_ops):
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate_input)
gvs = optimizer.compute_gradients(cross_entropy_mean)
capped_gvs = [(tf.clip_by_value(grad, -2., 2.), var) for grad, var in gvs]
train_step = optimizer.apply_gradients(gvs))
During training, I am feeding the graph with:
train_summary, train_accuracy, cross_entropy_value, _, _ = sess.run(
[
merged_summaries, evaluation_step, cross_entropy_mean, train_step,
increment_global_step
],
feed_dict={
fingerprint_input: train_fingerprints,
ground_truth_input: train_ground_truth,
learning_rate_input: learning_rate_value,
dropout_prob: 0.5,
mode_placeholder: True
})
During validation,
validation_summary, validation_accuracy, conf_matrix = sess.run(
[merged_summaries, evaluation_step, confusion_matrix],
feed_dict={
fingerprint_input: validation_fingerprints,
ground_truth_input: validation_ground_truth,
dropout_prob: 1.0,
mode_placeholder: False
})
My loss and accuracy curves (orange is training, blue is validation):
Plot of loss vs number of iterations,
Plot of accuracy vs number of iterations
The validation loss (and accuracy) seem very erratic. Is my implementation of Batch Normalization wrong? Or is this normal with Batch Normalization and I should wait for more iterations?
You need to pass is_training to tf.layers.batch_normalization(..., training=is_training) or it tries to normalize the inference minibatches using the minibatch statistics instead of the training statistics, which is wrong.
There are mainly two things to check.
1. Are you sure that you are using batch normalization (BN) correctly in the train op?
If you read the layer documentation:
Note: when training, the moving_mean and moving_variance need to be updated.
By default the update ops are placed in tf.GraphKeys.UPDATE_OPS, so they
need to be added as a dependency to the train_op. Also, be sure to add
any batch_normalization ops before getting the update_ops collection.
Otherwise, update_ops will be empty, and training/inference will not work
properly.
For example:
x_norm = tf.layers.batch_normalization(x, training=training)
# ...
update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
with tf.control_dependencies(update_ops):
train_op = optimizer.minimize(loss)
2. Otherwise, try lowering the "momentum" in the BN.
During the training, in fact, the BN uses two moving averages of the mean and the variance that are supposed to approximate the population statistics. Mean and variance are initialized to 0 and 1 respectively and then, step by step, they are multiplied by the momentum value (default is 0.99) and added the new value*0.01. At inference (test) time, the normalization uses these statistics. For this reason, it takes these values a little while to arrive at the "real" mean and variance of the data.
Source:
https://www.tensorflow.org/api_docs/python/tf/layers/batch_normalization
https://github.com/keras-team/keras/issues/7265
https://github.com/keras-team/keras/issues/3366
The original BN paper can be found here:
https://arxiv.org/abs/1502.03167
I also observed oscillations in validation loss when adding batch norm before ReLU. We found that moving the batch norm after the ReLU resolved the issue.

How can I get the global_step in a MonitoredTrainingSession?

I am running distributed an mnist model in distributed TensorFlow. I would like to monitor "manually" the evolution of the global_step for debugging purposes. What is the best and clean way to get the global step in a distributed TensorFlow setting?
My code below
...
with tf.device(device):
images = tf.placeholder(tf.float32, [None, 784], name='image_input')
labels = tf.placeholder(tf.float32, [None], name='label_input')
data = read_data_sets(FLAGS.data_dir,
one_hot=False,
fake_data=False)
logits = mnist.inference(images, FLAGS.hidden1, FLAGS.hidden2)
loss = mnist.loss(logits, labels)
loss = tf.Print(loss, [loss], message="Loss = ")
train_op = mnist.training(loss, FLAGS.learning_rate)
hooks=[tf.train.StopAtStepHook(last_step=FLAGS.nb_steps)]
with tf.train.MonitoredTrainingSession(
master=target,
is_chief=(FLAGS.task_index == 0),
checkpoint_dir=FLAGS.log_dir,
hooks = hooks) as sess:
while not sess.should_stop():
xs, ys = data.train.next_batch(FLAGS.batch_size, fake_data=False)
sess.run([train_op], feed_dict={images:xs, labels:ys})
global_step_value = # ... what is the clean way to get this variable
Normally a good practice is to initialize your global step variable in your graph-defining process, e.g. global_step = tf.Variable(0, trainable=False, name='global_step'). Then you can use graph.get_tensor_by_name("global_step:0") to get your global step easily.

tensorflow RNN implementation

I'm building a RNN model to do the image classification. I used a pipeline to feed in the data. However it returns
ValueError: Variable rnn/rnn/basic_rnn_cell/weights already exists, disallowed. Did you mean to set reuse=True in VarScope? Originally defined at:
I wonder what can I do to fix this since there are not many examples of implementing RNN with an input pipeline. I know it would work if I use the placeholder, but my data is already in the form of tensors. Unless I can feed the placeholder with tensors, I prefer just to use the pipeline.
def RNN(inputs):
with tf.variable_scope('cells', reuse=True):
basic_cell = tf.contrib.rnn.BasicRNNCell(num_units=batch_size)
with tf.variable_scope('rnn'):
outputs, states = tf.nn.dynamic_rnn(basic_cell, inputs, dtype=tf.float32)
fc_drop = tf.nn.dropout(states, keep_prob)
logits = tf.contrib.layers.fully_connected(fc_drop, batch_size, activation_fn=None)
return logits
#Training
with tf.name_scope("cost_function") as scope:
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=train_label_batch, logits=RNN(train_batch)))
train_step = tf.train.MomentumOptimizer(learning_rate, 0.9).minimize(cost)
#Accuracy
with tf.name_scope("accuracy") as scope:
correct_prediction = tf.equal(tf.argmax(RNN(test_image), 1), tf.argmax(test_image_label, 0))
accuracy = tf.cast(correct_prediction, tf.float32)
You need to use the reuse option correctly. following changes would solve it. For prediction you need to use the already existed variables in the graph.
def RNN(inputs, reuse):
with tf.variable_scope('cells', reuse=reuse):
basic_cell = tf.contrib.rnn.BasicRNNCell(num_units=batch_size, reuse=reuse)
...
...
#Training
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=train_label_batch, logits=RNN(train_batch, reuse=None)))
#Accuracy
...
correct_prediction = tf.equal(tf.argmax(RNN(test_image, reuse=True), 1), tf.argmax(test_image_label, 0))

How restore training from Inception-3 checkpoint with different trainable variables

I have the pretty common use case of freezing the bottom layers of Inception and training only the first two layers, after which I lower the learning rate and fine tune the entire Inception model.
Here is my code for running the first part
train_dir='/home/ubuntu/pynb/TF play/log-inceptionv3flowers'
with tf.Graph().as_default():
tf.logging.set_verbosity(tf.logging.INFO)
dataset = get_dataset()
images, _, labels = load_batch(dataset, batch_size=32)
# Create the model, use the default arg scope to configure the batch norm parameters.
with slim.arg_scope(inception.inception_v3_arg_scope()):
logits, _ = inception.inception_v3(images, num_classes=5, is_training=True)
# Specify the loss function:
one_hot_labels = slim.one_hot_encoding(labels, 5)
tf.losses.softmax_cross_entropy(one_hot_labels, logits)
total_loss = tf.losses.get_total_loss()
# Create some summaries to visualize the training process:
tf.summary.scalar('losses/Total Loss', total_loss)
# Specify the optimizer and create the train op:
optimizer = tf.train.RMSPropOptimizer(0.001, 0.9,
momentum=0.9, epsilon=1.0)
train_op = slim.learning.create_train_op(total_loss, optimizer, variables_to_train=get_variables_to_train())
# Run the training:
final_loss = slim.learning.train(
train_op,
logdir=train_dir,
init_fn=get_init_fn(),
number_of_steps=4500,
save_summaries_secs=30,
save_interval_secs=30,
session_config=tf.ConfigProto(gpu_options=gpu_options))
print('Finished training. Last batch loss %f' % final_loss)
which runs properly, then my code for running the second part
train_dir='/home/ubuntu/pynb/TF play/log-inceptionv3flowers'
with tf.Graph().as_default():
tf.logging.set_verbosity(tf.logging.INFO)
dataset = get_dataset()
images, _, labels = load_batch(dataset, batch_size=32)
# Create the model, use the default arg scope to configure the batch norm parameters.
with slim.arg_scope(inception.inception_v3_arg_scope()):
logits, _ = inception.inception_v3(images, num_classes=5, is_training=True)
# Specify the loss function:
one_hot_labels = slim.one_hot_encoding(labels, 5)
tf.losses.softmax_cross_entropy(one_hot_labels, logits)
total_loss = tf.losses.get_total_loss()
# Create some summaries to visualize the training process:
tf.summary.scalar('losses/Total Loss', total_loss)
# Specify the optimizer and create the train op:
optimizer = tf.train.RMSPropOptimizer(0.0001, 0.9,
momentum=0.9, epsilon=1.0)
train_op = slim.learning.create_train_op(total_loss, optimizer)
# Run the training:
final_loss = slim.learning.train(
train_op,
logdir=train_dir,
init_fn=get_init_fn(),
number_of_steps=10000,
save_summaries_secs=30,
save_interval_secs=30,
session_config=tf.ConfigProto(gpu_options=gpu_options))
print('Finished training. Last batch loss %f' % final_loss)
Notice that in the second part, I do not pass anything into create_train_op's variables_to_train parameter. This error is then shown
NotFoundError (see above for traceback): Key InceptionV3/Conv2d_4a_3x3/BatchNorm/beta/RMSProp not found in checkpoint
[[Node: save_1/RestoreV2_49 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:localhost/replica:0/task:0/cpu:0"](_recv_save_1/Const_0, save_1/RestoreV2_49/tensor_names, save_1/RestoreV2_49/shape_and_slices)]]
[[Node: save_1/Assign_774/_1550 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/gpu:0", send_device="/job:localhost/replica:0/task:0/cpu:0", send_device_incarnation=1, tensor_name="edge_2911_save_1/Assign_774", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/gpu:0"]()]]
I suspect that it's looking for the RMSProp variables for the InceptionV3/Conv2d_4a_3x3 layer, which is non-existent, because I didn't train that layer in the previous checkpoint. I'm not sure how to achieve what I want, as I can see no examples in the documentation about how to do this.
TF Slim has support for reading from a checkpoint whose variable names do not match, described here: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/slim/python/slim/learning.py#L146
You can specify how the variable names in a checkpoint map to the variables in your model.
I hope that helps!