tensorflow batch normalization gives doesn't work as expected when is_training flag is False - tensorflow

I have a model in which I perform batch normalization after every convolutional layer expect the last one. I use the function tensorflow.contrib.layers.batch_norm function to do this. When I set the is__training flag as True the loss value that is reported seems correct. For my particular example, it starts at 60s and decreases to almost 0. When I set the is_training flag to flase I get my loss value in the order of 1e10 which seems absurd.
I have attached the snippet I use in my code.
with tf.control_dependencies(update_ops):
training(train_output,train_input,sess) # is_training is true here
validate(test_output,train_input,sess) # is_training is false here
What could be the reason?


Better understanding of training parameter for Keras-Model call method needed

I'd like to get a better understanding of the parameter training, when calling a Keras model.
In all tutorials (like here) it is explained, that when you are doing a custom train step, you should call the model like this (because some layers may behave differently depending if you want to do training or inference):
pred = model(x, training=True)
and when you want to do inference, you should set training to false:
pred = model(x, training=False)
What I am wondering now is, how this is affected by the creation of a functional model. Assume I have 2 models: model_base and model_head, and I want to create a new model out of those two, where I want the model_base allways to be called with training=False (because I plan on freezing it like in this tutorial here):
inputs = keras.Input(shape=(150, 150, 3))
x = base_model(inputs, training=False)
outputs = head_model(x)
new_model = keras.Model(inputs, outputs)
What will in such a case happen, when I later on call new_model(x_new, training=True)? Will the usage of training=False for the base_model be overruled? Or will training now allways be set to True for the base_model, regardless of what I pass to the new_model? If the latter is the case, does that also mean, that if I set e.g. outputs = head_model(inputs, training=True), that this part of the new model would always run in training mode? And how would it work out if I don't give any specific value for training, when I run the new_model like this new_model(x_new)?
Thanks in advance!
training is a boolean argument that determines whether this call function runs in training mode or inference mode. For example, the Dropout layer is primarily used to as regularize in model training, randomly dropping weights but in inference time or prediction time we don't want it to happen.
y = Dropout(0.5)(x, training=True)
By this, we're setting training=True for the Dropout layer for training time. When we call .fit(), it set sets a flag to True and when we use evaluate or predict, in behind it sets a flag to False. And same goes for the custom training loop. When we pass input tensor to the model within the GradientTape scope, we can set this parameter; though it does not have manually set, the program will figure out itself. And same goes to inference time. So, this training argument is set as True or False if we want layers to operate either training mode or inference mode respectively.
# training mode
with tf.GradientTape() as tape:
logits = model(x, training=True) # forward pass
# inference mode
al_logits = model(x, training=False)
Now coming to your question. After defining the model
# Freeze the base_model
base_model.trainable = False
inputs = keras.Input(shape=(150, 150, 3))
x = base_model(inputs, training=False)
outputs = head_model(x)
new_model = keras.Model(inputs, outputs)
Now if your run this new model whether .fit() or custom training loop, the base_model will always run in inference mode as it's sets training=False.

the batch normlization layer do not work (tensorflow)

I implement a network using tensorflow, and the loss is not converged. Then, I get some value in the network, and I find that the BN layer do not work. Please look at the following picture:
We can see that s2 is the result of batch normalization of s1, but the value in s2 is still very large. I don't know what's the problem. Why the value in s2 is so large?
I have updated my code to github. Someone who is interested can test it.
As per the official tensorflow documentation here,
when training, the moving_mean and moving_variance need to be updated.
By default the update ops are placed in tf.GraphKeys.UPDATE_OPS, so
they need to be executed alongside the train_op. Also, be sure to add
any batch_normalization ops before getting the update_ops collection.
Otherwise, update_ops will be empty, and training/inference will not
work properly.
For example:
training = tf.placeholder(tf.bool, name="is_training")
# ...
x_norm = tf.layers.batch_normalization(x, training=training)
# ...
update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
train_op = optimizer.minimize(loss)
train_op = tf.group([train_op, update_ops])
# or, you can also do something like this:
# with tf.control_dependencies(update_ops):
# train_op = optimizer.minimize(loss)
So, it is really important to get the update ops as stated in the tensorflow documentation because in training time the moving variance and the moving mean of the layer have to be updated. If you don’t do this, batch normalization will not work and the network will not train as expected. It is also useful to declare a placeholder to tell the network if it is in training time or inference time, as during test(or inference) time, the mean and the variance are fixed. They are estimated using the previously calculated means and variances of each training batch.

stop_gradient in tensorflow

I am wondering if tf.stop_gradient stops the gradient computation of just a given op, or stops the update of its input tf.variable ? I have the following problem - During the forward path computation in MNIST, I would like to perform a set of operations on the weights (let's say W to W*) and then do a matmul with inputs. However, I would like to exclude these operations from the backward path. I want only dE/dW computed during training with back propagation. The code I wrote prevents W from getting updated. Could you please help me understand why ? If these were variables, I understand I should set their trainable property to false, but these are operations on weights. If stop_gradient cannot be used for this purpose, then how do I build two graphs, one for forward path and the other for back propagation ?
def build_layer(inputs, fmap, nscope,layer_size1,layer_size2, faulty_training):
with tf.name_scope(nscope):
if (faulty_training):
## trainable weight
weights_i = tf.Variable(tf.truncated_normal([layer_size1, layer_size2],stddev=1.0 / math.sqrt(float(layer_size1))),name='weights_i')
## Operations on weight whose gradient should not be computed during backpropagation
weights_fx_t = tf.multiply(268435456.0,weights_i)
weight_fx_t = tf.stop_gradient(weights_fx_t)
weights_fx = tf.cast(weights_fx_t,tf.int32)
weight_fx = tf.stop_gradient(weights_fx)
weights_fx_fault = tf.bitwise.bitwise_xor(weights_fx,fmap)
weight_fx_fault = tf.stop_gradient(weights_fx_fault)
weights_fl = tf.cast(weights_fx_fault, tf.float32)
weight_fl = tf.stop_gradient(weights_fl)
weights = tf.stop_gradient(tf.multiply((1.0/268435456.0),weights_fl))
##### end transformation
weights = tf.Variable(tf.truncated_normal([layer_size1, layer_size2],stddev=1.0 / math.sqrt(float(layer_size1))),name='weights')
biases = tf.Variable(tf.zeros([layer_size2]), name='biases')
hidden = tf.nn.relu(tf.matmul(inputs, weights) + biases)
return weights,hidden
I am using the tensorflow gradient descent optimizer to do the training.
optimizer = tf.train.GradientDescentOptimizer(learning_rate)
global_step = tf.Variable(0, name='global_step', trainable=False)
train_op = optimizer.minimize(loss, global_step=global_step)
Stop gradient will prevent the backpropagation from continuing past that node in the graph. You code doesn't have any path from weights_i to the loss except the one that goes through weights_fx_t where the gradient is stopped. This is what is causing weights_i not to be updated during training. You don't need to put stop_gradient after every step. Using it just once will stop the backpropagation there.
If stop_gradient doesn't do what you want then you can get the gradients by doing tf.gradients and you can write your own update op by using tf.assign. This will allow you to alter the gradients however you want.

significance of "trainable" and "training" flag in tf.layers.batch_normalization

What is the significance of "trainable" and "training" flag in tf.layers.batch_normalization? How are these two different during training and prediction?
The batch norm has two phases:
1. Training:
- Normalize layer activations using `moving_avg`, `moving_var`, `beta` and `gamma`
(`training`* should be `True`.)
- update the `moving_avg` and `moving_var` statistics.
(`trainable` should be `True`)
2. Inference:
- Normalize layer activations using `beta` and `gamma`.
(`training` should be `False`)
Example code to illustrate few cases:
#random image
img = np.random.randint(0,10,(2,2,4)).astype(np.float32)
# batch norm params initialized
beta = np.ones((4)).astype(np.float32)*1 # all ones
gamma = np.ones((4)).astype(np.float32)*2 # all twos
moving_mean = np.zeros((4)).astype(np.float32) # all zeros
moving_var = np.ones((4)).astype(np.float32) # all ones
#Placeholders for input image
_input = tf.placeholder(tf.float32, shape=(1,2,2,4), name='input')
#batch Norm
out = tf.layers.batch_normalization(
training=False, trainable=False)
update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
init_op = tf.global_variables_initializer()
## 2. Run the graph in a session
with tf.Session() as sess:
# init the variables
for i in range(2):
ops, o = sess.run([update_ops, out], feed_dict={_input: np.expand_dims(img, 0)})
print('beta', sess.run('batch_normalization/beta:0'))
print('gamma', sess.run('batch_normalization/gamma:0'))
print('out', np.round(o))
When training=False and trainable=False:
img = [[[4., 5., 9., 0.]...
out = [[ 9. 11. 19. 1.]...
The activation is scaled/shifted using gamma and beta.
When training=True and trainable=False:
out = [[ 2. 2. 3. -1.] ...
The activation is normalized using `moving_avg`, `moving_var`, `gamma` and `beta`.
The averages are not updated.
When traning=True and trainable=True:
The out is same as above, but the `moving_avg` and `moving_var` gets updated to new values.
moving_avg [0.03249997 0.03499997 0.06499994 0.02749997]
moving_variance [1.0791667 1.1266665 1.0999999 1.0925]
This is quite complicated.
And in TF 2.0 the behavior is changed, see this:
About setting layer.trainable = False on a BatchNormalization layer:
The meaning of setting layer.trainable = False is to freeze the
layer, i.e. its internal state will not change during training:
its trainable weights will not be updated during fit() or
train_on_batch(), and its state updates will not be run. Usually,
this does not necessarily mean that the layer is run in inference
mode (which is normally controlled by the training argument that can
be passed when calling a layer). "Frozen state" and "inference mode"
are two separate concepts.
However, in the case of the BatchNormalization layer, setting
trainable = False on the layer means that the layer will be
subsequently run in inference mode (meaning that it will use the
moving mean and the moving variance to normalize the current batch,
rather than using the mean and variance of the current batch). This
behavior has been introduced in TensorFlow 2.0, in order to enable
layer.trainable = False to produce the most commonly expected
behavior in the convnet fine-tuning use case. Note that:
This behavior only occurs as of TensorFlow 2.0. In 1.*, setting layer.trainable = False would freeze the layer but would not
switch it to inference mode.
Setting trainable on an model containing other layers will recursively set the trainable value of all inner layers.
If the value of the trainable attribute is changed after calling compile() on a model, the new value doesn't take effect for this
model until compile() is called again.
training controls whether to use the training-mode batchnorm (which uses statistics from this minibatch) or inference-mode batchnorm (which uses averaged statistics across the training data). trainable controls whether the variables created inside the batchnorm process are themselves trainable.

Tensorflow Batch Normalization: tf.contrib.layers.batch_norm

I've recently picked up Tensorflow and and have been trying my best to adjust to the environment. It has been nothing but wonderful! However, batch normalization using tf.contrib.layers.batch_norm has been a little tricky.
Right now, here is the function I'm using:
def batch_norm(x, phase):
return tf.contrib.layers.batch_norm(x,center = True, scale = True,
is_training = phase, updates_collections = None)
Using this, I followed most documentation (also Q & A) that I've found online and it led me to the following conclusions:
1) is_training should be set to True for training and false for testing. This makes sense! When training, I had convergence (error < 1%, Cifar 10 Dataset).
However during testing, my results are terrible (error > 90%) UNLESS I add (update collections = None) as an argument to the batch norm function above. Only with that as an argument will testing give me the error I expected.
I am also sure to use the following for training:
update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
with tf.control_dependencies(update_ops): # Ensures, Updating ops will perform before training
with tf.name_scope('Cross_Entropy'):
cross_entropy = tf.reduce_mean( # Implement Cross_Entropy to compute the softmax activation
tf.nn.softmax_cross_entropy_with_logits(labels=y_, logits=y_conv)) # Cross Entropy: True Output Labels (y_), Softmax output (y_conv)
tf.summary.scalar('cross_entropy', cross_entropy) # Graphical output Cross Entropy
with tf.name_scope('train'):
train_step = tf.train.AdamOptimizer(1e-2).minimize(cross_entropy) # Train Network, Tensorflow minimizes cross_entropy via ADAM Optimization
with tf.name_scope('Train_Results'):
with tf.name_scope('Correct_Prediction'):
correct_prediction = tf.equal(tf.argmax(y_conv, 1), tf.argmax(y_, 1)) # Check if prediction is wrong with tf.equal(CNN_result,True_result)
with tf.name_scope('Accuracy'):
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32)) # Find the percent accuracy, take mean of correct_prediction outputs
tf.summary.scalar('accuracy', accuracy) # Graphical output Classification Accuracy
This should make sure that the batch normalization parameters are updating during training.
So this leads me to believe that update collections = None is just a nice default to my batch normalization function that during testing function will be sure not to adjust any batch normalization parameters.... Am I correct?
Lastly: Is it normal to have good results (Expected Error) when, during the testing phase, having batch normalization turned on AND off? Using the batch norm function above, I was able to train well (is_training = True) and test well (is_training = False). However, during testing (is_training = True) I was still able to get great results. This is just gives me a bad feeling. Could someone explain why this is happening? Or if it should be happening at all?
Thank you for your time!
Unstable decay rate (default 0.999) for moving averages might be the reason for reasonably good training performance but poor validation/test performance. Try a slightly lower decay rate (0.99 or 0.9). Also, try zero_debias_moving_mean=True for improved stability.
You can also try different batch sizes and see if validation performance increases. Large batch size can break validation performance when batch normalization is used. See this.
Is your phase variable, a tensorflow boolean or a Python boolean?