How to handle BN and DO behavioural changes in subclassed models? - tensorflow

So, batch normalization and dropout are layers that change behaviour depending on whether you're in training or inferencing phase. Usually, Keras takes care of that on behalf of me. But, if I'm doing custom training, how can I handle that?
What I've done: added if statement to bypass dropout layer while in inference mode
class mymodel(tf.keras.models.Model):
def __init__(self, **kwargs):
super(mymodel, self).__init__(**kwargs)
self.l1 = tf.keras.layers.Dense(3, input_shape=(2,))
self.l2 = tf.keras.layers.Dropout(0.9)
def call(self, x, training=None):
x = self.l1(x)
if training:
x = self.l2(x)
return x
I'm not sure if that's all? And what about Batch normalization?
EDIT: my 'custom training loop' for the toy example above is:
def train_one_ste(model, batch)
with tf.GradientTape() as tape:
output = model(batch)
grad = tape.gradient(output, model.trainable_weights)
optimizer.apply_gradients(zip(grad, model.trainable_weight)

For this you can control the learning phase manually, using K.set_learning_phase(1) during training, and K.set_learning_phase(0) during testing/inference. Here K is the module keras.backend.
Also note that to run one training step with a given batch, you can use model.train_on_batch(x, y), in which case Keras will manage the learning phase for you.


tensorflow, compute gradients with respect to weights that come from two models (encoder, decoder)

I have a encoder model and a decoder model (RNN).
I want to compute the gradients and update the weights.
I'm somewhat confused by what I've seen so far on the web.
Which block is the best practice? Is there any difference between the two options? Gradients seems to converge faster in Block 1, I do not know why?
# BLOCK 1, in two operations
encoder_gradients,decoder_gradients = tape.gradient(loss,[encoder_model.trainable_variables,decoder_model.trainable_variables])
# BLOCK 2, in one operation
gradients = tape.gradient(loss,encoder_model.trainable_variables + decoder_model.trainable_variables)
myoptimizer.apply_gradients(zip(gradients,encoder_model.trainable_variables +
You can manually verify this.
First, let's simplify the model. Let the encoder and decoder both be a single dense layer. This is mostly for simplicity and you can print out the weights being applying the gradients, gradients and weights after applying the gradients.
import tensorflow as tf
import numpy as np
from copy import deepcopy
# create a simple model with one encoder and one decoder layer.
class custom_net(tf.keras.Model):
def __init__(self):
self.encoder = tf.keras.layers.Dense(3, activation='relu')
self.decoder = tf.keras.layers.Dense(3, activation='relu')
def call(self, inp):
return self.decoder(self.encoder(inp))
net = model()
# create dummy input/output
inp = np.random.randn(1,1)
gt = np.random.randn(3,1)
# set persistent to true since we will be accessing the gradient 2 times
with tf.GradientTape(persistent=True) as tape:
out = custom_model(inp)
loss = tf.keras.losses.mean_squared_error(gt, out)
# get the gradients as mentioned in the question
enc_grad, dec_grad = tape.gradient(loss,
gradients = tape.gradient(loss,
net.encoder.trainable_variables + net.decoder.trainable_variables)
First, let's use a stateless optimizer like SGD which updates the weights based on the following formula and compare it to the 2 approaches mentioned in the question.
new_weights = weights - learning_rate * gradients.
# Block 1
myoptimizer = tf.keras.optimizers.SGD(learning_rate=1)
# store weights before updating the weights based on the gradients
old_enc_weights = deepcopy(net.encoder.get_weights())
old_dec_weights = deepcopy(net.decoder.get_weights())
myoptimizer.apply_gradients(zip(enc_grad, net.encoder.trainable_variables))
myoptimizer.apply_gradients(zip(dec_grad, net.decoder.trainable_variables))
# manually calculate the weights after gradient update
# since the learning rate is 1, new_weights = weights - grad
cal_enc_weights = []
for weights, grad in zip(old_enc_weights, enc_grad):
cal_dec_weights = []
for weights, grad in zip(old_dec_weights, dec_grad):
for weights, man_calc_weight in zip(net.encoder.get_weights(), cal_enc_weights):
for weights, man_calc_weight in zip(net.decoder.get_weights(), cal_dec_weights):
# block 2
old_weights = deepcopy(net.encoder.trainable_variables + net.decoder.trainable_variables)
myoptimizer.apply_gradients(zip(gradients, net.encoder.trainable_variables + \
cal_weights = []
for weight, grad in zip(old_weights, gradients):
for weight, man_calc_weight in zip(net.encoder.trainable_variables + net.decoder.trainable_variables, cal_weights):
You will see that both the methods update the weights in the exact same way.
I think you used an optimizer like Adam/RMSProp which is stateful. For such optimizers invoking apply_gradients will update the optimizer parameters based on the gradient value and sign. In the first case, the optimizer parameters are updated twice and in the second case only once.
I would stick to the second option if I were you, since you are performing just one step of optimization here.

how to perform early stopping when writing our own custom training loops in tensorflow 2.0?

To perform early stopping in Tensorflow, tf.keras has a very convenient method which is a call tf.keras.callbacks, which in turn can be used in to execute it. When we write Custom training loop, I couldn't understand how to make use of the tf.keras.callbacks to execute it. Can someone provide with a basic tutorial on how to do it?
You have 2 approaches to create custom training loops.
One is this common 2 nested for loops.
or you can do this. All the callbacks and other features are available here
More info? check here
class CustomModel(keras.Model):
def train_step(self, data):
# Unpack the data. Its structure depends on your model and
# on what you pass to `fit()`.
x, y = data
with tf.GradientTape() as tape:
y_pred = self(x, training=True) # Forward pass
# Compute the loss value
# (the loss function is configured in `compile()`)
loss = self.compiled_loss(y, y_pred,
# Compute gradients
trainable_vars = self.trainable_variables
gradients = tape.gradient(loss, trainable_vars)
# Update weights
self.optimizer.apply_gradients(zip(gradients, trainable_vars))
# Update metrics (includes the metric that tracks the loss)
self.compiled_metrics.update_state(y, y_pred)
# Return a dict mapping metric names to current value
return { m.result() for m in self.metrics}
# Construct and compile an instance of CustomModel
inputs = keras.Input(shape=(32,))
outputs = keras.layers.Dense(1)(inputs)
model = CustomModel(inputs, outputs)
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['...'])
earlystopping_cb = keras.callbacks.EarlyStopping(patience=10, restore_best_weights=True)
# Just use `fit` as usual, epochs=3, callbacks=[earlystopping_cb])
more info:

Custom layer updates

i want to create a custom layer with weights that update only in training phase.
from the official documentation this is the way:
from keras import backend as K
from keras.layers import Layer
class MyLayer(Layer):
def __init__(self, output_dim, **kwargs):
self.output_dim = output_dim
super(MyLayer, self).__init__(**kwargs)
def build(self, input_shape):
# Create a trainable weight variable for this layer.
self.kernel = self.add_weight(name='kernel',
shape=(input_shape[1], self.output_dim),
super(MyLayer, self).build(input_shape) # Be sure to call this at the end
def call(self, x):
return, self.kernel)
def compute_output_shape(self, input_shape):
return (input_shape[0], self.output_dim)
in this github repo
the author added
new_centers = self.centers - self.alpha * delta_centers
self.add_update((self.centers, new_centers), x)
where self.centers are the weights.
I cant understand why self.add_update is useful in that situation.
Weights are not updated if i dont add self.add_update? If not, why new_centers must be in the updates list and not in the inputs list?And why x is a requirement?
from the source code,
self.add_update(updates, inputs)
updates: update op or list of update ops to add to the layer.
inputs: input tensor or list of inputs tensors to mark the updates as conditional on these inputs.If None is passed, the updates are assumed unconditional.
There are two types of weights:
Trainable = Updated automatically by the optimizer with backpropagation
Untrainable = Not updated by backpropagation
For the trainable weights, it's really not recommended to use updates, you will be mixing the optimizer's updates with your own updates and that could cause many issues
For the untrainable weights, you can do whatever you want. Sometimes you want constants and you will do nothing, sometimes, you want these weights to change (but not via backpropagation)
Notice how in that example the weights updated by the user are untrainable:
self.centers = self.add_weight(name='centers',
shape=(10, 2),
But the user wants these weights to be updated following some rules. I don't know what they are doing there (didn't analyse the code), but I assume that they are calculating, for instance, something similar to the center point of a group of images, and each batch will have this center in a different position. They want to update this position.
A classical example is the BatchNormalization layer. Besides having trainable scale and bias weights used to rescale the outputs, they have the mean and variance weights. These are statistical properties of the data that need to be updated with every batch.
You are not training the "mean" or the "variance", but each batch of data updates these values.
How does it work?
This is obscure and lies deep down in Keras code.
We need the update operation so we make sure self.centers will have new values for every batch, otherwise it won't.
We use self.add_update in a layer to register that this variable should be updated. (We do similar things in custom optimizers as well, the optimizers contain the updates to the weights made via backpropagation)
Later in the source code for training the model, Keras will collect all these registered updates and make a train function. Somewhere inside this, these updates will be applied to the vars:
#inside a training function from keras
with K.name_scope('training'):
with K.name_scope(self.optimizer.__class__.__name__):
training_updates = self.optimizer.get_updates(
updates = (self.updates + #probably the updates registered in layers
training_updates + #the updates registered in optimizers
self.metrics_updates) #don't know....
# Gets loss and metrics. Updates weights at each call.
self.train_function = K.function(
[self.total_loss] + self.metrics_tensors,

How can I tell Keras the learning phase when I use train_on_batch to train a model?

I have dropout layers in my model so I want keras to figure out the training and test phases to run or ignore the dropout layers, and I found that K.set_learning_phase can do me this favor but how can I add it to training and test processes? My code is like this:
def discriminator(self):
x_A = Input(shape=self.shape)
x_B = Input(shape=self.shape)
x = concatenate([x_A, x_B], axis=-1)
self.model = Sequential()
self.model.add(Dropout(0.5, input_shape=self.shape_double))
self.model.add(LSTM(200, return_sequences=True, kernel_constraint=unit_norm()))
self.model.add(LSTM(200, return_sequences=True, kernel_constraint=unit_norm()))
self.model.add(Dense(8, activation="softmax", kernel_constraint=unit_norm())
return Model([x_A,x_B], label)
def train(self, epoch, batch_size):
for epoch in range(epochs):
for batch,train_A,train_B,train_label in enumerate(Load_train(batch_size)):
Dloss = self.discriminator.train_on_batch([train_A,train_B],train_label)
def test(self,test_A,test_B,test_label):
predicted_label_dist = self.discriminator.predict([test_A,test_B])
Any suggestions will be appreciated. Thanks.
Keras does figure out the appropriate learning phase on its own by default when you call fit or predict. Hence, your dropout will only be applied during training but not during testing. However, if you still wish to configure training phase on your own i.e. overwrite the default behaviour you can do it like this (from the keras docs):
value: Learning phase value, either 0 or 1 (integers).
simply add this code in your training and testing function.

A Tensorflow training agnostic to Eager and Graph modes

I spend some of my time coding novel (I wish) RNN cells in Tensorflow.
To prototype, I use eager mode (easier to debug).
In order to train, I migrate the code to a graph (runs faster).
I am looking for a wrapper code/example that can run forward pass and training in a way that will be agnostic to the mode I run it - eager or graph, as much as possible. I have in mind a set of functions/classes, to which the particular neural network/optimizer/data can be inserted, and that these set of functions/classes could run in both modes with minimal changes between the two. In addition, it is of course good that it would be compatible with many types of NN/optimizers/data instances.
I am quite sure that many had this idea.
I wonder if something like this is feasible given the current eager/graph integration in TF.
Yes. I have been wondering the same. In the Tensorflow documentation you can see:
The same code written for eager execution will also build a graph during graph execution. Do this by simply running the same code in a new Python session where eager execution is not enabled.
But this is hard to achieve, mostly because working with graphs means dealing with placeholders, which can not be used in Eager mode. I tried to get rid off placeholders using object-oriented layers and the Dataset API. This is the closest I could get to totally compatible code:
m = 128 # num_examples
n = 5 # num features
epochs = 2
batch_size = 32
steps_per_epoch = m // 32
dataset =
(tf.random_uniform([m, n], dtype=tf.float32),
tf.random_uniform([m, 1], dtype=tf.float32)))
dataset = dataset.repeat(epochs)
dataset = dataset.batch(batch_size)
model = tf.keras.Sequential([
tf.keras.layers.Dense(64, activation='relu', input_dim=n),
tf.keras.layers.Dense(32, activation='relu'),
def train_eagerly(model, dataset):
optimizer = tf.train.AdamOptimizer()
iterator = dataset.make_one_shot_iterator()
print('Training graph...')
for epoch in range(epochs):
print('Epoch', epoch)
progbar = tf.keras.utils.Progbar(target=steps_per_epoch, stateful_metrics='loss')
for step in range(steps_per_epoch):
with tf.GradientTape() as tape:
features, labels = iterator.get_next()
predictions = model(features, training=True)
loss_value = tf.losses.mean_squared_error(labels, predictions)
grads = tape.gradient(loss_value, model.variables)
optimizer.apply_gradients(zip(grads, model.variables))
progbar.add(1, values=[('loss', loss_value.numpy())])
def train_graph(model, dataset):
optimizer = tf.train.AdamOptimizer()
iterator = dataset.make_initializable_iterator()
print('Training graph...')
with tf.Session() as sess:
for epoch in range(epochs):
print('Epoch', epoch)
progbar = tf.keras.utils.Progbar(target=steps_per_epoch, stateful_metrics='loss')
for step in range(steps_per_epoch):
with tf.GradientTape() as tape:
features, labels =
predictions = model(features, training=True)
loss_value = tf.losses.mean_squared_error(labels, predictions)
grads = tape.gradient(loss_value, model.variables)
optimizer.apply_gradients(zip(grads, model.variables))
progbar.add(1, values=[('loss', loss_value.eval())])
As you can see, the main difference is that I use a one_shot_iterator during Eager training (of course, during graph training, I have to run operations within a session).
I tried to do the same using optimizer.minimize instead of applying the gradients myself, but I could not come up with a code that worked both for eager and graph modes.
Also, I'm sure this becomes much harder to do with not so simple models, like the one you are working with.