Suppose we want to minimize the following equation using gradient descent:
min f(alpha * v + (1-alpha)*w) with v and w the model weights and alpha the weight, between 0 and 1, for the sum resulting in the combined model v_bar or ū (here referred to as m).
alpha = tf.Variable(0.01, name='Alpha', constraint=lambda t: tf.clip_by_value(t, 0, 1))
w_weights = tff.learning.ModelWeights.from_model(w)
v_weights = tff.learning.ModelWeights.from_model(v)
m_weights = tff.learning.ModelWeights.from_model(m)
m_weights_trainable = tf.nest.map_structure(lambda v, w: alpha*v + (tf.constant(1.0) - alpha)*w, v_weights.trainable, w_weights.trainable)
tf.nest.map_structure(lambda v, t: v.assign(t), m_weights.trainable, m_weights_trainable)
In the paper of Adaptive Personalized Federated Learning, formula with update step for alpha suggests updating alpha based on the gradients of model m applied on a minibatch. I tried it with the watch or without, but it always leads to No gradients provided for any variable
with tf.GradientTape(watch_accessed_variables=False) as tape:[alpha])
outputs_m = m.forward_pass(batch)
grad = tape.gradient(outputs_m.loss, alpha)
optimizer.apply_gradients(zip([grad], [alpha]))
Some more information about the initialization of the models:
The m.forward_pass(batch) is the default implementation from tff.learning.Model (found here) by creating a model with tff.learning.from_keras_model and a tf.keras.Sequential model.
def model_fn():
keras_model = create_keras_model()
return tff.learning.from_keras_model(
input_spec = element_spec,
loss = tf.keras.losses.MeanSquaredError(),
metrics = [tf.keras.metrics.MeanSquaredError(),
w = model_fn()
v = model_fn()
m = model_fn()
Some more experimenting as suggested below by Zachary Garrett:
It seems that whenever this weighted sum is calculated, and the new weights for the model are assigned, then it loses track of the previous trainable variables of both summed models. Again, it leads to the No gradients provided for any variable whenever optimizer.apply_gradients(zip([grad], [alpha])) is called. All gradients seem to be None.
with tf.GradientTape() as tape:
alpha = tf.Variable(0.01, name='Alpha', constraint=lambda t: tf.clip_by_value(t, 0, 1))
m_weights_t = tf.nest.map_structure(lambda w, v: tf.math.scalar_mul(alpha, v, name=None) + tf.math.scalar_mul(tf.constant(1.0) - alpha, w, name=None),
m_weights = tff.learning.ModelWeights.from_model(m)
tf.nest.map_structure(lambda v, t: v.assign(t), m_weights.trainable,
outputs_m = m.forward_pass(batch)
grad = tape.gradient(outputs_m.loss, alpha)
optimizer.apply_gradients(zip([grad], [alpha]))
Another edit:
I think I have a strategy to get it working, but it is bad practice as manually setting trainable_weights or _trainable_weights does not work. Any tips on improving this?
def do_weighted_combination():
def _mapper(target_layer, v_layer, w_layer):
target_layer.kernel = v_layer.kernel * alpha + w_layer.kernel * (1-alpha)
target_layer.bias = v_layer.bias * alpha + w_layer.bias * (1-alpha)
tf.nest.map_structure(_mapper, m.layers, v.layers, w.layers)
with tf.GradientTape(persistent=True) as tape:
predictions = m(x_data)
loss = m.compiled_loss(y_data, predictions)
g1 = tape.gradient(loss, v.trainable_weights) # Not None
g2 = tape.gradient(loss, alpha) # Not None

For TensorFlow auto-differentiation using tf.GradientTape, operations must occur within the tf.GradientTape Python context manager so that TensorFlow can "see" them.
Possibly what is happening here is that alpha is used outside/before the tape context, when setting the model variables. Then when m.forwad_pass is called TensorFlow doesn't see any access to alpha and thus can't compute a gradient for it (instead returning None).
Moving the
alpha*v + (tf.constant(1.0) - alpha)*w, v_weights.trainable, w_weights.trainable
logic inside the tf.GradientTape context manager (possibly inside m.forward_pass) may be a solution.


visualizing batch_norm parameters in tensorboard

My current NN model is giving some anomalous results when I change batch norm specific hyper parameters. I'd like to see the distribution of the batch norm parameters beta and gamma over time to make sure that batch norm isn't doing something weird.
Visualizing learned weights or biases is easiest to do with tensorboard, but I'm not sure how to do that with beta and gamma since they're defined and managed within tf.layers.batch_normalization or tf.contrib.layers.batch_norm.
Is there a simple way to reference beta and gamma and put them in a histogram summary without having to write my own version of batch norm?
building a summary for them is still a chore, but this is what I've come up with for accessing gamma and beta:
def batch_norm(self, x_in):
with tf.variable_scope('batch_norm'):
x = tf.layers.batch_normalization( x_in,
momentum = self.bn_decay,
epsilon = self.bn_epsilon,
training = self.is_training)
gamma = tf.trainable_variables(tf.get_variable_scope().name)[0]
beta = tf.trainable_variables(tf.get_variable_scope().name)[1]
return x
what tf.trainable_variables(tf.get_variable_scope().name) does is return all variables within the current scope in the form of a list. In this case there are two variables, the 0th is gamma and the 1st is beta but that may change with a different implementation.
if you need the specific names use:
for var in tf.trainable_variables(tf.get_variable_scope().name):
Alternatively if you need not only access to the beta and gamma values but also control over how they are used you could False center and scale from tf.layers.batch_normalization() and define your own scale and offset functionality. Like so:
def batch_norm(self, x, name = 'batch_norm'):
with tf.variable_scope(name):
x = tf.layers.batch_normalization( x,
momentum = .99,
epsilon = .0001,
center = False,
scale = False,
training = self.is_training)
gamma = tf.get_variable(
name = 'gamma',
shape = x.get_shape()[-1],
initializer = tf.ones_initializer())
beta = tf.get_variable(
name = 'beta',
shape = x.get_shape()[-1],
initializer = tf.zeros_initializer())
x = gamma*x + beta
return x

Implementing backpropagation gradient descent using scipy.optimize.minimize

I am trying to train an autoencoder NN (3 layers - 2 visible, 1 hidden) using numpy and scipy for the MNIST digits images dataset. The implementation is based on the notation given here Below is my code:
def autoencoder_cost_and_grad(theta, visible_size, hidden_size, lambda_, data):
The input theta is a 1-dimensional array because scipy.optimize.minimize expects
the parameters being optimized to be a 1d array.
First convert theta from a 1d array to the (W1, W2, b1, b2)
matrix/vector format, so that this follows the notation convention of the
lecture notes and tutorial.
You must compute the:
cost : scalar representing the overall cost J(theta)
grad : array representing the corresponding gradient of each element of theta
training_size = data.shape[1]
# unroll theta to get (W1,W2,b1,b2) #
W1 = theta[0:hidden_size*visible_size]
W1 = W1.reshape(hidden_size,visible_size)
W2 = theta[hidden_size*visible_size:2*hidden_size*visible_size]
W2 = W2.reshape(visible_size,hidden_size)
b1 = theta[2*hidden_size*visible_size:2*hidden_size*visible_size + hidden_size]
b2 = theta[2*hidden_size*visible_size + hidden_size: 2*hidden_size*visible_size + hidden_size + visible_size]
#feedforward pass
a_l1 = data
z_l2 = + numpy.tile(b1,(training_size,1)).T
a_l2 = sigmoid(z_l2)
z_l3 = + numpy.tile(b2,(training_size,1)).T
a_l3 = sigmoid(z_l3)
delta_l3 = numpy.multiply(-(data-a_l3),numpy.multiply(a_l3,1-a_l3))
delta_l2 = numpy.multiply(,
numpy.multiply(a_l2, 1 - a_l2))
b2_derivative = numpy.sum(delta_l3,axis=1)/training_size
b1_derivative = numpy.sum(delta_l2,axis=1)/training_size
W2_derivative =,a_l2.T)/training_size + lambda_*W2
W1_derivative =,a_l1.T)/training_size + lambda_*W1
W1_derivative = W1_derivative.reshape(hidden_size*visible_size)
W2_derivative = W2_derivative.reshape(visible_size*hidden_size)
b1_derivative = b1_derivative.reshape(hidden_size)
b2_derivative = b2_derivative.reshape(visible_size)
grad = numpy.concatenate((W1_derivative,W2_derivative,b1_derivative,b2_derivative))
cost = 0.5*numpy.sum((data-a_l3)**2)/training_size + 0.5*lambda_*(numpy.sum(W1**2) + numpy.sum(W2**2))
return cost,grad
I have also implemented a function to estimate the numerical gradient and verify the correctness of my implementation (below).
def compute_gradient_numerical_estimate(J, theta, epsilon=0.0001):
:param J: a loss (cost) function that computes the real-valued loss given parameters and data
:param theta: array of parameters
:param epsilon: amount to vary each parameter in order to estimate
the gradient by numerical difference
:return: array of numerical gradient estimate
gradient = numpy.zeros(theta.shape)
eps_vector = numpy.zeros(theta.shape)
for i in range(0,theta.size):
eps_vector[i] = epsilon
cost1,grad1 = J(theta+eps_vector)
cost2,grad2 = J(theta-eps_vector)
gradient[i] = (cost1 - cost2)/(2*epsilon)
eps_vector[i] = 0
return gradient
The norm of the difference between the numerical estimate and the one computed by the function is around 6.87165125021e-09 which seems to be acceptable. My main problem seems to be to get the gradient descent algorithm "L-BGFGS-B" working using the scipy.optimize.minimize function as below:
# theta is the 1-D array of(W1,W2,b1,b2)
J = lambda x: utils.autoencoder_cost_and_grad(theta, visible_size, hidden_size, lambda_, patches_train)
options_ = {'maxiter': 4000, 'disp': False}
result = scipy.optimize.minimize(J, theta, method='L-BFGS-B', jac=True, options=options_)
I get the below output from this:
scipy.optimize.minimize() details:
fun: 90.802022224079778
hess_inv: <16474x16474 LbfgsInvHessProduct with dtype=float64>
jac: array([ -6.83667742e-06, -2.74886002e-06, -3.23531941e-06, ...,
1.22425735e-01, 1.23425062e-01, 1.28091250e-01])
nfev: 21
nit: 0
status: 2
success: False
x: array([-0.06836677, -0.0274886 , -0.03235319, ..., 0. ,
0. , 0. ])
Now, this post seems to indicate that the error could mean that the gradient function implementation could be wrong? But my numerical gradient estimate seems to confirm that my implementation is correct. I have tried varying the initial weights by using a uniform distribution as specified here but the problem still persists. Is there anything wrong with my backprop implementation?
Turns out the issue was a syntax error (very silly) with this line:
J = lambda x: utils.autoencoder_cost_and_grad(theta, visible_size, hidden_size, lambda_, patches_train)
I don't even have the lambda parameter x in the function declaration. So the theta array wasn't even being passed whenever J was being invoked.
This fixed it:
J = lambda x: utils.autoencoder_cost_and_grad(x, visible_size, hidden_size, lambda_, patches_train)

consistent forward / backward pass with tensorflow dropout

For the reinforcement learning one usually applies forward pass of the neural network for each step of the episode in order to calculate policy. Afterwards one could calculate parameter gradients using backpropagation. Simplified implementation of my network looks like this:
class AC_Network(object):
def __init__(self, s_size, a_size, scope, trainer, parameters_net):
with tf.variable_scope(scope):
self.is_training = tf.placeholder(shape=[], dtype=tf.bool)
self.inputs = tf.placeholder(shape=[None, s_size], dtype=tf.float32)
# (...)
layer = slim.fully_connected(self.inputs,
layer = tf.contrib.layers.dropout(inputs=layer, keep_prob=parameters_net["dropout_keep_prob"],
self.policy = slim.fully_connected(layer, a_size,
self.actions = tf.placeholder(shape=[None], dtype=tf.int32)
self.advantages = tf.placeholder(shape=[None], dtype=tf.float32)
actions_onehot = tf.one_hot(self.actions, a_size, dtype=tf.float32)
responsible_outputs = tf.reduce_sum(self.policy * actions_onehot, [1])
self.policy_loss = - policy_loss_multiplier * tf.reduce_mean(tf.log(responsible_outputs) * self.advantages)
local_vars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope)
self.gradients = tf.gradients(self.policy_loss, local_vars)
Now during training I will fist rollout the episode by consecutive forward passes (again, simplified version):
s = self.local_env.reset() # list of input variables for the first step
while done == False:
a_dist =[self.policy],
feed_dict = {self.local_AC.inputs: [s],
self.is_training: True})
a = np.argmax(a_dist)
s, r, done, extra_stat = self.local_env.step(a)
# (...)
and in the end I will calculate gradients by backward pass:
p_l, grad =[self.policy_loss,
feed_dict={self.inputs: np.vstack(comb_observations),
self.is_training: True,
self.actions: np.hstack(comb_actions),})
(please note that I could have made a mistake somewhere above trying to remove as much as possible of the original code irrelevant to the issue in question)
So finally the question: Is there a way of ensuring that all the consecutive calls to the will generate the same dropout structure? Ideally I would like to have exactly the same dropout structure within each episode and only change it between episodes. Things seem to work well as they are but I continue to wonder.

Tensorflow >r1.0 tf.layers.batch_normalization very bad test performance

I'm trying to use the tf.layers.batch_normalization function provided in the latest Tensorflow API to implement a recurrent batch normalized LSTM.
The implementation is as below (I modified the TF source code):
class BNLSTMCell(tf.nn.rnn_cell.RNNCell):
Batch Normalized Long short-term memory unit (LSTM) recurrent network cell.
cf. Recurrent Batch Normalization
cf. A Gentle Guide to Using Batch Normalization in TensorFlow
def __init__(self, num_units, forward_only, gamma_c=1.0, gamma_h=1.0,
gamma_x=1.0, beta_c=0.0, beta_h=0.0, beta_x=0.0,
input_size=None, use_peepholes=False, cell_clip=None,
initializer=None, num_proj=None,
num_unit_shards=1, num_proj_shards=1,
forget_bias=1.0, state_is_tuple=False,
"""Initialize the parameters for an LSTM cell.
num_units: int, The number of units in the LSTM cell
If False (training):
1. Normalize layer activations according to mini-batch statistics.
2. During the training step, update population statistics
approximation via moving average of mini-batch statistics.
If True (testing):
1. Normalize layer activations according to estimated population
2. No update of population statistics according to mini-batch
statistcs from test data.
gamma_c: Scale of cell state normalization
beta_c: Offset of cell state normalization
gamma_h: Scale of hidden state normalization
beta_h: Offset of hidden state normalization
(set to 0 to avoid redundancy)
gamma_x: Scale of input normalization
beta_x: Offset of input normalization
(set to 0 to avoid redundancy)
input_size: Deprecated and unused.
use_peepholes: bool, Set True to enable diagonal/peephole connections.
cell_clip: (optional) A float value, if provided the cell state is clipped
by this value prior to the cell output activation.
initializer: (optional) The initializer to use for the weight and
projection matrices.
num_proj: (optional) int, The output dimensionality for the projection
matrices. If None, no projection is performed.
num_unit_shards: How to split the weight matrix. If >1, the weight
matrix is stored across num_unit_shards.
num_proj_shards: How to split the projection matrix. If >1, the
projection matrix is stored across num_proj_shards.
forget_bias: Biases of the forget gate are initialized by default to 1
in order to reduce the scale of forgetting at the beginning of
the training.
state_is_tuple: If True, accepted and returned states are 2-tuples of
the `c_state` and `m_state`. By default (False), they are concatenated
along the column axis. This default behavior will soon be deprecated.
activation: Activation function of the inner states.
if not state_is_tuple:
"%s: Using a concatenated state is slower and will soon be "
"deprecated. Use state_is_tuple=True." % self)
if input_size is not None:
logging.warn("%s: The input_size parameter is deprecated." % self)
self._num_units = num_units
self.forward_only = forward_only
self._gamma_c = gamma_c
self._beta_c = beta_c
self._gamma_h = gamma_h
self._beta_h = beta_h
self._gamma_x = gamma_x
self._beta_x = beta_x
self._use_peepholes = use_peepholes
self._cell_clip = cell_clip
self._initializer = initializer
self._num_proj = num_proj
self._num_unit_shards = num_unit_shards
self._num_proj_shards = num_proj_shards
self._forget_bias = forget_bias
self._state_is_tuple = state_is_tuple
self._activation = activation
if num_proj:
self._state_size = (
tf.nn.rnn_cell.LSTMStateTuple(num_units, num_proj)
if state_is_tuple else num_units + num_proj)
self._output_size = num_proj
self._state_size = (
tf.nn.rnn_cell.LSTMStateTuple(num_units, num_units)
if state_is_tuple else 2 * num_units)
self._output_size = num_units
def state_size(self):
return self._state_size
def output_size(self):
return self._output_size
def __call__(self, inputs, state, scope=None):
"""Run one step of LSTM.
inputs: input Tensor, 2D, batch x num_units.
state: if `state_is_tuple` is False, this must be a state Tensor,
`2-D, batch x state_size`. If `state_is_tuple` is True, this must be a
tuple of state Tensors, both `2-D`, with column sizes `c_state` and
scope: VariableScope for the created subgraph; defaults to "LSTMCell".
A tuple containing:
- A `2-D, [batch x output_dim]`, Tensor representing the output of the
LSTM after reading `inputs` when previous state was `state`.
Here output_dim is:
num_proj if num_proj was set,
num_units otherwise.
- Tensor(s) representing the new state of LSTM after reading `inputs` when
the previous state was `state`. Same type and shape(s) as `state`.
ValueError: If input size cannot be inferred from inputs via
static shape inference.
num_proj = self._num_units if self._num_proj is None else self._num_proj
if self._state_is_tuple:
(c_prev, m_prev) = state
c_prev = tf.slice(state, [0, 0], [-1, self._num_units])
m_prev = tf.slice(state, [0, self._num_units], [-1, num_proj])
dtype = inputs.dtype
input_size = inputs.get_shape().with_rank(2)[1]
if input_size.value is None:
raise ValueError("Could not infer input size from inputs.get_shape()[-1]")
with tf.variable_scope(scope or type(self).__name__,
initializer=self._initializer): # "LSTMCell"
w_h = tf.get_variable("W_h", [num_proj, 4 * self._num_units],
w_x = tf.get_variable("W_x", [input_size.value, 4 * self._num_units],
b = tf.get_variable(
"B", shape=[4 * self._num_units],
initializer=tf.zeros_initializer, dtype=dtype)
# i = input_gate, j = new_input, f = forget_gate, o = output_gate
hidden_matrix = tf.matmul(m_prev, w_h)
bn_hidden_matrix = tf.layers.batch_normalization(hidden_matrix,
training=(not self.forward_only),
name='bn_hidden_matrix', reuse=None)
# print(tf.get_collection(tf.GraphKeys.VARIABLES, scope=scope))
input_matrix = tf.matmul(inputs, w_x)
bn_input_matrix = tf.layers.batch_normalization(input_matrix,
training=(not self.forward_only),
name='bn_input_matrix', reuse=None)
lstm_matrix = tf.nn.bias_add(
tf.add(bn_input_matrix, bn_hidden_matrix), b)
i, j, f, o = tf.split(lstm_matrix, num_or_size_splits=4, axis=1)
# Diagonal connections
if self._use_peepholes:
w_f_diag = tf.get_variable(
"W_F_diag", shape=[self._num_units], dtype=dtype)
w_i_diag = tf.get_variable(
"W_I_diag", shape=[self._num_units], dtype=dtype)
w_o_diag = tf.get_variable(
"W_O_diag", shape=[self._num_units], dtype=dtype)
if self._use_peepholes:
c = (tf.sigmoid(f + self._forget_bias + w_f_diag * c_prev) * c_prev +
tf.sigmoid(i + w_i_diag * c_prev) * self._activation(j))
c = (tf.sigmoid(f + self._forget_bias) * c_prev + tf.sigmoid(i) *
if self._cell_clip is not None:
# pylint: disable=invalid-unary-operand-type
c = tf.clip_by_value(c, -self._cell_clip, self._cell_clip)
# pylint: enable=invalid-unary-operand-type
bn_c = tf.layers.batch_normalization(c,
training=(not self.forward_only),
name='bn_cell', reuse=None)
if self._use_peepholes:
m = tf.sigmoid(o + w_o_diag * bn_c) * self._activation(bn_c)
m = tf.sigmoid(o) * self._activation(bn_c)
if self._num_proj is not None:
concat_w_proj = tf.nn.rnn_cell._get_concat_variable(
"W_P", [self._num_units, self._num_proj],
dtype, self._num_proj_shards)
m = tf.matmul(m, concat_w_proj)
new_state = (tf.nn.rnn_cell.LSTMStateTuple(c, m) if self._state_is_tuple
else tf.concat(1, [c, m]))
return m, new_state
I built a sequence to sequence model and run the extra updates during training as specified in other posts.
extra_update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
if extra_update_ops and not forward_only:
outputs, extra_updates =[output_feed, extra_update_ops], input_feed)
outputs =, input_feed)
The training loss looks very reasonable.
However, my test output is garbage. I wonder if anyone has had similar experience and knows how to resolve it.

What does opt.apply_gradients() do in TensorFlow?

The documentation is not quite clear about this. I suppose the gradients one can obtain by opt.compute_gradients(E, [v]) contain the ∂E/∂x = g(x) for each element x of the tensor that v stores. Does opt.apply_gradients(grads_and_vars) essentially execute x ← -η·g(x), where η is the learning rate? That would imply that if I want to add a positive additive change p to the variable, I would need to need to change g(x) ← g(x) - (1/η)p, e.g. like this:
opt = tf.train.GradientDescentOptimizer(learning_rate=l)
grads_and_vars = opt.compute_gradients(loss, var_list)
for l, gv in enumerate(grads_and_vars):
grads_and_vars[l] = (gv[0] - (1/l) * p, gv[1])
train_op = opt.apply_gradients(grads_and_vars)
Is there a better way to do this?
The update rule that the apply_gradients method actually applies depends on the specific optimizer. Take a look at the implementation of apply_gradients in the tf.train.Optimizer class here. It relies on the derived classes implementing the update rule in the methods _apply_dense and _apply_spares. The update rule you are referring to is implemented by the GradientDescentOptimizer.
Regarding your desired positive additive update: If what you are calling opt is an instantiation of GradientDescentOptimizer, then you could indeed achieve what you want to do by
grads_and_vars = opt.compute_gradients(E, [v])
eta = opt._learning_rate
my_grads_and_vars = [(g-(1/eta)*p, v) for g, v in grads_and_vars]
The more elegant way to do this is probably to write a new optimizer (inheriting from tf.train.Optimizer) that implements your desired update rule directly.
You can also use eager execution API.
import tensorflow as tf
tfe = tf.contrib.eager
optimizer = tf.train.GradientDescentOptimizer(learning_rate)
grad = tfe.implicit_gradients(loss)
optimizer.apply_gradients(grad(model_fn, val_list))
I will make an instance for it as follow:
import tensorflow as tf
tfe = tf.contrib.eager
W = tfe.Variable(np.random.randn())
b = tfe.Variable(np.random.randn())
def linear_regression(inputs):
return inputs * W + b;
def MSE(model_fn, inputs, labels):
return tf.reduce_sum(tf.pow(model_fn(inputs) - labels, 2)) / (2 * n_samples)
optimizer = tf.train.GradientDescentOptimizer(learning_rate = 0.001)
grad = tfe.implicit_gradients(MSE)
optimizer.apply_gradients(grad(linear_regression, train_X, train_Y)) # train_X and train_Y are your input data and label