Tensorflow, How can I compute backward pass for a given forward function - tensorflow

I want to construct an L2-norm layer in Caffe style (well, I actually want to use Tensorflow in a pycaffe layer, since using CUDA to write .cu files in Caffe is an onerous task.)
Forward pass:
- input(x): n-D array
- output(y): n-D array that has the same shape of input
- operation:
y = x / sqrt(sum(x^2,axis=(0,1))) # channel wise L2 normalization
class L2NormLayer:
def __init__(self):
self.eps = 1e-12
self.sess = tf.Session()
def forward(self, in_x):
self.x = tf.constant(in_x)
self.xp2 = tf.pow(self.x, 2)
self.sum_xp2 = tf.reduce_sum(self.xp2, axis=(0, 1))
self.sqrt_sum_xp2 = tf.sqrt(self.sum_xp2 + self.eps)
self.hat = tf.div(self.x, self.sqrt_sum_xp2)
return self.sess.run(self.hat)
def backward(self, dl):
# 'dl' is loss calculated at upper layer (chain rule)
# how do I calculate this gradient automatically using Tensorflow
# hand-craft backward version
loss = tf.constant(dl)
d_x1 = tf.div(loss, self.sqrt_sum_xp2)
d_sqrt_sum_xp2 = tf.div(-tf.reduce_sum(self.x * dl, axis=(0, 1)), (self.eps + tf.pow(self.sqrt_sum_xp2, 2)))
d_sum_xp2 = tf.div(d_sqrt_sum_xp2, (self.eps + 2 * tf.sqrt(self.sum_xp2)))
d_xp2 = tf.ones_like(self.xp2) * d_sum_xp2
d_x2 = 2 * self.x * d_xp2
d_x = d_x1 + d_x2
return self.sess.run(d_x)
As commented in the code, how can I calcualte the gradient of the forward pass function by using Tensorflow automatically?

I think your best strategy would be to use existing caffe layers to achieve your goal.
First, use "Reduction" layer to compute the sq. L2 norm of x:
layer {
name: "norm_x_sq"
type: "Reduction"
bottom: "x"
top: "norm_x_sq"
reduction_param { operation: SUMSQ axis: 1 }
}
Use "Power" layer to take the square root of the norm and compute its reciprocal:
layer {
name: "norm_x-1"
type: "Power"
bottom: "norm_x_sq"
top: "norm_x-1"
power_param { power: -0.5 }
}
Once you have the denominator, you need to "Tile" it back to the same shape as x:
layer {
name: "denom"
type: "Tile"
bottom: "norm_x-1"
top: "denom"
tile_param { axis:1 tiles: N } # here you'll have to manually put the target dimension N
}
Finally, use "Eltwise" layer to normalize x:
layer {
name: "x_norm"
type: "Eltwise"
bottom: "x"
bottom: "denom"
top: "x_norm"
eltwise_param { operation: PROD }
}
Some additional notes:
1. Dividing by the norm might be numerically unstable if the norm is very little. You might want to consider adding a tiny constant to "norm_x_sq" before taking the reciprocal of the square root. You can do that using existing layers as well.
2. This example showed how to normalize according to axis=1 dimension. Depending how your vectors are arranged in the blob, you might be able to use "Scale" layer for the division instead of tile+eltwise.
3. You might also find this thread useful.

Related

GradientTape for variable weighted sum of two Sequential models in TensorFlow

Suppose we want to minimize the following equation using gradient descent:
min f(alpha * v + (1-alpha)*w) with v and w the model weights and alpha the weight, between 0 and 1, for the sum resulting in the combined model v_bar or ū (here referred to as m).
alpha = tf.Variable(0.01, name='Alpha', constraint=lambda t: tf.clip_by_value(t, 0, 1))
w_weights = tff.learning.ModelWeights.from_model(w)
v_weights = tff.learning.ModelWeights.from_model(v)
m_weights = tff.learning.ModelWeights.from_model(m)
m_weights_trainable = tf.nest.map_structure(lambda v, w: alpha*v + (tf.constant(1.0) - alpha)*w, v_weights.trainable, w_weights.trainable)
tf.nest.map_structure(lambda v, t: v.assign(t), m_weights.trainable, m_weights_trainable)
In the paper of Adaptive Personalized Federated Learning, formula with update step for alpha suggests updating alpha based on the gradients of model m applied on a minibatch. I tried it with the watch or without, but it always leads to No gradients provided for any variable
with tf.GradientTape(watch_accessed_variables=False) as tape:
tape.watch([alpha])
outputs_m = m.forward_pass(batch)
grad = tape.gradient(outputs_m.loss, alpha)
optimizer.apply_gradients(zip([grad], [alpha]))
Some more information about the initialization of the models:
The m.forward_pass(batch) is the default implementation from tff.learning.Model (found here) by creating a model with tff.learning.from_keras_model and a tf.keras.Sequential model.
def model_fn():
keras_model = create_keras_model()
return tff.learning.from_keras_model(
keras_model,
input_spec = element_spec,
loss = tf.keras.losses.MeanSquaredError(),
metrics = [tf.keras.metrics.MeanSquaredError(),
tf.keras.metrics.MeanAbsoluteError()],
)
w = model_fn()
v = model_fn()
m = model_fn()
Some more experimenting as suggested below by Zachary Garrett:
It seems that whenever this weighted sum is calculated, and the new weights for the model are assigned, then it loses track of the previous trainable variables of both summed models. Again, it leads to the No gradients provided for any variable whenever optimizer.apply_gradients(zip([grad], [alpha])) is called. All gradients seem to be None.
with tf.GradientTape() as tape:
alpha = tf.Variable(0.01, name='Alpha', constraint=lambda t: tf.clip_by_value(t, 0, 1))
m_weights_t = tf.nest.map_structure(lambda w, v: tf.math.scalar_mul(alpha, v, name=None) + tf.math.scalar_mul(tf.constant(1.0) - alpha, w, name=None),
w.trainable,
v.trainable)
m_weights = tff.learning.ModelWeights.from_model(m)
tf.nest.map_structure(lambda v, t: v.assign(t), m_weights.trainable,
m_weights_trainable)
outputs_m = m.forward_pass(batch)
grad = tape.gradient(outputs_m.loss, alpha)
optimizer.apply_gradients(zip([grad], [alpha]))
Another edit:
I think I have a strategy to get it working, but it is bad practice as manually setting trainable_weights or _trainable_weights does not work. Any tips on improving this?
def do_weighted_combination():
def _mapper(target_layer, v_layer, w_layer):
target_layer.kernel = v_layer.kernel * alpha + w_layer.kernel * (1-alpha)
target_layer.bias = v_layer.bias * alpha + w_layer.bias * (1-alpha)
tf.nest.map_structure(_mapper, m.layers, v.layers, w.layers)
with tf.GradientTape(persistent=True) as tape:
do_weighted_combination()
predictions = m(x_data)
loss = m.compiled_loss(y_data, predictions)
g1 = tape.gradient(loss, v.trainable_weights) # Not None
g2 = tape.gradient(loss, alpha) # Not None
For TensorFlow auto-differentiation using tf.GradientTape, operations must occur within the tf.GradientTape Python context manager so that TensorFlow can "see" them.
Possibly what is happening here is that alpha is used outside/before the tape context, when setting the model variables. Then when m.forwad_pass is called TensorFlow doesn't see any access to alpha and thus can't compute a gradient for it (instead returning None).
Moving the
alpha*v + (tf.constant(1.0) - alpha)*w, v_weights.trainable, w_weights.trainable
logic inside the tf.GradientTape context manager (possibly inside m.forward_pass) may be a solution.

Implementing backpropagation gradient descent using scipy.optimize.minimize

I am trying to train an autoencoder NN (3 layers - 2 visible, 1 hidden) using numpy and scipy for the MNIST digits images dataset. The implementation is based on the notation given here Below is my code:
def autoencoder_cost_and_grad(theta, visible_size, hidden_size, lambda_, data):
"""
The input theta is a 1-dimensional array because scipy.optimize.minimize expects
the parameters being optimized to be a 1d array.
First convert theta from a 1d array to the (W1, W2, b1, b2)
matrix/vector format, so that this follows the notation convention of the
lecture notes and tutorial.
You must compute the:
cost : scalar representing the overall cost J(theta)
grad : array representing the corresponding gradient of each element of theta
"""
training_size = data.shape[1]
# unroll theta to get (W1,W2,b1,b2) #
W1 = theta[0:hidden_size*visible_size]
W1 = W1.reshape(hidden_size,visible_size)
W2 = theta[hidden_size*visible_size:2*hidden_size*visible_size]
W2 = W2.reshape(visible_size,hidden_size)
b1 = theta[2*hidden_size*visible_size:2*hidden_size*visible_size + hidden_size]
b2 = theta[2*hidden_size*visible_size + hidden_size: 2*hidden_size*visible_size + hidden_size + visible_size]
#feedforward pass
a_l1 = data
z_l2 = W1.dot(a_l1) + numpy.tile(b1,(training_size,1)).T
a_l2 = sigmoid(z_l2)
z_l3 = W2.dot(a_l2) + numpy.tile(b2,(training_size,1)).T
a_l3 = sigmoid(z_l3)
#backprop
delta_l3 = numpy.multiply(-(data-a_l3),numpy.multiply(a_l3,1-a_l3))
delta_l2 = numpy.multiply(W2.T.dot(delta_l3),
numpy.multiply(a_l2, 1 - a_l2))
b2_derivative = numpy.sum(delta_l3,axis=1)/training_size
b1_derivative = numpy.sum(delta_l2,axis=1)/training_size
W2_derivative = numpy.dot(delta_l3,a_l2.T)/training_size + lambda_*W2
#print(W2_derivative.shape)
W1_derivative = numpy.dot(delta_l2,a_l1.T)/training_size + lambda_*W1
W1_derivative = W1_derivative.reshape(hidden_size*visible_size)
W2_derivative = W2_derivative.reshape(visible_size*hidden_size)
b1_derivative = b1_derivative.reshape(hidden_size)
b2_derivative = b2_derivative.reshape(visible_size)
grad = numpy.concatenate((W1_derivative,W2_derivative,b1_derivative,b2_derivative))
cost = 0.5*numpy.sum((data-a_l3)**2)/training_size + 0.5*lambda_*(numpy.sum(W1**2) + numpy.sum(W2**2))
return cost,grad
I have also implemented a function to estimate the numerical gradient and verify the correctness of my implementation (below).
def compute_gradient_numerical_estimate(J, theta, epsilon=0.0001):
"""
:param J: a loss (cost) function that computes the real-valued loss given parameters and data
:param theta: array of parameters
:param epsilon: amount to vary each parameter in order to estimate
the gradient by numerical difference
:return: array of numerical gradient estimate
"""
gradient = numpy.zeros(theta.shape)
eps_vector = numpy.zeros(theta.shape)
for i in range(0,theta.size):
eps_vector[i] = epsilon
cost1,grad1 = J(theta+eps_vector)
cost2,grad2 = J(theta-eps_vector)
gradient[i] = (cost1 - cost2)/(2*epsilon)
eps_vector[i] = 0
return gradient
The norm of the difference between the numerical estimate and the one computed by the function is around 6.87165125021e-09 which seems to be acceptable. My main problem seems to be to get the gradient descent algorithm "L-BGFGS-B" working using the scipy.optimize.minimize function as below:
# theta is the 1-D array of(W1,W2,b1,b2)
J = lambda x: utils.autoencoder_cost_and_grad(theta, visible_size, hidden_size, lambda_, patches_train)
options_ = {'maxiter': 4000, 'disp': False}
result = scipy.optimize.minimize(J, theta, method='L-BFGS-B', jac=True, options=options_)
I get the below output from this:
scipy.optimize.minimize() details:
fun: 90.802022224079778
hess_inv: <16474x16474 LbfgsInvHessProduct with dtype=float64>
jac: array([ -6.83667742e-06, -2.74886002e-06, -3.23531941e-06, ...,
1.22425735e-01, 1.23425062e-01, 1.28091250e-01])
message: b'ABNORMAL_TERMINATION_IN_LNSRCH'
nfev: 21
nit: 0
status: 2
success: False
x: array([-0.06836677, -0.0274886 , -0.03235319, ..., 0. ,
0. , 0. ])
Now, this post seems to indicate that the error could mean that the gradient function implementation could be wrong? But my numerical gradient estimate seems to confirm that my implementation is correct. I have tried varying the initial weights by using a uniform distribution as specified here but the problem still persists. Is there anything wrong with my backprop implementation?
Turns out the issue was a syntax error (very silly) with this line:
J = lambda x: utils.autoencoder_cost_and_grad(theta, visible_size, hidden_size, lambda_, patches_train)
I don't even have the lambda parameter x in the function declaration. So the theta array wasn't even being passed whenever J was being invoked.
This fixed it:
J = lambda x: utils.autoencoder_cost_and_grad(x, visible_size, hidden_size, lambda_, patches_train)

Tensorflow symmetric matrix

I want to create a symmetric matrix of n*n and train this matrix in TensorFlow. Effectively I should only train (n+1)*n/2 parameters. How should I do this?
I saw some previous threads which suggest do the following:
X = tf.Variable(tf.random_uniform([d,d], minval=-.1, maxval=.1, dtype=tf.float64))
X_symm = 0.5 * (X + tf.transpose(X))
However, this means I have to train n*n variables, not n*(n+1)/2 variables.
Even there is no function to achieve this, a patch of self-written code would help!
Thanks!
You can use tf.matrix_band_part(input, 0, -1) to create an upper triangular matrix from a square one, so this code would allow you to train on n(n+1)/2 variables although it has you create n*n:
X = tf.Variable(tf.random_uniform([d,d], minval=-.1, maxval=.1, dtype=tf.float64))
X_upper = tf.matrix_band_part(X, 0, -1)
X_symm = 0.5 * (X_upper + tf.transpose(X_upper))
Referring to answer of gdelab: in Tensorflow 2.x, you have to use following code.
X_upper = tf.linalg.band_part(X, 0, -1)
gdelab's answer is correct and will work, since a neural network can adjust the 0.5 factor by itself. I aimed for a solution, where the neural network actually only has (n+1)*n/2 output neurons. The following function transforms these into a symmetric matrix:
def create_symmetric_matrix(x,n):
x_rev = tf.reverse(x[:, n:], [1])
xc = tf.concat([x, x_rev], axis=1)
x_res = tf.reshape(xc, [-1, n, n])
x_upper_triangular = tf.linalg.band_part(x_res, 0, -1)
x_lower_triangular = tf.linalg.set_diag( tf.transpose(x_upper_triangular, perm=[0, 2, 1]), tf.zeros([tf.shape(x)[0], n], dtype=tf.float32))
return x_upper_triangular + x_lower_triangular
with x as a vector of rank [batch,n*(n+1)/2] and n as the rank of the output matrix.
The code is inspired by tfp.math.fill_triangular.

Using numpy roll in Keras

I'm trying to make a custom regularizer in Keras and I need to be able to roll the coefficient array.
I know this may be impossible however any mechanism that can replicate this roll function would be extremely appreciated.
```
def __call__(self, x):
regularization = 0.
# Add components if they are given
if self.l1:
# \lambda ||x||
regularization += self.l1 * K.sum(K.abs(x))
if self.fuse:
# \lambda \sum{ |x - x_+1| }
regularization += self.fuse * K.sum(K.abs(x - np.roll(x, 1)))
if self.abs_fuse:
# \lambda \sum{ ||x| - |x_+1|| }
regularization += self.abs_fuse * K.sum(K.abs(K.abs(x) - K.abs(np.roll(x, 1))))
```
Given that x is of shape (m, 1), a possible solution is to use tile:
def roll_reg(x):
length = K.int_shape(x)[0]
x_tile = K.tile(x, [2, 1])
x_roll = x_tile[length - 1:-1]
return K.sum(K.abs(x - x_roll))
It will result in some extra memory usage, but if x is a 1-dim vector, I guess the overhead won't be too bad.

How does Tensorflow Batch Normalization work?

I'm using tensorflow batch normalization in my deep neural network successfully. I'm doing it the following way:
if apply_bn:
with tf.variable_scope('bn'):
beta = tf.Variable(tf.constant(0.0, shape=[out_size]), name='beta', trainable=True)
gamma = tf.Variable(tf.constant(1.0, shape=[out_size]), name='gamma', trainable=True)
batch_mean, batch_var = tf.nn.moments(z, [0], name='moments')
ema = tf.train.ExponentialMovingAverage(decay=0.5)
def mean_var_with_update():
ema_apply_op = ema.apply([batch_mean, batch_var])
with tf.control_dependencies([ema_apply_op]):
return tf.identity(batch_mean), tf.identity(batch_var)
mean, var = tf.cond(self.phase_train,
mean_var_with_update,
lambda: (ema.average(batch_mean), ema.average(batch_var)))
self.z_prebn.append(z)
z = tf.nn.batch_normalization(z, mean, var, beta, gamma, 1e-3)
self.z.append(z)
self.bn.append((mean, var, beta, gamma))
And it works fine both for training and testing phases.
However I encounter problems when I try to use the computed neural network parameters in my another project, where I need to compute all the matrix multiplications and stuff by myself. The problem is that I can't reproduce the behavior of the tf.nn.batch_normalization function:
feed_dict = {
self.tf_x: np.array([range(self.x_cnt)]) / 100,
self.keep_prob: 1,
self.phase_train: False
}
for i in range(len(self.z)):
# print 0 layer's 1 value of arrays
print(self.sess.run([
self.z_prebn[i][0][1], # before bn
self.bn[i][0][1], # mean
self.bn[i][1][1], # var
self.bn[i][2][1], # offset
self.bn[i][3][1], # scale
self.z[i][0][1], # after bn
], feed_dict=feed_dict))
# prints
# [-0.077417567, -0.089603029, 0.000436493, -0.016652612, 1.0055743, 0.30664611]
According to the formula on the page https://www.tensorflow.org/versions/r1.2/api_docs/python/tf/nn/batch_normalization:
bn = scale * (x - mean) / (sqrt(var) + 1e-3) + offset
But as we can see,
1.0055743 * (-0.077417567 - -0.089603029)/(0.000436493^0.5 + 1e-3) + -0.016652612
= 0.543057
Which differs from the value 0.30664611, computed by Tensorflow itself.
So what am I doing wrong here and why I can't just calculate batch normalized value myself?
Thanks in advance!
The formula used is slightly different from:
bn = scale * (x - mean) / (sqrt(var) + 1e-3) + offset
It should be:
bn = scale * (x - mean) / (sqrt(var + 1e-3)) + offset
The variance_epsilon variable is supposed to scale with the variance, not with sigma, which is the square-root of variance.
After the correction, the formula yields the correct value:
1.0055743 * (-0.077417567 - -0.089603029)/((0.000436493 + 1e-3)**0.5) + -0.016652612
# 0.30664642276945747