Checking the gradient when doing gradient descent - optimization

I'm trying to implement a feed-forward backpropagating autoencoder (training with gradient descent) and wanted to verify that I'm calculating the gradient correctly. This tutorial suggests calculating the derivative of each parameter one at a time: grad_i(theta) = (J(theta_i+epsilon) - J(theta_i-epsilon)) / (2*epsilon). I've written a sample piece of code in Matlab to do just this, but without much luck -- the differences between the gradient calculated from the derivative and the gradient numerically found tend to be largish (>> 4 significant figures).
If anyone can offer any suggestions, I would greatly appreciate the help (either in my calculation of the gradient or how I perform the check). Because I've simplified the code greatly to make it more readable, I haven't included a biases, and am no longer tying the weight matrices.
First, I initialize the variables:
numHidden = 200;
numVisible = 784;
low = -4*sqrt(6./(numHidden + numVisible));
high = 4*sqrt(6./(numHidden + numVisible));
encoder = low + (high-low)*rand(numVisible, numHidden);
decoder = low + (high-low)*rand(numHidden, numVisible);
Next, given some input image x, do feed-forward propagation:
a = sigmoid(x*encoder);
z = sigmoid(a*decoder); % (reconstruction of x)
The loss function I'm using is the standard Σ(0.5*(z - x)^2)):
% first calculate the error by finding the derivative of sum(0.5*(z-x).^2),
% which is (f(h)-x)*f'(h), where z = f(h), h = a*decoder, and
% f = sigmoid(x). However, since the derivative of the sigmoid is
% sigmoid*(1 - sigmoid), we get:
error_0 = (z - x).*z.*(1-z);
% The gradient \Delta w_{ji} = error_j*a_i
gDecoder = error_0'*a;
% not important, but included for completeness
% do back-propagation one layer down
error_1 = (error_0*encoder).*a.*(1-a);
gEncoder = error_1'*x;
And finally, check that the gradient is correct (in this case, just do it for the decoder):
epsilon = 10e-5;
check = gDecoder(:); % the values we obtained above
for i = 1:size(decoder(:), 1)
% calculate J+
theta = decoder(:); % unroll
theta(i) = theta(i) + epsilon;
decoderp = reshape(theta, size(decoder)); % re-roll
a = sigmoid(x*encoder);
z = sigmoid(a*decoderp);
Jp = sum(0.5*(z - x).^2);
% calculate J-
theta = decoder(:);
theta(i) = theta(i) - epsilon;
decoderp = reshape(theta, size(decoder));
a = sigmoid(x*encoder);
z = sigmoid(a*decoderp);
Jm = sum(0.5*(z - x).^2);
grad_i = (Jp - Jm) / (2*epsilon);
diff = abs(grad_i - check(i));
fprintf('%d: %f <=> %f: %f\n', i, grad_i, check(i), diff);
end
Running this on the MNIST dataset (for the first entry) gives results such as:
2: 0.093885 <=> 0.028398: 0.065487
3: 0.066285 <=> 0.031096: 0.035189
5: 0.053074 <=> 0.019839: 0.033235
6: 0.108249 <=> 0.042407: 0.065843
7: 0.091576 <=> 0.009014: 0.082562

Do not sigmoid on both a and z. Just use it on z.
a = x*encoder;
z = sigmoid(a*decoderp);

Related

Using Nesterov's accelerated gradient on the Ackley function

I am working on a project where I want to use Nesterov's accelerated gradient method on the Ackley function below
to go from the initial point of (25, 20) to within the distance of the global minimizer and minimum of (2e-4, 5e-4).
def nag(func, x, lr, num_iters, jac, tol, callback, gamma=0.9, *args, **kwargs):
vals = [func(x)]
opt_res = OptimizeResult()
update = np.zeros(x.size)
for i in range(1, num_iters+1):
grad = jac(x - gamma * update)
prev_x = x
update = gamma * update + lr(i) * grad
x = x - update
vals.append(func(x))
callback(x)
if np.linalg.norm(x-prev_x) <= tol:
break
opt_res.x = x
opt_res.nit = i
return opt_res, np.array(vals, dtype=object)
Using gamma = 0.2, num_iters = 5000, and a customized learning rate function
def lr_(t):
if t < 5:
return 1e-4
elif t < 10:
return 1e-2
else:
return 0.1
I was able to get around (0.0008 and 0.0022), but couldn't get closer to the desired global minimizer and minimum as specified despite playing around with different values for a long time. Does anyone know what I could try so that I can get closer to the desired result?
Or are there other optimization methods that would work better than NAG? I heard Adam's or Adagrad should work, but haven't had much success with them.

LSTM from scratch in tensorflow 2

I'm trying to make LSTM in tensorflow 2.1 from scratch, without using the one already supplied with keras (tf.keras.layers.LSTM), just to learn and code something. To do so, I've defined a class "Model" that when called (like with model(input)) it computes the matrix multiplications of the LSTM. I'm pasting here part of my code, the other parts are on github (link)
class Model(object):
[...]
def __call__(self, inputs):
assert inputs.shape == (vocab_size, T_steps)
outputs = []
for time_step in range(T_steps):
x = inputs[:,time_step]
x = tf.expand_dims(x,axis=1)
z = tf.concat([self.h_prev,x],axis=0)
f = tf.matmul(self.W_f, z) + self.b_f
f = tf.sigmoid(f)
i = tf.matmul(self.W_i, z) + self.b_i
i = tf.sigmoid(i)
o = tf.matmul(self.W_o, z) + self.b_o
o = tf.sigmoid(o)
C_bar = tf.matmul(self.W_C, z) + self.b_C
C_bar = tf.tanh(C_bar)
C = (f * self.C_prev) + (i * C_bar)
h = o * tf.tanh(C)
v = tf.matmul(self.W_v, h) + self.b_v
v = tf.sigmoid(v)
y = tf.math.softmax(v, axis=0)
self.h_prev = h
self.C_prev = C
outputs.append(y)
outputs = tf.squeeze(tf.stack(outputs,axis=1))
return outputs
But this neural netoworks has three problems:
1) it is way slow during training. In comparison a model that uses tf.keras.layers.LSTM() is trained more than 10 times faster. Why is this? Maybe because I didn't use a minibatch training, but a stochastic one?
2) the NN seems to not learn anything at all. After just some (very few!) training examples, the loss seems to settle down and it won't decrease anymore, but rather it oscillates around the reached value. After training, I tested the NN making it generate some text, but it just outputs non-sense gibberish. Why isn't learning anything?
3) the loss function outputs very high values. I've coded a categorical cross-entropy loss function but, with 100 characters long sequence, the value of the function is over 370 per training example. Shouldn't it be way lower than this?
I've wrote the loss function like this:
def compute_loss(predictions, desired_outputs):
l = 0
for i in range(T_steps):
l -= tf.math.log(predictions[desired_outputs[i], i])
return l
I know they're open questions, but unfortunately I can't make it works. So any answer, even a short answer that help me to make myself solve the problem, is fine :)

Implementing backpropagation gradient descent using scipy.optimize.minimize

I am trying to train an autoencoder NN (3 layers - 2 visible, 1 hidden) using numpy and scipy for the MNIST digits images dataset. The implementation is based on the notation given here Below is my code:
def autoencoder_cost_and_grad(theta, visible_size, hidden_size, lambda_, data):
"""
The input theta is a 1-dimensional array because scipy.optimize.minimize expects
the parameters being optimized to be a 1d array.
First convert theta from a 1d array to the (W1, W2, b1, b2)
matrix/vector format, so that this follows the notation convention of the
lecture notes and tutorial.
You must compute the:
cost : scalar representing the overall cost J(theta)
grad : array representing the corresponding gradient of each element of theta
"""
training_size = data.shape[1]
# unroll theta to get (W1,W2,b1,b2) #
W1 = theta[0:hidden_size*visible_size]
W1 = W1.reshape(hidden_size,visible_size)
W2 = theta[hidden_size*visible_size:2*hidden_size*visible_size]
W2 = W2.reshape(visible_size,hidden_size)
b1 = theta[2*hidden_size*visible_size:2*hidden_size*visible_size + hidden_size]
b2 = theta[2*hidden_size*visible_size + hidden_size: 2*hidden_size*visible_size + hidden_size + visible_size]
#feedforward pass
a_l1 = data
z_l2 = W1.dot(a_l1) + numpy.tile(b1,(training_size,1)).T
a_l2 = sigmoid(z_l2)
z_l3 = W2.dot(a_l2) + numpy.tile(b2,(training_size,1)).T
a_l3 = sigmoid(z_l3)
#backprop
delta_l3 = numpy.multiply(-(data-a_l3),numpy.multiply(a_l3,1-a_l3))
delta_l2 = numpy.multiply(W2.T.dot(delta_l3),
numpy.multiply(a_l2, 1 - a_l2))
b2_derivative = numpy.sum(delta_l3,axis=1)/training_size
b1_derivative = numpy.sum(delta_l2,axis=1)/training_size
W2_derivative = numpy.dot(delta_l3,a_l2.T)/training_size + lambda_*W2
#print(W2_derivative.shape)
W1_derivative = numpy.dot(delta_l2,a_l1.T)/training_size + lambda_*W1
W1_derivative = W1_derivative.reshape(hidden_size*visible_size)
W2_derivative = W2_derivative.reshape(visible_size*hidden_size)
b1_derivative = b1_derivative.reshape(hidden_size)
b2_derivative = b2_derivative.reshape(visible_size)
grad = numpy.concatenate((W1_derivative,W2_derivative,b1_derivative,b2_derivative))
cost = 0.5*numpy.sum((data-a_l3)**2)/training_size + 0.5*lambda_*(numpy.sum(W1**2) + numpy.sum(W2**2))
return cost,grad
I have also implemented a function to estimate the numerical gradient and verify the correctness of my implementation (below).
def compute_gradient_numerical_estimate(J, theta, epsilon=0.0001):
"""
:param J: a loss (cost) function that computes the real-valued loss given parameters and data
:param theta: array of parameters
:param epsilon: amount to vary each parameter in order to estimate
the gradient by numerical difference
:return: array of numerical gradient estimate
"""
gradient = numpy.zeros(theta.shape)
eps_vector = numpy.zeros(theta.shape)
for i in range(0,theta.size):
eps_vector[i] = epsilon
cost1,grad1 = J(theta+eps_vector)
cost2,grad2 = J(theta-eps_vector)
gradient[i] = (cost1 - cost2)/(2*epsilon)
eps_vector[i] = 0
return gradient
The norm of the difference between the numerical estimate and the one computed by the function is around 6.87165125021e-09 which seems to be acceptable. My main problem seems to be to get the gradient descent algorithm "L-BGFGS-B" working using the scipy.optimize.minimize function as below:
# theta is the 1-D array of(W1,W2,b1,b2)
J = lambda x: utils.autoencoder_cost_and_grad(theta, visible_size, hidden_size, lambda_, patches_train)
options_ = {'maxiter': 4000, 'disp': False}
result = scipy.optimize.minimize(J, theta, method='L-BFGS-B', jac=True, options=options_)
I get the below output from this:
scipy.optimize.minimize() details:
fun: 90.802022224079778
hess_inv: <16474x16474 LbfgsInvHessProduct with dtype=float64>
jac: array([ -6.83667742e-06, -2.74886002e-06, -3.23531941e-06, ...,
1.22425735e-01, 1.23425062e-01, 1.28091250e-01])
message: b'ABNORMAL_TERMINATION_IN_LNSRCH'
nfev: 21
nit: 0
status: 2
success: False
x: array([-0.06836677, -0.0274886 , -0.03235319, ..., 0. ,
0. , 0. ])
Now, this post seems to indicate that the error could mean that the gradient function implementation could be wrong? But my numerical gradient estimate seems to confirm that my implementation is correct. I have tried varying the initial weights by using a uniform distribution as specified here but the problem still persists. Is there anything wrong with my backprop implementation?
Turns out the issue was a syntax error (very silly) with this line:
J = lambda x: utils.autoencoder_cost_and_grad(theta, visible_size, hidden_size, lambda_, patches_train)
I don't even have the lambda parameter x in the function declaration. So the theta array wasn't even being passed whenever J was being invoked.
This fixed it:
J = lambda x: utils.autoencoder_cost_and_grad(x, visible_size, hidden_size, lambda_, patches_train)

TensorFlow loss function zeroes out after first epoch

I am trying to implement a discriminative loss function for instance segmentation of images based on this paper: https://arxiv.org/pdf/1708.02551.pdf (This link is just for the readers' reference; I don't expect anyone to read it to help me out!)
My problem: Once I move from a simple loss function to a more complicated one (like you see in the attached code snippet), the loss function zeroes out after the first epoch. I checked the weights, and almost all of them seem to hover closely around -300. They are not exactly identical, but very close to each other (differing only in the decimal places).
Relevant code that implements the discriminative loss function:
def regDLF(y_true, y_pred):
global alpha
global beta
global gamma
global delta_v
global delta_d
global image_height
global image_width
global nDim
y_true = tf.reshape(y_true, [image_height*image_width])
X = tf.reshape(y_pred, [image_height*image_width, nDim])
uniqueLabels, uniqueInd = tf.unique(y_true)
numUnique = tf.size(uniqueLabels)
Sigma = tf.unsorted_segment_sum(X, uniqueInd, numUnique)
ones_Sigma = tf.ones((tf.shape(X)[0], 1))
ones_Sigma = tf.unsorted_segment_sum(ones_Sigma,uniqueInd, numUnique)
mu = tf.divide(Sigma, ones_Sigma)
Lreg = tf.reduce_mean(tf.norm(mu, axis = 1))
T = tf.norm(tf.subtract(tf.gather(mu, uniqueInd), X), axis = 1)
T = tf.divide(T, Lreg)
T = tf.subtract(T, delta_v)
T = tf.clip_by_value(T, 0, T)
T = tf.square(T)
ones_Sigma = tf.ones_like(uniqueInd, dtype = tf.float32)
ones_Sigma = tf.unsorted_segment_sum(ones_Sigma,uniqueInd, numUnique)
clusterSigma = tf.unsorted_segment_sum(T, uniqueInd, numUnique)
clusterSigma = tf.divide(clusterSigma, ones_Sigma)
Lvar = tf.reduce_mean(clusterSigma, axis = 0)
mu_interleaved_rep = tf.tile(mu, [numUnique, 1])
mu_band_rep = tf.tile(mu, [1, numUnique])
mu_band_rep = tf.reshape(mu_band_rep, (numUnique*numUnique, nDim))
mu_diff = tf.subtract(mu_band_rep, mu_interleaved_rep)
mu_diff = tf.norm(mu_diff, axis = 1)
mu_diff = tf.divide(mu_diff, Lreg)
mu_diff = tf.subtract(2*delta_d, mu_diff)
mu_diff = tf.clip_by_value(mu_diff, 0, mu_diff)
mu_diff = tf.square(mu_diff)
numUniqueF = tf.cast(numUnique, tf.float32)
Ldist = tf.reduce_mean(mu_diff)
L = alpha * Lvar + beta * Ldist + gamma * Lreg
return L
Question: I know it's hard to understand what the code does without reading the paper, but I have a couple questions:
Is there something glaringly wrong with the loss function defined
above?
Anyone has a general idea as to why the loss function could zero out after the first epoch?
Thank you very much for your time and help!
I think your problem suffers from tf.norm which is not safe (leads to zeros somewhere in the vector and hence nan in its gradients).
It would be better to replace tf.norm by this custom function:
def tf_norm(inputs, axis=1, epsilon=1e-7, name='safe_norm'):
squared_norm = tf.reduce_sum(tf.square(inputs), axis=axis, keep_dims=True)
safe_norm = tf.sqrt(squared_norm+epsilon)
return tf.identity(safe_norm, name=name)
In your Ldist calculation you use tf.tile and tf.reshape to find the distance between different cluster means in the following manner (suppose we have three clusters):
mu_1 - mu_1
mu_2 - mu_1
mu_3 - mu_1
mu_1 - mu_2
mu_2 - mu_2
mu_3 - mu_2
mu_1 - mu_3
mu_2 - mu_3
mu_3 - mu_3
The problem is that your distance vector contains zero vectors and you perform a norm operation afterwards. tf.norm gets numerical unstable since it performs a division over the length of the vector. The result is that the gradient either gets zero or inf. See this github issue.
The solution would be to remove those zero vectors in a fashion like this Stackoverflow question.

Can I implement a gradient descent for arbitrary convex loss function?

I have a loss function I would like to try and minimize:
def lossfunction(X,b,lambs):
B = b.reshape(X.shape)
penalty = np.linalg.norm(B, axis = 1)**(0.5)
return np.linalg.norm(np.dot(X,B)-X) + lambs*penalty.sum()
Gradient descent, or similar methods, might be useful. I can't calculate the gradient of this function analytically, so I am wondering how I can numerically calculate the gradient for this loss function in order to implement a descent method.
Numpy has a gradient function, but it requires me to pass a scalar field at pre determined points.
You could try scipy.optimize.minimize
For your case a sample call would be:
import scipy.optimize.minimize
scipy.optimize.minimize(lossfunction, args=(b, lambs), method='Nelder-mead')
You could estimate the derivative numerically by a central difference:
def derivative(fun, X, b, lambs, h):
return (fun(X + 0.5*h,b,lambs) - fun(X - 0.5*h,b,lambs))/h
And use it like this:
# assign values to X, b, lambs
# set the value of h
h = 0.001
print derivative(lossfunction, X, b, lambs, h)
The code above is valid for dimX = 1, some modifications are needed to account for multidimensional vector X:
def gradient(fun, X, b, lambs, h):
res = []
for i in range (0,len(X)):
t1 = list(X)
t1[i] = t1[i] + 0.5*h
t2 = list(X)
t2[i] = t2[i] - 0.5*h
res = res + [(fun(t1,b,lambs) - fun(t2,b,lambs))/h]
return res
Forgive the naivity of the code, I barely know how to write some python :-)