Can I implement a gradient descent for arbitrary convex loss function? - numpy

I have a loss function I would like to try and minimize:
def lossfunction(X,b,lambs):
B = b.reshape(X.shape)
penalty = np.linalg.norm(B, axis = 1)**(0.5)
return np.linalg.norm(np.dot(X,B)-X) + lambs*penalty.sum()
Gradient descent, or similar methods, might be useful. I can't calculate the gradient of this function analytically, so I am wondering how I can numerically calculate the gradient for this loss function in order to implement a descent method.
Numpy has a gradient function, but it requires me to pass a scalar field at pre determined points.

You could try scipy.optimize.minimize
For your case a sample call would be:
import scipy.optimize.minimize
scipy.optimize.minimize(lossfunction, args=(b, lambs), method='Nelder-mead')

You could estimate the derivative numerically by a central difference:
def derivative(fun, X, b, lambs, h):
return (fun(X + 0.5*h,b,lambs) - fun(X - 0.5*h,b,lambs))/h
And use it like this:
# assign values to X, b, lambs
# set the value of h
h = 0.001
print derivative(lossfunction, X, b, lambs, h)
The code above is valid for dimX = 1, some modifications are needed to account for multidimensional vector X:
def gradient(fun, X, b, lambs, h):
res = []
for i in range (0,len(X)):
t1 = list(X)
t1[i] = t1[i] + 0.5*h
t2 = list(X)
t2[i] = t2[i] - 0.5*h
res = res + [(fun(t1,b,lambs) - fun(t2,b,lambs))/h]
return res
Forgive the naivity of the code, I barely know how to write some python :-)

Related

Using Nesterov's accelerated gradient on the Ackley function

I am working on a project where I want to use Nesterov's accelerated gradient method on the Ackley function below
to go from the initial point of (25, 20) to within the distance of the global minimizer and minimum of (2e-4, 5e-4).
def nag(func, x, lr, num_iters, jac, tol, callback, gamma=0.9, *args, **kwargs):
vals = [func(x)]
opt_res = OptimizeResult()
update = np.zeros(x.size)
for i in range(1, num_iters+1):
grad = jac(x - gamma * update)
prev_x = x
update = gamma * update + lr(i) * grad
x = x - update
vals.append(func(x))
callback(x)
if np.linalg.norm(x-prev_x) <= tol:
break
opt_res.x = x
opt_res.nit = i
return opt_res, np.array(vals, dtype=object)
Using gamma = 0.2, num_iters = 5000, and a customized learning rate function
def lr_(t):
if t < 5:
return 1e-4
elif t < 10:
return 1e-2
else:
return 0.1
I was able to get around (0.0008 and 0.0022), but couldn't get closer to the desired global minimizer and minimum as specified despite playing around with different values for a long time. Does anyone know what I could try so that I can get closer to the desired result?
Or are there other optimization methods that would work better than NAG? I heard Adam's or Adagrad should work, but haven't had much success with them.

Why are gradients disconnected

Consider the following code
#tf.function
def get_derivatives(function_to_diff,X):
f = function_to_diff(X)
## Derivatives
W = X[:,0]
Z = X[:,1]
V = X[:,2]
df_dW = tf.gradients(f, X[:,0])
return df_dW
I wanted get_derivatives to return the partial derivative of function_to_diff with respect to the first element of X.
However, when I run
def test_function(X):
return tf.pow(X[:,0],2) * X[:,1] * X[:,2]
get_derivatives(test_function,X)
I get None.
If I use unconnected_gradients='zero' for tf.graidents, I'd get zeros. In other words, the gradients are disconnected.
Questions
Why are the gradients disconnected?
How can I get the derivative with respect to the first element of X, i.e. how can I restore the connection? I know that if I wrote
def test_function(x,y,z)
return tf.pow(x,2) * y * z
#tf.function
def get_derivatives(function_to_diff,x,y,z):
f = function_to_diff(x,y,z)
df_dW = tf.gradients(f, x)
return df_dW
This could fix the problem. What if my function can only take in one argument, i.e. what if my function looks like test_function(X)? For example, test_function could be a trained neural network that takes in only one argument.

LSTM from scratch in tensorflow 2

I'm trying to make LSTM in tensorflow 2.1 from scratch, without using the one already supplied with keras (tf.keras.layers.LSTM), just to learn and code something. To do so, I've defined a class "Model" that when called (like with model(input)) it computes the matrix multiplications of the LSTM. I'm pasting here part of my code, the other parts are on github (link)
class Model(object):
[...]
def __call__(self, inputs):
assert inputs.shape == (vocab_size, T_steps)
outputs = []
for time_step in range(T_steps):
x = inputs[:,time_step]
x = tf.expand_dims(x,axis=1)
z = tf.concat([self.h_prev,x],axis=0)
f = tf.matmul(self.W_f, z) + self.b_f
f = tf.sigmoid(f)
i = tf.matmul(self.W_i, z) + self.b_i
i = tf.sigmoid(i)
o = tf.matmul(self.W_o, z) + self.b_o
o = tf.sigmoid(o)
C_bar = tf.matmul(self.W_C, z) + self.b_C
C_bar = tf.tanh(C_bar)
C = (f * self.C_prev) + (i * C_bar)
h = o * tf.tanh(C)
v = tf.matmul(self.W_v, h) + self.b_v
v = tf.sigmoid(v)
y = tf.math.softmax(v, axis=0)
self.h_prev = h
self.C_prev = C
outputs.append(y)
outputs = tf.squeeze(tf.stack(outputs,axis=1))
return outputs
But this neural netoworks has three problems:
1) it is way slow during training. In comparison a model that uses tf.keras.layers.LSTM() is trained more than 10 times faster. Why is this? Maybe because I didn't use a minibatch training, but a stochastic one?
2) the NN seems to not learn anything at all. After just some (very few!) training examples, the loss seems to settle down and it won't decrease anymore, but rather it oscillates around the reached value. After training, I tested the NN making it generate some text, but it just outputs non-sense gibberish. Why isn't learning anything?
3) the loss function outputs very high values. I've coded a categorical cross-entropy loss function but, with 100 characters long sequence, the value of the function is over 370 per training example. Shouldn't it be way lower than this?
I've wrote the loss function like this:
def compute_loss(predictions, desired_outputs):
l = 0
for i in range(T_steps):
l -= tf.math.log(predictions[desired_outputs[i], i])
return l
I know they're open questions, but unfortunately I can't make it works. So any answer, even a short answer that help me to make myself solve the problem, is fine :)

Evaluating the squared term of a gaussian kernel for having a covariance matrix for multi-dimensional inputs [duplicate]

I have the following code. It is taking forever in Python. There must be a way to translate this calculation into a broadcast...
def euclidean_square(a,b):
squares = np.zeros((a.shape[0],b.shape[0]))
for i in range(squares.shape[0]):
for j in range(squares.shape[1]):
diff = a[i,:] - b[j,:]
sqr = diff**2.0
squares[i,j] = np.sum(sqr)
return squares
You can use np.einsum after calculating the differences in a broadcasted way, like so -
ab = a[:,None,:] - b
out = np.einsum('ijk,ijk->ij',ab,ab)
Or use scipy's cdist with its optional metric argument set as 'sqeuclidean' to give us the squared euclidean distances as needed for our problem, like so -
from scipy.spatial.distance import cdist
out = cdist(a,b,'sqeuclidean')
I collected the different methods proposed here, and in two other questions, and measured the speed of the different methods:
import numpy as np
import scipy.spatial
import sklearn.metrics
def dist_direct(x, y):
d = np.expand_dims(x, -2) - y
return np.sum(np.square(d), axis=-1)
def dist_einsum(x, y):
d = np.expand_dims(x, -2) - y
return np.einsum('ijk,ijk->ij', d, d)
def dist_scipy(x, y):
return scipy.spatial.distance.cdist(x, y, "sqeuclidean")
def dist_sklearn(x, y):
return sklearn.metrics.pairwise.pairwise_distances(x, y, "sqeuclidean")
def dist_layers(x, y):
res = np.zeros((x.shape[0], y.shape[0]))
for i in range(x.shape[1]):
res += np.subtract.outer(x[:, i], y[:, i])**2
return res
# inspired by the excellent https://github.com/droyed/eucl_dist
def dist_ext1(x, y):
nx, p = x.shape
x_ext = np.empty((nx, 3*p))
x_ext[:, :p] = 1
x_ext[:, p:2*p] = x
x_ext[:, 2*p:] = np.square(x)
ny = y.shape[0]
y_ext = np.empty((3*p, ny))
y_ext[:p] = np.square(y).T
y_ext[p:2*p] = -2*y.T
y_ext[2*p:] = 1
return x_ext.dot(y_ext)
# https://stackoverflow.com/a/47877630/648741
def dist_ext2(x, y):
return np.einsum('ij,ij->i', x, x)[:,None] + np.einsum('ij,ij->i', y, y) - 2 * x.dot(y.T)
I use timeit to compare the speed of the different methods. For the comparison, I use vectors of length 10, with 100 vectors in the first group, and 1000 vectors in the second group.
import timeit
p = 10
x = np.random.standard_normal((100, p))
y = np.random.standard_normal((1000, p))
for method in dir():
if not method.startswith("dist_"):
continue
t = timeit.timeit(f"{method}(x, y)", number=1000, globals=globals())
print(f"{method:12} {t:5.2f}ms")
On my laptop, the results are as follows:
dist_direct 5.07ms
dist_einsum 3.43ms
dist_ext1 0.20ms <-- fastest
dist_ext2 0.35ms
dist_layers 2.82ms
dist_scipy 0.60ms
dist_sklearn 0.67ms
While the two methods dist_ext1 and dist_ext2, both based on the idea of writing (x-y)**2 as x**2 - 2*x*y + y**2, are very fast, there is a downside: When the distance between x and y is very small, due to cancellation error the numerical result can sometimes be (very slightly) negative.
Another solution besides using cdist is the following
difference_squared = np.zeros((a.shape[0], b.shape[0]))
for dimension_iterator in range(a.shape[1]):
difference_squared = difference_squared + np.subtract.outer(a[:, dimension_iterator], b[:, dimension_iterator])**2.

Implementing backpropagation gradient descent using scipy.optimize.minimize

I am trying to train an autoencoder NN (3 layers - 2 visible, 1 hidden) using numpy and scipy for the MNIST digits images dataset. The implementation is based on the notation given here Below is my code:
def autoencoder_cost_and_grad(theta, visible_size, hidden_size, lambda_, data):
"""
The input theta is a 1-dimensional array because scipy.optimize.minimize expects
the parameters being optimized to be a 1d array.
First convert theta from a 1d array to the (W1, W2, b1, b2)
matrix/vector format, so that this follows the notation convention of the
lecture notes and tutorial.
You must compute the:
cost : scalar representing the overall cost J(theta)
grad : array representing the corresponding gradient of each element of theta
"""
training_size = data.shape[1]
# unroll theta to get (W1,W2,b1,b2) #
W1 = theta[0:hidden_size*visible_size]
W1 = W1.reshape(hidden_size,visible_size)
W2 = theta[hidden_size*visible_size:2*hidden_size*visible_size]
W2 = W2.reshape(visible_size,hidden_size)
b1 = theta[2*hidden_size*visible_size:2*hidden_size*visible_size + hidden_size]
b2 = theta[2*hidden_size*visible_size + hidden_size: 2*hidden_size*visible_size + hidden_size + visible_size]
#feedforward pass
a_l1 = data
z_l2 = W1.dot(a_l1) + numpy.tile(b1,(training_size,1)).T
a_l2 = sigmoid(z_l2)
z_l3 = W2.dot(a_l2) + numpy.tile(b2,(training_size,1)).T
a_l3 = sigmoid(z_l3)
#backprop
delta_l3 = numpy.multiply(-(data-a_l3),numpy.multiply(a_l3,1-a_l3))
delta_l2 = numpy.multiply(W2.T.dot(delta_l3),
numpy.multiply(a_l2, 1 - a_l2))
b2_derivative = numpy.sum(delta_l3,axis=1)/training_size
b1_derivative = numpy.sum(delta_l2,axis=1)/training_size
W2_derivative = numpy.dot(delta_l3,a_l2.T)/training_size + lambda_*W2
#print(W2_derivative.shape)
W1_derivative = numpy.dot(delta_l2,a_l1.T)/training_size + lambda_*W1
W1_derivative = W1_derivative.reshape(hidden_size*visible_size)
W2_derivative = W2_derivative.reshape(visible_size*hidden_size)
b1_derivative = b1_derivative.reshape(hidden_size)
b2_derivative = b2_derivative.reshape(visible_size)
grad = numpy.concatenate((W1_derivative,W2_derivative,b1_derivative,b2_derivative))
cost = 0.5*numpy.sum((data-a_l3)**2)/training_size + 0.5*lambda_*(numpy.sum(W1**2) + numpy.sum(W2**2))
return cost,grad
I have also implemented a function to estimate the numerical gradient and verify the correctness of my implementation (below).
def compute_gradient_numerical_estimate(J, theta, epsilon=0.0001):
"""
:param J: a loss (cost) function that computes the real-valued loss given parameters and data
:param theta: array of parameters
:param epsilon: amount to vary each parameter in order to estimate
the gradient by numerical difference
:return: array of numerical gradient estimate
"""
gradient = numpy.zeros(theta.shape)
eps_vector = numpy.zeros(theta.shape)
for i in range(0,theta.size):
eps_vector[i] = epsilon
cost1,grad1 = J(theta+eps_vector)
cost2,grad2 = J(theta-eps_vector)
gradient[i] = (cost1 - cost2)/(2*epsilon)
eps_vector[i] = 0
return gradient
The norm of the difference between the numerical estimate and the one computed by the function is around 6.87165125021e-09 which seems to be acceptable. My main problem seems to be to get the gradient descent algorithm "L-BGFGS-B" working using the scipy.optimize.minimize function as below:
# theta is the 1-D array of(W1,W2,b1,b2)
J = lambda x: utils.autoencoder_cost_and_grad(theta, visible_size, hidden_size, lambda_, patches_train)
options_ = {'maxiter': 4000, 'disp': False}
result = scipy.optimize.minimize(J, theta, method='L-BFGS-B', jac=True, options=options_)
I get the below output from this:
scipy.optimize.minimize() details:
fun: 90.802022224079778
hess_inv: <16474x16474 LbfgsInvHessProduct with dtype=float64>
jac: array([ -6.83667742e-06, -2.74886002e-06, -3.23531941e-06, ...,
1.22425735e-01, 1.23425062e-01, 1.28091250e-01])
message: b'ABNORMAL_TERMINATION_IN_LNSRCH'
nfev: 21
nit: 0
status: 2
success: False
x: array([-0.06836677, -0.0274886 , -0.03235319, ..., 0. ,
0. , 0. ])
Now, this post seems to indicate that the error could mean that the gradient function implementation could be wrong? But my numerical gradient estimate seems to confirm that my implementation is correct. I have tried varying the initial weights by using a uniform distribution as specified here but the problem still persists. Is there anything wrong with my backprop implementation?
Turns out the issue was a syntax error (very silly) with this line:
J = lambda x: utils.autoencoder_cost_and_grad(theta, visible_size, hidden_size, lambda_, patches_train)
I don't even have the lambda parameter x in the function declaration. So the theta array wasn't even being passed whenever J was being invoked.
This fixed it:
J = lambda x: utils.autoencoder_cost_and_grad(x, visible_size, hidden_size, lambda_, patches_train)