Using Nesterov's accelerated gradient on the Ackley function - optimization

I am working on a project where I want to use Nesterov's accelerated gradient method on the Ackley function below
to go from the initial point of (25, 20) to within the distance of the global minimizer and minimum of (2e-4, 5e-4).
def nag(func, x, lr, num_iters, jac, tol, callback, gamma=0.9, *args, **kwargs):
vals = [func(x)]
opt_res = OptimizeResult()
update = np.zeros(x.size)
for i in range(1, num_iters+1):
grad = jac(x - gamma * update)
prev_x = x
update = gamma * update + lr(i) * grad
x = x - update
vals.append(func(x))
callback(x)
if np.linalg.norm(x-prev_x) <= tol:
break
opt_res.x = x
opt_res.nit = i
return opt_res, np.array(vals, dtype=object)
Using gamma = 0.2, num_iters = 5000, and a customized learning rate function
def lr_(t):
if t < 5:
return 1e-4
elif t < 10:
return 1e-2
else:
return 0.1
I was able to get around (0.0008 and 0.0022), but couldn't get closer to the desired global minimizer and minimum as specified despite playing around with different values for a long time. Does anyone know what I could try so that I can get closer to the desired result?
Or are there other optimization methods that would work better than NAG? I heard Adam's or Adagrad should work, but haven't had much success with them.

Related

Why are gradients disconnected

Consider the following code
#tf.function
def get_derivatives(function_to_diff,X):
f = function_to_diff(X)
## Derivatives
W = X[:,0]
Z = X[:,1]
V = X[:,2]
df_dW = tf.gradients(f, X[:,0])
return df_dW
I wanted get_derivatives to return the partial derivative of function_to_diff with respect to the first element of X.
However, when I run
def test_function(X):
return tf.pow(X[:,0],2) * X[:,1] * X[:,2]
get_derivatives(test_function,X)
I get None.
If I use unconnected_gradients='zero' for tf.graidents, I'd get zeros. In other words, the gradients are disconnected.
Questions
Why are the gradients disconnected?
How can I get the derivative with respect to the first element of X, i.e. how can I restore the connection? I know that if I wrote
def test_function(x,y,z)
return tf.pow(x,2) * y * z
#tf.function
def get_derivatives(function_to_diff,x,y,z):
f = function_to_diff(x,y,z)
df_dW = tf.gradients(f, x)
return df_dW
This could fix the problem. What if my function can only take in one argument, i.e. what if my function looks like test_function(X)? For example, test_function could be a trained neural network that takes in only one argument.

How to implement custom Keras ordinal loss function with tensor evaluation without disturbing TF>2.0 Model Graph?

I am trying to implement a custom loss function in Tensorflow 2.4 using the Keras backend.
The loss function is a ranking loss; I found the following paper with a somewhat log-likelihood loss: Chen et al. Single-Image Depth Perception in the Wild.
Similarly, I wanted to sample some (in this case 50) points from an image to compare the relative order between ground-truth and predicted depth maps using the NYU-Depth dataset. Being a fan of Numpy, I started working with that but came to the following exception:
ValueError: No gradients provided for any variable: [...]
I have learned that this is caused by the arguments not being filled when calling the loss function but instead, a C function is compiled which is then used later. So while I know the dimensions of my tensors (4, 480, 640, 1), I cannot work with the data as wanted and have to use the keras.backend functions on top so that in the end (if I understood correctly), there is supposed to be a path between the input tensors from the TF graph and the output tensor, which has to provide a gradient.
So my question now is: Is this a feasible loss function within keras?
I have already tried a few ideas and different approaches with different variations of my original code, which was something like:
def ranking_loss_function(y_true, y_pred):
# Chen et al. loss
y_true_np = K.eval(y_true)
y_pred_np = K.eval(y_pred)
if y_true_np.shape[0] != None:
num_sample_points = 50
total_samples = num_sample_points ** 2
err_list = [0 for x in range(y_true_np.shape[0])]
for i in range(y_true_np.shape[0]):
sample_points = create_random_samples(y_true, y_pred, num_sample_points)
for x1, y1 in sample_points:
for x2, y2 in sample_points:
if y_true[i][x1][y1] > y_true[i][x2][y2]:
#image_relation_true = 1
err_list[i] += np.log(1 + np.exp(-1 * y_pred[i][x1][y1] + y_pred[i][x2][y2]))
elif y_true[i][x1][y1] < y_true[i][x2][y2]:
#image_relation_true = -1
err_list[i] += np.log(1 + np.exp(y_pred[i][x1][y1] - y_pred[i][x2][y2]))
else:
#image_relation_true = 0
err_list[i] += np.square(y_pred[i][x1][y1] - y_pred[i][x2][y2])
err_list = np.divide(err_list, total_samples)
return K.constant(err_list)
As you can probably tell, the main idea was to first create the sample points and then based on the existing relation between them in y_true/y_pred continue with the corresponding computation from the cited paper.
Can anyone help me and provide some more helpful information or tips on how to correctly implement this loss using keras.backend functions? Trying to include the ordinal relation information really confused me compared to standard regression losses.
EDIT: Just in case this causes confusion: create_random_samples() just creates 50 random sample points (x, y) coordinate pairs based on the shape[1] and shape[2] of y_true (image width and height)
EDIT(2): After finding this variation on GitHub, I have tried out a variation using only TF functions to retrieve data from the tensors and compute the output. The adjusted and probably more correct version still throws the same exception though:
def ranking_loss_function(y_true, y_pred):
#In the Wild ranking loss
y_true_np = K.eval(y_true)
y_pred_np = K.eval(y_pred)
if y_true_np.shape[0] != None:
num_sample_points = 50
total_samples = num_sample_points ** 2
bs = y_true_np.shape[0]
w = y_true_np.shape[1]
h = y_true_np.shape[2]
total_samples = total_samples * bs
num_pairs = tf.constant([total_samples], dtype=tf.float32)
output = tf.Variable(0.0)
for i in range(bs):
sample_points = create_random_samples(y_true, y_pred, num_sample_points)
for x1, y1 in sample_points:
for x2, y2 in sample_points:
y_true_sq = tf.squeeze(y_true)
y_pred_sq = tf.squeeze(y_pred)
d1_t = tf.slice(y_true_sq, [i, x1, y1], [1, 1, 1])
d2_t = tf.slice(y_true_sq, [i, x2, y2], [1, 1, 1])
d1_p = tf.slice(y_pred_sq, [i, x1, y1], [1, 1, 1])
d2_p = tf.slice(y_pred_sq, [i, x2, y2], [1, 1, 1])
d1_t_sq = tf.squeeze(d1_t)
d2_t_sq = tf.squeeze(d2_t)
d1_p_sq = tf.squeeze(d1_p)
d2_p_sq = tf.squeeze(d2_p)
if d1_t_sq > d2_t_sq:
# --> Image relation = 1
output.assign_add(tf.math.log(1 + tf.math.exp(-1 * d1_p_sq + d2_p_sq)))
elif d1_t_sq < d2_t_sq:
# --> Image relation = -1
output.assign_add(tf.math.log(1 + tf.math.exp(d1_p_sq - d2_p_sq)))
else:
output.assign_add(tf.math.square(d1_p_sq - d2_p_sq))
return output/num_pairs
EDIT(3): This is the code for create_random_samples():
(FYI: Because it was weird to get the shape from y_true in this case, I first proceeded to hard-code it here as I know it for the dataset which I am currently using.)
def create_random_samples(y_true, y_pred, num_points=50):
y_true_shape = (4, 480, 640, 1)
y_pred_shape = (4, 480, 640, 1)
if y_true_shape[0] != None:
num_samples = num_points
population = [(x, y) for x in range(y_true_shape[1]) for y in range(y_true_shape[2])]
sample_points = random.sample(population, num_samples)
return sample_points

How to calculate cosine similarity given sparse matrix data in TensorFlow?

I'm supposed to change part of a python script on the GitHub website. This code is an attention-based similarity measure, but I want to turn it to cosine similarity.
The respective code is in the layers.py file (inside the call method).
Attention-Based:
def __call__(self, inputs):
x = inputs
# dropout
if self.sparse_inputs:
x = sparse_dropout(x, 1-self.dropout, self.num_features_nonzero)
else:
x = tf.nn.dropout(x, 1-self.dropout)
# graph learning
h = dot(x, self.vars['weights'], sparse=self.sparse_inputs)
N = self.num_nodes
edge_v = tf.abs(tf.gather(h,self.edge[0]) - tf.gather(h,self.edge[1]))
edge_v = tf.squeeze(self.act(dot(edge_v, self.vars['a'])))
sgraph = tf.SparseTensor(indices=tf.transpose(self.edge), values=edge_v, dense_shape=[N, N])
sgraph = tf.sparse_softmax(sgraph)
return h, sgraph
I edited the above code to what I believe are my requirements (cosine similarity). However, when I run the following code, like so:
def __call__(self, inputs):
x = inputs
# dropout
if self.sparse_inputs:
x = sparse_dropout(x, 1-self.dropout, self.num_features_nonzero)
else:
x = tf.nn.dropout(x, 1-self.dropout)
# graph learning
h = dot(x, self.vars['weights'], sparse=self.sparse_inputs)
N = self.num_nodes
h_norm = tf.nn.l2_normalize(h)
edge_v = tf.matmul(h_norm, tf.transpose(h_norm))
h_norm_1 = tf.norm(h_norm)
edge_v /= h_norm_1 * h_norm_1
edge_v = dot(edge_v, self.vars['a']) # It causes an error when I add this line
zero = tf.constant(0, dtype=tf.float32)
where = tf.not_equal(edge_v, zero)
indices = tf.where(where)
values = tf.gather_nd(edge_v, indices)
sgraph = tf.SparseTensor(indices, values, dense_shape= [N,N])
return h, sgraph
The script shows some runtime errors:
Screenshot of error message
I suspect the error here is related to line 226:
edge_v = dot(edge_v, self.vars['a']) # It causes an error when I add this line
Any admonition on how to accomplish this successfully?
Link of the script on GitHub:
https://github.com/jiangboahu/GLCN-tf
Note: I don't want to use built-in functions, because I think they are not precise to do this job.
ETA: It appears that there are some answers around but they seem to tackle different problems, as far, as I understood them.
Thanks a bunch in advance
What's the dot? Have you imported the method?
It should either be:
edge_v = tf.keras.backend.dot(edge_v, self.vars['a'])
or
edge_v = tf.tensordot(edge_v, self.vars['a'])

Can I implement a gradient descent for arbitrary convex loss function?

I have a loss function I would like to try and minimize:
def lossfunction(X,b,lambs):
B = b.reshape(X.shape)
penalty = np.linalg.norm(B, axis = 1)**(0.5)
return np.linalg.norm(np.dot(X,B)-X) + lambs*penalty.sum()
Gradient descent, or similar methods, might be useful. I can't calculate the gradient of this function analytically, so I am wondering how I can numerically calculate the gradient for this loss function in order to implement a descent method.
Numpy has a gradient function, but it requires me to pass a scalar field at pre determined points.
You could try scipy.optimize.minimize
For your case a sample call would be:
import scipy.optimize.minimize
scipy.optimize.minimize(lossfunction, args=(b, lambs), method='Nelder-mead')
You could estimate the derivative numerically by a central difference:
def derivative(fun, X, b, lambs, h):
return (fun(X + 0.5*h,b,lambs) - fun(X - 0.5*h,b,lambs))/h
And use it like this:
# assign values to X, b, lambs
# set the value of h
h = 0.001
print derivative(lossfunction, X, b, lambs, h)
The code above is valid for dimX = 1, some modifications are needed to account for multidimensional vector X:
def gradient(fun, X, b, lambs, h):
res = []
for i in range (0,len(X)):
t1 = list(X)
t1[i] = t1[i] + 0.5*h
t2 = list(X)
t2[i] = t2[i] - 0.5*h
res = res + [(fun(t1,b,lambs) - fun(t2,b,lambs))/h]
return res
Forgive the naivity of the code, I barely know how to write some python :-)

Avoiding optimization pitfalls when modeling an ordinal predicted variable in PyMC3

I am trying to model an ordinal predicted variable using PyMC3 based on the approach in chapter 23 of Doing Bayesian Data Analysis. I would like to determine a good starting value using find_MAP, but am receiving an optimization error.
The model:
import pymc3 as pm
import numpy as np
import theano
import theano.tensor as tt
# Some helper functions
def cdf(x, location=0, scale=1):
epsilon = np.array(1e-32, dtype=theano.config.floatX)
location = tt.cast(location, theano.config.floatX)
scale = tt.cast(scale, theano.config.floatX)
div = tt.sqrt(2 * scale ** 2 + epsilon)
div = tt.cast(div, theano.config.floatX)
erf_arg = (x - location) / div
return .5 * (1 + tt.erf(erf_arg + epsilon))
def percent_to_thresh(idx, vect):
return 5 * tt.sum(vect[:idx + 1]) + 1.5
def full_thresh(thresh):
idxs = tt.arange(thresh.shape[0] - 1)
thresh_mod, updates = theano.scan(fn=percent_to_thresh,
sequences=[idxs],
non_sequences=[thresh])
return tt.concatenate([[-1 * np.inf, 1.5], thresh_mod, [6.5, np.inf]])
def compute_ps(thresh, location, scale):
f_thresh = full_thresh(thresh)
return cdf(f_thresh[1:], location, scale) - cdf(f_thresh[:-1], location, scale)
# Generate data
real_ps = [0.05, 0.05, 0.1, 0.1, 0.2, 0.3, 0.2]
data = np.random.choice(7, size=1000, p=real_ps)
# Run model
with pm.Model() as model:
mu = pm.Normal('mu', mu=4, sd=3)
sigma = pm.Uniform('sigma', lower=0.1, upper=70)
thresh = pm.Dirichlet('thresh', a=np.ones(5))
cat_p = compute_ps(thresh, mu, sigma)
results = pm.Categorical('results', p=cat_p, observed=data)
with model:
start = pm.find_MAP()
trace = pm.sample(2000, start=start)
When running this, I receive the following error:
Applied interval-transform to sigma and added transformed sigma_interval_ to model.
Applied stickbreaking-transform to thresh and added transformed thresh_stickbreaking_ to model.
Traceback (most recent call last):
File "cm_net_log.v1-for_so.py", line 53, in <module>
start = pm.find_MAP()
File "/usr/local/lib/python3.5/site-packages/pymc3/tuning/starting.py", line 133, in find_MAP
specific_errors)
ValueError: Optimization error: max, logp or dlogp at max have non-finite values. Some values may be outside of distribution support. max: {'thresh_stickbreaking_': array([-1.04298465, -0.48661088, -0.84326554, -0.44833646]), 'sigma_interval_': array(-2.220446049250313e-16), 'mu': array(7.68422528308479)} logp: array(-3506.530143064723) dlogp: array([ 1.61013190e-06, nan, -6.73994118e-06,
-6.93873894e-06, 6.03358122e-06, 3.18954680e-06])Check that 1) you don't have hierarchical parameters, these will lead to points with infinite density. 2) your distribution logp's are properly specified. Specific issues:
My questions:
How can I determine why dlogp is nan at certain points?
Is there a different way that I can express this model to avoid dlogp being nan?
Also worth noting:
This model runs fine if I don't find_MAP and use a Metropolis sampler. However, I'd like to have the flexibility of using other samplers as this model becomes more complex.
I have a suspicion that the issue is due to the relationship between the thresholds and the normal distribution, but I don't know how to disentangle them for the optimization.
Regarding question 2: I expressed the model for the ordinal predicted variable (single group) differently; I used the Theano #as_op decorator for a function that calculates probabilities for the outcomes. That also explains why I cannot use find_MAP() or gradient based samplers: Theano cannot calculate a gradient for the custom function. (http://pymc-devs.github.io/pymc3/notebooks/getting_started.html#Arbitrary-deterministics)
# Number of outcomes
nYlevels = df.Y.cat.categories.size
thresh = [k + .5 for k in range(1, nYlevels)]
thresh_obs = np.ma.asarray(thresh)
thresh_obs[1:-1] = np.ma.masked
#as_op(itypes=[tt.dvector, tt.dscalar, tt.dscalar], otypes=[tt.dvector])
def outcome_probabilities(theta, mu, sigma):
out = np.empty(nYlevels)
n = norm(loc=mu, scale=sigma)
out[0] = n.cdf(theta[0])
out[1] = np.max([0, n.cdf(theta[1]) - n.cdf(theta[0])])
out[2] = np.max([0, n.cdf(theta[2]) - n.cdf(theta[1])])
out[3] = np.max([0, n.cdf(theta[3]) - n.cdf(theta[2])])
out[4] = np.max([0, n.cdf(theta[4]) - n.cdf(theta[3])])
out[5] = np.max([0, n.cdf(theta[5]) - n.cdf(theta[4])])
out[6] = 1 - n.cdf(theta[5])
return out
with pm.Model() as ordinal_model_single:
theta = pm.Normal('theta', mu=thresh, tau=np.repeat(.5**2, len(thresh)),
shape=len(thresh), observed=thresh_obs, testval=thresh[1:-1])
mu = pm.Normal('mu', mu=nYlevels/2.0, tau=1.0/(nYlevels**2))
sigma = pm.Uniform('sigma', nYlevels/1000.0, nYlevels*10.0)
pr = outcome_probabilities(theta, mu, sigma)
y = pm.Categorical('y', pr, observed=df.Y.cat.codes.as_matrix())
http://nbviewer.jupyter.org/github/JWarmenhoven/DBDA-python/blob/master/Notebooks/Chapter%2023.ipynb