What is a Hessian matrix? - calculus

I know that the Hessian matrix is a kind of second derivative test of functions involving more than one independent variable. How does one find the maximum or minimum of a function involving more than one variable? Is it found using the eigenvalues of the Hessian matrix or its principal minors?

You should have a look here:
https://en.wikipedia.org/wiki/Second_partial_derivative_test
For an n-dimensional function f, find an x where the gradient grad f = 0. This is a critical point.
Then, the 2nd derivatives tell, whether x marks a local minimum, a maximum or a saddle point.
The Hessian H is the matrix of all combinations of 2nd derivatives of f.
For the 2D-case the determinant and the minors of the Hessian are relevant.
For the nD-case it might involve a computation of eigen values of the Hessian H (if H is invertible) as part of checking H for being positive (or negative) definite.
In fact, the shortcut in 1) is generalized by 2)
For numeric calculations, some kind of optimization strategy can be used for finding x where grad f = 0.

Related

CVXPY: quasiconvex fractional programming problem with variance term

I have a portfolio optimization problem where my objective function is the mean divided by the standard deviation.
The variance is the difference of two random variables so is computed as Var(X) + Var(Y) - 2 * Cov(X, Y). The variance term is specified as above, where w represents the portfolio selection, capital sigma is a covariance matrix, and sigma sub delta g is a vector of covariances related to the second random variable. The problem is that CVXPY doesn't consider the last term there to be nonnegative because some of the covariance terms are negative. Obviously, I know that the variance will always be nonnegative, so I believe that this should work as a quasiconvex problem. Is there any way to tell CVXPY that this variance term will always be positive?

A positive semidefinite matrix with negative eigenvalues

From what I know, for any square real matrix A, a matrix generated with the following should be a positive semidefinite (PSD) matrix:
Q = A # A.T
I have this matrix A, which is sparse and not symmetric. However, regardless of the properties of A, I think the matrix Q should be PSD.
However, upon using np.linalg.eigvals, I get the following:
np.sort(np.linalg.eigvals(Q))
>>>array([-1.54781185e+01+0.j, -7.27494242e-04+0.j, 2.09363431e-04+0.j, ...,
3.55351888e+15+0.j, 5.82221014e+17+0.j, 1.78954577e+18+0.j])
I think the complex eigenvalues result from the numerical instability of the operation. Using scipy.linalg.eigh, which takes advantage of the fact that the matrix is symmetric, gives,
np.sort(eigh(Q, eigvals_only=True))
>>>array([-3.10854357e+01, -6.60108485e+00, -7.34059692e-01, ...,
3.55351888e+15, 5.82221014e+17, 1.78954577e+18])
which again, contains negative eigenvalues.
My goal is to perform Cholesky decomposition on the matrix Q, however, I keep getting this error message saying that the matrix Q is not positive definite, which can be again confirmed with the negative eigenvalues shown above.
Does anyone know why the matrix is not PSD? Thank you.
Of course that's a numerical problem, but I would say that Q is probably still PSD.
Notice that the largest eigenvalue is 1.7e18 while the smallest is 3.1e1 so the ratio is about, if you take probably min(L) + max(L) == max(L) will return true, meaning that the minimum value is negligible compared to the maximum.
What I would suggest to you is to compute Cholesky on a slightly shifted version of the matrix.
e.g.
d = np.linalg.norm(Q) * np.finfo(Q.dtype).eps;
I = np.eye(len(Q));
np.linalg.cholesky(Q + d * I);

Automatic Differentiation with respect to rank-based computations

I'm new to automatic differentiation programming, so this maybe a naive question. Below is a simplified version of what I'm trying to solve.
I have two input arrays - a vector A of size N and a matrix B of shape (N, M), as well a parameter vector theta of size M. I define a new array C(theta) = B * theta to get a new vector of size N. I then obtain the indices of elements that fall in the upper and lower quartile of C, and use them to create a new array A_low(theta) = A[lower quartile indices of C] and A_high(theta) = A[upper quartile indices of C]. Clearly these two do depend on theta, but is it possible to differentiate A_low and A_high w.r.t theta?
My attempts so far seem to suggest no - I have using the python libraries of autograd, JAX and tensorflow, but they all return a gradient of zero. (The approaches I have tried so far involve using argsort or extracting the relevant sub-arrays using tf.top_k.)
What I'm seeking help with is either a proof that the derivative is not defined (or cannot be analytically computed) or if it does exist, a suggestion on how to estimate it. My eventual goal is to minimize some function f(A_low, A_high) wrt theta.
This is the JAX computation that I wrote based on your description:
import numpy as np
import jax.numpy as jnp
import jax
N = 10
M = 20
rng = np.random.default_rng(0)
A = jnp.array(rng.random((N,)))
B = jnp.array(rng.random((N, M)))
theta = jnp.array(rng.random(M))
def f(A, B, theta, k=3):
C = B # theta
_, i_upper = lax.top_k(C, k)
_, i_lower = lax.top_k(-C, k)
return A[i_lower], A[i_upper]
x, y = f(A, B, theta)
dx_dtheta, dy_dtheta = jax.jacobian(f, argnums=2)(A, B, theta)
The derivatives are all zero, and I believe this is correct, because the change in value of the outputs does not depend on the change in value of theta.
But, you might ask, how can this be? After all, theta enters into the computation, and if you put in a different value for theta, you get different outputs. How could the gradient be zero?
What you must keep in mind, though, is that differentiation doesn't measure whether an input affects an output. It measures the change in output given an infinitesimal change in input.
Let's use a slightly simpler function as an example:
import jax
import jax.numpy as jnp
A = jnp.array([1.0, 2.0, 3.0])
theta = jnp.array([5.0, 1.0, 3.0])
def f(A, theta):
return A[jnp.argmax(theta)]
x = f(A, theta)
dx_dtheta = jax.grad(f, argnums=1)(A, theta)
Here the result of differentiating f with respect to theta is all zero, for the same reasons as above. Why? If you make an infinitesimal change to theta, it will in general not affect the sort order of theta. Thus, the entries you choose from A do not change given an infinitesimal change in theta, and thus the derivative with respect to theta is zero.
Now, you might argue that there are circumstances where this is not the case: for example, if two values in theta are very close together, then certainly perturbing one even infinitesimally could change their respective rank. This is true, but the gradient resulting from this procedure is undefined (the change in output is not smooth with respect to the change in input). The good news is this discontinuity is one-sided: if you perturb in the other direction, there is no change in rank and the gradient is well-defined. In order to avoid undefined gradients, most autodiff systems will implicitly use this safer definition of a derivative for rank-based computations.
The result is that the value of the output does not change when you infinitesimally perturb the input, which is another way of saying the gradient is zero. And this is not a failure of autodiff – it is the correct gradient given the definition of differentiation that autodiff is built on. Moreover, were you to try changing to a different definition of the derivative at these discontinuities, the best you could hope for would be undefined outputs, so the definition that results in zeros is arguably more useful and correct.

Implementation of Isotropic squared exponential kernel with numpy

I've come across a from scratch implementation for gaussian processes:
http://krasserm.github.io/2018/03/19/gaussian-processes/
There, the isotropic squared exponential kernel is implemented in numpy. It looks like:
The implementation is:
def kernel(X1, X2, l=1.0, sigma_f=1.0):
sqdist = np.sum(X1**2, 1).reshape(-1, 1) + np.sum(X2**2, 1) - 2 * np.dot(X1, X2.T)
return sigma_f**2 * np.exp(-0.5 / l**2 * sqdist)
consistent with the implementation of Nando de Freitas: https://www.cs.ubc.ca/~nando/540-2013/lectures/gp.py
However, I am not quite sure how this implementation matches the provided formula, especially in the sqdist part. In my opinion, it is wrong but it works (and delivers the same results as scipy's cdist with squared euclidean distance). Why do I think it is wrong? If you multiply out the multiplication of the two matrices, you get
which equals to either a scalar or a nxn matrix for a vector x_i, depending on whether you define x_i to be a column vector or not. The implementation however gives back a nx1 vector with the squared values.
I hope that anyone can shed light on this.
I found out: The implementation is correct. I just was not aware of the fuzzy notation (in my opinion) which is sometimes used in ML contexts. What is to be achieved is a distance matrix and each row vectors of matrix A are to be compared with the row vectors of matrix B to infer the covariance matrix, not (as I somehow guessed) the direct distance between two matrices/vectors.

Constrained np.polyfit

I am trying to fit a quadratic to some experimental data and using polyfit in numpy. I am looking to get a concave curve, and hence want to make sure that the coefficient of the quadratic term is negative, also the fit itself is weighted, as in there are some weights on the points. Is there an easy way to do that? Thanks.
The use of weights is described here (numpy.polyfit).
Basically, you need a weight vector with the same length as x and y.
To avoid the wrong sign in the coefficient, you could use a fit function definition like
def fitfunc(x,a,b,c):
return -1 * abs(a) * x**2 + b * x + c
This will give you a negative coefficient for x**2 at all times.
You can use curve_fit
.
Or you can run polyfit with rank 2 and if the last coefficient is bigger than 0. run again linear polyfit (polyfit with rank 1)