Constrained optimization with hessian in scipy - numpy

I want to minimize a function, subject to constraints (the variables are non-negative). I can compute the gradient and Hessian exactly. So I want something like:
result = scipy.optimize.minimize(objective, x0, jac=grad, hess=hess, bounds=bds)
I need to specify a method for the optimization (http://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.minimize.html). Unfortunately I can't seem to find a method that allows for both user-specified bounds and a Hessian!
This is particularly annoying because methods "TNC" and "Newton-CG" seem essentially the same, however TNC estimates Hessian internally (in C code), while Newton-CG doesn't allow for constraints.
So, how can I do a constrained optimization with user-specified Hessian? Seems like there ought to be an easy option for this in scipy -- am I missing something?

I realized a workaround for my problem, which is to transform the constrained optimization into an unconstrained optimization.
In my case, since I have the constraint x > 0, I decided to optimize over log(x) instead of x. This was easy to do for my problem since I am using automatic differentiation.
Still, this seems like a somewhat unsatisfying solution -- I still think scipy should allow some constrained second-order minimization method.

just bumped into exactly this point myself. I think the TNC applies an active set to the line search of the CG, not the direction of the line search. Conversely the Hessian chooses the direction of the line. So, er, could maybe cut the line search out of NCG and drop it into TNC. Problem is when you are at the boundary the Hessian might not take you out of it.
How about using TNC for an extremely sloppy first guess [give it a really large error bound to hit], then use NCG with a small number of iterations, check: if on boundary back to TNC, else continue with NCG. Ugh...
Yes, or use log(x). I'm going to follow your lead.

Related

scipy.sparse.linalg: what's the difference between splu and factorized?

What's the difference between using
scipy.sparse.linalg.factorized(A)
and
scipy.sparse.linalg.splu(A)
Both of them return objects with .solve(rhs) method and for both it's said in the documentation that they use LU decomposition. I'd like to know the difference in performance for both of them.
More specificly, I'm writing a python/numpy/scipy app that implements dynamic FEM model. I need to solve an equation Au = f on each timestep. A is sparse and rather large, but doesn't depend on timestep, so I'd like to invest some time beforehand to make iterations faster (there may be thousands of them). I tried using scipy.sparse.linalg.inv(A), but it threw memory exceptions when the size of matrix was large. I used scipy.linalg.spsolve on each step until recently, and now am thinking on using some sort of decomposition for better performance. So if you have other suggestions aside from LU, feel free to propose!
They should both work well for your problem, assuming that A does not change with each time step.
scipy.sparse.linalg.inv(A) will return a dense matrix that is the same size as A, so it's no wonder it's throwing memory exceptions.
scipy.linalg.solve is also a dense linear solver, which isn't what you want.
Assuming A is sparse, to solve Au=f and you only want to solve Au=f once, you could use scipy.sparse.linalg.spsolve. For example
u = spsolve(A, f)
If you want to speed things up dramatically for subsequent solves, you would instead use scipy.sparse.linalg.factorized or scipy.sparse.linalg.splu. For example
A_inv = splu(A)
for t in range(iterations):
u_t = A_inv.solve(f_t)
or
A_solve = factorized(A)
for t in range(iterations):
u_t = A_solve(f_t)
They should both be comparable in speed, and much faster than the previous options.
As #sascha said, you will need to dig into the documentation to see the differences between splu and factorize. But, you can use 'umfpack' instead of the default 'superLU' if you have it installed and set up correctly. I think umfpack will be faster in most cases. Keep in mind that if your matrix A is too large or has too many non-zeros, an LU decomposition / direct solver may take too much memory on your system. In this case, you might be stuck with using an iterative solver such as this. Unfortunately, you wont be able to reuse the solve of A at each time step, but you might be able to find a good preconditioner for A (approximation to inv(A)) to feed the solver to speed it up.

Worhp: Local point of infeasibility

I have a problem that is solved successfully with ipopt and fmincon. worhp terminates on local infeasibility. My x0 (init) is feasible.
This may happen with the interior point algorithm, but I expect sqp to always stay in the feasible zone?
Maybe also check the derivatives with WORHP by enabling CheckValuesDF, CheckValuesDG, CheckValuesHM, CheckStructureDF, CheckStructureDG and CheckStructureHM if you provide them. What I am pointing at is that WORHP requires a very special coordinate storage format (in particular for the Hessian). Mistakes here lead to false search directions.
Due to the approximation error of the QP subproblem this is not something you can expect in general. Consider the problem
which will have the QP subproblems
for a current x and Lagrangian multiplier lambda, as can be seen by determining the necessary derivatives. With initial values x_0 = 0 and lambda_0 = 1 we have a feasible initial guess. The first QP to be solved is then
which has the unique solution d = 2. Now, depending on the implemented linesearch, the full step might be taken, i.e. the next iterate is x_1 = x_0 + d. That means x_1 = 2 which is not a feasible point anymore. In fact, WORHP's SQP algorithm will iterate like this if you disable the par.InitialLMest and eventually find the global optimum at x = 1.
Apart from this fundamental property there can also be other effects leading to iterates leaving the feasible set, that will very much be specific to the actual solver implementation. For example numerical inaccuracies, difficulties during the solution of a QP or certain recovery strategies. As to why your problem is not solved successfully using the SQP algorithm of WORHP, I am unable to say much without knowing anything about the problem itself.

Errors to fit parameters of scipy.optimize

I use the scipy.optimize.minimize ( https://docs.scipy.org/doc/scipy/reference/tutorial/optimize.html ) function with method='L-BFGS-B.
An example of what it returns is here above:
fun: 32.372210618549758
hess_inv: <6x6 LbfgsInvHessProduct with dtype=float64>
jac: array([ -2.14583906e-04, 4.09272616e-04, -2.55795385e-05,
3.76587650e-05, 1.49213975e-04, -8.38440428e-05])
message: 'CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH'
nfev: 420
nit: 51
status: 0
success: True
x: array([ 0.75739412, -0.0927572 , 0.11986434, 1.19911266, 0.27866406,
-0.03825225])
The x value correctly contains the fitted parameters. How do I compute the errors associated to those parameters?
TL;DR: You can actually place an upper bound on how precisely the minimization routine has found the optimal values of your parameters. See the snippet at the end of this answer that shows how to do it directly, without resorting to calling additional minimization routines.
The documentation for this method says
The iteration stops when (f^k - f^{k+1})/max{|f^k|,|f^{k+1}|,1} <= ftol.
Roughly speaking, the minimization stops when the value of the function f that you're minimizing is minimized to within ftol of the optimum. (This is a relative error if f is greater than 1, and absolute otherwise; for simplicity I'll assume it's an absolute error.) In more standard language, you'll probably think of your function f as a chi-squared value. So this roughly suggests that you would expect
Of course, just the fact that you're applying a minimization routine like this assumes that your function is well behaved, in the sense that it's reasonably smooth and the optimum being found is well approximated near the optimum by a quadratic function of the parameters xi:
where Δxi is the difference between the found value of parameter xi and its optimal value, and Hij is the Hessian matrix. A little (surprisingly nontrivial) linear algebra gets you to a pretty standard result for an estimate of the uncertainty in any quantity X that's a function of your parameters xi:
which lets us write
That's the most useful formula in general, but for the specific question here, we just have X = xi, so this simplifies to
Finally, to be totally explicit, let's say you've stored the optimization result in a variable called res. The inverse Hessian is available as res.hess_inv, which is a function that takes a vector and returns the product of the inverse Hessian with that vector. So, for example, we can display the optimized parameters along with the uncertainty estimates with a snippet like this:
ftol = 2.220446049250313e-09
tmp_i = np.zeros(len(res.x))
for i in range(len(res.x)):
tmp_i[i] = 1.0
hess_inv_i = res.hess_inv(tmp_i)[i]
uncertainty_i = np.sqrt(max(1, abs(res.fun)) * ftol * hess_inv_i)
tmp_i[i] = 0.0
print('x^{0} = {1:12.4e} ± {2:.1e}'.format(i, res.x[i], uncertainty_i))
Note that I've incorporated the max behavior from the documentation, assuming that f^k and f^{k+1} are basically just the same as the final output value, res.fun, which really ought to be a good approximation. Also, for small problems, you can just use np.diag(res.hess_inv.todense()) to get the full inverse and extract the diagonal all at once. But for large numbers of variables, I've found that to be a much slower option. Finally, I've added the default value of ftol, but if you change it in an argument to minimize, you would obviously need to change it here.
One approach to this common problem is to use scipy.optimize.leastsq after using minimize with 'L-BFGS-B' starting from the solution found with 'L-BFGS-B'. That is, leastsq will (normally) include and estimate of the 1-sigma errors as well as the solution.
Of course, that approach makes several assumption, including that leastsq can be used and may be appropriate for solving the problem. From a practical view, this requires the objective function return an array of residual values with at least as many elements as variables, not a cost function.
You may find lmfit (https://lmfit.github.io/lmfit-py/) useful here: It supports both 'L-BFGS-B' and 'leastsq' and gives a uniform wrapper around these and other minimization methods, so that you can use the same objective function for both methods (and specify how to convert the residual array into the cost function). In addition, parameter bounds can be used for both methods. This makes it very easy to first do a fit with 'L-BFGS-B' and then with 'leastsq', using the values from 'L-BFGS-B' as starting values.
Lmfit also provides methods to more explicitly explore confidence limits on parameter values in more detail, in case you suspect the simple but fast approach used by leastsq might be insufficient.
It really depends what you mean by "errors". There is no general answer to your question, because it depends on what you're fitting and what assumptions you're making.
The easiest case is one of the most common: when the function you are minimizing is a negative log-likelihood. In that case the inverse of the hessian matrix returned by the fit (hess_inv) is the covariance matrix describing the Gaussian approximation to the maximum likelihood.The parameter errors are the square root of the diagonal elements of the covariance matrix.
Beware that if you are fitting a different kind of function or are making different assumptions, then that doesn't apply.

Implementing a 2D recursive spatial filter using Scipy

Minimally, I would like to know how to achieve what is stated in the title. Specifically, signal.lfilter seems like the only implementation of a difference equation filter in scipy, but it is 1D, as shown in the docs. I would like to know how to implement a 2D version as described by this difference equation. If that's as simple as "bro, use this function," please let me know, pardon my naiveté, and feel free to disregard the rest of the post.
I am new to DSP and acknowledging there might be a different approach to answering my question so I will explain the broader goal and give context for the question in the hopes someone knows how do want I want with Scipy, or perhaps a better way than what I explicitly asked for.
To get straight into it, broadly speaking I am using vectorized computation methods (Numpy/Scipy) to implement a Monte Carlo simulation to improve upon a naive for loop. I have successfully abstracted most of my operations to array computation / linear algebra, but a few specific ones (recursive computations) have eluded my intuition and I continually end up in the digital signal processing world when I go looking for how this type of thing has been done by others (that or machine learning but those "frameworks" are much opinionated). The reason most of my google searches end up on scipy.signal or scipy.ndimage library references is clear to me at this point, and subsequent to accepting the "signal" representation of my data, I have spent a considerable amount of time (about as much as reasonable for a field that is not my own) ramping up the learning curve to try and figure out what I need from these libraries.
My simulation entails updating a vector of data representing the state of a system each period for n periods, and then repeating that whole process a "Monte Carlo" amount of times. The updates in each of n periods are inherently recursive as the next depends on the state of the prior. It can be characterized as a difference equation as linked above. Additionally this vector is theoretically indexed on an grid of points with uneven stepsize. Here is an example vector y and its theoretical grid t:
y = np.r_[0.0024, 0.004, 0.0058, 0.0083, 0.0099, 0.0133, 0.0164]
t = np.r_[0.25, 0.5, 1, 2, 5, 10, 20]
I need to iteratively perform numerous operations to y for each of n "updates." Specifically, I am computing the curvature along the curve y(t) using finite difference approximations and using the result at each point to adjust the corresponding y(t) prior to the next update. In a loop this amounts to inplace variable reassignment with the desired update in each iteration.
y += some_function(y)
Not only does this seem inefficient, but vectorizing things seems intuitive given y is a vector to begin with. Furthermore I am interested in preserving each "updated" y(t) along the n updates, which would require a data structure of dimensions len(y) x n. At this point, why not perform the updates inplace in the array? This is wherein lies the question. Many of the update operations I have succesfully vectorized the "Numpy way" (such as adding random variates to each point), but some appear overly complex in the array world.
Specifically, as mentioned above the one involving computing curvature at each element using its neighbouring two elements, and then imediately using that result to update the next row of the array before performing its own curvature "update." I was able to implement a non-recursive version (each row fails to consider its "updated self" from the prior row) of the curvature operation using ndimage generic_filter. Given the uneven grid, I have unique coefficients (kernel weights) for each triplet in the kernel footprint (instead of always using [1,-2,1] for y'' if I had a uniform grid). This last part has already forced me to use a spatial filter from ndimage rather than a 1d convolution. I'll point out, something conceptually similar was discussed in this math.exchange post, and it seems to me only the third response saliently addressed the difference between mathematical notion of "convolution" which should be associative from general spatial filtering kernels that would require two sequential filtering operations or a cleverly merged kernel.
In any case this does not seem to actually address my concern as it is not about 2D recursion filtering but rather having a backwards looking kernel footprint. Additionally, I think I've concluded it is not applicable in that this only allows for "recursion" (backward looking kernel footprints in the spatial filtering world) in a manner directly proportional to the size of the recursion. Meaning if I wanted to filter each of n rows incorporating calculations on all prior rows, it would require a convolution kernel far too big (for my n anyways). If I'm understanding all this correctly, a recursive linear filter is algorithmically more efficient in that it returns (for use in computation) the result of itself applied over the previous n samples (up to a level where the stability of the algorithm is affected) using another companion vector (z). In my case, I would only need to look back one step at output signal y[n-1] to compute y[n] from curvature at x[n] as the rest works itself out like a cumsum. signal.lfilter works for this, but I can't used that to compute curvature, as that requires a kernel footprint that can "see" at least its left and right neighbors (pixels), which is how I ended up using generic_filter.
It seems to me I should be able to do both simultaneously with one filter namely spatial and recursive filtering; or somehow I've missed the maths of how this could be mathematically simplified/combined (convolution of multiples kernels?).
It seems like this should be a common problem, but perhaps it is rarely relevant to do both at once in signal processing and image filtering. Perhaps this is why you don't use signals libraries solely to implement a fast monte carlo simulation; though it seems less esoteric than using a tensor math library to implement a recursive neural network scan ... which I'm attempting to do right now.
EDIT: For those familiar with the theoretical side of DSP, I know that what I am describing, the process of designing a recursive filters with arbitrary impulse responses, is achieved by employing a mathematical technique called the z-transform which I understand is generally used for two things:
converting between the recursion coefficients and the frequency response
combining cascaded and parallel stages into a single filter
Both are exactly what I am trying to accomplish.
Also, reworded title away from FIR / IIR because those imply specific definitions of "recursion" and may be confusing / misnomer.

How to optimize non-negative constraints with gradient descent

I have an optimization in the following form,
argmin_W f(W)
s.t. W_i > 0, for all i
where W is a vector, and f(W) is a function on W.
I know how to optimize without the non-negative constraints. But I am unsure about how to optimize this with gradient descent.
Optimization on the open set is quite tricky, so let us assume that W_i >= 0, consequently you can use many methods:
optimize f(|W|) on the whole domain
use GD for f(W) but after each iteration project your solution back to the domain, so put W = |W|
use constrained optimization techniques, such as L-BFGS-B
I don't think there is a general and simple way of doing it. You will have to do some sort of search at each point to make sure the constraints are met (techniques like line search, trust regions).
Or perhaps f has some structure you can exploit.