Worhp: Local point of infeasibility - worhp

I have a problem that is solved successfully with ipopt and fmincon. worhp terminates on local infeasibility. My x0 (init) is feasible.
This may happen with the interior point algorithm, but I expect sqp to always stay in the feasible zone?

Maybe also check the derivatives with WORHP by enabling CheckValuesDF, CheckValuesDG, CheckValuesHM, CheckStructureDF, CheckStructureDG and CheckStructureHM if you provide them. What I am pointing at is that WORHP requires a very special coordinate storage format (in particular for the Hessian). Mistakes here lead to false search directions.

Due to the approximation error of the QP subproblem this is not something you can expect in general. Consider the problem
which will have the QP subproblems
for a current x and Lagrangian multiplier lambda, as can be seen by determining the necessary derivatives. With initial values x_0 = 0 and lambda_0 = 1 we have a feasible initial guess. The first QP to be solved is then
which has the unique solution d = 2. Now, depending on the implemented linesearch, the full step might be taken, i.e. the next iterate is x_1 = x_0 + d. That means x_1 = 2 which is not a feasible point anymore. In fact, WORHP's SQP algorithm will iterate like this if you disable the par.InitialLMest and eventually find the global optimum at x = 1.
Apart from this fundamental property there can also be other effects leading to iterates leaving the feasible set, that will very much be specific to the actual solver implementation. For example numerical inaccuracies, difficulties during the solution of a QP or certain recovery strategies. As to why your problem is not solved successfully using the SQP algorithm of WORHP, I am unable to say much without knowing anything about the problem itself.

Related

Differential evolution algorithm different results for different runs

As the title says, I am using the Differential Evolution algorithm as implemented in the Python mystic package for a global optimisation problem for O(10) parameters, with bounds and constraints.
I am using the simple interface diffev
result = my.diffev(func, x0, npop = 10*len(list(bnds)), bounds = bnds,
ftol = 1e-11, gtol = gtol, maxiter = 1024**3, maxfun = 1024**3,
constraints = constraint_eq, penalty = penalty,
full_output=True, itermon=mon, scale = scale)
I was experimenting running the SAME optimisation over several times: given a scaling for the differential evolution algorithm, I run 10 times the optimisation problem.
Result? I get different answers for almost all the results!
I experiment with scaling of 0.7, 0.75, 0.8, and 0.85, all roughly same bad behaviour (as suggested on the mystic page).
Here there is an example: on the x-axis there are the parameters, on the y-axis their values. The labels represent the iteration. Ideally you want to see only one line.
I run with gtol = 3500, so it should be quite long. I am using npop = 10*number pars, ftol = 1e-11, and the other important arguments for the diffev algorithm are the default ones.
Does anyone have some suggestion for tuning the differential evolution with mystic? Is there a way to avoid this variance in the results? I know it is a stochastic algorithm, but I did not expect it to give different results while running on gtol of 3500. My understanding was also that this algorithm does not get stuck into local minima, but I might be wrong.
p.s.
This is not relevant for the question, but just to give some context of why this is important for me.
What I need to do for my work is to minimise a function, under the conditions above, for several input data: I optimize for each data configuration over the O(10) parameters, then the configuration with some parameters that gives the overall minimum is the 'chosen' one.
Now, if the optimiser is not stable, it might give me the wrong data configuration by chance as the optimal one, as I run over hundreds of them.
I'm the mystic author. As you state, differential evolution (DE) is a stochastic algorithm. Essentially, DE uses a random mutations on the current solution vector to come up with new candidate solutions. So, you can expect to get different results for different runs in many cases, especially when the function is nonlinear.
Theoretically, if you let it run forever, it will find the global minimum. However, most of us don't want to wait that long. So, there's termination conditions like gtol (change over generations) which sets the cutoff for number of iterations without improvement. There are also solver parameters that effect how the mutation is generated, like cross, scale, and strategy. Essentially, if you get different results for different runs, all that means is that you haven't tuned the optimizer for the particular cost function yet, and should try to play with the settings.
Of importance is the balance between npop and gtol, and that's where I often go first. You want to increase the population of candidates, generally, until it saturates (i.e. doesn't have an effect) or becomes too slow.
If you have other information you can constrain the problem with, that often helps (i.e. use constraints or penalty to restrict your search space).
I also use mystic's visualization tools to try to get an understanding of what the response surface looks like (i.e. visualization and interpolation of log data).
Short answer is, any solver that includes randomness in the algorithm will often need to be tuned before you get consistent results.

scipy.sparse.linalg: what's the difference between splu and factorized?

What's the difference between using
scipy.sparse.linalg.factorized(A)
and
scipy.sparse.linalg.splu(A)
Both of them return objects with .solve(rhs) method and for both it's said in the documentation that they use LU decomposition. I'd like to know the difference in performance for both of them.
More specificly, I'm writing a python/numpy/scipy app that implements dynamic FEM model. I need to solve an equation Au = f on each timestep. A is sparse and rather large, but doesn't depend on timestep, so I'd like to invest some time beforehand to make iterations faster (there may be thousands of them). I tried using scipy.sparse.linalg.inv(A), but it threw memory exceptions when the size of matrix was large. I used scipy.linalg.spsolve on each step until recently, and now am thinking on using some sort of decomposition for better performance. So if you have other suggestions aside from LU, feel free to propose!
They should both work well for your problem, assuming that A does not change with each time step.
scipy.sparse.linalg.inv(A) will return a dense matrix that is the same size as A, so it's no wonder it's throwing memory exceptions.
scipy.linalg.solve is also a dense linear solver, which isn't what you want.
Assuming A is sparse, to solve Au=f and you only want to solve Au=f once, you could use scipy.sparse.linalg.spsolve. For example
u = spsolve(A, f)
If you want to speed things up dramatically for subsequent solves, you would instead use scipy.sparse.linalg.factorized or scipy.sparse.linalg.splu. For example
A_inv = splu(A)
for t in range(iterations):
u_t = A_inv.solve(f_t)
or
A_solve = factorized(A)
for t in range(iterations):
u_t = A_solve(f_t)
They should both be comparable in speed, and much faster than the previous options.
As #sascha said, you will need to dig into the documentation to see the differences between splu and factorize. But, you can use 'umfpack' instead of the default 'superLU' if you have it installed and set up correctly. I think umfpack will be faster in most cases. Keep in mind that if your matrix A is too large or has too many non-zeros, an LU decomposition / direct solver may take too much memory on your system. In this case, you might be stuck with using an iterative solver such as this. Unfortunately, you wont be able to reuse the solve of A at each time step, but you might be able to find a good preconditioner for A (approximation to inv(A)) to feed the solver to speed it up.

Implementing a 2D recursive spatial filter using Scipy

Minimally, I would like to know how to achieve what is stated in the title. Specifically, signal.lfilter seems like the only implementation of a difference equation filter in scipy, but it is 1D, as shown in the docs. I would like to know how to implement a 2D version as described by this difference equation. If that's as simple as "bro, use this function," please let me know, pardon my naiveté, and feel free to disregard the rest of the post.
I am new to DSP and acknowledging there might be a different approach to answering my question so I will explain the broader goal and give context for the question in the hopes someone knows how do want I want with Scipy, or perhaps a better way than what I explicitly asked for.
To get straight into it, broadly speaking I am using vectorized computation methods (Numpy/Scipy) to implement a Monte Carlo simulation to improve upon a naive for loop. I have successfully abstracted most of my operations to array computation / linear algebra, but a few specific ones (recursive computations) have eluded my intuition and I continually end up in the digital signal processing world when I go looking for how this type of thing has been done by others (that or machine learning but those "frameworks" are much opinionated). The reason most of my google searches end up on scipy.signal or scipy.ndimage library references is clear to me at this point, and subsequent to accepting the "signal" representation of my data, I have spent a considerable amount of time (about as much as reasonable for a field that is not my own) ramping up the learning curve to try and figure out what I need from these libraries.
My simulation entails updating a vector of data representing the state of a system each period for n periods, and then repeating that whole process a "Monte Carlo" amount of times. The updates in each of n periods are inherently recursive as the next depends on the state of the prior. It can be characterized as a difference equation as linked above. Additionally this vector is theoretically indexed on an grid of points with uneven stepsize. Here is an example vector y and its theoretical grid t:
y = np.r_[0.0024, 0.004, 0.0058, 0.0083, 0.0099, 0.0133, 0.0164]
t = np.r_[0.25, 0.5, 1, 2, 5, 10, 20]
I need to iteratively perform numerous operations to y for each of n "updates." Specifically, I am computing the curvature along the curve y(t) using finite difference approximations and using the result at each point to adjust the corresponding y(t) prior to the next update. In a loop this amounts to inplace variable reassignment with the desired update in each iteration.
y += some_function(y)
Not only does this seem inefficient, but vectorizing things seems intuitive given y is a vector to begin with. Furthermore I am interested in preserving each "updated" y(t) along the n updates, which would require a data structure of dimensions len(y) x n. At this point, why not perform the updates inplace in the array? This is wherein lies the question. Many of the update operations I have succesfully vectorized the "Numpy way" (such as adding random variates to each point), but some appear overly complex in the array world.
Specifically, as mentioned above the one involving computing curvature at each element using its neighbouring two elements, and then imediately using that result to update the next row of the array before performing its own curvature "update." I was able to implement a non-recursive version (each row fails to consider its "updated self" from the prior row) of the curvature operation using ndimage generic_filter. Given the uneven grid, I have unique coefficients (kernel weights) for each triplet in the kernel footprint (instead of always using [1,-2,1] for y'' if I had a uniform grid). This last part has already forced me to use a spatial filter from ndimage rather than a 1d convolution. I'll point out, something conceptually similar was discussed in this math.exchange post, and it seems to me only the third response saliently addressed the difference between mathematical notion of "convolution" which should be associative from general spatial filtering kernels that would require two sequential filtering operations or a cleverly merged kernel.
In any case this does not seem to actually address my concern as it is not about 2D recursion filtering but rather having a backwards looking kernel footprint. Additionally, I think I've concluded it is not applicable in that this only allows for "recursion" (backward looking kernel footprints in the spatial filtering world) in a manner directly proportional to the size of the recursion. Meaning if I wanted to filter each of n rows incorporating calculations on all prior rows, it would require a convolution kernel far too big (for my n anyways). If I'm understanding all this correctly, a recursive linear filter is algorithmically more efficient in that it returns (for use in computation) the result of itself applied over the previous n samples (up to a level where the stability of the algorithm is affected) using another companion vector (z). In my case, I would only need to look back one step at output signal y[n-1] to compute y[n] from curvature at x[n] as the rest works itself out like a cumsum. signal.lfilter works for this, but I can't used that to compute curvature, as that requires a kernel footprint that can "see" at least its left and right neighbors (pixels), which is how I ended up using generic_filter.
It seems to me I should be able to do both simultaneously with one filter namely spatial and recursive filtering; or somehow I've missed the maths of how this could be mathematically simplified/combined (convolution of multiples kernels?).
It seems like this should be a common problem, but perhaps it is rarely relevant to do both at once in signal processing and image filtering. Perhaps this is why you don't use signals libraries solely to implement a fast monte carlo simulation; though it seems less esoteric than using a tensor math library to implement a recursive neural network scan ... which I'm attempting to do right now.
EDIT: For those familiar with the theoretical side of DSP, I know that what I am describing, the process of designing a recursive filters with arbitrary impulse responses, is achieved by employing a mathematical technique called the z-transform which I understand is generally used for two things:
converting between the recursion coefficients and the frequency response
combining cascaded and parallel stages into a single filter
Both are exactly what I am trying to accomplish.
Also, reworded title away from FIR / IIR because those imply specific definitions of "recursion" and may be confusing / misnomer.

scipy.optimize.fmin_l_bfgs_b returns 'ABNORMAL_TERMINATION_IN_LNSRCH'

I am using scipy.optimize.fmin_l_bfgs_b to solve a gaussian mixture problem. The means of mixture distributions are modeled by regressions whose weights have to be optimized using EM algorithm.
sigma_sp_new, func_val, info_dict = fmin_l_bfgs_b(func_to_minimize, self.sigma_vector[si][pj],
args=(self.w_vectors[si][pj], Y, X, E_step_results[si][pj]),
approx_grad=True, bounds=[(1e-8, 0.5)], factr=1e02, pgtol=1e-05, epsilon=1e-08)
But sometimes I got a warning 'ABNORMAL_TERMINATION_IN_LNSRCH' in the information dictionary:
func_to_minimize value = 1.14462324063e-07
information dictionary: {'task': b'ABNORMAL_TERMINATION_IN_LNSRCH', 'funcalls': 147, 'grad': array([ 1.77635684e-05, 2.87769808e-05, 3.51718654e-05,
6.75015599e-06, -4.97379915e-06, -1.06581410e-06]), 'nit': 0, 'warnflag': 2}
RUNNING THE L-BFGS-B CODE
* * *
Machine precision = 2.220D-16
N = 6 M = 10
This problem is unconstrained.
At X0 0 variables are exactly at the bounds
At iterate 0 f= 1.14462D-07 |proj g|= 3.51719D-05
* * *
Tit = total number of iterations
Tnf = total number of function evaluations
Tnint = total number of segments explored during Cauchy searches
Skip = number of BFGS updates skipped
Nact = number of active bounds at final generalized Cauchy point
Projg = norm of the final projected gradient
F = final function value
* * *
N Tit Tnf Tnint Skip Nact Projg F
6 1 21 1 0 0 3.517D-05 1.145D-07
F = 1.144619474757747E-007
ABNORMAL_TERMINATION_IN_LNSRCH
Line search cannot locate an adequate point after 20 function
and gradient evaluations. Previous x, f and g restored.
Possible causes: 1 error in function or gradient evaluation;
2 rounding error dominate computation.
Cauchy time 0.000E+00 seconds.
Subspace minimization time 0.000E+00 seconds.
Line search time 0.000E+00 seconds.
Total User time 0.000E+00 seconds.
I do not get this warning every time, but sometimes. (Most get 'CONVERGENCE: NORM_OF_PROJECTED_GRADIENT_<=_PGTOL' or 'CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH').
I know that it means the minimum can be be reached in this iteration. I googled this problem. Someone said it occurs often because the objective and gradient functions do not match. But here I do not provide gradient function because I am using 'approx_grad'.
What are the possible reasons that I should investigate? What does it mean by "rounding error dominate computation"?
======
I also find that the log-likelihood does not monotonically increase:
########## Convergence !!! ##########
log_likelihood_history: [-28659.725891322563, 220.49993177669558, 291.3513633060345, 267.47745327823907, 265.31567762171181, 265.07311121000367, 265.04217683341682]
It usually start decrease at the second or the third iteration, even through 'ABNORMAL_TERMINATION_IN_LNSRCH' does not occurs. I do not know whether it this problem is related to the previous one.
Scipy calls the original L-BFGS-B implementation. Which is some fortran77 (old but beautiful and superfast code) and our problem is that the descent direction is actually going up. The problem starts on line 2533 (link to the code at the bottom)
gd = ddot(n,g,1,d,1)
if (ifun .eq. 0) then
gdold=gd
if (gd .ge. zero) then
c the directional derivative >=0.
c Line search is impossible.
if (iprint .ge. 0) then
write(0,*)' ascent direction in projection gd = ', gd
endif
info = -4
return
endif
endif
In other words, you are telling it to go down the hill by going up the hill. The code tries something called line search a total of 20 times in the descent direction that you provide and realizes that you are NOT telling it to go downhill, but uphill. All 20 times.
The guy who wrote it (Jorge Nocedal, who by the way is a very smart guy) put 20 because pretty much that's enough. Machine epsilon is 10E-16, I think 20 is actually a little too much. So, my money for most people having this problem is that your gradient does not match your function.
Now, it could also be that "2. rounding errors dominate computation". By this, he means that your function is a very flat surface in which increases are of the order of machine epsilon (in which case you could perhaps rescale the function),
Now, I was thiking that maybe there should be a third option, when your function is too weird. Oscillations? I could see something like $\sin({\frac{1}{x}})$ causing this kind of problem. But I'm not a smart guy, so don't assume that there's a third case.
So I think the OP's solution should be that your function is too flat. Or look at the fortran code.
https://github.com/scipy/scipy/blob/master/scipy/optimize/lbfgsb/lbfgsb.f
Here's line search for those who want to see it. https://en.wikipedia.org/wiki/Line_search
Note. This is 7 months too late. I put it here for future's sake.
As pointed out in the answer by Wilmer E. Henao, the problem is probably in the gradient. Since you are using approx_grad=True, the gradient is calculated numerically. In this case, reducing the value of epsilon, which is the step size used for numerically calculating the gradient, can help.
I also got the error "ABNORMAL_TERMINATION_IN_LNSRCH" using the L-BFGS-B optimizer.
While my gradient function pointed in the right direction, I rescaled the actual gradient of the function by its L2-norm. Removing that or adding another appropriate type of rescaling worked. Before, I guess that the gradient was so large that it went out of bounds immediately.
The problem from OP was unbounded if I read correctly, so this will certainly not help in this problem setting. However, googling the error "ABNORMAL_TERMINATION_IN_LNSRCH" yields this page as one of the first results, so it might help others...
I had a similar problem recently. I sometimes encounter the ABNORMAL_TERMINATION_IN_LNSRCH message after using fmin_l_bfgs_b function of scipy. I try to give additional explanations of the reason why I get this. I am looking for complementary details or corrections if I am wrong.
In my case, I provide the gradient function, so approx_grad=False. My cost function and the gradient are consistent. I double-checked it and the optimization actually works most of the time. When I get ABNORMAL_TERMINATION_IN_LNSRCH, the solution is not optimal, not even close (even this is a subjective point of view). I can overcome this issue by modifying the maxls argument. Increasing maxls helps to solve this issue to finally get the optimal solution. However, I noted that sometimes a smaller maxls, than the one that produces ABNORMAL_TERMINATION_IN_LNSRCH, results in a converging solution. A dataframe summarizes the results. I was surprised to observe this. I expected that reducing maxls would not improve the result. For this reason, I tried to read the paper describing the line search algorithm but I had trouble to understand it.
The line "search algorithm generates a sequence of
nested intervals {Ik} and a sequence of iterates αk ∈ Ik ∩ [αmin ; αmax] according to the [...] procedure". If I understand well, I would say that the maxls argument specifies the length of this sequence. At the end of the maxls iterations (or less if the algorithm terminates in fewer iterations), the line search stops. A final trial point is generated within the final interval Imaxls. I would say the the formula does not guarantee to get an αmaxls that respects the two update conditions, the minimum decrease and the curvature, especially when the interval is still wide. My guess is that in my case, after 11 iterations the generated interval I11 is such that a trial point α11 respects both conditions. But, even though I12 is smaller and still containing acceptable points, α12 is not. Finally after 24 iterations, the interval is very small and the generated αk respects the update conditions.
Is my understanding / explanation accurate?
If so, I would then be surprised that when maxls=12, since the generated α11 is acceptable but not α12, why α11 is not chosen in this case instead of α12?
Pragmatically, I would recommend to try a few higher maxls when getting ABNORMAL_TERMINATION_IN_LNSRCH.

Constrained optimization with hessian in scipy

I want to minimize a function, subject to constraints (the variables are non-negative). I can compute the gradient and Hessian exactly. So I want something like:
result = scipy.optimize.minimize(objective, x0, jac=grad, hess=hess, bounds=bds)
I need to specify a method for the optimization (http://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.minimize.html). Unfortunately I can't seem to find a method that allows for both user-specified bounds and a Hessian!
This is particularly annoying because methods "TNC" and "Newton-CG" seem essentially the same, however TNC estimates Hessian internally (in C code), while Newton-CG doesn't allow for constraints.
So, how can I do a constrained optimization with user-specified Hessian? Seems like there ought to be an easy option for this in scipy -- am I missing something?
I realized a workaround for my problem, which is to transform the constrained optimization into an unconstrained optimization.
In my case, since I have the constraint x > 0, I decided to optimize over log(x) instead of x. This was easy to do for my problem since I am using automatic differentiation.
Still, this seems like a somewhat unsatisfying solution -- I still think scipy should allow some constrained second-order minimization method.
just bumped into exactly this point myself. I think the TNC applies an active set to the line search of the CG, not the direction of the line search. Conversely the Hessian chooses the direction of the line. So, er, could maybe cut the line search out of NCG and drop it into TNC. Problem is when you are at the boundary the Hessian might not take you out of it.
How about using TNC for an extremely sloppy first guess [give it a really large error bound to hit], then use NCG with a small number of iterations, check: if on boundary back to TNC, else continue with NCG. Ugh...
Yes, or use log(x). I'm going to follow your lead.