SLSQP in ScipyOptimizeDriver only executes one iteration, takes a very long time, then exits - optimization

I'm trying to use SLSQP to optimise the angle of attack of an aerofoil to place the stagnation point in a desired location. This is purely as a test case to check that my method for calculating the partials for the stagnation position is valid.
When run with COBYLA, the optimisation converges to the correct alpha (6.04144912) after 47 iterations. When run with SLSQP, it completes one iteration, then hangs for a very long time (10, 20 minutes or more, I didn't time it exactly), and exits with an incorrect value. The output is:
Driver debug print for iter coord: rank0:ScipyOptimize_SLSQP|0
--------------------------------------------------------------
Design Vars
{'alpha': array([0.5])}
Nonlinear constraints
None
Linear constraints
None
Objectives
{'obj_cmp.obj': array([0.00023868])}
Driver debug print for iter coord: rank0:ScipyOptimize_SLSQP|1
--------------------------------------------------------------
Design Vars
{'alpha': array([0.5])}
Nonlinear constraints
None
Linear constraints
None
Objectives
{'obj_cmp.obj': array([0.00023868])}
Optimization terminated successfully. (Exit mode 0)
Current function value: 0.0002386835700364719
Iterations: 1
Function evaluations: 1
Gradient evaluations: 1
Optimization Complete
-----------------------------------
Finished optimisation
Why might SLSQP be misbehaving like this? As far as I can tell, there are no incorrect analytical derivatives when I look at check_partials().
The code is quite long, so I put it on Pastebin here:
core: https://pastebin.com/fKJpnWHp
inviscid: https://pastebin.com/7Cmac5GF
aerofoil coordinates (NACA64-012): https://pastebin.com/UZHXEsr6

You asked two questions whos answers ended up being unrelated to eachother:
Why is the model so slow when you use SLSQP, but fast when you use COBYLA
Why does SLSQP stop after one iteration?
1) Why is SLSQP so slow?
COBYLA is a gradient free method. SLSQP uses gradients. So the solid bet was that slow down happened when SLSQP asked for the derivatives (which COBYLA never did).
Thats where I went to look first. Computing derivatives happens in two steps: a) compute partials for each component and b) solve a linear system with those partials to compute totals. The slow down has to be in one of those two steps.
Since you can run check_partials without too much trouble, step (a) is not likely to be the culprit. So that means step (b) is probably where we need to speed things up.
I ran the summary utility (openmdao summary core.py) on your model and saw this:
============== Problem Summary ============
Groups: 9
Components: 36
Max tree depth: 4
Design variables: 1 Total size: 1
Nonlinear Constraints: 0 Total size: 0
equality: 0 0
inequality: 0 0
Linear Constraints: 0 Total size: 0
equality: 0 0
inequality: 0 0
Objectives: 1 Total size: 1
Input variables: 87 Total size: 1661820
Output variables: 44 Total size: 1169614
Total connections: 87 Total transfer data size: 1661820
Then I generated an N2 of your model and saw this:
So we have an output vector that is 1169614 elements long, which means your linear system is a matrix that is about 1e6x1e6. Thats pretty big, and you are using a DirectSolver to try and compute/store a factorization of it. Thats the source of the slow down. Using DirectSolvers is great for smaller models (rule of thumb, is that the output vector should be less than 10000 elements). For larger ones you need to be more careful and use more advanced linear solvers.
In your case we can see from the N2 that there is no coupling anywhere in your model (nothing in the lower triangle of the N2). Purely feed-forward models like this can use a much simpler and faster LinearRunOnce solver (which is the default if you don't set anything else). So I turned off all DirectSolvers in your model, and the derivatives became effectively instant. Make your N2 look like this instead:
The choice of best linear solver is extremely model dependent. One factor to consider is computational cost, another is numerical robustness. This issue is covered in some detail in Section 5.3 of the OpenMDAO paper, and I won't cover everything here. But very briefly here is a summary of the key considerations.
When just starting out with OpenMDAO, using DirectSolver is both the simplest and usually the fastest option. It is simple because it does not require consideration of your model structure, and it's fast because for small models OpenMDAO can assemble the Jacobian into a dense or sparse matrix and provide that for direct factorization. However, for larger models (or models with very large vectors of outputs), the cost of computing the factorization is prohibitively high. In this case, you need to break the solver structure down more intentionally, and use other linear solvers (sometimes in conjunction with the direct solver--- see Section 5.3 of OpenMDAO paper, and this OpenMDAO doc).
You stated that you wanted to use the DirectSolver to take advantage of the sparse Jacobian storage. That was a good instinct, but the way OpenMDAO is structured this is not a problem either way. We are pretty far down in the weeds now, but since you asked I'll give a short summary explanation. As of OpenMDAO 3.7, only the DirectSolver requires an assembled Jacobian at all (and in fact, it is the linear solver itself that determines this for whatever system it is attached to). All other LinearSolvers work with a DictionaryJacobian (which stores each sub-jac keyed to the [of-var, wrt-var] pair). Each sub-jac can be stored as dense or sparse (depending on how you declared that particular partial derivative). The dictionary Jacobian is effectively a form of a sparse-matrix, though not a traditional one. The key takeaway here is that if you use the LinearRunOnce (or any other solver), then you are getting a memory efficient data storage regardless. It is only the DirectSolver that changes over to a more traditional assembly of an actual matrix object.
Regarding the issue of memory allocation. I borrowed this image from the openmdao docs
2) Why does SLSQP stop after one iteration?
Gradient based optimizations are very sensitive to scaling. I ploted your objective function inside your allowed design space and got this:
So we can see that the minimum is at about 6 degrees, but the objective values are TINY (about 1e-4).
As a general rule of thumb, getting your objective to around order of magnitude 1 is a good idea (we have a scaling report feature that helps with this). I added a reference that was about the order of magnitude of your objective:
p.model.add_objective('obj', ref=1e-4)
Then I got a good result:
Optimization terminated successfully (Exit mode 0)
Current function value: [3.02197589e-11]
Iterations: 7
Function evaluations: 9
Gradient evaluations: 7
Optimization Complete
-----------------------------------
Finished optimization
alpha = [6.04143334]
time: 2.1188600063323975 seconds
Unfortunately, scaling is just hard with gradient based optimization. Starting by scaling your objective/constraints to order-1 is a decent rule of thumb, but its common that you need to adjust things beyond that for more complex problems.

Related

Reason why setting tensorflow's variable with small stddev

I have a question about a reason why setting TensorFlow's variable with small stddev.
I guess many people do test MNIST test code from TensorFlow beginner's guide.
As following it, the first layer's weights are initiated by using truncated_normal with stddev 0.1.
And I guessed if setting it with more bigger value, then it would be the same result, which is exactly accurate.
But although increasing epoch count, it doesn't work.
Is there anybody know this reason?
original :
W_layer = tf.Variable(tf.truncated_normal([inp.get_shape()[1].value, size],stddev=0.1), name='w_'+name)
#result : (990, 0.93000001, 0.89719999)
modified :
W_layer = tf.Variable(tf.truncated_normal([inp.get_shape()[1].value, size],stddev=200), name='w_'+name)
#result : (99990, 0.1, 0.098000005)
The reason is because you want to keep all the layer's variances (or standard deviations) approximately the same, and sane. It has to do with the error backpropagation step of the learning process and the activation functions used.
In order to learn the network's weights, the backpropagation step requires knowledge of the network's gradient, a measure of how strong each weight influences the input to reach the final output; layer's weight variance directly influences the propagation of gradients.
Say, for example, that the activation function is sigmoidal (e.g. tf.nn.sigmoid or tf.nn.tanh); this implies that all input values are squashed into a fixed output value range. For the sigmoid, it is the range 0..1, where essentially all values z greater or smaller than +/- 4 are very close to one (for z > 4) or zero (for z < -4) and only values within that range tend to have some meaningful "change".
Now the difference between the values sigmoid(5) and sigmoid(1000) is barely noticeable. Because of that, all very large or very small values will optimize very slowly, since their influence on the result y = sigmoid(W*x+b) is extremely small. Now the pre-activation value z = W*x+b (where x is the input) depends on the actual input x and the current weights W. If either of them is large, e.g. by initializing the weights with a high variance (i.e. standard deviation), the result will necessarily be (relatively) large, leading to said problem. This is also the reason why truncated_normal is used rather than a correct normal distribution: The latter only guarantees that most of the values are very close to the mean, with some less than 5% chance that this is not the case, while truncated_normal simply clips away every value that is too big or too small, guaranteeing that all weights are in the same range, while still being normally distributed.
To make matters worse, in a typical neural network - especially in deep learning - each network layer is followed by one or many others. If in each layer the output value range is big, the gradients will get bigger and bigger as well; this is known as the exploding gradients problem (a variation of the vanishing gradients, where gradients are getting smaller).
The reason that this is a problem is because learning starts at the very last layer and each weight is adjusted depending on how much it contributed to the error. If the gradients are indeed getting very big towards the end, the very last layer is the first one to pay a high toll for this: Its weights get adjusted very strongly - likely overcorrecting the actual problem - and then only the "remaining" error gets propagated further back, or up, the network. Here, since the last layer was already "fixed a lot" regarding the measured error, only smaller adjustments will be made. This may lead to the problem that the first layers are corrected only by a tiny bit or not at all, effectively preventing all learning there. The same basically happens if the learning rate is too big.
Finding the best weight initialization is a topic by itself and there are somewhat more sophisticated methods such as Xavier initialization or Layer-sequential unit variance, however small normally distributed values are usually simply a good guess.

Usage of scipy.optimize.fmin_slsqp for Integer design variable

I'm trying to use the scipy.optimize.slsqp for an industrial-related constrained optimization. A highly non-linear FE model is used to generate the objective and the constraint functions, and their derivatives/sensitivities.
The objective function is in the form:
obj=a number calculated from the FE model
A series of constraint functions are set, and most of them are in the form:
cons = real number i - real number j (calculated from the FE model)
I would like to try to restrict the design variables to integers as that would be what input into the plant machine.
Another consideration is to have a log file recording what design variable have been tried. if a set of design variable (integer) is already tried for, skip the calculation, perturb the design variable and try again. By limiting the design variable to integers, we are able to limit the number of trials (while leaving the design variable to real, a change in the e.g. 8th decimal point could be regarded as untried values).
I'm using SLSQP as it is one of the SQP method (please correct me if I am wrong), and the it is said to be powerful to deal with nonlinear problems. I understand the SLSQP algorithm is a gradient-based optimizer and there is no way I can implement the restriction of the design variables being integer in the algorithm coded in FORTRAN. So instead, I modified the slsqp.py file to the following (where it calls the python extension built from the FORTRAN algorithm):
slsqp(m, meq, x, xl, xu, fx, c, g, a, acc, majiter, mode, w, jw)
for i in range(len(x)):
x[i]=int(x[i])
The code stops at the 2nd iteration and output the following:
Optimization terminated successfully. (Exit mode 0)
Current function value: -1.286621577077517
Iterations: 7
Function evaluations: 0
Gradient evaluations: 0
However, one of the constraint function is violated (value at about -5.2, while the default convergence criterion of the optimization code = 10^-6).
Questions:
1. Since the FE model is highly nonlinear, I think it's safe to assume the objective and constraint functions will be highly nonlinear too (regardless of their mathematical form). Is that correct?
2. Based on the convergence criterion of the slsqp algorithm(please see below), one of which requires the sum of all constraint violations(absolute values) to be less than a very small value (10^-6), how could the optimization exit with successful termination message?
IF ((ABS(f-f0).LT.acc .OR. dnrm2_(n,s,1).LT.acc).AND. h3.LT.acc)
Any help or advice is appreciated. Thank you.

Implementing a 2D recursive spatial filter using Scipy

Minimally, I would like to know how to achieve what is stated in the title. Specifically, signal.lfilter seems like the only implementation of a difference equation filter in scipy, but it is 1D, as shown in the docs. I would like to know how to implement a 2D version as described by this difference equation. If that's as simple as "bro, use this function," please let me know, pardon my naiveté, and feel free to disregard the rest of the post.
I am new to DSP and acknowledging there might be a different approach to answering my question so I will explain the broader goal and give context for the question in the hopes someone knows how do want I want with Scipy, or perhaps a better way than what I explicitly asked for.
To get straight into it, broadly speaking I am using vectorized computation methods (Numpy/Scipy) to implement a Monte Carlo simulation to improve upon a naive for loop. I have successfully abstracted most of my operations to array computation / linear algebra, but a few specific ones (recursive computations) have eluded my intuition and I continually end up in the digital signal processing world when I go looking for how this type of thing has been done by others (that or machine learning but those "frameworks" are much opinionated). The reason most of my google searches end up on scipy.signal or scipy.ndimage library references is clear to me at this point, and subsequent to accepting the "signal" representation of my data, I have spent a considerable amount of time (about as much as reasonable for a field that is not my own) ramping up the learning curve to try and figure out what I need from these libraries.
My simulation entails updating a vector of data representing the state of a system each period for n periods, and then repeating that whole process a "Monte Carlo" amount of times. The updates in each of n periods are inherently recursive as the next depends on the state of the prior. It can be characterized as a difference equation as linked above. Additionally this vector is theoretically indexed on an grid of points with uneven stepsize. Here is an example vector y and its theoretical grid t:
y = np.r_[0.0024, 0.004, 0.0058, 0.0083, 0.0099, 0.0133, 0.0164]
t = np.r_[0.25, 0.5, 1, 2, 5, 10, 20]
I need to iteratively perform numerous operations to y for each of n "updates." Specifically, I am computing the curvature along the curve y(t) using finite difference approximations and using the result at each point to adjust the corresponding y(t) prior to the next update. In a loop this amounts to inplace variable reassignment with the desired update in each iteration.
y += some_function(y)
Not only does this seem inefficient, but vectorizing things seems intuitive given y is a vector to begin with. Furthermore I am interested in preserving each "updated" y(t) along the n updates, which would require a data structure of dimensions len(y) x n. At this point, why not perform the updates inplace in the array? This is wherein lies the question. Many of the update operations I have succesfully vectorized the "Numpy way" (such as adding random variates to each point), but some appear overly complex in the array world.
Specifically, as mentioned above the one involving computing curvature at each element using its neighbouring two elements, and then imediately using that result to update the next row of the array before performing its own curvature "update." I was able to implement a non-recursive version (each row fails to consider its "updated self" from the prior row) of the curvature operation using ndimage generic_filter. Given the uneven grid, I have unique coefficients (kernel weights) for each triplet in the kernel footprint (instead of always using [1,-2,1] for y'' if I had a uniform grid). This last part has already forced me to use a spatial filter from ndimage rather than a 1d convolution. I'll point out, something conceptually similar was discussed in this math.exchange post, and it seems to me only the third response saliently addressed the difference between mathematical notion of "convolution" which should be associative from general spatial filtering kernels that would require two sequential filtering operations or a cleverly merged kernel.
In any case this does not seem to actually address my concern as it is not about 2D recursion filtering but rather having a backwards looking kernel footprint. Additionally, I think I've concluded it is not applicable in that this only allows for "recursion" (backward looking kernel footprints in the spatial filtering world) in a manner directly proportional to the size of the recursion. Meaning if I wanted to filter each of n rows incorporating calculations on all prior rows, it would require a convolution kernel far too big (for my n anyways). If I'm understanding all this correctly, a recursive linear filter is algorithmically more efficient in that it returns (for use in computation) the result of itself applied over the previous n samples (up to a level where the stability of the algorithm is affected) using another companion vector (z). In my case, I would only need to look back one step at output signal y[n-1] to compute y[n] from curvature at x[n] as the rest works itself out like a cumsum. signal.lfilter works for this, but I can't used that to compute curvature, as that requires a kernel footprint that can "see" at least its left and right neighbors (pixels), which is how I ended up using generic_filter.
It seems to me I should be able to do both simultaneously with one filter namely spatial and recursive filtering; or somehow I've missed the maths of how this could be mathematically simplified/combined (convolution of multiples kernels?).
It seems like this should be a common problem, but perhaps it is rarely relevant to do both at once in signal processing and image filtering. Perhaps this is why you don't use signals libraries solely to implement a fast monte carlo simulation; though it seems less esoteric than using a tensor math library to implement a recursive neural network scan ... which I'm attempting to do right now.
EDIT: For those familiar with the theoretical side of DSP, I know that what I am describing, the process of designing a recursive filters with arbitrary impulse responses, is achieved by employing a mathematical technique called the z-transform which I understand is generally used for two things:
converting between the recursion coefficients and the frequency response
combining cascaded and parallel stages into a single filter
Both are exactly what I am trying to accomplish.
Also, reworded title away from FIR / IIR because those imply specific definitions of "recursion" and may be confusing / misnomer.

LDPC behaviour as density of parity-check matrix increases

My assignment is to implement a Loopy Belief Propagation algorithm for Low-density Parity-check Code. This code uses a parity-check matrix H which is rather sparse (say 750-by-1000 binary matrix with an average of about 3 "ones" per each column). The code to generate the parity-check matrix is taken from here
Anyway, one of the subtasks is to check the reliability of LDPC code when the density of the matrix H increases. So, I fix the channel at 0.5 capacity, fix my code speed at 0.35 and begin to increase the density of the matrix. As the average number of "ones" in a column goes from 3 to 7 in steps of 1, disaster happens. With 3 or 4 the code copes perfectly well. With higher density it begins to fail: not only does it sometimes fail to converge, it oftentimes converges to the wrong codeword and produces mistakes.
So my question is: what type of behaviour is expected of an LDPC code as its sparse parity-check matrix becomes denser? Bonus question for skilled mind-readers: in my case (as the code performance degrades) is it more likely because the Loopy Belief Propagation algo has no guarantee on convergence or because I made a mistake implementing it?
After talking to my TA and other students I understand the following:
According to Shannon's theorem, the reliability of the code should increase with the density of the parity check matrix. That is simply because more checks are made.
However, since we use Loopy Belief Propagation, it struggles a lot when there are more and more edges in the graph forming more and more loops. Therefore, the actual performance degrades.
Whether or not I made a mistake in my code based solely on this behaviour cannot be established. However, since my code does work for sparse matrices, it is likely that the implementation is fine.

Finding Optimal Parameters In A "Black Box" System

I'm developing machine learning algorithms which classify images based on training data.
During the image preprocessing stages, there are several parameters which I can modify that affect the data I feed my algorithms (for example, I can change the Hessian Threshold when extracting SURF features). So the flow thus far looks like:
[param1, param2, param3...] => [black box] => accuracy %
My problem is: with so many parameters at my disposal, how can I systematically pick values which give me optimized results/accuracy? A naive approach is to run i nested for-loops (assuming i parameters) and just iterate through all parameter combinations, but if it takes 5 minute to calculate an accuracy from my "black box" system this would take a long, long time.
This being said, are there any algorithms or techniques which can search for optimal parameters in a black box system? I was thinking of taking a course in Discrete Optimization but I'm not sure if that would be the best use of my time.
Thank you for your time and help!
Edit (to answer comments):
I have 5-8 parameters. Each parameter has its own range. One parameter can be 0-1000 (integer), while another can be 0 to 1 (real number). Nothing is stopping me from multithreading the black box evaluation.
Also, there are some parts of the black box that have some randomness to them. For example, one stage is using k-means clustering. Each black box evaluation, the cluster centers may change. I run k-means several times to (hopefully) avoid local optima. In addition, I evaluate the black box multiple times and find the median accuracy in order to further mitigate randomness and outliers.
As a partial solution, a grid search of moderate resolution and range can be recursively repeated in the areas where the n-parameters result in the optimal values.
Each n-dimensioned result from each step would be used as a starting point for the next iteration.
The key is that for each iteration the resolution in absolute terms is kept constant (i.e. keep the iteration period constant) but the range decreased so as to reduce the pitch/granular step size.
I'd call it a ‘contracting mesh’ :)
Keep in mind that while it avoids full brute-force complexity it only reaches exhaustive resolution in the final iteration (this is what defines the final iteration).
Also that the outlined process is only exhaustive on a subset of the points that may or may not include the global minimum - i.e. it could result in a local minima.
(You can always chase your tail though by offsetting the initial grid by some sub-initial-resolution amount and compare results...)
Have fun!
Here is the solution to your problem.
A method behind it is described in this paper.