I'm using scipy.optimize.lsq_linear to run some linear least squares optimizations and all is well, but a little slow. My A matrix is typically about 100 x 10,000 in size and sparse (sparsity usually ~50%). The bounds on the solution are critical. Given my tolerance lsq_linear typically solves the problems in about 10 seconds and speeding this up would be very helpful for running many optimizations.
I've read about speeding up linear algebra operations using GPU acceleration in PyTorch. It looks like PyTorch handles sparse arrays (torch calls them tensors), which is good. However, I've been digging through the PyTorch documentation, particularly the torch.optim and torch.linalg packages, and I haven't found anything that appears to be able to do a linear least squares optimization with bounds.
Is there a torch method that can do linear least squares optimization with bounds like scipy.optimize.lsq_linear?
Is there another way to speed up lsq_linear or to perform the optimization in a faster way?
For what it's worth, I think I've pushed lsq_linear pretty far. I don't think I can decrease the number of matrix elements, increase sparsity or decrease optimiation tolerances much farther without sacrificing the results.
Not easily, no.
I'd try to profile lsq_linear on your problem to see if it's pure python overhead (which can probably be trimmed some) or linear algebra. In the latter case, I'd start with vendoring the lsq_linear code and swapping relevant linear algebra routines. YMMV though.
Suppose, I an algorithm, whose runtime depends on two parameters. I want to find the best set of parameters that minimizes the runtime. The two parameters are continuous double values in the range of 0 to INFINITY.
Therefore, for two parameters a,b: I want to find the best values of a and b that minimize the runtime. I think this is pretty standard practice, but I could not find good literature on this. I found few literature such as MLE, Least Squares, etc. but they talk about distribution.
First use your brains to understand the possible functional relationship between those parameters and the running time, in a qualitative way. This means having a first idea on the number and positions of possible maxima, smoothness of the function, asymptotic behavior and any other clue that you can find.
Then make up your mind about a reasonable range of values where it makes sense to sample the function values. If those ranges are very wide, it is preferable to sample using a geometric progression rather than arithmetic (say, powers of 2).
Then measure and observe the function values with a graphical viewer and confirm your intuitions. It is likely that this will be enough to spot the gross location of the absolute maximum. Finding an accurate position might be useless if it gives you the last percents of improvement. It is also very likely that the location of the optimum will depend on the particular dataset, making accurate location even less useful.
Initially I modeled my objective function as follows:
argmin var(f(x),g(x))+var(c(x),d(x))
where f,g,c,d are linear functions
in order to be able to use linear solvers I modeled the problem as follows
argmin abs(f(x),g(x))+abs(c(x),d(x))
is it correct to change variance to absolute value in this context, I'm pretty sure they imply the same meaning as having the least difference between two functions
You haven't given enough context to answer the question. Even though your question doesn't seem to be about regression, in many ways it is similar to the question of choosing between least squares and least absolute deviations approaches to regression. If that term in your objective function is in any sense an error term then the most appropriate way to model the error depends on the nature of the error distribution. Least squares is better if there is normally distributed noise. Least absolute deviations is better in the nonparametric setting and is less sensitive to outliers. If the problem has nothing to do with probability at all then other criteria need to be brought in to decide between the two options.
Having said all this, the two ways of measuring distance are broadly similar. One will be fairly small if and only if the other is -- though they won't be equally small. If they are similar enough for your purposes then the fact that absolute values can be linearized could be a good motivation to use it. On the other hand -- if the variance-based one is really a better expression of what you are interested in then the fact that you can't use LP isn't sufficient justification to adopt absolute values. After all -- quadratic programming is not all that much harder than LP, at least below a certain scale.
To sum up -- they don't imply the same meaning, but they do imply similar meanings; and, whether or not they are similar enough depends upon your purposes.
This is for self-study of N-dimensional system of linear homogeneous ordinary differential equations of the form:
dx/dt=Ax
where A is the coefficient matrix of the system.
I have learned that you can check for stability by determining if the real parts of all the eigenvalues of A are negative. You can check for oscillations if there are any purely imaginary eigenvalues of A.
The author in the book I'm reading then introduces the Routh-Hurwitz criterion for detecting stability and oscillations of the system. This seems to be a more efficient computational short-cut than calculating eigenvalues.
What are the advantages of using Routh-Hurwitz criteria for stability and oscillations, when you can just find the eigenvalues quickly nowadays? For instance, will it be useful when I start to study nonlinear dynamics? Is there some additional use that I am completely missing?
Wikipedia entry on RH stability analysis has stuff about control systems, and ends up with a lot of equations in the s-domain (Laplace transforms), but for my applications I will be staying in the time-domain for the most part, and just focusing fairly narrowly on stability and oscillations in linear (or linearized) systems.
My motivation: it seems easy to calculate eigenvalues on my computer, and the Routh-Hurwitz criterion comes off as sort of anachronistic, the sort of thing that might save me a lot of time if I were doing this by hand, but not very helpful for doing analysis of small-fry systems via Matlab.
Edit: I've asked this at Math Exchange, which seems more appropriate:
https://math.stackexchange.com/questions/690634/use-of-routh-hurwitz-if-you-have-the-eigenvalues
There is an accepted answer there.
This is just legacy educational curriculum which fell way behind of the actual computational age. Routh-Hurwitz gives a very nice theoretical basis for parametrization of root positions and linked to much more abstract math.
However, for control purposes it is just a nice trick that has no practical value except maybe simple transfer functions with one or two unknown parameters. It had real value when computing the roots of the polynomials were expensive or even manual. Today, even root finding of polynomials is based on forming the companion matrix and computing the eigenvalues. In fact you can basically form a meshgrid and check the stability surface by plotting the largest real part in a few minutes.
I would like to maximize a function with one parameter.
So I run gradient descent (or, ascent actually): I start with an initial parameter and keep adding the gradient (times some learning rate factor that gets smaller and smaller), re-evaluate the gradient given the new parameter, and so on until convergence.
But there is one problem: My parameter must stay positive, so it is not supposed to become <= 0 because my function will be undefined. My gradient search will sometimes go into such regions though (when it was positive, the gradient told it to go a bit lower, and it overshoots).
And to make things worse, the gradient at such a point might be negative, driving the search toward even more negative parameter values. (The reason is that the objective function contains logs, but the gradient doesn't.)
What are some good (simple) algorithms that deal with this constrained optimization problem? I'm hoping for just a simple fix to my algorithm. Or maybe ignore the gradient and do some kind of line search for the optimal parameter?
Each time you update your parameter, check to see if it's negative, and if it is, clamp it to zero.
If clamping to zero is not acceptable, try adding a "log-barrier" (Google it). Basically, it adds a smooth "soft" wall to your objective function (and modifying your gradient) to keep it away from regions you don't want it to go to. You then repeatedly run the optimization by hardening up the wall to make it more infinitely vertical, but starting with the previously found solution. In the limit (in practice only a few iterations are needed), the problem you are solving is identical to the original problem with a hard constraint.
Without knowing more about your problem, it's hard to give specific advice. Your gradient ascent algorithm may not be particularly suitable for your function space. However, given that's what you've got, here's one tweak that would help.
You're following what you believe is an ascending gradient. But when you move forwards in the direction of the gradient, you discover you have fallen into a pit of negative value. This implies that there was a nearby local maximum, but also a very sharp negative gradient cliff. The obvious fix is to backtrack to your previous position, and take a smaller step (e.g half the size). If you still fall in, repeat with a still smaller step. This will iterate until you find the local maximum at the edge of the cliff.
The problem is, there is no guarantee that your local maximum is actually global (unless you know more about your function than you are sharing). This is the main limitation of naive gradient ascent - it always fixes on the first local maximum and converges to it. If you don't want to switch to a more robust algorithm, one simple approach that could help is to run n iterations of your code, starting each time from random positions in the function space, and keeping the best maximum you find overall. This Monte Carlo approach increases the odds that you will stumble on the global maximum, at the cost of increasing your run time by a factor n. How effective this is will depend on the 'bumpiness' of your objective function.
A simple trick to restrict a parameter to be positive is to re-parametrize the problem in terms of its logarithm (make sure to change the gradient appropriately). Of course it is possible that the optimum moves to -infty with this transformation, and the search does not converge.
At each step, constrain the parameter to be positive. This is (in short) the projected gradient method you may want to google about.
I have three suggestions, in order of how much thinking and work you want to do.
First, in gradient descent/ascent, you move each time by the gradient times some factor, which you refer to as a "learning rate factor." If, as you describe, this move causes x to become negative, there are two natural interpretations: Either the gradient was too big, or the learning rate factor was too big. Since you can't control the gradient, take the second interpretation. Check whether your move will cause x to become negative, and if so, cut the learning rate factor in half and try again.
Second, to elaborate on Aniko's answer, let x be your parameter, and f(x) be your function. Then define a new function g(x) = f(e^x), and note that although the domain of f is (0,infinity), the domain of g is (-infinity, infinity). So g cannot suffer the problems that f suffers. Use gradient descent to find the value x_0 that maximizes g. Then e^(x_0), which is positive, maximizes f. To apply gradient descent on g, you need its derivative, which is f'(e^x)*e^x, by the chain rule.
Third, it sounds like you're trying maximize just one function, not write a general maximization routine. You could consider shelving gradient descent, and tailoring the
method of optimization to the idiosyncrasies of your specific function. We would have to know a lot more about the expected behavior of f to help you do that.
Just use Brent's method for minimization. It is stable and fast and the right thing to do if you have only one parameter. It's what the R function optimize uses. The link also contains a simple C++ implementation. And yes, you can give it MIN and MAX parameter value limits.
You're getting good answers here. Reparameterizing is what I would recommend. Also, have you considered another search method, like Metropolis-Hastings? It's actually quite simple once you bull through the scary-looking math, and it gives you standard errors as well as an optimum.