I'm using scipy.optimize.lsq_linear to run some linear least squares optimizations and all is well, but a little slow. My A matrix is typically about 100 x 10,000 in size and sparse (sparsity usually ~50%). The bounds on the solution are critical. Given my tolerance lsq_linear typically solves the problems in about 10 seconds and speeding this up would be very helpful for running many optimizations.
I've read about speeding up linear algebra operations using GPU acceleration in PyTorch. It looks like PyTorch handles sparse arrays (torch calls them tensors), which is good. However, I've been digging through the PyTorch documentation, particularly the torch.optim and torch.linalg packages, and I haven't found anything that appears to be able to do a linear least squares optimization with bounds.
Is there a torch method that can do linear least squares optimization with bounds like scipy.optimize.lsq_linear?
Is there another way to speed up lsq_linear or to perform the optimization in a faster way?
For what it's worth, I think I've pushed lsq_linear pretty far. I don't think I can decrease the number of matrix elements, increase sparsity or decrease optimiation tolerances much farther without sacrificing the results.
Not easily, no.
I'd try to profile lsq_linear on your problem to see if it's pure python overhead (which can probably be trimmed some) or linear algebra. In the latter case, I'd start with vendoring the lsq_linear code and swapping relevant linear algebra routines. YMMV though.
Related
With numpy, what is the fastest way to compute one solution to an underdetermined linear system? I don't care which solution the method would return, I'd be happy with any solution.
In particular, I'm dealing with a 7x7 rank-6 matrix which describes the dynamics of a physical system. I'm noticing numpy.linalg.lstsq, numpy.linalg.qr, scipy.linalg.null_space, and scipy.linalg.lu run on the full matrix are all slower on my machine than numpy.linalg.solve run on a correctly-trimmed 6x6 full-rank matrix; solve is twice as fast as lstsq (14.8 µs vs 29.1 µs).
Is there any way to speed up the computation without some horrible C LAPACK-level hacking?
Numpy is not designed to be efficient on very small matrices. Its overheads (due to type checks, value checks, iterators, allocations, etc.) can be quite big on such matrices. In fact, dozens of microseconds is reasonable for such Numpy function call. Numba can reduce the overheads thanks to a fully compiled native code. That being said, Numba can still have a small overhead (due to the call from CPython, few type checks and allocations), but there are generally reasonable unless you work on extremely small inputs. In that case, it is better to use Numba in the caller function since the problem is actually the slow CPython interpreter. The lazy compilation of the Numba function make the first execution significantly slower. You can provide the signature to Numba to make it faster (eager compilation).
import numba as nb
#nb.njit('(float64[:,::1], float64[::1])')
def solve_nb(a, b):
return np.linalg.solve(a, b)
On my machine. It is about 16% faster on a 7x7 matrix. It requires the matrices to be contiguous (working on non-contiguous is fundamentally inefficient, especially here). If this is not fast enough, then you can call dgesv directly for double-precision matrices (or sgesv for simple-precision).
Actually, solve does use dgesv internally. lstsq appears to use a singular value decomposition (SVD). SVD are significantly slower than a QR decomposition which is generally a bit slower than a LU decomposition.
I am not an expert of the numerical/mathematical part, but AFAIK, solving this with a LU decomposition is less numerically stable than using a QR which is also less numerically stable than a SVD. Also, I think a SVD/QR method should be used instead of a simple LU decomposition for matrices that are not full-rank one.
The implementation of dgesv of the standard Netlib LAPACK uses a LU factorization followed by a call to dgetrs (see here). This later call should be fast compared to the LU factorization. The code of LAPACK implementations are generally pretty generic so they may have significant overhead on 7x7 matrices (AFAIK, the Intel implementation is one of the fastest for that).
An alternative solution is to write your own specialized LU decomposition and your own system solving using Numba or Cython. This solution is tedious, but it should be significantly faster since the compiler can unroll the loop if it know the bounds reducing the overheads. You can also perform 1 allocation instead of multiple ones.
I have a series of linearly-constrained convex quadratic optimization problems that have around 100.000 variables, 1 linear constraint and 100.000 bound constraints (the same as the number of variables - the solution has to be positive). I am planning to use gurobi in R and/or Python. I have noticed that, although for small problems the solver can find a solution quite quickly, for medium to large problems (as the one I have), it takes forever (some benchmarks are shown in here - credits to Stéphane Caron).
I know that qp methods do not scale very well, but I'd like to know if you are aware of any solver/technique/tool that solves medium to large qp problems faster.
Thanks!
Please click here to see the optimization problem
I'm currently computing several eigenvalues of different matrices and trying to find their closed-form solutions. The matrix is hermitian/self-adjoint, and tri-diagonal. Additionally, every diagonal element is positive and every off-diagonal is negative.
Due to what I suspect is an issue with trying to algebraically solve the quintic, sympy cannot solve the eigenvalues of my 14x14 matrix.
Numpy has given me great results that I'm sometimes able to use via wolfram-alpha, but other times the precision is lacking to be able to determine which of several candidates the closed form solution could take. As a result, I'm wishing to increase the precision with which numpy.linalg.eigenvaluesh outputs eigenvalues. Any help would be greatly appreciated!
Eigenvalue problems of size>=5 have no general closed form solution (for the reason you mention), and so all general eigensolvers are iterative. As a result, there are a few sources of error.
First, there are the errors with the convergence of the algorithm itself. I.e. even if all your computations were exact, you would need to run a certain number of iterations to get a certain accuracy.
Second, finite precision limits the overall accuracy.
Numerical analysts study how accurate a solution you can get for a given algorithm and precision and there are results on this.
As for your specific problem, if you are not getting enough accuracy there are a few things you can try to do.
The first, is make sure you are using the best solvers for your method. I.e. since your matrix is symmetric and tridiagonal, make sure you are using solvers for this type (as suggested by norok2).
If that still doesn't give you enough accuracy, you can try to increase the precision.
However, the main issue with doing this in numpy is that the LAPACK functions under the hood are compiled for float64.
Thus, even if the numpy function allows inputs of higher precision (float128), it will round them before calling the LAPACK functions.
It might be possible to recompile those functions for higher precision, but that may not be worth the effort for your particular problem.
(As a side note, I'm not very familiar with scipy, so it may be the case that they have eigensolvers written in python which support all different types, but you need to be careful that they are actually doing every step in the higher precision and not silently rounding to float64 somewhere.)
For your problem, I would suggest using the package mpmath, which supports arbitrary precision linear algebra.
It is a bit slower since everything is done in software, but for 14x14 matrices it should still be pretty quick.
Implementing convolution in a pixel shader is somewhat costly as to the very high number of texture fetches.
A direct way of implementing a convolution filter is to make N x N lookups per fragment using two for cycles per fragment. A simple calculation says that a 1024x1024 image blurred with a 4x4 Gaussian kernel would need 1024 x 1024 x 4 x 4 = 16M lookups.
What can one do about this?
Can one use some optimization that would need less lookups? I am not interested in kernel-specific optimizations like the ones for the Gaussian (or are they kernel specific?)
Can one at least make these lookups faster by somehow exploiting the locality of the pixels one would work with?
Thanks!
Gaussian kernels are separable, which means you can do a horizontal pass first, then a vertical pass (or the other way around). That turns O(N^2) into O(2N). That works for all separable filters, not just for blur (not all filters are separable, but many are, and some are "as good as").
Or,in the particular case of a blur filter (Gauss or not), which are all kind of "weighted sums", you can take advantage of texture interpolation, which may be faster for small kernel sizes (but definitively not for large kernel sizes).
EDIT: image for the "linear interpolation" method
EDIT (as requested by Jerry Coffin) to summarize the comments:
In the "texture filter" method, linear interpolation will produce a weighted sum of adjacent texels according to the inverse distance from the sample location to the texel center. This is done by the texturing hardware, for free. That way, 16 pixels can be summed in 4 fetches. Texture filtering can be exploited in addition to separating the kernel.
In the example image, on the top left, your sample (the circle) hits the center of a texel. What you get is the same as "nearest" filtering, you get that texel's value. On the top right, you are in the middle between two texels, what you get is the 50/50 average between them (pictured by the lighter shader of blue). On the bottom right, you sample in between 4 texels, but somewhat closer to the top left one. That gives you a weighted average of all 4, but with the weight biased towards the top left one (darkest shade of blue).
The following suggestions are courtesy of datenwolf (see below):
"Another methods I'd like suggest is operating in fourier space, where convolution turns into a simple product of fourier transformed signal and fourier transformed kernel. Although the fourier transform on the GPU itself is quite tedious to implement, at least using OpenGL shaders. But it's quite easy done in OpenCL. Actually I implement such things using OpenCL, now, a lot of image processing in my 3D engine happens in OpenCL.
OpenCL has been specifically designed for running on GPUs. A Fast Fourier Transform is actually the piece of example code on Wikipedia's OpenCL article: en.wikipedia.org/wiki/OpenCL and yes the performance gain is tremendous. A FFT executes with at most O(n log n), the reverse the same. The filter kernel fourier representation can be precomputed. The way is FFT -> multiply with kernel -> IFFT, which boils down to O(n + 2n log n) operations. Take note the the actual convolution is just O(n) there.
In the case of a separable, finite convolution like a gaussian blur the separation solution will outperform the fourier method. But in case of generalized, possible non-separable kernels the fourier methods is probably the fastest method available.
OpenCL integrates nicely with OpenGL, e.g. you can use OpenGL buffers (textures and vertex) for both input and ouput of OpenCL programs."
More than being separable, Gaussian filters are also computable in O(1) :
There are recursive computations like the Deriche one :
http://hal.inria.fr/docs/00/07/47/78/PDF/RR-1893.pdf
Rotoglup's answer to my question here my be worth reading; in particular, this blog post about Gaussian blur really helped me understand the concept of separable filters.
One more approach is approximating Gaussian curve with step-wise function: https://arxiv.org/pdf/1107.4958.pdf (I guess piece-wise linear functions can be also used of course).
I am using 3D maths in my application extensively. How much speed-up can I achieve by converting my vector/matrix library to SSE, AltiVec or a similar SIMD code?
In my experience I typically see about a 3x improvement in taking an algorithm from x87 to SSE, and a better than 5x improvement in going to VMX/Altivec (because of complicated issues having to do with pipeline depth, scheduling, etc). But I usually only do this in cases where I have hundreds or thousands of numbers to operate on, not for those where I'm doing one vector at a time ad hoc.
That's not the whole story, but it's possible to get further optimizations using SIMD, have a look at Miguel's presentation about when he implemented SIMD instructions with MONO which he held at PDC 2008,
(source: tirania.org)
Picture from Miguel's blog entry.
For some very rough numbers: I've heard some people on ompf.org claim 10x speed ups for some hand-optimized ray tracing routines. I've also had some good speed ups. I estimate I got somewhere between 2x and 6x on my routines depending on the problem, and many of these had a couple of unnecessary stores and loads. If you have a huge amount of branching in your code, forget about it, but for problems that are naturally data-parallel you can do quite well.
However, I should add that your algorithms should be designed for data-parallel execution.
This means that if you have a generic math library as you've mentioned then it should take packed vectors rather than individual vectors or you'll just be wasting your time.
E.g. Something like
namespace SIMD {
class PackedVec4d
{
__m128 x;
__m128 y;
__m128 z;
__m128 w;
//...
};
}
Most problems where performance matters can be parallelized since you'll most likely be working with a large dataset. Your problem sounds like a case of premature optimization to me.
For 3D operations beware of un-initialized data in your W component. I've seen cases where SSE ops (_mm_add_ps) would take 10x normal time because of bad data in W.
The answer highly depends on what the library is doing and how it is used.
The gains can go from a few percent points, to "several times faster", the areas most susceptible of seeing gains are those where you're not dealing with isolated vectors or values, but multiple vectors or values that have to be processed in the same way.
Another area is when you're hitting cache or memory limits, which, again, requires a lot of values/vectors being processed.
The domains where gains can be the most drastic are probably those of image and signal processing, computational simulations, as well general 3D maths operation on meshes (rather than isolated vectors).
These days all the good compilers for x86 generate SSE instructions for SP and DP float math by default. It's nearly always faster to use these instructions than the native ones, even for scalar operations, so long as you schedule them correctly. This will come as a surprise to many, who in the past found SSE to be "slow", and thought compilers could not generate fast SSE scalar instructions. But now, you have to use a switch to turn off SSE generation and use x87. Note that x87 is effectively deprecated at this point and may be removed from future processors entirely. The one down point of this is we may lose the ability to do 80bit DP float in register. But the consensus seems to be if you are depending on 80bit instead of 64bit DP floats for the precision, your should look for a more precision loss-tolerant algorithm.
Everything above came as a complete surprise to me. It's very counter intuitive. But data talks.
Most likely you will see only very small speedup, if any, and the process will be more complicated than expected. For more details see The Ubiquitous SSE vector class article by Fabian Giesen.
The Ubiquitous SSE vector class: Debunking a common myth
Not that important
First and foremost, your vector class is probably not as important for the performance of your program as you think (and if it is, it's more likely because you're doing something wrong than because the computations are inefficient). Don't get me wrong, it's probably going to be one of the most frequently used classes in your whole program, at least when doing 3D graphics. But just because vector operations will be common doesn't automatically mean that they'll dominate the execution time of your program.
Not so hot
Not easy
Not now
Not ever