With numpy, what is the fastest way to compute one solution to an underdetermined linear system? - numpy

With numpy, what is the fastest way to compute one solution to an underdetermined linear system? I don't care which solution the method would return, I'd be happy with any solution.
In particular, I'm dealing with a 7x7 rank-6 matrix which describes the dynamics of a physical system. I'm noticing numpy.linalg.lstsq, numpy.linalg.qr, scipy.linalg.null_space, and scipy.linalg.lu run on the full matrix are all slower on my machine than numpy.linalg.solve run on a correctly-trimmed 6x6 full-rank matrix; solve is twice as fast as lstsq (14.8 µs vs 29.1 µs).
Is there any way to speed up the computation without some horrible C LAPACK-level hacking?

Numpy is not designed to be efficient on very small matrices. Its overheads (due to type checks, value checks, iterators, allocations, etc.) can be quite big on such matrices. In fact, dozens of microseconds is reasonable for such Numpy function call. Numba can reduce the overheads thanks to a fully compiled native code. That being said, Numba can still have a small overhead (due to the call from CPython, few type checks and allocations), but there are generally reasonable unless you work on extremely small inputs. In that case, it is better to use Numba in the caller function since the problem is actually the slow CPython interpreter. The lazy compilation of the Numba function make the first execution significantly slower. You can provide the signature to Numba to make it faster (eager compilation).
import numba as nb
#nb.njit('(float64[:,::1], float64[::1])')
def solve_nb(a, b):
return np.linalg.solve(a, b)
On my machine. It is about 16% faster on a 7x7 matrix. It requires the matrices to be contiguous (working on non-contiguous is fundamentally inefficient, especially here). If this is not fast enough, then you can call dgesv directly for double-precision matrices (or sgesv for simple-precision).
Actually, solve does use dgesv internally. lstsq appears to use a singular value decomposition (SVD). SVD are significantly slower than a QR decomposition which is generally a bit slower than a LU decomposition.
I am not an expert of the numerical/mathematical part, but AFAIK, solving this with a LU decomposition is less numerically stable than using a QR which is also less numerically stable than a SVD. Also, I think a SVD/QR method should be used instead of a simple LU decomposition for matrices that are not full-rank one.
The implementation of dgesv of the standard Netlib LAPACK uses a LU factorization followed by a call to dgetrs (see here). This later call should be fast compared to the LU factorization. The code of LAPACK implementations are generally pretty generic so they may have significant overhead on 7x7 matrices (AFAIK, the Intel implementation is one of the fastest for that).
An alternative solution is to write your own specialized LU decomposition and your own system solving using Numba or Cython. This solution is tedious, but it should be significantly faster since the compiler can unroll the loop if it know the bounds reducing the overheads. You can also perform 1 allocation instead of multiple ones.

Related

Which operations in numpy uses SIMD?

Good afternoon!
Currently, I'm diggin out the reason why numpy is fast.
More specific, I'm wondering why np.sum() is that fast.
My one suggestion is np.sum() uses some kind of SIMD optimization, but I'm not sure whether it is.
Is there any way that I can check which numpy's method uses SIMD operations?
Thx in advance
Numpy does not currently use SIMD instructions for trivial np.sum calls yet. However, I made this PR which should be merged soon and fix this issue with integers (it will use the 256-bit AVX2 instruction set if available and the 128-bit SSE/Neon instruction set otherwise). Using SIMD instructions for np.sum with floating-point numbers is a bit harder due to the current algorithm used (pair-wise summation) and because one should care about the precision.
Is there any way that I can check which numpy's method uses SIMD operations?
Low-level profilers and hardware-counter-based tools (eg. Linux perf, Intel VTune) can do that but they are not very user-friendly (ie. you need to have some notions in assembly, know roughly how processors work and read some documentation about hardware counters). Another solution is to look the disassembled code of Numpy using tools like objdump (require a pretty good knowledge in assembly and the name of the C function called) or simply look at the Numpy C code (note compilers can autovectorize loops so this solution is not so simple).
Update: If you are using np.sum on contiguous double-precision Numpy arrays, then the benefit of using SIMD instructions is not so big. Indeed, for large contiguous double-precision arrays not fitting in the cache, a scalar implementation should be able to saturate the memory bandwidth on most PCs (but certainly not the Apple M1 nor computing servers), especially on high-frequency processors. On small arrays (eg. <4000), Numpy overheads dominate the execution time of such a function. For contiguous medium-sized arrays (eg. >10K and <1M items), using SIMD instructions should result in a significant speed up, especially for simple-precision arrays (eg. 3-4 times faster on DP and 6-8 times faster on SP on mainstream machines).

Use PyTorch to speed up linear least squares optimization with bounds?

I'm using scipy.optimize.lsq_linear to run some linear least squares optimizations and all is well, but a little slow. My A matrix is typically about 100 x 10,000 in size and sparse (sparsity usually ~50%). The bounds on the solution are critical. Given my tolerance lsq_linear typically solves the problems in about 10 seconds and speeding this up would be very helpful for running many optimizations.
I've read about speeding up linear algebra operations using GPU acceleration in PyTorch. It looks like PyTorch handles sparse arrays (torch calls them tensors), which is good. However, I've been digging through the PyTorch documentation, particularly the torch.optim and torch.linalg packages, and I haven't found anything that appears to be able to do a linear least squares optimization with bounds.
Is there a torch method that can do linear least squares optimization with bounds like scipy.optimize.lsq_linear?
Is there another way to speed up lsq_linear or to perform the optimization in a faster way?
For what it's worth, I think I've pushed lsq_linear pretty far. I don't think I can decrease the number of matrix elements, increase sparsity or decrease optimiation tolerances much farther without sacrificing the results.
Not easily, no.
I'd try to profile lsq_linear on your problem to see if it's pure python overhead (which can probably be trimmed some) or linear algebra. In the latter case, I'd start with vendoring the lsq_linear code and swapping relevant linear algebra routines. YMMV though.

Increase performance when calculating feature matrix?

Does calculate_feature_matrix use any libraries such as numba to increase performance?
I am one of the maintainers of Featuretools. calculate_feature_matrix currently only uses functions from Pandas/Numpy/Scipy to increase performance over raw Python. There are several areas where using numba or Cython may help, particularly in the PandasBackend class and in individual feature computation functions.
However, doing so requires a C-compiler or compiled C code, and so adds extra complexity to the installation. Because of this complexity it's currently not high on our priority list, but we may consider adding it in the future.
Instead, we are more focused on scalability to larger datasets, which involves parallelization rather than subroutine optimization.

How to translate computation in index notation into sequence of SIMD ops in general case?

UPD: the question in it's original form is poorly formulated because I strongly confuse terminology (SIMD vs vectorized computations) and give too broad example that does not specify exactly what is the problem; I voted to close it with "unclear what you're asking", I'll link a better-formulated question above whenever it appears
In mathematics, one would usually describe n-dimensional tensor computation using index notation, that would look something like:
A[i,j,k] = B[k,j] + C[d[k],i,B[k,j]] + d[k]*f[j] // for 0<i<N, 0<j<M, 0<k<K
but if we want to use any SIMD library to efficiently parallelize that computation (and take advantage of linear-algebraic magic), we would have to express it using primitives from BLAS, numpy, tensorflow, OpenCL, ... that is often quite tricky.
Expressions in [Einstein notation][1] like A_ijk*B_kj are generally solved via [np.einsum][2] (using tensordot, sum and transpose, I guess?). Summation and other element-wise ops are also okay, "smart" indexing is quite tricky, though (especially, if an index appears more then single time in the expression).
I wonder if there any language-agnostic libraries that take an expression in certain form (lets say, form above) and translates it into some Intermediate Representation that can be efficiently executed using existing linear-algebra libraries?
There are libraries that attempt to parallelize loop computations (user API usually looks like #pragma in C++ or #numba.jit in python), but I'm asking about slightly different thing: translate abritary expression in form above into a finite sequence of SIMD commands, like elementwise-ops, matvecs, tensordots and etc.
If there are no language-agnostic solutions yet, I am personally interested in numpy computations :)
Further questions about the code:
I see B[k,j] is used an an index and as a value. Is everything integer? If not, which parts are FP, and where does the conversion happen?
Why does i not appear in the right hand side? Is the same data repeated N times?
Oh yikes, so you have a gather operation, with indices coming from d[k] and B[k,j]. Only a few SIMD instruction sets support this (e.g. AVX2).
I mostly manually vectorize stuff in C, with Intel's x86 intrinsics, (or auto-vectorization and check the compiler's asm output to make sure it didn't suck), so IDK if there's any kind of platform-independent way to express that operation.
I wouldn't expect that many cross-platform SIMD languages would provide a gather or anything built on top of a gather. I haven't used numpy though.
I don't expect you'd find a BLAS, LAPACK, or other library function that includes a gather, unless you go looking for implementations of this exact problem.
With an efficient gather (e.g. Intel Skylake or Xeon Phi), it might vectorize ok if you use SIMD in the loop over j, so you load a whole vector at once from B[], and from f[], and use it with a vector holding d[k] broadcast to every position. You probably want to store a transposed result matrix, like A[i][k][j], so the final store doesn't have to be a scatter. You definitely need to avoid looping over k in the inner-most loop, since that makes loads from B[] non-contiguous, and you have d[k] instead of f[j] varying inside the inner loop.
I haven't done much with GPGPU, but they do SIMD differently. Instead of short vectors like CPUs use, they have effectively many scalar processors ganged together. OpenCL or CUDA or whatever other hot new GPGPU tech might handle your gathers much more efficiently.
SIMD commands, like elementwise-ops, matvecs, tensordots and etc.
When I think of "SIMD commands", I think of x86 assembly instructions (or ARM NEON, or whatever), or at least C / C++ intrinsics that compile to single instructions. :P
A matrix-vector product is not a single "instruction". If you used that terminology, every function that processes a buffer would be "a SIMD instruction".
The last part of your question seems to be asking for a programming-language independent version of numpy, for gluing together high-performance library functions. Or were you thinking that there might be something that would inter-optimize such operations, so you could write something that would compile to a vectorized loop that did stuff like use each input more than once without having to reload it in separate library calls?
IDK if there's anything like that, other than normal C compiler auto-vectorization of loops over arrays.

SHould I trust BLAS libraries unconditionally to improve performance

I am working on some project that involves computationally intensive image processing algorithms that involve a lot of steps that could be handled by BLAS libraries (mostly level 1 routines). Since my data is quite large it certainly makes sense to consider using BLAS.
I have seen examples where optimised BLAS libraries offer a tremendous increase in performance (factor 10 in speedup for matrix matrix multiplications are nothing unusual).
Should I apply the BLAS functions whenever possible and trust it blindly that it will yield a better performance or should I do a case by case analysis and only apply BLAS where it is necessary?
Blindly applying BLAS has the benefit that I save some time now since I don't have to profile my code in detail. On the other hand, carefully analysing each method might give me the best possible performance but I wonder if it is worth spending a few hours now just to gain half a second later when running the software.
A while agon, I read in a book: (1) Golden rule about optimization: don't do it (2) Golden rule about optimization (for experts only): don't do it yet. In short, I'd recommend to proceed as follows:
step 1: implement the algorithms in the simplest / most legible way
step 2: measure performances
step 3: if (and only if) performances are not satisfactory, use a profiler to detect the hot spots. They are often not where we think !!
step 4: try different alternatives for the hot spots only (measure performances for each alternative)
More speficically about your question: yes, a good implementation of BLAS can make some difference (it may use AVX instruction sets, and for matrix times matrix multiply, decompose the matrix into blocs in a way that is more cache-friendly), but again, I would not "trust unconditionally" (depends on the version of BLAS, on the data, on the target machine etc...), then measuring performances and comparing is absolutely necessary.