Which operations in numpy uses SIMD? - numpy

Good afternoon!
Currently, I'm diggin out the reason why numpy is fast.
More specific, I'm wondering why np.sum() is that fast.
My one suggestion is np.sum() uses some kind of SIMD optimization, but I'm not sure whether it is.
Is there any way that I can check which numpy's method uses SIMD operations?
Thx in advance

Numpy does not currently use SIMD instructions for trivial np.sum calls yet. However, I made this PR which should be merged soon and fix this issue with integers (it will use the 256-bit AVX2 instruction set if available and the 128-bit SSE/Neon instruction set otherwise). Using SIMD instructions for np.sum with floating-point numbers is a bit harder due to the current algorithm used (pair-wise summation) and because one should care about the precision.
Is there any way that I can check which numpy's method uses SIMD operations?
Low-level profilers and hardware-counter-based tools (eg. Linux perf, Intel VTune) can do that but they are not very user-friendly (ie. you need to have some notions in assembly, know roughly how processors work and read some documentation about hardware counters). Another solution is to look the disassembled code of Numpy using tools like objdump (require a pretty good knowledge in assembly and the name of the C function called) or simply look at the Numpy C code (note compilers can autovectorize loops so this solution is not so simple).
Update: If you are using np.sum on contiguous double-precision Numpy arrays, then the benefit of using SIMD instructions is not so big. Indeed, for large contiguous double-precision arrays not fitting in the cache, a scalar implementation should be able to saturate the memory bandwidth on most PCs (but certainly not the Apple M1 nor computing servers), especially on high-frequency processors. On small arrays (eg. <4000), Numpy overheads dominate the execution time of such a function. For contiguous medium-sized arrays (eg. >10K and <1M items), using SIMD instructions should result in a significant speed up, especially for simple-precision arrays (eg. 3-4 times faster on DP and 6-8 times faster on SP on mainstream machines).

Related

With numpy, what is the fastest way to compute one solution to an underdetermined linear system?

With numpy, what is the fastest way to compute one solution to an underdetermined linear system? I don't care which solution the method would return, I'd be happy with any solution.
In particular, I'm dealing with a 7x7 rank-6 matrix which describes the dynamics of a physical system. I'm noticing numpy.linalg.lstsq, numpy.linalg.qr, scipy.linalg.null_space, and scipy.linalg.lu run on the full matrix are all slower on my machine than numpy.linalg.solve run on a correctly-trimmed 6x6 full-rank matrix; solve is twice as fast as lstsq (14.8 µs vs 29.1 µs).
Is there any way to speed up the computation without some horrible C LAPACK-level hacking?
Numpy is not designed to be efficient on very small matrices. Its overheads (due to type checks, value checks, iterators, allocations, etc.) can be quite big on such matrices. In fact, dozens of microseconds is reasonable for such Numpy function call. Numba can reduce the overheads thanks to a fully compiled native code. That being said, Numba can still have a small overhead (due to the call from CPython, few type checks and allocations), but there are generally reasonable unless you work on extremely small inputs. In that case, it is better to use Numba in the caller function since the problem is actually the slow CPython interpreter. The lazy compilation of the Numba function make the first execution significantly slower. You can provide the signature to Numba to make it faster (eager compilation).
import numba as nb
#nb.njit('(float64[:,::1], float64[::1])')
def solve_nb(a, b):
return np.linalg.solve(a, b)
On my machine. It is about 16% faster on a 7x7 matrix. It requires the matrices to be contiguous (working on non-contiguous is fundamentally inefficient, especially here). If this is not fast enough, then you can call dgesv directly for double-precision matrices (or sgesv for simple-precision).
Actually, solve does use dgesv internally. lstsq appears to use a singular value decomposition (SVD). SVD are significantly slower than a QR decomposition which is generally a bit slower than a LU decomposition.
I am not an expert of the numerical/mathematical part, but AFAIK, solving this with a LU decomposition is less numerically stable than using a QR which is also less numerically stable than a SVD. Also, I think a SVD/QR method should be used instead of a simple LU decomposition for matrices that are not full-rank one.
The implementation of dgesv of the standard Netlib LAPACK uses a LU factorization followed by a call to dgetrs (see here). This later call should be fast compared to the LU factorization. The code of LAPACK implementations are generally pretty generic so they may have significant overhead on 7x7 matrices (AFAIK, the Intel implementation is one of the fastest for that).
An alternative solution is to write your own specialized LU decomposition and your own system solving using Numba or Cython. This solution is tedious, but it should be significantly faster since the compiler can unroll the loop if it know the bounds reducing the overheads. You can also perform 1 allocation instead of multiple ones.

NumPy and decimal128

Say I have a memory buffer with a vector of type std::decimal::decimal128 (IEEE754R) elements, can I wrap and expose that as a NumPy array, and do fast operations on those decimal vectors, like for example compute variance or auto-correlation over the vector? How would I do that best?
Numpy does not support such a data type yet (at least on mainstream architectures). Only float16, float32, float64 and the non standard native extended double (generally with 80 bits) are supported. Put it shortly, only floating-point types natively supported by the target architecture. If the target machine support 128 bit double-precision numbers, then you could try the numpy.longdouble type but I do not expect this to be the case. In practice, x86 processors does not support that yet as well as ARM. IBM processors like POWER9 supports that natively but I am not sure they (fully) support the IEEE-754R standard. For more information please read this. Note that you could theoretically wrap binary data in Numpy types but you will not be able to do anything (really) useful with it. The Numpy code can theoretically be extended with new types but please note that Numpy is written in C and not C++ so adding the std::decimal::decimal128 in the source code will not be easy.
Note that if you really want to wrap such a type in Numpy array without having to change/rebuild the Numpy code, could wrap your type in a pure-Python class. However, be aware that the performance will be very bad since using pure-Python object prevent all the optimization done in Numpy (eg. SIMD vectorization, use of fast native code, specific algorithm optimized for a given type, etc.).

Need help understanding Kernel Transport speed on GPU (numba, cupy, cuda)

While GPUs speed math calculations there is a fixed overhead for moving a kernel out to the GPU for execution that is high.
I'm using cupy and numba. THe first time I execute a function call that is using cupy's GPU version of numpy it is quite slow. But the second time it is fast.
I've realized I don't understand how the kernel, or GPU code, gets out to the GPU to run. Operationally I want to understand this better so that I can know when the things I do will accidentally create a slow step due to some kernel transfer. So I need some sorts of rules or rules of thumb understand the concept.
For example, if I multiply two cupy arrays that are stashed on the GPU already I might write C= A*B
At some point the cupy overload on * multiplication has to be coded out on the GPU, and it automagically needs will also get wrapped by loops that break it down into blocks and threads. So presumably this code is some kernel that gets transported out to the GPU. I'm guessing that the next time I call C*D that the GPU no longer needs to be taught what * means and so it will be fast.
But at some point I would imagine the GPU needs to clear out old code so * or other operations not being used at that moment might get flushed from memory, and so later on when the call for A*B happens again there's going to be a penalty in time to recompile it out on the GPU.
Or so I imagine. If I'm right how do I know when these kernels are going to stick around or disappear?
If I'm wrong and this isn't how it works or there is some other slow step (I'm assuming the data is already transported to arrays on the GPU) then what is this slow step and how does organize things so one pay it as little as possible?
I'm trying to avoid writing explicit numba thread managment kernels as one does in cuda++ but just use the standard numba #njit, #vectorize, #stencil decorators. Likewise in Cupy I want to just work at the level of the numpy syntax not dive into thread management.
I've read a lot of docs on this but they just refer to overheads of kernels, not when these get paid and how one controls that so I'm confused.
I don't have a full answer to this yet. But so far the biggest clue I've gotten has come from reading up on the currently undocumented function #cupy.fuse() which makes it more clear than the #numba.jit documents where the kernel launch costs are paid. I have not found the connection to Contexts yet as recommended by #talonmies.
see https://gist.github.com/unnonouno/877f314870d1e3a2f3f45d84de78d56c
The key example is this
c = cupy.arange(4)
##cupy.fuse()
def foo(x):
return x+x+x+x+x
foo(.) will be three times slower with #cupy.fuse() commented out because each "+" involves a kernel load and a kernel free. Fusion merges all the adds into a single kernel so those the launch and free are paid onces. FOr matricies less than 1 million in size on a typical 2018 GPU, the add() is so fast that the launch and free are the dominate times.
I wish I could find some documentation on #fuse. FOr example, does it unroll internal functions the way #jit does. Could I achieve that by stacking #jit and #fuse?
I'm still however largely in the dark about when the costs are getting paid in numba.

How to translate computation in index notation into sequence of SIMD ops in general case?

UPD: the question in it's original form is poorly formulated because I strongly confuse terminology (SIMD vs vectorized computations) and give too broad example that does not specify exactly what is the problem; I voted to close it with "unclear what you're asking", I'll link a better-formulated question above whenever it appears
In mathematics, one would usually describe n-dimensional tensor computation using index notation, that would look something like:
A[i,j,k] = B[k,j] + C[d[k],i,B[k,j]] + d[k]*f[j] // for 0<i<N, 0<j<M, 0<k<K
but if we want to use any SIMD library to efficiently parallelize that computation (and take advantage of linear-algebraic magic), we would have to express it using primitives from BLAS, numpy, tensorflow, OpenCL, ... that is often quite tricky.
Expressions in [Einstein notation][1] like A_ijk*B_kj are generally solved via [np.einsum][2] (using tensordot, sum and transpose, I guess?). Summation and other element-wise ops are also okay, "smart" indexing is quite tricky, though (especially, if an index appears more then single time in the expression).
I wonder if there any language-agnostic libraries that take an expression in certain form (lets say, form above) and translates it into some Intermediate Representation that can be efficiently executed using existing linear-algebra libraries?
There are libraries that attempt to parallelize loop computations (user API usually looks like #pragma in C++ or #numba.jit in python), but I'm asking about slightly different thing: translate abritary expression in form above into a finite sequence of SIMD commands, like elementwise-ops, matvecs, tensordots and etc.
If there are no language-agnostic solutions yet, I am personally interested in numpy computations :)
Further questions about the code:
I see B[k,j] is used an an index and as a value. Is everything integer? If not, which parts are FP, and where does the conversion happen?
Why does i not appear in the right hand side? Is the same data repeated N times?
Oh yikes, so you have a gather operation, with indices coming from d[k] and B[k,j]. Only a few SIMD instruction sets support this (e.g. AVX2).
I mostly manually vectorize stuff in C, with Intel's x86 intrinsics, (or auto-vectorization and check the compiler's asm output to make sure it didn't suck), so IDK if there's any kind of platform-independent way to express that operation.
I wouldn't expect that many cross-platform SIMD languages would provide a gather or anything built on top of a gather. I haven't used numpy though.
I don't expect you'd find a BLAS, LAPACK, or other library function that includes a gather, unless you go looking for implementations of this exact problem.
With an efficient gather (e.g. Intel Skylake or Xeon Phi), it might vectorize ok if you use SIMD in the loop over j, so you load a whole vector at once from B[], and from f[], and use it with a vector holding d[k] broadcast to every position. You probably want to store a transposed result matrix, like A[i][k][j], so the final store doesn't have to be a scatter. You definitely need to avoid looping over k in the inner-most loop, since that makes loads from B[] non-contiguous, and you have d[k] instead of f[j] varying inside the inner loop.
I haven't done much with GPGPU, but they do SIMD differently. Instead of short vectors like CPUs use, they have effectively many scalar processors ganged together. OpenCL or CUDA or whatever other hot new GPGPU tech might handle your gathers much more efficiently.
SIMD commands, like elementwise-ops, matvecs, tensordots and etc.
When I think of "SIMD commands", I think of x86 assembly instructions (or ARM NEON, or whatever), or at least C / C++ intrinsics that compile to single instructions. :P
A matrix-vector product is not a single "instruction". If you used that terminology, every function that processes a buffer would be "a SIMD instruction".
The last part of your question seems to be asking for a programming-language independent version of numpy, for gluing together high-performance library functions. Or were you thinking that there might be something that would inter-optimize such operations, so you could write something that would compile to a vectorized loop that did stuff like use each input more than once without having to reload it in separate library calls?
IDK if there's anything like that, other than normal C compiler auto-vectorization of loops over arrays.

How much speed-up from converting 3D maths to SSE or other SIMD?

I am using 3D maths in my application extensively. How much speed-up can I achieve by converting my vector/matrix library to SSE, AltiVec or a similar SIMD code?
In my experience I typically see about a 3x improvement in taking an algorithm from x87 to SSE, and a better than 5x improvement in going to VMX/Altivec (because of complicated issues having to do with pipeline depth, scheduling, etc). But I usually only do this in cases where I have hundreds or thousands of numbers to operate on, not for those where I'm doing one vector at a time ad hoc.
That's not the whole story, but it's possible to get further optimizations using SIMD, have a look at Miguel's presentation about when he implemented SIMD instructions with MONO which he held at PDC 2008,
(source: tirania.org)
Picture from Miguel's blog entry.
For some very rough numbers: I've heard some people on ompf.org claim 10x speed ups for some hand-optimized ray tracing routines. I've also had some good speed ups. I estimate I got somewhere between 2x and 6x on my routines depending on the problem, and many of these had a couple of unnecessary stores and loads. If you have a huge amount of branching in your code, forget about it, but for problems that are naturally data-parallel you can do quite well.
However, I should add that your algorithms should be designed for data-parallel execution.
This means that if you have a generic math library as you've mentioned then it should take packed vectors rather than individual vectors or you'll just be wasting your time.
E.g. Something like
namespace SIMD {
class PackedVec4d
{
__m128 x;
__m128 y;
__m128 z;
__m128 w;
//...
};
}
Most problems where performance matters can be parallelized since you'll most likely be working with a large dataset. Your problem sounds like a case of premature optimization to me.
For 3D operations beware of un-initialized data in your W component. I've seen cases where SSE ops (_mm_add_ps) would take 10x normal time because of bad data in W.
The answer highly depends on what the library is doing and how it is used.
The gains can go from a few percent points, to "several times faster", the areas most susceptible of seeing gains are those where you're not dealing with isolated vectors or values, but multiple vectors or values that have to be processed in the same way.
Another area is when you're hitting cache or memory limits, which, again, requires a lot of values/vectors being processed.
The domains where gains can be the most drastic are probably those of image and signal processing, computational simulations, as well general 3D maths operation on meshes (rather than isolated vectors).
These days all the good compilers for x86 generate SSE instructions for SP and DP float math by default. It's nearly always faster to use these instructions than the native ones, even for scalar operations, so long as you schedule them correctly. This will come as a surprise to many, who in the past found SSE to be "slow", and thought compilers could not generate fast SSE scalar instructions. But now, you have to use a switch to turn off SSE generation and use x87. Note that x87 is effectively deprecated at this point and may be removed from future processors entirely. The one down point of this is we may lose the ability to do 80bit DP float in register. But the consensus seems to be if you are depending on 80bit instead of 64bit DP floats for the precision, your should look for a more precision loss-tolerant algorithm.
Everything above came as a complete surprise to me. It's very counter intuitive. But data talks.
Most likely you will see only very small speedup, if any, and the process will be more complicated than expected. For more details see The Ubiquitous SSE vector class article by Fabian Giesen.
The Ubiquitous SSE vector class: Debunking a common myth
Not that important
First and foremost, your vector class is probably not as important for the performance of your program as you think (and if it is, it's more likely because you're doing something wrong than because the computations are inefficient). Don't get me wrong, it's probably going to be one of the most frequently used classes in your whole program, at least when doing 3D graphics. But just because vector operations will be common doesn't automatically mean that they'll dominate the execution time of your program.
Not so hot
Not easy
Not now
Not ever