How to translate computation in index notation into sequence of SIMD ops in general case? - numpy

UPD: the question in it's original form is poorly formulated because I strongly confuse terminology (SIMD vs vectorized computations) and give too broad example that does not specify exactly what is the problem; I voted to close it with "unclear what you're asking", I'll link a better-formulated question above whenever it appears
In mathematics, one would usually describe n-dimensional tensor computation using index notation, that would look something like:
A[i,j,k] = B[k,j] + C[d[k],i,B[k,j]] + d[k]*f[j] // for 0<i<N, 0<j<M, 0<k<K
but if we want to use any SIMD library to efficiently parallelize that computation (and take advantage of linear-algebraic magic), we would have to express it using primitives from BLAS, numpy, tensorflow, OpenCL, ... that is often quite tricky.
Expressions in [Einstein notation][1] like A_ijk*B_kj are generally solved via [np.einsum][2] (using tensordot, sum and transpose, I guess?). Summation and other element-wise ops are also okay, "smart" indexing is quite tricky, though (especially, if an index appears more then single time in the expression).
I wonder if there any language-agnostic libraries that take an expression in certain form (lets say, form above) and translates it into some Intermediate Representation that can be efficiently executed using existing linear-algebra libraries?
There are libraries that attempt to parallelize loop computations (user API usually looks like #pragma in C++ or #numba.jit in python), but I'm asking about slightly different thing: translate abritary expression in form above into a finite sequence of SIMD commands, like elementwise-ops, matvecs, tensordots and etc.
If there are no language-agnostic solutions yet, I am personally interested in numpy computations :)

Further questions about the code:
I see B[k,j] is used an an index and as a value. Is everything integer? If not, which parts are FP, and where does the conversion happen?
Why does i not appear in the right hand side? Is the same data repeated N times?
Oh yikes, so you have a gather operation, with indices coming from d[k] and B[k,j]. Only a few SIMD instruction sets support this (e.g. AVX2).
I mostly manually vectorize stuff in C, with Intel's x86 intrinsics, (or auto-vectorization and check the compiler's asm output to make sure it didn't suck), so IDK if there's any kind of platform-independent way to express that operation.
I wouldn't expect that many cross-platform SIMD languages would provide a gather or anything built on top of a gather. I haven't used numpy though.
I don't expect you'd find a BLAS, LAPACK, or other library function that includes a gather, unless you go looking for implementations of this exact problem.
With an efficient gather (e.g. Intel Skylake or Xeon Phi), it might vectorize ok if you use SIMD in the loop over j, so you load a whole vector at once from B[], and from f[], and use it with a vector holding d[k] broadcast to every position. You probably want to store a transposed result matrix, like A[i][k][j], so the final store doesn't have to be a scatter. You definitely need to avoid looping over k in the inner-most loop, since that makes loads from B[] non-contiguous, and you have d[k] instead of f[j] varying inside the inner loop.
I haven't done much with GPGPU, but they do SIMD differently. Instead of short vectors like CPUs use, they have effectively many scalar processors ganged together. OpenCL or CUDA or whatever other hot new GPGPU tech might handle your gathers much more efficiently.
SIMD commands, like elementwise-ops, matvecs, tensordots and etc.
When I think of "SIMD commands", I think of x86 assembly instructions (or ARM NEON, or whatever), or at least C / C++ intrinsics that compile to single instructions. :P
A matrix-vector product is not a single "instruction". If you used that terminology, every function that processes a buffer would be "a SIMD instruction".
The last part of your question seems to be asking for a programming-language independent version of numpy, for gluing together high-performance library functions. Or were you thinking that there might be something that would inter-optimize such operations, so you could write something that would compile to a vectorized loop that did stuff like use each input more than once without having to reload it in separate library calls?
IDK if there's anything like that, other than normal C compiler auto-vectorization of loops over arrays.

Related

Which operations in numpy uses SIMD?

Good afternoon!
Currently, I'm diggin out the reason why numpy is fast.
More specific, I'm wondering why np.sum() is that fast.
My one suggestion is np.sum() uses some kind of SIMD optimization, but I'm not sure whether it is.
Is there any way that I can check which numpy's method uses SIMD operations?
Thx in advance
Numpy does not currently use SIMD instructions for trivial np.sum calls yet. However, I made this PR which should be merged soon and fix this issue with integers (it will use the 256-bit AVX2 instruction set if available and the 128-bit SSE/Neon instruction set otherwise). Using SIMD instructions for np.sum with floating-point numbers is a bit harder due to the current algorithm used (pair-wise summation) and because one should care about the precision.
Is there any way that I can check which numpy's method uses SIMD operations?
Low-level profilers and hardware-counter-based tools (eg. Linux perf, Intel VTune) can do that but they are not very user-friendly (ie. you need to have some notions in assembly, know roughly how processors work and read some documentation about hardware counters). Another solution is to look the disassembled code of Numpy using tools like objdump (require a pretty good knowledge in assembly and the name of the C function called) or simply look at the Numpy C code (note compilers can autovectorize loops so this solution is not so simple).
Update: If you are using np.sum on contiguous double-precision Numpy arrays, then the benefit of using SIMD instructions is not so big. Indeed, for large contiguous double-precision arrays not fitting in the cache, a scalar implementation should be able to saturate the memory bandwidth on most PCs (but certainly not the Apple M1 nor computing servers), especially on high-frequency processors. On small arrays (eg. <4000), Numpy overheads dominate the execution time of such a function. For contiguous medium-sized arrays (eg. >10K and <1M items), using SIMD instructions should result in a significant speed up, especially for simple-precision arrays (eg. 3-4 times faster on DP and 6-8 times faster on SP on mainstream machines).

Is SSE redundant or discouraged?

Looking around here and the internet, I can find a lot of posts about modern compilers beating SSE in many real situations, and I have just encountered in some code I inherited that when I disable some SSE code written in 2006 for integer-based image processing and force the code down the standard C branch, it runs faster.
On modern processors with multiple cores and advanced pipelining, etc, does older SSE code underperform gcc -O2?
You have to be careful with microbenchmarks. It's really easy to measure something other than what you thought you were. Microbenchmarks also usually don't account for code size at all, in terms of pressure on the L1 I-cache / uop-cache and branch-predictor entries.
In most cases, microbenchmarks usually have all the branches predicted as well as they can be, while a routine that's called frequently but not in a tight loop might not do as well in practice.
There have been many additions to SSE over the years. A reasonable baseline for new code is SSSE3 (found in Intel Core2 and later, and AMD Bulldozer and later), as long as there is a scalar fallback. The addition of a fast byte-shuffle (pshufb) is a game-changer for some things. SSE4.1 adds quite a few nice things for integer code, too. If old code doesn't use it, compiler output, or new hand-written code, could do much better.
Currently we're up to AVX2, which handles two 128b lanes at once, in 256b registers. There are a few 256b shuffle instructions. AVX/AVX2 gives 3-operand (non-destructive dest, src1, src2) versions of all the previous SSE instructions, which helps improve code density even when the two-lane aspect of using 256b ops is a downside (or when targeting AVX1 without AVX2 for integer code).
In a year or two, the first AVX512 desktop hardware will probably be around. That adds a huge amount of powerful features (mask registers, and filling in more gaps in the highly non-orthogonal SSE / AVX instruction set), as well as just wider registers and execution units.
If the old SSE code only gave a marginal speedup over the scalar code back when it was written, or nobody ever benchmarked it, that might be the problem. Compiler advances may lead to the generated code for scalar C beating old SSE that takes a lot of shuffling. Sometimes the cost of shuffling data into vector registers eats up all the speedup of being fast once it's there.
Or depending on your compiler options, the compiler might even be auto-vectorizing. IIRC, gcc -O2 doesn't enable -ftree-vectorize, so you need -O3 for auto-vec.
Another thing that might hold back old SSE code is that it might assume unaligned loads/stores are slow, and used palignr or similar techniques to go between unaligned data in registers and aligned loads/stores. So old code might be tuned for an old microarch in a way that's actually slower on recent ones.
So even without using any instructions that weren't available previously, tuning for a different microarchitecture matters.
Compiler output is rarely optimal, esp. if you haven't told it about pointers not aliasing (restrict), or being aligned. But it often manages to run pretty fast. You can often improve it a bit (esp. for being more hyperthreading-friendly by having fewer uops/insns to do the same work), but you have to know the microarchitecture you're targeting. E.g. Intel Sandybridge and later can only micro-fuse memory operands with one-register addressing mode. Other links at the x86 wiki.
So to answer the title, no the SSE instruction set is in no way redundant or discouraged. Using it directly, with asm, is discouraged for casual use (use intrinsics instead). Using intrinsics is discouraged unless you can actually get a speedup over compiler output. If they're tied now, it will be easier for a future compiler to do even better with your scalar code than to do better with your vector intrinsics.
Just to add to Peter's already excellent answer, one fundamental point to consider is that the compiler does not know everything that the programmer knows about the problem domain, and there is in general no easy way for the programmer to express useful constraints and other relevant information that a truly smart compiler might be able to exploit in order to aid vectorization. This can give the programmer a huge advantage in many cases.
For example, for a simple case such as:
// add two arrays of floats
float a[N], b[N], c[N];
for (int i = 0; i < N; ++i)
a[i] = b[i] + c[i];
any decent compiler should be able to do a reasonably good job of vectorizing this with SSE/AVX/whatever, and there would be little point in implementing this with SIMD intrinsics. Apart from relatively minor concerns such as data alignment, or the likely range of values for N, the compiler-generated code should be close to optimal.
But if you have something less straightforward, e.g.
// map array of 4 bit values to 8 bit values using a LUT
const uint8_t LUT[16] = { 0, 1, 3, 7, 11, 15, 20, 27, ..., 255 };
uint8_t in[N]; // 4 bit input values
uint8_t out[N]; // 8 bit output values
for (int i = 0; i < N; ++i)
out[i] = LUT[in[i]];
you won't see any auto-vectoization from your compiler because (a) it doesn't know that you can use PSHUFB to implement a small LUT, and (b) even if it did, it has no way of knowing that the input data is constrained to a 4 bit range. So a programmer could write a simple SSE implementation which would most likely be an order of magnitude faster:
__m128i vLUT = _mm_loadu_si128((__m128i *)LUT);
for (int i = 0; i < N; i += 16)
{
__m128i va = _mm_loadu_si128((__m128i *)&b[i]);
__m128i vb = _mm_shuffle_epi8(va, vLUT);
_mm_storeu_si128((__m128i *)&a[i], vb);
}
Maybe in another 10 years compilers will be smart enough to do this kind of thing, and programming languages will have methods to express everything the programmer knows about the problem, the data, and other relevant constraints, at which point it will probably be time for people like me to consider a new career. But until then there will continue to be a large problem space where a human can still easily beat a compiler with manual SIMD optimisation.
These were two separate and strictly speaking unrelated questions:
1) Did SSE in general and SSE-tuned codebases in particular become obsolete / "discouraged" / retired?
Answer in brief: not yet and not really. High Level Reason: because there are still enough hardware around (even in HPC domain, where one could easily find Nehalem) which only have SSE* on board, but no AVX* available. If you look outside HPC, then consider for example Intel Atom CPU, which currently supports only up to SSE4.
2) Why gcc -O2 (i.e. auto-vectorized, running on SSE-only hardware) is faster than some old (presumably intrinsics) SSE implementation written 9 years ago.
Answer: it depends, but first of all things are very actively improving on Compilers side. AFAIK top 4 x86 compilers dev teams has made big to enormous investments into auto-vectorization or explicit-vectorization domains in the course of past 9 years. And the reason why they did so is also clear: SIMD "FLOPs" potential in x86 hardware has been increased (formally) "by 8 times" (i.e. 8x of SSE4 peak flops) in the course of past 9 years.
Let me ask one more question myself:
3) OK, SSE is not obsolete. But will it be obsolete in X years from now?
Answer: who knows, but at least in HPC, with wider AVX-2 and AVX-512 compatible hardware adoption, SSE intrinsics codebases are highly likely to retire soon enough, although it again depends on what you develop. Some low-level optimized HPC/HPC+Media libraries will likely keep highly tuned SSE code pathes for long time.
You might very well see modern compilers use SSE4. But even if they stick to the same ISA, they're often a lot better at scheduling. Keeping SSE units busy means careful management of data streaming.
Cores are irrelevant as each instruction stream (thread) runs on a single core.
Yes -- but mainly in the same sense that writing inline assembly is discouraged.
SSE instructions (and other vector instructions) have been around long enough that compilers now have a good understanding of how to use them to generate efficient code.
You won't do a better job than the compiler unless you have a good idea what you're doing. And even then it often won't be worth the effort spent trying to beat the compiler. And even then our efforts at optimizing for one specific CPU might not result in good code for other CPUs.

Performing matrix operations with complex numbers in C

I'm trying to perform computations involving matrix operations and complex math - sometimes together, in C. I'm very familiar with Matlab and I know these types of computations could be performed simply and efficiently. For example, two matrices of the same size, A and B, each having elements of complex values can be summed easily through the expression A+B. Are there any packages or techniques that can be recommended to employ programming these types of expressions in C or Objective C? I am aware of complex.h which allows for performing operations on complex numbers, but am unaware of how to perform operations on complex matrices, which is what I'm really after. Similarly, I'm aware of packages which allow for operations on matrices, but don't think they will be useful in working on complex matrices.
You want to use BLAS for basic linear algebra operations, like summing or multiplying two matrices, and LAPACK for more computational intensive algorithms, like factoring matrices.
BLAS routines have funny names, that look like alphabet soup. This is because of old Fortran restrictions on the length of the function name. The first letter of the name indicates the data type the BLAS routine operates on. Since your interested in complex numbers you want to look at routines beginning in c (for complex single precision) or z (for zouble complex double precision). For example the BLAS routine to multiply complex matrices A and B is CGEMM or ZGEMM (here GEMM stands for general matrix matrix multiply.)
It looks like in Objective C, BLAS is available through the Accelerate framework. The naming convention is to prepend cblas_ to the original BLAS name. For example here is the documentation for cblas_zgemm.
Normally, vendors provided optimized versions of the BLAS for their platform. These routines
can often be significantly faster than naive implementations of these matrix operations. Often
the peak floating-point performance of a machine can be achieved, or nearly achieved, with these
routines. In fact the LINPACK benchmark (LINPACK was the predecessor to LAPACK) uses these
routines to benchmark and rank supercomputers.
You are looking for BLAS or LAPACK. They are linear algebra libraries which you can download and install.

How to optimize MATLAB loops?

I have been working lately on a number of iterative algorithms in MATLAB, and been getting hit hard by MATLAB's performance (or lack thereof) when it comes to loops. I'm aware of the benefit of vectorizing code when possible, but are there any tools for optimization when you need the loop for your algorithm?
I am aware of the MEX-file option to write small subroutines in C/C++, although given my algorithms, this can be a very painful option given the data structures required. I mainly use MATLAB for the simplicity and speed of prototyping, so a syntactically complex, statically typed language is not ideal for my situation.
Are there any other suggestions? Even other languages (python?) which have relatively painless matrix tools are an option.
It was once true that vectorization would improve the speed of your MATLAB code. However, that is largely no longer true with the JIT-accelerator
This video demonstrating the MATLAB profiler might help.
PROFILER is very useful tool to find bottlenecks in Matlab code. it does not change your code of course, but helps to find which functions/lines to optimize with vectorization or mex.
http://www.mathworks.com/access/helpdesk/help/techdoc/ref/profile.html
http://www.mathworks.com/access/helpdesk/help/techdoc/matlab_env/f9-17018.html
If you have a choice, be sure to set up your loops so you scan the data column-wise which is how the data in MATLAB are arranged. In addition, be sure to preallocate any output arrays before the loop and index into them instead of growing the array inside the for-loop.
If you can cast your code so your operations are called on the whole matrix then you will see great improvement in the speed of your code. Many functions are much quicker when operating on the whole matrix rather than in an element-wise fashion with loops.
You might want to investigate MATLAB's Parallel Computing Toolbox which can make a big difference if you have the right hardware. I re-wrote about 12 lines of code and got 4 - 6 times speedup for one of our loop-intensive programs on and eight core PC.

How much speed-up from converting 3D maths to SSE or other SIMD?

I am using 3D maths in my application extensively. How much speed-up can I achieve by converting my vector/matrix library to SSE, AltiVec or a similar SIMD code?
In my experience I typically see about a 3x improvement in taking an algorithm from x87 to SSE, and a better than 5x improvement in going to VMX/Altivec (because of complicated issues having to do with pipeline depth, scheduling, etc). But I usually only do this in cases where I have hundreds or thousands of numbers to operate on, not for those where I'm doing one vector at a time ad hoc.
That's not the whole story, but it's possible to get further optimizations using SIMD, have a look at Miguel's presentation about when he implemented SIMD instructions with MONO which he held at PDC 2008,
(source: tirania.org)
Picture from Miguel's blog entry.
For some very rough numbers: I've heard some people on ompf.org claim 10x speed ups for some hand-optimized ray tracing routines. I've also had some good speed ups. I estimate I got somewhere between 2x and 6x on my routines depending on the problem, and many of these had a couple of unnecessary stores and loads. If you have a huge amount of branching in your code, forget about it, but for problems that are naturally data-parallel you can do quite well.
However, I should add that your algorithms should be designed for data-parallel execution.
This means that if you have a generic math library as you've mentioned then it should take packed vectors rather than individual vectors or you'll just be wasting your time.
E.g. Something like
namespace SIMD {
class PackedVec4d
{
__m128 x;
__m128 y;
__m128 z;
__m128 w;
//...
};
}
Most problems where performance matters can be parallelized since you'll most likely be working with a large dataset. Your problem sounds like a case of premature optimization to me.
For 3D operations beware of un-initialized data in your W component. I've seen cases where SSE ops (_mm_add_ps) would take 10x normal time because of bad data in W.
The answer highly depends on what the library is doing and how it is used.
The gains can go from a few percent points, to "several times faster", the areas most susceptible of seeing gains are those where you're not dealing with isolated vectors or values, but multiple vectors or values that have to be processed in the same way.
Another area is when you're hitting cache or memory limits, which, again, requires a lot of values/vectors being processed.
The domains where gains can be the most drastic are probably those of image and signal processing, computational simulations, as well general 3D maths operation on meshes (rather than isolated vectors).
These days all the good compilers for x86 generate SSE instructions for SP and DP float math by default. It's nearly always faster to use these instructions than the native ones, even for scalar operations, so long as you schedule them correctly. This will come as a surprise to many, who in the past found SSE to be "slow", and thought compilers could not generate fast SSE scalar instructions. But now, you have to use a switch to turn off SSE generation and use x87. Note that x87 is effectively deprecated at this point and may be removed from future processors entirely. The one down point of this is we may lose the ability to do 80bit DP float in register. But the consensus seems to be if you are depending on 80bit instead of 64bit DP floats for the precision, your should look for a more precision loss-tolerant algorithm.
Everything above came as a complete surprise to me. It's very counter intuitive. But data talks.
Most likely you will see only very small speedup, if any, and the process will be more complicated than expected. For more details see The Ubiquitous SSE vector class article by Fabian Giesen.
The Ubiquitous SSE vector class: Debunking a common myth
Not that important
First and foremost, your vector class is probably not as important for the performance of your program as you think (and if it is, it's more likely because you're doing something wrong than because the computations are inefficient). Don't get me wrong, it's probably going to be one of the most frequently used classes in your whole program, at least when doing 3D graphics. But just because vector operations will be common doesn't automatically mean that they'll dominate the execution time of your program.
Not so hot
Not easy
Not now
Not ever