super-space-optimized code - optimization

There are key self-contained algorithms - particularly cryptography-related such as AES, RSA, SHA1 etc - which you can find many implementations of for free on the internet.
Some are written to be nice and portable clean C.
Some are written to be fast - often with macros, and explicit unrolling.
As far as I can tell, none are trying to be especially super-small - so I'm resigned to writing my own - explicitly AES128 decryption and SHA1 for ARM THUMB2. (I've verified by compiling all I can find for my target machine with GCC with -Os and -mthumb and such)
What patterns and tricks can I use to do so?
Are there compilers/tools that can roll-up code?

before optimizing for space (or speed): compilers are pretty clever these days, have you tried if a normal, readable implementation of aes128 gets small enough for your needs if you tell the compiler to optimize for space?
to go and write your own version of aes128 is perhaps a good educational thing but you will fight for bugs for sure and cryptography is not that kind of trivial stuff that falls out of thin air. and faulty or weak (due some bugs of your implementation) is pretty much the worse case you can have.
since you are targetting ARM and gcc is pretty common as a compiler for that platform:
-Os Optimize for size.
-Os enables all -O2 optimizations that do not typically
increase code size. It also performs further optimizations
designed to reduce code size.

It depends on what kind of space you are trying to optimise: code or data. There are essentially three variants of AES128 commonly in use, each differing in the amount of precomputed lookup table space.
The fastest version uses 4k arranged as four 32-bit x 256 entry lookup tables (commonly called T-tables). If you can afford that amount of data space then the only instructions in this version are the EORs to combine the table results, these will roll up into a very small piece of code.
The intermediate version uses a 8-bit x 256 entry lookup table to encode the SBox. The residual instructions need to implement the shift rows and mix columns steps so the code size is larger.
The smallest (data-size) version doesn't use any lookup tables at all, but needs to compute all of the individual AES-field operations including the inversion. This will use the most instructions, even if you fold both the field-multiply and inversion into subroutines.

Related

Is FFTW significantly better than GSL for real transform calculations?

I'm unable to get FFTW to link to my code in order to use its functions in my code. I have spent enough time on this that I am considering giving up on it.
I am very familiar with GSL, and have used the linear algebra libraries extensively with good results. GSL also has a set of FFT functions that seem to do the same things as FFTW. Are they just as good? Or is FFTW significantly better, and worth spending more time to try to get it to work?
(By the way, my error is that using g++ on a remote system on which I am not the admin, I am unable to compile my code to recognize references to FFTW calls. My makefile includes -L/libdirectory -lfftw3 but I still get undefined references for some (not all) fftw functions).
Here is the source:
#include "fftw3.h"
in = (fftw_complex*) fftw_malloc(sizeof(fftw_complex) * length);
Here is the relevant compile command:
g++ -std=c++0x -fPIC ... -lm ... -L/libdirectory -lfftw3
Here is the error:
/source.cc: undefined reference to 'fftw_malloc'
Note that the compiler is able to find fftw3.h. I also can declare objects such as fftw_complex and fftw_plan.
EDIT: I still can't get my Makefile to link the static library. However, I was able to recompile with shared libraries and those seem to have worked so far. I still would like to see some benchmarks newer than 11 years old, though!
You didn't mention what you would consider "significantly better" which could means a variety of things from speed, accuracy, ease of use, maintenance, licensing, etc. But I assume you are primarily interested in speed and accuracy comparisons.
For the speed aspect, the reference section of GNU GSL documentation mentions:
For large-scale FFT work we recommend the use of the dedicated FFTW library by Frigo and Johnson. The FFTW library is self-optimizing—it automatically tunes itself for each hardware platform in order to achieve maximum performance.
So according to GSL developers' own admission, FFTW is expected to outperform GSL. How much so? You can have a look at this speed performance benchmark from FFTW which suggests that GSL is about 3-4 times slower than FFTW 3. Note that this benchmark hasn't been done with g++ (and there doesn't seem to be another readily available benchmark for the gcc compilers from FFTW's site which includes GSL), and quite likely on a machine with different than yours, so your own results may vary. On the accuracy front this accuracy benchmark from FFTW suggests that they have similar accuracy for most case (with FFTW being slightly more accurate) but that GSL tend to exhibit accuracy degradation for real data and larger transform sizes.
For sake of completeness I'll briefly mention that as far as licensing goes, they both offer GNU GPL license, but FFTW also offers a non-free license, which could be considered better by someone for which the GNU GPL license is problematic. Otherwise for ease-of-use and maintenance, they are both actively developed and offer different but similarly complex APIs. So for those aspects, preference of one library over the other may be based on factors other that the FFT's implementation merits.

Is SSE redundant or discouraged?

Looking around here and the internet, I can find a lot of posts about modern compilers beating SSE in many real situations, and I have just encountered in some code I inherited that when I disable some SSE code written in 2006 for integer-based image processing and force the code down the standard C branch, it runs faster.
On modern processors with multiple cores and advanced pipelining, etc, does older SSE code underperform gcc -O2?
You have to be careful with microbenchmarks. It's really easy to measure something other than what you thought you were. Microbenchmarks also usually don't account for code size at all, in terms of pressure on the L1 I-cache / uop-cache and branch-predictor entries.
In most cases, microbenchmarks usually have all the branches predicted as well as they can be, while a routine that's called frequently but not in a tight loop might not do as well in practice.
There have been many additions to SSE over the years. A reasonable baseline for new code is SSSE3 (found in Intel Core2 and later, and AMD Bulldozer and later), as long as there is a scalar fallback. The addition of a fast byte-shuffle (pshufb) is a game-changer for some things. SSE4.1 adds quite a few nice things for integer code, too. If old code doesn't use it, compiler output, or new hand-written code, could do much better.
Currently we're up to AVX2, which handles two 128b lanes at once, in 256b registers. There are a few 256b shuffle instructions. AVX/AVX2 gives 3-operand (non-destructive dest, src1, src2) versions of all the previous SSE instructions, which helps improve code density even when the two-lane aspect of using 256b ops is a downside (or when targeting AVX1 without AVX2 for integer code).
In a year or two, the first AVX512 desktop hardware will probably be around. That adds a huge amount of powerful features (mask registers, and filling in more gaps in the highly non-orthogonal SSE / AVX instruction set), as well as just wider registers and execution units.
If the old SSE code only gave a marginal speedup over the scalar code back when it was written, or nobody ever benchmarked it, that might be the problem. Compiler advances may lead to the generated code for scalar C beating old SSE that takes a lot of shuffling. Sometimes the cost of shuffling data into vector registers eats up all the speedup of being fast once it's there.
Or depending on your compiler options, the compiler might even be auto-vectorizing. IIRC, gcc -O2 doesn't enable -ftree-vectorize, so you need -O3 for auto-vec.
Another thing that might hold back old SSE code is that it might assume unaligned loads/stores are slow, and used palignr or similar techniques to go between unaligned data in registers and aligned loads/stores. So old code might be tuned for an old microarch in a way that's actually slower on recent ones.
So even without using any instructions that weren't available previously, tuning for a different microarchitecture matters.
Compiler output is rarely optimal, esp. if you haven't told it about pointers not aliasing (restrict), or being aligned. But it often manages to run pretty fast. You can often improve it a bit (esp. for being more hyperthreading-friendly by having fewer uops/insns to do the same work), but you have to know the microarchitecture you're targeting. E.g. Intel Sandybridge and later can only micro-fuse memory operands with one-register addressing mode. Other links at the x86 wiki.
So to answer the title, no the SSE instruction set is in no way redundant or discouraged. Using it directly, with asm, is discouraged for casual use (use intrinsics instead). Using intrinsics is discouraged unless you can actually get a speedup over compiler output. If they're tied now, it will be easier for a future compiler to do even better with your scalar code than to do better with your vector intrinsics.
Just to add to Peter's already excellent answer, one fundamental point to consider is that the compiler does not know everything that the programmer knows about the problem domain, and there is in general no easy way for the programmer to express useful constraints and other relevant information that a truly smart compiler might be able to exploit in order to aid vectorization. This can give the programmer a huge advantage in many cases.
For example, for a simple case such as:
// add two arrays of floats
float a[N], b[N], c[N];
for (int i = 0; i < N; ++i)
a[i] = b[i] + c[i];
any decent compiler should be able to do a reasonably good job of vectorizing this with SSE/AVX/whatever, and there would be little point in implementing this with SIMD intrinsics. Apart from relatively minor concerns such as data alignment, or the likely range of values for N, the compiler-generated code should be close to optimal.
But if you have something less straightforward, e.g.
// map array of 4 bit values to 8 bit values using a LUT
const uint8_t LUT[16] = { 0, 1, 3, 7, 11, 15, 20, 27, ..., 255 };
uint8_t in[N]; // 4 bit input values
uint8_t out[N]; // 8 bit output values
for (int i = 0; i < N; ++i)
out[i] = LUT[in[i]];
you won't see any auto-vectoization from your compiler because (a) it doesn't know that you can use PSHUFB to implement a small LUT, and (b) even if it did, it has no way of knowing that the input data is constrained to a 4 bit range. So a programmer could write a simple SSE implementation which would most likely be an order of magnitude faster:
__m128i vLUT = _mm_loadu_si128((__m128i *)LUT);
for (int i = 0; i < N; i += 16)
{
__m128i va = _mm_loadu_si128((__m128i *)&b[i]);
__m128i vb = _mm_shuffle_epi8(va, vLUT);
_mm_storeu_si128((__m128i *)&a[i], vb);
}
Maybe in another 10 years compilers will be smart enough to do this kind of thing, and programming languages will have methods to express everything the programmer knows about the problem, the data, and other relevant constraints, at which point it will probably be time for people like me to consider a new career. But until then there will continue to be a large problem space where a human can still easily beat a compiler with manual SIMD optimisation.
These were two separate and strictly speaking unrelated questions:
1) Did SSE in general and SSE-tuned codebases in particular become obsolete / "discouraged" / retired?
Answer in brief: not yet and not really. High Level Reason: because there are still enough hardware around (even in HPC domain, where one could easily find Nehalem) which only have SSE* on board, but no AVX* available. If you look outside HPC, then consider for example Intel Atom CPU, which currently supports only up to SSE4.
2) Why gcc -O2 (i.e. auto-vectorized, running on SSE-only hardware) is faster than some old (presumably intrinsics) SSE implementation written 9 years ago.
Answer: it depends, but first of all things are very actively improving on Compilers side. AFAIK top 4 x86 compilers dev teams has made big to enormous investments into auto-vectorization or explicit-vectorization domains in the course of past 9 years. And the reason why they did so is also clear: SIMD "FLOPs" potential in x86 hardware has been increased (formally) "by 8 times" (i.e. 8x of SSE4 peak flops) in the course of past 9 years.
Let me ask one more question myself:
3) OK, SSE is not obsolete. But will it be obsolete in X years from now?
Answer: who knows, but at least in HPC, with wider AVX-2 and AVX-512 compatible hardware adoption, SSE intrinsics codebases are highly likely to retire soon enough, although it again depends on what you develop. Some low-level optimized HPC/HPC+Media libraries will likely keep highly tuned SSE code pathes for long time.
You might very well see modern compilers use SSE4. But even if they stick to the same ISA, they're often a lot better at scheduling. Keeping SSE units busy means careful management of data streaming.
Cores are irrelevant as each instruction stream (thread) runs on a single core.
Yes -- but mainly in the same sense that writing inline assembly is discouraged.
SSE instructions (and other vector instructions) have been around long enough that compilers now have a good understanding of how to use them to generate efficient code.
You won't do a better job than the compiler unless you have a good idea what you're doing. And even then it often won't be worth the effort spent trying to beat the compiler. And even then our efforts at optimizing for one specific CPU might not result in good code for other CPUs.

Is it possible to optimize a compiled binary?

This is more of a curiosity I suppose, but I was wondering whether it is possible to apply compiler optimizations post-compilation. Are most optimization techniques highly-dependent on the IR, or can assembly be translated back and forth fairly easily?
This has been done, though I don't know of many standard tools that do it.
This paper describes an optimizer for Compaq Alpha processors that works after linking has already been done and some of the challenges they faced in writing it.
If you strain the definition a bit, you can use profile-guided optimization to instrument a binary and then rewrite it based on its observable behaviors with regards to cache misses, page faults, etc.
There's also been some work in dynamic translation, in which you run an existing binary in an interpreter and use standard dynamic compilation techniques to try to speed this up. Here's one paper that details this.
Hope this helps!
There's been some recent research interest in this space. Alex Aiken's STOKE project is doing exactly this with some pretty impressive results. In one example, their optimizer found a function that is twice as fast as gcc -O3 for the Montgomery Multiplication step in OpenSSL's RSA library. It applies these optimizations to already-compiled ELF binaries.
Here is a link to the paper.
Some compiler backends have a peephole optimizer which basically does just that, before it commits to the assembly that represents the IR, it has a little opportunity to optimize.
Basically you would want to do the same thing, from the binary, machine code to machine code. Not the same tool but the same kind of process, examine some size block of code and optimize.
Now the problem you will end up with though is for example you may have had some variables that were marked volatile in C so they are being very inefficiently used in the binary, the optimizer wont know the programmers desire there and could end up optimizing that out.
You could certainly take this back to IR and forward again, nothing to stop you from that.

GCC optimization flags for matrix/vector operations

I am performing matrix operations using C. I would like to know what are the various compiler optimization flags to improve speed of execution of these matrix operations for double and int64 data - like Multiplication, Inverse, etc. I am not looking for hand optimized code, I just want to make the native code more faster using compiler flags and learn more about these flags.
The flags that I have found so far which improve matrix code.
-O3/O4
-funroll-loops
-ffast-math
First of all, I don't recommend using -ffast-math for the following reasons:
It has been proved that the performance actually degrades when
using this option in most (if not all) cases. So "fast math" is
not actually that fast.
This option breaks strict IEEE compliance on floating-point
operations which ultimately results in accumulation of computational
errors of unpredictable nature.
You may well get different results in different environments and the difference may be
substantial. The term environment (in this case) implies the combination of: hardware,
OS, compiler. Which means that the diversity of situations when you can get unexpected
results has exponential growth.
Another sad consequence is that programs which link against the
library built with this option might
expect correct (IEEE compliant) floating-point math, and this is
where their expectations break, but it will be very tough to figure
out why.
Finally, have a look at this article.
For the same reasons you should avoid -Ofast (as it includes the evil -ffast-math). Extract:
-Ofast
Disregard strict standards compliance. -Ofast enables all -O3 optimizations. It also enables optimizations that are not valid for all standard-compliant programs. It turns on -ffast-math and the Fortran-specific -fno-protect-parens and -fstack-arrays.
There is no such flag as -O4. At least I'm not aware of that one, and there is no trace of it in the official GCC documentation. So the maximum in this regard is -O3 and you should be definitely using it, not only to optimize math, but in release builds in general.
-funroll-loops is a very good choice for math routines, especially involving vector/matrix operations where the size of the loop can be deduced at compile-time (and as a result unrolled by the compiler).
I can recommend 2 more flags: -march=native and -mfpmath=sse. Similarly to -O3, -march=native is good in general for release builds of any software and not only math intensive. -mfpmath=sse enables use of XMM registers in floating point instructions (instead of stack in x87 mode).
Furthermore, I'd like to say that it's a pity that you don't want to modify your code to get better performance as this is the main source of speedup for vector/matrix routines. Thanks to SIMD, SSE Intrinsics, and Vectorization, the heavy-linear-algebra code can be orders of magnitude faster than without them. However, proper application of these techniques requires in-depth knowledge of their internals and quite some time/effort to modify (actually rewrite) the code.
Nevertheless, there is one option that could be suitable in your case. GCC offers auto-vectorization which can be enabled by -ftree-vectorize, but it is unnecessary since you are using -O3 (because it includes -ftree-vectorize already). The point is that you should still help GCC a little bit to understand which code can be auto-vectorized. The modifications are usually minor (if needed at all), but you have to make yourself familiar with them. So see the Vectorizable Loops section in the link above.
Finally, I recommend you to look into Eigen, the C++ template-based library which has highly efficient implementation of most common linear algebra routines. It utilizes all the techniques mentioned here so far in a very clever way. The interface is purely object-oriented, neat, and pleasing to use. The object-oriented approach looks very relevant to linear algebra as it usually manipulates the pure objects such as matrices, vectors, quaternions, rotations, filters, and so on. As a result, when programming with Eigen, you never have to deal with such low level concepts (as SSE, Vectorization, etc.) yourself, but just enjoy solving your specific problem.

Consistant behaviour of float code with GCC

I do some numerical computing, and I have often had problems with floating points computations when using GCC. For my current purpose, I don't care too much about the real precision of the results, but I want this firm property:
no matter WHERE the SAME code is in my program, when it is run on the SAME inputs, I want it to give the SAME outputs.
How can I force GCC to do this? And specifically, what is the behavior of --fast-math, and the different -O optimizations?
I've heard that GCC might try to be clever, and sometimes load floats in registers, and sometime read them directly from memory, and that this might change the precision of the floats, resulting in a different output. How can I avoid this?
Again, I want :
my computations to be fast
my computations to be reliable (ie. same input -> same result)
I don't care that much about the precision for this particular code, so I can be fine with reduced precision if this brings reliability
could anyone tell me what is the way to go for this problem?
If your targets include x86 processors, using the switch that makes gcc use SSE2 instructions (instead of the historical stack-based ones) will make these run more like the others.
If your targets include PowerPC processors, using the switch that makes gcc not use the fmadd instruction (to replace a multiplication followed by an addition in the source code) will make these run more like the others.
Do not use --fast-math: this allows the compiler to take some shortcuts, and this will cause differences between architectures. Gcc is more standard-compliant, and therefore predictable, without this option.
Including your own math functions (exp, sin, ...) in your application instead of relying on those from the system's library can only help with predictability.
And lastly, even when the compiler does rigorously respect the standard (I mean C99 here), there may be some differences, because C99 allows intermediate results to be computed with a higher precision than required by the type of the expression. If you really want the program always to give the same results, write three-address code. Or, use only the maximum precision available for all computations, which would be double if you can avoid the historical x86 instructions. In any case do not use lower-precision floats in an attempt to improve predictability: the effect would be the opposite, as per the above clause in the standard.
I think that GCC is pretty well documented so I'm not going to reveal my own ignorance by trying to answer the parts of your question about its options and their effects. I would, though, make the general statement that when numeric precision and performance are concerned, it pays big dividends to read the manual. The clever people who work on GCC put a lot of effort into their documentation, reading it is rewarding (OK, it can be a trifle dull, but heck, it's a compiler manual not a bodice-ripper).
If it is important to you that you get identical-to-the-last-bit numeric results you'll have to concern yourself with more than just GCC and how you can control its behaviour. You'll need to lock down the libraries it calls, the hardware it runs on and probably a number of other factors I haven't thought of yet. In the worst (?) case you may even want to, and I've seen this done, write your own implementations of f-p maths to guarantee bit-identity across platforms. This is difficult, and therefore expensive, and leaves you possibly less certain of the correctness of your own code than of the code usd by GCC.
However, you write
I don't care that much about the precision for this particular code, so I can be fine with reduced precision if this brings reliability
which prompts the question to you -- why don't you simply use 5-decimal-digit precision as your standard of (reduced) precision ? It's what an awful lot of us in numerical computing do all the time; we ignore the finer aspects of numerical analysis since they are difficult, and costly in computation time, to circumvent. I'm thinking of things like interval arithmetic and high-precision maths. (OF course, if 5 is not right for you, choose another single-digit number.)
But the good news is that this is entirely justifiable: we're dealing with scientific data which, by its nature, comes with errors attached (of course we generally don't know what the errors are but that's another matter) so it's OK to disregard the last few digits in the decimal representation of, say, a 64-bit f-p number. Go right ahead and ignore a few more of them. Even better, it doesn't matter how many bits your f-p numbers have, you will always lose some precision doing numerical calculations on computers; adding more bits just pushes the errors back, both towards the least-significant-bits and towards the end of long-running computations.
The case you have to watch out for is where you have such a poor algorithm, or a poor implementation of an algorithm, that it loses lots of precision quickly. This usually shows up with any reasonable size of f-p number. Your test suite should have exposed this if it is a real problem for you.
To conclude: you have to deal with loss of precision in some way and it's not necessarily wrong to brush the finer details under the carpet.