I'm unable to get FFTW to link to my code in order to use its functions in my code. I have spent enough time on this that I am considering giving up on it.
I am very familiar with GSL, and have used the linear algebra libraries extensively with good results. GSL also has a set of FFT functions that seem to do the same things as FFTW. Are they just as good? Or is FFTW significantly better, and worth spending more time to try to get it to work?
(By the way, my error is that using g++ on a remote system on which I am not the admin, I am unable to compile my code to recognize references to FFTW calls. My makefile includes -L/libdirectory -lfftw3 but I still get undefined references for some (not all) fftw functions).
Here is the source:
#include "fftw3.h"
in = (fftw_complex*) fftw_malloc(sizeof(fftw_complex) * length);
Here is the relevant compile command:
g++ -std=c++0x -fPIC ... -lm ... -L/libdirectory -lfftw3
Here is the error:
/source.cc: undefined reference to 'fftw_malloc'
Note that the compiler is able to find fftw3.h. I also can declare objects such as fftw_complex and fftw_plan.
EDIT: I still can't get my Makefile to link the static library. However, I was able to recompile with shared libraries and those seem to have worked so far. I still would like to see some benchmarks newer than 11 years old, though!
You didn't mention what you would consider "significantly better" which could means a variety of things from speed, accuracy, ease of use, maintenance, licensing, etc. But I assume you are primarily interested in speed and accuracy comparisons.
For the speed aspect, the reference section of GNU GSL documentation mentions:
For large-scale FFT work we recommend the use of the dedicated FFTW library by Frigo and Johnson. The FFTW library is self-optimizing—it automatically tunes itself for each hardware platform in order to achieve maximum performance.
So according to GSL developers' own admission, FFTW is expected to outperform GSL. How much so? You can have a look at this speed performance benchmark from FFTW which suggests that GSL is about 3-4 times slower than FFTW 3. Note that this benchmark hasn't been done with g++ (and there doesn't seem to be another readily available benchmark for the gcc compilers from FFTW's site which includes GSL), and quite likely on a machine with different than yours, so your own results may vary. On the accuracy front this accuracy benchmark from FFTW suggests that they have similar accuracy for most case (with FFTW being slightly more accurate) but that GSL tend to exhibit accuracy degradation for real data and larger transform sizes.
For sake of completeness I'll briefly mention that as far as licensing goes, they both offer GNU GPL license, but FFTW also offers a non-free license, which could be considered better by someone for which the GNU GPL license is problematic. Otherwise for ease-of-use and maintenance, they are both actively developed and offer different but similarly complex APIs. So for those aspects, preference of one library over the other may be based on factors other that the FFT's implementation merits.
Related
This is more of a curiosity I suppose, but I was wondering whether it is possible to apply compiler optimizations post-compilation. Are most optimization techniques highly-dependent on the IR, or can assembly be translated back and forth fairly easily?
This has been done, though I don't know of many standard tools that do it.
This paper describes an optimizer for Compaq Alpha processors that works after linking has already been done and some of the challenges they faced in writing it.
If you strain the definition a bit, you can use profile-guided optimization to instrument a binary and then rewrite it based on its observable behaviors with regards to cache misses, page faults, etc.
There's also been some work in dynamic translation, in which you run an existing binary in an interpreter and use standard dynamic compilation techniques to try to speed this up. Here's one paper that details this.
Hope this helps!
There's been some recent research interest in this space. Alex Aiken's STOKE project is doing exactly this with some pretty impressive results. In one example, their optimizer found a function that is twice as fast as gcc -O3 for the Montgomery Multiplication step in OpenSSL's RSA library. It applies these optimizations to already-compiled ELF binaries.
Here is a link to the paper.
Some compiler backends have a peephole optimizer which basically does just that, before it commits to the assembly that represents the IR, it has a little opportunity to optimize.
Basically you would want to do the same thing, from the binary, machine code to machine code. Not the same tool but the same kind of process, examine some size block of code and optimize.
Now the problem you will end up with though is for example you may have had some variables that were marked volatile in C so they are being very inefficiently used in the binary, the optimizer wont know the programmers desire there and could end up optimizing that out.
You could certainly take this back to IR and forward again, nothing to stop you from that.
I am performing matrix operations using C. I would like to know what are the various compiler optimization flags to improve speed of execution of these matrix operations for double and int64 data - like Multiplication, Inverse, etc. I am not looking for hand optimized code, I just want to make the native code more faster using compiler flags and learn more about these flags.
The flags that I have found so far which improve matrix code.
-O3/O4
-funroll-loops
-ffast-math
First of all, I don't recommend using -ffast-math for the following reasons:
It has been proved that the performance actually degrades when
using this option in most (if not all) cases. So "fast math" is
not actually that fast.
This option breaks strict IEEE compliance on floating-point
operations which ultimately results in accumulation of computational
errors of unpredictable nature.
You may well get different results in different environments and the difference may be
substantial. The term environment (in this case) implies the combination of: hardware,
OS, compiler. Which means that the diversity of situations when you can get unexpected
results has exponential growth.
Another sad consequence is that programs which link against the
library built with this option might
expect correct (IEEE compliant) floating-point math, and this is
where their expectations break, but it will be very tough to figure
out why.
Finally, have a look at this article.
For the same reasons you should avoid -Ofast (as it includes the evil -ffast-math). Extract:
-Ofast
Disregard strict standards compliance. -Ofast enables all -O3 optimizations. It also enables optimizations that are not valid for all standard-compliant programs. It turns on -ffast-math and the Fortran-specific -fno-protect-parens and -fstack-arrays.
There is no such flag as -O4. At least I'm not aware of that one, and there is no trace of it in the official GCC documentation. So the maximum in this regard is -O3 and you should be definitely using it, not only to optimize math, but in release builds in general.
-funroll-loops is a very good choice for math routines, especially involving vector/matrix operations where the size of the loop can be deduced at compile-time (and as a result unrolled by the compiler).
I can recommend 2 more flags: -march=native and -mfpmath=sse. Similarly to -O3, -march=native is good in general for release builds of any software and not only math intensive. -mfpmath=sse enables use of XMM registers in floating point instructions (instead of stack in x87 mode).
Furthermore, I'd like to say that it's a pity that you don't want to modify your code to get better performance as this is the main source of speedup for vector/matrix routines. Thanks to SIMD, SSE Intrinsics, and Vectorization, the heavy-linear-algebra code can be orders of magnitude faster than without them. However, proper application of these techniques requires in-depth knowledge of their internals and quite some time/effort to modify (actually rewrite) the code.
Nevertheless, there is one option that could be suitable in your case. GCC offers auto-vectorization which can be enabled by -ftree-vectorize, but it is unnecessary since you are using -O3 (because it includes -ftree-vectorize already). The point is that you should still help GCC a little bit to understand which code can be auto-vectorized. The modifications are usually minor (if needed at all), but you have to make yourself familiar with them. So see the Vectorizable Loops section in the link above.
Finally, I recommend you to look into Eigen, the C++ template-based library which has highly efficient implementation of most common linear algebra routines. It utilizes all the techniques mentioned here so far in a very clever way. The interface is purely object-oriented, neat, and pleasing to use. The object-oriented approach looks very relevant to linear algebra as it usually manipulates the pure objects such as matrices, vectors, quaternions, rotations, filters, and so on. As a result, when programming with Eigen, you never have to deal with such low level concepts (as SSE, Vectorization, etc.) yourself, but just enjoy solving your specific problem.
I upgraded to OSX 10.7 Lion this weekend, and now I'm trying to get all my unit and regression tests to pass... but there are quite a few problems. Several of my regression tests are now producing numerical results which differ (e.g. in the 3rd decimal place). This is a surprise because I get consistent results between OSX 10.6 and Linux, and for various compilers (because we already apply some tricks to keep the numerics stable enough to be comparable)... But it seems that OSX 10.7 is producing significantly different results. We could raise the threshold of course to get all these tests to pass, but I'd rather not because that weakens the test.
By default "g++" is now aliased to "llvm-g++-4.2". Can somebody explain to me what kinds of differences to expect in my results for g++ vs llvm? If I want to preserve my regression results, do I basically have to choose between llvm and -ffast-math?
Basic floating-point computation shouldn't be substantially different between llvm-gcc-4.2 and gcc-4.2 in the normal case; a basic floating-point operation will generate a functionally identical code sequence with llvm-gcc and gcc-4.2 (assuming default compiler flags).
You're mentioning -ffast-math; LLVM generally does relatively few additional optimizations when -ffast-math is used. That could cause substantial differences if you're depending on the compiler to do certain transformations, I guess.
Beyond that, it's really hard to say without an actual testcase.
I was wondering, is it possible to use Artificial Intelligence to make compilers better?
Things I could imagine if it was possible -
More specific error messages
Improving compiler optimizations, so the compiler could actually understand what you're trying to do, and do it better
If it is possible, are there any research projects on this subject?
You should look at MILEPOST GCC -
MILEPOST GCC is the first practical attept to build machine learning enabled open-source self-tuning production (and research) compiler that can adapt to any architecture using iterative feedback-directed compilation, machine learning and collective optimizatio
An optimizing compiler is actually a very complex expert system and Expert systems is one of the oldest branches of artificial intelligence.
Are you refering to something like Genetic Programming?
http://en.wikipedia.org/wiki/Genetic_programming
This is indeed a field being researched. Look at the milepost branch for GCC, which relies on profile-guided optimization and machine learning. The recent scientific literature for compilers is full of papers using a combination of data mining, machine learning (through genetic algorithms or neural networks), and more "classical", pattern-recognition of certain code patterns.
There are key self-contained algorithms - particularly cryptography-related such as AES, RSA, SHA1 etc - which you can find many implementations of for free on the internet.
Some are written to be nice and portable clean C.
Some are written to be fast - often with macros, and explicit unrolling.
As far as I can tell, none are trying to be especially super-small - so I'm resigned to writing my own - explicitly AES128 decryption and SHA1 for ARM THUMB2. (I've verified by compiling all I can find for my target machine with GCC with -Os and -mthumb and such)
What patterns and tricks can I use to do so?
Are there compilers/tools that can roll-up code?
before optimizing for space (or speed): compilers are pretty clever these days, have you tried if a normal, readable implementation of aes128 gets small enough for your needs if you tell the compiler to optimize for space?
to go and write your own version of aes128 is perhaps a good educational thing but you will fight for bugs for sure and cryptography is not that kind of trivial stuff that falls out of thin air. and faulty or weak (due some bugs of your implementation) is pretty much the worse case you can have.
since you are targetting ARM and gcc is pretty common as a compiler for that platform:
-Os Optimize for size.
-Os enables all -O2 optimizations that do not typically
increase code size. It also performs further optimizations
designed to reduce code size.
It depends on what kind of space you are trying to optimise: code or data. There are essentially three variants of AES128 commonly in use, each differing in the amount of precomputed lookup table space.
The fastest version uses 4k arranged as four 32-bit x 256 entry lookup tables (commonly called T-tables). If you can afford that amount of data space then the only instructions in this version are the EORs to combine the table results, these will roll up into a very small piece of code.
The intermediate version uses a 8-bit x 256 entry lookup table to encode the SBox. The residual instructions need to implement the shift rows and mix columns steps so the code size is larger.
The smallest (data-size) version doesn't use any lookup tables at all, but needs to compute all of the individual AES-field operations including the inversion. This will use the most instructions, even if you fold both the field-multiply and inversion into subroutines.