Consistant behaviour of float code with GCC - optimization

I do some numerical computing, and I have often had problems with floating points computations when using GCC. For my current purpose, I don't care too much about the real precision of the results, but I want this firm property:
no matter WHERE the SAME code is in my program, when it is run on the SAME inputs, I want it to give the SAME outputs.
How can I force GCC to do this? And specifically, what is the behavior of --fast-math, and the different -O optimizations?
I've heard that GCC might try to be clever, and sometimes load floats in registers, and sometime read them directly from memory, and that this might change the precision of the floats, resulting in a different output. How can I avoid this?
Again, I want :
my computations to be fast
my computations to be reliable (ie. same input -> same result)
I don't care that much about the precision for this particular code, so I can be fine with reduced precision if this brings reliability
could anyone tell me what is the way to go for this problem?

If your targets include x86 processors, using the switch that makes gcc use SSE2 instructions (instead of the historical stack-based ones) will make these run more like the others.
If your targets include PowerPC processors, using the switch that makes gcc not use the fmadd instruction (to replace a multiplication followed by an addition in the source code) will make these run more like the others.
Do not use --fast-math: this allows the compiler to take some shortcuts, and this will cause differences between architectures. Gcc is more standard-compliant, and therefore predictable, without this option.
Including your own math functions (exp, sin, ...) in your application instead of relying on those from the system's library can only help with predictability.
And lastly, even when the compiler does rigorously respect the standard (I mean C99 here), there may be some differences, because C99 allows intermediate results to be computed with a higher precision than required by the type of the expression. If you really want the program always to give the same results, write three-address code. Or, use only the maximum precision available for all computations, which would be double if you can avoid the historical x86 instructions. In any case do not use lower-precision floats in an attempt to improve predictability: the effect would be the opposite, as per the above clause in the standard.

I think that GCC is pretty well documented so I'm not going to reveal my own ignorance by trying to answer the parts of your question about its options and their effects. I would, though, make the general statement that when numeric precision and performance are concerned, it pays big dividends to read the manual. The clever people who work on GCC put a lot of effort into their documentation, reading it is rewarding (OK, it can be a trifle dull, but heck, it's a compiler manual not a bodice-ripper).
If it is important to you that you get identical-to-the-last-bit numeric results you'll have to concern yourself with more than just GCC and how you can control its behaviour. You'll need to lock down the libraries it calls, the hardware it runs on and probably a number of other factors I haven't thought of yet. In the worst (?) case you may even want to, and I've seen this done, write your own implementations of f-p maths to guarantee bit-identity across platforms. This is difficult, and therefore expensive, and leaves you possibly less certain of the correctness of your own code than of the code usd by GCC.
However, you write
I don't care that much about the precision for this particular code, so I can be fine with reduced precision if this brings reliability
which prompts the question to you -- why don't you simply use 5-decimal-digit precision as your standard of (reduced) precision ? It's what an awful lot of us in numerical computing do all the time; we ignore the finer aspects of numerical analysis since they are difficult, and costly in computation time, to circumvent. I'm thinking of things like interval arithmetic and high-precision maths. (OF course, if 5 is not right for you, choose another single-digit number.)
But the good news is that this is entirely justifiable: we're dealing with scientific data which, by its nature, comes with errors attached (of course we generally don't know what the errors are but that's another matter) so it's OK to disregard the last few digits in the decimal representation of, say, a 64-bit f-p number. Go right ahead and ignore a few more of them. Even better, it doesn't matter how many bits your f-p numbers have, you will always lose some precision doing numerical calculations on computers; adding more bits just pushes the errors back, both towards the least-significant-bits and towards the end of long-running computations.
The case you have to watch out for is where you have such a poor algorithm, or a poor implementation of an algorithm, that it loses lots of precision quickly. This usually shows up with any reasonable size of f-p number. Your test suite should have exposed this if it is a real problem for you.
To conclude: you have to deal with loss of precision in some way and it's not necessarily wrong to brush the finer details under the carpet.

Related

How to decide when to use fixed point arithmetic over float?

How to decide when to use fixed point arithmetic over float?
I have read that, fixed point is used when there is no Floating point unit in the processor. When there is no FPU, does that mean 'float' datatype is not supported ?
If you have no FPU fixed point will be faster and more deterministic
If you have no FPU fixed point will in many cases be smaller - certainly for simple arithmetic.
If you need your code to generate bit-identical results across different platforms or toolchains with or without an FPU, then fixed point is necessary.
If you need to do complex math requiring trig or log functions for example, floating point is the path of least resistance, but by no means the only option - but you need to develop or find a library (see links below) - and there are plenty of ways of doing that badly.
If you need wide dynamic range floating point is simpler. For example the square root of a number less than one is a smaller number - with fixed point you can run out of bits and end up with zero, with floating point, the point is simply moved to increase the resolution at the expense of range.
If you have an FPU and are using an RTOS and don't want the overhead of stacking FPU registers on a context switch (or if it is not supported), fixed-point avoids the need, and avoids errors if you forget to enable the option for every task that needs it.
Generally if your operation is trivial, use fixed point or at least an integer representation by selecting your units appropriately. For example storing voltage values in integer millivolts (or even in ADC quanta) rather then in volts can avoid unnecessary floating point.
If you are doing complex maths and have an FPU, floating point is the simpler, least error prone solution. Even without and FPU, if your solution meets timing and code size constraints, floating point may still be simpler, but may restrict your ability to use the same code in more constraint execution environments. So is reuse across a wide range of platforms is required fixed-point may be preferable.
Whatever you do, avoid "decimal fixed point" in most cases, use where possible use a binary fixed point representation (Q representation), where for example 10Q6 has 10 integer bits and 6 fractional bits. The reason for this is that rescaling following multiply/divide are then shift operations rather than potentially expensive multiply/divide operations and you loose no precision in the rescaling.
Some useful references:
https://www.drdobbs.com/cpp/optimizing-math-intensive-applications-w/207000448
http://jet.ro/files/The_neglected_art_of_Fixed_Point_arithmetic_20060913.pdf
If there is no FPU available, compilers can often emulate floating point arithmetics. But this is inefficient, as it takes a lot of cycles.
If you are resource constraint (which you often are in environments without a FPU), you can then opt for fixed point arithmetic, which uses regular integer operations.
Just rambling: When I was using FP, I missed support from the compiler (C/C++) to be able to mark variables to be fixed point (with some specific number of fractional bits).
If you have a standard compliant compiler, then float and double are always available and work correctly. If there isn't an FPU then the calculations are done in software (called soft-FPU or FPU emulation). This is slower and uses more memory.
When to use fixed point is mainly a matter of opinion, but when NOT to use it is when a variable has a large dynamic range, ie: when a number could be very large but you still need it to be accurate if it is very small.
Eg: Displaying the speed of a car. I need to know the difference between 68 and 70 mph, but I don't care about the difference between 0.68 mph and 0.70 mph. This has low dynamic range (that I care about) so I could use fixed-point if other reasons suggested that I might want to. Alternatively, measuring the radio-activity of a sample: I care about the difference between 100 and 90 counts per second and I still care about the difference between 1 and 0.9 counts per second. This high dynamic range means that fixed point would not be suitable.
How to decide when to use fixed point arithmetic over float?
It depends on many factors which may or may not affect you, including...
Pros:
Fixed-point requires less circuitry so may be more practical on smaller, simpler devices.
Fixed-point uses less energy so may be more practical
on battery-powered devices,
in applications where intensive computation incurs a significant energy bill, or
where heat dissipation is a problem.
Fixed-point is really just integer arithmetic so operations are lossless.
Fixed-point allows you to precisely express real numbers in the number base of your choice, e.g. the values 1/10 or 1/3.
Floating-point arithmetic can exhibit inconsistent behavior related to things like
global rounding modes,
optimisation,
associativity,
implementation-defined behavior, and
variations in FPU hardware.
Cons:
While lossless, fixed-point arithmetic is prone to out-of-range errors, such as overflow. (Libraries and UB sanitizers can help avoid/detect errors.)
Lossless division is achieved with the help of the modulus operator (%), which is often harder to work with.
Fixed-point arithmetic is not as well supported in most languages: you have to perform error-prone calculations by hand using integers or find a library to help you.
Floating-point formats tend to be more consistent across architectures, unlike integers which vary in width and endianness.
Not only do floating-point types have a dynamic radix point, but the optimal position of that point is maintained automatically, saving headaches and precision loss.
Really, floating-point is an easy-to-use and widely-supported solution for representing real numbers. If floating-point works for you, think carefully before exploring alternatives.
Exception: an application where the pros make fixed-point a natural choice is currency. Lossless representation of denominations is especially important in finance. (example)
I have read that, fixed point is used when there is no Floating point unit in the processor.
That is true. Without an FPU, floating-point arithmetic is very slow!
When there is no FPU, does that mean 'float' datatype is not supported ?
No, that shouldn't necessarily follow, although implementations may vary. (IIRC, older versions of GCC had a flag to enable floating-point support.) It's entirely possible to perform floating-point operations using
the compiler, by generating equivalent ALU instructions, or
the system, by interrupting the process when an FPU instruction is encountered and deferring to a software routine.
Both of these are much slower, so using fixed-point arithmetic on such hardware may become a practical choice. It may be best to think of fixed-point as an optimisation technique, used once a performance deficit has been identified.

SMT solvers for bit vector arithmetic

I'm planning some experiments in symbolic execution of C code, using an off-the-shelf SMT solver, and wondering which solver to use; looking at e.g. the SMT contest entrants, and taking only the open-source systems, narrows it down to Beaver, Boolector, CVC3, OpenSMT, Sateen, Sonolar, STP, Verit; which is still a long list.
Trying to narrow it down a little further, I notice that some of the systems advertise the ability to handle bit vector arithmetic, whereas others only advertise the ability to handle general integer arithmetic. In principle, the former is correct for C, where variables are machine words, not unbounded integers. How much difference does it make in practice? What happens if you try to use a general integer system for this kind of job? Does one of the following scenarios apply?
A bit vector system is slightly more efficient, but you can use either, no problem.
You can use a general integer system with a bit of tweaking.
A general integer system is fine for signed int (because the result of overflow is undefined) but will give the wrong answer for unsigned.
A general integer system just isn't correct for machine word arithmetic, and I can reduce my short list to only those systems that provide bit vector arithmetic.
Something else...?
I've tried to ask as specific a question as possible, but if anyone can suggest any other criteria for narrowing down the list, that would be great!
I've had good experience using STP for symbolic execution. STP was designed precisely for this task. Also, there have been a number of symbolic execution tools that have successfully used STP for this purpose, so there is reason to believe that STP doesn't suck. I would definitely recommend STP to others as a default choice for this sort of experimentation.
However, I haven't tried the other systems, so I don't know how STP compares to them.
Personally, I see STP as the baseline and the default choice for this kind of application. So, if you only have time to try one solver, trying STP seems like a pretty reasonable choice.
If I had to guess, my guess would be that bit-vector arithmetic is important to support, because any large systems code is going to have a non-trivial amount of code that performs bitwise operations. Also, I'd suspect/worry some systems code may rely upon the behavior of unsigned arithmetic to wrap modulo 2n, and if you try to model it with integers, you will not get the semantics of C right (because, as you say, integers just aren't correct for machine-word arithmetic) and consequently, if you try to use an integer-only solver, you may experience some difficulties. However, I have no hard evidence for either of these suspicions.
P.S. Z3 might also be a contender to add to your list to consider. (Do you really need your solver to be open-source, as long as it is free? I'd expect that a symbolic execution tool would use it only as a blackbox, without modification.)
According to SMT-Wikipedia at 2011-08, we have:
Based on these measures, it appears that the most vibrant, well-organized projects are OpenSMT, STP and CVC4.
I'm just checking this stuff - so far, all three seems reasonable, plus older CVC -> CVC3.

Compiler optimizations: Where/how can I get a feel for what the payoff is for different optimizations?

In my independent study of various compiler books and web sites, I am learning about many different ways that a compiler can optimize the code that is being compiled, but I am having trouble figuring out how much of a benefit each optimization will tend to give.
How do most compiler writers go about deciding which optimizations to implement first? Or which optimizations are worth the effort or not worth the effort? I realize that this will vary between types of code and even individual programs, but I'm hoping that there is enough similarity between most programs to say, for instance, that one given technique will usually give you a better performance gain than another technique.
I found when implementing textbook compiler optimizations that some of them tended to reverse the improvements made by other optimizations. This entailed a lot of work trying to find the right balance between them.
So there really isn't a good answer to your question. Everything is a tradeoff. Many optimizations work well on one type of code, but are pessimizations for other types. It's like designing a house - if you make the kitchen bigger, the pantry gets smaller.
The real work in building an optimizer is trying out the various combinations, benchmarking the results, and, like a master chef, picking the right mix of ingredients.
Tongue in cheek:
Hubris
Benchmarks
Embarrassment
More seriously, it depends on your compiler's architecture and goals. Here's one person's experience...
Go for the "big payoffs":
native code generation
register allocation
instruction scheduling
Go for the remaining "low hanging fruit":
strength reduction
constant propagation
copy propagation
Keep bennchmarking.
Look at the output; fix anything that looks stupid.
It is usually the case that combining optimizations, or even repeating optimization passes, is more effective than you might expect. The benefit is more than the sum of the parts.
You may find that introduction of one optimization may necessitate another. For example, SSA with Briggs-Chaitin register allocation really benefits from copy propagation.
Historically, there are "algorithmical" optimizations from which the code should benefit in most of the cases, like loop unrolling (and compiler writers should implement those "documented" and "tested" optimizations first).
Then there are types of optimizations that could benefit from the type of processor used (like using SIMD instructions on modern CPUs).
See Compiler Optimizations on Wikipedia for a reference.
Finally, various type of optimizations could be tested profiling the code or doing accurate timing of repeated executions.
I'm not a compiler writer, but why not just incrementally optimize portions of your code, profiling all the while?
My optimization scheme usually goes:
1) make sure the program is working
2) find something to optimize
3) optimize it
4) compare the test results with what came out from 1; if they are different, then the optimization is actually a breaking change.
5) compare the timing difference
Incrementally, I'll get it faster.
I choose which portions to focus on by using a profiler. I'm not sure what extra information you'll garner by asking the compiler writers.
This really depends on what you are compiling. There is was a reasonably good discussion about this on the LLVM mailing list recently, it is of course somewhat specific to the optimizers they have available. They use abbreviations for a lot of their optimization passes, if you not familiar with any of acronyms they are tossing around you can look at their passes page for documentation. Ultimately you can spend years reading academic papers on this subject.
This is one of those topics where academic papers (ACM perhaps?) may be one of the better sources of up-to-date information. The best thing to do if you really want to know could be to create some code in unoptimized form and some in the form that the optimization would take (loops unrolled, etc) and actually figure out where the gains are likely to be using a compiler with optimizations turned off.
It is worth noting that in many cases, compiler writers will NOT spend much time, if any, on ensuring that their libraries are optimized. Benchmarks tend to de-emphasize or even ignore library differences, presumably because you can just use different libraries. For example, the permutation algorithms in GCC are asymptotically* less efficient than they could be when trying to permute complex data. This relates to incorrectly making deep copies during calls to swap functions. This will likely be corrected in most compilers with the introduction of rvalue references (part of the C++0x standard). Rewriting the STL to be much faster is surprisingly easy.
*This assumes the size of the class being permuted is variable. E.g. permutting a vector of vectors of ints would slow down if the vectors of ints were larger.
One that can give big speedups but is rarely done is to insert memory prefetch instructions. The trick is to figure out what memory the program will be wanting far enough in advance, never ask for the wrong memory and never overflow the D-cache.

Overhead of using bignums

I have hit upon this problem about whether to use bignums in my language as a default datatype when there's numbers involved. I've evaluated this myself and reduced it to a convenience&comfort vs. performance -question. The answer to that question depends about how large the performance hit is in programs that aren't getting optimized.
How small is the overhead of using bignums in places where a fixnum or integer would had sufficed? How small can it be at best implementations? What kind of implementations reach the smallest overhead and what kind of additional tradeoffs do they result in?
What kind of hit can I expect to the results in the overall language performance if I'll put my language to default on bignums?
You can perhaps look at how Lisp does it. It will almost always do the exactly right thing and implicitly convert the types as it becomes necessary. It has fixnums ("normal" integers), bignums, ratios (reduced proper fractions represented as a set of two integers) and floats (in different sizes). Only floats have a precision error, and they are contagious, i.e. once a calculation involves a float, the result is a float, too. "Practical Common Lisp" has a good description of this behaviour.
To be honest, the best answer is "try it and see".
Clearly bignums can't be as efficient as native types, which typically fit in a single CPU register, but every application is different - if yours doesn't do a whole load of integer arithmetic then the overhead could be negligible.
Come to think of it... I don't think it will have much performance hits at all.
Because bignums by nature, will have a very large base, say a base of 65536 or larger for which is usually a maximum possible value for traditional fixnum and integers.
I don't know how large you would set the bignum's base to be but if you set it sufficiently large enough so that when it is used in place of fixnums and/or integers, it would never exceeds its first bignum-digit thus the operation will be nearly identical to normal fixnums/int.
This opens an opportunity for optimizations where for a bignum that never grows over its first bignum-digit, you could replace them with uber-fast one-bignum-digit operation.
And then switch over to n-digit algorithms when the second bignum-digit is needed.
This could be implemented with a bit flag and a validating operation on all arithmetic operations, roughly thinking, you could use the highest-order bit to signify bignum, if a data block has its highest-order bit set to 0, then process them as if they were normal fixnum/ints but if it is set to 1, then parse the block as a bignum structure and use bignum algorithms from there.
That should avoid performance hits from simple loop iterator variables which I think is the first possible source of performance hits.
It's just my rough thinking though, a suggestion since you should know better than me :-)
p.s. sorry, forgot what the technical terms of bignum-digit and bignum-base were
your reduction is correct, but the choice depends on the performance characteristics of your language, which we cannot possibly know!
once you have your language implemented, you can measure the performance difference, and perhaps offer the programmer a directive to choose the default
You will never know the actual performance hit until you create your own benchmark as the results will vary per language, per language revision and per cpu and. There's no language independent way to measure this except for the obvious fact that a 32bit integer uses twice the memory of a 16bit integer.
How small is the overhead of using bignums in places where a fixnum or integer would had sufficed? Show small can it be at best implementations?
The bad news is that even in the best possible software implementation, BigNum is going to be slower than the builtin arithmetics by orders of magnitude (i.e. everything from factor 10 up to factor 1000).
I don't have exact numbers but I don't think exact numbers will help very much in such a situation: If you need big numbers, use them. If not, don't. If your language uses them by default (which language does? some dynamic languages do …), think whether the disadvantage of switching to another language is compensated for by the gain in performance (which it should rarely be).
(Which could roughly be translated to: there's a huge difference but it shouldn't matter. If (and only if) it matters, use another language because even with the best possible implementation, this language evidently isn't well-suited for the task.)
I totally doubt that it would be worth it, unless it is very domain-specific.
The first thing that comes to mind are all the little for loops throughout programs, are the little iterator variables all gonna be bignums? That's scary!
But if your language is rather functional... then maybe not.

When to use Fixed Point these days

For intense number-crunching i'm considering using fixed point instead of floating point. Of course it'll matter how many bytes the fixed point type is in size, on what CPU it'll be running on, if i can use (for Intel) the MMX or SSE or whatever new things come up...
I'm wondering if these days when floating point runs faster than ever, is it ever worth considering fixed point? Are there general rules of thumb where we can say it'll matter by more than a few percent? What is the overview from 35,000 feet of numerical performance? (BTW, i'm assuming a general CPU as found in most computers, not DSP or specialized embedded systems.)
It's still worth it. Floating point is faster than in the past, but fixed-point is also. And fixed is still the only way to go if you care about precision beyond that guaranteed by IEEE 754.
In situations where you are dealing with very large amounts of data, fixed point can be twice as memory efficient, e.g. a four byte long integer as opposed to an eight byte double. A technique often used in large geospatial datasets is to reduce all the data to a common origin, such that the most significant bits can be disposed of, and work with fixed point integers for the rest. Floating point is only important if the point does actually float, i.e. you are dealing with a very wide range of numbers at very high accuracy.
Another good reason to use fixed decimal is that rounding is much simpler and predictable. Most of the financial software uses fixed point arbitrary precision decimals with half-even rounding to represent money.
Its nearly ALWAYS faster to use fixed point (experience of x86, pentium, 68k and ARM). It can, though, also depend on the application type. For graphics programming (one of my main uses of fixed point) I've been able to optimize the code using prebuilt cosine tables, log tables etc. But also the basic mathematical operations have also proven faster.
A comment on financial software. It was said in an earlier answer that fixed point is useful for financial calculations. In my own experience (development of large treasury management system and extensive experience of credit card processing) I would NOT use fixed point. You will have rounding errors using either floating or fixed point. We always use whole amounts to represent monetary amounts, counting the minimum amount possible (1c for Euro or dollar). This ensure no partial amounts are ever lost. When doing complex calculations values are converted to doubles, application specific rounding rules are applied and results are converted back to whole numbers.
Use fixed-point when the hardware doesn't support floating-point or the hardware implementation sucks.
Also beware when making classes for it. Something you think would be quick could actually turn out to be a dog when it comes to profiling due to (un)necessary copies of classes. That is another question for another time however.
Another reason to use fixed-point is that ARM devices, like mobile phones and tablets, lack of FPU (at least many of them).
For developing real-time applications it makes sense to optimize functions using fixed-point arithmetic. There are implementations of FFTs (Fast Fourier Transform), very importan for graphics, that base its improvements on efficiency on relying on floating point arithmetic.
Since you are using a general-purpose CPU, I would suggest not using fixed point, unless performance is so critical for your application that you have to count every tic. The hassle of implementing fixed point, and dealing with issues like overflow is just not worth it, when you have a CPU, which will do it for you.
IMHO, fixed point is only necessary when you are using a DSP without hardware support for floating point operations.