Can using a different cpu cause the output of the same program to be different? - deterministic

Should the same program always output the same result, if it does not use any random numbers or I/O, or is it possible that it will output something different on a different cpu (but same architecture, no need for recompile) ? I'm specially thinking about calculations with floats and doubles which depend heavilly on precision, as used in numerical approximations.

I don't believe it should ever happen, since CPUs using the same architecture should basically use the same set of registers that they use for storing data.
Short: The result will be the same.

Related

In CUDA programming, is atomic function faster than reducing after calculating the intermediate results?

Atomic functions (such as atomic_add) are widely used for counting or performing summation/aggregation in CUDA programming. However, I can not find information about the speed of atomic functions compared with ordinary global memory read/write.
Consider the following task, where we want to calculate a floating-point array with 256K elements. Each element is the sum over 1000 intermediate variables which is calculated first. One approach is to use atomic_add; While another approach is to use a 256K*1000 temporary array for the intermediate results and then to reduce this array (by taking summation).
Is the first approach using atomic function faster than the second?
In your specific case, even without you providing a concrete program, one does not need to know anything about the difference in latency or in bandwidth between atomic and non-atomic operations to rule out both your approaches: They are both quite inefficient.
You should have single blocks handling single output variables (or a small number of output variables), so that the sum of each 1,000 intermediate variables is not performed via global memory. You may want to read the "classic" presentation by Mark Harris:
Optimizing Parallel Reduction in CUDA
to get the basics. There have been improvements over this in recent years, due to newer hardware capabilities. For a more recent actual implementation, see the CUB library's block reduction primitive.
Also relevant: CUDA: how to sum all elements of an array into one number within the GPU?
If you implement it this way, each output element will only be written to once. And even if the computation of the 1,000 intermediates somehow needs to be distributed among multiple blocks, for whatever reason you have not shared in the question - you should still distribute it over a smaller number, rather than 1,000, so that the global-memory writes of the result take up a small enough fraction of the total computation time, that it is not worth bothering with something other than an atomic addition.

Bit Fields vs Integer

I'm writing an API that gets information about the CPU (using CPUID). What I'm wondering is should I store the values from the bit field returned by calling CPUID in separate integer values, or should I just store the entire bit field in a value and write functions to get the different values on-the-fly?
What is preferable in this case? Memory usage or speed? If it's memory usage, I'll just store the entire bit field in a single variable. If it's speed, I'll store each value in a separate variable.
You're only going to query a CPU once. With modern computers having both huge amounts of memory and processing power, it would make no difference either way.
Just do what would make more sense for the next person who reads it.
Programs must be written for people to read, and only incidentally for machines to execute.
— The Structure and Interpretation of Computer Programs
I think it does not matter here, b/c you will not call your CPU-id code 10000 times per second.. will you?
I think you can define different interface (method) for different value. this is more clear and easy to use. a clear, accuracy & easy to use of interface should be the first thing to consider, then performance (memory usage & speed).

Consistant behaviour of float code with GCC

I do some numerical computing, and I have often had problems with floating points computations when using GCC. For my current purpose, I don't care too much about the real precision of the results, but I want this firm property:
no matter WHERE the SAME code is in my program, when it is run on the SAME inputs, I want it to give the SAME outputs.
How can I force GCC to do this? And specifically, what is the behavior of --fast-math, and the different -O optimizations?
I've heard that GCC might try to be clever, and sometimes load floats in registers, and sometime read them directly from memory, and that this might change the precision of the floats, resulting in a different output. How can I avoid this?
Again, I want :
my computations to be fast
my computations to be reliable (ie. same input -> same result)
I don't care that much about the precision for this particular code, so I can be fine with reduced precision if this brings reliability
could anyone tell me what is the way to go for this problem?
If your targets include x86 processors, using the switch that makes gcc use SSE2 instructions (instead of the historical stack-based ones) will make these run more like the others.
If your targets include PowerPC processors, using the switch that makes gcc not use the fmadd instruction (to replace a multiplication followed by an addition in the source code) will make these run more like the others.
Do not use --fast-math: this allows the compiler to take some shortcuts, and this will cause differences between architectures. Gcc is more standard-compliant, and therefore predictable, without this option.
Including your own math functions (exp, sin, ...) in your application instead of relying on those from the system's library can only help with predictability.
And lastly, even when the compiler does rigorously respect the standard (I mean C99 here), there may be some differences, because C99 allows intermediate results to be computed with a higher precision than required by the type of the expression. If you really want the program always to give the same results, write three-address code. Or, use only the maximum precision available for all computations, which would be double if you can avoid the historical x86 instructions. In any case do not use lower-precision floats in an attempt to improve predictability: the effect would be the opposite, as per the above clause in the standard.
I think that GCC is pretty well documented so I'm not going to reveal my own ignorance by trying to answer the parts of your question about its options and their effects. I would, though, make the general statement that when numeric precision and performance are concerned, it pays big dividends to read the manual. The clever people who work on GCC put a lot of effort into their documentation, reading it is rewarding (OK, it can be a trifle dull, but heck, it's a compiler manual not a bodice-ripper).
If it is important to you that you get identical-to-the-last-bit numeric results you'll have to concern yourself with more than just GCC and how you can control its behaviour. You'll need to lock down the libraries it calls, the hardware it runs on and probably a number of other factors I haven't thought of yet. In the worst (?) case you may even want to, and I've seen this done, write your own implementations of f-p maths to guarantee bit-identity across platforms. This is difficult, and therefore expensive, and leaves you possibly less certain of the correctness of your own code than of the code usd by GCC.
However, you write
I don't care that much about the precision for this particular code, so I can be fine with reduced precision if this brings reliability
which prompts the question to you -- why don't you simply use 5-decimal-digit precision as your standard of (reduced) precision ? It's what an awful lot of us in numerical computing do all the time; we ignore the finer aspects of numerical analysis since they are difficult, and costly in computation time, to circumvent. I'm thinking of things like interval arithmetic and high-precision maths. (OF course, if 5 is not right for you, choose another single-digit number.)
But the good news is that this is entirely justifiable: we're dealing with scientific data which, by its nature, comes with errors attached (of course we generally don't know what the errors are but that's another matter) so it's OK to disregard the last few digits in the decimal representation of, say, a 64-bit f-p number. Go right ahead and ignore a few more of them. Even better, it doesn't matter how many bits your f-p numbers have, you will always lose some precision doing numerical calculations on computers; adding more bits just pushes the errors back, both towards the least-significant-bits and towards the end of long-running computations.
The case you have to watch out for is where you have such a poor algorithm, or a poor implementation of an algorithm, that it loses lots of precision quickly. This usually shows up with any reasonable size of f-p number. Your test suite should have exposed this if it is a real problem for you.
To conclude: you have to deal with loss of precision in some way and it's not necessarily wrong to brush the finer details under the carpet.

Overhead of using bignums

I have hit upon this problem about whether to use bignums in my language as a default datatype when there's numbers involved. I've evaluated this myself and reduced it to a convenience&comfort vs. performance -question. The answer to that question depends about how large the performance hit is in programs that aren't getting optimized.
How small is the overhead of using bignums in places where a fixnum or integer would had sufficed? How small can it be at best implementations? What kind of implementations reach the smallest overhead and what kind of additional tradeoffs do they result in?
What kind of hit can I expect to the results in the overall language performance if I'll put my language to default on bignums?
You can perhaps look at how Lisp does it. It will almost always do the exactly right thing and implicitly convert the types as it becomes necessary. It has fixnums ("normal" integers), bignums, ratios (reduced proper fractions represented as a set of two integers) and floats (in different sizes). Only floats have a precision error, and they are contagious, i.e. once a calculation involves a float, the result is a float, too. "Practical Common Lisp" has a good description of this behaviour.
To be honest, the best answer is "try it and see".
Clearly bignums can't be as efficient as native types, which typically fit in a single CPU register, but every application is different - if yours doesn't do a whole load of integer arithmetic then the overhead could be negligible.
Come to think of it... I don't think it will have much performance hits at all.
Because bignums by nature, will have a very large base, say a base of 65536 or larger for which is usually a maximum possible value for traditional fixnum and integers.
I don't know how large you would set the bignum's base to be but if you set it sufficiently large enough so that when it is used in place of fixnums and/or integers, it would never exceeds its first bignum-digit thus the operation will be nearly identical to normal fixnums/int.
This opens an opportunity for optimizations where for a bignum that never grows over its first bignum-digit, you could replace them with uber-fast one-bignum-digit operation.
And then switch over to n-digit algorithms when the second bignum-digit is needed.
This could be implemented with a bit flag and a validating operation on all arithmetic operations, roughly thinking, you could use the highest-order bit to signify bignum, if a data block has its highest-order bit set to 0, then process them as if they were normal fixnum/ints but if it is set to 1, then parse the block as a bignum structure and use bignum algorithms from there.
That should avoid performance hits from simple loop iterator variables which I think is the first possible source of performance hits.
It's just my rough thinking though, a suggestion since you should know better than me :-)
p.s. sorry, forgot what the technical terms of bignum-digit and bignum-base were
your reduction is correct, but the choice depends on the performance characteristics of your language, which we cannot possibly know!
once you have your language implemented, you can measure the performance difference, and perhaps offer the programmer a directive to choose the default
You will never know the actual performance hit until you create your own benchmark as the results will vary per language, per language revision and per cpu and. There's no language independent way to measure this except for the obvious fact that a 32bit integer uses twice the memory of a 16bit integer.
How small is the overhead of using bignums in places where a fixnum or integer would had sufficed? Show small can it be at best implementations?
The bad news is that even in the best possible software implementation, BigNum is going to be slower than the builtin arithmetics by orders of magnitude (i.e. everything from factor 10 up to factor 1000).
I don't have exact numbers but I don't think exact numbers will help very much in such a situation: If you need big numbers, use them. If not, don't. If your language uses them by default (which language does? some dynamic languages do …), think whether the disadvantage of switching to another language is compensated for by the gain in performance (which it should rarely be).
(Which could roughly be translated to: there's a huge difference but it shouldn't matter. If (and only if) it matters, use another language because even with the best possible implementation, this language evidently isn't well-suited for the task.)
I totally doubt that it would be worth it, unless it is very domain-specific.
The first thing that comes to mind are all the little for loops throughout programs, are the little iterator variables all gonna be bignums? That's scary!
But if your language is rather functional... then maybe not.

When to use Fixed Point these days

For intense number-crunching i'm considering using fixed point instead of floating point. Of course it'll matter how many bytes the fixed point type is in size, on what CPU it'll be running on, if i can use (for Intel) the MMX or SSE or whatever new things come up...
I'm wondering if these days when floating point runs faster than ever, is it ever worth considering fixed point? Are there general rules of thumb where we can say it'll matter by more than a few percent? What is the overview from 35,000 feet of numerical performance? (BTW, i'm assuming a general CPU as found in most computers, not DSP or specialized embedded systems.)
It's still worth it. Floating point is faster than in the past, but fixed-point is also. And fixed is still the only way to go if you care about precision beyond that guaranteed by IEEE 754.
In situations where you are dealing with very large amounts of data, fixed point can be twice as memory efficient, e.g. a four byte long integer as opposed to an eight byte double. A technique often used in large geospatial datasets is to reduce all the data to a common origin, such that the most significant bits can be disposed of, and work with fixed point integers for the rest. Floating point is only important if the point does actually float, i.e. you are dealing with a very wide range of numbers at very high accuracy.
Another good reason to use fixed decimal is that rounding is much simpler and predictable. Most of the financial software uses fixed point arbitrary precision decimals with half-even rounding to represent money.
Its nearly ALWAYS faster to use fixed point (experience of x86, pentium, 68k and ARM). It can, though, also depend on the application type. For graphics programming (one of my main uses of fixed point) I've been able to optimize the code using prebuilt cosine tables, log tables etc. But also the basic mathematical operations have also proven faster.
A comment on financial software. It was said in an earlier answer that fixed point is useful for financial calculations. In my own experience (development of large treasury management system and extensive experience of credit card processing) I would NOT use fixed point. You will have rounding errors using either floating or fixed point. We always use whole amounts to represent monetary amounts, counting the minimum amount possible (1c for Euro or dollar). This ensure no partial amounts are ever lost. When doing complex calculations values are converted to doubles, application specific rounding rules are applied and results are converted back to whole numbers.
Use fixed-point when the hardware doesn't support floating-point or the hardware implementation sucks.
Also beware when making classes for it. Something you think would be quick could actually turn out to be a dog when it comes to profiling due to (un)necessary copies of classes. That is another question for another time however.
Another reason to use fixed-point is that ARM devices, like mobile phones and tablets, lack of FPU (at least many of them).
For developing real-time applications it makes sense to optimize functions using fixed-point arithmetic. There are implementations of FFTs (Fast Fourier Transform), very importan for graphics, that base its improvements on efficiency on relying on floating point arithmetic.
Since you are using a general-purpose CPU, I would suggest not using fixed point, unless performance is so critical for your application that you have to count every tic. The hassle of implementing fixed point, and dealing with issues like overflow is just not worth it, when you have a CPU, which will do it for you.
IMHO, fixed point is only necessary when you are using a DSP without hardware support for floating point operations.