In what cases do we need functions for both double, float and long double? - objective-c

In the math-headers we see
extern float fabsf(float);
extern double fabs(double);
extern long double fabsl(long double);
...
extern float fmodf(float, float);
extern double fmod(double, double);
extern long double fmodl(long double, long double);
Why is there one function for each type?
Isn't this a lot of duplicate code? If I where to say write a lerp-function or a clamp-function would I need to write one for each type?
Seems like we will have duplicate code where there's only one thing changing – the type.
extern float clampf(float value, float min, float max)
{
if(value > max)
return max;
if(value < min)
return min;
return value;
}
extern double clamp(double value, double min, double max)
{
if(value > max)
return max;
if(value < min)
return min;
return value;
}
Question 1: What is the historical reason for this structure?
Question 2: Should I follow the same pattern? Or should I only implement the double-kind since it is the one which is most common?
Question 3: Or should I just use macro's to overcome the type-issue altogether?

Historically (circa C89 and before), the math library contained only the double-precision versions of these functions, which is why those versions have no suffix. If you needed to compute the sine of a float, you either wrote your own implementation, or (more likely!) you simply wrote:
float x;
float y = sin(x);
However, this introduces some overhead on modern architectures. Specifically, on the most common architectures today, it is necessary for the compiler to emit code that looks something like this:
convert x to double
call sin
convert result to float
These conversions are pretty fast (about the same as an addition, usually), but they still have some cost. On top of the cost of conversion, sin needs to deliver a result that has ~53 bits of precision, more than half of which are completely wasted if the result is just going to be converted back to single precision. Between these two factors, it is possible for a dedicated single-precision sin routine to be about twice as fast; that’s a significant win for some very frequently-used library functions!
If we look at functions like fabs (and assume that the compiler does not simply inline and lower them), the situation is much, much worse. fabs, on a typical modern architecture, is a simple bitwise-and operation. So the two conversions bracketing the call (if all you have is double) are significantly more expensive than the operation itself, and can easily cause a 5x slowdown. That’s why multiple versions of these functions were added to support each FP type.
If you don’t want to keep track of all of them, you can #include <tgmath.h>, which will infer the correct function to use based on the type of the argument (meaning
sin((float)x)
will generate a call to sinf(x), whereas
sin((long double)x)
will call sinl(x)).
In your own code, you usually know a priori what the type of your arguments is, and only need to support one or maybe two types. clamp and lerp in particular are graphics operations, and almost universally are used only in single-precision variants.
Incidentally, the fact that you’re using clamp and lerp is a pretty good indication that you might want to look at writing your code in OpenCL instead of C/Obj-C; the OpenCL math library implements these operations (and many other similar operations) for you, and provides implementations that work with a wide range of basic types, including vectors.

float and double are different data types, same as int and long int. You can use the functions which operate on double on float values and implicit conversion will happen to make it work as expected in most circumstances, but if you use functions which operate on float on double values, you will almost inevitably lose precision.
There are other longer explanations available, e.g. What's the difference between a single precision and double precision floating point operation? .

Related

Kotlin - Converting Float to Double while maintaining precision

In Kotlin 123.456 is a valid Double value, however, 123.456F.toDouble() results in 123.45600128173828 - presumably just the way precision is handled between the two.
I want to be able to convert freely between the two, specifically for cases like this:
123.456F -> 123.456 // Float to Double
123.456 -> 123.456F // Double to Float
How can I convert a float to a double in cases like this, and maintain precision?
It's a big ugly, but you could convert your Float to a String and back out to a Double:
val myDouble: Double = 123.456f.toString().toDouble()
// 123.456d
You could always encapsulate this in an extension function:
fun Float.toExactDouble(): Double =
this.toString().toDouble()
val myDouble = 123.456f.toExactDouble()
In Kotlin 123.456 is a valid Double value
Actually, that's not quite true.  There's a Double value very close to 123.456, but it's not exactly 123.456.  What you're seeing is the consequences of that.
So you can't maintain precision, because you don't have that precision to start with!
Short answer:
If you need exact values, don't use floating-point!
(In particular: Never store money values in floating-point! See for example this question.)
The best alternative is usually BigDecimal which can store and calculate decimal fractions to an arbitrary precision. They're less efficient, but Kotlin's operator overloading makes them painless to use (unlike Java!).
Or if you're not going to be doing any calculations, you could store them as Strings.
Or if you'll only need a certain number of decimal places, you could scale them all up to Ints (or Longs).
Technical explanation:
Floats and Doubles use binary floating-point; they store an integer, and an integer power of 2 to multiple or divide it by.  (For example, 3/4 would be stored as 3*2⁻².)  This means they can store a wide range of binary fractions exactly.
However, just as you can't store 1/3 as a decimal fraction (it's 0.3333333333…, but any finite number of digits will only be an approximation), so you can't store 1/10 as a binary fraction (it's 0.000110011001100…).  This means that a binary floating-point number can't store most decimal numbers exactly.
Instead, they store the nearest possible value to the number you want.  And the routines which convert them to a String will try to undo that difference, by rounding appropriately.  But that doesn't always give the result you expect.
Floating-point numbers are great when you need a huge range of values (e.g. in scientific and technical use), but don't care about storing them exactly.

Difference between Objective-C primitive numbers

What is the difference between objective-c C primitive numbers? I know what they are and how to use them (somewhat), but I'm not sure what the capabilities and uses of each one is. Could anyone clear up which ones are best for some scenarios and not others?
int
float
double
long
short
What can I store with each one? I know that some can store more precise numbers and some can only store whole numbers. Say for example I wanted to store a latitude (possibly retrieved from a CLLocation object), which one should I use to avoid loosing any data?
I also noticed that there are unsigned variants of each one. What does that mean and how is it different from a primitive number that is not unsigned?
Apple has some interesting documentation on this, however it doesn't fully satisfy my question.
Well, first off types like int, float, double, long, and short are C primitives, not Objective-C. As you may be aware, Objective-C is sort of a superset of C. The Objective-C NSNumber is a wrapper class for all of these types.
So I'll answer your question with respect to these C primitives, and how Objective-C interprets them. Basically, each numeric type can be placed in one of two categories: Integer Types and Floating-Point Types.
Integer Types
short
int
long
long long
These can only store, well, integers (whole numbers), and are characterized by two traits: size and signedness.
Size means how much physical memory in the computer a type requires for storage, that is, how many bytes. Technically, the exact memory allocated for each type is implementation-dependendant, but there are a few guarantees: (1) char will always be 1 byte (2) sizeof(short) <= sizeof(int) <= sizeof(long) <= sizeof(long long).
Signedness means, simply whether or not the type can represent negative values. So a signed integer, or int, can represent a certain range of negative or positive numbers (traditionally –2,147,483,648 to 2,147,483,647), and an unsigned integer, or unsigned int can represent the same range of numbers, but all positive (0 to 4,294,967,295).
Floating-Point Types
float
double
long double
These are used to store decimal values (aka fractions) and are also categorized by size. Again the only real guarantee you have is that sizeof(float) <= sizeof(double) <= sizeof (long double). Floating-point types are stored using a rather peculiar memory model that can be difficult to understand, and that I won't go into, but there is an excellent guide here.
There's a fantastic blog post about C primitives in an Objective-C context over at RyPress. Lots of intro CPS textbooks also have good resources.
Firstly I would like to specify the difference between au unsigned int and an int. Say that you have a very high number, and that you write a loop iterating with an unsigned int:
for(unsigned int i=0; i< N; i++)
{ ... }
If N is a number defined with #define, it may be higher that the maximum value storable with an int instead of an unsigned int. If you overflow i will start again from zero and you'll go in an infinite loop, that's why I prefer to use an int for loops.
The same happens if for mistake you iterate with an int, comparing it to a long. If N is a long you should iterate with a long, but if N is an int you can still safely iterate with a long.
Another pitfail that may occur is when using the shift operator with an integer constant, then assigning it to an int or long. Maybe you also log sizeof(long) and you notice that it returns 8 and you don't care about portability, so you think that you wouldn't lose precision here:
long i= 1 << 34;
Bit instead 1 isn't a long, so it will overflow and when you cast it to a long you have already lost precision. Instead you should type:
long i= 1l << 34;
Newer compilers will warn you about this.
Taken from this question: Converting Long 64-bit Decimal to Binary.
About float and double there is a thing to considerate: they use a mantissa and an exponent to represent the number. It's something like:
value= 2^exponent * mantissa
So the more the exponent is high, the more the floating point number doesn't have an exact representation. It may also happen that a number is too high, so that it will have a so inaccurate representation, that surprisingly if you print it you get a different number:
float f= 9876543219124567;
NSLog("%.0f",f); // On my machine it prints 9876543585124352
If I use a double it prints 9876543219124568, and if I use a long double with the .0Lf format it prints the correct value. Always be careful when using floating points numbers, unexpected things may happen.
For example it may also happen that two floating point numbers have almost the same value, that you expect they have the same value but there is a subtle difference, so that the equality comparison fails. But this has been treated hundreds of times on Stack Overflow, so I will just post this link: What is the most effective way for float and double comparison?.

What does the floating point "f" designator signify?

I wonder if someone can clarify what the "f" behind a floating point number is used to signify?
float myFloat = 12.0f;
as apposed to:
float myFloat = 12.0;
I have seen this used many times but its pretty hard to find an explanation either online or in books. I am assuming its either something carried over from another language thats supported for consistency by C or maybe its there as a directive for the compiler when it comes to evaluate the maths.
I am just curious if there is any practical difference between the "f" and using the "." to signify a floating point number?
It means it's a single-precision float rather than a double precision double.
From the C99 standard:
An unsuffixed floating constant has type double. If suffixed by the letter f or F, it has type float.
Objective-C is based on C, maybe not C99, but this convention has been around in C for a long time.
There are sometimes performance concerns when converting from float to double, and you can avoid them by using the 'f'. Also when doing say a square root, sin,cos, etc, a wild guess would say that
float answer = sqrt(12.0f)
is about 10x slower than
float answer = sqrtf(12.0f)
It really makes a difference on the phone and iPad, if you are doing millions of these kinds of operations. Stay in float if you need speed and can deal with the lower resolution. If you are not good at math, or not using much math in your program use double everywhere, as there are more gotchas when using the lower precision 32 bit float.

OpenCL Alternative Modulo Uses, Advice

There is this simple function which I have used with C++ in the past to simulate simple forms of tessellation. The function takes a number and a divisor. The divisor must be (a power of two - 1) and n should be between 0 and divisor. It returns a modulus result of n % (d+1) using bitwise &.
Fairly sure the function goes like:
unsigned int BitwiseMod(unsigned int n, unsigned int d){ return n & d; }
I am wanting to use this effectively in OpenCL and am wondering if it will work as I imagine it too. In my mind, modulus is a very expensive operation on the GPU but I am familiar using it to form magnitude spaces and other techniques to travel through data.
More often, I would be more likely to simply write this assuming functions have some overhead.
x[i] = 8*(i&d)+offset[i]; //OR in other contexts,...
num = i&d+offset[i];
x[num] = data;
The question is: Will this be useful or get in the way, if useful can you give me some examples where I might try to apply it.
On NVidia's architectures, GT200 and up, Modulo isn't particularly slow, not slower than a normal integer divide. See this paper for details.
However, using a bitwise AND is still quite a lot faster. As function calls are expensive on GPUs, OpenCL compilers aggressively use inlining to improve performance by default. You should be fine with a function call, as it will be inlined.

Comparing IEEE floats and doubles for equality

What is the best method for comparing IEEE floats and doubles for equality? I have heard of several methods, but I wanted to see what the community thought.
The best approach I think is to compare ULPs.
bool is_nan(float f)
{
return (*reinterpret_cast<unsigned __int32*>(&f) & 0x7f800000) == 0x7f800000 && (*reinterpret_cast<unsigned __int32*>(&f) & 0x007fffff) != 0;
}
bool is_finite(float f)
{
return (*reinterpret_cast<unsigned __int32*>(&f) & 0x7f800000) != 0x7f800000;
}
// if this symbol is defined, NaNs are never equal to anything (as is normal in IEEE floating point)
// if this symbol is not defined, NaNs are hugely different from regular numbers, but might be equal to each other
#define UNEQUAL_NANS 1
// if this symbol is defined, infinites are never equal to finite numbers (as they're unimaginably greater)
// if this symbol is not defined, infinities are 1 ULP away from +/- FLT_MAX
#define INFINITE_INFINITIES 1
// test whether two IEEE floats are within a specified number of representable values of each other
// This depends on the fact that IEEE floats are properly ordered when treated as signed magnitude integers
bool equal_float(float lhs, float rhs, unsigned __int32 max_ulp_difference)
{
#ifdef UNEQUAL_NANS
if(is_nan(lhs) || is_nan(rhs))
{
return false;
}
#endif
#ifdef INFINITE_INFINITIES
if((is_finite(lhs) && !is_finite(rhs)) || (!is_finite(lhs) && is_finite(rhs)))
{
return false;
}
#endif
signed __int32 left(*reinterpret_cast<signed __int32*>(&lhs));
// transform signed magnitude ints into 2s complement signed ints
if(left < 0)
{
left = 0x80000000 - left;
}
signed __int32 right(*reinterpret_cast<signed __int32*>(&rhs));
// transform signed magnitude ints into 2s complement signed ints
if(right < 0)
{
right = 0x80000000 - right;
}
if(static_cast<unsigned __int32>(std::abs(left - right)) <= max_ulp_difference)
{
return true;
}
return false;
}
A similar technique can be used for doubles. The trick is to convert the floats so that they're ordered (as if integers) and then just see how different they are.
I have no idea why this damn thing is screwing up my underscores. Edit: Oh, perhaps that is just an artefact of the preview. That's OK then.
The current version I am using is this
bool is_equals(float A, float B,
float maxRelativeError, float maxAbsoluteError)
{
if (fabs(A - B) < maxAbsoluteError)
return true;
float relativeError;
if (fabs(B) > fabs(A))
relativeError = fabs((A - B) / B);
else
relativeError = fabs((A - B) / A);
if (relativeError <= maxRelativeError)
return true;
return false;
}
This seems to take care of most problems by combining relative and absolute error tolerance. Is the ULP approach better? If so, why?
#DrPizza: I am no performance guru but I would expect fixed point operations to be quicker than floating point operations (in most cases).
It rather depends on what you are doing with them. A fixed-point type with the same range as an IEEE float would be many many times slower (and many times larger).
Things suitable for floats:
3D graphics, physics/engineering, simulation, climate simulation....
In numerical software you often want to test whether two floating point numbers are exactly equal. LAPACK is full of examples for such cases. Sure, the most common case is where you want to test whether a floating point number equals "Zero", "One", "Two", "Half". If anyone is interested I can pick some algorithms and go more into detail.
Also in BLAS you often want to check whether a floating point number is exactly Zero or One. For example, the routine dgemv can compute operations of the form
y = beta*y + alpha*A*x
y = beta*y + alpha*A^T*x
y = beta*y + alpha*A^H*x
So if beta equals One you have an "plus assignment" and for beta equals Zero a "simple assignment". So you certainly can cut the computational cost if you give these (common) cases a special treatment.
Sure, you could design the BLAS routines in such a way that you can avoid exact comparisons (e.g. using some flags). However, the LAPACK is full of examples where it is not possible.
P.S.:
There are certainly many cases where you don't want check for "is exactly equal". For many people this even might be the only case they ever have to deal with. All I want to point out is that there are other cases too.
Although LAPACK is written in Fortran the logic is the same if you are using other programming languages for numerical software.
Oh dear lord please don't interpret the float bits as ints unless you're running on a P6 or earlier.
Even if it causes it to copy from vector registers to integer registers via memory, and even if it stalls the pipeline, it's the best way to do it that I've come across, insofar as it provides the most robust comparisons even in the face of floating point errors.
i.e. it is a price worth paying.
This seems to take care of most problems by combining relative and absolute error tolerance. Is the ULP approach better? If so, why?
ULPs are a direct measure of the "distance" between two floating point numbers. This means that they don't require you to conjure up the relative and absolute error values, nor do you have to make sure to get those values "about right". With ULPs, you can express directly how close you want the numbers to be, and the same threshold works just as well for small values as for large ones.
If you have floating point errors you have even more problems than this. Although I guess that is up to personal perspective.
Even if we do the numeric analysis to minimize accumulation of error, we can't eliminate it and we can be left with results that ought to be identical (if we were calculating with reals) but differ (because we cannot calculate with reals).
If you are looking for two floats to be equal, then they should be identically equal in my opinion. If you are facing a floating point rounding problem, perhaps a fixed point representation would suit your problem better.
If you are looking for two floats to be equal, then they should be identically equal in my opinion. If you are facing a floating point rounding problem, perhaps a fixed point representation would suit your problem better.
Perhaps we cannot afford the loss of range or performance that such an approach would inflict.
#DrPizza: I am no performance guru but I would expect fixed point operations to be quicker than floating point operations (in most cases).
#Craig H: Sure. I'm totally okay with it printing that. If a or b store money then they should be represented in fixed point. I'm struggling to think of a real world example where such logic ought to be allied to floats. Things suitable for floats:
weights
ranks
distances
real world values (like from a ADC)
For all these things, either you much then numbers and simply present the results to the user for human interpretation, or you make a comparative statement (even if such a statement is, "this thing is within 0.001 of this other thing"). A comparative statement like mine is only useful in the context of the algorithm: the "within 0.001" part depends on what physical question you're asking. That my 0.02. Or should I say 2/100ths?
It rather depends on what you are
doing with them. A fixed-point type
with the same range as an IEEE float
would be many many times slower (and
many times larger).
Okay, but if I want a infinitesimally small bit-resolution then it's back to my original point: == and != have no meaning in the context of such a problem.
An int lets me express ~10^9 values (regardless of the range) which seems like enough for any situation where I would care about two of them being equal. And if that's not enough, use a 64-bit OS and you've got about 10^19 distinct values.
I can express values a range of 0 to 10^200 (for example) in an int, it is just the bit-resolution that suffers (resolution would be greater than 1, but, again, no application has that sort of range as well as that sort of resolution).
To summarize, I think in all cases one either is representing a continuum of values, in which case != and == are irrelevant, or one is representing a fixed set of values, which can be mapped to an int (or a another fixed-precision type).
An int lets me express ~10^9 values
(regardless of the range) which seems
like enough for any situation where I
would care about two of them being
equal. And if that's not enough, use a
64-bit OS and you've got about 10^19
distinct values.
I have actually hit that limit... I was trying to juggle times in ps and time in clock cycles in a simulation where you easily hit 10^10 cycles. No matter what I did I very quickly overflowed the puny range of 64-bit integers... 10^19 is not as much as you think it is, gimme 128 bits computing now!
Floats allowed me to get a solution to the mathematical issues, as the values overflowed with lots zeros at the low end. So you basically had a decimal point floating aronud in the number with no loss of precision (I could like with the more limited distinct number of values allowed in the mantissa of a float compared to a 64-bit int, but desperately needed th range!).
And then things converted back to integers to compare etc.
Annoying, and in the end I scrapped the entire attempt and just relied on floats and < and > to get the work done. Not perfect, but works for the use case envisioned.
If you are looking for two floats to be equal, then they should be identically equal in my opinion. If you are facing a floating point rounding problem, perhaps a fixed point representation would suit your problem better.
Perhaps I should explain the problem better. In C++, the following code:
#include <iostream>
using namespace std;
int main()
{
float a = 1.0;
float b = 0.0;
for(int i=0;i<10;++i)
{
b+=0.1;
}
if(a != b)
{
cout << "Something is wrong" << endl;
}
return 1;
}
prints the phrase "Something is wrong". Are you saying that it should?
Oh dear lord please don't interpret the float bits as ints unless you're running on a P6 or earlier.
it's the best way to do it that I've come across, insofar as it provides the most robust comparisons even in the face of floating point errors.
If you have floating point errors you have even more problems than this. Although I guess that is up to personal perspective.