Numeric limits in stanza - numeric-limits

In stanza, I would like to get the numeric limits (min and max) of Int and Double types.
In C++, there are INT_MIN, INT_MAX, DBL_MIN and DBL_MAX.

The following special constants are declared in core.
BYTE-MAX
BYTE-MIN
INT-MAX
INT-MIN
LONG-MAX
LONG-MIN
FLOAT-MAX
FLOAT-MIN-NORMAL
FLOAT-MIN
FLOAT-NAN
FLOAT-POSITIVE-INFINITY
FLOAT-NEGATIVE-INFINITY
DOUBLE-MAX
DOUBLE-MIN-NORMAL
DOUBLE-MIN
DOUBLE-NAN
DOUBLE-POSITIVE-INFINITY
DOUBLE-NEGATIVE-INFINITY
Patrick

Related

Kotlin - Converting Float to Double while maintaining precision

In Kotlin 123.456 is a valid Double value, however, 123.456F.toDouble() results in 123.45600128173828 - presumably just the way precision is handled between the two.
I want to be able to convert freely between the two, specifically for cases like this:
123.456F -> 123.456 // Float to Double
123.456 -> 123.456F // Double to Float
How can I convert a float to a double in cases like this, and maintain precision?
It's a big ugly, but you could convert your Float to a String and back out to a Double:
val myDouble: Double = 123.456f.toString().toDouble()
// 123.456d
You could always encapsulate this in an extension function:
fun Float.toExactDouble(): Double =
this.toString().toDouble()
val myDouble = 123.456f.toExactDouble()
In Kotlin 123.456 is a valid Double value
Actually, that's not quite true.  There's a Double value very close to 123.456, but it's not exactly 123.456.  What you're seeing is the consequences of that.
So you can't maintain precision, because you don't have that precision to start with!
Short answer:
If you need exact values, don't use floating-point!
(In particular: Never store money values in floating-point! See for example this question.)
The best alternative is usually BigDecimal which can store and calculate decimal fractions to an arbitrary precision. They're less efficient, but Kotlin's operator overloading makes them painless to use (unlike Java!).
Or if you're not going to be doing any calculations, you could store them as Strings.
Or if you'll only need a certain number of decimal places, you could scale them all up to Ints (or Longs).
Technical explanation:
Floats and Doubles use binary floating-point; they store an integer, and an integer power of 2 to multiple or divide it by.  (For example, 3/4 would be stored as 3*2⁻².)  This means they can store a wide range of binary fractions exactly.
However, just as you can't store 1/3 as a decimal fraction (it's 0.3333333333…, but any finite number of digits will only be an approximation), so you can't store 1/10 as a binary fraction (it's 0.000110011001100…).  This means that a binary floating-point number can't store most decimal numbers exactly.
Instead, they store the nearest possible value to the number you want.  And the routines which convert them to a String will try to undo that difference, by rounding appropriately.  But that doesn't always give the result you expect.
Floating-point numbers are great when you need a huge range of values (e.g. in scientific and technical use), but don't care about storing them exactly.

Saving float and double precision in C++ during cin/cout and scanf/printf

I want to read floats and doubles from standart input and save its precision (exact the same digits after floating point) and be able to output (cout/printf) as it is. What the most convinient (and simplest way) to do this?
Thanks!
float f;
cin >> f;
cout << f;
Use setprecision.
Here is the solution
cout<<setprecision(the precision you want to set here)<<variablename;
eg. If you want to set precision of the output to 5 for variable var use it like this:
cout<<setprecision(5)<<var;
setprecision is a manipulator. Learn more about manipulators here.
It sets the decimal precision to be used to
format floating-point values on output operations.
Behaves as if member precision were called with n as argument on the
stream on which it is inserted/extracted as a manipulator (it can be
inserted/extracted on input streams or output streams).
This is a manipulator and is declared in header <iomanip>
Since the input has an unknown precision, the simplest method is to read them as strings, not doubles/floats.
If you need the float value, a simple string to double conversion is needed.
Any other method will probably fail since you rely on imperfect conversion from string to float done by the standard library.
The latter can't distinguish between 0.4 and 0.40.

In what cases do we need functions for both double, float and long double?

In the math-headers we see
extern float fabsf(float);
extern double fabs(double);
extern long double fabsl(long double);
...
extern float fmodf(float, float);
extern double fmod(double, double);
extern long double fmodl(long double, long double);
Why is there one function for each type?
Isn't this a lot of duplicate code? If I where to say write a lerp-function or a clamp-function would I need to write one for each type?
Seems like we will have duplicate code where there's only one thing changing – the type.
extern float clampf(float value, float min, float max)
{
if(value > max)
return max;
if(value < min)
return min;
return value;
}
extern double clamp(double value, double min, double max)
{
if(value > max)
return max;
if(value < min)
return min;
return value;
}
Question 1: What is the historical reason for this structure?
Question 2: Should I follow the same pattern? Or should I only implement the double-kind since it is the one which is most common?
Question 3: Or should I just use macro's to overcome the type-issue altogether?
Historically (circa C89 and before), the math library contained only the double-precision versions of these functions, which is why those versions have no suffix. If you needed to compute the sine of a float, you either wrote your own implementation, or (more likely!) you simply wrote:
float x;
float y = sin(x);
However, this introduces some overhead on modern architectures. Specifically, on the most common architectures today, it is necessary for the compiler to emit code that looks something like this:
convert x to double
call sin
convert result to float
These conversions are pretty fast (about the same as an addition, usually), but they still have some cost. On top of the cost of conversion, sin needs to deliver a result that has ~53 bits of precision, more than half of which are completely wasted if the result is just going to be converted back to single precision. Between these two factors, it is possible for a dedicated single-precision sin routine to be about twice as fast; that’s a significant win for some very frequently-used library functions!
If we look at functions like fabs (and assume that the compiler does not simply inline and lower them), the situation is much, much worse. fabs, on a typical modern architecture, is a simple bitwise-and operation. So the two conversions bracketing the call (if all you have is double) are significantly more expensive than the operation itself, and can easily cause a 5x slowdown. That’s why multiple versions of these functions were added to support each FP type.
If you don’t want to keep track of all of them, you can #include <tgmath.h>, which will infer the correct function to use based on the type of the argument (meaning
sin((float)x)
will generate a call to sinf(x), whereas
sin((long double)x)
will call sinl(x)).
In your own code, you usually know a priori what the type of your arguments is, and only need to support one or maybe two types. clamp and lerp in particular are graphics operations, and almost universally are used only in single-precision variants.
Incidentally, the fact that you’re using clamp and lerp is a pretty good indication that you might want to look at writing your code in OpenCL instead of C/Obj-C; the OpenCL math library implements these operations (and many other similar operations) for you, and provides implementations that work with a wide range of basic types, including vectors.
float and double are different data types, same as int and long int. You can use the functions which operate on double on float values and implicit conversion will happen to make it work as expected in most circumstances, but if you use functions which operate on float on double values, you will almost inevitably lose precision.
There are other longer explanations available, e.g. What's the difference between a single precision and double precision floating point operation? .

Difference between Objective-C primitive numbers

What is the difference between objective-c C primitive numbers? I know what they are and how to use them (somewhat), but I'm not sure what the capabilities and uses of each one is. Could anyone clear up which ones are best for some scenarios and not others?
int
float
double
long
short
What can I store with each one? I know that some can store more precise numbers and some can only store whole numbers. Say for example I wanted to store a latitude (possibly retrieved from a CLLocation object), which one should I use to avoid loosing any data?
I also noticed that there are unsigned variants of each one. What does that mean and how is it different from a primitive number that is not unsigned?
Apple has some interesting documentation on this, however it doesn't fully satisfy my question.
Well, first off types like int, float, double, long, and short are C primitives, not Objective-C. As you may be aware, Objective-C is sort of a superset of C. The Objective-C NSNumber is a wrapper class for all of these types.
So I'll answer your question with respect to these C primitives, and how Objective-C interprets them. Basically, each numeric type can be placed in one of two categories: Integer Types and Floating-Point Types.
Integer Types
short
int
long
long long
These can only store, well, integers (whole numbers), and are characterized by two traits: size and signedness.
Size means how much physical memory in the computer a type requires for storage, that is, how many bytes. Technically, the exact memory allocated for each type is implementation-dependendant, but there are a few guarantees: (1) char will always be 1 byte (2) sizeof(short) <= sizeof(int) <= sizeof(long) <= sizeof(long long).
Signedness means, simply whether or not the type can represent negative values. So a signed integer, or int, can represent a certain range of negative or positive numbers (traditionally –2,147,483,648 to 2,147,483,647), and an unsigned integer, or unsigned int can represent the same range of numbers, but all positive (0 to 4,294,967,295).
Floating-Point Types
float
double
long double
These are used to store decimal values (aka fractions) and are also categorized by size. Again the only real guarantee you have is that sizeof(float) <= sizeof(double) <= sizeof (long double). Floating-point types are stored using a rather peculiar memory model that can be difficult to understand, and that I won't go into, but there is an excellent guide here.
There's a fantastic blog post about C primitives in an Objective-C context over at RyPress. Lots of intro CPS textbooks also have good resources.
Firstly I would like to specify the difference between au unsigned int and an int. Say that you have a very high number, and that you write a loop iterating with an unsigned int:
for(unsigned int i=0; i< N; i++)
{ ... }
If N is a number defined with #define, it may be higher that the maximum value storable with an int instead of an unsigned int. If you overflow i will start again from zero and you'll go in an infinite loop, that's why I prefer to use an int for loops.
The same happens if for mistake you iterate with an int, comparing it to a long. If N is a long you should iterate with a long, but if N is an int you can still safely iterate with a long.
Another pitfail that may occur is when using the shift operator with an integer constant, then assigning it to an int or long. Maybe you also log sizeof(long) and you notice that it returns 8 and you don't care about portability, so you think that you wouldn't lose precision here:
long i= 1 << 34;
Bit instead 1 isn't a long, so it will overflow and when you cast it to a long you have already lost precision. Instead you should type:
long i= 1l << 34;
Newer compilers will warn you about this.
Taken from this question: Converting Long 64-bit Decimal to Binary.
About float and double there is a thing to considerate: they use a mantissa and an exponent to represent the number. It's something like:
value= 2^exponent * mantissa
So the more the exponent is high, the more the floating point number doesn't have an exact representation. It may also happen that a number is too high, so that it will have a so inaccurate representation, that surprisingly if you print it you get a different number:
float f= 9876543219124567;
NSLog("%.0f",f); // On my machine it prints 9876543585124352
If I use a double it prints 9876543219124568, and if I use a long double with the .0Lf format it prints the correct value. Always be careful when using floating points numbers, unexpected things may happen.
For example it may also happen that two floating point numbers have almost the same value, that you expect they have the same value but there is a subtle difference, so that the equality comparison fails. But this has been treated hundreds of times on Stack Overflow, so I will just post this link: What is the most effective way for float and double comparison?.

MAXFLOAT in Objective-C

Max float is defied as:
math.h
#define MAXFLOAT 0x1.fffffep+127f
I'm a little sad I never noticed this before. What's this actually say? I would have expected something like this:
#define MAXFLOAT 0xFFFFFFFF-1
Would that even work?
0x1.fffffep+127 is (roughly)
1.99999999999999999999998 times 2^127. It's a floating point number, with an exponent, in hexadecimal.
0x = hex notation
1 = integer part of the number
.fffffe = fractional part of the number
p+127 = scientific notation for "times two to the 127th power"
MAXFLOAT is required for UNIX conformance:
MAXFLOAT
[XSI] Value of maximum non-infinite single-precision floating-point number.
0x1.fffffep+127f is precisely that value, represented as a standard C hexadecimal floating-point literal.
The C standard requires that FLT_MAX be defined in <float.h>, and it has the same value ("maximum representable finite floating-point number", per §5.2.4.2.2). FLT_MAX is the more portable choice, as it is required by the language standard.