Why BigFloat.to_s is not precise enough? - gmp

I am not sure if this is a bug. But I've been playing with big and I cant understand why this code works this way:
https://carc.in/#/r/2w96
Code
require "big"
x = BigInt.new(1<<30) * (1<<30) * (1<<30)
puts "BigInt: #{x}"
x = BigFloat.new(1<<30) * (1<<30) * (1<<30)
puts "BigFloat: #{x}"
puts "BigInt from BigFloat: #{x.to_big_i}"
Output
BigInt: 1237940039285380274899124224
BigFloat: 1237940039285380274900000000
BigInt from BigFloat: 1237940039285380274899124224
First I though that BigFloat requires to change BigFloat.default_precision to work with bigger number. But from this code it looks like it only matters when trying to output #to_s value.
Same with precision of BigFloat set to 1024 (https://carc.in/#/r/2w98):
Output
BigInt: 1237940039285380274899124224
BigFloat: 1237940039285380274899124224
BigInt from BigFloat: 1237940039285380274899124224
BigFloat.to_s uses LibGMP.mpf_get_str(nil, out expptr, 10, 0, self). Where GMP is saying:
mpf_get_str (char *str, mp_exp_t *expptr, int base, size_t n_digits, const mpf_t op)
Convert op to a string of digits in base base. The base argument may vary from 2 to 62 or from -2 to -36. Up to n_digits digits will be generated. Trailing zeros are not returned. No more digits than can be accurately represented by op are ever generated. If n_digits is 0 then that accurate maximum number of digits are generated.
Thanks.

In GMP (it applies to all languages not just Crystal), integers (C mpz_t, Crystal BigInt) and floats (C mpf_t, Crystal BigFloat) have separate default precision.
Also, note that using an explicit precision is better than setting a default one, because the default precision might not be reentrant (it depends on a configure-time switch). Also, if someone reads only a part of your code, they may skip the part with setting the default precision and assume a wrong one. Although I do not know the Crystal binding well, I assume that such functionality is exposed somewhere.
The zero parameter passed to mpf_get_str means to guess the value from the precision. I know the number of significant digits is proportional and close to precision / log2(10). Floating point numbers have finite precision. In that case, it was not the mpf_get_str call which made the last digits zero - it was the internal representation that did not keep such data. It looks like your (default) precision is too small to store all the necessary digits.
To summarize, there are two solutions:
Set a global default precision. Although this approach will work, it will require to either change the default precision frequently, or use one in the whole program. Both ways, the approach with the default precision is a form of procrastination which is going to have its vengeance later.
Set a precision on variable basis. This is a better solution than the former. Although it requires more code (1-2 more lines per variable initialization), it is going to pay back later. For example, in a space object tracking system, the physics calculations have to be super-precise, but other systems could use lower precision numbers for speed and memory saving.
I am still unsure what made the conversion BigFloat --> BigInt yield the missing digits.

Related

How to handle precision problems of floating point numbers?

I am using Firebird 3.0.4 (both in Windows and Linux) and I have the following procedure that clearly demonstrates my problem with floating point numbers, and that also demonstrates a possible workaround:
create or alter procedure test_float returns (res double precision,
res1 double precision,
res2 double precision)
as
declare variable z1 double precision;
declare variable z2 double precision;
declare variable z3 double precision;
begin
z1=15;
z2=1.1;
z3=0.49;
res=z1*z2*z3; /* one expects res to be 8.085, but internally, inside the procedure
it is represented as 8.084999999999.
The procedure-internal representation is repaired when then
res is sent to the output of the procedure, but the procedure-internal
representation (which is worng) impacts the further calculations */
res1=round(res, 2);
res2=round(round(res, 8), 2);
suspend;
end
On can see the result of the procedure with:
select proc.res, proc.res1, proc.res2
from test_float proc
The result is
RES RES1 RES2
8,085 8,08 8,09
But one can expect that RES2 should be 8.09.
One can clearly see that the internal representation of the res contains 8.0849999 (e.g. one can assign res to the exception message and then raise this exception), it is repaired during output but it leads to the failed calculations when such variable is used in the further calculations.
RES2 demonstrates the repair: I can always apply ROUND(..., 8) to repair the internal representation. I am ready to go with this solution, but my question is - is it acceptable workaround (when the outer ROUND is with strictly less than 5 decimal places) or is there better workaround.
All my tests pass with this workaround, but the feeling is bad.
Of course, I know the minimum that every programmer should know about floats (there is article about that) and I know that one should not use double for business calculations.
This is an inherent problem with calculating with floating point numbers, and is not specific to Firebird. The problem is that the calculation of 15 * 1.1 * 0.49 using double precision numbers is not exactly 8.085. In fact, if you would do 8.085 - RES, you'd get a value that is (approximately) 1.776356839400251e-015 (although likely your client will just present it as 0.00000000).
You would get similar results in different languages. For example, in Java
DecimalFormat df = new DecimalFormat("#.00");
df.format(15 * 1.1 * 0.49);
will also produce 8.08 for exactly the same reason.
Also, if you would change the order of operations, you would get a different result. For example using 15 * 0.49 * 1.1 would produce 8.085 and round to 8.09, so the actual results would match your expectations.
Given round itself also returns a double precision, this isn't really a good way to handle this in your SQL code, because the rounded value with a higher number of decimals might still yield a value slightly less than what you'd expect because of how floating point numbers work, so the double round may still fail for some numbers even if the presentation in your client 'looks' correct.
If you purely want this for presentation purposes, it might be better to do this in your frontend, but alternatively you could try tricks like adding a small value and casting to decimal, for example something like:
cast(RES + 1e-10 as decimal(18,2))
However this still has rounding issues, because it is impossible to distinguish between values that genuinely are 8.08499999999 (and should be rounded down to 8.08), and values where the result of calculation just happens to be 8.08499999999 in floating point, while it would be 8.085 in exact numerics (and therefor need to be rounded up to 8.09).
In a similar vein, you could try to use double casting to decimal (eg cast(cast(res as decimal(18,3)) as decimal(18,2))), or casting the decimal and then rounding (eg round(cast(res as decimal(18,3)), 2). This would be a bit more consistent than double rounding because the first cast will convert to exact numerics, but again this has similar downside as mentioned above.
Although you don't want to hear this answer, if you want exact numeric semantics, you shouldn't be using floating point types.

Fortran: can you explain this formatting string

I have a Fortran program which I need to modify, so I'm reading it and trying to understand. Can you please explain what the formatting string in the following statement means:
write(*,'(1p,(5x,3(1x,g20.10)))') x(jr,1:ncols)
http://www.fortran.com/F77_std/rjcnf0001-sh-13.html
breifly, you are writing three general (g) format floats per line. Each float has a total field width of 20 characters and 10 places to the right of the decimal. Large magnitude numbers are in exponential form.
The 1xs are simply added spaces (which could as well have been accomplished by increasing the field width ie, g21.10 since the numbers are right justified. The 5x puts an additional 5 spaces at the beginning of each line.
The somewhat tricky thing here is tha lead 1p which is a scale factor. It causes the mantissa of all exponential form numbers produced by the following g format to be multiplied by 10, and the exponent changed accordingly, ie instead of the default,
g17.10 -> b0.1234567890E+12
you get:
1p,g17.10 -> b1.2345678900E+11
b denotes a blank in the output. Be sure to allow room for a - in your field width count...
for completeness in the case of scale greater than one the number of decimal places is reduced (preserving the total precision) ie,
3p,g17.10 -> b123.45678900E+09 ! note only 8 digits after the decimal
that is 1p buys you a digit of precision over the default, but you don't get any more. Negative scales cost you precision, preserving the 10 digits:
-7p,g17.10 -> b0.0000000123E+19
I should add, the p scale factor edit descriptor does something completely different on input. Read the docs...
I'd like to add slightly to George's answer. Unfortunately this is a very nasty (IMO) part of Fortran. In general, bear in mind that a Fortran format specification is automatically repeated as long as there are values remaining in the input/output list, so it isn't necessary to provide formats for every value to be processed.
Scale factors
In the output, all floating point values following kP are multiplied by 10k. Fields containing exponents (E) have their exponent reduced by k, unless the exponent format is fixed by using EN (engineering) or ES (scientific) descriptors. Scaling does not apply to G editing, unless the value is such that E editing is applied. Thus, there is a difference between (1P,G20.10) and (1P,F20.10).
Grouping
A format like n() repeats the descriptors within parentheses n times before proceeding.

How can I use SYNCSORT to format a Packed Decimal field with a specifc sign value?

I want to use SYNCSORT to force all Packed Decimal fields to a negative sign value. The critical requirement is the 2nd nibble must be Hex 'D'. I have a method that works but it seems much too complex. In keeping with the KISS principle, I'm hoping someone has a better method. Perhaps using a bit mask on the last 4 bits? Here is the code I have come up with. Is there a better way?
*
* This sort logic is intended to force all Packed Decimal amounts to
* have a negative sign with a B'....1101' value (Hex 'xD').
*
SORT FIELDS=COPY
OUTFIL FILES=1,
INCLUDE=(8,1,BI,NE,B'....1..1',OR, * POSITIVE PACKED DECIMAL
8,1,BI,EQ,B'....1111'), * UNSIGNED PACKED DECIMAL
OUTREC=(1:1,7, * INCLUDING +0
8:(-1,MUL,8,1,PD),PD,LENGTH=1,
9:9,72)
OUTFIL FILES=2,
INCLUDE=(8,1,BI,EQ,B'....1..1',AND, * NEGATIVE PACKED DECIMAL
8,1,BI,NE,B'....1111'), * NOT UNSIGNED PACKED DECIMAL
OUTREC=(1:1,7, * INCLUDING -0
8:(+1,MUL,8,1,PD),PD,LENGTH=1,
9:9,72)
In the code that processes the VSAM file, can you change the read logic to GET with KEY GTEQ and check for < 0 on the result instead of doing a specific keyed read?
If you did that, you could accept all three negative packed values xA, xB and xD.
Have you considered writing an E15 user exit? The E15 user exit lets you
manipulate records as they are input to the sort process. In this case you would have a
REXX, COBOL or other LE compatible language subroutine patch the packed decimal sign field as it is input to the sort process. No need to split into multiple files to be merged later on.
Here is a link to example JCL
for invoking an E15 exit from DFSORT (same JCL for SYNCSORT). Chapter 4 of this reference
describes how to develop User Exit routines, again this is a DFSORT manual but I believe SyncSort is
fully compatible in this respect. Writing a user exit is no different than writing any other subroutine - get the linkage right and the rest is easy.
This is a very general outline, but I hope it helps.
Okay, it took some digging but NEALB's suggestion to seek help on MVSFORUMS.COM paid off... here is the final result. The OUTREC logic used with SORT/MERGE replaces OUTFIL and takes advantage of new capabilities (IFTHEN, WHEN and OVERLAY) in Syncsort 1.3 that I didn't realize existed. It pays to have current documentation available!
*
* This MERGE logic is intended to assert that the Packed Decimal
* field has a negative sign with a B'....1101' value (Hex X'.D').
*
*
MERGE FIELDS=(27,5.4,BI,A),EQUALS
SUM FIELDS=NONE
OUTREC IFTHEN=(WHEN=(32,1,BI,NE,B'....1..1',OR,
32,1,BI,EQ,B'....1111'),
OVERLAY=(32:(-1,MUL,32,1,PD),PD,LENGTH=1)),
IFTHEN=(WHEN=(32,1,BI,EQ,B'....1..1',AND,
32,1,BI,NE,B'....1111'),
OVERLAY=(32:(+1,MUL,32,1,PD),PD,LENGTH=1))
Looking at the last byte of a packed field is possible. You want positive/unsigned to negative, so if it is greater than -1, subtract it from zero.
From a short-lived Answer by MikeC, it is now known that the data contains non-preferred signs (that is, it can contain A through F in the low-order half-byte, whereas a preferred sign would be C (positive) or D (negative). F is unsigned, treated as positive.
This is tested with DFSORT. It should work with SyncSORT. Turns out that DFSORT can understand a negative packed-decimal zero, but it will not create a negative packed-decimal zero (it will allow a zoned-decimal negative zero to be created from a negative zero packed-decimal).
The idea is that a non-preferred sign is valid and will be accurately signed for input to a decimal machine instruction, but the result will always be a preferred sign, and will be correct. So by adding zero first, the field gets turned into a preferred sign and then the test for -1 will work as expected. With data in the sign-nybble for packed-decimal fields, SORT has some specific and documented behaviours, which just don't happen to help here.
Since there is only one value to deal with to become the negative zero, X'0C', after the normalisation of signs already done, there is a simple test and replacement with a constant of X'0D' for the negative zero. Since the negative zero will not work, the second test is changed from the original minus one to zero.
With non-preferred signs in the data:
SORT FIELDS=COPY
INREC IFTHEN=(WHEN=INIT,
OVERLAY=(32:+0,ADD,32,1,PD,TO=PD,LENGTH=1)),
IFTHEN=(WHEN=(32,1,CH,EQ,X'0C'),
OVERLAY=(32:X'0D')),
IFTHEN=(WHEN=(32,1,PD,GT,0),
OVERLAY=(32:+0,SUB,32,1,PD,TO=PD,LENGTH=1))
With preferred signs in the data:
SORT FIELDS=COPY
INREC IFTHEN=(WHEN=(32,1,CH,EQ,X'0C'),
OVERLAY=(32:X'0D')),
IFTHEN=(WHEN=(32,1,PD,GT,0),
OVERLAY=(32:+0,SUB,32,1,PD,TO=PD,LENGTH=1))
Note: If non-preferred signs are stuffed through a COBOL program not using compiler option NUMPROC(NOPFD) then results will be "interesting".

Is there a practical limit to the size of bit masks?

There's a common way to store multiple values in one variable, by using a bitmask. For example, if a user has read, write and execute privileges on an item, that can be converted to a single number by saying read = 4 (2^2), write = 2 (2^1), execute = 1 (2^0) and then add them together to get 7.
I use this technique in several web applications, where I'd usually store the variable into a field and give it a type of MEDIUMINT or whatever, depending on the number of different values.
What I'm interested in, is whether or not there is a practical limit to the number of values you can store like this? For example, if the number was over 64, you couldn't use (64 bit) integers any more. If this was the case, what would you use? How would it affect your program logic (ie: could you still use bitwise comparisons)?
I know that once you start getting really large sets of values, a different method would be the optimal solution, but I'm interested in the boundaries of this method.
Off the top of my head, I'd write a set_bit and get_bit function that could take an array of bytes and a bit offset in the array, and use some bit-twiddling to set/get the appropriate bit in the array. Something like this (in C, but hopefully you get the idea):
// sets the n-th bit in |bytes|. num_bytes is the number of bytes in the array
// result is 0 on success, non-zero on failure (offset out-of-bounds)
int set_bit(char* bytes, unsigned long num_bytes, unsigned long offset)
{
// make sure offset is valid
if(offset < 0 || offset > (num_bytes<<3)-1) { return -1; }
//set the right bit
bytes[offset >> 3] |= (1 << (offset & 0x7));
return 0; //success
}
//gets the n-th bit in |bytes|. num_bytes is the number of bytes in the array
// returns (-1) on error, 0 if bit is "off", positive number if "on"
int get_bit(char* bytes, unsigned long num_bytes, unsigned long offset)
{
// make sure offset is valid
if(offset < 0 || offset > (num_bytes<<3)-1) { return -1; }
//get the right bit
return (bytes[offset >> 3] & (1 << (offset & 0x7));
}
I've used bit masks in filesystem code where the bit mask is many times bigger than a machine word. think of it like an "array of booleans";
(journalling masks in flash memory if you want to know)
many compilers know how to do this for you. Adda bit of OO code to have types that operate senibly and then your code starts looking like it's intent, not some bit-banging.
My 2 cents.
With a 64-bit integer, you can store values up to 2^64-1, 64 is only 2^6. So yes, there is a limit, but if you need more than 64-its worth of flags, I'd be very interested to know what they were all doing :)
How many states so you need to potentially think about? If you have 64 potential states, the number of combinations they can exist in is the full size of a 64-bit integer.
If you need to worry about 128 flags, then a pair of bit vectors would suffice (2^64 * 2).
Addition: in Programming Pearls, there is an extended discussion of using a bit array of length 10^7, implemented in integers (for holding used 800 numbers) - it's very fast, and very appropriate for the task described in that chapter.
Some languages ( I believe perl does, not sure ) permit bitwise arithmetic on strings. Giving you a much greater effective range. ( (strlen * 8bit chars ) combinations )
However, I wouldn't use a single value for superimposition of more than one /type/ of data. The basic r/w/x triplet of 3-bit ints would probably be the upper "practical" limit, not for space efficiency reasons, but for practical development reasons.
( Php uses this system to control its error-messages, and I have already found that its a bit over-the-top when you have to define values where php's constants are not resident and you have to generate the integer by hand, and to be honest, if chmod didn't support the 'ugo+rwx' style syntax I'd never want to use it because i can never remember the magic numbers )
The instant you have to crack open a constants table to debug code you know you've gone too far.
Old thread, but it's worth mentioning that there are cases requiring bloated bit masks, e.g., molecular fingerprints, which are often generated as 1024-bit arrays which we have packed in 32 bigint fields (SQL Server not supporting UInt32). Bit wise operations work fine - until your table starts to grow and you realize the sluggishness of separate function calls. The binary data type would work, were it not for T-SQL's ban on bitwise operators having two binary operands.
For example .NET uses array of integers as an internal storage for their BitArray class.
Practically there's no other way around.
That being said, in SQL you will need more than one column (or use the BLOBS) to store all the states.
You tagged this question SQL, so I think you need to consult with the documentation for your database to find the size of an integer. Then subtract one bit for the sign, just to be safe.
Edit: Your comment says you're using MySQL. The documentation for MySQL 5.0 Numeric Types states that the maximum size of a NUMERIC is 64 or 65 digits. That's 212 bits for 64 digits.
Remember that your language of choice has to be able to work with those digits, so you may be limited to a 64-bit integer anyway.

Comparing IEEE floats and doubles for equality

What is the best method for comparing IEEE floats and doubles for equality? I have heard of several methods, but I wanted to see what the community thought.
The best approach I think is to compare ULPs.
bool is_nan(float f)
{
return (*reinterpret_cast<unsigned __int32*>(&f) & 0x7f800000) == 0x7f800000 && (*reinterpret_cast<unsigned __int32*>(&f) & 0x007fffff) != 0;
}
bool is_finite(float f)
{
return (*reinterpret_cast<unsigned __int32*>(&f) & 0x7f800000) != 0x7f800000;
}
// if this symbol is defined, NaNs are never equal to anything (as is normal in IEEE floating point)
// if this symbol is not defined, NaNs are hugely different from regular numbers, but might be equal to each other
#define UNEQUAL_NANS 1
// if this symbol is defined, infinites are never equal to finite numbers (as they're unimaginably greater)
// if this symbol is not defined, infinities are 1 ULP away from +/- FLT_MAX
#define INFINITE_INFINITIES 1
// test whether two IEEE floats are within a specified number of representable values of each other
// This depends on the fact that IEEE floats are properly ordered when treated as signed magnitude integers
bool equal_float(float lhs, float rhs, unsigned __int32 max_ulp_difference)
{
#ifdef UNEQUAL_NANS
if(is_nan(lhs) || is_nan(rhs))
{
return false;
}
#endif
#ifdef INFINITE_INFINITIES
if((is_finite(lhs) && !is_finite(rhs)) || (!is_finite(lhs) && is_finite(rhs)))
{
return false;
}
#endif
signed __int32 left(*reinterpret_cast<signed __int32*>(&lhs));
// transform signed magnitude ints into 2s complement signed ints
if(left < 0)
{
left = 0x80000000 - left;
}
signed __int32 right(*reinterpret_cast<signed __int32*>(&rhs));
// transform signed magnitude ints into 2s complement signed ints
if(right < 0)
{
right = 0x80000000 - right;
}
if(static_cast<unsigned __int32>(std::abs(left - right)) <= max_ulp_difference)
{
return true;
}
return false;
}
A similar technique can be used for doubles. The trick is to convert the floats so that they're ordered (as if integers) and then just see how different they are.
I have no idea why this damn thing is screwing up my underscores. Edit: Oh, perhaps that is just an artefact of the preview. That's OK then.
The current version I am using is this
bool is_equals(float A, float B,
float maxRelativeError, float maxAbsoluteError)
{
if (fabs(A - B) < maxAbsoluteError)
return true;
float relativeError;
if (fabs(B) > fabs(A))
relativeError = fabs((A - B) / B);
else
relativeError = fabs((A - B) / A);
if (relativeError <= maxRelativeError)
return true;
return false;
}
This seems to take care of most problems by combining relative and absolute error tolerance. Is the ULP approach better? If so, why?
#DrPizza: I am no performance guru but I would expect fixed point operations to be quicker than floating point operations (in most cases).
It rather depends on what you are doing with them. A fixed-point type with the same range as an IEEE float would be many many times slower (and many times larger).
Things suitable for floats:
3D graphics, physics/engineering, simulation, climate simulation....
In numerical software you often want to test whether two floating point numbers are exactly equal. LAPACK is full of examples for such cases. Sure, the most common case is where you want to test whether a floating point number equals "Zero", "One", "Two", "Half". If anyone is interested I can pick some algorithms and go more into detail.
Also in BLAS you often want to check whether a floating point number is exactly Zero or One. For example, the routine dgemv can compute operations of the form
y = beta*y + alpha*A*x
y = beta*y + alpha*A^T*x
y = beta*y + alpha*A^H*x
So if beta equals One you have an "plus assignment" and for beta equals Zero a "simple assignment". So you certainly can cut the computational cost if you give these (common) cases a special treatment.
Sure, you could design the BLAS routines in such a way that you can avoid exact comparisons (e.g. using some flags). However, the LAPACK is full of examples where it is not possible.
P.S.:
There are certainly many cases where you don't want check for "is exactly equal". For many people this even might be the only case they ever have to deal with. All I want to point out is that there are other cases too.
Although LAPACK is written in Fortran the logic is the same if you are using other programming languages for numerical software.
Oh dear lord please don't interpret the float bits as ints unless you're running on a P6 or earlier.
Even if it causes it to copy from vector registers to integer registers via memory, and even if it stalls the pipeline, it's the best way to do it that I've come across, insofar as it provides the most robust comparisons even in the face of floating point errors.
i.e. it is a price worth paying.
This seems to take care of most problems by combining relative and absolute error tolerance. Is the ULP approach better? If so, why?
ULPs are a direct measure of the "distance" between two floating point numbers. This means that they don't require you to conjure up the relative and absolute error values, nor do you have to make sure to get those values "about right". With ULPs, you can express directly how close you want the numbers to be, and the same threshold works just as well for small values as for large ones.
If you have floating point errors you have even more problems than this. Although I guess that is up to personal perspective.
Even if we do the numeric analysis to minimize accumulation of error, we can't eliminate it and we can be left with results that ought to be identical (if we were calculating with reals) but differ (because we cannot calculate with reals).
If you are looking for two floats to be equal, then they should be identically equal in my opinion. If you are facing a floating point rounding problem, perhaps a fixed point representation would suit your problem better.
If you are looking for two floats to be equal, then they should be identically equal in my opinion. If you are facing a floating point rounding problem, perhaps a fixed point representation would suit your problem better.
Perhaps we cannot afford the loss of range or performance that such an approach would inflict.
#DrPizza: I am no performance guru but I would expect fixed point operations to be quicker than floating point operations (in most cases).
#Craig H: Sure. I'm totally okay with it printing that. If a or b store money then they should be represented in fixed point. I'm struggling to think of a real world example where such logic ought to be allied to floats. Things suitable for floats:
weights
ranks
distances
real world values (like from a ADC)
For all these things, either you much then numbers and simply present the results to the user for human interpretation, or you make a comparative statement (even if such a statement is, "this thing is within 0.001 of this other thing"). A comparative statement like mine is only useful in the context of the algorithm: the "within 0.001" part depends on what physical question you're asking. That my 0.02. Or should I say 2/100ths?
It rather depends on what you are
doing with them. A fixed-point type
with the same range as an IEEE float
would be many many times slower (and
many times larger).
Okay, but if I want a infinitesimally small bit-resolution then it's back to my original point: == and != have no meaning in the context of such a problem.
An int lets me express ~10^9 values (regardless of the range) which seems like enough for any situation where I would care about two of them being equal. And if that's not enough, use a 64-bit OS and you've got about 10^19 distinct values.
I can express values a range of 0 to 10^200 (for example) in an int, it is just the bit-resolution that suffers (resolution would be greater than 1, but, again, no application has that sort of range as well as that sort of resolution).
To summarize, I think in all cases one either is representing a continuum of values, in which case != and == are irrelevant, or one is representing a fixed set of values, which can be mapped to an int (or a another fixed-precision type).
An int lets me express ~10^9 values
(regardless of the range) which seems
like enough for any situation where I
would care about two of them being
equal. And if that's not enough, use a
64-bit OS and you've got about 10^19
distinct values.
I have actually hit that limit... I was trying to juggle times in ps and time in clock cycles in a simulation where you easily hit 10^10 cycles. No matter what I did I very quickly overflowed the puny range of 64-bit integers... 10^19 is not as much as you think it is, gimme 128 bits computing now!
Floats allowed me to get a solution to the mathematical issues, as the values overflowed with lots zeros at the low end. So you basically had a decimal point floating aronud in the number with no loss of precision (I could like with the more limited distinct number of values allowed in the mantissa of a float compared to a 64-bit int, but desperately needed th range!).
And then things converted back to integers to compare etc.
Annoying, and in the end I scrapped the entire attempt and just relied on floats and < and > to get the work done. Not perfect, but works for the use case envisioned.
If you are looking for two floats to be equal, then they should be identically equal in my opinion. If you are facing a floating point rounding problem, perhaps a fixed point representation would suit your problem better.
Perhaps I should explain the problem better. In C++, the following code:
#include <iostream>
using namespace std;
int main()
{
float a = 1.0;
float b = 0.0;
for(int i=0;i<10;++i)
{
b+=0.1;
}
if(a != b)
{
cout << "Something is wrong" << endl;
}
return 1;
}
prints the phrase "Something is wrong". Are you saying that it should?
Oh dear lord please don't interpret the float bits as ints unless you're running on a P6 or earlier.
it's the best way to do it that I've come across, insofar as it provides the most robust comparisons even in the face of floating point errors.
If you have floating point errors you have even more problems than this. Although I guess that is up to personal perspective.