how the tert command works for real numbers - dm-script

I like very much the tert() command in DigitalMicrograph but there is something I dont understand in it. Consider the test script:
image test:= realimage("",4,100,1);
number value1 = 1;
number value2 = 0.1;
if(value2==0.1) result("\nvalue2 really equals 0.1");
test.setPixel(5,0,value1);
test.setPixel(10,0,value2);
image mask = imageclone(test);
mask = 0;
mask = tert(test==value1,1,mask);
mask = tert(test==value2,1,mask);
mask.showimage()
The script finds the pixel where the "test" array equals to value1 but doesnot find this for value2. It seems the tert command understands the condition (test==value) only when the "value" is an integer number. Otherwise, it considers that the equivalence is not EXACT. That is strange because the number Value2 was implicitely (I assume) defined as a real number and then assigned to the real array. How DigitalMicrograph decides whether the value is integer/real/double?

What you have observed here is a typical problem when comparing floating point values without allowing a small but non-zero tolerance for the limited precision of floating-point representations. In your specific test script, the problem arises because Number values in DMS are always double precision, so you are effectively comparing 4-byte float values in the test image with the 8-byte float values stored in the Number objects. Your script will correctly find both values if you change the first line to allocate the test image with 8-byte floating-point values, as follows:
image test:= realimage("",8,100,1);
On the other hand, a more robust solution is to use a small non-zero tolerance when comparing floating point values for 'equality'. Specifically, if you change the test lines of your script that invoke the tert function as follows, then it will also correctly find both values:
mask = tert(Abs((test-value1)/value1)<1e-7,1,mask);
mask = tert(Abs((test-value2)/value2)<1e-7,1,mask);

Related

How to handle precision problems of floating point numbers?

I am using Firebird 3.0.4 (both in Windows and Linux) and I have the following procedure that clearly demonstrates my problem with floating point numbers, and that also demonstrates a possible workaround:
create or alter procedure test_float returns (res double precision,
res1 double precision,
res2 double precision)
as
declare variable z1 double precision;
declare variable z2 double precision;
declare variable z3 double precision;
begin
z1=15;
z2=1.1;
z3=0.49;
res=z1*z2*z3; /* one expects res to be 8.085, but internally, inside the procedure
it is represented as 8.084999999999.
The procedure-internal representation is repaired when then
res is sent to the output of the procedure, but the procedure-internal
representation (which is worng) impacts the further calculations */
res1=round(res, 2);
res2=round(round(res, 8), 2);
suspend;
end
On can see the result of the procedure with:
select proc.res, proc.res1, proc.res2
from test_float proc
The result is
RES RES1 RES2
8,085 8,08 8,09
But one can expect that RES2 should be 8.09.
One can clearly see that the internal representation of the res contains 8.0849999 (e.g. one can assign res to the exception message and then raise this exception), it is repaired during output but it leads to the failed calculations when such variable is used in the further calculations.
RES2 demonstrates the repair: I can always apply ROUND(..., 8) to repair the internal representation. I am ready to go with this solution, but my question is - is it acceptable workaround (when the outer ROUND is with strictly less than 5 decimal places) or is there better workaround.
All my tests pass with this workaround, but the feeling is bad.
Of course, I know the minimum that every programmer should know about floats (there is article about that) and I know that one should not use double for business calculations.
This is an inherent problem with calculating with floating point numbers, and is not specific to Firebird. The problem is that the calculation of 15 * 1.1 * 0.49 using double precision numbers is not exactly 8.085. In fact, if you would do 8.085 - RES, you'd get a value that is (approximately) 1.776356839400251e-015 (although likely your client will just present it as 0.00000000).
You would get similar results in different languages. For example, in Java
DecimalFormat df = new DecimalFormat("#.00");
df.format(15 * 1.1 * 0.49);
will also produce 8.08 for exactly the same reason.
Also, if you would change the order of operations, you would get a different result. For example using 15 * 0.49 * 1.1 would produce 8.085 and round to 8.09, so the actual results would match your expectations.
Given round itself also returns a double precision, this isn't really a good way to handle this in your SQL code, because the rounded value with a higher number of decimals might still yield a value slightly less than what you'd expect because of how floating point numbers work, so the double round may still fail for some numbers even if the presentation in your client 'looks' correct.
If you purely want this for presentation purposes, it might be better to do this in your frontend, but alternatively you could try tricks like adding a small value and casting to decimal, for example something like:
cast(RES + 1e-10 as decimal(18,2))
However this still has rounding issues, because it is impossible to distinguish between values that genuinely are 8.08499999999 (and should be rounded down to 8.08), and values where the result of calculation just happens to be 8.08499999999 in floating point, while it would be 8.085 in exact numerics (and therefor need to be rounded up to 8.09).
In a similar vein, you could try to use double casting to decimal (eg cast(cast(res as decimal(18,3)) as decimal(18,2))), or casting the decimal and then rounding (eg round(cast(res as decimal(18,3)), 2). This would be a bit more consistent than double rounding because the first cast will convert to exact numerics, but again this has similar downside as mentioned above.
Although you don't want to hear this answer, if you want exact numeric semantics, you shouldn't be using floating point types.

Getting Value of a Double Variable as Infinity

Code:
var num = Math.pow(2.0, 5.0)
num = Math.pow(5.0, num)
num = Math.pow(2.0, num)
print(num)
Output:
Infinity
When I run the above code I am getting the value as Infinity. I can't understand what's happening. Is it crossing the limit of the double variable? Also, Math.pow() method will always return a double.
I have to run the above code and need the value of num variable. How can I solve it?
Actually, I have to compare two numbers after the calculation of exponential programmatically to determine which one is higher?
Like: (2^(5^(2^5))) or (5^(2^(5^2))). So, which one is higher?
Thanks in advance.
Infinity is a legal value for double and what you'll get whenever your result sufficiently exceeds the maximum number representable as a double (~1.8*10^308). This isn't the only "special" value: there are also negative infinity, negative zero, NaN, subnormal numbers.
There are many sources on how floating point numbers work, you can start here: http://floating-point-gui.de/
I have to run the above code and need the value of num variable. How can I solve it?
When you run this code, the value of num is infinity. There's nothing to solve there.
Actually, I have to compare two numbers after the calculation of exponential programmatically to determine which one is higher?
Like: (2^(5^(2^5))) or (5^(2^(5^2))). So, which one is higher?
You can't use double for that if the numbers are too large, period. Try calculating and comparing their logarithms or logarithms of logarithms.

Why BigFloat.to_s is not precise enough?

I am not sure if this is a bug. But I've been playing with big and I cant understand why this code works this way:
https://carc.in/#/r/2w96
Code
require "big"
x = BigInt.new(1<<30) * (1<<30) * (1<<30)
puts "BigInt: #{x}"
x = BigFloat.new(1<<30) * (1<<30) * (1<<30)
puts "BigFloat: #{x}"
puts "BigInt from BigFloat: #{x.to_big_i}"
Output
BigInt: 1237940039285380274899124224
BigFloat: 1237940039285380274900000000
BigInt from BigFloat: 1237940039285380274899124224
First I though that BigFloat requires to change BigFloat.default_precision to work with bigger number. But from this code it looks like it only matters when trying to output #to_s value.
Same with precision of BigFloat set to 1024 (https://carc.in/#/r/2w98):
Output
BigInt: 1237940039285380274899124224
BigFloat: 1237940039285380274899124224
BigInt from BigFloat: 1237940039285380274899124224
BigFloat.to_s uses LibGMP.mpf_get_str(nil, out expptr, 10, 0, self). Where GMP is saying:
mpf_get_str (char *str, mp_exp_t *expptr, int base, size_t n_digits, const mpf_t op)
Convert op to a string of digits in base base. The base argument may vary from 2 to 62 or from -2 to -36. Up to n_digits digits will be generated. Trailing zeros are not returned. No more digits than can be accurately represented by op are ever generated. If n_digits is 0 then that accurate maximum number of digits are generated.
Thanks.
In GMP (it applies to all languages not just Crystal), integers (C mpz_t, Crystal BigInt) and floats (C mpf_t, Crystal BigFloat) have separate default precision.
Also, note that using an explicit precision is better than setting a default one, because the default precision might not be reentrant (it depends on a configure-time switch). Also, if someone reads only a part of your code, they may skip the part with setting the default precision and assume a wrong one. Although I do not know the Crystal binding well, I assume that such functionality is exposed somewhere.
The zero parameter passed to mpf_get_str means to guess the value from the precision. I know the number of significant digits is proportional and close to precision / log2(10). Floating point numbers have finite precision. In that case, it was not the mpf_get_str call which made the last digits zero - it was the internal representation that did not keep such data. It looks like your (default) precision is too small to store all the necessary digits.
To summarize, there are two solutions:
Set a global default precision. Although this approach will work, it will require to either change the default precision frequently, or use one in the whole program. Both ways, the approach with the default precision is a form of procrastination which is going to have its vengeance later.
Set a precision on variable basis. This is a better solution than the former. Although it requires more code (1-2 more lines per variable initialization), it is going to pay back later. For example, in a space object tracking system, the physics calculations have to be super-precise, but other systems could use lower precision numbers for speed and memory saving.
I am still unsure what made the conversion BigFloat --> BigInt yield the missing digits.

Objective C, division between floats not giving an exact answer

Right now I have a line of code like this:
float x = (([self.machine micSensitivity] - 0.0075f) / 0.00025f);
Where [self.machine micSensitivity] is a float containing the value 0.010000
So,
0.01 - 0.0075 = 0.0025
0.0025 / 0.00025 = 10.0
But in this case, it keeps returning 9.999999
I'm assuming there's some kind of rounding error but I can't seem to find a clean way of fixing it. micSensitivity is incremented/decremented by 0.00025 and that formula is meant to return a clean integer value for the user to reference so I'd rather get the programming right than just adding 0.000000000001.
Thanks.
that formula is meant to return a clean integer value for the user to reference
If that is really important to you, then why do you not multiply all the numbers in this story by 10000, coerce to int, and do integer arithmetic?
Or, if you know that the answer is arbitrarily close to an integer, round to that integer and present it.
Floating-point arithmetic is binary, not decimal. It will almost always give rounding errors. You need to take that into account. "float" has about six digit precision. "double" has about 15 digits precision. You throw away nine digits precision for no reason.
Now think: What do you want to display? What do you want to display if the result of your calculation is 9.999999999? What would you want to display if the result is 9.538105712?
None of the numbers in your question, except 10.0, can be exactly represented in a float or a double on iOS. If you want to do float math with those numbers, you will have rounding errors.
You can round your result to the nearest integer easily enough:
float x = rintf((self.machine.micSensitivity - 0.0075f) / 0.00025f);
Or you can just multiply all your numbers, including the allowed values of micSensitivity, by 4000 (which is 1/0.00025), and thus work entirely with integers.
Or you can change the allowed values of micSensitivity so that its increment is a fraction whose denominator is a power of 2. For example, if you use an increment of 0.000244140625 (which is 2-12), and change 0.0075 to 0.00732421875 (which is 30 * 2-12), you should get exact results, as long as your micSensitivity is within the range ±4096 (since 4096 is 212 and a float has 24 bits of significand).
The code you have posted is correct and functioning properly. This is a known side effect of using floating point arithmetic. See the wiki on floating point accuracy problems for a dull explanation as to why.
There are several ways to work around the problem depending on what you need to use the number for.
If you need to compare two floats, then most everything works OK: less than and greater than do what you would expect. The only trouble is testing if two floats are equal.
// If x and y are within a very small number from each other then they are equal.
if (fabs(x - y) < verySmallNumber) { // verySmallNumber is usually called epsilon.
// x and y are equal (or at least close enough)
}
If you want to print a float, then you can specify a precision to round to.
// Get a string of the x rounded to five digits of precision.
NSString *xAsAString = [NSString stringWithFormat:#"%.5f", x];
9.999999 is equal 10. there is prove:
9.999999 = x then 10x = 99.999999 then 10x-x = 9x = 90 then x = 10

Why do we do unsigned right shift or signed right shift? [duplicate]

I understand what the unsigned right shift operator ">>>" in Java does, but why do we need it, and why do we not need a corresponding unsigned left shift operator?
The >>> operator lets you treat int and long as 32- and 64-bit unsigned integral types, which are missing from the Java language.
This is useful when you shift something that does not represent a numeric value. For example, you could represent a black and white bit map image using 32-bit ints, where each int encodes 32 pixels on the screen. If you need to scroll the image to the right, you would prefer the bits on the left of an int to become zeros, so that you could easily put the bits from the adjacent ints:
int shiftBy = 3;
int[] imageRow = ...
int shiftCarry = 0;
// The last shiftBy bits are set to 1, the remaining ones are zero
int mask = (1 << shiftBy)-1;
for (int i = 0 ; i != imageRow.length ; i++) {
// Cut out the shiftBits bits on the right
int nextCarry = imageRow & mask;
// Do the shift, and move in the carry into the freed upper bits
imageRow[i] = (imageRow[i] >>> shiftBy) | (carry << (32-shiftBy));
// Prepare the carry for the next iteration of the loop
carry = nextCarry;
}
The code above does not pay attention to the content of the upper three bits, because >>> operator makes them
There is no corresponding << operator because left-shift operations on signed and unsigned data types are identical.
>>> is also the safe and efficient way of finding the rounded mean of two (large) integers:
int mid = (low + high) >>> 1;
If integers high and low are close to the the largest machine integer, the above will be correct but
int mid = (low + high) / 2;
can get a wrong result because of overflow.
Here's an example use, fixing a bug in a naive binary search.
Basically this has to do with sign (numberic shifts) or unsigned shifts (normally pixel related stuff).
Since the left shift, doesn't deal with the sign bit anyhow, it's the same thing (<<< and <<)...
Either way I have yet to meet anyone that needed to use the >>>, but I'm sure they are out there doing amazing things.
As you have just seen, the >> operator automatically fills the
high-order bit with its previous contents each time a shift occurs.
This preserves the sign of the value. However, sometimes this is
undesirable. For example, if you are shifting something that does not
represent a numeric value, you may not want sign extension to take
place. This situation is common when you are working with pixel-based
values and graphics. In these cases you will generally want to shift a
zero into the high-order bit no matter what its initial value was.
This is known as an unsigned shift. To accomplish this, you will use
java’s unsigned, shift-right operator,>>>, which always shifts zeros
into the high-order bit.
Further reading:
http://henkelmann.eu/2011/02/01/java_the_unsigned_right_shift_operator
http://www.java-samples.com/showtutorial.php?tutorialid=60
The signed right-shift operator is useful if one has an int that represents a number and one wishes to divide it by a power of two, rounding toward negative infinity. This can be nice when doing things like scaling coordinates for display; not only is it faster than division, but coordinates which differ by the scale factor before scaling will differ by one pixel afterward. If instead of using shifting one uses division, that won't work. When scaling by a factor of two, for example, -1 and +1 differ by two, and should thus differ by one afterward, but -1/2=0 and 1/2=0. If instead one uses signed right-shift, things work out nicely: -1>>1=-1 and 1>>1=0, properly yielding values one pixel apart.
The unsigned operator is useful either in cases where either the input is expected to have exactly one bit set and one will want the result to do so as well, or in cases where one will be using a loop to output all the bits in a word and wants it to terminate cleanly. For example:
void processBitsLsbFirst(int n, BitProcessor whatever)
{
while(n != 0)
{
whatever.processBit(n & 1);
n >>>= 1;
}
}
If the code were to use a signed right-shift operation and were passed a negative value, it would output 1's indefinitely. With the unsigned-right-shift operator, however, the most significant bit ends up being interpreted just like any other.
The unsigned right-shift operator may also be useful when a computation would, arithmetically, yield a positive number between 0 and 4,294,967,295 and one wishes to divide that number by a power of two. For example, when computing the sum of two int values which are known to be positive, one may use (n1+n2)>>>1 without having to promote the operands to long. Also, if one wishes to divide a positive int value by something like pi without using floating-point math, one may compute ((value*5468522205L) >>> 34) [(1L<<34)/pi is 5468522204.61, which rounded up yields 5468522205]. For dividends over 1686629712, the computation of value*5468522205L would yield a "negative" value, but since the arithmetically-correct value is known to be positive, using the unsigned right-shift would allow the correct positive number to be used.
A normal right shift >> of a negative number will keep it negative. I.e. the sign bit will be retained.
An unsigned right shift >>> will shift the sign bit too, replacing it with a zero bit.
There is no need to have the equivalent left shift because there is only one sign bit and it is the leftmost bit so it only interferes when shifting right.
Essentially, the difference is that one preserves the sign bit, the other shifts in zeros to replace the sign bit.
For positive numbers they act identically.
For an example of using both >> and >>> see BigInteger shiftRight.
In the Java domain most typical applications the way to avoid overflows is to use casting or Big Integer, such as int to long in the previous examples.
int hiint = 2147483647;
System.out.println("mean hiint+hiint/2 = " + ( (((long)hiint+(long)hiint)))/2);
System.out.println("mean hiint*2/2 = " + ( (((long)hiint*(long)2)))/2);
BigInteger bhiint = BigInteger.valueOf(2147483647);
System.out.println("mean bhiint+bhiint/2 = " + (bhiint.add(bhiint).divide(BigInteger.valueOf(2))));