This question already has answers here:
Biggest integer that can be stored in a double
(10 answers)
Closed 1 year ago.
So one day I was experimenting (like any other good coder does), and I came up across this:
>>> 1e308
1e+308
>>> 1e309
inf
What is going on? First of all, 308’s factors are 2, 2, 7, and 11.
Farther investigation yields:
>>> 1.7976931348623158075e308 # No, I didn’t copy it incorrectly
1.797693134862315e+308
>>> 1.79769313486231581+308
inf
So what is going on? There doesn’t seem any relationship between an absurdly big number and an equally weird number with over 10 decimal places.
Also, all of this was using the repl python console, so others might be different.
A double-precision floating point number in IEEE-754 has an 11-bit exponent and a 53-bit mantissa. The 11-bit exponent means we get from 2**(-1023) to 2**+1023. 2**1023 happens to be 10**308.
You get about 3.23 bits per decimal digit. A 53-bit mantissa gives you about 17 digits of precision.
The largest number that fits in a double is, as you noticed, 1.7976931348623158075E+308.
I recommend https://en.wikipedia.org/wiki/IEEE_754 .
Related
println(log(it.toDouble(), 10.0).toInt()+1) // n1
println(log10(it.toDouble()).toInt() + 1) // n2
I had to count the "length" of the number in n-base for non-related to the question needs and stumbled upon a bug (or rather unexpected behavior) that for it == 1000 these two functions give different results.
n1(1000) = 3,
n2(1000) = 4.
Checking values before conversion to int resulted in:
n1_double(1000) = 3.9999999999999996,
n2_double(1000) = 4.0
I understand that some floating point arithmetics magic is involved, but what is especially weird to me is that for 100, 10000 and other inputs that I checked n1 == n2.
What is special about it == 1000? How I ensure that log gives me the intended result (4, not 3.99..), because right now I can't even figure out what cases I need to double-check, since it is not just powers of 10, it is 1000 (and probably some other numbers) specifically.
I looked into implementation of log() and log10() and log is implemented as
if (base <= 0.0 || base == 1.0) return Double.NaN
return nativeMath.log(x) / nativeMath.log(base) //log() here is a natural logarithm
while log10 is implemented as
return nativeMath.log10(x)
I suspect this division in the first case is the reason of an error, but I can't figure out why it causes an error only in specific cases.
I also found this question:
Python math.log and math.log10 giving different results
But I already know that one is more precise than another. However there is no analogy for log10 for some base n, so I'm curious of reason WHY it is specifically 1000 that goes wrong.
PS: I understand there are methods of calculating length of a number without fp arithmetics and log of n-base, but at this point it is a scientific curiosity.
but I can't figure out why it causes an error only in specific cases.
return nativeMath.log(x) / nativeMath.log(base)
//log() here is a natural logarithm
Consider x = 1000 and nativeMath.log(x). The natural logarithm is not exactly representable. It is near
6.90775527898213_681... (Double answer)
6.90775527898213_705... (closer answer)
Consider base = 10 and nativeMath.log(base). The natural logarithm is not exactly representable. It is near
2.302585092994045_901... (Double)
2.302585092994045_684... (closer answer)
The only exactly correct nativeMath.log(x) for a finite x is when x == 1.0.
The quotient of the division of 6.90775527898213681... / 2.302585092994045901... is not exactly representable. It is near 2.9999999999999995559...
The conversion of the quotient to text is not exact.
So we have 4 computation errors with the system giving us a close (rounded) result instead at each step.
Sometimes these rounding errors cancel out in a way we find acceptable and the value of "3.0" is reported. Sometimes not.
Performed with higher precision math, it is easy to see log(1000) was less than a higher precision answer and that log(10) was more. These 2 round-off errors in the opposite direction for a / contributed to the quotient being extra off (low) - by 1 ULP than hoped.
When log(x, 10) is computed for other x = power-of-10, and the log(x) is slightly more than than a higher precision answer, I'd expect the quotient to less often result in a 1 ULP error. Perhaps it will be 50/50 for all powers-of-10.
log10(x) is designed to compute the logarithm in a different fashion, exploiting that the base is 10.0 and certainly exact for powers-of-10.
According to the documentation, the decimal.Round method uses a round-to-even algorithm which is not common for most applications. So I always end up writing a custom function to do the more natural round-half-up algorithm:
public static decimal RoundHalfUp(this decimal d, int decimals)
{
if (decimals < 0)
{
throw new ArgumentException("The decimals must be non-negative",
"decimals");
}
decimal multiplier = (decimal)Math.Pow(10, decimals);
decimal number = d * multiplier;
if (decimal.Truncate(number) < number)
{
number += 0.5m;
}
return decimal.Round(number) / multiplier;
}
Does anybody know the reason behind this framework design decision?
Is there any built-in implementation of the round-half-up algorithm into the framework? Or maybe some unmanaged Windows API?
It could be misleading for beginners that simply write decimal.Round(2.5m, 0) expecting 3 as a result but getting 2 instead.
The other answers with reasons why the Banker's algorithm (aka round half to even) is a good choice are quite correct. It does not suffer from negative or positive bias as much as the round half away from zero method over most reasonable distributions.
But the question was why .NET use Banker's actual rounding as default - and the answer is that Microsoft has followed the IEEE 754 standard. This is also mentioned in MSDN for Math.Round under Remarks.
Also note that .NET supports the alternative method specified by IEEE by providing the MidpointRounding enumeration. They could of course have provided more alternatives to solving ties, but they choose to just fulfill the IEEE standard.
Probably because it's a better algorithm. Over the course of many roundings performed, you will average out that all .5's end up rounding equally up and down. This gives better estimations of actual results if you are for instance, adding a bunch of rounded numbers. I would say that even though it isn't what some may expect, it's probably the more correct thing to do.
While I cannot answer the question of "Why did Microsoft's designers choose this as the default?", I just want to point out that an extra function is unnecessary.
Math.Round allows you to specify a MidpointRounding:
ToEven - When a number is halfway between two others, it is rounded toward the nearest even number.
AwayFromZero - When a number is halfway between two others, it is rounded toward the nearest number that is away from zero.
Decimals are mostly used for money; banker’s rounding is common when working with money. Or you could say.
It is mostly bankers that need the
decimal type; therefore it does
“banker’s rounding”
Bankers rounding have the advantage that on average you will get the same result if you:
round a set of “invoice lines” before adding them up,
or add them up then round the total
Rounding before adding up saved a lot of work in the days before computers.
(In the UK when we went decimal banks would not deal with half pence, but for many years there was still a half pence coin and shop often had prices ending in half pence – so lots of rounding)
Use another overload of Round function like this:
decimal.Round(2.5m, 0,MidpointRounding.AwayFromZero)
It will output 3. And if you use
decimal.Round(2.5m, 0,MidpointRounding.ToEven)
you will get banker's rounding.
I have a Fortran program which I need to modify, so I'm reading it and trying to understand. Can you please explain what the formatting string in the following statement means:
write(*,'(1p,(5x,3(1x,g20.10)))') x(jr,1:ncols)
http://www.fortran.com/F77_std/rjcnf0001-sh-13.html
breifly, you are writing three general (g) format floats per line. Each float has a total field width of 20 characters and 10 places to the right of the decimal. Large magnitude numbers are in exponential form.
The 1xs are simply added spaces (which could as well have been accomplished by increasing the field width ie, g21.10 since the numbers are right justified. The 5x puts an additional 5 spaces at the beginning of each line.
The somewhat tricky thing here is tha lead 1p which is a scale factor. It causes the mantissa of all exponential form numbers produced by the following g format to be multiplied by 10, and the exponent changed accordingly, ie instead of the default,
g17.10 -> b0.1234567890E+12
you get:
1p,g17.10 -> b1.2345678900E+11
b denotes a blank in the output. Be sure to allow room for a - in your field width count...
for completeness in the case of scale greater than one the number of decimal places is reduced (preserving the total precision) ie,
3p,g17.10 -> b123.45678900E+09 ! note only 8 digits after the decimal
that is 1p buys you a digit of precision over the default, but you don't get any more. Negative scales cost you precision, preserving the 10 digits:
-7p,g17.10 -> b0.0000000123E+19
I should add, the p scale factor edit descriptor does something completely different on input. Read the docs...
I'd like to add slightly to George's answer. Unfortunately this is a very nasty (IMO) part of Fortran. In general, bear in mind that a Fortran format specification is automatically repeated as long as there are values remaining in the input/output list, so it isn't necessary to provide formats for every value to be processed.
Scale factors
In the output, all floating point values following kP are multiplied by 10k. Fields containing exponents (E) have their exponent reduced by k, unless the exponent format is fixed by using EN (engineering) or ES (scientific) descriptors. Scaling does not apply to G editing, unless the value is such that E editing is applied. Thus, there is a difference between (1P,G20.10) and (1P,F20.10).
Grouping
A format like n() repeats the descriptors within parentheses n times before proceeding.
Here is my question :
If we have the following value
0.59144706948010461
and we try to convert it to Single we receive the next value:
0.591447055
As you can see this is not that we should receive. Could you please explain how does this value get created and how can I avoid this situation?
Thank you!
As you can see this is not that we should receive.
Why not? I strongly suspect that's the closest Single value to the Double you've given.
From the documentation for Single, having fixed the typo:
All floating-point numbers have a limited number of significant digits, which also determines how accurately a floating-point value approximates a real number. A Single value has up to 7 decimal digits of precision, although a maximum of 9 digits is maintained internally.
Your Double value is 0.5914471 when limited to 7 significant digits - and so is the Single value you're getting. Your original Double value isn't exactly 0.59144706948010461 either... the exact values of the Double and Single values are:
Double: 0.5914470694801046146693579430575482547283172607421875
Single: 0.591447055339813232421875
It's important that you understand a bit about how binary floating point works - see my articles on binary floating point and decimal floating point for more background.
When converting from double to float you're also rounding. The result should be the single-precision number that is closest to the number you are rounding.
That is exactly what you're getting here.
Floating-point numbers between 0.5 and 1 are of the form n / 2^24 where n is between 2^23 and 2^24.
0.59144706948010461... = 9922835.23723472274456576... / 2^24
so the closest single-precision floating-point number is
9922835 / 2^24 = 0.5914470553...
This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Trouble with floats in Objective-C
I have broken this problem down to about as simple as i can get it. Feel free to try the same thing and tell me if you get the same error and what solution you might have. I have already tried it on several computers.
float total = 200000.0f + 154196.8f;
NSLog(#"total: %f", total);
The output is:
total: 354196.812500
If anyone has any sort of logical explanation, feel free to share it.
I'd suggest you brush up on your floats
http://www.altdevblogaday.com/2012/05/20/thats-not-normalthe-performance-of-odd-floats/
If you need higher precision use a double.
Additionally http://randomascii.wordpress.com/2012/03/08/float-precisionfrom-zero-to-100-digits-2/
See What Every Programmer Should Know About Floating-Point Arithmetic for all the deep understanding. The short answer is that all floating point representations have limitations on their precision, and that things that can be expressed in a small number of digits in decimal may not be expressible in a small number of digits in binary (and specifically not in floating point formats).
Note that while double can improve things, it is no panacea. It is quite common to have small rounding errors, even with double. You may easily get 1.99999999 when you expect 2.
Hint:
long double total = 200000.0 + 154196.8;
NSLog(#"total: %Lf", total);
On my machine prints the correct value.
A 32 bit floating point has a 23 bit mantissa, the closest value is 0.5+0.25+0.125.
You should use more bits to get the correct representation.