Objective C, division between floats not giving an exact answer - objective-c

Right now I have a line of code like this:
float x = (([self.machine micSensitivity] - 0.0075f) / 0.00025f);
Where [self.machine micSensitivity] is a float containing the value 0.010000
0.01 - 0.0075 = 0.0025
0.0025 / 0.00025 = 10.0
But in this case, it keeps returning 9.999999
I'm assuming there's some kind of rounding error but I can't seem to find a clean way of fixing it. micSensitivity is incremented/decremented by 0.00025 and that formula is meant to return a clean integer value for the user to reference so I'd rather get the programming right than just adding 0.000000000001.

that formula is meant to return a clean integer value for the user to reference
If that is really important to you, then why do you not multiply all the numbers in this story by 10000, coerce to int, and do integer arithmetic?
Or, if you know that the answer is arbitrarily close to an integer, round to that integer and present it.

Floating-point arithmetic is binary, not decimal. It will almost always give rounding errors. You need to take that into account. "float" has about six digit precision. "double" has about 15 digits precision. You throw away nine digits precision for no reason.
Now think: What do you want to display? What do you want to display if the result of your calculation is 9.999999999? What would you want to display if the result is 9.538105712?

None of the numbers in your question, except 10.0, can be exactly represented in a float or a double on iOS. If you want to do float math with those numbers, you will have rounding errors.
You can round your result to the nearest integer easily enough:
float x = rintf((self.machine.micSensitivity - 0.0075f) / 0.00025f);
Or you can just multiply all your numbers, including the allowed values of micSensitivity, by 4000 (which is 1/0.00025), and thus work entirely with integers.
Or you can change the allowed values of micSensitivity so that its increment is a fraction whose denominator is a power of 2. For example, if you use an increment of 0.000244140625 (which is 2-12), and change 0.0075 to 0.00732421875 (which is 30 * 2-12), you should get exact results, as long as your micSensitivity is within the range ±4096 (since 4096 is 212 and a float has 24 bits of significand).

The code you have posted is correct and functioning properly. This is a known side effect of using floating point arithmetic. See the wiki on floating point accuracy problems for a dull explanation as to why.
There are several ways to work around the problem depending on what you need to use the number for.
If you need to compare two floats, then most everything works OK: less than and greater than do what you would expect. The only trouble is testing if two floats are equal.
// If x and y are within a very small number from each other then they are equal.
if (fabs(x - y) < verySmallNumber) { // verySmallNumber is usually called epsilon.
// x and y are equal (or at least close enough)
If you want to print a float, then you can specify a precision to round to.
// Get a string of the x rounded to five digits of precision.
NSString *xAsAString = [NSString stringWithFormat:#"%.5f", x];

9.999999 is equal 10. there is prove:
9.999999 = x then 10x = 99.999999 then 10x-x = 9x = 90 then x = 10


Getting Value of a Double Variable as Infinity

var num = Math.pow(2.0, 5.0)
num = Math.pow(5.0, num)
num = Math.pow(2.0, num)
When I run the above code I am getting the value as Infinity. I can't understand what's happening. Is it crossing the limit of the double variable? Also, Math.pow() method will always return a double.
I have to run the above code and need the value of num variable. How can I solve it?
Actually, I have to compare two numbers after the calculation of exponential programmatically to determine which one is higher?
Like: (2^(5^(2^5))) or (5^(2^(5^2))). So, which one is higher?
Thanks in advance.
Infinity is a legal value for double and what you'll get whenever your result sufficiently exceeds the maximum number representable as a double (~1.8*10^308). This isn't the only "special" value: there are also negative infinity, negative zero, NaN, subnormal numbers.
There are many sources on how floating point numbers work, you can start here: http://floating-point-gui.de/
I have to run the above code and need the value of num variable. How can I solve it?
When you run this code, the value of num is infinity. There's nothing to solve there.
Actually, I have to compare two numbers after the calculation of exponential programmatically to determine which one is higher?
Like: (2^(5^(2^5))) or (5^(2^(5^2))). So, which one is higher?
You can't use double for that if the numbers are too large, period. Try calculating and comparing their logarithms or logarithms of logarithms.

Converting int to double screws up the decimal point

In the debug window, when I input this command:
po 1912/10.0
The output is 191.19999999999999.
What I really want to get back is 191.2.
Why is this happening, and how can I convert an int into a double with precision?
From What Every Programmer Should Know About Floating-Point Arithmetic:
Why don’t my numbers, like 0.1 + 0.2 add up to a nice round 0.3, and instead I get a weird result like 0.30000000000000004?
Because internally, computers use a format (binary floating-point) that cannot accurately represent a number like 0.1, 0.2 or 0.3 at all.
When the code is compiled or interpreted, your “0.1” is already rounded to the nearest number in that format, which results in a small rounding error even before the calculation happens.
This is why programmers say you should only ever store money as an integer. For example int cents = 1995; rather than float dollars = 19.95.
If your app doesn't need to be 100% precise (for example, if you're calculating screen coordinates or translucency or a color) just format your float rounded to 1 or 2 decimal places:
double someValue = 1912/10.0;
NSLog(#"2 decimals: %.2f", someValue);
NSLog(#"0 decimals: %.0f", someValue);
This code will output:
2 decimals: 191.20
0 decimals: 191
That's normal for a floating point number. Double is obviously just an extended precision floating point number. If you want to keep the pristine decimal digits, then don't allow any float/double conversion. Instead store the result as a scaled integer (in your case 1912) and place the decimal manually.
Let me try to explain this another way. When you express a number with a fractional part with a float or double, precision is most often lost. There's no way around that. If you store 1912 as a float and store 10 as a float then divide the first stored value by the second, the value will NEVER be 191.2. That's just the way floating point numbers work. If you look at the number in a debugger you'll see something like 191.19999999999999 as you describe. This, in itself, is an approximation as the value should be 191.19999999999999... but of course you can't even type all the digits in the decimal value of that stored result as the number of digits approaches infinity.
If you're going to use floating point, that's what you'll get. No way around it.
If you really want to get 191.2, then you can't use floating point, at least without doing rounding. Instead, you need to normalize the numbers by just storing the value as 1912 and printing the value with a decimal point to the left of the 2.
There's another brief online description at http://floating-point-gui.de/basic/

Convert.ToSingle() from double in vb.net returns wrong value

Here is my question :
If we have the following value
and we try to convert it to Single we receive the next value:
As you can see this is not that we should receive. Could you please explain how does this value get created and how can I avoid this situation?
Thank you!
As you can see this is not that we should receive.
Why not? I strongly suspect that's the closest Single value to the Double you've given.
From the documentation for Single, having fixed the typo:
All floating-point numbers have a limited number of significant digits, which also determines how accurately a floating-point value approximates a real number. A Single value has up to 7 decimal digits of precision, although a maximum of 9 digits is maintained internally.
Your Double value is 0.5914471 when limited to 7 significant digits - and so is the Single value you're getting. Your original Double value isn't exactly 0.59144706948010461 either... the exact values of the Double and Single values are:
Double: 0.5914470694801046146693579430575482547283172607421875
Single: 0.591447055339813232421875
It's important that you understand a bit about how binary floating point works - see my articles on binary floating point and decimal floating point for more background.
When converting from double to float you're also rounding. The result should be the single-precision number that is closest to the number you are rounding.
That is exactly what you're getting here.
Floating-point numbers between 0.5 and 1 are of the form n / 2^24 where n is between 2^23 and 2^24.
0.59144706948010461... = 9922835.23723472274456576... / 2^24
so the closest single-precision floating-point number is
9922835 / 2^24 = 0.5914470553...

Precise Multiplication

first post!
I have a problem with a program that i'm writing for a numerical simulation and I have a problem with the multiplication. Basically, I am trying to calculate:
result1 = (a + b)*c
and this loops thousands of times. I need to expand this code to be
result2 = a*c + b*c
However, when I do that I start to get significant errors in my results. I used a high precision library, which did improve things, but the simulation ran horribly slow (the simulation took 50 times longer) and it really isn't a practical solution. From this I realised that it isn't really the precision of the variables a, b, & c that is hurting me, but something in the way the multiplication is done.
My question is: how can I multiply out these brackets in way so that result1 = result2?
It was a problem with the addition. So i reordered the terms and applied Kahan addition by writing the following piece of code:
double Modelsimple::sum(double a, double b, double c, double d) {
//reorder the variables in order from smallest to greatest
double tempone = (a<b?a:b);
double temptwo = (c<d?c:d);
double tempthree = (a>b?a:b);
double tempfour = (c>d?c:d);
double one = (tempone<temptwo?tempone:temptwo);
double four = (tempthree>tempfour?tempthree:tempfour);
double tempfive = (tempone>temptwo?tempone:temptwo);
double tempsix = (tempthree<tempfour?tempthree:tempfour);
double two = (tempfive<tempsix?tempfive:tempsix);
double three = (tempfive>tempsix?tempfive:tempsix);
//kahan addition
double total = one;
double tempsum = one + two;
double error = (tempsum - one) - two;
total = tempsum;
// first iteration complete
double tempadd = three - error;
tempsum = total + tempadd;
error = (tempsum - total) - tempadd;
total = tempsum;
//second iteration complete
tempadd = four - error;
total += tempadd;
return total;
This gives me results that are as close to the precise answer as makes no difference. However, in a fictitious simulation of a mine collapse, the code with the Kahan addition takes 2 minutes whereas the high precision library takes over a day to finish!!
Thanks to all the help here. This problem was really a pain in the a$$.
I am presuming your numbers are all floating point values.
You should not expect result1 to equal result2 due to limitations in the scale of the numbers and precision in the calculations. Which one to use will depend upon the numbers you are dealing with. More important than result1 and result2 being the same is that they are close enough to the real answer (eg that you would have calculated by hand) for your application.
Imagine that a and b are both very large, and c much less than 1. (a + b) might overflow so that result1 will be incorrect. result2 would not overflow because it scales everything down before adding.
There are also problems with loss of precision when combining numbers of widely differing size, as the smaller number has significant digits reduced when it is converted to use the same exponent as the larger number it is added to.
If you give some specific examples of a, b and c which are causing you issues it might be possible to suggest further improvements.
I have been using the following program as a test, using values for a and b between 10^5 and 10^10, and c around 10^-5, but so far cannot find any differences.
Thinking about the storage of 10^5 vs 10^10, I think it requires about 13 bits vs 33 bits, so you may lose about 20 bits of precision when you add a and b together in result1.
But multiplying them by the same value c essentially reduces the exponent but leaves the significand the same, so it should also lose about 20 bits of precision in result2.
A double significand usually stores 53 bits, so I suspect your results will still retain 33 bits, or about 10 decimal digits of precision.
#include <stdio.h>
int main()
double a = 13584.9484893449;
double b = 43719848748.3911;
double c = 0.00001483394434;
double result1 = (a+b)*c;
double result2 = a*c + b*c;
double diff = result1 - result2;
printf("size of double is %d\n", sizeof(double));
However I do find a difference if I change all the doubles to float and use c=0.00001083394434. Are you sure that you are using 64 (or 80) bit doubles when doing your calculations?
Usually "loss of precision" in these kinds of calculations can be traced to "poorly formulated problem". For example, when you have to add a series of numbers of very different sizes, you will get a different answer depending on the order in which you sum them. The problem is even more acute when you subtract numbers.
The best approach in your case above is to look not simply at this one line, but at the way that result1 is used in your subsequent calculations. In principle, an engineering calculation should not require precision in the final result beyond about three significant figures; but in many instances (for example, finite element methods) you end up subtracting two numbers that are very similar in magnitude - in which case you may lose many significant figures and get a meaningless answer. Given that you are talking about "materials properties" and "strain", I am suspecting that is actually at the heart of your problem.
One approach is to look at places where you compute a difference, and see if you can reformulate your problem (for example, if you can differentiate your function, you can replace Y(x+dx)-Y(x) with dx * Y(x)'.
There are many excellent references on the subject of numerical stability. It is a complicated subject. Just "throwing more significant figures at the problem" is almost never the best solution.

Succesion of numbers passed into a label coming up with weird results

I have 2 buttons that each have a tag number that I pass into this string in which I am just trying to type in either 1,1,1,1,1,1,1,1,1 or 2,2,2,2,2,2,2 or shoot - even, 1,2,2,1,1,1.
Everything works fine until the 8th or 9th time of pressing the button "1" the label says, 111111112. Then if I press the 1 again the label says, 111111168.
Maybe I am going about this totally wrong? Made sense in my head - but now I am just confused. Any help would be amazing, thank you!
-(IBAction)buttonDigitPressed:(id)sender {
currentNumber=currentNumber * 10 + (float)[sender tag];
NSLog(#"currentNumber: %.f", currentNumber);
phoneNumberLabel.text = [NSString stringWithFormat:#"%.f",currentNumber];
This image shows me hitting the 1 a bunch of times.. you'd think it would just keep showing 1's all the way across, no?
If this is a string operation, you should not do it using numbers. Possible reasons of the error: running out of range (because float is not big enough), loss of precision (because of the nature of float), etc. What you should do instead is
phoneNumberLabel.text = [phoneNumberLabel.text stringByAppendingFormat:#"%d", [sender tag]];
(Single precision) floating point numbers use 23 bits for the mantissa, therefore the largest integer that can be represented exactly by a float is 2^24 = 16777216.
All larger integers can not be represented exactly by a float, therefore the calculation with numbers having 8 or more digits using float cannot be exact.
Double precision floating point numbers can represent numbers up to 2^53 = 9007199254740992 exactly.
A better solution might be to work with integer types (e.g. uint64_t), or with strings as suggested in H2CO3's answer.