If the size of a float is 4 bytes then shouldn't it be able to hold digits from 8,388,607 to -8,388,608 or somewhere around there because I probably calculated it wrong.
Why does f display the extra 15 because the value of f (0.1) is still between 8,388,607 to -8,388,608 right?
int main(int argc, const char * argv[])
{
#autoreleasepool {
float f = .1;
printf("%lu", sizeof(float));
printf("%.10f", f);
}
return 0;
}
2012-08-28 20:53:38.537 prog[841:403] 4
2012-08-28 20:53:38.539 prog[841:403] 0.1000000015
The values -8,388,608 ... 8,388,607 lead me to believe that you think floats use two's complement, which they don't. In any case, the range you have indicates 24 bits, not the 32 that you'd get from four bytes.
Floats in C use IEEE754 representation, which basically has three parts:
the sign.
the exponent (sort of a scale).
the fraction (actual digits of the number).
You basically get a certain amount of precision (such as 7 decimal digits) and the exponent dictates whether you use those for a number like 0.000000001234567 or 123456700000.
The reason you get those extra digits at the end of your 0.1 is because that number cannot be represented exactly in IEEE754. See this answer for a treatise explaining why that is the case.
Numbers are only representable exactly if they can be built by adding inverse powers of two (like 1/2, 1/16, 1/65536 and so on) within the number of bits of precision (ie, number of bits in the fraction), subject to scaling.
So, for example, a number like 0.5 is okay since it's 1/2. Similarly 0.8125 is okay since that can be built from 1/2, 1/4 and 1/16.
There is no way (at least within 23 bits of precision) that you can build 0.1 from inverse powers of two, so it gives you the nearest match.
Related
When I calculate log(8) / log(2) I get 3 as one would expect:
?log(8)/log(2)
3
However, if I take the int of this calculation like this the result is 2 and thus wrong:
?int(log(8)/log(2))
2
How and why does this happen?
Likely because the actual number returned is of type double. Because floats and doubles cannot accurately represent most base 10 rational numbers the number returned is something like 2.99999999999. Then when you apply int() the .999999999 is truncated.
How floating-point number works: it dedicates a bit for the sign, a few bits to store an exponent, and the rest for the actual fraction. This leads to numbers being represented in a form similar to 1.45 * 10^4; except that instead of the base being 10, it's two.
I am programming a fixed-point speech enhancement algorithm on a 16-bit processor. At some point I need to do 32-bit fractional multiplication. I have read other posts about doing 32-bit multiplication byte by byte and I see why this works for Q0.31 formats. But I use different Q formats with varying number of fractional bits.
So I have found out that for fractional bits less than 16, this works:
(low*low >> N) + low*high + high*low + (high*high << N)
where N is the number of fractional bits. I have read that the low*low result should be unsigned as well as the low bytes themselves. In general this gives exactly the result I want in any Q format with less than 16 fractional bits.
Now it gets tricky when the fractional bits are more than 16. I have tried out several numbers of shifts, different shifts for low*low and high*high I have tried to put it on paper, but I can't figure it out.
I know it may be very simple but the whole idea eludes me and I would be grateful for some comments or guidelines!
It's the same formula. For N > 16, the shifts just mean you throw out a whole 16-bit word which would have over- or underflowed. low*low >> N means just shift N-16 bit in the high word of the 32-bit result of the multiply and add to the low word of the result. high * high << N means just use the low word of the multiply result shifted left N-16 and add to the high word of the result.
There are a few ideas at play.
First, multiplication of 2 shorter integers to produce a longer product. Consider unsigned multiplication of 2 32-bit integers via multiplications of their 16-bit "halves", each of which produces a 32-bit product and the total product is 64-bit:
a * b = (a_hi * 216 + a_lo) * (b_hi * 216 + b_lo) =
a_hi * b_hi * 232 + (a_hi * b_lo + a_lo * b_hi) * 216 + a_lo * b_lo.
Now, if you need a signed multiplication, you can construct it from unsigned multiplication (e.g. from the above).
Supposing a < 0 and b >= 0, a *signed b must be equal
264 - ((-a) *unsigned b), where
-a = 232 - a (because this is 2's complement)
IOW,
a *signed b =
264 - ((232 - a) *unsigned b) =
264 + (a *unsigned b) - (b * 232), where 264 can be discarded since we're using 64 bits only.
In exactly the same way you can calculate a *signed b for a >= 0 and b < 0 and must get a symmetric result:
(a *unsigned b) - (a * 232)
You can similarly show that for a < 0 and b < 0 the signed multiplication can be built on top of the unsigned multiplication this way:
(a *unsigned b) - ((a + b) * 232)
So, you multiply a and b as unsigned first, then if a < 0, you subtract b from the top 32 bits of the product and if b < 0, you subtract a from the top 32 bits of the product, done.
Now that we can multiply 32-bit signed integers and get 64-bit signed products, we can finally turn to the fractional stuff.
Suppose now that out of those 32 bits in a and b N bits are used for the fractional part. That means that if you look at a and b as at plain integers, they are going to be 2N times greater than what they really represent, e.g. 1.0 is going to look like 2N (or 1 << N).
So, if you multiply two such integers the product is going to be 2N*2N = 22*N times greater than what it should represent, e.g. 1.0 * 1.0 is going to look like 22*N (or 1 << (2*N)). IOW, plain integer multiplication is going to double the number of fractional bits. If you want the product to
have the same number of fractional bits as in the multiplicands, what do you do? You divide the product by 2N (or shift it arithmetically N positions right). Simple.
A few words of caution, just in case...
In C (and C++) you cannot legally shift a variable left or right by the same or greater number of bits contained in the variable. The code will compile, but not work as you may expect it to. So, if you want to shift a 32-bit variable, you can shift it by 0 through 31 positions left or right (31 is the max, not 32).
If you shift signed integers left, you cannot overflow the result legally. All signed overflows result in undefined behavior. So, you may want to stick to unsigned.
Right shifts of negative signed integers are implementation-specific. They can either do an arithmetic shift or a logical shift. Which one, it depends on the compiler. So, if you need one of the two you need to either ensure that your compiler just supports it directly
or implement it in some other ways.
I am currently trying to figure out how to multiply two numbers in fixed point representation.
Say my number representation is as follows:
[SIGN][2^0].[2^-1][2^-2]..[2^-14]
In my case, the number 10.01000000000000 = -0.25.
How would I for example do 0.25x0.25 or -0.25x0.25 etc?
Hope you can help!
You should use 2's complement representation instead of a seperate sign bit. It's much easier to do maths on that, no special handling is required. The range is also improved because there's no wasted bit pattern for negative 0. To multiply, just do as normal fixed-point multiplication. The normal Q2.14 format will store value x/214 for the bit pattern of x, therefore if we have A and B then
So you just need to multiply A and B directly then divide the product by 214 to get the result back into the form x/214 like this
AxB = ((int32_t)A*B) >> 14;
A rounding step is needed to get the nearest value. You can find the way to do it in Q number format#Math operations. The simplest way to round to nearest is just add back the bit that was last shifted out (i.e. the first fractional bit) like this
AxB = (int32_t)A*B;
AxB = (AxB >> 14) + ((AxB >> 13) & 1);
You might also want to read these
Fixed-point arithmetic.
Emulated Fixed Point Division/Multiplication
Fixed point math in c#?
With 2 bits you can represent the integer range of [-2, 1]. So using Q2.14 format, -0.25 would be stored as 11.11000000000000. Using 1 sign bit you can only represent -1, 0, 1, and it makes calculations more complex because you need to split the sign bit then combine it back at the end.
Multiply into a larger sized variable, and then right shift by the number of bits of fixed point precision.
Here's a simple example in C:
int a = 0.25 * (1 << 16);
int b = -0.25 * (1 << 16);
int c = (a * b) >> 16;
printf("%.2f * %.2f = %.2f\n", a / 65536.0, b / 65536.0 , c / 65536.0);
You basically multiply everything by a constant to bring the fractional parts up into the integer range, then multiply the two factors, then (optionally) divide by one of the constants to return the product to the standard range for use in future calculations. It's like multiplying prices expressed in fractional dollars by 100 and then working in cents (i.e. $1.95 * 100 cents/dollar = 195 cents).
Be careful not to overflow the range of the variable you are multiplying into. Your constant might need to be smaller to avoid overflow, like using 1 << 8 instead of 1 << 16 in the example above.
I am trying to create an array of values. These values should be "2.4,1.6,.8,0". I am subtracting .8 at every step.
This is how I am doing it (code snippet):
float mean = [[_scalesDictionary objectForKey:#"M1"] floatValue]; //3.2f
float sD = [[_scalesDictionary objectForKey:#"SD1"] floatValue]; //0.8f
nextRegion = mean;
hitWall = NO;
NSMutableArray *minusRegion = [NSMutableArray array];
while (!hitWall) {
nextRegion -= sD;
if(nextRegion<0.0f){
nextRegion = 0.0f;
hitWall = YES;
}
[minusRegion addObject:[NSNumber numberWithFloat:nextRegion]];
}
I am getting this output:
minusRegion = (
"2.4",
"1.6",
"0.8000001",
"1.192093e-07",
0
)
I do not want the incredibly small number between .8 and 0. Is there a standard way to truncate these values?
Neither 3.2 nor .8 is exactly representable as a 32-bit float. The representable number closest to 3.2 is 3.2000000476837158203125 (in hexadecimal floating-point, 0x1.99999ap+1). The representable number closest to .8 is 0.800000011920928955078125 (0x1.99999ap-1).
When 0.800000011920928955078125 is subtracted from 3.2000000476837158203125, the exact mathematical result is 2.400000035762786865234375 (0x1.3333338p+1). This result is also not exactly representable as a 32-bit float. (You can see this easily in the hexadecimal floating-point. A 32-bit float has a 24-bit significand. “1.3333338” has one bit in the “1”, 24 bits in the middle six digits, and another bit in the ”8”.) So the result is rounded to the nearest 32-bit float, which is 2.400000095367431640625 (0x1.333334p+1).
Subtracting 0.800000011920928955078125 from that yields 1.6000001430511474609375 (0x1.99999cp+0), which is exactly representable. (The “1” is one bit, the five nines are 20 bits, and the “c” has two significant bits. The low bits two bits in the “c” are trailing zeroes and may be neglected. So there are 23 significant bits.)
Subtracting 0.800000011920928955078125 from that yields 0.800000131130218505859375 (0x1.99999ep-1), which is also exactly representable.
Finally, subtracting 0.800000011920928955078125 from that yields 1.1920928955078125e-07 (0x1p-23).
The lesson to be learned here is the floating-point does not represent all numbers, and it rounds results to give you the closest numbers it can represent. When writing software to use floating-point arithmetic, you must understand and allow for these rounding operations. One way to allow for this is to use numbers that you know can be represented. Others have suggested using integer arithmetic. Another option is to use mostly values that you know can be represented exactly in floating-point, which includes integers up to 224. So you could start with 32 and subtract 8, yielding 24, then 16, then 8, then 0. Those would be the intermediate values you use for loop control and continuing calculations with no error. When you are ready to deliver results, then you could divide by 10, producing numbers near 3.2, 2.4, 1.6, .8, and 0 (exactly). This way, your arithmetic would introduce only one rounding error into each result, instead of accumulating rounding errors from iteration to iteration.
You're looking at good old floating-point rounding error. Fortunately, in your case it should be simple to deal with. Just clamp:
if( val < increment ){
val = 0.0;
}
Although, as Eric Postpischil explained below:
Clamping in this way is a bad idea, because sometimes rounding will cause the iteration variable to be slightly less than the increment instead of slightly more, and this clamping will effectively skip an iteration. For example, if the initial value were 3.6f (instead of 3.2f), and the step were .9f (instead of .8f), then the values in each iteration would be slightly below 3.6, 2.7, 1.8, and .9. At that point, clamping converts the value slightly below .9 to zero, and an iteration is skipped.
Therefore it might be necessary to subtract a small amount when doing the comparison.
A better option which you should consider is doing your calculations with integers rather than floats, then converting later.
int increment = 8;
int val = 32;
while( val > 0 ){
val -= increment;
float new_float_val = val / 10.0;
};
Another way to do this is to multiply the numbers you get by subtraction by 10, then convert to an integer, then divide that integer by by 10.0.
You can do this easily with the floor function (floorf) like this:
float newValue = floorf(oldVlaue*10)/10;
I am learning Objective-C and have completed a simple program and got an unexpected result. This program is just a multiplication table test... User inputs the number of iterations(test questions), then inputs answers. That after program displays the number of right and wrong answers, percentage and accepted/failed result.
#import <Foundation/Foundation.h>
int main (int argc, const char * argv[])
{
NSAutoreleasePool * pool = [[NSAutoreleasePool alloc] init];
NSLog(#"Welcome to multiplication table test");
int rightAnswers; //the sum of the right answers
int wrongAnswers; //the sum of wrong answers
int combinations; //the number of combinations#
NSLog(#"Please, input the number of test combinations");
scanf("%d",&combinations);
for(int i=0; i<combinations; ++i)
{
int firstInt=rand()%8+1;
int secondInt=rand()%8+1;
int result=firstInt*secondInt;
int answer;
NSLog(#"%d*%d=",firstInt,secondInt);
scanf("%d",&answer);
if(answer==result)
{
NSLog(#"Ok");
rightAnswers++;
}
else
{
NSLog(#"Error");
wrongAnswers++;
}
}
int percent=(100/combinations)*rightAnswers;
NSLog(#"Combinations passed: %d",combinations);
NSLog(#"Answered right: %d times",rightAnswers);
NSLog(#"Answered wrong: %d times",wrongAnswers);
NSLog(#"Completed %d percent",percent);
if(percent>=70)NSLog(#"accepted");
else
NSLog(#"failed");
[pool drain];
return 0;
}
Problem (strange result)
When I input 3 iterations and answer 'em right, i am not getting of 100% right. Getting only
99%. The same count I tried on my iPhone calculator.
100 / 3 = 33.3333333... percentage for one right answer (program displays 33%. The digits after mantissa getting cut off)
33.3333333... * 3=100%
Can someone explain me where I went wrong? Thanx.
This is a result of integer division. When you perform division between two integer types, the result is automatically rounded towards 0 to form an integer. So, integer division of (100 / 3) gives a result of 33, not 33.33.... When you multiply that by 3, you get 99. To fix this, you can force floating point division by changing 100 to 100.0. The .0 tells the compiler that it should use a floating point type instead of an integer, forcing floating point division. As a result, rounding will not occur after the division. However, 33.33... cannot be represented exactly by binary numbers. Because of this, you may still see incorrect results at times. Since you store the result as an integer, rounding down will still occur after the multiplication, which will make it more obvious. If you want to use an integer type, you should use the round function on the result:
int percent = round((100.0 / combinations) * rightAnswers);
This will cause the number to be rounded to the closest integer before converting it to an integer type. Alternately, you could use a floating point storage type and specify a certain number of decimal places to display:
float percent = (100.0 / combinations) * rightAnswers;
NSLog(#"Completed %.1f percent",percent); // Display result with 1 decimal place
Finally, since floating point math will still cause rounding for numbers that can't be represented in binary, I would suggest multiplying by rightAnswers before dividing by combinations. This will increase the chances that the result is representable. For example, 100/3=33.33... is not representable and will be rounded. If you multiply by 3 first, you get 300/3=100, which is representable and will not be rounded.
Ints are integers. They can't represent an arbitrary real number like 1/3. Even floating-point numbers, which can represent reals, won't have enough precision to represent an infinitely repeating decimal like 100/3. You'll either need to use an arbitrary-precision library, use a library that includes rationals as a data type, or just store as much precision as you need and round from there (e.g. make your integer unit hundredths-of-a-percent instead of a single percentage point).
You might want to implement some sort of rounding because 33.333....*3 = 99.99999%. 3/10 is an infinite decimal therefore you need some sort of rounding to occur (maybe at the 3rd decimal place) so that the answer comes out correct. I would say if (num*1000 % 10 >= 5) num += .01 or something along those lines multiply by 100 moves decimal 3 times and then mod returns the 3rd digit (could be zero). You also might only want to round at the end once you sum everything up to avoid errors.
EDIT: Didn't realize you were using integers numbers at the end threw me off, you might want to use double or float (floats are slightly inaccurate past 2 or 3 digits which is OK with what you want).
100/3 is 33. Integer mathematics here.