Integer part bit growth for fixed point numbers of the 0.xyz kind - embedded

First of all we should agree on the definition of the QM.N format. I will follow this resource and its conventions.
For the purposes of this paper the notion of a Q-point for a fixed-point number is introduced.
This labeling convention is as follows:
Q[QI].[QF]
Where QI = # of integer bits & QF = # of fractional bits
For signed integer variable types we will include the sign bit in QI as it does have integer
weight albeit negative in sign.
Based on this convention, if I had to represent the number -0.123 in the format Q1.7 I would write it as: 1.1110001
The theory says that:
When performing an integer multiplication the product is 2xWL if both the multiplier and
multiplicand are WL long. If the integer multiplication is on fixed-point variables, the number of
integer and fractional bits in the product is the sum of the corresponding multiplier and
multiplicand Q-points as described by the following equations
Knowing this is useful because after multiplication we have double precision, and we need to rescale the output to our input precision. Knowing where the integer part is allows us to prevent overflow and to pick the relevant bits, as in the example where the long string is the result of the multiplication:
However, when performing the multiplication between two Q1.7 numbers of the format 0.xyz I have noticed that the integer part never grows, allowing me to pick only one bit from the integer part. I have written a piece of code that picks only the fractional part after multiplication, and here are the results.
Test 0
Testing +0.5158*+0.0596
A:real_val:+0.5156 fixed: 66 int: 0 frac: 1000010
B:real_val:+0.0547 fixed: 7 int: 0 frac: 0000111
C: real_val:+0.0282 fixed: 462 int: 00 frac: 00000111001110
Floating multiplication: +0.0307
Test 1
Testing +0.4842*-0.9558
A:real_val:+0.4766 fixed: 61 int: 0 frac: 0111101
B:real_val:-0.9531 fixed: -122 int: 1 frac: 0000110
C: real_val:-0.4542 fixed: -7442 int: 11 frac: 10001011101110
Floating multiplication: -0.4628
Test 2
Testing +0.2812*+0.2433
A:real_val:+0.2734 fixed: 35 int: 0 frac: 0100011
B:real_val:+0.2422 fixed: 31 int: 0 frac: 0011111
C: real_val:+0.0662 fixed: 1085 int: 00 frac: 00010000111101
Floating multiplication: +0.0684
Test 3
Testing -0.7235*-0.9037
A:real_val:-0.7188 fixed: -92 int: 1 frac: 0100100
B:real_val:-0.8984 fixed: -115 int: 1 frac: 0001101
C: real_val:+0.6458 fixed: 10580 int: 00 frac: 10100101010100
Floating multiplication: +0.6538
My question to you is if I am overlooking anything here or if this is normal and expected behaviour from fixed points. If so, I will be happy with my numbers never overflowing during multiplication.
Basically what I mean is that after multiplication of two Q1.X numbers in the form 0.xyz the integer part will always be 0 (if the result is positive) or 1111.. if the result is negative.
So my accumulator register will be filled with only 2*X of meaningful bits and I can take only them, plus the sign.

No, the number of bits in the result is still the sum of the bits in the inputs.
Summary:
Signed Q1.31 times signed Q1.31 equals signed Q2.62.
Unsigned Q1.31 times unsigned Q1.31 equals unsigned Q2.62.
Explanation:
Unsigned Q1.n numbers can represent from zero (inclusive) to two (exclusive). If you multiply two such numbers together the range of results is from zero (inclusive) to 4 (exclusive). Just less than four is three point something, and three fits in the two bits above the point.
Signed Q1.n numbers can represent from negative one (inclusive) to one (exclusive). If you multiply two such numbers together the range of results is negative one (exclusive) to one (inclusive). Signed Q1.31 times signed Q1.31 would fit in Q1.62 except for the single case -1.0 times -1.0 equals +1.0, which requires the extra bit above the point.
The equations in your question apply equally in both these cases.

Related

How does numpy manage to divide float32 by 2**63?

Here Daniel mentions
... you pick any integer in [0, 2²⁴), and divide it by 2²⁴, then you can recover your original integer by multiplying the result again by 2²⁴. This works with 2²⁴ but not with 2²⁵ or any other larger number.
But when I tried
>>> b = np.divide(1, 2**63, dtype=np.float32)
>>> b*2**63
1.0
Although it isn't working for 2⁶⁴, but I'm left wondering why it's working for all the exponents from 24 to 63. And moreover if it's unique to numpy only.
In the context that passage is in, it is not saying that an integer value cannot be divided by 225 or 263 and then multiplied to restore the original value. It is saying that this will not work to create an unbiased distribution of numbers.
The text leaves some things not explicitly stated, but I suspect it is discussing taking a value of integer type, converting it to IEEE-754 single-precision, and then dividing it. This will not work for factors larger than 224 because the conversion from integer type to IEEE-754 single-precision will have to round the number.
For example, for 232, all numbers from 0 to 16,777,215 will convert to themselves with no error, and then dividing by 232 will produce a unique floating-point number for each. But both 16,777,216 and 16,777,217 will convert to 16,777,216, and then dividing by 232 will produce the same number for them (1/256). All numbers from 2,147,483,520 to 2,147,483,776 will map to 2,147,483,648, which then produces ½, so that is 257 numbers mapping to one floating-point number. But all the numbers from 2,147,483,777 to 2,147,484,031 map to 2,147,483,904. So this one has 255 numbers mapping to it. (The difference is due to the round-to-nearest-ties-to-even rule.) At the high end, the 129 numbers from 4,294,967,168 to 4,294,967,296 map to 4,294,967,296, for which dividing produces 1, which is out of the desired half-open interval, [0, 1).
On the other hand, if we use integers from 0 to 16,777,215 (224−1), there is no rounding, and each result maps from exactly one starting number and stays within the interval.
Note that “significand“ is the preferred term for the fraction portion of a floating-point representation. “Mantissa” is an old word for the fraction portion of a logarithm. Significands are linear. Mantissas are logarithmic. And the significand of the IEEE-754 single-precision format has 24 bits, not 23. The primary field used to encode the significand has 23 bits, but the exponent field provides another bit.

Min and Max values of Float and Double types in Kotlin

It's simple to find out what the exact min and max values for Int and Long integers are in Kotlin:
Signed 32 bit Integer:
Int.MIN_VALUE // -2147483648
Int.MAX_VALUE // 2147483647
Signed 64 bit Integer:
Long.MIN_VALUE // -9223372036854775808
Long.MAX_VALUE // 9223372036854775807
However, if I try to print Float or Double types' ranges of min and max values, I'll get unbalanced numbers (where both values will be expressed using a scientific notation).
Signed 32 bit Floating Point Number:
Float.MIN_VALUE // 1.4e-45
Float.MAX_VALUE // 3.4028235e38
Signed 64 bit Floating Point Number:
Double.MIN_VALUE // 4.9e-324
Double.MAX_VALUE // 1.7976931348623157e308
Why the positive and negative values of Float and Double types are so "unbalanced"?
The conceptual definition of MIN_VALUE is different for integers vs floating-point numbers.
Int.MIN_VALUE is the largest negative value.
Float.MIN_VALUE is the smallest positive value.
In other words, 1.4E-45 is 0.00[40 zeroes]0014, and not a very large negative number. The largest possible negative value is represented by -1 * Float.MAX_VALUE.
Just to add to this discussion because I made the mistake of expecting that Float.MIN_VALUE and Double.MIN_VALUE correlated with the Int.MIN_VALUE, in that they both should represent the most negative value for that datatype.
For a Float or Double you have the additional properties, aside from MIN_VALUE or MAX_VALUE, of NEGATIVE_INFINITY and POSITIVE_INFINITY which technically could be your largest negative and positive value. I was trying to find a number that would represent a value of a var that had not been used yet. MIN_VALUE didn't work for me because I was dealing with positive and negative numbers.

32-bit fractional multiplication with cross-multiplication method (no 64-bit intermediate result)

I am programming a fixed-point speech enhancement algorithm on a 16-bit processor. At some point I need to do 32-bit fractional multiplication. I have read other posts about doing 32-bit multiplication byte by byte and I see why this works for Q0.31 formats. But I use different Q formats with varying number of fractional bits.
So I have found out that for fractional bits less than 16, this works:
(low*low >> N) + low*high + high*low + (high*high << N)
where N is the number of fractional bits. I have read that the low*low result should be unsigned as well as the low bytes themselves. In general this gives exactly the result I want in any Q format with less than 16 fractional bits.
Now it gets tricky when the fractional bits are more than 16. I have tried out several numbers of shifts, different shifts for low*low and high*high I have tried to put it on paper, but I can't figure it out.
I know it may be very simple but the whole idea eludes me and I would be grateful for some comments or guidelines!
It's the same formula. For N > 16, the shifts just mean you throw out a whole 16-bit word which would have over- or underflowed. low*low >> N means just shift N-16 bit in the high word of the 32-bit result of the multiply and add to the low word of the result. high * high << N means just use the low word of the multiply result shifted left N-16 and add to the high word of the result.
There are a few ideas at play.
First, multiplication of 2 shorter integers to produce a longer product. Consider unsigned multiplication of 2 32-bit integers via multiplications of their 16-bit "halves", each of which produces a 32-bit product and the total product is 64-bit:
a * b = (a_hi * 216 + a_lo) * (b_hi * 216 + b_lo) =
a_hi * b_hi * 232 + (a_hi * b_lo + a_lo * b_hi) * 216 + a_lo * b_lo.
Now, if you need a signed multiplication, you can construct it from unsigned multiplication (e.g. from the above).
Supposing a < 0 and b >= 0, a *signed b must be equal
264 - ((-a) *unsigned b), where
-a = 232 - a (because this is 2's complement)
IOW,
a *signed b =
264 - ((232 - a) *unsigned b) =
264 + (a *unsigned b) - (b * 232), where 264 can be discarded since we're using 64 bits only.
In exactly the same way you can calculate a *signed b for a >= 0 and b < 0 and must get a symmetric result:
(a *unsigned b) - (a * 232)
You can similarly show that for a < 0 and b < 0 the signed multiplication can be built on top of the unsigned multiplication this way:
(a *unsigned b) - ((a + b) * 232)
So, you multiply a and b as unsigned first, then if a < 0, you subtract b from the top 32 bits of the product and if b < 0, you subtract a from the top 32 bits of the product, done.
Now that we can multiply 32-bit signed integers and get 64-bit signed products, we can finally turn to the fractional stuff.
Suppose now that out of those 32 bits in a and b N bits are used for the fractional part. That means that if you look at a and b as at plain integers, they are going to be 2N times greater than what they really represent, e.g. 1.0 is going to look like 2N (or 1 << N).
So, if you multiply two such integers the product is going to be 2N*2N = 22*N times greater than what it should represent, e.g. 1.0 * 1.0 is going to look like 22*N (or 1 << (2*N)). IOW, plain integer multiplication is going to double the number of fractional bits. If you want the product to
have the same number of fractional bits as in the multiplicands, what do you do? You divide the product by 2N (or shift it arithmetically N positions right). Simple.
A few words of caution, just in case...
In C (and C++) you cannot legally shift a variable left or right by the same or greater number of bits contained in the variable. The code will compile, but not work as you may expect it to. So, if you want to shift a 32-bit variable, you can shift it by 0 through 31 positions left or right (31 is the max, not 32).
If you shift signed integers left, you cannot overflow the result legally. All signed overflows result in undefined behavior. So, you may want to stick to unsigned.
Right shifts of negative signed integers are implementation-specific. They can either do an arithmetic shift or a logical shift. Which one, it depends on the compiler. So, if you need one of the two you need to either ensure that your compiler just supports it directly
or implement it in some other ways.

Inprecision on floating point decimals?

If the size of a float is 4 bytes then shouldn't it be able to hold digits from 8,388,607 to -8,388,608 or somewhere around there because I probably calculated it wrong.
Why does f display the extra 15 because the value of f (0.1) is still between 8,388,607 to -8,388,608 right?
int main(int argc, const char * argv[])
{
#autoreleasepool {
float f = .1;
printf("%lu", sizeof(float));
printf("%.10f", f);
}
return 0;
}
2012-08-28 20:53:38.537 prog[841:403] 4
2012-08-28 20:53:38.539 prog[841:403] 0.1000000015
The values -8,388,608 ... 8,388,607 lead me to believe that you think floats use two's complement, which they don't. In any case, the range you have indicates 24 bits, not the 32 that you'd get from four bytes.
Floats in C use IEEE754 representation, which basically has three parts:
the sign.
the exponent (sort of a scale).
the fraction (actual digits of the number).
You basically get a certain amount of precision (such as 7 decimal digits) and the exponent dictates whether you use those for a number like 0.000000001234567 or 123456700000.
The reason you get those extra digits at the end of your 0.1 is because that number cannot be represented exactly in IEEE754. See this answer for a treatise explaining why that is the case.
Numbers are only representable exactly if they can be built by adding inverse powers of two (like 1/2, 1/16, 1/65536 and so on) within the number of bits of precision (ie, number of bits in the fraction), subject to scaling.
So, for example, a number like 0.5 is okay since it's 1/2. Similarly 0.8125 is okay since that can be built from 1/2, 1/4 and 1/16.
There is no way (at least within 23 bits of precision) that you can build 0.1 from inverse powers of two, so it gives you the nearest match.

Properly subtracting float values

I am trying to create an array of values. These values should be "2.4,1.6,.8,0". I am subtracting .8 at every step.
This is how I am doing it (code snippet):
float mean = [[_scalesDictionary objectForKey:#"M1"] floatValue]; //3.2f
float sD = [[_scalesDictionary objectForKey:#"SD1"] floatValue]; //0.8f
nextRegion = mean;
hitWall = NO;
NSMutableArray *minusRegion = [NSMutableArray array];
while (!hitWall) {
nextRegion -= sD;
if(nextRegion<0.0f){
nextRegion = 0.0f;
hitWall = YES;
}
[minusRegion addObject:[NSNumber numberWithFloat:nextRegion]];
}
I am getting this output:
minusRegion = (
"2.4",
"1.6",
"0.8000001",
"1.192093e-07",
0
)
I do not want the incredibly small number between .8 and 0. Is there a standard way to truncate these values?
Neither 3.2 nor .8 is exactly representable as a 32-bit float. The representable number closest to 3.2 is 3.2000000476837158203125 (in hexadecimal floating-point, 0x1.99999ap+1). The representable number closest to .8 is 0.800000011920928955078125 (0x1.99999ap-1).
When 0.800000011920928955078125 is subtracted from 3.2000000476837158203125, the exact mathematical result is 2.400000035762786865234375 (0x1.3333338p+1). This result is also not exactly representable as a 32-bit float. (You can see this easily in the hexadecimal floating-point. A 32-bit float has a 24-bit significand. “1.3333338” has one bit in the “1”, 24 bits in the middle six digits, and another bit in the ”8”.) So the result is rounded to the nearest 32-bit float, which is 2.400000095367431640625 (0x1.333334p+1).
Subtracting 0.800000011920928955078125 from that yields 1.6000001430511474609375 (0x1.99999cp+0), which is exactly representable. (The “1” is one bit, the five nines are 20 bits, and the “c” has two significant bits. The low bits two bits in the “c” are trailing zeroes and may be neglected. So there are 23 significant bits.)
Subtracting 0.800000011920928955078125 from that yields 0.800000131130218505859375 (0x1.99999ep-1), which is also exactly representable.
Finally, subtracting 0.800000011920928955078125 from that yields 1.1920928955078125e-07 (0x1p-23).
The lesson to be learned here is the floating-point does not represent all numbers, and it rounds results to give you the closest numbers it can represent. When writing software to use floating-point arithmetic, you must understand and allow for these rounding operations. One way to allow for this is to use numbers that you know can be represented. Others have suggested using integer arithmetic. Another option is to use mostly values that you know can be represented exactly in floating-point, which includes integers up to 224. So you could start with 32 and subtract 8, yielding 24, then 16, then 8, then 0. Those would be the intermediate values you use for loop control and continuing calculations with no error. When you are ready to deliver results, then you could divide by 10, producing numbers near 3.2, 2.4, 1.6, .8, and 0 (exactly). This way, your arithmetic would introduce only one rounding error into each result, instead of accumulating rounding errors from iteration to iteration.
You're looking at good old floating-point rounding error. Fortunately, in your case it should be simple to deal with. Just clamp:
if( val < increment ){
val = 0.0;
}
Although, as Eric Postpischil explained below:
Clamping in this way is a bad idea, because sometimes rounding will cause the iteration variable to be slightly less than the increment instead of slightly more, and this clamping will effectively skip an iteration. For example, if the initial value were 3.6f (instead of 3.2f), and the step were .9f (instead of .8f), then the values in each iteration would be slightly below 3.6, 2.7, 1.8, and .9. At that point, clamping converts the value slightly below .9 to zero, and an iteration is skipped.
Therefore it might be necessary to subtract a small amount when doing the comparison.
A better option which you should consider is doing your calculations with integers rather than floats, then converting later.
int increment = 8;
int val = 32;
while( val > 0 ){
val -= increment;
float new_float_val = val / 10.0;
};
Another way to do this is to multiply the numbers you get by subtraction by 10, then convert to an integer, then divide that integer by by 10.0.
You can do this easily with the floor function (floorf) like this:
float newValue = floorf(oldVlaue*10)/10;