I came across a code recommendation for IBM PL/I: do not create fixed type variable. The example of wrong code is:
DECLARE avar FIXED BIN;
while the correct code given is:
DECLARE avar BIN;
I wanted to know whether this is correct recommendation because there are way too many occurrences in the code, existing and new, where "Fixed" is and will be used.
And if it is correct recommendation, should it only be applicable to "BIN" or BIN(n,m) as well.
I can only guess as to the rationale for this "coding guideline".
In PL/1, a number can be binary (BIN) or decimal (DEC). Generally when we think of numbers, we think in decimal rather than binary. So we think of numbers like 123.45 (FIXED DEC (5,2)) not 1001101.011B (FIXED BIN (10,3)). Fixed binary is an odd duck, and we generally don't understand it intuitively. Can you tell me what the afore mentioned fixed binary number is in decimal? Maybe the integer part, but what about the decimal part?
Let's break it down:
1001101.011B
||||||| ||+---> 1/8 .125
||||||| |+----> 1/4 .25
||||||| +-----> 1/2 0.0
||||||+-------> 1 1
|||||+--------> 2 0
||||+---------> 4 4
|||+----------> 8 8
||+-----------> 16 0
|+------------> 32 0
+-------------> 64 64
======
77.375
Did you get that? Going the other way you can convert 123.45 to fixed binary like this:
Integer Fractional
Part Part
123.45 .00000000 .45
-64 |||||||+--> 1/256 = .00390625 -.25
------ ||||||+---> 1/128 = .0078125 ----
59.45 |||||+----> 1/64 = .015625 .2
-32 ||||+-----> 1/32 = .03125 -.125
------ |||+------> 1/16 = .0625 -----
27.45 ||+-------> 1/8 = .125 .075
-16 |+--------> 1/4 = .25 -.0625
------ +---------> 1/2 = .5 ------
11.45 .0125
-8 -.0078125
------ ---------
3.45 .0046875
-2 -.00390625
------ ----------
1.45 .00071825 <--+
-1 |
|
= 1111011B = .01110011B (and it is not exact)
So 123.45 ~ 1111011.01110011B
So FIXED BIN probably not a good idea. Not every fixed decimal number can be represented in FIXED BIN, and it is not really intuitive.
If FIXED/FLOAT is not specified, FLOAT is the default. This in fact is how the floating point numbers you are likely used to are stored. So we can use it for our regular floating point calculations.
As for the (p) vs. (p,q) question, (p) is for floating point, (p,q) is for fixed point, though q defaults to 0 when it is missing from the fixed point precision specification.
You may then be wondering about DEC then. You probably should use FIXED DEC (p,q) to get what you are probably thinking about as a fixed point number like currency. These are likely encoded as Packed Decimal where each digit is stored in a nibble (4 bits) with a sign in the last nibble. So 123.45 would be stored as x'12 34 5f' where x'f' is positive or unsigned and x'd' is negative. The decimal point is assumed, not stored.
So to summarize, for exact calculations use FIXED DEC, and for floating point calculation use BIN or FLOAT BIN.
Related
Here Daniel mentions
... you pick any integer in [0, 2²⁴), and divide it by 2²⁴, then you can recover your original integer by multiplying the result again by 2²⁴. This works with 2²⁴ but not with 2²⁵ or any other larger number.
But when I tried
>>> b = np.divide(1, 2**63, dtype=np.float32)
>>> b*2**63
1.0
Although it isn't working for 2⁶⁴, but I'm left wondering why it's working for all the exponents from 24 to 63. And moreover if it's unique to numpy only.
In the context that passage is in, it is not saying that an integer value cannot be divided by 225 or 263 and then multiplied to restore the original value. It is saying that this will not work to create an unbiased distribution of numbers.
The text leaves some things not explicitly stated, but I suspect it is discussing taking a value of integer type, converting it to IEEE-754 single-precision, and then dividing it. This will not work for factors larger than 224 because the conversion from integer type to IEEE-754 single-precision will have to round the number.
For example, for 232, all numbers from 0 to 16,777,215 will convert to themselves with no error, and then dividing by 232 will produce a unique floating-point number for each. But both 16,777,216 and 16,777,217 will convert to 16,777,216, and then dividing by 232 will produce the same number for them (1/256). All numbers from 2,147,483,520 to 2,147,483,776 will map to 2,147,483,648, which then produces ½, so that is 257 numbers mapping to one floating-point number. But all the numbers from 2,147,483,777 to 2,147,484,031 map to 2,147,483,904. So this one has 255 numbers mapping to it. (The difference is due to the round-to-nearest-ties-to-even rule.) At the high end, the 129 numbers from 4,294,967,168 to 4,294,967,296 map to 4,294,967,296, for which dividing produces 1, which is out of the desired half-open interval, [0, 1).
On the other hand, if we use integers from 0 to 16,777,215 (224−1), there is no rounding, and each result maps from exactly one starting number and stays within the interval.
Note that “significand“ is the preferred term for the fraction portion of a floating-point representation. “Mantissa” is an old word for the fraction portion of a logarithm. Significands are linear. Mantissas are logarithmic. And the significand of the IEEE-754 single-precision format has 24 bits, not 23. The primary field used to encode the significand has 23 bits, but the exponent field provides another bit.
I was working on a bit of SQL for a project, and I noticed some seemingly strange behavior in SQL Server, with regard to what the answer looks like when dividing with decimals.
Here are some examples which illustrate the behavior I'm seeing:
DECLARE #Ratio Decimal(38,16)
SET #Ratio = CAST(210 as Decimal(38,16))/CAST(222 as Decimal(38,16));
select #Ratio -- Results in 0.9459450000000000
DECLARE #Ratio Decimal(38,16)
SET #Ratio = CAST(210 as Decimal)/CAST(222 as Decimal);
select #Ratio -- Results in 0.9459459459459459
For the code above, the answer for the query which is (seemingly) less precise gives a more precise value as the answer. When I cast both the dividend and the divisor as Decimal(38,16), I get a number with a scale of 6 (casting it to a Decimal(38,16) again results in the 0's padding the scale).
When I cast the dividend and divisor to just a default Decimal, with no precision or scale set manually, I get the full 16 digits in the scale of my result.
Out of curiosity, I began experimenting more with it, using these queries:
select CAST(210 as Decimal(38,16))/CAST(222 as Decimal(38,16)) --0.945945
select CAST(210 as Decimal(28,16))/CAST(222 as Decimal(28,16)) --0.9459459459
select CAST(210 as Decimal(29,16))/CAST(222 as Decimal(29,16)) --0.945945945
As you can see, as I increase the precision, the scale of the answer appears to decrease. I can't see a correlation between the scale of the result vs the scale or precision of the dividend and divisor.
I found some other SO questions pointing to a place in the msdn documentation which states that the resulting precision and scale during an operation on a decimal is determined by performing a set of calculations on the precision and scale of the divisor and dividend, and that:
The result precision and scale have an absolute maximum of 38. When a result precision is greater than 38, the corresponding scale is reduced to prevent the integral part of a result from being truncated.
So I tried running through those equations myself to determine what the output of dividing a Decimal(38,16) into another Decimal(38,16) would look like, and according to what I found, I still should have gotten back a more precise number than I did.
So I'm either doing the math wrong, or there's something else going on here that I'm missing. I'd greatly appreciate any insight that any of you has to offer.
Thanks in advance...
The documentation is a little incomplete as to the magic of the value 6 and when to apply the max function, but here's a table of my findings, based on that documentation.
As it says, the formulas for division are:
Result precision = p1 - s1 + s2 + max(6, s1 + p2 + 1), Result scale = max(6, s1 + p2 + 1)
And, as you yourself highlight, we then have the footnote:
The result precision and scale have an absolute maximum of 38. When a result precision is greater than 38, the corresponding scale is reduced to prevent the integral part of a result from being truncated.
So, here's what I produced in my spreadsheet:
p1 s1 p2 s2 prInit srInit prOver prAdjusted srAdjusted
38 16 38 16 93 55 55 38 6
28 16 28 16 73 45 35 38 10
29 16 29 16 75 46 37 38 9
So, I'm using pr and sr to indicate the precision and scale of the result. The prInit and srInit formulas are exactly the forumlas from the documentation. As we can see, in all 3 cases, the precision of the result is vastly larger than 38 and so the footnote applies. prOver is just max(0,prInit - 38) - how much we have to adjust the precision by if the footnote applies. prAdjusted is just prInit - prOver. We can see in all three cases that the final precision of the result is 38.
If I apply the same adjustment factor to the scales then I would obtain results of 0, 10 and 9. But we can see that your result for the (38,16) case has a scale of 6. So I believe that that is where the max(6,... part of the documentation actually applies. So my final formula for srAdjusted is max(6,srInit-prOver) and now my final Adjusted values appear to match your results.
And, of course, if we consult the documentation for decimal, we can see that the default precision and scale, if you do not specify them, are (18,0), so here's the row for when you didn't specify precision and scale:
p1 s1 p2 s2 prInit srInit prOver prAdjusted srAdjusted
18 0 18 0 37 19 0 37 19
When I calculate log(8) / log(2) I get 3 as one would expect:
?log(8)/log(2)
3
However, if I take the int of this calculation like this the result is 2 and thus wrong:
?int(log(8)/log(2))
2
How and why does this happen?
Likely because the actual number returned is of type double. Because floats and doubles cannot accurately represent most base 10 rational numbers the number returned is something like 2.99999999999. Then when you apply int() the .999999999 is truncated.
How floating-point number works: it dedicates a bit for the sign, a few bits to store an exponent, and the rest for the actual fraction. This leads to numbers being represented in a form similar to 1.45 * 10^4; except that instead of the base being 10, it's two.
I am programming a fixed-point speech enhancement algorithm on a 16-bit processor. At some point I need to do 32-bit fractional multiplication. I have read other posts about doing 32-bit multiplication byte by byte and I see why this works for Q0.31 formats. But I use different Q formats with varying number of fractional bits.
So I have found out that for fractional bits less than 16, this works:
(low*low >> N) + low*high + high*low + (high*high << N)
where N is the number of fractional bits. I have read that the low*low result should be unsigned as well as the low bytes themselves. In general this gives exactly the result I want in any Q format with less than 16 fractional bits.
Now it gets tricky when the fractional bits are more than 16. I have tried out several numbers of shifts, different shifts for low*low and high*high I have tried to put it on paper, but I can't figure it out.
I know it may be very simple but the whole idea eludes me and I would be grateful for some comments or guidelines!
It's the same formula. For N > 16, the shifts just mean you throw out a whole 16-bit word which would have over- or underflowed. low*low >> N means just shift N-16 bit in the high word of the 32-bit result of the multiply and add to the low word of the result. high * high << N means just use the low word of the multiply result shifted left N-16 and add to the high word of the result.
There are a few ideas at play.
First, multiplication of 2 shorter integers to produce a longer product. Consider unsigned multiplication of 2 32-bit integers via multiplications of their 16-bit "halves", each of which produces a 32-bit product and the total product is 64-bit:
a * b = (a_hi * 216 + a_lo) * (b_hi * 216 + b_lo) =
a_hi * b_hi * 232 + (a_hi * b_lo + a_lo * b_hi) * 216 + a_lo * b_lo.
Now, if you need a signed multiplication, you can construct it from unsigned multiplication (e.g. from the above).
Supposing a < 0 and b >= 0, a *signed b must be equal
264 - ((-a) *unsigned b), where
-a = 232 - a (because this is 2's complement)
IOW,
a *signed b =
264 - ((232 - a) *unsigned b) =
264 + (a *unsigned b) - (b * 232), where 264 can be discarded since we're using 64 bits only.
In exactly the same way you can calculate a *signed b for a >= 0 and b < 0 and must get a symmetric result:
(a *unsigned b) - (a * 232)
You can similarly show that for a < 0 and b < 0 the signed multiplication can be built on top of the unsigned multiplication this way:
(a *unsigned b) - ((a + b) * 232)
So, you multiply a and b as unsigned first, then if a < 0, you subtract b from the top 32 bits of the product and if b < 0, you subtract a from the top 32 bits of the product, done.
Now that we can multiply 32-bit signed integers and get 64-bit signed products, we can finally turn to the fractional stuff.
Suppose now that out of those 32 bits in a and b N bits are used for the fractional part. That means that if you look at a and b as at plain integers, they are going to be 2N times greater than what they really represent, e.g. 1.0 is going to look like 2N (or 1 << N).
So, if you multiply two such integers the product is going to be 2N*2N = 22*N times greater than what it should represent, e.g. 1.0 * 1.0 is going to look like 22*N (or 1 << (2*N)). IOW, plain integer multiplication is going to double the number of fractional bits. If you want the product to
have the same number of fractional bits as in the multiplicands, what do you do? You divide the product by 2N (or shift it arithmetically N positions right). Simple.
A few words of caution, just in case...
In C (and C++) you cannot legally shift a variable left or right by the same or greater number of bits contained in the variable. The code will compile, but not work as you may expect it to. So, if you want to shift a 32-bit variable, you can shift it by 0 through 31 positions left or right (31 is the max, not 32).
If you shift signed integers left, you cannot overflow the result legally. All signed overflows result in undefined behavior. So, you may want to stick to unsigned.
Right shifts of negative signed integers are implementation-specific. They can either do an arithmetic shift or a logical shift. Which one, it depends on the compiler. So, if you need one of the two you need to either ensure that your compiler just supports it directly
or implement it in some other ways.
I am currently trying to figure out how to multiply two numbers in fixed point representation.
Say my number representation is as follows:
[SIGN][2^0].[2^-1][2^-2]..[2^-14]
In my case, the number 10.01000000000000 = -0.25.
How would I for example do 0.25x0.25 or -0.25x0.25 etc?
Hope you can help!
You should use 2's complement representation instead of a seperate sign bit. It's much easier to do maths on that, no special handling is required. The range is also improved because there's no wasted bit pattern for negative 0. To multiply, just do as normal fixed-point multiplication. The normal Q2.14 format will store value x/214 for the bit pattern of x, therefore if we have A and B then
So you just need to multiply A and B directly then divide the product by 214 to get the result back into the form x/214 like this
AxB = ((int32_t)A*B) >> 14;
A rounding step is needed to get the nearest value. You can find the way to do it in Q number format#Math operations. The simplest way to round to nearest is just add back the bit that was last shifted out (i.e. the first fractional bit) like this
AxB = (int32_t)A*B;
AxB = (AxB >> 14) + ((AxB >> 13) & 1);
You might also want to read these
Fixed-point arithmetic.
Emulated Fixed Point Division/Multiplication
Fixed point math in c#?
With 2 bits you can represent the integer range of [-2, 1]. So using Q2.14 format, -0.25 would be stored as 11.11000000000000. Using 1 sign bit you can only represent -1, 0, 1, and it makes calculations more complex because you need to split the sign bit then combine it back at the end.
Multiply into a larger sized variable, and then right shift by the number of bits of fixed point precision.
Here's a simple example in C:
int a = 0.25 * (1 << 16);
int b = -0.25 * (1 << 16);
int c = (a * b) >> 16;
printf("%.2f * %.2f = %.2f\n", a / 65536.0, b / 65536.0 , c / 65536.0);
You basically multiply everything by a constant to bring the fractional parts up into the integer range, then multiply the two factors, then (optionally) divide by one of the constants to return the product to the standard range for use in future calculations. It's like multiplying prices expressed in fractional dollars by 100 and then working in cents (i.e. $1.95 * 100 cents/dollar = 195 cents).
Be careful not to overflow the range of the variable you are multiplying into. Your constant might need to be smaller to avoid overflow, like using 1 << 8 instead of 1 << 16 in the example above.