How does numpy manage to divide float32 by 2**63? - numpy

Here Daniel mentions
... you pick any integer in [0, 2²⁴), and divide it by 2²⁴, then you can recover your original integer by multiplying the result again by 2²⁴. This works with 2²⁴ but not with 2²⁵ or any other larger number.
But when I tried
>>> b = np.divide(1, 2**63, dtype=np.float32)
>>> b*2**63
1.0
Although it isn't working for 2⁶⁴, but I'm left wondering why it's working for all the exponents from 24 to 63. And moreover if it's unique to numpy only.

In the context that passage is in, it is not saying that an integer value cannot be divided by 225 or 263 and then multiplied to restore the original value. It is saying that this will not work to create an unbiased distribution of numbers.
The text leaves some things not explicitly stated, but I suspect it is discussing taking a value of integer type, converting it to IEEE-754 single-precision, and then dividing it. This will not work for factors larger than 224 because the conversion from integer type to IEEE-754 single-precision will have to round the number.
For example, for 232, all numbers from 0 to 16,777,215 will convert to themselves with no error, and then dividing by 232 will produce a unique floating-point number for each. But both 16,777,216 and 16,777,217 will convert to 16,777,216, and then dividing by 232 will produce the same number for them (1/256). All numbers from 2,147,483,520 to 2,147,483,776 will map to 2,147,483,648, which then produces ½, so that is 257 numbers mapping to one floating-point number. But all the numbers from 2,147,483,777 to 2,147,484,031 map to 2,147,483,904. So this one has 255 numbers mapping to it. (The difference is due to the round-to-nearest-ties-to-even rule.) At the high end, the 129 numbers from 4,294,967,168 to 4,294,967,296 map to 4,294,967,296, for which dividing produces 1, which is out of the desired half-open interval, [0, 1).
On the other hand, if we use integers from 0 to 16,777,215 (224−1), there is no rounding, and each result maps from exactly one starting number and stays within the interval.
Note that “significand“ is the preferred term for the fraction portion of a floating-point representation. “Mantissa” is an old word for the fraction portion of a logarithm. Significands are linear. Mantissas are logarithmic. And the significand of the IEEE-754 single-precision format has 24 bits, not 23. The primary field used to encode the significand has 23 bits, but the exponent field provides another bit.

Related

Integer part bit growth for fixed point numbers of the 0.xyz kind

First of all we should agree on the definition of the QM.N format. I will follow this resource and its conventions.
For the purposes of this paper the notion of a Q-point for a fixed-point number is introduced.
This labeling convention is as follows:
Q[QI].[QF]
Where QI = # of integer bits & QF = # of fractional bits
For signed integer variable types we will include the sign bit in QI as it does have integer
weight albeit negative in sign.
Based on this convention, if I had to represent the number -0.123 in the format Q1.7 I would write it as: 1.1110001
The theory says that:
When performing an integer multiplication the product is 2xWL if both the multiplier and
multiplicand are WL long. If the integer multiplication is on fixed-point variables, the number of
integer and fractional bits in the product is the sum of the corresponding multiplier and
multiplicand Q-points as described by the following equations
Knowing this is useful because after multiplication we have double precision, and we need to rescale the output to our input precision. Knowing where the integer part is allows us to prevent overflow and to pick the relevant bits, as in the example where the long string is the result of the multiplication:
However, when performing the multiplication between two Q1.7 numbers of the format 0.xyz I have noticed that the integer part never grows, allowing me to pick only one bit from the integer part. I have written a piece of code that picks only the fractional part after multiplication, and here are the results.
Test 0
Testing +0.5158*+0.0596
A:real_val:+0.5156 fixed: 66 int: 0 frac: 1000010
B:real_val:+0.0547 fixed: 7 int: 0 frac: 0000111
C: real_val:+0.0282 fixed: 462 int: 00 frac: 00000111001110
Floating multiplication: +0.0307
Test 1
Testing +0.4842*-0.9558
A:real_val:+0.4766 fixed: 61 int: 0 frac: 0111101
B:real_val:-0.9531 fixed: -122 int: 1 frac: 0000110
C: real_val:-0.4542 fixed: -7442 int: 11 frac: 10001011101110
Floating multiplication: -0.4628
Test 2
Testing +0.2812*+0.2433
A:real_val:+0.2734 fixed: 35 int: 0 frac: 0100011
B:real_val:+0.2422 fixed: 31 int: 0 frac: 0011111
C: real_val:+0.0662 fixed: 1085 int: 00 frac: 00010000111101
Floating multiplication: +0.0684
Test 3
Testing -0.7235*-0.9037
A:real_val:-0.7188 fixed: -92 int: 1 frac: 0100100
B:real_val:-0.8984 fixed: -115 int: 1 frac: 0001101
C: real_val:+0.6458 fixed: 10580 int: 00 frac: 10100101010100
Floating multiplication: +0.6538
My question to you is if I am overlooking anything here or if this is normal and expected behaviour from fixed points. If so, I will be happy with my numbers never overflowing during multiplication.
Basically what I mean is that after multiplication of two Q1.X numbers in the form 0.xyz the integer part will always be 0 (if the result is positive) or 1111.. if the result is negative.
So my accumulator register will be filled with only 2*X of meaningful bits and I can take only them, plus the sign.
No, the number of bits in the result is still the sum of the bits in the inputs.
Summary:
Signed Q1.31 times signed Q1.31 equals signed Q2.62.
Unsigned Q1.31 times unsigned Q1.31 equals unsigned Q2.62.
Explanation:
Unsigned Q1.n numbers can represent from zero (inclusive) to two (exclusive). If you multiply two such numbers together the range of results is from zero (inclusive) to 4 (exclusive). Just less than four is three point something, and three fits in the two bits above the point.
Signed Q1.n numbers can represent from negative one (inclusive) to one (exclusive). If you multiply two such numbers together the range of results is negative one (exclusive) to one (inclusive). Signed Q1.31 times signed Q1.31 would fit in Q1.62 except for the single case -1.0 times -1.0 equals +1.0, which requires the extra bit above the point.
The equations in your question apply equally in both these cases.

Internal error occurred during runtime generation of Program Dump ID: BCD_OVERFLOW

Internal error occurred during runtime generation of Program (Dump ID: BCD_OVERFLOW)
No error during check but activation gives this error.
This issue occurs in any ABAP code if you try to assign a value to a numeric attribute or variable, which is "out of range" (so leads to an overflow). See here all the possible value range for numeric types:
Type Value Range
---------- ------------------------------------------------------------------------------
b 0 to 255
s -32,768 to +32,767
i -2,147,483,648 to +2,147,483,647
int8 -9,223,372,036,854,775,808 to +9,223,372,036,854,775,807
p The valid length for packed numbers is between 1 and 16 bytes. Two places are
packed into one byte, where the last byte contains only one place and the sign,
which is the number of places or places calculated from 2 * len1. After the
decimal separator, up to 14 decimal places are allowed ( the number of decimal
places should not exceed the number of places). Depending on the field length
len and the number of decimal places dec, the value range is: (-10^(2len-1)
+1) / (10^(+dec)) to (+10^(2len-1) -1) /(10^(+dec)) in increments of 10^(-dec).
Any intermediate values are rounded decimally. Invalid content produces undefined
behavior.
decfloat16 Decimal floating point numbers of this type are represented internally with 16
places in accordance with the IEEE-754-2008 standard. Valid values are numbers
between 1E385(1E-16 - 1) and -1E-383 for the negative range, 0 and +1E-383 to
1E385(1 - 1E-16) for the positive range. Values between the ranges form the
subnormal range and are rounded. Outside of the subnormal range, each 16-digit
decimal number can be represented exactly with a decimal floating point number
of this type.
decfloat34 Decimal floating point numbers of this type are represented internally with 34
places in accordance with the IEEE-754-2008 standard. Valid values are numbers
between 1E6145(1E-34 - 1) and -1E-6143 for the negative range, 0 to +1E-6143
and 1E6145(1 - 1E-34) for the positive range. Values between the ranges form
the subnormal range and are rounded. Outside of the subnormal range, each
34-digit decimal number can be represented exactly using a decimal floating
point number.
f Binary floating point numbers are represented internally according to the
IEEE-754 standard (double precision). In ABAP, 17 places are represented (one
integer digit and 16 decimal places). Valid values are numbers between
-1.7976931348623157E+308 and -2.2250738585072014E-308 for the negative range
and between +2.2250738585072014E-308 and +1.7976931348623157E+308 for the
positive range, plus 0. Both validity intervals are extended to the value zero
by subnormal numbers according to IEEE-754. Not every sixteen-digit number can
be represented exactly by a binary floating point number.
Minimal reproducible example:
REPORT ztest.
DATA num TYPE int1.
num = 1000. " <=== run time error
The solution is to use a larger data type like int2 (up to 32767) or I (integer on 4 bytes):
REPORT ztest.
DATA num TYPE int2. " <=== larger type
num = 1000. " <=== no more error
NB: decfloat34 is the larger possible numeric data type, it can handle virtually any value.
This issue can occur in methods, Function modules, or in Report also.
The main reason behind this issue is the runtime configuration of attributes present in code.
For example,
DATA: num type int1 value 256.
This statement is syntactically fine but the value that is being assigned to variable num is more than the range of type INT1. So it will dump while activation only.
Solution:
DATA: num type int1 value {<=255}
Similarly, this error can occur in any case where the compile-time and runtime configuration conflict.

Error taking int of logs in VBA

When I calculate log(8) / log(2) I get 3 as one would expect:
?log(8)/log(2)
3
However, if I take the int of this calculation like this the result is 2 and thus wrong:
?int(log(8)/log(2))
2
How and why does this happen?
Likely because the actual number returned is of type double. Because floats and doubles cannot accurately represent most base 10 rational numbers the number returned is something like 2.99999999999. Then when you apply int() the .999999999 is truncated.
How floating-point number works: it dedicates a bit for the sign, a few bits to store an exponent, and the rest for the actual fraction. This leads to numbers being represented in a form similar to 1.45 * 10^4; except that instead of the base being 10, it's two.

Number format in Oracle SQL

I've given a task of exporting data from an Oracle view to a fixed length text file, however I've been given specification of how data should be exported to a text file. I.e.
quantity NUM (10)
price NUM (8,2)
participant_id CHAR (3)
brokerage NUM (10,2)
cds_fees NUM (8,2)
My confusion arises in Numeric types where when it says (8,2). If I'm to use same as text, does it effectively means
10 characters (as to_char(<field name>, '9999999.99'))
or
8 characters (as to_char(<field name>, '99999.99'))
when exporting to fixed length text field in the text file?
I was looking at this question which gave an insight, but not entirely.
Appreciate if someone could enlighten me with some examples.
Thanks a lot.
According to the Oracle docs on types
Optionally, you can also specify a precision (total number of digits)
and scale (number of digits to the right of the decimal point):
If a precision is not specified, the column stores values as given. If
no scale is specified, the scale is zero.
So in your case, a NUMBER(8,2), has got:
8 digits in total
2 of which are after the decimal point
This gives you a range of -999999.99 to 999999.99
I assume that you mean NUMBER data type by NUM.
When it says NUMBER(8,2), it means that there will be 8 digits, and that the number should be rounded to the nearest hundredth. Which means that there will be 6 digits before, and 2 digits after the decimal point.
Refer to oracle doc:
You use the NUMBER datatype to store fixed-point or floating-point
numbers. Its magnitude range is 1E-130 .. 10E125. If the value of an
expression falls outside this range, you get a numeric overflow or
underflow error. You can specify precision, which is the total number
of digits, and scale, which is the number of digits to the right of
the decimal point. The syntax follows:
NUMBER[(precision,scale)]
To declare fixed-point numbers, for which you must specify scale, use
the following form:
NUMBER(precision,scale)
To declare floating-point numbers, for which you cannot specify
precision or scale because the decimal point can "float" to any
position, use the following form:
NUMBER
To declare integers, which have no decimal point, use this form:
NUMBER(precision) -- same as NUMBER(precision,0)
You cannot use constants or variables to specify precision and scale;
you must use integer literals. The maximum precision of a NUMBER value
is 38 decimal digits. If you do not specify precision, it defaults to
38 or the maximum supported by your system, whichever is less.
Scale, which can range from -84 to 127, determines where rounding
occurs. For instance, a scale of 2 rounds to the nearest hundredth
(3.456 becomes 3.46). A negative scale rounds to the left of the
decimal point. For example, a scale of -3 rounds to the nearest
thousand (3456 becomes 3000). A scale of 0 rounds to the nearest whole
number. If you do not specify scale, it defaults to 0.
NUMBER(p,s)
p(precision) = length of the number in digits
s(scale) = places after the decimal point
So Number(8,2) in your example is a '999999.99'
You can see more examples here.

Properly subtracting float values

I am trying to create an array of values. These values should be "2.4,1.6,.8,0". I am subtracting .8 at every step.
This is how I am doing it (code snippet):
float mean = [[_scalesDictionary objectForKey:#"M1"] floatValue]; //3.2f
float sD = [[_scalesDictionary objectForKey:#"SD1"] floatValue]; //0.8f
nextRegion = mean;
hitWall = NO;
NSMutableArray *minusRegion = [NSMutableArray array];
while (!hitWall) {
nextRegion -= sD;
if(nextRegion<0.0f){
nextRegion = 0.0f;
hitWall = YES;
}
[minusRegion addObject:[NSNumber numberWithFloat:nextRegion]];
}
I am getting this output:
minusRegion = (
"2.4",
"1.6",
"0.8000001",
"1.192093e-07",
0
)
I do not want the incredibly small number between .8 and 0. Is there a standard way to truncate these values?
Neither 3.2 nor .8 is exactly representable as a 32-bit float. The representable number closest to 3.2 is 3.2000000476837158203125 (in hexadecimal floating-point, 0x1.99999ap+1). The representable number closest to .8 is 0.800000011920928955078125 (0x1.99999ap-1).
When 0.800000011920928955078125 is subtracted from 3.2000000476837158203125, the exact mathematical result is 2.400000035762786865234375 (0x1.3333338p+1). This result is also not exactly representable as a 32-bit float. (You can see this easily in the hexadecimal floating-point. A 32-bit float has a 24-bit significand. “1.3333338” has one bit in the “1”, 24 bits in the middle six digits, and another bit in the ”8”.) So the result is rounded to the nearest 32-bit float, which is 2.400000095367431640625 (0x1.333334p+1).
Subtracting 0.800000011920928955078125 from that yields 1.6000001430511474609375 (0x1.99999cp+0), which is exactly representable. (The “1” is one bit, the five nines are 20 bits, and the “c” has two significant bits. The low bits two bits in the “c” are trailing zeroes and may be neglected. So there are 23 significant bits.)
Subtracting 0.800000011920928955078125 from that yields 0.800000131130218505859375 (0x1.99999ep-1), which is also exactly representable.
Finally, subtracting 0.800000011920928955078125 from that yields 1.1920928955078125e-07 (0x1p-23).
The lesson to be learned here is the floating-point does not represent all numbers, and it rounds results to give you the closest numbers it can represent. When writing software to use floating-point arithmetic, you must understand and allow for these rounding operations. One way to allow for this is to use numbers that you know can be represented. Others have suggested using integer arithmetic. Another option is to use mostly values that you know can be represented exactly in floating-point, which includes integers up to 224. So you could start with 32 and subtract 8, yielding 24, then 16, then 8, then 0. Those would be the intermediate values you use for loop control and continuing calculations with no error. When you are ready to deliver results, then you could divide by 10, producing numbers near 3.2, 2.4, 1.6, .8, and 0 (exactly). This way, your arithmetic would introduce only one rounding error into each result, instead of accumulating rounding errors from iteration to iteration.
You're looking at good old floating-point rounding error. Fortunately, in your case it should be simple to deal with. Just clamp:
if( val < increment ){
val = 0.0;
}
Although, as Eric Postpischil explained below:
Clamping in this way is a bad idea, because sometimes rounding will cause the iteration variable to be slightly less than the increment instead of slightly more, and this clamping will effectively skip an iteration. For example, if the initial value were 3.6f (instead of 3.2f), and the step were .9f (instead of .8f), then the values in each iteration would be slightly below 3.6, 2.7, 1.8, and .9. At that point, clamping converts the value slightly below .9 to zero, and an iteration is skipped.
Therefore it might be necessary to subtract a small amount when doing the comparison.
A better option which you should consider is doing your calculations with integers rather than floats, then converting later.
int increment = 8;
int val = 32;
while( val > 0 ){
val -= increment;
float new_float_val = val / 10.0;
};
Another way to do this is to multiply the numbers you get by subtraction by 10, then convert to an integer, then divide that integer by by 10.0.
You can do this easily with the floor function (floorf) like this:
float newValue = floorf(oldVlaue*10)/10;