Units conversion on a PIC 18F2431 - optimization

I have the following conversion given Pressure per Square Inch (PSI) and Megapascals (MPa):
psi = MPa*1.45038;
I need the lowest value possible after conversion to be 1 PSI. An example of what I am looking for is:
psi = ((long)MPa*145)/100
Is there anyway to optimize this for memory and speed by not using float or long? I will be implementing this conversion on a microcontroller (PIC 18F2431).

You should divide by powers of 2 instead which is far cheaper than division by any other values. And if the type can't be negative then use an unsigned type instead. Depending on the type of MPa and its maximum value you can choose different denominator to suite your needs. No need to cast to a wider type if the multiplication won't overflow
For example if MPa is of type uint16_t you can do psi = MPa*95052/(1UL << 16); (95052/65536 ≈ 1.450378)
If MPa is not larger than 1024 or 210 then you can multiply it with 221 without overflowing, thus you can increase the numerator/denominator for more precision
psi = MPa*3041667/(1UL << 21);
Edit:
On the PIC 18F2431 int is a 16-bit type. That means 95052 will be of type long and MPa will be promoted to long in the expression. If you don't need much precision then change the scaling to fit in an int/int16_t to avoid dealing with long
In case MPa is not larger than 20 you can divide it by 2048 which is the largest power of 2 that is less than or equal to 216/20.
psi = MPa*2970/(1U << 11);
Note that the * and / have equal precedence so it'll be evaluated from left to right, and the above equation will be identical to
psi = (MPa*2970)/2048; // = MPa*1.4501953125)
no need for such excessive parentheses
Edit 2:
Unfortunately if the range of MPa is [0, 2000] then you can only multiply it by 32 without overflowing a 16-bit unsigned int. The closest ratio that you can achieve is 46/32 = 1.4375 so if you need more precision, there's no way other than using long. Anyway integer math with long is still a lot faster than floating-point math on the PIC MCU, and cost significantly less code space

Calculate the largest N such that MPa*1.45038*2^N < 2^32
Calculate the constant value of K = floor(1.45038*2^N) once
Calculate the value of psi = (MPa*K)>>N for every value of MPa
Since 0 <= MPa <= 2000, you must choose N such that 2000*1.45038*2^N < 2^32:
2^N < 2^32/(2000*1.45038)
N < log(2^32/(2000*1.45038))
N < 20.497
N = 20
Therefore, K = floor(1.45038*2^N) = floor(1.45038*2^20) = 1520833.
So for every value of MPa, you can calculate psi = (MPa*1520833)>>20.
You'll need to make sure that MPa is unsigned (or cast it accordingly).
Using this method will allow you to avoid floating-point operations.
For better accuracy, you can use round instead of floor, giving you K = 1520834.
In this specific case it will work out fine, because 2000*1520834 is smaller than 2^32.
But with a different maximum value of MPa or a different conversion scalar, it might not.
In any case, the difference in the outcome of psi for each value of K is neglectable.
BTW, if you don't mind the additional memory, then you can use a look-up table instead:
Add a pre-calculated global variable const unsigned short lut[2001] = {...}
For every value of MPa, calculate psi = lut[MPa] instead of psi = (MPa*K)>>N
So instead of mul + shift operations, your program will perform a load operation
Please note, however, that whether or not this is more efficient in terms of runtime performance depends on several things, such as the accessibility of the memory segment in which you allocate the look-up table, the architecture of the processor at hand, runtime caching heuristics, etc.
So you will need to apply some profiling on your program in order to decide which approach is better.

Related

Excel VBA - Sum of 1 in workbook not equal to 1 in VBA [duplicate]

Why do some numbers lose accuracy when stored as floating point numbers?
For example, the decimal number 9.2 can be expressed exactly as a ratio of two decimal integers (92/10), both of which can be expressed exactly in binary (0b1011100/0b1010). However, the same ratio stored as a floating point number is never exactly equal to 9.2:
32-bit "single precision" float: 9.19999980926513671875
64-bit "double precision" float: 9.199999999999999289457264239899814128875732421875
How can such an apparently simple number be "too big" to express in 64 bits of memory?
In most programming languages, floating point numbers are represented a lot like scientific notation: with an exponent and a mantissa (also called the significand). A very simple number, say 9.2, is actually this fraction:
5179139571476070 * 2 -49
Where the exponent is -49 and the mantissa is 5179139571476070. The reason it is impossible to represent some decimal numbers this way is that both the exponent and the mantissa must be integers. In other words, all floats must be an integer multiplied by an integer power of 2.
9.2 may be simply 92/10, but 10 cannot be expressed as 2n if n is limited to integer values.
Seeing the Data
First, a few functions to see the components that make a 32- and 64-bit float. Gloss over these if you only care about the output (example in Python):
def float_to_bin_parts(number, bits=64):
if bits == 32: # single precision
int_pack = 'I'
float_pack = 'f'
exponent_bits = 8
mantissa_bits = 23
exponent_bias = 127
elif bits == 64: # double precision. all python floats are this
int_pack = 'Q'
float_pack = 'd'
exponent_bits = 11
mantissa_bits = 52
exponent_bias = 1023
else:
raise ValueError, 'bits argument must be 32 or 64'
bin_iter = iter(bin(struct.unpack(int_pack, struct.pack(float_pack, number))[0])[2:].rjust(bits, '0'))
return [''.join(islice(bin_iter, x)) for x in (1, exponent_bits, mantissa_bits)]
There's a lot of complexity behind that function, and it'd be quite the tangent to explain, but if you're interested, the important resource for our purposes is the struct module.
Python's float is a 64-bit, double-precision number. In other languages such as C, C++, Java and C#, double-precision has a separate type double, which is often implemented as 64 bits.
When we call that function with our example, 9.2, here's what we get:
>>> float_to_bin_parts(9.2)
['0', '10000000010', '0010011001100110011001100110011001100110011001100110']
Interpreting the Data
You'll see I've split the return value into three components. These components are:
Sign
Exponent
Mantissa (also called Significand, or Fraction)
Sign
The sign is stored in the first component as a single bit. It's easy to explain: 0 means the float is a positive number; 1 means it's negative. Because 9.2 is positive, our sign value is 0.
Exponent
The exponent is stored in the middle component as 11 bits. In our case, 0b10000000010. In decimal, that represents the value 1026. A quirk of this component is that you must subtract a number equal to 2(# of bits) - 1 - 1 to get the true exponent; in our case, that means subtracting 0b1111111111 (decimal number 1023) to get the true exponent, 0b00000000011 (decimal number 3).
Mantissa
The mantissa is stored in the third component as 52 bits. However, there's a quirk to this component as well. To understand this quirk, consider a number in scientific notation, like this:
6.0221413x1023
The mantissa would be the 6.0221413. Recall that the mantissa in scientific notation always begins with a single non-zero digit. The same holds true for binary, except that binary only has two digits: 0 and 1. So the binary mantissa always starts with 1! When a float is stored, the 1 at the front of the binary mantissa is omitted to save space; we have to place it back at the front of our third element to get the true mantissa:
1.0010011001100110011001100110011001100110011001100110
This involves more than just a simple addition, because the bits stored in our third component actually represent the fractional part of the mantissa, to the right of the radix point.
When dealing with decimal numbers, we "move the decimal point" by multiplying or dividing by powers of 10. In binary, we can do the same thing by multiplying or dividing by powers of 2. Since our third element has 52 bits, we divide it by 252 to move it 52 places to the right:
0.0010011001100110011001100110011001100110011001100110
In decimal notation, that's the same as dividing 675539944105574 by 4503599627370496 to get 0.1499999999999999. (This is one example of a ratio that can be expressed exactly in binary, but only approximately in decimal; for more detail, see: 675539944105574 / 4503599627370496.)
Now that we've transformed the third component into a fractional number, adding 1 gives the true mantissa.
Recapping the Components
Sign (first component): 0 for positive, 1 for negative
Exponent (middle component): Subtract 2(# of bits) - 1 - 1 to get the true exponent
Mantissa (last component): Divide by 2(# of bits) and add 1 to get the true mantissa
Calculating the Number
Putting all three parts together, we're given this binary number:
1.0010011001100110011001100110011001100110011001100110 x 1011
Which we can then convert from binary to decimal:
1.1499999999999999 x 23 (inexact!)
And multiply to reveal the final representation of the number we started with (9.2) after being stored as a floating point value:
9.1999999999999993
Representing as a Fraction
9.2
Now that we've built the number, it's possible to reconstruct it into a simple fraction:
1.0010011001100110011001100110011001100110011001100110 x 1011
Shift mantissa to a whole number:
10010011001100110011001100110011001100110011001100110 x 1011-110100
Convert to decimal:
5179139571476070 x 23-52
Subtract the exponent:
5179139571476070 x 2-49
Turn negative exponent into division:
5179139571476070 / 249
Multiply exponent:
5179139571476070 / 562949953421312
Which equals:
9.1999999999999993
9.5
>>> float_to_bin_parts(9.5)
['0', '10000000010', '0011000000000000000000000000000000000000000000000000']
Already you can see the mantissa is only 4 digits followed by a whole lot of zeroes. But let's go through the paces.
Assemble the binary scientific notation:
1.0011 x 1011
Shift the decimal point:
10011 x 1011-100
Subtract the exponent:
10011 x 10-1
Binary to decimal:
19 x 2-1
Negative exponent to division:
19 / 21
Multiply exponent:
19 / 2
Equals:
9.5
Further reading
The Floating-Point Guide: What Every Programmer Should Know About Floating-Point Arithmetic, or, Why don’t my numbers add up? (floating-point-gui.de)
What Every Computer Scientist Should Know About Floating-Point Arithmetic (Goldberg 1991)
IEEE Double-precision floating-point format (Wikipedia)
Floating Point Arithmetic: Issues and Limitations (docs.python.org)
Floating Point Binary
This isn't a full answer (mhlester already covered a lot of good ground I won't duplicate), but I would like to stress how much the representation of a number depends on the base you are working in.
Consider the fraction 2/3
In good-ol' base 10, we typically write it out as something like
0.666...
0.666
0.667
When we look at those representations, we tend to associate each of them with the fraction 2/3, even though only the first representation is mathematically equal to the fraction. The second and third representations/approximations have an error on the order of 0.001, which is actually much worse than the error between 9.2 and 9.1999999999999993. In fact, the second representation isn't even rounded correctly! Nevertheless, we don't have a problem with 0.666 as an approximation of the number 2/3, so we shouldn't really have a problem with how 9.2 is approximated in most programs. (Yes, in some programs it matters.)
Number bases
So here's where number bases are crucial. If we were trying to represent 2/3 in base 3, then
(2/3)10 = 0.23
In other words, we have an exact, finite representation for the same number by switching bases! The take-away is that even though you can convert any number to any base, all rational numbers have exact finite representations in some bases but not in others.
To drive this point home, let's look at 1/2. It might surprise you that even though this perfectly simple number has an exact representation in base 10 and 2, it requires a repeating representation in base 3.
(1/2)10 = 0.510 = 0.12 = 0.1111...3
Why are floating point numbers inaccurate?
Because often-times, they are approximating rationals that cannot be represented finitely in base 2 (the digits repeat), and in general they are approximating real (possibly irrational) numbers which may not be representable in finitely many digits in any base.
While all of the other answers are good there is still one thing missing:
It is impossible to represent irrational numbers (e.g. π, sqrt(2), log(3), etc.) precisely!
And that actually is why they are called irrational. No amount of bit storage in the world would be enough to hold even one of them. Only symbolic arithmetic is able to preserve their precision.
Although if you would limit your math needs to rational numbers only the problem of precision becomes manageable. You would need to store a pair of (possibly very big) integers a and b to hold the number represented by the fraction a/b. All your arithmetic would have to be done on fractions just like in highschool math (e.g. a/b * c/d = ac/bd).
But of course you would still run into the same kind of trouble when pi, sqrt, log, sin, etc. are involved.
TL;DR
For hardware accelerated arithmetic only a limited amount of rational numbers can be represented. Every not-representable number is approximated. Some numbers (i.e. irrational) can never be represented no matter the system.
There are infinitely many real numbers (so many that you can't enumerate them), and there are infinitely many rational numbers (it is possible to enumerate them).
The floating-point representation is a finite one (like anything in a computer) so unavoidably many many many numbers are impossible to represent. In particular, 64 bits only allow you to distinguish among only 18,446,744,073,709,551,616 different values (which is nothing compared to infinity). With the standard convention, 9.2 is not one of them. Those that can are of the form m.2^e for some integers m and e.
You might come up with a different numeration system, 10 based for instance, where 9.2 would have an exact representation. But other numbers, say 1/3, would still be impossible to represent.
Also note that double-precision floating-points numbers are extremely accurate. They can represent any number in a very wide range with as much as 15 exact digits. For daily life computations, 4 or 5 digits are more than enough. You will never really need those 15, unless you want to count every millisecond of your lifetime.
Why can we not represent 9.2 in binary floating point?
Floating point numbers are (simplifying slightly) a positional numbering system with a restricted number of digits and a movable radix point.
A fraction can only be expressed exactly using a finite number of digits in a positional numbering system if the prime factors of the denominator (when the fraction is expressed in it's lowest terms) are factors of the base.
The prime factors of 10 are 5 and 2, so in base 10 we can represent any fraction of the form a/(2b5c).
On the other hand the only prime factor of 2 is 2, so in base 2 we can only represent fractions of the form a/(2b)
Why do computers use this representation?
Because it's a simple format to work with and it is sufficiently accurate for most purposes. Basically the same reason scientists use "scientific notation" and round their results to a reasonable number of digits at each step.
It would certainly be possible to define a fraction format, with (for example) a 32-bit numerator and a 32-bit denominator. It would be able to represent numbers that IEEE double precision floating point could not, but equally there would be many numbers that can be represented in double precision floating point that could not be represented in such a fixed-size fraction format.
However the big problem is that such a format is a pain to do calculations on. For two reasons.
If you want to have exactly one representation of each number then after each calculation you need to reduce the fraction to it's lowest terms. That means that for every operation you basically need to do a greatest common divisor calculation.
If after your calculation you end up with an unrepresentable result because the numerator or denominator you need to find the closest representable result. This is non-trivil.
Some Languages do offer fraction types, but usually they do it in combination with arbitary precision, this avoids needing to worry about approximating fractions but it creates it's own problem, when a number passes through a large number of calculation steps the size of the denominator and hence the storage needed for the fraction can explode.
Some languages also offer decimal floating point types, these are mainly used in scenarios where it is imporant that the results the computer gets match pre-existing rounding rules that were written with humans in mind (chiefly financial calculations). These are slightly more difficult to work with than binary floating point, but the biggest problem is that most computers don't offer hardware support for them.

How to Make a Uniform Random Integer Generator from a Random Boolean Generator?

I have a hardware-based boolean generator that generates either 1 or 0 uniformly. How to use it to make a uniform 8-bit integer generator? I'm currently using the collected booleans to create the binary string for the 8-bit integer. The generated integers aren't uniformly distributed. It follows the distribution explained on this page. Integers with ̶a̶ ̶l̶o̶t̶ ̶o̶f̶ ̶a̶l̶t̶e̶r̶n̶a̶t̶I̶n̶g̶ ̶b̶I̶t̶s̶ the same number of 1's and 0's such as 85 (01010101) and -86 (10101010) have the highest chance to be generated and integers with a lot of repeating bits such as 0 (00000000) and -1 (11111111) have the lowest chance.
Here's the page that I've annotated with probabilities for each possible 4-bit integer. We can see that they're not uniform. 3, 5, 6, -7, -6, and -4 that have the same number of 1's and 0's have ⁶/₁₆ probability while 0 and -1 that all of their bits are the same only have ¹/₁₆ probability.
.
And here's my implementation on Kotlin
Based on your edit, there appears to be a misunderstanding here. By "uniform 4-bit integers", you seem to have the following in mind:
Start at 0.
Generate a random bit. If it's 1, add 1, and otherwise subtract 1.
Repeat step 2 three more times.
Output the resulting number.
Although the random bit generator may generate bits where each outcome is as likely as the other to be randomly generated, and each 4-bit chunk may be just as likely as any other to be randomly generated, the number of bits in each chunk is not uniformly distributed.
What range of integers do you want? Say you're generating 4-bit integers. Do you want a range of [-4, 4], as in the 4-bit random walk in your question, or do you want a range of [-8, 7], which is what you get when you treat a 4-bit chunk of bits as a two's complement integer?
If the former, the random walk won't generate a uniform distribution, and you will need to tackle the problem in a different way.
In this case, to generate a uniform random number in the range [-4, 4], do the following:
Take 4 bits of the random bit generator and treat them as an integer in [0, 15);
If the integer is greater than 8, go to step 1.
Subtract 4 from the integer and output it.
This algorithm uses rejection sampling, but is variable-time (thus is not appropriate whenever timing differences can be exploited in a security attack). Numbers in other ranges are similarly generated, but the details are too involved to describe in this answer. See my article on random number generation methods for details.
Based on the code you've shown me, your approach to building up bytes, ints, and longs is highly error-prone. For example, a better way to build up an 8-bit byte to achieve what you want is as follows (keeping in mind that I am not very familiar with Kotlin, so the syntax may be wrong):
val i = 0
val b = 0
for (i = 0; i < 8; i++) {
b = b << 1; // Shift old bits
if (bitStringBuilder[i] == '1') {
b = b | 1; // Set new bit
} else {
b = b | 0; // Don't set new bit
}
}
value = (b as byte) as T
Also, if MediatorLiveData is not thread safe, then neither is your approach to gathering bits using a StringBuilder (especially because StringBuilder is not thread safe).
The approach you suggest, combining eight bits of the boolean generator to make one uniform integer, will work in theory. However, in practice there are several issues:
You don't mention what kind of hardware it is. In most cases, the hardware won't be likely to generate uniformly random Boolean bits unless the hardware is a so-called true random number generator designed for this purpose. For example, the hardware might generate uniformly distributed bits but have periodic behavior.
Entropy means how hard it is to predict the values a generator produces, compared to ideal random values. For example, a 64-bit data block with 32 bits of entropy is as hard to predict as an ideal random 32-bit data block. Characterizing a hardware device's entropy (or ability to produce unpredictable values) is far from trivial. Among other things, this involves entropy tests that have to be done across the full range of operating conditions suitable for the hardware (e.g., temperature, voltage).
Most hardware cannot produce uniform random values, so usually an additional step, called randomness extraction, entropy extraction, unbiasing, whitening, or deskewing, is done to transform the values the hardware generates into uniformly distributed random numbers. However, it works best if the hardware's entropy is characterized first (see previous point).
Finally, you still have to test whether the whole process delivers numbers that are "adequately random" for your purposes. There are several statistical tests that attempt to do so, such as NIST's Statistical Test Suite or TestU01.
For more information, see "Nondeterministic Sources and Seed Generation".
After your edits to this page, it seems you're going about the problem the wrong way. To produce a uniform random number, you don't add uniformly distributed random bits (e.g., bit() + bit() + bit()), but concatenate them (e.g., (bit() << 2) | (bit() << 1) | bit()). However, again, this will work in theory, but not in practice, for the reasons I mention above.

gpu optimization when multiplying by powers of 2 in a shader

Do modern GPUs optimize multiplication by powers of 2 by doing a bit shift? For example suppose I do the following in a shader:
float t = 0;
t *= 16;
t *= 17;
Is it possible the first multiplication will run faster than the second?
Floating point multiplication cannot be done by bit shift. Howerver, in theory floating point multiplication by power of 2 constants can be optimized. Floating point value is normally stored in the form of S * M * 2 ^ E, where S is a sign, M is mantissa and E is exponent. Multiplying by a power of 2 constant can be done by adding/substracting to the exponent part of the float, without modifying the other parts. But in practice, I would bet that on GPUs a generic multiply instruction is always used.
I had an interesting observation regarding the power of 2 constants while studying the disassembly output of the PVRShaderEditor (PowerVR GPUs). I have noticed that a certain range of power of 2 constants ([2^(-16), 2^10] in my case), use special notation, e.g. C65, implying that they are predefined. Whereas arbitrary constants, such as 3.0 or 2.3, use shared register notation (e.g. SH12), which implies they are stored as a uniform and probably incur some setup cost. Thus using power of 2 constants may yield some optimizational benefit at least on some hardware.

Time complexity and integer inputs

I came across a question asking to describe the computational complexity in Big O of the following code:
i = 1;
while(i < N) {
i = i * 2;
}
I found this Stack Overflow question asking for the answer, with the most voted answer saying it is Log2(N).
On first thought that answer looks correct, however I remember learning about psuedo polynomial runtimes, and how computational complexity measures difficulty with respect to the length of the input, rather than the value.
So for integer inputs, the complexity should be in terms of the number of bits in the input.
Therefore, shouldn't this function be O(N)? Because every iteration of the loop increases the number of bits in i by 1, until it reaches around the same bits as N.
This code might be found in a function like the one below:
function FindNextPowerOfTwo(N) {
i = 1;
while(i < N) {
i = i * 2;
}
return i;
}
Here, the input can be thought of as a k-bit unsigned integer which we might as well imagine as having as a string of k bits. The input size is therefore k = floor(log(N)) + 1 bits of input.
The assignment i = 1 should be interpreted as creating a new bit string and assigning it the length-one bit string 1. This is a constant time operation.
The loop condition i < N compares the two bit strings to see which represents the larger number. If implemented intelligently, this will take time proportional to the length of the shorter of the two bit strings which will always be i. As we will see, the length of i's bit string begins at 1 and increases by 1 until it is greater than or equal to the length of N's bit string, k. When N is not a power of two, the length of i's bit string will reach k + 1. Thus, the time taken by evaluating the condition is proportional to 1 + 2 + ... + (k + 1) = (k + 1)(k + 2)/2 = O(k^2) in the worst case.
Inside the loop, we multiply i by two over and over. The complexity of this operation depends on how multiplication is to be interpreted. Certainly, it is possible to represent our bit strings in such a way that we could intelligently multiply by two by performing a bit shift and inserting a zero on the end. This could be made to be a constant-time operation. If we are oblivious to this optimization and perform standard long multiplication, we scan i's bit string once to write out a row of 0s and again to write out i with an extra 0, and then we perform regular addition with carry by scanning both of these strings. The time taken by each step here is proportional to the length of i's bit string (really, say that plus one) so the whole thing is proportional to i's bit-string length. Since the bit-string length of i assumes values 1, 2, ..., (k + 1), the total time is 2 + 3 + ... + (k + 2) = (k + 2)(k + 3)/2 = O(k^2).
Returning i is a constant time operation.
Taking everything together, the runtime is bounded from above and from below by functions of the form c * k^2, and so a bound on the worst-case complexity is Theta(k^2) = Theta(log(n)^2).
In the given example, you are not increasing the value of i by 1, but doubling it at every time, thus it is moving 2 times faster towards N. By multiplying it by two you are reducing the size of search space (between i to N) by half; i.e, reducing the input space by the factor of 2. Thus the complexity of your program is - log_2 (N).
If by chance you'd be doing -
i = i * 3;
The complexity of your program would be log_3 (N).
It depends on important question: "Is multiplication constant operation"?
In real world it is usually considered as constant, because you have fixed 32 or 64 bit numbers and multiplicating them takes always same (=constant) time.
On the other hand - you have limitation that N < 32/64 bit (or any other if you use it).
In theory where you do not consider multiplying as constant operation or for some special algorithms where N can grow too much to ignore the multiplying complexity, you are right, you have to start thinking about complexity of multiplying.
The complexity of multiplying by constant number (in this case 2) - you have to go through each bit each time and you have log_2(N) bits.
And you have to do hits log_2(N) times before you reach N
Which ends with complexity of log_2(N) * log_2(N) = O(log_2^2(N))
PS: Akash has good point that multiply by 2 can be written as constant operation, because the only thing you need in binary is to "add zero" (similar to multiply by 10 in "human readable" format, you just add zero 4333*10 = 43330)
However if multiplying is not that simple (you have to go through all bits), the previous answer is correct

Differences between mult and div operations on floating point numbers

Is any difference in computation precision for these 2 cases:
1) x = y / 1000d;
2) x = y * 0.001d;
Edit: Shoudn't add C# tag. Question is only from 'floating-point' point of view. I don't wanna know what is faster, I need to know what case will give me 'better precision'.
No, they're not the same - at least not with C#, using the version I have on my machine (just standard .NET 4.5.1) on my processor - there are enough subtleties involved that I wouldn't like to claim it'll do the same on all machines, or with all languages. This may very well be a language-specific question after all.
Using my DoubleConverter class to show the exact value of a double, and after a few bits of trial and error, here's a C# program which at least on my machine shows a difference:
using System;
class Program
{
static void Main(string[] args)
{
double input = 9;
double x1 = input / 1000d;
double x2 = input * 0.001d;
Console.WriteLine(x1 == x2);
Console.WriteLine(DoubleConverter.ToExactString(x1));
Console.WriteLine(DoubleConverter.ToExactString(x2));
}
}
Output:
False
0.00899999999999999931998839741709161899052560329437255859375
0.009000000000000001054711873393898713402450084686279296875
I can reproduce this in C with the Microsoft C compiler - apologies if it's horrendous C style, but I think it at least demonstrates the differences:
#include <stdio.h>
void main(int argc, char **argv) {
double input = 9;
double x1 = input / 1000;
double x2 = input * 0.001;
printf("%s\r\n", x1 == x2 ? "Same" : "Not same");
printf("%.18f\r\n", x1);
printf("%.18f\r\n", x2);
}
Output:
Not same
0.008999999999999999
0.009000000000000001
I haven't looked into the exact details, but it makes sense to me that there is a difference, because dividing by 1000 and multiplying by "the nearest double to 0.001" aren't the same logical operation... because 0.001 can't be exactly represented as a double. The nearest double to 0.001 is actually:
0.001000000000000000020816681711721685132943093776702880859375
... so that's what you end up multiplying by. You're losing information early, and hoping that it corresponds to the same information that you lose otherwise by dividing by 1000. It looks like in some cases it isn't.
you are programming in base 10 but the floating point is base 2 you CAN represent 1000 in base 2 but cannot represent 0.001 in base 2 so you have chosen bad numbers to ask your question, on a computer x/1000 != x*0.001, you might get lucky most of the time with rounding and more precision but it is not a mathematical identity.
Now maybe that was your question, maybe you wanted to know why x/1000 != x*0.001. And the answer to that question is because this is a binary computer and it uses base 2 not base 10, there are conversion problems with 0.001 when going to base 2, you cannot exactly represent that fraction in an IEEE floating point number.
In base 10 we know that if we have a fraction with a factor of 3 in the denominator (and lacking one in the numerator to cancel it out) we end up with an infinitely repeated pattern, basically we cannot accurately represent that number with a finite set of digits.
1/3 = 0.33333...
Same problem when you try to represent 1/10 in base 2. 10 = 2*5 the 2 is okay 1/2, but the 5 is the real problem 1/5.
1/10th (1/1000 works the same way). Elementary long division:
0 000110011
----------
1010 | 1.000000
1010
------
1100
1010
----
10000
1010
----
1100
1010
----
10
we have to keep pulling down zeros until we get 10000 10 goes into 16 one time, remainder 6, drop the next zero. 10 goes into 12 1 time remainder 2. And we repeat the pattern so you end up with this 001100110011 repeated forever. Floating point is a fixed number of bits, so we cannot represent an infinite pattern.
Now if your question has to do with something like is dividing by 4 the same as multiplying by 1/4th. That is a different question. Aanswer is it should be the same, consumes more cycles and/or logic to do a divide than multiply but works out with the same answer in the end.
Probably not. The compiler (or the JIT) is likely to convert the first case to the second anyway, since multiplication is typically faster than division. You would have to check this by compiling the code (with or without optimizations enabled) and then examining the generated IL with a tool like IL Disassembler or .NET Reflector, and/or examining the native code with a debugger at runtime.
No, there is no any difference. Except if you set custom rounding mode.
gcc produces ((double)0.001 - (double)1.0/1000) == 0.0e0
When compiler converts 0.001 to binary it divides 1 by 1000. It uses software floating point simulation compatible with target architecture to do this.
For high precision there are long double (80-bit) and software simulation of any precision.
PS I used gcc for 64 bit machine, both sse and x87 FPU.
PPS With some optimizations 1/1000.0 could be more precise on x87 since x87 uses 80-bit internal representation and 1000 == 1000.0. It is true if you use result for next calculations promptly. If you return/write to memory it calculates 80-bit value and then rounds it to 64-bit. But SSE is more common to use for double.