How to do numerical integration with quantum harmonic oscillator wavefunction? - physics

How to do numerical integration (what numerical method, and what tricks to use) for one-dimensional integration over infinite range, where one or more functions in the integrand are 1d quantum harmonic oscillator wave functions. Among others I want to calculate matrix elements of some function in the harmonic oscillator basis:
phin(x) = Nn Hn(x) exp(-x2/2)
where Hn(x) is Hermite polynomial
Vm,n = \int_{-infinity}^{infinity} phim(x) V(x) phin(x) dx
Also in the case where there are quantum harmonic wavefunctions with different widths.
The problem is that wavefunctions phin(x) have oscillatory behaviour, which is a problem for large n, and algorithm like adaptive Gauss-Kronrod quadrature from GSL (GNU Scientific Library) take long to calculate, and have large errors.

An incomplete answer, since I'm a little short on time at the moment; if others can't complete the picture, I can supply more details later.
Apply orthogonality of the wavefunctions whenever and wherever possible. This should significantly cut down the amount of computation.
Do analytically whatever you can. Lift constants, split integrals by parts, whatever. Isolate the region of interest; most wavefunctions are band-limited, and reducing the area of interest will do a lot to save work.
For the quadrature itself, you probably want to split the wavefunctions into three pieces and integrate each separately: the oscillatory bit in the center plus the exponentially-decaying tails on either side. If the wavefunction is odd, you get lucky and the tails will cancel each other, meaning you only have to worry about the center. For even wavefunctions, you only have to integrate one and double it (hooray for symmetry!). Otherwise, integrate the tails using a high order Gauss-Laguerre quadrature rule. You might have to calculate the rules yourself; I don't know if tables list good Gauss-Laguerre rules, as they're not used too often. You probably also want to check the error behavior as the number of nodes in the rule goes up; it's been a long time since I used Gauss-Laguerre rules and I don't remember if they exhibit Runge's phenomenon. Integrate the center part using whatever method you like; Gauss-Kronrod is a solid choice, of course, but there's also Fejer quadrature (which sometimes scales better to high numbers of nodes, which might work nicer on an oscillatory integrand) and even the trapezoidal rule (which exhibits stunning accuracy with certain oscillatory functions). Pick one and try it out; if results are poor, give another method a shot.
Hardest question ever on SO? Hardly :)

I'd recommend a few other things:
Try transforming the function onto a finite domain to make the integration more manageable.
Use symmetry where possible - break it up into the sum of two integrals from negative infinity to zero and zero to infinity and see if the function is symmetry or anti-symmetric. It could make your computing easier.
Look into Gauss-Laguerre quadrature and see if it can help you.

The WKB approximation?

I am not going to explain or qualify any of this right now. This code is written as is and probably incorrect. I am not even sure if it is the code I was looking for, I just remember that years ago I did this problem and upon searching my archives I found this. You will need to plot the output yourself, some instruction is provided. I will say that the integration over infinite range is a problem that I addressed and upon execution of the code it states the round off error at 'infinity' (which numerically just means large).
// compile g++ base.cc -lm
#include <iostream>
#include <cstdlib>
#include <fstream>
#include <math.h>
using namespace std;
int main ()
{
double xmax,dfx,dx,x,hbar,k,dE,E,E_0,m,psi_0,psi_1,psi_2;
double w,num;
int n,temp,parity,order;
double last;
double propogator(double E,int parity);
double eigen(double E,int parity);
double f(double x, double psi, double dpsi);
double g(double x, double psi, double dpsi);
double rk4(double x, double psi, double dpsi, double E);
ofstream datas ("test.dat");
E_0= 1.602189*pow(10.0,-19.0);// ev joules conversion
dE=E_0*.001;
//w^2=k/m v=1/2 k x^2 V=??? = E_0/xmax x^2 k-->
//w=sqrt( (2*E_0)/(m*xmax) );
//E=(0+.5)*hbar*w;
cout << "Enter what energy level your looking for, as an (0,1,2...) INTEGER: ";
cin >> order;
E=0;
for (n=0; n<=order; n++)
{
parity=0;
//if its even parity is 1 (true)
temp=n;
if ( (n%2)==0 ) {parity=1; }
cout << "Energy " << n << " has these parameters: ";
E=eigen(E,parity);
if (n==order)
{
propogator(E,parity);
cout <<" The postive values of the wave function were written to sho.dat \n";
cout <<" In order to plot the data should be reflected about the y-axis \n";
cout <<" evenly for even energy levels and oddly for odd energy levels\n";
}
E=E+dE;
}
}
double propogator(double E,int parity)
{
ofstream datas ("sho.dat") ;
double hbar =1.054*pow(10.0,-34.0);
double m =9.109534*pow(10.0,-31.0);
double E_0= 1.602189*pow(10.0,-19.0);
double dx =pow(10.0,-10);
double xmax= 100*pow(10.0,-10.0)+dx;
double dE=E_0*.001;
double last=1;
double x=dx;
double psi_2=0.0;
double psi_0=0.0;
double psi_1=1.0;
// cout <<parity << " parity passsed \n";
psi_0=0.0;
psi_1=1.0;
if (parity==1)
{
psi_0=1.0;
psi_1=m*(1.0/(hbar*hbar))* dx*dx*(0-E)+1 ;
}
do
{
datas << x << "\t" << psi_0 << "\n";
psi_2=(2.0*m*(dx/hbar)*(dx/hbar)*(E_0*(x/xmax)*(x/xmax)-E)+2.0)*psi_1-psi_0;
//cout << psi_1 << "=psi_1\n";
psi_0=psi_1;
psi_1=psi_2;
x=x+dx;
} while ( x<= xmax);
//I return 666 as a dummy value sometimes to check the function has run
return 666;
}
double eigen(double E,int parity)
{
double hbar =1.054*pow(10.0,-34.0);
double m =9.109534*pow(10.0,-31.0);
double E_0= 1.602189*pow(10.0,-19.0);
double dx =pow(10.0,-10);
double xmax= 100*pow(10.0,-10.0)+dx;
double dE=E_0*.001;
double last=1;
double x=dx;
double psi_2=0.0;
double psi_0=0.0;
double psi_1=1.0;
do
{
psi_0=0.0;
psi_1=1.0;
if (parity==1)
{double psi_0=1.0; double psi_1=m*(1.0/(hbar*hbar))* dx*dx*(0-E)+1 ;}
x=dx;
do
{
psi_2=(2.0*m*(dx/hbar)*(dx/hbar)*(E_0*(x/xmax)*(x/xmax)-E)+2.0)*psi_1-psi_0;
psi_0=psi_1;
psi_1=psi_2;
x=x+dx;
} while ( x<= xmax);
if ( sqrt(psi_2*psi_2)<=1.0*pow(10.0,-3.0))
{
cout << E << " is an eigen energy and " << psi_2 << " is psi of 'infinity' \n";
return E;
}
else
{
if ( (last >0.0 && psi_2<0.0) ||( psi_2>0.0 && last<0.0) )
{
E=E-dE;
dE=dE/10.0;
}
}
last=psi_2;
E=E+dE;
} while (E<=E_0);
}
If this code seems correct, wrong, interesting or you do have specific questions ask and I will answer them.

I am a student majoring in physics, and I also encountered the problem. These days I keep thinking about this question and get my own answer. I think it may help you solve this question.
1.In gsl, there are functions can help you integrate the oscillatory function--qawo & qawf. Maybe you can set a value, a. And the integration can be separated into tow parts, [0,a] and [a,pos_infinity]. In the first interval, you can use any gsl integration function you want, and in the second interval, you can use qawo or qawf.
2.Or you can integrate the function to a upper limit, b, that is integrated in [0,b]. So the integration can be calculated using a gauss legendry method, and this is provided in gsl. Although there maybe some difference between the real value and the computed value, but if you set b properly, the difference can be neglected. As long as the difference is less than the accuracy you want. And this method using the gsl function is only called once and can use many times, because the return value is point and its corresponding weight, and integration is only the sum of f(xi)*wi, for more details you can search gauss legendre quadrature on wikipedia. Multiple and addition operation is much faster than integration.
3.There is also a function which can calculate the infinity area integration--qagi, you can search it in the gsl-user's guide. But this is called everytime you need to calculate the integration, and this may cause some time consuming, but I'm not sure how long will it use in you program.
I suggest NO.2 choice I offered.

If you are going to work with Harmonic oscillator functions less than n = 100 you might want to try:
http://www.mymathlib.com/quadrature/gauss_hermite.html
The program computes an integral via gauss-hermite quadrature with 100 zeroes and weights (the zeroes of H_100). Once you go over Hermite_100 the integrals are not as accurate.
Using this integration method I wrote a program calculating exactly what you want to calculate and it works fairly well. Also, there might be a way to go beyond n=100 by using the asymptotic form of the Hermite-polynomial zeroes but I haven't looked into it.

Related

Approximation using gmp mpf_class

I am writing a UnitTest using Catch2.
I want to check if two vectors are equal. They look like the following using gmplib:
std::vector<mpf_class> result
Due to me 'faking' the expected_result vector, I get the following message after a failed test:
unittests/test.cpp:01: FAILED:
REQUIRE( actual_result == expected_result )
with expansion:
{ 0.5, 0.166667, 0.166667, 0.166667 }
==
{ 0.5, 0.166667, 0.166667, 0.166667 }
So I was looking for a function that could do an approximation for me.
I just wasn't successful in finding a solution that worked out for me.
I found some Comparison Functions but they do not work on my project.
EDIT:
The "minimal, reproducible example would simply be:
TEST_CASE("DemoTest") {
// simplified:
mpf_class a = 1;
mpf_class b = 6;
mpf_class actual_result = a / b;
mpf_class expected_result= 0.16666666667;
REQUIRE(actual_result == expected_result);
}
The "only" difference to my real application is that the results are stored in vectors. But because I am only "faking" the result by saying it is "0.1666666667" it probably doesn't fit the == anymore. So I need a function that takes an approximation and compares the range like epsilon = +-0.001.
Edit:
After implementing the solution #Arc suggested it worked well until I had some Values that were not complete "even".
So I have a failure with the following values:
actual 0.16666666666666666666700000000000000000000000000000
expected 0.16666666666666665741500000000000000000000000000000
Even though my "expected" value looks like this:
mpf_class expected = 0.16666666666666666666700000000000000000000000000000
Getting back to my original question if there is a way I can compare an approximation of the number with an epsilon of like +-0.0001 or what would be the best way to fix this issue?
First, we need to see some Minimal, Reproducible Example to be sure of what is happening. You can for example cut down some code from your test.cpp until you are left with just a few lines of code, but the issue still happens. Also, please provide compilation and running instructions. Frequently, a little bit of explanation on what your goals are may also help. As Catch2 is available on GitHub you don't need to provide it.
Without seeing the code, the best I can guess is that your code is trying to comparing mpf_t types in the mpf_class using the == operator, which I'm afraid has not been overload (see here). You should compare mpf_ts with the cmp function, since the C type mpf_t is actually an struct containing the pointer to the actual significand limbs. Check some usage examples in the tests/cxx/ directory of GMP (like here).
I note you are using GNU MP 4.1 version which is very old, you probably want to move to the 6.2.1 latest version if possible. Also, for using floats it's recommended that you use the GNU MPFR library instead of GMP floats.
EDIT: I did not yet manage to run Catch2, but the issue with your code is the expected_result is actually not equal to the actual_result. In GMP mpf_t variables are created with a 64-bit significand precision (on 64-bit machines), so that the division a / b actually results in a binary that prints 0.166666666666666666667 (that's 19 sixes after the digit 1). Try printing the result with gmp_printf("%.50Ff\n", actual_result);, because the standard cout output will only give you the value rounded to 6 digits: 0.166667.
But the problem is you can't just assign this like expected_result = 0.166666666666666666667 because in C/C++ numeric constants are parsed as double, thus you have to use the string overload attribution to get more precision.
But you can't also manage to easily (or, in general, justifiably) coin a decimal string that will correctly convert to the exact same binary given by a / b because decimal to float conversion has subtleties, see for example here and here.
So, it all depends on your application and the kind of numerical validation you aim to do. If you know that your decimal validation values are correct to some known precision, and if you set the mpf_t variables to withstanding precision (using for example mpf_set_prec), then you can use tolerance comparison, like so.
in C++ (without Catch2), it works like this:
#include <iostream>
#include <gmpxx.h>
using namespace std;
int main (void)
{
mpf_class a = 1;
mpf_class b = 6;
mpf_class actual = a / b;
mpf_class expected;
mpf_class tol;
expected = "0.166666666666666666666666666666667";
tol = "1e-30";
cout << "actual " << actual << "\n";
cout << "expected " << expected << "\n";
gmp_printf("actual %.50Ff\n", actual);
gmp_printf("expected %.50Ff\n", expected);
gmp_printf("tol %.50Ff\n", tol);
mpf_class diff = expected - actual;
gmp_printf("diff %.50Ff\n", diff);
if (abs(actual - expected) < tol)
cout << "ok\n";
else
cout << "nop\n";
return 0;
}
And compile with -lgmpxx -lgmp options.
It produces the output:
actual 0.166667
expected 0.166667
actual 0.16666666666666666666700000000000000000000000000000
expected 0.16666666666666666666700000000000000000000000000000
tol 0.00000000000000000000000000000100000000000000000000
diff 0.00000000000000000000000000000000033333529249058470
ok
If I understand Catch2 well, it should be ok if you assign expected_result with string then compare with REQUIRE(abs(actual - expected) < tol).

How to 'checksum' an array of noisy floating point numbers?

What is a quick and easy way to 'checksum' an array of floating point numbers, while allowing for a specified small amount of inaccuracy?
e.g. I have two algorithms which should (in theory, with infinite precision) output the same array. But they work differently, and so floating point errors will accumulate differently, though the array lengths should be exactly the same. I'd like a quick and easy way to test if the arrays seem to be the same. I could of course compare the numbers pairwise, and report the maximum error; but one algorithm is in C++ and the other is in Mathematica and I don't want the bother of writing out the numbers to a file or pasting them from one system to another. That's why I want a simple checksum.
I could simply add up all the numbers in the array. If the array length is N, and I can tolerate an error of 0.0001 in each number, then I would check if abs(sum1-sum2)<0.0001*N. But this simplistic 'checksum' is not robust, e.g. to an error of +10 in one entry and -10 in another. (And anyway, probability theory says that the error probably grows like sqrt(N), not like N.) Of course, any checksum is a low-dimensional summary of a chunk of data so it will miss some errors, if not most... but simple checksums are nonetheless useful for finding non-malicious bug-type errors.
Or I could create a two-dimensional checksum, [sum(x[n]), sum(abs(x[n]))]. But is the best I can do, i.e. is there a different function I might use that would be "more orthogonal" to the sum(x[n])? And if I used some arbitrary functions, e.g. [sum(f1(x[n])), sum(f2(x[n]))], then how should my 'raw error tolerance' translate into 'checksum error tolerance'?
I'm programming in C++, but I'm happy to see answers in any language.
i have a feeling that what you want may be possible via something like gray codes. if you could translate your values into gray codes and use some kind of checksum that was able to correct n bits you could detect whether or not the two arrays were the same except for n-1 bits of error, right? (each bit of error means a number is "off by one", where the mapping would be such that this was a variation in the least significant digit).
but the exact details are beyond me - particularly for floating point values.
i don't know if it helps, but what gray codes solve is the problem of pathological rounding. rounding sounds like it will solve the problem - a naive solution might round and then checksum. but simple rounding always has pathological cases - for example, if we use floor, then 0.9999999 and 1 are distinct. a gray code approach seems to address that, since neighbouring values are always single bit away, so a bit-based checksum will accurately reflect "distance".
[update:] more exactly, what you want is a checksum that gives an estimate of the hamming distance between your gray-encoded sequences (and the gray encoded part is easy if you just care about 0.0001 since you can multiple everything by 10000 and use integers).
and it seems like such checksums do exist: Any error-correcting code can be used for error detection. A code with minimum Hamming distance, d, can detect up to d − 1 errors in a code word. Using minimum-distance-based error-correcting codes for error detection can be suitable if a strict limit on the minimum number of errors to be detected is desired.
so, just in case it's not clear:
multiple by minimum error to get integers
convert to gray code equivalent
use an error detecting code with a minimum hamming distance larger than the error you can tolerate.
but i am still not sure that's right. you still get the pathological rounding in the conversion from float to integer. so it seems like you need a minimum hamming distance that is 1 + len(data) (worst case, with a rounding error on each value). is that feasible? probably not for large arrays.
maybe ask again with better tags/description now that a general direction is possible? or just add tags now? we need someone who does this for a living. [i added a couple of tags]
I've spent a while looking for a deterministic answer, and been unable to find one. If there is a good answer, it's likely to require heavy-duty mathematical skills (functional analysis).
I'm pretty sure there is no solution based on "discretize in some cunning way, then apply a discrete checksum", e.g. "discretize into strings of 0/1/?, where ? means wildcard". Any discretization will have the property that two floating-point numbers very close to each other can end up with different discrete codes, and then the discrete checksum won't tell us what we want to know.
However, a very simple randomized scheme should work fine. Generate a pseudorandom string S from the alphabet {+1,-1}, and compute csx=sum(X_i*S_i) and csy=sum(Y_i*S_i), where X and Y are my original arrays of floating point numbers. If we model the errors as independent Normal random variables with mean 0, then it's easy to compute the distribution of csx-csy. We could do this for several strings S, and then do a hypothesis test that the mean error is 0. The number of strings S needed for the test is fixed, it doesn't grow linearly in the size of the arrays, so it satisfies my need for a "low-dimensional summary". This method also gives an estimate of the standard deviation of the error, which may be handy.
Try this:
#include <complex>
#include <cmath>
#include <iostream>
// PARAMETERS
const size_t no_freqs = 3;
const double freqs[no_freqs] = {0.05, 0.16, 0.39}; // (for example)
int main() {
std::complex<double> spectral_amplitude[no_freqs];
for (size_t i = 0; i < no_freqs; ++i) spectral_amplitude[i] = 0.0;
size_t n_data = 0;
{
std::complex<double> datum;
while (std::cin >> datum) {
for (size_t i = 0; i < no_freqs; ++i) {
spectral_amplitude[i] += datum * std::exp(
std::complex<double>(0.0, 1.0) * freqs[i] * double(n_data)
);
}
++n_data;
}
}
std::cout << "Fuzzy checksum:\n";
for (size_t i = 0; i < no_freqs; ++i) {
std::cout << real(spectral_amplitude[i]) << "\n";
std::cout << imag(spectral_amplitude[i]) << "\n";
}
std::cout << "\n";
return 0;
}
It returns just a few, arbitrary points of a Fourier transform of the entire data set. These make a fuzzy checksum, so to speak.
How about computing a standard integer checksum on the data obtained by zeroing the least significant digits of the data, the ones that you don't care about?

Getting any point along an NSBezier path

For a program I'm writing, I need to be able to trace a virtual line (that is not straight) that an object must travel along. I was thinking to use NSBezierPath to draw the line, but I cannot find a way to get any point along the line, which I must do so I can move the object along it.
Can anyone suggest a way to find a point along an NSBezierPath? If thats not possible, can anyone suggest a method to do the above?
EDIT: The below code is still accurate, but there are much faster ways to calculate it. See Introduction to Fast Bezier and Even Faster Bezier.
There are two ways to approach this. If you just need to move something along the line, use a CAKeyframeAnimation. This is pretty straightforward and you never need to calculate the points.
If on the other hand you actually need to know the point for some reason, you have to calculate the Bézier yourself. For an example, you can pull the sample code for Chapter 18 from iOS 5 Programming Pushing the Limits. (It is written for iOS, but it applies equally to Mac.) Look in CurvyTextView.m.
Given control points P0_ through P3_, and an offset between 0 and 1 (see below), pointForOffset: will give you the point along the path:
static double Bezier(double t, double P0, double P1, double P2,
double P3) {
return
pow(1-t, 3) * P0
+ 3 * pow(1-t, 2) * t * P1
+ 3 * (1-t) * pow(t, 2) * P2
+ pow(t, 3) * P3;
}
- (CGPoint)pointForOffset:(double)t {
double x = Bezier(t, P0_.x, P1_.x, P2_.x, P3_.x);
double y = Bezier(t, P0_.y, P1_.y, P2_.y, P3_.y);
return CGPointMake(x, y);
}
NOTE: This code violates one of my cardinal rules of always using accessors rather than accessing ivars directly. It's because in it's called many thousands of times, and eliminating the method call has a significant performance impact.
"Offset" is not a trivial thing to work out. It does not proceed linearly along the curve. If you need evenly spaced points along the curve, you'll need to calculate the correct offset for each point. This is done with this routine:
// Simplistic routine to find the offset along Bezier that is
// aDistance away from aPoint. anOffset is the offset used to
// generate aPoint, and saves us the trouble of recalculating it
// This routine just walks forward until it finds a point at least
// aDistance away. Good optimizations here would reduce the number
// of guesses, but this is tricky since if we go too far out, the
// curve might loop back on leading to incorrect results. Tuning
// kStep is good start.
- (double)offsetAtDistance:(double)aDistance
fromPoint:(CGPoint)aPoint
offset:(double)anOffset {
const double kStep = 0.001; // 0.0001 - 0.001 work well
double newDistance = 0;
double newOffset = anOffset + kStep;
while (newDistance <= aDistance && newOffset < 1.0) {
newOffset += kStep;
newDistance = Distance(aPoint,
[self pointForOffset:newOffset]);
}
return newOffset;
}
I leave Distance() as an exercise for the reader, but it's in the example code of course.
The referenced code also provides BezierPrime() and angleForOffset: if you need those. Chapter 18 of iOS:PTL covers this in more detail as part of a discussion on how to draw text along an arbitrary path.

What are the practical uses of modulus (%) in programming? [duplicate]

This question already has answers here:
Closed 12 years ago.
Possible Duplicate:
Recognizing when to use the mod operator
What are the practical uses of modulus? I know what modulo division is. The first scenario which comes to my mind is to use it to find odd and even numbers, and clock arithmetic. But where else I could use it?
The most common use I've found is for "wrapping round" your array indices.
For example, if you just want to cycle through an array repeatedly, you could use:
int a[10];
for (int i = 0; true; i = (i + 1) % 10)
{
// ... use a[i] ...
}
The modulo ensures that i stays in the [0, 10) range.
I usually use them in tight loops, when I have to do something every X loops as opposed to on every iteration..
Example:
int i;
for (i = 1; i <= 1000000; i++)
{
do_something(i);
if (i % 1000 == 0)
printf("%d processed\n", i);
}
One use for the modulus operation is when making a hash table. It's used to convert the value out of the hash function into an index into the array. (If the hash table size is a power of two, the modulus could be done with a bit-mask, but it's still a modulus operation.)
To print a number as string, you need the modulus to find the value of a digit.
string number_to_string(uint number) {
string result = "";
while (number != 0) {
result = cast(char)((number % 10) + '0') ~ result;
// ^^^^^^^^^^^
number /= 10;
}
return result;
}
For the control number of international bank account numbers, the mod97 technique.
Also in large batches to do something after n iterations. Here is an example for NHibernate:
ISession session = sessionFactory.openSession();
ITransaction tx = session.BeginTransaction();
for ( int i=0; i<100000; i++ ) {
Customer customer = new Customer(.....);
session.Save(customer);
if ( i % 20 == 0 ) { //20, same as the ADO batch size
//Flush a batch of inserts and release memory:
session.Flush();
session.Clear();
}
}
tx.Commit();
session.Close();
The usual implementation of buffered communications uses circular buffers, and you manage them with modulus arithmetic.
For languages that don't have bitwise operators, modulus can be used to get the lowest n bits of a number. For example, to get the lowest 8 bits of x:
x % 256
which is equivalent to:
x & 255
Cryptography. That alone would account for an obscene percentage of modulus (I exaggerate, but you get the point).
Try the Wikipedia page too:
Modular arithmetic is referenced in number theory, group theory, ring theory, knot theory, abstract algebra, cryptography, computer science, chemistry and the visual and musical arts.
In my experience, any sufficiently advanced algorithm is probably going to touch on one more of the above topics.
Well, there are many perspectives you can look at it. If you are looking at it as a mathematical operation then it's just a modulo division. Even we don't need this as whatever % do, we can achieve using subtraction as well, but every programming language implement it in very optimized way.
And modulu division is not limited to finding odd and even numbers or clock arithmetic. There are hundreds of algorithms which need this module operation, for example, cryptography algorithms, etc. So it's a general mathematical operation like other +, -, *, /, etc.
Except the mathematical perspective, different languages use this symbol for defining built-in data structures, like in Perl %hash is used to show that the programmer declared a hash. So it all varies based on the programing language design.
So still there are a lot of other perspectives which one can do add to the list of use of %.

Comparing IEEE floats and doubles for equality

What is the best method for comparing IEEE floats and doubles for equality? I have heard of several methods, but I wanted to see what the community thought.
The best approach I think is to compare ULPs.
bool is_nan(float f)
{
return (*reinterpret_cast<unsigned __int32*>(&f) & 0x7f800000) == 0x7f800000 && (*reinterpret_cast<unsigned __int32*>(&f) & 0x007fffff) != 0;
}
bool is_finite(float f)
{
return (*reinterpret_cast<unsigned __int32*>(&f) & 0x7f800000) != 0x7f800000;
}
// if this symbol is defined, NaNs are never equal to anything (as is normal in IEEE floating point)
// if this symbol is not defined, NaNs are hugely different from regular numbers, but might be equal to each other
#define UNEQUAL_NANS 1
// if this symbol is defined, infinites are never equal to finite numbers (as they're unimaginably greater)
// if this symbol is not defined, infinities are 1 ULP away from +/- FLT_MAX
#define INFINITE_INFINITIES 1
// test whether two IEEE floats are within a specified number of representable values of each other
// This depends on the fact that IEEE floats are properly ordered when treated as signed magnitude integers
bool equal_float(float lhs, float rhs, unsigned __int32 max_ulp_difference)
{
#ifdef UNEQUAL_NANS
if(is_nan(lhs) || is_nan(rhs))
{
return false;
}
#endif
#ifdef INFINITE_INFINITIES
if((is_finite(lhs) && !is_finite(rhs)) || (!is_finite(lhs) && is_finite(rhs)))
{
return false;
}
#endif
signed __int32 left(*reinterpret_cast<signed __int32*>(&lhs));
// transform signed magnitude ints into 2s complement signed ints
if(left < 0)
{
left = 0x80000000 - left;
}
signed __int32 right(*reinterpret_cast<signed __int32*>(&rhs));
// transform signed magnitude ints into 2s complement signed ints
if(right < 0)
{
right = 0x80000000 - right;
}
if(static_cast<unsigned __int32>(std::abs(left - right)) <= max_ulp_difference)
{
return true;
}
return false;
}
A similar technique can be used for doubles. The trick is to convert the floats so that they're ordered (as if integers) and then just see how different they are.
I have no idea why this damn thing is screwing up my underscores. Edit: Oh, perhaps that is just an artefact of the preview. That's OK then.
The current version I am using is this
bool is_equals(float A, float B,
float maxRelativeError, float maxAbsoluteError)
{
if (fabs(A - B) < maxAbsoluteError)
return true;
float relativeError;
if (fabs(B) > fabs(A))
relativeError = fabs((A - B) / B);
else
relativeError = fabs((A - B) / A);
if (relativeError <= maxRelativeError)
return true;
return false;
}
This seems to take care of most problems by combining relative and absolute error tolerance. Is the ULP approach better? If so, why?
#DrPizza: I am no performance guru but I would expect fixed point operations to be quicker than floating point operations (in most cases).
It rather depends on what you are doing with them. A fixed-point type with the same range as an IEEE float would be many many times slower (and many times larger).
Things suitable for floats:
3D graphics, physics/engineering, simulation, climate simulation....
In numerical software you often want to test whether two floating point numbers are exactly equal. LAPACK is full of examples for such cases. Sure, the most common case is where you want to test whether a floating point number equals "Zero", "One", "Two", "Half". If anyone is interested I can pick some algorithms and go more into detail.
Also in BLAS you often want to check whether a floating point number is exactly Zero or One. For example, the routine dgemv can compute operations of the form
y = beta*y + alpha*A*x
y = beta*y + alpha*A^T*x
y = beta*y + alpha*A^H*x
So if beta equals One you have an "plus assignment" and for beta equals Zero a "simple assignment". So you certainly can cut the computational cost if you give these (common) cases a special treatment.
Sure, you could design the BLAS routines in such a way that you can avoid exact comparisons (e.g. using some flags). However, the LAPACK is full of examples where it is not possible.
P.S.:
There are certainly many cases where you don't want check for "is exactly equal". For many people this even might be the only case they ever have to deal with. All I want to point out is that there are other cases too.
Although LAPACK is written in Fortran the logic is the same if you are using other programming languages for numerical software.
Oh dear lord please don't interpret the float bits as ints unless you're running on a P6 or earlier.
Even if it causes it to copy from vector registers to integer registers via memory, and even if it stalls the pipeline, it's the best way to do it that I've come across, insofar as it provides the most robust comparisons even in the face of floating point errors.
i.e. it is a price worth paying.
This seems to take care of most problems by combining relative and absolute error tolerance. Is the ULP approach better? If so, why?
ULPs are a direct measure of the "distance" between two floating point numbers. This means that they don't require you to conjure up the relative and absolute error values, nor do you have to make sure to get those values "about right". With ULPs, you can express directly how close you want the numbers to be, and the same threshold works just as well for small values as for large ones.
If you have floating point errors you have even more problems than this. Although I guess that is up to personal perspective.
Even if we do the numeric analysis to minimize accumulation of error, we can't eliminate it and we can be left with results that ought to be identical (if we were calculating with reals) but differ (because we cannot calculate with reals).
If you are looking for two floats to be equal, then they should be identically equal in my opinion. If you are facing a floating point rounding problem, perhaps a fixed point representation would suit your problem better.
If you are looking for two floats to be equal, then they should be identically equal in my opinion. If you are facing a floating point rounding problem, perhaps a fixed point representation would suit your problem better.
Perhaps we cannot afford the loss of range or performance that such an approach would inflict.
#DrPizza: I am no performance guru but I would expect fixed point operations to be quicker than floating point operations (in most cases).
#Craig H: Sure. I'm totally okay with it printing that. If a or b store money then they should be represented in fixed point. I'm struggling to think of a real world example where such logic ought to be allied to floats. Things suitable for floats:
weights
ranks
distances
real world values (like from a ADC)
For all these things, either you much then numbers and simply present the results to the user for human interpretation, or you make a comparative statement (even if such a statement is, "this thing is within 0.001 of this other thing"). A comparative statement like mine is only useful in the context of the algorithm: the "within 0.001" part depends on what physical question you're asking. That my 0.02. Or should I say 2/100ths?
It rather depends on what you are
doing with them. A fixed-point type
with the same range as an IEEE float
would be many many times slower (and
many times larger).
Okay, but if I want a infinitesimally small bit-resolution then it's back to my original point: == and != have no meaning in the context of such a problem.
An int lets me express ~10^9 values (regardless of the range) which seems like enough for any situation where I would care about two of them being equal. And if that's not enough, use a 64-bit OS and you've got about 10^19 distinct values.
I can express values a range of 0 to 10^200 (for example) in an int, it is just the bit-resolution that suffers (resolution would be greater than 1, but, again, no application has that sort of range as well as that sort of resolution).
To summarize, I think in all cases one either is representing a continuum of values, in which case != and == are irrelevant, or one is representing a fixed set of values, which can be mapped to an int (or a another fixed-precision type).
An int lets me express ~10^9 values
(regardless of the range) which seems
like enough for any situation where I
would care about two of them being
equal. And if that's not enough, use a
64-bit OS and you've got about 10^19
distinct values.
I have actually hit that limit... I was trying to juggle times in ps and time in clock cycles in a simulation where you easily hit 10^10 cycles. No matter what I did I very quickly overflowed the puny range of 64-bit integers... 10^19 is not as much as you think it is, gimme 128 bits computing now!
Floats allowed me to get a solution to the mathematical issues, as the values overflowed with lots zeros at the low end. So you basically had a decimal point floating aronud in the number with no loss of precision (I could like with the more limited distinct number of values allowed in the mantissa of a float compared to a 64-bit int, but desperately needed th range!).
And then things converted back to integers to compare etc.
Annoying, and in the end I scrapped the entire attempt and just relied on floats and < and > to get the work done. Not perfect, but works for the use case envisioned.
If you are looking for two floats to be equal, then they should be identically equal in my opinion. If you are facing a floating point rounding problem, perhaps a fixed point representation would suit your problem better.
Perhaps I should explain the problem better. In C++, the following code:
#include <iostream>
using namespace std;
int main()
{
float a = 1.0;
float b = 0.0;
for(int i=0;i<10;++i)
{
b+=0.1;
}
if(a != b)
{
cout << "Something is wrong" << endl;
}
return 1;
}
prints the phrase "Something is wrong". Are you saying that it should?
Oh dear lord please don't interpret the float bits as ints unless you're running on a P6 or earlier.
it's the best way to do it that I've come across, insofar as it provides the most robust comparisons even in the face of floating point errors.
If you have floating point errors you have even more problems than this. Although I guess that is up to personal perspective.