Principled reasoning about tolerances and formulas for comparing floating-point numbers? - testing

The Python standard library contains the function math.isclose, which is equivalent to:
abs(a - b) <= max(rtol * max(abs(a), abs(b)), atol)
The Numpy library likewise contains numpy.isclose and numpy.allclose, which are equivalent to:
abs(a - b) <= (atol + rtol * abs(b))
Neither documentation page explains why you would want to use one of these formulas over the other, or provides any principled criteria for choosing sensible absolute and relative tolerances, written above as atol and rtol respectively.
I very often end up having to use these functions in tests for my code, but I never learned any principled basis for choosing between these two formulas, or for choosing tolerances that might be appropriate to my use case.
I usually just leave the default values as-is unless I happen to know that I'm doing something that could result in a loss of numerical precision, at which point I hand-tune the tolerances until the results seem right, largely based on gut feeling and checking examples by hand. This is tedious, imperfect, and seems antithetical to the purpose of software testing, particularly property-based testing.
For example, I might want to assert that two different implementations of the same algorithm produce "the same" result, acknowledging that an exact equality comparison doesn't make sense.
What are principled techniques that I can use for choosing a sensible formula and tolerances for comparing floating point numbers? For the sake of this question, I am happy to focus on the case of testing code that uses floating-point numbers.

For example, I might want to assert that two different implementations of the same algorithm produce "the same" result, acknowledging that an exact equality comparison doesn't make sense.
Consider instead of a singular true/false assessment of the "same" result, attempt to rate the algorithms same-ness on various properties.
If the assessments are within your tolerance/limits, functions are the "same".
Given g(x) and r(x) (the reference function).
Absolute difference: Try y = abs(g(x) - r(x)) for various (if not all) x. What is the largest y?
Relative difference: Try y = abs((g(x) - r(x))/r(x)) for various normal r(x) (not zeroes). What is the largest y?
Relative difference: Like above with r(x) with sub-normal results. Here relative difference may be far larger than with normals and so handled separately. r(x) == +/-0.0 deserves special assessment.
Range test/ edge cases: What is largest/smallest greatest/least x that "works". e.g. y = my_exp(x) and exp(x) may return infinity or 0.0 at different x, but are otherwise nearly the "same".
Total ordering difference: (a favorite). Map all non-NAN floating point values -inf to +inf to an integer: [-ORDER_N to ORDER_N] with a helper function called total order(). total order(+/-0.0) is 0. Find the maximum difference abs(total_order(g(x)) - total_order(r(x))) and use that metric to determine "same"-ness.
Various function deserve special handling. This field of study has many further considerations.

One question when using relative tolerance is - relative to what? If you want to know if 90 and 100 are "equal" with a 10% tolerance, you get different answers if you take 10% of 90 vs 10% of 100.
The standard library uses the larger of a or b when defining the "what" in that scenario, so it would use 10% of 100 as the tolerance. It also uses the larger of that relative tolerance or the absolute tolerance as the "ultimate" tolerance.
The numpy method simbly uses b for the "relative" tolerance and takes the total of the relative and absolute tolerance as the "ultimate" tolerance.
Which is better? Neither is better or worse- they are different ways of establishing a tolerance. You can choose which one to use based on how you want to define "close enough".
The tolerances you choose are contextual as well - are you comparing lengths of lumber or the distance between circuit paths in a microprocessor? Is 1% tolerance "good enough" or do you need ultra-precise tolerance? A tolerance too low might yield too many "false positives" depending on the application, while too high a tolerance will yield too many "false negatives" that might let some problems "slip through the cracks".
Note that the standard function is not vectorized, so if you want to use it on arrays you'll either have to use the numpy function or build a vertorized version of the standard one.

Nobody can choose the tolerances for you, they are problem dependent. Because in real-life the input data that you work on has (very) limited accuracy, be it the result of experimental measurement or of numerical computation that introduces truncation errors. So you need to know your data and understand the concepts and methods of error calculus to adjust them.
As regards the formulas, they were designed to be general-purpose, i.e. not knowing if the quantities to be compared can be strictly equal or not (when they are strictly equal, the relative error does not work). Again, this should not be a blind choice.

Related

Does it make sense to use big-O to describe the best case for a function?

I have an extremely pedantic question on big-O notation that I would like some opinions on. One of my uni subjects states “Best O(1) if their first element is the same” for a question on checking if two lists have a common element.
My qualm with this is that it does not describe the function on the entire domain of large inputs, rather the restricted domain of large inputs that have two lists with the same first element. Does it make sense to describe a function by only talking about a subset of that function’s domain? Of course, when restricted to that domain, the time complexity is omega(1), O(1) and therefore theta(1), but this isn’t describing the original function. From my understanding it would be more correct to say the entire function is bounded by omega(1). (and O(m*n) where m, n are the sizes of the two input lists).
What do all of you think?
It is perfectly correct to discuss cases (as you correctly point out, a case is a subset of the function's domain) and bounds on the runtime of algorithms in those cases (Omega, Oh or Theta). Whether or not it's useful is a harder question and one that is very situation-dependent. I'd generally think that Omega-bounds on the best case, Oh-bounds on the worst case and Theta bounds (on the "universal" case of all inputs, when such a bound exists) are the most "useful". But calling the subset of inputs where the first elements of each collection are the same, the "best case", seems like reasonable usage. The "best case" for bubble sort is the subset of inputs which are pre-sorted arrays, and is bound by O(n), better than unmodified merge sort's best-case bound.
Fundamentally, big-O notation is a way of talking about how some quantity scales. In CS we so often see it used for talking about algorithm runtimes that we forget that all the following are perfectly legitimate use cases for it:
The area of a circle of radius r is O(r2).
The volume of a sphere of radius r is O(r3).
The expected number of 2’s showing after rolling n dice is O(n).
The minimum number of 2’s showing after rolling n dice is O(1).
The number of strings made of n pairs of balanced parentheses is O(4n).
With that in mind, it’s perfectly reasonable to use big-O notation to talk about how the behavior of an algorithm in a specific family of cases will scale. For example, we could say the following:
The time required to sort a sequence of n already-sorted elements with insertion sort is O(n). (There are other, non-sorted sequences where it takes longer.)
The time required to run a breadth-first search of a tree with n nodes is O(n). (In a general graph, this could be larger if the number of edges were larger.)
The time required to insert n items in sorted order into an initially empty binary heap is On). (This can be Θ(n log n) for a sequence of elements that is reverse-sorted.)
In short, it’s perfectly fine to use asymptotic notation to talk about restricted subcases of an algorithm. More generally, it’s fine to use asymptotic notation to describe how things grow as long as you’re precise about what you’re quantifying. See #Patrick87’s answer for more about whether to use O, Θ, Ω, etc. when doing so.

What is a good rule-of-thumb floating point comparison method selector?

I'm testing some bits of code, a number which involves computation using floating-point values - often very large numbers of these. I have some generic (C++-templated, but it doesn't really matter for the sake of this question) code which compares my outputs, be they scalar or arrays, against their expected values.
I'm faced with the problem of choosing a precision threshold, at least for the two C/C++ floating-point types float and double - for various functions I'm testing. As is well known, there is no one-size-fits-all with respect to comparing floating-point values, nor a single precision value which fits and computation based solely on the data type: Relative vs. absolute error, numerous operations which may magnify floating-point rounding errors a lot, computations which are supposed to arrive at 0 so you can't really normalize by the expected value, etc.
What is a generally-reasonable approach/algorith/rule-of-thumb to choosing a comparsion method (and equality thresholds) for floating point values?
I like the approach used in googletest, e.g. EXPECT_DOUBLE_EQ(a,b) and EXPECT_FLOAT_EQ(a,b): the numbers are approximately equal if they are within 4 units in the last position (4 ULP). To do this, you
convert signed-magnitude to offset
subtract as though they were integers
check that the difference <= 4.
This automatically scales for magnitude and relaxes to absolute near zero.
There is no generally-reasonable approach :-(
One important property of numbers is that the set of numbers can be divided into equivalence classes where all members of the same equivalence class are "equal" in some sense and all members of two different equivalence classes are "not equal". That property is essential for sorting algorithms and hashing.
If you take double with 53 bit mantissa, and just replace the last bits of the mantissa with zeroes, then you still have equivalence classes, and sorting / hashing will work just fine. On the other hand, two numbers can be arbitrarily close together and still compare equal with this method.
The other method is having an algorithm that decides if two numbers are "possibly equal". You can base everything else on this. For example, a is "definitely greater" than b if a > b and a is not "possibly equal" to b. a is "possibly greater" than b if a > b or a is "possibly equal" to b.
Sorting is problematic. You could have a "possibly equal" to b, and b "possibly equal" to c, but a is not "possibly equal" to c.
If you use double with 53 bits mantissa, then it is unlikely that two unrelated numbers are equal within even 45 bits. So you could check quite reasonably whether the absolute value of the difference is less than the absolute value of the larger number, divided by 2^45. Your mileage will vary considerably. Important is whether you think 0 should be equal to very small numbers or not.

Optimizing Parameters using AI technique

I know that my question is general, but I'm new to AI area.
I have an experiment with some parameters (almost 6 parameters). Each one of them is independent one, and I want to find the optimal solution for maximum or minimum the output function. However, if I want to do it in traditional programming technique it will take much time since i will use six nested loops.
I just want to know which AI technique to use for this problem? Genetic Algorithm? Neural Network? Machine learning?
Update
Actually, the problem could have more than one evaluation function.
It will have one function that we should minimize it (Cost)
and another function the we want to maximize it (Capacity)
Maybe another functions can be added.
Example:
Construction a glass window can be done in a million ways. However, we want the strongest window with lowest cost. There are many parameters that affect the pressure capacity of the window such as the strength of the glass, Height and Width, slope of the window.
Obviously, if we go to extreme cases (Largest strength glass, with smallest width and height, and zero slope) the window will be extremely strong. However, the cost for that will be very high.
I want to study the interaction between the parameters in specific range.
Without knowing much about the specific problem it sounds like Genetic Algorithms would be ideal. They've been used a lot for parameter optimisation and have often given good results. Personally, I've used them to narrow parameter ranges for edge detection techniques with about 15 variables and they did a decent job.
Having multiple evaluation functions needn't be a problem if you code this into the Genetic Algorithm's fitness function. I'd look up multi objective optimisation with genetic algorithms.
I'd start here: Multi-Objective optimization using genetic algorithms: A tutorial
First of all if you have multiple competing targets the problem is confused.
You have to find a single value that you want to maximize... for example:
value = strength - k*cost
or
value = strength / (k1 + k2*cost)
In both for a fixed strength the lower cost wins and for a fixed cost the higher strength wins but you have a formula to be able to decide if a given solution is better or worse than another. If you don't do this how can you decide if a solution is better than another that is cheaper but weaker?
In some cases a correctly defined value requires a more complex function... for example for strength the value could increase up to a certain point (i.e. having a result stronger than a prescribed amount is just pointless) or a cost could have a cap (because higher than a certain amount a solution is not interesting because it would place the final price out of the market).
Once you find the criteria if the parameters are independent a very simple approach that in my experience is still decent is:
pick a random solution by choosing n random values, one for each parameter within the allowed boundaries
compute target value for this starting point
pick a random number 1 <= k <= n and for each of k parameters randomly chosen from the n compute a random signed increment and change the parameter by that amount.
compute the new target value from the translated solution
if the new value is better keep the new position, otherwise revert to the original one.
repeat from 3 until you run out of time.
Depending on the target function there are random distributions that work better than others, also may be that for different parameters the optimal choice is different.
Some time ago I wrote a C++ code for solving optimization problems using Genetic Algorithms. Here it is: http://create-technology.blogspot.ro/2015/03/a-genetic-algorithm-for-solving.html
It should be very easy to follow.

Building ranking with genetic algorithm,

Question after BIG edition :
I need to built a ranking using genetic algorithm, I have data like this :
P(a>b)=0.9
P(b>c)=0.7
P(c>d)=0.8
P(b>d)=0.3
now, lets interpret a,b,c,d as names of football teams, and P(x>y) is probability that x wins with y. We want to build ranking of teams, we lack some observations P(a>d),P(a>c) are missing due to lack of matches between a vs d and a vs c.
Goal is to find ordering of team names, which the best describes current situation in that four team league.
If we have only 4 teams than solution is straightforward, first we compute probabilities for all 4!=24 orderings of four teams, while ignoring missing values we have :
P(abcd)=P(a>b)P(b>c)P(c>d)P(b>d)
P(abdc)=P(a>b)P(b>c)(1-P(c>d))P(b>d)
...
P(dcba)=(1-P(a>b))(1-P(b>c))(1-P(c>d))(1-P(b>d))
and we choose the ranking with highest probability. I don't want to use any other fitness function.
My question :
As numbers of permutations of n elements is n! calculation of probabilities for all
orderings is impossible for large n (my n is about 40). I want to use genetic algorithm for that problem.
Mutation operator is simple switching of places of two (or more) elements of ranking.
But how to make crossover of two orderings ?
Could P(abcd) be interpreted as cost function of path 'abcd' in assymetric TSP problem but cost of travelling from x to y is different than cost of travelling from y to x, P(x>y)=1-P(y<x) ? There are so many crossover operators for TSP problem, but I think I have to design my own crossover operator, because my problem is slightly different from TSP. Do you have any ideas for solution or frame for conceptual analysis ?
The easiest way, on conceptual and implementation level, is to use crossover operator which make exchange of suborderings between two solutions :
CrossOver(ABcD,AcDB) = AcBD
for random subset of elements (in this case 'a,b,d' in capital letters) we copy and paste first subordering - sequence of elements 'a,b,d' to second ordering.
Edition : asymetric TSP could be turned into symmetric TSP, but with forbidden suborderings, which make GA approach unsuitable.
It's definitely an interesting problem, and it seems most of the answers and comments have focused on the semantic aspects of the problem (i.e., the meaning of the fitness function, etc.).
I'll chip in some information about the syntactic elements -- how do you do crossover and/or mutation in ways that make sense. Obviously, as you noted with the parallel to the TSP, you have a permutation problem. So if you want to use a GA, the natural representation of candidate solutions is simply an ordered list of your points, careful to avoid repitition -- that is, a permutation.
TSP is one such permutation problem, and there are a number of crossover operators (e.g., Edge Assembly Crossover) that you can take from TSP algorithms and use directly. However, I think you'll have problems with that approach. Basically, the problem is this: in TSP, the important quality of solutions is adjacency. That is, abcd has the same fitness as cdab, because it's the same tour, just starting and ending at a different city. In your example, absolute position is much more important that this notion of relative position. abcd means in a sense that a is the best point -- it's important that it came first in the list.
The key thing you have to do to get an effective crossover operator is to account for what the properties are in the parents that make them good, and try to extract and combine exactly those properties. Nick Radcliffe called this "respectful recombination" (note that paper is quite old, and the theory is now understood a bit differently, but the principle is sound). Taking a TSP-designed operator and applying it to your problem will end up producing offspring that try to conserve irrelevant information from the parents.
You ideally need an operator that attempts to preserve absolute position in the string. The best one I know of offhand is known as Cycle Crossover (CX). I'm missing a good reference off the top of my head, but I can point you to some code where I implemented it as part of my graduate work. The basic idea of CX is fairly complicated to describe, and much easier to see in action. Take the following two points:
abcdefgh
cfhgedba
Pick a starting point in parent 1 at random. For simplicity, I'll just start at position 0 with the "a".
Now drop straight down into parent 2, and observe the value there (in this case, "c").
Now search for "c" in parent 1. We find it at position 2.
Now drop straight down again, and observe the "h" in parent 2, position 2.
Again, search for this "h" in parent 1, found at position 7.
Drop straight down and observe the "a" in parent 2.
At this point note that if we search for "a" in parent one, we reach a position where we've already been. Continuing past that will just cycle. In fact, we call the sequence of positions we visited (0, 2, 7) a "cycle". Note that we can simply exchange the values at these positions between the parents as a group and both parents will retain the permutation property, because we have the same three values at each position in the cycle for both parents, just in different orders.
Make the swap of the positions included in the cycle.
Note that this is only one cycle. You then repeat this process starting from a new (unvisited) position each time until all positions have been included in a cycle. After the one iteration described in the above steps, you get the following strings (where an "X" denotes a position in the cycle where the values were swapped between the parents.
cbhdefga
afcgedbh
X X X
Just keep finding and swapping cycles until you're done.
The code I linked from my github account is going to be tightly bound to my own metaheuristics framework, but I think it's a reasonably easy task to pull the basic algorithm out from the code and adapt it for your own system.
Note that you can potentially gain quite a lot from doing something more customized to your particular domain. I think something like CX will make a better black box algorithm than something based on a TSP operator, but black boxes are usually a last resort. Other people's suggestions might lead you to a better overall algorithm.
I've worked on a somewhat similar ranking problem and followed a technique similar to what I describe below. Does this work for you:
Assume the unknown value of an object diverges from your estimate via some distribution, say, the normal distribution. Interpret your ranking statements such as a > b, 0.9 as the statement "The value a lies at the 90% percentile of the distribution centered on b".
For every statement:
def realArrival = calculate a's location on a distribution centered on b
def arrivalGap = | realArrival - expectedArrival |
def fitness = Σ arrivalGap
Fitness function is MIN(fitness)
FWIW, my problem was actually a bin-packing problem, where the equivalent of your "rank" statements were user-provided rankings (1, 2, 3, etc.). So not quite TSP, but NP-Hard. OTOH, bin-packing has a pseudo-polynomial solution proportional to accepted error, which is what I eventually used. I'm not quite sure that would work with your probabilistic ranking statements.
What an interesting problem! If I understand it, what you're really asking is:
"Given a weighted, directed graph, with each edge-weight in the graph representing the probability that the arc is drawn in the correct direction, return the complete sequence of nodes with maximum probability of being a topological sort of the graph."
So if your graph has N edges, there are 2^N graphs of varying likelihood, with some orderings appearing in more than one graph.
I don't know if this will help (very brief Google searches did not enlighten me, but maybe you'll have more success with more perseverance) but my thoughts are that looking for "topological sort" in conjunction with any of "probabilistic", "random", "noise," or "error" (because the edge weights can be considered as a reliability factor) might be helpful.
I strongly question your assertion, in your example, that P(a>c) is not needed, though. You know your application space best, but it seems to me that specifying P(a>c) = 0.99 will give a different fitness for f(abc) than specifying P(a>c) = 0.01.
You might want to throw in "Bayesian" as well, since you might be able to start to infer values for (in your example) P(a>c) given your conditions and hypothetical solutions. The problem is, "topological sort" and "bayesian" is going to give you a whole bunch of hits related to markov chains and markov decision problems, which may or may not be helpful.

Exponents in Genetic Programming

I want to have real-valued exponents (not just integers) for the terminal variables.
For example, lets say I want to evolve a function y = x^3.5 + x^2.2 + 6. How should I proceed? I haven't seen any GP implementations which can do this.
I tried using the power function, but sometimes the initial solutions have so many exponents that the evaluated value exceeds 'double' bounds!
Any suggestion would be appreciated. Thanks in advance.
DEAP (in Python) implements it. In fact there is an example for that. By adding the math.pow from Python in the primitive set you can acheive what you want.
pset.addPrimitive(math.pow, 2)
But using the pow operator you risk getting something like x^(x^(x^(x))), which is probably not desired. You shall add a restriction (by a mean that I not sure) on where in your tree the pow is allowed (just before a leaf or something like that).
OpenBeagle (in C++) also allows it but you will need to develop your own primitive using the pow from <math.h>, you can use as an example the Sin or Cos primitive.
If only some of the initial population are suffering from the overflow problem then just penalise them with a poor fitness score and they will probably be removed from the population within a few generations.
But, if the problem is that virtually all individuals suffer from this problem, then you will have to add some constraints. The simplest thing to do would be to constrain the exponent child of the power function to be a real literal - which would mean powers would not be allowed to be nested. It depends on whether this is sufficient for your needs though. There are a few ways to add constraints like these (or more complex ones) - try looking in to Constrained Syntactic Structures and grammar guided GP.
A few other simple thoughts: can you use a data-type with a larger range? Also, you could reduce the maximum depth parameter, so that there will be less room for nested exponents. Of course that's only possible to an extent, and it depends on the complexity of the function.
Integers have a different binary representation than reals, so you have to use a slightly different bitstring representation and recombination/mutation operator.
For an excellent demonstration, see slide 24 of www.cs.vu.nl/~gusz/ecbook/slides/Genetic_Algorithms.ppt or check out the Eiben/Smith book "Introduction to Evolutionary Computing Genetic Algorithms." This describes how to map a bit string to a real number. You can then create a representation where x only lies within an interval [y,z]. In this case, choose y and z to be the of less magnitude than the capacity of the data type you are using (e.g. 10^308 for a double) so you don't run into the overflow issue you describe.
You have to consider that with real-valued exponents and a negative base you will not obtain a real, but a complex number. For example, the Math.Pow implementation in .NET says that you get NaN if you attempt to calculate the power of a negative base to a non-integer exponent. You have to make sure all your x values are positive. I think that's the problem that you're seeing when you "exceed double bounds".
Btw, you can try the HeuristicLab GP implementation. It is very flexible with a configurable grammar.