How to design acceptance probability function for simulated annealing with multiple distinct costs? - optimization

I am using simulated annealing to solve an NP-complete resource scheduling problem. For each candidate ordering of the tasks I compute several different costs (or energy values). Some examples are (though the specifics are probably irrelevant to the question):
global_finish_time: The total number of days that the schedule spans.
split_cost: The number of days by which each task is delayed due to interruptions by other tasks (this is meant to discourage interruption of a task once it has started).
deadline_cost: The sum of the squared number of days by which each missed deadline is overdue.
The traditional acceptance probability function looks like this (in Python):
def acceptance_probability(old_cost, new_cost, temperature):
if new_cost < old_cost:
return 1.0
else:
return math.exp((old_cost - new_cost) / temperature)
So far I have combined my first two costs into one by simply adding them, so that I can feed the result into acceptance_probability. But what I would really want is for deadline_cost to always take precedence over global_finish_time, and for global_finish_time to take precedence over split_cost.
So my question to Stack Overflow is: how can I design an acceptance probability function that takes multiple energies into account but always considers the first energy to be more important than the second energy, and so on? In other words, I would like to pass in old_cost and new_cost as tuples of several costs and return a sensible value .
Edit: After a few days of experimenting with the proposed solutions I have concluded that the only way that works well enough for me is Mike Dunlavey's suggestion, even though this creates many other difficulties with cost components that have different units. I am practically forced to compare apples with oranges.
So, I put some effort into "normalizing" the values. First, deadline_cost is a sum of squares, so it grows exponentially while the other components grow linearly. To address this I use the square root to get a similar growth rate. Second, I developed a function that computes a linear combination of the costs, but auto-adjusts the coefficients according to the highest cost component seen so far.
For example, if the tuple of highest costs is (A, B, C) and the input cost vector is (x, y, z), the linear combination is BCx + Cy + z. That way, no matter how high z gets it will never be more important than an x value of 1.
This creates "jaggies" in the cost function as new maximum costs are discovered. For example, if C goes up then BCx and Cy will both be higher for a given (x, y, z) input and so will differences between costs. A higher cost difference means that the acceptance probability will drop, as if the temperature was suddenly lowered an extra step. In practice though this is not a problem because the maximum costs are updated only a few times in the beginning and do not change later. I believe this could even be theoretically proven to converge to a correct result since we know that the cost will converge toward a lower value.
One thing that still has me somewhat confused is what happens when the maximum costs are 1.0 and lower, say 0.5. With a maximum vector of (0.5, 0.5, 0.5) this would give the linear combination 0.5*0.5*x + 0.5*y + z, i.e. the order of precedence is suddenly reversed. I suppose the best way to deal with it is to use the maximum vector to scale all values to given ranges, so that the coefficients can always be the same (say, 100x + 10y + z). But I haven't tried that yet.

mbeckish is right.
Could you make a linear combination of the different energies, and adjust the coefficients?
Possibly log-transforming them in and out?
I've done some MCMC using Metropolis-Hastings. In that case I'm defining the (non-normalized) log-likelihood of a particular state (given its priors), and I find that a way to clarify my thinking about what I want.

I would take a hint from multi-objective evolutionary algorithm (MOEA) and have it transition if all of the objectives simultaneously pass with the acceptance_probability function you gave. This will have the effect of exploring the Pareto front much like the standard simulated annealing explores plateaus of same-energy solutions.
However, this does give up on the idea of having the first one take priority.
You will probably have to tweak your parameters, such as giving it a higher initial temperature.

I would consider something along the lines of:
If (new deadline_cost > old deadline_cost)
return (calculate probability)
else if (new global finish time > old global finish time)
return (calculate probability)
else if (new split cost > old split cost)
return (calculate probability)
else
return (1.0)
Of course each of the three places you calculate the probability could use a different function.

It depends on what you mean by "takes precedence".
For example, what if the deadline_cost goes down by 0.001, but the global_finish_time cost goes up by 10000? Do you return 1.0, because the deadline_cost decreased, and that takes precedence over anything else?
This seems like it is a judgment call that only you can make, unless you can provide enough background information on the project so that others can suggest their own informed judgment call.

Related

Asynchrony loss function over an array of 1D signals

So I have an array of N 1D-signals (e.g. time series) with same number of samples per signal (all in equal resolution) and I want to define a differentiable loss function to penalize asynchrony among them and therefore be zero if all N 1D signals will be equal to each other. I've been searching the literature to find something but haven't had luck yet.
Few remarks:
1 - since N (number of signals) could be quite large I can not afford to calculate Mean squared loss between every single pair which could grow combinatorialy large. also I'm not quite sure whether it would be optimal in any mathematical sense for the goal to achieve.
There are two naive loss functions that I could think of :
a) Total variation loss for each time sample across all signals (to force to reach ideally zero variation). the problem is here the weight needs to be very large to yield zero varion. masking any other loss term that is going to be added and also there is no inherent order among the N signals, which doesnt make it suitable to TV loss to begin with.
b) minimizing the sum of variance at each time point among all signals. however, choice of the reference of variance (aka mean) could be crucial I believe as just using the sample mean might not really yield the desired result, not quite sure.

Determine the running time of an algorithm with two parameters

I have implemented an algorithm that uses two other algorithms for calculating the shortest path in a graph: Dijkstra and Bellman-Ford. Based on the time complexity of the these algorithms, I can calculate the running time of my implementation, which is easy giving the code.
Now, I want to experimentally verify my calculation. Specifically, I want to plot the running time as a function of the size of the input (I am following the method described here). The problem is that I have two parameters - number of edges and number of vertices.
I have tried to fix one parameter and change the other, but this approach results in two plots - one for varying number of edges and the other for varying number of vertices.
This leads me to my question - how can I determine the order of growth based on two plots? In general, how can one experimentally determine the running time complexity of an algorithm that has more than one parameter?
It's very difficult in general.
The usual way you would experimentally gauge the running time in the single variable case is, insert a counter that increments when your data structure does a fundamental (putatively O(1)) operation, then take data for many different input sizes, and plot it on a log-log plot. That is, log T vs. log N. If the running time is of the form n^k you should see a straight line of slope k, or something approaching this. If the running time is like T(n) = n^{k log n} or something, then you should see a parabola. And if T is exponential in n you should still see exponential growth.
You can only hope to get information about the highest order term when you do this -- the low order terms get filtered out, in the sense of having less and less impact as n gets larger.
In the two variable case, you could try to do a similar approach -- essentially, take 3 dimensional data, do a log-log-log plot, and try to fit a plane to that.
However this will only really work if there's really only one leading term that dominates in most regimes.
Suppose my actual function is T(n, m) = n^4 + n^3 * m^3 + m^4.
When m = O(1), then T(n) = O(n^4).
When n = O(1), then T(n) = O(m^4).
When n = m, then T(n) = O(n^6).
In each of these regimes, "slices" along the plane of possible n,m values, a different one of the terms is the dominant term.
So there's no way to determine the function just from taking some points with fixed m, and some points with fixed n. If you did that, you wouldn't get the right answer for n = m -- you wouldn't be able to discover "middle" leading terms like that.
I would recommend that the best way to predict asymptotic growth when you have lots of variables / complicated data structures, is with a pencil and piece of paper, and do traditional algorithmic analysis. Or possibly, a hybrid approach. Try to break the question of efficiency into different parts -- if you can split the question up into a sum or product of a few different functions, maybe some of them you can determine in the abstract, and some you can estimate experimentally.
Luckily two input parameters is still easy to visualize in a 3D scatter plot (3rd dimension is the measured running time), and you can check if it looks like a plane (in log-log-log scale) or if it is curved. Naturally random variations in measurements plays a role here as well.
In Matlab I typically calculate a least-squares solution to two-variable function like this (just concatenates different powers and combinations of x and y horizontally, .* is an element-wise product):
x = log(parameter_x);
y = log(parameter_y);
% Find a least-squares fit
p = [x.^2, x.*y, y.^2, x, y, ones(length(x),1)] \ log(time)
Then this can be used to estimate running times for larger problem instances, ideally those would be confirmed experimentally to know that the fitted model works.
This approach works also for higher dimensions but gets tedious to generate, maybe there is a more general way to achieve that and this is just a work-around for my lack of knowledge.
I was going to write my own explanation but it wouldn't be any better than this.

Finding Optimal Parameters In A "Black Box" System

I'm developing machine learning algorithms which classify images based on training data.
During the image preprocessing stages, there are several parameters which I can modify that affect the data I feed my algorithms (for example, I can change the Hessian Threshold when extracting SURF features). So the flow thus far looks like:
[param1, param2, param3...] => [black box] => accuracy %
My problem is: with so many parameters at my disposal, how can I systematically pick values which give me optimized results/accuracy? A naive approach is to run i nested for-loops (assuming i parameters) and just iterate through all parameter combinations, but if it takes 5 minute to calculate an accuracy from my "black box" system this would take a long, long time.
This being said, are there any algorithms or techniques which can search for optimal parameters in a black box system? I was thinking of taking a course in Discrete Optimization but I'm not sure if that would be the best use of my time.
Thank you for your time and help!
Edit (to answer comments):
I have 5-8 parameters. Each parameter has its own range. One parameter can be 0-1000 (integer), while another can be 0 to 1 (real number). Nothing is stopping me from multithreading the black box evaluation.
Also, there are some parts of the black box that have some randomness to them. For example, one stage is using k-means clustering. Each black box evaluation, the cluster centers may change. I run k-means several times to (hopefully) avoid local optima. In addition, I evaluate the black box multiple times and find the median accuracy in order to further mitigate randomness and outliers.
As a partial solution, a grid search of moderate resolution and range can be recursively repeated in the areas where the n-parameters result in the optimal values.
Each n-dimensioned result from each step would be used as a starting point for the next iteration.
The key is that for each iteration the resolution in absolute terms is kept constant (i.e. keep the iteration period constant) but the range decreased so as to reduce the pitch/granular step size.
I'd call it a ‘contracting mesh’ :)
Keep in mind that while it avoids full brute-force complexity it only reaches exhaustive resolution in the final iteration (this is what defines the final iteration).
Also that the outlined process is only exhaustive on a subset of the points that may or may not include the global minimum - i.e. it could result in a local minima.
(You can always chase your tail though by offsetting the initial grid by some sub-initial-resolution amount and compare results...)
Have fun!
Here is the solution to your problem.
A method behind it is described in this paper.

Simplification / optimization of GPS track

I've got a GPS track produced by gpxlogger(1) (supplied as a client for gpsd). GPS receiver updates its coordinates every 1 second, gpxlogger's logic is very simple, it writes down location (lat, lon, ele) and a timestamp (time) received from GPS every n seconds (n = 3 in my case).
After writing down a several hours worth of track, gpxlogger saves several megabyte long GPX file that includes several thousands of points. Afterwards, I try to plot this track on a map and use it with OpenLayers. It works, but several thousands of points make using the map a sloppy and slow experience.
I understand that having several thousands of points of suboptimal. There are myriads of points that can be deleted without losing almost anything: when there are several points making up roughly the straight line and we're moving with the same constant speed between them, we can just leave the first and the last point and throw away anything else.
I thought of using gpsbabel for such track simplification / optimization job, but, alas, it's simplification filter works only with routes, i.e. analyzing only geometrical shape of path, without timestamps (i.e. not checking that the speed was roughly constant).
Is there some ready-made utility / library / algorithm available to optimize tracks? Or may be I'm missing some clever option with gpsbabel?
Yes, as mentioned before, the Douglas-Peucker algorithm is a straightforward way to simplify 2D connected paths. But as you have pointed out, you will need to extend it to the 3D case to properly simplify a GPS track with an inherent time dimension associated with every point. I have done so for a web application of my own using a PHP implementation of Douglas-Peucker.
It's easy to extend the algorithm to the 3D case with a little understanding of how the algorithm works. Say you have input path consisting of 26 points labeled A to Z. The simplest version of this path has two points, A and Z, so we start there. Imagine a line segment between A and Z. Now scan through all remaining points B through Y to find the point furthest away from the line segment AZ. Say that point furthest away is J. Then, you scan the points between B and I to find the furthest point from line segment AJ and scan points K through Y to find the point furthest from segment JZ, and so on, until the remaining points all lie within some desired distance threshold.
This will require some simple vector operations. Logically, it's the same process in 3D as in 2D. If you find a Douglas-Peucker algorithm implemented in your language, it might have some 2D vector math implemented, and you'll need to extend those to use 3 dimensions.
You can find a 3D C++ implementation here: 3D Douglas-Peucker in C++
Your x and y coordinates will probably be in degrees of latitude/longitude, and the z (time) coordinate might be in seconds since the unix epoch. You can resolve this discrepancy by deciding on an appropriate spatial-temporal relationship; let's say you want to view one day of activity over a map area of 1 square mile. Imagining this relationship as a cube of 1 mile by 1 mile by 1 day, you must prescale the time variable. Conversion from degrees to surface distance is non-trivial, but for this case we simplify and say one degree is 60 miles; then one mile is .0167 degrees. One day is 86400 seconds; then to make the units equivalent, our prescale factor for your timestamp is .0167/86400, or about 1/5,000,000.
If, say, you want to view the GPS activity within the same 1 square mile map area over 2 days instead, time resolution becomes half as important, so scale it down twice further, to 1/10,000,000. Have fun.
Have a look at Ramer-Douglas-Peucker algorithm for smoothening complex polygons, also Douglas-Peucker line simplification algorithm can help you reduce your points.
OpenSource GeoKarambola java library (no Android dependencies but can be used in Android) that includes a GpxPathManipulator class that does both route & track simplification/reduction (3D/elevation aware).
If the points have timestamp information that will not be discarded.
https://sourceforge.net/projects/geokarambola/
This is the algorith in action, interactively
https://lh3.googleusercontent.com/-hvHFyZfcY58/Vsye7nVrmiI/AAAAAAAAHdg/2-NFVfofbd4ShZcvtyCDpi2vXoYkZVFlQ/w360-h640-no/movie360x640_05_82_05.gif
This algorithm is based on reducing the number of points by eliminating those that have the greatest XTD (cross track distance) error until a tolerated error is satisfied or the maximum number of points is reached (both parameters of the function), wichever comes first.
An alternative algorithm, for on-the-run stream like track simplification (I call it "streamplification") is:
you keep a small buffer of the points the GPS sensor gives you, each time a GPS point is added to the buffer (elevation included) you calculate the max XTD (cross track distance) of all the points in the buffer to the line segment that unites the first point with the (newly added) last point of the buffer. If the point with the greatest XTD violates your max tolerated XTD error (25m has given me great results) then you cut the buffer at that point, register it as a selected point to be appended to the streamplified track, trim the trailing part of the buffer up to that cut point, and keep going. At the end of the track the last point of the buffer is also added/flushed to the solution.
This algorithm is lightweight enough that it runs on an AndroidWear smartwatch and gives optimal output regardless of if you move slow or fast, or stand idle at the same place for a long time. The ONLY thing that maters is the SHAPE of your track. You can go for many minutes/kilometers and, as long as you are moving in a straight line (a corridor within +/- tolerated XTD error deviations) the streamplify algorithm will only output 2 points: those of the exit form last curve and entry on next curve.
I ran in to a similar issue. The rate at which the gps unit takes points is much larger that needed. Many of the points are not geographically far away from each other. The approach that I took is to calculate the distance between the points using the haversine formula. If the distance was not larger than my threshold (0.1 miles in my case) I threw away the point. This quickly gets the number of points down to a manageable size.
I don't know what language you are looking for. Here is a C# project that I was working on. At the bottom you will find the haversine code.
http://blog.bobcravens.com/2010/09/gps-using-the-netduino/
Hope this gets you going.
Bob
This is probably NP-hard. Suppose you have points A, B, C, D, E.
Let's try a simple deterministic algorithm. Suppose you calculate the distance from point B to line A-C and it's smaller than your threshold (1 meter). So you delete B. Then you try the same for C to line A-D, but it's bigger and D for C-E, which is also bigger.
But it turns out that the optimal solution is A, B, E, because point C and D are close to the line B-E, yet on opposite sides.
If you delete 1 point, you cannot be sure that it should be a point that you should keep, unless you try every single possible solution (which can be n^n in size, so on n=80 that's more than the minimum number of atoms in the known universe).
Next step: try a brute force or branch and bound algorithm. Doesn't scale, doesn't work for real-world size. You can safely skip this step :)
Next step: First do a determinstic algorithm and improve upon that with a metaheuristic algorithm (tabu search, simulated annealing, genetic algorithms). In java there are a couple of open source implementations, such as Drools Planner.
All in all, you 'll probably have a workable solution (although not optimal) with the first simple deterministic algorithm, because you only have 1 constraint.
A far cousin of this problem is probably the Traveling Salesman Problem variant in which the salesman cannot visit all cities but has to select a few.
You want to throw away uninteresting points. So you need a function that computes how interesting a point is, then you can compute how interesting all the points are and throw away the N least interesting points, where you choose N to slim the data set sufficiently. It sounds like your definition of interesting corresponds to high acceleration (deviation from straight-line motion), which is easy to compute.
Try this, it's free and opensource online Service:
https://opengeo.tech/maps/gpx-simplify-optimizer/
I guess you need to keep points where you change direction. If you split your track into the set of intervals of constant direction, you can leave only boundary points of these intervals.
And, as Raedwald pointed out, you'll want to leave points where your acceleration is not zero.
Not sure how well this will work, but how about taking your list of points, working out the distance between them and therefore the total distance of the route and then deciding on a resolution distance and then just linear interpolating the position based on each step of x meters. ie for each fix you have a "distance from start" measure and you just interpolate where n*x is for your entire route. (you could decide how many points you want and divide the total distance by this to get your resolution distance). On top of this you could add a windowing function taking maybe the current point +/- z points and applying a weighting like exp(-k* dist^2/accuracy^2) to get the weighted average of a set of points where dist is the distance from the raw interpolated point and accuracy is the supposed accuracy of the gps position.
One really simple method is to repeatedly remove the point that creates the largest angle (in the range of 0° to 180° where 180° means it's on a straight line between its neighbors) between its neighbors until you have few enough points. That will start off removing all points that are perfectly in line with their neighbors and will go from there.
You can do that in Ο(n log(n)) by making a list of each index and its angle, sorting that list in descending order of angle, keeping how many you need from the front of the list, sorting that shorter list in descending order of index, and removing the indexes from the list of points.
def simplify_points(points, how_many_points_to_remove)
angle_map = Array.new
(2..points.length - 1).each { |next_index|
removal_list.add([next_index - 1, angle_between(points[next_index - 2], points[next_index - 1], points[next_index])])
}
removal_list = removal_list.sort_by { |index, angle| angle }.reverse
removal_list = removal_list.first(how_many_points_to_remove)
removal_list = removal_list.sort_by { |index, angle| index }.reverse
removal_list.each { |index| points.delete_at(index) }
return points
end

Need help generating discrete random numbers from distribution

I searched the site but did not find exactly what I was looking for... I wanted to generate a discrete random number from normal distribution.
For example, if I have a range from a minimum of 4 and a maximum of 10 and an average of 7. What code or function call ( Objective C preferred ) would I need to return a number in that range. Naturally, due to normal distribution more numbers returned would center round the average of 7.
As a second example, can the bell curve/distribution be skewed toward one end of the other? Lets say I need to generate a random number with a range of minimum of 4 and maximum of 10, and I want the majority of the numbers returned to center around the number 8 with a natural fall of based on a skewed bell curve.
Any help is greatly appreciated....
Anthony
What do you need this for? Can you do it the craps player's way?
Generate two random integers in the range of 2 to 5 (inclusive, of course) and add them together. Or flip a coin (0,1) six times and add 4 to the result.
Summing multiple dice produces a normal distribution (a "bell curve"), while eliminating high or low throws can be used to skew the distribution in various ways.
The key is you are going for discrete numbers (and I hope you mean integers by that). Multiple dice throws famously generate a normal distribution. In fact, I think that's how we were first introduced to the Gaussian curve in school.
Of course the more throws, the more closely you approximate the bell curve. Rolling a single die gives a flat line. Rolling two dice just creates a ramp up and down that isn't terribly close to a bell. Six coin flips gets you closer.
So consider this...
If I understand your question correctly, you only have seven possible outcomes--the integers (4,5,6,7,8,9,10). You can set up an array of seven probabilities to approximate any distribution you like.
Many frameworks and libraries have this built-in.
Also, just like TokenMacGuy said a normal distribution isn't characterized by the interval it's defined on, but rather by two parameters: Mean μ and standard deviation σ. With both these parameters you can confine a certain quantile of the distribution to a certain interval, so that 95 % of all points fall in that interval. But resticting it completely to any interval other than (−∞, ∞) is impossible.
There are several methods to generate normal-distributed values from uniform random values (which is what most random or pseudorandom number generators are generating:
The Box-Muller transform is probably the easiest although not exactly fast to compute. Depending on the number of numbers you need, it should be sufficient, though and definitely very easy to write.
Another option is Marsaglia's Polar method which is usually faster1.
A third method is the Ziggurat algorithm which is considerably faster to compute but much more complex to program. In applications that really use a lot of random numbers it may be the best choice, though.
As a general advice, though: Don't write it yourself if you have access to a library that generates normal-distributed random numbers for you already.
For skewing your distribution I'd just use a regular normal distribution, choosing μ and σ appropriately for one side of your curve and then determine on which side of your wanted mean a point fell, stretching it appropriately to fit your desired distribution.
For generating only integers I'd suggest you just round towards the nearest integer when the random number happens to fall within your desired interval and reject it if it doesn't (drawing a new random number then). This way you won't artificially skew the distribution (such as you would if you were clamping the values at 4 or 10, respectively).
1 In testing with deliberately bad random number generators (yes, worse than RANDU) I've noticed that the polar method results in an endless loop, rejecting every sample. Won't happen with random numbers that fulfill the usual statistic expectations to them, though.
Yes, there are sophisticated mathematical solutions, but for "simple but practical" I'd go with Nosredna's comment. For a simple Java solution:
Random random=new Random();
public int bell7()
{
int n=4;
for (int x=0;x<6;++x)
n+=random.nextInt(2);
return n;
}
If you're not a Java person, Random.nextInt(n) returns a random integer between 0 and n-1. I think the rest should be similar to what you'd see in any programming language.
If the range was large, then instead of nextInt(2)'s I'd use a bigger number in there so there would be fewer iterations through the loop, depending on frequency of call and performance requirements.
Dan Dyer and Jay are exactly right. What you really want is a binomial distribution, not a normal distribution. The shape of a binomial distribution looks a lot like a normal distribution, but it is discrete and bounded whereas a normal distribution is continuous and unbounded.
Jay's code generates a binomial distribution with 6 trials and a 50% probability of success on each trial. If you want to "skew" your distribution, simply change the line that decides whether to add 1 to n so that the probability is something other than 50%.
The normal distribution is not described by its endpoints. Normally it's described by it's mean (which you have given to be 7) and its standard deviation. An important feature of this is that it is possible to get a value far outside the expected range from this distribution, although that will be vanishingly rare, the further you get from the mean.
The usual means for getting a value from a distribution is to generate a random value from a uniform distribution, which is quite easily done with, for example, rand(), and then use that as an argument to a cumulative distribution function, which maps probabilities to upper bounds. For the standard distribution, this function is
F(x) = 0.5 - 0.5*erf( (x-μ)/(σ * sqrt(2.0)))
where erf() is the error function which may be described by a taylor series:
erf(z) = 2.0/sqrt(2.0) * Σ∞n=0 ((-1)nz2n + 1)/(n!(2n + 1))
I'll leave it as an excercise to translate this into C.
If you prefer not to engage in the exercise, you might consider using the Gnu Scientific Library, which among many other features, has a technique to generate random numbers in one of many common distributions, of which the Gaussian Distribution (hint) is one.
Obviously, all of these functions return floating point values. You will have to use some rounding strategy to convert to a discrete value. A useful (but naive) approach is to simply downcast to integer.