Why is score calculation speed faster for Construction Heuristics than Local Search? - optaplanner

Getting started with OptaPlanner (v.23.0.Final), I am experimenting with the CloudBalancing example. Using the IncrementalScoreCalculator Java class, I notice that the score calculation speed is much higher in the construction phase (>1M/sec) than in the local search phase (~50k/sec). How can this happen? Is the algorithm outside the score calculation included? That could explain the differnce, since the local search algorithm will spend much more time outside the score calculator than the construction algorithm.

Two reasons:
1) The construction Heuristic starts with no processes assigned to a computer, so all Process.getComputer() is null. Most constraints match on Processes for which computer != null, so they short circuit and don't do any expensive joins, groupBy's, accumulates, etc. So an empty or a partially initialized solution evaluates much faster than a fully initialized one (which Local Search uses).
2) The CH's only do ChangeMove's. LS does more expensive moves including swap moves (twice as big) and pillar moves (n times as big). So the amount of delta impact to calculate per move is bigger in LS too.

Related

Raku parallel/functional methods

I am pretty new to Raku and I have a questions to functional methods, in particular with reduce.
I originally had the method:
sub standardab{
my $mittel = mittel(#_);
my $foo = 0;
for #_ {
$foo += ($_ - $mittel)**2;
}
$foo = sqrt($foo/(#_.elems));
}
and it worked fine. Then I started to use reduce:
sub standardab{
my $mittel = mittel(#_);
my $foo = 0;
$foo = #_.reduce({$^a + ($^b-$mittel)**2});
$foo = sqrt($foo/(#_.elems));
}
my execution time doubled (I am applying this to roughly 1000 elements) and the solution differed by 0.004 (i guess rounding error).
If I am using
.race.reduce(...)
my execution time is 4 times higher than with the original sequential code.
Can someone tell me the reason for this?
I thought about parallelism initialization time, but - as I said - i am applying this to 1000 elements and if i change other for loops in my code to reduce it gets even slower!
Thanks for your help
Summary
In general, reduce and for do different things, and they are doing different things in your code. For example, compared with your for code, your reduce code involves twice as many arguments being passed and is doing one less iteration. I think that's likely at the root of the 0.004 difference.
Even if your for and reduce code did the same thing, an optimized version of such reduce code would never be faster than an equally optimized version of equivalent for code.
I thought that race didn't automatically parallelize reduce due to reduce's nature. (Though I see per your and #user0721090601's comment I'm wrong.) But it will incur overhead -- currently a lot.
You could use race to parallelize your for loop instead, if it's slightly rewritten. That might speed it up.
On the difference between your for and reduce code
Here's the difference I meant:
say do for <a b c d> { $^a } # (a b c d) (4 iterations)
say do reduce <a b c d>: { $^a, $^b } # (((a b) c) d) (3 iterations)
For more details of their operation, see their respective doc (for, reduce).
You haven't shared your data, but I will presume that the for and/or reduce computations involve Nums (floats). Addition of floats isn't commutative, so you may well get (typically small) discrepancies if the additions end up happening in a different order.
I presume that explains the 0.004 difference.
On your sequential reduce being 2X slower than your for
my execution time doubled (I am applying this to roughly 1000 elements)
First, your reduce code is different, as explained above. There are general abstract differences (eg taking two arguments per call instead of your for block's one) and perhaps your specific data leads to fundamental numeric computation differences (perhaps your for loop computation is primarily integer or float math while your reduce is primarily rational?). That might explain the execution time difference, or some of it.
Another part of it may be the difference between, on the one hand, a reduce, which will by default compile into calls of a closure, with call overhead, and two arguments per call, and temporary memory storing intermediate results, and, on the other, a for which will by default compile into direct iteration, with the {...} being just inlined code rather than a call of a closure. (That said, it's possible a reduce will sometimes compile to inlined code; and it may even already be that way for your code.)
More generally, Rakudo optimization effort is still in its relatively early days. Most of it has been generic, speeding up all code. Where effort has been applied to particular constructs, the most widely used constructs have gotten the attention so far, and for is widely used and reduce less so. So some or all the difference may just be that reduce is poorly optimized.
On reduce with race
my execution time [for .race.reduce(...)] is 4 times higher than with the original sequential code
I didn't think reduce would be automatically parallelizable with race. Per its doc, reduce works by "iteratively applying a function which knows how to combine two values", and one argument in each iteration is the result of the previous iteration. So it seemed to me it must be done sequentially.
(I see in the comments that I'm misunderstanding what could be done by a compiler with a reduction. Perhaps this is if it's a commutative operation?)
In summary, your code is incurring raceing's overhead without gaining any benefit.
On race in general
Let's say you're using some operation that is parallelizable with race.
First, as you noted, race incurs overhead. There'll be an initialization and teardown cost, at least some of which is paid repeatedly for each evaluation of an overall statement/expression that's being raced.
Second, at least for now, race means use of threads running on CPU cores. For some payloads that can yield a useful benefit despite any initialization and teardown costs. But it will, at best, be a speed up equal to the number of cores.
(One day it should be possible for compiler implementors to spot that a raced for loop is simple enough to be run on a GPU rather than a CPU, and go ahead and send it to a GPU to achieve a spectacular speed up.)
Third, if you literally write .race.foo... you'll get default settings for some tunable aspects of the racing. The defaults are almost certainly not optimal and may be way off.
The currently tunable settings are :batch and :degree. See their doc for more details.
More generally, whether parallelization speeds up code depends on the details of a specific use case such as the data and hardware in use.
On using race with for
If you rewrite your code a bit you can race your for:
$foo = sum do race for #_ { ($_ - $mittel)**2 }
To apply tuning you must repeat the race as a method, for example:
$foo = sum do race for #_.race(:degree(8)) { ($_ - $mittel)**2 }

Determining a program's execution time by its length in bits?

This is a question popped into my mind while reading the halting problem, collatz conjecture and Kolmogorov complexity. I have tried to search for something similar but I was unable to find a particular topic maybe because it is not of great value or it could just be a trivial question.
For the sake of simplicity I will give three examples of programs/functions.
function one(s):
return s
function two(s):
while (True):
print s
function three(s):
for i from 0 to 10^10:
print(s)
So my questions is, if there is a way to formalize the length of a program (like the bits used to describe it) and also the internal memory used by the program, to determine the minimum/maximum number of time/steps needed to decide whether the program will terminate or run forever.
For example, in the first function the program doesn't alter its internal memory and halts after some time steps.
In the second example, the program runs forever but the program also doesn't alter its internal memory. For example, if we considered all the programs with the same length as with the program two that do not alter their state, couldn't we determine an upper bound of steps, which if surpassed we could conclude that this program will never terminate ? (If not why ?)
On the last example, the program alters its state (variable i). So, at each step the upper bound may change.
[In short]
Kolmogorov complexity suggests a way of finding the (descriptive) complexity of an object such as a piece of text. I would like to know, given a formal way of describing the memory-space used by a program (computed in runtime), if we could compute a maximum number of steps, which if surpassed would allow us to know whether this program will terminate or run forever.
Finally, I would like to suggest me any source that I might find useful and help me figure out what I am exactly looking for.
Thank you. (sorry for my English, not my native language. I hope I was clear)
If a deterministic Turing machine enters precisely the same configuration twice (which we can detect b keeping a trace of configurations seen so far), then we immediately know the TM will loop forever.
If it known in advance that a deterministic Turing machine cannot possibly use more than some fixed constant amount of its input tape, then the TM must explicitly halt or eventually enter some configuration it has already visited. Suppose the TM can use at most k tape cells, the tape alphabet is T and the set of states is Q. Then there are (|T|+1)^k * |Q| unique configurations (the number of strings over (T union blank) of length k times the number of states) and by the pigeonhole principle we know that a TM that takes that many steps must enter some configuration it has already been to before.
one: because we are given that this function does not use internal memory, we know that it either halts or loops forever.
two: because we are given that this function does not use internal memory, we know that it either halts or loops forever.
three: because we are given that this function only uses a fixed amount of internal memory (like 34 bits) we can tell in fewer than 2^34 iterations of the loop whether the TM will halt or not for any given input s, guaranteed.
Now, knowing how much tape a TM is going to use, or how much memory a program is going to use, is not a problem a TM can solve. But if you have an oracle (like a person who was able to do a proof) that tells you a correct fixed upper bound on memory, then the halting problem is solvable.

Optimizing Parameters using AI technique

I know that my question is general, but I'm new to AI area.
I have an experiment with some parameters (almost 6 parameters). Each one of them is independent one, and I want to find the optimal solution for maximum or minimum the output function. However, if I want to do it in traditional programming technique it will take much time since i will use six nested loops.
I just want to know which AI technique to use for this problem? Genetic Algorithm? Neural Network? Machine learning?
Update
Actually, the problem could have more than one evaluation function.
It will have one function that we should minimize it (Cost)
and another function the we want to maximize it (Capacity)
Maybe another functions can be added.
Example:
Construction a glass window can be done in a million ways. However, we want the strongest window with lowest cost. There are many parameters that affect the pressure capacity of the window such as the strength of the glass, Height and Width, slope of the window.
Obviously, if we go to extreme cases (Largest strength glass, with smallest width and height, and zero slope) the window will be extremely strong. However, the cost for that will be very high.
I want to study the interaction between the parameters in specific range.
Without knowing much about the specific problem it sounds like Genetic Algorithms would be ideal. They've been used a lot for parameter optimisation and have often given good results. Personally, I've used them to narrow parameter ranges for edge detection techniques with about 15 variables and they did a decent job.
Having multiple evaluation functions needn't be a problem if you code this into the Genetic Algorithm's fitness function. I'd look up multi objective optimisation with genetic algorithms.
I'd start here: Multi-Objective optimization using genetic algorithms: A tutorial
First of all if you have multiple competing targets the problem is confused.
You have to find a single value that you want to maximize... for example:
value = strength - k*cost
or
value = strength / (k1 + k2*cost)
In both for a fixed strength the lower cost wins and for a fixed cost the higher strength wins but you have a formula to be able to decide if a given solution is better or worse than another. If you don't do this how can you decide if a solution is better than another that is cheaper but weaker?
In some cases a correctly defined value requires a more complex function... for example for strength the value could increase up to a certain point (i.e. having a result stronger than a prescribed amount is just pointless) or a cost could have a cap (because higher than a certain amount a solution is not interesting because it would place the final price out of the market).
Once you find the criteria if the parameters are independent a very simple approach that in my experience is still decent is:
pick a random solution by choosing n random values, one for each parameter within the allowed boundaries
compute target value for this starting point
pick a random number 1 <= k <= n and for each of k parameters randomly chosen from the n compute a random signed increment and change the parameter by that amount.
compute the new target value from the translated solution
if the new value is better keep the new position, otherwise revert to the original one.
repeat from 3 until you run out of time.
Depending on the target function there are random distributions that work better than others, also may be that for different parameters the optimal choice is different.
Some time ago I wrote a C++ code for solving optimization problems using Genetic Algorithms. Here it is: http://create-technology.blogspot.ro/2015/03/a-genetic-algorithm-for-solving.html
It should be very easy to follow.

HLSL branch avoidance

I have a shader where I want to move half of the vertices in the vertex shader. I'm trying to decide the best way to do this from a performance standpoint, because we're dealing with well over 100,000 verts, so speed is critical. I've looked at 3 different methods: (pseudo-code, but enough to give you the idea. The <complex formula> I can't give out, but I can say that it involves a sin() function, as well as a function call (just returns a number, but still a function call), as well as a bunch of basic arithmetic on floating point numbers).
if (y < 0.5)
{
x += <complex formula>;
}
This has the advantage that the <complex formula> is only executed half the time, but the downside is that it definitely causes a branch, which may actually be slower than the formula. It is the most readable, but we care more about speed than readability in this context.
x += step(y, 0.5) * <complex formula>;
Using HLSL's step() function (which returns 0 if the first param is greater and 1 if less), you can eliminate the branch, but now the <complex formula> is being called every time, and its results are being multiplied by 0 (thus wasted effort) half of the time.
x += (y < 0.5) ? <complex formula> : 0;
This I don't know about. Does the ?: cause a branch? And if not, are both sides of the equation evaluated or only the one that is relevant?
The final possibility is that the <complex formula> could be offloaded back to the CPU instead of the GPU, but I worry that it will be slower in calculating sin() and other operations, which might result in a net loss. Also, it means one more number has to be passed to the shader, and that could cause overhead as well. Anyone have any insight as to which would be the best course of action?
Addendum:
According to http://msdn.microsoft.com/en-us/library/windows/desktop/bb509665%28v=vs.85%29.aspx
the step() function uses a ?: internally, so it's probably no better than my 3rd solution, and potentially worse since <complex formula> is definitely called every time, whereas it may be only called half the time with a straight ?:. (Nobody's answered that part of the question yet.) Though avoiding both and using:
x += (1.0 - y) * <complex formula>;
may well be better than any of them, since there's no comparison being made anywhere. (And y is always either 0 or 1.) Still executes the <complex formula> needlessly half the time, but might be worth it to avoid branches altogether.
Perhaps look at this answer.
My guess (this is a performance question: measure it!) is that you are best off keeping the if statement.
Reason number one: The shader compiler, in theory (and if invoked correctly), should be clever enough to make the best choice between a branch instruction, and something similar to the step function, when it compiles your if statement. The only way to improve on it is to profile[1]. Note that it's probably hardware-dependent at this level of granularity.
[1] Or if you have specific knowledge about how your data is laid out, read on...
Reason number two is the way shader units work: If even one fragment or vertex in the unit takes a different branch to the others, then the shader unit must take both branches. But if they all take the same branch - the other branch is ignored. So while it is per-unit, rather than per-vertex - it is still possible for the expensive branch to be skipped.
For fragments, the shader units have on-screen locality - meaning you get best performance with groups of nearby pixels all taking the same branch (see the illustration in my linked answer). To be honest, I don't know how vertices are grouped into units - but if your data is grouped appropriately - you should get the desired performance benefit.
Finally: It's worth pointing out that your <complex formula> - if you're saying that you can hoist it out of your HLSL manually - it may well get hoisted into a CPU-based pre-shader anyway (on PC at least, from memory Xbox 360 doesn't support this, no idea about PS3). You can check this by decompiling the shader. If it is something that you only need to calculate once per-draw (rather than per-vertex/fragment) it probably is best for performance to do it on the CPU.
I got tired of my conditionals being ignored so I just made a another kernel and did an override in c execution.
If you need it to be accurate all the time I suggest this fix.

Time Complexity confusion

Ive always been a bit confused on this, possibly due to my lack of understanding in compilers. But lets use python as an example. If we had some large list of numbers called numlist and wanted to get rid of any duplicates, we could use a set operator on the list, example set(numlist). In return we would have a set of our numbers. This operation to the best of my knowledge will be done in O(n) time. Though if I were to create my own algorithm to handle this operation, the absolute best I could ever hope for is O(n^2).
What I don't get is, what allows a internal operation like set() to be so much faster then an external to the language algorithm. The checking still needs to be done, don't they?
You can do this in Θ(n) average time using a hash table. Lookup and insertion in a hash table are Θ(1) on average . Thus, you just run through the n items and for each one checking if it is already in the hash table and if not inserting the item.
What I don't get is, what allows a internal operation like set() to be so much faster then an external to the language algorithm. The checking still needs to be done, don't they?
The asymptotic complexity of an algorithm does not change if implemented by the language implementers versus being implemented by a user of the language. As long as both are implemented in a Turing complete language with random access memory models they have the same capabilities and algorithms implemented in each will have the same asymptotic complexity. If an algorithm is theoretically O(f(n)) it does not matter if it is implemented in assembly language, C#, or Python on it will still be O(f(n)).
You can do this in O(n) in any language, basically as:
# Get min and max values O(n).
min = oldList[0]
max = oldList[0]
for i = 1 to oldList.size() - 1:
if oldList[i] < min:
min = oldList[i]
if oldList[i] > max:
max = oldList[i]
# Initialise boolean list O(n)
isInList = new boolean[max - min + 1]
for i = min to max:
isInList[i] = false
# Change booleans for values in old list O(n)
for i = 0 to oldList.size() - 1:
isInList[oldList[i] - min] = true
# Create new list from booleans O(n) (or O(1) based on integer range).
newList = []
for i = min to max:
if isInList[i - min]:
newList.append (i)
I'm assuming here that append is an O(1) operation, which it should be unless the implementer was brain-dead. So with k steps each O(n), you still have an O(n) operation.
Whether the steps are explicitly done in your code or whether they're done under the covers of a language is irrelevant. Otherwise you could claim that the C qsort was one operation and you now have the holy grail of an O(1) sort routine :-)
As many people have discovered, you can often trade off space complexity for time complexity. For example, the above only works because we're allowed to introduce the isInList and newList variables. If this were not allowed, the next best solution may be sorting the list (probably no better the O(n log n)) followed by an O(n) (I think) operation to remove the duplicates.
An extreme example, you can use that same extra-space method to sort an arbitrary number of 32-bit integers (say with each only having 255 or less duplicates) in O(n) time, provided you can allocate about four billion bytes for storing the counts.
Simply initialise all the counts to zero and run through each position in your list, incrementing the count based on the number at that position. That's O(n).
Then start at the beginning of the list and run through the count array, placing that many of the correct value in the list. That's O(1), with the 1 being about four billion of course but still constant time :-)
That's also O(1) space complexity but a very big "1". Typically trade-offs aren't quite that severe.
The complexity bound of an algorithm is completely unrelated to whether it is implemented 'internally' or 'externally'
Taking a list and turning it into a set through set() is O(n).
This is because set is implemented as a hash set. That means that to check if something is in the set or to add something to the set only takes O(1), constant time. Thus, to make a set from an iterable (like a list for example), you just start with an empty set and add the elements of the iterable one by one. Since there are n elements and each insertion takes O(1), the total time of converting an iterable to a set is O(n).
To understand how the hash implementation works, see the wikipedia artcle on hash tables
Off hand I can't think of how to do this in O(n), but here is the cool thing:
The difference between n^2 and n is sooo massive that the difference between you implementing it and python implementing is tiny compared to the algorithm used to implement it. n^2 is always worse than O(n), even if the n^2 one is in C and the O(n) one is in python. You should never think that kind of difference comes from the fact that you're not writing in a low level language.
That said, if you want to implement your own, you can do a sort then remove dups. the sort is n*ln(n) and the remove dups in O(n)...
There are two issues here.
Time complexity (which is expressed in big O notation) is a formal measure of how long an algorithm takes to run for a given set size. It's more about how well an algorithm scales than about the absolute speed.
The actual speed (say, in milliseconds) of an algorithm is the time complexity multiplied by a constant (in an ideal world).
Two people could implement the same removal of duplicates algorithm with O(log(n)*n) complexity, but if one writes it in Python and the other writes it in optimised C, then the C program will be faster.