How do you derive the time complexity of alpha-beta pruning? - time-complexity

I understand the basics of minimax and alpha-beta pruning. In all the literature, they talk about the time complexity for the best case is O(b^(d/2)) where b = branching factor and d = depth of the tree, and the base case is when all the preferred nodes are expanded first.
In my example of the "best case", I have a binary tree of 4 levels, so out of the 16 terminal nodes, I need to expand at most 7 nodes. How does this relate to O(b^(d/2))?
I don't understand how they come to O(b^(d/2)).

O(b^(d/2)) correspond to the best case time complexity of alpha-beta pruning. Explanation:
With an (average or constant) branching factor of b, and a search
depth of d plies, the maximum number of leaf node positions evaluated
(when the move ordering is pessimal) is O(bb...*b) = O(b^d) – the
same as a simple minimax search. If the move ordering for the search
is optimal (meaning the best moves are always searched first), the
number of leaf node positions evaluated is about O(b*1*b*1*...*b) for
odd depth and O(b*1*b*1*...*1) for even depth, or O(b^(d/2)). In the
latter case, where the ply of a search is even, the effective
branching factor is reduced to its square root, or, equivalently, the
search can go twice as deep with the same amount of computation.
The explanation of b*1*b*1*... is that all the first player's moves
must be studied to find the best one, but for each, only the best
second player's move is needed to refute all but the first (and best)
first player move – alpha–beta ensures no other second player moves
need be considered.
Put simply, you "skip" every two level:
O describes the limiting behavior of a function when the argument tends towards a particular value or infinity, so in your case comparing precisely O(b^(d/2)) with small values of b and d doesn't really make sense.

Related

Does it make sense to use big-O to describe the best case for a function?

I have an extremely pedantic question on big-O notation that I would like some opinions on. One of my uni subjects states “Best O(1) if their first element is the same” for a question on checking if two lists have a common element.
My qualm with this is that it does not describe the function on the entire domain of large inputs, rather the restricted domain of large inputs that have two lists with the same first element. Does it make sense to describe a function by only talking about a subset of that function’s domain? Of course, when restricted to that domain, the time complexity is omega(1), O(1) and therefore theta(1), but this isn’t describing the original function. From my understanding it would be more correct to say the entire function is bounded by omega(1). (and O(m*n) where m, n are the sizes of the two input lists).
What do all of you think?
It is perfectly correct to discuss cases (as you correctly point out, a case is a subset of the function's domain) and bounds on the runtime of algorithms in those cases (Omega, Oh or Theta). Whether or not it's useful is a harder question and one that is very situation-dependent. I'd generally think that Omega-bounds on the best case, Oh-bounds on the worst case and Theta bounds (on the "universal" case of all inputs, when such a bound exists) are the most "useful". But calling the subset of inputs where the first elements of each collection are the same, the "best case", seems like reasonable usage. The "best case" for bubble sort is the subset of inputs which are pre-sorted arrays, and is bound by O(n), better than unmodified merge sort's best-case bound.
Fundamentally, big-O notation is a way of talking about how some quantity scales. In CS we so often see it used for talking about algorithm runtimes that we forget that all the following are perfectly legitimate use cases for it:
The area of a circle of radius r is O(r2).
The volume of a sphere of radius r is O(r3).
The expected number of 2’s showing after rolling n dice is O(n).
The minimum number of 2’s showing after rolling n dice is O(1).
The number of strings made of n pairs of balanced parentheses is O(4n).
With that in mind, it’s perfectly reasonable to use big-O notation to talk about how the behavior of an algorithm in a specific family of cases will scale. For example, we could say the following:
The time required to sort a sequence of n already-sorted elements with insertion sort is O(n). (There are other, non-sorted sequences where it takes longer.)
The time required to run a breadth-first search of a tree with n nodes is O(n). (In a general graph, this could be larger if the number of edges were larger.)
The time required to insert n items in sorted order into an initially empty binary heap is On). (This can be Θ(n log n) for a sequence of elements that is reverse-sorted.)
In short, it’s perfectly fine to use asymptotic notation to talk about restricted subcases of an algorithm. More generally, it’s fine to use asymptotic notation to describe how things grow as long as you’re precise about what you’re quantifying. See #Patrick87’s answer for more about whether to use O, Θ, Ω, etc. when doing so.

Is there a more efficient search factor than midpoint for binary search?

The naive binary search is a very efficient algorithm: you take the midpoint of your high and low points in a sorted array and adjust your high or low point accordingly. Then you recalculate your endpoint and iterate until you find your target value (or you don't, of course.)
Now, quite clearly, if you don't use the midpoint, you introduce some risk to the system. Let's say you shift your search target away from the midpoint and you create two sides - I'll call them a big side and small side. (It doesn't matter whether the shift is toward high or low, because it would be symmetrical.) The risk is that if you miss, your search space is bigger than it would be: you've got to search the big side which is bigger. But the reward is that if you hit your search space is smaller.
It occurs to me that the number of spaces being risked vs rewarded is the same, and (without patterns, which I'm assuming there are none) the likelihood of an element being higher and lower than the midpoint is equal. So the risk is that it falls between the new target and the midpoint.
Now because the number of spaces affects the search space, and the search space is measured logrithmically, it seems to me if I used, let's say 1/4 and 3/4 for our search spaces, I've cut the log of the small space in half, where the large space has only gone up in by about .6 or .7.
So with all this in mind: is there a more efficient way of performing a binary search than just using the midpoint?
Let's agree that the search key is equally likely to be at position in the array—otherwise, we'd want to design an algorithm based on our special knowledge of the location. So all we can choose is where to split the array each time. If we choose a number 0 < x < 1 and split the array there, the chance that it's on the left is x and the chance that it's on the right is 1-x. In the first case we shorten the array by a factor of x and in the second by a factor of 1-x. If we did this many times we'd have a product of many of these factors, and so the 'right' average to use here is the geometric mean. In that sense, the average decrease per step is x with weight x and 1-x with weight 1-x, for a total of x^x * (1-x)^(1-x).
So when is this minimized? If this were the math stackexchange, we'd take derivatives (with the product rule, chain rule, and exponent rule), set them to zero, and solve. But this is stackoverflow, so instead we graph it:
You can see that the further you get from 1/2, the worse you get. For a better understanding I recommend information theory or calculus which have interesting and complementary perspectives on this.

Equality of two algorithms

Consider a tree of depth B (i.e.: all the paths have length B) whose nodes represent system states and edges represent actions.
Each action a in ActionSet has a gain and makes the system move from a state to another.
Performing the sequence of actions A-B-C or C-B-A (or any other permutation of these actions) brings to the same gain. Moreover:
the higher the number of actions performed before a, the lower the increase of total gain when a is asked
the gain achieved by each path cannot be greater than a quantity H, i.e.: some paths may achieve a gain that is lower than H, but whenever performing an action makes the total gain equal to H, all the other actions performed from that point on will gain 0
what is gained by the sequence of actions #b,h,j, ..., a# is g(a) (0 <= g(a) <= H)
once an action has been performed on a path from the root to a leaf, it cannot be performed a second time on the same path
Application of Algorithm1. I apply the following algorithm (A*-like):
Start from the root.
Expand the first level of the tree, which will contain all the actions in ActionSet. Each expanded action a has gain f(a) = g(a) + h(a), where g(a) is defined as stated before and h(a) is an estimate of what will be earned by performing other B-1 actions
Select the action a* that maximizes f(a)
Expand the children of a*
Iterate 2-3 until an entire path of B actions from the root to a leaf that guarantees the highest f(n) is visited. Notice that the new selected action can be selected also from the nodes which were abandoned at previous levels. E.g., if after expanding a* the node maximizing f(a) is a children of the root, it is selected as the new best node
Application of Algorithm2. Now, suppose I have a greedy algorithm that looks only to the g(n) component of the knowledge-plus-heuristic function f(n), i.e., this algorithm chooses actions according to the gain that has been already earned:
at the first step I choose the action a maximizing the gain g(a)
at the second step I choose the action b maximizing the gain g(b)
Claim. Experimental proofs showed me that the two algorithms bring to the same result, which might be mixed (e.g., the first one suggests the sequence A-B-C and the second one suggests B-C-A).
However, I didn't succeed in understanding why.
My question is: is there a formal way of proving that the two algorithms return the same result, although mixed in some cases?
Thank you.
A* search will return the optimal path. From what I understand of the problem, your greedy search is simply performing bayes calculations and wlll continue to do so until it finds an optimal set of nodes to take. Since the order of the nodes do not matter, the two should return the same set of nodes, albiet in different orders.
I think this is correct assuming you have the same set of actions you can perform from every node.

Building ranking with genetic algorithm,

Question after BIG edition :
I need to built a ranking using genetic algorithm, I have data like this :
P(a>b)=0.9
P(b>c)=0.7
P(c>d)=0.8
P(b>d)=0.3
now, lets interpret a,b,c,d as names of football teams, and P(x>y) is probability that x wins with y. We want to build ranking of teams, we lack some observations P(a>d),P(a>c) are missing due to lack of matches between a vs d and a vs c.
Goal is to find ordering of team names, which the best describes current situation in that four team league.
If we have only 4 teams than solution is straightforward, first we compute probabilities for all 4!=24 orderings of four teams, while ignoring missing values we have :
P(abcd)=P(a>b)P(b>c)P(c>d)P(b>d)
P(abdc)=P(a>b)P(b>c)(1-P(c>d))P(b>d)
...
P(dcba)=(1-P(a>b))(1-P(b>c))(1-P(c>d))(1-P(b>d))
and we choose the ranking with highest probability. I don't want to use any other fitness function.
My question :
As numbers of permutations of n elements is n! calculation of probabilities for all
orderings is impossible for large n (my n is about 40). I want to use genetic algorithm for that problem.
Mutation operator is simple switching of places of two (or more) elements of ranking.
But how to make crossover of two orderings ?
Could P(abcd) be interpreted as cost function of path 'abcd' in assymetric TSP problem but cost of travelling from x to y is different than cost of travelling from y to x, P(x>y)=1-P(y<x) ? There are so many crossover operators for TSP problem, but I think I have to design my own crossover operator, because my problem is slightly different from TSP. Do you have any ideas for solution or frame for conceptual analysis ?
The easiest way, on conceptual and implementation level, is to use crossover operator which make exchange of suborderings between two solutions :
CrossOver(ABcD,AcDB) = AcBD
for random subset of elements (in this case 'a,b,d' in capital letters) we copy and paste first subordering - sequence of elements 'a,b,d' to second ordering.
Edition : asymetric TSP could be turned into symmetric TSP, but with forbidden suborderings, which make GA approach unsuitable.
It's definitely an interesting problem, and it seems most of the answers and comments have focused on the semantic aspects of the problem (i.e., the meaning of the fitness function, etc.).
I'll chip in some information about the syntactic elements -- how do you do crossover and/or mutation in ways that make sense. Obviously, as you noted with the parallel to the TSP, you have a permutation problem. So if you want to use a GA, the natural representation of candidate solutions is simply an ordered list of your points, careful to avoid repitition -- that is, a permutation.
TSP is one such permutation problem, and there are a number of crossover operators (e.g., Edge Assembly Crossover) that you can take from TSP algorithms and use directly. However, I think you'll have problems with that approach. Basically, the problem is this: in TSP, the important quality of solutions is adjacency. That is, abcd has the same fitness as cdab, because it's the same tour, just starting and ending at a different city. In your example, absolute position is much more important that this notion of relative position. abcd means in a sense that a is the best point -- it's important that it came first in the list.
The key thing you have to do to get an effective crossover operator is to account for what the properties are in the parents that make them good, and try to extract and combine exactly those properties. Nick Radcliffe called this "respectful recombination" (note that paper is quite old, and the theory is now understood a bit differently, but the principle is sound). Taking a TSP-designed operator and applying it to your problem will end up producing offspring that try to conserve irrelevant information from the parents.
You ideally need an operator that attempts to preserve absolute position in the string. The best one I know of offhand is known as Cycle Crossover (CX). I'm missing a good reference off the top of my head, but I can point you to some code where I implemented it as part of my graduate work. The basic idea of CX is fairly complicated to describe, and much easier to see in action. Take the following two points:
abcdefgh
cfhgedba
Pick a starting point in parent 1 at random. For simplicity, I'll just start at position 0 with the "a".
Now drop straight down into parent 2, and observe the value there (in this case, "c").
Now search for "c" in parent 1. We find it at position 2.
Now drop straight down again, and observe the "h" in parent 2, position 2.
Again, search for this "h" in parent 1, found at position 7.
Drop straight down and observe the "a" in parent 2.
At this point note that if we search for "a" in parent one, we reach a position where we've already been. Continuing past that will just cycle. In fact, we call the sequence of positions we visited (0, 2, 7) a "cycle". Note that we can simply exchange the values at these positions between the parents as a group and both parents will retain the permutation property, because we have the same three values at each position in the cycle for both parents, just in different orders.
Make the swap of the positions included in the cycle.
Note that this is only one cycle. You then repeat this process starting from a new (unvisited) position each time until all positions have been included in a cycle. After the one iteration described in the above steps, you get the following strings (where an "X" denotes a position in the cycle where the values were swapped between the parents.
cbhdefga
afcgedbh
X X X
Just keep finding and swapping cycles until you're done.
The code I linked from my github account is going to be tightly bound to my own metaheuristics framework, but I think it's a reasonably easy task to pull the basic algorithm out from the code and adapt it for your own system.
Note that you can potentially gain quite a lot from doing something more customized to your particular domain. I think something like CX will make a better black box algorithm than something based on a TSP operator, but black boxes are usually a last resort. Other people's suggestions might lead you to a better overall algorithm.
I've worked on a somewhat similar ranking problem and followed a technique similar to what I describe below. Does this work for you:
Assume the unknown value of an object diverges from your estimate via some distribution, say, the normal distribution. Interpret your ranking statements such as a > b, 0.9 as the statement "The value a lies at the 90% percentile of the distribution centered on b".
For every statement:
def realArrival = calculate a's location on a distribution centered on b
def arrivalGap = | realArrival - expectedArrival |
def fitness = Σ arrivalGap
Fitness function is MIN(fitness)
FWIW, my problem was actually a bin-packing problem, where the equivalent of your "rank" statements were user-provided rankings (1, 2, 3, etc.). So not quite TSP, but NP-Hard. OTOH, bin-packing has a pseudo-polynomial solution proportional to accepted error, which is what I eventually used. I'm not quite sure that would work with your probabilistic ranking statements.
What an interesting problem! If I understand it, what you're really asking is:
"Given a weighted, directed graph, with each edge-weight in the graph representing the probability that the arc is drawn in the correct direction, return the complete sequence of nodes with maximum probability of being a topological sort of the graph."
So if your graph has N edges, there are 2^N graphs of varying likelihood, with some orderings appearing in more than one graph.
I don't know if this will help (very brief Google searches did not enlighten me, but maybe you'll have more success with more perseverance) but my thoughts are that looking for "topological sort" in conjunction with any of "probabilistic", "random", "noise," or "error" (because the edge weights can be considered as a reliability factor) might be helpful.
I strongly question your assertion, in your example, that P(a>c) is not needed, though. You know your application space best, but it seems to me that specifying P(a>c) = 0.99 will give a different fitness for f(abc) than specifying P(a>c) = 0.01.
You might want to throw in "Bayesian" as well, since you might be able to start to infer values for (in your example) P(a>c) given your conditions and hypothetical solutions. The problem is, "topological sort" and "bayesian" is going to give you a whole bunch of hits related to markov chains and markov decision problems, which may or may not be helpful.

Find minimal path in a tree with multivalued nodes

My math classes are far behind, and I'm currently struggling to find a decent solution to a problem I'm having: I have a tree, in which nodes are actions, and are "weighted" according to multiple criteria : the cost of said action, the time it will take, the necessary resources, the disturbance, etc...
And I want to to find in this tree the path that minimizes both the cost AND the time for example, or the disturbance AND cost AND time, etc. My problem is that I have no idea on how to do it, except by coming up with a global cost function F(cost, time, resources,...), and apply a regular tree traversal algorithm using the result from F(...) as my only weight.
But then, how do I come up with F ? Something like "F(cost, time, resources) = a * cost + b * time + c * resources" feels very "unprofessional"...
Edit:
I wanted to avoid the word "summing" as I wasn't sure it was really the way to go, but in essence, that's what I'm doing: computing a total cost for each "path" or "branch" that goes from that top node, to one of the leafs, and choosing the "path" or "branch" that minimizes the cost. The problem being that each node has a weight based on the time necessary, on financial cost, on resource usage, etc.
So it seems inevitable to have to come up with a formula, as Stephan says, that will reduce all these parameters to one global cost, per node, which I can then sum across nodes as I travel down the tree, and pick the path that minimizes the total cost.
So I guess my question really is, it there a methodology to choose that function ?
Thanks for your answers and comments, it's starting to be a bit more clear in my head now.
Let's say we have four pairs (x, y) like (1, 4), (1, 5), (2, 3), (3, 3). Now you want to minimize "both x and y". What do you mean? If you minimize first x and then y you end up with (1, 4). If you minimize first y and then x you find (2, 3).
Unless you choose a global cost function F(x, y), like in your observation, I can't see any meaning of "both". (Anyway, once F is chosen, there may still be multiple minimum points.) By the way, in my opinion a linear combination (the positive multipliers a,b,c being understood as "weights") is not "unprofessional" at all, at least if you have no idea of what a more suitable cost function could be.
Edit:
So it seems inevitable to have to come up with a formula, as Stephan says, that will reduce all these parameters to one global cost, per node, which I can then sum across nodes as I travel down the tree, and pick the path that minimizes the total cost.
Caution. Indeed this strategy makes sense only if F is linear. Surely cost, time, resources etc. are additive functions, in the sense that time(node1 -> node2 -> node3) = time(node1) + time(node2) + time(node3), but in general this is not the case for F, unless it is linear. (i.e. F(cost(node1 -> node2)) = F(cost(node1) + cost(node2)) != F(cost(node1)) + F(cost(node2)). )
If you choose a nonlinear global cost function, the right strategy is to compute, for each node, the total cost, total time, total resources from the root to that node, and compute (then minimize) F only for the terminal nodes.
Coming up with F is the most important thing. If I can give you 6 cost and 5 time or 5 time and 6 cost, which do you prefer? Your cost function needs to take that into consideration. There's no algorithm that's going to solve that problem for you, unfortunately. I denied that for a week before I sat down and wrote F for an optimization application I was working on. Worst case, leave parameters for the user to tinker with.
Why wouldn't a normal graph search algorithm like A* work?
For the path-cost function, you could use the running sum of the relevant criteria. The distance to the goal is more difficult...
It could be the distance to the nearest leaf, pre-computed for all or some nodes, although that sounds awfully expensive. Depending on the structure of your tree, you could come up with a cheaper under-estimate - if it's perfectly balanced, for example, it's trivial.