Genetic Algorithm: Need some clarification on selection and what to do when crossover doesn't happen - optimization

I'm writing a genetic algorithm to minimize a function. I have two questions, one in regards to selection and the other with regards to crossover and what to do when it doesn't happen.
Here's an outline of what I'm doing:
while (number of new population < current population)
# Evaluate all fitnesses and give them a rank. Choose individual based on rank (wheel roulette) to get first parent.
# Do it again to get second parent, ensuring parent1 =/= parent2
# Elitism (do only once): choose the fittest individual and immediately copy to new generation
Multi-point crossover: 50% chance
if (crossover happened)
do single point mutation on child (0.75%)
else
pick random individual to be copied into new population.
end
And all of this is under another while loop which tracks fitness progression and number of iterations, which I didn't include. So, my questions:
As you can see, two parents are chosen randomly in each
iteration until the new population is filled up. So, the two same
parents may mate more than once and surely several fit parents will
mate many more times than once. Is this in any way bad?
In the obitko tutorial, it says if crossover doesn't
happen, then child is exact copy of parents. I don't even understand
what that means, so, as you can see, I just picked a random parent
(uniformly; no fitness considered) and copied to new population.
This seems weird to me. Whether I actually do this or not, my results really don't change that much. What's the proper way to handle the case
when crossover doesn't happen?

Some parents having several offspring is common; I'd even say this is the default practice (and consider biological evolution, where precisely this is one of the main ingredients).
"If crossover doesn't happen, then child is exact copy of parents"
That is a bit confusing. Crossover (well explained in your link) means taking some genes from one parent and some from the other. This is called sexual reproduction and requires two (or more?) parents.
But asexual reproduction is also possible. In this case, you simply take one parent and mutate its genome in the new individual. This is almost what you were attempting, but you are missing the important mutation step (note mutations can be very aggressive or very conservative!)
Note that asexual reproduction requires mutation after copying the genome to create diversity, while in sexual reproduction this is an optional step.
It is fine to use either type of reproduction, or a mix of them. By the way: in some problems genes might not always have the same size. Sexual reproduction is problematic in this case. If you are interested in this problem, take a look at the NEAT algorithm, a popular neuroevolution algorithm designed to address this (wiki and paper).
Finally, elitism (copying the best-performing individuals to the next generation) is common, but it may be problematic. Genetic algorithms often stall in sub-optimal solutions (called local maxima, where any changes decrease fitness). Elitism can contribute to this problem. Of course, the opposite problem is too much diversity being similar to random search, so you need to find the right balance.

I don't see anything wrong with the same individual being the parent of more than one child per generation. It can only affect your diversity a little bit. If you don't like this, or find a real lack of diversity at the final generations, you can actually flag the individual so it cannot be parent more than once per generation.
I actually don't fully agree with the tutorial, I think after you have selected the individuals that will become parents (based on their fitness, of course) you should actually perform the crossover. Otherwise you will be cloning a lot of individuals to the next generation.

Related

Party people come an go: How to use Bayes to know who is in or out?

story/problem
Imagine there are N party guests and a bouncer guarding the only door. Before the party starts, all the guests are outside (that's for sure). However, once the party starts, people come and go. Each time such an event occurs, the bouncer makes a note for each potential guest of the likelihood that it could have been him or her. One could call this score the bouncer's classification confidence. For each event, this is a list of N candidates that adds up to one. All in all, T events have been observed until the next morning.
Unfortunately, some valuables were stolen that night. To narrow down the group of suspects, the host checked the bouncers notes. However, he soon found contradictory and therefore unreliable data: For example, according to the data, the same person entered the place of high confidences twice in a row, which we know is impossible. Therefore, he attempts by cleaning the data from contradictions to improve the quality of the classifications.
where I am stuck/what I tried
First I solved this issue by formulating a Linear Program and solved this in Python. However, as the number of guests increase, this soon becomes computationally infeasible. Therefore want use Bayes' theorem to compute the probability of the guests presence.
Since we have prior beliefs of the guests attending or not and repeated information updates, I felt that a bayesian approach would be the natural way to tackle this problem.
Even though this approach seemed intuitive to me, I faced some problems:
(1) The problem of being 100% sure of the first state
Being 100% sure of a person being outside at time t=0 seems to "break" the Bayesian formular. Let A be person A and data be the information available by the bouncer.
If the prior P(A present) is either 0 or 1 the formular collapses to 0 or 1 respectively. Am I wrong here?
(2) The problem of using all available information
I did not find a way not only to use data that happened before time step t but also information gathered at a later time. E.g. given three candidates the bouncer makes three observations with confidence ct:
cout->in1=(0.5, 0.4, 0.1)T,
cout->in2=(0.4, 0.4, 0.2)T,
cout->in3=(0.9, 0.05, 0.05)T.
Using only information that was available until t, the most plausible solution would be: (A entered, B entered, C entered) in. However, using all the information available this is a better solution: (B entered, C entered, A entered).
(3) The problem of noisy observations.
I was thinking of using a Bernoulli distribution as prior, however, I do not observe binary events but confidences.
Since I'm stuck, I'm looking forward to your help on Bayesian reasoning and ultimately finding the thief.

Determining hopeless branches early in branch-and-bound algorithms

I have to design a branch-and bound algorithm that solves the optimal tour of a graph on the cartesian plane every time. I have been given the hint that identifying hopeless branches earlier in the runtime will compound into a program that runs "a hundred times faster". I had the idea of assuming that the shortest edge connected to the starting/ending node will be either the first or last edge in the tour but a thin diamond shaped graph proves otherwise. Does any one have ideas for how to eliminate these hopeless branches or a reference that talks about this?
Basically, is there a better way to branch to subsets of solutions better than just lexicographically, eg. first branch is including and excluding edge a-b, second branch includes and excludes branch a-c
So somewhere in your branch-and-bound algorithm, you look at possible places to go, and then somehow keep track of them to do later.
To make this more efficient, you can do a couple things:
Write a better bound calculator. In other words, come up with an algorithm that determines the bound more accurately. This will result in less time spent on paths that turn out to be poor.
Instead of using a stack to keep track of things to do, use a queue. Instead of using a queue, use a priority queue (heap) ordered by bound, e.g. the things that seem best are put at the top of the heap, and the things that seem bad are put on the bottom.
Nearest-neighbor is a simple algorithm. Branch-and-Bound is just an optimizing loop and additionally you need a sub-problem solver. I think nearest-neighbor is also a branch-and-bound algorithm. Instead I would look into the simplex algorithm. It's a linear programming algorithm. Also cutting-plane algorithm to solve tsp.

Generating a sudoku of a desired difficulty?

So, I've done a fair bit of reading into generation of a Sudoku puzzle. From what I can tell, the standard way to have a Sudoku puzzle of a desired difficulty is to generate a puzzle, and then grade it afterwards, and repeat until you have one of an acceptable rating. This can be refined by generating via backtracing using some of the more complex solving patterns (XY-wing, swordfish, etc.), but that's not quite what I'm wanting to do here.
What I want to do, but have been unable to find any real resource on, is generate a puzzle from a "difficulty value" (0-1.0 value, 0 being the easiest, and 1.0 being the hardest).
For example, I want create a moderately difficult puzzle, so the value .675 is selected. Now using that value I want to be able to generate a moderately difficult puzzle.
Anyone know of something like this? Or perhaps something with a similar methodology?
Adding another answer for generating a sudoku of desired difficulty on-the-fly.
This means that unlike other approaches the algorithm runs only once and returns a sudoku configuration matching the desired difficulty (with high probability within a range or with probability=1)
Various solutions for generating (and rating) a sudoku difficulty have to do with human-based techniques and approaches, which can be easily rated.
Then one (after having generated a sudoku configuration) re-solves the sudoku with the human-like solver and depending on the techniques the solver used (e.g pairs, x-wing, swordfish etc.) a difficulty rate is also assigned.
Problems with this approach
(and requirements for the use case i had)
In order to generate a sudoku with given difficulty, with previous method one needs to solve a sudoku twice (once with the basic algorithm and once with the human-like solver).
One has to (pre-)generate many sudokus which can only be rated as to difficulty after being solved by the human-like solver. So one cannot generate a desired sudoku on-the-fly once.
The human-like solver can be complicated and in most cases (if not all) is tightly coupled to 9x9 sudoku grids. So no easy generalisation to other sudokus (e.g 4x4, 16x16, 6x6 etc.)
The difficulty rating of the human-like techniques is very subjective. For example why x-wing is taken to be more difficult than hidden singles? (personaly have solved many difficult published sudoku puzzles manualy and never used such techniques)
Another approach was used which has the following benefits:
Generalises well to arbitrary sudokus (9x9, 4x4, 6x6, 16x16 etc..)
The sudoku configuration, with desired difficulty, is generated once and on-the-fly
The difficulty rating is objective.
How it works?
First of all, the simple fact that the more difficult the puzzle, the more time it needs to be solved.
But time to be solved is intimately correlated to both number of clues (givens) and average alternatives to be investigated per empty cell.
Extending my previous answer, it was mentioned that for any sudoku puzzle the minimum number of clues is an objective property of the puzzle (for example for 9x9 grids the minimum number of clues for having a valid sudoku is 17)
One can start from there and compute minimum number of clues per difficulty level (linear correlation).
Furthermore at each step of the sudoku generation process, one can make sure the average alternatives (to be investigated) per empty cell is within given bounds (as a function of desired difficulty)
Depending on whether the algorithm uses backtrack or not (for the use case discussed the algorithm does no backtracking) the desired difficulty can be reached either with probability=1 or with high probability within bounds (respectively).
Tests of the sudokus generated with this algorithm and difficulty rating based on the previous approaches (human-like solver), show a correlation of desired and estimated difficulty rates, plus a greater ability for generalisation to arbitrary sudoku configurations.
(have used this online sudoku solver (and also this one) to correlate the difficulty rates of the test sudokus)
The code is available free on github sudoku.js (along with sample demo application), a scaled-down version of CrossWord.js a professional crossword builder in JavaScript, by same author
The sudoku difficulty is related in an interesting way to the (minimum) amount of information needed to specify a unique solution for a given grid.
Sounds like information theory, yes it has applications here too.
Sudoku puzzles should have a unique solution. Furthermore sudoku puzzles have certain symmetries, i.e by row, by column and by sub-square.
These symmetries specify the minimum number of clues (and their position more or less) needed so that the solution would be unique (i.e using a sudoku compiler or an algorithm like backtrack-search).
This would be the most difficult/hard sudoku puzzle level (i.e minimum needed number of clues). Then all other difficulty levels from less hard to easy are generated by allowing more clues than the minimum amount needed.
It should be noted that sudoku difficulty levels are not standard, as explained above, one can have as many or as few difficulty levels as one wants. What is standard is the minimum number (and position) of clues (which is the hardest level and which is relatd to the sudoku symmetries), then one can generate as many difficulty levels as one wants simply by allowing extra/redundant clues to be visible as well.
It's not as elegant as what you ask, but you can simulate this behavior with caching:
Decide how many "buckets" you want for puzzles. For example, let's say you choose 20. Thus, your buckets will contain puzzles of different difficulty ranges: 0-.05, .05-.1, .1-.15, .. , .9-.95, .95-1
Generate a puzzle
Grade the puzzle
Put it in the appropriate bucket (or throw it away when the bucket is full)
Repeat till your buckets are "filled". The size of the buckets and where they are stored will be based on the needs of your application.
Then when a user requests a certain difficulty puzzle, give them a cached one from the bucket they choose. You might also want to consider swapping numbers and changing orientation of puzzles with known difficulties to generate similar puzzles with the same level of difficulty. Then repeat the above as needed when you need to refill your buckets with new puzzles.
Well, you can't know how complicated it is, before you know how to solve it. And Sudoku solving (and therefore also the difficulty rating) belongs to the NP-C complexity class, that means it's (most likely) logically impossible to find an algorithm that is (asymptotically) faster than the proposed randomly-guess-and-check.
However, if you can find one, you have solved the P versus NP problem and should clear a cupboard for the Fields Medal... :)

Travelling Salesman and Map/Reduce: Abandon Channel

This is an academic rather than practical question. In the Traveling Salesman Problem, or any other which involves finding a minimum optimization ... if one were using a map/reduce approach it seems like there would be some value to having some means for the current minimum result to be broadcast to all of the computational nodes in some manner that allows them to abandon computations which exceed that.
In other words if we map the problem out we'd like each node to know when to give up on a given partial result before it's complete but when it's already exceeded some other solution.
One approach that comes immediately to mind would be if the reducer had a means to provide feedback to the mapper. Consider if we had 100 nodes, and millions of paths being fed to them by the mapper. If the reducer feeds the best result to the mapper than that value could be including as an argument along with each new path (problem subset). In this approach the granularity is fairly rough ... the 100 nodes will each keep grinding away on their partition of the problem to completion and only get the new minimum with their next request from the mapper. (For a small number of nodes and a huge number of problem partitions/subsets to work across this granularity would be inconsequential; also it's likely that one could apply heuristics to the sequence in which the possible routes or problem subsets are fed to the nodes to get a rapid convergence towards the optimum and thus minimize the amount of "wasted" computation performed by the nodes).
Another approach that comes to mind would be for the nodes to be actively subscribed to some sort of channel, or multicast or even broadcast from which they could glean new minimums from their computational loop. In that case they could immediately abandon a bad computation when notified of a better solution (by one of their peers).
So, my questions are:
Is this concept covered by any terms of art in relation to existing map/reduce discussions
Do any of the current map/reduce frameworks provide features to support this sort of dynamic feedback?
Is there some flaw with this idea ... some reason why it's stupid?
that's a cool theme, that doesn't have that much literature, that was done on it before. So this is pretty much a brainstorming post, rather than an answer to all your problems ;)
So every TSP can be expressed as a graph, that looks possibly like this one: (taken it from the german Wikipedia)
Now you can run a graph algorithm on it. MapReduce can be used for graph processing quite well, although it has much overhead.
You need a paradigm that is called "Message Passing". It was described in this paper here: Paper.
And I blog'd about it in terms of graph exploration, it tells quite simple how it works. My Blogpost
This is the way how you can tell the mapper what is the current minimum result (maybe just for the vertex itself).
With all the knowledge in the back of the mind, it should be pretty standard to think of a branch and bound algorithm (that you described) to get to the goal. Like having a random start vertex and branching to every adjacent vertex. This causes a message to be send to each of this adjacents with the cost it can be reached from the start vertex (Map Step). The vertex itself only updates its cost if it is lower than the currently stored cost (Reduce Step). Initially this should be set to infinity.
You're doing this over and over again until you've reached the start vertex again (obviously after you visited every other one). So you have to somehow keep track of the currently best way to reach a vertex, this can be stored in the vertex itself, too. And every now and then you have to bound this branching and cut off branches that are too costly, this can be done in the reduce step after reading the messages.
Basically this is just a mix of graph algorithms in MapReduce and a kind of shortest paths.
Note that this won't yield to the optimal way between the nodes, it is still a heuristic thing. And you're just parallizing the NP-hard problem.
BUT a little self-advertising again, maybe you've read it already in the blog post I've linked, there exists an abstraction to MapReduce, that has way less overhead in this kind of graph processing. It is called BSP (Bulk synchonous parallel). It is more freely in the communication and it's computing model. So I'm sure that this can be a lot better implemented with BSP than MapReduce. You can realize these channels you've spoken about better with it.
I'm currently involved in an Summer of Code project which targets these SSSP problems with BSP. Maybe you want to visit if you're interested. This could then be a part solution, it is described very well in my blog, too. SSSP's in my blog
I'm excited to hear some feedback ;)
It seems that Storm implements what I was thinking of. It's essentially a computational topology (think of how each compute node might be routing results based on a key/hashing function to the specific reducers).
This is not exactly what I described, but might be useful if one had a sufficiently low-latency way to propagate current bounding (i.e. local optimum information) which each node in the topology could update/receive in order to know which results to discard.

Difference between Gene Expression Programming and Cartesian Genetic Programming

Something pretty annoying in evolutionary computing is that mildly different and overlapping concepts tend to pick dramatically different names. My latest confusion because of this is that gene-expression-programming seems very similar to cartesian-genetic-programming.
(how) Are these fundamentally different concepts?
I've read that indirect encoding of GP instructions is an effective technique ( both GEP and CGP do that ). Has there been reached some sort of consensus that indirect encoding has outdated classic tree bases GP?
Well, it seems that there is some difference between gene expression programming (GEP) and cartesian genetic programming (CGP or what I view as classic genetic programming), but the difference might be more hyped up than it really ought to be. Please note that I have never used GEP, so all of my comments are based on my experience with CGP.
In CGP there is no distinction between genotype and a phenotype, in other words- if you're looking at the "genes" of a CGP you're also looking at their expression. There is no encoding here, i.e. the expression tree is the gene itself.
In GEP the genotype is expressed into a phenotype, so if you're looking at the genes you will not readily know what the expression is going to look like. The "inventor" of GP, Cândida Ferreira, has written a really good paper and there are some other resources which try to give a shorter overview of the whole concept.
Ferriera says that the benefits are "obvious," but I really don't see anything that would necessarily make GEP better than CGP. Apparently GEP is multigenic, which means that multiple genes are involved in the expression of a trait (i.e. an expression tree). In any case, the fitness is calculated on the expressed tree, so it doesn't seem like GEP is doing anything to increase the fitness. What the author claims is that GEP increases the speed at which the fitness is reached (i.e. in fewer generations), but frankly speaking you can see dramatic performance shifts from a CGP just by having a different selection algorithm, a different tournament structure, splitting the population into tribes, migrating specimens between tribes, including diversity into the fitness, etc.
Selection:
random
roulette wheel
top-n
take half
etc.
Tournament Frequency:
once per epoch
once per every data instance
once per generation.
Tournament Structure:
Take 3, kill 1 and replace it with the child of the other two.
Sort all individuals in the tournament by fitness, kill the lower half and replace it with the offspring of the upper half (where lower is worse fitness and upper is better fitness).
Randomly pick individuals from the tournament to mate and kill the excess individuals.
Tribes
A population can be split into tribes that evolve independently of each-other:
Migration- periodically, individual(s) from a tribe would be moved to another tribe
The tribes are logically separated so that they're like their own separate populations running in separate environments.
Diversity Fitness
Incorporate diversity into the fitness, where you count how many individuals have the same fitness value (thus are likely to have the same phenotype) and you penalize their fitness by a proportionate value: the more individuals with the same fitness value, the more penalty for those individuals. This way specimens with unique phenotypes will be encouraged, therefore there will be much less stagnation of the population.
Those are just some of the things that can greatly affect the performance of a CGP, and when I say greatly I mean that it's in the same order or greater than Ferriera's performance. So if Ferriera didn't tinker with those ideas too much, then she could have seen much slower performance of the CGPs... especially if she didn't do anything to combat stagnation. So I would be careful when reading performance statistics on GEP, because sometimes people fail to account for all of the "optimizations" available out there.
There seems to be some confusion in these answers that must be clarified. Cartesian GP is different from classic GP (aka tree-based GP), and GEP. Even though they share many concepts and take inspiration from the same biological mechanisms, the representation of the individuals (the solutions) varies.
In CGPthe representation (mapping between genotype and phenotype) is indirect, in other words, not all of the genes in a CGP genome will be expressed in the phenome (a concept also found in GEP and many others). The genotypes can be coded in a grid or array of nodes, and the resulting program graph is the expression of active nodes only.
In GEP the representation is also indirect, and similarly not all genes will be expressed in the phenotype. The representation in this case is much different from treeGP or CGP, but the genotypes are also expressed into a program tree. In my opinion GEP is a more elegant representation, easier to implement, but also suffers from some defects like: you have to find the appropriate tail and head size which is problem specific, the mnltigenic version is a bit of a forced glue between expression trees, and finally it has too much bloat.
Independently of which representation may be better than the other in some specific problem domain, they are general purpose, can be applied to any domain as long as you can encode it.
In general, GEP is simpler from GP. Let's say you allow the following nodes in your program: constants, variables, +, -, *, /, if, ...
For each of such nodes with GP you must create the following operations:
- randomize
- mutate
- crossover
- and probably other genetic operators as well
In GEP for each of such nodes only one operation is needed to be implemented: deserialize, which takes array of numbers (like double in C or Java), and returns the node. It resembles object deserialization in languages like Java or Python (the difference is that deserialization in programming languages uses byte arrays, where here we have arrays of numbers). Even this 'deserialize' operation doesn't have to be implemented by the programmer: it can be implemented by a generic algorithm, just like it's done in Java or Python deserialization.
This simplicity from one point of view may make searching of best solution less successful, but from other side: requires less work from programmer and simpler algorithms may execute faster (easier to optimize, more code and data fits in CPU cache, and so on). So I would say that GEP is slightly better, but of course the definite answer depends on problem, and for many problems the opposite may be true.