Genetic algorithm for optimization in game playing agent heuristic evaluation function - optimization

This is in response to an answer given in this question:
How to create a good evaluation function for a game?, particularly by #David (it is the first answer).
Background: I am using a genetic algorithm to optimize the hyper parameters in a game playing agent that is using minimax / alpha beta pruning (with iterative deepening). In particular, I would like to optimize the heuristic (evaluation) function parameters using a genetic algorithm. The evaluation function I use is:
f(w) = w * num_my_moves - (1-w) * num_opponent_moves
The only parameter to optimize is w in [0,1].
Here's how I programmed the genetic algorithm:
Create a random population of say 100 agents
Let them play 1000 games at random with replacement.
Let the parents be the top performing agents with some poorer performing agents mixed in for genetic diversity.
Randomly breed some parents to create children. * Breeding process: We define a child to be an average of the weights of its parents.
i.e. childWeight = 0.5(father.w+ mother.w)
The new population is formed by the parents and the newly created children.
Randomly mutate 1% of the population as follows: newWeight = agent.x + random.uniform(-0.01,0.01) and account for trivial border cases (i.e. less than zero and greater than one, appropriately).
Evolve 10 times (i.e. repeat for new population)
My question: Please evaluate the bold points above. In particular, does anyone have a better way to breed (rather than trivially averaging the parent weights), and does anyone have a better way to mutate, rather than just adding random.uniform(-0.01,0.01)?

It looks like you're not actually applying a genetic-algorithm to your agents, but rather just simple evolution directly on the phenotype/weights. I suggest you try introducing a genetic representation of your weights, and evolve this genome instead. An example would be to represent your weights as a binary string, and apply evolution on each bit of the string, meaning there is a likelihood that each bit gets mutated. This is called point mutations. There are many other mutations you can apply, but it would do as a start.
What you will notice is that your agents don't get stuck in local minima as much because sometimes a small genetic change can vastly change the phenotype/weights.
Ok, that might sound complicated, it's not really. Let me give you an example:
Say you have a weight of 42 in base 10. This would be 101010 in binary. Now you have implemented a 1% mutation rate on each bit of the binary representation. Let's say the last bit is flipped. Then we have 101011 in binary, or 43 in decimal. Not such a big change. Doing the same with the second bit on the other hand gives you 111010 in binary or 58 decimal. Notice the big jump. This is what we want, and lets your agent population search a larger part of the solution space faster.
With regard to breeding. You can try crossover. Lets assume you have many weights each with a genetic encoding. If you represent the whole genome (all the binary data) as one long binary string you can combine sections of the two parents genome. Example, again. The following is the "father" and "mother" genome and phenotype:
Weight Name: W1 W2 W3 W4 W5
Father Phenotype: 43 15 34 17 14
Father Genome: 101011 001111 100010 010001 001110
Mother Genome: 100110 100111 011001 010100 101000
Mother Phenotype: 38 39 25 20 40
What you can do is draw arbitrary lines through both genomes at the same place, and assign the segments arbitrarily to the child. This is a version of crossover.
Weight Name: W1 W2 W3 W4 W5
Father Genome: 101011 00.... ...... .....1 001110
Mother Genome: ...... ..0111 011001 01010. ......
Child Genome: 101011 000111 011001 010101 001110
Child Phenotype: 43 7 25 21 14
Here the first 8 and the last 7 bits come from the father, and the middle comes from the mother. Notice how weight W1 and W5 are entirely from the father, and W3 is entirely from the mother. While W2 and W4 are combinations. W4 had hardly any change, while W2 has changed drastically.
I hope this gives you some insight in how to do genetic algorithms. That said, I recommend using a modern library instead of implementing it yourself, unless you are doing it to learn.
Edit: More on handling the weights/binary representation:
If you need fractions, you can do this by separating the numerator and denominator as different weights, or have one of them as a constant, e.g., 42 and 10 gives 4.2.)
Larger than 0 constraints come free. To actually get negative numbers you need to negate your weights.
Less than 1 constraint you can get by dividing the weight by the maximum possible value for that bit string length. In the examples above you have 6 bits, which can become a maximum of 63. If you then after mutation get a binary string of 101010 or 42 in base 10, you do 42/63 getting 0.667 and can only ever get as high as 1.0, when 63/63.
Two weights' sum equal to 1? If you get 101010 and 001000 for W1 and W2, it gives 42 and 8, then you can go W1_scaled = W1 / (W1 + W2) = 0.84 and W2_scaled = W2 / (W1 + W2) = 0.16. This should give you W1_scaled + W2_scaled = 1 always.

Since I was mentioned.
Rather than averaging the parent weights, I picked random numbers using the parent weights as a min/max. I additionally found I had to widen the range slightly (compensating for the reduction in standard deviation when I'd average two uniform random numbers, or sqrt(2), but I probably wasn't exact) to resist the pull toward the average. Otherwise the population converges toward the average and can't escape.
So if the parents' weights were 0.1 and 0.2, it might pick a random number between 0.08 and 0.22 for the child weight.
Late edit: A more accepted, studied, understood approach that I didn't know at the time is something called "Differential Evolution".

Related

effective number

In Gelman book, the effective number is defined in terms of the following;
R hat
between- within MCMC sequence of variance, B and W
the number of MCMC samples, denoted by n
the number of chains, denoted by m
I do not know how the samplig() calculate the between MCMC sequence of variance for the case chains=1. So, I cannot calculate these terms ( B,W,m). I want to implement some algorithm according to the paper:https://arxiv.org/abs/1804.06788.
Roughly speaking, this paper construct some test statistics which is uniformly distributed under the null hypothesis that the MCMC sampling is correct. And if MCMC sampling is not correct, then the histogram of the test statistics become skew shape and this deviation from uniformity tells us the MCMC contains bias. I want to implement but it needs to calculate the above quantities.
In rstan, is there such function to extract the above quantities ? I think the process of calculation of R hat statistics, the above quantities B,W, m are retained in some place in the stanfit S4 object.
I am sorry, I found n_eff, but I do not know the choice of m of the case chains =1.
In the case that only one chain is estimated (which should not be happening anyway), then m = 2 because the post-warmup draws from the single chain are split into the first half and the second half. This splitting method is discussed in the documentation.

What is meaning of "parameter optimization of SVM by PSO"?

I can change parameters C and epsilon manually to obtain an optimised result, but I found that there is parameter optimization of SVM by PSO (or any other optimization algorithm). There is no algorithm. What does it mean: how can PSO automatically optimize the SVM parameters? I read several papers on this topic, but I'm still not sure.
Particle Swarm Optimization is a technique that uses the ML parameters (SVM parameters, in your case) as its features.
Each "particle" in the swarm is characterized by those parameter values. For instance, you might have initial coordinates of
degree epsilon gamma C
p1 3 0.001 0.25 1.0
p2 3 0.003 0.20 0.9
p3 2 0.0003 0.30 1.2
p4 4 0.010 0.25 0.5
...
pn ...........................
The "fitness" of each particle (p1-p4 shown here out of a population of n particles) is measured by the accuracy of the resulting model: the PSO algorithm trains and tests a model for each particle, returning that model's error rate as the value analogous to that from the training loss function (which it how the value is computed).
On each iteration, particles move toward the fittest neighbours. The process repeats until a maximum (hopefully the global one) appears as a convergence point. This process is simply one from the familiar gradient descent family.
There are two basic PSO variants. In gbest (global best), every particle affects every other particle, sort of a universal gravitation principle. It converges quickly, but may well miss a global max in favor of a local max that happened to be nearer to the swarm's original center. In lbest (local best), a particle responds to only its k closest neighbors. This can form localized clusters; it converges more slowly, but is more likely to find the global max in a non-convex space.
I'll try to briefly explain enough to answer your clarification questions. If that doesn't work, I'm afraid you'll probably have to find someone to discuss this in front of a white board.
To use PSO, you have to decide which SVM parameters you'll try to optimize, and how many particles you want to use. PSO is a meta-algorithm, so its features are the SVM parameters. The PSO parameters are population (how many particles you want to use, update neighbourhood (lbest size and a distance function; gbest is the all-inclusive case), and velocity (learning rate for the SVM parameters).
For a bit of illustration, let's assume the particle table above, extended to a population of 20 particles. We'll use lbest with a neighbourhood of 4, and a velocity of 0.1. We choose (randomly, in a grid, or however we think might give us nice results) the initial values of degree, epsilon, gamma, and C for each of the 20 particles.
Each iteration of PSO works like this:
# Train the model described by each particle's "position"
For each of the 20 particles:
Train an SVM with the SVM input and the given parameters.
Test the SVM; return the error rate as the PSO loss function value.
# Update the particle positions
for each of the 20 particles:
find the nearest 4 neighbours (using the PSO distance function)
identify the neighbour with the lowest loss (SVM's error rate).
adjust this particle's features (degree, epsilon, gamma, C) 0.1 of the way toward that neighbour's features. 0.1 is our learning rate / velocity. (Yes, I realize that changing degree is not likely to happen (it's a discrete value) without a special case in the update routine.
Continue iterating through PSO until the particles have converged to your liking.
gbest is simply lbest with an infinite neighbourhood; in that case, you don't need a distance function on the particle space.

Is there a prdefined name for the following solution search/optimization algorithm?

Consider a problem whose solution maximizes an objective function.
Problem : From 500 elements, 15 needs to be selected (candidate solution), Value of Objective function depends on the pairwise relationships between the elements in a candidate solution and some more.
The steps for solving such a problem is described here:
1. Generate a set of candidate solutions in guided random manner(population) //not purely random the direction is given to generate the population
2. Evaluating the objective function for current population
3. If the current_best_solution exceeds the global_best_solution, then replace the global_best with current_best
4. Repeat steps 1,2,3 for N (arbitrary number) times
where size of population and N are smaller (approx 50)
After N iterations it returns a candidate solution stored in global_best_solution
Is this the description of a well-known algorithm?
If it is, what is the name of that algorithm or if not under which category these type of algorithms fit?
What you have sounds like you are just fishing. Note that you might as well get rid of steps 3 and 4 since running the loop 100 times would be the same as doing it once with an initial population 100 times as large.
If you think of the objective function as a random variable which is a function of random decision variables then what you are doing would e.g. give you something in the 99.9th percentile with very high probability -- but there is no limit to how far the optimum might be from the 99.9th percentile.
To illustrate the difficulty, consider the following sort of Travelling Salesman Problem. Imagine two clusters of points A and B, each of which has 100 points. Within the clusters, each point is arbitrarily close to every other point (e.g. 0.0000001). But -- between the clusters the distance is say 1,000,000. The optimal tour would clearly have length 2,000,000 (+ a negligible amount). A random tour is just a random permutation of those 200 decision points. Getting an optimal or near optimal tour would be akin to shuffling a deck of 200 cards with 100 read and 100 black and having all of the red cards in the deck in a block (counting blocks that "wrap around") -- vanishingly unlikely (It can be calculated as 99 * 100! * 100! / 200! = 1.09 x 10^-57). Even if you generate quadrillions of tours it is overwhelmingly likely that each of those tours would be off by millions. This is a min problem, but it is also easy to come up with max problems where it is vanishingly unlikely that you will get a near-optimal solution by purely random settings of the decision variables.
This is an extreme example, but it is enough to show that purely random fishing for a solution isn't very reliable. It would make more sense to use evolutionary algorithms or other heuristics such as simulated annealing or tabu search.
why do you work with a population if the members of that population do not interact ?
what you have there is random search.
if you add mutation it looks like an Evolution Strategy: https://en.wikipedia.org/wiki/Evolution_strategy

sampling 2-dimensional surface: how many sample points along X & Y axes?

I have a set of first 25 Zernike polynomials. Below are shown few in Cartesin co-ordinate system.
z2 = 2*x
z3 = 2*y
z4 = sqrt(3)*(2*x^2+2*y^2-1)
:
:
z24 = sqrt(14)*(15*(x^2+y^2)^2-20*(x^2+y^2)+6)*(x^2-y^2)
I am not using 1st since it is piston; so I have these 24 two-dim ANALYTICAL functions expressed in X-Y Cartesian co-ordinate system. All are defined over unit circle, as they are orthogonal over unit circle. The problem which I am describing here is relevant to other 2D surfaces also apart from Zernike Polynomials.
Suppose that origin (0,0) of the XY co-ordinate system and the centre of the unit circle are same.
Next, I take linear combination of these 24 polynomials to build a 2D wavefront shape. I use 24 random input coefficients in this combination.
w(x,y) = sum_over_i a_i*z_i (i=2,3,4,....24)
a_i = random coefficients
z_i = zernike polynomials
Upto this point, everything is analytical part which can be done on paper.
Now comes the discretization!
I know that when you want to re-construct a signal (1Dim/2Dim), your sampling frequency should be at least twice the maximum frequency present in the signal (Nyquist-Shanon principle).
Here signal is w(x,y) as mentioned above which is nothing but a simple 2Dim
function of x & y. I want to represent it on computer now. Obviously I can not take all infinite points from -1 to +1 along x axis and same for y axis.
I have to take finite no. of data points (which are called sample points or just samples) on this analytical 2Dim surface w(x,y)
I am measuring x & y in metres, and -1 <= x <= +1; -1 <= y <= +1.
e.g. If I divide my x-axis from -1 to 1, in 50 sample points then dx = 2/50= 0.04 metre. Same for y axis. Now my sampling frequency is 1/dx i.e. 25 samples per metre. Same for y axis.
But I took 50 samples arbitrarily; I could have taken 10 samples or 1000 samples. That is the crux of the matter here: how many samples points?How will I determine this number?
There is one theorem (Nyquist-Shanon theorem) mentioned above which says that if I want to re-construct w(x,y) faithfully, I must sample it on both axes so that my sampling frequency (i.e. no. of samples per metre) is at least twice the maximum frequency present in the w(x,y). This is nothing but finding power spectrum of w(x,y). Idea is that any function in space domain can be represented in spatial-frequency domain also, which is nothing but taking Fourier transform of the function! This tells us how many (spatial) frequencies are present in your function w(x,y) and what is the maximum frequency out of these many frequencies.
Now my question is first how to find out this maximum sampling frequency in my case. I can not use MATLAB fft2() or any other tool since it means already I have samples taken across the wavefront!! Obviously remaining option is find it analytically ! But that is time consuming and difficult since I have 24 polynomials & I will have to use then continuous Fourier transform i.e. I will have to go for pen and paper.
Any help will be appreciated.
Thanks
Key Assumptions
You want to use the "Nyquist-Shanon" theorem to determine sampling frequency
Obviously remaining option is find it analytically ! But that is time
consuming and difficult since I have 21 polynomials & I have to use
continuous Fourier transform i.e. done by analytically.
Given the assumption I have made (and noting that consideration of other mathematical techniques is out of scope for StackOverflow), you have no option but to calculate the continuous Fourier Transform.
However, I believe you haven't considered all the options for calculating the transform other than a laborious paper exercise e.g.
Numerical approximation of the continuous F.T. using code
Symbolic Integration e.g. Wolfram Alpha
Surely a numerical approximation of the Fourier Transform will be adequate for your solution?
I am assuming this is for coursework or research rather, so all you really care about as a physicist is a solution that is the quickest solution that is accurate within the scope of your problem.
So to conclude, IMHO, don't waste time searching for a more mathematically elegant solution or trick and just solve the problem with one of the above methods

How to calculate deceleration needed to reach a certain speed over a certain distance?

I've tried the typical physics equations for this but none of them really work because the equations deal with constant acceleration and mine will need to change to work correctly. Basically I have a car that can be going at a large range of speeds and needs to slow down and stop over a given distance and time as it reaches the end of its path.
So, I have:
V0, or the current speed
Vf, or the speed I want to reach (typically 0)
t, or the amount of time I want to take to reach the end of my path
d, or the distance I want to go as I change from V0 to Vf
I want to calculate
a, or the acceleration needed to go from V0 to Vf
The reason this becomes a programming-specific question is because a needs to be recalculated every single timestep as the car keeps stopping. So, V0 constantly is changed to be V0 from last timestep plus the a that was calculated last timestep. So essentially it will start stopping slowly then will eventually stop more abruptly, sort of like a car in real life.
EDITS:
All right, thanks for the great responses. A lot of what I needed was just some help thinking about this. Let me be more specific now that I've got some more ideas from you all:
I have a car c that is 64 pixels from its destination, so d=64. It is driving at 2 pixels per timestep, where a timestep is 1/60 of a second. I want to find the acceleration a that will bring it to a speed of 0.2 pixels per timestep by the time it has traveled d.
d = 64 //distance
V0 = 2 //initial velocity (in ppt)
Vf = 0.2 //final velocity (in ppt)
Also because this happens in a game loop, a variable delta is passed through to each action, which is the multiple of 1/60s that the last timestep took. In other words, if it took 1/60s, then delta is 1.0, if it took 1/30s, then delta is 0.5. Before acceleration is actually applied, it is multiplied by this delta value. Similarly, before the car moves again its velocity is multiplied by the delta value. This is pretty standard stuff, but it might be what is causing problems with my calculations.
Linear acceleration a for a distance d going from a starting speed Vi to a final speed Vf:
a = (Vf*Vf - Vi*Vi)/(2 * d)
EDIT:
After your edit, let me try and gauge what you need...
If you take this formula and insert your numbers, you get a constant acceleration of -0,0309375. Now, let's keep calling this result 'a'.
What you need between timestamps (frames?) is not actually the acceleration, but new location of the vehicle, right? So you use the following formula:
Sd = Vi * t + 0.5 * t * t * a
where Sd is the current distance from the start position at current frame/moment/sum_of_deltas, Vi is the starting speed, and t is the time since the start.
With this, your decceleration is constant, but even if it is linear, your speed will accomodate to your constraints.
If you want a non-linear decceleration, you could find some non-linear interpolation method, and interpolate not acceleration, but simply position between two points.
location = non_linear_function(time);
The four constraints you give are one too many for a linear system (one with constant acceleration), where any three of the variables would suffice to compute the acceleration and thereby determine the fourth variables. However, the system is way under-specified for a completely general nonlinear system -- there may be uncountably infinite ways to change acceleration over time while satisfying all the constraints as given. Can you perhaps specify better along what kind of curve acceleration should change over time?
Using 0 index to mean "at the start", 1 to mean "at the end", and D for Delta to mean "variation", given a linearly changing acceleration
a(t) = a0 + t * (a1-a0)/Dt
where a0 and a1 are the two parameters we want to compute to satisfy all the various constraints, I compute (if there's been no misstep, as I did it all by hand):
DV = Dt * (a0+a1)/2
Ds = Dt * (V0 + ((a1-a0)/6 + a0/2) * Dt)
Given DV, Dt and Ds are all given, this leaves 2 linear equations in the unknowns a0 and a1 so you can solve for these (but I'm leaving things in this form to make it easier to double check on my derivations!!!).
If you're applying the proper formulas at every step to compute changes in space and velocity, it should make no difference whether you compute a0 and a1 once and for all or recompute them at every step based on the remaining Dt, Ds and DV.
If you're trying to simulate a time-dependent acceleration in your equations, it just means that you should assume that. You have to integrate F = ma along with the acceleration equations, that's all. If acceleration isn't constant, you just have to solve a system of equations instead of just one.
So now it's really three vector equations that you have to integrate simultaneously: one for each component of displacement, velocity, and acceleration, or nine equations in total. The force as a function of time will be an input for your problem.
If you're assuming 1D motion you're down to three simultaneous equations. The ones for velocity and displacement are both pretty easy.
In real life, a car's stopping ability depends on the pressure on the brake pedal, any engine braking that's going on, surface conditions, and such: also, there's that "grab" at the end when the car really stops. Modeling that is complicated, and you're unlikely to find good answers on a programming website. Find some automotive engineers.
Aside from that, I don't know what you're asking for. Are you trying to determine a braking schedule? As in there's a certain amount of deceleration while coasting, and then applying the brake? In real driving, the time is not usually considered in these maneuvers, but rather the distance.
As far as I can tell, your problem is that you aren't asking for anything specific, which suggests that you really haven't figured out what you actually want. If you'd provide a sample use for this, we could probably help you. As it is, you've provided the bare bones of a problem that is either overdetermined or way underconstrained, and there's really nothing we can do with that.
if you need to go from 10m/s to 0m/s in 1m with linear acceleration you need 2 equations.
first find the time (t) it takes to stop.
v0 = initial velocity
vf = final velocity
x0 = initial displacement
xf = final displacement
a = constant linear acceleration
(xf-x0)=.5*(v0-vf)*t
t=2*(xf-x0)/(v0-vf)
t=2*(1m-0m)/(10m/s-0m/s)
t=.2seconds
next to calculate the linear acceleration between x0 & xf
(xf-x0)=(v0-vf)*t+.5*a*t^2
(1m-0m)=(10m/s-0m/s)*(.2s)+.5*a*((.2s)^2)
1m=(10m/s)*(.2s)+.5*a*(.04s^2)
1m=2m+a*(.02s^2)
-1m=a*(.02s^2)
a=-1m/(.02s^2)
a=-50m/s^2
in terms of gravity (g's)
a=(-50m/s^2)/(9.8m/s^2)
a=5.1g over the .2 seconds from 0m to 10m
Problem is either overconstrained or underconstrained (a is not constant? is there a maximum a?) or ambiguous.
Simplest formula would be a=(Vf-V0)/t
Edit: if time is not constrained, and distance s is constrained, and acceleration is constant, then the relevant formulae are s = (Vf+V0)/2 * t, t=(Vf-V0)/a which simplifies to a = (Vf2 - V02) / (2s).