Markov decision process - how to use optimal policy formula? - process

I have a task, where I have to calculate optimal policy
(Reinforcement Learning - Markov decision process) in the grid world (agent movies left,right,up,down).
In left table, there are Optimal values (V*).
In right table, there is sollution (directions) which I don't know how to get by using that "Optimal policy" formula.
Y=0.9 (discount factor)
And here is formula:
So if anyone knows how to use that formula, to get solution (those arrows), please help.
Edit: there is whole problem description on this page:
http://webdocs.cs.ualberta.ca/~sutton/book/ebook/node35.html
Rewards: state A(column 2, row 1) is followed by a reward of +10 and transition to state A', while state B(column 4, row 1) is followed by a reward of +5 and transition to state B'.
You can move: up,down,left,right. You cannot move outside the grid or stay in same place.

Break the math down, piece by piece:
The arg max (....) is telling you to find the argument, a, which maximizes everything in the parentheses. The variables a, s, and s' are an action, a state you're in, and a state that results from that action, respectively. So the arg max (...) is telling you to find an action that maximizes that term.
You know gamma, and someone did the hard work of calculating V*(s'), which is the value of that resulting state. So you know what to plug in there, right?
So what is p(s,a,s')? That is the probability that, starting from s and doing a, you end in some s'. This is meant to represent some kind of faulty actuator-- you say "go forward!" and it foolishly decides to go left (or two squares forward, or remain still, or whatever.) For this problem, I'd expect it to be given to you, but you haven't shared it with us. And the summation over s' is telling you that when you start in s, and you pick an action a, you need to sum over all possible resulting s' states. Again, you need the details of that p(s,a,s') function to know what those are.
And last, there is r(s,a) which is the reward for doing action a in state s, regardless of where you end up. In this problem it might be slightly negative, to represent a fuel cost. If there is a series of rewards in the grid and a grab action, it might be positive. You need that, too.
So, what to do? Pick a state s, and calculate your policy for it. For each s, you're going have the possibility of (s,a1), and (s,a2), and (s,a3), etc. You have to find the a that gives you the biggest result. And of course for each pair (s,a) you may (in fact, almost certainly will) have multiple values of s' to stick in the summation.
If this sounds like a lot of work, well, that's why we have computers.
PS - Read the problem description very carefully to find out what happens if you run into a wall.

Related

X and Y inputs in LabVIEW

I am new to LabVIEW and I am trying to read a code written in LabVIEW. The block diagram is this:
This is the program to input x and y functions into the voltage input. It is meant to give an input voltage in different forms (sine, heartshape , etc.) into the fast-steering mirror or galvano mirror x and y axises.
x and y function controls are for inputting a formula for a function, and then we use "evaluation single value" function to input into a daq assistant.
I understand that { 2*(|-Mpi|)/N }*i + -Mpi*pi goes into the x value. However, I dont understand why we use this kind of formula. Why we need to assign a negative value and then do the absolute value of -M*pi. Also, I don`t understand why we need to divide to N and then multiply by i. And finally, why need to add -Mpi again? If you provide any hints about this I would really appreciate it.
This is just a complicated way to write the code/formula. Given what the code looks like (unnecessary wire bends, duplicate loop-input-tunnels, hidden wires, unnecessary coercion dots, failure to use appropriate built-in 'negate' function) not much care has been given in writing it. So while it probably yields the correct results you should not expect it to do so in the most readable way.
To answer you specific questions:
Why we need to assign a negative value and then do the absolute value
We don't. We can just move the negation immediately before the last addition or change that to a subtraction:
{ 2*(|Mpi|)/N }*i - Mpi*pi
And as #yair pointed out: We are not assigning a value here, we are basically flipping the sign of whatever value the user entered.
Why we need to divide to N and then multiply by i
This gives you a fraction between 0 and 1, no matter how many steps you do in your for-loop. Think of N as a sampling rate. I.e. your mirrors will always do the same movement, but a larger N just produces more steps in between.
Why need to add -Mpi again
I would strongly assume this is some kind of quick-and-dirty workaround for a bug that has not been fixed properly. Looking at the code it seems this +Mpi*pi has been added later on in the development process. And while I don't know what the expected values are I would believe that multiplying only one of the summands by Pi is probably wrong.

Accessing chain position or multi-entity subchain in optaplanner?

We are using Optaplanner to come up with music playlists that follow a set of sound musical principles and rules (with respect to key changes, etc):
https://www.youtube.com/watch?v=Iqya6xqc1jY
We’re using chained planning variables to avoid disrupting an otherwise solid “streak” of playlist tracks, but many of our rules involve some aspect of temporal reasoning across a subchain of X previous tracks. As an example,I’m trying to implement a rule that requires the key to change at least every five songs (to keep the playlist from getting too boring/monotonous). What I’ve come up with works, but I’m wondering if there’s a less awkward way of doing it.
Here’s the rule as we have it right now, which I feel is ugly from a DRY and configurability perspective:
https://github.com/spotfire-io/spotfire-solver/blob/1c0fcda5256c337e214b33043a27fc25f615d0ef/src/main/resources/io/spotfire/solver/rules/rules.drl#L79-L88
rule "Should change key at least once every five songs"
when
$t0: RestPlaylistTrack(keyDistance == 0) // previousTrack is a chained variable
$t1: RestPlaylistTrack(keyDistance == 0, previousTrack == $t0)
$t2: RestPlaylistTrack(keyDistance == 0, previousTrack == $t1)
$t3: RestPlaylistTrack(keyDistance == 0, previousTrack == $t3)
$t4: RestPlaylistTrack(keyDistance == 0, previousTrack == $t3)
then
scoreHolder.addSoftConstraintMatch(kcontext, 0, new BigDecimal(-2));
end
Another example would be implementing a rule that batches tracks of the same genre together (e.g. play 4 jazz tracks in a row followed by 4 rock tracks), or ensuring that we avoid playing the same artist 5 tracks from the last time we played that artist.
In this example, Is there a better way to keep track of the distance between two tracks and then specify a constraint on that? Some potential options we’ve considered include…
Provide a way to extract X-length sub-chains programmatically and apply the rules to that subchain.
Create a shadow variable that represents the position of the track relative to the anchor. Then we could create constraints like RestPlaylistTrack(position < $t.position, position > $t.position - 5) to apply to any tracks within 5 tracks of $t.
Using some sort of Drools aggregate expression that accumulates previous tracks via a map-reducey thing until reaching a certain maximum number of tracks.
The challenge we perceive with the first two solutions is that a chain swap move involves changes to three planning variables. If we have a chain that looks like A <- B <- C <- D, a swap between B and D involves a change to point D to A, B to C, and C to D. At the Drools or shadow variable level, I think there’s a risk doing a bunch of intermediate calculations before the move is complete. This might make score calculation pretty inefficient. For the third option, we’re just not sure how something like that would work mechanically.
If anyone (especially #geoffrey-de-smet) has examples on how this could be done, that would be greatly appreciated. If this is legitimately tricky in the current version of Optaplanner, we think adding a native position mechanism to chained planning modes would be super helpful as a future feature.
It sounds like consecutive shift constraints in nurse rostering. Detecting "no more than n shifts in a row" is non trivial in hand-written DRL. In nurse rostering, we use insertLogicals to deal with those, but I would recommend not to use that (it kills performance). I guesstimate that approach 1) (which gives up incremental calculation) is still faster than any insertLogical approach, unless you're queuing up thousands of songs.
In ConstraintStreams, approach 1 could maybe one day look like this:
constraintFactory.from(Shift.class)
.groupBy(Shift::getEmployee,
sort(Comparable.from(Shift::getStartDateTime, Shift::getId)) // BiConstraintStream<Employee, List<Shift>))
.penalize((employee, sortedShiftList) -> ...); // One match for all bad subsequences of 1 employee
Approach 2) is interesting. Try it out and let us know if it works well enough for you.
Approach 3) is what I aim thinking of in ConstraintStreams at some point. This is incremental. Something like:
constraintFactory.from(Shift.class)
.forEachSortedSubList(Shift::getEmployee,
Comparable.from(Shift::getStartDateTime, Shift::getId,
(employee, sortedShiftSubList) -> ...)
.penalize(...); // One match per bad subsequence
If you have any suggestions on how you 'd like to use the API for approach 3) or how using it could look like, please put them on our google group discussion forum. It could help move the work along.

Complexity of an NFA

I have seen in several sources (e.g. 1, 2 - page 160), that the complexity of running through an NFA is O(m²n). However I haven't understood why it is so.
My intuition is that the complexity should be O(m^n) (where m is the length of the string, and n is the number of states), because for each letter in the input string, there are n possible states that the NFA can move to them.
Can anyone explain this to me?
Thanks.
Let's think about how we'd do one step of simulating the NFA. We begin in some set of active states Qinit. For each state we're in, we look at all the outgoing transitions labeled with the current symbol, then gather them into a set Qmid. Next, we follow all possible ε-transitions from those states and keep repeating this process until there are no more ε-transitions to follow (that is, we find the ε-closure), giving us our new set of state Qnext. We then repeat this process once per character in the input.
So how much time does this take? Well, there are m characters in the input string, so the runtime is m times whatever work we do on one individual character. Each state can have at most n transitions leaving it on each character, so the time to iterate over all the characters in Qinit to find the set Qmid is O(n2): there are O(n) states in Qinit and we only need to scan O(n) transitions per state. From there, we have to keep following ε-transitions until we run out of transitions to follow. We can do that by doing a BFS or DFS starting simultaneously from each state in Qmin and only following ε-transitions. The NFA can have at most O(n2) ε-transitions total (one between each pair of states), so the BFS or DFS will run in time O(n2). Overall we're doing O(n2) work per character, so the total work done is O(mn2).

Enumerable.Count not working

I am monitoring the power of a laser and I want to know when n consecutive measurements are outside a safe range. I have a Queue(Of Double) which has n items (2 in my example) at the time it's being checked. I want to check that all items in the queue satisfy a condition, so I pass the items through a Count() with a predicate. However, the count function always returns the number of items in the queue, even if they don't all satisfy the predicate.
ElseIf _consecutiveMeasurements.AsEnumerable().Count(Function(m) m <= Me.CriticalLowLevel) = ConsecutiveCount Then
_ownedISetInstrument.Disable()
' do other things
A view of the debugger with the execution moving into the If.
Clearly, there are two measurements in the queue, and they are both greater than the CriticalLowLevel, so the count should be zero. I first tried Enumerable.Where(predicate).Count() and I got the same result. What's going on?
Edit:
Of course the values are below the CriticalLowLevel, which I had mistakenly set to 598 instead of 498 for testing. I had over-complicated the problem by focusing my attention on the code when it was my test case which was faulty. I guess I couldn't see the forest for the trees, so they say. Thanks Eric for pointing it out.
Based on your debug snapshot, it looks like both of your measurements are less than the critical level of 598.0, so I would expect the count to match the queue length.
Both data points are <= Me.CriticalLowLevel.
Can you share an example where one of the data points is > Me.CriticalLowLevel that still exhibits this behavior?

How to create an "intercept missile" for a game?

I have a game I am working on that has homing missiles in it. At the moment they just turn towards their target, which produces a rather dumb looking result, with all the missiles following the target around.
I want to create a more deadly flavour of missile that will aim at the where the target "will be" by the time it gets there and I am getting a bit stuck and confused about how to do it.
I am guessing I will need to work out where my target will be at some point in the future (a guess anyway), but I can't get my head around how far ahead to look. It needs to be based on how far the missile is away from the target, but the target it also moving.
My missiles have a constant thrust, combined with a weak ability to turn. The hope is they will be fast and exciting, but steer like a cow (ie, badly, for the non HitchHiker fans out there).
Anyway, seemed like a kind of fun problem for Stack Overflow to help me solve, so any ideas, or suggestions on better or "more fun" missiles would all be gratefully received.
Next up will be AI for dodging them ...
What you are suggesting is called "Command Guidance" but there is an easier, and better way.
The way that real missiles generally do it (Not all are alike) is using a system called Proportional Navigation. This means the missile "turns" in the same direction as the line-of-sight (LOS) between the missile and the target is turning, at a turn rate "proportional" to the LOS rate... This will do what you are asking for as when the LOS rate is zero, you are on collision course.
You can calculate the LOS rate by just comparing the slopes of the line between misile and target from one second to the next. If that slope is not changing, you are on collision course. if it is changing, calculate the change and turn the missile by a proportionate angular rate... you can use any metrics that represent missile and target position.
For example, if you use a proportionality constant of 2, and the LOS is moving to the right at 2 deg/sec, turn the missile to the right at 4 deg/sec. LOS to the left at 6 deg/sec, missile to the left at 12 deg/sec...
In 3-d problem is identical except the "Change in LOS Rate", (and resultant missile turn rate) is itself a vector, i.e., it has not only a magnitude, but a direction (Do I turn the missile left, right or up or down or 30 deg above horizontal to the right, etc??... Imagine, as a missile pilot, where you would "set the wings" to apply the lift...
Radar guided missiles, which "know" the rate of closure. adjust the proportionality constant based on closure (the higher the closure the faster the missile attempts to turn), so that the missile will turn more aggressively in high closure scenarios, (when the time of flight is lower), and less aggressively in low closure (tail chases) when it needs to conserve energy.
Other missiles (like Sidewinders), which do not know the closure, use a constant pre-determined proportionality value). FWIW, Vietnam era AIM-9 sidewinders used a proportionality constant of 4.
I've used this CodeProject article before - it has some really nice animations to explain the math.
"The Mathematics of Targeting and Simulating a Missile: From Calculus to the Quartic Formula":
http://www.codeproject.com/KB/recipes/Missile_Guidance_System.aspx
(also, hidden in the comments at the bottom of that article is a reference to some C++ code that accomplishes the same task from the Unreal wiki)
Take a look at OpenSteer. It has code to solve problems like this. Look at 'steerForSeek' or 'steerForPursuit'.
Have you considered negative feedback on the recent change of bearing over change of time?
Details left as an exercise.
The suggestions is completely serious: if the target does not maneuver this should obtain a near optimal intercept. And it should converge even if the target is actively dodging.
Need more detail?
Solving in a two dimensional space for ease of notation. Take \vec{m} to be the location of the missile and vector \vec{t} To be the location of the target.
The current heading in the direction of motion over last time unit: \vec{h} = \bar{\vec{m}_i - \vec{m}_i-1}}. Let r be the normlized vector between the missile and the target: \vec{r} = \bar{\vec{t} - \vec{m}}. The bearing is b = \vec{r} \dot \vec{h} Compute the bearing at each time tick, and the change thereof, and change heading to minimize that quantity.
The math is harrier in 3d because of the need to find the plane of action at each step, but the process is the same.
You'll want to interpolate the trajectory of both the target and the missile as a function of time. Then look for the times in which the coordinates of the objects are within some acceptable error.