Q-learning value update

Q-learning value update - optimization

I am working on the power management of a device using Q-learning algorithm. The device has two power modes, i.e., idle and sleep. When the device is asleep, the requests for processing are buffered in a queue. The Q-learning algorithm looks for minimizing a cost function which is a weighted sum of the immediate power consumption and the latency caused by an action.
c(s,a)=lambda*p_avg+(1-lambda)*avg_latency
In each state, the learning algorithm takes an action (executing time-out values) and evaluates the effect of the taken action in next state (using above formula). The actions are taken by executing certain time-out values from a pool of pre-defined time-out values. The parameter lambda in above equation is a power-performance parameter (0_<lambda<1). It defines whether the algorithm should look for power saving (lambda-->1) or should look for minimizing latency (lambda-->0). The latency for each request is calculated as queuing-time + execution-time.
The problem is that the learning algorithm always favors small time-out values in sleep state. It is because the average latency for small time-out values is always lower, and hence their cost is also small. When I change the value of lambda from lower to higher, I don't see any effect in the final output policy. The policy always selects small time-out values as best actions in each state. Instead of average power and average latency for each state, I have tried using overall average power consumption and overall average latency for calculating cost for a state-action pair, but it doesn't help. I also tried using total energy consumption and total latency experinced by all the request for calculating cost in each state-action pair, but it doesn't help either. My question is: what could be a better cost function for this scenario? I update the Q-value as follows:
Q(s,a)=Q(s,a)+alpha*[c(s,a)+gamma*min_a Q(s',a')-Q(s,a)]
Where alpha is a learning rate (decreased slowly) and gamma=0.9 is a discount factor.

To answer the questions posed in the comments:
shall I use the entire power consumption and entire latency for all
the requests to calculate the cost in each state (s,a)?
No. In Q-learning, reward is generally considered an instantaneous signal associated with a single state-action pair. Take a look at Sutton and Barto's page on rewards. As shown the instantaneous reward function (r_t+1) is subscripted by time step - indicating that it is indeed instantaneous. Note that R_t, that expected return, considers the history of rewards (from time t back to t_0). Thus, there is no need for you to explicitly keep track of accumulated latency and power consumption (and doing so is likely to be counter-productive.)
or shall I use the immediate power consumption and average latency
caused by an action a in state s?
Yes. To underscore the statement above, see the definition of an MDP on page 4 here. The relevant bit:
The reward function specifies expected instantaneous reward as a
function of the current state and action
As I indicated in a comment above, problems in which reward is being "lost" or "washed out" might be better solved with a Q(lambda) implementation because temporal credit assignment is performed more effectively. Take a look at Sutton and Barto's chapter on TD(lambda) methods here. You can also find some good examples and implementations here.

Related

Why is score calculation speed faster for Construction Heuristics than Local Search?

Getting started with OptaPlanner (v.23.0.Final), I am experimenting with the CloudBalancing example. Using the IncrementalScoreCalculator Java class, I notice that the score calculation speed is much higher in the construction phase (>1M/sec) than in the local search phase (~50k/sec). How can this happen? Is the algorithm outside the score calculation included? That could explain the differnce, since the local search algorithm will spend much more time outside the score calculator than the construction algorithm.

Two reasons:
1) The construction Heuristic starts with no processes assigned to a computer, so all Process.getComputer() is null. Most constraints match on Processes for which computer != null, so they short circuit and don't do any expensive joins, groupBy's, accumulates, etc. So an empty or a partially initialized solution evaluates much faster than a fully initialized one (which Local Search uses).
2) The CH's only do ChangeMove's. LS does more expensive moves including swap moves (twice as big) and pillar moves (n times as big). So the amount of delta impact to calculate per move is bigger in LS too.

Check number of slots used by a query in BigQuery

Is there a way to check how many slots were used by a query over the period of its execution in BigQuery? I checked the execution plan but I could just see the Slot Time in ms but could not see any parameter or any graph to show the number of slots used over the period of execution. I even tried looking at Stackdriver Monitoring but I could not find anything like this. Please let me know if it can be calculated in some way or if I can see it somewhere I might've missed seeing.

A BigQuery job will report the total number of slot-milliseconds from the extended query stats in the job metadata, which is analogous to computational cost. Each stage of the query plan also indicates input stats for the stage, which can be used to indicate the number of units of work each stage dispatched.
More details about the representation can be found in the REST reference for jobs. See query.statistics.totalSlotMs and statistics.query.queryPlan[].parallelInputs for more information.

BigQuery now provides a key in the Jobs API JSON called "timeline". This structure provides "statistics.query.timeline[].completedUnits" which you can obtain either during job execution or after. If you choose to pull this information after a job has executed, "completedUnits" will be the cumulative sum of all the units of work (slots) utilised during the query execution.
The question might have two parts though: (1) Total number of slots utilised (units of work completed) or (2) Maximum parallel number of units used at a point in time by the query.
For (1), the answer is as above, given by "completedUnits".
For (2), you might need to consider the maximum value of queryPlan.parallelInputs across all query stages, which would indicate the maximum "number of parallelizable units of work for the stage" (https://cloud.google.com/bigquery/query-plan-explanation)
If, after this, you additionally want to know if the 2000 parallel slots that you are allocated across your entire on-demand query project is sufficient, you'd need to find the point in time across all queries taking place in your project where the slots being utilised is at a maximum. This is not a trivial task, but Stackdriver monitoring provides the clearest view for you on this.

Waiting time of SUMO

I am using sumo for traffic signal control, and want to optimize the phase to reduce some objectives. During the process, I use the traci module as an output of states in traffic junction. The confusing part is traci.lane.getWaitingTime.
I don't know how the waiting time is calculated and also after I use two detectors as an output to observe, I think it is too large.
Can someone explain how the waiting time is calculated in SUMO?

The waiting time essentially counts the number of seconds a vehicle has a speed of less than 0.1 m/s. In the case of traci.lane this means it is the number of (nearly) standing vehicles multiplied with the time step length (since traci.lane returns the values for the last step).

Compensating for laggy positive feedback

I'm trying to make a program run as accurately as possible while staying at a fixed frame rate. How do you do this?
Formally, I have some parameter b in [0,1] that I can set to determine how accurate my computations are (where 0 is least accurate, 0.5 is fairly accurate, and 1 is very accurate). The higher this is, the lower frame rate I will get.
However, there is a "lag", where after changing this parameter, the frame rate won't change until d milliseconds afterwards, where d can vary and is unknown.
Is there a way to change this parameter in a way that prevents "wiggling"? The problem is that if I am experiencing a low frame rate, if I increase the parameter then measure again, it will only be slightly higher, so I will need to increase it more, and then the framerate will be too slow, so I need to decrease the parameter, and I get this oscillating behavior. Is there a way to prevent this? I need to be as reactive as possible in doing this, because changing too slowly will cause the framerate to be incorrect for too long.

Looks like you need an adaptive feedback dampener. Trying an electrical circuit analogy :)
I'd first try to get more info about how the circuit's input signal and responsiveness look like. So I'd first make the algorithm update b not with the desired values but with the previous values plus or minus (as needed towards the desired value) a small fixed increment, say .01 instead (ignore the sloppy response time for now). While doing so I'd collect and plot/analyze the "desired" b values, looking for:
the general shape of the changes: smooth or rather "steppy" or "spiky"? (spiky would require a stronger dampening to prevent oscillations, steppy would require a weaker dampening to prevent lagging)
the maximum/typical/minimum changes in values from sample to sample
the distribution of the changes in values from sample to sample (I'd plan the algorithm to react best for changes in a typical range, say 20-80% range and consider acceptable lagging for changes higher than that or oscillations for values lower than that)
The end goal is to be able to obtain parameters for operating alternatively in 2 modes:
a high-speed tracking mode (also the system's initial mode)
a normal tracking mode
In high-speed tracking mode the b value updates can be either:
not dampened - the update value is the full desired value - only if the changes shape is not spiky and only in the 1st b update after entering the high-speed tracking mode. This would help reduce lagging.
dampened - the update delta is just a fraction (dampening factor) of the desired delta and reflects the fact that the effect of the previous b value update might not be completely reflected in the current frame rate due to d. Dampening helps preventing oscillations at the expense of potentially increasing lag (always conflicting requirements). The dampening factor would be higher for a smooth shape and smaller for a spiky shape.
Switching from high-speed tracking mode to normal tracking mode can be done when the delta between b's previous value and its desired value falls below a certain mode change threshold value (eventually maintained for a minimum number of consecutive samples). The mode change threshold value would be initially estimated from the info collected above and/or empirically determined.
In normal tracking mode the delta between b's previous value and its desired value remain below the mode change threshold value and is either ignored (no b update) or and update is made either with the desired value or some average one - tiny course corrections, keeping the frame rate practically constant, no dampening, no lagging, no oscillations.
When in normal tracking mode the delta between b's previous value and its desired value goes above the mode change threshold value the system switches again to the high-speed tracking mode.
I would also try to get a general idea about how the d response time looks like. To do that I'd change the algorithm to only update b with the desired values not at every iteration, but every n iterations apart (maybe even re-try for several n values). This should indicate how many sample periods would generally a b value change take to become fully effective and should be reflected in the dampening factor: the longer it takes for a change to take effect the stronger the dampening should be to prevent oscillations.
Of course, this is just the general idea, a lot of experimental trial/adjustment iterations may be required to reach a satisfactory solution.

Optimizing Parameters using AI technique

I know that my question is general, but I'm new to AI area.
I have an experiment with some parameters (almost 6 parameters). Each one of them is independent one, and I want to find the optimal solution for maximum or minimum the output function. However, if I want to do it in traditional programming technique it will take much time since i will use six nested loops.
I just want to know which AI technique to use for this problem? Genetic Algorithm? Neural Network? Machine learning?
Update
Actually, the problem could have more than one evaluation function.
It will have one function that we should minimize it (Cost)
and another function the we want to maximize it (Capacity)
Maybe another functions can be added.
Example:
Construction a glass window can be done in a million ways. However, we want the strongest window with lowest cost. There are many parameters that affect the pressure capacity of the window such as the strength of the glass, Height and Width, slope of the window.
Obviously, if we go to extreme cases (Largest strength glass, with smallest width and height, and zero slope) the window will be extremely strong. However, the cost for that will be very high.
I want to study the interaction between the parameters in specific range.

Without knowing much about the specific problem it sounds like Genetic Algorithms would be ideal. They've been used a lot for parameter optimisation and have often given good results. Personally, I've used them to narrow parameter ranges for edge detection techniques with about 15 variables and they did a decent job.
Having multiple evaluation functions needn't be a problem if you code this into the Genetic Algorithm's fitness function. I'd look up multi objective optimisation with genetic algorithms.
I'd start here: Multi-Objective optimization using genetic algorithms: A tutorial

First of all if you have multiple competing targets the problem is confused.
You have to find a single value that you want to maximize... for example:
value = strength - k*cost
or
value = strength / (k1 + k2*cost)
In both for a fixed strength the lower cost wins and for a fixed cost the higher strength wins but you have a formula to be able to decide if a given solution is better or worse than another. If you don't do this how can you decide if a solution is better than another that is cheaper but weaker?
In some cases a correctly defined value requires a more complex function... for example for strength the value could increase up to a certain point (i.e. having a result stronger than a prescribed amount is just pointless) or a cost could have a cap (because higher than a certain amount a solution is not interesting because it would place the final price out of the market).
Once you find the criteria if the parameters are independent a very simple approach that in my experience is still decent is:
pick a random solution by choosing n random values, one for each parameter within the allowed boundaries
compute target value for this starting point
pick a random number 1 <= k <= n and for each of k parameters randomly chosen from the n compute a random signed increment and change the parameter by that amount.
compute the new target value from the translated solution
if the new value is better keep the new position, otherwise revert to the original one.
repeat from 3 until you run out of time.
Depending on the target function there are random distributions that work better than others, also may be that for different parameters the optimal choice is different.

Some time ago I wrote a C++ code for solving optimization problems using Genetic Algorithms. Here it is: http://create-technology.blogspot.ro/2015/03/a-genetic-algorithm-for-solving.html
It should be very easy to follow.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas