Last week I've read a paper suggesting MDP as an alternative solution for recommender systems,
The core of that paper was representation of recommendation process in terms of MDP, i.e. states, actions, transition probabilities, reward function and so on.
If we assume for simplicity a single-user system, then states look like k-tuples (x1, x2, .. , xk) where last element xk represents the very last item that was purchased by the user.
For example, suppose our current state is (x1, x2, x3) which means, the user purchased x1, then x2, then x3, in chronological order. Now if he purchases x4, the new state is going to be (x2, x3, x4).
Now, what the paper suggests, is that, these state transitions are triggered by actions, where action is "recommending an item x_i to the user". but the problem is that such an action may lead to more than one state.
For example if our current state is (x1, x2, x3), and action is "recommend x4" to the user, then the possible outcome might be one out of two:
the user accepts the recommendation of x4, and new state will be (x2, x3, x4)
the user ignores the recommendation of x4 (i.e. buys something else) and new state will be any state (x2, x3, xi) where xi != x4
My question is, does MDP actually support same action triggering two or more different states ?
UPDATE. I think the actions should be formulated as "gets recommendation of item x_i and accepts it" and "gets recommendation of item x_i and rejects it" rather than simply "gets recommendation of item x_i"
Based on this Wikipedia article, yes, it does.
I'm no expert on this, as I only just looked up the concept, but it looks as though the set of states and the set of actions have no inherent relation. Thus, multiple states can be linked to any action (or not linked) and vice versa. Therefore, an action can lead to two or more different states, and there will be a specific probability for each outcome.
Note that, in your example, you may have to have a set of all possible states (which seems as though it could be infinite). Further....based on what I'm reading, your states perhaps shouldn't record past history. It seems as though you could record history by keeping a record of the chain itself - instead of (x1, x2, x3, xi) as a state, you'd have something more like (x1) -> (x2) -> (x3) -> (xi) - four states linked by actions. (Sorry about the notation. I hope that the concept makes sense.) This way, your state represents the choice of purchase (and is therefore finite).
Sure, this is called randomized policy. If you want to evaluate the reward of a certain policy, you have to take the expectation over the probability distribution of the randomized actions.
The following reference may be of interest: Puterman, Martin L. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014.
If I remember correctly, it is proven that there is a deterministic policy that gives the optimal reward for any MDP with a finite discrete state space and action space (and possibly some other conditions). While there may be randomized policies that give the same reward, we can thus restrict to searching in the set of deterministic policies.
Related
I am given a data that consists of N sequences of variable lengths of hidden variables and their corresponding observed variables (i.e., I have both the hidden variables and the observed variables for each sequence).
Is there a way to find the order K of the "best" HMM model for this data, without exhaustive search? (justified heuristics are also legitimate).
I think there may be a confusion about the word "order":
A first-order HMM is an HMM which transition matrix depends only on the previous state. A 2nd-order HMM is an HMM which transition matrix depends only on the 2 previous states, and so on. As the order increases, the theory gets "thicker" (i.e., the equations) and very few implementations of such complex models are implemented in mainstream libraries.
A search on your favorite browser with the keywords "second-order HMM" will bring you to meaningful readings about these models.
If by order you mean the number of states, and with the assumptions that you use single distributions assigned to each state (i.e., you do not use HMMs with mixtures of distributions) then, indeed the only hyperparameter you need to tune is the number of states.
You can estimate the optimal number of states using criteria such as the Bayesian Information Criterion, the Akaike Information Criterion, or the Minimum Message Length Criterion which are based on model's likelihood computations. Usually, the use of these criteria necessitates training multiple models in order to be able to compute some meaningful likelihood results to compare.
If you just want to get a blur idea of a good K value that may not be optimal, a k-means clustering combined with the percentage of variance explained can do the trick: if X clusters explain more than, let say, 90% of the variance of the observations in your training set then, going with an X-state HMM is a good start. The 3 first criteria are interesting because they include a penalty term that goes with the number of parameters of the model and can therefore prevent some overfitting.
These criteria can also be applied when one uses mixture-based HMMs, in which case there are more hyperparameters to tune (i.e., the number of states and the number of component of the mixture models).
I'm trying to build a regression based M/L model using tensorflow.
I am trying to estimate an object's ETA based on the following:
distance from target
distance from target (X component)
distance from target (Y component)
speed
The object travels on specific journeys. This could be represented as from A->B or from A->C or from D->F (POINT 1 -> POINT 2). There are 500 specific journeys (between a set of points).
These journeys aren't completely straight lines, and every journey is different (ie. the shape of the route taken).
I have two ways of getting around this problem:
I can have 500 different models with 4 features and one label(the training ETA data).
I can have 1 model with 5 features and one label.
My dilemma is that if I use option 1, that's added complexity, but will be more accurate as every model will be specific to each journey.
If I use option 2, the model will be pretty simple, but I don't know if it would work properly. The new feature that I would add are originCode+ destinationCode. Unfortunately these are not quantifiable in order to make any numerical sense or pattern - they're just text that define the journey (journey A->B, and the feature would be 'AB').
Is there some way that I can use one model, and categorize the features so that one feature is just a 'grouping' feature (in order separate the training data with respect to the journey.
In ML, I believe that option 2 is generally the better option. We prefer general models rather than tailoring many models to specific tasks, as that gets dangerously close to hardcoding, which is what we're trying to get away from by using ML!
I think that, depending on the training data you have available, and the model size, a one-hot vector could be used to describe the starting/end points for the model. Eg, say we have 5 points (ABCDE), and we are going from position B to position C, this could be represented by the vector:
0100000100
as in, the first five values correspond to the origin spot whereas the second five are the destination. It is also possible to combine these if you want to reduce your input feature space to:
01100
There are other things to consider, as Scott has said in the comments:
How much data do you have? Maybe the feature space will be too big this way, I can't be sure. If you have enough data, then the model will intuitively learn the general distances (not actually, but intrinsically in the data) between datapoints.
If you have enough data, you might even be able to accurately predict between two points you don't have data for!
If it does come down to not having enough data, then finding representative features of the journey will come into use, ie. length of journey, shape of the journey, elevation travelled etc. Also a metric for distance travelled from the origin could be useful.
Best of luck!
I would be inclined to lean toward individual models. This is because, for a given position along a given route and a constant speed, the ETA is a deterministic function of time. If one moves monotonically closer to the target along the route, it is also a deterministic function of distance to target. Thus, there is no information to transfer from one route to the next, i.e. "lumping" their parameters offers no a priori benefit. This is assuming, of course, that you have several "trips" worth of data along each route (i.e. (distance, speed) collected once per minute, or some such). If you have only, say, one datum per route then lumping the parameters is a must. However, in such a low-data scenario, I believe that including a dummy variable for "which route" would ultimately be fruitless, since that would introduce a number of parameters that rivals the size of your dataset.
As a side note, NEITHER of the models you describe could handle new routes. I would be inclined to build an individual model per route, data quantity permitting, and a single model neglecting the route identity entirely just for handling new routes, until sufficient data is available to build a model for that route.
In my study, a person is represented as a pair of real numbers (x, y). x is on [30, 80] and y is [60, 120]. There are two types of people, A and B. I have ~300 of each type. How can I generate the largest (or even a large) set of pairs of one person from A with one from B: ((xA, yA), (xB, yB)) such that each pair of points is close? Two points are close if abs(x1-x2) < dX and abs(y1 - y2) < dY. Similar constraints are acceptable. (That is, this constraint is roughly a Manhattan metric, but euclidean/etc is ok too.) Not all points need be used, but no point can be reused.
You're looking for the Hungarian Algorithm.
Suggested formulation: A are rows, B are columns, each cell contains a distance metric between Ai and Bi, e.g. abs(X(Ai)-X(Bi)) + abs(Y(Ai)-Y(Bi)). (You can normalize the X and Y values to [0,1] if you want distances to be proportional to the range of each variable)
Then use the Hungarian Algorithm to minimize matching weight.
You can filter out matches with distances over your threshold. If you're worried that this filtering might cause the approach to be sub-optimal, you could set distances over your threshold to a very high number.
There are many implementations of this algorithm. A short search found one in any conceivable language, including VBA for Excel and some online solvers (not sure about matching 300x300 matrix with them, though)
Hungarian algorithm did it, thanks Etov.
Source code available here: http://www.filedropper.com/stackoverflow1
What is the difference between markov chain models and hidden markov model? I've read in Wikipedia, but couldn't understand the differences.
To explain by example, I'll use an example from natural language processing. Imagine you want to know the probability of this sentence:
I enjoy coffee
In a Markov model, you could estimate its probability by calculating:
P(WORD = I) x P(WORD = enjoy | PREVIOUS_WORD = I) x P(word = coffee| PREVIOUS_WORD = enjoy)
Now, imagine we wanted to know the parts-of-speech tags of this sentence, that is, if a word is a past tense verb, a noun, etc.
We did not observe any parts-of-speech tags in that sentence, but we assume they are there. Thus, we calculate what's the probability of the parts-of-speech tag sequence. In our case, the actual sequence is:
PRP-VBP-NN
(where PRP=“Personal Pronoun”, VBP=“Verb, non-3rd person singular present”, NN=“Noun, singular or mass”. See https://cs.nyu.edu/grishman/jet/guide/PennPOS.html for complete notation of Penn POS tagging)
But wait! This is a sequence that we can apply a Markov model to. But we call it hidden, since the parts-of-speech sequence is never directly observed. Of course in practice, we will calculate many such sequences and we'd like to find the hidden sequence that best explains our observation (e.g. we are more likely to see words such as 'the', 'this', generated from the determiner (DET) tag)
The best explanation I have ever encountered is in a paper from 1989 by Lawrence R. Rabiner: http://www.cs.ubc.ca/~murphyk/Bayes/rabiner.pdf
Markov model is a state machine with the state changes being probabilities. In a hidden Markov model, you don't know the probabilities, but you know the outcomes.
For example, when you flip a coin, you can get the probabilities, but, if you couldn't see the flips and someone moves one of five fingers with each coin flip, you could take the finger movements and use a hidden Markov model to get the best guess of coin flips.
As I understand it, the question is: what is the difference between a Markov Process and a Hidden Markov Process?
A Markov Process (MP) is a stochastic Process with:
Finite number of states
Probabilistic transitions between these states
Next state determined only by the current state (Markov property)
A Hidden Markov Process (HMM) is also a stochastic Process with:
Finite number of states
Probabilistic transitions between these states
Next state determined only by the current state (Markov property) AND
We’re unsure which state we’re in: The current state emits an observation.
Example - (HMM) Stock Market:
In the Stock Market, people trade with the value of the firm. Let's assume that the real value of the share is $100 (this is unobservable, and in fact you never know it). What you really see is then the value it is traded with: let's assume in this case $90 (this is observable).
For people interested in Markov: The interesting part is when you start taking actions on these models (in the previous example, to gain money). This goes to Markov Decision Processes (MDP) and Partially Observable Markov Decision Processes (POMDPs). To assess a general classification of these models, I have summarized in the following picture the main characteristics of each Markov Model.
Since Matt used parts-of-speech tags as an HMM example, I could add one more example: Speech Recognition. Almost all large vocabulary continuous speech recognition (LVCSR) systems are based on HMMs.
"Matt's example":
I enjoy coffee
In a Markov model, you could estimate its probability by calculating:
P(WORD = I) x P(WORD = enjoy | PREVIOUS_WORD = I) x P(word = coffee| PREVIOUS_WORD = enjoy)
In a Hidden Markov Model,
Let's say 30 different people read the sentence "I enjoy hugging" and we have to recognize it.
Every person will pronounce this sentence differently. So we do NOT know whether or not the person meant "hugging" or "hogging". We will only have the probabilistic distribution of the actual word.
In short, a hidden Markov model is a statistical Markov model in which the system being modeled is assumed to be a Markov process with unobserved (hidden) states.
A hidden Markov models is a double embedded stochastic process with two levels.
The upper level is a Markov process and the states are unobservable.
In fact, observation is a probabilistic function of the upper level Markov states.
Different Markov states will have different observation probabilistic functions.
I am using simulated annealing to solve an NP-complete resource scheduling problem. For each candidate ordering of the tasks I compute several different costs (or energy values). Some examples are (though the specifics are probably irrelevant to the question):
global_finish_time: The total number of days that the schedule spans.
split_cost: The number of days by which each task is delayed due to interruptions by other tasks (this is meant to discourage interruption of a task once it has started).
deadline_cost: The sum of the squared number of days by which each missed deadline is overdue.
The traditional acceptance probability function looks like this (in Python):
def acceptance_probability(old_cost, new_cost, temperature):
if new_cost < old_cost:
return 1.0
else:
return math.exp((old_cost - new_cost) / temperature)
So far I have combined my first two costs into one by simply adding them, so that I can feed the result into acceptance_probability. But what I would really want is for deadline_cost to always take precedence over global_finish_time, and for global_finish_time to take precedence over split_cost.
So my question to Stack Overflow is: how can I design an acceptance probability function that takes multiple energies into account but always considers the first energy to be more important than the second energy, and so on? In other words, I would like to pass in old_cost and new_cost as tuples of several costs and return a sensible value .
Edit: After a few days of experimenting with the proposed solutions I have concluded that the only way that works well enough for me is Mike Dunlavey's suggestion, even though this creates many other difficulties with cost components that have different units. I am practically forced to compare apples with oranges.
So, I put some effort into "normalizing" the values. First, deadline_cost is a sum of squares, so it grows exponentially while the other components grow linearly. To address this I use the square root to get a similar growth rate. Second, I developed a function that computes a linear combination of the costs, but auto-adjusts the coefficients according to the highest cost component seen so far.
For example, if the tuple of highest costs is (A, B, C) and the input cost vector is (x, y, z), the linear combination is BCx + Cy + z. That way, no matter how high z gets it will never be more important than an x value of 1.
This creates "jaggies" in the cost function as new maximum costs are discovered. For example, if C goes up then BCx and Cy will both be higher for a given (x, y, z) input and so will differences between costs. A higher cost difference means that the acceptance probability will drop, as if the temperature was suddenly lowered an extra step. In practice though this is not a problem because the maximum costs are updated only a few times in the beginning and do not change later. I believe this could even be theoretically proven to converge to a correct result since we know that the cost will converge toward a lower value.
One thing that still has me somewhat confused is what happens when the maximum costs are 1.0 and lower, say 0.5. With a maximum vector of (0.5, 0.5, 0.5) this would give the linear combination 0.5*0.5*x + 0.5*y + z, i.e. the order of precedence is suddenly reversed. I suppose the best way to deal with it is to use the maximum vector to scale all values to given ranges, so that the coefficients can always be the same (say, 100x + 10y + z). But I haven't tried that yet.
mbeckish is right.
Could you make a linear combination of the different energies, and adjust the coefficients?
Possibly log-transforming them in and out?
I've done some MCMC using Metropolis-Hastings. In that case I'm defining the (non-normalized) log-likelihood of a particular state (given its priors), and I find that a way to clarify my thinking about what I want.
I would take a hint from multi-objective evolutionary algorithm (MOEA) and have it transition if all of the objectives simultaneously pass with the acceptance_probability function you gave. This will have the effect of exploring the Pareto front much like the standard simulated annealing explores plateaus of same-energy solutions.
However, this does give up on the idea of having the first one take priority.
You will probably have to tweak your parameters, such as giving it a higher initial temperature.
I would consider something along the lines of:
If (new deadline_cost > old deadline_cost)
return (calculate probability)
else if (new global finish time > old global finish time)
return (calculate probability)
else if (new split cost > old split cost)
return (calculate probability)
else
return (1.0)
Of course each of the three places you calculate the probability could use a different function.
It depends on what you mean by "takes precedence".
For example, what if the deadline_cost goes down by 0.001, but the global_finish_time cost goes up by 10000? Do you return 1.0, because the deadline_cost decreased, and that takes precedence over anything else?
This seems like it is a judgment call that only you can make, unless you can provide enough background information on the project so that others can suggest their own informed judgment call.