This question is for the case of homogeneous discrete HMM's.
In the regular HMM's, the probability of the current state depends only on the previous state, that is Pr(S_t|S_1,S_2,...,S_(t-1)) = Pr(S_t|S_(t-1)), and the probability of an output observation depends on the current state, that is Pr(O_t|O_1,...,O_(t-1),S_1,...,S_t) = Pr(O_t|S_t). Then, we can use the Forward-Backward (Baum-Welch) algorithm to estimate the transition and emission probabilities.
My question is about the case when the current observation depends on the current state and the previous observation, that is Pr(O_t|O_1,...,O_(t-1),S_1,...,S_t) = Pr(O_t|O_(t-1),S_t). How to train a model like that? I was thinking about using the same Baum-Welch algorithm, but instead of having an M emission probabilities for each state (representing M possible outputs), there would be MxM emission probabilities. I mean the emission probabilities for each state would be a 2D square matrix, where for example rows represent the observation at the previous state and the columns represent the observation at the current state.
Is this valid? Any other ideas or citations to papers addressing this problem? I searched for papers studying such case, but unfortunately did not find any.
Related
I'm analysing longitudinal panel data, in which individuals transition between different states in a Markov chain. I'm modelling the transition rates between states using a series of multinomial logistic regressions. This means that I end up with a very large number of regression slopes.
For each regression slope, I obtain a posterior distribution (using WinBUGS). From the posterior distribution, we get the mean, standard deviation, and 95% credible interval associated with the slope in question.
The value I am ultimately interested in is the expected first passage time ('hitting time') through the Markov chain. This is a function of all the different predictor variables, and so is built from the many regression slopes produced by the multinomial logistic regressions.
A simple approach would be to take the mean of each posterior distribution as a point-estimate for each regression slope, and solve for the expected first passage time at a series of different values of the predictor variables. I have now done this, but it is potentially misleading because it doesn't show the uncertainty around the predicted values of expected first passage time.
My question is: how can I calculate a credible interval for the expected first passage time?
My first thought was to approximate the error via simulation, by sampling individual values for the regression slopes from each posterior distribution, obtaining the expected first passage time given those values, and then plotting the standard deviation of all these simulated values. However, I feel like (a) this would make a statistician scream and (b) it doesn't take into account the fact that different posterior distributions will be correlated (it samples from each one independently).
In WinBUGS, you can actually obtain the correlations between the posterior distributions. So if the simulation idea is appropriate, I could in theory simulate the regression slope coefficients incorporating these correlations.
Is there a more direct and less approximate way to find the uncertainty? Could I, for instance, use WinBUGS to find the posterior distribution of the expected first passage time for a given set of values of the predictor variables? Rather like the answer to this question: define a new node and monitor it. I would imagine defining a series of new nodes, where each one is for a different set of actual predictor values, and monitoring each one. Does this make good statistical sense?
Any thoughts about this would be really appreciated!
I am trying to use pymc to find a change point in a time-series. The value I am looking at over time is probability to "convert" which is very small, 0.009 on average with a range of 0.001-0.016.
I give the two probabilities a uniform distribution as a prior between zero and the max observation.
alpha = df.cnvrs.max() # Set upper uniform
center_1_c = pm.Uniform("center_1_c", 0, alpha)
center_2_c = pm.Uniform("center_2_c", 0, alpha)
day_c = pm.DiscreteUniform("day_c", lower=1, upper=n_days)
#pm.deterministic
def lambda_(day_c=day_c, center_1_c=center_1_c, center_2_c=center_2_c):
out = np.zeros(n_days)
out[:day_c] = center_1_c
out[day_c:] = center_2_c
return out
observation = pm.Uniform("obs", lambda_, value=df.cnvrs.values, observed=True)
When I run this code I get:
ZeroProbability: Stochastic obs's value is outside its support,
or it forbids its parents' current values.
I'm pretty new to pymc so not sure if I'm missing something obvious. My guess is I might not have appropriate distributions for modelling small probabilities.
It's impossible to tell where you've introduced this bug—and programming is off-topic here, in any case—without more of your output. But there is a statistical issue here: You've somehow constructed a model that cannot produce either the observed variables or the current sample of latent ones.
To give a simple example, say you have a dataset with negative values, and you've assumed it to be gamma distributed; this will produce an error, because the data has zero probability under a gamma. Similarly, an error will be thrown if an impossible value is sampled during an MCMC chain.
i'm new to the HMM universe. I've followed the tutorials using a GaussianHMM machine learner, and they work but i was just wondering how i can use the code to display the probability of an observation given the most likely sequence, assuming i have multiple sequences of observations? Thanks
so for example, if the observations are:
seq1:[1,2,-1,4,2], seq2:[a,v,s,a,f], and the model has 2 states,
once the model predicts the states, how does one calculate the probability of an observed output [1],[a] ?
I used to code my MCMC using C. But I'd like to give PyMC a try.
Suppose X_n is the underlying state whose dynamics following a Markov chain and Y_n is the observed data. In particular,
Y_n has Poisson distribution with mean depending on X_n and a multidimensional unknown parameter theta
X_n | X_{n-1} has distribution depending on theta
How should I describe this model using PyMC?
Another question: I can find conjugate priors for theta but not for X_n. Is it possible to specify which posteriors are updated using conjugate priors and which using MCMC?
Here is an example of a state-space model in PyMC on the PyMC wiki. It basically involves populating a list and allowing PyMC to treat it as a container of PyMC nodes.
As for the second part of the question, you could certainly calculate some of your conjugate posteriors ahead of time and put them into the model. For example, if you observed binomial data x=4, n=10 you could insert a Beta node p = Beta('p', 5, 7) to represent that posterior (its really just a prior, as far as the model is concerned, but it is the posterior given data x). Then PyMC would draw a sample for this posterior at every iteration to be used wherever it is needed in the model.
What is the difference between markov chain models and hidden markov model? I've read in Wikipedia, but couldn't understand the differences.
To explain by example, I'll use an example from natural language processing. Imagine you want to know the probability of this sentence:
I enjoy coffee
In a Markov model, you could estimate its probability by calculating:
P(WORD = I) x P(WORD = enjoy | PREVIOUS_WORD = I) x P(word = coffee| PREVIOUS_WORD = enjoy)
Now, imagine we wanted to know the parts-of-speech tags of this sentence, that is, if a word is a past tense verb, a noun, etc.
We did not observe any parts-of-speech tags in that sentence, but we assume they are there. Thus, we calculate what's the probability of the parts-of-speech tag sequence. In our case, the actual sequence is:
PRP-VBP-NN
(where PRP=“Personal Pronoun”, VBP=“Verb, non-3rd person singular present”, NN=“Noun, singular or mass”. See https://cs.nyu.edu/grishman/jet/guide/PennPOS.html for complete notation of Penn POS tagging)
But wait! This is a sequence that we can apply a Markov model to. But we call it hidden, since the parts-of-speech sequence is never directly observed. Of course in practice, we will calculate many such sequences and we'd like to find the hidden sequence that best explains our observation (e.g. we are more likely to see words such as 'the', 'this', generated from the determiner (DET) tag)
The best explanation I have ever encountered is in a paper from 1989 by Lawrence R. Rabiner: http://www.cs.ubc.ca/~murphyk/Bayes/rabiner.pdf
Markov model is a state machine with the state changes being probabilities. In a hidden Markov model, you don't know the probabilities, but you know the outcomes.
For example, when you flip a coin, you can get the probabilities, but, if you couldn't see the flips and someone moves one of five fingers with each coin flip, you could take the finger movements and use a hidden Markov model to get the best guess of coin flips.
As I understand it, the question is: what is the difference between a Markov Process and a Hidden Markov Process?
A Markov Process (MP) is a stochastic Process with:
Finite number of states
Probabilistic transitions between these states
Next state determined only by the current state (Markov property)
A Hidden Markov Process (HMM) is also a stochastic Process with:
Finite number of states
Probabilistic transitions between these states
Next state determined only by the current state (Markov property) AND
We’re unsure which state we’re in: The current state emits an observation.
Example - (HMM) Stock Market:
In the Stock Market, people trade with the value of the firm. Let's assume that the real value of the share is $100 (this is unobservable, and in fact you never know it). What you really see is then the value it is traded with: let's assume in this case $90 (this is observable).
For people interested in Markov: The interesting part is when you start taking actions on these models (in the previous example, to gain money). This goes to Markov Decision Processes (MDP) and Partially Observable Markov Decision Processes (POMDPs). To assess a general classification of these models, I have summarized in the following picture the main characteristics of each Markov Model.
Since Matt used parts-of-speech tags as an HMM example, I could add one more example: Speech Recognition. Almost all large vocabulary continuous speech recognition (LVCSR) systems are based on HMMs.
"Matt's example":
I enjoy coffee
In a Markov model, you could estimate its probability by calculating:
P(WORD = I) x P(WORD = enjoy | PREVIOUS_WORD = I) x P(word = coffee| PREVIOUS_WORD = enjoy)
In a Hidden Markov Model,
Let's say 30 different people read the sentence "I enjoy hugging" and we have to recognize it.
Every person will pronounce this sentence differently. So we do NOT know whether or not the person meant "hugging" or "hogging". We will only have the probabilistic distribution of the actual word.
In short, a hidden Markov model is a statistical Markov model in which the system being modeled is assumed to be a Markov process with unobserved (hidden) states.
A hidden Markov models is a double embedded stochastic process with two levels.
The upper level is a Markov process and the states are unobservable.
In fact, observation is a probabilistic function of the upper level Markov states.
Different Markov states will have different observation probabilistic functions.