hmmlearn: Using GausianHMM, how does one calculate the probability of an observation (as opposed to the probability of a state) - hidden-markov-models

i'm new to the HMM universe. I've followed the tutorials using a GaussianHMM machine learner, and they work but i was just wondering how i can use the code to display the probability of an observation given the most likely sequence, assuming i have multiple sequences of observations? Thanks
so for example, if the observations are:
seq1:[1,2,-1,4,2], seq2:[a,v,s,a,f], and the model has 2 states,
once the model predicts the states, how does one calculate the probability of an observed output [1],[a] ?

Related

Algorithm - finding the order of HMM from observations

I am given a data that consists of N sequences of variable lengths of hidden variables and their corresponding observed variables (i.e., I have both the hidden variables and the observed variables for each sequence).
Is there a way to find the order K of the "best" HMM model for this data, without exhaustive search? (justified heuristics are also legitimate).
I think there may be a confusion about the word "order":
A first-order HMM is an HMM which transition matrix depends only on the previous state. A 2nd-order HMM is an HMM which transition matrix depends only on the 2 previous states, and so on. As the order increases, the theory gets "thicker" (i.e., the equations) and very few implementations of such complex models are implemented in mainstream libraries.
A search on your favorite browser with the keywords "second-order HMM" will bring you to meaningful readings about these models.
If by order you mean the number of states, and with the assumptions that you use single distributions assigned to each state (i.e., you do not use HMMs with mixtures of distributions) then, indeed the only hyperparameter you need to tune is the number of states.
You can estimate the optimal number of states using criteria such as the Bayesian Information Criterion, the Akaike Information Criterion, or the Minimum Message Length Criterion which are based on model's likelihood computations. Usually, the use of these criteria necessitates training multiple models in order to be able to compute some meaningful likelihood results to compare.
If you just want to get a blur idea of a good K value that may not be optimal, a k-means clustering combined with the percentage of variance explained can do the trick: if X clusters explain more than, let say, 90% of the variance of the observations in your training set then, going with an X-state HMM is a good start. The 3 first criteria are interesting because they include a penalty term that goes with the number of parameters of the model and can therefore prevent some overfitting.
These criteria can also be applied when one uses mixture-based HMMs, in which case there are more hyperparameters to tune (i.e., the number of states and the number of component of the mixture models).

Inference on several inputs in order to calculate the loss function

I am modeling a perceptual process in tensorflow. In the setup I am interested in, the modeled agent is playing a resource game: it has to choose 1 out of n resouces, by relying only on the label that a classifier gives to the resource. Each resource is an ordered pair of two reals. The classifier only sees the first real, but payoffs depend on the second. There is a function taking first to second.
Anyway, ideally I'd like to train the classifier in the following way:
In each run, the classifier give labels to n resources.
The agent then gets the payoff of the resource corresponding to the highest label in some predetermined ranking (say, A > B > C > D), and randomly in case of draw.
The loss is taken to be the normalized absolute difference between the payoff thus obtained and the maximum payoff in the set of resources. I.e., (Payoff_max - Payoff) / Payoff_max
For this to work, one needs to run inference n times, once for each resource, before calculating the loss. Is there a way to do this in tensorflow? If I am tackling the problem in the wrong way feel free to say so, too.
I don't have much knowledge in ML aspects of this, but from programming point of view, I can see doing it in two ways. One is by copying your model n times. All the copies can share the same variables. The output of all of these copies would go into some function that determines the the highest label. As long as this function is differentiable, variables are shared, and n is not too large, it should work. You would need to feed all n inputs together. Note that, backprop will run through each copy and update your weights n times. This is generally not a problem, but if it is, I heart about some fancy tricks one can do by using partial_run.
Another way is to use tf.while_loop. It is pretty clever - it stores activations from each run of the loop and can do backprop through them. The only tricky part should be to accumulate the inference results before feeding them to your loss. Take a look at TensorArray for this. This question can be helpful: Using TensorArrays in the context of a while_loop to accumulate values

How to generate data that fits the normal distribution within each class?

Using numpy, I need to produce training and test data for a machine learning problem. The model is able to predict three different classes (X,Y,Z). The classes represent the types of patients in multiple clinical trials, and the model should be able to predict the type of patient based on data gathered about the patient (such as blood analysis and blood pressure, previous history etc.)
From a previous study we know that, in total, the classes are represented with the following distribution, in terms of a percentage of the total patient count per trial:
X - u=7.2, s=5.3
Y - u=83.7, s=15.2
Z - u=9.1, s=2.3
The u/s describe the distribution in N(u, s) for each class (so, for all trials studied, class X had mean 7.2 and variance 5.3). Unfortunately the data set for the study is not available.
How can I recreate a dataset that follows the same distribution over all classes, and within each class, subject to the constraint of X+Y+Z=100 for each record.
It is easy to generate a dataset that follows the overall distribution (the u values), but how do I get a dataset that has the same distribution per each class?
The problem you have stated is to sample from a mixture distribution. A mixture distribution is just a number of component distributions, each with a weight, such that the weights are nonnegative and sum to 1. Your mixture has 3 components. Each is a Gaussian distribution with the mean and sd you gave. It is reasonable to assume the mixing weights are the proportion of each class in the population. To sample from a mixture, first select a component using the weights as probabilities for a discrete distribution. Then sample from the component. I assume you know how to sample from a Gaussian distribution.

How do I have to train a HMM with Baum-Welch and multiple observations?

I am having some problems understanding how the Baum-Welch algorithm exactly works. I read that it adjusts the parameters of the HMM (the transition and the emission probabilities) in order to maximize the probability that my observation sequence may be seen by the given model.
However, what does happen if I have multiple observation sequences? I want to train my HMM against a huge lot of observations (and I think this is what is usually done).
ghmm for example can take both a single observation sequence and a full set of observations for the baumWelch method.
Does it work the same in both situations? Or does the algorithm have to know all observations at the same time?
In Rabiner's paper, the parameters of GMMs (weights, means and covariances) are re-estimated in the Baum-Welch algorithm using these equations:
These are just for the single observation sequence case. In the multiple case, the numerators and denominators are just summed over all observation sequences, and then divided to get the parameters. (this can be done since they simply represent occupation counts, see pg. 273 of the paper)
So it's not required to know all observation sequences during an invocation of the algorithm. As an example, the HERest tool in HTK has a mechanism that allows splitting up the training data amongst multiple machines. Each machine computes the numerators and denominators and dumps them to a file. In the end, a single machine reads these files, sums up the numerators and denominators and divides them to get the result. See pg. 129 of the HTK book v3.4

Probability Density Function with Zero Standard Deviation

I am now implementing an email filtering application using the Naive Bayes algorithm. My application uses the Spambase Data Set from the UCI Machine Learning Repository. Since the attributes are continuous, I calculate the probability using the Probability Density Function (PDF). However, when I evaluate the data using the k-fold cross validation, a training set may contain only 0 for one of its attributes. For this reason, I got a 0 standard deviation and the PDF returns NaN and it leads to a huge number of spams are not correctly classified with that training set. What should I do to fix the problem?
You could use a discrete PDF, which will always be bounded.
Alternatively, simply ignore any attribute with zero variance. There is no point in including distributions with zero variance, because they won't actually do anything. For example, you want to know how old I am, and then I tell you that I live on planet Earth. That shouldn't change your estimate, because every single piece of data you have is for people on planet Earth.