Algorithm - finding the order of HMM from observations - hidden-markov-models

I am given a data that consists of N sequences of variable lengths of hidden variables and their corresponding observed variables (i.e., I have both the hidden variables and the observed variables for each sequence).
Is there a way to find the order K of the "best" HMM model for this data, without exhaustive search? (justified heuristics are also legitimate).

I think there may be a confusion about the word "order":
A first-order HMM is an HMM which transition matrix depends only on the previous state. A 2nd-order HMM is an HMM which transition matrix depends only on the 2 previous states, and so on. As the order increases, the theory gets "thicker" (i.e., the equations) and very few implementations of such complex models are implemented in mainstream libraries.
A search on your favorite browser with the keywords "second-order HMM" will bring you to meaningful readings about these models.
If by order you mean the number of states, and with the assumptions that you use single distributions assigned to each state (i.e., you do not use HMMs with mixtures of distributions) then, indeed the only hyperparameter you need to tune is the number of states.
You can estimate the optimal number of states using criteria such as the Bayesian Information Criterion, the Akaike Information Criterion, or the Minimum Message Length Criterion which are based on model's likelihood computations. Usually, the use of these criteria necessitates training multiple models in order to be able to compute some meaningful likelihood results to compare.
If you just want to get a blur idea of a good K value that may not be optimal, a k-means clustering combined with the percentage of variance explained can do the trick: if X clusters explain more than, let say, 90% of the variance of the observations in your training set then, going with an X-state HMM is a good start. The 3 first criteria are interesting because they include a penalty term that goes with the number of parameters of the model and can therefore prevent some overfitting.
These criteria can also be applied when one uses mixture-based HMMs, in which case there are more hyperparameters to tune (i.e., the number of states and the number of component of the mixture models).

Related

Remove Outliers from a Multitrace in PYMC3

I have a model which has 3 parameters A, n, and Beta.
I did a Bayesian analysis using pymc3 and got the posterior distributions of the parameters in a multitrace called "trace". Is there any way to remove the outliers of A (and thus the corresponding values of n and Beta) from the multitrace?
Stating that specific values of A are outliers implies that you have enough "domain expertise" to know that the ranges where these values fall into have very low probability of occurence in the experiment/system you are modelling.
You could therefore narrow your chosen prior distribution for A, such that these "outliers" remain in the tails of the distribution.
Reducing the overall model entropy with such informative prior's choice is risky in a way but can be considered as a valid approach if you know that values within these specific ranges just do not happen in real-life experiments.
Once the Bayes rule applied, your posterior distribution will put a lot less weight on these ranges and should better reflect the actual system behaviour.

Select important features then impute or first impute then select important features?

I have a dataset with lots of features (mostly categorical features(Yes/No)) and lots of missing values.
One of the techniques for dimensionality reduction is to generate a large and carefully constructed set of trees against a target attribute and then use each attribute’s usage statistics to find the most informative subset of features. That is basically we can generate a large set of very shallow trees, with each tree being trained on a small fraction of the total number of attributes. If an attribute is often selected as best split, it is most likely an informative feature to retain.
I am also using an imputer to fill the missing values.
My doubt is what should be the order to the above two. Which of the above two (dimensionality reduction and imputation) to do first and why?
From mathematical perspective you should always avoid data imputation (in the sense - use it only if you have to). In other words - if you have a method which can work with missing values - use it (if you do not - you are left with data imputation).
Data imputation is nearly always heavily biased, it has been shown so many times, I believe that I even read paper about it which is ~20 years old. In general - in order to do a statistically sound data imputation you need to fit a very good generative model. Just imputing "most common", mean value etc. makes assumptions about the data of similar strength to the Naive Bayes.

What is the output of XGboost using 'rank:pairwise'?

I use the python implementation of XGBoost. One of the objectives is rank:pairwise and it minimizes the pairwise loss (Documentation). However, it does not say anything about the scope of the output. I see numbers between -10 and 10, but can it be in principle -inf to inf?
good question. you may have a look in kaggle competition:
Actually, in Learning to Rank field, we are trying to predict the relative score for each document to a specific query. That is, this is not a regression problem or classification problem. Hence, if a document, attached to a query, gets a negative predict score, it means and only means that it's relatively less relative to the query, when comparing to other document(s), with positive scores.
It gives predicted score for ranking.
However, the scores are valid for ranking only in their own groups.
So we must set the groups for input data.
For esay ranking, refer to my project xgboostExtension
If I understand your questions correctly, you mean the output of the predict function on a model fitted using rank:pairwise.
Predict gives the predicted variable (y_hat).
This is the same for reg:linear / binary:logistic etc. The only difference is that reg:linear builds trees to Min(RMSE(y, y_hat)), while rank:pairwise build trees to Max(Map(Rank(y), Rank(y_hat))). However, output is always y_hat.
Depending on the values of your dependent variables, output can be anything. But I typically expect output to be much smaller in variance vs the dependent variable. This is usually the case as it is not necessary to fit extreme data values, the tree just needs to produce predictors that are large/small enough to be ranked first/last in the group.

Identical Test set

I have some comments and i want to classify them as Positive or Negative.
So far i have an annotated dataset .
The thing is that the first 100 rows are classified as positive and the rest 100 as Negative.
I am using SQL Server Analysis-2008 R2. The Class attribute has 2 values, POS-for positive and NEG-for negative.
Also i use Naive Bayes algorithm with maximum input/output attributes=0 (want to use all the attributes) for the classification, the test set max case is set to 30%. The current score from the Lift Chart is 0.60.
Do i have to mix them up, for example 2 POS followed by 1 NEG, in order to get better classification accuracy?
The ordering of the learning instances should not affect classification performance. The probabilities computed by Naive Bayes will be the same for any ordering of instances in the data set.
However, the selection of different test and training sets can affect classification performance. For example, some instances might be inherently more difficult to classify than others.
Are you getting similarly poor training and test performance? If your training performance is good and/or much better than your test performance, your model may be over-fitted. Otherwise, if your training performance is also poor, I would suggest (a) trying a better/stronger/more expressive classifier, e.g., SVM, decision trees etc; and/or (b) making sure your features are representive/expressive enough of the data.

How do I have to train a HMM with Baum-Welch and multiple observations?

I am having some problems understanding how the Baum-Welch algorithm exactly works. I read that it adjusts the parameters of the HMM (the transition and the emission probabilities) in order to maximize the probability that my observation sequence may be seen by the given model.
However, what does happen if I have multiple observation sequences? I want to train my HMM against a huge lot of observations (and I think this is what is usually done).
ghmm for example can take both a single observation sequence and a full set of observations for the baumWelch method.
Does it work the same in both situations? Or does the algorithm have to know all observations at the same time?
In Rabiner's paper, the parameters of GMMs (weights, means and covariances) are re-estimated in the Baum-Welch algorithm using these equations:
These are just for the single observation sequence case. In the multiple case, the numerators and denominators are just summed over all observation sequences, and then divided to get the parameters. (this can be done since they simply represent occupation counts, see pg. 273 of the paper)
So it's not required to know all observation sequences during an invocation of the algorithm. As an example, the HERest tool in HTK has a mechanism that allows splitting up the training data amongst multiple machines. Each machine computes the numerators and denominators and dumps them to a file. In the end, a single machine reads these files, sums up the numerators and denominators and divides them to get the result. See pg. 129 of the HTK book v3.4