Remove Outliers from a Multitrace in PYMC3 - bayesian

I have a model which has 3 parameters A, n, and Beta.
I did a Bayesian analysis using pymc3 and got the posterior distributions of the parameters in a multitrace called "trace". Is there any way to remove the outliers of A (and thus the corresponding values of n and Beta) from the multitrace?

Stating that specific values of A are outliers implies that you have enough "domain expertise" to know that the ranges where these values fall into have very low probability of occurence in the experiment/system you are modelling.
You could therefore narrow your chosen prior distribution for A, such that these "outliers" remain in the tails of the distribution.
Reducing the overall model entropy with such informative prior's choice is risky in a way but can be considered as a valid approach if you know that values within these specific ranges just do not happen in real-life experiments.
Once the Bayes rule applied, your posterior distribution will put a lot less weight on these ranges and should better reflect the actual system behaviour.

Related

sjmisc::merge_imputations() averages across imputed datasets, which seems unjustified?

The sjmisc package has a function sjmisc::merge_imputations()
This function merges multiple imputed data frames from mice::mids()-objects into a single data frame by computing the mean or selecting the most likely imputed value.
I think this is what Stef van Buuren cautions against in 5.1.2 Not recommended workflow: Averaging the data ?
the procedure ignores the between-imputation variability, and hence shares all the drawbacks of single imputation
Instead, they advocate for mice::with() and mice::pool().
So when might one use sjmisc::merge_imputations() ?
If:
The researcher either only cares about means, not about correlations or other more complicated relationships between variables. Or, is willing to assume that the imputation models were "true" models.
The researcher only cares about point estimates, and is less concerned about the uncertainty in those estimates (variance, standard errors, confidence intervals, hypothesis tests, coefficients of variation).
There is only a small amount of missing data.
Then averaging the imputed values can be a reasonable fix. Averaging the imputed values is basically a version of "stochastic regression imputation". Although note that as the number of imputations increases, averaging the imputed values converges to simple regression imputation. It's still wrong, but it may be a practical method. The sjmisc package documentation quotes Burns et al (2011). https://doi.org/10.1016/j.jclinepi.2010.10.011 From that article:
There were practical benefits in providing DYNOPTA investigators an averaged imputation score as it precludes the necessity for investigators to run MICE for different projects using the MMSE, the need to obtain software capable of combining and analyzing multiple imputed datasets, and many investigators are unfamiliar with MI analysis techniques.
Compare also van Buuren 1.3.5
If you have the ability to use proper pooling methods I would recommend using those instead.

Using Variance Threshold with normalized variance

We know that zero-variance or low-variance features should be dropped to help with model complexity. However, I have come to learn that comparing variances of features can be difficult. For example:
The above features all have different medians, different variances, and ranges. Also, higher values in a distribution tend to have bigger variances. So, to make a fair comparison, can we normalize all features by dividing them by their mean, like so:
normalized_df = df / df.mean()
I have seen this technique in a DataCamp course and it is suggested in the course that after doing a normalization like above, we can choose a lower variance threshold, like 0.005 to make a fair comparison in feature selection. I was wondering if it was correct.
If it is, what kind of threshold should be chosen for normalized features?

Algorithm - finding the order of HMM from observations

I am given a data that consists of N sequences of variable lengths of hidden variables and their corresponding observed variables (i.e., I have both the hidden variables and the observed variables for each sequence).
Is there a way to find the order K of the "best" HMM model for this data, without exhaustive search? (justified heuristics are also legitimate).
I think there may be a confusion about the word "order":
A first-order HMM is an HMM which transition matrix depends only on the previous state. A 2nd-order HMM is an HMM which transition matrix depends only on the 2 previous states, and so on. As the order increases, the theory gets "thicker" (i.e., the equations) and very few implementations of such complex models are implemented in mainstream libraries.
A search on your favorite browser with the keywords "second-order HMM" will bring you to meaningful readings about these models.
If by order you mean the number of states, and with the assumptions that you use single distributions assigned to each state (i.e., you do not use HMMs with mixtures of distributions) then, indeed the only hyperparameter you need to tune is the number of states.
You can estimate the optimal number of states using criteria such as the Bayesian Information Criterion, the Akaike Information Criterion, or the Minimum Message Length Criterion which are based on model's likelihood computations. Usually, the use of these criteria necessitates training multiple models in order to be able to compute some meaningful likelihood results to compare.
If you just want to get a blur idea of a good K value that may not be optimal, a k-means clustering combined with the percentage of variance explained can do the trick: if X clusters explain more than, let say, 90% of the variance of the observations in your training set then, going with an X-state HMM is a good start. The 3 first criteria are interesting because they include a penalty term that goes with the number of parameters of the model and can therefore prevent some overfitting.
These criteria can also be applied when one uses mixture-based HMMs, in which case there are more hyperparameters to tune (i.e., the number of states and the number of component of the mixture models).

Error propagation in a Bayesian analysis of a Markov chain

I'm analysing longitudinal panel data, in which individuals transition between different states in a Markov chain. I'm modelling the transition rates between states using a series of multinomial logistic regressions. This means that I end up with a very large number of regression slopes.
For each regression slope, I obtain a posterior distribution (using WinBUGS). From the posterior distribution, we get the mean, standard deviation, and 95% credible interval associated with the slope in question.
The value I am ultimately interested in is the expected first passage time ('hitting time') through the Markov chain. This is a function of all the different predictor variables, and so is built from the many regression slopes produced by the multinomial logistic regressions.
A simple approach would be to take the mean of each posterior distribution as a point-estimate for each regression slope, and solve for the expected first passage time at a series of different values of the predictor variables. I have now done this, but it is potentially misleading because it doesn't show the uncertainty around the predicted values of expected first passage time.
My question is: how can I calculate a credible interval for the expected first passage time?
My first thought was to approximate the error via simulation, by sampling individual values for the regression slopes from each posterior distribution, obtaining the expected first passage time given those values, and then plotting the standard deviation of all these simulated values. However, I feel like (a) this would make a statistician scream and (b) it doesn't take into account the fact that different posterior distributions will be correlated (it samples from each one independently).
In WinBUGS, you can actually obtain the correlations between the posterior distributions. So if the simulation idea is appropriate, I could in theory simulate the regression slope coefficients incorporating these correlations.
Is there a more direct and less approximate way to find the uncertainty? Could I, for instance, use WinBUGS to find the posterior distribution of the expected first passage time for a given set of values of the predictor variables? Rather like the answer to this question: define a new node and monitor it. I would imagine defining a series of new nodes, where each one is for a different set of actual predictor values, and monitoring each one. Does this make good statistical sense?
Any thoughts about this would be really appreciated!

Select important features then impute or first impute then select important features?

I have a dataset with lots of features (mostly categorical features(Yes/No)) and lots of missing values.
One of the techniques for dimensionality reduction is to generate a large and carefully constructed set of trees against a target attribute and then use each attribute’s usage statistics to find the most informative subset of features. That is basically we can generate a large set of very shallow trees, with each tree being trained on a small fraction of the total number of attributes. If an attribute is often selected as best split, it is most likely an informative feature to retain.
I am also using an imputer to fill the missing values.
My doubt is what should be the order to the above two. Which of the above two (dimensionality reduction and imputation) to do first and why?
From mathematical perspective you should always avoid data imputation (in the sense - use it only if you have to). In other words - if you have a method which can work with missing values - use it (if you do not - you are left with data imputation).
Data imputation is nearly always heavily biased, it has been shown so many times, I believe that I even read paper about it which is ~20 years old. In general - in order to do a statistically sound data imputation you need to fit a very good generative model. Just imputing "most common", mean value etc. makes assumptions about the data of similar strength to the Naive Bayes.