Select important features then impute or first impute then select important features? - pandas

I have a dataset with lots of features (mostly categorical features(Yes/No)) and lots of missing values.
One of the techniques for dimensionality reduction is to generate a large and carefully constructed set of trees against a target attribute and then use each attribute’s usage statistics to find the most informative subset of features. That is basically we can generate a large set of very shallow trees, with each tree being trained on a small fraction of the total number of attributes. If an attribute is often selected as best split, it is most likely an informative feature to retain.
I am also using an imputer to fill the missing values.
My doubt is what should be the order to the above two. Which of the above two (dimensionality reduction and imputation) to do first and why?

From mathematical perspective you should always avoid data imputation (in the sense - use it only if you have to). In other words - if you have a method which can work with missing values - use it (if you do not - you are left with data imputation).
Data imputation is nearly always heavily biased, it has been shown so many times, I believe that I even read paper about it which is ~20 years old. In general - in order to do a statistically sound data imputation you need to fit a very good generative model. Just imputing "most common", mean value etc. makes assumptions about the data of similar strength to the Naive Bayes.

Related

sjmisc::merge_imputations() averages across imputed datasets, which seems unjustified?

The sjmisc package has a function sjmisc::merge_imputations()
This function merges multiple imputed data frames from mice::mids()-objects into a single data frame by computing the mean or selecting the most likely imputed value.
I think this is what Stef van Buuren cautions against in 5.1.2 Not recommended workflow: Averaging the data ?
the procedure ignores the between-imputation variability, and hence shares all the drawbacks of single imputation
Instead, they advocate for mice::with() and mice::pool().
So when might one use sjmisc::merge_imputations() ?
If:
The researcher either only cares about means, not about correlations or other more complicated relationships between variables. Or, is willing to assume that the imputation models were "true" models.
The researcher only cares about point estimates, and is less concerned about the uncertainty in those estimates (variance, standard errors, confidence intervals, hypothesis tests, coefficients of variation).
There is only a small amount of missing data.
Then averaging the imputed values can be a reasonable fix. Averaging the imputed values is basically a version of "stochastic regression imputation". Although note that as the number of imputations increases, averaging the imputed values converges to simple regression imputation. It's still wrong, but it may be a practical method. The sjmisc package documentation quotes Burns et al (2011). https://doi.org/10.1016/j.jclinepi.2010.10.011 From that article:
There were practical benefits in providing DYNOPTA investigators an averaged imputation score as it precludes the necessity for investigators to run MICE for different projects using the MMSE, the need to obtain software capable of combining and analyzing multiple imputed datasets, and many investigators are unfamiliar with MI analysis techniques.
Compare also van Buuren 1.3.5
If you have the ability to use proper pooling methods I would recommend using those instead.

A huge number of discrete features

I'm developing a regression model. But I ran into a problem when preparing the data. 17 out of 20 signs are categorical, and there are a lot of categories in each of them. Using one-hot-encoding, my data table is transformed into a 10000x6000 table. How should I prepare this type of data?
I used PCA, trying to reduce the dimension, but even 70% of the variance is in 2500 features. That's why I joined.
Unfortunately, I can't attach the dataset, as it is confidential
How do I prepare the data to achieve the best results in the learning process?
Can the data be mapped more accurately in a non-linear manner? If so, you might want to try using an autoencoder for dimensionality reduction.
One thing to note about PCA is that it computes an orthogonal projection of the data into linear space. This means that it only gives a linear mapping of the data. Autoencoders, on the other hand, can give you a non-linear mapping, and so is able to represent a greater amount of variance in the data in fewer dimensions. Just be sure to use non-linear activation functions in your autoencoder architecture.
It really depends on exactly what you are trying to do. Getting a covariance matrix (and also PCA decomp.) will give you great insight about which classes tend to come together (and this requires one-hot encoded categories), but training a model off of that might be problematic.
In general, it really depends on the model you want to use.
One option would be a random forest. They can definitely be used for regression, though they need to be trained specifically for that. SKLearn has a class just for this:
https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html
The benifits of random forest is that it is great for tabular data (as is the case here), and can easily be trained using numerical values for class features, meaning your data vector can only be of dimension 20!
Decision tree models (such as random forest) are being shown to out-preform deep-learning in many cases, and this may be one of them.
TLDR; If you use a random forest, it can take learn even with numerical values for categories, and you can avoid creating incredibly large vectors for data.

How can I study the properties of outliers in high-dimensional data?

I have a bundle of high-dimensional data and the instances are labeled as outliers or not. I am looking to get some insights around where these outliers reside within the data. I seek to answer questions like:
Are the outliers spread far apart from each other? Or are they clustered together?
Are the outliers lying 'in-between' clusters of good data? Or are they on the 'edge' boundaries of the data?
If outliers are clustered together, how do these cluster densities compare with clusters of good data?
'Where' are the outliers?
What kind of techniques will let me find these insights? If the data was 2 or 3-dimensional, I can easily plot the data and just look at it. But I can't do it high-dimensional data.
Analyzing the Statistical Properties of Outliers
First of all, if you can choose to focus on specific features. For
example, if you know a featues is subject to high variation, you can
draw a box plot. You can also draw a 2D graph if you want to focus on
2 features. THis shows how much the labelled outliers vary.
Next, there's a metric called a Z-score, which basically says how
many standard devations a point varies compared to the mean. The
Z-score is signed, meaning if a point is below the mean, the Z-score
will be negative. This can be used to analyze all the features of the
dataset. You can find the threshold value in your labelled dataset for which all the points above that threshold are labelled outliers
Lastly, we can find the interquartile range and similarly filter
based on it. The IQR is simply the difference between the 75
percentile and 25 percentile. You can also use this similarly to Z-score.
Using these techniques, we can analyze some of the statistical properties of the outliers.
If you also want to analyze the clusters, you can adapt the DBSCAN algorithm to your problem. This algorithm clusters data based on densities, so it will be easy to apply the techniques to outliers.

Algorithm - finding the order of HMM from observations

I am given a data that consists of N sequences of variable lengths of hidden variables and their corresponding observed variables (i.e., I have both the hidden variables and the observed variables for each sequence).
Is there a way to find the order K of the "best" HMM model for this data, without exhaustive search? (justified heuristics are also legitimate).
I think there may be a confusion about the word "order":
A first-order HMM is an HMM which transition matrix depends only on the previous state. A 2nd-order HMM is an HMM which transition matrix depends only on the 2 previous states, and so on. As the order increases, the theory gets "thicker" (i.e., the equations) and very few implementations of such complex models are implemented in mainstream libraries.
A search on your favorite browser with the keywords "second-order HMM" will bring you to meaningful readings about these models.
If by order you mean the number of states, and with the assumptions that you use single distributions assigned to each state (i.e., you do not use HMMs with mixtures of distributions) then, indeed the only hyperparameter you need to tune is the number of states.
You can estimate the optimal number of states using criteria such as the Bayesian Information Criterion, the Akaike Information Criterion, or the Minimum Message Length Criterion which are based on model's likelihood computations. Usually, the use of these criteria necessitates training multiple models in order to be able to compute some meaningful likelihood results to compare.
If you just want to get a blur idea of a good K value that may not be optimal, a k-means clustering combined with the percentage of variance explained can do the trick: if X clusters explain more than, let say, 90% of the variance of the observations in your training set then, going with an X-state HMM is a good start. The 3 first criteria are interesting because they include a penalty term that goes with the number of parameters of the model and can therefore prevent some overfitting.
These criteria can also be applied when one uses mixture-based HMMs, in which case there are more hyperparameters to tune (i.e., the number of states and the number of component of the mixture models).

How to estimate the Scoring Scheme in Pairwise Alignment

I'm not specialist in bioinformatics. I want to align two nucleotide sequences using a global alignment method. Each sequence is a combinations of the {A,C,T,G} letters.
The problem is that I don't know how to choose the best scoring scheme (substations and gap penalties).
Currently, I'm using the values +1,-1,-2 for match , mismatch and gap penalty. And I'm aware that ; the number of transitions in human DNA is larger than the number of transversions.
My question is how to estimate the penalties for (match , mismatch and gap) based on my dataset. Is there any statistical model can help?
If we need to answer to this question we need to know the dataset well and your scope exactly,but generally for match/mismatch we may represent as +1/-1 this does not include (transversion and transition).
For I advice you to take a look on this model and Kimura
Finally for penalty, you may use "low, medium, and high" penalty according to the divergent the sequences,I mean If the organisms is closely related then you may use the low gap penalty, and the high penalty for the more divergent organisms, so the gap penalty depends on how divergent the sequences are that you are aligning.
If we need to know if the sequences is divergent or not, as I said it's depends and different according to your data, but you may take a look on these examples about some sequences : link1, link2, link3, link4, and link5