xgboost vs H2o Gradient Boosting - xgboost

I have a dataset having a large missing values (more than 40% missing). Genrated a model in xgboost and H2o gradient boosting - got a decent model in both cases. However, the xgboost shows this variable as one of the key contributors to the model but as per H2o Gradient Boosting the variable is not important. Does xgboost handle variables with missing values differently. All the configuration to both the models are exactly the same.

Both missing value handling and variable importances are slightly different between the two methods. Both are treating missing values as information (i.e., they learn from them, and don't just impute with a simple constant). The variable importances are computed from the gains of their respective loss functions during tree construction. H2O uses squared error, and XGBoost uses a more complicated one based on gradient and hessian.
One thing you could check is the variance of the variable importances between different runs with different seeds, to see how stable each method is in terms of variable importances.
PS. If you have categoricals, then you're better off leaving the column as a factor for H2O, no need to do your own encoding. This can lead to a different effective count of columns vs XGBoost's dataset, so for column sampling, things will be different.

Related

sjmisc::merge_imputations() averages across imputed datasets, which seems unjustified?

The sjmisc package has a function sjmisc::merge_imputations()
This function merges multiple imputed data frames from mice::mids()-objects into a single data frame by computing the mean or selecting the most likely imputed value.
I think this is what Stef van Buuren cautions against in 5.1.2 Not recommended workflow: Averaging the data ?
the procedure ignores the between-imputation variability, and hence shares all the drawbacks of single imputation
Instead, they advocate for mice::with() and mice::pool().
So when might one use sjmisc::merge_imputations() ?
If:
The researcher either only cares about means, not about correlations or other more complicated relationships between variables. Or, is willing to assume that the imputation models were "true" models.
The researcher only cares about point estimates, and is less concerned about the uncertainty in those estimates (variance, standard errors, confidence intervals, hypothesis tests, coefficients of variation).
There is only a small amount of missing data.
Then averaging the imputed values can be a reasonable fix. Averaging the imputed values is basically a version of "stochastic regression imputation". Although note that as the number of imputations increases, averaging the imputed values converges to simple regression imputation. It's still wrong, but it may be a practical method. The sjmisc package documentation quotes Burns et al (2011). https://doi.org/10.1016/j.jclinepi.2010.10.011 From that article:
There were practical benefits in providing DYNOPTA investigators an averaged imputation score as it precludes the necessity for investigators to run MICE for different projects using the MMSE, the need to obtain software capable of combining and analyzing multiple imputed datasets, and many investigators are unfamiliar with MI analysis techniques.
Compare also van Buuren 1.3.5
If you have the ability to use proper pooling methods I would recommend using those instead.

A huge number of discrete features

I'm developing a regression model. But I ran into a problem when preparing the data. 17 out of 20 signs are categorical, and there are a lot of categories in each of them. Using one-hot-encoding, my data table is transformed into a 10000x6000 table. How should I prepare this type of data?
I used PCA, trying to reduce the dimension, but even 70% of the variance is in 2500 features. That's why I joined.
Unfortunately, I can't attach the dataset, as it is confidential
How do I prepare the data to achieve the best results in the learning process?
Can the data be mapped more accurately in a non-linear manner? If so, you might want to try using an autoencoder for dimensionality reduction.
One thing to note about PCA is that it computes an orthogonal projection of the data into linear space. This means that it only gives a linear mapping of the data. Autoencoders, on the other hand, can give you a non-linear mapping, and so is able to represent a greater amount of variance in the data in fewer dimensions. Just be sure to use non-linear activation functions in your autoencoder architecture.
It really depends on exactly what you are trying to do. Getting a covariance matrix (and also PCA decomp.) will give you great insight about which classes tend to come together (and this requires one-hot encoded categories), but training a model off of that might be problematic.
In general, it really depends on the model you want to use.
One option would be a random forest. They can definitely be used for regression, though they need to be trained specifically for that. SKLearn has a class just for this:
https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html
The benifits of random forest is that it is great for tabular data (as is the case here), and can easily be trained using numerical values for class features, meaning your data vector can only be of dimension 20!
Decision tree models (such as random forest) are being shown to out-preform deep-learning in many cases, and this may be one of them.
TLDR; If you use a random forest, it can take learn even with numerical values for categories, and you can avoid creating incredibly large vectors for data.

Algorithm - finding the order of HMM from observations

I am given a data that consists of N sequences of variable lengths of hidden variables and their corresponding observed variables (i.e., I have both the hidden variables and the observed variables for each sequence).
Is there a way to find the order K of the "best" HMM model for this data, without exhaustive search? (justified heuristics are also legitimate).
I think there may be a confusion about the word "order":
A first-order HMM is an HMM which transition matrix depends only on the previous state. A 2nd-order HMM is an HMM which transition matrix depends only on the 2 previous states, and so on. As the order increases, the theory gets "thicker" (i.e., the equations) and very few implementations of such complex models are implemented in mainstream libraries.
A search on your favorite browser with the keywords "second-order HMM" will bring you to meaningful readings about these models.
If by order you mean the number of states, and with the assumptions that you use single distributions assigned to each state (i.e., you do not use HMMs with mixtures of distributions) then, indeed the only hyperparameter you need to tune is the number of states.
You can estimate the optimal number of states using criteria such as the Bayesian Information Criterion, the Akaike Information Criterion, or the Minimum Message Length Criterion which are based on model's likelihood computations. Usually, the use of these criteria necessitates training multiple models in order to be able to compute some meaningful likelihood results to compare.
If you just want to get a blur idea of a good K value that may not be optimal, a k-means clustering combined with the percentage of variance explained can do the trick: if X clusters explain more than, let say, 90% of the variance of the observations in your training set then, going with an X-state HMM is a good start. The 3 first criteria are interesting because they include a penalty term that goes with the number of parameters of the model and can therefore prevent some overfitting.
These criteria can also be applied when one uses mixture-based HMMs, in which case there are more hyperparameters to tune (i.e., the number of states and the number of component of the mixture models).

Select important features then impute or first impute then select important features?

I have a dataset with lots of features (mostly categorical features(Yes/No)) and lots of missing values.
One of the techniques for dimensionality reduction is to generate a large and carefully constructed set of trees against a target attribute and then use each attribute’s usage statistics to find the most informative subset of features. That is basically we can generate a large set of very shallow trees, with each tree being trained on a small fraction of the total number of attributes. If an attribute is often selected as best split, it is most likely an informative feature to retain.
I am also using an imputer to fill the missing values.
My doubt is what should be the order to the above two. Which of the above two (dimensionality reduction and imputation) to do first and why?
From mathematical perspective you should always avoid data imputation (in the sense - use it only if you have to). In other words - if you have a method which can work with missing values - use it (if you do not - you are left with data imputation).
Data imputation is nearly always heavily biased, it has been shown so many times, I believe that I even read paper about it which is ~20 years old. In general - in order to do a statistically sound data imputation you need to fit a very good generative model. Just imputing "most common", mean value etc. makes assumptions about the data of similar strength to the Naive Bayes.

How to effectively use knn in Stata

I have two questions with executing discrim knn in Stata.
1) How do you properly code the command? I've tried various versions, but seem to always get an error that there are too many variables specified.
The vector with the correct result is buy.
I am trying: discrim knn buy, group(train test) k(1)
2) My understanding with KNN was that factor variables (binary) were fine for using KNN, even encouraged. However I get the error message that factor variables and time-series operators not allowed.
Lastly, though I know this isn't the best space for this question, should each vector be normalized for knn? I've heard conflicting responses.
I'm guessing that the error you're getting is
group(): too many variables specified
This is because you can only group by 1 variable with knn. knn performs discriminant analysis based on a single grouping variable, in your case, distinguishing the training from the test. I imagine your train and test variables are binary, in which case using only one of the variables is enough, as they are merely logical opposites of each other. A single variable has enough information to distinguish the two groups.