Is it physical to consider model with number of free parameter higher than data point while model comparison in Ultranest bayesian analysis - bayesian

I am conducting a model comparison for a limited set of data using the Ultranest nested sampling bayesian analysis. I realized that for a few complex models, the number of data points is less than the free parameters in the model. Is it physical to consider such a complex model during model comparison? The complex model is fitting the data better.
Also, does the evidence value (logz) takes care of the above-mentioned problem.

Related

How to implement feature importance on nominal categorical features in tree based classifiers?

I am using SKLearn XGBoost model for my binary classification problem. My data contains nominal categorical features (such as race) for which one hot encoding should be used to feed them to the tree based models.
On the other hand, using feature_importances_ variable of XGBoost yields us the importance of each column on the trained model. So if I do the encoding and then get the features importance of columns, the result will includes names like race_2 and its importance.
What should I do to solve this problem and get a whole score for each nominal feature? Can I take the average of one hot encoded columns importance scores that belong to one feature? (like race_1, race_2 and race_3)
First of all, if your goal is to select the most useful features for later training, I would advise you to use regularization in your model. In the case of xgboost, you can tune the parameter gamma so the model would actually be more dependent on "more useful" features (i.e. tune the minimum loss reduction required for the model to add a partition leaf). Here is a good article on implementing regularization into xgboost models.
On the other hand, if you insist on doing feature importance, I would say grouping the encoded variables and simply adding them is not a good decision. This would result in feature-importance results that do not consider the relationship between these dummy variables.
My suggestion would be to take a look at the permutation tools for this. The basic idea is you take your original dataset, shuffle the values on the column in which you are going to calculate feature importance, train the model and record the score. Repeat this over different columns and the effect of each on the model performance would be a sign of their importance.
It is actually easier done than said, sklearn has this feature built-in to do for you: check out the example provided in here.

How to structure multi-output bayesian optimization

I am trying to use bayesian optimization for a multi-output problem, but am not 100% sure the best way to set it up.
I have a small number of inputs (5) and outputs (3-4) in my problem. For each output, I have a target value I would like to achieve. Ultimately I would like to minimize the MSE between the target vector (of 3-4 outputs) and the true outputs.
The simplest way to do this in my mind is to create a single model which models the MSE as a function the problem inputs. Here, all historical data is first compressed into the single MSE, then this is used to train the GP.
However, I would instead like to create individual models (or a combined, multi-output model) that directly models the outputs of interest, instead of the ultimate cost function (MSE). Primarily, this is because I have noticed more accurate predictive results (of the combined MSE), when first modeling the individual outputs, then creating a MSE, instead of directly modeling the MSE.
My problem arises when creating the acquisition function when I have multiple outputs. Ideally, I'd like to use expected improvement (EI) as my acquisition function. However, I'm not sure how to either 1) combine the multiple output distributions into a single distribution (representing the probability of the combined MSE), which can then be used to determine overall EI or 2) how to combine multiple EI values into a single metric (i.e. combine the E.I. for each output into a unified E.I.).
When reading about multi-output BO, the most common approach seems to be to identify a frontier of solutions, however this is not 100% applicable, as ultimately I can convert the output vector into a single MSE (and the frontier becomes a point).
Is the best approach simply to model the combined MSE directly? Or is there a way that I can model the individual outputs, then combine these modeled outputs into a reasonable acquisition function?

What is the reason for very high variations in val accuracy for multiple model runs?

I have a 2 layered Neural Network that I'm training on about 10000 features (genomic data) with about 100 samples in my data set. Now I realized that anytime I run my model (i.e. compile & fit) I get varying validation/testing accuracys even if I leave the train/test/validation split untouched. Sometimes its around 70% sometimes around 90%.
Due to the stochastic nature of the NN I anticipate some variation but could these strong fluctuations be a sign of something else?
The reason why you're seeing such a big instability with your validation accuracy is because your neural network is huge in comparison to the data you train it on.
Even with just 12 neurons per layer, you still have 12 * 10000 + 12 = 120012 parameters in your first layer. Now think about what the neural network does under the hood. It takes your 10000 inputs, it multiplies each input by some weight and then sums all these inputs. Now you provide it only 64 training examples on which the training algorithm is supposed to decide what are the correct input weights. Just based on intuition, from a purely combinatorial perspective there is going to be large amount of weight assignments that do well on your 64 training samples. And you have no guarantee that the training algorithm will pick such weight assignment that will also do well on your out-of-sample data.
Given neural network is able to represent a wide variety of functions (it's been proven that under certain assumptions it can approximate any function, that's called general approximation). To select the function you want you provide the training algorithm with data to constrain the space of all possible functions the network can represent to a subspace of functions that fit your data. However, such function is in no way guaranteed to represent the true underlying relationship between the input and the output. And especially if the number of parameters is larger than the number of samples (in this case by a few orders of magnitude), you're nearly guaranteed to see your network simply memorize the samples in your training data, simply because it has the capacity to do so and you haven't constrained it enough.
In other words, what you're seeing is overfitting. In NNs, the general rule of thumb is that you want at least a couple of times more samples than you have parameters (look in to the Hoeffding Inequality for theoretical rationale of this) and in effect the more samples you have, the less you're afraid of overfitting.
So here is a couple of possible solutions:
Use an algorithm that's more suitable for the case where you have high input dimension and low sample count, such as Kernel SVM (Support Vector Machine). With such a low sample count, it's quite possible that a Kernel SVM algorithm will achieve better and more consistent validation accuracy. (You can easily test this, they are available in the scikit-learn package, really easy to use)
If you insist on using NN - use regularization. Given the fact you already have working code, this will be easy, just add kernel_regularizer to all your layers, I would try both L1 and L2 regularization (probably separately). L1 regularization tends to push weights to zero so it might help reduce the number of parameters in your problem. L2 just tries to make all the weights small. Use your validation set to decide the best value for each regularization. You can optimize both for the best mean accuracy and also the lowest variance in accuracy on your validation data (do something like 20 training runs for each parameter value of L1 and L2 regularization, usually just trying different orders of magnitude is sufficient, e.g. 1e-4, 1e-3, 1e-2, 1e-1, 1, 1e1).
If most of your input features are not really predictive or if they are highly correlated, PCA (Principal Component Analysis) can be used to project your inputs into a much lower dimensional space (e.g. from 10000 to 20), where you'd have much smaller neural network (still I'd use L1 or L2 for regularization because even then you'd have more weights than training samples)
On a final note, the point of a testing set is to use it very sparsely (ideally only once). It should be the final reported metric after all your research and model tuning is done. You should not optimize any values on it. You should do all this on your validation set. To avoid overfitting on your validation set, look into k-fold cross validation.

Is it a good idea to mix the validation / testing data with the training data?

I am working with a large dataset (e.g. large for a single machine) - with 1,000,000 examples.
I split my dataset to as follows: (80% Training Data, 10% Validation Data, 10% Testing Data). Every time I retrain the model, I shuffle the data first - such that some of the data from the validation / testing set ends up into the training set and vice versa.)
My thinking is this:
Ideally I would want all possible available data for the model to learn. The more the better - for improved accuracy.
Even though 20% of the data is dedicated to validation and testing, that is still 100,000 examples per piece - (i.e. I may potentially miss out on some crucial data that exists within the validation or testing set that the previous training set may not have accounted for.)
Shuffling prevents the training set from learning order where it is not important (at least in my particular dataset).
Here is my workflow process:
The Test Accuracy is more or less the equivalent to the Validation Accuracy (plus or minus 0.5%)
Per each retrain, the results usually ends up something like this: where the accuracy keeps improving (until it runs out of total epoch), but the validation accuracy ends up stuck at a particular percentage. I then save that model. Start the retraining process again. Shuffles data occurs. The training accuracy drops, but validation accuracy jumps up. The training accuracy improves until total epoch. The validation accuracy, converges downward (still greater than the previous run).
See Example:
I plan on doing this until the training accuracy data reaches 99%. (Note: I used Keras-Tuner to find the best architecture/model for my particular problem)
I can't help but think, that I am doing something wrong by doing this. From my perspective, this is just the model eventually learning all 1,000,000 examples. It feels like "mild overfitting" because of the shuffling per each retrain.
Is it a good idea to mix the validation / testing data with the training data?
Am I wrong by doing it this way? If so, why should I not do this method? Is there a better way to approach this?
If you mix your test/validation data with training data, you then can not evaluate your model on that data, since that data has been seen by your model. The model evaluation is done on the basis of how well it is able to make predictions/classification on data which your model has not seen (assuming that the data you are using to evaluate your model is coming from the same distribution as your training data). If you also mix your test set data with training set data, you will eventually end up with really good test set accuracy since that data has been seen by your model, but it might not perform well on new unseen data coming from the same distribution.
If you are worried size of test/validation data, I suggest you further reduce the size of your test/validation data. Use 99.9% instead of 99%. Also, the random shuffling will take care of learning almost every feature of your data.
After all, my point is, never ever evaluate your model on the data it has seen before. It will always give you better results (assuming you have trained your model well untill it memorizes the training data). The validation data is used when you have multiple algorithms/models and you need to select one algorithm/model from all those available models. Here, the validation data is used to select the model. The algo/model which gives good results on validation data is selected (again you do not evaluate your model based on validation set accuracy, it is just used for the selection of the model.) Once you have selected your model based on validation set accuracy, you then evaluate it on new unseen data (called test data) and report the prediction/classification accuracy on test data as your model accuracy.

Using unlabeled dataset in Keras

Usually, when using Keras, the datasets used to train the neural network are labeled.
For example, if I have a 100,000 rows of patients with 12 field per each row, then the last field will indicate if this patient is diabetic or no (0 or 1).
And then after training is finished I can insert a new record and predict if this person is diabetic or no.
But in the case of unlabeled datasets, where I can not label the data due to some reasons, how can I train the neural network to let him know that those are the normal records and any new record that does not match this network will be malicious or not accepted ?
This is called one-class learning and is usually done by using autoencoders. You train an autoencoder on the training data to reconstruct the data itself. The labels in this case is the input itself. This will give you a reconstruction error. https://en.wikipedia.org/wiki/Autoencoder
Now you can define a threshold where the data is benign or not, depending on the reconstruction error. The hope is that the reconstruction of the good data is better than the reconstruction of the bad data.
Edit to answer the question about the difference in performance between supervised and unsupervised learning.
This cannot be said with any certainty, because I have not tried it and I do not know what the final accuracy is going to be. But for a rough estimate supervised learning will perform better on the trained data, because more information is supplied to the algorithm. However if the actual data is quite different to the training data the network will underperform in practice, while the autoencoder tends to deal better with different data. Additionally, per rule of thumb you should have 5000 examples per class to train a neural network reliably, so labeling could take some time. But you will need some data to test anyways.
It sounds like you need fit two different models:
a model for bad record detection
a model for prediction of a patient's likelihood to be diabetic
For both of these models, you will need to have labels. For the first model your labels would indicate whether the record is good or bad (malicious) and the second would be whether the patient is diabetic or not.
In order to detect bad records, you may find that simple logistic regression or SVM performs adequately.