Bayesian Approach in Ensemble modeling - bayesian

Is ensemble modeling a Bayesian approach? I am thinking it like this: our final model(posterior) is based on other primary models(prior). Can you guys give your opinions?

The question is probably better suited on CrossValidated, but I'll give you a hint.
The way you describe it, Bayesian approach does not fit directly, because Bayes theorem states that posterior equals prior times the likelihood (normalized). The final ensemble model is a weighted sum of individual models. It's not clear what you consider a likelihood to make an ensemble Bayesian.
If you are looking for a probabilistic interpretation, here's a better one: the ensemble model represents a joint distribution of the model selector variable (what is the probability that a particular model is good for a given input) and the model distribution (the accuracy of a particular model). The better you pick both of these distributions (proper models and their weights), the better the ensemble.

Related

What is the difference between optimization algorithms and Ensembling methods?

I was going through ensemblling methods and was wondering what is the difference between the optimization techniques like gradient descent etc. And ensembling techniques like bagging, boosting etc.
Optimization like Gradient Decent is a single model approach. Ensemble per Wikipedia is multiple models. Constituents in the ensemble are weighted for overall consideration. Boosting (per Wikipedia https://en.wikipedia.org/wiki/Ensemble_learning seems to say that it is retraining with a focus on missed (errors) in a model.
To me this is like single image recognition in a monocular fashion vs. binocular image recognition. The two images being an ensemble. Further scrutiny requiring extra attention to errors in classification is boosting. That is to say retraining on some errors. Perhaps error condition data were represented too infrequently enough to make good classifications (thinking black swan here). In vehicles, this could be like combining infrared, thermal, radar and lidar sensor results for an overall classification. The link above has really good explanations of each of your areas of concern.

Does knowledge distillation have an ensemble effect?

I don't know much about knowledge distillation.
I have a one question.
There is a model with showing 99% performance(10class image classification). But I can't use a bigger model because I have to keep inference time.
Does it have an ensemble effect if I train knowledge distillation using another big model?
-------option-------
Or let me know if there's any way to improve performance than this.
enter image description here
The technical answer is no. KD is a different technique from ensembling.
But they are related in the sense that KD was originally proposed to distill larger models, and the authors specifically cite ensemble models as the type of larger model they experimented on.
Net net, give KD a try on your big model to see if you can keep a lot of the performance of the bigger model but with the size of the smaller model. I have empirically found that you can retain 75%-80% of the power of the a 5x larger model after distilling it down to the smaller model.
From the abstract of the KD paper:
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
https://arxiv.org/abs/1503.02531

Stratified Kfold

If I am correct Stratified kfold is used so that the dependent variable ratio in the splits is similar to the original data.
What I want to understand is why is it necessary or important to retain that ratio.
Is it necessary for fraud detection problems where data is highly imbalanced.
If yes, why?
Taken from https://stats.stackexchange.com/questions/49540/understanding-stratified-cross-validation
Cross-validation article in Encyclopedia of Database Systems says:
Stratification is the process of rearranging the data as to ensure each fold is a
good representative of the whole. For example in a binary classification problem
where each class comprises 50% of the data, it is best to arrange the data such
that in every fold, each class comprises around half the instances.
About the importance of the stratification, Kohavi (A study of cross-validation
and bootstrap for accuracy estimation and model selection) concludes that:
stratification is generally a better scheme, both in terms of bias and variance,
when compared to regular cross-validation.
All metrics calculate against the true labels. If there is a bias in the system, say it predicts more of one label, a fold with more of said label would give artificially large results.
a methodology to take care of that is to ensure the true labels distribution is very similar for each fold. Then the aggregation of results would be more indicative of system performance.

Predicting new values in logistic regression

I am building a logistic regression model in tensorflow to approximate a function.
When I randomly select training and testing data from the complete dataset, I get a good result like so (blue are training points; red are testing points, the black line is the predicted curve):
But when I select the spatially seperate testing data, I get terrible predicted curve like so:
I understand why this is happening. But shouldn't a machine learning model learn these patterns and predict new values?
Similar thing happens with a periodic function too:
Am I missing something trivial here?
P.S. I did google this query for quite some time but was not able to get a good answer.
Thanks in advance.
What you are trying to do here is not related to logistic regression. Logistic regression is a classifier and you are doing regression.
No, machine learning systems aren't smart enough to learn to extrapolate functions like you have here. When you fit the model you are telling it to find an explanation for the training data. It doesn't care what the model does outside the range of training data. If you want it to be able to extrapolate then you need to give it extra information. You could set it up to assume that the input belonged to a sine wave or a quadratic polynomial and have it find the best fitting one. However, with no assumptions about the form of the function you won't be able to extrapolate.

One class classification - interpreting the models accuracy

I am using LIBSVM for classification of data. I am mainly doing One Class Classification.
My training sets consists of data of only one class & my testing data consists of data of two classes (one which belong to target class & the other which doesn't belong to the target class).
After applying svmtrain and svmpredict on both training and testing datasets the accuracy which is coming for training sets is 48% and for testing sets it is 34.72%.
Is it good? How can I know whether LIBSVM is classifying the datasets correctly?
To say if it is good or not depends entirely on the data you are trying to classify. You should search what is the state of the art accuracy for SVM model for your kind of classification and then you will be able to know if your model is good or not.
What I can say from your results is that the testing accuracy is worse than the training accuracy, which is normal as a classifier usually perform better with data it has already seen before.
What you can try now is to play with the regularization parameter (C if you are using a linear kernel) and see if the performance improves on the testing set.
You can also trace learning curves to see if your classifier overfit or not, which will help you choose if you need to increase or decrease the regularization.
For you case, you might want to apply weighting on the classes as the data is often sparse in favor of negative example.
To know whether Libsvm is classifying the dataset correctly you can look at which examples it predicted correctly and which ones it predicted incorrectly. Then you can try to change your features to improve its results.
If you are worried about your code being correct, you can try to code a toy example and play with it or use an example of someone on the web and replicate their results.