What is the difference between optimization algorithms and Ensembling methods? - optimization

I was going through ensemblling methods and was wondering what is the difference between the optimization techniques like gradient descent etc. And ensembling techniques like bagging, boosting etc.

Optimization like Gradient Decent is a single model approach. Ensemble per Wikipedia is multiple models. Constituents in the ensemble are weighted for overall consideration. Boosting (per Wikipedia https://en.wikipedia.org/wiki/Ensemble_learning seems to say that it is retraining with a focus on missed (errors) in a model.
To me this is like single image recognition in a monocular fashion vs. binocular image recognition. The two images being an ensemble. Further scrutiny requiring extra attention to errors in classification is boosting. That is to say retraining on some errors. Perhaps error condition data were represented too infrequently enough to make good classifications (thinking black swan here). In vehicles, this could be like combining infrared, thermal, radar and lidar sensor results for an overall classification. The link above has really good explanations of each of your areas of concern.

Related

Does knowledge distillation have an ensemble effect?

I don't know much about knowledge distillation.
I have a one question.
There is a model with showing 99% performance(10class image classification). But I can't use a bigger model because I have to keep inference time.
Does it have an ensemble effect if I train knowledge distillation using another big model?
-------option-------
Or let me know if there's any way to improve performance than this.
enter image description here
The technical answer is no. KD is a different technique from ensembling.
But they are related in the sense that KD was originally proposed to distill larger models, and the authors specifically cite ensemble models as the type of larger model they experimented on.
Net net, give KD a try on your big model to see if you can keep a lot of the performance of the bigger model but with the size of the smaller model. I have empirically found that you can retain 75%-80% of the power of the a 5x larger model after distilling it down to the smaller model.
From the abstract of the KD paper:
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
https://arxiv.org/abs/1503.02531

Why machine learning algorithms focus on speed and not accuracy?

I study ML and I see that most of the time the focus of the algorithms is run time and not accuracy. Reducing features, taking sample from the data set, using approximation and so on.
Im not sure why its the focus since once I trained my model I dont need to train it anymore if my accuracy is high enough and for that if it will take me 1 hours or 10 days to train my model it does not really matter because I do it only 1 time and my goal is to predict as better as I can my outcomes (minimum loss).
If I train a model to differ between cats and dogs I want it to be the most accurate it can be and not the fasted since once I trained this model I dont need to train any more models.
I can understand why models that depends on fasting changing data need this focus of speed but for general training models I dont understand why the focus is on speed.
Speed is relative term. Accuracy is also relative depending on the difficulty of the task. Currently the goal is to achieve human-like performance for application at reasonable costs because this will replace human labor and cut costs.
From what I have seen in reading papers, people usually focus on accuracy first to produce something that works. Then do ablation studies - studies where pieces of the models are removed or modified - to achieve the same performance in less time or memory requirements.
The field is very experimentally validated. There really isn't much of a theory that states why CNN work so well other than that it can model any function given non-linear activations functions. (https://en.wikipedia.org/wiki/Universal_approximation_theorem) There have been some recent efforts to explain why it works well. One I recall is MobileNetV2: Inverted Residuals and Linear Bottlenecks. The explaination of embedding data into a low dimensional space without losing information might be worth reading.

Stratified Kfold

If I am correct Stratified kfold is used so that the dependent variable ratio in the splits is similar to the original data.
What I want to understand is why is it necessary or important to retain that ratio.
Is it necessary for fraud detection problems where data is highly imbalanced.
If yes, why?
Taken from https://stats.stackexchange.com/questions/49540/understanding-stratified-cross-validation
Cross-validation article in Encyclopedia of Database Systems says:
Stratification is the process of rearranging the data as to ensure each fold is a
good representative of the whole. For example in a binary classification problem
where each class comprises 50% of the data, it is best to arrange the data such
that in every fold, each class comprises around half the instances.
About the importance of the stratification, Kohavi (A study of cross-validation
and bootstrap for accuracy estimation and model selection) concludes that:
stratification is generally a better scheme, both in terms of bias and variance,
when compared to regular cross-validation.
All metrics calculate against the true labels. If there is a bias in the system, say it predicts more of one label, a fold with more of said label would give artificially large results.
a methodology to take care of that is to ensure the true labels distribution is very similar for each fold. Then the aggregation of results would be more indicative of system performance.

Image classification / detection - Objects being used in real life vs. stock photo images?

When training detection models, are images that are used in real life better (i.e. higher accuracy / mAP) than images of the same object but in the form of stock photo?
The more variety the better. If you train a network on images that all have a white background and expect it to perform under conditions with noisy backgrounds you should expect the results on unseen data to perform worse because the network never had a chance to learn distinguiting features of target object vs. background objects.
If you have images with transparent backgrounds one form of data augmentation that would be expected to improve results would be to place that image against many random backgrounds. The closer you come to realistic renderings of an image the better you can expect your results to be.
The more realistic examples you can augment your training dataset with, the better. Note that it generally does not help to add random noise to your data to generate larger training datasets, it only improves results when your expanded dataset contains realistic variants of the original images in the dataset.
My motto when training neural networks is this: The network will cheat any chance it gets. It will learn impressively well, but given the opportunity, it will take shortcuts. Don't let it take shortcuts. That often translates to: Make the problem harder such that no shortcut exists for it to take. Neural networks often perform better under more difficult conditions because the simplest solution it can arrive at is also the most general purpose. Read up on multi-task learning for some exciting examples that provide great food-for-thought.

Is it possible to train Neural Network with low amount of instances?

I have faced some problem when I needed to solve Regression Task and use as minimum instances as possible. When I tried to use Xgboost I had to feed 4 instances to get the reasonable result. But Multilayer Perceptron tuned to overcoming Regression problems has to take 20 instances, tried to change amount of neurons&layers but the answer is still 20 .Is it possible to do something to make Neural Network solve Resgression tasks with from 2 to 4 instances? if yes - explain please what should I do to succeed in it? Maybe there is some correlation between how much instances are needed to train and get reasonable results from Perceptron and how features are valuable inside dataset?
Thanks in advance for any help
With small numbers of samples, there are likely better methods to apply, Xgaboost definitely comes to mind as a method that does quite well at avoiding overfitting.
Neural networks tend to work well with larger numbers of samples. They often over fit to small datasets and underperform other algorithms.
There is, however, an active area of research in semi-supervised techniques using neural networks with large datasets of unlabeled data and small datasets of labeled samples.
Here's a paper to start you down that path, search on 'semi supervised learning'.
http://vdel.me.cmu.edu/publications/2011cgev/paper.pdf
Another area of interest to reduce overfitting in smaller datasets is in multi-task learning.
http://ruder.io/multi-task/
Multi task learning requires the network to achieve multiple target goals for a given input. Adding more requirements tends to reduce the space of solutions that the network can converge on and often achieves better results because of it. To say that differently: when multiple objectives are defined, the parameters necessary to do well at one task are often beneficial for the other task and vice versa.
Lastly, another area of open research is GANs and how they might be used in semi-supervised learning. No papers pop to the forefront of my mind on the subject just now, so I'll leave this mention as a footnote.