Machine Learning Classification ensemble - ensemble-learning

i have a question regarding ensembles of classifiers in machine learning. I have read about Ensemble methods but i couldn't put a relate them to what i am thinking.
If i have a few classifiers for a multi-class problem, and some classifiers exhibit better performance for certain classes than others, how can i take advantage of this characteristic in my ensemble?
For example.
Classifier A scores higher F1 in classes 1 compared to the rest.
Classifier B scores higher F1 in classes 2 compared to the rest.
Classifier C scores higher F1 in classes 3 compared to the rest.
How might i do an ensemble such that i give more weight to the probabilities of class 1 in classifier A and reduce the rest?
I am thinking of a simple 2 layer approach.
Layer 1: For each classifier, internally put a weight across the classes based on their individual class performance and then normalise.
Layer 2: A weight on each classifier based on their overall F1 performance
Would this make sense?
Layer1
Model A Class1 Class2 Class3
Original 0.2 0.5 0.3
Weight 0.2*0.25 0.5*0.5 0.3*0.25
Equals 0.05 0.25 0.075
Normalise 0.133 0.66 0.2
Layer2
Models Class1 Class2 Class3
A 0.3*0.133 0.3*0.66 0.3*0.2
B 0.5*blah 0.5*blah 0.5*blah
C 0.2*blah 0.2*blah 0.2*blah
Avg Avg Avg Avg
Thank you.

Why you theoretically could weight differently the outcomes of different models and get a better result, how could you weight them differently based on the class, considering that the task is to find the class?
You can weight some classifiers globally. Also, once a class is determined (i.e. enough classifiers agreed that the class is X), you can reweight the probabilities (e.g. if the ensemble said that the probability is 60% and some trusted classifiers on this class had a high probability, you can make it 80%). However, I am not sure this would help in practice.

Related

Can SigmoidFocalCrossEntropy in Tensorflow (tf-addons) be used in Multiclass Classification? ( What is the right way)?

Focal Loss given in Tensorflow is used for class imbalance. For Binary class classification, there are a lots of codes available but for Multiclass classification, a very little help is there. I ran the code with One Hot Encoded target variables of 250 classes and it gave me results without any error.
y = pd.get_dummies(df['target']) # One hot encoded target classes
model.compile(
optimizer="adam", loss=tfa.losses.SigmoidFocalCrossEntropy(), metrics= metric
)
I just want to know whoever wrote this code or someone having enough knowledge of this code, can it be used be used for Multiclass Classification. If no then how come it did not give me errors, instead better results than CrossEntropy. Also, in other implementations like this one, the value of alpha has to be given for every class but just one value in Tensorflow's implementations.
What is the correct way to use this?
Some basics first.
Categorical Crossentropy is designed to incentivize a model a model to predict 100% for the correct label. It was designed for models that predict single-label multi-class classification - like CIFAR10 or Imagenet. Usually these models finish in a Dense layer with more than one output.
Binary Crossentropy is designed to incentivize a model to predict 100% if the label is one, or, 0% is the label is zero. Usually these models finish in a Dense layer with exactly one output.
When you apply Binary Crossentropy to a single-label multi-class classification problem, you are doing something that is mathematically valid but defines a slightly different task: you are incentivizing a single-label classification model to not only get the true label correct, but also minimize the false labels.
For example, if your target is dog, and your model predict 60% dog, CCE doesn't care if your model predicts 20% cat and 20% French horn, or, 40% cat and 0% French horn. So this is aligned with a top-1 accuracy concept.
But if you take that same model and apply BCE, and your model predictions 60% dog, BCE DOES care if your models predict 20%/20% cat/frenchhorn, vs 40%/0% cat/frenchhorn. To put it in precise terminology, the former is more "calibrated" and so it has some additional measure of goodness. However, this has little correlation to top-1 accuracy.
When you use BCE, presumably you are wasting the model's energy to focus on calibration at the expense of top-1 acc. But as you might have seen, it doesn't always work out that way. Sometimes BCE gives you superior results. I don't know that there's a clear explanation of that but I'd assume that the additional signals (in the case of Imagenet, you'll literally get 1000 times more signals) somehow creates a smoother loss value that perhaps helps smooth the gradients you receive.
The alpha value of focal loss additionally penalizes very wrong predictions and lessens the penalty if your model predicts something close to the right answer - like predicting 90% cat if the ground truth is cat. This would be a shift from the original definition of CCE, based on the theory of Maximum Likelihood Estimation... which focuses on calibration... vs the normal metric most ML practitioners care about: top-1 accuracy.
Focal loss was originally designed for binary classification so the original formulation only has a single alpha value. The repo you pointed to extends the concept of Focal Loss to single-label classification and therefore there are multiple alpha values: one per class. However, by my read, it loses the additional possible smoothing effect of BCE.
Net net, for the best results, you'll want to benchmark CCE, BCE, Binary Focal Loss (out of TFA and per the original paper), and the single-label multi-class Focal Loss that you found in that repo. In general, those the discovery of those alpha values is done via guess & check, or grid search.
There's a lot of manual guessing and checking in ML unfortunately.

Tensorflow & Keras prediction threshold

What is the threshold value that is used by TF by default to classify an input image as being a certain class?
For example, say I have 3 classes 0, 1, 2, and the labels for images are one-hot encoded like so: [1, 0, 0], meaning this image has label of class 0.
Now when a model outputs a prediction after softmax like this one: [0.39, 0.56, 0.05] does TF use 0.5 as the threshold so the class it predicts is class 1?
What if all the predictions were below 0.5 like [0.33, 0.33, 0.33] what would TF say the result is?
And is there any way to specify a new threshold for example 0.7 and ensure TF says that a prediction is wrong if no class prediction is above that threshold?
Also would this logic carry over to the inference stage too where if the network is uncertain of the class then it will refuse to give a classification for the image?
when a model outputs a prediction after softmax like this one: [0.39, 0.56, 0.05] does TF use 0.5 as the threshold so the class it predicts is class 1?
No. There is not any threshold involved here. Tensorflow (and any other framework, for that matter) will just pick up the maximum one (argmax); the result here (class 1) would be the same even if the probabilistic output was [0.33, 0.34, 0.33].
You seem to erroneously believe that a probability value of 0.5 has some special significance in a 3-class classification problem; it has not: a probability value of 0.5 is "special" only in a binary classification setting (and a balanced one, for that matter). In an n-class setting, the respective "special" value is 1/n (here 0.33), and by definition, there will always be some entry in the probability vector greater than or equal to this value.
What if all the predictions were below 0.5 like [0.33, 0.33, 0.33] what would TF say the result is?
As already implied, there is nothing strange or unexpected with all probabilities being below 0.5 in an n-class problem with n>2.
Now, if all the probabilities happen to be equal, as in the example you show (although highly improbable in practice, the question is valid, at least in theory), ideally, such ties should be resolved randomly (i.e. pick a class in random); in practice, since usually this stage is handled by the argmax method of Numpy, the prediction will be the first class (i.e. class 0), which is not difficult to demonstrate:
import numpy as np
x = np.array([0.33, 0.33, 0.33])
np.argmax(x)
# 0
due to how such cases are handled by Numpy - from the argmax docs:
In case of multiple occurrences of the maximum values, the indices corresponding to the first occurrence are returned.
To your next question:
is there any way to specify a new threshold for example 0.7 and ensure TF says that a prediction is wrong if no class prediction is above that threshold?
Not in Tensorflow (or any other framework) itself, but this is always something that can be done in a post-processing stage during inference: irrespectively of what is actually returned by your classifier, it is always possible to add some extra logic such that whenever the max probability value is less that a threshold, your system (i.e. your model plus the post-processing logic) returns something like "I don't know / I am not sure / I can't answer". But again, this is external to Tensorflow (or any other framework used) and the model itself, and it can be used only during inference and not during training (in any case, it doesn't make sense during training, because during training only predicted class probabilities are used, and not hard classes).
In fact, we had implemented such a post-processing module in a toy project some years ago, which was an online service to classify dog races from images: when the max probability returned by the model was less than a threshold (which was the case, say, when the model was presented with an image of a cat instead of a dog), the system was programmed to respond with the question "Are you sure this is a dog"?, instead of being forced to make a prediction among the predefined dog races...
the threshold is used in the case of binary classification or multilabel classification, in the case of multi class classification you use argmax, basically the class with the highest activation is your output class, all classes rarely equal each other, if the model is trained well there should be one dominant class

What is the impact from choosing auc/error/logloss as eval_metric for XGBoost binary classification problems?

How does choosing auc, error, or logloss as the eval_metric for XGBoost impact its performance? Assume data are unbalanced. How does it impact accuracy, recall, and precision?
Choosing between different evaluation matrices doesn't directly impact the performance. Evaluation matrices are there for the user to evaluate his model. accuracy is another evaluation method, and so does precision-recall. On the other hand, Objective functions is what impacts all those evaluation matrices
For example, if one classifier is yielding a probability of 0.7 for label 1 and 0.3 for label 0, and a different classifier is yielding a probability of 0.9 for label 1 and 0.1 for label 0 you will have a different error between them, though both of them will classify the labels correctly.
Personally, most of the times, I use roc auc to evaluate a binary classification, and if I want to look deeper, I look at a confusion matrix.
When dealing with unbalanced data, one needs to know how much unbalanced, is it 30% - 70% ratio or 0.1% - 99.9% ratio? I've read an article talking about how precision recall is a better evaluation for highly unbalanced data.
Here some more reading material:
Handling highly imbalance classes and why Receiver Operating Characteristics Curve (ROC Curve) should not be used, and Precision/Recall curve should be preferred in highly imbalanced situations
ROC and precision-recall with imbalanced datasets
The only way evaluation metric can impact your model accuracy (or other different eval matrices) is when using early_stopping. early_stopping decides when to stop train additional boosters according to your evaluation metric.
early_stopping was designed to prevent over-fitting.

Is it possible give more weight to class with lesser data in object detection while training?

Suppose I have about 250000 images total with 5 classes and the actual no of objects in all the photos is as follows.
dog 20000 cat 17000hen 16000 cow 5000ox 2000
In such a problem the accuracy after training the model using tensor-flow model is as follows:
dog 90 cat 60 hen 50 cow 0.02ox 0.15
Is it possible to increase the accuracy of the classes cow and ox?
Can we train the model so that we can add more weightage to the classes which have lesser number of classes in the tensor-flow?
I am using transfer learning: Faster R-CNN with ResNet-101
The most typical approach is to change your sampling ratio such that each batch you train on is sampling each class at the same rate. This means you will oversample the under-represented classes as you train. A reasonably easy way to do this in Tensorflow is by creating separate Dataset objects for each class and then merging them back together using tf.data.Dataset.interleave to combine them back together such that they are evenly sampled across classes.
It is also possible to multiple your loss by a constant vector, this would effectively change the learning rate per class, I'd go with the former option myself.

Tensorflow Loss for Non-Independent Classes

I am using a Tensorflow network for classification between classes that are similar to their neighboring classes, i.e. not independent. For example, let's say we want to predict among 10 classes but the predictions are not merely "correct" or "incorrect." Instead, if the correct class is 7 and network predicts 6, the loss should be less than if the network predicted 5, because 6 is closer to the correct answer than 5. My understanding is that cross entropy and 1-hot vectors provides "all or nothing" loss rather than a "continuous" loss that reflects the magnitude of the error. If that is correct, how does one implement such a continuous loss in Tensorflow?
-- Update June 13 2016 ----
An example application might be color recognition. If the network predicts "green" but the true color is yellow-green, then the loss should be less than if the network predicted blue because green is a better prediction than blue.
You can choose to implement a continuous function (e.g. hue from HSV) as a single output, and construct your own loss calculation that reflects what you want to optimize. In that case you'd just have a single output value that ranged between 0.0 and 1.0, and the loss would be evaluated based on the distance from the labeled value.