What is the threshold value that is used by TF by default to classify an input image as being a certain class?
For example, say I have 3 classes 0, 1, 2, and the labels for images are one-hot encoded like so: [1, 0, 0], meaning this image has label of class 0.
Now when a model outputs a prediction after softmax like this one: [0.39, 0.56, 0.05] does TF use 0.5 as the threshold so the class it predicts is class 1?
What if all the predictions were below 0.5 like [0.33, 0.33, 0.33] what would TF say the result is?
And is there any way to specify a new threshold for example 0.7 and ensure TF says that a prediction is wrong if no class prediction is above that threshold?
Also would this logic carry over to the inference stage too where if the network is uncertain of the class then it will refuse to give a classification for the image?
when a model outputs a prediction after softmax like this one: [0.39, 0.56, 0.05] does TF use 0.5 as the threshold so the class it predicts is class 1?
No. There is not any threshold involved here. Tensorflow (and any other framework, for that matter) will just pick up the maximum one (argmax); the result here (class 1) would be the same even if the probabilistic output was [0.33, 0.34, 0.33].
You seem to erroneously believe that a probability value of 0.5 has some special significance in a 3-class classification problem; it has not: a probability value of 0.5 is "special" only in a binary classification setting (and a balanced one, for that matter). In an n-class setting, the respective "special" value is 1/n (here 0.33), and by definition, there will always be some entry in the probability vector greater than or equal to this value.
What if all the predictions were below 0.5 like [0.33, 0.33, 0.33] what would TF say the result is?
As already implied, there is nothing strange or unexpected with all probabilities being below 0.5 in an n-class problem with n>2.
Now, if all the probabilities happen to be equal, as in the example you show (although highly improbable in practice, the question is valid, at least in theory), ideally, such ties should be resolved randomly (i.e. pick a class in random); in practice, since usually this stage is handled by the argmax method of Numpy, the prediction will be the first class (i.e. class 0), which is not difficult to demonstrate:
import numpy as np
x = np.array([0.33, 0.33, 0.33])
np.argmax(x)
# 0
due to how such cases are handled by Numpy - from the argmax docs:
In case of multiple occurrences of the maximum values, the indices corresponding to the first occurrence are returned.
To your next question:
is there any way to specify a new threshold for example 0.7 and ensure TF says that a prediction is wrong if no class prediction is above that threshold?
Not in Tensorflow (or any other framework) itself, but this is always something that can be done in a post-processing stage during inference: irrespectively of what is actually returned by your classifier, it is always possible to add some extra logic such that whenever the max probability value is less that a threshold, your system (i.e. your model plus the post-processing logic) returns something like "I don't know / I am not sure / I can't answer". But again, this is external to Tensorflow (or any other framework used) and the model itself, and it can be used only during inference and not during training (in any case, it doesn't make sense during training, because during training only predicted class probabilities are used, and not hard classes).
In fact, we had implemented such a post-processing module in a toy project some years ago, which was an online service to classify dog races from images: when the max probability returned by the model was less than a threshold (which was the case, say, when the model was presented with an image of a cat instead of a dog), the system was programmed to respond with the question "Are you sure this is a dog"?, instead of being forced to make a prediction among the predefined dog races...
the threshold is used in the case of binary classification or multilabel classification, in the case of multi class classification you use argmax, basically the class with the highest activation is your output class, all classes rarely equal each other, if the model is trained well there should be one dominant class
Related
Focal Loss given in Tensorflow is used for class imbalance. For Binary class classification, there are a lots of codes available but for Multiclass classification, a very little help is there. I ran the code with One Hot Encoded target variables of 250 classes and it gave me results without any error.
y = pd.get_dummies(df['target']) # One hot encoded target classes
model.compile(
optimizer="adam", loss=tfa.losses.SigmoidFocalCrossEntropy(), metrics= metric
)
I just want to know whoever wrote this code or someone having enough knowledge of this code, can it be used be used for Multiclass Classification. If no then how come it did not give me errors, instead better results than CrossEntropy. Also, in other implementations like this one, the value of alpha has to be given for every class but just one value in Tensorflow's implementations.
What is the correct way to use this?
Some basics first.
Categorical Crossentropy is designed to incentivize a model a model to predict 100% for the correct label. It was designed for models that predict single-label multi-class classification - like CIFAR10 or Imagenet. Usually these models finish in a Dense layer with more than one output.
Binary Crossentropy is designed to incentivize a model to predict 100% if the label is one, or, 0% is the label is zero. Usually these models finish in a Dense layer with exactly one output.
When you apply Binary Crossentropy to a single-label multi-class classification problem, you are doing something that is mathematically valid but defines a slightly different task: you are incentivizing a single-label classification model to not only get the true label correct, but also minimize the false labels.
For example, if your target is dog, and your model predict 60% dog, CCE doesn't care if your model predicts 20% cat and 20% French horn, or, 40% cat and 0% French horn. So this is aligned with a top-1 accuracy concept.
But if you take that same model and apply BCE, and your model predictions 60% dog, BCE DOES care if your models predict 20%/20% cat/frenchhorn, vs 40%/0% cat/frenchhorn. To put it in precise terminology, the former is more "calibrated" and so it has some additional measure of goodness. However, this has little correlation to top-1 accuracy.
When you use BCE, presumably you are wasting the model's energy to focus on calibration at the expense of top-1 acc. But as you might have seen, it doesn't always work out that way. Sometimes BCE gives you superior results. I don't know that there's a clear explanation of that but I'd assume that the additional signals (in the case of Imagenet, you'll literally get 1000 times more signals) somehow creates a smoother loss value that perhaps helps smooth the gradients you receive.
The alpha value of focal loss additionally penalizes very wrong predictions and lessens the penalty if your model predicts something close to the right answer - like predicting 90% cat if the ground truth is cat. This would be a shift from the original definition of CCE, based on the theory of Maximum Likelihood Estimation... which focuses on calibration... vs the normal metric most ML practitioners care about: top-1 accuracy.
Focal loss was originally designed for binary classification so the original formulation only has a single alpha value. The repo you pointed to extends the concept of Focal Loss to single-label classification and therefore there are multiple alpha values: one per class. However, by my read, it loses the additional possible smoothing effect of BCE.
Net net, for the best results, you'll want to benchmark CCE, BCE, Binary Focal Loss (out of TFA and per the original paper), and the single-label multi-class Focal Loss that you found in that repo. In general, those the discovery of those alpha values is done via guess & check, or grid search.
There's a lot of manual guessing and checking in ML unfortunately.
I have a data-set without labels, but I do have a way to get pairs of examples with opposite labels, that is given a pair x,z I know that their true labels are either 0,1 or 1,0.
So, I am building a model that accepts pairs of samples as input, and learns to classify them with opposite labels. Assuming I have an arbitrary model for predicting a single sample, y_hat = f(x), I am building a model with Keras that accepts pairs of samples (x,z) and outputs pairs of predictions, f(x), f(z). I then use a custom loss function that drives the model towards the correct direction: Given that a regular binary classifier is trained using the Binary Cross Entropy (BCE) to make the predicted and desired output "close", I use the negative BCE. Also, since BCE is not symmetric, I symmetrize it. So, the loss function I give the model.compile method is:
from tensorflow import keras
bce = keras.losses.BinaryCrossentropy()
def neg_sym_bce(y1, y2):
return (- 0.5 * (bce(y1, y2) + bce(y2, y1)))
My problem is, this model fails to learn to classify even a single pair of my data (I get f(x)~=f(z)~=0.5), and if I try to train it with synthetic "easy" data, it takes hundreds of epochs to converge (also on a single pair).
This made me suspect that it has to do with a "vanishing gradient" problem. Indeed, when I plot (see below) the loss for a single pair, which is a function of 2 variables (the 2 outputs), it is evident that there is a wide plateau around the 0.5, 0.5 point. It is also evident that the global minima is, as expected, around the points 0,1 and 1,0.
So, is there a way to deal with the vanishing gradient here? I read about the problem but the references I found deal with vanishing gradient in the network, not in the loss itself.
Or, is there another loss that can drive the model to predict opposite labels?
Think if your labels are always either 0,1 or 1,1 just use categorical_crossentropy for the loss.
This is from one of the tensorflow examples mnist_softmax.py.
Even though the gradients are non-zero, they must be identical and all the ten weight vectors corresponding to the ten classes should be exactly same and produce the same output logits and hence same probabilities. The only case I could think this is possible is while calculating the accuracy using tf.argmax(), whose output is ambiguous in case of ties, we are getting lucky and resulting in 92% accuracy. But then I checked the values of y after training is complete and they give perfectly different outputs indicating the weight vectors of all classes are not same. Can someone explain how this is possible?
Although it is best to initialize the parameters to small random numbers to break symmetry and possibly accelerate learning, it does not necessarily mean you will get same probabilities for all classes if you initialize the weights to zeros.
The reason is because the cross_entropy function is a function of weights, inputs, and correct class labels. So the gradient will be different for each output 'neuron', depending on the correct class label, and this will break the symmetry.
I want to perform a multilabel image classification task for n classes.
I've got sparse label vectors for each image and each dimension of each label vector is currently encoded in this way:
1.0 ->Label true / Image belongs to this class
-1.0 ->Label false / Image does not contain to this class.
0.0 ->missing value/label
E.g.: V= {1.0,-1.0,1.0, 0.0}
For this example V the model should learn, that the corresponding image should be classified in the first and third class.
My problem is currently how to handle the missing values/labels. I've searched through the issues and found this issue:
tensorflow/skflow#113 found here
So could do multilable image classification with:
tf.nn.sigmoid_cross_entropy_with_logits(logits, targets, name=None)
but TensorFlow has this error function for sparse softmax, which is used for exclusive classification:
tf.nn.sparse_softmax_cross_entropy_with_logits(logits, labels, name=None)
So is there something like sparse sigmoid cross entropy? (Couldn't find something) or any suggestions how can I handle my multilabel classification problem with sparse labels.
I used weighted_cross_entropy_with_logits as the loss function with positive weights for 1s.
In my case, all the labels are equally important. But 0 was ten times more likely to be appeared as the value of any label than 1.
So I weighed all the 1s by calling the pos_weight parameter of the aforementioned loss function. I used a pos_weight (= weight on positive values) of 10. By the way, I do not recommend any strategy to calculate the pos_weight. I think it will depend explicitly on the data in hand.
if real label = 1,
weighted_cross_entropy = pos_weight * sigmoid_cross_entropy
Weighted cross entropy with logits is same as the Sigmoid cross entropy with logits, except for the extra weight value multiplied to all the targets with a positive real value i.e.; 1.
Theoretically, it should do the job. I am still tuning other parameters to optimize the performance. Will update with performance statistics later.
First I would like to know what you mean by missing data? What is the difference between miss and false in your case?
Next, I think it is wrong that you represent your data like this. You have unrelated information that you try to represent on the same dimension. (If it was false or true it would work)
What seems to me better is to represent for each of your class a probability if it is good, or is missing or is false.
In your case V = [(1,0,0),(0,0,1),(1,0,0),(0,1,0)]
Ok!
So your problem is more about how to handle the missing data I think.
So I think you should definitely use tf.sigmoid_cross_entropy_with_logits()
Just change the target for the missing data to 0.5. (0 for false and 1 for true).
I never tried this approach but it should let your network learn without biasing it too much.
I am using a Tensorflow network for classification between classes that are similar to their neighboring classes, i.e. not independent. For example, let's say we want to predict among 10 classes but the predictions are not merely "correct" or "incorrect." Instead, if the correct class is 7 and network predicts 6, the loss should be less than if the network predicted 5, because 6 is closer to the correct answer than 5. My understanding is that cross entropy and 1-hot vectors provides "all or nothing" loss rather than a "continuous" loss that reflects the magnitude of the error. If that is correct, how does one implement such a continuous loss in Tensorflow?
-- Update June 13 2016 ----
An example application might be color recognition. If the network predicts "green" but the true color is yellow-green, then the loss should be less than if the network predicted blue because green is a better prediction than blue.
You can choose to implement a continuous function (e.g. hue from HSV) as a single output, and construct your own loss calculation that reflects what you want to optimize. In that case you'd just have a single output value that ranged between 0.0 and 1.0, and the loss would be evaluated based on the distance from the labeled value.