Best loss function for multi-class classification when the dataset is imbalance? - tensorflow

I'm currently using the Cross Entropy Loss function but with the imbalance data-set the performance is not great.
Is there better lost function?

It's a very broad subject, but IMHO, you should try focal loss: It was introduced by Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He and Piotr Dollar to handle imbalance prediction in object detection. Since introduced it was also used in the context of segmentation.
The idea of the focal loss is to reduce both loss and gradient for correct (or almost correct) prediction while emphasizing the gradient of errors.
As you can see in the graph:
Blue curve is the regular cross entropy loss: it has on the one hand non-negligible loss and gradient even for well classified examples, and on the other hand it has weaker gradient for the erroneously classified examples.
In contrast, focal loss (all other curves) has smaller loss and weaker gradient for the well classified examples and stronger gradients for the erroneously classified examples.

Related

What's the complete loss function used by yolov4?

I am unable to find the explanation for the loss function of yolov4.
First, to understand the YOLOv4 loss, I think you should read about the original YOLO loss that was released in YOLO first paper (https://arxiv.org/abs/1506.02640), you can find it here.
In YOLOv4, you will have the exact same ideas, but with:
Binary cross entropy for the objectness and classification scores,
Box-per-cell level prediction instead of cell level prediction for the class probabilities, so a slightly different penalization for the classification terms,
CIoU Loss instead of MSE for the regression terms (x,y,w,h). CIoU stands for Complete Intersection over Union, and is not so far from the MSE loss. It proposes to compare width and height a bit more interestingly (consistency between aspect ratios), but it keeps the MSE for the comparison between bounding box centers. You can find more details in this paper.
Finally, YOLOv4 loss can be written this way. With the complete CIoU loss terms, it looks like this.

What loss function to use in Keras when metric is SparseTopKCategoricalAccuracy/TopKCategoricalAccuracy?

For multiclass classification problems, Keras and tf.keras have metrics like SparseTopKCategoricalAccuracy and TopKCategoricalAccuracy. However, if one uses loss functions like SparseCategoricalCrossentropy or CategoricalCrossentropy, they cannot achieve the max values for these two metrics.
What is a good loss function to use when one wants to maximize SparseTopKCategoricalAccuracy or TopKCategoricalAccuracy?
I understand that SparseTopKCategoricalAccuracy is not differentiable, just like Accuracy. I am trying to find a function that can approximate the smooth loss function and yield a higher number for SparseTopKCategoricalAccuracy.
CrossEntropy is not the best loss function when you deal with Top-k accuracy because cross-entropy may be prone to overfitting on small datasets or noisy labels.
As you have already pointed out, "smooth loss" functions are developed for top-k classification with SVM. To my knowledge, there is no a "off-the-shelf" loss function in Keras/TF that is best suited for top-k. However, I suggest you to try Smooth Surrogate Loss (SSL) presented in the article and implemented in Pytorch to use with deep neural networks (see Github). It derives from multi-class SVMs as SSL creates a margin between the correct top-k predictions and the incorrect ones. The training time of SSL is comparatevely the same as in the case of cross-entropy thanking to a divide-and-conquer approach and the use of polynomials (see implementation).

PyTorch - Superior Model Performance by Misusing Loss Function (Negative Log Likelihood)?

I misread PyTorch's NLLLoss() and accidentally passed my model's probabilities to the loss function instead of my model's log probabilities, which is what the function expects. However, when I train a model under this misused loss function, the model (a) learns faster, (b) learns more stably, (b) reaches a lower loss, and (d) performs better at the classification task.
I don't have a minimal working example, but I'm curious if anyone else has experienced this or knows why this is? Any possible hypotheses?
One hypothesis I have is that the gradient with respect to the misused loss function is more stable because the derivative isn't scaled by 1/model output probability.

Multiple questions regarding the KL term in the ELBO loss with TensorFlow Probability

I have been trying to conduct a few experiments using TensorFlow Probability (TFP), and I got a few questions.
What is the proper value of the coefficient of the KL loss?
In the paper by Blundell (2015), the coefficient is set to 1/M (where M is the number of mini-batches). In the example given by TFP, the coefficient is given as 1/mnist_data.train.num_examples. Why?
As I go from 2d input to 3d images volumes, the KL loss is still significantly larger (~1k) than the cross-entropy (~1), even after dividing by mnist_data.train.num_examples. Why?
What is the guideline for getting a proper value for this coefficient? Maybe like the two-loss terms should be the same order of magnitude?
The current coefficient only takes care of the number of training samples, but not the network complexity or number of parameters in the network, which I assume the KL loss increase with the complexity of the model.
I am trying to implement a neural network with the KL loss, without using keras.model.losses, as some software production and hardware support limitation. I am trying to train my model with TF 1.10 and TFP 0.3.0., the issue is that for tf<=1.14, tf.keras.model does not support tf.layers inside the Keras model, so I can't use my original model straight away. Is there a way to get the KL loss, not from model.losses, but from layers or weights of the network in a TF construct?
Is batch normalization or group normalization still helpful in Bayesian deep learning?
In the paper by Blundell (2015), the coefficient is set to 1/M (where M is the number of mini-batches). In the example given by TFP, the coefficient is given as 1/mnist_data.train.num_examples. Why?
In the BBB paper eq. 8, they refer to M being the number of mini-batches. To be consistent with the non-stochastic gradient learning, it should be scaled by the number of mini-batches which is what is done by Graves. Another alternative is that done in eq. 9, where they scale it by \pi_i, where the sum of all the values in the set {\pi} sum to one.
In the TFP example, it does look like the num_examples is the total number of independent samples within the training set, which is much larger than the number of batches. This is goes by a few names, such as Safe Bayes or Tempering. Have a look at sec. 8 of this paper for some more discussion about the use of tempering within Bayesian inference and it's suitability.
As I go from 2d input to 3d images volumes, the KL loss is still significantly larger (~1k) than the cross-entropy (~1), even after dividing by mnist_data.train.num_examples. Why?
The ELBO will always be larger than just your cross-entropy (which defines your likelihood). Have a look at how the KL divergence term in the ELBO is found. (and a full mean-field approach where each weight/parameter is assumed to be independent).
Since the assumed posterior is factorised (assume each parameter is independent), can write the joint distribution as a product. This means when you take the log when you are computing the KL between the approx. posterior and the prior, you can write it as a sum of the KL terms between each parameter. Since the KL is >= 0, for each parameter you add to your model you will be adding another positive term to your ELBO. This is likely why your loss is so much more for your 3D model, likely because there is more parameters.
Another reason this could occur is if you have less data (your M is smaller, than the KL term is weighted less).
What is the guideline for getting a proper value for this coefficient? Maybe like the two-loss terms should be the same order of magnitude?
I am unsure of any specific guideline, for training you are interested primarily in the gradients. A large loss does not mean a large gradient. Have a look at the gradients contributed by the negative log likelihood and the KL term in your ELBO. If the KL term is too large, you probably need a more informative prior or more data (you could simply scale the KL term but this feels a bit yucky for the Bayesian in me).
The current coefficient only takes care of the number of training samples, but not the network complexity or the number of parameters in the network, which I assume the KL loss increase with the complexity of the model.
Yes, as stated before, in general, more parameters == greater ELBO (for a mean-field approach as used in Bayes by Backprop).
I am trying to implement a neural network with the KL loss, without using keras.model.losses, as some software production and hardware support limitation. I am trying to train my model with TF 1.10 and TFP 0.3.0., the issue is that for tf<=1.14, tf.keras.model does not support tf.layers inside the Keras model, so I can't use my original model straight away. Is there a way to get the KL loss, not from model.losses, but from layers or weights of the network in a TF construct?
I am unsure about the best way to tackle this part of it. I would be cautious about going to older versions where it isn't explicitly supported. They put those warnings/exceptions in for a reason.
Is batch normalization or group normalization still helpful in Bayesian deep learning?
For variational inference (as done in Bayes by Backprop) Batchnorm is fine. For sampling methods such as MCMC, Batch normalization is no longer suitable. Have a look at https://arxiv.org/pdf/1908.03491v1.pdf for info on suitability for batch norm with sampling methods for approx. Bayesian inference.

When we do supervised classification with NN, why do we train for cross-entropy and not for classification error?

The standard supervised classification setup: we have a bunch of samples, each with the correct label out of N labels. We build a NN with N outputs, transform those to probabilities with softmax, and the loss is the mean cross-entropy between each NN output and the corresponding true label, represented as a 1-hot vector with 1 in the true label and 0 elsewhere. We then optimize this loss by following its gradient. The classification error is used just to measure our model quality.
HOWEVER, I know that when doing policy gradient we can use the likelihood ratio trick, and we no longer need to use cross-entropy! our loss simply tf.gather the NN output corresponding to the correct label. E.g. this solution of OpenAI gym CartPole.
WHY can't we use the same trick when doing supervised learning? I was thinking that the reason we used cross-entropy is because it is differentiable, but apparently tf.gather is differentiable as well.
I mean - IF we measure ourselves on classification error, and we CAN optimize for classification error as it's differentiable, isn't it BETTER to also optimize for classification error instead of this weird cross-entropy proxy?
Policy gradient is using cross entropy (or KL divergence, as Ishant pointed out). For supervised learning tf.gather is really just implementational trick, nothing else. For RL on the other hand it is a must because you do not know "what would happen" if you would execute other action. Consequently you end up with high variance estimator of your gradients, something that you would like to avoid for all costs, if possible.
Going back to supervised learning though
CE(p||q) = - SUM_i q_i log p_i
Lets assume that q_i is one hot encoded, with 1 at k'th position, then
CE(p||q) = - q_k log p_k = - log p_k
So if you want, you can implement this as tf.gather, it simply does not matter. The cross-entropy is simply more generic because it handles more complex targets. In particular, in TF you have sparse cross entropy which does exactly what you describe - exploits one hot encoding, that's it. Mathematically there is no difference, there is small difference computation-wise, and there are functions doing exactly what you want.
Minimization of cross-entropy loss minimizes the KL divergence between the predicted distribution and the target distribution. Which is indeed the same as maximizing the likelihood of the predicted distribution.