What's the complete loss function used by yolov4? - object-detection

I am unable to find the explanation for the loss function of yolov4.

First, to understand the YOLOv4 loss, I think you should read about the original YOLO loss that was released in YOLO first paper (https://arxiv.org/abs/1506.02640), you can find it here.
In YOLOv4, you will have the exact same ideas, but with:
Binary cross entropy for the objectness and classification scores,
Box-per-cell level prediction instead of cell level prediction for the class probabilities, so a slightly different penalization for the classification terms,
CIoU Loss instead of MSE for the regression terms (x,y,w,h). CIoU stands for Complete Intersection over Union, and is not so far from the MSE loss. It proposes to compare width and height a bit more interestingly (consistency between aspect ratios), but it keeps the MSE for the comparison between bounding box centers. You can find more details in this paper.
Finally, YOLOv4 loss can be written this way. With the complete CIoU loss terms, it looks like this.

Related

Purpose of using one loss function and metric another one in Tensorflow/Keras?

I'm new to Deep Learning and I saw this for the first time. Having MAE as loss function and MSE to metric. What is the purpose of this and what is gained?
(loss=tf.metrics.MeanAbsoluteError(), metrics=[tf.losses.MeanSquaredError()])
In some cases it is useful to have a loss function different from the metric you are going to evaluate.
Consider the case in which you want to denoise an image, that is you design a network that takes as input a noise image and outputs its clean version. Here, your metric might be the Peak-Signal-to-Noise Ratio (PSNR) or some sort of structural similarity (SSIM) between your output and the ground truth clean image. However, during training, you might consider different loss function, such as L1 (MAE), L2 (MSE) or even a Perceptual Loss, such as the VGG loss, because these have been proved to lead to better results than directly optimizing for PSNR or SSIM.

What loss function to use in Keras when metric is SparseTopKCategoricalAccuracy/TopKCategoricalAccuracy?

For multiclass classification problems, Keras and tf.keras have metrics like SparseTopKCategoricalAccuracy and TopKCategoricalAccuracy. However, if one uses loss functions like SparseCategoricalCrossentropy or CategoricalCrossentropy, they cannot achieve the max values for these two metrics.
What is a good loss function to use when one wants to maximize SparseTopKCategoricalAccuracy or TopKCategoricalAccuracy?
I understand that SparseTopKCategoricalAccuracy is not differentiable, just like Accuracy. I am trying to find a function that can approximate the smooth loss function and yield a higher number for SparseTopKCategoricalAccuracy.
CrossEntropy is not the best loss function when you deal with Top-k accuracy because cross-entropy may be prone to overfitting on small datasets or noisy labels.
As you have already pointed out, "smooth loss" functions are developed for top-k classification with SVM. To my knowledge, there is no a "off-the-shelf" loss function in Keras/TF that is best suited for top-k. However, I suggest you to try Smooth Surrogate Loss (SSL) presented in the article and implemented in Pytorch to use with deep neural networks (see Github). It derives from multi-class SVMs as SSL creates a margin between the correct top-k predictions and the incorrect ones. The training time of SSL is comparatevely the same as in the case of cross-entropy thanking to a divide-and-conquer approach and the use of polynomials (see implementation).

Best loss function for multi-class classification when the dataset is imbalance?

I'm currently using the Cross Entropy Loss function but with the imbalance data-set the performance is not great.
Is there better lost function?
It's a very broad subject, but IMHO, you should try focal loss: It was introduced by Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He and Piotr Dollar to handle imbalance prediction in object detection. Since introduced it was also used in the context of segmentation.
The idea of the focal loss is to reduce both loss and gradient for correct (or almost correct) prediction while emphasizing the gradient of errors.
As you can see in the graph:
Blue curve is the regular cross entropy loss: it has on the one hand non-negligible loss and gradient even for well classified examples, and on the other hand it has weaker gradient for the erroneously classified examples.
In contrast, focal loss (all other curves) has smaller loss and weaker gradient for the well classified examples and stronger gradients for the erroneously classified examples.

Losses in Tensorflow

Can anyone kindly explain what basically classification loss and localization loss mean in tensorflow?
I am getting this losses during SSD training procedure using tensorflow API but not understanding both of this two losses at all.
Here I read that localization loss is the loss of the Bounding Box regressor which arises a new question and that is what is bounding box regressor?
Can anyone brief it please?
Hope this helps, I tried to give a brief explanation as I understand it.
what basically classification loss and localization loss mean in tensorflow?
classification /localisation loss values are the result of loss functions and represent the "price paid for inaccuracy of predictions" in the classification/localisation problems (respectively).
The loss value given is a sum of the classification loss and the localisation loss.
The optimisation algorithms are trying to reduce these loss values until your loss sum reaches a point where you are happy with the results and consider your network 'trained'.
You can generally think of loss as a score where 'lower score equals better model'.
what is bounding box regressor?
The bounding box regressor is a trained model to obtain a more accurate bounding box in relation to the ROI in image classification problems I believe.

When we do supervised classification with NN, why do we train for cross-entropy and not for classification error?

The standard supervised classification setup: we have a bunch of samples, each with the correct label out of N labels. We build a NN with N outputs, transform those to probabilities with softmax, and the loss is the mean cross-entropy between each NN output and the corresponding true label, represented as a 1-hot vector with 1 in the true label and 0 elsewhere. We then optimize this loss by following its gradient. The classification error is used just to measure our model quality.
HOWEVER, I know that when doing policy gradient we can use the likelihood ratio trick, and we no longer need to use cross-entropy! our loss simply tf.gather the NN output corresponding to the correct label. E.g. this solution of OpenAI gym CartPole.
WHY can't we use the same trick when doing supervised learning? I was thinking that the reason we used cross-entropy is because it is differentiable, but apparently tf.gather is differentiable as well.
I mean - IF we measure ourselves on classification error, and we CAN optimize for classification error as it's differentiable, isn't it BETTER to also optimize for classification error instead of this weird cross-entropy proxy?
Policy gradient is using cross entropy (or KL divergence, as Ishant pointed out). For supervised learning tf.gather is really just implementational trick, nothing else. For RL on the other hand it is a must because you do not know "what would happen" if you would execute other action. Consequently you end up with high variance estimator of your gradients, something that you would like to avoid for all costs, if possible.
Going back to supervised learning though
CE(p||q) = - SUM_i q_i log p_i
Lets assume that q_i is one hot encoded, with 1 at k'th position, then
CE(p||q) = - q_k log p_k = - log p_k
So if you want, you can implement this as tf.gather, it simply does not matter. The cross-entropy is simply more generic because it handles more complex targets. In particular, in TF you have sparse cross entropy which does exactly what you describe - exploits one hot encoding, that's it. Mathematically there is no difference, there is small difference computation-wise, and there are functions doing exactly what you want.
Minimization of cross-entropy loss minimizes the KL divergence between the predicted distribution and the target distribution. Which is indeed the same as maximizing the likelihood of the predicted distribution.