Can someone give me an explanation for Multibox loss function? - tensorflow

I have found some expression for SSD Multibox-loss function as follows:
multibox_loss = confidence_loss + alpha * location_loss
Can someone explains what are the explanations for those terms?

SSD Multibox (short for Single Shot Multibox Detector) is a neural network that can detect and locate objects in an image in a single forward pass. The network is trained in a supervised manner on a dataset of images where a bounding box and a class label is given for each object of interest. The loss term
multibox_loss = confidence_loss + alpha * location_loss
is made up of two parts:
Confidence loss is a categorical cross-entropy loss for classifying the detected objects. The purpose of this term is to make sure that correct label is assigned to each detected object.
Location loss is a regression loss (either the smooth L1 or the L2 loss) on the parameters (width, height and corner offset) of the detected bounding box. The purpose of this term is to make sure that the correct region of the image is identified for the detected objects. The alpha term is a hyper parameter used to scale the location loss.
The precise formulation of the loss is given in Equation 1 of the SSD: Single Shot MultiBox Detector paper.

Related

Can SigmoidFocalCrossEntropy in Tensorflow (tf-addons) be used in Multiclass Classification? ( What is the right way)?

Focal Loss given in Tensorflow is used for class imbalance. For Binary class classification, there are a lots of codes available but for Multiclass classification, a very little help is there. I ran the code with One Hot Encoded target variables of 250 classes and it gave me results without any error.
y = pd.get_dummies(df['target']) # One hot encoded target classes
model.compile(
optimizer="adam", loss=tfa.losses.SigmoidFocalCrossEntropy(), metrics= metric
)
I just want to know whoever wrote this code or someone having enough knowledge of this code, can it be used be used for Multiclass Classification. If no then how come it did not give me errors, instead better results than CrossEntropy. Also, in other implementations like this one, the value of alpha has to be given for every class but just one value in Tensorflow's implementations.
What is the correct way to use this?
Some basics first.
Categorical Crossentropy is designed to incentivize a model a model to predict 100% for the correct label. It was designed for models that predict single-label multi-class classification - like CIFAR10 or Imagenet. Usually these models finish in a Dense layer with more than one output.
Binary Crossentropy is designed to incentivize a model to predict 100% if the label is one, or, 0% is the label is zero. Usually these models finish in a Dense layer with exactly one output.
When you apply Binary Crossentropy to a single-label multi-class classification problem, you are doing something that is mathematically valid but defines a slightly different task: you are incentivizing a single-label classification model to not only get the true label correct, but also minimize the false labels.
For example, if your target is dog, and your model predict 60% dog, CCE doesn't care if your model predicts 20% cat and 20% French horn, or, 40% cat and 0% French horn. So this is aligned with a top-1 accuracy concept.
But if you take that same model and apply BCE, and your model predictions 60% dog, BCE DOES care if your models predict 20%/20% cat/frenchhorn, vs 40%/0% cat/frenchhorn. To put it in precise terminology, the former is more "calibrated" and so it has some additional measure of goodness. However, this has little correlation to top-1 accuracy.
When you use BCE, presumably you are wasting the model's energy to focus on calibration at the expense of top-1 acc. But as you might have seen, it doesn't always work out that way. Sometimes BCE gives you superior results. I don't know that there's a clear explanation of that but I'd assume that the additional signals (in the case of Imagenet, you'll literally get 1000 times more signals) somehow creates a smoother loss value that perhaps helps smooth the gradients you receive.
The alpha value of focal loss additionally penalizes very wrong predictions and lessens the penalty if your model predicts something close to the right answer - like predicting 90% cat if the ground truth is cat. This would be a shift from the original definition of CCE, based on the theory of Maximum Likelihood Estimation... which focuses on calibration... vs the normal metric most ML practitioners care about: top-1 accuracy.
Focal loss was originally designed for binary classification so the original formulation only has a single alpha value. The repo you pointed to extends the concept of Focal Loss to single-label classification and therefore there are multiple alpha values: one per class. However, by my read, it loses the additional possible smoothing effect of BCE.
Net net, for the best results, you'll want to benchmark CCE, BCE, Binary Focal Loss (out of TFA and per the original paper), and the single-label multi-class Focal Loss that you found in that repo. In general, those the discovery of those alpha values is done via guess & check, or grid search.
There's a lot of manual guessing and checking in ML unfortunately.

Resnet50 image preprocessing

I am using https://tfhub.dev/google/imagenet/resnet_v2_50/feature_vector/3 to extract image feature vectors. However, I'm confused when it comes to how to preprocess the images prior to passing them through the module.
Based on the related Github explanation, it's said that the following should be done:
image_path = "path/to/the/jpg/image"
image_string = tf.read_file(image_path)
image = tf.image.decode_jpeg(image_string, channels=3)
image = tf.image.convert_image_dtype(image, tf.float32)
# All other transformations (during training), in my case:
image = tf.random_crop(image, [224, 224, 3])
image = tf.image.random_flip_left_right(image)
# During testing:
image = tf.image.resize_image_with_crop_or_pad(image, 224, 224)
However, using the aforementioned transformation, the results I am getting suggest that something might be wrong. Moreover, the Resnet paper is saying that the images should be preprocessed by:
A 224×224 crop is randomly sampled from an image or its
horizontal flip, with the per-pixel mean subtracted...
which I can't quite understand what is means. Can someone point me in the right direction?
Looking forward to you answers!
The image modules on TensorFlow Hub all expect pixel values in range [0,1], like you get in your code snippet above. This makes it easy and safe to switch between modules.
Inside the module, the input values are scaled to the range that the network was trained for. The module https://tfhub.dev/google/imagenet/resnet_v2_50/feature_vector/3 has been published from a TF-Slim checkpoint (see documentation), which uses yet another convention for normalizing inputs than He&al. -- but all this is taken care of.
To demystify the language in He&al.: it refers to the mean R, G and B values aggregated over all pixels of the dataset they studied, following the old wisdom that normalizing inputs to zero mean helps neural networks train better. However, later papers on image classification no longer expended this degree of attention to dataset-specific preprocessing.
The citation from the Resnet paper you mentioned is based on the following explanation from the Alexnet paper:
ImageNet consists of variable-resolution images, while our system requires a constant input dimensionality. Therefore, we down-sampled the images to a fixed resolution of256×256. Given a rectangular image, we first rescaled the image such that the shorter side was of length 256, and thencropped out the central 256×256patch from the resulting image. We did not pre-process the images in any other way, except for subtracting the mean activity over the training set from each pixel.
So in the Resnet paper, a similar process consist in taking a of 224x224 pixels part of the image (or of its horizontally flipped version) to ensure the network is given constant-sized images, and then center it by substracting the mean.

Understanding and tracking of metrics in object detection

I have some questions about metrics if I do some training or evaluation on my own dataset. I am still new to this topic and just experimented with tensorflow and googles object detection api and tensorboard...
So I did all this stuff to get things up and running with the object detection api and trained on some images and did some eval on other images.
So I decided to use the weighted PASCAL metrics set for evaluation:
And in tensorboard I get some IoU for every class and also mAP and thats fine to see and now comes the questions.
The IoU gives me the value of how well the overlapping of ground-truth and predictes boxes is and measures the accuracy of my object detector.
First Question: Is there a influencing to IoU if a object with ground-truth is not detected?
Second Question: Is there a influencing of IoU if a ground-truth object is predicted false negativ?
Third Question: What about False Positves where are no ground-truth objects?
Coding Questions:
Fourth Question: Has anyone modified the evaluation workflow from the object detection API to bring in more metrics like accuracy or TP/FP/TN/FN? And if so can provide me some code with explanation or a tutorial you used - that would be awesome!
Fifth Question: If I will monitor some overfitting and take 30% of my 70% traindata and do some evaluation, which parameter shows me that there is some overfitting on my dataset?
Maybe those question are newbie questions or I just have to read and understand more - I dont know - so your help to understand more is appreciated!!
Thanks
Let's start with defining precision with respect to a particular object class: its a proportion of good predictions to all predictions of that class, i.e., its TP / (TP + FP). E.g., if you have dog, cat and bird detector, the dog-precision would be number of correctly marked dogs over all predictions marked as dog (i.e., including false detections).
To calculate the precision, you need to decide if each detected box is TP or FP. To do this you may use IuO measure, i.e., if there is significant (e.g., 50% *) overlap of the detected box with some ground truth box, its TP if both boxes are of the same class, otherwise its FP (if the detection is not matched to any box its also FP).
* thats where the #0.5IUO shortcut comes from, you may have spotted it in the Tensorboard in titles of the graphs with PASCAL metrics.
If the estimator outputs some quality measure (or even probability), you may decide to drop all detections with quality below some threshold. Usually, the estimators are trained to output value between 0 and 1. By changing the threshold you may tune the recall metric of your estimator (the proportion of correctly discovered objects). Lowering the threshold increases the recall (but decreases precision) and vice versa. The average precision (AP) is the average of class predictions calculated over different thresholds, in PASCAL metrics the thresholds are from range [0, 0.1, ... , 1], i.e., its average of precision values for different recall levels. Its an attempt to capture characteristics of the detector in a single number.
The mean average precision is mean of average previsions over all classes. E.g., for our dog, cat, bird detector it would be (dog_AP + cat_AP + bird_AP)/3.
More rigorous definitions could be found in the PASCAL challenge paper, section 4.2.
Regarding your question about overfitting, there could be several indicators of it, one could be, that AP/mAP metrics calculated on the independent test/validation set begin to drop while the loss still decreases.

How to implement loss on a fully convolutional network in TensorFlow?

So I've started to implement the paper "Synthetic Data for Text Localisation in Natural Images" by Gupta et al. and I've encountered a serious problem.
The network architecture is a fully convolutional network. The final layer is basically an NxNx7 Tensor (Imagine a matrix where each cell holds 7 values). Each cell holds a P and C value. P is 6 parameters about a bounding box that should be regressed and C is the confidence.
Now I want to implement squared loss on this layer. As the paper states every cell of the final layer is a prediction, if indeed that predictor's location should contain a bounding box then the loss should be applied on all of the parameters in that predictor(or cell). If it shouldn't contain a bounding box then only regressing the confidence C should be enough.
So I should have dynamically defined separate losses in TensorFlow, how could I do that?
You can use tf.cond, and write something like
loss = tf.cond(is_there_sthg_label, lambda: tf.add(loss1, loss2), lambda: loss2)
EDIT:
Sorry I didn't understand your problem correctly. You can make a mask of size NxN with value True (at runtime) on [i, j] if there is a bounding box, and false else. Then you compute both your losses for each cell, you get tensors loss1 and loss2 of shape NxN, and then
#loss1 is the loss on the confidence only, loss2 is the loss on P
loss_tensor = loss1 + tf.multiply(loss2, tf.cast(mask, loss2.dtype))
total_loss = reduce_sum(loss_tensor)
(this still works if you have batches of course)

patch-wise training and fully convolutional training in FCN

In the FCN paper, the authors discuss the patch wise training and fully convolutional training. What is the difference between these two?
Please refer to section 4.4 attached in the following.
It seems to me that the training mechanism is as follows,
Assume the original image is M*M, then iterate the M*M pixels to extract N*N patch (where N<M). The iteration stride can some number like N/3 to generate overlapping patches. Moreover, assume each single image corresponds to 20 patches, then we can put these 20 patches or 60 patches(if we want to have 3 images) into a single mini-batch for training. Is this understanding right? It seems to me that this so-called fully convolutional training is the same as patch-wise training.
The term "Fully Convolutional Training" just means replacing fully-connected layer with convolutional layers so that the whole network contains just convolutional layers (and pooling layers).
The term "Patchwise training" is intended to avoid the redundancies of full image training.
In semantic segmentation, given that you are classifying each pixel in the image, by using the whole image, you are adding a lot of redundancy in the input. A standard approach to avoid this during training segmentation networks is to feed the network with batches of random patches (small image regions surrounding the objects of interest) from the training set instead of full images. This "patchwise sampling" ensures that the input has enough variance and is a valid representation of the training dataset (the mini-batch should have the same distribution as the training set). This technique also helps to converge faster and to balance the classes. In this paper, they claim that is it not necessary to use patch-wise training and if you want to balance the classes you can weight or sample the loss.
In a different perspective, the problem with full image training in per-pixel segmentation is that the input image has a lot of spatial correlation. To fix this, you can either sample patches from the training set (patchwise training) or sample the loss from the whole image. That is why the subsection is called "Patchwise training is loss sampling".
So by "restricting the loss to a randomly sampled subset of its spatial terms excludes patches from the gradient computation." They tried this "loss sampling" by randomly ignoring cells from the last layer so the loss is not calculated over the whole image.