How to interpret a confusion matrix in yolov5 for single class? - object-detection

I used the yolov5 network for object detection on my database which has only one class. But I do not understand the confusion matrix. Why FP is one and TN is zero?

You can take a look at this issue. In short, confusion matrix isn't the best metric for object detection because it all depends on confidence threshold. Even for one class detection try to use mean average precision as the main metric.

We usually use a confusion matrix when working with text data. FP (False Positive) means the model predicted YES while it was actually NO. FN (False Negative) means the model predicted NO while it was actually YES and TN means the model predicted NO while it was actually a NO. In object detection, normally mAP is used (mean Average Precision). Also, you should upload a picture of the confusion matrix, only then the community would be able to guide you on why your FP is one and TN is zero.

The matrix indicates that 100% of the background FPs are caused by a single class, which means detections that do not match with any ground truth label. So that it shows 1. Background FN is for the ground truth objects that can not be detected by the mode, which shows empty or null.

Related

What is the purpose of the scales factor in Faster Rcnn Box Coder?

I'm using the object detection api and tuning the parameters for a SSD task. My question refers to the box coder at https://github.com/tensorflow/models/blob/master/research/object_detection/box_coders/faster_rcnn_box_coder.py.
Why setting these scales factors to [10,10,5,5]? The original paper does not explain it. I suspect that it has to do either assigning a different weight to the 4 components of location error (tx, ty, tw, th) or with some numerical stability issue, but I would like to have a confirmation. Thanks
I find the answer here https://leimao.github.io/blog/Bounding-Box-Encoding-Decoding/, where the variables are used as some sort of Representation Encoding With Variance. The question was also the subject of this issue https://github.com/rykov8/ssd_keras/issues/53
The network predicts changes for each anchor box. That is, for each anchor block, it predicts an offset for x, y position and width, height.
a short description of these parameters can be found, for example to link:
https://medium.com/#smallfishbigsea/understand-ssd-and-implement-your-own-caa3232cd6ad
https://lambdalabs.com/blog/how-to-implement-ssd-object-detection-in-tensorflow/

TensorFlow Object Detection API: evaluation mAP behaves weirdly?

I am training an object detector for my own data using Tensorflow Object Detection API. I am following the (great) tutorial by Dat Tran https://towardsdatascience.com/how-to-train-your-own-object-detector-with-tensorflows-object-detector-api-bec72ecfe1d9. I am using the provided ssd_mobilenet_v1_coco-model pre-trained model checkpoint as the starting point for the training. I have only one object class.
I exported the trained model, ran it on the evaluation data and looked at the resulted bounding boxes. The trained model worked nicely; I would say that if there was 20 objects, typically there were 13 objects with spot on predicted bounding boxes ("true positives"); 7 where the objects were not detected ("false negatives"); 2 cases where problems occur were two or more objects are close to each other: the bounding boxes get drawn between the objects in some of these cases ("false positives"<-of course, calling these "false positives" etc. is inaccurate, but this is just for me to understand the concept of precision here). There are almost no other "false positives". This seems much better result than what I was hoping to get, and while this kind of visual inspection does not give the actual mAP (which is calculated based on overlap of the predicted and tagged bounding boxes?), I would roughly estimate the mAP as something like 13/(13+2) >80%.
However, when I run the evaluation (eval.py) (on two different evaluation sets), I get the following mAP graph (0.7 smoothed):
mAP during training
This would indicate a huge variation in mAP, and level of about 0.3 at the end of the training, which is way worse than what I would assume based on how well the boundary boxes are drawn when I use the exported output_inference_graph.pb on the evaluation set.
Here is the total loss graph for the training:
total loss during training
My training data consist of 200 images with about 20 labeled objects each (I labeled them using the labelImg app); the images are extracted from a video and the objects are small and kind of blurry. The original image size is 1200x900, so I reduced it to 600x450 for the training data. Evaluation data (which I used both as the evaluation data set for eval.pyand to visually check what the predictions look like) is similar, consists of 50 images with 20 object each, but is still in the original size (the training data is extracted from the first 30 min of the video and evaluation data from the last 30 min).
Question 1: Why is the mAP so low in evaluation when the model appears to work so well? Is it normal for the mAP graph fluctuate so much? I did not touch the default values for how many images the tensorboard uses to draw the graph (I read this question: Tensorflow object detection api validation data size and have some vague idea that there is some default value that can be changed?)
Question 2: Can this be related to different size of the training data and the evaluation data (1200x700 vs 600x450)? If so, should I resize the evaluation data, too? (I did not want to do this as my application uses the original image size, and I want to evaluate how well the model does on that data).
Question 3: Is it a problem to form the training and evaluation data from images where there are multiple tagged objects per image (i.e. surely the evaluation routine compares all the predicted bounding boxes in one image to all the tagged bounding boxes in one image, and not all the predicted boxes in one image to one tagged box which would preduce many "false false positives"?)
(Question 4: it seems to me the model training could have been stopped after around 10000 timesteps were the mAP kind of leveled out, is it now overtrained? it's kind of hard to tell when it fluctuates so much.)
I am a newbie with object detection so I very much appreciate any insight anyone can offer! :)
Question 1: This is the tough one... First, I think you don't understand correctly what mAP is, since your rough calculation is false. Here is, briefly, how it is computed:
For each class of object, using the overlap between the real objects and the detected ones, the detections are tagged as "True positive" or "False positive"; all the real objects with no "True positive" associated to them are labelled "False Negative".
Then, iterate through all your detections (on all images of the dataset) in decreasing order of confidence. Compute the accuracy (TP/(TP+FP)) and recall (TP/(TP+FN)), only counting the detections that you've already seen ( with confidence bigger than the current one) for TP and FP. This gives you a point (acc, recc), that you can put on a precision-recall graph.
Once you've added all possible points to your graph, you compute the area under the curve: this is the Average Precision for this category
if you have multiple categories, the mAP is the standard mean of all APs.
Applying that to your case: in the best case your true positive are the detections with the best confidence. In that case your acc/rec curve will look like a rectangle: you'd have 100% accuracy up to (13/20) recall, and then points with 13/20 recall and <100% accuracy; this gives you mAP=AP(category 1)=13/20=0.65. And this is the best case, you can expect less in practice due to false positives which higher confidence.
Other reasons why yours could be lower:
maybe among the bounding boxes that appear to be good, some are still rejected in the calculations because the overlap between the detection and the real object is not quite big enough. The criterion is that Intersection over Union (IoU) of the two bounding boxes (real one and detection) should be over 0.5. While it seems like a gentle threshold, it's not really; you should probably try and write a script to display the detected bounding boxes with a different color depending on whether they're accepted or not (if not, you'll get both a FP and a FN).
maybe you're only visualizing the first 10 images of the evaluation. If so, change that, for 2 reasons: 1. maybe you're just very lucky on these images, and they're not representative of what follows, just by luck. 2. Actually, more than luck, if these images are the first from the evaluation set, they come right after the end of the training set in your video, so they are probably quite similar to some images in the training set, so they are easier to predict, so they're not representative of your evaluation set.
Question 2: if you have not changed that part in the config file mobilenet_v1_coco-model, all your images (both for training and testing) are rescaled to 300x300 pixels at the start of the network, so your preprocessings don't matter.
Question 3: no it's not a problem at all, all these algorithms were designed to detect multiple objects in images.
Question 4: Given the fluctuations, I'd actually keep training it until you can see improvement or clear overtraining. 10k steps is actually quite small, maybe it's enough because your task is relatively easy, maybe it's not enough and you need to wait ten times that to have significant improvement...

Is the object location in train effect the results for Faster RCNN?

Has enyone try the effect of the location per class in faster rcnn?
In case my train data has one of the object classes always in one area of the frame, lets say in the top right of the image, and on the evaluation dataset I have one image that this object is on other area, down left,
Is the Faster RCNN capable to handle with this case?
Or if I want my network to find all of the classes in all of the frame areas I need to provide example in the train dataset that cover all the areas?
Quoting faster-RCNN paper:
An important property of our approach is that it is
translation invariant, both in terms of the anchors and the
functions that compute proposals relative to the anchors. If
one translates an object in an image, the proposal should
translate and the same function should be able to predict the
proposal in either location. This translation-invariant property
is guaranteed by our method*
*As is the case of FCNs [7], our network is translation invariant up to the network’s total stride
So the short answer is that you'll probably be ok with the object is mostly at a certain location in the train set and somewhere else in the test set.
A bit longer answer is that the location may have side affects that may affect the accuracy and it will probably be better to have the object in different locations; however you can try to add - for testing purposes - N test samples to the train set and see what is the accuracy change in the test set -N remaining samples.

Reason why setting tensorflow's variable with small stddev

I have a question about a reason why setting TensorFlow's variable with small stddev.
I guess many people do test MNIST test code from TensorFlow beginner's guide.
As following it, the first layer's weights are initiated by using truncated_normal with stddev 0.1.
And I guessed if setting it with more bigger value, then it would be the same result, which is exactly accurate.
But although increasing epoch count, it doesn't work.
Is there anybody know this reason?
original :
W_layer = tf.Variable(tf.truncated_normal([inp.get_shape()[1].value, size],stddev=0.1), name='w_'+name)
#result : (990, 0.93000001, 0.89719999)
modified :
W_layer = tf.Variable(tf.truncated_normal([inp.get_shape()[1].value, size],stddev=200), name='w_'+name)
#result : (99990, 0.1, 0.098000005)
The reason is because you want to keep all the layer's variances (or standard deviations) approximately the same, and sane. It has to do with the error backpropagation step of the learning process and the activation functions used.
In order to learn the network's weights, the backpropagation step requires knowledge of the network's gradient, a measure of how strong each weight influences the input to reach the final output; layer's weight variance directly influences the propagation of gradients.
Say, for example, that the activation function is sigmoidal (e.g. tf.nn.sigmoid or tf.nn.tanh); this implies that all input values are squashed into a fixed output value range. For the sigmoid, it is the range 0..1, where essentially all values z greater or smaller than +/- 4 are very close to one (for z > 4) or zero (for z < -4) and only values within that range tend to have some meaningful "change".
Now the difference between the values sigmoid(5) and sigmoid(1000) is barely noticeable. Because of that, all very large or very small values will optimize very slowly, since their influence on the result y = sigmoid(W*x+b) is extremely small. Now the pre-activation value z = W*x+b (where x is the input) depends on the actual input x and the current weights W. If either of them is large, e.g. by initializing the weights with a high variance (i.e. standard deviation), the result will necessarily be (relatively) large, leading to said problem. This is also the reason why truncated_normal is used rather than a correct normal distribution: The latter only guarantees that most of the values are very close to the mean, with some less than 5% chance that this is not the case, while truncated_normal simply clips away every value that is too big or too small, guaranteeing that all weights are in the same range, while still being normally distributed.
To make matters worse, in a typical neural network - especially in deep learning - each network layer is followed by one or many others. If in each layer the output value range is big, the gradients will get bigger and bigger as well; this is known as the exploding gradients problem (a variation of the vanishing gradients, where gradients are getting smaller).
The reason that this is a problem is because learning starts at the very last layer and each weight is adjusted depending on how much it contributed to the error. If the gradients are indeed getting very big towards the end, the very last layer is the first one to pay a high toll for this: Its weights get adjusted very strongly - likely overcorrecting the actual problem - and then only the "remaining" error gets propagated further back, or up, the network. Here, since the last layer was already "fixed a lot" regarding the measured error, only smaller adjustments will be made. This may lead to the problem that the first layers are corrected only by a tiny bit or not at all, effectively preventing all learning there. The same basically happens if the learning rate is too big.
Finding the best weight initialization is a topic by itself and there are somewhat more sophisticated methods such as Xavier initialization or Layer-sequential unit variance, however small normally distributed values are usually simply a good guess.

Probability Density Function with Zero Standard Deviation

I am now implementing an email filtering application using the Naive Bayes algorithm. My application uses the Spambase Data Set from the UCI Machine Learning Repository. Since the attributes are continuous, I calculate the probability using the Probability Density Function (PDF). However, when I evaluate the data using the k-fold cross validation, a training set may contain only 0 for one of its attributes. For this reason, I got a 0 standard deviation and the PDF returns NaN and it leads to a huge number of spams are not correctly classified with that training set. What should I do to fix the problem?
You could use a discrete PDF, which will always be bounded.
Alternatively, simply ignore any attribute with zero variance. There is no point in including distributions with zero variance, because they won't actually do anything. For example, you want to know how old I am, and then I tell you that I live on planet Earth. That shouldn't change your estimate, because every single piece of data you have is for people on planet Earth.