What the ImageNet Vid policy over evaluation of frames with zero object inside - object-detection

Ι am trying to evaluate my video object detection module and I am using InageNet VID dataset for this purpose. At some point I am facing the case to evaluate a frame containing zero objects. Meaning there are no ground truth bboxes in this frame (this is fine since we are talking about video object detection).
Since, the module I am using expected at least 1 bbox to be present I was wondering what's the official treatment for these case by ImageNet. I found this description which obviously not exhaustive one may provide some point in ImageNet site which states:
The evaluation metric is the same as for the objct detection task,
meaning objects which are not annotated will be penalized, as will
duplicate detections (two annotations for the same object instance).
(sic; typo is from the original text)
Which does not mention the above case scenario. Since this is a simple description I am not sure it covers every edge case. Normally in single image object detection this is not an issues since evaluation samples always contain some object. But in this case does this mean I should ignore those frames for example altogether?
Also, checking this repository about object detection metric (which is super analytic by the way) the no gt case seems to fall into the general scenario about False Positive (FP). In this case Intersection would be 0 (since no gt bbox exists) and Union would be just a non zero number equal to the FP bbox and so, IoU = 0.
So, how does the official ImageNet deal with these cases? I am not interested in what is reasonable choice here, just the official version.

I have just looked through the ImageNet VID 2015 evaluation code, which I got from the evaluation kit from UNC.
The evaluation deals with precision and recall, and so needs to calculate TP, FP and FN for every GT box/detection pair or instance. The IoU calculation is used purely to determine whether a valid detection has taken place.
For frames with no GT boxes, and no detections: As we are not recording true negatives these make no difference to the calculation.
For frames with no GT boxes, but some detections: These false positives are captured for each frame on line 231 of eval_vid_detections.m:
if kmax > 0
tp(j) = 1;
gt_detected(kmax) = 1;
else
fp(j) = 1;
end
For frames with GT boxes, but no detections: These GT boxes are counted when the GT data is first loaded on line 79: num_pos_per_class(c) = num_pos_per_class(c) + 1;. This is later used when calculating the recall on line 266: recall{c}=(tp/num_pos_per_class(c))';
So, if your frame contains no detections and no GT boxes, you can safely ignore it.
As an aside, note that the per instance detection threshold is set like this:
thr = (gt_w.*gt_h)./((gt_w+pixelTolerance).*(gt_h+pixelTolerance));
gt_obj_thr{i} = min(defaultIOUthr,thr);
where pixelTolerance = 10. This gives a bit of a boost to small objects.

What about the case where there is a missing annotation for an object which should have had a GT annotation ?
For example ILSVRC2015_val_00000001/000266.JPEG clearly has a turtle (in fact up until 000265.JPEG all the frames has the corresponding annotation of turtle) but the corresponding annotation file ILSVRC2015_val_00000001/000266.xml doesn't have any annotation.
In my analysis, there are 4046 frames out of 176126 in validation dataset which have missing GT annotation. Most of these frames have no GT annotation despite containing an object which belongs to one of the 30 categories of ImageNetVID.

Related

How to interpret a confusion matrix in yolov5 for single class?

I used the yolov5 network for object detection on my database which has only one class. But I do not understand the confusion matrix. Why FP is one and TN is zero?
You can take a look at this issue. In short, confusion matrix isn't the best metric for object detection because it all depends on confidence threshold. Even for one class detection try to use mean average precision as the main metric.
We usually use a confusion matrix when working with text data. FP (False Positive) means the model predicted YES while it was actually NO. FN (False Negative) means the model predicted NO while it was actually YES and TN means the model predicted NO while it was actually a NO. In object detection, normally mAP is used (mean Average Precision). Also, you should upload a picture of the confusion matrix, only then the community would be able to guide you on why your FP is one and TN is zero.
The matrix indicates that 100% of the background FPs are caused by a single class, which means detections that do not match with any ground truth label. So that it shows 1. Background FN is for the ground truth objects that can not be detected by the mode, which shows empty or null.

TensorFlow Object Detection API: evaluation mAP behaves weirdly?

I am training an object detector for my own data using Tensorflow Object Detection API. I am following the (great) tutorial by Dat Tran https://towardsdatascience.com/how-to-train-your-own-object-detector-with-tensorflows-object-detector-api-bec72ecfe1d9. I am using the provided ssd_mobilenet_v1_coco-model pre-trained model checkpoint as the starting point for the training. I have only one object class.
I exported the trained model, ran it on the evaluation data and looked at the resulted bounding boxes. The trained model worked nicely; I would say that if there was 20 objects, typically there were 13 objects with spot on predicted bounding boxes ("true positives"); 7 where the objects were not detected ("false negatives"); 2 cases where problems occur were two or more objects are close to each other: the bounding boxes get drawn between the objects in some of these cases ("false positives"<-of course, calling these "false positives" etc. is inaccurate, but this is just for me to understand the concept of precision here). There are almost no other "false positives". This seems much better result than what I was hoping to get, and while this kind of visual inspection does not give the actual mAP (which is calculated based on overlap of the predicted and tagged bounding boxes?), I would roughly estimate the mAP as something like 13/(13+2) >80%.
However, when I run the evaluation (eval.py) (on two different evaluation sets), I get the following mAP graph (0.7 smoothed):
mAP during training
This would indicate a huge variation in mAP, and level of about 0.3 at the end of the training, which is way worse than what I would assume based on how well the boundary boxes are drawn when I use the exported output_inference_graph.pb on the evaluation set.
Here is the total loss graph for the training:
total loss during training
My training data consist of 200 images with about 20 labeled objects each (I labeled them using the labelImg app); the images are extracted from a video and the objects are small and kind of blurry. The original image size is 1200x900, so I reduced it to 600x450 for the training data. Evaluation data (which I used both as the evaluation data set for eval.pyand to visually check what the predictions look like) is similar, consists of 50 images with 20 object each, but is still in the original size (the training data is extracted from the first 30 min of the video and evaluation data from the last 30 min).
Question 1: Why is the mAP so low in evaluation when the model appears to work so well? Is it normal for the mAP graph fluctuate so much? I did not touch the default values for how many images the tensorboard uses to draw the graph (I read this question: Tensorflow object detection api validation data size and have some vague idea that there is some default value that can be changed?)
Question 2: Can this be related to different size of the training data and the evaluation data (1200x700 vs 600x450)? If so, should I resize the evaluation data, too? (I did not want to do this as my application uses the original image size, and I want to evaluate how well the model does on that data).
Question 3: Is it a problem to form the training and evaluation data from images where there are multiple tagged objects per image (i.e. surely the evaluation routine compares all the predicted bounding boxes in one image to all the tagged bounding boxes in one image, and not all the predicted boxes in one image to one tagged box which would preduce many "false false positives"?)
(Question 4: it seems to me the model training could have been stopped after around 10000 timesteps were the mAP kind of leveled out, is it now overtrained? it's kind of hard to tell when it fluctuates so much.)
I am a newbie with object detection so I very much appreciate any insight anyone can offer! :)
Question 1: This is the tough one... First, I think you don't understand correctly what mAP is, since your rough calculation is false. Here is, briefly, how it is computed:
For each class of object, using the overlap between the real objects and the detected ones, the detections are tagged as "True positive" or "False positive"; all the real objects with no "True positive" associated to them are labelled "False Negative".
Then, iterate through all your detections (on all images of the dataset) in decreasing order of confidence. Compute the accuracy (TP/(TP+FP)) and recall (TP/(TP+FN)), only counting the detections that you've already seen ( with confidence bigger than the current one) for TP and FP. This gives you a point (acc, recc), that you can put on a precision-recall graph.
Once you've added all possible points to your graph, you compute the area under the curve: this is the Average Precision for this category
if you have multiple categories, the mAP is the standard mean of all APs.
Applying that to your case: in the best case your true positive are the detections with the best confidence. In that case your acc/rec curve will look like a rectangle: you'd have 100% accuracy up to (13/20) recall, and then points with 13/20 recall and <100% accuracy; this gives you mAP=AP(category 1)=13/20=0.65. And this is the best case, you can expect less in practice due to false positives which higher confidence.
Other reasons why yours could be lower:
maybe among the bounding boxes that appear to be good, some are still rejected in the calculations because the overlap between the detection and the real object is not quite big enough. The criterion is that Intersection over Union (IoU) of the two bounding boxes (real one and detection) should be over 0.5. While it seems like a gentle threshold, it's not really; you should probably try and write a script to display the detected bounding boxes with a different color depending on whether they're accepted or not (if not, you'll get both a FP and a FN).
maybe you're only visualizing the first 10 images of the evaluation. If so, change that, for 2 reasons: 1. maybe you're just very lucky on these images, and they're not representative of what follows, just by luck. 2. Actually, more than luck, if these images are the first from the evaluation set, they come right after the end of the training set in your video, so they are probably quite similar to some images in the training set, so they are easier to predict, so they're not representative of your evaluation set.
Question 2: if you have not changed that part in the config file mobilenet_v1_coco-model, all your images (both for training and testing) are rescaled to 300x300 pixels at the start of the network, so your preprocessings don't matter.
Question 3: no it's not a problem at all, all these algorithms were designed to detect multiple objects in images.
Question 4: Given the fluctuations, I'd actually keep training it until you can see improvement or clear overtraining. 10k steps is actually quite small, maybe it's enough because your task is relatively easy, maybe it's not enough and you need to wait ten times that to have significant improvement...

Is the object location in train effect the results for Faster RCNN?

Has enyone try the effect of the location per class in faster rcnn?
In case my train data has one of the object classes always in one area of the frame, lets say in the top right of the image, and on the evaluation dataset I have one image that this object is on other area, down left,
Is the Faster RCNN capable to handle with this case?
Or if I want my network to find all of the classes in all of the frame areas I need to provide example in the train dataset that cover all the areas?
Quoting faster-RCNN paper:
An important property of our approach is that it is
translation invariant, both in terms of the anchors and the
functions that compute proposals relative to the anchors. If
one translates an object in an image, the proposal should
translate and the same function should be able to predict the
proposal in either location. This translation-invariant property
is guaranteed by our method*
*As is the case of FCNs [7], our network is translation invariant up to the network’s total stride
So the short answer is that you'll probably be ok with the object is mostly at a certain location in the train set and somewhere else in the test set.
A bit longer answer is that the location may have side affects that may affect the accuracy and it will probably be better to have the object in different locations; however you can try to add - for testing purposes - N test samples to the train set and see what is the accuracy change in the test set -N remaining samples.

what's the difference between a ref edge and non-ref edge in tensorflow?

As the title says, I'm wondering about the conceptual difference between a "ref edge" and "non-ref edge" in TensorFlow.
I'm reading the graph partitioning algorithm in TensorFlow.
Here (line 826 of graph_partition.cc) is the comment which mentions
the "non-ref edge":
825 // For a node dst, 'ref_recvs' remembers the recvs introduced by a ref
826 // edge to dst. 'ref_control_inputs' remembers the inputs by a non-ref
827 // edge to dst. We will add a control edge for every pair in
828 // (ref_recvs x ref_control_inputs).
829 std::vector<NodeDef*> ref_recvs;
830 std::vector<string> ref_control_inputs;
Can someone explain the difference more clearly? Thanks very much.
In TensorFlow, most edges are "non-ref" edges, which means that the value flowing along that edge is a constant. If you think of a vertex (operation) in a TensorFlow graph as a function, you can think of a non-ref edge as representing a function argument that is passed by value in a conventional programming language like C or C++. For example, the inputs to and outputs from the operation z = tf.matmul(x, y) are all non-ref edges.
A "ref edge" in TensorFlow allows the value flowing along that edge to be mutated. Continuing the function analogy, a ref edge represents a function argument that is passed by reference (from which we take the name "ref" edge). The most common use of ref edges is in the current internal implementation of tf.Variable: the internal Variable kernel owns a mutable buffer, and outputs a reference to that buffer on a ref edge. Operations such as tf.assign(var, val) expect their var argument to be passed along ref edge, because they need to mutate the value in var.
The graph partitioning algorithm treats ref edges specially because they correspond to values that could change as the graph executes. Since a non-ref edge is a constant value, TensorFlow can assume that all non-ref edges out of the same operation that cross between two devices can be combined into a single edge, which saves on network/memory bandwidth. Since the value on a ref edge can change (e.g. if a variable is updated in the middle of a step), TensorFlow must be careful not to combine the edges, so that the remote device can see the new value. By analogy with C/C++, the TensorFlow graph partitioner treats a ref-edge as representing a volatile variable, for the purposes of optimization.
Finally, as you can tell from the amount of explanation above, ref edges are quite complicated, and there is an ongoing effort to remove them from the TensorFlow execution model. The replacement is "resource-typed edges", which allow non-tensor values to flow along an edge (unifying variables, queues, readers, and other complex objects in TensorFlow), and explicit operations that take a variable resource as input and read its value (as a non-ref output edge). The implementation of the new "resource variables" can be seen here in Python and here in C++.

Strict class labels in SVM

I'm using one-vs-all to do a 21-class svm categorization.
I want the label -1 to mean "not in this class" and the label 1 to mean "indeed in this class" for each of the 21 kernels.
I've generated my pre-computed kernels and my test vectors using this standard.
Using easy.py everything went well for 20 of the classes, but for one of them the labels were switched so that all the inputs that should have been labelled with 1 for being in the class were instead labelled -1 and vice-versa.
The difference in that class was that the first vector in the pre-computed kernel was labelled 1, while in all the other kernels the first vector was labelled -1. This suggests that LibSVM relabels all of my vectors.
Is there a way to prevent this or a simple way to work around it?
You already discovered that libsvm uses the label -1 for whatever label it encounters first.
The reason is, that it allows arbitrary labels and changes them to -1 and +1 according to the order in which they appear in the label vector.
So you can either check this directly or you look at the model returned by libsvm.
It contains an entry called Label which is a vector containing the order in which libsvm encountered the labels. You can also use this information to switch the sign of your scores.
If during training libsvm encounters label A first, then during prediction
libsvm will use positive values for assigning object the label A and negative values for another label.
So if you use label 1 for positive class and 0 for negative, then to obtain right output values you should do the following trick (Matlab).
%test_data.y contains 0-s and 1-s
[labels,~,values] = svmpredict(test_data.y, test_data.X, model, ' ');
if (model.Label(1) == 0) % we check which label was encountered by libsvm first
values = -values;
end