Object detection when the object occupies the full region on the image? - tensorflow

I am working with object detection using Tensorflow. I have mix of 7-8 classes. Initially, we had an image classification model and now moving it to object detection model. For once class alone, the object to be detected occupies the entire image. Can we have the bounding box dimension to be the entire width and height of the image? Will it hinder the performance?

It shouldn't hinder the performance as long as there's enough such examples in the training set.
the OD API clips detections outbounding the image, so in these cases the resulting bounding box would be the of the entire image (or one axis would be the entire size, and the other less, depending on the object occupation).
Assuming your OD model uses anchors, make sure you have anchors which are responsible for such cases (i.e. with scale of about the entire image).

Related

Images labeling for object detection when object is larger than the image

how I should label objects to detect them, if the object is larger than the image, e.g. I want to label a building, but in the picture is visible only part of the building (windows and doors, without roof). Or should I remove these pictures from my dataset?
Thank you!
In every object detection dataset I've seen, such objects will just have the label cover whatever is visible, so the bounding box will go up to the border of the image.
It really depends what you want your model to do if it sees an image like this. If you want it to be able to recognise partial buildings, then you should keep them in your dataset and label whatever is visible.
Don't label them. Discard them from your training set. The model needs to learn the difference between the negative class (background) and positive classes (windows, doors). If the positive class takes the whole image, the model will have a massive false positive problem.

Object Detection: Aspect Ratio and Scale of Anchor Boxes

I am working on an object detection problem on my own dataset. I want to figure out the scale and aspect ratio that I should specify in the config file of Faster RCNN provided by Tensorflow object detection api. The first step is image resizer. I am using Fixed shape resizer as it allows batch size of more than 1. I read that this uses bilinear interpolation for downsample and upsample. How to calculate the new ground truth box coordinate after this resizing. Also, once we have the new ground truth box coordinates, how do we calculate the scale and aspect ratio of anchor boxes that can be specified in the config file to improve the localisation loss.

YOLO object detection: how does the algorithm predict bounding boxes larger than a grid cell?

I am trying to better understand how the YOLO2 & 3 algorithms works. The algorithm processes a series of convolutions until it gets down to a 13x13 grid. Then it is able to classify objects within each grid cell as well as the bounding boxes for those objects.
If you look at this picture, you see that the bounding box in red is larger than any individual grid cell. Also the bounding box is centered at the center of the object.
My questions of to do with how do the predicted bounding boxes exceed the size of the grid cell, when the network activations are based upon the individual grid cell. I mean everything outside of the grid cell should be unknown to the neurons predicting the bounding boxes for an object detected in that cell right.
More precisely here are my questions:
1. How does the algorithm predict bounding boxes that are larger than the grid cell?
2. How does the algorithm know in which cell the center of the object is located?
everything outside of the grid cell should be unknown to the neurons predicting the bounding boxes for an object detected in that cell right.
It's not quite right. The cells correspond to a partition of the image where the neuron have learned to respond if the center of an object is located within.
However, the receptive field of those output neurons is much larger than the cell and actually cover the entire image. It is therefore able to recognize and draw a bounding box around an object much larger than its assigned "center cell".
So a cell is centered on the center of the receptive field of the output neuron but is a much smaller part. It is also somewhat arbitrary, and one could image for example to have overlapping cells -- in which case you would expect neighboring neurons to fire simultaneously when an object is centered in the overlapping zone of their cells.
YOLO predicts offsets to anchors. The anchors are initialised such that there are 13x13 sets of anchors. (In Yolov3 each set has k=5 anchors, different yolo versions have different k.) The anchors are spread over the image, to make sure objects in all parts are detected.
The anchors can have an arbitrary size and aspect ratio, unrelated to the grid size. If your dataset has mostly large foreground objects, then you should initialise your anchors to be large. YOLO learns better if it only has to make small adjustments to the anchors.
Each prediction actually uses information from the whole image. Often context from the rest of the image helps the prediction. e.g. black pixels below a vehicle could be either tyres or shadow.
The algorithm doesn't really "know" in which cell the centre of the object is located. But during trainig we have that information from the ground truth, and we can train it to guess. With enough training, it ends up pretty good at guessing. The way that works is that the closest anchor to the ground truth is assigned to the object. Other anchors are assigned to the other objects or to the background. Anchors assigned to the background are supposed to have a low confidence, while anchors assigned to an object are assessed for the IoU of their bounding boxes. So the training reinforces one anchor to give a high confidence and an accurate bounding box, while other anchors give a low confidence. The example in your question doesn't include any predictions with low confidence (probably trying to keep things simple) but actually there will be many many more low confidence predictions than high confidence ones.
Ok this is not my first time seing this question, had the same problem and infact for all the YOLO 1 & 2 architectures I encountered during my yoloquest, no where did the network-diagrams imply some classification and localization kicked it at the first layer or the moment the image was fed in. It passes through a series of convolution layers and filters(didn't forget the pooling just feel they are the laziest elements in the network plus I hate swimming pools including the words in it).
Which implies at basic levels of the network flow information is seen
or represented differently i.e. from pixels to outlines, shapes ,
features etc before the object is correctly classified or localised
just as in any normal CNN
Since the tensor representing the bounding box predictions and
classifications is located towards the end of the network(I see
regression with backpropagation). I believe it is more appropriate to
say that the network:
divides the image into cells(actually the author of the network did this with the training label datasets)
for each cell divided, tries to predict bounding boxes with confidence scores(I believe the convolution and filters right after
the cell divisions are responsible for being able to correctly have
the network predict bounding boxes larger than each cell because they
feed on more than one cell at a time if you look at the complete YOLO
architecture, there's no incomplete one).
So to conclude, my take on it is that the network predicts larger
bounding boxes for a cell and not that each cell does this
i.e. The network can be viewed as a normal CNN that has outputs for
each classification + number of bounding boxes per cell whose sole goal is
to apply convolutions and feature maps to detect, classify and
localise objects with a forward pass.
forward pass implying neighbouring cells in the division don't query other cells backwardly/recursively, prediction of larger bounding boxes are by next feature maps and convolutions connected to receptive areas of previous cell divisions. also the box being centroidal is a function of the training data, if it's changed to top-leftiness it wouldn't be centroidal(forgive the grammar).

Tensorflow object detection API how to add background class samples?

I am using tensorflow object detection API. I have two classes of interest. In the first trial, I got reasonable results, but I found it was easy to get false positive of both classes in the pure background images. These background images (i.e., images without any class bbx) have not been included in the training set.
How can I add them into the training set? It seems not work if I simply add samples without bbx.
Your goal is to add negative images to your training dataset to strength the background class (id 0 in the detection API). You can reach this with the VOC Pascal XML annotation format. In your XML file is the height and width of your image without object. Usually you label objects the coordinates and height and width of your object and object name is in the XML file. If you use labelImg you can generate a XML file corresponded to your negative image with the verify button. Also can Roboflow generates XML files with and without objects.

TensorFlow: Collecting my own training data set & Using that training dataset to find the location of object

I'm trying to collect my own training data set for the image detection (Recognition, yet). Right now, I have 4 classes and 750 images for each. Each images are just regular images of the each classes; however, some of images are blur or contain outside objects such as, different background or other factors (but nothing distinguishable stuff). Using that training data set, image recognition is really bad.
My question is,
1. Does the training image set needs to contain the object in various background/setting/environment (I believe not...)?
2. Lets just say training worked fairly accurately and I want to know the location of the object on the image. I figure there is no way I can find the location just using the image recognition, so if I use the bounding box, how/where in the code can I see the location of the bounding box?
Thank you in advance!
It is difficult to know in advance what features your programm will learn for each class. But then again, if your unseen images will be in the same background, the background will play no role. I would suggest data augmentation in training; randomly color distortion, random flipping, random cropping.
You can't see in the code where the bounding box is. You have to label/annotate them yourself first in your collected data, using a tool as LabelMe for example. Then comes learning the object detector.