How Calculate Intersection over union (IoU) for Bounding Box in case of Multilabeling in Tensorflow 2.x? - tensorflow

How can I calculate IOU metric for bounding box with mullabeling bounding box? i.e. in my image
i can have more than one bound box with diference classes.
For example: one boung box for person, one for car and another for bird in the same image.
I find here a direct implementation in addons tensorflow:
https://www.tensorflow.org/addons/api_docs/python/tfa/losses/GIoULoss
And here a manual implemantation:
https://pyimagesearch.com/2016/11/07/intersection-over-union-iou-for-object-detection/
But all for single label.
Is that just simple as identify the bb class calculate it separate and then calculate the mean?
Or i need something else ?

Related

How to sort bounding boxes from left to right?

I am using tensorflow object detection API to detect text on images. It gives me a good result with very good accuracy. How can i sort the bounding boxes from left to right so as to make the text in a readable form.
A simple trick would be to sort the output bounding boxes according their xmin coordinates. For the y coordinate, just take the mean of ymin and ymax and put all with the same mean plus a samll tolerace together in a list. This way, you get ordered columns and rows

How to recognize two different objects with the similar shape, but different size

I am using Mask-RCNN neural network. I retrained my network to detect and mask wheels of die-cast toy cars. I am using images, which present the side of the car (left or right).
Sometimes the cars have different sizes of the wheels like presented on the image below. The front wheels are much smaller than rear ones. I want to detect front ones as "front wheels" and rear one as "rear wheels". If there is only one wheel on the photo (caused by bad cropping), then I want to detect this wheel just as a "wheel".
What should I do to mask 2 wheels (and assing proper labels to them) if image contains two wheels, which looks similiar?
Car image
Model output of Mask-RCNN provides bounding boxes. Why not using those bounding boxes to infer the size of each tire and compare the tire sizes? Then you can label them as front or rear based on their area.
# Run detection
results = model.detect([image])
# Visualize results
r = results[0]
visualize.display_instances(image, r['rois'], r['masks'], r['class_ids'],
class_names, r['scores'])
You can use the r['rois'] for computation of the areas of tires as following:
y1, x1, y2, x2 = rois[i]
area[i] = (y2-y1)**2 + (x2-x1)**2
Then only thing left is to decide which area is larger.
Mask-RCNN can segment each instance of object separately irrespective of size of object. It does not classify object based on perspective, it will classify both wheels as wheels.
If you train model with two classes like front and rear wheel it will work fine when the condition is true, but when the wheels will be of same size it won't produce expected output.
You can train two different models for different set of cars, like if the car has different size wheels, then it will classify the wheels as front and rear wheel which will be the next module. This segmentation and classification will be based on logic that rear wheel of car will always be bigger in size than front wheels but never the small. But if the car is not classified in that category, then the wheels will not be segmented by size, in which case it will classify them as only wheels.

Tensorflow object detection API - Setting specific color to bounding boxes

I am trying to detect 3 different classes of objects in images using Tensorflow Object Detection. I would like to set the bounding box color for each class to a custom color of my choice in order to suit my application.
For example,
Class 1: Red
Class 2: Blue
Class 3: Green
Unfortunately, Tensorflow object detection automatically sets this colors and I do not know how to change them.
I would be very gratefull for any suggestions and help.
You can achieve this by passing a track_ids to function visualize_boxes_and_labels_on_image_array.
Notice that when detection is performed, this plot function is called to visualize the bounding boxes on images.
Here is how to get the variable track_ids. First you should look at the STANDARD_COLORS list and get the index of the color you want to draw boxes with. For example, 'Red' color's index is 98. Then you should loop through the variable output_dict['detection_classes'] (this variable is also passed to the plot function), and when encounter class 1, track_ids is appended with 98. By doing this you will create a list of indexes of colors as track_ids, then transform this to a numpy array and pass it together into the plot function, then you should have all classes plotted as the color you assigned.

Anchor boxes and offsets in SSD object detection

How do you calculate anchor box offsets for object detection in SSD? As far as I understood anchor boxes are the boxes in 8x8 feature map, 4x4 feature map, or any other feature map in the output layer.
So what are the offsets?
Is it the distance between the centre of the bounding box and the centre of a particular box in say a 4x4 feature map?
If I am using a 4x4 feature map as my output, then my output should be of the dimension:
(4x4, n_classes + 4)
where 4 is for my anchor box co-ordinates.
This 4 co-ordinates can be something like:
(xmin, xmax, ymin, ymax)
This will correspond to the top-left and bottom-right corners of the bounding box.
So why do we need offsets and if so how do we calculate them?
Any help would be really appreciated!
We need offsets because thats what we calculate when we default anchor boxes, In case of ssd for every feature map cell they will have predefined number of anchor boxes of different scale ratios on very feature map cell,I think in the paper this number is 6.
Now because this is a detection problem ,we will also have ground truth bounding boxes,Here roughly, we compare the IOU of the anchor box to the GT box and if it is greater than a threshold say 0.5 we predict the box offsets to that anchor box.

YOLO object detection model?

Currently, I am reading the Yolo9000 model "https://arxiv.org/pdf/1612.08242.pdf" and I am very confused about how the model can predict the bounding box for object detection, I did many examples with Tensorflow, and in most of them we give to the model "Images and Label of images".
My questions are:
1- How we can pass the bounding box instead of labels to the model?
2- How can the the model learn that many boxes belong to one images?
In YOLO, we divide the image into 7X7 grid. For each of the grid location, the network predicts three things -
Probability of an object being present in that grid
If an object lies in this grid, what would be the co-ordinates of the
bounding box?
If an object lies in this grid, which class does it
belong to?
If we apply regression for all the above variables for all 49 grid locations, we will be able tell which grid locations have objects(using first parameter). For the grid locations that have objects, we can tell the bounding box co-ordinates and correct class using the second and third parameters.
Once we have designed a network that can output all the information we need, prepare the training data in this format i.e. find these parameters for every 7X7 grid location in every image in your dataset. Next you simply train the deep neural network to regress for these parameters.
To pass bounding boxes of an image we need to create it first. You can create bounding boxes for any image using specific tools. Here, you have to create boundaries that bound an object within it and then label that bounding box/rectangle. You to do this for every object in the image you want your model to train/recognize.
There is one very useful project in this link, you should check that out if you need to understand about bounding boxes.
I have just started learning object detection with tensorflow. So as and when I get proper info on providing bounding boxes to the object detection model I'll also update that here. Also if you have solved this problem by now, you can also provide the details to help out others facing same kind of problems.
1- How we can pass the bounding box instead of labels to the model?
If we want to train a model that performs object detection (not object classification), we have to pass the truth labels as .xml files, for example. An xml file contains information about objects that exist in an image. Each information about object is composed of 5 values:
class name of this object, such as car or human...
xmin: x coordinate of the box's top left point
ymin: y coordinate of the box's top left point
xmax: x coordinate of the box's bottom right point
ymax: y coordinate of the box'x bottom right point
One bounding box within an image is specified as a set of 5 values like above. If there are 3 objects in an image, the xml file will contain 3 sets of this values.
2- How can the the model learn that many boxes belong to one images?
As you know, the output of YOLOv2 or YOLO9000 has shape (13, 13, D), where D depends on how many class of object you're going to detect. You can see that there are 13x13 = 169 cells (grid cells) and each cell as D values (depth).
Among 169 grid cells, there are some grid cells that are responsible to predict bounding boxes. If the center of a true bounding box falls on a grid cell, this grid cell is responsible to predict that bounding box, when it is given the same image.
I think there must be a function that reads the xml annotation files and determines which grid cells are responsible to detect bounding boxes.
To make the model learn the box positions and shapes not only the classes, we have to build an appropriate loss function. The loss function used in YOLOv puts cost also on the box shapes and positions. So the loss is calculated as the weighted sum of the following individual loss values:
Loss on the class name
Loss on the box position (x-y coordinates)
Loss on the box shape (box width and height)
SIDE NOTE:
Actually, one grid cell can detect up to B boxes, where B depends on
implementations of YOLOv2. I used darkflow to train YOLOv2 on my
custom training data, in which B was 5. So the model can detect 169*B
boxes in total, and loss is the sum of 169*B small losses.
D = B*(5+C), where C is the number of classes you want to detect.
Before passed to the model, the box shapes and positions are
converted into relative values to the image size.