YOLO object detection model? - tensorflow

Currently, I am reading the Yolo9000 model "https://arxiv.org/pdf/1612.08242.pdf" and I am very confused about how the model can predict the bounding box for object detection, I did many examples with Tensorflow, and in most of them we give to the model "Images and Label of images".
My questions are:
1- How we can pass the bounding box instead of labels to the model?
2- How can the the model learn that many boxes belong to one images?

In YOLO, we divide the image into 7X7 grid. For each of the grid location, the network predicts three things -
Probability of an object being present in that grid
If an object lies in this grid, what would be the co-ordinates of the
bounding box?
If an object lies in this grid, which class does it
belong to?
If we apply regression for all the above variables for all 49 grid locations, we will be able tell which grid locations have objects(using first parameter). For the grid locations that have objects, we can tell the bounding box co-ordinates and correct class using the second and third parameters.
Once we have designed a network that can output all the information we need, prepare the training data in this format i.e. find these parameters for every 7X7 grid location in every image in your dataset. Next you simply train the deep neural network to regress for these parameters.

To pass bounding boxes of an image we need to create it first. You can create bounding boxes for any image using specific tools. Here, you have to create boundaries that bound an object within it and then label that bounding box/rectangle. You to do this for every object in the image you want your model to train/recognize.
There is one very useful project in this link, you should check that out if you need to understand about bounding boxes.
I have just started learning object detection with tensorflow. So as and when I get proper info on providing bounding boxes to the object detection model I'll also update that here. Also if you have solved this problem by now, you can also provide the details to help out others facing same kind of problems.

1- How we can pass the bounding box instead of labels to the model?
If we want to train a model that performs object detection (not object classification), we have to pass the truth labels as .xml files, for example. An xml file contains information about objects that exist in an image. Each information about object is composed of 5 values:
class name of this object, such as car or human...
xmin: x coordinate of the box's top left point
ymin: y coordinate of the box's top left point
xmax: x coordinate of the box's bottom right point
ymax: y coordinate of the box'x bottom right point
One bounding box within an image is specified as a set of 5 values like above. If there are 3 objects in an image, the xml file will contain 3 sets of this values.
2- How can the the model learn that many boxes belong to one images?
As you know, the output of YOLOv2 or YOLO9000 has shape (13, 13, D), where D depends on how many class of object you're going to detect. You can see that there are 13x13 = 169 cells (grid cells) and each cell as D values (depth).
Among 169 grid cells, there are some grid cells that are responsible to predict bounding boxes. If the center of a true bounding box falls on a grid cell, this grid cell is responsible to predict that bounding box, when it is given the same image.
I think there must be a function that reads the xml annotation files and determines which grid cells are responsible to detect bounding boxes.
To make the model learn the box positions and shapes not only the classes, we have to build an appropriate loss function. The loss function used in YOLOv puts cost also on the box shapes and positions. So the loss is calculated as the weighted sum of the following individual loss values:
Loss on the class name
Loss on the box position (x-y coordinates)
Loss on the box shape (box width and height)
SIDE NOTE:
Actually, one grid cell can detect up to B boxes, where B depends on
implementations of YOLOv2. I used darkflow to train YOLOv2 on my
custom training data, in which B was 5. So the model can detect 169*B
boxes in total, and loss is the sum of 169*B small losses.
D = B*(5+C), where C is the number of classes you want to detect.
Before passed to the model, the box shapes and positions are
converted into relative values to the image size.

Related

How Calculate Intersection over union (IoU) for Bounding Box in case of Multilabeling in Tensorflow 2.x?

How can I calculate IOU metric for bounding box with mullabeling bounding box? i.e. in my image
i can have more than one bound box with diference classes.
For example: one boung box for person, one for car and another for bird in the same image.
I find here a direct implementation in addons tensorflow:
https://www.tensorflow.org/addons/api_docs/python/tfa/losses/GIoULoss
And here a manual implemantation:
https://pyimagesearch.com/2016/11/07/intersection-over-union-iou-for-object-detection/
But all for single label.
Is that just simple as identify the bb class calculate it separate and then calculate the mean?
Or i need something else ?

Tensorflow output labels is a value in the 2D gird or locating it in the grid

My final output should be a 2D grid that contains values for each grid point. Is there a way to implement in TensorFlow, where I can input a number of images and each image correspond to a specific point in a 2D grid? I want my model such that when I input a similar image it should result in detecting that specific grid in a 2D image. I mean that each image input image belongs to a specific area in the output image (which I divided into a grid for simplicity to make it a finite number of locations).

Can YOLO pictures have a bounded box that covering the whole picture?

I wonder why YOLO pictures need to have a bounding box.
Assume that we are using Darknet. Each image need to have a corresponding .txt file with the same name as the image file. Inside the .txt file it need to be. It's the same for all YOLO frameworks that are using bounded boxes for labeling.
<object-class> <x> <y> <width> <height>
Where x, y, width, and height are relative to the image's width and height.
For exampel. If we goto this page and press YOLO Darknet TXT button and download the .zip file and then go to train folder. Then we can see a these files
IMG_0074_jpg.rf.64efe06bcd723dc66b0d071bfb47948a.jpg
IMG_0074_jpg.rf.64efe06bcd723dc66b0d071bfb47948a.txt
Where the .txt file looks like this
0 0.7055288461538461 0.6538461538461539 0.11658653846153846 0.4110576923076923
1 0.5913461538461539 0.3545673076923077 0.17307692307692307 0.6538461538461539
Every image has the size 416x416. This image looks like this:
My idéa is that every image should have one class. Only one class. And the image should taked with a camera like this.
This camera snap should been taked as:
Take camera snap
Cut the camera snap into desired size
Upscale it to square 416x416
Like this:
And then every .txt file that correspons for every image should look like this:
<object-class> 0 0 1 1
Question
Is this possible for e.g Darknet or other framework that are using bounded boxes to labeling the classes?
Instead of let the software e.g Darknet upscale the bounded boxes to 416x416 for every class object, then I should do it and change the .txt file to x = 0, y = 0, width = 1, height = 1 for every image that only having one class object.
Is that possible for me to create a traing set in that way and train with it?
Little disclaimer I have to say that I am not an expert on this, I am part of a project and we are using darknet so I had some time experimenting.
So if I understand it right you want to train with cropped single class images with full image sized bounding boxes.
It is possible to do it and I am using something like that but it is most likely not what you want.
Let me tell you about the problems and unexpected behaviour this method creates.
When you train with images that has full image size bounding boxes yolo can not make proper detection because while training it also learns the backgrounds and empty spaces of your dataset. More specifically objects on your training dataset has to be in the same context as your real life usage. If you train it with dog images on the jungle it won't do a good job of predicting dogs in house.
If you are only going to use it with classification you can still train it like this it still classifies fine but images that you are going to predict also should be like your training dataset, so by looking at your example if you train images like this cropped dog picture your model won't be able to classify the dog on the first image.
For a better example, in my case detection wasn't required. I am working with food images and I only predict the meal on the plate, so I trained with full image sized bboxes since every food has one class. It perfectly classifies the food but the bboxes are always predicted as full image.
So my understanding for the theory part of this, if you feed the network with only full image bboxes it learns that making the box as big as possible is results in less error rate so it optimizes that way, this is kind of wasting half of the algorithm but it works for me.
Also your images don't need to be 416x416 it resizes to that whatever size you give it, you can also change it from cfg file.
I have a code that makes full sized bboxes for all images in a directory if you want to try it fast.(It overrides existing annotations so be careful)
Finally boxes should be like this for them to be centered full size, x and y are center of the bbox it should be center/half of the image.
<object-class> 0.5 0.5 1 1
from imagepreprocessing.darknet_functions import create_training_data_yolo, auto_annotation_by_random_points
import os
main_dir = "datasets/my_dataset"
# auto annotating all images by their center points (x,y,w,h)
folders = sorted(os.listdir(main_dir))
for index, folder in enumerate(folders):
auto_annotation_by_random_points(os.path.join(main_dir, folder), index, annotation_points=((0.5,0.5), (0.5,0.5), (1.0,1.0), (1.0,1.0)))
# creating required files
create_training_data_yolo(main_dir)
```

Tensorflow object detection API - Setting specific color to bounding boxes

I am trying to detect 3 different classes of objects in images using Tensorflow Object Detection. I would like to set the bounding box color for each class to a custom color of my choice in order to suit my application.
For example,
Class 1: Red
Class 2: Blue
Class 3: Green
Unfortunately, Tensorflow object detection automatically sets this colors and I do not know how to change them.
I would be very gratefull for any suggestions and help.
You can achieve this by passing a track_ids to function visualize_boxes_and_labels_on_image_array.
Notice that when detection is performed, this plot function is called to visualize the bounding boxes on images.
Here is how to get the variable track_ids. First you should look at the STANDARD_COLORS list and get the index of the color you want to draw boxes with. For example, 'Red' color's index is 98. Then you should loop through the variable output_dict['detection_classes'] (this variable is also passed to the plot function), and when encounter class 1, track_ids is appended with 98. By doing this you will create a list of indexes of colors as track_ids, then transform this to a numpy array and pass it together into the plot function, then you should have all classes plotted as the color you assigned.

Get the location of object to crop by providing pixel label in tensorflow

I have a data-set of images(every image is in rgb format) and corresponding label image(which contains label of every pixel in the image).
I need to extract the objects(pixels) of a particular class from original images.
first i have to find location of object using label image(by providing label of given object)(it is doable by using explicit for loops but, i don't want to use explicit for loops)
Now my questions-
If there is any in-build function in tensorflow that gives me the location(Rectangles are fine) of given object(if i provide the labels of that object)?
After that i can use the tf.image.crop_and_resize to crop the image. but i am not able to find any function that will give me location of objects.