Anchor boxes and offsets in SSD object detection - tensorflow

How do you calculate anchor box offsets for object detection in SSD? As far as I understood anchor boxes are the boxes in 8x8 feature map, 4x4 feature map, or any other feature map in the output layer.
So what are the offsets?
Is it the distance between the centre of the bounding box and the centre of a particular box in say a 4x4 feature map?
If I am using a 4x4 feature map as my output, then my output should be of the dimension:
(4x4, n_classes + 4)
where 4 is for my anchor box co-ordinates.
This 4 co-ordinates can be something like:
(xmin, xmax, ymin, ymax)
This will correspond to the top-left and bottom-right corners of the bounding box.
So why do we need offsets and if so how do we calculate them?
Any help would be really appreciated!

We need offsets because thats what we calculate when we default anchor boxes, In case of ssd for every feature map cell they will have predefined number of anchor boxes of different scale ratios on very feature map cell,I think in the paper this number is 6.
Now because this is a detection problem ,we will also have ground truth bounding boxes,Here roughly, we compare the IOU of the anchor box to the GT box and if it is greater than a threshold say 0.5 we predict the box offsets to that anchor box.

Related

Line Profile Diagonal

When you make a line profile of all x-values or all y-values the extraction from each pixel is clear. But when you take a line profile along a diagonal, how does DM choose which pixels to use in the one dimensional readout?
Not really a scripting question, but I'm rather certain that it uses bi-linear interpolation between the grid-points along the drawn line. (And if perpendicular integration is enabled, it does so in an integral.) It's the same interpolation you would get for a "rotate" image.
In fact, you can think of it as a rotate-image (bi-linearly interpolated) with a 'cut-out' afterwards, potentially summed/projected onto the new X-axis.
Here is an example
Assume we have a 5 x 4 image, which gives the grid as shown below.
I'm drawing top-left corners to indicate the coordinates system pixel convention used in DigitalMicrgraph, where
(x/y)=(0/0) is the top-left corner of the image
Now extract a LineProfile from (1/1) to (4/3). I have highlighted the pixels for those coordinates.
Note, that a Line drawn from the corners seems to be shifted by half-a-pixel from what feels 'natural', but that is the consequence of the top-left-corner convention. I think, this is why a LineProfile-Marker is shown shifted compared to f.e. LineAnnotations.
In general, this top-left corner convention makes schematics with 'pixels' seem counter-intuitive. It is easier to think of the image simply as grid with values in points at the given coordinates than as square pixels.
Now the maths.
The exact profile has a length of:
As we can only have profiles with integer channels, we actually extract a LineProfile of length = 4, i.e we round up.
The angle of the profile is given by the arc-tangent of dX and dY.
So to extract the profile, we 'rotate' the grid by that angle - done by bilinear interpolation - and then extract the profile as grid of size 4 x 1:
This means the 'values' in the profile are from the four points:
Which are each bi-linearly interpolated values from four closest points of the original image:
In case the LineProfile is averaged over a certain width W, you do the same thing but:
extract a 2D grid of size L x W centered symmetrically over the line.i.e. the grid is shifted by (W-1)/2 perpendicular to the profile direction.
sum the values along W

How to sort bounding boxes from left to right?

I am using tensorflow object detection API to detect text on images. It gives me a good result with very good accuracy. How can i sort the bounding boxes from left to right so as to make the text in a readable form.
A simple trick would be to sort the output bounding boxes according their xmin coordinates. For the y coordinate, just take the mean of ymin and ymax and put all with the same mean plus a samll tolerace together in a list. This way, you get ordered columns and rows

I want to know the size of bounding box in object-detection api

I have used the API
(https://github.com/tensorflow/models/tree/master/object_detection)
And then,
How would I know the length of bounding box?
I have used Tutorial IPython notebook on github in real-time.
But I don't know use which command to calculate the length of boxes.
Just to extend Beta's answer:
You can get the predicted bounding boxes from the detection graph. An example for this is given in the Tutorial IPython notebook on github. This is where Beta's code snipped comes from. Access the detection_graph and extract the coordinates of the predicted bounding boxes from the tensor:
By calling np.squeeze(boxes) you reshape them to (m, 4), where m denotes the amount of predicted boxes. You can now access the boxes and compute the length, area or what ever you want.
But remember that the predicted box coordinates are normalized! They are in the following order:
[ymin, xmin, ymax, xmax]
So computing the length in pixel would be something like:
def length_of_bounding_box(bbox):
return bbox[3]*IMG_WIDTH - bbox[1]*IMG_WIDTH
I wrote a full answer on how to find the bounding box coordinates here and thought it might be useful to someone on this thread too.
Google Object Detection API returns bounding boxes in the format [ymin, xmin, ymax, xmax] and in normalised form (full explanation here). To find the (x,y) pixel coordinates we need to multiply the results by width and height of the image. First get the width and height of your image:
width, height = image.size
Then, extract ymin,xmin,ymax,xmax from the boxes object and multiply to get the (x,y) coordinates:
ymin = boxes[0][i][0]*height
xmin = boxes[0][i][1]*width
ymax = boxes[0][i][2]*height
xmax = boxes[0][i][3]*width
Finally print the coordinates of the box corners:
print 'Top left'
print (xmin,ymin,)
print 'Bottom right'
print (xmax,ymax)
You can call boxes, like the following:
boxes = detection_graph.get_tensor_by_name('detection_boxes:0')
similarly for scores, and classes.
Then just call them in session run.
(boxes, scores, classes) = sess.run(
[boxes, scores, classes],
feed_dict={image_tensor: imageFile})
Basically, you can get all those from the graph
image_tensor = graph.get_tensor_by_name('image_tensor:0')
boxes = graph.get_tensor_by_name('detection_boxes:0')
scores = graph.get_tensor_by_name('detection_scores:0')
classes = graph.get_tensor_by_name('detection_classes:0')
num_detections = graph.get_tensor_by_name('num_detections:0')
and boxes[0] contains all predicted bounding box coordinate in format of [top_left_x, top_left_y, bottom_right_x, bottom_right_y], which is what you are looking for.
Check out this repo and you may find more details:
https://github.com/KleinYuan/tf-object-detection
The following code that recognizes objects and returns the information for the locations and confidence is:
(boxes, scores, classes, num_detections) = sess.run(
[boxes, scores, classes, num_detections],
feed_dict={image_tensor: image_np_expanded})
To iterate through the boxes
for i,b in enumerate(boxes[0]):
To get width and height:
width = boxes[0][i][1]+boxes[0][i][3]
height = boxes[0][i][0]+boxes[0][i][2]
You can find more details: [https://pythonprogramming.net/detecting-distances-self-driving-car/]

Is the IOU in Tensorflow Object Detection API wrong?

I just digged a bit through the Tensorflow Object Detection API code especially the eval_util part, as I wanted to implement the COCO metrics.
But I noticed that the metrics are solely calculated using the bounding boxes which have normalized coordinates between [0, 1].
There are no aspect ratios or absolute coordinates used.
So, doesn't this mean that the intersection over unions calculated on these results are incorrect?
Let's take an 200x100 image pixel as an example.
If the box would be off by 20px to the left, that's 0.1 in normalized coordinates.
But if it would be off by 20px to the top, that would be 0.2 in normalized coordinates.
Doesn't that mean, being off to the top is harder penalizing the score than being off to the side?
I believe the predicted coordinates are resized to the absolute image coordinates in the eval binary.
But the other thing I would say is that IOU is scale invariant in the sense that if you scale two boxes by some factor, they will still have the same IOU overlap. As an example if we scale by 2 in the x-direction and scale by 3 in the y direction:
If A is (x1, y1, x2, y2) and B is (u1, v1, u2, v2), then IOU((A, B))
= IOU((2*x1, 3*y1, 2*x2, 3*y2), (2*u1, 3*v1, 2*u2, 3*v2))
What this means is that evaluating in normalized coordinates should give the same result as evaluating in absolute coordinates.

YOLO object detection model?

Currently, I am reading the Yolo9000 model "https://arxiv.org/pdf/1612.08242.pdf" and I am very confused about how the model can predict the bounding box for object detection, I did many examples with Tensorflow, and in most of them we give to the model "Images and Label of images".
My questions are:
1- How we can pass the bounding box instead of labels to the model?
2- How can the the model learn that many boxes belong to one images?
In YOLO, we divide the image into 7X7 grid. For each of the grid location, the network predicts three things -
Probability of an object being present in that grid
If an object lies in this grid, what would be the co-ordinates of the
bounding box?
If an object lies in this grid, which class does it
belong to?
If we apply regression for all the above variables for all 49 grid locations, we will be able tell which grid locations have objects(using first parameter). For the grid locations that have objects, we can tell the bounding box co-ordinates and correct class using the second and third parameters.
Once we have designed a network that can output all the information we need, prepare the training data in this format i.e. find these parameters for every 7X7 grid location in every image in your dataset. Next you simply train the deep neural network to regress for these parameters.
To pass bounding boxes of an image we need to create it first. You can create bounding boxes for any image using specific tools. Here, you have to create boundaries that bound an object within it and then label that bounding box/rectangle. You to do this for every object in the image you want your model to train/recognize.
There is one very useful project in this link, you should check that out if you need to understand about bounding boxes.
I have just started learning object detection with tensorflow. So as and when I get proper info on providing bounding boxes to the object detection model I'll also update that here. Also if you have solved this problem by now, you can also provide the details to help out others facing same kind of problems.
1- How we can pass the bounding box instead of labels to the model?
If we want to train a model that performs object detection (not object classification), we have to pass the truth labels as .xml files, for example. An xml file contains information about objects that exist in an image. Each information about object is composed of 5 values:
class name of this object, such as car or human...
xmin: x coordinate of the box's top left point
ymin: y coordinate of the box's top left point
xmax: x coordinate of the box's bottom right point
ymax: y coordinate of the box'x bottom right point
One bounding box within an image is specified as a set of 5 values like above. If there are 3 objects in an image, the xml file will contain 3 sets of this values.
2- How can the the model learn that many boxes belong to one images?
As you know, the output of YOLOv2 or YOLO9000 has shape (13, 13, D), where D depends on how many class of object you're going to detect. You can see that there are 13x13 = 169 cells (grid cells) and each cell as D values (depth).
Among 169 grid cells, there are some grid cells that are responsible to predict bounding boxes. If the center of a true bounding box falls on a grid cell, this grid cell is responsible to predict that bounding box, when it is given the same image.
I think there must be a function that reads the xml annotation files and determines which grid cells are responsible to detect bounding boxes.
To make the model learn the box positions and shapes not only the classes, we have to build an appropriate loss function. The loss function used in YOLOv puts cost also on the box shapes and positions. So the loss is calculated as the weighted sum of the following individual loss values:
Loss on the class name
Loss on the box position (x-y coordinates)
Loss on the box shape (box width and height)
SIDE NOTE:
Actually, one grid cell can detect up to B boxes, where B depends on
implementations of YOLOv2. I used darkflow to train YOLOv2 on my
custom training data, in which B was 5. So the model can detect 169*B
boxes in total, and loss is the sum of 169*B small losses.
D = B*(5+C), where C is the number of classes you want to detect.
Before passed to the model, the box shapes and positions are
converted into relative values to the image size.