How to sort bounding boxes from left to right? - tensorflow

I am using tensorflow object detection API to detect text on images. It gives me a good result with very good accuracy. How can i sort the bounding boxes from left to right so as to make the text in a readable form.

A simple trick would be to sort the output bounding boxes according their xmin coordinates. For the y coordinate, just take the mean of ymin and ymax and put all with the same mean plus a samll tolerace together in a list. This way, you get ordered columns and rows

Related

Line Profile Diagonal

When you make a line profile of all x-values or all y-values the extraction from each pixel is clear. But when you take a line profile along a diagonal, how does DM choose which pixels to use in the one dimensional readout?
Not really a scripting question, but I'm rather certain that it uses bi-linear interpolation between the grid-points along the drawn line. (And if perpendicular integration is enabled, it does so in an integral.) It's the same interpolation you would get for a "rotate" image.
In fact, you can think of it as a rotate-image (bi-linearly interpolated) with a 'cut-out' afterwards, potentially summed/projected onto the new X-axis.
Here is an example
Assume we have a 5 x 4 image, which gives the grid as shown below.
I'm drawing top-left corners to indicate the coordinates system pixel convention used in DigitalMicrgraph, where
(x/y)=(0/0) is the top-left corner of the image
Now extract a LineProfile from (1/1) to (4/3). I have highlighted the pixels for those coordinates.
Note, that a Line drawn from the corners seems to be shifted by half-a-pixel from what feels 'natural', but that is the consequence of the top-left-corner convention. I think, this is why a LineProfile-Marker is shown shifted compared to f.e. LineAnnotations.
In general, this top-left corner convention makes schematics with 'pixels' seem counter-intuitive. It is easier to think of the image simply as grid with values in points at the given coordinates than as square pixels.
Now the maths.
The exact profile has a length of:
As we can only have profiles with integer channels, we actually extract a LineProfile of length = 4, i.e we round up.
The angle of the profile is given by the arc-tangent of dX and dY.
So to extract the profile, we 'rotate' the grid by that angle - done by bilinear interpolation - and then extract the profile as grid of size 4 x 1:
This means the 'values' in the profile are from the four points:
Which are each bi-linearly interpolated values from four closest points of the original image:
In case the LineProfile is averaged over a certain width W, you do the same thing but:
extract a 2D grid of size L x W centered symmetrically over the line.i.e. the grid is shifted by (W-1)/2 perpendicular to the profile direction.
sum the values along W

How Calculate Intersection over union (IoU) for Bounding Box in case of Multilabeling in Tensorflow 2.x?

How can I calculate IOU metric for bounding box with mullabeling bounding box? i.e. in my image
i can have more than one bound box with diference classes.
For example: one boung box for person, one for car and another for bird in the same image.
I find here a direct implementation in addons tensorflow:
https://www.tensorflow.org/addons/api_docs/python/tfa/losses/GIoULoss
And here a manual implemantation:
https://pyimagesearch.com/2016/11/07/intersection-over-union-iou-for-object-detection/
But all for single label.
Is that just simple as identify the bb class calculate it separate and then calculate the mean?
Or i need something else ?

Anchor boxes and offsets in SSD object detection

How do you calculate anchor box offsets for object detection in SSD? As far as I understood anchor boxes are the boxes in 8x8 feature map, 4x4 feature map, or any other feature map in the output layer.
So what are the offsets?
Is it the distance between the centre of the bounding box and the centre of a particular box in say a 4x4 feature map?
If I am using a 4x4 feature map as my output, then my output should be of the dimension:
(4x4, n_classes + 4)
where 4 is for my anchor box co-ordinates.
This 4 co-ordinates can be something like:
(xmin, xmax, ymin, ymax)
This will correspond to the top-left and bottom-right corners of the bounding box.
So why do we need offsets and if so how do we calculate them?
Any help would be really appreciated!
We need offsets because thats what we calculate when we default anchor boxes, In case of ssd for every feature map cell they will have predefined number of anchor boxes of different scale ratios on very feature map cell,I think in the paper this number is 6.
Now because this is a detection problem ,we will also have ground truth bounding boxes,Here roughly, we compare the IOU of the anchor box to the GT box and if it is greater than a threshold say 0.5 we predict the box offsets to that anchor box.

YOLO object detection model?

Currently, I am reading the Yolo9000 model "https://arxiv.org/pdf/1612.08242.pdf" and I am very confused about how the model can predict the bounding box for object detection, I did many examples with Tensorflow, and in most of them we give to the model "Images and Label of images".
My questions are:
1- How we can pass the bounding box instead of labels to the model?
2- How can the the model learn that many boxes belong to one images?
In YOLO, we divide the image into 7X7 grid. For each of the grid location, the network predicts three things -
Probability of an object being present in that grid
If an object lies in this grid, what would be the co-ordinates of the
bounding box?
If an object lies in this grid, which class does it
belong to?
If we apply regression for all the above variables for all 49 grid locations, we will be able tell which grid locations have objects(using first parameter). For the grid locations that have objects, we can tell the bounding box co-ordinates and correct class using the second and third parameters.
Once we have designed a network that can output all the information we need, prepare the training data in this format i.e. find these parameters for every 7X7 grid location in every image in your dataset. Next you simply train the deep neural network to regress for these parameters.
To pass bounding boxes of an image we need to create it first. You can create bounding boxes for any image using specific tools. Here, you have to create boundaries that bound an object within it and then label that bounding box/rectangle. You to do this for every object in the image you want your model to train/recognize.
There is one very useful project in this link, you should check that out if you need to understand about bounding boxes.
I have just started learning object detection with tensorflow. So as and when I get proper info on providing bounding boxes to the object detection model I'll also update that here. Also if you have solved this problem by now, you can also provide the details to help out others facing same kind of problems.
1- How we can pass the bounding box instead of labels to the model?
If we want to train a model that performs object detection (not object classification), we have to pass the truth labels as .xml files, for example. An xml file contains information about objects that exist in an image. Each information about object is composed of 5 values:
class name of this object, such as car or human...
xmin: x coordinate of the box's top left point
ymin: y coordinate of the box's top left point
xmax: x coordinate of the box's bottom right point
ymax: y coordinate of the box'x bottom right point
One bounding box within an image is specified as a set of 5 values like above. If there are 3 objects in an image, the xml file will contain 3 sets of this values.
2- How can the the model learn that many boxes belong to one images?
As you know, the output of YOLOv2 or YOLO9000 has shape (13, 13, D), where D depends on how many class of object you're going to detect. You can see that there are 13x13 = 169 cells (grid cells) and each cell as D values (depth).
Among 169 grid cells, there are some grid cells that are responsible to predict bounding boxes. If the center of a true bounding box falls on a grid cell, this grid cell is responsible to predict that bounding box, when it is given the same image.
I think there must be a function that reads the xml annotation files and determines which grid cells are responsible to detect bounding boxes.
To make the model learn the box positions and shapes not only the classes, we have to build an appropriate loss function. The loss function used in YOLOv puts cost also on the box shapes and positions. So the loss is calculated as the weighted sum of the following individual loss values:
Loss on the class name
Loss on the box position (x-y coordinates)
Loss on the box shape (box width and height)
SIDE NOTE:
Actually, one grid cell can detect up to B boxes, where B depends on
implementations of YOLOv2. I used darkflow to train YOLOv2 on my
custom training data, in which B was 5. So the model can detect 169*B
boxes in total, and loss is the sum of 169*B small losses.
D = B*(5+C), where C is the number of classes you want to detect.
Before passed to the model, the box shapes and positions are
converted into relative values to the image size.

Put pcolormesh and contour onto same grid?

I'm trying to display 2D data with axis labels using both contour and pcolormesh. As has been noted on the matplotlib user list, these functions obey different conventions: pcolormesh expects the x and y values to specify the corners of the individual pixels, while contour expects the centers of the pixels.
What is the best way to make these behave consistently?
One option I've considered is to make a "centers-to-edges" function, assuming evenly spaced data:
def centers_to_edges(arr):
dx = arr[1]-arr[0]
newarr = np.linspace(arr.min()-dx/2,arr.max()+dx/2,arr.size+1)
return newarr
Another option is to use imshow with the extent keyword set.
The first approach doesn't play nicely with 2D axes (e.g., as created by meshgrid or indices) and the second discards the axis numbers entirely
Your data is a regular mesh? If it doesn't, you can use griddata() to obtain it. I think that if your data is too big, a sub-sampling or regularization always is possible. If the data is too big, maybe your output image always will be small compared with it and you can exploit this.
If you use imshow() with "extent" and "interpolation='nearest'", you will see that the data is cell-centered, and extent provided the lower edges of cells (corners). On the other hand, contour assumes that the data is cell-centered, and X,Y must be the center of cells. So, you need to be care about the input domain for contour. The trivial example is:
x = np.arange(-10,10,1)
X,Y = np.meshgrid(x,x)
P = X**2+Y**2
imshow(P,extent=[-10,10,-10,10],interpolation='nearest',origin='lower')
contour(X+0.5,Y+0.5,P,20,colors='k')
My tests told me that pcolormesh() is a very slow routine, and I always try to avoid it. griddata and imshow() always is a good choose for me.