Is the IOU in Tensorflow Object Detection API wrong? - tensorflow

I just digged a bit through the Tensorflow Object Detection API code especially the eval_util part, as I wanted to implement the COCO metrics.
But I noticed that the metrics are solely calculated using the bounding boxes which have normalized coordinates between [0, 1].
There are no aspect ratios or absolute coordinates used.
So, doesn't this mean that the intersection over unions calculated on these results are incorrect?
Let's take an 200x100 image pixel as an example.
If the box would be off by 20px to the left, that's 0.1 in normalized coordinates.
But if it would be off by 20px to the top, that would be 0.2 in normalized coordinates.
Doesn't that mean, being off to the top is harder penalizing the score than being off to the side?

I believe the predicted coordinates are resized to the absolute image coordinates in the eval binary.
But the other thing I would say is that IOU is scale invariant in the sense that if you scale two boxes by some factor, they will still have the same IOU overlap. As an example if we scale by 2 in the x-direction and scale by 3 in the y direction:
If A is (x1, y1, x2, y2) and B is (u1, v1, u2, v2), then IOU((A, B))
= IOU((2*x1, 3*y1, 2*x2, 3*y2), (2*u1, 3*v1, 2*u2, 3*v2))
What this means is that evaluating in normalized coordinates should give the same result as evaluating in absolute coordinates.

Related

Tuning first_stage_anchor_generator in faster rcnn model

I am trying to detect some very small object (~25x25 pixels) from large image (~ 2040, 1536 pixels) using faster rcnn model from object_detect_api from here https://github.com/tensorflow/models/tree/master/research/object_detection
I am very confused about the following configuration parameters(I have read the proto file and also tried modify them and test):
first_stage_anchor_generator {
grid_anchor_generator {
scales: [0.25, 0.5, 1.0, 2.0]
aspect_ratios: [0.5, 1.0, 2.0]
height_stride: 16
width_stride: 16
}
}
I am kind of very new to this area, if some one can explain a bit about these parameters to me it would be very appreciated.
My Question is how should I adjust above (or other) parameters to accommodate for the fact that I have very small fix-sized objects to detect in large image.
Thanks
I don't know the actual answer, but I suspect that the way Faster RCNN works in Tensorflow object detection is as follows:
this article says:
"Anchors play an important role in Faster R-CNN. An anchor is a box. In the default configuration of Faster R-CNN, there are 9 anchors at a position of an image. The following graph shows 9 anchors at the position (320, 320) of an image with size (600, 800)."
and the author gives an image showing an overlap of boxes, those are the proposed regions that contain the object based on the "CNN" part of the "RCNN" model, next comes the "R" part of the "RCNN" model which is the region proposal. To do that, there is another neural network that is trained alongside the CNN to figure out the best fit box. There are a lot of "proposals" where an object could be based on all the boxes, but we still don't know where it is.
This "region proposal" neural net's job is to find the correct region and it is trained based on the labels you provide with the coordinates of each object in the image.
Looking at this file, I noticed:
line 174: heights = scales / ratio_sqrts * base_anchor_size[0]
line 175: widths = scales * ratio_sqrts * base_anchor_size[[1]]
which seems to be the final goal of the configurations found in the config file(to generate a list of sliding windows with known widths and heights). While the base_anchor_size is created as a default of [256, 256]. In the comments the author of the code wrote:
"For example, setting scales=[.1, .2, .2]
and aspect ratios = [2,2,1/2] means that we create three boxes: one with scale
.1, aspect ratio 2, one with scale .2, aspect ratio 2, and one with scale .2
and aspect ratio 1/2. Each box is multiplied by "base_anchor_size" before
placing it over its respective center."
which gives insight into how these boxes are created, the code seems to be creating a list of boxes based on the scales =[stuff] and aspect_ratios = [stuff] parameters that will be used to slide over the image. The scale is fairly straightforward and is how much the default square box of 256 by 256 should be scaled before it is used and the aspect ratio is the thing that changes the original square box into a rectangle that is more closer to the (scaled) shape of the objects you expect to encounter.
Meaning, to optimally configure the scales and aspect ratios, you should find the "typical" sizes of the object in the image whatever it is ex(20 by 30, 5 by 10 ,etc) and figure out how much the default of 256 by 256 square box should be scaled to optimally fit that, then find the "typical" aspect ratios of your objects(according to google an aspect ratio is: the ratio of the width to the height of an image or screen.) and set those as your aspect ratio parameters.
Note: it seems that the number of elements in the scales and aspect_ratios lists in the config file should be the same but I don't know for sure.
Also I am not sure about how to find the optimal stride, but if your objects are smaller than 16 by 16 pixels the sliding window you created by setting the scales and aspect ratios to what you want might just skip your object altogether.
As I believe proposal anchors are generated only for model types of Faster RCNN. In this file you have specified what parameters may be set for anchors generation within line you mentioned from config.
I tried setting base_anchor_size, however I failed. Though this FasterRCNNTutorial tutorial mentions that:
[...] you also need to configure the anchor sizes and aspect ratios in the .config file. The base anchor size is 255,255.
The anchor ratios will multiply the x dimension and divide the y dimension, so if you have an aspect ratio of 0.5 your 255x255 anchor becomes 128x510. Each aspect ratio in the list is applied, then the results are multiplied by the scales. So the first step is to resize your images to the training/testing size, then manually check what the smallest and largest objects you expect are, and what the most extreme aspect ratios will be. Set up the config file with values that will cover these cases when the base anchor size is adjusted by the aspect ratios and multiplied by the scales.
I think it's pretty straightforward. I also used this 'workaround'.

3D alpha shape yielding unexpected convex hull of surface

I executed the 3D alpha shape function with CGAL and I got unexpected results.
My input data was a set of 3D points (x, y, z) that represents one building (box) in a flat area (with some noise in the coordinates - small ones). I supposed I would get as a result only the surface triangles that represent the building (walls and roof) and the ground.
But, as a result I got triangles forming a convex hull of the surface.
I tried to change the "optimal alpha value" but it was the same.
Is there any filtering process or parameter that I can set to get the surface triangles only?
You need to find the tetrahedons on the surface of the shape first. Then you can try alpha shapes and remove the edges exceeding alpha. In CGAL you Then check all tetrahedons if they are connected with a super tetrahedon. These are the tetrahedons on the surface of the shape. Then apply alpha shapes.

GLKView GLKMatrix4MakeLookAt description and explanation

For modelviewMatrix I understand how to form translate and scale Matrix. But I am unable to understand how to form viewMatrix using GLKMatrix4MakeLookAt. Can anyone explain how to it works and how to give value to parameters(eye center up X Y Z).
GLK_INLINE GLKMatrix4 GLKMatrix4MakeLookAt(float eyeX, float eyeY, float eyeZ,
float centerX, float centerY, float centerZ,
float upX, float upY, float upZ)
GLKMatrix4MakeLookAt creates a viewing matrix (in the same way as gluLookAt does, in case you look at other OpenGL code). As the parameters suggest, it considers the position of the viewer's eye, the point in space they're looking at (e.g., a point on an object), and the up vector, which specifies which direction is "up" (e.g., pointing towards the sky). The viewing matrix generated is the combination of a rotation matrix (composed of a set of orthonormal bases [basis vectors]) and an translation.
Logically, the matrix is basically constructed in a few steps:
compute the line-of-sight vector, which is the normalized vector going from the eye's position to the point you're looking at, the center point.
compute the cross product of the line-of-sight vector with the up vector, and normalize the resulting vector.
compute the cross product of the vector computed in step 2. with the line-of-sight to complete the orthonormal basis.
create a 3x3 rotation matrix by setting the first row to the vector created in step 2., the middle row with the vector from step 3., and the bottom row to the negated, normalized line-of-sight vector.
those three steps produce a rotation matrix that will rotate the world coordinate system into eye coordinates (a coordinate system where the eye is located at the origin, and the line-of-sight is down the -z axis. The final viewing matrix is computed by multiplying a translation to the negated eye position, which moves the "world coordinate positioned eye" to the origin for eye coordinates.
Here's a related question showing the code of GLKMatrix4MakeLookAt, and here's a question with more detail about eye coordinates and related coordinate systems: (What exactly are eye space coordinates?) .

Faster way to perform point-wise interplation of numpy array?

I have a 3D datacube, with two spatial dimensions and the third being a multi-band spectrum at each point of the 2D image.
H[x, y, bands]
Given a wavelength (or band number), I would like to extract the 2D image corresponding to that wavelength. This would be simply an array slice like H[:,:,bnd]. Similarly, given a spatial location (i,j) the spectrum at that location is H[i,j].
I would also like to 'smooth' the image spectrally, to counter low-light noise in the spectra. That is for band bnd, I choose a window of size wind and fit a n-degree polynomial to the spectrum in that window. With polyfit and polyval I can find the fitted spectral value at that point for band bnd.
Now, if I want the whole image of bnd from the fitted value, then I have to perform this windowed-fitting at each (i,j) of the image. I also want the 2nd-derivative image of bnd, that is, the value of the 2nd-derivative of the fitted spectrum at each point.
Running over the points, I could polyfit-polyval-polyder each of the x*y spectra. While this works, this is a point-wise operation. Is there some pytho-numponic way to do this faster?
If you do least-squares polynomial fitting to points (x+dx[i],y[i]) for a fixed set of dx and then evaluate the resulting polynomial at x, the result is a (fixed) linear combination of the y[i]. The same is true for the derivatives of the polynomial. So you just need a linear combination of the slices. Look up "Savitzky-Golay filters".
EDITED to add a brief example of how S-G filters work. I haven't checked any of the details and you should therefore not rely on it to be correct.
So, suppose you take a filter of width 5 and degree 2. That is, for each band (ignoring, for the moment, ones at the start and end) we'll take that one and the two on either side, fit a quadratic curve, and look at its value in the middle.
So, if f(x) ~= ax^2+bx+c and f(-2),f(-1),f(0),f(1),f(2) = p,q,r,s,t then we want 4a-2b+c ~= p, a-b+c ~= q, etc. Least-squares fitting means minimizing (4a-2b+c-p)^2 + (a-b+c-q)^2 + (c-r)^2 + (a+b+c-s)^2 + (4a+2b+c-t)^2, which means (taking partial derivatives w.r.t. a,b,c):
4(4a-2b+c-p)+(a-b+c-q)+(a+b+c-s)+4(4a+2b+c-t)=0
-2(4a-2b+c-p)-(a-b+c-q)+(a+b+c-s)+2(4a+2b+c-t)=0
(4a-2b+c-p)+(a-b+c-q)+(c-r)+(a+b+c-s)+(4a+2b+c-t)=0
or, simplifying,
22a+10c = 4p+q+s+4t
10b = -2p-q+s+2t
10a+5c = p+q+r+s+t
so a,b,c = p-q/2-r-s/2+t, (2(t-p)+(s-q))/10, (p+q+r+s+t)/5-(2p-q-2r-s+2t).
And of course c is the value of the fitted polynomial at 0, and therefore is the smoothed value we want. So for each spatial position, we have a vector of input spectral data, from which we compute the smoothed spectral data by multiplying by a matrix whose rows (apart from the first and last couple) look like [0 ... 0 -9/5 4/5 11/5 4/5 -9/5 0 ... 0], with the central 11/5 on the main diagonal of the matrix.
So you could do a matrix multiplication for each spatial position; but since it's the same matrix everywhere you can do it with a single call to tensordot. So if S contains the matrix I just described (er, wait, no, the transpose of the matrix I just described) and A is your 3-dimensional data cube, your spectrally-smoothed data cube would be numpy.tensordot(A,S).
This would be a good point at which to repeat my warning: I haven't checked any of the details in the few paragraphs above, which are just meant to give an indication of how it all works and why you can do the whole thing in a single linear-algebra operation.

Solving for optimal alignment of 3d polygonal mesh

I'm trying to implement a geometry templating engine. One of the parts is taking a prototypical polygonal mesh and aligning an instantiation with some points in the larger object.
So, the problem is this: given 3d point positions for some (perhaps all) of the verts in a polygonal mesh, find a scaled rotation that minimizes the difference between the transformed verts and the given point positions. I also have a centerpoint that can remain fixed, if that helps. The correspondence between the verts and the 3d locations is fixed.
I'm thinking this could be done by solving for the coefficients of a transformation matrix, but I'm a little unsure how to build the system to solve.
An example of this is a cube. The prototype would be the unit cube, centered at the origin, with vert indices:
4----5
|\ \
| 6----7
| | |
0 | 1 |
\| |
2----3
An example of the vert locations to fit:
v0: 1.243,2.163,-3.426
v1: 4.190,-0.408,-0.485
v2: -1.974,-1.525,-3.426
v3: 0.974,-4.096,-0.485
v5: 1.974,1.525,3.426
v7: -1.243,-2.163,3.426
So, given that prototype and those points, how do I find the single scale factor, and the rotation about x, y, and z that will minimize the distance between the verts and those positions? It would be best for the method to be generalizable to an arbitrary mesh, not just a cube.
Assuming you have all points and their correspondences, you can fine-tune your match by solving the least squares problem:
minimize Norm(T*V-M)
where T is the transformation matrix you are looking for, V are the vertices to fit, and M are the vertices of the prototype. Norm refers to the Frobenius norm. M and V are 3xN matrices where each column is a 3-vector of a vertex of the prototype and corresponding vertex in the fitting vertex set. T is a 3x3 transformation matrix. Then the transformation matrix that minimizes the mean squared error is inverse(V*transpose(V))*V*transpose(M). The resulting matrix will in general not be orthogonal (you wanted one which has no shear), so you can solve a matrix Procrustes problem to find the nearest orthogonal matrix with the SVD.
Now, if you don't know which given points will correspond to which prototype points, the problem you want to solve is called surface registration. This is an active field of research. See for example this paper, which also covers rigid registration, which is what you're after.
If you want to create a mesh on an arbitrary 3D geometry, this is not the way it's typically done.
You should look at octree mesh generation techniques. You'll have better success if you work with a true 3D primitive, which means tetrahedra instead of cubes.
If your geometry is a 3D body, all you'll have is a surface description to start with. Determining "optimal" interior points isn't meaningful, because you don't have any. You'll want them to be arranged in such a way that the tetrahedra inside aren't too distorted, but that's the best you'll be able to do.