How YOLO do with 2 object centers at same grid - object-detection

I am trying to better understand YOLO.
When converting the ground truth bounding boxes into targets for model, as I understand:
grid[x // grid_size, y // grid_size, 0 : anchors_number] = [x % grid_size, y % grid_size, obj_width, obj_height, conf...]
Am I wrong? If not, how YOLO work with 2 objects with the same center grid.

Yes Yolo can deal with 2 objects in the same image. You just need to create 2 rows in your annotation file.
This is a sample annotation file with 2 boxes of the same class, class 0:
0 0.588196 0.474138 0.823607 0.441645
0 0.688196 0.574138 0.723607 0.341645
The schema for the annotations is:
<class-label x_center_image y_center_image width height>
There is a clear description of what the values mean here https://stackoverflow.com/a/66563144/5183735.

Related

Is it possible to train YOLO (any version) for a single class where the image has text data. (find region of equations)

I am wondering if YOLO (any version, specially the one with accuracy, not speed) can be trained on the text data. What I am trying to do is to find the Region in the text image where any equation is present.
For example, I want to find the 2 of the Gray regions of interest in this image so that I can outline and eventually, crop the equations separately.
I am asking this questions because :
First of all I have not found a place where the YOLO is used for text data.
Secondly, how can we customise for low resolution unlike the (416,416) as all the images are either cropped or horizontal mostly in (W=2H) format.
I have implemented the YOLO-V3 version for text data but using OpenCv which is basically for CPU. I want to train the model from scratch.
Please help. Any of the Keras, Tensorflow or PyTorch would do.
Here is the code I used for implementing in OpenCv.
net = cv2.dnn.readNet(PATH+"yolov3.weights", PATH+"yolov3.cfg") # build the model. NOTE: This will only use CPU
layer_names = net.getLayerNames() # get all the layer names from the network 254 layers in the network
output_layers = [layer_names[i[0] - 1] for i in net.getUnconnectedOutLayers()] # output layer is the
# 3 output layers in otal
blob = cv2.dnn.blobFromImage(image=img, scalefactor=0.00392, size=(416,416), mean=(0, 0, 0), swapRB=True,)
# output as numpy array of (1,3,416,416). If you need to change the shape, change it in the config file too
# swap BGR to RGB, scale it to a threshold, resize, subtract it from the mean of 0 for all the RGB values
net.setInput(blob)
outs = net.forward(output_layers) # list of 3 elements for each channel
class_ids = [] # id of classes
confidences = [] # to store all the confidence score of objects present in bounding boxes if 0, no object is present
boxes = [] # to store all the boxes
for out in outs: # get all channels one by one
for detection in out: # get detection one by one
scores = detection[5:] # prob of 80 elements if the object(s) is/are inside the box and if yes, with what prob
class_id = np.argmax(scores) # Which class is dominating inside the list
confidence = scores[class_id]
if confidence > 0.1: # consider only those boxes which have a prob of having an object > 0.55
# grid coordinates
center_x = int(detection[0] * width) # centre X of grid
center_y = int(detection[1] * height) # Center Y of grid
w = int(detection[2] * width) # width
h = int(detection[3] * height) # height
# Rectangle coordinates
x = int(center_x - w / 2)
y = int(center_y - h / 2)
boxes.append([x, y, w, h]) # get all the bounding boxes
confidences.append(float(confidence)) # get all the confidence score
class_ids.append(class_id) # get all the clas ids
Being an object detector Yolo can be used for specific text detection only, not for detecting any text that might be present in the image.
For example Yolo can be trained to do text based logo detection like this:
I want to find the 2 of the Gray regions of interest in this image so
that I can outline and eventually, crop the equations separately.
Your problem statement talks about detecting any equation (math formula) that's present in the image so it can't be done using Yolo alone. I think mathpix is similar to your use-case. They will be using OCR (Optical Character Recognition) system trained and fine tuned towards their use-case.
Eventually to do something like mathpix, OCR system customised for your use case is what you need. There won't be any ready ready made solution out there for this. You'll have to build one.
Proposed Methods:
Mathematical Formula Detection in Heterogeneous Document Images
A Simple Equation Region Detector for Printed Document Images in Tesseract
Note: Tesseract as it is can't be used because it is a pre-trained model trained for reading any character. You can refer 2nd paper to train tesseract towards fitting your use case.
To get some idea about OCR, you can read about it here.
EDIT:
So idea is to build your own OCR to detect something that constitutes equation/math formula rather than detecting every character. You need to have data set where equations are marked. Basically you look for region with math symbols(say summation, integration etc.).
Some Tutorials to train your own OCR:
Tesseract training guide
Creating OCR pipeline using CV and DL
Build OCR pipeline
Build Your OCR
Attention OCR
So idea is that you follow these tutorials to get to know how to train
and build your OCR for any use case and then you read research papers
I mentioned above and also some of the basic ideas I gave above to
build OCR towards your use case.

How to convert RGB image to single channel image(but no grayscale) image

I am having a 3 channel image, in which each channel has some information encoded in the form of colors.
I want to convert it to a single channel image with all the information retained. When I convert it into 1 channel (using grayscale) I loose all that color information and get a tensor with zero values and visualizing this image show a total black image.
So, is there any way to change the 3 channel image to 1 channel image but not grayscale ?
You probably have to keep the 3 channels. 1 channel images do not have colors, since you need an additional dimension to represent them.
Why would you want to drop the channels and keep the color information at the same time?
In typical image processing with deep learning, the tensors have dimensions such as [Batch x Channel x Height x Width] (more frequent in pytorch) or [Batch x Height x Width x Channel] (more frequent in tensorflow).
What is the real problem with 3 channels?

How can I achieve better than 80% on the test set

My goal is to detect digits from 0 to 9 on a random background. I wrote a dataset generator with the following features:
Grayscale data
Random digit rotation
Random digit blur
43 different fonts
Random noisy blurred background
Here are 1024 samples of my dataset:
1024 testset samples
I adapted the mnist expert model to train the dataset and get almost 100% on the train and validation set.
On the test set I get approximately 80% correct.
Here is a sample. The green digit is the digit predicted:
9 predicted as 5
It seems that my model has some troubles to distinguish between
1 and 7
8 and 3
9 and 6
5 and 9
I need to detect the digit on any background because the test images are not always binary images.
Now my questions:
For the testset generator:
How useful is applying digit rotation? When I rotate a 7 then I get a 1 for some fonts. When I rotate a 9 I get a 6 (rotation > 90°)
Is the convolution filter already treating image rotation?
Are 180'000 image samples enough to train the model?
For the model:
Should I increase the image size from 28x28 to 56x56 when I apply a blur filter onto the dataset?
What filter size should I use?
Do I have to increase the number of hidden layers?
Thanks a lot for any guide.
If you are stuck with the different image backgrounds, I suggest you try image filtering, which will turn your images into the same background for foreground, assuming your images have good qualities.
Try this (scikit-image library):
import numpy as np
from skimage import filters as flt
filtered_image = np.array(original_image > flt.threshold_li(original_image))
Then you can use the filtered images for both training and prediction.
I ended up extracting the dataset patches out of existing images instead of using a random background with random digits. This gives us less variance and a much better accuracy on the test set.
Here is a working but not so performant implementation which allows us to define shape and stride size:
def patchify(self, arr, shape, stride):
patches = []
arr_shape = arr.shape
(shape_h, shape_w) = shape
(stride_h, stride_w) = stride
num_patches = np.floor(np.array(arr_shape)/np.array(stride))
(num_patches_row, num_patches_col) = (int(num_patches[0]), int(num_patches[1]))
for row in range(num_patches_row):
row_from = row*stride_h
row_to = row_from+shape_h
for col in range(num_patches_col):
col_from = col * stride_w
col_to = col_from + shape_w
origin_information = (row_from,row_to, col_from,col_to)
roi = arr[row_from:row_to, col_from:col_to]
patches.append((roi, origin_information))
return patches
or we can also use scklearn where image is a numpy array
patches = image.extract_patches_2d(image, (patch_height, patch_width))

imshow non unifrom matrix bin size

I am trying to create an image with imshow, but the bins in my matrix are not equal.
For example the following matrix
C = [[1,2,2],[2,3,2],[3,2,3]]
is for X = [1,4,8] and for Y = [2,4,9]
I know I can just do xticks and yticks, but I want the axis to be equal..This means that I will need the squares which build the imshow to be in different sizes.
Is it possible?
This seems like a job for pcolormesh.
From When to use imshow over pcolormesh:
Fundamentally, imshow assumes that all data elements in your array are
to be rendered at the same size, whereas pcolormesh/pcolor associates
elements of the data array with rectangular elements whose size may
vary over the rectangular grid.
pcolormesh plots a matrix as cells, and take as argument the x and y coordinates of the cells, which allows you to draw each cell in a different size.
I assume the X and Y of your example data are meant to be the size of the cells. So I converted them in coordinates with:
xSize=[1,4,9]
ySize=[2,4,8]
x=np.append(0,np.cumsum(xSize)) # gives [ 0 1 5 13]
y=np.append(0,np.cumsum(ySize)) # gives [ 0 2 6 15]
Then if you want a similar behavior as imshow, you need to revert the y axis.
c=np.array([[1,2,2],[2,3,2],[3,2,3]])
plt.pcolormesh(x,-y,c)
Which gives us:

The math to render a cube?

My friend and I are making a 3d rendering engine from scratch in our VB class at school, but I am not sure how the math to form the cube would work. Given six variables:
rotX
rotY
rotZ
lenX
lenY
lenZ
Which represent the rotation on x,y,z and the length on x,y,z respectively, what would be the formulas to make the cube? I know that all I have to do is calculate three segments and from those segments just create three parallelograms, so I just need the math to find what the three segments are.
Thanks!
there are 2 basic 3D object representations for both are your data is insufficient.
surface representation
objects are set of surface polygons/vertexes/...
for cube its a set of 8 points + the triangles/quads for 6 faces
analytical representation
objects are set of equations describing the object
for cube its a intersection of 6 planes
I think you are using option 1 so what you need is:
- position
- orientation
- size
usually an axis aligned cube looks like this:
const double a=1.0; //cube size;
double pnt[8][3]= //cube points
{
+a,-a,+a,
+a,+a,+a,
-a,+a,+a,
-a,-a,+a,
+a,-a,-a,
+a,+a,-a,
-a,+a,-a,
-a,-a,-a
};
int tab[24]=
{
0,1,2,3, // 1st.quad
7,6,5,4, // 2nd.quad
4,5,1,0, // 3th.quad ...
5,6,2,1,
6,7,3,2,
7,4,0,3
};
well for size and orientation you can apply transformation matrix
or directly recompute points by direction vectors
so you need to remember position (point) and orientation (3 vectors) and size (scalar)
all above can be stored in single transformation matrix 4x4
but if you want the vectors then points will be like this:
P(+a,-a,+a) -> +a*I -a*J +a*K
where I,J,K are the orientation vectors
a is cube size
P(+a,-a,+a) is original axis aligned point in table above
Option 2 is more tricky to implement and unless you really need it (ray-tracing renders) then forget about it.