Can YOLO pictures have a bounded box that covering the whole picture? - tensorflow

I wonder why YOLO pictures need to have a bounding box.
Assume that we are using Darknet. Each image need to have a corresponding .txt file with the same name as the image file. Inside the .txt file it need to be. It's the same for all YOLO frameworks that are using bounded boxes for labeling.
<object-class> <x> <y> <width> <height>
Where x, y, width, and height are relative to the image's width and height.
For exampel. If we goto this page and press YOLO Darknet TXT button and download the .zip file and then go to train folder. Then we can see a these files
IMG_0074_jpg.rf.64efe06bcd723dc66b0d071bfb47948a.jpg
IMG_0074_jpg.rf.64efe06bcd723dc66b0d071bfb47948a.txt
Where the .txt file looks like this
0 0.7055288461538461 0.6538461538461539 0.11658653846153846 0.4110576923076923
1 0.5913461538461539 0.3545673076923077 0.17307692307692307 0.6538461538461539
Every image has the size 416x416. This image looks like this:
My idéa is that every image should have one class. Only one class. And the image should taked with a camera like this.
This camera snap should been taked as:
Take camera snap
Cut the camera snap into desired size
Upscale it to square 416x416
Like this:
And then every .txt file that correspons for every image should look like this:
<object-class> 0 0 1 1
Question
Is this possible for e.g Darknet or other framework that are using bounded boxes to labeling the classes?
Instead of let the software e.g Darknet upscale the bounded boxes to 416x416 for every class object, then I should do it and change the .txt file to x = 0, y = 0, width = 1, height = 1 for every image that only having one class object.
Is that possible for me to create a traing set in that way and train with it?

Little disclaimer I have to say that I am not an expert on this, I am part of a project and we are using darknet so I had some time experimenting.
So if I understand it right you want to train with cropped single class images with full image sized bounding boxes.
It is possible to do it and I am using something like that but it is most likely not what you want.
Let me tell you about the problems and unexpected behaviour this method creates.
When you train with images that has full image size bounding boxes yolo can not make proper detection because while training it also learns the backgrounds and empty spaces of your dataset. More specifically objects on your training dataset has to be in the same context as your real life usage. If you train it with dog images on the jungle it won't do a good job of predicting dogs in house.
If you are only going to use it with classification you can still train it like this it still classifies fine but images that you are going to predict also should be like your training dataset, so by looking at your example if you train images like this cropped dog picture your model won't be able to classify the dog on the first image.
For a better example, in my case detection wasn't required. I am working with food images and I only predict the meal on the plate, so I trained with full image sized bboxes since every food has one class. It perfectly classifies the food but the bboxes are always predicted as full image.
So my understanding for the theory part of this, if you feed the network with only full image bboxes it learns that making the box as big as possible is results in less error rate so it optimizes that way, this is kind of wasting half of the algorithm but it works for me.
Also your images don't need to be 416x416 it resizes to that whatever size you give it, you can also change it from cfg file.
I have a code that makes full sized bboxes for all images in a directory if you want to try it fast.(It overrides existing annotations so be careful)
Finally boxes should be like this for them to be centered full size, x and y are center of the bbox it should be center/half of the image.
<object-class> 0.5 0.5 1 1
from imagepreprocessing.darknet_functions import create_training_data_yolo, auto_annotation_by_random_points
import os
main_dir = "datasets/my_dataset"
# auto annotating all images by their center points (x,y,w,h)
folders = sorted(os.listdir(main_dir))
for index, folder in enumerate(folders):
auto_annotation_by_random_points(os.path.join(main_dir, folder), index, annotation_points=((0.5,0.5), (0.5,0.5), (1.0,1.0), (1.0,1.0)))
# creating required files
create_training_data_yolo(main_dir)
```

Related

Yolo Training: multiple objects in one image

I have a set of training images that contain many small objects (10-20). The image resolution is high (9000x6000).
Is it better to split the image into the specific objects before running yolo training? Or just leave it as is.
Does yolo resize an entire image, or does it ‘extract’ the annotated object first before resizing?
If it is the former, I am concerned that the resolution will be bad. Imagine 20 objects in a 416x416 image.
Does yolo resize an entire image, or does it ‘extract’ the annotated
object first before resizing?
Yes, an entire image will be resized in case of Yolo and it does not extract annotated object before resizing.
Since your input images have very high resolution, what you can do is:
Yolo can handle object sizes of 25 x 25 effectively with network input layer size 608 x 608. So if your object sizes in original input image are greater than 250 x 250 you can train the images as they are (with 608 x 608 network size). In that case even when images are resized to network size, objects will be of size greater than 25x25. This should give you good accuracy.
(6000/600) * 25 = 250
If object sizes in original images are smaller than 200 x 200, split your input image into 8 smaller units/blocks, say blocks/tiles of 2250 x 1500. Train these blocks as individual images. Each bigger image (9000 x 6000) corresponds to 8 training images. Each image might contain zero to many objects. You can operate in sliding window method.
The method you choose for training should be used for inference as well.
For training on objects of all sizes use following models: [Use this if you use original image as it is used for training]
Yolov4-custom
Yolov3-SPP
Yolov3_5l
If all of the objects that you want to detect are of smaller size, then for effective detection use Yolov4 with following changes: [Use this if you split original image into 8 blocks]
Set layers = 23 instead of layers = 54
Set stride=4 instead of stride=2
Set stride=4 instead of stride=2
References:
Refer this relevant GitHub thread
darknet documentation

Is it possible to train YOLO (any version) for a single class where the image has text data. (find region of equations)

I am wondering if YOLO (any version, specially the one with accuracy, not speed) can be trained on the text data. What I am trying to do is to find the Region in the text image where any equation is present.
For example, I want to find the 2 of the Gray regions of interest in this image so that I can outline and eventually, crop the equations separately.
I am asking this questions because :
First of all I have not found a place where the YOLO is used for text data.
Secondly, how can we customise for low resolution unlike the (416,416) as all the images are either cropped or horizontal mostly in (W=2H) format.
I have implemented the YOLO-V3 version for text data but using OpenCv which is basically for CPU. I want to train the model from scratch.
Please help. Any of the Keras, Tensorflow or PyTorch would do.
Here is the code I used for implementing in OpenCv.
net = cv2.dnn.readNet(PATH+"yolov3.weights", PATH+"yolov3.cfg") # build the model. NOTE: This will only use CPU
layer_names = net.getLayerNames() # get all the layer names from the network 254 layers in the network
output_layers = [layer_names[i[0] - 1] for i in net.getUnconnectedOutLayers()] # output layer is the
# 3 output layers in otal
blob = cv2.dnn.blobFromImage(image=img, scalefactor=0.00392, size=(416,416), mean=(0, 0, 0), swapRB=True,)
# output as numpy array of (1,3,416,416). If you need to change the shape, change it in the config file too
# swap BGR to RGB, scale it to a threshold, resize, subtract it from the mean of 0 for all the RGB values
net.setInput(blob)
outs = net.forward(output_layers) # list of 3 elements for each channel
class_ids = [] # id of classes
confidences = [] # to store all the confidence score of objects present in bounding boxes if 0, no object is present
boxes = [] # to store all the boxes
for out in outs: # get all channels one by one
for detection in out: # get detection one by one
scores = detection[5:] # prob of 80 elements if the object(s) is/are inside the box and if yes, with what prob
class_id = np.argmax(scores) # Which class is dominating inside the list
confidence = scores[class_id]
if confidence > 0.1: # consider only those boxes which have a prob of having an object > 0.55
# grid coordinates
center_x = int(detection[0] * width) # centre X of grid
center_y = int(detection[1] * height) # Center Y of grid
w = int(detection[2] * width) # width
h = int(detection[3] * height) # height
# Rectangle coordinates
x = int(center_x - w / 2)
y = int(center_y - h / 2)
boxes.append([x, y, w, h]) # get all the bounding boxes
confidences.append(float(confidence)) # get all the confidence score
class_ids.append(class_id) # get all the clas ids
Being an object detector Yolo can be used for specific text detection only, not for detecting any text that might be present in the image.
For example Yolo can be trained to do text based logo detection like this:
I want to find the 2 of the Gray regions of interest in this image so
that I can outline and eventually, crop the equations separately.
Your problem statement talks about detecting any equation (math formula) that's present in the image so it can't be done using Yolo alone. I think mathpix is similar to your use-case. They will be using OCR (Optical Character Recognition) system trained and fine tuned towards their use-case.
Eventually to do something like mathpix, OCR system customised for your use case is what you need. There won't be any ready ready made solution out there for this. You'll have to build one.
Proposed Methods:
Mathematical Formula Detection in Heterogeneous Document Images
A Simple Equation Region Detector for Printed Document Images in Tesseract
Note: Tesseract as it is can't be used because it is a pre-trained model trained for reading any character. You can refer 2nd paper to train tesseract towards fitting your use case.
To get some idea about OCR, you can read about it here.
EDIT:
So idea is to build your own OCR to detect something that constitutes equation/math formula rather than detecting every character. You need to have data set where equations are marked. Basically you look for region with math symbols(say summation, integration etc.).
Some Tutorials to train your own OCR:
Tesseract training guide
Creating OCR pipeline using CV and DL
Build OCR pipeline
Build Your OCR
Attention OCR
So idea is that you follow these tutorials to get to know how to train
and build your OCR for any use case and then you read research papers
I mentioned above and also some of the basic ideas I gave above to
build OCR towards your use case.

Simple Captcha Solving

I'm trying to solve some simple captcha using OpenCV and pytesseract. Some of captcha samples are:
I tried to the remove the noisy dots with some filters:
import cv2
import numpy as np
import pytesseract
img = cv2.imread(image_path)
_, img = cv2.threshold(img, 127, 255, cv2.THRESH_BINARY)
img = cv2.morphologyEx(img, cv2.MORPH_OPEN, np.ones((4, 4), np.uint8), iterations=1)
img = cv2.medianBlur(img, 3)
img = cv2.medianBlur(img, 3)
img = cv2.medianBlur(img, 3)
img = cv2.medianBlur(img, 3)
img = cv2.GaussianBlur(img, (5, 5), 0)
cv2.imwrite('res.png', img)
print(pytesseract.image_to_string('res.png'))
Resulting tranformed images are:
Unfortunately pytesseract just recognizes first captcha correctly. Any other better transformation?
Final Update:
As #Neil suggested, I tried to remove noise by detecting connected pixels. To find connected pixels, I found a function named connectedComponentsWithStats, whichs detect connected pixels and assigns group (component) a label. By finding connected components and removing the ones with small number of pixels, I managed to get better overall detection accuracy with pytesseract.
And here are the new resulting images:
I've taken a much more direct approach to filtering ink splotches from pdf documents. I won't share the whole thing it's a lot of code, but here is the general strategy I adopted:
Use Python Pillow library to get an image object where you can manipulate pixels directly.
Binarize the image.
Find all connected pixels and how many pixels are in each group of connected pixels. You can do this using the minesweeper algorithm. Which is easy to search for.
Set some threshold value of pixels that all legitimate letters are expected to have. This will be dependent on your image resolution.
replace all black pixels in groups below the threshold with white pixels.
Convert back to image.
Your final output image is too blurry. To enhance the performance of pytesseract you need to sharpen it.
Sharpening is not as easy as blurring, but there exist a few code snippets / tutorials (e.g. http://datahacker.rs/004-how-to-smooth-and-sharpen-an-image-in-opencv/).
Rather than chaining blurs, blur once either using Gaussian or Median Blur, experiment with parameters to get the blur amount you need, perhaps try one method after the other but there is no reason to chain blurs of the same method.
There is an OCR example in python that detect the characters. Save several images and apply the filter and train a SVM algorithm. that may help you. I did trained a algorithm with even few Images but the results were acceptable. Check this link.
Wish you luck
I know the post is a bit old but I suggest you to try this library I've developed some time ago. If you have a set of labelled captchas that service would fit you. Take a look: https://github.com/punkerpunker/captcha_solver
In README there is a section "Train model on external data" that you might be interested in.

Tensorflow Object Detection Unusually large bounding boxes and wrong results

I am building an object detector in TensorFlow to detect, motorbike riders with and without helmet, I have 1000 Images each for riders with helmet, withouthelmet and pedestrians(pu together -- 3000 IMAGES), My last checkpoint was 35267 steps, I have tested using a traffic video, but I see unusally large bounding boxes with wrong results. Can someone please explain the reason for such detections? Do I need to wait for atleast 50000 steps?? or Do I need to add datasets(Images in the angle to Traffic Cameras)?
Model - SSD Mobilenet COCO - Custom Object Detection,
Training Platform - Google Colab
Please find the Images attachedVideo Snapshot 1
Video Snapshot 2
Day 2 - 10/30/2018
I have tested with Images today, I have got different results, seems to be correct,2nd Day if I test with single object in a Image. Please find the results
Single Object IMage Test 1
Single Object Image Test 2
Tested CHeckpoint - 52,000 Steps
But, If I test with the Images with multiple objects in a road, the detection is wrong and bounding boxes are weirdly bigger, Is it because of the dataset, as I am training with One Motorbike rider(with or with out helmet) per image.
Please find the wrong results
Multi Object Image Test
Multi Object Image Test
I had also tested with images like all Motorbikes in the scene, In this case, I did not get any results, Please find the Images
No Result Image
No Result Image
The results are very confusing, Is there anything I am missing?,
There is no need to wait till 50000 epocs you should get decent result in 35k or even in 10k. I would suggest
go through you data-set again and check all the bounding boxes (data cleaning)
Check your model with inference code for changes like batch normalization etc
Add some more data with different features, angles and color complexities
I would check these points before going further.

Darknet YOLO image size

I am trying to train custom object classifier in Darknet YOLO v2
https://pjreddie.com/darknet/yolo/
I gathered a dataset for images most of them are 6000 x 4000 px and some lower resolutions as well.
Do I need to resize the images before training to be squared ?
I found that the config uses:
[net]
batch=64
subdivisions=8
height=416
width=416
channels=3
momentum=0.9
decay=0.0005
angle=0
saturation = 1.5
exposure = 1.5
hue=.1
thats why I was wondering how to use it for different sizes of data sets.
You don't have to resize it, because Darknet will do it instead of you!
It means you really don't need to do that and you can use different image sizes during your training. What you posted above is just network configuration. There should be full network definition as well. And the height and the width tell you what's the network resolution. And it also keeps aspect ratio, check e.g this.
You don't need to resize your database images. PJReddie's YOLO architecture does it by itself keeping the aspect ratio safe (no information will miss) according to the resolution in .cfg file.
For Example, if you have image size 1248 x 936, YOLO will resize it to 416 x 312 and then pad the extra space with black bars to fit into 416 x 416 network.
It is very common to resize images before training. 416x416 is slightly larger than common. Most imagenet models resize and square the images to 256x256 for example. So I would expect the same here. Trying to train on 6000x4000 is going to require a farm of GPUs. The standard process is to square the image to the largest dimension (height, or width), padding with 0's on the shorter side, then resizing using standard image resizing tools like PIL.
You do not need to resize the images, you can directly change the values in darknet.cfg file.
When you open darknet.cfg (yolo-darknet.cfg) file, you can all
hyper-parameters and their values.
As showed in your cfg file images dimensions are (416,416)->(weight,height), you can change the values, so that darknet will automatically resize the images before training.
Since the images have high dimensions, you can adjust batch and sub-division values (lower the values 32,16,8 . it has to be multiples of 2), so that darknet will not crash (memory allocation error)
By default the darknet api changes the size of the images in both inference and training, but in theory any input size w, h = 32 x X where X belongs to a natural number should, W is the width, H the height. By default X = 13, so the input size is w, h = (416, 416). I use this rule with yolov3 in opencv, and it works better the bigger X is.