Yolo Training: multiple objects in one image - yolo

I have a set of training images that contain many small objects (10-20). The image resolution is high (9000x6000).
Is it better to split the image into the specific objects before running yolo training? Or just leave it as is.
Does yolo resize an entire image, or does it ‘extract’ the annotated object first before resizing?
If it is the former, I am concerned that the resolution will be bad. Imagine 20 objects in a 416x416 image.

Does yolo resize an entire image, or does it ‘extract’ the annotated
object first before resizing?
Yes, an entire image will be resized in case of Yolo and it does not extract annotated object before resizing.
Since your input images have very high resolution, what you can do is:
Yolo can handle object sizes of 25 x 25 effectively with network input layer size 608 x 608. So if your object sizes in original input image are greater than 250 x 250 you can train the images as they are (with 608 x 608 network size). In that case even when images are resized to network size, objects will be of size greater than 25x25. This should give you good accuracy.
(6000/600) * 25 = 250
If object sizes in original images are smaller than 200 x 200, split your input image into 8 smaller units/blocks, say blocks/tiles of 2250 x 1500. Train these blocks as individual images. Each bigger image (9000 x 6000) corresponds to 8 training images. Each image might contain zero to many objects. You can operate in sliding window method.
The method you choose for training should be used for inference as well.
For training on objects of all sizes use following models: [Use this if you use original image as it is used for training]
Yolov4-custom
Yolov3-SPP
Yolov3_5l
If all of the objects that you want to detect are of smaller size, then for effective detection use Yolov4 with following changes: [Use this if you split original image into 8 blocks]
Set layers = 23 instead of layers = 54
Set stride=4 instead of stride=2
Set stride=4 instead of stride=2
References:
Refer this relevant GitHub thread
darknet documentation

Related

Can YOLO pictures have a bounded box that covering the whole picture?

I wonder why YOLO pictures need to have a bounding box.
Assume that we are using Darknet. Each image need to have a corresponding .txt file with the same name as the image file. Inside the .txt file it need to be. It's the same for all YOLO frameworks that are using bounded boxes for labeling.
<object-class> <x> <y> <width> <height>
Where x, y, width, and height are relative to the image's width and height.
For exampel. If we goto this page and press YOLO Darknet TXT button and download the .zip file and then go to train folder. Then we can see a these files
IMG_0074_jpg.rf.64efe06bcd723dc66b0d071bfb47948a.jpg
IMG_0074_jpg.rf.64efe06bcd723dc66b0d071bfb47948a.txt
Where the .txt file looks like this
0 0.7055288461538461 0.6538461538461539 0.11658653846153846 0.4110576923076923
1 0.5913461538461539 0.3545673076923077 0.17307692307692307 0.6538461538461539
Every image has the size 416x416. This image looks like this:
My idéa is that every image should have one class. Only one class. And the image should taked with a camera like this.
This camera snap should been taked as:
Take camera snap
Cut the camera snap into desired size
Upscale it to square 416x416
Like this:
And then every .txt file that correspons for every image should look like this:
<object-class> 0 0 1 1
Question
Is this possible for e.g Darknet or other framework that are using bounded boxes to labeling the classes?
Instead of let the software e.g Darknet upscale the bounded boxes to 416x416 for every class object, then I should do it and change the .txt file to x = 0, y = 0, width = 1, height = 1 for every image that only having one class object.
Is that possible for me to create a traing set in that way and train with it?
Little disclaimer I have to say that I am not an expert on this, I am part of a project and we are using darknet so I had some time experimenting.
So if I understand it right you want to train with cropped single class images with full image sized bounding boxes.
It is possible to do it and I am using something like that but it is most likely not what you want.
Let me tell you about the problems and unexpected behaviour this method creates.
When you train with images that has full image size bounding boxes yolo can not make proper detection because while training it also learns the backgrounds and empty spaces of your dataset. More specifically objects on your training dataset has to be in the same context as your real life usage. If you train it with dog images on the jungle it won't do a good job of predicting dogs in house.
If you are only going to use it with classification you can still train it like this it still classifies fine but images that you are going to predict also should be like your training dataset, so by looking at your example if you train images like this cropped dog picture your model won't be able to classify the dog on the first image.
For a better example, in my case detection wasn't required. I am working with food images and I only predict the meal on the plate, so I trained with full image sized bboxes since every food has one class. It perfectly classifies the food but the bboxes are always predicted as full image.
So my understanding for the theory part of this, if you feed the network with only full image bboxes it learns that making the box as big as possible is results in less error rate so it optimizes that way, this is kind of wasting half of the algorithm but it works for me.
Also your images don't need to be 416x416 it resizes to that whatever size you give it, you can also change it from cfg file.
I have a code that makes full sized bboxes for all images in a directory if you want to try it fast.(It overrides existing annotations so be careful)
Finally boxes should be like this for them to be centered full size, x and y are center of the bbox it should be center/half of the image.
<object-class> 0.5 0.5 1 1
from imagepreprocessing.darknet_functions import create_training_data_yolo, auto_annotation_by_random_points
import os
main_dir = "datasets/my_dataset"
# auto annotating all images by their center points (x,y,w,h)
folders = sorted(os.listdir(main_dir))
for index, folder in enumerate(folders):
auto_annotation_by_random_points(os.path.join(main_dir, folder), index, annotation_points=((0.5,0.5), (0.5,0.5), (1.0,1.0), (1.0,1.0)))
# creating required files
create_training_data_yolo(main_dir)
```

min_scale and max_scale in the model config of Tensorflow Object Detection API

In the Tensorflow object detection API have the model config files for training, this config file has min_scale and max_scale for detection object that are set to 0.2 and 0.95 respectively by default,
I have some question about these parameters:
These params are for detecting the size of objects?
If we set the input size of network=300x300 and min_scale=0.2, then the network is not able to detect the objects that have size smaller than 300x0.2 = 60 pixels?
As far as you know, the ssd_mobilenet_v2_coco has the problem for detecting the small objects, If we set the min_scale = 0.05 and train the network on small objects with the same model, Is it possible to detect small objects with size 300x0.05 = 15 pixels?
These params are for detecting the size of objects?
Well, yes and no. Those parameters are inside the ssd_anchor_generator definition, which is itself an anchor_generator. That part of the system takes care of providing some anchor boxes for the further box prediction.
If we set the input size of network=300x300 and min_scale=0.2, then the network is not able to detect the objects that have size smaller than 300x0.2 = 60 pixels?
No. The size of a detectable object is not just related to the min_scale (which only affects anchor generation), but instead is affected by, for example, data the network was trained on, network depth, etc.
As far as you know, the ssd_mobilenet_v2_coco has the problem for detecting the small objects, If we set the min_scale = 0.05 and train the network on small objects with the same model, Is it possible to detect small objects with size 300x0.05 = 15 pixels?
Maybe? That depends entirely on your data. Modifying the min_scale parameter might help (and indeed it might make sense to select another range for those parameters), but experimentation with your data is necessary.

Do images with large dimensions (e.g. 2000 x 2000) be auto-scaled to 300 x 300 when using them for training data in AWS Sagemaker?

I'm working on a project that trains an ML model to predict the location of Waldo in a Where's Wally? image using AWS Sagemaker with the underlying object detection algorithm being Single Shot Detection, but I am thinking that using an actual puzzle image with dimensions like 2000 x 2000 as training data is not possible and that SSD will auto-resize the image to 300 x 300 which would render Waldo a meaningless blur. Does SSD re-size images automatically, or will it train on the 2000 x 2000 image? Should I crop resize all puzzles to 300 x 300 images containing Waldo, or can I include a mix of actual puzzle images with dimensions 2000+ x 2000+ and the 300 x 300 cropped images?
I'm considering augmenting the data by cropping these larger images at locations that contain Wally so that I can have 300 x 300 images where Wally isn't reduced to a smudge on a page and is actually visible - is this a good idea? I am thinking that SSD does train on the 2000 x 2000 image, but the FPS will reduce by a lot - is this wrong? I feel like if I don't use the 2000 x 2000 image for training, in the prediction stage where I start feeding the model images with large dimensions (actual puzzle images), the model won't be able to predict locations accurately - is this not the case?
SageMaker object detection resizes the image based on the input parameter "image_shape", which you use a size larger than 300 x 300. But 2000 x 2000 might be too large for the algorithm and it will also slow down the training speed. You can try to use a image size somewhere in the middle. Cropping larger images into small patches is a good idea for solving this problem. For the inference, the input image will also be resized to the same size as the training parameter "image_shape". So you may want to crop or resize the large image before you send them to the endpoint.

Input feature to Feature maps

Can anybody please explain this basic thing to me that how does a 192x28x28 input image gets reduced to a 16x28x28 feature maps using a 1x1 conv mapping. My question is about the understanding of what exactly happens when 192 goes to 16 ??
i know about ((I-2P-F)/S)+1, but what happens in the process of reducing depth.
The 1x1 Convolution compresses the whole 192*28*28 input image (which could be read as 192 feature maps of 28px * 28px pixels images) into a single 1*28*28 image. So far it reduces depth in the "feature map axis" to 1 while preserving the height and width of the original image.
But then... why do you get the 16? In a convolutional layer you can have different kernels. Basically each kernel is an indepentent filter with the same size. In your case it looks like your 1x1 Conv layer has 16 kernels by default, hence you get 16 28*28 images (one per kernel).

Darknet YOLO image size

I am trying to train custom object classifier in Darknet YOLO v2
https://pjreddie.com/darknet/yolo/
I gathered a dataset for images most of them are 6000 x 4000 px and some lower resolutions as well.
Do I need to resize the images before training to be squared ?
I found that the config uses:
[net]
batch=64
subdivisions=8
height=416
width=416
channels=3
momentum=0.9
decay=0.0005
angle=0
saturation = 1.5
exposure = 1.5
hue=.1
thats why I was wondering how to use it for different sizes of data sets.
You don't have to resize it, because Darknet will do it instead of you!
It means you really don't need to do that and you can use different image sizes during your training. What you posted above is just network configuration. There should be full network definition as well. And the height and the width tell you what's the network resolution. And it also keeps aspect ratio, check e.g this.
You don't need to resize your database images. PJReddie's YOLO architecture does it by itself keeping the aspect ratio safe (no information will miss) according to the resolution in .cfg file.
For Example, if you have image size 1248 x 936, YOLO will resize it to 416 x 312 and then pad the extra space with black bars to fit into 416 x 416 network.
It is very common to resize images before training. 416x416 is slightly larger than common. Most imagenet models resize and square the images to 256x256 for example. So I would expect the same here. Trying to train on 6000x4000 is going to require a farm of GPUs. The standard process is to square the image to the largest dimension (height, or width), padding with 0's on the shorter side, then resizing using standard image resizing tools like PIL.
You do not need to resize the images, you can directly change the values in darknet.cfg file.
When you open darknet.cfg (yolo-darknet.cfg) file, you can all
hyper-parameters and their values.
As showed in your cfg file images dimensions are (416,416)->(weight,height), you can change the values, so that darknet will automatically resize the images before training.
Since the images have high dimensions, you can adjust batch and sub-division values (lower the values 32,16,8 . it has to be multiples of 2), so that darknet will not crash (memory allocation error)
By default the darknet api changes the size of the images in both inference and training, but in theory any input size w, h = 32 x X where X belongs to a natural number should, W is the width, H the height. By default X = 13, so the input size is w, h = (416, 416). I use this rule with yolov3 in opencv, and it works better the bigger X is.