Do images with large dimensions (e.g. 2000 x 2000) be auto-scaled to 300 x 300 when using them for training data in AWS Sagemaker? - object-detection

I'm working on a project that trains an ML model to predict the location of Waldo in a Where's Wally? image using AWS Sagemaker with the underlying object detection algorithm being Single Shot Detection, but I am thinking that using an actual puzzle image with dimensions like 2000 x 2000 as training data is not possible and that SSD will auto-resize the image to 300 x 300 which would render Waldo a meaningless blur. Does SSD re-size images automatically, or will it train on the 2000 x 2000 image? Should I crop resize all puzzles to 300 x 300 images containing Waldo, or can I include a mix of actual puzzle images with dimensions 2000+ x 2000+ and the 300 x 300 cropped images?
I'm considering augmenting the data by cropping these larger images at locations that contain Wally so that I can have 300 x 300 images where Wally isn't reduced to a smudge on a page and is actually visible - is this a good idea? I am thinking that SSD does train on the 2000 x 2000 image, but the FPS will reduce by a lot - is this wrong? I feel like if I don't use the 2000 x 2000 image for training, in the prediction stage where I start feeding the model images with large dimensions (actual puzzle images), the model won't be able to predict locations accurately - is this not the case?

SageMaker object detection resizes the image based on the input parameter "image_shape", which you use a size larger than 300 x 300. But 2000 x 2000 might be too large for the algorithm and it will also slow down the training speed. You can try to use a image size somewhere in the middle. Cropping larger images into small patches is a good idea for solving this problem. For the inference, the input image will also be resized to the same size as the training parameter "image_shape". So you may want to crop or resize the large image before you send them to the endpoint.

Related

Can YOLO pictures have a bounded box that covering the whole picture?

I wonder why YOLO pictures need to have a bounding box.
Assume that we are using Darknet. Each image need to have a corresponding .txt file with the same name as the image file. Inside the .txt file it need to be. It's the same for all YOLO frameworks that are using bounded boxes for labeling.
<object-class> <x> <y> <width> <height>
Where x, y, width, and height are relative to the image's width and height.
For exampel. If we goto this page and press YOLO Darknet TXT button and download the .zip file and then go to train folder. Then we can see a these files
IMG_0074_jpg.rf.64efe06bcd723dc66b0d071bfb47948a.jpg
IMG_0074_jpg.rf.64efe06bcd723dc66b0d071bfb47948a.txt
Where the .txt file looks like this
0 0.7055288461538461 0.6538461538461539 0.11658653846153846 0.4110576923076923
1 0.5913461538461539 0.3545673076923077 0.17307692307692307 0.6538461538461539
Every image has the size 416x416. This image looks like this:
My idéa is that every image should have one class. Only one class. And the image should taked with a camera like this.
This camera snap should been taked as:
Take camera snap
Cut the camera snap into desired size
Upscale it to square 416x416
Like this:
And then every .txt file that correspons for every image should look like this:
<object-class> 0 0 1 1
Question
Is this possible for e.g Darknet or other framework that are using bounded boxes to labeling the classes?
Instead of let the software e.g Darknet upscale the bounded boxes to 416x416 for every class object, then I should do it and change the .txt file to x = 0, y = 0, width = 1, height = 1 for every image that only having one class object.
Is that possible for me to create a traing set in that way and train with it?
Little disclaimer I have to say that I am not an expert on this, I am part of a project and we are using darknet so I had some time experimenting.
So if I understand it right you want to train with cropped single class images with full image sized bounding boxes.
It is possible to do it and I am using something like that but it is most likely not what you want.
Let me tell you about the problems and unexpected behaviour this method creates.
When you train with images that has full image size bounding boxes yolo can not make proper detection because while training it also learns the backgrounds and empty spaces of your dataset. More specifically objects on your training dataset has to be in the same context as your real life usage. If you train it with dog images on the jungle it won't do a good job of predicting dogs in house.
If you are only going to use it with classification you can still train it like this it still classifies fine but images that you are going to predict also should be like your training dataset, so by looking at your example if you train images like this cropped dog picture your model won't be able to classify the dog on the first image.
For a better example, in my case detection wasn't required. I am working with food images and I only predict the meal on the plate, so I trained with full image sized bboxes since every food has one class. It perfectly classifies the food but the bboxes are always predicted as full image.
So my understanding for the theory part of this, if you feed the network with only full image bboxes it learns that making the box as big as possible is results in less error rate so it optimizes that way, this is kind of wasting half of the algorithm but it works for me.
Also your images don't need to be 416x416 it resizes to that whatever size you give it, you can also change it from cfg file.
I have a code that makes full sized bboxes for all images in a directory if you want to try it fast.(It overrides existing annotations so be careful)
Finally boxes should be like this for them to be centered full size, x and y are center of the bbox it should be center/half of the image.
<object-class> 0.5 0.5 1 1
from imagepreprocessing.darknet_functions import create_training_data_yolo, auto_annotation_by_random_points
import os
main_dir = "datasets/my_dataset"
# auto annotating all images by their center points (x,y,w,h)
folders = sorted(os.listdir(main_dir))
for index, folder in enumerate(folders):
auto_annotation_by_random_points(os.path.join(main_dir, folder), index, annotation_points=((0.5,0.5), (0.5,0.5), (1.0,1.0), (1.0,1.0)))
# creating required files
create_training_data_yolo(main_dir)
```

Yolo Training: multiple objects in one image

I have a set of training images that contain many small objects (10-20). The image resolution is high (9000x6000).
Is it better to split the image into the specific objects before running yolo training? Or just leave it as is.
Does yolo resize an entire image, or does it ‘extract’ the annotated object first before resizing?
If it is the former, I am concerned that the resolution will be bad. Imagine 20 objects in a 416x416 image.
Does yolo resize an entire image, or does it ‘extract’ the annotated
object first before resizing?
Yes, an entire image will be resized in case of Yolo and it does not extract annotated object before resizing.
Since your input images have very high resolution, what you can do is:
Yolo can handle object sizes of 25 x 25 effectively with network input layer size 608 x 608. So if your object sizes in original input image are greater than 250 x 250 you can train the images as they are (with 608 x 608 network size). In that case even when images are resized to network size, objects will be of size greater than 25x25. This should give you good accuracy.
(6000/600) * 25 = 250
If object sizes in original images are smaller than 200 x 200, split your input image into 8 smaller units/blocks, say blocks/tiles of 2250 x 1500. Train these blocks as individual images. Each bigger image (9000 x 6000) corresponds to 8 training images. Each image might contain zero to many objects. You can operate in sliding window method.
The method you choose for training should be used for inference as well.
For training on objects of all sizes use following models: [Use this if you use original image as it is used for training]
Yolov4-custom
Yolov3-SPP
Yolov3_5l
If all of the objects that you want to detect are of smaller size, then for effective detection use Yolov4 with following changes: [Use this if you split original image into 8 blocks]
Set layers = 23 instead of layers = 54
Set stride=4 instead of stride=2
Set stride=4 instead of stride=2
References:
Refer this relevant GitHub thread
darknet documentation

Do input size effect mobilenet-ssd in aspect-ratio and real anchor ratio? (Tensorflow API)

im recently using tensorflow api object detection. The default SSD-MobileNet v1 is using 300 x 300 images as input training image, but i gonna edit the image size as width and height in different values. For instance, 320 * 180. Are aspects ratio in .config represent the real ratio of the anchors width/height ratio or they are just for the square images?
You can change the "size" to any different value , the general guidance is preserve the aspect ratio of the original image while the size can be different value.
Aspect ratios represent the real ratio of anchors. You can use it for different input ratios, but you will get the best result if you use input ratio similar to square images.

Darknet YOLO image size

I am trying to train custom object classifier in Darknet YOLO v2
https://pjreddie.com/darknet/yolo/
I gathered a dataset for images most of them are 6000 x 4000 px and some lower resolutions as well.
Do I need to resize the images before training to be squared ?
I found that the config uses:
[net]
batch=64
subdivisions=8
height=416
width=416
channels=3
momentum=0.9
decay=0.0005
angle=0
saturation = 1.5
exposure = 1.5
hue=.1
thats why I was wondering how to use it for different sizes of data sets.
You don't have to resize it, because Darknet will do it instead of you!
It means you really don't need to do that and you can use different image sizes during your training. What you posted above is just network configuration. There should be full network definition as well. And the height and the width tell you what's the network resolution. And it also keeps aspect ratio, check e.g this.
You don't need to resize your database images. PJReddie's YOLO architecture does it by itself keeping the aspect ratio safe (no information will miss) according to the resolution in .cfg file.
For Example, if you have image size 1248 x 936, YOLO will resize it to 416 x 312 and then pad the extra space with black bars to fit into 416 x 416 network.
It is very common to resize images before training. 416x416 is slightly larger than common. Most imagenet models resize and square the images to 256x256 for example. So I would expect the same here. Trying to train on 6000x4000 is going to require a farm of GPUs. The standard process is to square the image to the largest dimension (height, or width), padding with 0's on the shorter side, then resizing using standard image resizing tools like PIL.
You do not need to resize the images, you can directly change the values in darknet.cfg file.
When you open darknet.cfg (yolo-darknet.cfg) file, you can all
hyper-parameters and their values.
As showed in your cfg file images dimensions are (416,416)->(weight,height), you can change the values, so that darknet will automatically resize the images before training.
Since the images have high dimensions, you can adjust batch and sub-division values (lower the values 32,16,8 . it has to be multiples of 2), so that darknet will not crash (memory allocation error)
By default the darknet api changes the size of the images in both inference and training, but in theory any input size w, h = 32 x X where X belongs to a natural number should, W is the width, H the height. By default X = 13, so the input size is w, h = (416, 416). I use this rule with yolov3 in opencv, and it works better the bigger X is.

Memory Error while working with images in Massachussets Road Dataset

An image in the Massachussets Road Dataset has dimensions (1500,1500,3) . where image_height = image_width = 1500 , and 3 is for RGB channels.
I used skimage.external.tifffile.imread for reading the images and then treid storing them in a numpy array having dimensions ((number_of_images),1500,1500,3) , which gave me a Memory Error.
For solutions, I looked at the PIL module which aids in reshaping the image to a smaller size but I was wondering whether that would cause me to lose some critical information.
Error Message
---------------------------------------------------------------------------
MemoryError Traceback (most recent call last)
<ipython-input-25-4133a180c798> in <module>()
200 if __name__ == '__main__':
201
--> 202 create_train_data()
203
204 create_test_data()
<ipython-input-25-4133a180c798> in create_train_data()
41
42
---> 43 imgs = np.ndarray((total, image_rows, image_cols,3), dtype=np.uint8)
44
45
MemoryError:
1500 x 1500 images, compounded by the fact that these are colour images for classification and especially if you have a lot of images will definitely make you run out of memory.
There is no saving of memory here with sparse matrices because the data is dense. You don't have a choice but to resize the images. It also depends on how complex the images are.
Resizing the images you can get away with if the images are fairly textured. Given the choice of dataset you've made, we can certainly get away with this.
You also tagged your post with conv-neural-network. These full sized images are definitely impractical for a CNN, computation wise and parameter wise. Like what I said earlier, it's best that you resize the images before training (and of course testing).