Tensorflow for Poets Inception v3 image size - tensorflow

I am training my own image set using Tensorflow for Poets as an example,
https://codelabs.developers.google.com/codelabs/tensorflow-for-poets/
What size do the images need to be. I have read that the script automatically resizes the image for you, but what size does it resize them to. Can you preresize your images to this to save on your disk space (10,000 1mb images).
How does it crop the images, does it chop off part of your image, or add white/black bars, or change the aspect ratio?
Also, I think Inception v3 uses 299x299 images, what if your image recogition requires more detailed accuracy, is it possible to increase the networks image size, like to 598x598?

I don't know what re-sizing option this implementation uses; if you haven't found that in the documentation, then I expect that we'd need to read the code.
The images can be of any size. Yes, you can shrink your images to save disk space. However, note that you lose image detail; there won't be a way to recover the lost information.
The good news is that you shouldn't need it; CNN models are built for an image size that contains enough detail to handle the problem at hand. Greater image detail generally does not translate to greater accuracy in classification. Doubling the image resolution is usually a waste of storage.
To do that, you'd have to edit the code to accept the larger "native" image size. Then you'd have to alter the model topology to account for the greater input size: either a larger step-down factor somewhere (which could defeat the greater resolution), or another layer on the model to capture the larger size.
To get a more accurate model, you generally need a stronger network topology. 2x resolution does not give us much more information to differentiate a horse from a school bus.

Related

Training Image Size Faster-RCNN

I will train my dataset with faster-rcnn for one class. All my images are 1920x1080 sizes. Should I resize or crop the images or I can train with this size?
Also my objects are really small (around 60x60).
In the config file there are dimensions written as min_dimension: 600 and max_dimension: 1024 for this reason I am confused to train the model with 1920x1080 size images.
If your objects are small, resizing the images to a smaller size is not a good idea. You can change the max_dimension to 1920 or 2000 which might make the speed a bit lower. For cropping the images, you should first consider how the objects are placed in the images. If cropping will cut a lot of objects, then you will have many cases of truncation which might have a negative effect on the model's performance.
If you insist on the faster-rcnn to cope with this task, personally I recommend:
Change input height and width, maximum and minimum value in the config file, which should work for your dataset in terms of successfully execution.
Change the original region proposal parameters (should be in config file, too) to certain ratio and scale like 1:1 and 60.
But if I were you, I would like to try:
Add some shortcuts in backbone since it is a small object detection task which is in need of features of high resolution.
Cut the fast-rcnn head off to enhance the performance, since I only need to detect one class to be THE class or not to be (being background or other class), and the output should be enough to encode the information at the RPN stage.

MobileNet-SSD input resolution

I have a working object detection model (fined-tuned MobileNet SSD) that detects my custom small robot. I'll feed it some webcam footage (which will be tied to a drone) and use the real-time bounding box information.
So, I am about to purchase the camera.
My questions: since SSD resizes the input images into 300x300, is the camera resolution very important? Does higher resolution mean better accuracy (even when it gets resized to 300x300 anyway)? Should I crop the camera footage into 1:1 aspect ratio at every frame before running my object detection model on it? Should I divide the image into MxN cropped segments and run inference one by one?
Because my robot is very small and the drone will be at a 4 meter altitude, so I'll effectively be trying to detect a very tiny spot on my input image.
Any sort of wisdom is greatly appreciated, thank you.
These are quite a few questions, I'll try to answer all of them. The detection model resizes the input images before feeding it to the network by some resizing method, e.g. bilinear. It would be better of course if the input image would be equal or larger than the input size to the network rather than smaller. A rule of thumb is that indeed higher resolution means better accuracy, but it highly depends on the setup and the task. If you're trying to detect a small object, and let's say for example that the original resolution is 1920x1080. Then after resizing the image, the small object would be even smaller (pixels-wise), and might be too small to detect. Therefore, indeed, it would be better to either split the image to smaller images (possibly with some overlap to avoid misdetection due to object splitting) and applying detection on each, or using a model with higher input resolution. Be aware that while the first is possible with your current model, you'll need to train a new model possibly with some architectural changes (e.g. adding SSD layers and modifying anchors, depends on the scales you want to detect) for the latter. Regarding the aspect ratio matter, you mostly need to stay consistent. It doesn't matter if you don't keep the original aspect ratio, but if you don't - do it both in training and evaluation/test/deployment.

TensorFlow For Poets - Optimal image size

I am trying to use the Tensorflow For Poets Google CodeLab as a template for a image classification project.
I use tens (maybe hundreds) of thousands of images with varying (relatively high) resolutions for retraining, but they are taking up too much disk space (over 10 GB) and I would like to downscale them to save some space.
As far as I understand, image resolution is not much of a concern here and it should not be an issue to scale down all the images (from roughly 4000x3000 to something much smaller).
I tried using 224x224 resolution and everything worked fine, but then I noticed some existing SO questions mentioning that the input images are being scaled to 299x299 rather than 224x224.
This made me wonder: What is the optimal input image resolution when using the code from the said CodeLab to make sure the images take up as little space as possible without making any sacrifice to the performance of the retrained model?
Was I sabotaging the process by overly downscaling the images? I use the mobilenet_v1_0.50_224 model, which is why I thought using 224x224 images for retraining would be the best way to go.
Given all my images have a high enough resolution, would I benefit from modifying the scripts to accept a larger image size?

Best practice for TF Object Detection API for very large images

The Tensorflow Object Detection API offers a variety of models. These are trained at 600x600 image size. Suppose I have a 6000x4000 satellite image, and I want to detect objects continuously throughout the image. What is the best practice for adapting a TFODI model to this image size? I don't care about the running time per image for object detection. I have a GPU with 9GB of RAM.
I know I can fit a single 6000x4000 image onto this GPU. I'm not sure if I can fit an image processing neural net for that size onto the GPU. I can think of a few alternatives:
Chip the image into 600x600 blocks, which risks losing features that cross the blocks, but then everything should work out of the box.
Change the image dimensions in the model definition from 600x600 to 6000x4000. Can I retrain from the Model Zoo checkpoint, or do I have to start from scratch if I do this?
Compress the image to smaller size. This distorts the image dimensions and also loses feature detail. For say a picture of a city, the resulting detail would not be adequate to pick out cars and small houses.
You need to try with different sizes and see using what size during training you don't run out of memory. The memory consumption also depends how many images you have that you are training on. From what you described you will end up using a intermediate size of the image

small object detection with faster-RCNN in tensorflow-models

I'm attempting to train a faster-rccn model for small digit detection. I'm using the newly released tensorflow object detection API and so far have been fine tuning a pre-trained faster_rcnn_resnet101_coco from the zoo. All my training attempts have resulted in models with high precision but low recall. Out of the ~120 objects (digits) on each image only ~20 objects are ever detected, but when detected the classification is accurate. (Also, I am able to train a simple convnet from scratch on my cropped images with high accuracy so the problem is in the detection aspect of the model.) Each digit is on average 60x30 in the original images (and probably about half that size after the image is resized before being fed into the model.) Here is an example image with detected boxes of what I'm seeing:
What is odd to me is how it is able to correctly detect neighboring digits but completely miss the rest that are very similar in terms of pixel dimensions.
I have tried adjusting the hyperparameters around anchor box generation and first_stage_max_proposals but nothing has improved the results so far. Here is an example config file I have used. What other hyperparameters should I try adjusting? Any other suggestions on how to diagnose the problem? Should I be looking into other architectures or does my task look doable with faster-rccn and/or SSD?
In the end the immediate problem was that I was not using the visualizer correctly. By updating the parameters for visualize_boxes_and_labels_on_image_array as described by Johnathan in the comments I was able to see that that I am at least detecting more boxes than I had thought.
I check your config gile, you are decreasing the resolution of your image to 1024. The region of your digit will not contain a lot of pixel and you are loosing some information. What I suggest is to train the model with an another dataset (smaller images). You can for example crop the images in 4 four area.
If you have a good GPU increase the max dimension in the image_resizer, but I guess you will run out of memory