I am trying to use the Tensorflow For Poets Google CodeLab as a template for a image classification project.
I use tens (maybe hundreds) of thousands of images with varying (relatively high) resolutions for retraining, but they are taking up too much disk space (over 10 GB) and I would like to downscale them to save some space.
As far as I understand, image resolution is not much of a concern here and it should not be an issue to scale down all the images (from roughly 4000x3000 to something much smaller).
I tried using 224x224 resolution and everything worked fine, but then I noticed some existing SO questions mentioning that the input images are being scaled to 299x299 rather than 224x224.
This made me wonder: What is the optimal input image resolution when using the code from the said CodeLab to make sure the images take up as little space as possible without making any sacrifice to the performance of the retrained model?
Was I sabotaging the process by overly downscaling the images? I use the mobilenet_v1_0.50_224 model, which is why I thought using 224x224 images for retraining would be the best way to go.
Given all my images have a high enough resolution, would I benefit from modifying the scripts to accept a larger image size?
Related
I will train my dataset with faster-rcnn for one class. All my images are 1920x1080 sizes. Should I resize or crop the images or I can train with this size?
Also my objects are really small (around 60x60).
In the config file there are dimensions written as min_dimension: 600 and max_dimension: 1024 for this reason I am confused to train the model with 1920x1080 size images.
If your objects are small, resizing the images to a smaller size is not a good idea. You can change the max_dimension to 1920 or 2000 which might make the speed a bit lower. For cropping the images, you should first consider how the objects are placed in the images. If cropping will cut a lot of objects, then you will have many cases of truncation which might have a negative effect on the model's performance.
If you insist on the faster-rcnn to cope with this task, personally I recommend:
Change input height and width, maximum and minimum value in the config file, which should work for your dataset in terms of successfully execution.
Change the original region proposal parameters (should be in config file, too) to certain ratio and scale like 1:1 and 60.
But if I were you, I would like to try:
Add some shortcuts in backbone since it is a small object detection task which is in need of features of high resolution.
Cut the fast-rcnn head off to enhance the performance, since I only need to detect one class to be THE class or not to be (being background or other class), and the output should be enough to encode the information at the RPN stage.
I trained an Image Classifier with Tensforflow using a bunch of JPG images.
Let's say I have 3 classifiers, ClassifierA, ClassifierB, ClassifierC.
When testing the classifiers, I have no issues at all in 90% of the images I use as a test. But in some cases, I have misclassifications due to the image quality.
For example, the image below is the same, saved as BMP and JPG. You'll see little differences due to the format quality.
When I test the BMP version using tf.image.decode_bmp I get misclassifications, let's say ClassifierA 70%
When I test the JPG version using tf.image.decode_jpeg I get the right one, ClassifierB 90%
When I test the JPG version using tf.image.decode_jpeg and dct_method="INTEGER_ACCURATE" I get the right one with the much better result, ClassifierB 99%
What could be the issue here? Such difference between BMP and JPG, and how can I solve this if there's a solution?
update1: I retrained my Classifier using different effects and randomly changing the quality in which I save the images I use as a dataset.
Now, I get the right output, but still the percentages changes a lot, for example44% with BMP and +90% with JPG
This is a fabulous question, and even more fabulous of an observation. I'm going to use this in my own work in the future!
I expect you have just identified a rather fascinating issue with the dataset. It appears that your model is overfitting to features specific to JPG compression. The solution is to increase data augmentation. In particular, convert your training samples between various formats randomly.
This issue also makes me think that sharpening and blurring operations would make good data augmentation features. It's common to alter color, contrast, rotation, scale, orientation, and translation of the image to augmentat the training dataset, but I don't commonly see blur and sharpness used. I suspect these two data augmentation techniques will go a long way to resolving your issue by themselves.
In case the OP (or others reading this) are not terribly familiar with what "data augmentation" is, I'll define it. It is common to warp your training images in various ways to generate endlessly unique images from your (otherwise finite) dataset. For example, randomly flipping the image left/right is quite simple, common, and effectively doubles your dataset. Changing contrast and brightness settings further alter your images. Adding these and other data augmentation transformations to your pipeline creates a much richer dataset and trains a network that is more robust to these common variations in images.
It's important that the data augmentation techniques you use produce realistic variations. For example, rotating an image is quite a realistic augmentation technique. If your training image is a cat standing horizontally, it's realistically possible that a future sample might be a cat at a 25-degree angle.
I am training my own image set using Tensorflow for Poets as an example,
https://codelabs.developers.google.com/codelabs/tensorflow-for-poets/
What size do the images need to be. I have read that the script automatically resizes the image for you, but what size does it resize them to. Can you preresize your images to this to save on your disk space (10,000 1mb images).
How does it crop the images, does it chop off part of your image, or add white/black bars, or change the aspect ratio?
Also, I think Inception v3 uses 299x299 images, what if your image recogition requires more detailed accuracy, is it possible to increase the networks image size, like to 598x598?
I don't know what re-sizing option this implementation uses; if you haven't found that in the documentation, then I expect that we'd need to read the code.
The images can be of any size. Yes, you can shrink your images to save disk space. However, note that you lose image detail; there won't be a way to recover the lost information.
The good news is that you shouldn't need it; CNN models are built for an image size that contains enough detail to handle the problem at hand. Greater image detail generally does not translate to greater accuracy in classification. Doubling the image resolution is usually a waste of storage.
To do that, you'd have to edit the code to accept the larger "native" image size. Then you'd have to alter the model topology to account for the greater input size: either a larger step-down factor somewhere (which could defeat the greater resolution), or another layer on the model to capture the larger size.
To get a more accurate model, you generally need a stronger network topology. 2x resolution does not give us much more information to differentiate a horse from a school bus.
I'm attempting to train a faster-rccn model for small digit detection. I'm using the newly released tensorflow object detection API and so far have been fine tuning a pre-trained faster_rcnn_resnet101_coco from the zoo. All my training attempts have resulted in models with high precision but low recall. Out of the ~120 objects (digits) on each image only ~20 objects are ever detected, but when detected the classification is accurate. (Also, I am able to train a simple convnet from scratch on my cropped images with high accuracy so the problem is in the detection aspect of the model.) Each digit is on average 60x30 in the original images (and probably about half that size after the image is resized before being fed into the model.) Here is an example image with detected boxes of what I'm seeing:
What is odd to me is how it is able to correctly detect neighboring digits but completely miss the rest that are very similar in terms of pixel dimensions.
I have tried adjusting the hyperparameters around anchor box generation and first_stage_max_proposals but nothing has improved the results so far. Here is an example config file I have used. What other hyperparameters should I try adjusting? Any other suggestions on how to diagnose the problem? Should I be looking into other architectures or does my task look doable with faster-rccn and/or SSD?
In the end the immediate problem was that I was not using the visualizer correctly. By updating the parameters for visualize_boxes_and_labels_on_image_array as described by Johnathan in the comments I was able to see that that I am at least detecting more boxes than I had thought.
I check your config gile, you are decreasing the resolution of your image to 1024. The region of your digit will not contain a lot of pixel and you are loosing some information. What I suggest is to train the model with an another dataset (smaller images). You can for example crop the images in 4 four area.
If you have a good GPU increase the max dimension in the image_resizer, but I guess you will run out of memory
I have to build a network that receives images of different sizes. I don't want to resize or crop so i am using a fully convolutional network.
The problem is that i can't pre-create minibatchs because of the different sizes of each image.
One solution would be to take the biggest image in the intended mini-batch and zero-pad all the other images to fit the same size. However that is not efficient in time and memory, especially since the images vary in size significantly (30px to 3000px even).
Another solution which i am using right now is to create mini-batchs of 1 that of course solves the problem of different sizes but it is no good for convergence.
So the question is whether Keras offers some method to collect gradients from several inputs and only then take a learning step?