Why does SSD resize random crops during data augmentation?

Why does SSD resize random crops during data augmentation? - tensorflow

The SSD paper details its random-crop data augmentation scheme as:
Data augmentation To make the model more robust to various input object sizes and
shapes, each training image is randomly sampled by one of the following options:
– Use the entire original input image.
– Sample a patch so that the minimum jaccard overlap with the objects is 0.1, 0.3,
0.5, 0.7, or 0.9.
– Randomly sample a patch.
The size of each sampled patch is [0.1, 1] of the original image size, and the aspect ratio
is between 1 and 2. We keep the overlapped part of the ground truth box if the center of
it is in the sampled patch. After the aforementioned sampling step, each sampled patch
is resized to fixed size and is horizontally flipped with probability of 0.5, in addition to
applying some photo-metric distortions similar to those described in [14].
https://arxiv.org/pdf/1512.02325.pdf
My question is: what is the reasoning for resizing crops that range in aspect ratios between 0.5 and 2.0?
For instance if your input image is 300x300, reshaping a crop with AR=2.0 back to square resolution will severely stretch objects (square features become rectangular, circles become ellipses, etc.) I understand small distortions may be good to improve generalization, but training the network on objects distorted up to 2x in either dimension seems counter-productive. Am I misunderstanding how random-crop works?
[Edit] I completely understand that augmented images need to be the same size as the original -- I'm more wondering why the authors don't fix the Aspect Ratio to 1.0 to preserve object proportions.

GPU architecture enforces us to use batches to speedup training, and these batches should be of the same size. Using not-so-distorted image crops could make training more efficient, but much slower.

Personally I consider that any transformation makes sense as long as you as a human can still identify the object/subject, and as long as they make sense in the receptive field of the network. Also I guess somehow that the aspect ratio might help to learn some kind of perspective distortion (look at the cow in fig 5, it's kind of "compressed"). Objects like a cup, a tree, a chair, even stretched are still identifiable. Otherwise you could also consider that some point-controlled or skew transforms just don't make sense as well.
Then, if you are working with different images than natural images, without perspective, it is probably not a good idea to do so. If your image shows objects of a fixed known size like in a microscope or other medical imaging device, and if your object has more or less a fixed size (let's say a cell), then it's probably not a good idea to perform strong distortion on the scale (like a cell twice as large), maybe then a cell twice as an ellipse actually makes more sense.
With this library, you can perform strong augmentations, but not all of them make sense if you look at the image here:

Related

Creating a good training set for one-class detection

I am training a one-class (hands) object detector on the egohands data set. My problem is that it detects way too many things as hands. It feels like it is detecting everything that is skin-colored as a hand.
I assume the most likely explanation for this is that my training set is poor, as every single image of the set contains hands, and also almost no other skin-toned elements are on the images. I guess it is necessary to also present the network images that are not what you try to detect?
I just want to verify I am right with my assumptions, before investing lots of time into creating a better training set. Therefore I am very grateful for every hint want I am doing wrong.

Object detection preprocessing is critical step, take extra caution guards as detection networks are sensitive to geometrical transformations.
Some proven data augmentation methods include:
1.Random geometry transformation for random cropping (with constraints),
2.Random expansion,
3.Random horizontal flip
4.Random resize (with random interpolation).
5.Random color jittering for brightness, hue, saturation, and contrast

Should the size of the photos be the same for deep learning?

I have lots of image (about 40 GB).
My images are small but they don't have same size.
My images aren't from natural things because I made them from a signal so all pixels are important and I can't crop or delete any pixel.
Is it possible to use deep learning for this kind of images with different shapes?
All pixels are important, please take this into consideration.
I want a model which does not depend on a fixed size input image. Is it possible?

Without knowing what you're trying to learn from the data, it's tough to give a definitive answer:
You could pad all the data at the beginning (or end) of the signal so
they're all the same size. This allows you to keep all the important
pixels, but adds irrelevant information to the image that the network
will most likely ignore.
I've also had good luck with activations where you take a pretrained
network and pull features from the image at a certain part of the
network regardless of size (as long as it's larger than the network
input size). Then run through a classifier.
https://www.mathworks.com/help/deeplearning/ref/activations.html#d117e95083
Or you could window your data, and only process smaller chunks at one
time.
https://www.mathworks.com/help/audio/examples/cocktail-party-source-separation-using-deep-learning-networks.html

TF Object detection API mAP calculation seemingly wrong

I'm not sure if this is a misunderstanding on my part or a bug in the TF Object Detection (OD) API code, but I figured I'd try here first before posting to github.
Basically, I am comparing 2 models in tensorboard, red vs green. I find that red model is slightly better at overal mAP, mAP#.50IOU, & mAP#.75IOU. However, green is better at all the mAPs split by object size: mAP large, medium, and small (see image below at 67.5k steps, where blue arrow is).
Now I don't have a PhD in math, but my assumption was that if a model has higher mAP w/ small medium and large objects, it should have a higher overall mAP...
Here are the exact values: (All values obtained at 67.5k steps, without any smoothing)
Red Green
mAP .3599 .3511
mAP#.50IOU .5670 .5489
mAP#.75IOU .3981 .3944
mAP (large) .5557 .7404
mAP (medium) .3788 .3941
mAP (small) .1093 .1386

I think a way to gain more insight is to analyze the statistics of the number of bounding box sizes (small, medium, large) in your dataset. Here is a link for mAP calculation where the TF Object Detection API describes how small boxes and medium boxes are calculated.
I can imagine that a reason this issue is happening is that you have a much higher amount of medium size bounding boxes than your large size bounding boxes. Also, I would neglect the performance of the small bounding boxes since the mAP is less than 0.01.

Tiled Instance Normalization

I am currently implementing a few image style transfer algorithms for Tensorflow, but I would like to do it in tiles, so I don't have to run the entire image through the network. Everything works fine, however each image is normalized differently, according to its own statistics, which results in tiles with slightly different characteristics.
I am certain that the only issue is instance normalization, since if I feed the true values (obtained from the entire image) to each tile calculation the result is perfect, however I still have to run the entire image through the network to calculate these values. I also tried calculating these values using a downsampled version of the image, but resolution suffers a lot.
So my question is: is it possible to estimate mean and variance values for instance normalization without feeding the entire image through the network?

You can take a random sample of the pixels of the image, and use the sample mean and sample variance to normalize the whole image. It will not be perfect, but the larger the sample, the better. A few hundred pixels will probably suffice, maybe even less, but you need to experiment.
Use tf.random_uniform() to get random X and Y coordinates, and then use tf.gather_nd() to get the pixel values at the given coordinates.

Streaming Jpeg Resizer

Does anyone know of any code that does streaming Jpeg resizing. What I mean by this is reading a chunk of an image (depending on the original source and destination size this would obviously vary), and resizing it, allowing for lower memory consumption when resizing very large jpegs. Obviously this wouldn't work for progressive jpegs (or at least it would become much more complicated), but it should be possible for standard jpegs.

The design of JPEG data allows simple resizing to 1/2, 1/4 or 1/8 size. Other variations are possible. These same size reductions are easy to do on progressive jpegs as well and the quantity of data to parse in a progressive file will be much less if you want a reduced size image. Beyond that, your question is not specific enough to know what you really want to do.
Another simple trick to reduce the data size by 33% is to render the image into a RGB565 bitmap instead of RGB24 (if you don't need the full color space).

I don't know of a library that can do this off the shelf, but it's certainly possible.
Lets say your JPEG is using 8x8 pixel MCUs (the units in which pixels are grouped). Lets also say you are reducing by a factor to 12 to 1. The first output pixel needs to be the average of the 12x12 block of pixels at the top left of the input image. To get to the input pixels with a y coordinate greater than 8, you need to have decoded the start of the second row of MCUs. You can't really get to decode those pixels before decoding the whole of the first row of MCUs. In practice, that probably means you'll need to store two rows of decoded MCUs. Still, for a 12000x12000 pixel image (roughly 150 mega pixels) you'd reduce the memory requirements by a factor of 12000/16 = 750. That should be enough for a PC. If you're looking at embedded use, you could horizontally resize the rows of MCUs as you read them, reducing the memory requirements by another factor of 12, at the cost of a little more code complexity.
I'd find a simple jpeg decoder library like Tiny Jpeg Decoder and look at the main loop in the jpeg decode function. In the case of Tiny Jpeg Decoder, the main loop calls decode_MCU, Modify from there. :-)
You've got a bunch of fiddly work to do to make the code work for non 8x8 MCUs and a load more if you want to reduce by a none integer factor. Sounds like fun though. Good luck.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas