Convolutional Neural Networks - reasons for changing image data range - tensorflow

When looking at data augmentation techniques for input images in a Convolutional Neural Network, it is often mentioned that you can change/rescale the range of image values from [0,255] to [0,1].
What is the reasoning behind this?

This is scaling (part of preprocessing inputs for any network, not just CNN). Why is it done? This is done to keep the ranges of all the features in the same region. You can refer this answer for more information about the same.
But, here in your case, you only have features regarding the pixel intensities of the image. So, why do you need scaling in this case? This is because most of the parameter initialization, that is being automatically done by the framework you are using, assumes that the data being passed to it is normalized. It tends to make network converge faster, as many researchers have spent time figuring out the right initialization for the network parameters.

Related

Give an example visual recognition task where a fully connected network would be more suitable than a convolution neural networks

I know CNN has a lot of good features like weight sharing, save memory and feature extracting. However, this question makes me very confused. Is there any possible situation that fully connected network better than CNN? Why?
Thanks a lot guys!
Is there any possible situation that fully connected network better than CNN?
Well, I think we should first define what we mean by "better". Accuracy and precision are not the only things to consider: computational time, degrees of freedom and difficulty of the optimization should also be taken into account.
First, consider an input of size h*w*c. Feeding this input to a convolutional layer with F featuremaps and kernel size s will result in at about F*s*s*c learnable parameters (assuming there are no constraints on the ranks of the convolutions, otherwise we even have less parameters.). Feeding the same input into a fully connected layer with the same number of featuremaps will result in F*d_1*d_2*w*h*c, (where d_1,d_2 are the dimensions of each featuremap) which is clearly in the order of billions of learnable parameters given any input image with decent resolution.
While it can be tempting to think that we can get away with shallower networks (we already have lots of parameters, right?), fully connected layers are just linear layers after all, so we still need to insert many non-linearities in order for the network to gain reasonable representational power. So, this will mean that you will still need a deep network, however with so many parameters that it would be untractable. In addition, a larger network will have more degrees of freedom, and will therefore model much more than what we want: it will model noise unless we feed it some data or constrain it.
So yes, there might be a fully connected network that in theory could give us better performance, but we don't know how to train it yet. Finally, and this is purely based on intuition and therefore might be wrong, but it seems unlikely to me that such a fully connected network would converge to a dense solution. Since many convolutional networks achieve very high levels of accuracy (99% and up) on many tasks, I think that the optimal solution the fully connected network would converge to would be close to the convolutional network. So, we don't really need to train the fully connected one, but just a subset of its architecture.

Do I need every class in a training image for object detection?

I just try to dive into TensorFlows Object Detection. I have a very small training set of circa 40 images yet. Each image can have up to 3 classes. But now the question came into my mind: Does every training image need every class? Is that important for efficient training? Or is it okay if an image may only have one of the object classes?
I get a very high total loss with ~8.0 and thought this might be the reason for this but I couldn't find an answer.
In general machine learning systems can cope with some amount of noise.
An image missing labels or having the wrong labels is fine as long as overall you have sufficient data for the model to figure it out.
40 examples for image classification sounds very small. It might work if you start with a pre-trained image network and there are few classes that are very easy to distinguish.
Ignore absolute the loss value, it doesn't mean anything. Look at the curve to see that the loss is decreasing and stop the training when the curve flattens out. Compare the loss value to a test dataset to check if the values are sufficiently similar (you are not overfitting). You might be able to compare to another training of the exact same system (to check if the training is stable for example).

What's the value of random scale / crop / brightness in image classifier

When we retrain the image classifier layer in Mobilenet, the retrain script allows us to specific several parameters to preprocess the input images:
random_scale
random_crop
random_brightness
I would like to know how to determine these values? I saw in some articles they set random_brightness and random_scale to 30, and random_crop to 0.
Can someone help me to understand these parameters?
Found the answer from this link: https://github.com/tensorflow/hub/blob/master/docs/tutorials/image_retraining.md
A common way of improving the results of image training is by deforming, cropping, or brightening the training inputs in random ways. This has the advantage of expanding the effective size of the training data thanks to all the possible variations of the same images, and tends to help the network learn to cope with all the distortions that will occur in real-life uses of the classifier. The biggest disadvantage of enabling these distortions in our script is that the bottleneck caching is no longer useful, since input images are never reused exactly. This means the training process takes a lot longer (many hours), so it's recommended you try this as a way of polishing your model only after you have one that you're reasonably happy with.

Image classification / detection - Objects being used in real life vs. stock photo images?

When training detection models, are images that are used in real life better (i.e. higher accuracy / mAP) than images of the same object but in the form of stock photo?
The more variety the better. If you train a network on images that all have a white background and expect it to perform under conditions with noisy backgrounds you should expect the results on unseen data to perform worse because the network never had a chance to learn distinguiting features of target object vs. background objects.
If you have images with transparent backgrounds one form of data augmentation that would be expected to improve results would be to place that image against many random backgrounds. The closer you come to realistic renderings of an image the better you can expect your results to be.
The more realistic examples you can augment your training dataset with, the better. Note that it generally does not help to add random noise to your data to generate larger training datasets, it only improves results when your expanded dataset contains realistic variants of the original images in the dataset.
My motto when training neural networks is this: The network will cheat any chance it gets. It will learn impressively well, but given the opportunity, it will take shortcuts. Don't let it take shortcuts. That often translates to: Make the problem harder such that no shortcut exists for it to take. Neural networks often perform better under more difficult conditions because the simplest solution it can arrive at is also the most general purpose. Read up on multi-task learning for some exciting examples that provide great food-for-thought.

Neural network gives different output for same input

What are the potential reasons for a NN to output different values for the same input? Especially when there isn't any random or stochastic processes?
This is a very broad and general question, might be even too broad to even be on here, but there are several things you should know about neural networks:
They are NOT methods for finding one prefect optimal solution. A neural network usually learn examples that it is given and "figures out" a way to predict results reasonably well. Reasonable is relative, and for some models may mean 50% success and for others anything short of 99.9% will be considered failure.
They're outcome is very dependent on the data that was trained on. The order of data matters, and it's usually a good idea to shuffle data during training, but that can lead to wildly different results. Also, the quality of data matters - if the training data is very different in nature to the test data for example.
The best analogy of neural networks in computing is of course - the brain. Even with the same information and same basic underlying biology, we could all evolve different opinions on matters based on endless other variables. Same thing with computer learning to some extent.
Some types of neural networks use dropout layers, that are specifically designed to shut off random parts of the network during training. This should not affect the final prediction process, because for predictions that layer is usually set to allow all the parts of the network to operate, but if you are inputting data and telling the model it is "training" instead of asking it to predict, the results may vary significantly.
The sum of all this is just to say: The training of neural networks should be expected to yield different results from similar starting conditions, and so must be tested multiple times for every condition to determine what parts of it are inevitable and what parts are not.
It might be due to shuffling of data , If you want to use the same vector you should turn the shuffle argument off.
You should try disabling dropout. Dropout randomly sets the outputs of certain neurons to 0. This will mean that your output will be different each time.