Give an example visual recognition task where a fully connected network would be more suitable than a convolution neural networks - tensorflow

I know CNN has a lot of good features like weight sharing, save memory and feature extracting. However, this question makes me very confused. Is there any possible situation that fully connected network better than CNN? Why?
Thanks a lot guys!

Is there any possible situation that fully connected network better than CNN?
Well, I think we should first define what we mean by "better". Accuracy and precision are not the only things to consider: computational time, degrees of freedom and difficulty of the optimization should also be taken into account.
First, consider an input of size h*w*c. Feeding this input to a convolutional layer with F featuremaps and kernel size s will result in at about F*s*s*c learnable parameters (assuming there are no constraints on the ranks of the convolutions, otherwise we even have less parameters.). Feeding the same input into a fully connected layer with the same number of featuremaps will result in F*d_1*d_2*w*h*c, (where d_1,d_2 are the dimensions of each featuremap) which is clearly in the order of billions of learnable parameters given any input image with decent resolution.
While it can be tempting to think that we can get away with shallower networks (we already have lots of parameters, right?), fully connected layers are just linear layers after all, so we still need to insert many non-linearities in order for the network to gain reasonable representational power. So, this will mean that you will still need a deep network, however with so many parameters that it would be untractable. In addition, a larger network will have more degrees of freedom, and will therefore model much more than what we want: it will model noise unless we feed it some data or constrain it.
So yes, there might be a fully connected network that in theory could give us better performance, but we don't know how to train it yet. Finally, and this is purely based on intuition and therefore might be wrong, but it seems unlikely to me that such a fully connected network would converge to a dense solution. Since many convolutional networks achieve very high levels of accuracy (99% and up) on many tasks, I think that the optimal solution the fully connected network would converge to would be close to the convolutional network. So, we don't really need to train the fully connected one, but just a subset of its architecture.

Related

How to arrange different layers in CNN

I searched many articles about convolutional neural networks and found that there are some good structures that I can refer to. For example, AlexNet, VGG, GoogleNet.
However, if I want to customize CNN architecture by myself, how to arrange/order different layers? E.g. convolution layer, dropout, max pooling... Is there any standard? or just keep trying different combination to produce the good result?
According to me there isn't a standard per say,But combinations
1-Like if you want to create a deeper network you can use residual block to avoid facing vanishing gradient problem.
2-The standard of using a 3,3 convolution is because it reduces computational cost ex 3 simultaneous 3,3 convolution can achieve a 7,7 convolution for a smaller cost
3-The main reason for dropout is to introduce regularization ,which can also be achieved by batch normalization as the author claims.
4-Before what to enhanced and how to enhanced ,one must understand the problem he/she is trying to solve.
You can go through the case study which was taught at Standford
Standford case study
The video can help you understand much of these combinations and how they result in model improvement and can help you built your network.
You generally want to put a pooling layer after a convolutional layer. Also, you can think of dropout as a parameter that is applied to a layer, and not a separate layer altogether -- whichever is easier for you to envision.

Convolutional Neural Networks - reasons for changing image data range

When looking at data augmentation techniques for input images in a Convolutional Neural Network, it is often mentioned that you can change/rescale the range of image values from [0,255] to [0,1].
What is the reasoning behind this?
This is scaling (part of preprocessing inputs for any network, not just CNN). Why is it done? This is done to keep the ranges of all the features in the same region. You can refer this answer for more information about the same.
But, here in your case, you only have features regarding the pixel intensities of the image. So, why do you need scaling in this case? This is because most of the parameter initialization, that is being automatically done by the framework you are using, assumes that the data being passed to it is normalized. It tends to make network converge faster, as many researchers have spent time figuring out the right initialization for the network parameters.

Is it possible to train Neural Network with low amount of instances?

I have faced some problem when I needed to solve Regression Task and use as minimum instances as possible. When I tried to use Xgboost I had to feed 4 instances to get the reasonable result. But Multilayer Perceptron tuned to overcoming Regression problems has to take 20 instances, tried to change amount of neurons&layers but the answer is still 20 .Is it possible to do something to make Neural Network solve Resgression tasks with from 2 to 4 instances? if yes - explain please what should I do to succeed in it? Maybe there is some correlation between how much instances are needed to train and get reasonable results from Perceptron and how features are valuable inside dataset?
Thanks in advance for any help
With small numbers of samples, there are likely better methods to apply, Xgaboost definitely comes to mind as a method that does quite well at avoiding overfitting.
Neural networks tend to work well with larger numbers of samples. They often over fit to small datasets and underperform other algorithms.
There is, however, an active area of research in semi-supervised techniques using neural networks with large datasets of unlabeled data and small datasets of labeled samples.
Here's a paper to start you down that path, search on 'semi supervised learning'.
http://vdel.me.cmu.edu/publications/2011cgev/paper.pdf
Another area of interest to reduce overfitting in smaller datasets is in multi-task learning.
http://ruder.io/multi-task/
Multi task learning requires the network to achieve multiple target goals for a given input. Adding more requirements tends to reduce the space of solutions that the network can converge on and often achieves better results because of it. To say that differently: when multiple objectives are defined, the parameters necessary to do well at one task are often beneficial for the other task and vice versa.
Lastly, another area of open research is GANs and how they might be used in semi-supervised learning. No papers pop to the forefront of my mind on the subject just now, so I'll leave this mention as a footnote.

Neural network gives different output for same input

What are the potential reasons for a NN to output different values for the same input? Especially when there isn't any random or stochastic processes?
This is a very broad and general question, might be even too broad to even be on here, but there are several things you should know about neural networks:
They are NOT methods for finding one prefect optimal solution. A neural network usually learn examples that it is given and "figures out" a way to predict results reasonably well. Reasonable is relative, and for some models may mean 50% success and for others anything short of 99.9% will be considered failure.
They're outcome is very dependent on the data that was trained on. The order of data matters, and it's usually a good idea to shuffle data during training, but that can lead to wildly different results. Also, the quality of data matters - if the training data is very different in nature to the test data for example.
The best analogy of neural networks in computing is of course - the brain. Even with the same information and same basic underlying biology, we could all evolve different opinions on matters based on endless other variables. Same thing with computer learning to some extent.
Some types of neural networks use dropout layers, that are specifically designed to shut off random parts of the network during training. This should not affect the final prediction process, because for predictions that layer is usually set to allow all the parts of the network to operate, but if you are inputting data and telling the model it is "training" instead of asking it to predict, the results may vary significantly.
The sum of all this is just to say: The training of neural networks should be expected to yield different results from similar starting conditions, and so must be tested multiple times for every condition to determine what parts of it are inevitable and what parts are not.
It might be due to shuffling of data , If you want to use the same vector you should turn the shuffle argument off.
You should try disabling dropout. Dropout randomly sets the outputs of certain neurons to 0. This will mean that your output will be different each time.

Tensorflow: how to find good neural network architectures/hyperparameters?

I've been using tensorflow on and off for various things that I guess are considered rather easy these days. Captcha cracking, basic OCR, things I remember from my AI education at university. They are problems that are reasonably large and therefore don't really lend themselves to experimenting efficiently with different NN architectures.
As you probably know, Joel Grus came out with FizzBuzz in tensorflow. TLDR: learning from a binary representation of a number (ie. 12 bits encoding the number) into 4 bits (none_of_the_others, divisible by 3, divisible by 5, divisible by 15). For this toy problem, you can quickly compare different networks.
So I've been trying a simple feedforward network and wrote a program to compare various architectures. Things like a 2-hidden-layer feedforward network, then 3 layers, different activation functions, ... Most architectures, well, suck. They get somewhere near 50-60 success rate and remain there, independent of how much training you do.
A few perform really well. For instance, a sigmoid-activated double hidden layer with 23 neurons each works really well (89-90% correct after 2000 training epochs). Unfortunately anything close to it is rather disastrously bad. Take one neuron out of the second or first layer and it drops to 30% correct. Same for taking it out of the first layer ... Single hidden layer, 20 neurons tanh activated does pretty well as well. But most have a little over half this performance.
Now given that for real problems I can't realistically do these sorts of studies of different architectures, are there ways to get good architectures guaranteed to work ?
You might find the paper by Yoshua Bengio on Practical Recommendations for Gradient-Based Training of Deep Architectures helpful to learn more about hyperparameters and their settings.
If you're asking specifically for settings that have more guaranteed succes, I advise you to read on Batch Normalization. I find that it decreases the failure rate for bad picks of the learning rate and weight initialization.
Some people also discourage the use of non-linearities like sigmoid() and tanh() as they suffer from the vanishing gradient problem