Many times I have seen in neural networks forward propagation that example vectors are multiplied from the left (vector-matrix) and some times from the right (matrix-vector). Notation, some Tensorflow tutorials and the datasets I have found seem to prefer the former over the later, contrary to the way in which linear algebra tends to be teached (matrix-vector way).
Moreover, they represent inverted ways of representing parameters: enumerate problem variables in dimension 0 or enumerate neurons in dimension 0.
This confuses me and makes me wonder if there is really a standard here or it has been only coincidence. If there is, I would like to know if the standard follows some deeper reasons. I would feel really better answering this question.
(By the way, I know that you will normally use example matrices instead of vectors [or more complex things in conv nets, etc..] because the use of minibatches, but the point still holds.)
Not sure if this answer is what you are looking for, but in the context of Tensorflow, the standard is to use a dense layer (https://www.tensorflow.org/api_docs/python/tf/layers/dense) which is a higher level abstraction that wraps up the affine transformation logic you are referring to.
Related
Is there a way to find the inverse of neural network representation of a function in tensorflow v1? I require this to find the optimal function in an optimization problem that I am solving.
To be precise, the optimal function is found by minimizing the error computed as L2 norm of difference between the approximated optimal function C* (coded as a neural network object), and inverse of a value function V* (coded as another neural network object).
My problem is that I do not know how to write inverse of V* in tensorflow, as I cannot find something like tf.inverse().
Any help is much appreciated. Thanks.
Unless I am misunderstanding the situation, I believe that it is impossible to do this in a generalized way. Many functions do not have a perfect inverse. For a simple example, imagine a square(x) function that computes x2. You might think that the inverse is sqrt(y), but in reality the "correct" result could be either sqrt(y) or -sqrt(y), with no way of telling which is correct.
Similarly, with most neural networks I imagine it would be impossible to find the "true" mathematical inverse. There are architectures that attempt to train a neural net and its inverse simultaneously (autoencoders and BiGAN/ALI come to mind), and for some nets it might be possible to train an inverse empirically, but these can have extremely varying levels of accuracy that depend heavily on many factors.
Depending on how much control you have over V*, you might be able to design it in such a way that it is mathematically invertible (and then you would have to manually code the inverse), or you might be able to make it a simpler model that is not based on a neural net. However, if V* is an arbitrary preexisting net, then you're probably out of luck.
Further reading:
SO: local inverse of a neural network
AI.SE: Can we get the inverse of the function that a neural network represents?
I know CNN has a lot of good features like weight sharing, save memory and feature extracting. However, this question makes me very confused. Is there any possible situation that fully connected network better than CNN? Why?
Thanks a lot guys!
Is there any possible situation that fully connected network better than CNN?
Well, I think we should first define what we mean by "better". Accuracy and precision are not the only things to consider: computational time, degrees of freedom and difficulty of the optimization should also be taken into account.
First, consider an input of size h*w*c. Feeding this input to a convolutional layer with F featuremaps and kernel size s will result in at about F*s*s*c learnable parameters (assuming there are no constraints on the ranks of the convolutions, otherwise we even have less parameters.). Feeding the same input into a fully connected layer with the same number of featuremaps will result in F*d_1*d_2*w*h*c, (where d_1,d_2 are the dimensions of each featuremap) which is clearly in the order of billions of learnable parameters given any input image with decent resolution.
While it can be tempting to think that we can get away with shallower networks (we already have lots of parameters, right?), fully connected layers are just linear layers after all, so we still need to insert many non-linearities in order for the network to gain reasonable representational power. So, this will mean that you will still need a deep network, however with so many parameters that it would be untractable. In addition, a larger network will have more degrees of freedom, and will therefore model much more than what we want: it will model noise unless we feed it some data or constrain it.
So yes, there might be a fully connected network that in theory could give us better performance, but we don't know how to train it yet. Finally, and this is purely based on intuition and therefore might be wrong, but it seems unlikely to me that such a fully connected network would converge to a dense solution. Since many convolutional networks achieve very high levels of accuracy (99% and up) on many tasks, I think that the optimal solution the fully connected network would converge to would be close to the convolutional network. So, we don't really need to train the fully connected one, but just a subset of its architecture.
I have seen two ways of visualizing transposed convolutions from credible sources, and as far as I can see they conflict.
My question boils down to, for each application of the kernel, do we go from many (e.g. 3x3) elements with input padding to one, or do we go from one element to many (e.g. 3x3)?
Related question: Which version does tf.nn.conv2d_transpose implement?
The sources of my confusion are:
A guide to convolution arithmetic for deep learning has probably the most famous visualization out there, but it isn't peer reviewed (Arxiv).
The second is from Deconvolution and Checkerboard Artifacts, which technically isn't peer reviewed either (Distil), but it is from a much more reputable source.
(The term deconvolution is used in the article, but it is stated that this is the same as transposed conv.)
Due to the nature of this question it is hard to look for results online, e.g. this SO posts takes the first position, but I am not sure to what extent I can trust it.
I want to stress a little more what Littleone also mentioned in his last paragraph:
A transposed convolution will reverse the spatial transformation of a regular convolution with the same parameters.
If you perform a regular convolution followed by a transposed convolution and both have the same settings (kernel size, padding, stride), then the input and output will have the same shape. This makes it super easy to build encoder-decoder networks with them. I wrote an article about different types of convolutions in Deep Learning here, where this is also covered.
PS: Please don't call it a deconvolution
Strided convolutions, deconvolutions, transposed convolutions all mean the same thing. Both papers are correct and you don't need to be doubtful as both of them are cited a lot. But the distil image is from a different perspective as its trying to show the artifacts problem.
The first visualisation is transposed convolutions with stride 2 and padding 1. If it was stride 1, there wouldn't be any padding in between inputs. The padding on the borders depend on the dimension of the output.
By deconvolution, we generally go from a smaller dimension to a higher dimension. And input data is generally padded to achieve the desired output dimensions. I believe the confusion arises from the padding patterns. Take a look at this formula
output = [(input-1)stride]+kernel_size-2*padding_of_output
Its a rearrangement of the general convolution output formula. Output here refers to the output of the deconvolution operation. To best understand deconvolution, I suggest thinking in terms of the equation, i.e., flipping what a convolution does. Its asking how do I reverse what a convolution operation does?
Hope that helps.
Good explanation from Justin Johnson (part of the Stanford cs231n mooc):
https://youtu.be/ByjaPdWXKJ4?t=1221 (starts at 20:21)
He reviews strided conv and then he explains transposed convolutions.
What are the potential reasons for a NN to output different values for the same input? Especially when there isn't any random or stochastic processes?
This is a very broad and general question, might be even too broad to even be on here, but there are several things you should know about neural networks:
They are NOT methods for finding one prefect optimal solution. A neural network usually learn examples that it is given and "figures out" a way to predict results reasonably well. Reasonable is relative, and for some models may mean 50% success and for others anything short of 99.9% will be considered failure.
They're outcome is very dependent on the data that was trained on. The order of data matters, and it's usually a good idea to shuffle data during training, but that can lead to wildly different results. Also, the quality of data matters - if the training data is very different in nature to the test data for example.
The best analogy of neural networks in computing is of course - the brain. Even with the same information and same basic underlying biology, we could all evolve different opinions on matters based on endless other variables. Same thing with computer learning to some extent.
Some types of neural networks use dropout layers, that are specifically designed to shut off random parts of the network during training. This should not affect the final prediction process, because for predictions that layer is usually set to allow all the parts of the network to operate, but if you are inputting data and telling the model it is "training" instead of asking it to predict, the results may vary significantly.
The sum of all this is just to say: The training of neural networks should be expected to yield different results from similar starting conditions, and so must be tested multiple times for every condition to determine what parts of it are inevitable and what parts are not.
It might be due to shuffling of data , If you want to use the same vector you should turn the shuffle argument off.
You should try disabling dropout. Dropout randomly sets the outputs of certain neurons to 0. This will mean that your output will be different each time.
I've been using tensorflow on and off for various things that I guess are considered rather easy these days. Captcha cracking, basic OCR, things I remember from my AI education at university. They are problems that are reasonably large and therefore don't really lend themselves to experimenting efficiently with different NN architectures.
As you probably know, Joel Grus came out with FizzBuzz in tensorflow. TLDR: learning from a binary representation of a number (ie. 12 bits encoding the number) into 4 bits (none_of_the_others, divisible by 3, divisible by 5, divisible by 15). For this toy problem, you can quickly compare different networks.
So I've been trying a simple feedforward network and wrote a program to compare various architectures. Things like a 2-hidden-layer feedforward network, then 3 layers, different activation functions, ... Most architectures, well, suck. They get somewhere near 50-60 success rate and remain there, independent of how much training you do.
A few perform really well. For instance, a sigmoid-activated double hidden layer with 23 neurons each works really well (89-90% correct after 2000 training epochs). Unfortunately anything close to it is rather disastrously bad. Take one neuron out of the second or first layer and it drops to 30% correct. Same for taking it out of the first layer ... Single hidden layer, 20 neurons tanh activated does pretty well as well. But most have a little over half this performance.
Now given that for real problems I can't realistically do these sorts of studies of different architectures, are there ways to get good architectures guaranteed to work ?
You might find the paper by Yoshua Bengio on Practical Recommendations for Gradient-Based Training of Deep Architectures helpful to learn more about hyperparameters and their settings.
If you're asking specifically for settings that have more guaranteed succes, I advise you to read on Batch Normalization. I find that it decreases the failure rate for bad picks of the learning rate and weight initialization.
Some people also discourage the use of non-linearities like sigmoid() and tanh() as they suffer from the vanishing gradient problem