How to visualize (and understand) transposed convolutions? - tensorflow

I have seen two ways of visualizing transposed convolutions from credible sources, and as far as I can see they conflict.
My question boils down to, for each application of the kernel, do we go from many (e.g. 3x3) elements with input padding to one, or do we go from one element to many (e.g. 3x3)?
Related question: Which version does tf.nn.conv2d_transpose implement?
The sources of my confusion are:
A guide to convolution arithmetic for deep learning has probably the most famous visualization out there, but it isn't peer reviewed (Arxiv).
The second is from Deconvolution and Checkerboard Artifacts, which technically isn't peer reviewed either (Distil), but it is from a much more reputable source.
(The term deconvolution is used in the article, but it is stated that this is the same as transposed conv.)
Due to the nature of this question it is hard to look for results online, e.g. this SO posts takes the first position, but I am not sure to what extent I can trust it.

I want to stress a little more what Littleone also mentioned in his last paragraph:
A transposed convolution will reverse the spatial transformation of a regular convolution with the same parameters.
If you perform a regular convolution followed by a transposed convolution and both have the same settings (kernel size, padding, stride), then the input and output will have the same shape. This makes it super easy to build encoder-decoder networks with them. I wrote an article about different types of convolutions in Deep Learning here, where this is also covered.
PS: Please don't call it a deconvolution

Strided convolutions, deconvolutions, transposed convolutions all mean the same thing. Both papers are correct and you don't need to be doubtful as both of them are cited a lot. But the distil image is from a different perspective as its trying to show the artifacts problem.
The first visualisation is transposed convolutions with stride 2 and padding 1. If it was stride 1, there wouldn't be any padding in between inputs. The padding on the borders depend on the dimension of the output.
By deconvolution, we generally go from a smaller dimension to a higher dimension. And input data is generally padded to achieve the desired output dimensions. I believe the confusion arises from the padding patterns. Take a look at this formula
output = [(input-1)stride]+kernel_size-2*padding_of_output
Its a rearrangement of the general convolution output formula. Output here refers to the output of the deconvolution operation. To best understand deconvolution, I suggest thinking in terms of the equation, i.e., flipping what a convolution does. Its asking how do I reverse what a convolution operation does?
Hope that helps.

Good explanation from Justin Johnson (part of the Stanford cs231n mooc):
https://youtu.be/ByjaPdWXKJ4?t=1221 (starts at 20:21)
He reviews strided conv and then he explains transposed convolutions.

Related

Is Capsule Network really rotationally invariant in practice?

Capsule network is said to perform well under rotation..??*
I trained a Capsule Network with (train-dataset) to get train-accuracy ~100%..
i tested the network with the (test-dataset-original) to get test-accuracy ~99%
i rotated the (test-dataset-original) by 0.5 (test-dataset-rotate0p5) and
1 degrees to get (test-dataset-rotate1) and got the test-accuracy of just ~10%
i used the network from this repo as a seed https://github.com/naturomics/CapsNet-Tensorflow
10% acc is not acceptable at all on rotated test data. perhaps something doesn't implement correctly.
we implemented capsnet on some non-english digit datasets (similar to mnist) and the result was unbelievable great.
the implemented model was invariant not only in rotation but also on other transform such as pan, zoom, perspective and etc
The first layer of a capsule network is normal convolution. The filters here are not rotation invariant, only the output feature maps are applied a pose matrix by the primary capsule layer.
I think this is why you also need to show the capsnet rotated images. But much fewer than for normal convnets.
Capsule networks encapsule vectors or 4x4 matrices in a neural network. However, matrices can be used for many things, rotations being just one of them. There's no way the network can know that you want to use the encapsuled representation for rotations, except if you specifically show it, so it can learn to use this for rotations..
Capsule Networks came into existence to solve the problem of viewpoint variance problem in convolutional neural networks (CNNs). CapsNet is said to be viewpoint invariant that includes rotational and translational invariance.
CNNs have translational invariance by using max-pooling but that results in information loss in the receptive field. And as the network goes deeper, the receptive field also increases gradually and hence max-pooling in deeper layers cause more information loss. This results in loss of the spatial information and only local/temporal information is learned by the network. CNNs fail to learn the bigger picture of the input.
The weights Wij (between primary and secondary capsule layer) are backpropagated to learn the affine transformation on the entity represented by the ith capsule in primary layer and make a predicted vector uj|i. So basically this Wij is responsible for learning rotational transformations for a given entity.

How to arrange different layers in CNN

I searched many articles about convolutional neural networks and found that there are some good structures that I can refer to. For example, AlexNet, VGG, GoogleNet.
However, if I want to customize CNN architecture by myself, how to arrange/order different layers? E.g. convolution layer, dropout, max pooling... Is there any standard? or just keep trying different combination to produce the good result?
According to me there isn't a standard per say,But combinations
1-Like if you want to create a deeper network you can use residual block to avoid facing vanishing gradient problem.
2-The standard of using a 3,3 convolution is because it reduces computational cost ex 3 simultaneous 3,3 convolution can achieve a 7,7 convolution for a smaller cost
3-The main reason for dropout is to introduce regularization ,which can also be achieved by batch normalization as the author claims.
4-Before what to enhanced and how to enhanced ,one must understand the problem he/she is trying to solve.
You can go through the case study which was taught at Standford
Standford case study
The video can help you understand much of these combinations and how they result in model improvement and can help you built your network.
You generally want to put a pooling layer after a convolutional layer. Also, you can think of dropout as a parameter that is applied to a layer, and not a separate layer altogether -- whichever is easier for you to envision.

Standard parameter representation in neural networks

Many times I have seen in neural networks forward propagation that example vectors are multiplied from the left (vector-matrix) and some times from the right (matrix-vector). Notation, some Tensorflow tutorials and the datasets I have found seem to prefer the former over the later, contrary to the way in which linear algebra tends to be teached (matrix-vector way).
Moreover, they represent inverted ways of representing parameters: enumerate problem variables in dimension 0 or enumerate neurons in dimension 0.
This confuses me and makes me wonder if there is really a standard here or it has been only coincidence. If there is, I would like to know if the standard follows some deeper reasons. I would feel really better answering this question.
(By the way, I know that you will normally use example matrices instead of vectors [or more complex things in conv nets, etc..] because the use of minibatches, but the point still holds.)
Not sure if this answer is what you are looking for, but in the context of Tensorflow, the standard is to use a dense layer (https://www.tensorflow.org/api_docs/python/tf/layers/dense) which is a higher level abstraction that wraps up the affine transformation logic you are referring to.

Fully convolutional neural network for semantic segmentation

I have perhaps a naive question and sorry if this is not the appropriate channel to ask about these kind of questions. I have successfully implemented a FCNN for semantic segmentation, but I don't involve deconvolution or unpooling layers.
What I simply do, is to resize the ground truth image to the size of my final FCNN layer and then I compute my loss. In this way, I obtain a smaller image as output, but correctly segmented.
Is the process of deconvolution or unpooling needed at all?
I mean, resizing images in python is quite easy, so why one should involve complicated techniques as deconv or unpooling to do the same? Surely I miss something.
What's the advantage in enlarging images using unpooling and performing deconv?
The output of your network after the convolution steps is smaller than your original image: you probably don't want that, you want to have semantic segmentation for the image you give it as input.
If you simply resize it to its original size, new pixels will be interpolated and therefore lack precision. Deconvolution layers allow to learn this resize (as they're learned during training, through backpropagation), and therefore to increase your segmentation precision.

Tensorflow: how to find good neural network architectures/hyperparameters?

I've been using tensorflow on and off for various things that I guess are considered rather easy these days. Captcha cracking, basic OCR, things I remember from my AI education at university. They are problems that are reasonably large and therefore don't really lend themselves to experimenting efficiently with different NN architectures.
As you probably know, Joel Grus came out with FizzBuzz in tensorflow. TLDR: learning from a binary representation of a number (ie. 12 bits encoding the number) into 4 bits (none_of_the_others, divisible by 3, divisible by 5, divisible by 15). For this toy problem, you can quickly compare different networks.
So I've been trying a simple feedforward network and wrote a program to compare various architectures. Things like a 2-hidden-layer feedforward network, then 3 layers, different activation functions, ... Most architectures, well, suck. They get somewhere near 50-60 success rate and remain there, independent of how much training you do.
A few perform really well. For instance, a sigmoid-activated double hidden layer with 23 neurons each works really well (89-90% correct after 2000 training epochs). Unfortunately anything close to it is rather disastrously bad. Take one neuron out of the second or first layer and it drops to 30% correct. Same for taking it out of the first layer ... Single hidden layer, 20 neurons tanh activated does pretty well as well. But most have a little over half this performance.
Now given that for real problems I can't realistically do these sorts of studies of different architectures, are there ways to get good architectures guaranteed to work ?
You might find the paper by Yoshua Bengio on Practical Recommendations for Gradient-Based Training of Deep Architectures helpful to learn more about hyperparameters and their settings.
If you're asking specifically for settings that have more guaranteed succes, I advise you to read on Batch Normalization. I find that it decreases the failure rate for bad picks of the learning rate and weight initialization.
Some people also discourage the use of non-linearities like sigmoid() and tanh() as they suffer from the vanishing gradient problem