concatenating convolutions with different strides in tensorflow - tensorflow

I am attempting to reproduce a CNN from a research paper using tensoflow. Here is the whole architecture of the CNN, but I am mainly focused on the Reduction A section.
I am wondering if I have spotted a problem with the research paper. As you can see in Reduction A, 3 layers are concatenated. However, 2 of those layers use a stride of 2. Therefore, when concatenating the tensor along the 4th axis(number of channels), the right most layer does not have the same depth, width and height as the other 2 layers. I am aware that I could use padding to fix this, but there is no mention of this in the paper. Do you believe this research paper has a mistake? Should the right most path of reduction A also use a stride of 2?

Considering that all the other reductions and inceptions have matching strides, it seems like the paper made a mistake. I suppose the 3x3(384) convolution was supposed to have a stride of 2, since this convolution increases the channel size.

Related

Why does Conv2DTranspose outputs checkerboarded images if using matched kernel sizes and strides?

I've read multiple times the famous distill paper concernging why Conv2DTranspose results in checkerboard artefacts. https://distill.pub/2016/deconv-checkerboard/
I understand from it, however, that if both the kernel size and the strides are matched, there shouldn't be any issue. I'm downsampling using Conv2D with kernel=(2,2) and strides=(2,2). And I'm upsampling using Conv2DTranspose with the exact same values. However if I visualize the output of the Conv2DTranspose layers they're extremely checkerboaded. Nothing that the next conv layer can't fix but... Why is that? By the way, I don't see artefacts in the final output (also, I'm segmenting so you don't see small intensity noise in the final quasi-binary mask).
What's the exact definition of Conv2DTranspose anyway? I'd expect the output of such a layer to be non "grid-like" when using matched kernel and strides (as shown in the examples in the distill link above), but in the documentation there's no exact mathematical definition of what it's doing. Why am I wrong?
The Conv2Transpose mathematical operation is the exact opposite of the Convolution. Note that the usage of Deconvolution or other terms is mathematically incorrect.
This is the perfect example to illustrate, since it uses exactly a stride of two. You can see how the initial image size is 2x2, the white squares are 0 padding, and the final output tensor dimension is 4x4.
Although not in use anymore, http://deeplearning.net/software/theano_versions/dev/tutorial/conv_arithmetic.html#transposed-convolution-arithmetic the theano documentation explains beautifully the mathematics behind Conv2DTranspose. You should check the documentation to better understand.

Implementation of VQ-VAE-2 paper

I am trying to build a 2 stage VQ-VAE-2 + PixelCNN as shown in the paper:
"Generating Diverse High-Fidelity Images with VQ-VAE-2" (https://arxiv.org/pdf/1906.00446.pdf).
I have 3 implementation questions:
The paper mentions:
We allow each level in the hierarchy to separately depend on pixels.
I understand the second latent space in the VQ-VAE-2 must
be conditioned on a concatenation of the 1st latent space and a
downsampled version of the image. Is that correct ?
The paper "Conditional Image Generation with PixelCNN Decoders" (https://papers.nips.cc/paper/6527-conditional-image-generation-with-pixelcnn-decoders.pdf) says:
h is a one-hot encoding that specifies a class this is equivalent to
adding a class dependent bias at every layer.
As I understand it, the condition is entered as a 1D tensor that is injected into the bias through a convolution. Now for a 2 stage conditional PixelCNN, one needs to condition on the class vector but also on the latent code of the previous stage. A possibility I see is to append them and feed a 3D tensor. Does anyone see another way to do this ?
The loss and optimization are unchanged in 2 stages. One simply adds the loss of each stage into a final loss that is optimized. Is that correct ?
Discussing with one of the author of the paper, I received answers to all those questions and shared them below.
Question 1
This is correct, but the downsampling of the image is implemented with strided convolution rather than a non-parametric resize. This can be absorbed as part of the encoder architecture in something like this (the number after each variable indicates their spatial dim, so for example h64 is [B, 64, 64, D] and so on).
h128 = Relu(Conv2D(image256, stride=(2, 2)))
h64 = Relu(Conv2D(h128, stride=(2, 2)))
h64 = ResNet(h64)
Now for obtaining h32 and q32 we can do:
h32 = Relu(Conv2D(h64, stride=(2, 2)))
h32 = ResNet(h32)
q32 = Quantize(h32)
This way, the gradients flow all the way back to the image and hence we have a dependency between h32 and image256.
Everywhere you can use 1x1 convolution to adjust the size of the last dimension (the feature layers), use strided convolution for down-sampling and strided transposed convolution for upsampling spatial dimensions.
So for this example of quantizing bottom layer, you need to first upsample q32 spatially to become 64x64 and combine it with h64 and feed the result to the quantizer. For additional expressive power we inserted a residual stack in between as well. It looks like this:
hq32 = ResNet(Conv2D(q32, (1, 1)))
hq64 = Conv2DTranspose(hq32, stride=(2, 2))
h64 = Conv2D(concat([h64, hq64]), (1, 1))
q64 = Quantize(h64)
Question 2
The original PixelCNN paper also describes how to use spatial conditioning using convolutions. Flattening and appending to class embedding as a global conditioning is not a good idea. What you would want to do is to apply a transposed convolution to align the spatial dimensions, then a 1x1 convolution to match the feature dimension with hidden reps of pixelcnn and then add it.
Question 3
It's a good idea to train them separately. Besides isolating the losses etc. and being able to tune appropriate learning rates for each stage, you will also be able to use the full memory capacity of your GPU/TPU for each stage. These priors do better and better with larger scale, so it's a good idea to not deny them of that.

How to use conv1d_transpose in TensorFlow for single-channel images?

New to TensorFlow. I have a single-channel image of size W x H. I would like to do a 1D deconvolution on this image with a kernel that only calculates the deconvoluted output row-wise, and 3 by 3 pixels. Meaning that it uses each group of 3 pixels within a row only once in the deconvolution process. I guess this could be achieved by the stride parameter?
I am aware that there is a conv1d_transpose in the contrib branch of TensorFlow, but with the current limited documentation on it, I am rather confused how to achieve the above. Any recommendations are appreciated.
I would do this with stride and using the standard 2D convolution/transpose. I'm not familiar with conv1d_transpose, but I'm all but certain you wouldn't be able to use a 3x3 kernel with a conv1D operation.
A conv1D operations would operate on a vector, such as a optical spectra (an example here just in case it doesn't make sense: https://dr12.sdss.org/spectrumDetail?plateid=5008&mjd=55744&fiber=278)

Detection Text from natural images

I write a code in tensorflow by using convolution neural network to detect the text from images. I used TFRecords file to read the street view text dataset, then, I resized the images to 128 for height and width.
I used 9-conv layer with zero padding and three max_pool layer with window size of (2×2) and stride of 2. Since I use just three pooling layer, the last layer shape will be (16×16). the last conv layer has '256' filters.
I used too, two regression fully connected layers (tf.nn.sigmoid) and tf.losses.mean_squared_error as a loss function.
My question is
is this architecture enough for detection process?? I know there is something call NMS for detection. Also what is the label in this case??
In general and this not a rule , it's just based on my experience, you should start with a smaller net 2 or 3 conv layer, and say what happens, if you get some good result focus more on the winning topology and adapt the hyperparameters ( learnrat, batchsize and so one ) , if you don't get good result at all go deep meaning add conv layer. and evaluate again. 12 conv is really huge , your problem complexity should be huge too ! otherwise you wil reach a good accuracy but waste a lot computer power and time for nothing ! and by the way use pyramid form meaning start wider and finish tiny

How to implement the DecomposeMe architecture in TensorFlow?

There is a type of architecture that I would like to experiment with in TensorFlow.
The idea is to compose 2-D filter kernels by a combination of 1-D filters.
From the paper:
Simplifying ConvNets through Filter Compositions
The essence of our proposal consists of decomposing the ND kernels of a traditional network into N consecutive layers of 1D kernels.
...
We propose DecomposeMe which is an architecture consisting of decomposed layers. Each decomposed layer represents a N-D convolutional layer as a composition of 1D filters and, in addition, by including a non-linearity
φ(·) in-between.
...
Converting existing structures to decomposed ones is a straight forward process as
each existing ND convolutional layer can systematically be decomposed into sets of
consecutive layers consisting of 1D linearly rectified kernels and 1D transposed kernels
as shown in Figure 1.
If I understand correctly, a single 2-D convolutional layer is replaced with two consecutive 1-D convolutions?
Considering that the weights are shared and transposed, it is not clear to me how exactly to implement this in TensorFlow.
I know this question is old and you probably already figured it out, but it might help someone else with the same problem.
Separable convolution can be implemented in tensorflow as follows (details omitted):
X= placeholder(float32, shape=[None,100,100,3]);
v1=Variable(truncated_normal([d,1,3,K],stddev=0.001));
h1=Variable(truncated_normal([1,d,K,N],stddev=0.001));
M1=relu(conv2(conv2(X,v1),h1));
Standard 2d convolution with a column vector is the same as convolving each column of the input with that vector. Convolution with v1 produces K feature maps (or an output image with K channels), which is then passed on to be convolved by h1 producing the final desired number of featuremaps N.
Weight sharing, according to my knowledge, is simply a a misleading term, which is meant to emphasize the fact that you use one filter that is convolved with each patch in the image. Obviously you're going to use the same filter to obtain the results for each output pixel, which is how everyone does it in image/signal processing.
Then in order to "decompose" a convolution layer as shown on page 5, it can be done by simply adding activation units in between the convolutions (ignoring biases):
M1=relu(conv2(relu(conv2(X,v1)),h1));
Not that each filter in v1 is a column vector [d,1], and each h1 is a row vector [1,d]. The paper is a little vague, but when performing separable convolution, this is how it's done. That is, you convolve the image with the column vectors, then you convolve the result with the horizontal vectors, obtaining the final result.