How to use conv1d_transpose in TensorFlow for single-channel images? - tensorflow

New to TensorFlow. I have a single-channel image of size W x H. I would like to do a 1D deconvolution on this image with a kernel that only calculates the deconvoluted output row-wise, and 3 by 3 pixels. Meaning that it uses each group of 3 pixels within a row only once in the deconvolution process. I guess this could be achieved by the stride parameter?
I am aware that there is a conv1d_transpose in the contrib branch of TensorFlow, but with the current limited documentation on it, I am rather confused how to achieve the above. Any recommendations are appreciated.

I would do this with stride and using the standard 2D convolution/transpose. I'm not familiar with conv1d_transpose, but I'm all but certain you wouldn't be able to use a 3x3 kernel with a conv1D operation.
A conv1D operations would operate on a vector, such as a optical spectra (an example here just in case it doesn't make sense: https://dr12.sdss.org/spectrumDetail?plateid=5008&mjd=55744&fiber=278)

Related

Dimensions of a convolution?

I have some questions regarding how this convolution is calculated and its output dimension. I'm familiar with simple convolutions with a nxm kernel, using strides, dilations or padding, thats not a problem, but this dimensions seems odd to me. Since the model that I'm using is pretty well known onnx-mnist, I assume it is correct.
So, my point is:
If the input has a dimensions of 1x1x28x28, how is the output 1x8x28x28?
W denotes the kernel. How can it be 8x1x5x5? As far as I know, the first dimension is the batch size, but here I'm just doing inference with 1 input. Does this make sense?
I'm implementing from scratch this convolution operator, and so far it works for 1x1x28x28 and a kernel of 1x1x5x5, but that extra dimensions doesn't make sense to me.
Find attached the convolution that I'm trying to do, hope is not too onnx specific.
I do not see the code you are using but I guess 8 is the number of kernels. This means you apply 8 different kernels on your input with the size 5x5 over a batch size of 1. That is how you get 1x8x28x28 in the output, the 8 denotes the number of activation maps (one for each kernel).
The numbers of your kernel dimensions (8x1x5x5) explained:
8: Number of different filters/kernels (will be number of output maps per image)
1: Number of input channels. If your input image was RGB instead of grayscale, this would be 3 instead of 1.
5: First spatial dimension
5: Second spatial dimension

how to generate different samples using PixelCNN?

I am trying pixelcnn, which is auto-regressive generative model. After training, the model receive an all-zero tensor and generate the next pixel form the left top coner. Now that the model parameters are fixed, does the model only can produce the same outputs starting from the same zero tensor? How to produce different samples?
Yes, you always provide an all-zero tensor. However, for PixelCNN each pixel location is represented by a distribution. So when you do the forward pass you then sample from a random distribution at the end. That is how the pixel values are different each run.
This is of course because PixelCNN is a probabilistic neural network. So the pixels, as mentioned before, are all represented by conditional probability distributions of all the layers below, not just point estimates.

concatenating convolutions with different strides in tensorflow

I am attempting to reproduce a CNN from a research paper using tensoflow. Here is the whole architecture of the CNN, but I am mainly focused on the Reduction A section.
I am wondering if I have spotted a problem with the research paper. As you can see in Reduction A, 3 layers are concatenated. However, 2 of those layers use a stride of 2. Therefore, when concatenating the tensor along the 4th axis(number of channels), the right most layer does not have the same depth, width and height as the other 2 layers. I am aware that I could use padding to fix this, but there is no mention of this in the paper. Do you believe this research paper has a mistake? Should the right most path of reduction A also use a stride of 2?
Considering that all the other reductions and inceptions have matching strides, it seems like the paper made a mistake. I suppose the 3x3(384) convolution was supposed to have a stride of 2, since this convolution increases the channel size.

What is a 2D float tensor?

Disclamer: I know nothing about CNN and deep learning and I don't know Torch.
I'm using SIFT for my object recognition application. I found this paper Discriminative Learning of Deep Convolutional Feature Point Descriptors which is particularly interesting because it's CNN based, which are more precise than classic image descripion methods (e.g. SIFT, SURF etc.), but (quoting the abstract):
using the L2 distance during both training and testing we develop
128-D descriptors whose euclidean distances reflect patch similarity,
and which can be used as a drop-in replacement for any task involving
SIFT
Wow, that's fantastic: that means that we can continue to use any SIFT based approach but with more precise descriptors!
However, quoting the github code repository README:
Note the output will be a Nx128 2D float tensor where each row is a
descriptor.
Well, what is a "2D float tensor"? SIFT descriptors matrix is Nx128 floats, is there something that I am missing?
2D float tensor = 2D float matrix.
FYI: The meaning of tensors in the neural network community
This is a 2-d float tensor.
[[1.0,2.0],
[3.0,4.0]]
This is still a 2-d float tensor, even if they have 3 items, and 3 rows!
[[1.0,2.0,3.0],
[4.0,5.0,6.0],
[7.0,5.0,6.0]]
The number of bracket is what matters.

How to implement the DecomposeMe architecture in TensorFlow?

There is a type of architecture that I would like to experiment with in TensorFlow.
The idea is to compose 2-D filter kernels by a combination of 1-D filters.
From the paper:
Simplifying ConvNets through Filter Compositions
The essence of our proposal consists of decomposing the ND kernels of a traditional network into N consecutive layers of 1D kernels.
...
We propose DecomposeMe which is an architecture consisting of decomposed layers. Each decomposed layer represents a N-D convolutional layer as a composition of 1D filters and, in addition, by including a non-linearity
φ(·) in-between.
...
Converting existing structures to decomposed ones is a straight forward process as
each existing ND convolutional layer can systematically be decomposed into sets of
consecutive layers consisting of 1D linearly rectified kernels and 1D transposed kernels
as shown in Figure 1.
If I understand correctly, a single 2-D convolutional layer is replaced with two consecutive 1-D convolutions?
Considering that the weights are shared and transposed, it is not clear to me how exactly to implement this in TensorFlow.
I know this question is old and you probably already figured it out, but it might help someone else with the same problem.
Separable convolution can be implemented in tensorflow as follows (details omitted):
X= placeholder(float32, shape=[None,100,100,3]);
v1=Variable(truncated_normal([d,1,3,K],stddev=0.001));
h1=Variable(truncated_normal([1,d,K,N],stddev=0.001));
M1=relu(conv2(conv2(X,v1),h1));
Standard 2d convolution with a column vector is the same as convolving each column of the input with that vector. Convolution with v1 produces K feature maps (or an output image with K channels), which is then passed on to be convolved by h1 producing the final desired number of featuremaps N.
Weight sharing, according to my knowledge, is simply a a misleading term, which is meant to emphasize the fact that you use one filter that is convolved with each patch in the image. Obviously you're going to use the same filter to obtain the results for each output pixel, which is how everyone does it in image/signal processing.
Then in order to "decompose" a convolution layer as shown on page 5, it can be done by simply adding activation units in between the convolutions (ignoring biases):
M1=relu(conv2(relu(conv2(X,v1)),h1));
Not that each filter in v1 is a column vector [d,1], and each h1 is a row vector [1,d]. The paper is a little vague, but when performing separable convolution, this is how it's done. That is, you convolve the image with the column vectors, then you convolve the result with the horizontal vectors, obtaining the final result.