Implementation of VQ-VAE-2 paper - tensorflow

I am trying to build a 2 stage VQ-VAE-2 + PixelCNN as shown in the paper:
"Generating Diverse High-Fidelity Images with VQ-VAE-2" (
I have 3 implementation questions:
The paper mentions:
We allow each level in the hierarchy to separately depend on pixels.
I understand the second latent space in the VQ-VAE-2 must
be conditioned on a concatenation of the 1st latent space and a
downsampled version of the image. Is that correct ?
The paper "Conditional Image Generation with PixelCNN Decoders" ( says:
h is a one-hot encoding that specifies a class this is equivalent to
adding a class dependent bias at every layer.
As I understand it, the condition is entered as a 1D tensor that is injected into the bias through a convolution. Now for a 2 stage conditional PixelCNN, one needs to condition on the class vector but also on the latent code of the previous stage. A possibility I see is to append them and feed a 3D tensor. Does anyone see another way to do this ?
The loss and optimization are unchanged in 2 stages. One simply adds the loss of each stage into a final loss that is optimized. Is that correct ?

Discussing with one of the author of the paper, I received answers to all those questions and shared them below.
Question 1
This is correct, but the downsampling of the image is implemented with strided convolution rather than a non-parametric resize. This can be absorbed as part of the encoder architecture in something like this (the number after each variable indicates their spatial dim, so for example h64 is [B, 64, 64, D] and so on).
h128 = Relu(Conv2D(image256, stride=(2, 2)))
h64 = Relu(Conv2D(h128, stride=(2, 2)))
h64 = ResNet(h64)
Now for obtaining h32 and q32 we can do:
h32 = Relu(Conv2D(h64, stride=(2, 2)))
h32 = ResNet(h32)
q32 = Quantize(h32)
This way, the gradients flow all the way back to the image and hence we have a dependency between h32 and image256.
Everywhere you can use 1x1 convolution to adjust the size of the last dimension (the feature layers), use strided convolution for down-sampling and strided transposed convolution for upsampling spatial dimensions.
So for this example of quantizing bottom layer, you need to first upsample q32 spatially to become 64x64 and combine it with h64 and feed the result to the quantizer. For additional expressive power we inserted a residual stack in between as well. It looks like this:
hq32 = ResNet(Conv2D(q32, (1, 1)))
hq64 = Conv2DTranspose(hq32, stride=(2, 2))
h64 = Conv2D(concat([h64, hq64]), (1, 1))
q64 = Quantize(h64)
Question 2
The original PixelCNN paper also describes how to use spatial conditioning using convolutions. Flattening and appending to class embedding as a global conditioning is not a good idea. What you would want to do is to apply a transposed convolution to align the spatial dimensions, then a 1x1 convolution to match the feature dimension with hidden reps of pixelcnn and then add it.
Question 3
It's a good idea to train them separately. Besides isolating the losses etc. and being able to tune appropriate learning rates for each stage, you will also be able to use the full memory capacity of your GPU/TPU for each stage. These priors do better and better with larger scale, so it's a good idea to not deny them of that.


Conv 1x1 configuration for feature reduction

I am using 1x1 convolution in the deep network to reduce a feature x: Bx2CxHxW to BxCxHxW. I have three options:
x -> Conv (1x1) -> Batchnorm-->ReLU. Code will be output = ReLU(BN(Conv(x))). Reference resnet
x -> BN -> ReLU-> Conv. So the code will be output = Conv(ReLU(BN(x))) . Reference densenet
x-> Conv. The code is output = Conv(x)
Which one is most using for feature reduction? Why?
Since you are going to train your net end-to-end, whatever configuration you are using - the weights will be trained to accommodate them.
I guess the first question you need to ask yourself is do you want to use BatchNorm? If your net is deep and you are concerned with covariate shifts then you probably should have a BatchNorm -- and here goes option no. 3
BatchNorm first?
If your x is the output of another conv layer, than there's actually no difference between your first and second alternatives: your net is a cascade of ...-conv-bn-ReLU-conv-BN-ReLU-conv-... so it's only an "artificial" partitioning of the net into triplets of functions conv, bn, relu and up to the very first and last functions you can split things however you wish. Moreover, since Batch norm is a linear operation (scale + bias) it can be "folded" into an adjacent conv layer without changing the net, so you basically left with conv-relu pairs.
So, there's not really a big difference between the first two options you highlighted.
What else to consider?
Do you really need ReLU when changing dimension of features? You can think of the reducing dimensions as a linear mapping - decomposing the weights mapping to x into a lower rank matrix that ultimately maps into c dimensional space instead of 2c space. If you consider a linear mapping, then you might omit the ReLU altogether.
See fast RCNN SVD trick for an example.

What are the effects of padding a tensor?

I'm working on a problem using Keras that has been presenting me with issues:
My X data is all of shape (num_samples, 8192, 8), but my Y data is of shape (num_samples, 4), where 4 is a one-hot encoded vector.
Both X and Y data will be run through LSTM layers, but the layers are rejecting the Y data because it doesn't match the shape of the X data.
Is padding the Y data with 0s so that it matches the dimensions of the X data unreasonable? What kind of effects would that have? Is there a better solution?
Edited for clarification:
As requested, here is more information:
My Y data represents the expected output of passing the X data through my model. This is my first time working with LSTMs, so I don't have an architecture in mind, but I'd like to use an architecture that works well with classifying long (8192-length) sequences of words into one of several categories. Additionally, the dataset that I have is of an immense size when fed through an LSTM, so I'm currently using batch-training.
Technologies being used:
Keras (Tensorflow Backend)
TL;DR Is padding one tensor with zeroes in all dimensions to match another tensor's shape a bad idea? What could be a better approach?
First of all, let's make sure your representation is actually what you think it is; the input to an LSTM (or any recurrent layer, for that matter) must be of dimensionality: (timesteps, shape), i.e. if you have 1000 training samples, each consisting of 100 timesteps, with each timestep having 10 values, your input shape will be (100,10,). Therefore I assume from your question that each input sample in your X set has 8192 steps and 8 values per step. Great; a single LSTM layer can iterate over these and produce 4-dimensional representations with absolutely no problem, just like so:
myLongInput = Input(shape=(8192,8,))
myRecurrentFunction = LSTM(4)
myShortOutput = myRecurrentFunction(myLongInput)
TensorShape([Dimension(None), Dimension(4)])
I assume your problem stems from trying to apply yet another LSTM on top of the first one; the next LSTM expects a tensor that has a time dimension, but your output has none. If that is the case, you'll need to let your first LSTM also output the intermediate representations at each time step, like so:
myNewRecurrentFunction=LSTM(4, return_sequences=True)
myLongOutput = myNewRecurrentFunction(myLongInput)
TensorShape([Dimension(None), Dimension(None), Dimension(4)])
As you can see the new output is now a 3rd order tensor, with the second dimension now being the (yet unassigned) timesteps. You can repeat this process until your final output, where you usually don't need the intermediate representations but rather only the last one. (Sidenote: make sure to set the activation of your last layer to a softmax if your output is in one-hot format)
On to your original question, zero-padding has very little negative impact on your network. The network will strain itself a bit in the beginning trying to figure out the concept of the additional values you have just thrown at it, but will very soon be able to learn they're meaningless. This comes at a cost of a larger parameter space (therefore more time and memory complexity), but doesn't really affect predictive power most of the time.
I hope that was helpful.

concatenating convolutions with different strides in tensorflow

I am attempting to reproduce a CNN from a research paper using tensoflow. Here is the whole architecture of the CNN, but I am mainly focused on the Reduction A section.
I am wondering if I have spotted a problem with the research paper. As you can see in Reduction A, 3 layers are concatenated. However, 2 of those layers use a stride of 2. Therefore, when concatenating the tensor along the 4th axis(number of channels), the right most layer does not have the same depth, width and height as the other 2 layers. I am aware that I could use padding to fix this, but there is no mention of this in the paper. Do you believe this research paper has a mistake? Should the right most path of reduction A also use a stride of 2?
Considering that all the other reductions and inceptions have matching strides, it seems like the paper made a mistake. I suppose the 3x3(384) convolution was supposed to have a stride of 2, since this convolution increases the channel size.

Semantic Segmentation with Encoder-Decoder CNNs

Appologizes for misuse of technical terms.
I am working on a project of semantic segmentation via CNNs ; trying to implement an architecture of type Encoder-Decoder, therefore output is the same size as the input.
How do you design the labels ?
What loss function should one apply ? Especially in the situation of heavy class inbalance (but the ratio between the classes is variable from image to image).
The problem deals with two classes (objects of interest and background). I am using Keras with tensorflow backend.
So far, I am going with designing expected outputs to be the same dimensions as the input images, applying pixel-wise labeling. Final layer of model has either softmax activation (for 2 classes), or sigmoid activation ( to express probability that the pixels belong to the objects class). I am having trouble with designing a suitable objective function for such a task, of type:
in agreement with Keras.
Please,try to be specific with the dimensions of tensors involved (input/output of the model). Any thoughts and suggestions are much appreciated. Thank you !
Actually when you use a TensorFlow backend you could simply apply a predefined Keras objectives in a following manner:
output = Convolution2D(number_of_classes, # 1 for binary case
activation = "softmax")(input_to_output) # or "sigmoid" for binary
model.compile(loss = "categorical_crossentropy", ...) # or "binary_crossentropy" for binary
And then feed either a one-hot encoded feature map or matrix of shape (image_height, image_width) with integer encoded classes (remember than in this case you should use sparse_categorical_crossentropy as a loss).
To deal with a class inbalance (I guess it's beacuse of a backgroud class) I strongly recommend you to read carefully answers to this Stack Overflow question.
I suggest starting with a base architecture used in practice like this one in nerve-segmentation: Here a dice_loss is used as a loss function. This works very well for a two class problem as has been shown in literature:
Another loss function that has been widely used is cross entropy for such a problem. For problems like yours most commonly long and short skip connections are deployed to stabilize training as denoted in the paper above.
Two ways :
You could try 'flattening':
model.add(Permute(2,1)) # now itll be NUM_CLASSES x HEIGHT x WIDTH
#Use some activation here- model.activation()
#You can use Global averaging or Softmax
One hot encoding every pixel:
In this case your final layer should Upsample/Unpool/Deconvolve to HEIGHT x WIDTH x CLASSES. So your output is essentially of the shape: (HEIGHT,WIDTH,NUM_CLASSES).

How to implement the DecomposeMe architecture in TensorFlow?

There is a type of architecture that I would like to experiment with in TensorFlow.
The idea is to compose 2-D filter kernels by a combination of 1-D filters.
From the paper:
Simplifying ConvNets through Filter Compositions
The essence of our proposal consists of decomposing the ND kernels of a traditional network into N consecutive layers of 1D kernels.
We propose DecomposeMe which is an architecture consisting of decomposed layers. Each decomposed layer represents a N-D convolutional layer as a composition of 1D filters and, in addition, by including a non-linearity
φ(·) in-between.
Converting existing structures to decomposed ones is a straight forward process as
each existing ND convolutional layer can systematically be decomposed into sets of
consecutive layers consisting of 1D linearly rectified kernels and 1D transposed kernels
as shown in Figure 1.
If I understand correctly, a single 2-D convolutional layer is replaced with two consecutive 1-D convolutions?
Considering that the weights are shared and transposed, it is not clear to me how exactly to implement this in TensorFlow.
I know this question is old and you probably already figured it out, but it might help someone else with the same problem.
Separable convolution can be implemented in tensorflow as follows (details omitted):
X= placeholder(float32, shape=[None,100,100,3]);
Standard 2d convolution with a column vector is the same as convolving each column of the input with that vector. Convolution with v1 produces K feature maps (or an output image with K channels), which is then passed on to be convolved by h1 producing the final desired number of featuremaps N.
Weight sharing, according to my knowledge, is simply a a misleading term, which is meant to emphasize the fact that you use one filter that is convolved with each patch in the image. Obviously you're going to use the same filter to obtain the results for each output pixel, which is how everyone does it in image/signal processing.
Then in order to "decompose" a convolution layer as shown on page 5, it can be done by simply adding activation units in between the convolutions (ignoring biases):
Not that each filter in v1 is a column vector [d,1], and each h1 is a row vector [1,d]. The paper is a little vague, but when performing separable convolution, this is how it's done. That is, you convolve the image with the column vectors, then you convolve the result with the horizontal vectors, obtaining the final result.