shifted convolutions as a replacements to masked convolutions in pixelcnn++ - tensorflow

I read the PixelCnn++ paper and code.
in the code, there is this line (298):
''' utilities for shifting the image around, efficient alternative to masking convolutions '''
aftwerwards, they define several functions for that purpuse:
down_shifted_conv2d, down_right_shifted_conv2d, down_shift, right_shift.
using these and gated_resnet layers, they (based on figure 2 from the paper) convert the image from 32X32 to 8X8, and back to 32X32. I looked into these layers - it seems like down_shift adds a bottom row of zeros, and down_shifted_conv2d adds some specific padding and using a specific kernel size.
also, they divide the model to up u_list (line 37) and ul_list (38), which I think might correspond to downwards and downward+rightward streams mentioned briefly in the paper after figure 2.
lastly, in the beginning of the model, they pad the last axis with 1 (line 37), and state that it is for:
add channel of ones to distinguish image from padding later on
my questions are:
how are the shifted convolutions a replacement for masked convolutions - that is, how they prevent the network for seeing future pixel value through the layers? and why are they called "shifted"?
what is the downwards and downward_rightwards streams, how do they work and are they the same as the u_list and ul_list?
why they pad the last axis of the input with ones, in what way it helps them later?

Related

Implement CVAE for a single image

I have a multi-dimensional, hyper-spectral image (channels, width, height = 15, 2500, 2500). I want to compress its 15 channel dimensions into 5 channels.So, the output would be (channels, width, height = 5, 2500, 2500). One simple way to do is to apply PCA. However, performance is not so good. Thus, I want to use Variational AutoEncoder(VAE).
When I saw the available solution in Tensorflow or keras library, it shows an example of clustering the whole images using Convolutional Variational AutoEncoder(CVAE).
https://www.tensorflow.org/tutorials/generative/cvae
https://keras.io/examples/generative/vae/
However, I have a single image. What is the best practice to implement CVAE? Is it by generating sample images by moving window approach?
One way of doing it would be to have a CVAE that takes as input (and output) values of all the spectral features for each of the spatial coordinates (the stacks circled in red in the picture). So, in the case of your image, you would have 2500*2500 = 6250000 input data samples, which are all vectors of length 15. And then the dimension of the middle layer would be a vector of length 5. And, instead of 2D convolutions that are normally used along the spatial domain of images, in this case it would make sense to use 1D convolution over the spectral domain (since the values of neighbouring wavelengths are also correlated). But I think using only fully-connected layers would also make sense.
As a disclaimer, I haven’t seen CVAEs used in this way before, but like this, you would also get many data samples, which is needed in order for the learning generalise well.
Another option would be indeed what you suggested -- to just generate the samples (patches) using a moving window (maybe with a stride that is the half size of the patch). Even though you wouldn't necessarily get enough data samples for the CVAE to generalise really well on all HSI images, I guess it doesn't matter (if it overfits), since you want to use it on that same image.

How applying Conv1D layer on the whole Matrix

I have a Matrix each row reprsent a point with coordinate (x,y,z) I want to make feature extraction for each point using 3 shared MLP layers (64,128,1024) (Conv1D with kernal size 1) and at the end I want to aggregat the features using MaxPooling1D.
My question is How to define that my input is the whole Matrix (what I mean that I want each layer to applay on the whole rows of matrix not just on one row )
I made code but I'm soure it's wrong
**Model=Sequential([
Conv1D(64,1,input_dim=(1,3),activation='relu')
BatchNormalization(axis=-1)
Conv1D(128,1,activation='relu')
BatchNormalization(axis=-1)
Conv1D(1021,1,activation='relu')
BatchNormalization(axis=-1)
MaxPooling1D(1)
])**
thanks in advance

conv2d on non-rectangular image in Tensorflow

I have dataset of images which are half black in a upper triangular fashion, i.e. all pixels below the main diagonal are black.
Is there a way in Tensorflow to give such an image to a conv2d layer and mask or limit the convolution to only the relevant pixels?
If the black translates to 0 then you don't need to do anything. The convolution will multiply the 0 by whatever weight it has so it's not going to contribute to the result. If it's not you can multiply the data with a binary mask to make them 0.
For all black pixels you will still get any bias term if you have any.
You could multiply the result with a binary mask to 0 out the areas you don't want populated. This way you can also decide to drop results that have too many black cells, like around the diagonal.
You can also write your own custom operation that does what you want. I would recommend against it because you only get a speedup of at most 2 (the other operations will lower it). You probably get more performance by running on a GPU.

Properly concatenate feature maps in Tensorflow

I am attempting to reproduce a Convolution Neural Network from a research paper using Tensorflow.
There are many times in the diagram where the results of convolutions are concatenated. Currently I am using tf.concat(https://www.tensorflow.org/api_docs/python/tf/concat) along the last axis (representing channels) to concatenate these feature maps. I originally believed that I would want to concatenate along all axes, but this does not seem to be an option in tensorflow. Now I am facing the problem where the paper indicates that tensors(feature maps) of different sizes should be concatenated. tf.concat does not support concatenations of different sizes, so I am wondering if this was the correct command to use in the first place. In summary, what is the correct way to concatenate feature maps(sometimes of different sizes) in tensorflow?
Thank you.
It's impossible and meaningless to concatenate features maps with different sizes.
If you want to concatenate 2 tensors, every dimension except the concatenation one must be equal.
From the image you posted, in fact, you can see that every feature map that gets concatenated, has the same spatial extent (but different depth) of the other one.
If you can't concatenate in that way, probabily that's something wrong in your code, and probably the problem is the lack of padding = valid in the convolution operation.
The problem that you encounter for inception network may be resolved by using padding in convolutional layers to keep the size same. For inception blocks, instead of using "VALID" padding, change it to "SAME" one. So, without requiring any resizing, you can concatenate the outputs.
Alternatively, you can append padding to the feature maps that are going to be concatenated. You can do that by using tf.pad().
If you don't prefer to do this one, you can use tf.image.resize_images function to resize them to same values. However, this is a dirty and computationally expensive approach.
Tensors can only be concatenated along one axis. If you need to concatenate feature maps of different sizes, you must somehow manipulate the sizes of the original tensors.

How to train with inputs of variable size?

This question is rather abstract and not necessarily tied to tensorflow or keras. Say that you want to train a language model, and you want to use inputs of different sizes for your LSTMs. Particularly, I'm following this paper: https://www.researchgate.net/publication/317379370_A_Neural_Language_Model_for_Query_Auto-Completion.
The authors use, among other things, word embeddings and one-hot encoding of characters. Most likely, the dimensions of each of these inputs are different. Now, to feed that into a network, I see a few alternatives but I'm sure I'm missing something and I would like to know how it should be done.
Create a 3D tensor of shape (instances, 2, max(embeddings,characters)). That is, padding the smaller input with 0s.
Create a 3D tensor of shape (instances, embeddings+characters, 1)). That is, concatenating inputs.
It looks to me that both alternatives are bad for efficiently training the model. So, what's the best way to approach this? I see the authors use an embedding layer for this purpose, but technically, what does that mean?
EDIT
Here are more details. Let's call these inputs X (character-level input) and E (word-level input). On each character of a sequence (a text), I compute x, e and y, the label.
x: character one-hot encoding. My character index is of size 38, so this is a vector filled with 37 zeros and one 1.
e: precomputed word embedding of dimension 200. If the character is a space, I fetch the word embedding of the previous word in the sequence, Otherwise, I assign the vector for incomplete word (INC, also of size 200). Real example with the sequence "red car": r>INC, e>INC, d>INC, _>embeddings["red"], c>INC, a>INC, r>INC.
y: the label to be predicted, which is the next character, one-hot encoded. This output is of the same dimension as x because it uses the same character index. In the example above, for "r", y is the one-hot encoding of "e".
According to keras documentation, the padding idea seems to be the one. There is the masking parameter in the embedding layer, that will make keras skip these values instead of processing them. In theory, you don't lose that much performance. If the library is well built, the skipping is actually skipping extra processing.
You just need to take care not to attribute the value zero to any other character, not even spaces or unknown words.
An embedding layer is not only for masking (masking is just an option in an embedding layer).
The embedding layer transforms integer values from a word/character dictionary into actual vectors of a certain shape.
Suppose you have this dictionary:
1: hey
2: ,
3: I'm
4: here
5: not
And you form sentences like
[1,2,3,4,0] -> this is "hey, I'm here"
[1,2,3,5,4] -> this is "hey, I'm not here"
[1,2,1,2,1] -> this is "hey, hey, hey"
The embedding layer will tranform each of those integers into vectors of a certain size. This does two good things at the same time:
Transforms the words in vectors because neural networks can only handle vectors or intensities. A list of indices cannot be processed by a neural network directly, there is no logical relation between indices and words
Creates a vector that will be a "meaningful" set of features for each word.
And after training, they become "meaningful" vectors. Each element starts to represent a certain feature of the word, although that feature is obscure to humans. It's possible that an embedding be capable of detecting words that are verbs, nouns, feminine, masculine, etc, everything encoded in a combination of numeric values (presence/abscence/intensity of features).
You may also try the approach in this question, which instead of using masking, needs to separate batches by length, so each batch can be trained at a time without needing to pad them: Keras misinterprets training data shape