How to convolve 1xW and reduce width only? - tensorflow

Given a volume [2, 2W, C], after applying pooling with 2x2 window and stride 2, I'm now left with [1, W, C] (height = 1px, width = half what it was before, channels = stays the same).
What I want to do now is apply a convolution op with the sole purpose of reducing that width dimension. Is this even possible?

Yes this is possible (though because it's unusual, the solution is a bit hackish).
Conceptually, there's no issue here. This is frequently done in the depth/channel dimension rather than width, where people usually call it a 1x1 convolution. Again the sole purpose is dimensionality reduction. A nice blog post about it is http://iamaaditya.github.io/2016/03/one-by-one-convolution/ (to be clear, I am not the author of that blog). That is, a typical 1x1 conv layer is really a bank of D2 filters of size 1x1xD, and dimensionality reduction is achieved by D2 < D. Here you want the same thing but in width: 1xWx1 filter size, W2 times. Conceptually then, that's it; it should be easy.
Practically of course, this is not so easy, as in CNNs convention treats width and depth differently: one convolves over width, but filters always operate on the full depth stack; making a 1x1 convolution easy in depth, but tricky in width. You have at least two options in tensorflow:
Use a full width filter with no zero padding
tf.nn.conv2d(input,filter,strides,padding="VALID",...)
such that filter_width = W (as in [filter_height, filter_width, in_channels, out_channels]). You then make several of these, which gets you the output information you want. Pro: This considers the full width of the stack, so serves as a dimensionality reduction in the equivalent sense as a typical (depth) 1x1 convolution. Con: This moves your width information to the depth stack (you get width of 1 for each filter, so your "reduced" dimension is not in the width, but in the depth. That's almost certainly not desirable. You could tf.reshape your way out of it, but yuck.
Use strides to sort of accomplish this
tf.nn.conv2d(input, filter, [1,1,2,1],padding="VALID",...)
where strides has been specified as [1,1,2,1] and you specify filter where filter_width = 2. This will reduce your width dimension by 2 (or 3 or any other factor that divides your width evenly), using a stride that matches your filter width (and critically zero padding that will be in effect 0). Pro this is clean and produces the data sizes you want without the reshaping annoyance above. Con this isn't doing a 1x1 convolution / dimension reduction in the usual sense. It is reducing dimension pairwise (every two adjacent dimensions are becoming one), not mixing all dimensions together. This is not a good dimensionality reduction method, so you might lose a lot of signal. Probably you should try this one because it's much cleaner, but be forewarned about that issue.

Related

Why tf.contrib.layers.instance_norm layer contain StopGradient operation?

Why tf.contrib.layers.instance_norm layer contain StopGradient operation? i.e. why it's needed?
Seems there is StopGradient even in simpler layer tf.nn.moments (that can be building block of tf.contrib.layers.instance_norm).
x_m, x_v = tf.nn.moments(x, [1, 2], keep_dims=True)
Also I find a note on StopGradient in tf.nn.moments source code:
# The dynamic range of fp16 is too limited to support the collection of
# sufficient statistics. As a workaround we simply perform the operations
# on 32-bit floats before converting the mean and variance back to fp16
y = math_ops.cast(x, dtypes.float32) if x.dtype == dtypes.float16 else x
# Compute true mean while keeping the dims for proper broadcasting.
mean = math_ops.reduce_mean(y, axes, keepdims=True, name="mean")
# sample variance, not unbiased variance
# Note: stop_gradient does not change the gradient that gets
# backpropagated to the mean from the variance calculation,
# because that gradient is zero
variance = math_ops.reduce_mean(
math_ops.squared_difference(y, array_ops.stop_gradient(mean)),
axes,
keepdims=True,
name="variance")
So it's sort of optimisation because gradient is always zero?
Attempt of an answer.
This design tells us that minimizing second moment we would not want to propagate gradients through the first moment. Does it make sense? If we try to minimize E[x^2]-E[x]^2 we would minimize E[x^2] while simultaneously maximizing E[x]^2. First term would decrease absolute values of each element (drag them to the center). Second term would increase all values by gradient which would do nothing to minimize variance but might negatively affect other gradient paths.
So, we don't propagate gradient of second moment through the first moment because this gradient would not effect second moment whatsoever, at least when using plain SGD.

Implement CVAE for a single image

I have a multi-dimensional, hyper-spectral image (channels, width, height = 15, 2500, 2500). I want to compress its 15 channel dimensions into 5 channels.So, the output would be (channels, width, height = 5, 2500, 2500). One simple way to do is to apply PCA. However, performance is not so good. Thus, I want to use Variational AutoEncoder(VAE).
When I saw the available solution in Tensorflow or keras library, it shows an example of clustering the whole images using Convolutional Variational AutoEncoder(CVAE).
https://www.tensorflow.org/tutorials/generative/cvae
https://keras.io/examples/generative/vae/
However, I have a single image. What is the best practice to implement CVAE? Is it by generating sample images by moving window approach?
One way of doing it would be to have a CVAE that takes as input (and output) values of all the spectral features for each of the spatial coordinates (the stacks circled in red in the picture). So, in the case of your image, you would have 2500*2500 = 6250000 input data samples, which are all vectors of length 15. And then the dimension of the middle layer would be a vector of length 5. And, instead of 2D convolutions that are normally used along the spatial domain of images, in this case it would make sense to use 1D convolution over the spectral domain (since the values of neighbouring wavelengths are also correlated). But I think using only fully-connected layers would also make sense.
As a disclaimer, I haven’t seen CVAEs used in this way before, but like this, you would also get many data samples, which is needed in order for the learning generalise well.
Another option would be indeed what you suggested -- to just generate the samples (patches) using a moving window (maybe with a stride that is the half size of the patch). Even though you wouldn't necessarily get enough data samples for the CVAE to generalise really well on all HSI images, I guess it doesn't matter (if it overfits), since you want to use it on that same image.

Should RNN attention weights over variable length sequences be re-normalized to "mask" the effects of zero-padding?

To be clear, I am referring to "self-attention" of the type described in Hierarchical Attention Networks for Document Classification and implemented many places, for example: here. I am not referring to the seq2seq type of attention used in encoder-decoder models (i.e. Bahdanau), although my question might apply to that as well... I am just not as familiar with it.
Self-attention basically just computes a weighted average of RNN hidden states (a generalization of mean-pooling, i.e. un-weighted average). When there are variable length sequences in the same batch, they will typically be zero-padded to the length of the longest sequence in the batch (if using dynamic RNN). When the attention weights are computed for each sequence, the final step is a softmax, so the attention weights sum to 1.
However, in every attention implementation I have seen, there is no care taken to mask out, or otherwise cancel, the effects of the zero-padding on the attention weights. This seems wrong to me, but I fear maybe I am missing something since nobody else seems bothered by this.
For example, consider a sequence of length 2, zero-padded to length 5. Ultimately this leads to the attention weights being computed as the softmax of a similarly 0-padded vector, e.g.:
weights = softmax([0.1, 0.2, 0, 0, 0]) = [0.20, 0.23, 0.19, 0.19, 0.19]
and because exp(0)=1, the zero-padding in effect "waters down" the attention weights. This can be easily fixed, after the softmax operation, by multiplying the weights with a binary mask, i.e.
mask = [1, 1, 0, 0, 0]
and then re-normalizing the weights to sum to 1. Which would result in:
weights = [0.48, 0.52, 0, 0, 0]
When I do this, I almost always see a performance boost (in the accuracy of my models - I am doing document classification/regression). So why does nobody do this?
For a while I considered that maybe all that matters is the relative values of the attention weights (i.e., ratios), since the gradient doesn't pass through the zero-padding anyway. But then why would we use softmax at all, as opposed to just exp(.), if normalization doesn't matter? (plus, that wouldn't explain the performance boost...)
Great question! I believe your concern is valid and zero attention scores for the padded encoder outputs do affect the attention. However, there are few aspects that you have to keep in mind:
There are different score functions, the one in tf-rnn-attention uses simple linear + tanh + linear transformation. But even this score function can learn to output negative scores. If you look at the code and imagine inputs consists of zeros, vector v is not necessarily zero due to bias and the dot product with u_omega can boost it further to low negative numbers (in other words, plain simple NN with a non-linearity can make both positive and negative predictions). Low negative scores don't water down the high scores in softmax.
Due to bucketing technique, the sequences within a bucket usually have roughly the same length, so it's unlikely to have half of the input sequence padded with zeros. Of course, it doesn't fix anything, it just means that in real applications negative effect from the padding is naturally limited.
You mentioned it in the end, but I'd like to stress it too: the final attended output is the weighted sum of encoder outputs, i.e. relative values actually matter. Take your own example and compute the weighted sum in this case:
the first one is 0.2 * o1 + 0.23 * o2 (the rest is zero)
the second one is 0.48 * o1 + 0.52 * o2 (the rest is zero too)
Yes, the magnitude of the second vector is two times bigger and it isn't a critical issue, because it goes then to the linear layer. But relative attention on o2 is just 7% higher, than it would have been with masking.
What this means is that even if the attention weights won't do a good job in learning to ignore zero outputs, the end effect on the output vector is still good enough for the decoder to take the right outputs into account, in this case to concentrate on o2.
Hope this convinces you that re-normalization isn't that critical, though probably will speed-up learning if actually applied.
BERT implementation applies a padding mask for calculating attention score.
Adds 0 to the non-padding attention score and adds -10000 to padding attention scores. the e^-10000 is very small w.r.t to other attention score values.
attention_score = [0.1, 0.2, 0, 0, 0]
mask = [0, 0, -10000, -10000] # -10000 is a large negative value
attention_score += mask
weights = softmax(attention_score)

conv2d on non-rectangular image in Tensorflow

I have dataset of images which are half black in a upper triangular fashion, i.e. all pixels below the main diagonal are black.
Is there a way in Tensorflow to give such an image to a conv2d layer and mask or limit the convolution to only the relevant pixels?
If the black translates to 0 then you don't need to do anything. The convolution will multiply the 0 by whatever weight it has so it's not going to contribute to the result. If it's not you can multiply the data with a binary mask to make them 0.
For all black pixels you will still get any bias term if you have any.
You could multiply the result with a binary mask to 0 out the areas you don't want populated. This way you can also decide to drop results that have too many black cells, like around the diagonal.
You can also write your own custom operation that does what you want. I would recommend against it because you only get a speedup of at most 2 (the other operations will lower it). You probably get more performance by running on a GPU.

Dynamic Tensor Aligment/Cropping

I implemented Fully-Convolution Network at TensorFlow. It use encdoder-decoder structure.
When training, I use always same image size (224x224, using random crop) and everything works nicely.
In interference phase, I want to predict one image at a time, because I want to use full-image (not croped). For example, such image have size [406,256]. And here is problem.
In Encoder-Decoder architecture I add two tesors (z = x + y). When training, sizes of both tensor matches. When predicting my single image, sizes does not match (tensor sizes: [1,47,47,64] vs [1,46,46,64]). I think it is cause by some rounding done in Conv and Pool layer.
What should I change in my architecture to works for any image size I want? Should I change rounding parameters? Or add 'cropping' of tensor?
Link to implementation of architecture:
https://gist.github.com/melgor/0e43cadf742fe3336148ab64dd63138f
(the problem occur in line 166)
I found the solution for variable input size:)
What we really need was a 'Crop-layer', that crop one tensor to match other. I found really similar layer here: http://tf-unet.readthedocs.io/en/latest/_modules/tf_unet/layers.html
(crop_and_concat).
I have just made it `crop_and_add' and it is working:
def crop_and_add(x1,x2):
x1_shape = tf.shape(x1)
x2_shape = tf.shape(x2)
# offsets for the top left corner of the crop
offsets = [0, (x1_shape[1] - x2_shape[1]) // 2, (x1_shape[2] - x2_shape[2]) // 2, 0]
size = [-1, x2_shape[1], x2_shape[2], -1]
x1_crop = tf.slice(x1, offsets, size)
return x1_crop + x2
All addition in model I replaced by above layer (so merging encoder and decoder data).
Also, the input to model need to be defined as:
image = tf.placeholder(tf.float32, shape=[1, None, None, 3], name="input_image")
So we know that we will pass single image and that image have 3 channels. but we do not know neither width nor height. And it works very nice! (40 FPS on K80 as AWS P2, size of image is 224x{}-shoter side of image have 224)
FYI, I was also trying to run ENET (2x faster than LinkNet), but in TensorFlow it is slower. I think it is because of PReLu (which is slow at TF). Also it does not support arbitraty size of image becauese of UnPool layer, which need to have predefined output size by list of integers (not placeholders). So LinkNet look better in case of Speed and Performacance in TF.