Filter size and stride when upsampling image using Conv2D Transpose - tensorflow

I am using Conv2D Transpose to upsample image by factors 18, 9, 6, 3.
My images are of sizes (1,1), (2,2), (3,3), (6,6). The goal is to upsample them to size (18,18).
The problem I am having is when choosing the correct filter size, stride and padding to achieve this. I have read articles about checkboard patterns that might arise when using improper sizes, but I still have not found any solution about which sizes to choose.
For the image (1,1) -> (18,18), I have chosen the filter size of (18,18) with no stride and padding. This makes sense to me as this one pixel is solely responsible for the look of the entire upsampled image.
But the other three are giving me problems.
One solution that I have thought of is that for (2,2) -> (18,18), I use the filter size (9,9) with stride (9,9). This would result that each pixel (2,2) provides 9,9 upsampled pixels.
Is this a proper way or would you recommend something else.

Have a look at the Keras docs. You can find the formula to calculate the output shape there:
new_rows = ((rows - 1) * strides[0] + kernel_size[0] - 2 * padding[0] + output_padding[0])
new_cols = ((cols - 1) * strides[1] + kernel_size[1] - 2 * padding[1] + output_padding[1])

Related

RGB to gray filter doesn't preserve the shape

I have 209 cat/noncat images and I am looking to augment my dataset. In order to do so, this is the following code I am using to convert each NumPy array of RGB values to have a grey filter. The problem is I need their dimensions to be the same for my Neural Network to work, but they happen to have different dimensions.The code:
def rgb2gray(rgb):
return np.dot(rgb[...,:3], [0.2989, 0.5870, 0.1140])
Normal Image Dimension: (64, 64, 3)
After Applying the Filter:(64,64)
I know that the missing 3 is probably the RGB Value or something,but I cannot find a way to have a "dummy" third dimension that would not affect the actual image. Can someone provide an alternative to the rgb2gray function that maintains the dimension?
The whole point of applying that greyscale filter is to reduce the number of channels from 3 (i.e. R,G and B) down to 1 (i.e. grey).
If you really, really want to get a 3-channel image that looks just the same but takes 3x as much memory, just make all 3 channels equal:
grey = np.dstack((grey, grey, grey))
def rgb2gray(rgb):
return np.dot(rgb[...,:3], [[0.2989, 0.5870, 0.1140],[0.2989, 0.5870, 0.1140],[0.2989, 0.5870, 0.1140]])

How to convolve 1xW and reduce width only?

Given a volume [2, 2W, C], after applying pooling with 2x2 window and stride 2, I'm now left with [1, W, C] (height = 1px, width = half what it was before, channels = stays the same).
What I want to do now is apply a convolution op with the sole purpose of reducing that width dimension. Is this even possible?
Yes this is possible (though because it's unusual, the solution is a bit hackish).
Conceptually, there's no issue here. This is frequently done in the depth/channel dimension rather than width, where people usually call it a 1x1 convolution. Again the sole purpose is dimensionality reduction. A nice blog post about it is http://iamaaditya.github.io/2016/03/one-by-one-convolution/ (to be clear, I am not the author of that blog). That is, a typical 1x1 conv layer is really a bank of D2 filters of size 1x1xD, and dimensionality reduction is achieved by D2 < D. Here you want the same thing but in width: 1xWx1 filter size, W2 times. Conceptually then, that's it; it should be easy.
Practically of course, this is not so easy, as in CNNs convention treats width and depth differently: one convolves over width, but filters always operate on the full depth stack; making a 1x1 convolution easy in depth, but tricky in width. You have at least two options in tensorflow:
Use a full width filter with no zero padding
tf.nn.conv2d(input,filter,strides,padding="VALID",...)
such that filter_width = W (as in [filter_height, filter_width, in_channels, out_channels]). You then make several of these, which gets you the output information you want. Pro: This considers the full width of the stack, so serves as a dimensionality reduction in the equivalent sense as a typical (depth) 1x1 convolution. Con: This moves your width information to the depth stack (you get width of 1 for each filter, so your "reduced" dimension is not in the width, but in the depth. That's almost certainly not desirable. You could tf.reshape your way out of it, but yuck.
Use strides to sort of accomplish this
tf.nn.conv2d(input, filter, [1,1,2,1],padding="VALID",...)
where strides has been specified as [1,1,2,1] and you specify filter where filter_width = 2. This will reduce your width dimension by 2 (or 3 or any other factor that divides your width evenly), using a stride that matches your filter width (and critically zero padding that will be in effect 0). Pro this is clean and produces the data sizes you want without the reshaping annoyance above. Con this isn't doing a 1x1 convolution / dimension reduction in the usual sense. It is reducing dimension pairwise (every two adjacent dimensions are becoming one), not mixing all dimensions together. This is not a good dimensionality reduction method, so you might lose a lot of signal. Probably you should try this one because it's much cleaner, but be forewarned about that issue.

Getting wrong parameter count for Google NASNet-A neural net

I’m trying to understand the NASNet-A architecture in detail, but can’t match the parameter counts in the paper.
For example, the paper says CIFAR-10 NASNet-A “6 # 768” model has 3.3M params, but by my calculations a single “sep 5x5” primitive in the final cell should alone have 2.9M params… which can’t be right!
Here’s how I derive this count…
The “6 # 768” notation means the “number of filters in the penultimate layer of the network” is 768, which I assume means the number of filters in each of the primitive operations in the cell is 768, and therefore the output depth of the concat operation (with 5 block inputs) is 5 * 768. Since shape is only changed by reduction cells, the input to the final cell (concat output from prior normal cell) will also be of depth 5 * 768.
So for a 5x5 separable convolution with 5 * 768 input channels and 768 output channels, the number of parameters is:
5x5x1 * (5 * 768) = 96,00 params for the 5x5 depthwise filters, plus
1x1x(5 * 768) x 768 = 2,949,128 params for the 1x1 pointwise filters
Where am I going wrong?!
The amount of output channels from each operation of cell's block is according to the defined num_conv_filters. In example for CIFAR NASNet-A is 32, and it doubles after each Reduction Cell.
Although they mention they have B=5 blocks and no residual connection it seems they have 6 concatenated chunks of filters, the last seem to come from the previous layer.
See: https://github.com/tensorflow/models/blob/d07447a3e34bc66acd9ba7267437ebe9d15b45c0/research/slim/nets/nasnet/nasnet_utils.py#L309
This is why for example you have 192 feature depth in the first cell:
6*32=192.
You can take a look on the expected depths here:
https://github.com/tensorflow/models/blob/d07447a3e34bc66acd9ba7267437ebe9d15b45c0/research/slim/nets/nasnet/nasnet_test.py#L127
So for example, for the last 5x5 separable convolution you can get:
5x5*768 + 768*128 = 117504 parametes
For more info about the separable convolution:
http://forums.fast.ai/t/how-depthwise-separable-convolutions-work/4249

Dynamic Tensor Aligment/Cropping

I implemented Fully-Convolution Network at TensorFlow. It use encdoder-decoder structure.
When training, I use always same image size (224x224, using random crop) and everything works nicely.
In interference phase, I want to predict one image at a time, because I want to use full-image (not croped). For example, such image have size [406,256]. And here is problem.
In Encoder-Decoder architecture I add two tesors (z = x + y). When training, sizes of both tensor matches. When predicting my single image, sizes does not match (tensor sizes: [1,47,47,64] vs [1,46,46,64]). I think it is cause by some rounding done in Conv and Pool layer.
What should I change in my architecture to works for any image size I want? Should I change rounding parameters? Or add 'cropping' of tensor?
Link to implementation of architecture:
https://gist.github.com/melgor/0e43cadf742fe3336148ab64dd63138f
(the problem occur in line 166)
I found the solution for variable input size:)
What we really need was a 'Crop-layer', that crop one tensor to match other. I found really similar layer here: http://tf-unet.readthedocs.io/en/latest/_modules/tf_unet/layers.html
(crop_and_concat).
I have just made it `crop_and_add' and it is working:
def crop_and_add(x1,x2):
x1_shape = tf.shape(x1)
x2_shape = tf.shape(x2)
# offsets for the top left corner of the crop
offsets = [0, (x1_shape[1] - x2_shape[1]) // 2, (x1_shape[2] - x2_shape[2]) // 2, 0]
size = [-1, x2_shape[1], x2_shape[2], -1]
x1_crop = tf.slice(x1, offsets, size)
return x1_crop + x2
All addition in model I replaced by above layer (so merging encoder and decoder data).
Also, the input to model need to be defined as:
image = tf.placeholder(tf.float32, shape=[1, None, None, 3], name="input_image")
So we know that we will pass single image and that image have 3 channels. but we do not know neither width nor height. And it works very nice! (40 FPS on K80 as AWS P2, size of image is 224x{}-shoter side of image have 224)
FYI, I was also trying to run ENET (2x faster than LinkNet), but in TensorFlow it is slower. I think it is because of PReLu (which is slow at TF). Also it does not support arbitraty size of image becauese of UnPool layer, which need to have predefined output size by list of integers (not placeholders). So LinkNet look better in case of Speed and Performacance in TF.

what is the best way to multiply tensors in tensorflow

Suppose that I have tensors x[i,j,k] and y[p,q] in a graph. What is the correct way to specify the tensor z[i,j,k,p,q] = x[i,j,k]y[p,q]? This is the coordinate representation of the tensor product of x and y. I can get the job done using a combination of tf.expand_dims, tf.mult and tf.tile, but I feel like there should be a better way...
I think you can get away without the tile operation using broadcasting.
x_reshaped = tf.reshape(x, (i, j, k, 1, 1))
y_reshaped = tf.reshape(y, (1, 1, 1, p, q))
z = x_reshaped * y_reshaped
When a dimension has size 1 and does not match the size of the other tensor's dimensions it is being multiplied with, it is copied / broadcasted automatically along that dimension and the product is carried out. Tile is often unnecessary. I actually don't think I have ever even used tile in tensorflow. Here I also used reshape rather than expand_dims but the result is the same either way.