When would I want to set a stride in the batch or channel dimension for TensorFlow convolution? - tensorflow

Tensor flow implements a basic convolution operation with tf.nn.conv2d.
I am specifically interested in the "strides" parameter, which lets you set the stride of the convolution filter -- how far across the image you shift the filter each time.
The example given in one of the early tutorials, with an image stride of 1 in each direction, is
def conv2d(x, W):
return tf.nn.conv2d(x, W, strides=[1, 1, 1, 1], padding='SAME')
The strides array is explained more in the linked docs:
In detail, with the default NHWC format...
Must have strides[0] = strides[3] = 1. For the most common case of the same horizontal and vertices strides, strides = [1, stride, stride, 1].
Note the order of "strides" matches the order of inputs: [batch, height, width, channels] in the NHWC format.
Obviously having a stride of not 1 for batch and channels wouldn't make sense, right? (your filter should always go across every batch and every channel)
But why is it even an option to put something other than 1 in strides[0] and strides[3], then? (where it being an "option" is in regards to the fact that you could put something other than 1 in the python array you pass in, disregarding the documentation quote above)
Is there a situation where I would have a non-one stride for the batch or channels dimension, e.g.
tf.nn.conv2d(x, W, strides=[2, 1, 1, 2], padding='SAME')
If so, what would that example even mean in terms of the convolution operation?

There might be a situation where you send a video in chunks. That means your batch will be a sequence of frames. And assuming that close frames should be quite similar we can omit some of them by increasing batch stride. That as far as I understand. IDK about channel stride though

Related

Can I simply replace a tf.nn.conv2d layer with a tf.layer.max_pooling?

I want to know, how much better is striding compared to pooling.
My current code looks like this
w = tf.get_variable('w', [k_h, k_w, output_shape[-1], input_.get_shape()[-1]],
initializer=tf.random_normal_initializer(stddev=stddev))
deconv = tf.nn.conv2d_transpose(input_, w, output_shape=output_shape, strides=[1, d_h, d_w, 1])
Would code underneath more or less equivalent to the code above?
tf.layers.max_pooling2d(input_, pooling=2, strides=[1, d_h, d_w, 1], padding='same')
Both are different
tf.nn.conv2d_transpose is used for upsampling
tf.layers.max_pooling2d is used for downsampling
tf.nn.conv2d_transpose takes a lower dimension image and scales it to a higher dimension image
tf.layers.max_pooling2d takes a higher dimension image and scales it down to a lower dimension image
You can not replace tf.nn.conv2d layer with a tf.layer.max_pooling because bot of them are used completely for opposite purpose

Force symmetry for a TensorFlow conv2d kernel

I'd like to enforce symmetry in the weights within a Variable. I really want an approximate circular symmetry. However, I could imagine either row or column enforced symmetry.
The goal is to reduce training time by reducing the number of free variables. I know my problem would like a symmetric array but I might want to include both symmetric and "free" variables. I am using conv2d now, so I believe I need to keep using it.
Here is a function that creates a kernel symmetric with respect to reflection over its center row:
def SymmetricKernels(height,width,in_channels,out_channels,name=None):
half_kernels = tf.Variable(initial_value=tf.random_normal([(height+1)//2,width,in_channels,out_channels]))
half_kernels_reversed = tf.reverse(half_kernels[:(height//2),:,:,:],[0])
kernels = tf.concat([half_kernels,half_kernels_reversed],axis=0,name=name)
return kernels
Usage example:
w = SymmetricKernels(5,5,1,1)
sess = tf.InteractiveSession()
sess.run(tf.global_variables_initializer())
w_ = sess.run(w)
w_[:,:,0,0]
# output:
# [[-1.299 -1.835 -1.188 0.093 -1.736]
# [-1.426 -2.087 0.434 0.223 -0.65 ]
# [-0.217 -0.802 -0.892 -0.229 1.383]
# [-1.426 -2.087 0.434 0.223 -0.65 ]
# [-1.299 -1.835 -1.188 0.093 -1.736]]
The idea is to use tf.Variable() to create only the upper half variables of the kernels (half_kernels), and then form the symmetric kernels as a concatenation of the upper half and its reflected version.
This idea can be extended to create also kernels with both left-right and up-down symmetries.
Another thing you can try is to tie the net's hands by convolving twice, reusing the kernel but flipping it for the second convolution (untested code):
def symmetric_convolution(input_tensor, n_filters, size, name, dilations=[1,1,1,1]):
with tf.variable_scope("", reuse=tf.AUTO_REUSE):
kernel = tf.get_variable(shape=[*size, input_tensor.shape[-1], n_filters], name='conv_kernel_' + name, ...)
lr_flipped_kernel = tf.reverse(kernel, axis=[1], name='conv_kernel_flipped_lr_' + name)
conv_l = tf.nn.conv2d(input=input_tensor, filter=kernel, strides=[1, 1, 1, 1], padding='SAME', dilations=dilations)
conv_r = tf.nn.conv2d(input=input_tensor, filter=lr_flipped_kernel, strides=[1, 1, 1, 1], padding='SAME', dilations=dilations)
return tf.reduce_max(tf.concat([conv_l, conv_r], axis=-1), keepdims=True, axis=[-1])
You can add in biases, activations, etc. as needed. I've used something similar in the past – reduce_max will allow your kernel to take whatever shape, and effectively give you two convolutions for one; if you use reduce_sum instead, any asymmetries will average out quite quickly and your kernel will be symmetric. What works best will depend on your use case.

Effect of max_pool in Convolutional Neural Network [tensorflow]

I'm following Udacity Deep Learning video by Vincent Vanhoucke and trying to understand the (practical or intuitive or obvious) effect of max pooling.
Let's say my current model (without pooling) uses convolutions with stride 2 to reduce the dimensionality.
def model(data):
conv = tf.nn.conv2d(data, layer1_weights, [1, 2, 2, 1], padding='SAME')
hidden = tf.nn.relu(conv + layer1_biases)
conv = tf.nn.conv2d(hidden, layer2_weights, [1, 2, 2, 1], padding='SAME')
hidden = tf.nn.relu(conv + layer2_biases)
shape = hidden.get_shape().as_list()
reshape = tf.reshape(hidden, [shape[0], shape[1] * shape[2] * shape[3]])
hidden = tf.nn.relu(tf.matmul(reshape, layer3_weights) + layer3_biases)
return tf.matmul(hidden, layer4_weights) + layer4_biases
Now I introduced pooling: Replace the strides by a max pooling operation (nn.max_pool()) of stride 2 and kernel size 2.
def model(data):
conv1 = tf.nn.conv2d(data, layer1_weights, [1, 1, 1, 1], padding='SAME')
bias1 = tf.nn.relu(conv1 + layer1_biases)
pool1 = tf.nn.max_pool(bias1, [1, 2, 2, 1], [1, 2, 2, 1], padding='SAME')
conv2 = tf.nn.conv2d(pool1, layer2_weights, [1, 1, 1, 1], padding='SAME')
bias2 = tf.nn.relu(conv2 + layer2_biases)
pool2 = tf.nn.max_pool(bias2, [1, 2, 2, 1], [1, 2, 2, 1], padding='SAME')
shape = pool2.get_shape().as_list()
reshape = tf.reshape(pool2, [shape[0], shape[1] * shape[2] * shape[3]])
hidden = tf.nn.relu(tf.matmul(reshape, layer3_weights) + layer3_biases)
return tf.matmul(hidden, layer4_weights) + layer4_biases
What would be the compelling reason that we use the later model instead of no-pool model, besides the improved accuracy? Would love to have some insights from people who have already used cnn many times!
Both of the approaches (strides and pooling) reduces the dimensionality of the input (for strides/pooling size > 1). This by itself is a good thing because it reduces the computation time, number of parameters and allows to prevent overfitting.
They achieve it in a different way:
you can think about strides as downsampling the result of the 1-strided convolution by just taking every s-th result.
max-pooling downsamples the result by taking the maximum number from a hypercube. If some important feature has been found, max-pool preserves it regardless of its position
You also mentioned "besides the improved accuracy". But almost everything people do in machine learning is to improve the accuracy (some other loss function). So if tomorrow someone will show that sum-square-root pooling achieves the best result on many bechmarks, a lot of people will start to use it.
In a classification task improving the accuracy is the goal.
However, pooling allows you to:
Reduce the input dimensionality
Force the network to learn particular features, depending on the type of pooling you apply.
Reducing the input dimensionality is something you want because it forces the network to project its learned representations in a different and with lower dimensionality space. This is good computationally speaking because you have to allocate less memory and thus you can have bigger batches. But it's also something desirable because usually high-dimensional spaces have a lot of redundancy and are spaces in which all abjects appears to be sparse and dissimilar ( see The curse of dimensionality )
The function you decide to use for the pooling operation, moreover, can force the network to give more importance to some features.
Max-pooling, for instance, is widely used because allow the network to be robust to small variations of the input image.
What happens, in practice, it that only the features with the highest activations pass through the max-pooling gate.
If the input image is shifted by a small amount, then the max-pooling op produces the same output although the input is shifted (the maximum shift is thus equal to the kernel size).
CNN without pooling are also capable of learning this kind of features, but with a bigger cost in term of parameters and computing time (see Striving for Simplicity: The All Convolutional Net)

How to imagine convolution/pooling on images with 3 color channels

I am a beginner and i understood the mnist tutorials. Now i want to get something going on the SVHN dataset. In contrast to mnist, it comes with 3 color channels. I am having a hard time visualizing how convolution and pooling works with the additional dimensionality of the color channels.
Has anyone a good way to think about it or a link for me ?
I appreciate all input :)
This is very simple, the difference only lies in the first convolution:
in grey images, the input shape is [batch_size, W, H, 1] so your first convolution (let's say 3x3) has a filter of shape [3, 3, 1, 32] if you want to have 32 dimensions after.
in RGB images, the input shape is [batch_size, W, H, 3] so your first convolution (still 3x3) has a filter of shape [3, 3, 3, 32].
In both cases, the output shape (with stride 1) is [batch_size, W, H, 32]

How to understand the "Densely Connected Layer" section in tensorflow tutorial

In the Densely Connected Layer section of the tensorflow tutorial, it says the image size is 7 x 7, after it is been processed. I tried the code, and it seem these parameters works.
But I do not know how to get this 7 x 7 dimension. I understand that:
the original image is 28 x 28,
in the 1st conv layer, the max_pool_2x2 function will reduce both of the image dimension by a factor of 4, so after the first pooling operation, the image size is 7 x 7
HERE'S WHAT I DO NOT UNDERSTAND
in the 2nd conv layer, there another max_pool_2x2 function call, so I think the image size should be reduce by a factor of 4 again. But actually did not.
Which step I got wrong?
You also need to know the stride of the max pool and convolution.
def conv2d(x, W):
return tf.nn.conv2d(x, W, strides=[1, 1, 1, 1], padding='SAME')
def max_pool_2x2(x):
return tf.nn.max_pool(x, ksize=[1, 2, 2, 1],
strides=[1, 2, 2, 1], padding='SAME')
Here, we can see that convolution has a stride of 1 and max pool has a stride of 2. How you can look at max pool, is that it takes a 2x2 box, and slides it over the image, each time taking the maximum value over 4 pixels. If you have a stride of 2, it takes 2 steps each time it moves! The image size should reduce by a factor of 2, instead of 4.
In other words, a 28x28 picture with max pool 2x2 and stride 2, will become 14x14. Another max pool 2x2 and stride 2 will reduce it to 7x7.
To further illustrate my point, let's take the case of max pool 2x2 and stride 1. If we don't pad the image, it will become a 27x27 image after max pool.
Here's an image for a more complete answer:
Take a look at Teach Yourself Deep Learning with TensorFlow and Udacity
with Vincent Vanhoucke
This is covered in the course. I am currently working through it.
The course is free, however you do have to sign up. It is a series of videos, quizzes and coding projects all self paced and self graded. I am learning a lot and enjoy it.
Here is one of the quizzes.