I'm following Udacity Deep Learning video by Vincent Vanhoucke and trying to understand the (practical or intuitive or obvious) effect of max pooling.
Let's say my current model (without pooling) uses convolutions with stride 2 to reduce the dimensionality.
def model(data):
conv = tf.nn.conv2d(data, layer1_weights, [1, 2, 2, 1], padding='SAME')
hidden = tf.nn.relu(conv + layer1_biases)
conv = tf.nn.conv2d(hidden, layer2_weights, [1, 2, 2, 1], padding='SAME')
hidden = tf.nn.relu(conv + layer2_biases)
shape = hidden.get_shape().as_list()
reshape = tf.reshape(hidden, [shape[0], shape[1] * shape[2] * shape[3]])
hidden = tf.nn.relu(tf.matmul(reshape, layer3_weights) + layer3_biases)
return tf.matmul(hidden, layer4_weights) + layer4_biases
Now I introduced pooling: Replace the strides by a max pooling operation (nn.max_pool()) of stride 2 and kernel size 2.
def model(data):
conv1 = tf.nn.conv2d(data, layer1_weights, [1, 1, 1, 1], padding='SAME')
bias1 = tf.nn.relu(conv1 + layer1_biases)
pool1 = tf.nn.max_pool(bias1, [1, 2, 2, 1], [1, 2, 2, 1], padding='SAME')
conv2 = tf.nn.conv2d(pool1, layer2_weights, [1, 1, 1, 1], padding='SAME')
bias2 = tf.nn.relu(conv2 + layer2_biases)
pool2 = tf.nn.max_pool(bias2, [1, 2, 2, 1], [1, 2, 2, 1], padding='SAME')
shape = pool2.get_shape().as_list()
reshape = tf.reshape(pool2, [shape[0], shape[1] * shape[2] * shape[3]])
hidden = tf.nn.relu(tf.matmul(reshape, layer3_weights) + layer3_biases)
return tf.matmul(hidden, layer4_weights) + layer4_biases
What would be the compelling reason that we use the later model instead of no-pool model, besides the improved accuracy? Would love to have some insights from people who have already used cnn many times!
Both of the approaches (strides and pooling) reduces the dimensionality of the input (for strides/pooling size > 1). This by itself is a good thing because it reduces the computation time, number of parameters and allows to prevent overfitting.
They achieve it in a different way:
you can think about strides as downsampling the result of the 1-strided convolution by just taking every s-th result.
max-pooling downsamples the result by taking the maximum number from a hypercube. If some important feature has been found, max-pool preserves it regardless of its position
You also mentioned "besides the improved accuracy". But almost everything people do in machine learning is to improve the accuracy (some other loss function). So if tomorrow someone will show that sum-square-root pooling achieves the best result on many bechmarks, a lot of people will start to use it.
In a classification task improving the accuracy is the goal.
However, pooling allows you to:
Reduce the input dimensionality
Force the network to learn particular features, depending on the type of pooling you apply.
Reducing the input dimensionality is something you want because it forces the network to project its learned representations in a different and with lower dimensionality space. This is good computationally speaking because you have to allocate less memory and thus you can have bigger batches. But it's also something desirable because usually high-dimensional spaces have a lot of redundancy and are spaces in which all abjects appears to be sparse and dissimilar ( see The curse of dimensionality )
The function you decide to use for the pooling operation, moreover, can force the network to give more importance to some features.
Max-pooling, for instance, is widely used because allow the network to be robust to small variations of the input image.
What happens, in practice, it that only the features with the highest activations pass through the max-pooling gate.
If the input image is shifted by a small amount, then the max-pooling op produces the same output although the input is shifted (the maximum shift is thus equal to the kernel size).
CNN without pooling are also capable of learning this kind of features, but with a bigger cost in term of parameters and computing time (see Striving for Simplicity: The All Convolutional Net)
Related
I am using the MNIST data and loading a single image for a cnn, I wanted to see how the image looks like after a single layer. I have gone through the documentation to see if there were any errors with my inputs or how I am consolidating the data. Is the code wrong by any means or is it just my computer ?
filtersw = tf.Variable(tf.random_normal(shape=[5, 5, 1, 1], mean=0.5, stddev=0.01))
filtersb = tf.Variable(tf.zeros(1))
conv = tf.nn.conv2d(x, filtersw, strides=[1, 1, 1, 1], padding='VALID') + filtersb
sess = tf.Session()
sess.run(tf.global_variables_initializer())
afterimage = sess.run(conv, feed_dict={x : X_train[0:1]})
You need to call tf.layers.conv2d instead of tf.nn.conv2d. The two functions have the same name, but they perform different operations. tf.layers.conv2d creates a convolution layer in a CNN and tf.nn.conv2d performs convolution with a known filter or set of filters.
I want to know, how much better is striding compared to pooling.
My current code looks like this
w = tf.get_variable('w', [k_h, k_w, output_shape[-1], input_.get_shape()[-1]],
initializer=tf.random_normal_initializer(stddev=stddev))
deconv = tf.nn.conv2d_transpose(input_, w, output_shape=output_shape, strides=[1, d_h, d_w, 1])
Would code underneath more or less equivalent to the code above?
tf.layers.max_pooling2d(input_, pooling=2, strides=[1, d_h, d_w, 1], padding='same')
Both are different
tf.nn.conv2d_transpose is used for upsampling
tf.layers.max_pooling2d is used for downsampling
tf.nn.conv2d_transpose takes a lower dimension image and scales it to a higher dimension image
tf.layers.max_pooling2d takes a higher dimension image and scales it down to a lower dimension image
You can not replace tf.nn.conv2d layer with a tf.layer.max_pooling because bot of them are used completely for opposite purpose
Tensor flow implements a basic convolution operation with tf.nn.conv2d.
I am specifically interested in the "strides" parameter, which lets you set the stride of the convolution filter -- how far across the image you shift the filter each time.
The example given in one of the early tutorials, with an image stride of 1 in each direction, is
def conv2d(x, W):
return tf.nn.conv2d(x, W, strides=[1, 1, 1, 1], padding='SAME')
The strides array is explained more in the linked docs:
In detail, with the default NHWC format...
Must have strides[0] = strides[3] = 1. For the most common case of the same horizontal and vertices strides, strides = [1, stride, stride, 1].
Note the order of "strides" matches the order of inputs: [batch, height, width, channels] in the NHWC format.
Obviously having a stride of not 1 for batch and channels wouldn't make sense, right? (your filter should always go across every batch and every channel)
But why is it even an option to put something other than 1 in strides[0] and strides[3], then? (where it being an "option" is in regards to the fact that you could put something other than 1 in the python array you pass in, disregarding the documentation quote above)
Is there a situation where I would have a non-one stride for the batch or channels dimension, e.g.
tf.nn.conv2d(x, W, strides=[2, 1, 1, 2], padding='SAME')
If so, what would that example even mean in terms of the convolution operation?
There might be a situation where you send a video in chunks. That means your batch will be a sequence of frames. And assuming that close frames should be quite similar we can omit some of them by increasing batch stride. That as far as I understand. IDK about channel stride though
I was trying to train a sequence-to-sequence LSTM model with a dataset with three labels: [1, 0] for detection of class 1, [0, 1] for detection of class 2, and [0, 0] for detection of nothing. After getting the outputs from the LSTM network, I applied a fully connected layer to each cell's output the following way:
outputs, state = tf.nn.dynamic_rnn(cell, input)
# Shape of outputs is [batch_size, n_time_steps, n_hidden]
# As matmul works only on matrices, reshape to get the
# time dimension into the batch dimension
outputs = tf.reshape(outputs, [-1, n_hidden])
# Shape is [batch_size * n_time_steps, n_hidden]
w = tf.Variable(tf.truncated_normal(shape=[n_hidden, 2], stddev=0.1))
b = tf.Variable(tf.constant(0.1, shape=[2]))
logit = tf.add(tf.matmul(outputs, w), b, name='logit')
# Reshape back to [batch_size, n_time_steps, 2]
logit = tf.reshape(logit, [batch_size, -1, 2])
On the output, I apply tf.nn.sigmoid_cross_entropy_with_logits and reduce the mean. The model seems to work just fine achieving high accuracy and recall, except for the fact that in almost all the cases it outputs either [0, 0], or [1, 1]. The two logit outputs from the fully connected layer always have very similar values (but not the same). This effectively puts a hard-cap on precision of 50%, which the model converges to (but not a fraction of a percent above).
Now, my intuition would tell me that something must be wrong with the training step and both fully connected outputs are trained on the same data, but curiously enough when I replace my own implementation with the prepackaged one from tf.contrib:
outputs, state = tf.nn.dynamic_rnn(cell, input)
logit = tf.contrib.layers.fully_connected(outputs, 2, activation_fn=None)
without changing a single other thing, the model starts training properly. Now, the obvious solution would be to just use that implementation, but why doesn't the first one work?
In the CIFAR10 example, the conv2 is defined as follows. How to know that the shape=[5,5,64,64] in kernel = _variable_with_weight_decay should be given those values, e.g., 5,5,64,64 In addition, in biases = _variable_on_cpu('biases', [64], tf.constant_initializer(0.1)), shape is also defined as [64], how to get those values?
# conv2
with tf.variable_scope('conv2') as scope:
kernel = _variable_with_weight_decay('weights',
shape=[5, 5, 64, 64],
stddev=5e-2,
wd=0.0)
conv = tf.nn.conv2d(norm1, kernel, [1, 1, 1, 1], padding='SAME')
biases = _variable_on_cpu('biases', [64], tf.constant_initializer(0.1))
bias = tf.nn.bias_add(conv, biases)
conv2 = tf.nn.relu(bias, name=scope.name)
_activation_summary(conv2)
Looking at the source, we see that a call to _variable_with_weight_decay boils down to a tf.get_variable call. We are retrieving a weight tensor (creating one if it doesn't already exist)
In a Convolutional Neural Network, the weight tensor defines a mapping from one layer to the next, but differs from a vanilla NN. The convolution implies you are applying a convolutional filter to your as you map from one layer to the next. This filter is defined with hyper-parameters which are the ones fed into shape.
There are four parameters fed into shape, the first two relate to the size of the convolution filter. In this case we have a 5x5 filter. The third parameter defines the input dimension, which in this case is the same as the output of in the previous convolution:
kernel = _variable_with_weight_decay('weights',
shape=[5, 5, 3, 64],
stddev=5e-2,
wd=0.0)
The fourth parameter defines the output dimension of the tensor.
Bias is a perturbation to system used for better learning. The bias is added to the output of the convolution. Given by basic linear algebra rules, these two vectors should be of the same size, in this case it is 64
Cheers!