How image is reduced to 7x7 by TensorFlow? - tensorflow

I'm reading the tutorial Deep MNIST for Experts. At the start of the section Densely Connected Layer, it says that "[...] the image size has been reduced to 7x7".
I can't seem to find out how they get to this 7x7 matrix. To my understanding, we start at 28x28 and have two layers of a 5x5 convolution kernels. 28 divided by 4 is 7, but not divided by 5.

5x5 is the "window" size for the convolution layer. It does not reduce the image size: TensorFlow and Caffe, among others, automatically supply a border pad. Torch, to name one, requires you to add that border (2 locations in each direction, in this case).
Each kernel (filter) considers a 5x5 subset of the entire image. For instance, to compute the value for position [7, 12] in the image, the convolution process considers the "window" [5:9, 10:14]. It multiplies each of these 25 values by its corresponding weight and sums those products. This sum becomes the value in the next layer for the center square [7,12].
This process repeats for every position in the image, and for each kernel in the layer.
As #Aenimated1 already mentioned, the size reduction comes from two poolings of 2x each. This operation divides the image into 2x2 windows, passing along the maximum value (or other representation, should the user specify) of each 2x2 square. This reduces the 28x28 image to 14x14; the second pooling reduces it to 7x7.

The reduction in the "image size" is the result of the pooling layers added after each convolutional layer. Each 2x2 pooling decreases the width and height by a factor of 2, thus yielding a 7x7 matrix after both pooling ops.

Related

Dimensions of a convolution?

I have some questions regarding how this convolution is calculated and its output dimension. I'm familiar with simple convolutions with a nxm kernel, using strides, dilations or padding, thats not a problem, but this dimensions seems odd to me. Since the model that I'm using is pretty well known onnx-mnist, I assume it is correct.
So, my point is:
If the input has a dimensions of 1x1x28x28, how is the output 1x8x28x28?
W denotes the kernel. How can it be 8x1x5x5? As far as I know, the first dimension is the batch size, but here I'm just doing inference with 1 input. Does this make sense?
I'm implementing from scratch this convolution operator, and so far it works for 1x1x28x28 and a kernel of 1x1x5x5, but that extra dimensions doesn't make sense to me.
Find attached the convolution that I'm trying to do, hope is not too onnx specific.
I do not see the code you are using but I guess 8 is the number of kernels. This means you apply 8 different kernels on your input with the size 5x5 over a batch size of 1. That is how you get 1x8x28x28 in the output, the 8 denotes the number of activation maps (one for each kernel).
The numbers of your kernel dimensions (8x1x5x5) explained:
8: Number of different filters/kernels (will be number of output maps per image)
1: Number of input channels. If your input image was RGB instead of grayscale, this would be 3 instead of 1.
5: First spatial dimension
5: Second spatial dimension

How conv2D function change the input layer

In my ResNet32 network coded using Tensorflow, the input size is 32 x 32 x 3 and the output of the
layer is 32 x 32 x 32. Why 32 channel is used ?
tf.contrib.layers.conv2d(
inputs,
**num_outputs**, /// how to determine the number of channel to be used in my layer?
kernel_size,
stride=1,
padding='SAME',
data_format=None,
rate=1,
activation_fn=tf.nn.relu,
normalizer_fn=None,
normalizer_params=None,
weights_initializer=initializers.xavier_initializer(),
weights_regularizer=None,
biases_initializer=tf.zeros_initializer(),
biases_regularizer=None,
reuse=None,
variables_collections=None,
outputs_collections=None,
trainable=True,
scope=None
)
Thank's in advance,
The 3 in input is the number to represent that the input image is RGB (color image), also known as color channels, and if it were a black and white image then it would have been 1 (monochrome image).
The 32 in output of this represents the number of neurons\number of features\number of channels you are using, so basically you are representing the image in 3 colors with 32 channels.
This helps in learning more complex and different set of features of the image. For example, it can make the network learn better edges.
By assigning stride=2 you can reduce the spatial size of input tensor so that the height and width of output tensor becomes half of that input tensor. That means, if your input tensor shape is (batch, 32, 32, 3) (3 is for RGB channel) to a Convolution layer having 32 kernels/filters with stride=2 then the shape of output tensor will be (batch, 16, 16, 32). Alternatively, Pooling is also widely used to reduce the output tensor size.
The ability of learning hierarchical representation by stacking conv layer is considered as the key to success of CNN. In CNN, as we go deeper the spatial size of the tensor reduces whereas the number of channel increases that helps to handle the variations in appearance of complex target object . This reduction of spatial size drastically decreases the required number of arithmetic operations and computation time with the motive of extracting prominent features contributing towards final output/decision. However, finding this optimal number of filter/kernel/output channel is time consuming and, therefore, people follow the proven earlier architectures e.g. VGG.

How to train a classifier that contain multi dimensional featured input values

I am trying to model a classifier that contain Multi Dimensional Feature as input. Can any one knew of a dataset that contain multi dimensional Features?
Lets say for example: In mnist data we have pixel location as feature & feature value is a Single Dimensional grey scale value that varies from (0 - 255), But if we consider a colour image then in that case a single grey scale value is not sufficient, in this case also we will take the pixel location as feature but feature value will be of 3 Dimension( R(0-255) as one dimension, G(0-255) as second dimension and B(0-255) as third dimension) So in this case how can one solve using FeedForward Neural network?
SMALL SUGGESTIONS ALSO ACCEPTED.
The same way.
If you plug the pixels into your network directly just reshape the tensor to have H*W*3 length.
If you use convolutions note the the last parameter is the number of input/output dimensions. Just make sure the first convolution uses 3 as input.

Is a Tensorflow 3d convolution layer with kernel depth equal to depth equivalent to a 2d convolution layer?

To further explain the title, I am passing a series of single channel pictures into a convolution network and am comparing and contrasting conv3d versus conv2d. There are two possible setups I'm considering:
Setup 1 is using a conv2d layer with each picture input as a single channel. Input dimensions [batch_size,width,height,num_pictures]. Kernal Dimensions [width, height], and Stride [1,1]. Valid Padding.
Setup 2 is using a conv3d layer with each picture as the "depth" component of the kernal used for each picture. Input dimensions [batch_size, num_pictures, width, height, 1]. Kernal Dimensions [num_pictures, width, height]. Stride [1,1,1]. Valid padding.
The way I see it, 2d convolution networks consider all channels of a given input; so is there functionally any difference between the two above setups pragmatically and performance wise?

Is mxnet.symbol.Convolution cyclic?

Is the Convolution symbol computed cyclically, i.e., does it assume that the padded input symbol is periodical in all dimensions?
More specifically, if I've got input symbol of dimensions 1x3xHxW, representing an RGB image, and I define a convolution operating on it as below:
conv1 = mxmet.symbol.Convolution(data=input, kernel=(3, 5, 5), pad=(0, 2, 2)...
what the trained filter will look like? I expect it to be composed from linear combinations of 2-D filters operating on each of the color channels R,G,B.
Am I correct?
It turns out that convolutions in mxnet are 3D: first two dimensions reflect image coordinates, while the third dimension reflects the depth, i.e., the dimension of the feature space. For an RGB image at the input layer, depth is 3 (unless it is a grayscale image that has depth==1). For any other layer, depth is the number of features.
The convolution across the depth dimension is of course cyclical, such that all features of the current layer can affect any feature of the following layer by finding linear combinations that optimize the detection precision.