In my ResNet32 network coded using Tensorflow, the input size is 32 x 32 x 3 and the output of the
layer is 32 x 32 x 32. Why 32 channel is used ?
tf.contrib.layers.conv2d(
inputs,
**num_outputs**, /// how to determine the number of channel to be used in my layer?
kernel_size,
stride=1,
padding='SAME',
data_format=None,
rate=1,
activation_fn=tf.nn.relu,
normalizer_fn=None,
normalizer_params=None,
weights_initializer=initializers.xavier_initializer(),
weights_regularizer=None,
biases_initializer=tf.zeros_initializer(),
biases_regularizer=None,
reuse=None,
variables_collections=None,
outputs_collections=None,
trainable=True,
scope=None
)
Thank's in advance,
The 3 in input is the number to represent that the input image is RGB (color image), also known as color channels, and if it were a black and white image then it would have been 1 (monochrome image).
The 32 in output of this represents the number of neurons\number of features\number of channels you are using, so basically you are representing the image in 3 colors with 32 channels.
This helps in learning more complex and different set of features of the image. For example, it can make the network learn better edges.
By assigning stride=2 you can reduce the spatial size of input tensor so that the height and width of output tensor becomes half of that input tensor. That means, if your input tensor shape is (batch, 32, 32, 3) (3 is for RGB channel) to a Convolution layer having 32 kernels/filters with stride=2 then the shape of output tensor will be (batch, 16, 16, 32). Alternatively, Pooling is also widely used to reduce the output tensor size.
The ability of learning hierarchical representation by stacking conv layer is considered as the key to success of CNN. In CNN, as we go deeper the spatial size of the tensor reduces whereas the number of channel increases that helps to handle the variations in appearance of complex target object . This reduction of spatial size drastically decreases the required number of arithmetic operations and computation time with the motive of extracting prominent features contributing towards final output/decision. However, finding this optimal number of filter/kernel/output channel is time consuming and, therefore, people follow the proven earlier architectures e.g. VGG.
Related
I am using the Qubvel segmentation models https://github.com/qubvel/segmentation_models with Keras backend to train on a medical binary segmentation problem. I am fine with training the model with input images and masks of spatial dimensions 256 x 224, 256 x 256, 512 x 480, 512 x 512, and other values, as long as the width and height are divisible by 32. Otherwise, the models do not train. What is the mathematical definition behind this rule of divisibility of the input width and height by 32?
Unet architecture (from the original paper, shown below) works by downsampling layers in encoder and upsampling layers in decoder. It has 5 such layers. Choosing a multiple of 32 (2^5) makes sure the downsampling and upsampling process results in the same resolution for input and output, so that loss can be calculated at pixel level.
Having said that, if you want to make it work for a different input size, you just need to make sure that the decoder (by padding or other means) returns the same size as the encoder output at each intermediate level (for skip connection) as well as the output image.
I am representing images of size 100px by 100px, so I can have the shape (None, 100, 100, 3) or shape (None, 10000, 3)
I can't find any clear explanation on Google, however, will the following two tensors result in similar results?
(None, 100, 100, 3)
(None, 10000, 3)
I assume either is sufficient as I would have thought the neural network will still learn just as well if the image is in a single row, your thoughts?
For the 1st shape : ( 100 , 100 , 3 )
This is a 3 dimensional tensor. If you are working with Dense layers, they require two dimensional input. Yes, 1D Convolutional layers exist but they are reserved for totally different use cases.
A Convolutional layer would pass a kernel through definite strides and will gather spatial information. This kernel will then get pooled so that the information is retained but with lesser dimensions.
Hence, the learning with this shape, would be far better as learning
of spatial features will take place. This is excellent for Image
Classification.
For 2nd shape : ( 10000 , 3 )
This is 2 dimensional tensor and would work with 1D Convolutional layers and Dense layers.
1D Convolutions pass the kernel through only one straight line ( axis ). Also the features of the image would get aligned in a straight ( all the columns would get lined up ). This will destroy the features of the image.
Hence, at last, an image is a 2D object a and must be kept in it's original dimension to facilitate learning. A 1D tensor has other uses like Text classification, human activity recognition etc.
How to select the following: Size of filter for convolution, strides, pooling, and densely connected layer
There is no single answer to this question. This Reddit and this answer have some nice discussion. To quote the second post on the Reddit, "Start simple."
Celeba has similar, maybe exactly the same, image size. When I was working with Celeba on a DCGAN project, I gently cropped and then reshaped the images to 64 x 64 x 3. My discriminator was a convolutional neural network and used 4 convolutional layers and one fully connected layer. All conv layers had 5 x 5 window size and stride size of 2 x 2. SAME padding and no pooling. The output channels per layer were 128 -> 256 -> 512 -> 1024. So, the last conv layer output a 4 x 4 x 1024 tensor. My dense layer then had a weight size of classes x 1024. (I had 1 class since its purpose was to determine whether the input image was from the dataset or made by the generator.)
That relatively simple architecture had good results, but it was intentionally built to not overpower the generator. If you're looking for pure classification, you may want a deeper architecture. You might not want to crop as aggressively as I did. Then you can include more conv layers before the fully connected layer. You may want to use a 3 x 3 window size with a stride size of 1 x 1 and use pooling - although I see architectures abandoning pooling in favor of larger stride size. If your dataset is small, it is prone to overfitting. Having smaller weights helps combat this when dropout isn't enough. That means fewer output channels per layer.
There are a lot of possibilities when choosing an architecture, and there is no hard-and-fast rule for the best architecture. Remember to start simple.
Is the Convolution symbol computed cyclically, i.e., does it assume that the padded input symbol is periodical in all dimensions?
More specifically, if I've got input symbol of dimensions 1x3xHxW, representing an RGB image, and I define a convolution operating on it as below:
conv1 = mxmet.symbol.Convolution(data=input, kernel=(3, 5, 5), pad=(0, 2, 2)...
what the trained filter will look like? I expect it to be composed from linear combinations of 2-D filters operating on each of the color channels R,G,B.
Am I correct?
It turns out that convolutions in mxnet are 3D: first two dimensions reflect image coordinates, while the third dimension reflects the depth, i.e., the dimension of the feature space. For an RGB image at the input layer, depth is 3 (unless it is a grayscale image that has depth==1). For any other layer, depth is the number of features.
The convolution across the depth dimension is of course cyclical, such that all features of the current layer can affect any feature of the following layer by finding linear combinations that optimize the detection precision.
When I reading the chapter of "Deep MNIST for expert" in tensorflow tutorial.
There give below function for the weight of first layer. I can't understand why the patch size is 5*5 and why features number is 32, are they the random numbers that you can pick anyone or some rules must be followed? and whether the features number "32" is the “Convolution kernel”?
W_conv1 = weight_variable([5, 5, 1, 32])
First Convolutional Layer
We can now implement our first layer. It will consist of convolution,
followed by max pooling. The convolutional will compute 32 features
for each 5x5 patch. Its weight tensor will have a shape of [5, 5, 1,
32]. The first two dimensions are the patch size, the next is the
number of input channels, and the last is the number of output
channels. We will also have a bias vector with a component for each
output channel.
The patch size and the number of features are network hyper-parameters, therefore the are completely arbitrary.
There are rules of thumb, by the way, to follow in order to define a working and performing network.
The kernel size should be small, due to the equivalence between the application of multiple small kernels and lower number of big kernels (it's an image processing topic and it's well explained in the VGG paper). In addiction, operations with small filters are way faster to execute.
The number of features to extract (32 in you example) is completely arbitrary and find the right number is somehow an art.
Yes, both of them are hyperparameters, selected mostly arbitrary for this tutorial. A lot of effort is done currently to find appropriate sizes of the kernel, but for this tutorial it is not important.
The tutorial tells:
The convolutional will compute 32 features for each 5x5 patch. Its weight tensor will have a shape of [5, 5, 1, 32]
tf.nn.conv2d() tells that the second parameter represent your filter and consists of [filter_height, filter_width, in_channels, out_channels]. So [5, 5, 1, 32] means that your in_channels is 1: you have a greyscale image, so no surprises here.
32 means that during our learning phase, the network will try to learn 32 different kernels which will be used during the prediction. You can change this number to any other number as it is a hyperparameter that you can tune.