When I reading the chapter of "Deep MNIST for expert" in tensorflow tutorial.
There give below function for the weight of first layer. I can't understand why the patch size is 5*5 and why features number is 32, are they the random numbers that you can pick anyone or some rules must be followed? and whether the features number "32" is the “Convolution kernel”?
W_conv1 = weight_variable([5, 5, 1, 32])
First Convolutional Layer
We can now implement our first layer. It will consist of convolution,
followed by max pooling. The convolutional will compute 32 features
for each 5x5 patch. Its weight tensor will have a shape of [5, 5, 1,
32]. The first two dimensions are the patch size, the next is the
number of input channels, and the last is the number of output
channels. We will also have a bias vector with a component for each
output channel.
The patch size and the number of features are network hyper-parameters, therefore the are completely arbitrary.
There are rules of thumb, by the way, to follow in order to define a working and performing network.
The kernel size should be small, due to the equivalence between the application of multiple small kernels and lower number of big kernels (it's an image processing topic and it's well explained in the VGG paper). In addiction, operations with small filters are way faster to execute.
The number of features to extract (32 in you example) is completely arbitrary and find the right number is somehow an art.
Yes, both of them are hyperparameters, selected mostly arbitrary for this tutorial. A lot of effort is done currently to find appropriate sizes of the kernel, but for this tutorial it is not important.
The tutorial tells:
The convolutional will compute 32 features for each 5x5 patch. Its weight tensor will have a shape of [5, 5, 1, 32]
tf.nn.conv2d() tells that the second parameter represent your filter and consists of [filter_height, filter_width, in_channels, out_channels]. So [5, 5, 1, 32] means that your in_channels is 1: you have a greyscale image, so no surprises here.
32 means that during our learning phase, the network will try to learn 32 different kernels which will be used during the prediction. You can change this number to any other number as it is a hyperparameter that you can tune.
Related
In my ResNet32 network coded using Tensorflow, the input size is 32 x 32 x 3 and the output of the
layer is 32 x 32 x 32. Why 32 channel is used ?
tf.contrib.layers.conv2d(
inputs,
**num_outputs**, /// how to determine the number of channel to be used in my layer?
kernel_size,
stride=1,
padding='SAME',
data_format=None,
rate=1,
activation_fn=tf.nn.relu,
normalizer_fn=None,
normalizer_params=None,
weights_initializer=initializers.xavier_initializer(),
weights_regularizer=None,
biases_initializer=tf.zeros_initializer(),
biases_regularizer=None,
reuse=None,
variables_collections=None,
outputs_collections=None,
trainable=True,
scope=None
)
Thank's in advance,
The 3 in input is the number to represent that the input image is RGB (color image), also known as color channels, and if it were a black and white image then it would have been 1 (monochrome image).
The 32 in output of this represents the number of neurons\number of features\number of channels you are using, so basically you are representing the image in 3 colors with 32 channels.
This helps in learning more complex and different set of features of the image. For example, it can make the network learn better edges.
By assigning stride=2 you can reduce the spatial size of input tensor so that the height and width of output tensor becomes half of that input tensor. That means, if your input tensor shape is (batch, 32, 32, 3) (3 is for RGB channel) to a Convolution layer having 32 kernels/filters with stride=2 then the shape of output tensor will be (batch, 16, 16, 32). Alternatively, Pooling is also widely used to reduce the output tensor size.
The ability of learning hierarchical representation by stacking conv layer is considered as the key to success of CNN. In CNN, as we go deeper the spatial size of the tensor reduces whereas the number of channel increases that helps to handle the variations in appearance of complex target object . This reduction of spatial size drastically decreases the required number of arithmetic operations and computation time with the motive of extracting prominent features contributing towards final output/decision. However, finding this optimal number of filter/kernel/output channel is time consuming and, therefore, people follow the proven earlier architectures e.g. VGG.
I am representing images of size 100px by 100px, so I can have the shape (None, 100, 100, 3) or shape (None, 10000, 3)
I can't find any clear explanation on Google, however, will the following two tensors result in similar results?
(None, 100, 100, 3)
(None, 10000, 3)
I assume either is sufficient as I would have thought the neural network will still learn just as well if the image is in a single row, your thoughts?
For the 1st shape : ( 100 , 100 , 3 )
This is a 3 dimensional tensor. If you are working with Dense layers, they require two dimensional input. Yes, 1D Convolutional layers exist but they are reserved for totally different use cases.
A Convolutional layer would pass a kernel through definite strides and will gather spatial information. This kernel will then get pooled so that the information is retained but with lesser dimensions.
Hence, the learning with this shape, would be far better as learning
of spatial features will take place. This is excellent for Image
Classification.
For 2nd shape : ( 10000 , 3 )
This is 2 dimensional tensor and would work with 1D Convolutional layers and Dense layers.
1D Convolutions pass the kernel through only one straight line ( axis ). Also the features of the image would get aligned in a straight ( all the columns would get lined up ). This will destroy the features of the image.
Hence, at last, an image is a 2D object a and must be kept in it's original dimension to facilitate learning. A 1D tensor has other uses like Text classification, human activity recognition etc.
My 1-layer CNN neuron network's input data_set has 10 channels. If I set filter channel equal to 16, then there will be 10*16=160 filters.
I want to use same 16 filter channel's weights for each input channel. means only use 16 filters for my input data_set. means the 10 input channels share same convolution filter weights.
Dose any one know how to do this in tensorflow? thanks a lot.
You could use the lower level tf.nn.conv1d with a filters arg constructed by tiling the same single-channel filters.
f0 = tf.get_variable('filters', shape=(kernel_width, 1, filters_out), initializer=...)
f_tiled = tf.tile(f0, (1, filters_in, 1))
output = tf.nn.conv1d(input, f_tiled, ...)
However, you would get the same effect (and it would be much more efficient and less error prone) to simply add all your input channels together to form a single-channel input then use the higher-level layers API.
conv_input = tf.reduce_sum(input, axis=-1, keepdis=True))
output = tf.layers.conv1d(conv_input, filters=...)
Note unless all your channels are almost equivalent, this is probably a bad idea. If you want to reduce the number of free parameters, consider multiple convolutions - a 1x1 to reduce the number of filters, other convolutions with wide kernels and non-linearities, then a 1x1 convolution to get back to a large number of filters. The reduce_sum in the above implementation is effectively a 1x1 convolution with fixed weights of tf.ones, and unless your dataset is tiny you'll almost certainly get a better result from learning the weights followed by some non-linearity.
Context:
Suppose we have a simple 3-layer feed-forward network. The hidden size of the first linear layer is 100000 -- W1[input_size, 100000] in which input_size is a number much smaller than 100000. Some of neurons won't be learning any thing. I want to select and shutdown these neurons using pruning.
Expected outcomes
After pruning the selected neurons, we will have a smaller network with the less neurons in the first layer, say reduced to 500. And this smaller network turns out to have the same predicting capacity to the large one.
My implementation:
According to some criterion (some metrics applied to check weight similarities after each backpropagation update), I have cheery picked the indices of neurons I want to shut down, e.g., [1,7,8 ...].
Zero out the weights represented by the indices in W1, W1[:, 1, 7, 8 ...] = 0. So that no information will be passed forward via these neurons to the next layer.
Will that be enough? Should I be manually intervening the backpropagation as well? Zero out neurons stop only computations passing forward, but for learning/updating on weights, backpropgation matters more. Since I am using pytorch, it will be great if illustrations are provided in pytorch, other frameworks like tensorflow, Keras are also fine.
There is a type of architecture that I would like to experiment with in TensorFlow.
The idea is to compose 2-D filter kernels by a combination of 1-D filters.
From the paper:
Simplifying ConvNets through Filter Compositions
The essence of our proposal consists of decomposing the ND kernels of a traditional network into N consecutive layers of 1D kernels.
...
We propose DecomposeMe which is an architecture consisting of decomposed layers. Each decomposed layer represents a N-D convolutional layer as a composition of 1D filters and, in addition, by including a non-linearity
φ(·) in-between.
...
Converting existing structures to decomposed ones is a straight forward process as
each existing ND convolutional layer can systematically be decomposed into sets of
consecutive layers consisting of 1D linearly rectified kernels and 1D transposed kernels
as shown in Figure 1.
If I understand correctly, a single 2-D convolutional layer is replaced with two consecutive 1-D convolutions?
Considering that the weights are shared and transposed, it is not clear to me how exactly to implement this in TensorFlow.
I know this question is old and you probably already figured it out, but it might help someone else with the same problem.
Separable convolution can be implemented in tensorflow as follows (details omitted):
X= placeholder(float32, shape=[None,100,100,3]);
v1=Variable(truncated_normal([d,1,3,K],stddev=0.001));
h1=Variable(truncated_normal([1,d,K,N],stddev=0.001));
M1=relu(conv2(conv2(X,v1),h1));
Standard 2d convolution with a column vector is the same as convolving each column of the input with that vector. Convolution with v1 produces K feature maps (or an output image with K channels), which is then passed on to be convolved by h1 producing the final desired number of featuremaps N.
Weight sharing, according to my knowledge, is simply a a misleading term, which is meant to emphasize the fact that you use one filter that is convolved with each patch in the image. Obviously you're going to use the same filter to obtain the results for each output pixel, which is how everyone does it in image/signal processing.
Then in order to "decompose" a convolution layer as shown on page 5, it can be done by simply adding activation units in between the convolutions (ignoring biases):
M1=relu(conv2(relu(conv2(X,v1)),h1));
Not that each filter in v1 is a column vector [d,1], and each h1 is a row vector [1,d]. The paper is a little vague, but when performing separable convolution, this is how it's done. That is, you convolve the image with the column vectors, then you convolve the result with the horizontal vectors, obtaining the final result.