What is MobileNetv1 depth_multiplier? - tensorflow

Refering to tensorflow mobilenetv1 model: https://github.com/tensorflow/models/blob/9f7a5fa353df0ee2010f8e7a5494ca6b188af8bc/research/slim/nets/mobilenet_v1.py#L171
The param depth_multiplier is documented as:
depth_multiplier: Float multiplier for the depth (number of channels)
for all convolution ops. The value must be greater than zero. Typical
usage will be to set this value in (0, 1) to reduce the number of
parameters or computation cost of the model
But in the (paper), they mention 2 types of multipliers: width multiplier and resolution multiplier, so which one correspond to depth multiplier?
On Keras, they say that:
depth_multiplier: depth multiplier for depthwise convolution (also
called the resolution multiplier)
I'm so confused!

As described in the paper:
The role of the width multiplier α is to thin a network uniformly at each layer. for a given layer and width multiplier α, the number of input channels M becomes αM and the number of output channels N becomes αN.
The resolution multiplier ρ is applied to the input image and the internal representation of every layer is subsequently reduced by the same multiplier. In practice we implicitly set ρ by setting the input resolution.
In the code:
The depth_multiplier is used to reduce the number of channels at each layer. So the depth_multiplier corresponds the width multiplier α.

Related

Why the norm of the embedding vector exceeds the limit, it is necessary to normalize

I would like to ask you guys, I saw that max_norm=1 in a piece of code and I checked that he said that the maximum norm is 1. What does this mean? Why does the norm of the embedded vector exceed the limit, so it needs to be normalized.
Randomly initialize the user's vector,
self.users = nn.Embedding( n_users, dim, max_norm=1 )
max_norm (python:float, optional) – The maximum norm, if the norm of the embedding vector exceeds this limit, renormalization will be performed.
There is also this, assuming n_users is specified, then the specified dim=10 is a 10-dimensional user, so why does this need a concept of dimension, is there any difference between different dimensions?
embedding_dim (int) – the size of each embedding vector
max_norm (float, optional) – If given, each embedding vector with norm larger than max_norm is renormalized to have norm max_norm.
Embedding is to map a number to a vector. embedding_dim is the dimension of this vector. max_norm specifies the maximum allowed norm for the vector.

How conv2D function change the input layer

In my ResNet32 network coded using Tensorflow, the input size is 32 x 32 x 3 and the output of the
layer is 32 x 32 x 32. Why 32 channel is used ?
tf.contrib.layers.conv2d(
inputs,
**num_outputs**, /// how to determine the number of channel to be used in my layer?
kernel_size,
stride=1,
padding='SAME',
data_format=None,
rate=1,
activation_fn=tf.nn.relu,
normalizer_fn=None,
normalizer_params=None,
weights_initializer=initializers.xavier_initializer(),
weights_regularizer=None,
biases_initializer=tf.zeros_initializer(),
biases_regularizer=None,
reuse=None,
variables_collections=None,
outputs_collections=None,
trainable=True,
scope=None
)
Thank's in advance,
The 3 in input is the number to represent that the input image is RGB (color image), also known as color channels, and if it were a black and white image then it would have been 1 (monochrome image).
The 32 in output of this represents the number of neurons\number of features\number of channels you are using, so basically you are representing the image in 3 colors with 32 channels.
This helps in learning more complex and different set of features of the image. For example, it can make the network learn better edges.
By assigning stride=2 you can reduce the spatial size of input tensor so that the height and width of output tensor becomes half of that input tensor. That means, if your input tensor shape is (batch, 32, 32, 3) (3 is for RGB channel) to a Convolution layer having 32 kernels/filters with stride=2 then the shape of output tensor will be (batch, 16, 16, 32). Alternatively, Pooling is also widely used to reduce the output tensor size.
The ability of learning hierarchical representation by stacking conv layer is considered as the key to success of CNN. In CNN, as we go deeper the spatial size of the tensor reduces whereas the number of channel increases that helps to handle the variations in appearance of complex target object . This reduction of spatial size drastically decreases the required number of arithmetic operations and computation time with the motive of extracting prominent features contributing towards final output/decision. However, finding this optimal number of filter/kernel/output channel is time consuming and, therefore, people follow the proven earlier architectures e.g. VGG.

Is mxnet.symbol.Convolution cyclic?

Is the Convolution symbol computed cyclically, i.e., does it assume that the padded input symbol is periodical in all dimensions?
More specifically, if I've got input symbol of dimensions 1x3xHxW, representing an RGB image, and I define a convolution operating on it as below:
conv1 = mxmet.symbol.Convolution(data=input, kernel=(3, 5, 5), pad=(0, 2, 2)...
what the trained filter will look like? I expect it to be composed from linear combinations of 2-D filters operating on each of the color channels R,G,B.
Am I correct?
It turns out that convolutions in mxnet are 3D: first two dimensions reflect image coordinates, while the third dimension reflects the depth, i.e., the dimension of the feature space. For an RGB image at the input layer, depth is 3 (unless it is a grayscale image that has depth==1). For any other layer, depth is the number of features.
The convolution across the depth dimension is of course cyclical, such that all features of the current layer can affect any feature of the following layer by finding linear combinations that optimize the detection precision.

How image is reduced to 7x7 by TensorFlow?

I'm reading the tutorial Deep MNIST for Experts. At the start of the section Densely Connected Layer, it says that "[...] the image size has been reduced to 7x7".
I can't seem to find out how they get to this 7x7 matrix. To my understanding, we start at 28x28 and have two layers of a 5x5 convolution kernels. 28 divided by 4 is 7, but not divided by 5.
5x5 is the "window" size for the convolution layer. It does not reduce the image size: TensorFlow and Caffe, among others, automatically supply a border pad. Torch, to name one, requires you to add that border (2 locations in each direction, in this case).
Each kernel (filter) considers a 5x5 subset of the entire image. For instance, to compute the value for position [7, 12] in the image, the convolution process considers the "window" [5:9, 10:14]. It multiplies each of these 25 values by its corresponding weight and sums those products. This sum becomes the value in the next layer for the center square [7,12].
This process repeats for every position in the image, and for each kernel in the layer.
As #Aenimated1 already mentioned, the size reduction comes from two poolings of 2x each. This operation divides the image into 2x2 windows, passing along the maximum value (or other representation, should the user specify) of each 2x2 square. This reduces the 28x28 image to 14x14; the second pooling reduces it to 7x7.
The reduction in the "image size" is the result of the pooling layers added after each convolutional layer. Each 2x2 pooling decreases the width and height by a factor of 2, thus yielding a 7x7 matrix after both pooling ops.

variable-length rnn padding and mask out padding gradients

I'm building an rnn and using the sequene_length parameter to supply a list of lengths for sequences in a batch, and all of sequences in a batch are padded to the same length.
However, when doing backprop, is it possible to mask out the gradients corresponding to the padded steps, so these steps would have 0 contribution to the weight updates? I'm already masking out their corresponding costs like this (where batch_weights is a vector of 0's and 1's, where the elements corresponding to the padding steps are 0's):
loss = tf.mul(tf.nn.sparse_softmax_cross_entropy_with_logits(logits, tf.reshape(self._targets, [-1])), batch_weights)
self._cost = cost = tf.reduce_sum(loss) / tf.to_float(tf.reduce_sum(batch_weights))
the problem is I'm not sure by doing the above whether the gradients from the padding steps are zeroed out or not?
For all framewise / feed-forward (non-recurrent) operations, masking the loss/cost is enough.
For all sequence / recurrent operations (e.g. dynamic_rnn), there is always a sequence_length parameter which you need to set to the corresponding sequence lengths. Then there wont be a gradient for the zero-padded steps, or in other terms, it will have 0 contribution.