Why slim.nets.vgg use conv2d instead of fully_connected layers? - tensorflow

In tensorflow/models/slim/nets, here is the link of relative snippets of vgg. I'am curious about why slim.nets.vgg use conv2d instead of fully_connected layers, although it works out the same way actually? Is it for the purpose of speed?
I appreciate some explanations. Thanks!

After a while, I think there is at least one reason that it can avoid weights converting mistakes.
Tensorflow/slim as well as other high-level libraries allows tensor formats either BHWC(batch_size, height, width, channel. Same below) by default or BCHW(for better performance).
When converting weights between these two formats, the weights [in_channel, out_channel] of first fc (fully connected layer, after conv layer) have to been reshaped to [last_conv_channel, height, width, out_channel] for example, then transposed to [height, width, last_conv_channel, out_channel] and reshaped again to [in_channel, out_channel].
And if you use conv weights rather than fully connected weights, such transformation will explicitly applied on fc layer (conv weights actually). Surely will avoid converting mistakes.

Related

MobileNetV2 preprocessing inconsistence on TF 2.2

In this segmentation tutorial, the preprocessing normalizes image values into [0, 1].
However, according to the document of MobileNetV2 (https://www.tensorflow.org/api_docs/python/tf/keras/applications/mobilenet_v2/preprocess_input), the preprocess step normalizes data to the interval [-1, 1].
Which preprocessing is the correct one and why?
If you want to train your own network from scratch, you can apply any normalization you see fit, or even no normalization at all, it's your choice!
If, instead, you want to reuse the pretrained models (e.g.: by setting weights='imagenet' in the definition of MobileNetV2), then you should use the specific preprocessing in https://www.tensorflow.org/api_docs/python/tf/keras/applications/mobilenet_v2/preprocess_input, since this model has been trained with this specific preprocessing (normalization to [-1, 1]).
Although you should, you could also treat the MobileNetV2 pretrained model as a static blackbox transformation, and plug in any normalization you want. The downside: you could almost surely make better use of this black box by applying the standard normalization.

How can I use dropout in Conv Layer to drop activation maps in tensorflow?

I am trying to add dropout in convolutional layers(although it seems people dont do this a lot).
According to cs231n, they recommended to drop the activation maps instead of units in all activation maps(I consider this somehow make sense, because each activation maps are extracting the same feature in different positions).
In tensorflow, I can't find any API can directly do this, so how can I do this?
This is my first time asking a question in StackOverflow, and I will appreciate for advices and answers.
You can actually do this with the available dropout functions via the noise_shape argument. E.g. using the layers API:
x = tf.layers.dropout(x, noise_shape=[batch_size, 1, 1, features])
This would be for 2D convolution and channels_last format. We only generate a single noise value for image width/height which will be broadcast over the image dimensions. However, we still generate a different noise value for each feature/activation map.

Shape of tensor for 2D image in Keras

I am a newbie to Keras (and somehow to TF) but I have found shape definition for the input layer very confusing.
So in the examples, when we have a 1D vector of length 20 for input, shape gets defined as
...Input(shape=(20,)...)
And when a 2D tensor for greyscale images needs to be defined for MNIST, it is defined as:
...Input(shape=(28, 28, 1)...)
So my question is why the tensor is not defined as (20) and (28, 28)? Why in the first case a second dimension is added and left empty? Also in second, number of channels have to be defined?
I understand that it depends on the layer so Conv1D, Dense or Conv2D take different shapes but it seems the first parameter is implicit?
According to docs, Dense needs be (batch_size, ..., input_dim) but how is this related the example:
Dense(32, input_shape=(784,))
Thanks
Tuples vs numbers
input_shape must be a tuple, so only (20,) can satisfy it. The number 20 is not a tuple. -- There is the parameter input_dim, to make your life easier if you have only one dimension. This parameter can take 20. (But really, I find it just confusing, I always work with input_shape and use tuples, to keep a consistent understanding).
Dense(32, input_shape=(784,)) is the same as Dense(32, input_dim=784).
Images
Images don't have only pixels, they also have channels (red, green, blue).
A black/white image has only one channel.
So, (28pixels, 28pixels, 1channel)
But notice that there isn't any obligation to follow this shape for images everywhere. You can shape them the way you like. But some kinds of layers do demand a certain shape, otherwise they couldn't work.
Some layers demand specific shapes
It's the case of the 2D convolutional layers, which need (size1,size2,channels). They need this shape because they must apply the convolutional filters accordingly.
It's also the case of recurrent layers, which need (timeSteps,featuresPerStep) to perform their recurrent calculations.
MNIST models
Again, there isn't any obligation to shape your image in a specific way. You must do it according to which first layer you choose and what you intend to achieve. It's a free thing.
Many examples simply don't care about an image being a 2d structured thing, and they just use models that take 784 pixels. That's enough. They probably start with Dense layers, which demand shapes like (size,)
Other examples may care, and use a shape (28,28), but then these models will have to reshape the input to fit the needs of the next layer.
Convolutional layers 2D will demand (28,28,1).
The main idea is: input arrays must match input_shape or input_dim.
Tensor shapes
Be careful, though, when reading Keras error messages or working with custom / lambda layers.
All these shapes we defined before omit an important dimension: the batch size or the number of samples.
Internally all tensors will have this additional dimension as the first dimension. Keras will report it as None (a dimension that will adapt to any batch size you have).
So, input_shape=(784,) will be reported as (None,784).
And input_shape=(28,28,1) will be reported as (None,28,28,1)
And your actual input data must have a shape that matches that reported shape.

How to handle the BatchNorm layer when training fully convolutional networks by finetuning?

Training fully convolutional nerworks (FCNs) for pixelwise semantic segmentation is very memory intensive. So we often use batchsize=1 for traing FCNs. However, when we finetune the pretrained networks with BatchNorm (BN) layers, batchsize=1 doesn't make sense for the BN layers. So, how to handle the BN layers?
Some options:
delete the BN layers (merge the BN layers with the preceding layers for the pretrained model)
Freeze the parameters and statistics of the BN layers
....
which is better and any demo for implementation in pytorch/tf/caffe?
Having only one element will make the batch normalization zero if epsilon is non-zero (variance is zero, mean will be same as input).
Its better to delete the BN layers from the network and try the activation function SELU (scaled exponential linear units). This is from the paper 'Self normalizing neural networks' (SNNs).
Quote from the paper:
While batch normalization requires explicit normalization, neuron
activations of SNNs automatically converge towards zero mean and
unit variance. The activation function of SNNs are “scaled
exponential linear units” (SELUs), which induce self-normalizing
properties.
The SELU is defined as:
def selu(x, name="selu"):
alpha = 1.6732632423543772848170429916717
scale = 1.0507009873554804934193349852946
return scale * tf.where(x >= 0.0, x, alpha * tf.nn.elu(x))
Batch Normalization was introduced to reduce the internal covariate shift of the input feature maps. Due to change of parameters of each layer after every optimization steps, input distribution of a layer also changes, this slow down the model convergence. By using Batch Normalization we can normalize the input distribution irrespective of the batch_size (whether batch_size =1 or larger).
BN normalizes the input distribution
For convolutional network input for intermediate layer is 4D tensor. [batch_size, width, height, num_filters]. Normalization effect all the feature maps.
delete the BN layers (merge the BN layers with the preceding layers for the pretrained model)
This may further slow down the training step and convergence mayn't be achieved.
Freeze the parameters and statistics of the BN layers
Sometime the input data distribution for retrain/finetune, may vary significantly from the original data used to train the pretrained model used for initialization, Due to which your model may end-up in non-optimal solution.
According to my experiments in PyTorch, if convolutional layer before the BN outputs more than one value (i.e. 1 x feat_nb x height x width, where height > 1 or width > 1), then the BN still works fine even when the batch size is equal to one. However, I suspect that in this case the variance estimate might be very biased since all samples that are used for variance calculation come from the same image. Therefore in my case I still decided to use small batch.
The effective batch size over convolutional layer
I think the CNN-relative section (Section 3.2) in the BN original paper could help. From the point of view of the authors, it should be OK to use batch size = 1 for convolutional layers. The "effective batch size" for convolutional layer actually is batch_size * image_height * image_width.
I do not have an exact answer, but here are my thoughts:
networks with BatchNorm (BN) layers, batchsize=1 doesn't make sense
for the BN layers
The main motivation of BN is to fix the distribution (mean/variance) of the input in the batch. In my opinion, having one element this does not make sense. Judging from the paper
you will need to calculate the mean and the variance for 1 element, which does not make sense.
You can always just remove BN but are you sure you can't afford at least 16 elements in the batch?
My observation is in contrary with Stephan's: using PyTorch on a similar input batch x feat_nb x height x width, where height > 1 or width > 1, I found adding BatchNorm after the last conv and before the last non-linear (sigmoid) actually hurts the accuracy by a big margin. Still trying to make sense out of it..
(batch size = 8)

Semantic Segmentation with Encoder-Decoder CNNs

Appologizes for misuse of technical terms.
I am working on a project of semantic segmentation via CNNs ; trying to implement an architecture of type Encoder-Decoder, therefore output is the same size as the input.
How do you design the labels ?
What loss function should one apply ? Especially in the situation of heavy class inbalance (but the ratio between the classes is variable from image to image).
The problem deals with two classes (objects of interest and background). I am using Keras with tensorflow backend.
So far, I am going with designing expected outputs to be the same dimensions as the input images, applying pixel-wise labeling. Final layer of model has either softmax activation (for 2 classes), or sigmoid activation ( to express probability that the pixels belong to the objects class). I am having trouble with designing a suitable objective function for such a task, of type:
function(y_pred,y_true),
in agreement with Keras.
Please,try to be specific with the dimensions of tensors involved (input/output of the model). Any thoughts and suggestions are much appreciated. Thank you !
Actually when you use a TensorFlow backend you could simply apply a predefined Keras objectives in a following manner:
output = Convolution2D(number_of_classes, # 1 for binary case
filter_height,
filter_width,
activation = "softmax")(input_to_output) # or "sigmoid" for binary
...
model.compile(loss = "categorical_crossentropy", ...) # or "binary_crossentropy" for binary
And then feed either a one-hot encoded feature map or matrix of shape (image_height, image_width) with integer encoded classes (remember than in this case you should use sparse_categorical_crossentropy as a loss).
To deal with a class inbalance (I guess it's beacuse of a backgroud class) I strongly recommend you to read carefully answers to this Stack Overflow question.
I suggest starting with a base architecture used in practice like this one in nerve-segmentation: https://github.com/EdwardTyantov/ultrasound-nerve-segmentation. Here a dice_loss is used as a loss function. This works very well for a two class problem as has been shown in literature: https://arxiv.org/pdf/1608.04117.pdf.
Another loss function that has been widely used is cross entropy for such a problem. For problems like yours most commonly long and short skip connections are deployed to stabilize training as denoted in the paper above.
Two ways :
You could try 'flattening':
model.add(Reshape(NUM_CLASSES,HEIGHT*WIDTH)) #shape : HEIGHT x WIDTH x NUM_CLASSES
model.add(Permute(2,1)) # now itll be NUM_CLASSES x HEIGHT x WIDTH
#Use some activation here- model.activation()
#You can use Global averaging or Softmax
One hot encoding every pixel:
In this case your final layer should Upsample/Unpool/Deconvolve to HEIGHT x WIDTH x CLASSES. So your output is essentially of the shape: (HEIGHT,WIDTH,NUM_CLASSES).