(Faster R-CNN) ROI Pooling layer is not differentiable w.r.t the box coordinates - object-detection

The paper reports that "having an RoI pooling layer that is differentiable w.r.t the box coordinates is a nontrivial problem" and refers to "ROI Warping" (crops and resizes the features to a fixed shape) that makes it fully differentiable w.r.t the box coordinates.
I can't figure out why RoI pooling layer is not differentiable and ROI Warping is?

The inputs of RoI pooling are coordinates of reference boxes, and these coordinates are integer which are discrete, and the inputs of RoI pooling are also the outputs of Region Proposal Network, but the outputs of Region Proposal Network are continuous. So there exist a transformation between discrete input and continous output, this makes RoI cannot be differentiable.

Related

Can someone give me an explanation for Multibox loss function?

I have found some expression for SSD Multibox-loss function as follows:
multibox_loss = confidence_loss + alpha * location_loss
Can someone explains what are the explanations for those terms?
SSD Multibox (short for Single Shot Multibox Detector) is a neural network that can detect and locate objects in an image in a single forward pass. The network is trained in a supervised manner on a dataset of images where a bounding box and a class label is given for each object of interest. The loss term
multibox_loss = confidence_loss + alpha * location_loss
is made up of two parts:
Confidence loss is a categorical cross-entropy loss for classifying the detected objects. The purpose of this term is to make sure that correct label is assigned to each detected object.
Location loss is a regression loss (either the smooth L1 or the L2 loss) on the parameters (width, height and corner offset) of the detected bounding box. The purpose of this term is to make sure that the correct region of the image is identified for the detected objects. The alpha term is a hyper parameter used to scale the location loss.
The precise formulation of the loss is given in Equation 1 of the SSD: Single Shot MultiBox Detector paper.

Is a Tensorflow 3d convolution layer with kernel depth equal to depth equivalent to a 2d convolution layer?

To further explain the title, I am passing a series of single channel pictures into a convolution network and am comparing and contrasting conv3d versus conv2d. There are two possible setups I'm considering:
Setup 1 is using a conv2d layer with each picture input as a single channel. Input dimensions [batch_size,width,height,num_pictures]. Kernal Dimensions [width, height], and Stride [1,1]. Valid Padding.
Setup 2 is using a conv3d layer with each picture as the "depth" component of the kernal used for each picture. Input dimensions [batch_size, num_pictures, width, height, 1]. Kernal Dimensions [num_pictures, width, height]. Stride [1,1,1]. Valid padding.
The way I see it, 2d convolution networks consider all channels of a given input; so is there functionally any difference between the two above setups pragmatically and performance wise?

How to handle the BatchNorm layer when training fully convolutional networks by finetuning?

Training fully convolutional nerworks (FCNs) for pixelwise semantic segmentation is very memory intensive. So we often use batchsize=1 for traing FCNs. However, when we finetune the pretrained networks with BatchNorm (BN) layers, batchsize=1 doesn't make sense for the BN layers. So, how to handle the BN layers?
Some options:
delete the BN layers (merge the BN layers with the preceding layers for the pretrained model)
Freeze the parameters and statistics of the BN layers
....
which is better and any demo for implementation in pytorch/tf/caffe?
Having only one element will make the batch normalization zero if epsilon is non-zero (variance is zero, mean will be same as input).
Its better to delete the BN layers from the network and try the activation function SELU (scaled exponential linear units). This is from the paper 'Self normalizing neural networks' (SNNs).
Quote from the paper:
While batch normalization requires explicit normalization, neuron
activations of SNNs automatically converge towards zero mean and
unit variance. The activation function of SNNs are “scaled
exponential linear units” (SELUs), which induce self-normalizing
properties.
The SELU is defined as:
def selu(x, name="selu"):
alpha = 1.6732632423543772848170429916717
scale = 1.0507009873554804934193349852946
return scale * tf.where(x >= 0.0, x, alpha * tf.nn.elu(x))
Batch Normalization was introduced to reduce the internal covariate shift of the input feature maps. Due to change of parameters of each layer after every optimization steps, input distribution of a layer also changes, this slow down the model convergence. By using Batch Normalization we can normalize the input distribution irrespective of the batch_size (whether batch_size =1 or larger).
BN normalizes the input distribution
For convolutional network input for intermediate layer is 4D tensor. [batch_size, width, height, num_filters]. Normalization effect all the feature maps.
delete the BN layers (merge the BN layers with the preceding layers for the pretrained model)
This may further slow down the training step and convergence mayn't be achieved.
Freeze the parameters and statistics of the BN layers
Sometime the input data distribution for retrain/finetune, may vary significantly from the original data used to train the pretrained model used for initialization, Due to which your model may end-up in non-optimal solution.
According to my experiments in PyTorch, if convolutional layer before the BN outputs more than one value (i.e. 1 x feat_nb x height x width, where height > 1 or width > 1), then the BN still works fine even when the batch size is equal to one. However, I suspect that in this case the variance estimate might be very biased since all samples that are used for variance calculation come from the same image. Therefore in my case I still decided to use small batch.
The effective batch size over convolutional layer
I think the CNN-relative section (Section 3.2) in the BN original paper could help. From the point of view of the authors, it should be OK to use batch size = 1 for convolutional layers. The "effective batch size" for convolutional layer actually is batch_size * image_height * image_width.
I do not have an exact answer, but here are my thoughts:
networks with BatchNorm (BN) layers, batchsize=1 doesn't make sense
for the BN layers
The main motivation of BN is to fix the distribution (mean/variance) of the input in the batch. In my opinion, having one element this does not make sense. Judging from the paper
you will need to calculate the mean and the variance for 1 element, which does not make sense.
You can always just remove BN but are you sure you can't afford at least 16 elements in the batch?
My observation is in contrary with Stephan's: using PyTorch on a similar input batch x feat_nb x height x width, where height > 1 or width > 1, I found adding BatchNorm after the last conv and before the last non-linear (sigmoid) actually hurts the accuracy by a big margin. Still trying to make sense out of it..
(batch size = 8)

How to implement loss on a fully convolutional network in TensorFlow?

So I've started to implement the paper "Synthetic Data for Text Localisation in Natural Images" by Gupta et al. and I've encountered a serious problem.
The network architecture is a fully convolutional network. The final layer is basically an NxNx7 Tensor (Imagine a matrix where each cell holds 7 values). Each cell holds a P and C value. P is 6 parameters about a bounding box that should be regressed and C is the confidence.
Now I want to implement squared loss on this layer. As the paper states every cell of the final layer is a prediction, if indeed that predictor's location should contain a bounding box then the loss should be applied on all of the parameters in that predictor(or cell). If it shouldn't contain a bounding box then only regressing the confidence C should be enough.
So I should have dynamically defined separate losses in TensorFlow, how could I do that?
You can use tf.cond, and write something like
loss = tf.cond(is_there_sthg_label, lambda: tf.add(loss1, loss2), lambda: loss2)
EDIT:
Sorry I didn't understand your problem correctly. You can make a mask of size NxN with value True (at runtime) on [i, j] if there is a bounding box, and false else. Then you compute both your losses for each cell, you get tensors loss1 and loss2 of shape NxN, and then
#loss1 is the loss on the confidence only, loss2 is the loss on P
loss_tensor = loss1 + tf.multiply(loss2, tf.cast(mask, loss2.dtype))
total_loss = reduce_sum(loss_tensor)
(this still works if you have batches of course)

How image is reduced to 7x7 by TensorFlow?

I'm reading the tutorial Deep MNIST for Experts. At the start of the section Densely Connected Layer, it says that "[...] the image size has been reduced to 7x7".
I can't seem to find out how they get to this 7x7 matrix. To my understanding, we start at 28x28 and have two layers of a 5x5 convolution kernels. 28 divided by 4 is 7, but not divided by 5.
5x5 is the "window" size for the convolution layer. It does not reduce the image size: TensorFlow and Caffe, among others, automatically supply a border pad. Torch, to name one, requires you to add that border (2 locations in each direction, in this case).
Each kernel (filter) considers a 5x5 subset of the entire image. For instance, to compute the value for position [7, 12] in the image, the convolution process considers the "window" [5:9, 10:14]. It multiplies each of these 25 values by its corresponding weight and sums those products. This sum becomes the value in the next layer for the center square [7,12].
This process repeats for every position in the image, and for each kernel in the layer.
As #Aenimated1 already mentioned, the size reduction comes from two poolings of 2x each. This operation divides the image into 2x2 windows, passing along the maximum value (or other representation, should the user specify) of each 2x2 square. This reduces the 28x28 image to 14x14; the second pooling reduces it to 7x7.
The reduction in the "image size" is the result of the pooling layers added after each convolutional layer. Each 2x2 pooling decreases the width and height by a factor of 2, thus yielding a 7x7 matrix after both pooling ops.