L2 regularization for the fully connected parameters in CNN - optimization

In this example for tensorflow, it used L2 regularization for the fully connected parameters.:
regularizers = (tf.nn.l2_loss(fc1_weights) + tf.nn.l2_loss(fc1_biases) +
tf.nn.l2_loss(fc2_weights) + tf.nn.l2_loss(fc2_biases))
what is it? why fully connected parameters used here? and how it improve the prformance?

Regularizers in general are terms added to the loss function that prevent the model from over-fitting the training data. They do this by encouraging certain properties on the learned model.
L2 regularization of the parameters, for instance, encourages all the parameters to be small, instead of being peaky. This in turn would encourage the network to pay equal attention to all dimensions of the input vector.
The Wikipedia page is a great introduction to regularization in general, and you click through to learn in depth about L2 regularization in particular.

Related

How to decide on activation function?

Currently there are a lot of activation functions like sigmoid, tanh, ReLU ( being the preferred choice ), but I have a question that concerns which choices are needed to be considered so that a certain activation function should be selected.
For example : When we want to Upsample a network in GANs, we prefer using LeakyReLU.
I am a newbie in this subject, and have not found a concrete solution as to which activation function to use in different situations.
My knowledge uptil now :
Sigmoid : When you have a binary class to identify
Tanh : ?
ReLU : ?
LeakyReLU : When you want to upsample
Any help or article will be appreciated.
This is an open research question. The choice of activation is also very intertwined with the architecture of the model and the computation / resources available so it's not something that can be answered in silo. The paper Efficient Backprop, Yann LeCun et. al. has a lot of good insights into what makes a good activation function.
That being said, here are some toy examples that may help get intuition for activation functions. Consider a simple MLP with one hidden layer and a simple classification task:
In the last layer we can use sigmoid in combination with the binary_crossentropy loss in order to use intuition from logistic regression - because we're just doing simple logistic regression on the learned features that the hidden layer gives to the last layer.
What types of features are learned depends on the activation function used in that hidden layer and the number of neurons in that hidden layer.
Here is what ReLU learns when using two hidden neurons:
https://miro.medium.com/max/2000/1*5nK725uTBUeoIA0XjEyA_A.gif
(on the left is what the decision boundary looks like in the feature space)
As you add more neurons you get more pieces with which to approximate the decision boundary. Here is with 3 hidden neurons:
And 10 hidden neurons:
Sigmoid and Tanh produce similar decsion boundaries (this is tanh https://miro.medium.com/max/2000/1*jynT0RkGsZFqt3WSFcez4w.gif - sigmoid is similar) which are more continuous and sinusoidal.
The main difference is that sigmoid is not zero-centered which doesn't make it a good choice for a hidden layer - especially in deep networks.

Is batchnorm used in neural networks that are not CNN?

1.) Batchnorm is always used in deep convolutional neural networks. But is it also used in not-CNN. In NN. In networks with just fully-connected layers?
2.) Is batchnorm used in shallow CNNs?
3.) If I have a CNN with an input image and an input array IN_array, the output is an array after the last fully-connected layer. I call this array FC_array. If I want to concat that FC_array with the IN_array.
CONCAT_array = tf.concat(values=[FC_array, IN_array])
Is it useful to have a bachnorm after the concat layer? Or should that batchnorm be just after the FC_array before the concat layer?
For information, the IN_array is a tf.one_hot() vector.
Thank you
TL;DR: 1. Yes 2. Yes 3. No
TS;WM:
Batch normalization was a great invention by Sergey Ioffe and Christian Szegedy early 2015. Back in those days, battling vanishing or exploding gradients was an everyday problem. Read that article if you want to gain a deep understanding. but basically this quote from the abstract should give you some idea:
Training Deep Neural Networks is complicated by the fact that the distribution of each layer's inputs changes during training, as the parameters of the previous layers change. This slows down the training by requiring lower learning rates and careful parameter initialization, and makes it notoriously hard to train models with saturating nonlinearities. We refer to this phenomenon as internal covariate shift, and address the problem by normalizing layer inputs.
They did in fact first use batch normalization for DCNNs, which allowed them to beat human performance in the top-5 ImageNet classification, but any network where there are nonlinearities can benefit from batch normalization. Including a network consisting of fully-connected layers.
Yes, it is used for shallow CNN-s too. Any network with more than one layer can benefit from it, albeit it is true that more benefit comes to deeper networks.
First of all, one-hot vectors should never be normalized. Normalization means you subtract the mean and divide by the variance, thus creating a dataset with 0 mean and 1 variance. If you do this to a one-hot vector, then the cross-entropy loss calculation will be completely off. Second, there is no point in normalizing a concat layer separately, since it does not change the values, just concatenates them. Batch normalization is done on the input of a layer, so the one after the concat, that will get the concatenated values, can do it if necessary.

Which kind of regularization use L2 regularization or dropout in multiRNNCell?

I have been working on a project related with sequence to sequence autoencoder for time series forecasting. So, I have used tf.contrib.rnn.MultiRNNCell in encoder and decoder. I am confused in which strategy used in order to regularize my seq2seq model. Should I use L2 regularization in the loss or using DropOutWrapper (tf.contrib.rnn.DropoutWrapper) in the multiRNNCell? Or can I use both strategies ... L2 for weigths and bias (projection layer) and DropOutWrapper between cells in the multiRNNCell?
Thanks in advance :)
You can use both dropout and L2 regularization at the same time as is commonly done. They are quite different types of regularization. However, I would note that recent literature has suggested that batch normalization has replaced the need for dropout as noted in the original paper on batch normalization:
https://arxiv.org/abs/1502.03167
From the abstract: "It also acts as a regularizer, in some cases eliminating the need for Dropout."
L2 regularization is typically applied when batchnorm is in use. There's nothing stopping you from applying all 3 forms of regularization, the statement above only indicates that you might not see an improvement by applying dropout when batchnorm is already in use.
There are generally optimal values for the amount of L2 regularization to apply and the dropout keep probability. These are hyperparameters you tune by trial and error or a hyperparameter search algorithm.

Weight Decay and input normalization

I'm new to tensorflow and find that the sample CNN programs are using weight decay to avoid huge weight while they do not always normalize the input in the first place.
Does the weight decay serve the same purpose as the input normalization?
What is the difference between them?
Weight decay is a type of regularisation used to control overfitting of the model. Weight decay is more commonly known as L2 Normalisation. Weight decay is used more in common in shallow learning algorithms like linear regression, logistic regression etc. In deep learning (ex: which uses CNN), weight decay is not so common. In fact other regularisation methods like dropout is used.
Input normalisation on the other hand refers to zero centering your input data and limiting the range of the input data. This procedure helps in quick convergence of the data.
There is no general fixed rule on how this two concepts has to be applied. Hence, you may have seen some variations of this two concepts.
Weight decay is a regularization technique such as L2 regularization that result in gradient descent shrinking the weights on every iteration

Online tripet generation - am I doing it right?

I'm trying to train a convolutional neural network with triplet loss (more about triplet loss here) in order to generate face embeddings (128 values that accurately describe a face).
In order to select only semi-hard triplets (distance(anchor, positive) < distance(anchor, negative)), I first feed all values in a mini-batch and calculate the distances:
distance1, distance2 = sess.run([d_pos, d_neg], feed_dict={x_anchor:input1, x_positive:input2, x_negative:input3})
Then I select the indices of the inputs with distances that respect the formula above:
valids_batch = compute_valids(distance1, distance2, batch_size)
The function compute_valids:
def compute_valids(distance1, distance2, batch_size):
valids = list();
for q in range(0, len(distance1)):
if(distance1[q] < distance2[q]):
valids.append(q)
return valids;
Then I learn only from the training examples with indices returned by this filter function:
input1_valid = [input1[q] for q in valids_batch]
input2_valid = [input2[q] for q in valids_batch]
input3_valid = [input3[q] for q in valids_batch]
_, loss_value, summary = sess.run([optimizer, cost, summary_op], feed_dict={x_anchor:input1_valid, x_positive:input2_valid, x_negative:input3_valid})
Where optimizer is defined as:
model1 = siamese_convnet(x_anchor)
model2 = siamese_convnet(x_positive)
model3 = siamese_convnet(x_negative)
d_pos = tf.reduce_sum(tf.square(model1 - model2), 1)
d_neg = tf.reduce_sum(tf.square(model1 - model3), 1)
cost = triplet_loss(d_pos, d_neg)
optimizer = tf.train.AdamOptimizer(learning_rate = 1e-4).minimize( cost )
But something is wrong because accuracy is very low (50%).
What am I doing wrong?
There are a lot of reasons why your network is performing poorly. From what I understand, your triplet generation method is fine. Here are some tips that may help improve your performance.
The model
In deep metric learning, people usually use some pre-trained models on ImageNet classification task as these models are pretty expressive and can generate good representation for image. You can fine-tuning your model on the basis of these pre-trained models, e.g., VGG16, GoogleNet, ResNet.
How to fine-tuing
Even if you have a good pre-trained model, it is often difficult to directly optimize the triplet loss using these model on your own dataset. Since these pre-trained models are trained on ImageNet, if your dataset is vastly different from ImageNet, you can first fine-tuning the model using classification task on your dataset. Once your model performs reasonably well on the classification task on your custom dataset, you can use the classification model as base network (maybe a little tweak) for triplet network. It will often lead to much better performance.
Hyper parameters
Hyper parameters such as learning rate, momentum, weight_decay etc. are also extremely important for good performance (learning rate maybe the most important factor). Since your are fine-tuning and not training the network from scratch. You should use a small learning rate, for example, lr=0.001 or lr=0.0001. For momentum, 0.9 is a good choice. For weight_decay, people usually use 0.0005 or 0.00005.
If you add some fully connected layers, then for these layers, the learning rate may be higher than other layers (0.01 for example).
Which layer to fine-tuing
As your network has several layers, you need to decide which layer to fine-tune. Researcher have found that the lower layers in network just produce some generic features such as line or edges. Typically, people will freeze the updating of lower layers and only update the weight of upper layers which tend to produce task-oriented features. You should try to optimize starting from different lower layers and see which setting performs best.
Reference
Fast rcnn(Section 4.5, which layers to fine-tune)
Deep image retrieval(section 5.2, Influence of fine-tuning the representation)
distance(anchor, positive) < distance(anchor, negative)
This will select triplets in which similarity between anchor and positive is more than anchor and negative, it is opposite of hard triplet. You need to use examples where d(a,p)>d(a,n) for hard triplets. For semi-hard triplets, you need examples that satisfy d(a,p)<d(a,n)<d(a,p)+margin.
Here is the explanation : https://stackoverflow.com/a/49314187/7693521
I hope I am correct about this, if not please correct me.