Which kind of regularization use L2 regularization or dropout in multiRNNCell? - tensorflow

I have been working on a project related with sequence to sequence autoencoder for time series forecasting. So, I have used tf.contrib.rnn.MultiRNNCell in encoder and decoder. I am confused in which strategy used in order to regularize my seq2seq model. Should I use L2 regularization in the loss or using DropOutWrapper (tf.contrib.rnn.DropoutWrapper) in the multiRNNCell? Or can I use both strategies ... L2 for weigths and bias (projection layer) and DropOutWrapper between cells in the multiRNNCell?
Thanks in advance :)

You can use both dropout and L2 regularization at the same time as is commonly done. They are quite different types of regularization. However, I would note that recent literature has suggested that batch normalization has replaced the need for dropout as noted in the original paper on batch normalization:
https://arxiv.org/abs/1502.03167
From the abstract: "It also acts as a regularizer, in some cases eliminating the need for Dropout."
L2 regularization is typically applied when batchnorm is in use. There's nothing stopping you from applying all 3 forms of regularization, the statement above only indicates that you might not see an improvement by applying dropout when batchnorm is already in use.
There are generally optimal values for the amount of L2 regularization to apply and the dropout keep probability. These are hyperparameters you tune by trial and error or a hyperparameter search algorithm.

Related

Keras - Regularization & custom loss [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 2 years ago.
Improve this question
I have built a custom Keras model which consists of various layers. Since I wanted to add L2 regularization to such layers, I've passed an instance of keras.regularizers.l2 as the argument for the kernel_regularizer parameter of those layers (as an example, see the constructor of keras.layers.Conv2D). Now, if I were to train this model using, say, Keras's implementation of the binary cross-entropy loss (keras.losses.BinaryCrossEntropy), I would be sure that the L2 regularization that I've specified would be taken into consideration when computing the loss.
In my case, however, I have a custom loss function that requires several other parameters aside from y_true and y_pred, meaning that there's no way I can pass this function as the argument for the loss parameter of model.compile(...) (in fact, I don't even call model.compile(...)). As a result, I also had to write a custom training loop. In other words, instead of simply running model.fit(...), I had to:
Perform forward propagation by calling model(x)
Compute the loss
Compute the gradients of the loss with respect to the model's weights (that is, model.trainable_variables) with tf.GradientTape
Apply the gradients
Repeat
My question is: in which phase is regularization accounted for?
During forward propagation?
During the computation/application of the gradients?
Keep in mind that my custom loss function does NOT account for regularization, so if it's not accounted for in any of the two phases I've mentioned above, then I'm actually training a model with no regularization whatsoever (even though I've provided a value for the kernel_regularizer argument in each layer that my network is made of). In that case, would I be forced to compute the regularization term by hand and add it to the loss?
Regularization losses are computed on the forward pass of the model, and their gradients are applied on the backward pass. I don't think that your training step is applying any weight regularization, and consequently your model isn't regularized. One way to check this would be to actually look at the weights of a trained model - if they're sparse, it means you've regularized the weights in some way. L1 regularization will actually push some weights to 0. L2 regularization does a similar thing, but often results in less sparse weights.
This post outlines writing a training loop from scratch in Keras and has a section on model regularization. The author adds the loss from regularization layers in his training step with the following command:
loss += sum(model.losses)
I think this may be what you need. If you are still unsure, I would train a model with the line above in the training loop, and another model without that line. Inspecting the weights of the trained models will give you some input on whether or not the weight regularization is working as expected.

RNN Text Generation: How to balance training/test lost with validation loss?

I'm working on a short project that involves implementing a character RNN for text generation. My model uses a single LSTM layer with varying units (messing around with between 50 and 500), dropout at a rate of 0.2, and softmax activation. I'm using RMSprop with a learning rate of 0.01.
My issue is that I can't find a good way to characterize the validation loss. I'm using a validation split of 0.3 and I'm finding that the validation loss starts to become constant after only a few epochs (maybe 2-5 or so) while the training loss keeps decreasing. Does validation loss carry much weight in this sort of problem? The purpose of the model is to generate new strings, so quantifying the validation loss with other strings seems... pointless?
It's hard for me to really find the best model since qualitatively I get the sense that the best model is trained for more epochs than it takes for the validation loss to stop changing but also for fewer epochs than it takes for the training loss to start increasing. I would really appreciate any advice you have regarding this problem as well as any general advice about RNN's for text generation, especially regarding dropout and overfitting. Thanks!
This is the code for fitting the model for every epoch. The callback is a custom callback that just prints a few tests. I'm now realizing that history_callback.history['loss'] is probably the training loss isn't it...
for i in range(num_epochs):
history_callback = model.fit(x, y,
batch_size=128,
epochs=1,
callbacks=[print_callback],
validation_split=0.3)
loss_history.append(history_callback.history['loss'])
validation_loss_history.append(history_callback.history['val_loss'])
My intention for this model isn't to replicate sentences from the training data, rather, I'd like to generate sentence from the same distribution that I'm training on.
Yes history_callback.history['loss'] is Training Loss and history_callback.history['val_loss'] is the Validation Loss.
Yes, Validation Loss carries weight in this sort of problem because you just don't want to replicate the sentences which are given during Training but you want to learn the patterns from the Training Data and generate new sentences when it sees a new data.
From the information you mentioned in the question and from the insights identified from comments (thanks to Brian Bartoldson), it is understood that your model is overfitting. In addition to EarlyStopping and dropout, you can try the below mentioned techniques to mitigate overfitting problem.
3.a. Shuffle the Data, by using shuffle=True in model.fit. Code is shown below
3.b. Use recurrent_dropout. For example, If we set the value of Recurrent Dropout as 0.2 in a Recurrent Layer (LSTM), it means that it will consider only 80% of the Time Steps for that Recurrent Layer (LSTM).
3.c. Use Regularization. You can try l1 Regularization or l1_l2 Regularization as well for the arguments, kernel_regularizer, recurrent_regularizer, bias_regularizer, activity_regularizer of the LSTM Layer.
Sample code to use Shuffle, Early Stopping, Recurrent_Dropout, Regularization is shown below:
from tensorflow.keras.regularizers import l2
from tensorflow.keras.models import Sequential
model = Sequential()
Regularizer = l2(0.001)
model.add(tf.keras.layers.LSTM(units = 50, activation='relu',kernel_regularizer=Regularizer ,
recurrent_regularizer=Regularizer , bias_regularizer=Regularizer , activity_regularizer=Regularizer, dropout=0.2, recurrent_dropout=0.3))
callback = tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=15)
history_callback = model.fit(x, y,
batch_size=128,
epochs=1,
callbacks=[print_callback, callback],
validation_split=0.3, shuffle = True)
Hope this helps. Happy Learning!

Is it relevant to use both batch_norm and dropout in estimator?

I read batch normalization and dropout are two different ways to avoid overfitting in neural networks. Is it relevant to use both in the same estimator as following ?
```
model1 = tf.estimator.DNNClassifier(feature_columns=feature_columns_complex_standardized,
hidden_units=[512,512,512],
optimizer=tf.train.AdamOptimizer(learning_rate=0.001, beta1= 0.9,beta2=0.99, epsilon = 1e-08,use_locking=False),
weight_column=weights,
dropout=0.5,
activation_fn=tf.nn.softmax,
n_classes=10,
label_vocabulary=Action_vocab,
model_dir='./Models9/Action/',
loss_reduction=tf.losses.Reduction.SUM_OVER_BATCH_SIZE,
config=tf.estimator.RunConfig().replace(save_summary_steps=10),
batch_norm=True)
There is a small problem in your understanding. Batch Normalization original intent is not to reduce overfitting but to speed up the training. Just like how you normalize the inputs while you are passing it to the first layer of your network, batch normalization achieves this action in inner (or hidden) layers. Batch normalization removes the effect of covariate shift while it is training.
But since this is applied at every batches separately, it results in a side effect of regularizing your weight parameters. This regularizing effect is quite similar to that of how you would have done had you intended to solve over-fitting.
You can apply both batch_norm and dropout together but it is advisable to reduce the dropout. Currently, your dropout rate at 0.5 is very high. I believe dropout of 0.1 to 0.2 should be enough when you are applying it together with batch_norm. Also, the value of dropout is a hyper-parameter, so there is no fixed answer to it and you may have to tune it as per your data input and network.
Both batch normalization and dropout gives the regularization effect in some way or another.
As you apply the batch normalization for normalization steps it sees all the training example in mini-batch together to reduce the internal covariate shift which helps in speeding up the training and not setting the learning rate low and also gives the regularization effect.
If batch normalization is used along the network, then the dropout regularization can be reduced or dropped in strength

Weight Decay and input normalization

I'm new to tensorflow and find that the sample CNN programs are using weight decay to avoid huge weight while they do not always normalize the input in the first place.
Does the weight decay serve the same purpose as the input normalization?
What is the difference between them?
Weight decay is a type of regularisation used to control overfitting of the model. Weight decay is more commonly known as L2 Normalisation. Weight decay is used more in common in shallow learning algorithms like linear regression, logistic regression etc. In deep learning (ex: which uses CNN), weight decay is not so common. In fact other regularisation methods like dropout is used.
Input normalisation on the other hand refers to zero centering your input data and limiting the range of the input data. This procedure helps in quick convergence of the data.
There is no general fixed rule on how this two concepts has to be applied. Hence, you may have seen some variations of this two concepts.
Weight decay is a regularization technique such as L2 regularization that result in gradient descent shrinking the weights on every iteration

L2 regularization for the fully connected parameters in CNN

In this example for tensorflow, it used L2 regularization for the fully connected parameters.:
regularizers = (tf.nn.l2_loss(fc1_weights) + tf.nn.l2_loss(fc1_biases) +
tf.nn.l2_loss(fc2_weights) + tf.nn.l2_loss(fc2_biases))
what is it? why fully connected parameters used here? and how it improve the prformance?
Regularizers in general are terms added to the loss function that prevent the model from over-fitting the training data. They do this by encouraging certain properties on the learned model.
L2 regularization of the parameters, for instance, encourages all the parameters to be small, instead of being peaky. This in turn would encourage the network to pay equal attention to all dimensions of the input vector.
The Wikipedia page is a great introduction to regularization in general, and you click through to learn in depth about L2 regularization in particular.