Iam using a densenet model for one of my projects and have some difficulties using regularization.
Without any regularization, both validation and training loss (MSE) decrease. The training loss drops faster though, resulting in some overfitting of the final model.
So I decided to use dropout to avoid overfitting. When using Dropout, both validation and training loss decrease to about 0.13 during the first epoch and remain constant for about 10 epochs.
After that both loss functions decrease in the same way as without dropout, resulting in overfitting again. The final loss value is in about the same range as without dropout.
So for me it seems like dropout is not really working.
If I switch to L2 regularization though, Iam able to avoid overfitting, but I would rather use Dropout as a regularizer.
Now Iam wondering if anyone has experienced that kind of behaviour?
I use dropout in both the dense block (bottleneck layer) and in the transition block (dropout rate = 0.5):
def bottleneck_layer(self, x, scope):
with tf.name_scope(scope):
x = Batch_Normalization(x, training=self.training, scope=scope+'_batch1')
x = Relu(x)
x = conv_layer(x, filter=4 * self.filters, kernel=[1,1], layer_name=scope+'_conv1')
x = Drop_out(x, rate=dropout_rate, training=self.training)
x = Batch_Normalization(x, training=self.training, scope=scope+'_batch2')
x = Relu(x)
x = conv_layer(x, filter=self.filters, kernel=[3,3], layer_name=scope+'_conv2')
x = Drop_out(x, rate=dropout_rate, training=self.training)
return x
def transition_layer(self, x, scope):
with tf.name_scope(scope):
x = Batch_Normalization(x, training=self.training, scope=scope+'_batch1')
x = Relu(x)
x = conv_layer(x, filter=self.filters, kernel=[1,1], layer_name=scope+'_conv1')
x = Drop_out(x, rate=dropout_rate, training=self.training)
x = Average_pooling(x, pool_size=[2,2], stride=2)
return x
Without any regularization, both validation and training loss (MSE) decrease. The training loss drops faster though, resulting in some overfitting of the final model.
This is not overfitting.
Overfitting starts when your validation loss starts increasing, while your training loss continues decreasing; here is its telltale signature:
The image is adapted from the Wikipedia entry on overfitting - diferent things may lie in the horizontal axis, e.g. depth or number of boosted trees, number of neural net fitting iterations etc.
The (generally expected) difference between training and validation loss is something completely different, called the generalization gap:
An important concept for understanding generalization is the generalization gap, i.e., the difference between a model’s performance on training data and its performance on unseen data drawn from the same distribution.
where, practically speaking, validation data is unseen data indeed.
So for me it seems like dropout is not really working.
It can very well be the case - dropout is not expected to work always and for every problem.
Interesting problem,
I would recommend plotting the validation loss and the training loss to see if it is really overfitting. if you see that the validation loss didn't change while the training loss dropped ( you will also probably see a large gap between them ) then it is overfitting.
If it is overfitting then try to reduce the number of layers or the number of nodes ( also play a little with the Dropout rate after you do that). Reducing the number of epochs could also be helpful.
If you would like to use a different method instead of dropout I would recommend using the Gaussian Noise layer.
Keras - https://keras.io/layers/noise/
TensorFlow - https://www.tensorflow.org/api_docs/python/tf/keras/layers/GaussianNoise
Related
My model arch is
I have two outputs, I want to train a model based on two outputs such as mse, and cross-entropy. At first, I used two keras loss
model1.compile(loss=['mse','sparse_categorical_crossentropy'], metrics = ['mse','accuracy'], optimizer='adam')
it's working fine, the problem is the cross entropy loss is very unstable, sometimes gives accuracy 74% in the next epoch shows 32%. I'm confused why is?
Now if define customer loss.
def my_custom_loss(y_true, y_pred):
mse = mean_squared_error(y_true[0], y_pred[0])
crossentropy = binary_crossentropy(y_true[1], y_pred[1])
return mse + crossentropy
But it's not working, it showed a negative loss in total loss.
It is hard to judge the issues depending on the information given. A reason might be a too small batch size or a too high learning rate, making the training unstable. I also wonder, that you use sparse_categorical_crossentropy in the top example and binary_crossentropy in the lower one. How many classes do you actually have?
I'm making a model for predicting the age of people by analyzing their face. I'm using this pretrained model, and maked a custom loss function and a custom metrics. So I obtain discrete result but I want to improve it. In particular, I noticed that after some epochs the model begin to overfitt on the training set then the val_loss increases. How can I avoid this? I'm already using Dropout, but this doesn't seem to be enough.
I think maybe I should use l1 and l2 but I don't know how.
def resnet_model():
model = VGGFace(model = 'resnet50')#model :{resnet50, vgg16, senet50}
xl = model.get_layer('avg_pool').output
x = keras.layers.Flatten(name='flatten')(xl)
x = keras.layers.Dense(4096, activation='relu')(x)
x = keras.layers.Dropout(0.5)(x)
x = keras.layers.Dense(4096, activation='relu')(x)
x = keras.layers.Dropout(0.5)(x)
x = keras.layers.Dense(11, activation='softmax', name='predictions')(x)
model = keras.engine.Model(model.input, outputs = x)
return model
model = resnet_model()
initial_learning_rate = 0.0003
epochs = 20; batch_size = 110
num_steps = train_x.shape[0]//batch_size
learning_rate_fn = tf.keras.optimizers.schedules.PiecewiseConstantDecay(
[3*num_steps, 10*num_steps, 16*num_steps, 25*num_steps],
[1e-4, 1e-5, 1e-6, 1e-7, 5e-7]
)
optimizer = tf.keras.optimizers.Adam(learning_rate=learning_rate_fn)
model.compile(loss=custom_loss, optimizer=optimizer, metrics=['accuracy', one_off_accuracy])
model.fit(train_x, train_y, epochs=epochs, batch_size=batch_size, validation_data=(test_x, test_y))
This is an example of result:
There are many regularization methods to help you avoid overfitting your model:
Dropouts:
Randomly disables neurons during the training, in order to force other neurons to be trained as well.
L1/L2 penalties:
Penalizes weights that change dramatically. This tries to ensure that all parameters will be equally taken into consideration when classifying an input.
Random Gaussian Noise at the inputs:
Adds random gaussian noise at the inputs: x = x + r where r is a random normal value from range [-1, 1]. This will confuse your model and prevent it from overfitting into your dataset, because in every epoch, each input will be different.
Label Smoothing:
Instead of saying that a target is 0 or 1, You can smooth those values (e.g. 0.1 & 0.9).
Early Stopping:
This is a quite common technique for avoiding training your model too much. If you notice that your model's loss is decreasing along with the validation's accuracy, then this is a good sign to stop the training, as your model begins to overfit.
K-Fold Cross-Validation:
This is a very strong technique, which ensures that your model is not fed all the time with the same inputs and is not overfitting.
Data Augmentations:
By rotating/shifting/zooming/flipping/padding etc. an image you make sure that your model is forced to train better its parameters and not overfit to the existing dataset.
I am quite sure there are also more techniques to avoid overfitting. This repository contains many examples of how the above techniques are deployed in a dataset:
https://github.com/kochlisGit/Tensorflow-State-of-the-Art-Neural-Networks
You can try incorporate image augmentation in your training, which increases the "sample size" of your data as well as the "diversity" as #Suraj S Jain mentioned. The official tutorial is here: https://www.tensorflow.org/tutorials/images/data_augmentation
I trained a simple fully connected network on CIFAR-10 dataset:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torchvision
import torchvision.transforms as transforms
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.fc1 = nn.Linear(3*32*32, 300, bias=False)
self.fc2 = nn.Linear(300, 10, bias=False)
def forward(self, x):
x = x.reshape(250, -1)
self.x2 = F.relu(self.fc1(x))
x = self.fc2(self.x2)
return x
def train():
# The output of torchvision datasets are PILImage images of range [0, 1].
transform = transforms.Compose([transforms.ToTensor()])
trainset = torchvision.datasets.CIFAR10(root='./data', train=True, download=True, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=250, shuffle=True, num_workers=4)
testset = torchvision.datasets.CIFAR10(root='./data', train=False, download=True, transform=transform)
testloader = torch.utils.data.DataLoader(testset, batch_size=args.bs, shuffle=False, num_workers=4)
net = Net()
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(net.parameters(), lr=0.02, momentum=0.9, weight_decay=0.0001)
for epoch in range(20):
correct = 0
total = 0
for data in trainloader:
inputs, labels = data
outputs = net(inputs)
loss = criterion(outputs, labels)
optimizer.zero_grad()
loss.backward()
optimizer.step()
_, predicted = torch.max(outputs.data, 1)
total += labels.size(0)
correct += (predicted == labels).sum().item()
acc = 100. * correct / total
This network gets to ~50% test accuracy with the parameters specified, after 20 epochs.
Note that I didn't do any whitening of the inputs (no per channel mean subtraction)
Next I scaled up the model inputs by 255, by replacing outputs = net(inputs) with outputs = net(inputs*255). After this change, the network no longer converges. I looked at the gradients and they seem to grow explosively after just a few iterations, leading to all model outputs being zero. I'd like to understand why this is happening.
Also, I tried scaling down the learning rate by 255. This helps, but the network only gets to ~43% accuracy. Again, I don't understand why this helps, and more importantly why the accuracy is still degraded compared to the original settings.
EDIT: forgot to mention that I don't use biases in this network.
EDIT2: I can recover the original accuracy if I scale down the initial weights in both layers by 255 (in addition to scaling down the learning rate). I also tried to scale down the initial weights only in the first layer, but the network had trouble learning (even when I did scale down the learning rate in both layers). Then I tried scaling down the learning rate only in the first layer - this also didn't help. Finally I tried reducing learning rate in both layer even more (by 255*255) and this suddenly worked. This does not make sense to me - scaling down the initial weights by the same factor the inputs have been scaled up should have completely eliminated any difference from the original network, the input to the second layer is identical. At that point the learning rate should be scaled down in the first layer only, but in practice both layers need significantly lower learning rate...
Scaling up the inputs will lead to exploding gradients because of a few observations:
The learning rate is common to all the weights in a given update step.
Hence, the same scaling factor (ie: the learning rate) is applied to a given weight's cost derivative regardless of it's magnitude, so large and small weights get updated by the same scale.
When the loss landscape is highly erratic, this leads to exploding gradients.(like a snowball effect, one overshot update - in say, the axis of one particular weight - causes another in the opposite direction in the next update which overshoots again and so on..)
The range of values of the pixels are 0 to 255, hence scaling the data by 255 will ensure all inputs are between 0 and 1 and hence more smooth convergence as all the gradients will be uniform with respect to the learning rate. But here you scaled the learning rate which adjusts some of the problems mentioned above but is not as effective as scaling the data itself. This reduces the learning rate hence making convergence time longer, that might be the reason why it reaches 43% at 20 epochs, maybe it needs more epochs..
Also:
CIFAR-10 is a significant step up from something like the MNIST dataset, hence, fully connected neural networks do not have the representation power needed to accurately predict these images. CNNs are the way to go for any image classification task beyond MNIST. ~50% accuracy is the max you can get with a fully connected neural network unfortunately.
Maybe decrease the learning rate by 1/255 ... just a guess
I'm working on a short project that involves implementing a character RNN for text generation. My model uses a single LSTM layer with varying units (messing around with between 50 and 500), dropout at a rate of 0.2, and softmax activation. I'm using RMSprop with a learning rate of 0.01.
My issue is that I can't find a good way to characterize the validation loss. I'm using a validation split of 0.3 and I'm finding that the validation loss starts to become constant after only a few epochs (maybe 2-5 or so) while the training loss keeps decreasing. Does validation loss carry much weight in this sort of problem? The purpose of the model is to generate new strings, so quantifying the validation loss with other strings seems... pointless?
It's hard for me to really find the best model since qualitatively I get the sense that the best model is trained for more epochs than it takes for the validation loss to stop changing but also for fewer epochs than it takes for the training loss to start increasing. I would really appreciate any advice you have regarding this problem as well as any general advice about RNN's for text generation, especially regarding dropout and overfitting. Thanks!
This is the code for fitting the model for every epoch. The callback is a custom callback that just prints a few tests. I'm now realizing that history_callback.history['loss'] is probably the training loss isn't it...
for i in range(num_epochs):
history_callback = model.fit(x, y,
batch_size=128,
epochs=1,
callbacks=[print_callback],
validation_split=0.3)
loss_history.append(history_callback.history['loss'])
validation_loss_history.append(history_callback.history['val_loss'])
My intention for this model isn't to replicate sentences from the training data, rather, I'd like to generate sentence from the same distribution that I'm training on.
Yes history_callback.history['loss'] is Training Loss and history_callback.history['val_loss'] is the Validation Loss.
Yes, Validation Loss carries weight in this sort of problem because you just don't want to replicate the sentences which are given during Training but you want to learn the patterns from the Training Data and generate new sentences when it sees a new data.
From the information you mentioned in the question and from the insights identified from comments (thanks to Brian Bartoldson), it is understood that your model is overfitting. In addition to EarlyStopping and dropout, you can try the below mentioned techniques to mitigate overfitting problem.
3.a. Shuffle the Data, by using shuffle=True in model.fit. Code is shown below
3.b. Use recurrent_dropout. For example, If we set the value of Recurrent Dropout as 0.2 in a Recurrent Layer (LSTM), it means that it will consider only 80% of the Time Steps for that Recurrent Layer (LSTM).
3.c. Use Regularization. You can try l1 Regularization or l1_l2 Regularization as well for the arguments, kernel_regularizer, recurrent_regularizer, bias_regularizer, activity_regularizer of the LSTM Layer.
Sample code to use Shuffle, Early Stopping, Recurrent_Dropout, Regularization is shown below:
from tensorflow.keras.regularizers import l2
from tensorflow.keras.models import Sequential
model = Sequential()
Regularizer = l2(0.001)
model.add(tf.keras.layers.LSTM(units = 50, activation='relu',kernel_regularizer=Regularizer ,
recurrent_regularizer=Regularizer , bias_regularizer=Regularizer , activity_regularizer=Regularizer, dropout=0.2, recurrent_dropout=0.3))
callback = tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=15)
history_callback = model.fit(x, y,
batch_size=128,
epochs=1,
callbacks=[print_callback, callback],
validation_split=0.3, shuffle = True)
Hope this helps. Happy Learning!
When I used dropout mechanism for lstm, the rouge score and loss of no dropout model performs better than model with dropout. So I wonder is my dropout code correct? I use tensorflow 0.12
cellClass = tf.nn.rnn_cell.LSTMCell
for layer_i in xrange(hps.enc_layers):
with tf.variable_scope('encoder%d'%layer_i), tf.device(
self._next_device()):
#bidirectional rnn cell
cell_fw = cellClass(
hps.num_hidden
,initializer=tf.random_uniform_initializer(-0.1, 0.1, seed=123),
state_is_tuple=False
)
cell_bw = cellClass(
hps.num_hidden
,initializer=tf.random_uniform_initializer(-0.1, 0.1, seed=113),
state_is_tuple=False
)
cell_fw = tf.nn.rnn_cell.DropoutWrapper(cell_fw, input_keep_prob=hps.input_dropout, output_keep_prob=hps.output_dropout)
cell_bw = tf.nn.rnn_cell.DropoutWrapper(cell_bw, input_keep_prob=hps.input_dropout, output_keep_prob=hps.output_dropout)
(emb_encoder_inputs, fw_state, _) = tf.nn.bidirectional_rnn(
cell_fw, cell_bw, emb_encoder_inputs, dtype=tf.float32,
sequence_length=article_lens)
#decoder
cell = cellClass(
hps.num_hidden
,initializer=tf.random_uniform_initializer(-0.1, 0.1, seed=113),
state_is_tuple=False
)
cell=tf.nn.rnn_cell.DropoutWrapper(cell, input_keep_prob=hps.input_dropout, output_keep_prob=hps.output_dropout)
decoder_outputs, self._dec_out_state, self.cur_attns, self.cur_alpha = seq2seq.attention_decoder(
emb_decoder_inputs, self._dec_in_state, self._enc_top_states,
cell, num_heads=1, loop_function=loop_function,
initial_state_attention=initial_state_attention)
When training I set those keep prob to be the value I use like 0.5, when computing the loss of training set and validation set I keep them as 0.5, but in decoding step I use 1, which did not dropout anything. Am I correct?
Almost!
When you calculate the accuracy and validation, you need to manually set the keep_probability to 1.0 so that you don't actually drop any of your weight values when you are evaluating your network. If you don't do this, you'll essentially miscalculate the value you've trained your network to predict thus far. This could certainly negatively affect your acc/val scores. Especially with a 50% dropout rate.
Your dropout layer used in the decoding step is optional and should be experimented with. If you do use it, you'll want to set it to something other than 1.0.
Just to recap for the passerby, the idea behind dropout is to reset the values of random weights across your network weights to increase the probability that neurons don't become erroneously fixed (or whatever term you prefer) which results in over fitting of your network. Remember that, generally speaking, we are trying to approximate, or fit, our networks to a function. Since fitting networks is, in essence, an optimization problem, we have to to worry about optimizing to a local minima (or maxima... depending on which way the picture is oriented in your mind). Dropout is thus a form of regularization that helps us avoid over fitting.
If anyone has any further insights or corrections, please post!