Keras LSTM training early stopping : restore best weights - tensorflow

I am trying to train an LSTM network and am using the callbacks module of Keras for early stopping.
Sample code is as below:
callback = tensorflow.keras.callbacks.EarlyStopping(monitor='loss', min_delta=0.0001,
patience=7, mode='min', restore_best_weights=True, verbose=1)
model1= Sequential()
model1.add(LSTM(64, activation='swish',input_shape=(trainX.shape[1], trainX.shape[2]),
return_sequences=True))
model1.add(LSTM(128,activation = 'swish', return_sequences=True))
model1.add(LSTM(64,activation = 'elu', return_sequences=False))
model1.add(Dropout(0.01))
model1.add(Dense(trainY.shape[1]))
model1.compile(optimizer='adam', loss='mse')
model1.summary()
model1.fit(trainX,trainY, epochs=n_epochs, batch_size=batchsize, verbose=2, callbacks=
[callback])
However I feel my restore_best_weights parameter is not working the way I expected it to.
I find that even though I have opted for restore_best_weights=True, once an earlystopping parameter is triggered, the system does not load the weights of the lowest/best epoch.
See the training progress as below:
Epoch 1/9
1250/1250 - 76s - loss: 0.0012 - 76s/epoch - 61ms/step
Epoch 2/9
1250/1250 - 76s - loss: 0.0011 - 76s/epoch - 61ms/step
Epoch 3/9
1250/1250 - 76s - loss: 0.0011 - 76s/epoch - 60ms/step
Epoch 4/9
1250/1250 - 76s - loss: 0.0010 - 76s/epoch - 60ms/step
Epoch 5/9
1250/1250 - 76s - loss: 9.9930e-04 - 76s/epoch - 61ms/step
Epoch 6/9
1250/1250 - 75s - loss: 9.9933e-04 - 75s/epoch - 60ms/step
Epoch 7/9
Restoring model weights from the end of the best epoch: 3.
1250/1250 - 76s - loss: 0.0010 - 76s/epoch - 61ms/step
Epoch 7: early stopping
I would expect the weights of Epoch 5 be loaded (since it gives the best value of loss). But it seems to restore the weights from Epoch 3 (which gives a higher loss value) and then train once again without much improvement (final loss value is 0.0010 which is worse than that compared to loss values in epochs 5 and 6).
Am I doing something wrong or is my understanding of the restore_best_weights parameter wrong?
Is there a better way of ensuring the best loss optimized weights are selected when early stopping is triggered?

Related

Is my CNN model still overfitting? If so, how can I combat it? Is there something wrong with my architecture?

My CNN model kept getting high accuracy/low loss during training and much lower accuracy/higher loss during validation, therefore I started suspecting that it's overfitting.
I have therefore introduced a few dropout layers as well as some image augmentiation. I've also tried monitoring val_loss after each epoch, using ReduceLROnPlateau and EarlyStopping.
Although those measures helped improve validation accuracy a bit, I'm still nowhere close to the desired result and I'm honestly running out of ideas. This is the result I'm obtaining right now:
Epoch 9/30
999/1000 [============================>.] - ETA: 0s - loss: 0.0072 - accuracy: 0.9980
Epoch 9: ReduceLROnPlateau reducing learning rate to 1.500000071246177e-05.
1000/1000 [==============================] - 19s 19ms/step - loss: 0.0072 - accuracy: 0.9980 - val_loss: 2.2994 - val_accuracy: 0.6570 - lr: 1.5000e-04
Epoch 10/30
1000/1000 [==============================] - 19s 19ms/step - loss: 0.0045 - accuracy: 0.9985 - val_loss: 2.2451 - val_accuracy: 0.6560 - lr: 1.5000e-05
Epoch 11/30
1000/1000 [==============================] - 19s 19ms/step - loss: 0.0026 - accuracy: 0.9995 - val_loss: 2.6080 - val_accuracy: 0.6540 - lr: 1.5000e-05
Epoch 12/30
1000/1000 [==============================] - 19s 19ms/step - loss: 0.0018 - accuracy: 1.0000 - val_loss: 2.8192 - val_accuracy: 0.6560 - lr: 1.5000e-05
Epoch 13/30
1000/1000 [==============================] - 19s 19ms/step - loss: 0.0013 - accuracy: 1.0000 - val_loss: 2.8216 - val_accuracy: 0.6570 - lr: 1.5000e-05
32/32 [==============================] - 1s 23ms/step - loss: 2.8216 - accuracy: 0.6570
Am I wrong to assume that overfitting is still the problem that prevents my model from scoring high on validation and test data?
Or is there something fundamentally wrong with my architecture?
#prevent overfitting, generalize better
data_augmentation = tf.keras.Sequential([
layers.RandomFlip("horizontal_and_vertical"),
layers.RandomRotation(0.2),
layers.RandomZoom((0.2))
])
model = tf.keras.models.Sequential()
model.add(data_augmentation)
#same padding, since edges of the pictures often contain valuable information
model.add(layers.Conv2D(64, (3,3), strides=(1,1), padding='same', activation = 'relu', input_shape=(64,64,3)))
model.add(layers.MaxPooling2D((2,2)))
model.add(layers.Dropout(0.25))
model.add(layers.Conv2D(32, (3,3), strides=(1,1), padding='same', activation = 'relu'))
model.add(layers.MaxPooling2D((2,2)))
model.add(layers.Dropout(0.25))
model.add(layers.Flatten())
model.add(layers.Dense(128, activation='relu'))
#prevent overfitting
model.add(layers.Dropout(0.25))
#4 output classes, softmax since we want to end up with probabilities for each class at the end (have to sum up to 1)
model.add(layers.Dense((4), activation='softmax'))
#not using one hot encoding, therefore sparse categorical entropy
model.compile(loss='sparse_categorical_crossentropy', optimizer=keras.optimizers.Adam(learning_rate=0.00015), metrics='accuracy')
try using the code below I would add a BatchNormalization layer right after the flatten layer
model.add(layers.BatchNormalization(axis=-1, momentum=0.99, epsilon=0.001 )
for the dense layer add regularizers
model.add(layers.Dense(128, kernel_regularizer = regularizers.l2(l = 0.016),activity_regularizer=regularizers.l1(0.006),
bias_regularizer=regularizers.l1(0.006) ,activation='relu')
Also I suggest you use an adjustable learning rate using the Keras callback ReduceLROnPlateau. Documentation is here. My recommended code for that is shown below
rlronp=tf.keras.callbacks.ReduceLROnPlateau( monitor="val_loss", factor=0.4,
patience=2, verbose=1, mode="auto")
I also recommend us use the Keras callback EarlyStopping. Documentation for that is here. My recommended code for that is below
estop=tf.keras.callbacks.EarlyStopping( monitor="val_loss", patience=4,
verbose=1,mode="auto",
restore_best_weights=True)
Before you fit the model include code below
callbacks=[rlronp, estop]
in model.fit include callbacks=callbacks
You can try to add regularizer to all or some of your layers, for example:
model.add(layers.Conv2D(32, (3,3), strides=(1,1), kernel_regularizer='l1_l2', padding='same', activation = 'relu'))
You could try to replace Dropout with SpatialDropout2D between the conv layers. You could also try more image augmentation, maybe GaussianNoise, RandomContrast, RandomBrightness
Since you have a very high training accuracy, you could also try to simplify your model (less units for example).

How do you add additional layers to a TensorFlow Neural Network?

How do you add additional layers to a TensorFlow Neural Network and know the additional layer will not be an overfit??? It seems that 2 layers wont be very helpful however it did give me a 91% accuracy and I wanted a 100% accuracy. So I wanted to add 5 to 10 additional layers and try and "overfit" the neural network. Would an overfit always give 100% accuracy on a training set?
The basic building block of a neural network is the layer.
I'm using the model example from https://www.tensorflow.org/tutorials/keras/classification
model = tf.keras.Sequential([
tf.keras.layers.Flatten(input_shape=(28, 28)),
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dense(10)
])
model.fit(train_images, train_labels, epochs=10)
The first layer in this network, transforms the format of the images from a two-dimensional array (of 28 by 28 pixels) to a one-dimensional array (of 28 * 28 = 784 pixels). Think of this layer as unstacking rows of pixels in the image and lining them up. This layer has no parameters to learn; it only reformats the data.
Currently this example after the pixels are flattened, the network consists of a sequence of two tf.keras.layers.Dense layers or fully connected, neural layers. The first Dense layer has 128 nodes (or neurons). The second (and last) layer returns a array with length of 10.
QUESTION: I wanted to start by adding ONE additional layer and then overfit with say 5 layers. How do manually add an additional layer and fit this layer? can I specify 5 additional layers without having to specify each layer? Whats a typical estimate for "overfit" on a image data set with a given size of say 30x30 pixels?
Adding One Addtional Layer gave me the same accuracy.
Epoch 1/10
1875/1875 [==============================] - 9s 5ms/step - loss: 0.4866 - accuracy: 0.8266
Epoch 2/10
1875/1875 [==============================] - 8s 4ms/step - loss: 0.3619 - accuracy: 0.8680
Epoch 3/10
1875/1875 [==============================] - 8s 4ms/step - loss: 0.3278 - accuracy: 0.8785
Epoch 4/10
1875/1875 [==============================] - 8s 4ms/step - loss: 0.3045 - accuracy: 0.8874
Epoch 5/10
1875/1875 [==============================] - 8s 4ms/step - loss: 0.2885 - accuracy: 0.8929
Epoch 6/10
1875/1875 [==============================] - 8s 4ms/step - loss: 0.2727 - accuracy: 0.8980
Epoch 7/10
1875/1875 [==============================] - 8s 4ms/step - loss: 0.2597 - accuracy: 0.9014
Epoch 8/10
1875/1875 [==============================] - 9s 5ms/step - loss: 0.2475 - accuracy: 0.9061
Epoch 9/10
1875/1875 [==============================] - 9s 5ms/step - loss: 0.2386 - accuracy: 0.9099
Epoch 10/10
1875/1875 [==============================] - 10s 5ms/step - loss: 0.2300 - accuracy: 0.9125
You can add layers to a neural network as follows.
model = tf.keras.Sequential([
tf.keras.layers.Flatten(input_shape=(28, 28)),
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dense(10)
])
Overfitting:
Overfitting occurs when a model fits exactly against its training data. When this happens, the algorithm unfortunately cannot perform accurately against unseen data. For example, if our model saw 95% accuracy on the training set but only 45% accuracy on the test set then the model is overfitting against its training data. It does not always give 100% accuracy on a training set.
It can be identified by checking validation metrics such as accuracy and loss.When the model is impacted by overfitting, the validation measures typically rise until a point at which they stagnate or start to decline. The model searches for a good fit during an upward trend, and when it finds one, the trend begins to decline or stagnate. For more information, please refer this.

Keras: Validation accuracy stays the exact same but validation loss decreases

I know that the problem can't be with the dataset because I've seen other projects use the same dataset.
Here is my data preprocessing code:
import pandas as pd
dataset = pd.read_csv('political_tweets.csv')
dataset.head()
dataset = pd.read_csv('political_tweets.csv')["tweet"].values
y_train = pd.read_csv('political_tweets.csv')["dem_or_rep"].values
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(dataset, y_train, test_size=0.1)
max_words = 10000
print(max_words)
max_len = 25
tokenizer = Tokenizer(num_words = max_words, filters='!"#$%&()*+,-./:;<=>?#[\\]^_`{|}~\t\n1234567890', lower=False,oov_token="<OOV>")
tokenizer.fit_on_texts(x_train)
x_train = tokenizer.texts_to_sequences(x_train)
x_train = pad_sequences(x_train, max_len, padding='post', truncating='post')
tokenizer.fit_on_texts(x_test)
x_test = tokenizer.texts_to_sequences(x_test)
x_test = pad_sequences(x_test, max_len, padding='post', truncating='post')
And my model:
model = Sequential([
Embedding(max_words+1,64,input_length=max_len),
Bidirectional(GRU(64, return_sequences = True), merge_mode='concat'),
GlobalMaxPooling1D(),
Dense(64,kernel_regularizer=regularizers.l2(0.02)),
Dropout(0.5),
Dense(1, activation='sigmoid'),
])
model.summary()
model.compile(loss='binary_crossentropy', optimizer=RMSprop(learning_rate=0.0001), metrics=['accuracy'])
model.fit(x_train,y_train, batch_size=128, epochs=500, verbose=1, shuffle=True, validation_data=(x_test, y_test))
Both of my losses decrease, my training accuracy increases, but the validation accuracy stays at 50% (which is awful considering I am doing a binary classification model).
Epoch 1/500
546/546 [==============================] - 35s 64ms/step - loss: 1.7385 - accuracy: 0.5102 - val_loss: 1.2458 - val_accuracy: 0.5102
Epoch 2/500
546/546 [==============================] - 34s 62ms/step - loss: 0.9746 - accuracy: 0.5137 - val_loss: 0.7886 - val_accuracy: 0.5102
Epoch 3/500
546/546 [==============================] - 34s 62ms/step - loss: 0.7235 - accuracy: 0.5135 - val_loss: 0.6943 - val_accuracy: 0.5102
Epoch 4/500
546/546 [==============================] - 34s 62ms/step - loss: 0.6929 - accuracy: 0.5135 - val_loss: 0.6930 - val_accuracy: 0.5102
Epoch 5/500
546/546 [==============================] - 34s 62ms/step - loss: 0.6928 - accuracy: 0.5135 - val_loss: 0.6931 - val_accuracy: 0.5102
Epoch 6/500
546/546 [==============================] - 34s 62ms/step - loss: 0.6927 - accuracy: 0.5135 - val_loss: 0.6931 - val_accuracy: 0.5102
Epoch 7/500
546/546 [==============================] - 37s 68ms/step - loss: 0.6925 - accuracy: 0.5136 - val_loss: 0.6932 - val_accuracy: 0.5106
Epoch 8/500
546/546 [==============================] - 34s 63ms/step - loss: 0.6892 - accuracy: 0.5403 - val_loss: 0.6958 - val_accuracy: 0.5097
Epoch 9/500
546/546 [==============================] - 35s 63ms/step - loss: 0.6815 - accuracy: 0.5633 - val_loss: 0.7013 - val_accuracy: 0.5116
Epoch 10/500
546/546 [==============================] - 34s 63ms/step - loss: 0.6747 - accuracy: 0.5799 - val_loss: 0.7096 - val_accuracy: 0.5055
I've seen other posts on this topic and they say to add dropout, crossentropy, decrease the learning rate, etc. I have done all of this and none of it works.
Any help is greatly appreciated.
Thanks in advance!
A couple of observations for your problem:
Though not particularly familiar with the dataset, I trust that it is used in many circumstances without problems. However, you could try to check for its balance. In train_test_split() there is a parameter called stratify which, if fed the y, it will ensure the same number of samples for each class are in training set and test set proportionally.
Your phenomenon with validation loss and validation accuracy is not something out of the ordinary. Imagine that in the first epochs, the neural network considers some ground truth positive examples (ys) with GT == 1 with 55% confidence. While the training advances, the neural network learns better, and now it is 90% confident for a ground truth positive example (ys) with GT == 1. Since the threshold for calculating the accuracy is 50% , in both situations you have the same accuracy. Nevertheless, the loss has changed significantly, since 90% >> 55%.
You training seems to advance(slowly but surely). Have you considered using Adam as an off-the-shelves optimizer?
If the low accuracy is still maintained over some epochs, you may very well suffer from a well known phenomenon called underfitting, in which your model is unable to capture the dependencies between your data. To mitigate/avoid underfitting altogether, you may want to use a more complex model (2 LSTMs / 2 GRUs) stacked.
At this stage, remove the Dropout() layer, since you have underfitting, not overfitting.
Decrease the batch_size. Very big batch_size can lead to local minima, rendering you network unable to properly learn/generalize.
If none of these work, try starting with a lower learning rate, say 0.00001 instead of 0.0001.
Reiterate over the dataset preprocessing steps. Ensure the sentences are converted properly.
I have had a similar issue and I think it might be because dropout is right before the output layer. Try moving it to one layer before that.

make accuracy appear in my result and interpret the results of the loss and the val_loss

i'm new in tensorflow i'm trying to learning it by examples in github, now i found an example but the results of the loss and val_loss are more than '1' ( you can see bellow the results is between 800 and 700 ) while normally in other examples the loss and val_loss are between 0 and 1)
in addition i would like how to make appear the accuracy.
this is the code.
https://github.com/simoninithomas/DNN-Speech-Recognizer/blob/master/train_utils.py
thank you !
def train_model(input_to_softmax,
pickle_path,
save_model_path,
train_json='train_corpus.json',
valid_json='valid_corpus.json',
minibatch_size=20,
spectrogram=True,
mfcc_dim=13,
optimizer=SGD(lr=0.02, decay=1e-6, momentum=0.9, nesterov=True, clipnorm=5),
epochs=20,
verbose=1,
sort_by_duration=False,
max_duration=10.0):
# create a class instance for obtaining batches of data
audio_gen = AudioGenerator(minibatch_size=minibatch_size,
spectrogram=spectrogram, mfcc_dim=mfcc_dim, max_duration=max_duration,
sort_by_duration=sort_by_duration)
# add the training data to the generator
audio_gen.load_train_data(train_json)
audio_gen.load_validation_data(valid_json)
# calculate steps_per_epoch
num_train_examples=len(audio_gen.train_audio_paths)
steps_per_epoch = num_train_examples//minibatch_size
# calculate validation_steps
num_valid_samples = len(audio_gen.valid_audio_paths)
validation_steps = num_valid_samples//minibatch_size
# add CTC loss to the NN specified in input_to_softmax
model = add_ctc_loss(input_to_softmax)
# CTC loss is implemented elsewhere, so use a dummy lambda function for the loss
model.compile(loss={'ctc': lambda y_true, y_pred: y_pred}, optimizer=optimizer)
# make results/ directory, if necessary
if not os.path.exists('results'):
os.makedirs('results')
# add checkpointer
checkpointer = ModelCheckpoint(filepath='results/'+save_model_path, verbose=0)
# train the model
hist = model.fit_generator(generator=audio_gen.next_train(), steps_per_epoch=steps_per_epoch,
epochs=epochs, validation_data=audio_gen.next_valid(), validation_steps=validation_steps,
callbacks=[checkpointer], verbose=verbose)
# save model loss
with open('results/'+pickle_path, 'wb') as f:
pickle.dump(hist.history, f)
Epoch 1/20
106/106 [==============================] - 302s - loss: 839.6881 - val_loss: 744.7609
Epoch 2/20
106/106 [==============================] - 276s - loss: 767.3973 - val_loss: 727.8361
Epoch 3/20
106/106 [==============================] - 272s - loss: 752.6904 - val_loss: 720.8375
Epoch 4/20
106/106 [==============================] - 261s - loss: 751.8432 - val_loss: 728.3446
Epoch 5/20
106/106 [==============================] - 261s - loss: 752.1302 - val_loss: 733.3166
Epoch 6/20
106/106 [==============================] - 264s - loss: 752.3786 - val_loss: 722.4345
Epoch 7/20
106/106 [==============================] - 265s - loss: 752.7827 - val_loss: 723.2651
Epoch 8/20
106/106 [==============================] - 263s - loss: 752.5077 - val_loss: 736.0229
Epoch 9/20
106/106 [==============================] - 263s - loss: 752.5616 - val_loss: 731.2018
The loss that you are using is described in this pdf.
When you say accuracy it could mean a lot of things:
Single unit accuracy (average it over the labels you have. NOTE: you have multiple labels for the same data point since its a temporal classification) [Would be between 0 and 1]
Error rate: Could be defined as edit distance between the predicted labels and true labels [Would be between 0 and MAX_LABELS averaged over data points.
Precision of labels averaged over all timesteps and data points.
There is no reason for it to be between 0 and 1.
On the other hand your loss is a connectionist temporal loss. This loss predicts either a label or a blank label at every timestep. Then we use Cross Entropy on top of the labels. The cross entropy of two probability distributions is only a positive quantity, and is not between 0 and 1.
Therefore this is not an issue. If you would like to see the accuracies you will have take some test data and make a prediction. You can calculate accuracy (as defined above) using Tensorflow against the expected labels using whatever metric you want, and use that as your accuracy. You can technically use any metric defined in Tensorflow: https://www.tensorflow.org/api_docs/python/tf/metrics/, after your prediction step.

Keras Batchnormalization, differing results in trainin and evaluation on training dataset

I'm am training a CNN, for the sake of debugging a my problem I am working on a small subset of the actual training data.
During training the loss and accuracy seem very reasonable and pretty good. (In the example I used the same small subset for validation, the problem shows here already)
Fit on x_train and validate on x_train, using batch_size=32
Epoch 10/10
1/10 [==>...........................] - ETA: 2s - loss: 0.5126 - acc: 0.7778
2/10 [=====>........................] - ETA: 1s - loss: 0.3873 - acc: 0.8576
3/10 [========>.....................] - ETA: 1s - loss: 0.3447 - acc: 0.8634
4/10 [===========>..................] - ETA: 1s - loss: 0.3320 - acc: 0.8741
5/10 [==============>...............] - ETA: 0s - loss: 0.3291 - acc: 0.8868
6/10 [=================>............] - ETA: 0s - loss: 0.3485 - acc: 0.8848
7/10 [====================>.........] - ETA: 0s - loss: 0.3358 - acc: 0.8879
8/10 [=======================>......] - ETA: 0s - loss: 0.3315 - acc: 0.8863
9/10 [==========================>...] - ETA: 0s - loss: 0.3215 - acc: 0.8885
10/10 [==============================] - 3s - loss: 0.3106 - acc: 0.8863 - val_loss: 1.5021 - val_acc: 0.2707
When I evaluate on the same training dataset however the accuracy is really off from what I saw during training ( I would expect it to be at least as good as during training on the same dataset).
When evaluating straight forward or using
K.set_learning_phase(0)
I get, similar to the validation (Evaluating on x_train using batch_size=32):
Evaluation Accuracy: 0.266318537392, Loss: 1.50756853772
If I set the backend to learning phase the results get pretty good again, so the per batch normalization seems to work well. I suspect that the cumulated mean and variance are not properly being used.
So after
K.set_learning_phase(1)
I get (Evaluating on x_train using batch_size=32):
Evaluation Accuracy: 0.887728457507, Loss: 0.335956037511
I added the the batchnormalization layer after the first convolutional layer like this:
model = models.Sequential()
model.add(Conv2D(80, first_conv_size, strides=2, activation="relu", input_shape=input_shape, padding=padding_name))
model.add(BatchNormalization(axis=-1))
model.add(MaxPooling2D(first_max_pool_size, strides=4, padding=padding_name))
...
Further down the line I would also have some dropout layers, which I removed to investigate the Batchnormalization behavior. My intend would be to use the model in non-training phase for normal prediction.
Shouldn't it work like that, or am I missing some additional configuration?
Thanks!
I'm using keras 2.0.8 with tensorflow 1.1.0 (anaconda)
This is really annoying. When you set the learning_phase to be True - a BatchNormalization layer is getting normalization statistics straight from data what might be a problem when you have a small batch_size. I came across similar issue some time ago - and here you have my solution:
When building a model - add an option if the model would predict in either learning or not-learning phase and in this used in learning phase use the following class instead of BatchNormalization:
class NonTrainableBatchNormalization(BatchNormalization):
"""
This class makes possible to freeze batch normalization while Keras
is in training phase.
"""
def call(self, inputs, training=None):
return super(
NonTrainableBatchNormalization, self).call(inputs, training=False)
Once you train your model - reset its weights to a NonTrainable copy:
learning_phase_model.set_weights(learned_model.get_weights())
Now you can fully enjoy using BatchNormalization in a learning_phase.