Keras Model early stops even though min_delta condition is not achieved - tensorflow

I am training a Keras Sequential Model as follows. It is for the mnist dataset for 5 numbers. In goes the 28x28 images flattened and out comes a one hot notation for the class that they belong to.
model = keras.Sequential([
keras.layers.InputLayer(input_shape = (784, )),
keras.layers.Dense(32, activation='relu'),
keras.layers.Dense(15, activation='relu'),
keras.layers.Dense(3, activation='relu'),
keras.layers.Dense(5, activation='softmax')
])
optzr = keras.optimizers.SGD(learning_rate=0.001, momentum=0.0, nesterov=False)
es = keras.callbacks.EarlyStopping(monitor='loss', min_delta=0.0001, verbose=2)
model.compile(optimizer=optzr, loss='categorical_crossentropy', metrics=['accuracy'])
out = model.fit(xtrain, ytrain, validation_data=(xval, yval), batch_size=32, verbose=2, epochs=20, callbacks=[es])
On running the model, this is what the output is
Epoch 1/20
356/356 - 2s - loss: 1.7157 - accuracy: 0.1894 - val_loss: 1.6104 - val_accuracy: 0.1997 - 2s/epoch - 5ms/step
Epoch 2/20
356/356 - 1s - loss: 1.6094 - accuracy: 0.1946 - val_loss: 1.6102 - val_accuracy: 0.1997 - 1s/epoch - 3ms/step
Epoch 00002: early stopping
Here, even though the loss decreased by more than 0.1, the model declares to have met the condition for early stopping and stops training.

You should set patience to 1 in the callback definition. If you don't, it defaults to 0.
es = keras.callbacks.EarlyStopping(monitor='loss', min_delta=1e-4, verbose=2, patience=1)

Keras implements EarlyStopping by keeping an internal variable named wait. This variable increases by one per epoch if performance does not improve by min_delta, and resets to 0 otherwise. Training then stops if wait is greater than or equal patience.
# Only check after the first epoch.
if self.wait >= self.patience and epoch > 0:
self.stopped_epoch = epoch
self.model.stop_training = True
Since patience defaults to 0, self.wait >= self.patience is always True as soon as first epoch has passed (as soon as epoch > 0).
To stop as soon as performance stops improving, you actually want to set patience to 1 and not 0.

Related

What do these numbers mean when training in Tensor Flow

Taking the following example:
import tensorflow as tf
data = tf.keras.datasets.mnist
(training_images, training_labels), (val_images, val_labels) = data.load_data()
training_images = training_images / 255.0
val_images = val_images / 255.0
model = tf.keras.models.Sequential([tf.keras.layers.Flatten(input_shape=(28,28)),
tf.keras.layers.Dense(20, activation=tf.nn.relu),
tf.keras.layers.Dense(10, activation=tf.nn.softmax)])
model.compile(optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
model.fit(training_images, training_labels, epochs=20, validation_data=(val_images, val_labels))
The result is something like this:
Epoch 1/20
1875/1875 [==============================] - 4s 2ms/step - loss: 0.4104 - accuracy: 0.8838 -
val_loss: 0.2347 - val_accuracy: 0.9304
Where does 1875 come from? What does that number represent? I am unable to see where it is coming from. The training_images has a shape of 60000x28x28 when I look at it.
1875 is the number of iterations the training need to complete entire dataset with batch size of 32.
1875 * 32 = 60k
Epoch An epoch describes the number of times the algorithm sees the
entire data set. So, each time the algorithm has seen all samples in
the dataset, an epoch has completed.
Iteration An iteration describes the number of times a batch of data
passed through the algorithm. In the case of neural networks, that
means the forward pass and backward pass. So, every time you pass a
batch of data through the NN, you completed an iteration.
For more, you can refer link-1 and link-2
1875 is the number of steps/batches trained on. For example, with the default batch size of 32, this tells us that you have 60 000 images (plus or minus 31, as the last batch may or may not be full).

tf keras SparseCategoricalCrossentropy and sparse_categorical_accuracy reporting wrong values during training

This is tf 2.3.0. During training, reported values for SparseCategoricalCrossentropy loss and sparse_categorical_accuracy seemed way off. I looked through my code but couldn't spot any errors yet. Here's the code to reproduce:
import numpy as np
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
x = np.random.randint(0, 255, size=(64, 224, 224, 3)).astype('float32')
y = np.random.randint(0, 3, (64, 1)).astype('int32')
ds = tf.data.Dataset.from_tensor_slices((x, y)).batch(32)
def create_model():
input_layer = tf.keras.layers.Input(shape=(224, 224, 3), name='img_input')
x = tf.keras.layers.experimental.preprocessing.Rescaling(1./255, name='rescale_1_over_255')(input_layer)
base_model = tf.keras.applications.ResNet50(input_tensor=x, weights='imagenet', include_top=False)
x = tf.keras.layers.GlobalAveragePooling2D(name='global_avg_pool_2d')(base_model.output)
output = Dense(3, activation='softmax', name='predictions')(x)
return tf.keras.models.Model(inputs=input_layer, outputs=output)
model = create_model()
model.compile(
optimizer=tf.keras.optimizers.Adam(learning_rate=1e-4),
loss=tf.keras.losses.SparseCategoricalCrossentropy(),
metrics=['sparse_categorical_accuracy']
)
model.fit(ds, steps_per_epoch=2, epochs=5)
This is what printed:
Epoch 1/5
2/2 [==============================] - 0s 91ms/step - loss: 1.5160 - sparse_categorical_accuracy: 0.2969
Epoch 2/5
2/2 [==============================] - 0s 85ms/step - loss: 0.0892 - sparse_categorical_accuracy: 1.0000
Epoch 3/5
2/2 [==============================] - 0s 84ms/step - loss: 0.0230 - sparse_categorical_accuracy: 1.0000
Epoch 4/5
2/2 [==============================] - 0s 82ms/step - loss: 0.0109 - sparse_categorical_accuracy: 1.0000
Epoch 5/5
2/2 [==============================] - 0s 82ms/step - loss: 0.0065 - sparse_categorical_accuracy: 1.0000
But if I double check with model.evaluate, and "manually" checking the accuracy:
model.evaluate(ds)
2/2 [==============================] - 0s 25ms/step - loss: 1.2681 - sparse_categorical_accuracy: 0.2188
[1.268101453781128, 0.21875]
y_pred = model.predict(ds)
y_pred = np.argmax(y_pred, axis=-1)
y_pred = y_pred.reshape(-1, 1)
np.sum(y == y_pred)/len(y)
0.21875
Result from model.evaluate(...) agrees on the metrics with "manual" checking. But if you stare at the loss/metrics from training, they look way off. It is rather hard to see whats wrong since no error or exception is ever thrown.
Additionally, i created a very simple case to try to reproduce this, but it actually is not reproducible here. Note that batch_size == length of data so this isnt mini-batch GD, but full batch GD (to eliminate confusion with mini-batch loss/metrics:
x = np.random.randn(1024, 1).astype('float32')
y = np.random.randint(0, 3, (1024, 1)).astype('int32')
ds = tf.data.Dataset.from_tensor_slices((x, y)).batch(1024)
model = Sequential()
model.add(Dense(3, activation='softmax'))
model.compile(
optimizer=tf.keras.optimizers.Adam(learning_rate=1e-4),
loss=tf.keras.losses.SparseCategoricalCrossentropy(),
metrics=['sparse_categorical_accuracy']
)
model.fit(ds, epochs=5)
model.evaluate(ds)
As mentioned in my comment, one suspect is batch norm layer, which I dont have for the case that can't reproduce.
You get different results because fit() displays the training loss as the average of the losses for each batch of training data, over the current epoch. This can bring the epoch-wise average down. And the computed loss is employed further to update the model. Whereas, evaluate() is computed using the model as it is at the end of the training, resulting in a different loss. You can check the official Keras FAQ and the related StackOverflow post.
Also, try to increase the learning rate.
The big discrepancy seem in the metrics can be explained (or at least partially so) by presence of batch norm in the model. Will present 2 case where one is not reproducible vs. another that is reproduced if batch norm is introduced. In both case, batch_size is equal to full length of data (aka full gradient descent without 'stochastic') to minimize confusion over mini-batch statistics.
Not reproducible:
x = np.random.randn(1024, 1).astype('float32')
y = np.random.randint(0, 3, (1024, 1)).astype('int32')
ds = tf.data.Dataset.from_tensor_slices((x, y)).batch(1024)
model = Sequential()
model.add(Dense(10, activation='relu'))
model.add(Dense(10, activation='relu'))
model.add(Dense(10, activation='relu'))
model.add(Dense(3, activation='softmax'))
Reproducible:
model = Sequential()
model.add(Dense(10))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(Dense(10))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(Dense(10))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(Dense(3, activation='softmax'))
In fact, you can try model.predict(x), model(x, training=True) and you will see large difference in the y_pred. Also, per keras doc, this result also depend on whats in the batch. So prediction model(x[0:1], training=True) for x[0] will differ from model(x[0:2], training=True) by including an extra sample.
Probably best go to Keras doc and the original paper for the details, but I do think you will have to live with this and interprete what you see in the progress bar accordingly. It looks rather fishy if you try to use training loss/accuracy to see if you have a bias (not variance) issue. When in doubt, i think we can just run evaluate on the train set to be sure when after your model "converges" to a great minima. I sort of overlook this detail all together in my prior work 'cos underfitting (bias) is rare for deep net, and so I go by with the validation loss/metrics to determine when to stop training. But i probably would go back to the same model and evaluate on the train set (just to see if model has the capacity (not bias).

My model's loss value decreases slowly .how to reduce my loss faster while training?

when I train the model the loss decrease from 0.9 to 0.5 in 2500 epochs. Is it normal?
my model:
model = Sequential()
model.add(Embedding(vocab_size , emd_dim, weights=[emd_matrix], input_length=maxLen,trainable=False))
model.add(LSTM(256,return_sequences=True,activation="relu",kernel_regularizer=regularizers.l2(0.01),kernel_initializer=keras.initializers.glorot_normal(seed=None)))
model.add(LSTM(256,return_sequences=True,activation="relu",kernel_regularizer=regularizers.l2(0.01),kernel_initializer=keras.initializers.glorot_normal(seed=None)))
model.add(LSTM(256,return_sequences=False,activation="relu",kernel_regularizer=regularizers.l2(0.01),kernel_initializer=keras.initializers.glorot_normal(seed=None)))
model.add(Dense(l_h2i,activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer="adam", metrics=['accuracy'])
filepath = "F:/checkpoints/"+modelname+"/lstm-{epoch:02d}-{loss:0.3f}-{acc:0.3f}-{val_loss:0.3f}-{val_acc:0.3f}.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor="loss", verbose=1, save_best_only=True, mode='min')
reduce_lr = ReduceLROnPlateau(monitor='loss', factor=0.5, patience=2, min_lr=0.000001)
print(model.summary())
history=model.fit(X_train_indices, Y_train_oh, batch_size=batch_size ,
epochs=epochs , validation_split=0.1, shuffle=True,
callbacks=[checkpoint, reduce_lr])
some part of the results are as shown here :
loss improved from 0.54275 to 0.54272
loss: 0.5427 - acc: 0.8524 - val_loss: 1.1198 - val_acc: 0.7610
loss improved from 0.54272 to 0.54268
loss: 0.5427 - acc: 0.8525 - val_loss: 1.1195 - val_acc: 0.7311
loss improved from 0.54268 to 0.54251
loss: 0.5425 - acc: 0.8519 - val_loss: 1.1218 - val_acc: 0.7420
loss improved from 0.54251 to 0.54249
loss: 0.5425 - acc: 0.8517 - val_loss: 1.1210 - val_acc: 0.7518
consider updating ReduceLROnPlateau parameters as in TensorFlow documentation. Factor should be a larger and Patience should be smaller numbers
reduce_lr = ReduceLROnPlateau(monitor='val_loss', factor=0.2,
patience=5, min_lr=0.001)
model.fit(X_train, Y_train, callbacks=[reduce_lr])
Arguments:
monitor: quantity to be monitored.
factor: factor by which the learning rate will be reduced. new_lr = lr * factor
patience: number of epochs with no improvement after which learning rate will be reduced.
verbose: int. 0: quiet, 1: update messages.
mode: one of {auto, min, max}. In min mode, lr will be reduced when the quantity monitored has stopped decreasing; in max mode it will
be reduced when the quantity monitored has stopped increasing; in
auto mode, the direction is automatically inferred from the name
of the monitored quantity.
min_delta: threshold for measuring the new optimum, to only focus on significant changes.
cooldown: number of epochs to wait before resuming normal operation after lr has been reduced.
min_lr: lower bound on the learning rate.

Is val_sample_weights broken in tf.keras?

I cannot seem to make val_sample_weights work with validation data, using model.fit().
My understanding of val_sample_weights is that it is suppose to do exactly the same as sample_weight such that reported loss and val_loss can be directly compared when training using sample_weight.
From my example code below, we see that loss goes to zero when setting sample_weight to zero, but val_loss is unaffected when setting val_sample_weights to zero.
Am I missing or misunderstanding something here, or does this seem like val_sample_weights has no use in tf.keras? Maybe a bug in tf.keras? I'm on Windows, tensorflow version 1.14.0.
Below is a working example if you want to test it yourself:
import tensorflow as tf
import numpy as np
data_size = 100
input_size=3
seed = 1
# synetic train data
x_train = np.random.rand(data_size ,input_size)
y_train= np.random.rand(data_size,1)
# synetic test data
x_test = np.random.rand(data_size ,input_size)
y_test= np.random.rand(data_size,1)
tf.random.set_random_seed(seed)
# create model
inputs = tf.keras.layers.Input(shape=(input_size))
pred=tf.keras.layers.Dense(1, activation='sigmoid')(inputs)
model = tf.keras.models.Model(inputs=inputs, outputs=pred)
loss = tf.keras.losses.MSE
metrics = tf.keras.losses.MAE
model.compile(loss=loss , metrics=[metrics], optimizer='adam')
# Make model static, so we can compare it between different scenarios
for layer in model.layers:
layer.trainable = False
# base model no weights (same result as without class_weights)
model.fit(x=x_train,y=y_train, validation_data=(x_test,y_test))
# which outputs:
#> 100/100 [==============================] - 0s 1ms/sample - loss: 0.0851 - mean_absolute_error: 0.2511 - val_loss: 0.0900 - val_mean_absolute_error: 0.2629
#changing the sample_weights to zero, Hence loss and val_loss should be zero, and metrics should be the same
sample_weight_train = np.zeros(100)
sample_weight_val = np.zeros(100)
model.fit(x=x_train,y=y_train,sample_weight=sample_weight_train, validation_data=(x_test,y_test,sample_weight_val))
# which outputs:
#> 100/100 [==============================] - 0s 860us/sample - loss: 0.0000e+00 - mean_absolute_error: 0.2507 - val_loss: 0.0899 - val_mean_absolute_error: 0.2627

Keras' Resnet finetuning either underfits or overfits: how can I balance the training?

I'm finetuning Keras' Resnet pre trained on imagenet data to work on a specific classification with another dataset of images. My model is structured as follows: Resnet takes the inputs, and on the top of Resnet I added my own classifier. During all the experiments I tried, the model either underfitted or overfitted.
I mainly tried two approaches:
block a certain number n of layers towards the input, not to let them be updated during the training. In particular, Resnet has 175 layers, and I tried with n = 0, 10, 30, 50, 80, 175. In all these cases, the model underfits, obtaining an accuracy over the training set at most equal to 0.75, and on the validation at most 0.51.
block all the batch normalization layers, plus some n layers at the beginning (as before), with n = 0, 10, 30, 50. In these cases, the model overfits, obtaining more than 0.95 of accuracy on the training set, but around 0.5 on the validation.
Please note that changing from Resnet to InceptionV3, and blocking 50 layers, I obtain more than 0.95 of accuracy on both validation and test sets.
Here is the main part of my code:
inc_model = ResNet50(weights='imagenet',
include_top=False,
input_shape=(IMG_HEIGHT, IMG_WIDTH, 3))
print("number of layers:", len(inc_model.layers)) #175
#Adding custom Layers
x = inc_model.output
x = GlobalAveragePooling2D()(x)
x = Dense(1024, activation="relu")(x)
x = Dropout(0.5)(x)
x = Dense(512, activation="relu")(x)
predictions = Dense(2, activation="softmax")(x)
model_ = Model(inputs=inc_model.input, outputs=predictions)
# fine tuning 1
for layer in inc_model.layers[:30]:
layer.trainable = False
# fine tuning 2
for layer in inc_model.layers:
if 'bn' in layer.name:
layer.trainable = False
# compile the model
model_.compile(optimizer=SGD(lr=0.0001, momentum=0.9)
, loss='categorical_crossentropy'
, metrics=['accuracy'])
checkpointer = ModelCheckpoint(filepath='weights.best.inc.male.resnet.hdf5',
verbose=1, save_best_only=True)
hist = model_.fit_generator(train_generator
, validation_data = (x_valid, y_valid)
, steps_per_epoch= TRAINING_SAMPLES/BATCH_SIZE
, epochs= NUM_EPOCHS
, callbacks=[checkpointer]
, verbose=1
)
Can anyone suggest how to find a stable solution that learns something but doesn't overfit?
EDIT:
the output of the training phase is something like that:
Epoch 1/20
625/625 [==============================] - 2473s 4s/step - loss: 0.6048 - acc: 0.6691 - val_loss: 8.0590 - val_acc: 0.5000
Epoch 00001: val_loss improved from inf to 8.05905, saving model to weights.best.inc.male.resnet.hdf5
Epoch 2/20
625/625 [==============================] - 2432s 4s/step - loss: 0.4445 - acc: 0.7923 - val_loss: 8.0590 - val_acc: 0.5000
Epoch 00002: val_loss did not improve from 8.05905
Epoch 3/20
625/625 [==============================] - 2443s 4s/step - loss: 0.3730 - acc: 0.8407 - val_loss: 8.0590 - val_acc: 0.5000
Epoch 00003: val_loss did not improve from 8.05905
and so on.. Every time there's no improvements on the validation
You have many choices but did try early stopping? or you can try to do some data augmentation, or test with a simpler model.
In Resnet, if you are not using the preprocessing_function of resnet, try using it as shown below:
train_datagen = ImageDataGenerator(dtype='float32', preprocessing_function=tf.keras.applications.resnet.preprocess_input)
test_datagen = ImageDataGenerator(dtype='float32', preprocessing_function=tf.keras.applications.resnet.preprocess_input)
Keep everything else the same.