I am training a classifier model on cats vs dogs data. The model is a minor variant of ResNet18 & returns a softmax probability for classes. However, I am noticing that the validation loss is majorly NaN whereas training loss is steadily decreasing & behaves as expected. Training & Validation accuracy increase epoch by epoch.
Epoch 1/15
312/312 [==============================] - 1372s 4s/step - loss: 0.7849 - accuracy: 0.5131 - val_loss: nan - val_accuracy: 0.5343
Epoch 2/15
312/312 [==============================] - 1372s 4s/step - loss: 0.6966 - accuracy: 0.5539 - val_loss: 13989871201999266517090304.0000 - val_accuracy: 0.5619
Epoch 3/15
312/312 [==============================] - 1373s 4s/step - loss: 0.6570 - accuracy: 0.6077 - val_loss: 747123703808.0000 - val_accuracy: 0.5679
Epoch 4/15
312/312 [==============================] - 1372s 4s/step - loss: 0.6180 - accuracy: 0.6483 - val_loss: nan - val_accuracy: 0.6747
Epoch 5/15
312/312 [==============================] - 1373s 4s/step - loss: 0.5838 - accuracy: 0.6852 - val_loss: nan - val_accuracy: 0.6240
Epoch 6/15
312/312 [==============================] - 1372s 4s/step - loss: 0.5338 - accuracy: 0.7301 - val_loss: 31236203781405710523301888.0000 - val_accuracy: 0.7590
Epoch 7/15
312/312 [==============================] - 1373s 4s/step - loss: 0.4872 - accuracy: 0.7646 - val_loss: 52170.8672 - val_accuracy: 0.7378
Epoch 8/15
312/312 [==============================] - 1372s 4s/step - loss: 0.4385 - accuracy: 0.7928 - val_loss: 2130819335420217655296.0000 - val_accuracy: 0.8101
Epoch 9/15
312/312 [==============================] - 1373s 4s/step - loss: 0.3966 - accuracy: 0.8206 - val_loss: 116842888.0000 - val_accuracy: 0.7857
Epoch 10/15
312/312 [==============================] - 1372s 4s/step - loss: 0.3643 - accuracy: 0.8391 - val_loss: nan - val_accuracy: 0.8199
Epoch 11/15
312/312 [==============================] - 1373s 4s/step - loss: 0.3285 - accuracy: 0.8557 - val_loss: 788904.2500 - val_accuracy: 0.8438
Epoch 12/15
312/312 [==============================] - 1372s 4s/step - loss: 0.3029 - accuracy: 0.8670 - val_loss: nan - val_accuracy: 0.8245
Epoch 13/15
312/312 [==============================] - 1373s 4s/step - loss: 0.2857 - accuracy: 0.8781 - val_loss: 121907.8594 - val_accuracy: 0.8444
Epoch 14/15
312/312 [==============================] - 1373s 4s/step - loss: 0.2585 - accuracy: 0.8891 - val_loss: nan - val_accuracy: 0.8674
Epoch 15/15
312/312 [==============================] - 1374s 4s/step - loss: 0.2430 - accuracy: 0.8965 - val_loss: 822.7968 - val_accuracy: 0.8776
I checked for the following -
Infinity/NaN in validation data
Infinity/NaN caused when normalizing data (using tf.keras.applications.resnet.preprocess_input)
If the model is predicting only one class & hence causing loss function to behave oddly
Training code for reference -
optimizer = tf.keras.optimizers.Adam(learning_rate=5e-3)
model = Resnet18(NUM_CLASSES=NUM_CLASSES) # variant of original model
model.compile(optimizer=optimizer, loss="categorical_crossentropy", metrics=["accuracy"])
history = model.fit(
train_dataset,
steps_per_epoch=len(X_train) // BATCH_SIZE,
epochs=EPOCHS,
validation_data=valid_dataset,
validation_steps=len(X_valid) // BATCH_SIZE,
verbose=1,
)
The most relevant answer I found was the last paragraph of the accepted answer here. However, that doesn't seem to be the case here as validation loss diverges by order of magnitudes compared to training loss & returns nan. Seems like the loss function is misbehaving.
Related
Monitoring Keras metric of val_reall. It has been improving but it keeps the best value as the lowest 0.9958 although better values 0.9978 or 0.9985 have been recorded. The monitor mode is set to 'auto'.
Please help understand why the Keras thinks the metric is not improving.
Epoch 1/10
6883/6883 [==============================] - 1982s 287ms/step - loss: 0.1025 - recall: 0.9738 - accuracy: 0.9631 - val_loss: 0.0537 - val_recall: 0.9978 - val_accuracy: 0.9837
Epoch 00001: val_recall improved from inf to 0.99783, saving model to /content/drive/MyDrive/home/repository/mon/kaggle/toxic_comment_classification/toxicity_classification_2021JUL10_1647/model_Ctoxic_B32_L256/model.h5
Epoch 2/10
6883/6883 [==============================] - 1970s 286ms/step - loss: 0.0348 - recall: 0.9946 - accuracy: 0.9901 - val_loss: 0.0412 - val_recall: 0.9958 - val_accuracy: 0.9888
Epoch 00002: val_recall improved from 0.99783 to 0.99583, saving model to /content/drive/MyDrive/home/repository/mon/kaggle/toxic_comment_classification/toxicity_classification_2021JUL10_1647/model_Ctoxic_B32_L256/model.h5
Epoch 3/10
6883/6883 [==============================] - 1970s 286ms/step - loss: 0.0181 - recall: 0.9968 - accuracy: 0.9952 - val_loss: 0.0446 - val_recall: 0.9984 - val_accuracy: 0.9897
Epoch 00003: val_recall did not improve from 0.99583
Epoch 4/10
6883/6883 [==============================] - 1972s 286ms/step - loss: 0.0125 - recall: 0.9976 - accuracy: 0.9967 - val_loss: 0.0429 - val_recall: 0.9985 - val_accuracy: 0.9902
Epoch 00004: val_recall did not improve from 0.99583
Epoch 5/10
6883/6883 [==============================] - 1973s 287ms/step - loss: 0.0094 - recall: 0.9979 - accuracy: 0.9974 - val_loss: 0.0663 - val_recall: 0.9991 - val_accuracy: 0.9873
Epoch 00005: ReduceLROnPlateau reducing learning rate to 5.9999998484272515e-06.
Epoch 00005: val_recall did not improve from 0.99583
Epoch 6/10
6883/6883 [==============================] - 1970s 286ms/step - loss: 0.0031 - recall: 0.9996 - accuracy: 0.9993 - val_loss: 0.0646 - val_recall: 0.9998 - val_accuracy: 0.9901
Epoch 00006: val_recall did not improve from 0.99583
Epoch 7/10
6883/6883 [==============================] - 1967s 286ms/step - loss: 0.0019 - recall: 0.9998 - accuracy: 0.9997 - val_loss: 0.0641 - val_recall: 0.9997 - val_accuracy: 0.9903
Restoring model weights from the end of the best epoch.
Epoch 00007: val_recall did not improve from 0.99583
Epoch 00007: early stopping
Solution
As per the comment by Innat, mode=max at callbacks.
From Comments:
Setting mode=max in the Callbacks has resolved the issue.
I am running te following code:
basemodel.fit(X_train,y_train,epochs=25,validation_split=.1,callbacks=call_back)
But I get a result Epoch 00014: val_accuracy did not improve from 0.57709. I am not sure what is the issue there because I clearly see that my loss has decreased and my accuracy has increased.
This is the result
Epoch 1/25
909/909 [==============================] - 13s 6ms/step - loss: 1.6465 - accuracy: 0.3396 - val_loss: 1.4830 - val_accuracy: 0.4334
Epoch 00001: val_accuracy improved from -inf to 0.43344, saving model to checkpoint/best_model.h5
Epoch 2/25
909/909 [==============================] - 5s 5ms/step - loss: 1.3402 - accuracy: 0.4860 - val_loss: 1.3291 - val_accuracy: 0.4926
Epoch 00002: val_accuracy improved from 0.43344 to 0.49257, saving model to checkpoint/best_model.h5
Epoch 3/25
909/909 [==============================] - 5s 5ms/step - loss: 1.2050 - accuracy: 0.5418 - val_loss: 1.2769 - val_accuracy: 0.5025
Epoch 00003: val_accuracy improved from 0.49257 to 0.50248, saving model to checkpoint/best_model.h5
Epoch 4/25
909/909 [==============================] - 5s 5ms/step - loss: 1.1054 - accuracy: 0.5806 - val_loss: 1.1936 - val_accuracy: 0.5495
Epoch 00004: val_accuracy improved from 0.50248 to 0.54954, saving model to checkpoint/best_model.h5
Epoch 5/25
909/909 [==============================] - 5s 5ms/step - loss: 1.0190 - accuracy: 0.6159 - val_loss: 1.1535 - val_accuracy: 0.5551
Epoch 00005: val_accuracy improved from 0.54954 to 0.55511, saving model to checkpoint/best_model.h5
Epoch 6/25
909/909 [==============================] - 5s 5ms/step - loss: 0.9329 - accuracy: 0.6502 - val_loss: 1.1962 - val_accuracy: 0.5641
Epoch 00006: val_accuracy improved from 0.55511 to 0.56409, saving model to checkpoint/best_model.h5
Epoch 7/25
909/909 [==============================] - 5s 5ms/step - loss: 0.8435 - accuracy: 0.6846 - val_loss: 1.1707 - val_accuracy: 0.5771
Epoch 00007: val_accuracy improved from 0.56409 to 0.57709, saving model to checkpoint/best_model.h5
Epoch 8/25
909/909 [==============================] - 5s 5ms/step - loss: 0.7527 - accuracy: 0.7201 - val_loss: 1.3817 - val_accuracy: 0.5545
Epoch 00008: val_accuracy did not improve from 0.57709
Epoch 9/25
909/909 [==============================] - 5s 5ms/step - loss: 0.6633 - accuracy: 0.7576 - val_loss: 1.5021 - val_accuracy: 0.5207
Epoch 00009: val_accuracy did not improve from 0.57709
Epoch 10/25
909/909 [==============================] - 5s 5ms/step - loss: 0.5865 - accuracy: 0.7874 - val_loss: 1.5610 - val_accuracy: 0.5721
Epoch 00010: val_accuracy did not improve from 0.57709
Epoch 11/25
909/909 [==============================] - 5s 5ms/step - loss: 0.5154 - accuracy: 0.8097 - val_loss: 1.5723 - val_accuracy: 0.5430
Epoch 00011: val_accuracy did not improve from 0.57709
Epoch 12/25
909/909 [==============================] - 5s 6ms/step - loss: 0.4540 - accuracy: 0.8333 - val_loss: 2.1641 - val_accuracy: 0.5650
Epoch 00012: val_accuracy did not improve from 0.57709
Epoch 13/25
909/909 [==============================] - 5s 5ms/step - loss: 0.4106 - accuracy: 0.8511 - val_loss: 2.3236 - val_accuracy: 0.5322
Epoch 00013: val_accuracy did not improve from 0.57709
Epoch 14/25
909/909 [==============================] - 5s 5ms/step - loss: 0.3747 - accuracy: 0.8682 - val_loss: 1.8985 - val_accuracy: 0.5567
Epoch 00014: val_accuracy did not improve from 0.57709
Epoch 15/25
909/909 [==============================] - 5s 5ms/step - loss: 0.3480 - accuracy: 0.8768 - val_loss: 2.1689 - val_accuracy: 0.5505
Epoch 00015: val_accuracy did not improve from 0.57709
Epoch 16/25
909/909 [==============================] - 5s 5ms/step - loss: 0.3224 - accuracy: 0.8878 - val_loss: 2.0880 - val_accuracy: 0.5269
Epoch 00016: val_accuracy did not improve from 0.57709
Epoch 17/25
909/909 [==============================] - 5s 5ms/step - loss: 0.3157 - accuracy: 0.8912 - val_loss: 2.2746 - val_accuracy: 0.5328
Epoch 00017: val_accuracy did not improve from 0.57709
Epoch 18/25
909/909 [==============================] - 5s 5ms/step - loss: 0.2960 - accuracy: 0.8992 - val_loss: 2.3014 - val_accuracy: 0.5582
Epoch 00018: val_accuracy did not improve from 0.57709
Epoch 19/25
909/909 [==============================] - 5s 5ms/step - loss: 0.2961 - accuracy: 0.8998 - val_loss: 2.8190 - val_accuracy: 0.5399
Epoch 00019: val_accuracy did not improve from 0.57709
Epoch 20/25
909/909 [==============================] - 5s 5ms/step - loss: 0.2945 - accuracy: 0.9016 - val_loss: 2.5621 - val_accuracy: 0.5495
Epoch 00020: val_accuracy did not improve from 0.57709
Epoch 21/25
909/909 [==============================] - 5s 5ms/step - loss: 0.2772 - accuracy: 0.9075 - val_loss: 2.6602 - val_accuracy: 0.5402
Epoch 00021: val_accuracy did not improve from 0.57709
Epoch 22/25
909/909 [==============================] - 5s 6ms/step - loss: 0.2857 - accuracy: 0.9070 - val_loss: 2.7156 - val_accuracy: 0.5381
Epoch 00022: val_accuracy did not improve from 0.57709
Epoch 23/25
909/909 [==============================] - 5s 5ms/step - loss: 0.2767 - accuracy: 0.9098 - val_loss: 3.4705 - val_accuracy: 0.5291
Epoch 00023: val_accuracy did not improve from 0.57709
Epoch 24/25
909/909 [==============================] - 5s 6ms/step - loss: 0.2725 - accuracy: 0.9100 - val_loss: 3.5462 - val_accuracy: 0.5706
Epoch 00024: val_accuracy did not improve from 0.57709
Epoch 25/25
909/909 [==============================] - 5s 5ms/step - loss: 0.2675 - accuracy: 0.9134 - val_loss: 2.3214 - val_accuracy: 0.5254
Epoch 00025: val_accuracy did not improve from 0.57709
<tensorflow.python.keras.callbacks.History at 0x7f9d42d7afd0>
Below is a screenshot of my code:
My learning rate is .01.
basemodel.compile(optimizer=tf.keras.optimizers.RMSprop(learning_rate=.01), loss='sparse_categorical_crossentropy', metrics=['accuracy'])
This is the case of overfitting/memorization of the training data by the model.
Change the Validation data and set it to train data, you will see validation loss will also go down.
With the discussion That I had with you!! You had just 1000 data points and the model that you are building have 403,463 trainable parameters.
Choices that you have
Get more data
Use pretrained layers(this is known as transfer learning)
Use regularization parameter
Use Dropout
Use Batch normalization (Won't be very Effective)
Getting more data or using pre-trained layers will be highly effective in your case!!
I created two anaconda environments for tensorflow2x and tensorflow1x respectively. In tensorflow2x, the tensorflow 2.3.2 and keras 2.4.3 (the latest) are installed, while in tensorflow1x, the tensorflow-gpu 1.15 and keras 2.3.1 are installed. Then I run a toy example mnist_cnn.py. It is found that the former tensorflow2 version give much lower accuracy than that the one obtained by the latter tensorflow 1.
Here below are the results:
# tensorflow2.3.2 + keras 2.4.3:
Epoch 1/12
60000/60000 [==============================] - 3s 54us/step - loss: 2.2795 - accuracy: 0.1270 - val_loss: 2.2287 - val_accuracy: 0.2883
Epoch 2/12
60000/60000 [==============================] - 3s 52us/step - loss: 2.2046 - accuracy: 0.2435 - val_loss: 2.1394 - val_accuracy: 0.5457
Epoch 3/12
60000/60000 [==============================] - 3s 52us/step - loss: 2.1133 - accuracy: 0.3636 - val_loss: 2.0215 - val_accuracy: 0.6608
Epoch 4/12
60000/60000 [==============================] - 3s 52us/step - loss: 1.9932 - accuracy: 0.4560 - val_loss: 1.8693 - val_accuracy: 0.7147
Epoch 5/12
60000/60000 [==============================] - 3s 52us/step - loss: 1.8430 - accuracy: 0.5239 - val_loss: 1.6797 - val_accuracy: 0.7518
Epoch 6/12
60000/60000 [==============================] - 3s 52us/step - loss: 1.6710 - accuracy: 0.5720 - val_loss: 1.4724 - val_accuracy: 0.7755
Epoch 7/12
60000/60000 [==============================] - 3s 53us/step - loss: 1.5003 - accuracy: 0.6071 - val_loss: 1.2725 - val_accuracy: 0.7928
Epoch 8/12
60000/60000 [==============================] - 3s 52us/step - loss: 1.3414 - accuracy: 0.6363 - val_loss: 1.0991 - val_accuracy: 0.8077
Epoch 9/12
60000/60000 [==============================] - 3s 53us/step - loss: 1.2129 - accuracy: 0.6604 - val_loss: 0.9603 - val_accuracy: 0.8169
Epoch 10/12
60000/60000 [==============================] - 3s 53us/step - loss: 1.1103 - accuracy: 0.6814 - val_loss: 0.8530 - val_accuracy: 0.8281
Epoch 11/12
60000/60000 [==============================] - 3s 52us/step - loss: 1.0237 - accuracy: 0.7021 - val_loss: 0.7689 - val_accuracy: 0.8350
Epoch 12/12
60000/60000 [==============================] - 3s 52us/step - loss: 0.9576 - accuracy: 0.7168 - val_loss: 0.7030 - val_accuracy: 0.8429
Test loss: 0.7029915698051452
Test accuracy: 0.8428999781608582
# tensorflow1.15.5 + keras2.3.1
60000/60000 [==============================] - 5s 84us/step - loss: 0.2631 - accuracy: 0.9198 - val_loss: 0.0546 - val_accuracy: 0.9826
Epoch 2/12
60000/60000 [==============================] - 4s 63us/step - loss: 0.0898 - accuracy: 0.9731 - val_loss: 0.0394 - val_accuracy: 0.9866
Epoch 3/12
60000/60000 [==============================] - 4s 63us/step - loss: 0.0674 - accuracy: 0.9799 - val_loss: 0.0341 - val_accuracy: 0.9881
Epoch 4/12
60000/60000 [==============================] - 4s 63us/step - loss: 0.0563 - accuracy: 0.9835 - val_loss: 0.0320 - val_accuracy: 0.9895
Epoch 5/12
60000/60000 [==============================] - 4s 63us/step - loss: 0.0465 - accuracy: 0.9859 - val_loss: 0.0343 - val_accuracy: 0.9889
Epoch 6/12
60000/60000 [==============================] - 4s 63us/step - loss: 0.0423 - accuracy: 0.9872 - val_loss: 0.0327 - val_accuracy: 0.9892
Epoch 7/12
60000/60000 [==============================] - 4s 63us/step - loss: 0.0387 - accuracy: 0.9882 - val_loss: 0.0279 - val_accuracy: 0.9907
Epoch 8/12
60000/60000 [==============================] - 4s 63us/step - loss: 0.0351 - accuracy: 0.9893 - val_loss: 0.0269 - val_accuracy: 0.9909
Epoch 9/12
60000/60000 [==============================] - 4s 63us/step - loss: 0.0330 - accuracy: 0.9902 - val_loss: 0.0311 - val_accuracy: 0.9895
Epoch 10/12
60000/60000 [==============================] - 4s 63us/step - loss: 0.0292 - accuracy: 0.9915 - val_loss: 0.0256 - val_accuracy: 0.9919
Epoch 11/12
60000/60000 [==============================] - 4s 63us/step - loss: 0.0293 - accuracy: 0.9911 - val_loss: 0.0276 - val_accuracy: 0.9911
Epoch 12/12
60000/60000 [==============================] - 4s 63us/step - loss: 0.0269 - accuracy: 0.9917 - val_loss: 0.0264 - val_accuracy: 0.9915
Test loss: 0.026934823030711867
Test accuracy: 0.9918000102043152
What caused the poor results for the tensorflow 2.3.2 + keras 2.4.3?? Is there any compatibility issue between tensorflow and keras here?
According to the author of keras, users should consider switching their Keras code to tf.keras in TensorFlow 2.x. In the above toy example, if
from tensorflow import keras in place of import keras, it also leads lower accuracy. It seems tf.keras gives poorer accuracy than keras? Maybe I run a wrong toy example for Tensorflow 2.X??
Update:
I also note if I decrease tensorflow to the version 2.2.1 (along with keras 2.3.1). They will produce about the same result. It seems there are some major changes from keras 2.3.1 to keras 2.4.0 (https://newreleases.io/project/github/keras-team/keras/release/2.4.0).
What are the specific main differences between keras 2.3.1 and keras 2.4.x??
Which versions of tensorflow are compatible with keras 2.4.x??
I am currently studying the book hands on machine learning. I want to create a simple neural network, as described in the book chapter 10 for the mnist hand written data. But my model is stuck, and the accuracy is not increasing at all.
Here is my code:
import tensorflow as tf
from tensorflow import keras
import pandas as pd
import numpy as np
data = pd.read_csv('sample_data/mnist_train_small.csv', header=None)
test = pd.read_csv('sample_data/mnist_test.csv', header=None)
labels = data[0]
data = data.drop(0, axis=1)
test_labels = test[0]
test = test.drop(0, axis=1)
model = keras.models.Sequential([
keras.layers.Dense(300, activation='relu', input_shape=(784,)),
keras.layers.Dense(100, activation='relu'),
keras.layers.Dense(10, activation='softmax'),
])
model.compile(loss='sparse_categorical_crossentropy',
optimizer='sgd',
metrics=['accuracy'])
keras.utils.plot_model(model, show_shapes=True)
hist = model.fit(data.to_numpy(), labels.to_numpy(), epochs=20, validation_data=(test.to_numpy(), test_labels.to_numpy()))
The first few outputs are :
Epoch 1/20
625/625 [==============================] - 2s 3ms/step - loss: 2055059923226079526912.0000 - accuracy: 0.1115 - val_loss: 2.4539 - val_accuracy: 0.1134
Epoch 2/20
625/625 [==============================] - 2s 3ms/step - loss: 2.4160 - accuracy: 0.1085 - val_loss: 2.2979 - val_accuracy: 0.1008
Epoch 3/20
625/625 [==============================] - 2s 2ms/step - loss: 2.3006 - accuracy: 0.1110 - val_loss: 2.3014 - val_accuracy: 0.1136
Epoch 4/20
625/625 [==============================] - 2s 3ms/step - loss: 2.3009 - accuracy: 0.1121 - val_loss: 2.3014 - val_accuracy: 0.1136
Epoch 5/20
625/625 [==============================] - 2s 3ms/step - loss: 2.3009 - accuracy: 0.1121 - val_loss: 2.3014 - val_accuracy: 0.1136
Epoch 6/20
625/625 [==============================] - 2s 3ms/step - loss: 2.3008 - accuracy: 0.1121 - val_loss: 2.3014 - val_accuracy: 0.1136
Epoch 7/20
625/625 [==============================] - 2s 3ms/step - loss: 2.3008 - accuracy: 0.1121 - val_loss: 2.3014 - val_accuracy: 0.1136
Epoch 8/20
625/625 [==============================] - 2s 3ms/step - loss: 2.3008 - accuracy: 0.1121 - val_loss: 2.3014 - val_accuracy: 0.1136
Epoch 9/20
625/625 [==============================] - 2s 2ms/step - loss: 2.3008 - accuracy: 0.1121 - val_loss: 2.3014 - val_accuracy: 0.1136
Epoch 10/20
625/625 [==============================] - 2s 3ms/step - loss: 2.3008 - accuracy: 0.1121 - val_loss: 2.3014 - val_accuracy: 0.1136
Epoch 11/20
625/625 [==============================] - 2s 3ms/step - loss: 2.3008 - accuracy: 0.1121 - val_loss: 2.3014 - val_accuracy: 0.1136
Epoch 12/20
625/625 [==============================] - 2s 3ms/step - loss: 2.3008 - accuracy: 0.1121 - val_loss: 2.3014 - val_accuracy: 0.1136
Your loss function should be categorical_crossentrophy. Sparse is for large and mostly empty matrixes(word matrixes etc.). And also instead of data[] you can use data.iloc[]. And adam optimizer would be better in this problem.
Using around 27.000 image samples for a CNN, having a very good performance, but all of a sudden, at epoch 42 the validation accuracy drops dramatically (from val_acc: 0.9982 to val_acc: 0.0678)!. Any idea? should I just stop training at the maximum val_acc? It's also weird that the validation accuracy is always higher than the training accuracy.
Using TensorFlow backend.
...
27091/27067 [==============================] - 2645s - loss: 0.0120 - acc: 0.9967 - val_loss: 0.0063 - val_acc: 0.9982
Epoch 33/50
27091/27067 [==============================] - 2674s - loss: 0.0114 - acc: 0.9971 - val_loss: 0.0145 - val_acc: 0.9975
Epoch 34/50
27091/27067 [==============================] - 2654s - loss: 0.0200 - acc: 0.9962 - val_loss: 0.0063 - val_acc: 0.9979
Epoch 35/50
27091/27067 [==============================] - 2649s - loss: 0.0137 - acc: 0.9964 - val_loss: 0.0069 - val_acc: 0.9985
Epoch 36/50
27091/27067 [==============================] - 2663s - loss: 0.0161 - acc: 0.9962 - val_loss: 0.0117 - val_acc: 0.9978
Epoch 37/50
27091/27067 [==============================] - 2680s - loss: 0.0155 - acc: 0.9959 - val_loss: 0.0039 - val_acc: 0.9993
Epoch 38/50
27091/27067 [==============================] - 2660s - loss: 0.0145 - acc: 0.9965 - val_loss: 0.0117 - val_acc: 0.9973
Epoch 39/50
27091/27067 [==============================] - 2647s - loss: 0.0111 - acc: 0.9970 - val_loss: 0.0127 - val_acc: 0.9982
Epoch 40/50
27091/27067 [==============================] - 2644s - loss: 0.0112 - acc: 0.9970 - val_loss: 0.0092 - val_acc: 0.9984
Epoch 41/50
27091/27067 [==============================] - 2658s - loss: 0.0131 - acc: 0.9967 - val_loss: 0.0057 - val_acc: 0.9982
Epoch 42/50
27091/27067 [==============================] - 2662s - loss: 0.0114 - acc: 0.7715 - val_loss: 1.1921e-07 - val_acc: 0.0678
Epoch 43/50
27091/27067 [==============================] - 2661s - loss: 1.1921e-07 - acc: 0.0714 - val_loss: 1.1921e-07 - val_acc: 0.0653
Epoch 44/50
27091/27067 [==============================] - 2668s - loss: 1.1921e-07 - acc: 0.0723 - val_loss: 1.1921e-07 - val_acc: 0.0664
Epoch 45/50
27091/27067 [==============================] - 2669s - loss: 1.1921e-07 - acc: 0.0731 - val_loss: 1.1921e-07 - val_acc: 0.0683
Thanks Marcin Możejkofor pointing me to the right direction.
This can happen at very high learning rates
loss can start increasing after some epochs as described here
It worked reducing the learning rate as described in the keras callbacks documentation.
Example:
reduce_lr = ReduceLROnPlateau(monitor='val_loss', factor=0.2,
patience=5, min_lr=0.001)
model.fit(X_train, Y_train, callbacks=[reduce_lr])