Keras: Why my val_acc suddenly drops at Epoch 42/50? - tensorflow

Using around 27.000 image samples for a CNN, having a very good performance, but all of a sudden, at epoch 42 the validation accuracy drops dramatically (from val_acc: 0.9982 to val_acc: 0.0678)!. Any idea? should I just stop training at the maximum val_acc? It's also weird that the validation accuracy is always higher than the training accuracy.
Using TensorFlow backend.
...
27091/27067 [==============================] - 2645s - loss: 0.0120 - acc: 0.9967 - val_loss: 0.0063 - val_acc: 0.9982
Epoch 33/50
27091/27067 [==============================] - 2674s - loss: 0.0114 - acc: 0.9971 - val_loss: 0.0145 - val_acc: 0.9975
Epoch 34/50
27091/27067 [==============================] - 2654s - loss: 0.0200 - acc: 0.9962 - val_loss: 0.0063 - val_acc: 0.9979
Epoch 35/50
27091/27067 [==============================] - 2649s - loss: 0.0137 - acc: 0.9964 - val_loss: 0.0069 - val_acc: 0.9985
Epoch 36/50
27091/27067 [==============================] - 2663s - loss: 0.0161 - acc: 0.9962 - val_loss: 0.0117 - val_acc: 0.9978
Epoch 37/50
27091/27067 [==============================] - 2680s - loss: 0.0155 - acc: 0.9959 - val_loss: 0.0039 - val_acc: 0.9993
Epoch 38/50
27091/27067 [==============================] - 2660s - loss: 0.0145 - acc: 0.9965 - val_loss: 0.0117 - val_acc: 0.9973
Epoch 39/50
27091/27067 [==============================] - 2647s - loss: 0.0111 - acc: 0.9970 - val_loss: 0.0127 - val_acc: 0.9982
Epoch 40/50
27091/27067 [==============================] - 2644s - loss: 0.0112 - acc: 0.9970 - val_loss: 0.0092 - val_acc: 0.9984
Epoch 41/50
27091/27067 [==============================] - 2658s - loss: 0.0131 - acc: 0.9967 - val_loss: 0.0057 - val_acc: 0.9982
Epoch 42/50
27091/27067 [==============================] - 2662s - loss: 0.0114 - acc: 0.7715 - val_loss: 1.1921e-07 - val_acc: 0.0678
Epoch 43/50
27091/27067 [==============================] - 2661s - loss: 1.1921e-07 - acc: 0.0714 - val_loss: 1.1921e-07 - val_acc: 0.0653
Epoch 44/50
27091/27067 [==============================] - 2668s - loss: 1.1921e-07 - acc: 0.0723 - val_loss: 1.1921e-07 - val_acc: 0.0664
Epoch 45/50
27091/27067 [==============================] - 2669s - loss: 1.1921e-07 - acc: 0.0731 - val_loss: 1.1921e-07 - val_acc: 0.0683

Thanks Marcin Możejkofor pointing me to the right direction.
This can happen at very high learning rates
loss can start increasing after some epochs as described here
It worked reducing the learning rate as described in the keras callbacks documentation.
Example:
reduce_lr = ReduceLROnPlateau(monitor='val_loss', factor=0.2,
patience=5, min_lr=0.001)
model.fit(X_train, Y_train, callbacks=[reduce_lr])

Related

Using TensorFlow-metal plugin, training stops after some time without any errors?

I've followed the steps provided by Apple, which utilizes conda, to install TensorFlow to get the best out of the M1 Pro MacBook Pro. As the title is self-descriptive, the training stops after some time without any errors. Please see the Keras training log below. This has happened many times. What could be the reason behind this situation? Have you experienced the same on your end? If so, how can I overcome this situation?
...
Epoch 38/50
625/625 [==============================] - 18s 29ms/step - loss: 1.6704 - acc: 0.4178 - val_loss: 1.8169 - val_acc: 0.4044
Epoch 39/50
625/625 [==============================] - 18s 29ms/step - loss: 1.6788 - acc: 0.4157 - val_loss: 1.6830 - val_acc: 0.4029
Epoch 40/50
625/625 [==============================] - 18s 28ms/step - loss: 1.6921 - acc: 0.4089 - val_loss: 1.7088 - val_acc: 0.4049
Epoch 41/50
625/625 [==============================] - 18s 28ms/step - loss: 1.6705 - acc: 0.4170 - val_loss: 1.6650 - val_acc: 0.4182
Epoch 42/50
625/625 [==============================] - 18s 29ms/step - loss: 1.6659 - acc: 0.4177 - val_loss: 1.9102 - val_acc: 0.3443
Epoch 43/50
625/625 [==============================] - 18s 29ms/step - loss: 1.6760 - acc: 0.4166 - val_loss: 1.6647 - val_acc: 0.4222
Epoch 44/50
532/625 [========================>.....] - ETA: 2s - loss: 1.6639 - acc: 0.4217

Lower model accuracies with tensorflow2.3+keras2.4 than tensorflow1.15+keras2.3

I created two anaconda environments for tensorflow2x and tensorflow1x respectively. In tensorflow2x, the tensorflow 2.3.2 and keras 2.4.3 (the latest) are installed, while in tensorflow1x, the tensorflow-gpu 1.15 and keras 2.3.1 are installed. Then I run a toy example mnist_cnn.py. It is found that the former tensorflow2 version give much lower accuracy than that the one obtained by the latter tensorflow 1.
Here below are the results:
# tensorflow2.3.2 + keras 2.4.3:
Epoch 1/12
60000/60000 [==============================] - 3s 54us/step - loss: 2.2795 - accuracy: 0.1270 - val_loss: 2.2287 - val_accuracy: 0.2883
Epoch 2/12
60000/60000 [==============================] - 3s 52us/step - loss: 2.2046 - accuracy: 0.2435 - val_loss: 2.1394 - val_accuracy: 0.5457
Epoch 3/12
60000/60000 [==============================] - 3s 52us/step - loss: 2.1133 - accuracy: 0.3636 - val_loss: 2.0215 - val_accuracy: 0.6608
Epoch 4/12
60000/60000 [==============================] - 3s 52us/step - loss: 1.9932 - accuracy: 0.4560 - val_loss: 1.8693 - val_accuracy: 0.7147
Epoch 5/12
60000/60000 [==============================] - 3s 52us/step - loss: 1.8430 - accuracy: 0.5239 - val_loss: 1.6797 - val_accuracy: 0.7518
Epoch 6/12
60000/60000 [==============================] - 3s 52us/step - loss: 1.6710 - accuracy: 0.5720 - val_loss: 1.4724 - val_accuracy: 0.7755
Epoch 7/12
60000/60000 [==============================] - 3s 53us/step - loss: 1.5003 - accuracy: 0.6071 - val_loss: 1.2725 - val_accuracy: 0.7928
Epoch 8/12
60000/60000 [==============================] - 3s 52us/step - loss: 1.3414 - accuracy: 0.6363 - val_loss: 1.0991 - val_accuracy: 0.8077
Epoch 9/12
60000/60000 [==============================] - 3s 53us/step - loss: 1.2129 - accuracy: 0.6604 - val_loss: 0.9603 - val_accuracy: 0.8169
Epoch 10/12
60000/60000 [==============================] - 3s 53us/step - loss: 1.1103 - accuracy: 0.6814 - val_loss: 0.8530 - val_accuracy: 0.8281
Epoch 11/12
60000/60000 [==============================] - 3s 52us/step - loss: 1.0237 - accuracy: 0.7021 - val_loss: 0.7689 - val_accuracy: 0.8350
Epoch 12/12
60000/60000 [==============================] - 3s 52us/step - loss: 0.9576 - accuracy: 0.7168 - val_loss: 0.7030 - val_accuracy: 0.8429
Test loss: 0.7029915698051452
Test accuracy: 0.8428999781608582
# tensorflow1.15.5 + keras2.3.1
60000/60000 [==============================] - 5s 84us/step - loss: 0.2631 - accuracy: 0.9198 - val_loss: 0.0546 - val_accuracy: 0.9826
Epoch 2/12
60000/60000 [==============================] - 4s 63us/step - loss: 0.0898 - accuracy: 0.9731 - val_loss: 0.0394 - val_accuracy: 0.9866
Epoch 3/12
60000/60000 [==============================] - 4s 63us/step - loss: 0.0674 - accuracy: 0.9799 - val_loss: 0.0341 - val_accuracy: 0.9881
Epoch 4/12
60000/60000 [==============================] - 4s 63us/step - loss: 0.0563 - accuracy: 0.9835 - val_loss: 0.0320 - val_accuracy: 0.9895
Epoch 5/12
60000/60000 [==============================] - 4s 63us/step - loss: 0.0465 - accuracy: 0.9859 - val_loss: 0.0343 - val_accuracy: 0.9889
Epoch 6/12
60000/60000 [==============================] - 4s 63us/step - loss: 0.0423 - accuracy: 0.9872 - val_loss: 0.0327 - val_accuracy: 0.9892
Epoch 7/12
60000/60000 [==============================] - 4s 63us/step - loss: 0.0387 - accuracy: 0.9882 - val_loss: 0.0279 - val_accuracy: 0.9907
Epoch 8/12
60000/60000 [==============================] - 4s 63us/step - loss: 0.0351 - accuracy: 0.9893 - val_loss: 0.0269 - val_accuracy: 0.9909
Epoch 9/12
60000/60000 [==============================] - 4s 63us/step - loss: 0.0330 - accuracy: 0.9902 - val_loss: 0.0311 - val_accuracy: 0.9895
Epoch 10/12
60000/60000 [==============================] - 4s 63us/step - loss: 0.0292 - accuracy: 0.9915 - val_loss: 0.0256 - val_accuracy: 0.9919
Epoch 11/12
60000/60000 [==============================] - 4s 63us/step - loss: 0.0293 - accuracy: 0.9911 - val_loss: 0.0276 - val_accuracy: 0.9911
Epoch 12/12
60000/60000 [==============================] - 4s 63us/step - loss: 0.0269 - accuracy: 0.9917 - val_loss: 0.0264 - val_accuracy: 0.9915
Test loss: 0.026934823030711867
Test accuracy: 0.9918000102043152
What caused the poor results for the tensorflow 2.3.2 + keras 2.4.3?? Is there any compatibility issue between tensorflow and keras here?
According to the author of keras, users should consider switching their Keras code to tf.keras in TensorFlow 2.x. In the above toy example, if
from tensorflow import keras in place of import keras, it also leads lower accuracy. It seems tf.keras gives poorer accuracy than keras? Maybe I run a wrong toy example for Tensorflow 2.X??
Update:
I also note if I decrease tensorflow to the version 2.2.1 (along with keras 2.3.1). They will produce about the same result. It seems there are some major changes from keras 2.3.1 to keras 2.4.0 (https://newreleases.io/project/github/keras-team/keras/release/2.4.0).
What are the specific main differences between keras 2.3.1 and keras 2.4.x??
Which versions of tensorflow are compatible with keras 2.4.x??

All weights become NaN in basic Mnist keras example

I'm running the Mnist example given at https://keras.io/examples/mnist_cnn/. After a few epochs, the accuracy drops to near zero and all layer weights become NaN.
x_train shape: (60000, 28, 28, 1)
60000 train samples
10000 test samples
Train on 60000 samples, validate on 10000 samples
Epoch 1/12
60000/60000 [==============================] - 4s 69us/step - loss: 0.2202 - acc: 0.9321 - val_loss: 0.0594 - val_acc: 0.9815
Epoch 2/12
60000/60000 [==============================] - 4s 63us/step - loss: 0.0741 - acc: 0.9773 - val_loss: 0.0392 - val_acc: 0.9871
Epoch 3/12
60000/60000 [==============================] - 4s 63us/step - loss: 0.0345 - acc: 0.6064 - val_loss: 1.1921e-07 - val_acc: 0.0980
Epoch 4/12
60000/60000 [==============================] - 4s 63us/step - loss: 1.1921e-07 - acc: 0.0987 - val_loss: 1.1921e-07 - val_acc: 0.0980
Epoch 5/12
60000/60000 [==============================] - 4s 62us/step - loss: 1.1921e-07 - acc: 0.0987 - val_loss: 1.1921e-07 - val_acc: 0.0980
Epoch 6/12
60000/60000 [==============================] - 4s 63us/step - loss: 1.1921e-07 - acc: 0.0987 - val_loss: 1.1921e-07 - val_acc: 0.0980
Epoch 7/12
60000/60000 [==============================] - 4s 63us/step - loss: 1.1921e-07 - acc: 0.0987 - val_loss: 1.1921e-07 - val_acc: 0.0980
Epoch 8/12
60000/60000 [==============================] - 4s 63us/step - loss: 1.1921e-07 - acc: 0.0987 - val_loss: 1.1921e-07 - val_acc: 0.0980
Epoch 9/12
60000/60000 [==============================] - 4s 63us/step - loss: 1.1921e-07 - acc: 0.0987 - val_loss: 1.1921e-07 - val_acc: 0.0980
Epoch 10/12
60000/60000 [==============================] - 4s 63us/step - loss: 1.1921e-07 - acc: 0.0987 - val_loss: 1.1921e-07 - val_acc: 0.0980
Epoch 11/12
60000/60000 [==============================] - 4s 62us/step - loss: 1.1921e-07 - acc: 0.0987 - val_loss: 1.1921e-07 - val_acc: 0.0980
Epoch 12/12
60000/60000 [==============================] - 4s 63us/step - loss: 1.1921e-07 - acc: 0.0987 - val_loss: 1.1921e-07 - val_acc: 0.0980
Test loss: 1.1920930376163597e-07
Test accuracy: 0.098
for layer in model.layers:
if len(layer.get_weights()) > 0 and np.all(np.isnan(layer.get_weights()[0])):
print(layer.name)
Output:
conv2d_3
conv2d_4
dense_3
dense_4
TensorFlow 1.12.0
Keras 2.2.4
CUDA Version 10.0.130
cuDNN 7.3.1

two kind of description of model.compile get the different results

have already tried
The data pipeline and model build function is right
code
The code is https://gist.github.com/MaoXianXin/cd398521546d967560942e702c243ba7
I want to know why this two kind of description of model.compile get the different results.
see the accuracy
model.compile(optimizer=tf.keras.optimizers.RMSprop(lr=0.01), loss='categorical_crossentropy', metrics=['acc'])
Epoch 1/50
2019-05-27 13:53:20.280605: I tensorflow/stream_executor/dso_loader.cc:153] successfully opened CUDA library libcublas.so.10.0 locally
468/468 [==============================] - 6s 12ms/step - loss: 14.4332 - acc: 0.1045 - val_loss: 14.4601 - val_acc: 0.1029
Epoch 2/50
468/468 [==============================] - 3s 6ms/step - loss: 14.4354 - acc: 0.1044 - val_loss: 14.4763 - val_acc: 0.1023
Epoch 3/50
468/468 [==============================] - 3s 6ms/step - loss: 14.4359 - acc: 0.1044 - val_loss: 14.4714 - val_acc: 0.1026
Epoch 4/50
468/468 [==============================] - 3s 6ms/step - loss: 14.4359 - acc: 0.1044 - val_loss: 14.4682 - val_acc: 0.1028
model.compile('adam', 'categorical_crossentropy', metrics=['acc'])
Epoch 1/50
2019-05-27 13:51:16.122054: I tensorflow/stream_executor/dso_loader.cc:153] successfully opened CUDA library libcublas.so.10.0 locally
468/468 [==============================] - 5s 12ms/step - loss: 3.6567 - acc: 0.7388 - val_loss: 0.0732 - val_acc: 0.9791
Epoch 2/50
468/468 [==============================] - 3s 6ms/step - loss: 0.0812 - acc: 0.9760 - val_loss: 0.0449 - val_acc: 0.9854
Epoch 3/50
468/468 [==============================] - 3s 6ms/step - loss: 0.0533 - acc: 0.9836 - val_loss: 0.0428 - val_acc: 0.9869
Epoch 4/50
468/468 [==============================] - 3s 6ms/step - loss: 0.0426 - acc: 0.9871 - val_loss: 0.0446 - val_acc: 0.9872
Epoch 5/50
468/468 [==============================] - 3s 6ms/step - loss: 0.0376 - acc: 0.9886 - val_loss: 0.0449 - val_acc: 0.9867

First training epoch is very slow

Hi… I’m running mnist code in my P3 AWS machine and the initialization process seems to be very long compared to my previous P2 machine (although P3>P2)
Train on 60000 samples, validate on 10000 samples
Epoch 1/10
60000/60000 [==============================] - 265s 4ms/step - loss: 0.2674 - acc: 0.9175 - val_loss: 0.0602 - val_acc: 0.9811
Epoch 2/10
60000/60000 [==============================] - 3s 51us/step - loss: 0.0860 - acc: 0.9742 - val_loss: 0.0393 - val_acc: 0.9866
Epoch 3/10
60000/60000 [==============================] - 3s 50us/step - loss: 0.0647 - acc: 0.9808 - val_loss: 0.0338 - val_acc: 0.9884
Epoch 4/10
60000/60000 [==============================] - 3s 50us/step - loss: 0.0542 - acc: 0.9839 - val_loss: 0.0337 - val_acc: 0.9887
Epoch 5/10
60000/60000 [==============================] - 3s 50us/step - loss: 0.0453 - acc: 0.9863 - val_loss: 0.0311 - val_acc: 0.9900
Epoch 6/10
60000/60000 [==============================] - 3s 51us/step - loss: 0.0412 - acc: 0.9873 - val_loss: 0.0291 - val_acc: 0.9898
Epoch 7/10
60000/60000 [==============================] - 3s 50us/step - loss: 0.0368 - acc: 0.9891 - val_loss: 0.0300 - val_acc: 0.9901
Epoch 8/10
60000/60000 [==============================] - 3s 50us/step - loss: 0.0340 - acc: 0.9897 - val_loss: 0.0298 - val_acc: 0.9897
Epoch 9/10
60000/60000 [==============================] - 3s 50us/step - loss: 0.0320 - acc: 0.9908 - val_loss: 0.0267 - val_acc: 0.9916
Epoch 10/10
60000/60000 [==============================] - 3s 50us/step - loss: 0.0286 - acc: 0.9914 - val_loss: 0.0276 - val_acc: 0.9903
Test loss: 0.02757222411266339
Test accuracy: 0.9903
I’m using Keras=2.1.4
tensorflow-gpu=1.5.0
my keras.json file is configured as follows:
{
"floatx": "float32",
"epsilon": 1e-07,
"backend": "tensorflow",
"image_data_format": "channels_last"
}
Any ideas why is it like that?
Thanks in advance
Based on this issue:
The first epoch takes the same time, but the counter also takes into
account the time taken by building the part of the computational graph
that deals with training (a few seconds). This used to be done during
the compile step, but now it is done lazily one demand to avoid
unnecessary work.