I am creating a CNN-LSTM based model to classify intracranial hemorrhage using CT scan images. I am using a custom data generator that generates x of array shape (512, 512, 3) and y [1].
This is a binary classification. Based on batch_size, the images will be fed to the entire network and the model will be trained.
Since I am using batch size of 32, x is of shape (32, 30, 512, 512, 3) and y of shape (32, 1), where i am using 30 slice as temporal images.
model = Sequential()
model.add(TimeDistributed(Conv2D(64, (3, 3), activation='relu'),input_shape=(None,512, 512,3)))
model.add(TimeDistributed(MaxPooling2D(pool_size=(2, 2))))
model.add(TimeDistributed(Dropout(0.3)))
model.add(TimeDistributed(Conv2D(128, (3, 3), activation='relu')))
model.add(TimeDistributed(MaxPooling2D(pool_size=(2, 2))))
model.add(TimeDistributed(Dropout(0.3)))
model.add(TimeDistributed(Conv2D(256, (3, 3), activation='relu')))
model.add(TimeDistributed(MaxPooling2D(pool_size=(2, 2))))
model.add(TimeDistributed((Dropout(0.3))))
model.add(TimeDistributed(Conv2D(512, (3, 3), activation='relu')))
model.add(TimeDistributed(MaxPooling2D(pool_size=(2, 2))))
model.add(TimeDistributed((Dropout(0.3))))
model.add(TimeDistributed(Conv2D(512, (3, 3), activation='relu')))
model.add(TimeDistributed(MaxPooling2D(pool_size=(2, 2))))
model.add(TimeDistributed((Dropout(0.3))))
model.add(TimeDistributed(Flatten()))
model.add(TimeDistributed(Dense(512, activation='relu')))
model.add(TimeDistributed(Dropout(0.3)))
model.add(Bidirectional(GRU(512,activation = 'relu', kernel_regularizer='l2')))
model.add(Dense(1,activation='sigmoid'))
#optim = RMSprop(learning_rate=0.00001)
model.compile(loss='binary_crossentropy',
#optimizer= SGD(lr=0.1), #momentum=0.9, decay=0.01),
optimizer= Adam(lr=0.00001),
#optimizer= Nadam(lr=0.001),
metrics=['accuracy'])
I am training the model for 5 epoch but the accuracy seems to be stuck at 58%.
I have created another model using only CNN using the above architecture without the LSTM part and am able to get close to 91% accuracy. When I am including the LSTM part, the accuracy seems to be stagnant but the loss decreases over each epoch as seen below.
Epoch 1/5
904/904 [==============================] - 1056s 1s/step - loss: 1.4925 - accuracy: 0.5827 - val_loss: 0.7267 - val_accuracy: 0.5938
Epoch 2/5
904/904 [==============================] - 1050s 1s/step - loss: 0.6946 - accuracy: 0.5837 - val_loss: 0.6776 - val_accuracy: 0.5950
Epoch 3/5
904/904 [==============================] - 1057s 1s/step - loss: 0.6801 - accuracy: 0.5836 - val_loss: 0.6763 - val_accuracy: 0.5944
Epoch 4/5
904/904 [==============================] - 1045s 1s/step - loss: 0.6793 - accuracy: 0.5836 - val_loss: 0.6770 - val_accuracy: 0.5944
Epoch 5/5
904/904 [==============================] - 1048s 1s/step - loss: 0.6794 - accuracy: 0.5836 - val_loss: 0.6745 - val_accuracy: 0.5969
Below is my data distribution
What can be the possible reasons here?
when you are saying other methods yield better performance , name those. I feel that CNN with LSTM might be tricky...
please check padding , strides and etc.
Adding on a learning rate of .00001 will be very very slow to converge during the course of learning.
There are many questions before one can debug this process -
Is your dataset 2D or 3D images?
What is the dimension of your image, you have mentioned it before, but what is the dimension when you are using the data in 2D or 3D format?
Does your data have any temporal component to it, i.e. is there any relationship between one image and another?
That being said, I think if the individual images in your dataset are uncorrelated to each other, then I do not understand why an LSTM or any kind of recurrent architecture is being used. If you have a 3D dataset then use a 3D convolutional network.
LSTM/RNN is used to model temporal dependencies between your input, for example a sentence when the next word depends on the previous words. In your case, the LSTM is trying to model some relationships between images which do not exist and the information from those 30 (which seems to be the "temporal" dimension) images is being trying to be bottlenecked into the final LSTM timepoint which is being used for classification.
Related
My CNN model kept getting high accuracy/low loss during training and much lower accuracy/higher loss during validation, therefore I started suspecting that it's overfitting.
I have therefore introduced a few dropout layers as well as some image augmentiation. I've also tried monitoring val_loss after each epoch, using ReduceLROnPlateau and EarlyStopping.
Although those measures helped improve validation accuracy a bit, I'm still nowhere close to the desired result and I'm honestly running out of ideas. This is the result I'm obtaining right now:
Epoch 9/30
999/1000 [============================>.] - ETA: 0s - loss: 0.0072 - accuracy: 0.9980
Epoch 9: ReduceLROnPlateau reducing learning rate to 1.500000071246177e-05.
1000/1000 [==============================] - 19s 19ms/step - loss: 0.0072 - accuracy: 0.9980 - val_loss: 2.2994 - val_accuracy: 0.6570 - lr: 1.5000e-04
Epoch 10/30
1000/1000 [==============================] - 19s 19ms/step - loss: 0.0045 - accuracy: 0.9985 - val_loss: 2.2451 - val_accuracy: 0.6560 - lr: 1.5000e-05
Epoch 11/30
1000/1000 [==============================] - 19s 19ms/step - loss: 0.0026 - accuracy: 0.9995 - val_loss: 2.6080 - val_accuracy: 0.6540 - lr: 1.5000e-05
Epoch 12/30
1000/1000 [==============================] - 19s 19ms/step - loss: 0.0018 - accuracy: 1.0000 - val_loss: 2.8192 - val_accuracy: 0.6560 - lr: 1.5000e-05
Epoch 13/30
1000/1000 [==============================] - 19s 19ms/step - loss: 0.0013 - accuracy: 1.0000 - val_loss: 2.8216 - val_accuracy: 0.6570 - lr: 1.5000e-05
32/32 [==============================] - 1s 23ms/step - loss: 2.8216 - accuracy: 0.6570
Am I wrong to assume that overfitting is still the problem that prevents my model from scoring high on validation and test data?
Or is there something fundamentally wrong with my architecture?
#prevent overfitting, generalize better
data_augmentation = tf.keras.Sequential([
layers.RandomFlip("horizontal_and_vertical"),
layers.RandomRotation(0.2),
layers.RandomZoom((0.2))
])
model = tf.keras.models.Sequential()
model.add(data_augmentation)
#same padding, since edges of the pictures often contain valuable information
model.add(layers.Conv2D(64, (3,3), strides=(1,1), padding='same', activation = 'relu', input_shape=(64,64,3)))
model.add(layers.MaxPooling2D((2,2)))
model.add(layers.Dropout(0.25))
model.add(layers.Conv2D(32, (3,3), strides=(1,1), padding='same', activation = 'relu'))
model.add(layers.MaxPooling2D((2,2)))
model.add(layers.Dropout(0.25))
model.add(layers.Flatten())
model.add(layers.Dense(128, activation='relu'))
#prevent overfitting
model.add(layers.Dropout(0.25))
#4 output classes, softmax since we want to end up with probabilities for each class at the end (have to sum up to 1)
model.add(layers.Dense((4), activation='softmax'))
#not using one hot encoding, therefore sparse categorical entropy
model.compile(loss='sparse_categorical_crossentropy', optimizer=keras.optimizers.Adam(learning_rate=0.00015), metrics='accuracy')
try using the code below I would add a BatchNormalization layer right after the flatten layer
model.add(layers.BatchNormalization(axis=-1, momentum=0.99, epsilon=0.001 )
for the dense layer add regularizers
model.add(layers.Dense(128, kernel_regularizer = regularizers.l2(l = 0.016),activity_regularizer=regularizers.l1(0.006),
bias_regularizer=regularizers.l1(0.006) ,activation='relu')
Also I suggest you use an adjustable learning rate using the Keras callback ReduceLROnPlateau. Documentation is here. My recommended code for that is shown below
rlronp=tf.keras.callbacks.ReduceLROnPlateau( monitor="val_loss", factor=0.4,
patience=2, verbose=1, mode="auto")
I also recommend us use the Keras callback EarlyStopping. Documentation for that is here. My recommended code for that is below
estop=tf.keras.callbacks.EarlyStopping( monitor="val_loss", patience=4,
verbose=1,mode="auto",
restore_best_weights=True)
Before you fit the model include code below
callbacks=[rlronp, estop]
in model.fit include callbacks=callbacks
You can try to add regularizer to all or some of your layers, for example:
model.add(layers.Conv2D(32, (3,3), strides=(1,1), kernel_regularizer='l1_l2', padding='same', activation = 'relu'))
You could try to replace Dropout with SpatialDropout2D between the conv layers. You could also try more image augmentation, maybe GaussianNoise, RandomContrast, RandomBrightness
Since you have a very high training accuracy, you could also try to simplify your model (less units for example).
How do you add additional layers to a TensorFlow Neural Network and know the additional layer will not be an overfit??? It seems that 2 layers wont be very helpful however it did give me a 91% accuracy and I wanted a 100% accuracy. So I wanted to add 5 to 10 additional layers and try and "overfit" the neural network. Would an overfit always give 100% accuracy on a training set?
The basic building block of a neural network is the layer.
I'm using the model example from https://www.tensorflow.org/tutorials/keras/classification
model = tf.keras.Sequential([
tf.keras.layers.Flatten(input_shape=(28, 28)),
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dense(10)
])
model.fit(train_images, train_labels, epochs=10)
The first layer in this network, transforms the format of the images from a two-dimensional array (of 28 by 28 pixels) to a one-dimensional array (of 28 * 28 = 784 pixels). Think of this layer as unstacking rows of pixels in the image and lining them up. This layer has no parameters to learn; it only reformats the data.
Currently this example after the pixels are flattened, the network consists of a sequence of two tf.keras.layers.Dense layers or fully connected, neural layers. The first Dense layer has 128 nodes (or neurons). The second (and last) layer returns a array with length of 10.
QUESTION: I wanted to start by adding ONE additional layer and then overfit with say 5 layers. How do manually add an additional layer and fit this layer? can I specify 5 additional layers without having to specify each layer? Whats a typical estimate for "overfit" on a image data set with a given size of say 30x30 pixels?
Adding One Addtional Layer gave me the same accuracy.
Epoch 1/10
1875/1875 [==============================] - 9s 5ms/step - loss: 0.4866 - accuracy: 0.8266
Epoch 2/10
1875/1875 [==============================] - 8s 4ms/step - loss: 0.3619 - accuracy: 0.8680
Epoch 3/10
1875/1875 [==============================] - 8s 4ms/step - loss: 0.3278 - accuracy: 0.8785
Epoch 4/10
1875/1875 [==============================] - 8s 4ms/step - loss: 0.3045 - accuracy: 0.8874
Epoch 5/10
1875/1875 [==============================] - 8s 4ms/step - loss: 0.2885 - accuracy: 0.8929
Epoch 6/10
1875/1875 [==============================] - 8s 4ms/step - loss: 0.2727 - accuracy: 0.8980
Epoch 7/10
1875/1875 [==============================] - 8s 4ms/step - loss: 0.2597 - accuracy: 0.9014
Epoch 8/10
1875/1875 [==============================] - 9s 5ms/step - loss: 0.2475 - accuracy: 0.9061
Epoch 9/10
1875/1875 [==============================] - 9s 5ms/step - loss: 0.2386 - accuracy: 0.9099
Epoch 10/10
1875/1875 [==============================] - 10s 5ms/step - loss: 0.2300 - accuracy: 0.9125
You can add layers to a neural network as follows.
model = tf.keras.Sequential([
tf.keras.layers.Flatten(input_shape=(28, 28)),
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dense(10)
])
Overfitting:
Overfitting occurs when a model fits exactly against its training data. When this happens, the algorithm unfortunately cannot perform accurately against unseen data. For example, if our model saw 95% accuracy on the training set but only 45% accuracy on the test set then the model is overfitting against its training data. It does not always give 100% accuracy on a training set.
It can be identified by checking validation metrics such as accuracy and loss.When the model is impacted by overfitting, the validation measures typically rise until a point at which they stagnate or start to decline. The model searches for a good fit during an upward trend, and when it finds one, the trend begins to decline or stagnate. For more information, please refer this.
I am trying to understand the TensorFlow text classification example at https://www.tensorflow.org/tutorials/keras/text_classification. They define the model as follows:
model = tf.keras.Sequential([
layers.Embedding(max_features + 1, embedding_dim),
layers.Dropout(0.2),
layers.GlobalAveragePooling1D(),
layers.Dropout(0.2),
layers.Dense(1)])
To the best of my knowledge, deep learning models use an activation function and I wonder what activation function the above classification model uses internally.
Can anyone help me understand that?
As you read, the model definition is written something like this
model = tf.keras.Sequential([
layers.Embedding(max_features + 1, embedding_dim),
layers.Dropout(0.2),
layers.GlobalAveragePooling1D(),
layers.Dropout(0.2),
layers.Dense(1)])
And the data set used in that tutorials is a binary classification zero and one. By not defining any activation to the last layer of the model, the original author wants to get the logits rather than probability. And that why they used the loss function as
model.compile(loss=losses.BinaryCrossentropy(from_logits=True),
...
Now, if we set the last layer activation as sigmoid (which usually pick for binary classification), then we must set from_logits=False. So, here are two option to chose from:
with logit: True
We take the logit from the last layer and that why we set from_logits=True.
model = tf.keras.Sequential([
layers.Embedding(max_features + 1, embedding_dim),
layers.Dropout(0.2),
layers.GlobalAveragePooling1D(),
layers.Dropout(0.2),
layers.Dense(1, activation=None)])
model.compile(loss=losses.BinaryCrossentropy(from_logits=True),
optimizer='adam',
metrics=['accuracy'])
history = model.fit(
train_ds, verbose=2,
validation_data=val_ds,
epochs=epochs)
7ms/step - loss: 0.6828 - accuracy: 0.5054 - val_loss: 0.6148 - val_accuracy: 0.5452
Epoch 2/3
7ms/step - loss: 0.5797 - accuracy: 0.6153 - val_loss: 0.4976 - val_accuracy: 0.7406
Epoch 3/3
7ms/step - loss: 0.4664 - accuracy: 0.7734 - val_loss: 0.4197 - val_accuracy: 0.8096
without logit: False
And here we take the probability from the last layer and that why we set from_logits=False.
model = tf.keras.Sequential([
layers.Embedding(max_features + 1, embedding_dim),
layers.Dropout(0.2),
layers.GlobalAveragePooling1D(),
layers.Dropout(0.2),
layers.Dense(1, activation='sigmoid')])
model.compile(loss=losses.BinaryCrossentropy(from_logits=False),
optimizer='adam',
metrics=['accuracy'])
history = model.fit(
train_ds, verbose=2,
validation_data=val_ds,
epochs=epochs)
Epoch 1/3
8ms/step - loss: 0.6818 - accuracy: 0.6163 - val_loss: 0.6135 - val_accuracy: 0.7736
Epoch 2/3
7ms/step - loss: 0.5787 - accuracy: 0.7871 - val_loss: 0.4973 - val_accuracy: 0.8226
Epoch 3/3
8ms/step - loss: 0.4650 - accuracy: 0.8365 - val_loss: 0.4195 - val_accuracy: 0.8472
Now, you may wonder, why this tutorial uses logit (or no activation to the last layer)? The short answer is, it generally doesn't matter, we can choose any option. The thing is, there is a chance of numerical instability in the case of using from_logits=False. Check this answer for more details.
This model uses a single activation function at the output (a sigmoid), used for predictions for a binary classification task.
The task to perform often guides the choice of both loss and activation functions. In this case, therefore, the Binary-Cross-Entropy loss function is used, as well as the sigmoid activation function (which is also called the logistic function, and outputs values between 0 and 1 for any real value taken as input). This is quite well explained in this post.
In contrast, you can also have multiple activation functions in a neural network, depending on its architecture; it is very common for instance in convolutional neural networks to have an activation function for each convolutional layer, as shown in this tutorial.
I hope to find an answer to clarify my doubt. I created a convolutional-autoencoder this way:
input_dim = Input((1, 200, 4))
x = Conv2D(64, (1,3), activation='relu', padding='same')(input_dim)
x = MaxPooling2D((1,2), padding='same')(x)
x = Conv2D(32, (1,3), activation='relu', padding='same')(x)
x = MaxPooling2D((1,2), padding='same')(x)
x = Conv2D(32, (1,3), activation='relu', padding='same')(x)
encoded = MaxPooling2D((1,2), padding='same')(x)
#decoder
x = Conv2D(32, (1,3), activation='relu', padding='same')(encoded)
x = UpSampling2D((1,2))(x)
x = Conv2D(32, (1,3), activation='relu', padding='same')(x)
x = UpSampling2D((1,2))(x)
x = Conv2D(64, (1,3), activation='relu')(x)
x = UpSampling2D((1,2))(x)
decoded = Conv2D(4, (1,3), activation='sigmoid', padding='same')(x)
autoencoder = Model(input_layer, decoded)
autoencoder.compile(optimizer='adam', loss='mae',
metrics=['mean_squared_error'])
But when I try fitting the model with the last activation of the decoder being sigmoid as above, the model loss decreases slightly (and remain unchanged at later epochs) so also the mean_square_error. (using default Adam settings):
autoencoder.fit(train, train, epochs=100, batch_size=256, shuffle=True,
validation_data=(test, test), callbacks=callbacks_list)
Epoch 1/100
97/98 [============================>.] - ETA: 0s - loss: 12.3690 - mean_squared_error: 2090.8232
Epoch 00001: loss improved from inf to 12.36328, saving model to weights.best.hdf5
98/98 [==============================] - 6s 65ms/step - loss: 12.3633 - mean_squared_error: 2089.3044 - val_loss: 12.1375 - val_mean_squared_error: 2029.4445
Epoch 2/100
97/98 [============================>.] - ETA: 0s - loss: 12.3444 - mean_squared_error: 2089.8032
Epoch 00002: loss improved from 12.36328 to 12.34172, saving model to weights.best.hdf5
98/98 [==============================] - 6s 64ms/step - loss: 12.3417 - mean_squared_error: 2089.1536 - val_loss: 12.1354 - val_mean_squared_error: 2029.4530
Epoch 3/100
97/98 [============================>.] - ETA: 0s - loss: 12.3461 - mean_squared_error: 2090.5886
Epoch 00003: loss improved from 12.34172 to 12.34068, saving model to weights.best.hdf5
98/98 [==============================] - 6s 63ms/step - loss: 12.3407 - mean_squared_error: 2089.1526 - val_loss: 12.1351 - val_mean_squared_error: 2029.4374
Epoch 4/100
97/98 [============================>.] - ETA: 0s - loss: 12.3320 - mean_squared_error: 2087.0349
Epoch 00004: loss improved from 12.34068 to 12.34050, saving model to weights.best.hdf5
98/98 [==============================] - 6s 63ms/step - loss: 12.3405 - mean_squared_error: 2089.1489 - val_loss: 12.1350 - val_mean_squared_error: 2029.4448
But then both loss and mean_squared_error decrease quickly when I changed the decoder's last activation to relu.
Epoch 1/100
97/98 [============================>.] - ETA: 0s - loss: 9.8283 - mean_squared_error: 1267.3282
Epoch 00001: loss improved from inf to 9.82359, saving model to weights.best.hdf5
98/98 [==============================] - 6s 64ms/step - loss: 9.8236 - mean_squared_error: 1266.0548 - val_loss: 8.4972 - val_mean_squared_error: 971.0208
Epoch 2/100
97/98 [============================>.] - ETA: 0s - loss: 8.1906 - mean_squared_error: 910.6423
Epoch 00002: loss improved from 9.82359 to 8.19058, saving model to weights.best.hdf5
98/98 [==============================] - 6s 62ms/step - loss: 8.1906 - mean_squared_error: 910.5417 - val_loss: 7.6558 - val_mean_squared_error: 811.6011
Epoch 3/100
97/98 [============================>.] - ETA: 0s - loss: 7.3522 - mean_squared_error: 736.2031
Epoch 00003: loss improved from 8.19058 to 7.35255, saving model to weights.best.hdf5
98/98 [==============================] - 6s 61ms/step - loss: 7.3525 - mean_squared_error: 736.2403 - val_loss: 6.8044 - val_mean_squared_error: 650.5342
Epoch 4/100
97/98 [============================>.] - ETA: 0s - loss: 6.6166 - mean_squared_error: 621.1281
Epoch 00004: loss improved from 7.35255 to 6.61435, saving model to weights.best.hdf5
98/98 [==============================] - 6s 61ms/step - loss: 6.6143 - mean_squared_error: 620.6105 - val_loss: 6.2180 - val_mean_squared_error: 572.2390
I want to verify if it is valid to use an-all relu function in the network architecture. Being novice to deep learning.
What you have asked invokes another question which is very fundamental. Ask yourself: "What you actually want the model to do?"- Predicting a real value? Or Value within a certain range? - You will get your answer.
But before that what I feel I should give you a brief on what activation functions are all about and why we use them.
Activation functions' main goal is to introduce non-linearity in your model. As the combination of linear functions is also a linear function, hence, without activation functions a Neural Network is nothing but a giant linear function. Hence, being a liner function itself it won't be able to learn any non-linear behavior at all. This is the primary purpose of using an activation function.
Another purpose is to limit the range of output from a neuron. Following image shows Sigmoid and ReLU activation functions (the image is collected from here).
These two graphs show exactly what kind of limitations they can impose on values passed through them. If you look at Sigmoid function it is allowing output to be in between 0 to 1. So we can think it like a probability mapping based on some input value to the function. So where we can use it? Say for binary classification if we assign 0 and 1 for two different classes and use a Sigmoid function in the output layer it can give us the probability of belonging to a certain class for an example input.
Now coming to ReLU. What it does? It only allows Non-negative values. As you can see all the negative values in horizontal axis is being mapped to 0 in vertical axis. But for positive values the 45 degree straight line shows that it does nothing to them and leave them as they are. Basically it helps us to get rid of negative values and makes them 0 and allows non-negative values only. Mathematically: relu(value) = max(0, value).
Now picture a situation: Say you want to predict real values which can be positive, zero or even negative! Will you use ReLU activation function in the output layer just because it looks cool? Nope! Obviously not. If you do so it will never be able to predict any negative values as all the negative values are being trimmed down to 0.
Now coming to your case, I believe this model should predict values which shouldn't be limited from 0 to 1. It should be a real valued prediction.
Hence when you are using sigmoid function, it is basically forcing the model to output between 0 to 1 and which is not a valid prediction in most of the cases and thus the model is producing large loss and MSE values. As the model is forcefully predicting something which is not anywhere near to the actual correct output.
Again when you are using ReLU it is performing better. Because ReLU doesn't change any non-negative value. Hence, the model is free to predict any non-negative values and now there is no bound to predict values which are close to actual outputs.
But what I think the model wants to predict intensity values which are likely from 0 to 255. Hence, there are already no negative values coming from your model. So in that sense technically there is no need of using ReLU activation function in the last layer as it will not even get any negative values to filter out (if I am not mistaken). But you can use it as the official TensorFlow documentation is using it. But it is only for safety purpose such that no negative values can come out and again the ReLU won't do anything to non-negative values.
You can use relu function as activation in the final layer.
You can see in the autoencoder example at the official TensorFlow site here.
Use the sigmoid/softmax activation function in the final output layer when you are trying to solve the Classification problems where your labels are class values.
I'm training CAT/DOG classifier.
My model is:
model.add(layers.Conv2D(32, (3, 3), activation='relu',
input_shape=(150, 150, 3)))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(128, (3, 3), activation='relu'))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(128, (3, 3), activation='relu'))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Flatten())
model.add(layers.Dropout(0.5))
model.add(layers.Dense(512, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy',
optimizer=optimizers.RMSprop(lr=1e-4),
metrics=['acc'])
history = model.fit_generator(
train_generator,
steps_per_epoch = 100,
epochs=200,
validation_data=validation_generator,
validation_steps=50)
My val_acc ~83%, my val_loss ~0.36 between 130th-140th epoch - excluding 136th epoch -.
Epoch 130/200
100/100 [==============================] - 69s - loss: 0.3297 - acc: 0.8574 - val_loss: 0.3595 - val_acc: 0.8331
Epoch 131/200
100/100 [==============================] - 68s - loss: 0.3243 - acc: 0.8548 - val_loss: 0.3561 - val_acc: 0.8242
Epoch 132/200
100/100 [==============================] - 71s - loss: 0.3200 - acc: 0.8557 - val_loss: 0.2725 - val_acc: 0.8157
Epoch 133/200
100/100 [==============================] - 71s - loss: 0.3236 - acc: 0.8615 - val_loss: 0.3411 - val_acc: 0.8388
Epoch 134/200
100/100 [==============================] - 70s - loss: 0.3115 - acc: 0.8681 - val_loss: 0.3800 - val_acc: 0.8073
Epoch 135/200
100/100 [==============================] - 70s - loss: 0.3210 - acc: 0.8536 - val_loss: 0.3247 - val_acc: 0.8357
Epoch 137/200
100/100 [==============================] - 66s - loss: 0.3117 - acc: 0.8602 - val_loss: 0.3396 - val_acc: 0.8351
Epoch 138/200
100/100 [==============================] - 70s - loss: 0.3211 - acc: 0.8624 - val_loss: 0.3284 - val_acc: 0.8185
I wonder why this happened in 136th epoch, val_loss raised to 0.84:
Epoch 136/200
100/100 [==============================] - 67s - loss: 0.3061 - acc: 0.8712 - val_loss: 0.8448 - val_acc: 0.6881
It was an extremely unlucky dropout at that dropped all important values from activation matrix or what?
Here is my final result:
How model is able to solve this?
Thank you :)
It is normal for the values to fluctuate. In your case, it can be explained by the value of your learning rate and a large number of epochs.
You are training for too long, you have reached a plateau (accuracy isn't improving).
Using big learning rates at the end of training can cause plateauing or convergence issues.
In the image, you can see that for learning rate=0.1 it reaches high accuracy very fast but then plateaus and drops in accuracy. For a learning rate=0.001, it reaches high accuracy slower but is continuously increasing.
So in your case, I think the problem is the large learning rate towards the endo of the training. You can use a variable learning rate to get the best of both worlds, big at first but lower towards the end. For example, after the accuracy isn't increasing more than 0.1% drop learning rate to 0.0000001.
You can do this using LearningRateScheduler or ReduceLROnPlateau from keras callbacks
reduce_lr = ReduceLROnPlateau(monitor='val_loss', factor=0.2,
patience=5, min_lr=1e-10)
model.fit_generator(
train_generator,
steps_per_epoch = 100,
epochs=200,
validation_data=validation_generator,
validation_steps=50,
callbacks=[reduce_lr])
The architecture that you are using is somehow similar to a VGG.
This sudden drop that you are experimenting is due to the fact that your model simply starts to strongly overfit after that epoch.
An additional observation , out of personal experience, is that sudden/such a huge discrepancy between training and validation , at an advanced step during the training, takes place on networks which do not have skip-connections. Note that this phenomenon that I am referring to is different than the 'mere' overfitting.
Networks which do have skip-connections do not exhibit this sudden huge drop phenomenon(particularly at an advanced step during the training phase). The main intuition is that by means of those skip connection information flow of the gradient is not lost. However, on a very deep convolutional neural network which does not have such connections, you can arrive at a point where you have a sudden drop (even both on training accuracy, because of the vanishing gradient).
For more about skip/residual connections, read more about here : https://www.quora.com/How-do-skip-connections-work-in-a-fully-convolutional-neural-network.
UPDATE (according to the photos uploaded):
The sudden drop is only caused by the batch training(hopefully you are not in the case I described above). When we use batch training (since we do not have enough memory to fit the entire dataset at once). Fluctuations are normal, it just happened that at that specific epoch the weights had such values in that the accuracy decreased a lot. Indeed, decreasing the learning rate would help you obtain better accuracy and validation accuracy, since it would help the neural network 'exit' a possible state of plateau.