Why training accuracy and validation accuracy are different for the same dataset with tensorflow2.0? - tensorflow

I am training with tensorflow2.0 and tensorflow_datasets. But I am not understand: why does the training accuracy and loss and valdataion accuracy and loss are different?
This is my code:
import tensorflow as tf
import tensorflow_datasets as tfds
data_name = 'uc_merced'
dataset = tfds.load(data_name)
# the train_data and the test_data are same dataset
train_data, test_data = dataset['train'], dataset['train']
def parse(img_dict):
img = tf.image.resize_with_pad(img_dict['image'], 256, 256)
#img = img / 255.
label = img_dict['label']
return img, label
train_data = train_data.map(parse)
train_data = train_data.batch(96)
test_data = test_data.map(parse)
test_data = test_data.batch(96)
strategy = tf.distribute.MirroredStrategy()
with strategy.scope():
model = tf.keras.applications.ResNet50(weights=None, classes=21,
input_shape=(256, 256, 3))
model.compile(optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
model.fit(train_data, epochs=50, verbose=2, validation_data=test_data)
It is very simple and you can run it on your computer. you can see my train data and validation data are the same train_data, test_data = dataset['train'], dataset['train'].
But the train accuracy (loss) are not the same with validation accuracy (loss). Why is it happen? Is this the bug of tensorflow2.0?
Epoch 1/50
22/22 - 51s - loss: 3.3766 - accuracy: 0.2581 - val_loss: 0.0000e+00 - val_accuracy: 0.0000e+00
Epoch 2/50
22/22 - 30s - loss: 1.8221 - accuracy: 0.4590 - val_loss: 123071.9851 - val_accuracy: 0.0476
Epoch 3/50
22/22 - 30s - loss: 1.4701 - accuracy: 0.5405 - val_loss: 12767.8928 - val_accuracy: 0.0519
Epoch 4/50
22/22 - 30s - loss: 1.2113 - accuracy: 0.6071 - val_loss: 3.9311 - val_accuracy: 0.1186
Epoch 5/50
22/22 - 31s - loss: 1.0846 - accuracy: 0.6567 - val_loss: 23.7775 - val_accuracy: 0.1386
Epoch 6/50
22/22 - 31s - loss: 0.9358 - accuracy: 0.7043 - val_loss: 15.3453 - val_accuracy: 0.1543
Epoch 7/50
22/22 - 32s - loss: 0.8566 - accuracy: 0.7243 - val_loss: 8.0415 - val_accuracy: 0.2548

In short, the culprit here is BatchNorm.
Since you have a small dataset and large batch size, you only do 22 updates per epoch. The BatchNorm layer has a default momentum of 0.99, so it takes some time to move the BatchNorm running means/variances to values more appropriate for your dataset (which, given you do not normalise the pixel values away from the [0, 255] range, is pretty far from the typical mean=0, variance=1 sort of range that neural networks are generally designed/initialised to expect).
The reason for the big discrepancy in train vs. validation loss/accuracy is because the training behaviour of batch norm versus the testing behaviour is very different, especially with so few batches. The mean of the data running through the network during training is very far from the running mean accumulated so far, which only updates slowly due to the default BatchNorm momentum/decay of 0.99.
If you reduce your batch size from 96 to, say, 4, you substantially increase the frequency of updates to the BatchNorm running means/variances. Doing this, plus uncommenting the #img = img / 255. line in your data parsing function, alleviates the train/validation discrepancy to a large extent. Doing so gives me this output for three epochs:
Epoch 1/7
525/525 - 51s - loss: 3.2650 - accuracy: 0.1633 - val_loss: 0.0000e+00 - val_accuracy: 0.0000e+00
Epoch 2/7
525/525 - 38s - loss: 2.6455 - accuracy: 0.2152 - val_loss: 12.1067 - val_accuracy: 0.2114
Epoch 3/7
525/525 - 38s - loss: 2.5033 - accuracy: 0.2414 - val_loss: 16.9369 - val_accuracy: 0.2095
You can also keep your code the same, and instead modify the keras_applications implementation of Resnet50 to use BatchNormalization(..., momentum=0.9) everywhere. This gives me the following output after two epochs, which I think more or less shows that indeed this is the main cause of your issue:
Epoch 1/2
22/22 [==============================] - 33s 1s/step - loss: 3.1512 - accuracy: 0.2357 - val_loss: 0.0000e+00 - val_accuracy: 0.0000e+00
Epoch 2/2
22/22 [==============================] - 16s 748ms/step - loss: 1.7975 - accuracy: 0.4505 - val_loss: 4.1324 - val_accuracy: 0.2810

Related

Keras model gets worse when fine-tuning

I'm trying to follow the fine-tuning steps described in https://www.tensorflow.org/tutorials/images/transfer_learning#create_the_base_model_from_the_pre-trained_convnets to get a trained model for binary segmentation.
I create an encoder-decoder with the weights of the encoder being the ones of the MobileNetV2 and fixed as encoder.trainable = False. Then, I define my decoder as said in the tutorial and I train the network for 300 epochs using a learning rate of 0.005. I get the following loss value and Jaccard index during the lasts epochs:
Epoch 297/300
55/55 [==============================] - 85s 2s/step - loss: 0.2443 - jaccard_sparse3D: 0.5556 - accuracy: 0.9923 - val_loss: 0.0440 - val_jaccard_sparse3D: 0.3172 - val_accuracy: 0.9768
Epoch 298/300
55/55 [==============================] - 75s 1s/step - loss: 0.2437 - jaccard_sparse3D: 0.5190 - accuracy: 0.9932 - val_loss: 0.0422 - val_jaccard_sparse3D: 0.3281 - val_accuracy: 0.9776
Epoch 299/300
55/55 [==============================] - 78s 1s/step - loss: 0.2465 - jaccard_sparse3D: 0.4557 - accuracy: 0.9936 - val_loss: 0.0431 - val_jaccard_sparse3D: 0.3327 - val_accuracy: 0.9769
Epoch 300/300
55/55 [==============================] - 85s 2s/step - loss: 0.2467 - jaccard_sparse3D: 0.5030 - accuracy: 0.9923 - val_loss: 0.0463 - val_jaccard_sparse3D: 0.3315 - val_accuracy: 0.9740
I store all the weights of this model and then, I compute the fine-tuning with the following steps:
model.load_weights('my_pretrained_weights.h5')
model.trainable = True
model.compile(optimizer=Adam(learning_rate=0.00001, name='adam'),
loss=SparseCategoricalCrossentropy(from_logits=True),
metrics=[jaccard, "accuracy"])
model.fit(training_generator, validation_data=(val_x, val_y), epochs=5,
validation_batch_size=2, callbacks=callbacks)
Suddenly the performance of my model is way much worse than during the training of the decoder:
Epoch 1/5
55/55 [==============================] - 89s 2s/step - loss: 0.2417 - jaccard_sparse3D: 0.0843 - accuracy: 0.9946 - val_loss: 0.0079 - val_jaccard_sparse3D: 0.0312 - val_accuracy: 0.9992
Epoch 2/5
55/55 [==============================] - 90s 2s/step - loss: 0.1920 - jaccard_sparse3D: 0.1179 - accuracy: 0.9927 - val_loss: 0.0138 - val_jaccard_sparse3D: 7.1138e-05 - val_accuracy: 0.9998
Epoch 3/5
55/55 [==============================] - 95s 2s/step - loss: 0.2173 - jaccard_sparse3D: 0.1227 - accuracy: 0.9932 - val_loss: 0.0171 - val_jaccard_sparse3D: 0.0000e+00 - val_accuracy: 0.9999
Epoch 4/5
55/55 [==============================] - 94s 2s/step - loss: 0.2428 - jaccard_sparse3D: 0.1319 - accuracy: 0.9927 - val_loss: 0.0190 - val_jaccard_sparse3D: 0.0000e+00 - val_accuracy: 1.0000
Epoch 5/5
55/55 [==============================] - 97s 2s/step - loss: 0.1920 - jaccard_sparse3D: 0.1107 - accuracy: 0.9926 - val_loss: 0.0215 - val_jaccard_sparse3D: 0.0000e+00 - val_accuracy: 1.0000
Is there any known reason why this is happening? Is it normal?
Thank you in advance!
OK I found out what I do different that makes it NOT necessary to compile. I do not set encoder.trainable = False. What I do in the code below is equivalent
for layer in encoder.layers:
layer.trainable=False
then train your model. Then you can unfreeze the encoder weights with
for layer in encoder.layers:
layer.trainable=True
You do not need to recompile the model. I tested this and it works as expected. You can
verify by priniting model summary before and after and look at the number of trainable parameters. As for changing the learning rate I find it is best to use the the keras callback ReduceLROnPlateau to automatically adjust the learning rate based on validation loss. I also recommend using the EarlyStopping callback which monitors validation and halts training if the loss fails to reduce after 'patience' number of consecutive epochs. Setting restore_best_weights=True will load the weights for the epoch with the lowest validation loss so you don't have to save then reload the weights. Set epochs to a large number to ensure this callback activates. The code I use is shown below
es=tf.keras.callbacks.EarlyStopping( monitor="val_loss", patience=3,
verbose=1, restore_best_weights=True)
rlronp=tf.keras.callbacks.ReduceLROnPlateau( monitor="val_loss", factor=0.5, patience=1,
verbose=1)
callbacks=[es, rlronp]
In model.fit set callbacks=callbacks

Why the accuracy and validation_accuracy are drastically different on the same dataset (no normalization or dropout)?

I'm new to tensorflow 2.0 and I'm running a very simple model that classifies a 1d time series of fixed size (100 values) into one of two classes:
model = tf.keras.models.Sequential([
tf.keras.layers.Dense(512, activation='relu', input_shape=(100, 1)),
tf.keras.layers.Dense(1, activation='sigmoid')
])
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc'])
I have a dataset of ~660,000 labeled examples that I feed into the model with batch_size=256. When I train the NN for 10 epochs, using the same data as a validation dataset
history = model.fit(training_dataset,
epochs=10,
verbose=1,
validation_data=training_dataset)
I got the following output
Epoch 1/10
2573/2573 [==============================] - 55s 21ms/step - loss: 0.5271 - acc: 0.7433 - val_loss: 3.4160 - val_acc: 0.4282
Epoch 2/10
2573/2573 [==============================] - 55s 21ms/step - loss: 0.5673 - acc: 0.7318 - val_loss: 3.3634 - val_acc: 0.4282
Epoch 3/10
2573/2573 [==============================] - 55s 21ms/step - loss: 0.5628 - acc: 0.7348 - val_loss: 2.6422 - val_acc: 0.4282
Epoch 4/10
2573/2573 [==============================] - 57s 22ms/step - loss: 0.5589 - acc: 0.7314 - val_loss: 2.6799 - val_acc: 0.4282
Epoch 5/10
2573/2573 [==============================] - 56s 22ms/step - loss: 0.5683 - acc: 0.7278 - val_loss: 2.3266 - val_acc: 0.4282
Epoch 6/10
2573/2573 [==============================] - 55s 21ms/step - loss: 0.5644 - acc: 0.7276 - val_loss: 2.3177 - val_acc: 0.4282
Epoch 7/10
2573/2573 [==============================] - 56s 22ms/step - loss: 0.5664 - acc: 0.7255 - val_loss: 2.3848 - val_acc: 0.4282
Epoch 8/10
2573/2573 [==============================] - 55s 21ms/step - loss: 0.5711 - acc: 0.7237 - val_loss: 2.2369 - val_acc: 0.4282
Epoch 9/10
2573/2573 [==============================] - 55s 22ms/step - loss: 0.5739 - acc: 0.7189 - val_loss: 2.6969 - val_acc: 0.4282
Epoch 10/10
2573/2573 [==============================] - 219s 85ms/step - loss: 0.5778 - acc: 0.7213 - val_loss: 2.5662 - val_acc: 0.4282
How come the accuracy during the training is so different from the validation step, when run on the same dataset? I tried to find some explanation but it seems that such problems usually arise when people use BatchNormalization or Dropout layers, which is not the case here.
Based on the information above, I may assume your data has strong dependencies on examples that are closer to each other in time series.
Therefore, NN data flow will likely be like this:
NN takes the first batch, calculates the loss, and updates the weights and biases
the cycle repeats on and on
but since examples in batches not that far away from each other in time series it is easier for NN to update weights accordingly, making loss reasonably low for every next batch
When it is time for validation NN just calculates the loss without updating the weights,
so you end up with NN that learned how to infer on a small portion of the data but do not generalize well on the whole dataset.
That is why validation error is different from training even on the same dataset.
And a list of reasons is not limited to this, this is just one assumption.

Is validation dataset initialized/created every epoch during the training process?

Setup:
U-Net network is trained to process small patches (e.g. 64x64 pixels).
The network is fed with a training dataset and validation dataset using Tensorflow Dataset API.
Small patches are generated by sampling (randomly) much larger
images.
The sampling of image patches takes place during the training process
(both training and validation image patches are cropped on the fly).
Tensorflow 2.1 (eager execution mode)
Both training and validation datasets are the same:
dataset = tf.data.Dataset.from_tensor_slices((large_images, large_targets))
dataset = dataset.shuffle(buffer_size=num_large_samples)
dataset = dataset.map(get_patches_from_large_images, num_parallel_calls=num_parallel_calls)
dataset = dataset.unbatch()
dataset = dataset.shuffle(buffer_size=num_small_patches)
dataset = dataset.batch(patches_batch_size)
dataset = dataset.prefetch(1)
dataset = dataset.repeat()
Function get_patches_from_large_images samples a predefined number of small patches from a single large image using tf.image.random_crop. There are two nested loops for and while. The outer loop for is responsible for generating the predefined number of small patches and while is used to check if randomly generated patch using tf.image.random_crop meets some predefined criteria (e.g. patches containing only the background should be discarded). The inner loop while gives up if it is not able to generate a proper patch in some predefined number of iterations so we will not get stuck in this loop. This approach is based on the solution presented here.
for i in range(number_of_patches_from_one_large_image):
num_tries = 0
patches = []
while num_tries < max_num_tries_befor_giving_up:
patch = tf.image.random_crop(large_input_and_target_image,[patch_size, patch_size, 2])
if patch_meets_some_criterions:
break
num_tries = num_tries + 1
patches.append(patch)
Experiment:
training and validation datasets to feed the model are the same (5 large pairs of input-target images), both datasets produce exactly the same number of small patches from single large image
batch_size for training and validation is the same and equals to 50 image patches,
steps_per_epoch and validation_steps are equal (20 batches)
When training is run for validation_freq=5
unet_model.fit(dataset_train, epochs=10, steps_per_epoch=20, validation_data = dataset_val, validation_steps=20, validation_freq=5)
Train for 20 steps, validate for 20 steps
Epoch 1/10
20/20 [==============================] - 44s 2s/step - loss: 0.6771 - accuracy: 0.9038
Epoch 2/10
20/20 [==============================] - 4s 176ms/step - loss: 0.4952 - accuracy: 0.9820
Epoch 3/10
20/20 [==============================] - 4s 196ms/step - loss: 0.0532 - accuracy: 0.9916
Epoch 4/10
20/20 [==============================] - 4s 194ms/step - loss: 0.0162 - accuracy: 0.9942
Epoch 5/10
20/20 [==============================] - 42s 2s/step - loss: 0.0108 - accuracy: 0.9966 - val_loss: 0.0081 - val_accuracy: 0.9975
Epoch 6/10
20/20 [==============================] - 1s 36ms/step - loss: 0.0074 - accuracy: 0.9978
Epoch 7/10
20/20 [==============================] - 4s 175ms/step - loss: 0.0053 - accuracy: 0.9985
Epoch 8/10
20/20 [==============================] - 3s 169ms/step - loss: 0.0034 - accuracy: 0.9992
Epoch 9/10
20/20 [==============================] - 3s 171ms/step - loss: 0.0023 - accuracy: 0.9995
Epoch 10/10
20/20 [==============================] - 43s 2s/step - loss: 0.0016 - accuracy: 0.9997 - val_loss: 0.0013 - val_accuracy: 0.9998
we can see that the first epoch and epochs with validation (every 5th epoch) took much more time than epochs without validation. The same experiment but this time validation is run each epoch give us the following result:
history = unet_model.fit(dataset_train, epochs=10, steps_per_epoch=20, validation_data = dataset_val, validation_steps=20)
Train for 20 steps, validate for 20 steps
Epoch 1/10
20/20 [==============================] - 84s 4s/step - loss: 0.6775 - accuracy: 0.8971 - val_loss: 0.6552 - val_accuracy: 0.9542
Epoch 2/10
20/20 [==============================] - 41s 2s/step - loss: 0.5985 - accuracy: 0.9833 - val_loss: 0.4677 - val_accuracy: 0.9951
Epoch 3/10
20/20 [==============================] - 43s 2s/step - loss: 0.1884 - accuracy: 0.9950 - val_loss: 0.0173 - val_accuracy: 0.9948
Epoch 4/10
20/20 [==============================] - 44s 2s/step - loss: 0.0116 - accuracy: 0.9962 - val_loss: 0.0087 - val_accuracy: 0.9969
Epoch 5/10
20/20 [==============================] - 44s 2s/step - loss: 0.0062 - accuracy: 0.9979 - val_loss: 0.0051 - val_accuracy: 0.9983
Epoch 6/10
20/20 [==============================] - 45s 2s/step - loss: 0.0039 - accuracy: 0.9989 - val_loss: 0.0033 - val_accuracy: 0.9991
Epoch 7/10
20/20 [==============================] - 44s 2s/step - loss: 0.0025 - accuracy: 0.9994 - val_loss: 0.0023 - val_accuracy: 0.9995
Epoch 8/10
20/20 [==============================] - 44s 2s/step - loss: 0.0019 - accuracy: 0.9996 - val_loss: 0.0017 - val_accuracy: 0.9996
Epoch 9/10
20/20 [==============================] - 44s 2s/step - loss: 0.0014 - accuracy: 0.9997 - val_loss: 0.0013 - val_accuracy: 0.9997
Epoch 10/10
20/20 [==============================] - 45s 2s/step - loss: 0.0012 - accuracy: 0.9998 - val_loss: 0.0011 - val_accuracy: 0.9998
Question:
In the first example, we can see that the initialization/creation of the training data set (dataset_train) took about 40s. However, subsequent epochs (without validation) were shorter and took about 4s. Nevertheless, the duration was extended again to about 40 seconds for the epoch with the validation step. Validation dataset (dataset_val) is exactly the same as the training dataset (datasat_train) so the procedure of its creation/initialization took about 40s. However, I am surprised that each validation step is time expensive. I expected the first validation to take 40s, but the next validations should take about 4s. I thought that the validation dataset will behave like the training dataset so the first fetch will take long but subsequent should be much shorter. Am I right or maybe I'm missing something?
Update:
I have checked that creating the iterator from the dataset takes about 40s
dataset_val_it = iter(dataset_val) #40s
If we look inside the fit function, we will see that data_handler object is created once for the whole training, and it returns the data iterator that is used in the main loop of the training process. The iterator is created by calling the function enumerate_epochs. When the fit function wants to perform the validation process, it calls the evaluate function. Whenever evaluate function is called it creates new data_handler object. And then it calls enumerate_epochs function what in turn creates the iterator from the dataset. Unfortunately, in the case of complicated datasets, this process is time-consuming.
If you want just want a quickfix to speed up your input pipeline, you can try caching the elements of the validation dataset.
If we look inside the fit function, we will see that data_handler object is created once for the whole training, and it returns the data iterator that is used in the main loop of the training process. The iterator is created by calling the function enumerate_epochs. When the fit function wants to perform the validation process, it calls the evaluate function. Whenever evaluate function is called it creates new data_handler object. And then it calls enumerate_epochs function what in turn creates the iterator from the dataset. Unfortunately, in the case of complicated datasets, this process is time-consuming.
I've never dug very deep in the tf.data code, but you seem to make a point here. I think it can be interesting to open an issue on Github for this.

Keras: Validation accuracy stays the exact same but validation loss decreases

I know that the problem can't be with the dataset because I've seen other projects use the same dataset.
Here is my data preprocessing code:
import pandas as pd
dataset = pd.read_csv('political_tweets.csv')
dataset.head()
dataset = pd.read_csv('political_tweets.csv')["tweet"].values
y_train = pd.read_csv('political_tweets.csv')["dem_or_rep"].values
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(dataset, y_train, test_size=0.1)
max_words = 10000
print(max_words)
max_len = 25
tokenizer = Tokenizer(num_words = max_words, filters='!"#$%&()*+,-./:;<=>?#[\\]^_`{|}~\t\n1234567890', lower=False,oov_token="<OOV>")
tokenizer.fit_on_texts(x_train)
x_train = tokenizer.texts_to_sequences(x_train)
x_train = pad_sequences(x_train, max_len, padding='post', truncating='post')
tokenizer.fit_on_texts(x_test)
x_test = tokenizer.texts_to_sequences(x_test)
x_test = pad_sequences(x_test, max_len, padding='post', truncating='post')
And my model:
model = Sequential([
Embedding(max_words+1,64,input_length=max_len),
Bidirectional(GRU(64, return_sequences = True), merge_mode='concat'),
GlobalMaxPooling1D(),
Dense(64,kernel_regularizer=regularizers.l2(0.02)),
Dropout(0.5),
Dense(1, activation='sigmoid'),
])
model.summary()
model.compile(loss='binary_crossentropy', optimizer=RMSprop(learning_rate=0.0001), metrics=['accuracy'])
model.fit(x_train,y_train, batch_size=128, epochs=500, verbose=1, shuffle=True, validation_data=(x_test, y_test))
Both of my losses decrease, my training accuracy increases, but the validation accuracy stays at 50% (which is awful considering I am doing a binary classification model).
Epoch 1/500
546/546 [==============================] - 35s 64ms/step - loss: 1.7385 - accuracy: 0.5102 - val_loss: 1.2458 - val_accuracy: 0.5102
Epoch 2/500
546/546 [==============================] - 34s 62ms/step - loss: 0.9746 - accuracy: 0.5137 - val_loss: 0.7886 - val_accuracy: 0.5102
Epoch 3/500
546/546 [==============================] - 34s 62ms/step - loss: 0.7235 - accuracy: 0.5135 - val_loss: 0.6943 - val_accuracy: 0.5102
Epoch 4/500
546/546 [==============================] - 34s 62ms/step - loss: 0.6929 - accuracy: 0.5135 - val_loss: 0.6930 - val_accuracy: 0.5102
Epoch 5/500
546/546 [==============================] - 34s 62ms/step - loss: 0.6928 - accuracy: 0.5135 - val_loss: 0.6931 - val_accuracy: 0.5102
Epoch 6/500
546/546 [==============================] - 34s 62ms/step - loss: 0.6927 - accuracy: 0.5135 - val_loss: 0.6931 - val_accuracy: 0.5102
Epoch 7/500
546/546 [==============================] - 37s 68ms/step - loss: 0.6925 - accuracy: 0.5136 - val_loss: 0.6932 - val_accuracy: 0.5106
Epoch 8/500
546/546 [==============================] - 34s 63ms/step - loss: 0.6892 - accuracy: 0.5403 - val_loss: 0.6958 - val_accuracy: 0.5097
Epoch 9/500
546/546 [==============================] - 35s 63ms/step - loss: 0.6815 - accuracy: 0.5633 - val_loss: 0.7013 - val_accuracy: 0.5116
Epoch 10/500
546/546 [==============================] - 34s 63ms/step - loss: 0.6747 - accuracy: 0.5799 - val_loss: 0.7096 - val_accuracy: 0.5055
I've seen other posts on this topic and they say to add dropout, crossentropy, decrease the learning rate, etc. I have done all of this and none of it works.
Any help is greatly appreciated.
Thanks in advance!
A couple of observations for your problem:
Though not particularly familiar with the dataset, I trust that it is used in many circumstances without problems. However, you could try to check for its balance. In train_test_split() there is a parameter called stratify which, if fed the y, it will ensure the same number of samples for each class are in training set and test set proportionally.
Your phenomenon with validation loss and validation accuracy is not something out of the ordinary. Imagine that in the first epochs, the neural network considers some ground truth positive examples (ys) with GT == 1 with 55% confidence. While the training advances, the neural network learns better, and now it is 90% confident for a ground truth positive example (ys) with GT == 1. Since the threshold for calculating the accuracy is 50% , in both situations you have the same accuracy. Nevertheless, the loss has changed significantly, since 90% >> 55%.
You training seems to advance(slowly but surely). Have you considered using Adam as an off-the-shelves optimizer?
If the low accuracy is still maintained over some epochs, you may very well suffer from a well known phenomenon called underfitting, in which your model is unable to capture the dependencies between your data. To mitigate/avoid underfitting altogether, you may want to use a more complex model (2 LSTMs / 2 GRUs) stacked.
At this stage, remove the Dropout() layer, since you have underfitting, not overfitting.
Decrease the batch_size. Very big batch_size can lead to local minima, rendering you network unable to properly learn/generalize.
If none of these work, try starting with a lower learning rate, say 0.00001 instead of 0.0001.
Reiterate over the dataset preprocessing steps. Ensure the sentences are converted properly.
I have had a similar issue and I think it might be because dropout is right before the output layer. Try moving it to one layer before that.

Keras categorical crossentropy learning stuck by putting all in one category

I was following the tensorflow tutorial on classification but got stuck with the problem, that the learning stagnates with my trained network in a sub optimal solution putting all pictures in just one categorie. My first thought was, that this was due to an unballanced distribution of training pictures in the categories (as also suggested here), so I deleted enough training pictures, so that the same amount of pictures remained in each category. However, the problem did not change. Next I tried different loss functions, different metrics, different optimizers and different layer structures of my model, without any improvements. My model still puts all pictures in just one category after training. Any idea is highly welcome.
Here is one of the models I tried:
model = keras.Sequential([
keras.layers.Flatten(input_shape=(PicHeight, PicWidth, 3)),
keras.layers.Dense(64, activation='relu'),
keras.layers.Dense(64, activation='relu'),
keras.layers.Dense(32, activation='relu'),
keras.layers.Dense(32, activation='relu'),
keras.layers.Dense(8, activation='relu'),
keras.layers.Dense(number_of_categories, activation='softmax')
])
model.compile(optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
And this is the training
Train on 101 samples
Epoch 1/16
101/101 [==============================] - 1s 11ms/sample - loss: 55.8119 - accuracy: 0.1584
Epoch 2/16
101/101 [==============================] - 1s 6ms/sample - loss: 232.9768 - accuracy: 0.1485
Epoch 3/16
101/101 [==============================] - 1s 6ms/sample - loss: 111.9690 - accuracy: 0.1584
Epoch 4/16
101/101 [==============================] - 1s 6ms/sample - loss: 72.1569 - accuracy: 0.1782
Epoch 5/16
101/101 [==============================] - 1s 6ms/sample - loss: 39.3051 - accuracy: 0.1386
Epoch 6/16
101/101 [==============================] - 1s 6ms/sample - loss: 2.6347 - accuracy: 0.0990
Epoch 7/16
101/101 [==============================] - 1s 6ms/sample - loss: 2.3318 - accuracy: 0.1683
Epoch 8/16
101/101 [==============================] - 1s 6ms/sample - loss: 2.5922 - accuracy: 0.2277
Epoch 9/16
101/101 [==============================] - 1s 6ms/sample - loss: 2.0848 - accuracy: 0.1485
Epoch 10/16
101/101 [==============================] - 1s 6ms/sample - loss: 1.9453 - accuracy: 0.1386
Epoch 11/16
101/101 [==============================] - 1s 6ms/sample - loss: 1.9453 - accuracy: 0.1386
Epoch 12/16
101/101 [==============================] - 1s 6ms/sample - loss: 1.9453 - accuracy: 0.1386
Epoch 13/16
101/101 [==============================] - 1s 6ms/sample - loss: 1.9452 - accuracy: 0.1386
Epoch 14/16
101/101 [==============================] - 1s 6ms/sample - loss: 1.9452 - accuracy: 0.1485
Epoch 15/16
101/101 [==============================] - 1s 6ms/sample - loss: 1.9452 - accuracy: 0.1485
Epoch 16/16
101/101 [==============================] - 1s 7ms/sample - loss: 1.9451 - accuracy: 0.1485
25/25 - 0s - loss: 1.9494 - accuracy: 0.1200
The training data has 7 categories with 18 pictures each.
Don't use so many FC layers. They aren't really good in dealing with pictures.
Your dataset size is obviously too small for deep learning. Adding more training data or try traditional machine learning like SVM, LR.
Imbalanced training data won't have that effect on model performance. It really depends on how imbalanced your data are. If it is less than 15%, it will be fine. You can definitely improve by weighted loss, overbalancing, preprocessing to make more images,etc.
If you have enough training data and picture sizes are bigger than 20*20, you should try CNN.