Val_loss in Keras using TFRecord - tensorflow

I was building a model in Keras using Tensorflow's Dataset function and TFRecord. I succeeded in the training model with Keras, but the problem lies in val_loss. It is not showing at all in Keras's progress bar.
if __name__ == '__main__':
x_train,y_train = input_fn('train_whale_without07.tfrecords')
x_test,y_test = input_fn('test_whale_without07.tfrecords')
img_input = layers.Input(tensor = x_train)
model = CNN(img_input)
model.compile(optimizer=Adam(lr=0.001), loss='categorical_crossentropy',
metrics=[categorical_crossentropy, categorical_accuracy],
target_tensors=[y_train])
model.fit(steps_per_epoch=3000, epochs=EPOCHS, batch_size=None, verbose=1, validation_data = ([x_test,y_test]))
model.save('my_model_keras.h5')
The results are like this
Epoch 1/15
1/3000 [..............................] - ETA: 00:05:12 - loss: 8.1786 - categorical_crossentropy: 8.1786 - categorical_accuracy: 0.0000e+00
Anybpdy know how to add val_loss?

Validation loss and metrics are only computed at the end of an epoch, not during training. So it won't be shown while iterating batches on the training set, only at the end of the epoch.

Related

eager mode and keras.fit have different results

I am trying to convert model.fit() in Keras to the eager mode training. The model is an autoencoder. It has one encoder and two decoders. The decoders have different loss functions. The losses for decoders in eager model and model.fit are the same. I tried to set everything as the model.fit(). But the losses are different. I really appreciate help me out.
The link for google colab: https://colab.research.google.com/drive/1XNOwJ9oVgs1z9qqXIs_ldnKuSm3Dn2Ud?usp=sharing
In the following, the definition and training of the model are shown. I use model.fit() for training. Also, in the end, the output is shown, which shows the values for losses.
def fit_ae (x_unlab, p_m, alpha, parameters):
# Parameters
_, dim = x_unlab.shape
epochs = parameters['epochs']
batch_size = parameters['batch_size']
# Build model
inputs = contrib_layers.Input(shape=(dim,))
# Encoder
h = contrib_layers.Dense(int(256), activation='relu', name='encoder1')(inputs)
h = contrib_layers.Dense(int(128), activation='relu', name='encoder2')(h)
h = contrib_layers.Dense(int(26), activation='relu', name='encoder3')(h)
# Mask estimator
output_1 = contrib_layers.Dense(dim, activation='sigmoid', name = 'mask')(h)
# Feature estimator
output_2 = contrib_layers.Dense(dim, activation='sigmoid', name = 'feature')(h)
#Projection Network
model = Model(inputs = inputs, outputs = [output_1, output_2])
model.compile(optimizer='rmsprop',
loss={'mask': 'binary_crossentropy',
'feature': 'mean_squared_error'},
loss_weights={'mask':1, 'feature':alpha})
m_unlab = mask_generator(p_m, x_unlab)
m_label, x_tilde = pretext_generator(m_unlab, x_unlab)
# Fit model on unlabeled data
model.fit(x_tilde, {'mask': m_label, 'feature': x_unlab}, epochs = epochs, batch_size= batch_size)
########### OUTPUT
Epoch 1/15
4/4 [==============================] - 1s 32ms/step - loss: 1.0894 - mask_loss: 0.6560 - feature_loss: 0.2167
Epoch 2/15
4/4 [==============================] - 0s 23ms/step - loss: 0.6923 - mask_loss: 0.4336 - feature_loss: 0.1293
Epoch 3/15
4/4 [==============================] - 0s 26ms/step - loss: 0.4720 - mask_loss: 0.3022 - feature_loss: 0.0849
Epoch 4/15
4/4 [==============================] - 0s 23ms/step - loss: 0.4054 - mask_loss: 0.2581 - feature_loss: 0.0736
In the following code, I implemented the above code in eager mode. I set all optimizer and loss functions same as the above code. Data are the same for training both model.
###################################################### MODEL AUTOENCODER ============================================
def eager_ae(x_unlab,p_m,alpha,parameters):
# import pdb; pdb.set_trace()
_, dim = x_unlab.shape
epochs = parameters['epochs']
batch_size = parameters['batch_size']
E = keras.Sequential([
Input(shape=[dim,]),
Dense(256,activation='relu'),
Dense(128,activation='relu'),
Dense(26,activation='relu'),
])
# Mask estimator
output_1 = keras.Sequential([
Dense(dim,activation='sigmoid'),
])
# Feature estimator
output_2 = keras.Sequential([
Dense(dim,activation='sigmoid'),
])
optimizer = tf.keras.optimizers.RMSprop()
loss_mask = tf.keras.losses.BinaryCrossentropy()
loss_feature = tf.keras.losses.MeanSquaredError()
# Generate corrupted samples
m_unlab = mask_generator(p_m, x_unlab)
m_label, x_tilde = pretext_generator(m_unlab, x_unlab)
for epoch in range(epochs):
loss_metric = tf.keras.metrics.Mean(name='train_loss')
len_batch = range(int(x_unlab.shape[0]/batch_size))
for i in len_batch:
samples = x_tilde[i*batch_size:(i+1)*batch_size]
mask = m_label[i*batch_size:(i+1)*batch_size]
# train_step(samples,tgt)
with tf.GradientTape() as tape:
latent = E(samples, training=True)
out_mask = output_1(latent)
out_feat = output_2(latent)
# import pdb; pdb.set_trace()
lm = loss_mask(out_mask,tf.Variable(mask,dtype=tf.float32))
lf = loss_feature(out_feat,tf.Variable(samples,dtype=tf.float32))
pred_loss = lm + alpha*lf
trainable_vars = E.trainable_weights+output_1.trainable_weights+output_2.trainable_weights
grads = tape.gradient(pred_loss, trainable_vars)
optimizer.apply_gradients(zip(grads, trainable_vars))
loss_metric.update_state(pred_loss)
print(f'Epoch {epoch}, Loss {loss_metric.result()}')
return E
############# OUTPUT
Epoch 0, Loss 7.902271747589111
Epoch 1, Loss 5.336598873138428
Epoch 2, Loss 2.880791664123535
Epoch 3, Loss 1.9296690225601196
Epoch 4, Loss 1.6377944946289062
Epoch 5, Loss 1.5342860221862793
Epoch 6, Loss 1.5015968084335327
Epoch 7, Loss 1.4912563562393188
The total loss in the first code is less than zero (≈0.25), while the total loss in the second code is more than 1 (≈1.3). I can not find the issue in my second implementation (the second code).

constant loss values with normal CNNs and transfer learning

I am working on the dataset given in the paper https://arxiv.org/ftp/arxiv/papers/1511/1511.02459.pdf
In this paper, a dataset of images (portraits of people) is labeled by a floating number between 1 and 5 (1 ugly, 5 good looking). I wanted to work on this dataset and use MobileNetV2 with transfer learning (pretrained on Imagenet) in Tensorflow 2.4.0-dev20201009 with CUDA 11.1 on my RTX 3070 8gb. I don't really see my mistake but training my model yields often in constant validation loss, for example:
78/78 [==============================] - ETA: 0s - loss: 52145660442.33472020-11-20 13:19:36.796481: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:596] layout failed: Invalid argument: Size of values 2 does not match size of permutation 4 # fanin shape insequential/dense/BiasAdd-0-TransposeNHWCToNCHW-LayoutOptimizer
78/78 [==============================] - 16s 70ms/step - loss: 51654522711.5709 - val_loss: 9.5415
Epoch 2/300
78/78 [==============================] - 4s 52ms/step - loss: 9.4870 - val_loss: 9.5415
Epoch 3/300
78/78 [==============================] - 4s 52ms/step - loss: 9.3986 - val_loss: 9.5415
Epoch 4/300
78/78 [==============================] - 4s 51ms/step - loss: 9.4950 - val_loss: 9.5415
Epoch 5/300
78/78 [==============================] - 4s 52ms/step - loss: 9.4076 - val_loss: 9.5415
Epoch 6/300
78/78 [==============================] - 4s 52ms/step - loss: 9.4993 - val_loss: 9.5415
Epoch 7/300
78/78 [==============================] - 4s 52ms/step - loss: 9.3758 - val_loss: 9.5415
...
The validation loss would remain constant for 300 epochs. My code can be found here below. Let me summarize:
I used transfer-learning from Imagenet and froze the convolutional base of MobileNetV2.
I added a dense layer as the classificator and 1 output neuron. The loss function I used is MSE. The optimizer in the code is SGD, and I also tried ADAM which could also yield constant loss values on the validation set.
The above error (constant val loss) occurs also with different learning rates and with ADAM. Sometimes the same learning rate yields not constant val loss but reasonable loss. I assume this is due to the randomized weights initialization method on the dense layers in my classificator. I even tried absurd learning_rates like 10, and values are still constant. If the lr is very high then changes should be clearly seen! This is not the case. What is wrong?
My code:
import os
from typing import Dict, Any
from PIL import Image
from sklearn.model_selection import GridSearchCV
import tensorflow as tf
from tensorflow.keras.applications.mobilenet_v2 import MobileNetV2
from tensorflow.keras import layers
from tensorflow import keras
import matplotlib.pyplot as plt
import pickle
import numpy as np
import cv2
import random
#method to create the model
def create_model(IMG_SIZE, lr):
#Limit memore usage of GPU
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
try:
tf.config.experimental.set_virtual_device_configuration(gpus[0], [
tf.config.experimental.VirtualDeviceConfiguration(memory_limit=1024*7)])
except RuntimeError as e:
print(e)
model = keras.Sequential()
model.add(MobileNetV2(input_shape=(IMG_SIZE, IMG_SIZE, 3), include_top=False))
model.layers[0].trainable = False
model.add(layers.GlobalAveragePooling2D())
model.add(tf.keras.layers.Dropout(0.8))
model.add(layers.Dense(128, activation="relu"))
model.add(layers.Dense(1, activation="relu"))
#use adam or sgd as optimizers
adam = tf.keras.optimizers.Adam(learning_rate=lr, beta_1=0.9, beta_2=0.98,
epsilon=1e-9)
sgd = tf.keras.optimizers.SGD(lr=lr, decay=1e-6, momentum=0.5)
model.compile(optimizer=sgd,
loss=tf.losses.mean_squared_error,
)
model.summary()
return model
#preprocessing
def loadImages(IMG_SIZE):
path = os.path.join(os.getcwd(), 'data\\Images')
training_data=[]
labelMap = getLabelMap()
for img in os.listdir(path):
out_array = np.zeros((350,350, 3), np.float32) #original size of images in the dataset
try:
img_array = cv2.imread(os.path.join(path, img))
img_array=img_array.astype('float32') #cast to float because to prevent normalization erros
out_array = cv2.normalize(img_array, out_array, 0, 1, cv2.NORM_MINMAX) #normalize image
out_array = cv2.resize(out_array, (IMG_SIZE, IMG_SIZE)) #resize, bc we need 224x224 for Imagenet pretrained weights
training_data.append([out_array, float(labelMap[img])])
except Exception as e:
pass
return training_data
#preprocessing, the txt file All_labels.txt has lines of the form 'filename.jpg 3.2' and 3.2 is the label
def getLabelMap():
map = {}
path = os.getcwd()
path = os.path.join(path, "data\\train_test_files\\All_labels.txt")
f = open(path, "r")
for line in f:
line = line.split()
map[line[0]] = line[1]
f.close()
return map
#not important, in case you want to see the images after preprocessing
def showimg(image):
image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
plt.imshow(image)
plt.show()
#pickle the preprocessed data
def pickle_it(training_set, IMG_SIZE):
X = []
Y = []
for features, label in training_set:
X.append(features)
Y.append(label)
X = np.array(X).reshape(-1, IMG_SIZE, IMG_SIZE, 3)
Y = np.array(Y)
pickle_out = open("X.pickle", "wb")
pickle.dump(X, pickle_out)
pickle_out.close()
pickle_out = open("Y.pickle", "wb")
pickle.dump(Y, pickle_out)
pickle_out.close()
#for prediction after training the model
def betterThan(y, Y):
Z=np.sort(Y)
cnt = 0
for z in Z:
if z>y:
break
else:
cnt = cnt+1
return float(cnt/len(Y))
#for prediction after training the model
def predictImage(image, model, Y):
img_array = cv2.imread(image)
img_array = cv2.resize(img_array, (IMG_SIZE, IMG_SIZE))
img_array = np.array(img_array).reshape(-1, IMG_SIZE, IMG_SIZE, 3)
y = model.predict(img_array)
per = betterThan(y, Y)
print('You look better than ' + str(per) + '% of the dataset')
#Main/Driver function
#Preprocessing
IMG_SIZE = 224
training_set=[]
training_set = loadImages(IMG_SIZE)
random.shuffle(training_set)
pickle_it(training_set, IMG_SIZE) #I pickle my data, so that I don't always have to go through the preprocessing
#Load preprocessed data
X = pickle.load(open("X.pickle", "rb"))
Y = pickle.load(open("Y.pickle", "rb"))
#Just to check that the images look correct
showimg(X[0])
# define the grid search parameters, feel free to edit the grids
batch_size = [64]
epochsGrid = [300]
learning_rate = [0.1]
#save models and best parameters found in grid search
size_histories = {}
min_val_loss = 10
best_para = {}
#ignore this, used for bugs on my gpu... You possibly don't need this
config = tf.compat.v1.ConfigProto(gpu_options=tf.compat.v1.GPUOptions(allow_growth=True))
sess = tf.compat.v1.Session(config=config)
#grid search, training the model
for epochs in epochsGrid:
for batch in batch_size:
for lr in learning_rate:
model = create_model(IMG_SIZE, lr)
model_name = str(epochs) + '_' + str(batch) + '_' + str(lr)
#train the model with the given hyperparameters
size_histories[model_name] = model.fit(X, Y, batch_size=batch, epochs=epochs, validation_split=0.1)
# save model with the best loss value
if min(size_histories[model_name].history['val_loss']) < min_val_loss:
min_val_loss = min(size_histories[model_name].history['val_loss'])
best_para['epoch'] = epochs
best_para['batch'] = batch
best_para['lr'] = lr
model.save('savedModel')
#If you want to make prediction
model = tf.keras.models.load_model("savedModel")
image = os.path.join(os.getcwd(), 'data\\otherImages\\beautifulWomen.jpg')
predictImage(image, model, Y)
EDIT:
I have found the issue. It is 'relu' in the output neuron. When I change my loss from RMSE to MAPE I will see that I got a 100 percent error on validation. I assume this is because all my validation data is output to 0. This is only possible when the value in the output neuron before 'relu' is negative. I don't know why this is the case. But removing 'relu' will yield better training.
Does anyone know why 'relu' causes this problem with regression problems?
If this is your last layer
model.add(layers.Dense(1, activation="relu"))
then your models final output is y if y > 0 else 0. At your untrained state, your model could very well have y pinned to something like -17 or 17 with fairly equal chance. In the case of -17, the relu will convert that to 0 and also set the gradient to 0, which means the network doesn't learn. Yeah, the network doesn't learn anything from any part of a network where a relu unit output 0. In the case of the layer before
model.add(layers.Dense(128, activation="relu"))
there will be a really good chance that about half of the units will fire with a positive value and so they learn, so that layer is fine.
What can be done in the case of a bad initialization or after training a bad state in which the output of that last layer is pushed down to below 0? Well, what if we just don't use relu. What activation to use? None! Let's look at what that would be
1: model = keras.Sequential()
2: model.add(MobileNetV2(input_shape=(IMG_SIZE, IMG_SIZE, 3), include_top=False))
3: model.layers[0].trainable = False
4: model.add(layers.GlobalAveragePooling2D())
5: model.add(tf.keras.layers.Dropout(0.8))
6: model.add(layers.Dense(128, activation="relu"))
7: model.add(layers.Dense(1))
Lines 1-6 are all the same. It is important to note that the output of line 6 passes through the non-linear relu activation, and so there is the capability to learn non-linearities. Line 7, without an activation function will be a linear combination of Line 6, with a full ability to generate gradients in the positive and negative output region. When backprop is applied to learn the target values of 1 to 5, if the network outputs -17, it can learn to output a larger number. Yeah!
If you'd like to have 2 layers of nonlinearity, I'd suggest the following
1: model = keras.Sequential()
2: model.add(MobileNetV2(input_shape=(IMG_SIZE, IMG_SIZE, 3), include_top=False))
3: model.layers[0].trainable = False
4: model.add(layers.GlobalAveragePooling2D())
5: model.add(layers.Dense(128, activation="tanh"))
6: model.add(layers.Dense(64, activation="tanh"))
7: model.add(layers.Dense(1))
Ditch the dropout unless you have actual proof that it helps in this very specific network (and right now I suspect you don't). Try tanh as your hidden layer activation function. It has some nice features, like being positive and negative, gradient even with large and/or negative numbers, and acts somewhat to automatically regularize weights. But, importantly, the last output either has no activation function.

Tensorflow/Keras stops using gpu after recompiling model

I'm trying to train my sequential model (RNN->GRU->Dense) with Keras/TensorFlow 2.0 in two phases with different loss weights in the two phases. To change the loss weights, I need to recompile the model between the two phases. My problem is that training becomes much much slower after the recompilation, and I can see no other explanation than that the GPU is no longer used. Here is the relevant code:
# Build model
input_ = tf.keras.layers.Input(shape=(None, num_features))
masking = tf.keras.layers.Masking(mask_value=0.)(input_)
rnn = tf.keras.layers.SimpleRNN(24, return_sequences=True, name="rnn")(masking)
gru = tf.keras.layers.GRU(16, return_sequences=True, name="gru")(rnn)
dense1 = tf.keras.layers.Dense(5, activation=tf.nn.softmax, name="dense1")(gru)
dense2 = tf.keras.layers.Dense(1, activation=tf.math.sigmoid, name="dense2")(gru)
model = tf.keras.Model(inputs=[input_], outputs=[dense1, dense2])
# Learn reate scheduler: Reduce learn reate by factor 0.5 when no progress after 7 epochs
reduce_lr = tf.keras.callbacks.ReduceLROnPlateau(monitor='loss', factor=0.5, patience=7, min_lr=0.0001)
# Compile and fit, phase 1
optimizer = tf.keras.optimizers.Adam(lr=0.01, clipvalue=0.1)
model.compile(optimizer=optimizer, loss=['categorical_crossentropy', 'binary_crossentropy'], sample_weight_mode="temporal", loss_weights=[0.7, 0.3], metrics=['accuracy'])
model.fit_generator(train_generator(), steps_per_epoch=BATCHES_PER_EPOCH, epochs=375, callbacks=[reduce_lr])
# Recompile and fit, phase 2
optimizer.lr = 0.001
model.compile(optimizer=optimizer, loss=['categorical_crossentropy', 'binary_crossentropy'], sample_weight_mode="temporal", loss_weights=[0.99, 0.01], metrics=['accuracy'])
model.fit_generator(train_generator(), steps_per_epoch=BATCHES_PER_EPOCH, epochs=125, callbacks=[reduce_lr])
Output at end of phase 1 and start of phase 2 shows how training becomes about 5 times slower:
Epoch 374/375
4/4 [==============================] - 5s 1s/step - loss: 0.1177 - dense1_loss: 0.1479 - dense2_loss: 0.0473 - dense1_accuracy: 0.9249 - dense2_accuracy: 0.9784
Epoch 375/375
4/4 [==============================] - 5s 1s/step - loss: 0.1177 - dense1_loss: 0.1479 - dense2_loss: 0.0473 - dense1_accuracy: 0.9249 - dense2_accuracy: 0.9784
Epoch 1/125
4/4 [==============================] - 27s 7s/step - loss: 0.1494 - dense1_loss: 0.1504 - dense2_loss: 0.0478 - dense1_accuracy: 0.9225 - dense2_accuracy: 0.9779
Epoch 2/125
4/4 [==============================] - 24s 6s/step - loss: 0.1603 - dense1_loss: 0.1614 - dense2_loss: 0.0545 - dense1_accuracy: 0.9201 - dense2_accuracy: 0.9750
What could be the explanation? Is the model reorganized in some way when it's recompiled, so TensorFlow can no longer map the operations to the GPU?
(I have tried just changing the loss weights with model.loss_weights = [0.99, 0.01] but that doesn't work - recompilation is necessary.)
Try this:
Build two separate models with same layers (weights):
input_ = tf.keras.layers.Input(shape=(None, num_features))
masking = tf.keras.layers.Masking(mask_value=0.)(input_)
rnn = tf.keras.layers.SimpleRNN(24, return_sequences=True, name="rnn")(masking)
gru = tf.keras.layers.GRU(16, return_sequences=True, name="gru")(rnn)
dense1 = tf.keras.layers.Dense(5, activation=tf.nn.softmax, name="dense1")(gru)
dense2 = tf.keras.layers.Dense(1, activation=tf.math.sigmoid, name="dense2")(gru)
model1 = tf.keras.Model(inputs=[input_], outputs=[dense1, dense2])
model2 = tf.keras.Model(inputs=[input_], outputs=[dense1, dense2])
Compile and fit each one separately, with different optimiser instances:
optimizer1 = tf.keras.optimizers.Adam(lr=0.01, clipvalue=0.1)
optimizer2 = tf.keras.optimizers.Adam(lr=0.001, clipvalue=0.1)
model1.compile(optimizer=optimizer1, loss=['categorical_crossentropy', 'binary_crossentropy'], sample_weight_mode="temporal", loss_weights=[0.7, 0.3], metrics=['accuracy'])
model2.compile(optimizer=optimizer2, loss=['categorical_crossentropy', 'binary_crossentropy'], sample_weight_mode="temporal", loss_weights=[0.99, 0.01], metrics=['accuracy'])
model1.fit_generator(train_generator(), steps_per_epoch=BATCHES_PER_EPOCH, epochs=375, callbacks=[reduce_lr])
model2.fit_generator(train_generator(), steps_per_epoch=BATCHES_PER_EPOCH, epochs=125, callbacks=[reduce_lr])

After saving checkpoint with ModelCeckpoint, Keras stopped training process

I am training CNN with tf.keras. After of saving checkpoint Keras didn't start next epoch
Note:
1)As a saver was used tf.keras.callbacks.ModelCeckpoint
2)For training used fit_generator()
def iterate_minibatches(inputs, targets, batchsize):
assert len(inputs) == len(targets)
indices = np.arange(len(inputs))
np.random.shuffle(indices)
for start_idx in np.arange(0, len(inputs) - batchsize + 1, batchsize):
excerpt = indices[start_idx:start_idx + batchsize]
yield load_images(inputs[excerpt], targets[excerpt])
#Model path
model_path = "C:/Users/Paperspace/Desktop/checkpoints/cp.ckpt"
#saver = tf.train.Saver(max_to_keep=3)
cp_callback = tf.keras.callbacks.ModelCheckpoint(model_path,
verbose=1,
save_weights_only=True,
period=2)
tb_callback =TensorBoard(log_dir="./Graph/{}".format(time()))
batch_size = 750
history = model.fit_generator(generator=iterate_minibatches(X_train, Y_train,batch_size),
validation_data=iterate_minibatches(X_test, Y_test, batch_size),
# validation_data=None,
steps_per_epoch=len(X_train)//batch_size,
validation_steps=len(X_test)//batch_size,
verbose=1,
epochs=30,
callbacks=[cp_callback,tb_callback]
)
Actual result it stops training without any issue.
Expected result to go next epoch.
**Log**
Epoch 1/30
53/53 [==============================] - 919s 17s/step - loss: 1.2445 - acc: 0.0718
426/426 [==============================] - 7058s 17s/step - loss: 1.7877 - acc: 0.0687 - val_loss: 1.2445 - val_acc: 0.0718
Epoch 2/30
WARNING:tensorflow:Your dataset iterator ran out of data.
Epoch 00002: saving model to C:/Users/Paperspace/Desktop/checkpoints/cp.ckpt
WARNING:tensorflow:This model was compiled with a Keras optimizer (<tensorflow.python.keras.optimizers.Adam object at 0x0000023A913DE470>) but is being saved in TensorFlow format with `save_weights`. The model's weights will be saved, but unlike with TensorFlow optimizers in the TensorFlow format the optimizer's state will not be saved.
Consider using a TensorFlow optimizer from `tf.train`.
WARNING:tensorflow:From C:\Users\Paperspace\Anaconda3\lib\site-packages\tensorflow\python\keras\engine\network.py:1436: update_checkpoint_state (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.train.CheckpointManager to manage checkpoints rather than manually editing the Checkpoint proto.
0/426 [..............................] - ETA: 0s - loss: 0.0000e+00 - acc: 0.0687 - val_loss: 0.0000e+00 - val_acc: 0.0000e+00
On first look, your generator looks incorrect. Keras generators need a while True: loop in them. Maybe this will work for you
def iterate_minibatches(inputs, targets, batchsize):
assert len(inputs) == len(targets)
indices = np.arange(len(inputs))
np.random.shuffle(indices)
while True:
start = 0
end = batchsize
while start < len(inputs):
excerpt = indices[start:end]
yield load_images(inputs[excerpt], targets[excerpt])
start += batchsize
end += batchsize
A Keras generator has to yield batches in an infinite loop. This change should work, otherwise you can follow a tutorial like this.
def iterate_minibatches(inputs, targets, batchsize):
assert len(inputs) == len(targets)
while True:
indices = np.arange(len(inputs))
np.random.shuffle(indices)
for start_idx in np.arange(0, len(inputs) - batchsize + 1, batchsize):
excerpt = indices[start_idx:start_idx + batchsize]
yield load_images(inputs[excerpt], targets[excerpt])

Keras validation accuracy much lower than training accuracy even with the same dataset for both training and validation

We tried the transfer learning with Keras ResNet50 application (Tensorflow as backend) with our own dataset for 2000 classes with 14000 images as training set and 5261 images as validation set. The training results we got are much different in both loss and accuracy for training vs validation. Then, we tried to use the same images for both training and validation, i.e. trained with 14000 images and validated with the same 14000 images, training results for the attempt are similar, i.e. high training accuracy and low validation accuracy.
Keras version: 2.1.6
Tensorflow version: 1.8.0
Code (same dataset for both training and validation) as below,
from __future__ import print_function
from keras.applications.resnet50 import ResNet50
from keras.applications.resnet50 import preprocess_input, decode_predictions
from keras.models import *
from keras.layers import *
from keras.callbacks import *
from keras.preprocessing.image import ImageDataGenerator
from datetime import datetime
from keras.optimizers import SGD
import numpy as np
batch_size = 28 # tweak to your GPUs capacity
img_height = 224 # ResNetInceptionv2 & Xception like 299, ResNet50 & VGG like 224
img_width = img_height
channels = 3
input_shape = (img_height, img_width, channels)
best_model = 'best_model.h5'
train_datagen = ImageDataGenerator(preprocessing_function=preprocess_input)
train_generator = train_datagen.flow_from_directory(
'data/train', # this is the target directory
target_size=(img_height, img_width),
batch_size=batch_size,
class_mode='categorical')
classes = len(train_generator.class_indices)
n_of_train_samples = train_generator.samples
callbacks = [ModelCheckpoint(filepath=best_model, verbose=0, save_best_only=True),
EarlyStopping(monitor='val_acc', patience=3, verbose=0)]
base_model = ResNet50(input_shape=input_shape, weights='imagenet', include_top=False)
# first: train only the top layers (which were randomly initialized)
# i.e. freeze all convolutional ResNet50 layers
for layer in base_model.layers:
layer.trainable = False
pool_layer = [layer for layer in base_model.layers if layer.name == 'avg_pool'][0]
base_model = Model(base_model.input, pool_layer.input)
base_model.layers.pop()
dropout=[.25,.25]
dense=1024
last = base_model.output
a = MaxPooling2D(pool_size=(7,7),name='maxpool')(last)
b = AveragePooling2D(pool_size=(7,7),name='avgpool')(last)
x = concatenate([a,b], axis = 1)
x = Flatten()(x)
x = Dense(dense, init='uniform', activation='relu')(x)
x = BatchNormalization()(x)
x = Dropout(dropout[0])(x)
x = Dense(classes, activation='softmax')(x)
model = Model(base_model.input, outputs=x)
print("Start time: %s" % str(datetime.now()))
# compile the model (should be done *after* setting layers to non-trainable)
model.compile(optimizer=SGD(lr=1e-2, momentum=0.9), loss='categorical_crossentropy', metrics=['accuracy'])
# train the model on the new data for a few epochs
model.fit_generator(
train_generator,
steps_per_epoch=n_of_train_samples//batch_size,
epochs=3,
validation_data=train_generator,
validation_steps=n_of_train_samples//batch_size,
callbacks=callbacks)
print("End time: %s" % str(datetime.now()))
Training result as below
Found 14306 images belonging to 2000 classes.
Start time: 2018-05-21 10:51:34.459545
Epoch 1/3
510/510 [==============================] - 10459s 21s/step - loss: 5.6433 - acc: 0.1538 - val_loss: 9.8465 - val_acc: 0.0024
Epoch 2/3
510/510 [==============================] - 10258s 20s/step - loss: 1.3632 - acc: 0.8550 - val_loss: 10.3264 - val_acc: 0.0044
Epoch 3/3
510/510 [==============================] - 63640s 125s/step - loss: 0.2367 - acc: 0.9886 - val_loss: 10.4537 - val_acc: 0.0034
End time: 2018-05-22 10:17:42.028052
We understood that we shouldn't use the same dataset for both training and validation but we just could not understand why Keras give us high differences in both loss and accuracy for training vs validation when the dataset are the same for both training and validation.
ps. We tried the same dataset, i.e 2000 classes with 14000 images as training set and 5261 images as validation set with fast.ai library ResNet50 and the training loss and validation loss are not much difference. Codes and results with fast.ai library as below
from fastai.imports import *
from fastai.transforms import *
from fastai.conv_learner import *
from fastai.model import *
from fastai.dataset import *
from fastai.sgdr import *
from fastai.plots import *
from datetime import datetime
PATH = "data/"
sz=224
arch=resnet50
bs=28
tfms = tfms_from_model(arch, sz)
data = ImageClassifierData.from_paths(PATH, tfms=tfms, bs=bs)
learn = ConvLearner.pretrained(arch, data, precompute=False)
print("Start time: %s" % str(datetime.now()))
learn.fit(1e-2, 5)
print("End time: %s" % str(datetime.now()))
Start time: 2018-05-02 18:08:51.644750
0%| | 1/487 [00:14<2:00:00, 14.81s/it, loss=tensor(7.5704)]
[0. 6.13229 5.2504 0.26458]
[1. 3.70098 2.74378 0.6752 ]
[2. 1.80197 1.08414 0.88106]
[3. 0.83221 0.50391 0.9424 ]
[4. 0.45565 0.31056 0.95554]
End time: 2018-05-03 00:27:13.147758
Not an answer, but a suggestion to see the non-affected loss/metrics per batch:
def batchEnd(batch,logs):
print("\nfinished batch " + str(batch) + ": " + str(logs) + "\n")
metricCallback = LambdaCallback(on_batch_end=batchEnd)
callbacks = [ metricCallback,
ModelCheckpoint(filepath=best_model, verbose=0, save_best_only=True),
EarlyStopping(monitor='val_acc', patience=3, verbose=0)]
With this, you will see the metrics for each batch without the influence of other batches. (Assuming Keras does some kind of averaging/totaling when it shows the metrics for an epoch).
each time you start your fitting - it can give different results, because initial weights are being loaded different (in multi-thread env. of the library)... and if you have imbalanced dataset - it is also hard to think about the correctness of results... besides I always believe that minimum 50-100 epochs are needed to get rather reliable result (3 is not sufficient)