How can I isolate why my tensorflow model has such a high loss and low accuracy? - tensorflow

The Context:
I am creating a test application that largely replicates the functionality described here.
I was able to run the code found in the tutorial linked above, and I see losses and accuracies that are reasonable, even after just a couple of epochs.
Tutorial Code: Early into the training of the two-headed CNN, losses and accuracy look good
This is because the code starts with the VGG16 model and the already trained weights, and it freezes those layers so that no learning is required for the core classification.
My test code largely replicates the tutorial structure. It uses the exact same dataset, and the already-trained VGG16 weights. However I load the image dataset using generators (rather than pulling all data into memory, as the tutorial does).
You can find how I created those generators in the answer provided here. I had struggled for a while, before I finally got it to a point that I think is correct.
The Problem:
When I train my model the classification loss and accuracy are as expected, however the bounding box loss grows, and the bounding box accuracy does not improve, over the epochs.
My Code: Even after just a couple epochs you see the bounding box loss starting to grow
Further Details:
I've spent a lot of time looking at the (image, target) tuples yielded by the generator, and I think I am handling the yielded data properly (including the unitrect).
A pycharm view of the images and target tuples yielded by generator
In fact I've also added a debug mode that allows me to display the images and rectangles fed into the training session.
A motorcycle with the bounding box as computed from the unit rectangle bounding box loaded from CSV into the dataframe (df); df is an input to flow_from_dataframe
The model I am using:
imodel = tf.keras.applications.vgg16.VGG16(weights=None, include_top=False,
input_tensor=Input(shape=(224, 224, 3)))
imodel.load_weights(weights, by_name=True)
imodel.trainable = False
# flatten the max-pooling output of VGG
flatten = imodel.output
flatten = Flatten()(flatten)
# construct a fully-connected layer header to output the predicted
# bounding box coordinates
bboxHead = Dense(128, activation="relu")(flatten)
bboxHead = Dense(64, activation="relu")(bboxHead)
bboxHead = Dense(32, activation="relu")(bboxHead)
bboxHead = Dense(4, activation="sigmoid",
name="bounding_box")(bboxHead)
# construct a second fully-connected layer head, this one to predict
# the class label
softmaxHead = Dense(512, activation="relu")(flatten)
softmaxHead = Dropout(0.5)(softmaxHead)
softmaxHead = Dense(512, activation="relu")(softmaxHead)
softmaxHead = Dropout(0.5)(softmaxHead)
softmaxHead = Dense(len(classes), activation="softmax",
name="class_label")(softmaxHead)
# put together our model which accept an input image and then output
# bounding box coordinates and a class label
model = Model(
inputs=imodel.input,
outputs=(bboxHead, softmaxHead))
# define a dictionary to set the loss methods -- categorical
# cross-entropy for the class label head and mean absolute error
# for the bounding box head
losses = {
"class_label": "categorical_crossentropy",
"bounding_box": "mean_squared_error",
}
# define a dictionary that specifies the weights per loss (both the
# class label and bounding box outputs will receive equal weight)
lossWeights = {
"class_label": 1.0,
"bounding_box": 1.0
}
# initialize the optimizer, compile the model, and show the model
# summary
opt = Adam(lr=learning_rate)
model.compile(loss=losses, optimizer=opt, metrics=["accuracy"], loss_weights=lossWeights)
My call to "fit"
model.fit(x=train_generator[0], steps_per_epoch=train_generator[1],
validation_data=validation_generator[0], validation_steps=validation_generator[1],
epochs=epochs, verbose=1)
The weights that I load I've used in other experiments and downloaded them from kaggle - (see vgg16_weights_tf_dim_ordering_tf_kernels.h5).
My Generator:
def generate_image_generator(generator, data_directory, df, subset, target_size, batch_size, shuffle, seed):
genImages = generator.flow_from_dataframe(dataframe=df, directory=data_directory, target_size=target_size,
x_col="file",
y_col=['cls_onehot', 'bbox'],
subset=subset,
class_mode="multi_output",
batch_size=batch_size, shuffle=shuffle, seed=seed)
while True:
images, labels = genImages.next()
targets = {
'class_label': labels[0],
'bounding_box': np.array(labels[1], dtype="float32")
}
yield images, targets
def get_train_and_validate_generators(self, data_directory, files, max_images, validation_split, shuffle, seed, target_size):
generator = ImageDataGenerator(validation_split=validation_split,
rescale=1./255.)
df = get_dataframe(data_directory, files)
if max_images:
df = df.head(max_images)
train_generator = generate_image_generator(generator, data_directory, df, "training",
target_size,
self.batch_size,
shuffle, seed)
valid_generator = generate_image_generator(generator, data_directory, df, "validation",
target_size,
self.batch_size,
shuffle, seed)
Loading the dataframe from a list of CSV
def get_dataframe(data_directory, files):
frames=[]
for di in files:
df = pd.read_csv(data_directory+di["file"])
frames.append(df)
df = pd.concat(frames)
df['cls_onehot'] = df['cls'].str.get_dummies().values.tolist()
df['bbox'] = df[['sxu', 'syu', 'exu', 'eyu']].values.tolist()
return df
A snippet of the CSV:
id,file,sx,sy,ex,ey,cls,sxu,syu,exu,eyu,w,h
0,motorcycle.0001.jpg,31,19,233,141,motorcycle,0.1183206106870229,0.11801242236024845,0.8893129770992366,0.8757763975155279,262,161
1,motorcycle.0002.jpg,32,15,232,142,motorcycle,0.12167300380228137,0.09259259259259259,0.8821292775665399,0.8765432098765432,263,162
2,motorcycle.0003.jpg,30,20,234,143,motorcycle,0.11406844106463879,0.12269938650306748,0.8897338403041825,0.8773006134969326,263,163
3,motorcycle.0004.jpg,30,15,231,132,motorcycle,0.11450381679389313,0.1,0.8816793893129771,0.88,262,150
4,motorcycle.0005.jpg,31,19,232,145,motorcycle,0.1183206106870229,0.1144578313253012,0.8854961832061069,0.8734939759036144,262,166
When I load weights from "imagenet", rather than use those I received from kaggle, I see the very same increase in bounding box loss
imodel = tf.keras.applications.vgg16.VGG16(weights="imagenet", include_top=False,
input_tensor=Input(shape=(224, 224, 3)))
The Question:
Please provide suggestions on how to isolate this bounding box loss growth problem.

Ok. It looks like the problem was not at all with my generator. The code was fine except for one silly oversight. I still had an old call to compile running. I called compile correctly the first time with the composite loss function. Then I called it a second time strictly with categorical cross entropy as the cost, effectively ignoring my bounding boxes.
Anyways, if someone stumbles on this post, I hope they find the complete view of how to do classification and object detection, with a generator function, useful.
I edited the above question with the correct details.. so it now reflects the right answer.
I'd still like to get the perspective of the experts who have had to dig into the workings of a model to better understand the underlying details that lead to loss calculation.
Now that I'm starting to understand tensorflow at the high-level, its clear how to recognize when things are working.. its not clear how to diagnose when things aren't working.

Related

Training with Dataset API and numpy array yields completely different results

I have a CNN regression model and feature comes in (2000, 3000, 1) shape, where 2000 is total number of samples with each being a (3000, 1) 1D array. Batch size is 8, 20% of the full dataset is used for validation.
However, zip feature and label into tf.data.Dataset gives completely different scores from feeding numpy arrays directly in.
The tf.data.Dataset code looks like:
# Load features and labels
features = np.array(features) # shape is (2000, 3000, 1)
labels = np.array(labels) # shape is (2000,)
dataset = tf.data.Dataset.from_tensor_slices((features, labels))
dataset = dataset.shuffle(buffer_size=2000)
dataset = dataset.batch(8)
train_dataset = dataset.take(200)
val_dataset = dataset.skip(200)
# Training model
model.fit(train_dataset, validation_data=val_dataset,
batch_size=8, epochs=1000)
The numpy code looks like:
# Load features and labels
features = np.array(features) # exactly the same as previous
labels = np.array(labels) # exactly the same as previous
# Training model
model.fit(x=features, y=labels, shuffle=True, validation_split=0.2,
batch_size=8, epochs=1000)
Except for this, other code is exactly the same, for example
# Set global random seed
tf.random.set_seed(0)
np.random.seed(0)
# No preprocessing of feature at all
# Load model (exactly the same)
model = load_model()
# Compile model
model.compile(
optimizer=tf.keras.optimizers.Adam(learning_rate=0.001),
loss=tf.keras.losses.MeanSquaredError(),
metrics=[tf.keras.metrics.mean_absolute_error, ],
)
The former method via tf.data.Dataset API yields mean absolute error (MAE) around 10-3 on both training and validation set, which looks quite suspicious as the model doesn't have any drop-out or regularization to prevent overfitting. On the other hand, feeding numpy arrays right in gives training MAE around 0.1 and validation MAE around 1.
The low MAE of tf.data.Dataset method looks super suspicious however I just couldn't figure out anything wrong with the code. Also I could confirm the number of training batches is 200 and validation batches is 50, meaning I didn't use the training set for validation.
I tried to vary the global random seed or use some different shuffle seeds, which didn't change the results much. Training was done on NVIDIA V100 GPUs, and I tried tensorflow version 2.9, 2.10, 2.11 which didn't make much difference.
The problem lies in the default behaviour of "shuffle" method of tf.data.Dataset, more specificially the reshuffle_each_iteration argument which is by default True. Meaning if I implement the following code:
dataset = tf.data.Dataset.from_tensor_slices((features, labels))
dataset = dataset.shuffle(buffer_size=2000)
dataset = dataset.batch(8)
train_dataset = dataset.take(200)
val_dataset = dataset.skip(200)
model.fit(train_dataset, validation_data=val_dataset, batch_size=8, epochs=1000)
The dataset would actually be shuffle after each epoch though it might not look so apparently so. As a result, the validation data would leak into training set (in fact there would be no distinguish between these two sets as the order is shuffled every epoch).
So make sure to set reshuffle_each_iteration to False if you would like to shuffle the dataset and then do train-val split.
UPDATE: TensorFlow confirms this issue and warning would be added in future docs.
PS: It's a hard lesson for me, as I have been using the model for analysing the results for several months (as a graduating MPhil student).

Why does my validation loss / accuracy fluctuate despite manual test show good results

I am training an EfficientNet lite (from scratch) on a dataset of ~10.000.000 images (128x128x1) with ~6500 classes. My training loss is converging as well as my training accuracy.
However, my test loss/accuracy are fluctuating. When I test the CNN manually on some input it is looking very good and recognizes (nearly) everything correctly.
Because my GPU memory is only 8GB I am training with batch size 256 and fp16 calculations.
Now my question is why does the train loss/acc is fluctuating so much and is there something to correct for that?
Here are some (maybe) important Details:
Loading the data Set:
tr_dataset = tf.keras.preprocessing.image_dataset_from_directory(
DATA_PATH,
labels="inferred",
label_mode="categorical",
interpolation="bilinear",
color_mode="grayscale",
batch_size=bs,
image_size=img_size,
shuffle=True,
seed=123,
validation_split=val_split,
subset="training"
)
My model official TF implementation:
def instantiate_char_cnn(include_augmentation=False, name=NAME):
eff_net_lite = EfficientNetLiteB0(
include_top=True,
weights=None,
input_shape=(img_size[0], img_size[1], 1),
classes=len(ls),
pooling="avg",
classifier_activation="softmax",
)
if(img_augmentation):
model = tf.keras.Sequential([
tf.keras.layers.InputLayer(input_shape=(None, None, 1)),
PreprocessTFLayer(),
img_augmentation,
eff_net_lite,
],
name=name)
The custom layer for preprocessing:
#tf.function
def preprocess_tf(x):
"""
Preprocessing for TF Lite.
Args:
x : a Tensor(batch_size, height, width, channels) of images to preprocess
Return:
normalized and resized Tensor of images
"""
batch, height, width, channels = x.shape
# resize images
x = tf.image.resize(x, img_size, method=tf.image.ResizeMethod.BILINEAR)
# normalize image between [0, 1]
x = tf.math.divide(x, tf.math.reduce_max(x))
return x
class PreprocessTFLayer(tf.keras.layers.Layer):
def __init__(self, name="preprocess_tf", **kwargs):
super(PreprocessTFLayer, self).__init__(name=name, **kwargs)
self.preprocess = preprocess_tf
def call(self, input):
return self.preprocess(input)
def get_config(self):
config = super(PreprocessTFLayer, self).get_config()
return config
def get_prunable_weights(self):
return []
The keras layers for image augmentation:
from tensorflow.keras.layers.experimental.preprocessing import Resizing, Rescaling, RandomZoom, RandomRotation, RandomTranslation
img_augmentation = tf.keras.Sequential(
[
RandomErasing.RandomErasing(probability=0.4),
# random data augmentation
RandomZoom(height_factor=(-0.2, 1.0), width_factor=(-0.2, 1.0),
fill_mode='constant', interpolation='bilinear', fill_value=0.0
),
RandomTranslation(0.2, 0.2, fill_mode="constant"),
RandomRotation(factor=(-0.1, 0.1) , fill_mode='constant', interpolation='bilinear'),
],
name = "img_augmentation"
)
There could be many reasons behind this phenomenon and there could be human error involved. The key is how to troubleshoot. Manual inspection sometimes would not give you useful hints.
Try your implementation on some simpler datasets, such as ImageNet or CIFAR-100 to see whether the same phenomenon could reproduce. This helps you make sure you don't have some bug in your evaluation code.
Random shuffle and split your dataset to train, validation, and test sets. Train your model again to see whether the same phenomenon could reproduce. This helps you make sure that the distribution of the train, validation, and test sets are close and the phenomenon is not due to test set distribution mismatch.
Turn off FP16 and reduce batch size to see whether the same phenomenon could reproduce. This helps you make sure FP16 is not causing any numerical issues.
Use a more reliable implementation, such as the PyTorch or TensorFlow official neural network implementations (it could be a different network such as ResNet), for your task to see whether the same phenomenon could reproduce. This helps you make sure your model (EfficientNet) implementation is not too problematic.

Keras/Tensorflow - Generate predictions in batch for imagenet (I get only one result back)

I am generating imagenet tags for all keyframes in a video with a single call and have this code:
# all keras/tf/mobilenet imports
model_imagenet = MobileNetV2(weights='imagenet')
frames_list = []
for frame in frame_set:
frame_img = frame.to_image()
frame_pil = frame_img.resize((224,224), Image.ANTIALIAS)
ts = int(frame.pts)
x = image.img_to_array(frame_pil)
x = np.expand_dims(x, axis=0)
x = preprocess_input(x)
frames_list.append(x)
print(len(frames_list))
preds_list = model_imagenet.predict_on_batch(frames_list)
print("[*]",preds_list)
The result appears thus:
frames_list count: 125
and the predictions thus, one row of 1000 dimensions (imagenet classes), shouldn't it be 125 rows?:
[[1.15425530e-04 1.83317825e-04 4.28701424e-05 2.87547664e-05
:
7.91769926e-05 1.30803732e-04 4.81895368e-05 3.06891889e-04]]
This is generating prediction for a single row in the batch. I have tried both predict and predict_on_batch with the same result.
How can I get a bulk prediction for say 200 frames at one go with Keras/Tensorflow/Mobilenet?
ImageNet is a popular database which consists of 1000 different categories.
The dimension of 1000 is natural and to be expected, since for one image the softmax outputs a probability for each of the 1000 classes.
EDIT: For multiple image predictions, you should use predict_generator(). In addition, as of TensorFlow 2.0, if you use the Keras backend, predict_generator() has been deprecated in favor of simple predict, which also allows input data as generators.
E.g. : (from How to use predict_generator with ImageDataGenerator?) :
test_datagen = ImageDataGenerator(rescale=1./255)
#Modify the batch size here
test_generator = test_datagen.flow_from_directory(
test_dir,
target_size=(200, 200),
color_mode="rgb",
shuffle = False,
class_mode='categorical',
batch_size=1)
filenames = test_generator.filenames
nb_samples = len(filenames)
predict = model.predict_generator(test_generator,steps = nb_samples)
Please bear in mind that it will be highly unlikely to have a lot of predictions at once, since it is constrained to the memory of the video card.
Also, note the difference between predict and predict_on_batch: What is the difference between the predict and predict_on_batch methods of a Keras model?
OK, here is how I solved it, hope this helps someone else:
preds_list = model_imagenet.predict(np.vstack(frames_list),batch_size=32)
print("[*]",preds_list)
Please note the np.vstack and adjust the batch_size to whatever your computer is capable of.

Tensorflow Polynomial Linear Regression curve fit

I have created this Linear regression model using Tensorflow (Keras). However, I am not getting good results and my model is trying to fit the points around a linear line. I believe fitting points around degree 'n' polynomial can give better results. I have looked googled how to change my model to polynomial linear regression using Tensorflow Keras, but could not find a good resource. Any recommendation on how to improve the prediction?
I have a large dataset. Shuffled it first and then spited to 80% training and 20% Testing. Also dataset is normalized.
1) Building model:
def build_model():
model = keras.Sequential()
model.add(keras.layers.Dense(units=300, input_dim=32))
model.add(keras.layers.Activation('sigmoid'))
model.add(keras.layers.Dense(units=250))
model.add(keras.layers.Activation('tanh'))
model.add(keras.layers.Dense(units=200))
model.add(keras.layers.Activation('tanh'))
model.add(keras.layers.Dense(units=150))
model.add(keras.layers.Activation('tanh'))
model.add(keras.layers.Dense(units=100))
model.add(keras.layers.Activation('tanh'))
model.add(keras.layers.Dense(units=50))
model.add(keras.layers.Activation('linear'))
model.add(keras.layers.Dense(units=1))
#sigmoid tanh softmax relu
optimizer = tf.train.RMSPropOptimizer(0.001,
decay=0.9,
momentum=0.0,
epsilon=1e-10,
use_locking=False,
centered=False,
name='RMSProp')
#optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.1)
model.compile(loss='mse',
optimizer=optimizer,
metrics=['mae'])
return model
model = build_model()
model.summary()
2) Train the model:
class PrintDot(keras.callbacks.Callback):
def on_epoch_end(self, epoch, logs):
if epoch % 100 == 0: print('')
print('.', end='')
EPOCHS = 500
# Store training stats
history = model.fit(train_data, train_labels, epochs=EPOCHS,
validation_split=0.2, verbose=1,
callbacks=[PrintDot()])
3) plot Train loss and val loss
enter image description here
4) Stop When results does not get improved
enter image description here
5) Evaluate the result
[loss, mae] = model.evaluate(test_data, test_labels, verbose=0)
#Testing set Mean Abs Error: 1.9020842795676374
6) Predict:
test_predictions = model.predict(test_data).flatten()
enter image description here
7) Prediction error:
enter image description here
Polynomial regression is a linear regression with some extra additional input features which are the polynomial functions of original input features.
i.e.;
let the original input features are : (x1,x2,x3,...)
Generate a set of polynomial functions by adding some transformations of the original features, for example: (x12, x23, x13x2,...).
One may decide which all functions are to be included depending on their constraints such as intuition on correlation to the target values, computational resources, and training time.
Append these new features to the original input feature vector. Now the transformed input feature vector has a size of len(x1,x2,x3,...) + len(x12, x23, x13x2,...)
Further, this updated set of input features (x1,x2,x3,x12, x23, x13x2,...) is feeded into the normal linear regression model. ANN's architecture may be tuned again to get the best trained model.
PS: I see that your network is huge while the number of inputs is only 32 - this is not a common scale of architecture. Even in this particular linear model, reducing the hidden layers to one or two hidden layers may help in training better models (It's a suggestion with an assumption that this particular dataset is similar to other generally seen regression datasets)
I've actually created polynomial layers for Tensorflow 2.0, though these may not be exactly what you are looking for. If they are, you could use those layers directly or follow the procedure used there to create a more general layer https://github.com/jloveric/piecewise-polynomial-layers

Tensorflow slim how to specify batch size during training

I'm trying to use slim interface to create and train a convolutional neural network, but I couldn't figure out how to specify the batch size for training.
During the training my net crashes because of "Out of Memory" on my graphic card.
So I think that should be a way to handle this condition...
Do I have to split the data and the labels in batches and then explicitly loop or the slim.learning.train is taking care of it?
In the code I paste train_data are all the data in my training set (numpy array)..and the model definition is not included here
I had a quick loop to the sources but no luck so far...
g = tf.Graph()
with g.as_default():
# Set up the data loading:
images = train_data
labels = tf.contrib.layers.one_hot_encoding(labels=train_labels, num_classes=num_classes)
# Define the model:
predictions = model7_2(images, num_classes, is_training=True)
# Specify the loss function:
slim.losses.softmax_cross_entropy(predictions, labels)
total_loss = slim.losses.get_total_loss()
tf.scalar_summary('losses/total loss', total_loss)
# Specify the optimization scheme:
optimizer = tf.train.GradientDescentOptimizer(learning_rate=.001)
train_tensor = slim.learning.create_train_op(total_loss, optimizer)
slim.learning.train(train_tensor,
train_log_dir,
number_of_steps=1000,
save_summaries_secs=300,
save_interval_secs=600)
Any hints suggestions?
Edit:
I re-read the documentation...and I found this example
image, label = MyPascalVocDataLoader(...)
images, labels = tf.train.batch([image, label], batch_size=32)
But It's not clear at all how to feed image and label to be passed to tf.train.batch... as MyPascalVocDataLoader function is not specified...
In my case my data set are loaded from a sqlite database and I have training data and labels as numpy array....still confused.
Of course I tried to pass my numpy arrays (converted to constant tensor) to the tf.train.batch like this
image = tf.constant(train_data)
label = tf.contrib.layers.one_hot_encoding(labels=train_labels, num_classes=num_classes)
images, labels = tf.train.batch([image, label], batch_size=32)
But seems not the right path to follow... it seems that the train.batch wants only one element from my data set...(how to pass this? it does not make sense to me to pass only train_data[0] and train_labels[0])
Here you can create the tfrecords which is the special type of binary file format used by the tensorflow. As you mentioned you have the training images and the labels, you can easily create the TFrecords for training and validation.
After creating the TFrecords, all you need to right is decode the images from the encoded TFrecords and give it to your model input. There you can select the batch size and all.