Training with Dataset API and numpy array yields completely different results - tensorflow

I have a CNN regression model and feature comes in (2000, 3000, 1) shape, where 2000 is total number of samples with each being a (3000, 1) 1D array. Batch size is 8, 20% of the full dataset is used for validation.
However, zip feature and label into tf.data.Dataset gives completely different scores from feeding numpy arrays directly in.
The tf.data.Dataset code looks like:
# Load features and labels
features = np.array(features) # shape is (2000, 3000, 1)
labels = np.array(labels) # shape is (2000,)
dataset = tf.data.Dataset.from_tensor_slices((features, labels))
dataset = dataset.shuffle(buffer_size=2000)
dataset = dataset.batch(8)
train_dataset = dataset.take(200)
val_dataset = dataset.skip(200)
# Training model
model.fit(train_dataset, validation_data=val_dataset,
batch_size=8, epochs=1000)
The numpy code looks like:
# Load features and labels
features = np.array(features) # exactly the same as previous
labels = np.array(labels) # exactly the same as previous
# Training model
model.fit(x=features, y=labels, shuffle=True, validation_split=0.2,
batch_size=8, epochs=1000)
Except for this, other code is exactly the same, for example
# Set global random seed
tf.random.set_seed(0)
np.random.seed(0)
# No preprocessing of feature at all
# Load model (exactly the same)
model = load_model()
# Compile model
model.compile(
optimizer=tf.keras.optimizers.Adam(learning_rate=0.001),
loss=tf.keras.losses.MeanSquaredError(),
metrics=[tf.keras.metrics.mean_absolute_error, ],
)
The former method via tf.data.Dataset API yields mean absolute error (MAE) around 10-3 on both training and validation set, which looks quite suspicious as the model doesn't have any drop-out or regularization to prevent overfitting. On the other hand, feeding numpy arrays right in gives training MAE around 0.1 and validation MAE around 1.
The low MAE of tf.data.Dataset method looks super suspicious however I just couldn't figure out anything wrong with the code. Also I could confirm the number of training batches is 200 and validation batches is 50, meaning I didn't use the training set for validation.
I tried to vary the global random seed or use some different shuffle seeds, which didn't change the results much. Training was done on NVIDIA V100 GPUs, and I tried tensorflow version 2.9, 2.10, 2.11 which didn't make much difference.

The problem lies in the default behaviour of "shuffle" method of tf.data.Dataset, more specificially the reshuffle_each_iteration argument which is by default True. Meaning if I implement the following code:
dataset = tf.data.Dataset.from_tensor_slices((features, labels))
dataset = dataset.shuffle(buffer_size=2000)
dataset = dataset.batch(8)
train_dataset = dataset.take(200)
val_dataset = dataset.skip(200)
model.fit(train_dataset, validation_data=val_dataset, batch_size=8, epochs=1000)
The dataset would actually be shuffle after each epoch though it might not look so apparently so. As a result, the validation data would leak into training set (in fact there would be no distinguish between these two sets as the order is shuffled every epoch).
So make sure to set reshuffle_each_iteration to False if you would like to shuffle the dataset and then do train-val split.
UPDATE: TensorFlow confirms this issue and warning would be added in future docs.
PS: It's a hard lesson for me, as I have been using the model for analysing the results for several months (as a graduating MPhil student).

Related

Understanding what RandomizedSearchCV + KerasClassifier do when training

I have a training set on which I would like to train a neural network, using K-folds cross validation.
TL;DR: Given the number of epochs, the set of params to be used, and checking on the test-set, how RandomizedSearchCV trains the model? I would think that for a combination of params, it trains the model on (K-1) folds for epochs number of epochs. Then it tests it on the last fold. But then, what prevent us from overfitting? When "vanilla" training with a constant validation set, after each epoch keras checks it on the validation set, is it done here as well? Even though verbose=1 I don't see the scores from the fit on the remaining fold. I saw here that we can add callbacks to the KerasClassifier, but then, what happens if the settings of KerasClassifier and RandomizedSearchCV clash? Can I add there a callback to check the val_prc, for exampl? If so, what would happen?
Sorry for the long TL;DR!
Regarding the training procedure, I am using the keras-sklearn interface. I defined the model using
model = KerasClassifier(build_fn=get_model_, epochs=120, batch_size=32, verbose=1)
Where get_model_ is a function that returns a compiled tf.keras model.
Given the model, the training procedure is the following:
params = dict({'l2':[0.1,0.3,0.5,0.8],
'dropout_rate':[0.1,0.3,0.5,0.8],
'batch_size':[16,32,64,128],
'learning_rate':[0.001, 0.01, 0.05, 0.1]})
def trainer(model, X, y, folds, params, verbose=None):
from keras.wrappers.scikit_learn import KerasClassifier
from tensorflow.keras.optimizers import Adam
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
if not verbose:
v=0
else:
v = verbose
clf = RandomizedSearchCV(model,
param_distributions = params,
n_jobs = 1,
scoring="roc_auc",
cv = folds,
verbose = v)
# -------------- fit ------------
grid_result = clf.fit(X, y)
# summarize results
print('- '*40)
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
print('- '*40)
# ------ Training -------- #
trainer(model, X_train, y_train, folds, params, verbose=1)
First, do I use RandomizedSearchCV right? Regardless of the number of options for each param I get the same message: Fitting 5 folds for each of 10 candidates, totalling 50 fits
Second, I have a hard problem with imbalanced data + lack of data. Even so, I get unexpectedly low scores and high loss values.
Lastly, and following the TL;DR, what is the training procedure that is actually being done using the above code, assuming that it is correct.
Thanks!
First, do I use RandomizedSearchCV right? Regardless of the number of options for each param I get the same message: Fitting 5 folds for each of 10 candidates, totalling 50 fits
RandomizedSearchCV has an argument n_iter that defaults to 10, it will thus sample 10 configurations of parameters, no matter how many possible ones are there. If you want to run all combinations you want to use GridSearchCV instead.
Second, I have a hard problem with imbalanced data + lack of data. Even so, I get unexpectedly low scores and high loss values.
This is way too broad / ill posed question for stack overflow.
Lastly, and following the TL;DR, what is the training procedure that is actually being done using the above code, assuming that it is correct.
For i=1 to n_iters (10):
Get random hyperparameters from provided space
Split data into 5 equal chunks (X_1, y_1), ..., (X_5, y_5)
scores = []
for k=1 to 5:
Train model with given hyperparameters on all chunks apart from (X_k, y_k)
Evaluate the above model on (X_k, y_k)
Append score to scores
if avg(scores) > best_score:
best_score = avg(scores)
best_model = model
best_hyperparameters = hyperparameters

Training seq2seq model on Google Colab TPU with big dataset - Keras

I'm trying to train a sequence to sequence model for machine translation using Keras on Google Colab TPU.
I have a dataset which I can load in memory but I have to preprocess to it to feed it to the model. In particular I need to convert the target words to one hot vectors and with many examples I can't load the entire conversion in memory, so I need to make batches of data.
I'm using this function as a batch generator:
def generate_batch_bert(X_ids, X_masks, y, batch_size = 1024):
''' Generate a batch of data '''
while True:
for j in range(0, len(X_ids), batch_size):
# batch of encoder and decoder data
encoder_input_data_ids = X_ids[j:j+batch_size]
encoder_input_data_masks = X_masks[j:j+batch_size]
y_decoder = y[j:j+batch_size]
# decoder target and input for teacher forcing
decoder_input_data = y_decoder[:,:-1]
decoder_target_seq = y_decoder[:,1:]
# batch of decoder target data
decoder_target_data = to_categorical(decoder_target_seq, vocab_size_fr)
# keep only with the right amount of instances for training on TPU
if encoder_input_data_ids.shape[0] == batch_size:
yield([encoder_input_data_ids, encoder_input_data_masks, decoder_input_data], decoder_target_data)
The problem is that whenever I try to run the fit function as follows:
model.fit(x=generate_batch_bert(X_train_ids, X_train_masks, y_train, batch_size = batch_size),
steps_per_epoch = train_samples//batch_size,
epochs=epochs,
callbacks = callbacks,
validation_data = generate_batch_bert(X_val_ids, X_val_masks, y_val, batch_size = batch_size),
validation_steps = val_samples//batch_size)
I get the following error:
/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/tensor_util.py:445 make_tensor_proto
raise ValueError("None values not supported.")
ValueError: None values not supported.
Not sure what's wrong and how I can solve this problem.
EDIT
I tried loading less amount of data in memory so that the conversion to one hot encoding of the target words doesn't crash the kernel and it actually works. So there is obviously something wrong on how I generate batches.
It's hard to tell what's wrong since you don't provide your model
definition nor any sample data. However, I'm fairly certain that you're
running into the same
TensorFlow bug
that I recently got bitten by.
The workaround is to use the tensorflow.data API which works much
better with TPUs. Like this:
from tensorflow.data import Dataset
import tensorflow as tf
def map_fn(X_id, X_mask, y):
decoder_target_data = tf.one_hot(y[1:], vocab_size_fr)
return (X_id, X_mask, y[:-1]), decoder_target_data
...
X_ids = Dataset.from_tensor_slices(X_ids)
X_masks = Dataset.from_tensor_slices(X_masks)
y = Dataset.from_tensor_slices(y)
ds = Dataset.zip((X_ids, X_masks, y)).map(map_fn).batch(1024)
model.fit(x = ds, ...)

How can I isolate why my tensorflow model has such a high loss and low accuracy?

The Context:
I am creating a test application that largely replicates the functionality described here.
I was able to run the code found in the tutorial linked above, and I see losses and accuracies that are reasonable, even after just a couple of epochs.
Tutorial Code: Early into the training of the two-headed CNN, losses and accuracy look good
This is because the code starts with the VGG16 model and the already trained weights, and it freezes those layers so that no learning is required for the core classification.
My test code largely replicates the tutorial structure. It uses the exact same dataset, and the already-trained VGG16 weights. However I load the image dataset using generators (rather than pulling all data into memory, as the tutorial does).
You can find how I created those generators in the answer provided here. I had struggled for a while, before I finally got it to a point that I think is correct.
The Problem:
When I train my model the classification loss and accuracy are as expected, however the bounding box loss grows, and the bounding box accuracy does not improve, over the epochs.
My Code: Even after just a couple epochs you see the bounding box loss starting to grow
Further Details:
I've spent a lot of time looking at the (image, target) tuples yielded by the generator, and I think I am handling the yielded data properly (including the unitrect).
A pycharm view of the images and target tuples yielded by generator
In fact I've also added a debug mode that allows me to display the images and rectangles fed into the training session.
A motorcycle with the bounding box as computed from the unit rectangle bounding box loaded from CSV into the dataframe (df); df is an input to flow_from_dataframe
The model I am using:
imodel = tf.keras.applications.vgg16.VGG16(weights=None, include_top=False,
input_tensor=Input(shape=(224, 224, 3)))
imodel.load_weights(weights, by_name=True)
imodel.trainable = False
# flatten the max-pooling output of VGG
flatten = imodel.output
flatten = Flatten()(flatten)
# construct a fully-connected layer header to output the predicted
# bounding box coordinates
bboxHead = Dense(128, activation="relu")(flatten)
bboxHead = Dense(64, activation="relu")(bboxHead)
bboxHead = Dense(32, activation="relu")(bboxHead)
bboxHead = Dense(4, activation="sigmoid",
name="bounding_box")(bboxHead)
# construct a second fully-connected layer head, this one to predict
# the class label
softmaxHead = Dense(512, activation="relu")(flatten)
softmaxHead = Dropout(0.5)(softmaxHead)
softmaxHead = Dense(512, activation="relu")(softmaxHead)
softmaxHead = Dropout(0.5)(softmaxHead)
softmaxHead = Dense(len(classes), activation="softmax",
name="class_label")(softmaxHead)
# put together our model which accept an input image and then output
# bounding box coordinates and a class label
model = Model(
inputs=imodel.input,
outputs=(bboxHead, softmaxHead))
# define a dictionary to set the loss methods -- categorical
# cross-entropy for the class label head and mean absolute error
# for the bounding box head
losses = {
"class_label": "categorical_crossentropy",
"bounding_box": "mean_squared_error",
}
# define a dictionary that specifies the weights per loss (both the
# class label and bounding box outputs will receive equal weight)
lossWeights = {
"class_label": 1.0,
"bounding_box": 1.0
}
# initialize the optimizer, compile the model, and show the model
# summary
opt = Adam(lr=learning_rate)
model.compile(loss=losses, optimizer=opt, metrics=["accuracy"], loss_weights=lossWeights)
My call to "fit"
model.fit(x=train_generator[0], steps_per_epoch=train_generator[1],
validation_data=validation_generator[0], validation_steps=validation_generator[1],
epochs=epochs, verbose=1)
The weights that I load I've used in other experiments and downloaded them from kaggle - (see vgg16_weights_tf_dim_ordering_tf_kernels.h5).
My Generator:
def generate_image_generator(generator, data_directory, df, subset, target_size, batch_size, shuffle, seed):
genImages = generator.flow_from_dataframe(dataframe=df, directory=data_directory, target_size=target_size,
x_col="file",
y_col=['cls_onehot', 'bbox'],
subset=subset,
class_mode="multi_output",
batch_size=batch_size, shuffle=shuffle, seed=seed)
while True:
images, labels = genImages.next()
targets = {
'class_label': labels[0],
'bounding_box': np.array(labels[1], dtype="float32")
}
yield images, targets
def get_train_and_validate_generators(self, data_directory, files, max_images, validation_split, shuffle, seed, target_size):
generator = ImageDataGenerator(validation_split=validation_split,
rescale=1./255.)
df = get_dataframe(data_directory, files)
if max_images:
df = df.head(max_images)
train_generator = generate_image_generator(generator, data_directory, df, "training",
target_size,
self.batch_size,
shuffle, seed)
valid_generator = generate_image_generator(generator, data_directory, df, "validation",
target_size,
self.batch_size,
shuffle, seed)
Loading the dataframe from a list of CSV
def get_dataframe(data_directory, files):
frames=[]
for di in files:
df = pd.read_csv(data_directory+di["file"])
frames.append(df)
df = pd.concat(frames)
df['cls_onehot'] = df['cls'].str.get_dummies().values.tolist()
df['bbox'] = df[['sxu', 'syu', 'exu', 'eyu']].values.tolist()
return df
A snippet of the CSV:
id,file,sx,sy,ex,ey,cls,sxu,syu,exu,eyu,w,h
0,motorcycle.0001.jpg,31,19,233,141,motorcycle,0.1183206106870229,0.11801242236024845,0.8893129770992366,0.8757763975155279,262,161
1,motorcycle.0002.jpg,32,15,232,142,motorcycle,0.12167300380228137,0.09259259259259259,0.8821292775665399,0.8765432098765432,263,162
2,motorcycle.0003.jpg,30,20,234,143,motorcycle,0.11406844106463879,0.12269938650306748,0.8897338403041825,0.8773006134969326,263,163
3,motorcycle.0004.jpg,30,15,231,132,motorcycle,0.11450381679389313,0.1,0.8816793893129771,0.88,262,150
4,motorcycle.0005.jpg,31,19,232,145,motorcycle,0.1183206106870229,0.1144578313253012,0.8854961832061069,0.8734939759036144,262,166
When I load weights from "imagenet", rather than use those I received from kaggle, I see the very same increase in bounding box loss
imodel = tf.keras.applications.vgg16.VGG16(weights="imagenet", include_top=False,
input_tensor=Input(shape=(224, 224, 3)))
The Question:
Please provide suggestions on how to isolate this bounding box loss growth problem.
Ok. It looks like the problem was not at all with my generator. The code was fine except for one silly oversight. I still had an old call to compile running. I called compile correctly the first time with the composite loss function. Then I called it a second time strictly with categorical cross entropy as the cost, effectively ignoring my bounding boxes.
Anyways, if someone stumbles on this post, I hope they find the complete view of how to do classification and object detection, with a generator function, useful.
I edited the above question with the correct details.. so it now reflects the right answer.
I'd still like to get the perspective of the experts who have had to dig into the workings of a model to better understand the underlying details that lead to loss calculation.
Now that I'm starting to understand tensorflow at the high-level, its clear how to recognize when things are working.. its not clear how to diagnose when things aren't working.

Keras/Tensorflow - Generate predictions in batch for imagenet (I get only one result back)

I am generating imagenet tags for all keyframes in a video with a single call and have this code:
# all keras/tf/mobilenet imports
model_imagenet = MobileNetV2(weights='imagenet')
frames_list = []
for frame in frame_set:
frame_img = frame.to_image()
frame_pil = frame_img.resize((224,224), Image.ANTIALIAS)
ts = int(frame.pts)
x = image.img_to_array(frame_pil)
x = np.expand_dims(x, axis=0)
x = preprocess_input(x)
frames_list.append(x)
print(len(frames_list))
preds_list = model_imagenet.predict_on_batch(frames_list)
print("[*]",preds_list)
The result appears thus:
frames_list count: 125
and the predictions thus, one row of 1000 dimensions (imagenet classes), shouldn't it be 125 rows?:
[[1.15425530e-04 1.83317825e-04 4.28701424e-05 2.87547664e-05
:
7.91769926e-05 1.30803732e-04 4.81895368e-05 3.06891889e-04]]
This is generating prediction for a single row in the batch. I have tried both predict and predict_on_batch with the same result.
How can I get a bulk prediction for say 200 frames at one go with Keras/Tensorflow/Mobilenet?
ImageNet is a popular database which consists of 1000 different categories.
The dimension of 1000 is natural and to be expected, since for one image the softmax outputs a probability for each of the 1000 classes.
EDIT: For multiple image predictions, you should use predict_generator(). In addition, as of TensorFlow 2.0, if you use the Keras backend, predict_generator() has been deprecated in favor of simple predict, which also allows input data as generators.
E.g. : (from How to use predict_generator with ImageDataGenerator?) :
test_datagen = ImageDataGenerator(rescale=1./255)
#Modify the batch size here
test_generator = test_datagen.flow_from_directory(
test_dir,
target_size=(200, 200),
color_mode="rgb",
shuffle = False,
class_mode='categorical',
batch_size=1)
filenames = test_generator.filenames
nb_samples = len(filenames)
predict = model.predict_generator(test_generator,steps = nb_samples)
Please bear in mind that it will be highly unlikely to have a lot of predictions at once, since it is constrained to the memory of the video card.
Also, note the difference between predict and predict_on_batch: What is the difference between the predict and predict_on_batch methods of a Keras model?
OK, here is how I solved it, hope this helps someone else:
preds_list = model_imagenet.predict(np.vstack(frames_list),batch_size=32)
print("[*]",preds_list)
Please note the np.vstack and adjust the batch_size to whatever your computer is capable of.

The established way to use TF Dataset API in Keras is to feed `model.fit` with `make_one_shot_iterator()`, But this iterator only good for one Epoch

Edit:
To clarify why this question is different from the suggested duplicates, this SO question follows up on those suggested duplicates, on what exactly is Keras doing with the techniques described in those SO questions. The suggested duplicates specify using a dataset API make_one_shot_iterator() in model.fit, my follow up is that make_one_shot_iterator() can only go through the dataset once, however in the solutions given, several epochs are specified.
This is a follow up to these SO questions
How to Properly Combine TensorFlow's Dataset API and Keras?
Tensorflow keras with tf dataset input
Using tf.data.Dataset as training input to Keras model NOT working
Where "Starting from Tensorflow 1.9, one can pass tf.data.Dataset object directly into keras.Model.fit() and it would act similar to fit_generator". Each example has a TF dataset one shot iterator fed into Kera's model.fit.
An example is given below
# Load mnist training data
(x_train, y_train), _ = tf.keras.datasets.mnist.load_data()
training_set = tfdata_generator(x_train, y_train,is_training=True)
model = # your keras model here
model.fit(
training_set.make_one_shot_iterator(),
steps_per_epoch=len(x_train) // 128,
epochs=5,
verbose = 1)
However, according the the Tensorflow Dataset API guide (here https://www.tensorflow.org/guide/datasets ) :
A one-shot iterator is the simplest form of iterator, which only
supports iterating once through a dataset
So it's only good for 1 epoch. However, the codes in the SO questions specify several epochs, with the code example above specifying 5 epochs.
Is there any explanation for this contradiction? Does Keras somehow know that when the one shot iterator has gone through the dataset, it can re-initialize and shuffle the data?
You can simply pass dataset object to model.fit, Keras will handle iteration.
Considering one of pre-made datasets:
train, test = tf.keras.datasets.cifar10.load_data()
dataset = tf.data.Dataset.from_tensor_slices((train[0], train[1]))
This will create dataset object from training data of cifar10 dataset. In this case parse function isn't needed.
If you create dataset from path containing images of list of numpy arrays you'll need one.
dataset = tf.data.Dataset.from_tensor_slices((image_path, labels_path))
In case you'll need a function to load actual data from filename. Numpy array can be handled the same way just without tf.read_file
def parse_func(filename):
f = tf.read_file(filename)
image = tf.image.decode_image(f)
label = #get label from filename
return image, label
Then you can shuffle, batch, and map any parse function to this dataset. You can control how many examples will be preloaded with shuffle buffer. Repeat controls epoch count and better be left None, so it will repeat indefinitely. You can use either plain batch function or combine with
dataset = dataset.shuffle().repeat()
dataset.apply(tf.data.experimental.map_and_batch(map_func=parse_func, batch_size,num_parallel_batches))
Then dataset object can be passed to model.fit
model.fit(dataset, epochs, steps_per_epoch). Note that steps_per_epoch is a necessary parameter in this case, it will define when to start new epoch. So you'll have to know epoch size in advance.