can't reproduce model.fit with GradientTape - tensorflow

I've been trying to investigate into the reason (e.g. by checking weights, gradients and activations during training) why SGD with a 0.001 learning rate worked in training while Adam fails to do so. (Please see my previous post [here](Why is my loss (binary cross entropy) converging on ~0.6? (Task: Natural Language Inference)"Why is my loss (binary cross entropy) converging on ~0.6? (Task: Natural Language Inference)"))
Note: I'm using the same model from my previous post here as well.
using tf.keras, i trained the neural network using model.fit():
model.compile(optimizer=SGD(learning_rate=0.001),
loss='binary_crossentropy',
metrics=['accuracy'])
model.fit(x=ds,
epoch=80,
validation_data=ds_val)
This resulted in a epoch loss graphed below, within the 1st epoch, it's reached a train loss of 0.46 and then ultimately resulting in a train_loss of 0.1241 and val_loss of 0.2849.
I would've used tf.keras.callbacks.Tensorboard(histogram_freq=1) to train the network with both SGD(0.001) and Adam to investigate but it's throwing an InvalidArgumentError on Variable:0, something I can't decipher. So I tried to write a custom training loop using GradientTape and plotting the values.
using tf.GradientTape(), i tried to reproduce the results using the exact same model and dataset, however the epoch loss is training incredibly slowly, reaching train loss of 0.676 after 15 epochs (see graph below), is there something wrong with my implementation? (code below)
#tf.function
def compute_grads(train_batch: Dict[str,tf.Tensor], target_batch: tf.Tensor,
loss_fn: Loss, model: tf.keras.Model):
with tf.GradientTape(persistent=False) as tape:
# forward pass
outputs = model(train_batch)
# calculate loss
loss = loss_fn(y_true=target_batch, y_pred=outputs)
# calculate gradients for each param
grads = tape.gradient(loss, model.trainable_variables)
return grads, loss
BATCH_SIZE = 8
EPOCHS = 15
bce = BinaryCrossentropy()
optimizer = SGD(learning_rate=0.001)
for epoch in tqdm(range(EPOCHS), desc='epoch'):
# - accumulators
epoch_loss = 0.0
for (i, (train_batch, target_dict)) in tqdm(enumerate(ds_train.shuffle(1024).batch(BATCH_SIZE)), desc='step'):
(grads, loss) = compute_grads(train_batch, target_dict['target'], bce, model)
optimizer.apply_gradients(zip(grads, model.trainable_variables))
epoch_loss += loss
avg_epoch_loss = epoch_loss/(i+1)
tensorboard_scalar(writer, name='epoch_loss', data=avg_epoch_loss, step=epoch) # custom helper function
print("Epoch {}: epoch_loss = {}".format(epoch, avg_epoch_loss))
Thanks in advance!

Check if you have shuffle your dataset then the problem may came from the shuffling using the tf.Dataset method. It only shuffled through the dataset one bucket at the time. Using the Keras.Model.fit yielded better results because it probably adds another shuffling.
By adding a shuffling with numpy.random.shuffle it may improve the training performance. From this reference.
The example of applying it into generation of the dataset is:
numpy_data = np.hstack([index_rows.reshape(-1, 1), index_cols.reshape(-1, 1), index_data.reshape(-1, 1)])
np.random.shuffle(numpy_data)
indexes = np.array(numpy_data[:, :2], dtype=np.uint32)
labels = np.array(numpy_data[:, 2].reshape(-1, 1), dtype=np.float32)
train_ds = data.Dataset.from_tensor_slices(
(indexes, labels)
).shuffle(100000).batch(batch_size, drop_remainder=True)
If this not work you may need to use Dataset .repeat(epochs_number) and .shuffle(..., reshuffle_each_iteration=True):
train_ds = data.Dataset.from_tensor_slices(
(np.hstack([index_rows.reshape(-1, 1), index_cols.reshape(-1, 1)]), index_data)
).shuffle(100000, reshuffle_each_iteration=True
).batch(batch_size, drop_remainder=True
).repeat(epochs_number)
for ix, (examples, labels) in train_ds.enumerate():
train_step(examples, labels)
current_epoch = ix // (len(index_data) // batch_size)
This workaround is not beautiful nor natural, for the moment you can use this to shuffle each epoch. It's a known issue and will be fixed, in the future you can use for epoch in range(epochs_number) instead of .repeat()
The solution provided here may also help a lot. You might want to check it out.
If this is not the case, you may want to speed up the TF2.0 GradientTape. This can be the solution:
TensorFlow 2.0 introduces the concept of functions, which translate eager code into graph code.
The usage is pretty straight-forward. The only change needed is that all relevant functions (like compute_loss and apply_gradients) have to be annotated with #tf.function.

Related

Regression custom loss return value in Keras with and without custom loop

When a custom loss is defined in a Keras model, online sources seem to indicate that the the loss should return an array of values (a loss for each sample in the batch). Something like this
def custom_loss_function(y_true, y_pred):
squared_difference = tf.square(y_true - y_pred)
return tf.reduce_mean(squared_difference, axis=-1)
model.compile(optimizer='adam', loss=custom_loss_function)
In the example above, I have no idea when or if the model is taking the batch sum or mean with tf.reduce_sum() or tf.reduce_mean()
In another situation when we want to implement a custom training loop with a custom function, the template to follow according to Keras documentation is this
for epoch in range(epochs):
for step, (x_batch_train, y_batch_train) in enumerate(train_dataset):
with tf.GradientTape() as tape:
y_batch_pred = model(x_batch_train, training=True)
loss_value = custom_loss_function(y_batch_train, y_batch_pred)
grads = tape.gradient(loss_value, model.trainable_weights)
optimizer.apply_gradients(zip(grads, model.trainable_weights))
So by the book, if I understand correctly, we are supposed to take the mean of the batch gradients. Therefore, the loss value above should be a single value per batch.
However, the example will work with both of the following variations:
tf.reduce_mean(squared_difference, axis=-1) # array of loss for each sample
tf.reduce_mean(squared_difference) # mean loss for batch
So, why does the first option (array loss) above still work? Is apply_gradients applying small changes for each value sequentially? Is this wrong although it works?
What is the correct way without a custom loop, and with a custom loop?
Good question. In my opinion, this is not well documented in the TensorFlow/Keras API. By default, if you do not provide a scalar loss_value, TensorFlow will add them up (and the updates are not sequential). Essentially, this is equivalent to summing the losses along the batch axis.
Currently, the losses in the TensorFlow API include a reduction argument (for example, tf.losses.MeanSquaredError) that allows specifying how to aggregate the loss along the batch axis.

How to avoid overfitting in CNN?

I'm making a model for predicting the age of people by analyzing their face. I'm using this pretrained model, and maked a custom loss function and a custom metrics. So I obtain discrete result but I want to improve it. In particular, I noticed that after some epochs the model begin to overfitt on the training set then the val_loss increases. How can I avoid this? I'm already using Dropout, but this doesn't seem to be enough.
I think maybe I should use l1 and l2 but I don't know how.
def resnet_model():
model = VGGFace(model = 'resnet50')#model :{resnet50, vgg16, senet50}
xl = model.get_layer('avg_pool').output
x = keras.layers.Flatten(name='flatten')(xl)
x = keras.layers.Dense(4096, activation='relu')(x)
x = keras.layers.Dropout(0.5)(x)
x = keras.layers.Dense(4096, activation='relu')(x)
x = keras.layers.Dropout(0.5)(x)
x = keras.layers.Dense(11, activation='softmax', name='predictions')(x)
model = keras.engine.Model(model.input, outputs = x)
return model
model = resnet_model()
initial_learning_rate = 0.0003
epochs = 20; batch_size = 110
num_steps = train_x.shape[0]//batch_size
learning_rate_fn = tf.keras.optimizers.schedules.PiecewiseConstantDecay(
[3*num_steps, 10*num_steps, 16*num_steps, 25*num_steps],
[1e-4, 1e-5, 1e-6, 1e-7, 5e-7]
)
optimizer = tf.keras.optimizers.Adam(learning_rate=learning_rate_fn)
model.compile(loss=custom_loss, optimizer=optimizer, metrics=['accuracy', one_off_accuracy])
model.fit(train_x, train_y, epochs=epochs, batch_size=batch_size, validation_data=(test_x, test_y))
This is an example of result:
There are many regularization methods to help you avoid overfitting your model:
Dropouts:
Randomly disables neurons during the training, in order to force other neurons to be trained as well.
L1/L2 penalties:
Penalizes weights that change dramatically. This tries to ensure that all parameters will be equally taken into consideration when classifying an input.
Random Gaussian Noise at the inputs:
Adds random gaussian noise at the inputs: x = x + r where r is a random normal value from range [-1, 1]. This will confuse your model and prevent it from overfitting into your dataset, because in every epoch, each input will be different.
Label Smoothing:
Instead of saying that a target is 0 or 1, You can smooth those values (e.g. 0.1 & 0.9).
Early Stopping:
This is a quite common technique for avoiding training your model too much. If you notice that your model's loss is decreasing along with the validation's accuracy, then this is a good sign to stop the training, as your model begins to overfit.
K-Fold Cross-Validation:
This is a very strong technique, which ensures that your model is not fed all the time with the same inputs and is not overfitting.
Data Augmentations:
By rotating/shifting/zooming/flipping/padding etc. an image you make sure that your model is forced to train better its parameters and not overfit to the existing dataset.
I am quite sure there are also more techniques to avoid overfitting. This repository contains many examples of how the above techniques are deployed in a dataset:
https://github.com/kochlisGit/Tensorflow-State-of-the-Art-Neural-Networks
You can try incorporate image augmentation in your training, which increases the "sample size" of your data as well as the "diversity" as #Suraj S Jain mentioned. The official tutorial is here: https://www.tensorflow.org/tutorials/images/data_augmentation

Best practices in Tensorflow 2.0(Training step)

In tensorflow 2.0 you don't have to worry about training phase(batch size, number of epochs etc), because everything can be defined in compile method: model.fit(X_train,Y_train,batch_size = 64,epochs = 100).
But I have seen the following code style:
optimizer = tf.keras.optimizers.Adam(0.001)
loss_fn = tf.keras.losses.SparseCategoricalCrossentropy()
#tf.function
def train_step(inputs, labels):
with tf.GradientTape() as tape:
predictions = model(inputs, training=True)
regularization_loss = tf.math.add_n(model.losses)
pred_loss = loss_fn(labels, predictions)
total_loss = pred_loss + regularization_loss
gradients = tape.gradient(total_loss, model.trainable_variables)
optimizer.apply_gradients(zip(gradients, model.trainable_variables))
for epoch in range(NUM_EPOCHS):
for inputs, labels in train_data:
train_step(inputs, labels)
print("Finished epoch", epoch)
So here you can observe "more detailed" code, where you manually define by for loops you training procedure.
I have following question: what is the best practice in Tensorflow 2.0? I haven't found a any complete tutorial.
Use what is best for your needs.
Both methods are documented in Tensorflow tutorials.
If you don't need anything special, no extra losses, strange metrics or intricate gradient computation, just use a model.fit() or a model.fit_generator(). This is totally ok and makes your life easier.
A custom training loop might come in handy when you have complicated models with non-trivial loss/gradients calculation.
Up to now, two applications I tried were easier with this:
Training a GAN's generator and discriminator simultaneously without having to do the generation step twice. (It's complicated because you have a loss function that applies to different y_true values, and each case should update only a part of the model) - The other option would require to have a few separate models, each model with its own trainable=True/False configuration, and train then in separate phases.
Training inputs (good for style transfer models) -- Alternatively, create a custom layer that takes dummy inputs and that outputs its own trainable weights. But it gets complicated to compile several loss functions for each of the outputs of the base and style networks.

A Tensorflow training agnostic to Eager and Graph modes

I spend some of my time coding novel (I wish) RNN cells in Tensorflow.
To prototype, I use eager mode (easier to debug).
In order to train, I migrate the code to a graph (runs faster).
I am looking for a wrapper code/example that can run forward pass and training in a way that will be agnostic to the mode I run it - eager or graph, as much as possible. I have in mind a set of functions/classes, to which the particular neural network/optimizer/data can be inserted, and that these set of functions/classes could run in both modes with minimal changes between the two. In addition, it is of course good that it would be compatible with many types of NN/optimizers/data instances.
I am quite sure that many had this idea.
I wonder if something like this is feasible given the current eager/graph integration in TF.
Yes. I have been wondering the same. In the Tensorflow documentation you can see:
The same code written for eager execution will also build a graph during graph execution. Do this by simply running the same code in a new Python session where eager execution is not enabled.
But this is hard to achieve, mostly because working with graphs means dealing with placeholders, which can not be used in Eager mode. I tried to get rid off placeholders using object-oriented layers and the Dataset API. This is the closest I could get to totally compatible code:
m = 128 # num_examples
n = 5 # num features
epochs = 2
batch_size = 32
steps_per_epoch = m // 32
dataset = tf.data.Dataset.from_tensor_slices(
(tf.random_uniform([m, n], dtype=tf.float32),
tf.random_uniform([m, 1], dtype=tf.float32)))
dataset = dataset.repeat(epochs)
dataset = dataset.batch(batch_size)
model = tf.keras.Sequential([
tf.keras.layers.Dense(64, activation='relu', input_dim=n),
tf.keras.layers.Dense(32, activation='relu'),
tf.keras.layers.Dense(1)
])
def train_eagerly(model, dataset):
optimizer = tf.train.AdamOptimizer()
iterator = dataset.make_one_shot_iterator()
print('Training graph...')
for epoch in range(epochs):
print('Epoch', epoch)
progbar = tf.keras.utils.Progbar(target=steps_per_epoch, stateful_metrics='loss')
for step in range(steps_per_epoch):
with tf.GradientTape() as tape:
features, labels = iterator.get_next()
predictions = model(features, training=True)
loss_value = tf.losses.mean_squared_error(labels, predictions)
grads = tape.gradient(loss_value, model.variables)
optimizer.apply_gradients(zip(grads, model.variables))
progbar.add(1, values=[('loss', loss_value.numpy())])
def train_graph(model, dataset):
optimizer = tf.train.AdamOptimizer()
iterator = dataset.make_initializable_iterator()
print('Training graph...')
with tf.Session() as sess:
sess.run(iterator.initializer)
sess.run(tf.global_variables_initializer())
for epoch in range(epochs):
print('Epoch', epoch)
progbar = tf.keras.utils.Progbar(target=steps_per_epoch, stateful_metrics='loss')
for step in range(steps_per_epoch):
with tf.GradientTape() as tape:
features, labels = sess.run(iterator.get_next())
predictions = model(features, training=True)
loss_value = tf.losses.mean_squared_error(labels, predictions)
grads = tape.gradient(loss_value, model.variables)
optimizer.apply_gradients(zip(grads, model.variables))
progbar.add(1, values=[('loss', loss_value.eval())])
As you can see, the main difference is that I use a one_shot_iterator during Eager training (of course, during graph training, I have to run operations within a session).
I tried to do the same using optimizer.minimize instead of applying the gradients myself, but I could not come up with a code that worked both for eager and graph modes.
Also, I'm sure this becomes much harder to do with not so simple models, like the one you are working with.

Tensorflow: calculate gradient for tf.multiply

I'm building a neural network that has the following two layers
pseudo_inputs = tf.Variable(a_numpy_ndarray)
weights = tf.Variable(tf.truncated_normal(...))
I then want to multiply them using tf.multiply (which, unlike tf.matmul multiplies corresponding indices, i.e. c_ij = a_ij * b_ij)
input = tf.multiply(pseudo_inputs, weights)
My goal is to learn weights. So I run
train_step = tf.train.AdamOptimizer(learn_rate).minimize(loss, var_list=[weights])
But it doesn't work. The network doesn't change at all.
Looking at tensorboard, I could see that 'input' has no gradient, so I'm assuming that's the problem. Any ideas how to solve this?
From reading tensorflow docs it seems like I might have to write a gradient op for tf.multiply, but I find it hard to believe no one needed to do this before.
I thought the pseudo_inputs should be set as Placeholders in the first line.
And in this line:
train_step = tf.train.AdamOptimizer(learn_rate).minimize(loss, var_list=[weights])
Since weights are to be trained in the graph by minimizing loss then it should not passed as a parameter here.
train = tf.train.AdamOptimizer(learn_rate).minimize(loss)
Then you should first run the train using the samples(you don't have labels) you have.
for x_train, y_train in samples:
sess.run(train, {pseudo_inputs:x_train, y:y_train})
And after that you can get weights by:
W_c, loss_c = sess.run([W, loss], {pseudo_inputs=x_train, y:y_train})