Starting with ADAM and then fine tune with SGD. Changing the optimizer - tensorflow

I read this great blog about a bag of tricks for image classification.
This part i have a hard time to figure out how to implement in tensorflow, or rather, i have no idea how to do it or if it is even possible.
So, start off with Adam: just set a learning rate that’s not absurdly high, commonly defaulted at 0.0001 and you’ll usually get some very good results. Then, once your model starts to saturate with Adam, fine tune with SGD at a smaller learning rate to squeeze in that last bit of accuracy!
Can you change the optimizer without re-compile in some way?
I have ofc tried googling but cant seem to find much information.
Anyone know if this is possible in tensorflow and if so how to do it? (or if you have source that have some info about it)

You can start form training loop from scratch of the tensorflow documentation.
Create two train_step functions, the first with an Adam optimizer and the second with an SGD optimizer.
optimizer1 = keras.optimizers.Adam(learning_rate=1e-3)
optimizer2 = keras.optimizers.SGD(learning_rate=1e-3)
#tf.function
def train_step1(x, y):
with tf.GradientTape() as tape:
logits = model(x, training=True)
loss_value = loss_fn(y, logits)
grads = tape.gradient(loss_value, model.trainable_weights)
optimizer1.apply_gradients(zip(grads, model.trainable_weights))
train_acc_metric.update_state(y, logits)
return loss_value
#tf.function
def train_step2(x, y):
with tf.GradientTape() as tape:
logits = model(x, training=True)
loss_value = loss_fn(y, logits)
grads = tape.gradient(loss_value, model.trainable_weights)
optimizer2.apply_gradients(zip(grads, model.trainable_weights))
train_acc_metric.update_state(y, logits)
return loss_value
Main loop:
epochs = 20
train_step = train_step1
start_time = time.time()
for epoch in range(epochs):
if epoch > epochs//2:
train_step = train_step2
total_train_loss = 0.
# print("\nStart of epoch %d" % (epoch,))
# Iterate over the batches of the dataset.
for step, (x_batch_train, y_batch_train) in enumerate(train_dataset):
loss_value = train_step(x_batch_train, y_batch_train)
total_train_loss += loss_value.numpy()
...
Note that the graph of each train_step function is built separately. In graph mode, you cannot have a single train_step function with the optimizer as a parameter that changes during iterations (Adam and then SGD).

Related

How to apply a function to network output before passing it to the loss?

I'm trying to implement a network in tensorflow and I need to apply a function f to the network output and use the returned value as the prediction to be used in the loss.
Is there a simple way to make it or which part of tensorflow should I study to achieve that ?
you should study how to write custom training loops in tensorflow: https://www.tensorflow.org/guide/keras/writing_a_training_loop_from_scratch
A simplified and short version could look similar to the code bellow:
#Repeat for several epochs
for epoch in range(epochs):
# Iterate over the batches of the dataset.
for step, (x_batch_train, y_batch_train) in enumerate(train_dataset):
# Start tracing your forward pass to calculate gradients
with tf.GradientTape() as tape:
prediction = model(x_batch_train, training=True)
# HERE YOU PLACE YOUR FUNCTION f
transformed_prediction = f(prediction)
loss_value = loss_fn(y_batch_train, transformed_prediction )
grads = tape.gradient(loss_value, model.trainable_weights)
optimizer.apply_gradients(zip(grads, model.trainable_weights))
(...)

Tensorflow, possible to downweigh gradients for certain data items

Say I have a multi output model with outputs y_0 and y_1.
For some data examples I am confident that y_0 is correct, but know that y_1 may be a complete guess. My idea was to use a custom training loop and multiply by a calculated weight, but this does not seem to be working. Is there a way to do this through the keras api that may be simpler than this?
#tf.function
def train_on_batch(x,y):
y_true = y[:, 0]
weights = y[:,1]
with tf.GradientTape() as tape:
y_pred = model(x, training=True)
print("ytrainpred ", y_pred)
loss_value_pre = loss(y_true, y_pred)
loss_value = loss_value_pre * weights
# compute gradient
grads = tape.gradient(loss_value, model.trainable_weights)
# update weights
optimizer.apply_gradients(zip(grads, model.trainable_weights))
# update metrics
loss_1_train.update_state(y_true[:, 0], loss_value[:,0])
loss_2_train.update_state(y_true[:, 1], loss_value[:,1)
return loss_value
In the method compile of the keras object you have a parameter called loss weights to do that, you only need to implement the lost functions that take one or other output and passed as an array of losses to the loss parameter, but this becomes quite impractical if you have many ys

Why doesn't custom training loop average loss over batch_size?

Below code snippet is the custom training loop from Tensorflow official tutorial.https://www.tensorflow.org/guide/keras/writing_a_training_loop_from_scratch . Another tutorial also does not average loss over batch_size, as shown here https://www.tensorflow.org/tutorials/customization/custom_training_walkthrough
Why is the loss_value not averaged over batch_size at this line loss_value = loss_fn(y_batch_train, logits)? Is this a bug? From another question here Loss function works with reduce_mean but not reduce_sum, reduce_mean is indeed needed to average loss over batch_size
The loss_fn is defined in the tutorial as below. It obviously does not average over batch_size.
loss_fn = keras.losses.SparseCategoricalCrossentropy(from_logits=True)
From documentation, keras.losses.SparseCategoricalCrossentropy sums loss over the batch without averaging. Thus, this is essentially reduce_sum instead of reduce_mean!
Type of tf.keras.losses.Reduction to apply to loss. Default value is AUTO. AUTO indicates that the reduction option will be determined by the usage context. For almost all cases this defaults to SUM_OVER_BATCH_SIZE.
The code is shown below.
epochs = 2
for epoch in range(epochs):
print("\nStart of epoch %d" % (epoch,))
# Iterate over the batches of the dataset.
for step, (x_batch_train, y_batch_train) in enumerate(train_dataset):
# Open a GradientTape to record the operations run
# during the forward pass, which enables auto-differentiation.
with tf.GradientTape() as tape:
# Run the forward pass of the layer.
# The operations that the layer applies
# to its inputs are going to be recorded
# on the GradientTape.
logits = model(x_batch_train, training=True) # Logits for this minibatch
# Compute the loss value for this minibatch.
loss_value = loss_fn(y_batch_train, logits)
# Use the gradient tape to automatically retrieve
# the gradients of the trainable variables with respect to the loss.
grads = tape.gradient(loss_value, model.trainable_weights)
# Run one step of gradient descent by updating
# the value of the variables to minimize the loss.
optimizer.apply_gradients(zip(grads, model.trainable_weights))
# Log every 200 batches.
if step % 200 == 0:
print(
"Training loss (for one batch) at step %d: %.4f"
% (step, float(loss_value))
)
print("Seen so far: %s samples" % ((step + 1) * 64))
I've figured it out, the loss_fn = keras.losses.SparseCategoricalCrossentropy(from_logits=True) indeed averages loss over batch_size by default.

can't reproduce model.fit with GradientTape

I've been trying to investigate into the reason (e.g. by checking weights, gradients and activations during training) why SGD with a 0.001 learning rate worked in training while Adam fails to do so. (Please see my previous post [here](Why is my loss (binary cross entropy) converging on ~0.6? (Task: Natural Language Inference)"Why is my loss (binary cross entropy) converging on ~0.6? (Task: Natural Language Inference)"))
Note: I'm using the same model from my previous post here as well.
using tf.keras, i trained the neural network using model.fit():
model.compile(optimizer=SGD(learning_rate=0.001),
loss='binary_crossentropy',
metrics=['accuracy'])
model.fit(x=ds,
epoch=80,
validation_data=ds_val)
This resulted in a epoch loss graphed below, within the 1st epoch, it's reached a train loss of 0.46 and then ultimately resulting in a train_loss of 0.1241 and val_loss of 0.2849.
I would've used tf.keras.callbacks.Tensorboard(histogram_freq=1) to train the network with both SGD(0.001) and Adam to investigate but it's throwing an InvalidArgumentError on Variable:0, something I can't decipher. So I tried to write a custom training loop using GradientTape and plotting the values.
using tf.GradientTape(), i tried to reproduce the results using the exact same model and dataset, however the epoch loss is training incredibly slowly, reaching train loss of 0.676 after 15 epochs (see graph below), is there something wrong with my implementation? (code below)
#tf.function
def compute_grads(train_batch: Dict[str,tf.Tensor], target_batch: tf.Tensor,
loss_fn: Loss, model: tf.keras.Model):
with tf.GradientTape(persistent=False) as tape:
# forward pass
outputs = model(train_batch)
# calculate loss
loss = loss_fn(y_true=target_batch, y_pred=outputs)
# calculate gradients for each param
grads = tape.gradient(loss, model.trainable_variables)
return grads, loss
BATCH_SIZE = 8
EPOCHS = 15
bce = BinaryCrossentropy()
optimizer = SGD(learning_rate=0.001)
for epoch in tqdm(range(EPOCHS), desc='epoch'):
# - accumulators
epoch_loss = 0.0
for (i, (train_batch, target_dict)) in tqdm(enumerate(ds_train.shuffle(1024).batch(BATCH_SIZE)), desc='step'):
(grads, loss) = compute_grads(train_batch, target_dict['target'], bce, model)
optimizer.apply_gradients(zip(grads, model.trainable_variables))
epoch_loss += loss
avg_epoch_loss = epoch_loss/(i+1)
tensorboard_scalar(writer, name='epoch_loss', data=avg_epoch_loss, step=epoch) # custom helper function
print("Epoch {}: epoch_loss = {}".format(epoch, avg_epoch_loss))
Thanks in advance!
Check if you have shuffle your dataset then the problem may came from the shuffling using the tf.Dataset method. It only shuffled through the dataset one bucket at the time. Using the Keras.Model.fit yielded better results because it probably adds another shuffling.
By adding a shuffling with numpy.random.shuffle it may improve the training performance. From this reference.
The example of applying it into generation of the dataset is:
numpy_data = np.hstack([index_rows.reshape(-1, 1), index_cols.reshape(-1, 1), index_data.reshape(-1, 1)])
np.random.shuffle(numpy_data)
indexes = np.array(numpy_data[:, :2], dtype=np.uint32)
labels = np.array(numpy_data[:, 2].reshape(-1, 1), dtype=np.float32)
train_ds = data.Dataset.from_tensor_slices(
(indexes, labels)
).shuffle(100000).batch(batch_size, drop_remainder=True)
If this not work you may need to use Dataset .repeat(epochs_number) and .shuffle(..., reshuffle_each_iteration=True):
train_ds = data.Dataset.from_tensor_slices(
(np.hstack([index_rows.reshape(-1, 1), index_cols.reshape(-1, 1)]), index_data)
).shuffle(100000, reshuffle_each_iteration=True
).batch(batch_size, drop_remainder=True
).repeat(epochs_number)
for ix, (examples, labels) in train_ds.enumerate():
train_step(examples, labels)
current_epoch = ix // (len(index_data) // batch_size)
This workaround is not beautiful nor natural, for the moment you can use this to shuffle each epoch. It's a known issue and will be fixed, in the future you can use for epoch in range(epochs_number) instead of .repeat()
The solution provided here may also help a lot. You might want to check it out.
If this is not the case, you may want to speed up the TF2.0 GradientTape. This can be the solution:
TensorFlow 2.0 introduces the concept of functions, which translate eager code into graph code.
The usage is pretty straight-forward. The only change needed is that all relevant functions (like compute_loss and apply_gradients) have to be annotated with #tf.function.

A Tensorflow training agnostic to Eager and Graph modes

I spend some of my time coding novel (I wish) RNN cells in Tensorflow.
To prototype, I use eager mode (easier to debug).
In order to train, I migrate the code to a graph (runs faster).
I am looking for a wrapper code/example that can run forward pass and training in a way that will be agnostic to the mode I run it - eager or graph, as much as possible. I have in mind a set of functions/classes, to which the particular neural network/optimizer/data can be inserted, and that these set of functions/classes could run in both modes with minimal changes between the two. In addition, it is of course good that it would be compatible with many types of NN/optimizers/data instances.
I am quite sure that many had this idea.
I wonder if something like this is feasible given the current eager/graph integration in TF.
Yes. I have been wondering the same. In the Tensorflow documentation you can see:
The same code written for eager execution will also build a graph during graph execution. Do this by simply running the same code in a new Python session where eager execution is not enabled.
But this is hard to achieve, mostly because working with graphs means dealing with placeholders, which can not be used in Eager mode. I tried to get rid off placeholders using object-oriented layers and the Dataset API. This is the closest I could get to totally compatible code:
m = 128 # num_examples
n = 5 # num features
epochs = 2
batch_size = 32
steps_per_epoch = m // 32
dataset = tf.data.Dataset.from_tensor_slices(
(tf.random_uniform([m, n], dtype=tf.float32),
tf.random_uniform([m, 1], dtype=tf.float32)))
dataset = dataset.repeat(epochs)
dataset = dataset.batch(batch_size)
model = tf.keras.Sequential([
tf.keras.layers.Dense(64, activation='relu', input_dim=n),
tf.keras.layers.Dense(32, activation='relu'),
tf.keras.layers.Dense(1)
])
def train_eagerly(model, dataset):
optimizer = tf.train.AdamOptimizer()
iterator = dataset.make_one_shot_iterator()
print('Training graph...')
for epoch in range(epochs):
print('Epoch', epoch)
progbar = tf.keras.utils.Progbar(target=steps_per_epoch, stateful_metrics='loss')
for step in range(steps_per_epoch):
with tf.GradientTape() as tape:
features, labels = iterator.get_next()
predictions = model(features, training=True)
loss_value = tf.losses.mean_squared_error(labels, predictions)
grads = tape.gradient(loss_value, model.variables)
optimizer.apply_gradients(zip(grads, model.variables))
progbar.add(1, values=[('loss', loss_value.numpy())])
def train_graph(model, dataset):
optimizer = tf.train.AdamOptimizer()
iterator = dataset.make_initializable_iterator()
print('Training graph...')
with tf.Session() as sess:
sess.run(iterator.initializer)
sess.run(tf.global_variables_initializer())
for epoch in range(epochs):
print('Epoch', epoch)
progbar = tf.keras.utils.Progbar(target=steps_per_epoch, stateful_metrics='loss')
for step in range(steps_per_epoch):
with tf.GradientTape() as tape:
features, labels = sess.run(iterator.get_next())
predictions = model(features, training=True)
loss_value = tf.losses.mean_squared_error(labels, predictions)
grads = tape.gradient(loss_value, model.variables)
optimizer.apply_gradients(zip(grads, model.variables))
progbar.add(1, values=[('loss', loss_value.eval())])
As you can see, the main difference is that I use a one_shot_iterator during Eager training (of course, during graph training, I have to run operations within a session).
I tried to do the same using optimizer.minimize instead of applying the gradients myself, but I could not come up with a code that worked both for eager and graph modes.
Also, I'm sure this becomes much harder to do with not so simple models, like the one you are working with.