How to apply a function to network output before passing it to the loss? - tensorflow

I'm trying to implement a network in tensorflow and I need to apply a function f to the network output and use the returned value as the prediction to be used in the loss.
Is there a simple way to make it or which part of tensorflow should I study to achieve that ?

you should study how to write custom training loops in tensorflow: https://www.tensorflow.org/guide/keras/writing_a_training_loop_from_scratch
A simplified and short version could look similar to the code bellow:
#Repeat for several epochs
for epoch in range(epochs):
# Iterate over the batches of the dataset.
for step, (x_batch_train, y_batch_train) in enumerate(train_dataset):
# Start tracing your forward pass to calculate gradients
with tf.GradientTape() as tape:
prediction = model(x_batch_train, training=True)
# HERE YOU PLACE YOUR FUNCTION f
transformed_prediction = f(prediction)
loss_value = loss_fn(y_batch_train, transformed_prediction )
grads = tape.gradient(loss_value, model.trainable_weights)
optimizer.apply_gradients(zip(grads, model.trainable_weights))
(...)

Related

Does optimizer.apply_gradients do gradient descent?

Ive found the following code:
# Iterate over the batches of the dataset.
for step, (x_batch_train, y_batch_train) in enumerate(train_dataset):
# Open a GradientTape to record the operations run
# during the forward pass, which enables auto-differentiation.
with tf.GradientTape() as tape:
# Run the forward pass of the layer.
# The operations that the layer applies
# to its inputs are going to be recorded
# on the GradientTape.
logits = model(x_batch_train, training=True) # Logits for this minibatch
# Compute the loss value for this minibatch.
loss_value = loss_fn(y_batch_train, logits)
# Use the gradient tape to automatically retrieve
# the gradients of the trainable variables with respect to the loss.
grads = tape.gradient(loss_value, model.trainable_weights)
# Run one step of gradient descent by updating
# the value of the variables to minimize the loss.
optimizer.apply_gradients(zip(grads, model.trainable_weights))
And the last part says
# Use the gradient tape to automatically retrieve
# the gradients of the trainable variables with respect to the loss.
grads = tape.gradient(loss_value, model.trainable_weights)
# Run one step of gradient descent by updating
# the value of the variables to minimize the loss.
optimizer.apply_gradients(zip(grads, model.trainable_weights))
But after Ive looked the function apply_gradients up, Im not sure if the sentence
"Run one step of gradient descent by updating" for optimizer.apply_gradients(zip(grads, model.trainable_weights)) is true.
Because it only updates the gradients. And grads = tape.gradient(loss_value, model.trainable_weights) only calculates the derivation of the loss function with respect. But for gradient descent calculate the learning rate with the gradients and subtract that from the value of the loss function. But it seems to work, because the loss is decreasing constantly. So my question is: Does apply_gradients do more than just updating?
full code is here: https://keras.io/guides/writing_a_training_loop_from_scratch/
.apply_gradients performs an update to the weights, using the gradients. Depending on optimizer used it could be gradient descent, which is:
w_{t+1} := w_t - lr * g(w_t)
where g = grad(L)
Note, that there is no need to access loss function or anything else, you just need the gradient (which is a vector of length of your parameters).
In general .apply_gradients can do more than that, e.g. if you were to use Adam it would also accumulate some statistics and use them to rescale gradients etc.

Starting with ADAM and then fine tune with SGD. Changing the optimizer

I read this great blog about a bag of tricks for image classification.
This part i have a hard time to figure out how to implement in tensorflow, or rather, i have no idea how to do it or if it is even possible.
So, start off with Adam: just set a learning rate that’s not absurdly high, commonly defaulted at 0.0001 and you’ll usually get some very good results. Then, once your model starts to saturate with Adam, fine tune with SGD at a smaller learning rate to squeeze in that last bit of accuracy!
Can you change the optimizer without re-compile in some way?
I have ofc tried googling but cant seem to find much information.
Anyone know if this is possible in tensorflow and if so how to do it? (or if you have source that have some info about it)
You can start form training loop from scratch of the tensorflow documentation.
Create two train_step functions, the first with an Adam optimizer and the second with an SGD optimizer.
optimizer1 = keras.optimizers.Adam(learning_rate=1e-3)
optimizer2 = keras.optimizers.SGD(learning_rate=1e-3)
#tf.function
def train_step1(x, y):
with tf.GradientTape() as tape:
logits = model(x, training=True)
loss_value = loss_fn(y, logits)
grads = tape.gradient(loss_value, model.trainable_weights)
optimizer1.apply_gradients(zip(grads, model.trainable_weights))
train_acc_metric.update_state(y, logits)
return loss_value
#tf.function
def train_step2(x, y):
with tf.GradientTape() as tape:
logits = model(x, training=True)
loss_value = loss_fn(y, logits)
grads = tape.gradient(loss_value, model.trainable_weights)
optimizer2.apply_gradients(zip(grads, model.trainable_weights))
train_acc_metric.update_state(y, logits)
return loss_value
Main loop:
epochs = 20
train_step = train_step1
start_time = time.time()
for epoch in range(epochs):
if epoch > epochs//2:
train_step = train_step2
total_train_loss = 0.
# print("\nStart of epoch %d" % (epoch,))
# Iterate over the batches of the dataset.
for step, (x_batch_train, y_batch_train) in enumerate(train_dataset):
loss_value = train_step(x_batch_train, y_batch_train)
total_train_loss += loss_value.numpy()
...
Note that the graph of each train_step function is built separately. In graph mode, you cannot have a single train_step function with the optimizer as a parameter that changes during iterations (Adam and then SGD).

Why doesn't custom training loop average loss over batch_size?

Below code snippet is the custom training loop from Tensorflow official tutorial.https://www.tensorflow.org/guide/keras/writing_a_training_loop_from_scratch . Another tutorial also does not average loss over batch_size, as shown here https://www.tensorflow.org/tutorials/customization/custom_training_walkthrough
Why is the loss_value not averaged over batch_size at this line loss_value = loss_fn(y_batch_train, logits)? Is this a bug? From another question here Loss function works with reduce_mean but not reduce_sum, reduce_mean is indeed needed to average loss over batch_size
The loss_fn is defined in the tutorial as below. It obviously does not average over batch_size.
loss_fn = keras.losses.SparseCategoricalCrossentropy(from_logits=True)
From documentation, keras.losses.SparseCategoricalCrossentropy sums loss over the batch without averaging. Thus, this is essentially reduce_sum instead of reduce_mean!
Type of tf.keras.losses.Reduction to apply to loss. Default value is AUTO. AUTO indicates that the reduction option will be determined by the usage context. For almost all cases this defaults to SUM_OVER_BATCH_SIZE.
The code is shown below.
epochs = 2
for epoch in range(epochs):
print("\nStart of epoch %d" % (epoch,))
# Iterate over the batches of the dataset.
for step, (x_batch_train, y_batch_train) in enumerate(train_dataset):
# Open a GradientTape to record the operations run
# during the forward pass, which enables auto-differentiation.
with tf.GradientTape() as tape:
# Run the forward pass of the layer.
# The operations that the layer applies
# to its inputs are going to be recorded
# on the GradientTape.
logits = model(x_batch_train, training=True) # Logits for this minibatch
# Compute the loss value for this minibatch.
loss_value = loss_fn(y_batch_train, logits)
# Use the gradient tape to automatically retrieve
# the gradients of the trainable variables with respect to the loss.
grads = tape.gradient(loss_value, model.trainable_weights)
# Run one step of gradient descent by updating
# the value of the variables to minimize the loss.
optimizer.apply_gradients(zip(grads, model.trainable_weights))
# Log every 200 batches.
if step % 200 == 0:
print(
"Training loss (for one batch) at step %d: %.4f"
% (step, float(loss_value))
)
print("Seen so far: %s samples" % ((step + 1) * 64))
I've figured it out, the loss_fn = keras.losses.SparseCategoricalCrossentropy(from_logits=True) indeed averages loss over batch_size by default.

how to perform early stopping when writing our own custom training loops in tensorflow 2.0?

To perform early stopping in Tensorflow, tf.keras has a very convenient method which is a call tf.keras.callbacks, which in turn can be used in model.fit() to execute it. When we write Custom training loop, I couldn't understand how to make use of the tf.keras.callbacks to execute it. Can someone provide with a basic tutorial on how to do it?
https://www.tensorflow.org/guide/keras/writing_a_training_loop_from_scratch
https://machinelearningmastery.com/how-to-stop-training-deep-neural-networks-at-the-right-time-using-early-stopping/
You have 2 approaches to create custom training loops.
One is this common 2 nested for loops.
or you can do this. All the callbacks and other features are available here
Tip : THE CODE BELLOW IS JUST AN SLICE OF CODE AND MODEL STRUCTURE IS NOT IMPLEMENTED. You should do it by your own.
More info? check here
class CustomModel(keras.Model):
def train_step(self, data):
# Unpack the data. Its structure depends on your model and
# on what you pass to `fit()`.
print(data)
x, y = data
with tf.GradientTape() as tape:
y_pred = self(x, training=True) # Forward pass
# Compute the loss value
# (the loss function is configured in `compile()`)
loss = self.compiled_loss(y, y_pred,
regularization_losses=self.losses)
# Compute gradients
trainable_vars = self.trainable_variables
gradients = tape.gradient(loss, trainable_vars)
# Update weights
self.optimizer.apply_gradients(zip(gradients, trainable_vars))
# Update metrics (includes the metric that tracks the loss)
self.compiled_metrics.update_state(y, y_pred)
# Return a dict mapping metric names to current value
return {m.name: m.result() for m in self.metrics}
# Construct and compile an instance of CustomModel
inputs = keras.Input(shape=(32,))
outputs = keras.layers.Dense(1)(inputs)
model = CustomModel(inputs, outputs)
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['...'])
earlystopping_cb = keras.callbacks.EarlyStopping(patience=10, restore_best_weights=True)
# Just use `fit` as usual
model.fit(train_ds, epochs=3, callbacks=[earlystopping_cb])
more info: https://keras.io/getting_started/intro_to_keras_for_engineers/#using-fit-with-a-custom-training-step

Understanding Gradient Tape with mini batches

In the below example taken from Keras documentation, I want to understand how grads is computed. Does the gradient grads corresponds to the average gradient computed using the batch (x_batch_train, y_batch_train)? In other words, does the algorithm computes the gradient, with respect to each variable, using every sample in the mini batch and then average them to get grads?
for epoch in range(epochs):
print("\nStart of epoch %d" % (epoch,))
# Iterate over the batches of the dataset.
for step, (x_batch_train, y_batch_train) in enumerate(train_dataset):
# Open a GradientTape to record the operations run
# during the forward pass, which enables auto-differentiation.
with tf.GradientTape() as tape:
# Run the forward pass of the layer.
# The operations that the layer applies
# to its inputs are going to be recorded
# on the GradientTape.
logits = model(x_batch_train, training=True) # Logits for this minibatch
# Compute the loss value for this minibatch.
loss_value = loss_fn(y_batch_train, logits)
# Use the gradient tape to automatically retrieve
# the gradients of the trainable variables with respect to the loss.
grads = tape.gradient(loss_value, model.trainable_weights)
# Run one step of gradient descent by updating
# the value of the variables to minimize the loss.
optimizer.apply_gradients(zip(grads, model.trainable_weights))
Default value is SUM_OVER_BATCH_SIZE .
Read this .
Your suppositions are correct.
The documentation provided by DachuanZhao shows as well, that the sum of the elements in the batch are averaged.