Why is tf.GradientTape.jacobian giving None? - tensorflow

I'm using the IRIS dataset, and am following this official tutorial: Custom training: walkthrough
In the Training loop, I am trying to gather the model outputs and weights in each epoch%50==0 in the lists m_outputs_mod50, gather_weights respectively:
# Keep results for plotting
train_loss_results = []
train_accuracy_results = []
m_outputs_mod50 = []
gather_weights = []
num_epochs = 201
for epoch in range(num_epochs):
epoch_loss_avg = tf.keras.metrics.Mean()
epoch_accuracy = tf.keras.metrics.SparseCategoricalAccuracy()
# gather_kernel(model)
# Training loop - using batches of 32
for x, y in train_dataset:
# Optimize the model
loss_value, grads = grad(model, x, y)
optimizer.apply_gradients(zip(grads, model.trainable_variables))
# Track progress
epoch_loss_avg.update_state(loss_value) # Add current batch loss
# Compare predicted label to actual label
# training=True is needed only if there are layers with different
# behavior during training versus inference (e.g. Dropout).
epoch_accuracy.update_state(y, model(x, training=True))
# End epoch
train_loss_results.append(epoch_loss_avg.result())
train_accuracy_results.append(epoch_accuracy.result())
# pred_hist.append(model.predict(x))
if epoch % 50 == 0:
m_outputs_mod50.append(model(x))
gather_weights.append(model.weights)
print("Epoch {:03d}: Loss: {:.3f}, Accuracy: {:.3%}".format(epoch,
epoch_loss_avg.result(),
epoch_accuracy.result()))
Running the above and trying to even get the jacobian at epoch 0 (using m_outputs_mod50[0] and gather_weights[0]) using
with tf.GradientTape() as tape:
print(tape.jacobian(target = m_outputs_mod50[0], sources = gather_weights[0]))`
I get a list of None as the output.
Why?

You need to understand how the GradientTape operates. For that, you can follow the guide: Introduction to gradients and automatic differentiation. Here is an excerpt:
TensorFlow provides the tf.GradientTape API for automatic
differentiation; that is, computing the gradient of a computation with
respect to some inputs, usually tf.Variables. TensorFlow "records"
relevant operations executed inside the context of a tf.GradientTape
onto a "tape". TensorFlow then uses that tape to compute the gradients
of a "recorded" computation using reverse mode differentiation.
To compute a gradient (or a jacobian), the tape needs to record the operations that are executed in its context. Then, outside its context, once the forward pass has been executed, its possible to use the tape to compute the gradient/jacobian.
You could use something like that:
if epoch % 50 == 0:
with tf.GradientTape() as tape:
out = model(x)
jacobian = tape.jacobian(out, model.weights)

Related

How to apply a function to network output before passing it to the loss?

I'm trying to implement a network in tensorflow and I need to apply a function f to the network output and use the returned value as the prediction to be used in the loss.
Is there a simple way to make it or which part of tensorflow should I study to achieve that ?
you should study how to write custom training loops in tensorflow: https://www.tensorflow.org/guide/keras/writing_a_training_loop_from_scratch
A simplified and short version could look similar to the code bellow:
#Repeat for several epochs
for epoch in range(epochs):
# Iterate over the batches of the dataset.
for step, (x_batch_train, y_batch_train) in enumerate(train_dataset):
# Start tracing your forward pass to calculate gradients
with tf.GradientTape() as tape:
prediction = model(x_batch_train, training=True)
# HERE YOU PLACE YOUR FUNCTION f
transformed_prediction = f(prediction)
loss_value = loss_fn(y_batch_train, transformed_prediction )
grads = tape.gradient(loss_value, model.trainable_weights)
optimizer.apply_gradients(zip(grads, model.trainable_weights))
(...)

Calculating gradients in Custom training loop, difference in performace TF vs Torch

I have attempted to translate pytorch implementation of a NN model which calculates forces and energies in molecular structures to TensorFlow. This needed a custom training loop and custom loss function so I implemented to different one step training functions below.
First using Nested Gradient Tapes.
def calc_gradients(D_train_batch, E_train_batch, F_train_batch, opt):
#set up gradient tape scope in order to track gradients of both d(Loss)/d(Weights)
#and d(output)/d(input)
with tf.GradientTape() as tape1:
with tf.GradientTape() as tape2:
#set gradient tape to watch Tensor
tape2.watch(D_train_batch)
#pass D thru model to get predicted energy vals
E_pred = model(D_train_batch, training=True)
df_dD_train_batch = tape2.gradient(E_pred, D_train_batch)
#matrix mult of -Grad_D(f) x Grad_r(D)
F_pred = -tf.einsum('ijkl,il->ijk', dD_dr_train_batch, df_dD_train_batch)
#calculate loss value
loss = force_energy_loss(E_pred, F_pred, E_train_batch, F_train_batch)
grads = tape1.gradient(loss, model.trainable_weights)
opt.apply_gradients(zip(grads, model.trainable_weights))
Other attempt with gradient tape (persistent = true)
def calc_gradients_persistent(D_train_batch, E_train_batch, F_train_batch, opt):
#set up gradient tape scope in order to track gradients of both d(Loss)/d(Weights)
#and d(output)/d(input)
with tf.GradientTape(persistent = True) as outer:
#set gradient tape to watch Tensor
outer.watch(D_train_batch)
#output values from model, set trainable to be true to get
#model.trainable_weights out
E_pred = model(D_train_batch, training=True)
#set gradient tape to watch trainable weights
outer.watch(model.trainable_weights)
#get gradient of output (f/E_pred) w.r.t input (D/D_train_batch) and cast to double
df_dD_train_batch = outer.gradient(E_pred, D_train_batch)
#matrix mult of -Grad_D(f) x Grad_r(D)
F_pred = -tf.einsum('ijkl,il->ijk', dD_dr_train_batch, df_dD_train_batch)
#calculate loss value
loss = force_energy_loss(E_pred, F_pred, E_train_batch, F_train_batch)
#get gradient of loss w.r.t to trainable weights for back propogation
grads = outer.gradient(loss, model.trainable_weights)
#updates weights using the optimizer and the gradients (grads)
opt.apply_gradients(zip(grads, model.trainable_weights))
These were attempted translations of the pytorch code
# Forward pass: Predict energies from the descriptor input
E_train_pred_batch = model(D_train_batch)
# Get derivatives of model output with respect to input variables. The
# torch.autograd.grad-function can be used for this, as it returns the
# gradients of the input with respect to outputs. It is very important
# to set the create_graph=True in this case. Without it the derivatives
# of the NN parameters with respect to the loss from the force error
# will not be populated (=the force error will not affect the
# training), but the model will still run fine without errors.
df_dD_train_batch = torch.autograd.grad(
outputs=E_train_pred_batch,
inputs=D_train_batch,
grad_outputs=torch.ones_like(E_train_pred_batch),
create_graph=True,
)[0]
# Get derivatives of input variables (=descriptor) with respect to atom
# positions = forces
F_train_pred_batch = -torch.einsum('ijkl,il->ijk', dD_dr_train_batch, df_dD_train_batch)
# Zero gradients, perform a backward pass, and update the weights.
# D_train_batch.grad.data.zero_()
optimizer.zero_grad()
loss = energy_force_loss(E_train_pred_batch, E_train_batch, F_train_pred_batch, F_train_batch)
loss.backward()
optimizer.step()
which is from the tutorial for the Dscribe library at https://singroup.github.io/dscribe/latest/tutorials/machine_learning/forces_and_energies.html
Question
Using either versions of the TF implementation there is a huge loss in prediction accuracy compared to running the pytorch version. I was wondering, have I maybe misunderstood the pytorch code and translated incorrectly and if so where is my discrepancy?
P.S
Model directly computes energies E, from which we use the gradient of E w.r.t D in order to calculate the forces F. The loss function is a weighted sum of MSE of both Force and energies.
These methods are in fact the same, my error was somewhere else which was creating differing results. For anyone whose trying to implement the TensorFlow versions, the nested gradient tapes are about 2x faster, at least in this scenario and also ensure to wrap the functions in an #tf.function in order to use graphs over eager execution, The speed up is about 10x.

Why doesn't custom training loop average loss over batch_size?

Below code snippet is the custom training loop from Tensorflow official tutorial.https://www.tensorflow.org/guide/keras/writing_a_training_loop_from_scratch . Another tutorial also does not average loss over batch_size, as shown here https://www.tensorflow.org/tutorials/customization/custom_training_walkthrough
Why is the loss_value not averaged over batch_size at this line loss_value = loss_fn(y_batch_train, logits)? Is this a bug? From another question here Loss function works with reduce_mean but not reduce_sum, reduce_mean is indeed needed to average loss over batch_size
The loss_fn is defined in the tutorial as below. It obviously does not average over batch_size.
loss_fn = keras.losses.SparseCategoricalCrossentropy(from_logits=True)
From documentation, keras.losses.SparseCategoricalCrossentropy sums loss over the batch without averaging. Thus, this is essentially reduce_sum instead of reduce_mean!
Type of tf.keras.losses.Reduction to apply to loss. Default value is AUTO. AUTO indicates that the reduction option will be determined by the usage context. For almost all cases this defaults to SUM_OVER_BATCH_SIZE.
The code is shown below.
epochs = 2
for epoch in range(epochs):
print("\nStart of epoch %d" % (epoch,))
# Iterate over the batches of the dataset.
for step, (x_batch_train, y_batch_train) in enumerate(train_dataset):
# Open a GradientTape to record the operations run
# during the forward pass, which enables auto-differentiation.
with tf.GradientTape() as tape:
# Run the forward pass of the layer.
# The operations that the layer applies
# to its inputs are going to be recorded
# on the GradientTape.
logits = model(x_batch_train, training=True) # Logits for this minibatch
# Compute the loss value for this minibatch.
loss_value = loss_fn(y_batch_train, logits)
# Use the gradient tape to automatically retrieve
# the gradients of the trainable variables with respect to the loss.
grads = tape.gradient(loss_value, model.trainable_weights)
# Run one step of gradient descent by updating
# the value of the variables to minimize the loss.
optimizer.apply_gradients(zip(grads, model.trainable_weights))
# Log every 200 batches.
if step % 200 == 0:
print(
"Training loss (for one batch) at step %d: %.4f"
% (step, float(loss_value))
)
print("Seen so far: %s samples" % ((step + 1) * 64))
I've figured it out, the loss_fn = keras.losses.SparseCategoricalCrossentropy(from_logits=True) indeed averages loss over batch_size by default.

Understanding Gradient Tape with mini batches

In the below example taken from Keras documentation, I want to understand how grads is computed. Does the gradient grads corresponds to the average gradient computed using the batch (x_batch_train, y_batch_train)? In other words, does the algorithm computes the gradient, with respect to each variable, using every sample in the mini batch and then average them to get grads?
for epoch in range(epochs):
print("\nStart of epoch %d" % (epoch,))
# Iterate over the batches of the dataset.
for step, (x_batch_train, y_batch_train) in enumerate(train_dataset):
# Open a GradientTape to record the operations run
# during the forward pass, which enables auto-differentiation.
with tf.GradientTape() as tape:
# Run the forward pass of the layer.
# The operations that the layer applies
# to its inputs are going to be recorded
# on the GradientTape.
logits = model(x_batch_train, training=True) # Logits for this minibatch
# Compute the loss value for this minibatch.
loss_value = loss_fn(y_batch_train, logits)
# Use the gradient tape to automatically retrieve
# the gradients of the trainable variables with respect to the loss.
grads = tape.gradient(loss_value, model.trainable_weights)
# Run one step of gradient descent by updating
# the value of the variables to minimize the loss.
optimizer.apply_gradients(zip(grads, model.trainable_weights))
Default value is SUM_OVER_BATCH_SIZE .
Read this .
Your suppositions are correct.
The documentation provided by DachuanZhao shows as well, that the sum of the elements in the batch are averaged.

can't reproduce model.fit with GradientTape

I've been trying to investigate into the reason (e.g. by checking weights, gradients and activations during training) why SGD with a 0.001 learning rate worked in training while Adam fails to do so. (Please see my previous post [here](Why is my loss (binary cross entropy) converging on ~0.6? (Task: Natural Language Inference)"Why is my loss (binary cross entropy) converging on ~0.6? (Task: Natural Language Inference)"))
Note: I'm using the same model from my previous post here as well.
using tf.keras, i trained the neural network using model.fit():
model.compile(optimizer=SGD(learning_rate=0.001),
loss='binary_crossentropy',
metrics=['accuracy'])
model.fit(x=ds,
epoch=80,
validation_data=ds_val)
This resulted in a epoch loss graphed below, within the 1st epoch, it's reached a train loss of 0.46 and then ultimately resulting in a train_loss of 0.1241 and val_loss of 0.2849.
I would've used tf.keras.callbacks.Tensorboard(histogram_freq=1) to train the network with both SGD(0.001) and Adam to investigate but it's throwing an InvalidArgumentError on Variable:0, something I can't decipher. So I tried to write a custom training loop using GradientTape and plotting the values.
using tf.GradientTape(), i tried to reproduce the results using the exact same model and dataset, however the epoch loss is training incredibly slowly, reaching train loss of 0.676 after 15 epochs (see graph below), is there something wrong with my implementation? (code below)
#tf.function
def compute_grads(train_batch: Dict[str,tf.Tensor], target_batch: tf.Tensor,
loss_fn: Loss, model: tf.keras.Model):
with tf.GradientTape(persistent=False) as tape:
# forward pass
outputs = model(train_batch)
# calculate loss
loss = loss_fn(y_true=target_batch, y_pred=outputs)
# calculate gradients for each param
grads = tape.gradient(loss, model.trainable_variables)
return grads, loss
BATCH_SIZE = 8
EPOCHS = 15
bce = BinaryCrossentropy()
optimizer = SGD(learning_rate=0.001)
for epoch in tqdm(range(EPOCHS), desc='epoch'):
# - accumulators
epoch_loss = 0.0
for (i, (train_batch, target_dict)) in tqdm(enumerate(ds_train.shuffle(1024).batch(BATCH_SIZE)), desc='step'):
(grads, loss) = compute_grads(train_batch, target_dict['target'], bce, model)
optimizer.apply_gradients(zip(grads, model.trainable_variables))
epoch_loss += loss
avg_epoch_loss = epoch_loss/(i+1)
tensorboard_scalar(writer, name='epoch_loss', data=avg_epoch_loss, step=epoch) # custom helper function
print("Epoch {}: epoch_loss = {}".format(epoch, avg_epoch_loss))
Thanks in advance!
Check if you have shuffle your dataset then the problem may came from the shuffling using the tf.Dataset method. It only shuffled through the dataset one bucket at the time. Using the Keras.Model.fit yielded better results because it probably adds another shuffling.
By adding a shuffling with numpy.random.shuffle it may improve the training performance. From this reference.
The example of applying it into generation of the dataset is:
numpy_data = np.hstack([index_rows.reshape(-1, 1), index_cols.reshape(-1, 1), index_data.reshape(-1, 1)])
np.random.shuffle(numpy_data)
indexes = np.array(numpy_data[:, :2], dtype=np.uint32)
labels = np.array(numpy_data[:, 2].reshape(-1, 1), dtype=np.float32)
train_ds = data.Dataset.from_tensor_slices(
(indexes, labels)
).shuffle(100000).batch(batch_size, drop_remainder=True)
If this not work you may need to use Dataset .repeat(epochs_number) and .shuffle(..., reshuffle_each_iteration=True):
train_ds = data.Dataset.from_tensor_slices(
(np.hstack([index_rows.reshape(-1, 1), index_cols.reshape(-1, 1)]), index_data)
).shuffle(100000, reshuffle_each_iteration=True
).batch(batch_size, drop_remainder=True
).repeat(epochs_number)
for ix, (examples, labels) in train_ds.enumerate():
train_step(examples, labels)
current_epoch = ix // (len(index_data) // batch_size)
This workaround is not beautiful nor natural, for the moment you can use this to shuffle each epoch. It's a known issue and will be fixed, in the future you can use for epoch in range(epochs_number) instead of .repeat()
The solution provided here may also help a lot. You might want to check it out.
If this is not the case, you may want to speed up the TF2.0 GradientTape. This can be the solution:
TensorFlow 2.0 introduces the concept of functions, which translate eager code into graph code.
The usage is pretty straight-forward. The only change needed is that all relevant functions (like compute_loss and apply_gradients) have to be annotated with #tf.function.