Calculating gradients in Custom training loop, difference in performace TF vs Torch - tensorflow

I have attempted to translate pytorch implementation of a NN model which calculates forces and energies in molecular structures to TensorFlow. This needed a custom training loop and custom loss function so I implemented to different one step training functions below.
First using Nested Gradient Tapes.
def calc_gradients(D_train_batch, E_train_batch, F_train_batch, opt):
#set up gradient tape scope in order to track gradients of both d(Loss)/d(Weights)
#and d(output)/d(input)
with tf.GradientTape() as tape1:
with tf.GradientTape() as tape2:
#set gradient tape to watch Tensor
tape2.watch(D_train_batch)
#pass D thru model to get predicted energy vals
E_pred = model(D_train_batch, training=True)
df_dD_train_batch = tape2.gradient(E_pred, D_train_batch)
#matrix mult of -Grad_D(f) x Grad_r(D)
F_pred = -tf.einsum('ijkl,il->ijk', dD_dr_train_batch, df_dD_train_batch)
#calculate loss value
loss = force_energy_loss(E_pred, F_pred, E_train_batch, F_train_batch)
grads = tape1.gradient(loss, model.trainable_weights)
opt.apply_gradients(zip(grads, model.trainable_weights))
Other attempt with gradient tape (persistent = true)
def calc_gradients_persistent(D_train_batch, E_train_batch, F_train_batch, opt):
#set up gradient tape scope in order to track gradients of both d(Loss)/d(Weights)
#and d(output)/d(input)
with tf.GradientTape(persistent = True) as outer:
#set gradient tape to watch Tensor
outer.watch(D_train_batch)
#output values from model, set trainable to be true to get
#model.trainable_weights out
E_pred = model(D_train_batch, training=True)
#set gradient tape to watch trainable weights
outer.watch(model.trainable_weights)
#get gradient of output (f/E_pred) w.r.t input (D/D_train_batch) and cast to double
df_dD_train_batch = outer.gradient(E_pred, D_train_batch)
#matrix mult of -Grad_D(f) x Grad_r(D)
F_pred = -tf.einsum('ijkl,il->ijk', dD_dr_train_batch, df_dD_train_batch)
#calculate loss value
loss = force_energy_loss(E_pred, F_pred, E_train_batch, F_train_batch)
#get gradient of loss w.r.t to trainable weights for back propogation
grads = outer.gradient(loss, model.trainable_weights)
#updates weights using the optimizer and the gradients (grads)
opt.apply_gradients(zip(grads, model.trainable_weights))
These were attempted translations of the pytorch code
# Forward pass: Predict energies from the descriptor input
E_train_pred_batch = model(D_train_batch)
# Get derivatives of model output with respect to input variables. The
# torch.autograd.grad-function can be used for this, as it returns the
# gradients of the input with respect to outputs. It is very important
# to set the create_graph=True in this case. Without it the derivatives
# of the NN parameters with respect to the loss from the force error
# will not be populated (=the force error will not affect the
# training), but the model will still run fine without errors.
df_dD_train_batch = torch.autograd.grad(
outputs=E_train_pred_batch,
inputs=D_train_batch,
grad_outputs=torch.ones_like(E_train_pred_batch),
create_graph=True,
)[0]
# Get derivatives of input variables (=descriptor) with respect to atom
# positions = forces
F_train_pred_batch = -torch.einsum('ijkl,il->ijk', dD_dr_train_batch, df_dD_train_batch)
# Zero gradients, perform a backward pass, and update the weights.
# D_train_batch.grad.data.zero_()
optimizer.zero_grad()
loss = energy_force_loss(E_train_pred_batch, E_train_batch, F_train_pred_batch, F_train_batch)
loss.backward()
optimizer.step()
which is from the tutorial for the Dscribe library at https://singroup.github.io/dscribe/latest/tutorials/machine_learning/forces_and_energies.html
Question
Using either versions of the TF implementation there is a huge loss in prediction accuracy compared to running the pytorch version. I was wondering, have I maybe misunderstood the pytorch code and translated incorrectly and if so where is my discrepancy?
P.S
Model directly computes energies E, from which we use the gradient of E w.r.t D in order to calculate the forces F. The loss function is a weighted sum of MSE of both Force and energies.

These methods are in fact the same, my error was somewhere else which was creating differing results. For anyone whose trying to implement the TensorFlow versions, the nested gradient tapes are about 2x faster, at least in this scenario and also ensure to wrap the functions in an #tf.function in order to use graphs over eager execution, The speed up is about 10x.

Related

Does optimizer.apply_gradients do gradient descent?

Ive found the following code:
# Iterate over the batches of the dataset.
for step, (x_batch_train, y_batch_train) in enumerate(train_dataset):
# Open a GradientTape to record the operations run
# during the forward pass, which enables auto-differentiation.
with tf.GradientTape() as tape:
# Run the forward pass of the layer.
# The operations that the layer applies
# to its inputs are going to be recorded
# on the GradientTape.
logits = model(x_batch_train, training=True) # Logits for this minibatch
# Compute the loss value for this minibatch.
loss_value = loss_fn(y_batch_train, logits)
# Use the gradient tape to automatically retrieve
# the gradients of the trainable variables with respect to the loss.
grads = tape.gradient(loss_value, model.trainable_weights)
# Run one step of gradient descent by updating
# the value of the variables to minimize the loss.
optimizer.apply_gradients(zip(grads, model.trainable_weights))
And the last part says
# Use the gradient tape to automatically retrieve
# the gradients of the trainable variables with respect to the loss.
grads = tape.gradient(loss_value, model.trainable_weights)
# Run one step of gradient descent by updating
# the value of the variables to minimize the loss.
optimizer.apply_gradients(zip(grads, model.trainable_weights))
But after Ive looked the function apply_gradients up, Im not sure if the sentence
"Run one step of gradient descent by updating" for optimizer.apply_gradients(zip(grads, model.trainable_weights)) is true.
Because it only updates the gradients. And grads = tape.gradient(loss_value, model.trainable_weights) only calculates the derivation of the loss function with respect. But for gradient descent calculate the learning rate with the gradients and subtract that from the value of the loss function. But it seems to work, because the loss is decreasing constantly. So my question is: Does apply_gradients do more than just updating?
full code is here: https://keras.io/guides/writing_a_training_loop_from_scratch/
.apply_gradients performs an update to the weights, using the gradients. Depending on optimizer used it could be gradient descent, which is:
w_{t+1} := w_t - lr * g(w_t)
where g = grad(L)
Note, that there is no need to access loss function or anything else, you just need the gradient (which is a vector of length of your parameters).
In general .apply_gradients can do more than that, e.g. if you were to use Adam it would also accumulate some statistics and use them to rescale gradients etc.

stop_gradient in tensorflow

I am wondering if tf.stop_gradient stops the gradient computation of just a given op, or stops the update of its input tf.variable ? I have the following problem - During the forward path computation in MNIST, I would like to perform a set of operations on the weights (let's say W to W*) and then do a matmul with inputs. However, I would like to exclude these operations from the backward path. I want only dE/dW computed during training with back propagation. The code I wrote prevents W from getting updated. Could you please help me understand why ? If these were variables, I understand I should set their trainable property to false, but these are operations on weights. If stop_gradient cannot be used for this purpose, then how do I build two graphs, one for forward path and the other for back propagation ?
def build_layer(inputs, fmap, nscope,layer_size1,layer_size2, faulty_training):
with tf.name_scope(nscope):
if (faulty_training):
## trainable weight
weights_i = tf.Variable(tf.truncated_normal([layer_size1, layer_size2],stddev=1.0 / math.sqrt(float(layer_size1))),name='weights_i')
## Operations on weight whose gradient should not be computed during backpropagation
weights_fx_t = tf.multiply(268435456.0,weights_i)
weight_fx_t = tf.stop_gradient(weights_fx_t)
weights_fx = tf.cast(weights_fx_t,tf.int32)
weight_fx = tf.stop_gradient(weights_fx)
weights_fx_fault = tf.bitwise.bitwise_xor(weights_fx,fmap)
weight_fx_fault = tf.stop_gradient(weights_fx_fault)
weights_fl = tf.cast(weights_fx_fault, tf.float32)
weight_fl = tf.stop_gradient(weights_fl)
weights = tf.stop_gradient(tf.multiply((1.0/268435456.0),weights_fl))
##### end transformation
else:
weights = tf.Variable(tf.truncated_normal([layer_size1, layer_size2],stddev=1.0 / math.sqrt(float(layer_size1))),name='weights')
biases = tf.Variable(tf.zeros([layer_size2]), name='biases')
hidden = tf.nn.relu(tf.matmul(inputs, weights) + biases)
return weights,hidden
I am using the tensorflow gradient descent optimizer to do the training.
optimizer = tf.train.GradientDescentOptimizer(learning_rate)
global_step = tf.Variable(0, name='global_step', trainable=False)
train_op = optimizer.minimize(loss, global_step=global_step)
Stop gradient will prevent the backpropagation from continuing past that node in the graph. You code doesn't have any path from weights_i to the loss except the one that goes through weights_fx_t where the gradient is stopped. This is what is causing weights_i not to be updated during training. You don't need to put stop_gradient after every step. Using it just once will stop the backpropagation there.
If stop_gradient doesn't do what you want then you can get the gradients by doing tf.gradients and you can write your own update op by using tf.assign. This will allow you to alter the gradients however you want.

How to feed the list of gradients, or (grad, variable name) pairs, to my model

This is related to a previous question: How to partition a single batch into many invocations to save memory, and also to How to train a big model with relatively large batch size on a single GPU using Tensorflow?; but, still I couldn't find the exact answer. For example, the answer to another related question tensorflow - run optimizer op on a large batch doesn't work for me (btw. it wasn't accepted and there are no more comments there).
I want to try to simulate larger batch size but using only one GPU.
So, I need to compute the gradients for every smaller batch, aggregate/average them across several such smaller batches, and only then apply.
(Basically, it's like synchronized distributed SGD, but on a single device/GPU, performed serially. Of course, the acceleration advantage of distributed SGD is lost but larger batch size itself will maybe enable convergence to larger accuracy and larger step size, as indicated by a few recent papers.)
To keep memory requirement low, I should do standard SGD with small batches, update the gradients after every iteration and then call optimizer.apply_gradients() (where optimizer is one of the implemented optimizers).
So, everything looks simple but when I go to implement it, it is actually not so trivial.
For example, I would like to use one Graph, compute gradients for each iteration, and then, when multiple batches are processed, sum the gradients and pass them to my model. But the list itself can't be fed into the feed_dict parameter of sess.run. Also, passing gradients directly doesn't exactly work, I get the TypeError: unhashable type: 'numpy.ndarray' (I think the reason is that I can't pass in the numpy.ndarray, only tensorflow variable).
I could define a placeholder for the gradients, but for that I would need tu build the model first (to specify the trainable variables etc.).
All in all, please tell me there is a simpler way to implement this.
There is no simpler way than what you have already been told. That way may seem complicated at first, but it actually is really simple. You just have to use the low level API to manually calculate the gradients for each batch, average over them and than manually feed the averaged gradients to the optimizer to apply them.
I'll try to provide some stripped down code of how to do this. I'll use dots as placeholders for actual code which would depend on the problem. What you would usually do would be something like this:
import tensorflow as tf
[...]
input = tf.placeholder(...)
[...]
loss = ...
[...]
# initialize the optimizer
optimizer = tf.train.AdamOptimizer(LEARNING_RATE)
# define operation to apply the gradients
minimize = optimizer.minimize(loss)
[...]
if __name__ == '__main__':
session = tf.Session(config=CONFIG)
session.run(tf.global_variables_initializer())
for step in range(1, MAX_STEPS + 1):
data = ...
loss = session.run([minimize, loss],
feed_dict={input: data})[1]
What you want to do instead now, to average over multiple batches to preserver memory would be this:
import tensorflow as tf
[...]
input = tf.placeholder(...)
[...]
loss = ...
[...]
# initialize the optimizer
optimizer = tf.train.AdamOptimizer(LEARNING_RATE)
# grab all trainable variables
trainable_variables = tf.trainable_variables()
# define variables to save the gradients in each batch
accumulated_gradients = [tf.Variable(tf.zeros_like(tv.initialized_value()),
trainable=False) for tv in
trainable_variables]
# define operation to reset the accumulated gradients to zero
reset_gradients = [gradient.assign(tf.zeros_like(gradient)) for gradient in
accumulated_gradients]
# compute the gradients
gradients = optimizer.compute_gradients(loss, trainable_variables)
# Note: Gradients is a list of tuples containing the gradient and the
# corresponding variable so gradient[0] is the actual gradient. Also divide
# the gradients by BATCHES_PER_STEP so the learning rate still refers to
# steps not batches.
# define operation to evaluate a batch and accumulate the gradients
evaluate_batch = [
accumulated_gradient.assign_add(gradient[0]/BATCHES_PER_STEP)
for accumulated_gradient, gradient in zip(accumulated_gradients,
gradients)]
# define operation to apply the gradients
apply_gradients = optimizer.apply_gradients([
(accumulated_gradient, gradient[1]) for accumulated_gradient, gradient
in zip(accumulated_gradients, gradients)])
# define variable and operations to track the average batch loss
average_loss = tf.Variable(0., trainable=False)
update_loss = average_loss.assign_add(loss/BATCHES_PER_STEP)
reset_loss = average_loss.assign(0.)
[...]
if __name__ == '__main__':
session = tf.Session(config=CONFIG)
session.run(tf.global_variables_initializer())
data = [batch_data[i] for i in range(BATCHES_PER_STEP)]
for batch_data in data:
session.run([evaluate_batch, update_loss],
feed_dict={input: batch_data})
# apply accumulated gradients
session.run(apply_gradients)
# get loss
loss = session.run(average_loss)
# reset variables for next step
session.run([reset_gradients, reset_loss])
This should be runnable if you fill in the gaps. However I might have made a mistake while stripping it down and pasting it here. For a runnable example you can take a look into a project I am currently working on myself.
I also want to make clear that this is not the same as evaluating the loss for all the batch data at once, since you average over the gradients. This is especially important when your loss does not work well with low statistics. Take a chi square of histograms for example, calculating the average gradients for a chi square of histograms with low bin counts won't be as good as calculating the gradient on just one histogram with all the bins filled up at once.
You would need to give the gradients as the values that get passed to apply_gradients. It can be placeholders, but it is probably easier to use the usual compute_gradients/apply_gradients combination:
# Some loss measure
loss = ...
optimizer = ...
gradients = optimizer.compute_gradients(loss)
# gradients is a list of pairs
_, gradient_tensors = zip(*gradients)
# Apply gradients as usual
train_op = optimizer.apply_gradients(gradients)
# On training
# Compute some gradients
gradient_values = session.run(gradient_tensors, feed_dict={...})
# gradient_values is a sequence of numpy arrays with gradients
# After averaging multiple evaluations of gradient_values apply them
session.run(train_op, feed_dict=dict(zip(gradient_tensors, gradient_values_average)))
If you want to compute the averages of the gradients within TensorFlow too, that requires a bit of extra code specifically for that, maybe something like this:
# Some loss measure
loss = ...
optimizer = ...
gradients = optimizer.compute_gradients(loss)
# gradients is a list of pairs
_, gradient_tensors = zip(*gradients)
# Apply gradients as usual
train_op = optimizer.apply_gradients(gradients)
# Additional operations for gradient averaging
gradient_placeholders = [tf.placeholder(t.dtype, (None,) + t.shape)
for t in gradient_tensors]
gradient_averages = [tf.reduce_mean(p, axis=0) for p in gradient_placeholders]
# On training
gradient_values = None
# Compute some gradients
for ...: # Repeat for each small batch
gradient_values_current = session.run(gradient_tensors, feed_dict={...})
if gradient_values is None:
gradient_values = [[g] for g in gradient_values_current]
else:
for g_list, g in zip(gradient_values, gradient_values_current):
g_list.append(g)
# Stack gradients
gradient_values = [np.stack(g_list) for g_list in gradient_values)
# Compute averages
gradient_values_average = session.run(
gradient_averages, feed_dict=dict(zip(gradient_placeholders, gradient_values)))
# After averaging multiple gradients apply them
session.run(train_op, feed_dict=dict(zip(gradient_tensors, gradient_values_average)))

How do I get the gradient of the loss at a TensorFlow variable?

The feature I'm after is to be able to tell what the gradient of a given variable is with respect to my error function given some data.
One way to do this would be to see how much the variable has changed after a call to train, but obviously that can vary massively based on the learning algorithm (for example it would be almost impossible to tell with something like RProp) and just isn't very clean.
Thanks in advance.
The tf.gradients() function allows you to compute the symbolic gradient of one tensor with respect to one or more other tensors—including variables. Consider the following simple example:
data = tf.placeholder(tf.float32)
var = tf.Variable(...) # Must be a tf.float32 or tf.float64 variable.
loss = some_function_of(var, data) # some_function_of() returns a `Tensor`.
var_grad = tf.gradients(loss, [var])[0]
You can then use this symbolic gradient to evaluate the gradient in some specific point (data):
sess = tf.Session()
var_grad_val = sess.run(var_grad, feed_dict={data: ...})
In TensorFlow 2.0 you can use GradientTape to achieve this. GradientTape records the gradients of any computation that happens in the context of that. Below is an example of how you might do that.
import tensorflow as tf
# Here goes the neural network weights as tf.Variable
x = tf.Variable(3.0)
# TensorFlow operations executed within the context of
# a GradientTape are recorded for differentiation
with tf.GradientTape() as tape:
# Doing the computation in the context of the gradient tape
# For example computing loss
y = x ** 2
# Getting the gradient of network weights w.r.t. loss
dy_dx = tape.gradient(y, x)
print(dy_dx) # Returns 6

How to get the gradients of loss with respect to activations in Tensorflow

In the cifar10 example, the gradients of loss with respect to parameters can be computed as follows:
grads_and_vars = opt.compute_gradients(loss)
for grad, var in grads_and_vars:
# ...
Is there any way to get the gradients of loss with respect to activations (not the parameters), and watch them in Tensorboard?
You can use the tf.gradients() function to compute the gradient of any scalar tensor with respect to any other tensor (assuming the gradients are defined for all of the ops between those two tensors):
activations = ...
loss = f(..., activations) # `loss` is some function of `activations`.
grad_wrt_activations, = tf.gradients(loss, [activation])
Visualizing this in TensorBoard is tricky in general, since grad_wrt_activation is (typically) a tensor with the same shape as activation. Adding a tf.histogram_summary() op is probably the easiest way to visualize
# Adds a histogram of `grad_wrt_activations` to the graph, which will be logged
# with the other summaries, and shown in TensorBoard.
tf.histogram_summary("Activation gradient", grad_wrt_activations)