How to feed the list of gradients, or (grad, variable name) pairs, to my model - tensorflow

This is related to a previous question: How to partition a single batch into many invocations to save memory, and also to How to train a big model with relatively large batch size on a single GPU using Tensorflow?; but, still I couldn't find the exact answer. For example, the answer to another related question tensorflow - run optimizer op on a large batch doesn't work for me (btw. it wasn't accepted and there are no more comments there).
I want to try to simulate larger batch size but using only one GPU.
So, I need to compute the gradients for every smaller batch, aggregate/average them across several such smaller batches, and only then apply.
(Basically, it's like synchronized distributed SGD, but on a single device/GPU, performed serially. Of course, the acceleration advantage of distributed SGD is lost but larger batch size itself will maybe enable convergence to larger accuracy and larger step size, as indicated by a few recent papers.)
To keep memory requirement low, I should do standard SGD with small batches, update the gradients after every iteration and then call optimizer.apply_gradients() (where optimizer is one of the implemented optimizers).
So, everything looks simple but when I go to implement it, it is actually not so trivial.
For example, I would like to use one Graph, compute gradients for each iteration, and then, when multiple batches are processed, sum the gradients and pass them to my model. But the list itself can't be fed into the feed_dict parameter of sess.run. Also, passing gradients directly doesn't exactly work, I get the TypeError: unhashable type: 'numpy.ndarray' (I think the reason is that I can't pass in the numpy.ndarray, only tensorflow variable).
I could define a placeholder for the gradients, but for that I would need tu build the model first (to specify the trainable variables etc.).
All in all, please tell me there is a simpler way to implement this.

There is no simpler way than what you have already been told. That way may seem complicated at first, but it actually is really simple. You just have to use the low level API to manually calculate the gradients for each batch, average over them and than manually feed the averaged gradients to the optimizer to apply them.
I'll try to provide some stripped down code of how to do this. I'll use dots as placeholders for actual code which would depend on the problem. What you would usually do would be something like this:
import tensorflow as tf
[...]
input = tf.placeholder(...)
[...]
loss = ...
[...]
# initialize the optimizer
optimizer = tf.train.AdamOptimizer(LEARNING_RATE)
# define operation to apply the gradients
minimize = optimizer.minimize(loss)
[...]
if __name__ == '__main__':
session = tf.Session(config=CONFIG)
session.run(tf.global_variables_initializer())
for step in range(1, MAX_STEPS + 1):
data = ...
loss = session.run([minimize, loss],
feed_dict={input: data})[1]
What you want to do instead now, to average over multiple batches to preserver memory would be this:
import tensorflow as tf
[...]
input = tf.placeholder(...)
[...]
loss = ...
[...]
# initialize the optimizer
optimizer = tf.train.AdamOptimizer(LEARNING_RATE)
# grab all trainable variables
trainable_variables = tf.trainable_variables()
# define variables to save the gradients in each batch
accumulated_gradients = [tf.Variable(tf.zeros_like(tv.initialized_value()),
trainable=False) for tv in
trainable_variables]
# define operation to reset the accumulated gradients to zero
reset_gradients = [gradient.assign(tf.zeros_like(gradient)) for gradient in
accumulated_gradients]
# compute the gradients
gradients = optimizer.compute_gradients(loss, trainable_variables)
# Note: Gradients is a list of tuples containing the gradient and the
# corresponding variable so gradient[0] is the actual gradient. Also divide
# the gradients by BATCHES_PER_STEP so the learning rate still refers to
# steps not batches.
# define operation to evaluate a batch and accumulate the gradients
evaluate_batch = [
accumulated_gradient.assign_add(gradient[0]/BATCHES_PER_STEP)
for accumulated_gradient, gradient in zip(accumulated_gradients,
gradients)]
# define operation to apply the gradients
apply_gradients = optimizer.apply_gradients([
(accumulated_gradient, gradient[1]) for accumulated_gradient, gradient
in zip(accumulated_gradients, gradients)])
# define variable and operations to track the average batch loss
average_loss = tf.Variable(0., trainable=False)
update_loss = average_loss.assign_add(loss/BATCHES_PER_STEP)
reset_loss = average_loss.assign(0.)
[...]
if __name__ == '__main__':
session = tf.Session(config=CONFIG)
session.run(tf.global_variables_initializer())
data = [batch_data[i] for i in range(BATCHES_PER_STEP)]
for batch_data in data:
session.run([evaluate_batch, update_loss],
feed_dict={input: batch_data})
# apply accumulated gradients
session.run(apply_gradients)
# get loss
loss = session.run(average_loss)
# reset variables for next step
session.run([reset_gradients, reset_loss])
This should be runnable if you fill in the gaps. However I might have made a mistake while stripping it down and pasting it here. For a runnable example you can take a look into a project I am currently working on myself.
I also want to make clear that this is not the same as evaluating the loss for all the batch data at once, since you average over the gradients. This is especially important when your loss does not work well with low statistics. Take a chi square of histograms for example, calculating the average gradients for a chi square of histograms with low bin counts won't be as good as calculating the gradient on just one histogram with all the bins filled up at once.

You would need to give the gradients as the values that get passed to apply_gradients. It can be placeholders, but it is probably easier to use the usual compute_gradients/apply_gradients combination:
# Some loss measure
loss = ...
optimizer = ...
gradients = optimizer.compute_gradients(loss)
# gradients is a list of pairs
_, gradient_tensors = zip(*gradients)
# Apply gradients as usual
train_op = optimizer.apply_gradients(gradients)
# On training
# Compute some gradients
gradient_values = session.run(gradient_tensors, feed_dict={...})
# gradient_values is a sequence of numpy arrays with gradients
# After averaging multiple evaluations of gradient_values apply them
session.run(train_op, feed_dict=dict(zip(gradient_tensors, gradient_values_average)))
If you want to compute the averages of the gradients within TensorFlow too, that requires a bit of extra code specifically for that, maybe something like this:
# Some loss measure
loss = ...
optimizer = ...
gradients = optimizer.compute_gradients(loss)
# gradients is a list of pairs
_, gradient_tensors = zip(*gradients)
# Apply gradients as usual
train_op = optimizer.apply_gradients(gradients)
# Additional operations for gradient averaging
gradient_placeholders = [tf.placeholder(t.dtype, (None,) + t.shape)
for t in gradient_tensors]
gradient_averages = [tf.reduce_mean(p, axis=0) for p in gradient_placeholders]
# On training
gradient_values = None
# Compute some gradients
for ...: # Repeat for each small batch
gradient_values_current = session.run(gradient_tensors, feed_dict={...})
if gradient_values is None:
gradient_values = [[g] for g in gradient_values_current]
else:
for g_list, g in zip(gradient_values, gradient_values_current):
g_list.append(g)
# Stack gradients
gradient_values = [np.stack(g_list) for g_list in gradient_values)
# Compute averages
gradient_values_average = session.run(
gradient_averages, feed_dict=dict(zip(gradient_placeholders, gradient_values)))
# After averaging multiple gradients apply them
session.run(train_op, feed_dict=dict(zip(gradient_tensors, gradient_values_average)))

Related

Calculating gradients in Custom training loop, difference in performace TF vs Torch

I have attempted to translate pytorch implementation of a NN model which calculates forces and energies in molecular structures to TensorFlow. This needed a custom training loop and custom loss function so I implemented to different one step training functions below.
First using Nested Gradient Tapes.
def calc_gradients(D_train_batch, E_train_batch, F_train_batch, opt):
#set up gradient tape scope in order to track gradients of both d(Loss)/d(Weights)
#and d(output)/d(input)
with tf.GradientTape() as tape1:
with tf.GradientTape() as tape2:
#set gradient tape to watch Tensor
tape2.watch(D_train_batch)
#pass D thru model to get predicted energy vals
E_pred = model(D_train_batch, training=True)
df_dD_train_batch = tape2.gradient(E_pred, D_train_batch)
#matrix mult of -Grad_D(f) x Grad_r(D)
F_pred = -tf.einsum('ijkl,il->ijk', dD_dr_train_batch, df_dD_train_batch)
#calculate loss value
loss = force_energy_loss(E_pred, F_pred, E_train_batch, F_train_batch)
grads = tape1.gradient(loss, model.trainable_weights)
opt.apply_gradients(zip(grads, model.trainable_weights))
Other attempt with gradient tape (persistent = true)
def calc_gradients_persistent(D_train_batch, E_train_batch, F_train_batch, opt):
#set up gradient tape scope in order to track gradients of both d(Loss)/d(Weights)
#and d(output)/d(input)
with tf.GradientTape(persistent = True) as outer:
#set gradient tape to watch Tensor
outer.watch(D_train_batch)
#output values from model, set trainable to be true to get
#model.trainable_weights out
E_pred = model(D_train_batch, training=True)
#set gradient tape to watch trainable weights
outer.watch(model.trainable_weights)
#get gradient of output (f/E_pred) w.r.t input (D/D_train_batch) and cast to double
df_dD_train_batch = outer.gradient(E_pred, D_train_batch)
#matrix mult of -Grad_D(f) x Grad_r(D)
F_pred = -tf.einsum('ijkl,il->ijk', dD_dr_train_batch, df_dD_train_batch)
#calculate loss value
loss = force_energy_loss(E_pred, F_pred, E_train_batch, F_train_batch)
#get gradient of loss w.r.t to trainable weights for back propogation
grads = outer.gradient(loss, model.trainable_weights)
#updates weights using the optimizer and the gradients (grads)
opt.apply_gradients(zip(grads, model.trainable_weights))
These were attempted translations of the pytorch code
# Forward pass: Predict energies from the descriptor input
E_train_pred_batch = model(D_train_batch)
# Get derivatives of model output with respect to input variables. The
# torch.autograd.grad-function can be used for this, as it returns the
# gradients of the input with respect to outputs. It is very important
# to set the create_graph=True in this case. Without it the derivatives
# of the NN parameters with respect to the loss from the force error
# will not be populated (=the force error will not affect the
# training), but the model will still run fine without errors.
df_dD_train_batch = torch.autograd.grad(
outputs=E_train_pred_batch,
inputs=D_train_batch,
grad_outputs=torch.ones_like(E_train_pred_batch),
create_graph=True,
)[0]
# Get derivatives of input variables (=descriptor) with respect to atom
# positions = forces
F_train_pred_batch = -torch.einsum('ijkl,il->ijk', dD_dr_train_batch, df_dD_train_batch)
# Zero gradients, perform a backward pass, and update the weights.
# D_train_batch.grad.data.zero_()
optimizer.zero_grad()
loss = energy_force_loss(E_train_pred_batch, E_train_batch, F_train_pred_batch, F_train_batch)
loss.backward()
optimizer.step()
which is from the tutorial for the Dscribe library at https://singroup.github.io/dscribe/latest/tutorials/machine_learning/forces_and_energies.html
Question
Using either versions of the TF implementation there is a huge loss in prediction accuracy compared to running the pytorch version. I was wondering, have I maybe misunderstood the pytorch code and translated incorrectly and if so where is my discrepancy?
P.S
Model directly computes energies E, from which we use the gradient of E w.r.t D in order to calculate the forces F. The loss function is a weighted sum of MSE of both Force and energies.
These methods are in fact the same, my error was somewhere else which was creating differing results. For anyone whose trying to implement the TensorFlow versions, the nested gradient tapes are about 2x faster, at least in this scenario and also ensure to wrap the functions in an #tf.function in order to use graphs over eager execution, The speed up is about 10x.

Is it possible to loop through all minibatches in a single tensorflow op using dataset/iterators?

I'm working with tf.data.dataset/iterator mechanism and trying to improve data loading performance. It occurred to me that offloading the entire minibatch loop from Python might help. My data is small enough that storing on CPU or GPU is no problem.
So, Is it possible to loop an optimizer node over a full minibatched epoch within a call to session.run?
The tensor returned by iterator.get_next() is only incremented once per session.run, which would seems to make it impossible to iterate through a dataset of minibatches... but if it could be done, my CPU would only have to touch the Python thread once per epoch.
UPDATE: #muskrat's suggestion to use tf.slice can be used for this purpose. See my subsequent non-answer with a schematic implementation of this using tf.while_loop. However, the question is whether this can be accomplished using dataset/iterators... and I'd still like to know.
From the description it seems that you already have the dataset preloaded as a constant on CPU/GPU, like at this example. That's certainly the first step.
Second, I suggest using tf.slice() to replicate the effect of the minibatch operation. In other words, just manually slice minibatches out of the preloaded constant (your dataset), and you should get the desired behavior. See for example the slice docs or this related post.
If that's not enough detail, please edit your question to include a code example (with mnist or something) and I can give more details.
This "answer" is an implementation of muskrat's tf.slice suggestion with the details of tf.while_loop worked out (with help from How to use tf.while_loop() in tensorflow and https://www.tensorflow.org/api_docs/python/tf/while_loop).
Unless your data and model are small enough that you're bottlenecked by Python I/O (like me!), this solution is probably academic.
Advantages:
Trains over minibatches without returning to the Python thread.
Uses only ops that have GPU implementations meaning that the entire graph can be placed in the GPU.
On my small dataset, which is presumably bottlenecked by Python I/O, this solution is twice the speed of my dataset/iteratior (which touches Python once per minibatch) and four times the speed of passing minibatches through feed_dict.
Disadvantages:
tf.while_loop is treacherous. It's challenging to understand when ops inside the loop's body are evaluated and when those they depend on are evaluated, particularly the (thin) official documentation and limited Stack Overflow coverage.
The missing documentation of tf.while_loop is that tensors outside the body of the loop are only evaluated once, even if inner ops depend on them. This means that optimization, model, and loss have to be defined in the loop. This limits flexibility if you'd like to e.g. be able to call validation loss ops between training epochs. Presumably this could be accomplished with tf.cond statements and the appropriate flags passed in via feed_dict. But not nearly as flexible or elegant as the dataset/iterator mechanism in tf.data.
Adding shuffling operations at each Epoch doesn't seem available on GPU.
Here's my schematic code (I've ommitted the variable and model definition for brevity):
def buildModel(info, training_data, training_targets):
graph = tf.Graph()
with graph.as_default():
# numBatches is passed in from Python once per Epoch.
batch_size = tf.placeholder(tf.float32, name = 'batch_size')
# Initializers for loop variables for tf.while_loop
batchCounter = tf.Variable(0, dtype=tf.float32, trainable=False)
lossList = tf.Variable(tf.zeros([0,1]), trainable=False)
# In a full example, I'd normalize my data here. And possibly shuffle
tf_training_data = tf.constant(training_data, dtype=tf.float32)
tf_training_targets = tf.constant(training_targets, dtype=tf.float32)
# For brevity, I'll spare the definitions of my variables. Because tf.Variables
# are essentially treated as globals in the model and are manipulated directly (like with tf.apply)
# they can reside outside runMinibatch, the body of tf.while_loop.
# weights_1 =
# biases_1 =
# etc.
def moreMinibatches(batchCount, lossList):
return (batchCount + 1) * batch_size <= len(training_data)
def runMinibatch(batchCount, lossList):
# These tensors and ops have to be defined inside runMinibatch, otherwise they're not updated as tf.wile_loop loops. This means
# slices, model definition, loss tensor, and training op.
dat_batch = tf.slice(tf_training_data, [tf.cast(batchCounter * batch_size, tf.int32) , 0], [tf.cast(batch_size, tf.int32), -1])
targ_batch = tf.slice(tf_training_targets, [tf.cast(batchCounter * batch_size, tf.int32) , 0], [tf.cast(batch_size, tf.int32), -1])
# Here's where you'd define the model as a function of weights and biases above and dat_batch
# model = <insert here>
loss = tf.reduce_mean(tf.squared_difference(model, targ_batch))
optimizer = tf.train.AdagradOptimizer() # for example
train_op = optimizer.minimize(while_loss, name='optimizer')
# control_dependences ensures that train_op is run before return
# even though the return values don't explicitly depend on it.
with tf.control_dependencies([train_op]):
return batchCount + 1, tf.concat([lossList, [[while_loss]]],0)
# So, the idea is that this trains a full epoch without returning to Python.
trainMinibatches = tf.while_loop(moreMinibatches, runMinibatch, [minibatchCounter, lossList]
shape_invariants=[batchCounter.get_shape(), tf.TensorShape(None)])
return (graph,
{'trainMinibatches' : trainAllMinibatches,
'minibatchCounter' : minibatchCounter,
'norm_loss' : norm_loss,
} )
numEpochs = 100 # e.g.
minibatchSize = 32 #
# training_dataset = <data here>
# training_targets = <targets here>
graph, ops = buildModel(info, training_dataset, training_targets,
minibatch_size)
with tf.Session(graph=graph, config=config) as session:
tf.global_variables_initializer().run()
for i in range(numEpochs):
# This op will train on as all minibatches that fit in the full dataset. finalBatchCount with be the number of
# complete minibatches in the dataset. lossList is a list of each step's minibatches.
finalBatchCount, lossList = session.run(ops['trainAllMinibatches'],
feed_dict={'batch_size:0':minibatchSize})
print('minibatch losses at Epoch', i, ': ', lossList)
I implemented tf.slice() and tf.while_loop approach to vectorize mini-batch suggested above.
The performance was about 1.86 times faster in my case than the mini-batches using feed_dict, but I found there was a problem that the loss values of each epochs were not stabilized.
Then, I changed to tf.random_shuffle the inputs every epoch, the problem was much mitigated. (the performance gain was reduced to 1.68 times)

Modify gradient in TensorFlow backward pass

I'm trying to modify the gradients for all layers in TensorFlow with a custom gradient, and also save the current gradient. Conceptually, the computation process for a single layer in the i-th iteration looks like:
original_grad = (actual gradient computed by TF)
custom_grad = f(original_grad, stored_grad[i - 1])
stored_grad[i] = original_grad
Use custom_grad to update layer weights
I'm pretty new to TF so quite lost on how / whether this could be achieved.
For answering your question, we must look at what optimizers usually do when you call optimizer.minimize(loss).
Actually they are performing two subsequent operations: compute_gradients() and apply_gradients.
From Tensorflow documentation of tf.train.Optimizer, we read:
Calling minimize() takes care of both computing the gradients and applying them to the variables.
So:
If you want to process the gradients
before applying them you can instead use the optimizer in three steps:
Compute the gradients with compute_gradients().
Process the gradients as you wish.
Apply the processed gradients with apply_gradients().
Directly from the documentation I can take an example of applying some modifications to the gradients:
# Create an optimizer.
opt = GradientDescentOptimizer(learning_rate=0.1)
# Compute the gradients for a list of variables.
grads_and_vars = opt.compute_gradients(loss, <list of variables>)
# grads_and_vars is a list of tuples (gradient, variable). Do whatever you
# need to the 'gradient' part, for example cap them, etc.
capped_grads_and_vars = [(MyCapper(gv[0]), gv[1]) for gv in grads_and_vars]
# Ask the optimizer to apply the capped gradients.
opt.apply_gradients(capped_grads_and_vars)

How can I intercept the gradient from automatic differentiation in TensorFlow?

Let's say I have two subsequent layers with activations a1 and a2. Is there a way to intercept the gradients that automatic differentiation propagates from layer 2 to layer 1, i.e. ∂E/∂a2? I would like to change this gradient and then pass it on to layer 1.
From tf.train.Optimizer documentation,
Processing gradients before applying them.
Calling minimize() takes care of both computing the gradients and applying them to the variables. If you want to process the gradients before applying them you can instead use the optimizer in three steps:
Compute the gradients with compute_gradients().
Process the gradients as you wish.
Apply the processed gradients with apply_gradients().
Example:
# Create an optimizer.
opt = GradientDescentOptimizer(learning_rate=0.1)
# Compute the gradients for a list of variables.
grads_and_vars = opt.compute_gradients(loss, <list of variables>)
# grads_and_vars is a list of tuples (gradient, variable). Do whatever you
# need to the 'gradient' part, for example cap them, etc.
capped_grads_and_vars = [(MyCapper(gv[0]), gv[1]) for gv in grads_and_vars]
# Ask the optimizer to apply the capped gradients.
opt.apply_gradients(capped_grads_and_vars)
You might be looking for tf.Graph.gradient_override_map. There is a good example in the tensorflow docs:
#tf.RegisterGradient("CustomSquare")
def _custom_square_grad(op, grad):
# ...
with tf.Graph().as_default() as g:
c = tf.constant(5.0)
s_1 = tf.square(c) # Uses the default gradient for tf.square.
with g.gradient_override_map({"Square": "CustomSquare"}):
s_2 = tf.square(s_2) # Uses _custom_square_grad to compute the
# gradient of s_2.
There is a real world use of it here to pass the real valued gradient back through quantized weights in a do-re-fa net implementation.

How do I get the gradient of the loss at a TensorFlow variable?

The feature I'm after is to be able to tell what the gradient of a given variable is with respect to my error function given some data.
One way to do this would be to see how much the variable has changed after a call to train, but obviously that can vary massively based on the learning algorithm (for example it would be almost impossible to tell with something like RProp) and just isn't very clean.
Thanks in advance.
The tf.gradients() function allows you to compute the symbolic gradient of one tensor with respect to one or more other tensors—including variables. Consider the following simple example:
data = tf.placeholder(tf.float32)
var = tf.Variable(...) # Must be a tf.float32 or tf.float64 variable.
loss = some_function_of(var, data) # some_function_of() returns a `Tensor`.
var_grad = tf.gradients(loss, [var])[0]
You can then use this symbolic gradient to evaluate the gradient in some specific point (data):
sess = tf.Session()
var_grad_val = sess.run(var_grad, feed_dict={data: ...})
In TensorFlow 2.0 you can use GradientTape to achieve this. GradientTape records the gradients of any computation that happens in the context of that. Below is an example of how you might do that.
import tensorflow as tf
# Here goes the neural network weights as tf.Variable
x = tf.Variable(3.0)
# TensorFlow operations executed within the context of
# a GradientTape are recorded for differentiation
with tf.GradientTape() as tape:
# Doing the computation in the context of the gradient tape
# For example computing loss
y = x ** 2
# Getting the gradient of network weights w.r.t. loss
dy_dx = tape.gradient(y, x)
print(dy_dx) # Returns 6