Why are these Gradient Accumulation implementations not working? - tensorflow

Note:
After experimenting I noticed that this problem only occurs when I am training on the GPU. I created a github issue (#50454). At this point I am not sure what is happening exactly.
I am working on an implementation for Gradient Accumulation. However, none of the approaches seem to work. Below I am describing two approaches which could work theoretically but it appears to conflict with Tensorflow.
The idea
I want to patch an arbitrary Optimizer-instance by replacing its apply_gradients() function by my own implementation which accumulates gradients.
# Build model first
model.build()
# Patch the optimizer
optimizer = get_patched_optimizer(optimizer, n, model.trainable_variables)
# Compile the model with the patched optimizer
model.compile(optimizer=optimizer)
where
def get_patched_optimizer(optimizer, n, trainable_variables):
"""Patch optimizer for gradient accumulation.
:param optimizer:
The optimizer to patch.
:param n:
The number of accumulation steps before applying gradients.
:param trainable_variables:
Trainable parameters of the model
:return:
A patched patched optimizer for gradient accumulation.
"""
accumulator = _GradientAccumulationPatch(
n=n,
orig_apply_gradients=optimizer.apply_gradients,
trainable_variables=trainable_variables
)
# Replace the original function
optimizer.apply_gradients = accumulator.apply_gradients
return optimizer
The happy (but not working) path
The simplest way would be to just accumulate gradients and apply gradients conditionally e.g. whenever current_step % n == 0.
However, the problem here is that it looks like I am not able to use tf.cond() in this context in contrast to how they're doing it in Gradient Accumulation with Custom model.fit in TF.Keras?.
Using tf.cond() results in the following RuntimeError
RuntimeError: merge_call called while defining a new graph or a tf.function. This can often happen if the function fn passed to strategy.run() contains a nested #tf.function, and the nested #tf.function contains a synchronization point, such as aggregating gradients (e.g, optimizer.apply_gradients), or if the function fn uses a control flow statement which contains a synchronization point in the body. Such behaviors are not yet supported. Instead, please avoid nested tf.functions or control flow statements that may potentially cross a synchronization boundary, for example, wrap the fn passed to strategy.run or the entire strategy.run inside a tf.function or move the control flow out of fn
Here is the implementation of _GradientAccumulationPatch using tf.cond():
class _GradientAccumulationPatch:
def __init__(
self,
n: int,
orig_apply_gradients,
trainable_variables
):
self.n = tf.constant(n, dtype=tf.int64)
policy = tf.keras.mixed_precision.global_policy()
self.variable_dtype = policy.variable_dtype
self.accu_gradients = [
tf.Variable(
tf.zeros(g.shape, dtype=g.dtype),
) for g in trainable_variables
]
self._current_step = tf.Variable(0, dtype=tf.int64)
self._orig_apply_gradients = orig_apply_gradients
def apply_gradients(self, grads_and_vars, *args, **kwargs):
trainable_variables = [var for (_, var) in grads_and_vars]
gradients = [grad for (grad, _) in grads_and_vars]
# Always accumulate gradients
for i, grad in enumerate(gradients):
self.accu_gradients[i].assign_add(grad)
tf.cond(
self._can_apply_on_next_step(),
true_fn=lambda: self.apply_accu_gradients(trainable_variables, args, kwargs),
false_fn=lambda: None
)
def apply_accu_gradients(self, trainable_variables, *args, **kwargs):
# Call the original apply_gradients() function
self._orig_apply_gradients(zip(self.accu_gradients, trainable_variables), *args, **kwargs)
# Reset all accumulated gradients to zero
for i in range(len(self.accu_gradients)):
self.accu_gradients[i].assign(tf.zeros_like(trainable_variables[i]))
def _can_apply_on_next_step(self):
"""
:return: True if gradients should be applied; False otherwise.
"""
# Increment (always do this first)
self._current_step.assign_add(1)
count_mod_steps = tf.math.mod(self._current_step, self.n)
return tf.equal(count_mod_steps, 0)
The more complicated path (also not working)
It is possible to remove the tf.cond() by simply using the signal apply, given by _can_apply_on_next_step(), as a multiplication factor and apply zero-gradients whenever we are in the accumulation-phase.
The idea would be to always accumulate gradients and always apply them with one particular change:
final_gradients = [grad * apply for grad in gradients]
self._orig_apply_gradients(zip(final_gradients, trainable_variables))
This is how we'd change the apply_gradients() method:
def apply_gradients(self, grads_and_vars, *args, **kwargs):
can_apply = self._can_apply_on_next_step()
# 1.0 whenever we want to apply gradients; 0.0 otherwise
apply = tf.cast(can_apply, dtype=self.variable_dtype)
# Will be 0.0 if apply is 1.0 and vice versa
keep = tf.cast(tf.logical_not(can_apply), dtype=self.variable_dtype)
grads_and_vars = list(grads_and_vars)
gradients = [grad for (grad, _) in grads_and_vars]
trainable_variables = [var for (_, var) in grads_and_vars]
# Accumulate gradients
for i, grad in enumerate(gradients):
self.accu_gradients[i].assign_add(grad)
# Multiply each gradient with our apply-signal
final_gradients = [grad * apply for grad in self.accu_gradients]
self._orig_apply_gradients(zip(final_gradients, trainable_variables), *args, **kwargs)
# This will reset our buffer whenever "keep" is 0.0
for g in self.accu_gradients:
g.assign(g * keep)
But the problem is that self.accu_gradients[i].assign_add(grad) does not seem to have any effect. And yes, I have also tried
self.accu_gradients[i].assign(grad + self.accu_gradients[i])
Interestingly, the model starts to converge if I use assign(grad) instead as in self.accu_gradients[i].assign_add(grad) as you can see:
blue: just using assign() # <- no accumulation happening
red: using assign_add()
The train_step()
This patch should work model independently. I do have a custom train_step() for my model though but the implementation is pretty straight forward.
Here I am just computing the gradients and then all the apply_gradients() method of the optimizer:
def train_step(self, data):
(inputs, (input_lengths, label_lengths), mask), y_true = data
loss, gradients = self.rnnt_gradient(
inputs=inputs,
y_true=y_true,
input_lengths=input_lengths,
label_lengths=label_lengths,
mask=mask
)
self.optimizer.apply_gradients(zip(gradients, self.trainable_variables))
return {'loss': loss}
def test_step(self, data):
(inputs, (input_lengths, label_lengths), mask), y_true = data
val_loss = self.rnnt_loss_wrapper(
inputs=inputs,
y_true=y_true,
input_lengths=input_lengths,
label_lengths=label_lengths,
mask=mask
)
return dict(loss=val_loss)
def rnnt_gradient(
self,
inputs: tuple,
y_true: tf.Tensor,
input_lengths: tf.Tensor,
label_lengths: tf.Tensor,
mask=None
):
with tf.GradientTape() as tape:
model_loss = self.rnnt_loss_wrapper(
inputs,
y_true=y_true,
input_lengths=input_lengths,
label_lengths=label_lengths,
mask=mask
)
is_mixed_precision = isinstance(self.optimizer, mixed_precision.LossScaleOptimizer)
# We always want to return the unmodified model_loss for Tensorboard
if is_mixed_precision:
loss = self.optimizer.get_scaled_loss(model_loss)
else:
loss = model_loss
gradients = tape.gradient(loss, self.trainable_variables)
if is_mixed_precision:
gradients = self.optimizer.get_unscaled_gradients(gradients)
return model_loss, gradients

It turns out that this was totally my fault and it was due to the fact that whenever I trained with a mixed_float16 policy, I would have patched the wrong instance.
What I had was something like:
if precision_policy.name.startswith('mixed'):
logger.info(f'Using LossScaleOptimizer (policy: "{precision_policy.name})"')
optimizer = keras.mixed_precision.LossScaleOptimizer(optimizer)
if grad_acc_n > 1:
# --> This patched the LossScaleOptimizer which caused the problem:
optimizer = grad_acc.get_patched_optimizer(optimizer=optimizer, n=grad_acc_n)
So I would require a check like:
if isinstance(optimizer, keras.mixed_precision.LossScaleOptimizer):
# Warning: This does NOT work either (just an example)!
optimizer.inner_optimizer.apply_gradients = accumulator.apply_gradients
raise Exception('Don\'t do this!')
else:
optimizer.apply_gradients = accumulator.apply_gradients
However, as stated in the comment, patching the inner_optimizer does not work either. I haven't figured out why but at least I am now able to run a "normal" float32-policy training with my _GradientAccumulationPatch-implementation.

Related

When writing a custom loss function, should I use tf.reduce_mean, and if so how? Does it ever matter?

The sample code below shows that all the following give the same (correct) results when
writing a custom loss function (calculating mean_squared_error) for
a simple linear regression model.
Do not use tf_reduce_mean() (so returning a loss for each example)
Use tf_reduce_mean() (so returning a single loss)
Use tf_reduce_mean(..., axis-1)
Is there any reason to prefer one approach to another, and are there any circumstances
where it makes a difference?
(There is, for example sample code at
Make a custom loss function in keras
that suggests axis=-1 should be used)
import numpy as np
import tensorflow as tf
# Create simple dataset to do linear regression on
# The mean squared error (~ best achievable MSE loss after fitting linear regression) for this dataset is 0.01
xtrain = np.random.randn(5000) # Already normalized
ytrain = xtrain + np.random.randn(5000) * 0.1 # Close enough to being normalized
# Function to create model and fit linear regression, and report final loss
def cre_and_fit(loss="mean_squared_error", lossdescription="",epochs=20):
model = tf.keras.models.Sequential([tf.keras.layers.Dense(1, input_shape=(1,))])
model.compile(loss=loss, optimizer="RMSProp")
history = model.fit(xtrain, ytrain, epochs=epochs, verbose=False)
print(f"Final loss value for {lossdescription}: {history.history['loss'][-1]:.4f}")
# Result from standard MSE loss ~ 0.01
cre_and_fit("mean_squared_error","Keras standard MSE")
# This gives the right result, not reducing. Return shape = (batch_size,)
cre_and_fit(lambda y_true, y_pred: (y_true-y_pred)*(y_true-y_pred),
"custom loss, not reducing over batch items" )
# This also gives the right result, reducing over batch items. Return shape = ()
cre_and_fit(lambda y_true, y_pred: tf.reduce_mean((y_true-y_pred)*(y_true-y_pred) ),
"custom loss, reducing over batch items")
# How about using axis=-1? Also gives the same result
cre_and_fit(lambda y_true, y_pred: tf.reduce_mean((y_true-y_pred)*(y_true-y_pred), axis=-1),
"custom loss, reducing with axis=-1" )
When you pass a lambda (or a callable in general) to compile and call fit, TF will wrap it inside a LossFunctionWrapper, which is a subclass of Loss, with a default reduction type of ReductionV2.AUTO. Note that a Loss object always has a reduction type representing how it will reduce the loss tensor to a single scalar.
Under most circumstances, ReductionV2.AUTO translates to ReductionV2.SUM_OVER_BATCH_SIZE which, despite its name, actually performs reduced mean over all axis on the underlying lambda's output.
import tensorflow as tf
from keras import losses as losses_mod
from keras.utils import losses_utils
a = tf.random.uniform((10,2))
b = tf.random.uniform((10,2))
l_auto = losses_mod.LossFunctionWrapper(fn=lambda y_true, y_pred : tf.square(y_true - y_pred), reduction=losses_utils.ReductionV2.AUTO)
l_sum = losses_mod.LossFunctionWrapper(fn=lambda y_true, y_pred : tf.square(y_true - y_pred), reduction=losses_utils.ReductionV2.SUM_OVER_BATCH_SIZE)
l_auto(a,b).shape.rank == l_sum(a,b).shape.rank == 0 # rank 0 means scalar
l_auto(a,b) == tf.reduce_mean(tf.square(a - b)) # True
l_sum(a,b) == tf.reduce_mean(tf.square(a - b)) # True
So to answer your question, the three options are equivalent since they all eventually result in a single scalar that is the mean of all elements in the raw tf.square(a - b) loss tensor. However, should you wish to perform an operation other than reduce_mean e.g., reduce_sum, in the lambda, then the three will yield different results:
l1 = losses_mod.LossFunctionWrapper(fn=lambda y_true, y_pred : tf.square(y_true - y_pred),
reduction=losses_utils.ReductionV2.AUTO)
l2 = losses_mod.LossFunctionWrapper(fn=lambda y_true, y_pred : tf.reduce_sum(tf.square(y_true - y_pred)),
reduction=losses_utils.ReductionV2.AUTO)
l3 = losses_mod.LossFunctionWrapper(fn=lambda y_true, y_pred : tf.reduce_sum(tf.square(y_true - y_pred), axis=-1),
reduction=losses_utils.ReductionV2.AUTO)
l1(a,b) == tf.reduce_mean(tf.square(a-b)) # True
l2(a,b) == tf.reduce_sum(tf.square(a-b)) # True
l3(a,b) == tf.reduce_mean(tf.reduce_sum(tf.square(a-b), axis=-1)) # True
Concretely, l2(a,b) == tf.reduce_mean(tf.reduce_sum(tf.square(a-b))), but that is just tf.reduce_sum(tf.square(a-b)) since mean of a scalar is itself.

Can't apply gradients on tf.Variable

I am trying to learn a similarity matrix(M) between two image embeddings, A single instance of training is a pair of images - (anchor, positive). So ideally the model will return 0 distance for embeddings of similar images.
The problem is, when i declare the distance matrix(M) as a tf.Variable, it returns an error
on this line
self.optimizer.apply_gradients(zip(gradients, self.trainable_variables))
TypeError: 'Variable' object is not iterable.
I think I should use a tensorflow datatype for M, that is iterable
Please tell me how I can fix this issue
import tensorflow as tf
from tensorflow import keras
# metric learning model
class MetricLearningModel:
def __init__(self, lr):
self.optimizer = keras.optimizers.Adam(lr=lr)
self.lr = lr
self.loss_object = keras.losses.MeanSquaredError()
self.trainable_variables = tf.Variable(
(tf.ones((2048, 2048), dtype=tf.float32)),
trainable=True
)
def similarity_function(self, anchor_embeddings, positive_embeddings):
M = self.trainable_variables
X_i = anchor_embeddings
X_j = positive_embeddings
similarity_value = tf.matmul(X_j, M, name='Tensor')
similarity_value = tf.matmul(similarity_value, tf.transpose(X_i), name='Tensor')
# distance(x,y) = sqrt( (x-y)#M#(x-y).T )
return similarity_value
def train_step(self, anchor, positive):
anchor_embeddings, positive_embeddings = anchor, positive
# Calculate gradients
with tf.GradientTape() as tape:
# Calculate similarity between anchors and positives.
similarities = self.similarity_function(anchor_embeddings, positive_embeddings)
y_pred = similarities
y_true = tf.zeros(1)
print(y_true, y_pred)
loss_value = self.loss_object(
y_pred=y_true,
y_true=y_pred,
)
gradients = tape.gradient(loss_value, self.trainable_variables)
# Apply gradients via optimizer
self.optimizer.apply_gradients(zip(gradients, self.trainable_variables))
metric_model = MetricLearningModel(lr=1e-3)
anchor, positive = tf.ones((1, 2048), dtype=tf.float32), tf.ones((1, 2048), dtype=tf.float32)
metric_model.train_step(anchor, positive)
The python zip function expects iterable objects, like for example a list or a tuple.
In your calls to tape.gradient, or optimizer.apply_gradients, you can put your Variable in a list to solve the issue :
with tf.GradienTape() as tape:
gradients = tape.gradient(loss_value, [self.trainable_variables])
# Apply gradients via optimizer
self.optimizer.apply_gradients(zip(gradients, [self.trainable_variables]))
tape.gradient respects the shape of the sources object passed to compute the gradients of, so if you feed it with a list, you will get a list out of it. It is stated in the documentation:
Returns
a list or nested structure of Tensors (or IndexedSlices, or None), one for each element in sources. Returned structure is the same as the structure of sources.

Evaluating TF model inside a TF op throws error

I am using TensorFlow 2. I am trying to optimize a function which uses the loss of a trained tensorflow model (poison).
#tf.function
def totalloss(x):
xt = tf.multiply(x, (1.0 - m)) + tf.multiply(m, d)
label = targetlabel*np.ones(xt.shape[0])
loss1 = poison.evaluate(xt, label, steps=1)
loss2 = tf.linalg.norm(m, 1)
return loss1 + loss2
I am not able to execute this function, however, when I comment the #tf.function line the function works!
I need to use this function as a tensorflow op so as to optimize 'm' & 'd'.
Value Error: Unknown graph. Aborting.
This is how I am defining the model and variables:
# mask
m = tf.Variable(tf.zeros(shape=(1, 784)), name="m")
d = tf.Variable(tf.zeros(shape=(1, 784)), name="d")
# target
targetlabel = 6
poison = fcn()
poison.load_weights("MNISTP.h5")
adam = tf.keras.optimizers.Adam(lr=.002, decay=1e-6)
poison.compile(optimizer=adam, loss=tf.losses.sparse_categorical_crossentropy)
This is how I am calling the function later: (Executing this line results in an error listed below. However if I comment off the #tf.function line, this command works!)
loss = totalloss(ptestdata)
This is the entire traceback call:
ValueError: in converted code:
<ipython-input-52-4841ad87022f>:5 totalloss *
loss1 = poison.evaluate(xt, label, steps=1)
/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/training.py:746 evaluate
use_multiprocessing=use_multiprocessing)
/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/training_arrays.py:693 evaluate
callbacks=callbacks)
/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/training_arrays.py:187 model_iteration
f = _make_execution_function(model, mode)
/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/training_arrays.py:555 _make_execution_function
return model._make_execution_function(mode)
/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/training.py:2034 _make_execution_function
self._make_test_function()
/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/training.py:2010 _make_test_function
**self._function_kwargs)
/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/backend.py:3544 function
return EagerExecutionFunction(inputs, outputs, updates=updates, name=name)
/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/backend.py:3429 __init__
raise ValueError('Unknown graph. Aborting.')
ValueError: Unknown graph. Aborting.
The purpose of #tf.function decorator is to convert Tensorflow operations written in Python into Tensorflow graph to achieve better performance. The error might come when you tried to use a pre-trained model with a serialized graph. Thus, the decorator cannot make the graph-to-graph conversion.
I've reported this error here: https://github.com/tensorflow/tensorflow/issues/33997
A (temporary) solution is that your loss function should be separated into two small functions. The decorator should only be used in the function not including the pre-trained model. In this way, you still can achieve better performance in other operations but not with the part of using the pre-trained model.
For example:
#tf.function
def _other_ops(x):
xt = tf.multiply(x, (1.0 - m)) + tf.multiply(m, d)
label = targetlabel * np.ones(xt.shape[0])
loss2 = tf.linalg.norm(m, 1)
return xt, label, loss2
def total_loss(x):
xt, label, loss2 = _other_ops(x)
loss1 = poison.evaluate(xt, label, steps=1)
return loss1 + loss2
Update:
According to the discussion in the above TF issue link, an elegant solution is to manually pass the input through each layer of the model. You could get a list of layers in your model by calling your_model.layers
In your case, you might calculate the loss from the prediction of your output with the label in the last layer. Thus, I think you should skip the last layer and calculate the loss outside of the loop:
#tf.function
def totalloss(x):
xt = tf.multiply(x, (1.0 - m)) + tf.multiply(m, d)
label = targetlabel*np.ones(xt.shape[0])
feat = xt
# Skip the last layer which calculates loss1
for i in range(len(poison.layers) - 1):
layer = poison.layers[i]
feat = layer(feat)
# Now, calculate loss by yourself
loss1 = tf.keras.losses.sparse_categorical_crossentropy(feat, label)
loss2 = tf.linalg.norm(m, 1)
return loss1 + loss2
The way that the TF engineers explain for this issue is that a model might wrap high-level processing which does guarantee by the #tf.function. So, putting a model inside a function decorated with #tf.function is not recommended. Thus, we need to break the model to smaller pieces to bypass it.

What is the proper way to weight decay for Adam Optimizer

Since Adam Optimizer keeps an pair of running averages like mean/variance for the gradients, I wonder how it should properly handle weight decay. I have seen two ways of implementing it.
Only update mean/variance from the gradients based on the objective loss, decay weight explicitly at each mini-batch. (the following code is taken from https://github.com/dmlc/mxnet/blob/v0.7.0/python/mxnet/optimizer.py)
weight[:] -= lr*mean/(sqrt(variance) + self.epsilon)
wd = self._get_wd(index)
if wd > 0.:
weight[:] -= (lr * wd) * weight
Update mean/variance from the gradients based on the objective loss + regularization loss, and update weights like usual. (the following code is taken from https://github.com/dmlc/mxnet/blob/master/src/operator/optimizer_op-inl.h#L210)
grad = scalar<DType>(param.rescale_grad) * grad +
scalar<DType>(param.wd) * weight;
// stuff
Assign(out, req[0],
weight -
scalar<DType>(param.lr) * mean /
(F<square_root>(var) + scalar<DType>(param.epsilon)));
These two approaches sometimes show significant difference in training results. And I actually think the first one makes more sense (and find it gives better results time to time). Caffe and old version of mxnet follow the first approach, while torch, tensorflow and new version of mxnet follow the second one.
Really appreciate your help!
Edit: see also this PR which just got merged into TF.
When using pure SGD (without momentum) as an optimizer, weight decay is the same thing as adding a L2-regularization term to the loss. When using any other optimizer, this is not true.
Weight decay (don't know how to TeX here, so excuse my pseudo-notation):
w[t+1] = w[t] - learning_rate * dw - weight_decay * w
L2-regularization:
loss = actual_loss + lambda * 1/2 sum(||w||_2 for w in network_params)
Computing the gradient of the extra term in L2-regularization gives lambda * w and thus inserting it into the SGD update equation
dloss_dw = dactual_loss_dw + lambda * w
w[t+1] = w[t] - learning_rate * dw
gives the same as weight decay, but mixes lambda with the learning_rate. Any other optimizer, even SGD with momentum, gives a different update rule for weight decay as for L2-regularization! See the paper Fixing weight decay in Adam for more details. (Edit: AFAIK, this 1987 Hinton paper introduced "weight decay", literally as "each time the weights are updated, their magnitude is also decremented by 0.4%" at page 10)
That being said, there doesn't seem to be support for "proper" weight decay in TensorFlow yet. There are a few issues discussing it, specifically because of above paper.
One possible way to implement it is by writing an op that does the decay step manually after every optimizer step. A different way, which is what I'm currently doing, is using an additional SGD optimizer just for the weight decay, and "attaching" it to your train_op. Both of these are just crude work-arounds, though. My current code:
# In the network definition:
with arg_scope([layers.conv2d, layers.dense],
weights_regularizer=layers.l2_regularizer(weight_decay)):
# define the network.
loss = # compute the actual loss of your problem.
train_op = optimizer.minimize(loss, global_step=global_step)
if args.weight_decay not in (None, 0):
with tf.control_dependencies([train_op]):
sgd = tf.train.GradientDescentOptimizer(learning_rate=1.0)
train_op = sgd.minimize(tf.add_n(tf.get_collection(tf.GraphKeys.REGULARIZATION_LOSSES)))
This somewhat makes use of TensorFlow's provided bookkeeping. Note that the arg_scope takes care of appending an L2-regularization term for every layer to the REGULARIZATION_LOSSES graph-key, which I then all sum up and optimize using SGD which, as shown above, corresponds to actual weight-decay.
Hope that helps, and if anyone gets a nicer code snippet for this, or TensorFlow implements it better (i.e. in the optimizers), please share.
I came across the same question. I think this code that I got from here will work for you. It implements the weight decay adam optimizer by inheritance from the tf.train.Optimizer. This is the cleanest solution I have found:
class AdamWeightDecayOptimizer(tf.train.Optimizer):
"""A basic Adam optimizer that includes "correct" L2 weight decay."""
def __init__(self,
learning_rate,
weight_decay_rate=0.0,
beta_1=0.9,
beta_2=0.999,
epsilon=1e-6,
exclude_from_weight_decay=None,
name="AdamWeightDecayOptimizer"):
"""Constructs a AdamWeightDecayOptimizer."""
super(AdamWeightDecayOptimizer, self).__init__(False, name)
self.learning_rate = learning_rate
self.weight_decay_rate = weight_decay_rate
self.beta_1 = beta_1
self.beta_2 = beta_2
self.epsilon = epsilon
self.exclude_from_weight_decay = exclude_from_weight_decay
def apply_gradients(self, grads_and_vars, global_step=None, name=None):
"""See base class."""
assignments = []
for (grad, param) in grads_and_vars:
if grad is None or param is None:
continue
param_name = self._get_variable_name(param.name)
m = tf.get_variable(
name=param_name + "/adam_m",
shape=param.shape.as_list(),
dtype=tf.float32,
trainable=False,
initializer=tf.zeros_initializer())
v = tf.get_variable(
name=param_name + "/adam_v",
shape=param.shape.as_list(),
dtype=tf.float32,
trainable=False,
initializer=tf.zeros_initializer())
# Standard Adam update.
next_m = (
tf.multiply(self.beta_1, m) + tf.multiply(1.0 - self.beta_1, grad))
next_v = (
tf.multiply(self.beta_2, v) + tf.multiply(1.0 - self.beta_2,
tf.square(grad)))
update = next_m / (tf.sqrt(next_v) + self.epsilon)
# Just adding the square of the weights to the loss function is *not*
# the correct way of using L2 regularization/weight decay with Adam,
# since that will interact with the m and v parameters in strange ways.
#
# Instead we want ot decay the weights in a manner that doesn't interact
# with the m/v parameters. This is equivalent to adding the square
# of the weights to the loss with plain (non-momentum) SGD.
if self._do_use_weight_decay(param_name):
update += self.weight_decay_rate * param
update_with_lr = self.learning_rate * update
next_param = param - update_with_lr
assignments.extend(
[param.assign(next_param),
m.assign(next_m),
v.assign(next_v)])
return tf.group(*assignments, name=name)
def _do_use_weight_decay(self, param_name):
"""Whether to use L2 weight decay for `param_name`."""
if not self.weight_decay_rate:
return False
if self.exclude_from_weight_decay:
for r in self.exclude_from_weight_decay:
if re.search(r, param_name) is not None:
return False
return True
def _get_variable_name(self, param_name):
"""Get the variable name from the tensor name."""
m = re.match("^(.*):\\d+$", param_name)
if m is not None:
param_name = m.group(1)
return param_name
And you can use it in the following way (I have made some changes to make it useful in a more general context), This function will return a train_op that can be used in the Session:
def create_optimizer(loss, init_lr, num_train_steps, num_warmup_steps):
"""Creates an optimizer training op."""
global_step = tf.train.get_or_create_global_step()
learning_rate = tf.constant(value=init_lr, shape=[], dtype=tf.float32)
# Implements linear decay of the learning rate.
learning_rate = tf.train.polynomial_decay(
learning_rate,
global_step,
num_train_steps,
end_learning_rate=0.0,
power=1.0,
cycle=False)
# Implements linear warmup. I.e., if global_step < num_warmup_steps, the
# learning rate will be `global_step/num_warmup_steps * init_lr`.
if num_warmup_steps:
global_steps_int = tf.cast(global_step, tf.int32)
warmup_steps_int = tf.constant(num_warmup_steps, dtype=tf.int32)
global_steps_float = tf.cast(global_steps_int, tf.float32)
warmup_steps_float = tf.cast(warmup_steps_int, tf.float32)
warmup_percent_done = global_steps_float / warmup_steps_float
warmup_learning_rate = init_lr * warmup_percent_done
is_warmup = tf.cast(global_steps_int < warmup_steps_int, tf.float32)
learning_rate = (
(1.0 - is_warmup) * learning_rate + is_warmup * warmup_learning_rate)
# It is recommended that you use this optimizer for fine tuning, since this
# is how the model was trained (note that the Adam m/v variables are NOT
# loaded from init_checkpoint.)
optimizer = AdamWeightDecayOptimizer(
learning_rate=learning_rate,
weight_decay_rate=0.01,
beta_1=0.9,
beta_2=0.999,
epsilon=1e-6)
tvars = tf.trainable_variables()
grads = tf.gradients(loss, tvars)
# You can do clip gradients if you need in this step(in general it is not neccessary)
# (grads, _) = tf.clip_by_global_norm(grads, clip_norm=1.0)
train_op = optimizer.apply_gradients(
zip(grads, tvars), global_step=global_step)
# Normally the global step update is done inside of `apply_gradients`.
# However, `AdamWeightDecayOptimizer` doesn't do this. But if you use
# a different optimizer, you should probably take this line out.
new_global_step = global_step + 1
train_op = tf.group(train_op, [global_step.assign(new_global_step)])
return train_op

Simple custom gradient with gradient_override_map

I want to use a function that creates weights for a normal dense layer, it basically behaves like an initialization function, only that it "initializes" before every new forward pass.
The flow for my augmented linear layer looks like this:
input = (x, W)
W_new = g(x,W)
output = tf.matmul(x,W_new)
However, g(x,W) is not differentiable, as it involves some sampling. Luckily it also doesn't have any parameters I want to learn so I just try to do the forward and backward pass, as if I would have never replaced W.
Now I need to tell the automatic differentiation to not backpropagate through g(). I do this with:
W_new = tf.stop_gradient(g(x,W))
Unfortunately this does not work, as it complains about non-matching shapes.
What does work is the following:
input = (x, W)
W_new = W + tf.stop_gradient(g(x,W) - W)
output = tf.matmul(x,W_new)
as suggested here: https://stackoverflow.com/a/36480182
Now the forward pass seems to be OK, but I don't know how to override the gradient for the backward pass. I know, that I have to use: gradient_override_map for this, but could not transfer applications I have seen to my particular usecase (I am still quite new to TF).
However, I am not sure how to do this and if there isn't an easier way. I assume something similar has to be done in the first forward pass in a given model, where all weights are initialized while we don't have to backpropagate through the init functions as well.
Any help would be very much appreciated!
Hey #jhj I too faced the same problem fortunately I found this gist. Hope this helps :)
Sample working -
import tensorflow as tf
from tensorflow.python.framework import ops
import numpy as np
Define custom py_func which takes also a grad op as argument:
def py_func(func, inp, Tout, stateful=True, name=None, grad=None):
# Need to generate a unique name to avoid duplicates:
rnd_name = 'PyFuncGrad' + str(np.random.randint(0, 1E+8))
tf.RegisterGradient(rnd_name)(grad) # see _MySquareGrad for grad example
g = tf.get_default_graph()
with g.gradient_override_map({"PyFunc": rnd_name, "PyFuncStateless": rnd_name}):
return tf.py_func(func, inp, Tout, stateful=stateful, name=name)
Def custom square function using np.square instead of tf.square:
def mysquare(x, name=None):
with ops.name_scope(name, "Mysquare", [x]) as name:
sqr_x = py_func(np.square,
[x],
[tf.float32],
name=name,
grad=_MySquareGrad) # <-- here's the call to the gradient
return sqr_x[0]
Actual gradient:
def _MySquareGrad(op, grad):
x = op.inputs[0]
return grad * 20 * x # add a "small" error just to see the difference:
with tf.Session() as sess:
x = tf.constant([1., 2.])
y = mysquare(x)
tf.global_variables_initializer().run()
print(x.eval(), y.eval(), tf.gradients(y, x)[0].eval())