I want to use a function that creates weights for a normal dense layer, it basically behaves like an initialization function, only that it "initializes" before every new forward pass.
The flow for my augmented linear layer looks like this:
input = (x, W)
W_new = g(x,W)
output = tf.matmul(x,W_new)
However, g(x,W) is not differentiable, as it involves some sampling. Luckily it also doesn't have any parameters I want to learn so I just try to do the forward and backward pass, as if I would have never replaced W.
Now I need to tell the automatic differentiation to not backpropagate through g(). I do this with:
W_new = tf.stop_gradient(g(x,W))
Unfortunately this does not work, as it complains about non-matching shapes.
What does work is the following:
input = (x, W)
W_new = W + tf.stop_gradient(g(x,W) - W)
output = tf.matmul(x,W_new)
as suggested here: https://stackoverflow.com/a/36480182
Now the forward pass seems to be OK, but I don't know how to override the gradient for the backward pass. I know, that I have to use: gradient_override_map for this, but could not transfer applications I have seen to my particular usecase (I am still quite new to TF).
However, I am not sure how to do this and if there isn't an easier way. I assume something similar has to be done in the first forward pass in a given model, where all weights are initialized while we don't have to backpropagate through the init functions as well.
Any help would be very much appreciated!
Hey #jhj I too faced the same problem fortunately I found this gist. Hope this helps :)
Sample working -
import tensorflow as tf
from tensorflow.python.framework import ops
import numpy as np
Define custom py_func which takes also a grad op as argument:
def py_func(func, inp, Tout, stateful=True, name=None, grad=None):
# Need to generate a unique name to avoid duplicates:
rnd_name = 'PyFuncGrad' + str(np.random.randint(0, 1E+8))
tf.RegisterGradient(rnd_name)(grad) # see _MySquareGrad for grad example
g = tf.get_default_graph()
with g.gradient_override_map({"PyFunc": rnd_name, "PyFuncStateless": rnd_name}):
return tf.py_func(func, inp, Tout, stateful=stateful, name=name)
Def custom square function using np.square instead of tf.square:
def mysquare(x, name=None):
with ops.name_scope(name, "Mysquare", [x]) as name:
sqr_x = py_func(np.square,
[x],
[tf.float32],
name=name,
grad=_MySquareGrad) # <-- here's the call to the gradient
return sqr_x[0]
Actual gradient:
def _MySquareGrad(op, grad):
x = op.inputs[0]
return grad * 20 * x # add a "small" error just to see the difference:
with tf.Session() as sess:
x = tf.constant([1., 2.])
y = mysquare(x)
tf.global_variables_initializer().run()
print(x.eval(), y.eval(), tf.gradients(y, x)[0].eval())
Related
Note:
After experimenting I noticed that this problem only occurs when I am training on the GPU. I created a github issue (#50454). At this point I am not sure what is happening exactly.
I am working on an implementation for Gradient Accumulation. However, none of the approaches seem to work. Below I am describing two approaches which could work theoretically but it appears to conflict with Tensorflow.
The idea
I want to patch an arbitrary Optimizer-instance by replacing its apply_gradients() function by my own implementation which accumulates gradients.
# Build model first
model.build()
# Patch the optimizer
optimizer = get_patched_optimizer(optimizer, n, model.trainable_variables)
# Compile the model with the patched optimizer
model.compile(optimizer=optimizer)
where
def get_patched_optimizer(optimizer, n, trainable_variables):
"""Patch optimizer for gradient accumulation.
:param optimizer:
The optimizer to patch.
:param n:
The number of accumulation steps before applying gradients.
:param trainable_variables:
Trainable parameters of the model
:return:
A patched patched optimizer for gradient accumulation.
"""
accumulator = _GradientAccumulationPatch(
n=n,
orig_apply_gradients=optimizer.apply_gradients,
trainable_variables=trainable_variables
)
# Replace the original function
optimizer.apply_gradients = accumulator.apply_gradients
return optimizer
The happy (but not working) path
The simplest way would be to just accumulate gradients and apply gradients conditionally e.g. whenever current_step % n == 0.
However, the problem here is that it looks like I am not able to use tf.cond() in this context in contrast to how they're doing it in Gradient Accumulation with Custom model.fit in TF.Keras?.
Using tf.cond() results in the following RuntimeError
RuntimeError: merge_call called while defining a new graph or a tf.function. This can often happen if the function fn passed to strategy.run() contains a nested #tf.function, and the nested #tf.function contains a synchronization point, such as aggregating gradients (e.g, optimizer.apply_gradients), or if the function fn uses a control flow statement which contains a synchronization point in the body. Such behaviors are not yet supported. Instead, please avoid nested tf.functions or control flow statements that may potentially cross a synchronization boundary, for example, wrap the fn passed to strategy.run or the entire strategy.run inside a tf.function or move the control flow out of fn
Here is the implementation of _GradientAccumulationPatch using tf.cond():
class _GradientAccumulationPatch:
def __init__(
self,
n: int,
orig_apply_gradients,
trainable_variables
):
self.n = tf.constant(n, dtype=tf.int64)
policy = tf.keras.mixed_precision.global_policy()
self.variable_dtype = policy.variable_dtype
self.accu_gradients = [
tf.Variable(
tf.zeros(g.shape, dtype=g.dtype),
) for g in trainable_variables
]
self._current_step = tf.Variable(0, dtype=tf.int64)
self._orig_apply_gradients = orig_apply_gradients
def apply_gradients(self, grads_and_vars, *args, **kwargs):
trainable_variables = [var for (_, var) in grads_and_vars]
gradients = [grad for (grad, _) in grads_and_vars]
# Always accumulate gradients
for i, grad in enumerate(gradients):
self.accu_gradients[i].assign_add(grad)
tf.cond(
self._can_apply_on_next_step(),
true_fn=lambda: self.apply_accu_gradients(trainable_variables, args, kwargs),
false_fn=lambda: None
)
def apply_accu_gradients(self, trainable_variables, *args, **kwargs):
# Call the original apply_gradients() function
self._orig_apply_gradients(zip(self.accu_gradients, trainable_variables), *args, **kwargs)
# Reset all accumulated gradients to zero
for i in range(len(self.accu_gradients)):
self.accu_gradients[i].assign(tf.zeros_like(trainable_variables[i]))
def _can_apply_on_next_step(self):
"""
:return: True if gradients should be applied; False otherwise.
"""
# Increment (always do this first)
self._current_step.assign_add(1)
count_mod_steps = tf.math.mod(self._current_step, self.n)
return tf.equal(count_mod_steps, 0)
The more complicated path (also not working)
It is possible to remove the tf.cond() by simply using the signal apply, given by _can_apply_on_next_step(), as a multiplication factor and apply zero-gradients whenever we are in the accumulation-phase.
The idea would be to always accumulate gradients and always apply them with one particular change:
final_gradients = [grad * apply for grad in gradients]
self._orig_apply_gradients(zip(final_gradients, trainable_variables))
This is how we'd change the apply_gradients() method:
def apply_gradients(self, grads_and_vars, *args, **kwargs):
can_apply = self._can_apply_on_next_step()
# 1.0 whenever we want to apply gradients; 0.0 otherwise
apply = tf.cast(can_apply, dtype=self.variable_dtype)
# Will be 0.0 if apply is 1.0 and vice versa
keep = tf.cast(tf.logical_not(can_apply), dtype=self.variable_dtype)
grads_and_vars = list(grads_and_vars)
gradients = [grad for (grad, _) in grads_and_vars]
trainable_variables = [var for (_, var) in grads_and_vars]
# Accumulate gradients
for i, grad in enumerate(gradients):
self.accu_gradients[i].assign_add(grad)
# Multiply each gradient with our apply-signal
final_gradients = [grad * apply for grad in self.accu_gradients]
self._orig_apply_gradients(zip(final_gradients, trainable_variables), *args, **kwargs)
# This will reset our buffer whenever "keep" is 0.0
for g in self.accu_gradients:
g.assign(g * keep)
But the problem is that self.accu_gradients[i].assign_add(grad) does not seem to have any effect. And yes, I have also tried
self.accu_gradients[i].assign(grad + self.accu_gradients[i])
Interestingly, the model starts to converge if I use assign(grad) instead as in self.accu_gradients[i].assign_add(grad) as you can see:
blue: just using assign() # <- no accumulation happening
red: using assign_add()
The train_step()
This patch should work model independently. I do have a custom train_step() for my model though but the implementation is pretty straight forward.
Here I am just computing the gradients and then all the apply_gradients() method of the optimizer:
def train_step(self, data):
(inputs, (input_lengths, label_lengths), mask), y_true = data
loss, gradients = self.rnnt_gradient(
inputs=inputs,
y_true=y_true,
input_lengths=input_lengths,
label_lengths=label_lengths,
mask=mask
)
self.optimizer.apply_gradients(zip(gradients, self.trainable_variables))
return {'loss': loss}
def test_step(self, data):
(inputs, (input_lengths, label_lengths), mask), y_true = data
val_loss = self.rnnt_loss_wrapper(
inputs=inputs,
y_true=y_true,
input_lengths=input_lengths,
label_lengths=label_lengths,
mask=mask
)
return dict(loss=val_loss)
def rnnt_gradient(
self,
inputs: tuple,
y_true: tf.Tensor,
input_lengths: tf.Tensor,
label_lengths: tf.Tensor,
mask=None
):
with tf.GradientTape() as tape:
model_loss = self.rnnt_loss_wrapper(
inputs,
y_true=y_true,
input_lengths=input_lengths,
label_lengths=label_lengths,
mask=mask
)
is_mixed_precision = isinstance(self.optimizer, mixed_precision.LossScaleOptimizer)
# We always want to return the unmodified model_loss for Tensorboard
if is_mixed_precision:
loss = self.optimizer.get_scaled_loss(model_loss)
else:
loss = model_loss
gradients = tape.gradient(loss, self.trainable_variables)
if is_mixed_precision:
gradients = self.optimizer.get_unscaled_gradients(gradients)
return model_loss, gradients
It turns out that this was totally my fault and it was due to the fact that whenever I trained with a mixed_float16 policy, I would have patched the wrong instance.
What I had was something like:
if precision_policy.name.startswith('mixed'):
logger.info(f'Using LossScaleOptimizer (policy: "{precision_policy.name})"')
optimizer = keras.mixed_precision.LossScaleOptimizer(optimizer)
if grad_acc_n > 1:
# --> This patched the LossScaleOptimizer which caused the problem:
optimizer = grad_acc.get_patched_optimizer(optimizer=optimizer, n=grad_acc_n)
So I would require a check like:
if isinstance(optimizer, keras.mixed_precision.LossScaleOptimizer):
# Warning: This does NOT work either (just an example)!
optimizer.inner_optimizer.apply_gradients = accumulator.apply_gradients
raise Exception('Don\'t do this!')
else:
optimizer.apply_gradients = accumulator.apply_gradients
However, as stated in the comment, patching the inner_optimizer does not work either. I haven't figured out why but at least I am now able to run a "normal" float32-policy training with my _GradientAccumulationPatch-implementation.
I am using TensorFlow 2. I am trying to optimize a function which uses the loss of a trained tensorflow model (poison).
#tf.function
def totalloss(x):
xt = tf.multiply(x, (1.0 - m)) + tf.multiply(m, d)
label = targetlabel*np.ones(xt.shape[0])
loss1 = poison.evaluate(xt, label, steps=1)
loss2 = tf.linalg.norm(m, 1)
return loss1 + loss2
I am not able to execute this function, however, when I comment the #tf.function line the function works!
I need to use this function as a tensorflow op so as to optimize 'm' & 'd'.
Value Error: Unknown graph. Aborting.
This is how I am defining the model and variables:
# mask
m = tf.Variable(tf.zeros(shape=(1, 784)), name="m")
d = tf.Variable(tf.zeros(shape=(1, 784)), name="d")
# target
targetlabel = 6
poison = fcn()
poison.load_weights("MNISTP.h5")
adam = tf.keras.optimizers.Adam(lr=.002, decay=1e-6)
poison.compile(optimizer=adam, loss=tf.losses.sparse_categorical_crossentropy)
This is how I am calling the function later: (Executing this line results in an error listed below. However if I comment off the #tf.function line, this command works!)
loss = totalloss(ptestdata)
This is the entire traceback call:
ValueError: in converted code:
<ipython-input-52-4841ad87022f>:5 totalloss *
loss1 = poison.evaluate(xt, label, steps=1)
/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/training.py:746 evaluate
use_multiprocessing=use_multiprocessing)
/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/training_arrays.py:693 evaluate
callbacks=callbacks)
/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/training_arrays.py:187 model_iteration
f = _make_execution_function(model, mode)
/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/training_arrays.py:555 _make_execution_function
return model._make_execution_function(mode)
/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/training.py:2034 _make_execution_function
self._make_test_function()
/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/training.py:2010 _make_test_function
**self._function_kwargs)
/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/backend.py:3544 function
return EagerExecutionFunction(inputs, outputs, updates=updates, name=name)
/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/backend.py:3429 __init__
raise ValueError('Unknown graph. Aborting.')
ValueError: Unknown graph. Aborting.
The purpose of #tf.function decorator is to convert Tensorflow operations written in Python into Tensorflow graph to achieve better performance. The error might come when you tried to use a pre-trained model with a serialized graph. Thus, the decorator cannot make the graph-to-graph conversion.
I've reported this error here: https://github.com/tensorflow/tensorflow/issues/33997
A (temporary) solution is that your loss function should be separated into two small functions. The decorator should only be used in the function not including the pre-trained model. In this way, you still can achieve better performance in other operations but not with the part of using the pre-trained model.
For example:
#tf.function
def _other_ops(x):
xt = tf.multiply(x, (1.0 - m)) + tf.multiply(m, d)
label = targetlabel * np.ones(xt.shape[0])
loss2 = tf.linalg.norm(m, 1)
return xt, label, loss2
def total_loss(x):
xt, label, loss2 = _other_ops(x)
loss1 = poison.evaluate(xt, label, steps=1)
return loss1 + loss2
Update:
According to the discussion in the above TF issue link, an elegant solution is to manually pass the input through each layer of the model. You could get a list of layers in your model by calling your_model.layers
In your case, you might calculate the loss from the prediction of your output with the label in the last layer. Thus, I think you should skip the last layer and calculate the loss outside of the loop:
#tf.function
def totalloss(x):
xt = tf.multiply(x, (1.0 - m)) + tf.multiply(m, d)
label = targetlabel*np.ones(xt.shape[0])
feat = xt
# Skip the last layer which calculates loss1
for i in range(len(poison.layers) - 1):
layer = poison.layers[i]
feat = layer(feat)
# Now, calculate loss by yourself
loss1 = tf.keras.losses.sparse_categorical_crossentropy(feat, label)
loss2 = tf.linalg.norm(m, 1)
return loss1 + loss2
The way that the TF engineers explain for this issue is that a model might wrap high-level processing which does guarantee by the #tf.function. So, putting a model inside a function decorated with #tf.function is not recommended. Thus, we need to break the model to smaller pieces to bypass it.
I found that it is easy to use lasagne to make a graph like this.
import lasagne.layers as L
class A:
def __init__(self):
self.x = L.InputLayer(shape=(None, 3), name='x')
self.y = x + 1
def get_y_sym(self, x_var, **kwargs):
y = L.get_output(self.y, {self.x: x_var}, **kwargs)
return y
through the method get_y_sym, we could get a tensor not a value, then I could use this tensor as the input of another graph.
But if I use tensorflow, how could I implement this?
I'm not familiar with lasagne but you should know that ALL of TensorFlow uses graph based computation (unless you use tf.Eager, but that's another story). So by default something like:
net = tf.nn.conv2d(...)
returns a reference to a Tensor object. In other words, net is NOT a value, it is a reference to the output of the convolution node created by tf.nn.conv2d(...).
These can then be chained:
net2 = tf.nn.conv2d(net, ...) and so on.
To get "values" one has to open a tf.Session:
with tf.Session() as sess:
net2_eval = sess.run(net2)
In tensorflow, we can define our own op and its gradient by:
https://gist.github.com/harpone/3453185b41d8d985356cbe5e57d67342
However, can we modify any variable in the computational graph in these python functions. For example in the "_MySquareGrad" function?
I assume we can get the variable by:
var = tf.get_variable('var')
and then do something to change its value and then assign it back?
e.g.
tmp = var*10
var.assign(tmp)
Thanks!
Also when we do var*10, do we have to convert it to numpy?
Background: I'm familiar with automatic differentiation, but new to Tensorflow and Python. So please point out any syntactic problem and let me know if my intention is clear.
You can modify the variables in the computational graph in these python functions. Your example code with tmp = var*10 will work and does not convert anything to numpy.
In fact you should try to avoid converting to numpy as much as possible since it will slow down the computation.
edit:
You can include your code to the gradient computation graph of the _MySquareGrad function doing this:
def _MySquareGrad(op, grad):
#first get a Variable that was created using tf.get_variable()
with tf.variable_scope("", reuse=True):
var = tf.get_variable('var')
#now create the assign graph:
tmp = var*10.
assign_op = var.assign(tmp)
#now make the assign operation part of the grad calculation graph:
with tf.control_dependencies([assign_op]):
x = tf.identity(op.inputs[0])
return grad * 20 * x
Here is a working example:
import tensorflow as tf
from tensorflow.python.framework import ops
import numpy as np
# Define custom py_func which takes also a grad op as argument:
def py_func(func, inp, Tout, stateful=True, name=None, grad=None):
# Need to generate a unique name to avoid duplicates:
rnd_name = 'PyFuncGrad' + str(np.random.randint(0, 1E+8))
tf.RegisterGradient(rnd_name)(grad) # see _MySquareGrad for grad example
g = tf.get_default_graph()
with g.gradient_override_map({"PyFunc": rnd_name}):
return tf.py_func(func, inp, Tout, stateful=stateful, name=name)
# Def custom square function using np.square instead of tf.square:
def mysquare(x, name=None):
with ops.name_scope(name, "Mysquare", [x]) as name:
sqr_x = py_func(np.square,
[x],
[tf.float32],
name=name,
grad=_MySquareGrad) # <-- here's the call to the gradient
return sqr_x[0]
### Actual gradient:
##def _MySquareGrad(op, grad):
##x = op.inputs[0]
##return grad * 20 * x # add a "small" error just to see the difference:
def _MySquareGrad(op, grad):
#first get a Variable that was created using tf.get_variable()
with tf.variable_scope("", reuse=True):
var = tf.get_variable('var')
#now create the assign graph:
tmp = var*10.
assign_op = var.assign(tmp)
#now make the assign operation part of the grad calculation graph:
with tf.control_dependencies([assign_op]):
x = tf.identity(op.inputs[0])
return grad * 20 * x
with tf.Session() as sess:
x = tf.constant([1., 2.])
var = tf.get_variable(name="var", shape=[], initializer=tf.constant_initializer(0.2))
y = mysquare(x)
tf.global_variables_initializer().run()
print(x.eval(), y.eval(), tf.gradients(y, x)[0].eval())
print("Now var is 10 times larger:", var.eval())
I was looking at the mechanics section for Tensorflow, specifically on shared variables. In the section "The problem", they are dealing with a convolutional neural net, and provide the following code (which runs an image through the model):
# First call creates one set of variables.
result1 = my_image_filter(image1)
# Another set is created in the second call.
result2 = my_image_filter(image2)
If the model was implemented in such a way, would it then be impossible to learn/update the parameters because there's a new set of parameters for each image in my training set?
Edit:
I've also tried "the problem" approach on a simple linear regression example, and there do not appear to be any issues with this method of implementation. Training seems to work as well as can be shown by the last line of the code. So I'm wondering if there is a subtle discrepancy in the tensorflow documentation and what I'm doing. :
import tensorflow as tf
import numpy as np
trX = np.linspace(-1, 1, 101)
trY = 2 * trX + np.random.randn(*trX.shape) * 0.33 # create a y value which is approximately linear but with some random noise
X = tf.placeholder("float") # create symbolic variables
Y = tf.placeholder("float")
def model(X):
with tf.variable_scope("param"):
w = tf.Variable(0.0, name="weights") # create a shared variable (like theano.shared) for the weight matrix
return tf.mul(X, w) # lr is just X*w so this model line is pretty simple
y_model = model(X)
cost = (tf.pow(Y-y_model, 2)) # use sqr error for cost function
train_op = tf.train.GradientDescentOptimizer(0.01).minimize(cost) # construct an optimizer to minimize cost and fit line to my data
sess = tf.Session()
init = tf.initialize_all_variables() # you need to initialize variables (in this case just variable W)
sess.run(init)
with tf.variable_scope("train"):
for i in range(100):
for (x, y) in zip(trX, trY):
sess.run(train_op, feed_dict={X: x, Y: y})
print sess.run(y_model, feed_dict={X: np.array([1,2,3])})
One has to create the variable set only once per whole training (and testing) set. The goal of variable scopes is to allow for modularization of subsets of parameters, such as those belonging to layers (e.g. when architecture of a layer is repeated, the same names can be used within each layer scope).
In your example you create parameters only in the model function. You can print out your variable names to see that it is assigned to the specified scope:
from __future__ import print_function
X = tf.placeholder("float") # create symbolic variables
Y = tf.placeholder("float")
print("X:", X.name)
print("Y:", Y.name)
def model(X):
with tf.variable_scope("param"):
w = tf.Variable(0.0, name="weights") # create a shared variable (like theano.shared) for the weight matrix
print("w:", w.name)
return tf.mul(X, w)
The call to sess.run(train_op, feed_dict={X: x, Y: y}) only evaluates the value of train_op given the provided values of X and Y. No new variables (incl. parameters) are created there; therefore, it has no effect. You can make sure the variable names stay the same by again printing them out:
with tf.variable_scope("train"):
print("X:", X.name)
print("Y:", Y.name)
for i in range(100):
for (x, y) in zip(trX, trY):
sess.run(train_op, feed_dict={X: x, Y: y})
You will see that variable names stay the same, as they are already initialized.
If you'd like to retrieve a variable using its scope, you need to use get_variable within a tf.variable_scope enclosure:
with tf.variable_scope("param"):
w = tf.get_variable("weights", [1])
print("w:", w.name)