There is a lot of examples of py_func usage on Stackoverflow, but I just want to define gradient for my custom activation function, something like this, which uses only tensorflow native operations. Example for Identity forward pass.
Suppose I have registered gradient for my activation "OPLU" (comments illustrate my understanding so far of what's going on):
#tf.RegisterGradient("OPLUGrad")
def oplugrad(op, grad):
x = op.inputs[0] # Need x !
# This print should be executed if oplugrad was launched!
# Because it was set inside the evaluation chain for output !
x = tf.Print(x, [tf.shape(x)], message = 'debug: ')
grad_new = x*grad # let it be, just for example
return grad_new
And defined my layer:
def tf_oplu(x, name="OPLU"):
y = ...f(x)...
# Here new op is created, as far as I understand
with ops.op_scope([x], name, "OPLUop") as name:
g = tf.get_default_graph()
# As far as I understand, here I issue command to tensorflow
# to use "OPLUGrad" when "OPLU" activation was applied
with g.gradient_override_map({"OPLU": "OPLUGrad"}):
# OK, gradient assigned, now return what forward layer computes
return y
But I don't see any output from tf.Print inside gradient function, which means it is not executed.
Question1: How to register it properly and have these two functions in order to use embedded optimizers like AdamOptimizer?
Question2: As far as I understand, standard gradient computation is suppressed in this way. What if I want standard gradients to be computed and then do some modification without interference into Session() code with manual invocation and modification of gradients in Session() run that I've seen somewhere?
EDIT: Here is the example of code for which I want to replace tf.nn.relu with my tf_OPLU
Thank you!
Related
I'm trying to write a wrapper around a model, such that the tf model can be called as a function of its weights (and input). However this wrapper returns different gradients than the gradients fromt the original model. Details in the code below (including a colab notebook to reproduce directly), but at the core I'm using the custom gradient decorator - the respective gradient is computed directly as the upstream 'gradient' matmul (via tensordot) the respective jacobian.
To make this clear: I'm computing the gradient for a model, once directly, once by using my custom wrapper. In both cases the parameters in the model are the same. The Jacobian is implemented by TF, so nothing should be wrong there. Still the resulting gradient seems to be wrong.
I'm not sure, whether this is a coding mistake I made somewhere, or possibly just a numeric problem stemming from the Jacobian matmul - however my tests regarding correlation of the gradients suggest this is more than a numeric issue for now. Code of the function is provided below, a link to colab notebook reproducing the problem can be found here: Colab Notebook reproducing the problem
Why: This is important for a bunch of metalearning, which I'm trying to build a small library for currently.
My current 'wrapper' looks something like this:
#calls model on input x but replaces internal weights with the weights argument
#critically supposed to compute the respective gradient for the weights tensor argument!
def call_model_with_weights(model, x, weights, dim_output=2):
#tf.custom_gradient
def _call_with_weights(x_and_w):
x, weights = x_and_w
#be careful; this assigns weights to the model as a side effect, can ignore for dummy version
ctrls = [var.assign(val) for var, val in zip(model.trainable_weights, weights)]
with tf.control_dependencies(ctrls):
with tf.GradientTape() as tape:
y = model(x)
jacobians = tape.jacobian(y, model.trainable_weights)
def grad(upstream, variables):
assert len(variables)==len(weights)
#gradient for each weight should be upstream dotproduct respective jacobian
dy_dw = [tf.tensordot(upstream, j, axes=[list(range(dim_output)), list(range(dim_output))]) for j in jacobians]
dy_dw_weights = dy_dw
return (None, dy_dw_weights), [None for _ in dy_dw] # returning x as derivative of x is wrong, but not important here rn
return y, grad
y = _call_with_weights((x, weights))
return y
Thanks a lot for any help (including how this could be done in a more elegant way), helping out means you are contributing to package that plans to mimic PyTorch 'higher' for TF which I hope helps some more people <3
I have been following LRP implementation using pyTorch and wanted to test it out using Tensorflow and Keras. I am using the same model with weights(VGG16) in Keras and was able to successfully execute the forward pass and element wise division using
# keras-tensorflow implementation
z = incr(clasifierLayers[l](A[l])) # forward pass step(1)
s = (R[l+1]/z) # Element wise division step(2)
But i am facing trouble in recreating the backward pass. In the original code(LRP), which uses pyTorch, the backward pass is computed using
# pyTorch implementation
(z*s).sum().backward(); c = A[l].grad
and when i tried to find the replicate the backward pass using tensorflow, my gradient returns None. Here is my code trying to compute the backward pass.
def getGradients(product,layer,l):
with tf.GradientTape() as tape:
tape.watch(product)
a=layers[l](A[l])
gradient = tape.gradient(product, a)
return gradient
c = getGradients((z*s).numpy().sum(),layers[l],l) # backward pass step(3)
Can someone tell me whats wrong with this implementation?
Thanks in Advance
I tried to replicate the issue with one layer and performing an LRP backward step, here is the code:
import tensorflow as tf
x = tf.ones((1,10))
layer=tf.keras.layers.Dense(10)
y=layer(x)
with tf.GradientTape() as tape:
tape.watch(x)
z = tf.keras.layers.Dense(10)(x)+1e-9
s = y/z
s = tf.reshape(s, z.shape)
c = tape.gradient(tf.reduce_sum(z*s), x)
y*c
This code works, in the sense that it returns the gradients to c.
I did not test it with a dataset, so do not know if it works as it should. Nonetheless, I think the problem with your code is that you should have the first block:
# keras-tensorflow implementation
z = incr(clasifierLayers[l](A[l])) # forward pass step(1)
s = (R[l+1]/z) # Element wise division step(2)
inside the TapeGradient scope and ask for the gradients with respect to the A[l].
Edit:
I forgot to avoid gradients being propagated through s. The gradient computation should be done as follows:
c = tape.gradient(tf.reduce_sum(z*s.numpy()), x)
I am a researcher in optimization and I trying to write a custom optimizer. I have come across a problem. I have asked in many places and so far no response.
Take any optimizer code, say just copy SGD. In the beginning of get_updates, you see
grads = self.get_gradients(loss, params)
now add the following line right after this one:
gradsb = self.get_gradients(loss, [tf.Variable(a) for a in params])
this should compute the gradients at a new tensor, with all the values the same as before
now try to see what you get:
for a in gradsb:
print(a)
you get a list of Nones (but if you print the list grads you see that they are still Tensors)
Why?
And how to circumvent this problem? This is important as I'd like to compute the gradients at another point for my algorithm.
When you write gradsb = self.get_gradients(loss, [tf.Variable(a) for a in params]) you are defining a new tf.Variable for each a in params. Because the loss does not depend on these new variables, your gradients are None.
If you want to compute a second gradient you need to make sure that you're computing it with respect to Tensors that the objective does depend on.
Apparently even replacing the current vector of parameters is not OK!! If I type this in the code:
grads = self.get_gradients(loss, params)
tempparam = [tf.Variable(a) for a in params]
params = [tf.add(a,a) for a in params]
gradsn = self.get_gradients(loss, params)
for a in gradsn:
print(a)
params = [tf.Variable(a) for a in tempparam]
The result is still that None is printed!!
I know you understand what I am trying to do, at each iteration of get_updates, I would like to compute the gradients at a (slightly) different value of the parameter tensors, and use that to construct the update to the parameters for optimization and training. Is there any way to do this within the keras package?
Let's say that I have some code such as:
out = tf.nn.softmax(x) # shape (batch,time,n)
labels = .... # reference labels of type (batch,time)->int
And then I define my loss as the Cross Entropy:
loss = -tf.log(tf.gather_nd(out, labels))
Will TensorFlow automatically replace the loss in the computation graph by this?
loss = sparse_softmax_cross_entropy_with_logits(x, labels)
What type of optimizations can I expect that TensorFlow will apply?
Follow-up question: If TensorFlow doesn't do this optimization, how can I do it manually? Consider that I have a modular framework where I get some out tensor which could possibly be the output of a softmax operation, and I want to calculate Cross Entropy, and I want to use sparse_softmax_cross_entropy_with_logits if possible. How could I accomplish this? Can I do something like the following?
if out.op == "softmax": # how to check this?
x = out.op.sources[0] # how to get this?
loss = sparse_softmax_cross_entropy_with_logits(x, labels)
else:
loss = -tf.log(tf.gather_nd(out, labels))
TensorFlow generally doesn't merge nodes together in the way you're hoping. This is because other code (e.g. fetching outputs when running) may depend on intermediate nodes like the softmax, so removing them behind the user's back would be confusing.
If you do want to do this optimization yourself as part of a higher-level framework, you can analyze the current graphdef, but there's no annotation in TF to tell you what the outputs are, since that can vary at runtime depending on how session.run is called.
I would like to implement in TensorFlow the technique of "Guided back-propagation" introduced in this Paper and which is described in this recipe .
Computationally that means that when I compute the gradient e.g., of the input wrt. the output of the NN, I will have to modify the gradients computed at every RELU unit. Concretely, the back-propagated signal on those units must be thresholded on zero, to make this technique work. In other words the partial derivative of the RELUs that are negative must be ignored.
Given that I am interested in applying these gradient computations only on test examples, i.e., I don't want to update the model's parameters - how shall I do it?
I tried (unsuccessfully) two things so far:
Use tf.py_func to wrap my simple numpy version of a RELU, which then is eligible to redefine it's gradient operation via the g.gradient_override_map context manager.
Gather the forward/backward values of BackProp and apply the thresholding on those stemming from Relus.
I failed with both approaches because they require some knowledge of the internals of TF that currently I don't have.
Can anyone suggest any other route, or sketch the code?
Thanks a lot.
The better solution (your approach 1) with ops.RegisterGradient and tf.Graph.gradient_override_map. Together they override the gradient computation for a pre-defined Op, e.g. Relu within the gradient_override_map context using only python code.
#ops.RegisterGradient("GuidedRelu")
def _GuidedReluGrad(op, grad):
return tf.where(0. < grad, gen_nn_ops._relu_grad(grad, op.outputs[0]), tf.zeros(grad.get_shape()))
...
with g.gradient_override_map({'Relu': 'GuidedRelu'}):
y = tf.nn.relu(x)
here is the full example implementation of guided relu: https://gist.github.com/falcondai/561d5eec7fed9ebf48751d124a77b087
Update: in Tensorflow >= 1.0, tf.select is renamed to tf.where. I updated the snippet accordingly. (Thanks #sbond for bringing this to my attention :)
The tf.gradients has the grad_ys parameter that can be used for this purpose. Suppose your network has just one relu layer as follows :
before_relu = f1(inputs, params)
after_relu = tf.nn.relu(before_relu)
loss = f2(after_relu, params, targets)
First, compute the derivative up to after_relu.
Dafter_relu = tf.gradients(loss, after_relu)[0]
Then threshold your gradients that you send down.
Dafter_relu_thresholded = tf.select(Dafter_relu < 0.0, 0.0, Dafter_relu)
Compute the actual gradients w.r.t to params.
Dparams = tf.gradients(after_relu, params, grad_ys=Dafter_relu_thresholded)
You can easily extend this same method for a network with many relu layers.