I have been following LRP implementation using pyTorch and wanted to test it out using Tensorflow and Keras. I am using the same model with weights(VGG16) in Keras and was able to successfully execute the forward pass and element wise division using
# keras-tensorflow implementation
z = incr(clasifierLayers[l](A[l])) # forward pass step(1)
s = (R[l+1]/z) # Element wise division step(2)
But i am facing trouble in recreating the backward pass. In the original code(LRP), which uses pyTorch, the backward pass is computed using
# pyTorch implementation
(z*s).sum().backward(); c = A[l].grad
and when i tried to find the replicate the backward pass using tensorflow, my gradient returns None. Here is my code trying to compute the backward pass.
def getGradients(product,layer,l):
with tf.GradientTape() as tape:
tape.watch(product)
a=layers[l](A[l])
gradient = tape.gradient(product, a)
return gradient
c = getGradients((z*s).numpy().sum(),layers[l],l) # backward pass step(3)
Can someone tell me whats wrong with this implementation?
Thanks in Advance
I tried to replicate the issue with one layer and performing an LRP backward step, here is the code:
import tensorflow as tf
x = tf.ones((1,10))
layer=tf.keras.layers.Dense(10)
y=layer(x)
with tf.GradientTape() as tape:
tape.watch(x)
z = tf.keras.layers.Dense(10)(x)+1e-9
s = y/z
s = tf.reshape(s, z.shape)
c = tape.gradient(tf.reduce_sum(z*s), x)
y*c
This code works, in the sense that it returns the gradients to c.
I did not test it with a dataset, so do not know if it works as it should. Nonetheless, I think the problem with your code is that you should have the first block:
# keras-tensorflow implementation
z = incr(clasifierLayers[l](A[l])) # forward pass step(1)
s = (R[l+1]/z) # Element wise division step(2)
inside the TapeGradient scope and ask for the gradients with respect to the A[l].
Edit:
I forgot to avoid gradients being propagated through s. The gradient computation should be done as follows:
c = tape.gradient(tf.reduce_sum(z*s.numpy()), x)
Related
I'm trying to write a wrapper around a model, such that the tf model can be called as a function of its weights (and input). However this wrapper returns different gradients than the gradients fromt the original model. Details in the code below (including a colab notebook to reproduce directly), but at the core I'm using the custom gradient decorator - the respective gradient is computed directly as the upstream 'gradient' matmul (via tensordot) the respective jacobian.
To make this clear: I'm computing the gradient for a model, once directly, once by using my custom wrapper. In both cases the parameters in the model are the same. The Jacobian is implemented by TF, so nothing should be wrong there. Still the resulting gradient seems to be wrong.
I'm not sure, whether this is a coding mistake I made somewhere, or possibly just a numeric problem stemming from the Jacobian matmul - however my tests regarding correlation of the gradients suggest this is more than a numeric issue for now. Code of the function is provided below, a link to colab notebook reproducing the problem can be found here: Colab Notebook reproducing the problem
Why: This is important for a bunch of metalearning, which I'm trying to build a small library for currently.
My current 'wrapper' looks something like this:
#calls model on input x but replaces internal weights with the weights argument
#critically supposed to compute the respective gradient for the weights tensor argument!
def call_model_with_weights(model, x, weights, dim_output=2):
#tf.custom_gradient
def _call_with_weights(x_and_w):
x, weights = x_and_w
#be careful; this assigns weights to the model as a side effect, can ignore for dummy version
ctrls = [var.assign(val) for var, val in zip(model.trainable_weights, weights)]
with tf.control_dependencies(ctrls):
with tf.GradientTape() as tape:
y = model(x)
jacobians = tape.jacobian(y, model.trainable_weights)
def grad(upstream, variables):
assert len(variables)==len(weights)
#gradient for each weight should be upstream dotproduct respective jacobian
dy_dw = [tf.tensordot(upstream, j, axes=[list(range(dim_output)), list(range(dim_output))]) for j in jacobians]
dy_dw_weights = dy_dw
return (None, dy_dw_weights), [None for _ in dy_dw] # returning x as derivative of x is wrong, but not important here rn
return y, grad
y = _call_with_weights((x, weights))
return y
Thanks a lot for any help (including how this could be done in a more elegant way), helping out means you are contributing to package that plans to mimic PyTorch 'higher' for TF which I hope helps some more people <3
I am trying to find out, how exactly does BatchNormalization layer behave in TensorFlow. I came up with the following piece of code which to the best of my knowledge should be a perfectly valid keras model, however the mean and variance of BatchNormalization doesn't appear to be updated.
From docs https://www.tensorflow.org/api_docs/python/tf/keras/layers/BatchNormalization
in the case of the BatchNormalization layer, setting trainable = False on the layer means that the layer will be subsequently run in inference mode (meaning that it will use the moving mean and the moving variance to normalize the current batch, rather than using the mean and variance of the current batch).
I expect the model to return a different value with each subsequent predict call.
What I see, however, are the exact same values returned 10 times.
Can anyone explain to me why does the BatchNormalization layer not update its internal values?
import tensorflow as tf
import numpy as np
if __name__ == '__main__':
np.random.seed(1)
x = np.random.randn(3, 5) * 5 + 0.3
bn = tf.keras.layers.BatchNormalization(trainable=False, epsilon=1e-9)
z = input = tf.keras.layers.Input([5])
z = bn(z)
model = tf.keras.Model(inputs=input, outputs=z)
for i in range(10):
print(x)
print(model.predict(x))
print()
I use TensorFlow 2.1.0
Okay, I found the mistake in my assumptions. The moving average is being updated during training not during inference as I thought. This makes perfect sense, as updating the moving averages during inference would likely result in an unstable production model (for example a long sequence of highly pathological input samples [e.g. such that their generating distribution differs drastically from the one on which the network was trained] could potentially bias the network and result in worse performance on valid input samples).
The trainable parameter is useful when you're fine-tuning a pretrained model and want to freeze some of the layers of the network even during training. Because when you call model.predict(x) (or even model(x) or model(x, training=False)), the layer automatically uses the moving averages instead of batch averages.
The code below demonstrates this clearly
import tensorflow as tf
import numpy as np
if __name__ == '__main__':
np.random.seed(1)
x = np.random.randn(10, 5) * 5 + 0.3
z = input = tf.keras.layers.Input([5])
z = tf.keras.layers.BatchNormalization(trainable=True, epsilon=1e-9, momentum=0.99)(z)
model = tf.keras.Model(inputs=input, outputs=z)
# a dummy loss function
model.compile(loss=lambda x, y: (x - y) ** 2)
# a dummy fit just to update the batchnorm moving averages
model.fit(x, x, batch_size=3, epochs=10)
# first predict uses the moving averages from training
pred = model(x).numpy()
print(pred.mean(axis=0))
print(pred.var(axis=0))
print()
# outputs the same thing as previous predict
pred = model(x).numpy()
print(pred.mean(axis=0))
print(pred.var(axis=0))
print()
# here calling the model with training=True results in update of moving averages
# furthermore, it uses the batch mean and variance as in training,
# so the result is very different
pred = model(x, training=True).numpy()
print(pred.mean(axis=0))
print(pred.var(axis=0))
print()
# here we see again that the moving averages are used but they differ slightly after
# the previous call, as expected
pred = model(x).numpy()
print(pred.mean(axis=0))
print(pred.var(axis=0))
print()
In the end, I found that the documentation (https://www.tensorflow.org/api_docs/python/tf/keras/layers/BatchNormalization) mentions this:
When performing inference using a model containing batch normalization, it is generally (though not always) desirable to use accumulated statistics rather than mini-batch statistics. This is accomplished by passing training=False when calling the model, or using model.predict.
Hopefully this will help someone with similar misunderstanding in the future.
I have an example of DNN learning XOR (right click to open in new tab): https://colab.research.google.com/drive/1M5xFp4gaXPCbnejM8-5_yLp1B6UvwdL8
I'm confused in these 2 lines (related to backpropagation):
Grads = T.gradient(Loss,[W1,B1,W2,B2]);
Optim.apply_gradients(zip(Grads,[W1,B1,W2,B2]));
I'm guessing the backward loop is at T.gradient because those are gradient values related to loss, but I'm still not clear. The questions are:
Question1. Is there backpropagation (the backward loop) in those 2 lines?
Question2. If there is backpropagation, it's at T.gradient or Optim.apply_gradients?
Question3. Because backpropagation is done backward, is the order of [W1,B1,W2,B2] important? I believe, eg. this shuffled [B1,W2,B2,W1] can't be the same, because backpropagation needs layer order from output back to input.
From my trying, when shuffling the order of weights and biases in the variable array, the optimisation process is still working. But backpropagation needs layer order from output back to input, I don't get this.
Source code:
#!pip install tensorflow==2.0.0rc2
%tensorflow_version 2.x
%reset -f
#libs
import tensorflow as tf;
#data
X = [[0,0],[0,1],[1,0],[1,1]];
Y = [[0], [1], [1], [0] ];
X = tf.convert_to_tensor(X,tf.float32);
Y = tf.convert_to_tensor(Y,tf.float32);
#model
W1 = tf.Variable(tf.random.uniform([2,20],-1,1));
B1 = tf.Variable(tf.random.uniform([ 20],-1,1));
W2 = tf.Variable(tf.random.uniform([20,1],-1,1));
B2 = tf.Variable(tf.random.uniform([ 1],-1,1));
#tf.function
def feedforward(X):
H1 = tf.nn.leaky_relu(tf.matmul(X,W1) + B1);
Out = tf.sigmoid(tf.matmul(H1,W2) + B2);
return Out;
#end def
#train
Optim = tf.keras.optimizers.SGD(1e-1);
Steps = 1000;
for I in range(Steps):
if I%(Steps/10)==0:
Out = feedforward(X);
Loss = tf.reduce_sum(tf.square(Y-Out));
print("Loss:",Loss.numpy());
#end if
with tf.GradientTape() as T:
Out = feedforward(X);
Loss = tf.reduce_sum(tf.square(Y-Out));
#end with
#BACKPROPAGATION HERE?
Grads = T.gradient(Loss,[W1,B1,W2,B2]);
Optim.apply_gradients(zip(Grads,[W1,B1,W2,B2]));
#end for
Out = feedforward(X);
Loss = tf.reduce_sum(tf.square(Y-Out));
print("Loss:",Loss.numpy(),"(Last)");
print("\nDone.");
#eof
Let's take this one step at a time.
Step 1: Calculation of Gradients:
Grads = T.gradient(Loss,[W1,B1,W2,B2])
Here, we calculate the gradients of the loss with respect to the variables in the provided list. The list of gradients is indexed based on the indices of the variables. This means that Grads[0] will be the gradients with respect to W1, and so on.
Step 2: Next, we perform the update. This is done in:
Optim.apply_gradients(zip(Grads,[W1,B1,W2,B2]))
Here, Grads[0] are used to update W1, Grads[1] to update B1 and so on.
Note that gradient calculation and the update steps are performed separately. So as long as the variables appear in the same order in both lists, there shouldn't be any problems.
Also, GradientTape has to be used with Eager Execution.
With TensorFlow 2 in default eager mode, and even without the #tf.function decorator to make graph. TensorFlow is still tracking the relation between tensors while calculation: https://stats.stackexchange.com/a/272000/142160
TensorFlow tracks every variables here:
with tf.GradientTape() as T:
Out = feedforward(X);
Loss = tf.reduce_sum(tf.square(Y-Out));
It is automatic differentiation (kinda Monte Carlo method) instead of mathematical differentiation, and thus, all gradients obtained by the following function is already at their proper depths in backpropagation (just like the backward loop to calculate errors at all layers):
Grads = T.gradient(Loss,[W1,B1,W2,B2]);
After that, optimiser will apply gradients to change weights and biases:
Optim.apply_gradients(zip(Grads,[W1,B1,W2,B2]));
There is a lot of examples of py_func usage on Stackoverflow, but I just want to define gradient for my custom activation function, something like this, which uses only tensorflow native operations. Example for Identity forward pass.
Suppose I have registered gradient for my activation "OPLU" (comments illustrate my understanding so far of what's going on):
#tf.RegisterGradient("OPLUGrad")
def oplugrad(op, grad):
x = op.inputs[0] # Need x !
# This print should be executed if oplugrad was launched!
# Because it was set inside the evaluation chain for output !
x = tf.Print(x, [tf.shape(x)], message = 'debug: ')
grad_new = x*grad # let it be, just for example
return grad_new
And defined my layer:
def tf_oplu(x, name="OPLU"):
y = ...f(x)...
# Here new op is created, as far as I understand
with ops.op_scope([x], name, "OPLUop") as name:
g = tf.get_default_graph()
# As far as I understand, here I issue command to tensorflow
# to use "OPLUGrad" when "OPLU" activation was applied
with g.gradient_override_map({"OPLU": "OPLUGrad"}):
# OK, gradient assigned, now return what forward layer computes
return y
But I don't see any output from tf.Print inside gradient function, which means it is not executed.
Question1: How to register it properly and have these two functions in order to use embedded optimizers like AdamOptimizer?
Question2: As far as I understand, standard gradient computation is suppressed in this way. What if I want standard gradients to be computed and then do some modification without interference into Session() code with manual invocation and modification of gradients in Session() run that I've seen somewhere?
EDIT: Here is the example of code for which I want to replace tf.nn.relu with my tf_OPLU
Thank you!
The feature I'm after is to be able to tell what the gradient of a given variable is with respect to my error function given some data.
One way to do this would be to see how much the variable has changed after a call to train, but obviously that can vary massively based on the learning algorithm (for example it would be almost impossible to tell with something like RProp) and just isn't very clean.
Thanks in advance.
The tf.gradients() function allows you to compute the symbolic gradient of one tensor with respect to one or more other tensors—including variables. Consider the following simple example:
data = tf.placeholder(tf.float32)
var = tf.Variable(...) # Must be a tf.float32 or tf.float64 variable.
loss = some_function_of(var, data) # some_function_of() returns a `Tensor`.
var_grad = tf.gradients(loss, [var])[0]
You can then use this symbolic gradient to evaluate the gradient in some specific point (data):
sess = tf.Session()
var_grad_val = sess.run(var_grad, feed_dict={data: ...})
In TensorFlow 2.0 you can use GradientTape to achieve this. GradientTape records the gradients of any computation that happens in the context of that. Below is an example of how you might do that.
import tensorflow as tf
# Here goes the neural network weights as tf.Variable
x = tf.Variable(3.0)
# TensorFlow operations executed within the context of
# a GradientTape are recorded for differentiation
with tf.GradientTape() as tape:
# Doing the computation in the context of the gradient tape
# For example computing loss
y = x ** 2
# Getting the gradient of network weights w.r.t. loss
dy_dx = tape.gradient(y, x)
print(dy_dx) # Returns 6