How to use stop_gradient in Tensorflow - tensorflow

I'm wondering how to use stop_gradient in tensorflow, and the documentation is not clear to me.
I'm currently using stop_gradient to produce the gradient of the loss function w.r.t. the word embeddings in a CBOW word2vec model. I want to just get the value, and not do backpropagation (as I'm generating adversarial examples).
Currently, I'm using the code:
lossGrad = gradients.gradients(loss, embed)[0]
real_grad = lossGrad.eval(feed_dict)
But when I run this, it does the backpropogation anyway! What am I doing wrong, and just as importantly, how can I fix this?
CLARIFICATION: To clarify by "backpropagation" I mean "calculating values and updating model parameters".
UPDATE
If I run the two lines above after the first training step, the I get a different loss after 100 training steps than when I don't run those two lines. I might be fundamentally misunderstanding something about Tensorflow.
I've tried setting using set_random_seed both in the beginning of the graph declaration and before each training step. The total loss is consistent between multiple runs, but not between including/excluding those two lines. So if it's not the RNG causing the disparity, and it's not unanticipated updating of the model parameters between training steps, do you have any idea what would cause this behavior?
SOLUTION
Welp, it's a bit late but here's how I solved it. I only wanted to optimize over some, but not all, variables. I thought that the way to prevent optimizing some variables would be to use stop_grad - but I never found a way to make that work. Maybe there is a way, but what worked for me was to adjust my optimizer to only optimize over a list of variables. So instead of:
opt = tf.train.GradientDescentOptimizer(learning_rate=eta)
train_op = opt.minimize(loss)
I used:
opt = tf.train.GradientDescentOptimizer(learning_rate=eta)
train_op = opt.minimize(loss, var_list=[variables to optimize over])
This prevented opt from updating the variables not in var_list. Hopefully it works for you, too!

tf.stop_gradient provides a way to not compute gradient with respect to some variables during back-propagation.
For example, in the code below, we have three variables, w1, w2, w3 and input x. The loss is square((x1.dot(w1) - x.dot(w2 * w3))). We want to minimize this loss wrt to w1 but want to keep w2 and w3 fixed. To achieve this we can just put tf.stop_gradient(tf.matmul(x, w2*w3)).
In the figure below, I plotted how w1, w2, and w3 from their initial values as the function of training iterations. It can be seen that w2 and w3 remain fixed while w1 changes until it becomes equal to w2 * w3.
An image showing that w1 only learns but not w2 and w3:
import tensorflow as tf
import numpy as np
w1 = tf.get_variable("w1", shape=[5, 1], initializer=tf.truncated_normal_initializer())
w2 = tf.get_variable("w2", shape=[5, 1], initializer=tf.truncated_normal_initializer())
w3 = tf.get_variable("w3", shape=[5, 1], initializer=tf.truncated_normal_initializer())
x = tf.placeholder(tf.float32, shape=[None, 5], name="x")
a1 = tf.matmul(x, w1)
a2 = tf.matmul(x, w2*w3)
a2 = tf.stop_gradient(a2)
loss = tf.reduce_mean(tf.square(a1 - a2))
optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.1)
gradients = optimizer.compute_gradients(loss)
train_op = optimizer.apply_gradients(gradients)

tf. gradients(loss, embed) computes the partial derivative of the tensor loss with respect to the tensor embed. TensorFlow computes this partial derivative by backpropagation, so it is expected behavior that evaluating the result of tf. gradients(...) performs backpropagation. However, evaluating that tensor does not perform any variable updates, because the expression does not include any assignment operations.
tf.stop_gradient() is an operation that acts as the identity function in the forward direction but stops the accumulated gradient from flowing through that operator in the backward direction. It does not prevent backpropagation altogether, but instead prevents an individual tensor from contributing to the gradients that are computed for an expression. The documentation for the operation has more details about the operation, and when to use it.

Related

Minimize the output of Tensorflow regression model using one of the inputs

I am trying to train a Tensorflow model using this guide with the purpose of solving an optimization problem using deep neural networks (Tensorflow). The model I have so far takes 9 inputs and produces 1 output.
What I'm trying to do now is to use it in an application in which the goal is to minimize the output value by adjusting one input value, given the other input values being fixed.
For example, let's denote the input values x1, x2, ..., x10 and the output y. Given the values for x2, x3, ..., x10, what is the value of x1 that minimizes the output y? See the image below for a visual description of my problem.
I have trained a network using Keras and saved it as variable.data-00000-of-00001 and variables.index files, and loaded it using tf.keras.models.load_model.
The current code I have is an ultra-slow "hardcoded" optimization function that relies on appending values to lists by iterating over x1 values and running them through the network, append every output to the list, and observe which x1 value produced the lowest output. This is obviously not a very good solution. See code below.
for index, row in input_df.iterrows():
prediction = model.predict(row[['x1', 'x2', 'x3', 'x4', 'x5', 'x6', 'x7', 'x8', 'x9']]).flatten()
prediction = float(prediction)
X1_predictions.ayend(prediction)
# Optimized x1
x1_values = []
y_pred_values = []
for x1 in np.arange(-1, 0, 0.01):
row['x1'] = x1
x1_values.append(x1)
y_prediction = y_model.predict(row[['x1', 'x2', 'x3', 'x4', 'x5', 'x6', 'x7', 'x8', 'x9']]).flatten()
y_prediction = float(y_prediction)
y_pred_values.ayend(y_prediction)
min_y_val = min(y_pred_values)
min_y_idx = y_pred_values.index(min(y_pred_values))
opt_x1 = x1_values[min_y_idx]
x1_opt_list.ayend(opt_x1)
y_opt_predictions.ayend(min_y_val)
I haven't worked with TF regression models like this before, how should I go about solving this in a more elegant matter using Tensorflow/Keras and not lists and for-loops?
I can give you a rough example on how to solve an optimization problem with tensorflow.
Let's imagine that you have a certain function, and want to optimize the input to that function based on some ground truth y. Let's call that function my_funct. (In your case, it would be a frozen neural network). In my example, I will take a simple function, like a sum :
#tf.function
def my_funct(inp):
return tf.reduce_sum(inp)
Now, lets define an input, and a ground truth. In that optimization problem, my ground truth in the sum of the input + 1. So at the end of the optimization my variable x1 should be equal to x1 +1
inp = tf.random.normal((9,))
y_true = tf.reduce_sum(inp) + 1
Now, you need to encode the values that you want to optimize (in your example x1), in a tf.Variable. This is the way TensorFlow keep tracks of states that needs to be optimized. In our case, x1 is the first value of our input.
x1 = tf.Variable(inp[0])
Lets start the optimization in itself. We need:
a cost function, that will tell us how far we are from the objective
a optimizer, an algorithm that will modify the states of our program so we reduce our cost function.
In that case, I'm going to use the gradient descent optimizer and the mean squared error as an objective function, but there is plenty other possibilities that might fit your problem better.
opt = tf.optimizers.SGD()
cost = tf.losses.mse
Then, we can write the optimization in itself using TensorFlow. To do that, we need to calculate the gradient of our cost function in respect to our states, that we will give to the optimizer so it can modify the states in the right direction in order to minimize our cost.
This can be done that way:
STEPS = 200
for _ in range(STEPS):
with tf.GradientTape() as tape:
tape.watch(x1)
y_pred = my_funct(tf.concat([[x1], inp[1:]], axis=0))
loss = cost([y], [y_pred])
grad = tape.gradient(loss, [x1])
opt.apply_gradients(zip(grad, [x1]))
It is a bit cumbersome to handle the tf.Variable with the rest of the input as I do with tf.concat. There might be a more elegant way of doing it, but I don't want to over engineer that simple example.
At the end of that process, we should have something close to x1=inp[0] + 1
Let's check :
>>> inp[0] + 1
<tf.Tensor: shape=(), dtype=float32, numpy=2.5110626>
>>> x1
<tf.Variable 'Variable:0' shape=() dtype=float32, numpy=2.4934747>
Not so bad!
Note: As always in those problem, there is some hyper parameters that you can tune to get a faster, better results, such as the number of steps, the learning rate, etc.

Where is backpropagation performed in this example

I have an example of DNN learning XOR (right click to open in new tab): https://colab.research.google.com/drive/1M5xFp4gaXPCbnejM8-5_yLp1B6UvwdL8
I'm confused in these 2 lines (related to backpropagation):
Grads = T.gradient(Loss,[W1,B1,W2,B2]);
Optim.apply_gradients(zip(Grads,[W1,B1,W2,B2]));
I'm guessing the backward loop is at T.gradient because those are gradient values related to loss, but I'm still not clear. The questions are:
Question1. Is there backpropagation (the backward loop) in those 2 lines?
Question2. If there is backpropagation, it's at T.gradient or Optim.apply_gradients?
Question3. Because backpropagation is done backward, is the order of [W1,B1,W2,B2] important? I believe, eg. this shuffled [B1,W2,B2,W1] can't be the same, because backpropagation needs layer order from output back to input.
From my trying, when shuffling the order of weights and biases in the variable array, the optimisation process is still working. But backpropagation needs layer order from output back to input, I don't get this.
Source code:
#!pip install tensorflow==2.0.0rc2
%tensorflow_version 2.x
%reset -f
#libs
import tensorflow as tf;
#data
X = [[0,0],[0,1],[1,0],[1,1]];
Y = [[0], [1], [1], [0] ];
X = tf.convert_to_tensor(X,tf.float32);
Y = tf.convert_to_tensor(Y,tf.float32);
#model
W1 = tf.Variable(tf.random.uniform([2,20],-1,1));
B1 = tf.Variable(tf.random.uniform([ 20],-1,1));
W2 = tf.Variable(tf.random.uniform([20,1],-1,1));
B2 = tf.Variable(tf.random.uniform([ 1],-1,1));
#tf.function
def feedforward(X):
H1 = tf.nn.leaky_relu(tf.matmul(X,W1) + B1);
Out = tf.sigmoid(tf.matmul(H1,W2) + B2);
return Out;
#end def
#train
Optim = tf.keras.optimizers.SGD(1e-1);
Steps = 1000;
for I in range(Steps):
if I%(Steps/10)==0:
Out = feedforward(X);
Loss = tf.reduce_sum(tf.square(Y-Out));
print("Loss:",Loss.numpy());
#end if
with tf.GradientTape() as T:
Out = feedforward(X);
Loss = tf.reduce_sum(tf.square(Y-Out));
#end with
#BACKPROPAGATION HERE?
Grads = T.gradient(Loss,[W1,B1,W2,B2]);
Optim.apply_gradients(zip(Grads,[W1,B1,W2,B2]));
#end for
Out = feedforward(X);
Loss = tf.reduce_sum(tf.square(Y-Out));
print("Loss:",Loss.numpy(),"(Last)");
print("\nDone.");
#eof
Let's take this one step at a time.
Step 1: Calculation of Gradients:
Grads = T.gradient(Loss,[W1,B1,W2,B2])
Here, we calculate the gradients of the loss with respect to the variables in the provided list. The list of gradients is indexed based on the indices of the variables. This means that Grads[0] will be the gradients with respect to W1, and so on.
Step 2: Next, we perform the update. This is done in:
Optim.apply_gradients(zip(Grads,[W1,B1,W2,B2]))
Here, Grads[0] are used to update W1, Grads[1] to update B1 and so on.
Note that gradient calculation and the update steps are performed separately. So as long as the variables appear in the same order in both lists, there shouldn't be any problems.
Also, GradientTape has to be used with Eager Execution.
With TensorFlow 2 in default eager mode, and even without the #tf.function decorator to make graph. TensorFlow is still tracking the relation between tensors while calculation: https://stats.stackexchange.com/a/272000/142160
TensorFlow tracks every variables here:
with tf.GradientTape() as T:
Out = feedforward(X);
Loss = tf.reduce_sum(tf.square(Y-Out));
It is automatic differentiation (kinda Monte Carlo method) instead of mathematical differentiation, and thus, all gradients obtained by the following function is already at their proper depths in backpropagation (just like the backward loop to calculate errors at all layers):
Grads = T.gradient(Loss,[W1,B1,W2,B2]);
After that, optimiser will apply gradients to change weights and biases:
Optim.apply_gradients(zip(Grads,[W1,B1,W2,B2]));

Tensorflow difference between tf.stop_gradient and feed variables to optimizer?

I'm trying to train a model in self-supervised learning. The flow chart is something like the following:
Let's assume that N1 is already trained and we want to train just N2. This is my current implementation:
x_1 = tf.placeholder(tf.float32, [None, 128, 128, 1])
x_2 = tf.placeholder(tf.float32, [None, 128, 128, 1])
s_t1 = tf.stop_gradient(N1(x_1)) # treat s_t1 as a constant
s_t2_pred = N2(s_t1))
s_t2 = tf.stop_gradient(N1(x_2)) # treat s_t2 as a constant
loss = some_loss_function(s_t2, s_t2_pred)
train_op = tf.train.AdamOptimizer(lr).minimize(loss)
In this way, I should be optimizing only N2. What makes me confused is the fact that if I were to use the following code I would obtain very different results (much better than the above):
# treat everything as a variable:
s_t1 = N1(x_1)
s_t2_pred = N2(s_t1)
s_t2 = N1(x_2)
loss = some_loss_function(s_t2, s_t2_pred)
var_list = take_all_variables_in_N2()
train_op = tf.train.AdamOptimizer(lr).minimize(loss, var_list)
I wonder what is the problem with the first implementation. What is exactly the behaviour of tf.stop_gradient (the documentation is a bit poor)? How does this differ from the second approach?
From a practical perspective in semi-supervised learning: what is the difference between the two? Which one is the correct approach?
Thank you :)
I added a possible solution to the problem in the comments below. I would still be happy to receive any feedback from more experienced users and to share some opinions on the best approach to structure a self-supervised learning problem in tensorflow.
Bye, G.
I found a possible solution to my question and I'm posting it here, in case someone may find it useful.
Apparently, tf.stop_gradients() only stops the new gradients to be back-propagated through the layers, but: if we have a momentum term (e.g. when using Adam or RMSProp) the variables of such layers could still be updated due to some gradients cumulated in the past (contained in the momentum term). Let's have a look at the simple case of SGD + Momentum; the formula would be:
w1 = w0 - a*grad(loss) - b*v0
where w1 and w0 are the weights at time 0 and 1, a is the learning rate v0 is the accumulated velocity (a function of the past gradients). Using tf.stop_gradients() is equivalent to multiplying the second term for zero. Then, the update rule becomes:
w1 = w0 - b*v0
e.g. we still have a momentum component that can update the weights.
A workaround to this problem would be to explicitly passing the variables to be updated to the optimizer. For example:
var_list = take_all_variables_in_N2()
train_op = tf.train.AdamOptimizer(lr).minimize(loss, var_list)
References:
[1] http://ruder.io/optimizing-gradient-descent/
[2] Using stop_gradient with AdamOptimizer in TensorFlow

What's the difference between GradientTape, implicit_gradients, gradients_function and implicit_value_and_gradients?

I'm trying to switch to TensorFlow eager mode and I find the documentation of GradientTape, implicit_gradients, gradients_function and implicit_value_and_gradients confusing.
What's the difference between them? When should I use one over the other?
The intro point in the documentation does not mention implicit* functions at all, yet almost all of the examples in TensorFlow repository seems to use that method for computing gradients.
There are 4 ways to automatically compute gradients when eager execution is enabled (actually, they also work in graph mode):
tf.GradientTape context records computations so that you can call tfe.gradient() to get the gradients of any tensor computed while recording with regards to any trainable variable.
tfe.gradients_function() takes a function (say f()) and returns a gradient function (say fg()) that can compute the gradients of the outputs of f() with regards to the parameters of f() (or a subset of them).
tfe.implicit_gradients() is very similar but fg() computes the gradients of the outputs of f() with regards to all trainable variables these outputs depend on.
tfe.implicit_value_and_gradients() is almost identical but fg() also returns the output of the function f().
Usually, in Machine Learning, you will want to compute the gradients of the loss with regards to the model parameters (ie. variables), and you will generally also be interested in the value of the loss itself. For this use case, the simplest and most efficient options are tf.GradientTape and tfe.implicit_value_and_gradients() (the other two options do not give you the value of the loss itself, so if you need it, it will require extra computations). I personally prefer tfe.implicit_value_and_gradients() when writing production code, and tf.GradientTape when experimenting in a Jupyter notebook.
Edit: In TF 2.0, it seems that only tf.GradientTape remains. Maybe the other functions will be added back, but I wouldn't count on it.
Detailed example
Let's create a small function to highlight the differences:
import tensorflow as tf
import tensorflow.contrib.eager as tfe
tf.enable_eager_execution()
w1 = tfe.Variable(2.0)
w2 = tfe.Variable(3.0)
​
def weighted_sum(x1, x2):
return w1 * x1 + w2 * x2
s = weighted_sum(5., 7.)
print(s.numpy()) # 31
Using tf.GradientTape
Within a GradientTape context, all operations are recorded, then you can compute the gradients of any tensor computed within the context, with regards to any trainable variable. For example, this code computes s within the GradientTape context, and then computes the gradient of s with regards to w1. Since s = w1 * x1 + w2 * x2, the gradient of s with regards to w1 is x1:
with tf.GradientTape() as tape:
s = weighted_sum(5., 7.)
​
[w1_grad] = tape.gradient(s, [w1])
print(w1_grad.numpy()) # 5.0 = gradient of s with regards to w1 = x1
Using tfe.gradients_function()
This function returns another function that can compute the gradients of a function's returned value with regards to its parameters. For example, we can use it to define a function that will compute the gradients of s with regards to x1 and x2:
grad_fn = tfe.gradients_function(weighted_sum)
x1_grad, x2_grad = grad_fn(5., 7.)
print(x1_grad.numpy()) # 2.0 = gradient of s with regards to x1 = w1
In the context of optimization, it would make more sense compute gradients with regards to variables that we can tweak. For this, we can change the weighted_sum() function to take w1 and w2 as parameters as well, and tell tfe.gradients_function() to only consider the parameters named "w1" and "w2":
def weighted_sum_with_weights(w1, x1, w2, x2):
return w1 * x1 + w2 * x2
grad_fn = tfe.gradients_function(weighted_sum_with_weights, params=["w1", "w2"])
[w1_grad, w2_grad] = grad_fn(w1, 5., w2, 7.)
print(w2_grad.numpy()) # 7.0 = gradient of s with regards to w2 = x2
Using tfe.implicit_gradients()
This function returns another function that can compute the gradients of a function's returned value with regards to all trainable variables it depends on. Going back to the first version of weighted_sum(), we can use it to compute the gradients of s with regards to w1 and w2 without having to explicitly pass these variables. Note that the gradient function returns a list of gradient/variable pairs:
grad_fn = tfe.implicit_gradients(weighted_sum)
[(w1_grad, w1_var), (w2_grad, w2_var)] = grad_fn(5., 7.)
print(w1_grad.numpy()) # 5.0 = gradient of s with regards to w1 = x1
assert w1_var is w1
assert w2_var is w2
This function does seem like the simplest and most useful option, since generally we are interested in computing the gradients of the loss with regards to the model parameters (ie. variables).
Note: try making w1 untrainable (w1 = tfe.Variable(2., trainable=False)) and redefine weighted_sum(), and you will see that grad_fn only returns the gradient of s with regards to w2.
Using tfe.implicit_value_and_gradients()
This function is almost identical to implicit_gradients() except the function it creates also returns the result of the function being differentiated (in this case weighted_sum()):
grad_fn = tfe.implicit_value_and_gradients(weighted_sum)
s, [(w1_grad, w1_var), (w2_grad, w2_var)] = grad_fn(5., 7.)
print(s.numpy()) # 31.0 = s = w1 * x1 + w2 * x2
When you need both the output of a function and its gradients, this function can give you a nice performance boost, since you get the output of the function for free when computing the gradients using autodiff.

How to freeze/lock weights of one TensorFlow variable (e.g., one CNN kernel of one layer)

I have a TensorFlow CNN model that is performing well and we would like to implement this model in hardware; i.e., an FPGA. It's a relatively small network but it would be ideal if it were smaller. With that goal, I've examined the kernels and find that there are some where the weights are quite strong and there are others that aren't doing much at all (the kernel values are all close to zero). This occurs specifically in layer 2, corresponding to the tf.Variable() named, "W_conv2". W_conv2 has shape [3, 3, 32, 32]. I would like to freeze/lock the values of W_conv2[:, :, 29, 13] and set them to zero so that the rest of the network can be trained to compensate. Setting the values of this kernel to zero effectively removes/prunes the kernel from the hardware implementation thus achieving the goal stated above.
I have found similar questions with suggestions that generally revolve around one of two approaches;
Suggestion #1:
tf.Variable(some_initial_value, trainable = False)
Implementing this suggestion freezes the entire variable. I want to freeze just a slice, specifically W_conv2[:, :, 29, 13].
Suggestion #2:
Optimizer = tf.train.RMSPropOptimizer(0.001).minimize(loss, var_list)
Again, implementing this suggestion does not allow the use of slices. For instance, if I try the inverse of my stated goal (optimize only a single kernel of a single variable) as follows:
Optimizer = tf.train.RMSPropOptimizer(0.001).minimize(loss, var_list = W_conv2[:,:,0,0]))
I get the following error:
NotImplementedError: ('Trying to optimize unsupported type ', <tf.Tensor 'strided_slice_2228:0' shape=(3, 3) dtype=float32>)
Slicing tf.Variables() isn't possible in the way that I've tried it here. The only thing that I've tried which comes close to doing what I want is using .assign() but this is extremely inefficient, cumbersome, and caveman-like as I've implemented it as follows (after the model is trained):
for _ in range(10000):
# get a new batch of data
# reset the values of W_conv2[:,:,29,13]=0 each time through
for m in range(3):
for n in range(3):
assign_op = W_conv2[m,n,29,13].assign(0)
sess.run(assign_op)
# re-train the rest of the network
_, loss_val = sess.run([optimizer, loss], feed_dict = {
dict_stuff_here
})
print(loss_val)
The model was started in Keras then moved to TensorFlow since Keras didn't seem to have a mechanism to achieve the desired results. I'm starting to think that TensorFlow doesn't allow for pruning but find this hard to believe; it just needs the correct implementation.
A possible approach is to initialize these specific weights with zeros, and modify the minimization process such that gradients won't be applied to them. It can be done by replacing the call to minimize() with something like:
W_conv2_weights = np.ones((3, 3, 32, 32))
W_conv2_weights[:, :, 29, 13] = 0
W_conv2_weights_const = tf.constant(W_conv2_weights)
optimizer = tf.train.RMSPropOptimizer(0.001)
W_conv2_orig_grads = tf.gradients(loss, W_conv2)
W_conv2_grads = tf.multiply(W_conv2_weights_const, W_conv2_orig_grads)
W_conv2_train_op = optimizer.apply_gradients(zip(W_conv2_grads, W_conv2))
rest_grads = tf.gradients(loss, rest_of_vars)
rest_train_op = optimizer.apply_gradients(zip(rest_grads, rest_of_vars))
tf.group([rest_train_op, W_conv2_train_op])
I.e,
Preparing a constant Tensor for canceling the appropriate gradients
Compute gradients only for W_conv2, then multiply element-wise with the constant W_conv2_weights to zero the appropriate gradients and only then apply gradients.
Compute and apply gradients "normally" to the rest of the variables.
Group the 2 train ops to a single training op.