You can get intermediate gradients with tf.gradients()and you can create a new tensor by applying an op on this results (like clipping) but how to modify the backpropagation accordingly ?
For instance to implement the Huber loss (with delta=1).
The first method is to create a boolean mask on the batch dimension doing something like.
cond=tf.less(input_tensor,1)
cond=tf.cast(cond,"tf.float32")
loss=cond*tf.square(input_tensor)+(1.-cond)*(tf.abs(input_tensor)-0.5)
A simpler way to implement it would be to use the l2 loss and to clip its gradients wrt inputs to 1.
l2_loss=tf.square(input_tensor)
modified_grad_wrt_input=tf.clip_by_value(tf.gradients(l2_loss,input_tensor),0.,1.)
But when you train your network you have to use compute_gradients and apply_gradients, which only give you gradients wrt variables. How to make your optimizer use the tensor modified_grad_wrt_input when doing the chain rule ?
Do you have to use gradient_override_map as in this github issue ?
Is there a simpler way without registering a new op/gradients ?
Related
In a custom optimizer I would like to update weights with random values if the loss function has not decreased.
However, I can not see how to do that in the methods you can override (resource_apply_dense, resource_apply_sparse, create_slots, get_config). None of them are passed the loss function.
I have tried overriding minimize(), but that is not called in a standard training loop.
Any ideas?
If you are writing a custom optimizer, I think the easiest way to apply it is to explicitly define the layers, also. In a standard feedforward neural network, if x is the input, then h=tf.tanh(tf.matmul(x,W)+b) is an example of the first hidden layer. Similarly you can get more layers. Then W and b are variables you need to update. The training loop would look something like this:
trainable_variables=[W,b]
for i in range(1000):
optimizer.minimize(loss, trainable_variables)
but with your own optimizer instead of the one from keras.
when I try to fine-tune a VGG network, I only want to update the weights after 5th convolution layers ,in caffe , we can cancel BP in configure file. What should I do in tensorflow ? thanks !
Just use tf.stop_gradient() on the input of your 5th layer. Tensorflow will not backpropagate the error below. tf.stop_gradient() is an operation that acts as the identity function in the forward direction, but stops the gradient in the backward direction.
From documentation:
tf.stop_gradient
Stops gradient computation.
When executed in a graph, this op outputs its input tensor as-is.
When building ops to compute gradients, this op prevents the
contribution of its inputs to be taken into account. Normally, the
gradient generator adds ops to a graph to compute the derivatives of a
specified 'loss' by recursively finding out inputs that contributed to
its computation. If you insert this op in the graph it inputs are
masked from the gradient generator. They are not taken into account
for computing gradients.
Otherwise you can use optimizer.minimize(loss, variables_of_fifth_layer). Here you are running backpropagation and updating only on the variables of your 5th layer.
For a fast selection of the variables of interest you could:
Define as trainable=False all the variables that you don't want to update, and use variables_of_fifth_layer=tf.trainable_variables().
Divide layers by defining specific scopes and then variables_of_fifth_layer = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES,"scope/of/fifth/layer")
Is it possible to get the gradients with respect to each layer in Caffe in CNNs, edit them and again apply the new gradients in the training process? If possible, using pycaffe interface.
For example in TensorFlow, it could be done by means of functions:
given_optimizer.compute_gradients(total_loss)
given_optimizer.apply_gradients(grads)
I'm not sure what you mean by "apply the new gradients in the training process", but you can access the gradients in the pycaffe interface:
import caffe
net = caffe.Net('/path/to/net.prototxt', '/path/to/weights.caffemodel', caffe.TEST)
# provide inputs to the net, do a pass so that meaningful data/gradients propagate to all the layers
net.forward_backward_all()
# once data/gradients are updated, you can access them
net.blobs['blob_name'].diff # access the gradient of blob 'blob_name'
net.layers[5].blobs[0].diff # access the gradient of the first parameter blob of the 6th layer
To map between layer names and layer indices, you can use this code:
list(net._layer_names).index('layer_name')
This will return the index of layer 'layer_name'.
The tensorflow documentation states that:
Calling minimize() takes care of both computing the gradients and
applying them to the variables. If you want to process the gradients
before applying them you can instead use the optimizer in three steps:
Compute the gradients with compute_gradients(). Process the gradients
as you wish. Apply the processed gradients with apply_gradients().
However the example given is for vanilla SGD.
Does this two step process work for other types of optimizers (like momentum, adam etc), which don't use the gradients directly but instead use other derived descent directions ?
If so, where do the various intermediate variables and the final descent direction get computed - in compute_gradients or apply_gradients ?
Thanks.
I want to use gradient descent with momentum (keep track of previous gradients) while building a classifier in TensorFlow.
So I don't want to use tensorflow.train.GradientDescentOptimizer but I want to use tensorflow.gradients to calculate gradients and keep track of previous gradients and update the weights based on all of them.
How do I do this in TensorFlow?
TensorFlow has an implementation of gradient descent with momentum.
To answer your general question about implementing your own optimization algorithm, TensorFlow gives you the primitives to calculate the gradients, and update variables using the calculated gradients. In your model, suppose loss designates the loss function, and var_list is a python list of TensorFlow variables in your model (which you can get by calling tf.all_variables or tf.trainable_variables, then you can calculate the gradients w.r.t your variables as follows :
grads = tf.gradients(loss, var_list)
For the simple gradient descent, you would simply subtract the product of the gradient and the learning rate from the variable. The code for that would look as follows :
var_updates = []
for grad, var in zip(grads, var_list):
var_updates.append(var.assign_sub(learning_rate * grad))
train_op = tf.group(*var_updates)
You can train your model by calling sess.run(train_op). Now, you can do all sorts of things before actually updating your variables. For instance, you can keep track of the gradients in a different set of variables and use it for the momentum algorithm. Or, you can clip your gradients before updating the variables. All these are simple TensorFlow operations because the gradient tensors are no different from other tensors that you compute in TensorFlow. Please look at the implementations (Momentum, RMSProp, Adam) of some the fancier optimization algorithms to understand how you can implement your own.