I am new to PyTorch but have a lot of experience with TensorFlow.
I would like to modify the gradient of just a tiny piece of the graph: just the derivative of activation function of a single layer. This can be easily done in Tensorflow using tf.custom_gradient, which allows you to supply customized gradient for any functions.
I would like to do the same thing in PyTorch and I know that you can modify the backward() method, but that requires you to rewrite the derivative for the whole network defined in the forward() method, when I would just like to modify the gradient of a tiny piece of the graph. Is there something like tf.custom_gradient() in PyTorch? Thanks!
You can do this in two ways:
1. Modifying the backward() function:
As you already said in your question, pytorch also allows you to provide a custom backward implementation. However, in contrast to what you wrote, you do not need to re-write the backward() of the entire model - only the backward() of the specific layer you want to change.
Here's a simple and nice tutorial that shows how this can be done.
For example, here is a custom clip activation that instead of killing the gradients outside the [0, 1] domain, simply passes the gradients as-is:
class MyClip(torch.autograd.Function):
#staticmethod
def forward(ctx, x):
return torch.clip(x, 0., 1.)
#staticmethod
def backward(ctx, grad):
return grad
Now you can use MyClip layer wherever you like in your model and you do not need to worry about the overall backward function.
2. Using a backward hook
pytorch allows you to attach hooks to different layer (=sub nn.Modules) of your network. You can register_full_backward_hook to your layer. That hook function can modify the gradients:
The hook should not modify its arguments, but it can optionally return a new gradient with respect to the input that will be used in place of grad_input in subsequent computations.
I am trying to modify a class, SGDRule(optimizer.UpdateRule) of chainer, to make my original optimizer.
To achieve what I want, I need to get not only the gradient but also the loss.
Before generating the gradient by back propagation, a forward path, which yields the loss, must be done. I need that loss.
The problem is that I have to access the loss from the code of update_core_gpu(self, param) in the class.
I learned that the Classifier object has the loss as an attribute. However, I don't know how to access the object from the update rule.
As an alternative, I considered using the Reporter object that I can access from the code. I know how to pass a value to the reporter, but have no idea about how to get the loss that the reporter has.
Does anybody know how to get the current loss in the code of update rule?
If you are using a model that holds the loss, e.g. a Classifier, one simple but maybe less elegant way to do it would be to simply pass the model to the Optimizer and then to each UpdateRule when being constructed in Optimizer.create_update_rule. If you don't want to pass the model, you could probably pass a lambda that returns the loss from the model.
Another, probably a cleaner approach if sufficient for your case, would be to implement an optimizer hook, similar to how gradient clipping is implemented in Chainer. See https://github.com/chainer/chainer/blob/master/chainer/optimizer_hooks/gradient_clipping.py#L56. You can obtain the loss via opt.target.loss (opt.target being
your model) and for instance update the gradient prior to the optimization step.
guys! I have a question to ask.If I want to use maxout as activation function , how should I write the codes in Tensorflow? An input parameter is required in the slim.maxout() function, so it cannot be used for
slim.arg_scope([slim.conv], activation_fn=slim.maxout)?What should I do?
You may have to define maxout in a separate function. For example:
def maxout(inputs,num_inputs):
return slim.maxout(inputs,num_inputs)
slim.arg_scope([slim.conv],activation_fn=maxout)
(I may have the arguments defined incorrectly in the maxout function.)
In any case, I recommend that you switch to tf.layers (tf core API) because tf.slim seems to be in the works of being phased out.
https://github.com/tensorflow/tensorflow/issues/16182#issuecomment-372397483
I have multiple RNN layers right now setup like:
stack = tf.nn.rnn_cell.MultiRNNCell([
tf.nn.rnn_cell.GRUCell(num_hidden, activation=clipped_relu)
for _ in range(num_rnn_layers)
])
But am trying to add layer normalization using https://www.tensorflow.org/api_docs/python/tf/contrib/layers/layer_norm to the RNN layers. I've tried a number of different setups but can't get the model to compile.
Has anyone done this yet? And if so, how did you implement it?
I think you need to define your own layer class that normalizes inside the call function. Did you try that?
There is a layer normalization implementation here:
tf.contrib.rnn.LayerNormBasicLSTMCell
which can be used in the MultiRNNCell function.
I'm using bidirectional_rnn with GRUCell but this is a general question regarding the RNN in Tensorflow.
I couldn't find how to initialize the weight matrices (input to hidden, hidden to hidden). Are they initialized randomly? to zeros? are they initialized differently for each LSTM I create?
EDIT: Another motivation for this question is in pre-training some LSTMs and using their weights in a subsequent model. I don't currently know how to do that currently without saving all the states and restoring the entire model.
Thanks.
How to initialize weight matrices for RNN?
I believe people are using random normal initialization for weight matrices for RNN. Check out the example in TensorFlow GitHub Repo. As the notebook is a bit long, they have a simple LSTM model where they use tf.truncated_normal to initialize weights and tf.zeros to initialize biases (although I have tried using tf.ones to initialize biases before, seem to also work). I believe that the standard deviation is a hyperparameter you could tune yourself. Sometimes weights initialization is important to the gradient flow. Although as far as I know, LSTM itself is designed to handle gradient vanishing problem (and gradient clipping is for helping gradient exploding problem), so perhaps you don't need to be super careful with the setup of std_dev in LSTM? I've read papers recommending Xavier initialization (TF API doc for Xavier initializer) in Convolution Neural Network context. I don't know if people use that in RNN, but I imagine you can even try those in RNN if you want to see if it helps.
Now to follow up with #Allen's answer and your follow up question left in the comments.
How to control initialization with variable scope?
Using the simple LSTM model in the TensorFlow GitHub python notebook that I linked to as an example.
Specifically, if I want to re-factorize the LSTM part of the code in above picture using variable scope control, I may code something as following...
import tensorflow as tf
def initialize_LSTMcell(vocabulary_size, num_nodes, initializer):
'''initialize LSTMcell weights and biases, set variables to reuse mode'''
gates = ['input_gate', 'forget_gate', 'memory_cell', 'output_gate']
with tf.variable_scope('LSTMcell') as scope:
for gate in gates:
with tf.variable_scope(gate) as gate_scope:
wx = tf.get_variable("wx", [vocabulary_size, num_nodes], initializer)
wt = tf.get_variable("wt", [num_nodes, num_nodes], initializer)
bi = tf.get_variable("bi", [1, num_nodes, tf.constant_initializer(0.0)])
gate_scope.reuse_variables() #this line can probably be omitted, b.z. by setting 'LSTMcell' scope variables to 'reuse' as the next line, it'll turn on the reuse mode for all its child scope variables
scope.reuse_variables()
def get_scope_variables(scope_name, variable_names):
'''a helper function to fetch variable based on scope_name and variable_name'''
vars = {}
with tf.variable_scope(scope_name, reuse=True):
for var_name in variable_names
var = tf.get_variable(var_name)
vars[var_name] = var
return vars
def LSTMcell(i, o, state):
'''a function for performing LSTMcell computation'''
gates = ['input_gate', 'forget_gate', 'memory_cell', 'output_gate']
var_names = ['wx', 'wt', 'bi']
gate_comp = {}
with tf.variable_scope('LSTMcell', reuse=True):
for gate in gates:
vars = get_scope_variables(gate, var_names)
gate_comp[gate] = tf.matmul(i, vars['wx']) + tf.matmul(o, vars['wt']) + vars['bi']
state = tf.sigmoid(gate_comp['forget_gate']) * state + tf.sigmoid(gate_comp['input_gate']) * tf.tanh(gate_comp['memory_cell'])
output = tf.sigmoid(gate_comp['output_gate']) * tf.tanh(state)
return output, state
The usage of the re-factorized code would be something like following...
initialize_LSTMcell(volcabulary_size, num_nodes, tf.truncated_normal_initializer(mean=-0.1, stddev=.01))
#...Doing some computation...
LSTMcell(input_tensor, output_tensor, state)
Even though the refactorized code may look less straightforward, but using scope variable control ensures scope encapsulation and allows flexible variable controls (in my opinion at least).
In pre-training some LSTMs and using their weights in a subsequent model. How to do that without saving all the states and restoring the entire model.
Assuming you have a pre-trained model froze and loaded in, if you wanna use their frozen 'wx', 'wt' and 'bi', you can simply find their parent scope names and variable names, then fetch the variables using similar structure in get_scope_variables func.
with tf.variable_scope(scope_name, reuse=True):
var = tf.get_variable(var_name)
Here is a link to understanding variable scope and sharing variables. I hope this is helpful.
The RNN models will create their variables with get_variable, and you can control the initialization by wrapping the code which creates those variables with a variable_scope and passing a default initializer to it. Unless the RNN specifies one explicitly (looking at the code, it doesn't), uniform_unit_scaling_initializer is used.
You should also be able to share model weights by declaring the second model and passing reuse=True to its variable_scope. As long as the namespaces match up, the new model will get the same variables as the first model.
A simple way to initialize all kernel weights with certain initializer is to leave the initializer in tf.variable_scope(). For example:
with tf.variable_scope('rnn', initializer=tf.variance_scaling_initializer()):
basic_cell= tf.contrib.rnn.BasicRNNCell(num_units=n_neurons)
outputs, state= tf.nn.dynamic_rnn(basic_cell, X, dtype=tf.float32)