I was wondering how one can implement l1 or l2 regularization within an LSTM in TensorFlow? TF doesn't give you access to the internal weights of the LSTM, so I'm not certain how one can calculate the norms and add it to the loss. My loss function is just RMS for now.
The answers here don't seem to suffice.
The answers in the link you mentioned are the correct way to do it. Iterate through tf.trainable_variables and find the variables associated with your LSTM.
An alternative, more complicated and possibly more brittle approach is to re-enter the LSTM's variable_scope, set reuse_variables=True, and call get_variable(). But really, the original solution is faster and less brittle.
TL;DR; Save all the parameters in a list, and add their L^n norm to the objective function before making gradient for optimisation
1) In the function where you define the inference
net = [v for v in tf.trainable_variables()]
return *, net
2) Add the L^n norm in the cost and calculate the gradient from the cost
weight_reg = tf.add_n([0.001 * tf.nn.l2_loss(var) for var in net]) #L2
cost = Your original objective w/o regulariser + weight_reg
param_gradients = tf.gradients(cost, net)
optimiser = tf.train.AdamOptimizer(0.001).apply_gradients(zip(param_gradients, net))
3) Run the optimiser when you want via
_ = sess.run(optimiser, feed_dict={input_var: data})
Related
I'm trying to implement the recurrent neural network architecture proposed in this paper (https://arxiv.org/abs/1611.03824), where the authors use a LSTM to minimize a black-box function (which however is assumed to be differentiable). Here is a diagram of the proposed architecture: RNN. Briefly, the idea is to use an LSTM like an optimizer, which has to learn a good heuristic to propose new parameters for the unknown function y=f(parameters), so that it moves towards a minimum. Here's how the proposed procedure works:
Select an initial value for the parameters p0, and for the function y0 = f(p0)
Call to LSTM cell with input=[p0,y0], and whose output is a new value for the parameters output=p1
Evaluate y1 = f(p1)
Call the LSTM cell with input=[p1,y1], and obtain output=p2
Evaluate y2 = f(p2)
Repeat for few times, for example stopping at fifth iteration: y5 = f(p5).
I'm trying to implement a similar model in Tensorflow/Keras but I'm having some troubles. In particular, this case is different from "standard" ones because we don't have a predefinite time sequence to be analyzed, but instead it is generated online, after each iteration of the LSTM cell. Thus, in this case, our input would consist of just the starting guess [p0,y0=f(p0)] at time t=0. If I understood it correctly, this model is similar to the one-to-many LSTM, but with the difference that the input to the next time step does not come from just the previous cell, but also form the output an additional function (in our case f).
I managed to create a custom tf.keras.layers.Layer which performs the calculation for a single time step (that is it performs the LSTM cell and then use its output as input to the function f):
class my_layer(tf.keras.layers.Layer):
def __init__(self, units = 4):
super(my_layer, self).__init__()
self.cell = tf.keras.layers.LSTMCell(units)
def call(self, inputs):
prev_cost = inputs[0]
prev_params = inputs[1]
prev_h = inputs[2]
prev_c = inputs[3]
# Concatenate the previous parameters and previous cost to create new input
new_input = tf.keras.layers.concatenate([prev_cost, prev_params])
# New parameters obtained by the LSTM cell, along with new internsal states: h and c
new_params, [new_h, new_c] = self.cell(new_input, states = [prev_h, prev_c])
# Function evaluation
new_cost = f(new_params)
return [new_cost, new_params, new_h, new_c]
but I do not know how to build the recurrent part. I tried to do it manually, that is doing something like:
my_cell = my_layer(units = 4)
outputs = my_cell(inputs)
outputs1 = my_cell(outputs)
outputs2 = my_cell(outputs1)
Is that correct? Is there some other way to do it more appropriately?
Bonus question: I would like to train the LSTM to be able to optimize not only a single function f, but rather a class of different functions [f1, f2, ...] which share some common structure which make them similar enough to be optimized using the same LSTM. How could I implement such a training loop which takes as inputs a list of this functions [f1, f2, ...], and tries to minimize them all? My first thought was to do that "brute force" way: use a for loop over the function and a tf.GradientTape which evaluates and applies the gradients for each function.
Any help is much appreciated!
Thank you very much in advance! :)
I'm trying to train a model in self-supervised learning. The flow chart is something like the following:
Let's assume that N1 is already trained and we want to train just N2. This is my current implementation:
x_1 = tf.placeholder(tf.float32, [None, 128, 128, 1])
x_2 = tf.placeholder(tf.float32, [None, 128, 128, 1])
s_t1 = tf.stop_gradient(N1(x_1)) # treat s_t1 as a constant
s_t2_pred = N2(s_t1))
s_t2 = tf.stop_gradient(N1(x_2)) # treat s_t2 as a constant
loss = some_loss_function(s_t2, s_t2_pred)
train_op = tf.train.AdamOptimizer(lr).minimize(loss)
In this way, I should be optimizing only N2. What makes me confused is the fact that if I were to use the following code I would obtain very different results (much better than the above):
# treat everything as a variable:
s_t1 = N1(x_1)
s_t2_pred = N2(s_t1)
s_t2 = N1(x_2)
loss = some_loss_function(s_t2, s_t2_pred)
var_list = take_all_variables_in_N2()
train_op = tf.train.AdamOptimizer(lr).minimize(loss, var_list)
I wonder what is the problem with the first implementation. What is exactly the behaviour of tf.stop_gradient (the documentation is a bit poor)? How does this differ from the second approach?
From a practical perspective in semi-supervised learning: what is the difference between the two? Which one is the correct approach?
Thank you :)
I added a possible solution to the problem in the comments below. I would still be happy to receive any feedback from more experienced users and to share some opinions on the best approach to structure a self-supervised learning problem in tensorflow.
Bye, G.
I found a possible solution to my question and I'm posting it here, in case someone may find it useful.
Apparently, tf.stop_gradients() only stops the new gradients to be back-propagated through the layers, but: if we have a momentum term (e.g. when using Adam or RMSProp) the variables of such layers could still be updated due to some gradients cumulated in the past (contained in the momentum term). Let's have a look at the simple case of SGD + Momentum; the formula would be:
w1 = w0 - a*grad(loss) - b*v0
where w1 and w0 are the weights at time 0 and 1, a is the learning rate v0 is the accumulated velocity (a function of the past gradients). Using tf.stop_gradients() is equivalent to multiplying the second term for zero. Then, the update rule becomes:
w1 = w0 - b*v0
e.g. we still have a momentum component that can update the weights.
A workaround to this problem would be to explicitly passing the variables to be updated to the optimizer. For example:
var_list = take_all_variables_in_N2()
train_op = tf.train.AdamOptimizer(lr).minimize(loss, var_list)
References:
[1] http://ruder.io/optimizing-gradient-descent/
[2] Using stop_gradient with AdamOptimizer in TensorFlow
I am a researcher in optimization and I trying to write a custom optimizer. I have come across a problem. I have asked in many places and so far no response.
Take any optimizer code, say just copy SGD. In the beginning of get_updates, you see
grads = self.get_gradients(loss, params)
now add the following line right after this one:
gradsb = self.get_gradients(loss, [tf.Variable(a) for a in params])
this should compute the gradients at a new tensor, with all the values the same as before
now try to see what you get:
for a in gradsb:
print(a)
you get a list of Nones (but if you print the list grads you see that they are still Tensors)
Why?
And how to circumvent this problem? This is important as I'd like to compute the gradients at another point for my algorithm.
When you write gradsb = self.get_gradients(loss, [tf.Variable(a) for a in params]) you are defining a new tf.Variable for each a in params. Because the loss does not depend on these new variables, your gradients are None.
If you want to compute a second gradient you need to make sure that you're computing it with respect to Tensors that the objective does depend on.
Apparently even replacing the current vector of parameters is not OK!! If I type this in the code:
grads = self.get_gradients(loss, params)
tempparam = [tf.Variable(a) for a in params]
params = [tf.add(a,a) for a in params]
gradsn = self.get_gradients(loss, params)
for a in gradsn:
print(a)
params = [tf.Variable(a) for a in tempparam]
The result is still that None is printed!!
I know you understand what I am trying to do, at each iteration of get_updates, I would like to compute the gradients at a (slightly) different value of the parameter tensors, and use that to construct the update to the parameters for optimization and training. Is there any way to do this within the keras package?
Let's say that I have some code such as:
out = tf.nn.softmax(x) # shape (batch,time,n)
labels = .... # reference labels of type (batch,time)->int
And then I define my loss as the Cross Entropy:
loss = -tf.log(tf.gather_nd(out, labels))
Will TensorFlow automatically replace the loss in the computation graph by this?
loss = sparse_softmax_cross_entropy_with_logits(x, labels)
What type of optimizations can I expect that TensorFlow will apply?
Follow-up question: If TensorFlow doesn't do this optimization, how can I do it manually? Consider that I have a modular framework where I get some out tensor which could possibly be the output of a softmax operation, and I want to calculate Cross Entropy, and I want to use sparse_softmax_cross_entropy_with_logits if possible. How could I accomplish this? Can I do something like the following?
if out.op == "softmax": # how to check this?
x = out.op.sources[0] # how to get this?
loss = sparse_softmax_cross_entropy_with_logits(x, labels)
else:
loss = -tf.log(tf.gather_nd(out, labels))
TensorFlow generally doesn't merge nodes together in the way you're hoping. This is because other code (e.g. fetching outputs when running) may depend on intermediate nodes like the softmax, so removing them behind the user's back would be confusing.
If you do want to do this optimization yourself as part of a higher-level framework, you can analyze the current graphdef, but there's no annotation in TF to tell you what the outputs are, since that can vary at runtime depending on how session.run is called.
I would like to implement in TensorFlow the technique of "Guided back-propagation" introduced in this Paper and which is described in this recipe .
Computationally that means that when I compute the gradient e.g., of the input wrt. the output of the NN, I will have to modify the gradients computed at every RELU unit. Concretely, the back-propagated signal on those units must be thresholded on zero, to make this technique work. In other words the partial derivative of the RELUs that are negative must be ignored.
Given that I am interested in applying these gradient computations only on test examples, i.e., I don't want to update the model's parameters - how shall I do it?
I tried (unsuccessfully) two things so far:
Use tf.py_func to wrap my simple numpy version of a RELU, which then is eligible to redefine it's gradient operation via the g.gradient_override_map context manager.
Gather the forward/backward values of BackProp and apply the thresholding on those stemming from Relus.
I failed with both approaches because they require some knowledge of the internals of TF that currently I don't have.
Can anyone suggest any other route, or sketch the code?
Thanks a lot.
The better solution (your approach 1) with ops.RegisterGradient and tf.Graph.gradient_override_map. Together they override the gradient computation for a pre-defined Op, e.g. Relu within the gradient_override_map context using only python code.
#ops.RegisterGradient("GuidedRelu")
def _GuidedReluGrad(op, grad):
return tf.where(0. < grad, gen_nn_ops._relu_grad(grad, op.outputs[0]), tf.zeros(grad.get_shape()))
...
with g.gradient_override_map({'Relu': 'GuidedRelu'}):
y = tf.nn.relu(x)
here is the full example implementation of guided relu: https://gist.github.com/falcondai/561d5eec7fed9ebf48751d124a77b087
Update: in Tensorflow >= 1.0, tf.select is renamed to tf.where. I updated the snippet accordingly. (Thanks #sbond for bringing this to my attention :)
The tf.gradients has the grad_ys parameter that can be used for this purpose. Suppose your network has just one relu layer as follows :
before_relu = f1(inputs, params)
after_relu = tf.nn.relu(before_relu)
loss = f2(after_relu, params, targets)
First, compute the derivative up to after_relu.
Dafter_relu = tf.gradients(loss, after_relu)[0]
Then threshold your gradients that you send down.
Dafter_relu_thresholded = tf.select(Dafter_relu < 0.0, 0.0, Dafter_relu)
Compute the actual gradients w.r.t to params.
Dparams = tf.gradients(after_relu, params, grad_ys=Dafter_relu_thresholded)
You can easily extend this same method for a network with many relu layers.