Tensorflow: How to apply a regularizer on a tensor? - tensorflow

I am implementing a model in Tensorflow 2, and I want to apply a penalization on a tensor (multiplication from two layers' outputs) in my model.
I am used to use regularization on layers (kernel, bias or activity regularization).
I could build a custom layer that only has an activity regularization, but I am hopping that there is a simpler solution to add regularization to a tensor.
I saw this code in Tensorflow:
regularizer = tf.keras.regularizers.L2(2.)
tensor = tf.ones(shape=(5, 5))
regularizer(tensor)
Which outputs:
<tf.Tensor: shape=(), dtype=float32, numpy=50.0>
But does this only compute the regularization value or it also add it to my model's loss?
Or would adding self.add_loss(tf.keras.regularizers.L2(2.)(tensor)) in my call function work?
How would you add a penalty on a tensor?
It is my first question on stackoverflow so sorry if I didn't ask in the good place.

No, the regularizer classes pretty much only handle the computation of the penalty itself. E.g. the source code for the L2 regularizer has
def __call__(self, x):
return self.l2 * tf.reduce_sum(tf.square(x))
and there is also nothing in __init__ that points at any "housekeeping" work. However, activity_regularizer is actually an argument for the base Layer class so any subclassed layer should be able to handle it by default, meaning it should be easy to write a custom layer (you essentially only have to write the call method which sounds like a one-liner in your case).
You may even be able to use a Lambda layer to avoid having to write any sub-classing code, since these also inherit from Layer. However, the docs mention that there are some issues with Lambda regarding saving and loading models...

Related

How to get activation of a hidden layer in tensorflow.js?

In TensorFlow.js, I have a very simple tf.Sequential model created like this:
let model = tf.sequential();
model.add(tf.layers.dense({inputShape: [784], units: 128, activation: 'relu'}));
model.add(tf.layers.dense({units: 10}));
model.add(tf.layers.softmax());
During prediction time, how can I get the activation of the second tf.layers.dense layer?
Can I just delete model.layers[2] and use model.predict() as normal?
(I know I can do this in advance by defining two model outputs with the functional API, but let's say I have a pre-made tf.Sequential model that I want to inspect the logits of.)
For more complex models, there's an easier way. If model is the original model, you can create a copy using tf.model({inputs:model.inputs, outputs: model.layers[2].output}), thereby only needing to provide the first and last layer
I figured out how to do this.
Deleting model.layers[2] doesn't work, since apparently model.predict() doesn't depend on that property.
One way to do this is to create a duplicate tf.Sequential model, copying over all the layers (except the last) from the original.
let m2 = tf.sequential();
m2.add(model.layers[0]);
m2.add(model.layers[1]);
Then m2.predict() will output the logits.

Custom loss function in Keras that penalizes output from intermediate layer

Imagine I have a convolutional neural network to classify MNIST digits, such as this Keras example. This is purely for experimentation so I don't have a clear reason or justification as to why I'm doing this, but let's say I would like to regularize or penalize the output of an intermediate layer. I realize that the visualization below does not correspond to the MNIST CNN example and instead just has several fully connected layers. However, to help visualize what I mean let's say I want to impose a penalty on the node values in layer 4 (either pre or post activation is fine with me).
In addition to having a categorical cross entropy loss term which is typical for multi-class classification, I would like to add another term to the loss function that minimizes the squared sum of the output at a given layer. This is somewhat similar in concept to l2 regularization, except that l2 regularization is penalizing the squared sum of all weights in the network. Instead, I am purely interested in the values of a given layer (e.g. layer 4) and not all the weights in the network.
I realize that this requires writing a custom loss function using keras backend to combine categorical crossentropy and the penalty term, but I am not sure how to use an intermediate layer for the penalty term in the loss function. I would greatly appreciate help on how to do this. Thanks!
Actually, what you are interested in is regularization and in Keras there are two different kinds of built-in regularization approach available for most of the layers (e.g. Dense, Conv1D, Conv2D, etc.):
Weight regularization, which penalizes the weights of a layer. Usually, you can use kernel_regularizer and bias_regularizer arguments when constructing a layer to enable it. For example:
l1_l2 = tf.keras.regularizers.l1_l2(l1=1.0, l2=0.01)
x = tf.keras.layers.Dense(..., kernel_regularizer=l1_l2, bias_regularizer=l1_l2)
Activity regularization, which penalizes the output (i.e. activation) of a layer. To enable this, you can use activity_regularizer argument when constructing a layer:
l1_l2 = tf.keras.regularizers.l1_l2(l1=1.0, l2=0.01)
x = tf.keras.layers.Dense(..., activity_regularizer=l1_l2)
Note that you can set activity regularization through activity_regularizer argument for all the layers, even custom layers.
In both cases, the penalties are summed into the model's loss function, and the result would be the final loss value which would be optimized by the optimizer during training.
Further, besides the built-in regularization methods (i.e. L1 and L2), you can define your own custom regularizer method (see Developing new regularizers). As always, the documentation provides additional information which might be helpful as well.
Just specify the hidden layer as an additional output. As tf.keras.Models can have multiple outputs, this is totally allowed. Then define your custom loss using both values.
Extending your example:
input = tf.keras.Input(...)
x1 = tf.keras.layers.Dense(10)(input)
x2 = tf.keras.layers.Dense(10)(x1)
x3 = tf.keras.layers.Dense(10)(x2)
model = tf.keras.Model(inputs=[input], outputs=[x3, x2])
for the custom loss function I think it's something like this:
def custom_loss(y_true, y_pred):
x2, x3 = y_pred
label = y_true # you might need to provide a dummy var for x2
return f1(x2) + f2(y_pred, x3) # whatever you want to do with f1, f2
Another way to add loss based on input or calculations at a given layer is to use the add_loss() API. If you are already creating a custom layer, the custom loss can be added directly to the layer. Or a custom layer can be created that simply takes the input, calculates and adds the loss, and then passes the unchanged input along to the next layer.
Here is the code taken directly from the documentation (in case the link is ever broken):
from tensorflow.keras.layers import Layer
class MyActivityRegularizer(Layer):
"""Layer that creates an activity sparsity regularization loss."""
def __init__(self, rate=1e-2):
super(MyActivityRegularizer, self).__init__()
self.rate = rate
def call(self, inputs):
# We use `add_loss` to create a regularization loss
# that depends on the inputs.
self.add_loss(self.rate * tf.reduce_sum(tf.square(inputs)))
return inputs

How to specify custom weight updates in tensorflow custom optimizer

In a custom optimizer I would like to update weights with random values if the loss function has not decreased.
However, I can not see how to do that in the methods you can override (resource_apply_dense, resource_apply_sparse, create_slots, get_config). None of them are passed the loss function.
I have tried overriding minimize(), but that is not called in a standard training loop.
Any ideas?
If you are writing a custom optimizer, I think the easiest way to apply it is to explicitly define the layers, also. In a standard feedforward neural network, if x is the input, then h=tf.tanh(tf.matmul(x,W)+b) is an example of the first hidden layer. Similarly you can get more layers. Then W and b are variables you need to update. The training loop would look something like this:
trainable_variables=[W,b]
for i in range(1000):
optimizer.minimize(loss, trainable_variables)
but with your own optimizer instead of the one from keras.

How to apply dropConnect on recurrent weights in keras

I am using Keras and I want to apply dropConnect on the hidden-to-hidden weights in an LSTM. I found that Keras only allows to apply dropout on the hidden states using (recurrent_dropout).
I am trying to make a custom implementation of this. I am trying to create a custom recurrent_regularizer using the following:
def dropConnect_reg(weight_matrix):
return tf.nn.dropout(weight_matrix, rate = 0.5)
then use it as follows (the task is language modelling, so I apply a softmax layer on the vocab):
model.add(LSTM(650, return_sequences=True, recurrent_regularizer=dropConnect_reg))
model.add(Dense(vocab_size, activation='softmax'))
However, I don't think this works properly. Without using the implemented recurrent_regularizer the loss is a scalar number as expected (the categorical cross-entropy loss). However, when using it, it outputs a full array for the loss instead of having only one number (dimensions: time_steps,time_steps*4). I am also currently not sure if this is applied during training only as it is intended.
Any ideas on how to properly implement this?
If you only want the last output from the LSTM, you'll want to set return_sequences=False. If you only want to apply the dropout during training you'll need something like this:
def dropconnect_regularizer(weight_matrix):
return tf.nn.dropout(weight_matrix, rate=0.5)
if training:
regularizer = dropconnect_regularizer
else:
regularizer = None
model.add(LSTM(650, recurrent_regularizer=regularizer))

Guided Back-propagation in TensorFlow

I would like to implement in TensorFlow the technique of "Guided back-propagation" introduced in this Paper and which is described in this recipe .
Computationally that means that when I compute the gradient e.g., of the input wrt. the output of the NN, I will have to modify the gradients computed at every RELU unit. Concretely, the back-propagated signal on those units must be thresholded on zero, to make this technique work. In other words the partial derivative of the RELUs that are negative must be ignored.
Given that I am interested in applying these gradient computations only on test examples, i.e., I don't want to update the model's parameters - how shall I do it?
I tried (unsuccessfully) two things so far:
Use tf.py_func to wrap my simple numpy version of a RELU, which then is eligible to redefine it's gradient operation via the g.gradient_override_map context manager.
Gather the forward/backward values of BackProp and apply the thresholding on those stemming from Relus.
I failed with both approaches because they require some knowledge of the internals of TF that currently I don't have.
Can anyone suggest any other route, or sketch the code?
Thanks a lot.
The better solution (your approach 1) with ops.RegisterGradient and tf.Graph.gradient_override_map. Together they override the gradient computation for a pre-defined Op, e.g. Relu within the gradient_override_map context using only python code.
#ops.RegisterGradient("GuidedRelu")
def _GuidedReluGrad(op, grad):
return tf.where(0. < grad, gen_nn_ops._relu_grad(grad, op.outputs[0]), tf.zeros(grad.get_shape()))
...
with g.gradient_override_map({'Relu': 'GuidedRelu'}):
y = tf.nn.relu(x)
here is the full example implementation of guided relu: https://gist.github.com/falcondai/561d5eec7fed9ebf48751d124a77b087
Update: in Tensorflow >= 1.0, tf.select is renamed to tf.where. I updated the snippet accordingly. (Thanks #sbond for bringing this to my attention :)
The tf.gradients has the grad_ys parameter that can be used for this purpose. Suppose your network has just one relu layer as follows :
before_relu = f1(inputs, params)
after_relu = tf.nn.relu(before_relu)
loss = f2(after_relu, params, targets)
First, compute the derivative up to after_relu.
Dafter_relu = tf.gradients(loss, after_relu)[0]
Then threshold your gradients that you send down.
Dafter_relu_thresholded = tf.select(Dafter_relu < 0.0, 0.0, Dafter_relu)
Compute the actual gradients w.r.t to params.
Dparams = tf.gradients(after_relu, params, grad_ys=Dafter_relu_thresholded)
You can easily extend this same method for a network with many relu layers.