Wrong gradient from tf custom gradient - even though gradient is implemented using the inbuilt Jacobian - tensorflow

I'm trying to write a wrapper around a model, such that the tf model can be called as a function of its weights (and input). However this wrapper returns different gradients than the gradients fromt the original model. Details in the code below (including a colab notebook to reproduce directly), but at the core I'm using the custom gradient decorator - the respective gradient is computed directly as the upstream 'gradient' matmul (via tensordot) the respective jacobian.
To make this clear: I'm computing the gradient for a model, once directly, once by using my custom wrapper. In both cases the parameters in the model are the same. The Jacobian is implemented by TF, so nothing should be wrong there. Still the resulting gradient seems to be wrong.
I'm not sure, whether this is a coding mistake I made somewhere, or possibly just a numeric problem stemming from the Jacobian matmul - however my tests regarding correlation of the gradients suggest this is more than a numeric issue for now. Code of the function is provided below, a link to colab notebook reproducing the problem can be found here: Colab Notebook reproducing the problem
Why: This is important for a bunch of metalearning, which I'm trying to build a small library for currently.
My current 'wrapper' looks something like this:
#calls model on input x but replaces internal weights with the weights argument
#critically supposed to compute the respective gradient for the weights tensor argument!
def call_model_with_weights(model, x, weights, dim_output=2):
#tf.custom_gradient
def _call_with_weights(x_and_w):
x, weights = x_and_w
#be careful; this assigns weights to the model as a side effect, can ignore for dummy version
ctrls = [var.assign(val) for var, val in zip(model.trainable_weights, weights)]
with tf.control_dependencies(ctrls):
with tf.GradientTape() as tape:
y = model(x)
jacobians = tape.jacobian(y, model.trainable_weights)
def grad(upstream, variables):
assert len(variables)==len(weights)
#gradient for each weight should be upstream dotproduct respective jacobian
dy_dw = [tf.tensordot(upstream, j, axes=[list(range(dim_output)), list(range(dim_output))]) for j in jacobians]
dy_dw_weights = dy_dw
return (None, dy_dw_weights), [None for _ in dy_dw] # returning x as derivative of x is wrong, but not important here rn
return y, grad
y = _call_with_weights((x, weights))
return y
Thanks a lot for any help (including how this could be done in a more elegant way), helping out means you are contributing to package that plans to mimic PyTorch 'higher' for TF which I hope helps some more people <3

Related

TensorFlow / Keras Model as function of its weights (and inputs) / Making a tf model pure functional, non-stateful

As the title suggests I'm looking for a way to copy an arbitrary tensorflow (/keras) model, such that I can run the same computational graph from the model, but have the weights (or different weights or a tensor copy of them) as part of the input of the function, something like the following, where an implementation (or idea how to implement) 'smart_copy_function' is what is missing:
model = some_model() #a TF/Keras Model
model_from_weights = smart_copy_function(model)
x = tf.some_model_input # imagine some random input to the model here, e.g. an mnist image
y_model = model(x) #normal way to call model
y_model_from_weights = model_from_weights(model.trainable_weights, x) #same call to copied model
y_model == y_model_from_weights #should be True
I believe this should be doable in a somewhat easy way, as the respective computational graph does exist in TF anyway already.
'This sounds stupid, why would you do this?': I want to build an analog to the PyTorch MetaLearning Framework higher for TensorFlow, since gradients through calls of TF optimizer 'apply_gradients' and variable 'assign' are not supported. To achieve such gradients through parameter gradient updates the above seems to be the way to go. Such gradients through parameter updates are in turn very important for Meta-Learning Research, with papers like 'Model-Agnostic Meta Learning' and 'Teaching with Commentaries' being somewhat famous examples / use cases.

Use TensorFlow loss Global Objectives (recall_at_precision_loss) with Keras (not metrics)

Background
I have a multi-label classification problem with 5 labels (e.g. [1 0 1 1 0]). Therefore, I want my model to improve at metrics such as fixed recall, precision-recall AUC or ROC AUC.
It doesn't make sense to use a loss function (e.g. binary_crossentropy) that is not directly related to the performance measurement I want to optimize. Therefore, I want to use TensorFlow's global_objectives.recall_at_precision_loss() or similar as loss function.
Relevant GitHub:
https://github.com/tensorflow/models/tree/master/research/global_objectives
Relevant paper (Scalable Learning of Non-Decomposable Objectives): https://arxiv.org/abs/1608.04802
Not metric
I'm not looking for implementing a tf.metrics. I already succeeded in that following: https://stackoverflow.com/a/50566908/3399066
Problem
I think my issue can be divided into 2 problems:
How to use global_objectives.recall_at_precision_loss() or similar?
How to use it in a Keras model with TF backend?
Problem 1
There is a file called loss_layers_example.py on the global objectives GitHub page (same as above). However, since I don't have much experience with TF, I don't really understand how to use it. Also, Googling for TensorFlow recall_at_precision_loss example or TensorFlow Global objectives example won't give me any clearer example.
How do I use global_objectives.recall_at_precision_loss() in a simple TF example?
Problem 2
Would something like (in Keras): model.compile(loss = ??.recall_at_precision_loss, ...) be enough?
My feeling tells me it is more complex than that, due to the use of global variables used in loss_layers_example.py.
How to use loss functions similar to global_objectives.recall_at_precision_loss() in Keras?
Similar to Martino's answer, but will infer shape from input (setting it to a fixed batch size did not work for me).
The outside function isn't strictly necessary, but it feels a bit more natural to pass params as you configure the loss function, especially when your wrapper is defined in an external module.
import keras.backend as K
from global_objectives.loss_layers import precision_at_recall_loss
def get_precision_at_recall_loss(target_recall):
def precision_at_recall_loss_wrapper(y_true, y_pred):
y_true = K.reshape(y_true, (-1, 1))
y_pred = K.reshape(y_pred, (-1, 1))
return precision_at_recall_loss(y_true, y_pred, target_recall)[0]
return precision_at_recall_loss_wrapper
Then, when compiling the model:
TARGET_RECALL = 0.9
model.compile(optimizer='adam', loss=get_precision_at_recall_loss(TARGET_RECALL))
I managed to make it work by:
Explicitly reshaping tensors to BATCH_SIZE length (see code below)
Cutting the dataset size to a multiple of BATCH_SIZE
def precision_recall_auc_loss(y_true, y_pred):
y_true = keras.backend.reshape(y_true, (BATCH_SIZE, 1))
y_pred = keras.backend.reshape(y_pred, (BATCH_SIZE, 1))
util.get_num_labels = lambda labels : 1
return loss_layers.precision_recall_auc_loss(y_true, y_pred)[0]

How to wrap a custom TensorFlow loss function in Keras?

This is my third attempt to get a deep learning project off the ground. I'm working with protein sequences. First I tried TFLearn, then raw TensorFlow, and now I'm trying Keras.
The previous two attempts taught me a lot, and gave me some code and concepts that I can re-use. However there has always been an obstacle, and I've asked questions that the developers can't answer (in the case of TFLearn), or I've simply gotten bogged down (TensorFlow object introspection is tedious).
I have written this TensorFlow loss function, and I know it works:
def l2_angle_distance(pred, tgt):
with tf.name_scope("L2AngleDistance"):
# Scaling factor
count = tgt[...,0,0]
scale = tf.to_float(tf.count_nonzero(tf.is_finite(count)))
# Mask NaN in tgt
tgt = tf.where(tf.is_nan(tgt), pred, tgt)
# Calculate L1 losses
losses = tf.losses.cosine_distance(pred, tgt, -1, reduction=tf.losses.Reduction.NONE)
# Square the losses, then sum, to get L2 scalar loss.
# Divide the loss result by the scaling factor.
return tf.reduce_sum(losses * losses) / scale
My target values (tgt) can include NaN, because my protein sequences are passed in a 4D Tensor, despite the fact that the individual sequences differ in length. Before you ask, the data can't be resampled like an image. So I use NaN in the tgt Tensor to indicate "no prediction needed here." Before I calculate the L2 cosine loss, I replace every NaN with the matching values in the prediction (pred) so the loss for every NaN is always zero.
Now, how can I re-use this function in Keras? It appears that the Keras Lambda core layer is not a good choice, because a Lambda only takes a single argument, and a loss function needs two arguments.
Alternately, can I rewrite this function in Keras? I shouldn't ever need to use the Theano or CNTK backend, so it isn't necessary for me to rewrite my function in Keras. I'll use whatever works.
I just looked at the Keras losses.py file to get some clues. I imported keras.backend and had a look around. I also found https://keras.io/backend/. I don't seem to find wrappers for ANY of the TensorFlow function calls I happen to use: to_float(), count_nonzero(), is_finite(), where(), is_nan(), cosine_distance(), or reduce_sum().
Thanks for your suggestions!
I answered my own question. I'm posting the solution for anyone who may come across this same problem.
I tried using my TF loss function directly in Keras, as was independently suggested by Matias Valdenegro. I did not provoke any errors from Keras by doing so, however, the loss value went immediately to NaN.
Eventually I identified the problem. The calling convention for a Keras loss function is first y_true (which I called tgt), then y_pred (my pred). But the calling convention for a TensorFlow loss function is pred first, then tgt. So if you want to keep a Tensorflow-native version of the loss function around, this fix works:
def keras_l2_angle_distance(tgt, pred):
return l2_angle_distance(pred, tgt)
<snip>
model.compile(loss = keras_l2_angle_distance, optimizer = "something")
Maybe Theano or CNTK uses the same parameter order as Keras, I don't know. But I'm back in business.
You don't need to use keras.backend, as your loss is directly written in TensorFlow, then you can use it directly in Keras. The backend functions are an abstraction layer so you can code a loss/layer that will work with the multiple available backends in Keras.
You just have to put your loss in the model.compile call:
model.compile(loss = l2_angle_distance, optimizer = "something")

Guided Back-propagation in TensorFlow

I would like to implement in TensorFlow the technique of "Guided back-propagation" introduced in this Paper and which is described in this recipe .
Computationally that means that when I compute the gradient e.g., of the input wrt. the output of the NN, I will have to modify the gradients computed at every RELU unit. Concretely, the back-propagated signal on those units must be thresholded on zero, to make this technique work. In other words the partial derivative of the RELUs that are negative must be ignored.
Given that I am interested in applying these gradient computations only on test examples, i.e., I don't want to update the model's parameters - how shall I do it?
I tried (unsuccessfully) two things so far:
Use tf.py_func to wrap my simple numpy version of a RELU, which then is eligible to redefine it's gradient operation via the g.gradient_override_map context manager.
Gather the forward/backward values of BackProp and apply the thresholding on those stemming from Relus.
I failed with both approaches because they require some knowledge of the internals of TF that currently I don't have.
Can anyone suggest any other route, or sketch the code?
Thanks a lot.
The better solution (your approach 1) with ops.RegisterGradient and tf.Graph.gradient_override_map. Together they override the gradient computation for a pre-defined Op, e.g. Relu within the gradient_override_map context using only python code.
#ops.RegisterGradient("GuidedRelu")
def _GuidedReluGrad(op, grad):
return tf.where(0. < grad, gen_nn_ops._relu_grad(grad, op.outputs[0]), tf.zeros(grad.get_shape()))
...
with g.gradient_override_map({'Relu': 'GuidedRelu'}):
y = tf.nn.relu(x)
here is the full example implementation of guided relu: https://gist.github.com/falcondai/561d5eec7fed9ebf48751d124a77b087
Update: in Tensorflow >= 1.0, tf.select is renamed to tf.where. I updated the snippet accordingly. (Thanks #sbond for bringing this to my attention :)
The tf.gradients has the grad_ys parameter that can be used for this purpose. Suppose your network has just one relu layer as follows :
before_relu = f1(inputs, params)
after_relu = tf.nn.relu(before_relu)
loss = f2(after_relu, params, targets)
First, compute the derivative up to after_relu.
Dafter_relu = tf.gradients(loss, after_relu)[0]
Then threshold your gradients that you send down.
Dafter_relu_thresholded = tf.select(Dafter_relu < 0.0, 0.0, Dafter_relu)
Compute the actual gradients w.r.t to params.
Dparams = tf.gradients(after_relu, params, grad_ys=Dafter_relu_thresholded)
You can easily extend this same method for a network with many relu layers.

How do I get the gradient of the loss at a TensorFlow variable?

The feature I'm after is to be able to tell what the gradient of a given variable is with respect to my error function given some data.
One way to do this would be to see how much the variable has changed after a call to train, but obviously that can vary massively based on the learning algorithm (for example it would be almost impossible to tell with something like RProp) and just isn't very clean.
Thanks in advance.
The tf.gradients() function allows you to compute the symbolic gradient of one tensor with respect to one or more other tensors—including variables. Consider the following simple example:
data = tf.placeholder(tf.float32)
var = tf.Variable(...) # Must be a tf.float32 or tf.float64 variable.
loss = some_function_of(var, data) # some_function_of() returns a `Tensor`.
var_grad = tf.gradients(loss, [var])[0]
You can then use this symbolic gradient to evaluate the gradient in some specific point (data):
sess = tf.Session()
var_grad_val = sess.run(var_grad, feed_dict={data: ...})
In TensorFlow 2.0 you can use GradientTape to achieve this. GradientTape records the gradients of any computation that happens in the context of that. Below is an example of how you might do that.
import tensorflow as tf
# Here goes the neural network weights as tf.Variable
x = tf.Variable(3.0)
# TensorFlow operations executed within the context of
# a GradientTape are recorded for differentiation
with tf.GradientTape() as tape:
# Doing the computation in the context of the gradient tape
# For example computing loss
y = x ** 2
# Getting the gradient of network weights w.r.t. loss
dy_dx = tape.gradient(y, x)
print(dy_dx) # Returns 6