Custom external loss metric for Gradient Optimizer? - optimization

I have an external function which takes y and y_prediction (in matrix format), and computes a metric which depicts how good or bad the prediction actually is.
Unfortunately the metric is no simple y - ypred or confusion matrix, but still very useful and important. How can I use this number computed for the loss or as an argument for optimizer.minimize?

If i understood correctly i think there is two way to do this:
Either the loss you want to compute can be writen as tensorflow ops which gradient is defined (for exemple SVD has no gradient defined in tensorflow library saddly) then the optimisation is direct.
Or you can always write your loss function with numpy operators and use tf.py_func() https://www.tensorflow.org/api_docs/python/tf/py_func and then you have to explicit the gradient by hand as said in here : How to make a custom activation function with only Python in Tensorflow?
But you have to know an explicit formula of your gradient ...

Related

Is it possible to integrate Levenberg-Marquardt optimizer from Tensorflow Graphics with a Tensorflow 2.0 model?

I have a Tensorflow 2.0 tf.keras.Sequential model. Now, my technical specification prescribes using the Levenberg-Marquardt optimizer to fit the model. Tensorflow 2.0 doesn't provide it as an optimizer out of the box, but it is available in the Tensorflow Graphics module.
tfg.math.optimizer.levenberg_marquardt.minimize function accepts residuals ( a residual is a Python callable returning a tensor) and variables (list of tensors corresponding to my model weights) as parameters.
What would be the best way to convert my model into residuals and variables?
If I understand correctly how the minimize function works, I have to provide two residuals. The first residual must call my model for every learning case and aggregate all the results into a tensor. The second residuals must return all labels as a single constant tensor. The problem is that tf.keras.Sequential.predict function returns a numpy array instead of tensor. I believe that if I convert it to a tensor, the minimizer won't be able to calculate jacobians with respect to variables.
The same problem is with variables. It doesn't seem like there's a way to extract all weights from a model into a list of tensors.
There's a major difference between tfg.math.optimizer.levenberg_marquardt.minimize and Keras optimizers from the implementation/API perspective.
Keras optimizers, such as tf.keras.optimizers.Adam consume gradients as input and updates tf.Variables.
In contrast, tfg.math.optimizer.levenberg_marquardt.minimize essentially unrolls the optimization loop in graph mode (using a tf.while_loop construct). It takes initial parameter values and produces updated parameter values, unlike Adam & co, which only apply one iteration and actually change the values of tf.Variables via assign_add.
Stepping back a bit to the theoretical big picture, Levenberg-Marquardt is not a general gradient descent-like solver for any nonlinear optimization problem (such as Adam is). It specifically addresses nonlinear least-squares optimization, so it's not a drop-in replacement for optimizers like Adam. In gradient descent, we compute the gradient of the loss with respect to the parameters. In Levenberg-Marquardt, we compute the Jacobian of the residuals with respect to the parameters. Concretely, it repeatedly solves the linearized problem Jacobian # delta_params = residuals for delta_params using tf.linalg.lstsq (which internally uses Cholesky decomposition on the Gram matrix computed from the Jacobian) and applies delta_params as the update.
Note that this lstsq operation has cubic complexity in the number of parameters, so in case of neural nets it can only be applied for fairly small ones.
Also note that Levenberg-Marquardt is usually applied as a batch algorithm, not a minibatch algorithm like SGD, though there's nothing stopping you from applying the LM iteration on different minibatches in each iteration.
I think you may only be able to get one iteration out of tfg's LM algorithm, through something like
from tensorflow_graphics.math.optimizer.levenberg_marquardt import minimize as lm_minimize
for input_batch, target_batch in dataset:
def residual_fn(trainable_params):
# do not use trainable params, it will still be at its initial value, since we only do one iteration of Levenberg Marquardt each time.
return model(input_batch) - target_batch
new_objective_value, new_params = lm_minimize(residual_fn, model.trainable_variables, max_iter=1)
for var, new_param in zip(model.trainable_variables, new_params):
var.assign(new_param)
In contrast, I believe the following naive method will not work where we assign model parameters before computing the residuals:
from tensorflow_graphics.math.optimizer.levenberg_marquardt import minimize as lm_minimize
dataset_iterator = ...
def residual_fn(params):
input_batch, target_batch = next(dataset_iterator)
for var, param in zip(model.trainable_variables, params):
var.assign(param)
return model(input_batch) - target_batch
final_objective, final_params = lm_minimize(residual_fn, model.trainable_variables, max_iter=10000)
for var, final_param in zip(model.trainable_variables, final_params):
var.assign(final_param)
The main conceptual problem is that residual_fn's output has no gradients wrt its input params, since this dependency goes through a tf.assign. But it might even fail before that due to using constructs that are disallowed in graph mode.
Overall I believe it's best to write your own LM optimizer that works on tf.Variables, since tfg.math.optimizer.levenberg_marquardt.minimize has a very different API that is not really suited for optimizing Keras model parameters since you can't directly compute model(input, parameters) - target_value without a tf.assign.

Get values of tensors in loss function

I would like to get the values of the y_pred and y_true tensors of this keras backend function. I need this to be able to perform some custom calculations and change the loss, these calculations are just possible with the real array values.
def mean_squared_error(y_true, y_pred):
#some code here
return K.mean(K.square(y_pred - y_true), axis=-1)
There is a way to do this in keras? Or in any other ML framework (tf, pytorch, theano)?
No, in general you can't compute the loss that way, because Keras is based on frameworks that do automatic differentiation (like Theano, TensorFlow) and they need to know which operations you are doing in between in order to compute the gradients of the loss.
You need to implement your loss computations using keras.backend functions, else there is no way to compute gradients and optimization won't be possible.
Try including this within the loss function:
y_true = keras.backend.print_tensor(y_true, message='y_true')
Following is an excerpt from the Keras documentation (https://keras.io/backend/):
print_tensor
keras.backend.print_tensor(x, message='')
Prints message and the tensor value when evaluated.
Note that print_tensor returns a new tensor identical to x which should be used in the later parts of the code. Otherwise, the print operation is not taken into account during evaluation.

Convolutional Neural Network Loss

While Calculating the Loss Function. Can i manually Calculate Loss like
Loss = tf.reduce_mean(tf.square(np.array(Prediction) - np.array(Y)))
and then Optimize this Loss using Adam Optimizer
No.
Tensorflow loss functions typically accept tensors as input and also outputs a tensor. So np.array() wouldn't work.
In case of CNNs, you'd generally come across loss functions like cross-entropy, softmax corss-entropy, sigmoid cross-entropy etc. These are already in-built in tf.losses module. So you can use them directly.
The loss function that you're trying to apply looks like a Mean-squared loss. This is built in tf.losses as well. tf.losses.mean_squared_error.
Having said that, I've also implemented a few loss functions like cross-entropy using hand-coded formula such as: -tf.reduce_mean(tf.reduce_sum(targets * logProb)). This works equally fine, as long as the inputs targets and logProb are computed as tensors and not as numpy arrays.
No, actually you need to use tensor Variable for Loss, not use numpy.array(np.array(Prediction)).
Since tensorflow will eval these tensors in tensorflow engine.

Anyway to backprob derivatives when derivatives of the custom loss function are calculated by myself

I have been using tensorflow to train deep NN acoustic models for speech recognition for a while. The loss function I use is Cross Entropy and the NN models performe very well. Now I want to change the loss function to a more complex one named MMI (Maximum Mutual Information) which is also a classical criterion used in speech recognition domain. I put one paper here which describes this loss function in case that you have interests.
When using this special loss function, the derivatives of the loss function w.r.t. the activations of output layer can be computed by some special algorithms defined in Hidden Markov Model scenario. It means that I can compute the derivatives of the loss function w.r.t. the activations of output layer by myself rather than just write out the loss function and leave Tensorflow to calculate the derivatives automatically.
But based on my poor experiences, I don't know how to backprob the derivatives which I calculate by myself. Is there any way to do this without touching Tensorflow C++ source code?
Probably yes if all the computation involved use existing tensorflow functions.
You just have to set up the chain of operations that compute the gradients from the current variables.
Then you just use tf.assign_add() to the variables with your gradients multiplied by minus the learning rate.
You are thus mimicking what happens in the background in TF usually.
EDIT: If calculations are made in numpy for instance for the gradients you can use.
#perform numpy calculations
a=f(output_npy,variables_npy)
grad_from_user=tf.placeholder(tf.float32, a.shape)
grad_update=tf.assign_add(variables_tf,-lr*grad_from_user)
#and then
sess.run(grad_update,feed_dict={grad_from_user:a,...})

Guided Back-propagation in TensorFlow

I would like to implement in TensorFlow the technique of "Guided back-propagation" introduced in this Paper and which is described in this recipe .
Computationally that means that when I compute the gradient e.g., of the input wrt. the output of the NN, I will have to modify the gradients computed at every RELU unit. Concretely, the back-propagated signal on those units must be thresholded on zero, to make this technique work. In other words the partial derivative of the RELUs that are negative must be ignored.
Given that I am interested in applying these gradient computations only on test examples, i.e., I don't want to update the model's parameters - how shall I do it?
I tried (unsuccessfully) two things so far:
Use tf.py_func to wrap my simple numpy version of a RELU, which then is eligible to redefine it's gradient operation via the g.gradient_override_map context manager.
Gather the forward/backward values of BackProp and apply the thresholding on those stemming from Relus.
I failed with both approaches because they require some knowledge of the internals of TF that currently I don't have.
Can anyone suggest any other route, or sketch the code?
Thanks a lot.
The better solution (your approach 1) with ops.RegisterGradient and tf.Graph.gradient_override_map. Together they override the gradient computation for a pre-defined Op, e.g. Relu within the gradient_override_map context using only python code.
#ops.RegisterGradient("GuidedRelu")
def _GuidedReluGrad(op, grad):
return tf.where(0. < grad, gen_nn_ops._relu_grad(grad, op.outputs[0]), tf.zeros(grad.get_shape()))
...
with g.gradient_override_map({'Relu': 'GuidedRelu'}):
y = tf.nn.relu(x)
here is the full example implementation of guided relu: https://gist.github.com/falcondai/561d5eec7fed9ebf48751d124a77b087
Update: in Tensorflow >= 1.0, tf.select is renamed to tf.where. I updated the snippet accordingly. (Thanks #sbond for bringing this to my attention :)
The tf.gradients has the grad_ys parameter that can be used for this purpose. Suppose your network has just one relu layer as follows :
before_relu = f1(inputs, params)
after_relu = tf.nn.relu(before_relu)
loss = f2(after_relu, params, targets)
First, compute the derivative up to after_relu.
Dafter_relu = tf.gradients(loss, after_relu)[0]
Then threshold your gradients that you send down.
Dafter_relu_thresholded = tf.select(Dafter_relu < 0.0, 0.0, Dafter_relu)
Compute the actual gradients w.r.t to params.
Dparams = tf.gradients(after_relu, params, grad_ys=Dafter_relu_thresholded)
You can easily extend this same method for a network with many relu layers.