TensorFlow / PyTorch: Gradient for loss which is measured externally - tensorflow

I am relatively new to Machine Learning and Python.
I have a system, which consists of a NN whose output is fed into an unknown nonlinear function F, e.g. some hardware. The idea is to train the NN to be an inverse F^(-1) of that unknown nonlinear function F. This means that a loss L is calculated at the output of F. However, backpropagation cannot be used in a straightforward manner for calculating the gradients and updating the NN weights because the gradient of F is not known either.
Is there any way how to use a loss function L, which is not directly connected to the NN, for the calculation of the gradients in TensorFlow or PyTorch? Or to take a loss that was obtained with any other software (Matlab, C, etc.) use it for backpropagation?
As far as I know, Keras keras.backend.gradients only allows to calculate gradients with respect to connected weights, otherwise the gradient is either zero or NoneType.
I read about the stop_gradient() function in TensorFlow. But I am not sure whether this is what I am looking for. It allows to not compute the gradient with respect to some variables during backpropagation. But I think the operation F is not interpreted as a variable anyway.
Can I define any arbitrary loss function (including a hardware measurement) and use it for backpropagation in TensorFlow or is it required to be connected to the graph as well?
Please, let me know if my question is not specific enough.

AFAIK, all modern deep learning packages (pytorch, tensorflow, keras etc.) are relaying on gradient descent (and its many variants) to train networks.
As the name suggests, you cannot do gradient descent without gradients.
However, you might circumvent the "non differentiability" of your "given" function F by looking at the problem from a slightly different perspective:
You are trying to learn a model M that "counters" the effect of F. So you have access to F (but not its gradients) and a set of representative inputs X={x_0, x_1, ... x_n}.
For each example x_i you can compute y_i = F(x_i) and your end goal is to have a model M that given y_i will output x_i.
Therefore, you can treat y_i as your model's input and compute a loss between M(y_i) and x_i that produced it. This way you do not need to compute gradients through the "black box" F.
A pseudo code would look something like:
for x in examples:
y = F(x) # applying F on x - getting only output WITHOUT any gradients
pred = M(y) # apply the trainable model M to the output of F
loss = ||x - pred|| # loss will propagate gradients through M and stop at F
loss.backward()

Related

Gradient calculation in A2C

In A2C, the actor and critic algorithm, the weights are updated via equations:
delta = TD Error and
theta = theta + alpha*delta*[Grad(log(PI(a|s,theta)))] and
w = w + beta*delta*[Grad(V(s,w))]
So my question is, when using neural networks to implement this,
how are the gradients calculated and
am I correct that the weights are updated via the optimization fmethods in TensorFlow or PyTorch?
Thanks, Jon
I'm not quite clear what you mean to update with w, but I'll answer the question for theta assuming it is denoting the parameters for the actor model.
1) Gradients can be calculated in a variety of ways, but if focussing on PyTorch, you can call .backward() on f(x)=alpha
* delta * log(PI(a|s,theta), which will df/dx for every parameter x that is chained to f(x) via autograd.
2) You are indeed correct that the weights are updated via the optimization methods in Pytorch (specifically, autograd). However, in order to complete the optimization step, you must call torch.optim.step with whatever optimizer you would like to use on the network's parameters (e.g. weights and biases).

Is it possible to integrate Levenberg-Marquardt optimizer from Tensorflow Graphics with a Tensorflow 2.0 model?

I have a Tensorflow 2.0 tf.keras.Sequential model. Now, my technical specification prescribes using the Levenberg-Marquardt optimizer to fit the model. Tensorflow 2.0 doesn't provide it as an optimizer out of the box, but it is available in the Tensorflow Graphics module.
tfg.math.optimizer.levenberg_marquardt.minimize function accepts residuals ( a residual is a Python callable returning a tensor) and variables (list of tensors corresponding to my model weights) as parameters.
What would be the best way to convert my model into residuals and variables?
If I understand correctly how the minimize function works, I have to provide two residuals. The first residual must call my model for every learning case and aggregate all the results into a tensor. The second residuals must return all labels as a single constant tensor. The problem is that tf.keras.Sequential.predict function returns a numpy array instead of tensor. I believe that if I convert it to a tensor, the minimizer won't be able to calculate jacobians with respect to variables.
The same problem is with variables. It doesn't seem like there's a way to extract all weights from a model into a list of tensors.
There's a major difference between tfg.math.optimizer.levenberg_marquardt.minimize and Keras optimizers from the implementation/API perspective.
Keras optimizers, such as tf.keras.optimizers.Adam consume gradients as input and updates tf.Variables.
In contrast, tfg.math.optimizer.levenberg_marquardt.minimize essentially unrolls the optimization loop in graph mode (using a tf.while_loop construct). It takes initial parameter values and produces updated parameter values, unlike Adam & co, which only apply one iteration and actually change the values of tf.Variables via assign_add.
Stepping back a bit to the theoretical big picture, Levenberg-Marquardt is not a general gradient descent-like solver for any nonlinear optimization problem (such as Adam is). It specifically addresses nonlinear least-squares optimization, so it's not a drop-in replacement for optimizers like Adam. In gradient descent, we compute the gradient of the loss with respect to the parameters. In Levenberg-Marquardt, we compute the Jacobian of the residuals with respect to the parameters. Concretely, it repeatedly solves the linearized problem Jacobian # delta_params = residuals for delta_params using tf.linalg.lstsq (which internally uses Cholesky decomposition on the Gram matrix computed from the Jacobian) and applies delta_params as the update.
Note that this lstsq operation has cubic complexity in the number of parameters, so in case of neural nets it can only be applied for fairly small ones.
Also note that Levenberg-Marquardt is usually applied as a batch algorithm, not a minibatch algorithm like SGD, though there's nothing stopping you from applying the LM iteration on different minibatches in each iteration.
I think you may only be able to get one iteration out of tfg's LM algorithm, through something like
from tensorflow_graphics.math.optimizer.levenberg_marquardt import minimize as lm_minimize
for input_batch, target_batch in dataset:
def residual_fn(trainable_params):
# do not use trainable params, it will still be at its initial value, since we only do one iteration of Levenberg Marquardt each time.
return model(input_batch) - target_batch
new_objective_value, new_params = lm_minimize(residual_fn, model.trainable_variables, max_iter=1)
for var, new_param in zip(model.trainable_variables, new_params):
var.assign(new_param)
In contrast, I believe the following naive method will not work where we assign model parameters before computing the residuals:
from tensorflow_graphics.math.optimizer.levenberg_marquardt import minimize as lm_minimize
dataset_iterator = ...
def residual_fn(params):
input_batch, target_batch = next(dataset_iterator)
for var, param in zip(model.trainable_variables, params):
var.assign(param)
return model(input_batch) - target_batch
final_objective, final_params = lm_minimize(residual_fn, model.trainable_variables, max_iter=10000)
for var, final_param in zip(model.trainable_variables, final_params):
var.assign(final_param)
The main conceptual problem is that residual_fn's output has no gradients wrt its input params, since this dependency goes through a tf.assign. But it might even fail before that due to using constructs that are disallowed in graph mode.
Overall I believe it's best to write your own LM optimizer that works on tf.Variables, since tfg.math.optimizer.levenberg_marquardt.minimize has a very different API that is not really suited for optimizing Keras model parameters since you can't directly compute model(input, parameters) - target_value without a tf.assign.

Is it possible to use Keras to optimize the coefficients of a mathematical function?

I'm very new to Keras a neural network in general. and I was wondering if I had a list of points (x,y) that came from a quadratic function that looks like this (ax^2+bx+c) is it possible
to feed the points into a neural network and
get the coefficients a,b and c as an output from the network?
I know that I can simply use polynomial regression to achieve my goal. that is not the point.
If you are asking how to do polynomial regression using neural networks, here's the recipe.
Your dataset consists of points (x, y). Design your network to be a fully connected network (dense network) with 1 input layer and 1 output layer. The input layer consists of 2 nodes, the output layer consists of 1 node. Then, give to your network the inputs x and x^2. The output will be computed as:
y = w * X + c
where w is a matrix of learnable parameters. Specifically, it has shape 1x2 since it contains parameters a and b. c is a bias. The input matrix X has shape 2xN, where N is the number of points in your dataset and for each point, the first component is x^2 and the second component is x.
As loss function, use the standard Mean Squared Error loss. As for the optimizer, a simple Stochastic Gradient Descent should work just fine. At convergence, w and c will be good enough to approximate the true quadratic function.
I don't know keras, but I think it will not tough figuring out by yourself how to implement this naive network.

Tensorflow: Accumulating gradients of a Tensor

TL;DR: you can just skip to the question in yellow box below.
Suppose I have a Encoder-Decoder Neural Network, with weights W_1 and W_2 of the encoder and decoder respectively. Let's denote Z as the output of the encoder. The network is trained with batch size n, and all the gradients will be calculated with respect to the mean loss value over the batch (as shown in image below, the L_hat which is the sum of per-sample loss L).
What I'm trying to achieve is, in the backward pass, to manipulate the gradients of Z before passing it further to the encoder's weights W_1. Suppose is a somehow modified gradients operator, for which the following holds:
The described above, in case of a synchronuous pass (first calculate the modified gradients of Z, then propagate down to W_1) is very easy to implement (the Jacobian multiplication is done using grad_ys of tf.gradients):
def modify_grad(grad_z):
# do some modifications
grad_z = tf.gradients(L_hat, Z)
mod_grad_z = modify_grad(grad_z)
mod_grad_w1 = tf.gradients(Z, W_1, mod_grad_z)
The problem is, I need to accumulate the gradients grad_z of the tensor Z over several batches. As the shape of it is dynamic (with None in one of the dimensions, as in the illustration above), I cannot define a tf.Variable to store it. Furthermore, the batch size n may change during training. How can I store the average of grad_z over several batches?
PS: I just wanted to combine pareto-optimal training of ArXiv:1810.04650, the asynchronous network training of ArXiv:1609.02132, and batch size scheduling of ArXiv:1711.00489.

Backpropagation with python/numpy - calculating derivative of weight and bias matrices in neural network

I'm developing a neural network model in python, using various resources to put together all the parts. Everything is working, but I have questions about some of the math. The model has variable number of hidden layers, uses relu activation for all hidden layers except for the last one, which uses sigmoid.
The cost function is:
def calc_cost(AL, Y):
m = Y.shape[1]
cost = (-1/m) * np.sum((Y * np.log(AL)) - ((1 - Y) * np.log(1 - AL)))
return cost
where AL is probability prediction after last sigmoid activation is applied.
In part of my implementation of backpropagation, I use the following
def linear_backward_step(dZ, A_prev, W, b):
m = A_prev.shape[1]
dW = (1/m) * np.dot(dZ, A_prev.T)
db = (1/m) * np.sum(dZ, axis=1, keepdims=True)
dA_prev = np.dot(W.T, dZ)
return dA_prev, dW, db
where, given dZ (the derivative of the cost with respect to a linear step of forward propagation at any given layer), the derivative of the layer's weight matrix W, bias vector b, and deriv of previous layer's activation dA_prev, are each calculated.
The forward part that is complement to this step is this equation: Z = np.dot(W, A_prev) + b
My question is: in calculating dW and db, why is it necessary to multiply by 1/m? I've tried differentiating this using calculus rules but I'm unsure how this term fits in.
Any help is appreciated!
Your gradient calculation seems wrong. You do not multiply it by 1/m. Also, your calculation of m seems wrong as well. It should be
# note it's not A_prev.shape[1]
m = A_prev.shape[0]
Also, the definition in your calc_cost function
# should not be Y.shape[1]
m = Y.shape[0]
You can refer the following example for more information.
Neural Network Case Study
This actually depends on your loss function and if you update your weights after each sample or if you update batch-wise. Take a look at the following old-fashion general-purpose cost function:
Source: MSE Cost Function for Training Neural Network
Here, let's say y^_i is your networks output and y_i is your target value. y^_i is the output of your net.
If you differentiate this for y^_i you'll never get rid of the 1/n or the sum, because the derivative of a sum is the sum of the derivates. Since 1/n is a factor to the sum, you won't also not be able to get rid of this, too. Now, think about what the standard gradient descent is actually doing. It updates your weights after calculating the average over all n samples. A stochastic gradient descent can be used to update after each sample, so you won't have to average it. Batch updates calculate the average over each batch. Which I guess in your case is 1/m, where m is the batch size.