Is it possible to get Gradient of Matrices in TensorFlow? - tensorflow

I have the given objective that I want to maximize:
I need to maximize the above matrix equation with respect to H, D, R, F, B. Can this be done in tensorflow using the tape methods? I will be getting their gradients separately with respect to the matrices above. Thank you

Related

Tensorflow 1.12: compute the derivative of elements in one tensor with respect to elements in another tensor

I have a b-by-T-by-n tensor Z, and a n-by-m matrix X, and I wish to compute the partial derivative of every element in Z with respect to every element in X that share the same index along the axis with size n. Hence, my output would be a b-by-T-by-n-by-m tensor.
Is there a built-in function or an efficient way to do so in Tensorflow? Thanks!

TensorFlow / PyTorch: Gradient for loss which is measured externally

I am relatively new to Machine Learning and Python.
I have a system, which consists of a NN whose output is fed into an unknown nonlinear function F, e.g. some hardware. The idea is to train the NN to be an inverse F^(-1) of that unknown nonlinear function F. This means that a loss L is calculated at the output of F. However, backpropagation cannot be used in a straightforward manner for calculating the gradients and updating the NN weights because the gradient of F is not known either.
Is there any way how to use a loss function L, which is not directly connected to the NN, for the calculation of the gradients in TensorFlow or PyTorch? Or to take a loss that was obtained with any other software (Matlab, C, etc.) use it for backpropagation?
As far as I know, Keras keras.backend.gradients only allows to calculate gradients with respect to connected weights, otherwise the gradient is either zero or NoneType.
I read about the stop_gradient() function in TensorFlow. But I am not sure whether this is what I am looking for. It allows to not compute the gradient with respect to some variables during backpropagation. But I think the operation F is not interpreted as a variable anyway.
Can I define any arbitrary loss function (including a hardware measurement) and use it for backpropagation in TensorFlow or is it required to be connected to the graph as well?
Please, let me know if my question is not specific enough.
AFAIK, all modern deep learning packages (pytorch, tensorflow, keras etc.) are relaying on gradient descent (and its many variants) to train networks.
As the name suggests, you cannot do gradient descent without gradients.
However, you might circumvent the "non differentiability" of your "given" function F by looking at the problem from a slightly different perspective:
You are trying to learn a model M that "counters" the effect of F. So you have access to F (but not its gradients) and a set of representative inputs X={x_0, x_1, ... x_n}.
For each example x_i you can compute y_i = F(x_i) and your end goal is to have a model M that given y_i will output x_i.
Therefore, you can treat y_i as your model's input and compute a loss between M(y_i) and x_i that produced it. This way you do not need to compute gradients through the "black box" F.
A pseudo code would look something like:
for x in examples:
y = F(x) # applying F on x - getting only output WITHOUT any gradients
pred = M(y) # apply the trainable model M to the output of F
loss = ||x - pred|| # loss will propagate gradients through M and stop at F
loss.backward()

Is it possible to use Keras to optimize the coefficients of a mathematical function?

I'm very new to Keras a neural network in general. and I was wondering if I had a list of points (x,y) that came from a quadratic function that looks like this (ax^2+bx+c) is it possible
to feed the points into a neural network and
get the coefficients a,b and c as an output from the network?
I know that I can simply use polynomial regression to achieve my goal. that is not the point.
If you are asking how to do polynomial regression using neural networks, here's the recipe.
Your dataset consists of points (x, y). Design your network to be a fully connected network (dense network) with 1 input layer and 1 output layer. The input layer consists of 2 nodes, the output layer consists of 1 node. Then, give to your network the inputs x and x^2. The output will be computed as:
y = w * X + c
where w is a matrix of learnable parameters. Specifically, it has shape 1x2 since it contains parameters a and b. c is a bias. The input matrix X has shape 2xN, where N is the number of points in your dataset and for each point, the first component is x^2 and the second component is x.
As loss function, use the standard Mean Squared Error loss. As for the optimizer, a simple Stochastic Gradient Descent should work just fine. At convergence, w and c will be good enough to approximate the true quadratic function.
I don't know keras, but I think it will not tough figuring out by yourself how to implement this naive network.

how to prepare `bias` vector in tensor2tensor?

I am having problem understanding how bias works in tensor2tensor, specially in multihead_attention or dot_product_attention. I want to use it as a library for my problem.
Lets say I have an input tensor, T with dimension, (batch, max_input_length, hidden_unit) for a batch of sentences S. And I also have a tensor, sequence_length whose dimension is (batch) mentioning the length of the each sentences in S. Now how can I prepare the bias vector for this input?
I want to calculate the bias vector for self_attention that means when q, k, v is same.
Another thing,
What happens to the bias if q is different and k, v is same? This is sort of cross_attention. I think in that case we have to calculate the bias vector for k. But I'm not sure.

how does tensorflow calculate gradients *efficiently* from input to loss?

To calculate the derivative of an output layer of size N w.r.t an input of size M, we need a Jacobian matrix of size M x N. To calculate a complete gradient from loss to inputs using the chain rule, we would need a large number of such Jacobians stored in memory.
I assume that tensorflow does not calculate a complete Jacobian matrix for each step of the graph, but does something more efficient. How does it do it?
Thanks
TensorFlow uses Automatic Differentiation to compute gradients efficiently. Concretely, it defines a computation graph in which nodes are operations and each directed edge represents the partial derivative of a child with respect to its parent. The total derivative of an operation f with respect to x is then given by the sum over all path values from x to f, where each path value is the product of the partial derivatives of the operations on the edges.
More specifically, TensorFlow uses reverse differentiation, which involves a forward pass to compute the value of each node in the computation graph, and a backward pass to compute the partial derivative of the function f that we are differentiating with respect to every node in the graph. We need to repeat the backward pass for each dimension of function f, so the computational complexity is O(dim(f))*O(f), where dim(f) is the output dimensionality of function f.
Although this approach is memory intensive (it requires storing the values of all the nodes before running the backward pass), it is very efficient for machine learning, where we typically have a scalar function f (i.e. dim(f)=1).
You might find this resource useful.