In A2C, the actor and critic algorithm, the weights are updated via equations:
delta = TD Error and
theta = theta + alpha*delta*[Grad(log(PI(a|s,theta)))] and
w = w + beta*delta*[Grad(V(s,w))]
So my question is, when using neural networks to implement this,
how are the gradients calculated and
am I correct that the weights are updated via the optimization fmethods in TensorFlow or PyTorch?
Thanks, Jon
I'm not quite clear what you mean to update with w, but I'll answer the question for theta assuming it is denoting the parameters for the actor model.
1) Gradients can be calculated in a variety of ways, but if focussing on PyTorch, you can call .backward() on f(x)=alpha
* delta * log(PI(a|s,theta), which will df/dx for every parameter x that is chained to f(x) via autograd.
2) You are indeed correct that the weights are updated via the optimization methods in Pytorch (specifically, autograd). However, in order to complete the optimization step, you must call torch.optim.step with whatever optimizer you would like to use on the network's parameters (e.g. weights and biases).
Related
I'm trying to write a wrapper around a model, such that the tf model can be called as a function of its weights (and input). However this wrapper returns different gradients than the gradients fromt the original model. Details in the code below (including a colab notebook to reproduce directly), but at the core I'm using the custom gradient decorator - the respective gradient is computed directly as the upstream 'gradient' matmul (via tensordot) the respective jacobian.
To make this clear: I'm computing the gradient for a model, once directly, once by using my custom wrapper. In both cases the parameters in the model are the same. The Jacobian is implemented by TF, so nothing should be wrong there. Still the resulting gradient seems to be wrong.
I'm not sure, whether this is a coding mistake I made somewhere, or possibly just a numeric problem stemming from the Jacobian matmul - however my tests regarding correlation of the gradients suggest this is more than a numeric issue for now. Code of the function is provided below, a link to colab notebook reproducing the problem can be found here: Colab Notebook reproducing the problem
Why: This is important for a bunch of metalearning, which I'm trying to build a small library for currently.
My current 'wrapper' looks something like this:
#calls model on input x but replaces internal weights with the weights argument
#critically supposed to compute the respective gradient for the weights tensor argument!
def call_model_with_weights(model, x, weights, dim_output=2):
#tf.custom_gradient
def _call_with_weights(x_and_w):
x, weights = x_and_w
#be careful; this assigns weights to the model as a side effect, can ignore for dummy version
ctrls = [var.assign(val) for var, val in zip(model.trainable_weights, weights)]
with tf.control_dependencies(ctrls):
with tf.GradientTape() as tape:
y = model(x)
jacobians = tape.jacobian(y, model.trainable_weights)
def grad(upstream, variables):
assert len(variables)==len(weights)
#gradient for each weight should be upstream dotproduct respective jacobian
dy_dw = [tf.tensordot(upstream, j, axes=[list(range(dim_output)), list(range(dim_output))]) for j in jacobians]
dy_dw_weights = dy_dw
return (None, dy_dw_weights), [None for _ in dy_dw] # returning x as derivative of x is wrong, but not important here rn
return y, grad
y = _call_with_weights((x, weights))
return y
Thanks a lot for any help (including how this could be done in a more elegant way), helping out means you are contributing to package that plans to mimic PyTorch 'higher' for TF which I hope helps some more people <3
I am relatively new to Machine Learning and Python.
I have a system, which consists of a NN whose output is fed into an unknown nonlinear function F, e.g. some hardware. The idea is to train the NN to be an inverse F^(-1) of that unknown nonlinear function F. This means that a loss L is calculated at the output of F. However, backpropagation cannot be used in a straightforward manner for calculating the gradients and updating the NN weights because the gradient of F is not known either.
Is there any way how to use a loss function L, which is not directly connected to the NN, for the calculation of the gradients in TensorFlow or PyTorch? Or to take a loss that was obtained with any other software (Matlab, C, etc.) use it for backpropagation?
As far as I know, Keras keras.backend.gradients only allows to calculate gradients with respect to connected weights, otherwise the gradient is either zero or NoneType.
I read about the stop_gradient() function in TensorFlow. But I am not sure whether this is what I am looking for. It allows to not compute the gradient with respect to some variables during backpropagation. But I think the operation F is not interpreted as a variable anyway.
Can I define any arbitrary loss function (including a hardware measurement) and use it for backpropagation in TensorFlow or is it required to be connected to the graph as well?
Please, let me know if my question is not specific enough.
AFAIK, all modern deep learning packages (pytorch, tensorflow, keras etc.) are relaying on gradient descent (and its many variants) to train networks.
As the name suggests, you cannot do gradient descent without gradients.
However, you might circumvent the "non differentiability" of your "given" function F by looking at the problem from a slightly different perspective:
You are trying to learn a model M that "counters" the effect of F. So you have access to F (but not its gradients) and a set of representative inputs X={x_0, x_1, ... x_n}.
For each example x_i you can compute y_i = F(x_i) and your end goal is to have a model M that given y_i will output x_i.
Therefore, you can treat y_i as your model's input and compute a loss between M(y_i) and x_i that produced it. This way you do not need to compute gradients through the "black box" F.
A pseudo code would look something like:
for x in examples:
y = F(x) # applying F on x - getting only output WITHOUT any gradients
pred = M(y) # apply the trainable model M to the output of F
loss = ||x - pred|| # loss will propagate gradients through M and stop at F
loss.backward()
I'm very new to Keras a neural network in general. and I was wondering if I had a list of points (x,y) that came from a quadratic function that looks like this (ax^2+bx+c) is it possible
to feed the points into a neural network and
get the coefficients a,b and c as an output from the network?
I know that I can simply use polynomial regression to achieve my goal. that is not the point.
If you are asking how to do polynomial regression using neural networks, here's the recipe.
Your dataset consists of points (x, y). Design your network to be a fully connected network (dense network) with 1 input layer and 1 output layer. The input layer consists of 2 nodes, the output layer consists of 1 node. Then, give to your network the inputs x and x^2. The output will be computed as:
y = w * X + c
where w is a matrix of learnable parameters. Specifically, it has shape 1x2 since it contains parameters a and b. c is a bias. The input matrix X has shape 2xN, where N is the number of points in your dataset and for each point, the first component is x^2 and the second component is x.
As loss function, use the standard Mean Squared Error loss. As for the optimizer, a simple Stochastic Gradient Descent should work just fine. At convergence, w and c will be good enough to approximate the true quadratic function.
I don't know keras, but I think it will not tough figuring out by yourself how to implement this naive network.
I need to implement YOLOv2 based on tensorflow framework.
Firstly, in my network design, there are five anchors for each cell and one class (face), thus finally the network outputs 4D tensor that has n * c * h * w shape. Here n represents the batch size, c = 5 * (location coordinates + objectiveness score + classification probability) = 5 * (4 + 1 + 1) = 30, and h/w represent height and width of feature map respectively.
Secondly, YOLOv2 adopts multi-task loss function:
So I defined the following function to calculate the total loss:
def yolov2_loss_function(pred, ground_truth, global_step)
This function accept three parameters: pred respresents the network output tensor which is already described above, ground_truth represents the corresponding GT, and global_step represents the number of iterations. This function returns a scalar value which is used to denote the totol loss.
Finally I use the following code to perform SGD train:
......
total_loss = yolov2_loss_function(pred, gt, global_step)
optimizer = tf.train.MomentumOptimizer(learning_rate=lr, momentum=momentum).minimize(total_loss, global_step=global_step)
......
I am not sure if the above process is correct. Especially the total_loss variable is just a scalar, how does the tensorflow framework know the residual/gradient of each element in the output tensor and further perform backward propagation? I know mechanism of automatic differential,but the premise of the automatic differential is each output element should have residual.
Although in the function yolov2_loss_function I firstly calculate each element's residual, then output their total loss. However, how does the tensorflow framework know the residual of each output element?
Thank you very much.
I think you mixed it up.
For any function differentiatable function $f\colon \mathbb{R}^n -> \mathbb{R}, x\mapsto f(x)$ the partial derivatives $\frac{\partial}{\partial x_i} f(x)$ exists.
Hence, even from a scalar-valued function the gradient can be written as vector, if the function maps a vector to a scalar.
For "f(a,b)=a * a * b" the derivative wrt. a is "2 * a * b" and wrt. b it is "a * a". No issue here.
I'm developing a neural network model in python, using various resources to put together all the parts. Everything is working, but I have questions about some of the math. The model has variable number of hidden layers, uses relu activation for all hidden layers except for the last one, which uses sigmoid.
The cost function is:
def calc_cost(AL, Y):
m = Y.shape[1]
cost = (-1/m) * np.sum((Y * np.log(AL)) - ((1 - Y) * np.log(1 - AL)))
return cost
where AL is probability prediction after last sigmoid activation is applied.
In part of my implementation of backpropagation, I use the following
def linear_backward_step(dZ, A_prev, W, b):
m = A_prev.shape[1]
dW = (1/m) * np.dot(dZ, A_prev.T)
db = (1/m) * np.sum(dZ, axis=1, keepdims=True)
dA_prev = np.dot(W.T, dZ)
return dA_prev, dW, db
where, given dZ (the derivative of the cost with respect to a linear step of forward propagation at any given layer), the derivative of the layer's weight matrix W, bias vector b, and deriv of previous layer's activation dA_prev, are each calculated.
The forward part that is complement to this step is this equation: Z = np.dot(W, A_prev) + b
My question is: in calculating dW and db, why is it necessary to multiply by 1/m? I've tried differentiating this using calculus rules but I'm unsure how this term fits in.
Any help is appreciated!
Your gradient calculation seems wrong. You do not multiply it by 1/m. Also, your calculation of m seems wrong as well. It should be
# note it's not A_prev.shape[1]
m = A_prev.shape[0]
Also, the definition in your calc_cost function
# should not be Y.shape[1]
m = Y.shape[0]
You can refer the following example for more information.
Neural Network Case Study
This actually depends on your loss function and if you update your weights after each sample or if you update batch-wise. Take a look at the following old-fashion general-purpose cost function:
Source: MSE Cost Function for Training Neural Network
Here, let's say y^_i is your networks output and y_i is your target value. y^_i is the output of your net.
If you differentiate this for y^_i you'll never get rid of the 1/n or the sum, because the derivative of a sum is the sum of the derivates. Since 1/n is a factor to the sum, you won't also not be able to get rid of this, too. Now, think about what the standard gradient descent is actually doing. It updates your weights after calculating the average over all n samples. A stochastic gradient descent can be used to update after each sample, so you won't have to average it. Batch updates calculate the average over each batch. Which I guess in your case is 1/m, where m is the batch size.