How to get the Jacobian matrix form derivative of vector by vector in TensorFlow Eager Execution API? - tensorflow

In the MLP model the input of layer l can be computed by this formula:
z = Wa + b
W is the weight matrix between layer l-1 and layer l, a is the output signal of layer l-1 neuron, b is the bias of layer l.
For example:
I want to use TensorFlow Eager Execution API to get the derivatives:
I define a function to calculate the value of z:
def f002(W, a, b):
return tf.matmul(W, a) + b
My main program:
def test001(args={}):
tf.enable_eager_execution()
tfe = tf.contrib.eager
a = tf.reshape(tf.constant([1.0, 2.0, 3.0]), [3, 1])
W = tf.constant([[4.0, 5.0, 6.0],[7.0, 8.0, 9.0]])
b = tf.reshape(tf.constant([1001.0, 1002.0]), [2, 1])
z = f002(W, a, b)
print(z)
grad_f1 = tfe.gradients_function(f002)
dv = grad_f1(W, a, b)
print(dv)
I can get the correct value of z in forward mode. But when print the derivative results it displayed something like these:
[<tf.Tensor: id=17, shape=(2, 3), dtype=float32, numpy=
array([[1., 2., 3.],
[1., 2., 3.]], dtype=float32)>, <tf.Tensor: id=18, shape=(3, 1),
dtype=float32, numpy=
array([[11.],
[13.],
[15.]], dtype=float32)>, <tf.Tensor: id=16, shape=(2, 1),
dtype=float32, numpy=
array([[1.],
[1.]], dtype=float32)>]
This is not what I want. How to get the Jacobian matrix derivative result of vector by vector?

Related

Autodiff implementation for gradient calculation

I have worked through some papers about the autodiff algorithm to implement it for myself (for learning purposes). I compared my algorithm in test cases to the output of tensorflow and their outputs did not match in most cases. Therefor i worked through the tutorial from this side and implemented it with tensorflow operations just for the matrix multiplication operation since that was one of the operations that did not work:
gradient of matmul and unbroadcast method:
def gradient_matmul(node, dx, adj):
# dx is needed to know which of both parents should be derived
a = node.parents[0]
b = node.parents[1]
# the operation was node.tensor = tf.matmul(a.tensor, b,tensor)
if a == dx or b == dx:
# result depends on which of the parents is the derivative
mm = tf.matmul(adj, tf.transpose(b.tensor)) if a == dx else \
tf.matmul(tf.transpose(a.tensor), adj)
return mm
else:
return None
def unbroadcast(adjoint, node):
dim_a = len(adjoint.shape)
dim_b = len(node.shape)
if dim_a > dim_b:
sum = tuple(range(dim_a - dim_b))
res = tf.math.reduce_sum(adjoint, axis = sum)
return res
return adjoint
And finally the gradient calculation autodiff algorithm:
def gradient(y, dx):
working = [y]
adjoints = defaultdict(float)
adjoints[y] = tf.ones(y.tensor.shape)
while len(working) != 0:
curr = working.pop(0)
if curr == dx:
return adjoints[curr]
if curr.is_store:
continue
adj = adjoints[curr]
for p in curr.parents:
# for testing with matrix multiplication as only operation
local_grad = gradient_matmul(curr, p, adj)
adjoints[p] = unbroadcast(tf.add(adjoints[p], local_grad), p.tensor)
if not p in working:
working.append(p)
Yet it produces the same output as my initial implementation.
I constructed a matrix multiplication test case:
x = tf.constant([[[1.0, 1.0], [2.0, 3.0]], [[4.0, 5.0], [6.0, 7.0]]])
y = tf.constant([[3.0, -7.0], [-1.0, 5.0]])
z = tf.constant([[[1, 1], [2.0, 2]], [[3, 3], [-1, -1]]])
w = tf.matmul(tf.matmul(x, y), z)
Where w should be derived for each of the variables.
Tensorflow calculates the gradient:
[<tf.Tensor: shape=(2, 2, 2), dtype=float32, numpy=
array([[[-22., 18.],
[-22., 18.]],
[[ 32., -16.],
[ 32., -16.]]], dtype=float32)>, <tf.Tensor: shape=(2, 2), dtype=float32, numpy=
array([[66., -8.],
[80., -8.]], dtype=float32)>, <tf.Tensor: shape=(2, 2, 2), dtype=float32, numpy=
array([[[ 5., 5.],
[ -1., -1.]],
[[ 18., 18.],
[-10., -10.]]], dtype=float32)>]
My implementation calculates:
[[[-5. 7.]
[-5. 7.]]
[[-5. 7.]
[-5. 7.]]]
[[33. 22.]
[54. 36.]]
[[[ 9. 9.]
[14. 14.]]
[[-5. -5.]
[-6. -6.]]]
Maybe the problem is the difference between numpys dot and tensorflows matmul?
But then i don't know to fix the gradient or unbroadcast for the tensorflow method...
Thanks for taking the time to look over my code! :)
I found the error, the gradient matmul should have been:
def gradient_matmul(node, dx, adj):
a = node.parents[0]
b = node.parents[1]
if a == dx:
return tf.matmul(adj, b.tensor, transpose_b=True)
elif b == dx:
return tf.matmul(a.tensor, adj, transpose_a=True)
else:
return None
Since i only want to transpose the last 2 dimensions

Jacobian of a vector in Tensorflow

I think this question has never been properly answered 8see How to calculate the Jacobian of a vector function with tensorflow or Computing Jacobian in TensorFlow 2.0), so I will try again:
I want to compute the jacobian of the vector valued function z = [x**2 + 2*y, y**2], that is, I want to obtain the matrix of the partial derivatives
[[2x, 0],
[2, 2y]]
(being automatic differentiation, this matrix will be for an specific point).
with tf.GradientTape() as g:
x = tf.Variable(1.0)
y = tf.Variable(4.0)
z = tf.convert_to_tensor([x**2 + 2*y, y**2])
jacobian = g.jacobian(z, [x, y])
print(jacobian)
Obtaining
[<tf.Tensor: shape=(2,), dtype=float32, numpy=array([2., 0.], dtype=float32)>, <tf.Tensor: shape=(2,), dtype=float32, numpy=array([2., 8.], dtype=float32)>]
I want to obtain naturally the tensor
[[2., 0.],
[2., 8.]]
not that intermediate result. Can it be done?
Try some thing like this
import numpy as np
import tensorflow as tf
with tf.GradientTape() as g:
x = tf.Variable(1.0)
y = tf.Variable(4.0)
z = tf.convert_to_tensor([x**2 + 2*y, y**2])
jacobian = g.jacobian(z, [x, y])
print(np.array([jacob.numpy() for jacob in jacobian]))
Result
[[2. 0.]
[2. 8.]]

Problem about getting None from the GradientTape.gradient in TensorFlow

I tried the following code:
from d2l import tensorflow as d2l
import tensorflow as tf
#tf.function
def corr2d(X, k, Y): ##save
"""Compute 2D cross-correlation."""
with tf.GradientTape() as tape:
for i in range(Y.shape[0]):
for j in range(Y.shape[1]):
Y[i, j].assign(tf.reduce_sum(tf.multiply(X[i: i + h, j: j + w], k)))
print('Gradients = ', tape.gradient(Y, k)) # show the gradient
print('Watched Variables = ', tape.watched_variables()) # show the watched varaibles
print(tf.__version__)
Xin= tf.constant([[0.0, 1.0, 2.0], [3.0, 4.0, 5.0], [6.0, 7.0, 8.0]])
kernel = tf.Variable([[0.0, 1.0], [2.0, 3.0]])
h, w = kernel.shape
Y_hat = tf.Variable(tf.zeros((Xin.shape[0] - h + 1, Xin.shape[1] - w + 1))) # prepare the output tensor
corr2d(X, kernel, Y_hat)
print(Y_hat)
I got the following results:
2.4.1
Gradients = None
Watched Variables = (<tf.Variable 'Variable:0' shape=(2, 2) dtype=float32>, <tf.Variable 'Variable:0' shape=(2, 2) dtype=float32>)
<tf.Variable 'Variable:0' shape=(2, 2) dtype=float32, numpy=
array([[19., 25.],
[37., 43.]], dtype=float32)>
Can anyone explain why the returned gradient is None even though the source variable kernel is included in the list of watched variables?
I'm not sure I really understood what you were trying to do. You were passing your variable as the target for the gradient.
It is always easier to think in terms of cost function and variables.
Let's say your cost function is y = x ** 2. In this case, it is possible to calculate the gradient of y with respect to x.
Basically, you did not have a function to calculate any gradient with respect to k.
I have done a small change. Check for the variable cost.
import tensorflow as tf
def corr2d(X, k, Y): ##save
"""Compute 2D cross-correlation."""
with tf.GradientTape() as tape:
cost = 0
for i in range(Y.shape[0]):
for j in range(Y.shape[1]):
Y[i, j].assign(tf.reduce_sum(tf.multiply(X[i: i + h, j: j + w], k)))
cost = cost + tf.reduce_sum(tf.multiply(X[i: i + h, j: j + w], k))
print('\nGradients = ', tape.gradient(cost, k)) # show the gradient
print('Watched Variables = ', tape.watched_variables()) # show the watched varaibles
print(tf.__version__)
Xin= tf.constant([[0.0, 1.0, 2.0], [3.0, 4.0, 5.0], [6.0, 7.0, 8.0]])
kernel = tf.Variable([[0.0, 1.0], [2.0, 3.0]])
h, w = kernel.shape
Y_hat = tf.Variable(tf.zeros((Xin.shape[0] - h + 1, Xin.shape[1] - w + 1))) # prepare the output tensor
corr2d(Xin, kernel, Y_hat)
print(Y_hat)
And now, you will get
Gradients = tf.Tensor(
[[ 8. 12.]
[20. 24.]], shape=(2, 2), dtype=float32)
Watched Variables = (<tf.Variable 'Variable:0' shape=(2, 2) dtype=float32, numpy=
array([[0., 1.],
[2., 3.]], dtype=float32)>, <tf.Variable 'Variable:0' shape=(2, 2) dtype=float32, numpy=
array([[19., 25.],
[37., 43.]], dtype=float32)>)
<tf.Variable 'Variable:0' shape=(2, 2) dtype=float32, numpy=
array([[19., 25.],
[37., 43.]], dtype=float32)>

How does a process of optimization go with tensorflow?

I have simple graph in tensorflow
(1) X = tf.Variable(dtype=tf.float32, shape=(1, 3), name="X", initial_value=np.array([[1,2,3]]))
(2) y = tf.reduce_sum(tf.square(X)) - 2 * tf.reduce_sum(tf.sin(tf.square(X)))
(3) training_op = tf.train.GradientDescentOptimizer(0.3).minimize(y)
Here's the code for 5 steps of gradient descent:
with tf.Session() as sess:
sess.run(init)
for i in range(5):
(4) *res, _ = sess.run(fetches=[X, y, training_op])
print(res)
[array([[1., 2., 3.]], dtype=float32), 13.006426]
[array([[ 1.0483627 , -0.76874477, -2.080069 ]], dtype=float32), 4.9738936]
[array([[ 0.9910337 , -1.0735381 , 0.10702228]], dtype=float32), -1.3677568]
[array([[ 1.0567244 , -0.95272505, 0.17122723]], dtype=float32), -1.3784065]
[array([[ 0.978967 , -1.0848547 , 0.27387527]], dtype=float32), -1.4229481]
I'm trying to figure out how its optimization process goes. Could you please explain it step by step?
I thought it should be like this:
Evaluate X (1)
Evaluate y (2)
Calculate gradient and make a step (3) (as here it says "Calling minimize() takes care of both computing the gradients and applying them to the variables."
Then yield all requested in fetches variables (4)
But the output shows that at first run yields initial values, so I'm confused...
tf version == '1.15.0'
Thank you in advance!
upd1. If I change the order in fetches list, the output is still the same.
with tf.Session() as sess:
sess.run(init)
for i in range(5):
_, *res = sess.run(fetches=[training_op, X, y])
print(res)
[array([[1., 2., 3.]], dtype=float32), 13.006426]
[array([[ 1.0483627 , -0.76874477, -2.080069 ]], dtype=float32), 4.9738936]
[array([[ 0.9910337 , -1.0735381 , 0.10702228]], dtype=float32), -1.3677568]
[array([[ 1.0567244 , -0.95272505, 0.17122723]], dtype=float32), -1.3784065]
[array([[ 0.978967 , -1.0848547 , 0.27387527]], dtype=float32), -1.4229481]
upd2. A slight modification of the answer by #thushv89 does what I initially expected to see:
with tf.Session() as sess:
sess.run(init)
for i in range(2):
res = sess.run(fetches=[X, y])
print('Variables before the step', res)
sess.run(training_op)
res = sess.run(fetches=[X, y])
print('Variables after the step', res)
print()
Variables before the step [array([[1., 2., 3.]], dtype=float32), 13.006426]
Variables after the step [array([[ 1.0483627 , -0.76874477, -2.080069 ]], dtype=float32), 4.9738936]
Variables before the step [array([[ 1.0483627 , -0.76874477, -2.080069 ]], dtype=float32), 4.9738936]
Variables after the step [array([[ 0.9910337 , -1.0735381 , 0.10702228]], dtype=float32), -1.3677568]
You have fetches=[X, y, training_op]. These don't respect the order (At least you shouldn't expect sess.run() to respect the order). Which means, all of the,
Evaluates X (so the training_op hasn't happened yet)
Evaluate y (still the training_op hasn't happened yet)
Executes training_op (now, X and y have changed).
gets executed and then the results are fetched. If you want the variable X to change first,
Option 1: Breaking the sess.run() function
r1 = sess.run(X)
_, r2 = sess.run(fetches=[training_op, y])
print(r1,r2)
Option 2: Using a separate tf.Variable with tf.control_dependencies
X = tf.Variable(dtype=tf.float32, shape=(1, 3), name="X", initial_value=np.array([[1,2,3]]))
prevX = tf.Variable(dtype=tf.float32, shape=(1, 3), name="prevX", initial_value=np.array([[1,2,3]]))
y = tf.reduce_sum(tf.square(X)) - 2 * tf.reduce_sum(tf.sin(tf.square(X)))
assign_op = tf.assign(prevX, X)
with tf.control_dependencies([assign_op]):
training_op = tf.train.GradientDescentOptimizer(0.3).minimize(y)
with tf.Session() as sess:
init = tf.global_variables_initializer()
sess.run(init)
for i in range(5):
*res, _ = sess.run(fetches=[prevX, y, training_op])
print(res)

Initializing LSTM hidden state Tensorflow/Keras

Can someone explain how can I initialize hidden state of LSTM in tensorflow? I am trying to build LSTM recurrent auto-encoder, so after i have that model trained i want to transfer learned hidden state of unsupervised model to hidden state of supervised model.
Is that even possible with current API?
This is paper I am trying to recreate:
http://papers.nips.cc/paper/5949-semi-supervised-sequence-learning.pdf
Yes - this is possible but truly cumbersome. Let's go through an example.
Defining a model:
from keras.layers import LSTM, Input
from keras.models import Model
input = Input(batch_shape=(32, 10, 1))
lstm_layer = LSTM(10, stateful=True)(input)
model = Model(input, lstm_layer)
model.compile(optimizer="adam", loss="mse")
It's important to build and compile model first as in compilation the initial states are reset. Moreover - you need to specify a batch_shape where batch_size is specified as in this scenario our network should be stateful (which is done by setting a stateful=True mode.
Now we could set the values of initial states:
import numpy
import keras.backend as K
hidden_states = K.variable(value=numpy.random.normal(size=(32, 10)))
cell_states = K.variable(value=numpy.random.normal(size=(32, 10)))
model.layers[1].states[0] = hidden_states
model.layers[1].states[1] = cell_states
Note that you need to provide states as a keras variables. states[0] holds hidden states and states[1] holds cell states.
Hope that helps.
As stated in the Keras API documentation for recurrent layers (https://keras.io/layers/recurrent/):
Note on specifying the initial state of RNNs
You can specify the initial state of RNN layers symbolically by calling them with the keyword argument initial_state. The value of initial_state should be a tensor or list of tensors representing the initial state of the RNN layer.
You can specify the initial state of RNN layers numerically by calling reset_states with the keyword argument states. The value of states should be a numpy array or list of numpy arrays representing the initial state of the RNN layer.
Since the LSTM layer has two states (hidden state and cell state) the value of initial_state and states is a list of two tensors.
Examples
Stateless LSTM
Input shape: (batch, timesteps, features) = (1, 10, 1)
Number of units in the LSTM layer = 8 (i.e. dimensionality of hidden and cell state)
import tensorflow as tf
import numpy as np
inputs = np.random.random([1, 10, 1]).astype(np.float32)
lstm = tf.keras.layers.LSTM(8)
c_0 = tf.convert_to_tensor(np.random.random([1, 8]).astype(np.float32))
h_0 = tf.convert_to_tensor(np.random.random([1, 8]).astype(np.float32))
outputs = lstm(inputs, initial_state=[h_0, c_0])
Stateful LSTM
Input shape: (batch, timesteps, features) = (1, 10, 1)
Number of units in the LSTM layer = 8 (i.e. dimensionality of hidden and cell state)
Note that for stateful lstm you need to specify also batch_size.
import tensorflow as tf
import numpy as np
from pprint import pprint
inputs = np.random.random([1, 10, 1]).astype(np.float32)
lstm = tf.keras.layers.LSTM(8, stateful=True, batch_size=(1, 10, 1))
c_0 = tf.convert_to_tensor(np.random.random([1, 8]).astype(np.float32))
h_0 = tf.convert_to_tensor(np.random.random([1, 8]).astype(np.float32))
outputs = lstm(inputs, initial_state=[h_0, c_0])
With a Stateful LSTM, the states are not reset at the end of each sequence and we can notice that the output of the layer correspond to the hidden state (i.e. lstm.states[0]) at the last timestep:
>>> pprint(outputs)
<tf.Tensor: id=821, shape=(1, 8), dtype=float32, numpy=
array([[ 0.07119043, 0.07012419, -0.06118739, -0.11008392, 0.00573938,
-0.05663438, 0.11196419, 0.02663924]], dtype=float32)>
>>>
>>> pprint(lstm.states)
[<tf.Variable 'lstm_1/Variable:0' shape=(1, 8) dtype=float32, numpy=
array([[ 0.07119043, 0.07012419, -0.06118739, -0.11008392, 0.00573938,
-0.05663438, 0.11196419, 0.02663924]], dtype=float32)>,
<tf.Variable 'lstm_1/Variable:0' shape=(1, 8) dtype=float32, numpy=
array([[ 0.14726108, 0.13584498, -0.12986949, -0.22309153, 0.0125412 ,
-0.11446435, 0.22290672, 0.05397629]], dtype=float32)>]
Calling reset_states() it is possible to reset the states:
>>> lstm.reset_states()
>>> pprint(lstm.states)
[<tf.Variable 'lstm_1/Variable:0' shape=(1, 8) dtype=float32, numpy=array([[0., 0., 0., 0., 0., 0., 0., 0.]], dtype=float32)>,
<tf.Variable 'lstm_1/Variable:0' shape=(1, 8) dtype=float32, numpy=array([[0., 0., 0., 0., 0., 0., 0., 0.]], dtype=float32)>]
>>>
or to set them to a specific value:
>>> lstm.reset_states(states=[h_0, c_0])
>>> pprint(lstm.states)
[<tf.Variable 'lstm_1/Variable:0' shape=(1, 8) dtype=float32, numpy=
array([[0.59103394, 0.68249655, 0.04518601, 0.7800545 , 0.3799634 ,
0.27347744, 0.54415804, 0.9889024 ]], dtype=float32)>,
<tf.Variable 'lstm_1/Variable:0' shape=(1, 8) dtype=float32, numpy=
array([[0.43390197, 0.28252542, 0.27139077, 0.19655049, 0.7568088 ,
0.05909375, 0.68569875, 0.19087408]], dtype=float32)>]
>>>
>>> pprint(h_0)
<tf.Tensor: id=422, shape=(1, 8), dtype=float32, numpy=
array([[0.59103394, 0.68249655, 0.04518601, 0.7800545 , 0.3799634 ,
0.27347744, 0.54415804, 0.9889024 ]], dtype=float32)>
>>>
>>> pprint(c_0)
<tf.Tensor: id=421, shape=(1, 8), dtype=float32, numpy=
array([[0.43390197, 0.28252542, 0.27139077, 0.19655049, 0.7568088 ,
0.05909375, 0.68569875, 0.19087408]], dtype=float32)>
>>>
I used this approach, totally worked out for me:
lstm_cell = LSTM(cell_num, return_state=True)
output, h, c = lstm_cell(input, initial_state=[h_prev, c_prev])
Assuming an RNN is in layer 1 and hidden/cell states are numpy arrays. You can do this:
from keras import backend as K
K.set_value(model.layers[1].states[0], hidden_states)
K.set_value(model.layers[1].states[1], cell_states)
States can also be set using
model.layers[1].states[0] = hidden_states
model.layers[1].states[1] = cell_states
but when I did it this way my state values stayed constant even after stepping the RNN.