Tensorflow gradientTape explanation

Tensorflow gradientTape explanation - tensorflow

I am trying to understand an API from tensorflow tf.gradientTape
Below is the code I get from the official website:
x = tf.constant(3.0)
with tf.GradientTape(persistent=True) as g:
g.watch(x)
y = x * x
z = y * y
dz_dx = g.gradient(z, x) # 108.0 (4*x^3 at x = 3)
dy_dx = g.gradient(y, x) # 6.0
I wanted to know how did they get dz_dx as 108 and dy_dx as 6?
I also did another test like below:
x = tf.constant(3.0)
with tf.GradientTape(persistent=True) as g:
g.watch(x)
y = x * x * x
z = y * y
dz_dx = g.gradient(z, x) # 1458.0
dy_dx = g.gradient(y, x) # 6.0
this time the dz_dx becomes 1458 and I do not know why at all. Could any expert show me how the calculation being done?

From y=x*x, we can have dy/dx=2*x. From z=y*y, we have dz/dy=2*y. According to the chain rule, dz/dx=(dz/dy)*(dy/dx)=(2*y)*(2*x)=(2*x*x)*(2*x)=108. dy/dx=2*x=6. The same derivation for your second example. BTW, in your second example, dy/dx should be 27 instead of 6.

Related

SVD Inversion, Moore Penrose and and LSQ give different answers using Numpy

I am solving a matrix using different methods. According to my interpretation of the numpy descriptions, all three of my tested methods (SVD inversion, moore-penrose inversion, and Least Squares) should result in the same answer. However, the SVD inversion results in a very different answer. I cannot find a mathematical reason for this in Numerical Recipes. Is there a Numpy implementation nuance that is causing this?
I am using the following code on Python 3.8.10, Numpy 1.21.4, in a jupyter notebook
y = np.array([176, 166, 194])
x = np.array([324, 322, 376])
x = np.stack([x, np.ones_like(x)], axis=1)
# Solve the matrix using singular value decomposition
u, s, vh = np.linalg.svd(x, full_matrices=False)
s = np.where(s < np.finfo(s.dtype).eps, 0, s)
manual_scale, manual_offset = vh # np.linalg.inv(np.diag(s)) # u.T # y
display(manual_scale, manual_offset, manual_scale * x + manual_offset)
# Solve the matrix using Moore-Penrose Inversion
# Manually
manual_scale, manual_offset = np.linalg.inv(x.T # x) # x.T # y
display(manual_scale, manual_offset, manual_scale * x + manual_offset)
# Using supplied numpy methods
manual_scale, manual_offset = np.linalg.pinv(x) # y
display(manual_scale, manual_offset, manual_scale * x + manual_offset)
# Solve using lstsq
((manual_scale, manual_offset), residuals, rank, s) = np.linalg.lstsq(x, y)
display(manual_scale, manual_offset, manual_scale * x + manual_offset)
The output (edited for clarity) is then
'SVD'
0.6091639943577222
29.167637174498772
array([[226.53677135, 29.77680117],
[225.31844336, 29.77680117],
[258.21329905, 29.77680117]])
'Manual Moore-Penrose'
0.4388335704125341
29.170697012800005
array([[171.35277383, 29.60953058],
[170.47510669, 29.60953058],
[194.17211949, 29.60953058]])
'Moore-Penrose'
0.43883357041251736
29.170697012802187
array([[171.35277383, 29.60953058],
[170.47510669, 29.60953058],
[194.17211949, 29.60953058]])
'LSTSQ'
/tmp/ipykernel_261995/387148285.py:24: FutureWarning: `rcond` parameter will change to the default of machine precision times ``max(M, N)`` where M and N are the input matrix dimensions.
To use the future default and silence this warning we advise to pass `rcond=None`, to keep using the old, explicitly pass `rcond=-1`.
((manual_scale, manual_offset), residuals, rank, s) = np.linalg.lstsq(x, y)
0.43883357041251814
29.17069701280214
array([[171.35277383, 29.60953058],
[170.47510669, 29.60953058],
[194.17211949, 29.60953058]])
As you can see three later methods get the same result, yet the manual svd calculation is different. What is going on?

You are missing a transpose of vh. The SVD solution should be
manual_scale, manual_offset = vh.T # np.linalg.inv(np.diag(s)) # u.T # y
By the way, you can simplify the inverse of the diagonal factor:
manual_scale, manual_offset = vh.T # np.diag(1/s) # u.T # y
(That assumes there are no zeros in s.)

For the next person who needs this, the fixed code is below. Thanks Warren!
y = np.array([176, 166, 194])
x = np.array([324, 322, 376])
x = np.stack([x, np.ones_like(x)], axis=1)
# Solve the matrix using singular value decomposition
u, s, vh = np.linalg.svd(x, full_matrices=False)
s = np.where(s < np.finfo(s.dtype).eps, 0, s)
manual_scale, manual_offset = vh.T # np.diag(1/s) # u.T # y
display('SVD')
display(manual_scale, manual_offset, manual_scale * x + manual_offset)
# Solve the matrix using Moore-Penrose Inversion
# Manually
manual_scale, manual_offset = np.linalg.inv(x.T # x) # x.T # y
display('Manual Moore-Penrose')
display(manual_scale, manual_offset, manual_scale * x + manual_offset)
# Using supplied numpy methods
manual_scale, manual_offset = np.linalg.pinv(x) # y
display('Moore-Penrose')
display(manual_scale, manual_offset, manual_scale * x + manual_offset)
# Solve using lstsq
display('LSTSQ')
((manual_scale, manual_offset), residuals, rank, s) = np.linalg.lstsq(x, y)
display(manual_scale, manual_offset, manual_scale * x + manual_offset)

batched tensor slice, slice B x N x M with B x 1

I have an B x M x N tensor, X, and I have and B x 1 tensor, Y, which corresponds to the index of tensor X at dimension=1 that I want to keep. What is the shorthand for this slice so that I can avoid a loop?
Essentially I want to do this:
Z = torch.zeros(B,N)
for i in range(B):
Z[i] = X[i][Y[i]]

the following code is similar to the code in the loop. the difference is that instead of sequentially indexing the array Z,X and Y we are indexing them in parallel using the array i
B, M, N = 13, 7, 19
X = np.random.randint(100, size= [B,M,N])
Y = np.random.randint(M , size= [B,1])
Z = np.random.randint(100, size= [B,N])
i = np.arange(B)
Y = Y.ravel() # reducing array to rank-1, for easy indexing
Z[i] = X[i,Y[i],:]
this code can be further simplified as
-> Z[i] = X[i,Y[i],:]
-> Z[i] = X[i,Y[i]]
-> Z[i] = X[i,Y]
-> Z = X[i,Y]
pytorch equivalent code
B, M, N = 5, 7, 3
X = torch.randint(100, size= [B,M,N])
Y = torch.randint(M , size= [B,1])
Z = torch.randint(100, size= [B,N])
i = torch.arange(B)
Y = Y.ravel()
Z = X[i,Y]

The answer provided by #Hammad is short and perfect for the job. Here's an alternative solution if you're interested in using some less known Pytorch built-ins. We will use torch.gather (similarly you can achieve this with numpy.take).
The idea behind torch.gather is to construct a new tensor-based on two identically shaped tensors containing the indices (here ~ Y) and the values (here ~ X).
The operation performed is Z[i][j][k] = X[i][Y[i][j][k]][k].
Since X's shape is (B, M, N) and Y shape is (B, 1) we are looking to fill in the blanks inside Y such that Y's shape becomes (B, 1, N).
This can be achieved with some axis manipulation:
>>> Y.expand(-1, N)[:, None] # expand to dim=1 to N and unsqueeze dim=1
The actual call to torch.gather will be:
>>> X.gather(dim=1, index=Y.expand(-1, N)[:, None])
Which you can reshape to (B, N) by adding in [:, 0].
This function can be very effective in tricky scenarios...

`scipy.optimize` functions hang even with `maxiter=0`

I am trying to train the MNIST data (which I downloaded from Kaggle) with simple multi-class logistic regression, but the scipy.optimize functions hang.
Here's the code:
import csv
from math import exp
from numpy import *
from scipy.optimize import fmin, fmin_cg, fmin_powell, fmin_bfgs
# Prepare the data
def getIiter(ifname):
"""
Get the iterator from a csv file with filename ifname
"""
ifile = open(ifname, 'r')
iiter = csv.reader(ifile)
iiter.__next__()
return iiter
def parseRow(s):
y = [int(x) for x in s]
lab = y[0]
z = y[1:]
return (lab, z)
def getAllRows(ifname):
iiter = getIiter(ifname)
x = []
l = []
for row in iiter:
lab, z = parseRow(row)
x.append(z)
l.append(lab)
return x, l
def cutData(x, y):
"""
70% training
30% testing
"""
m = len(x)
t = int(m * .7)
return [(x[:t], y[:t]), (x[t:], y[t:])]
def num2IndMat(l):
t = array(l)
tt = [vectorize(int)((t == i)) for i in range(10)]
return array(tt).T
def readData(ifname):
x, l = getAllRows(ifname)
t = [[1] + y for y in x]
return array(t), num2IndMat(l)
#Calculate the cost function
def sigmoid(x):
return 1 / (1 + exp(-x))
vSigmoid = vectorize(sigmoid)
vLog = vectorize(log)
def costFunction(theta, x, y):
sigxt = vSigmoid(dot(x, theta))
cm = (- y * vLog(sigxt) - (1 - y) * vLog(1 - sigxt)) / m / N
return sum(cm)
def unflatten(flatTheta):
return [flatTheta[i * N : (i + 1) * N] for i in range(n + 1)]
def costFunctionFlatTheta(flatTheta):
return costFunction(unflatten(flatTheta), trainX, trainY)
def costFunctionFlatTheta1(flatTheta):
return costFunction(flatTheta.reshape(785, 10), trainX, trainY)
x, y = readData('train.csv')
[(trainX, trainY), (testX, testY)] = cutData(x, y)
m = len(trainX)
n = len(trainX[0]) - 1
N = len(trainY[0])
initTheta = zeros(((n + 1), N))
flatInitTheta = ndarray.flatten(initTheta)
flatInitTheta1 = initTheta.reshape(1, -1)
In the last two lines we flatten initTheta because the fmin{,_cg,_bfgs,_powell} functions seem to only take vectors as the initial value argument x0. I also flatten initTheta using reshape in hope this answer can be of help.
There is no problem computing the cost function which takes up less than 2 seconds on my computer:
print(costFunctionFlatTheta(flatInitTheta), costFunctionFlatTheta1(flatInitTheta1))
# 0.69314718056 0.69314718056
But all the fmin functions hang, even if I set maxiter=0.
e.g.
newFlatTheta = fmin(costFunctionFlatTheta, flatInitTheta, maxiter=0)
or
newFlatTheta1 = fmin(costFunctionFlatTheta1, flatInitTheta1, maxiter=0)
When I interrupt the program, it seems to me it all hangs at lines in optimize.py calling the cost functions, lines like this:
return function(*(wrapper_args + args))
For example, if I use fmin_cg, this would be line 292 in optimize.py (Version 0.5).
How do I solve this problem?

OK I found a way to stop fmin_cg from hanging.
Basically I just need to write a function that computes the gradient of the cost function, and pass it to the fprime parameter of fmin_cg.
def gradient(theta, x, y):
return dot(x.T, vSigmoid(dot(x, theta)) - y) / m / N
def gradientFlatTheta(flatTheta):
return ndarray.flatten(gradient(flatTheta.reshape(785, 10), trainX, trainY))
Then
newFlatTheta = fmin_cg(costFunctionFlatTheta, flatInitTheta, fprime=gradientFlatTheta, maxiter=0)
terminates within seconds, and setting maxiter to a higher number (say 100) one can train the model within reasonable amount of time.
The documentation of fmin_cg says the gradient would be numerically computed if no fprime is given, which is what I suspect caused the hanging.
Thanks to this notebook by zgo2016#Kaggle which helped me find the solution.

backward, grad function in pytorch

I'm trying to implement backward, grad function in pytorch.
But, I don't know why this value is returned.
Here is my code.
x = Variable(torch.FloatTensor([[1,2],[3,4]]), requires_grad=True)
y = x + 2
z = y * y
gradient = torch.ones(2, 2)
z.backward(gradient)
print(x.grad)
I think that result value should be [[6,8],[10,12]]
Because of dz/dx= 2*(x+2) and x=1,2,3,4
But returned value is [[7,9],[11,13]]
Why this is happened.. I want to know how gradient, grad function is doing.
Help me please..

The below piece of code on pytorch v0.12.1
import torch
from torch.autograd import Variable
x = Variable(torch.FloatTensor([[1,2],[3,4]]), requires_grad=True)
y = x + 2
z = y * y
gradient = torch.ones(2, 2)
z.backward(gradient)
print(x.grad)
returns
Variable containing:
6 8
10 12
[torch.FloatTensor of size 2x2]
Update your pytorch installation. This explains the working of autograd, which handles gradient computation for pytorch.

hessian of a variable returned by tf.concat() is None

Let x and y be vectors of length N, and z is a function z = f(x,y). In Tensorflow v1.0.0, tf.hessians(z,x) and tf.hessians(z,y) both returns an N by N matrix, which is what I expected.
However, when I concatenate the x and y into a vector p of size 2*N using tf.concat, and run tf.hessian(z, p), it returns error "ValueError: None values not supported."
I understand this is because in the computation graph x,y ->z and x,y -> p, so there is no gradient between p and z. To circumvent the problem, I can create p first, slice it into x and y, but I will have to change a ton of my code. Is there a more elegant way?
related question: Slice of a variable returns gradient None
import tensorflow as tf
import numpy as np
N = 2
A = tf.Variable(np.random.rand(N,N).astype(np.float32))
B = tf.Variable(np.random.rand(N,N).astype(np.float32))
x = tf.Variable(tf.random_normal([N]) )
y = tf.Variable(tf.random_normal([N]) )
#reshape to N by 1
x_1 = tf.reshape(x,[N,1])
y_1 = tf.reshape(y,[N,1])
#concat x and y to form a vector with length of 2*N
p = tf.concat([x,y],axis = 0)
#define the function
z = 0.5*tf.matmul(tf.matmul(tf.transpose(x_1), A), x_1) + 0.5*tf.matmul(tf.matmul(tf.transpose(y_1), B), y_1) + 100
#works , hx and hy are both N by N matrix
hx = tf.hessians(z,x)
hy = tf.hessians(z,y)
#this gives error "ValueError: None values not supported."
#expecting a matrix of size 2*N by 2*N
hp = tf.hessians(z,p)

Compute the hessian by its definition.
gxy = tf.gradients(z, [x, y])
gp = tf.concat([gxy[0], gxy[1]], axis=0)
hp = []
for i in range(2*N):
hp.append(tf.gradients(gp[i], [x, y]))
Because tf.gradients computes the sum of (dy/dx), so when computing the second partial derivative, one should slice the vector into scalars and then compute the gradient. Tested on tf1.0 and python2.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Tensorflow gradientTape explanation - tensorflow

From y=xx, we can have dy/dx=2x. From z=yy, we have dz/dy=2y. According to the chain rule, dz/dx=(dz/dy)(dy/dx)=(2y)(2x)=(2xx)(2x)=108. dy/dx=2*x=6. The same derivation for your second example. BTW, in your second example, dy/dx should be 27 instead of 6.

Related

SVD Inversion, Moore Penrose and and LSQ give different answers using Numpy

batched tensor slice, slice B x N x M with B x 1

`scipy.optimize` functions hang even with `maxiter=0`

backward, grad function in pytorch

hessian of a variable returned by tf.concat() is None

Categories

Resources