What does the tensorflow.python.eager.tape do in the implementation of tf.contrib.eager.custom_gradient? - tensorflow

I am going through TensorFlow Eager Execution from here and find it difficult to understand the customizing gradients part.
#tfe.custom_gradient
def logexp(x):
e = tf.exp(x)
def grad(dy):
return dy * (1 - 1/(1 + e))
return tf.log(1 + e), grad
First, it is difficult to make sense what does dy do in the gradient function.
When I read the implementation of tf.contrib.eager.custom_gradient.
I can't really make sense the working mechanism behind tape. Following is the code I borrow from the implementation of tf.contrib.eager.custom_gradient. Can anybody explain what does tape do here?
from tensorflow.python.eager import tape
from tensorflow.python.ops import array_ops
from tensorflow.python.ops import gen_array_ops
from tensorflow.python.util import nest
from tensorflow.python.framework import ops as tf_ops
def my_custom_gradient(f):
def decorated(*args, **kwargs):
for x in args:
print('args {0}'.format(x))
input_tensors = [tf_ops.convert_to_tensor(x) for x in args]
with tape.stop_recording():
result, grad_fn = f(*args, **kwargs)
flat_result = nest.flatten(result)
flat_result = [gen_array_ops.identity(x) for x in flat_result]
def actual_grad_fn(*outputs):
print(*outputs)
return nest.flatten(grad_fn(*outputs))
tape.record_operation(
f.__name__, # the name of f, in this case logexp
flat_result,
input_tensors,
actual_grad_fn) # backward_function
flat_result = list(flat_result)
return nest.pack_sequence_as(result, flat_result)
return decorated
Even though I found the implementation of tape from here. But I can't really get much out of it due the poor documentation.

I will answer each sub-question separately:
Re: what is dy: Primary use case for gradient functions is during back propagation. During back propagation, the gradient for each op, must take the gradient(s) of its output(s) and produce the gradient(s) of its input(s). This is effectively the consequence of chain rule from calculus. For simple ops, the final gradient is just multiplication of dy with the op's gradient (again from chain rule). In this case, you can see dy * (1 - 1/(1 + e)).
Re: custom_gradient is complicated: Yes, it is. The public API for gradient tapes is tfe.GradientTape, which should be much easier to understand and work with. You can find simple examples in its spec and its tests. A more complex "real world" example can be found here. If its basic workings are not clear, please ask specific questions. Also, we will soon publish a more detailed guide for working with gradients when executing eagerly.
The tape that is used to implement custom_gradient and GradientTape is a low level concept that wraps some C++ code. End users should not care about it (it is not exposed in the tfe namespace). It is used to build a "tape" of executed operations. Tape is similar to but simpler than regular TF graphs. It allows one to compute gradients between two tensors it recorded.

Related

Equation constraints with scipy least_squares

I'm trying to use least square to minimize a loss function by changing x,y,z. My problem is nonlinear hence why i chose scipy least_squares. The general structure is:
from scipy.optimize import least_squares
def loss_func(x, *arguments):
# plug x's and args into an arbitrary equation and return loss
return loss # loss here is an array
# x_arr contains x,y,z
res = least_squares(loss_func, x_arr, args=arguments)
I am trying to constraint x,y,z by: x-y = some value, z-y = some value. How do I go about doing so? The scipy least_squares documentation only provided bounds. I understand I can create bounds like 0<x<5. However, my constraints is an equation and not a constant bound. Thank you in advance!
If anyone ever stumble on this question, I've figured out how to overcome this issue. Since least_squares does not have constraints, it is best to just use linear programming with scipy.optimize.minimize. Since the loss_func returns an array of residuals, we can use L1 norm (as we want to minimize the absolute difference of this array of residuals).
from scipy.optimize import minimize
import numpy as np
def loss_func(x, *arguments):
# plug x's and args into an arbitrary equation and return loss (loss is an array)
return np.linalg.norm(loss, 1)
The bounds can be added to scipy.optimize.minimize fairly easily:)

Efficient solving of generalised eigenvalue problems in python

Given an eigenvalue problem Ax = λBx what is the more efficient way to solve it out of the two shown here:
import scipy as sp
import numpy as np
def geneivprob(A,B):
# Use scipy
lamda, eigvec = sp.linalg.eig(A, B)
return lamda, eigvec
def geneivprob2(A,B):
# Reduce the problem to a standard symmetric eigenvalue problem
Linv = np.linalg.inv(np.linalg.cholesky(B))
C = Linv # A # Linv.transpose()
#C = np.asmatrix((C + C.transpose())*0.5,np.float32)
lamda,V = np.linalg.eig(C)
return lamda, Linv.transpose() # V
I saw the second version in a codebase and was wondering if it was better than simply using scipy.
Well there is no obvious advantage in using the second approach, maybe for some class of matrices it will be better, I would suggest you to test with the problems you want to solve. Since you are transforming the eigenvectors, this will also transform how the errors affect the solution, and maybe that is the reason for using this second method, not efficiency, but numerical accuracy, or convergence.
Another thing is that the second method will only work for symmetric B.

In OpenMDAO's ExecComp, is shape_by_conn compatible with has_diag_partials?

I have an om.ExecComp that performs a simple operation:
"d_sq = x**2 + y**2"
where x, y, and d_sq are always 1D np.arrays. I'd like to be able to use this with large arrays without allocating a large dense matrix. I'd also like the length of the array to be configured based on the shape of the connections.
However, if I specify x={"shape_by_conn": True} rather than x={"shape":100000}, even if I also have has_diag_partials=True, it attempts to allocate a 100000^2 array. Is there a way to make these two options compatible?
First, I'll note that you're using ExecComp a bit outside its design intended purpose. Thats not to say that you're something totally invalid, but generally speaking ExecComp was designed for small, cheap calculations. Passing it giant arrays is not something we test for.
That being said, I think what you want will work. When you use shape_by_conn in this you need to be sure to size both your inputs and outputs. I've provided an example, along with a manually defined component that does the same thing. Since your equations are pretty simple, this would be a little faster overall.
import numpy as np
import openmdao.api as om
class SparseCalc(om.ExplicitComponent):
def setup(self):
self.add_input('x', shape_by_conn=True)
self.add_input('y', shape_by_conn=True)
self.add_output('d_sq', shape_by_conn=True, copy_shape='x')
def setup_partials(self):
# when using shape_by_conn, you need to delcare partials
# in this secondary method
md = self.get_io_metadata(iotypes='input')
# everything should be the same shape, so just need this one
x_shape = md['x']['shape']
row_col = np.arange(x_shape[0])
self.declare_partials('d_sq', 'x', rows=row_col, cols=row_col)
self.declare_partials('d_sq', 'y', rows=row_col, cols=row_col)
def compute(self, inputs, outputs):
outputs['d_sq'] = inputs['x']**2 + inputs['y']**2
def compute_partials(self, inputs, J):
J['d_sq', 'x'] = 2*inputs['x']
J['d_sq', 'y'] = 2*inputs['y']
if __name__ == "__main__":
p = om.Problem()
# use IVC here, because you have to have something connected to
# in order to use shape_by_conn. Normally IVC is not needed
ivc = p.model.add_subsystem('ivc', om.IndepVarComp(), promotes=['*'])
ivc.add_output('x', 3*np.ones(10))
ivc.add_output('y', 2*np.ones(10))
# p.model.add_subsystem('sparse_calc', SparseCalc(), promotes=['*'])
p.model.add_subsystem('sparse_exec_calc',
om.ExecComp('d_sq = x**2 + y**2',
x={'shape_by_conn':True},
y={'shape_by_conn':True},
d_sq={'shape_by_conn':True,
'copy_shape':'x'},
has_diag_partials=True),
promotes=['*'])
p.setup(force_alloc_complex=True)
p.run_model()
If you still find this isn't working as expected, please feel free to submit a bug report with a test case that shows the problem clearly (i.e. will raise the error you're seeing). In this case, the provided manual component can serve as a workaround.

Why does keras (SGD) optimizer.minimize() not reach global minimum in this example?

I'm in the process of completing a TensorFlow tutorial via DataCamp and am transcribing/replicating the code examples I am working through in my own Jupyter notebook.
Here are the original instructions from the coding problem :
I'm running the following snippet of code and am not able to arrive at the same result that I am generating within the tutorial, which I have confirmed are the correct values via a connected scatterplot of x vs. loss_function(x) as seen a bit further below.
# imports
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
from tensorflow import Variable, keras
def loss_function(x):
import math
return 4.0*math.cos(x-1)+np.divide(math.cos(2.0*math.pi*x),x)
# Initialize x_1 and x_2
x_1 = Variable(6.0, np.float32)
x_2 = Variable(0.3, np.float32)
# Define the optimization operation
opt = keras.optimizers.SGD(learning_rate=0.01)
for j in range(100):
# Perform minimization using the loss function and x_1
opt.minimize(lambda: loss_function(x_1), var_list=[x_1])
# Perform minimization using the loss function and x_2
opt.minimize(lambda: loss_function(x_2), var_list=[x_2])
# Print x_1 and x_2 as numpy arrays
print(x_1.numpy(), x_2.numpy())
I draw a quick connected scatterplot to confirm (successfully) that the loss function that I using gets me back to the same graph provided by the example (seen in screenshot above)
# Generate loss_function(x) values for given range of x-values
losses = []
for p in np.linspace(0.1, 6.0, 60):
losses.append(loss_function(p))
# Define x,y coordinates
x_coordinates = list(np.linspace(0.1, 6.0, 60))
y_coordinates = losses
# Plot
plt.scatter(x_coordinates, y_coordinates)
plt.plot(x_coordinates, y_coordinates)
plt.title('Plot of Input values (x) vs. Losses')
plt.xlabel('x')
plt.ylabel('loss_function(x)')
plt.show()
Here are the resulting global and local minima, respectively, as per the DataCamp environment :
4.38 is the correct global minimum, and 0.42 indeed corresponds to the first local minima on the graphs RHS (when starting from x_2 = 0.3)
And here are the results from my environment, both of which move opposite the direction that they should be moving towards when seeking to minimize the loss value:
I've spent the better part of the last 90 minutes trying to sort out why my results disagree with those of the DataCamp console / why the optimizer fails to minimize this loss for this simple toy example...?
I appreciate any suggestions that you might have after you've run the provided code in your own environments, many thanks in advance!!!
As it turned out, the difference in outputs arose from the default precision of tf.division() (vs np.division()) and tf.cos() (vs math.cos()) -- operations which were specified in (my transcribed, "custom") definition of the loss_function().
The loss_function() had been predefined in the body of the tutorial and when I "inspected" it using the inspect package ( using inspect.getsourcelines(loss_function) ) in order to redefine it in my own environment, the output of said inspection didn't clearly indicate that tf.division & tf.cos had been used instead of their NumPy counterparts (which my version of the code had used).
The actual difference is quite small, but is apparently sufficient to push the optimizer in the opposite direction (away from the two respective minima).
After swapping in tf.division() and tf.cos (as seen below) I was able to arrive at the same results as seen in the DC console.
Here is the code for the loss_function that will back in to the same results as seen in the console (screenshot) :
def loss_function(x):
import math
return 4.0*tf.cos(x-1)+tf.divide(tf.cos(2.0*math.pi*x),x)

Exponential decay curve fitting in numpy and scipy

I'm having a bit of trouble with fitting a curve to some data, but can't work out where I am going wrong.
In the past I have done this with numpy.linalg.lstsq for exponential functions and scipy.optimize.curve_fit for sigmoid functions. This time I wished to create a script that would let me specify various functions, determine parameters and test their fit against the data. While doing this I noticed that Scipy leastsq and Numpy lstsq seem to provide different answers for the same set of data and the same function. The function is simply y = e^(l*x) and is constrained such that y=1 at x=0.
Excel trend line agrees with the Numpy lstsq result, but as Scipy leastsq is able to take any function, it would be good to work out what the problem is.
import scipy.optimize as optimize
import numpy as np
import matplotlib.pyplot as plt
## Sampled data
x = np.array([0, 14, 37, 975, 2013, 2095, 2147])
y = np.array([1.0, 0.764317544, 0.647136491, 0.070803763, 0.003630962, 0.001485394, 0.000495131])
# function
fp = lambda p, x: np.exp(p*x)
# error function
e = lambda p, x, y: (fp(p, x) - y)
# using scipy least squares
l1, s = optimize.leastsq(e, -0.004, args=(x,y))
print l1
# [-0.0132281]
# using numpy least squares
l2 = np.linalg.lstsq(np.vstack([x, np.zeros(len(x))]).T,np.log(y))[0][0]
print l2
# -0.00313461628963 (same answer as Excel trend line)
# smooth x for plotting
x_ = np.arange(0, x[-1], 0.2)
plt.figure()
plt.plot(x, y, 'rx', x_, fp(l1, x_), 'b-', x_, fp(l2, x_), 'g-')
plt.show()
Edit - additional information
The MWE above includes a small sample of the dataset. When fitting the actual data the scipy.optimize.curve_fit curve presents an R^2 of 0.82, while the numpy.linalg.lstsq curve, which is the same as that calculated by Excel, has an R^2 of 0.41.
You are minimizing different error functions.
When you use numpy.linalg.lstsq, the error function being minimized is
np.sum((np.log(y) - p * x)**2)
while scipy.optimize.leastsq minimizes the function
np.sum((y - np.exp(p * x))**2)
The first case requires a linear dependency between the dependent and independent variables, but the solution is known analitically, while the second can handle any dependency, but relies on an iterative method.
On a separate note, I cannot test it right now, but when using numpy.linalg.lstsq, I you don't need to vstack a row of zeros, the following works as well:
l2 = np.linalg.lstsq(x[:, None], np.log(y))[0][0]
To expound a bit on Jaime's point, any non-linear transformation of the data will lead to a different error function and hence to different solutions. These will lead to different confidence intervals for the fitting parameters. So you have three possible criteria to use to make a decision: which error you want to minimize, which parameters you want more confidence in, and finally, if you are using the fitting to predict some value, which method yields less error in the interesting predicted value. Playing around a bit analytically and in Excel suggests that different kinds of noise in the data (e.g. if the noise function scales the amplitude, affects the time-constant or is additive) leads to different choices of solution.
I'll also add that while this trick "works" for exponential decay to 0, it can't be used in the more general (and common) case of damped exponentials (rising or falling) to values that cannot be assumed to be 0.