How do I approximate the Jacobian and Hessian of a function numerically? - numpy

I have a function in Python:
def f(x):
return x[0]**3 + x[1]**2 + 7
# Actually more than this.
# No analytical expression
It's a scalar valued function of a vector.
How can I approximate the Jacobian and Hessian of this function in numpy or scipy numerically?

(Updated in late 2017 because there's been a lot of updates in this space.)
Your best bet is probably automatic differentiation. There are now many packages for this, because it's the standard approach in deep learning:
Autograd works transparently with most numpy code. It's pure-Python, requires almost no code changes for typical functions, and is reasonably fast.
There are many deep-learning-oriented libraries that can do this.
Some of the most popular are TensorFlow, PyTorch, Theano, Chainer, and MXNet. Each will require you to rewrite your function in their kind-of-like-numpy-but-needlessly-different API, and in return will give you GPU support and a bunch of deep learning-oriented features that you may or may not care about.
FuncDesigner is an older package I haven't used whose website is currently down.
Another option is to approximate it with finite differences, basically just evaluating (f(x + eps) - f(x - eps)) / (2 * eps) (but obviously with more effort put into it than that). This will probably be slower and less accurate than the other approaches, especially in moderately high dimensions, but is fully general and requires no code changes. numdifftools seems to be the standard Python package for this.
You could also attempt to find fully symbolic derivatives with SymPy, but this will be a relatively manual process.

Restricted to just SciPy, the most convenient way I found was scipy.misc.derivative, within the appropriate loops, with lambdas to curry the function of interest.

Related

Fast Cross Entropy in Numpy

I have come up with two versions of Cross-Entropy one in the more vectorized dot product format and the other is the typical one you will see in any ML lecture. I am trying to speed my algorithm and thus will use any chance to speed it up.
cost = -(1.0/m) * np.sum(Y*np.log(A) + (1-Y)*np.log(1-A))
Vectorized Version
cost = -(1.0/m) * (np.dot(np.log(A), Y.T) + np.dot(np.log(1-A), (1-Y).T))
My question: Which one of the above implementations of cross-entropy loss is computed fastest given the architecture of Numpy library and other constraints.
It would help with benchmarking to know typical values of labels. If labels are too short, a pure python implementation could actually be faster than using NumPy
Coming back to your question, I would say the dot products are already scalars (single numbers) the np.sum (which sums the array of products in the other style - see) is not needed in "np dot style".
I hope it helps.

How does sklearn's standard DBSCAN run so fast?

I've been messing around with alternative implementations of DBSCAN for clustering radar data (like grid-based DBSCAN). Up to this point, I had been using sklearn's standard euclidean DBSCAN and it would run on 26,000 data points in less than a second. However, when I specify my own distance metric, like this:
X = np.column_stack((beam, gate, time_index))
num_pts = X.shape[0]
epsilons = np.array([[beam_eps]*num_pts, [gate_eps] * num_pts, [time_eps] * num_pts]).T
metric = lambda x, y, eps: np.sqrt(np.sum((x/eps - y/eps)**2))
def dist_metric(x, y, eps):
return np.sqrt(np.sum((x - y)**2))
db = DBSCAN(eps=eps, min_samples=minPts, metric=dist_metric, metric_params={'eps': epsilons}).fit(X)
it goes from 0.36 seconds to 92 minutes to run on the same data.
What I did in that code snippet can also be accomplished with just transforming the data beforehand and running standard Euclidean DBSCAN, but I'm trying to implement a reasonably fast version of Grid-based DBSCAN, for which the horizontal epsilon varies based on distance from the radar, so I won't be able to do that.
Part of the slowness in the above distance metric is because of that division by epsilon I think, because it only takes about a minute to run if I use a 'custom metric' that's just Euclidean distance:
metric = lambda x, y: np.sqrt(np.sum((x - y)**2))
How does sklearn's euclidean DBSCAN manage to run so much faster? I've been digging through the code, but haven't made sense of it so far.
Because it uses an index.
Furthermore, it avoids the slow and memory intensive Python interpreter, but does all the work in native code (compiled from Cython). This makes a huge difference when dealing with lots of primitive data such as doubles and ints that the Python interpreter would need to box.
Indexes make all the difference for similarity search. They can reduce the runtime from O(n²) to O(n log n).
But while the ball tree index allows custom metrics, the cost of invoking the python interpreter for every distance computation is very high, so if you really want a custom metric, edit the cython source code and compile sklearn yourself. Or you can use ELKI because the Java JVM can compile extension code into native code when necessary; it does not need to fallback to slow interpreter callbacks like sklearn.
In your case, it will likely be much better to rather preprocess the data. Scale it prior to clustering.

Do scipy and numpy svd or eig always return the same singular/eigen vector?

Since the SVD decomposition is not unique (pairs of left and right singular vectors can have their sign flipped simultaneously), I was wondering to what extent the U and V matrix returned by scipy.linalg.svd() are 'deterministic' / always the same?
I tried it a few times with a random array on my machine and it seems to always return the same thing (fortunately), but could that vary across machines?
SciPy and Numpy both compute the SVD by out-sourcing to the LAPACK _gesdd routine. Any deterministic implementation of this routine will produce the same results every time on a given machine with a given LAPACK implementation, but as far as I know there is no guarantee that different LAPACK implementations (i.e. NETLIB vs MKL, OSX vs Windows, etc.) will use the same convention. If your application depends on some convention for resolving the sign ambiguity, it would be safest to ensure it yourself in some sort of post-processing of the singular vectors; one useful approach is given in Resolving the Sign Ambiguity in the
Singular Value Decomposition (pdf)

Numpy/Scipy pinv and pinv2 behave differently

I am working with bidimensional arrays on Numpy for Extreme Learning Machines. One of my arrays, H, is random, and I want to compute its pseudoinverse.
If I use scipy.linalg.pinv2 everything runs smoothly. However, if I use scipy.linalg.pinv, sometimes (30-40% of the times) problems arise.
The reason why I am using pinv2 is because I read (here: http://vene.ro/blog/inverses-pseudoinverses-numerical-issues-speed-symmetry.html ) that pinv2 performs better on "tall" and on "wide" arrays.
The problem is that, if H has a column j of all 1, pinv(H) has huge coefficients at row j.
This is in turn a problem because, in such cases, np.dot(pinv(H), Y) contains some nan values (Y is an array of small integers).
Now, I am not into linear algebra and numeric computation enough to understand if this is a bug or some precision related property of the two functions. I would like you to answer this question so that, if it's the case, I can file a bug report (honestly, at the moment I would not even know what to write).
I saved the arrays with np.savetxt(fn, a, '%.2e', ';'): please, see https://dl.dropboxusercontent.com/u/48242012/example.tar.gz to find them.
Any help is appreciated. In the provided file, you can see in pinv(H).csv that rows 14, 33, 55, 56 and 99 have huge values, while in pinv2(H) the same rows have more decent values.
Your help is appreciated.
In short, the two functions implement two different ways to calculate the pseudoinverse matrix:
scipy.linalg.pinv uses least squares, which may be quite compute intensive and take up a lot of memory.
https://docs.scipy.org/doc/scipy/reference/generated/scipy.linalg.pinv.html#scipy.linalg.pinv
scipy.linalg.pinv2 uses SVD (singular value decomposition), which should run with a smaller memory footprint in most cases.
https://docs.scipy.org/doc/scipy/reference/generated/scipy.linalg.pinv2.html#scipy.linalg.pinv2
numpy.linalg.pinv also implements this method.
As these are two different evaluation methods, the resulting matrices will not be the same. Each method has its own advantages and disadvantages, and it is not always easy to determine which one should be used without deeply understanding the data and what the pseudoinverse will be used for. I'd simply suggest some trial-and-error and use the one which gives you the best results for your classifier.
Note that in some cases these functions cannot converge to a solution, and will then raise a scipy.stats.LinAlgError. In that case you may try to use the second pinv implementation, which will greatly reduce the amount of errors you receive.
Starting from scipy 1.7.0 , pinv2 is deprecated and also replaced by a SVD solution.
DeprecationWarning: scipy.linalg.pinv2 is deprecated since SciPy 1.7.0, use scipy.linalg.pinv instead
That means, numpy.pinv, scipy.pinv and scipy.pinv2 now compute all equivalent solutions. They are also equally fast in their computation, with scipy being slightly faster.
import numpy as np
import scipy
arr = np.random.rand(1000, 2000)
res1 = np.linalg.pinv(arr)
res2 = scipy.linalg.pinv(arr)
res3 = scipy.linalg.pinv2(arr)
np.testing.assert_array_almost_equal(res1, res2, decimal=10)
np.testing.assert_array_almost_equal(res1, res3, decimal=10)

Is there a GPU accelerated numpy.max(X, axis=0) implementation in Theano?

Do we have a GPU accelerated of version of numpy.max(X, axis=None) in Theano.
I looked into the documentation and found theano.tensor.max(X, axis=None), but it is 4-5 times slower than the numpy implementation.
I can assure you, it is not slow because of some bad choice of matrix size. Same matrix under theano.tensor.exp is 40 times faster than its numpy counterpart.
Any suggestions?
The previous answer is partial. The suggestion should not work, as the work around is the one used in the final compiled code. There is optimization that will do this transformation automatically.
The title of the question isn't the same as the content. They differ by the axis argument. I'll answer both questions.
If the axis is 0 or None we support this on the GPU for that operation for matrix. If the axis is None, we have a basic implementation that isn't well optimized as it is harder to parallelize. If the axis is 0, we have a basic implementation, but it is faster as it is easier to parallelize.
Also, how did you do your timing? If you just make one function with only that operation and test it via the device=gpu flags to do your comparison, this will include the transfer time between CPU and GPU. This is a memory bound operation, so if you include the transfer in your timming, personnaly I don't expect any speed op for that case. To see only the GPU operation, use Theano profiler: run with the Theano flag profile=True.
The max and exp operations are fundamentally different; exp (and other operations like addition, sin, etc.) is an elementwise operation that is embarrassingly parallelizable, while max requires a parallel-processing scan algorithm that basically builds up a tree of pairwise comparisons over an array. It's not impossible to speed up max, but it's not as easy as exp.
Anyway, the theano implementation of max basically consists of the following lines (in theano/tensor/basic.py):
try:
out = max_and_argmax(x, axis)[0]
except Exception:
out = CAReduce(scal.maximum, axis)(x)
where max_and_argmax is a bunch of custom code that, to my eye, implements a max+argmax operation using numpy, and CAReduce is a generic GPU-accelerated scan operation used as a fallback (which, according to the comments, doesn't support grad etc.). You could try using the fallback directly and see whether that is faster, maybe something like this:
from theano.tensor.elemwise import CAReduce
from theano.scalar import maximum
def mymax(X, axis=None):
CAReduce(maximum, axis)(X)