Causes of floating point non-determinism? Including NumPy? - numpy

IEEE floating point operations are deterministic, but see How can floating point calculations be made deterministic? for one way that an overall floating point computation can be non-deterministic:
... parallel computations are non-deterministic in terms of the order in which floating-point computations are performed, which can result in non-bit-exact results across runs.
Two-part question:
How else can an overall floating point computation be non-deterministic, yielding results that are not exactly equal?
Consider a single-threaded Python program that calls NumPy, CVXOPT, and SciPy subroutines such as scipy.optimize.fsolve(), which in turn call native libraries like MINPACK and GLPK and optimized linear algebra subroutines like BLAS, ATLAS, and MKL. “If your numpy/scipy is compiled using one of these, then dot() will be computed in parallel (if this is faster) without you doing anything.”
Do these native libraries ever parallelize in a way that introduces non-deterministic results?
Assumptions:
The same software, with the same inputs, on the same hardware. The output of multiple runs should be equal.
If that works, it's highly desirable to test that the output after doing a code refactoring is equal. (Yes, some changes in order of operations can make some of the output not-equal.)
All random numbers in the program are psuedo-random numbers used in a consistent way from the same seeds across all runs.
No uninitialized values. Python is generally safe in that way but numpy.empty() returns a new array without initializing entries. And it's not clear that it's much faster in practice. So beware!
#PaulPanzer's test shows that numpy.empty() does return an uninitialized array and it can easily and quickly recycle a recent array:
import numpy as np
np.arange(100); np.empty(100, int); np.empty(100, int)
np.arange(100, 200.0); np.empty(100, float); np.empty(100, float)
It's tricky to get useful timing measurements for these routines! In a timeit loop, numpy.empty() can just keep reallocating the same one or two memory nodes. The time is independent of the array size. To prevent recycling:
from timeit import timeit
timeit('l.append(numpy.empty(100000))', 'import numpy; l = []')
timeit('l.append(numpy.zeros(100000))', 'import numpy; l = []')
but reducing that array size to numpy.zeros(10000) takes 15x as long; reducing it to numpy.zeros(1000) takes 1.3x as long (on my MBP). Puzzling.
See also:
Hash values are salted in Python 3 and each dict preserves insertion order. That could vary the order of operations from run to run. [I'm wrangling with this problem in Python 2.7.15.]

I found that most (not all) of the non-determinism problems I'm experiencing seem to be fixed in the code for OpenBLAS 0.3.5.
A bunch of threading problems in earlier versions of OpenBLAS are fixed in release 0.3.4, but that release has a macOS compatibility bug that's fixed in the code for release 0.3.5. The bugs also occurs with Apple's Accelerate framework version 1.1 and Intel's MKL mkl==2019.0.
See how to install OpenBLAS and compile NumPy and SciPy on it.
Perhaps the remaining problems I'm experiencing are due to other libraries linked to Accelerate?
Note: I'm still open to more answers to this question.

Related

Graph Lassoalgorithm Sklearn.covariance.graph_lasso()

I am working on replicating a paper titled “Improving Mean Variance Optimization through Sparse Hedging Restriction”. The authors’ idea is to use Graphical Lasso algorithm to infuse some bias in the estimation process of the inverse of the sample covariance matrix. The graphical lasso algorithm works perfectly fine in R, but when I use python on the same data with the same parameters I get two sorts of errors:
If I use coordinate descent (cd ) mode as a solver, I get a floating point error saying that: FloatingPointError: Non SPD result: the system is too ill-conditioned for this solver. The system is too ill-conditioned for this solver (The thing that bugs me is that I tried this solver on a simulated Positive definite matrix and It game me this error)
If I use the Least Angle Regression (LARS) mode (Which is less stable but recommended for ill-conditioned matrices) I get an overflow error stating OverflowError: int too large to convert to float
To my knowledge, unlike C++ and other languages, python is not restricted by an upper maximum for integer numbers (besides the capacity of the machine itself). Whereas the floats are restricted. I think this might be the source of the later problem. (I have also heard in the past that R is much more robust in terms of dealing ill-conditioned matrices). I would be glad to hear you experience with graph lasso in R or python.
With this email, I have attached a little python code that simulates this problem in a few lines. Any input will be of great appreciation.
Thank you all,
Skander
from sklearn.covariance import graph_lasso
from sklearn.datasets import make_spd_matrix
symetric_PD_mx= make_spd_matrix(100)
glout = graph_lasso(emp_cov=symetric_PD_mx, alpha=0.01,mode="lars")

Do scipy and numpy svd or eig always return the same singular/eigen vector?

Since the SVD decomposition is not unique (pairs of left and right singular vectors can have their sign flipped simultaneously), I was wondering to what extent the U and V matrix returned by scipy.linalg.svd() are 'deterministic' / always the same?
I tried it a few times with a random array on my machine and it seems to always return the same thing (fortunately), but could that vary across machines?
SciPy and Numpy both compute the SVD by out-sourcing to the LAPACK _gesdd routine. Any deterministic implementation of this routine will produce the same results every time on a given machine with a given LAPACK implementation, but as far as I know there is no guarantee that different LAPACK implementations (i.e. NETLIB vs MKL, OSX vs Windows, etc.) will use the same convention. If your application depends on some convention for resolving the sign ambiguity, it would be safest to ensure it yourself in some sort of post-processing of the singular vectors; one useful approach is given in Resolving the Sign Ambiguity in the
Singular Value Decomposition (pdf)

Numpy/Scipy pinv and pinv2 behave differently

I am working with bidimensional arrays on Numpy for Extreme Learning Machines. One of my arrays, H, is random, and I want to compute its pseudoinverse.
If I use scipy.linalg.pinv2 everything runs smoothly. However, if I use scipy.linalg.pinv, sometimes (30-40% of the times) problems arise.
The reason why I am using pinv2 is because I read (here: http://vene.ro/blog/inverses-pseudoinverses-numerical-issues-speed-symmetry.html ) that pinv2 performs better on "tall" and on "wide" arrays.
The problem is that, if H has a column j of all 1, pinv(H) has huge coefficients at row j.
This is in turn a problem because, in such cases, np.dot(pinv(H), Y) contains some nan values (Y is an array of small integers).
Now, I am not into linear algebra and numeric computation enough to understand if this is a bug or some precision related property of the two functions. I would like you to answer this question so that, if it's the case, I can file a bug report (honestly, at the moment I would not even know what to write).
I saved the arrays with np.savetxt(fn, a, '%.2e', ';'): please, see https://dl.dropboxusercontent.com/u/48242012/example.tar.gz to find them.
Any help is appreciated. In the provided file, you can see in pinv(H).csv that rows 14, 33, 55, 56 and 99 have huge values, while in pinv2(H) the same rows have more decent values.
Your help is appreciated.
In short, the two functions implement two different ways to calculate the pseudoinverse matrix:
scipy.linalg.pinv uses least squares, which may be quite compute intensive and take up a lot of memory.
https://docs.scipy.org/doc/scipy/reference/generated/scipy.linalg.pinv.html#scipy.linalg.pinv
scipy.linalg.pinv2 uses SVD (singular value decomposition), which should run with a smaller memory footprint in most cases.
https://docs.scipy.org/doc/scipy/reference/generated/scipy.linalg.pinv2.html#scipy.linalg.pinv2
numpy.linalg.pinv also implements this method.
As these are two different evaluation methods, the resulting matrices will not be the same. Each method has its own advantages and disadvantages, and it is not always easy to determine which one should be used without deeply understanding the data and what the pseudoinverse will be used for. I'd simply suggest some trial-and-error and use the one which gives you the best results for your classifier.
Note that in some cases these functions cannot converge to a solution, and will then raise a scipy.stats.LinAlgError. In that case you may try to use the second pinv implementation, which will greatly reduce the amount of errors you receive.
Starting from scipy 1.7.0 , pinv2 is deprecated and also replaced by a SVD solution.
DeprecationWarning: scipy.linalg.pinv2 is deprecated since SciPy 1.7.0, use scipy.linalg.pinv instead
That means, numpy.pinv, scipy.pinv and scipy.pinv2 now compute all equivalent solutions. They are also equally fast in their computation, with scipy being slightly faster.
import numpy as np
import scipy
arr = np.random.rand(1000, 2000)
res1 = np.linalg.pinv(arr)
res2 = scipy.linalg.pinv(arr)
res3 = scipy.linalg.pinv2(arr)
np.testing.assert_array_almost_equal(res1, res2, decimal=10)
np.testing.assert_array_almost_equal(res1, res3, decimal=10)

Is there a GPU accelerated numpy.max(X, axis=0) implementation in Theano?

Do we have a GPU accelerated of version of numpy.max(X, axis=None) in Theano.
I looked into the documentation and found theano.tensor.max(X, axis=None), but it is 4-5 times slower than the numpy implementation.
I can assure you, it is not slow because of some bad choice of matrix size. Same matrix under theano.tensor.exp is 40 times faster than its numpy counterpart.
Any suggestions?
The previous answer is partial. The suggestion should not work, as the work around is the one used in the final compiled code. There is optimization that will do this transformation automatically.
The title of the question isn't the same as the content. They differ by the axis argument. I'll answer both questions.
If the axis is 0 or None we support this on the GPU for that operation for matrix. If the axis is None, we have a basic implementation that isn't well optimized as it is harder to parallelize. If the axis is 0, we have a basic implementation, but it is faster as it is easier to parallelize.
Also, how did you do your timing? If you just make one function with only that operation and test it via the device=gpu flags to do your comparison, this will include the transfer time between CPU and GPU. This is a memory bound operation, so if you include the transfer in your timming, personnaly I don't expect any speed op for that case. To see only the GPU operation, use Theano profiler: run with the Theano flag profile=True.
The max and exp operations are fundamentally different; exp (and other operations like addition, sin, etc.) is an elementwise operation that is embarrassingly parallelizable, while max requires a parallel-processing scan algorithm that basically builds up a tree of pairwise comparisons over an array. It's not impossible to speed up max, but it's not as easy as exp.
Anyway, the theano implementation of max basically consists of the following lines (in theano/tensor/basic.py):
try:
out = max_and_argmax(x, axis)[0]
except Exception:
out = CAReduce(scal.maximum, axis)(x)
where max_and_argmax is a bunch of custom code that, to my eye, implements a max+argmax operation using numpy, and CAReduce is a generic GPU-accelerated scan operation used as a fallback (which, according to the comments, doesn't support grad etc.). You could try using the fallback directly and see whether that is faster, maybe something like this:
from theano.tensor.elemwise import CAReduce
from theano.scalar import maximum
def mymax(X, axis=None):
CAReduce(maximum, axis)(X)

Understanding Numpy internals for profiling purposes

Profiling a piece of numpy code shows that I'm spending most of the time within these two functions
numpy/matrixlib/defmatrix.py.__getitem__:301
numpy/matrixlib/defmatrix.py.__array_finalize__:279
Here's the Numpy source:
https://github.com/numpy/numpy/blob/master/numpy/matrixlib/defmatrix.py#L301
https://github.com/numpy/numpy/blob/master/numpy/matrixlib/defmatrix.py#L279
Question #1:
__getitem__ seems to be called every time I'm using something like my_array[arg] and it's getting more expensive if arg is not an integer but a slice. Is there any way to speed up calls to array slices?
E.g. in
for i in range(idx): res[i] = my_array[i:i+10].mean()
Question #2:
When exactly does __array_finalize__ get called and how can I speed up by reducing the number of calls to this function?
Thanks!
You could not use matrices as much and just use 2d numpy arrays. I typically only use matrices for a short-time to take advantage of the syntax for multiplication (but with the addition of the .dot method on arrays, I find I do that less and less as well).
But, to your questions:
1) There really is no short-cut to __getitem__ unless defmatrix over-rides __getslice__ which it could do but doesn't yet. There are the .item and .itemset methods which are optimized for integer getting and setting (and return Python objects rather than NumPy's array-scalars)
2) __array_finalize__ is called whenever an array object (or a subclass) is created. It is called from the C-function that every array-creation gets funneled through. https://github.com/numpy/numpy/blob/master/numpy/core/src/multiarray/ctors.c#L1003
In the case of sub-classes defined purely in Python, it is calling back into the Python interpreter from C which has overhead. If the matrix class were a builtin type (a Cython-based cdef class, for example), then the call could avoid the Python interpreter overhead.
Question 1:
Since array slices can sometimes require a copy of the underlying data structure (holding the pointers to the data in memory) they can be quite expensive. If you're really bottlenecked by this in your above example, you can perform mean operations by actually iterating over the i to i+10 elements and manually creating the mean. For some operations this won't give any performance improvement, but avoiding creating new data structures will generally speed up the process.
Another note, if you're not using native types inside numpy you will get a Very large performance penalty to manipulating a numpy array. Say you're array has dtype=float64 and your native machine float size is float32 -- this will cost a lot of extra computation power for numpy and performance overall will drop. Sometimes this is fine and you can just take the hit for maintaining a data type. Other times it's arbitrary what type the float or int is stored as internally. In these cases try dtype=float instead of dtype=float64. Numpy should default to your native type. I've had 3x+ speedups on numpy intensive algorithms by making this change.
Question 2:
__array_finalize__ "is called whenever the system internally allocates a new array from obj, where obj is a subclass (subtype) of the (big)ndarray" according to SciPy. Thus this is a result described in the first question. When you slice and make a new array, you have to finalize that array by either making structural copies or wrapping the original structure. This operation takes time. Avoiding slices will save on this operation, though for multidimensional data it may be impossible to completely avoid calls to __array_finalize__.