I am finding that scipy.linalg.eig sometimes gives inconsistent results. But not every time.
>>> import numpy as np
>>> import scipy.linalg as lin
>>> modmat=np.random.random((150,150))
>>> modmat=modmat+modmat.T # the data i am interested in is described by real symmetric matrices
>>> d,v=lin.eig(modmat)
>>> dx=d.copy()
>>> vx=v.copy()
>>> d,v=lin.eig(modmat)
>>> np.all(d==dx)
False
>>> np.all(v==vx)
False
>>> e,w=lin.eigh(modmat)
>>> ex=e.copy()
>>> wx=w.copy()
>>> e,w=lin.eigh(modmat)
>>> np.all(e==ex)
True
>>> e,w=lin.eigh(modmat)
>>> np.all(e==ex)
False
While I am not the greatest linear algebra wizard, I do understand that the eigendecomposition is inherently subject to weird rounding errors, but I don't understand why repeating the computation would result in a different value. But my results and reproducibility are varying.
What exactly is the nature of the problem -- well, sometimes the results are acceptably different, and sometimes they aren't. Here are some examples:
>>> d[1]
(9.8986888573772465+0j)
>>> dx[1]
(9.8986888573772092+0j)
The difference above of ~3e-13 does not seem like an enormously big deal. Instead, the real problem (at least for my present project) is that some of the eigenvalues cannot seem to agree on the proper sign.
>>> np.all(np.sign(d)==np.sign(dx))
False
>>> np.nonzero(np.sign(d)!=np.sign(dx))
(array([ 38, 39, 40, 41, 42, 45, 46, 47, 79, 80, 81, 82, 83,
84, 109, 112]),)
>>> d[38]
(-6.4011617320002525+0j)
>>> dx[38]
(6.1888785138080209+0j)
Similar code in MATLAB does not seem to have this problem.
The eigenvalue decompositions satisfy A V = V Lambda, which is all what is guaranteed --- for instance the order of the eigenvalues is not.
Answer to the second part of your question:
Modern compilers/linear algebra libraries produce/contain code that does different things
depending on whether the data is aligned in memory on (e.g.) 16-byte boundaries. This affects rounding error in computations, as floating point operations are done in a different order. Small changes in rounding error can then affect things such as ordering of the eigenvalues if the algorithm (here, LAPACK/xGEEV) does not guarantee numerical stability in this respect.
(If your code is sensitive to things like this, it is incorrect! Running e.g. it on a different platform or different library version would lead to a similar problem.)
The results usually are quasi-deterministic --- for instance you get one of 2 possible results, depending if the array happens to be aligned in memory or not. If you are curious about alignment, check A.__array_interface__['data'][0] % 16.
See http://www.nccs.nasa.gov/images/FloatingPoint_consistency.pdf for more
I think your problem is you are expecting the eigenvalues to be returned in a particular order, and they don't always come out the same. Sort them, and you'll be on your way. If I run your code to generate d and dx with eig I get the following:
>>> np.max(d - dx)
(19.275224236664116+0j)
But...
>>> d_i = np.argsort(d)
>>> dx_i = np.argsort(dx)
>>> np.max(d[d_i] - dx[dx_i])
(1.1368683772161603e-13+0j)
Related
I am having the following array of array
a = np.array([[1,2,3],[4,5,6]])
b = np.array([[1,5,10])
and want to add up the value in b into a, like
np.array([[2,7,13],[5,10,16]])
what is the best approach with performance concern to achieve the goal?
Thanks
Broadcasting does that for you, so:
>>> a+b
just works:
array([[ 2, 7, 13],
[ 5, 10, 16]])
And it can also be done with
>>> a + np.tile(b,(2,1))
which gives the result
array([[ 2, 7, 13],
[ 5, 10, 16]])
Depending on size of inputs and time constraints, both methods might be of consideration
Method 1: Numpy Broadcasting
Operation on two arrays are possible if they are compatible
Operation generally done along with broadcasting
broadcasting in lay man terms could be called repeating elements along a specified axis
Conditions for broadcasting
Arrays need to be compatible
Compatibility is decided based on their shapes
shapes are compared from right to left.
from right to left while comparing, either they should be equal or one of them should be 1
smaller array is broadcasted(repeated) over bigger array
a.shape, b.shape
((2, 3), (1, 3))
From the rules they are compatible, so they can be added, b is smaller, so b is repeated long 1 dimension, so b can be treated as [[ 5, 10, 16], [ 5, 10, 16]]. But note numpy does not allocate new memory, it is just view.
a + b
array([[ 2, 7, 13],
[ 5, 10, 16]])
Method 2: Numba
Numba gives parallelism
It will convert to optimized machine code
Why this is because, sometimes numpy broadcasting is not good enough, ufuncs(np.add, np.matmul, etc) allocate temp memory during operations and it might be time consuming if already on memory limits
Easy parallelization
Using numba based on your requirement, you might not need temp memory allocation or various checks which numpy does, which can speed up code for huge inputs, for example. Why are np.hypot and np.subtract.outer very fast?
import numba as nb
#nb.njit(parallel=True)
def sum(a, b):
s = np.empty(a.shape, dtype=a.dtype)
# nb.prange gives numba hint to what to parallelize
for i in nb.prange(a.shape[0]):
s[i] = a[i] + b
return s
sum(a, b)
I wanted to implement Singular Value Decomposition (SVD) as the collaborative filtering method for recommendation systems. I have this sparse_matrix, with rows representing users and columns representing items, and each matrix entry as the user-item rating.
>>> type(sparse_matrix)
scipy.sparse.csr.csr_matrix
First I factorized this matrix using SVD:
from scipy.sparse.linalg import svds
u, s, vt = svds(sparse_matrix.asfptype(), k = 2)
s_diag = np.diag(s)
Then I make the prediction by taking the dot product of u, s_diag, and vt:
>>> tmp = np.dot(u, s_diag)
>>> pred = np.dot(tmp, vt)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
MemoryError
I got an MemoryError. However, I checked the size and memory usage of tmp and vt:
>>> tmp.shape
(686556, 2)
>>> tmp.nbytes
10984896
>>> vt.shape
(2, 85539)
>>> vt.nbytes
1368624
which means that tmp is around 11MB and vt is 1.4MB. But at the time of np.dot(tmp, vt), my system has over 50GB free memory available, which seems sufficient for this computation. So why am I getting this MemoryError? Is there something wrong with my code? Or is np.dot super expensive in terms of memory usage?
I think you get this error because np.dot is not able to handle sparse matrices.
As a check, please try converting the matrices to full.
check the sparse documentation (https://docs.scipy.org/doc/scipy/reference/sparse.html)
try:
np.dot(u.toarray(), s_diag.toarray())
or use
u.dot(s_diag)
While the new Categorical Series support since pandas 0.15.0 is fantastic, I'm a bit annoyed with how they decided to make the underlying data inaccessible except through underscored variables. Consider the following code:
import numpy as np
import pandas as pd
x = np.empty(3, dtype=np.int64)
s = pd.DatetimeIndex(x, tz='UTC')
x
Out[17]: array([140556737562568, 55872352, 32])
s[0]
Out[18]: Timestamp('1970-01-02 15:02:36.737562568+0000', tz='UTC')
x[0] = 0
s[0]
Out[20]: Timestamp('1970-01-01 00:00:00+0000', tz='UTC')
y = s.values
y[0] = 5
x[0]
Out[23]: 5
s[0]
Out[24]: Timestamp('1970-01-01 00:00:00.000000005+0000', tz='UTC')
We can see that both in construction and when asked for underlying values, no deep copies are being made in this DatetimeIndex with regards to its underlying data. Not only is this potentially useful in terms of efficiency, but it's great if you are using a DataFrame as a buffer. You can easily get the numpy primitive containing the underlying data, from there get a pointer to the raw data, which some low level C routine can use to do a copy into from some block of memory.
Now lets look at the behavior of the new Categorical Series. The underlying data of course is not the levels, but the codes.
x2 = np.zeros(3, dtype=np.int64)
s2 = pd.Categorical.from_codes(x2, ["hello", "bye"])
s2
Out[27]:
[hello, hello, hello]
Categories (2, object): [hello, bye]
x2[0] = 1
s2[0]
Out[29]: 'hello'
y2 = s2.codes
y2[0] = 1
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-31-0366d645c98d> in <module>()
----> 1 y2[0] = 1
ValueError: assignment destination is read-only
y2 = s2._codes
y2[0] = 1
s2[0]
Out[34]: 'bye'
The net effect of this behavior is that as a developer, efficient manipulation of the underlying data for Categoricals is not part of the interface. Also as a user, the from_codes constructor is slow as it deep copies the codes, which may often be unnecessary. There should at least be an option for this.
But the fact that codes is a read only variable and _codes needs to be used strikes me as worse. Why wouldn't .codes give the same behavior as .values? Is there some justification for this beyond the concept that the codes are "private"? I'm hoping some of the pandas gurus on stackoverflow can shed some light on this.
The Categorical type is different from almost all other types in that it is a compound type that has a certain guarantee among its data. Namely that the codes provide a factorization of the levels.
So the argument against mutability is that it would be easy to break the codes-categories mapping, and it could be non-performant. Of course these could possibly be mitigated with checking on the setitem instead (but with some added code complexity).
The vast majority of users are not going to manipulate the codes/categories directly (and only use exposed methods) so this is really a protection against accidently breaking these guarantees.
If you need to efficiently manipulate the underlying data, best/easiest is simply to pull out the codes/categories. Mutate them, then create a new Categorical (which is cheap if codes/categories are already provided).
e.g.
In [3]: s2 = pd.Categorical.from_codes(x2, ["hello", "bye"])
In [4]: s2
Out[4]:
[hello, hello, hello]
Categories (2, object): [hello, bye]
In [5]: s2.codes
Out[5]: array([0, 0, 0], dtype=int8)
In [6]: pd.Categorical(s2.codes+1,s2.categories,fastpath=True)
Out[6]:
[bye, bye, bye]
Categories (2, object): [hello, bye]
Of course this is quite dangerous, if you added 2 to the expression would blow up. Manipulation of the codes directly is simply buyer-be-ware.
I have two computers with python 2.7.2 (MSC v.1500 32 bit (Intel)] on win32) and numpy 1.6.1.
But
numpy.mean(data)
returns
1.13595094681 on my old computer
and
1.13595104218 on my new computer
where
Data = [ 0.20227873 -0.02738848 0.59413314 0.88547146 1.26513398 1.21090782
1.62445402 1.80423951 1.58545554 1.26801944 1.22551131 1.16882968
1.19972098 1.41940248 1.75620842 1.28139281 0.91190684 0.83705413
1.19861531 1.30767155]
In both cases
s=0
for n in data[:20]:
s+=n
print s/20
gives
1.1359509334
Can anyone explain why and how to avoid?
Mads
If you want to avoid any differences between the two, then make them explicitly 32-bit or 64-bit float arrays. NumPy uses several other libraries that may be 32 or 64 bit. Note that rounding can occur in your print statements as well:
>>> import numpy as np
>>> a = [0.20227873, -0.02738848, 0.59413314, 0.88547146, 1.26513398,
1.21090782, 1.62445402, 1.80423951, 1.58545554, 1.26801944,
1.22551131, 1.16882968, 1.19972098, 1.41940248, 1.75620842,
1.28139281, 0.91190684, 0.83705413, 1.19861531, 1.30767155]
>>> x32 = np.array(a, np.float32)
>>> x64 = np.array(a, np.float64)
>>> x32.mean()
1.135951042175293
>>> x64.mean()
1.1359509335
>>> print x32.mean()
1.13595104218
>>> print x64.mean()
1.1359509335
Another point to note is that if you have lower level libraries (e.g., atlas, lapack) that are multi-threaded, then for large arrays, you may have differences in your result regardless, due to possible variable order of operations and floating point precision.
Also, you are at the limit of precision for 32 bit numbers:
>>> x32.sum()
22.719021
>>> np.array(sorted(x32)).sum()
22.719019
This is happening because you have Float32 arrays (single precision). With single precision, the operations are only accurate to 6 decimal place. Hence your results are the same up to the 6th decimal place (after the decimal point, rounding the last digit), but they are not accurate after that. Different architectures/machines/compilers will yield the different results after that. If you want the same results you should use higher precision arrays (e.g. Float64).
I've been rewriting a matlab/octave program into numpy and ran across a difference in some resultant values.
This occurs with both the percentile/prctile and the stdard-deviation functions.
In Numpy:
import matplotlib.mlab as ml
import numpy
>>> t = numpy.linspace(0,100, 100)
>>> numpy.percentile(t,95)
95.0
>>> numpy.std(t)
29.157646512850626
>>> ml.prctile(t,95)
95.000000000000014
In Octave:
octave:1> t = linspace(0,100,100)';
octave:2> prctile(t,95)
ans = 95.454545
octave:3> std(t)
ans = 29.304537
Although the array values of 't' are the same, the results are more different than I would suspect.
In the numpy help(numpy.std) they specifically mention that the algorithm is:
std = sqrt(mean(abs(x - x.mean())**2))
So I implemented that in octave and got the exact answer numpy gives. So it seems the std-deviation function differs.
But why/how? And which is correct? (if there is such a thing)
And even prctile/percentile?
Just in case since I'm in Linux aptosid...
GNU Octave, version 3.6.2
numpy.version '1.6.2rc1'
Numpy simply uses a different algorithm when the percentile lies between two data points. Octave, Matlab and R always center it exactly between two points when needed (I believe), numpy does a bit more then that... if you check http://en.wikipedia.org/wiki/Percentile you will see there are a couple of ways to calculate percentiles.
It seems like Octave assumes ddof=1, at least by default, and numpy uses 0 by default:
>>> numpy.std(t, ddof=0)
29.157646512850633
>>> numpy.std(t, ddof=1)
29.304537349375785