Perplexing behavior numpy.linalg.eig (A possibly serious issue) - numpy

Consider the following simple piece of code:
import numpy as np
A = np.array([[0,1,1,0],[1,0,0,1],[1,0,0,1],[0,1,1,0]], dtype=float)
eye4 = np.eye(4, dtype=float) # 4x4 identity
H1 = np.kron(A,eye4)
w1,v1 = np.linalg.eig(H1)
H1copy = np.dot(np.dot(v1,np.diag(w1)),np.transpose(v1)) # reconstructing from eigvals and eigvecs
H2 = np.kron(eye4,A)
w2,v2 = np.linalg.eig(H2)
H2copy = np.dot(np.dot(v2,np.diag(w2)),np.transpose(v2))
print np.sum((H1-H1copy)**2) # sum of squares of elements
print np.sum((H2-H2copy)**2)
It produces the output
1.06656622138
8.7514256673e-30
This is very perplexing. These two matrices differ only in the order of the kronecker product. And yet, the accuracy is so low in just one of them. Further, an norm-square error > 1.066 is highly unacceptable according to me. What is going wrong here?
Further, what is the best way to work around this issue, given that the eigenvalue decomposition is a small part of a code that has to be run several (>100) times.

Your matrices are symmetric. Use eigh instead of eig.
If you use eig, the transpose of v1 is not necessarily equal to the inverse of v1.

Related

Applying LSA on term document matrix when number of documents are very less

I have a term-document matrix (X) of shape (6, 25931). The first 5 documents are my source documents and the last document is my target document. The column represents counts for different words in the vocabulary set. I want to get the cosine similarity of the last document with each of the other documents.
But since SVD produces an S of size (min(6, 25931),), If I used the S to reduce my X, I get a 6 * 6 matrix. But In this case, I feel that I will be losing too much information since I am reducing a vector of size (25931,) to (6,).
And when you think about it, usually, the number of documents will always be less than number of vocabulary words. In this case, using SVD to reduce dimensionality will always produce vectors that are of size (no documents,).
According to everything that I have read, when SVD is used like this on a term-document matrix, it's called LSA.
Am I implementing LSA correctly?
If this is correct, then is there any other way to reduce the dimensionality and get denser vectors where the size of the compressed vector is greater than (6,)?
P.S.: I also tried using fit_transform from sklearn.decomposition.TruncatedSVD which expects the vector to be of the form (n_samples, n_components) which is why the shape of my term-document matrix is (6, 25931) and not (25931, 6). I kept getting a (6, 6) matrix which initially confused me. But now it makes sense after I remembered the math behind SVD.
If the objective of the exercise is to find the cosine similarity, then the following approach can help. The author is only attempting to solve for the objective and not to comment on the definition of Latent Semantic Analysis or the definition of Singular Value Decomposition mentioned by the questioner.
Let us first invoke all the required libraries. Please install them if they do not exist in the machine.
from sklearn.metrics.pairwise import cosine_similarity
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
Let us generate some sample data for this exercise.
df = {'sentence': ['one two three','two three four','four five','six seven eight nine ten']}
df = pd.DataFrame(df, columns = ['sentence'])
The first step is to get the exhaustive list of all the possible features. So collate all of the content at one place.
all_content = [' '.join(df['sentence'])]
Let us build a vectorizer and fit it now. Please note that the arguments in the vectorizer are not explained by the author as the focus is on solving the problem.
vectorizer = TfidfVectorizer(encoding = 'latin-1',norm = 'l2', min_df = 0.03, ngram_range = (1,2), max_features = 5000)
vectorizer.fit(all_content)
We can inspect the vocabulary to see if it makes sense. If needed, one could add stop words in the vectorizer above and supress them to see if they are indeed supressed.
print(vectorizer.vocabulary_)
Let us vectorize the sentences for us to deploy cosine similarity.
s1Tokens = vectorizer.transform(df.iloc[1,])
s2Tokens = vectorizer.transform(df.iloc[2,])
Finally, the cosine of the similarity can be computed as follows.
cosine_similarity(s1Tokens , s2Tokens)

linalg.matrix_power(A,n) for a huge $n$ and a huge $A$

I'm trying to use linalg to find $P^{500}$ where $ P$ is a 9x9 matrix but Python displays the following:
Matrix full of inf
I think this is too much for this method so my question is, there is annother library to find $P^{500}$? Must I surrender?
Thank you all in advance
Use the eigendecomposition and then exponentiate the matrix of eigenvalues. Like this. You end up getting an inf up in the first column. Unless you control the type of matrix by their eigenvalues this won't happen I believe. In other words, your eigenvalues have to be bounded. You can generate a random matrix by the Schur decomposition putting the eigenvalues along the diagonal. This is a post I have about generating a matrix with given eigenvalues. This should be the way that method works anyways.
% Generate random 9x9 matrix
n=9;
A = randn(n);
[V,D] = eig(A);
p = 500;
Dp = D^p;
Ap = V^(-1)*Dp*V;
Ap1 = mpower(A,p);
NumPy arrays have homogeneous data types and float datatype maximum is
>>> np.finfo('d').max
1.7976931348623157e+308
>>> _**0.002
4.135322944991858
>>> np.array(4.135)**500
1.7288485271474026e+308
>>> np.array(4.136)**500
__main__:1: RuntimeWarning: overflow encountered in power
inf
So if there is an inner product that results higher than approx. 4.135 it is going to blowup and once it blows up, the next product will be multiplied with infinities and more entries will get infinities until everything becomes infinities.
Metahominid's suggestion certainly helps but it will not solve the issue if your eigenvalues are larger than this value. In general, you need to use specialized high-precision tools to get correct results.

numpy.cov or numpy.linalg.eigvals gives wrong results

I have high (100) dimensional data. I want to get the eigenvectors of the covariance matrix of the data.
Cov = numpy.cov(data)
EVs = numpy.linalg.eigvals(Cov)
I get a vector containing some eigenvalues which are complex numbers. This is mathematically impossible. Granted, the imaginary parts of the complex numbers are very small but it still causes issues later on. Is this a numerical issue? If so, does the issue lie with cov, eigvals function or both?
To give more color on that, I did the same calculation in Mathematica which gives, of course, a correct result. Turns out there are some eigenvalues which are very close to zero but not quiet zero and numpy gets all of these wrong (magnitude wise and it makes some of them into complex numbers)
I was facing a similar issue: np.linalg.eigvals was returning a complex vector in which the imaginary part was quasi-zero everywhere.
Using np.linalg.eigvalsh instead fixed it for me.
I don't know the exact reason, but most probably it is a numerical issue and eigvalsh seems to handle it whereas eigvals doesn't. Note that the ordering of the actual eigenvalues may differ.
The following snippet illustrates the fix:
import numpy as np
from numpy.linalg import eigvalsh, eigvals
D = 10
MUL = 100
EPS = 1e-8
x = np.random.rand(1, D) * MUL
x -= x.mean()
S = np.matmul(x.T, x) + I
# adding epsilon*I avoids negative eigenvalues due to numerical error
# since the matrix is actually positive semidef. (useful for cholesky etc)
S += np.eye(D, dtype=np.float64) * EPS
print(sorted(eigvalsh(S)))
print(sorted(eigvals(S)))

Error when computing eigenvalues of a scipy LinearOperator: "gmres did not converge"

I'm trying to solve a large eigenvalue problem with Scipy where the matrix A is dense but I can compute its action on a vector without having to assemble A explicitly. So in order to avoid memory issues when the matrix A gets big I'd like to use the sparse solver scipy.sparse.linalg.eigs with a LinearOperator that implemements this action.
Applying eigs to an explicit numpy array A works fine. However, if I apply eigs to a LinearOperator instead then the iterative solver fails to converge. This is true even if the matvec method of the LinearOperator is simply matrix-vector multiplication with the given matrix A.
A minimal example illustrating the failure is attached below (I'm using shift-invert mode because I am interested in the smallest few eigenvalues). This computes the eigenvalues of a random matrix A just fine, but fails when applied to a LinearOperator that is directly converted from A. I tried to fiddle with the parameters for the iterative solver (v0, ncv, maxiter) but to no avail.
Am I missing something obvious? Is there a way to make this work? Any suggestions would be highly appreciated. Many thanks!
Edit: I should clarify what I mean by "make this work" (thanks, Dietrich). The example below uses a random matrix for illustration. However, in my application I know that the eigenvalues are almost purely imaginary (or almost purely real if I multiply the matrix by 1j). I'm interested in the 10-20 smallest-magnitude eigenvalues, but the algorithm doesn't behave well (i.e., never stops even for small-ish matrix sizes) if I specify which='SM'. Therefore I'm using shift-invert mode by passing the parameters sigma=0.0, which='LM'. I'm happy to try a different approach so long as it allows me to compute a bunch of smallest-magnitude eigenvalues.
from scipy.sparse.linalg import eigs, LinearOperator, aslinearoperator
import numpy as np
# Set a seed for reproducibility
np.random.seed(0)
# Size of the matrix
N = 100
# Generate a random matrix of size N x N
# and compute its eigenvalues
A = np.random.random_sample((N, N))
eigvals = eigs(A, sigma=0.0, which='LM', return_eigenvectors=False)
print eigvals
# Convert the matrix to a LinearOperator
A_op = aslinearoperator(A)
# Try to solve the same eigenproblem again.
# This time it produces an error:
#
# ValueError: Error in inverting M: function gmres did not converge (info = 1000).
eigvals2 = eigs(A_op, sigma=0.0, which='LM', return_eigenvectors=False)
I tried running your code, but not passing the sigma parameter to eigs() and it ran without problems (read eigs() docs for its meaning). I didn't see the benefit of it in your example.
Eigs can already find the smallest eigenvalues first. Set which = 'SM'

Creating grid and interpolating (x,y,z) for contour plot sagemath

!I have values in the form of (x,y,z). By creating a list_plot3d plot i can clearly see that they are not quite evenly spaced. They usually form little "blobs" of 3 to 5 points on the xy plane. So for the interpolation and the final "contour" plot to be better, or should i say smoother(?), do i have to create a rectangular grid (like the squares on a chess board) so that the blobs of data are somehow "smoothed"? I understand that this might be trivial to some people but i am trying this for the first time and i am struggling a bit. I have been looking at the scipy packages like scipy.interplate.interp2d but the graphs produced at the end are really bad. Maybe a brief tutorial on 2d interpolation in sagemath for an amateur like me? Some advice? Thank you.
EDIT:
https://docs.google.com/file/d/0Bxv8ab9PeMQVUFhBYWlldU9ib0E/edit?pli=1
This is mostly the kind of graphs it produces along with this message:
Warning: No more knots can be added because the number of B-spline
coefficients
already exceeds the number of data points m. Probably causes:
either
s or m too small. (fp>s)
kx,ky=3,3 nx,ny=17,20 m=200 fp=4696.972223 s=0.000000
To get this graph i just run this command:
f_interpolation = scipy.interpolate.interp2d(*zip(*matrix(C)),kind='cubic')
plot_interpolation = contour_plot(lambda x,y:
f_interpolation(x,y)[0], (22.419,22.439),(37.06,37.08) ,cmap='jet', contours=numpy.arange(0,1400,100), colorbar=True)
plot_all = plot_interpolation
plot_all.show(axes_labels=["m", "m"])
Where matrix(c) can be a huge matrix like 10000 X 3 or even a lot more like 1000000 x 3. The problem of bad graphs persists even with fewer data like the picture i attached now where matrix(C) was only 200 x 3. That's why i begin to think that it could be that apart from a possible glitch with the program my approach to the use of this command might be totally wrong, hence the reason for me to ask for advice about using a grid and not just "throwing" my data into a command.
I've had a similar problem using the scipy.interpolate.interp2d function. My understanding is that the issue arises because the interp1d/interp2d and related functions use an older wrapping of FITPACK for the underlying calculations. I was able to get a problem similar to yours to work using the spline functions, which rely on a newer wrapping of FITPACK. The spline functions can be identified because they seem to all have capital letters in their names here http://docs.scipy.org/doc/scipy/reference/interpolate.html. Within the scipy installation, these newer functions appear to be located in scipy/interpolate/fitpack2.py, while the functions using the older wrappings are in fitpack.py.
For your purposes, RectBivariateSpline is what I believe you want. Here is some sample code for implementing RectBivariateSpline:
import numpy as np
from scipy import interpolate
# Generate unevenly spaced x/y data for axes
npoints = 25
maxaxis = 100
x = (np.random.rand(npoints)*maxaxis) - maxaxis/2.
y = (np.random.rand(npoints)*maxaxis) - maxaxis/2.
xsort = np.sort(x)
ysort = np.sort(y)
# Generate the z-data, which first requires converting
# x/y data into grids
xg, yg = np.meshgrid(xsort,ysort)
z = xg**2 - yg**2
# Generate the interpolated, evenly spaced data
# Note that the min/max of x/y isn't necessarily 0 and 100 since
# randomly chosen points were used. If we want to avoid extrapolation,
# the explicit min/max must be found
interppoints = 100
xinterp = np.linspace(xsort[0],xsort[-1],interppoints)
yinterp = np.linspace(ysort[0],ysort[-1],interppoints)
# Generate the kernel that will be used for interpolation
# Note that the default version uses three coefficients for
# interpolation (i.e. parabolic, a*x**2 + b*x +c). Higher order
# interpolation can be used by setting kx and ky to larger
# integers, i.e. interpolate.RectBivariateSpline(xsort,ysort,z,kx=5,ky=5)
kernel = interpolate.RectBivariateSpline(xsort,ysort,z)
# Now calculate the linear, interpolated data
zinterp = kernel(xinterp, yinterp)