Numpy / Scipy - Sparse matrix to vector - numpy

I have sparse CSR matrices (from a product of two sparse vector) and I want to convert each matrix to a flat vector. Indeed, I want to avoid using any dense representation or iterating over indexes.
So far, the only solution that came up was to iterate over non null elements by using coo representation:
import numpy
from scipy import sparse as sp
matrices = [sp.csr_matrix([[1,2],[3,4]])]*3
vectorSize = matrices[0].shape[0]*matrices[0].shape[1]
flatMatrixData = []
flatMatrixRows = []
flatMatrixCols = []
for i in range(len(matrices)):
matrix = matrices[i].tocoo()
flatMatrixData += matrix.data.tolist()
flatMatrixRows += [i]*matrix.nnz
flatMatrixCols += [r+c*2 for r,c in zip(matrix.row, matrix.col)]
flatMatrix = sp.coo_matrix((flatMatrixData,(flatMatrixRows, flatMatrixCols)), shape=(len(matrices), vectorSize), dtype=numpy.float64).tocsr()
It is indeed unsatisfying and inelegant. Does any one know how to achieve this in an efficient way?

Your flatMatrix is (3,4); each row is [1 3 2 4]. If a submatrix is x, then the row is x.A.T.flatten().
F = sp.vstack([x.T.tolil().reshape((1,vectorSize)) for x in matrices])
F is the same (dtype is int). I had to convert each submatrix to lil since csr has not implemented reshape (in my version of sparse). I don't know if other formats work.
Ideally sparse would let you do the whole range of numpy array (or matrix) manipulations, but it isn't there yet.
Given the small dimensions in this example, I won't speculate on the speed of the alternatives.

Related

How to matrix-multiply two sparse SciPy matrices and produce a dense Numpy array efficiently?

I'd like to matrix-multiply two sparse SciPy matrices. However, the result is not sparse, so I'd like to store it as a NumPy array.
Is it possible to do this efficiently, that is without creating a "sparse" matrix first and then converting it? I'm free to choose any input format (whichever is more efficient).
An example: the product of two 10000x10000 99% sparse matrices with randomly distributed zeros will be dense:
n = 10_000
a = np.random.randn(n, n) * (np.random.randint(100, size=(n, n)) == 0)
b = np.random.randn(n, n) * (np.random.randint(100, size=(n, n)) == 0)
c = a.dot(b)
np.count_nonzero(c) / c.size # should be 0.63
import numpy as np
from scipy import sparse
n = 10_000
a = sparse.csr_matrix(np.random.randn(n, n) * (np.random.randint(100, size=(n, n)) == 0))
b = sparse.csr_matrix(np.random.randn(n, n) * (np.random.randint(100, size=(n, n)) == 0))
c = a.dot(b)
>>> c
<10000x10000 sparse matrix of type '<class 'numpy.float64'>'
with 63132806 stored elements in Compressed Sparse Row format>
Yeah, this is a really inefficient way to store this matrix. There's no way in scipy to go straight to dense though. You can use the sparseBLAS functions that go straight to dense (which exist for the use case you are describing).
There's a python wrapper for MKL that I use for this, which wraps mkl_sparse_spmmd:
from sparse_dot_mkl import dot_product_mkl
c_dense = dot_product_mkl(a, b, dense=True)
>>>> np.sum(c_dense != 0)
63132806
It's also threaded so it's a lot faster than scipy. Getting MKL installed is left to the reader (conda install -c intel mkl is probably easiest though)

To find an inverse matrix of A with LU decomposition

The task asks me to generate A matrix with 50 columns and 50 rows with a random library of seed 1007092020 in the range [0,1].
import numpy as np
np.random.seed(1007092020)
A = np.random.randint(2, size=(3,3))
Then I have to find an inverse matrix of A with LU decomposition.
No idea how to do that.
If you need matrix A to be a 50 x 50 matrix with random floating numbers, then you can make that with the following code :
import numpy as np
np.random.seed(1007092020)
A = np.random.random((50,50))
Instead, if you want integers in the range 0,1 (1 included), you can do this
A = np.random.randint(0,2,(50,50))
If you want to compute the inverse using LU decomposition, you can use SciPy. It should be noted that since you are generating random matrices, it is possible that your matrix does not have an inverse. In that case, you can not find the inverse.
Here's some code that will work in case A does have an inverse.
from scipy.linalg import lu
p,l,u = lu(A, permute_l = False)
Now that we have the lower (l) and upper (u) triangular matrices, we can find the inverse of A by the following equation : A^-1 = U^-1 L^-1
l = np.dot(p,l)
l_inv = np.linalg.inv(l)
u_inv = np.linalg.inv(u)
A_inv = np.dot(u_inv,l_inv)

numpy.corrcoeff() MemoryError

Can't understand MemoryError I get using numpy.corrcoeff() to find correlation coefficient between 2 vectors smin & smax as following:
import numpy as np
from numpy import random as rn
r=0.01
sigma=0.2
T=1
K=1
N=252
h=T/N
M = 50000
Z = rn.randn(M,N)
S=np.ones((M,N+1))
smax=np.ones((M,1))
smin=np.ones((M,1))
for i in range(0,N):
S[:,i+1]=S[:,i]*(np.exp((r-(sigma**2)/2)*h+sigma*Z[:,i]*np.sqrt(h)))
for j in range(0,M):
smax[j,:]=np.exp(-r*T)*(np.max(S[j,:])>K)*(np.max(S[j,:])-K)
smin[j,:]=np.exp(-r*T)*(np.min(S[j,:])<K)*(K-np.min(S[j,:]))
c=np.corrcoef(smax,smin)
print(c)
if there is another way to find correlation coeff.,like using pandas it's also good.
The shape of your arrays here is what is the problem. The function documentation states that x is a "1-D or 2-D array containing multiple variables and observations. Each row of x represents a variable, and each column a single observation of all those variables." and that y is an additional set of variables and observations. So this is trying to allocate an array of size (10000, 10000), which is huge.
If you just want to calculate the pearson correlation coefficient between two one dimensional vectors, you can use a much simpler formula than what is implemented here. This documentation has the formula I am referring to.
https://hydroerr.readthedocs.io/en/stable/api/HydroErr.HydroErr.pearson_r.html#HydroErr.HydroErr.pearson_r
But to be able to still use the numpy version you need to pass in the observations and predictions in the same parameter x, and x and y need to be 1D arrays.
import numpy as np
simulated_array = np.random.rand(50000)
observed_array = np.random.rand(50000)
c = np.corrcoef([simulated_array, observed_array])[1, 0]
More explanation about this here.

Sklearn and Sparse Matrices ValueError

I'm aware similar questions have been asked before, and I've tried everything suggested in them, but I'm still stumped. I have a dataset with 2 columns: The first with vectors representing words stored as a 1x10000 sparse csr matrix (so a matrix in each cell), and the second contains integer ratings which I will use for classification. When I run the following code
for index, row in data.iterrows():
print(row)
print(row[0].shape)
I get the correct output for all the rows
Name: 0, dtype: object
(1, 10000)
Vector (0, 0)\t1.0\n (0, 1)\t1.0\n (0, 2)\t1.0\n ...
Rating 5
Now when I try passing my data in any SKlearn classifier like so:
uniform_random_classifier = DummyClassifier(strategy='uniform')
uniform_random_classifier.fit(data["Vectors"], data["Ratings"])
I get the following error:
array = np.array(array, dtype=dtype, order=order, copy=copy)
ValueError: setting an array element with a sequence.
What am I doing wrong? I've made sure all my sparse matrices are the same size and I've tried reshaping my data in various ways, but with no luck, and the Sklearn classifiers are supposed to be able to deal with csr matrices.
Update: Converting the entire "Vectors" column into one large 2-D matrix did the trick, but for completeness sake the following is the code I used to generate my dataframe if anyone is curious and wants to try solving the original issue. Assume data is a pandas dataframe with rows that look like
"560 420 222" 5.0
"2345 2344 2344 5" 3.0
def vectorize(feature, size):
"""Given a numeric string generated from a vocabulary table return a binary vector representation of
each feature"""
vector = sparse.lil_matrix((1, size))
for number in feature.split(' '):
try:
vector[0, int(number) - 1] = 1
except ValueError:
pass
return vector
def vectorize_dataset(data, vectorize, size):
"""Given a dataset in the appropriate "num num num..." format, a specific vectorization format, and a vector size,
returns the dataset in vectorized form"""
result_data = pd.DataFrame(index=range(data.shape[0]), columns=["Vector", "Rating"])
for index, row in data.iterrows():
# All the mixing up of decodings and encoding has made it so that Pandas incorrectly parses EOF chars
if type(row[0]) == type('str'):
result_data.iat[index, 0] = vectorize(row[0], size).tocsr()
result_data.iat[index, 1] = data.loc[index][1]
return result_data

Combine Sklearn TFIDF with Additional Data

I am trying to prepare data for supervised learning. I have my Tfidf data, which was generated from a column in my dataframe called "merged"
vect = TfidfVectorizer(stop_words='english', use_idf=True, min_df=50, ngram_range=(1,2))
X = vect.fit_transform(merged['kws_name_desc'])
print X.shape
print type(X)
(57629, 11947)
<class 'scipy.sparse.csr.csr_matrix'>
But I also need to add additional columns to this matrix. For each document in the TFIDF matrix, I have a list of additional numeric features. Each list is length 40 and it's comprised of floats.
So for clarify, I have 57,629 lists of length 40 which I'd like to append on to my TDIDF result.
Currently, I have this in a DataFrame, example data: merged["other_data"]. Below is an example row from the merged["other_data"]
0.4329597715,0.3637511039,0.4893141843,0.35840...
How can I append the 57,629 rows of my dataframe column with the TF-IDF matrix? I honestly don't know where to begin and would appreciate any pointers/guidance.
This will do the work.
`df1 = pd.DataFrame(X.toarray()) //Convert sparse matrix to array
df2 = YOUR_DF of size 57k x 40
newDf = pd.concat([df1, df2], axis = 1)`//newDf is the required dataframe
I figured it out:
First: iterate over my pandas column and create a list of lists
for_np = []
for x in merged['other_data']:
row = x.split(",")
row2 = map(float, row)
for_np.append(row2)
Then create a np array:
n = np.array(for_np)
Then use scipy.sparse.hstack on X (my original tfidf sparse matrix and my new matrix. I'll probably end-up reweighting these 40-d vectors if they do not improve the classification results, but this approach worked!
import scipy.sparse
X = scipy.sparse.hstack([X, n])
You could have a look at the answer to this question:
use Featureunion in scikit-learn to combine two pandas columns for tfidf
Obviously, the anwers given should work, but as soon as you want your classifier to make predictions, you definitely want to work with pipelines and feature unions.