scipy super sparse matrix multiplication is super slow

scipy super sparse matrix multiplication is super slow - numpy

There are some posts on SO discussing sparse matrix multiplication performance, but they don't seem to answer my question here.
Here is the benchmark code,
# First, construct a space matrix
In [1]: from scipy.sparse import dok_matrix
In [2]: M = dok_matrix((100, (1<<32)-1), dtype=np.float32)
In [3]: rows, cols = M.shape
In [5]: js = np.random.randint(0, (1<<32)-1, size=100)
In [6]: for i in range(rows):
...: for j in js:
...: M[i,j] = 1.0
...:
# Check out sparsity
In [7]: M.shape
Out[7]: (100, 4294967295)
In [8]: M.count_nonzero()
Out[8]: 10000
# Test csr dot performance, 36.3 seconds
In [9]: csr = M.tocsr()
In [10]: %time csr.dot(csr.T)
CPU times: user 36.3 s, sys: 1min 1s, total: 1min 37s
Wall time: 1min 46s
Out[10]:
<100x100 sparse matrix of type '<class 'numpy.float32'>'
with 10000 stored elements in Compressed Sparse Row format>
The above csr.dot costs 36.3s, which is quite long IMHO.
In order to speed up, I coded up a naive for-loop dot function as follows,
def lil_matmul_transposeB(A, B):
rows_a, cols_a = A.shape
rows_b, cols_b = B.shape
assert cols_a == cols_b
C = np.zeros((rows_a, rows_b))
for ra in range(rows_a):
cols_a = A.rows[ra]
data_a = A.data[ra]
for i, ca in enumerate(cols_a):
xa = data_a[i]
for rb in range(rows_b):
cols_b = B.rows[rb]
data_b = B.data[rb]
pos = bs(cols_b, ca)
if pos!=-1:
C[ra,rb] += data_b[pos] * xa
return C
# Test dot performance in LiL format,
In [25]: lil = M.tolil()
In [26]: %time A = F.lil_matmul_transposeB(lil, lil)
CPU times: user 1.26 s, sys: 2.07 ms, total: 1.26 s
Wall time: 1.26 s
The above function only costs 1.26s, much faster than the built-in csr.dot.
So I wonder if I made some mistakes here to do the sparse matrix multiplication?

That very large 2nd dimension is giving problems, even though the sparsity is quite small.
In [12]: Mr = M.tocsr()
In [20]: Mr
Out[20]:
<100x4294967295 sparse matrix of type '<class 'numpy.float32'>'
with 10000 stored elements in Compressed Sparse Row format>
Transpose just turns the csr into csc, without changing the arrays. That indptr for both is just (101,).
In [21]: Mr.T
Out[21]:
<4294967295x100 sparse matrix of type '<class 'numpy.float32'>'
with 10000 stored elements in Compressed Sparse Column format>
But when I do Mr#Mr.T, I get an error when it tries to convert that Mr.T to `csr. That is, the multiplication requires the same format:
In [22]: Mr.T.tocsr()
Traceback (most recent call last):
File "<ipython-input-22-a376906f557e>", line 1, in <module>
Mr.T.tocsr()
File "/usr/local/lib/python3.8/dist-packages/scipy/sparse/csc.py", line 138, in tocsr
indptr = np.empty(M + 1, dtype=idx_dtype)
MemoryError: Unable to allocate 32.0 GiB for an array with shape (4294967296,) and data type int64
It's trying to make a matrix with a indptr that's (4294967296,) long. On my limited RAM machine that produces an error. On your's it must be hitting some sort of memory management/swap task that slowing it way down.
So it's the extreme dimension that's making this slow even though the nnz is small.

Related

Convert pandas single column to Scipy Sparse Matrix

I have a pandas data frame like this:
a other-columns
0.3 0.2 0.0 0.0 0.0... ....
I want to convert column a into SciPy sparse CSR matrix. a is a probability distribution. I would like to convert without expanding a into multiple columns.
This is naive solution with expanding a into multiple columns:
df = df.join(df['a'].str.split(expand = True).add_prefix('a')).drop(['a'], axis = 1)
df_matrix = scipy.sparse.csr_matrix(df.values)
But, I don't want to expand into multiple columns, as it shoots up the memory. Is it possible to do this by keeping a in 1 column only?
EDIT (Minimum Reproducible Example):
import pandas as pd
from scipy.sparse import csr_matrix
d = {'a': ['0.05 0.0', '0.2 0.0']}
df = pd.DataFrame(data=d)
df = df.join(df['a'].str.split(expand = True).add_prefix('a')).drop(['a'], axis = 1)
df = df.astype(float)
df_matrix = scipy.sparse.csr_matrix(df.values)
df_matrix
Output:
<2x2 sparse matrix of type '<class 'numpy.float64'>'
with 2 stored elements in Compressed Sparse Row format>
I want to achieve above, but, without splitting into multiple columns. Also, in my real file, I have 36 length string (separated by space) columns and millions of rows. It is sure that all rows will contain 36 spaces.

Also, in my real file, I have 36 length string (separated by space) columns and millions of rows. It is sure that all rows will contain 36 spaces.
Convert large csv to sparse matrix for use in sklearn
I can not overstate how much you should not do the thing that follows this sentence.
import pandas as pd
import numpy as np
from scipy import sparse
df = pd.DataFrame({'a': ['0.05 0.0', '0.2 0.0'] * 100000})
chunksize = 10000
sparse_coo = []
for i in range(int(np.ceil(df.shape[0]/chunksize))):
chunk = df.iloc[i * chunksize:min(i * chunksize +chunksize, df.shape[0]), :]
sparse_coo.append(sparse.coo_matrix(chunk['a'].apply(lambda x: [float(y) for y in x.split()]).tolist()))
sparse_coo = sparse.vstack(sparse_coo)

You could get the dense array from the column without the expand:
In [179]: df = pd.DataFrame(data=d)
e.g.
In [180]: np.array(df['a'].str.split().tolist(),float)
Out[180]:
array([[0.05, 0. ],
[0.2 , 0. ]])
But I doubt if that saves much in memory (though I only have a crude understanding of DataFrame memory use.
You could convert each string to a sparse matrix:
In [190]: def foo(astr):
...: alist = astr.split()
...: arr = np.array(alist, float)
...: return sparse.coo_matrix(arr)
In [191]: alist = [foo(row) for row in df['a']]
In [192]: alist
Out[192]:
[<1x2 sparse matrix of type '<class 'numpy.float64'>'
with 1 stored elements in COOrdinate format>,
<1x2 sparse matrix of type '<class 'numpy.float64'>'
with 1 stored elements in COOrdinate format>]
In [193]: sparse.vstack(alist)
Out[193]:
<2x2 sparse matrix of type '<class 'numpy.float64'>'
with 2 stored elements in COOrdinate format>
I tried to make the coo directly from the alist, but that didn't trim out the zeros. There's just as much conversion, but if sufficiently sparse (5% or less) it could save quite a bit on memory (if not time).
sparse.vstack combines the data,rows,cols values from the component matrices to define a new coo matrix. It's most straight forward way of combining sparse matrices, if not the fastest.
Looks like I could use apply as well
In [205]: df['a'].apply(foo)
Out[205]:
0 (0, 0)\t0.05
1 (0, 0)\t0.2
Name: a, dtype: object
In [206]: df['a'].apply(foo).values
Out[206]:
array([<1x2 sparse matrix of type '<class 'numpy.float64'>'
with 1 stored elements in COOrdinate format>,
<1x2 sparse matrix of type '<class 'numpy.float64'>'
with 1 stored elements in COOrdinate format>], dtype=object)
In [207]: sparse.vstack(df['a'].apply(foo))
Out[207]:
<2x2 sparse matrix of type '<class 'numpy.float64'>'
with 2 stored elements in COOrdinate format>

Loop through numpy array on indexes and apply function [duplicate]

I have two arrays that have the shapes N X T and M X T. I'd like to compute the correlation coefficient across T between every possible pair of rows n and m (from N and M, respectively).
What's the fastest, most pythonic way to do this? (Looping over N and M would seem to me to be neither fast nor pythonic.) I'm expecting the answer to involve numpy and/or scipy. Right now my arrays are numpy arrays, but I'm open to converting them to a different type.
I'm expecting my output to be an array with the shape N X M.
N.B. When I say "correlation coefficient," I mean the Pearson product-moment correlation coefficient.
Here are some things to note:
The numpy function correlate requires input arrays to be one-dimensional.
The numpy function corrcoef accepts two-dimensional arrays, but they must have the same shape.
The scipy.stats function pearsonr requires input arrays to be one-dimensional.

Correlation (default 'valid' case) between two 2D arrays:
You can simply use matrix-multiplication np.dot like so -
out = np.dot(arr_one,arr_two.T)
Correlation with the default "valid" case between each pairwise row combinations (row1,row2) of the two input arrays would correspond to multiplication result at each (row1,row2) position.
Row-wise Correlation Coefficient calculation for two 2D arrays:
def corr2_coeff(A, B):
# Rowwise mean of input arrays & subtract from input arrays themeselves
A_mA = A - A.mean(1)[:, None]
B_mB = B - B.mean(1)[:, None]
# Sum of squares across rows
ssA = (A_mA**2).sum(1)
ssB = (B_mB**2).sum(1)
# Finally get corr coeff
return np.dot(A_mA, B_mB.T) / np.sqrt(np.dot(ssA[:, None],ssB[None]))
This is based upon this solution to How to apply corr2 functions in Multidimentional arrays in MATLAB
Benchmarking
This section compares runtime performance with the proposed approach against generate_correlation_map & loopy pearsonr based approach listed in the other answer.(taken from the function test_generate_correlation_map() without the value correctness verification code at the end of it). Please note the timings for the proposed approach also include a check at the start to check for equal number of columns in the two input arrays, as also done in that other answer. The runtimes are listed next.
Case #1:
In [106]: A = np.random.rand(1000, 100)
In [107]: B = np.random.rand(1000, 100)
In [108]: %timeit corr2_coeff(A, B)
100 loops, best of 3: 15 ms per loop
In [109]: %timeit generate_correlation_map(A, B)
100 loops, best of 3: 19.6 ms per loop
Case #2:
In [110]: A = np.random.rand(5000, 100)
In [111]: B = np.random.rand(5000, 100)
In [112]: %timeit corr2_coeff(A, B)
1 loops, best of 3: 368 ms per loop
In [113]: %timeit generate_correlation_map(A, B)
1 loops, best of 3: 493 ms per loop
Case #3:
In [114]: A = np.random.rand(10000, 10)
In [115]: B = np.random.rand(10000, 10)
In [116]: %timeit corr2_coeff(A, B)
1 loops, best of 3: 1.29 s per loop
In [117]: %timeit generate_correlation_map(A, B)
1 loops, best of 3: 1.83 s per loop
The other loopy pearsonr based approach seemed too slow, but here are the runtimes for one small datasize -
In [118]: A = np.random.rand(1000, 100)
In [119]: B = np.random.rand(1000, 100)
In [120]: %timeit corr2_coeff(A, B)
100 loops, best of 3: 15.3 ms per loop
In [121]: %timeit generate_correlation_map(A, B)
100 loops, best of 3: 19.7 ms per loop
In [122]: %timeit pearsonr_based(A, B)
1 loops, best of 3: 33 s per loop

#Divakar provides a great option for computing the unscaled correlation, which is what I originally asked for.
In order to calculate the correlation coefficient, a bit more is required:
import numpy as np
def generate_correlation_map(x, y):
"""Correlate each n with each m.
Parameters
----------
x : np.array
Shape N X T.
y : np.array
Shape M X T.
Returns
-------
np.array
N X M array in which each element is a correlation coefficient.
"""
mu_x = x.mean(1)
mu_y = y.mean(1)
n = x.shape[1]
if n != y.shape[1]:
raise ValueError('x and y must ' +
'have the same number of timepoints.')
s_x = x.std(1, ddof=n - 1)
s_y = y.std(1, ddof=n - 1)
cov = np.dot(x,
y.T) - n * np.dot(mu_x[:, np.newaxis],
mu_y[np.newaxis, :])
return cov / np.dot(s_x[:, np.newaxis], s_y[np.newaxis, :])
Here's a test of this function, which passes:
from scipy.stats import pearsonr
def test_generate_correlation_map():
x = np.random.rand(10, 10)
y = np.random.rand(20, 10)
desired = np.empty((10, 20))
for n in range(x.shape[0]):
for m in range(y.shape[0]):
desired[n, m] = pearsonr(x[n, :], y[m, :])[0]
actual = generate_correlation_map(x, y)
np.testing.assert_array_almost_equal(actual, desired)

For those interested in computing the Pearson correlation coefficient between a 1D and 2D array, I wrote the following function, where x is a 1D array and y a 2D array.
def pearsonr_2D(x, y):
"""computes pearson correlation coefficient
where x is a 1D and y a 2D array"""
upper = np.sum((x - np.mean(x)) * (y - np.mean(y, axis=1)[:,None]), axis=1)
lower = np.sqrt(np.sum(np.power(x - np.mean(x), 2)) * np.sum(np.power(y - np.mean(y, axis=1)[:,None], 2), axis=1))
rho = upper / lower
return rho
Example run:
>>> x
Out[1]: array([1, 2, 3])
>>> y
Out[2]: array([[ 1, 2, 3],
[ 6, 7, 12],
[ 9, 3, 1]])
>>> pearsonr_2D(x, y)
Out[3]: array([ 1. , 0.93325653, -0.96076892])

Python Memory error on scipy stats. Scipy linalg lstsq <> manual beta

Not sure if this question belongs here or on crossvalidated but since the primary issue is programming language related, I am posting it here.
Inputs:
Y= big 2D numpy array (300000,30)
X= 1D array (30,)
Desired Output:
B= 1D array (300000,) each element of which regression coefficient of regressing each row (element of length 30) of Y against X
So B[0] = scipy.stats.linregress(X,Y[0])[0]
I tried this first:
B = scipy.stats.linregress(X,Y)[0]
hoping that it will broadcast X according to shape of Y. Next I broadcast X myself to match the shape of Y. But on both occasions, I got this error:
File "C:\...\scipy\stats\stats.py", line 3011, in linregress
ssxm, ssxym, ssyxm, ssym = np.cov(x, y, bias=1).flat
File "C:\...\numpy\lib\function_base.py", line 1766, in cov
return (dot(X, X.T.conj()) / fact).squeeze()
MemoryError
I used manual approach to calculate beta, and on Sascha's suggestion below also used scipy.linalg.lstsq as follows
B = lstsq(Y.T, X)[0] # first estimate of beta
Y1=Y-Y.mean(1)[:,None]
X1=X-X.mean()
B1= np.dot(Y1,X1)/np.dot(X1,X1) # second estimate of beta
The two estimates of beta are very different however:
>>> B1
Out[10]: array([0.135623, 0.028919, -0.106278, ..., -0.467340, -0.549543, -0.498500])
>>> B
Out[11]: array([0.000014, -0.000073, -0.000058, ..., 0.000002, -0.000000, 0.000001])

Scipy's linregress will output slope+intercept which defines the regression-line.
If you want to access the coefficients naturally, scipy's lstsq might be more appropriate, which is an equivalent formulation.
Of course you need to feed it with the correct dimensions (your data is not ready; needs preprocessing; swap dims).
Code
import numpy as np
from scipy.linalg import lstsq
Y = np.random.random((300000,30))
X = np.random.random(30)
x, res, rank, s = lstsq(Y.T, X) # Y transposed!
print(x)
print(x.shape)
Output
[ 1.73122781e-05 2.70274135e-05 9.80840639e-06 ..., -1.84597771e-05
5.25035470e-07 2.41275026e-05]
(300000,)

numpy n-dimensional smart indexing over large tensors - memory efficiency

I'm working with large tensors, so numpy memory allocations for temporary tensors begin significantly influencing execution time + code sometimes raises memory allocation errors during those intermediate steps. Here're two approaches for indexing one tensor with int values of another tensor (like, result_ijk = a[i, b[i, j], k]) that I came up with, and even though second one seems more memory-efficient, I feel like creating this enormous index-matrix and iterating over all it's values (even in parallel) is kind of wired (and hits memory limits quite often):
def test():
i, j, k, l = 10, 20, 30, 40 # in reality, they're like 1e3..1e6
a = np.random.rand(i, j, k)
b = np.random.randint(0, j, size=i*l).reshape((i, l))
# c_ilk = c[i, b[i, l], k]; shape(c) = (10, 40, 30)
tmp = a[:, b, :] # <= i*ijk additional memory allocated (!) crazy
c1 = np.diagonal(tmp, axis1=0, axis2=1).transpose([2, 0, 1])
print(c1.shape)
# another approach:
ii, ll = np.indices((i, l)) # <= 2*i*l of temporary ints allocated
tmp2 = b[ii, ll] # i*l of ints allocated, slow ops
c2 = a[ii, tmp2] # slow ops over tensor
print(c2.shape)
print(np.allclose(c1, c2))
test()
- any suggestions on how one could optimize this type of n-dim smart indexing code?
If I'm going to use this piece of ~vectorized code in Theano, does it also going to allocate all those temporary buffers or it could somehow manage to build them "on-fly"? Is there any package that would perform such indexing in lazy\more efficient manner without allocation of these ii-like tensors?
(note: I need to take gradients over it in the end, so I can't use fancy jit-compilers like numba :( )

You only need to allocate an array of integers of length i to get your desired result:
i_idx = np.arange(i)
c = a[i_idx[:, None], b[i_idx, :], :]
# or you can use the terser c = a[i_idx[:, None], b[i_idx]]
Broadcasting takes care of duplicating values as needed on the fly, without having to allocate memory for them.
If you time this for large-ish arrays, you'll notice it is only marginally faster than your second approach: as noted by others, the intermediate indexing array is going to be several orders of magnitude smaller than your overall computation, so optimizing it has a small effect on the total runtime or memory footprint.

Some methods :
i,j,k,l=[100]*4
a = np.random.randint(0,5,(i, j, k))
b = np.random.randint(0, j,(i, l))
def test1():
# c_ilk = c[i, b[i, l], k]; shape(c) = (2,3,5)
tmp = a[:, b, :] # <= i*ijk additional memory allocated (!) crazy
c1 = np.diagonal(tmp, axis1=0, axis2=1).transpose([2, 0, 1])
return c1
def test2():
ii, ll = np.indices((i, l)) # <= 2*i*l of temporary ints allocated
tmp2 = b[ii, ll] # i*l of ints allocated, slow ops
c2 = a[ii, tmp2] # slow ops over tensor
#print(c2.shape)
return c2
def test3():
c3=np.empty((i,l,k),dtype=a.dtype)
for ii in range(i):
for ll in range(l):
c3[ii,ll]=a[ii,b[ii,ll]]
return c3
from numba import jit
test4=jit(test3)
And the corresponding benchmarks :
In [54]: %timeit test1()
1 loop, best of 3: 720 ms per loop
In [55]: %timeit test2()
100 loops, best of 3: 7.79 ms per loop
In [56]: %timeit test3()
10 loops, best of 3: 43.7 ms per loop
In [57]: %timeit test4()
100 loop, best of 3: 4.99 ms per loop
That seems to show (see #Eelco Hoogendoorn comment) that your second method is nearly optimal for big sizes, while the first is a bad choice.
For numba you can just use this part of the code, and apply gradient in a non "jited" function.

numpy outerproduct of sequence of arrays

I have a matrix A (nXm) . My ultimate goal is to get Z of dimension (nXmXm) Currently I am doing it using this but can it be done without using for loop using some matrix.tensordot or matrix.multiply.outer
for i in range(0,A.shape[0]):
Z[i,:,:] = np.outer(A[i,:],A[i,:])

You could use numpy's Einstein summation, like this:
np.einsum('ij, ik -> ijk', a, a)
Just for completeness, the timing comparison with the also excellent answer (+1) from unutbu:
In [39]: A = np.random.random((1000,50))
In [40]: %timeit using_einsum(A)
100 loops, best of 3: 11.6 ms per loop
In [41]: %timeit using_broadcasting(A)
100 loops, best of 3: 10.2 ms per loop
In [42]: %timeit orig(A)
10 loops, best of 3: 27.8 ms per loop
Which teaches me that
unutbu's machine is faster than mine
broadcasting would be slightly faster than np.einsum

for i in range(0,A.shape[0]):
Z[i,:,:] = np.outer(A[i,:],A[i,:])
means
Z_ijk = A_ij * A_ik
which can be computed using NumPy broadcasting:
Z = A[:, :, np.newaxis] * A[:, np.newaxis, :]
A[:, :, np.newaxis] has shape (n, m, 1) and A[:, np.newaxis, :] has shape
(n, 1, m). Multiplying the two causes both arrays to be broadcasted up to
shape (n, m, m).
NumPy multiplication is always performed elementwise. The values along the
broadcasted axis are the same everywhere, so elementwise multiplication results
in Z_ijk = A_ij * A_ik.
import numpy as np
def orig(A):
Z = np.empty(A.shape+(A.shape[-1],), dtype=A.dtype)
for i in range(0,A.shape[0]):
Z[i,:,:] = np.outer(A[i,:],A[i,:])
return Z
def using_broadcasting(A):
return A[:, :, np.newaxis] * A[:, np.newaxis, :]
Here is a sanity check showing this produces the correct result:
A = np.random.random((1000,50))
assert np.allclose(using_broadcasting(A), orig(A))
By choosing A.shape[0] to be large we get an example which shows off the
advantage of broadcasting over looping in Python:
In [107]: %timeit using_broadcasting(A)
10 loops, best of 3: 6.12 ms per loop
In [108]: %timeit orig(A)
100 loops, best of 3: 16.9 ms per loop

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

scipy super sparse matrix multiplication is super slow - numpy

Related

Convert pandas single column to Scipy Sparse Matrix

Loop through numpy array on indexes and apply function [duplicate]

Python Memory error on scipy stats. Scipy linalg lstsq <> manual beta

numpy n-dimensional smart indexing over large tensors - memory efficiency

numpy outerproduct of sequence of arrays

Categories

Resources