How to delete the small none-zero values and increase the sparsity? - numpy

I have a large csr_matrix (46000*46000) ,but this matrix is very dense, its sparsity is about 0.05%. most of none-zero values are less than 1 ,I want to delete these values and increase the sparsity
import scipy.sparse as sp
cgc=sp.load_npz('/root/cg.npz')
print cgc.count_nonzero() #2115920056
cgc=cgc[cgc>1] #too slow

You have two options:
Zero the elements in-place, and then convert.This works on the array in-place to save memory and time, but changes your original array (which seems to be saved to disk anyway, so it shouldn't be a problem):
cgc[cgc<1] = 0
cgc = scipy.sparse.csr_matrix(cgc)
Build the sparce indices and construct the sparse matrix from them (doesn't overwite original, but is slow and memory-intensive):
i, j = np.flatnonzero(cgc > 1)
cgc_sparse = np.csr_matrix((cgc[i, j], (i, j)))

Related

How to matrix-multiply two sparse SciPy matrices and produce a dense Numpy array efficiently?

I'd like to matrix-multiply two sparse SciPy matrices. However, the result is not sparse, so I'd like to store it as a NumPy array.
Is it possible to do this efficiently, that is without creating a "sparse" matrix first and then converting it? I'm free to choose any input format (whichever is more efficient).
An example: the product of two 10000x10000 99% sparse matrices with randomly distributed zeros will be dense:
n = 10_000
a = np.random.randn(n, n) * (np.random.randint(100, size=(n, n)) == 0)
b = np.random.randn(n, n) * (np.random.randint(100, size=(n, n)) == 0)
c = a.dot(b)
np.count_nonzero(c) / c.size # should be 0.63
import numpy as np
from scipy import sparse
n = 10_000
a = sparse.csr_matrix(np.random.randn(n, n) * (np.random.randint(100, size=(n, n)) == 0))
b = sparse.csr_matrix(np.random.randn(n, n) * (np.random.randint(100, size=(n, n)) == 0))
c = a.dot(b)
>>> c
<10000x10000 sparse matrix of type '<class 'numpy.float64'>'
with 63132806 stored elements in Compressed Sparse Row format>
Yeah, this is a really inefficient way to store this matrix. There's no way in scipy to go straight to dense though. You can use the sparseBLAS functions that go straight to dense (which exist for the use case you are describing).
There's a python wrapper for MKL that I use for this, which wraps mkl_sparse_spmmd:
from sparse_dot_mkl import dot_product_mkl
c_dense = dot_product_mkl(a, b, dense=True)
>>>> np.sum(c_dense != 0)
63132806
It's also threaded so it's a lot faster than scipy. Getting MKL installed is left to the reader (conda install -c intel mkl is probably easiest though)

generate large array in dask

I would like to calculate SVD from large matrix by Dask. However, I tried naively to create an empty 2D array and update in a loop, but Dask does not allow mutating the array.
So, I'm looking for a workaround. I tried saving large ( around 65,000 x 65,000, or even more) array into HDF5 via h5py, but updating the array in a loop is quite inefficient. Should I be using mmap, memory mapped numpy instead?
Below, I shared a sample code, without any dask implementation. Should I use dask.bag or dask.delayed for this operation?
The sample code is taking in long strings and in window size of 8, generates combinations of two-letter words. In actual data, the window size would be 20 and words will be 8-letter long. And, the input string can be 3 Gb long.
import itertools
import numpy as np
np.set_printoptions(threshold=np.Inf)
# generate all possible words of length 2 (AA, AC, AG, AT, CA, etc.)
# then get numerical index (AA -> 0, AC -> 1, etc.)
bases=['A','C','G','T']
all_two = [''.join(p) for p in itertools.product(bases, repeat=2)]
two_index = {x: y for (x,y) in zip(all_two, range(len(all_two)))}
# final array to fill, size is [ 16 possible words x 16 possible words ]
counts = np.zeros(shape=(16,16)) # in actual sample we expect 65000x65000 array
# sample sequences (these will be gigabytes long in actual sample)
seq1 = "AAAAACCATCGACTACGACTAC"
seq2 = "ACGATCACGACTACGACTAGATGCATCACGACTAAAAA"
# accumulate results
all_pairs=[]
def generate_pairs(sequence):
pairs=[]
for i in range(len(sequence)-8+1):
window=sequence[i:i+8]
words= [window[i:i+2] for i in range(0, len(window), 2)]
for pair in itertools.combinations(words,2):
pairs.append(pair)
return pairs
# use function for each sequence
all_pairs.extend(generate_pairs(seq1))
all_pairs.extend(generate_pairs(seq2))
# convert 1D array of pairs into 2D counts of pairs
# for each pair, lookup word index and increase corresponding cell
for j in all_pairs:
counts[ two_index[j[0]], two_index[j[1]] ] += 1
print(counts)
EDIT: I might have asked the question a little complicated, let me try to paraphrase it. I need to construct a single large 2D array of size ~65000x65000. The array needs to be filled with counting occurrences of (word1,word2) pairs. Since Dask does not allow item assignment/mutate for Dask array, I can not fill the array as pairs are processed. Is there a workaround to generate/fill a large 2D array with Dask?
Here's simpler code to test:
import itertools
import numpy as np
np.set_printoptions(threshold=np.Inf)
bases=['A','C','G','T']
all_two = [''.join(p) for p in itertools.product(bases, repeat=2)]
two_index = {x: y for (x,y) in zip(all_two, range(len(all_two)))}
seq = "AAAAACCATCGACTACGACTAC"
counts = np.zeros(shape=(16,16))
for i in range(len(seq)-8+1):
window=seq[i:i+8]
words= [window[i:i+2] for i in range(0, len(window), 2)]
for pair in itertools.combinations(words,2):
counts[two_index[pair[0]], two_index[pair[1]]] += 1 # problematic part!
print(counts)

Sklearn PCA: Correct Dimensionality of PCs

I have a dataframe, df, which contains a column called 'event' wherein there is a 24x24x40 numpy array. I want to:
extract this numpy array;
flatten it into a 1x23040 vector;
add this entry as a column in a new numpy array or dataframe;
perform PCA on the resulting matrix.
However, the PCA produces eigenvectors with the dimensions of 'the number of entries', not the 'number of dimensions in the data'.
To illustrate my problem, I demonstrate a minimal example that works perfectly well:
EXAMPLE 1
from sklearn import datasets, decomposition
digits = datasets.load_digits()
X = digits.data
pca = decomposition.PCA()
X_pca = pca.fit_transform(X)
print (X.shape)
Result: (1797, 64)
print (X_pca.shape)
Result: (1797, 64)
There are 1797 entries in each case, with eigenvectors of dimension 64.
Now onto my example:
EXAMPLE 2
from sklearn import datasets, decomposition
import pandas as pd
hdf=pd.HDFStore('./afile.h5')
df=hdf.select('batch0')
print(df['event'][0].shape)
Result: (1, 24, 24, 40)
print(df['event'][0].shape.flatten())
Result: (23040,)
for index, row in df.iterrows():
entry = df['event'][index].flatten()
_list.append(entry)
X = np.asarray(_list)
pca = decomposition.PCA()
X_pca=pca.fit_transform(X)
print (X.shape)
Result: (201, 23040)
print (X_pca.shape)
Result:(201, 201)
This has dimensions of the number of data, 201 entries!
I am unfamiliar with dataframes, so it could be that I am iterating through the dataframe incorrectly. However, I have checked that the rows of the resultant numpy array in X in Example 2 can be reshaped and plotted as expected.
Any thoughts would be appreciated!
Kind regards!
Sklearn's documentation states that the number of components retained when you don't specify the n_components parameter is min(n_samples, n_features).
Now, heading to your example:
In your first example, the number of data samples 1797 is less than the number of dimensions 64, therefore it keeps the whole dimensionality (since you are not specifying the number of components). However, in your second example, the number of data samples is far less than the number of features, hence, sklearns' PCA reduces the number of dimensions to n_samples.

Numpy / Scipy - Sparse matrix to vector

I have sparse CSR matrices (from a product of two sparse vector) and I want to convert each matrix to a flat vector. Indeed, I want to avoid using any dense representation or iterating over indexes.
So far, the only solution that came up was to iterate over non null elements by using coo representation:
import numpy
from scipy import sparse as sp
matrices = [sp.csr_matrix([[1,2],[3,4]])]*3
vectorSize = matrices[0].shape[0]*matrices[0].shape[1]
flatMatrixData = []
flatMatrixRows = []
flatMatrixCols = []
for i in range(len(matrices)):
matrix = matrices[i].tocoo()
flatMatrixData += matrix.data.tolist()
flatMatrixRows += [i]*matrix.nnz
flatMatrixCols += [r+c*2 for r,c in zip(matrix.row, matrix.col)]
flatMatrix = sp.coo_matrix((flatMatrixData,(flatMatrixRows, flatMatrixCols)), shape=(len(matrices), vectorSize), dtype=numpy.float64).tocsr()
It is indeed unsatisfying and inelegant. Does any one know how to achieve this in an efficient way?
Your flatMatrix is (3,4); each row is [1 3 2 4]. If a submatrix is x, then the row is x.A.T.flatten().
F = sp.vstack([x.T.tolil().reshape((1,vectorSize)) for x in matrices])
F is the same (dtype is int). I had to convert each submatrix to lil since csr has not implemented reshape (in my version of sparse). I don't know if other formats work.
Ideally sparse would let you do the whole range of numpy array (or matrix) manipulations, but it isn't there yet.
Given the small dimensions in this example, I won't speculate on the speed of the alternatives.

Tensordot for numpy array and scipy sparse matrix

For a current project I have to compute the inner product of a lot of vectors with the same matrix (which is quite sparse). The vectors are associated with a two dimensional grid so I store the vectors in a three dimensional array:
E.g:
X is an array of dim (I,J,N). The matrix A is of dim (N,N). Now the task is to compute A.dot(X[i,j]) for each i,j in I,J.
For numpy arrays, this is quite easily accomplished with
Y = X.dot(A.T)
Now I'd like to store A as sparse matrix since it is sparse and only contains a very limited number of nonzero entries which results in a lot of unnecessary multiplications. Unfortunately, the above solution won't work since the numpy dot doesn't work with sparse matrices. And to the best of my knowledge there is not tensordot-like operation for scipy sparse.
Does anybody know a nice and efficient way to compute the above array Y with a sparse matrix A?
The obvious approach is to run a loop over your vectors and use the sparse matrix's .dot method:
def naive_sps_x_dense_vecs(sps_mat, dense_vecs):
rows, cols = sps_mat.shape
I, J, _ = dense_vecs.shape
out = np.empty((I, J, rows))
for i in xrange(I):
for j in xrange(J):
out[i, j] = sps_mat.dot(dense_vecs[i, j])
return out
But you may be able to speed things up a little by reshaping your 3d array to 2d and avoid the Python looping:
def sps_x_dense_vecs(sps_mat, dense_vecs):
rows, cols = sps_mat.shape
vecs_shape = dense_vecs.shape
dense_vecs = dense_vecs.reshape(-1, cols)
out = sps_mat.dot(dense_vecs.T).T
return out.reshape(vecs.shape[:-1] + (rows,))
The problem is that we need to have the sparse matrix be the first argument, so that we can call its .dot method, which means that the return is transposed, which in turns means that after transposing, the last reshape is going to trigger a copy of the whole array. So for fairly large values of I and J, combined with not-so-large values of N, the latter method will be several times faster than the former, but performance may even be reversed for other combinations of the parameters:
n, i, j = 100, 500, 500
a = sps.rand(n, n, density=1/n, format='csc')
vecs = np.random.rand(i, j, n)
>>> np.allclose(naive_sps_x_dense_vecs(a, vecs), sps_x_dense_vecs(a, vecs))
True
n, i, j = 100, 500, 500
%timeit naive_sps_x_dense_vecs(a, vecs)
1 loops, best of 3: 3.85 s per loop
%timeit sps_x_dense_vecs(a, vecs)
1 loops, best of 3: 576 ms per
n, i, j = 1000, 200, 200
%timeit naive_sps_x_dense_vecs(a, vecs)
1 loops, best of 3: 791 ms per loop
%timeit sps_x_dense_vecs(a, vecs)
1 loops, best of 3: 1.3 s per loop
You could use jaxto achieve what you are looking for. Let's suppose your sparse matrix is in csr_arrayformat. You would first transform it into a jax BCOO array
from scipy import sparse
from jax.experimental import sparse as jaxsparse
import jax.numpy as jnp
def convert_to_BCOO(x):
x = x.transpose() #get the transpose
x = x.tocoo()
x = jaxsparse.BCOO((x.data, jnp.column_stack((x.row, x.col))),
shape=x.shape)
x = L.sort_indices()
You could then use jax.sparsify to create a sparsified dot product as follows.
def dot(x, y):
return jnp.dot(x, y)
sp_dot = jaxsparse.sparsify(dot)
A_transpose = convert_to_BCOO(A)
Y = sp_dot(X,A_transpose)
The function sp_dot now follows the exact same rules as numpy.dot.
Hope this helps!