Multiplying sparse matrices in tensorflow - tensorflow

I have a very large sparse 2D matrix (200k x 100k) (sparsity around 0.002), which I would need to multiply with itself (after transposition).
A = 200k x 100k
and I need to calculate:
A x A.transpose()
The problem is, that in Tensorflow it seems there is no sparse matrix to sparse matrix multiplication.
tf.sparse_tensor_dense_matmul() requires one side to be turned into a dense matrix, which would hit the memory limit.
tf.sparse_matmul() seems to actually require dense matrices.
There is no way to fit the whole matrix (A) in the GPU memory as a dense vector.
Is there a way to multiple sparse matrices with other sparse matrices in Tensorflow? I can't seem to find one.

Related

Working with sparse matrices in numpy and sklearn

I have a time series dataset geenrated from some electrophysiological data. I have a frequncy dataset and the matrix is quite sparse but huge, like it contains 0.005 s time bins for 2000+ neurons recorded over an hour so the matrix is huge. I am using this to train regressions in sklearn and I was wondering if there were ways to represent the matrix more efficiently to speed up my code? Tons of the data is taken up by 0 values.
Specifically I will be using these two functions on the matrix,
https://scikit-learn.org/stable/modules/generated/sklearn.cross_decomposition.CCA.html
https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.ElasticNetCV.html
Will scipy.sparse run with sklearn functions, and will it train faster using scipy.sparse or other options?
https://docs.scipy.org/doc/scipy/reference/sparse.html

How to optimise contraction of three inner tensor indices

In my code for my convolutional neural network there is a step which involves a tensor contraction along 3 dimensions; in NumPy it looks like this (however I am planning on using raw BLAS):
y = np.einsum('abcijk,ijkd->abcd', x, f)
which is symbolic for
Generally, CNN routines tackle large tensor contractions by converting higher-dimensional tensors into 2-D "images", then using regular matrix multiplication - the cost and redundancy of the initial dimension reduction is offset by leveraging the speed of GEMM and other well-optimised matrix multiplication routines.
My method involves skipping the dimension reduction stage, and the only real computational effort required is in the tensor multiplication of X and F. Do there exist similarly fast routines for contracting the inner indices of tensors? How can I whittle this down to something that is well-established in BLAS etc.? The fact that only inner indices are being contracted suggests to me that there should be a way to take advantage of the locality of reference via the natural traversal order, rather than using generic routines such as "GETT".

Is concatenated matrix multiplication faster than multiple non-concatenated matmul? If so, why?

The definition of the LSTM cell involves 4 matrix multiplications with the input, and 4 matrix multiplications with the output. We can simplify the expression by using a single matrix multiply by concatenating 4 small matrices (now the matrix are 4 times larger).
My question is: does this improve the efficiency of the matrix multiplication? If so, why? Because we can put them in continuous memory? Or is it because of the conciseness of the code?
The number of items that we multipy doesn't change whether or not we concatenate the matrices. (Therefore complexity shouldn't change.) So I'm wondering why we would do this..
Here is an excerpt from pytorch doc of torch.nn.LSTM(*args, **kwargs). W_ii, W_if, W_ig, W_io are concatenated.
weight_ih_l[k] – the learnable input-hidden weights of the \text{k}^{th}k
th
layer (W_ii|W_if|W_ig|W_io), of shape (4*hidden_size x input_size)
weight_hh_l[k] – the learnable hidden-hidden weights of the \text{k}^{th}k
th
layer (W_hi|W_hf|W_hg|W_ho), of shape (4*hidden_size x hidden_size)
bias_ih_l[k] – the learnable input-hidden bias of the \text{k}^{th}k
th
layer (b_ii|b_if|b_ig|b_io), of shape (4*hidden_size)
bias_hh_l[k] – the learnable hidden-hidden bias of the \text{k}^{th}k
th
layer (b_hi|b_hf|b_hg|b_ho), of shape (4*hidden_size)
The structure of LSTM isn't about improving multiplication efficiency but more so for bypassing diminishing / exploding gradients (https://stats.stackexchange.com/questions/185639/how-does-lstm-prevent-the-vanishing-gradient-problem). There are some studies being done to mitigate the effects of diminishing gradients, and GRU / LSTM cells + peepholes are few attempts to mitigate that.

About feed sparse matrix into the graph

For the data dimension is too large, i have to change the data into sparse matrix, instead of the dense array.
However, for the graph includes the cnn, and when i feed the sparse matrix directly, i was told the cnn cannot receive the sparse tensor. so i have to do the operation 'sparse to dense' at first.
But the question is that the data(multi sparse matrix) i feed should be converted to a two-dimension sparse matrix.(e.g i have sparse matrix1, dim is [14,25500],and sparse matrix2, dim is [14,25500], the perfect dimension i want to feed is [2,14,25500], but the reality i faced is [28,25500])
So i have to split the tensor after entering the graph.
i want to ask, if any other ways can solve this problem ?
tf.stack is your friend
tf.stack([matrix1, matrix2]) # => [2,14,25500]

very sparse (and large) matrices and vectors for gradient descent in tensorflow

We are trying to find a solution to implement a stochastic gradient descent (or a coordinate gradient descent) on a very very very large and very sparse least-squares problem . That is to say, for the standard least squares problem:
min ||y - Ax||2
The matrix A on the order of a billion rows and a billion columns, but it is 99% sparse . Similarly, the coefficient vector x is about 98% sparse, and the observation vector Y is also about 97% sparse.
However, although I know that tensorflow has a stochastic gradient descent function, it's unclear to me whether the gradient descent functions accept sparse matrices/vectors, and even if they do, it's unclear if these gradient descent libraries behind the scenes eventually converts the sparse matrices and vectors into dense representations which would then defeat the point of having inputted the data in sparse format to begin with. If tensorflow's gradient descent method converts everything to dense format, we would blow up the memory easily and also blow up the performance (since we would have ~ 100 * 100 more computations needed for all those zero values.)
If the SGD algorithms implicitly convert sparse matrixes and vectors to their dense counterparts, how hard would it be to change that logic to a sparse only logic? As in, could a reasonably knowledgeable python/C++ engineer (with some tensorflow knowledge) do it or is a major architectural rewrite?