I would like to perform the following batch of matrix multiplications
proj = torch.einsum('abi,aic->abc', A, B)
where A is an nxnxd tensor and B is an nxdxd tensor.
When n gets large ~50k, this operation becomes very slow.
However, A is actually sparse in the first two dimensions, i.e., it could actually be written as a set of indices (i,j) and a corresponding set of 1xd vectors.
Could someone help me how to speed this computation up?
Related
Currently, I'm in the process of optimizing a piece of code I have. I'm trying to speed-up a bunch of matrix multiplications by reducing the size of matrix dimensions. However, in some cases, using both NumPy and MATLAB, I'm not able to obtain the speed-ups I expect. Using MATLAB, I first defined 2 randomized matrices: bigger_mat which is 10000x10000 and smaller_mat which is 10000x100. I then created 2 smaller matrices by slicing bigger_mat and smaller_mat, such that I get matrix dimensions of 200x10000 (bigger_mat_sliced) and 10000x2 (smaller_mat_sliced).
% Defining full (big) array dimension
dim_big = 100;
% Defining sliced (small) array dimension.
dim_small = 2;
% Creating a 10000x10000 randomized array
bigger_mat = rand(dim_big^2, dim_big^2);
% Creating a 10000x100 randomized array
smaller_mat = rand(dim_big^2, dim_big);
% Slicing bigger_mat to obtain a 200x10000 array
bigger_mat_sliced = bigger_mat(1:dim_small * dim_big, :);
% Slicing smaller_mat to obtain a 10000x2 array
smaller_mat_sliced = smaller_mat(:, 1:dim_small);
I then measured the runtimes for the following 3 matrix multiplications:
bigger_mat x smaller_mat
bigger_mat_sliced x smaller_mat
bigger_mat x smaller_mat_sliced
My expectations were as the following:
Multiplication #1 should take the longest amount of time since the unsliced (full) matrices are being multiplied
Multiplications #2 and #3 should take less time than #1, as in both #2 and #3 I'm multiplying a full matrix with a sliced matrix. Specifically, #2 and #3 both should require the same amount of time, and both should be 50 times faster than #1 (the sliced dimensions are scaled down by a factor of dim_big/dim_small = 100/2 = 50).
The timings I got were:
bigger_mat x smaller_mat: Elapsed time is 0.110538 seconds
bigger_mat_sliced x smaller_mat: Elapsed time is 0.002564 seconds
bigger_mat x smaller_mat_sliced: Elapsed time is 0.068878 seconds
While #2 is behaving as expected with a 43x speed-up compared to #1, #3 is only 1.6x faster than #1. I tried running this test using NumPy, but I also got similar timings to those above.
It seems to me like when multiplying 2 matrices of unequal outer dimensions, for example A(i,k)*B(k,j) where i >> j, slicing the largest of the 2 outer dimensions (i) by some factor scales down the multiplication time as expected. However, for some reason, scaling down (or slicing) the smaller dimension (j) yields barely any speed-up. I'm really having a hard time understanding these results. I tried looking up matrix multiplication algorithms implemented in BLAS libraries, hoping to find an explanation, but soon I found myself out of my depth.
Lastly, is there a way to make multiplication #2 as fast as #3? Thanks!
I am using sparse matrices in python, namely
scipy.sparse.csr_matrix
I am in principle free to choose the exact sparse implementation, as long as the matrices support matrix-vector multiplication and addition/subtraction of matrices with the same sparsity pattern. Currently, at every time step, I construct a new sparse matrix from scratch and add it to the existing matrix. I believe that my code could be unnecessarily losing time due to
Construction time of sparse matrix
Addition of the sparse matrices, assuming that the underlying algorithm inside CSR matrix implementation has to find matching sparse entries before adding them up.
My guess would be that the sparse matrix is internally stored as a numpy array of values + a few index arrays denoting where those values are located. The question is if it is possible to directly add the underlying value arrays without touching the sparsity structure. Is something like this possible?
new_values = np.linspace(0, num_values)
csr_mat.val += new_values
I want to construct a weight whose certain elements are zero and never change, and other elements are the variables.For example:
[[0,0,a,0],[0,0,b,0],[0,0,0,c],[0,0,0,d]]
This is a tf variable, and all zeros stay unchanged. Only a, b, c, d are tuned using gradient descent.
Are there anyone who knows how to define such a matrix?
You should look into SparseTensor. It is highly optimised for operations where tensor consists of many zeros.
So, in your case, to initialise SparseTensor:
a,b,c,d = 10,20,30,40
sparse = tf.SparseTensor([[0,2], [1,2], [2,3], [3,3]], [a,b,c,d], [4,4])
I have two large square sparse matrices, A & B, and need to compute the following: A * B^-1 in the most efficient way. I have a feeling that the answer involves using scipy.sparse, but can't for the life of me figure it out.
After extensive searching, I have run across the following thread: Efficient numpy / lapack routine for product of inverse and sparse matrix? but can't figure out what the most efficient way would be.
Someone suggested using LU decomposition which is built into the sparse module of scipy, but when I try and do LU on sample matrix is says the result is singular (although when I just do a * B^-1 i get an answer). I have also heard someone suggest using linalg.spsolve(), but i can't figure out how to implement this as it requires a vector as the second argument.
If it helps, once I have the solution s.t. A * B^-1 = C, i only need to know the value for one row of the matrix C. The matrices will be roughly 1000x1000 to 1500x1500.
Actually 1000x1000 matrices are not that large. You can compute the inverse of such a matrix using numpy.linalg.inv(B) in less than 1 second on a modern desktop computer.
But you can be much more efficient if you rewrite your problem taking into account the fact that you only need one row of C (this is actually very often the case).
Let us write d_i = [0 0 0 ... 0 1 0 ... 0 ], a vector with only one one on the i-th element.
You can write, if ^t denotes the transpose :
AB^-1 = C <=> A = CB <=> A^t = B^t C^t
For the i-th row :
A^t d_i = B^t C^t d_i <=> a_i = B^t c_i
So you have a linear inverse problem which can be solved using numpy.linalg.solve
ci = np.linalg.solve(B.T, a[i])
I have a 4x4 input matrix and I want to multiply every 2x2 slice with a weight stored in a 3x3 weight matrix. Please see the attached image for an example:
In the image, the colored section of the 4x4 input matrix is multiplied by the same colored section of the 3x3 weight matrix and stored in the 4x4 output matrix. When the slices overlap, the output takes the sum of the overlaps (e.g. the blue+red).
I am trying to perform this operation in Tensorflow 2.0 using eager tensors (which can be treated as numpy arrays). This is what I've written to perform this operation and it produces the expected output.
inputm = np.ones([4,4]) # initialize 4x4 input matrix
weightm = np.ones([3,3]) # initialize 3x3 weight matrix
outputm = np.zeros([4,4]) # initialize blank 4x4 output matrix
# iterate through each weight
for i in range(weightm.shape[0]):
for j in range(weightm.shape[1]):
outputm[i:i+2, j:j+2] += weightm[i,j] * inputm[i:i+2, j:j+2]
However, I don't think this is efficient since I am iterating through the weight matrix one-by-one, and this will be extremely slow when I need to perform this on large matrices of 500x500. I am having a hard time identifying a way to vectorize this operation, maybe tiling the weight matrix to be the same shape as the input matrix and performing a single matrix multiplication. I have also thought about flattening the matrix but I'm still not able to see a way to do this more efficiently.
Any advice will be much appreciated. Thanks in advance!
Alright, I think I have a solution but this involves using both numpy operations (e.g. np.repeat) and TensorFlow 2.0 operations (i.e. tf.segment_sum). And to warn you this is not the most clear elegant solution in the world, but it was the most elegant I could come up with. So here goes.
The main culprit in your problem is this weight matrix. If you manipulate this weight matrix to be a 4x4 matrix (with correct sum of weight at each position) you have a nice weight matrix which you can do an element-wise multiplication with the input. And that's my solution. Note that this is designed for the 4x4 problem and you should be able to relatively easily extend this to the 500x500 matrix.
import numpy as np
import tensorflow as tf
a = np.array([[1,2,3,4],[4,3,2,1],[1,2,3,4],[4,3,2,1]])
w = np.array([[5,4,3],[3,4,5],[5,4,3]])
# We make weights to a 6x6 matrix by repeating 2 times on both axis
w_rep = np.repeat(w,2,axis=0)
w_rep = np.repeat(w_rep,2,axis=1)
# Let's now jump in to tensorflow
tf_a = tf.constant(a)
tf_w = tf.constant(w_rep)
tf_segments = tf.constant([0,1,1,2,2,3])
# This is the most tricky bit, here we use the segment_sum to achieve what we need
# You can use segment_sum to get the sum of segments on the very first dimension of a matrix.
# So you need to do that to the input matrix twice. One on the original and the other on the transpose.
tf_w2 = tf.math.segment_sum(tf_w, tf_segments)
tf_w2 = tf.transpose(tf_w2)
tf_w2 = tf.math.segment_sum(tf_w2, tf_segments)
tf_w2 = tf.transpose(tf_w2)
print(tf_w2*a)
PS: I will try to include an illustration of what's going on here in a future edit. But I reckon that will take some time.
After realising #thushv89's trick, I realised you can get the same result by convolving the weight matrix with a matrix of ones:
import numpy as np
from scipy.signal import convolve2d
a = np.ones([4,4]) # initialize 4x4 input matrix
w = np.ones([3,3]) # initialize 3x3 weight matrix
b = np.multiply(a, convolve2d(w, np.ones((2,2))))
print(b)