Numpy/Scipy : solving several least squares with the same design matrix - numpy

I face a least square problem that i solve via scipy.linalg.lstsq(M,b), where :
M has shape (n,n)
b has shape (n,)
The issue is that i have to solve it a bunch of time for different b's. How can i do something more efficient ? I guess that lstsq does a lot of things independently of the value of b.
Ideas ?

In the case your linear system is well-determined, I'll store M LU decomposition and use it for all the b's individually or simply do one solve call for 2d-array B representing the horizontally stacked b's, it really depends on your problem here but this is globally the same idea. Let's suppose you've got each b one at a time, then:
import numpy as np
from scipy.linalg import lstsq, lu_factor, lu_solve, svd, pinv
# as you didn't specified any practical dimensions
n = 100
# number of b's
nb_b = 10
# generate random n-square matrix M
M = np.random.rand(n**2).reshape(n,n)
# Set of nb_b of right hand side vector b as columns
B = np.random.rand(n*nb_b).reshape(n,nb_b)
# compute pivoted LU decomposition of M
M_LU = lu_factor(M)
# then solve for each b
X_LU = np.asarray([lu_solve(M_LU,B[:,i]) for i in range(nb_b)])
but if it is under or over-determined, you need to use lstsq as you did:
X_lstsq = np.asarray([lstsq(M,B[:,i])[0] for i in range(nb_b)])
or simply store the pseudo-inverse M_pinv with pinv (built on lstsq) or pinv2 (built on SVD):
# compute the pseudo-inverse of M
M_pinv = pinv(M)
X_pinv = np.asarray([,B[:,i]) for i in range(nb_b)])
or you can also do the work by yourself, as in pinv2 for instance, just store the SVD of M, and solve this manually:
# compute svd of M
U,s,Vh = svd(M)
def solve_svd(U,s,Vh,b):
# U diag(s) Vh x = b <=> diag(s) Vh x = U.T b = c
c =,b)
# diag(s) Vh x = c <=> Vh x = diag(1/s) c = w (trivial inversion of a diagonal matrix)
w =,c)
# Vh x = w <=> x = Vh.H w (where .H stands for hermitian = conjugate transpose)
x =,w)
return x
X_svd = np.asarray([solve_svd(U,s,Vh,B[:,i]) for i in range(nb_b)])
which all give the same result if checked with np.allclose (unless the system is not well-determined resulting in the LU direct approach failure). Finally in terms of performances:
%timeit M_LU = lu_factor(M); X_LU = np.asarray([lu_solve(M_LU,B[:,i]) for i in range(nb_b)])
1000 loops, best of 3: 1.01 ms per loop
%timeit X_lstsq = np.asarray([lstsq(M,B[:,i])[0] for i in range(nb_b)])
10 loops, best of 3: 47.8 ms per loop
%timeit M_pinv = pinv(M); X_pinv = np.asarray([,B[:,i]) for i in range(nb_b)])
100 loops, best of 3: 8.64 ms per loop
%timeit U,s,Vh = svd(M); X_svd = np.asarray([solve_svd(U,s,Vh,B[:,i]) for i in range(nb_b)])
100 loops, best of 3: 5.68 ms per loop
Nevertheless, it's up to you to check these with appropriate dimensions.
Hope this helps.

Your question is unclear, but I am guessing you mean to compute the equation Mx=b through scipy.linalg.lstsq(M,b) for different arrays (b0, b1, b2..). If that is the case you could just parallelize the process with concurrent.futures.ProcessPoolExecutor. The documentation for this is fairly simple and can help python run multiple scipy solvers at once.
Hope this helps.

You can factorize M into either QR or SVD products and find the lsq solution manually.


Vectorizing ARD (Automatic Relevance Determination) kernel implementation in Gaussian processes

I am trying to implement an ARD kernel with NumPy as given in the GPML book (M3 from Equation 5.2).
I am struggling in vectorizing this equation for NxM kernel computation. I have tried the following non-vectorized version. Can someone help in vectorizing this in NumPy/PyTorch?
import numpy as np
N = 30 # Number of data points in X1
M = 40 # Number of data points in X2
D = 6 # Number of features (ARD dimensions)
X1 = np.random.rand(N, D)
X2 = np.random.rand(M, D)
Lambda = np.random.rand(D, 1)
L_inv = np.diag(np.random.rand(D))
sigma_f = np.random.rand()
K = np.empty((N, M))
for n in range(N):
for m in range(M):
M3 = Lambda#Lambda.T + L_inv**2
d = (X1[n,:] - X2[m,:]).reshape(-1,1)
K[n, m] = sigma_f**2 * np.exp(-0.5 * d.T#M3#d)
We can use the rules of broadcasting and the neat NumPy function einsum to vectorize array operations. In few words, broadcasting allows us to operate with arrays in one-liners by adding new dimensions to the resulting array, while einsum allows us to perform operations with multiple arrays by explicitly working in the index notation (instead of matrices).
Luckily, no loops are necessary to calculate your kernel. Please see below the vectorized solution, ARD_kernel function, which is about 30x faster in my machine than the original loopy version. Now, einsum is usually as fast as it gets, but it's possible that there are faster methods though, I've not checked anything else (e.g. usual # operator instead of einsum).
Also, there is a missing term in the code (the Kronecker delta), I don't know if it was omitted in purpose (let me know if you have problems implementing it and I'll edit the answer).
import numpy as np
N = 300 # Number of data points in X1
M = 400 # Number of data points in X2
D = 6 # Number of features (ARD dimensions)
np.random.seed(1) # Fix random seed for reproducibility
X1 = np.random.rand(N, D)
X2 = np.random.rand(M, D)
Lambda = np.random.rand(D, 1)
L_inv = np.diag(np.random.rand(D))
sigma_f = np.random.rand()
# Loopy function
def ARD_kernel_loops(X1, X2, Lambda, L_inv, sigma_f):
K = np.empty((N, M))
M3 = Lambda#Lambda.T + L_inv**2
for n in range(N):
for m in range(M):
d = (X1[n,:] - X2[m,:]).reshape(-1,1)
K[n, m] = np.exp(-0.5 * d.T#M3#d)
return K * sigma_f**2
# Vectorized function
def ARD_kernel(X1, X2, Lambda, L_inv, sigma_f):
M3 = Lambda.squeeze()*Lambda + L_inv**2 # Use broadcasting to avoid transpose
d = X1[:,None] - X2[None,...] # Use broadcasting to avoid loops
# order=F for memory layout (as your arrays are (N,M,D) instead of (D,N,M))
return sigma_f**2 * np.exp(-0.5 * np.einsum("ijk,kl,ijl->ij", d, M3, d, order = 'F'))
There is perhaps an additional optimisation. The examples of the M matrices given are all positive definite. This means that the Cholesky decomposition can be applied, wo that we can find upper triangular U so that
M = U'*U
The point of this is that if we apply U to the xs, so
y[p] = U*x[p] p=1..
(x[p]-x[q])'*M*(x[p]-x[q]) = (y[p]-y[q])'*(y[p]-y[q])
Thus if there are N vectors x each of dimension d,
we convert the N squared O(d squared) operations on the LHS to N squared O(d) operations on the RHS
This has cost an extra choleski decompositon (O(d cubed))
and N O( d squared) applications of U to the xs.

Efficient way to calculate the pairwise matrix product between one tensor and all the rolling of another tensor

Suppose we have two tensors:
tensor A whose shape is (d,m,n)
tensor B whose shape is (d,n,l).
If we want to get the pairwise matrix product of the right-most matrix of A and B, I think we can use np.einsum('dmn,>',A,B) whose size is (d,d,m,l). However, I would like to get the pairwise product of not all the pairs.
Import a parameter k, 1<=k<=d, I want to get the following pairwise matrix product:
Note here we we use a rolling way to deal with tensor B. (like numpy.roll).
Finally, we actually get a tensor whose shape is (d,k,m,l).
What's the most efficient way to do this.
I know several ways like:
First get np.einsum('dmn,>',A,B), then use a mask to extract the (d,k) pairs.
tile B first, then use einsum in some way.
But I think there exists a better way.
I doubt you can do much better than a for loop. Here is, for example, a vectorized version using einsum and stride_tricks compared to a double for loop:
from simple_benchmark import BenchmarkBuilder, MultiArgument
import numpy as np
from numpy.lib.stride_tricks import as_strided
B = BenchmarkBuilder()
def loopy(A,B,k):
d,m,n = A.shape
l = B.shape[-1]
out = np.empty((d,k,m,l),int)
for i in range(d):
for j in range(k):
out[i,j] = A[i]#B[(i+j)%d]
return out
def vectory(A,B,k):
d,m,n = A.shape
l = B.shape[-1]
BB = np.concatenate([B,B[:k-1]],0)
BB = as_strided(BB,(d,k,n,l),np.repeat(BB.strides,(2,1,1)))
return np.einsum("ikl,ijln->ijkn",A,BB)
#B.add_arguments('d x k x m x n x l')
def argument_provider():
for exp in range(10):
d,k,m,n,l = (np.r_[1.6,1.5,1.5,1.5,1.5]**exp*(4,2,2,2,2)).astype(int)
A = np.random.randint(0,10,(d,m,n))
B = np.random.randint(0,10,(d,n,l))
yield k*d*m*n*l,MultiArgument([A,B,k])
r =
import pylab

speeding up numpy code involving array slicing and broadcasting

I have the following code:
x = sp.linspace(-2,2,1000)
z = sp.linspace(-1,3,2000)
X,Z = sp.meshgrid(x,z)
X = X[:,:,sp.meshgrid]
Z = Z[:,:,sp.meshgrid]
E = sp.zeros((len(z),len(x),3), dtype=complex)
# e_uvect.shape = (2,N,2,3)
# En.shape = (2,N,2)
# d_cum.shape = (N,)
# pol is either 0 or 1
for n in range(N):
idx = sp.logical_and(Z<d_cum[n], Z>=d_cum[n-1])
E += e_uvect[pol,n,0,:]*En[pol,n,0]*sp.exp(+1j*[n]*(Z-d_cum[n-1])+1j*self.kx*X)*idx
Basically the above is part of a code to calculate the electric field of an N-layer structures. For each iteration inside for loop, I find the index of the array elements which are within the Nth layer, then after I calculate the electric field I multiply the whole thing by idx to 'filter' out the correct part which satisfies sp.logical_and(Z<d_cum[n], Z>=d_cum[n-1]).
It works fine, but I wonder if there is a more efficient way of doing this using numpy array slicing or other methods, because each multiplication involves a large proportion of array elements which are not accepted in each iteration. I tried something like the following to only work on the relevant part of the coordinates array Z and X
idx = sp.logical_and(Z<d_cum[n], Z>=d_cum[n-1])
Z2 = Z[idx]
X2 = X[idx]
E[???] += e_uvect[pol,n,0,:]*En[pol,n,0]*sp.exp(+1j*[n]*(Z2-d_cum[n-1])+1j*self.kx*X2)
But then Z2 and X2 becomes a 1d-array, and I'm not sure about the indexing part within E or how to reshape the arrays appropriately.
So are there any ways to speed up the original code?

Vectorizing a comparison in numpy

How can I vectorize this loop in NumPy? It uses sampling from NumPy's binomial() function to estimate the probability that out of 55 events exactly m of a particular type occur, where the probability of m occuring is 5%; ie it estimates 55Cm.(0.05)^m.(0.95)^(55-m). where 55Cm = 55!/(m!.(55-m)!)
import numpy as np
M = 7
m = np.arange(M+1)
ntrials = 1000000
p = np.empty(M+1)
for r in m:
p[r] = np.sum(np.random.binomial(55, 0.05, ntrials)==r)/ntrials
Here is the equivalent code:
p = np.zeros(M+1)
print p
I imagine you didn't intend for your output to always be all zero, but it is! So the first thing to do is add a dtype=float argument to your np.sum() call. With that out of the way, we can vectorize the whole thing like this:
samples = np.random.binomial(55, 0.05, (ntrials, M+1))
p = np.sum(samples == m, dtype=float, axis=0) / ntrials
This produces an equivalent, though not identical, result. The reason is that the random number generation is done in a different sequence, so you will get an answer which is "correct" but not identical to the old code. If you want the identical result to before, you can get that by changing the first line to this:
samples = p.random.binomial(55, 0.05, (M+1, ntrials)).T
Then you draw in the same order as before, with no real performance penalty.

Tensordot for numpy array and scipy sparse matrix

For a current project I have to compute the inner product of a lot of vectors with the same matrix (which is quite sparse). The vectors are associated with a two dimensional grid so I store the vectors in a three dimensional array:
X is an array of dim (I,J,N). The matrix A is of dim (N,N). Now the task is to compute[i,j]) for each i,j in I,J.
For numpy arrays, this is quite easily accomplished with
Y =
Now I'd like to store A as sparse matrix since it is sparse and only contains a very limited number of nonzero entries which results in a lot of unnecessary multiplications. Unfortunately, the above solution won't work since the numpy dot doesn't work with sparse matrices. And to the best of my knowledge there is not tensordot-like operation for scipy sparse.
Does anybody know a nice and efficient way to compute the above array Y with a sparse matrix A?
The obvious approach is to run a loop over your vectors and use the sparse matrix's .dot method:
def naive_sps_x_dense_vecs(sps_mat, dense_vecs):
rows, cols = sps_mat.shape
I, J, _ = dense_vecs.shape
out = np.empty((I, J, rows))
for i in xrange(I):
for j in xrange(J):
out[i, j] =[i, j])
return out
But you may be able to speed things up a little by reshaping your 3d array to 2d and avoid the Python looping:
def sps_x_dense_vecs(sps_mat, dense_vecs):
rows, cols = sps_mat.shape
vecs_shape = dense_vecs.shape
dense_vecs = dense_vecs.reshape(-1, cols)
out =
return out.reshape(vecs.shape[:-1] + (rows,))
The problem is that we need to have the sparse matrix be the first argument, so that we can call its .dot method, which means that the return is transposed, which in turns means that after transposing, the last reshape is going to trigger a copy of the whole array. So for fairly large values of I and J, combined with not-so-large values of N, the latter method will be several times faster than the former, but performance may even be reversed for other combinations of the parameters:
n, i, j = 100, 500, 500
a = sps.rand(n, n, density=1/n, format='csc')
vecs = np.random.rand(i, j, n)
>>> np.allclose(naive_sps_x_dense_vecs(a, vecs), sps_x_dense_vecs(a, vecs))
n, i, j = 100, 500, 500
%timeit naive_sps_x_dense_vecs(a, vecs)
1 loops, best of 3: 3.85 s per loop
%timeit sps_x_dense_vecs(a, vecs)
1 loops, best of 3: 576 ms per
n, i, j = 1000, 200, 200
%timeit naive_sps_x_dense_vecs(a, vecs)
1 loops, best of 3: 791 ms per loop
%timeit sps_x_dense_vecs(a, vecs)
1 loops, best of 3: 1.3 s per loop
You could use jaxto achieve what you are looking for. Let's suppose your sparse matrix is in csr_arrayformat. You would first transform it into a jax BCOO array
from scipy import sparse
from jax.experimental import sparse as jaxsparse
import jax.numpy as jnp
def convert_to_BCOO(x):
x = x.transpose() #get the transpose
x = x.tocoo()
x = jaxsparse.BCOO((, jnp.column_stack((x.row, x.col))),
x = L.sort_indices()
You could then use jax.sparsify to create a sparsified dot product as follows.
def dot(x, y):
return, y)
sp_dot = jaxsparse.sparsify(dot)
A_transpose = convert_to_BCOO(A)
Y = sp_dot(X,A_transpose)
The function sp_dot now follows the exact same rules as
Hope this helps!