Vectorizing ARD (Automatic Relevance Determination) kernel implementation in Gaussian processes - numpy

I am trying to implement an ARD kernel with NumPy as given in the GPML book (M3 from Equation 5.2).
I am struggling in vectorizing this equation for NxM kernel computation. I have tried the following non-vectorized version. Can someone help in vectorizing this in NumPy/PyTorch?
import numpy as np
N = 30 # Number of data points in X1
M = 40 # Number of data points in X2
D = 6 # Number of features (ARD dimensions)
X1 = np.random.rand(N, D)
X2 = np.random.rand(M, D)
Lambda = np.random.rand(D, 1)
L_inv = np.diag(np.random.rand(D))
sigma_f = np.random.rand()
K = np.empty((N, M))
for n in range(N):
for m in range(M):
M3 = Lambda#Lambda.T + L_inv**2
d = (X1[n,:] - X2[m,:]).reshape(-1,1)
K[n, m] = sigma_f**2 * np.exp(-0.5 * d.T#M3#d)

We can use the rules of broadcasting and the neat NumPy function einsum to vectorize array operations. In few words, broadcasting allows us to operate with arrays in one-liners by adding new dimensions to the resulting array, while einsum allows us to perform operations with multiple arrays by explicitly working in the index notation (instead of matrices).
Luckily, no loops are necessary to calculate your kernel. Please see below the vectorized solution, ARD_kernel function, which is about 30x faster in my machine than the original loopy version. Now, einsum is usually as fast as it gets, but it's possible that there are faster methods though, I've not checked anything else (e.g. usual # operator instead of einsum).
Also, there is a missing term in the code (the Kronecker delta), I don't know if it was omitted in purpose (let me know if you have problems implementing it and I'll edit the answer).
import numpy as np
N = 300 # Number of data points in X1
M = 400 # Number of data points in X2
D = 6 # Number of features (ARD dimensions)
np.random.seed(1) # Fix random seed for reproducibility
X1 = np.random.rand(N, D)
X2 = np.random.rand(M, D)
Lambda = np.random.rand(D, 1)
L_inv = np.diag(np.random.rand(D))
sigma_f = np.random.rand()
# Loopy function
def ARD_kernel_loops(X1, X2, Lambda, L_inv, sigma_f):
K = np.empty((N, M))
M3 = Lambda#Lambda.T + L_inv**2
for n in range(N):
for m in range(M):
d = (X1[n,:] - X2[m,:]).reshape(-1,1)
K[n, m] = np.exp(-0.5 * d.T#M3#d)
return K * sigma_f**2
# Vectorized function
def ARD_kernel(X1, X2, Lambda, L_inv, sigma_f):
M3 = Lambda.squeeze()*Lambda + L_inv**2 # Use broadcasting to avoid transpose
d = X1[:,None] - X2[None,...] # Use broadcasting to avoid loops
# order=F for memory layout (as your arrays are (N,M,D) instead of (D,N,M))
return sigma_f**2 * np.exp(-0.5 * np.einsum("ijk,kl,ijl->ij", d, M3, d, order = 'F'))

There is perhaps an additional optimisation. The examples of the M matrices given are all positive definite. This means that the Cholesky decomposition can be applied, wo that we can find upper triangular U so that
M = U'*U
The point of this is that if we apply U to the xs, so
y[p] = U*x[p] p=1..
Then
(x[p]-x[q])'*M*(x[p]-x[q]) = (y[p]-y[q])'*(y[p]-y[q])
Thus if there are N vectors x each of dimension d,
we convert the N squared O(d squared) operations on the LHS to N squared O(d) operations on the RHS
This has cost an extra choleski decompositon (O(d cubed))
and N O( d squared) applications of U to the xs.

Related

Create B such that B # X.ravel() == (A.T # X + X # A).ravel()

I'm not new to numpy, but the solution to this problem has been eluding me for a while now, so I thought that I would ask here.
Given the equation Y = A.T # X + X # A
where A and X are square matrices.
construct B such that B # X.ravel() == Y.ravel()
The problem is just one small part of a bigger code that I am trying to implement so I wrote a minimal reproducible example given below.
The challenge that I pose to you is to fill out the function "create_B" in a way such that the assertion does not raise an error.
The list of things that I have attempted myself is far too long to include in this question, and besides I don't feel that trying to include a list of my own failed attempts adds any value to the question.
import numpy as np
def create_B(A):
B = np.zeros((A.size, A.size))
return B
A = np.random.uniform(-1, 1, (100, 100))
X = np.random.uniform(-1, 1, (100, 100))
Y = A.T # X + X # A
B = create_B(A)
assert np.allclose(B # X.ravel(), Y.ravel())
I should probably mention that I can easily get this working using loops, but in reality A is very big and I need to do this many times so I need an optimized solution (ie. no python loops).
Basically I am looking for a solution of the form:
def create_B(A):
B = np.zeros((A.size, A.size))
B[indices1] = A
B[indices2] += A
return B
Where the challenge then is to create the index tuples indices1 and indices2.
An example of a solution using loops (not what i'm looking for):
def create_B(A):
B = np.zeros((A.size, A.size))
indices = np.arange(A.size).reshape(A.shape)
for i, k in enumerate(indices):
for j, k in enumerate(k):
B[k, indices[:, j]] = A[:, i]
B[k, indices[i, :]] += A[:, j]
return B
You can confirm for yourself that this is indeed a solution.
Then your equation is known as continuous and can be solved with scipy.linalg.solve_continous_lyapunov
Try this
Y = A.T # X + X # A
X_hat = solve_continuous_lyapunov(A.T, Y)
assert np.allclose(A.T # X_hat + X_hat # A, Y)
assert np.allclose(X_hat, X)
If you need B to be computed explicitly you will inevitably end up with a n2 x n2 matrix.
I fear you will not get nothing much better than you have (unless you build it as a sparse matrix)

Why does numpy and pytorch give different results after mean and variance normalization?

I am working on a problem in which a matrix has to be mean-var normalized row-wise. It is also required that the normalization is applied after splitting each row into tiny batches.
The code seem to work for Numpy, but fails with Pytorch (which is required for training).
It seems Pytorch and Numpy results differ. Any help will be greatly appreciated.
Example code:
import numpy as np
import torch
def normalize(x, bsize, eps=1e-6):
nc = x.shape[1]
if nc % bsize != 0:
raise Exception(f'Number of columns must be a multiple of bsize')
x = x.reshape(-1, bsize)
m = x.mean(1).reshape(-1, 1)
s = x.std(1).reshape(-1, 1)
n = (x - m) / (eps + s)
n = n.reshape(-1, nc)
return n
# numpy
a = np.float32(np.random.randn(8, 8))
n1 = normalize(a, 4)
# torch
b = torch.tensor(a)
n2 = normalize(b, 4)
n2 = n2.numpy()
print(abs(n1-n2).max())
In the first example you are calling normalize with a, a numpy.ndarray, while in the second you call normalize with b, a torch.Tensor.
According to the documentation page of torch.std, Bessel’s correction is used by default to measure the standard deviation. As such the default behavior between numpy.ndarray.std and torch.Tensor.std is different.
If unbiased is True, Bessel’s correction will be used. Otherwise, the sample deviation is calculated, without any correction.
torch.std(input, dim, unbiased, keepdim=False, *, out=None) → Tensor
Parameters
input (Tensor) – the input tensor.
unbiased (bool) – whether to use Bessel’s correction (δN = 1).
You can try yourself:
>>> a.std(), b.std(unbiased=True), b.std(unbiased=False)
(0.8364538, tensor(0.8942), tensor(0.8365))

Vectorize multivariate normal pdf python (PyTorch/NumPy)

I have N Gaussian distributions (multivariate) with N different means (covariance is the same for all of them) in D dimensions.
I also have N evaluation points, where I want to evaluate each of these (log) PDFs.
This means I need to get a NxN matrix, call it "kernels". That is, the (i,j)-th entry is the j-th Gaussian evaluated at the i-th point. A naive approach is:
from torch.distributions.multivariate_normal import MultivariateNormal
import numpy as np
# means contains all N means as rows and is thus N x D
# same for eval_points
# cov is not a problem , just a DxD matrix that is equal for all N Gaussians
kernels = np.empty((N,N))
for i in range(N):
for j in range(N):
kernels[i][j] = MultivariateNormal(means[j], cov).log_prob(eval_points[i])
Now one for loop we can get rid of easily, since for example if we wanted all the evaluations of the first Gaussian , we simply do:
MultivariateNormal(means[0], cov).log_prob(eval_points).squeeze()
and this gives us a N x 1 list of values, that is the first Gaussian evaluated at all N points.
My problem is that , in order to get the full N x N matrix , this doesn't work:
kernels = MultivariateNormal(means, cov).log_prob(eval_points).squeeze()
It doesn't figure out that it should evaluate each mean with all evaluation points in eval_points, and it doesn't return a NxN matrix with these which would be what I want. Therefore, I am not able to get rid of the second for loop, over all N Gaussians.
You are passing wrong shaped tensors to MultivariateNormal's constructor. You should pass a collection of mean vectors of shape (N, D) and a collection of precision matrix cov of shape (N, D, D) for N D-dimensional gaussian.
You are passing mu of shape (N, D) but your precision matrix is not well-shaped. You will need to repeat the precision matrix N number of times before passing it to the MultivariateNormal constructor. Here's one way to do it.
N = 10
D = 3
# means contains all N means as rows and is thus N x D
# same for eval_points
# cov is not a problem , just a DxD matrix that is equal for all N Gaussians
mu = torch.from_numpy(np.random.randn(N, D))
cov = torch.from_numpy(make_spd_matrix(D, D))
cov_n = cov[None, ...].repeat_interleave(N, 0)
assert cov_n.shape == (N, D, D)
kernels = MultivariateNormal(mu, cov_n)

Numpy/Scipy : solving several least squares with the same design matrix

I face a least square problem that i solve via scipy.linalg.lstsq(M,b), where :
M has shape (n,n)
b has shape (n,)
The issue is that i have to solve it a bunch of time for different b's. How can i do something more efficient ? I guess that lstsq does a lot of things independently of the value of b.
Ideas ?
In the case your linear system is well-determined, I'll store M LU decomposition and use it for all the b's individually or simply do one solve call for 2d-array B representing the horizontally stacked b's, it really depends on your problem here but this is globally the same idea. Let's suppose you've got each b one at a time, then:
import numpy as np
from scipy.linalg import lstsq, lu_factor, lu_solve, svd, pinv
# as you didn't specified any practical dimensions
n = 100
# number of b's
nb_b = 10
# generate random n-square matrix M
M = np.random.rand(n**2).reshape(n,n)
# Set of nb_b of right hand side vector b as columns
B = np.random.rand(n*nb_b).reshape(n,nb_b)
# compute pivoted LU decomposition of M
M_LU = lu_factor(M)
# then solve for each b
X_LU = np.asarray([lu_solve(M_LU,B[:,i]) for i in range(nb_b)])
but if it is under or over-determined, you need to use lstsq as you did:
X_lstsq = np.asarray([lstsq(M,B[:,i])[0] for i in range(nb_b)])
or simply store the pseudo-inverse M_pinv with pinv (built on lstsq) or pinv2 (built on SVD):
# compute the pseudo-inverse of M
M_pinv = pinv(M)
X_pinv = np.asarray([np.dot(M_pinv,B[:,i]) for i in range(nb_b)])
or you can also do the work by yourself, as in pinv2 for instance, just store the SVD of M, and solve this manually:
# compute svd of M
U,s,Vh = svd(M)
def solve_svd(U,s,Vh,b):
# U diag(s) Vh x = b <=> diag(s) Vh x = U.T b = c
c = np.dot(U.T,b)
# diag(s) Vh x = c <=> Vh x = diag(1/s) c = w (trivial inversion of a diagonal matrix)
w = np.dot(np.diag(1/s),c)
# Vh x = w <=> x = Vh.H w (where .H stands for hermitian = conjugate transpose)
x = np.dot(Vh.conj().T,w)
return x
X_svd = np.asarray([solve_svd(U,s,Vh,B[:,i]) for i in range(nb_b)])
which all give the same result if checked with np.allclose (unless the system is not well-determined resulting in the LU direct approach failure). Finally in terms of performances:
%timeit M_LU = lu_factor(M); X_LU = np.asarray([lu_solve(M_LU,B[:,i]) for i in range(nb_b)])
1000 loops, best of 3: 1.01 ms per loop
%timeit X_lstsq = np.asarray([lstsq(M,B[:,i])[0] for i in range(nb_b)])
10 loops, best of 3: 47.8 ms per loop
%timeit M_pinv = pinv(M); X_pinv = np.asarray([np.dot(M_pinv,B[:,i]) for i in range(nb_b)])
100 loops, best of 3: 8.64 ms per loop
%timeit U,s,Vh = svd(M); X_svd = np.asarray([solve_svd(U,s,Vh,B[:,i]) for i in range(nb_b)])
100 loops, best of 3: 5.68 ms per loop
Nevertheless, it's up to you to check these with appropriate dimensions.
Hope this helps.
Your question is unclear, but I am guessing you mean to compute the equation Mx=b through scipy.linalg.lstsq(M,b) for different arrays (b0, b1, b2..). If that is the case you could just parallelize the process with concurrent.futures.ProcessPoolExecutor. The documentation for this is fairly simple and can help python run multiple scipy solvers at once.
Hope this helps.
You can factorize M into either QR or SVD products and find the lsq solution manually.

Convolution along one axis only

I have two 2-D arrays with the same first axis dimensions. In python, I would like to convolve the two matrices along the second axis only. I would like to get C below without computing the convolution along the first axis as well.
import numpy as np
import scipy.signal as sg
M, N, P = 4, 10, 20
A = np.random.randn(M, N)
B = np.random.randn(M, P)
C = sg.convolve(A, B, 'full')[(2*M-1)/2]
Is there a fast way?
You can use np.apply_along_axis to apply np.convolve along the desired axis. Here is an example of applying a boxcar filter to a 2d array:
import numpy as np
a = np.arange(10)
a = np.vstack((a,a)).T
filt = np.ones(3)
np.apply_along_axis(lambda m: np.convolve(m, filt, mode='full'), axis=0, arr=a)
This is an easy way to generalize many functions that don't have an axis argument.
With ndimage.convolve1d, you can specify the axis...
np.apply_along_axis won't really help you, because you're trying to iterate over two arrays. Effectively, you'd have to use a loop, as described here.
Now, loops are fine if your arrays are small, but if N and P are large, then you probably want to use FFT to convolve instead.
However, you need to appropriately zero pad your arrays first, so that your "full" convolution has the expected shape:
M, N, P = 4, 10, 20
A = np.random.randn(M, N)
B = np.random.randn(M, P)
A_ = np.zeros((M, N+P-1), dtype=A.dtype)
A_[:, :N] = A
B_ = np.zeros((M, N+P-1), dtype=B.dtype)
B_[:, :P] = B
A_fft = np.fft.fft(A_, axis=1)
B_fft = np.fft.fft(B_, axis=1)
C_fft = A_fft * B_fft
C = np.real(np.fft.ifft(C_fft))
# Test
C_test = np.zeros((M, N+P-1))
for i in range(M):
C_test[i, :] = np.convolve(A[i, :], B[i, :], 'full')
assert np.allclose(C, C_test)
for 2D arrays, the function scipy.signal.convolve2d is faster and scipy.signal.fftconvolve can be even faster (depending on the dimensions of the arrays):
Here the same code with N = 100000
import time
import numpy as np
import scipy.signal as sg
M, N, P = 10, 100000, 20
A = np.random.randn(M, N)
B = np.random.randn(M, P)
T1 = time.time()
C = sg.convolve(A, B, 'full')
print(time.time()-T1)
T1 = time.time()
C_2d = sg.convolve2d(A, B, 'full')
print(time.time()-T1)
T1 = time.time()
C_fft = sg.fftconvolve(A, B, 'full')
print(time.time()-T1)
>>> 12.3
>>> 2.1
>>> 0.6
Answers are all the same with slight differences due to different computation methods used (e.g., fft vs direct multiplication, but i don't know what exaclty convolve2d uses):
print(np.max(np.abs(C - C_2d)))
>>>7.81597009336e-14
print(np.max(np.abs(C - C_fft)))
>>>1.84741111298e-13
Late answer, but worth posting for reference. Quoting from comments of the OP:
Each row in A is being filtered by the corresponding row in B. I could
implement it like that, just thought there might be a faster way.
A is on the order of 10s of gigabytes in size and I use overlap-add.
Naive / Straightforward Approach
import numpy as np
import scipy.signal as sg
M, N, P = 4, 10, 20
A = np.random.randn(M, N) # (4, 10)
B = np.random.randn(M, P) # (4, 20)
C = np.vstack([sg.convolve(a, b, 'full') for a, b in zip(A, B)])
>>> C.shape
(4, 29)
Each row in A is convolved with each respective row in B, essentially convolving M 1D arrays/vectors.
No Loop + CUDA Supported Version
It is possible to replicate this operation by using PyTorch's F.conv1d. We have to imagine A as a 4-channel, 1D signal of length 10. We wish to convolve each channel in A with a specific kernel of length 20. This is a special case called a depthwise convolution, often used in deep learning.
Note that torch's conv is implemented as cross-correlation, so we need to flip B in advance to do actual convolution.
import torch
import torch.nn.functional as F
#torch.no_grad()
def torch_conv(A, B):
M, N, P = A.shape[0], A.shape[1], B.shape[1]
C = F.conv1d(A, B[:, None, :], bias=None, stride=1, groups=M, padding=N+(P-1)//2)
return C.numpy()
# Convert A and B to torch tensors + flip B
X = torch.from_numpy(A) # (4, 10)
W = torch.from_numpy(np.fliplr(B).copy()) # (4, 20)
# Do grouped conv and get np array
Y = torch_conv(X, W)
>>> Y.shape
(4, 29)
>>> np.allclose(C, Y)
True
Advantages of using a depthwise convolution with torch:
No loops!
The above solution can also run on CUDA/GPU, which can really speed things up if A and B are very large matrices. (From OP's comment, this seems to be the case: A is 10GB in size.)
Disadvantages:
Overhead of converting from array to tensor (should be negligible)
Need to flip B once before the operation