numpy n-dimensional smart indexing over large tensors - memory efficiency - numpy

I'm working with large tensors, so numpy memory allocations for temporary tensors begin significantly influencing execution time + code sometimes raises memory allocation errors during those intermediate steps. Here're two approaches for indexing one tensor with int values of another tensor (like, result_ijk = a[i, b[i, j], k]) that I came up with, and even though second one seems more memory-efficient, I feel like creating this enormous index-matrix and iterating over all it's values (even in parallel) is kind of wired (and hits memory limits quite often):
def test():
i, j, k, l = 10, 20, 30, 40 # in reality, they're like 1e3..1e6
a = np.random.rand(i, j, k)
b = np.random.randint(0, j, size=i*l).reshape((i, l))
# c_ilk = c[i, b[i, l], k]; shape(c) = (10, 40, 30)
tmp = a[:, b, :] # <= i*ijk additional memory allocated (!) crazy
c1 = np.diagonal(tmp, axis1=0, axis2=1).transpose([2, 0, 1])
print(c1.shape)
# another approach:
ii, ll = np.indices((i, l)) # <= 2*i*l of temporary ints allocated
tmp2 = b[ii, ll] # i*l of ints allocated, slow ops
c2 = a[ii, tmp2] # slow ops over tensor
print(c2.shape)
print(np.allclose(c1, c2))
test()
- any suggestions on how one could optimize this type of n-dim smart indexing code?
If I'm going to use this piece of ~vectorized code in Theano, does it also going to allocate all those temporary buffers or it could somehow manage to build them "on-fly"? Is there any package that would perform such indexing in lazy\more efficient manner without allocation of these ii-like tensors?
(note: I need to take gradients over it in the end, so I can't use fancy jit-compilers like numba :( )

You only need to allocate an array of integers of length i to get your desired result:
i_idx = np.arange(i)
c = a[i_idx[:, None], b[i_idx, :], :]
# or you can use the terser c = a[i_idx[:, None], b[i_idx]]
Broadcasting takes care of duplicating values as needed on the fly, without having to allocate memory for them.
If you time this for large-ish arrays, you'll notice it is only marginally faster than your second approach: as noted by others, the intermediate indexing array is going to be several orders of magnitude smaller than your overall computation, so optimizing it has a small effect on the total runtime or memory footprint.

Some methods :
i,j,k,l=[100]*4
a = np.random.randint(0,5,(i, j, k))
b = np.random.randint(0, j,(i, l))
def test1():
# c_ilk = c[i, b[i, l], k]; shape(c) = (2,3,5)
tmp = a[:, b, :] # <= i*ijk additional memory allocated (!) crazy
c1 = np.diagonal(tmp, axis1=0, axis2=1).transpose([2, 0, 1])
return c1
def test2():
ii, ll = np.indices((i, l)) # <= 2*i*l of temporary ints allocated
tmp2 = b[ii, ll] # i*l of ints allocated, slow ops
c2 = a[ii, tmp2] # slow ops over tensor
#print(c2.shape)
return c2
def test3():
c3=np.empty((i,l,k),dtype=a.dtype)
for ii in range(i):
for ll in range(l):
c3[ii,ll]=a[ii,b[ii,ll]]
return c3
from numba import jit
test4=jit(test3)
And the corresponding benchmarks :
In [54]: %timeit test1()
1 loop, best of 3: 720 ms per loop
In [55]: %timeit test2()
100 loops, best of 3: 7.79 ms per loop
In [56]: %timeit test3()
10 loops, best of 3: 43.7 ms per loop
In [57]: %timeit test4()
100 loop, best of 3: 4.99 ms per loop
That seems to show (see #Eelco Hoogendoorn comment) that your second method is nearly optimal for big sizes, while the first is a bad choice.
For numba you can just use this part of the code, and apply gradient in a non "jited" function.

Related

Vectorizing ARD (Automatic Relevance Determination) kernel implementation in Gaussian processes

I am trying to implement an ARD kernel with NumPy as given in the GPML book (M3 from Equation 5.2).
I am struggling in vectorizing this equation for NxM kernel computation. I have tried the following non-vectorized version. Can someone help in vectorizing this in NumPy/PyTorch?
import numpy as np
N = 30 # Number of data points in X1
M = 40 # Number of data points in X2
D = 6 # Number of features (ARD dimensions)
X1 = np.random.rand(N, D)
X2 = np.random.rand(M, D)
Lambda = np.random.rand(D, 1)
L_inv = np.diag(np.random.rand(D))
sigma_f = np.random.rand()
K = np.empty((N, M))
for n in range(N):
for m in range(M):
M3 = Lambda#Lambda.T + L_inv**2
d = (X1[n,:] - X2[m,:]).reshape(-1,1)
K[n, m] = sigma_f**2 * np.exp(-0.5 * d.T#M3#d)
We can use the rules of broadcasting and the neat NumPy function einsum to vectorize array operations. In few words, broadcasting allows us to operate with arrays in one-liners by adding new dimensions to the resulting array, while einsum allows us to perform operations with multiple arrays by explicitly working in the index notation (instead of matrices).
Luckily, no loops are necessary to calculate your kernel. Please see below the vectorized solution, ARD_kernel function, which is about 30x faster in my machine than the original loopy version. Now, einsum is usually as fast as it gets, but it's possible that there are faster methods though, I've not checked anything else (e.g. usual # operator instead of einsum).
Also, there is a missing term in the code (the Kronecker delta), I don't know if it was omitted in purpose (let me know if you have problems implementing it and I'll edit the answer).
import numpy as np
N = 300 # Number of data points in X1
M = 400 # Number of data points in X2
D = 6 # Number of features (ARD dimensions)
np.random.seed(1) # Fix random seed for reproducibility
X1 = np.random.rand(N, D)
X2 = np.random.rand(M, D)
Lambda = np.random.rand(D, 1)
L_inv = np.diag(np.random.rand(D))
sigma_f = np.random.rand()
# Loopy function
def ARD_kernel_loops(X1, X2, Lambda, L_inv, sigma_f):
K = np.empty((N, M))
M3 = Lambda#Lambda.T + L_inv**2
for n in range(N):
for m in range(M):
d = (X1[n,:] - X2[m,:]).reshape(-1,1)
K[n, m] = np.exp(-0.5 * d.T#M3#d)
return K * sigma_f**2
# Vectorized function
def ARD_kernel(X1, X2, Lambda, L_inv, sigma_f):
M3 = Lambda.squeeze()*Lambda + L_inv**2 # Use broadcasting to avoid transpose
d = X1[:,None] - X2[None,...] # Use broadcasting to avoid loops
# order=F for memory layout (as your arrays are (N,M,D) instead of (D,N,M))
return sigma_f**2 * np.exp(-0.5 * np.einsum("ijk,kl,ijl->ij", d, M3, d, order = 'F'))
There is perhaps an additional optimisation. The examples of the M matrices given are all positive definite. This means that the Cholesky decomposition can be applied, wo that we can find upper triangular U so that
M = U'*U
The point of this is that if we apply U to the xs, so
y[p] = U*x[p] p=1..
Then
(x[p]-x[q])'*M*(x[p]-x[q]) = (y[p]-y[q])'*(y[p]-y[q])
Thus if there are N vectors x each of dimension d,
we convert the N squared O(d squared) operations on the LHS to N squared O(d) operations on the RHS
This has cost an extra choleski decompositon (O(d cubed))
and N O( d squared) applications of U to the xs.

scipy super sparse matrix multiplication is super slow

There are some posts on SO discussing sparse matrix multiplication performance, but they don't seem to answer my question here.
Here is the benchmark code,
# First, construct a space matrix
In [1]: from scipy.sparse import dok_matrix
In [2]: M = dok_matrix((100, (1<<32)-1), dtype=np.float32)
In [3]: rows, cols = M.shape
In [5]: js = np.random.randint(0, (1<<32)-1, size=100)
In [6]: for i in range(rows):
...: for j in js:
...: M[i,j] = 1.0
...:
# Check out sparsity
In [7]: M.shape
Out[7]: (100, 4294967295)
In [8]: M.count_nonzero()
Out[8]: 10000
# Test csr dot performance, 36.3 seconds
In [9]: csr = M.tocsr()
In [10]: %time csr.dot(csr.T)
CPU times: user 36.3 s, sys: 1min 1s, total: 1min 37s
Wall time: 1min 46s
Out[10]:
<100x100 sparse matrix of type '<class 'numpy.float32'>'
with 10000 stored elements in Compressed Sparse Row format>
The above csr.dot costs 36.3s, which is quite long IMHO.
In order to speed up, I coded up a naive for-loop dot function as follows,
def lil_matmul_transposeB(A, B):
rows_a, cols_a = A.shape
rows_b, cols_b = B.shape
assert cols_a == cols_b
C = np.zeros((rows_a, rows_b))
for ra in range(rows_a):
cols_a = A.rows[ra]
data_a = A.data[ra]
for i, ca in enumerate(cols_a):
xa = data_a[i]
for rb in range(rows_b):
cols_b = B.rows[rb]
data_b = B.data[rb]
pos = bs(cols_b, ca)
if pos!=-1:
C[ra,rb] += data_b[pos] * xa
return C
# Test dot performance in LiL format,
In [25]: lil = M.tolil()
In [26]: %time A = F.lil_matmul_transposeB(lil, lil)
CPU times: user 1.26 s, sys: 2.07 ms, total: 1.26 s
Wall time: 1.26 s
The above function only costs 1.26s, much faster than the built-in csr.dot.
So I wonder if I made some mistakes here to do the sparse matrix multiplication?
That very large 2nd dimension is giving problems, even though the sparsity is quite small.
In [12]: Mr = M.tocsr()
In [20]: Mr
Out[20]:
<100x4294967295 sparse matrix of type '<class 'numpy.float32'>'
with 10000 stored elements in Compressed Sparse Row format>
Transpose just turns the csr into csc, without changing the arrays. That indptr for both is just (101,).
In [21]: Mr.T
Out[21]:
<4294967295x100 sparse matrix of type '<class 'numpy.float32'>'
with 10000 stored elements in Compressed Sparse Column format>
But when I do Mr#Mr.T, I get an error when it tries to convert that Mr.T to `csr. That is, the multiplication requires the same format:
In [22]: Mr.T.tocsr()
Traceback (most recent call last):
File "<ipython-input-22-a376906f557e>", line 1, in <module>
Mr.T.tocsr()
File "/usr/local/lib/python3.8/dist-packages/scipy/sparse/csc.py", line 138, in tocsr
indptr = np.empty(M + 1, dtype=idx_dtype)
MemoryError: Unable to allocate 32.0 GiB for an array with shape (4294967296,) and data type int64
It's trying to make a matrix with a indptr that's (4294967296,) long. On my limited RAM machine that produces an error. On your's it must be hitting some sort of memory management/swap task that slowing it way down.
So it's the extreme dimension that's making this slow even though the nnz is small.

Loop through numpy array on indexes and apply function [duplicate]

I have two arrays that have the shapes N X T and M X T. I'd like to compute the correlation coefficient across T between every possible pair of rows n and m (from N and M, respectively).
What's the fastest, most pythonic way to do this? (Looping over N and M would seem to me to be neither fast nor pythonic.) I'm expecting the answer to involve numpy and/or scipy. Right now my arrays are numpy arrays, but I'm open to converting them to a different type.
I'm expecting my output to be an array with the shape N X M.
N.B. When I say "correlation coefficient," I mean the Pearson product-moment correlation coefficient.
Here are some things to note:
The numpy function correlate requires input arrays to be one-dimensional.
The numpy function corrcoef accepts two-dimensional arrays, but they must have the same shape.
The scipy.stats function pearsonr requires input arrays to be one-dimensional.
Correlation (default 'valid' case) between two 2D arrays:
You can simply use matrix-multiplication np.dot like so -
out = np.dot(arr_one,arr_two.T)
Correlation with the default "valid" case between each pairwise row combinations (row1,row2) of the two input arrays would correspond to multiplication result at each (row1,row2) position.
Row-wise Correlation Coefficient calculation for two 2D arrays:
def corr2_coeff(A, B):
# Rowwise mean of input arrays & subtract from input arrays themeselves
A_mA = A - A.mean(1)[:, None]
B_mB = B - B.mean(1)[:, None]
# Sum of squares across rows
ssA = (A_mA**2).sum(1)
ssB = (B_mB**2).sum(1)
# Finally get corr coeff
return np.dot(A_mA, B_mB.T) / np.sqrt(np.dot(ssA[:, None],ssB[None]))
This is based upon this solution to How to apply corr2 functions in Multidimentional arrays in MATLAB
Benchmarking
This section compares runtime performance with the proposed approach against generate_correlation_map & loopy pearsonr based approach listed in the other answer.(taken from the function test_generate_correlation_map() without the value correctness verification code at the end of it). Please note the timings for the proposed approach also include a check at the start to check for equal number of columns in the two input arrays, as also done in that other answer. The runtimes are listed next.
Case #1:
In [106]: A = np.random.rand(1000, 100)
In [107]: B = np.random.rand(1000, 100)
In [108]: %timeit corr2_coeff(A, B)
100 loops, best of 3: 15 ms per loop
In [109]: %timeit generate_correlation_map(A, B)
100 loops, best of 3: 19.6 ms per loop
Case #2:
In [110]: A = np.random.rand(5000, 100)
In [111]: B = np.random.rand(5000, 100)
In [112]: %timeit corr2_coeff(A, B)
1 loops, best of 3: 368 ms per loop
In [113]: %timeit generate_correlation_map(A, B)
1 loops, best of 3: 493 ms per loop
Case #3:
In [114]: A = np.random.rand(10000, 10)
In [115]: B = np.random.rand(10000, 10)
In [116]: %timeit corr2_coeff(A, B)
1 loops, best of 3: 1.29 s per loop
In [117]: %timeit generate_correlation_map(A, B)
1 loops, best of 3: 1.83 s per loop
The other loopy pearsonr based approach seemed too slow, but here are the runtimes for one small datasize -
In [118]: A = np.random.rand(1000, 100)
In [119]: B = np.random.rand(1000, 100)
In [120]: %timeit corr2_coeff(A, B)
100 loops, best of 3: 15.3 ms per loop
In [121]: %timeit generate_correlation_map(A, B)
100 loops, best of 3: 19.7 ms per loop
In [122]: %timeit pearsonr_based(A, B)
1 loops, best of 3: 33 s per loop
#Divakar provides a great option for computing the unscaled correlation, which is what I originally asked for.
In order to calculate the correlation coefficient, a bit more is required:
import numpy as np
def generate_correlation_map(x, y):
"""Correlate each n with each m.
Parameters
----------
x : np.array
Shape N X T.
y : np.array
Shape M X T.
Returns
-------
np.array
N X M array in which each element is a correlation coefficient.
"""
mu_x = x.mean(1)
mu_y = y.mean(1)
n = x.shape[1]
if n != y.shape[1]:
raise ValueError('x and y must ' +
'have the same number of timepoints.')
s_x = x.std(1, ddof=n - 1)
s_y = y.std(1, ddof=n - 1)
cov = np.dot(x,
y.T) - n * np.dot(mu_x[:, np.newaxis],
mu_y[np.newaxis, :])
return cov / np.dot(s_x[:, np.newaxis], s_y[np.newaxis, :])
Here's a test of this function, which passes:
from scipy.stats import pearsonr
def test_generate_correlation_map():
x = np.random.rand(10, 10)
y = np.random.rand(20, 10)
desired = np.empty((10, 20))
for n in range(x.shape[0]):
for m in range(y.shape[0]):
desired[n, m] = pearsonr(x[n, :], y[m, :])[0]
actual = generate_correlation_map(x, y)
np.testing.assert_array_almost_equal(actual, desired)
For those interested in computing the Pearson correlation coefficient between a 1D and 2D array, I wrote the following function, where x is a 1D array and y a 2D array.
def pearsonr_2D(x, y):
"""computes pearson correlation coefficient
where x is a 1D and y a 2D array"""
upper = np.sum((x - np.mean(x)) * (y - np.mean(y, axis=1)[:,None]), axis=1)
lower = np.sqrt(np.sum(np.power(x - np.mean(x), 2)) * np.sum(np.power(y - np.mean(y, axis=1)[:,None], 2), axis=1))
rho = upper / lower
return rho
Example run:
>>> x
Out[1]: array([1, 2, 3])
>>> y
Out[2]: array([[ 1, 2, 3],
[ 6, 7, 12],
[ 9, 3, 1]])
>>> pearsonr_2D(x, y)
Out[3]: array([ 1. , 0.93325653, -0.96076892])

Tensorflow gather columns of a matrix very slow

Given two matrices A (1000 x 100) and B (100 x 1000), instead of directly computing their product in tensorflow, i.e., tf.dot(A,B), I want to first select 10 cols (randomly) from A and 10 rows from B and then use the tf.dot(A_s,B_s)
Naturally, the second multiplication should be much faster as the number of required multiplications reduces by factor of 10.
However, in reality, it seems selecting given columns of matrix A in tensorflow to creat A_s is an extremly inefficient process.
Given indices of the required columns in idx, I tried the following solutions to creat A_s. The solutions are ranked according to their performance:
. A_s = tf.transpose(tf.gather(tf.unstack(A, axis=1), idx)):
tf.dot(A_s,B_s) 5 times slower than tf.dot(A,B) because creating A_s is too expensive.
2.
p_shape = K.shape(params)
p_flat = K.reshape(params, [-1])
i_flat = K.reshape(K.reshape(
K.arange(0, p_shape[0]) * p_shape[1], [-1, 1]) + indices, [-1])
indices = [i_flat]
v = K.transpose(indices)
updates = i_flat * 0 - 1
shape = tf.to_int32([p_shape[0] * p_shape[1]])
scatter = tf.scatter_nd(v, updates, shape) + 1
out_temp = tf.dynamic_partition(p_flat,
partitions=scatter, num_partitions=2)[0]
A_s = tf.reshape(out_temp, [p_shape[0], -1])
results in 6-7 times slower product
3.
X,Y = tf.meshgrid((tf.range(0, p_shape[0])),indices)
idx = K.concatenate([K.expand_dims(
K.reshape((X),[-1]),1),
K.expand_dims(K.reshape((Y),[-1]),1)],axis=1)
A_s = tf.reshape(tf.gather_nd(params, idx), [p_shape[0], -1])
10-12 times slower.
Any idea on how I can improve the efficiency of column selection process is very much appreciated.
PS1: I ran all the experiments on CPU.
PS2: Matrix A is a placeholder not a variable. In some implementation it can get problematic as its shape may not be inferred.

Tensordot for numpy array and scipy sparse matrix

For a current project I have to compute the inner product of a lot of vectors with the same matrix (which is quite sparse). The vectors are associated with a two dimensional grid so I store the vectors in a three dimensional array:
E.g:
X is an array of dim (I,J,N). The matrix A is of dim (N,N). Now the task is to compute A.dot(X[i,j]) for each i,j in I,J.
For numpy arrays, this is quite easily accomplished with
Y = X.dot(A.T)
Now I'd like to store A as sparse matrix since it is sparse and only contains a very limited number of nonzero entries which results in a lot of unnecessary multiplications. Unfortunately, the above solution won't work since the numpy dot doesn't work with sparse matrices. And to the best of my knowledge there is not tensordot-like operation for scipy sparse.
Does anybody know a nice and efficient way to compute the above array Y with a sparse matrix A?
The obvious approach is to run a loop over your vectors and use the sparse matrix's .dot method:
def naive_sps_x_dense_vecs(sps_mat, dense_vecs):
rows, cols = sps_mat.shape
I, J, _ = dense_vecs.shape
out = np.empty((I, J, rows))
for i in xrange(I):
for j in xrange(J):
out[i, j] = sps_mat.dot(dense_vecs[i, j])
return out
But you may be able to speed things up a little by reshaping your 3d array to 2d and avoid the Python looping:
def sps_x_dense_vecs(sps_mat, dense_vecs):
rows, cols = sps_mat.shape
vecs_shape = dense_vecs.shape
dense_vecs = dense_vecs.reshape(-1, cols)
out = sps_mat.dot(dense_vecs.T).T
return out.reshape(vecs.shape[:-1] + (rows,))
The problem is that we need to have the sparse matrix be the first argument, so that we can call its .dot method, which means that the return is transposed, which in turns means that after transposing, the last reshape is going to trigger a copy of the whole array. So for fairly large values of I and J, combined with not-so-large values of N, the latter method will be several times faster than the former, but performance may even be reversed for other combinations of the parameters:
n, i, j = 100, 500, 500
a = sps.rand(n, n, density=1/n, format='csc')
vecs = np.random.rand(i, j, n)
>>> np.allclose(naive_sps_x_dense_vecs(a, vecs), sps_x_dense_vecs(a, vecs))
True
n, i, j = 100, 500, 500
%timeit naive_sps_x_dense_vecs(a, vecs)
1 loops, best of 3: 3.85 s per loop
%timeit sps_x_dense_vecs(a, vecs)
1 loops, best of 3: 576 ms per
n, i, j = 1000, 200, 200
%timeit naive_sps_x_dense_vecs(a, vecs)
1 loops, best of 3: 791 ms per loop
%timeit sps_x_dense_vecs(a, vecs)
1 loops, best of 3: 1.3 s per loop
You could use jaxto achieve what you are looking for. Let's suppose your sparse matrix is in csr_arrayformat. You would first transform it into a jax BCOO array
from scipy import sparse
from jax.experimental import sparse as jaxsparse
import jax.numpy as jnp
def convert_to_BCOO(x):
x = x.transpose() #get the transpose
x = x.tocoo()
x = jaxsparse.BCOO((x.data, jnp.column_stack((x.row, x.col))),
shape=x.shape)
x = L.sort_indices()
You could then use jax.sparsify to create a sparsified dot product as follows.
def dot(x, y):
return jnp.dot(x, y)
sp_dot = jaxsparse.sparsify(dot)
A_transpose = convert_to_BCOO(A)
Y = sp_dot(X,A_transpose)
The function sp_dot now follows the exact same rules as numpy.dot.
Hope this helps!