Tensorflow gather columns of a matrix very slow

Tensorflow gather columns of a matrix very slow - tensorflow

Given two matrices A (1000 x 100) and B (100 x 1000), instead of directly computing their product in tensorflow, i.e., tf.dot(A,B), I want to first select 10 cols (randomly) from A and 10 rows from B and then use the tf.dot(A_s,B_s)
Naturally, the second multiplication should be much faster as the number of required multiplications reduces by factor of 10.
However, in reality, it seems selecting given columns of matrix A in tensorflow to creat A_s is an extremly inefficient process.
Given indices of the required columns in idx, I tried the following solutions to creat A_s. The solutions are ranked according to their performance:
. A_s = tf.transpose(tf.gather(tf.unstack(A, axis=1), idx)):
tf.dot(A_s,B_s) 5 times slower than tf.dot(A,B) because creating A_s is too expensive.
2.
p_shape = K.shape(params)
p_flat = K.reshape(params, [-1])
i_flat = K.reshape(K.reshape(
K.arange(0, p_shape[0]) * p_shape[1], [-1, 1]) + indices, [-1])
indices = [i_flat]
v = K.transpose(indices)
updates = i_flat * 0 - 1
shape = tf.to_int32([p_shape[0] * p_shape[1]])
scatter = tf.scatter_nd(v, updates, shape) + 1
out_temp = tf.dynamic_partition(p_flat,
partitions=scatter, num_partitions=2)[0]
A_s = tf.reshape(out_temp, [p_shape[0], -1])
results in 6-7 times slower product
3.
X,Y = tf.meshgrid((tf.range(0, p_shape[0])),indices)
idx = K.concatenate([K.expand_dims(
K.reshape((X),[-1]),1),
K.expand_dims(K.reshape((Y),[-1]),1)],axis=1)
A_s = tf.reshape(tf.gather_nd(params, idx), [p_shape[0], -1])
10-12 times slower.
Any idea on how I can improve the efficiency of column selection process is very much appreciated.
PS1: I ran all the experiments on CPU.
PS2: Matrix A is a placeholder not a variable. In some implementation it can get problematic as its shape may not be inferred.

Related

Getting single value from the N dim histogram in NumPy or SciPy

Assume I have a data like this:
x = np.random.randn(4, 100000)
and I fit a histogram
hist = np.histogramdd(x, density=True)
What I want is to get the probability of number g, e.g. g=0.1. Assume some hypothetical function foo then.
g = 0.1
prob = foo(hist, g)
print(prob)
>> 0.2223124214
How could I do something like this, where I get probability back for a single or a vector of numbers for a fitted histogram? Especially histogram that is N-dimensional.

histogramdd takes O(r^D) memory, and unless you have a very large dataset or very small dimension you will have a poor estimate. Consider your example data, 100k points in 4-D space, the default histogram will be 10 x 10 x 10 x 10, so it will have 10k bins.
x = np.random.randn(4, 100000)
hist = np.histogramdd(x.transpose(), density=True)
np.mean(hist[0] == 0)
gives something arround 0.77 meaning that 77% of the bins in the histogram have no points.
You probably want to smooth the distribution. Unless you have a good reason to not do, I would suggest you to use Gaussian kernel-density Estimate
x = np.random.randn(4, 100000) # d x n array
f = scipy.stats.gaussian_kde(x) # d-dimensional PDF
f([1,2,3,4]) # evaluate the PDF in a given point

how to avoid split and sum of pieces in pytorch or numpy

I want to split a long vector into smaller unequal pieces, do a summation on each piece and gather the results into a new vector.
I need to do this in pytorch but I am also interested to see how this is done with numpy.
This can easily be accomplish by splitting the vector.
sizes = [3, 7, 5, 9]
X = torch.ones(sum(sizes))
Y = torch.tensor([s.sum() for s in torch.split(X, sizes)])
or with np.ones and np.split.
Is there a more efficient way to do this?
Edit:
Inspired by the first comment:
indices = np.cumsum([0]+sizes)[:-1]
Y = np.add.reduceat(X, indices.tolist())
solves it for numpy. I am still looking for a solution with pytorch.

index_add_ is your friend!
# inputs
sizes = torch.tensor([3, 7, 5, 9], dtype=torch.long)
x = torch.ones(sizes.sum())
# prepare an index vector for summation (what elements of x are summed to each element of y)
ind = torch.zeros(sizes.sum(), dtype=torch.long)
ind[torch.cumsum(sizes, dim=0)[:-1]] = 1
ind = torch.cumsum(ind, dim=0)
# prepare the output
y = torch.zeros(len(sizes))
# do the actual summation
y.index_add_(0, ind, x)

numpy n-dimensional smart indexing over large tensors - memory efficiency

I'm working with large tensors, so numpy memory allocations for temporary tensors begin significantly influencing execution time + code sometimes raises memory allocation errors during those intermediate steps. Here're two approaches for indexing one tensor with int values of another tensor (like, result_ijk = a[i, b[i, j], k]) that I came up with, and even though second one seems more memory-efficient, I feel like creating this enormous index-matrix and iterating over all it's values (even in parallel) is kind of wired (and hits memory limits quite often):
def test():
i, j, k, l = 10, 20, 30, 40 # in reality, they're like 1e3..1e6
a = np.random.rand(i, j, k)
b = np.random.randint(0, j, size=i*l).reshape((i, l))
# c_ilk = c[i, b[i, l], k]; shape(c) = (10, 40, 30)
tmp = a[:, b, :] # <= i*ijk additional memory allocated (!) crazy
c1 = np.diagonal(tmp, axis1=0, axis2=1).transpose([2, 0, 1])
print(c1.shape)
# another approach:
ii, ll = np.indices((i, l)) # <= 2*i*l of temporary ints allocated
tmp2 = b[ii, ll] # i*l of ints allocated, slow ops
c2 = a[ii, tmp2] # slow ops over tensor
print(c2.shape)
print(np.allclose(c1, c2))
test()
- any suggestions on how one could optimize this type of n-dim smart indexing code?
If I'm going to use this piece of ~vectorized code in Theano, does it also going to allocate all those temporary buffers or it could somehow manage to build them "on-fly"? Is there any package that would perform such indexing in lazy\more efficient manner without allocation of these ii-like tensors?
(note: I need to take gradients over it in the end, so I can't use fancy jit-compilers like numba :( )

You only need to allocate an array of integers of length i to get your desired result:
i_idx = np.arange(i)
c = a[i_idx[:, None], b[i_idx, :], :]
# or you can use the terser c = a[i_idx[:, None], b[i_idx]]
Broadcasting takes care of duplicating values as needed on the fly, without having to allocate memory for them.
If you time this for large-ish arrays, you'll notice it is only marginally faster than your second approach: as noted by others, the intermediate indexing array is going to be several orders of magnitude smaller than your overall computation, so optimizing it has a small effect on the total runtime or memory footprint.

Some methods :
i,j,k,l=[100]*4
a = np.random.randint(0,5,(i, j, k))
b = np.random.randint(0, j,(i, l))
def test1():
# c_ilk = c[i, b[i, l], k]; shape(c) = (2,3,5)
tmp = a[:, b, :] # <= i*ijk additional memory allocated (!) crazy
c1 = np.diagonal(tmp, axis1=0, axis2=1).transpose([2, 0, 1])
return c1
def test2():
ii, ll = np.indices((i, l)) # <= 2*i*l of temporary ints allocated
tmp2 = b[ii, ll] # i*l of ints allocated, slow ops
c2 = a[ii, tmp2] # slow ops over tensor
#print(c2.shape)
return c2
def test3():
c3=np.empty((i,l,k),dtype=a.dtype)
for ii in range(i):
for ll in range(l):
c3[ii,ll]=a[ii,b[ii,ll]]
return c3
from numba import jit
test4=jit(test3)
And the corresponding benchmarks :
In [54]: %timeit test1()
1 loop, best of 3: 720 ms per loop
In [55]: %timeit test2()
100 loops, best of 3: 7.79 ms per loop
In [56]: %timeit test3()
10 loops, best of 3: 43.7 ms per loop
In [57]: %timeit test4()
100 loop, best of 3: 4.99 ms per loop
That seems to show (see #Eelco Hoogendoorn comment) that your second method is nearly optimal for big sizes, while the first is a bad choice.
For numba you can just use this part of the code, and apply gradient in a non "jited" function.

Vectorizing a comparison in numpy

How can I vectorize this loop in NumPy? It uses sampling from NumPy's binomial() function to estimate the probability that out of 55 events exactly m of a particular type occur, where the probability of m occuring is 5%; ie it estimates 55Cm.(0.05)^m.(0.95)^(55-m). where 55Cm = 55!/(m!.(55-m)!)
import numpy as np
M = 7
m = np.arange(M+1)
ntrials = 1000000
p = np.empty(M+1)
for r in m:
p[r] = np.sum(np.random.binomial(55, 0.05, ntrials)==r)/ntrials

Here is the equivalent code:
p = np.zeros(M+1)
print p
I imagine you didn't intend for your output to always be all zero, but it is! So the first thing to do is add a dtype=float argument to your np.sum() call. With that out of the way, we can vectorize the whole thing like this:
samples = np.random.binomial(55, 0.05, (ntrials, M+1))
p = np.sum(samples == m, dtype=float, axis=0) / ntrials
This produces an equivalent, though not identical, result. The reason is that the random number generation is done in a different sequence, so you will get an answer which is "correct" but not identical to the old code. If you want the identical result to before, you can get that by changing the first line to this:
samples = p.random.binomial(55, 0.05, (M+1, ntrials)).T
Then you draw in the same order as before, with no real performance penalty.

Tensordot for numpy array and scipy sparse matrix

For a current project I have to compute the inner product of a lot of vectors with the same matrix (which is quite sparse). The vectors are associated with a two dimensional grid so I store the vectors in a three dimensional array:
E.g:
X is an array of dim (I,J,N). The matrix A is of dim (N,N). Now the task is to compute A.dot(X[i,j]) for each i,j in I,J.
For numpy arrays, this is quite easily accomplished with
Y = X.dot(A.T)
Now I'd like to store A as sparse matrix since it is sparse and only contains a very limited number of nonzero entries which results in a lot of unnecessary multiplications. Unfortunately, the above solution won't work since the numpy dot doesn't work with sparse matrices. And to the best of my knowledge there is not tensordot-like operation for scipy sparse.
Does anybody know a nice and efficient way to compute the above array Y with a sparse matrix A?

The obvious approach is to run a loop over your vectors and use the sparse matrix's .dot method:
def naive_sps_x_dense_vecs(sps_mat, dense_vecs):
rows, cols = sps_mat.shape
I, J, _ = dense_vecs.shape
out = np.empty((I, J, rows))
for i in xrange(I):
for j in xrange(J):
out[i, j] = sps_mat.dot(dense_vecs[i, j])
return out
But you may be able to speed things up a little by reshaping your 3d array to 2d and avoid the Python looping:
def sps_x_dense_vecs(sps_mat, dense_vecs):
rows, cols = sps_mat.shape
vecs_shape = dense_vecs.shape
dense_vecs = dense_vecs.reshape(-1, cols)
out = sps_mat.dot(dense_vecs.T).T
return out.reshape(vecs.shape[:-1] + (rows,))
The problem is that we need to have the sparse matrix be the first argument, so that we can call its .dot method, which means that the return is transposed, which in turns means that after transposing, the last reshape is going to trigger a copy of the whole array. So for fairly large values of I and J, combined with not-so-large values of N, the latter method will be several times faster than the former, but performance may even be reversed for other combinations of the parameters:
n, i, j = 100, 500, 500
a = sps.rand(n, n, density=1/n, format='csc')
vecs = np.random.rand(i, j, n)
>>> np.allclose(naive_sps_x_dense_vecs(a, vecs), sps_x_dense_vecs(a, vecs))
True
n, i, j = 100, 500, 500
%timeit naive_sps_x_dense_vecs(a, vecs)
1 loops, best of 3: 3.85 s per loop
%timeit sps_x_dense_vecs(a, vecs)
1 loops, best of 3: 576 ms per
n, i, j = 1000, 200, 200
%timeit naive_sps_x_dense_vecs(a, vecs)
1 loops, best of 3: 791 ms per loop
%timeit sps_x_dense_vecs(a, vecs)
1 loops, best of 3: 1.3 s per loop

You could use jaxto achieve what you are looking for. Let's suppose your sparse matrix is in csr_arrayformat. You would first transform it into a jax BCOO array
from scipy import sparse
from jax.experimental import sparse as jaxsparse
import jax.numpy as jnp
def convert_to_BCOO(x):
x = x.transpose() #get the transpose
x = x.tocoo()
x = jaxsparse.BCOO((x.data, jnp.column_stack((x.row, x.col))),
shape=x.shape)
x = L.sort_indices()
You could then use jax.sparsify to create a sparsified dot product as follows.
def dot(x, y):
return jnp.dot(x, y)
sp_dot = jaxsparse.sparsify(dot)
A_transpose = convert_to_BCOO(A)
Y = sp_dot(X,A_transpose)
The function sp_dot now follows the exact same rules as numpy.dot.
Hope this helps!

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Tensorflow gather columns of a matrix very slow - tensorflow

Related

Getting single value from the N dim histogram in NumPy or SciPy

how to avoid split and sum of pieces in pytorch or numpy

numpy n-dimensional smart indexing over large tensors - memory efficiency

Vectorizing a comparison in numpy

Tensordot for numpy array and scipy sparse matrix

Categories

Resources