Dynamic axis indexing of Numpy ndarray - numpy

I want to obtain the 2D slice in a given direction of a 3D array where the direction (or the axis from where the slice is going to be extracted) is given by another variable.
Assuming idx the index of the 2D slice in a 3D array, and direction the axis in which obtain that 2D slice, the initial approach would be:
if direction == 0:
return A[idx, :, :]
elif direction == 1:
return A[:, idx, :]
else:
return A[:, :, idx]
I'm pretty sure there must be a way of doing this without doing conditionals, or at least, not in raw python. Does numpy have a shortcut for this?
The better solution I've found so far (for doing it dynamically), relies in the transpose operator:
# for 3 dimensions [0,1,2] and direction == 1 --> [1, 0, 2]
tr = [direction] + range(A.ndim)
del tr[direction+1]
return np.transpose(A, tr)[idx]
But I wonder if there is any better/easier/faster function for this, since for 3D the transpose code almost looks more awful than the 3 if/elif. It generalizes better for ND and the larger the N the more beautiful the code gets in comparison, but for 3D is quite the same.

Transpose is cheap (timewise). There are numpy functions that use it to move the operational axis (or axes) to a known location - usually the front or end of the shape list. tensordot is one that comes to mind.
Other functions construct an indexing tuple. They may start with a list or array for ease of manipulation, and then turn it into a tuple for application. For example
I = [slice(None)]*A.ndim
I[axis] = idx
A[tuple(I)]
np.apply_along_axis does something like that. It's instructive to look at the code for functions like this.
I imagine the writers of the numpy functions worried most about whether it works robustly, secondly about speed, and lastly whether it looks pretty. You can bury all kinds of ugly code in a function!.
tensordot ends with
at = a.transpose(newaxes_a).reshape(newshape_a)
bt = b.transpose(newaxes_b).reshape(newshape_b)
res = dot(at, bt)
return res.reshape(olda + oldb)
where the previous code calculated newaxes_.. and newshape....
apply_along_axis constructs a (0...,:,0...) index tuple
i = zeros(nd, 'O')
i[axis] = slice(None, None)
i.put(indlist, ind)
....arr[tuple(i.tolist())]

To index a dimension dynamically, you can use swapaxes, as shown below:
a = np.arange(7 * 8 * 9).reshape((7, 8, 9))
axis = 1
idx = 2
np.swapaxes(a, 0, axis)[idx]
Runtime comparison
Natural method (non dynamic) :
%timeit a[:, idx, :]
300 ns ± 1.58 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
swapaxes:
%timeit np.swapaxes(a, 0, axis)[idx]
752 ns ± 4.54 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
Index with list comprehension:
%timeit a[[idx if i==axis else slice(None) for i in range(a.ndim)]]

This is python. You could simply use eval() like this:
def get_by_axis(a, idx, axis):
indexing_list = a.ndim*[':']
indexing_list[axis] = str(idx)
expression = f"a[{', '.join(indexing_list)}]"
return eval(expression)
Obviously, in which case you do not accept input from untrusted users.

Related

pandas vectorize function using two dataframes

I have the following operation:
import pandas as pd
import numpy as np
def some_calc(x,y):
x = x.set_index('Cat')
y = y.set_index('Cat')
y = np.sqrt(y['data_point2'])
vec = pd.DataFrame(x['data_point1'] * y)
grid = np.random.rand(len(x),len(x))
result = vec.dot(vec.T).mul(grid).sum().sum()
return result
sample_size = 100
cats = ['a','b','c','d']
df1 = pd.DataFrame({'Cat':[cats[np.random.randint(4)] for _ in range(sample_size)],
'data_point1':np.random.rand(sample_size),
'data_point2':np.random.rand(sample_size)})
df2 = df1.groupby('Cat').sum().reset_index()
I would like to run some_calc across each of the df2 rows using their relative data points from df1.
The code below works well:
df2['Apply'] = df2.apply(lambda x: some_calc(x=df1[df1['Cat']==x['Cat']][['Cat','data_point1']],
y=df1[df1['Cat']==x['Cat']][['Cat','data_point2']]),axis=1)
(I reset the index in df2 because I don't know how to apply across indices.
Also, I'm passing both Cat as the index field and data_point as vectors to some_calc because without an index v.dot(v.T) will crunch the dot product into one single number. This errors with .mul() because I need the full MxM matrix as opposed to a float value. I might be doing something wrong here though...)
I'm currently exploring how I can vectorize the above so that when sample_size grows I will not be hampered by a slow down in the calculation.
I saw that in previous threads you can toggle raw=True so that the input deal with np.array as opposed to pd.Series.
df2['ApplyRaw'] = df2.apply(lambda x: some_calc(x=df1[df1['Cat']==x['Cat']][['Cat','data_point1']],
y=df1[df1['Cat']==x['Cat']]['Cat','data_point2']),axis=1, raw=True)
However, it throws an error:
IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices
I tried omitting Cat from the argument but still the same issue.
Are there any code improvements or tricks I can employ that allow me to vectorize the above?
Or do I have to amend some_calc?
I'm not sure if it's possible to vectorize your function since it's a bit complex. However, some_calc itself and how it is called can be optimized.
What
df2['Apply'] = df2.apply(lambda x: some_calc(x=df1[df1['Cat']==x['Cat']][['Cat','data_point1']],
y=df1[df1['Cat']==x['Cat']][['Cat','data_point2']]),axis=1)
does is basically the same as a groupby. So instead of creating these groups and applying the function on them, use groupby + apply. Simplifying the some_calc function as well, we get:
def some_calc(df):
x = df['data_point1'].values
y = np.sqrt(df['data_point2'].values)
vec = (x * y).reshape(-1, 1)
grid = np.random.rand(len(x),len(x))
result = (vec # vec.T * grid).sum().sum()
return result
apply = df1.groupby('Cat').apply(some_calc)
apply.name = 'Apply'
df2.merge(apply, left_on='Cat', right_index=True)
The final merge is just to add the results to the df2 dataframe.
Timings:
# original
20.5 ms ± 2.26 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# above code
3.62 ms ± 668 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Calculating distances between vectors of two numpy arrays

I have two numpy arrays R with dimensions S x F and W with dimensions N x M x F. Getting concrete lets assign the following values N = 5, M = 7, F = 3, S = 4
The array R contains a collections of samples S = 4 with F = 3 features. Each line represents a samples and each row a feature. Therefore R[0] is the first sample, R[1] the second and goes on. Each R[i-th] entry, contains F elements, giving for sake of example R[0] = np.array([1, 4, -2]).
Here is a small snippet to initialize all those values, with a MWE in mind
import numpy as np
# Size of Map (rows, columns)
N, M = 5, 7
# Number of features
F = 3
# Sample size
S = 4
np.random.seed(13)
R = np.random.randint(0, 10, size=(S, F))
W = np.random.randint(-4, 5, size=(N, M, F))
We can also see a given "depth line" of numpy array W, as a vector also with same dimension as each row of array R (this can easily be noticed looking at the size of the last dimension of both arrays). With that I can access W[2, 3] and obtain np.array([ 2, 2, -1 ]) (the values here are just examples).
I created a simple function to calculate the distance of a given vector r to each "depth line" of matrix W and the return the position of the nearest element of W depth line to r
def nearest_vector_matrix_naive(r, W):
delta = np.zeros((N,M), dtype=int)
for i in range(N):
for j in range(M):
norm = 0
for k in range(F):
norm += (r[k] - W[i,j,k])**2
delta[i,j] = norm
norm = 0
win_idx = np.unravel_index(np.argmin(delta, axis=None), delta.shape)
return win_idx
Of course this is a very naive approach, that I could further optimize to the code below, obtaining a HUGE performance boost.
def nearest_vector_matrix(r, W):
delta = np.sum((W[:,:] - r)**2, axis=2)
return np.unravel_index(np.argmin(delta, axis=None), delta.shape)
I can use this function simple as
nearest_idx = nearest_vector_matrix(R[0], W)
# Returns the nearest vector in W to R[0]
W[nearest_idx]
Since I have the array R with a bunch of samples I use the following snippet to calculate the nearest vectors to a array of samples:
def nearest_samples_matrix(R, W):
DELTA = np.zeros((R.shape[0],2))
for idx, r in enumerate(R):
delta = np.sum((W[:,:] - r)**2, axis=2)
DELTA[idx] = np.unravel_index(np.argmin(delta, axis=None), delta.shape)
return DELTA
This function returns an array with S rows (S being the number of samples) of 2d indexes. That is DELTA has (S, 2) shape (always).
I would like to know how can I substitute the for loop (for example for a broadcasting) inside nearest_samples_matrix to enhance the code execution performance even further?
I could not figure out how to do it. (besides I was able to do it in the first case)
The best solution depends on the input size of the arrays
For lower dimensional problems dim<20 or less, a kdtree approach is usually the way to go. There are quite a lot of answers regarding this topic eg. one I have written a few weeks ago.
If the dimension of the problems is too high you can switch to brute-force algorithms. Both of the following algorithms are much faster than your optimized approach, but on larger input sizes and low dimensional problems much slower than a kdtree approach O(log(n)) instead of O(n^2).
Brute force 1
The following example uses an algorithm described here. It is very fast on large dimensional problems because most of the calculation is done in a highly optimized matrix-matrix multiplication algorithm.
The disadvantage is high memory usage (all distances are calculated in one function call) and precision problems, because of the more error prone calculation method.
import numpy as np
from sklearn.metrics.pairwise import euclidean_distances
def nearest_samples_matrix_2(R,W):
R_Temp=R
W_Temp=W.reshape(-1,W.shape[2])
dist=euclidean_distances(R_Temp, W_Temp)
ind_1,ind_2=np.unravel_index(np.argmin(dist,axis=1),shape=(W.shape[0],W.shape[1]))
return np.vstack((ind_1,ind_2)).T
Brute force 2
This is quite similar to your naive approach, but uses a JIT-Compiler (Numba) to get good performance. Temporary arrays are not necessary and the precision should be good (as long as no overflow occurs). There is room for further optimization (loop tiling) on larger input sizes.
import numpy as np
import numba as nb
#parallelization is only beneficial on larger input data
#nb.njit(fastmath=True,parallel=True,cache=True)
def nearest_samples_matrix_3(r, W):
ind_i=0
ind_j=0
out=np.empty((r.shape[0],2),dtype=np.int64)
for x in nb.prange(r.shape[0]):
delta=0
for k in range(W.shape[2]):
delta += (r[x,k] - W[0,0,k])**2
for i in range(W.shape[0]):
for j in range(W.shape[1]):
norm = 0
for k in range(W.shape[2]):
norm += (r[x,k] - W[i,j,k])**2
if norm < delta:
delta=norm
ind_i=i
ind_j=j
out[x,0]=ind_i
out[x,1]=ind_j
return out
Timings
#small Arrays
N, M = 100, 200
F = 30
S = 50
R = np.random.randint(0, 10, size=(S, F))
W = np.random.randint(-4, 5, size=(N, M, F))
#your function
%timeit nearest_samples_matrix(R,W)
#268 ms ± 2.94 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit nearest_samples_matrix_2(R,W)
#5.62 ms ± 22.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit nearest_samples_matrix_3(R,W)
#3.68 ms ± 1.01 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
#larger arrays
N, M = 1_000, 2_000
F = 50
S = 100
R = np.random.randint(0, 10, size=(S, F))
W = np.random.randint(-4, 5, size=(N, M, F))
#%timeit nearest_samples_matrix_1(R,W)
#too slow
%timeit nearest_samples_matrix_2(R,W)
#2.76 s ± 17.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit nearest_samples_matrix_3(R,W)
#1.42 s ± 402 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

numpy elementwise outer product

I want to do the element-wise outer product of two 2d arrays in numpy.
A.shape = (100, 3) # A numpy ndarray
B.shape = (100, 5) # A numpy ndarray
C = element_wise_outer_product(A, B) # A function that does the trick
C.shape = (100, 3, 5) # This should be the result
C[i] = np.outer(A[i], B[i]) # This should be the result
A naive implementation can the following.
tmp = []
for i in range(len(A):
outer_product = np.outer(A[i], B[i])
tmp.append(outer_product)
C = np.array(tmp)
A better solution inspired from stack overflow.
big_outer = np.multiply.outer(A, B)
tmp = np.swapaxes(tmp, 1, 2)
C_tmp = [tmp[i][i] for i in range(len(A)]
C = np.array(C_tmp)
I'm looking for a vectorized implementation that gets rid the for loop.
Does anyone have an idea?
Thank you!
Extend A and B to 3D keeping their first axis aligned and introducing new axes along the third and second ones respectively with None/np.newaxis and then multiply with each other. This would allow broadcasting to come into play for a vectorized solution.
Thus, an implementation would be -
A[:,:,None]*B[:,None,:]
We could shorten it a bit by using ellipsis for A's : :,: and skip listing the leftover last axis with B, like so -
A[...,None]*B[:,None]
As another vectorized approach we could also use np.einsum, which might be more intuitive once we get past the string notation syntax and consider those notations being representatives of the iterators involved in a naive loopy implementation, like so -
np.einsum('ij,ik->ijk',A,B)
Another solution using np.lib.stride_tricks.as_strided()..
Here the strategy is to, in essence, build a (100, 3, 5) array As and a (100, 3, 5) array Bs such that the normal element-wise product of these arrays will produce the desired result. Of course, we don't actually build big memory consuming arrays, thanks to as_strided(). (as_strided() is like a blueprint that tells NumPy how you'd map data from the original arrays to construct As and Bs.)
def outer_prod_stride(A, B):
"""stride trick"""
a = A.shape[-1]
b = B.shape[-1]
d = A.strides[-1]
new_shape = A.shape + (b,)
As = np.lib.stride_tricks.as_strided(A, shape=new_shape, strides=(a*d, d, 0))
Bs = np.lib.stride_tricks.as_strided(B, shape=new_shape, strides=(b*d, 0, d))
return As * Bs
Timings
def outer_prod_broadcasting(A, B):
"""Broadcasting trick"""
return A[...,None]*B[:,None]
def outer_prod_einsum(A, B):
"""einsum() trick"""
return np.einsum('ij,ik->ijk',A,B)
def outer_prod_stride(A, B):
"""stride trick"""
a = A.shape[-1]
b = B.shape[-1]
d = A.strides[-1]
new_shape = A.shape + (b,)
As = np.lib.stride_tricks.as_strided(A, shape=new_shape, strides=(a*d, d, 0))
Bs = np.lib.stride_tricks.as_strided(B, shape=new_shape, strides=(b*d, 0, d))
return As * Bs
%timeit op1 = outer_prod_broadcasting(A, B)
2.54 µs ± 436 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit op2 = outer_prod_einsum(A, B)
3.03 µs ± 637 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit op3 = outer_prod_stride(A, B)
16.6 µs ± 5.39 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Seems my stride trick solution is slower than both #Divkar's solutions. ..still an interesting method worth knowing though.

Vectorized way of calculating row-wise dot product two matrices with Scipy

I want to calculate the row-wise dot product of two matrices of the same dimension as fast as possible. This is the way I am doing it:
import numpy as np
a = np.array([[1,2,3], [3,4,5]])
b = np.array([[1,2,3], [1,2,3]])
result = np.array([])
for row1, row2 in a, b:
result = np.append(result, np.dot(row1, row2))
print result
and of course the output is:
[ 26. 14.]
Straightforward way to do that is:
import numpy as np
a=np.array([[1,2,3],[3,4,5]])
b=np.array([[1,2,3],[1,2,3]])
np.sum(a*b, axis=1)
which avoids the python loop and is faster in cases like:
def npsumdot(x, y):
return np.sum(x*y, axis=1)
def loopdot(x, y):
result = np.empty((x.shape[0]))
for i in range(x.shape[0]):
result[i] = np.dot(x[i], y[i])
return result
timeit npsumdot(np.random.rand(500000,50),np.random.rand(500000,50))
# 1 loops, best of 3: 861 ms per loop
timeit loopdot(np.random.rand(500000,50),np.random.rand(500000,50))
# 1 loops, best of 3: 1.58 s per loop
Check out numpy.einsum for another method:
In [52]: a
Out[52]:
array([[1, 2, 3],
[3, 4, 5]])
In [53]: b
Out[53]:
array([[1, 2, 3],
[1, 2, 3]])
In [54]: einsum('ij,ij->i', a, b)
Out[54]: array([14, 26])
Looks like einsum is a bit faster than inner1d:
In [94]: %timeit inner1d(a,b)
1000000 loops, best of 3: 1.8 us per loop
In [95]: %timeit einsum('ij,ij->i', a, b)
1000000 loops, best of 3: 1.6 us per loop
In [96]: a = random.randn(10, 100)
In [97]: b = random.randn(10, 100)
In [98]: %timeit inner1d(a,b)
100000 loops, best of 3: 2.89 us per loop
In [99]: %timeit einsum('ij,ij->i', a, b)
100000 loops, best of 3: 2.03 us per loop
Note: NumPy is constantly evolving and improving; the relative performance of the functions shown above has probably changed over the years. If performance is important to you, run your own tests with the version of NumPy that you will be using.
Played around with this and found inner1d the fastest. That function however is internal, so a more robust approach is to use
numpy.einsum("ij,ij->i", a, b)
Even better is to align your memory such that the summation happens in the first dimension, e.g.,
a = numpy.random.rand(3, n)
b = numpy.random.rand(3, n)
numpy.einsum("ij,ij->j", a, b)
For 10 ** 3 <= n <= 10 ** 6, this is the fastest method, and up to twice as fast as its untransposed equivalent. The maximum occurs when the level-2 cache is maxed out, at about 2 * 10 ** 4.
Note also that the transposed summation is much faster than its untransposed equivalent.
The plot was created with perfplot (a small project of mine)
import numpy
from numpy.core.umath_tests import inner1d
import perfplot
def setup(n):
a = numpy.random.rand(n, 3)
b = numpy.random.rand(n, 3)
aT = numpy.ascontiguousarray(a.T)
bT = numpy.ascontiguousarray(b.T)
return (a, b), (aT, bT)
b = perfplot.bench(
setup=setup,
n_range=[2 ** k for k in range(1, 25)],
kernels=[
lambda data: numpy.sum(data[0][0] * data[0][1], axis=1),
lambda data: numpy.einsum("ij, ij->i", data[0][0], data[0][1]),
lambda data: numpy.sum(data[1][0] * data[1][1], axis=0),
lambda data: numpy.einsum("ij, ij->j", data[1][0], data[1][1]),
lambda data: inner1d(data[0][0], data[0][1]),
],
labels=["sum", "einsum", "sum.T", "einsum.T", "inner1d"],
xlabel="len(a), len(b)",
)
b.save("out1.png")
b.save("out2.png", relative_to=3)
You'll do better avoiding the append, but I can't think of a way to avoid the python loop. A custom Ufunc perhaps? I don't think numpy.vectorize will help you here.
import numpy as np
a=np.array([[1,2,3],[3,4,5]])
b=np.array([[1,2,3],[1,2,3]])
result=np.empty((2,))
for i in range(2):
result[i] = np.dot(a[i],b[i]))
print result
EDIT
Based on this answer, it looks like inner1d might work if the vectors in your real-world problem are 1D.
from numpy.core.umath_tests import inner1d
inner1d(a,b) # array([14, 26])
I came across this answer and re-verified the results with Numpy 1.14.3 running in Python 3.5. For the most part the answers above hold true on my system, although I found that for very large matrices (see example below), all but one of the methods are so close to one another that the performance difference is meaningless.
For smaller matrices, I found that einsum was the fastest by a considerable margin, up to a factor of two in some cases.
My large matrix example:
import numpy as np
from numpy.core.umath_tests import inner1d
a = np.random.randn(100, 1000000) # 800 MB each
b = np.random.randn(100, 1000000) # pretty big.
def loop_dot(a, b):
result = np.empty((a.shape[1],))
for i, (row1, row2) in enumerate(zip(a, b)):
result[i] = np.dot(row1, row2)
%timeit inner1d(a, b)
# 128 ms ± 523 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit np.einsum('ij,ij->i', a, b)
# 121 ms ± 402 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit np.sum(a*b, axis=1)
# 411 ms ± 1.99 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit loop_dot(a, b) # note the function call took negligible time
# 123 ms ± 342 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
So einsum is still the fastest on very large matrices, but by a tiny amount. It appears to be a statistically significant (tiny) amount though!

Sorting very large 1D arrays

I'm about to try out Pytables for the first time and I need to write my data to the hdf file per time step. I'll have over 100,000 time steps. When I'm done, I would like to sort my 100,000+ x 6 array by column 2, i.e., I currently have everything sorted by time but now I need to sort the array by order of decreasing rain rates (col 2). I'm unsure how to even begin here. I know that having the entire array in memory is unwise. Any ideas how to doe this fast and efficiently?
Appreciate any advice.
I know that having the entire array in memory is unwise.
You might be overthinking it. A 100K x 6 array of float64 takes just ~5MB of RAM. On my computer, sorting such an array takes about 27ms:
In [37]: a = np.random.rand(100000, 6)
In [38]: %timeit a[a[:,1].argsort()]
10 loops, best of 3: 27.2 ms per loop
Unless you have a very old computer, you should put the entire array in memory. Assuming they are floats, it will only take 100000*6*4./2**20 = 2.29 Mb. Twice as much for doubles. You can use numpy's sort or argsort for sorting. For example, you can get the sorting indices from your second column:
import numpy as np
a = np.random.normal(0, 1, size=(100000,6))
idx = a[:, 1].argsort()
And then use these to index the columns you want, or the whole array:
b = a[idx]
You can even use different types of sort and check their speed:
In [33]: %timeit idx = a[:, 1].argsort(kind='quicksort')
100 loops, best of 3: 12.6 ms per loop
In [34]: %timeit idx = a[:, 1].argsort(kind='mergesort')
100 loops, best of 3: 14.4 ms per loop
In [35]: %timeit idx = a[:, 1].argsort(kind='heapsort')
10 loops, best of 3: 21.4 ms per loop
So you see that for an array of this size it doesn't really matter.