Calculating distances between vectors of two numpy arrays - numpy

I have two numpy arrays R with dimensions S x F and W with dimensions N x M x F. Getting concrete lets assign the following values N = 5, M = 7, F = 3, S = 4
The array R contains a collections of samples S = 4 with F = 3 features. Each line represents a samples and each row a feature. Therefore R[0] is the first sample, R[1] the second and goes on. Each R[i-th] entry, contains F elements, giving for sake of example R[0] = np.array([1, 4, -2]).
Here is a small snippet to initialize all those values, with a MWE in mind
import numpy as np
# Size of Map (rows, columns)
N, M = 5, 7
# Number of features
F = 3
# Sample size
S = 4
np.random.seed(13)
R = np.random.randint(0, 10, size=(S, F))
W = np.random.randint(-4, 5, size=(N, M, F))
We can also see a given "depth line" of numpy array W, as a vector also with same dimension as each row of array R (this can easily be noticed looking at the size of the last dimension of both arrays). With that I can access W[2, 3] and obtain np.array([ 2, 2, -1 ]) (the values here are just examples).
I created a simple function to calculate the distance of a given vector r to each "depth line" of matrix W and the return the position of the nearest element of W depth line to r
def nearest_vector_matrix_naive(r, W):
delta = np.zeros((N,M), dtype=int)
for i in range(N):
for j in range(M):
norm = 0
for k in range(F):
norm += (r[k] - W[i,j,k])**2
delta[i,j] = norm
norm = 0
win_idx = np.unravel_index(np.argmin(delta, axis=None), delta.shape)
return win_idx
Of course this is a very naive approach, that I could further optimize to the code below, obtaining a HUGE performance boost.
def nearest_vector_matrix(r, W):
delta = np.sum((W[:,:] - r)**2, axis=2)
return np.unravel_index(np.argmin(delta, axis=None), delta.shape)
I can use this function simple as
nearest_idx = nearest_vector_matrix(R[0], W)
# Returns the nearest vector in W to R[0]
W[nearest_idx]
Since I have the array R with a bunch of samples I use the following snippet to calculate the nearest vectors to a array of samples:
def nearest_samples_matrix(R, W):
DELTA = np.zeros((R.shape[0],2))
for idx, r in enumerate(R):
delta = np.sum((W[:,:] - r)**2, axis=2)
DELTA[idx] = np.unravel_index(np.argmin(delta, axis=None), delta.shape)
return DELTA
This function returns an array with S rows (S being the number of samples) of 2d indexes. That is DELTA has (S, 2) shape (always).
I would like to know how can I substitute the for loop (for example for a broadcasting) inside nearest_samples_matrix to enhance the code execution performance even further?
I could not figure out how to do it. (besides I was able to do it in the first case)

The best solution depends on the input size of the arrays
For lower dimensional problems dim<20 or less, a kdtree approach is usually the way to go. There are quite a lot of answers regarding this topic eg. one I have written a few weeks ago.
If the dimension of the problems is too high you can switch to brute-force algorithms. Both of the following algorithms are much faster than your optimized approach, but on larger input sizes and low dimensional problems much slower than a kdtree approach O(log(n)) instead of O(n^2).
Brute force 1
The following example uses an algorithm described here. It is very fast on large dimensional problems because most of the calculation is done in a highly optimized matrix-matrix multiplication algorithm.
The disadvantage is high memory usage (all distances are calculated in one function call) and precision problems, because of the more error prone calculation method.
import numpy as np
from sklearn.metrics.pairwise import euclidean_distances
def nearest_samples_matrix_2(R,W):
R_Temp=R
W_Temp=W.reshape(-1,W.shape[2])
dist=euclidean_distances(R_Temp, W_Temp)
ind_1,ind_2=np.unravel_index(np.argmin(dist,axis=1),shape=(W.shape[0],W.shape[1]))
return np.vstack((ind_1,ind_2)).T
Brute force 2
This is quite similar to your naive approach, but uses a JIT-Compiler (Numba) to get good performance. Temporary arrays are not necessary and the precision should be good (as long as no overflow occurs). There is room for further optimization (loop tiling) on larger input sizes.
import numpy as np
import numba as nb
#parallelization is only beneficial on larger input data
#nb.njit(fastmath=True,parallel=True,cache=True)
def nearest_samples_matrix_3(r, W):
ind_i=0
ind_j=0
out=np.empty((r.shape[0],2),dtype=np.int64)
for x in nb.prange(r.shape[0]):
delta=0
for k in range(W.shape[2]):
delta += (r[x,k] - W[0,0,k])**2
for i in range(W.shape[0]):
for j in range(W.shape[1]):
norm = 0
for k in range(W.shape[2]):
norm += (r[x,k] - W[i,j,k])**2
if norm < delta:
delta=norm
ind_i=i
ind_j=j
out[x,0]=ind_i
out[x,1]=ind_j
return out
Timings
#small Arrays
N, M = 100, 200
F = 30
S = 50
R = np.random.randint(0, 10, size=(S, F))
W = np.random.randint(-4, 5, size=(N, M, F))
#your function
%timeit nearest_samples_matrix(R,W)
#268 ms ± 2.94 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit nearest_samples_matrix_2(R,W)
#5.62 ms ± 22.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit nearest_samples_matrix_3(R,W)
#3.68 ms ± 1.01 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
#larger arrays
N, M = 1_000, 2_000
F = 50
S = 100
R = np.random.randint(0, 10, size=(S, F))
W = np.random.randint(-4, 5, size=(N, M, F))
#%timeit nearest_samples_matrix_1(R,W)
#too slow
%timeit nearest_samples_matrix_2(R,W)
#2.76 s ± 17.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit nearest_samples_matrix_3(R,W)
#1.42 s ± 402 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Related

pandas vectorize function using two dataframes

I have the following operation:
import pandas as pd
import numpy as np
def some_calc(x,y):
x = x.set_index('Cat')
y = y.set_index('Cat')
y = np.sqrt(y['data_point2'])
vec = pd.DataFrame(x['data_point1'] * y)
grid = np.random.rand(len(x),len(x))
result = vec.dot(vec.T).mul(grid).sum().sum()
return result
sample_size = 100
cats = ['a','b','c','d']
df1 = pd.DataFrame({'Cat':[cats[np.random.randint(4)] for _ in range(sample_size)],
'data_point1':np.random.rand(sample_size),
'data_point2':np.random.rand(sample_size)})
df2 = df1.groupby('Cat').sum().reset_index()
I would like to run some_calc across each of the df2 rows using their relative data points from df1.
The code below works well:
df2['Apply'] = df2.apply(lambda x: some_calc(x=df1[df1['Cat']==x['Cat']][['Cat','data_point1']],
y=df1[df1['Cat']==x['Cat']][['Cat','data_point2']]),axis=1)
(I reset the index in df2 because I don't know how to apply across indices.
Also, I'm passing both Cat as the index field and data_point as vectors to some_calc because without an index v.dot(v.T) will crunch the dot product into one single number. This errors with .mul() because I need the full MxM matrix as opposed to a float value. I might be doing something wrong here though...)
I'm currently exploring how I can vectorize the above so that when sample_size grows I will not be hampered by a slow down in the calculation.
I saw that in previous threads you can toggle raw=True so that the input deal with np.array as opposed to pd.Series.
df2['ApplyRaw'] = df2.apply(lambda x: some_calc(x=df1[df1['Cat']==x['Cat']][['Cat','data_point1']],
y=df1[df1['Cat']==x['Cat']]['Cat','data_point2']),axis=1, raw=True)
However, it throws an error:
IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices
I tried omitting Cat from the argument but still the same issue.
Are there any code improvements or tricks I can employ that allow me to vectorize the above?
Or do I have to amend some_calc?
I'm not sure if it's possible to vectorize your function since it's a bit complex. However, some_calc itself and how it is called can be optimized.
What
df2['Apply'] = df2.apply(lambda x: some_calc(x=df1[df1['Cat']==x['Cat']][['Cat','data_point1']],
y=df1[df1['Cat']==x['Cat']][['Cat','data_point2']]),axis=1)
does is basically the same as a groupby. So instead of creating these groups and applying the function on them, use groupby + apply. Simplifying the some_calc function as well, we get:
def some_calc(df):
x = df['data_point1'].values
y = np.sqrt(df['data_point2'].values)
vec = (x * y).reshape(-1, 1)
grid = np.random.rand(len(x),len(x))
result = (vec # vec.T * grid).sum().sum()
return result
apply = df1.groupby('Cat').apply(some_calc)
apply.name = 'Apply'
df2.merge(apply, left_on='Cat', right_index=True)
The final merge is just to add the results to the df2 dataframe.
Timings:
# original
20.5 ms ± 2.26 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# above code
3.62 ms ± 668 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

numpy / linear algebra - fast 16-bit histogram

If I have an image made of uint16s and want to compute a histogram for each bit, i.e. a vector 'x' of 0..65535 that contains the intensity value, and a vector y that is the number of samples that have that value, is there a vectorized numpy / linear algreba way to compute this?
I did it the obvious way with Numpy, and using your image dimensions on my Mac, it takes 300ms. I then did the same thing with OpenCV and it is 33x faster at 9ms!
#!/usr/bin/env python3
import cv2
import numpy as np
# Dimensions - height, width
h, w = 2160, 2560
# Known image, channel0=1, channel1=3, channel2=5, channel3=65535
R = np.zeros((h,w,4), dtype=np.uint16)
R[...,0] = 1
R[...,1] = 3
R[...,2] = 5
R[...,3] = 65535
def npHistogram(R):
"""Generate histogram using Numpy"""
H, _ = np.histogram(R,65536)
return H
def OpenCVHistogram(R):
"""Generate histogram using OpenCV"""
H = cv2.calcHist([R.ravel()], [0], None, [65536], [0,65536])
return H
A = npHistogram(R)
B = OpenCVHistogram(R)
#%timeit npHistogram(R)
#%timeit OpenCVHistogram(R)
Results
Using IPython, I got these timings
%timeit npHistogram(R)
300 ms ± 11.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit OpenCVHistogram(R)
9.02 ms ± 226 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Keywords: Python, histogram, slow, Numpy, np.histogram, speedup, OpenCV, image processing.
Ok, if OpenCV is too big a dependency for you to get 9ms processing time instead of 300ms, how about Numba? This runs in 10ms.
#!/usr/bin/env python3
import numpy as np
from numba import jit
# Dimensions - height, width
h, w = 2160, 2560
# Known image, channel0=1, channel1=3, channel2=5, channel3=65535
R = np.zeros((h,w,4), dtype=np.uint16)
R[...,0] = 1
R[...,1] = 3
R[...,2] = 5
R[...,3] = 65535
#jit(nopython=True, nogil=True)
def NumbaHistogram(pixels):
"""Histogram of uint16 image"""
H = np.zeros(65536, dtype=np.int32)
for i in range(len(pixels)):
H[pixels[i]] += 1
return H
#%timeit q = NumbaHistogram(R.ravel())
Results
%timeit NumbaHistogram(R.ravel())
10.4 ms ± 54.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

numpy elementwise outer product

I want to do the element-wise outer product of two 2d arrays in numpy.
A.shape = (100, 3) # A numpy ndarray
B.shape = (100, 5) # A numpy ndarray
C = element_wise_outer_product(A, B) # A function that does the trick
C.shape = (100, 3, 5) # This should be the result
C[i] = np.outer(A[i], B[i]) # This should be the result
A naive implementation can the following.
tmp = []
for i in range(len(A):
outer_product = np.outer(A[i], B[i])
tmp.append(outer_product)
C = np.array(tmp)
A better solution inspired from stack overflow.
big_outer = np.multiply.outer(A, B)
tmp = np.swapaxes(tmp, 1, 2)
C_tmp = [tmp[i][i] for i in range(len(A)]
C = np.array(C_tmp)
I'm looking for a vectorized implementation that gets rid the for loop.
Does anyone have an idea?
Thank you!
Extend A and B to 3D keeping their first axis aligned and introducing new axes along the third and second ones respectively with None/np.newaxis and then multiply with each other. This would allow broadcasting to come into play for a vectorized solution.
Thus, an implementation would be -
A[:,:,None]*B[:,None,:]
We could shorten it a bit by using ellipsis for A's : :,: and skip listing the leftover last axis with B, like so -
A[...,None]*B[:,None]
As another vectorized approach we could also use np.einsum, which might be more intuitive once we get past the string notation syntax and consider those notations being representatives of the iterators involved in a naive loopy implementation, like so -
np.einsum('ij,ik->ijk',A,B)
Another solution using np.lib.stride_tricks.as_strided()..
Here the strategy is to, in essence, build a (100, 3, 5) array As and a (100, 3, 5) array Bs such that the normal element-wise product of these arrays will produce the desired result. Of course, we don't actually build big memory consuming arrays, thanks to as_strided(). (as_strided() is like a blueprint that tells NumPy how you'd map data from the original arrays to construct As and Bs.)
def outer_prod_stride(A, B):
"""stride trick"""
a = A.shape[-1]
b = B.shape[-1]
d = A.strides[-1]
new_shape = A.shape + (b,)
As = np.lib.stride_tricks.as_strided(A, shape=new_shape, strides=(a*d, d, 0))
Bs = np.lib.stride_tricks.as_strided(B, shape=new_shape, strides=(b*d, 0, d))
return As * Bs
Timings
def outer_prod_broadcasting(A, B):
"""Broadcasting trick"""
return A[...,None]*B[:,None]
def outer_prod_einsum(A, B):
"""einsum() trick"""
return np.einsum('ij,ik->ijk',A,B)
def outer_prod_stride(A, B):
"""stride trick"""
a = A.shape[-1]
b = B.shape[-1]
d = A.strides[-1]
new_shape = A.shape + (b,)
As = np.lib.stride_tricks.as_strided(A, shape=new_shape, strides=(a*d, d, 0))
Bs = np.lib.stride_tricks.as_strided(B, shape=new_shape, strides=(b*d, 0, d))
return As * Bs
%timeit op1 = outer_prod_broadcasting(A, B)
2.54 µs ± 436 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit op2 = outer_prod_einsum(A, B)
3.03 µs ± 637 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit op3 = outer_prod_stride(A, B)
16.6 µs ± 5.39 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Seems my stride trick solution is slower than both #Divkar's solutions. ..still an interesting method worth knowing though.

Dynamic axis indexing of Numpy ndarray

I want to obtain the 2D slice in a given direction of a 3D array where the direction (or the axis from where the slice is going to be extracted) is given by another variable.
Assuming idx the index of the 2D slice in a 3D array, and direction the axis in which obtain that 2D slice, the initial approach would be:
if direction == 0:
return A[idx, :, :]
elif direction == 1:
return A[:, idx, :]
else:
return A[:, :, idx]
I'm pretty sure there must be a way of doing this without doing conditionals, or at least, not in raw python. Does numpy have a shortcut for this?
The better solution I've found so far (for doing it dynamically), relies in the transpose operator:
# for 3 dimensions [0,1,2] and direction == 1 --> [1, 0, 2]
tr = [direction] + range(A.ndim)
del tr[direction+1]
return np.transpose(A, tr)[idx]
But I wonder if there is any better/easier/faster function for this, since for 3D the transpose code almost looks more awful than the 3 if/elif. It generalizes better for ND and the larger the N the more beautiful the code gets in comparison, but for 3D is quite the same.
Transpose is cheap (timewise). There are numpy functions that use it to move the operational axis (or axes) to a known location - usually the front or end of the shape list. tensordot is one that comes to mind.
Other functions construct an indexing tuple. They may start with a list or array for ease of manipulation, and then turn it into a tuple for application. For example
I = [slice(None)]*A.ndim
I[axis] = idx
A[tuple(I)]
np.apply_along_axis does something like that. It's instructive to look at the code for functions like this.
I imagine the writers of the numpy functions worried most about whether it works robustly, secondly about speed, and lastly whether it looks pretty. You can bury all kinds of ugly code in a function!.
tensordot ends with
at = a.transpose(newaxes_a).reshape(newshape_a)
bt = b.transpose(newaxes_b).reshape(newshape_b)
res = dot(at, bt)
return res.reshape(olda + oldb)
where the previous code calculated newaxes_.. and newshape....
apply_along_axis constructs a (0...,:,0...) index tuple
i = zeros(nd, 'O')
i[axis] = slice(None, None)
i.put(indlist, ind)
....arr[tuple(i.tolist())]
To index a dimension dynamically, you can use swapaxes, as shown below:
a = np.arange(7 * 8 * 9).reshape((7, 8, 9))
axis = 1
idx = 2
np.swapaxes(a, 0, axis)[idx]
Runtime comparison
Natural method (non dynamic) :
%timeit a[:, idx, :]
300 ns ± 1.58 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
swapaxes:
%timeit np.swapaxes(a, 0, axis)[idx]
752 ns ± 4.54 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
Index with list comprehension:
%timeit a[[idx if i==axis else slice(None) for i in range(a.ndim)]]
This is python. You could simply use eval() like this:
def get_by_axis(a, idx, axis):
indexing_list = a.ndim*[':']
indexing_list[axis] = str(idx)
expression = f"a[{', '.join(indexing_list)}]"
return eval(expression)
Obviously, in which case you do not accept input from untrusted users.

Vectorized way of calculating row-wise dot product two matrices with Scipy

I want to calculate the row-wise dot product of two matrices of the same dimension as fast as possible. This is the way I am doing it:
import numpy as np
a = np.array([[1,2,3], [3,4,5]])
b = np.array([[1,2,3], [1,2,3]])
result = np.array([])
for row1, row2 in a, b:
result = np.append(result, np.dot(row1, row2))
print result
and of course the output is:
[ 26. 14.]
Straightforward way to do that is:
import numpy as np
a=np.array([[1,2,3],[3,4,5]])
b=np.array([[1,2,3],[1,2,3]])
np.sum(a*b, axis=1)
which avoids the python loop and is faster in cases like:
def npsumdot(x, y):
return np.sum(x*y, axis=1)
def loopdot(x, y):
result = np.empty((x.shape[0]))
for i in range(x.shape[0]):
result[i] = np.dot(x[i], y[i])
return result
timeit npsumdot(np.random.rand(500000,50),np.random.rand(500000,50))
# 1 loops, best of 3: 861 ms per loop
timeit loopdot(np.random.rand(500000,50),np.random.rand(500000,50))
# 1 loops, best of 3: 1.58 s per loop
Check out numpy.einsum for another method:
In [52]: a
Out[52]:
array([[1, 2, 3],
[3, 4, 5]])
In [53]: b
Out[53]:
array([[1, 2, 3],
[1, 2, 3]])
In [54]: einsum('ij,ij->i', a, b)
Out[54]: array([14, 26])
Looks like einsum is a bit faster than inner1d:
In [94]: %timeit inner1d(a,b)
1000000 loops, best of 3: 1.8 us per loop
In [95]: %timeit einsum('ij,ij->i', a, b)
1000000 loops, best of 3: 1.6 us per loop
In [96]: a = random.randn(10, 100)
In [97]: b = random.randn(10, 100)
In [98]: %timeit inner1d(a,b)
100000 loops, best of 3: 2.89 us per loop
In [99]: %timeit einsum('ij,ij->i', a, b)
100000 loops, best of 3: 2.03 us per loop
Note: NumPy is constantly evolving and improving; the relative performance of the functions shown above has probably changed over the years. If performance is important to you, run your own tests with the version of NumPy that you will be using.
Played around with this and found inner1d the fastest. That function however is internal, so a more robust approach is to use
numpy.einsum("ij,ij->i", a, b)
Even better is to align your memory such that the summation happens in the first dimension, e.g.,
a = numpy.random.rand(3, n)
b = numpy.random.rand(3, n)
numpy.einsum("ij,ij->j", a, b)
For 10 ** 3 <= n <= 10 ** 6, this is the fastest method, and up to twice as fast as its untransposed equivalent. The maximum occurs when the level-2 cache is maxed out, at about 2 * 10 ** 4.
Note also that the transposed summation is much faster than its untransposed equivalent.
The plot was created with perfplot (a small project of mine)
import numpy
from numpy.core.umath_tests import inner1d
import perfplot
def setup(n):
a = numpy.random.rand(n, 3)
b = numpy.random.rand(n, 3)
aT = numpy.ascontiguousarray(a.T)
bT = numpy.ascontiguousarray(b.T)
return (a, b), (aT, bT)
b = perfplot.bench(
setup=setup,
n_range=[2 ** k for k in range(1, 25)],
kernels=[
lambda data: numpy.sum(data[0][0] * data[0][1], axis=1),
lambda data: numpy.einsum("ij, ij->i", data[0][0], data[0][1]),
lambda data: numpy.sum(data[1][0] * data[1][1], axis=0),
lambda data: numpy.einsum("ij, ij->j", data[1][0], data[1][1]),
lambda data: inner1d(data[0][0], data[0][1]),
],
labels=["sum", "einsum", "sum.T", "einsum.T", "inner1d"],
xlabel="len(a), len(b)",
)
b.save("out1.png")
b.save("out2.png", relative_to=3)
You'll do better avoiding the append, but I can't think of a way to avoid the python loop. A custom Ufunc perhaps? I don't think numpy.vectorize will help you here.
import numpy as np
a=np.array([[1,2,3],[3,4,5]])
b=np.array([[1,2,3],[1,2,3]])
result=np.empty((2,))
for i in range(2):
result[i] = np.dot(a[i],b[i]))
print result
EDIT
Based on this answer, it looks like inner1d might work if the vectors in your real-world problem are 1D.
from numpy.core.umath_tests import inner1d
inner1d(a,b) # array([14, 26])
I came across this answer and re-verified the results with Numpy 1.14.3 running in Python 3.5. For the most part the answers above hold true on my system, although I found that for very large matrices (see example below), all but one of the methods are so close to one another that the performance difference is meaningless.
For smaller matrices, I found that einsum was the fastest by a considerable margin, up to a factor of two in some cases.
My large matrix example:
import numpy as np
from numpy.core.umath_tests import inner1d
a = np.random.randn(100, 1000000) # 800 MB each
b = np.random.randn(100, 1000000) # pretty big.
def loop_dot(a, b):
result = np.empty((a.shape[1],))
for i, (row1, row2) in enumerate(zip(a, b)):
result[i] = np.dot(row1, row2)
%timeit inner1d(a, b)
# 128 ms ± 523 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit np.einsum('ij,ij->i', a, b)
# 121 ms ± 402 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit np.sum(a*b, axis=1)
# 411 ms ± 1.99 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit loop_dot(a, b) # note the function call took negligible time
# 123 ms ± 342 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
So einsum is still the fastest on very large matrices, but by a tiny amount. It appears to be a statistically significant (tiny) amount though!