I want to do the element-wise outer product of two 2d arrays in numpy.
A.shape = (100, 3) # A numpy ndarray
B.shape = (100, 5) # A numpy ndarray
C = element_wise_outer_product(A, B) # A function that does the trick
C.shape = (100, 3, 5) # This should be the result
C[i] = np.outer(A[i], B[i]) # This should be the result
A naive implementation can the following.
tmp = []
for i in range(len(A):
outer_product = np.outer(A[i], B[i])
tmp.append(outer_product)
C = np.array(tmp)
A better solution inspired from stack overflow.
big_outer = np.multiply.outer(A, B)
tmp = np.swapaxes(tmp, 1, 2)
C_tmp = [tmp[i][i] for i in range(len(A)]
C = np.array(C_tmp)
I'm looking for a vectorized implementation that gets rid the for loop.
Does anyone have an idea?
Thank you!
Extend A and B to 3D keeping their first axis aligned and introducing new axes along the third and second ones respectively with None/np.newaxis and then multiply with each other. This would allow broadcasting to come into play for a vectorized solution.
Thus, an implementation would be -
A[:,:,None]*B[:,None,:]
We could shorten it a bit by using ellipsis for A's : :,: and skip listing the leftover last axis with B, like so -
A[...,None]*B[:,None]
As another vectorized approach we could also use np.einsum, which might be more intuitive once we get past the string notation syntax and consider those notations being representatives of the iterators involved in a naive loopy implementation, like so -
np.einsum('ij,ik->ijk',A,B)
Another solution using np.lib.stride_tricks.as_strided()..
Here the strategy is to, in essence, build a (100, 3, 5) array As and a (100, 3, 5) array Bs such that the normal element-wise product of these arrays will produce the desired result. Of course, we don't actually build big memory consuming arrays, thanks to as_strided(). (as_strided() is like a blueprint that tells NumPy how you'd map data from the original arrays to construct As and Bs.)
def outer_prod_stride(A, B):
"""stride trick"""
a = A.shape[-1]
b = B.shape[-1]
d = A.strides[-1]
new_shape = A.shape + (b,)
As = np.lib.stride_tricks.as_strided(A, shape=new_shape, strides=(a*d, d, 0))
Bs = np.lib.stride_tricks.as_strided(B, shape=new_shape, strides=(b*d, 0, d))
return As * Bs
Timings
def outer_prod_broadcasting(A, B):
"""Broadcasting trick"""
return A[...,None]*B[:,None]
def outer_prod_einsum(A, B):
"""einsum() trick"""
return np.einsum('ij,ik->ijk',A,B)
def outer_prod_stride(A, B):
"""stride trick"""
a = A.shape[-1]
b = B.shape[-1]
d = A.strides[-1]
new_shape = A.shape + (b,)
As = np.lib.stride_tricks.as_strided(A, shape=new_shape, strides=(a*d, d, 0))
Bs = np.lib.stride_tricks.as_strided(B, shape=new_shape, strides=(b*d, 0, d))
return As * Bs
%timeit op1 = outer_prod_broadcasting(A, B)
2.54 µs ± 436 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit op2 = outer_prod_einsum(A, B)
3.03 µs ± 637 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit op3 = outer_prod_stride(A, B)
16.6 µs ± 5.39 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Seems my stride trick solution is slower than both #Divkar's solutions. ..still an interesting method worth knowing though.
Related
I'm trying to understand the performance differences I am seeing by using various numba implementations of an algorithm. In particular, I would expect func1d from below to be the fastest implementation since it it the only algorithm that is not copying data, however from my timings func1b appears to be fastest.
import numpy
import numba
def func1a(data, a, b, c):
# pure numpy
return a * (1 + numpy.tanh((data / b) - c))
#numba.njit(fastmath=True)
def func1b(data, a, b, c):
new_data = a * (1 + numpy.tanh((data / b) - c))
return new_data
#numba.njit(fastmath=True)
def func1c(data, a, b, c):
new_data = numpy.empty(data.shape)
for i in range(new_data.shape[0]):
for j in range(new_data.shape[1]):
new_data[i, j] = a * (1 + numpy.tanh((data[i, j] / b) - c))
return new_data
#numba.njit(fastmath=True)
def func1d(data, a, b, c):
for i in range(data.shape[0]):
for j in range(data.shape[1]):
data[i, j] = a * (1 + numpy.tanh((data[i, j] / b) - c))
return data
Helper functions for testing memory copying
def get_data_base(arr):
"""For a given NumPy array, find the base array
that owns the actual data.
https://ipython-books.github.io/45-understanding-the-internals-of-numpy-to-avoid-unnecessary-array-copying/
"""
base = arr
while isinstance(base.base, numpy.ndarray):
base = base.base
return base
def arrays_share_data(x, y):
return get_data_base(x) is get_data_base(y)
def test_share(func):
data = data = numpy.random.randn(100, 3)
print(arrays_share_data(data, func(data, 0.5, 2.5, 2.5)))
Timings
# force compiling
data = numpy.random.randn(10_000, 300)
_ = func1a(data, 0.5, 2.5, 2.5)
_ = func1b(data, 0.5, 2.5, 2.5)
_ = func1c(data, 0.5, 2.5, 2.5)
_ = func1d(data, 0.5, 2.5, 2.5)
data = numpy.random.randn(10_000, 300)
%timeit func1a(data, 0.5, 2.5, 2.5)
%timeit func1b(data, 0.5, 2.5, 2.5)
%timeit func1c(data, 0.5, 2.5, 2.5)
%timeit func1d(data, 0.5, 2.5, 2.5)
67.2 ms ± 230 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
13 ms ± 10.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
69.8 ms ± 60.4 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
69.8 ms ± 105 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Test which implementations copy memory
test_share(func1a)
test_share(func1b)
test_share(func1c)
test_share(func1d)
False
False
False
True
Here, copying of data doesn't play a big role: the bottle neck is fast how the tanh-function is evaluated. There are many algorithms: some of them are faster some of them are slower, some are more precise some less.
Different numpy-distributions use different implementations of tanh-function, e.g. it could be one from mkl/vml or the one from the gnu-math-library.
Depending on numba version, also either the mkl/svml impelementation is used or gnu-math-library.
The easiest way to look inside is to use a profiler, for example perf.
For the numpy-version on my machine I get:
>>> perf record python run.py
>>> perf report
Overhead Command Shared Object Symbol
46,73% python libm-2.23.so [.] __expm1
24,24% python libm-2.23.so [.] __tanh
4,89% python _multiarray_umath.cpython-37m-x86_64-linux-gnu.so [.] sse2_binary_scalar2_divide_DOUBLE
3,59% python [unknown] [k] 0xffffffff8140290c
As one can see, numpy uses the slow gnu-math-library (libm) functionality.
For the numba-function I get:
53,98% python libsvml.so [.] __svml_tanh4_e9
3,60% python [unknown] [k] 0xffffffff81831c57
2,79% python python3.7 [.] _PyEval_EvalFrameDefault
which means that fast mkl/svml functionality is used.
That is (almost) all there is to it.
As #user2640045 has rightly pointed out, the numpy performance will be hurt by additional cache misses due to creation of temporary arrays.
However, cache misses don't play such a big role as the calculation of tanh:
%timeit func1a(data, 0.5, 2.5, 2.5) # 91.5 ms ± 2.88 ms per loop
%timeit numpy.tanh(data) # 76.1 ms ± 539 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
i.e. creation of temporary objects is responsible for around 20% of the running time.
FWIW, also for version with the handwritten loops, my numba version (0.50.1) is able to vectorize and call mkl/svml functionality. If for some other version this not happens - numba will fall back to gnu-math-library functionality, what seems to be happening on your machine.
Listing of run.py:
import numpy
# TODO: define func1b for checking numba
def func1a(data, a, b, c):
# pure numpy
return a * (1 + numpy.tanh((data / b) - c))
data = numpy.random.randn(10_000, 300)
for _ in range(100):
func1a(data, 0.5, 2.5, 2.5)
The performance difference is NOT in the evaluation of the tanh-function
I must disagree with #ead. Let's assume for the moment that
the main performance difference is in the evaluation of the tanh-function
Then one would expect that running just tanh from numpy and numba with fast math would show that speed difference.
def func_a(data):
return np.tanh(data)
#nb.njit(fastmath=True)
def func_b(data):
new_data = np.tanh(data)
return new_data
data = np.random.randn(10_000, 300)
%timeit func_a(data)
%timeit func_b(data)
Yet on my machine the above code shows almost no difference in performance.
15.7 ms ± 129 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
15.8 ms ± 82 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Short detour on NumExpr
I tried a NumExpr version of your code. But before being amazed that it runts almost 7 times faster you should keep in mind that it uses all 10 cores available on my machine. After allowing numba to run in parallel too and optimising that a little bit the performance benefit is small but sill there 2.56 ms vs 3.87 ms. See code below.
#nb.njit(fastmath=True)
def func_a(data):
new_data = a * (1 + np.tanh((data / b) - c))
return new_data
#nb.njit(fastmath=True, parallel=True)
def func_b(data):
new_data = a * (1 + np.tanh((data / b) - c))
return new_data
#nb.njit(fastmath=True, parallel=True)
def func_c(data):
for i in nb.prange(data.shape[0]):
for j in range(data.shape[1]):
data[i, j] = a * (1 + np.tanh((data[i, j] / b) - c))
return data
def func_d(data):
return ne.evaluate('a * (1 + tanh((data / b) - c))')
data = np.random.randn(10_000, 300)
%timeit func_a(data)
%timeit func_b(data)
%timeit func_c(data)
%timeit func_d(data)
17.4 ms ± 146 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
4.31 ms ± 193 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
3.87 ms ± 152 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.56 ms ± 104 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
The actual explanation
The ~34% time that NumExpr saves compared to numba are nice but even nicer is that they have a concise explanation why they are faster than numpy. I am pretty sure that this applies to numba too.
From the NumExpr github page:
The main reason why NumExpr achieves better performance than NumPy is
that it avoids allocating memory for intermediate results. This
results in better cache utilization and reduces memory access in
general.
So
a * (1 + numpy.tanh((data / b) - c))
is slower because it does a lot of steps producing intermediate results.
I have the following operation:
import pandas as pd
import numpy as np
def some_calc(x,y):
x = x.set_index('Cat')
y = y.set_index('Cat')
y = np.sqrt(y['data_point2'])
vec = pd.DataFrame(x['data_point1'] * y)
grid = np.random.rand(len(x),len(x))
result = vec.dot(vec.T).mul(grid).sum().sum()
return result
sample_size = 100
cats = ['a','b','c','d']
df1 = pd.DataFrame({'Cat':[cats[np.random.randint(4)] for _ in range(sample_size)],
'data_point1':np.random.rand(sample_size),
'data_point2':np.random.rand(sample_size)})
df2 = df1.groupby('Cat').sum().reset_index()
I would like to run some_calc across each of the df2 rows using their relative data points from df1.
The code below works well:
df2['Apply'] = df2.apply(lambda x: some_calc(x=df1[df1['Cat']==x['Cat']][['Cat','data_point1']],
y=df1[df1['Cat']==x['Cat']][['Cat','data_point2']]),axis=1)
(I reset the index in df2 because I don't know how to apply across indices.
Also, I'm passing both Cat as the index field and data_point as vectors to some_calc because without an index v.dot(v.T) will crunch the dot product into one single number. This errors with .mul() because I need the full MxM matrix as opposed to a float value. I might be doing something wrong here though...)
I'm currently exploring how I can vectorize the above so that when sample_size grows I will not be hampered by a slow down in the calculation.
I saw that in previous threads you can toggle raw=True so that the input deal with np.array as opposed to pd.Series.
df2['ApplyRaw'] = df2.apply(lambda x: some_calc(x=df1[df1['Cat']==x['Cat']][['Cat','data_point1']],
y=df1[df1['Cat']==x['Cat']]['Cat','data_point2']),axis=1, raw=True)
However, it throws an error:
IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices
I tried omitting Cat from the argument but still the same issue.
Are there any code improvements or tricks I can employ that allow me to vectorize the above?
Or do I have to amend some_calc?
I'm not sure if it's possible to vectorize your function since it's a bit complex. However, some_calc itself and how it is called can be optimized.
What
df2['Apply'] = df2.apply(lambda x: some_calc(x=df1[df1['Cat']==x['Cat']][['Cat','data_point1']],
y=df1[df1['Cat']==x['Cat']][['Cat','data_point2']]),axis=1)
does is basically the same as a groupby. So instead of creating these groups and applying the function on them, use groupby + apply. Simplifying the some_calc function as well, we get:
def some_calc(df):
x = df['data_point1'].values
y = np.sqrt(df['data_point2'].values)
vec = (x * y).reshape(-1, 1)
grid = np.random.rand(len(x),len(x))
result = (vec # vec.T * grid).sum().sum()
return result
apply = df1.groupby('Cat').apply(some_calc)
apply.name = 'Apply'
df2.merge(apply, left_on='Cat', right_index=True)
The final merge is just to add the results to the df2 dataframe.
Timings:
# original
20.5 ms ± 2.26 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# above code
3.62 ms ± 668 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
I have two numpy arrays R with dimensions S x F and W with dimensions N x M x F. Getting concrete lets assign the following values N = 5, M = 7, F = 3, S = 4
The array R contains a collections of samples S = 4 with F = 3 features. Each line represents a samples and each row a feature. Therefore R[0] is the first sample, R[1] the second and goes on. Each R[i-th] entry, contains F elements, giving for sake of example R[0] = np.array([1, 4, -2]).
Here is a small snippet to initialize all those values, with a MWE in mind
import numpy as np
# Size of Map (rows, columns)
N, M = 5, 7
# Number of features
F = 3
# Sample size
S = 4
np.random.seed(13)
R = np.random.randint(0, 10, size=(S, F))
W = np.random.randint(-4, 5, size=(N, M, F))
We can also see a given "depth line" of numpy array W, as a vector also with same dimension as each row of array R (this can easily be noticed looking at the size of the last dimension of both arrays). With that I can access W[2, 3] and obtain np.array([ 2, 2, -1 ]) (the values here are just examples).
I created a simple function to calculate the distance of a given vector r to each "depth line" of matrix W and the return the position of the nearest element of W depth line to r
def nearest_vector_matrix_naive(r, W):
delta = np.zeros((N,M), dtype=int)
for i in range(N):
for j in range(M):
norm = 0
for k in range(F):
norm += (r[k] - W[i,j,k])**2
delta[i,j] = norm
norm = 0
win_idx = np.unravel_index(np.argmin(delta, axis=None), delta.shape)
return win_idx
Of course this is a very naive approach, that I could further optimize to the code below, obtaining a HUGE performance boost.
def nearest_vector_matrix(r, W):
delta = np.sum((W[:,:] - r)**2, axis=2)
return np.unravel_index(np.argmin(delta, axis=None), delta.shape)
I can use this function simple as
nearest_idx = nearest_vector_matrix(R[0], W)
# Returns the nearest vector in W to R[0]
W[nearest_idx]
Since I have the array R with a bunch of samples I use the following snippet to calculate the nearest vectors to a array of samples:
def nearest_samples_matrix(R, W):
DELTA = np.zeros((R.shape[0],2))
for idx, r in enumerate(R):
delta = np.sum((W[:,:] - r)**2, axis=2)
DELTA[idx] = np.unravel_index(np.argmin(delta, axis=None), delta.shape)
return DELTA
This function returns an array with S rows (S being the number of samples) of 2d indexes. That is DELTA has (S, 2) shape (always).
I would like to know how can I substitute the for loop (for example for a broadcasting) inside nearest_samples_matrix to enhance the code execution performance even further?
I could not figure out how to do it. (besides I was able to do it in the first case)
The best solution depends on the input size of the arrays
For lower dimensional problems dim<20 or less, a kdtree approach is usually the way to go. There are quite a lot of answers regarding this topic eg. one I have written a few weeks ago.
If the dimension of the problems is too high you can switch to brute-force algorithms. Both of the following algorithms are much faster than your optimized approach, but on larger input sizes and low dimensional problems much slower than a kdtree approach O(log(n)) instead of O(n^2).
Brute force 1
The following example uses an algorithm described here. It is very fast on large dimensional problems because most of the calculation is done in a highly optimized matrix-matrix multiplication algorithm.
The disadvantage is high memory usage (all distances are calculated in one function call) and precision problems, because of the more error prone calculation method.
import numpy as np
from sklearn.metrics.pairwise import euclidean_distances
def nearest_samples_matrix_2(R,W):
R_Temp=R
W_Temp=W.reshape(-1,W.shape[2])
dist=euclidean_distances(R_Temp, W_Temp)
ind_1,ind_2=np.unravel_index(np.argmin(dist,axis=1),shape=(W.shape[0],W.shape[1]))
return np.vstack((ind_1,ind_2)).T
Brute force 2
This is quite similar to your naive approach, but uses a JIT-Compiler (Numba) to get good performance. Temporary arrays are not necessary and the precision should be good (as long as no overflow occurs). There is room for further optimization (loop tiling) on larger input sizes.
import numpy as np
import numba as nb
#parallelization is only beneficial on larger input data
#nb.njit(fastmath=True,parallel=True,cache=True)
def nearest_samples_matrix_3(r, W):
ind_i=0
ind_j=0
out=np.empty((r.shape[0],2),dtype=np.int64)
for x in nb.prange(r.shape[0]):
delta=0
for k in range(W.shape[2]):
delta += (r[x,k] - W[0,0,k])**2
for i in range(W.shape[0]):
for j in range(W.shape[1]):
norm = 0
for k in range(W.shape[2]):
norm += (r[x,k] - W[i,j,k])**2
if norm < delta:
delta=norm
ind_i=i
ind_j=j
out[x,0]=ind_i
out[x,1]=ind_j
return out
Timings
#small Arrays
N, M = 100, 200
F = 30
S = 50
R = np.random.randint(0, 10, size=(S, F))
W = np.random.randint(-4, 5, size=(N, M, F))
#your function
%timeit nearest_samples_matrix(R,W)
#268 ms ± 2.94 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit nearest_samples_matrix_2(R,W)
#5.62 ms ± 22.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit nearest_samples_matrix_3(R,W)
#3.68 ms ± 1.01 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
#larger arrays
N, M = 1_000, 2_000
F = 50
S = 100
R = np.random.randint(0, 10, size=(S, F))
W = np.random.randint(-4, 5, size=(N, M, F))
#%timeit nearest_samples_matrix_1(R,W)
#too slow
%timeit nearest_samples_matrix_2(R,W)
#2.76 s ± 17.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit nearest_samples_matrix_3(R,W)
#1.42 s ± 402 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
I want to obtain the 2D slice in a given direction of a 3D array where the direction (or the axis from where the slice is going to be extracted) is given by another variable.
Assuming idx the index of the 2D slice in a 3D array, and direction the axis in which obtain that 2D slice, the initial approach would be:
if direction == 0:
return A[idx, :, :]
elif direction == 1:
return A[:, idx, :]
else:
return A[:, :, idx]
I'm pretty sure there must be a way of doing this without doing conditionals, or at least, not in raw python. Does numpy have a shortcut for this?
The better solution I've found so far (for doing it dynamically), relies in the transpose operator:
# for 3 dimensions [0,1,2] and direction == 1 --> [1, 0, 2]
tr = [direction] + range(A.ndim)
del tr[direction+1]
return np.transpose(A, tr)[idx]
But I wonder if there is any better/easier/faster function for this, since for 3D the transpose code almost looks more awful than the 3 if/elif. It generalizes better for ND and the larger the N the more beautiful the code gets in comparison, but for 3D is quite the same.
Transpose is cheap (timewise). There are numpy functions that use it to move the operational axis (or axes) to a known location - usually the front or end of the shape list. tensordot is one that comes to mind.
Other functions construct an indexing tuple. They may start with a list or array for ease of manipulation, and then turn it into a tuple for application. For example
I = [slice(None)]*A.ndim
I[axis] = idx
A[tuple(I)]
np.apply_along_axis does something like that. It's instructive to look at the code for functions like this.
I imagine the writers of the numpy functions worried most about whether it works robustly, secondly about speed, and lastly whether it looks pretty. You can bury all kinds of ugly code in a function!.
tensordot ends with
at = a.transpose(newaxes_a).reshape(newshape_a)
bt = b.transpose(newaxes_b).reshape(newshape_b)
res = dot(at, bt)
return res.reshape(olda + oldb)
where the previous code calculated newaxes_.. and newshape....
apply_along_axis constructs a (0...,:,0...) index tuple
i = zeros(nd, 'O')
i[axis] = slice(None, None)
i.put(indlist, ind)
....arr[tuple(i.tolist())]
To index a dimension dynamically, you can use swapaxes, as shown below:
a = np.arange(7 * 8 * 9).reshape((7, 8, 9))
axis = 1
idx = 2
np.swapaxes(a, 0, axis)[idx]
Runtime comparison
Natural method (non dynamic) :
%timeit a[:, idx, :]
300 ns ± 1.58 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
swapaxes:
%timeit np.swapaxes(a, 0, axis)[idx]
752 ns ± 4.54 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
Index with list comprehension:
%timeit a[[idx if i==axis else slice(None) for i in range(a.ndim)]]
This is python. You could simply use eval() like this:
def get_by_axis(a, idx, axis):
indexing_list = a.ndim*[':']
indexing_list[axis] = str(idx)
expression = f"a[{', '.join(indexing_list)}]"
return eval(expression)
Obviously, in which case you do not accept input from untrusted users.
I want to calculate the row-wise dot product of two matrices of the same dimension as fast as possible. This is the way I am doing it:
import numpy as np
a = np.array([[1,2,3], [3,4,5]])
b = np.array([[1,2,3], [1,2,3]])
result = np.array([])
for row1, row2 in a, b:
result = np.append(result, np.dot(row1, row2))
print result
and of course the output is:
[ 26. 14.]
Straightforward way to do that is:
import numpy as np
a=np.array([[1,2,3],[3,4,5]])
b=np.array([[1,2,3],[1,2,3]])
np.sum(a*b, axis=1)
which avoids the python loop and is faster in cases like:
def npsumdot(x, y):
return np.sum(x*y, axis=1)
def loopdot(x, y):
result = np.empty((x.shape[0]))
for i in range(x.shape[0]):
result[i] = np.dot(x[i], y[i])
return result
timeit npsumdot(np.random.rand(500000,50),np.random.rand(500000,50))
# 1 loops, best of 3: 861 ms per loop
timeit loopdot(np.random.rand(500000,50),np.random.rand(500000,50))
# 1 loops, best of 3: 1.58 s per loop
Check out numpy.einsum for another method:
In [52]: a
Out[52]:
array([[1, 2, 3],
[3, 4, 5]])
In [53]: b
Out[53]:
array([[1, 2, 3],
[1, 2, 3]])
In [54]: einsum('ij,ij->i', a, b)
Out[54]: array([14, 26])
Looks like einsum is a bit faster than inner1d:
In [94]: %timeit inner1d(a,b)
1000000 loops, best of 3: 1.8 us per loop
In [95]: %timeit einsum('ij,ij->i', a, b)
1000000 loops, best of 3: 1.6 us per loop
In [96]: a = random.randn(10, 100)
In [97]: b = random.randn(10, 100)
In [98]: %timeit inner1d(a,b)
100000 loops, best of 3: 2.89 us per loop
In [99]: %timeit einsum('ij,ij->i', a, b)
100000 loops, best of 3: 2.03 us per loop
Note: NumPy is constantly evolving and improving; the relative performance of the functions shown above has probably changed over the years. If performance is important to you, run your own tests with the version of NumPy that you will be using.
Played around with this and found inner1d the fastest. That function however is internal, so a more robust approach is to use
numpy.einsum("ij,ij->i", a, b)
Even better is to align your memory such that the summation happens in the first dimension, e.g.,
a = numpy.random.rand(3, n)
b = numpy.random.rand(3, n)
numpy.einsum("ij,ij->j", a, b)
For 10 ** 3 <= n <= 10 ** 6, this is the fastest method, and up to twice as fast as its untransposed equivalent. The maximum occurs when the level-2 cache is maxed out, at about 2 * 10 ** 4.
Note also that the transposed summation is much faster than its untransposed equivalent.
The plot was created with perfplot (a small project of mine)
import numpy
from numpy.core.umath_tests import inner1d
import perfplot
def setup(n):
a = numpy.random.rand(n, 3)
b = numpy.random.rand(n, 3)
aT = numpy.ascontiguousarray(a.T)
bT = numpy.ascontiguousarray(b.T)
return (a, b), (aT, bT)
b = perfplot.bench(
setup=setup,
n_range=[2 ** k for k in range(1, 25)],
kernels=[
lambda data: numpy.sum(data[0][0] * data[0][1], axis=1),
lambda data: numpy.einsum("ij, ij->i", data[0][0], data[0][1]),
lambda data: numpy.sum(data[1][0] * data[1][1], axis=0),
lambda data: numpy.einsum("ij, ij->j", data[1][0], data[1][1]),
lambda data: inner1d(data[0][0], data[0][1]),
],
labels=["sum", "einsum", "sum.T", "einsum.T", "inner1d"],
xlabel="len(a), len(b)",
)
b.save("out1.png")
b.save("out2.png", relative_to=3)
You'll do better avoiding the append, but I can't think of a way to avoid the python loop. A custom Ufunc perhaps? I don't think numpy.vectorize will help you here.
import numpy as np
a=np.array([[1,2,3],[3,4,5]])
b=np.array([[1,2,3],[1,2,3]])
result=np.empty((2,))
for i in range(2):
result[i] = np.dot(a[i],b[i]))
print result
EDIT
Based on this answer, it looks like inner1d might work if the vectors in your real-world problem are 1D.
from numpy.core.umath_tests import inner1d
inner1d(a,b) # array([14, 26])
I came across this answer and re-verified the results with Numpy 1.14.3 running in Python 3.5. For the most part the answers above hold true on my system, although I found that for very large matrices (see example below), all but one of the methods are so close to one another that the performance difference is meaningless.
For smaller matrices, I found that einsum was the fastest by a considerable margin, up to a factor of two in some cases.
My large matrix example:
import numpy as np
from numpy.core.umath_tests import inner1d
a = np.random.randn(100, 1000000) # 800 MB each
b = np.random.randn(100, 1000000) # pretty big.
def loop_dot(a, b):
result = np.empty((a.shape[1],))
for i, (row1, row2) in enumerate(zip(a, b)):
result[i] = np.dot(row1, row2)
%timeit inner1d(a, b)
# 128 ms ± 523 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit np.einsum('ij,ij->i', a, b)
# 121 ms ± 402 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit np.sum(a*b, axis=1)
# 411 ms ± 1.99 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit loop_dot(a, b) # note the function call took negligible time
# 123 ms ± 342 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
So einsum is still the fastest on very large matrices, but by a tiny amount. It appears to be a statistically significant (tiny) amount though!