I have the following operation:
import pandas as pd
import numpy as np
def some_calc(x,y):
x = x.set_index('Cat')
y = y.set_index('Cat')
y = np.sqrt(y['data_point2'])
vec = pd.DataFrame(x['data_point1'] * y)
grid = np.random.rand(len(x),len(x))
result = vec.dot(vec.T).mul(grid).sum().sum()
return result
sample_size = 100
cats = ['a','b','c','d']
df1 = pd.DataFrame({'Cat':[cats[np.random.randint(4)] for _ in range(sample_size)],
'data_point1':np.random.rand(sample_size),
'data_point2':np.random.rand(sample_size)})
df2 = df1.groupby('Cat').sum().reset_index()
I would like to run some_calc across each of the df2 rows using their relative data points from df1.
The code below works well:
df2['Apply'] = df2.apply(lambda x: some_calc(x=df1[df1['Cat']==x['Cat']][['Cat','data_point1']],
y=df1[df1['Cat']==x['Cat']][['Cat','data_point2']]),axis=1)
(I reset the index in df2 because I don't know how to apply across indices.
Also, I'm passing both Cat as the index field and data_point as vectors to some_calc because without an index v.dot(v.T) will crunch the dot product into one single number. This errors with .mul() because I need the full MxM matrix as opposed to a float value. I might be doing something wrong here though...)
I'm currently exploring how I can vectorize the above so that when sample_size grows I will not be hampered by a slow down in the calculation.
I saw that in previous threads you can toggle raw=True so that the input deal with np.array as opposed to pd.Series.
df2['ApplyRaw'] = df2.apply(lambda x: some_calc(x=df1[df1['Cat']==x['Cat']][['Cat','data_point1']],
y=df1[df1['Cat']==x['Cat']]['Cat','data_point2']),axis=1, raw=True)
However, it throws an error:
IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices
I tried omitting Cat from the argument but still the same issue.
Are there any code improvements or tricks I can employ that allow me to vectorize the above?
Or do I have to amend some_calc?
I'm not sure if it's possible to vectorize your function since it's a bit complex. However, some_calc itself and how it is called can be optimized.
What
df2['Apply'] = df2.apply(lambda x: some_calc(x=df1[df1['Cat']==x['Cat']][['Cat','data_point1']],
y=df1[df1['Cat']==x['Cat']][['Cat','data_point2']]),axis=1)
does is basically the same as a groupby. So instead of creating these groups and applying the function on them, use groupby + apply. Simplifying the some_calc function as well, we get:
def some_calc(df):
x = df['data_point1'].values
y = np.sqrt(df['data_point2'].values)
vec = (x * y).reshape(-1, 1)
grid = np.random.rand(len(x),len(x))
result = (vec # vec.T * grid).sum().sum()
return result
apply = df1.groupby('Cat').apply(some_calc)
apply.name = 'Apply'
df2.merge(apply, left_on='Cat', right_index=True)
The final merge is just to add the results to the df2 dataframe.
Timings:
# original
20.5 ms ± 2.26 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# above code
3.62 ms ± 668 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Related
I'm trying to combine and average multiple (10-100 per call) data series, with each data series of approx shape=(1,100). I want to average the values of each result and output a series of equal length, ie output[i] = mean(series0[i], series1[i], series2[i].... This will be called ~10k times a day in early production, hopefully much more later, so I'm interested in wider tips or references if possible.
Currently the extant code in development is heavy on pandas for it's easy readability but is easily amended to output pandas.Series, python3 lists, or numpy.arrays, so anything goes. At a guess, I imagine some or all pandas will eventually be cut in favour of numpy.arrays and lists/dicts for speed/memory/cost reasons. I know enough to write the below code and know just about enough that a list comprehension may be a good contender, but I'm very much learning as I go so please be gentle.
I could find posts on merge/concat speeds, but rarely is this combined with further functions. So... suggestions on faster ways to produce an average series?
import numpy as np
series_length = 100
repeats=10
def foo(length):
return np.random.randint(0,500,length,int)
results = []
for i in range(repeats):
results.append(foo(series_length)) # produce a list n long, each containing a len=100 series/array/list (format optional) of integers
def some_code_here(data):
avg_results = [np.mean([series[i] for series in data]) for i in range(series_length)]
return avg_results
# Output length = series_length
final_solution = some_code_here(results)
You could use np.stack to create a single array from your data and then take the mean along axis = 0, which results in a speed improvement of ~45x on my machine:
import numpy as np
avg_results = np.stack(results).mean(axis=0)
Timings:
series_length = 100
repeats = 10
rng = np.random.default_rng()
results = [rng.integers(0, 500, series_length) for _ in range(repeats)]
#Check they're the same
assert ([np.mean([series[i] for series in results]) for i in range(series_length)] == np.stack(results).mean(axis=0)).all()
#OP
%timeit [np.mean([series[i] for series in results]) for i in range(series_length)]
#Me
%timeit np.stack(results).mean(axis=0)
Output:
787 µs ± 6.71 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
17.6 µs ± 38.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
I have two numpy arrays R with dimensions S x F and W with dimensions N x M x F. Getting concrete lets assign the following values N = 5, M = 7, F = 3, S = 4
The array R contains a collections of samples S = 4 with F = 3 features. Each line represents a samples and each row a feature. Therefore R[0] is the first sample, R[1] the second and goes on. Each R[i-th] entry, contains F elements, giving for sake of example R[0] = np.array([1, 4, -2]).
Here is a small snippet to initialize all those values, with a MWE in mind
import numpy as np
# Size of Map (rows, columns)
N, M = 5, 7
# Number of features
F = 3
# Sample size
S = 4
np.random.seed(13)
R = np.random.randint(0, 10, size=(S, F))
W = np.random.randint(-4, 5, size=(N, M, F))
We can also see a given "depth line" of numpy array W, as a vector also with same dimension as each row of array R (this can easily be noticed looking at the size of the last dimension of both arrays). With that I can access W[2, 3] and obtain np.array([ 2, 2, -1 ]) (the values here are just examples).
I created a simple function to calculate the distance of a given vector r to each "depth line" of matrix W and the return the position of the nearest element of W depth line to r
def nearest_vector_matrix_naive(r, W):
delta = np.zeros((N,M), dtype=int)
for i in range(N):
for j in range(M):
norm = 0
for k in range(F):
norm += (r[k] - W[i,j,k])**2
delta[i,j] = norm
norm = 0
win_idx = np.unravel_index(np.argmin(delta, axis=None), delta.shape)
return win_idx
Of course this is a very naive approach, that I could further optimize to the code below, obtaining a HUGE performance boost.
def nearest_vector_matrix(r, W):
delta = np.sum((W[:,:] - r)**2, axis=2)
return np.unravel_index(np.argmin(delta, axis=None), delta.shape)
I can use this function simple as
nearest_idx = nearest_vector_matrix(R[0], W)
# Returns the nearest vector in W to R[0]
W[nearest_idx]
Since I have the array R with a bunch of samples I use the following snippet to calculate the nearest vectors to a array of samples:
def nearest_samples_matrix(R, W):
DELTA = np.zeros((R.shape[0],2))
for idx, r in enumerate(R):
delta = np.sum((W[:,:] - r)**2, axis=2)
DELTA[idx] = np.unravel_index(np.argmin(delta, axis=None), delta.shape)
return DELTA
This function returns an array with S rows (S being the number of samples) of 2d indexes. That is DELTA has (S, 2) shape (always).
I would like to know how can I substitute the for loop (for example for a broadcasting) inside nearest_samples_matrix to enhance the code execution performance even further?
I could not figure out how to do it. (besides I was able to do it in the first case)
The best solution depends on the input size of the arrays
For lower dimensional problems dim<20 or less, a kdtree approach is usually the way to go. There are quite a lot of answers regarding this topic eg. one I have written a few weeks ago.
If the dimension of the problems is too high you can switch to brute-force algorithms. Both of the following algorithms are much faster than your optimized approach, but on larger input sizes and low dimensional problems much slower than a kdtree approach O(log(n)) instead of O(n^2).
Brute force 1
The following example uses an algorithm described here. It is very fast on large dimensional problems because most of the calculation is done in a highly optimized matrix-matrix multiplication algorithm.
The disadvantage is high memory usage (all distances are calculated in one function call) and precision problems, because of the more error prone calculation method.
import numpy as np
from sklearn.metrics.pairwise import euclidean_distances
def nearest_samples_matrix_2(R,W):
R_Temp=R
W_Temp=W.reshape(-1,W.shape[2])
dist=euclidean_distances(R_Temp, W_Temp)
ind_1,ind_2=np.unravel_index(np.argmin(dist,axis=1),shape=(W.shape[0],W.shape[1]))
return np.vstack((ind_1,ind_2)).T
Brute force 2
This is quite similar to your naive approach, but uses a JIT-Compiler (Numba) to get good performance. Temporary arrays are not necessary and the precision should be good (as long as no overflow occurs). There is room for further optimization (loop tiling) on larger input sizes.
import numpy as np
import numba as nb
#parallelization is only beneficial on larger input data
#nb.njit(fastmath=True,parallel=True,cache=True)
def nearest_samples_matrix_3(r, W):
ind_i=0
ind_j=0
out=np.empty((r.shape[0],2),dtype=np.int64)
for x in nb.prange(r.shape[0]):
delta=0
for k in range(W.shape[2]):
delta += (r[x,k] - W[0,0,k])**2
for i in range(W.shape[0]):
for j in range(W.shape[1]):
norm = 0
for k in range(W.shape[2]):
norm += (r[x,k] - W[i,j,k])**2
if norm < delta:
delta=norm
ind_i=i
ind_j=j
out[x,0]=ind_i
out[x,1]=ind_j
return out
Timings
#small Arrays
N, M = 100, 200
F = 30
S = 50
R = np.random.randint(0, 10, size=(S, F))
W = np.random.randint(-4, 5, size=(N, M, F))
#your function
%timeit nearest_samples_matrix(R,W)
#268 ms ± 2.94 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit nearest_samples_matrix_2(R,W)
#5.62 ms ± 22.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit nearest_samples_matrix_3(R,W)
#3.68 ms ± 1.01 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
#larger arrays
N, M = 1_000, 2_000
F = 50
S = 100
R = np.random.randint(0, 10, size=(S, F))
W = np.random.randint(-4, 5, size=(N, M, F))
#%timeit nearest_samples_matrix_1(R,W)
#too slow
%timeit nearest_samples_matrix_2(R,W)
#2.76 s ± 17.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit nearest_samples_matrix_3(R,W)
#1.42 s ± 402 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
I want to create new columns based on the elements of column Col1, which is of type set. Each element has a corresponding column name that is stored in a dict. Here is the full code:
import numpy as np
import pandas as pd
np.random.seed(123)
N = 10**4 #number of rows in the dataframe
df = pd.DataFrame({'Cnt': np.random.randint(2,10,N)})
# generate lists of random length
def f(x):
return set(np.random.randint(101,120,x))
df['Col1'] = df['Cnt'].apply(f)
# dictionary with column names for each element in list
d = {'Item_1':101, 'Item_2':102, 'Item_3':103, 'Item_4':104, 'Item_5':105, 'Item_6':106, 'Item_7':107, 'Item_8':108,
'Item_9':109, 'Item_10':110, 'Item_11':111, 'Item_12':112, 'Item_13':113, 'Item_14':114, 'Item_15':115, 'Item_16':116,
'Item_17':117, 'Item_18':118, 'Item_19':119, 'Item_20':120}
def elem_in_set(x,e):
return 1 if e in x else 0
def create_columns(input_data, d):
df = input_data.copy()
for k, v in d.items():
df[k] = df.apply(lambda x: elem_in_set(x['Col1'], v), axis=1)
return df
%timeit create_columns(df, d)
#5.05 s ± 78.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
The problem is that the production dataframe has about 400k rows, and my solution does not scale well at all - I'm looking at around 10 minutes on my machine. The column containing all elements (Col1) could be type list instead of set, but that doesn't improve performance.
Is there a faster solution to this?
I made a small change in your create_columns apply. Seems like it works much faster now.
import numpy as np
import pandas as pd
np.random.seed(123)
N = 10**4 #number of rows in the dataframe
df = pd.DataFrame({'Cnt': np.random.randint(2,10,N)})
# generate lists of random length
def f(x):
return set(np.random.randint(101,120,x))
df['Col1'] = df['Cnt'].apply(f)
# dictionary with column names for each element in list
d = {'Item_1':101, 'Item_2':102, 'Item_3':103, 'Item_4':104, 'Item_5':105, 'Item_6':106, 'Item_7':107, 'Item_8':108,
'Item_9':109, 'Item_10':110, 'Item_11':111, 'Item_12':112, 'Item_13':113, 'Item_14':114, 'Item_15':115, 'Item_16':116,
'Item_17':117, 'Item_18':118, 'Item_19':119, 'Item_20':120}
def create_columns(input_data, d):
df = input_data.copy()
for k, v in d.items():
df[k] = df.Col1.apply(lambda x: 1 if v in x else 0)
return df
create_columns(df, d)
#191 ms ± 15.6 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
I want to obtain the 2D slice in a given direction of a 3D array where the direction (or the axis from where the slice is going to be extracted) is given by another variable.
Assuming idx the index of the 2D slice in a 3D array, and direction the axis in which obtain that 2D slice, the initial approach would be:
if direction == 0:
return A[idx, :, :]
elif direction == 1:
return A[:, idx, :]
else:
return A[:, :, idx]
I'm pretty sure there must be a way of doing this without doing conditionals, or at least, not in raw python. Does numpy have a shortcut for this?
The better solution I've found so far (for doing it dynamically), relies in the transpose operator:
# for 3 dimensions [0,1,2] and direction == 1 --> [1, 0, 2]
tr = [direction] + range(A.ndim)
del tr[direction+1]
return np.transpose(A, tr)[idx]
But I wonder if there is any better/easier/faster function for this, since for 3D the transpose code almost looks more awful than the 3 if/elif. It generalizes better for ND and the larger the N the more beautiful the code gets in comparison, but for 3D is quite the same.
Transpose is cheap (timewise). There are numpy functions that use it to move the operational axis (or axes) to a known location - usually the front or end of the shape list. tensordot is one that comes to mind.
Other functions construct an indexing tuple. They may start with a list or array for ease of manipulation, and then turn it into a tuple for application. For example
I = [slice(None)]*A.ndim
I[axis] = idx
A[tuple(I)]
np.apply_along_axis does something like that. It's instructive to look at the code for functions like this.
I imagine the writers of the numpy functions worried most about whether it works robustly, secondly about speed, and lastly whether it looks pretty. You can bury all kinds of ugly code in a function!.
tensordot ends with
at = a.transpose(newaxes_a).reshape(newshape_a)
bt = b.transpose(newaxes_b).reshape(newshape_b)
res = dot(at, bt)
return res.reshape(olda + oldb)
where the previous code calculated newaxes_.. and newshape....
apply_along_axis constructs a (0...,:,0...) index tuple
i = zeros(nd, 'O')
i[axis] = slice(None, None)
i.put(indlist, ind)
....arr[tuple(i.tolist())]
To index a dimension dynamically, you can use swapaxes, as shown below:
a = np.arange(7 * 8 * 9).reshape((7, 8, 9))
axis = 1
idx = 2
np.swapaxes(a, 0, axis)[idx]
Runtime comparison
Natural method (non dynamic) :
%timeit a[:, idx, :]
300 ns ± 1.58 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
swapaxes:
%timeit np.swapaxes(a, 0, axis)[idx]
752 ns ± 4.54 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
Index with list comprehension:
%timeit a[[idx if i==axis else slice(None) for i in range(a.ndim)]]
This is python. You could simply use eval() like this:
def get_by_axis(a, idx, axis):
indexing_list = a.ndim*[':']
indexing_list[axis] = str(idx)
expression = f"a[{', '.join(indexing_list)}]"
return eval(expression)
Obviously, in which case you do not accept input from untrusted users.
I want to calculate the row-wise dot product of two matrices of the same dimension as fast as possible. This is the way I am doing it:
import numpy as np
a = np.array([[1,2,3], [3,4,5]])
b = np.array([[1,2,3], [1,2,3]])
result = np.array([])
for row1, row2 in a, b:
result = np.append(result, np.dot(row1, row2))
print result
and of course the output is:
[ 26. 14.]
Straightforward way to do that is:
import numpy as np
a=np.array([[1,2,3],[3,4,5]])
b=np.array([[1,2,3],[1,2,3]])
np.sum(a*b, axis=1)
which avoids the python loop and is faster in cases like:
def npsumdot(x, y):
return np.sum(x*y, axis=1)
def loopdot(x, y):
result = np.empty((x.shape[0]))
for i in range(x.shape[0]):
result[i] = np.dot(x[i], y[i])
return result
timeit npsumdot(np.random.rand(500000,50),np.random.rand(500000,50))
# 1 loops, best of 3: 861 ms per loop
timeit loopdot(np.random.rand(500000,50),np.random.rand(500000,50))
# 1 loops, best of 3: 1.58 s per loop
Check out numpy.einsum for another method:
In [52]: a
Out[52]:
array([[1, 2, 3],
[3, 4, 5]])
In [53]: b
Out[53]:
array([[1, 2, 3],
[1, 2, 3]])
In [54]: einsum('ij,ij->i', a, b)
Out[54]: array([14, 26])
Looks like einsum is a bit faster than inner1d:
In [94]: %timeit inner1d(a,b)
1000000 loops, best of 3: 1.8 us per loop
In [95]: %timeit einsum('ij,ij->i', a, b)
1000000 loops, best of 3: 1.6 us per loop
In [96]: a = random.randn(10, 100)
In [97]: b = random.randn(10, 100)
In [98]: %timeit inner1d(a,b)
100000 loops, best of 3: 2.89 us per loop
In [99]: %timeit einsum('ij,ij->i', a, b)
100000 loops, best of 3: 2.03 us per loop
Note: NumPy is constantly evolving and improving; the relative performance of the functions shown above has probably changed over the years. If performance is important to you, run your own tests with the version of NumPy that you will be using.
Played around with this and found inner1d the fastest. That function however is internal, so a more robust approach is to use
numpy.einsum("ij,ij->i", a, b)
Even better is to align your memory such that the summation happens in the first dimension, e.g.,
a = numpy.random.rand(3, n)
b = numpy.random.rand(3, n)
numpy.einsum("ij,ij->j", a, b)
For 10 ** 3 <= n <= 10 ** 6, this is the fastest method, and up to twice as fast as its untransposed equivalent. The maximum occurs when the level-2 cache is maxed out, at about 2 * 10 ** 4.
Note also that the transposed summation is much faster than its untransposed equivalent.
The plot was created with perfplot (a small project of mine)
import numpy
from numpy.core.umath_tests import inner1d
import perfplot
def setup(n):
a = numpy.random.rand(n, 3)
b = numpy.random.rand(n, 3)
aT = numpy.ascontiguousarray(a.T)
bT = numpy.ascontiguousarray(b.T)
return (a, b), (aT, bT)
b = perfplot.bench(
setup=setup,
n_range=[2 ** k for k in range(1, 25)],
kernels=[
lambda data: numpy.sum(data[0][0] * data[0][1], axis=1),
lambda data: numpy.einsum("ij, ij->i", data[0][0], data[0][1]),
lambda data: numpy.sum(data[1][0] * data[1][1], axis=0),
lambda data: numpy.einsum("ij, ij->j", data[1][0], data[1][1]),
lambda data: inner1d(data[0][0], data[0][1]),
],
labels=["sum", "einsum", "sum.T", "einsum.T", "inner1d"],
xlabel="len(a), len(b)",
)
b.save("out1.png")
b.save("out2.png", relative_to=3)
You'll do better avoiding the append, but I can't think of a way to avoid the python loop. A custom Ufunc perhaps? I don't think numpy.vectorize will help you here.
import numpy as np
a=np.array([[1,2,3],[3,4,5]])
b=np.array([[1,2,3],[1,2,3]])
result=np.empty((2,))
for i in range(2):
result[i] = np.dot(a[i],b[i]))
print result
EDIT
Based on this answer, it looks like inner1d might work if the vectors in your real-world problem are 1D.
from numpy.core.umath_tests import inner1d
inner1d(a,b) # array([14, 26])
I came across this answer and re-verified the results with Numpy 1.14.3 running in Python 3.5. For the most part the answers above hold true on my system, although I found that for very large matrices (see example below), all but one of the methods are so close to one another that the performance difference is meaningless.
For smaller matrices, I found that einsum was the fastest by a considerable margin, up to a factor of two in some cases.
My large matrix example:
import numpy as np
from numpy.core.umath_tests import inner1d
a = np.random.randn(100, 1000000) # 800 MB each
b = np.random.randn(100, 1000000) # pretty big.
def loop_dot(a, b):
result = np.empty((a.shape[1],))
for i, (row1, row2) in enumerate(zip(a, b)):
result[i] = np.dot(row1, row2)
%timeit inner1d(a, b)
# 128 ms ± 523 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit np.einsum('ij,ij->i', a, b)
# 121 ms ± 402 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit np.sum(a*b, axis=1)
# 411 ms ± 1.99 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit loop_dot(a, b) # note the function call took negligible time
# 123 ms ± 342 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
So einsum is still the fastest on very large matrices, but by a tiny amount. It appears to be a statistically significant (tiny) amount though!