Need pandas optimized code with 1 million stock data - pandas

Currently my code is
self.df['sma'] = self.df['Close'].rolling(window=30).mean()
self.df['cma'] = self.df.apply(lambda x: self.get_cma(x), axis=1)
def get_cma(self, candle):
if np.isnan(candle['sma']):
return np.nan
secma = (candle['sma'] - self.previous_cma if self.previous_cma is not None else 0) ** 2
ka = 1 - (candle['var']/secma) if candle['var'] < secma else 0
cma = ((ka * candle['sma']) + ((1 - ka) * self.previous_cma)) if self.previous_cma is not None else candle[self.src]
self.previous_cma = cma
return cma
Can the above optimized to make it faster?

As you may already know, the secret to performance with Pandas is to do this in vectorized form. This means no apply. Here are the first few steps you need to take to speed up your code, by extracting parts of your get_cma() function to their vectorized equivalents.
if np.isnan(candle['sma']):
return np.nan
This early exit is not needed in get_cma(), we can do this instead:
self.df['cma'] = np.nan
valid = self.df['sma'].notnull()
# this comment is a placeholder for step 2
self.df.loc[valid, 'cma'] = self.df[valid].apply(self.get_cma, axis=1)
This not only vectorizes the first two lines of get_cma(), it means get_cma() is now only called on not-null rows, rather than every row. Depending on your data that alone may provide a noticeable speedup.
If that's not enough, we need a bigger hammer. The fundamental problem is that each iteration of get_cma() depends on the previous, so it is not easy to vectorize. So let's use Numba to JIT compile the code. First we need to get rid of apply by using a good old for loop over the individual columns, which is equivalent (and will still be slow). Note this is a free (global) function, not a member function, and it takes NumPy arrays instead of Pandas types, because those are what Numba understands:
def get_cma(sma, var, src):
cma = np.empty_like(sma)
# take care of the initial value first, to avoid unnecessary branches later
cma[0] = src[0]
# now do all remaining rows, cma[ii-1] is previous_cma and is never None
for ii in range(1, len(sma)):
secma = (sma[ii] - cma[ii-1]) ** 2
ka = 1 - (var[ii] / secma) if var[ii] < secma else 0
cma[ii] = (ka * sma[ii]]) + ((1 - ka) * cma[ii-1])
return cma
Call it like this, passing the required columns as NumPy arrays:
valid_rows = self.df[valid]
self.df.loc[valid, 'cma'] = get_cma(
Finally, after confirming the code works, decorate get_cma() to compile it with Numba automatically like this:
import numba
def get_cma(sma, var, src):
That's it. Please let us know how much faster this runs on your real data. I expect it will be plenty fast enough.


How could I speed up my written python code: spheres contact detection (collision) using spatial searching

I am working on a spatial search case for spheres in which I want to find connected spheres. For this aim, I searched around each sphere for spheres that centers are in a (maximum sphere diameter) distance from the searching sphere’s center. At first, I tried to use scipy related methods to do so, but scipy method takes longer times comparing to equivalent numpy method. For scipy, I have determined the number of K-nearest spheres firstly and then find them by cKDTree.query, which lead to more time consumption. However, it is slower than numpy method even by omitting the first step with a constant value (it is not good to omit the first step in this case). It is contrary to my expectations about scipy spatial searching speed. So, I tried to use some list-loops instead some numpy lines for speeding up using numba prange. Numba run the code a little faster, but I believe that this code can be optimized for better performances, perhaps by vectorization, using other alternative numpy modules or using numba in another way. I have used iteration on all spheres due to prevent probable memory leaks and …, where number of spheres are high.
import numpy as np
import numba as nb
from scipy.spatial import cKDTree, distance
# ---------------------------- input data ----------------------------
""" For testing by prepared files:
radii = np.load('a.npy') # shape: (n-spheres, ) must be loaded by np.load('a.npy') or np.loadtxt('radii_large.csv')
poss = np.load('b.npy') # shape: (n-spheres, 3) must be loaded by np.load('b.npy') or np.loadtxt('pos_large.csv', delimiter=',')
rnd = np.random.RandomState(70)
data_volume = 200000
radii = rnd.uniform(0.0005, 0.122, data_volume)
dia_max = 2 * radii.max()
x = rnd.uniform(-1.02, 1.02, (data_volume, 1))
y = rnd.uniform(-3.52, 3.52, (data_volume, 1))
z = rnd.uniform(-1.02, -0.575, (data_volume, 1))
poss = np.hstack((x, y, z))
# --------------------------------------------------------------------
# #nb.jit('float64[:,::1](float64[:,::1], float64[::1])', forceobj=True, parallel=True)
def ends_gap(poss, dia_max):
particle_corsp_overlaps = np.array([], dtype=np.float64)
ends_ind = np.empty([1, 2], dtype=np.int64)
""" using list looping """
# particle_corsp_overlaps = []
# ends_ind = []
# for particle_idx in nb.prange(len(poss)): # by list looping
for particle_idx in range(len(poss)):
unshared_idx = np.delete(np.arange(len(poss)), particle_idx) # <--- relatively high time consumer
poss_without = poss[unshared_idx]
""" # SCIPY method ---------------------------------------------------------------------------------------------
nears_i_ind = cKDTree(poss_without).query_ball_point(poss[particle_idx], r=dia_max) # <--- high time consumer
if len(nears_i_ind) > 0:
dist_i, dist_i_ind = cKDTree(poss_without[nears_i_ind]).query(poss[particle_idx], k=len(nears_i_ind)) # <--- high time consumer
if not isinstance(dist_i, float):
dist_i[dist_i_ind] = dist_i.copy()
""" # NUMPY method --------------------------------------------------------------------------------------------
lx_limit_idx = poss_without[:, 0] <= poss[particle_idx][0] + dia_max
ux_limit_idx = poss_without[:, 0] >= poss[particle_idx][0] - dia_max
ly_limit_idx = poss_without[:, 1] <= poss[particle_idx][1] + dia_max
uy_limit_idx = poss_without[:, 1] >= poss[particle_idx][1] - dia_max
lz_limit_idx = poss_without[:, 2] <= poss[particle_idx][2] + dia_max
uz_limit_idx = poss_without[:, 2] >= poss[particle_idx][2] - dia_max
nears_i_ind = np.where(lx_limit_idx & ux_limit_idx & ly_limit_idx & uy_limit_idx & lz_limit_idx & uz_limit_idx)[0]
if len(nears_i_ind) > 0:
dist_i = distance.cdist(poss_without[nears_i_ind], poss[particle_idx][None, :]).squeeze() # <--- relatively high time consumer
# """ # -------------------------------------------------------------------------------------------------------
contact_check = dist_i - (radii[unshared_idx][nears_i_ind] + radii[particle_idx])
connected = contact_check[contact_check <= 0]
particle_corsp_overlaps = np.concatenate((particle_corsp_overlaps, connected))
""" using list looping """
# if len(connected) > 0:
# for value_ in connected:
# particle_corsp_overlaps.append(value_)
contacts_ind = np.where([contact_check <= 0])[1]
contacts_sec_ind = np.array(nears_i_ind)[contacts_ind]
sphere_olps_ind = np.where((poss[:, None] == poss_without[contacts_sec_ind][None, :]).all(axis=2))[0] # <--- high time consumer
ends_ind_mod_temp = np.array([np.repeat(particle_idx, len(sphere_olps_ind)), sphere_olps_ind], dtype=np.int64).T
if particle_idx > 0:
ends_ind = np.concatenate((ends_ind, ends_ind_mod_temp))
ends_ind[0, 0], ends_ind[0, 1] = ends_ind_mod_temp[0, 0], ends_ind_mod_temp[0, 1]
""" using list looping """
# for contacted_idx in sphere_olps_ind:
# ends_ind.append([particle_idx, contacted_idx])
# ends_ind_org = np.array(ends_ind) # using lists
ends_ind_org = ends_ind
ends_ind, ends_ind_idx = np.unique(np.sort(ends_ind_org), axis=0, return_index=True) # <--- relatively high time consumer
gap = np.array(particle_corsp_overlaps)[ends_ind_idx]
return gap, ends_ind, ends_ind_idx, ends_ind_org
In one of my tests on 23000 spheres, scipy, numpy, and numba-aided methods finished the loop in about 400, 200, and 180 seconds correspondingly using Colab TPU; for 500.000 spheres it take 3.5 hours. These execution times are not satisfying at all for my project, where number of spheres may be up to 1.000.000 in a medium data volume. I will call this code many times in my main code and seeking for ways that could perform this code in milliseconds (as much as fastest that it could). Is it possible??
I would be appreciated if anyone would speed up the code as it is needed.
This code must be executable with python 3.7+, on CPU and GPU.
This code must be applicable for data size, at least, 300.000 spheres.
All numpy, scipy, and … equivalent modules instead of my written modules, which make my code faster significantly, will be upvoted.
I would be appreciated for any recommendations or explanations about:
Which method could be faster in this subject?
Why scipy is not faster than other methods in this case and where it could be helpful relating to this subject?
Choosing between iterator methods and matrix form methods is a confusing matter for me. Iterating methods use less memory and could be used and tuned up by numba and … but, I think, are not useful and comparable with matrix methods (which depends on memory limits) like numpy and … for huge sphere numbers. For this case, perhaps I could omit the iteration by numpy, but I guess strongly that it cannot be handled due to huge matrix size operations and memory leaks.
Prepared sample test data:
Poss data: 23000, 500000
Radii data: 23000, 500000
Line by line speed test logs: for two test cases scipy method and numpy time consumption.
UPDATE: this post answered is now superseded by this new one
(which take into account the updates of the question) providing an even faster code based on a different approach.
Step 1: better algorithm
First of all, building a k-d tree runs in O(n log n) time and doing a query runs in O(log n) time where n is the number of points. So using a k-d tree seems a good idea at first glance. However, your code build a k-d tree for each point resulting in a O(n² log n) time. This is why the Scipy solution is slower than the others. The thing is that Scipy does not provide a way to update a k-d tree. It turns out that updating efficiently a k-d tree appears not to be possible. Hopefully, this is not a problem in your case: you can just build one k-d tree with all the points once and then discard the current point you do not want appearing in the result of each query.
Moreover, the computation of sphere_olps_ind runs in O(n² m) time where n is the total number of points and m is the average number of neighbour (ie. closest points retrieved from the k-d tree query). Assuming there is no duplicate points, then it turns out that sphere_olps_ind is simply equal to np.sort(contacts_sec_ind). The later runs in O(m log m) which is drastically better.
Additionally, using np.concatenate in a loop to append value in a Numpy array is slow because it creates a new bigger array for each iteration. Using a list was a good idea, but appending directly Numpy array in a list and then calling np.concatenate once is much faster.
Here is the resulting code:
def ends_gap(poss, dia_max):
particle_corsp_overlaps = []
ends_ind = [np.empty([1, 2], dtype=np.int64)]
kdtree = cKDTree(poss)
for particle_idx in range(len(poss)):
# Find the nearest point including the current one and
# then remove the current point from the output.
# The distances can be computed directly without a new query.
cur_point = poss[particle_idx]
nears_i_ind = np.array(kdtree.query_ball_point(cur_point, r=dia_max), dtype=np.int64)
assert len(nears_i_ind) > 0
if len(nears_i_ind) <= 1:
nears_i_ind = nears_i_ind[nears_i_ind != particle_idx]
dist_i = distance.cdist(poss[nears_i_ind], cur_point[None, :]).squeeze()
contact_check = dist_i - (radii[nears_i_ind] + radii[particle_idx])
connected = contact_check[contact_check <= 0]
contacts_ind = np.where([contact_check <= 0])[1]
contacts_sec_ind = nears_i_ind[contacts_ind]
sphere_olps_ind = np.sort(contacts_sec_ind)
ends_ind_mod_temp = np.array([np.repeat(particle_idx, len(sphere_olps_ind)), sphere_olps_ind], dtype=np.int64).T
if particle_idx > 0:
ends_ind[0][:] = ends_ind_mod_temp[0, 0], ends_ind_mod_temp[0, 1]
ends_ind_org = np.concatenate(ends_ind)
ends_ind, ends_ind_idx = np.unique(np.sort(ends_ind_org), axis=0, return_index=True) # <--- relatively high time consumer
gap = np.concatenate(particle_corsp_overlaps)[ends_ind_idx]
return gap, ends_ind, ends_ind_idx, ends_ind_org
Step 2: optimization
First of all, the query_ball_point call can be done on all the points at once in parallel by providing poss to the Scipy method and specifying the parameter workers=-1. However, note that this requires more memory.
Moreover, Numba can be used to significantly speed up the computation. The parts that can be mainly improved is the computation of the distances and the creation of many unnecessary temporary arrays as well as the use of Numpy array direct indexing instead of list's appends (since the bounded size of the output array can be known after the query_ball_point call).
Here is a simple example of optimized code using Numba:
#nb.jit('(float64[:, ::1], int64[::1], int64[::1], float64)')
def compute(poss, all_neighbours, all_neighbours_sizes, dia_max):
particle_corsp_overlaps = []
ends_ind_lst = [np.empty((1, 2), dtype=np.int64)]
an_offset = 0
for particle_idx in range(len(poss)):
cur_point = poss[particle_idx]
cur_len = all_neighbours_sizes[particle_idx]
nears_i_ind = all_neighbours[an_offset:an_offset+cur_len]
an_offset += cur_len
assert len(nears_i_ind) > 0
if len(nears_i_ind) <= 1:
nears_i_ind = nears_i_ind[nears_i_ind != particle_idx]
dist_i = np.empty(len(nears_i_ind), dtype=np.float64)
# Compute the distances
x1, y1, z1 = poss[particle_idx]
for i in range(len(nears_i_ind)):
x2, y2, z2 = poss[nears_i_ind[i]]
dist_i[i] = np.sqrt((x2-x1)**2 + (y2-y1)**2 + (z2-z1)**2)
contact_check = dist_i - (radii[nears_i_ind] + radii[particle_idx])
connected = contact_check[contact_check <= 0]
contacts_ind = np.where(contact_check <= 0)
contacts_sec_ind = nears_i_ind[contacts_ind]
sphere_olps_ind = np.sort(contacts_sec_ind)
ends_ind_mod_temp = np.empty((len(sphere_olps_ind), 2), dtype=np.int64)
for i in range(len(sphere_olps_ind)):
ends_ind_mod_temp[i, 0] = particle_idx
ends_ind_mod_temp[i, 1] = sphere_olps_ind[i]
if particle_idx > 0:
tmp = ends_ind_lst[0]
tmp[:] = ends_ind_mod_temp[0, :]
return particle_corsp_overlaps, ends_ind_lst
def ends_gap(poss, dia_max):
kdtree = cKDTree(poss)
tmp = kdtree.query_ball_point(poss, r=dia_max, workers=-1)
all_neighbours = np.concatenate(tmp, dtype=np.int64)
all_neighbours_sizes = np.array([len(e) for e in tmp], dtype=np.int64)
particle_corsp_overlaps, ends_ind_lst = compute(poss, all_neighbours, all_neighbours_sizes, dia_max)
ends_ind_org = np.concatenate(ends_ind_lst)
ends_ind, ends_ind_idx = np.unique(np.sort(ends_ind_org), axis=0, return_index=True)
gap = np.concatenate(particle_corsp_overlaps)[ends_ind_idx]
return gap, ends_ind, ends_ind_idx, ends_ind_org
ends_gap(poss, dia_max)
Performance analysis
Here are the performance results on my 6-core machine (with a i5-9600KF processor) on the small dataset:
Initial code with Scipy: 259 s
Initial default code with Numpy: 112 s
Optimized algorithm: 1.37 s
Final optimized code: 0.22 s
Unfortunately, the Scipy k-d tree is too big to fit in memory on my machine with the big dataset.
Thus the Numba implementation with an efficient algorithm is up to ~510 times faster than the initial Numpy implementation and ~1200 time faster than the initial Scipy implementation.
The Numba code can be further optimized, but please note that the Numba compute call takes less than 25% of the time on my machine. The np.unique call is the most expensive, but it is not easy to make it faster. A significant part of the time is spent in the Scipy-to-Numba data conversion, but this code is mandatory as long as Scipy is used. Thus, the code can be improved a bit (eg. certainly 2x faster) with advanced Numba optimization but if you need a much faster code, then you need to use a native language like C++ and an highly-optimized parallel k-d tree implementation. I expect a very-optimized native code to be an order of magnitude faster but not much more. I hardly believe the big dataset can be computed in less than 10 ms on my machine regardless of the implementation.
Note that gap is different with the provided functions (other values are left unchanged). However, the same thing happens between the initial Scipy method and the one of Numpy. This appear to come from the ordering of variables like nears_i_ind and dist_i which is undefined by Scipy and change the gap result in a non-trivial way (not just the order of gap). I am not sure this is a problem of the initial implementation. Because of that, it is much harder to compare the correctness of the different implementations.
forceobj should not be used in production as the documentation states this is only useful for testing purposes.
Based on previous answers, I designed a efficient algorithm with a much lower memory footprint and much faster than the previous ones (especially on the large dataset). That being said this algorithm is far move complex and push the limit of Python and Numba.
The key issue of previous algorithms is that they set a dia_max threshold which is much bigger than actually required. Indeed, dia_max is set to the maximum possible redius so to be sure not to miss any overlapping. The thing is the big dataset contains balls of very different size and some of them are huge. This means that previous algorithms was fetching for a very large radius around many small balls. The result was thousands of neighbours to check per ball while only few can truly overlap.
One solution to efficiently address this problem is to split the balls in different groups based on their size. The idea is to first sort balls based on radii, then split the sorted balls in two groups, then independently query neighbours between each possible pair of groups, then merge data so to apply the previous algorithm (with some additional optimizations). More specifically, the query is applied between small balls with big ones, small balls with other small ones, big balls with other big ones, and big balls with small ones.
Another key point to speed this up is to request the different neighbour queries in parallel using joblib. This solution is far from being perfect since the BallTree object needs to be duplicated which is inefficient but this is mandatory because of the way parallelism is currently done in CPython (ie. GIL, pickling, etc.). Using a package that support parallel request can bypass this inherent limitation of CPython but existing package doing that does not seems to provide an interface sufficiently useful to address this problem or are not optimized enough to be actually useful.
Finally, the Numba code can be strongly optimized by removing almost all very expensive (implicit) array allocations. Using a in-place sorting algorithm optimized for small array also improve significantly the execution time (mainly because the default implementation of Numba perform several expensive allocations and is not optimized for small arrays). In addition, the final np.unique operation can be completely rewritten with a basic loop as the main loop iterate over balls with increasing IDs (hence already sorted).
Here is the resulting code:
import numpy as np
import numba as nb
from sklearn.neighbors import BallTree
from joblib import Parallel, delayed
def flatten_neighbours(arr):
sizes = np.fromiter(map(len, arr), count=len(arr), dtype=np.int64)
values = np.concatenate(arr, dtype=np.int64)
return sizes, values
def find_neighbours(searched_pts, ref_pts, max_dist):
balltree = BallTree(ref_pts, leaf_size=16, metric='euclidean')
res = balltree.query_radius(searched_pts, r=max_dist)
return flatten_neighbours(res)
def vstack_neighbours(top_infos, bottom_infos):
top_sizes, top_values = top_infos
bottom_sizes, bottom_values = bottom_infos
return np.concatenate([top_sizes, bottom_sizes]), np.concatenate([top_values, bottom_values])
#nb.njit('(Tuple([int64[::1],int64[::1]]), Tuple([int64[::1],int64[::1]]), int64)')
def hstack_neighbours(left_infos, right_infos, offset):
left_sizes, left_values = left_infos
right_sizes, right_values = right_infos
n = left_sizes.size
out_sizes = np.empty(n, dtype=np.int64)
out_values = np.empty(left_values.size + right_values.size, dtype=np.int64)
left_cur, right_cur, out_cur = 0, 0, 0
right_values += offset
for i in range(n):
left, right = left_sizes[i], right_sizes[i]
full = left + right
out_values[out_cur:out_cur+left] = left_values[left_cur:left_cur+left]
out_values[out_cur+left:out_cur+full] = right_values[right_cur:right_cur+right]
out_sizes[i] = full
left_cur += left
right_cur += right
out_cur += full
return out_sizes, out_values
#nb.njit('(int64[::1], int64[::1], int64[::1], int64[::1])')
def reorder_neighbours(in_sizes, in_values, index, reverse_index):
n = reverse_index.size
out_sizes = np.empty_like(in_sizes)
out_values = np.empty_like(in_values)
in_offsets = np.empty_like(in_sizes)
s, cur = 0, 0
for i in range(n):
in_offsets[i] = s
s += in_sizes[i]
for i in range(n):
in_ind = reverse_index[i]
size = in_sizes[in_ind]
in_offset = in_offsets[in_ind]
out_sizes[i] = size
for j in range(size):
out_values[cur+j] = index[in_values[in_offset+j]]
cur += size
return out_sizes, out_values
def small_inplace_sort(arr):
if len(arr) < 80:
# Basic insertion sort
i = 1
while i < len(arr):
x = arr[i]
j = i - 1
while j >= 0 and arr[j] > x:
arr[j+1] = arr[j]
j = j - 1
arr[j+1] = x
i += 1
#nb.jit('(float64[:, ::1], float64[::1], int64[::1], int64[::1])')
def compute(poss, radii, neighbours_sizes, neighbours_values):
n, m = neighbours_sizes.size, np.max(neighbours_sizes)
# Big buffers allocated with the maximum size.
# Thank to virtual memory, it does not take more memory can actually needed.
particle_corsp_overlaps = np.empty(neighbours_values.size, dtype=np.float64)
ends_ind_org = np.empty((neighbours_values.size, 2), dtype=np.float64)
in_offset = 0
out_offset = 0
buff1 = np.empty(m, dtype=np.int64)
buff2 = np.empty(m, dtype=np.float64)
buff3 = np.empty(m, dtype=np.float64)
for particle_idx in range(n):
size = neighbours_sizes[particle_idx]
cur = 0
for i in range(size):
value = neighbours_values[in_offset+i]
if value != particle_idx:
buff1[cur] = value
cur += 1
nears_i_ind = buff1[0:cur]
small_inplace_sort(nears_i_ind) # Note: bottleneck of this function
in_offset += size
if len(nears_i_ind) == 0:
x1, y1, z1 = poss[particle_idx]
cur = 0
for i in range(len(nears_i_ind)):
index = nears_i_ind[i]
x2, y2, z2 = poss[index]
dist = np.sqrt((x2 - x1) ** 2 + (y2 - y1) ** 2 + (z2 - z1) ** 2)
contact_check = dist - (radii[index] + radii[particle_idx])
if contact_check <= 0.0:
buff2[cur] = contact_check
buff3[cur] = index
cur += 1
particle_corsp_overlaps[out_offset:out_offset+cur] = buff2[0:cur]
contacts_sec_ind = buff3[0:cur]
sphere_olps_ind = contacts_sec_ind
for i in range(cur):
ends_ind_org[out_offset+i, 0] = particle_idx
ends_ind_org[out_offset+i, 1] = sphere_olps_ind[i]
out_offset += cur
# Truncate the views to their real size
particle_corsp_overlaps = particle_corsp_overlaps[:out_offset]
ends_ind_org = ends_ind_org[:out_offset]
assert len(ends_ind_org) % 2 == 0
size = len(ends_ind_org)//2
ends_ind = np.empty((size,2), dtype=np.int64)
ends_ind_idx = np.empty(size, dtype=np.int64)
gap = np.empty(size, dtype=np.float64)
cur = 0
# Find efficiently duplicates (replace np.unique+np.sort)
for i in range(len(ends_ind_org)):
left, right = ends_ind_org[i]
if left < right:
ends_ind[cur, 0] = left
ends_ind[cur, 1] = right
ends_ind_idx[cur] = i
gap[cur] = particle_corsp_overlaps[i]
cur += 1
return gap, ends_ind, ends_ind_idx, ends_ind_org
def ends_gap(poss, radii):
assert poss.size >= 1
# Sort the balls
index = np.argsort(radii)
reverse_index = np.empty(index.size, np.int64)
reverse_index[index] = np.arange(index.size, dtype=np.int64)
sorted_poss = poss[index]
sorted_radii = radii[index]
# Split them in two groups: the small and the big ones
split_ind = len(radii) * 3 // 4
small_poss, big_poss = np.split(sorted_poss, [split_ind])
small_radii, big_radii = np.split(sorted_radii, [split_ind])
max_small_radii = sorted_radii[max(split_ind, 0)]
max_big_radii = sorted_radii[-1]
# Find the neighbours in parallel
result = Parallel(n_jobs=4, backend='threading')([
find_neighbours(small_poss, small_poss, small_radii+max_small_radii),
find_neighbours(small_poss, big_poss, small_radii+max_big_radii ),
find_neighbours(big_poss, small_poss, big_radii+max_small_radii ),
find_neighbours(big_poss, big_poss, big_radii+max_big_radii )
small_small_neighbours = result[0]
small_big_neighbours = result[1]
big_small_neighbours = result[2]
big_big_neighbours = result[3]
# Merge the (segmented) arrays in a big one
neighbours_sizes, neighbours_values = vstack_neighbours(
hstack_neighbours(small_small_neighbours, small_big_neighbours, split_ind),
hstack_neighbours(big_small_neighbours, big_big_neighbours, split_ind)
# Reverse the indices.
# Note that the results in `neighbours_values` associated to
# `neighbours_sizes[i]` are subsets of `query_radius([poss[i]], r=dia_max)`
# on a `BallTree(poss)`.
res = reorder_neighbours(neighbours_sizes, neighbours_values, index, reverse_index)
neighbours_sizes, neighbours_values = res
# Finally compute the neighbours with a method similar to the
# previous one, but using a much faster optimized code.
return compute(poss, radii, neighbours_sizes, neighbours_values)
result = ends_gap(poss, radii)
Here is the results (still on the same i5-9600KF machine):
Small dataset:
- Reference optimized Numba code: 256 ms
- This highly-optimized Numba code: 82 ms
Big dataset:
- Reference optimized Numba code: 42.7 s (take about 7~8 GiB of RAM)
- This highly-optimized Numba code: 4.2 s (take about 1 GiB of RAM)
Thus the new algorithm is about 3.1 time faster on the small dataset (in addition to the previous optimizations), and about 10 times faster on the big dataset! This is 3 order of magnitude faster than the initially posted algorithms.
Note that 80% of the time is spend in the BallTree query (which is already mostly parallel). The main Numba computing function takes only 12% of the time and more than 75% of the time is spent in sorting the input indices. As a result, the neighbourhood search is clearly the bottleneck. It can be improved a bit by splitting the current queries in multiple smaller one but this will make the code even more complex for a relatively small improvement (eg. 1.5x faster). Note that more complex code are harder to maintain and modifications are bug-prone. Thus, I think moving to a native language to overcome the limitation of Python is the best solution to increase performance. That being said, writing a faster native code to solve this problem is far from being simple (unless you find good k-d tree, octree or ball tree library). Still, it is certainly better than optimizing this code further.
A profiling analysis shows that at least 50% of the time in BallTree of scikit-learn is spent in unoptimized scalar loops that could use SIMD instructions like AVX-2 (and loop unrolling) to be about 4 times faster. Additionally, some multi-threading issue are also visible (the 4 threads on the top are the joblib workers, the light-green sections are the idle time):
This shows that this implementation is sub-optimal. One possible way to easily improve the execution time may be to optimize the hot loops of the scikit-learn BallTree implementation. Another strategy could be to try to use threads more efficiently (possibly by releasing the GIL in some parts of the scikit-learn module).
As the BallTree class of scikit-learn is written in Cython (BallTree is based on DKTree itself based on BinaryTree). You can try to rebuild the package on your machine and simply tweak compiler optimizations. Using the parameter -O3 -march=native -ffast-math should enable the compiler to use faster SIMD instruction and more aggressive optimizations resulting in a significant speed up. Note that using -ffast-math is unsafe as it assume the code of Scikit will never use NaN, Inf or -0 values (otherwise the result is completely undefined) and that floating-point number operations are associative (resulting in different results). That being said, such an option is critical to improve the automatic vectorization of numerical codes.
For the GIL, one can see that it is released in the query_radius function but it does not seems the case for the constructor of BallTree. Maybe, the simplest solution is to implement a parallel version of query/query_radius like Scipy did.
By fixing the query radius at twice the max sphere radius, you're creating a lot of spurious "collisions" to filter out.
The Python below achieves a significant speedup relative to your answer by using a fourth dimension to improve the selectivity of the kd-tree queries. Each Euclidean ball of radius r is over-approximated by an L1 ball of radius r√d where d is the dimension (3 here). The test for L1 balls colliding in 3d becomes a test for points being within a fixed L1 distance in 4d.
If you switched to a lower level language, you could potentially avoid a separate filtering step by altering the kd-tree implementation to use a combination L2+L1 metric.
import numpy as np
from scipy import spatial
from timeit import default_timer
def load_data():
centers = np.loadtxt("pos_large.csv", delimiter=",")
radii = np.loadtxt("radii_large.csv")
assert radii.shape + (3,) == centers.shape
return centers, radii
def count_contacts(centers, radii):
scaled_centers = centers / np.sqrt(centers.shape[1])
max_radius = radii.max()
tree = spatial.cKDTree(np.c_[scaled_centers, max_radius - radii])
count = 0
for i, x in enumerate(np.c_[scaled_centers, radii - max_radius]):
for j in tree.query_ball_point(x, r=2 * max_radius, p=1):
d = centers[i] - centers[j]
r = radii[i] + radii[j]
if i < j and np.inner(d, d) <= r * r:
count += 1
return count
def main():
centers, radii = load_data()
start = default_timer()
print(count_contacts(centers, radii))
end = default_timer()
print(end - start)
if __name__ == "__main__":
As an update to Richard answer and to overcome probable memory leaks, I post this answer. During my testing executions, memory usage grows up and limits the execution to some smaller data volumes (maximum 200000 by my machine and 100000 on COLAB). This problem leads to much longer runtimes than resulted runtimes by Richard. So, I opened a SciPy issue relating to these different performances and put and compared some memory results there.
But, I did not get any answer so far and the origin of these significant differences between performances are not clear to me yet !!??
Fezzani referred to another SciPy issue to use chunk and well prepared a comparison to show the influence of chunk values on the runtimes. Strangely, although Fezzani's machine (Intel® Core™ i7-10700K CPU # 3.80GHz × 16; 32GiB of RAM) seems to be more powerful than Richard's machine (6-core machine with a i5-9600KF processor, 16 GiB of RAM 2 channels DDR4 # 3200MHz reaching 36~40 GiB/s), His execution on the large data will take at least (around) 33 seconds by chunk method (to avoid memory leaks).
I could not figure out why and which hardware can help machines to pass memory leaks and result in satisfying fast execution as for Richard (perhaps it was related to KF type of Richard's CPU) !!??
By seeking among some related memory issues, I could guess cKDTree methods are facing this inevitable problem when data volume is huge or … and scikit-learn, perhaps, be a better choice. In this regard, based on my understanding from JaminSore answer and the referred Martelli answer, I tried to evaluate BallTree and KDTree from scikit-learn. BallTree has better performance than KDTree in my cases (about 1.5 to 2 times), so I used it. There were no memory leaks for the large data, but it took 2 minutes (Richard results and mine differ just in time units now ;)). It ran faster than scipy when data volume increased. In my tests, scipy was faster on smaller data volumes (low memory consumptions) and as data volumes grows up, scipy performance falls behind due to its implementation behavior or related bugs (unclear to me yet); For my prepared 100000 data volumes, scikit-learn performs 1.5 to 2 times faster.
I guess using arrays is the big advantage of scikit-learn comparing to scipy method's lists, which can be derived from aforementioned Martelli answer. It may be the reason of the different performances.
scikit-learn methods return an object type ndarray with arrays of different lengths inside it that need to be sorted to get same results as the main code. I applied the related sorting behavior of each element in the loop in the compute function by modifying nears_i_ind code-line as nears_i_ind = np.sort(all_neighbours[an_offset:an_offset+cur_len]). Using BallTree, tmp and all_neighbours consume memory near the same. Note: If both have the same name, memory consumption will be reduced (almost halved). So, the modified Richard's ends_gap function by BallTree will be as:
def ends_gap(poss, dia_max):
balltree = BallTree(poss, metric='euclidean')
# tmp = balltree.query_radius(poss, r=dia_max)
# all_neighbours = np.concatenate(tmp, dtype=np.int64)
all_neighbours = balltree.query_radius(poss, r=dia_max)
all_neighbours_sizes = np.array([len(e) for e in all_neighbours], dtype=np.int64)
all_neighbours = np.concatenate(all_neighbours, dtype=np.int64)
particle_corsp_overlaps, ends_ind_lst = compute(poss, all_neighbours, all_neighbours_sizes)
ends_ind_org = np.concatenate(ends_ind_lst)
ends_ind, ends_ind_idx = np.unique(np.sort(ends_ind_org), axis=0, return_index=True)
gap = np.concatenate(particle_corsp_overlaps)[ends_ind_idx]
return gap, ends_ind, ends_ind_idx, ends_ind_org
It is not multi-threaded, which can improve the speed; I will try to multi-thread.
On my machine (i5 1st gen cpu intel core 760 # 2.8GHz, 16gb ram cl9 dual channel DDR3 ripjaws, x64 windows system) for 200000 data volume:
There were some mistakes in my two proposed methods which result in different gap values, which was mentioned in Note section by Richard. For producing same results, return_sorted=True must be added for nears_i_ind in Optimized algorithm and ends_ind and ends_ind_lst changes to list beside removing if-else statements in both codes:
Optimized algorithm:
def ends_gap(poss, dia_max):
particle_corsp_overlaps = []
ends_ind = [] # <------- this line is modified
kdtree = cKDTree(poss)
for particle_idx in range(len(poss)):
cur_point = poss[particle_idx]
nears_i_ind = np.array(kdtree.query_ball_point(cur_point, r=dia_max, return_sorted=True), dtype=np.int64) # <------- this line is modified
assert len(nears_i_ind) > 0
if len(nears_i_ind) <= 1:
nears_i_ind = nears_i_ind[nears_i_ind != particle_idx]
dist_i = distance.cdist(poss[nears_i_ind], cur_point[None, :]).squeeze()
contact_check = dist_i - (radii[nears_i_ind] + radii[particle_idx])
connected = contact_check[contact_check <= 0]
contacts_ind = np.where([contact_check <= 0])[1]
contacts_sec_ind = nears_i_ind[contacts_ind]
sphere_olps_ind = np.sort(contacts_sec_ind)
ends_ind_mod_temp = np.array([np.repeat(particle_idx, len(sphere_olps_ind)), sphere_olps_ind], dtype=np.int64).T
ends_ind.append(ends_ind_mod_temp) # <------- this line is modified
ends_ind_org = np.concatenate(ends_ind)
ends_ind, ends_ind_idx = np.unique(np.sort(ends_ind_org), axis=0, return_index=True)
gap = np.concatenate(particle_corsp_overlaps)[ends_ind_idx]
return gap, ends_ind, ends_ind_idx, ends_ind_org
Numba final optimized code:
#nb.jit('(float64[:, ::1], int64[::1], int64[::1])')
def compute(poss, all_neighbours, all_neighbours_sizes):
particle_corsp_overlaps = []
ends_ind_lst = [] # <------- this line is modified
an_offset = 0
for particle_idx in range(len(poss)):
cur_len = all_neighbours_sizes[particle_idx]
nears_i_ind = np.sort(all_neighbours[an_offset:an_offset+cur_len]) # <------- this line is modified
an_offset += cur_len
assert len(nears_i_ind) > 0
if len(nears_i_ind) <= 1:
nears_i_ind = nears_i_ind[nears_i_ind != particle_idx]
dist_i = np.empty(len(nears_i_ind), dtype=np.float64)
x1, y1, z1 = poss[particle_idx]
for i in range(len(nears_i_ind)):
x2, y2, z2 = poss[nears_i_ind[i]]
dist_i[i] = np.sqrt((x2 - x1) ** 2 + (y2 - y1) ** 2 + (z2 - z1) ** 2)
contact_check = dist_i - (radii[nears_i_ind] + radii[particle_idx])
connected = contact_check[contact_check <= 0]
contacts_ind = np.where(contact_check <= 0)
contacts_sec_ind = nears_i_ind[contacts_ind]
sphere_olps_ind = np.sort(contacts_sec_ind)
ends_ind_mod_temp = np.empty((len(sphere_olps_ind), 2), dtype=np.int64)
for i in range(len(sphere_olps_ind)):
ends_ind_mod_temp[i, 0] = particle_idx
ends_ind_mod_temp[i, 1] = sphere_olps_ind[i]
ends_ind_lst.append(ends_ind_mod_temp) # <------- this line is modified
return particle_corsp_overlaps, ends_ind_lst
def ends_gap(poss, dia_max):
balltree = BallTree(poss, metric='euclidean') # <------- new code
all_neighbours = balltree.query_radius(poss, r=dia_max) # <------- new code and modified
all_neighbours_sizes = np.array([len(e) for e in all_neighbours], dtype=np.int64) # <------- this line is modified
all_neighbours = np.concatenate(all_neighbours, dtype=np.int64) # <------- this line is modified
particle_corsp_overlaps, ends_ind_lst = compute(poss, all_neighbours, all_neighbours_sizes)
ends_ind_org = np.concatenate(ends_ind_lst)
ends_ind, ends_ind_idx = np.unique(np.sort(ends_ind_org), axis=0, return_index=True)
gap = np.concatenate(particle_corsp_overlaps)[ends_ind_idx]
return gap, ends_ind, ends_ind_idx, ends_ind_org
On my machine for around 550000 data volume:
Have you tried FLANN?
This code doesn't solve your problem completely. It simply finds the nearest 50 neighbors to each point in your 500000 point dataset:
from pyflann import FLANN
p = np.loadtxt("pos_large.csv", delimiter=",")
flann = FLANN()
idx, dist = flann.nn_index(qpts=p, num_neighbors=50)
The last line takes less than a second in my laptop without any tuning or parallelization.

Vectorize a function with a condition

I would like to vectorize a function with a condition, meaning to calculate its values with array arithmetic. np.vectorize handles vectorization, but it does not work with array arithmetic, so it is not a complete solution
An answer was given as the solution in the question "How to vectorize a function which contains an if statement?" but did not prevent errors here; see the MWE below.
import numpy as np
def myfx(x):
return np.where(x < 1.1, 1, np.arcsin(1 / x))
y = myfx(x)
This runs but raises the following warnings:
<stdin>:2: RuntimeWarning: divide by zero encountered in true_divide
<stdin>:2: RuntimeWarning: invalid value encountered in arcsin
What is the problem, or is there a better way to do this?
I think this could be done by
Getting the indices ks of x for which x[k] > 1.1 for each k in ks.
Applying np.arcsin(1 / x[ks]) to the slice x[ks], and using 1 for the rest of the elements.
Recombining the arrays.
I am not sure about the efficiency, though.
The statement np.where(x < 1.1, 1, np.arcsin(1 / x)) is equivalent to
mask = x < 1.1
a = 1
b = np.arcsin(1 / x)
np.where(mask, a, b)
Notice that you're calling np.arcsin on all the elements of x, regardless of whether 1 / x <= 1 or not. Your basic plan is correct. You can do the operations in-place on an output array using the where keyword of np.arcsin and np.reciprocal, without having to recombine anything:
def myfx(x):
mask = (x >= 1.1)
out = np.ones(x.shape)
np.reciprocal(x, where=mask, out=out) # >= 1.1 implies != 0
return np.arcsin(out, where=mask, out=out)
Using np.ones ensures that the unmasked elements of out are initialized correctly. An equivalent method would be
out = np.empty(x.shape)
out[~mask] = 1
You can always find an arithmetic expression that prevents the "divide by zero".
def myfx(x):
return np.where( x < 1.1, 1, np.arcsin(1/np.maximum(x, 1.1)) )
The values where x<1.1 in the right wing are not used, so it's not an issue computing np.arcsin(1/1.1) where x < 1.1.

Finding n-tuple that minimizes expensive cost function

Suppose there are three variables that take on discrete integer values, say w1 = {1,2,3,4,5,6,7,8,9,10,11,12}, w2 = {1,2,3,4,5,6,7,8,9,10,11,12}, and w3 = {1,2,3,4,5,6,7,8,9,10,11,12}. The task is to pick one value from each set such that the resulting triplet minimizes some (black box, computationally expensive) cost function.
I've tried the surrogate optimization in Matlab but I'm not sure it is appropriate. I've also heard about simulated annealing but found no implementation applied to this instance.
Which algorithm, apart from exhaustive search, can solve this combinatorial optimization problem?
Any help would be much appreciated.
The requirement/benefit of Simulated Annealing (SA), is that the objective surface is somewhat smooth, that is, we can be close to a solution.
For a completely random spiky surface- you might as well do a random search
If it is anything smooth, or even sometimes, it makes sense to try SA.
The idea is that (sometimes) changing only 1 of the 3 values, we have little effect on out blackbox function.
Here is a basic example to do this with Simulated Annealing, using frigidum in Python
import numpy as np
w1 = np.array( [1,2,3,4,5,6,7,8,9,10,11,12] )
w2 = np.array( [1,2,3,4,5,6,7,8,9,10,11,12] )
w3 = np.array( [1,2,3,4,5,6,7,8,9,10,11,12] )
W = np.array([w1,w2,w3])
I define a black-box using the Rastrigin function.
def rastrigin_function_n( x ):
N-dimensional Rastrigin
x_i is in [-5.12, 5.12]
A = 10
n = x.shape[0]
return A*n + np.sum( x**2- A*np.cos(2*np.pi * x) )
def black_box( x ):
Transform from domain [1,12] to [-5,5]
to be able to push to rastrigin
x = (x - 6.5) * (5/5.5)
return rastrigin_function_n(x)
Simulated Annealing needs to modify state X. Instead of taking/modifying values directly, we keep track of indices. This simplifies creating new proposals as an index is always an integer we can simply add/subtract 1 modulo LENGTH.
def random_start():
returns 3 random indices
return np.random.randint(0, LENGTH, size=3)
def random_small_step(x):
change only 1 index
d = np.array( [1,0,0] )
if np.random.random() < .5:
d = np.array( [-1,0,0] )
return (x+d) % LENGTH
def random_big_step(x):
change 2 indici
d = np.array( [1,-1,0] )
return (x+d) % LENGTH
def obj(x):
We have a triplet of indici,
1. Calculate corresponding values in W = [w1,w2,w3]
2. Push the values in out black-box function
indices = x
values = W[np.array([0,1,2]), indices]
return black_box(values)
And throw a SA Scheme at it
import frigidum
local_opt =,
neighbours=[random_small_step, random_big_step],
I am not sure what the minimum for this function should be, but it found a objective with 47.9095 with indicis np.array([9, 2, 2])
For frigidum to change the cooling schedule, use alpha=.9. My experience is that all the work of experiment which cooling scheme works best doesn't out-weight simply let it run a little longer. The multiplication you proposed, (sometimes called geometric) is the standard one, also implemented in frigidum. So to implement Tn+1 = 0.9*Tn you need a alpha=.9. Be aware this cooling step is done after N repeats, so if repeats=100, it will first do 100 proposals before lowering the temperature with factor alpha
Simple variations on current state often works best. Since its best practice to set the initial temperature high enough to make most proposals (>90%) accepted, it doesn't matter the steps are small. But if you fear its soo small, try 2 or 3 variations. Frigidum accepts a list of proposal functions, and combinations can enforce each other.
I have no experience with MINLP. But even if, so many times experiments can surprise us. So if time/cost is small to bring another competitor to the table, yes!
Try every possible combination of the three values and see which has the lowest cost.

vectorize join condition in pandas

This code is working correctly as expected. But it takes a lot of time for large dataframes.
for i in excel_df['name_of_college_school'] :
for y in mysql_df['college_name'] :
if SequenceMatcher(None, i.lower(), y.lower() ).ratio() > 0.8:
excel_df.loc[excel_df['name_of_college_school'] == i, 'dupmark4'] = y
I guess, I can not use a function on join clause to compare values like this.
How do I vectorize this?
Is it possible to update with the highest score? This loop will overwrite the earlier match and it is possible that the earlier match was more relevant than current one.
What you are looking for is fuzzy merging.
a = excel_df.as_matrix()
b = mysql_df.as_matrix()
for i in a:
for j in b:
if SequenceMatcher(None,
i[college_index_a].lower(), y[college_index_b].lower() ).ratio() > 0.8:
i[dupmark_index] = j
Never use loc in a loop, it has a huge overhead. And btw, get the index of the respective columns, (the numerical one). Use this -
df.columns.get_loc("college name")
You could avoid one of the loops using apply and instead of MxN .loc operations, now it'll be M operations.
for y in mysql_df['college_name']:
match = excel_df['name_of_college_school'].apply(lambda x: SequenceMatcher(
None, x.lower(), y.lower()).ratio() > 0.8)
excel_df.loc[match, 'dupmark4'] = y

Matplotlib: Plot a function with multiple defintions

How do I plot e.g. a function f(x) = x for 0 < x < 1 and f(x) = 1 for x >= 1?
Thanks in advance!
Okay, I have thought for a while and found a solution for the given function, but I'd really like to find a more generic solution. Maybe like f=f1+f2+...fn, where fi is the function in domain i and then plot f alltogehter.
f = 0.5*(1*(1-np.sign(1-x))+x*(1-np.sign(x-1)))
Matplotlib doesn't care where your data comes from: you can either make lists from two different functions and combine them, or call a function with a conditional in it. The most mathematically appealing choice is probably
def f(x):
return 1 if x>=1 else 0 if x>0 else ...
Of course, if you care only about x>0, your function can be computed just as x>=1, which can be used as a number already.