How could I speed up my written python code: spheres contact detection (collision) using spatial searching - numpy

I am working on a spatial search case for spheres in which I want to find connected spheres. For this aim, I searched around each sphere for spheres that centers are in a (maximum sphere diameter) distance from the searching sphere’s center. At first, I tried to use scipy related methods to do so, but scipy method takes longer times comparing to equivalent numpy method. For scipy, I have determined the number of K-nearest spheres firstly and then find them by cKDTree.query, which lead to more time consumption. However, it is slower than numpy method even by omitting the first step with a constant value (it is not good to omit the first step in this case). It is contrary to my expectations about scipy spatial searching speed. So, I tried to use some list-loops instead some numpy lines for speeding up using numba prange. Numba run the code a little faster, but I believe that this code can be optimized for better performances, perhaps by vectorization, using other alternative numpy modules or using numba in another way. I have used iteration on all spheres due to prevent probable memory leaks and …, where number of spheres are high.
import numpy as np
import numba as nb
from scipy.spatial import cKDTree, distance
# ---------------------------- input data ----------------------------
""" For testing by prepared files:
radii = np.load('a.npy') # shape: (n-spheres, ) must be loaded by np.load('a.npy') or np.loadtxt('radii_large.csv')
poss = np.load('b.npy') # shape: (n-spheres, 3) must be loaded by np.load('b.npy') or np.loadtxt('pos_large.csv', delimiter=',')
"""
rnd = np.random.RandomState(70)
data_volume = 200000
radii = rnd.uniform(0.0005, 0.122, data_volume)
dia_max = 2 * radii.max()
x = rnd.uniform(-1.02, 1.02, (data_volume, 1))
y = rnd.uniform(-3.52, 3.52, (data_volume, 1))
z = rnd.uniform(-1.02, -0.575, (data_volume, 1))
poss = np.hstack((x, y, z))
# --------------------------------------------------------------------
# #nb.jit('float64[:,::1](float64[:,::1], float64[::1])', forceobj=True, parallel=True)
def ends_gap(poss, dia_max):
particle_corsp_overlaps = np.array([], dtype=np.float64)
ends_ind = np.empty([1, 2], dtype=np.int64)
""" using list looping """
# particle_corsp_overlaps = []
# ends_ind = []
# for particle_idx in nb.prange(len(poss)): # by list looping
for particle_idx in range(len(poss)):
unshared_idx = np.delete(np.arange(len(poss)), particle_idx) # <--- relatively high time consumer
poss_without = poss[unshared_idx]
""" # SCIPY method ---------------------------------------------------------------------------------------------
nears_i_ind = cKDTree(poss_without).query_ball_point(poss[particle_idx], r=dia_max) # <--- high time consumer
if len(nears_i_ind) > 0:
dist_i, dist_i_ind = cKDTree(poss_without[nears_i_ind]).query(poss[particle_idx], k=len(nears_i_ind)) # <--- high time consumer
if not isinstance(dist_i, float):
dist_i[dist_i_ind] = dist_i.copy()
""" # NUMPY method --------------------------------------------------------------------------------------------
lx_limit_idx = poss_without[:, 0] <= poss[particle_idx][0] + dia_max
ux_limit_idx = poss_without[:, 0] >= poss[particle_idx][0] - dia_max
ly_limit_idx = poss_without[:, 1] <= poss[particle_idx][1] + dia_max
uy_limit_idx = poss_without[:, 1] >= poss[particle_idx][1] - dia_max
lz_limit_idx = poss_without[:, 2] <= poss[particle_idx][2] + dia_max
uz_limit_idx = poss_without[:, 2] >= poss[particle_idx][2] - dia_max
nears_i_ind = np.where(lx_limit_idx & ux_limit_idx & ly_limit_idx & uy_limit_idx & lz_limit_idx & uz_limit_idx)[0]
if len(nears_i_ind) > 0:
dist_i = distance.cdist(poss_without[nears_i_ind], poss[particle_idx][None, :]).squeeze() # <--- relatively high time consumer
# """ # -------------------------------------------------------------------------------------------------------
contact_check = dist_i - (radii[unshared_idx][nears_i_ind] + radii[particle_idx])
connected = contact_check[contact_check <= 0]
particle_corsp_overlaps = np.concatenate((particle_corsp_overlaps, connected))
""" using list looping """
# if len(connected) > 0:
# for value_ in connected:
# particle_corsp_overlaps.append(value_)
contacts_ind = np.where([contact_check <= 0])[1]
contacts_sec_ind = np.array(nears_i_ind)[contacts_ind]
sphere_olps_ind = np.where((poss[:, None] == poss_without[contacts_sec_ind][None, :]).all(axis=2))[0] # <--- high time consumer
ends_ind_mod_temp = np.array([np.repeat(particle_idx, len(sphere_olps_ind)), sphere_olps_ind], dtype=np.int64).T
if particle_idx > 0:
ends_ind = np.concatenate((ends_ind, ends_ind_mod_temp))
else:
ends_ind[0, 0], ends_ind[0, 1] = ends_ind_mod_temp[0, 0], ends_ind_mod_temp[0, 1]
""" using list looping """
# for contacted_idx in sphere_olps_ind:
# ends_ind.append([particle_idx, contacted_idx])
# ends_ind_org = np.array(ends_ind) # using lists
ends_ind_org = ends_ind
ends_ind, ends_ind_idx = np.unique(np.sort(ends_ind_org), axis=0, return_index=True) # <--- relatively high time consumer
gap = np.array(particle_corsp_overlaps)[ends_ind_idx]
return gap, ends_ind, ends_ind_idx, ends_ind_org
In one of my tests on 23000 spheres, scipy, numpy, and numba-aided methods finished the loop in about 400, 200, and 180 seconds correspondingly using Colab TPU; for 500.000 spheres it take 3.5 hours. These execution times are not satisfying at all for my project, where number of spheres may be up to 1.000.000 in a medium data volume. I will call this code many times in my main code and seeking for ways that could perform this code in milliseconds (as much as fastest that it could). Is it possible??
I would be appreciated if anyone would speed up the code as it is needed.
Notes:
This code must be executable with python 3.7+, on CPU and GPU.
This code must be applicable for data size, at least, 300.000 spheres.
All numpy, scipy, and … equivalent modules instead of my written modules, which make my code faster significantly, will be upvoted.
I would be appreciated for any recommendations or explanations about:
Which method could be faster in this subject?
Why scipy is not faster than other methods in this case and where it could be helpful relating to this subject?
Choosing between iterator methods and matrix form methods is a confusing matter for me. Iterating methods use less memory and could be used and tuned up by numba and … but, I think, are not useful and comparable with matrix methods (which depends on memory limits) like numpy and … for huge sphere numbers. For this case, perhaps I could omit the iteration by numpy, but I guess strongly that it cannot be handled due to huge matrix size operations and memory leaks.
Prepared sample test data:
Poss data: 23000, 500000
Radii data: 23000, 500000
Line by line speed test logs: for two test cases scipy method and numpy time consumption.

UPDATE: this post answered is now superseded by this new one
(which take into account the updates of the question) providing an even faster code based on a different approach.
Step 1: better algorithm
First of all, building a k-d tree runs in O(n log n) time and doing a query runs in O(log n) time where n is the number of points. So using a k-d tree seems a good idea at first glance. However, your code build a k-d tree for each point resulting in a O(n² log n) time. This is why the Scipy solution is slower than the others. The thing is that Scipy does not provide a way to update a k-d tree. It turns out that updating efficiently a k-d tree appears not to be possible. Hopefully, this is not a problem in your case: you can just build one k-d tree with all the points once and then discard the current point you do not want appearing in the result of each query.
Moreover, the computation of sphere_olps_ind runs in O(n² m) time where n is the total number of points and m is the average number of neighbour (ie. closest points retrieved from the k-d tree query). Assuming there is no duplicate points, then it turns out that sphere_olps_ind is simply equal to np.sort(contacts_sec_ind). The later runs in O(m log m) which is drastically better.
Additionally, using np.concatenate in a loop to append value in a Numpy array is slow because it creates a new bigger array for each iteration. Using a list was a good idea, but appending directly Numpy array in a list and then calling np.concatenate once is much faster.
Here is the resulting code:
def ends_gap(poss, dia_max):
particle_corsp_overlaps = []
ends_ind = [np.empty([1, 2], dtype=np.int64)]
kdtree = cKDTree(poss)
for particle_idx in range(len(poss)):
# Find the nearest point including the current one and
# then remove the current point from the output.
# The distances can be computed directly without a new query.
cur_point = poss[particle_idx]
nears_i_ind = np.array(kdtree.query_ball_point(cur_point, r=dia_max), dtype=np.int64)
assert len(nears_i_ind) > 0
if len(nears_i_ind) <= 1:
continue
nears_i_ind = nears_i_ind[nears_i_ind != particle_idx]
dist_i = distance.cdist(poss[nears_i_ind], cur_point[None, :]).squeeze()
contact_check = dist_i - (radii[nears_i_ind] + radii[particle_idx])
connected = contact_check[contact_check <= 0]
particle_corsp_overlaps.append(connected)
contacts_ind = np.where([contact_check <= 0])[1]
contacts_sec_ind = nears_i_ind[contacts_ind]
sphere_olps_ind = np.sort(contacts_sec_ind)
ends_ind_mod_temp = np.array([np.repeat(particle_idx, len(sphere_olps_ind)), sphere_olps_ind], dtype=np.int64).T
if particle_idx > 0:
ends_ind.append(ends_ind_mod_temp)
else:
ends_ind[0][:] = ends_ind_mod_temp[0, 0], ends_ind_mod_temp[0, 1]
ends_ind_org = np.concatenate(ends_ind)
ends_ind, ends_ind_idx = np.unique(np.sort(ends_ind_org), axis=0, return_index=True) # <--- relatively high time consumer
gap = np.concatenate(particle_corsp_overlaps)[ends_ind_idx]
return gap, ends_ind, ends_ind_idx, ends_ind_org
Step 2: optimization
First of all, the query_ball_point call can be done on all the points at once in parallel by providing poss to the Scipy method and specifying the parameter workers=-1. However, note that this requires more memory.
Moreover, Numba can be used to significantly speed up the computation. The parts that can be mainly improved is the computation of the distances and the creation of many unnecessary temporary arrays as well as the use of Numpy array direct indexing instead of list's appends (since the bounded size of the output array can be known after the query_ball_point call).
Here is a simple example of optimized code using Numba:
#nb.jit('(float64[:, ::1], int64[::1], int64[::1], float64)')
def compute(poss, all_neighbours, all_neighbours_sizes, dia_max):
particle_corsp_overlaps = []
ends_ind_lst = [np.empty((1, 2), dtype=np.int64)]
an_offset = 0
for particle_idx in range(len(poss)):
cur_point = poss[particle_idx]
cur_len = all_neighbours_sizes[particle_idx]
nears_i_ind = all_neighbours[an_offset:an_offset+cur_len]
an_offset += cur_len
assert len(nears_i_ind) > 0
if len(nears_i_ind) <= 1:
continue
nears_i_ind = nears_i_ind[nears_i_ind != particle_idx]
dist_i = np.empty(len(nears_i_ind), dtype=np.float64)
# Compute the distances
x1, y1, z1 = poss[particle_idx]
for i in range(len(nears_i_ind)):
x2, y2, z2 = poss[nears_i_ind[i]]
dist_i[i] = np.sqrt((x2-x1)**2 + (y2-y1)**2 + (z2-z1)**2)
contact_check = dist_i - (radii[nears_i_ind] + radii[particle_idx])
connected = contact_check[contact_check <= 0]
particle_corsp_overlaps.append(connected)
contacts_ind = np.where(contact_check <= 0)
contacts_sec_ind = nears_i_ind[contacts_ind]
sphere_olps_ind = np.sort(contacts_sec_ind)
ends_ind_mod_temp = np.empty((len(sphere_olps_ind), 2), dtype=np.int64)
for i in range(len(sphere_olps_ind)):
ends_ind_mod_temp[i, 0] = particle_idx
ends_ind_mod_temp[i, 1] = sphere_olps_ind[i]
if particle_idx > 0:
ends_ind_lst.append(ends_ind_mod_temp)
else:
tmp = ends_ind_lst[0]
tmp[:] = ends_ind_mod_temp[0, :]
return particle_corsp_overlaps, ends_ind_lst
def ends_gap(poss, dia_max):
kdtree = cKDTree(poss)
tmp = kdtree.query_ball_point(poss, r=dia_max, workers=-1)
all_neighbours = np.concatenate(tmp, dtype=np.int64)
all_neighbours_sizes = np.array([len(e) for e in tmp], dtype=np.int64)
particle_corsp_overlaps, ends_ind_lst = compute(poss, all_neighbours, all_neighbours_sizes, dia_max)
ends_ind_org = np.concatenate(ends_ind_lst)
ends_ind, ends_ind_idx = np.unique(np.sort(ends_ind_org), axis=0, return_index=True)
gap = np.concatenate(particle_corsp_overlaps)[ends_ind_idx]
return gap, ends_ind, ends_ind_idx, ends_ind_org
ends_gap(poss, dia_max)
Performance analysis
Here are the performance results on my 6-core machine (with a i5-9600KF processor) on the small dataset:
Initial code with Scipy: 259 s
Initial default code with Numpy: 112 s
Optimized algorithm: 1.37 s
Final optimized code: 0.22 s
Unfortunately, the Scipy k-d tree is too big to fit in memory on my machine with the big dataset.
Thus the Numba implementation with an efficient algorithm is up to ~510 times faster than the initial Numpy implementation and ~1200 time faster than the initial Scipy implementation.
The Numba code can be further optimized, but please note that the Numba compute call takes less than 25% of the time on my machine. The np.unique call is the most expensive, but it is not easy to make it faster. A significant part of the time is spent in the Scipy-to-Numba data conversion, but this code is mandatory as long as Scipy is used. Thus, the code can be improved a bit (eg. certainly 2x faster) with advanced Numba optimization but if you need a much faster code, then you need to use a native language like C++ and an highly-optimized parallel k-d tree implementation. I expect a very-optimized native code to be an order of magnitude faster but not much more. I hardly believe the big dataset can be computed in less than 10 ms on my machine regardless of the implementation.
Notes
Note that gap is different with the provided functions (other values are left unchanged). However, the same thing happens between the initial Scipy method and the one of Numpy. This appear to come from the ordering of variables like nears_i_ind and dist_i which is undefined by Scipy and change the gap result in a non-trivial way (not just the order of gap). I am not sure this is a problem of the initial implementation. Because of that, it is much harder to compare the correctness of the different implementations.
forceobj should not be used in production as the documentation states this is only useful for testing purposes.

Based on previous answers, I designed a efficient algorithm with a much lower memory footprint and much faster than the previous ones (especially on the large dataset). That being said this algorithm is far move complex and push the limit of Python and Numba.
The key issue of previous algorithms is that they set a dia_max threshold which is much bigger than actually required. Indeed, dia_max is set to the maximum possible redius so to be sure not to miss any overlapping. The thing is the big dataset contains balls of very different size and some of them are huge. This means that previous algorithms was fetching for a very large radius around many small balls. The result was thousands of neighbours to check per ball while only few can truly overlap.
One solution to efficiently address this problem is to split the balls in different groups based on their size. The idea is to first sort balls based on radii, then split the sorted balls in two groups, then independently query neighbours between each possible pair of groups, then merge data so to apply the previous algorithm (with some additional optimizations). More specifically, the query is applied between small balls with big ones, small balls with other small ones, big balls with other big ones, and big balls with small ones.
Another key point to speed this up is to request the different neighbour queries in parallel using joblib. This solution is far from being perfect since the BallTree object needs to be duplicated which is inefficient but this is mandatory because of the way parallelism is currently done in CPython (ie. GIL, pickling, etc.). Using a package that support parallel request can bypass this inherent limitation of CPython but existing package doing that does not seems to provide an interface sufficiently useful to address this problem or are not optimized enough to be actually useful.
Finally, the Numba code can be strongly optimized by removing almost all very expensive (implicit) array allocations. Using a in-place sorting algorithm optimized for small array also improve significantly the execution time (mainly because the default implementation of Numba perform several expensive allocations and is not optimized for small arrays). In addition, the final np.unique operation can be completely rewritten with a basic loop as the main loop iterate over balls with increasing IDs (hence already sorted).
Here is the resulting code:
import numpy as np
import numba as nb
from sklearn.neighbors import BallTree
from joblib import Parallel, delayed
def flatten_neighbours(arr):
sizes = np.fromiter(map(len, arr), count=len(arr), dtype=np.int64)
values = np.concatenate(arr, dtype=np.int64)
return sizes, values
#delayed
def find_neighbours(searched_pts, ref_pts, max_dist):
balltree = BallTree(ref_pts, leaf_size=16, metric='euclidean')
res = balltree.query_radius(searched_pts, r=max_dist)
return flatten_neighbours(res)
def vstack_neighbours(top_infos, bottom_infos):
top_sizes, top_values = top_infos
bottom_sizes, bottom_values = bottom_infos
return np.concatenate([top_sizes, bottom_sizes]), np.concatenate([top_values, bottom_values])
#nb.njit('(Tuple([int64[::1],int64[::1]]), Tuple([int64[::1],int64[::1]]), int64)')
def hstack_neighbours(left_infos, right_infos, offset):
left_sizes, left_values = left_infos
right_sizes, right_values = right_infos
n = left_sizes.size
out_sizes = np.empty(n, dtype=np.int64)
out_values = np.empty(left_values.size + right_values.size, dtype=np.int64)
left_cur, right_cur, out_cur = 0, 0, 0
right_values += offset
for i in range(n):
left, right = left_sizes[i], right_sizes[i]
full = left + right
out_values[out_cur:out_cur+left] = left_values[left_cur:left_cur+left]
out_values[out_cur+left:out_cur+full] = right_values[right_cur:right_cur+right]
out_sizes[i] = full
left_cur += left
right_cur += right
out_cur += full
return out_sizes, out_values
#nb.njit('(int64[::1], int64[::1], int64[::1], int64[::1])')
def reorder_neighbours(in_sizes, in_values, index, reverse_index):
n = reverse_index.size
out_sizes = np.empty_like(in_sizes)
out_values = np.empty_like(in_values)
in_offsets = np.empty_like(in_sizes)
s, cur = 0, 0
for i in range(n):
in_offsets[i] = s
s += in_sizes[i]
for i in range(n):
in_ind = reverse_index[i]
size = in_sizes[in_ind]
in_offset = in_offsets[in_ind]
out_sizes[i] = size
for j in range(size):
out_values[cur+j] = index[in_values[in_offset+j]]
cur += size
return out_sizes, out_values
#nb.njit
def small_inplace_sort(arr):
if len(arr) < 80:
# Basic insertion sort
i = 1
while i < len(arr):
x = arr[i]
j = i - 1
while j >= 0 and arr[j] > x:
arr[j+1] = arr[j]
j = j - 1
arr[j+1] = x
i += 1
else:
arr.sort()
#nb.jit('(float64[:, ::1], float64[::1], int64[::1], int64[::1])')
def compute(poss, radii, neighbours_sizes, neighbours_values):
n, m = neighbours_sizes.size, np.max(neighbours_sizes)
# Big buffers allocated with the maximum size.
# Thank to virtual memory, it does not take more memory can actually needed.
particle_corsp_overlaps = np.empty(neighbours_values.size, dtype=np.float64)
ends_ind_org = np.empty((neighbours_values.size, 2), dtype=np.float64)
in_offset = 0
out_offset = 0
buff1 = np.empty(m, dtype=np.int64)
buff2 = np.empty(m, dtype=np.float64)
buff3 = np.empty(m, dtype=np.float64)
for particle_idx in range(n):
size = neighbours_sizes[particle_idx]
cur = 0
for i in range(size):
value = neighbours_values[in_offset+i]
if value != particle_idx:
buff1[cur] = value
cur += 1
nears_i_ind = buff1[0:cur]
small_inplace_sort(nears_i_ind) # Note: bottleneck of this function
in_offset += size
if len(nears_i_ind) == 0:
continue
x1, y1, z1 = poss[particle_idx]
cur = 0
for i in range(len(nears_i_ind)):
index = nears_i_ind[i]
x2, y2, z2 = poss[index]
dist = np.sqrt((x2 - x1) ** 2 + (y2 - y1) ** 2 + (z2 - z1) ** 2)
contact_check = dist - (radii[index] + radii[particle_idx])
if contact_check <= 0.0:
buff2[cur] = contact_check
buff3[cur] = index
cur += 1
particle_corsp_overlaps[out_offset:out_offset+cur] = buff2[0:cur]
contacts_sec_ind = buff3[0:cur]
small_inplace_sort(contacts_sec_ind)
sphere_olps_ind = contacts_sec_ind
for i in range(cur):
ends_ind_org[out_offset+i, 0] = particle_idx
ends_ind_org[out_offset+i, 1] = sphere_olps_ind[i]
out_offset += cur
# Truncate the views to their real size
particle_corsp_overlaps = particle_corsp_overlaps[:out_offset]
ends_ind_org = ends_ind_org[:out_offset]
assert len(ends_ind_org) % 2 == 0
size = len(ends_ind_org)//2
ends_ind = np.empty((size,2), dtype=np.int64)
ends_ind_idx = np.empty(size, dtype=np.int64)
gap = np.empty(size, dtype=np.float64)
cur = 0
# Find efficiently duplicates (replace np.unique+np.sort)
for i in range(len(ends_ind_org)):
left, right = ends_ind_org[i]
if left < right:
ends_ind[cur, 0] = left
ends_ind[cur, 1] = right
ends_ind_idx[cur] = i
gap[cur] = particle_corsp_overlaps[i]
cur += 1
return gap, ends_ind, ends_ind_idx, ends_ind_org
def ends_gap(poss, radii):
assert poss.size >= 1
# Sort the balls
index = np.argsort(radii)
reverse_index = np.empty(index.size, np.int64)
reverse_index[index] = np.arange(index.size, dtype=np.int64)
sorted_poss = poss[index]
sorted_radii = radii[index]
# Split them in two groups: the small and the big ones
split_ind = len(radii) * 3 // 4
small_poss, big_poss = np.split(sorted_poss, [split_ind])
small_radii, big_radii = np.split(sorted_radii, [split_ind])
max_small_radii = sorted_radii[max(split_ind, 0)]
max_big_radii = sorted_radii[-1]
# Find the neighbours in parallel
result = Parallel(n_jobs=4, backend='threading')([
find_neighbours(small_poss, small_poss, small_radii+max_small_radii),
find_neighbours(small_poss, big_poss, small_radii+max_big_radii ),
find_neighbours(big_poss, small_poss, big_radii+max_small_radii ),
find_neighbours(big_poss, big_poss, big_radii+max_big_radii )
])
small_small_neighbours = result[0]
small_big_neighbours = result[1]
big_small_neighbours = result[2]
big_big_neighbours = result[3]
# Merge the (segmented) arrays in a big one
neighbours_sizes, neighbours_values = vstack_neighbours(
hstack_neighbours(small_small_neighbours, small_big_neighbours, split_ind),
hstack_neighbours(big_small_neighbours, big_big_neighbours, split_ind)
)
# Reverse the indices.
# Note that the results in `neighbours_values` associated to
# `neighbours_sizes[i]` are subsets of `query_radius([poss[i]], r=dia_max)`
# on a `BallTree(poss)`.
res = reorder_neighbours(neighbours_sizes, neighbours_values, index, reverse_index)
neighbours_sizes, neighbours_values = res
# Finally compute the neighbours with a method similar to the
# previous one, but using a much faster optimized code.
return compute(poss, radii, neighbours_sizes, neighbours_values)
result = ends_gap(poss, radii)
Here is the results (still on the same i5-9600KF machine):
Small dataset:
- Reference optimized Numba code: 256 ms
- This highly-optimized Numba code: 82 ms
Big dataset:
- Reference optimized Numba code: 42.7 s (take about 7~8 GiB of RAM)
- This highly-optimized Numba code: 4.2 s (take about 1 GiB of RAM)
Thus the new algorithm is about 3.1 time faster on the small dataset (in addition to the previous optimizations), and about 10 times faster on the big dataset! This is 3 order of magnitude faster than the initially posted algorithms.
Note that 80% of the time is spend in the BallTree query (which is already mostly parallel). The main Numba computing function takes only 12% of the time and more than 75% of the time is spent in sorting the input indices. As a result, the neighbourhood search is clearly the bottleneck. It can be improved a bit by splitting the current queries in multiple smaller one but this will make the code even more complex for a relatively small improvement (eg. 1.5x faster). Note that more complex code are harder to maintain and modifications are bug-prone. Thus, I think moving to a native language to overcome the limitation of Python is the best solution to increase performance. That being said, writing a faster native code to solve this problem is far from being simple (unless you find good k-d tree, octree or ball tree library). Still, it is certainly better than optimizing this code further.
Analysis
A profiling analysis shows that at least 50% of the time in BallTree of scikit-learn is spent in unoptimized scalar loops that could use SIMD instructions like AVX-2 (and loop unrolling) to be about 4 times faster. Additionally, some multi-threading issue are also visible (the 4 threads on the top are the joblib workers, the light-green sections are the idle time):
This shows that this implementation is sub-optimal. One possible way to easily improve the execution time may be to optimize the hot loops of the scikit-learn BallTree implementation. Another strategy could be to try to use threads more efficiently (possibly by releasing the GIL in some parts of the scikit-learn module).
As the BallTree class of scikit-learn is written in Cython (BallTree is based on DKTree itself based on BinaryTree). You can try to rebuild the package on your machine and simply tweak compiler optimizations. Using the parameter -O3 -march=native -ffast-math should enable the compiler to use faster SIMD instruction and more aggressive optimizations resulting in a significant speed up. Note that using -ffast-math is unsafe as it assume the code of Scikit will never use NaN, Inf or -0 values (otherwise the result is completely undefined) and that floating-point number operations are associative (resulting in different results). That being said, such an option is critical to improve the automatic vectorization of numerical codes.
For the GIL, one can see that it is released in the query_radius function but it does not seems the case for the constructor of BallTree. Maybe, the simplest solution is to implement a parallel version of query/query_radius like Scipy did.

By fixing the query radius at twice the max sphere radius, you're creating a lot of spurious "collisions" to filter out.
The Python below achieves a significant speedup relative to your answer by using a fourth dimension to improve the selectivity of the kd-tree queries. Each Euclidean ball of radius r is over-approximated by an L1 ball of radius r√d where d is the dimension (3 here). The test for L1 balls colliding in 3d becomes a test for points being within a fixed L1 distance in 4d.
If you switched to a lower level language, you could potentially avoid a separate filtering step by altering the kd-tree implementation to use a combination L2+L1 metric.
import numpy as np
from scipy import spatial
from timeit import default_timer
def load_data():
centers = np.loadtxt("pos_large.csv", delimiter=",")
radii = np.loadtxt("radii_large.csv")
assert radii.shape + (3,) == centers.shape
return centers, radii
def count_contacts(centers, radii):
scaled_centers = centers / np.sqrt(centers.shape[1])
max_radius = radii.max()
tree = spatial.cKDTree(np.c_[scaled_centers, max_radius - radii])
count = 0
for i, x in enumerate(np.c_[scaled_centers, radii - max_radius]):
for j in tree.query_ball_point(x, r=2 * max_radius, p=1):
d = centers[i] - centers[j]
r = radii[i] + radii[j]
if i < j and np.inner(d, d) <= r * r:
count += 1
return count
def main():
centers, radii = load_data()
start = default_timer()
print(count_contacts(centers, radii))
end = default_timer()
print(end - start)
if __name__ == "__main__":
main()

As an update to Richard answer and to overcome probable memory leaks, I post this answer. During my testing executions, memory usage grows up and limits the execution to some smaller data volumes (maximum 200000 by my machine and 100000 on COLAB). This problem leads to much longer runtimes than resulted runtimes by Richard. So, I opened a SciPy issue relating to these different performances and put and compared some memory results there.
But, I did not get any answer so far and the origin of these significant differences between performances are not clear to me yet !!??
Fezzani referred to another SciPy issue to use chunk and well prepared a comparison to show the influence of chunk values on the runtimes. Strangely, although Fezzani's machine (Intel® Core™ i7-10700K CPU # 3.80GHz × 16; 32GiB of RAM) seems to be more powerful than Richard's machine (6-core machine with a i5-9600KF processor, 16 GiB of RAM 2 channels DDR4 # 3200MHz reaching 36~40 GiB/s), His execution on the large data will take at least (around) 33 seconds by chunk method (to avoid memory leaks).
I could not figure out why and which hardware can help machines to pass memory leaks and result in satisfying fast execution as for Richard (perhaps it was related to KF type of Richard's CPU) !!??
By seeking among some related memory issues, I could guess cKDTree methods are facing this inevitable problem when data volume is huge or … and scikit-learn, perhaps, be a better choice. In this regard, based on my understanding from JaminSore answer and the referred Martelli answer, I tried to evaluate BallTree and KDTree from scikit-learn. BallTree has better performance than KDTree in my cases (about 1.5 to 2 times), so I used it. There were no memory leaks for the large data, but it took 2 minutes (Richard results and mine differ just in time units now ;)). It ran faster than scipy when data volume increased. In my tests, scipy was faster on smaller data volumes (low memory consumptions) and as data volumes grows up, scipy performance falls behind due to its implementation behavior or related bugs (unclear to me yet); For my prepared 100000 data volumes, scikit-learn performs 1.5 to 2 times faster.
I guess using arrays is the big advantage of scikit-learn comparing to scipy method's lists, which can be derived from aforementioned Martelli answer. It may be the reason of the different performances.
scikit-learn methods return an object type ndarray with arrays of different lengths inside it that need to be sorted to get same results as the main code. I applied the related sorting behavior of each element in the loop in the compute function by modifying nears_i_ind code-line as nears_i_ind = np.sort(all_neighbours[an_offset:an_offset+cur_len]). Using BallTree, tmp and all_neighbours consume memory near the same. Note: If both have the same name, memory consumption will be reduced (almost halved). So, the modified Richard's ends_gap function by BallTree will be as:
def ends_gap(poss, dia_max):
balltree = BallTree(poss, metric='euclidean')
# tmp = balltree.query_radius(poss, r=dia_max)
# all_neighbours = np.concatenate(tmp, dtype=np.int64)
all_neighbours = balltree.query_radius(poss, r=dia_max)
all_neighbours_sizes = np.array([len(e) for e in all_neighbours], dtype=np.int64)
all_neighbours = np.concatenate(all_neighbours, dtype=np.int64)
particle_corsp_overlaps, ends_ind_lst = compute(poss, all_neighbours, all_neighbours_sizes)
ends_ind_org = np.concatenate(ends_ind_lst)
ends_ind, ends_ind_idx = np.unique(np.sort(ends_ind_org), axis=0, return_index=True)
gap = np.concatenate(particle_corsp_overlaps)[ends_ind_idx]
return gap, ends_ind, ends_ind_idx, ends_ind_org
It is not multi-threaded, which can improve the speed; I will try to multi-thread.
On my machine (i5 1st gen cpu intel core 760 # 2.8GHz, 16gb ram cl9 dual channel DDR3 ripjaws, x64 windows system) for 200000 data volume:
There were some mistakes in my two proposed methods which result in different gap values, which was mentioned in Note section by Richard. For producing same results, return_sorted=True must be added for nears_i_ind in Optimized algorithm and ends_ind and ends_ind_lst changes to list beside removing if-else statements in both codes:
Optimized algorithm:
def ends_gap(poss, dia_max):
particle_corsp_overlaps = []
ends_ind = [] # <------- this line is modified
kdtree = cKDTree(poss)
for particle_idx in range(len(poss)):
cur_point = poss[particle_idx]
nears_i_ind = np.array(kdtree.query_ball_point(cur_point, r=dia_max, return_sorted=True), dtype=np.int64) # <------- this line is modified
assert len(nears_i_ind) > 0
if len(nears_i_ind) <= 1:
continue
nears_i_ind = nears_i_ind[nears_i_ind != particle_idx]
dist_i = distance.cdist(poss[nears_i_ind], cur_point[None, :]).squeeze()
contact_check = dist_i - (radii[nears_i_ind] + radii[particle_idx])
connected = contact_check[contact_check <= 0]
particle_corsp_overlaps.append(connected)
contacts_ind = np.where([contact_check <= 0])[1]
contacts_sec_ind = nears_i_ind[contacts_ind]
sphere_olps_ind = np.sort(contacts_sec_ind)
ends_ind_mod_temp = np.array([np.repeat(particle_idx, len(sphere_olps_ind)), sphere_olps_ind], dtype=np.int64).T
ends_ind.append(ends_ind_mod_temp) # <------- this line is modified
ends_ind_org = np.concatenate(ends_ind)
ends_ind, ends_ind_idx = np.unique(np.sort(ends_ind_org), axis=0, return_index=True)
gap = np.concatenate(particle_corsp_overlaps)[ends_ind_idx]
return gap, ends_ind, ends_ind_idx, ends_ind_org
Numba final optimized code:
#nb.jit('(float64[:, ::1], int64[::1], int64[::1])')
def compute(poss, all_neighbours, all_neighbours_sizes):
particle_corsp_overlaps = []
ends_ind_lst = [] # <------- this line is modified
an_offset = 0
for particle_idx in range(len(poss)):
cur_len = all_neighbours_sizes[particle_idx]
nears_i_ind = np.sort(all_neighbours[an_offset:an_offset+cur_len]) # <------- this line is modified
an_offset += cur_len
assert len(nears_i_ind) > 0
if len(nears_i_ind) <= 1:
continue
nears_i_ind = nears_i_ind[nears_i_ind != particle_idx]
dist_i = np.empty(len(nears_i_ind), dtype=np.float64)
x1, y1, z1 = poss[particle_idx]
for i in range(len(nears_i_ind)):
x2, y2, z2 = poss[nears_i_ind[i]]
dist_i[i] = np.sqrt((x2 - x1) ** 2 + (y2 - y1) ** 2 + (z2 - z1) ** 2)
contact_check = dist_i - (radii[nears_i_ind] + radii[particle_idx])
connected = contact_check[contact_check <= 0]
particle_corsp_overlaps.append(connected)
contacts_ind = np.where(contact_check <= 0)
contacts_sec_ind = nears_i_ind[contacts_ind]
sphere_olps_ind = np.sort(contacts_sec_ind)
ends_ind_mod_temp = np.empty((len(sphere_olps_ind), 2), dtype=np.int64)
for i in range(len(sphere_olps_ind)):
ends_ind_mod_temp[i, 0] = particle_idx
ends_ind_mod_temp[i, 1] = sphere_olps_ind[i]
ends_ind_lst.append(ends_ind_mod_temp) # <------- this line is modified
return particle_corsp_overlaps, ends_ind_lst
def ends_gap(poss, dia_max):
balltree = BallTree(poss, metric='euclidean') # <------- new code
all_neighbours = balltree.query_radius(poss, r=dia_max) # <------- new code and modified
all_neighbours_sizes = np.array([len(e) for e in all_neighbours], dtype=np.int64) # <------- this line is modified
all_neighbours = np.concatenate(all_neighbours, dtype=np.int64) # <------- this line is modified
particle_corsp_overlaps, ends_ind_lst = compute(poss, all_neighbours, all_neighbours_sizes)
ends_ind_org = np.concatenate(ends_ind_lst)
ends_ind, ends_ind_idx = np.unique(np.sort(ends_ind_org), axis=0, return_index=True)
gap = np.concatenate(particle_corsp_overlaps)[ends_ind_idx]
return gap, ends_ind, ends_ind_idx, ends_ind_org
On my machine for around 550000 data volume:

Have you tried FLANN?
This code doesn't solve your problem completely. It simply finds the nearest 50 neighbors to each point in your 500000 point dataset:
from pyflann import FLANN
p = np.loadtxt("pos_large.csv", delimiter=",")
flann = FLANN()
flann.build_index(pts=p)
idx, dist = flann.nn_index(qpts=p, num_neighbors=50)
The last line takes less than a second in my laptop without any tuning or parallelization.

Related

numpy matmul is very very slow

I'm trying to implement the Gradient descent method for solving $Ax = b$ for a positive definite symmetric matrix $A$ of size about $9600 \times 9600$. I thought my code was relatively simple
#Solves the problem Ax = b for x within epsilon tolerance or until MAX_ITERATION is reached
def GradientDescent(Amat,target,epsilon = .01,MAX_ITERATION = 100,x=np.zeros(9604):
CurrentRes = target-np.matmul(Amat,x)
count = 0
while(np.linalg.norm(CurrentRes)> epsilon and count < MAX_ITERATION):
Ar = np.matmul(Amat,CurrentRes)
alpha = CurrentRes.T.dot(CurrentRes)/CurrentRes.T.dot(Ar)
x = x+alpha*CurrentRes
Ax = np.matmul(Amat,x)
CurrentRes = target-Ax
count = count+1
return(x,count,norm(CurrentRes))
#A is square matrix about 9600x9600 and b is about 9600x1
GDSum = GradientDescent(A,b)
but the above takes almost 3 minutes to run a single iteration of the main while loop.
I didn't think that $9600 \times 9600$ was too big for NumPy to handle effectively, but even the step of computing alpha which is just the quotient of two dot products is taking over 30 seconds.
I tried error-testing the code by timing each action in the while loop, and they are all running much slower than expected. A single matrix multiplication is taking almost a minute. The steps involving vector addition or subtraction at least seem to be running quickly.
#A is square matrix about 9600x9600 and b is about 9600x1
GDSum = GradientDescent(A,b)
Perhaps the most relevant bit of information is missing.
Your function is fast when A and b are Numpy arrays, but it's terribly slow when they are lists.
Is that your case?

How to accelerate my written python code: function containing nested functions for classification of points by polygons

I have written the following NumPy code by Python:
def inbox_(points, polygon):
""" Finding points in a region """
ll = np.amin(polygon, axis=0) # lower limit
ur = np.amax(polygon, axis=0) # upper limit
in_idx = np.all(np.logical_and(ll <= points, points < ur), axis=1) # points in the range [boolean]
return in_idx
def operation_(r, gap, ends_ind):
""" calculation formula which is applied on the points specified by inbox_ function """
r_active = np.take(r, ends_ind) # taking values from "r" based on indices and shape (paired_values) of "ends_ind"
r_sub = np.subtract.reduce(r_active, axis=1) # subtracting each paired "r" determined by "ends_ind" [this line will be used in the final return formula]
r_add = np.add.reduce(r_active, axis=1) # adding each paired "r" determined by "ends_ind" [this line will be used in the final return formula]
paired_cent_dis = np.sum((r_add, gap), axis=0) # distance of the each two paired points
return (np.power(gap, 2) * (np.power(paired_cent_dis, 2) + 5 * paired_cent_dis * r_add - 7 * np.power(r_sub, 2))) / (3 * paired_cent_dis) # Formula
def elapses(r, pos, gap, ends_ind, elem_vert, contact_poss):
if len(gap) > 0:
elaps = np.empty([len(elem_vert), ], dtype=object)
operate_ = operation_(r, gap, ends_ind)
#elbav = np.empty([len(elem_vert), ], dtype=object)
#con_num = 0
for i, j in enumerate(elem_vert): # loop for each section (cell or region) of a mesh
in_bool = inbox_(contact_poss, j) # getting boolean array for points within that section
elaps[i] = np.sum(operate_[in_bool]) # performing some calculations on that points and get the sum of them for each section
operate_ = operate_[np.invert(in_bool)] # slicing the arrays by deleting the points on which the calculations were performed to speed-up the code in next loops
contact_poss = contact_poss[np.invert(in_bool)] # as above
#con_num += sum(inbox_(contact_poss, j))
#inba_bool = inbox_(pos, j)
#elbav[i] = 4 * np.pi * np.sum(np.power(r[inba_bool], 3)) / 3
#pos = pos[np.invert(inba_bool)]
#r = r[np.invert(inba_bool)]
return elaps
r = np.load('a.npy')
pos = np.load('b.npy')
gap = np.load('c.npy')
ends_ind = np.load('d.npy')
elem_vert = np.load('e.npy')
contact_poss = np.load('f.npy')
elapses(r, pos, gap, ends_ind, elem_vert, contact_poss)
# a --------r-------> parameter corresponding to each coordinate (point); here radius (23605,) <class 'numpy.ndarray'> <class 'numpy.float64'>
# b -------pos------> coordinates of the points (23605, 3) <class 'numpy.ndarray'> <class 'numpy.ndarray'> <class 'numpy.float64'>
# c -------gap------> if we consider points as spheres by that radii [r], it is maximum length for spheres' over-lap (103832,) <class 'numpy.ndarray'> <class 'numpy.float64'>
# d ----ends_ind----> indices for each over-laped spheres (103832, 2) <class 'numpy.ndarray'> <class 'numpy.ndarray'> <class 'numpy.int64'>
# e ---elem_vert----> vertices of the mesh's sections or cells (2000, 8, 3) <class 'numpy.ndarray'> <class 'numpy.ndarray'> <class 'numpy.ndarray'> <class 'numpy.float64'>
# f --contact_poss--> a coordinate between the paired spheres (103832, 3) <class 'numpy.ndarray'> <class 'numpy.ndarray'> <class 'numpy.float64'>
This code will be called frequently from another code with big-data inputs. So, speeding up this code is essential. I have tried to use jit decorator from JAX and Numba libraries to accelerate the code, but I could not work with that properly to make the code better. I have tested the code on Colab (for 3 data sets with loops number of 20, 250, and 2000) for speed and the results were:
11 (ms), 47 (ms), 6.62 (s) (per loop) <-- without the commented code lines in the code
137 (ms), 1.66 (s) , 4 (m) (per loop) <-- with activating the commented code lines in the code
What this code does is finding some coordinates in a range and then performing some calculations on them.
I will be very appreciated for any answers that can speed up this code significantly (I believe it could). Also, I will be grateful for any experienced recommendations about speeding up the code by changing (substituting) the used NumPy methods and … or writing method for the math operations.
Notes:
The proposed answers must be executable by python version 2 (being applicable in both versions 2 and 3 is very excellent)
The commented code lines in the code are unnecessary for the main aim and are written just for further evaluations. Any recommendations to handle these lines with the proposed answers is appreciated (is not needed).
Data sets for test:
small data set: https://drive.google.com/file/d/1CswjyoqS8ogLmLQa_oNTOj221chDcbK8/view?usp=sharing
medium data set: https://drive.google.com/file/d/14RJ0Ackx88NzQWloops5FagzuNQYDSrh/view?usp=sharing
large data set: https://drive.google.com/file/d/1dJnXpb3HiAGcRC9PPTwui9joNcij4E_E/view?usp=sharing
First of all, the algorithm can be improved to be much more efficient. Indeed, a polygon can be directly assigned to each point. This is like a classification of points by polygons. Once the classification is done, you can perform one/many reductions by key where the key is the polygon ID.
This new algorithm consists in:
computing all the bounding boxes of the polygons;
classifying the points by polygons;
performing the reduction by key (where the key is the polygon ID).
This approach is much more efficient than iterating over all the points for each polygons and filtering the attributes arrays (eg. operate_ and contact_poss). Indeed, a filtering is an expensive operation since it requires the target array (that may not fit in the CPU caches) to be fully read and then written back. Not to mention this operation requires a temporary array to be allocated/deleted if it is not performed in-place and the operation cannot benefit from being implemented with SIMD instructions on most x86/x86-64 platforms (as it requires the new AVX-512 instruction set). It is also harder to parallelize since the filtering steps are too fast for threads to be useful but steps need to be done sequentially.
Regarding the implementation of the algorithm, Numba can be used to speed up a lot the overall computation. The main benefit of using Numba is to drastically reduce the number of expensive temporary arrays created by Numpy in your current implementation. Note that you can specify the function types to Numba so it can compile functions when it is defined. Assertions can be used to make the code more robust and help the compiler to know the size of a given dimension so to generate a significantly faster code (the JIT compiler of Numba can unroll the loops). Ternaries operators can help a bit the JIT compiler to generate a faster branch-less program.
Note the classification can be easily parallelized using multiple threads. However, one needs to be very careful about constant propagation since some critical constants (like the shape of the working arrays and assertions) tends not to be propagated to the code executed by threads while the propagation is critical to optimize the hot loops (eg. vectorization, unrolling). Note also that creating of many threads can be expensive on machines with many cores (from 10 ms to 0.1 ms). Thus, this is often better to use a parallel implementation only on big input data.
Here is the resulting implementation (working with both Python2 and Python3):
#nb.njit('float64[::1](float64[::1], float64[::1], int64[:,::1])')
def operation_(r, gap, ends_ind):
""" calculation formula which is applied on the points specified by findMatchingPolygons_ function """
nPoints = ends_ind.shape[0]
assert ends_ind.shape[1] == 2
assert gap.size == nPoints
formula = np.empty(nPoints, dtype=np.float64)
for i in range(nPoints):
ind0, ind1 = ends_ind[i]
r0, r1 = r[ind0], r[ind1]
r_sub = r0 - r1
r_add = r0 + r1
cur_gap = gap[i]
paired_cent_dis = r_add + cur_gap
formula[i] = (cur_gap**2 * (paired_cent_dis**2 + 5 * paired_cent_dis * r_add - 7 * r_sub**2)) / (3 * paired_cent_dis)
return formula
# Use `parallel=True` for a parallel implementation
#nb.njit('int32[::1](float64[:,::1], float64[:,:,::1])')
def findMatchingPolygons_(points, polygons):
""" Attribute to all point a region """
nPolygons = polygons.shape[0]
nPolygonPoints = polygons.shape[1]
nPoints = points.shape[0]
assert points.shape[1] == 3
assert polygons.shape[2] == 3
# Compute the bounding boxes of all polygons
ll = np.empty((nPolygons, 3), dtype=np.float64)
ur = np.empty((nPolygons, 3), dtype=np.float64)
for i in range(nPolygons):
ll_x, ll_y, ll_z = polygons[i, 0]
ur_x, ur_y, ur_z = polygons[i, 0]
for j in range(1, nPolygonPoints):
x, y, z = polygons[i, j]
ll_x = x if x<ll_x else ll_x
ll_y = y if y<ll_y else ll_y
ll_z = z if z<ll_z else ll_z
ur_x = x if x>ur_x else ur_x
ur_y = y if y>ur_y else ur_y
ur_z = z if z>ur_z else ur_z
ll[i] = ll_x, ll_y, ll_z
ur[i] = ur_x, ur_y, ur_z
# Find for each point its corresponding polygon
pointPolygonId = np.empty(nPoints, dtype=np.int32)
# Use `nb.prange(nPoints)` for a parallel implementation
for i in range(nPoints):
x, y, z = points[i, 0], points[i, 1], points[i, 2]
pointPolygonId[i] = -1
for j in range(polygons.shape[0]):
if ll[j, 0] <= x < ur[j, 0] and ll[j, 1] <= y < ur[j, 1] and ll[j, 2] <= z < ur[j, 2]:
pointPolygonId[i] = j
break
return pointPolygonId
#nb.njit('float64[::1](float64[:,:,::1], float64[:,::1], float64[::1])')
def computeSections_(elem_vert, contact_poss, operate_):
nPolygons = elem_vert.shape[0]
elaps = np.zeros(nPolygons, dtype=np.float64)
pointPolygonId = findMatchingPolygons_(contact_poss, elem_vert)
for i, polygonId in enumerate(pointPolygonId):
if polygonId >= 0:
elaps[polygonId] += operate_[i]
return elaps
def elapses(r, pos, gap, ends_ind, elem_vert, contact_poss):
if len(gap) > 0:
operate_ = operation_(r, gap, ends_ind)
return computeSections_(elem_vert, contact_poss, operate_)
r = np.load('a.npy')
pos = np.load('b.npy')
gap = np.load('c.npy')
ends_ind = np.load('d.npy')
elem_vert = np.load('e.npy')
contact_poss = np.load('f.npy')
elapses(r, pos, gap, ends_ind, elem_vert, contact_poss)
Here are the results on a old 2-core machine (i7-3520M):
Small dataset:
- Original version: 5.53 ms
- Proposed version (sequential): 0.22 ms (x25)
- Proposed version (parallel): 0.20 ms (x27)
Medium dataset:
- Original version: 53.40 ms
- Proposed version (sequential): 1.24 ms (x43)
- Proposed version (parallel): 0.62 ms (x86)
Big dataset:
- Original version: 5742 ms
- Proposed version (sequential): 144 ms (x40)
- Proposed version (parallel): 67 ms (x86)
Thus, the proposed implementation is up to 86 times faster than the original one.

Finding n-tuple that minimizes expensive cost function

Suppose there are three variables that take on discrete integer values, say w1 = {1,2,3,4,5,6,7,8,9,10,11,12}, w2 = {1,2,3,4,5,6,7,8,9,10,11,12}, and w3 = {1,2,3,4,5,6,7,8,9,10,11,12}. The task is to pick one value from each set such that the resulting triplet minimizes some (black box, computationally expensive) cost function.
I've tried the surrogate optimization in Matlab but I'm not sure it is appropriate. I've also heard about simulated annealing but found no implementation applied to this instance.
Which algorithm, apart from exhaustive search, can solve this combinatorial optimization problem?
Any help would be much appreciated.
The requirement/benefit of Simulated Annealing (SA), is that the objective surface is somewhat smooth, that is, we can be close to a solution.
For a completely random spiky surface- you might as well do a random search
If it is anything smooth, or even sometimes, it makes sense to try SA.
The idea is that (sometimes) changing only 1 of the 3 values, we have little effect on out blackbox function.
Here is a basic example to do this with Simulated Annealing, using frigidum in Python
import numpy as np
w1 = np.array( [1,2,3,4,5,6,7,8,9,10,11,12] )
w2 = np.array( [1,2,3,4,5,6,7,8,9,10,11,12] )
w3 = np.array( [1,2,3,4,5,6,7,8,9,10,11,12] )
W = np.array([w1,w2,w3])
LENGTH = 12
I define a black-box using the Rastrigin function.
def rastrigin_function_n( x ):
"""
N-dimensional Rastrigin
https://en.wikipedia.org/wiki/Rastrigin_function
x_i is in [-5.12, 5.12]
"""
A = 10
n = x.shape[0]
return A*n + np.sum( x**2- A*np.cos(2*np.pi * x) )
def black_box( x ):
"""
Transform from domain [1,12] to [-5,5]
to be able to push to rastrigin
"""
x = (x - 6.5) * (5/5.5)
return rastrigin_function_n(x)
Simulated Annealing needs to modify state X. Instead of taking/modifying values directly, we keep track of indices. This simplifies creating new proposals as an index is always an integer we can simply add/subtract 1 modulo LENGTH.
def random_start():
"""
returns 3 random indices
"""
return np.random.randint(0, LENGTH, size=3)
def random_small_step(x):
"""
change only 1 index
"""
d = np.array( [1,0,0] )
if np.random.random() < .5:
d = np.array( [-1,0,0] )
np.random.shuffle(d)
return (x+d) % LENGTH
def random_big_step(x):
"""
change 2 indici
"""
d = np.array( [1,-1,0] )
np.random.shuffle(d)
return (x+d) % LENGTH
def obj(x):
"""
We have a triplet of indici,
1. Calculate corresponding values in W = [w1,w2,w3]
2. Push the values in out black-box function
"""
indices = x
values = W[np.array([0,1,2]), indices]
return black_box(values)
And throw a SA Scheme at it
import frigidum
local_opt = frigidum.sa(random_start=random_start,
neighbours=[random_small_step, random_big_step],
objective_function=obj,
T_start=10**4,
T_stop=0.000001,
repeats=10**3,
copy_state=frigidum.annealing.naked)
I am not sure what the minimum for this function should be, but it found a objective with 47.9095 with indicis np.array([9, 2, 2])
Edit:
For frigidum to change the cooling schedule, use alpha=.9. My experience is that all the work of experiment which cooling scheme works best doesn't out-weight simply let it run a little longer. The multiplication you proposed, (sometimes called geometric) is the standard one, also implemented in frigidum. So to implement Tn+1 = 0.9*Tn you need a alpha=.9. Be aware this cooling step is done after N repeats, so if repeats=100, it will first do 100 proposals before lowering the temperature with factor alpha
Simple variations on current state often works best. Since its best practice to set the initial temperature high enough to make most proposals (>90%) accepted, it doesn't matter the steps are small. But if you fear its soo small, try 2 or 3 variations. Frigidum accepts a list of proposal functions, and combinations can enforce each other.
I have no experience with MINLP. But even if, so many times experiments can surprise us. So if time/cost is small to bring another competitor to the table, yes!
Try every possible combination of the three values and see which has the lowest cost.

How to apply bounds on a variable when performing optimisation in Pytorch?

I am trying to use Pytorch for non-convex optimisation, trying to maximise my objective (so minimise in SGD). I would like to bound my dependent variable x > 0, and also have the sum of my x values be less than 1000.
I think I have the penalty implemented correctly in the form of a ramp penalty, but am struggling with the bounding of the x variable. In Pytorch you can set the bounds using clamp but it doesn't seem appropriate in this case. I think this is because optim needs the gradients free under the hood. Full working example:
import torch
from torch.autograd import Variable
import numpy as np
def objective(x, a, b, c): # Want to maximise this quantity (so minimise in SGD)
d = 1 / (1 + torch.exp(-a * (x)))
# Checking constraint
exceeded_limit = constraint(x).item()
#print(exceeded_limit)
obj = torch.sum(d * (b * c - x))
# If overlimit add ramp penalty
if exceeded_limit < 0:
obj = obj - (exceeded_limit * 10)
print("Exceeded limit")
return - obj
def constraint(x, limit = 1000): # Must be > 0
return limit - x.sum()
N = 1000
# x is variable to optimise for
x = Variable(torch.Tensor([1 for ii in range(N)]), requires_grad=True)
a = Variable(torch.Tensor(np.random.uniform(0,100,N)), requires_grad=True)
b = Variable(torch.Tensor(np.random.rand(N)), requires_grad=True)
c = Variable(torch.Tensor(np.random.rand(N)), requires_grad=True)
# Would like to include the clamp
# x = torch.clamp(x, min=0)
# Non-convex methodf
opt = torch.optim.SGD([x], lr=.01)
for i in range(10000):
# Zeroing gradients
opt.zero_grad()
# Evaluating the objective
obj = objective(x, a, b, c)
# Calculate gradients
obj.backward()
opt.step()
if i%1000==0: print("Objective: %.1f" % -obj.item())
print("\nObjective: {}".format(-obj))
print("Limit: {}".format(constraint(x).item()))
if torch.sum(x<0) > 0: print("Bounds not met")
if constraint(x).item() < 0: print("Constraint not met")
Any suggestions as to how to impose the bounds would be appreciated, either using clamp or otherwise. Or generally advice on non-convex optimisation using Pytorch. This is a much simpler and scaled down version of the problem I'm working so am trying to find a lightweight solution if possible. I am considering using a workaround such as transforming the x variable using an exponential function but then you'd have to scale the function to avoid the positive values becoming infinite, and I want some flexibility with being able to set the constraint.
I meet the same problem with you.
I want to apply bounds on a variable in PyTorch, too.
And I solved this problem by the below Way3.
Your example is a little compliex but I am still learning English.
So I give a simpler example below.
For example, there is a trainable variable v, its bounds is (-1, 1)
v = torch.tensor((0.5, ), require_grad=True)
v_loss = xxxx
optimizer.zero_grad()
v_loss.backward()
optimizer.step()
Way1. RuntimeError: a leaf Variable that requires grad has been used in an in-place operation.
v.clamp_(-1, 1)
Way2. RuntimeError: Trying to backward through the graph a second time, but the buffers have already been freed.
v = torch.clamp(v, -1, +1) # equal to v = v.clamp(-1, +1)
Way3. NotError. I solved this problem in Way3.
with torch.no_grad():
v[:] = v.clamp(-1, +1) # You must use v[:]=xxx instead of v=xxx

How to implement a method to generate Poincaré sections for a non-linear system of ODEs?

I have been trying to work out how to calculate Poincaré sections for a system of non-linear ODEs, using a paper on the exact system as reference, and have been wrestling with numpy to try and make it run better. This is intended to run within a bounded domain.
Currently, I have the following code
import numpy as np
from scipy.integrate import odeint
X = 0
Y = 1
Z = 2
def generate_poincare_map(function, initial, plane, iterations, delta):
intersections = []
p_i = odeint(function, initial.flatten(), [0, delta])[-1]
for i in range(1, iterations):
p_f = odeint(function, p_i, [i * delta, (i+1) * delta])[-1]
if (p_f[Z] > plane) and (p_i[Z] < plane):
intersections.append(p_i[:2])
if (p_f[Z] > plane) and (p_i[Z] < plane):
intersections.append(p_i[:2])
p_i = p_f
return np.stack(intersections)
This is pretty wasteful due to the integration solely between successive time steps, and seems to produce incorrect results. The original reference includes sections along the lines of
whereas mine tend to result in something along the lines of
Do you have any advice on how to proceed to make this more correct, and perhaps a little faster?
To get a Pointcaré map of the ABC flow
def ABC_ode(u,t):
A, B, C = 0.75, 1, 1 # matlab parameters
x, y, z = u
return np.array([
A*np.sin(z)+C*np.cos(y),
B*np.sin(x)+A*np.cos(z),
C*np.sin(y)+B*np.cos(x)
])
def mysolver(u0, tspan): return odeint(ABC_ode, u0, tspan, atol=1e-10, rtol=1e-11)
you have first to understand that the dynamical system is really about the points (cos(x),sin(x)) etc. on the unit circle. So values different by multiples of 2*pi represent the same point. In the computation of the section one has to reflect this, either by computing it on the Cartesian product of the 3 circles. Let's stay with the second variant, and chose [-pi,pi] as the fundamental period to have the zero location well in the center. Keep in mind that jumps larger pi are from the angle reduction, not from a real crossing of that interval.
def find_crosssections(x0,y0):
u0 = [x0,y0,0]
px = []
py = []
u = mysolver(u0, np.arange(0, 4000, 0.5)); u0 = u[-1]
u = np.mod(u+pi,2*pi)-pi
x,y,z = u.T
for k in range(len(z)-1):
if z[k]<=0 and z[k+1]>=0 and z[k+1]-z[k]<pi:
# find a more exact intersection location by linear interpolation
s = -z[k]/(z[k+1]-z[k]) # 0 = z[k] + s*(z[k+1]-z[k])
rx, ry = (1-s)*x[k]+s*x[k+1], (1-s)*y[k]+s*y[k+1]
px.append(rx);
py.append(ry);
return px,py
To get a full picture of the Poincare cross-section and avoid duplicate work, use a grid of squares and mark if one of the intersections already fell in it. Only start new iterations from the centers of free squares.
N=20
grid = np.zeros([N,N], dtype=int)
for i in range(N):
for j in range(N):
if grid[i,j]>0: continue;
x0, y0 = (2*i+1)*pi/N-pi, (2*j+1)*pi/N-pi
px, py = find_crosssections(x0,y0)
for rx,ry in zip(px,py):
m, n = int((rx+pi)*N/(2*pi)), int((ry+pi)*N/(2*pi))
grid[m,n]=1
plt.plot(px, py, '.', ms=2)
You can now play with the density of the grid and the length of the integration interval to get the plot a little more filled out, but all characteristic features are already here. But I'd recommend re-programming this in a compiled language, as the computation will take some time.