How to implement a method to generate Poincaré sections for a non-linear system of ODEs? - numpy

I have been trying to work out how to calculate Poincaré sections for a system of non-linear ODEs, using a paper on the exact system as reference, and have been wrestling with numpy to try and make it run better. This is intended to run within a bounded domain.
Currently, I have the following code
import numpy as np
from scipy.integrate import odeint
X = 0
Y = 1
Z = 2
def generate_poincare_map(function, initial, plane, iterations, delta):
intersections = []
p_i = odeint(function, initial.flatten(), [0, delta])[-1]
for i in range(1, iterations):
p_f = odeint(function, p_i, [i * delta, (i+1) * delta])[-1]
if (p_f[Z] > plane) and (p_i[Z] < plane):
intersections.append(p_i[:2])
if (p_f[Z] > plane) and (p_i[Z] < plane):
intersections.append(p_i[:2])
p_i = p_f
return np.stack(intersections)
This is pretty wasteful due to the integration solely between successive time steps, and seems to produce incorrect results. The original reference includes sections along the lines of
whereas mine tend to result in something along the lines of
Do you have any advice on how to proceed to make this more correct, and perhaps a little faster?

To get a Pointcaré map of the ABC flow
def ABC_ode(u,t):
A, B, C = 0.75, 1, 1 # matlab parameters
x, y, z = u
return np.array([
A*np.sin(z)+C*np.cos(y),
B*np.sin(x)+A*np.cos(z),
C*np.sin(y)+B*np.cos(x)
])
def mysolver(u0, tspan): return odeint(ABC_ode, u0, tspan, atol=1e-10, rtol=1e-11)
you have first to understand that the dynamical system is really about the points (cos(x),sin(x)) etc. on the unit circle. So values different by multiples of 2*pi represent the same point. In the computation of the section one has to reflect this, either by computing it on the Cartesian product of the 3 circles. Let's stay with the second variant, and chose [-pi,pi] as the fundamental period to have the zero location well in the center. Keep in mind that jumps larger pi are from the angle reduction, not from a real crossing of that interval.
def find_crosssections(x0,y0):
u0 = [x0,y0,0]
px = []
py = []
u = mysolver(u0, np.arange(0, 4000, 0.5)); u0 = u[-1]
u = np.mod(u+pi,2*pi)-pi
x,y,z = u.T
for k in range(len(z)-1):
if z[k]<=0 and z[k+1]>=0 and z[k+1]-z[k]<pi:
# find a more exact intersection location by linear interpolation
s = -z[k]/(z[k+1]-z[k]) # 0 = z[k] + s*(z[k+1]-z[k])
rx, ry = (1-s)*x[k]+s*x[k+1], (1-s)*y[k]+s*y[k+1]
px.append(rx);
py.append(ry);
return px,py
To get a full picture of the Poincare cross-section and avoid duplicate work, use a grid of squares and mark if one of the intersections already fell in it. Only start new iterations from the centers of free squares.
N=20
grid = np.zeros([N,N], dtype=int)
for i in range(N):
for j in range(N):
if grid[i,j]>0: continue;
x0, y0 = (2*i+1)*pi/N-pi, (2*j+1)*pi/N-pi
px, py = find_crosssections(x0,y0)
for rx,ry in zip(px,py):
m, n = int((rx+pi)*N/(2*pi)), int((ry+pi)*N/(2*pi))
grid[m,n]=1
plt.plot(px, py, '.', ms=2)
You can now play with the density of the grid and the length of the integration interval to get the plot a little more filled out, but all characteristic features are already here. But I'd recommend re-programming this in a compiled language, as the computation will take some time.

Related

numpy matmul is very very slow

I'm trying to implement the Gradient descent method for solving $Ax = b$ for a positive definite symmetric matrix $A$ of size about $9600 \times 9600$. I thought my code was relatively simple
#Solves the problem Ax = b for x within epsilon tolerance or until MAX_ITERATION is reached
def GradientDescent(Amat,target,epsilon = .01,MAX_ITERATION = 100,x=np.zeros(9604):
CurrentRes = target-np.matmul(Amat,x)
count = 0
while(np.linalg.norm(CurrentRes)> epsilon and count < MAX_ITERATION):
Ar = np.matmul(Amat,CurrentRes)
alpha = CurrentRes.T.dot(CurrentRes)/CurrentRes.T.dot(Ar)
x = x+alpha*CurrentRes
Ax = np.matmul(Amat,x)
CurrentRes = target-Ax
count = count+1
return(x,count,norm(CurrentRes))
#A is square matrix about 9600x9600 and b is about 9600x1
GDSum = GradientDescent(A,b)
but the above takes almost 3 minutes to run a single iteration of the main while loop.
I didn't think that $9600 \times 9600$ was too big for NumPy to handle effectively, but even the step of computing alpha which is just the quotient of two dot products is taking over 30 seconds.
I tried error-testing the code by timing each action in the while loop, and they are all running much slower than expected. A single matrix multiplication is taking almost a minute. The steps involving vector addition or subtraction at least seem to be running quickly.
#A is square matrix about 9600x9600 and b is about 9600x1
GDSum = GradientDescent(A,b)
Perhaps the most relevant bit of information is missing.
Your function is fast when A and b are Numpy arrays, but it's terribly slow when they are lists.
Is that your case?

How could I speed up my written python code: spheres contact detection (collision) using spatial searching

I am working on a spatial search case for spheres in which I want to find connected spheres. For this aim, I searched around each sphere for spheres that centers are in a (maximum sphere diameter) distance from the searching sphere’s center. At first, I tried to use scipy related methods to do so, but scipy method takes longer times comparing to equivalent numpy method. For scipy, I have determined the number of K-nearest spheres firstly and then find them by cKDTree.query, which lead to more time consumption. However, it is slower than numpy method even by omitting the first step with a constant value (it is not good to omit the first step in this case). It is contrary to my expectations about scipy spatial searching speed. So, I tried to use some list-loops instead some numpy lines for speeding up using numba prange. Numba run the code a little faster, but I believe that this code can be optimized for better performances, perhaps by vectorization, using other alternative numpy modules or using numba in another way. I have used iteration on all spheres due to prevent probable memory leaks and …, where number of spheres are high.
import numpy as np
import numba as nb
from scipy.spatial import cKDTree, distance
# ---------------------------- input data ----------------------------
""" For testing by prepared files:
radii = np.load('a.npy') # shape: (n-spheres, ) must be loaded by np.load('a.npy') or np.loadtxt('radii_large.csv')
poss = np.load('b.npy') # shape: (n-spheres, 3) must be loaded by np.load('b.npy') or np.loadtxt('pos_large.csv', delimiter=',')
"""
rnd = np.random.RandomState(70)
data_volume = 200000
radii = rnd.uniform(0.0005, 0.122, data_volume)
dia_max = 2 * radii.max()
x = rnd.uniform(-1.02, 1.02, (data_volume, 1))
y = rnd.uniform(-3.52, 3.52, (data_volume, 1))
z = rnd.uniform(-1.02, -0.575, (data_volume, 1))
poss = np.hstack((x, y, z))
# --------------------------------------------------------------------
# #nb.jit('float64[:,::1](float64[:,::1], float64[::1])', forceobj=True, parallel=True)
def ends_gap(poss, dia_max):
particle_corsp_overlaps = np.array([], dtype=np.float64)
ends_ind = np.empty([1, 2], dtype=np.int64)
""" using list looping """
# particle_corsp_overlaps = []
# ends_ind = []
# for particle_idx in nb.prange(len(poss)): # by list looping
for particle_idx in range(len(poss)):
unshared_idx = np.delete(np.arange(len(poss)), particle_idx) # <--- relatively high time consumer
poss_without = poss[unshared_idx]
""" # SCIPY method ---------------------------------------------------------------------------------------------
nears_i_ind = cKDTree(poss_without).query_ball_point(poss[particle_idx], r=dia_max) # <--- high time consumer
if len(nears_i_ind) > 0:
dist_i, dist_i_ind = cKDTree(poss_without[nears_i_ind]).query(poss[particle_idx], k=len(nears_i_ind)) # <--- high time consumer
if not isinstance(dist_i, float):
dist_i[dist_i_ind] = dist_i.copy()
""" # NUMPY method --------------------------------------------------------------------------------------------
lx_limit_idx = poss_without[:, 0] <= poss[particle_idx][0] + dia_max
ux_limit_idx = poss_without[:, 0] >= poss[particle_idx][0] - dia_max
ly_limit_idx = poss_without[:, 1] <= poss[particle_idx][1] + dia_max
uy_limit_idx = poss_without[:, 1] >= poss[particle_idx][1] - dia_max
lz_limit_idx = poss_without[:, 2] <= poss[particle_idx][2] + dia_max
uz_limit_idx = poss_without[:, 2] >= poss[particle_idx][2] - dia_max
nears_i_ind = np.where(lx_limit_idx & ux_limit_idx & ly_limit_idx & uy_limit_idx & lz_limit_idx & uz_limit_idx)[0]
if len(nears_i_ind) > 0:
dist_i = distance.cdist(poss_without[nears_i_ind], poss[particle_idx][None, :]).squeeze() # <--- relatively high time consumer
# """ # -------------------------------------------------------------------------------------------------------
contact_check = dist_i - (radii[unshared_idx][nears_i_ind] + radii[particle_idx])
connected = contact_check[contact_check <= 0]
particle_corsp_overlaps = np.concatenate((particle_corsp_overlaps, connected))
""" using list looping """
# if len(connected) > 0:
# for value_ in connected:
# particle_corsp_overlaps.append(value_)
contacts_ind = np.where([contact_check <= 0])[1]
contacts_sec_ind = np.array(nears_i_ind)[contacts_ind]
sphere_olps_ind = np.where((poss[:, None] == poss_without[contacts_sec_ind][None, :]).all(axis=2))[0] # <--- high time consumer
ends_ind_mod_temp = np.array([np.repeat(particle_idx, len(sphere_olps_ind)), sphere_olps_ind], dtype=np.int64).T
if particle_idx > 0:
ends_ind = np.concatenate((ends_ind, ends_ind_mod_temp))
else:
ends_ind[0, 0], ends_ind[0, 1] = ends_ind_mod_temp[0, 0], ends_ind_mod_temp[0, 1]
""" using list looping """
# for contacted_idx in sphere_olps_ind:
# ends_ind.append([particle_idx, contacted_idx])
# ends_ind_org = np.array(ends_ind) # using lists
ends_ind_org = ends_ind
ends_ind, ends_ind_idx = np.unique(np.sort(ends_ind_org), axis=0, return_index=True) # <--- relatively high time consumer
gap = np.array(particle_corsp_overlaps)[ends_ind_idx]
return gap, ends_ind, ends_ind_idx, ends_ind_org
In one of my tests on 23000 spheres, scipy, numpy, and numba-aided methods finished the loop in about 400, 200, and 180 seconds correspondingly using Colab TPU; for 500.000 spheres it take 3.5 hours. These execution times are not satisfying at all for my project, where number of spheres may be up to 1.000.000 in a medium data volume. I will call this code many times in my main code and seeking for ways that could perform this code in milliseconds (as much as fastest that it could). Is it possible??
I would be appreciated if anyone would speed up the code as it is needed.
Notes:
This code must be executable with python 3.7+, on CPU and GPU.
This code must be applicable for data size, at least, 300.000 spheres.
All numpy, scipy, and … equivalent modules instead of my written modules, which make my code faster significantly, will be upvoted.
I would be appreciated for any recommendations or explanations about:
Which method could be faster in this subject?
Why scipy is not faster than other methods in this case and where it could be helpful relating to this subject?
Choosing between iterator methods and matrix form methods is a confusing matter for me. Iterating methods use less memory and could be used and tuned up by numba and … but, I think, are not useful and comparable with matrix methods (which depends on memory limits) like numpy and … for huge sphere numbers. For this case, perhaps I could omit the iteration by numpy, but I guess strongly that it cannot be handled due to huge matrix size operations and memory leaks.
Prepared sample test data:
Poss data: 23000, 500000
Radii data: 23000, 500000
Line by line speed test logs: for two test cases scipy method and numpy time consumption.
UPDATE: this post answered is now superseded by this new one
(which take into account the updates of the question) providing an even faster code based on a different approach.
Step 1: better algorithm
First of all, building a k-d tree runs in O(n log n) time and doing a query runs in O(log n) time where n is the number of points. So using a k-d tree seems a good idea at first glance. However, your code build a k-d tree for each point resulting in a O(n² log n) time. This is why the Scipy solution is slower than the others. The thing is that Scipy does not provide a way to update a k-d tree. It turns out that updating efficiently a k-d tree appears not to be possible. Hopefully, this is not a problem in your case: you can just build one k-d tree with all the points once and then discard the current point you do not want appearing in the result of each query.
Moreover, the computation of sphere_olps_ind runs in O(n² m) time where n is the total number of points and m is the average number of neighbour (ie. closest points retrieved from the k-d tree query). Assuming there is no duplicate points, then it turns out that sphere_olps_ind is simply equal to np.sort(contacts_sec_ind). The later runs in O(m log m) which is drastically better.
Additionally, using np.concatenate in a loop to append value in a Numpy array is slow because it creates a new bigger array for each iteration. Using a list was a good idea, but appending directly Numpy array in a list and then calling np.concatenate once is much faster.
Here is the resulting code:
def ends_gap(poss, dia_max):
particle_corsp_overlaps = []
ends_ind = [np.empty([1, 2], dtype=np.int64)]
kdtree = cKDTree(poss)
for particle_idx in range(len(poss)):
# Find the nearest point including the current one and
# then remove the current point from the output.
# The distances can be computed directly without a new query.
cur_point = poss[particle_idx]
nears_i_ind = np.array(kdtree.query_ball_point(cur_point, r=dia_max), dtype=np.int64)
assert len(nears_i_ind) > 0
if len(nears_i_ind) <= 1:
continue
nears_i_ind = nears_i_ind[nears_i_ind != particle_idx]
dist_i = distance.cdist(poss[nears_i_ind], cur_point[None, :]).squeeze()
contact_check = dist_i - (radii[nears_i_ind] + radii[particle_idx])
connected = contact_check[contact_check <= 0]
particle_corsp_overlaps.append(connected)
contacts_ind = np.where([contact_check <= 0])[1]
contacts_sec_ind = nears_i_ind[contacts_ind]
sphere_olps_ind = np.sort(contacts_sec_ind)
ends_ind_mod_temp = np.array([np.repeat(particle_idx, len(sphere_olps_ind)), sphere_olps_ind], dtype=np.int64).T
if particle_idx > 0:
ends_ind.append(ends_ind_mod_temp)
else:
ends_ind[0][:] = ends_ind_mod_temp[0, 0], ends_ind_mod_temp[0, 1]
ends_ind_org = np.concatenate(ends_ind)
ends_ind, ends_ind_idx = np.unique(np.sort(ends_ind_org), axis=0, return_index=True) # <--- relatively high time consumer
gap = np.concatenate(particle_corsp_overlaps)[ends_ind_idx]
return gap, ends_ind, ends_ind_idx, ends_ind_org
Step 2: optimization
First of all, the query_ball_point call can be done on all the points at once in parallel by providing poss to the Scipy method and specifying the parameter workers=-1. However, note that this requires more memory.
Moreover, Numba can be used to significantly speed up the computation. The parts that can be mainly improved is the computation of the distances and the creation of many unnecessary temporary arrays as well as the use of Numpy array direct indexing instead of list's appends (since the bounded size of the output array can be known after the query_ball_point call).
Here is a simple example of optimized code using Numba:
#nb.jit('(float64[:, ::1], int64[::1], int64[::1], float64)')
def compute(poss, all_neighbours, all_neighbours_sizes, dia_max):
particle_corsp_overlaps = []
ends_ind_lst = [np.empty((1, 2), dtype=np.int64)]
an_offset = 0
for particle_idx in range(len(poss)):
cur_point = poss[particle_idx]
cur_len = all_neighbours_sizes[particle_idx]
nears_i_ind = all_neighbours[an_offset:an_offset+cur_len]
an_offset += cur_len
assert len(nears_i_ind) > 0
if len(nears_i_ind) <= 1:
continue
nears_i_ind = nears_i_ind[nears_i_ind != particle_idx]
dist_i = np.empty(len(nears_i_ind), dtype=np.float64)
# Compute the distances
x1, y1, z1 = poss[particle_idx]
for i in range(len(nears_i_ind)):
x2, y2, z2 = poss[nears_i_ind[i]]
dist_i[i] = np.sqrt((x2-x1)**2 + (y2-y1)**2 + (z2-z1)**2)
contact_check = dist_i - (radii[nears_i_ind] + radii[particle_idx])
connected = contact_check[contact_check <= 0]
particle_corsp_overlaps.append(connected)
contacts_ind = np.where(contact_check <= 0)
contacts_sec_ind = nears_i_ind[contacts_ind]
sphere_olps_ind = np.sort(contacts_sec_ind)
ends_ind_mod_temp = np.empty((len(sphere_olps_ind), 2), dtype=np.int64)
for i in range(len(sphere_olps_ind)):
ends_ind_mod_temp[i, 0] = particle_idx
ends_ind_mod_temp[i, 1] = sphere_olps_ind[i]
if particle_idx > 0:
ends_ind_lst.append(ends_ind_mod_temp)
else:
tmp = ends_ind_lst[0]
tmp[:] = ends_ind_mod_temp[0, :]
return particle_corsp_overlaps, ends_ind_lst
def ends_gap(poss, dia_max):
kdtree = cKDTree(poss)
tmp = kdtree.query_ball_point(poss, r=dia_max, workers=-1)
all_neighbours = np.concatenate(tmp, dtype=np.int64)
all_neighbours_sizes = np.array([len(e) for e in tmp], dtype=np.int64)
particle_corsp_overlaps, ends_ind_lst = compute(poss, all_neighbours, all_neighbours_sizes, dia_max)
ends_ind_org = np.concatenate(ends_ind_lst)
ends_ind, ends_ind_idx = np.unique(np.sort(ends_ind_org), axis=0, return_index=True)
gap = np.concatenate(particle_corsp_overlaps)[ends_ind_idx]
return gap, ends_ind, ends_ind_idx, ends_ind_org
ends_gap(poss, dia_max)
Performance analysis
Here are the performance results on my 6-core machine (with a i5-9600KF processor) on the small dataset:
Initial code with Scipy: 259 s
Initial default code with Numpy: 112 s
Optimized algorithm: 1.37 s
Final optimized code: 0.22 s
Unfortunately, the Scipy k-d tree is too big to fit in memory on my machine with the big dataset.
Thus the Numba implementation with an efficient algorithm is up to ~510 times faster than the initial Numpy implementation and ~1200 time faster than the initial Scipy implementation.
The Numba code can be further optimized, but please note that the Numba compute call takes less than 25% of the time on my machine. The np.unique call is the most expensive, but it is not easy to make it faster. A significant part of the time is spent in the Scipy-to-Numba data conversion, but this code is mandatory as long as Scipy is used. Thus, the code can be improved a bit (eg. certainly 2x faster) with advanced Numba optimization but if you need a much faster code, then you need to use a native language like C++ and an highly-optimized parallel k-d tree implementation. I expect a very-optimized native code to be an order of magnitude faster but not much more. I hardly believe the big dataset can be computed in less than 10 ms on my machine regardless of the implementation.
Notes
Note that gap is different with the provided functions (other values are left unchanged). However, the same thing happens between the initial Scipy method and the one of Numpy. This appear to come from the ordering of variables like nears_i_ind and dist_i which is undefined by Scipy and change the gap result in a non-trivial way (not just the order of gap). I am not sure this is a problem of the initial implementation. Because of that, it is much harder to compare the correctness of the different implementations.
forceobj should not be used in production as the documentation states this is only useful for testing purposes.
Based on previous answers, I designed a efficient algorithm with a much lower memory footprint and much faster than the previous ones (especially on the large dataset). That being said this algorithm is far move complex and push the limit of Python and Numba.
The key issue of previous algorithms is that they set a dia_max threshold which is much bigger than actually required. Indeed, dia_max is set to the maximum possible redius so to be sure not to miss any overlapping. The thing is the big dataset contains balls of very different size and some of them are huge. This means that previous algorithms was fetching for a very large radius around many small balls. The result was thousands of neighbours to check per ball while only few can truly overlap.
One solution to efficiently address this problem is to split the balls in different groups based on their size. The idea is to first sort balls based on radii, then split the sorted balls in two groups, then independently query neighbours between each possible pair of groups, then merge data so to apply the previous algorithm (with some additional optimizations). More specifically, the query is applied between small balls with big ones, small balls with other small ones, big balls with other big ones, and big balls with small ones.
Another key point to speed this up is to request the different neighbour queries in parallel using joblib. This solution is far from being perfect since the BallTree object needs to be duplicated which is inefficient but this is mandatory because of the way parallelism is currently done in CPython (ie. GIL, pickling, etc.). Using a package that support parallel request can bypass this inherent limitation of CPython but existing package doing that does not seems to provide an interface sufficiently useful to address this problem or are not optimized enough to be actually useful.
Finally, the Numba code can be strongly optimized by removing almost all very expensive (implicit) array allocations. Using a in-place sorting algorithm optimized for small array also improve significantly the execution time (mainly because the default implementation of Numba perform several expensive allocations and is not optimized for small arrays). In addition, the final np.unique operation can be completely rewritten with a basic loop as the main loop iterate over balls with increasing IDs (hence already sorted).
Here is the resulting code:
import numpy as np
import numba as nb
from sklearn.neighbors import BallTree
from joblib import Parallel, delayed
def flatten_neighbours(arr):
sizes = np.fromiter(map(len, arr), count=len(arr), dtype=np.int64)
values = np.concatenate(arr, dtype=np.int64)
return sizes, values
#delayed
def find_neighbours(searched_pts, ref_pts, max_dist):
balltree = BallTree(ref_pts, leaf_size=16, metric='euclidean')
res = balltree.query_radius(searched_pts, r=max_dist)
return flatten_neighbours(res)
def vstack_neighbours(top_infos, bottom_infos):
top_sizes, top_values = top_infos
bottom_sizes, bottom_values = bottom_infos
return np.concatenate([top_sizes, bottom_sizes]), np.concatenate([top_values, bottom_values])
#nb.njit('(Tuple([int64[::1],int64[::1]]), Tuple([int64[::1],int64[::1]]), int64)')
def hstack_neighbours(left_infos, right_infos, offset):
left_sizes, left_values = left_infos
right_sizes, right_values = right_infos
n = left_sizes.size
out_sizes = np.empty(n, dtype=np.int64)
out_values = np.empty(left_values.size + right_values.size, dtype=np.int64)
left_cur, right_cur, out_cur = 0, 0, 0
right_values += offset
for i in range(n):
left, right = left_sizes[i], right_sizes[i]
full = left + right
out_values[out_cur:out_cur+left] = left_values[left_cur:left_cur+left]
out_values[out_cur+left:out_cur+full] = right_values[right_cur:right_cur+right]
out_sizes[i] = full
left_cur += left
right_cur += right
out_cur += full
return out_sizes, out_values
#nb.njit('(int64[::1], int64[::1], int64[::1], int64[::1])')
def reorder_neighbours(in_sizes, in_values, index, reverse_index):
n = reverse_index.size
out_sizes = np.empty_like(in_sizes)
out_values = np.empty_like(in_values)
in_offsets = np.empty_like(in_sizes)
s, cur = 0, 0
for i in range(n):
in_offsets[i] = s
s += in_sizes[i]
for i in range(n):
in_ind = reverse_index[i]
size = in_sizes[in_ind]
in_offset = in_offsets[in_ind]
out_sizes[i] = size
for j in range(size):
out_values[cur+j] = index[in_values[in_offset+j]]
cur += size
return out_sizes, out_values
#nb.njit
def small_inplace_sort(arr):
if len(arr) < 80:
# Basic insertion sort
i = 1
while i < len(arr):
x = arr[i]
j = i - 1
while j >= 0 and arr[j] > x:
arr[j+1] = arr[j]
j = j - 1
arr[j+1] = x
i += 1
else:
arr.sort()
#nb.jit('(float64[:, ::1], float64[::1], int64[::1], int64[::1])')
def compute(poss, radii, neighbours_sizes, neighbours_values):
n, m = neighbours_sizes.size, np.max(neighbours_sizes)
# Big buffers allocated with the maximum size.
# Thank to virtual memory, it does not take more memory can actually needed.
particle_corsp_overlaps = np.empty(neighbours_values.size, dtype=np.float64)
ends_ind_org = np.empty((neighbours_values.size, 2), dtype=np.float64)
in_offset = 0
out_offset = 0
buff1 = np.empty(m, dtype=np.int64)
buff2 = np.empty(m, dtype=np.float64)
buff3 = np.empty(m, dtype=np.float64)
for particle_idx in range(n):
size = neighbours_sizes[particle_idx]
cur = 0
for i in range(size):
value = neighbours_values[in_offset+i]
if value != particle_idx:
buff1[cur] = value
cur += 1
nears_i_ind = buff1[0:cur]
small_inplace_sort(nears_i_ind) # Note: bottleneck of this function
in_offset += size
if len(nears_i_ind) == 0:
continue
x1, y1, z1 = poss[particle_idx]
cur = 0
for i in range(len(nears_i_ind)):
index = nears_i_ind[i]
x2, y2, z2 = poss[index]
dist = np.sqrt((x2 - x1) ** 2 + (y2 - y1) ** 2 + (z2 - z1) ** 2)
contact_check = dist - (radii[index] + radii[particle_idx])
if contact_check <= 0.0:
buff2[cur] = contact_check
buff3[cur] = index
cur += 1
particle_corsp_overlaps[out_offset:out_offset+cur] = buff2[0:cur]
contacts_sec_ind = buff3[0:cur]
small_inplace_sort(contacts_sec_ind)
sphere_olps_ind = contacts_sec_ind
for i in range(cur):
ends_ind_org[out_offset+i, 0] = particle_idx
ends_ind_org[out_offset+i, 1] = sphere_olps_ind[i]
out_offset += cur
# Truncate the views to their real size
particle_corsp_overlaps = particle_corsp_overlaps[:out_offset]
ends_ind_org = ends_ind_org[:out_offset]
assert len(ends_ind_org) % 2 == 0
size = len(ends_ind_org)//2
ends_ind = np.empty((size,2), dtype=np.int64)
ends_ind_idx = np.empty(size, dtype=np.int64)
gap = np.empty(size, dtype=np.float64)
cur = 0
# Find efficiently duplicates (replace np.unique+np.sort)
for i in range(len(ends_ind_org)):
left, right = ends_ind_org[i]
if left < right:
ends_ind[cur, 0] = left
ends_ind[cur, 1] = right
ends_ind_idx[cur] = i
gap[cur] = particle_corsp_overlaps[i]
cur += 1
return gap, ends_ind, ends_ind_idx, ends_ind_org
def ends_gap(poss, radii):
assert poss.size >= 1
# Sort the balls
index = np.argsort(radii)
reverse_index = np.empty(index.size, np.int64)
reverse_index[index] = np.arange(index.size, dtype=np.int64)
sorted_poss = poss[index]
sorted_radii = radii[index]
# Split them in two groups: the small and the big ones
split_ind = len(radii) * 3 // 4
small_poss, big_poss = np.split(sorted_poss, [split_ind])
small_radii, big_radii = np.split(sorted_radii, [split_ind])
max_small_radii = sorted_radii[max(split_ind, 0)]
max_big_radii = sorted_radii[-1]
# Find the neighbours in parallel
result = Parallel(n_jobs=4, backend='threading')([
find_neighbours(small_poss, small_poss, small_radii+max_small_radii),
find_neighbours(small_poss, big_poss, small_radii+max_big_radii ),
find_neighbours(big_poss, small_poss, big_radii+max_small_radii ),
find_neighbours(big_poss, big_poss, big_radii+max_big_radii )
])
small_small_neighbours = result[0]
small_big_neighbours = result[1]
big_small_neighbours = result[2]
big_big_neighbours = result[3]
# Merge the (segmented) arrays in a big one
neighbours_sizes, neighbours_values = vstack_neighbours(
hstack_neighbours(small_small_neighbours, small_big_neighbours, split_ind),
hstack_neighbours(big_small_neighbours, big_big_neighbours, split_ind)
)
# Reverse the indices.
# Note that the results in `neighbours_values` associated to
# `neighbours_sizes[i]` are subsets of `query_radius([poss[i]], r=dia_max)`
# on a `BallTree(poss)`.
res = reorder_neighbours(neighbours_sizes, neighbours_values, index, reverse_index)
neighbours_sizes, neighbours_values = res
# Finally compute the neighbours with a method similar to the
# previous one, but using a much faster optimized code.
return compute(poss, radii, neighbours_sizes, neighbours_values)
result = ends_gap(poss, radii)
Here is the results (still on the same i5-9600KF machine):
Small dataset:
- Reference optimized Numba code: 256 ms
- This highly-optimized Numba code: 82 ms
Big dataset:
- Reference optimized Numba code: 42.7 s (take about 7~8 GiB of RAM)
- This highly-optimized Numba code: 4.2 s (take about 1 GiB of RAM)
Thus the new algorithm is about 3.1 time faster on the small dataset (in addition to the previous optimizations), and about 10 times faster on the big dataset! This is 3 order of magnitude faster than the initially posted algorithms.
Note that 80% of the time is spend in the BallTree query (which is already mostly parallel). The main Numba computing function takes only 12% of the time and more than 75% of the time is spent in sorting the input indices. As a result, the neighbourhood search is clearly the bottleneck. It can be improved a bit by splitting the current queries in multiple smaller one but this will make the code even more complex for a relatively small improvement (eg. 1.5x faster). Note that more complex code are harder to maintain and modifications are bug-prone. Thus, I think moving to a native language to overcome the limitation of Python is the best solution to increase performance. That being said, writing a faster native code to solve this problem is far from being simple (unless you find good k-d tree, octree or ball tree library). Still, it is certainly better than optimizing this code further.
Analysis
A profiling analysis shows that at least 50% of the time in BallTree of scikit-learn is spent in unoptimized scalar loops that could use SIMD instructions like AVX-2 (and loop unrolling) to be about 4 times faster. Additionally, some multi-threading issue are also visible (the 4 threads on the top are the joblib workers, the light-green sections are the idle time):
This shows that this implementation is sub-optimal. One possible way to easily improve the execution time may be to optimize the hot loops of the scikit-learn BallTree implementation. Another strategy could be to try to use threads more efficiently (possibly by releasing the GIL in some parts of the scikit-learn module).
As the BallTree class of scikit-learn is written in Cython (BallTree is based on DKTree itself based on BinaryTree). You can try to rebuild the package on your machine and simply tweak compiler optimizations. Using the parameter -O3 -march=native -ffast-math should enable the compiler to use faster SIMD instruction and more aggressive optimizations resulting in a significant speed up. Note that using -ffast-math is unsafe as it assume the code of Scikit will never use NaN, Inf or -0 values (otherwise the result is completely undefined) and that floating-point number operations are associative (resulting in different results). That being said, such an option is critical to improve the automatic vectorization of numerical codes.
For the GIL, one can see that it is released in the query_radius function but it does not seems the case for the constructor of BallTree. Maybe, the simplest solution is to implement a parallel version of query/query_radius like Scipy did.
By fixing the query radius at twice the max sphere radius, you're creating a lot of spurious "collisions" to filter out.
The Python below achieves a significant speedup relative to your answer by using a fourth dimension to improve the selectivity of the kd-tree queries. Each Euclidean ball of radius r is over-approximated by an L1 ball of radius r√d where d is the dimension (3 here). The test for L1 balls colliding in 3d becomes a test for points being within a fixed L1 distance in 4d.
If you switched to a lower level language, you could potentially avoid a separate filtering step by altering the kd-tree implementation to use a combination L2+L1 metric.
import numpy as np
from scipy import spatial
from timeit import default_timer
def load_data():
centers = np.loadtxt("pos_large.csv", delimiter=",")
radii = np.loadtxt("radii_large.csv")
assert radii.shape + (3,) == centers.shape
return centers, radii
def count_contacts(centers, radii):
scaled_centers = centers / np.sqrt(centers.shape[1])
max_radius = radii.max()
tree = spatial.cKDTree(np.c_[scaled_centers, max_radius - radii])
count = 0
for i, x in enumerate(np.c_[scaled_centers, radii - max_radius]):
for j in tree.query_ball_point(x, r=2 * max_radius, p=1):
d = centers[i] - centers[j]
r = radii[i] + radii[j]
if i < j and np.inner(d, d) <= r * r:
count += 1
return count
def main():
centers, radii = load_data()
start = default_timer()
print(count_contacts(centers, radii))
end = default_timer()
print(end - start)
if __name__ == "__main__":
main()
As an update to Richard answer and to overcome probable memory leaks, I post this answer. During my testing executions, memory usage grows up and limits the execution to some smaller data volumes (maximum 200000 by my machine and 100000 on COLAB). This problem leads to much longer runtimes than resulted runtimes by Richard. So, I opened a SciPy issue relating to these different performances and put and compared some memory results there.
But, I did not get any answer so far and the origin of these significant differences between performances are not clear to me yet !!??
Fezzani referred to another SciPy issue to use chunk and well prepared a comparison to show the influence of chunk values on the runtimes. Strangely, although Fezzani's machine (Intel® Core™ i7-10700K CPU # 3.80GHz × 16; 32GiB of RAM) seems to be more powerful than Richard's machine (6-core machine with a i5-9600KF processor, 16 GiB of RAM 2 channels DDR4 # 3200MHz reaching 36~40 GiB/s), His execution on the large data will take at least (around) 33 seconds by chunk method (to avoid memory leaks).
I could not figure out why and which hardware can help machines to pass memory leaks and result in satisfying fast execution as for Richard (perhaps it was related to KF type of Richard's CPU) !!??
By seeking among some related memory issues, I could guess cKDTree methods are facing this inevitable problem when data volume is huge or … and scikit-learn, perhaps, be a better choice. In this regard, based on my understanding from JaminSore answer and the referred Martelli answer, I tried to evaluate BallTree and KDTree from scikit-learn. BallTree has better performance than KDTree in my cases (about 1.5 to 2 times), so I used it. There were no memory leaks for the large data, but it took 2 minutes (Richard results and mine differ just in time units now ;)). It ran faster than scipy when data volume increased. In my tests, scipy was faster on smaller data volumes (low memory consumptions) and as data volumes grows up, scipy performance falls behind due to its implementation behavior or related bugs (unclear to me yet); For my prepared 100000 data volumes, scikit-learn performs 1.5 to 2 times faster.
I guess using arrays is the big advantage of scikit-learn comparing to scipy method's lists, which can be derived from aforementioned Martelli answer. It may be the reason of the different performances.
scikit-learn methods return an object type ndarray with arrays of different lengths inside it that need to be sorted to get same results as the main code. I applied the related sorting behavior of each element in the loop in the compute function by modifying nears_i_ind code-line as nears_i_ind = np.sort(all_neighbours[an_offset:an_offset+cur_len]). Using BallTree, tmp and all_neighbours consume memory near the same. Note: If both have the same name, memory consumption will be reduced (almost halved). So, the modified Richard's ends_gap function by BallTree will be as:
def ends_gap(poss, dia_max):
balltree = BallTree(poss, metric='euclidean')
# tmp = balltree.query_radius(poss, r=dia_max)
# all_neighbours = np.concatenate(tmp, dtype=np.int64)
all_neighbours = balltree.query_radius(poss, r=dia_max)
all_neighbours_sizes = np.array([len(e) for e in all_neighbours], dtype=np.int64)
all_neighbours = np.concatenate(all_neighbours, dtype=np.int64)
particle_corsp_overlaps, ends_ind_lst = compute(poss, all_neighbours, all_neighbours_sizes)
ends_ind_org = np.concatenate(ends_ind_lst)
ends_ind, ends_ind_idx = np.unique(np.sort(ends_ind_org), axis=0, return_index=True)
gap = np.concatenate(particle_corsp_overlaps)[ends_ind_idx]
return gap, ends_ind, ends_ind_idx, ends_ind_org
It is not multi-threaded, which can improve the speed; I will try to multi-thread.
On my machine (i5 1st gen cpu intel core 760 # 2.8GHz, 16gb ram cl9 dual channel DDR3 ripjaws, x64 windows system) for 200000 data volume:
There were some mistakes in my two proposed methods which result in different gap values, which was mentioned in Note section by Richard. For producing same results, return_sorted=True must be added for nears_i_ind in Optimized algorithm and ends_ind and ends_ind_lst changes to list beside removing if-else statements in both codes:
Optimized algorithm:
def ends_gap(poss, dia_max):
particle_corsp_overlaps = []
ends_ind = [] # <------- this line is modified
kdtree = cKDTree(poss)
for particle_idx in range(len(poss)):
cur_point = poss[particle_idx]
nears_i_ind = np.array(kdtree.query_ball_point(cur_point, r=dia_max, return_sorted=True), dtype=np.int64) # <------- this line is modified
assert len(nears_i_ind) > 0
if len(nears_i_ind) <= 1:
continue
nears_i_ind = nears_i_ind[nears_i_ind != particle_idx]
dist_i = distance.cdist(poss[nears_i_ind], cur_point[None, :]).squeeze()
contact_check = dist_i - (radii[nears_i_ind] + radii[particle_idx])
connected = contact_check[contact_check <= 0]
particle_corsp_overlaps.append(connected)
contacts_ind = np.where([contact_check <= 0])[1]
contacts_sec_ind = nears_i_ind[contacts_ind]
sphere_olps_ind = np.sort(contacts_sec_ind)
ends_ind_mod_temp = np.array([np.repeat(particle_idx, len(sphere_olps_ind)), sphere_olps_ind], dtype=np.int64).T
ends_ind.append(ends_ind_mod_temp) # <------- this line is modified
ends_ind_org = np.concatenate(ends_ind)
ends_ind, ends_ind_idx = np.unique(np.sort(ends_ind_org), axis=0, return_index=True)
gap = np.concatenate(particle_corsp_overlaps)[ends_ind_idx]
return gap, ends_ind, ends_ind_idx, ends_ind_org
Numba final optimized code:
#nb.jit('(float64[:, ::1], int64[::1], int64[::1])')
def compute(poss, all_neighbours, all_neighbours_sizes):
particle_corsp_overlaps = []
ends_ind_lst = [] # <------- this line is modified
an_offset = 0
for particle_idx in range(len(poss)):
cur_len = all_neighbours_sizes[particle_idx]
nears_i_ind = np.sort(all_neighbours[an_offset:an_offset+cur_len]) # <------- this line is modified
an_offset += cur_len
assert len(nears_i_ind) > 0
if len(nears_i_ind) <= 1:
continue
nears_i_ind = nears_i_ind[nears_i_ind != particle_idx]
dist_i = np.empty(len(nears_i_ind), dtype=np.float64)
x1, y1, z1 = poss[particle_idx]
for i in range(len(nears_i_ind)):
x2, y2, z2 = poss[nears_i_ind[i]]
dist_i[i] = np.sqrt((x2 - x1) ** 2 + (y2 - y1) ** 2 + (z2 - z1) ** 2)
contact_check = dist_i - (radii[nears_i_ind] + radii[particle_idx])
connected = contact_check[contact_check <= 0]
particle_corsp_overlaps.append(connected)
contacts_ind = np.where(contact_check <= 0)
contacts_sec_ind = nears_i_ind[contacts_ind]
sphere_olps_ind = np.sort(contacts_sec_ind)
ends_ind_mod_temp = np.empty((len(sphere_olps_ind), 2), dtype=np.int64)
for i in range(len(sphere_olps_ind)):
ends_ind_mod_temp[i, 0] = particle_idx
ends_ind_mod_temp[i, 1] = sphere_olps_ind[i]
ends_ind_lst.append(ends_ind_mod_temp) # <------- this line is modified
return particle_corsp_overlaps, ends_ind_lst
def ends_gap(poss, dia_max):
balltree = BallTree(poss, metric='euclidean') # <------- new code
all_neighbours = balltree.query_radius(poss, r=dia_max) # <------- new code and modified
all_neighbours_sizes = np.array([len(e) for e in all_neighbours], dtype=np.int64) # <------- this line is modified
all_neighbours = np.concatenate(all_neighbours, dtype=np.int64) # <------- this line is modified
particle_corsp_overlaps, ends_ind_lst = compute(poss, all_neighbours, all_neighbours_sizes)
ends_ind_org = np.concatenate(ends_ind_lst)
ends_ind, ends_ind_idx = np.unique(np.sort(ends_ind_org), axis=0, return_index=True)
gap = np.concatenate(particle_corsp_overlaps)[ends_ind_idx]
return gap, ends_ind, ends_ind_idx, ends_ind_org
On my machine for around 550000 data volume:
Have you tried FLANN?
This code doesn't solve your problem completely. It simply finds the nearest 50 neighbors to each point in your 500000 point dataset:
from pyflann import FLANN
p = np.loadtxt("pos_large.csv", delimiter=",")
flann = FLANN()
flann.build_index(pts=p)
idx, dist = flann.nn_index(qpts=p, num_neighbors=50)
The last line takes less than a second in my laptop without any tuning or parallelization.

how to get better Kriging result graphs in openturns?

I performed spherical Kriging, but I can't seem to get good output graphs.
The coordinates(x, and y) range from around around 51 latitude and around 6.5 as longitude
my observations range from -70 to +10
here is my code :
import openturns as ot
import pandas as pd
# your input / output data can be easily formatted as samples for openturns
df = pd.read_csv("kreuzkerpenutm.csv")
inputdata = ot.Sample(df[['x','y']].values)
outputdata = ot.Sample(df[['z']].values)
dimension = 2 # dimension of your input (x,y)
basis = ot.ConstantBasisFactory(dimension).build()
covarianceModel = ot.SphericalModel(dimension)
algo = ot.KrigingAlgorithm(inputdata, outputdata, covarianceModel, basis)
algo.run()
result = algo.getResult()
metamodel = result.getMetaModel()
lower = [-10.0] * 2 # lower bound of the 2D window
upper = [50.0] * 2 # upper bound of the 2D window
graph = metamodel.draw(lower, upper)
graph.setBoundingBox(ot.Interval(lower, upper))
graph.add(ot.Cloud(inputdata)) # overlay a scatter plot of the observation points
graph.setTitle("Kriging metamodel")
# A View object allows us to interact with the underlying matplotlib figure
from openturns.viewer import View
view = View(graph, legend_kw={'bbox_to_anchor':(1,1), 'loc':"upper left"})
view.getFigure().tight_layout()
here is my output:
kriging metamodel graph
I don't know why my graph won't show me my inputs aswell as my kriging results.
thanks for ideas and help
If the input data is not scaled in [-1,1]^d, the kriging metamodel may have issues to identify the scale parameters using maximum likelihood optimization. In order to help for this, we may:
provide a better starting point for the scale parameters of the covariance model (this is trick "A" below),
set the bounds of the optimization algorithm so that the interval where the parameters are searched for correspond to the data at hand (this is trick "B" below).
This is what the following script does, using simulated data instead of a csv data file. In the script, I create the data using a g function which is scaled so that it produces results in the [-10, 70] range, as in your problem. Please look carefuly at the setScale() method which sets the initial value of the covariance model: this is the starting point of the optimization algorithm. Then look at the setOptimizationBounds() method, which sets the bounds of the optimization algorithm.
import openturns as ot
dimension = 2 # dimension of your input (x,y)
distribution = ot.ComposedDistribution([ot.Uniform(-10.0, 50.0)] * dimension)
inputdata = distribution.getSample(100)
g = ot.SymbolicFunction(["x", "y"], ["30 + 3.0 * sin(x / 10.0) * (y / 10.0) ^ 2"])
outputdata = g(inputdata)
basis = ot.ConstantBasisFactory(dimension).build()
covarianceModel = ot.SphericalModel(dimension)
covarianceModel.setScale(inputdata.getMax()) # Trick A
algo = ot.KrigingAlgorithm(inputdata, outputdata, covarianceModel, basis)
# Trick B, v2
x_range = inputdata.getMax() - inputdata.getMin()
scale_max_factor = 2.0 # Must be > 1, tune this to match your problem
scale_min_factor = 0.1 # Must be < 1, tune this to match your problem
maximum_scale_bounds = scale_max_factor * x_range
minimum_scale_bounds = scale_min_factor * x_range
scaleOptimizationBounds = ot.Interval(minimum_scale_bounds, maximum_scale_bounds)
algo.setOptimizationBounds(scaleOptimizationBounds)
algo.run()
result = algo.getResult()
metamodel = result.getMetaModel()
metamodel.setInputDescription(["x", "y"])
metamodel.setOutputDescription(["z"])
lower = [-10.0] * 2 # lower bound of the 2D window
upper = [50.0] * 2 # upper bound of the 2D window
graph = metamodel.draw(lower, upper)
graph.setBoundingBox(ot.Interval(lower, upper))
graph.add(ot.Cloud(inputdata)) # overlay a scatter plot of the observation points
graph.setTitle("Kriging metamodel")
# A View object allows us to interact with the underlying matplotlib figure
from openturns.viewer import View
view = View(graph, legend_kw={"bbox_to_anchor": (1, 1), "loc": "upper left"})
view.getFigure().tight_layout()
The previous script produces the following figure.
There are other ways to implement trick B. Here is one provided by J.Pelamatti:
# Trick B, v3
for d in range(X_train.getDimension()):
dist = scipy.spatial.distance.pdist(X_train[:,d])
scale_max_factor = 2.0 # Must be > 1, tune this to match your problem
scale_min_factor = 0.1 # Must be < 1, tune this to match your problem
maximum_scale_bounds = scale_max_factor * np.max(dist)
minimum_scale_bounds = scale_min_factor * np.min(dist)
This topic is discussed in this particular thread in OT's forum.
Sorry for the late answer.
Which version of openturns are you using?
Probably you have an embedded transformation of (input) data, which makes the data range between (-3, 3) approximately (standard scaling). The kriging result should contains the transformation in such a case.
With more recent openturns implementations, this feature has been removed.
Hope this can help.
Cheers

Fast way to set diagonals of an (M x N x N) matrix? Einsum / n-dimensional fill_diagonal?

I'm trying to write fast, optimized code based on matrices, and have recently discovered einsum as a tool for achieving significant speed-up.
Is it possible to use this to set the diagonals of a multidimensional array efficiently, or can it only return data?
In my problem, I'm trying to set the diagonals for an array of square matrices (shape: M x N x N) by summing the columns in each square (N x N) matrix.
My current (slow, loop-based) solution is:
# Build dummy array
dimx = 2 # Dimension x (likely to be < 100)
dimy = 3 # Dimension y (likely to be between 2 and 10)
M = np.random.randint(low=1, high=9, size=[dimx, dimy, dimy])
# Blank the diagonals so we can see the intended effect
np.fill_diagonal(M[0], 0)
np.fill_diagonal(M[1], 0)
# Compute diagonals based on summing columns
diags = np.einsum('ijk->ik', M)
# Set the diagonal for each matrix
# THIS IS LOW. CAN IT BE IMPROVED?
for i in range(len(M)):
np.fill_diagonal(M[i], diags[i])
# Print result
M
Can this be improved at all please? It seems np.fill_diagonal doesn't accepted non-square matrices (hence forcing my loop based solution). Perhaps einsum can help here too?
One approach would be to reshape to 2D, set the columns at steps of ncols+1 with the diagonal values. Reshaping creates a view and as such allows us to directly access those diagonal positions. Thus, the implementation would be -
s0,s1,s2 = M.shape
M.reshape(s0,-1)[:,::s2+1] = diags
If you do np.source(np.fill_diagonal) you'll see that in the 2d case it uses a 'strided' approach
if a.ndim == 2:
step = a.shape[1] + 1
end = a.shape[1] * a.shape[1]
a.flat[:end:step] = val
#Divakar's solution applies this to your 3d case by 'flattening' on 2 dimensions.
You could sum the columns with M.sum(axis=1). Though I vaguely recall some timings that found that einsum was actually a bit faster. sum is a little more conventional.
Someone has has asked for an ability to expand dimensions in einsum, but I don't think that will happen.

Coordinate Descent Algorithm in Julia for Least Squares not converging

As a warm-up to writing my own elastic net solver, I'm trying to get a fast enough version of ordinary least squares implemented using coordinate descent.
I believe I've implemented the coordinate descent algorithm correctly, but when I use the "fast" version (see below), the algorithm is insanely unstable, outputting regression coefficients that routinely overflow a 64-bit float when the number of features is of moderate size compared to the number of samples.
Linear Regression and OLS
If b = A*x, where A is a matrix, x a vector of the unknown regression coefficients, and y is the output, I want to find x that minimizes
||b - Ax||^2
If A[j] is the jth column of A and A[-j] is A without column j, and the columns of A are normalized so that ||A[j]||^2 = 1 for all j, the coordinate-wise update is then
Coordinate Descent:
x[j] <-- A[j]^T * (b - A[-j] * x[-j])
I'm following along with these notes (page 9-10) but the derivation is simple calculus.
It's pointed out that instead of recomputing A[j]^T(b - A[-j] * x[-j]) all the time, a faster way to do it is with
Fast Coordinate Descent:
x[j] <-- A[j]^T*r + x[j]
where the total residual r = b - Ax is computed outside the loop over coordinates. The equivalence of these update rules follows from noting that Ax = A[j]*x[j] + A[-j]*x[-j] and rearranging terms.
My problem is that while the second method is indeed faster, it's wildly numerically unstable for me whenever the number of features isn't small compared to the number of samples. I was wondering if anyone might have some insight as to why that's the case. I should note that the first method, which is more stable, still starts disagreeing with more standard methods as the number of features approaches the number of samples.
Julia code
Below is some Julia code for the two update rules:
function OLS_builtin(A,b)
x = A\b
return(x)
end
function OLS_coord_descent(A,b)
N,P = size(A)
x = zeros(P)
for cycle in 1:1000
for j = 1:P
x[j] = dot(A[:,j], b - A[:,1:P .!= j]*x[1:P .!= j])
end
end
return(x)
end
function OLS_coord_descent_fast(A,b)
N,P = size(A)
x = zeros(P)
for cycle in 1:1000
r = b - A*x
for j = 1:P
x[j] += dot(A[:,j],r)
end
end
return(x)
end
Example of the problem
I generate data with the following:
n = 100
p = 50
σ = 0.1
β_nz = float([i*(-1)^i for i in 1:10])
β = append!(β_nz,zeros(Float64,p-length(β_nz)))
X = randn(n,p); X .-= mean(X,1); X ./= sqrt(sum(abs2(X),1))
y = X*β + σ*randn(n); y .-= mean(y);
Here I use p=50, and I get good agreement between OLS_coord_descent(X,y) and OLS_builtin(X,y), whereas OLS_coord_descent_fast(X,y)returns exponentially large values for the regression coefficients.
When p is less than about 20, OLS_coord_descent_fast(X,y) agrees with the other two.
Conjecture
Since things agrees for the regime of p << n, I think the algorithm is formally correct, but numerically unstable. Does anyone have any thoughts on whether this guess is correct, and if so how to correct for the instability while retaining (most) of the performance gains of the fast version of the algorithm?
The quick answer: You forgot to update r after each x[j] update. Following is the fixed function which behaves like OLS_coord_descent:
function OLS_coord_descent_fast(A,b)
N,P = size(A)
x = zeros(P)
for cycle in 1:1000
r = b - A*x
for j = 1:P
x[j] += dot(A[:,j],r)
r -= A[:,j]*dot(A[:,j],r) # Add this line
end
end
return(x)
end