Numpy nditer for non-broadcastable algorithms - numpy

TLDR: how to setup nditer when my algorithm needs different number of values for each operand, but I want broadcasting to be applied over the "other" axes.
I'm in the process of converting some algorithms to cython since I've got a lot of looping overhead in the implementation.
Originally I implemented the algorithms with support for broadcasting to allow various use-cases, and I would like to keep that.
The algorithms are quite involved, but the issue can be summarized by the following example code:
a = np.arange(5)
b = np.arange(11)
c = 0
for idx in range(len(a)):
c += a[idx] * b[2 * idx]
c += a[idx] * b[2 * idx + 1]
Broadcasting of this could be implemented along the first axis with the exact same code:
a = np.arange(5 * 7).reshape((5, 7, 1))
b = np.arange(11 * 6).reshape((11, 1, 6))
c = 0
for idx in range(a.shape[0]):
c += a[idx] * b[2 * idx]
c += a[idx] * b[2 * idx + 1]
or along the last axis with some slight modifications (not the same result, but that's not of importance here):
a = np.arange(5 * 7).reshape((7, 1, 5))
b = np.arange(11 * 6).reshape((1, 6, 11))
c = 0
for idx in range(a.shape[-1]):
c += a[..., idx] * b[..., 2 * idx]
c += a[..., idx] * b[..., 2 * idx + 1]
The actual algorithms I have need multiple nested for-loops for each "broadcastable unit", can involve more "inputs" (a and b here) and the "output" (c) can also be another array instead of a single value.
When the inner loops are moved over to cython broadcasting is no longer an option. It seems like nditer would be the way to go here, but I cannot figure out how to make it ignore the fact that one of the axes is not broadcastable. I expected that
a = np.arange(5 * 7).reshape((7, 1, 5))
b = np.arange(11 * 6).reshape((1, 6, 11))
it = np.nditer([a, b], flags=['external_loop'])
would allow me to loop over all axes except the one where I apply my custom algorithm, but that does not seem to be the case. Instead I'm met with a ValueError: operands could not be broadcast together with shapes (7,1,5) (1,6,11).
Ideally I would be able to loop as
for a_inner, b_inner, out_inner in it:
out_inner[...] = call_to_cythonized_algorithm(a_inner, b_inner)
where the shapes of the _inner variables match with what I need for the algorithm (5, 11, 0 in the examples above), or potentially with one extra dimension which I could loop over in the cython code.
I've tried a couple other flags as well, but I don't really know what I'm doing, and none of them give me an iterator that works.
Is this possible with the current API, or have I found a limitation of nditer?

Related

minimum-difference constrained sparse least squares problem (called from python)

I'm struggling a bit finding a fast algorithm that's suitable.
I just want to minimize:
norm2(x-s)
st
G.x <= h
x >= 0
sum(x) = R
G is sparse and contains only 1s (and zeros obviously).
In the case of iterative algorithms, it would be nice to get the interim solutions to show to the user.
The context is that s is a vector of current results, and the user is saying "well the sum of these few entries (entries indicated by a few 1.0's in a row in G) should be less than this value (a row in h). So we have to remove quantities from the entries the user specified (indicated by 1.0 entries in G) in a least-squares optimal way, but since we have a global constraint on the total (R) the values removed need to be allocated in a least-squares optimal way amongst the other entries. The entries can't go negative.
All the algorithms I'm looking at are much more general, and as a result are much more complex. Also, they seem quite slow. I don't see this as a complex problem, although mixes of equality and inequality constraints always seem to make things more complex.
This has to be called from Python, so I'm looking at Python libraries like qpsolvers and scipy.optimize. But I suppose Java or C++ libraries could be used and called from Python, which might be good since multithreading is better in Java and C++.
Any thoughts on what library/package/approach to use to best solve this problem?
The size of the problem is about 150,000 rows in s, and a few dozen rows in G.
Thanks!
Your problem is a linear least squares:
minimize_x norm2(x-s)
such that G x <= h
x >= 0
1^T x = R
Thus it fits the bill of the solve_ls function in qpsolvers.
Here is an instance of how I imagine your problem matrices would look like, given what you specified. Since it is sparse we should use SciPy CSC matrices, and regular NumPy arrays for vectors:
import numpy as np
import scipy.sparse as spa
n = 150_000
# minimize 1/2 || x - s ||^2
R = spa.eye(n, format="csc")
s = np.array(range(n), dtype=float)
# such that G * x <= h
G = spa.diags(
diagonals=[
[1.0 if i % 2 == 0 else 0.0 for i in range(n)],
[1.0 if i % 3 == 0 else 0.0 for i in range(n - 1)],
[1.0 if i % 5 == 0 else 0.0 for i in range(n - 1)],
],
offsets=[0, 1, -1],
)
a_dozen_rows = np.linspace(0, n - 1, 12, dtype=int)
G = G[a_dozen_rows]
h = np.ones(12)
# such that sum(x) == 42
A = spa.csc_matrix(np.ones((1, n)))
b = np.array([42.0]).reshape((1,))
# such that x >= 0
lb = np.zeros(n)
Next, we can solve this problem with:
from qpsolvers import solve_ls
x = solve_ls(R, s, G, h, A, b, lb, solver="osqp", verbose=True)
Here I picked CVXOPT but there are other open-source solvers you can install such as ProxQP, OSQP or SCS. You can install a set of open-source solvers by: pip install qpsolvers[open_source_solvers]. After some solvers are installed, you can list those for sparse matrices by:
print(qpsolvers.sparse_solvers)
Finally, here is some code to check that the solution returned by the solver satisfies our constraints:
tol = 1e-6 # tolerance for checks
print(f"- Objective: {0.5 * (x - s).dot(x - s):.1f}")
print(f"- G * x <= h: {(G.dot(x) <= h + tol).all()}")
print(f"- x >= 0: {(x + tol >= 0.0).all()}")
print(f"- sum(x) = {x.sum():.1f}")
I just tried it with OSQP (adding the eps_rel=1e-5 keyword argument when calling solve_ls, otherwise the returned solution would be less accurate than the tol = 1e-6 tolerance) and it found a solution is 737 milliseconds on my (rather old) CPU with:
- Objective: 562494373088866.8
- G * x <= h: True
- x >= 0: True
- sum(x) = 42.0
Hoping this helps. Happy solving!

How to accelerate my written python code: function containing nested functions for classification of points by polygons

I have written the following NumPy code by Python:
def inbox_(points, polygon):
""" Finding points in a region """
ll = np.amin(polygon, axis=0) # lower limit
ur = np.amax(polygon, axis=0) # upper limit
in_idx = np.all(np.logical_and(ll <= points, points < ur), axis=1) # points in the range [boolean]
return in_idx
def operation_(r, gap, ends_ind):
""" calculation formula which is applied on the points specified by inbox_ function """
r_active = np.take(r, ends_ind) # taking values from "r" based on indices and shape (paired_values) of "ends_ind"
r_sub = np.subtract.reduce(r_active, axis=1) # subtracting each paired "r" determined by "ends_ind" [this line will be used in the final return formula]
r_add = np.add.reduce(r_active, axis=1) # adding each paired "r" determined by "ends_ind" [this line will be used in the final return formula]
paired_cent_dis = np.sum((r_add, gap), axis=0) # distance of the each two paired points
return (np.power(gap, 2) * (np.power(paired_cent_dis, 2) + 5 * paired_cent_dis * r_add - 7 * np.power(r_sub, 2))) / (3 * paired_cent_dis) # Formula
def elapses(r, pos, gap, ends_ind, elem_vert, contact_poss):
if len(gap) > 0:
elaps = np.empty([len(elem_vert), ], dtype=object)
operate_ = operation_(r, gap, ends_ind)
#elbav = np.empty([len(elem_vert), ], dtype=object)
#con_num = 0
for i, j in enumerate(elem_vert): # loop for each section (cell or region) of a mesh
in_bool = inbox_(contact_poss, j) # getting boolean array for points within that section
elaps[i] = np.sum(operate_[in_bool]) # performing some calculations on that points and get the sum of them for each section
operate_ = operate_[np.invert(in_bool)] # slicing the arrays by deleting the points on which the calculations were performed to speed-up the code in next loops
contact_poss = contact_poss[np.invert(in_bool)] # as above
#con_num += sum(inbox_(contact_poss, j))
#inba_bool = inbox_(pos, j)
#elbav[i] = 4 * np.pi * np.sum(np.power(r[inba_bool], 3)) / 3
#pos = pos[np.invert(inba_bool)]
#r = r[np.invert(inba_bool)]
return elaps
r = np.load('a.npy')
pos = np.load('b.npy')
gap = np.load('c.npy')
ends_ind = np.load('d.npy')
elem_vert = np.load('e.npy')
contact_poss = np.load('f.npy')
elapses(r, pos, gap, ends_ind, elem_vert, contact_poss)
# a --------r-------> parameter corresponding to each coordinate (point); here radius (23605,) <class 'numpy.ndarray'> <class 'numpy.float64'>
# b -------pos------> coordinates of the points (23605, 3) <class 'numpy.ndarray'> <class 'numpy.ndarray'> <class 'numpy.float64'>
# c -------gap------> if we consider points as spheres by that radii [r], it is maximum length for spheres' over-lap (103832,) <class 'numpy.ndarray'> <class 'numpy.float64'>
# d ----ends_ind----> indices for each over-laped spheres (103832, 2) <class 'numpy.ndarray'> <class 'numpy.ndarray'> <class 'numpy.int64'>
# e ---elem_vert----> vertices of the mesh's sections or cells (2000, 8, 3) <class 'numpy.ndarray'> <class 'numpy.ndarray'> <class 'numpy.ndarray'> <class 'numpy.float64'>
# f --contact_poss--> a coordinate between the paired spheres (103832, 3) <class 'numpy.ndarray'> <class 'numpy.ndarray'> <class 'numpy.float64'>
This code will be called frequently from another code with big-data inputs. So, speeding up this code is essential. I have tried to use jit decorator from JAX and Numba libraries to accelerate the code, but I could not work with that properly to make the code better. I have tested the code on Colab (for 3 data sets with loops number of 20, 250, and 2000) for speed and the results were:
11 (ms), 47 (ms), 6.62 (s) (per loop) <-- without the commented code lines in the code
137 (ms), 1.66 (s) , 4 (m) (per loop) <-- with activating the commented code lines in the code
What this code does is finding some coordinates in a range and then performing some calculations on them.
I will be very appreciated for any answers that can speed up this code significantly (I believe it could). Also, I will be grateful for any experienced recommendations about speeding up the code by changing (substituting) the used NumPy methods and … or writing method for the math operations.
Notes:
The proposed answers must be executable by python version 2 (being applicable in both versions 2 and 3 is very excellent)
The commented code lines in the code are unnecessary for the main aim and are written just for further evaluations. Any recommendations to handle these lines with the proposed answers is appreciated (is not needed).
Data sets for test:
small data set: https://drive.google.com/file/d/1CswjyoqS8ogLmLQa_oNTOj221chDcbK8/view?usp=sharing
medium data set: https://drive.google.com/file/d/14RJ0Ackx88NzQWloops5FagzuNQYDSrh/view?usp=sharing
large data set: https://drive.google.com/file/d/1dJnXpb3HiAGcRC9PPTwui9joNcij4E_E/view?usp=sharing
First of all, the algorithm can be improved to be much more efficient. Indeed, a polygon can be directly assigned to each point. This is like a classification of points by polygons. Once the classification is done, you can perform one/many reductions by key where the key is the polygon ID.
This new algorithm consists in:
computing all the bounding boxes of the polygons;
classifying the points by polygons;
performing the reduction by key (where the key is the polygon ID).
This approach is much more efficient than iterating over all the points for each polygons and filtering the attributes arrays (eg. operate_ and contact_poss). Indeed, a filtering is an expensive operation since it requires the target array (that may not fit in the CPU caches) to be fully read and then written back. Not to mention this operation requires a temporary array to be allocated/deleted if it is not performed in-place and the operation cannot benefit from being implemented with SIMD instructions on most x86/x86-64 platforms (as it requires the new AVX-512 instruction set). It is also harder to parallelize since the filtering steps are too fast for threads to be useful but steps need to be done sequentially.
Regarding the implementation of the algorithm, Numba can be used to speed up a lot the overall computation. The main benefit of using Numba is to drastically reduce the number of expensive temporary arrays created by Numpy in your current implementation. Note that you can specify the function types to Numba so it can compile functions when it is defined. Assertions can be used to make the code more robust and help the compiler to know the size of a given dimension so to generate a significantly faster code (the JIT compiler of Numba can unroll the loops). Ternaries operators can help a bit the JIT compiler to generate a faster branch-less program.
Note the classification can be easily parallelized using multiple threads. However, one needs to be very careful about constant propagation since some critical constants (like the shape of the working arrays and assertions) tends not to be propagated to the code executed by threads while the propagation is critical to optimize the hot loops (eg. vectorization, unrolling). Note also that creating of many threads can be expensive on machines with many cores (from 10 ms to 0.1 ms). Thus, this is often better to use a parallel implementation only on big input data.
Here is the resulting implementation (working with both Python2 and Python3):
#nb.njit('float64[::1](float64[::1], float64[::1], int64[:,::1])')
def operation_(r, gap, ends_ind):
""" calculation formula which is applied on the points specified by findMatchingPolygons_ function """
nPoints = ends_ind.shape[0]
assert ends_ind.shape[1] == 2
assert gap.size == nPoints
formula = np.empty(nPoints, dtype=np.float64)
for i in range(nPoints):
ind0, ind1 = ends_ind[i]
r0, r1 = r[ind0], r[ind1]
r_sub = r0 - r1
r_add = r0 + r1
cur_gap = gap[i]
paired_cent_dis = r_add + cur_gap
formula[i] = (cur_gap**2 * (paired_cent_dis**2 + 5 * paired_cent_dis * r_add - 7 * r_sub**2)) / (3 * paired_cent_dis)
return formula
# Use `parallel=True` for a parallel implementation
#nb.njit('int32[::1](float64[:,::1], float64[:,:,::1])')
def findMatchingPolygons_(points, polygons):
""" Attribute to all point a region """
nPolygons = polygons.shape[0]
nPolygonPoints = polygons.shape[1]
nPoints = points.shape[0]
assert points.shape[1] == 3
assert polygons.shape[2] == 3
# Compute the bounding boxes of all polygons
ll = np.empty((nPolygons, 3), dtype=np.float64)
ur = np.empty((nPolygons, 3), dtype=np.float64)
for i in range(nPolygons):
ll_x, ll_y, ll_z = polygons[i, 0]
ur_x, ur_y, ur_z = polygons[i, 0]
for j in range(1, nPolygonPoints):
x, y, z = polygons[i, j]
ll_x = x if x<ll_x else ll_x
ll_y = y if y<ll_y else ll_y
ll_z = z if z<ll_z else ll_z
ur_x = x if x>ur_x else ur_x
ur_y = y if y>ur_y else ur_y
ur_z = z if z>ur_z else ur_z
ll[i] = ll_x, ll_y, ll_z
ur[i] = ur_x, ur_y, ur_z
# Find for each point its corresponding polygon
pointPolygonId = np.empty(nPoints, dtype=np.int32)
# Use `nb.prange(nPoints)` for a parallel implementation
for i in range(nPoints):
x, y, z = points[i, 0], points[i, 1], points[i, 2]
pointPolygonId[i] = -1
for j in range(polygons.shape[0]):
if ll[j, 0] <= x < ur[j, 0] and ll[j, 1] <= y < ur[j, 1] and ll[j, 2] <= z < ur[j, 2]:
pointPolygonId[i] = j
break
return pointPolygonId
#nb.njit('float64[::1](float64[:,:,::1], float64[:,::1], float64[::1])')
def computeSections_(elem_vert, contact_poss, operate_):
nPolygons = elem_vert.shape[0]
elaps = np.zeros(nPolygons, dtype=np.float64)
pointPolygonId = findMatchingPolygons_(contact_poss, elem_vert)
for i, polygonId in enumerate(pointPolygonId):
if polygonId >= 0:
elaps[polygonId] += operate_[i]
return elaps
def elapses(r, pos, gap, ends_ind, elem_vert, contact_poss):
if len(gap) > 0:
operate_ = operation_(r, gap, ends_ind)
return computeSections_(elem_vert, contact_poss, operate_)
r = np.load('a.npy')
pos = np.load('b.npy')
gap = np.load('c.npy')
ends_ind = np.load('d.npy')
elem_vert = np.load('e.npy')
contact_poss = np.load('f.npy')
elapses(r, pos, gap, ends_ind, elem_vert, contact_poss)
Here are the results on a old 2-core machine (i7-3520M):
Small dataset:
- Original version: 5.53 ms
- Proposed version (sequential): 0.22 ms (x25)
- Proposed version (parallel): 0.20 ms (x27)
Medium dataset:
- Original version: 53.40 ms
- Proposed version (sequential): 1.24 ms (x43)
- Proposed version (parallel): 0.62 ms (x86)
Big dataset:
- Original version: 5742 ms
- Proposed version (sequential): 144 ms (x40)
- Proposed version (parallel): 67 ms (x86)
Thus, the proposed implementation is up to 86 times faster than the original one.

Vectorize a function with a condition

I would like to vectorize a function with a condition, meaning to calculate its values with array arithmetic. np.vectorize handles vectorization, but it does not work with array arithmetic, so it is not a complete solution
An answer was given as the solution in the question "How to vectorize a function which contains an if statement?" but did not prevent errors here; see the MWE below.
import numpy as np
def myfx(x):
return np.where(x < 1.1, 1, np.arcsin(1 / x))
y = myfx(x)
This runs but raises the following warnings:
<stdin>:2: RuntimeWarning: divide by zero encountered in true_divide
<stdin>:2: RuntimeWarning: invalid value encountered in arcsin
What is the problem, or is there a better way to do this?
I think this could be done by
Getting the indices ks of x for which x[k] > 1.1 for each k in ks.
Applying np.arcsin(1 / x[ks]) to the slice x[ks], and using 1 for the rest of the elements.
Recombining the arrays.
I am not sure about the efficiency, though.
The statement np.where(x < 1.1, 1, np.arcsin(1 / x)) is equivalent to
mask = x < 1.1
a = 1
b = np.arcsin(1 / x)
np.where(mask, a, b)
Notice that you're calling np.arcsin on all the elements of x, regardless of whether 1 / x <= 1 or not. Your basic plan is correct. You can do the operations in-place on an output array using the where keyword of np.arcsin and np.reciprocal, without having to recombine anything:
def myfx(x):
mask = (x >= 1.1)
out = np.ones(x.shape)
np.reciprocal(x, where=mask, out=out) # >= 1.1 implies != 0
return np.arcsin(out, where=mask, out=out)
Using np.ones ensures that the unmasked elements of out are initialized correctly. An equivalent method would be
out = np.empty(x.shape)
out[~mask] = 1
You can always find an arithmetic expression that prevents the "divide by zero".
Example:
def myfx(x):
return np.where( x < 1.1, 1, np.arcsin(1/np.maximum(x, 1.1)) )
The values where x<1.1 in the right wing are not used, so it's not an issue computing np.arcsin(1/1.1) where x < 1.1.

How to iterate through all non zero values of a sparse matrix and normal matrix

I am using Julia and I want to iterate over the values of a matrix. This matrix can either be a normal matrix or a sparse matrix but I do not have the prior knowledge of that. I would like to create a code that would work in both cases and being optimised for both cases.
For simplicity, I did a example that computes the sum of the vector multiplied by a random value. What I want to do is actually similar to this but instead of being multiplied by a random number is actually an function that takes long time to compute.
myiterator(m::SparseVector) = m.nzval
myiterator(m::AbstractVector) = m
function sumtimesrand(m)
a = 0.
for i in myiterator(m)
a += i * rand()
end
return a
end
I = [1, 4, 3, 5]; V = [1, 2, -5, 3];
Msparse = sparsevec(I,V)
M = rand(5)
sumtimesrand(Msparse)
sumtimesrand(M)
I want my code to work this way. I.e. most of the code is the same and by using the right iterator the code is optimised for both cases (sparse and normal vector).
My question is: is there any iterator that does what I am trying to achieve? In this case, the iterator returns the values but an iterator over the indices would work to.
Cheers,
Dylan
I think you almost had what you are asking for? I.e., change your AbstractVector and SparseVector into AbstractArray and AbstractSparseArray. But maybe I am missing something? See MWE below:
using SparseArrays
using BenchmarkTools # to compare performance
# note the changes here to "Array":
myiterator(m::AbstractSparseArray) = m.nzval
myiterator(m::AbstractArray) = m
function sumtimesrand(m)
a = 0.
for i in myiterator(m)
a += i * rand()
end
return a
end
N = 1000
spV = sprand(N, 0.01); V = Vector(spV)
spM = sprand(N, N, 0.01); M = Matrix(spM)
#btime sumtimesrand($spV); # 0.044936 μs
#btime sumtimesrand($V); # 3.919 μs
#btime sumtimesrand($spM); # 0.041678 ms
#btime sumtimesrand($M); # 4.095 ms

How to fix "LoadError: DimensionMismatch ("cannot broadcast array to have fewer dimensions")"

I'd like to solve the following two coupled differential equations numerically:
d/dt Phi_i = 1 - 1/N * \sum_{j=1}^N( k_{ij} sin(Phi_i - Phi_j + a)
d/dt k_{ij} = - epsilon * (sin(Phi_i - Phi_j + b) + k_{ij}
with defined starting conditions phi_0 (1-dim array with N entries) and k_0 (2-dim array with NxN entries)
I tried this: Using DifferentialEquations.js, build a matrix of initial starting conditions u0 = hcat(Phi_0, k_0) (2-dim array, Nx(N+1)), and somehow define that the first equation applies to to first column (in my code [:,1]) , and the second equation applies to the other columns (in my code [:,2:N+1]).
using Distributions
using DifferentialEquations
N = 100
phi0 = rand(N)*2*pi
k0 = rand(Uniform(-1,1), N,N)
function dynamics(du, u, p, t)
a = 0.3*pi
b = -0.53*pi
epsi = 0.01
du[:,1] .= 1 .- 1/N .* sum.([u[i,j+1] * sin(u[i,1] - u[j,1] + a) for i in 1:N, j in 1:N], dims=2)
du[:,2:N+1] .= .- epsi .* [sin(u[i,1] - u[j,1] + b) + u[i,j+1] for i in 1:N, j in 1:N]
end
u0 = hcat(phi0, k0)
tspan = (0.0, 200.0)
prob = ODEProblem(dynamics, u0, tspan)
sol = solve(prob)
Running this lines of code result in this error:
LoadError: DimensionMismatch ("cannot broadcast array to have fewer dimensions")in expression starting at line 47 (which is sol = solve(prob))
I'm new to Julia, and I'm not sure if im heading in the right direction with this. Please help me!
First of all, edit the first package, which is Distributions and not Distribution, it took me a while to find the error xD
The main problem is the .= in your first equation. When you do that, you don't just assign new values to an array, you're making a view. I cannot explain you exactly what is a view, but what I can tell you is that, when you this kind of assign, the left and right side must have the same type.
For example:
N = 100
u = rand(N,N+1)
du = rand(N,N+1)
julia> u[:,1] .= du[:,1]
100-element view(::Array{Float64,2}, :, 1) with eltype Float64:
0.2948248997313967
0.2152933893895821
0.09114453738716022
0.35018616658607926
0.7788869975259098
0.2833659299216609
0.9093344091412392
...
The result is a view and not a Vector. With this syntax, left and right sides must have same type, and that does not happen in your example. Note that the types of rand(5) and rand(5,1) are different in Julia: the first is an Array{Float64,1} and the other is Array{Float64,2}. In your code, d[:,1] is an Array{Float64,1} but 1 .- 1/N .* sum.([u[i,j+1] * sin(u[i,1] - u[j,1] + a) for i in 1:N, j in 1:N], dims=2) is an Array{Float64,2}, that's why it doesn't work. You have two choices, change the equal sign for:
du[:,1] = ...
Or:
du[:,1] .= 1 .- 1/N .* sum.([u[i,j+1] * sin(u[i,1] - u[j,1] + a) for i in 1:N, j in 1:N], dims=2)[:,1]
The first choice is just a basic assign, the second choice uses the view way and matches the types of both sides.