How to efficiently use a numpy function in a cython loop? [duplicate] - numpy

I'm trying to use dot products, matrix inversion and other basic linear algebra operations that are available in numpy from Cython. Functions like numpy.linalg.inv (inversion), numpy.dot (dot product), X.t (transpose of matrix/array). There's a large overhead to calling numpy.* from Cython functions and the rest of the function is written in Cython, so I'd like to avoid this.
If I assume users have numpy installed, is there a way to do something like:
#include "numpy/npy_math.h"
as an extern, and call these functions? Or alternatively call BLAS directly (or whatever it is that numpy calls for these core operations)?
To give an example, imagine you have a function in Cython that does many things and in the end needs to make a computation involving dot products and matrix inverses:
cdef myfunc(...):
# ... do many things faster than Python could
# ...
# compute one value using dot products and inv
# without using
# import numpy as np
# np.*
val = gammaln(sum(v)) - sum(gammaln(v)) + dot((v - 1).T, log(x).T)
how can this be done? If there's a library that implements these in Cython already, I can also use that, but have not found anything. Even if those procedures are less optimized than BLAS directly, not having the overhead of calling numpy Python module from Cython will still make things overall faster.
Example functions I'd like to call:
dot product (np.dot)
matrix inversion (np.linalg.inv)
matrix multiplication
taking transpose (equivalent of x.T in numpy)
gammaln function (like scipy.gammaln equivalent, which should be available in C)
I realize as it says on numpy mailing list (https://groups.google.com/forum/?fromgroups=#!topic/cython-users/XZjMVSIQnTE) that if you call these functions on large matrices, there is no point in doing it from Cython, since calling it from numpy will just result in the majority of the time spent in the optimized C code that numpy calls. However, in my case, I have many calls to these linear algebra operations on small matrices -- in that case, the overhead introduced by repeatedly going from Cython back to numpy and back to Cython will far outweigh the time spent actually computing the operation from BLAS. Therefore, I'd like to keep everything at the C/Cython level for these simple operations and not go through python.
I'd prefer not to go through GSL, since that adds another dependency and since it's unclear if GSL is actively maintained. Since I'm assuming users of the code already have scipy/numpy installed, I can safely assume that they have all the associated C code that goes along with these libraries, so I just want to be able to tap into that code and call it from Cython.
edit: I found a library that wraps BLAS in Cython (https://github.com/tokyo/tokyo) which is close but not what I'm looking for. I'd like to call the numpy/scipy C functions directly (I'm assuming the user has these installed.)

Calling BLAS bundled with Scipy is "fairly" straightforward, here's one example for calling DGEMM to compute matrix multiplication: https://gist.github.com/pv/5437087 Note that BLAS and LAPACK expect all arrays to be Fortran-contiguous (modulo the lda/b/c parameters), hence order="F" and double[::1,:] which are required for correct functioning.
Computing inverses can be similarly done by applying the LAPACK function dgesv on the identity matrix. For the signature, see here. All this requires dropping down to rather low-level coding, you need to allocate temporary work arrays yourself etc etc. --- however these can be encapsulated into your own convenience functions, or just reuse the code from tokyo by replacing the lib_* functions with function pointers obtained from Scipy in the above way.
If you use Cython's memoryview syntax (double[::1,:]) you transpose is the same x.T as usual. Alternatively, you can compute the transpose by writing a function of your own that swaps elements of the array across the diagonal. Numpy doesn't actually contain this operation, x.T only changes the strides of the array and doesn't move the data around.
It would probably be possible to rewrite the tokyo module to use the BLAS/LAPACK exported by Scipy and bundle it in scipy.linalg, so that you could just do from scipy.linalg.blas cimport dgemm. Pull requests are accepted if someone wants to get down to it.
As you can see, it all boils down to passing function pointers around. As alluded to above, Cython does in fact provide its own protocol for exchanging function pointers. For an example, consider from scipy.spatial import qhull; print(qhull.__pyx_capi__) --- those functions could be accessed via from scipy.spatial.qhull cimport XXXX in Cython (they're private though, so don't do that).
However, at the present, scipy.special does not offer this C-API. It would however in fact be quite simple to provide it, given that the interface module in scipy.special is written in Cython.
I don't think there is at the moment any sane and portable way to access the function doing the heavy lifting for gamln, (although you could snoop around the UFunc object, but that's not a sane solution :), so at the moment it's probably best to just grab the relevant part of source code from scipy.special and bundle it with your project, or use e.g. GSL.

Perhaps the easiest way if you do accept using the GSL would be to use this GSL->cython interface https://github.com/twiecki/CythonGSL and call BLAS from there (see the example https://github.com/twiecki/CythonGSL/blob/master/examples/blas2.pyx). It should also take care of the Fortran vs C ordering.
There aren't many new GSL features, but you can safely assume it is actively maintained. The CythonGSL is more complete compared to tokyo; e.g., it features symmetric-matrix products that are absent in numpy.

As I've just encountered the same problem, and wrote some additional functions, I'll include them here in case someone else finds them useful. I code up some matrix multiplication, and also call LAPACK functions for matrix inversion, determinant and cholesky decomposition. But you should consider trying to do linear algebra stuff outside any loops, if you have any, like I do here. And by the way, the determinant function here isn't quite working if you have suggestions. Also, please note that I don't do any checking to see if inputs are conformable.
from scipy.linalg.cython_lapack cimport dgetri, dgetrf, dpotrf
cpdef void double[:, ::1] inv_c(double[:, ::1] A, double[:, ::1] B,
double[:, ::1] work, double[::1] ipiv):
'''invert float type square matrix A
Parameters
----------
A : memoryview (numpy array)
n x n array to invert
B : memoryview (numpy array)
n x n array to use within the function, function
will modify this matrix in place to become the inverse of A
work : memoryview (numpy array)
n x n array to use within the function
ipiv : memoryview (numpy array)
length n array to use within function
'''
cdef int n = A.shape[0], info, lwork
B[...] = A
dgetrf(&n, &n, &B[0, 0], &n, &ipiv[0], &info)
dgetri(&n, &B[0,0], &n, &ipiv[0], &work[0,0], &lwork, &info)
cpdef double det_c(double[:, ::1] A, double[:, ::1] work, double[::1] ipiv):
'''obtain determinant of float type square matrix A
Notes
-----
As is, this function is not yet computing the sign of the determinant
correctly, help!
Parameters
----------
A : memoryview (numpy array)
n x n array to compute determinant of
work : memoryview (numpy array)
n x n array to use within function
ipiv : memoryview (numpy array)
length n vector use within function
Returns
-------
detval : float
determinant of matrix A
'''
cdef int n = A.shape[0], info
work[...] = A
dgetrf(&n, &n, &work[0,0], &n, &ipiv[0], &info)
cdef double detval = 1.
cdef int j
for j in range(n):
if j != ipiv[j]:
detval = -detval*work[j, j]
else:
detval = detval*work[j, j]
return detval
cdef void chol_c(double[:, ::1] A, double[:, ::1] B):
'''cholesky factorization of real symmetric positive definite float matrix A
Parameters
----------
A : memoryview (numpy array)
n x n matrix to compute cholesky decomposition
B : memoryview (numpy array)
n x n matrix to use within function, will be modified
in place to become cholesky decomposition of A. works
similar to np.linalg.cholesky
'''
cdef int n = A.shape[0], info
cdef char uplo = 'U'
B[...] = A
dpotrf(&uplo, &n, &B[0,0], &n, &info)
cdef int i, j
for i in range(n):
for j in range(n):
if j > i:
B[i, j] = 0
cpdef void dotmm_c(double[:, :] A, double[:, :] B, double[:, :] out):
'''matrix multiply matrices A (n x m) and B (m x l)
Parameters
----------
A : memoryview (numpy array)
n x m left matrix
B : memoryview (numpy array)
m x r right matrix
out : memoryview (numpy array)
n x r output matrix
'''
cdef Py_ssize_t i, j, k
cdef double s
cdef Py_ssize_t n = A.shape[0], m = A.shape[1]
cdef Py_ssize_t l = B.shape[0], r = B.shape[1]
for i in range(n):
for j in range(r):
s = 0
for k in range(m):
s += A[i, k]*B[k, j]
out[i, j] = s

Related

Why does Cython keep making python objects instead of c? [duplicate]

This question already has an answer here:
What parts of a Numpy-heavy function can I accelerate with Cython
(1 answer)
Closed last year.
I am trying to learn cython, where I compile with annotate=True.
Says in The basic manual:
If a line is white, it means that the code generated doesn’t interact with Python, so will run as fast as normal C code. The darker the yellow, the more Python interaction there is in that line
Then I wrote this code following (as much as I understood) numpy in cython basic manual instructions:
+14: cdef entropy(counts):
15: '''
16: INPUT: pandas table with counts as obsN
17: OUTPUT: general entropy
18: '''
+19: cdef int l = counts.shape[0]
+20: cdef np.ndarray probs = np.zeros(l, dtype=np.float)
+21: cdef int totals = np.sum(counts)
+22: probs = counts/totals
+23: cdef np.ndarray plogp = np.zeros(l, dtype=np.float)
+24: plogp = ( probs.T * (np.log(probs)) ).T
+25: cdef float d = np.exp(-1 * np.sum(plogp))
+26: cdef float relative_d = d / probs.shape[0]
27:
+28: return {'d':d,
+29: 'relative_d':relative_d
30: }
Where all the "+" at the beginning of the line are yellow in the cython.debug.output.html file.
What am I doing very wrong? How can I make at least part of this function run at c speed?
The function returns a python dictionary, hence I think that I can't returned any c data type. I might be wrong here to.
Thank you for the help!
First of all, Cython does not rewrite Numpy functions, it just call them like CPython does. This is the case for np.zeros, np.sum or np.log for example. Such calls will not be faster with Cython. If you want a faster code you can use plain loops to reimplement them in you code. However, this may not be faster: on one hand Numpy calls introduce an overhead (due to type checking AFAIK still enabled with Cython, internal function calls, wrappers, etc) certainly significant if you use small arrays and each function generate huge temporary arrays that are often slow to read/write; on the other hand, some Numpy functions makes use of highly-optimized code (like BLAS or low-level SIMD intrinsics). Moreover, the division in Python does not behave the same way than C. This is why Cython provides the flag cython.cdivision which can be set to True (it is False by default). If the Python division is used, Cython generate a slower wrapping code. Finally, np.ndarray is a CPython type and behave as such, you can use memoryviews so not to deal with Numpy objects.
If you want to get a fast code, you certainly need to use memoryviews, loops and and avoid creating temporary arrays as well as using multiple threads. Additionally, you can use np.empty instead of np.zeros in your case. Besides this, the Numpy transposition is not very efficient and Numpy does not solves this problem. You can implement a tiled-transposition to speed it up but this is not trivial to implement it efficiently. Here is a Numba implementation that can certainly be easily transformed to a Cython code. Putting some cdef on a Python Numpy code generally does not make it faster.

Coefficients of 2D Chebyshev series in numpy.polynomial.chebyshev

I understand that chebvander2d and chebval2d return the Vandermonde matrix and fitted values for 2D inputs, and chebfit returns the coefficients for 1D-input series, but how do I get the coefficients for 2D-input series?
Short answer: It looks to me like this is not yet implemented. The whole of 2D polynomials seems more like a draft with some stub functions (as of June 2020).
Long answer (I came looking for the same thing, so I dug a little deeper):
First of all, this applies to all of the polynomial classes, not only chebyshev, so you also cannot fit an "ordinary" polynomial (power series). In fact, you cannot even construct one.
To understand the programming problem, let me recapture what a 2D polynomial looks like as a math formula, at an example polynomial of degree 2:
p(x, y) = c_00 + c_10 x + c_01 y + c_20 x^2 + c11 xy + c02 y^2
here the indices of c refer to the powers of x and y (the sum of the exponents must be <= degree).
First thing to notice is that, for degree d, there are (d+1)(d+2)/2 coefficients.
They could be stored in the upper left part of a matrix or in a 1D array, e.g. aranged as in the formula above.
The documentation of functions like numpy.polynomial.polynomial.polyval2d implies that numpy expects the matrix variant: p(x, y) = sum_i,j c_i,j * x^i * y^j.
Side note: it may be confusing that the row index i ("y-coordinate") of the matrix is used as exponent of x, not y; maybe the role of i and j should be switched if this is eventually implementd, or at least there should be a note in the documentation.
This leads to the core problem: the data structure for the 2D coefficients is not defined anywhere; only indirectly, like above, it can be guessed that a matrix should be used. But compared to a 1D array this is a waste of space, and evaluation of the polynomial takes two nested loops instead of just one. Also: does the matrix have to be initialized with np.zeros or do the implemented functions make sure that the lower right part is never touched so that np.empty can be used?
If the whole (d+1)^2 matrix were used, as the polyval2d function doc suggests, the degree of the polynomial would actually be d*2 (if c_d,d != 0)
To test this, I wanted to construct a numpy.polynomial.polynomial.Polynomial (yes, three times polynomial) and check the degree attribute:
import numpy as np
import numpy.polynomial.polynomial as poly
coef = np.array([
[5.00, 5.01, 5.02],
[5.10, 5.11, 0. ],
[5.20, 0. , 0. ]
])
polyObj = poly.Polynomial(coef)
print(polyObj.degree)
This gave a ValueError: Coefficient array is not 1-d before the print statement was reached. So while polyval2d expects a 2D coefficient array, it is not (yet) possible to construct such a polynomial - not manually like this at least. With this insight, it is not surprising that there is no function (yet) that computes a fit for 2D polynomials.

Intersection of sorted numpy arrays

I have a list of sorted numpy arrays. What is the most efficient way to compute the sorted intersection of these arrays?
In my application, I expect the number of arrays to be less than 10^4, I expect the individual arrays to be of length less than 10^7, and I expect the length of the intersection to be close to p*N, where N is the length of the largest array and where 0.99 < p <= 1.0. The arrays are loaded from disk and can be loaded in batches if they won't all fit in memory at once.
A quick and dirty approach is to repeatedly invoke numpy.intersect1d(). That seems inefficient though as intersect1d() does not take advantage of the fact that the arrays are sorted.
Since intersect1d sort arrays each time, it's effectively inefficient.
Here you have to sweep intersection and each sample together to build the new intersection, which can be done in linear time, maintaining order.
Such task must often be tuned by hand with low level routines.
Here a way to do that with numba :
from numba import njit
import numpy as np
#njit
def drop_missing(intersect,sample):
i=j=k=0
new_intersect=np.empty_like(intersect)
while i< intersect.size and j < sample.size:
if intersect[i]==sample[j]: # the 99% case
new_intersect[k]=intersect[i]
k+=1
i+=1
j+=1
elif intersect[i]<sample[j]:
i+=1
else :
j+=1
return new_intersect[:k]
Now the samples :
n=10**7
ref=np.random.randint(0,n,n)
ref.sort()
def perturbation(sample,k):
rands=np.random.randint(0,n,k-1)
rands.sort()
l=np.split(sample,rands)
return np.concatenate([a[:-1] for a in l])
samples=[perturbation(ref,100) for _ in range(10)] #similar samples
And a run for 10 samples
def find_intersect(samples):
intersect=samples[0]
for sample in samples[1:]:
intersect=drop_missing(intersect,sample)
return intersect
In [18]: %time u=find_intersect(samples)
Wall time: 307 ms
In [19]: len(u)
Out[19]: 9999009
This way it seems that the job can be done in about 5 minutes , beyond loading time.
A few months ago, I wrote a C++-based python extension for this exact purpose. The package is called sortednp and is available via pip. The intersection of multiple sorted numpy arrays, for example, a, b and c, can be calculated with
import sortednp as snp
i = snp.kway_intersect(a, b, c)
By default, this uses an exponential search to advance the array indices internally which is pretty fast in cases where the intersection is small. In your case, it might be faster if you add algorithm=snp.SIMPLE_SEARCH to the method call.

sparse matrix multiplication involving inverted matrix

I have two large square sparse matrices, A & B, and need to compute the following: A * B^-1 in the most efficient way. I have a feeling that the answer involves using scipy.sparse, but can't for the life of me figure it out.
After extensive searching, I have run across the following thread: Efficient numpy / lapack routine for product of inverse and sparse matrix? but can't figure out what the most efficient way would be.
Someone suggested using LU decomposition which is built into the sparse module of scipy, but when I try and do LU on sample matrix is says the result is singular (although when I just do a * B^-1 i get an answer). I have also heard someone suggest using linalg.spsolve(), but i can't figure out how to implement this as it requires a vector as the second argument.
If it helps, once I have the solution s.t. A * B^-1 = C, i only need to know the value for one row of the matrix C. The matrices will be roughly 1000x1000 to 1500x1500.
Actually 1000x1000 matrices are not that large. You can compute the inverse of such a matrix using numpy.linalg.inv(B) in less than 1 second on a modern desktop computer.
But you can be much more efficient if you rewrite your problem taking into account the fact that you only need one row of C (this is actually very often the case).
Let us write d_i = [0 0 0 ... 0 1 0 ... 0 ], a vector with only one one on the i-th element.
You can write, if ^t denotes the transpose :
AB^-1 = C <=> A = CB <=> A^t = B^t C^t
For the i-th row :
A^t d_i = B^t C^t d_i <=> a_i = B^t c_i
So you have a linear inverse problem which can be solved using numpy.linalg.solve
ci = np.linalg.solve(B.T, a[i])

polynomial surface fit numpy

How do I fit a 2D surface z=f(x,y) with a polynomial in numpy with full cross terms?
This is inherently numerically ill-conditioned but you could do something like this:
import numpy as np
x = np.random.randn(500)
y = np.random.randn(500)
z = np.random.randn(500) # Dependent variable
v = np.array([np.ones(500), x, y, x**2, x * y, y**2])
coefficients, residues, rank, singval = np.linalg.lstsq(v.T, z)
The more terms you add, the worse things get, numerically. Are you sure you want a polynomial interpolant?
There are other bases for polynomials for which the matrix of values is not so badly conditioned but I can't remember what they are called; any college-level numerical analysis textbook would have this material, though.
You can use a combination of polyvander2d and polyval2d, but will need to do the fit yourself using the design matrix output from polyvander2d, probably involving scaling and such. It should be possible to build a class Polynomial2d from those tools.