Link Cython-wrapped C functions against BLAS from NumPy - numpy

I want to use inside a Cython extension some C functions defined in .c files that uses BLAS subroutines, e.g.
cfile.c
double ddot(int *N, double *DX, int *INCX, double *DY, int *INCY);
double call_ddot(double* a, double* b, int n){
int one = 1;
return ddot(&n, a, &one, b, &one);
}
(Let’s say the functions do more than just call one BLAS subroutine)
pyfile.pyx
cimport numpy as np
import numpy as np
cdef extern from "cfile.c":
double call_ddot(double* a, double* b, int n)
def pyfun(np.ndarray[double, ndim=1] a):
return call_ddot(&a[0], &a[0], <int> a.shape[0])
setup.py:
from distutils.core import setup
from distutils.extension import Extension
from Cython.Build import cythonize
from Cython.Distutils import build_ext
import numpy
setup(
name = "wrapped_cfun",
packages = ["wrapped_cfun"],
cmdclass = {'build_ext': build_ext},
ext_modules = [Extension("wrapped_cfun.cython_part", sources=["pyfile.pyx"], include_dirs=[numpy.get_include()])]
)
I want this package to link against the same BLAS library that the installed NumPy or SciPy are using, and would like it to be installable from PIP under different operating systems using numpy or scipy as dependencies, without any additional BLAS-related dependency.
Is there any hack for setup.py that would allow me to accomplish this, in a way that it could work with any BLAS implementation?
Update:
With MKL, I can make it work by modifying the Extension object to point to libmkl_rt, which can be extracted from numpy if MKL is installed, e.g.:
Extension("wrapped_cfun.cython_part", sources=["pyfile.pyx"], include_dirs=[numpy.get_include()], extra_link_args=["-L{path to python's lib dir}", "-l:libmkl_rt.{so, dll, dylib}"])
However, the same trick does not work for OpenBLAS (e.g. -l:libopenblasp-r0.2.20.so). Pointing to libblas.{so,dll,dylib} will not work if that file is a link to libopenblas, but works fine it it's a link to libmkl_rt.
Update 2:
It seems OpenBLAS names their C functions with an underscore at the end, e.g. not ddot but ddot_. The code above with l:libopenblas will work if I change ddot to ddot_ in the .c file. I'm still wondering if there is some (ideally run-time) mechanism to detect which name should be used in the c file.

An alternative to depending on linker/loader to provide the right blas-functionality, would be to emulate resolution of the necessary blas-symbols (e.g. ddot) and to use the wrapped blas-function provided by scipy during the runtime.
Not sure, this approach is superior to the "normal way" of building, but wanted to bring it to your attention, even if only because I find this approach interesting.
The idea in a nutshell:
Define an explicit function-pointer to ddot-functionality, called my_ddot in the snippet below.
Use my_ddot-pointer where you would use ddot-otherwise.
Initialize my_ddot-pointer when the cython-module is loaded with the functionality provided by scipy.
Here is a working prototype (I use C-code-verbatim to make the snippet standalone and easily testable in a jupiter-notebook, trust you to transform it to format you need/like):
%%cython
# h-file:
cdef extern from *:
"""
// blas-functionality,
// will be initialized by cython when module is loaded:
typedef double (*ddot_t)(int *N, double *DX, int *INCX, double *DY, int *INCY);
extern ddot_t my_ddot;
double call_ddot(double* a, double* b, int n);
"""
ctypedef double (*ddot_t)(int *N, double *DX, int *INCX, double *DY, int *INCY)
ddot_t my_ddot
double call_ddot(double* a, double* b, int n)
# init the functions of the c-library
# with blas-function provided by scipy
from scipy.linalg.cython_blas cimport ddot
my_ddot=ddot
# a simple function to demonstrate, that it works
def ddot_mult(double[:]a, double[:]b):
cdef int n=len(a)
return call_ddot(&a[0], &b[0], n)
#-------------------------------------------------
# c-file, added so the example is complete
cdef extern from *:
"""
ddot_t my_ddot;
double call_ddot(double* a, double* b, int n){
int one = 1;
return my_ddot(&n, a, &one, b, &one);
}
"""
pass
And now ddot_mult can be used:
import numpy as np
a=np.arange(4, dtype=float)
ddot_mult(a,a) # 14.0 as expected!
An advantage of this approach is, that there is no hustle with distutils and you have a guarantee, to use the same blas-functionality as scipy.
Another perk: One could switch the used engine (mkl, open_blas or even an own implementation) during the runtime without the need to recompile/relink.
On there other hand, there is some additional amount of boilerplate-code and also the danger, that initialization of some symbols will be forgotten.

I've finally figured out an ugly hack for this. I'm not sure if it will always work, but at least it works for cobminations of Windows (mingw and visual studio), Linux, MKL and OpenBlas. I'd still like to know if there are better alternatives, but if not, this will do it:
Edit: Corrected for visual studio now
Modify C files to account for names with underscores (do it for each BLAS function that is called) - need to declare each function twice and add an if for each one
double ddot_(int *N, double *DX, int *INCX, double *DY, int *INCY);
#define ddot(N, DX, INCX, DY, INCY) ddot_(N, DX, INCX, DY, INCY)
daxpy_(int *N, double *DA, double *DX, int *INCX, double *DY, int *INCY);
#define daxpy(N, DA, DX, INCX, DY, INCY) daxpy_(N, DA, DX, INCX, DY, INCY)
... etc
Extract library path from NumPy or SciPy and add it to the link arguments.
Detect if the compiler to be used is visual studio, in which case the linking arguments are quite different.
setup.py
from distutils.core import setup
from distutils.extension import Extension
from Cython.Build import cythonize
from Cython.Distutils import build_ext
import numpy
from sys import platform
import os
try:
blas_path = numpy.distutils.system_info.get_info('blas')['library_dirs'][0]
except:
if "library_dirs" in numpy.__config__.blas_mkl_info:
blas_path = numpy.__config__.blas_mkl_info["library_dirs"][0]
elif "library_dirs" in numpy.__config__.blas_opt_info:
blas_path = numpy.__config__.blas_opt_info["library_dirs"][0]
else:
raise ValueError("Could not locate BLAS library.")
if platform[:3] == "win":
if os.path.exists(os.path.join(blas_path, "mkl_rt.lib")):
blas_file = "mkl_rt.lib"
elif os.path.exists(os.path.join(blas_path, "mkl_rt.dll")):
blas_file = "mkl_rt.dll"
else:
import re
blas_file = [f for f in os.listdir(blas_path) if bool(re.search("blas", f))]
if len(blas_file) == 0:
raise ValueError("Could not locate BLAS library.")
blas_file = blas_file[0]
elif platform[:3] == "dar":
blas_file = "libblas.dylib"
else:
blas_file = "libblas.so"
## https://stackoverflow.com/questions/724664/python-distutils-how-to-get-a-compiler-that-is-going-to-be-used
class build_ext_subclass( build_ext ):
def build_extensions(self):
compiler = self.compiler.compiler_type
if compiler == 'msvc': # visual studio
for e in self.extensions:
e.extra_link_args += [os.path.join(blas_path, blas_file)]
else: # gcc
for e in self.extensions:
e.extra_link_args += ["-L"+blas_path, "-l:"+blas_file]
build_ext.build_extensions(self)
setup(
name = "wrapped_cfun",
packages = ["wrapped_cfun"],
cmdclass = {'build_ext': build_ext_subclass},
ext_modules = [Extension("wrapped_cfun.cython_part", sources=["pyfile.pyx"], include_dirs=[numpy.get_include()], extra_link_args=[])]
)

As yet another alternative with more recent Cython versions, one can create a "public" Cython function (which will be made available to C code and auto-generate a public header) that would simply call the corresponding BLAS function:
from scipy.linalg.cython_blas cimport ddot
cdef public double ddot_(int *n, double *x, int *ldx, double *y, int *ldy):
return ddot(n, x, ldx, y, ldy)
Then one simply declares it in the C code or includes the header, and the rest of the Cython extension builder will take care of linkage:
extern double ddot_(int *n, double *x, int *ldx, double *y, int *ldy);

Related

What is a ".pxi.in" file and how to use it?

I need to understand the source code of pandas. I met several ".pxi.in" files in the source code. For example:
https://github.com/pandas-dev/pandas/blob/main/pandas/_libs/hashtable_class_helper.pxi.in
The files say
DO NOT edit .pxi FILE directly, .pxi is generated from .pxi.in
I want to know how to use .pxi.in files to generate .pxi files. Is there any tutorials or docs?
Cython comes with tempita. The idea behind it is more or less to have a kind of preprocessor-language for Cython, because otherwise Cython is somewhat lacking this functionality compared to C or C++-templates.
*.pxi.in from pandas uses tempita, to create e.g. classes for int8, int16 usw. from a template, so one doesn't have to repeat the code per hand. *.pxi.in files are converted to *.pxi-file in the setup.py-file.
As example
{{py:
# name
complex_types = ['complex64',
'complex128']
}}
{{for name in complex_types}}
cdef kh{{name}}_t to_kh{{name}}_t({{name}}_t val) nogil:
cdef kh{{name}}_t res
res.real = val.real
res.imag = val.imag
return res
{{endfor}}
will be converted, once tempita step was performed, to
cdef khcomplex64_t to_khcomplex64_t(complex64_t val) nogil:
cdef khcomplex64_t res
res.real = val.real
res.imag = val.imag
return res
cdef khcomplex128_t to_khcomplex128_t(complex128_t val) nogil:
cdef khcomplex128_t res
res.real = val.real
res.imag = val.imag
return res
and the resulting pxi-file will be used while cythonizing pyx-files.

Cython passing int numpy array to C++

First, I know this question appears similar to this one but they are different. I'm struggling trying to pass int (int32) numpy array to C++ via Cython without copying. The files:
doit.cpp:
#include "doit.h"
void run(int *x) {}
doit.h:
#ifndef _DOIT_H_
#define _DOIT_H_
void run(int *);
#endif
q.pyx:
cimport numpy as np
import numpy as np
cdef extern from "doit.h":
void run(int* X)
def pyrun(np.ndarray[np.int_t, ndim=1] X):
X = np.ascontiguousarray(X)
run(&X[0])
I compile with Cython. The error is:
Error compiling Cython file:
------------------------------------------------------------
...
cdef extern from "doit.h":
void run(int* X)
def pyrun(np.ndarray[np.int_t, ndim=1] X):
X = np.ascontiguousarray(X)
run(&X[0])
^
------------------------------------------------------------
py_cpp/q.pyx:9:8: Cannot assign type 'int_t *' to 'int *'
However, if I replace all occurrences of int to double (e.g. int *x to double *x, int_t to double_t), then all errors are gone.
How to solve the problem? Thanks in advance.

A Good Way to Expose CUPY MemoryPointer in C/C++?

NumPy provides well-defined C APIs so that one can easily handle NumPy array in C/C++ space. For example, if I have a C function that takes C arrays (pointers) as arguments, I can just #include <numpy/arrayobject.h>, and pass a NumPy array to it by accessing its data member (or use the C API PyArray_DATA).
Recently I want to achieve the same for CuPy, but I cannot find a header file that I can include. To be specific, my goal is as follows:
I have some CUDA kernels and their callers written in C/C++. The callers run on host but take handles of memory buffers on device as arguments. The computed results of the callers are also stored on device.
I want to wrap the callers into Python functions so that I can control when to transfer data from device to host in Python. That means I have to wrap the resulted device memory pointers in Python objects. CuPy's ndarray is the best choice I can think of.
I can't use CuPy's user-defined-kenrel mechanism because the functions I want to wrap are not directly CUDA kernels. They must contain host code.
Currently, I've found a workaround. I write the Python functions in cython, which take CuPy arrays as inputs and return CuPy arrays. And then I cast .data.ptr attribute into C's size_t type, and then further cast it to whatever pointer type I need. Example code follows.
Example Code
//kernel.cu
#include <math.h>
__global__ void vecSumKernel(float *A, float *B, float *C, int n) {
int i = threadIdx.x + blockIdx.x * blockDim.x;
if (i < n)
C[i] = A[i] + B[i];
}
// This is the C function I want to wrap into Python.
// Notice it does not allocate any memory on device. I want that to be done by cupy.
extern "C" void vecSum(float *A_d, float *B_d, float *C_d, int n) {
int threadsPerBlock = 512;
if (threadsPerBlock > n) threadsPerBlock = n;
int nBlocks = (int)ceilf((float)n / (float)threadsPerBlock);
vecSumKernel<<<nBlocks, threadsPerBlock>>>(A_d, B_d, C_d, n);
}
//kernel.h
#ifndef KERNEL_H_
#define KERNEL_H_
void vecSum(float *A_d, float *B_d, float *C_d, int n);
#endif
# test_module.pyx
import cupy as cp
import numpy as np
cdef extern from "kernel.h":
void vecSum(float *A_d, float *B_d, float *C_d, int n)
cdef vecSum_wrapper(size_t aPtr, size_t bPtr, size_t cPtr, int n):
# here the Python int -- cp.ndarray.data.ptr -- is first cast to size_t,
# and then cast to (float *).
vecSum(<float*>aPtr, <float*>bPtr, <float*>cPtr, n)
# This is the Python function I want to use
# a, b are cupy arrays
def vec_sum(a, b):
a_ptr = a.data.ptr
b_ptr = b.data.ptr
n = a.shape[0]
output = cp.empty(shape=(n,), dtype=a.dtype)
c_ptr = output.data.ptr
vecSum_wrapper(a_ptr, b_ptr, c_ptr, n)
return output
Compile and Run
To compile, one can first compile the kernel.cu into a static library, say, libVecSum. Then use cython to compile test_module.pyx int test_module.c, and build the Python extension as usual.
# setup.py
from setuptools import Extension, setup
ext_module = Extension(
"cupyExt.test_module",
sources=["cupyExt/test_module.c"],
library_dirs=["cupyExt/"],
libraries=['libVecSum', 'cudart'])
setup(
name="cupyExt",
version="0.0.0",
ext_modules = [ext_module],
)
It seems working.
>>> import cupy as cp
>>> from cupyExt import test_module
>>> a = cp.ones(5, dtype=cp.float32) * 3
>>> b = cp.arange(5, dtype=cp.float32)
>>> c = test_module.vec_sum(a, b)
>>> print(c.device)
<CUDA Device 0>
>>> print(c)
[3. 4. 5. 6. 7.]
Any better ways?
I am not sure if this way is memory safe. I also feel the casting from .data.ptr to C pointers is not good. I want to know people's thoughts and comments on this.

Why does cythons in-place division of numpy arrays use conversion to python floats?

I tried to normalize a vector stored as numpy array, but cython -a shows unexpected conversions to Python values in this code.
Minimal example:
import numpy as np
cimport cython
cimport numpy as np
#cython.wraparound(False)
#cython.boundscheck(False)
cdef vec_diff(np.ndarray[double, ndim=1] vec1, double m):
vec1/=m
return vec1
Cython 0.29.6 run with the -a option generates the following code for the line vec1/=m:
__pyx_t_1 = PyFloat_FromDouble(__pyx_v_m); if (unlikely(!__pyx_t_1)) __PYX_ERR(0, 8, __pyx_L1_error)
__Pyx_GOTREF(__pyx_t_1);
__pyx_t_2 = __Pyx_PyNumber_InPlaceDivide(((PyObject *)__pyx_v_vec1), __pyx_t_1); if (unlikely(!__pyx_t_2)) __PYX_ERR(0, 8, __pyx_L1_error)
__Pyx_GOTREF(__pyx_t_2);
__Pyx_DECREF(__pyx_t_1); __pyx_t_1 = 0;
if (!(likely(((__pyx_t_2) == Py_None) || likely(__Pyx_TypeTest(__pyx_t_2, __pyx_ptype_5numpy_ndarray))))) __PYX_ERR(0, 8, __pyx_L1_error)
__pyx_t_3 = ((PyArrayObject *)__pyx_t_2);
{
__Pyx_BufFmt_StackElem __pyx_stack[1];
__Pyx_SafeReleaseBuffer(&__pyx_pybuffernd_vec1.rcbuffer->pybuffer);
__pyx_t_4 = __Pyx_GetBufferAndValidate(&__pyx_pybuffernd_vec1.rcbuffer->pybuffer, (PyObject*)__pyx_t_3, &__Pyx_TypeInfo_double, PyBUF_FORMAT| PyBUF_STRIDES, 1, 0, __pyx_stack);
if (unlikely(__pyx_t_4 < 0)) {
PyErr_Fetch(&__pyx_t_5, &__pyx_t_6, &__pyx_t_7);
if (unlikely(__Pyx_GetBufferAndValidate(&__pyx_pybuffernd_vec1.rcbuffer->pybuffer, (PyObject*)__pyx_v_vec1, &__Pyx_TypeInfo_double, PyBUF_FORMAT| PyBUF_STRIDES, 1, 0, __pyx_stack) == -1)) {
Py_XDECREF(__pyx_t_5); Py_XDECREF(__pyx_t_6); Py_XDECREF(__pyx_t_7);
__Pyx_RaiseBufferFallbackError();
} else {
PyErr_Restore(__pyx_t_5, __pyx_t_6, __pyx_t_7);
}
__pyx_t_5 = __pyx_t_6 = __pyx_t_7 = 0;
}
__pyx_pybuffernd_vec1.diminfo[0].strides = __pyx_pybuffernd_vec1.rcbuffer->pybuffer.strides[0]; __pyx_pybuffernd_vec1.diminfo[0].shape = __pyx_pybuffernd_vec1.rcbuffer->pybuffer.shape[0];
if (unlikely(__pyx_t_4 < 0)) __PYX_ERR(0, 8, __pyx_L1_error)
}
__pyx_t_3 = 0;
__Pyx_DECREF_SET(__pyx_v_vec1, ((PyArrayObject *)__pyx_t_2));
__pyx_t_2 = 0;
where the first line __pyx_t_1 = PyFloat_FromDouble(__pyx_v_m); has PyFloat_FromDouble highlighted in dark red.
Given that I have told cython that the array contains double values, why does it have to convert to a python float?
Note: Memoryviews do not support the /= operation (would require a loop)
Because this isn't something that Cython does anything special for or optimises at all. All it's doing is calling __Pyx_PyNumber_InPlaceDivide on the Numpy array, which calls the Numpy array's __idiv__ operator.
Since it's calling a Python operator it needs to pass a Python object as the second argument, and hence it needs to convert your double to a Python float.
The Numpy __idiv__ operator is almost certainly written in C so likely to be pretty fast (although there is a little overhead calling it) so there's not a lot of value in Cython doing anything except delegating to Numpy's code.
Memoryviews don't define the whole-array operators (they're just ways to access memory so don't make any claims about meaningful mathematical operations) and hence the fact that it doesn't work is consistent with how Cython deals with these operators.

passing 1 or 2 d numpy array to c throw cython

I am writing an extension to my python code in c and cython, by following this guide.
my c function signature is
void c_disloc(double *pEOutput, double *pNOutput, double *pZOutput, double *pModel, double *pECoords, double *pNCoords, double nu, int NumStat, int NumDisl)
and my cython function is
cdef extern void c_disloc(double *pEOutput, double *pNOutput, double *pZOutput, double *pModel, double *pECoords, double *pNCoords, double nu, int NumStat, int NumDisl)
#cython.boundscheck(False)
#cython.wraparound(False)
def disloc(np.ndarray[double, ndim=2, mode="c"] pEOutput not None,
np.ndarray[double, ndim=2, mode="c"] pNOutput not None,
np.ndarray[double, ndim=2, mode="c"] pZOutput not None,
np.ndarray[double, ndim=1, mode="c"] pModel not None,
np.ndarray[double, ndim=2, mode="c"] pECoords not None,
np.ndarray[double, ndim=2, mode="c"] pNCoords not None,
double nu,int NumStat, int NumDisl ):
c_disloc(&pEOutput[0,0], &pNOutput[0,0], &pZOutput[0,0], &pModel[0], &pECoords[0,0], &pNCoords[0,0], nu, NumStat, NumDisl)
return None
now my c function has the same behavior no matter if the arrays that its getting are 1d or 2d arrays, but I didn't succeed making the cython function to be able to get 1d or 2d numpy arrays.
of course, I could write tow cython function one for the 1d case and one for the 2d case but it will be cleaner to do it with one function.
dose someone knows how to do it?
I'd accept an untyped argument, check that it's a C contiguous array and then use np.ravel to get a flat array (this returns a view, not a copy, when passed a C contiguous array). It's easy to create that as a cdef function:
cdef double* get_array_pointer(arr) except NULL:
assert(arr.flags.c_contiguous) # if this isn't true, ravel will make a copy
cdef double[::1] mview = arr.ravel()
return &mview[0]
Then you'd do
def disloc(pEOutput,
pNOutput,
# etc...
double nu,int NumStat, int NumDisl ):
c_disloc(get_array_pointer(pEOutput), get_array_pointer(pNOutput),
# etc
nu, NumStat, NumDisl)
I've removed the
#cython.boundscheck(False)
#cython.wraparound(False)
since it's obvious they will gain you close to nothing. Using them without thinking about whether they do anything seems like cargo cult programming to me.