Why does cythons in-place division of numpy arrays use conversion to python floats? - numpy

I tried to normalize a vector stored as numpy array, but cython -a shows unexpected conversions to Python values in this code.
Minimal example:
import numpy as np
cimport cython
cimport numpy as np
#cython.wraparound(False)
#cython.boundscheck(False)
cdef vec_diff(np.ndarray[double, ndim=1] vec1, double m):
vec1/=m
return vec1
Cython 0.29.6 run with the -a option generates the following code for the line vec1/=m:
__pyx_t_1 = PyFloat_FromDouble(__pyx_v_m); if (unlikely(!__pyx_t_1)) __PYX_ERR(0, 8, __pyx_L1_error)
__Pyx_GOTREF(__pyx_t_1);
__pyx_t_2 = __Pyx_PyNumber_InPlaceDivide(((PyObject *)__pyx_v_vec1), __pyx_t_1); if (unlikely(!__pyx_t_2)) __PYX_ERR(0, 8, __pyx_L1_error)
__Pyx_GOTREF(__pyx_t_2);
__Pyx_DECREF(__pyx_t_1); __pyx_t_1 = 0;
if (!(likely(((__pyx_t_2) == Py_None) || likely(__Pyx_TypeTest(__pyx_t_2, __pyx_ptype_5numpy_ndarray))))) __PYX_ERR(0, 8, __pyx_L1_error)
__pyx_t_3 = ((PyArrayObject *)__pyx_t_2);
{
__Pyx_BufFmt_StackElem __pyx_stack[1];
__Pyx_SafeReleaseBuffer(&__pyx_pybuffernd_vec1.rcbuffer->pybuffer);
__pyx_t_4 = __Pyx_GetBufferAndValidate(&__pyx_pybuffernd_vec1.rcbuffer->pybuffer, (PyObject*)__pyx_t_3, &__Pyx_TypeInfo_double, PyBUF_FORMAT| PyBUF_STRIDES, 1, 0, __pyx_stack);
if (unlikely(__pyx_t_4 < 0)) {
PyErr_Fetch(&__pyx_t_5, &__pyx_t_6, &__pyx_t_7);
if (unlikely(__Pyx_GetBufferAndValidate(&__pyx_pybuffernd_vec1.rcbuffer->pybuffer, (PyObject*)__pyx_v_vec1, &__Pyx_TypeInfo_double, PyBUF_FORMAT| PyBUF_STRIDES, 1, 0, __pyx_stack) == -1)) {
Py_XDECREF(__pyx_t_5); Py_XDECREF(__pyx_t_6); Py_XDECREF(__pyx_t_7);
__Pyx_RaiseBufferFallbackError();
} else {
PyErr_Restore(__pyx_t_5, __pyx_t_6, __pyx_t_7);
}
__pyx_t_5 = __pyx_t_6 = __pyx_t_7 = 0;
}
__pyx_pybuffernd_vec1.diminfo[0].strides = __pyx_pybuffernd_vec1.rcbuffer->pybuffer.strides[0]; __pyx_pybuffernd_vec1.diminfo[0].shape = __pyx_pybuffernd_vec1.rcbuffer->pybuffer.shape[0];
if (unlikely(__pyx_t_4 < 0)) __PYX_ERR(0, 8, __pyx_L1_error)
}
__pyx_t_3 = 0;
__Pyx_DECREF_SET(__pyx_v_vec1, ((PyArrayObject *)__pyx_t_2));
__pyx_t_2 = 0;
where the first line __pyx_t_1 = PyFloat_FromDouble(__pyx_v_m); has PyFloat_FromDouble highlighted in dark red.
Given that I have told cython that the array contains double values, why does it have to convert to a python float?
Note: Memoryviews do not support the /= operation (would require a loop)

Because this isn't something that Cython does anything special for or optimises at all. All it's doing is calling __Pyx_PyNumber_InPlaceDivide on the Numpy array, which calls the Numpy array's __idiv__ operator.
Since it's calling a Python operator it needs to pass a Python object as the second argument, and hence it needs to convert your double to a Python float.
The Numpy __idiv__ operator is almost certainly written in C so likely to be pretty fast (although there is a little overhead calling it) so there's not a lot of value in Cython doing anything except delegating to Numpy's code.
Memoryviews don't define the whole-array operators (they're just ways to access memory so don't make any claims about meaningful mathematical operations) and hence the fact that it doesn't work is consistent with how Cython deals with these operators.

Related

Indexing in Rust ndarray crate based on a boolean mask

I would like to efficiently index into an ndarray using a boolean mask. To better convey what I mean I have some working numpy code and then my attempt in rust ndarray which works but is extremely inefficient.
Numpy:
import numpy as np
shape = (100, 100, 100)
grouping_array = np.random.randint(0, 100, size=shape)
data_array = np.random.rand(*shape)
for i in range(1, 100):
ith_mean = data_array[grouping_array == i].mean()
print(ith_mean)
Rust ndarray:
fn group_means(
data: &Array<f32, IxDyn>,
grouping_var: &Array<f32, IxDyn>,
n_groups: i32,
) {
for group in 1..n_groups {
let index_array = grouping_var.mapv(|x| x == roi as f32);
let roi_data = Array::from_iter(
image_data
.iter()
.zip(index_array.iter())
.map(|(x, y)| if *y { *x } else { 0. })
);
let mean_roi = roi_data.mean().unwrap();
println!("group {}; mean {}", group, mean_roi);
}
}
Here each iteration in the n_groups loop takes about as long as the whole numpy script which is done in less than a second. Is there a better way to do this in the rust-ndarray version?
This is likely not a surprise to others, but since my grouping_var array should (in my use case) always be 3D array, I changed its type (and therefore also index_array) from &Array<f32, IxDyn> to &Array<f32, Ix3> which dramatically improved performance.

A Good Way to Expose CUPY MemoryPointer in C/C++?

NumPy provides well-defined C APIs so that one can easily handle NumPy array in C/C++ space. For example, if I have a C function that takes C arrays (pointers) as arguments, I can just #include <numpy/arrayobject.h>, and pass a NumPy array to it by accessing its data member (or use the C API PyArray_DATA).
Recently I want to achieve the same for CuPy, but I cannot find a header file that I can include. To be specific, my goal is as follows:
I have some CUDA kernels and their callers written in C/C++. The callers run on host but take handles of memory buffers on device as arguments. The computed results of the callers are also stored on device.
I want to wrap the callers into Python functions so that I can control when to transfer data from device to host in Python. That means I have to wrap the resulted device memory pointers in Python objects. CuPy's ndarray is the best choice I can think of.
I can't use CuPy's user-defined-kenrel mechanism because the functions I want to wrap are not directly CUDA kernels. They must contain host code.
Currently, I've found a workaround. I write the Python functions in cython, which take CuPy arrays as inputs and return CuPy arrays. And then I cast .data.ptr attribute into C's size_t type, and then further cast it to whatever pointer type I need. Example code follows.
Example Code
//kernel.cu
#include <math.h>
__global__ void vecSumKernel(float *A, float *B, float *C, int n) {
int i = threadIdx.x + blockIdx.x * blockDim.x;
if (i < n)
C[i] = A[i] + B[i];
}
// This is the C function I want to wrap into Python.
// Notice it does not allocate any memory on device. I want that to be done by cupy.
extern "C" void vecSum(float *A_d, float *B_d, float *C_d, int n) {
int threadsPerBlock = 512;
if (threadsPerBlock > n) threadsPerBlock = n;
int nBlocks = (int)ceilf((float)n / (float)threadsPerBlock);
vecSumKernel<<<nBlocks, threadsPerBlock>>>(A_d, B_d, C_d, n);
}
//kernel.h
#ifndef KERNEL_H_
#define KERNEL_H_
void vecSum(float *A_d, float *B_d, float *C_d, int n);
#endif
# test_module.pyx
import cupy as cp
import numpy as np
cdef extern from "kernel.h":
void vecSum(float *A_d, float *B_d, float *C_d, int n)
cdef vecSum_wrapper(size_t aPtr, size_t bPtr, size_t cPtr, int n):
# here the Python int -- cp.ndarray.data.ptr -- is first cast to size_t,
# and then cast to (float *).
vecSum(<float*>aPtr, <float*>bPtr, <float*>cPtr, n)
# This is the Python function I want to use
# a, b are cupy arrays
def vec_sum(a, b):
a_ptr = a.data.ptr
b_ptr = b.data.ptr
n = a.shape[0]
output = cp.empty(shape=(n,), dtype=a.dtype)
c_ptr = output.data.ptr
vecSum_wrapper(a_ptr, b_ptr, c_ptr, n)
return output
Compile and Run
To compile, one can first compile the kernel.cu into a static library, say, libVecSum. Then use cython to compile test_module.pyx int test_module.c, and build the Python extension as usual.
# setup.py
from setuptools import Extension, setup
ext_module = Extension(
"cupyExt.test_module",
sources=["cupyExt/test_module.c"],
library_dirs=["cupyExt/"],
libraries=['libVecSum', 'cudart'])
setup(
name="cupyExt",
version="0.0.0",
ext_modules = [ext_module],
)
It seems working.
>>> import cupy as cp
>>> from cupyExt import test_module
>>> a = cp.ones(5, dtype=cp.float32) * 3
>>> b = cp.arange(5, dtype=cp.float32)
>>> c = test_module.vec_sum(a, b)
>>> print(c.device)
<CUDA Device 0>
>>> print(c)
[3. 4. 5. 6. 7.]
Any better ways?
I am not sure if this way is memory safe. I also feel the casting from .data.ptr to C pointers is not good. I want to know people's thoughts and comments on this.

calling a fortran dll from python using cffi with multidimensional arrays

I use a dll that contains differential equation solvers among other useful mathematical tools. Unfortunately, this dll is written in Fortran. My program is written in python 3.7 and I use spyder as an IDE.
I successfully called easy functions from the dll. However, I can't seem to get functions to work that require multidimensional arrays.
This is the online documentation to the function I am trying to call:
https://www.nag.co.uk/numeric/fl/nagdoc_fl26/html/f01/f01adf.html
The kernel dies without an error message if I execute the following code:
import numpy as np
import cffi as cf
ffi=cf.FFI()
lib=ffi.dlopen("C:\Windows\SysWOW64\DLL20DDS")
ffi.cdef("""void F01ADF (const int *n, double** a, const int *lda, int *ifail);""")
#Integer
nx = 4
n = ffi.new('const int*', nx)
lda = nx + 1
lda = ffi.new('const int*', lda)
ifail = 0
ifail = ffi.new('int*', ifail)
#matrix to be inversed
ax1 = np.array([5,7,6,5],dtype = float, order = 'F')
ax2 = np.array([7,10,8,7],dtype = float, order = 'F')
ax3 = np.array([6,8,10,9],dtype = float, order = 'F')
ax4 = np.array([5,7,9,10], dtype = float, order = 'F')
ax5 = np.array([0,0,0,0], dtype = float, order = 'F')
ax = (ax1,ax2,ax3,ax4,ax5)
#Array
zx = np.zeros(nx, dtype = float, order = 'F')
a = ffi.cast("double** ", zx.__array_interface__['data'][0])
for i in range(lda[0]):
a[i] = ffi.cast("double* ", ax[i].__array_interface__['data'][0])
lib.F01ADF(n, a, lda, ifail)
Since function with 1D arrays work I assume that the multidimensional array is the issues.
Any kind of help is greatly appreciated,
Thilo
Not having access to the dll you refer to complicates giving a definitive answer, however, the documentation of the dll and the provided Python script may be enough to diagnose the problem. There are at least two issues in your example:
The C header interface:
Your documentation link clearly states what the function's C header interface should look like. I'm not very well versed in C, Python's cffi or cdef, but the parameter declaration for a in your function interface seems wrong. The double** a (pointer to pointer to double) in your function interface should most likely be double a[] or double* a (pointer to double) as stated in the documentation.
Defining a 2d Numpy array with Fortran ordering:
Note that your Numpy arrays ax1..5 are one dimensional arrays, since the arrays only have one dimension order='F' and order='C' are equivalent in terms of memory layout and access. Thus, specifying order='F' here, probably does not have the intended effect (Fortran using column-major ordering for multi-dimensional arrays).
The variable ax is a tuple of Numpy arrays, not a 2d Numpy array, and will therefore have a very different representation in memory (which is of utmost importance when passing data to the Fortran dll) than a 2d array.
Towards a solution
My first step would be to correct the C header interface. Next, I would declare ax as a proper Numpy array with two dimensions, using Fortran ordering, and then cast it to the appropriate data type, as in this example:
#file: test.py
import numpy as np
import cffi as cf
ffi=cf.FFI()
lib=ffi.dlopen("./f01adf.dll")
ffi.cdef("""void f01adf_ (const int *n, double a[], const int *lda, int *ifail);""")
# integers
nx = 4
n = ffi.new('const int*', nx)
lda = nx + 1
lda = ffi.new('const int*', lda)
ifail = 0
ifail = ffi.new('int*', ifail)
# matrix to be inversed
ax = np.array([[5, 7, 6, 5],
[7, 10, 8, 7],
[6, 8, 10, 9],
[5, 7, 9, 10],
[0, 0, 0, 0]], dtype=float, order='F')
# operation on matrix using dll
print("BEFORE:")
print(ax.astype(int))
a = ffi.cast("double* ", ax.__array_interface__['data'][0])
lib.f01adf_(n, a, lda, ifail)
print("\nAFTER:")
print(ax.astype(int))
For testing purposes, consider the following Fortran subroutine that has the same interface as your actual dll as a substitute for your dll. It will simply add 10**(i-1) to the i'th column of input array a. This will allow checking that the interface between Python and Fortran works as intended, and that the intended elements of array a are operated on:
!file: f01adf.f90
Subroutine f01adf(n, a, lda, ifail)
Integer, Intent (In) :: n, lda
Integer, Intent (Inout) :: ifail
Real(Kind(1.d0)), Intent (Inout) :: a(lda,*)
Integer :: i
print *, "Fortran DLL says: Hello world!"
If ((n < 1) .or. (lda < n+1)) Then
! Input variables not conforming to requirements
ifail = 2
Else
! Input variables acceptable
ifail = 0
! add 10**(i-1) to the i'th column of 2d array 'a'
Do i = 1, n
a(:, i) = a(:, i) + 10**(i-1)
End Do
End If
End Subroutine
Compiling the Fortran code, and then running the suggested Python script, gives me the following output:
> gfortran -O3 -shared -fPIC -fcheck=all -Wall -Wextra -std=f2008 -o f01adf.dll f01adf.f90
> python test.py
BEFORE:
[[ 5 7 6 5]
[ 7 10 8 7]
[ 6 8 10 9]
[ 5 7 9 10]
[ 0 0 0 0]]
Fortran DLL says: Hello world!
AFTER:
[[ 6 17 106 1005]
[ 8 20 108 1007]
[ 7 18 110 1009]
[ 6 17 109 1010]
[ 1 10 100 1000]]

Cython Typing List of Strings

I'm trying to use cython to improve the performance of a loop, but I'm running
into some issues declaring the types of the inputs.
How do I include a field in my typed struct which is a string that can be
either 'front' or 'back'
I have a np.recarray that looks like the following (note the length of the
recarray is unknown as compile time)
import numpy as np
weights = np.recarray(4, dtype=[('a', np.int64), ('b', np.str_, 5), ('c', np.float64)])
weights[0] = (0, "front", 0.5)
weights[1] = (0, "back", 0.5)
weights[2] = (1, "front", 1.0)
weights[3] = (1, "back", 0.0)
as well as inputs of a list of strings and a pandas.Timestamp
import pandas as pd
ts = pd.Timestamp("2015-01-01")
contracts = ["CLX16", "CLZ16"]
I am trying to cythonize the following loop
def ploop(weights, contracts, timestamp):
cwts = []
for gen_num, position, weighting in weights:
if weighting != 0:
if position == "front":
cntrct_idx = gen_num
elif position == "back":
cntrct_idx = gen_num + 1
else:
raise ValueError("transition.columns must contain "
"'front' or 'back'")
cwts.append((gen_num, contracts[cntrct_idx], weighting, timestamp))
return cwts
My attempt involved typing the weights input as a struct in cython,
in a file struct_test.pyx as follows
import numpy as np
cimport numpy as np
cdef packed struct tstruct:
np.int64_t gen_num
char[5] position
np.float64_t weighting
def cloop(tstruct[:] weights_array, contracts, timestamp):
cdef tstruct weights
cdef int i
cdef int cntrct_idx
cwts = []
for k in xrange(len(weights_array)):
w = weights_array[k]
if w.weighting != 0:
if w.position == "front":
cntrct_idx = w.gen_num
elif w.position == "back":
cntrct_idx = w.gen_num + 1
else:
raise ValueError("transition.columns must contain "
"'front' or 'back'")
cwts.append((w.gen_num, contracts[cntrct_idx], w.weighting,
timestamp))
return cwts
But I am receiving runtime errors, which I believe are related to the
char[5] position.
import pyximport
pyximport.install()
import struct_test
struct_test.cloop(weights, contracts, ts)
ValueError: Does not understand character buffer dtype format string ('w')
In addition I am a bit unclear how I would go about typing contracts as well
as timestamp.
Your ploop (without the timestamp variable) produces:
In [226]: ploop(weights, contracts)
Out[226]: [(0, 'CLX16', 0.5), (0, 'CLZ16', 0.5), (1, 'CLZ16', 1.0)]
Equivalent function without a loop:
def ploopless(weights, contracts):
arr_contracts = np.array(contracts) # to allow array indexing
wgts1 = weights[weights['c']!=0]
mask = wgts1['b']=='front'
wgts1['b'][mask] = arr_contracts[wgts1['a'][mask]]
mask = wgts1['b']=='back'
wgts1['b'][mask] = arr_contracts[wgts1['a'][mask]+1]
return wgts1.tolist()
In [250]: ploopless(weights, contracts)
Out[250]: [(0, 'CLX16', 0.5), (0, 'CLZ16', 0.5), (1, 'CLZ16', 1.0)]
I'm taking advantage of the fact that returned list of tuples has same (int, str, int) layout as the input weight array. So I'm just making a copy of weights and replacing selected values of the b field.
Note that I use the field selection index before the mask one. The boolean mask produces a copy, so we have to careful about indexing order.
I'm guessing that loop-less array version will be competitive in time with the cloop (on realistic arrays). The string and list operations in cloop probably limit its speedup.

how to resize and subtract numpy arrays in c++

I have two numpy 3D-array in python with different height and width. I want to pass them to my C-Extension. How can I resize and subtract them in c++? Please see the comments in the code.
static PyObject *my_func(PyObject *self, PyObject *args)
{
Py_Initialize();
import_array();
PyObject *arr1;
PyObject *arr2;
if(!PyArg_ParseTuple(args, "OO", &arr1, &arr2))
{
return NULL;
}
//How can I do this?
//resize arr1 to [100, 100, 3]
//resize arr2 to [100, 100, 3]
//res = arr1 - arr2
//return res
}
Start by making the desired shape. It's easier to do this as a tuple than a list:
PyObject* shape = Py_BuildValue("iii",100,100,3);
Check this against NULL to ensure do error has occurred and handle if it has.
You can call the numpy resize function on both arrays to resize them. Unless you are certain that the data isn't shared then you need to call numpy.resize rather than the .resize method of the arrays. This involves importing the module and getting the resize attribute:
PyObject* np = PyImport_ImportModule("numpy");
PyObject* resize = PyObject_GetAttrString(np,"resize");
PyObject* resize_result = PyObject_CallFunctionObjArgs(resize,arr1, shape,NULL);
I've omitted all the error checking, which you should do after each stage.
Make sure you decrease the reference counts on the various PyObjects once you don't need them any more.
Use PyNumber_Subtract to do the subtraction (do it on the result from resize).
Addition: A shortcut for calling resize that should avoid most of the intermediates:
PyObject* np = PyImport_ImportModule("numpy");
// error check against null
PyObject* resize_result = PyObject_CallMethod(np,"resize","O(iii)",arr1,100,100,3);
(The "(iii)" creates the shape tuple rather than needing to do it separately.)
If you are certain that arr1 and arr2 are the only owners of the data then you can call the numpy .resize method either by the normal C API function calls or the specific numpy function PyArray_Resize.