A Good Way to Expose CUPY MemoryPointer in C/C++? - cupy

NumPy provides well-defined C APIs so that one can easily handle NumPy array in C/C++ space. For example, if I have a C function that takes C arrays (pointers) as arguments, I can just #include <numpy/arrayobject.h>, and pass a NumPy array to it by accessing its data member (or use the C API PyArray_DATA).
Recently I want to achieve the same for CuPy, but I cannot find a header file that I can include. To be specific, my goal is as follows:
I have some CUDA kernels and their callers written in C/C++. The callers run on host but take handles of memory buffers on device as arguments. The computed results of the callers are also stored on device.
I want to wrap the callers into Python functions so that I can control when to transfer data from device to host in Python. That means I have to wrap the resulted device memory pointers in Python objects. CuPy's ndarray is the best choice I can think of.
I can't use CuPy's user-defined-kenrel mechanism because the functions I want to wrap are not directly CUDA kernels. They must contain host code.
Currently, I've found a workaround. I write the Python functions in cython, which take CuPy arrays as inputs and return CuPy arrays. And then I cast .data.ptr attribute into C's size_t type, and then further cast it to whatever pointer type I need. Example code follows.
Example Code
//kernel.cu
#include <math.h>
__global__ void vecSumKernel(float *A, float *B, float *C, int n) {
int i = threadIdx.x + blockIdx.x * blockDim.x;
if (i < n)
C[i] = A[i] + B[i];
}
// This is the C function I want to wrap into Python.
// Notice it does not allocate any memory on device. I want that to be done by cupy.
extern "C" void vecSum(float *A_d, float *B_d, float *C_d, int n) {
int threadsPerBlock = 512;
if (threadsPerBlock > n) threadsPerBlock = n;
int nBlocks = (int)ceilf((float)n / (float)threadsPerBlock);
vecSumKernel<<<nBlocks, threadsPerBlock>>>(A_d, B_d, C_d, n);
}
//kernel.h
#ifndef KERNEL_H_
#define KERNEL_H_
void vecSum(float *A_d, float *B_d, float *C_d, int n);
#endif
# test_module.pyx
import cupy as cp
import numpy as np
cdef extern from "kernel.h":
void vecSum(float *A_d, float *B_d, float *C_d, int n)
cdef vecSum_wrapper(size_t aPtr, size_t bPtr, size_t cPtr, int n):
# here the Python int -- cp.ndarray.data.ptr -- is first cast to size_t,
# and then cast to (float *).
vecSum(<float*>aPtr, <float*>bPtr, <float*>cPtr, n)
# This is the Python function I want to use
# a, b are cupy arrays
def vec_sum(a, b):
a_ptr = a.data.ptr
b_ptr = b.data.ptr
n = a.shape[0]
output = cp.empty(shape=(n,), dtype=a.dtype)
c_ptr = output.data.ptr
vecSum_wrapper(a_ptr, b_ptr, c_ptr, n)
return output
Compile and Run
To compile, one can first compile the kernel.cu into a static library, say, libVecSum. Then use cython to compile test_module.pyx int test_module.c, and build the Python extension as usual.
# setup.py
from setuptools import Extension, setup
ext_module = Extension(
"cupyExt.test_module",
sources=["cupyExt/test_module.c"],
library_dirs=["cupyExt/"],
libraries=['libVecSum', 'cudart'])
setup(
name="cupyExt",
version="0.0.0",
ext_modules = [ext_module],
)
It seems working.
>>> import cupy as cp
>>> from cupyExt import test_module
>>> a = cp.ones(5, dtype=cp.float32) * 3
>>> b = cp.arange(5, dtype=cp.float32)
>>> c = test_module.vec_sum(a, b)
>>> print(c.device)
<CUDA Device 0>
>>> print(c)
[3. 4. 5. 6. 7.]
Any better ways?
I am not sure if this way is memory safe. I also feel the casting from .data.ptr to C pointers is not good. I want to know people's thoughts and comments on this.

Related

Cython passing int numpy array to C++

First, I know this question appears similar to this one but they are different. I'm struggling trying to pass int (int32) numpy array to C++ via Cython without copying. The files:
doit.cpp:
#include "doit.h"
void run(int *x) {}
doit.h:
#ifndef _DOIT_H_
#define _DOIT_H_
void run(int *);
#endif
q.pyx:
cimport numpy as np
import numpy as np
cdef extern from "doit.h":
void run(int* X)
def pyrun(np.ndarray[np.int_t, ndim=1] X):
X = np.ascontiguousarray(X)
run(&X[0])
I compile with Cython. The error is:
Error compiling Cython file:
------------------------------------------------------------
...
cdef extern from "doit.h":
void run(int* X)
def pyrun(np.ndarray[np.int_t, ndim=1] X):
X = np.ascontiguousarray(X)
run(&X[0])
^
------------------------------------------------------------
py_cpp/q.pyx:9:8: Cannot assign type 'int_t *' to 'int *'
However, if I replace all occurrences of int to double (e.g. int *x to double *x, int_t to double_t), then all errors are gone.
How to solve the problem? Thanks in advance.

pycuda - memcpy_dtoh, not giving what appears to have been set

I have a very simple function where I'm passing in a char array and doing a simple character match. I want to return an array of 1/0 depending on which characters are matched.
Problem: although I can see the value has been set in the data structure (as I print it in the function after it's assigned) when the int array is copied back from the device the values aren't as expected.
I'm sure it's something silly.
import pycuda.driver as cuda
import pycuda.autoinit
from pycuda.compiler import SourceModule
import numpy as np
mod = SourceModule("""
__global__ void test(const char *q, const int chrSize, int *d, const int intSize) {
int v = 0;
if( q[threadIdx.x * chrSize] == 'a' || q[threadIdx.x * chrSize] == 'c' ) {
v = 1;
}
d[threadIdx.x * intSize] = v;
printf("x=%d, y=%d, val=%c ret=%d\\n", threadIdx.x, threadIdx.y, q[threadIdx.x * chrSize], d[threadIdx.x * intSize]);
}
""")
func = mod.get_function("test")
# input data
a = np.asarray(['a','b','c','d'], dtype=np.str_)
# allocate/copy to device
a_gpu = cuda.mem_alloc(a.nbytes)
cuda.memcpy_htod(a_gpu, a)
# destination array
d = np.zeros((4), dtype=np.int16)
# allocate/copy to device
d_gpu = cuda.mem_alloc(d.nbytes)
cuda.memcpy_htod(d_gpu, d)
# run the function
func(a_gpu, np.int8(a.dtype.itemsize), d_gpu, np.int8(d.dtype.itemsize), block=(4,1,1))
# copy data back and priint
cuda.memcpy_dtoh(d, d_gpu)
print(d)
Output:
x=0, y=0, val=a ret=1
x=1, y=0, val=b ret=0
x=2, y=0, val=c ret=1
x=3, y=0, val=d ret=0
[1 0 0 0]
Expected output:
x=0, y=0, val=a ret=1
x=1, y=0, val=b ret=0
x=2, y=0, val=c ret=1
x=3, y=0, val=d ret=0
[1 0 1 0]
You have two main problems, neither of which have anything to do with memcpy_dtoh:
You have declared d and d_gpu as dtype np.int16, but the kernel is expecting C++ int, leading to a type mistmatch. You should use the np.int32 type to define the arrays.
The indexing of d within the kernel is incorrect. If you have declared the array to the compiler as a 32 bit type, indexing the array as d[threadIdx.x] will automatically include the correct alignment for the type. Passing and using intSize to the kernel for indexing d is not required and it is incorrect to do so.
If you fix those two issues, I suspect the code will work as intended.

Why does cythons in-place division of numpy arrays use conversion to python floats?

I tried to normalize a vector stored as numpy array, but cython -a shows unexpected conversions to Python values in this code.
Minimal example:
import numpy as np
cimport cython
cimport numpy as np
#cython.wraparound(False)
#cython.boundscheck(False)
cdef vec_diff(np.ndarray[double, ndim=1] vec1, double m):
vec1/=m
return vec1
Cython 0.29.6 run with the -a option generates the following code for the line vec1/=m:
__pyx_t_1 = PyFloat_FromDouble(__pyx_v_m); if (unlikely(!__pyx_t_1)) __PYX_ERR(0, 8, __pyx_L1_error)
__Pyx_GOTREF(__pyx_t_1);
__pyx_t_2 = __Pyx_PyNumber_InPlaceDivide(((PyObject *)__pyx_v_vec1), __pyx_t_1); if (unlikely(!__pyx_t_2)) __PYX_ERR(0, 8, __pyx_L1_error)
__Pyx_GOTREF(__pyx_t_2);
__Pyx_DECREF(__pyx_t_1); __pyx_t_1 = 0;
if (!(likely(((__pyx_t_2) == Py_None) || likely(__Pyx_TypeTest(__pyx_t_2, __pyx_ptype_5numpy_ndarray))))) __PYX_ERR(0, 8, __pyx_L1_error)
__pyx_t_3 = ((PyArrayObject *)__pyx_t_2);
{
__Pyx_BufFmt_StackElem __pyx_stack[1];
__Pyx_SafeReleaseBuffer(&__pyx_pybuffernd_vec1.rcbuffer->pybuffer);
__pyx_t_4 = __Pyx_GetBufferAndValidate(&__pyx_pybuffernd_vec1.rcbuffer->pybuffer, (PyObject*)__pyx_t_3, &__Pyx_TypeInfo_double, PyBUF_FORMAT| PyBUF_STRIDES, 1, 0, __pyx_stack);
if (unlikely(__pyx_t_4 < 0)) {
PyErr_Fetch(&__pyx_t_5, &__pyx_t_6, &__pyx_t_7);
if (unlikely(__Pyx_GetBufferAndValidate(&__pyx_pybuffernd_vec1.rcbuffer->pybuffer, (PyObject*)__pyx_v_vec1, &__Pyx_TypeInfo_double, PyBUF_FORMAT| PyBUF_STRIDES, 1, 0, __pyx_stack) == -1)) {
Py_XDECREF(__pyx_t_5); Py_XDECREF(__pyx_t_6); Py_XDECREF(__pyx_t_7);
__Pyx_RaiseBufferFallbackError();
} else {
PyErr_Restore(__pyx_t_5, __pyx_t_6, __pyx_t_7);
}
__pyx_t_5 = __pyx_t_6 = __pyx_t_7 = 0;
}
__pyx_pybuffernd_vec1.diminfo[0].strides = __pyx_pybuffernd_vec1.rcbuffer->pybuffer.strides[0]; __pyx_pybuffernd_vec1.diminfo[0].shape = __pyx_pybuffernd_vec1.rcbuffer->pybuffer.shape[0];
if (unlikely(__pyx_t_4 < 0)) __PYX_ERR(0, 8, __pyx_L1_error)
}
__pyx_t_3 = 0;
__Pyx_DECREF_SET(__pyx_v_vec1, ((PyArrayObject *)__pyx_t_2));
__pyx_t_2 = 0;
where the first line __pyx_t_1 = PyFloat_FromDouble(__pyx_v_m); has PyFloat_FromDouble highlighted in dark red.
Given that I have told cython that the array contains double values, why does it have to convert to a python float?
Note: Memoryviews do not support the /= operation (would require a loop)
Because this isn't something that Cython does anything special for or optimises at all. All it's doing is calling __Pyx_PyNumber_InPlaceDivide on the Numpy array, which calls the Numpy array's __idiv__ operator.
Since it's calling a Python operator it needs to pass a Python object as the second argument, and hence it needs to convert your double to a Python float.
The Numpy __idiv__ operator is almost certainly written in C so likely to be pretty fast (although there is a little overhead calling it) so there's not a lot of value in Cython doing anything except delegating to Numpy's code.
Memoryviews don't define the whole-array operators (they're just ways to access memory so don't make any claims about meaningful mathematical operations) and hence the fact that it doesn't work is consistent with how Cython deals with these operators.

Link Cython-wrapped C functions against BLAS from NumPy

I want to use inside a Cython extension some C functions defined in .c files that uses BLAS subroutines, e.g.
cfile.c
double ddot(int *N, double *DX, int *INCX, double *DY, int *INCY);
double call_ddot(double* a, double* b, int n){
int one = 1;
return ddot(&n, a, &one, b, &one);
}
(Let’s say the functions do more than just call one BLAS subroutine)
pyfile.pyx
cimport numpy as np
import numpy as np
cdef extern from "cfile.c":
double call_ddot(double* a, double* b, int n)
def pyfun(np.ndarray[double, ndim=1] a):
return call_ddot(&a[0], &a[0], <int> a.shape[0])
setup.py:
from distutils.core import setup
from distutils.extension import Extension
from Cython.Build import cythonize
from Cython.Distutils import build_ext
import numpy
setup(
name = "wrapped_cfun",
packages = ["wrapped_cfun"],
cmdclass = {'build_ext': build_ext},
ext_modules = [Extension("wrapped_cfun.cython_part", sources=["pyfile.pyx"], include_dirs=[numpy.get_include()])]
)
I want this package to link against the same BLAS library that the installed NumPy or SciPy are using, and would like it to be installable from PIP under different operating systems using numpy or scipy as dependencies, without any additional BLAS-related dependency.
Is there any hack for setup.py that would allow me to accomplish this, in a way that it could work with any BLAS implementation?
Update:
With MKL, I can make it work by modifying the Extension object to point to libmkl_rt, which can be extracted from numpy if MKL is installed, e.g.:
Extension("wrapped_cfun.cython_part", sources=["pyfile.pyx"], include_dirs=[numpy.get_include()], extra_link_args=["-L{path to python's lib dir}", "-l:libmkl_rt.{so, dll, dylib}"])
However, the same trick does not work for OpenBLAS (e.g. -l:libopenblasp-r0.2.20.so). Pointing to libblas.{so,dll,dylib} will not work if that file is a link to libopenblas, but works fine it it's a link to libmkl_rt.
Update 2:
It seems OpenBLAS names their C functions with an underscore at the end, e.g. not ddot but ddot_. The code above with l:libopenblas will work if I change ddot to ddot_ in the .c file. I'm still wondering if there is some (ideally run-time) mechanism to detect which name should be used in the c file.
An alternative to depending on linker/loader to provide the right blas-functionality, would be to emulate resolution of the necessary blas-symbols (e.g. ddot) and to use the wrapped blas-function provided by scipy during the runtime.
Not sure, this approach is superior to the "normal way" of building, but wanted to bring it to your attention, even if only because I find this approach interesting.
The idea in a nutshell:
Define an explicit function-pointer to ddot-functionality, called my_ddot in the snippet below.
Use my_ddot-pointer where you would use ddot-otherwise.
Initialize my_ddot-pointer when the cython-module is loaded with the functionality provided by scipy.
Here is a working prototype (I use C-code-verbatim to make the snippet standalone and easily testable in a jupiter-notebook, trust you to transform it to format you need/like):
%%cython
# h-file:
cdef extern from *:
"""
// blas-functionality,
// will be initialized by cython when module is loaded:
typedef double (*ddot_t)(int *N, double *DX, int *INCX, double *DY, int *INCY);
extern ddot_t my_ddot;
double call_ddot(double* a, double* b, int n);
"""
ctypedef double (*ddot_t)(int *N, double *DX, int *INCX, double *DY, int *INCY)
ddot_t my_ddot
double call_ddot(double* a, double* b, int n)
# init the functions of the c-library
# with blas-function provided by scipy
from scipy.linalg.cython_blas cimport ddot
my_ddot=ddot
# a simple function to demonstrate, that it works
def ddot_mult(double[:]a, double[:]b):
cdef int n=len(a)
return call_ddot(&a[0], &b[0], n)
#-------------------------------------------------
# c-file, added so the example is complete
cdef extern from *:
"""
ddot_t my_ddot;
double call_ddot(double* a, double* b, int n){
int one = 1;
return my_ddot(&n, a, &one, b, &one);
}
"""
pass
And now ddot_mult can be used:
import numpy as np
a=np.arange(4, dtype=float)
ddot_mult(a,a) # 14.0 as expected!
An advantage of this approach is, that there is no hustle with distutils and you have a guarantee, to use the same blas-functionality as scipy.
Another perk: One could switch the used engine (mkl, open_blas or even an own implementation) during the runtime without the need to recompile/relink.
On there other hand, there is some additional amount of boilerplate-code and also the danger, that initialization of some symbols will be forgotten.
I've finally figured out an ugly hack for this. I'm not sure if it will always work, but at least it works for cobminations of Windows (mingw and visual studio), Linux, MKL and OpenBlas. I'd still like to know if there are better alternatives, but if not, this will do it:
Edit: Corrected for visual studio now
Modify C files to account for names with underscores (do it for each BLAS function that is called) - need to declare each function twice and add an if for each one
double ddot_(int *N, double *DX, int *INCX, double *DY, int *INCY);
#define ddot(N, DX, INCX, DY, INCY) ddot_(N, DX, INCX, DY, INCY)
daxpy_(int *N, double *DA, double *DX, int *INCX, double *DY, int *INCY);
#define daxpy(N, DA, DX, INCX, DY, INCY) daxpy_(N, DA, DX, INCX, DY, INCY)
... etc
Extract library path from NumPy or SciPy and add it to the link arguments.
Detect if the compiler to be used is visual studio, in which case the linking arguments are quite different.
setup.py
from distutils.core import setup
from distutils.extension import Extension
from Cython.Build import cythonize
from Cython.Distutils import build_ext
import numpy
from sys import platform
import os
try:
blas_path = numpy.distutils.system_info.get_info('blas')['library_dirs'][0]
except:
if "library_dirs" in numpy.__config__.blas_mkl_info:
blas_path = numpy.__config__.blas_mkl_info["library_dirs"][0]
elif "library_dirs" in numpy.__config__.blas_opt_info:
blas_path = numpy.__config__.blas_opt_info["library_dirs"][0]
else:
raise ValueError("Could not locate BLAS library.")
if platform[:3] == "win":
if os.path.exists(os.path.join(blas_path, "mkl_rt.lib")):
blas_file = "mkl_rt.lib"
elif os.path.exists(os.path.join(blas_path, "mkl_rt.dll")):
blas_file = "mkl_rt.dll"
else:
import re
blas_file = [f for f in os.listdir(blas_path) if bool(re.search("blas", f))]
if len(blas_file) == 0:
raise ValueError("Could not locate BLAS library.")
blas_file = blas_file[0]
elif platform[:3] == "dar":
blas_file = "libblas.dylib"
else:
blas_file = "libblas.so"
## https://stackoverflow.com/questions/724664/python-distutils-how-to-get-a-compiler-that-is-going-to-be-used
class build_ext_subclass( build_ext ):
def build_extensions(self):
compiler = self.compiler.compiler_type
if compiler == 'msvc': # visual studio
for e in self.extensions:
e.extra_link_args += [os.path.join(blas_path, blas_file)]
else: # gcc
for e in self.extensions:
e.extra_link_args += ["-L"+blas_path, "-l:"+blas_file]
build_ext.build_extensions(self)
setup(
name = "wrapped_cfun",
packages = ["wrapped_cfun"],
cmdclass = {'build_ext': build_ext_subclass},
ext_modules = [Extension("wrapped_cfun.cython_part", sources=["pyfile.pyx"], include_dirs=[numpy.get_include()], extra_link_args=[])]
)
As yet another alternative with more recent Cython versions, one can create a "public" Cython function (which will be made available to C code and auto-generate a public header) that would simply call the corresponding BLAS function:
from scipy.linalg.cython_blas cimport ddot
cdef public double ddot_(int *n, double *x, int *ldx, double *y, int *ldy):
return ddot(n, x, ldx, y, ldy)
Then one simply declares it in the C code or includes the header, and the rest of the Cython extension builder will take care of linkage:
extern double ddot_(int *n, double *x, int *ldx, double *y, int *ldy);

OpenCL Performance Optimization

I have started learning OpenCL and I currently try to test how much I can improve performance for a simple skeletal animation algorithm. To do this I have written a program that performs skeletal animation from randomly generated vertices and transformation matrices twice, once with an SSE-optimized linear algebra library in plain C++, and once using my own OpenCL kernel on GPU (I'm testing on an Nvidia GTX 460).
I started off with a simple kernel where each work-item transforms exactly one vertex, with all values read from global memory. Because I was not satisfied with the performance of this kernel, I tried to optimize a little. My current kernel looks like this:
inline float4 MultiplyMatrixVector(float16 m, float4 v)
{
return (float4) (
dot(m.s048C, v),
dot(m.s159D, v),
dot(m.s26AE, v),
dot(m.s37BF, v)
);
}
kernel void skelanim(global const float16* boneMats, global const float4* vertices, global const float4* weights, global const uint4* indices, global float4* resVertices)
{
int gid = get_global_id(0);
int lid = get_local_id(0);
local float16 lBoneMats[NUM_BONES];
async_work_group_copy(lBoneMats, boneMats, NUM_BONES, 0);
barrier(CLK_LOCAL_MEM_FENCE);
for (int i = 0 ; i < NUM_VERTICES_PER_WORK_ITEM ; i++) {
int vidx = gid*NUM_VERTICES_PER_WORK_ITEM + i;
float4 vertex = vertices[vidx];
float4 w = weights[vidx];
uint4 idx = indices[vidx];
resVertices[vidx] = (MultiplyMatrixVector(lBoneMats[idx.x], vertex * w.x)
+ MultiplyMatrixVector(lBoneMats[idx.y], vertex * w.y)
+ MultiplyMatrixVector(lBoneMats[idx.z], vertex * w.z)
+ MultiplyMatrixVector(lBoneMats[idx.w], vertex * w.w));
}
}
Now I process a constant number of vertices per work-item, and I prefetch all the bone matrices into local memory only once for each work-item, which I believed would lead to way better performance because the matrices for multiple vertices could be read from the faster local memory afterwards. Unfortunately, this kernel performs worse than my first attempt, and even worse than the CPU-only implementation.
Why is performance so bad with this should-be optimization?
If it helps, here is how I execute the kernel:
#define NUM_BONES 50
#define NUM_VERTICES 30000
#define NUM_VERTICES_PER_WORK_ITEM 100
#define NUM_ANIM_REPEAT 1000
uint64_t PerformOpenCLSkeletalAnimation(Matrix4* boneMats, Vector4* vertices, float* weights, uint32_t* indices, Vector4* resVertices)
{
File kernelFile("/home/alemariusnexus/test/skelanim.cl");
char opts[256];
sprintf(opts, "-D NUM_VERTICES=%u -D NUM_REPEAT=%u -D NUM_BONES=%u -D NUM_VERTICES_PER_WORK_ITEM=%u", NUM_VERTICES, NUM_ANIM_REPEAT, NUM_BONES, NUM_VERTICES_PER_WORK_ITEM);
cl_program prog = BuildOpenCLProgram(kernelFile, opts);
cl_kernel kernel = clCreateKernel(prog, "skelanim", NULL);
cl_mem boneMatBuf = clCreateBuffer(ctx, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, NUM_BONES*sizeof(Matrix4), boneMats, NULL);
cl_mem vertexBuf = clCreateBuffer(ctx, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, NUM_VERTICES*sizeof(Vector4), vertices, NULL);
cl_mem weightBuf = clCreateBuffer(ctx, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, NUM_VERTICES*4*sizeof(float), weights, NULL);
cl_mem indexBuf = clCreateBuffer(ctx, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, NUM_VERTICES*4*sizeof(uint32_t), indices, NULL);
cl_mem resVertexBuf = clCreateBuffer(ctx, CL_MEM_WRITE_ONLY | CL_MEM_ALLOC_HOST_PTR, NUM_VERTICES*sizeof(Vector4), NULL, NULL);
uint64_t s, e;
s = GetTickcount();
clSetKernelArg(kernel, 0, sizeof(cl_mem), &boneMatBuf);
clSetKernelArg(kernel, 1, sizeof(cl_mem), &vertexBuf);
clSetKernelArg(kernel, 2, sizeof(cl_mem), &weightBuf);
clSetKernelArg(kernel, 3, sizeof(cl_mem), &indexBuf);
clSetKernelArg(kernel, 4, sizeof(cl_mem), &resVertexBuf);
size_t globalWorkSize[] = { NUM_VERTICES / NUM_VERTICES_PER_WORK_ITEM };
size_t localWorkSize[] = { NUM_BONES };
for (size_t i = 0 ; i < NUM_ANIM_REPEAT ; i++) {
clEnqueueNDRangeKernel(cq, kernel, 1, NULL, globalWorkSize, localWorkSize, 0, NULL, NULL);
}
clEnqueueReadBuffer(cq, resVertexBuf, CL_TRUE, 0, NUM_VERTICES*sizeof(Vector4), resVertices, 0, NULL, NULL);
e = GetTickcount();
return e-s;
}
I guess there are more things that could be optimized, maybe batching some of the other global reads together, but first I would really like to know why this first optimization didn't work.
Two things are affecting the performance in your exercise.
1) OpenCL conforms to C99 std that does not contain anything about inline functions, i.e. the clcc compiler either just ignores the inline keyword and does a regular call, or it supports the inlining silently. But it is not mandated to support that feature.
So, better define your MultiplyMatrixVector as a pre-processor macro. Though this is not a major problem in your case.
2) You incorrectly threat the local memory (the LDM).
Although its latency times less than the latency of the global memory when it accessed properly, the local memory is subject to bank conflicts.
Your vertex index is calculated with stride 100 per work item. The number of banks depends on the GPU in use but usually it is 16 or 32, i.e. you may access up to 16(32) four byte LDM variables in one cycle without penalty if all of them are in different banks. Otherwise, you get a bank conflict (when two or more threads accesses the same bank) that is serialized.
Your 100 threads in a work group accesses the array in LDM with no special arrangement about bank conflicts. Moreover, the array elements are float16, i.e. a single element spans all 16 banks (or half of 32 banks). Thus, you have a bank conflict in each row of MultiplyMatrixVector function. The cummulative degree that conflict at least 16x32 (here 16 is the number of the vector elements you access and 32 is a size of half wavefront or halfwarp).
The solution here is not to copy that array to LDM, but to allocate it in the host with CL_MEM_READ_ONLY (which you already did) and declare your kernel using __constant specifier for boneMats argument.
Then the OpenCL library would allocate the memory in the constant area inside GPU and the access to that array would be fast:
kernel void skelanim(__constant const float16* boneMats,
global const float4* vertices,
global const float4* weights,
global const uint4* indices,
global float4* resVertices)
It looks like EACH thread in a Work Group is copying the same 50 floats before the computation starts. This will saturate the Global Memory bandwidth.
try this
if ( lid == 0 )
{
async_work_group_copy(lBoneMats, boneMats, NUM_BONES, 0);
}
This does the copy only once per work group.