OpenCL Performance Optimization - optimization

I have started learning OpenCL and I currently try to test how much I can improve performance for a simple skeletal animation algorithm. To do this I have written a program that performs skeletal animation from randomly generated vertices and transformation matrices twice, once with an SSE-optimized linear algebra library in plain C++, and once using my own OpenCL kernel on GPU (I'm testing on an Nvidia GTX 460).
I started off with a simple kernel where each work-item transforms exactly one vertex, with all values read from global memory. Because I was not satisfied with the performance of this kernel, I tried to optimize a little. My current kernel looks like this:
inline float4 MultiplyMatrixVector(float16 m, float4 v)
{
return (float4) (
dot(m.s048C, v),
dot(m.s159D, v),
dot(m.s26AE, v),
dot(m.s37BF, v)
);
}
kernel void skelanim(global const float16* boneMats, global const float4* vertices, global const float4* weights, global const uint4* indices, global float4* resVertices)
{
int gid = get_global_id(0);
int lid = get_local_id(0);
local float16 lBoneMats[NUM_BONES];
async_work_group_copy(lBoneMats, boneMats, NUM_BONES, 0);
barrier(CLK_LOCAL_MEM_FENCE);
for (int i = 0 ; i < NUM_VERTICES_PER_WORK_ITEM ; i++) {
int vidx = gid*NUM_VERTICES_PER_WORK_ITEM + i;
float4 vertex = vertices[vidx];
float4 w = weights[vidx];
uint4 idx = indices[vidx];
resVertices[vidx] = (MultiplyMatrixVector(lBoneMats[idx.x], vertex * w.x)
+ MultiplyMatrixVector(lBoneMats[idx.y], vertex * w.y)
+ MultiplyMatrixVector(lBoneMats[idx.z], vertex * w.z)
+ MultiplyMatrixVector(lBoneMats[idx.w], vertex * w.w));
}
}
Now I process a constant number of vertices per work-item, and I prefetch all the bone matrices into local memory only once for each work-item, which I believed would lead to way better performance because the matrices for multiple vertices could be read from the faster local memory afterwards. Unfortunately, this kernel performs worse than my first attempt, and even worse than the CPU-only implementation.
Why is performance so bad with this should-be optimization?
If it helps, here is how I execute the kernel:
#define NUM_BONES 50
#define NUM_VERTICES 30000
#define NUM_VERTICES_PER_WORK_ITEM 100
#define NUM_ANIM_REPEAT 1000
uint64_t PerformOpenCLSkeletalAnimation(Matrix4* boneMats, Vector4* vertices, float* weights, uint32_t* indices, Vector4* resVertices)
{
File kernelFile("/home/alemariusnexus/test/skelanim.cl");
char opts[256];
sprintf(opts, "-D NUM_VERTICES=%u -D NUM_REPEAT=%u -D NUM_BONES=%u -D NUM_VERTICES_PER_WORK_ITEM=%u", NUM_VERTICES, NUM_ANIM_REPEAT, NUM_BONES, NUM_VERTICES_PER_WORK_ITEM);
cl_program prog = BuildOpenCLProgram(kernelFile, opts);
cl_kernel kernel = clCreateKernel(prog, "skelanim", NULL);
cl_mem boneMatBuf = clCreateBuffer(ctx, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, NUM_BONES*sizeof(Matrix4), boneMats, NULL);
cl_mem vertexBuf = clCreateBuffer(ctx, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, NUM_VERTICES*sizeof(Vector4), vertices, NULL);
cl_mem weightBuf = clCreateBuffer(ctx, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, NUM_VERTICES*4*sizeof(float), weights, NULL);
cl_mem indexBuf = clCreateBuffer(ctx, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, NUM_VERTICES*4*sizeof(uint32_t), indices, NULL);
cl_mem resVertexBuf = clCreateBuffer(ctx, CL_MEM_WRITE_ONLY | CL_MEM_ALLOC_HOST_PTR, NUM_VERTICES*sizeof(Vector4), NULL, NULL);
uint64_t s, e;
s = GetTickcount();
clSetKernelArg(kernel, 0, sizeof(cl_mem), &boneMatBuf);
clSetKernelArg(kernel, 1, sizeof(cl_mem), &vertexBuf);
clSetKernelArg(kernel, 2, sizeof(cl_mem), &weightBuf);
clSetKernelArg(kernel, 3, sizeof(cl_mem), &indexBuf);
clSetKernelArg(kernel, 4, sizeof(cl_mem), &resVertexBuf);
size_t globalWorkSize[] = { NUM_VERTICES / NUM_VERTICES_PER_WORK_ITEM };
size_t localWorkSize[] = { NUM_BONES };
for (size_t i = 0 ; i < NUM_ANIM_REPEAT ; i++) {
clEnqueueNDRangeKernel(cq, kernel, 1, NULL, globalWorkSize, localWorkSize, 0, NULL, NULL);
}
clEnqueueReadBuffer(cq, resVertexBuf, CL_TRUE, 0, NUM_VERTICES*sizeof(Vector4), resVertices, 0, NULL, NULL);
e = GetTickcount();
return e-s;
}
I guess there are more things that could be optimized, maybe batching some of the other global reads together, but first I would really like to know why this first optimization didn't work.

Two things are affecting the performance in your exercise.
1) OpenCL conforms to C99 std that does not contain anything about inline functions, i.e. the clcc compiler either just ignores the inline keyword and does a regular call, or it supports the inlining silently. But it is not mandated to support that feature.
So, better define your MultiplyMatrixVector as a pre-processor macro. Though this is not a major problem in your case.
2) You incorrectly threat the local memory (the LDM).
Although its latency times less than the latency of the global memory when it accessed properly, the local memory is subject to bank conflicts.
Your vertex index is calculated with stride 100 per work item. The number of banks depends on the GPU in use but usually it is 16 or 32, i.e. you may access up to 16(32) four byte LDM variables in one cycle without penalty if all of them are in different banks. Otherwise, you get a bank conflict (when two or more threads accesses the same bank) that is serialized.
Your 100 threads in a work group accesses the array in LDM with no special arrangement about bank conflicts. Moreover, the array elements are float16, i.e. a single element spans all 16 banks (or half of 32 banks). Thus, you have a bank conflict in each row of MultiplyMatrixVector function. The cummulative degree that conflict at least 16x32 (here 16 is the number of the vector elements you access and 32 is a size of half wavefront or halfwarp).
The solution here is not to copy that array to LDM, but to allocate it in the host with CL_MEM_READ_ONLY (which you already did) and declare your kernel using __constant specifier for boneMats argument.
Then the OpenCL library would allocate the memory in the constant area inside GPU and the access to that array would be fast:
kernel void skelanim(__constant const float16* boneMats,
global const float4* vertices,
global const float4* weights,
global const uint4* indices,
global float4* resVertices)

It looks like EACH thread in a Work Group is copying the same 50 floats before the computation starts. This will saturate the Global Memory bandwidth.
try this
if ( lid == 0 )
{
async_work_group_copy(lBoneMats, boneMats, NUM_BONES, 0);
}
This does the copy only once per work group.

Related

Libtorch: How to make a Tensor with GPU pointer?

Below is the pseudo-code of what I want to do.
I already know how to move tensor to GPU (.cuda())...
But have no idea about using a GPU pointer to make a new tensor.
Is there any method I've missed?
I don't want to copy devPtr back to the host side but just make the GPU tensor with the pointer.
int main(void) {
float* devPtr;
cudaMalloc((void**)&devPtr, sizeof(float)*HOSTDATA_SIZE);
cudaMemcpy(devPtr, hostData, sizeof(float)*HOSTDATA_SIZE, cudaMemcpyHostToDevice);
torch::Tensor inA = /* make Tensor with devPtr which is already in GPU */;
torch::Tensor inB = torch::randn({1, 10, 512, 512}).cuda();
torch::Tensor out = torch::matmul(inA, inB);
std::cout << out << std::endl;
return 0;
}
I think this should work, can you confirm ?
auto dims = torch::IntArrayRef{1, 10, 512, 512};
auto gpu_tensor = torch::from_blob(dev_ptr, dims, torch::TensorOptions().device(torch::kCUDA))
Be careful, torch::from_blob does not take ownership of the pointer.If you need to make gpu_tensor independant of the lifetime of dev_ptr, then you need to clone it.

A Good Way to Expose CUPY MemoryPointer in C/C++?

NumPy provides well-defined C APIs so that one can easily handle NumPy array in C/C++ space. For example, if I have a C function that takes C arrays (pointers) as arguments, I can just #include <numpy/arrayobject.h>, and pass a NumPy array to it by accessing its data member (or use the C API PyArray_DATA).
Recently I want to achieve the same for CuPy, but I cannot find a header file that I can include. To be specific, my goal is as follows:
I have some CUDA kernels and their callers written in C/C++. The callers run on host but take handles of memory buffers on device as arguments. The computed results of the callers are also stored on device.
I want to wrap the callers into Python functions so that I can control when to transfer data from device to host in Python. That means I have to wrap the resulted device memory pointers in Python objects. CuPy's ndarray is the best choice I can think of.
I can't use CuPy's user-defined-kenrel mechanism because the functions I want to wrap are not directly CUDA kernels. They must contain host code.
Currently, I've found a workaround. I write the Python functions in cython, which take CuPy arrays as inputs and return CuPy arrays. And then I cast .data.ptr attribute into C's size_t type, and then further cast it to whatever pointer type I need. Example code follows.
Example Code
//kernel.cu
#include <math.h>
__global__ void vecSumKernel(float *A, float *B, float *C, int n) {
int i = threadIdx.x + blockIdx.x * blockDim.x;
if (i < n)
C[i] = A[i] + B[i];
}
// This is the C function I want to wrap into Python.
// Notice it does not allocate any memory on device. I want that to be done by cupy.
extern "C" void vecSum(float *A_d, float *B_d, float *C_d, int n) {
int threadsPerBlock = 512;
if (threadsPerBlock > n) threadsPerBlock = n;
int nBlocks = (int)ceilf((float)n / (float)threadsPerBlock);
vecSumKernel<<<nBlocks, threadsPerBlock>>>(A_d, B_d, C_d, n);
}
//kernel.h
#ifndef KERNEL_H_
#define KERNEL_H_
void vecSum(float *A_d, float *B_d, float *C_d, int n);
#endif
# test_module.pyx
import cupy as cp
import numpy as np
cdef extern from "kernel.h":
void vecSum(float *A_d, float *B_d, float *C_d, int n)
cdef vecSum_wrapper(size_t aPtr, size_t bPtr, size_t cPtr, int n):
# here the Python int -- cp.ndarray.data.ptr -- is first cast to size_t,
# and then cast to (float *).
vecSum(<float*>aPtr, <float*>bPtr, <float*>cPtr, n)
# This is the Python function I want to use
# a, b are cupy arrays
def vec_sum(a, b):
a_ptr = a.data.ptr
b_ptr = b.data.ptr
n = a.shape[0]
output = cp.empty(shape=(n,), dtype=a.dtype)
c_ptr = output.data.ptr
vecSum_wrapper(a_ptr, b_ptr, c_ptr, n)
return output
Compile and Run
To compile, one can first compile the kernel.cu into a static library, say, libVecSum. Then use cython to compile test_module.pyx int test_module.c, and build the Python extension as usual.
# setup.py
from setuptools import Extension, setup
ext_module = Extension(
"cupyExt.test_module",
sources=["cupyExt/test_module.c"],
library_dirs=["cupyExt/"],
libraries=['libVecSum', 'cudart'])
setup(
name="cupyExt",
version="0.0.0",
ext_modules = [ext_module],
)
It seems working.
>>> import cupy as cp
>>> from cupyExt import test_module
>>> a = cp.ones(5, dtype=cp.float32) * 3
>>> b = cp.arange(5, dtype=cp.float32)
>>> c = test_module.vec_sum(a, b)
>>> print(c.device)
<CUDA Device 0>
>>> print(c)
[3. 4. 5. 6. 7.]
Any better ways?
I am not sure if this way is memory safe. I also feel the casting from .data.ptr to C pointers is not good. I want to know people's thoughts and comments on this.

How to predict a mini-batch examples in C++ of MXNet?

In python interface,we can use a mini-batch examples to make prediction like net([[1,2],[3,4],[5,6]]).
But in C++,I can't find a way to do this.
Suppose calling the net to predict a single example needs 10ms. If there is 10000 examples needs to make prediction, that is 100s
void OneInputOneOutputPredict(PredictorHandle pred_hnd, std::vector<mx_float> vector_data, std::vector<mx_float> &output)
{
MXPredSetInput(pred_hnd, "data", vector_data.data(), vector_data.size());
// Do Predict Forward
MXPredForward(pred_hnd);
mx_uint output_index = 0;
mx_uint *shape = 0;
mx_uint shape_len;
MXPredGetOutputShape(pred_hnd, output_index, &shape, &shape_len);
size_t size = 1;
for (mx_uint i = 0; i < shape_len; ++i) size *= shape[i];
std::vector<float> data(size);
assert(0 == MXPredGetOutput(pred_hnd, output_index, &(data[0]), size));
output = data;
}
//very long time
for(int step=0;step<10000;step++)
OneInputOneOutputPredict(pred_hnd, vector_data, vector_label);
Could we use vectorize the code or something way in C++ that make it fast in prediction?
originally
input_shape_data looks like this
const mx_uint input_shape_data[4] = {1, static_cast<mx_uint>(data_len)};
now if I want to predict a mini-batch(batch-size 3)
const mx_uint input_shape_data[4] = {3, static_cast<mx_uint>(data_len)};
If using seq2seq model.If data looks like [[1,2],[3,4],[5,6]],now only flatten it to a list {1,2,3,4,5,6} , then everything is OK

Changing the elements of a C-style array in Objective-C

I am trying to change the elements of a C-style array. Using an NSArray/NSMutableArray is not an option for me.
My code is as so:
int losingPositionsX[] = {0, 4, 8...};
but when I enter this code
losingPositionsX = {8, 16, 24};
to change the arrays elements of he array it has an error of: "expected expression" How can I make the copy?
In C (and by extension, in Objective C) you cannot assign C arrays to each other like that. You copy C arrays with memcpy, like this:
int losingPositionsX[] = {0, 4, 8};
memcpy(losingPositionsX, (int[3]){8, 16, 24}, sizeof(losingPositionsX));
Important: this solution requires that the sizes of the two arrays be equal.
You have to use something like memcpy() or a loop.
#define ARRAY_SIZE 3
const int VALUES[ARRAY_SIZE] = {8, 16, 24};
for (int i = 0; i < ARRAY_SIZE; i++)
losingPositionsX[i] = VALUES[i];
Alternatively, with memcpy(),
// Assuming VALUES has same type and size as losingPositions
memcpy(losingPositionsX, VALUES, sizeof(VALUES));
// Same thing
memcpy(losingPositionsX, VALUES, sizeof(losingPositionsX));
// Same thing (but don't use this one)
memcpy(losingPositionsX, VALUES, sizeof(int) * 3);
Since you are on OS X, which supports C99, you can use compound literals:
memcpy(losingPositionsX, (int[3]){8, 16, 24}, sizeof(losingPositionsX));
The loop is the safest, and will probably be optimized into the same machine code as memcpy() by the compiler. It's relatively easy to make typos with memcpy().
I do not know, whether it is a help for you in relation to memory management. But you can do
int * losingPositionsX = (int[]){ 0, 4, 8 };
losingPositionsX = (int[]){ 8, 16, 32 };

Why is drawing a circle procedurally slower than reading from a texture?

I am making a app where i need to draw a lot of circles in the screen, so i had the idea of replacing the texture of the triangles i am using to draw the texture with a function to draw a circle. However after testing, it turned out to be slower than picking the values off a texture (although the quality is vastly superior). Is it a problem with the way i produce the circle, or reading from a texture is really a lot faster? (about twice as fast)
new code:
precision mediump float;
uniform sampler2D u_Texture;
varying vec4 vColor;
varying vec2 vTexCoordinate;
void main() {
gl_FragColor = vec4(0, 0, 0, 0);
mediump float thing = vTexCoordinate.x * vTexCoordinate.x + vTexCoordinate.y * vTexCoordinate.y;
if(thing < 1.0 && thing > 0.9){
gl_FragColor = vec4(0, 0, 0, 1);
}
if(thing < 0.9){
gl_FragColor = vec4(1, 1, 1, 1) * vColor;
}
};
old code:
gl_FragColor = texture2D(u_Texture, vTexCoordinate) * vColor;
obs: i didn't bother to rename vTexCoordinate, so it now have a value of [-1, 1] where it was [0, 1]
Conditional branches are really expensive on the GPU, since there's no branch prediction, and probably other reasons too. Also, texture lookups latency can often be hidden under shader processing overhead, so it might actually work faster in the end. So it's best to avoid branches and loops in GLSL if you can.