JNCI/JCOL kernel optimization

JNCI/JCOL kernel optimization - optimization

I have a kernel running in open CL (via a jocl front end) that is running horrible slow compared to the other kernels, I'm trying to figure why and how to accelerate it. This kernel is very basic. it's sole job is to decimate the number of sample points we have. It copies every Nth point from the input array to a smaller output array to shrink our array size.
The kernel is passed a float specifying how many points to skip between 'good' points. So if it is passed 1.5 it will skip one point, ten two, then one etc to keep an average of every 1.5 points being skipped. The input array is already on the GPU (it was generated by an earlier kernel) and the output array will stay on the kernel so there is no expense to transfer data to or from the CPU.
This kernel is running 3-5 times slower then any of the other kernels; and as much as 20 times slower then some of the fast kernels. I realize that I'm suffering a penalty for not coalescing my array accesses; but I can't believe that it would cause me to run this horribly slow. After all every other kernel is touching every sample in the array, I would think touching ever X sample in the array, even if not coalesced, should be around the same speed at least of touching every sample in an array.
The original kernel actually decimated two arrays at once, for real and imaginary data. I tried splitting the kernel up into two kernel calls, one to decimate real and one to decimate imaginary data; but this didn't help at all. Likewise I tried 'unrolling' the kernel by having one thread be responsible for decimation of 3-4 points; but this didn't help any. Ive tried messing with the size of data passed into each kernel call (ie one kernel call on many thousands of data points, or a few kernel calls on a smaller number of data points) which has allowed me to tweak out small performance gains; but not to the order of magnitude I need for this kernel to be considered worth implementing on GPU.
just to give a sense of scale this kernel is taking 98 ms to run per iteration while the FFT takes only 32 ms for the same input array size and every other kernel is taking 5 or less ms. What else could cause such a simple kernel to run so absurdly slow compared to the rest of the kernels were running? Is it possible that I actually can't optimize this kernel sufficiently to warrant running it on the GPU. I don't need this kernel to run faster then CPU; just not quite as slow compared to CPU so I can keep all processing on the GPU.

it turns out the issue isn't with the kernel at all. Instead the problem is that when I try to release the buffer I was decimating it causes the entire program to stall while the kernel (and all other kernels in queue) complete. This appears to be functioning incorrectly, the clrelease should only decrement a counter so far as I understand, not block on the queue. However; the important point is that my kernel is running efficiently as it should be.

Related

Need help understanding Kernel Transport speed on GPU (numba, cupy, cuda)

While GPUs speed math calculations there is a fixed overhead for moving a kernel out to the GPU for execution that is high.
I'm using cupy and numba. THe first time I execute a function call that is using cupy's GPU version of numpy it is quite slow. But the second time it is fast.
I've realized I don't understand how the kernel, or GPU code, gets out to the GPU to run. Operationally I want to understand this better so that I can know when the things I do will accidentally create a slow step due to some kernel transfer. So I need some sorts of rules or rules of thumb understand the concept.
For example, if I multiply two cupy arrays that are stashed on the GPU already I might write C= A*B
At some point the cupy overload on * multiplication has to be coded out on the GPU, and it automagically needs will also get wrapped by loops that break it down into blocks and threads. So presumably this code is some kernel that gets transported out to the GPU. I'm guessing that the next time I call C*D that the GPU no longer needs to be taught what * means and so it will be fast.
But at some point I would imagine the GPU needs to clear out old code so * or other operations not being used at that moment might get flushed from memory, and so later on when the call for A*B happens again there's going to be a penalty in time to recompile it out on the GPU.
Or so I imagine. If I'm right how do I know when these kernels are going to stick around or disappear?
If I'm wrong and this isn't how it works or there is some other slow step (I'm assuming the data is already transported to arrays on the GPU) then what is this slow step and how does organize things so one pay it as little as possible?
I'm trying to avoid writing explicit numba thread managment kernels as one does in cuda++ but just use the standard numba #njit, #vectorize, #stencil decorators. Likewise in Cupy I want to just work at the level of the numpy syntax not dive into thread management.
I've read a lot of docs on this but they just refer to overheads of kernels, not when these get paid and how one controls that so I'm confused.

I don't have a full answer to this yet. But so far the biggest clue I've gotten has come from reading up on the currently undocumented function #cupy.fuse() which makes it more clear than the #numba.jit documents where the kernel launch costs are paid. I have not found the connection to Contexts yet as recommended by #talonmies.
see https://gist.github.com/unnonouno/877f314870d1e3a2f3f45d84de78d56c
The key example is this
c = cupy.arange(4)
##cupy.fuse()
def foo(x):
return x+x+x+x+x
foo(.) will be three times slower with #cupy.fuse() commented out because each "+" involves a kernel load and a kernel free. Fusion merges all the adds into a single kernel so those the launch and free are paid onces. FOr matricies less than 1 million in size on a typical 2018 GPU, the add() is so fast that the launch and free are the dominate times.
I wish I could find some documentation on #fuse. FOr example, does it unroll internal functions the way #jit does. Could I achieve that by stacking #jit and #fuse?
I'm still however largely in the dark about when the costs are getting paid in numba.

An example: Am I understanding GPU advantage correctly?

Just reading a bit about what the advantage of GPU is, and I want to verify I understand on a practical level. Lets say I have 10,000 arrays each containing a billion simple equations to run. On a cpu it would need to go through every single equation, 1 at a time, but with a GPU I could run all 10,000 arrays as as 10,000 different threads, all at the same time, so it would finish a ton faster...is this example spot on or have I misunderstood something?

I wouldn't call it spot on, but I think you're headed in the right direction. Mainly, a GPU is optimized for graphics-related calculations. This does not, however, mean that's all it is capable of.
Without knowing how much detail you want me to go into here, I can say at the very least the concept of running things in parallel is relevant. The GPU is very good at performing many tasks simultaneously in one go (known as running in parallel). CPUs can do this too, but the GPU is specifically optimized to handle much larger numbers of specific calculations with preset data.
For example, to render every pixel on your screen requires a calculation, and the GPU will attempt to do as many of these calculations as it can all at the same time. The more powerful the GPU, the more of these it can handle at once and the faster its clock speed. The end result is a higher-end GPU can run your OS and games in 4k resolution, whereas other cards (or integrated graphics) might only be able to handle 1080p or less.
There's a lot more to this as well, but I figured you weren't looking for the insanely technical explanation.
The bottom line is this: For running a single task on one piece of data, the CPU will normally be faster. A single CPU core is generally much faster than a single GPU core. However, they typically have many cores and for running a single task on many pieces of data (so you have to run it once for each), the GPU will usually be faster. But these are data-driven situations, and as such each situation should be assessed on an individual basis to determine which to use and how to use it.

Kernel underworking questions and execution costs

I have two questions:
Is it better to make a kernel overwork or underwork? Let's say I want to calculate a difference image with only 4 GPU cores. Should I consider any pixel of my image to be calculated independently by 1 thread or Should I make 1 thread calculate a whole line of my image? I dont know which solution is the most optimized to use. I already vectorized the first option (which was impelmented) but I only gain some ms, it is not very significative.
My second question is about the execution costs of a kernel. I know how to measure any OpenCL command queue task (copy, write, read, kernel...) but I think there is a time taken by the host to load the kernel to the GPU cores. Is there any way to evaluate it?
Baptiste

(1)
Typically you'd process a single item in a kernel. If you process multiple items, you need to do them in the right order to ensure coalesced memory access or you'll be slower than doing a single item (the solution to this is to process a column per work item instead of a row).
Another reason why working on multiple items can be slower is that you might leave compute units idle. For example, if you process scanlines on a 1000x1000 image with 700 compute units, the work will be chunked into 700 work items and then only 300 work items (leaving 400 idle).
A case where you want to do lots of work in a single kernel is if you're using shared local memory. For example, if you load a look-up table (LUT) into SLM, you should use it for an entire scanline or image.
(2)
I'm sure this is a non-zero amount of time but it is negligible. Kernel code is pretty small. The driver handles moving it to the GPU, and also handles pushing parameter data onto the GPU. Both are very fast, and likely happen while other kernels are running, so are "free".

MATLAB parallel computing setup

I have a quad core computer; and I use the parallel computing toolbox.
I set different number for the "worker" number in the parallel computing setting, for example 2,4,8..............
However, no matter what I set, the AVERAGE cpu usage by MATLAB is exactly 25% of total CPU usage; and None of the cores run at 100% (All are around 10%-30%). I am using MATLAB to run optimization problem, so I really want my quad core computer using all its power to do the computing. Please help

Setting a number of workers (up to 4 on a quad-core) is not enough. You also need to use a command like parfor to signal to Matlab what part of the calculation should be distributed among the workers.
I am curious about what kind of optimization you're running. Normally, optimization problems are very difficult to parallelize, since the result of every iteration depends on the previous one. However, if you want to e.g. try and fit multiple models to the data, or if you have to fit multiple data sets, then you can easily run these in parallel as opposed to sequentially.
Note that having many cores may not be sufficient in terms of resources - if performing the optimization on one worker uses k GB of RAM, performing it on n workers requires at least n*k GB of RAM.

OpenCL optimization and apparnt PCI bus limitations?

I'm writing a program using JOGL/openCL to utilize the GPU. I have code that kicks in when we work with data sizes which is suppose to detect the available memory on the GPU. If there is insufficient memory on the GPU to process the entire calculation at once it will break the process up into sub process with X number of frames which utilizes less then the max GPU global memory to store.
I had expected that using the maximum possible value of X would give me the largest speed up by minimizing the number of kernels used. Instead I found using a smaller group (X/2 or X/4) gives me better speeds. I'm trying to figure out why breaking the GPU processing into smaller groups rather then having the GPU process the maximum amount it can handle at one time gives me a speed increase; and how I can optimize to figure out what the best value of X is.
My current tests have been running on a GPU kernel which uses very little processing power (both kernels decimate output by selecting part of input and returning it) However, I am fairly certain the same effects occur when I activate all kernels which do a larger degree of processing on the value before returning.

The short answer is, it's complicated. There are many factors at play. These include (but are not limited to):
Amount of local memory you are using.
Amount of private memory you are using.
A limit on the max number of work groups the Symmetric Multiprocessor is able to handle at once.
Exceeding register limits, causing memory access slow-down.
And many more...
I recommend you check out the following link:
http://courses.engr.illinois.edu/ece498/al/textbook/Chapter5-CudaPerformance.pdf
In particular, check out section 5.3. Dynamic Partitioning of SM Resources. This text is meant to be general purpose, but uses CUDA for its examples. However, the concepts still apply just the same to OpenCL.
This text comes from the following book:
http://www.amazon.com/Programming-Massively-Parallel-Processors-Hands-/dp/0123814723/ref=sr_1_2?ie=UTF8&qid=1314279939&sr=8-2
For what its worth, I found this book to be very informative. It will give you a deeper understanding of the hardware that will allow you to answer questions like this.

PCI-e are full duplex bi-directional. i think that means you can write as you read. in which case, if you're doing very little processing, you may be seeing a gain because you're overlappings reads with writes.
consider a total size of N. in one work unit you do:
write N
process N
read N
total time proportional to: process N, transfer 2N
if you split this in two with parallel read/write you can get:
write N/2
process N/2
read N/2 and write N/2
process N/2
read N/2
total time proportional to: process N, transfer 3N/2 (saving N/2 transfer time)

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas