Kernel underworking questions and execution costs - optimization

I have two questions:
Is it better to make a kernel overwork or underwork? Let's say I want to calculate a difference image with only 4 GPU cores. Should I consider any pixel of my image to be calculated independently by 1 thread or Should I make 1 thread calculate a whole line of my image? I dont know which solution is the most optimized to use. I already vectorized the first option (which was impelmented) but I only gain some ms, it is not very significative.
My second question is about the execution costs of a kernel. I know how to measure any OpenCL command queue task (copy, write, read, kernel...) but I think there is a time taken by the host to load the kernel to the GPU cores. Is there any way to evaluate it?
Baptiste

(1)
Typically you'd process a single item in a kernel. If you process multiple items, you need to do them in the right order to ensure coalesced memory access or you'll be slower than doing a single item (the solution to this is to process a column per work item instead of a row).
Another reason why working on multiple items can be slower is that you might leave compute units idle. For example, if you process scanlines on a 1000x1000 image with 700 compute units, the work will be chunked into 700 work items and then only 300 work items (leaving 400 idle).
A case where you want to do lots of work in a single kernel is if you're using shared local memory. For example, if you load a look-up table (LUT) into SLM, you should use it for an entire scanline or image.
(2)
I'm sure this is a non-zero amount of time but it is negligible. Kernel code is pretty small. The driver handles moving it to the GPU, and also handles pushing parameter data onto the GPU. Both are very fast, and likely happen while other kernels are running, so are "free".

Related

An example: Am I understanding GPU advantage correctly?

Just reading a bit about what the advantage of GPU is, and I want to verify I understand on a practical level. Lets say I have 10,000 arrays each containing a billion simple equations to run. On a cpu it would need to go through every single equation, 1 at a time, but with a GPU I could run all 10,000 arrays as as 10,000 different threads, all at the same time, so it would finish a ton faster...is this example spot on or have I misunderstood something?
I wouldn't call it spot on, but I think you're headed in the right direction. Mainly, a GPU is optimized for graphics-related calculations. This does not, however, mean that's all it is capable of.
Without knowing how much detail you want me to go into here, I can say at the very least the concept of running things in parallel is relevant. The GPU is very good at performing many tasks simultaneously in one go (known as running in parallel). CPUs can do this too, but the GPU is specifically optimized to handle much larger numbers of specific calculations with preset data.
For example, to render every pixel on your screen requires a calculation, and the GPU will attempt to do as many of these calculations as it can all at the same time. The more powerful the GPU, the more of these it can handle at once and the faster its clock speed. The end result is a higher-end GPU can run your OS and games in 4k resolution, whereas other cards (or integrated graphics) might only be able to handle 1080p or less.
There's a lot more to this as well, but I figured you weren't looking for the insanely technical explanation.
The bottom line is this: For running a single task on one piece of data, the CPU will normally be faster. A single CPU core is generally much faster than a single GPU core. However, they typically have many cores and for running a single task on many pieces of data (so you have to run it once for each), the GPU will usually be faster. But these are data-driven situations, and as such each situation should be assessed on an individual basis to determine which to use and how to use it.

How to measure by c code

I have a question about how to measure the bandwidth of a GPU. I have tried some different ways but none of them work. For example, I tried to use the amount of data transfer divided by the time used to calculate the bandwidth. However, since GPU can switch warps currently executed, the number of data transfer varies during execution. I wonder whether you may give me some advices about how to do it. That would be really appreciated.

MATLAB parallel computing setup

I have a quad core computer; and I use the parallel computing toolbox.
I set different number for the "worker" number in the parallel computing setting, for example 2,4,8..............
However, no matter what I set, the AVERAGE cpu usage by MATLAB is exactly 25% of total CPU usage; and None of the cores run at 100% (All are around 10%-30%). I am using MATLAB to run optimization problem, so I really want my quad core computer using all its power to do the computing. Please help
Setting a number of workers (up to 4 on a quad-core) is not enough. You also need to use a command like parfor to signal to Matlab what part of the calculation should be distributed among the workers.
I am curious about what kind of optimization you're running. Normally, optimization problems are very difficult to parallelize, since the result of every iteration depends on the previous one. However, if you want to e.g. try and fit multiple models to the data, or if you have to fit multiple data sets, then you can easily run these in parallel as opposed to sequentially.
Note that having many cores may not be sufficient in terms of resources - if performing the optimization on one worker uses k GB of RAM, performing it on n workers requires at least n*k GB of RAM.

OpenCL optimization and apparnt PCI bus limitations?

I'm writing a program using JOGL/openCL to utilize the GPU. I have code that kicks in when we work with data sizes which is suppose to detect the available memory on the GPU. If there is insufficient memory on the GPU to process the entire calculation at once it will break the process up into sub process with X number of frames which utilizes less then the max GPU global memory to store.
I had expected that using the maximum possible value of X would give me the largest speed up by minimizing the number of kernels used. Instead I found using a smaller group (X/2 or X/4) gives me better speeds. I'm trying to figure out why breaking the GPU processing into smaller groups rather then having the GPU process the maximum amount it can handle at one time gives me a speed increase; and how I can optimize to figure out what the best value of X is.
My current tests have been running on a GPU kernel which uses very little processing power (both kernels decimate output by selecting part of input and returning it) However, I am fairly certain the same effects occur when I activate all kernels which do a larger degree of processing on the value before returning.
The short answer is, it's complicated. There are many factors at play. These include (but are not limited to):
Amount of local memory you are using.
Amount of private memory you are using.
A limit on the max number of work groups the Symmetric Multiprocessor is able to handle at once.
Exceeding register limits, causing memory access slow-down.
And many more...
I recommend you check out the following link:
http://courses.engr.illinois.edu/ece498/al/textbook/Chapter5-CudaPerformance.pdf
In particular, check out section 5.3. Dynamic Partitioning of SM Resources. This text is meant to be general purpose, but uses CUDA for its examples. However, the concepts still apply just the same to OpenCL.
This text comes from the following book:
http://www.amazon.com/Programming-Massively-Parallel-Processors-Hands-/dp/0123814723/ref=sr_1_2?ie=UTF8&qid=1314279939&sr=8-2
For what its worth, I found this book to be very informative. It will give you a deeper understanding of the hardware that will allow you to answer questions like this.
PCI-e are full duplex bi-directional. i think that means you can write as you read. in which case, if you're doing very little processing, you may be seeing a gain because you're overlappings reads with writes.
consider a total size of N. in one work unit you do:
write N
process N
read N
total time proportional to: process N, transfer 2N
if you split this in two with parallel read/write you can get:
write N/2
process N/2
read N/2 and write N/2
process N/2
read N/2
total time proportional to: process N, transfer 3N/2 (saving N/2 transfer time)

JNCI/JCOL kernel optimization

I have a kernel running in open CL (via a jocl front end) that is running horrible slow compared to the other kernels, I'm trying to figure why and how to accelerate it. This kernel is very basic. it's sole job is to decimate the number of sample points we have. It copies every Nth point from the input array to a smaller output array to shrink our array size.
The kernel is passed a float specifying how many points to skip between 'good' points. So if it is passed 1.5 it will skip one point, ten two, then one etc to keep an average of every 1.5 points being skipped. The input array is already on the GPU (it was generated by an earlier kernel) and the output array will stay on the kernel so there is no expense to transfer data to or from the CPU.
This kernel is running 3-5 times slower then any of the other kernels; and as much as 20 times slower then some of the fast kernels. I realize that I'm suffering a penalty for not coalescing my array accesses; but I can't believe that it would cause me to run this horribly slow. After all every other kernel is touching every sample in the array, I would think touching ever X sample in the array, even if not coalesced, should be around the same speed at least of touching every sample in an array.
The original kernel actually decimated two arrays at once, for real and imaginary data. I tried splitting the kernel up into two kernel calls, one to decimate real and one to decimate imaginary data; but this didn't help at all. Likewise I tried 'unrolling' the kernel by having one thread be responsible for decimation of 3-4 points; but this didn't help any. Ive tried messing with the size of data passed into each kernel call (ie one kernel call on many thousands of data points, or a few kernel calls on a smaller number of data points) which has allowed me to tweak out small performance gains; but not to the order of magnitude I need for this kernel to be considered worth implementing on GPU.
just to give a sense of scale this kernel is taking 98 ms to run per iteration while the FFT takes only 32 ms for the same input array size and every other kernel is taking 5 or less ms. What else could cause such a simple kernel to run so absurdly slow compared to the rest of the kernels were running? Is it possible that I actually can't optimize this kernel sufficiently to warrant running it on the GPU. I don't need this kernel to run faster then CPU; just not quite as slow compared to CPU so I can keep all processing on the GPU.
it turns out the issue isn't with the kernel at all. Instead the problem is that when I try to release the buffer I was decimating it causes the entire program to stall while the kernel (and all other kernels in queue) complete. This appears to be functioning incorrectly, the clrelease should only decrement a counter so far as I understand, not block on the queue. However; the important point is that my kernel is running efficiently as it should be.