OpenCL optimization and apparnt PCI bus limitations? - optimization

I'm writing a program using JOGL/openCL to utilize the GPU. I have code that kicks in when we work with data sizes which is suppose to detect the available memory on the GPU. If there is insufficient memory on the GPU to process the entire calculation at once it will break the process up into sub process with X number of frames which utilizes less then the max GPU global memory to store.
I had expected that using the maximum possible value of X would give me the largest speed up by minimizing the number of kernels used. Instead I found using a smaller group (X/2 or X/4) gives me better speeds. I'm trying to figure out why breaking the GPU processing into smaller groups rather then having the GPU process the maximum amount it can handle at one time gives me a speed increase; and how I can optimize to figure out what the best value of X is.
My current tests have been running on a GPU kernel which uses very little processing power (both kernels decimate output by selecting part of input and returning it) However, I am fairly certain the same effects occur when I activate all kernels which do a larger degree of processing on the value before returning.

The short answer is, it's complicated. There are many factors at play. These include (but are not limited to):
Amount of local memory you are using.
Amount of private memory you are using.
A limit on the max number of work groups the Symmetric Multiprocessor is able to handle at once.
Exceeding register limits, causing memory access slow-down.
And many more...
I recommend you check out the following link:
http://courses.engr.illinois.edu/ece498/al/textbook/Chapter5-CudaPerformance.pdf
In particular, check out section 5.3. Dynamic Partitioning of SM Resources. This text is meant to be general purpose, but uses CUDA for its examples. However, the concepts still apply just the same to OpenCL.
This text comes from the following book:
http://www.amazon.com/Programming-Massively-Parallel-Processors-Hands-/dp/0123814723/ref=sr_1_2?ie=UTF8&qid=1314279939&sr=8-2
For what its worth, I found this book to be very informative. It will give you a deeper understanding of the hardware that will allow you to answer questions like this.

PCI-e are full duplex bi-directional. i think that means you can write as you read. in which case, if you're doing very little processing, you may be seeing a gain because you're overlappings reads with writes.
consider a total size of N. in one work unit you do:
write N
process N
read N
total time proportional to: process N, transfer 2N
if you split this in two with parallel read/write you can get:
write N/2
process N/2
read N/2 and write N/2
process N/2
read N/2
total time proportional to: process N, transfer 3N/2 (saving N/2 transfer time)

Related

How can computational requirements be compared?

Calculating the solution to an optimization problem takes a 2 GHz CPU one hour. During this process there are no background processes, no RAM is being used and the CPU is at 100% capacity.
Based on this information, can it be derived that a 1 GHz CPU will take two hours to solve the same problem?
A quick search of IPC, frequence, and chip architecture will show you this topic has been breached many times. There are many things that can determine the execution speed of a program (without even going into threading at all) the main ones that pop to mind:
Instruction set - If one chip has an instruction for multiplication, than a*b is atomic. If not, you will need a lot of atomic instructions to perform such an action - big difference in speed, which can prove to make even higher frequency chips slower.
Cycles per second - this is the frequency of the chip.
Instructions per cycle (IPC) - what you are really interested is IPC*frequency, not just frequency. How many atomic actions can you can perform in a second. After the amount of atomic actions (see 1), on a single threaded application this might act as you expect (x2 this => x2 faster program), though no guarantees.
and there are a ton of other nuance technologies that can affect this, like branch prediction which hit the news big time recently. For a complete understanding a book/course might be a better resource.
So, in general, no. If you are comparing two single core, same architecture chips (unlikely), then maybe yes.

An example: Am I understanding GPU advantage correctly?

Just reading a bit about what the advantage of GPU is, and I want to verify I understand on a practical level. Lets say I have 10,000 arrays each containing a billion simple equations to run. On a cpu it would need to go through every single equation, 1 at a time, but with a GPU I could run all 10,000 arrays as as 10,000 different threads, all at the same time, so it would finish a ton faster...is this example spot on or have I misunderstood something?
I wouldn't call it spot on, but I think you're headed in the right direction. Mainly, a GPU is optimized for graphics-related calculations. This does not, however, mean that's all it is capable of.
Without knowing how much detail you want me to go into here, I can say at the very least the concept of running things in parallel is relevant. The GPU is very good at performing many tasks simultaneously in one go (known as running in parallel). CPUs can do this too, but the GPU is specifically optimized to handle much larger numbers of specific calculations with preset data.
For example, to render every pixel on your screen requires a calculation, and the GPU will attempt to do as many of these calculations as it can all at the same time. The more powerful the GPU, the more of these it can handle at once and the faster its clock speed. The end result is a higher-end GPU can run your OS and games in 4k resolution, whereas other cards (or integrated graphics) might only be able to handle 1080p or less.
There's a lot more to this as well, but I figured you weren't looking for the insanely technical explanation.
The bottom line is this: For running a single task on one piece of data, the CPU will normally be faster. A single CPU core is generally much faster than a single GPU core. However, they typically have many cores and for running a single task on many pieces of data (so you have to run it once for each), the GPU will usually be faster. But these are data-driven situations, and as such each situation should be assessed on an individual basis to determine which to use and how to use it.

MATLAB parallel computing setup

I have a quad core computer; and I use the parallel computing toolbox.
I set different number for the "worker" number in the parallel computing setting, for example 2,4,8..............
However, no matter what I set, the AVERAGE cpu usage by MATLAB is exactly 25% of total CPU usage; and None of the cores run at 100% (All are around 10%-30%). I am using MATLAB to run optimization problem, so I really want my quad core computer using all its power to do the computing. Please help
Setting a number of workers (up to 4 on a quad-core) is not enough. You also need to use a command like parfor to signal to Matlab what part of the calculation should be distributed among the workers.
I am curious about what kind of optimization you're running. Normally, optimization problems are very difficult to parallelize, since the result of every iteration depends on the previous one. However, if you want to e.g. try and fit multiple models to the data, or if you have to fit multiple data sets, then you can easily run these in parallel as opposed to sequentially.
Note that having many cores may not be sufficient in terms of resources - if performing the optimization on one worker uses k GB of RAM, performing it on n workers requires at least n*k GB of RAM.

JNCI/JCOL kernel optimization

I have a kernel running in open CL (via a jocl front end) that is running horrible slow compared to the other kernels, I'm trying to figure why and how to accelerate it. This kernel is very basic. it's sole job is to decimate the number of sample points we have. It copies every Nth point from the input array to a smaller output array to shrink our array size.
The kernel is passed a float specifying how many points to skip between 'good' points. So if it is passed 1.5 it will skip one point, ten two, then one etc to keep an average of every 1.5 points being skipped. The input array is already on the GPU (it was generated by an earlier kernel) and the output array will stay on the kernel so there is no expense to transfer data to or from the CPU.
This kernel is running 3-5 times slower then any of the other kernels; and as much as 20 times slower then some of the fast kernels. I realize that I'm suffering a penalty for not coalescing my array accesses; but I can't believe that it would cause me to run this horribly slow. After all every other kernel is touching every sample in the array, I would think touching ever X sample in the array, even if not coalesced, should be around the same speed at least of touching every sample in an array.
The original kernel actually decimated two arrays at once, for real and imaginary data. I tried splitting the kernel up into two kernel calls, one to decimate real and one to decimate imaginary data; but this didn't help at all. Likewise I tried 'unrolling' the kernel by having one thread be responsible for decimation of 3-4 points; but this didn't help any. Ive tried messing with the size of data passed into each kernel call (ie one kernel call on many thousands of data points, or a few kernel calls on a smaller number of data points) which has allowed me to tweak out small performance gains; but not to the order of magnitude I need for this kernel to be considered worth implementing on GPU.
just to give a sense of scale this kernel is taking 98 ms to run per iteration while the FFT takes only 32 ms for the same input array size and every other kernel is taking 5 or less ms. What else could cause such a simple kernel to run so absurdly slow compared to the rest of the kernels were running? Is it possible that I actually can't optimize this kernel sufficiently to warrant running it on the GPU. I don't need this kernel to run faster then CPU; just not quite as slow compared to CPU so I can keep all processing on the GPU.
it turns out the issue isn't with the kernel at all. Instead the problem is that when I try to release the buffer I was decimating it causes the entire program to stall while the kernel (and all other kernels in queue) complete. This appears to be functioning incorrectly, the clrelease should only decrement a counter so far as I understand, not block on the queue. However; the important point is that my kernel is running efficiently as it should be.

FLOPS assigned to sqrt in GPU to measure performance and global efficiency

In a GPU implementation we need to estimate its performance in terms of GLOPS. The code is very basic, but my problem is how many FLOPS should I give to the operations "sqrt" or "mad", whether 1 or more.
Besides, I obtain 50 GFLOPS for my code if 1 say 1 FLOP for these operations, while the theoretical maximum for this GPU is 500GFLOPS. If I express it in precentages I get 10 %. In terms of speedup I get 100 times. So I think it is great, but 10% seems to be a bit low yield, what do you think?
Thanks
The right answer is probably "it depends".
For pure comparative performance between code run on different platforms, I usually count transcendentals, sqrt, mads, as one operation. In that sort of situation, the key performance metric is how long the code takes to run. It is almost impossible to do the comparison any other way - how would you go about comparing the "FLOP" count of a hardware instruction for a transcendental which takes 25 cycles to retire, versus a math library generated stanza of fmad instructions which also takes 25 cycles to complete? Counting instructions or FLOPs becomes meaningless in such a case, both performed the desired operation in the same amount of clock cycles, despite a different apparent FLOP count.
On the other hand, for profiling and performance tuning of a piece of code on given hardware, the FLOP count might be a useful metric to have. In GPUs, it is normal to look at FLOP or IOP count and memory bandwidth utilization to determine where the performance bottleneck of a given code lies. Having those numbers might point you in the direction of useful optimizations.