Need help understanding Kernel Transport speed on GPU (numba, cupy, cuda) - numpy

While GPUs speed math calculations there is a fixed overhead for moving a kernel out to the GPU for execution that is high.
I'm using cupy and numba. THe first time I execute a function call that is using cupy's GPU version of numpy it is quite slow. But the second time it is fast.
I've realized I don't understand how the kernel, or GPU code, gets out to the GPU to run. Operationally I want to understand this better so that I can know when the things I do will accidentally create a slow step due to some kernel transfer. So I need some sorts of rules or rules of thumb understand the concept.
For example, if I multiply two cupy arrays that are stashed on the GPU already I might write C= A*B
At some point the cupy overload on * multiplication has to be coded out on the GPU, and it automagically needs will also get wrapped by loops that break it down into blocks and threads. So presumably this code is some kernel that gets transported out to the GPU. I'm guessing that the next time I call C*D that the GPU no longer needs to be taught what * means and so it will be fast.
But at some point I would imagine the GPU needs to clear out old code so * or other operations not being used at that moment might get flushed from memory, and so later on when the call for A*B happens again there's going to be a penalty in time to recompile it out on the GPU.
Or so I imagine. If I'm right how do I know when these kernels are going to stick around or disappear?
If I'm wrong and this isn't how it works or there is some other slow step (I'm assuming the data is already transported to arrays on the GPU) then what is this slow step and how does organize things so one pay it as little as possible?
I'm trying to avoid writing explicit numba thread managment kernels as one does in cuda++ but just use the standard numba #njit, #vectorize, #stencil decorators. Likewise in Cupy I want to just work at the level of the numpy syntax not dive into thread management.
I've read a lot of docs on this but they just refer to overheads of kernels, not when these get paid and how one controls that so I'm confused.

I don't have a full answer to this yet. But so far the biggest clue I've gotten has come from reading up on the currently undocumented function #cupy.fuse() which makes it more clear than the #numba.jit documents where the kernel launch costs are paid. I have not found the connection to Contexts yet as recommended by #talonmies.
The key example is this
c = cupy.arange(4)
def foo(x):
return x+x+x+x+x
foo(.) will be three times slower with #cupy.fuse() commented out because each "+" involves a kernel load and a kernel free. Fusion merges all the adds into a single kernel so those the launch and free are paid onces. FOr matricies less than 1 million in size on a typical 2018 GPU, the add() is so fast that the launch and free are the dominate times.
I wish I could find some documentation on #fuse. FOr example, does it unroll internal functions the way #jit does. Could I achieve that by stacking #jit and #fuse?
I'm still however largely in the dark about when the costs are getting paid in numba.


Tensorflow not linking operations into single CUDA kernel

I've just started learning how to use Tensorflow and have run into an issue that's making me doubt my understanding of how it should work. I want to get a rough idea of how much performance I should be getting using basic arithmetical operations on a GPU. I create a one dimensional tensor of 100 million elements and then chain 1000 add operations on this tensor. My expectation is that the Tensorflow run-time would be able to link these operations into a single CUDA kernel that's executed on the GPU, however when I run it it seems that each operation is being issued to the GPU separately. It takes around 5 seconds to complete on my gtx 1080 ti, which gives around 20 Gflops. While running, python.exe is using up a full CPU core and Nvidia Nsight shows many kernels being submitted. In comparison, when I try and see what I get with Alea.GPU I get around 3Tflops and a single CUDA kernel issued.
Am I misunderstanding how basic operations should work on a GPU? is the only way to get good GPU efficiency to manually group operations into more complex custom operations or use the higher level ML functions?
Thank you.
import tensorflow as tf
import time
def testSpeed(x):
for i in range(0, TF_REP):
return tf.reduce_sum(z).eval();
x=tf.range(0.0, TENSOR_SIZE)
print("Time taken "+str(t1-t0)+"s gflops= " + str(TENSOR_SIZE * TF_REP / 1000000000.0 / (t1 - t0)))
Firstly, you should separate your code into 2 stages, a build_graph stage, which defines the various tensors. I suggest collecting them in a function called build_graph(). Then create your session and run data through it. You are trying to apply procedural programming techniques to an imperative library.
Next is the issue of swapping data onto and off of the GPU. When you run tf.reduce_sum(z).eval() you are copying the result from GPU back to CPU every time.
Lastly, you are creating many sessions with tf.InteractiveSession(), you should only have 1 session created. Go back to the first issue to resolve this. A best practice is to never create tensorflow OPs after the session has been created. Tensorflow will allow you to, but as a best practice don't, and you shouldn't need to if you coded things correctly. If you feel like you need to, post a question asking why you can't do XYZ without defining it before the session was created and someone will almost certainly offer a correction to the workflow.

An example: Am I understanding GPU advantage correctly?

Just reading a bit about what the advantage of GPU is, and I want to verify I understand on a practical level. Lets say I have 10,000 arrays each containing a billion simple equations to run. On a cpu it would need to go through every single equation, 1 at a time, but with a GPU I could run all 10,000 arrays as as 10,000 different threads, all at the same time, so it would finish a ton this example spot on or have I misunderstood something?
I wouldn't call it spot on, but I think you're headed in the right direction. Mainly, a GPU is optimized for graphics-related calculations. This does not, however, mean that's all it is capable of.
Without knowing how much detail you want me to go into here, I can say at the very least the concept of running things in parallel is relevant. The GPU is very good at performing many tasks simultaneously in one go (known as running in parallel). CPUs can do this too, but the GPU is specifically optimized to handle much larger numbers of specific calculations with preset data.
For example, to render every pixel on your screen requires a calculation, and the GPU will attempt to do as many of these calculations as it can all at the same time. The more powerful the GPU, the more of these it can handle at once and the faster its clock speed. The end result is a higher-end GPU can run your OS and games in 4k resolution, whereas other cards (or integrated graphics) might only be able to handle 1080p or less.
There's a lot more to this as well, but I figured you weren't looking for the insanely technical explanation.
The bottom line is this: For running a single task on one piece of data, the CPU will normally be faster. A single CPU core is generally much faster than a single GPU core. However, they typically have many cores and for running a single task on many pieces of data (so you have to run it once for each), the GPU will usually be faster. But these are data-driven situations, and as such each situation should be assessed on an individual basis to determine which to use and how to use it.

Could a GPU speed up comparing every pixel between two images?

I've implemented the game where the user must spot 5 differences in two side by side images, and I've made the image comparison engine to find the different regions first. The performance is pretty good (4-10 ms to compare 800x600), but I'm aware GPUs have so much power.
My question is could a performance gain be realized by using all those cores (just to compare each pixel once)... at the cost of copying the images in. My hunch says it may be worthwhile, but my understanding of GPUs is foggy.
Yes, implementing this process to run on the GPU can result in much faster processing time. The amount of performance increase you get is, as you allude to, related to the size of the images you use. The bigger the images, the faster the GPU will complete the process compared to the CPU.
In the case of processing just two images, with dimensions of 800 x 600, the GPU will still be faster. Relatively, that is a very small amount of memory and can be written to the GPU memory quickly.
The algorithm of performing this process on the GPU is not overly complicated, but assuming a person had no experience of writing code for the graphics card, the cost of learning how to code a GPU is potentially not worth the result of having this algorithm implemented on a GPU. If however, the goal was to learn GPU programming, this could be a good early exercise. I would recommend, to first learn gpu programming, which will take some time and should start with even simpler exercises.

Embarassingly parallelizable computation with CUDA, how to start?

I need to accelerate many computations I am now doing with PyLab. I thought of using CUDA. The overall unit of computation (A) consists in doing several (thousands) entirely independent smaller computations (B). Each of them involves, at their initial stage, doing 40-41 independent, even smaller, computations (C). So parallel programming should really help. With PyLab the overall (A) takes 20 minutes and (B) takes some tenth of a second.
As a beginner in this realm, my question is what level I should parallelize the computation at, whether at (C) or at (B).
I should clarify that the (C) stage consists in taking a bunch of data (thousands of floats) which is shared between all the (C) processes, and doing various tasks, among which one of the most time consuming is linear regression, which is, too, parallelizable! The output of each procedure (C) is a single float. Each computation (B) consists basically in doing many times procedure (C) and doing a linear regression on the data that comes out. Its output, again, is a single float.
I'm not familiar with CUDA programming so I am basically asking what would be the wisest strategy to start with.
An important consideration when deciding how (and if) to convert your project to CUDA is what type of memory access patterns your code requires. The GPU runs threads in groups of 32, called warps, and to get the best performance, the threads in a warp should access the memory in some basic patterns, that are described in the CUDA Programming Guide (included with CUDA). In general, the more random the access patterns, the more likely the kernel is to become memory bound. In that case, the compute power in the GPU cannot be fully utilized.
The other main case when the compute power in the GPU cannot be fully utilized is if there is conditional logic and loops that causes the threads in a warp to run through different code paths, as the GPU has to run all the threads in the warp through each code path.
If you find that these points may cause issues for your code, you should also do some research to see if there are known alternative ways to implement your code to run better on the GPU (this is often the case).
If you see your question about at which level to parallelize the computation in the light of the above considerations, it may become clear which choice to make.

JNCI/JCOL kernel optimization

I have a kernel running in open CL (via a jocl front end) that is running horrible slow compared to the other kernels, I'm trying to figure why and how to accelerate it. This kernel is very basic. it's sole job is to decimate the number of sample points we have. It copies every Nth point from the input array to a smaller output array to shrink our array size.
The kernel is passed a float specifying how many points to skip between 'good' points. So if it is passed 1.5 it will skip one point, ten two, then one etc to keep an average of every 1.5 points being skipped. The input array is already on the GPU (it was generated by an earlier kernel) and the output array will stay on the kernel so there is no expense to transfer data to or from the CPU.
This kernel is running 3-5 times slower then any of the other kernels; and as much as 20 times slower then some of the fast kernels. I realize that I'm suffering a penalty for not coalescing my array accesses; but I can't believe that it would cause me to run this horribly slow. After all every other kernel is touching every sample in the array, I would think touching ever X sample in the array, even if not coalesced, should be around the same speed at least of touching every sample in an array.
The original kernel actually decimated two arrays at once, for real and imaginary data. I tried splitting the kernel up into two kernel calls, one to decimate real and one to decimate imaginary data; but this didn't help at all. Likewise I tried 'unrolling' the kernel by having one thread be responsible for decimation of 3-4 points; but this didn't help any. Ive tried messing with the size of data passed into each kernel call (ie one kernel call on many thousands of data points, or a few kernel calls on a smaller number of data points) which has allowed me to tweak out small performance gains; but not to the order of magnitude I need for this kernel to be considered worth implementing on GPU.
just to give a sense of scale this kernel is taking 98 ms to run per iteration while the FFT takes only 32 ms for the same input array size and every other kernel is taking 5 or less ms. What else could cause such a simple kernel to run so absurdly slow compared to the rest of the kernels were running? Is it possible that I actually can't optimize this kernel sufficiently to warrant running it on the GPU. I don't need this kernel to run faster then CPU; just not quite as slow compared to CPU so I can keep all processing on the GPU.
it turns out the issue isn't with the kernel at all. Instead the problem is that when I try to release the buffer I was decimating it causes the entire program to stall while the kernel (and all other kernels in queue) complete. This appears to be functioning incorrectly, the clrelease should only decrement a counter so far as I understand, not block on the queue. However; the important point is that my kernel is running efficiently as it should be.