Basic GPU Architecture - gpu

I have been using CUDA for a few weeks, but I have some doubts about the allocation of blocks/warps/thread.
First of all, I would like to understand if I got these facts straight:
GPU has some streaming processors that can work in parallel.
Every SM, has Blocks, and every Block has it's own shared memory.
Inside Blocks we have CUDA cores.
And inside each SM we also have Warp schedulers.
Every warp consist of 32 threads. and warps inside a Block working in serial.
It means one Warp should finish and then next Warp can start.
All Block in GPU cannot run in parallel at the same time. and the number of Blocks that run in parallel is limited.(But I do not know how and how many)
Now my main question is:
Assume we have 2 different situation:
First we have 32 Threads per block(1 warp) and we have 16384 objects(or operands).
Second we have 64 Threads per block(2 warps) with the same amount of objects.
Why does the first one take more time to run the program?(Other conditions are same, and even when we have 128 threads per block, it is faster than those two)
Thanks in advance

Related

OpenMp: how to make sure each thread works atleast 1 iteration in dynamic scheduling

I am using dynamic scheduling for the loop iteration. But when the works in each iteration are too small, some threads don't work or when there is a huge amount of threads. Eg. There are 100 iterations and there are 90 threads, I want every thread to do at least one iteration and the rest 10 iterations can be distributed to the threads who have done their job. How can I do that?
You cannot force the OpenMP runtime to do this. However, you can give hints to the OpenMP runtime so that it will likely do that when (it decide that) it is possible at the cost of a higher overhead.
On way is to specify the granularity of the dynamically scheduled loop.
Here is an example:
#pragma omp parallel for schedule(dynamic,1)
for(int i=0 ; i<100 ; ++i)
compute(i);
With such a code, the runtime is free to share the work evenly between threads (using a work-sharing scheduler) or let threads steal the work of a master thread that drive the parallel computation (using a work-stealing scheduler). In the second approach, although the granularity is 1 loop iteration, some threads could steal more work than they actually need (eg. to generally improve performance). If the loop iterations are fast enough, the work will probably not be balanced between threads.
Creating 90 threads is costly and sending work to 90 threads is also far from being free as it is mostly bounded by the relatively high latency of atomic operations, their salability as well as the latency of awaking threads.
Moreover, while such operation appear to be synchronous from the user point of view, it is not the case in practice (especially with 90 threads and on multi-socket NUMA-based architectures).
As a results, some threads may finish to compute one iteration of the loop while others may not be aware of the parallel computation or not even created yet.
The overhead to make threads aware of the computation to be done generally grow as the number of threads used is increased.
In some case, this overhead can be higher than the actual computation and it can be more efficient to use less threads.
OpenMP runtime developers should sometimes tread work balancing with smaller communication overheads. Thus those decisions can perform badly in your case but could improve the salability of other kind of applications. This is especially true on work-stealing scheduler (eg. the Clang/ICC OpenMP runtime). Note that improving the scalability of OpenMP runtimes is an ongoing research field.
I advise you to try multiple OpenMP runtimes (including research ones that may or may not be good to use in production code).
You can also play with the OMP_WAIT_POLICY variable to reduce the overhead of awaking threads.
You can also try to use OpenMP tasks to force a bit more the runtime to not merge iterations.
I also advise you to profile your code to see what is going on and find potential software/hardware bottlenecks.
Update
If you use more OpenMP threads than there is hardware threads on your machine, the processor cannot execute them simultaneously (it can only execute one OpenMP thread on each hardware thread). Consequently, the operating systems on your machine schedules the OpenMP threads on the hardware threads so that they seem to be executed simultaneously from the user point of view. However, they are not running simultaneously, but executed in an interleaved way during a very small quantum of time (eg. 100 ms).
For example, if you have a processor with 8 hardware threads and you use 8 OpenMP threads, you can roughly assume that they will run simultaneously.
But if you use 16 OpenMP threads, your operating system can choose to schedule them using the following way:
the first 8 threads are executed for 100 ms;
the last 8 threads are executed for 100 ms;
the first 8 threads are executed again for 100 ms;
the last 8 threads are executed again for 100 ms;
etc.
If your computation last for less than 100 ms, the OpenMP dynamic/guided schedulers will move the work of the 8 last threads to the 8 first threads so that the overall execution time will be faster. Consequently, the 8 first threads can execute all the work and the 8 last threads will not have anything to once executed. This is the cause of the work imbalance between threads.
Thus, if you want to measure the performance of an OpenMP program, you shall NOT use more OpenMP threads than hardware threads (unless you exactly know what you are doing and you are fully aware of such effects).

Do gpu cores switch tasks when they're done with one?

I'm experimenting with c++ AMP, one thing thats unclear from MS documentation is this:
If I dispatch a parallel_for_each with an extent of say 1000, then that would mean that it spawns 1000 threads. If the gpu is unable to take on those 1000 threads at the same time, it completes them 300 at a time or 400 or whatever number it can do. Then there was some vague stuff on warps and tiles out of which I got this impression:
Regardless of how the threads are tiled together (or not at all), the whole group must finish before taking on new tasks so if the internally assigned group has the size of 128 and 30 of them finish, the 30 cores will idle until the other 98 are done too. Is that true? Also, how do I find out what this internal groups size is?
During my experimentation, it certainly appears to have some truth to it because assigning more even amounts of work to the threads seems to speed things up, even if there is slightly more work overall.
The reason I'm trying to figure it out is because I'm deciding whether or not to engage in another lengthy experiment that would be based on threads getting uneven amounts of work (sometimes by the factor of 10x) but all the threads would be independent so data wise, the cores would be free to pick up another thread.
In practice, the underlying execution model of AMP on GPU is the same as CUDA, OpenCL, Compute Shaders, etc. The only thing that changes is the naming of each concept. So if you feel that the AMP documentation is lacking, consider reading up on CUDA or OpenCL. Those are significantly more mature APIs and the knowledge you gain from them applies as well to AMP.
If I dispatch a parallel_for_each with an extent of say 1000, then that would mean that it spawns 1000 threads. If the gpu is unable to take on those 1000 threads at the same time, it completes them 300 at a time or 400 or whatever number it can do.
Maybe. From the high-level view of parallel_for_each, you don't have to care about this. The threads may as well be executed sequentially, one at a time.
If you launch 1000 threads without specifying a tile size, the AMP runtime will choose a tile size for you, based on the underlying hardware. If you specify a tile size, then AMP will use that one.
GPUs are made of multiprocessors (in CUDA parlance, or compute units in OpenCL), each composed of a number of cores.
Tiles are assigned per multiprocessor: all threads within the same tile will be ran by the same multiprocessor, until all threads within that tile run to completion. Then, the multiprocessor will pick another available tile (if any) and run it, until all tiles are executed. Multiprocessors can execute multiple tiles simultaneously.
if the internally assigned group has the size of 128 and 30 of them finish, the 30 cores will idle until the other 98 are done too. Is that true?
Not necessarily. As mentionned earlier, a multiprocessor may have multiple active tiles. It may therefore schedule threads from other tiles to remain busy.
Important note: On GPU, threads are not executed on a granularity of 1. For example, NVIDIA hardware executes 32 threads at once.
To not make this answer needlessly lengthy, I encourage you to read up on the concept of warp.
The GPU certainly won't run 1000 threads at the same time, but it also won't complete them 300 at a time.
It uses multithreading, which means that just like in a CPU, it will share run time among the 1000 threads allowing them to complete seemingly at the same time.
Keep in mind creating a lot of threads may be not interesting for several reasons. For instance, if you must complete all 1000 tasks in step 1 before doing step 2, you might aswell distribute them on a number of threads equal to the number of cores in your GPU and no more than that.
Using more threads than the number of cores only makes sense if you want to dispatch tasks that are not being waited on, or because you felt like doing your code this way is easier. But keep in mind thread management is time-costly too and may drag down your performance.

CUDA optimisation - kernel launch conditions

I am fairly new to CUDA and would like to find out more about optimising kernel launch conditions to speed up my code. This is quite a specific scenario but I'll try to generalise it as much as possible so anyone else with a similar question can gain from this in the future.
Assume I've got an array of 300 elements (Array A) that is sent to the kernel as an input. This array is made of a few repeating integers with each integer having a device function specific to it. For example, every time 5 appears in Array A, the kernel performs the function specific to 5. These functions are device functions.
How I have parallelised this problem is by launching 320 blocks (probably not the best number) so that each block will perform the device function relevant to its element in parallel.
The CPU would handle the entire problem in a serial fashion where it will take element by element and call each function one after the other whereas the GPU would allocate an element to each block so that all 320 blocks can access the relevant device functions and calculate simultaneously.
In theory for a large number of elements the GPU should be faster - at least I though so but in my case it isn't. My assumption is that since 300 elements is a small number the CPU will always be faster than the GPU.
This is acceptable BUT what I want to know is how I can cut down the GPU execution time at least by a little. Currently, the CPU takes 2.5 milliseconds and the GPU around 12 ms.
Question 1 - How can I choose the optimum number of blocks/threads to launch at the start?
First I tried 320 blocks with 1 thread per block. Then 1 block with 320 threads. No real change in execution time. Will tweaking the number of blocks/threads improve the speed?
Question 2 - If 300 elements is too small, why is that, and roughly how many elements do I need to see the GPU outperforming the CPU?
Question 3 - What optimisation techniques should I look into?
Please let me know if any of this isn't that clear and I'll expand on it.
Thanks in advance.
Internally, CUDA manages threads in groups of 32 (so-called warps). If you have 1 thread per block device will still execute 32 of those - 31 thread will simply be in divergent state. This is potentially an occupancy issue though you may not observe it on your device and with your problem size. There is also limit on number of blocks given multiprocessor (SM) can execute. AFAIR, GeForce 4x can run up to 8 blocks on one SM. Hence if you have a device with 8 SMs you can simultaneously run 64 threads if you have block size of 1. You can use a tool called occupancy calculator to estimate a better block size - or you can use a visual profiler.
This can only be decided by profiling. There are too many unknowns - e.g. what is your ratio of memory accesses to actual computations, how parallelizable your task is, etc.
I would really recommend you to start with best practices guide.

How does a GPU group threads into warps/wavefronts?

My understanding is that warp is a group of threads that defined at runtime through the task scheduler, one performance critical part of CUDA is the divergence of threads within a warp, is there a way to make a good guess of how the hardware will construct warps within a thread block?
For instance I have start a kernel with 1024 threads in a thread block, how is the warps be arranged, can I tell that (or at least make a good guess) from the thread index?
Since by doing this, one can minimize the divergence of threads within a given warp.
The thread arrangement inside the warp is implementation dependant but atm I have experienced always the same behavior:
A warp is composed by 32 threads but the warp scheduller will issue 1 instruction for halp a warp each time (16 threads)
If you use 1D blocks (only threadIdx.x dimension is valid) then the warp scheduller will issue 1 instruction for threadIdx.x = (0..15) (16..31) ... etc
If you use 2D blocks (threadIdx.x and threadIdx.y dimension are valid) then the warp scheduller will try to issue following this fashion:
threadIdx.y = 0 threadIdx.x = (0 ..15) (16..31) ... etc
so, the threads with consecutive threadIdx.x component will execute the same instruction in groups of 16.
A warp consists of 32 threads that will be executed at the same time. At any given time a batch of 32 will be executing on the GPU, and this is called a warp.
I haven't found anywhere that states that you can control what warp is going to execute next, the only thing you know is that it consists of 32 threads and that a threadblock should always be a multiple of that number.
Threads in a single block will be executed on a single multiprocessor, sharing the software data cache, and can synchronize and share data with threads in the same block; a warp will always be a subset of threads from a single block.
There is also this, with regards to memory operations and latency:
When the threads in a warp issue a device memory operation, that instruction will take a very long time, perhaps hundreds of clock cycles, due to the long memory latency. Mainstream architectures would add a cache memory hierarchy to reduce the latency, and Fermi does include some hardware caches, but mostly GPUs are designed for stream or throughput computing, where cache memories are ineffective. Instead, these GPUs tolerate memory latency by using a high degree of multithreading. A Tesla supports up to 32 active warps on each multiprocessor, and a Fermi supports up to 48. When one warp stalls on a memory operation, the multiprocessor selects another ready warp and switches to that one. In this way, the cores can be productive as long as there is enough parallelism to keep them busy.
source
With regards to dividing up threadblocks into warps, I have found this:
if the block is 2D or 3D, the threads are ordered by first dimension, then second, then third – then split into warps of 32
source

How to leverage blocks/grid and threads/block?

I'm trying to accelerate this database search application with CUDA, and I'm working on running a core algorithm in parallel with CUDA.
In one test, I run the algorithm in parallel across a digital sequence of size 5000 with 500 blocks per grid and 100 threads per block and came back with a runt time of roughly 500 ms.
Then I increased the size of the digital sequence to 8192 with 128 blocks per grid and 64 threads per block and somehow came back with a result of 350 ms to run the algorithm.
This would indicate that how many blocks and threads used and how they're related does impact performance.
My question is how to decide the number of blocks/grid and threads/block?
Below I have my GPU specs from a standard device query program:
You should test it because it depends on your particular kernel. One thing you must aim to do is to make the number of threads per block a multiple of the number of threads in a warp. After that you can aim for high occupancy of each SM but that is not always synonymous with higher performance. It was been shown that sometimes lower occupancy can give better performance. Memory bound kernels usually benefit more from higher occupancy to hide memory latency. Compute bound kernels not so much. Testing the various configurations is your best bet.