How is the #pragma omp simd directive translated for a GPU target device?
GPU's cores are handling a separate thread each. Threads are combined in groups of 32 threads (a single warp), and assigned to 32 cores for the purpose of the execution of a single instruction. But a SIMD is a subthreading term, meaning a single core should have a vector register, and be able to handle several chunks of data in the context of a single thread. This is not possible on a GPU core (each core handles a separate thread in a scalar manner).
Does it mean that a simd directive can't be translated for a GPU?
Or maybe - each thread is handled as if it had a single SIMD lane?
Or maybe - the SIMD iterations are spread across entire warp of 32 threads (but how about memory access then?) ?
Related
On a general purpose CPU parallel processing is performed splitting calculation / problem into sub-problems, distributing them and running them in parallel on a number of cores on one or several sockets / servers.
What is the execution "flow" on a GPU from loading data to sending back results to the CPU ? What are key differences between execution on a GPU and execution on a CPU ?
Should we see a GPU as a "kind of CPU with a higher (huge) number smaller cores" or are there additionnal differences in nature ?
The fundamental difference in parallel processing between a CPU and a GPU is that CPUs are MIMD (Multi-Instruction-Multi-Data), while GPUs are SIMD (Single-Instruction-Multi-Data). In a multicore CPU, each core fetches its instructions and data independently of the others, whereas in a GPU there is only one instruction stream for a group of cores (typically 32 or 64). While there is only one instruction stream for the 32/64 cores, each of them is working on different data elements (typically located together in memory; more below). Such SIMD execution means that the GPU cores operate in a lock-stepped fashion.
For the above mentioned reason, a GPU can't be viewed as a "kind of CPU with a higher (huge) number smaller cores".
In order to support SIMD execution (also sometimes called wide-execution), we need wide fetch of input data. For a 32-wide execution, we fetch a contiguous 4B x 32 block = 128B that is consumed (typically) entirely by a 32-wide pipe. Contrast this to a MIMD multicore, where each of 32 CPU cores would fetch a separate instruction and then load from 32 different cachelines. This SIMD nature of (wide-) instruction/data fetch results in huge power savings compared to MIMD. As a result, for the same power budget, we can put more cores on a GPU (=> more HW parallelism) than a multicore-CPU.
The SIMD nature of GPUs is driven by applications that do exactly the same operation over very many input elements (e.g.; Image processing where we apply a filter on every pixel of say a 1024x768 image) so that wide instruction/data fetch works well. At the same time, applications where each core's computation is different (e.g., take if() when input data is zero, or else() if input data is 1) or each core needs to fetch data from a different page fail to take advantage of the SIMD nature of GPUs.
A partially related fact is that GPUs support applications (e.g., images/videos) that are streaming (almost zero data reuse) and have massive data-parallelism. Streaming means that we don't need huge caches like CPUs, and massive data-parallelism almost entirely cuts the need for HW coherence mechanisms.
I have read from the book "Operating System Internals and Design Principles" written by "William Stallings" that GPUs are Single-Instruction on Multiple Data, I don't get it what it means. I searched in google and got this assumption which I am not sure if it is right or wrong and that is:
SIMD GPU means the GPU processes only one instruction on an array of data, for example of a game, the GPU is only responsible for graphical representation of the game and the rest of calculation is being done by CPU, is it true.
In the context of GPU's, SIMD is a type of hardware architecture such that there are simultaneous (parallel) computations (Execution of an instruction), but only a single process (instruction) at a given moment.
Schematically, the SIMD architecture can be drawn as the following:
(credit for wikipedia: https://en.wikipedia.org/wiki/SIMD)
Data Pool in our context is the GPU memory & PU is a processing unit or execution unit (Cuda core in NVidia's GPU terms).
Bottom line - a single core of GPU can execute simultaneously the same instruction over different data.
I have been using CUDA for a few weeks, but I have some doubts about the allocation of blocks/warps/thread.
First of all, I would like to understand if I got these facts straight:
GPU has some streaming processors that can work in parallel.
Every SM, has Blocks, and every Block has it's own shared memory.
Inside Blocks we have CUDA cores.
And inside each SM we also have Warp schedulers.
Every warp consist of 32 threads. and warps inside a Block working in serial.
It means one Warp should finish and then next Warp can start.
All Block in GPU cannot run in parallel at the same time. and the number of Blocks that run in parallel is limited.(But I do not know how and how many)
Now my main question is:
Assume we have 2 different situation:
First we have 32 Threads per block(1 warp) and we have 16384 objects(or operands).
Second we have 64 Threads per block(2 warps) with the same amount of objects.
Why does the first one take more time to run the program?(Other conditions are same, and even when we have 128 threads per block, it is faster than those two)
Thanks in advance
Are coprocessors like Intel Xeon-Phi supposed to be utilized much like the GPUs, so that one should offload a large amount of blocks executing a single kernel, so that only the overall throughput the coprocessor handles results in a speed up, OR offloading independent threads (tasks) will increase the efficiency as well?
The Xeon Phi requires a large degree of both functional parallelism (different threads) and vector parallelism (SIMD). Since the cores are essentially enhanced Pentium processors, serial code runs slowly. This will change somewhat with the next generation as it'll use faster and more modern cores. The current Xeon Phi also suffers from the I/O bottleneck as does any coprocessor, having to communicate over a PCIe bus.
So though you could offload a kernel to every processor and exploit the 512-bit vectorization (similar to a GPGPU), you can also separate your code into many different functional blocks (i.e. different codes/kernels) and run them on different sets of Intel Xeon Phi cores. Again, the different blocks of code must also exploit the 512-bit SIMD vectors.
The Xeon Phi also operates as a native processor, so you can access other resources by mounting NFS directory trees, communication between cards and other processors in the cluster using TCP/IP, using MPI, etc. Note that this is not 'offload' but native execution. But the PCIe bus is still a significant bottle neck limiting I/O.
To summarize,
You can us an offload model similar to that used by GPGPUs,
The Xeon Phi itself also can support functional parallelism (more than one kernel) but each kernel must also exploit the 512-bit SIMD.
You can also write native code and use MPI, treating the Xeon Phi as a conventional (non-offload) node (always remembering the PCIe I/O bottleneck)
My understanding is that warp is a group of threads that defined at runtime through the task scheduler, one performance critical part of CUDA is the divergence of threads within a warp, is there a way to make a good guess of how the hardware will construct warps within a thread block?
For instance I have start a kernel with 1024 threads in a thread block, how is the warps be arranged, can I tell that (or at least make a good guess) from the thread index?
Since by doing this, one can minimize the divergence of threads within a given warp.
The thread arrangement inside the warp is implementation dependant but atm I have experienced always the same behavior:
A warp is composed by 32 threads but the warp scheduller will issue 1 instruction for halp a warp each time (16 threads)
If you use 1D blocks (only threadIdx.x dimension is valid) then the warp scheduller will issue 1 instruction for threadIdx.x = (0..15) (16..31) ... etc
If you use 2D blocks (threadIdx.x and threadIdx.y dimension are valid) then the warp scheduller will try to issue following this fashion:
threadIdx.y = 0 threadIdx.x = (0 ..15) (16..31) ... etc
so, the threads with consecutive threadIdx.x component will execute the same instruction in groups of 16.
A warp consists of 32 threads that will be executed at the same time. At any given time a batch of 32 will be executing on the GPU, and this is called a warp.
I haven't found anywhere that states that you can control what warp is going to execute next, the only thing you know is that it consists of 32 threads and that a threadblock should always be a multiple of that number.
Threads in a single block will be executed on a single multiprocessor, sharing the software data cache, and can synchronize and share data with threads in the same block; a warp will always be a subset of threads from a single block.
There is also this, with regards to memory operations and latency:
When the threads in a warp issue a device memory operation, that instruction will take a very long time, perhaps hundreds of clock cycles, due to the long memory latency. Mainstream architectures would add a cache memory hierarchy to reduce the latency, and Fermi does include some hardware caches, but mostly GPUs are designed for stream or throughput computing, where cache memories are ineffective. Instead, these GPUs tolerate memory latency by using a high degree of multithreading. A Tesla supports up to 32 active warps on each multiprocessor, and a Fermi supports up to 48. When one warp stalls on a memory operation, the multiprocessor selects another ready warp and switches to that one. In this way, the cores can be productive as long as there is enough parallelism to keep them busy.
source
With regards to dividing up threadblocks into warps, I have found this:
if the block is 2D or 3D, the threads are ordered by first dimension, then second, then third – then split into warps of 32
source