GPU execution "flow" vs. CPU - gpu

On a general purpose CPU parallel processing is performed splitting calculation / problem into sub-problems, distributing them and running them in parallel on a number of cores on one or several sockets / servers.
What is the execution "flow" on a GPU from loading data to sending back results to the CPU ? What are key differences between execution on a GPU and execution on a CPU ?
Should we see a GPU as a "kind of CPU with a higher (huge) number smaller cores" or are there additionnal differences in nature ?

The fundamental difference in parallel processing between a CPU and a GPU is that CPUs are MIMD (Multi-Instruction-Multi-Data), while GPUs are SIMD (Single-Instruction-Multi-Data). In a multicore CPU, each core fetches its instructions and data independently of the others, whereas in a GPU there is only one instruction stream for a group of cores (typically 32 or 64). While there is only one instruction stream for the 32/64 cores, each of them is working on different data elements (typically located together in memory; more below). Such SIMD execution means that the GPU cores operate in a lock-stepped fashion.
For the above mentioned reason, a GPU can't be viewed as a "kind of CPU with a higher (huge) number smaller cores".
In order to support SIMD execution (also sometimes called wide-execution), we need wide fetch of input data. For a 32-wide execution, we fetch a contiguous 4B x 32 block = 128B that is consumed (typically) entirely by a 32-wide pipe. Contrast this to a MIMD multicore, where each of 32 CPU cores would fetch a separate instruction and then load from 32 different cachelines. This SIMD nature of (wide-) instruction/data fetch results in huge power savings compared to MIMD. As a result, for the same power budget, we can put more cores on a GPU (=> more HW parallelism) than a multicore-CPU.
The SIMD nature of GPUs is driven by applications that do exactly the same operation over very many input elements (e.g.; Image processing where we apply a filter on every pixel of say a 1024x768 image) so that wide instruction/data fetch works well. At the same time, applications where each core's computation is different (e.g., take if() when input data is zero, or else() if input data is 1) or each core needs to fetch data from a different page fail to take advantage of the SIMD nature of GPUs.
A partially related fact is that GPUs support applications (e.g., images/videos) that are streaming (almost zero data reuse) and have massive data-parallelism. Streaming means that we don't need huge caches like CPUs, and massive data-parallelism almost entirely cuts the need for HW coherence mechanisms.

Related

Does low GPU utilization indicate bad fit for GPU acceleration?

I'm running some GPU-accelerated PyTorch code and training it against a custom dataset, but while monitoring the state of my workstation during the process, I see GPU usage along the following lines:
I have never written my own GPU primitives, but I have a long history of doing low-level optimizations for CPU-intensive workloads and my experience there makes me concerned that while pytorch/torchvision are offloading the work to the GPU, it may not be an ideal workload for GPU acceleration.
When optimizing CPU-bound code, the goal is to try and get the CPU to perform as much (meaningful) work as possible in a unit of time: a supposedly CPU-bound task that shows only 20% CPU utilization (of a single core or of all cores, depending on whether the task is parallelizable or not) is a task that is not being performed efficiently because the CPU is sitting idle when ideally it would be working towards your goal. Low CPU usage means that something other than number crunching is taking up your wall clock time, whether it's inefficient locking, heavy context switching, pipeline flushes, locking IO in the main loop, etc. which prevents the workload from properly saturating the CPU.
When looking at the GPU utilization in the chart above, and again speaking as a complete novice when it comes to GPU utilization, it strikes me that the GPU usage is extremely low and appears to be limited by the rate at which data is being copied into the GPU memory. Is this assumption correct? I would expect to see a spike in copy (to GPU) followed by an extended period of calculations/transforms, followed by a brief copy (back from the GPU), repeated ad infinitum.
I notice that despite the low (albeit constant) copy utilization, the GPU memory is constantly peaking at the 8GB limit. Can I assume the workload is being limited by the low GPU memory available (i.e. not maxing out the copy bandwidth because there's only so much that can be copied)?
Does that mean this is a workload better suited for the CPU (in this particular case with this RTX 2080 and in general with any card)?

Why does higher core count lead to higher CPI?

I'm looking at a chart that shows that, in reality, increasing the core count on a CPI usually results in higher CPI for most instructions, as well as that it usually increases the total amount of instructions the program executes. Why is this happening?
From my understanding CPI should increase only when increasing the clock frequency, so the CPI increase doesn't make much sense to me.
What chart? What factors are they holding constant while increasing core count? Perhaps total transistor budget, so each core has to be simpler to have more cores?
Making a single core larger has diminishing returns, but building more cores has linear returns for embarrassingly parallel problems; hence Xeon Phi having lots of simple cores, and GPUs being very simple pipelines.
But CPUs that also care about single-thread performance / latency (instead of just throughput) will push into those diminishing returns and build wider cores. Many problems that we run on CPUs are not trivial to parallelize, so lots of weak cores is worse than fewer faster cores. For a given problem size, the more threads you have the more of its total time the thread will be communicating with other threads (and maybe waiting for data from them).
If you do keep each core identical when adding more cores, their CPI generally stays the same when running the same code. e.g. SPECint_rate scales nearly linearly with number of cores for current Intel/AMD CPUs (which do scale up by adding more of the same cores).
So that must not be what your chart is talking about. You'll need to clarify the question if you want a more specific answer.
You don't get perfectly linear scaling because cores do compete with each other for memory bandwidth, and space in the shared last-level cache. (Although most current designs scale up the size of last-level cache with the number of cores. e.g. AMD Zen has clusters of 4 cores sharing 8MiB of L3 that's private to those cores. Intel uses a large shared L3 that has a slice of L3 with each core, so the L3 per core is about the same.)
But more cores also means a more complex interconnect to wire them all together and to the memory controllers. Intel many-core Xeon CPUs notably have worse single-thread bandwidth than quad-core "client" chips of the same microarchitecture, even though the cores are the same in both. Why is Skylake so much better than Broadwell-E for single-threaded memory throughput?

CPU and GPU differences

What is the difference between a single processing unit of CPU and single processing unit of GPU?
Most places I've come along on the internet cover the high level differences between the two. I want to know what instructions can each perform and how fast are they and how are these processing units integrated in the compete architecture?
It seems like a question with a long answer. So lots of links are fine.
edit:
In the CPU, the FPU runs real number operations. How fast are the same operations being done in each GPU core? If fast then why is it fast?
I know my question is very generic but my goal is to have such questions answered.
Short answer
The main difference between GPUs and CPUs is that GPUs are designed to execute the same operation in parallel on many independent data elements, while CPUs are designed to execute a single stream of instructions as quickly as possible.
Detailed answer
Part of the question asks
In the CPU, the FPU runs real number operations. How fast are the same
operations being done in each GPU core? If fast then why is it fast?
This refers to the floating point (FP) execution units that are used in CPUs and GPUs. The main difference is not how a single FP execution unit is implemented. Rather the difference is that a CPU core will only have a few FP execution units that operate on independent instructions, while a GPU will have hundreds of them that operate on independent data in parallel.
GPUs were originally developed to perform computations for graphics applications, and in these applications the same operation is performed repeatedly on millions of different data points (imagine applying an operation that looks at each pixel on your screen). By using SIMD or SIMT operations the GPU reduces the overhead of processing a single instruction, at the cost of requiring multiple instructions to operate in lock-step.
Later GPGPU programming became popular because there are many types of programming problems besides graphics that are suited to this model. The main characteristic is that the problem is data parallel, namely the same operations can be performed independently on many separate data elements.
In contrast to GPUs, CPUs are optimized to execute a single stream of instructions as quickly as possible. CPUs use pipelining, caching, branch prediction, out-of-order execution, etc. to achieve this goal. Most of the transistors and energy spent executing a single floating point instruction is spent in the overhead of managing that instructions flow through the pipeline, rather than in the FP execution unit. While a GPU and CPU's FP unit will likely differ somewhat, this is not the main difference between the two architectures. The main difference is in how the instruction stream is handled. CPUs also tend to have cache coherent memory between separate cores, while GPUs do not.
There are of course many variations in how specific CPUs and GPUs are implemented. But the high-level programming difference is that GPUs are optimized for data-parallel workloads, while CPUs cores are optimized for executing a single stream of instructions as quickly as possible.
Your question may open various answers and architecture design considerations. Trying to focus strictly to your question, you need to define more precisely what a "single processing unit" means.
On NVIDIA GPU, you have work arranged in warps which is not separable, that is a group of CUDA "cores" will all operate the same instruction on some data, potentially not doing this instruction - warp size is 32 entries. This notion of warp is very similar to the SIMD instructions of CPUs that have SSE (2 or 4 entries) or AVX (4 or 8 entries) capability. The AVX operations will also operate on a group of values, and different "lanes" of this vector unit may not do different operations at the same time.
CUDA is called SIMT as there is a bit more flexibility on CUDA "threads" than you have on AVX "lanes". However, it is similar conceptually. In essence, a notion of predicate will indicate whether the operations should be performed on some CUDA "core". AVX offers masked operations on its lane to offer similar behavior. Reading from and writing to memory is also different as GPU implement both gather and scatter where only AVX2 processors have gather and scatter is solely scheduled for AVX-512.
Considering a "single processing unit" with this analogy would mean a single CUDA "core", or a single AVX "lane" for example. In that case, the two are VERY similar. In practice both operate add, sub, mul, fma in a single cycle (throughput, latency may vary a lot though), in a manner compliant with IEEE norm, in 32bits or 64bits precision. Note that the number of double-precision CUDA "cores" will vary from gamer devices (a.k.a. GeForce) to Tesla solutions. Also, the frequency of each FPU type differs: discrete GPUs navigate in the 1GHz range where CPUs are more in the 2.x-3.xGHz range.
Finally, GPUs have a special function unit which is capable of computing a coarse approximation of some transcendental functions from standard math library. These functions, some of which are also implemented in AVX, LRBNi and AVX-512, perform much better than precise counterparts. The IEEE norm is not strict on most of the functions hence allowing different implementations, but this is more a compiler/linker topic.
In essence the major difference as far as writing code to run serially is clock speed of the cores. GPUs often have hundreds of fairly slow cores (Often modern GPUs have cores with speeds of 200-400 MHz) This makes them very bad at highly serial applications, but allows them to perform highly granulated and concurrent applications (such as rendering) with a great deal of efficiency.
A CPU however is designed to perform highly serial applications with little or no multi-threading. Modern CPUs often have 2-8 cores, with clock speeds in excess of 3-4 Ghz.
Often times highly optimized systems will take advantage of both resources to use GPUs for highly concurrent tasks, and CPUs for highly serial tasks.
There are several other differences such as the actual instruction sets, cache handling, etc, but those are out of scope for this question. (And even more off topic for SO)

Coprocessor accelerators compared to GPUs

Are coprocessors like Intel Xeon-Phi supposed to be utilized much like the GPUs, so that one should offload a large amount of blocks executing a single kernel, so that only the overall throughput the coprocessor handles results in a speed up, OR offloading independent threads (tasks) will increase the efficiency as well?
The Xeon Phi requires a large degree of both functional parallelism (different threads) and vector parallelism (SIMD). Since the cores are essentially enhanced Pentium processors, serial code runs slowly. This will change somewhat with the next generation as it'll use faster and more modern cores. The current Xeon Phi also suffers from the I/O bottleneck as does any coprocessor, having to communicate over a PCIe bus.
So though you could offload a kernel to every processor and exploit the 512-bit vectorization (similar to a GPGPU), you can also separate your code into many different functional blocks (i.e. different codes/kernels) and run them on different sets of Intel Xeon Phi cores. Again, the different blocks of code must also exploit the 512-bit SIMD vectors.
The Xeon Phi also operates as a native processor, so you can access other resources by mounting NFS directory trees, communication between cards and other processors in the cluster using TCP/IP, using MPI, etc. Note that this is not 'offload' but native execution. But the PCIe bus is still a significant bottle neck limiting I/O.
To summarize,
You can us an offload model similar to that used by GPGPUs,
The Xeon Phi itself also can support functional parallelism (more than one kernel) but each kernel must also exploit the 512-bit SIMD.
You can also write native code and use MPI, treating the Xeon Phi as a conventional (non-offload) node (always remembering the PCIe I/O bottleneck)

Which is best among Hybrid CPU-GPU, only GPU,onlyCPU for implementing large matrix addition or matrix multiplication?

If there is a matrix addition application that is implemented by hybrid CPU-GPU (in CUDA (i.e) using pthreads where each thread performs a partial matrix addition in host CPU and in GPU), for instance, if the matrix size is 1000, first 500 will be computed by host-CPU and the rest by GPU, basically the computation is split between cpu and gpu, so is this the best when compared to CPU only computation and GPU only computation.
Please, help me understand this concept.
Is there any profiling tool that will help find such kind of computation performance between those 3 ?. I'm new to CUDA so any help/guidance will be appreciated.
Thank you!
The problem with CPU-GPU hybrid computations where you need the result back on CPU is the latency between the two. If you expect to do some computation on GPU and have the result back on CPU, there can be easily several milliseconds of delay from starting the computation on GPU to get the results back on CPU, so the amount of work done on GPU should be significant. Or you need significant amount of work on CPU between starting GPU computation and getting the results back from GPU. Performing 1000 element matrix addition is tiny amount of work thus you would be better off performing the entire computation on CPU instead. You also have the overhead of transferring the data back and forth between the CPU & GPU across the PCI bus which adds to the overhead, so computations which require small amount of data transferred between the two lean more towards hybrid solution.
If you never need to read the result back from GPU to CPU, then you don't have the latency issue though. For example you could do N-body simulation on GPU and perform visualization on GPU as well thus never needing the result on CPU. But the moment you need the result of the simulation back to CPU you have to deal with the latency issue.