Translate OpenCl code for CPU compilation - numpy

Sometimes I find myself writing OpenCl kernel code (using pyopencl), even for tasks which involve moderate computational complexity, because it is easier to develop than a chain of numpy operations (especially if no appropriate numpy function exists).
However, in those cases the transfer overhead/delay between host and device may exceed the time spend for computation.
I was thinking about creating some Python tool, which automatically translates the OpenCl code to e.g. Cython code (or similar) which, after compilation for the CPU, can directly work on the underlying memory of the numpy arrays, without the need to copy the data to the device. I know that the CPU is capable of executing OpenCl kernels with appropriate drivers. However, this still has the disadvantages of additional delay due to the to_device operation. A multicore CPU could also exploit the OpenCL programming model for parallel execution. Furthermore, this approach removes the need for special OpenCl drivers and just requires some build tools for C-Code compilation.
Is that a reasonable idea? I do not want to reinvent the wheel. Any hints for existing frameworks/tools which could achieve my goals are much appreciated.

While converting an OpenCL code to a parallel CPU-oriented code is probably possible, it very hard (if not possible) to generate an efficient code.
Indeed, OpenCL encourage/force programmers to perform big computational steps (kernels) often reading/writing a relatively big portion of memory. However, the GPUs memory bandwidth is generally much higher than the one of CPUs (eg. my Nvidia 1660S has a bandwidth of 336 GB/s while my i5-9600KF with 2 DD4 channel succeed to reach about 40 GB/s while they had a similar price). OpenCL computing kernels are not be fully optimized for CPUs whatever the low-level transformation applied to the code. The main problem lies in the OpenCL algorithms themselves as well as the programming model. Rewriting OpenCL kernels to a CPU code can often result in a more efficient execution if the code is specifically optimized for such a platform. Low-level optimizations include working on in caches using data chunks, using register blocking, using the best SIMD instructions available. High-level optimizations consist in choosing the best algorithm and data structure for the target problem. The best sorting algorithm on a GPU is likely very different from the best one on a CPU. The same thing applies for other problems like computing a prefix sum, a partition/median or even string searching. Thus, one should keep in mind that different hardwares required different computing methods/algorithms.
A high-level algorithmic transformation could theoretically result in an efficient code, but such a transformation is insanely complex to perform if even possible. Indeed, there is fundamental theoretical limitations that strongly prevent many generalized advanced code analysis/transformation starting from the halting problem to high-level optimization.

Related

Fast Matrix Multiplication

I have an interview test where I have to implement a fast matrix multiplication with a given matrix multiplication algorithm.
I have to implement it on any platform with any compiler I want. The task says:
•PC implementation should be ready for SIMD optimization.
• Design a rational interface to the data processing module.
• Write portable ANSIC code where it doesn't degrade the efficiency. Don’t use assembler.
• Think about the number of operations, complexity of the operations. Care about things like function call overhead, loop overhead, memory access time and cache performance
Should I implement this on a platform like raspberry pi? Or on a CPU+DSP or ARM+NEON or CPU+GPU simulator? Or just give the code?
Thank you
There a whole theory about Instruction level parallelism, thread level parallelism, Cache utilization and what not used in speeding up matrix multiplication.
I can point you, first to learn how the CPU cache works. When a block is loaded into cache, how it is mapped to to cache index, when a block is evicted etc. Consult a book on Computer architecture, or Wikipedia.
Then I can point you to the blocking matrix multiplication algorithm.
And last there is the BLAS specification and OpenBLAS as the fastest implementation for CPUs.

CPU and GPU differences

What is the difference between a single processing unit of CPU and single processing unit of GPU?
Most places I've come along on the internet cover the high level differences between the two. I want to know what instructions can each perform and how fast are they and how are these processing units integrated in the compete architecture?
It seems like a question with a long answer. So lots of links are fine.
edit:
In the CPU, the FPU runs real number operations. How fast are the same operations being done in each GPU core? If fast then why is it fast?
I know my question is very generic but my goal is to have such questions answered.
Short answer
The main difference between GPUs and CPUs is that GPUs are designed to execute the same operation in parallel on many independent data elements, while CPUs are designed to execute a single stream of instructions as quickly as possible.
Detailed answer
Part of the question asks
In the CPU, the FPU runs real number operations. How fast are the same
operations being done in each GPU core? If fast then why is it fast?
This refers to the floating point (FP) execution units that are used in CPUs and GPUs. The main difference is not how a single FP execution unit is implemented. Rather the difference is that a CPU core will only have a few FP execution units that operate on independent instructions, while a GPU will have hundreds of them that operate on independent data in parallel.
GPUs were originally developed to perform computations for graphics applications, and in these applications the same operation is performed repeatedly on millions of different data points (imagine applying an operation that looks at each pixel on your screen). By using SIMD or SIMT operations the GPU reduces the overhead of processing a single instruction, at the cost of requiring multiple instructions to operate in lock-step.
Later GPGPU programming became popular because there are many types of programming problems besides graphics that are suited to this model. The main characteristic is that the problem is data parallel, namely the same operations can be performed independently on many separate data elements.
In contrast to GPUs, CPUs are optimized to execute a single stream of instructions as quickly as possible. CPUs use pipelining, caching, branch prediction, out-of-order execution, etc. to achieve this goal. Most of the transistors and energy spent executing a single floating point instruction is spent in the overhead of managing that instructions flow through the pipeline, rather than in the FP execution unit. While a GPU and CPU's FP unit will likely differ somewhat, this is not the main difference between the two architectures. The main difference is in how the instruction stream is handled. CPUs also tend to have cache coherent memory between separate cores, while GPUs do not.
There are of course many variations in how specific CPUs and GPUs are implemented. But the high-level programming difference is that GPUs are optimized for data-parallel workloads, while CPUs cores are optimized for executing a single stream of instructions as quickly as possible.
Your question may open various answers and architecture design considerations. Trying to focus strictly to your question, you need to define more precisely what a "single processing unit" means.
On NVIDIA GPU, you have work arranged in warps which is not separable, that is a group of CUDA "cores" will all operate the same instruction on some data, potentially not doing this instruction - warp size is 32 entries. This notion of warp is very similar to the SIMD instructions of CPUs that have SSE (2 or 4 entries) or AVX (4 or 8 entries) capability. The AVX operations will also operate on a group of values, and different "lanes" of this vector unit may not do different operations at the same time.
CUDA is called SIMT as there is a bit more flexibility on CUDA "threads" than you have on AVX "lanes". However, it is similar conceptually. In essence, a notion of predicate will indicate whether the operations should be performed on some CUDA "core". AVX offers masked operations on its lane to offer similar behavior. Reading from and writing to memory is also different as GPU implement both gather and scatter where only AVX2 processors have gather and scatter is solely scheduled for AVX-512.
Considering a "single processing unit" with this analogy would mean a single CUDA "core", or a single AVX "lane" for example. In that case, the two are VERY similar. In practice both operate add, sub, mul, fma in a single cycle (throughput, latency may vary a lot though), in a manner compliant with IEEE norm, in 32bits or 64bits precision. Note that the number of double-precision CUDA "cores" will vary from gamer devices (a.k.a. GeForce) to Tesla solutions. Also, the frequency of each FPU type differs: discrete GPUs navigate in the 1GHz range where CPUs are more in the 2.x-3.xGHz range.
Finally, GPUs have a special function unit which is capable of computing a coarse approximation of some transcendental functions from standard math library. These functions, some of which are also implemented in AVX, LRBNi and AVX-512, perform much better than precise counterparts. The IEEE norm is not strict on most of the functions hence allowing different implementations, but this is more a compiler/linker topic.
In essence the major difference as far as writing code to run serially is clock speed of the cores. GPUs often have hundreds of fairly slow cores (Often modern GPUs have cores with speeds of 200-400 MHz) This makes them very bad at highly serial applications, but allows them to perform highly granulated and concurrent applications (such as rendering) with a great deal of efficiency.
A CPU however is designed to perform highly serial applications with little or no multi-threading. Modern CPUs often have 2-8 cores, with clock speeds in excess of 3-4 Ghz.
Often times highly optimized systems will take advantage of both resources to use GPUs for highly concurrent tasks, and CPUs for highly serial tasks.
There are several other differences such as the actual instruction sets, cache handling, etc, but those are out of scope for this question. (And even more off topic for SO)

Could a GPU be Programmed and/or Modified to Carry Out CPU Instructions?

I was wondering if a GPU could behave like a CPU if modified or programmed to do so. If there is a way, I would also like to know how that could be done. The reason why is, well, sometimes I do that kind of stuff as experiments, just for fun. Plus, if it isn't a big hassle, then it would be much better than buying an expensive processor just to get better performance. I usually don't need my GPU, only because I use my computer for the simplest of things. My other computer, that's a slightly different story (because I use it for video playback), but you get the idea.
Yes, it's called GPGPU (general purpose GPU), and with it you could program some CPU-like workloads on your GPU using languages like CUDA or OpenCL.
Of course this method doesn't work well with any workload, the CPU is still much better in single-threaded hard-to-parallelize codes, or codes with complicated control flow (due to branch predictors) or memory locality (due to better caching and prefetching). GPGPUs are mostly better for performing very straight-forward highly parallel vectorizable code.
In fact, this method of computation caught enough traction to create a new lines of products, (such as Xeon Phi, formely Larrabee), and enhancing existing GPUs (e.g. Tesla/Fermi, and others)
EDIT
Having reread your question - if you mean running actual CPU ISA on such GPGPU, not just some general CPU task, then the best bet is Xeon Phi mentioned above, it's intended to be based on the same ISA as the CPU (it's the only x86 GPGPU I know of).

Are modern GPUs considered to be RISC based or CISC based?

I'm trying to figure out if modern GPUs have a reduced instruction set, or a complex instruction set.
Wikipedia says that it's not the size of the instruction set, rather how many cycles it takes to complete an instruction.
In RISC processors, each instruction can be completed in one cycle.
In CISC processors, it takes several cycles to complete some instructions.
I'm trying to figure out what the case is for modern GPUs.
If you mean Nvidia then it's clearly RISC as its most GPUs don't even have integer division and modulo operations in hardware, only shifts, bitwise operations and 3 arithmetic operations (addition, subtraction, multiplication) are used to implement those 2. I can't find example but this question (modular arithmetic on the gpu) shows that mod uses
procedure which implements some sophisticated algorithm (about 50 instructions or even more)
Even NVVM (Nvidia virtual machine) language called PTX uses more operations some of which are "baked" into a bunch of simpler operations anyway after conversion to one of native languages (there are different versions of such languages because of nature of GPUs and their generations/families but those are just called SASS altogether).
You can see here all the available operations along with description on each which are yet very short and not very clear (especially if you don't have background in machine level programming like knowing that "scaled" means 1 left shifted to operand just as in x86's "FSCALE" or "Scale factor" etc.):
https://docs.nvidia.com/cuda/cuda-binary-utilities/index.html#instruction-set-ref
If you mean AMDGPU then there is a lot of instructions and it's not so clear because some sources tell that they switched from VLIW to something just when Southern Islands GPUs were released.
RISC instruction set : the load/store unit is independent from other units so basically for loading and storing specific instruction are used
CISC insruction set : the ad/store unit in embedded in the instrction execution routine , therfore the instruction is more comlex than RISC instruction because CISC instruction beside the operation it will perform the load and store stage and this require more transistor logic to be used for one ibstruction
The goal of CISC was to take common coding patterns and accelerate them in hardware. You see this in the constant extensions to the base architecture. See Intel's MMX and SSE, and AMD's 3DNow!, etc. https://en.wikipedia.org/wiki/Streaming_SIMD_Extensions This also makes for good marketing, as you need to upgrade to the new processor to accelerate the newest common tasks, and keeps coders busy constantly translating their code patterns to the new extensions.
The goal of RISC was the opposite. It tried to perform few base functions as fast as possible. The coder then needs to continue to break down their common coding tasks to those simple instructions (although high-level programming languages and code packages/libraries accomplish this for you). RISC continues to survive as the architecture for ARM processors. See: https://en.wikipedia.org/wiki/Reduced_instruction_set_computer
I note that GPUs are similar to the RISC philosophy, in that the goal is to perform as many relatively simple computations as fast as possible. The move toward deep learning created a need for training millions of relatively simple parameters, hence the move back toward a highly parallel, relatively simple architecture. Having both philosophies implemented inside your computer is the best of both worlds.

Using Cuda optimization approaches for OpenCL

The more I learn about OpenCL, the more it seems that the right optimization of your kernel is the key to success. Furthermore I noticed, that the kernels for both languages seem very similar.
So how sensible would it be using Cuda optimization strategies learned from books and tutorials on OpenCL kernels? ... Considering that there is so much more (good) literature for Cuda than for OpenCL.
What is your opinion on that? What is your experience?
Thanks!
If you are working with just nvidia cards, you can use the same optimization approaches in both CUDA as well as OpenCL. A few things to keep in mind though is that OpenCL might have a larger start up time (This was a while ago when I was experimenting with both of them) compared to CUDA on nvidia cards.
However if you are going to work with different architectures, you will need to figure out a way to generalize your OpenCL program to be optimal across multiple platforms, which is not possible with CUDA.
But some of the few basic optimization approaches will remain the same.
For example, on any platform the following will be true.
Reading from and writing to memory
addresses that are aligned will have
higher performance (And sometimes
necessary on platforms like the Cell
Processor).
Knowing and understanding the limited resources
of each platform. (may it be called
constant memory, shared memory,
local memory or cache).
Understanding parallel programming.
For example, figuring out the trade
off between performance gains
(launching more threads) and
overhead costs (launching,
communication and synchronization).
That last part is useful in all kinds of parallel programming (be multi core, many core or grid computing).
While I'm still new at OpenCL (and barely glanced at CUDA), optimization at the developer level can be summarized as structuring your code so that it matches the hardware's (and compiler's) preferred way of doing things.
On GPUs, this can be anything from correctly ordering your data to take advantage of cache coherency (GPUs LOVE to work with cached data, from the top all the way down to the individual cores [there are several levels of cache]) to taking advantage of built-in operations like vector and matrix manipulation. I recently had to implement FDTD in OpenCL and found that by replacing the expanded dot/cross products in the popular implementations with matrix operations (which GPUs love!), reordering loops so that the X dimension (elements of which are stored sequentially) is handled in the innermost loop instead of the outer, avoiding branching (which GPUs hate), etc, I was able to increase the speed performance by about 20%. Those optimizations should work in CUDA, OpenCL or even GPU assembly, and I would expect that to be true of all of the most effective GPU optimizations.
Of course, most of this is application-dependent, so it may fall under the TIAS (try-it-and-see) category.
Here are a few links I found that look promising:
NVIDIA - Best Practices for OpenCL Programming
AMD - Porting CUDA to OpenCL
My research (and even NVIDIA's documentation) points to a nearly 1:1 correspondence between CUDA and OpenCL, so I would be very surprised if optimizations did not translate well between them. Most of what I have read focuses on cache coherency, avoiding branching, etc.
Also, note that in the case of OpenCL, the actual compilation process is handled by the vendor (I believe it happens in the video driver), so it may be worthwhile to have a look at the driver documentation and OpenCL kits from your vendor (NVIDIA, ATI, Intel(?), etc).