Fast Matrix Multiplication

Fast Matrix Multiplication - gpu

I have an interview test where I have to implement a fast matrix multiplication with a given matrix multiplication algorithm.
I have to implement it on any platform with any compiler I want. The task says:
•PC implementation should be ready for SIMD optimization.
• Design a rational interface to the data processing module.
• Write portable ANSIC code where it doesn't degrade the efficiency. Don’t use assembler.
• Think about the number of operations, complexity of the operations. Care about things like function call overhead, loop overhead, memory access time and cache performance
Should I implement this on a platform like raspberry pi? Or on a CPU+DSP or ARM+NEON or CPU+GPU simulator? Or just give the code?
Thank you

There a whole theory about Instruction level parallelism, thread level parallelism, Cache utilization and what not used in speeding up matrix multiplication.
I can point you, first to learn how the CPU cache works. When a block is loaded into cache, how it is mapped to to cache index, when a block is evicted etc. Consult a book on Computer architecture, or Wikipedia.
Then I can point you to the blocking matrix multiplication algorithm.
And last there is the BLAS specification and OpenBLAS as the fastest implementation for CPUs.

Related

Translate OpenCl code for CPU compilation

Sometimes I find myself writing OpenCl kernel code (using pyopencl), even for tasks which involve moderate computational complexity, because it is easier to develop than a chain of numpy operations (especially if no appropriate numpy function exists).
However, in those cases the transfer overhead/delay between host and device may exceed the time spend for computation.
I was thinking about creating some Python tool, which automatically translates the OpenCl code to e.g. Cython code (or similar) which, after compilation for the CPU, can directly work on the underlying memory of the numpy arrays, without the need to copy the data to the device. I know that the CPU is capable of executing OpenCl kernels with appropriate drivers. However, this still has the disadvantages of additional delay due to the to_device operation. A multicore CPU could also exploit the OpenCL programming model for parallel execution. Furthermore, this approach removes the need for special OpenCl drivers and just requires some build tools for C-Code compilation.
Is that a reasonable idea? I do not want to reinvent the wheel. Any hints for existing frameworks/tools which could achieve my goals are much appreciated.

While converting an OpenCL code to a parallel CPU-oriented code is probably possible, it very hard (if not possible) to generate an efficient code.
Indeed, OpenCL encourage/force programmers to perform big computational steps (kernels) often reading/writing a relatively big portion of memory. However, the GPUs memory bandwidth is generally much higher than the one of CPUs (eg. my Nvidia 1660S has a bandwidth of 336 GB/s while my i5-9600KF with 2 DD4 channel succeed to reach about 40 GB/s while they had a similar price). OpenCL computing kernels are not be fully optimized for CPUs whatever the low-level transformation applied to the code. The main problem lies in the OpenCL algorithms themselves as well as the programming model. Rewriting OpenCL kernels to a CPU code can often result in a more efficient execution if the code is specifically optimized for such a platform. Low-level optimizations include working on in caches using data chunks, using register blocking, using the best SIMD instructions available. High-level optimizations consist in choosing the best algorithm and data structure for the target problem. The best sorting algorithm on a GPU is likely very different from the best one on a CPU. The same thing applies for other problems like computing a prefix sum, a partition/median or even string searching. Thus, one should keep in mind that different hardwares required different computing methods/algorithms.
A high-level algorithmic transformation could theoretically result in an efficient code, but such a transformation is insanely complex to perform if even possible. Indeed, there is fundamental theoretical limitations that strongly prevent many generalized advanced code analysis/transformation starting from the halting problem to high-level optimization.

CPU and GPU differences

What is the difference between a single processing unit of CPU and single processing unit of GPU?
Most places I've come along on the internet cover the high level differences between the two. I want to know what instructions can each perform and how fast are they and how are these processing units integrated in the compete architecture?
It seems like a question with a long answer. So lots of links are fine.
edit:
In the CPU, the FPU runs real number operations. How fast are the same operations being done in each GPU core? If fast then why is it fast?
I know my question is very generic but my goal is to have such questions answered.

Short answer
The main difference between GPUs and CPUs is that GPUs are designed to execute the same operation in parallel on many independent data elements, while CPUs are designed to execute a single stream of instructions as quickly as possible.
Detailed answer
Part of the question asks
In the CPU, the FPU runs real number operations. How fast are the same
operations being done in each GPU core? If fast then why is it fast?
This refers to the floating point (FP) execution units that are used in CPUs and GPUs. The main difference is not how a single FP execution unit is implemented. Rather the difference is that a CPU core will only have a few FP execution units that operate on independent instructions, while a GPU will have hundreds of them that operate on independent data in parallel.
GPUs were originally developed to perform computations for graphics applications, and in these applications the same operation is performed repeatedly on millions of different data points (imagine applying an operation that looks at each pixel on your screen). By using SIMD or SIMT operations the GPU reduces the overhead of processing a single instruction, at the cost of requiring multiple instructions to operate in lock-step.
Later GPGPU programming became popular because there are many types of programming problems besides graphics that are suited to this model. The main characteristic is that the problem is data parallel, namely the same operations can be performed independently on many separate data elements.
In contrast to GPUs, CPUs are optimized to execute a single stream of instructions as quickly as possible. CPUs use pipelining, caching, branch prediction, out-of-order execution, etc. to achieve this goal. Most of the transistors and energy spent executing a single floating point instruction is spent in the overhead of managing that instructions flow through the pipeline, rather than in the FP execution unit. While a GPU and CPU's FP unit will likely differ somewhat, this is not the main difference between the two architectures. The main difference is in how the instruction stream is handled. CPUs also tend to have cache coherent memory between separate cores, while GPUs do not.
There are of course many variations in how specific CPUs and GPUs are implemented. But the high-level programming difference is that GPUs are optimized for data-parallel workloads, while CPUs cores are optimized for executing a single stream of instructions as quickly as possible.

Your question may open various answers and architecture design considerations. Trying to focus strictly to your question, you need to define more precisely what a "single processing unit" means.
On NVIDIA GPU, you have work arranged in warps which is not separable, that is a group of CUDA "cores" will all operate the same instruction on some data, potentially not doing this instruction - warp size is 32 entries. This notion of warp is very similar to the SIMD instructions of CPUs that have SSE (2 or 4 entries) or AVX (4 or 8 entries) capability. The AVX operations will also operate on a group of values, and different "lanes" of this vector unit may not do different operations at the same time.
CUDA is called SIMT as there is a bit more flexibility on CUDA "threads" than you have on AVX "lanes". However, it is similar conceptually. In essence, a notion of predicate will indicate whether the operations should be performed on some CUDA "core". AVX offers masked operations on its lane to offer similar behavior. Reading from and writing to memory is also different as GPU implement both gather and scatter where only AVX2 processors have gather and scatter is solely scheduled for AVX-512.
Considering a "single processing unit" with this analogy would mean a single CUDA "core", or a single AVX "lane" for example. In that case, the two are VERY similar. In practice both operate add, sub, mul, fma in a single cycle (throughput, latency may vary a lot though), in a manner compliant with IEEE norm, in 32bits or 64bits precision. Note that the number of double-precision CUDA "cores" will vary from gamer devices (a.k.a. GeForce) to Tesla solutions. Also, the frequency of each FPU type differs: discrete GPUs navigate in the 1GHz range where CPUs are more in the 2.x-3.xGHz range.
Finally, GPUs have a special function unit which is capable of computing a coarse approximation of some transcendental functions from standard math library. These functions, some of which are also implemented in AVX, LRBNi and AVX-512, perform much better than precise counterparts. The IEEE norm is not strict on most of the functions hence allowing different implementations, but this is more a compiler/linker topic.

In essence the major difference as far as writing code to run serially is clock speed of the cores. GPUs often have hundreds of fairly slow cores (Often modern GPUs have cores with speeds of 200-400 MHz) This makes them very bad at highly serial applications, but allows them to perform highly granulated and concurrent applications (such as rendering) with a great deal of efficiency.
A CPU however is designed to perform highly serial applications with little or no multi-threading. Modern CPUs often have 2-8 cores, with clock speeds in excess of 3-4 Ghz.
Often times highly optimized systems will take advantage of both resources to use GPUs for highly concurrent tasks, and CPUs for highly serial tasks.
There are several other differences such as the actual instruction sets, cache handling, etc, but those are out of scope for this question. (And even more off topic for SO)

Are modern GPUs considered to be RISC based or CISC based?

I'm trying to figure out if modern GPUs have a reduced instruction set, or a complex instruction set.
Wikipedia says that it's not the size of the instruction set, rather how many cycles it takes to complete an instruction.
In RISC processors, each instruction can be completed in one cycle.
In CISC processors, it takes several cycles to complete some instructions.
I'm trying to figure out what the case is for modern GPUs.

If you mean Nvidia then it's clearly RISC as its most GPUs don't even have integer division and modulo operations in hardware, only shifts, bitwise operations and 3 arithmetic operations (addition, subtraction, multiplication) are used to implement those 2. I can't find example but this question (modular arithmetic on the gpu) shows that mod uses
procedure which implements some sophisticated algorithm (about 50 instructions or even more)
Even NVVM (Nvidia virtual machine) language called PTX uses more operations some of which are "baked" into a bunch of simpler operations anyway after conversion to one of native languages (there are different versions of such languages because of nature of GPUs and their generations/families but those are just called SASS altogether).
You can see here all the available operations along with description on each which are yet very short and not very clear (especially if you don't have background in machine level programming like knowing that "scaled" means 1 left shifted to operand just as in x86's "FSCALE" or "Scale factor" etc.):
https://docs.nvidia.com/cuda/cuda-binary-utilities/index.html#instruction-set-ref
If you mean AMDGPU then there is a lot of instructions and it's not so clear because some sources tell that they switched from VLIW to something just when Southern Islands GPUs were released.

RISC instruction set : the load/store unit is independent from other units so basically for loading and storing specific instruction are used
CISC insruction set : the ad/store unit in embedded in the instrction execution routine , therfore the instruction is more comlex than RISC instruction because CISC instruction beside the operation it will perform the load and store stage and this require more transistor logic to be used for one ibstruction

The goal of CISC was to take common coding patterns and accelerate them in hardware. You see this in the constant extensions to the base architecture. See Intel's MMX and SSE, and AMD's 3DNow!, etc. https://en.wikipedia.org/wiki/Streaming_SIMD_Extensions This also makes for good marketing, as you need to upgrade to the new processor to accelerate the newest common tasks, and keeps coders busy constantly translating their code patterns to the new extensions.
The goal of RISC was the opposite. It tried to perform few base functions as fast as possible. The coder then needs to continue to break down their common coding tasks to those simple instructions (although high-level programming languages and code packages/libraries accomplish this for you). RISC continues to survive as the architecture for ARM processors. See: https://en.wikipedia.org/wiki/Reduced_instruction_set_computer
I note that GPUs are similar to the RISC philosophy, in that the goal is to perform as many relatively simple computations as fast as possible. The move toward deep learning created a need for training millions of relatively simple parameters, hence the move back toward a highly parallel, relatively simple architecture. Having both philosophies implemented inside your computer is the best of both worlds.

superscalar and VLIW

I want to ask some questions related to ILP.
A superscalar processor is sort of a mixture of the scalar and vector processor. So can I say that architectures of vector processor follows super-scalar ?
Processing multiple instructions concurrently does not make an architecture superscalar, since pipelined, multiprocessor or multi-core architectures also achieve that. What does this means?
I have read ' A superscalar CPU architecture implements a form of parallelism called instruction level parallelism within a single processor', superscalar cant use more than one processor ? Can anyone provide me example where superscalar are used?
VLIW , I have go through this article there is figure 4 on page 9.It shows a generic VLIW implementation, without the complex reorder buffer and decoding and dispatching logic. The term without decoding is confusing me.
Regards,
anas anjaria

Check this article.
Basic difference can be seen in these pictures:
Simple processor:
Superscalar processor:

A superscalar processor is sort of a mixture of the scalar and vector processor.
LOL, no. A superscalar core is a core that can execute more than one instruction per clock cycle.

A superscalar processor is sort of a mixture of the scalar and vector processor.
No, this is definitely not true.
A scalar processor performs computations on piece of data at a time.
A superscalar can execute multiple scalar instructions at a time.
A VLIW can execute multiple operations at a time.
A vector processor can operate on a vector of data at a time.
The superscalar Haswell CPU that I'm typing this on has 8 execution ports: 4 integer operations, 2 memory reads and 2 stores. Potentially 8 x86 instructions could execute simultaneously. That's superscalar. The 8080 could only execute 1 instruction at a time. That's scalar.
Haswell is both pipelined and superscalar. It's also speculative and out-of-order. It's hyperthreaded (2 threads per core) and multi-core (2-18 cores). It's just a beast.
Instruction level parallelism (ILP) is a characteristic or measure of a program not a CPU. A compiler scheduler will search for ILP statically or a CPU's scheduler will search for ILP dynamically. If they find it, then they can order+execute instructions accordingly.

Check out this first (http://en.wikipedia.org/wiki/Superscalar):
A superscalar processor executes more than one instruction during a clock cycle by simultaneously dispatching multiple instructions to redundant functional units on the processor. Each functional unit is not a separate CPU core but an execution resource within a single CPU such as an arithmetic logic unit, a bit shifter, or a multiplier.
This means that for example the CPU with 2(two) ALUs (arithmetic logic unit) can physically issue 2 arithmetic instructions and execute them. Each arithmetic instruction will be executed in different ALU unit.
Second check this (http://en.wikipedia.org/wiki/Instruction_level_parallelism):
It will help you not to confuse the different techniques for achieving ILP (instruction level parallelism).
Third (http://en.wikipedia.org/wiki/P5_(microprocessor)): Example for the superscalar processor is the original Intel Pentium. It has two instruction pipelines.

Meaning of bandwidth in CUDA and why it is important

The CUDA programming guide states that
"Bandwidth is one of the most important gating factors for performance. Almost all changes to code should be made in the context of how they affect bandwidth."
It goes on to calculate theoretical bandwidth which is in the order of hundreds of gigabytes per second. I am at a loss as to why how many bytes one can read/write to global memory is a reflection of how well optimised a kernel is.
If I have a kernel which does intensive computation on data stored in shared memory and/or registers, with only a single read at the start and write out at the end from and to global memory, surely the effective bandwidth will be small, while the kernel itself may be very efficient.
Could any one further explain bandwidth in this context?
Thanks

most all nontrivial computational kernels, in CPU and GPU land, memory bound.
GPU has very high computational intensity and throughput, but access to main memory is very slow and has high latency, few hundred cycles per read/store versus four cycles for mmany arithmetic operations.
It sounds like your kernel is computation bound, so your luck. However you still have to watch out for shared memory bank conflict, which can serialize portions of code unexpectedly.

Most kernels are memory bound so maximising memory throughput is critical. If you're lucky enough to have a compute bound kernel then optimizing for computation is generally easier. You do need to look out for divergence and you should still ensure you have enough threads to hide memory latency.
Check out the Advanced CUDA C presentation for more information, including some tips for how to compare your realised performance with theoretical performance. The CUDA Best Practices Gude also has some good information, it's available as part of the CUDA toolkit (download from the NVIDIA site).

Typically kernels are fairly small and simple and perform the same operation on a lot of data. You might have a bunch of kernels that you invoke in sequence to perform some more complex operation (think of it as a processing pipeline). Obviously the throughput of your pipeline will depend both on how efficient your kernels are and whether you are limited by memory bandwidth in any way.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas