superscalar and VLIW - cross-platform

I want to ask some questions related to ILP.
A superscalar processor is sort of a mixture of the scalar and vector processor. So can I say that architectures of vector processor follows super-scalar ?
Processing multiple instructions concurrently does not make an architecture superscalar, since pipelined, multiprocessor or multi-core architectures also achieve that. What does this means?
I have read ' A superscalar CPU architecture implements a form of parallelism called instruction level parallelism within a single processor', superscalar cant use more than one processor ? Can anyone provide me example where superscalar are used?
VLIW , I have go through this article there is figure 4 on page 9.It shows a generic VLIW implementation, without the complex reorder buffer and decoding and dispatching logic. The term without decoding is confusing me.
Regards,
anas anjaria

Check this article.
Basic difference can be seen in these pictures:
Simple processor:
Superscalar processor:

A superscalar processor is sort of a mixture of the scalar and vector processor.
LOL, no. A superscalar core is a core that can execute more than one instruction per clock cycle.

A superscalar processor is sort of a mixture of the scalar and vector processor.
No, this is definitely not true.
A scalar processor performs computations on piece of data at a time.
A superscalar can execute multiple scalar instructions at a time.
A VLIW can execute multiple operations at a time.
A vector processor can operate on a vector of data at a time.
The superscalar Haswell CPU that I'm typing this on has 8 execution ports: 4 integer operations, 2 memory reads and 2 stores. Potentially 8 x86 instructions could execute simultaneously. That's superscalar. The 8080 could only execute 1 instruction at a time. That's scalar.
Haswell is both pipelined and superscalar. It's also speculative and out-of-order. It's hyperthreaded (2 threads per core) and multi-core (2-18 cores). It's just a beast.
Instruction level parallelism (ILP) is a characteristic or measure of a program not a CPU. A compiler scheduler will search for ILP statically or a CPU's scheduler will search for ILP dynamically. If they find it, then they can order+execute instructions accordingly.

Check out this first (http://en.wikipedia.org/wiki/Superscalar):
A superscalar processor executes more than one instruction during a clock cycle by simultaneously dispatching multiple instructions to redundant functional units on the processor. Each functional unit is not a separate CPU core but an execution resource within a single CPU such as an arithmetic logic unit, a bit shifter, or a multiplier.
This means that for example the CPU with 2(two) ALUs (arithmetic logic unit) can physically issue 2 arithmetic instructions and execute them. Each arithmetic instruction will be executed in different ALU unit.
Second check this (http://en.wikipedia.org/wiki/Instruction_level_parallelism):
It will help you not to confuse the different techniques for achieving ILP (instruction level parallelism).
Third (http://en.wikipedia.org/wiki/P5_(microprocessor)): Example for the superscalar processor is the original Intel Pentium. It has two instruction pipelines.

Related

What does SIMD mean?

I have read from the book "Operating System Internals and Design Principles" written by "William Stallings" that GPUs are Single-Instruction on Multiple Data, I don't get it what it means. I searched in google and got this assumption which I am not sure if it is right or wrong and that is:
SIMD GPU means the GPU processes only one instruction on an array of data, for example of a game, the GPU is only responsible for graphical representation of the game and the rest of calculation is being done by CPU, is it true.
In the context of GPU's, SIMD is a type of hardware architecture such that there are simultaneous (parallel) computations (Execution of an instruction), but only a single process (instruction) at a given moment.
Schematically, the SIMD architecture can be drawn as the following:
(credit for wikipedia: https://en.wikipedia.org/wiki/SIMD)
Data Pool in our context is the GPU memory & PU is a processing unit or execution unit (Cuda core in NVidia's GPU terms).
Bottom line - a single core of GPU can execute simultaneously the same instruction over different data.

Fast Matrix Multiplication

I have an interview test where I have to implement a fast matrix multiplication with a given matrix multiplication algorithm.
I have to implement it on any platform with any compiler I want. The task says:
•PC implementation should be ready for SIMD optimization.
• Design a rational interface to the data processing module.
• Write portable ANSIC code where it doesn't degrade the efficiency. Don’t use assembler.
• Think about the number of operations, complexity of the operations. Care about things like function call overhead, loop overhead, memory access time and cache performance
Should I implement this on a platform like raspberry pi? Or on a CPU+DSP or ARM+NEON or CPU+GPU simulator? Or just give the code?
Thank you
There a whole theory about Instruction level parallelism, thread level parallelism, Cache utilization and what not used in speeding up matrix multiplication.
I can point you, first to learn how the CPU cache works. When a block is loaded into cache, how it is mapped to to cache index, when a block is evicted etc. Consult a book on Computer architecture, or Wikipedia.
Then I can point you to the blocking matrix multiplication algorithm.
And last there is the BLAS specification and OpenBLAS as the fastest implementation for CPUs.

CPU and GPU differences

What is the difference between a single processing unit of CPU and single processing unit of GPU?
Most places I've come along on the internet cover the high level differences between the two. I want to know what instructions can each perform and how fast are they and how are these processing units integrated in the compete architecture?
It seems like a question with a long answer. So lots of links are fine.
edit:
In the CPU, the FPU runs real number operations. How fast are the same operations being done in each GPU core? If fast then why is it fast?
I know my question is very generic but my goal is to have such questions answered.
Short answer
The main difference between GPUs and CPUs is that GPUs are designed to execute the same operation in parallel on many independent data elements, while CPUs are designed to execute a single stream of instructions as quickly as possible.
Detailed answer
Part of the question asks
In the CPU, the FPU runs real number operations. How fast are the same
operations being done in each GPU core? If fast then why is it fast?
This refers to the floating point (FP) execution units that are used in CPUs and GPUs. The main difference is not how a single FP execution unit is implemented. Rather the difference is that a CPU core will only have a few FP execution units that operate on independent instructions, while a GPU will have hundreds of them that operate on independent data in parallel.
GPUs were originally developed to perform computations for graphics applications, and in these applications the same operation is performed repeatedly on millions of different data points (imagine applying an operation that looks at each pixel on your screen). By using SIMD or SIMT operations the GPU reduces the overhead of processing a single instruction, at the cost of requiring multiple instructions to operate in lock-step.
Later GPGPU programming became popular because there are many types of programming problems besides graphics that are suited to this model. The main characteristic is that the problem is data parallel, namely the same operations can be performed independently on many separate data elements.
In contrast to GPUs, CPUs are optimized to execute a single stream of instructions as quickly as possible. CPUs use pipelining, caching, branch prediction, out-of-order execution, etc. to achieve this goal. Most of the transistors and energy spent executing a single floating point instruction is spent in the overhead of managing that instructions flow through the pipeline, rather than in the FP execution unit. While a GPU and CPU's FP unit will likely differ somewhat, this is not the main difference between the two architectures. The main difference is in how the instruction stream is handled. CPUs also tend to have cache coherent memory between separate cores, while GPUs do not.
There are of course many variations in how specific CPUs and GPUs are implemented. But the high-level programming difference is that GPUs are optimized for data-parallel workloads, while CPUs cores are optimized for executing a single stream of instructions as quickly as possible.
Your question may open various answers and architecture design considerations. Trying to focus strictly to your question, you need to define more precisely what a "single processing unit" means.
On NVIDIA GPU, you have work arranged in warps which is not separable, that is a group of CUDA "cores" will all operate the same instruction on some data, potentially not doing this instruction - warp size is 32 entries. This notion of warp is very similar to the SIMD instructions of CPUs that have SSE (2 or 4 entries) or AVX (4 or 8 entries) capability. The AVX operations will also operate on a group of values, and different "lanes" of this vector unit may not do different operations at the same time.
CUDA is called SIMT as there is a bit more flexibility on CUDA "threads" than you have on AVX "lanes". However, it is similar conceptually. In essence, a notion of predicate will indicate whether the operations should be performed on some CUDA "core". AVX offers masked operations on its lane to offer similar behavior. Reading from and writing to memory is also different as GPU implement both gather and scatter where only AVX2 processors have gather and scatter is solely scheduled for AVX-512.
Considering a "single processing unit" with this analogy would mean a single CUDA "core", or a single AVX "lane" for example. In that case, the two are VERY similar. In practice both operate add, sub, mul, fma in a single cycle (throughput, latency may vary a lot though), in a manner compliant with IEEE norm, in 32bits or 64bits precision. Note that the number of double-precision CUDA "cores" will vary from gamer devices (a.k.a. GeForce) to Tesla solutions. Also, the frequency of each FPU type differs: discrete GPUs navigate in the 1GHz range where CPUs are more in the 2.x-3.xGHz range.
Finally, GPUs have a special function unit which is capable of computing a coarse approximation of some transcendental functions from standard math library. These functions, some of which are also implemented in AVX, LRBNi and AVX-512, perform much better than precise counterparts. The IEEE norm is not strict on most of the functions hence allowing different implementations, but this is more a compiler/linker topic.
In essence the major difference as far as writing code to run serially is clock speed of the cores. GPUs often have hundreds of fairly slow cores (Often modern GPUs have cores with speeds of 200-400 MHz) This makes them very bad at highly serial applications, but allows them to perform highly granulated and concurrent applications (such as rendering) with a great deal of efficiency.
A CPU however is designed to perform highly serial applications with little or no multi-threading. Modern CPUs often have 2-8 cores, with clock speeds in excess of 3-4 Ghz.
Often times highly optimized systems will take advantage of both resources to use GPUs for highly concurrent tasks, and CPUs for highly serial tasks.
There are several other differences such as the actual instruction sets, cache handling, etc, but those are out of scope for this question. (And even more off topic for SO)

Are modern GPUs considered to be RISC based or CISC based?

I'm trying to figure out if modern GPUs have a reduced instruction set, or a complex instruction set.
Wikipedia says that it's not the size of the instruction set, rather how many cycles it takes to complete an instruction.
In RISC processors, each instruction can be completed in one cycle.
In CISC processors, it takes several cycles to complete some instructions.
I'm trying to figure out what the case is for modern GPUs.
If you mean Nvidia then it's clearly RISC as its most GPUs don't even have integer division and modulo operations in hardware, only shifts, bitwise operations and 3 arithmetic operations (addition, subtraction, multiplication) are used to implement those 2. I can't find example but this question (modular arithmetic on the gpu) shows that mod uses
procedure which implements some sophisticated algorithm (about 50 instructions or even more)
Even NVVM (Nvidia virtual machine) language called PTX uses more operations some of which are "baked" into a bunch of simpler operations anyway after conversion to one of native languages (there are different versions of such languages because of nature of GPUs and their generations/families but those are just called SASS altogether).
You can see here all the available operations along with description on each which are yet very short and not very clear (especially if you don't have background in machine level programming like knowing that "scaled" means 1 left shifted to operand just as in x86's "FSCALE" or "Scale factor" etc.):
https://docs.nvidia.com/cuda/cuda-binary-utilities/index.html#instruction-set-ref
If you mean AMDGPU then there is a lot of instructions and it's not so clear because some sources tell that they switched from VLIW to something just when Southern Islands GPUs were released.
RISC instruction set : the load/store unit is independent from other units so basically for loading and storing specific instruction are used
CISC insruction set : the ad/store unit in embedded in the instrction execution routine , therfore the instruction is more comlex than RISC instruction because CISC instruction beside the operation it will perform the load and store stage and this require more transistor logic to be used for one ibstruction
The goal of CISC was to take common coding patterns and accelerate them in hardware. You see this in the constant extensions to the base architecture. See Intel's MMX and SSE, and AMD's 3DNow!, etc. https://en.wikipedia.org/wiki/Streaming_SIMD_Extensions This also makes for good marketing, as you need to upgrade to the new processor to accelerate the newest common tasks, and keeps coders busy constantly translating their code patterns to the new extensions.
The goal of RISC was the opposite. It tried to perform few base functions as fast as possible. The coder then needs to continue to break down their common coding tasks to those simple instructions (although high-level programming languages and code packages/libraries accomplish this for you). RISC continues to survive as the architecture for ARM processors. See: https://en.wikipedia.org/wiki/Reduced_instruction_set_computer
I note that GPUs are similar to the RISC philosophy, in that the goal is to perform as many relatively simple computations as fast as possible. The move toward deep learning created a need for training millions of relatively simple parameters, hence the move back toward a highly parallel, relatively simple architecture. Having both philosophies implemented inside your computer is the best of both worlds.

Slow Parallel programming - MPI, VB.NET and FORTRAN

I'm working on parallelizing a software which simulates transport and flow process in the unsaturated soil zone. The software consists of a VB.NET user interface, and a FORTRAN DLL kernel to do the calculations.
I parallelized the software by using the package MPI.NET in the VB.NET part. When the program is started with a number of processes, all of them but the master process go into a wait function, while the master process takes care of the interaction of the software with the user. When all the data required for the simulation is entered, the master process enters the FORTRAN DLL, and calls the other processes. These jump to the starting point of the function in the DLL, and together all the processes solve a linear system of equations for about 10-20 times (the original partial differential equation is nonlinear, therefore these iterations in order to gain accuracy in the solution). When the solution is computed, all the processes go back to VB.NET, This is done for all the timesteps of the simulation. When all steps are computed, the master process continues with the user interaction, while the other processes go back
into the wait function, until they are called again by the master process.
The thing is that this program runs much slower than the original, sequential version of it. Now there might be a number of reasons for this. I used the PETSc library in the FORTRAN DLL to solve the system of equations, and I think I have configured it quite well. My question is if at some point in the architecture I described there could be a point or two which could cause a significant slowdown if not handled correctly. I'm not sure f.e. if the subsequent calls of DLL function can cost a lot of time.
My system is a Intel Xeon 3470 processor with 8GB RAM. The systems I tried to solve had up to 120.000 unknowns, which I know is at the very lower bound of what should be calculated in parallel, but at least with the 120.000 matrix I would have expected a better performance than I did measure.
Thanks in advance for your thoughts,
Martin
I would say that 120,000 degrees of freedom and 10-20 iterations is not that large a problem. Million degree of freedom problems were done when I did finite element analysis for a living, and that was 16 years ago.
Is it possible to solve it using an in-memory solver, without parallelization, with 8GB of RAM? That would certainly be your benchmark. Is that what you're comparing your parallel results to?
Are the parallel processes running on different processors or different machines? Parallelization doesn't buy you anything if everything is done on a single processor. You have to context switch and time slice processes, and there's overhead associated with MPI to communicate between processes. I would expect a parallel solution on a single processor to run more slowly than a single thread, in-memory solution.
If you have multiple processes, then I'd say it's a matter of tuning. I'd plot performance versus number of parallel processes. If there's a speedup, you should find that it improves with more processes until you reach a saturation point, beyond which the overhead is greater than the benefit.
If you have multiple cores, when you run your program sequentially can you see that only one or a few processor are utilized?
If the load in the sequential case is high and evenly distributed over all cores then IMHO there is no need to parallelize your program.
My system has a Xeon 3470, which is a quadcore processor. So the computations are all done on these 4 on 1 machine. I don't run the program with more than 4 processes of course.The old solver that the software had was sequential of course, and that still runs faster than the parallel version. When I plot number of processes against runtime, I see that runtime even increases a little bit with smaller models - but that is to be expected because of the communication overhead.
In both the sequential and the parallel case all 4 processors are utilized, and the load balance between them is acceptable.
Like I said, I know that the models I've tested so far are not ideal to talk about parallel performance. I was just wondering if besides the communication overhead due to MPI there could still be another point that could lead to the slowdown of the program.