GNU Parallel and the GPU?

GNU Parallel and the GPU? - gpu

I am interested in getting GNU Parallel to run some numerical computation tasks on the GPU. Generically speaking, here is my initial approach:
Write the tasks to use OpenCL, or some other GPU interfacing library
Call GNU parallel on the task list (I am unsure about the need for this step)
This brought up the following questions:
Does my approach/use-case benefit from the use of GNU Parallel (i.e. should I even use it here)?
Does GNU Parallel offer a built-in mechanism for running tasks in
parallel on a GPU?
If so, how can I configure GNU Parallel to do
this?

Modern CPUs have multiple cores, that means they can run different instructions at the same time; so when core 1 is running a MUL core 2 may be running an ADD. This is also called MIMD - Multiple Instructions, Multiple Data.
GPUs, however, cannot run different instructions at the same time. They excel in running the same instruction on a large amounts of data; SIMD - Single Instruction, Multiple Data.
Modern GPUs have multiple cores that are each SIMD.
So where does GNU Parallel fit into this mix?
GNU Parallel starts programs. If your program uses a GPU and you have one single GPU core on your system, GNU Parallel will not make much sense. But if you have, say, 4 GPU cores on your system, then it makes sense to keep these 4 cores running at the same time. So if your program reads the variable CUDA_VISIBLE_DEVICES to decide which GPU core to run on, you can do something like this:
seq 10000 | parallel -j4 CUDA_VISIBLE_DEVICES='$(({%} - 1))' compute {}

Related

What is the difference between the gem5 CPU models and which one is more accurate for my simulation?

When running a simulation in gem5, I can select a CPU with fs.py --cpu-type.
This option can also show a list of all CPU types if I use an invalid CPU type such as fs.py --cpu-type.
What is the difference between those CPU types and which one should I choose for my experiment?
Question inspired by: https://www.mail-archive.com/gem5-users#gem5.org/msg16976.html

An overview of the CPU types can be found at: https://cirosantilli.com/linux-kernel-module-cheat/#gem5-cpu-types
In summary:
simplistic CPUs (derived from BaseSimpleCPU): for example AtomicSimpleCPU (the default one). They have no CPU pipeline, and therefor are completely unrealistic. However, they also run much faster. Therefore,they are mostly useful to boot Linux fast and then checkpoint and switch to a more detailed CPU.
Within the simple CPUs we can notably distinguish:
AtomicSimpleCPU: memory requests finish immediately
TimingSimpleCPU: memory requests actually take time to go through to the memory system and return. Since there is no CPU pipeline however, the simulated CPU stalls on every memory request waiting for a response.
An alternative to those is to use KVM CPUs to speed up boot if host and guest ISA are the same, although as of 2019, KVM is less stable as it is harder to implement and debug.
in-order CPUs: derived from the generic MinorCPU by parametrization, Minor stands for In Order:
for ARM: HPI is made by ARM and models a "(2017) modern in-order Armv8-A implementation". This is your best in-order ARM bet.
out-of-order CPUs, derived from the generic DerivO3CPU by parametrization, O3 stands for Out Of Order:
for ARM: there are no models specifically published by ARM as of 2019. The only specific O3 model available is ex5_big for an A15, but you would have to verify its authors claims on how well it models the real core A15 core.
If none of those are accurate enough for your purposes, you could try to create your own in-order/out-of-order models by parametrizing MinorCPU / DerivO3CPU like HPI and ex5_big do, although this could be hard to get right, as there isn't generally enough public information on non-free CPUs to do this without experiments or reverse engineering.
The other thing you will want to think about is the memory system model. There are basically two choices: classical vs Ruby, and within Ruby, several options are available, see also: https://cirosantilli.com/linux-kernel-module-cheat/#gem5-ruby-build

Julia uses only 20-30% of my CPU. What should I do?

I am running a program that does numeric ODE integration in Julia. I am running Windows 10 (64bit), with Intel Core i7-4710MQ # 2.50Ghz (8 logical processors).
I noticed that when my code was running on julia, only max 30% of CPU is in usage. Going into the parallelazation documentation, I started Julia using:
C:\Users\*****\AppData\Local\Julia-0.4.5\bin\julia.exe -p 8 and expected to see improvements. I did not see them however.
Therefore my question is the following:
Is there a special way I have to write my code in order for it to use CPU more efficiently? Is this maybe a limitation posed by my operating system (windows 10)?
I submit my code in the julia console with the command:
include("C:\\Users\\****\\AppData\\Local\\Julia-0.4.5\\13. Fast Filesaving Format.jl").
Within this code I use some additional packages with:
using ODE; using PyPlot; using JLD.
I measure the CPU usage with windows' "Task Manager".

The -p 8 option to julia starts 8 worker processes, and disables multithreading in libraries like BLAS and FFTW so that the workers don't oversubscribe the physical threads on the system – since this kills performance in well-balanced distributed workloads. If you want to get more speed out of -p 8 then you need to distribute work between those workers, e.g. by having each of them do an independent computation, or by having the collaborate on a computation via SharedArrays. You can't just add workers and not change the program. If you are using BLAS (doing lots of matrix multiplies) or FFTW (doing lots of Fourier transforms), then if you don't use the -p flag, you'll automatically get multithreading from those libraries. Otherwise, there is no (non-experimental) user-level threading in Julia yet. There is experimental threading support and version 1.0 will support threading, but I wouldn't recommend that yet unless you're an expert.

CPU and GPU differences

What is the difference between a single processing unit of CPU and single processing unit of GPU?
Most places I've come along on the internet cover the high level differences between the two. I want to know what instructions can each perform and how fast are they and how are these processing units integrated in the compete architecture?
It seems like a question with a long answer. So lots of links are fine.
edit:
In the CPU, the FPU runs real number operations. How fast are the same operations being done in each GPU core? If fast then why is it fast?
I know my question is very generic but my goal is to have such questions answered.

Short answer
The main difference between GPUs and CPUs is that GPUs are designed to execute the same operation in parallel on many independent data elements, while CPUs are designed to execute a single stream of instructions as quickly as possible.
Detailed answer
Part of the question asks
In the CPU, the FPU runs real number operations. How fast are the same
operations being done in each GPU core? If fast then why is it fast?
This refers to the floating point (FP) execution units that are used in CPUs and GPUs. The main difference is not how a single FP execution unit is implemented. Rather the difference is that a CPU core will only have a few FP execution units that operate on independent instructions, while a GPU will have hundreds of them that operate on independent data in parallel.
GPUs were originally developed to perform computations for graphics applications, and in these applications the same operation is performed repeatedly on millions of different data points (imagine applying an operation that looks at each pixel on your screen). By using SIMD or SIMT operations the GPU reduces the overhead of processing a single instruction, at the cost of requiring multiple instructions to operate in lock-step.
Later GPGPU programming became popular because there are many types of programming problems besides graphics that are suited to this model. The main characteristic is that the problem is data parallel, namely the same operations can be performed independently on many separate data elements.
In contrast to GPUs, CPUs are optimized to execute a single stream of instructions as quickly as possible. CPUs use pipelining, caching, branch prediction, out-of-order execution, etc. to achieve this goal. Most of the transistors and energy spent executing a single floating point instruction is spent in the overhead of managing that instructions flow through the pipeline, rather than in the FP execution unit. While a GPU and CPU's FP unit will likely differ somewhat, this is not the main difference between the two architectures. The main difference is in how the instruction stream is handled. CPUs also tend to have cache coherent memory between separate cores, while GPUs do not.
There are of course many variations in how specific CPUs and GPUs are implemented. But the high-level programming difference is that GPUs are optimized for data-parallel workloads, while CPUs cores are optimized for executing a single stream of instructions as quickly as possible.

Your question may open various answers and architecture design considerations. Trying to focus strictly to your question, you need to define more precisely what a "single processing unit" means.
On NVIDIA GPU, you have work arranged in warps which is not separable, that is a group of CUDA "cores" will all operate the same instruction on some data, potentially not doing this instruction - warp size is 32 entries. This notion of warp is very similar to the SIMD instructions of CPUs that have SSE (2 or 4 entries) or AVX (4 or 8 entries) capability. The AVX operations will also operate on a group of values, and different "lanes" of this vector unit may not do different operations at the same time.
CUDA is called SIMT as there is a bit more flexibility on CUDA "threads" than you have on AVX "lanes". However, it is similar conceptually. In essence, a notion of predicate will indicate whether the operations should be performed on some CUDA "core". AVX offers masked operations on its lane to offer similar behavior. Reading from and writing to memory is also different as GPU implement both gather and scatter where only AVX2 processors have gather and scatter is solely scheduled for AVX-512.
Considering a "single processing unit" with this analogy would mean a single CUDA "core", or a single AVX "lane" for example. In that case, the two are VERY similar. In practice both operate add, sub, mul, fma in a single cycle (throughput, latency may vary a lot though), in a manner compliant with IEEE norm, in 32bits or 64bits precision. Note that the number of double-precision CUDA "cores" will vary from gamer devices (a.k.a. GeForce) to Tesla solutions. Also, the frequency of each FPU type differs: discrete GPUs navigate in the 1GHz range where CPUs are more in the 2.x-3.xGHz range.
Finally, GPUs have a special function unit which is capable of computing a coarse approximation of some transcendental functions from standard math library. These functions, some of which are also implemented in AVX, LRBNi and AVX-512, perform much better than precise counterparts. The IEEE norm is not strict on most of the functions hence allowing different implementations, but this is more a compiler/linker topic.

In essence the major difference as far as writing code to run serially is clock speed of the cores. GPUs often have hundreds of fairly slow cores (Often modern GPUs have cores with speeds of 200-400 MHz) This makes them very bad at highly serial applications, but allows them to perform highly granulated and concurrent applications (such as rendering) with a great deal of efficiency.
A CPU however is designed to perform highly serial applications with little or no multi-threading. Modern CPUs often have 2-8 cores, with clock speeds in excess of 3-4 Ghz.
Often times highly optimized systems will take advantage of both resources to use GPUs for highly concurrent tasks, and CPUs for highly serial tasks.
There are several other differences such as the actual instruction sets, cache handling, etc, but those are out of scope for this question. (And even more off topic for SO)

superscalar and VLIW

I want to ask some questions related to ILP.
A superscalar processor is sort of a mixture of the scalar and vector processor. So can I say that architectures of vector processor follows super-scalar ?
Processing multiple instructions concurrently does not make an architecture superscalar, since pipelined, multiprocessor or multi-core architectures also achieve that. What does this means?
I have read ' A superscalar CPU architecture implements a form of parallelism called instruction level parallelism within a single processor', superscalar cant use more than one processor ? Can anyone provide me example where superscalar are used?
VLIW , I have go through this article there is figure 4 on page 9.It shows a generic VLIW implementation, without the complex reorder buffer and decoding and dispatching logic. The term without decoding is confusing me.
Regards,
anas anjaria

Check this article.
Basic difference can be seen in these pictures:
Simple processor:
Superscalar processor:

A superscalar processor is sort of a mixture of the scalar and vector processor.
LOL, no. A superscalar core is a core that can execute more than one instruction per clock cycle.

A superscalar processor is sort of a mixture of the scalar and vector processor.
No, this is definitely not true.
A scalar processor performs computations on piece of data at a time.
A superscalar can execute multiple scalar instructions at a time.
A VLIW can execute multiple operations at a time.
A vector processor can operate on a vector of data at a time.
The superscalar Haswell CPU that I'm typing this on has 8 execution ports: 4 integer operations, 2 memory reads and 2 stores. Potentially 8 x86 instructions could execute simultaneously. That's superscalar. The 8080 could only execute 1 instruction at a time. That's scalar.
Haswell is both pipelined and superscalar. It's also speculative and out-of-order. It's hyperthreaded (2 threads per core) and multi-core (2-18 cores). It's just a beast.
Instruction level parallelism (ILP) is a characteristic or measure of a program not a CPU. A compiler scheduler will search for ILP statically or a CPU's scheduler will search for ILP dynamically. If they find it, then they can order+execute instructions accordingly.

Check out this first (http://en.wikipedia.org/wiki/Superscalar):
A superscalar processor executes more than one instruction during a clock cycle by simultaneously dispatching multiple instructions to redundant functional units on the processor. Each functional unit is not a separate CPU core but an execution resource within a single CPU such as an arithmetic logic unit, a bit shifter, or a multiplier.
This means that for example the CPU with 2(two) ALUs (arithmetic logic unit) can physically issue 2 arithmetic instructions and execute them. Each arithmetic instruction will be executed in different ALU unit.
Second check this (http://en.wikipedia.org/wiki/Instruction_level_parallelism):
It will help you not to confuse the different techniques for achieving ILP (instruction level parallelism).
Third (http://en.wikipedia.org/wiki/P5_(microprocessor)): Example for the superscalar processor is the original Intel Pentium. It has two instruction pipelines.

Slow Parallel programming - MPI, VB.NET and FORTRAN

I'm working on parallelizing a software which simulates transport and flow process in the unsaturated soil zone. The software consists of a VB.NET user interface, and a FORTRAN DLL kernel to do the calculations.
I parallelized the software by using the package MPI.NET in the VB.NET part. When the program is started with a number of processes, all of them but the master process go into a wait function, while the master process takes care of the interaction of the software with the user. When all the data required for the simulation is entered, the master process enters the FORTRAN DLL, and calls the other processes. These jump to the starting point of the function in the DLL, and together all the processes solve a linear system of equations for about 10-20 times (the original partial differential equation is nonlinear, therefore these iterations in order to gain accuracy in the solution). When the solution is computed, all the processes go back to VB.NET, This is done for all the timesteps of the simulation. When all steps are computed, the master process continues with the user interaction, while the other processes go back
into the wait function, until they are called again by the master process.
The thing is that this program runs much slower than the original, sequential version of it. Now there might be a number of reasons for this. I used the PETSc library in the FORTRAN DLL to solve the system of equations, and I think I have configured it quite well. My question is if at some point in the architecture I described there could be a point or two which could cause a significant slowdown if not handled correctly. I'm not sure f.e. if the subsequent calls of DLL function can cost a lot of time.
My system is a Intel Xeon 3470 processor with 8GB RAM. The systems I tried to solve had up to 120.000 unknowns, which I know is at the very lower bound of what should be calculated in parallel, but at least with the 120.000 matrix I would have expected a better performance than I did measure.
Thanks in advance for your thoughts,
Martin

I would say that 120,000 degrees of freedom and 10-20 iterations is not that large a problem. Million degree of freedom problems were done when I did finite element analysis for a living, and that was 16 years ago.
Is it possible to solve it using an in-memory solver, without parallelization, with 8GB of RAM? That would certainly be your benchmark. Is that what you're comparing your parallel results to?
Are the parallel processes running on different processors or different machines? Parallelization doesn't buy you anything if everything is done on a single processor. You have to context switch and time slice processes, and there's overhead associated with MPI to communicate between processes. I would expect a parallel solution on a single processor to run more slowly than a single thread, in-memory solution.
If you have multiple processes, then I'd say it's a matter of tuning. I'd plot performance versus number of parallel processes. If there's a speedup, you should find that it improves with more processes until you reach a saturation point, beyond which the overhead is greater than the benefit.

If you have multiple cores, when you run your program sequentially can you see that only one or a few processor are utilized?
If the load in the sequential case is high and evenly distributed over all cores then IMHO there is no need to parallelize your program.

My system has a Xeon 3470, which is a quadcore processor. So the computations are all done on these 4 on 1 machine. I don't run the program with more than 4 processes of course.The old solver that the software had was sequential of course, and that still runs faster than the parallel version. When I plot number of processes against runtime, I see that runtime even increases a little bit with smaller models - but that is to be expected because of the communication overhead.
In both the sequential and the parallel case all 4 processors are utilized, and the load balance between them is acceptable.
Like I said, I know that the models I've tested so far are not ideal to talk about parallel performance. I was just wondering if besides the communication overhead due to MPI there could still be another point that could lead to the slowdown of the program.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas