I've noticed that there is a src/arch/null directory in the gem5 211869ea950f3cc3116655f06b1d46d3fa39fb3a sitting next to "real ISAs" like src/arch/x86/.
This suggests that there is a NULL ISA in gem5, but it does not seem to have any registers or other common CPU components.
What is this NULL ISA for?
Inspired by: https://www.mail-archive.com/gem5-users#gem5.org/msg16968.html
I believe that the main application of the NULL ISA is to support tests where you don't need to simulate CPU, notably traffic generators such as Garnet mentioned at: http://www.gem5.org/Garnet_Synthetic_Traffic
Traffic generators are setups that produce memory requests that attempt to be similar to a real system component such as a CPU, but with a higher level approximation and without actually implemented a detailed microarchitecture.
The advantage is that traffic generators run faster than detailed models, and can be easier to implement. The downside is that the the simulation won't be as accurate as the real system.
Also doing a NULL build is faster than doing a build for a regular ISA, as it skips all the ISA specifics. This can be a big cost saving win for the continuous integration system.
As a concrete example, on gem5 6e06d231ecf621d580449127f96fdb20154c4f66 you could run scripts such as:
scons -j`nproc` build/NULL/gem5.opt
./build/NULL/gem5.opt configs/example/ruby_mem_test.py -m 10000000
./build/NULL/gem5.opt configs/example/memcheck.py -m 1000000000
./build/NULL/gem5.opt configs/example/memtest.py -m 10000000000
./build/NULL/gem5.opt configs/example/ruby_random_test.py --maxloads 5000
./build/NULL/gem5.opt configs/example/ruby_direct_test.py --requests 50000
./build/NULL/gem5.opt configs/example/garnet_synth_traffic.py --sim-cycles 5000000
These tests can also be run on a "regular" ISA build, for example:
scons -j`nproc` build/ARM/gem5.opt
./build/ARM/gem5.opt configs/example/ruby_mem_test.py -m 10000000
It is just that in that case, you also have all that extra ARM stuff in the binary that does not get used.
If you try to run a "regular" script with NULL however, it blows up. For example:
./build/NULL/gem5.opt configs/example/se.py -u /tmp/hello.out
fails with:
optparse.OptionValueError: option --cpu-type: invalid choice: 'AtomicSimpleCPU' (choose from )
since there is no valid CPU to run the simulation on (empty choices).
If you look at the source of traffic generator Python scripts, you can see that the traffic generator implements the CPU memory interface itself, and is therefore seen a CPU by gem5. For example, configs/example/ruby_mem_test.py does:
cpus = [ MemTest(...
system = System(cpu = cpus,
so we understand that the MemTest SimObject is the CPU of that system.
Related
I wish to simulate a quite non-trivial program in the gem5 environmnet.
I have three files that I cross-compiled accordingly for the designated ISA:
main.c
my_library.c
my_library.h
I use the command
build/ARM/gem5.opt configs/example/se.py --cpu-type=TimingSimpleCPU -c test/test-progs/hello/src/my_binary
But is there a way, maybe an argument of the se.py script that can make my simulation proceed faster ?
The default commands are normally the fastest available (and therefore lowest simulation accuracy).
gem5.fast build
A .fast build can run about 20% faster without losing simulation accuracy by disabling some debug related macros:
scons -j `nproc` build/ARM/gem5.fast
build/ARM/gem5.fast configs/example/se.py --cpu-type=TimingSimpleCPU \
-c test/test-progs/hello/src/my_binary
The speedup is achieved by:
disabling asserts and logging through macros. https://github.com/gem5/gem5/blob/ae7dd927e2978cee89d6828b31ab991aa6de40e2/src/SConscript#L1395 does:
if 'fast' in needed_envs:
CPPDEFINES = ['NDEBUG', 'TRACING_ON=0'],
NDEBUG is a standardized way to disable assert: _DEBUG vs NDEBUG
TRACING_ON has effects throughout the source, but the most notable one is at: https://github.com/gem5/gem5/blob/ae7dd927e2978cee89d6828b31ab991aa6de40e2/src/base/trace.hh#L173
#if TRACING_ON
#define DPRINTF(x, ...) do { \
using namespace Debug; \
if (DTRACE(x)) { \
Trace::getDebugLogger()->dprintf_flag( \
curTick(), name(), #x, __VA_ARGS__); \
} \
} while (0)
#else // !TRACING_ON
#define DPRINTF(x, ...) do {} while (0)
#end
which implies that --debug-flags won't do anything basically.
turning on link time optimization: Does the --force-lto gem5 scons build option speed up simulation significantly and how does it compare to a gem5.fast build? which might slow down the link time (and therefore how long it takes to recompile after a one line change)
so in general .fast is not worth it if you are developing the simulator, but only when you have done any patches you may have, and just need to run hundreds of simulations as fast as possible with different parameters.
TODO it would be good to benchmark which of the above changes matters the most for runtime, and if the link time is actually significantly slowed down by LTO.
gem5 performance profiling analysis
I'm not aware if a proper performance profiling of gem5 has ever been done to access which parts of the simulation are slow and if there is any way to improve it easily. Someone has to do that at some point and post it at: https://gem5.atlassian.net/browse/GEM5
Options that reduce simulation accuracy
Simulation would also be faster and with lower accuracy without --cpu-type=TimingSimpleCPU :
build/ARM/gem5.opt configs/example/se.py -c test/test-progs/hello/src/my_binary
which uses an even simpler memory model AtomicSimpleCPU.
Other lower accuracy but faster options include:
KVM, but support is not perfect as of 2020, and you need an ARM host to run the simulation on
Gabe's FastModel integration that is getting merged as of 2020, but it requires a FastModel license from ARM, which I think is too expensive for individuals
Also if someone were to implement binary translation in gem5, which is how QEMU goes fast, then that would be an amazing option.
Related
Gem5 system requirements for decent performance
When running a simulation in gem5, I can select a CPU with fs.py --cpu-type.
This option can also show a list of all CPU types if I use an invalid CPU type such as fs.py --cpu-type.
What is the difference between those CPU types and which one should I choose for my experiment?
Question inspired by: https://www.mail-archive.com/gem5-users#gem5.org/msg16976.html
An overview of the CPU types can be found at: https://cirosantilli.com/linux-kernel-module-cheat/#gem5-cpu-types
In summary:
simplistic CPUs (derived from BaseSimpleCPU): for example AtomicSimpleCPU (the default one). They have no CPU pipeline, and therefor are completely unrealistic. However, they also run much faster. Therefore,they are mostly useful to boot Linux fast and then checkpoint and switch to a more detailed CPU.
Within the simple CPUs we can notably distinguish:
AtomicSimpleCPU: memory requests finish immediately
TimingSimpleCPU: memory requests actually take time to go through to the memory system and return. Since there is no CPU pipeline however, the simulated CPU stalls on every memory request waiting for a response.
An alternative to those is to use KVM CPUs to speed up boot if host and guest ISA are the same, although as of 2019, KVM is less stable as it is harder to implement and debug.
in-order CPUs: derived from the generic MinorCPU by parametrization, Minor stands for In Order:
for ARM: HPI is made by ARM and models a "(2017) modern in-order Armv8-A implementation". This is your best in-order ARM bet.
out-of-order CPUs, derived from the generic DerivO3CPU by parametrization, O3 stands for Out Of Order:
for ARM: there are no models specifically published by ARM as of 2019. The only specific O3 model available is ex5_big for an A15, but you would have to verify its authors claims on how well it models the real core A15 core.
If none of those are accurate enough for your purposes, you could try to create your own in-order/out-of-order models by parametrizing MinorCPU / DerivO3CPU like HPI and ex5_big do, although this could be hard to get right, as there isn't generally enough public information on non-free CPUs to do this without experiments or reverse engineering.
The other thing you will want to think about is the memory system model. There are basically two choices: classical vs Ruby, and within Ruby, several options are available, see also: https://cirosantilli.com/linux-kernel-module-cheat/#gem5-ruby-build
I have disabled hardware prefetching using the following guidelines:
Installed msr-tools 1.3
wrmsr -a 0x1A4 1
The prefetcher information for my system (Broadwell) is in the msr address 0x1A4
as shown by intel documentation.
I did rdmsr -a 0x1A4 the out put showed 1.
According to the intel docs if the bit number corresponding to the particular prefetcher is set to 1 that means it is disabled.
I wanted to know if there is anyother way I can verify that my hardware prefetchers have been disabled?
Disabled prefetcher shall slowdown some operations which benefit from enabled prefetcher. You will need to write some code (probably in assembler language) and measure it's performance with enabled and disabled prefetcher.
Some long time ago I wrote test program to measure memory read performance. It was repeatedly reading memory in blocks of different sizes. It proved obvious correlation between memory block sizes and capacities of different levels of memory cache.
You can run some I/O intensive and prefetch-friendly workloads to verify.
btw, My CPU is Gold 6240, and set 0x1A4 as 1 is not working on it.
Instead, I use sudo wrmsr -a 0x1A4 0xf
I am interested in getting GNU Parallel to run some numerical computation tasks on the GPU. Generically speaking, here is my initial approach:
Write the tasks to use OpenCL, or some other GPU interfacing library
Call GNU parallel on the task list (I am unsure about the need for this step)
This brought up the following questions:
Does my approach/use-case benefit from the use of GNU Parallel (i.e. should I even use it here)?
Does GNU Parallel offer a built-in mechanism for running tasks in
parallel on a GPU?
If so, how can I configure GNU Parallel to do
this?
Modern CPUs have multiple cores, that means they can run different instructions at the same time; so when core 1 is running a MUL core 2 may be running an ADD. This is also called MIMD - Multiple Instructions, Multiple Data.
GPUs, however, cannot run different instructions at the same time. They excel in running the same instruction on a large amounts of data; SIMD - Single Instruction, Multiple Data.
Modern GPUs have multiple cores that are each SIMD.
So where does GNU Parallel fit into this mix?
GNU Parallel starts programs. If your program uses a GPU and you have one single GPU core on your system, GNU Parallel will not make much sense. But if you have, say, 4 GPU cores on your system, then it makes sense to keep these 4 cores running at the same time. So if your program reads the variable CUDA_VISIBLE_DEVICES to decide which GPU core to run on, you can do something like this:
seq 10000 | parallel -j4 CUDA_VISIBLE_DEVICES='$(({%} - 1))' compute {}
I am running a program that does numeric ODE integration in Julia. I am running Windows 10 (64bit), with Intel Core i7-4710MQ # 2.50Ghz (8 logical processors).
I noticed that when my code was running on julia, only max 30% of CPU is in usage. Going into the parallelazation documentation, I started Julia using:
C:\Users\*****\AppData\Local\Julia-0.4.5\bin\julia.exe -p 8 and expected to see improvements. I did not see them however.
Therefore my question is the following:
Is there a special way I have to write my code in order for it to use CPU more efficiently? Is this maybe a limitation posed by my operating system (windows 10)?
I submit my code in the julia console with the command:
include("C:\\Users\\****\\AppData\\Local\\Julia-0.4.5\\13. Fast Filesaving Format.jl").
Within this code I use some additional packages with:
using ODE; using PyPlot; using JLD.
I measure the CPU usage with windows' "Task Manager".
The -p 8 option to julia starts 8 worker processes, and disables multithreading in libraries like BLAS and FFTW so that the workers don't oversubscribe the physical threads on the system – since this kills performance in well-balanced distributed workloads. If you want to get more speed out of -p 8 then you need to distribute work between those workers, e.g. by having each of them do an independent computation, or by having the collaborate on a computation via SharedArrays. You can't just add workers and not change the program. If you are using BLAS (doing lots of matrix multiplies) or FFTW (doing lots of Fourier transforms), then if you don't use the -p flag, you'll automatically get multithreading from those libraries. Otherwise, there is no (non-experimental) user-level threading in Julia yet. There is experimental threading support and version 1.0 will support threading, but I wouldn't recommend that yet unless you're an expert.