How to Increase the simulation speed of a gem5 run - gem5

I wish to simulate a quite non-trivial program in the gem5 environmnet.
I have three files that I cross-compiled accordingly for the designated ISA:
main.c
my_library.c
my_library.h
I use the command
build/ARM/gem5.opt configs/example/se.py --cpu-type=TimingSimpleCPU -c test/test-progs/hello/src/my_binary
But is there a way, maybe an argument of the se.py script that can make my simulation proceed faster ?

The default commands are normally the fastest available (and therefore lowest simulation accuracy).
gem5.fast build
A .fast build can run about 20% faster without losing simulation accuracy by disabling some debug related macros:
scons -j `nproc` build/ARM/gem5.fast
build/ARM/gem5.fast configs/example/se.py --cpu-type=TimingSimpleCPU \
-c test/test-progs/hello/src/my_binary
The speedup is achieved by:
disabling asserts and logging through macros. https://github.com/gem5/gem5/blob/ae7dd927e2978cee89d6828b31ab991aa6de40e2/src/SConscript#L1395 does:
if 'fast' in needed_envs:
CPPDEFINES = ['NDEBUG', 'TRACING_ON=0'],
NDEBUG is a standardized way to disable assert: _DEBUG vs NDEBUG
TRACING_ON has effects throughout the source, but the most notable one is at: https://github.com/gem5/gem5/blob/ae7dd927e2978cee89d6828b31ab991aa6de40e2/src/base/trace.hh#L173
#if TRACING_ON
#define DPRINTF(x, ...) do { \
using namespace Debug; \
if (DTRACE(x)) { \
Trace::getDebugLogger()->dprintf_flag( \
curTick(), name(), #x, __VA_ARGS__); \
} \
} while (0)
#else // !TRACING_ON
#define DPRINTF(x, ...) do {} while (0)
#end
which implies that --debug-flags won't do anything basically.
turning on link time optimization: Does the --force-lto gem5 scons build option speed up simulation significantly and how does it compare to a gem5.fast build? which might slow down the link time (and therefore how long it takes to recompile after a one line change)
so in general .fast is not worth it if you are developing the simulator, but only when you have done any patches you may have, and just need to run hundreds of simulations as fast as possible with different parameters.
TODO it would be good to benchmark which of the above changes matters the most for runtime, and if the link time is actually significantly slowed down by LTO.
gem5 performance profiling analysis
I'm not aware if a proper performance profiling of gem5 has ever been done to access which parts of the simulation are slow and if there is any way to improve it easily. Someone has to do that at some point and post it at: https://gem5.atlassian.net/browse/GEM5
Options that reduce simulation accuracy
Simulation would also be faster and with lower accuracy without --cpu-type=TimingSimpleCPU :
build/ARM/gem5.opt configs/example/se.py -c test/test-progs/hello/src/my_binary
which uses an even simpler memory model AtomicSimpleCPU.
Other lower accuracy but faster options include:
KVM, but support is not perfect as of 2020, and you need an ARM host to run the simulation on
Gabe's FastModel integration that is getting merged as of 2020, but it requires a FastModel license from ARM, which I think is too expensive for individuals
Also if someone were to implement binary translation in gem5, which is how QEMU goes fast, then that would be an amazing option.
Related
Gem5 system requirements for decent performance

Related

What is the NULL ISA architecture under src/arch/null in gem5?

I've noticed that there is a src/arch/null directory in the gem5 211869ea950f3cc3116655f06b1d46d3fa39fb3a sitting next to "real ISAs" like src/arch/x86/.
This suggests that there is a NULL ISA in gem5, but it does not seem to have any registers or other common CPU components.
What is this NULL ISA for?
Inspired by: https://www.mail-archive.com/gem5-users#gem5.org/msg16968.html
I believe that the main application of the NULL ISA is to support tests where you don't need to simulate CPU, notably traffic generators such as Garnet mentioned at: http://www.gem5.org/Garnet_Synthetic_Traffic
Traffic generators are setups that produce memory requests that attempt to be similar to a real system component such as a CPU, but with a higher level approximation and without actually implemented a detailed microarchitecture.
The advantage is that traffic generators run faster than detailed models, and can be easier to implement. The downside is that the the simulation won't be as accurate as the real system.
Also doing a NULL build is faster than doing a build for a regular ISA, as it skips all the ISA specifics. This can be a big cost saving win for the continuous integration system.
As a concrete example, on gem5 6e06d231ecf621d580449127f96fdb20154c4f66 you could run scripts such as:
scons -j`nproc` build/NULL/gem5.opt
./build/NULL/gem5.opt configs/example/ruby_mem_test.py -m 10000000
./build/NULL/gem5.opt configs/example/memcheck.py -m 1000000000
./build/NULL/gem5.opt configs/example/memtest.py -m 10000000000
./build/NULL/gem5.opt configs/example/ruby_random_test.py --maxloads 5000
./build/NULL/gem5.opt configs/example/ruby_direct_test.py --requests 50000
./build/NULL/gem5.opt configs/example/garnet_synth_traffic.py --sim-cycles 5000000
These tests can also be run on a "regular" ISA build, for example:
scons -j`nproc` build/ARM/gem5.opt
./build/ARM/gem5.opt configs/example/ruby_mem_test.py -m 10000000
It is just that in that case, you also have all that extra ARM stuff in the binary that does not get used.
If you try to run a "regular" script with NULL however, it blows up. For example:
./build/NULL/gem5.opt configs/example/se.py -u /tmp/hello.out
fails with:
optparse.OptionValueError: option --cpu-type: invalid choice: 'AtomicSimpleCPU' (choose from )
since there is no valid CPU to run the simulation on (empty choices).
If you look at the source of traffic generator Python scripts, you can see that the traffic generator implements the CPU memory interface itself, and is therefore seen a CPU by gem5. For example, configs/example/ruby_mem_test.py does:
cpus = [ MemTest(...
system = System(cpu = cpus,
so we understand that the MemTest SimObject is the CPU of that system.

How to switch CPU models in gem5 after restoring a checkpoint and then observe the difference?

I want to boot the Linux kernel in full system (FS) mode with a lightweight CPU to save time, make a checkpoint after boot finishes, and then restore the checkpoint with a more detailed CPU to study a benchmark, as mentioned at: http://gem5.org/Checkpoints
However, when I tried to use -r 1 --restore-with-cpu= I cannot observe cycle differences between the new and old CPU.
The measure I'm looking at is how cache sizes affect the number of cycles that a benchmark takes to run.
The setup I'm using is described in detail at: Why doesn't the Linux kernel see the cache sizes in the gem5 emulator in full system mode? I'm looking at the cycle counts because I can't see cache sizes directly with the Linux kernel currently.
For example, if I boot the Linux kernel from scratch with the detailed and slow HPI model with command (excerpt):
./build/ARM/gem5.opt --cpu-type=HPI --caches --l1d_size=1024 --l1i_size=1024 --l2cache --l2_size=1024 --l3_size=1024
and then change cache sizes, the benchmark does get faster as the cache sizes get better as expected.
However, if I first boot without --cpu-type=HPI, which uses the faster AtomicSimpleCPU model:
./build/ARM/gem5.opt --caches --l1d_size=1024 --l1i_size=1024 --l2cache --l2_size=1024 --l3_size=1024
and then I create the checkpoint with m5 checkpoint and try to restore the faster CPU:
./build/ARM/gem5.opt --restore-with-cpu=HPI -r 1 --caches --l1d_size=1024 --l1i_size=1024 --l2cache --l2_size=1024 --l3_size=1024
then changing the cache sizes makes no difference: I always get the same cycle counts as I do for the AtomicSimpleCPU, indicating that the modified restore was not successful.
Analogous for x86 if I try to switch from AtomicSimpleCPU to DerivO3CPU.
Related old thread on the mailing list: http://thread.gmane.org/gmane.comp.emulators.m5.users/14395
Tested at: fbe63074e3a8128bdbe1a5e8f6509c565a3abbd4
From reading through some of the code I believe that --restore-with-cpu is specifically for the case when your checkpoint was created using a CPU model that isn't the AtomicCPU. The scripts assume that AtomicCPU was used to create the checkpoint. I think when restoring it's important to have the same cpu model as the system was checkpointed with, if you give another model with --cpu-type then it switches to that model after the restore operation as completed.
http://gem5.org/Checkpoints#Sampling has some (small) detail on switching cpu models
First, for your question, I don't see how cycle count being an indication of the restoration result. The cycle being restored should be the same regardless of what CPU you want to switch. Switching does not change the past cycles. When creating a checkpoint, you basically freeze the simulation at that state. And switching CPU simply changes all the parameter of the CPU while keeping the ticks unchanged. It is like hot swapping a CPU.
To correctly verify the restoration, you should keep a copy of config.json before restoration and compare it with the new one after restoration. For X86 case, I could find string AtomicSimpleCPU there only before restore.
Furthermore, only --cpu-type will determine the CPU being switched. But it does not make --restore-with-cpu useless. In fact, --restore-with-cpu should only be used when you boot up the system with a CPU other than AtomicSimpleCPU. Most people want to boot up the system with AtomicSimpleCPU and make a checkpoint since it is faster. But if you mistakenly boot up using DerivO3CPU, to restore this particular checkpoint, you have to configure --restore-with-cpu to DerivO3CPU. Otherwise, it will fail.
--cpu-type= affected the restore, but --restore-with-cpu= did not
I am not sure why that is, but I have empirically verified that if I do:
-r 1 --cpu-type=HPI
then as expected the cache size options start to affect cycle counts: larger caches leads to less cycles.
Also keep in mind that caches don't affect AtomicSimpleCPU much, and there is not much point in having them.
TODO so what is the point of --restore-with-cpu= vs --cpu-type if it didn't seem to do anything on my tests?
Except confuse me, since if --cpu-type != --restore-with-cpu, then the cycle count appears under system.switch_cpus.numCycles instead of system.cpu.numCycles.
I believe this is what is going on (yet untested):
switch_cpu contains stats for the CPU you switched to
when you set --restore-with-cpu= != --cpu-type, it thinks you have already
switched CPUs from the start
--restore-with-cpu has no effect on the initial CPU. It only
matters for options that switch the CPU during the run itself, e.g.
--fast-forward and --repeat_switch. This is where you will see both cpu and switch_cpu data get filled up.
TODO: also, if I use or remove --restore-with-cpu=, there is a small 1% cycle difference. But why is there a difference at all? AtomicSimpleCPU cycle count is completely different, so it must not be that it is falling back to it.
--cpu-type= vs --restore-with-cpu= showed up in fs.py --fast-forward: https://www.mail-archive.com/gem5-users#gem5.org/msg17418.html
Confirm what is happening with logging
One good sanity that the CPU want want is being used, is to enable some logging as shown at: https://github.com/cirosantilli/linux-kernel-module-cheat/tree/bab029f60656913b5dea629a220ae593cc16147d#gem5-restore-checkpoint-with-a-different-cpu e.g.:
--debug-flags ExecAll,FmtFlag,O3CPU,SimpleCPU
and shen see if you start to get O3 messages rather than SimpleCPU ones.

Limiting data collection of Cachegrind, in Valgrind

It is well known that, the callgrind analysis tool of the valgrind suit, provides the possibility to start and stop the colection of data via command line instruction callgrind_control -i on or callgrind_control -i off. For instance, the following code will collect data only after the hour.
(sleep 3600; callgrind_control -i on) &
valgrind --tool=callgrind --instr-atstart=no ./myprog
Is there a similar option for the cachegrind tool? if so, how can I use it (I do not find anything in the documentation)? If no, how can I start collecting data after a certain amount of time with cachegrind?
As far as I know, there is no such function for Cachegrind.
However, Callgrind is an extension of Cachegrind, which means that you can use Cachegrind features on Callgrind.
For example:
valgrind --tool=callgrind --cache-sim=yes --branch-sim=yes ./myprog
Will measure your programs cache and branch performance as if you where using Cachegrind.

Measuring build times to identify bottlenecks

I'm working on improving the build for a few projects. I've improved build times quite significantly, and I'm at a point now where I think the bottlenecks are more subtle.
The build uses GNU style makefiles. I generate a series of dependency files (.d) and include them in the makefile, otherwise there's nothing fancy going on (eg, no pre-compiled headers or other caching mechanisms).
The build takes about 95 seconds on a 32-core sparc ultra, running with 16 threads in parallel. Idle time hovers around 80% while the build runs, with kernel time hovering between 8-10%. I put the code in /tmp, but most of the compiler support files are NFS mounted and I believe this may be creating a performance bottleneck.
What tools exist for measuring & tracking down these sorts of problems?
From my own experience, compiling C/C++ code requires reading a lot of header files by C preprocessor. I've experienced situations when it took more than 50% of g++ run-time to generate a complete translation unit.
As you mentioned that it idles 80% when compiling it must be waiting for I/O then. iostat and DTrace would be a good starting point.

Optimization in GCC

I have two questions:
(1) I learned somewhere that -O3 is not recommended with GCC, because
The -O3 optimization level may increase the speed of the resulting executable, but can also increase its size. Under some circumstances where these optimizations are not favorable, this option might actually make a program slower. in fact it should not be used system-wide with gcc 4.x. The behavior of gcc has changed significantly since version 3.x. In 3.x, -O3 has been shown to lead to marginally faster execution times over -O2, but this is no longer the case with gcc 4.x. Compiling all your packages with -O3 will result in larger binaries that require more memory, and will significantly increase the odds of compilation failure or unexpected program behavior (including errors). The downsides outweigh the benefits; remember the principle of diminishing returns. Using -O3 is not recommended for gcc 4.x.
Suppose I have a workstation (Kubuntu9.04) which has 128 GB of memory and 24 cores and is shared by many users, some of whom may run intensive programs using like 60 GB memory. Is -O2 a better choice for me than -O3?
(2) I also learned that when a running program crashes unexpectedly, any debugging information is better than none, so the use of -g is recommended for optimized programs, both for development and deployment. But when compiled with -ggdb3 together with -O2 or -O3, will it slow down the speed of execution? Assume I am still using the same workstation.
The only way to know for sure is to benchmark your application compiled with -O2 and -O3. Also there are some individual optimization options that -O3 includes and you can turn on and off individually. Concerning the warning about larger binaries, note that just comparing executable file sizes compiled with -O2 and -O3 will not do much good here, because it is the size of small critical internal loops that matters here the most. You really have to benchmark.
It will result in a larger executable, but there shouldn't be any measurable slowdown.
Try it
You can rarely make accurate judgments about speed and optimisation without any data.
ps. This will also tell you if it's worth the effort. How many milliseconds saved in a function used once at startup is worthwhile ?
Firstly, it does appear that the compiler team is essentially admitting that -O3 isn't reliable. It seems like they are saying: try -O3 on your critical loops or critical modules, or your Lattice QCD program, but it's not reliable enough for building the whole system or library.
Secondly, the problem with making the code bigger (inline functions and other things) isn't only that it uses more memory. Even if you have extra RAM, it can slow you down. This is because the faster the CPU chip gets, the more it hurts to have to go out to DRAM. They are saying that some programs will run faster WITH the extra routine calls and unexploded branches (or whatever O3 replaces with bigger things) because without O3 they will still fit in the cache, and that's a bigger win than the O3 transformations.
On the other issue, I wouldn't normally build anything with -g unless I was currently working on it.
-g and/or -ggdb just adds debugging symbols to the executable. It makes the executable file bigger, but that part isn't loaded into memory(except when run in a debugger or similar).
As for what's best for performance of -O2 and -O3, there's no silver bullet. You have to measure/profile it for your particular program.
In my experience what I found is that GCC does not generate best assembly with O2 and O3, The best way is to apply specific optimization flags which you can find from this will definitely generate better code than -O2 and -O3 because there are flags which you can not find in -O2 and -O3, and they will be useful for your faster code.
One good example is that code and data prefetch instruction will never be inserted in your code with -O2 and -O3, But using additional flags for prefetching will make your memory intensive code 2 to 3 % faster.
You can find list of GCC optimization flags at http://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html.
I think this pretty much answers your question:
The downsides outweigh the benefits; remember the principle of diminishing returns. Using -O3 is not recommended for gcc 4.x.
If the guys writing the compiler say not to do it, I wouldn't second guess them.