Limiting data collection of Cachegrind, in Valgrind - valgrind

It is well known that, the callgrind analysis tool of the valgrind suit, provides the possibility to start and stop the colection of data via command line instruction callgrind_control -i on or callgrind_control -i off. For instance, the following code will collect data only after the hour.
(sleep 3600; callgrind_control -i on) &
valgrind --tool=callgrind --instr-atstart=no ./myprog
Is there a similar option for the cachegrind tool? if so, how can I use it (I do not find anything in the documentation)? If no, how can I start collecting data after a certain amount of time with cachegrind?

As far as I know, there is no such function for Cachegrind.
However, Callgrind is an extension of Cachegrind, which means that you can use Cachegrind features on Callgrind.
For example:
valgrind --tool=callgrind --cache-sim=yes --branch-sim=yes ./myprog
Will measure your programs cache and branch performance as if you where using Cachegrind.

Related

gprof displays no time accumulated

I am trying to use gprof to profile my code. However, when I compile with the flag -pg, gprof displays no time accumulated and the report contains all 0's. The version I have is GNU gprof (GNU Binutils for Ubuntu) 2.34. How can I solve this issue?
The program is definitely running for quite a while (>30 seconds) and it contains only two functions. The same thing happens when I use code and instructions from this article: https://www.thegeekstuff.com/2012/08/gprof-tutorial/.

How to Increase the simulation speed of a gem5 run

I wish to simulate a quite non-trivial program in the gem5 environmnet.
I have three files that I cross-compiled accordingly for the designated ISA:
main.c
my_library.c
my_library.h
I use the command
build/ARM/gem5.opt configs/example/se.py --cpu-type=TimingSimpleCPU -c test/test-progs/hello/src/my_binary
But is there a way, maybe an argument of the se.py script that can make my simulation proceed faster ?
The default commands are normally the fastest available (and therefore lowest simulation accuracy).
gem5.fast build
A .fast build can run about 20% faster without losing simulation accuracy by disabling some debug related macros:
scons -j `nproc` build/ARM/gem5.fast
build/ARM/gem5.fast configs/example/se.py --cpu-type=TimingSimpleCPU \
-c test/test-progs/hello/src/my_binary
The speedup is achieved by:
disabling asserts and logging through macros. https://github.com/gem5/gem5/blob/ae7dd927e2978cee89d6828b31ab991aa6de40e2/src/SConscript#L1395 does:
if 'fast' in needed_envs:
CPPDEFINES = ['NDEBUG', 'TRACING_ON=0'],
NDEBUG is a standardized way to disable assert: _DEBUG vs NDEBUG
TRACING_ON has effects throughout the source, but the most notable one is at: https://github.com/gem5/gem5/blob/ae7dd927e2978cee89d6828b31ab991aa6de40e2/src/base/trace.hh#L173
#if TRACING_ON
#define DPRINTF(x, ...) do { \
using namespace Debug; \
if (DTRACE(x)) { \
Trace::getDebugLogger()->dprintf_flag( \
curTick(), name(), #x, __VA_ARGS__); \
} \
} while (0)
#else // !TRACING_ON
#define DPRINTF(x, ...) do {} while (0)
#end
which implies that --debug-flags won't do anything basically.
turning on link time optimization: Does the --force-lto gem5 scons build option speed up simulation significantly and how does it compare to a gem5.fast build? which might slow down the link time (and therefore how long it takes to recompile after a one line change)
so in general .fast is not worth it if you are developing the simulator, but only when you have done any patches you may have, and just need to run hundreds of simulations as fast as possible with different parameters.
TODO it would be good to benchmark which of the above changes matters the most for runtime, and if the link time is actually significantly slowed down by LTO.
gem5 performance profiling analysis
I'm not aware if a proper performance profiling of gem5 has ever been done to access which parts of the simulation are slow and if there is any way to improve it easily. Someone has to do that at some point and post it at: https://gem5.atlassian.net/browse/GEM5
Options that reduce simulation accuracy
Simulation would also be faster and with lower accuracy without --cpu-type=TimingSimpleCPU :
build/ARM/gem5.opt configs/example/se.py -c test/test-progs/hello/src/my_binary
which uses an even simpler memory model AtomicSimpleCPU.
Other lower accuracy but faster options include:
KVM, but support is not perfect as of 2020, and you need an ARM host to run the simulation on
Gabe's FastModel integration that is getting merged as of 2020, but it requires a FastModel license from ARM, which I think is too expensive for individuals
Also if someone were to implement binary translation in gem5, which is how QEMU goes fast, then that would be an amazing option.
Related
Gem5 system requirements for decent performance

What is the NULL ISA architecture under src/arch/null in gem5?

I've noticed that there is a src/arch/null directory in the gem5 211869ea950f3cc3116655f06b1d46d3fa39fb3a sitting next to "real ISAs" like src/arch/x86/.
This suggests that there is a NULL ISA in gem5, but it does not seem to have any registers or other common CPU components.
What is this NULL ISA for?
Inspired by: https://www.mail-archive.com/gem5-users#gem5.org/msg16968.html
I believe that the main application of the NULL ISA is to support tests where you don't need to simulate CPU, notably traffic generators such as Garnet mentioned at: http://www.gem5.org/Garnet_Synthetic_Traffic
Traffic generators are setups that produce memory requests that attempt to be similar to a real system component such as a CPU, but with a higher level approximation and without actually implemented a detailed microarchitecture.
The advantage is that traffic generators run faster than detailed models, and can be easier to implement. The downside is that the the simulation won't be as accurate as the real system.
Also doing a NULL build is faster than doing a build for a regular ISA, as it skips all the ISA specifics. This can be a big cost saving win for the continuous integration system.
As a concrete example, on gem5 6e06d231ecf621d580449127f96fdb20154c4f66 you could run scripts such as:
scons -j`nproc` build/NULL/gem5.opt
./build/NULL/gem5.opt configs/example/ruby_mem_test.py -m 10000000
./build/NULL/gem5.opt configs/example/memcheck.py -m 1000000000
./build/NULL/gem5.opt configs/example/memtest.py -m 10000000000
./build/NULL/gem5.opt configs/example/ruby_random_test.py --maxloads 5000
./build/NULL/gem5.opt configs/example/ruby_direct_test.py --requests 50000
./build/NULL/gem5.opt configs/example/garnet_synth_traffic.py --sim-cycles 5000000
These tests can also be run on a "regular" ISA build, for example:
scons -j`nproc` build/ARM/gem5.opt
./build/ARM/gem5.opt configs/example/ruby_mem_test.py -m 10000000
It is just that in that case, you also have all that extra ARM stuff in the binary that does not get used.
If you try to run a "regular" script with NULL however, it blows up. For example:
./build/NULL/gem5.opt configs/example/se.py -u /tmp/hello.out
fails with:
optparse.OptionValueError: option --cpu-type: invalid choice: 'AtomicSimpleCPU' (choose from )
since there is no valid CPU to run the simulation on (empty choices).
If you look at the source of traffic generator Python scripts, you can see that the traffic generator implements the CPU memory interface itself, and is therefore seen a CPU by gem5. For example, configs/example/ruby_mem_test.py does:
cpus = [ MemTest(...
system = System(cpu = cpus,
so we understand that the MemTest SimObject is the CPU of that system.

Decreasing runtime of Fortran program with compiling and linking flags in GNU

My problem is related to runtime of a .exe app. I've been given a very large code (which I know is bug free) but it takes too much time to run. I have compiled it with GNU and I'm not able to use the parallel programming neither due to my computer having only two processors.
The problem is tied to a single subroutine of 2000 lines. I have notice that it is mainly made up of loops, where I think there the problem is. Also it is called like 20000 times by the main program.
First I used the -O flags (the best runtime was with -Ofast). After that, I tried to improve the loops performance with -fforce-addr, but no measurable acceleration happened. Lately I am using other flags like -mtune to create a code optimized for the local machine.
Here are my main tests and results:
Original program (31s)
COMPOPTS= -pthread -finline-functions -fbacktrace -fzero-initialized-in-bss -fno-automatic -frecord-marker=4 LINKOPTS= -l unlimit -s unlimited
Using -Ofast (25s)
COMPOPTS= -pthread -finline-functions -fbacktrace -fzero-initialized-in-bss -fno-automatic -frecord-marker=4 -cpp LINKOPTS= -l unlimit -s unlimited
Last situation (24s)
COMPOPTS= -mtune=native -pthread -finline-functions -fbacktrace -fzero-initialized-in-bss -fno-automatic -frecord-marker=4 -cpp -fforce-addr -fschedule-insns2 -ffp-contract=off LINKOPTS=-l ulimit -s unlimited
I have a .exe version compiled with Intel and it’s runtime is 7s. I know Intel usually is like a 20-40% faster than GNU, so I think there is some room for improvement.

Unable to find processes unused for half an hour

You can get underground processes by
ps ux
I am searching a way to find processes to which I have not touched for 30 minutes.
How can you find processes unused for an half hour?
Define "untouched" and "unused". You can find out lots of things using the f parameter on ps(1) in BSD-like systems, the -o on Solaris and Sys/V-like systems.
Update
Responding to the comment:
Well, you can do it. Consider, for example, something that does a periodic ps, and stores the CPU time used along with time. (Actually, you could do this better with a C program calling the appropriate system calls, but that's really an implementation detail.) Store sample time and PID, and watch for the PID's CPU time not having changed over the appropriate interval. This could even be implemented with an awk or perl program like
while true; do
ps _flags_
sleep 30
done | awk -f myprog | tail -f
so that every time awk gets a ps output, it mangles it, identifies candidates, and sends them out to show through tail -f.
But then you may well have daemon processes that don't get called often; it's not clear to me that CPU time alone is a good measure.
That's the point about defining what you really want to do: there's probably a way to do it, but I can't think of a combination of ps flags alone that will do it.