Why isn't cachegrind completely deterministic? - valgrind

Inspired by SQLite, I'm looking at using valgrind's "cachegrind" tool to do reproducible performance benchmarking. The numbers it outputs are much more stable than any other method of timing I've found, but they're still not deterministic. As an example, here's a simple C program:
int main() {
volatile int x;
while (x < 1000000) {
x++;
}
}
If I compile it and run it under cachegrind, I get the following results:
$ gcc -O2 x.c -o x
$ valgrind --tool=cachegrind ./x
==11949== Cachegrind, a cache and branch-prediction profiler
==11949== Copyright (C) 2002-2015, and GNU GPL'd, by Nicholas Nethercote et al.
==11949== Using Valgrind-3.11.0.SVN and LibVEX; rerun with -h for copyright info
==11949== Command: ./x
==11949==
--11949-- warning: L3 cache found, using its data for the LL simulation.
==11949==
==11949== I refs: 11,158,333
==11949== I1 misses: 3,565
==11949== LLi misses: 2,611
==11949== I1 miss rate: 0.03%
==11949== LLi miss rate: 0.02%
==11949==
==11949== D refs: 4,116,700 (3,552,970 rd + 563,730 wr)
==11949== D1 misses: 21,119 ( 19,041 rd + 2,078 wr)
==11949== LLd misses: 7,487 ( 6,148 rd + 1,339 wr)
==11949== D1 miss rate: 0.5% ( 0.5% + 0.4% )
==11949== LLd miss rate: 0.2% ( 0.2% + 0.2% )
==11949==
==11949== LL refs: 24,684 ( 22,606 rd + 2,078 wr)
==11949== LL misses: 10,098 ( 8,759 rd + 1,339 wr)
==11949== LL miss rate: 0.1% ( 0.1% + 0.2% )
$ valgrind --tool=cachegrind ./x
==11982== Cachegrind, a cache and branch-prediction profiler
==11982== Copyright (C) 2002-2015, and GNU GPL'd, by Nicholas Nethercote et al.
==11982== Using Valgrind-3.11.0.SVN and LibVEX; rerun with -h for copyright info
==11982== Command: ./x
==11982==
--11982-- warning: L3 cache found, using its data for the LL simulation.
==11982==
==11982== I refs: 11,159,225
==11982== I1 misses: 3,611
==11982== LLi misses: 2,611
==11982== I1 miss rate: 0.03%
==11982== LLi miss rate: 0.02%
==11982==
==11982== D refs: 4,117,029 (3,553,176 rd + 563,853 wr)
==11982== D1 misses: 21,174 ( 19,090 rd + 2,084 wr)
==11982== LLd misses: 7,496 ( 6,154 rd + 1,342 wr)
==11982== D1 miss rate: 0.5% ( 0.5% + 0.4% )
==11982== LLd miss rate: 0.2% ( 0.2% + 0.2% )
==11982==
==11982== LL refs: 24,785 ( 22,701 rd + 2,084 wr)
==11982== LL misses: 10,107 ( 8,765 rd + 1,342 wr)
==11982== LL miss rate: 0.1% ( 0.1% + 0.2% )
$
In this case, "I refs" differs by only 0.008% between the two runs but I still wonder why these are different. In more complex programs (tens of milliseconds) they can vary by more. Is there any way to make the runs completely reproducible?

At the end of a topic in gmane.comp.debugging.valgrind,
Nicholas Nethercote (a Mozilla developper working in the Valgrind dev team) says that minor variations are common using Cachegrind (and I can infer that they will not lead to major problems).
Cachegrind’s manual mentions that the program is very sensitive. For instance, on Linux, address space randomisation (used to improve security) can be the source of the non-determinism.
Another thing worth noting is that results are very sensitive.
Changing the size of the executable being profiled, or the sizes of
any of the shared libraries it uses, or even the length of their file
names, can perturb the results. Variations will be small, but don't
expect perfectly repeatable results if your program changes at all.
More recent GNU/Linux distributions do address space randomisation, in
which identical runs of the same program have their shared libraries
loaded at different locations, as a security measure. This also
perturbs the results.
While these factors mean you shouldn't trust the results to be
super-accurate, they should be close enough to be useful.

Related

Running a Tensorflow program on an IPU Model throws an "Illegal instruction (core dumped)" error

I’m trying to run a TensorFlow2 example from the Graphcore public examples (MNIST). I’m using the IPU model instead of IPU hardware because my machine doesn’t have access to IPU hardware, so I’ve followed the documentation (Running on the IPU Model simulator) and added the following to my model:
# Using IPU model instead of IPU hardware
if self.base_dictionary['ipu_model']:
os.environ['TF_POPLAR_FLAGS'] = '--use_ipu_model'
When I run the model, it fails with: Illegal instruction (core dumped). I don’t see where this comes from as I used an existing example. What is this error and how do I solve it?
Illegal instruction means that your program is generating instructions that your CPU can’t handle. The Graphcore TensorFlow wheel is compiled for Skylake class CPUs with the AVX-512 instruction set available, so processors that do not fit the requirements (i.e. a Skylake class CPU with AVX-512 capabilities) will not be able to run Graphcore Tensorflow code. (You can see the requirements in the “Requirements” section of the SDK Overview documentation here).
To see if your processors have AVX-512 capabilities, run cat /proc/cpuinfo and look at the flags field of any of the processors - they should all have the same flags. Here If you don’t see avx512f, your processors don’t fit the Graphcore requirements for running Tensorflow code. Here is an example of what the cat command returns on a machine that fits the requirements (result truncated to one processor):
processor : 95
vendor_id : GenuineIntel
cpu family : 6
model : 85
model name : Intel(R) Xeon(R) Platinum 8168 CPU # 2.70GHz
stepping : 4
microcode : 0x2000064
cpu MHz : 1200.703
cache size : 33792 KB
physical id : 1
siblings : 48
core id : 27
cpu cores : 24
apicid : 119
initial apicid : 119
fpu : yes
fpu_exception : yes
cpuid level : 22
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single pti intel_ppin ssbd mba ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts pku ospke md_clear flush_l1d
bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs taa itlb_multihit
bogomips : 5401.49
clflush size : 64
cache_alignment : 64
address sizes : 46 bits physical, 48 bits virtual
power management:
Machines provided by Graphcore or their partners will always fit these requirements, so it’s best to use them. They’ll also have enough cores and memory, which might not be the case on e.g. a personal laptop.

PGI 18.1 vs PGI 18.4

Is there any change from the PGi version 18.1 to 18.4 regarding the
#pragma routine seq, the code I have works fine with version 18.1 but gives an error when I use the newer version. I generate kernels using the math library.
using namespace std;
#pragma acc routine
double myfunc(double x)
{
return(fabs(x));
}
The default parallelism for routine directive is (or was) sequential.
i.e. #pragma acc routine is equivalent to #pragma acc routine seq
This works fine in version 18.1.
But I think there might be some change in the new version since when I compile with 18.4 version, it gives an error complaining about the math library function.
Oddly enough also causes error
#include cmath
#include "openacc.h"
using namespace std;
#pragma acc routine seq
double sine( double x )
{
return ( sin( x ) );
}
Gives compilation error but when I change the math library to math.h, it is perfectly fine, Can anyone explain why is not working with pgc++ ?
What's the actual error you get? I get the same error with both PGI 18.1 and 18.4:
% pgc++ -c test1.cpp -ta=tesla -Minfo=accel -w -V18.1
PGCC-S-1000-Call in OpenACC region to procedure 'sin' which has no acc routine information (test1.cpp: 10)
PGCC-S-0155-Compiler failed to translate accelerator region (see -Minfo messages) (test1.cpp: 10)
sine(double):
10, Generating acc routine seq
Generating Tesla code
11, Accelerator restriction: call to 'sin' with no acc routine information
The solution here is to include the PGI header "accelmath.h" to get the device version for the C99 math intrinsics.
% diff test1.cpp test2.cpp
4a5
> #include "accelmath.h"
% pgc++ -c test2.cpp -ta=tesla -Minfo=accel -w -V18.1
sine(double):
12, Generating acc routine seq
Generating Tesla code
% pgc++ -c test2.cpp -ta=tesla -Minfo=accel -w -V18.4
sine(double):
12, Generating acc routine seq
Generating Tesla code

How to count the number of CPU clock cycles between the start and end of a benchmark in gem5?

How to count the number of CPU clock cycles between the start and end of a benchmark in gem5?
I'm interested in all of the following cases:
full system userland benchmark. Maybe the m5 guest tool has a way to do it?
bare metal benchmark. When gem5 exits it dumps the stats automatically, so the main question is how to skip the cycles for bootloader and go straight to the benchmark itself.
Is there a way besides modifying the benchmark source with instrumentation instructions? How to write those instrumentation instructions in detail?
syscall emulation benchmark. I think gem5 just outputs the stats.txt at the end of the run, and then you ca just grep system.cpu.numCycles, but I have to confirm it, currently blocked on: How to solve "FATAL: kernel too old" when running gem5 in syscall emulation SE mode?
I want to use this to learn:
learn how CPUs work
how to optimize assembly code or compiler settings to run optimally on a given CPU
m5 tool
A good approximation is to run, ideally from a shell script that is the /init program:
m5 resetstats
run-benchmark
m5 dumpstats
Then on host:
grep -E '^system.cpu.numCycles ' m5out/stats.txt
Gives something like:
system.cpu.numCycles 33942872680 # number of cpu cycles simulated
Note that if you replay from a m5 checkpoint with a different CPU, e.g.:
--restore-with-cpu=HPI --caches
then you need to grep for a different identifier:
grep -E '^system.switch_cpus.numCycles ' m5out/stats.txt
resetstats zeroes out the cumulative stats, and dumpstats dumps what has been collected during the benchmark.
This is not perfect since there is some time between the exec syscall for m5 dumpstats finishing and the benchmark starting, but if the benchmark enough, this shouldn't matter.
http://arm.ecs.soton.ac.uk/wp-content/uploads/2016/10/gem5_tutorial.pdf also proposes a few more heuristics:
#!/bin/sh
# Wait for system to calm down
sleep 10
# Take a checkpoint in 100000 ns
m5 checkpoint 100000
# Reset the stats
m5 resetstats
run-benchmark
# Exit the simulation
m5 exit
m5 exit also works since GEM5 dumps stats when it finishes.
Instrumentation instructions
Sometimes those seem to be just inevitable that you have to modify the input source code a bit with those instructions in order to:
skip initialization and go directly to steady state
evaluate individual main loop runs
You can of course deduce those instructions from the gem5 m5 tool code code, but here are some very easy to re-use one line copy pastes for arm and aarch64, e.g. for aarch64:
/* resetstats */
__asm__ __volatile__ ("mov x0, #0; mov x1, #0; .inst 0XFF000110 | (0x40 << 16);" : : : "x0", "x1")
/* dumpstats */
__asm__ __volatile__ ("mov x0, #0; mov x1, #0; .inst 0xFF000110 | (0x41 << 16);" : : : "x0", "x1")
The m5 tool uses the same mechanism under the hood, but by adding the instructions directly into the source, we avoid the syscall, and therefore more precise and representative (at the cost of more manual work).
To ensure that the assembly is not reordered around your ROI by the compiler however, you might want to use the techniques mentioned at: Enforcing statement order in C++
Address monitoring
Another technique that can be used is to monitory addresses of interest instead of adding magic instructions to the source.
E.g., if you know that a benchmark starts with PIC == 0x400, it should be possible to do something when that addresses is hit.
To find the addresses of interest, you would have for example to use readelf or gdb or tracing, and the if running full system on top of Linux, ensure that ASLR is turned off.
This technique would be the least intrusive one, but the setup is harder, and to be honest I haven't done it yet. One day, one day.

What is the best way to measure time in gem5 simulation environment

I am running a small matrix multiplication program in gem5 simulation environment and want to measure execution time of the program. The program is in Fortran and I use cpu_time before and after the matrix multiplication routine to get the time. But is there any other better way to measure time in the gem5 environment?
The standard way of measuring stats for a given binary using gem5 in Full System mode is through providing an rcS script using the --script parameter:
./build/ARM/gem5.fast ... your_options... --script=./script.rcS
Your script should contain m5ops to reset and dump stats as required. An example script.rcS:
m5 resetstats
/bin/yourbinary
m5 dumpstats
Then from the stats.txt you can take execution time (sim_seconds) or whatever stat that you require. If you're using the Syscall Emulation mode you can directly check the stats.txt without the need for an rcS script.
You can also add resetstats / dumpstats magic assembly instructions directly inside your benchmarks as shown at: How to count the number of CPU clock cycles between the start and end of a benchmark in gem5? E.g. in aarch64:
/* resetstats */
__asm__ __volatile__ ("mov x0, #0; mov x1, #0; .inst 0XFF000110 | (0x40 << 16);" : : : "x0", "x1")
/* dumpstats */
__asm__ __volatile__ ("mov x0, #0; mov x1, #0; .inst 0xFF000110 | (0x41 << 16);" : : : "x0", "x1")
You then likely want to look at the system.cpu.numCycles which shows how many CPU ticks passed.
You can of course look into different stat files depending on your build but I think the easiest way is to flag time before your simulation command:
time ./build/ARM/gem5.fast ... your_options... --script=./script.rcS ...

iperf bandwidth; what is the difference between -b 60m and -b 60M

I am using iperf and have the following problem
iperf ... -b 60M, I have 12% packet loss
iperf ... -b 60m, I have 0.2% packet loss
In both of these cases, the bandwidth is 60 Mbit/s
0.0-10.0 sec 71.0 MBytes 59.6 Mbits/sec 0.150 ms 0/50661 (0%)
What is the difference between -b 60m and -b 60M ?
I found the answer after hours of searching:
m = 1000*1000 bit
M = 1024*1024 bit
I have problem with searching for interpreting the result of iperf ( I am doing testing on boards and sometimes the result do not follow ordinary patterns).
Does anyone knows of any good website or document?