How to count the number of CPU clock cycles between the start and end of a benchmark in gem5? - gem5

How to count the number of CPU clock cycles between the start and end of a benchmark in gem5?
I'm interested in all of the following cases:
full system userland benchmark. Maybe the m5 guest tool has a way to do it?
bare metal benchmark. When gem5 exits it dumps the stats automatically, so the main question is how to skip the cycles for bootloader and go straight to the benchmark itself.
Is there a way besides modifying the benchmark source with instrumentation instructions? How to write those instrumentation instructions in detail?
syscall emulation benchmark. I think gem5 just outputs the stats.txt at the end of the run, and then you ca just grep system.cpu.numCycles, but I have to confirm it, currently blocked on: How to solve "FATAL: kernel too old" when running gem5 in syscall emulation SE mode?
I want to use this to learn:
learn how CPUs work
how to optimize assembly code or compiler settings to run optimally on a given CPU

m5 tool
A good approximation is to run, ideally from a shell script that is the /init program:
m5 resetstats
run-benchmark
m5 dumpstats
Then on host:
grep -E '^system.cpu.numCycles ' m5out/stats.txt
Gives something like:
system.cpu.numCycles 33942872680 # number of cpu cycles simulated
Note that if you replay from a m5 checkpoint with a different CPU, e.g.:
--restore-with-cpu=HPI --caches
then you need to grep for a different identifier:
grep -E '^system.switch_cpus.numCycles ' m5out/stats.txt
resetstats zeroes out the cumulative stats, and dumpstats dumps what has been collected during the benchmark.
This is not perfect since there is some time between the exec syscall for m5 dumpstats finishing and the benchmark starting, but if the benchmark enough, this shouldn't matter.
http://arm.ecs.soton.ac.uk/wp-content/uploads/2016/10/gem5_tutorial.pdf also proposes a few more heuristics:
#!/bin/sh
# Wait for system to calm down
sleep 10
# Take a checkpoint in 100000 ns
m5 checkpoint 100000
# Reset the stats
m5 resetstats
run-benchmark
# Exit the simulation
m5 exit
m5 exit also works since GEM5 dumps stats when it finishes.
Instrumentation instructions
Sometimes those seem to be just inevitable that you have to modify the input source code a bit with those instructions in order to:
skip initialization and go directly to steady state
evaluate individual main loop runs
You can of course deduce those instructions from the gem5 m5 tool code code, but here are some very easy to re-use one line copy pastes for arm and aarch64, e.g. for aarch64:
/* resetstats */
__asm__ __volatile__ ("mov x0, #0; mov x1, #0; .inst 0XFF000110 | (0x40 << 16);" : : : "x0", "x1")
/* dumpstats */
__asm__ __volatile__ ("mov x0, #0; mov x1, #0; .inst 0xFF000110 | (0x41 << 16);" : : : "x0", "x1")
The m5 tool uses the same mechanism under the hood, but by adding the instructions directly into the source, we avoid the syscall, and therefore more precise and representative (at the cost of more manual work).
To ensure that the assembly is not reordered around your ROI by the compiler however, you might want to use the techniques mentioned at: Enforcing statement order in C++
Address monitoring
Another technique that can be used is to monitory addresses of interest instead of adding magic instructions to the source.
E.g., if you know that a benchmark starts with PIC == 0x400, it should be possible to do something when that addresses is hit.
To find the addresses of interest, you would have for example to use readelf or gdb or tracing, and the if running full system on top of Linux, ensure that ASLR is turned off.
This technique would be the least intrusive one, but the setup is harder, and to be honest I haven't done it yet. One day, one day.

Related

Why gem5 output shows that the number of cpu cycle less than the number of instructions simulated?

output of gem5 after run a simple spec test program
I run a spec test case using gem5. It's SE model and with O3CPU. This is the command I run:
/gem5/build/ARM/gem5.opt --outdir=/gem5/spec2006log/m5out_ 462.libquantum-1-O3CPU /gem5/configs/example/se.py --caches --cpu-type=O3CPU --mem-size=1GB -n 1 --cmd=/benchmark/462.libquantum/exe/libquantum_ base.qemurio -o '33 5'
Then I saw a strange output of gem5 as the picture shows. Why does the O3CPU execute more than one instruction in a CPU cycle? Is O3cpu a superscalar CPU? I didn't see the related description in the gem5 document and it confuse me very much.
I need help, thank a lot!

Long latency instruction

I would like a long-latency single-uop x861 instruction, in order to create long dependency chains as part of testing microarchitectural features.
Currently I'm using fsqrt, but I'm wondering is there is something better.
Ideally, the instruction will score well on the following criteria:
Long latency
Stable/fixed latency
One or a few uops (especially: not microcoded)
Consumes as few uarch resources as possible (load/store buffers, page walkers, etc)
Able to chain (latency-wise) with itself
Able to chain input and out with GP registers
Doesn't interfere with normal OoO execution (beyond whatever ROB, RS, etc, resources it consumes)
So fsqrt is OK in most senses, but the latency isn't that long and it seems hard to chain with GP regs.
1 On modern Intel x86 in particular, with bonus points if it also works well on AMD Zen*.
Mainstream Intel CPUs don't have any very long latency single-uop integer instructions. There are integer ALUs for 1-cycle latency uops on all ALU ports, and a 3-cycle-latency pipelined ALU on port 1. I think AMD is similar.
The div/sqrt unit is the only truly high-latency ALU, but integer div/idiv are microcoded on Intel so yes, use FP where div/sqrt are typically single-uop instructions.
AMD's integer div / idiv are 2-uop instructions (presumably to write the 2 outputs), with data-dependent latency.
Also, AMD Bulldozer/Piledriver (where 2 integer cores share a SIMD/FP unit) has pretty high latency for movd xmm, r32 (10c 2 uops) and movd r32, xmm (8c 1 uop). Steamroller shortens that by 1c each. Ryzen has 3-cycle 1 uop in either direction.
movd to/from XMM regs is cheap on Intel: single-uop with 1-cycle (Broadwell and earlier) or 2-cycle latency (Skylake). (https://agner.org/optimize/)
sqrtss has fixed latency (on IvB and later), other than maybe with subnormal inputs. If your chain-with-integer involves just movd xmm, r32 of an arbitrary integer bit-pattern, you might want to set DAZ/FTZ to remove the possibility of FP assists. NaN inputs are fine; that doesn't cause a slowdown for SSE/AVX math, only x87.
Other CPUs (Sandybridge and earlier, and all AMD) have variable-latency sqrtss so you probably want to control the starting bit-pattern there.
Same goes if you want to use sqrtsd for higher latency per uop than sqrtss. It's still variable latency even on Skylake. (15-16 cycles).
You can assume that the latency is a pure function of the input bit-pattern, so starting a chain of sqrtss instructions with the same input every time will give the same sequence of latencies. Or with a starting input of 0.0, 1.0, +inf, or NaN, you'll get the same latency for every uop in the sequence.
(Simple inputs like 1.0 and 0.0 (few significant figures in the input and output) presumably run with the lowest latency. sqrt(1.0) = 1.0 and sqrt(0) = 0, so these are self-perpetuating. Same for sqrt(NaN) = NaN)
You might use and reg, 0 or other non-dep-breaking zeroing as part of your chain to control the input bit-pattern. Or perhaps or reg, -1 to create NaN. Then you can get fixed latency on Sandybridge or earlier, and on AMD including Zen.
Or perhaps pinsrw xmm0, eax, 7 (2 uops for port 5 on Intel) to only modify the high qword of an XMM, leaving the bottom as known 0.0 or 1.0. Probably cheaper to just and with 0 and use movd, unless port-5 pressure is a non-issue.
To create a throughput bottleneck (not latency), your best bet on Skylake is vsqrtpd ymm - 1 uop for p0, latency = 15-16, throughput = 9-12.
On Broadwell and earlier, it was 3 uops (2p0 p15), but Skylake I think widened the SIMD divider (in preparation for AVX512 I guess).
vsqrtss might be somewhat better than fsqrt since it at least satisfies relatively easy chaining with GP registers (since GP <-> vector is just a movd away).

How to change the gem5 ARM SVE vector length?

I'm doing an experiment to see which ARM SVE vector length would be the best for my chip design, or to help select which chip has the optimal vector length for my application.
How to change the vector length in a gem5 simulation to see how it affects workload performance?
For SE:
se.py --param 'system.cpu[:].isa[:].sve_vl_se = 2'
For FS:
fs.py --param 'system.sve_vl = 2'
where the values are given in multiples of 128 bits, so 2 means length 256.
You can test this easily with the ADDVL instruction as shown in this example.
The name of those parameters can be easily determined by looking at a m5out/config.ini generated from a previous run.
Note however that this value is architecturally visible, and so it might not be possible to checkpoint after Linux boot, and restore with a different vector length than the boot, to speed up experiments. This is likely true in general even though the kernel itself does not run vector instructions, because there is software control of the effective vector length. Maybe it is possible to set a big vector length on the simulator to start with and then tell Linux to reduce it somehow in software, but I'm not sure what's the API.
Tested in gem5 3126e84db773f64e46b1d02a9a27892bf6612d30.
To change the vector length, one can use command line option:
--arm-sve-vl=<vl in quadwords: one of {1, 2, 4, 8, 16}>
where vl is a multiple of 128. So for a simulation of 512-bit SVE machine, one should use:
--arm-sve-vl=4
This works both for Syscall-Emulation mode and Full System mode.
If one wants to quickly explore the space of different vector lengths, one can also change it during the simulation (only in Full system mode). For example, to change the SVE length to 256, put the following line in your bootscript, before running the benchmark:
echo 256 >/proc/sys/abi/sve_default_vector_length
You can get more information on https://www.rico.cat/files/ICS18-gem5-sve-tutorial.pdf.

Fastest Cython implementation depends on computer?

I am converting a python script to cython and optimizing it for more speed. Right now i have 2 versions, on my desktop V2 is twice as fast as V1 unfortunately on my laptop V1 is twice as fast as V2 and i am unable to find out why there is such a big difference.
Both computers use:
- Ubuntu 16.04
- Python 2.7.12
- Cython 0.25.2
- Numpy 1.12.1
Desktop:
- Intel® Core™ i3-4370 CPU # 3.80GHz × 4 64bit. 16GB RAM
Laptop:
- Intel® Core™ i5-3210 CPU # 2.5GHz × 2 64bit. 8GB RAM
V1 - you can find the full code here. the only changes made are renaming go.py, preprocessing.py to go.pyx, preprocessing.pyx and using
import pyximport; pyximport.install() to compile them. you can run test.py. This version is using a 2d numpy array board to store data in go.pyx and list comprehension in the get_board function in preprocessing.pyx to process data. during the test no function is called from go.py only the numpy array board is used
V2 - you can find the full code here. quite some stuff has changed, below you can find a list with everything affecting this test case. Be aware, all function and variable declarations have to be in go.pxd. you can run test.py using this command: python test.py build_ext --inplace
the 2d numpy array is replaced by:
cdef char board[ 362 ]
and the function get_board_feature in go.pyx replaces numpy list comprehension:
cdef char get_board_feature( self, short location ):
# return correct board feature value
# 0 active player stone
# 1 opponent stone
# 2 empty location
cdef char value = self.board[ location ]
if value == EMPTY:
return 2
if value == self.player_current:
return 0
return 1
get_board function in preprocessing.pyx is replaced with a function that loops over the array and calls get_board_feature in go.pyx for every location
#cython.boundscheck(False)
#cython.wraparound(False)
cdef int get_board(self, GameState state, np.ndarray[double, ndim=2] tensor, int offSet ):
"""A feature encoding WHITE BLACK and EMPTY on separate planes, but plane 0
always refers to the current player and plane 1 to the opponent
"""
cdef short location
for location in range( 0, state.size * state.size ):
tensor[ offSet + state.get_board_feature( location ), location ] = 1
return offSet + 3
Please let me know if i should include any other information or run certain tests.
cmp, diff test
the V2 go.c and preprocessing.c files are identical.
V1 does not generate a .c file to compare
update compared .so files
the V2 go.so files are different:
goD.so goL.so differ: byte 473, line 1
the preprocessing.so files are identical, not sure what to think of that..
They are two different machines and behave differently. There's a reason why processor reviews use large benchmark suites. It could be said that the desktop CPU performs better on average, but execution times between two small but non-trivial pieces of codes does not 'have' to favor the desktop CPU. And differences execution times definitely do not have to follow any linear relationship. The performance is always dependant on a huge amount of factors. Possible explanations include but are not limited to the smaller L1 and L2 caches on the desktop and the change in vector instruction sets from AVX to AVX2 between the Ivy Bridge laptop and the Haswell desktop.
Generally it's a good idea to concentrate on using good algorithms and to identify and remove bottlenecks when optimizing performance. Trying to stare at benchmarks between different machines will probably only cause a headache.

What is the best way to measure time in gem5 simulation environment

I am running a small matrix multiplication program in gem5 simulation environment and want to measure execution time of the program. The program is in Fortran and I use cpu_time before and after the matrix multiplication routine to get the time. But is there any other better way to measure time in the gem5 environment?
The standard way of measuring stats for a given binary using gem5 in Full System mode is through providing an rcS script using the --script parameter:
./build/ARM/gem5.fast ... your_options... --script=./script.rcS
Your script should contain m5ops to reset and dump stats as required. An example script.rcS:
m5 resetstats
/bin/yourbinary
m5 dumpstats
Then from the stats.txt you can take execution time (sim_seconds) or whatever stat that you require. If you're using the Syscall Emulation mode you can directly check the stats.txt without the need for an rcS script.
You can also add resetstats / dumpstats magic assembly instructions directly inside your benchmarks as shown at: How to count the number of CPU clock cycles between the start and end of a benchmark in gem5? E.g. in aarch64:
/* resetstats */
__asm__ __volatile__ ("mov x0, #0; mov x1, #0; .inst 0XFF000110 | (0x40 << 16);" : : : "x0", "x1")
/* dumpstats */
__asm__ __volatile__ ("mov x0, #0; mov x1, #0; .inst 0xFF000110 | (0x41 << 16);" : : : "x0", "x1")
You then likely want to look at the system.cpu.numCycles which shows how many CPU ticks passed.
You can of course look into different stat files depending on your build but I think the easiest way is to flag time before your simulation command:
time ./build/ARM/gem5.fast ... your_options... --script=./script.rcS ...