Why gem5 output shows that the number of cpu cycle less than the number of instructions simulated? - gem5

output of gem5 after run a simple spec test program
I run a spec test case using gem5. It's SE model and with O3CPU. This is the command I run:
/gem5/build/ARM/gem5.opt --outdir=/gem5/spec2006log/m5out_ 462.libquantum-1-O3CPU /gem5/configs/example/se.py --caches --cpu-type=O3CPU --mem-size=1GB -n 1 --cmd=/benchmark/462.libquantum/exe/libquantum_ base.qemurio -o '33 5'
Then I saw a strange output of gem5 as the picture shows. Why does the O3CPU execute more than one instruction in a CPU cycle? Is O3cpu a superscalar CPU? I didn't see the related description in the gem5 document and it confuse me very much.
I need help, thank a lot!

Related

How to take a checkpoint at a given tick and then restore using the gem5 Python API?

I had always done with an m5 checkpoint m5op + fs.py -r. I then also learned that fs.py has --take-checkpoints which can select the tick.
But today I needed to do it for an integration Linux boot test (tests/gem5/fs/linux/arm/run.py) to start running closer to the point of interest, and I don't want to modify the kernel to add the m5op + the runner script does not have -r/--take-checkpoint options. I wish this stuff were gem5.opt options available to all runs rather Python script options, but they're not.
On gem5 71b450fc46ca5888971acf3160b813bf24784604 the script original script does:
m5.instantiate()
exit_event = m5.simulate()
so to take the checkpoint I can hack it to:
m5.instantiate()
# Run up to desired tick.
exit_event = m5.simulate(100000)
m5.checkpoint('m5out/mycpt')
and to restore hack it to:
m5.instantiate('m5out/mycpt')
exit_event = m5.simulate()
m5.checkpoint()

How to change the gem5 ARM SVE vector length?

I'm doing an experiment to see which ARM SVE vector length would be the best for my chip design, or to help select which chip has the optimal vector length for my application.
How to change the vector length in a gem5 simulation to see how it affects workload performance?
For SE:
se.py --param 'system.cpu[:].isa[:].sve_vl_se = 2'
For FS:
fs.py --param 'system.sve_vl = 2'
where the values are given in multiples of 128 bits, so 2 means length 256.
You can test this easily with the ADDVL instruction as shown in this example.
The name of those parameters can be easily determined by looking at a m5out/config.ini generated from a previous run.
Note however that this value is architecturally visible, and so it might not be possible to checkpoint after Linux boot, and restore with a different vector length than the boot, to speed up experiments. This is likely true in general even though the kernel itself does not run vector instructions, because there is software control of the effective vector length. Maybe it is possible to set a big vector length on the simulator to start with and then tell Linux to reduce it somehow in software, but I'm not sure what's the API.
Tested in gem5 3126e84db773f64e46b1d02a9a27892bf6612d30.
To change the vector length, one can use command line option:
--arm-sve-vl=<vl in quadwords: one of {1, 2, 4, 8, 16}>
where vl is a multiple of 128. So for a simulation of 512-bit SVE machine, one should use:
--arm-sve-vl=4
This works both for Syscall-Emulation mode and Full System mode.
If one wants to quickly explore the space of different vector lengths, one can also change it during the simulation (only in Full system mode). For example, to change the SVE length to 256, put the following line in your bootscript, before running the benchmark:
echo 256 >/proc/sys/abi/sve_default_vector_length
You can get more information on https://www.rico.cat/files/ICS18-gem5-sve-tutorial.pdf.

Why cifar10 benchmark shows slow performance on every n x 100 step?

I have tried to get performance comparison result between source built and google provided .whl files for tensorflow-gpu runs. I have tried more than tens of bench mark tests, and I always get slow performance on every n x 100 step like 0, 100, 200, .... I cannot figure out the reason. Who, one of you, expert of tensorflow, can answer for me?
I am running ubuntu(18.04). fedora(27, 28), Windows, and CUDA 9.0/9.1/9.2
I've tested with tf1.6, 1.7, 1.8, 1.9.
My GPU is 1080ti/11GB.
My cpu is intel 4690k with 32G dram.
attached one sample
.
Tnank you very much in advance.
Dae-Chul Jo
dcjo00#gmail.com
It could be for some different reasons:
Every 100 steps you are saving the model
Every 100 steps you are testing validation data
Every 100 steps you are saving logs to tensorboard
These are my first guesses in order of probability, if you provide code I could study it more deeply.
Hope it helps! :)
EDIT: it ended up being:
tf.train.MonitoredTrainingSession has a default of saving summaries every 100 steps. Which was proposal 3.

How to count the number of CPU clock cycles between the start and end of a benchmark in gem5?

How to count the number of CPU clock cycles between the start and end of a benchmark in gem5?
I'm interested in all of the following cases:
full system userland benchmark. Maybe the m5 guest tool has a way to do it?
bare metal benchmark. When gem5 exits it dumps the stats automatically, so the main question is how to skip the cycles for bootloader and go straight to the benchmark itself.
Is there a way besides modifying the benchmark source with instrumentation instructions? How to write those instrumentation instructions in detail?
syscall emulation benchmark. I think gem5 just outputs the stats.txt at the end of the run, and then you ca just grep system.cpu.numCycles, but I have to confirm it, currently blocked on: How to solve "FATAL: kernel too old" when running gem5 in syscall emulation SE mode?
I want to use this to learn:
learn how CPUs work
how to optimize assembly code or compiler settings to run optimally on a given CPU
m5 tool
A good approximation is to run, ideally from a shell script that is the /init program:
m5 resetstats
run-benchmark
m5 dumpstats
Then on host:
grep -E '^system.cpu.numCycles ' m5out/stats.txt
Gives something like:
system.cpu.numCycles 33942872680 # number of cpu cycles simulated
Note that if you replay from a m5 checkpoint with a different CPU, e.g.:
--restore-with-cpu=HPI --caches
then you need to grep for a different identifier:
grep -E '^system.switch_cpus.numCycles ' m5out/stats.txt
resetstats zeroes out the cumulative stats, and dumpstats dumps what has been collected during the benchmark.
This is not perfect since there is some time between the exec syscall for m5 dumpstats finishing and the benchmark starting, but if the benchmark enough, this shouldn't matter.
http://arm.ecs.soton.ac.uk/wp-content/uploads/2016/10/gem5_tutorial.pdf also proposes a few more heuristics:
#!/bin/sh
# Wait for system to calm down
sleep 10
# Take a checkpoint in 100000 ns
m5 checkpoint 100000
# Reset the stats
m5 resetstats
run-benchmark
# Exit the simulation
m5 exit
m5 exit also works since GEM5 dumps stats when it finishes.
Instrumentation instructions
Sometimes those seem to be just inevitable that you have to modify the input source code a bit with those instructions in order to:
skip initialization and go directly to steady state
evaluate individual main loop runs
You can of course deduce those instructions from the gem5 m5 tool code code, but here are some very easy to re-use one line copy pastes for arm and aarch64, e.g. for aarch64:
/* resetstats */
__asm__ __volatile__ ("mov x0, #0; mov x1, #0; .inst 0XFF000110 | (0x40 << 16);" : : : "x0", "x1")
/* dumpstats */
__asm__ __volatile__ ("mov x0, #0; mov x1, #0; .inst 0xFF000110 | (0x41 << 16);" : : : "x0", "x1")
The m5 tool uses the same mechanism under the hood, but by adding the instructions directly into the source, we avoid the syscall, and therefore more precise and representative (at the cost of more manual work).
To ensure that the assembly is not reordered around your ROI by the compiler however, you might want to use the techniques mentioned at: Enforcing statement order in C++
Address monitoring
Another technique that can be used is to monitory addresses of interest instead of adding magic instructions to the source.
E.g., if you know that a benchmark starts with PIC == 0x400, it should be possible to do something when that addresses is hit.
To find the addresses of interest, you would have for example to use readelf or gdb or tracing, and the if running full system on top of Linux, ensure that ASLR is turned off.
This technique would be the least intrusive one, but the setup is harder, and to be honest I haven't done it yet. One day, one day.

What does 'Off' mean in the output of nvidia-smi?

I run a tensorflow code in the GPU.
The image bellow shows the nvidia-smi info::
I want ask what does 'Off' mean in the output of nvidia-smi?
Also what does the ""C"" type means here??
My code run in the GPU or CPU in this situation????
"C" stands for compute. "G" stands for graphics. Both run on the graphics card. "Off" is in reference to "Persistence-M", which stands for Persistence Mode which keeps the driver always loaded.