How to get stats for several gem5 full system userland benchmark programs under Linux without having to reboot multiple times? - gem5

I know how to modify my image, reboot, and re-run it, but that would make my experiments very slow, since boot takes a few minutes.
Is there a way to quickly switch:
command line options
the executable
that is being run after boot?
This is not trivial because the Linux kernel knows about:
the state of the root filesystem
the state of memory, and therefore of kernel CLI options that could be used to modify init
so I can't just switch those after a checkpoint.
This question is inspired from: https://www.mail-archive.com/gem5-users#gem5.org/msg16959.html

Here is a fully automated setup that can help you to do it.
The basic workflow is as follows:
run your benchmark from the init executable that gets passed to the Linux kernel
https://unix.stackexchange.com/questions/122717/how-to-create-a-custom-linux-distro-that-runs-just-one-program-and-nothing-else/238579#238579
https://unix.stackexchange.com/questions/174062/can-the-init-process-be-a-shell-script-in-linux/395375#395375
to run a single benchmark with different parameters without
rebooting, do in your init script:
m5 checkpoint
m5 readfile | sh
and set the contents of m5 readfile on a host filesystem file before restoring the checkpoint with:
echo 'm5 resetstats && ./run-benchmark && m5 dumpstats' > path/to/script
build/ARM/gem5.opt configs/example/fs.py --script path/to/script
This is what the "configs/boot/hack_back_ckpt.rcS" but I think that
script is overly complicated.
to modify the executable without having to reboot, attach a second
disk image and mount after the checkpoint is restored:
How to attach multiple disk images in a simulation with gem5 fs.py?
Another possibility would be 9P but it is not working currently: http://gem5.org/WA-gem5
to only count only benchmark instructions, do:
m5 resetstats
./run-benchmark
m5 dumpstats
If that is not precise enough, modify the source of your benchmark with m5ops magic instructions that do resetstats and dumpstats from within the benchmark as shown at: How to count the number of CPU clock cycles between the start and end of a benchmark in gem5?
to make the first boot faster, you can boot with a simple CPU model like the default AtomicSimpleCPU and then switch to a more detailed, slower model after the checkpoint is restored: How to switch CPU models in gem5 after restoring a checkpoint and then observe the difference?

Related

Get mapping between /dev/nvidia* and nvidia-smi gpu list

A server with 4 GPUs is used for deep learning.
It often happens that GPU memory is not freed after the training process was terminated (killed). Results shown by nvidia-smi is
Nvidia-smi results
The cuda device 2 is used. (Might be a process launched with CUDA_VISIBLE_DEVICES=2)
Some sub-processes are still alive thus occupy the memory.
One bruce-force solution is to kill all processes created by python using:
pkill -u user_name python
This should be helpful if there is only one process to be cleaned up.
Another solution proposed by pytorch official My GPU memory isn’t freed properly
One may find them via
ps -elf | grep python.
However, if multiple processes are launched and we only want to kill the ones that related to a certain GPU, we can group processes by the gpu index (nvidia0, nvidia1, ...) as:
fuser -v /dev/nvidia*
fuser -v results
As we can see, /dev/nvidia3 is used by some python threads. Thus /dev/nvidia3 corresponds to cuda device 2.
The problem is: I want to kill certain processes launched with setting of CUDA_VISIBLE_DEVICES=2, but I do not know the gpu index (/dev/nvidia0, /dev/nvidia1, ...).
How to find the mapping between CUDA_VISIBLE_DEVICES={0,1,2,3} and /dev/nvidia{0,1,2,3}.
If you set CUDA_DEVICE_ORDER=PCI_BUS_ID environment variable, then the order should be consistent between CUDA and nvidia-smi.
There is also another option (if you are sure the limiting to a specific GPU is done via CUDA_VISIBLE_DEVICES env var). Every process' environment can be examined in /proc/${PID}/environ. The format is partially binary, but grepping through the output usually works (if you force grep to treat the file as text file). This might require root privileges.

Icache and Dcache in Simple.py configuration of gem5

I am trying to understand the models generated using gem5. I simulated a build/X86/gem5.opt with the gem5/configs/learning_gem5/part1/simple.py configuration file provided in gem5 repo.
In the output directory I get the following .dot graph:
I have the following doubts:
Does this design not have any Instruction and Data Cache? I checked the config.ini file there were no configuration statistics such as ICache/Dcache size.
What is the purpose of adding the icache_port and dcache_port?
system.cpu.icache_port = system.membus.slave
system.cpu.dcache_port = system.membus.slave
Does this design not have any Instruction and Data Cache? I checked the config.ini file there were no configuration statistics such as ICache/Dcache size.
I'm not very familiar with that config, but unless caches were added explicitly somewhere, then there aren't caches.
Just compare it to an se.py run e.g.:
build/ARM/gem5.opt configs/example/se.py --cmd hello.out \
--caches --l2cache --l1d_size=64kB --l1i_size=64kB --l2_size=256kB`
which definitely has caches, e.g. that config.ini at gem5 211869ea950f3cc3116655f06b1d46d3fa39fb3a contains:
[system.cpu.dcache]
size=65536
What is the purpose of adding the icache_port and dcache_port?
I'm not very familiar with the port system.
I think ports are used as a way for components to communicate, often in master / slave pairs, e.g. CPU is a master and the cache is a slave. So here I think that the CPU port is there but there is nothing attached to it, so no caches.
For example on the above se.py example we see this clearly:

Access pagemap in gem5 FS mode

I am trying to run an application which uses pagemap in gem5 FS mode.
But I am not able to use pagemap in gem5. It throws below error -
"assert(pagemap>=0) failed"
The line of code is:
int pagemap = open("/proc/self/pagemap", O_RDONLY);
assert(pagemap >= 0);
Also, If I try to run my application on gem5 terminal with sudo ,it throws error-
sudo command not found
How can I use sudo in gem5 ??
These problems are not gem5 specific, but rather image / Linux specific, and would likely happen on any simulator or real hardware. So I recommend that you remove gem5 from the equation completely, and ask a Linux or image specific question next time, saying exactly what image your are using, kernel configs, and provide a minimal C example that reproduces the problem: this will greatly improve the probability that you will get help.
I have just done open("/proc/self/pagemap", O_RDONLY) successfully with: this program and on this fs.py setup on aarch64, see also these comments.
If /proc/<pid>/pagemap is not present for any file, do the following:
ensure that procfs is mounted on /proc. This is normally done with an fstab entry of type:
proc /proc proc defaults 0 0
but your init script needs to use fstab as well.
Alternatively, you can mount proc manually with:
mount -t proc proc proc/
you will likely want to ensure that /sys and /dev are mounted as well.
grep the kernel to see if there is some config controlling the file creation.
These kinds of things are often easy to find without knowing anything about the kernel.
If I do:
git grep '"pagemap'
to find the pagemap string, which is likely the creation point, on v4.18 this leads me to fs/proc/base.c, which contains:
#ifdef CONFIG_PROC_PAGE_MONITOR
REG("pagemap", S_IRUSR, proc_pagemap_operations),
#endif
so make sure CONFIG_PROC_PAGE_MONITOR is set.
sudo: most embedded / simulator images don't have it, you just login as root directly and can do anything by default without it. This can be seen by the conventional # in the prompt instead of $.

How to boot the Linux kernel with initrd or initramfs with gem5?

With QEMU, I can use either use -initrd '${images_dir}/rootfs.cpio for the initrd, or pass the initramfs image directly to -kernel Image.
But if I try the initramfs image with gem5 fs.py --kernel Image it fails with:
fatal: Could not load kernel file
with the exact same initramfs kernel image that QEMU was able to consume.
And I don't see an analogue to -initrd.
The only method that I got to work was to pass an ext2 disk image to --disk-image with the raw vmlinux.
https://www.mail-archive.com/gem5-users#gem5.org/msg15198.html
initrd appears unimplemented on arm and x86 at least, since gem5 must know how to load it and inform the kernel about it's location, and grepping initrdonly shows some ARM hits under:
src/arch/arm/linux/atag.hh
but they are commented out.
Communicating the initrd to the kernel now appears to be simply doable via the DTB chosen node linux,initrd-start and linux,initrd-end properties, so it might be very easy to implement: https://www.kernel.org/doc/Documentation/devicetree/bindings/chosen.txt (and gem5's existing DTB auto generation) + reusing the infrastructure to load arbitrary bytes to a memory location: How to preload memory with given raw bytes in gem5 from the command line in addition to the main ELF executable?
Initramfs doesn't work because gem5 can only boot from vmlinux which is the raw ELF file, and the initramfs images only gets attached by the kernel build to a more final image type like Image or bzImage which QEMU can use to boot, see also: https://unix.stackexchange.com/questions/5518/what-is-the-difference-between-the-following-kernel-makefile-terms-vmlinux-vml/482978#482978
Edit: the following is not needed anymore after the patch mentioned at: How to attach multiple disk images in a simulation with gem5 fs.py? To do this test, I also had to pass a dummy disk image as of gem5 7fa4c946386e7207ad5859e8ade0bbfc14000d91 since the scripts don't handle a missing --disk-image well, you can just dump some random 512 bytes and use them:
dd if=/dev/zero of=dummy.iso bs=512 count=1

Synchronous distributed tensorflow training runs asynchronously

System Information:
Debian 4.5.5
TF installed from binary (pip3 install tensorflow-gpu==1.0.1 --user)
TF version: v1.0.0-65-g4763edf-dirty 1.0.1
Bazel version: N.A.
CUDA 8.0 cuDNN v5.1
Steps to reproduce
Make a directory and download the following files into it:
training.py run.sh
Run the command ./run.sh to simply reproduce this issue.
Detailed descriptions for the bug
Recently, I tried to deploy the synchronous distributed tensorflow training on the cluster. I followed the tutorial and the inception example to write my own program. The training.py is from other user's implementation, which follows the same API usage as the official example. I modified it to enable it running on a single machine with multiple GPUs by making them communicate through localhost and mapping each worker to see only one GPU.
The run.sh launched three processes. One of them is the parameter server and the others are two workers implemented by between-graph replication. I created the training supervisor by tf.train.Supervisor() to manage multiple sessions in the distributed training for the initialization and synchronization.
I expect these two workers would synchronize each batch and work in the same epoch. However, the worker 0, which is launched prior to the worker 1, completed the whole training set without waiting for the worker 1. After that, the process of the worker 0 finished training process and exited normally while worker 1 behaved like falling into the deadlock and keep near 0% utilization of CPU and GPU for several hours.
Based on my observation, I suspect these two workers didn't communicate and synchronize at all for the data they passed. I report this problem as a bug because I create the optimizer tf.train.SyncReplicasOptimizer as suggested by the official website and the inception example. However, it seems that the synchronization behaviors, if any, are very strange and the program can not exit normally.
Source code / logs
Two files:
training.py: This file contains the source code for the parameter server and workers created to use synchronous distributed optimizers (tf.train.SyncReplicasOptimizer).
run.sh: This file launched the parameter server and the workers.
Log:
Please produce according to the steps and look at worker_0_log and worker_1_log