Multiple CPUIDs not showing up in Gem5 simulation - gem5

I am just starting out with gem5; I ran a program which was statically compiled using m5threads library, and ran using se.py with '-n 64' flag set. This should be simulating the program running on 64 cores [As per http://pages.cs.wisc.edu/~markhill/cs757/Spring2016/wiki/index.php?n=Main.Homework3], but the Exec traces only show operations from a single CPU. What am I doing wrong ?
Command line instruction ran:
Gem5/gem5/build/X86/gem5.opt --debug-flags=Exec,TLB,DRAM Gem5/gem5/configs/example/se.py -n 64 -c paper3/Blackscholes/blackscholes.out --options="1 paper3/Blackscholes/in_16.txt paper3/Blackscholes/output.txt"

A similar thread happened recently at: https://www.mail-archive.com/gem5-users#gem5.org/msg16830.html
Some things you should check:
are you absolutely 100% certain that the content spawns threads? Ensure that by:
reading its source code
tracing with --debug-flags SyscallBase,SyscallVerbose and look for clone syscalls
m5threads is not needed anymore I believe. I'm sure this is the case for ARM at least
here is a minimal runnable example that spawns threads and does show multiple thread IDs on Exec trace and stats:
https://github.com/cirosantilli/linux-kernel-module-cheat/blob/6aa375df2ad3ce2e7d741d09b378503c25547df1/userland/posix/pthread_self.c
https://github.com/cirosantilli/linux-kernel-module-cheat/tree/e2b8bcdc3f6cc6803f3f89607bd118f812aed367#gem5-syscall-emulation-mode
don't forget that if the program will spawn 64 threads, then it likely uses 65 threads in total: main thread + 65, so maybe you will need 65 CPUs

Related

Does kernel spawn some processes (not user process) that keep running in background?

I'm learning the concepts of operating system. This is part I've learned: kernel is key piece of os that does lots of critical things such as memory management, job scheduling etc.
This is part what I'm thinking and get confused: to have os operating as expected, in a sense kernel needs to keep running, perhaps in the background, so it is always able to respond to different system calls and interrupts. In order to achieve this, I think of two completely different approaches:
kernel actually spawns some processes purely on its behalf, not user process, and keep them running in background (like daemon)? These background processes will handle housekeeping stuff without acknowledgement from user or user process. I call this approach as "kernel is running on its own"
There is no kernel process at all. Every process we can find in os are all user processes. Kernel is nothing but a library (piece of code, along with some key data structures like page tables etc) shared among all these user processes. In each process's address space, some portion of kernel will be loaded so that when any interrupt or system call occurs, mode is elevated to kernel mode. Pieces of kernel code loaded into user process's address space will be executed so that kernel can handle the event. When kernel does that, it is still in the context of current user process. In this approach, there exists only user processes, but kernel will periodically run within the context of each user process (but in a different mode).
This is a conceptual question that has confused me for a while. Thanks in advance!
The answer to your question is mostly no. The kernel doesn't spawn kernel mode processes. At boot, the kernel might start some executables but they run in user mode as a privileged user. For example, the Linux kernel will start systemd as the first user mode process as the root user. This process will read configuration files (written by your distribution's developers like Ubuntu) and start some other processes like the X Server for graphics and basic input (from keyboard, mouse, etc).
Your #1 is wrong and your #2 is also somewhat wrong. The kernel isn't a library. It is code loaded in the top half of the virtual address space. The bottom half of the VAS is very big (several tens of thousands of GB) so user mode processes can become very big as long as you have physical RAM or swap space to back the memory they require. The top half of the VAS is shared between processes. For the bottom half, every process has theoretical access to all of it.
The kernel is called on system call and on interrupt. It doesn't run all the time like a process. It simply is called when an interrupt or syscall occurs. To make it work with more active processes than there are processor cores, timers will be used. On x86-64, each core has one local APIC. The local APIC has a timer that you can program to throw an interrupt after some time. The kernel will thus give a time slice to each process, choose one process in the list and start the timer with its corresponding time slice. When the timer throws an interrupt, the kernel knows that the time slice of that process is over and that it might be time to let another process take its place on that core.
First of all, A library can have its own background threads.
Secondly, the answer is somewhere between these approaches.
Most Unix-like system are built on a monolithic kernel (or hybrid one). That means the kernel contains all its background work in kernel threads in a single address space. I wrote in more details about this here.
On most Linux distributions, you can run
ps -ef | grep '\[.*\]'
And it will show you kernel threads.
But it will not show you "the kernel process", because ps basically only shows threads. Multithreaded processes will be seen via their main thread. But the kernel doesn't have a main thread, it owns all the threads.
If you want to look at processes via the lens of address spaces rather than threads, there's not really a way to do it. However, address spaces are useless if no thread can access them, So you access the actual address space of a thread (if you have permission) via /proc/<pid>/mem. So if you used the above ps command and found a kernel thread, you can see its address space using this approach.
But you don't have to search - you can also access the kernel's address space via /proc/kcore.
You will see, however, that these kernel threads aren't, for the most part, core kernel functionality such as scheduling & virtual memory management. In most Unix kernels, these happen during a system call by the thread that made the system call while it's running in kernel mode.
Windows, on the other hand, is built on a microkernel. That means that the kernel launches other processes and delegates work to them.
On Windows, that microkernel's address space is represented by the "System" service. The other processes - file systems, drivers etc., and other parts of what a monolithic kernel would comprise e.g. virtual memory management - might run in user mode or kernel mode, but still in a different address space than the microkernel.
You can get more details on how this works on Wikipedia.
Thirdly, just to be clear, that none of these concepts is to be confused with "system daemon", which are the regular userspace daemons that an OS needs in order to function, e.g. systemd, syslog, cron, etc..
Those are generally created by the "init" process (PID 1 on Unix systems) e.g. systemd, however systemd itself is created by the kernel at boot time.

Which process goes to which cpu socket in MPI?

I am running a MPI program and in my hostfile I have only one node.
The node has 2 scokets, 8 physical cores each and the hyperthreading is disabled.
mpiexec -n 8 -f /pathtohostfile/host_file_test ./a.out
I am using likwid to measure energy consumed by my program.
Question :
Are the above 8 process running on the same socket(to save energy) or processes can be randomly assigned to either socket?
Not sure about it, but can a process context switch to another socket?
In case process are randomly assigned, can I pin my process to a core/socket to measure the energy?
Since you have only one node, your 8 processes are all under control of the Linux scheduler, so, unless you use numactl or something to pin them down, the OS will place them for best load balancing. And it may decide to migrate them. Look into numactl and other "pinning" tools. hwloc may also do it for you.

request for clarification in snakemake's documentation regarding 'resources' and 'threads'

I have a question with regards to resources and threads (it's not clear to me from the documentation).
Are the resources per thread ?
That's the case with various HPC job submission systems. E.g.: that's for example how jobs work on LSF's bsub:
If I request 64 threads, with 1024MiB each, bsub will schedule a job with 64 process, each consuming 1024MiB individually, and thus consuming 64GiB in total.
(That total memory may or may not be on the same machine, as the 64 processes may or may not be on the same machine depending on the host[span=n] parameters. For openMPI uses it might well be 64 different machines each allocating it's own local 1024MiB chunk. But with host[span=1], it's going to be a single machine with 64 threads and 64GiB memory).
When looking at the LSF profile, mem_mb seems to passed with only unit versions but otherwise the same value from ressources to bsub
thus it seems that snakemake and LSF both assume that total_memory = threads * mem_mb.
I just wanted to make sure this assumption is correct.
Upon further analysis, the resources accounting in jobs.py is in contactiction of the above.
Filing a bug request

How to run a gem5 arm aarch64 full system simulation with fs.py with more than 8 cores?

If I try to use more than --num-cpus=8 cores, e.g. 16, 32 or 64, the terminal just stays blank.
Tested with gem5 at commit 2a9573f5942b5416fb0570cf5cb6cdecba733392 and Linux kernel 4.16.
Related thread: https://www.mail-archive.com/gem5-users#gem5.org/msg15469.html
If I add more to Ciro's answer, the current GICv2 model in gem5 supports single core by default because of this line of code. Without enabling gem5ExtensionsEnabled, it won't update the highest_int with the receiving interrupt number, and as a result the received interrupt won't get posted to a specified cpu to invoke a handler. That is, there is no jump to interrupt handler. In addition, even when we turn on gem5ExtensionsEnabled, I think that it will support up to 4 cores because the default values of INT_BITS_MAX and itLines are 32 and 128, respectively (see this); it checks 32 interrupt lines per core across 4 cores. For example, imagine that a system features 16 cores and cpu 5 executes the loop. Also, suppose that the other core (say core 11) already has an interrupt with higher priority than this. Then, the loop will ignore the other interrupt from core 11 because the loop index x can grow at most to 3.
To turn on gem5ExtensionsEnabled, you can pass an option --param='system.realview.gic.gem5_extensions=True' to your command as Ciro stated. However, note that the parameter is used to set haveGem5Extensions variable at here, not setting gem5ExtensionsEnabled, which is enabled only when a firmware code writes some data (0x200) to GIC distributor register at GICD_TYPER offset (see this).
Newer method: GICv3
Since GICv3 was implemented in February 2019 at https://gem5-review.googlesource.com/c/public/gem5/+/13436 you can just use it instead.
The GICv3 hardware natively supports more than 8 CPUs, so it just works.
As of July 2020 gem5 3ca404da175a66e0b958165ad75eb5f54cb5e772, GICv3 is the default GIC for the VExpress_GEM5_V2 but the default fs.py machine is VExpress_GEM5_V2 at that commit, so you just have to select it with:
fs.py --machine-type VExpress_GEM5_V2
Once I did that, it just worked, Atomic boot took about 6x on 16 cores compared to a single CPU. Tested with this setup: https://github.com/cirosantilli/linux-kernel-module-cheat/tree/d0ada7f58c6624662bbfa3446c7c26422d1c5afb#gem5-arm-full-system-with-more-than-8-cores
Older method: GICv2 extensions
As mentioned at: https://www.mail-archive.com/gem5-dev#gem5.org/msg24593.html gem5 has a GICv2 extension + kernel patch that allows this:
use the ARM linux kernel fork from: https://gem5.googlesource.com/arm/linux/+/refs/heads/gem5/v4.15 in particular the GICv2 extension script commit
for fs.py add the options --param 'system.realview.gic.gem5_extensions = True' --generate-dtb
Tested with this setup: https://github.com/cirosantilli/linux-kernel-module-cheat/tree/bab029f60656913b5dea629a220ae593cc16147d#gem5-arm-full-system-with-more-than-8-cores (gem5 4c8efdbef45d98109769cf675ee3411393e8ed06, Linux kernel fork v4.15, aarch64).

mpi process ids

I would like to get the process ids of an mpi application which I start with mpirun/mpiexec tools.
For example I run my code with lets say 8 processes and want to get the process ids of all these 8 processes right at the beginning of the execution to give to another tool as an input.
What would be the right way to do this?
I don't believe that there is any MPI library routine which will return the pid of the o/s process which is running an MPI process. To be absolutely precise I don't think that the MPI standard requires there to be a one:one mapping between MPI processes and o/s processes, nor any other cardinality of mapping, though I don't think I've ever used an MPI implementation where there wasn't a one:one mapping between the different views of processes.
All that aside, why not simply use getpid if you are on Linux machine ? Each MPI process should get its own pid. I guess there is a Windows system call which does the same thing but I don't know much about Windows.