Which process goes to which cpu socket in MPI? - process

I am running a MPI program and in my hostfile I have only one node.
The node has 2 scokets, 8 physical cores each and the hyperthreading is disabled.
mpiexec -n 8 -f /pathtohostfile/host_file_test ./a.out
I am using likwid to measure energy consumed by my program.
Question :
Are the above 8 process running on the same socket(to save energy) or processes can be randomly assigned to either socket?
Not sure about it, but can a process context switch to another socket?
In case process are randomly assigned, can I pin my process to a core/socket to measure the energy?

Since you have only one node, your 8 processes are all under control of the Linux scheduler, so, unless you use numactl or something to pin them down, the OS will place them for best load balancing. And it may decide to migrate them. Look into numactl and other "pinning" tools. hwloc may also do it for you.

Related

Does kernel spawn some processes (not user process) that keep running in background?

I'm learning the concepts of operating system. This is part I've learned: kernel is key piece of os that does lots of critical things such as memory management, job scheduling etc.
This is part what I'm thinking and get confused: to have os operating as expected, in a sense kernel needs to keep running, perhaps in the background, so it is always able to respond to different system calls and interrupts. In order to achieve this, I think of two completely different approaches:
kernel actually spawns some processes purely on its behalf, not user process, and keep them running in background (like daemon)? These background processes will handle housekeeping stuff without acknowledgement from user or user process. I call this approach as "kernel is running on its own"
There is no kernel process at all. Every process we can find in os are all user processes. Kernel is nothing but a library (piece of code, along with some key data structures like page tables etc) shared among all these user processes. In each process's address space, some portion of kernel will be loaded so that when any interrupt or system call occurs, mode is elevated to kernel mode. Pieces of kernel code loaded into user process's address space will be executed so that kernel can handle the event. When kernel does that, it is still in the context of current user process. In this approach, there exists only user processes, but kernel will periodically run within the context of each user process (but in a different mode).
This is a conceptual question that has confused me for a while. Thanks in advance!
The answer to your question is mostly no. The kernel doesn't spawn kernel mode processes. At boot, the kernel might start some executables but they run in user mode as a privileged user. For example, the Linux kernel will start systemd as the first user mode process as the root user. This process will read configuration files (written by your distribution's developers like Ubuntu) and start some other processes like the X Server for graphics and basic input (from keyboard, mouse, etc).
Your #1 is wrong and your #2 is also somewhat wrong. The kernel isn't a library. It is code loaded in the top half of the virtual address space. The bottom half of the VAS is very big (several tens of thousands of GB) so user mode processes can become very big as long as you have physical RAM or swap space to back the memory they require. The top half of the VAS is shared between processes. For the bottom half, every process has theoretical access to all of it.
The kernel is called on system call and on interrupt. It doesn't run all the time like a process. It simply is called when an interrupt or syscall occurs. To make it work with more active processes than there are processor cores, timers will be used. On x86-64, each core has one local APIC. The local APIC has a timer that you can program to throw an interrupt after some time. The kernel will thus give a time slice to each process, choose one process in the list and start the timer with its corresponding time slice. When the timer throws an interrupt, the kernel knows that the time slice of that process is over and that it might be time to let another process take its place on that core.
First of all, A library can have its own background threads.
Secondly, the answer is somewhere between these approaches.
Most Unix-like system are built on a monolithic kernel (or hybrid one). That means the kernel contains all its background work in kernel threads in a single address space. I wrote in more details about this here.
On most Linux distributions, you can run
ps -ef | grep '\[.*\]'
And it will show you kernel threads.
But it will not show you "the kernel process", because ps basically only shows threads. Multithreaded processes will be seen via their main thread. But the kernel doesn't have a main thread, it owns all the threads.
If you want to look at processes via the lens of address spaces rather than threads, there's not really a way to do it. However, address spaces are useless if no thread can access them, So you access the actual address space of a thread (if you have permission) via /proc/<pid>/mem. So if you used the above ps command and found a kernel thread, you can see its address space using this approach.
But you don't have to search - you can also access the kernel's address space via /proc/kcore.
You will see, however, that these kernel threads aren't, for the most part, core kernel functionality such as scheduling & virtual memory management. In most Unix kernels, these happen during a system call by the thread that made the system call while it's running in kernel mode.
Windows, on the other hand, is built on a microkernel. That means that the kernel launches other processes and delegates work to them.
On Windows, that microkernel's address space is represented by the "System" service. The other processes - file systems, drivers etc., and other parts of what a monolithic kernel would comprise e.g. virtual memory management - might run in user mode or kernel mode, but still in a different address space than the microkernel.
You can get more details on how this works on Wikipedia.
Thirdly, just to be clear, that none of these concepts is to be confused with "system daemon", which are the regular userspace daemons that an OS needs in order to function, e.g. systemd, syslog, cron, etc..
Those are generally created by the "init" process (PID 1 on Unix systems) e.g. systemd, however systemd itself is created by the kernel at boot time.

How can I verify that my hardware prefetcher is disabled

I have disabled hardware prefetching using the following guidelines:
Installed msr-tools 1.3
wrmsr -a 0x1A4 1
The prefetcher information for my system (Broadwell) is in the msr address 0x1A4
as shown by intel documentation.
I did rdmsr -a 0x1A4 the out put showed 1.
According to the intel docs if the bit number corresponding to the particular prefetcher is set to 1 that means it is disabled.
I wanted to know if there is anyother way I can verify that my hardware prefetchers have been disabled?
Disabled prefetcher shall slowdown some operations which benefit from enabled prefetcher. You will need to write some code (probably in assembler language) and measure it's performance with enabled and disabled prefetcher.
Some long time ago I wrote test program to measure memory read performance. It was repeatedly reading memory in blocks of different sizes. It proved obvious correlation between memory block sizes and capacities of different levels of memory cache.
You can run some I/O intensive and prefetch-friendly workloads to verify.
btw, My CPU is Gold 6240, and set 0x1A4 as 1 is not working on it.
Instead, I use sudo wrmsr -a 0x1A4 0xf

The mechanics behind the mapping of redis instances to separate CPU cores

It's documented that separate redis instances map to separate CPU cores. If I have 8 redis instances running on a Debian/Ubuntu machine with 8 cores, all of them would map to a core each.
1) What happens if I scale this machine down to 4 cores?
2) Do the changes happen automatically (by default), or is some explicit configuration involved?
3) Is there any way to control the behavior? If so, to what extent?
Would love to understand the technicals behind this, and an illustrative example is most welcome. I run an app hosted in the cloud which uses redis as a back-end. Scaling up (and down) the machine's CPU cores is one of the things I have to do, but I'd like to know what I'm first getting into.
Thanks in advance!
There is no magic. Since redis is single-threaded, a single instance of redis will only occupy a single core at once. Running multiple instances creates the possibility that more than one of them will be executing at once, on different cores (if you have them). How this is done is left entirely up to the operating system. redis itself doesn't do anything to "map" instances to specific cores.
In practice, it's possible that running 8 instances on 8 cores might give you something that looks like a direct mapping of instances to cores, since a smart OS will spread processes across cores (to maximize available resources), and should show some preference for running a process on the same core that it recently vacated (to make best use of cache). But at best, this is only true for the simple case of a 1:1 mapping, with no other processes on the system, all processes equally loaded, no influence from network drivers, etc.
In the general case, all you can say is that the OS will decide how to give CPU time to all of the instances that you run, and it will probably do a pretty good job, because the scheduling parts of the OS were written by people who know what they're doing.
Redis is a (mostly) single-threaded process, which means that an instance of the server will use a single CPU core.
The server process is mapped to a core by the operating system - that's one of the main tasks that an OS is in charge of. To reiterate, assigning resources, including CPU, is an OS decision and a very complex one at that (i.e. try reading the code of the kernel's scheduler ;)).
If I have 8 redis instances running on a Debian/Ubuntu machine with 8 cores, all of them would map to a core each.
Perhaps, that's up to the OS' discretion. There is no guarantee that every instance will get a unique core, and it is possible that one core may be used by several instances.
1) What happens if I scale this machine down to 4 cores?
Scaling down like this means a restart. Once the Redis servers are restarted, the OS will assign them with the available cores.
2) Do the changes happen automatically (by default), or is some explicit configuration involved?
There are no changes involved - every process, Redis or not, gets a core. Cores are shared between processes, with the OS orchestrating the entire thing.
3) Is there any way to control the behavior? If so, to what extent?
Yes, most operating systems provide interfaces for controlling the allocation of resources. Specifically, the taskset Linux command can be used to set or get a process's CPU affinity.
Note: you should leave CPU affinity setting to the OS - it is supposed to be quite good at that. Instead, make sure that you provision your server correctly for the load.

How to ensure multiple redis instances running on different cores?

I've a 4-core server and I want to run redis on it. To fully utilize the capabilities of the 4 cores, it is expected to launch 4 redis instances, since redis is designed to be single-threaded.
However, I'm curious how to ensure that the 4 instances are exactly running on 4 different cores? How can an instance decide the core on which it is running when it is launched?
Redis itself does not provide such guarantee.
If you launch 4 instances, there will be 4 different processes that the operating system will have to get scheduled on the 4 cores. It is up to the OS to perform this load balancing, optimizing the performance of the system.
Now, if you really want to bind each instance to a specific core, modern OS usually provides tools to enforce the execution of a process on a specific CPU core.
For instance, on Linux, you can have a look at the taskset and the numactl commands.
In practice, you need to be careful with this, because once you launch Redis on a specific core (setting a CPU mask), all the threads and child processes will inherit from this CPU mask. So when Redis will try to trigger a background save operation, or a background AOF rewrite, it will seriously impact the performance of the Redis instance. This is due to the fact the main Redis thread will have share the CPU core with the background operation (which is typically CPU consuming).
If you really want to play with CPU binding (but is it really a good idea?), you need to bind N Redis instances to N+1 CPU cores, keeping one core free for the background operations, and make sure at most one background operation can run at the same time for these instances.

bind 1 single process on 1 cpu and moving all IRQs, deamins, rpci on other CPUS

I have a Linux machine with 16 cores in it.
// uname -a
Linux lndbxdev01 2.6.24.7-108.el5rt #1 SMP PREEMPT RT
Mon Mar 23 10:58:10 EDT 2009 x86_64 x86_64 x86_64 GNU/Linux
// OS detail
Red Hat Enterprise Linux Server release 5.3 (Tikanga)
I would like to schedule process affinity so that 1 CPU will be
entirely dedicated to 1 process.
When I say entirely dedicated I mean that I want really to bound
any other running deamons, IRQ-nnnn, rpciod/nn, etc. to all CPUs
available except for the one my process is interested.
( on my OS I can count around 500 processes ).
by doing that is it safe or should I care for letting some of them on the CPU where they are currently running?
If I bind at least IRQs will the performance be better?
Since these are connected to interrupts, which are triggered frequently,
they induce a frequent process context switch since the kernel has to call those.
I am expecting the following benefits:
because there will be one single process running a single CPU
there will be NO process context switch at all.
the time slice assigned to my process on that CPU
will be increased so it will run longer before a process context switch ( if any ).
Kind Regards
AFG
I guess cpusets can help you overcome this problem. You can define an exclusive cpuset for one of the CPUs and bind the process to that specific cpuset.
http://www.kernel.org/doc/man-pages/online/pages/man7/cpuset.7.html
You might find cset usefule:
http://code.google.com/p/cpuset/
If you are not willing to use any of them, then you need to write your own c code to schedule the process on a specific cpu (using sched_setaffinity) and disable all interrupts on that specific cpu.
I hope it helps.