bind 1 single process on 1 cpu and moving all IRQs, deamins, rpci on other CPUS - process

I have a Linux machine with 16 cores in it.
// uname -a
Linux lndbxdev01 2.6.24.7-108.el5rt #1 SMP PREEMPT RT
Mon Mar 23 10:58:10 EDT 2009 x86_64 x86_64 x86_64 GNU/Linux
// OS detail
Red Hat Enterprise Linux Server release 5.3 (Tikanga)
I would like to schedule process affinity so that 1 CPU will be
entirely dedicated to 1 process.
When I say entirely dedicated I mean that I want really to bound
any other running deamons, IRQ-nnnn, rpciod/nn, etc. to all CPUs
available except for the one my process is interested.
( on my OS I can count around 500 processes ).
by doing that is it safe or should I care for letting some of them on the CPU where they are currently running?
If I bind at least IRQs will the performance be better?
Since these are connected to interrupts, which are triggered frequently,
they induce a frequent process context switch since the kernel has to call those.
I am expecting the following benefits:
because there will be one single process running a single CPU
there will be NO process context switch at all.
the time slice assigned to my process on that CPU
will be increased so it will run longer before a process context switch ( if any ).
Kind Regards
AFG

I guess cpusets can help you overcome this problem. You can define an exclusive cpuset for one of the CPUs and bind the process to that specific cpuset.
http://www.kernel.org/doc/man-pages/online/pages/man7/cpuset.7.html
You might find cset usefule:
http://code.google.com/p/cpuset/
If you are not willing to use any of them, then you need to write your own c code to schedule the process on a specific cpu (using sched_setaffinity) and disable all interrupts on that specific cpu.
I hope it helps.

Related

Which process goes to which cpu socket in MPI?

I am running a MPI program and in my hostfile I have only one node.
The node has 2 scokets, 8 physical cores each and the hyperthreading is disabled.
mpiexec -n 8 -f /pathtohostfile/host_file_test ./a.out
I am using likwid to measure energy consumed by my program.
Question :
Are the above 8 process running on the same socket(to save energy) or processes can be randomly assigned to either socket?
Not sure about it, but can a process context switch to another socket?
In case process are randomly assigned, can I pin my process to a core/socket to measure the energy?
Since you have only one node, your 8 processes are all under control of the Linux scheduler, so, unless you use numactl or something to pin them down, the OS will place them for best load balancing. And it may decide to migrate them. Look into numactl and other "pinning" tools. hwloc may also do it for you.

The mechanics behind the mapping of redis instances to separate CPU cores

It's documented that separate redis instances map to separate CPU cores. If I have 8 redis instances running on a Debian/Ubuntu machine with 8 cores, all of them would map to a core each.
1) What happens if I scale this machine down to 4 cores?
2) Do the changes happen automatically (by default), or is some explicit configuration involved?
3) Is there any way to control the behavior? If so, to what extent?
Would love to understand the technicals behind this, and an illustrative example is most welcome. I run an app hosted in the cloud which uses redis as a back-end. Scaling up (and down) the machine's CPU cores is one of the things I have to do, but I'd like to know what I'm first getting into.
Thanks in advance!
There is no magic. Since redis is single-threaded, a single instance of redis will only occupy a single core at once. Running multiple instances creates the possibility that more than one of them will be executing at once, on different cores (if you have them). How this is done is left entirely up to the operating system. redis itself doesn't do anything to "map" instances to specific cores.
In practice, it's possible that running 8 instances on 8 cores might give you something that looks like a direct mapping of instances to cores, since a smart OS will spread processes across cores (to maximize available resources), and should show some preference for running a process on the same core that it recently vacated (to make best use of cache). But at best, this is only true for the simple case of a 1:1 mapping, with no other processes on the system, all processes equally loaded, no influence from network drivers, etc.
In the general case, all you can say is that the OS will decide how to give CPU time to all of the instances that you run, and it will probably do a pretty good job, because the scheduling parts of the OS were written by people who know what they're doing.
Redis is a (mostly) single-threaded process, which means that an instance of the server will use a single CPU core.
The server process is mapped to a core by the operating system - that's one of the main tasks that an OS is in charge of. To reiterate, assigning resources, including CPU, is an OS decision and a very complex one at that (i.e. try reading the code of the kernel's scheduler ;)).
If I have 8 redis instances running on a Debian/Ubuntu machine with 8 cores, all of them would map to a core each.
Perhaps, that's up to the OS' discretion. There is no guarantee that every instance will get a unique core, and it is possible that one core may be used by several instances.
1) What happens if I scale this machine down to 4 cores?
Scaling down like this means a restart. Once the Redis servers are restarted, the OS will assign them with the available cores.
2) Do the changes happen automatically (by default), or is some explicit configuration involved?
There are no changes involved - every process, Redis or not, gets a core. Cores are shared between processes, with the OS orchestrating the entire thing.
3) Is there any way to control the behavior? If so, to what extent?
Yes, most operating systems provide interfaces for controlling the allocation of resources. Specifically, the taskset Linux command can be used to set or get a process's CPU affinity.
Note: you should leave CPU affinity setting to the OS - it is supposed to be quite good at that. Instead, make sure that you provision your server correctly for the load.

How to run a gem5 arm aarch64 full system simulation with fs.py with more than 8 cores?

If I try to use more than --num-cpus=8 cores, e.g. 16, 32 or 64, the terminal just stays blank.
Tested with gem5 at commit 2a9573f5942b5416fb0570cf5cb6cdecba733392 and Linux kernel 4.16.
Related thread: https://www.mail-archive.com/gem5-users#gem5.org/msg15469.html
If I add more to Ciro's answer, the current GICv2 model in gem5 supports single core by default because of this line of code. Without enabling gem5ExtensionsEnabled, it won't update the highest_int with the receiving interrupt number, and as a result the received interrupt won't get posted to a specified cpu to invoke a handler. That is, there is no jump to interrupt handler. In addition, even when we turn on gem5ExtensionsEnabled, I think that it will support up to 4 cores because the default values of INT_BITS_MAX and itLines are 32 and 128, respectively (see this); it checks 32 interrupt lines per core across 4 cores. For example, imagine that a system features 16 cores and cpu 5 executes the loop. Also, suppose that the other core (say core 11) already has an interrupt with higher priority than this. Then, the loop will ignore the other interrupt from core 11 because the loop index x can grow at most to 3.
To turn on gem5ExtensionsEnabled, you can pass an option --param='system.realview.gic.gem5_extensions=True' to your command as Ciro stated. However, note that the parameter is used to set haveGem5Extensions variable at here, not setting gem5ExtensionsEnabled, which is enabled only when a firmware code writes some data (0x200) to GIC distributor register at GICD_TYPER offset (see this).
Newer method: GICv3
Since GICv3 was implemented in February 2019 at https://gem5-review.googlesource.com/c/public/gem5/+/13436 you can just use it instead.
The GICv3 hardware natively supports more than 8 CPUs, so it just works.
As of July 2020 gem5 3ca404da175a66e0b958165ad75eb5f54cb5e772, GICv3 is the default GIC for the VExpress_GEM5_V2 but the default fs.py machine is VExpress_GEM5_V2 at that commit, so you just have to select it with:
fs.py --machine-type VExpress_GEM5_V2
Once I did that, it just worked, Atomic boot took about 6x on 16 cores compared to a single CPU. Tested with this setup: https://github.com/cirosantilli/linux-kernel-module-cheat/tree/d0ada7f58c6624662bbfa3446c7c26422d1c5afb#gem5-arm-full-system-with-more-than-8-cores
Older method: GICv2 extensions
As mentioned at: https://www.mail-archive.com/gem5-dev#gem5.org/msg24593.html gem5 has a GICv2 extension + kernel patch that allows this:
use the ARM linux kernel fork from: https://gem5.googlesource.com/arm/linux/+/refs/heads/gem5/v4.15 in particular the GICv2 extension script commit
for fs.py add the options --param 'system.realview.gic.gem5_extensions = True' --generate-dtb
Tested with this setup: https://github.com/cirosantilli/linux-kernel-module-cheat/tree/bab029f60656913b5dea629a220ae593cc16147d#gem5-arm-full-system-with-more-than-8-cores (gem5 4c8efdbef45d98109769cf675ee3411393e8ed06, Linux kernel fork v4.15, aarch64).

How to ensure multiple redis instances running on different cores?

I've a 4-core server and I want to run redis on it. To fully utilize the capabilities of the 4 cores, it is expected to launch 4 redis instances, since redis is designed to be single-threaded.
However, I'm curious how to ensure that the 4 instances are exactly running on 4 different cores? How can an instance decide the core on which it is running when it is launched?
Redis itself does not provide such guarantee.
If you launch 4 instances, there will be 4 different processes that the operating system will have to get scheduled on the 4 cores. It is up to the OS to perform this load balancing, optimizing the performance of the system.
Now, if you really want to bind each instance to a specific core, modern OS usually provides tools to enforce the execution of a process on a specific CPU core.
For instance, on Linux, you can have a look at the taskset and the numactl commands.
In practice, you need to be careful with this, because once you launch Redis on a specific core (setting a CPU mask), all the threads and child processes will inherit from this CPU mask. So when Redis will try to trigger a background save operation, or a background AOF rewrite, it will seriously impact the performance of the Redis instance. This is due to the fact the main Redis thread will have share the CPU core with the background operation (which is typically CPU consuming).
If you really want to play with CPU binding (but is it really a good idea?), you need to bind N Redis instances to N+1 CPU cores, keeping one core free for the background operations, and make sure at most one background operation can run at the same time for these instances.

SQL Server 2005 - Multiple Processor Usage

We have a 16 processor SQL Server 2005 cluster. When looking at CPU usage data we see that most of the time only 4 of the 16 processors are ever utilized. However, in periods of high load, occasionally a 5th and 6th processor will be used, although never anywhere near the utilization of the other 4. I'm concerned that in periods of tremendously high load that not all of the other processors will be utilized and we will have performance degradation.
Is what we're seeing standard SQL Server 2005 cluster behavior? I assumed that all 16 processors would be utilized at all times, though this does not appear to be the case. Is this something we can tune? Or is this expected behavior? Will SQL server be able to utilize all 16 processors if it comes to that?
I'll consider you did due diligence and validated that the CPU consumption belongs to the sqlservr.exe process, so we're not chasing a red herring here. If not, please make sure the CPU is consumed by sqlservr.exe by checking the Process\% Processor performance counters.
You need to understand the SQL Server CPU scheduling model, as described in Thread and Task Architecture. SQL Server spreads requests (sys.dm_exec_requests) across schedulers (sys.dm_os_schedulers) by assigning each requests to task (sys.dm_os_tasks) that is run by a worker (sys.dm_os_workers). A worker is backed by an OS thread or fiber (sys.dm_os_threads). Most requests (a batch sent to SQL Server) spawn only one task, some requests though may spawn multiple tasks (parallel queries being the most notorious).
The normal behavior of SQL Server 2005 scheduling should be to distribute the tasks evenly, across all schedulers. Each scheduler corresponds to one CPU core. The result should be an even load on all CPU cores. But I've seen the problem you describe a few times in the labs, when the physical workload would distribute unevenly across only few CPUs. You have to understand that SQL Server does not control the thread affinity of its workers, but instead relies on the OS affinity algorithm for thread locality. What that means is that even if SQL Server spreads the requests across the 16 schedulers, the OS might decide to run the threads on only 4 cores. In correlation with this issue there are two problems that may cause or aggravate this behavior:
Hyperthreading. If you enabled hyperthreading, turn it off. SQL Server and hyperthreading should never mix.
Bad drivers. Make sure you have the proper system device drivers installed (for things like main board and such).
Also make sure your SQL 2005 is at least at SP2 level, prefferably at latest SP and all CU applied. Same goes for Windows (do you run Windows 2003 or Windows 2008?).
In theory the behavior could also be explained by a very peculiar workload, ie. SQL sees only few very long and CPU demanding requests that have no parallle option. But that would be an extremly skewed load and I never seen something like that in real life.
Even accounting for IO bottleneck I would check is whether you have processor affinities set up, what your maxdop setting is, whether it is SMP or NUMA which should also affect what maxdop you may wish to set.
when you say you have a 16 processor cluster, you mean 2 SQL servers in a cluster with 16 processors each, or 2 x 8 way SQL servers?
Are you sure that you're not bottlenecking elsewhere? On IO perhaps?
Hard to be sure without hard data, but I suspect the problem is that you're more IO-bound or memory-bound than CPU-bound right now, and 4 processors is enough to keep up with your real bottleneck.
My reasoning is that if there were some configuration problem that was keeping you limited to 4 cpus, you wouldn't see it spill over to the 5th and 6th processors at all.