The mechanics behind the mapping of redis instances to separate CPU cores - redis

It's documented that separate redis instances map to separate CPU cores. If I have 8 redis instances running on a Debian/Ubuntu machine with 8 cores, all of them would map to a core each.
1) What happens if I scale this machine down to 4 cores?
2) Do the changes happen automatically (by default), or is some explicit configuration involved?
3) Is there any way to control the behavior? If so, to what extent?
Would love to understand the technicals behind this, and an illustrative example is most welcome. I run an app hosted in the cloud which uses redis as a back-end. Scaling up (and down) the machine's CPU cores is one of the things I have to do, but I'd like to know what I'm first getting into.
Thanks in advance!

There is no magic. Since redis is single-threaded, a single instance of redis will only occupy a single core at once. Running multiple instances creates the possibility that more than one of them will be executing at once, on different cores (if you have them). How this is done is left entirely up to the operating system. redis itself doesn't do anything to "map" instances to specific cores.
In practice, it's possible that running 8 instances on 8 cores might give you something that looks like a direct mapping of instances to cores, since a smart OS will spread processes across cores (to maximize available resources), and should show some preference for running a process on the same core that it recently vacated (to make best use of cache). But at best, this is only true for the simple case of a 1:1 mapping, with no other processes on the system, all processes equally loaded, no influence from network drivers, etc.
In the general case, all you can say is that the OS will decide how to give CPU time to all of the instances that you run, and it will probably do a pretty good job, because the scheduling parts of the OS were written by people who know what they're doing.

Redis is a (mostly) single-threaded process, which means that an instance of the server will use a single CPU core.
The server process is mapped to a core by the operating system - that's one of the main tasks that an OS is in charge of. To reiterate, assigning resources, including CPU, is an OS decision and a very complex one at that (i.e. try reading the code of the kernel's scheduler ;)).
If I have 8 redis instances running on a Debian/Ubuntu machine with 8 cores, all of them would map to a core each.
Perhaps, that's up to the OS' discretion. There is no guarantee that every instance will get a unique core, and it is possible that one core may be used by several instances.
1) What happens if I scale this machine down to 4 cores?
Scaling down like this means a restart. Once the Redis servers are restarted, the OS will assign them with the available cores.
2) Do the changes happen automatically (by default), or is some explicit configuration involved?
There are no changes involved - every process, Redis or not, gets a core. Cores are shared between processes, with the OS orchestrating the entire thing.
3) Is there any way to control the behavior? If so, to what extent?
Yes, most operating systems provide interfaces for controlling the allocation of resources. Specifically, the taskset Linux command can be used to set or get a process's CPU affinity.
Note: you should leave CPU affinity setting to the OS - it is supposed to be quite good at that. Instead, make sure that you provision your server correctly for the load.

Related

VB.NET multi threading and system architecture

I believe I have a reasonable understanding of threading from an Object Oriented perspective using the Thread class and the Runnable interface. In one of my applications there is a "download" button that allows the user to run a task in the background that takes about half an hour whilst continuing to use the VB.NET application.
However, I do not understand how Threading maps to the physical architecture of a computer. If you have a single threaded application that runs on a PC with a quadcore processor then does a .net program use all four processors?
If you have a multi threaded application (say four threads) on a quadcore processor then does each thread execute on different cores?
Do you have any control of this as a developer?
I have referenced a book I read at university called Operating System Concepts, but I have not found a specific answer.
If you have a single threaded application that runs on a PC with a quadcore processor then does a .net program use all four processors?
No, it can’t, at least not simultaneously. However, it’s theoretically possible that the operating system’s scheduler first executes your thread on one processor and later moves it to another processor. Such a scheduler is necessary to allow simultaneously running more applications / threads than there are physical processors present: execution of a thread is sliced into small pieces, which are fed to the processor(s) one after the other. Each threads gets some time slice allocated during which it can calculate before usage of the CPU switches to another thread.
Do you have any control of this as a developer?
Not directly. What you can control is the priority of your thread to make it more important to the task scheduler.
On a more general note, you should not use threads in your use-case – at least not directly. Threads are actually pretty low-level primitives. For your specific use-case, there’s a component called BackgroundWorker which abstracts many of the low-level details of thread management for you.
If you have a multi threaded application (say four threads) on a quadcore processor then does each thread execute on different cores?
Not necessarily; again, the application has next to no control over how exactly its threads are executed; the operating system however tries really hard to schedule threads “smartly”. This means that in practice, if your application has several busy threads, they are spread out as evenly as possible across the available cores. In particular, if there are no more threads than cores then each thread gets its own core.
Generally, you do not need to worry about mapping to physical architecture, .NET and the OS will do their best to maximize efficiency of your application. Just make it multi-threaded, even at the cost of being slower on a single threaded computer. You can, however, limit your maximum number of threads (if your app theoretically scales to infinity), to a number of cores, or double that. Test performance for each case, and decide which maximum works best for you.
Sometimes setting a core # can make your app's performance even worse. For example, if core #1 is currently scanning your PC for viruses, and your antivirus is single threaded. With the above scenario, assuming a quad-core PC, you do NOT want to run your 4-threaded app on a 1-per-core basis.
Having said that, if you really want to run a thread on specific core, it is possible - see this:
How to start a Thread at Specific Core
How Can I Set Processor Affinity in .NET?
Also check this out:
How to control which core a process runs on?

How to ensure multiple redis instances running on different cores?

I've a 4-core server and I want to run redis on it. To fully utilize the capabilities of the 4 cores, it is expected to launch 4 redis instances, since redis is designed to be single-threaded.
However, I'm curious how to ensure that the 4 instances are exactly running on 4 different cores? How can an instance decide the core on which it is running when it is launched?
Redis itself does not provide such guarantee.
If you launch 4 instances, there will be 4 different processes that the operating system will have to get scheduled on the 4 cores. It is up to the OS to perform this load balancing, optimizing the performance of the system.
Now, if you really want to bind each instance to a specific core, modern OS usually provides tools to enforce the execution of a process on a specific CPU core.
For instance, on Linux, you can have a look at the taskset and the numactl commands.
In practice, you need to be careful with this, because once you launch Redis on a specific core (setting a CPU mask), all the threads and child processes will inherit from this CPU mask. So when Redis will try to trigger a background save operation, or a background AOF rewrite, it will seriously impact the performance of the Redis instance. This is due to the fact the main Redis thread will have share the CPU core with the background operation (which is typically CPU consuming).
If you really want to play with CPU binding (but is it really a good idea?), you need to bind N Redis instances to N+1 CPU cores, keeping one core free for the background operations, and make sure at most one background operation can run at the same time for these instances.

Are processes executed on operating system

I saw this question somewhere
Four processes p1, p2, p3, p4 - each have sizes 1GB, 1.2GB, 2GB, 1GB. And each processes is executed as a time sharing fashion. Will they be executed on an operating system.
I think the answer should be No,they are not executed on operating system because OS is itself a process and it will be running in parallel to these processes.There will be switching between processes from time to time with the help of dispatcher.
but i am getting the doubt that answer can also be yes because it uses every process uses memory which is managed by theb operating system .
please help me figure out the right answer for the question..
This depends entirely on the OS in question.
As well as starting processes (and possibly consisting of processes), an operating system generally provides services to the processes that run on it, such as memory management, file systems, communications and so on.
In that context, these processes can be said to be running on top of the OS. In other words, processes generally are of little use unless they communicate outside of themselves.
In any event, the dispatcher (or scheduler) tends to be an integral part of the OS so having your processes scheduled means that you're running on top of that OS.
Modern operating systems also provide paging of memory as well which means that you can use a lot more virtual memory than there is physical memory - the OS is then responsible for handling requests for memory that has been paged out.
If two processes co exist, they have their own share of memory. What we assume operating system does is scheduling. Operating system may ask one of the process to stop and another to begin

Redis - Default blocking VM

The blocking VM performance is better overall, as there is no time lost in
synchronization, spawning of threads, and resuming blocked
clients waiting for values. So if you are willing to accept an higher
latency from time to time, blocking VM can be a good pick. Especially
if swapping happens rarely and most of your often accessed data
happens to fit in your memory.
This is default mode of Redis (and the only mode going forward I believe now VM is deprecated in 2.6), leaving the OS to handle paging (if/when required). I am correct in my understanding that it will take some time to get "hot" when booted/started. When working on a 1gb RAM node with a 16gb dataset, does Redis attempt to load it all into virtual memory at boot and thus 90%+ is immediately paged out, and only after some good amount of usages does the above statement hold true?
Redis VM was already deprecated in Redis 2.4, and has been removed in Redis 2.6. It is a dead end: don't use it.
I think you are confusing the blocking VM with OS paging. They are two different things.
OS paging is the default mode of Redis when Redis VM is not configured at all (whatever the blocking mode). The OS will swap Redis memory if it does not fit in physical memory. The event loop can be frozen at any time. When it happens, performance is abysmal because none of the Redis internal data structures is designed for this (no locality, no paging system).
Redis VM can be configured in non blocking mode (using I/O threads). When I/Os are done, the event loop is not blocked, and Redis is still responsive. However, when too many I/Os pile up, the I/O threads will be completely busy, and you end up with a responsive Redis, but unable to process any queries requiring I/Os.
Redis VM can also be configured in blocking mode. In this mode all I/Os are synchronously performed in the main event loop thread. So the event loop is frozen in case of I/O (for instance in case of a key miss). All clients are impacted. However, general performance (CPU consumption and latency) is better than with the non blocking mode because some threading scheduling/synchronization is saved.
In practice, the difference between OS paging and the Redis blocking VM is the granularity level. With Redis VM, the granularity is the key. With OS paging, well it is the page (a 4 KB block which can span on several unrelated keys).
In all 3 cases, the initial load of the dump file will be extremely slow and generate a peak of random I/Os on your system. As you pointed out, most objects will be loaded and then swapped out. The warm-up time will be significant.
Except if you have extreme locality in your data, or if you do not care at all about the latencies, using 1 GB RAM for a 16 GB dataset with the Redis VM is science-fiction IMO.
There is a reason why the Redis VM was phased out. By design, it will never perform as well as a disk-based datastore (which can exploit file mapping or direct I/Os to avoid the double buffering, and use adapted data structures like B-trees).
Redis as an in-memory store is excellent. But if you need to store something which is bigger than RAM, don't use it. Other (disk-based) stores will all perform much better.

How to avoid Boost ASIO reactor becoming constrained to a single core?

TL;DR: Is it possible that I am reactor throughput limited? How would I tell? How expensive and scalable (across threads) is the implementation of the io_service?
I have a farily massively parallel application, running on a hyperthreaded-dual-quad-core-Xeon machine with tons of RAM and a fast SSD RAID. This is developed using boost::asio.
This application accepts connections from about 1,000 other machines, reads data, decodes a simple protocol, and shuffles data into files mapped using mmap(). The application also pre-fetches "future" mmap pages using madvise(WILLNEED) so it's unlikely to be blocking on page faults, but just to be sure, I've tried spawning up to 300 threads.
This is running on Linux kernel 2.6.32-27-generic (Ubuntu Server x64 LTS 10.04). Gcc version is 4.4.3 and boost::asio version is 1.40 (both are stock Ubuntu LTS).
Running vmstat, iostat and top, I see that disk throughput (both in TPS and data volume) is on the single digits of %. Similarly, the disk queue length is always a lot smaller than the number of threads, so I don't think I'm I/O bound. Also, the RSS climbs but then stabilizes at a few gigs (as expected) and vmstat shows no paging, so I imagine I'm not memory bound. CPU is constant at 0-1% user, 6-7% system and the rest as idle. Clue! One full "core" (remember hyper-threading) is 6.25% of the CPU.
I know the system is falling behind, because the client machines block on TCP send when more than 64kB is outstanding, and report the fact; they all keep reporting this fact, and throughput to the system is much less than desired, intended, and theoretically possible.
My guess is I'm contending on a lock of some sort. I use an application-level lock to guard a look-up table that may be mutated, so I sharded this into 256 top-level locks/tables to break that dependency. However, that didn't seem to help at all.
All threads go through one, global io_service instance. Running strace on the application shows that it spends most of its time dealing with futex calls, which I imagine have to do with the evented-based implementation of the io_service reactor.
Is it possible that I am reactor throughput limited? How would I tell? How expensive and scalable (across threads) is the implementation of the io_service?
EDIT: I didn't initially find this other thread because it used a set of tags that didn't overlap mine :-/ It is quite possible my problem is excessive locking used in the implementation of the boost::asio reactor. See C++ Socket Server - Unable to saturate CPU
However, the question remains: How can I prove this? And how can I fix it?
The answer is indeed that even the latest boost::asio only calls into the epoll file descriptor from a single thread, not entering the kernel from more than one thread at a time. I can kind-of understand why, because thread safety and lifetime of objects is extremely precarious when you use multiple threads that each can get notifications for the same file descriptor. When I code this up myself (using pthreads), it works, and scales beyond a single core. Not using boost::asio at that point -- it's a shame that an otherwise well designed and portable library should have this limitation.
I believe that if you use multiple io_service object (say for each cpu core), each run by a single thread, you will not have this problem. See the http server example 2 on the boost ASIO page.
I have done various benchmarks against the server example 2 and server example 3 and have found that the implementation I mentioned works the best.
In my single-threaded application, I found out from profiling that a large portion of the processor instructions was spent on locking and unlocking by the io_service::poll(). I disabled the lock operations with the BOOST_ASIO_DISABLE_THREADS macro. It may make sense for you, too, depending on your threading situation.