Redis - Default blocking VM - redis

The blocking VM performance is better overall, as there is no time lost in
synchronization, spawning of threads, and resuming blocked
clients waiting for values. So if you are willing to accept an higher
latency from time to time, blocking VM can be a good pick. Especially
if swapping happens rarely and most of your often accessed data
happens to fit in your memory.
This is default mode of Redis (and the only mode going forward I believe now VM is deprecated in 2.6), leaving the OS to handle paging (if/when required). I am correct in my understanding that it will take some time to get "hot" when booted/started. When working on a 1gb RAM node with a 16gb dataset, does Redis attempt to load it all into virtual memory at boot and thus 90%+ is immediately paged out, and only after some good amount of usages does the above statement hold true?

Redis VM was already deprecated in Redis 2.4, and has been removed in Redis 2.6. It is a dead end: don't use it.
I think you are confusing the blocking VM with OS paging. They are two different things.
OS paging is the default mode of Redis when Redis VM is not configured at all (whatever the blocking mode). The OS will swap Redis memory if it does not fit in physical memory. The event loop can be frozen at any time. When it happens, performance is abysmal because none of the Redis internal data structures is designed for this (no locality, no paging system).
Redis VM can be configured in non blocking mode (using I/O threads). When I/Os are done, the event loop is not blocked, and Redis is still responsive. However, when too many I/Os pile up, the I/O threads will be completely busy, and you end up with a responsive Redis, but unable to process any queries requiring I/Os.
Redis VM can also be configured in blocking mode. In this mode all I/Os are synchronously performed in the main event loop thread. So the event loop is frozen in case of I/O (for instance in case of a key miss). All clients are impacted. However, general performance (CPU consumption and latency) is better than with the non blocking mode because some threading scheduling/synchronization is saved.
In practice, the difference between OS paging and the Redis blocking VM is the granularity level. With Redis VM, the granularity is the key. With OS paging, well it is the page (a 4 KB block which can span on several unrelated keys).
In all 3 cases, the initial load of the dump file will be extremely slow and generate a peak of random I/Os on your system. As you pointed out, most objects will be loaded and then swapped out. The warm-up time will be significant.
Except if you have extreme locality in your data, or if you do not care at all about the latencies, using 1 GB RAM for a 16 GB dataset with the Redis VM is science-fiction IMO.
There is a reason why the Redis VM was phased out. By design, it will never perform as well as a disk-based datastore (which can exploit file mapping or direct I/Os to avoid the double buffering, and use adapted data structures like B-trees).
Redis as an in-memory store is excellent. But if you need to store something which is bigger than RAM, don't use it. Other (disk-based) stores will all perform much better.


Scheduling on multiple cores with each list in each processor vs one list that all processes share

I have a question about how scheduling is done. I know that when a system has multiple CPUs scheduling is usually done on a per processor bases. Each processor runs its own scheduler accessing a ready list of only those processes that are running on it.
So what would be the pros and cons when compared to an approach where there is a single ready list that all processors share?
Like what issues are there when assigning processes to processors and what issues might be caused if a process always lives on one processor? In terms of the mutex locking of data structures and time spent waiting on for the locks are there any issues to that?
Generally there is one, giant problem when it comes to multi-core CPU systems - cache coherency.
What does cache coherency mean?
Access to main memory is hard. Depending on the memory frequency, it can take between a few thousand to a few million cycles to access some data in RAM - that's a whole lot of time the CPU is doing no useful work. It'd be significantly better if we minimized this time as much as possible, but the hardware required to do this is expensive, and typically must be in very close proximity to the CPU itself (we're talking within a few millimeters of the core).
This is where the cache comes in. The cache keeps a small subset of main memory in close proximity to the core, allowing accesses to this memory to be several orders of magnitude faster than main memory. For reading this is a simple process - if the memory is in the cache, read from cache, otherwise read from main memory.
Writing is a bit more tricky. Writing to the cache is fast, but now main memory still holds the original value. We can update that memory, but that takes a while, sometimes even longer than reading depending on the memory type and board layout. How do we minimize this as well?
The most common way to do so is with a write-back cache, which, when written to, will flush the data contained in the cache back to main memory at some later point when the CPU is idle or otherwise not doing something. Depending on the CPU architecture, this could be done during idle conditions, or interleaved with CPU instructions, or on a timer (this is up to the designer/fabricator of the CPU).
Why is this a problem?
In a single core system, there is only one path for reads and writes to take - they must go through the cache on their way to main memory, meaning the programs running on the CPU only see what they expect - if they read a value, modified it, then read it back, it would be changed.
In a multi-core system, however, there are multiple paths for data to take when going back to main memory, depending on the CPU that issued the read or write. this presents a problem with write-back caching, since that "later time" introduces a gap in which one CPU might read memory that hasn't yet been updated.
Imagine a dual core system. A job starts on CPU 0 and reads a memory block. Since the memory block isn't in CPU 0's cache, it's read from main memory. Later, the job writes to that memory. Since the cache is write-back, that write will be made to CPU 0's cache and flushed back to main memory later. If CPU 1 then attempts to read that same memory, CPU 1 will attempt to read from main memory again, since it isn't in the cache of CPU 1. But the modification from CPU 0 hasn't left CPU 0's cache yet, so the data you get back is not valid - your modification hasn't gone through yet. Your program could now break in subtle, unpredictable, and potentially devastating ways.
Because of this, cache synchronization is done to alleviate this. Application IDs, address monitoring, and other hardware mechanisms exist to synchronize the caches between multiple CPUs. All of these methods have one common problem - they all force the CPU to take time doing bookkeeping rather than actual, useful computations.
The best method of avoiding this is actually keeping processes on one processor as much as possible. If the process doesn't migrate between CPUs, you don't need to keep the caches synchronized, as the other CPUs won't be accessing that memory at the same time (unless the memory is shared between multiple processes, but we'll not go into that here).
Now we come to the issue of how to design our scheduler, and the three main problems there - avoiding process migration, maximizing CPU utilization, and scalability.
Single Queue Multiprocessor scheduling (SQMS)
Single Queue Multiprocessor schedulers are what you suggested - one queue containing available processes, and each core accesses the queue to get the next job to run. This is fairly simple to implement, but has a couple of major drawbacks - it can cause a whole lot of process migration, and does not scale well to larger systems with more cores.
Imagine a system with four cores and five jobs, each of which takes about the same amount of time to run, and each of which is rescheduled when completed. On the first run through, CPU 0 takes job A, CPU 1 takes B, CPU 2 takes C, and CPU 3 takes D, while E is left on the queue. Let's then say CPU 0 finishes job A, puts it on the back of the shared queue, and looks for another job to do. E is currently at the front of the queue, to CPU 0 takes E, and goes on. Now, CPU 1 finishes job B, puts B on the back of the queue, and looks for the next job. It now sees A, and starts running A. But since A was on CPU 0 before, CPU 1 now needs to sync its cache with CPU 0, resulting in lost time for both CPU 0 and CPU 1. In addition, if two CPUs both finish their operations at the same time, they both need to write to the shared list, which has to be done sequentially or the list will get corrupted (just like in multi-threading). This requires that one of the two CPUs wait for the other to finish their writes, and sync their cache back to main memory, since the list is in shared memory! This problem gets worse and worse the more CPUs you add, resulting in major problems with large servers (where there can be 16 or even 32 CPU cores), and being completely unusable on supercomputers (some of which have upwards of 1000 cores).
Multi-queue Multiprocessor Scheduling (MQMS)
Multi-queue multiprocessor schedulers have a single queue per CPU core, ensuring that all local core scheduling can be done without having to take a shared lock or synchronize the cache. This allows for systems with hundreds of cores to operate without interfering with one another at every scheduling interval, which can happen hundreds of times a second.
The main issue with MQMS comes from CPU Utilization, where one or more CPU cores is doing the majority of the work, and scheduling fairness, where one of the processes on the computer is being scheduled more often than any other process with the same priority.
CPU Utilization is the biggest issue - no CPU should ever be idle if a job is scheduled. However, if all CPUs are busy, so we schedule a job to a random CPU, and a different CPU ends up becoming idle, it should "steal" the scheduled job from the original CPU to ensure every CPU is doing real work. Doing so, however, requires that we lock both CPU cores and potentially sync the cache, which may degrade any speedup we could get by stealing the scheduled job.
In conclusion
Both methods exist in the wild - Linux actually has three different mainstream scheduler algorithms, one of which is an SQMS. The choice of scheduler really depends on the way the scheduler is implemented, the hardware you plan to run it on, and the types of jobs you intend to run. If you know you only have two or four cores to run jobs, SQMS is likely perfectly adequate. If you're running a supercomputer where overhead is a major concern, then an MQMS might be the way to go. For a desktop user - just trust the distro, whether that's a Linux OS, Mac, or Windows. Generally, the programmers for the operating system you've got have done their homework on exactly what scheduler will be the best option for the typical use case of their system.
This whitepaper describes the differences between the two types of scheduling algorithms in place.

why use etcd?Can I use redis to implement configuration management/service discovery etc.?

I learned etcd for a few hours, but a question suddenly came into me. I found that redis is fully capable of covering functions which etcd owns.Like key/value CRUD && watch, and redis is very simple to use. why people choose etcd instead of redis?
I googled a few posts, but no post told me the reason.
Redis stores data in memory, which makes it very high performance but not very durable. If the redis server dies, it's easy to lose data. Etcd stores data in files on disc, and performs fsync across multiple nodes before resolving to guarantee consistency, which makes it very durable but not very performant.
That's a good trade-off for kubernetes, which is using etcd for cluster state and configuration, not user data. It would not be a good trade-off for something like user session data which you might be using redis for in your app because you need extremely fast response times and can tolerate a bit of data loss or inconsistency.
A major difference which is affecting my choice of one vs the other is:
etcd keeps the data index in RAM and the data store on disk
redis keeps both data index and data store in RAM
Theoretically, this means etcd ought to be a good fit for large data / small memory scenarios, where redis would require large RAM.
In practice, etcd's current behaviour is that it allocates some memory per transaction when data is accessed. Under heavy load, the memory footprint of the etcd server balloons unboundedly (appears limited by the rate of read requests), and the Go runtime eventually OOM's, killing the server.
In contrast, the redis design requires a virtual address space sized in relation to the dataset, or to the partition of the dataset stored locally.
Memory footprint examples
Eg, with redis, a 8GB dataset partition with an index size of 0.5GB requires 8.5GB of virtual address space (ie, could be handled with 1GB of RAM and 7.5GB of swap), but not less, and the requirement has an upper bound.
The same 8GB dataset, with etcd, would require only 0.5GB of virtual address space, but not less (ie, could be handled with 500MB of RAM and no swap), in theory. In practice, under high load, etcd's memory use is unbounded.
Other considerations
There are other considerations like data consistency, or supported languages, that have to be evaluated separately.
In my case, the language the server is written in is a factor, as I have in-house C expertise, but no Go expertise. This means I can maintain/diagnose/customize redis (written in C) in-house if needed, but cannot do the same with etc (written in Go), I'd have to use it as released by the maintainers.
My conclusion
Unfortunately, the memory behaviour of etcd, whereby it needs to allocate memory to access the indexed data, negates the memory advantages it might have theoretically, and the risk of crash by OOM due to high load make it unsuitable in applications that might experience unexpected usage spikes. Github bug 14362, Github bug 14352, other OOM reports
Furthermore, the ability to customize the server in-house (ie, available C vs Go expertise) is a business consideration that weighs in redis's favour, in my case.

Redis-server using all RAM at startup

i'm using redis and noticed that it crashes with the following error :
MISCONF Redis is configured to save RDB snapshots
I tried the solution suggested in this post
but everything seems to be OK in term of permissions and space.
htop command tells me that redis is consuming 70% of RAM. i tried to stop / restart redis in order to flush but at startup, the amount of RAM used by redis was growing up dramatically and stops around 66%. I'm pretty sure at this moment no processus was using any redis instance !
what happens there ?
The growing up ram issue is an expected behaviour of Redis at first data load, after restarts, writing the data to disk (snapshot process). Redis tends to allocate memory as much as it can unless you don't use "maxmemory" option at your conf file.
It allocates memory but not release immediately. Sometimes it takes hours, I saw such cases.
Well known fact about Redis is that, it can allocate memory up to twice size of the dataset it keeps.
I suggest you to wait couple of hours without any restart (Redis can work in this time, get/set operations etc.) and keep watching the memory.
Please check that too
Redis will not always free up (return) memory to the OS when keys are
removed. This is not something special about Redis, but it is how most
malloc() implementations work. For example if you fill an instance
with 5GB worth of data, and then remove the equivalent of 2GB of data,
the Resident Set Size (also known as the RSS, which is the number of
memory pages consumed by the process) will probably still be around
5GB, even if Redis will claim that the user memory is around 3GB. This
happens because the underlying allocator can't easily release the
memory. For example often most of the removed keys were allocated in
the same pages as the other keys that still exist.

What would be a good hardware configuration for a Redis dedicated server?

I am planning to configure Redis in Master/Slave configuration.
I have got three machines (8GB RAM, 8 cores), planing to to use one master and two slaves.
What would be the recommended hardware configuration for these machines?
Redis is not CPU intensive, so you should get at least 2 cores per server (one for redis, one for backups, maybe one more to do basic stuff on the server?), more is not really relevant. Redis is single-threaded.
Get as much RAM as you can as it defines the size of your store. Also making a dump consumes RAM so your true space size is less than you can think. Monitor your RAM usage to prevent surprises.
For RAM type, if it fails, redis fails and sometimes silently (consistency broken). If you need to be careful with your data always use ECC RAM, it is expensive but maybe less expensive than broken data in RAM accessed through redis causing unknown effects. Redis has no known checks against hardware errors from RAM, even if it is quite rare (less likely to happen than a broken hard drive) it does happen.

How to avoid Boost ASIO reactor becoming constrained to a single core?

TL;DR: Is it possible that I am reactor throughput limited? How would I tell? How expensive and scalable (across threads) is the implementation of the io_service?
I have a farily massively parallel application, running on a hyperthreaded-dual-quad-core-Xeon machine with tons of RAM and a fast SSD RAID. This is developed using boost::asio.
This application accepts connections from about 1,000 other machines, reads data, decodes a simple protocol, and shuffles data into files mapped using mmap(). The application also pre-fetches "future" mmap pages using madvise(WILLNEED) so it's unlikely to be blocking on page faults, but just to be sure, I've tried spawning up to 300 threads.
This is running on Linux kernel 2.6.32-27-generic (Ubuntu Server x64 LTS 10.04). Gcc version is 4.4.3 and boost::asio version is 1.40 (both are stock Ubuntu LTS).
Running vmstat, iostat and top, I see that disk throughput (both in TPS and data volume) is on the single digits of %. Similarly, the disk queue length is always a lot smaller than the number of threads, so I don't think I'm I/O bound. Also, the RSS climbs but then stabilizes at a few gigs (as expected) and vmstat shows no paging, so I imagine I'm not memory bound. CPU is constant at 0-1% user, 6-7% system and the rest as idle. Clue! One full "core" (remember hyper-threading) is 6.25% of the CPU.
I know the system is falling behind, because the client machines block on TCP send when more than 64kB is outstanding, and report the fact; they all keep reporting this fact, and throughput to the system is much less than desired, intended, and theoretically possible.
My guess is I'm contending on a lock of some sort. I use an application-level lock to guard a look-up table that may be mutated, so I sharded this into 256 top-level locks/tables to break that dependency. However, that didn't seem to help at all.
All threads go through one, global io_service instance. Running strace on the application shows that it spends most of its time dealing with futex calls, which I imagine have to do with the evented-based implementation of the io_service reactor.
Is it possible that I am reactor throughput limited? How would I tell? How expensive and scalable (across threads) is the implementation of the io_service?
EDIT: I didn't initially find this other thread because it used a set of tags that didn't overlap mine :-/ It is quite possible my problem is excessive locking used in the implementation of the boost::asio reactor. See C++ Socket Server - Unable to saturate CPU
However, the question remains: How can I prove this? And how can I fix it?
The answer is indeed that even the latest boost::asio only calls into the epoll file descriptor from a single thread, not entering the kernel from more than one thread at a time. I can kind-of understand why, because thread safety and lifetime of objects is extremely precarious when you use multiple threads that each can get notifications for the same file descriptor. When I code this up myself (using pthreads), it works, and scales beyond a single core. Not using boost::asio at that point -- it's a shame that an otherwise well designed and portable library should have this limitation.
I believe that if you use multiple io_service object (say for each cpu core), each run by a single thread, you will not have this problem. See the http server example 2 on the boost ASIO page.
I have done various benchmarks against the server example 2 and server example 3 and have found that the implementation I mentioned works the best.
In my single-threaded application, I found out from profiling that a large portion of the processor instructions was spent on locking and unlocking by the io_service::poll(). I disabled the lock operations with the BOOST_ASIO_DISABLE_THREADS macro. It may make sense for you, too, depending on your threading situation.