How does Redis achieve the high throughput and performance? - redis

I know this is a very generic question. But, I wanted to understand what are the major architectural decision that allow Redis (or caches like MemCached, Cassandra) to work at amazing performance limits.
How are connections maintained?
Are connections TCP or HTTP?
I know that it is completely written in C. How is the memory managed?
What are the synchronization techniques used to achieve high throughput inspite
of competing read/writes?
Basically, what is the difference between a plain vanilla implementation of a machine with in memory cache and server that can respond to commands and a Redis box? I also understand that the answer needs to be very huge and should include very complex details for completion. But, what I'm looking for are some general techniques used rather than all nuances.

There is a wealth of of information in the Redis documentation to understand how it works. Now, to answer specifically your questions:
1) How are connections maintained?
Connections are maintained and managed using the ae event loop (designed by the Redis author). All network I/O operations are non blocking. You can see ae as a minimalistic implementation using the best network I/O demultiplexing mechanism of the platform (epoll for Linux, kqueue for BSD, etc ...) just like libevent, libev, libuv, etc ...
2) Are connections TCP or HTTP?
Connections are TCP using the Redis protocol, which is a simple telnet compatible, text oriented protocol supporting binary data. This protocol is typically more efficient than HTTP.
3) How is the memory managed?
Memory is managed by relying on a general purpose memory allocator. On some platforms, this is actually the system memory allocator. On some other platforms (including Linux), jemalloc has been selected since it offers a good balance between CPU consumption, concurrency support, fragmentation and memory footprint. jemalloc source code is part of the Redis distribution.
Contrary to other products (such as memcached), there is no implementation of a slab allocator in Redis.
A number of optimized data structures have been implemented on top of the general purpose allocator to reduce the memory footprint.
4) What are the synchronization techniques used to achieve high throughput inspite of competing read/writes?
Redis is a single-threaded event loop, so there is no synchronization to be done since all commands are serialized. Now, some threads also run in the background for internal purposes. In the rare cases they access the data managed by the main thread, classical pthread synchronization primitives are used (mutexes for instance). But 100% of the data accesses made on behalf of multiple client connections do not require any synchronization.
You can find more information there:
Redis is single-threaded, then how does it do concurrent I/O?
What is the difference between a plain vanilla implementation of a machine with in memory cache and server that can respond to commands and a Redis box?
There is no difference. Redis is a plain vanilla implementation of a machine with in memory cache and server that can respond to commands. But it is an implementation which is done right:
using the single threaded event loop model
using simple and minimalistic data structures optimized for their corresponding use cases
offering a set of commands carefully chosen to balance minimalism and usefulness
constantly targeting the best raw performance
well adapted to modern OS mechanisms
providing multiple persistence mechanisms because the "one size does fit all" approach is only a dream.
providing the building blocks for HA mechanisms (replication system for instance)
avoiding stacking up useless abstraction layers like pancakes
resulting in a clean and understandable code base that any good C developer can be comfortable with


why use etcd?Can I use redis to implement configuration management/service discovery etc.?

I learned etcd for a few hours, but a question suddenly came into me. I found that redis is fully capable of covering functions which etcd owns.Like key/value CRUD && watch, and redis is very simple to use. why people choose etcd instead of redis?
I googled a few posts, but no post told me the reason.
Redis stores data in memory, which makes it very high performance but not very durable. If the redis server dies, it's easy to lose data. Etcd stores data in files on disc, and performs fsync across multiple nodes before resolving to guarantee consistency, which makes it very durable but not very performant.
That's a good trade-off for kubernetes, which is using etcd for cluster state and configuration, not user data. It would not be a good trade-off for something like user session data which you might be using redis for in your app because you need extremely fast response times and can tolerate a bit of data loss or inconsistency.
A major difference which is affecting my choice of one vs the other is:
etcd keeps the data index in RAM and the data store on disk
redis keeps both data index and data store in RAM
Theoretically, this means etcd ought to be a good fit for large data / small memory scenarios, where redis would require large RAM.
In practice, etcd's current behaviour is that it allocates some memory per transaction when data is accessed. Under heavy load, the memory footprint of the etcd server balloons unboundedly (appears limited by the rate of read requests), and the Go runtime eventually OOM's, killing the server.
In contrast, the redis design requires a virtual address space sized in relation to the dataset, or to the partition of the dataset stored locally.
Memory footprint examples
Eg, with redis, a 8GB dataset partition with an index size of 0.5GB requires 8.5GB of virtual address space (ie, could be handled with 1GB of RAM and 7.5GB of swap), but not less, and the requirement has an upper bound.
The same 8GB dataset, with etcd, would require only 0.5GB of virtual address space, but not less (ie, could be handled with 500MB of RAM and no swap), in theory. In practice, under high load, etcd's memory use is unbounded.
Other considerations
There are other considerations like data consistency, or supported languages, that have to be evaluated separately.
In my case, the language the server is written in is a factor, as I have in-house C expertise, but no Go expertise. This means I can maintain/diagnose/customize redis (written in C) in-house if needed, but cannot do the same with etc (written in Go), I'd have to use it as released by the maintainers.
My conclusion
Unfortunately, the memory behaviour of etcd, whereby it needs to allocate memory to access the indexed data, negates the memory advantages it might have theoretically, and the risk of crash by OOM due to high load make it unsuitable in applications that might experience unexpected usage spikes. Github bug 14362, Github bug 14352, other OOM reports
Furthermore, the ability to customize the server in-house (ie, available C vs Go expertise) is a business consideration that weighs in redis's favour, in my case.

Distributed Locking for Device

We have distributed cluster weblogic setup.
Our Use Case was whenever Device Contact our system need to compute Parameter and provision to the device. There can be concurrent request from devices. We cant reject any request from devices.So we are going with Async Processing approach.
Here problem we are facing is whenever device contacts we need to lock the source device as well as neighbor devices to provision optimized parameter.
Since we have cluster system, we require a distributed locking system which provides high performance.
Could you suggest us any framework/suggestion in java for distributed locking which suits to our requirement ?
Typically, when you sense a need for distributed locking, that indicates a design flaw. Distributed locking is usually either slow or unsafe. It's slow when done correctly because strong consistency guarantees are required to ensure two processes can't hold the same lock at the same time, and unsafe when consistency constraints are relaxed in favor of performance gains.
Often you can find a better solution than distributed locking by doing something like consistent hashing to ensure related requests are handled by the same process. Similarly, leader election can be a more performant alternative to distributed locking if you can elect a leader and route related requests to the leader. But certainly there must be some cases where these solutions are not possible, and so I'd better answer your question...
Assuming fault tolerance is a requirement, and considering the performance and safety concerns mentioned above, Hazelcast may be a good option for your use case. It's a fast embedded in-memory data grid that has a distributed Lock implementation. Often it's nice to use an embedded system like Hazelcast rather than relying on another cluster, but Hazelcat does have the potential for consistency issues in certain partition scenarios, and that could result in two processes acquiring a lock. TBH I've heard more than a few complaints about locks in Hazelcast, but no doubt others have had positive experiences.
Alternatively, ZooKeeper is likely the most common system for distributed locking in Java. However, ZooKeeper tends to be significantly slower for writes than reads since its quorum based - though it is relatively fast and very mature - and locking is a write-heavy work load. Also, in contrast to Hazelcast, one major downside to ZooKeeper is that it's a separate cluster and thus a dependency on another external system. I think ZooKeeper's stability and maturity makes it worth a look.
There doesn't currently seem to be many proven projects in between Hazelcast (an embedded eventually strongly consistent framework) and ZooKeeper (a strongly consistent external service) which is why (disclaimer: self promotion incoming) I created Atomix to provide safe distributed locking and leader elections as an embedded system for Java. It's a decent option if you need a framework like Hazelcast that has the same (actually stronger) consistency guarantees as ZooKeeper.
If performance and scalability is paramount and you're expecting high rates of requests, you will likely have to sacrifice consistency and look at a Hazelcast or something similar.
Alternatively, if fault tolerance is not a requirement (I don't think you spshould cities that it is) you can even just use a Redis instance :-)

Can I use lpop/rpop to create a simple queue system with Redis?

I tried several message/job queue systems but they all seem to add unnecessary complexity and I always end up with the queue process dying for no reason and cryptic log messages.
So now I want to make my own queue system using Redis. How would you go about doing this?
From what I have read, Redis is good because it has lpop and rpush methods, and also a pub/sub system that could be used to notify the workers that there are new messages to be consumed. Is this correct?
Yes you can. In fact there are a number of package which do exactly this ... including Celery and RQ for Python and resque for Ruby and ports of resque to Java (Jesque and Javascript (Coffee-resque).
There's also RestMQ which is implemented in Python, but designed for use with any ReSTful system.
There are MANY others.
Note that Redis LISTs are about the simplest possible network queuing system. However, making things robust over the simple primitives offered by Redis is non-trivial (and may be impossible for some values of "robust" --- at least on the server side). So many of these libraries for using Redis as a queue add features and protocols intended to minimize the chances of lost messages while ensuring "at-most-once" semantics. Many of these use the RPOPLPUSH Redis primitive with some other processing on the secondary LIST to handle acknowledgement of completed work and re-dispatch of "lost" units. (Consider the case where some client as "popped" a work unit off your queue and died before the work results were posted; how do you detect and mitigate for that scenario?)
In some cases people have cooked up elaborate bits of server side (Redis Lua EVAL) scripting to handle more reliable queuing. For example implementing something like RPOPLPUSH but replacing the "push" with a ZADD (thus adding the item and a timestamp to a "sorted set" representing work that's "in progress"). In such systems the work is completed with a ZREM and scanned for "lost" work using ZRANGEBYSCORE.
Here are some thoughts on the topic of implementing a robust queuing system by Salvatore Sanfilippo (a.k.a. antirez, author of Redis): Adventures in message queues where he discusses the considerations and forces which led him to work on disque.
I'm sure you'll find some detractors who argue that Redis is a poor substitute for a "real" message bus and queuing system (such as RabbitMQ). Salvatore says as much in his 'blog entry, and I'd welcome others here to spell out cogent reasons for preferring such systems.
My advice is to start with Redis during your early prototyping; but to keep your use of the system abstracted into some consolidated bit of code. Celery, among others, actually does this for you. You can start using Celery with a Redis backend and readily replace the backend with RabbitMQ or others with little effect on the bulk of your code.
For a catalog of alternatives, consider perusing:

How to avoid Boost ASIO reactor becoming constrained to a single core?

TL;DR: Is it possible that I am reactor throughput limited? How would I tell? How expensive and scalable (across threads) is the implementation of the io_service?
I have a farily massively parallel application, running on a hyperthreaded-dual-quad-core-Xeon machine with tons of RAM and a fast SSD RAID. This is developed using boost::asio.
This application accepts connections from about 1,000 other machines, reads data, decodes a simple protocol, and shuffles data into files mapped using mmap(). The application also pre-fetches "future" mmap pages using madvise(WILLNEED) so it's unlikely to be blocking on page faults, but just to be sure, I've tried spawning up to 300 threads.
This is running on Linux kernel 2.6.32-27-generic (Ubuntu Server x64 LTS 10.04). Gcc version is 4.4.3 and boost::asio version is 1.40 (both are stock Ubuntu LTS).
Running vmstat, iostat and top, I see that disk throughput (both in TPS and data volume) is on the single digits of %. Similarly, the disk queue length is always a lot smaller than the number of threads, so I don't think I'm I/O bound. Also, the RSS climbs but then stabilizes at a few gigs (as expected) and vmstat shows no paging, so I imagine I'm not memory bound. CPU is constant at 0-1% user, 6-7% system and the rest as idle. Clue! One full "core" (remember hyper-threading) is 6.25% of the CPU.
I know the system is falling behind, because the client machines block on TCP send when more than 64kB is outstanding, and report the fact; they all keep reporting this fact, and throughput to the system is much less than desired, intended, and theoretically possible.
My guess is I'm contending on a lock of some sort. I use an application-level lock to guard a look-up table that may be mutated, so I sharded this into 256 top-level locks/tables to break that dependency. However, that didn't seem to help at all.
All threads go through one, global io_service instance. Running strace on the application shows that it spends most of its time dealing with futex calls, which I imagine have to do with the evented-based implementation of the io_service reactor.
Is it possible that I am reactor throughput limited? How would I tell? How expensive and scalable (across threads) is the implementation of the io_service?
EDIT: I didn't initially find this other thread because it used a set of tags that didn't overlap mine :-/ It is quite possible my problem is excessive locking used in the implementation of the boost::asio reactor. See C++ Socket Server - Unable to saturate CPU
However, the question remains: How can I prove this? And how can I fix it?
The answer is indeed that even the latest boost::asio only calls into the epoll file descriptor from a single thread, not entering the kernel from more than one thread at a time. I can kind-of understand why, because thread safety and lifetime of objects is extremely precarious when you use multiple threads that each can get notifications for the same file descriptor. When I code this up myself (using pthreads), it works, and scales beyond a single core. Not using boost::asio at that point -- it's a shame that an otherwise well designed and portable library should have this limitation.
I believe that if you use multiple io_service object (say for each cpu core), each run by a single thread, you will not have this problem. See the http server example 2 on the boost ASIO page.
I have done various benchmarks against the server example 2 and server example 3 and have found that the implementation I mentioned works the best.
In my single-threaded application, I found out from profiling that a large portion of the processor instructions was spent on locking and unlocking by the io_service::poll(). I disabled the lock operations with the BOOST_ASIO_DISABLE_THREADS macro. It may make sense for you, too, depending on your threading situation.

Spread vs MPI vs zeromq?

In one of the answers to Broadcast like UDP with the Reliability of TCP, a user mentions the Spread messaging API. I've also run across one called ØMQ. I also have some familiarity with MPI.
So, my main question is: why would I choose one over the other? More specifically, why would I choose to use Spread or ØMQ when there are mature implementations of MPI to be had?
MPI was deisgned tightly-coupled compute clusters with fast, reliable networks. Spread and ØMQ are designed for large distributed systems. If you're designing a parallel scientific application, go with MPI, but if you are designing a persistent distributed system that needs to be resilient to faults and network instability, use one of the others.
MPI has very limited facilities for fault tolerance; the default error handling behavior in most implementations is a system-wide fail. Also, the semantics of MPI require that all messages sent eventually be consumed. This makes a lot of sense for simulations on a cluster, but not for a distributed application.
I have not used any of these libraries, but I may be able to give some hints.
MPI is a communication protocol while Spread and ØMQ are actual implementation.
MPI comes from "parallel" programming while Spread comes from "distributed" programming.
So, it really depends on whether you are trying to build a parallel system or distributed system. They are related to each other, but the implied connotations/goals are different. Parallel programming deals with increasing computational power by using multiple computers simultaneously. Distributed programming deals with reliable (consistent, fault-tolerant and highly available) group of computers.
The concept of "reliability" is slightly different from that of TCP. TCP's reliability is "give this packet to the end program no matter what." The distributed programming's reliability is "even if some machines die, the system as a whole continues to work in consistent manner." To really guarantee that all participants got the message, one would need something like 2 phase commit or one of faster alternatives.
You're addressing very different APIs here, with different notions about the kind of services provided and infrastructure for each of them. I don't know enough about MPI and Spread to answer for them, but I can help a little more with ZeroMQ.
ZeroMQ is a simple messaging communication library. It does nothing else than send a message to different peers (including local ones) based on a restricted set of common messaging patterns (PUSH/PULL, REQUEST/REPLY, PUB/SUB, etc.). It handles client connection, retrieval, and basic congestion strictly based on those patterns and you have to do the rest yourself.
Although appearing very restricted, this simple behavior is mostly what you would need for the communication layer of your application. It lets you scale very quickly from a simple prototype, all in memory, to more complex distributed applications in various environments, using simple proxies and gateways between nodes. However, don't expect it to do node deployment, network discovery, or server monitoring; You will have to do it yourself.
Briefly, use zeromq if you have an application that you want to scale from the simple multithread process to a distributed and variable environment, or that you want to experiment and prototype quickly and that no solutions seems to fit with your model. Expect however to have to put some effort on the deployment and monitoring of your network if you want to scale to a very large cluster.