I was exploring redisson and decided to use for the ease of its simplicity when compared to Jedis, and few other good reviews i found over internet.
The environment on which i will be using redisson is Storm topologies.
It's not a good idea to create threads by application level code in a
Storm Topology
I dig deeper to some extent of redisson code which internally translates the commands to async and command executor and promise.
just want to confirm . Is redisson internally spawning threads to achieve this.
A follow up. Is Jedis also doing the same in its internal implementation.
Please consider pipeline implementation also in your answers
Redisson mentions in some perf documentation I found somewhere that Redisson v.x supports on the order 30k threads/sec. I read this as concurrent requests.
I imagine at least two of those are in separate threads
Not sure what you are asking, but perhaps this gives some insight
https://github.com/redisson/redisson/wiki/11.-Redis-commands-mapping
Related
Hi looking at the docs and apart where it's obviously stated that a feature is partitioned, does for example a regular Redisson map or AtomicLong work regardless of the Redis cluster modes?
Yes, all Redisson objects work in Redis cluster.
I tried several message/job queue systems but they all seem to add unnecessary complexity and I always end up with the queue process dying for no reason and cryptic log messages.
So now I want to make my own queue system using Redis. How would you go about doing this?
From what I have read, Redis is good because it has lpop and rpush methods, and also a pub/sub system that could be used to notify the workers that there are new messages to be consumed. Is this correct?
Yes you can. In fact there are a number of package which do exactly this ... including Celery and RQ for Python and resque for Ruby and ports of resque to Java (Jesque and Javascript (Coffee-resque).
There's also RestMQ which is implemented in Python, but designed for use with any ReSTful system.
There are MANY others.
Note that Redis LISTs are about the simplest possible network queuing system. However, making things robust over the simple primitives offered by Redis is non-trivial (and may be impossible for some values of "robust" --- at least on the server side). So many of these libraries for using Redis as a queue add features and protocols intended to minimize the chances of lost messages while ensuring "at-most-once" semantics. Many of these use the RPOPLPUSH Redis primitive with some other processing on the secondary LIST to handle acknowledgement of completed work and re-dispatch of "lost" units. (Consider the case where some client as "popped" a work unit off your queue and died before the work results were posted; how do you detect and mitigate for that scenario?)
In some cases people have cooked up elaborate bits of server side (Redis Lua EVAL) scripting to handle more reliable queuing. For example implementing something like RPOPLPUSH but replacing the "push" with a ZADD (thus adding the item and a timestamp to a "sorted set" representing work that's "in progress"). In such systems the work is completed with a ZREM and scanned for "lost" work using ZRANGEBYSCORE.
Here are some thoughts on the topic of implementing a robust queuing system by Salvatore Sanfilippo (a.k.a. antirez, author of Redis): Adventures in message queues where he discusses the considerations and forces which led him to work on disque.
I'm sure you'll find some detractors who argue that Redis is a poor substitute for a "real" message bus and queuing system (such as RabbitMQ). Salvatore says as much in his 'blog entry, and I'd welcome others here to spell out cogent reasons for preferring such systems.
My advice is to start with Redis during your early prototyping; but to keep your use of the system abstracted into some consolidated bit of code. Celery, among others, actually does this for you. You can start using Celery with a Redis backend and readily replace the backend with RabbitMQ or others with little effect on the bulk of your code.
For a catalog of alternatives, consider perusing: http://queues.io/
I know this is a very generic question. But, I wanted to understand what are the major architectural decision that allow Redis (or caches like MemCached, Cassandra) to work at amazing performance limits.
How are connections maintained?
Are connections TCP or HTTP?
I know that it is completely written in C. How is the memory managed?
What are the synchronization techniques used to achieve high throughput inspite
of competing read/writes?
Basically, what is the difference between a plain vanilla implementation of a machine with in memory cache and server that can respond to commands and a Redis box? I also understand that the answer needs to be very huge and should include very complex details for completion. But, what I'm looking for are some general techniques used rather than all nuances.
There is a wealth of of information in the Redis documentation to understand how it works. Now, to answer specifically your questions:
1) How are connections maintained?
Connections are maintained and managed using the ae event loop (designed by the Redis author). All network I/O operations are non blocking. You can see ae as a minimalistic implementation using the best network I/O demultiplexing mechanism of the platform (epoll for Linux, kqueue for BSD, etc ...) just like libevent, libev, libuv, etc ...
2) Are connections TCP or HTTP?
Connections are TCP using the Redis protocol, which is a simple telnet compatible, text oriented protocol supporting binary data. This protocol is typically more efficient than HTTP.
3) How is the memory managed?
Memory is managed by relying on a general purpose memory allocator. On some platforms, this is actually the system memory allocator. On some other platforms (including Linux), jemalloc has been selected since it offers a good balance between CPU consumption, concurrency support, fragmentation and memory footprint. jemalloc source code is part of the Redis distribution.
Contrary to other products (such as memcached), there is no implementation of a slab allocator in Redis.
A number of optimized data structures have been implemented on top of the general purpose allocator to reduce the memory footprint.
4) What are the synchronization techniques used to achieve high throughput inspite of competing read/writes?
Redis is a single-threaded event loop, so there is no synchronization to be done since all commands are serialized. Now, some threads also run in the background for internal purposes. In the rare cases they access the data managed by the main thread, classical pthread synchronization primitives are used (mutexes for instance). But 100% of the data accesses made on behalf of multiple client connections do not require any synchronization.
You can find more information there:
Redis is single-threaded, then how does it do concurrent I/O?
What is the difference between a plain vanilla implementation of a machine with in memory cache and server that can respond to commands and a Redis box?
There is no difference. Redis is a plain vanilla implementation of a machine with in memory cache and server that can respond to commands. But it is an implementation which is done right:
using the single threaded event loop model
using simple and minimalistic data structures optimized for their corresponding use cases
offering a set of commands carefully chosen to balance minimalism and usefulness
constantly targeting the best raw performance
well adapted to modern OS mechanisms
providing multiple persistence mechanisms because the "one size does fit all" approach is only a dream.
providing the building blocks for HA mechanisms (replication system for instance)
avoiding stacking up useless abstraction layers like pancakes
resulting in a clean and understandable code base that any good C developer can be comfortable with
I have a small cluster of servers I need to keep in sync. My initial thought on this was to have one server be the "master" and publish updates using redis's pub/sub functionality (since we are already using redis for storage) and letting the other servers in the cluster, the slaves, poll for updates in a long running task. This seemed to be a simple method to keep everything in sync, but then I thought of the obvious issue: What if my "master" goes down? That is where I started looking into techniques to make sure there is always a master, which led me to reading about ideas like leader election. Finally, I stumbled upon Apache Zookeeper (through python binding, "pettingzoo"), which apparently takes care of a lot of the fault tolerance logic for you. I may be able to write my own leader selection code, but I figure it wouldn't be close to as good as something that has been proven and tested, like Zookeeper.
My main issue with using zookeeper is that it is just another component that I may be adding to my setup unnecessarily when I could get by with something simpler. Has anyone ever used redis in this way? Or is there any other simple method I can use to get the type of functionality I am trying to achieve?
More info about pettingzoo (slideshare)
I'm afraid there is no simple method to achieve high-availability. This is usually tricky to setup and tricky to test. There are multiple ways to achieve HA, to be classified in two categories: physical clustering and logical clustering.
Physical clustering is about using hardware, network, and OS level mechanisms to achieve HA. On Linux, you can have a look at Pacemaker which is a full-fledged open-source solution coming with all enterprise distributions. If you want to directly embed clustering capabilities in your application (in C), you may want to check the Corosync cluster engine (also used by Pacemaker). If you plan to use commercial software, Veritas Cluster Server is a well established (but expensive) cross-platform HA solution.
Logical clustering is about using fancy distributed algorithms (like leader election, PAXOS, etc ...) to achieve HA without relying on specific low level mechanisms. This is what things like Zookeeper provide.
Zookeeper is a consistent, ordered, hierarchical store built on top of the ZAB protocol (quite similar to PAXOS). It is quite robust and can be used to implement some HA facilities, but it is not trivial, and you need to install the JVM on all nodes. For good examples, you may have a look at some recipes and the excellent Curator library from Netflix. These days, Zookeeper is used well beyond the pure Hadoop contexts, and IMO, this is the best solution to build a HA logical infrastructure.
Redis pub/sub mechanism is not reliable enough to implement a logical cluster, because unread messages will be lost (there is no queuing of items with pub/sub). To achieve HA of a collection of Redis instances, you can try Redis Sentinel, but it does not extend to your own software.
If you are ready to program in C, a HA framework which is often forgotten (but can be quite useful IMO) is the one coming with BerkeleyDB. It is quite basic but support off-the-shelf leader elections, and can be integrated in any environment. Documentation can be found here and here. Note: you do not have to store your data with BerkeleyDB to benefit from the HA mechanism (only the topology data - the same ones you would put in Zookeeper).
I am looking for a message queue which would replicate messages across a cluster of servers. I am aware that this will cause a performance hit, but that's what the requirements are - message persistence is very important.
The replication can be asynchronous, but it should be there - if there's a large backlog of messages waiting for processing, they shouldn't be lost.
So far I didn't manage to find anything from the well-known MQs. HornetQ for example supported message replication in 2.0 but in 2.2 it seems to be removed. RabbitMQ doesn't replicate messages at all, etc.
Is there anything out there that could meet my requirements?
There are at least three ways of tackling this that come to mind, depending upon how robust you need the solution to be.
One: pick any messaging tech, then replicate your disk-storage. Using something like DRBD you can have the file-backed storage copied to another machine under the covers. If your primary box dies, you should be able to restart on your second machine from the replicated files.
Two: Keep looking. There are various commercial systems that definitely do this, two such (no financial benefit on my part) are Informatica Ultra Messaging (formerly 29West) and Solace. These are commonly used in the financial community.
Three: build your own. ZeroMQ is one such toolkit that you could use to roll-your-own system from pre-built messaging blocks. Even a system that does not officially support it could fairly easily be configured to publish all messages to two queues. Your reader would have to drain both somehow, so this may well be a non-starter, but possible in any case.
Overall: do test your performance assumptions, as all of these will have various performance implications in various scenarios.
Amazon SQS is designed with this very thing in mind, but because of the consistency model (which is a part of messaging anyway), you're responsible for de-duplicating messages on the consumer side. Granted, SQS maybe somewhat slow and the costs can add up for lots of messages, but if you want to guarantee that no messages are lost, then it's a pretty solid way to go.
new Kafka 0.8.1 offers replication!