Can I use lpop/rpop to create a simple queue system with Redis? - redis

I tried several message/job queue systems but they all seem to add unnecessary complexity and I always end up with the queue process dying for no reason and cryptic log messages.
So now I want to make my own queue system using Redis. How would you go about doing this?
From what I have read, Redis is good because it has lpop and rpush methods, and also a pub/sub system that could be used to notify the workers that there are new messages to be consumed. Is this correct?

Yes you can. In fact there are a number of package which do exactly this ... including Celery and RQ for Python and resque for Ruby and ports of resque to Java (Jesque and Javascript (Coffee-resque).
There's also RestMQ which is implemented in Python, but designed for use with any ReSTful system.
There are MANY others.
Note that Redis LISTs are about the simplest possible network queuing system. However, making things robust over the simple primitives offered by Redis is non-trivial (and may be impossible for some values of "robust" --- at least on the server side). So many of these libraries for using Redis as a queue add features and protocols intended to minimize the chances of lost messages while ensuring "at-most-once" semantics. Many of these use the RPOPLPUSH Redis primitive with some other processing on the secondary LIST to handle acknowledgement of completed work and re-dispatch of "lost" units. (Consider the case where some client as "popped" a work unit off your queue and died before the work results were posted; how do you detect and mitigate for that scenario?)
In some cases people have cooked up elaborate bits of server side (Redis Lua EVAL) scripting to handle more reliable queuing. For example implementing something like RPOPLPUSH but replacing the "push" with a ZADD (thus adding the item and a timestamp to a "sorted set" representing work that's "in progress"). In such systems the work is completed with a ZREM and scanned for "lost" work using ZRANGEBYSCORE.
Here are some thoughts on the topic of implementing a robust queuing system by Salvatore Sanfilippo (a.k.a. antirez, author of Redis): Adventures in message queues where he discusses the considerations and forces which led him to work on disque.
I'm sure you'll find some detractors who argue that Redis is a poor substitute for a "real" message bus and queuing system (such as RabbitMQ). Salvatore says as much in his 'blog entry, and I'd welcome others here to spell out cogent reasons for preferring such systems.
My advice is to start with Redis during your early prototyping; but to keep your use of the system abstracted into some consolidated bit of code. Celery, among others, actually does this for you. You can start using Celery with a Redis backend and readily replace the backend with RabbitMQ or others with little effect on the bulk of your code.
For a catalog of alternatives, consider perusing: http://queues.io/

Related

Should I try to use as many queues as possible?

On my machine I have two queue families, one that supports everything and one that only supports transfer.
The queue family that supports everything has a queueCount of 16.
Now the spec states
Command buffers submitted to different queues may execute in parallel or even out of order with respect to one another
Does that mean I should try to use all available queues for maximal performance?
Yes, if you have workload that is highly independent use separate queues.
If the queues need a lot of synchronization between themselves, it may kill any potential benefit you may get.
Basically what you are doing is supplying GPU with some alternative work it can do (and fill stalls and bubbles and idles with and giving GPU the choice) in the case of same queue family. And there is some potential to better use CPU (e.g. singlethreaded vs one queue per thread).
Using separate transfer queues (or other specialized family) seem to be the recommended approach even.
That is generally speaking. More realistic, empirical, sceptical and practical view was already presented by SW and NB answers. In reality one does have to be bit more cautious as those queues target the same resources, have same limits, and other common restrictions, limiting potential benefits gained from this. Notably, if the driver does the wrong thing with multiple queues, it may be very very bad for cache.
This AMD's Leveraging asynchronous queues for concurrent execution(2016) discusses a bit how it maps to their HW\driver. It shows potential benefits of using separate queue families. It says that although they offer two queues of compute family, they did not observe benefits in apps at that time. They say they have only one graphics queue, and why.
NVIDIA seems to have a similar idea of "asynch compute". Shown in Moving to Vulkan: Asynchronous compute.
To be safe, it seems we should still stick with only one graphics, and one async compute queue though on current HW. 16 queues seem like a trap and a way to hurt yourself.
With transfer queues it is not as simple as it seems either. You should use the dedicated ones for Host->Device transfers. And the non-dedicated should be used for device->device transfer ops.
To what end?
Take the typical structure of a deferred renderer. You build your g-buffers, do your lighting passes, do some post-processing and tone mapping, maybe throw in some transparent stuff, and then present the final image. Each process depends on the previous process having completed before it can begin. You can't do your lighting passes until you've finished your g-buffer. And so forth.
How could you parallelize that across multiple queues of execution? You can't parallelize the g-buffer building or the lighting passes, since all of those commands are writing to the same attached images (and you can't do that from multiple queues). And if they're not writing to the same images, then you're going to have to pick a queue in which to combine the resulting images into the final one. Also, I have no idea how depth buffering would work without using the same depth buffer.
And that combination step would require synchronization.
Now, there are many tasks which can be parallelized. Doing frustum culling. Particle system updates. Memory transfers. Things like that; data which is intended for the next frame. But how many queues could you realistically keep busy at once? 3? Maybe 4?
Not to mention, you're going to need to build a rendering system which can scale. Vulkan does not require that implementations provide more than 1 queue. So your code needs to be able to run reasonably on a system that only offers one queue as well as a system that offers 16. And to take advantage of a 16 queue system, you might need to render very differently.
Oh, and be advised that if you ask for a bunch of queues, but don't use them, performance could be impacted. If you ask for 8 queues, the implementation has no choice but to assume that you intend to be able to issue 8 concurrent sets of commands. Which means that the hardware cannot dedicate all of its resources to a single queue. So if you only ever use 3 of them... you may be losing over 50% of your potential performance to resources that the implementation is waiting for you to use.
Granted, the implementation could scale such things dynamically. But unless you profile this particular case, you'll never know. Oh, and if it does scale dynamically... then you won't be gaining a whole lot from using multiple queues like this either.
Lastly, there has been some research into how effective multiple queue submissions can be at keeping the GPU fed, on several platforms (read all of the parts). The general long and short of it seems to be that:
Having multiple queues executing genuine rendering operations isn't helpful.
Having a single rendering queue with one or more compute queues (either as actual compute queues or graphics queues you submit compute work to) is useful at keeping execution units well saturated during rendering operations.
That strongly depends on your actual scenario and setup. It's hard to tell without any details.
If you submit command buffers to multiple queues you also need to do proper synchronization, and if that's not done right you may get actually worse performance than just using one queue.
Note that even if you submit to only one queue an implementation may execute command buffers in parallel and even out-of-order (aka "in-flight"), see details on this in chapter chapter 2.2 of the specs or this AMD presentation.
If you do compute and graphics, using separate queues with simultaneous submissions (and a synchronization) will improve performance on hardware that supports async compute.
So there is no definitive yes or no on this without knowing about your actual use case.
Since you can submit multiple independent workload in the same queue, and it doesn't seem there is any implicit ordering guarantee among them, you don't really need more than one queue to saturate the queue family. So I guess the sole purpose of multiple queues is to allow for different priorities among the queues, as specified during device creation.
I know this answer is in direct contradiction to the accepted answer, but that answer fails to address the issue that you don't need more queues to send more parallel work to the device.

What pub/sub protocols have subscriber based data propagation?

I'm trying to evaluate different pub/sub messaging protocols on their ability to horizontally scale without producing unnecessary cross chatter.
My architecture will have NodeJS servers with web socket clients connected. I plan on using a consistent hashing based router to direct clients to servers based off of the topics they're interested in subscribing to. This would mean that for a given topic, only a subset of servers will have clients subscribing to that topic. Messages will then be published to a pub/sub broker, which would be responsible for fanning out that data to servers that have subscribers.
The situation I want to avoid is one in which every broker receives every request, and the network becomes saturated. This is a clear issue with scaling Redis Pub/Sub. Adding servers shouldn't create an n squares' problem.
The number of clients on the pub/sub protocol would be the number of servers. Ideally, each server would be able to have a local broker to fan out data efficiently to multiple NodeJS processes, as to avoid unnecessary network bandwidth. In most cases, for a given topic, all subscribers would be on that same server.
What pub/sub protocols offer this sort of topic based data propagation?
The protocols I'm evaluating are: MQTT, RabbitMQ, ZMQ, nanomsg. This isn't inclusive, and SAAS options are acceptable.
The quality assurance constraints are easy. At most once, or at least once are both adequate. Acknowledgment isn't important. Event order isn't important. We're looking for fire and forget, with an emphasis on horizontal scalability.
First, let me address a risk of mis-understanding
In many cases, similar words do not mean the same thing. The more the abbreviations.
Having that said, let me review a PUB/SUB terminus technicus.
Martin SUSTRIK's and Pieter HINTJENS' team in imatix & 250bpm have developed a few smart messaging frameworks over the past decades, so these guys know a lot about the architecture benefits, constraints and implementation compromises.
That said helps me to state that these fathers, who introduced grounds of the modern messaging, do not consider PUB/SUB to be a protocol.
It is, at least in nanomsg & ZeroMQ, rather a smart Distributed Scaleability-focused Formal Communication Pattern -- i.e. a behaviour emulated by all involved parties.
Both ZeroMQ and nanomsg are broker-less.
In this sense, asking "what protocols" does not have solid grounds.
Let's start from the "data propagation" side
In initial ZeroMQ implementations PUB had no other choice but distribute all messages to all SUB-s that were in a connected-state. Pieter HINTJENS explained numerous times this decision that actual subscription-based filtering was performed on SUB-side ( distributed in 1:all-connected manner ).
It came much later to implement PUB-side subscription based filtering and you may check revisions history to find since which version this started to avoid 1:all-connected broadcasts of data.
Similarly, you may check the nanomsg remarks from Martin SUSTRIK, who gave many indepth posts on performance improvements designed in his fabulous nanomsg project.
Scaleability as a priority No.1
If Scaleability is the focus of your post and if it were a serious Project, my question number one would be what is the quantitative metric for comparing feasible candidates according to such Project goal - i.e. what is the feasibility translated into a utility function to score candidates to compare all the parallel attributes your Project is interested in?

How does Redis achieve the high throughput and performance?

I know this is a very generic question. But, I wanted to understand what are the major architectural decision that allow Redis (or caches like MemCached, Cassandra) to work at amazing performance limits.
How are connections maintained?
Are connections TCP or HTTP?
I know that it is completely written in C. How is the memory managed?
What are the synchronization techniques used to achieve high throughput inspite
of competing read/writes?
Basically, what is the difference between a plain vanilla implementation of a machine with in memory cache and server that can respond to commands and a Redis box? I also understand that the answer needs to be very huge and should include very complex details for completion. But, what I'm looking for are some general techniques used rather than all nuances.
There is a wealth of of information in the Redis documentation to understand how it works. Now, to answer specifically your questions:
1) How are connections maintained?
Connections are maintained and managed using the ae event loop (designed by the Redis author). All network I/O operations are non blocking. You can see ae as a minimalistic implementation using the best network I/O demultiplexing mechanism of the platform (epoll for Linux, kqueue for BSD, etc ...) just like libevent, libev, libuv, etc ...
2) Are connections TCP or HTTP?
Connections are TCP using the Redis protocol, which is a simple telnet compatible, text oriented protocol supporting binary data. This protocol is typically more efficient than HTTP.
3) How is the memory managed?
Memory is managed by relying on a general purpose memory allocator. On some platforms, this is actually the system memory allocator. On some other platforms (including Linux), jemalloc has been selected since it offers a good balance between CPU consumption, concurrency support, fragmentation and memory footprint. jemalloc source code is part of the Redis distribution.
Contrary to other products (such as memcached), there is no implementation of a slab allocator in Redis.
A number of optimized data structures have been implemented on top of the general purpose allocator to reduce the memory footprint.
4) What are the synchronization techniques used to achieve high throughput inspite of competing read/writes?
Redis is a single-threaded event loop, so there is no synchronization to be done since all commands are serialized. Now, some threads also run in the background for internal purposes. In the rare cases they access the data managed by the main thread, classical pthread synchronization primitives are used (mutexes for instance). But 100% of the data accesses made on behalf of multiple client connections do not require any synchronization.
You can find more information there:
Redis is single-threaded, then how does it do concurrent I/O?
What is the difference between a plain vanilla implementation of a machine with in memory cache and server that can respond to commands and a Redis box?
There is no difference. Redis is a plain vanilla implementation of a machine with in memory cache and server that can respond to commands. But it is an implementation which is done right:
using the single threaded event loop model
using simple and minimalistic data structures optimized for their corresponding use cases
offering a set of commands carefully chosen to balance minimalism and usefulness
constantly targeting the best raw performance
well adapted to modern OS mechanisms
providing multiple persistence mechanisms because the "one size does fit all" approach is only a dream.
providing the building blocks for HA mechanisms (replication system for instance)
avoiding stacking up useless abstraction layers like pancakes
resulting in a clean and understandable code base that any good C developer can be comfortable with

zookeeper vs redis server sync

I have a small cluster of servers I need to keep in sync. My initial thought on this was to have one server be the "master" and publish updates using redis's pub/sub functionality (since we are already using redis for storage) and letting the other servers in the cluster, the slaves, poll for updates in a long running task. This seemed to be a simple method to keep everything in sync, but then I thought of the obvious issue: What if my "master" goes down? That is where I started looking into techniques to make sure there is always a master, which led me to reading about ideas like leader election. Finally, I stumbled upon Apache Zookeeper (through python binding, "pettingzoo"), which apparently takes care of a lot of the fault tolerance logic for you. I may be able to write my own leader selection code, but I figure it wouldn't be close to as good as something that has been proven and tested, like Zookeeper.
My main issue with using zookeeper is that it is just another component that I may be adding to my setup unnecessarily when I could get by with something simpler. Has anyone ever used redis in this way? Or is there any other simple method I can use to get the type of functionality I am trying to achieve?
More info about pettingzoo (slideshare)
I'm afraid there is no simple method to achieve high-availability. This is usually tricky to setup and tricky to test. There are multiple ways to achieve HA, to be classified in two categories: physical clustering and logical clustering.
Physical clustering is about using hardware, network, and OS level mechanisms to achieve HA. On Linux, you can have a look at Pacemaker which is a full-fledged open-source solution coming with all enterprise distributions. If you want to directly embed clustering capabilities in your application (in C), you may want to check the Corosync cluster engine (also used by Pacemaker). If you plan to use commercial software, Veritas Cluster Server is a well established (but expensive) cross-platform HA solution.
Logical clustering is about using fancy distributed algorithms (like leader election, PAXOS, etc ...) to achieve HA without relying on specific low level mechanisms. This is what things like Zookeeper provide.
Zookeeper is a consistent, ordered, hierarchical store built on top of the ZAB protocol (quite similar to PAXOS). It is quite robust and can be used to implement some HA facilities, but it is not trivial, and you need to install the JVM on all nodes. For good examples, you may have a look at some recipes and the excellent Curator library from Netflix. These days, Zookeeper is used well beyond the pure Hadoop contexts, and IMO, this is the best solution to build a HA logical infrastructure.
Redis pub/sub mechanism is not reliable enough to implement a logical cluster, because unread messages will be lost (there is no queuing of items with pub/sub). To achieve HA of a collection of Redis instances, you can try Redis Sentinel, but it does not extend to your own software.
If you are ready to program in C, a HA framework which is often forgotten (but can be quite useful IMO) is the one coming with BerkeleyDB. It is quite basic but support off-the-shelf leader elections, and can be integrated in any environment. Documentation can be found here and here. Note: you do not have to store your data with BerkeleyDB to benefit from the HA mechanism (only the topology data - the same ones you would put in Zookeeper).

Replicated message queue

I am looking for a message queue which would replicate messages across a cluster of servers. I am aware that this will cause a performance hit, but that's what the requirements are - message persistence is very important.
The replication can be asynchronous, but it should be there - if there's a large backlog of messages waiting for processing, they shouldn't be lost.
So far I didn't manage to find anything from the well-known MQs. HornetQ for example supported message replication in 2.0 but in 2.2 it seems to be removed. RabbitMQ doesn't replicate messages at all, etc.
Is there anything out there that could meet my requirements?
There are at least three ways of tackling this that come to mind, depending upon how robust you need the solution to be.
One: pick any messaging tech, then replicate your disk-storage. Using something like DRBD you can have the file-backed storage copied to another machine under the covers. If your primary box dies, you should be able to restart on your second machine from the replicated files.
Two: Keep looking. There are various commercial systems that definitely do this, two such (no financial benefit on my part) are Informatica Ultra Messaging (formerly 29West) and Solace. These are commonly used in the financial community.
Three: build your own. ZeroMQ is one such toolkit that you could use to roll-your-own system from pre-built messaging blocks. Even a system that does not officially support it could fairly easily be configured to publish all messages to two queues. Your reader would have to drain both somehow, so this may well be a non-starter, but possible in any case.
Overall: do test your performance assumptions, as all of these will have various performance implications in various scenarios.
Amazon SQS is designed with this very thing in mind, but because of the consistency model (which is a part of messaging anyway), you're responsible for de-duplicating messages on the consumer side. Granted, SQS maybe somewhat slow and the costs can add up for lots of messages, but if you want to guarantee that no messages are lost, then it's a pretty solid way to go.
new Kafka 0.8.1 offers replication!