How to quickly subscribed to relevant subset of large set of routing keys? - rabbitmq

I have the feeling I am not understanding something fundamental in AMQP/RabbitMQ, since I cannot find much help on this specific detail.
Let's assume I have a system made up of several components sending each other messages via a RabbitMQ broker. The messages can have routing keys of the form XXX.YYY. Let's further assume XXX and YYY are numbers between 000 and 999. That means there are a total of 1,000,000 different possible routing keys.
Now, not every component in my system is interested in every message. Let's say there is a component that wants all the messages in which XXX is between 300 and 500 and YYY is between 600 and 900. That means the component wants to process messages referring to 200*300 = 60,000 different routing keys. Also, the component might be restarted at any point in time and needs to be able to start processing the messages quickly after restart.
Furthermore, the routing keys the component is interested in might change at runtime.
There are several ways to approach this that I can think of:
Use topic exchanges and subscribe to each routing key. If I do this using one connection and one channel, it is awfully slow. My understanding is that bindings are created sequentially for each channel and thus creating 60,000 bindings takes a while. Adding and removing bindings is trivial, though. Would it be feasible to create more channels so that bindings can be created in parallel?
Use topic exchanges and wildcards, discard messages you're not interested in in the client. We could subscribe to *.* and receives messages for all 1,000,000 routing keys => much more load in the client. Or subscribe to all 200 relevant values of XXX.* and receive messages for 200,000 routing keys. Is this a generally applied pattern?
Use headers exchanges and set x-match to any. This feels a little hacky and it seems headers exchanges are not widely used. You also have to deal with the maximum size of the header when defining a binding. Do people do this? You only need a handful of bindings though, so re-creating the bindings after a restart is very fast. Updating the set of topics we're interested in is also not a problem: Just re-create everything.
So, I guess my question is: What's the best practice to subscribe to a large amount of topics very quickly (<5s) and still be able change routing keys dynamically at run-time?
Would it be feasible to split the component which needs the messages and the subscription into two components? One component is only responsible for keeping the subscriptions up-to-date (this would exchange-to-exchange subscriptions) and the other components receives every message from the downstream exchange.

Related

Is it possible to buffer messages in exchange until at least one queue is available?

I'm looking for a way to buffer messages received by the exchange as long as there is at least one queue bind to that exchange.
Is it supported by RabbitMQ?
Maybe there are some workarounds (I didn't find any).
EDIT
My use case:
I've got one data producer (which reads real-time data from an external system)
I've got one fanout exchange which receives data from the producer
On system startup, there might be no consumer, but after a few moments, there should be at least one which creates his own queue and binds it to the exchange from 2.
The problem is this short time between step 2. and 3. where there are no queues bound to the exchange created in step 1.
Of course, it's an edge case and after system initialization queues and exchanges are bound and everything works as expected.
Why queues and bindings has to be created by consumers (not by the producer)? Because I need a flexible setup where I can add consumers without any changes in other components code (e.g. producer).
EDIT 2
I'm processing the output from another system which stores both real-time and historical data. There are the cases where I want to read historical data first (on initialization) and then continue to handle real-time data.
I may mislead you by saying that there are multiple consumers. In the case where I need a buffer on exchange there is only one consumer (which writes everything to time series DB as it appears in queue).
The RabbitMQ team monitors this mailing list and only sometimes answers questions on StackOverflow.
Why queues and bindings has to be created by consumers (not by the producer)?
Queues and bindings can be created by producers or consumers or both. The requirement is that the exact same arguments are used when creating them if a client application tries to "re-create" a queue or binding. If different arguments are used, a channel-level error will happen.
As you have found, if a producer publishes to an exchange that can't route messages, they will be lost. Olivier's suggestion to use an alternate exchange is a good one, but I recommend you have your producers create queues and bindings as well.
If you mean to avoid throwing away messages because there is no destination configured for it, yes.
You should look at alternate exchange.
This assume that before (or when) you start (or when), the alternate exchange is created (would typically go for fanout) and a queue is binded to it (let's call it notroutedq).
So the messages are not lost, they will be stored in notroutedq.
From there you can possibly setup a mechanism that would reprocess messages in that queue - reinjecting them into the main exchange most likely - once a given time has passed or when a binding has been added to your main exchange.
-- EDIT --
Thanks for the updated info.
Could you indicate how long typically you'd expect the past messages to be useful to the consumers?
In your description, you mention real-time data and possibly multiple consumers coming and going. Based on that, I'm not sure how much of the data kept in the notroutedq would be of value, and with which frequency you'd expect to resend them to the consumers.
The cases I had with alternate exchange where mostly focused on identifying missing bindings, so that one could easily correct the bindings and reprocess the messages without loss.
If the number of consumers varies through time and the data content is real-time, I'd wonder a bit about the benefit of keeping the data.

Queue Fairness and Messaging Servers

I'm looking to solve a problem that I have with the FIFO nature of messaging severs and queues. In some cases, I'd like to distribute the messages in a queue to the pool of consumers on a criteria other than the message order it was delivered in. Ideally, this would prevent users from hogging shared resources in the system. Take this overly simplified scenario:
There is a feature within an application where a user can empty their trash can.
This event dispatches a DELETE message for each item in trash can
The consumers for this queue invoke a web service that has a rate limited API.
Given that each user can have very large volumes of messages in their trash can, what options do we have to allow concurrent processing of each trash can without regard to the enqueue time? It seems to me that there are a few obvious solutions:
Create a separate queue and pool of consumers for each user
Randomize the message delivery from a single queue to a single pool of consumers
In our case, creating a separate queue and managing the consumers for each user really isn't practical. It can be done but I think I really prefer the second option if it's reasonable. We're using RabbitMQ but not necessarily tied to it if there is a technology more suited to this task.
I'm entertaining the idea of using Rabbit's message priorities to help randomize delivery. By randomly assigning a message a priority between 1 and 10, this should help distribute the messages. The problem with this method is that the messages with the lowest priority may be stuck in the queue forever if the queue is never completely emptied. I thought I could use a TTL on the message and then re-queue the message with an escalated priority but I noticed this in the docs:
Messages which should expire will still only expire from the head of
the queue. This means that unlike with normal queues, even per-queue
TTL can lead to expired lower-priority messages getting stuck behind
non-expired higher priority ones. These messages will never be
delivered, but they will appear in queue statistics.
I fear that I may heading down the rabbit hole with this approach. I wonder how others are solving this problem. Any feedback on creative routing, messaging patterns, or any alternative solutions would be appreaciated.
So I ended up taking a page out of the network router handbook. This a problem they routers need to solve to allow fair traffic patterns. This video has a good breakdown of the problem and the solution.
The translation of the problem into my domain:
And the solution:
The load balancer is a wrapper around a channel and a known number of queues that uses a weighted algorithm to balance between messages received on each queue. We found a really interesting article/implementation that seems to be working well so far.
With this solution, I can also prioritize workspaces after messages have been published to increase their throughput. That's a really nice feature.
The biggest challenge ahead of me is management of the queues. There will be too many queues to leave bound to the exchange for an extended period of time. I'm working on some tools to manage their lifecycle.
One solution could be to interpose a Resequencer. The principle is outlined in the diag in that link. In your case, something like:
The app dispatches its DELETE messages into the delete queue as originally.
The Resequencer (a new component you write) is interposed between the original publishers and original consumers. It:
pulls messages off the DELETE queue into memory
places them into (in-memory) queues-by-user
republishes them to a new queue (eg FairPriorityDeleteQueue), round-robinning to interleave fairly any messages from different original users
limits its republish rate into FairPriorityDeleteQueue, either such that the length of FairPriorityDeleteQueue (obtainable via polling the rabbitmq management api periodically) never exceeds some integer you choose N, or limited to some rate related to the rate-limited delete API the consumers use.
doesn't ack any message it pulled off the original DELETE queue, until it's republished it to FairPriorityDeleteQueue (so you never lose a message)
The original consumers subscribe instead to FairPriorityDeleteQueue.
You set the preFetchCount on these consumers fairly low (<10), to prevent them in turn bulk-buffering the contents of FairPriorityDeleteQueue in memory.
--
Some points to watch:
Rate- or length-limiting publishing into and/or drawing messages out of FairPriorityDeleteQueue is essential. If you don't limit, Resequencer may just hand messages on as fast as it receives them, limiting the potential for resequencing.
Resequencer of course acts as a kind of in-memory buffer while resequencing. If the original publishers can publish very large numbers of messages in to the queue suddenly, you may need to memory-limit the Resequencer process so that it doesn't ingest more than it can hold.
Your particular scenario is greatly helped by the fact that you have an external factor (the final delete API) limiting throughput. Without such an extrinsic limiting factor, it is much harder to choose the optimum parameters for such a resequencer, to balance throughput-versus-resequencing in a particular environment.
I don't think a resequencer is needed in this case. Maybe it is, if you need to ensure the items are deleted in a specific order. But that only comes into play when you send multiple messages at roughly the same time and need to guarantee order on the consumer end.
You should also avoid the timeout scenario, for the reasons you've mentioned. timeout is meant to tell RabbitMQ that a message doesn't need to be processed - or that it needs to be routed to a dead letter queue so that i can be processed by some other code. while you might be able to make timeout work, i don't think it's a good choice.
Priorities may solve part of the problem, but could introduce a scenario where files never get processed. if you have a priority 1 message sitting back in the queue somewhere, and you keep putting priority 2, 3, 5, 10, etc. into the queue, the 1 might not be processed. the timeout doesn't solve this, as you've noted.
For my money, I would suggest a different approach: sending delete requests serially, for a single file.
that is, send 1 message to delete 1 file. wait for a response to say it's done. then send the next message to delete the next file.
here's why i think that will work, and how to manage it:
Long-Running Workflow, Single File Delete Requests
In this scenario, I would suggest taking a multi-step approach to the problem using the idea of a "saga" (aka a long-running workflow object).
when a user requests to delete their trashcan, you send a single message through rabbitmq to the service that can handle the delete process. that service creates an instance of the saga for that user's trashcan.
the saga gathers a list of all files in the trashcan that need to be deleted. then it starts to send the requests to delete the individual files, one at a time.
with each request to delete a single file, the saga waits for the response to say the file was deleted.
when the saga receives the message to say the previous file has been deleted, it sends out the next request to delete the next file.
once all the files are deleted, the saga updates itself and any other part of the system to say the trash can is empty.
Handling Multiple Users
When you have a single user requesting a delete, things will happen fairly quickly for them. they will get their trash emptied soon.
u1 = User 1 Trashcan Delete Request
|u1|u1|u1|u1|u1|u1|u1|u1|u1|u1done|
when you have multiple users requesting a delete, the process of sending one file delete request at a time means each user will have an equal chance of getting the next file delete.
u1 = User 1 Trashcan Delete Request
u2 = User 2 Trashcan Delete Request
|u1|u2|u1|u1|u2|u2|u1|u2|u1|u2|u2|u1|u1|u1|u2|u2|u1|u2|u1|u1done|u2|u2done|
This way, there will be shared use of the resources to delete the files. Over-all, it will take a little longer for each person's trashcan to be emptied, but they will see progress sooner and that's an important aspect of people thinking the system is fast / responsive to their request.
Optimizing Small File Set vs Large File Set
In a scenario where you have a small number of users with a small number of files, the above solution may prove to be slower than if you deleted all the files at once. after all, there will be more messages sent across rabbitmq - at least 2 for every file that needs to be deleted (one delete request, one delete confirmation response)
To optimize this further, you could do a couple of things:
have a minimum trashcan size before you split up the work like this. below that minimum, you just delete it all at once
chunk the work into groups of files, instead of one at a time. maybe 10 or 100 files would be a better group size, than 1 file at a time
Either (or both) of these solutions would help to improve the over-all performance of the process by reducing the number of messages being sent, and batching the work a bit.
You would need to do some testing in your real scenario to see which of these (or maybe both) would help and at what settings.
Many Users Problem
There's one additional problem you may face - many users. If you have 2 or 3 users requesting deletes, it won't be a big deal.
But if you have 100 or 1000 users requesting deletes, it could take a very long time for an individual to get their trashcan emptied.
You may need to have a higher level controlling process for this situation, where all requests to empty trashcans would be managed by yet another Saga. This saga would rate-limit the number of active trashcan-deletion sagas.
For example, if you have 10 active requests for deleting trashcans, the rate-limiting saga would only start 3 of them and it would wait for one to finish before starting the next one.
Again, you would need to test your actual scenario to see if this is needed and see what the limits should be, for performance reasons.
There may be additional scenarios that have to be considered in your actual scenario, but I hope this gets you down the path! :)

RabbitMQ Pub/Sub setup with large number of disconnected clients...

This is a new area for me so hopefully my question makes sense.
In my program I have a large number of clients which are windows services running on laptops - that are often disconnected. Occasionally they come on line and I want them to receive updates based on user profiles. There are many types of notifications that require the client to perform some work on the local application (i.e. the laptop).
I realize that I could do this with a series of restful database queries, but since there are so many clients (upwards to 10,000) and there are lots of different notification types, I was curious if perhaps this was not a problem better suited for a messaging product like RabbitMQ or even 0MQ.
But how would one set this up. (let's assume in RabbitMQ?
Would each user be assigned their own queue?
Or is it preferable to have each queue be a distinct notification type and you would use some combination of direct exchanges or filtering messages based on a routing key, where the routing key could be a username.
Since each user may potentially have a different set of notifications based on their user profile, I am thinking that each client/consumer would have a specific message for each notification sitting on a queue waiting for them to come online and process it.
Is this the right way of thinking about the problem? Thanks in advance.
It will be easier for you to balance a lot of queues than filter long ones, so it's better to use queue per consumer.
Messages can have arbitrary headers and body so it is the right place for notification types.
Since you will be using long-living queues, waiting for consumers on disk - you better use lazy queues https://www.rabbitmq.com/lazy-queues.html (it's available since version 3.6.0)

How to make topic exchanges expandable

So we will have a topic exchange that looks something like
{class}.{genus}
So we have some consumers that bind with the topic
mammal.*
(or bird.*, etc.)
Now suppose later on we want to include species information so the topic exchange now looks like this:
{class}.{genus}.{species}
Now the old consumers are broken :(
However they could have bound as
mammal.*.#
And been able to listen to whatever future information is added. However, this is something my team came up with on our own which leads me to ask:
Is this good practice?
Are there tradeoffs to this I should be aware of?
Is there an alternate way to have a producer be able to add information without breaking existing consumers, without publishing to multiple exchanges?
Typically if you have a need maximum control on queue delivery and want to do the logic in rabbit, then you should consider header exchanges.
Usually when we code up the publish we know exactly which queue it needs to go to, so whether you want to use a routing key or a boolean to do this might not make much difference depending on your application.
This brings up another design consideration to be aware of: whether you want routing logic in rabbit. Someone people prefer to just use simple routing keys and either direct or topic exchanges, focusing on flexible consumers. Its going to be hard to guess at what is best for your application obviously.
Keep in mind that your consumers will be subscribed, often statically, to the queue(s) that the exchange delivers to. Also mammal.# is the same as mammal.*.# (see: ref)

How to use routing_key and queues

I'm setting up a consumer that will listen for messages from two different sources. I want to have a different callback for messages from these two sources(other solutions are welcome though).
I'm very new to rabbitmq and pika and I haven't grasped the nitty gritty details yet. But what i want to know is:
Should i use different queues and setup two
channel.basic_consume(callback_1, ...)
channel.basic_consume(callback_2, ...)
for my callbacks or should i do some tricks with routing keys instead?
That depends on your needs a little. It's really about processing, I am most familiar with Java so I will tell you how I handle things and then you can make a decision based on that.
If I need to have different threads process different data or do different things with the data I create two different queues and each thread will consume a different queue. I use topic exchanges to make sure the queues get the correct messages. If the data is only slightly different then using the routing key I can handle the data differently with the same thread. The decision is purely based on the parallelism I require, ie how many queues I want processing the data.