RabbitMQ - How to ensure two queues stay synchronized - rabbitmq

I have two queues that both have distinct data types that affect one another as they're being processed by my application, therefore processing messages from the two queues asynchronously would cause a data integrity issue.
I'm curious as to the best practice for making sure only one consumer is consuming at any given time. Here is a summary of what I have so far:
EventMessages receive information about external events that may or may not have an impact on the enqueued/existing PurchaseOrderMessages.
Since we anticipate we'll be consuming more PurchaseOrderMessage than EventMessage, maybe we should just ensure the EventMessage Queue is empty (via the API) before we process anything in PurchaseOrderMessage Queue - but that gets into the question of wait times, etc. and this all needs to happen as close to real time as possible.
If there's a way to simply pause a Consumer A until Consumer B is at rest that might be the simplest solution, I'm just not quite sure which direction I need to go in.
UPDATE
To provide some additional context, a PurchaseOrderMessage will contain a Origin and Destination.
A EventMessage also contains location data.
Each time a PurchaseOrderMessage is processed, it will query the current EventMessage records for any Event locations that match the Origin and Destination of that PurchaseOrder and create an association.
Each time an EventMessage is processed, it will query the current PurchaseOrderMessage records for any Origins of Destinations that match that Event and create an association.
If synchronous queues aren't a good solution, what's an alternative that would insure none of the associations are missed when EventMessages and PurchaseOrderMessages are getting published to the app at the same time?
UPDATE 2
Ultimately this data will serve a UI which will have a list of PurchaseOrders and the events that might be affecting their delivery dates. It would be too slow to do the "Event Check" as the PurchaseOrder data was being rendered/retrieved by the end user which is why we're wanting to do it as they're processed/consumed.

Let me begin with the bottom line up front - on the face of it, what you are asking doesn't make sense.
Queues should never require synchronization. The very thought of doing so entirely defeats the purpose of having a queue. For some background, visit this answer.
Let's consider some common places from real life where we encounter multiple queues:
Movie theaters (box office, concession counter, usher)
Theme parks (snack bars, major attractions)
Manufacturing floors (each station may have a queue waiting to process)
In each of these examples, from the point of view of the object in the queue, it can only wait in one at a time. It cannot wait in one line while it is waiting in another- such a thing is physically impossible.
Your example seems to take two completely unrelated things and merge them together. You have a queue for PurchaseOrder objects - but what is the queue for? That's the equivalent of going to Disney World and waiting in the Customer queue - what is the purpose of such a queue? If the purpose is not clear, it's not a real queue.
Addressing your issue
This particular issue needs to be addressed first by clearly defining the various operations that are being done to a PurchaseOrder, then creating queues for each of those operations. If these operations are truly synchronous, then your business logic should be coded to wait for one operation to complete before starting another. In this circumstance, it would be considered an exception if a PurchaseOrder got to the head of one queue without fulfilling a pre-requisite.
Please remember that a message queue typically serves a stateless operation. Good design dictates that messages in the queue contain all the information needed for the processor to process the message. If you don't adhere to this, then your database becomes a single point of contention for your system - and while this is not an insurmountable problem, it does make the design more complex.
Waiting in Multiple Queues
Now, if you've ever been to Disney World, you'll also know that they have something called a FastPass+ (FP+), which allows the holder to skip the line at the designated attraction. Disney allocates a certain number of slots per hour for each major attraction at the park, and guests are able to request up to three FP+s during each day. FP+ times are allocated for one hour blocks, and guests cannot have two overlapping FP+ time blocks. Once all FP+ slots have been issued for the ride, no more are made available. The FP+ system ensures these rules are enforced, independently of the standby queues for each ride. Essentially, by using FastPass+, guests can wait in multiple lines virtually and experience more attractions during their visit.
If you are unable to analyze your design and come up with an alternative, perhaps the FastPass+ approach could help alleviate some of the bottlenecks.
Disclaimer: I don't work for Disney, but I do go multiple times per month, always getting my FastPass first

Related

What are the errors in this BPMN?

I have a BPMN diagram (see below) with some errors that I can't seem to figure out. The diagram depicts the Produce Magazine Article Process, where the writer and Researcher are freelancers who work together to write articles for various publications.
Bigger version: BPMN diagram
There is a bunch of errors here, three of them are logical (two are related), one is BPMN syntax.
Let's start with the syntax.
The message is always a communication between two separate pools s it has to cross pool boundaries. In your case, you have depicted Freelancers as a single pool, so Send information, being between lanes but not pools is a syntax error. Before suggesting a solution though, I will focus on logical errors.
Time event is not used to show the fact that some time goes by between the activities. That is actually something natural in the process It is used to indicate that the flow of time is a trigger of the next action(s). For instance, 7 days after choosing a topic the Publication might contact the Researcher to check on the progress. That would be indicated by timed event. In your case, it seems that the flow continuation is triggered by passing messages so you should indicate it as an Incoming message event. You actually do that in 2 places, one that is obvious (Get article as a "result" of time event) and the second that correlates to a second problem.
The second thing that most probably is a logical question is that since we are talking here about freelancers, most probably Researcher and Writer are two separate entities, not one organisation as your current diagram suggests. If that is the case, you should have them represented as two separate pools. Then your message would be judged, but still rather than "Wait for information" time event you should have "Receive information" incoming message event (that is BTW the starting event for the Writer pool - similarly receiving Article request by Researcher should be handled by Incoming message event).
If you prefer to depict the Freelancer as one "organisation", then you should completely abandon the time event (as again you have used it as an indication of time passing and as I have explained earlier that is not how it should be used). You have a simple flow, where once Researcher finishes their job, it is passed to Writer who carries it over from there. In such case, you should have a simple action flow (solid line) between the actions themselves.
It is also a good practice to be consistent in using End events (and at least recommended - some BPM engines verify that) to always have an End even for every branch of a process. You are missing one or two, depending on how are you going to approach the Freelancers part. Similarly, you should have a Start event for Publication.
Below are the two options shown in the form of diagrams. Note that I also did some minor changes to handle the insufficient information case by Publication. Otherwise, they will be stuck forever waiting for the article to come.
Option with Freelancers as separate pools:
Option with Freelancers considered as a single organisation

Understanding Eventual Consistency, BacklogItem and Tasks example from Vaughn Vernon

I'm struggling to understand how to implement Eventual Consistency with the exposed example of BacklogItems and Tasks from Vaughn Vernon. The statement I've understood so far is (considering the case where he splits BacklogItem and Task into separate aggregate roots):
A BacklogItem can contain one or more tasks. When all remaining hours from a the tasks of a BacklogItem are 0, the status of the BacklogItem should change to "DONE"
I'm aware about the rule that says that you should not update two aggregate roots in the same transaction, and that you should accomplish that with eventual consistency.
Once a Domain Service updates the amount of hours of a Task, a TaskRemainingHoursUpdated event should be published to a DomainEventPublisher which lives in the same thread as the executing code. And here it is where I'm at a loss with the following questions:
I suppose that there should be a subscriber (also living in the same thread I guess) that should react to TaskRemainingHoursUpdated events. At which point in your Desktop/Web application you perform this subscription to the Bus? At the very initialization of your app? In the application code? Is there any reasoning to place domain subscriptors in a specific place?
Should that subscriptor (in the same thread) call a BacklogItem repository and perform the update? (But that would be a violation of the rule of not updating two aggregates in the same transaction since this would happen synchronously, right?).
If you want to achieve eventual consistency to fulfil the previously mentioned rule, do I really need a Message Broker like RabbitMQ even though both BacklogItem and Task live inside the same Bounded Context?
If I use this message broker, should I have a background thread or something that just consumes events from a RabbitMQ queue and then dispatches the event to update the product?
I'd appreciate if someone can shed some clear light over this since it is quite complex to picture in its completeness.
So to start with, you need to recognize that, if the BacklogItem is the authority for whether or not it is "Done", then it needs to have all of the information to compute that for itself.
So somewhere within the BacklogItem is data that is tracking which Tasks it knows about, and the known state of those tasks. In other words, the BacklogItem has a stale copy of information about the task.
That's the "eventually consistent" bit; we're trying to arrange the system so that the cached copy of the data in the BacklogItem boundary includes the new changes to the task state.
That in turn means we need to send a command to the BacklogItem advising it of the changes to the task.
From the point of view of the backlog item, we don't really care where the command comes from. We could, for example, make it a manual process "After you complete the task, click this button here to inform the backlog item".
But for the sanity of our users, we're more likely to arrange an event handler to be running: when you see the output from the task, forward it to the corresponding backlog item.
At which point in your Desktop/Web application you perform this subscription to the Bus? At the very initialization of your app?
That seems pretty reasonable.
Should that subscriptor (in the same thread) call a BacklogItem repository and perform the update? (But that would be a violation of the rule of not updating two aggregates in the same transaction since this would happen synchronously, right?).
Same thread and same transaction are not necessarily coincident. It can all be coordinated in the same thread; but it probably makes more sense to let the consequences happen in the background. At their core, events and commands are just messages - write the message, put it into an inbox, and let the next thread worry about processing.
If you want to achieve eventual consistency to fulfil the previously mentioned rule, do I really need a Message Broker like RabbitMQ even though both BacklogItem and Task live inside the same Bounded Context?
No; the mechanics of the plumbing matter not at all.

Queue Fairness and Messaging Servers

I'm looking to solve a problem that I have with the FIFO nature of messaging severs and queues. In some cases, I'd like to distribute the messages in a queue to the pool of consumers on a criteria other than the message order it was delivered in. Ideally, this would prevent users from hogging shared resources in the system. Take this overly simplified scenario:
There is a feature within an application where a user can empty their trash can.
This event dispatches a DELETE message for each item in trash can
The consumers for this queue invoke a web service that has a rate limited API.
Given that each user can have very large volumes of messages in their trash can, what options do we have to allow concurrent processing of each trash can without regard to the enqueue time? It seems to me that there are a few obvious solutions:
Create a separate queue and pool of consumers for each user
Randomize the message delivery from a single queue to a single pool of consumers
In our case, creating a separate queue and managing the consumers for each user really isn't practical. It can be done but I think I really prefer the second option if it's reasonable. We're using RabbitMQ but not necessarily tied to it if there is a technology more suited to this task.
I'm entertaining the idea of using Rabbit's message priorities to help randomize delivery. By randomly assigning a message a priority between 1 and 10, this should help distribute the messages. The problem with this method is that the messages with the lowest priority may be stuck in the queue forever if the queue is never completely emptied. I thought I could use a TTL on the message and then re-queue the message with an escalated priority but I noticed this in the docs:
Messages which should expire will still only expire from the head of
the queue. This means that unlike with normal queues, even per-queue
TTL can lead to expired lower-priority messages getting stuck behind
non-expired higher priority ones. These messages will never be
delivered, but they will appear in queue statistics.
I fear that I may heading down the rabbit hole with this approach. I wonder how others are solving this problem. Any feedback on creative routing, messaging patterns, or any alternative solutions would be appreaciated.
So I ended up taking a page out of the network router handbook. This a problem they routers need to solve to allow fair traffic patterns. This video has a good breakdown of the problem and the solution.
The translation of the problem into my domain:
And the solution:
The load balancer is a wrapper around a channel and a known number of queues that uses a weighted algorithm to balance between messages received on each queue. We found a really interesting article/implementation that seems to be working well so far.
With this solution, I can also prioritize workspaces after messages have been published to increase their throughput. That's a really nice feature.
The biggest challenge ahead of me is management of the queues. There will be too many queues to leave bound to the exchange for an extended period of time. I'm working on some tools to manage their lifecycle.
One solution could be to interpose a Resequencer. The principle is outlined in the diag in that link. In your case, something like:
The app dispatches its DELETE messages into the delete queue as originally.
The Resequencer (a new component you write) is interposed between the original publishers and original consumers. It:
pulls messages off the DELETE queue into memory
places them into (in-memory) queues-by-user
republishes them to a new queue (eg FairPriorityDeleteQueue), round-robinning to interleave fairly any messages from different original users
limits its republish rate into FairPriorityDeleteQueue, either such that the length of FairPriorityDeleteQueue (obtainable via polling the rabbitmq management api periodically) never exceeds some integer you choose N, or limited to some rate related to the rate-limited delete API the consumers use.
doesn't ack any message it pulled off the original DELETE queue, until it's republished it to FairPriorityDeleteQueue (so you never lose a message)
The original consumers subscribe instead to FairPriorityDeleteQueue.
You set the preFetchCount on these consumers fairly low (<10), to prevent them in turn bulk-buffering the contents of FairPriorityDeleteQueue in memory.
--
Some points to watch:
Rate- or length-limiting publishing into and/or drawing messages out of FairPriorityDeleteQueue is essential. If you don't limit, Resequencer may just hand messages on as fast as it receives them, limiting the potential for resequencing.
Resequencer of course acts as a kind of in-memory buffer while resequencing. If the original publishers can publish very large numbers of messages in to the queue suddenly, you may need to memory-limit the Resequencer process so that it doesn't ingest more than it can hold.
Your particular scenario is greatly helped by the fact that you have an external factor (the final delete API) limiting throughput. Without such an extrinsic limiting factor, it is much harder to choose the optimum parameters for such a resequencer, to balance throughput-versus-resequencing in a particular environment.
I don't think a resequencer is needed in this case. Maybe it is, if you need to ensure the items are deleted in a specific order. But that only comes into play when you send multiple messages at roughly the same time and need to guarantee order on the consumer end.
You should also avoid the timeout scenario, for the reasons you've mentioned. timeout is meant to tell RabbitMQ that a message doesn't need to be processed - or that it needs to be routed to a dead letter queue so that i can be processed by some other code. while you might be able to make timeout work, i don't think it's a good choice.
Priorities may solve part of the problem, but could introduce a scenario where files never get processed. if you have a priority 1 message sitting back in the queue somewhere, and you keep putting priority 2, 3, 5, 10, etc. into the queue, the 1 might not be processed. the timeout doesn't solve this, as you've noted.
For my money, I would suggest a different approach: sending delete requests serially, for a single file.
that is, send 1 message to delete 1 file. wait for a response to say it's done. then send the next message to delete the next file.
here's why i think that will work, and how to manage it:
Long-Running Workflow, Single File Delete Requests
In this scenario, I would suggest taking a multi-step approach to the problem using the idea of a "saga" (aka a long-running workflow object).
when a user requests to delete their trashcan, you send a single message through rabbitmq to the service that can handle the delete process. that service creates an instance of the saga for that user's trashcan.
the saga gathers a list of all files in the trashcan that need to be deleted. then it starts to send the requests to delete the individual files, one at a time.
with each request to delete a single file, the saga waits for the response to say the file was deleted.
when the saga receives the message to say the previous file has been deleted, it sends out the next request to delete the next file.
once all the files are deleted, the saga updates itself and any other part of the system to say the trash can is empty.
Handling Multiple Users
When you have a single user requesting a delete, things will happen fairly quickly for them. they will get their trash emptied soon.
u1 = User 1 Trashcan Delete Request
|u1|u1|u1|u1|u1|u1|u1|u1|u1|u1done|
when you have multiple users requesting a delete, the process of sending one file delete request at a time means each user will have an equal chance of getting the next file delete.
u1 = User 1 Trashcan Delete Request
u2 = User 2 Trashcan Delete Request
|u1|u2|u1|u1|u2|u2|u1|u2|u1|u2|u2|u1|u1|u1|u2|u2|u1|u2|u1|u1done|u2|u2done|
This way, there will be shared use of the resources to delete the files. Over-all, it will take a little longer for each person's trashcan to be emptied, but they will see progress sooner and that's an important aspect of people thinking the system is fast / responsive to their request.
Optimizing Small File Set vs Large File Set
In a scenario where you have a small number of users with a small number of files, the above solution may prove to be slower than if you deleted all the files at once. after all, there will be more messages sent across rabbitmq - at least 2 for every file that needs to be deleted (one delete request, one delete confirmation response)
To optimize this further, you could do a couple of things:
have a minimum trashcan size before you split up the work like this. below that minimum, you just delete it all at once
chunk the work into groups of files, instead of one at a time. maybe 10 or 100 files would be a better group size, than 1 file at a time
Either (or both) of these solutions would help to improve the over-all performance of the process by reducing the number of messages being sent, and batching the work a bit.
You would need to do some testing in your real scenario to see which of these (or maybe both) would help and at what settings.
Many Users Problem
There's one additional problem you may face - many users. If you have 2 or 3 users requesting deletes, it won't be a big deal.
But if you have 100 or 1000 users requesting deletes, it could take a very long time for an individual to get their trashcan emptied.
You may need to have a higher level controlling process for this situation, where all requests to empty trashcans would be managed by yet another Saga. This saga would rate-limit the number of active trashcan-deletion sagas.
For example, if you have 10 active requests for deleting trashcans, the rate-limiting saga would only start 3 of them and it would wait for one to finish before starting the next one.
Again, you would need to test your actual scenario to see if this is needed and see what the limits should be, for performance reasons.
There may be additional scenarios that have to be considered in your actual scenario, but I hope this gets you down the path! :)

RabbitMQ: throttling fast producer against large queues with slow consumer

We're currently using RabbitMQ, where a continuously super-fast producer is paired with a consumer limited by a limited resource (e.g. slow-ish MySQL inserts).
We don't like declaring a queue with x-max-length, since all messages will be dropped or dead-lettered once the limit is reached, and we don't want to loose messages.
Adding more consumers is easy, but they'll all be limited by the one shared resource, so that won't work. The problem still remains: How to slow down the producer?
Sure, we could put a flow control flag in Redis, memcached, MySQL or something else that the producer reads as pointed out in an answer to a similar question, or perhaps better, the producer could periodically test for queue length and throttle itself, but these seem like hacks to me.
I'm mostly questioning whether I have a fundamental misunderstanding. I had expected this to be a common scenario, and so I'm wondering:
What is best practice for throttling producers? How is this done with RabbitMQ? Or do you do this in a completely different way?
Background
Assume the producer actually knows how to slow himself down with the right input. E.g. a hardware sensor or hardware random number generator, that can generate as many events as needed.
In our particular real case, we have an API that users can use to add messages. Instead of devouring and discarding messages, we'd like to apply back-pressure by having our API return an error if the queue is "full", so the caller/user knows to back-off, or have the API block until the consumer catches up. We don't control our user, so regardless of how fast the consumer is, I can create a producer that is faster.
I was hoping for something like the API for a TCP socket, where a write() can block and where a select() can be used to determine if a handle is writable. So either having the RabbitMQ API block or have it return an error if the queue is full.
For the x-max-length property, you said you don't want messages to be dropped or dead-lettered. I see there was an update in adding some more capabilities for this. As I see it is specified in the documentation:
"Use the overflow setting to configure queue overflow behaviour. If overflow is set to reject-publish, the most recently published messages will be discarded. In addition, if publisher confirms are enabled, the publisher will be informed of the reject via a basic.nack message"
So as I understand it, you can use queue limit to reject the new messages from publishers thus pushing some backpressure to the upstream.
I don't think that this is in any way rabbitmq specific. Basically you have a scenario, where there are two systems of different processing capabilities, and this mismatch will either pose a risk of overflowing the queue (whatever it would be), or even in case of a constant mismatch between producer and consumer, simply create more and more time-distance between event creation and its handling.
I used to deal with this kind of scenarios, and unfortunately there is no magic bullet. You either have to speed up even handling (better hardware, more suited software?) or throttle the event creation (which has nothing to do with MQ really).
Now, I would ask you what's the goal and how the events are produced. Are the events are produced constantly, with either unlimitted or just very high rate (for example readings from sensors - the more, the better), or are they created in batches/spikes (for example: user requests in specific time periods, batch loads from CRM system). I assume that the goal is to process everything cause you mention you don't want to loose any queued message.
If the output is constant, then some limiter (either internal counter, if the producer is the only producer, or external queue length checks if queue can be filled with some other system) is definitely in place.
IF eventsInTimePeriod/timePeriod > estimatedConsumerBandwidth
THEN LowerRate()
ELSE RiseRate()
In real world scenarios we used to simply limit the output manually to the estimated values and there were some alerts set for queue length, time from queue entry to queue leaving etc. Where such limiters were omitted (by mistake mostly) we used to find later some tasks that were supposed to be handled in few hours, that were waiting for three months for their turn.
I'm afraid it's hard to answer to "How to slow down the producer?" if we know nothing about it, but some ideas are: aforementioned rate check or maybe a blocking AddMessage method:
AddMessage(message)
WHILE(getQueueLength() > maxAllowedQueueLength)
spin(1000); // or sleep or whatever
mqAdapter.AddMessage(message)
I'd say it all depends on specific of the producer application and in general your architecture.

How do I coalesce processing of related events in NServiceBus?

I have a situation where I have a service subscribing to event messages and performing some work when they arrive. There is a certain class of events which can arrive in short bursts of many events which reference the same underlying data. I would like to be able to defer processing of related events for a short period of time, so that I only do the calculation once for each batch of related events, rather than in response to each individual event. Is there some kind of pattern I can follow which will allow me to collect related events for a period of time and then process them all at once? I was thinking a saga + timeout might be able to achieve this, but not sure if this is an appropriate use for that.
Thanks!
Yes, a saga could be the way to go - however consider the performance of the saga persistence (NHibernate over a DB in the current version, RavenDB in the next version) as compared to your fault-tolerance needs (if a machine crashes, would it be acceptable to lose some messages).
No easy answers, I'm afraid.