How to know when a set of RabbitMQ tasks are complete? - sql

I am using RabbitMQ to have worker processes encode video files. I would like to know when all of the files are complete - that is, when all of the worker processes have finished.
The only way I can think to do this is by using a database. When a video finishes encoding:
UPDATE videos SET status = 'complete' WHERE filename = 'foo.wmv'
-- etc etc etc as each worker finishes --
And then to check whether or not all of the videos have been encoded:
SELECT count(*) FROM videos WHERE status != 'complete'
But if I'm going to do this, then I feel like I am losing the benefit of RabbitMQ as a mechanism for multiple distributed worker processes, since I still have to manually maintain a database queue.
Is there a standard mechanism for RabbitMQ dependencies? That is, a way to say "wait for these 5 tasks to finish, and once they are done, then kick off a new task?"
I don't want to have a parent process add these tasks to a queue and then "wait" for each of them to return a "completed" status. Then I have to maintain a separate process for each group of videos, at which point I've lost the advantage of decoupled worker processes as compared to a single ThreadPool concept.
Am I asking for something which is impossible? Or, are there standard widely-adopted solutions to manage the overall state of tasks in a queue that I have missed?
Edit: after searching, I found this similar question: Getting result of a long running task with RabbitMQ
Are there any particular thoughts that people have about this?

Use a "response" queue. I don't know any specifics about RabbitMQ, so this is general:
Have your parent process send out requests and keep track of how many it sent
Make the parent process also wait on a specific response queue (that the children know about)
Whenever a child finishes something (or can't finish for some reason), send a message to the response queue
Whenever numSent == numResponded, you're done
Something to keep in mind is a timeout -- What happens if a child process dies? You have to do slightly more work, but basically:
With every sent message, include some sort of ID, and add that ID and the current time to a hash table.
For every response, remove that ID from the hash table
Periodically walk the hash table and remove anything that has timed out
This is called the Request Reply Pattern.

Based on Brendan's extremely helpful answer, which should be accepted, I knocked up this quick diagram which be helpful to some.

I have implemented a workflow where the workflow state machine is implemented as a series of queues. A worker receives a message on one queue, processes the work, and then publishes the same message onto another queue. Then another type of worker process picks up that message, etc.
In your case, it sounds like you need to implement one of the patterns from Enterprise Integration Patterns (that is a free online book) and have a simple worker that collects messages until a set of work is done, and then processes a single message to a queue representing the next step in the workflow.

Related

RabbitMQ workers with unique key

I'm thinking of using RabbitMQ for a new project (with little own RabbitMQ experience) to solve the following problem:
Upon an event, a long running computation has to be performed. The "work queue" pattern as described in https://www.rabbitmq.com/tutorials/tutorial-two-python.html seems to be perfect, but I want an additional twist: I want no two jobs with the same routing key (or some parts of the payload or metadata, however to implement that) running on the workers at the same time. In other words: when one worker is processing job XY, and another job XY is queued, the message XY must not be delivered to a new idle worker until the running worker has completed the job.
What would be the best strategy to implement that? The only real solution I came up with was that when a worker gets a job, it has to check with all other workers if they are currently processing a similar job, and if so, reject the message (for requeueing).
Depending on your architecture there are two approaches to your problem.
The consumers share a cache of tasks under process and if a job of the same type shows up, they reject or requeue it.
This requires a shared cache to be maintained and a bit of logic on the consumers side.
The side effect is that duplicated jobs will keep returning to the consumers in case of rejection while in case of requeueing they will be processed with unpredictable delay (depending on how big the queue is).
You use the deduplication plugin on the queue.
You won't need any additional cache, only a few lines of code on the publisher side.
The downside of this approach is that duplicated messages will be dropped. If you want them to be delivered, you will need to instruct the publisher to retry in case of a negative acknowledgment on the publisher.

Consume several queues, one item at a time among a pool of workers

My users are editing data collaboratively on the web. My product needs their edits to be made atomic. I can't guarantee it at the database level, so I would like the updates to be performed one at a time.
Here is what I would need to be able to parallelize multiple documents :
Let's say we have two documents A and B
1) The queue server starts empty
2) 1 user submits an update for document A
3) The queue server receives the update, creates QueueA and puts the update in it
4) 3 other users submit updates to documentA, which are queued in QueueA
5) 2 other users submit changes for document B, which are queued in new queue QueueB
6) The worker pool is started.
7) Worker1 makes a request, the first message of QueueA is delivered (although it would not be an issue if it was the message in QueueB first). QueueA is marked as busy until it gets a response
8) Another worker makes a request, the item from QueueB is returned. QueueB is marked as busy.
9) On the third request, nothing is returned as both queues are busy.
10) The first worker finishes its task, calls the broker and QueueA is not busy anymore.
11) A worker makes a request, it should get the message from QueueA.
12) Worker B times out, which frees QueueB for message consumption.
I have started to read about Rabbit MQ, AWS SQS/SNS, Kafka... I am not very knowledgeable in that field, but to my great surprise I haven't been able to find a system matching my requirements on the web.
For now I don't know if my design has issues i haven't seen, if I just haven't found the right keyword or software for my use... Scalability should be easy which is why I have looked at these tools.
How could I easily implement this design ?
This is an application design question that is hard to accurately address in a stack overflow answer. What you are doing sounds like async processing of data using a queue to buffer as well as scale. The scale part is easy.. you add more consumers (aka running service processes) and requests can be processed individually in parallel.
I think the best way to think of the problem is to break it down into individual steps of data processing and use the queues as on and off ramps into other distinct processes. More than that, and I'd need some whiteboard time to walk through the entire problem space.
ActiveMQ and RabbitMQ sound more of a fit here. Pressed to recommend one, I tend to lean ActiveMQ b/c its Java-based and most shops know how to monitor and support Java-based apps. SQS is limited and given this sounds business data, using HTTP as transport is not a robust solution. Kafka doesn't sound like a fit here.

Queue Fairness and Messaging Servers

I'm looking to solve a problem that I have with the FIFO nature of messaging severs and queues. In some cases, I'd like to distribute the messages in a queue to the pool of consumers on a criteria other than the message order it was delivered in. Ideally, this would prevent users from hogging shared resources in the system. Take this overly simplified scenario:
There is a feature within an application where a user can empty their trash can.
This event dispatches a DELETE message for each item in trash can
The consumers for this queue invoke a web service that has a rate limited API.
Given that each user can have very large volumes of messages in their trash can, what options do we have to allow concurrent processing of each trash can without regard to the enqueue time? It seems to me that there are a few obvious solutions:
Create a separate queue and pool of consumers for each user
Randomize the message delivery from a single queue to a single pool of consumers
In our case, creating a separate queue and managing the consumers for each user really isn't practical. It can be done but I think I really prefer the second option if it's reasonable. We're using RabbitMQ but not necessarily tied to it if there is a technology more suited to this task.
I'm entertaining the idea of using Rabbit's message priorities to help randomize delivery. By randomly assigning a message a priority between 1 and 10, this should help distribute the messages. The problem with this method is that the messages with the lowest priority may be stuck in the queue forever if the queue is never completely emptied. I thought I could use a TTL on the message and then re-queue the message with an escalated priority but I noticed this in the docs:
Messages which should expire will still only expire from the head of
the queue. This means that unlike with normal queues, even per-queue
TTL can lead to expired lower-priority messages getting stuck behind
non-expired higher priority ones. These messages will never be
delivered, but they will appear in queue statistics.
I fear that I may heading down the rabbit hole with this approach. I wonder how others are solving this problem. Any feedback on creative routing, messaging patterns, or any alternative solutions would be appreaciated.
So I ended up taking a page out of the network router handbook. This a problem they routers need to solve to allow fair traffic patterns. This video has a good breakdown of the problem and the solution.
The translation of the problem into my domain:
And the solution:
The load balancer is a wrapper around a channel and a known number of queues that uses a weighted algorithm to balance between messages received on each queue. We found a really interesting article/implementation that seems to be working well so far.
With this solution, I can also prioritize workspaces after messages have been published to increase their throughput. That's a really nice feature.
The biggest challenge ahead of me is management of the queues. There will be too many queues to leave bound to the exchange for an extended period of time. I'm working on some tools to manage their lifecycle.
One solution could be to interpose a Resequencer. The principle is outlined in the diag in that link. In your case, something like:
The app dispatches its DELETE messages into the delete queue as originally.
The Resequencer (a new component you write) is interposed between the original publishers and original consumers. It:
pulls messages off the DELETE queue into memory
places them into (in-memory) queues-by-user
republishes them to a new queue (eg FairPriorityDeleteQueue), round-robinning to interleave fairly any messages from different original users
limits its republish rate into FairPriorityDeleteQueue, either such that the length of FairPriorityDeleteQueue (obtainable via polling the rabbitmq management api periodically) never exceeds some integer you choose N, or limited to some rate related to the rate-limited delete API the consumers use.
doesn't ack any message it pulled off the original DELETE queue, until it's republished it to FairPriorityDeleteQueue (so you never lose a message)
The original consumers subscribe instead to FairPriorityDeleteQueue.
You set the preFetchCount on these consumers fairly low (<10), to prevent them in turn bulk-buffering the contents of FairPriorityDeleteQueue in memory.
--
Some points to watch:
Rate- or length-limiting publishing into and/or drawing messages out of FairPriorityDeleteQueue is essential. If you don't limit, Resequencer may just hand messages on as fast as it receives them, limiting the potential for resequencing.
Resequencer of course acts as a kind of in-memory buffer while resequencing. If the original publishers can publish very large numbers of messages in to the queue suddenly, you may need to memory-limit the Resequencer process so that it doesn't ingest more than it can hold.
Your particular scenario is greatly helped by the fact that you have an external factor (the final delete API) limiting throughput. Without such an extrinsic limiting factor, it is much harder to choose the optimum parameters for such a resequencer, to balance throughput-versus-resequencing in a particular environment.
I don't think a resequencer is needed in this case. Maybe it is, if you need to ensure the items are deleted in a specific order. But that only comes into play when you send multiple messages at roughly the same time and need to guarantee order on the consumer end.
You should also avoid the timeout scenario, for the reasons you've mentioned. timeout is meant to tell RabbitMQ that a message doesn't need to be processed - or that it needs to be routed to a dead letter queue so that i can be processed by some other code. while you might be able to make timeout work, i don't think it's a good choice.
Priorities may solve part of the problem, but could introduce a scenario where files never get processed. if you have a priority 1 message sitting back in the queue somewhere, and you keep putting priority 2, 3, 5, 10, etc. into the queue, the 1 might not be processed. the timeout doesn't solve this, as you've noted.
For my money, I would suggest a different approach: sending delete requests serially, for a single file.
that is, send 1 message to delete 1 file. wait for a response to say it's done. then send the next message to delete the next file.
here's why i think that will work, and how to manage it:
Long-Running Workflow, Single File Delete Requests
In this scenario, I would suggest taking a multi-step approach to the problem using the idea of a "saga" (aka a long-running workflow object).
when a user requests to delete their trashcan, you send a single message through rabbitmq to the service that can handle the delete process. that service creates an instance of the saga for that user's trashcan.
the saga gathers a list of all files in the trashcan that need to be deleted. then it starts to send the requests to delete the individual files, one at a time.
with each request to delete a single file, the saga waits for the response to say the file was deleted.
when the saga receives the message to say the previous file has been deleted, it sends out the next request to delete the next file.
once all the files are deleted, the saga updates itself and any other part of the system to say the trash can is empty.
Handling Multiple Users
When you have a single user requesting a delete, things will happen fairly quickly for them. they will get their trash emptied soon.
u1 = User 1 Trashcan Delete Request
|u1|u1|u1|u1|u1|u1|u1|u1|u1|u1done|
when you have multiple users requesting a delete, the process of sending one file delete request at a time means each user will have an equal chance of getting the next file delete.
u1 = User 1 Trashcan Delete Request
u2 = User 2 Trashcan Delete Request
|u1|u2|u1|u1|u2|u2|u1|u2|u1|u2|u2|u1|u1|u1|u2|u2|u1|u2|u1|u1done|u2|u2done|
This way, there will be shared use of the resources to delete the files. Over-all, it will take a little longer for each person's trashcan to be emptied, but they will see progress sooner and that's an important aspect of people thinking the system is fast / responsive to their request.
Optimizing Small File Set vs Large File Set
In a scenario where you have a small number of users with a small number of files, the above solution may prove to be slower than if you deleted all the files at once. after all, there will be more messages sent across rabbitmq - at least 2 for every file that needs to be deleted (one delete request, one delete confirmation response)
To optimize this further, you could do a couple of things:
have a minimum trashcan size before you split up the work like this. below that minimum, you just delete it all at once
chunk the work into groups of files, instead of one at a time. maybe 10 or 100 files would be a better group size, than 1 file at a time
Either (or both) of these solutions would help to improve the over-all performance of the process by reducing the number of messages being sent, and batching the work a bit.
You would need to do some testing in your real scenario to see which of these (or maybe both) would help and at what settings.
Many Users Problem
There's one additional problem you may face - many users. If you have 2 or 3 users requesting deletes, it won't be a big deal.
But if you have 100 or 1000 users requesting deletes, it could take a very long time for an individual to get their trashcan emptied.
You may need to have a higher level controlling process for this situation, where all requests to empty trashcans would be managed by yet another Saga. This saga would rate-limit the number of active trashcan-deletion sagas.
For example, if you have 10 active requests for deleting trashcans, the rate-limiting saga would only start 3 of them and it would wait for one to finish before starting the next one.
Again, you would need to test your actual scenario to see if this is needed and see what the limits should be, for performance reasons.
There may be additional scenarios that have to be considered in your actual scenario, but I hope this gets you down the path! :)

What Is The Best Way To Constantly Check And Process Items In A Queue?

I have part of my application that receives string messages from remote clients and decodes these into a _Message class. I then want to pass these messages into a queue for immediate processing. The FIFO method is exactly what I require as I would particularly prefer to process these messages in order of receiving.
These messages come in fairly constant (24 hours a day, maybe 1 every couple of seconds or so...) so I need to ensure that I capture them all and no messages get lost or rejected.
Each _Message will then run through a Routine which will decode and action various parts of the message content.
Therefore, what would be the best way of handling a constant message pool? I started to go down the path of Queues with Queue.Enqueue and Queue.Dequeue but I'm not sure how to constantly poll for items within the Queue without affecting performance or resources.
I then came across ThreadPooling (something very new to me) which sounds like it could be down the right path, but I'm not 100% sure on how it works or how to set it up correctly.
Or....can I use ThreadPooling in conjunction with a Queue? And simply add items into my Queue and have the ThreadPool automatically detect new items?
Any help or guidance would be appreciated. Thanks

RabbitMQ - subscribe to message type as it gets created

I'm new to RabbitMQ and I'm wondering how to implement the following: producer creates tasks for multiple sites, there's a bunch of consumers that should process these tasks one by one, but only talking to 1 site with concurrency of 1, without starting a new task for this site before the previous one ended. This way slow site would be processed slowly, and the fast ones - fast (as opposed by slow sites taking up all the worker capacity).
Ideally a site would be processed only by one worker at a time, being replaced by another worker if it dies. This seems like a task for exclusive queues, but apparently there's no easy way to list and subscribe to new queues. What is the proper way to achieve such results with RabbitMQ?
I think you may have things the wrong way round. For workers you have 1 or more producers sending to 1 exchange. The exchange has 1 queue (you can send directly to the queue, but all that is really doing is going via a default exchange, I prefer to be explicit). All consumers connect to the single queue and read off tasks in turn. You should set the queue to require messages to be ACKed before removing them. That way if a process dies it should be returned to the queue and picked up by the next consumer/worker.