I have jobsnamed A, B, C, D. Job B has to start after job A finished. So the order of jobs should look like this A->B->C->D.
I want to scale number of workers for A, B, C and D indepently. Is there a way to implement this using RabbitMQ, I am basically looking for a way to create a series of jobs.
My current design looks like this:
The caller process creates seriesOfJobs: array of JSONs that describe jobs A,B,C,D using JSON-RPC protocol
The caller sends the seriesOfJobs to a seriesManager(separate process) via RabbitMQ RPC and awaits callback on mainCallbackQueue
The seriesManager parses seriesOfJobs sends job A to aworkerA(separate process) via RabbitMQ RPC and awaits callback on callbackQueueA
workerA performs job A and notifies seriesManager via callbackQueueA
seriesManager gets callback from callbackQueueA and sends job B to worker and awaits callback, then the same for job C, then the same for job D
seriesManager knows that it jobs A,B,C,D finished - it notifies caller via mainCallbackQueue
I am using the concept of RPC as described here RabbitMQ RPC tutorial Is there a simpler way to do this?
(Unfortunately I don't have enough reputation to comment, so this may be a somewhat lacking answer as I can't clarify requirements, though I'll try and edit it to stick to what's needed)
Is there an absolute need for the seriesManager to be present?
It may be more logical to have workerA create job B for workerB and so on and so forth rather than consistently calling back to a central hub.
In which case your current design would change to:
caller creates seriesOfJobs.
caller sends seriesOfJobs to workerA.
workerA performs job A and sends remaining seriesOfJobs to workerB.
workerB performs job B and sends remaining seriesOfJobs to workerC.
workerC performs job C and sends remaining seriesOfJobs to workerD.
workerD performs job D and notifies caller via mainCallbackQueue.
I would regard that a "simpler way" seeing as it takes that nasty central hub out of the equation.
Related
I'm using RabbitMQ and the amiquip Rust crate to build out several services that will be processing some data in multiple steps. Roughly, this might look like:
Service A ingests data from external source, publishes its results to Topic A
Service B subscribes to Topic A, does some processing, publishes results to Topic B
Service C subscribes to Topic B, does some processing, publishes results to Topic C
Each step along the way, the data are further refined. I will need to be able to shut down different services for maintenance without missing messages that they're reading (eg, Service B may be taken down briefly, but the messages published by Service A to Topic A must remain in the queue until Service B comes back online). I am okay with setting some TTL/expiration (not sure what the right terminology is for AMQP); for example, if Service B doesn't come back online after 5 minutes, it's okay if messages published to the topic are lost).
Additionally, there may be another service that should also be able to subscribe to a topic without interfering with another service reading it. For example, Service C2 gets a copy of all messages in Topic B and does something with them; every message read by Service C2 is also read by Service C (no stepping on each other's feet).
I don't know the right terminology used here, so I'm at a bit of a loss for what I should be looking for. Is this possible with AMQP & RabbitMQ?
I have a group of of jobs that need to be processed.
Some may take 10 min, some may take 1h.
Now I need to know what is the last job executed because at the end of that group of jobs I need to fire another message.
The message queue in this case is RabbitMQ.
Is there a way I can accomplish this with only RabbitMQ?
What would be a good strategy for this task?
Thats strategy you can use with any messaging system.
I assume you have group of workers listening to single queue with jobs "jobs queue" to be processed. Now you can have service lets call it Manager witch duplicates this queue and saves all no finished messages. Now when worker finishes the job it send acknowledgment message to the Manager. Manager for example discards all finished jobs and stores only running once. (If you want to take in to account passable failures it can track that too).
When Manager have no more messages it publishes message to the "all messages in the group done topic". Now publishers can listen to the topic and fire new job messages to the "job queue".
Of course in simple case you can have one producer witch could be the Manager in the same time.
Example RabbitMQ implementation.
Now to implement this in RabbitMQ you can for example create one FanoutExchange (for producer to send messages to) and two queues jobsQueue (to send jobs to workers) and jobTrackingQueue (to send jobs to Manager for tracking jobs). Now you create second FonoutExchange (for Manager to send task done messges to) you create unnamed queue per producer who wants to know if all messges are done.
My users are editing data collaboratively on the web. My product needs their edits to be made atomic. I can't guarantee it at the database level, so I would like the updates to be performed one at a time.
Here is what I would need to be able to parallelize multiple documents :
Let's say we have two documents A and B
1) The queue server starts empty
2) 1 user submits an update for document A
3) The queue server receives the update, creates QueueA and puts the update in it
4) 3 other users submit updates to documentA, which are queued in QueueA
5) 2 other users submit changes for document B, which are queued in new queue QueueB
6) The worker pool is started.
7) Worker1 makes a request, the first message of QueueA is delivered (although it would not be an issue if it was the message in QueueB first). QueueA is marked as busy until it gets a response
8) Another worker makes a request, the item from QueueB is returned. QueueB is marked as busy.
9) On the third request, nothing is returned as both queues are busy.
10) The first worker finishes its task, calls the broker and QueueA is not busy anymore.
11) A worker makes a request, it should get the message from QueueA.
12) Worker B times out, which frees QueueB for message consumption.
I have started to read about Rabbit MQ, AWS SQS/SNS, Kafka... I am not very knowledgeable in that field, but to my great surprise I haven't been able to find a system matching my requirements on the web.
For now I don't know if my design has issues i haven't seen, if I just haven't found the right keyword or software for my use... Scalability should be easy which is why I have looked at these tools.
How could I easily implement this design ?
This is an application design question that is hard to accurately address in a stack overflow answer. What you are doing sounds like async processing of data using a queue to buffer as well as scale. The scale part is easy.. you add more consumers (aka running service processes) and requests can be processed individually in parallel.
I think the best way to think of the problem is to break it down into individual steps of data processing and use the queues as on and off ramps into other distinct processes. More than that, and I'd need some whiteboard time to walk through the entire problem space.
ActiveMQ and RabbitMQ sound more of a fit here. Pressed to recommend one, I tend to lean ActiveMQ b/c its Java-based and most shops know how to monitor and support Java-based apps. SQS is limited and given this sounds business data, using HTTP as transport is not a robust solution. Kafka doesn't sound like a fit here.
im studying multithreading and what i want is some clarification on subject matter.
As far as i know, SERIAL queue execute tasks serially, are always executing one task at a time.
Now, SYNCHRONOUS function is a function, that returns only after all tasks complete.
Now, i'm a bit confused. What difference between those two?
if i understand correct, both of them will block current thread (if they are not "covered" in global concurrent queue), and both of them execute tasks exactly in FIFO order.
So, what exactly a difference between them? Yes, i understand that serial is a property of a queue, and sync is a function (or operation). But their functionality is like to be similiar.
You are comparing a queue with a function, so it is difficult to define "difference". Using a serial queue does guarantee sequential behaviour of its operations. Typically, you use a synchronous dispatch if your program has to wait for all queued operations to complete before your program completes. If every dispatch on a given queue is synchronous, then indeed there is no difference between using a queue or calling the operations.
However, here is a very useful case that shows the difference. Suppose operation A is lengthy and you do not want to block. Suppose operation B returns something computed by operation A, but it is called some arbitrary time later (like in response to a user action). You dispatch_async A onto the queue. Your program is not blocked. Sometime later, you need the result. You dispatch_sync operation B on the same serial queue.
Now if A is already complete, the queue is empty when you add B and B executes immediately. But (and here is the good part) if A is still executing (asynchronously), B is not dispatched until A is done, so your program is blocked until the result it needs is ready.
For more explanation of this, see here.
The dangers of deadlock nicely handled for you by gcd.
I am using RabbitMQ to have worker processes encode video files. I would like to know when all of the files are complete - that is, when all of the worker processes have finished.
The only way I can think to do this is by using a database. When a video finishes encoding:
UPDATE videos SET status = 'complete' WHERE filename = 'foo.wmv'
-- etc etc etc as each worker finishes --
And then to check whether or not all of the videos have been encoded:
SELECT count(*) FROM videos WHERE status != 'complete'
But if I'm going to do this, then I feel like I am losing the benefit of RabbitMQ as a mechanism for multiple distributed worker processes, since I still have to manually maintain a database queue.
Is there a standard mechanism for RabbitMQ dependencies? That is, a way to say "wait for these 5 tasks to finish, and once they are done, then kick off a new task?"
I don't want to have a parent process add these tasks to a queue and then "wait" for each of them to return a "completed" status. Then I have to maintain a separate process for each group of videos, at which point I've lost the advantage of decoupled worker processes as compared to a single ThreadPool concept.
Am I asking for something which is impossible? Or, are there standard widely-adopted solutions to manage the overall state of tasks in a queue that I have missed?
Edit: after searching, I found this similar question: Getting result of a long running task with RabbitMQ
Are there any particular thoughts that people have about this?
Use a "response" queue. I don't know any specifics about RabbitMQ, so this is general:
Have your parent process send out requests and keep track of how many it sent
Make the parent process also wait on a specific response queue (that the children know about)
Whenever a child finishes something (or can't finish for some reason), send a message to the response queue
Whenever numSent == numResponded, you're done
Something to keep in mind is a timeout -- What happens if a child process dies? You have to do slightly more work, but basically:
With every sent message, include some sort of ID, and add that ID and the current time to a hash table.
For every response, remove that ID from the hash table
Periodically walk the hash table and remove anything that has timed out
This is called the Request Reply Pattern.
Based on Brendan's extremely helpful answer, which should be accepted, I knocked up this quick diagram which be helpful to some.
I have implemented a workflow where the workflow state machine is implemented as a series of queues. A worker receives a message on one queue, processes the work, and then publishes the same message onto another queue. Then another type of worker process picks up that message, etc.
In your case, it sounds like you need to implement one of the patterns from Enterprise Integration Patterns (that is a free online book) and have a simple worker that collects messages until a set of work is done, and then processes a single message to a queue representing the next step in the workflow.