How Akka.Net handles system falts during message processing - akka.net

Suppose that one of cluster nodes received a message and one of actors started to process it. Somewhere in the middle this node died for some reason. What will happen with message, I mean will it be processed by another available node or will be lost?

By default akka (and every other actor model framework) offers at-most-once delivery. This means that messages are send to actors using best effort guarantees - if they won't reach the target they won't be redelivered. This also means, that if message reached the target, but the process associated with it was interrupted before finishing, it won't be retried.
That being said, there are numerous ways to offer a redelivery between actors with various guarantees.
The simplest and most unreliable is to use Ask pattern in combination with i.e. Polly library. This however won't help if a node, on which sender lives, will die - simply because message are still stored only in memory.
The more reliable pattern is to use some event log/queue in front of your cluster (i.e. Azure Service Bus, RabbitMQ or Kafka). In this approach clients are sending requests via bus/queue, while the first actor in process pipeline is responsible for picking it up. If some actor or node in pipeline dies, the whole pipeline for that message is being retried.
Another idea is to use at-least-once delivery found in Akka.Peristence module. It allows you to use eventsourcing capabilities of persistent actors to persist messages. However IMO it requires a bit of exerience with Akka.
All of these approaches present at-least-once delivery guarantees, which means that it's possible to send the same message to its destination more than once. This also means, that your processing logic needs to acknowledge that by either an idempotent behavior or by recognizing and removing duplicates on the receiver side.

Related

To be sure about concurrency, same group of works in multiple queues (FIFO)

I have a question about multi consumer concurrency.
I want to send works to rabbitmq that comes from web request to distributed queues.
I just want to be sure about order of works in multiple queues (FIFO).
Because this request comes from different users eech user requests/works must be ordered.
I have found this feature with different names on Azure ServiceBus and ActiveMQ message grouping.
Is there any way to do this in pretty RabbitMQ ?
I want to quaranty that customer's requests must be ordered each other.
Each customer may have multiple requests but those requests for that customer must be processed in order.
I desire to process quickly incoming requests with using multiple consumer on different nodes.
For example different customers 1 to 1000 send requests over 1 millions.
If I put this huge request in only one queue it takes a lot of time to consume. So I want to share this process load between n (5) node. For customer X 's requests must be in same sequence for processing
When working with event-based systems, and especially when using multiple producers and/or consumers, it is important to come to terms with the fact that there usually is no such thing as a guaranteed order of events. And to get a robust system, it is also wise to design the system so the message handlers are idempotent; they should tolerate to get the same message twice (or more).
There are way to many things that may (and actually should be allowed to) interfere with the order;
The producers may deliver the messages in a slightly different pace
One producer might miss an ack (due to a missed package) and will resend the message
One consumer may get and process a message, but the ack is lost on the way back, so the message is delivered twice (to another consumer).
Some other service that your handlers depend on might be down, so that you have to reject the message.
That being said, there is one pattern that servicebus-systems like NServicebus use to enforce the order messages are consumed. There are some requirements:
You will need a centralized storage (like a sql-server or document store) that allows for conditional updates; for instance you want to be able to store the sequence number of the last processed message (or how far you have come in the process), but only if the already stored sequence/progress is the right/expected one. Storing the user-id and the progress even for millions of customers should be a very easy operation for most databases.
You make sure the queue is configured with a dead-letter-queue/exchange for retries, and then set your original queue as a dead-letter-queue for that one again.
You set a TTL (for instance 30 seconds) on the retry/dead-letter-queue. This way the messages that appear on the dead-letter-queue will automatically be pushed back to your original queue after some timeout.
When processing your messages you check your storage/database if you are in the right state to handle the message (i.e. the needed previous steps are already done).
If you are ok to handle it you do and update the storage (conditionally!).
If not - you nack the message, so that it is thrown on the dead-letter queue. Basically you are saying "nah - I can't handle this message, there are probably some other message in the queue that should be handled first".
This way the happy-path is to process a great number of messages in the right order.
But if something happens and a you get a message out of band, you will throw it on the retry-queue (the dead-letter-queue) and Rabbit will make sure it will get back in the queue to be retried at a later stage. But only after a delay.
The beauty of this is that you are able to handle most of the situations that may interfere with processing the message (out of order messages, dependent services being down, your handler being shut down in the middle of handling the message) in exact the same way; by rejecting the message and letting your infrastructure (Rabbit) take care of it being retried after a while.
(Assuming the OP is asking about things like ActiveMQs "message grouping:)
This isn't currently built in to RabbitMQ AFAIK (it wasn't as of 2013 as per this answer) and I'm not aware of it now (though I haven't kept up lately).
However, RabbitMQ's model of exchanges and queues is very flexible - exchanges and queues can be easily created dynamically (this can be done in other messaging systems but, for example, if you read ActiveMQ documentation or Red Hat AMQ documentation you'll find all of the examples in the user guides are using pre-declared queues in configuration files loaded at system startup - except for RPC-like request/response communication).
Also it is very easy in RabbitMQ for a consumer (i.e., message consuming thread) to consume from multiple queues.
So you could build, on top of RabbitMQ, a system where you got your desired grouping semantics.
One way would be to create dynamic queues: The first time a customer order was seen or a new group of customer orders a queue would be created with a unique name for all messages for that group - that queue name would be communicated (via another queue) to a consumer who's sole purpose was to load-balance among other consumers that were responsible for handling customer order groups. I.e., the load-balancer would pull off of its queue a message saying "new group with queue name XYZ" and it would find in a pool of order group consumer a consumer which could take this load and pass it a message saying "start listening to XYZ".
Another way to do it is with pub/sub and topic routing - each customer order group would get a unique topic - and proceed as above.
RabbitMQ Consistent Hash Exchange Type
We are using RabbitMQ and we have found a plugin. It use Consistent Hashing algorithm to distribute messages in order to consistent keys.
For more information about Consistent Hashing ;
https://en.wikipedia.org/wiki/Consistent_hashing
https://www.youtube.com/watch?v=viaNG1zyx1g
You can find this plugin from rabbitmq web page
plugin : rabbitmq_consistent_hash_exchange
https://www.rabbitmq.com/plugins.html

What is a proper way to acknowledge an MQ message from a chain of actors?

We want to use Akka to implement a scenario when messages are fetched from a message queue (RabbitMQ) and then processed by a chain of actors. The queue is durable and messages must not be lost. So we need to send an acknowledgement (BasicAck in RabbitMQ) back to the queue in order to finalize the dequeued message. Because of that the very last actor in the processing chain needs to do the acknowledgement. This seems to be rather common need, and I wonder if there is a known pattern for this. Vaughn Vernon in his book writes about using Return Address, so all messages sent along the chain will have the return address (of the MQ channel actor) and the correlation identifier that specifies the queue message tag. Is this the proper way to do it?
An alternative is to ack the message right after the receival and then use persistent actors to provide its guaranteed delivery, but I was adviced against such approach because use of AMPQ eliminates the need for actor persistance for this particular scenario.
I'm not really familiar with Akka, but I think I get the gist of what it does (very similar to "process" in Erlang - i think - which is what RMQ is built on).
In general, your first suggestion from Vaughn Vernon's book is the way to go.
In my specific scenarios, I have taken a "middleware" approach to what you are suggesting. My specific middleware implementation forwards the message itself through a chain of commands that process the message. Each command calls an action.next() method to continue forwarding to the next command.
Prior to sending the message through the middleware, I create a default last-command-in-the-chain. This default command simply calls actions.ack() - which, behind the scenes, acknowledged the message.
I do things this way so that the commands never have to know anything about how to actually implement the mechanics of completing and moving on to the next thing. They have an API specific to themselves, being commands in a chain.
This allows me to change the implementation of acknowledging the message, or how i handle messages from RMQ, etc, without changing the commands directly.
Ack'ing the message immediately introduces danger, as your actor could crash, Akka itself could crash, and a host of other problems can (and will) occur, and you'll be more likely to lose the message.
Remember, though - there is not 100% perfect setup. You will, at some point, lose a message or process the same message twice. Your system needs to handle these scenarios in some way, at some point. Everything your doing is heading down the right path to make this less likely, but nothing will ever prevent crashes and message loss 100% of the time.

How-to enable message persitence for akka.net

all
Is it possible to store akka.net actors inbox messages in database?
What will happen if host with akka.net system crash?
Persisting messages is only part of the bigger issue, which is reliable message processing. In short the goal is not only to persist messages, but usually to guarantee that message has been received and correctly processed. By default Akka.NET uses at-most-once delivery semantic, which means, that messages are processed using best effort politics. This allows to keep high throughput and keep actors behavior away from being idempotent. However sometimes we need a higher reliability for some of the messages.
One of the techniques is to use another reliable queue (such as RabbitMQ or Azure Service Bus) in front of your actor system and use it for reliable messaging.
Other solution is to use AtLeastOnceDeliverySemantic actors from Akka.Persistence library. Here you may specify actor responsible for re-sending and confirming processed messages. From there you may decide to persist incoming messages using eventsourcing primitives build into Akka.Persistence itself. Persistence backend is plugable in this scenario.

RabbitMQ clustering and mirror queues behavior behind the scenes

Can someone please explain what is going on behind the scenes in a RabbitMQ cluster with multiple nodes and queues in mirrored fashion when publishing to a slave node?
From what I read, it seems that all actions other than publishes go only to the master and the master then broadcasts the effect of the actions to the slaves(this is from the documentation). Form my understanding it means a consumer will always consume message from the master queue. Also, if I send a request to a slave for consuming a message, that slave will do an extra hop by getting to the master for fetching that message.
But what happens when I publish to a slave node? Will this node do the same thing of sending first the message to the master?
It seems there are so many extra hops when dealing with slaves, so it seems you could have a better performance if you know only the master. But how do you handle master failure? Then one of the slaves will be elected master, so you have to know where to connect to?
Asking all of this because we are using RabbitMQ cluster with HAProxy in front, so we can decouple the cluster structure from our apps. This way, whenever a node goes done, the HAProxy will redirect to living nodes. But we have problems when we kill one of the rabbit nodes. The connection to rabbit is permanent, so if it fails, you have to recreate it. Also, you have to resend the messages in this cases, otherwise you will lose them.
Even with all of this, messages can still be lost, because they may be in transit when I kill a node (in some buffers, somewhere on the network etc). So you have to use transactions or publisher confirms, which guarantee the delivery after all the mirrors have been filled up with the message. But here another issue. You may have duplicate messages, because the broker might have sent a confirmation that never reached the producer (due to network failures, etc). Therefore consumer applications will need to perform deduplication or handle incoming messages in an idempotent manner.
Is there a way of avoiding this? Or I have to decide whether I can lose couple of messages versus duplication of some messages?
Can someone please explain what is going on behind the scenes in a RabbitMQ cluster with multiple nodes and queues in mirrored fashion when publishing to a slave node?
This blog outlines exactly what happens.
But what happens when I publish to a slave node? Will this node do the same thing of sending first the message to the master?
The message will be redirected to the master Queue - that is, the node on which the Queue was created.
But how do you handle master failure? Then one of the slaves will be elected master, so you have to know where to connect to?
Again, this is covered here. Essentially, you need a separate service that polls RabbitMQ and determines whether nodes are alive or not. RabbitMQ provides a management API for this. Your publishing and consuming applications need to refer to this service either directly, or through a mutual data-store in order to determine that correct node to publish to or consume from.
The connection to rabbit is permanent, so if it fails, you have to recreate it. Also, you have to resend the messages in this cases, otherwise you will lose them.
You need to subscribe to connection-interrupted events to react to severed connections. You will need to build in some level of redundancy on the client in order to ensure that messages are not lost. I suggest, as above, that you introduce a service specifically designed to interrogate RabbitMQ. You client can attempt to publish a message to the last known active connection, and should this fail, the client might ask the monitor service for an up-to-date listing of the RabbitMQ cluster. Assuming that there is at least one active node, the client may then establish a connection to it and publish the message successfully.
Even with all of this, messages can still be lost, because they may be in transit when I kill a node
There are certain edge-cases that you can't cover with redundancy, and neither can RabbitMQ. For example, when a message lands in a Queue, and the HA policy invokes a background process to copy the message to a backup node. During this process there is potential for the message to be lost before it is persisted to the backup node. Should the active node immediately fail, the message will be lost for good. There is nothing that can be done about this. Unfortunately, when we get down to the level of actual bytes travelling across the wire, there's a limit to the amount of safeguards that we can build.
herefore consumer applications will need to perform deduplication or handle incoming messages in an idempotent manner.
You can handle this a number of ways. For example, setting the message-ttl to a relatively low value will ensure that duplicated messages don't remain on the Queue for extended periods of time. You can also tag each message with a unique reference, and check that reference at the consumer level. Of course, this would require storing a cache of processed messages to compare incoming messages against; the idea being that if a previously processed message arrives, its tag will have been cached by the consumer, and the message can be ignored.
One thing that I'd stress with AMQP and Queue-based solutions in general is that your infrastructure provides the tools, but not the entire solution. You have to bridge those gaps based on your business needs. Often, the best solution is derived through trial and error. I hope my suggestions are of use. I blog about a number of RabbitMQ design solutions here, including the issues you mentioned, here if you're interested.

Why is NServiceBus Bus.Publish() not transactional?

Setup:
I have a couple of subscribers subscribing to an event produced by a publisher on the same machine via MSMQ. The subscribers use two different endpoint names, and are run in its respective process. (This is NSB 4.6.3)
Scenario:
Now, if I do something "bad" to one of the subscribers (say remove proper permission in MSMQ to receive messages, or delete the queue in MSMQ outright...), and call Bus.Publish(), I will still have one event successfully published to the "good" subscriber (if the good one precedes the bad one on the subscriber list in subscription storage), or none successful (if the bad one precedes the good one).
Conclusion:
The upshot here is that Bus.Publish() does not seem to be transactional, as to making publishing to subscribers all succeed or all fail. Depending on the order of the subscribers on the list, the end result might be different.
Questions:
Is this behavior by design?
What is the thought behind this?
If I want to make this call transactional, what is the recommended way? (One option seems to enclose Bus.Publish() in a TransactionScope in my code...)
Publish is transactional, or at least, it is if there is an ambient transaction. Assuming you have not taken steps to disable transactions, all message handlers have an ambient transaction running when you enter the Handle method. (Inspect Transaction.Current.TransactionInformation to see first-hand.) If you are operating out of an IWantToRunWhenBusStartsAndStops, however, there will be no ambient transaction, so then yes you would need to wrap with your own TransactionScope.
How delivery is handled (specific for the MSMQ transport) is different depending upon whether the destination is a local or remote queue.
Remote Queues
For a remote queue, delivery is not directly handled by the publisher at all. It simply drops the two messages in the "Outbox", so to speak. MSMQ uses store-and-forward to ensure that these messages are eventually delivered to their intended destinations, whether that be on the same machine or a remote machine. In these cases, you may look at your outgoing queues and see that there are messages stuck there that are unable to be delivered because of whatever you have done to their destinations.
The safety afforded by store-and-forward mean that one errant subscriber cannot take down a publisher, and so overall coupling is reduced. This is a good thing! But it also means that monitoring outgoing queues is a very important part of your DevOps story when deploying an NServiceBus system.
Local Queues
For local queues, MSMQ may still technically use a concept of an outoging queue in its own plumbing - I'm not sure and it doesn't really matter. But an additional step that MSMQ is capable of doing (and does) is to check the existence of a local queue before you try to send to it, and will throw an exception if it doesn't exist or something is wrong with it. This would indeed affect the publisher.
So yes, if you publish a message from a non-transactional state like the inside of an IWantToRunWhenBusStartsAndStops, and the downed queue happens to be #2 on the list in subscription storage, you could observe a message arriving at SubscriberA but not at Subscriber B. If it were within a message handler with transactions disabled, you could see the multiple copies arriving at SubscriberA because of the message retry logic!
Upshot
IWantToRunWhenBusStartsAndStops is great for quick demos and proving things out, but try to put as little real logic in them as possible, opting instead for the safety of message handlers where the ambient transaction applies. Also remember than an exception inside there could potentially take down your host process. Certainly don't publish inside of one without wrapping it with your own transaction.