NServiceBus Timeoutsdispatcher queue is being flooded with messages during stress tests - nservicebus

I'm doing some stress tests on a saga that uses 2 timeouts. During the test about 21K saga's get created. So that would mean 42K timeouts, but I notice that the timeoutsdispatcher queue of the saga is getting flooded with 100's of thousands of messages until it crashes because the MSMQ storage limit is hit.
I'm seeing this behavior since I switched the persistence mechanism from RavenDB to SQL Server.
Does anyone have an idea what could be wrong?
Transport: MSMQ
Persistence: NHibernate
Packages used:
NHibernate version 4.0.4.4000
NServiceBus version 5.2.14
NServiceBus.Host version 6.0.0
NServiceBus.Log4Net version 1.0.0
NServiceBus.NHibernate version 6.2.7
Test setup:
* endpoint 1 is sending 22000 messages to endpoint 2.
* endpoint 2 hosts a saga that is started by that message.
* each saga publishes an event and then requests 2 timeouts: 1 at 4 minutes, 1 at 10 minutes.
Observed behavior:
* endpoint 1 sends the 22K messages in under a minute.
* endpoint 2 (the saga) processes 5 to 10 messages per second.
* after 4 minutes the first timeouts are fired, while endpoint 2 is still processing messages from its queue and thus is still creating new saga instances.
* from that moment on, the timeoutsdispatcher queue of the saga endpoint is getting filled with messages.
* after 10 minutes or so, the timeoutsdispatcher queue already contains over 170K messages and is still filling up.
* That continues until endpoint 2 crashes because the MSMQ storage limit is hit, or all messages from the input queue are processed. If the latter occurs first, the timeoutsdispatcher queue message count starts to decrease until it eventually reaches 0.

Did you perform the same stress test with RavenDB? And is SQL Server on a machine that's more-or-less equally powerful, with fast drives?
Update
Some checks for your saga
Is the [Unique] attribute used and is it used properly? In other words, do you use unique ids for every incoming message? So that every incoming message that is spawning 2 timeouts, will create a unique saga instance? If every incoming message is accessing the same Saga, this would be a great case for extremely limiting throughput. Imagine the Saga instance was created already once, else the explanation would become to complex. So Message1 comes in, tries to find the row in the database, finds and locks it. The second message comes in at the same time, finds the row but it's locked. It will go into retry. Message3 up until Message100 come in (if concurrency is set to 100) and all try to do the same thing, immediately failing. You can see this will limit throughput for a while :)
Are the correct indexes on your Saga table(s) and Timeout tables?
What is your maximum concurrency level set to?
Based on the number of message, you say you send 22k messages, resulting in 44k timeout messages. Image all of these timeouts are in MSMQ. Imagine messages are really, really small, like 1Kb. Header information added by NServiceBus might take up 2Kb. That's 44.000 times 3Kb is roughly 135 megabytes. So there's no way that can fill up a default MSMQ installation which has a quota of 1GB by default.
This probably means your deadletter queue is filled up completely. Find more information on MSMQ connectionstrings and set the appropriate connectionstring. For example
<connectionStrings>
<add name="NServiceBus/Transport"
connectionString="deadLetter=false;journal=false;"/>
</connectionStrings>
Messages with TimeToBeReceived attribute set (link) end up in deadletter queue. Also purging queues will make all messages go to deadletter queue. Unless you set the proper connectionstring.

Related

Camel RabbitMQ connector reads thousands of message before using them

In my app, we are using a Camel route to read messages from a RabbitMQ queue.
The configuration looks like that :
from("rabbitmq:myexchange?routingKey=mykey&queue=q")
The producer can send 50k messages within a few minutes, and each message can take 1 second or more to process.
What I can see is that that ALL messages are consumed very fast, but the processing of this messages can take many hours. Many hours of processing is expected but does that mean that the 50k messages are stored in memory ? If so, I would like to disable this behavior because I don't want to loose messages when the process goes down ... Actually, we are loosing most of the messages even when the process stays up, which is even worse. It looks like the connector is not designed to handle so many messages at once, but I cannot say if it is because of the connector himself or because we did not configure it properly.
I tried with the option autoAck :
from("rabbitmq:myexchange?routingKey=mykey&queue=q&autoAck=false")
This way the messages are rollbacked when something goes wrong but keeping 50k messages unacknowledge at the same time does not seem to be a good idea anyway...
There are a couple of things that i will like to share.
AutoAck - Yes in case when you want to process the message ( after receiving it ) you should set AutoAck to False and explicitly acknowledge the message once it is processed.
Setting Consumer PreFetch - You need to fine tune the PreFetch size , the pre fetch size is the max number of messages which RabbitMQ will present to the consumer in a go i.e. at the most your total un-acknowledged message count will be equal to the Pre Fetch size. Depending on your system if every message is critical you can set the pre fetch size to 1 , if you have multi threaded model of processing messages you can set the pre fetch size to match the number of threads where each thread processes one message and likewise.
In a way it acts like a buffer architecturally. If your process goes down while processing those message any message which was un acked before the process went down will still be there in the queue and the consumer will get it again for processing.

RabbitMQ support for LIFO or time based priority queue

Is there any way to make a RabbitMQ queue behave as a Stack, i.e. the client gets the last message that was posted in the queue (LIFO) rather than the first one? Or maybe alternatively make it a priority queue using a timestamp which the client could set?
RabbitMQ does support priority queues but the priority it allows is just a number up to 255 (recommended to use up to 10).
What I want to achieve is that the latest messages are processed first because they contain the latest information about the source. I still want to process the old messages, but in situations when the client cannot keep up (or there was some downtime and the client is recovering) I want to process the latest state information first.
The only solution I came up with so far is to use a TTL on the messages of the main queue and have them go to a dead letter queue when they expire, which is also processed by the client. However this is not so clean, and if the source of the message takes longer than the TTL to send a new status update, the latest state will be stuck in queue behind the other older expired messages still to be processed.
If it is not possible to achieve with RabbitMQ, is there any other recommended messaging framework that supports this requirement?
Kafka Log Compaction was created for exactly the use case you describe:
Log compaction ensures that Kafka will always retain at least the last
known value for each message key within the log of data for a single
topic partition. It addresses use cases and scenarios such as
restoring state after application crashes or system failure, or
reloading caches after application restarts during operational
maintenance. Let's dive into these use cases in more detail and then
describe how compaction works.
So, RabbitMQ is a queue, not a stack. It is specifically designed NOT to do what you are asking (a queue is always a first-in, first-out data structure).
However, there are options:
Presumably some process (e.g. a web service) exists between the client and the message server. This process could save the data off to an additional storage location (e.g. memcached) for immediate access of the latest value, thus leaving the queue untouched.
You could configure a secondary queue/service combination. When messages are published, they can then be routed to both queues. The first queue is for your heavy processing, and the second queue would be a service whose only task is to update the latest value in memcached or some other fast storage/retrieval system. Thus, message lifetime in this queue would presumably be much shorter.
You could implement multiple processing steps. The first step would be to update the current state (presumably a quick operation), after which the message is then re-published to the longer processing step's queue.

Rabbit MQ backup consumer

I have the following use case that I'm trying to setup in rabbit MQ:
Normally process A should handle all messages sent to queue A.
However if process A goes down (is no longer consuming from queue A) Then process B should handle the messages until process A comes back up.
At first it looks like consumer priorities might be the solution. https://www.rabbitmq.com/consumer-priority.html. However that will send messages to process B when process A is just blocked working on other messages. I only want them sent to process B when process A is down.
A 2nd option might be dead lettering. https://www.rabbitmq.com/dlx.html. If process A is not reading from queue A the messages will eventually time out and then move to an exchange that forwards them to a queue that process B reads. However that options requires waiting for the message to timeout which is not ideal. Also the message could timeout even while process A is still working which is not ideal.
Any ideas how rabbit MQ could be configured for the use case described above? Thanks
According to your answers to my questions, I would probably use a priority on consumer so that process A handles a maximum of messages, along with a high prefetch count (if possible, and you must ensure your process can handle such a high number).
Then, process B would handle the messages that process A cannot handle due to the high load, or all the messages when process A is not available. It is probably acceptable that in the case of high load some messages are handled with a higher delay. Do not forget to set a low prefetch count for process B.
Hope this helps.

How to specify another timeout queue for NSB?

I am using NSB 4.4.2
I want to have something like heartbeats on my saga to show processing statistics.
When i request a timeout it sends to sagas input queue.
In case of many messages prior to this timeout message, IHandleTimeouts may not be fired at specific time.
Is it a bug? Or how can i use separate queue for timeout messages?
Thanks
You are correct - when a timeout is ready to be dispatched, it is sent to the incoming queue of the endpoint, and if there are already many other messages in there, it will have to wait its turn to be processed.
Another thing you might want to consider, is that the endpoint may be down at that time.
If you want to guarantee that your saga code will be invoked at (or very close to) the time of the timeout, you'll need to set up a high availability deployment first. Then, you should look at setting the SLA required of that endpoint - how quickly messages should be processed, and then monitor the time to breach SLA performance counter.
See here for more information: http://docs.particular.net/nservicebus/monitoring-nservicebus-endpoints
You should be prepared to scale out your endpoint as needed to guarantee enough processing power to keep up with the load coming in.
NOTE: The reason we use the same incoming queue for processing these timeouts is by design. A timeout message is almost always the same priority or lower than the other business messages being processed by a saga. As such, it doesn't make sense to have them cut ahead of other messages in line.
Timeouts are sent to the [endpointname].timeouts

MSMQ + WCF - Retry with Growing Delay

I am using MSMQ 4 with WCF. We have a Microsoft Dynamics plugin putting a message on an queue. A service picks up the message and makes an HTTP request to another web server. The web server responds by putting another message on a different queue. A second service picks up the messages and sends the response back to Dynamics...
We have our retry queue set up to retry 3 times and then wait for 5 minutes before retrying again. The Dynamics system some times takes so long (due to other plugins) that we can round-trip before the database transaction commits. The user's aren't seeing the update come through for another 5 minutes.
I am curious if there is a way to configure the retry mechanism to retry incrementally. So, the first time it fails, it only waits a few seconds. If it fails a second time, it waits twice that. And the time between retries just keeps growing.
The problem with just reducing the time between retries is that a bad message could easily fill up a log file.
It turns out there is no built-in way of doing this. One slightly involved option is to create multiple queues, each with its own retry/poison sub-queues, each with a growing retry delay. You can reuse the same handler for each queue - the only thing that changes is the configuration. You also need a handler that can read the poison sub-queues (service) and move the message to the next queue in the chain (client).
So, you set receiveErrorHandling to Move. The maxRetryCycles and receiveRetryCount are just 1. Each queue will use a growing retryCycleDelay. Each queue you create will have a poison sub-queue created for it automatically. You simply read from each poison sub-queue and use a client to move it to the next queue.
I am sure someone could write some code that would automatically create N queues with a growing retryCycleDelay and hook it up all programmatically. Since it is the same handler/client for every queue, it wouldn't be a big deal.