MSMQ + WCF - Retry with Growing Delay - wcf

I am using MSMQ 4 with WCF. We have a Microsoft Dynamics plugin putting a message on an queue. A service picks up the message and makes an HTTP request to another web server. The web server responds by putting another message on a different queue. A second service picks up the messages and sends the response back to Dynamics...
We have our retry queue set up to retry 3 times and then wait for 5 minutes before retrying again. The Dynamics system some times takes so long (due to other plugins) that we can round-trip before the database transaction commits. The user's aren't seeing the update come through for another 5 minutes.
I am curious if there is a way to configure the retry mechanism to retry incrementally. So, the first time it fails, it only waits a few seconds. If it fails a second time, it waits twice that. And the time between retries just keeps growing.
The problem with just reducing the time between retries is that a bad message could easily fill up a log file.

It turns out there is no built-in way of doing this. One slightly involved option is to create multiple queues, each with its own retry/poison sub-queues, each with a growing retry delay. You can reuse the same handler for each queue - the only thing that changes is the configuration. You also need a handler that can read the poison sub-queues (service) and move the message to the next queue in the chain (client).
So, you set receiveErrorHandling to Move. The maxRetryCycles and receiveRetryCount are just 1. Each queue will use a growing retryCycleDelay. Each queue you create will have a poison sub-queue created for it automatically. You simply read from each poison sub-queue and use a client to move it to the next queue.
I am sure someone could write some code that would automatically create N queues with a growing retryCycleDelay and hook it up all programmatically. Since it is the same handler/client for every queue, it wouldn't be a big deal.

Related

NServiceBus Timeoutsdispatcher queue is being flooded with messages during stress tests

I'm doing some stress tests on a saga that uses 2 timeouts. During the test about 21K saga's get created. So that would mean 42K timeouts, but I notice that the timeoutsdispatcher queue of the saga is getting flooded with 100's of thousands of messages until it crashes because the MSMQ storage limit is hit.
I'm seeing this behavior since I switched the persistence mechanism from RavenDB to SQL Server.
Does anyone have an idea what could be wrong?
Transport: MSMQ
Persistence: NHibernate
Packages used:
NHibernate version 4.0.4.4000
NServiceBus version 5.2.14
NServiceBus.Host version 6.0.0
NServiceBus.Log4Net version 1.0.0
NServiceBus.NHibernate version 6.2.7
Test setup:
* endpoint 1 is sending 22000 messages to endpoint 2.
* endpoint 2 hosts a saga that is started by that message.
* each saga publishes an event and then requests 2 timeouts: 1 at 4 minutes, 1 at 10 minutes.
Observed behavior:
* endpoint 1 sends the 22K messages in under a minute.
* endpoint 2 (the saga) processes 5 to 10 messages per second.
* after 4 minutes the first timeouts are fired, while endpoint 2 is still processing messages from its queue and thus is still creating new saga instances.
* from that moment on, the timeoutsdispatcher queue of the saga endpoint is getting filled with messages.
* after 10 minutes or so, the timeoutsdispatcher queue already contains over 170K messages and is still filling up.
* That continues until endpoint 2 crashes because the MSMQ storage limit is hit, or all messages from the input queue are processed. If the latter occurs first, the timeoutsdispatcher queue message count starts to decrease until it eventually reaches 0.
Did you perform the same stress test with RavenDB? And is SQL Server on a machine that's more-or-less equally powerful, with fast drives?
Update
Some checks for your saga
Is the [Unique] attribute used and is it used properly? In other words, do you use unique ids for every incoming message? So that every incoming message that is spawning 2 timeouts, will create a unique saga instance? If every incoming message is accessing the same Saga, this would be a great case for extremely limiting throughput. Imagine the Saga instance was created already once, else the explanation would become to complex. So Message1 comes in, tries to find the row in the database, finds and locks it. The second message comes in at the same time, finds the row but it's locked. It will go into retry. Message3 up until Message100 come in (if concurrency is set to 100) and all try to do the same thing, immediately failing. You can see this will limit throughput for a while :)
Are the correct indexes on your Saga table(s) and Timeout tables?
What is your maximum concurrency level set to?
Based on the number of message, you say you send 22k messages, resulting in 44k timeout messages. Image all of these timeouts are in MSMQ. Imagine messages are really, really small, like 1Kb. Header information added by NServiceBus might take up 2Kb. That's 44.000 times 3Kb is roughly 135 megabytes. So there's no way that can fill up a default MSMQ installation which has a quota of 1GB by default.
This probably means your deadletter queue is filled up completely. Find more information on MSMQ connectionstrings and set the appropriate connectionstring. For example
<connectionStrings>
<add name="NServiceBus/Transport"
connectionString="deadLetter=false;journal=false;"/>
</connectionStrings>
Messages with TimeToBeReceived attribute set (link) end up in deadletter queue. Also purging queues will make all messages go to deadletter queue. Unless you set the proper connectionstring.

How to specify another timeout queue for NSB?

I am using NSB 4.4.2
I want to have something like heartbeats on my saga to show processing statistics.
When i request a timeout it sends to sagas input queue.
In case of many messages prior to this timeout message, IHandleTimeouts may not be fired at specific time.
Is it a bug? Or how can i use separate queue for timeout messages?
Thanks
You are correct - when a timeout is ready to be dispatched, it is sent to the incoming queue of the endpoint, and if there are already many other messages in there, it will have to wait its turn to be processed.
Another thing you might want to consider, is that the endpoint may be down at that time.
If you want to guarantee that your saga code will be invoked at (or very close to) the time of the timeout, you'll need to set up a high availability deployment first. Then, you should look at setting the SLA required of that endpoint - how quickly messages should be processed, and then monitor the time to breach SLA performance counter.
See here for more information: http://docs.particular.net/nservicebus/monitoring-nservicebus-endpoints
You should be prepared to scale out your endpoint as needed to guarantee enough processing power to keep up with the load coming in.
NOTE: The reason we use the same incoming queue for processing these timeouts is by design. A timeout message is almost always the same priority or lower than the other business messages being processed by a saga. As such, it doesn't make sense to have them cut ahead of other messages in line.
Timeouts are sent to the [endpointname].timeouts

Setting a long timeout for RabbitMQ ack message

I was wondering if this is possible. I want to pull a task from a queue and have some work that could potentially take anywhere from 3 seconds or longer (possibly) minutes before an ack is sent back to RabbitMQ notifying that the work has been completed. The work is done by a user, hence this is why the time it takes to process the job varies.
I don't want to ack the message immediately after I pop off the queue because I want the message to be requeued if no ack is received. Can anyone give me any insights into how to solve my problem?
Having a long timeout should be fine, and certainly as you say you want redelivery if something goes wrong, so you want to only ack after you finish.
The best way to achieve that, IMO, would be to have multiple consumers on the queue (i.e. multiple threads/processes consuming from the same queue). That should be fine as long as there's no particular ordering constraint on your queue contents (i.e. the way there might be if the queue were to contain contents representing Postgres data that involves FK constraints).
This tutorial on the RabbitMQ website provides more info (Python linked, but there should be similar tutorials for other languages): https://www.rabbitmq.com/tutorials/tutorial-two-python.html
Edit in response to comment from OP:
What's your heartbeat set to? If your worker doesn't acknowledge the heartbeat within the set period of time, the server will consider the connection to be dead.
Not sure which language you're using, but for Java you would use the setRequestedHeartbeat method to specify the heartbeat.
The way you implement your workers, it's vital that the heartbeat can still be sent back to the RabbitMQ server. If something blocks the client from sending the heartbeat, the server will kill the connection after the time interval expires.

NServicebus - Stopping a long running process?

Here is my application I'm attempting to put together using NServiceBus:
I have a 1000 files that need to be processed by a service. So far I'm thinking I'd have one endpoint, the client, find all of those files and send them out on the bus to be processed
My other endpoint, the server that does the processing, would listen for these client messages, when one comes in process the file, and return the results.
Client takes the results, marks the file as processed, and waits for the next 999 files to be processed. Client doesn't care the order of the messages that come back, just as long as they all get processed at some point. (In reality the client is going to do something more with the data after it is processed that can't be done by the server, so I can't just fire and forget the request for processing.)
Since processing a single message can take over an hour I would scale out the application to have multiple servers all attempting to eat through the 1000 files that need to be processed.
Conceptually, its like building a personal SETI at home service to run on all of my servers.
The issues I'm having is, how do I stop midway through processing the 1000 files?
I want to keep all of my servers working as much as they can on my data, so when the client starts does it publish a 1000 commands for the 1000 files to process and then sit back and wait? And if it does this, and decides to stop, how can it clear the bus of all of those commands to process files?
If my client only pushes one or two messages on the bus at a time I could easily stop sending messages if I decide to stop on the client, but then I have two other problems
The servers could be underutilized and I'd end up with idle servers.
How do I stop the servers that are loaded up and processing data? Send them a second command of a different message format?
Thoughts, ideas? Am I approaching this problem using the right tool/right methodology?
One of things you might want to think about is how you are going to correlate the message processing. I would use a saga for this and have the client generate some kind of batch id which is attached to all the files to be processed. This allows your client to be able to send a CancelProcessing message to the saga, the handler for which could then stop the processing / sending of messages to the file processing endpoints and perform any clean-up operations such as completing the saga and removing data from the database.
So you would have client endpoint, saga endpoint and one or more file processing endpoints (which would sit behind a distributor). Your client would be responsible for initiating / sending the files to the saga. The saga manages the file correlation and processing activities, while your processing endpoints focus doing the work.
Remember that the processing endpoints don't necessarily have to be physical endpoints. You can have many of these on one server if you wanted to and use monitoring tools to determine whether or not you need to add or remove nodes.

NServiceBus delay retries

We need to be able to specify a delay in retrying failed messages. NServiceBus retries more or less instantly up to n times (as configured) before moving the message to error queue.
What I need to be able to do is for a given message type specify that its not to be retried for an arbitrary period of time
I've read the post here:
NServiceBus Retry Delay
but this doesn't give what I'm looking for.
Kind regards
Ben
This isn't supported as of right now. What you can do is let the messages go to the error queue and setup and endpoint to monitor that queue. Your code could then determine the rules for replaying messages. You could use a Saga to achieve this in combination with the Timeout manager.
Typically you'll have some rules around when to replay messages. In NSB 3.0 we have a better way to do this using the FaultManager. This gives you options on where to put failed messages and includes the exception. One of the options is a DB which you could then set up a job to inspect the exception and determine what to do with it.
Lastly a low tech way of getting this is to schedule a job that runs the ReturnToSourceQueue tool periodically to "clean up". We are doing this and including an alert so we don't endlessly cycle messages around.