Service Bus Queue Retry strategy : WCF application is timing out because of long retries - azure-queues

We have one service which puts data in service bus queue. We have retry strategy implemented for this. But as per Windows Azure team, queue can go down for 1-3 minutes. So we should have retry strategy which should try more than 3 minutes.
If we will retry for 3 minutes than the client which is waiting for response will time out as default time out is 60 seconds. If we will increase the timeout then customer has to wait for 3 minutes in case of outage.
What is the best way to implement this scenario?
a. Should we keep client waiting? which will not be good experience for Client.
b. Should we keep timeout same , but then client will retry and we will have duplicate records?
Suggestions?

Its a general scenario and one of the basic rule which needs to be followed is that any client shouldn't be allowed to wait.
So as per my understanding and suggestion what you can do is:
1. Check if there is a request in Queue.
2. If 1 is true don't retry
3. If 1 is false retry
Also timeout would be a good exception that can be thrown to client or any user friendly message (you will need a new approach for that).
Point is client should know everything, if he is waiting or if he is out of queue or if he is in queue, then only you can create Robust and User friendly application

Related

RabbitMQ - Delayed message exchange

Currently, we have 2 systems that are communicating directly.
Service A continuously (but not in periodical manner) sends messages to service B. The Message is in simple Key/Value format. Key is an integer number and Value is current local date and time.
Service B, in order to decide whether to process the request, has this logic to examine the last incoming request; If there is a time difference against the system time (for each key) and the difference is more than 10 minutes, then it starts processing the request.
Now that we are bringing RabbitMQ into our solution, we need to revise this communication model as well. I was thinking to use a delayed message exchange for the 10 minutes time window, and then rewrite and reset the time (re-schedule for another 10 minutes) for duplicate messages incoming from service A.
Could share your ideas about this proposed solution?
Well, after reading the documents I'm certain that such logic should be implemented in application layer (in my situation consumer software)

NServiceBus Timeoutsdispatcher queue is being flooded with messages during stress tests

I'm doing some stress tests on a saga that uses 2 timeouts. During the test about 21K saga's get created. So that would mean 42K timeouts, but I notice that the timeoutsdispatcher queue of the saga is getting flooded with 100's of thousands of messages until it crashes because the MSMQ storage limit is hit.
I'm seeing this behavior since I switched the persistence mechanism from RavenDB to SQL Server.
Does anyone have an idea what could be wrong?
Transport: MSMQ
Persistence: NHibernate
Packages used:
NHibernate version 4.0.4.4000
NServiceBus version 5.2.14
NServiceBus.Host version 6.0.0
NServiceBus.Log4Net version 1.0.0
NServiceBus.NHibernate version 6.2.7
Test setup:
* endpoint 1 is sending 22000 messages to endpoint 2.
* endpoint 2 hosts a saga that is started by that message.
* each saga publishes an event and then requests 2 timeouts: 1 at 4 minutes, 1 at 10 minutes.
Observed behavior:
* endpoint 1 sends the 22K messages in under a minute.
* endpoint 2 (the saga) processes 5 to 10 messages per second.
* after 4 minutes the first timeouts are fired, while endpoint 2 is still processing messages from its queue and thus is still creating new saga instances.
* from that moment on, the timeoutsdispatcher queue of the saga endpoint is getting filled with messages.
* after 10 minutes or so, the timeoutsdispatcher queue already contains over 170K messages and is still filling up.
* That continues until endpoint 2 crashes because the MSMQ storage limit is hit, or all messages from the input queue are processed. If the latter occurs first, the timeoutsdispatcher queue message count starts to decrease until it eventually reaches 0.
Did you perform the same stress test with RavenDB? And is SQL Server on a machine that's more-or-less equally powerful, with fast drives?
Update
Some checks for your saga
Is the [Unique] attribute used and is it used properly? In other words, do you use unique ids for every incoming message? So that every incoming message that is spawning 2 timeouts, will create a unique saga instance? If every incoming message is accessing the same Saga, this would be a great case for extremely limiting throughput. Imagine the Saga instance was created already once, else the explanation would become to complex. So Message1 comes in, tries to find the row in the database, finds and locks it. The second message comes in at the same time, finds the row but it's locked. It will go into retry. Message3 up until Message100 come in (if concurrency is set to 100) and all try to do the same thing, immediately failing. You can see this will limit throughput for a while :)
Are the correct indexes on your Saga table(s) and Timeout tables?
What is your maximum concurrency level set to?
Based on the number of message, you say you send 22k messages, resulting in 44k timeout messages. Image all of these timeouts are in MSMQ. Imagine messages are really, really small, like 1Kb. Header information added by NServiceBus might take up 2Kb. That's 44.000 times 3Kb is roughly 135 megabytes. So there's no way that can fill up a default MSMQ installation which has a quota of 1GB by default.
This probably means your deadletter queue is filled up completely. Find more information on MSMQ connectionstrings and set the appropriate connectionstring. For example
<connectionStrings>
<add name="NServiceBus/Transport"
connectionString="deadLetter=false;journal=false;"/>
</connectionStrings>
Messages with TimeToBeReceived attribute set (link) end up in deadletter queue. Also purging queues will make all messages go to deadletter queue. Unless you set the proper connectionstring.

MSMQ + WCF - Retry with Growing Delay

I am using MSMQ 4 with WCF. We have a Microsoft Dynamics plugin putting a message on an queue. A service picks up the message and makes an HTTP request to another web server. The web server responds by putting another message on a different queue. A second service picks up the messages and sends the response back to Dynamics...
We have our retry queue set up to retry 3 times and then wait for 5 minutes before retrying again. The Dynamics system some times takes so long (due to other plugins) that we can round-trip before the database transaction commits. The user's aren't seeing the update come through for another 5 minutes.
I am curious if there is a way to configure the retry mechanism to retry incrementally. So, the first time it fails, it only waits a few seconds. If it fails a second time, it waits twice that. And the time between retries just keeps growing.
The problem with just reducing the time between retries is that a bad message could easily fill up a log file.
It turns out there is no built-in way of doing this. One slightly involved option is to create multiple queues, each with its own retry/poison sub-queues, each with a growing retry delay. You can reuse the same handler for each queue - the only thing that changes is the configuration. You also need a handler that can read the poison sub-queues (service) and move the message to the next queue in the chain (client).
So, you set receiveErrorHandling to Move. The maxRetryCycles and receiveRetryCount are just 1. Each queue will use a growing retryCycleDelay. Each queue you create will have a poison sub-queue created for it automatically. You simply read from each poison sub-queue and use a client to move it to the next queue.
I am sure someone could write some code that would automatically create N queues with a growing retryCycleDelay and hook it up all programmatically. Since it is the same handler/client for every queue, it wouldn't be a big deal.

How to tell for a particular request that all available worker threads are BUSY

I have a high-rate UDP server using Netty (3.6.6-Final) but notice that the back-end servers can take 1 to 10 seconds to respond - i have no control over those, so cannot improve latency there.
What happens is that all handler worker threads are busy waiting for response and that any new request must wait to get processed, over time this response comes very late. Is it possible to discover for a given request that the thread pool is exhausted, so as to intercept the request early and issue a server busy response?
I would use an ExecutionHandler configured with an appropriate ThreadPoolExecutor, with a max thread count and a bounded task queue. By choosing differenr RejectedExecutionHandler policies, you can either catch the RejectedExecutionException to answer with a "server busy", or use a "caller runs policy", in which case the IO worker thread will execute the task and create a push back (but that is what you wanted to avoid).
Either way, an execution handler with a limited capacity is the way forward.

NServiceBus delay retries

We need to be able to specify a delay in retrying failed messages. NServiceBus retries more or less instantly up to n times (as configured) before moving the message to error queue.
What I need to be able to do is for a given message type specify that its not to be retried for an arbitrary period of time
I've read the post here:
NServiceBus Retry Delay
but this doesn't give what I'm looking for.
Kind regards
Ben
This isn't supported as of right now. What you can do is let the messages go to the error queue and setup and endpoint to monitor that queue. Your code could then determine the rules for replaying messages. You could use a Saga to achieve this in combination with the Timeout manager.
Typically you'll have some rules around when to replay messages. In NSB 3.0 we have a better way to do this using the FaultManager. This gives you options on where to put failed messages and includes the exception. One of the options is a DB which you could then set up a job to inspect the exception and determine what to do with it.
Lastly a low tech way of getting this is to schedule a job that runs the ReturnToSourceQueue tool periodically to "clean up". We are doing this and including an alert so we don't endlessly cycle messages around.