NServiceBus 2.6 continually retries message when handler throws an exception - nservicebus

We have a production issue that happens infrequently but when it does it makes the entire service unusable.
The issue is that an exception in a message handler causes NServiceBus to enter an endless loop where it continually retries the message.
I found this post that seems to describe the same issue but I can't find a fix:
http://tech.dir.groups.yahoo.com/group/nservicebus/message/4316

Related

WCF InstancePersistenceCommand Exception

I have a WCF application which consists in some async communications with ecternal services. When we start a new expedient, a new instance is created; it process data and send an xml to a external service and waits for the response. This response requires that a person review the xml and send the response so it usually it is delayed for a long time. For this reason, the workflow go to idle and we use persistence with AppFabric.
The fact is that sometime, when we receive the response, the next exception is raised:
The execution of the InstancePersistenceCommand named {urn:schemas-microsoft-com:System.Activities.Persistence/command}LoadWorkflowByInstanceKey was interrupted by an error.
Normally this error does not occur, it can occur very sporadically. However, we are trying to update the app to include a new functionality (it does not modify the workflow) but when the application is deployed to the server, the instances that were created with the old deployment and were waiting for the response, throw this exception when they receive the response from the external service. However, the instances initiated with the new deployment process the response without problem.
I have been looking for information about this problem but I haven't found much. Anybody can help me?
SOLUTION:
Thanks a lot for your answer, it may be helpful for me in the future. In this case, the problem was that I was updating an assembly version of one of the implicated project (to upload a nuget package) and for a reason that I don’t understand, the instances created with an old version raised this exception when the service with the new version had to manipulate the mentioned instances.
If I change the assembly version to upload the nuget and then set the original version and deploy with this version, everything works ok. Anybody knows what is the reason?
Thanks a lot.
This may be because there is a program running in the background and trying to extend the lock on the instance store every 30 seconds, and it seems that whenever the connection to the SQL service fails, it marks the instance store as invalid.
You can try <workflowIdle timeToUnload="0"/>, if it doesn't work you can look at the methods provided by other links.
Windows workflow 4.0 InstancePersistenceCommand Error
Why do I get exception "The execution of the InstancePersistenceCommand named LoadWorkflowByInstanceKey was interrupted by an error"
WF4 InstancePersistenceCommand interrupted

How to log recovery by redelivery in Apache-Camel error handler?

How can I find when a Camel route redelivery error handler successfully recovered one error case.
I would like to be able to get metrics around successful redelivery by a camel error handler retry.
I would like to know how many message exchange instances that happened to have a network error performing file transfer where successful recovered after retry.

Troubleshooting Web App process restarting

Our web app process is restarting regularly and we are unable to determine the reason.
When looking into Application Events (using the 'Diagnostics and solve problems' blade in the Azure Portal), there exists a bunch of the following Info logs by 'IIS AspNetCore Module'
Event ID 1005:
Failed to gracefully shutdown process '14040'.
Event ID 1001:
Application 'MACHINE/WEBROOT/APPHOST/myapplication__xxxx' started process '31628' successfully and is listening on port '17663'.
There is nothing fishy with general resource usage and nothing in our application logs.
What is the best way to troubleshoot the reason behind these process restarts?
EDIT 1:
After fiddling around with web logging in the Web App's Diagnostic Logs, I now get an error logged from W3SVC-WP after each restart, but the message is nonsense:
1<br/>5<br/>50000780
EDIT 2:
Event Id 2284 refers to this:
FailedRequestTracing module failed to write buffered events to log
file for the request that matched failure definition. No logs will be
generated until this condition is corrected. The problem happened at
least %1 times in the last %2 minutes. The data is the error.
I'm not sure if this could be related to our Diagnostic Logs configuration, but seems unlikely.
EDIT 3:
As per Brando Zhang's suggestion, I've used the Web App Crash Diagnoser extension and tried monitoring 2nd Chance Unhandled Exceptions on both my application process AND on w3wp, but nothing is dumped.
From how I understand it, 1st Chance Exceptions will not crash the process, so no need to monitor these.
Very likely application is crashing due to fatal exception and causing the restarts.
On Azure App Service platform.You can use the Diagnostics as a
Service (DaaS) to troubleshoot this
It can also do an analysis and tell you the root cause most of the time.More step by step infofrmation can be found on this msdn blog .Also refer tips for using crash diagnoser

Google PubSub error [code=8a75]

Today, I started getting this error sporadically. Google pubsub error codes talks about only HTTP error codes. Does anyone know about this error?
ERROR Error: The service was unable to fulfill your request. Please try again. [code=8a75]
This error code is retryable, and can be safely expected. Automating your code to automatically retry with backoff, or to use one of the official client libraries, which automatically retry on these errors with backoff is the recommended solution.
In general these errors should be independent, meaning after a retry or two the odds that your RPC fails should be very low.

NServiceBus exceptions logged as INFO messages

I'm running an NServiceBus endpoint on an Azure Workerrole. I send all diagnostics to table storage at the moment. I was getting messages in my DLQ, and I couldn't figure out why I wasn't getting any exceptions logged in my table storage.
It turns out that NSB logs the exceptions as INFO, which is why I couldn't easily spot them in between all the actual verbose logging.
In my case, a command handler's dependencies couldn't be resolved so Autofac throws an exception. I totally get why the exception is thrown, I just don't understand why they're logged as INFO. The message ends up in my DLQ, and I only have a INFO-trace to understand why.
Is there a reason why exceptions are handled this way in NSB?
NServiceBus is not logging container issue as an error because it's happening during an attempt to process a message. First Level Retry and Second Level Retry will be attempted. When SLR is executed, it will log a WARN about the retry. Ultimately, a message will fail processing and an ERROR message will be logged. NSB and Autofac sample can be used to reproduce this.
When endpoint is running with a scaled out role and MadDeliveryCount is not big enough to accommodate all the role instances and retry count that each instance would hold, this will result in DeliveryCount reaching it's max while NServiceBus endpoint instance still thinks it has attempts before sending message to an error queue and logging an error. Similar to the question here I'd recommend to increase MaxDeliveryCount.
There is an open NServiceBus issue to have a native support for SLR counter. You can add your voice to the issue. The next version of NServiceBus (V6) will be logging message id along with the exception so that you at least could correlate between message in DLQ and log file.