WCF: The user specified maximum retry count for a particular message has been exceeded - wcf

Getting this WCF error, and no idea how to fix it:
System.ServiceModel.CommunicationException: The sequence has been terminated by the remote endpoint. The user specified maximum retry count for a particular message has been exceeded. Because of this the reliable session cannot continue. The reliable session was faulted.
Any ideas welcome :(

From the error message, it would appear that you're using reliable messaging. One of its features is that if a message transfer fails, it will be retried - up to a maximum number of attempts.
Obviously, in your setup, this max number has been maxed out. This might indicate a problem with the network, or your service code, or both. Really hard to tell from here without knowing what you're doing and what your setup is......
I guess the main question would be: do you really need the reliable messaging feature? What are you trying to achieve with this? If you could turn it off, you wouldn't be seeing those errors... can you switch to some other mechanism, maybe message queueing (MSMQ)? Or can you rearchitect your app so you can live with the odd chance that one message might get delivered "out of band" ?

Related

RabbitMQ: how to handle unwanted duplicate un-ack message after connection lost?

In my app(multiple instances), we occasionally see the case where connection is lost between my app and rabbitmq due to network issues(my app and rabbitmq are both alive), then after connection is recovered(re-established) we will receive messages that are unacked.
This creates an issue for us, because my app wasn't dead, and it is still processing the same message it received before, but now the message is redeivered, and it causes the app to process the message again (which can be fatal to us).
Since the app has multiple instances, it is not easy for an instance to check if another instance is processing the same message at the same time. We can't simply filter out redelivered message, because we need this feature to handle instance/app crashes/re-deployments.
It doesn't seem that there is an api to tell rabbitmq when to not redeliver unacked messages.
So what is the recommended practice to handle this situation ?
Thanks,
The general solution for such scenario is to make the consumers handle the messages in an idempotent manner . Generally what I do is from the producer side ( in case there is no unique identifier in the message body ) I add an attribute idempotencyId to the message body which is a guid and on the consumer side for each message this id is validated against the stored value in database , any duplicates are rejected.
This approach also works for messages which might be shoveled from another cluster or if in a same cluster multiple instances of consumers are listening then too this approach guarantee one time processing.
Would suggest to go over the RabbitMQ Reliability Guide here
Yeah, exactly-once delivery is not something RabbitMQ is good at. In fact, I'd say you should probably not be using it for these kinds of problems. Honestly, the only way to truly fix this is to use distributed transactions or locking.
Anyway, you could turn the problem on its head by ack'ing the message as soon as the consumer gets it, before it starts working on it. That would avoid the RabbitMQ-related duplication issue at least. This is at-most-once delivery.
Of course, it means that if the consumer crashes, the message is lost forever. So you need to persist the message right before you ack it so you can recover it later and also the consumer should remove it once it's complete.
Considering that crashes are rare, you can then have a single dedicated process that just works on those persisted messages. Or for that matter, handle them manually.
Just be aware that you are pushing the duplication problem in front of you, because the consumer might fail to remove the persisted message after it's done working with it anyway, but at least you have the option to implement it however you want.
Storage in this case could be anything from files, a RDBMS or something like ZooKeeper or Redis to lock/unlock in-flight messages.

How to use NServiceBus.FLRetries header (NSB 5.2)

I have a PowerShell script that replays the MSMQ messages from an error queue. I was hoping that NServiceBus.FLRetries updates with further failures. However this is not the case.
I have read the documentation but unable to fully understand this particular header.
What I would like to see is the number of retries increase as the message continue to fail to process (e.g. web service not available).
I am using NSB 5.2.
Any ideas how I can model this if the above header is not usable.
Unfortunately, this won't work.
NServiceBus implements First level retries(FLR) by rolling the receive transaction against the queue back if there is an exception. Rolling the transaction back means the message "never left" the queue in the first place. You can't thus modify the message, including adding/updating any headers like NServiceBus.FLRetries. NServiceBus keeps track of this count in-memory instead.
If the message keeps failing and is moved to Second Level Retries(SLR), NServiceBus will add the NServiceBus.FLRetries header to show how many FLR retries were attempted in total.
If you really need to keep track of the number of retries you can disable FLR and use only SLR instead. SLR retries increment the NServiceBus.Retries header for each retry since they don't rely on rolling the receive transaction back.

Immediate flag in RabbitMQ

I have a clients that uses API. The API sends messeges to rabbitmq. Rabbitmq to workers.
I ought to reply to clients if somethings went wrong - message wasn't routed to a certain queue and wasn't obtained for performing at this time ( full confirmation )
A task who is started after 5-10 seconds does not make sense.
Appropriately, I must use mandatory and immediate flags.
I can't increase counts of workers, I can't run workers on another servers. It's a demand.
So, as I could find the immediate flag hadn't been supporting since rabbitmq v.3.0x
The developers of rabbitmq suggests to use TTL=0 for a queue instead but then I will not be able to check status of message.
Whether any opportunity to change that behavior? Please, share your experience how you solved problems like this.
Thank you.
I'm not sure, but after reading your original question in Russian, it might be that using both publisher and consumer confirms may be what you want. See last three paragraphs in this answer.
As you want to get message result for published message from your worker, it looks like RPC pattern is what you want. See RabbitMQ RPC tuttorial. Pick a programming language section there you most comfortable with, overall concept is the same. You may also find Direct reply-to useful.
It's not the same as immediate flag functionality, but in case all your publishers operate with immediate scenario, it might be that AMQP protocol is not the best choice for such kind of task. Immediate mean "deliver this message right now or burn in hell" and it might be a situation when you publish more than you can process. In such cases RPC + response timeout may be a good choice on application side (e.g. socket timeout). But it doesn't work well for non-idempotent RPC calls while message still be processed, so you may want to use per-queue or per-message TTL (or set queue length limit). In case message will be dead-lettered, you may get it there (in case you need that for some reason).
TL;DR
As to "something" can go wrong, it can go so on different levels which we for simplicity define as:
before RabbitMQ, like sending application failure and network problems;
inside RabbitMQ, say, missed destination queue, message timeout, queue length limit, some hard and unexpected internal error;
after RabbitMQ, in most cases - messages processing application error or some third-party services like data persistence or caching layer outage.
Some errors like network outage or hardware error are a bit epic and are not a subject of this q/a.
Typical scenario for guaranteed message delivery is to use publisher confirms or transactions (which are slower). After you got a confirm it mean that RabbitMQ got your message and if it has route - placed in a queue. If not it is dropped OR if mandatory flag set returned with basic.return method.
For consumers it's similar - after basic.consumer/basic.get, client ack'ed message it considered received and removed from queue.
So when you use confirms on both ends, you are protected from message loss (we'll not run into a situation that there might be some bug in RabbitMQ itself).
Bogdan, thank you for your reply.
Seems, I expressed my thought enough clearly.
Scheme may looks like this. Each component of system must do what it must do :)
The an idea is make every component more simple.
How to task is performed.
Clients goes to HTTP-API with requests and must obtain a respones like this:
Positive - it have put to queue
Negative - response with error and a reason
When I was talking about confirmation I meant that I must to know that a message is delivered ( there are no free workers - rabbitmq can remove a message ), a client must be notified.
A sent message couldn't be delivered to certain queue, a client must be notified.
How to a message is handled.
Messages is sent for performing.
Status of perfoming is written into HeartBeat
Status.
Clients obtain status from HeartBeat by itself and then decide that
it's have to do.
I'm not sure, that RPC may be useful for us i.e. RPC means that clients must to wait response from server. Tasks may works a long time. Excess bound between clients and servers, additional logic on client-side.
Limited size of queue maybe not useful too.
Possible situation when a size of queue maybe greater than counts of workers. ( problem in configuration or defined settings ).
Then an idea with 5-10 seconds doesn't make sense.
TTL doesn't usefull because of:
Setting the TTL to 0 causes messages to be expired upon reaching a
queue unless they can be delivered to a consumer immediately. Thus
this provides an alternative to basic.publish's immediate flag, which
the RabbitMQ server does not support. Unlike that flag, no
basic.returns are issued, and if a dead letter exchange is set then
messages will be dead-lettered.
direct reply-to :
The RPC server will then see a reply-to property with a generated
name. It should publish to the default exchange ("") with the routing
key set to this value (i.e. just as if it were sending to a reply
queue as usual). The message will then be sent straight to the client
consumer.
Then I will not be able to route messages.
So, I'm sorry. I may flounder in terms i.e. I'm new in AMQP and rabbitmq.

How do I correctly configure a WCF NetTcp Duplex Reliable Session?

Please excuse the Obvious Self-Q/A, but this information is widely misunderstood, and almost always incorrectly answered. So I Wanted to place this information here for people searching for a definitive answer to this problem.
Even so, there's still some information I haven't been able to nail down. I will put this towards the end of the question (skip to that if you are not interested in the preamble).
How do I correctly configure a WCF NetTcp Duplex Reliable Session?
There are many questions and answers regarding this topic, and nearly all of them suggest setting inactivityTimeout="Infinite" in your configuration. This doesn't really seem to work correctly, particularly for the case of NetTcp (It may work correctly for WSDualHttp Bindings, but I have never used those).
There are a number of other issues that are often associated with this: Including, Channel not faulting after client or server unexpectedly disconnected, Channel disconnecting after 10 minutes, Channel randomly disconnecting... Channel throwing exception when trying to open... Unable to configure Metadata on same endpoint...
Please note: There are two concepts that are important below. Infrastructure messages are internal to the way WCF communicates, and are used by the framework to keep things running smoothly. Operation messages are messages that occur because your app has done something, like send a message across the wire. Infrastructure messages are largely invisible to your app (but they still occur in the background) while operation messages are the result of an action your app has taken.
Information I have figured out, through hard won trial and error.
Infinite does not appear to be a valid configuration setting in all situations (and certainly, the visual studio validation schema doesn't know about it).
There are two special configuration converters, called InfiniteIntConverter and InfiniteTimeSpanConverter which will sometimes work to convert the value Infinite to either Int.MaxValue or TimeSpan.MaxValue, but I haven't yet figured out the situations in which this appears to be valid as sometimes it works, and sometimes it doesn't. What's more, it appears that some libraries will allow Infinite in the config, while others will not, so you can succeed in one part of a configuration, but fail in another.
You must configure BOTH inactivityTimeout and receiveTimeout, on both the client and the server. While these values do not HAVE to be the same, they probably should be as they will probably cause confusion if they are not. (technically, you can leave inactivityTimeout to its default value if you want, but you should be aware of its value, and what it does)
inactivityTimeout should NEVER be set to a large value, much less Infinite or TimeSpan.MaxValue.
inactivityTimeout has two functions (and this is not widely understood). The first function defines the maximum amount of time that can elapse on a channel without receiving any "infrastructure" or "operation" messages. The second function defines the time period in which infrastructure messages are sent (half the time specified). If no infrastructure or operation messages have been received during the timeout period, the connection is aborted.
receiveTimeout specifies the maximum amount of time that can elapse between operation messages only. This value can be set to a large value, such as TimeSpan.MaxValue (particularly if your channel runs internally over a trusted network or over a vpn). This value is what defines how long the reliable session will "stay alive" if there is no activity between client and server (other than infrastructure messages). ie, your client does not call any methods of the interface, and your server does not call back into the client.
setting a short inactivityTimeout and a large receiveTimeout keeps your reliable session "tacked up" even when there is no operational activity between your client and server. The short inactivity timeout (i like to keep the default 10 minutes or less) sends infrastructure "ping" messages to keep the TCP connection alive while the long receive timeout keeps the reliable session active. while at the same time providing a reasonable timeout in case of disconnection.
If you set inactivityTimeout to a large value, then the reliable session will not be reliable as it has no way to keep the Tcp connection alive, nor does it have any way to verify the integrity of the connection. It won't know if a user has disconnected unexpectedly until you try and send a message to that client and find out the connection is no longer there. This is why many people who use Infinite for this setting resort to creating a "Ping" method in their service, which is completely unnecessary if you've configured these settings correctly.
If you set inactivityTimeout to a value larger than receiveTimeout then it will likewise also be unreliable, as you will still be governed by the receiveTimeout for operation messages. ie. if you forget to set receiveTimeout and leave it at the default 10 minutes, then if the user is idle for 10 minutes, the connection will be aborted.
When the client or server unexpectedly disconnects (app crashes, network failure, someone trips over the power cord, etc..), the other side may not notice right away. I have attached various ChannelFaulted event handlers in various test situations, and sometimes the connection is faulted right away... other times it doesn't seem to fault at all. What i have discovered through trial and error is that the when it doesn't seem to fault, it will actually fault after the inactivityTimeout expires on that end. (so if it's set to 10 minutes, then after 10 minutes it will call the ChannelFaulted event).
I have not yet figured out why in some situations it notices the disconnection right away, and others it waits for the timer to expire. In both cases, I notice internal first chance communication exceptions thrown and handled by the framework, and there are calls to Abort the connection... but somehow the call to the event handler gets lost and it must wait for the timeout. My suspicion is this is somehow thread related.
When trying to configure Metadata to work across the NetTcp channel, I have had sporadic results. Sometimes it works, sometimes it doesn't. I've read many reports that Metadata doesn't work over NetTcp and that you have to use an Http channel for the Metadata, but I have in fact had it work on several occasions using the net.tcp:// url to generate the proxy. Then I would change something, recompile and it would no longer work. Changing things back, it wouldn't work again. So it was very confusing what magic incantation was necessary to make Metadata function over net.tcp, shared with the endpoint on the same port (obviously with a different address).
When configuring both a NetTcp and Metatdata endpoint on the same service, and specifying non-default settings for connection parameters like listenBacklog, and maxConnections, you also need to make sure the Metadata endpoint uses the same settings, which typically means you have to define a custom binding, since these settings are not available from the standard tcp mex binding. This includes setting listenBacklog and maxPendingConnections on tcpTransport, and groupName and maxOutboundConnectionsPerEndpoint on connectionPoolSettings.
The default setting for the Ordered setting of ReliableSession is True. This uses a lot more overhead than turning it off. If you don't need ordered messages, i would suggest turning it off (need to set this on both sides)
-
Configuration I still need to understand:
How do I correctly configure the shared net.tcp Metadata endpoint? (I will add an example when I get a chance) Currently, i'm specifying an http get url to bypass the problem. It's so inconsistent as to why it sometimes works and sometimes does not. I kept getting the error `The URI Prefix is not recognized' when generating the proxy in Visual Studio.
Why does WCF sometimes Fault the channel immediately upon disconnect, and sometimes waits for inactivityTimeout to expire? What controls/causes one vs the other behavior?

NServiceBus delay retries

We need to be able to specify a delay in retrying failed messages. NServiceBus retries more or less instantly up to n times (as configured) before moving the message to error queue.
What I need to be able to do is for a given message type specify that its not to be retried for an arbitrary period of time
I've read the post here:
NServiceBus Retry Delay
but this doesn't give what I'm looking for.
Kind regards
Ben
This isn't supported as of right now. What you can do is let the messages go to the error queue and setup and endpoint to monitor that queue. Your code could then determine the rules for replaying messages. You could use a Saga to achieve this in combination with the Timeout manager.
Typically you'll have some rules around when to replay messages. In NSB 3.0 we have a better way to do this using the FaultManager. This gives you options on where to put failed messages and includes the exception. One of the options is a DB which you could then set up a job to inspect the exception and determine what to do with it.
Lastly a low tech way of getting this is to schedule a job that runs the ReturnToSourceQueue tool periodically to "clean up". We are doing this and including an alert so we don't endlessly cycle messages around.