I am using Redisson 3.8.2 to connect to a replicated AWS elasticache, after a while of operation my client frequently gets timeout exceptions trying to subscribe to topics.
I've checked the load on AWS and the load on my client, AWS is barely above idle and the client has far fewer subscriptions than it should be able to support (subscriber pool * subscriptions per connection).
I have tried adjusting the subscription connection pool and subscribers per connections settings but still get the issue.
The exception is thrown at a high level, timing out waiting for the Redisson promise to sync, looking through the detail code behind the promise there is a lot happening involving at least 2 locks in the java code and async futures to subscribe and attach the listener.
Is there a way I can get more debugging information from Redisson about where is it timing out / what stage it is getting to and when it fails seeing what the state of the connection pools and the connection entries are?
org.redisson.client.RedisTimeoutException: Subscribe timeout: (7500ms)
at org.redisson.command.CommandAsyncService.syncSubscription(CommandAsyncService.java:142) ~[redisson-3.8.2.jar!/:na]
at org.redisson.RedissonTopic.addListener(RedissonTopic.java:133) ~[redisson-3.8.2.jar!/:na]
at org.redisson.RedissonTopic.addListener(RedissonTopic.java:109) ~[redisson-3.8.2.jar!/:na]
Related
I have read the MT documentation on Error handling and faults and put some code to publish the fault and written a fault consumer to listen to the fault message after some number of retries with Polly.
I have a queue consumer gets the messages from RabbitMQ using MassTranasit and send to a cloud system through Http api. I have handled all possible exceptions and also wrapped http calls in Polly retry for transient network errors. But the problem with this approach is the message is literally abandoned from processing after the retries exhausted.
If the destination system is down for 10 hrs assume( this outage we don't know before otherwise i will plan for consumer service stop), what is the best strategy we can put with MassTransit to stop pulling the messages from Queue into Consumer? Is there a way we can stop receiving the messages based on number of failures etc..?
Thanks
You need a circuit breaker, it's a well-known pattern in distributed systems. The circuit breaker activates when the remote system is struggling under load and putting more requests to it will potentially strangle it. It would also allow you to stop sending messages to the remote system when it is down.
The circuit breaker is available in MassTransit out of the box.
I would also not recommend implementing retries using Polly in the consumer. MassTransit has a comprehensive set of retry policies and it also allows MassTransit to understand how many failures occur in the consumer, which is not available when you use Polly. For example, the circuit breaker middleware won't know about failures in a Polly-wrapped call and therefore won't be reacting properly.
If the remote system is down for a long time (like hours, as you described), any retry policy with a limited number of attempts will eventually fail. The circuit breaker will open but it would reset from time to time and try sending calling the consumer again. Otherwise, it won't ever know when the remote system is recovered. So, you would either need to recover messages from the error queue or add the redelivery middleware.
You can therefore configure your receive pipeline this way:
redelivery -> circuit breaker -> retry -> consumer
I am using ServiceStack 5.0.2 and Redis 3.2.100 on Windows.
I have got several nodes with active Pub/Sub Subscription and a few Pub's per second.
I noticed that if Redis Service restarts while there is no physical network connection (so one of the clients cannot connect to Redis Service), that client stops receiving any messages after network recovers. Let's call it a "zombie subscriber": it thinks that it is still operational, but never actually receives a message: client thinks it has a connection, the same connection on server is closed.
The problem is no exception is thrown in RedisSubscription.SubscribeToChannels, so I am not able to detect the issue in order to resubscribe.
I have also analyzed RedisPubSubServer and I think I have discovered a problem. In the described case RedisPubSubServer tries to restart (send stop command CTRL), but "zombie subscriber" does not receive it and no resubscription is made.
"rabbitmqctl list_connections" shows as running but on the UI in the connections tab, under client properties, i see "connection.blocked: true".
I can see that messages are in queued in RabbitMq and the connection is in idle state.
I am running Airflow with Celery. My jobs are not executing at all.
Is this the reason why jobs are not executing?
How to resolve the issue so that my jobs start running
I'm experiencing the same kind of issue by just using celery.
It seems that when you have a lot of messages in the queue, and these are fairly chunky, and your node memory goes high, the rabbitMQ memory watermark gets trespassed and this triggers a blocking into consumer connections, so no worker can access that node (and related queues).
At the same time publishers are happily sending stuff via the exchange so you get in a lose-lose situation.
The only solution we had is to avoid hitting that memory watermark and scale up the number of consumers.
Keep messages/tasks lean so that the signature is not MB but KB
I am using Jedis client for connecting to my Redis server. The following are the settings I'm using for connecting with Jedis (using apache common pool):
JedisPoolConfig poolConfig = new JedisPoolConfig();
poolConfig.setTestOnBorrow(true);
poolConfig.setTestOnReturn(true);
poolConfig.setMaxIdle(400);
// Tests whether connections are dead during idle periods
poolConfig.setTestWhileIdle(true);
poolConfig.setMaxTotal(400);
// configuring it for some good max value so that timeout don't occur
poolConfig.setMaxWaitMillis(120000);
So far with these setting I'm not facing any issues in terms of reliability (I can always get the Jedis connection whenever I want), but I am seeing a certain lag with Jedis performance.
Can any one suggest me some more optimization for achieving high performance?
You have 3 tests configured:
TestOnBorrow - Sends a PING request when you ask for the resource.
TestOnReturn - Sends a PING whe you return a resource to the pool.
TestWhileIdle - Sends periodic PINGS from idle resources in the pool.
While it is nice to know your connections are still alive, those onBorrow PING requests are wasting an RTT before your request, and the other two tests are wasting valuable Redis resources. In theory, a connection can go bad even after the PING test so you should catch a connection exception in your code and deal with it even if you send a PING. If your network is stable, and you do not have too many drops, you should remove those tests and handle this scenario in your exception catches only.
Also, by setting MaxIdle == MaxTotal, there will be no eviction of resources from your pool (good/bad?, depends on your usage). And when your pool is exhausted, an attempt to get a resource will endup in timeout after 2 minutes of waiting for a free resource.
Version pre-3 the recommendation was to run a timeout manager as a standalone process on your cluster, beside the distributor. (As detailed here: http://support.nservicebus.com/customer/portal/articles/965131-deploying-nservicebus-in-a-windows-failover-cluster).
After the inclusion of the timeout manager as a satellite assembly, what is the correct way to use it when scaling out with the distributor?
Should each worker of Service A run with timeout manager enabled or should only the distributor process for service A be configured to run a timeout manager for service A?
If each worker runs it, do they share the same Raven instance for storing the timeouts? (And if so, how do you make sure that two or more workers don't pick up the same expired timeout at the same time?)
Allow me to answer this clearly myself.
After a lot of digging and with help from Andreas Öhlund on the NSB team(http://tech.groups.yahoo.com/group/nservicebus/message/17758), the correct anwer to this question is:
Like Udi Dahan mentioned, by design ONLY the distributor/master node should run a timeout manager in a scale out scenario.
Unfortunately in early versions of NServiceBus 3 this is not implemented as designed.
You have the following 3 issues:
1) Running with the Distributor profile does NOT start a timeout manager.
Workaround:
Start the timeout manager on the distributor yourself by including this code on the distributor:
class DistributorProfileHandler : IHandleProfile<Distributor>
{
public void ProfileActivated()
{
Configure.Instance.RunTimeoutManager();
}
}
If you run the Master profile this is not an issue as a timeout manager is started on the master node for you automatically.
2) Workers running with the Worker profile DO each start a local timeout manager.
This is not as designed and messes up the polling against the timeout store and dispatching of timeouts. All workers poll the timeout store with "give me the imminent timeouts for MASTERNODE". Notice they ask for timeouts of MASTERNODE, not for W1, W2 etc. So several workers can end up fetching the same timeouts from the timeout store concurrently, leading to conflicts against Raven when deleting the timeouts from it.
The dispatching always happens through the LOCAL .timouts/.timeoutsdispatcher queues, while it SHOULD be through the queues of the timeout manager on the MasterNode/Distributor.
Workaround, you'll need to do both:
a) Disable the timeout manager on the workers. Include this code on your workers
class WorkerProfileHandler:IHandleProfile<Worker>
{
public void ProfileActivated()
{
Configure.Instance.DisableTimeoutManager();
}
}
b) Reroute NServiceBus on the workers to use the .timeouts queue on the MasterNode/Distributor.
If you don't do this, any call to RequestTimeout or Defer on the worker will die with an exception saying that you have forgotten to configure a timeout manager. Include this in your worker config:
<UnicastBusConfig TimeoutManagerAddress="{endpointname}.Timeouts#{masternode}" />
3) Erroneous "Ready" messages back to the distributor.
Because the timeout manager dispatches the messages directly to the workers input queues without removing an entry from the available workers in the distributor storage queue, the workers send erroneous "Ready" messages back to the distributor after handling a timeout. This happens even if you have fixed 1 and 2, and it makes no difference if the timeout was fetched from a local timeout manager on the worker or on one running on the distributor/MasterNode. The consequence is a build up of an extra entry in the storage queue on the distributor for each timeout handled by a worker.
Workaround:
Use NServiceBus 3.3.15 or later.
In version 3+ we created the concept of a master node which hosts inside it all the satellites like the distributor, timeout manager, gateway, etc.
The master node is very simple to run - you just pass a /master flag to the NServiceBus.Host.exe process and it runs everything for you. So, from a deployment perspective, where you used to deploy the distributor, now you deploy the master node.