What is recommended Redisson configuration to avoid timeouts when connected to AWS Elasticache? - amazon-elasticache

We are using Redisson to connect to a replicated Redis on AWS elasticache with 1 master and 2 replica nodes.
The app makes uses of a number of RLocalCachedMaps, Locks and a few thousand Topics to track user state. (Topics and subscriptions coming and going as users go online and offline).
However we frequently get a series of RedisTimeoutExceptions, originally these were after the server had been running for several days and would occur continuously until either the server was restarted, or would crash with an out of memory error. Which led me to think it was a lack of subscriptions available, however our settings (below) should support over 100,000 subscriptions if I understand them correctly and we are not near that.
Further some of these will occur during warm up, where load on the server is relatively light, after a few exceptions the connections sort out and there are no major problems for several days, which indicates it is not a pure subscription problem. The commands are simple lock/publish/subscribe each time, rather than complex batches.
The load on the AWS Elasticache nodes is minor at all times, our server is deployed on an AWS EC2 instance so should have relatively good connectivity!
The 2 exceptions we get in quantity are either taking locks or subscribing to topics:
Caused by: org.redisson.client.RedisTimeoutException: Subscribe timeout: (7500ms)
at org.redisson.command.CommandAsyncService.syncSubscription(CommandAsyncService.java:142) ~[redisson-3.8.2.jar!/:na]
at org.redisson.RedissonLock.lockInterruptibly(RedissonLock.java:149) ~[redisson-3.8.2.jar!/:na]
at org.redisson.RedissonLock.lockInterruptibly(RedissonLock.java:136) ~[redisson-3.8.2.jar!/:na]
at org.redisson.RedissonLock.lock(RedissonLock.java:118) ~[redisson-3.8.2.jar!/:na]
and
java.util.concurrent.CompletionException: org.redisson.client.RedisTimeoutException
at org.redisson.misc.RedissonPromise.await(RedissonPromise.java:197) ~[redisson-3.8.2.jar!/:na]
at org.redisson.misc.RedissonPromise.await(RedissonPromise.java:206) ~[redisson-3.8.2.jar!/:na]
at org.redisson.command.CommandAsyncService.syncSubscription(CommandAsyncService.java:141) ~[redisson-3.8.2.jar!/:na]
at org.redisson.RedissonTopic.addListener(RedissonTopic.java:133) ~[redisson-3.8.2.jar!/:na]
at org.redisson.RedissonTopic.addListener(RedissonTopic.java:109) ~[redisson-3.8.2.jar!/:na]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[na:1.8.0_111]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[na:1.8.0_111]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) ~[na:1.8.0_111]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) ~[na:1.8.0_111]
at java.lang.Thread.run(Thread.java:745) ~[na:1.8.0_111]
Caused by: org.redisson.client.RedisTimeoutException: null
at org.redisson.pubsub.PublishSubscribeService$4.run(PublishSubscribeService.java:220) ~[redisson-3.8.2.jar!/:na]
at io.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:670) ~[netty-common-4.1.30.Final.jar!/:4.1.30.Final]
at io.netty.util.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:745) ~[netty-common-4.1.30.Final.jar!/:4.1.30.Final]
at io.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:473) ~[netty-common-4.1.30.Final.jar!/:4.1.30.Final]
Our Configuration is:
"subscriptionConnectionMinimumIdleSize":32,
"subscriptionConnectionPoolSize":128,
"slaveConnectionMinimumIdleSize":32,
"slaveConnectionPoolSize":128,
"masterConnectionMinimumIdleSize":64,
"masterConnectionPoolSize":128,
"subscriptionsPerConnection": 1000,
"timeout": 3000,
"retryAttempts": 3,
"retryInterval": 1500,
"readMode": "SLAVE",
"subscriptionMode": MASTER
I have read the Redisson FAQ on timeouts, our timeout exceptions are not obviously server or client, so unsure of which timeout parameter would be better to tweak, further given that they are 7.5 seconds, that is pretty long for user requests to be waiting. Similarly I can't find documentation on the recommended values for the connection pool sizes or subscriptions per connection and what would be sensible values for a production deployment.

Related

No further traces reported after timeout exception

I integrated spring-cloud-sleuth with GCP support into an application. Under load the app suddenly stops reporting any spans until it is restarted.
The only tracing relevant log i can see is the following exception:
Unexpected error flushing spans java.lang.IllegalStateException: timeout waiting for onClose. timeoutMs=5000, resultSet=false
at zipkin2.reporter.stackdriver.internal.AwaitableUnaryClientCallListener.await(AwaitableUnaryClientCallListener.java:49)
at zipkin2.reporter.stackdriver.internal.UnaryClientCall.doExecute(UnaryClientCall.java:50)
at zipkin2.Call$Base.execute(Call.java:380)
at zipkin2.Call$Mapping.doExecute(Call.java:237) at zipkin2.Call$Base.execute(Call.java:380)
at zipkin2.reporter.AsyncReporter$BoundedAsyncReporter.flush(AsyncReporter.java:285)
at zipkin2.reporter.AsyncReporter$Flusher.run(AsyncReporter.java:354)
at java.base/java.lang.Thread.run(Unknown Source)
This exception happens a few times around the time the traces end and then never aggain (as if something permanently breaks)
I read in a spring-cloud-gcp issue (see here) that this can be related to to few executer threads so i already configured the number of threads to 8 (from 4).

Blocked system-critical thread has been detected

I'm using Ignite.NET 2.7.6. There is a configuration from one server and about 40 clients. After 8 hours of work, the server starts behaving strangely: clients cannot connect it, some queries have no result, etc.
On the server's side, the memory consumption is ok, the amount of threads is about 250 and all looks ok. I don't see any problems, so I decided to solve all the problems on the server's side that were marked as SEVERE.
The first one I encounter is:
Blocked system-critical thread has been detected. This can lead to cluster-wide undefined behaviour [threadName=tcp-comm-worker, blockedFor=13s]
So I want to understand the reason this happens.
Full server's log can be found here:
https://yadi.sk/d/LF03Vz5vz4tRcw
https://yadi.sk/d/MMe0xrgI3k6lkA
Added:
The issue doesn't seem to be innocuous, this message appears every second from various threads, the "blockedFor" value is increasing from seconds to hours.
The load on the server is low but as the servers' threads become locked, it stops responding and registering new clients.
Here are logs from the server:
https://yadi.sk/d/tc3g2hb9B0jtvg
https://yadi.sk/d/05YrlYXcp4xPqg
This is the log from one client:
https://yadi.sk/d/bcbQ7ee4PUzq2w
The client's log's last lines are at 19:03:52, when the server was restarted.
I see the following .NET specific exception among the others but it should be triggered by another issue. Anyway, this one is reported to the community.
class org.apache.ignite.IgniteException: Platform error:System.NullReferenceException: Ññûëêà íà îáúåêò íå óêàçûâàåò íà ýêçåìïëÿð îáúåêòà.
â Apache.Ignite.Core.Impl.Unmanaged.UnmanagedCallbacks.CacheEntryFilterApply(Int64 memPtr)
â Apache.Ignite.Core.Impl.Unmanaged.UnmanagedCallbacks.InLongOutLong(Int32 type, Int64 val)
at org.apache.ignite.internal.processors.platform.PlatformProcessorImpl.loggerLog(PlatformProcessorImpl.java:404)
at org.apache.ignite.internal.processors.platform.PlatformProcessorImpl.processInStreamOutLong(PlatformProcessorImpl.java:460)
at org.apache.ignite.internal.processors.platform.PlatformProcessorImpl.processInStreamOutLong(PlatformProcessorImpl.java:512)
at org.apache.ignite.internal.processors.platform.PlatformTargetProxyImpl.inStreamOutLong(PlatformTargetProxyImpl.java:67)
at org.apache.ignite.internal.processors.platform.callback.PlatformCallbackUtils.inLongOutLong(Native Method)
at org.apache.ignite.internal.processors.platform.callback.PlatformCallbackGateway.cacheEntryFilterApply(PlatformCallbackGateway.java:143)
at org.apache.ignite.internal.processors.platform.cache.PlatformCacheEntryFilterImpl.apply(PlatformCacheEntryFilterImpl.java:70)
at org.apache.ignite.internal.processors.cache.query.GridCacheQueryManager$InternalScanFilter.apply(GridCacheQueryManager.java:3139)
The very first exceptions are related to the communication issues at the networking level. See below:
java.io.IOException: Óäàëåííûé õîñò ïðèíóäèòåëüíî ðàçîðâàë ñóùåñòâóþùåå ïîäêëþ÷åíèå
at sun.nio.ch.SocketDispatcher.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(Unknown Source)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(Unknown Source)
at sun.nio.ch.IOUtil.read(Unknown Source)
at sun.nio.ch.SocketChannelImpl.read(Unknown Source)
at org.apache.ignite.internal.util.nio.GridNioServer$DirectNioClientWorker.processRead(GridNioServer.java:1282)
at org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.processSelectedKeysOptimized(GridNioServer.java:2386)
at org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.bodyInternal(GridNioServer.java:2153)
at org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.body(GridNioServer.java:1794)
at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120)
at java.lang.Thread.run(Unknown Source)
[18:46:12,846][WARNING][grid-nio-worker-tcp-comm-0-#48][TcpCommunicationSpi] Closing NIO session because of unhandled exception [cls=class o.a.i.i.util.nio.GridNioException, msg=Óäàëåííûé õîñò ïðèíóäèòåëüíî ðàçîðâàë ñóùåñòâóþùåå ïîäêëþ÷åíèå]
[18:46:13,861][WARNING][tcp-comm-worker-#1][TcpCommunicationSpi] Connect timed out (consider increasing 'failureDetectionTimeout' configuration property) [addr=/127.0.0.1:47101, failureDetectionTimeout=10000]
[18:46:14,893][WARNING][tcp-comm-worker-#1][TcpCommunicationSpi] Connect timed out (consider increasing 'failureDetectionTimeout' configuration property) [addr=BB-SRV-DELTA/169.254.40.231:47101, failureDetectionTimeout=10000]
It looks like that either the server or some clients don't react to heartbeats or to other networking requests within 10 seconds. Check the logs of the client nodes as well. You might need to scale out your cluster adding more servers for the sake of load balancing or adjust the failureDetectionTimeou.
The Blocked system-critical thread has been detected... error message is innocuous but confusing. I've restarted the following conversation.
As Denis described, there are a lot of network communication issues.
In general, a client would like to perform some cache operation, but a server thread from the striped pool is blocked for a long time. I don't think it relates to the .NET part.
You can see following messages:
[18:53:04,385][SEVERE][tcp-disco-msg-worker-#2][G] Blocked system-critical thread has been detected. This can lead to cluster-wide undefined behaviour [threadName=sys-stripe-7, blockedFor=13s]
If you take a look at the thread:
hread [name="sys-stripe-7-#8", id=28, state=WAITING, blockCnt=51, waitCnt=3424]
at sun.misc.Unsafe.park(Native Method)
at java.util.concurrent.locks.LockSupport.park(Unknown Source)
at o.a.i.i.util.future.GridFutureAdapter.get0(GridFutureAdapter.java:178)
at o.a.i.i.util.future.GridFutureAdapter.get(GridFutureAdapter.java:141)
at o.a.i.spi.communication.tcp.TcpCommunicationSpi.reserveClient(TcpCommunicationSpi.java:2911)
at o.a.i.spi.communication.tcp.TcpCommunicationSpi.sendMessage0(TcpCommunicationSpi.java:2713)
at o.a.i.spi.communication.tcp.TcpCommunicationSpi.sendMessage(TcpCommunicationSpi.java:2672)
at o.a.i.i.managers.communication.GridIoManager.send(GridIoManager.java:1656)
at o.a.i.i.managers.communication.GridIoManager.sendOrderedMessage(GridIoManager.java:1879)
at o.a.i.i.processors.continuous.GridContinuousProcessor.sendWithRetries(GridContinuousProcessor.java:1904)
at o.a.i.i.processors.continuous.GridContinuousProcessor.sendWithRetries(GridContinuousProcessor.java:1875)
at o.a.i.i.processors.continuous.GridContinuousProcessor.sendWithRetries(GridContinuousProcessor.java:1857)
at o.a.i.i.processors.continuous.GridContinuousProcessor.sendNotification(GridContinuousProcessor.java:1275)
at o.a.i.i.processors.continuous.GridContinuousProcessor.addNotification(GridContinuousProcessor.java:1212)
The thread is trying to send a Continuous Query callback but is failing to establish a connection to a client node. This causes the thread to be blocked and it can not serve other cache API requests that require the same partition.
At first glance, you could try to reduce #clientFailureDetectionTimeout, default is 30sec. But this won't fix the network issues completely.

ADO.NET Pooled connections are unable to reuse

I'm working on an ASP.NET MVC application which use EF 6.x to work with my Azure SDL Database. Recently with an increased load app start to get into a state when it's unable to communicate with the SQL server anymore. I can see that there are 100 active connections to my database using exec sp_who and any new connection is unable to create with the following error:
System.Data.Entity.Core.EntityException: The underlying provider
failed on Open. ---> System.InvalidOperationException: Timeout
expired. The timeout period elapsed prior to obtaining a connection
from the pool. This may have occurred because of all pooled connections
were in use and max pool size was reached.
Most of the time app works with average active connection count from 10 to 20. And any load doesn't change this number... Event when load is high it stays at level 10-20. But in certain situations, it could just up to 100 in less than a minute without any ramp up time and this causes app state when all my requests are failing. All those 100 connection are in sleeping state and awaiting command.
The good part is I found a workaround which helped me to mitigate the issue - clear connection pool from the client side. I'm using SqlCoonection.ClearAllPools() and it instantly closing all the connections and sp_who shows me my regular 10-20 connection after that.
The bad part, I still don't know the root cause.
Just to clarify the app load is about 200-300 concurrent users, which generate 1000 requests per minute
With the great suggestion #DavidBrowne to track leaked connection with a simple pattern I was able to find leaked connections while configuring Owin engine
private void ConfigureOAuthTokenGeneration(IAppBuilder app)
{
// here in create method I'm creating also a connection leak tracker
app.CreatePerOwinContext(() => MyCoolDb.Create());
...
}
Basically with every request, Owin creates a connection and doesn't let it go and when the WebAPI load is increased I have troubles.
Could it be the real cause and I Owin smart enough to lazy create a connection when needed (using the function provided) and let it go when it was used?
It's very unlikely that this is caused by anything other than your application code leaking connections.
Here's a helper library you can use to track when a connection is leaked, and report the call site that initially opened the connection.
http://ssbwcf.codeplex.com/SourceControl/latest#SsbTransportChannel/SqlConnectionLifetimeTracker.cs

Spring Amqp Consumer pauses after running for sometime

We have a 2 node RabbitMQ cluster with Ha-all policy. We use Spring AMQP in our application to talk to RabbitMQ. Producer part is working fine, but consumer works for some time and pauses. Producer and consumer are running as different applications. More information on Consumer part.
we use SimpleMessageListenerContainer with ChannelAwareMessageListener, use Manual ack mode and default prefetch(1)
In our application we create queue (on-demand) and add it to the listener
When we started with 10 ConcurrentConsumers and 20 MaxConcurrentConsumers, consumption happens for around 15 hours and pauses. This situation happens within 1 hour when we increase the MaxConcurrentConsumers to 75.
On RabbitMQ UI, we see channels with 3/4 unacked messages on the channel tab when this situation occurs, until then it just have 1 unacked message.
Our thread dump was similar to this. But having heartbeat set to 60 did not help improve this situation.
Most of the thread dump has the following message. If required I will attach the whole thread dump. Let me know if I am missing any setup which might cause the consumer to pause?
"pool-6-thread-16" #86 prio=5 os_prio=0 tid=0x00007f4db09cb000 nid=0x3b33 waiting on condition [0x00007f4ebebec000]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00000007b9930b68> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
at java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:350)
at org.springframework.amqp.rabbit.listener.BlockingQueueConsumer$InternalConsumer.handleDelivery(BlockingQueueConsumer.java:660)
at com.rabbitmq.client.impl.ConsumerDispatcher$5.run(ConsumerDispatcher.java:144)
at com.rabbitmq.client.impl.ConsumerWorkService$WorkPoolRunnable.run(ConsumerWorkService.java:99)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
More Info
We dynamically add and remove queues to SimpleMessageListenerContainer and we suspect this is causing some problem, because every time we add or remove a queue from the listener, all the BlockingQueueConsumer are removed and created again. Do you think if this can cause this problem?
Your problem is somewhere downstream in the target listener.
Look, prefetch(1) cause this:
this.queue = new LinkedBlockingQueue<Delivery>(prefetchCount);
And further if we don't poll that queue what we have here?
BlockingQueueConsumer.this.queue.put(new Delivery(consumerTag, envelope, properties, body));
Right, parking on lock.
AMQP-621 is now merged to master; we will release 1.6.1.RELEASE in the next few days.

What is a reasonable value for heartbeat in RabbitMQ?

RabbitMQ allows you to "heartbeat" a connection, i.e. from time to time the client and the server check (using empty messages) that the other party is still there and available. So far, so good.
Unfortunately, I was not able to find a place in the documentation where a suggestion is made what a reasonable value for this is. I know that you need to specify the heartbeat in seconds, but what is a real-world best practice value?
Obviously, it should not be too often (traffic), but also not too rare (proxies, …). Any suggestions?
Is 15 seconds fine? 30? 60? …?
This answer if for RabbitMQ < 3.5.5, for newer versions see the answer from #bmaupin.
It depends on your application needs. Out of the box it is 10 min for RabbitMQ. If you fail to ack heartbeat twice (20min of inactivity), connection will be closed immediately without sending any connection.close method or any error from the broker side.
The case to use heartbeat is firewalls that closes inactive for a long time connection or some other network settings that doesn't allow you to have waiting connections.
In fact, hearbeat is not a must, from RabbitMQ config doc
heartbeat
Value representing the heartbeat delay, in seconds, that the server sends in the connection.tune frame. If set to 0, heartbeats are disabled. Clients might not follow the server suggestion, see the AMQP reference for more details. Disabling heartbeats might improve performance in situations with a great number of connections, but might lead to connections dropping in the presence of network devices that close inactive connections.
Default: 580
Note, that having hearbeat interval too short may result in significant network overhead. Keep in mind, that hearbeat frames are sent when there are no other activity on the connection for a hearbeat time interval.
The RabbitMQ documentation now provides a recommended heartbeat timeout value between 5 and 20 seconds:
Setting heartbeat timeout value too low can lead to false positives (peer being considered unavailable while it really isn't the case) due to transient network congestion, short-lived server flow control, and so on. This should be taken into consideration when picking a timeout value.
Several years worth of feedback from the users and client library maintainers suggest that values lower than 5 seconds are fairly likely to cause false positives, and values of 1 second or lower are very likely to do so. Values within the 5 to 20 seconds range are optimal for most environments.
Source: https://www.rabbitmq.com/heartbeats.html#false-positives
In addition, as of RabbitMQ 3.5.5 the default heartbeat timeout value is 60 seconds (https://www.rabbitmq.com/heartbeats.html#heartbeats-timeout)