Blocked system-critical thread has been detected - ignite

I'm using Ignite.NET 2.7.6. There is a configuration from one server and about 40 clients. After 8 hours of work, the server starts behaving strangely: clients cannot connect it, some queries have no result, etc.
On the server's side, the memory consumption is ok, the amount of threads is about 250 and all looks ok. I don't see any problems, so I decided to solve all the problems on the server's side that were marked as SEVERE.
The first one I encounter is:
Blocked system-critical thread has been detected. This can lead to cluster-wide undefined behaviour [threadName=tcp-comm-worker, blockedFor=13s]
So I want to understand the reason this happens.
Full server's log can be found here:
https://yadi.sk/d/LF03Vz5vz4tRcw
https://yadi.sk/d/MMe0xrgI3k6lkA
Added:
The issue doesn't seem to be innocuous, this message appears every second from various threads, the "blockedFor" value is increasing from seconds to hours.
The load on the server is low but as the servers' threads become locked, it stops responding and registering new clients.
Here are logs from the server:
https://yadi.sk/d/tc3g2hb9B0jtvg
https://yadi.sk/d/05YrlYXcp4xPqg
This is the log from one client:
https://yadi.sk/d/bcbQ7ee4PUzq2w
The client's log's last lines are at 19:03:52, when the server was restarted.

I see the following .NET specific exception among the others but it should be triggered by another issue. Anyway, this one is reported to the community.
class org.apache.ignite.IgniteException: Platform error:System.NullReferenceException: Ññûëêà íà îáúåêò íå óêàçûâàåò íà ýêçåìïëÿð îáúåêòà.
â Apache.Ignite.Core.Impl.Unmanaged.UnmanagedCallbacks.CacheEntryFilterApply(Int64 memPtr)
â Apache.Ignite.Core.Impl.Unmanaged.UnmanagedCallbacks.InLongOutLong(Int32 type, Int64 val)
at org.apache.ignite.internal.processors.platform.PlatformProcessorImpl.loggerLog(PlatformProcessorImpl.java:404)
at org.apache.ignite.internal.processors.platform.PlatformProcessorImpl.processInStreamOutLong(PlatformProcessorImpl.java:460)
at org.apache.ignite.internal.processors.platform.PlatformProcessorImpl.processInStreamOutLong(PlatformProcessorImpl.java:512)
at org.apache.ignite.internal.processors.platform.PlatformTargetProxyImpl.inStreamOutLong(PlatformTargetProxyImpl.java:67)
at org.apache.ignite.internal.processors.platform.callback.PlatformCallbackUtils.inLongOutLong(Native Method)
at org.apache.ignite.internal.processors.platform.callback.PlatformCallbackGateway.cacheEntryFilterApply(PlatformCallbackGateway.java:143)
at org.apache.ignite.internal.processors.platform.cache.PlatformCacheEntryFilterImpl.apply(PlatformCacheEntryFilterImpl.java:70)
at org.apache.ignite.internal.processors.cache.query.GridCacheQueryManager$InternalScanFilter.apply(GridCacheQueryManager.java:3139)
The very first exceptions are related to the communication issues at the networking level. See below:
java.io.IOException: Óäàëåííûé õîñò ïðèíóäèòåëüíî ðàçîðâàë ñóùåñòâóþùåå ïîäêëþ÷åíèå
at sun.nio.ch.SocketDispatcher.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(Unknown Source)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(Unknown Source)
at sun.nio.ch.IOUtil.read(Unknown Source)
at sun.nio.ch.SocketChannelImpl.read(Unknown Source)
at org.apache.ignite.internal.util.nio.GridNioServer$DirectNioClientWorker.processRead(GridNioServer.java:1282)
at org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.processSelectedKeysOptimized(GridNioServer.java:2386)
at org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.bodyInternal(GridNioServer.java:2153)
at org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.body(GridNioServer.java:1794)
at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120)
at java.lang.Thread.run(Unknown Source)
[18:46:12,846][WARNING][grid-nio-worker-tcp-comm-0-#48][TcpCommunicationSpi] Closing NIO session because of unhandled exception [cls=class o.a.i.i.util.nio.GridNioException, msg=Óäàëåííûé õîñò ïðèíóäèòåëüíî ðàçîðâàë ñóùåñòâóþùåå ïîäêëþ÷åíèå]
[18:46:13,861][WARNING][tcp-comm-worker-#1][TcpCommunicationSpi] Connect timed out (consider increasing 'failureDetectionTimeout' configuration property) [addr=/127.0.0.1:47101, failureDetectionTimeout=10000]
[18:46:14,893][WARNING][tcp-comm-worker-#1][TcpCommunicationSpi] Connect timed out (consider increasing 'failureDetectionTimeout' configuration property) [addr=BB-SRV-DELTA/169.254.40.231:47101, failureDetectionTimeout=10000]
It looks like that either the server or some clients don't react to heartbeats or to other networking requests within 10 seconds. Check the logs of the client nodes as well. You might need to scale out your cluster adding more servers for the sake of load balancing or adjust the failureDetectionTimeou.
The Blocked system-critical thread has been detected... error message is innocuous but confusing. I've restarted the following conversation.

As Denis described, there are a lot of network communication issues.
In general, a client would like to perform some cache operation, but a server thread from the striped pool is blocked for a long time. I don't think it relates to the .NET part.
You can see following messages:
[18:53:04,385][SEVERE][tcp-disco-msg-worker-#2][G] Blocked system-critical thread has been detected. This can lead to cluster-wide undefined behaviour [threadName=sys-stripe-7, blockedFor=13s]
If you take a look at the thread:
hread [name="sys-stripe-7-#8", id=28, state=WAITING, blockCnt=51, waitCnt=3424]
at sun.misc.Unsafe.park(Native Method)
at java.util.concurrent.locks.LockSupport.park(Unknown Source)
at o.a.i.i.util.future.GridFutureAdapter.get0(GridFutureAdapter.java:178)
at o.a.i.i.util.future.GridFutureAdapter.get(GridFutureAdapter.java:141)
at o.a.i.spi.communication.tcp.TcpCommunicationSpi.reserveClient(TcpCommunicationSpi.java:2911)
at o.a.i.spi.communication.tcp.TcpCommunicationSpi.sendMessage0(TcpCommunicationSpi.java:2713)
at o.a.i.spi.communication.tcp.TcpCommunicationSpi.sendMessage(TcpCommunicationSpi.java:2672)
at o.a.i.i.managers.communication.GridIoManager.send(GridIoManager.java:1656)
at o.a.i.i.managers.communication.GridIoManager.sendOrderedMessage(GridIoManager.java:1879)
at o.a.i.i.processors.continuous.GridContinuousProcessor.sendWithRetries(GridContinuousProcessor.java:1904)
at o.a.i.i.processors.continuous.GridContinuousProcessor.sendWithRetries(GridContinuousProcessor.java:1875)
at o.a.i.i.processors.continuous.GridContinuousProcessor.sendWithRetries(GridContinuousProcessor.java:1857)
at o.a.i.i.processors.continuous.GridContinuousProcessor.sendNotification(GridContinuousProcessor.java:1275)
at o.a.i.i.processors.continuous.GridContinuousProcessor.addNotification(GridContinuousProcessor.java:1212)
The thread is trying to send a Continuous Query callback but is failing to establish a connection to a client node. This causes the thread to be blocked and it can not serve other cache API requests that require the same partition.
At first glance, you could try to reduce #clientFailureDetectionTimeout, default is 30sec. But this won't fix the network issues completely.

Related

No further traces reported after timeout exception

I integrated spring-cloud-sleuth with GCP support into an application. Under load the app suddenly stops reporting any spans until it is restarted.
The only tracing relevant log i can see is the following exception:
Unexpected error flushing spans java.lang.IllegalStateException: timeout waiting for onClose. timeoutMs=5000, resultSet=false
at zipkin2.reporter.stackdriver.internal.AwaitableUnaryClientCallListener.await(AwaitableUnaryClientCallListener.java:49)
at zipkin2.reporter.stackdriver.internal.UnaryClientCall.doExecute(UnaryClientCall.java:50)
at zipkin2.Call$Base.execute(Call.java:380)
at zipkin2.Call$Mapping.doExecute(Call.java:237) at zipkin2.Call$Base.execute(Call.java:380)
at zipkin2.reporter.AsyncReporter$BoundedAsyncReporter.flush(AsyncReporter.java:285)
at zipkin2.reporter.AsyncReporter$Flusher.run(AsyncReporter.java:354)
at java.base/java.lang.Thread.run(Unknown Source)
This exception happens a few times around the time the traces end and then never aggain (as if something permanently breaks)
I read in a spring-cloud-gcp issue (see here) that this can be related to to few executer threads so i already configured the number of threads to 8 (from 4).

Anylogic - Internal Error(s): Engine still has x events scheduled: xyz: [null]

After stopping my simulation I occasionally get the following error message:
Example:
Exception during stopping the engine:
INTERNAL ERROR(S):
Engine still has 6 events scheduled: 2386.0: [null]
java.lang.RuntimeException: INTERNAL ERROR(S):
Engine still has 6 events scheduled: 2386.0: [null]
at com.anylogic.engine.Engine.g(Unknown Source)
at com.anylogic.engine.Engine.stop(Unknown Source)
at com.anylogic.engine.ExperimentSimulation.stop(Unknown Source)
at com.anylogic.engine.gui.ExperimentHost.executeCommand(Unknown Source)
at com.anylogic.engine.internal.webserver.l.onCommand(Unknown Source)
...
My simulation model looks like this:
Simulation Model with 5 Machines
The Model is a simulation of a job shop scheduling problem and does the following:
Generate Job Agents through inject(20) in the source Block
The jobs go to the machine defined by a database and wait in the wait-block
The jobs are set free from the wait-block by other agents
The jobs are processed in the service block
The jobs repeat the process 4 additional times
There are overall 5 agents in Step 3 - let's call them Scheduling Agent - and they use the Wait.free() method to set the agents free. One agent controls one wait-block. All 5 Scheduling Agents work simultaneously and are synchronized through the Main agent (Main notifies the Scheduling Agents). The hold-blocks are unblocked immediately after simulation start. They exist also for synchronisation purposes. Every Scheduling Agent owns his own Thread which is started through Thread.start() by a Timeout Event (Occurs once, time = 0) defined in Main.
A Thread from a Scheduling Agent looks something like this:
new Thread(new Runnable() {
public void run() {
synchronized (sync_obj) {
sync_obj.waituntilJobarrives();
sync_obj.Waitblock.free(a_Job);
synv_obj.waituntilJobisfinished();
repeat();
}
}
});
Now here is my Problem: When I start the simulation, the jobs are generated normally and move to their assigned wait-block. After that, the scheduling agents start their work and free a Job, but sometimes the Scheduling Agent calls the Waitblock.free() method and the Job is not set free (checked with traceln() when the method was called). To double check the issue, I implemented buttons, which manually calls the Waitblock.free() method but the Job Agents still won't leave the wait-block. If the Job is not set free by the agent the simulation of the job shop is stuck there. The simulation keeps running, but the 20 Jobs get never finished and no error message is displayed (technically there is no error). Only after stopping the simulation the error message displayed above appears in the console.
What makes matters worse is the fact, that this error does not appear all the time. Sometimes the simulation works just fine and sometimes the wait-block stops reacting. Usually, after simulating long enough this error will appear and one or several wait-blocks stop reacting.
My guess from reading the error message is, that the engine received the order to free the agents from the wait-block. It just won't do it now. How or can I control the order of events scheduled by the engine (Personal Learning Edition)? Or is there another way of fixing the problem?
I am grateful for any help!
EDIT: By removing the Hold-block, the error of Engine still has X events scheduled does not appear that often. But the 'Wait-Block' still does not respond to the Waitblock.free()method and the following Error message appears in the console:
java.lang.RuntimeException: root.w_Warteblock1.readyEntities.output.readyNotificationAsync.event: negative timeout: -1.25
at com.anylogic.engine.Engine.error(Unknown Source)
at com.anylogic.engine.EventOriginator.g(Unknown Source)
at com.anylogic.engine.EventOriginator.c(Unknown Source)
at com.anylogic.engine.EventTimeout.restart(Unknown Source)
at com.anylogic.libraries.processmodeling.AsynchronousExecutor_xjal.a(Unknown Source)
at com.anylogic.libraries.processmodeling.OutputBlock.notifyReady(Unknown Source)
at com.anylogic.libraries.processmodeling.OutputBuffer.a(Unknown Source)
at com.anylogic.libraries.processmodeling.OutputBuffer.take(Unknown Source)
at com.anylogic.libraries.processmodeling.Wait.free(Unknown Source)
This looks more like a common error, which I can catch so, my current workaround is with a try and catch block around the Thread which calls the Waitblock.free() method and restarting the simulation with the simulation progress saved in an excel file.
I will tell you my thoughts, but the info might not be enough to make a conclusion:
I remember that I get this error when I pause the simulation then remove an agent and then stop the simulation. If I follow those steps, I will get that error...
This means that when you stop your simulation, you need to give the simulation at least a millisecond of time to be able to finish the scheduled events... In this case your scheduled events are on the thread. So a solution for that would be to stop the simulation with finishSimulation() before you click the stop button. You have to kill the threads before the finishSimulation() function runs... I'm not sure about this, but give it a try.
That's the first problem... the second problem I think is related to the hold after the wait. Notice that if your hold block is blocked and you try to release more than 1 agent from the wait block... only 1 agent will be freed when you unblock the hold. This is because there is space for only 1 agent at the exit of the wait block... if you make this mistake, the agent will stay in the wait block forever.. a solution is to use a queue just after the wait block. I don't think this problem is related to error you get though...
I had this problem in my test suites. I fixed it by calling:
engine.finish();
instead of:
engine.stop();

What is recommended Redisson configuration to avoid timeouts when connected to AWS Elasticache?

We are using Redisson to connect to a replicated Redis on AWS elasticache with 1 master and 2 replica nodes.
The app makes uses of a number of RLocalCachedMaps, Locks and a few thousand Topics to track user state. (Topics and subscriptions coming and going as users go online and offline).
However we frequently get a series of RedisTimeoutExceptions, originally these were after the server had been running for several days and would occur continuously until either the server was restarted, or would crash with an out of memory error. Which led me to think it was a lack of subscriptions available, however our settings (below) should support over 100,000 subscriptions if I understand them correctly and we are not near that.
Further some of these will occur during warm up, where load on the server is relatively light, after a few exceptions the connections sort out and there are no major problems for several days, which indicates it is not a pure subscription problem. The commands are simple lock/publish/subscribe each time, rather than complex batches.
The load on the AWS Elasticache nodes is minor at all times, our server is deployed on an AWS EC2 instance so should have relatively good connectivity!
The 2 exceptions we get in quantity are either taking locks or subscribing to topics:
Caused by: org.redisson.client.RedisTimeoutException: Subscribe timeout: (7500ms)
at org.redisson.command.CommandAsyncService.syncSubscription(CommandAsyncService.java:142) ~[redisson-3.8.2.jar!/:na]
at org.redisson.RedissonLock.lockInterruptibly(RedissonLock.java:149) ~[redisson-3.8.2.jar!/:na]
at org.redisson.RedissonLock.lockInterruptibly(RedissonLock.java:136) ~[redisson-3.8.2.jar!/:na]
at org.redisson.RedissonLock.lock(RedissonLock.java:118) ~[redisson-3.8.2.jar!/:na]
and
java.util.concurrent.CompletionException: org.redisson.client.RedisTimeoutException
at org.redisson.misc.RedissonPromise.await(RedissonPromise.java:197) ~[redisson-3.8.2.jar!/:na]
at org.redisson.misc.RedissonPromise.await(RedissonPromise.java:206) ~[redisson-3.8.2.jar!/:na]
at org.redisson.command.CommandAsyncService.syncSubscription(CommandAsyncService.java:141) ~[redisson-3.8.2.jar!/:na]
at org.redisson.RedissonTopic.addListener(RedissonTopic.java:133) ~[redisson-3.8.2.jar!/:na]
at org.redisson.RedissonTopic.addListener(RedissonTopic.java:109) ~[redisson-3.8.2.jar!/:na]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[na:1.8.0_111]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[na:1.8.0_111]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) ~[na:1.8.0_111]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) ~[na:1.8.0_111]
at java.lang.Thread.run(Thread.java:745) ~[na:1.8.0_111]
Caused by: org.redisson.client.RedisTimeoutException: null
at org.redisson.pubsub.PublishSubscribeService$4.run(PublishSubscribeService.java:220) ~[redisson-3.8.2.jar!/:na]
at io.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:670) ~[netty-common-4.1.30.Final.jar!/:4.1.30.Final]
at io.netty.util.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:745) ~[netty-common-4.1.30.Final.jar!/:4.1.30.Final]
at io.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:473) ~[netty-common-4.1.30.Final.jar!/:4.1.30.Final]
Our Configuration is:
"subscriptionConnectionMinimumIdleSize":32,
"subscriptionConnectionPoolSize":128,
"slaveConnectionMinimumIdleSize":32,
"slaveConnectionPoolSize":128,
"masterConnectionMinimumIdleSize":64,
"masterConnectionPoolSize":128,
"subscriptionsPerConnection": 1000,
"timeout": 3000,
"retryAttempts": 3,
"retryInterval": 1500,
"readMode": "SLAVE",
"subscriptionMode": MASTER
I have read the Redisson FAQ on timeouts, our timeout exceptions are not obviously server or client, so unsure of which timeout parameter would be better to tweak, further given that they are 7.5 seconds, that is pretty long for user requests to be waiting. Similarly I can't find documentation on the recommended values for the connection pool sizes or subscriptions per connection and what would be sensible values for a production deployment.

rabbitmq client hangs while trying to declare queue

I tried searching for solution of my problem but could not find it stack overflow.
Issue
When a user tries to declare a queue or exchange, in a corner case where RabbitMQ server is having some issue, the client keeps waiting without any timeout which causes the thread calling the rabbitmq to always remain in waiting state (wait which never ends).
Below is stacktrace
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
at java.lang.Object.wait(Object.java:502)
at com.rabbitmq.utility.BlockingCell.get(BlockingCell.java:50)
- locked <0x00000007bb0464c8> (a com.rabbitmq.utility.BlockingValueOrException)
at com.rabbitmq.utility.BlockingCell.uninterruptibleGet(BlockingCell.java:89)
- locked <0x00000007bb0464c8> (a com.rabbitmq.utility.BlockingValueOrException)
at com.rabbitmq.utility.BlockingValueOrException.uninterruptibleGetValue(BlockingValueOrException.java:33)
at com.rabbitmq.client.impl.AMQChannel$BlockingRpcContinuation.getReply(AMQChannel.java:343)
at com.rabbitmq.client.impl.AMQChannel.privateRpc(AMQChannel.java:216)
at
(AMQChannel.java:118)
at com.rabbitmq.client.impl.ChannelN.queueDeclare(ChannelN.java:833)
at com.rabbitmq.client.impl.ChannelN.queueDeclare(ChannelN.java:61)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.springframework.amqp.rabbit.connection.CachingConnectionFactory$CachedChannelInvocationHandler.invoke(CachingConnectionFactory.java:917)
- locked <0x00000007bb555300> (a java.lang.Object)
at com.sun.proxy.$Proxy293.queueDeclare(Unknown Source)
at org.springframework.amqp.rabbit.core.RabbitAdmin.declareQueues(RabbitAdmin.java:575)
at org.springframework.amqp.rabbit.core.RabbitAdmin.access$200(RabbitAdmin.java:66)
at org.springframework.amqp.rabbit.core.RabbitAdmin$12.doInRabbit(RabbitAdmin.java:504)
at org.springframework.amqp.rabbit.core.RabbitTemplate.doExecute(RabbitTemplate.java:1456)
at org.springframework.amqp.rabbit.core.RabbitTemplate.execute(RabbitTemplate.java:1412)
at org.springframework.amqp.rabbit.core.RabbitTemplate.execute(RabbitTemplate.java:1388)
at org.springframework.amqp.rabbit.core.RabbitAdmin.initialize(RabbitAdmin.java:500)
at org.springframework.amqp.rabbit.core.RabbitAdmin$11.onCreate(RabbitAdmin.java:419)
at org.springframework.amqp.rabbit.connection.CompositeConnectionListener.onCreate(CompositeConnectionListener.java:33)
at org.springframework.amqp.rabbit.connection.CachingConnectionFactory.createConnection(CachingConnectionFactory.java:553)
- locked <0x00000007bb057828> (a java.lang.Object)
at org.springframework.amqp.rabbit.core.RabbitTemplate.doExecute(RabbitTemplate.java:1431)
at org.springframework.amqp.rabbit.core.RabbitTemplate.execute(RabbitTemplate.java:1412)
at org.springframework.amqp.rabbit.core.RabbitTemplate.execute(RabbitTemplate.java:1388)
at org.springframework.amqp.rabbit.core.RabbitAdmin.declareQueue(RabbitAdmin.java:207)
Any help will be highly appreciated. Declaration of queues is currently in my postconstruct of beans calling our component handling messaging, thus not letting any new bean create.
UPDATE
The issue came again on our prod server. When trying to connect via amqp-client-3.4.2 directly it seems working. But from spring-rabbit-1.6.7.RELEASE, spring-amqp-1.6.7.RELEASE it is not working.
Via amqp-client-3.4.2
ConnectionFactory factory = new ConnectionFactory();
factory.setHost("<<HOST NAME>>");
factory.setUsername("<<USERNAME>>");
factory.setPassword("<<PASSWORD>>");
factory.setVirtualHost("<<VIRTUAL HOST>>");
Connection connection = factory.newConnection();
Channel channel = connection.createChannel();
channel.queueDeclare(QUEUE_NAME, true, false, false, null);
Code flow with rabbit-amqp client
Spring way which is not working
CachingConnectionFactory factory = new CachingConnectionFactory();
factory.setHost("<<HOST NAME>>");
factory.setUsername("<<USERNAME>>");
factory.setPassword("<<PASSWORD>>");
factory.setVirtualHost("<<VIRTUAL HOST>>");
RabbitAdmin admin = new RabbitAdmin(factory);
Queue queue = new Queue(QUEUE_NAME);
admin.declareQueue(queue);
Code flow with spring amqp
This issue occurs rarely and we are still trying to figure out the reason behind this behavior. We tried setting connection timeout but did not worked in our test program.
On debugging it further it looks like an exception is not letting notification sent back to our code. For client not found kind of issues, we are getting exception properly.
We are using RabbitMQ 3.6.10 and Erlang 19.3.4 on CentOS Linux 7 (Core)
Declaration of queues is currently in my postconstruct of beans
I can't speak to the hang but you should NEVER interact with the broker from post construct, afterPropertiesSet() etc. It is too early in the application context lifecycle.
There are several work arounds - implement SmartLifecycle; return true from isAutoStartup() and put the bean in an early phase (see Phased). start() will be called after the application context is fully created.
However, it's generally better to just define the queues, bindings etc as beans and let the framework take care of doing all the declarations for you.
I had something semi-similar happen, which I'll share in case it helps anyone.
It appears to me that a call to "rabbitAdmin.declareQueue" will wait for any ongoing publisher-confirm callbacks to complete. I couldn't find this documented anywhere but this was the behaviour I witnessed.
In my case, a separate thread (Thread #2) was processing a publisher-confirmation while Thread #1 was trying to declare a queue (and hanging). Thread #1 was waiting for Thread #2 to complete, but in my case I had a deadlock due to some funky database-locking I was doing which actually caused Thread #2 to also wait for Thread #1 to complete.
The solution was for me to stop doing significant processing in publisher-confirmation callbacks. In my callback, I actually just launch yet another thread to do the real processing. This allows my publisher-confirmation callback to return almost-immediately, releasing any potential deadlocks.

Spring Amqp Consumer pauses after running for sometime

We have a 2 node RabbitMQ cluster with Ha-all policy. We use Spring AMQP in our application to talk to RabbitMQ. Producer part is working fine, but consumer works for some time and pauses. Producer and consumer are running as different applications. More information on Consumer part.
we use SimpleMessageListenerContainer with ChannelAwareMessageListener, use Manual ack mode and default prefetch(1)
In our application we create queue (on-demand) and add it to the listener
When we started with 10 ConcurrentConsumers and 20 MaxConcurrentConsumers, consumption happens for around 15 hours and pauses. This situation happens within 1 hour when we increase the MaxConcurrentConsumers to 75.
On RabbitMQ UI, we see channels with 3/4 unacked messages on the channel tab when this situation occurs, until then it just have 1 unacked message.
Our thread dump was similar to this. But having heartbeat set to 60 did not help improve this situation.
Most of the thread dump has the following message. If required I will attach the whole thread dump. Let me know if I am missing any setup which might cause the consumer to pause?
"pool-6-thread-16" #86 prio=5 os_prio=0 tid=0x00007f4db09cb000 nid=0x3b33 waiting on condition [0x00007f4ebebec000]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00000007b9930b68> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
at java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:350)
at org.springframework.amqp.rabbit.listener.BlockingQueueConsumer$InternalConsumer.handleDelivery(BlockingQueueConsumer.java:660)
at com.rabbitmq.client.impl.ConsumerDispatcher$5.run(ConsumerDispatcher.java:144)
at com.rabbitmq.client.impl.ConsumerWorkService$WorkPoolRunnable.run(ConsumerWorkService.java:99)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
More Info
We dynamically add and remove queues to SimpleMessageListenerContainer and we suspect this is causing some problem, because every time we add or remove a queue from the listener, all the BlockingQueueConsumer are removed and created again. Do you think if this can cause this problem?
Your problem is somewhere downstream in the target listener.
Look, prefetch(1) cause this:
this.queue = new LinkedBlockingQueue<Delivery>(prefetchCount);
And further if we don't poll that queue what we have here?
BlockingQueueConsumer.this.queue.put(new Delivery(consumerTag, envelope, properties, body));
Right, parking on lock.
AMQP-621 is now merged to master; we will release 1.6.1.RELEASE in the next few days.