Spring Amqp Consumer pauses after running for sometime - rabbitmq

We have a 2 node RabbitMQ cluster with Ha-all policy. We use Spring AMQP in our application to talk to RabbitMQ. Producer part is working fine, but consumer works for some time and pauses. Producer and consumer are running as different applications. More information on Consumer part.
we use SimpleMessageListenerContainer with ChannelAwareMessageListener, use Manual ack mode and default prefetch(1)
In our application we create queue (on-demand) and add it to the listener
When we started with 10 ConcurrentConsumers and 20 MaxConcurrentConsumers, consumption happens for around 15 hours and pauses. This situation happens within 1 hour when we increase the MaxConcurrentConsumers to 75.
On RabbitMQ UI, we see channels with 3/4 unacked messages on the channel tab when this situation occurs, until then it just have 1 unacked message.
Our thread dump was similar to this. But having heartbeat set to 60 did not help improve this situation.
Most of the thread dump has the following message. If required I will attach the whole thread dump. Let me know if I am missing any setup which might cause the consumer to pause?
"pool-6-thread-16" #86 prio=5 os_prio=0 tid=0x00007f4db09cb000 nid=0x3b33 waiting on condition [0x00007f4ebebec000]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00000007b9930b68> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
at java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:350)
at org.springframework.amqp.rabbit.listener.BlockingQueueConsumer$InternalConsumer.handleDelivery(BlockingQueueConsumer.java:660)
at com.rabbitmq.client.impl.ConsumerDispatcher$5.run(ConsumerDispatcher.java:144)
at com.rabbitmq.client.impl.ConsumerWorkService$WorkPoolRunnable.run(ConsumerWorkService.java:99)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
More Info
We dynamically add and remove queues to SimpleMessageListenerContainer and we suspect this is causing some problem, because every time we add or remove a queue from the listener, all the BlockingQueueConsumer are removed and created again. Do you think if this can cause this problem?

Your problem is somewhere downstream in the target listener.
Look, prefetch(1) cause this:
this.queue = new LinkedBlockingQueue<Delivery>(prefetchCount);
And further if we don't poll that queue what we have here?
BlockingQueueConsumer.this.queue.put(new Delivery(consumerTag, envelope, properties, body));
Right, parking on lock.

AMQP-621 is now merged to master; we will release 1.6.1.RELEASE in the next few days.

Related

Blocked system-critical thread has been detected

I'm using Ignite.NET 2.7.6. There is a configuration from one server and about 40 clients. After 8 hours of work, the server starts behaving strangely: clients cannot connect it, some queries have no result, etc.
On the server's side, the memory consumption is ok, the amount of threads is about 250 and all looks ok. I don't see any problems, so I decided to solve all the problems on the server's side that were marked as SEVERE.
The first one I encounter is:
Blocked system-critical thread has been detected. This can lead to cluster-wide undefined behaviour [threadName=tcp-comm-worker, blockedFor=13s]
So I want to understand the reason this happens.
Full server's log can be found here:
https://yadi.sk/d/LF03Vz5vz4tRcw
https://yadi.sk/d/MMe0xrgI3k6lkA
Added:
The issue doesn't seem to be innocuous, this message appears every second from various threads, the "blockedFor" value is increasing from seconds to hours.
The load on the server is low but as the servers' threads become locked, it stops responding and registering new clients.
Here are logs from the server:
https://yadi.sk/d/tc3g2hb9B0jtvg
https://yadi.sk/d/05YrlYXcp4xPqg
This is the log from one client:
https://yadi.sk/d/bcbQ7ee4PUzq2w
The client's log's last lines are at 19:03:52, when the server was restarted.
I see the following .NET specific exception among the others but it should be triggered by another issue. Anyway, this one is reported to the community.
class org.apache.ignite.IgniteException: Platform error:System.NullReferenceException: Ññûëêà íà îáúåêò íå óêàçûâàåò íà ýêçåìïëÿð îáúåêòà.
â Apache.Ignite.Core.Impl.Unmanaged.UnmanagedCallbacks.CacheEntryFilterApply(Int64 memPtr)
â Apache.Ignite.Core.Impl.Unmanaged.UnmanagedCallbacks.InLongOutLong(Int32 type, Int64 val)
at org.apache.ignite.internal.processors.platform.PlatformProcessorImpl.loggerLog(PlatformProcessorImpl.java:404)
at org.apache.ignite.internal.processors.platform.PlatformProcessorImpl.processInStreamOutLong(PlatformProcessorImpl.java:460)
at org.apache.ignite.internal.processors.platform.PlatformProcessorImpl.processInStreamOutLong(PlatformProcessorImpl.java:512)
at org.apache.ignite.internal.processors.platform.PlatformTargetProxyImpl.inStreamOutLong(PlatformTargetProxyImpl.java:67)
at org.apache.ignite.internal.processors.platform.callback.PlatformCallbackUtils.inLongOutLong(Native Method)
at org.apache.ignite.internal.processors.platform.callback.PlatformCallbackGateway.cacheEntryFilterApply(PlatformCallbackGateway.java:143)
at org.apache.ignite.internal.processors.platform.cache.PlatformCacheEntryFilterImpl.apply(PlatformCacheEntryFilterImpl.java:70)
at org.apache.ignite.internal.processors.cache.query.GridCacheQueryManager$InternalScanFilter.apply(GridCacheQueryManager.java:3139)
The very first exceptions are related to the communication issues at the networking level. See below:
java.io.IOException: Óäàëåííûé õîñò ïðèíóäèòåëüíî ðàçîðâàë ñóùåñòâóþùåå ïîäêëþ÷åíèå
at sun.nio.ch.SocketDispatcher.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(Unknown Source)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(Unknown Source)
at sun.nio.ch.IOUtil.read(Unknown Source)
at sun.nio.ch.SocketChannelImpl.read(Unknown Source)
at org.apache.ignite.internal.util.nio.GridNioServer$DirectNioClientWorker.processRead(GridNioServer.java:1282)
at org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.processSelectedKeysOptimized(GridNioServer.java:2386)
at org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.bodyInternal(GridNioServer.java:2153)
at org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.body(GridNioServer.java:1794)
at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120)
at java.lang.Thread.run(Unknown Source)
[18:46:12,846][WARNING][grid-nio-worker-tcp-comm-0-#48][TcpCommunicationSpi] Closing NIO session because of unhandled exception [cls=class o.a.i.i.util.nio.GridNioException, msg=Óäàëåííûé õîñò ïðèíóäèòåëüíî ðàçîðâàë ñóùåñòâóþùåå ïîäêëþ÷åíèå]
[18:46:13,861][WARNING][tcp-comm-worker-#1][TcpCommunicationSpi] Connect timed out (consider increasing 'failureDetectionTimeout' configuration property) [addr=/127.0.0.1:47101, failureDetectionTimeout=10000]
[18:46:14,893][WARNING][tcp-comm-worker-#1][TcpCommunicationSpi] Connect timed out (consider increasing 'failureDetectionTimeout' configuration property) [addr=BB-SRV-DELTA/169.254.40.231:47101, failureDetectionTimeout=10000]
It looks like that either the server or some clients don't react to heartbeats or to other networking requests within 10 seconds. Check the logs of the client nodes as well. You might need to scale out your cluster adding more servers for the sake of load balancing or adjust the failureDetectionTimeou.
The Blocked system-critical thread has been detected... error message is innocuous but confusing. I've restarted the following conversation.
As Denis described, there are a lot of network communication issues.
In general, a client would like to perform some cache operation, but a server thread from the striped pool is blocked for a long time. I don't think it relates to the .NET part.
You can see following messages:
[18:53:04,385][SEVERE][tcp-disco-msg-worker-#2][G] Blocked system-critical thread has been detected. This can lead to cluster-wide undefined behaviour [threadName=sys-stripe-7, blockedFor=13s]
If you take a look at the thread:
hread [name="sys-stripe-7-#8", id=28, state=WAITING, blockCnt=51, waitCnt=3424]
at sun.misc.Unsafe.park(Native Method)
at java.util.concurrent.locks.LockSupport.park(Unknown Source)
at o.a.i.i.util.future.GridFutureAdapter.get0(GridFutureAdapter.java:178)
at o.a.i.i.util.future.GridFutureAdapter.get(GridFutureAdapter.java:141)
at o.a.i.spi.communication.tcp.TcpCommunicationSpi.reserveClient(TcpCommunicationSpi.java:2911)
at o.a.i.spi.communication.tcp.TcpCommunicationSpi.sendMessage0(TcpCommunicationSpi.java:2713)
at o.a.i.spi.communication.tcp.TcpCommunicationSpi.sendMessage(TcpCommunicationSpi.java:2672)
at o.a.i.i.managers.communication.GridIoManager.send(GridIoManager.java:1656)
at o.a.i.i.managers.communication.GridIoManager.sendOrderedMessage(GridIoManager.java:1879)
at o.a.i.i.processors.continuous.GridContinuousProcessor.sendWithRetries(GridContinuousProcessor.java:1904)
at o.a.i.i.processors.continuous.GridContinuousProcessor.sendWithRetries(GridContinuousProcessor.java:1875)
at o.a.i.i.processors.continuous.GridContinuousProcessor.sendWithRetries(GridContinuousProcessor.java:1857)
at o.a.i.i.processors.continuous.GridContinuousProcessor.sendNotification(GridContinuousProcessor.java:1275)
at o.a.i.i.processors.continuous.GridContinuousProcessor.addNotification(GridContinuousProcessor.java:1212)
The thread is trying to send a Continuous Query callback but is failing to establish a connection to a client node. This causes the thread to be blocked and it can not serve other cache API requests that require the same partition.
At first glance, you could try to reduce #clientFailureDetectionTimeout, default is 30sec. But this won't fix the network issues completely.

rabbitmq client hangs while trying to declare queue

I tried searching for solution of my problem but could not find it stack overflow.
Issue
When a user tries to declare a queue or exchange, in a corner case where RabbitMQ server is having some issue, the client keeps waiting without any timeout which causes the thread calling the rabbitmq to always remain in waiting state (wait which never ends).
Below is stacktrace
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
at java.lang.Object.wait(Object.java:502)
at com.rabbitmq.utility.BlockingCell.get(BlockingCell.java:50)
- locked <0x00000007bb0464c8> (a com.rabbitmq.utility.BlockingValueOrException)
at com.rabbitmq.utility.BlockingCell.uninterruptibleGet(BlockingCell.java:89)
- locked <0x00000007bb0464c8> (a com.rabbitmq.utility.BlockingValueOrException)
at com.rabbitmq.utility.BlockingValueOrException.uninterruptibleGetValue(BlockingValueOrException.java:33)
at com.rabbitmq.client.impl.AMQChannel$BlockingRpcContinuation.getReply(AMQChannel.java:343)
at com.rabbitmq.client.impl.AMQChannel.privateRpc(AMQChannel.java:216)
at
(AMQChannel.java:118)
at com.rabbitmq.client.impl.ChannelN.queueDeclare(ChannelN.java:833)
at com.rabbitmq.client.impl.ChannelN.queueDeclare(ChannelN.java:61)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.springframework.amqp.rabbit.connection.CachingConnectionFactory$CachedChannelInvocationHandler.invoke(CachingConnectionFactory.java:917)
- locked <0x00000007bb555300> (a java.lang.Object)
at com.sun.proxy.$Proxy293.queueDeclare(Unknown Source)
at org.springframework.amqp.rabbit.core.RabbitAdmin.declareQueues(RabbitAdmin.java:575)
at org.springframework.amqp.rabbit.core.RabbitAdmin.access$200(RabbitAdmin.java:66)
at org.springframework.amqp.rabbit.core.RabbitAdmin$12.doInRabbit(RabbitAdmin.java:504)
at org.springframework.amqp.rabbit.core.RabbitTemplate.doExecute(RabbitTemplate.java:1456)
at org.springframework.amqp.rabbit.core.RabbitTemplate.execute(RabbitTemplate.java:1412)
at org.springframework.amqp.rabbit.core.RabbitTemplate.execute(RabbitTemplate.java:1388)
at org.springframework.amqp.rabbit.core.RabbitAdmin.initialize(RabbitAdmin.java:500)
at org.springframework.amqp.rabbit.core.RabbitAdmin$11.onCreate(RabbitAdmin.java:419)
at org.springframework.amqp.rabbit.connection.CompositeConnectionListener.onCreate(CompositeConnectionListener.java:33)
at org.springframework.amqp.rabbit.connection.CachingConnectionFactory.createConnection(CachingConnectionFactory.java:553)
- locked <0x00000007bb057828> (a java.lang.Object)
at org.springframework.amqp.rabbit.core.RabbitTemplate.doExecute(RabbitTemplate.java:1431)
at org.springframework.amqp.rabbit.core.RabbitTemplate.execute(RabbitTemplate.java:1412)
at org.springframework.amqp.rabbit.core.RabbitTemplate.execute(RabbitTemplate.java:1388)
at org.springframework.amqp.rabbit.core.RabbitAdmin.declareQueue(RabbitAdmin.java:207)
Any help will be highly appreciated. Declaration of queues is currently in my postconstruct of beans calling our component handling messaging, thus not letting any new bean create.
UPDATE
The issue came again on our prod server. When trying to connect via amqp-client-3.4.2 directly it seems working. But from spring-rabbit-1.6.7.RELEASE, spring-amqp-1.6.7.RELEASE it is not working.
Via amqp-client-3.4.2
ConnectionFactory factory = new ConnectionFactory();
factory.setHost("<<HOST NAME>>");
factory.setUsername("<<USERNAME>>");
factory.setPassword("<<PASSWORD>>");
factory.setVirtualHost("<<VIRTUAL HOST>>");
Connection connection = factory.newConnection();
Channel channel = connection.createChannel();
channel.queueDeclare(QUEUE_NAME, true, false, false, null);
Code flow with rabbit-amqp client
Spring way which is not working
CachingConnectionFactory factory = new CachingConnectionFactory();
factory.setHost("<<HOST NAME>>");
factory.setUsername("<<USERNAME>>");
factory.setPassword("<<PASSWORD>>");
factory.setVirtualHost("<<VIRTUAL HOST>>");
RabbitAdmin admin = new RabbitAdmin(factory);
Queue queue = new Queue(QUEUE_NAME);
admin.declareQueue(queue);
Code flow with spring amqp
This issue occurs rarely and we are still trying to figure out the reason behind this behavior. We tried setting connection timeout but did not worked in our test program.
On debugging it further it looks like an exception is not letting notification sent back to our code. For client not found kind of issues, we are getting exception properly.
We are using RabbitMQ 3.6.10 and Erlang 19.3.4 on CentOS Linux 7 (Core)
Declaration of queues is currently in my postconstruct of beans
I can't speak to the hang but you should NEVER interact with the broker from post construct, afterPropertiesSet() etc. It is too early in the application context lifecycle.
There are several work arounds - implement SmartLifecycle; return true from isAutoStartup() and put the bean in an early phase (see Phased). start() will be called after the application context is fully created.
However, it's generally better to just define the queues, bindings etc as beans and let the framework take care of doing all the declarations for you.
I had something semi-similar happen, which I'll share in case it helps anyone.
It appears to me that a call to "rabbitAdmin.declareQueue" will wait for any ongoing publisher-confirm callbacks to complete. I couldn't find this documented anywhere but this was the behaviour I witnessed.
In my case, a separate thread (Thread #2) was processing a publisher-confirmation while Thread #1 was trying to declare a queue (and hanging). Thread #1 was waiting for Thread #2 to complete, but in my case I had a deadlock due to some funky database-locking I was doing which actually caused Thread #2 to also wait for Thread #1 to complete.
The solution was for me to stop doing significant processing in publisher-confirmation callbacks. In my callback, I actually just launch yet another thread to do the real processing. This allows my publisher-confirmation callback to return almost-immediately, releasing any potential deadlocks.

Instruct RabbitMQ to resend undelivered messages periodically

Background
We're using langohr to interact with RabbitMQ. We've tried two different approaches to let RabbitMQ resend messages that has not yet been properly handled by our service. One way that works is to send a basic.nack with requeue set to the true but this will resend the message immediately until the service responds with a basic.ack. This is a bit problematic if the service for example tries to persist the message to a datastore that is currently down (and is down for a while). It would be better for us to just fetch the undelivered messages say every 20 seconds or so (i.e. we neither do a basic.ack or basic.nack if the datastore is down, we just let the messages be retained in the queue). We've tried to implement this using an ExecutorService whose gist is implemented like this:
(let [chan (lch/open conn)] ; We create a new channel since channels in Langohr are not thread-safe
(log/info "Triggering \"recover\" for channel" chan)
(try
(lb/recover chan)
(catch Exception e (log/error "Failed to call recover" e))
(finally (lch/close chan))))
Unfortunately this doesn't seem to work (the messages are not redelivered and just remains in the queue). If we restart the service the queued messages are consumed correctly. However we have other services that are implemented using spring-rabbitmq (in Java) and they seem to be taking care of this out of the box. I've tried looking in the source code to figure out how they do it but I haven't managed to do so yet.
Question
How do you instruct RabbitMQ to (re-)deliver messages in the queue periodically (preferably using Langohr)?
I am not sure what you are doing with your Spring AMQP apps, but there's nothing built into RabbitMQ for this.
However, it's pretty easy to set up dead-lettering using a TTL to requeue back to the original queue after some period of time. See this answer for examples, links etc.
EDIT
However, Spring AMQP does have a retry interceptor which can be configured to suspend the consumer thread for some period(s) during retry.
Stateful retry rejects and requeues; stateless retry handles the retries internally and has no interaction with the broker during retries.
See this answer which has instructions: we Nack the message, the nack puts the message into a holding queue for N seconds, then it TTLs out of that queue and into another queue that puts it back in the original queue.
It took a little bit of work to setup, but it works great!

TP-Processorxx in waiting state

I am using jconsole(along with TDA.jar plugin) to take the thread dump of a remote tomcat 6 server.
I see a lot of TP-Processorxx(90 threads) in waiting state. Find below the thread dump
"TP-Processor86" nid=197 state=WAITING
- waiting on <0x20afbfdd> (a org.apache.tomcat.util.threads.ThreadPool$ControlRunnable)
- locked <0x20afbfdd> (a org.apache.tomcat.util.threads.ThreadPool$ControlRunnable)
at java.lang.Object.wait(Native Method)
at java.lang.Object.wait(Object.java:485)
at org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:662)
at java.lang.Thread.run(Thread.java:619)
I want to know - what are these TP-Processor threads and what they actually do?
Is there any impact on performace because of these waiting threads?
Are these waiting threads a result of some faulty application code?
If you really interested in understanding/debugging thread dumps, you may want to read the following article :
https://dzone.com/articles/how-analyze-java-thread-dumps
To answer your question, threads in the waiting state (with the stack trace provided by you) are generally harmless. They are just waiting for the task to arrive in the queue.

How does RabbitMQ send messages to consumers?

I am a newbie to RabbitMQ, hence need guidance on a basic question:
Does RabbitMQ send messages to consumer as they arrive?
OR
Does RabbitMQ send messages to consumer as they become available?
At message consumption endpoint, I am using com.rabbitmq.client.QueueingConsumer.
Looking at the sprint client source code, I could figure out that
QueueingConsumer keeps listening on socket for any messages the broker sends to it
Any message that is received is parsed and stored as Delivery in a LinkedBlockingQueue encapsulated inside the QueueingConsumer.
This implies that even if the message processing endpoint is busy, messages will be pushed to QueueingConsumer
Is this understanding right?
TLDR: you poll messages from RabbitMQ till the prefetch count is exceeded in which case you will block and only receive heart beat frames till the fetch messages are ACKed. So you can poll but you will only get new messages if the number of non-acked messages is less than the prefetch count. New messages are put on the QueueingConsumer and in theory you should never really have much more than the prefetch count in that QueueingConsumer internal queue.
Details:
Low level wise for (I'm probably going to get some of this wrong) RabbitMQ itself doesn't actually push messages. The client has to continuously read the connections for Frames based on the AMQP protocol. Its hard to classify this as push or pull but just know the client has to continuously read the connection and because the Java client is sadly BIO it is a blocking/polling operation. The blocking/polling is based on the AMQP heartbeat frames and regular frames and socket timeout configuration.
What happens in the Java RabbitMQ client is that there is thread for each channel (or maybe its connection) and that thread loops gathering frames from RabbitMQ which eventually become commands that are put in a blocking queue (I believe its like a SynchronousQueue aka handoff queue but Rabbit has its own special one).
The QueueingConsumer is a higher level API and will pull commands off of that handoff queue mentioned early because if commands are left on the handoff queue it will block the channel frame gathering loop. This is can be bad because timeout the connection. Also the QueueingConsumer allows work to be done on a separate thread instead of being in the same thread as the looping frame thread mentioned earlier.
Now if you look at most Consumer implementations you will probably notice that they are almost always unbounded blocking queues. I'm not entirely sure why the bounding of these queues can't be a multiplier of the prefetch but if they are less than the prefetch it will certainly cause problems with the connection timing out.
I think best answer is product's own answer. As RMQ has both push + pull mechanism defined as part of the protocol. Have a look : https://www.rabbitmq.com/tutorials/amqp-concepts.html
Rabbitmq mainly uses Push mechanism. Poll will consume bandwidth of the server. Poll also has time gaps between each poll. It will not able to achieve low latency. Rabbitmq will push the message to client once there are consumers available for the queue. So the connection is long running. ReadFrame in rabbitmq is basically waiting for incoming frames