rabbitmq client hangs while trying to declare queue - rabbitmq

I tried searching for solution of my problem but could not find it stack overflow.
Issue
When a user tries to declare a queue or exchange, in a corner case where RabbitMQ server is having some issue, the client keeps waiting without any timeout which causes the thread calling the rabbitmq to always remain in waiting state (wait which never ends).
Below is stacktrace
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
at java.lang.Object.wait(Object.java:502)
at com.rabbitmq.utility.BlockingCell.get(BlockingCell.java:50)
- locked <0x00000007bb0464c8> (a com.rabbitmq.utility.BlockingValueOrException)
at com.rabbitmq.utility.BlockingCell.uninterruptibleGet(BlockingCell.java:89)
- locked <0x00000007bb0464c8> (a com.rabbitmq.utility.BlockingValueOrException)
at com.rabbitmq.utility.BlockingValueOrException.uninterruptibleGetValue(BlockingValueOrException.java:33)
at com.rabbitmq.client.impl.AMQChannel$BlockingRpcContinuation.getReply(AMQChannel.java:343)
at com.rabbitmq.client.impl.AMQChannel.privateRpc(AMQChannel.java:216)
at
(AMQChannel.java:118)
at com.rabbitmq.client.impl.ChannelN.queueDeclare(ChannelN.java:833)
at com.rabbitmq.client.impl.ChannelN.queueDeclare(ChannelN.java:61)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.springframework.amqp.rabbit.connection.CachingConnectionFactory$CachedChannelInvocationHandler.invoke(CachingConnectionFactory.java:917)
- locked <0x00000007bb555300> (a java.lang.Object)
at com.sun.proxy.$Proxy293.queueDeclare(Unknown Source)
at org.springframework.amqp.rabbit.core.RabbitAdmin.declareQueues(RabbitAdmin.java:575)
at org.springframework.amqp.rabbit.core.RabbitAdmin.access$200(RabbitAdmin.java:66)
at org.springframework.amqp.rabbit.core.RabbitAdmin$12.doInRabbit(RabbitAdmin.java:504)
at org.springframework.amqp.rabbit.core.RabbitTemplate.doExecute(RabbitTemplate.java:1456)
at org.springframework.amqp.rabbit.core.RabbitTemplate.execute(RabbitTemplate.java:1412)
at org.springframework.amqp.rabbit.core.RabbitTemplate.execute(RabbitTemplate.java:1388)
at org.springframework.amqp.rabbit.core.RabbitAdmin.initialize(RabbitAdmin.java:500)
at org.springframework.amqp.rabbit.core.RabbitAdmin$11.onCreate(RabbitAdmin.java:419)
at org.springframework.amqp.rabbit.connection.CompositeConnectionListener.onCreate(CompositeConnectionListener.java:33)
at org.springframework.amqp.rabbit.connection.CachingConnectionFactory.createConnection(CachingConnectionFactory.java:553)
- locked <0x00000007bb057828> (a java.lang.Object)
at org.springframework.amqp.rabbit.core.RabbitTemplate.doExecute(RabbitTemplate.java:1431)
at org.springframework.amqp.rabbit.core.RabbitTemplate.execute(RabbitTemplate.java:1412)
at org.springframework.amqp.rabbit.core.RabbitTemplate.execute(RabbitTemplate.java:1388)
at org.springframework.amqp.rabbit.core.RabbitAdmin.declareQueue(RabbitAdmin.java:207)
Any help will be highly appreciated. Declaration of queues is currently in my postconstruct of beans calling our component handling messaging, thus not letting any new bean create.
UPDATE
The issue came again on our prod server. When trying to connect via amqp-client-3.4.2 directly it seems working. But from spring-rabbit-1.6.7.RELEASE, spring-amqp-1.6.7.RELEASE it is not working.
Via amqp-client-3.4.2
ConnectionFactory factory = new ConnectionFactory();
factory.setHost("<<HOST NAME>>");
factory.setUsername("<<USERNAME>>");
factory.setPassword("<<PASSWORD>>");
factory.setVirtualHost("<<VIRTUAL HOST>>");
Connection connection = factory.newConnection();
Channel channel = connection.createChannel();
channel.queueDeclare(QUEUE_NAME, true, false, false, null);
Code flow with rabbit-amqp client
Spring way which is not working
CachingConnectionFactory factory = new CachingConnectionFactory();
factory.setHost("<<HOST NAME>>");
factory.setUsername("<<USERNAME>>");
factory.setPassword("<<PASSWORD>>");
factory.setVirtualHost("<<VIRTUAL HOST>>");
RabbitAdmin admin = new RabbitAdmin(factory);
Queue queue = new Queue(QUEUE_NAME);
admin.declareQueue(queue);
Code flow with spring amqp
This issue occurs rarely and we are still trying to figure out the reason behind this behavior. We tried setting connection timeout but did not worked in our test program.
On debugging it further it looks like an exception is not letting notification sent back to our code. For client not found kind of issues, we are getting exception properly.
We are using RabbitMQ 3.6.10 and Erlang 19.3.4 on CentOS Linux 7 (Core)

Declaration of queues is currently in my postconstruct of beans
I can't speak to the hang but you should NEVER interact with the broker from post construct, afterPropertiesSet() etc. It is too early in the application context lifecycle.
There are several work arounds - implement SmartLifecycle; return true from isAutoStartup() and put the bean in an early phase (see Phased). start() will be called after the application context is fully created.
However, it's generally better to just define the queues, bindings etc as beans and let the framework take care of doing all the declarations for you.

I had something semi-similar happen, which I'll share in case it helps anyone.
It appears to me that a call to "rabbitAdmin.declareQueue" will wait for any ongoing publisher-confirm callbacks to complete. I couldn't find this documented anywhere but this was the behaviour I witnessed.
In my case, a separate thread (Thread #2) was processing a publisher-confirmation while Thread #1 was trying to declare a queue (and hanging). Thread #1 was waiting for Thread #2 to complete, but in my case I had a deadlock due to some funky database-locking I was doing which actually caused Thread #2 to also wait for Thread #1 to complete.
The solution was for me to stop doing significant processing in publisher-confirmation callbacks. In my callback, I actually just launch yet another thread to do the real processing. This allows my publisher-confirmation callback to return almost-immediately, releasing any potential deadlocks.

Related

ActiveMQ producer waiting indefinitely for a valid connection

We are facing an issue while producing message to an ActiveMQ 5.15.4 broker.
The thread trying to produce the message is blocked indefinitely:
Thread 464: (state = BLOCKED)
- java.lang.Object.wait(long) #bci=0 (Compiled frame; information may be imprecise)
- org.apache.activemq.transport.failover.FailoverTransport.oneway(java.lang.Object) #bci=370, line=620 (Compiled frame)
- org.apache.activemq.transport.MutexTransport.oneway(java.lang.Object) #bci=12, line=68 (Interpreted frame)
It seems that the FailoverTransport object is waiting to get valid connection (transport object not null) but the reconnection task is never launched.
Any idea how we can reach that situation and how to fix it?
At the end we did several actions and we didn't face the issue since then:
Switch from CachingConnectionFactory to PooledConnectionFactory
Use a dedicated connection factory for message senders distinct from the one used by the listeners
Define one JmsTemplate per destination, and set the defaultDestination
only once. (we used to have only 1 JmsTemplate and the destination was defined in the send method)

Blocked system-critical thread has been detected

I'm using Ignite.NET 2.7.6. There is a configuration from one server and about 40 clients. After 8 hours of work, the server starts behaving strangely: clients cannot connect it, some queries have no result, etc.
On the server's side, the memory consumption is ok, the amount of threads is about 250 and all looks ok. I don't see any problems, so I decided to solve all the problems on the server's side that were marked as SEVERE.
The first one I encounter is:
Blocked system-critical thread has been detected. This can lead to cluster-wide undefined behaviour [threadName=tcp-comm-worker, blockedFor=13s]
So I want to understand the reason this happens.
Full server's log can be found here:
https://yadi.sk/d/LF03Vz5vz4tRcw
https://yadi.sk/d/MMe0xrgI3k6lkA
Added:
The issue doesn't seem to be innocuous, this message appears every second from various threads, the "blockedFor" value is increasing from seconds to hours.
The load on the server is low but as the servers' threads become locked, it stops responding and registering new clients.
Here are logs from the server:
https://yadi.sk/d/tc3g2hb9B0jtvg
https://yadi.sk/d/05YrlYXcp4xPqg
This is the log from one client:
https://yadi.sk/d/bcbQ7ee4PUzq2w
The client's log's last lines are at 19:03:52, when the server was restarted.
I see the following .NET specific exception among the others but it should be triggered by another issue. Anyway, this one is reported to the community.
class org.apache.ignite.IgniteException: Platform error:System.NullReferenceException: Ññûëêà íà îáúåêò íå óêàçûâàåò íà ýêçåìïëÿð îáúåêòà.
â Apache.Ignite.Core.Impl.Unmanaged.UnmanagedCallbacks.CacheEntryFilterApply(Int64 memPtr)
â Apache.Ignite.Core.Impl.Unmanaged.UnmanagedCallbacks.InLongOutLong(Int32 type, Int64 val)
at org.apache.ignite.internal.processors.platform.PlatformProcessorImpl.loggerLog(PlatformProcessorImpl.java:404)
at org.apache.ignite.internal.processors.platform.PlatformProcessorImpl.processInStreamOutLong(PlatformProcessorImpl.java:460)
at org.apache.ignite.internal.processors.platform.PlatformProcessorImpl.processInStreamOutLong(PlatformProcessorImpl.java:512)
at org.apache.ignite.internal.processors.platform.PlatformTargetProxyImpl.inStreamOutLong(PlatformTargetProxyImpl.java:67)
at org.apache.ignite.internal.processors.platform.callback.PlatformCallbackUtils.inLongOutLong(Native Method)
at org.apache.ignite.internal.processors.platform.callback.PlatformCallbackGateway.cacheEntryFilterApply(PlatformCallbackGateway.java:143)
at org.apache.ignite.internal.processors.platform.cache.PlatformCacheEntryFilterImpl.apply(PlatformCacheEntryFilterImpl.java:70)
at org.apache.ignite.internal.processors.cache.query.GridCacheQueryManager$InternalScanFilter.apply(GridCacheQueryManager.java:3139)
The very first exceptions are related to the communication issues at the networking level. See below:
java.io.IOException: Óäàëåííûé õîñò ïðèíóäèòåëüíî ðàçîðâàë ñóùåñòâóþùåå ïîäêëþ÷åíèå
at sun.nio.ch.SocketDispatcher.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(Unknown Source)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(Unknown Source)
at sun.nio.ch.IOUtil.read(Unknown Source)
at sun.nio.ch.SocketChannelImpl.read(Unknown Source)
at org.apache.ignite.internal.util.nio.GridNioServer$DirectNioClientWorker.processRead(GridNioServer.java:1282)
at org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.processSelectedKeysOptimized(GridNioServer.java:2386)
at org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.bodyInternal(GridNioServer.java:2153)
at org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.body(GridNioServer.java:1794)
at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120)
at java.lang.Thread.run(Unknown Source)
[18:46:12,846][WARNING][grid-nio-worker-tcp-comm-0-#48][TcpCommunicationSpi] Closing NIO session because of unhandled exception [cls=class o.a.i.i.util.nio.GridNioException, msg=Óäàëåííûé õîñò ïðèíóäèòåëüíî ðàçîðâàë ñóùåñòâóþùåå ïîäêëþ÷åíèå]
[18:46:13,861][WARNING][tcp-comm-worker-#1][TcpCommunicationSpi] Connect timed out (consider increasing 'failureDetectionTimeout' configuration property) [addr=/127.0.0.1:47101, failureDetectionTimeout=10000]
[18:46:14,893][WARNING][tcp-comm-worker-#1][TcpCommunicationSpi] Connect timed out (consider increasing 'failureDetectionTimeout' configuration property) [addr=BB-SRV-DELTA/169.254.40.231:47101, failureDetectionTimeout=10000]
It looks like that either the server or some clients don't react to heartbeats or to other networking requests within 10 seconds. Check the logs of the client nodes as well. You might need to scale out your cluster adding more servers for the sake of load balancing or adjust the failureDetectionTimeou.
The Blocked system-critical thread has been detected... error message is innocuous but confusing. I've restarted the following conversation.
As Denis described, there are a lot of network communication issues.
In general, a client would like to perform some cache operation, but a server thread from the striped pool is blocked for a long time. I don't think it relates to the .NET part.
You can see following messages:
[18:53:04,385][SEVERE][tcp-disco-msg-worker-#2][G] Blocked system-critical thread has been detected. This can lead to cluster-wide undefined behaviour [threadName=sys-stripe-7, blockedFor=13s]
If you take a look at the thread:
hread [name="sys-stripe-7-#8", id=28, state=WAITING, blockCnt=51, waitCnt=3424]
at sun.misc.Unsafe.park(Native Method)
at java.util.concurrent.locks.LockSupport.park(Unknown Source)
at o.a.i.i.util.future.GridFutureAdapter.get0(GridFutureAdapter.java:178)
at o.a.i.i.util.future.GridFutureAdapter.get(GridFutureAdapter.java:141)
at o.a.i.spi.communication.tcp.TcpCommunicationSpi.reserveClient(TcpCommunicationSpi.java:2911)
at o.a.i.spi.communication.tcp.TcpCommunicationSpi.sendMessage0(TcpCommunicationSpi.java:2713)
at o.a.i.spi.communication.tcp.TcpCommunicationSpi.sendMessage(TcpCommunicationSpi.java:2672)
at o.a.i.i.managers.communication.GridIoManager.send(GridIoManager.java:1656)
at o.a.i.i.managers.communication.GridIoManager.sendOrderedMessage(GridIoManager.java:1879)
at o.a.i.i.processors.continuous.GridContinuousProcessor.sendWithRetries(GridContinuousProcessor.java:1904)
at o.a.i.i.processors.continuous.GridContinuousProcessor.sendWithRetries(GridContinuousProcessor.java:1875)
at o.a.i.i.processors.continuous.GridContinuousProcessor.sendWithRetries(GridContinuousProcessor.java:1857)
at o.a.i.i.processors.continuous.GridContinuousProcessor.sendNotification(GridContinuousProcessor.java:1275)
at o.a.i.i.processors.continuous.GridContinuousProcessor.addNotification(GridContinuousProcessor.java:1212)
The thread is trying to send a Continuous Query callback but is failing to establish a connection to a client node. This causes the thread to be blocked and it can not serve other cache API requests that require the same partition.
At first glance, you could try to reduce #clientFailureDetectionTimeout, default is 30sec. But this won't fix the network issues completely.

RabbitMQ Java API: I want to use basicGet() for RPC call. Can it be made to block? Can it be interrupted?

I am fairly new to RabbitMQ, and starting on a project that is using RabbitMQ in a fairly old-fashioned "RPC" pattern. So I'm trying something like this on the "server" side:
ConnectionFactory factory = new ConnectionFactory();
factory.setUri(uri);
Connection connection = factory.newConnection();
Channel channel = connection.createChannel();
channel.queueDeclare(queueName, false, false, false, null);
while (!shutdown) {
GetResponse gr = channel.basicGet(queueName, true);
... build reply ...
channel.basicPublish("", gr.getProps().getReplyTo(), replyProps, response);
}
My question is: can the thread waiting on basicGet() be interrupted? If so, what happens (InterruptedException is not declared). It realize this is not a great pattern, but I just want some way to cleanly shutdown a service.
UPDATE: one comment indicates that basicGet() does not block at all, and returns immediately if the queue is empty. If that is the case, let me revise my question to be more precise: How do I wait for a message on a queue and retrieve it, with a timeout?
UPDATE2: After experimenting and asking questions on the rabbitmq mailing list, I conclude that this cannot be done directly. It is simply not The Way That You Do Things in RabbitMQ. Instead, you launch a consumer thread pool using Channel.basicConsume() and wait for your handler method to be called. It can be done indirectly by having your consumer post to a SynchronizedQueue or something similar and having your foreground thread(s) wait on that, but be warned that this defeats the automatic scaling offered by basicConsume() and also makes it harder to properly ACK all requests, and also creates additional message buffering that makes it difficult to honor QOS semantics set by the basicQos() call.
It should also be noted that, once you go down the basicConsume() route, the consumer can be interrupted. This is done something like:
// This starts a background thread pool
String consumerTag = channel.basicConsume(consumer);
...
// Shutdown the consumer thread pool
channel.basicCancel(consumerTag);
UPDATE3: See last answer. RabbitMQ comes with an RpcClient class that works splendidly.
basicGet doesn't block, it returns immediately (well, just after a network roundtrip) and returns null if there's no messages on the queue. So it's not necessary to interrupt the thread.
RabbitMQ java API comes with a very nice implementation of an RPCClient called, (unsurprisingly) RPCClient. Use it! I made an example in github

Spring Amqp Consumer pauses after running for sometime

We have a 2 node RabbitMQ cluster with Ha-all policy. We use Spring AMQP in our application to talk to RabbitMQ. Producer part is working fine, but consumer works for some time and pauses. Producer and consumer are running as different applications. More information on Consumer part.
we use SimpleMessageListenerContainer with ChannelAwareMessageListener, use Manual ack mode and default prefetch(1)
In our application we create queue (on-demand) and add it to the listener
When we started with 10 ConcurrentConsumers and 20 MaxConcurrentConsumers, consumption happens for around 15 hours and pauses. This situation happens within 1 hour when we increase the MaxConcurrentConsumers to 75.
On RabbitMQ UI, we see channels with 3/4 unacked messages on the channel tab when this situation occurs, until then it just have 1 unacked message.
Our thread dump was similar to this. But having heartbeat set to 60 did not help improve this situation.
Most of the thread dump has the following message. If required I will attach the whole thread dump. Let me know if I am missing any setup which might cause the consumer to pause?
"pool-6-thread-16" #86 prio=5 os_prio=0 tid=0x00007f4db09cb000 nid=0x3b33 waiting on condition [0x00007f4ebebec000]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00000007b9930b68> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
at java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:350)
at org.springframework.amqp.rabbit.listener.BlockingQueueConsumer$InternalConsumer.handleDelivery(BlockingQueueConsumer.java:660)
at com.rabbitmq.client.impl.ConsumerDispatcher$5.run(ConsumerDispatcher.java:144)
at com.rabbitmq.client.impl.ConsumerWorkService$WorkPoolRunnable.run(ConsumerWorkService.java:99)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
More Info
We dynamically add and remove queues to SimpleMessageListenerContainer and we suspect this is causing some problem, because every time we add or remove a queue from the listener, all the BlockingQueueConsumer are removed and created again. Do you think if this can cause this problem?
Your problem is somewhere downstream in the target listener.
Look, prefetch(1) cause this:
this.queue = new LinkedBlockingQueue<Delivery>(prefetchCount);
And further if we don't poll that queue what we have here?
BlockingQueueConsumer.this.queue.put(new Delivery(consumerTag, envelope, properties, body));
Right, parking on lock.
AMQP-621 is now merged to master; we will release 1.6.1.RELEASE in the next few days.

Jedis is running out of resource instances, web application blocks

I'm running a Tomcat application which uses Jedis to access a Redis database. Form time to time the whole application blocks. By monitoring Tomcat using JavaMelody I found out that the problem seems be related to the JedisPool when a object requests a Jedis instance.
catalina-exec-74
java.lang.Object.wait(Native Method)
java.lang.Object.wait(Object.java:503)
org.apache.commons.pool.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:1104)
redis.clients.util.Pool.getResource(Pool.java:20)
....
This is the JedisPoolConfig I'm using
JedisPoolConfig poolConfig = new JedisPoolConfig();
poolConfig.setMaxActive(20);
poolConfig.setTestOnBorrow(true);
poolConfig.setTestOnReturn(true);
poolConfig.setMaxIdle(5);
poolConfig.setMinIdle(5);
poolConfig.setTestWhileIdle(true);
poolConfig.setNumTestsPerEvictionRun(10);
poolConfig.setTimeBetweenEvictionRunsMillis(10000);
jedisPool = new JedisPool(poolConfig, "localhost");
So obviously some threads try to get a Jedis instance but the pool is empty and cannot return an instance so the default pool behavior is wait.
I've already double checked my whole code and I'm pretty sure I return every Jedis instance to the pool that I used before. So I'm not sure why I'm running out of instance.
Is there a ways to check how many instances are left in the pool? I'm trying to find a sensible value for the maxActive parameter to prevent the application from blocking.
Are there any other ways to create memory holes other than not returning the Jedis instances to the pool?
Returning the resource to the Pool is important, so remember to do it. Otherwise when closing your app it'll wait for the resource to return.
https://groups.google.com/forum/?fromgroups=#!topic/jedis_redis/UeOhezUd1tQ
After each Jedis method call, return the resource pool. Your app has probably used all the threads and waits for some to be dropped. This may cause behavior you're explaining and the app is probably blocked.
Jedis jedis = JedisFactory.jedisPool.getResource();
try{
jedis.set("key","val");
}finally{
JedisFactory.jedisPool.returnResource(jedis);
}
Partial answer to hopefully be of some help to people in similar scenarios, though I'm unsure if my problem was the same as yours (if you've since figured it out, please let us know!).
I've already double checked my whole code and I'm pretty sure I return every Jedis instance to the pool that I used before. So I'm not sure why I'm running out of instance.
I thought I had to - I always put my code in try / finally blocks, but it turns out I did have a memory leak:
Jedis jedis = DBHelper.pool.getResource();
try {
// Next line causes the leak
Jedis jedis = DBHelper.pool.getResource();
...
} finally {
DBHelper.pool.returnResource(jedis);
}
No idea how that second call snuck in, but it did and caused both a leak and the web app to block.
Is there a ways to check how many instances are left in the pool? I'm trying to find a sensible value for the maxActive parameter to prevent the application from blocking.
In my case, I both found the bug and optimized based on the number of clients seen by the redis server. I set the loglevel (in redis.conf to verbose (default is notice), which will report about every 5-10 seconds the number of clients connected. Once I found my memory leak, I repeatedly sent requests to the page calling that method, and watched the redis clients reported by redis logs climb, but never drop.. I would think that would be a good start for optimizing?
Anyways, hope this is helpful to someone!
When you use jedis pool, every time you get the resource using getResource(), you have to call releaseResource(). And if number of threads are more than resources, you will have thread contention. I found it much simpler to have Jedis connection per thread using Java ThreadLocal. So, for each thread check whether jedis connection already exists. If yes, use it, otherwise create a connection for the running thread. This ensures there wouldn't be any lock contention or error conditions to look after.