RabbitMQ - Channel shutdown: connection error (SpringXD closes rabbitmq connections repeatedly.) - rabbitmq

I had a terrible night trying to figure out what is going on with RabbitMQ and SpringXD, unfortunately without a success.
The problem:
SpringXD closes RabbitMQ connections repeatedly,
or reports warnings related to the channel cache size.
Fragment from the SpringXD log (during stream initialization/autowiring):
2016-05-03T07:42:43+0200 1.3.0.RELEASE WARN
DeploymentsPathChildrenCache-0 listener.SimpleMessageListenerContainer
- CachingConnectionFactory's channelCacheSize can not be less than the
number of concurrentConsumers so it was reset to match: 4
...
2016-05-03T07:54:17+0200 1.3.0.RELEASE ERROR AMQP Connection
192.168.120.125:5672 connection.CachingConnectionFactory - Channel shutdown: connection error
2016-05-03T17:38:58+0200 1.3.0.RELEASE ERROR AMQP Connection
192.168.120.125:5672 connection.CachingConnectionFactory - Channel shutdown: connection error; protocol method:
method<connection.close>(reply-code=504, reply-text=CHANNEL_ERROR -
second 'channel.open' seen, class-id=20, method-id=10)
Fragment from the RabbitMQ log:
=WARNING REPORT==== 3-May-2016::08:08:09 === closing AMQP connection <0.22276.61> (192.168.120.125:59350 -> 192.168.120.125:5672): client
unexpectedly closed TCP connection
=ERROR REPORT==== 3-May-2016::08:08:11 === closing AMQP connection 0.15409.61> (192.168.120.125:58527 -> 192.168.120.125:5672):
{writer,send_failed,{error,closed}}
state blocked error is rare
=ERROR REPORT==== 3-May-2016::17:38:58 === Error on AMQP connection <0.20542.25> (192.168.120.125:59421 -> 192.168.120.125:5672, vhost:
'/', user: 'xd', state: blocked), channel 7: operation channel.open
caused a connection exception channel_error: "second 'channel.open'
seen"
My setup (6 nodes)
- springxd 1.3.0 distributed (zookeeper)
- RabbitMQ 3.6.0, Erlang R16B03-1 cluster
ackMode: AUTO ## or NONE
autoBindDLQ: false
backOffInitialInterval: 1000
backOffMaxInterval: 10000
backOffMultiplier: 2.0
batchBufferLimit: 10000
batchingEnabled: false
batchSize: 200
batchTimeout: 5000
compress: false
concurrency: 4
deliveryMode: NON_PERSISTENT ## or PERSISTENT
durableSubscription: false
maxAttempts: 10
maxConcurrency: 10
prefix: xdbus.
prefetch: 1000
replyHeaderPatterns: STANDARD_REPLY_HEADERS,*
republishToDLQ: false
requestHeaderPatterns: STANDARD_REQUEST_HEADERS,*
requeue: true
transacted: false
txSize: 1000
spring:
rabbitmq:
addresses:
priv1:5672,priv2:5672,priv3:5672,
priv4:5672,priv5:5672,priv6:5672
adminAddresses:
http://priv1:15672, http://priv2:15672, http://priv3:15672, http://priv4:15672, http://priv5:15672,http://priv6:15672
nodes:
rabbit#priv1,rabbit#priv2,rabbit#priv3,
rabbit#priv4,rabbit#priv5,rabbit#priv6
username: xd
password: xxxx
virtual_host: /
useSSL: false
ha-xdbus policy:
- ^xdbus\. all
- ha-mode: exactly
- ha-params: 2
- queue-master-locator: min-masters
Rabbit conf
[
{rabbit,
[
{tcp_listeners, [5672]},
{queue_master_locator, "min-masters"}
]
}
].
When ackMode is NONE the following happens:
Eventually the number of consumers drop to zero and I have a zombie streams that don't recover from that state, which in turn causes unwanted queueing.
When ackMode is AUTO the following happens:
Some messages left un-acked forever.
SpringXD streams and durable queues
Rabbit module is being used as source or sink, no custom autowiring.
Typical stream definitions are as follows:
Ingestion:
event_generator | rabbit --mappedRequestHeaders=XDRoutingKey --routingKey='headers[''XDRoutingKey'']'
Processing/Sink:
rabbit --queues='xdbus.INQUEUE-A' | ENRICHMENT-PROCESSOR-A | elastic-sink
rabbit --queues='xdbus.INQUEUE-B' | ENRICHMENT-PROCESSOR-B | elastic-sink
xdbus.INQUEUE-xxx are manually created from the Rabbit admin GUI. (durable)
GLOBAL statistics (from the RabbitMQ Admin)
Connections: 190
Channels: 2263 (Channel cache problem perhaps ?)
Exchanges: 20
Queues: 120
Consumers : 1850
Finally:
I would appreciate if someone could answer what is wrong with the configuration (I am pretty sure the network is performing well, so there are no network problems and there is no problem related to max open files limitation).
Message rates vary from 2K/sec to max 30k/sec which is relative small load.
Thanks!
Ivan

We have seen some similar instability when churning channels at a high rate.
The work-around was to increase the channel cache size to avoid the high rate of churning; it's not clear where the instability lies, but I don't believe it is in Spring AMQP.
One problem, however, is that XD doesn't expose channelCacheSize as a property.
The answer at the link above has a work-around to add the property by replacing the bus configuration XML. Increasing the cache size solved that user's problem.
We have an open JIRA issue to expose the property but it's not implemented yet.
I see you originally posted this as an 'answer' to that question.
Could someone be more specific and explain where exactly rabbit-bus.xml should be installed and why is this happening anyway.
As it says there, you need to put it under the xd config directory:
xd/config/META-INF/spring-xd/bus/rabbit-bus.xml.
EDIT
Technique using the bus extension mechanism instead...
$ cat xd/config/META-INF/spring-xd/bus/ext/cf.xml
<?xml version="1.0" encoding="UTF-8"?>
<beans xmlns="http://www.springframework.org/schema/beans"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:context="http://www.springframework.org/schema/context"
xsi:schemaLocation="http://www.springframework.org/schema/beans http://www.springframework.org/schema/beans/spring-beans.xsd
http://www.springframework.org/schema/context http://www.springframework.org/schema/context/spring-context.xsd">
<bean id="rabbitConnectionFactory" class="org.springframework.amqp.rabbit.connection.CachingConnectionFactory">
<constructor-arg ref="rabbitFactory" />
<property name="addresses" value="${spring.rabbitmq.addresses}" />
<property name="username" value="${spring.rabbitmq.username}" />
<property name="password" value="${spring.rabbitmq.password}" />
<property name="virtualHost" value="${spring.rabbitmq.virtual_host}" />
<property name="channelCacheSize" value="${spring.rabbitmq.channelCacheSize:100}" />
</bean>
</beans>
EDIT: TEST RESULTS
Prepopulated queue foo with 1 million messages.
concurrency: 10
prefetch: 1000
txSize: 1000
.
xd:>stream create foo --definition "rin:rabbit --concurrency=10 --maxConcurrency=10 --prefetch=1000 --txSize=1000 | t1:transform | t2:transform | rout:rabbit --routingKey='''bar'''" --deploy
Created and deployed new stream 'foo'
So with this configuration, we end up with 40 consumers.
I never saw more than 29 publishing channels from the bus, there were 10 publishers for the sink.
1m messages were transferred from foo to bar in less than 5 minutes (via xdbus.foo.0, xdbus.foo.1 and xdbus.foo.2) - 4m messages published.
No errors - but my laptop needs to cool off :D

Related

Streaming Data into Google BigQuery Tables : SocketTimeoutException, 502 Bad Gateway, 500 Internal Server Error warnings

We are using the Camel BigQuery API (version 2.20) to stream records from a message queue on an ActiveMQ server (version 5.14.3) into a Google BigQuery table.
We have implemented and deployed the streaming mechanism as an XML route definition in a Spring Framework thus:
<?xml version="1.0" encoding="UTF-8"?>
<beans
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns="http://www.springframework.org/schema/beans"
xmlns:beans="http://www.springframework.org/schema/beans"
xsi:schemaLocation="
http://www.springframework.org/schema/beans
./spring-beans.xsd
http://camel.apache.org/schema/spring
./camel-spring.xsd">
<!--
# ==========================================================================
# ActiveMQ JMS Bean Definition
# ==========================================================================
-->
<bean id="jms" class="org.apache.camel.component.jms.JmsComponent">
<property name="connectionFactory">
<bean class="org.apache.activemq.ActiveMQConnectionFactory">
<property name="brokerURL" value="nio://192.168.10.10:61616?jms.useAsyncSend=true" />
<property name="userName" value="MyAmqUserName" />
<property name="password" value="MyAmqPassword" />
</bean>
</property>
</bean>
<!--
# ==========================================================================
# GoogleBigQueryComponent
# https://github.com/apache/camel/tree/master/components/camel-google-bigquery
# ==========================================================================
-->
<bean id="gcp" class="org.apache.camel.component.google.bigquery.GoogleBigQueryComponent">
<property name="connectionFactory">
<bean class="org.apache.camel.component.google.bigquery.GoogleBigQueryConnectionFactory">
<property name="credentialsFileLocation" value="MyDir/MyGcpKeyFile.json" />
</bean>
</property>
</bean>
<!--
# ==========================================================================
# Main Context Bean Definition
# ==========================================================================
-->
<camelContext id="camelContext" xmlns="http://camel.apache.org/schema/spring" >
<!--
========================================================================
https://camel.apache.org/maven/current/camel-core/apidocs/org/apache/camel/processor/RedeliveryPolicy.html
========================================================================
-->
<onException useOriginalMessage="true">
<exception>com.google.api.client.googleapis.json.GoogleJsonResponseException</exception>
<exception>java.net.SocketTimeoutException</exception>
<exception>java.net.ConnectException</exception>
<redeliveryPolicy
backOffMultiplier="2"
logHandled="false"
logRetryAttempted="true"
maximumRedeliveries="10"
maximumRedeliveryDelay="60000"
redeliveryDelay="1000"
retriesExhaustedLogLevel ="ERROR"
retryAttemptedLogLevel="WARN"
/>
</onException>
<!--
# ==================================================================
# Message Route :
# 1. consume messages from my AMQ queue
# 2. write message to Google BigQuery table
# see https://github.com/apache/camel/blob/master/components/camel-google-bigquery/src/main/docs/google-bigquery-component.adoc
# ==================================================================
-->
<route>
<from uri="jms:my.amq.queue.of.output.data.for.gcp?acknowledgementModeName=DUPS_OK_ACKNOWLEDGE&concurrentConsumers=20" />
<to uri="gcp:my_gcp_project:my_bq_data_set:my_bq_table" />
</route>
</camelContext>
</beans>
The above seems to work and we seem to be landing a high of rate messages/records (one route handles over 12,000 message per minute) but our logs are showing a good number of the SocketTimeoutException, 502 Bad Gateway and 500 Internal Server Error warnings :
2019-10-21 15:33:13 | WARN | DefaultErrorHandler | Failed delivery for (MessageId: XXX on ExchangeId: XXX). On delivery attempt: 0 caught: java.net.SocketTimeoutException: connect timed out
2019-10-24 12:46:53 | WARN | DefaultErrorHandler | Failed delivery for (MessageId: XXX on ExchangeId: XXX). On delivery attempt: 0 caught: com.google.api.client.googleapis.json.GoogleJsonResponseException: 502 Bad Gateway
2019-10-25 12:33:33 | WARN | DefaultErrorHandler | Failed delivery for (MessageId: XXX on ExchangeId: XXX). On delivery attempt: 0 caught: com.google.api.client.googleapis.json.GoogleJsonResponseException: 500 Internal Server Error
Questions
Is my use of the onException object generally/syntactically correct (barring the fine-tuning of the redeliveryPolicy attributes) ? Or have I missed out anything else ?
My first warning message of interest says, "On delivery attempt: 0 caught: java.net.SocketTimeoutException". My log file does not have "On delivery attempt: 1", On delivery attempt: 2", etc. Does this mean that subsequent delivery attempts of the given message were successful ?
As far as trying to stream data into GCP is concerned, should I treat the "SocketTimeoutException" "500 Internal Server Error" and "502 Bad Gateway" differently from each other or is using the same onException + redeliveryPolicy OK ?
Are there any other ways I can improve the performance of this Camel / Google API method of streaming data into GCP ? Can the Camel / Google API support message batching in order to reduce the number of GCP insert operations ? I'm already using dual streams with deduplication (CamelGoogleBigQueryInsertId).
Disclaimer: I don't have experience in using Camel BigQuery API. My answer is based on observation and understanding of BigQuery API in general.
Based on observation that there is retriesExhaustedLogLevel ="ERROR", if no ERROR log presents, it probably means retry succeeded.
Retry on timeout/500/502 can be the same. At least I'm not aware of how they can be treated differently.
Batching will definitely help, based on public documentation:
Maximum rows per request: 10,000 rows per request
A maximum of 500 rows is recommended. Batching can increase
performance and throughput to a point, but at the cost of per-request
latency. Too few rows per request and the overhead of each request can
make ingestion inefficient. Too many rows per request and the
throughput may drop.
A maximum of 500 rows per request is recommended, but experimentation with representative data (schema and data sizes) will help you determine the ideal batch size.

RabbitMQ Ack Timeout

I'm using RPC Pattern for processing my objects with RabbitMQ.
You suspect,I have an object, and I want to have that process finishes and After that send ack to RPC Client.
Ack as default has a timeout about 3 Minutes.
My process Take long time.
How can I change this timeout for ack of each objects or what I must be do for handling these like processess?
Modern versions of RabbitMQ have a delivery acknowledgement timeout:
In modern RabbitMQ versions, a timeout is enforced on consumer delivery acknowledgement. This helps detect buggy (stuck) consumers that never acknowledge deliveries. Such consumers can affect node's on disk data compaction and potentially drive nodes out of disk space.
If a consumer does not ack its delivery for more than the timeout value (30 minutes by default), its channel will be closed with a PRECONDITION_FAILED channel exception. The error will be logged by the node that the consumer was connected to.
Error message will be:
Channel error on connection <####> :
operation none caused a channel exception precondition_failed: consumer ack timed out on channel 1
Timeout by default is 30 minutes (1,800,000ms)note 1 and is configured by the consumer_timeout parameter in rabbitmq.conf.
note 1: Timeout was 15 minutes (900,000ms) before RabbitMQ 3.8.17.
if you run rabbitmq in docker, you can describe volume with file rabbitmq.conf, then create this file inside volume and set consumer_timeout
for example:
docker compose
version: "2.4"
services:
rabbitmq:
image: rabbitmq:3.9.13-management-alpine
network_mode: host
container_name: 'you name'
ports:
- 5672:5672
- 15672:15672 ----- if you use gui for rabbit
volumes:
- /etc/rabbitmq/rabbitmq.conf:/etc/rabbitmq/rabbitmq.conf
And you need create file
rabbitmq.conf
on you server by this way
/etc/rabbitmq/
documentation with params: https://github.com/rabbitmq/rabbitmq-server/blob/v3.8.x/deps/rabbit/docs/rabbitmq.conf.example

Why does an ActiveMQ cluster fail with "server null" when the Zookeeper master node goes offline?

I have encountered an issue with ActiveMQ where the entire cluster will fail when the master Zookeeper node goes offline.
We have a 3-node ActiveMQ cluster setup in our development environment. Each node has ActiveMQ 5.12.0 and Zookeeper 3.4.6 (*note, we have done some testing with Zookeeper 3.4.7, but this has failed to resolve the issue. Time constraints have so far prevented us from testing ActiveMQ 5.13).
What we have found is that when we stop the master ZooKeeper process (via the "end process tree" command in Task Manager), the remaining two ZooKeeper nodes continue to function as normal. Sometimes the ActiveMQ cluster is able to handle this, but sometimes it does not.
When the cluster fails, we typically see this in the ActiveMQ log:
2015-12-18 09:08:45,157 | WARN | Too many cluster members are connected. Expected at most 3 members but there are 4 connected. | org.apache.activemq.leveldb.replicated.MasterElector | WrapperSimpleAppMain-EventThread
...
...
2015-12-18 09:27:09,722 | WARN | Session 0x351b43b4a560016 for server null, unexpected error, closing socket connection and attempting reconnect | org.apache.zookeeper.ClientCnxn | WrapperSimpleAppMain-SendThread(192.168.0.10:2181)
java.net.ConnectException: Connection refused: no further information
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)[:1.7.0_79]
at sun.nio.ch.SocketChannelImpl.finishConnect(Unknown Source)[:1.7.0_79]
at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)[zookeeper-3.4.6.jar:3.4.6-1569965]
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)[zookeeper-3.4.6.jar:3.4.6-1569965]
We were immediately concerned by the fact that (A)ActiveMQ seems to think there are four members in the cluster when it is only configured with 3 and (B) when the exception is raised, the server appears to be null. We then increased ActiveMQ's logging level to DEBUG in order to display the list of members:
2015-12-18 09:33:04,236 | DEBUG | ZooKeeper group changed: Map(localhost -> ListBuffer((0000000156,{"id":"localhost","container":null,"address":null,"position":-1,"weight":5,"elected":null}), (0000000157,{"id":"localhost","container":null,"address":null,"position":-1,"weight":1,"elected":null}), (0000000158,{"id":"localhost","container":null,"address":"tcp://192.168.0.11:61619","position":-1,"weight":10,"elected":null}), (0000000159,{"id":"localhost","container":null,"address":null,"position":-1,"weight":10,"elected":null}))) | org.apache.activemq.leveldb.replicated.MasterElector | ActiveMQ BrokerService[localhost] Task-14
Can anyone suggest why this may be happening and/or suggest a way to resolve this? Our configurations are shown below:
ZooKeeper:
tickTime=2000
dataDir=C:\\zookeeper-3.4.7\\data
clientPort=2181
initLimit=5
syncLimit=2
server.1=192.168.0.10:2888:3888
server.2=192.168.0.11:2888:3888
server.3=192.168.0.12:2888:3888
ActiveMQ (server.1):
<persistenceAdapter>
<replicatedLevelDB
directory="activemq-data"
replicas="3"
bind="tcp://0.0.0.0:61619"
zkAddress="192.168.0.11:2181,192.168.0.10:2181,192.168.0.12:2181"
zkPath="/activemq/leveldb-stores"
hostname="192.168.0.10"
weight="5"/>
//server.2 has a weight of 10, server.3 has a weight of 1
</persistenceAdapter>

ActiveMQ broker redelivery vs consumer redelivery

I'm trying to understand the difference between ActiveMQ redeliveryPlugin and consumer's attempt to recieve messages before it marks it as a poison pill. What's the difference. In the documentation there'is an example:
<broker xmlns="http://activemq.apache.org/schema/core" schedulerSupport="true" >
....
<plugins>
<redeliveryPlugin fallbackToDeadLetter="true" sendToDlqIfMaxRetriesExceeded="true">
<redeliveryPolicyMap>
<redeliveryPolicyMap>
<redeliveryPolicyEntries>
<!-- a destination specific policy -->
<redeliveryPolicy queue="SpecialQueue" maximumRedeliveries="4"
redeliveryDelay="10000" />
</redeliveryPolicyEntries>
<!-- the fallback policy for all other destinations -->
<defaultEntry>
<redeliveryPolicy maximumRedeliveries="4" initialRedeliveryDelay="5000"
redeliveryDelay="10000" />
</defaultEntry>
</redeliveryPolicyMap>
</redeliveryPolicyMap>
</redeliveryPlugin>
</plugins>
Now, I uderstand the broker's redelivery system as a separate to the client's one. For instance, after making 6 attempts (by default) to acknowledge a message (CLIENT_ACKNOWLDGMENT mode) the consumer send a poison pill. So, is it true that after receiving the poison pill, broker will try to resend the message to the consumer which will make another 6 attempt.
So, in total we may have 4 x 6 = 24 attempts before the message will send to a DLQ.
Is my understading correct?
Yes. The broker is not aware of any client redelivery. That happens in "the driver" - in memory. The broker won't consider if the client has already retried or not. The result is nested retries which is good to be aware of.

activeMQ master/slave cluster with zookeeper

I have my activeMQ connected to zookeeper (a cluster of 5 zookeepers), in the config file "activemq.xml", I have
<persistenceAdapter>
<replicatedLevelDB
directory="${activemq.data}/leveldb"
replicas="3"
bind="tcp://0.0.0.0:0"
zkAddress="blablabla:2181"
zkPassword="password"
zkPath="/activemq/leveldb-stores"
hostname="blabla"
/>
</persistenceAdapter>
now I have activeMQ-server1 started, successfully become the master; activeMQ-server2 with the same "activemq.xml" config file, successfully become the slave; activeMQ-server3 with the same "activemq.xml" config file, successfully become the slave, but kicks out activeMQ-server2 (start to give connection error)
I think I put the wrong number for replicas, I changed all the 3 config files with "replicas="4"", still not work
what would be the correct replicas number with 3 activeMQ servers, or I am wrong with some other parts. (I only have 1 zookeeper listed in config, since the 5 zookeepers can connect to each other, already a cluster there)
Thanks :)
You need to list all zookeeper servers in the zkAddress portion, zkAddress="zoo1.example.org:2181,zoo2.example.org:2181,zoo3.example.org:2181", taken from activemq replicated levelDB
The replicas value is the number of activemq nodes, not the number of zookeeper nodes. So if you have 3 amq nodes, set replicas="3", not more. http://activemq.apache.org/replicated-leveldb-store.html :
Replicas property :
The number of nodes that will exist in the cluster.
At least (replicas/2)+1 nodes must be online to avoid service outage.
Another thing, all amq nodes in the cluster must get the same name (MyBroker below):
<broker xmlns="http://activemq.apache.org/schema/core" brokerName="MyBroker" dataDirectory="${activemq.data}">