We have a HA RabbitMQ cluster (v3.2.x) with two nodes that sits behind a load-balancer. Our clients are configured to use a 300s heartbeat. Everything works as expected most of the time.
However, if the client's connection drops (say the client's NIC is disconnected), we have noticed (via TCPDump/wireshark) that the RabbitMQ node will attempt 3 heartbeat messages (in our case nearly 15 mins) before it closes the connection. Why? Why not close it after one failure?
Is there some means to change this behavior on the RabbitMQ server? Or do we have to shorten our heartbeat to something much smaller like 5s or 10s in order to get the connection to close sooner, thoughts?
Related issue...
Looking at the TCPDump (captured on load-balancer), I wonder why the LB doesn't close the connection when it doesn't receive the TCP-ACK from the dead client in response to the proxied RabbitMQ server heartbeat request? In fact, the LB will attempt to send the request several times (never receiving a response, of course). Wouldn't it make sense for the LB to make the assumption the connection has been dropped and close the entire session (including the connection to RabbitMQ node)?
It appears as though RabbitMQ is configured to tolerate two missed heartbeats before it terminates the connection. However, it waits until the next heartbeat would need to be sent before it drops the connection, that's what gives it the appearance of requiring 3 missed heartbeats.
Heartbeat1 (no response) wait Heartbeat2 (no response) wait Heartbeat3 terminate
There is a slight bug in MQ (it sends a 3rd heartbeat but immediately terminates the connection) but it isn't really affecting anything.
Related
I start a Client to connect remote Server.
It involves a lot of computation.
Then the Client accidental disconnected.
However the Client's computation is still running on remote Server.
has a way to close it?
It will eventually happen when socket timeout is reached, I guess.
I have some images in my queue and I pass each image to my flask server where processing on images is done and a response is received in my rabbitmq server. After receiving response, I get this error "pika.exceptions.StreamLostError: Stream connection lost(104,'Connection reset by peer')". This happens when rabbitmq channel again starts consuming the connection. I don't understand why this happens. Also I would like to restart the server again automatically if this error persists. Is there any way to do that?
Your consume process is probably taking too much time to complete and send Ack/Nack to the server. Therefore, server does not receive heartbeat from your client, and thereby stops from serving. Then, on the client side you receive:
pika.exceptions.StreamLostError: Stream connection lost(104,'Connection reset by peer')
You should see server logs as well. It is probably like this:
missed heartbeats from client, timeout: 60s
See this issue for mor information.
Do your work on another thread. See this code as an example -
https://github.com/pika/pika/blob/master/examples/basic_consumer_threaded.py
NOTE: the RabbitMQ team monitors the rabbitmq-users mailing list and only sometimes answers questions on StackOverflow.
You can change stream connection limit if you set heartbeat in ConnectionParameters
connection_params = pika.ConnectionParameters(heartbeat=10)
wher number in seconds. It say yout TCP connection keepalive to 10 seconds for example.
More information https://www.rabbitmq.com/heartbeats.html and https://www.rabbitmq.com/heartbeats.html#tcp-keepalives
UPDATE - apparently a tcp closure
I see on rabbit server:
=ERROR REPORT==== 24-Jan-2015::03:22:00 ===
closing AMQP connection <0.1070.22> (209.151.226.37:38040 -> 192.168.80.81:5672):
{inet_error,etimedout}
This conections appears alive on my app's side. How to prevent this? tcp keepalive parms look OK.
I have two apps.
One, "processor", consumes jobs from a queue and sends replies to a response queue.
The other, "responder" consumes from this response queue and talks to a database.
I had some replies which apparently made it into the response queue because upon restart of the responder they were handled and database updated appropriately. But before that restart where were they?
How can I pinpoint why they weren't PREVIOUSLY handled? That responder seems to have been running fine.
In the responder I do
res = amqp_consume_message(Cx->conn, genvelope, &tqb, 0);
I ack ( not multiple ) after replying to the database.
I have prefetch at 11.
The processor was closed and restarted a few times during this FWIW. Also the processor is the one that establishes the exchange used for the replies; the responder connects to it.
I have the management url up.
I saw no indication that the replies were available from the consume(), which makes sense since the database wasn't updated. The processor did do its processing and put a reply in the response queue according to logs.
In separate testing I saw that messages in the reply aren't destroyed by restarting the processor - reply exchange is durable.
The apps generally work.
Any debugging suggestions or conceptual info that might be relevant would be appreciated.
Opened 2 TCP connections :
1. Normal connection(while implementing echo server,client) &
2. HTTP connection
Opened HTTP connection with curl(modified) utility while running apache as server, where curl is not sending GET request for some time after connection establishment.
For normal connection after connection establishment, server is waiting for request from client.
But as observed, Strangely in HTTP connection after connection establishment, if GET request is not coming from client(for some time), server is sending FIN pkt to client & closing his part of connection.
Is it a mandatory condition for HTTP client to send GET request immediately after initial connection.
Apache got a parameter called Timeout.
Its manual page ( Apache Core - Timeout Directive ) states:
The TimeOut directive defines the length of time Apache will wait for I/O in various circumstances:
When reading data from the client, the length of time to wait for a
TCP packet to arrive if the read buffer is empty.
When writing data to the client, the length of time to wait for an
acknowledgement of a packet if the send buffer is full.
In mod_cgi, the length of time to wait for output from a CGI script.
In mod_ext_filter, the length of time to wait for output from a
filtering process.
In mod_proxy, the default timeout value if ProxyTimeout is not
configured.
I think you fell into case NUMBER ONE
EDIT
I was lurking into W3 HTTP document and I found no refer to timeouts.
but into the chapter 8 (connections) I found:
8.1.4 Practical Considerations
Servers will usually have some time-out value beyond which they will no longer maintain an inactive connection. (...) The use of persistent connections places no requirements on the length (or existence) of this time-out for either the client or the server.
that sounds to me like "every server or client is free to choose his behaviour about inactive connection timeouts"
Let's say I have the following ActiveMQ connection string:
failover:(tcp://broker1:61616,tcp://broker2:61616)?randomize=true
I am sending in like a few thousands requests to the brokers from a Java producer which has this configuration.
Sometimes I noticed that all messages end up going to just 1 broker with the other not receiving a single message.
Is this normal behavior?
Out of 10 tests, I made I may have noticed this behavior a couple of times. And at other times both the brokers received the message.
How randomize=true works?
The only explanation I found on http://activemq.apache.org/failover-transport-reference.html is: "use a random algorithm to choose the the URI to use for reconnect from the list provided"
The randomize flag on the failover transport indicates that the transport should choose at random one of the configured broker URIs to connect to (in your case there are two to choose from. Once a client is connect to one of those brokers the client will remain happily connected and send messages only to that broker until such time as something happens to interrupt the connection. Once the connection is interrupted the client will again attempt to connect to one of those two brokers. So in your case the single producer sending all its messages to one broker means, its working just like its expected too.