What is a reasonable value for heartbeat in RabbitMQ? - rabbitmq

RabbitMQ allows you to "heartbeat" a connection, i.e. from time to time the client and the server check (using empty messages) that the other party is still there and available. So far, so good.
Unfortunately, I was not able to find a place in the documentation where a suggestion is made what a reasonable value for this is. I know that you need to specify the heartbeat in seconds, but what is a real-world best practice value?
Obviously, it should not be too often (traffic), but also not too rare (proxies, …). Any suggestions?
Is 15 seconds fine? 30? 60? …?

This answer if for RabbitMQ < 3.5.5, for newer versions see the answer from #bmaupin.
It depends on your application needs. Out of the box it is 10 min for RabbitMQ. If you fail to ack heartbeat twice (20min of inactivity), connection will be closed immediately without sending any connection.close method or any error from the broker side.
The case to use heartbeat is firewalls that closes inactive for a long time connection or some other network settings that doesn't allow you to have waiting connections.
In fact, hearbeat is not a must, from RabbitMQ config doc
heartbeat
Value representing the heartbeat delay, in seconds, that the server sends in the connection.tune frame. If set to 0, heartbeats are disabled. Clients might not follow the server suggestion, see the AMQP reference for more details. Disabling heartbeats might improve performance in situations with a great number of connections, but might lead to connections dropping in the presence of network devices that close inactive connections.
Default: 580
Note, that having hearbeat interval too short may result in significant network overhead. Keep in mind, that hearbeat frames are sent when there are no other activity on the connection for a hearbeat time interval.

The RabbitMQ documentation now provides a recommended heartbeat timeout value between 5 and 20 seconds:
Setting heartbeat timeout value too low can lead to false positives (peer being considered unavailable while it really isn't the case) due to transient network congestion, short-lived server flow control, and so on. This should be taken into consideration when picking a timeout value.
Several years worth of feedback from the users and client library maintainers suggest that values lower than 5 seconds are fairly likely to cause false positives, and values of 1 second or lower are very likely to do so. Values within the 5 to 20 seconds range are optimal for most environments.
Source: https://www.rabbitmq.com/heartbeats.html#false-positives
In addition, as of RabbitMQ 3.5.5 the default heartbeat timeout value is 60 seconds (https://www.rabbitmq.com/heartbeats.html#heartbeats-timeout)

Related

BITMEX Websocket connection drops / disconnects despite keep alive burning through connection pool

Every once in a while Bitmex disconnects our websocket connection which forces us to reconnect. However, they provide a connection pool of 40 connections per hour. In times of low volatility it seems not to be a problem AT ALL, however as soon as trading activity goes up, we are running through these 40 connections in no time leaving our connection dead eventually.
We do have a keep-alive but it does not solve the problem at all.
We haven’t seen any specifics on the API documentation regarding how to deal with this problem, or the specific reasons we get so many close opcodes whenever the volatility raises
Does anyone know if we are doing something wrong?
EDIT: heartbeat is also in place
I suggest implementing heartbeats as per https://www.bitmex.com/app/wsAPI#Heartbeats
In general, WebSocket connections can drop if the connection remains idle for too long without transmitting any data.

Large RabbitMQ message in Slow network

I am using RabbitMQ with Spring AMQP
large message (>100MB, 102400KB)
small bandwidth (<512Kbps)
low heartbeat interval (10 seconds)
single broker
It will take >= 200*8 seconds to consume the message, which is more than my heartbeat interval. From https://stackoverflow.com/a/42363685/418439
If the message transfer time between nodes (60seconds?) > heartbeat time between nodes, it will cause the cluster to disconnect and the loose the message
Will I also face the disconnection issue even I am using single broker?
Does the heartbeat and consumer using the same thread, where if
consumer is consuming, it is not possible to perform heartbeat?
If so, what can I do to consume the message, without increase heartbeat interval or reduce my message size?
Update:
I have received another answer and comments after I posted my own answer. Thanks for the feedback. Just to clarify, I do not use AMQP for file transfer. Actually the data is in JSON message, some are simple and small but some contain complex information, include some free hand drawing. Besides saving the data at Data Center, we also save a copy of message at branch level via AMQP, for case connectivity to Data Center is not available.
So, the real questions here are a bit more fundamental, and those are: (1) is it appropriate to perform a large file transfer via AMQP, and (2) what purpose does the heartbeat serve?
Heartbeats
First off, let's address the heartbeat question. As the RabbitMQ documentation clearly states, the purpose of the heartbeat is "to ensure that the application layer promptly finds out about disrupted connections."
The reason for this is simple. In an ordinary AMQP usage, there may be several seconds, even minutes between the arrival of successive messages. Without data being exchanged across a TCP session, many firewalls and other networking equipment automatically close ports to lower exposure to the enterprise network. Heartbeats further help mitigate a fundamental weakness in TCP, which is the difficulty of detecting a dropped connection. Networks experience failure, and TCP is not always able to detect that on its own.
So, the bottom line here is that, while you're transferring a large message, the connection is active and the heartbeat function serves no useful purpose, and can cause you trouble. It's best to turn it off in such cases.
AMQP For Moving Large Files?
The second issue, and I believe more important question, is how should large files be dealt with. To answer this, let's first consider what a message queue does: sending messages -- small bits of data which communicate something to another computer system. The operative word here is small. Messages typically contain one of three things: 1. commands (go do something), 2. events (something happened), 3. requests (give me some data), and 4. responses (here is your data). A full discussion on these is beyond the scope, but suffice it to say that each of these can generally be composed of a small message less than 100kB.
Indeed, the AMQP protocol, which underlies RabbitMQ, is a fairly chatty protocol. It requires large messages be divided into multiple segments of no more than 131kB. This can add a significant amount of overhead to a large file transfer, especially when compared to other file transfer mechanisms (FTP, for instance). Secondly, the message has to be fully processed by the broker before it is made available in a queue, and it ties up valuable resources on the broker while this is being done. For one, the whole message must fit into RAM on the broker due to its architecture. This solution may work for one client and one broker, but it will break quickly when scaling out is attempted.
Finally, compression is often desirable when transferring files - HTTP supports gzip compression automatcially. AMQP does not. It is quite common in message-oriented applications to send a message containing a resource locator (e.g. URL) pointing to the larger data file, which is then accessed via appropriate means.
The moral of the story
As the adage goes: "to the man with a hammer, everything looks like a nail." AMQP is not a hammer- it's a precision scalpel. It has a very specific purpose, and narrow applicability within that purpose. Using it for something other than its intended purpose will lead to stability and reliability problems in whatever it is you are designing, and overall dissatisfaction with your end product.
Will I also face the disconnection issue even I am using single
broker?
Yes
Does the heartbeat and consumer use the same thread, where
if consumer is consuming, it is not possible to perform heartbeat?
Can't confirm the thread, but from what I observe when Java RabbitMQ consumer consumes a message, it won't perform heartbeat acknowledgement. If the time to consume longer than 3 x heartbeat timeout timer (due to large message and/or low bandwidth), MQ server will close AMQP connection.
If so, what can I do to consume the message, without increase
heartbeat interval or reduce my message size?
I resolved my issue by increasing heartbeat size. No further code change is required.

RabbitMQ consumer crash and consumer-count

If a consumer of a RabbitMQ crashes, with no graceful disconnection, will a subsequent declare-ok request fired several milliseconds later report a diminished consumer-count? Or is there an amount of time that needs to pass before the reported number will change?
declare-ok count all known consumers regardless their actual state.
See, in fact, some time after connection get dangled it still marked as alive (exact time depends on OS settings and whether do you use heartbeats and whether there are any network operation over that connection). In RabbitMQ management panel you may see connection and it channels with consumer tags listed some time after connection died.

Setting a long timeout for RabbitMQ ack message

I was wondering if this is possible. I want to pull a task from a queue and have some work that could potentially take anywhere from 3 seconds or longer (possibly) minutes before an ack is sent back to RabbitMQ notifying that the work has been completed. The work is done by a user, hence this is why the time it takes to process the job varies.
I don't want to ack the message immediately after I pop off the queue because I want the message to be requeued if no ack is received. Can anyone give me any insights into how to solve my problem?
Having a long timeout should be fine, and certainly as you say you want redelivery if something goes wrong, so you want to only ack after you finish.
The best way to achieve that, IMO, would be to have multiple consumers on the queue (i.e. multiple threads/processes consuming from the same queue). That should be fine as long as there's no particular ordering constraint on your queue contents (i.e. the way there might be if the queue were to contain contents representing Postgres data that involves FK constraints).
This tutorial on the RabbitMQ website provides more info (Python linked, but there should be similar tutorials for other languages): https://www.rabbitmq.com/tutorials/tutorial-two-python.html
Edit in response to comment from OP:
What's your heartbeat set to? If your worker doesn't acknowledge the heartbeat within the set period of time, the server will consider the connection to be dead.
Not sure which language you're using, but for Java you would use the setRequestedHeartbeat method to specify the heartbeat.
The way you implement your workers, it's vital that the heartbeat can still be sent back to the RabbitMQ server. If something blocks the client from sending the heartbeat, the server will kill the connection after the time interval expires.

How to prevent an I/O Completion Port from blocking when completion packets are available?

I have a server application that uses Microsoft's I/O Completion Port (IOCP) mechanism to manage asynchronous network socket communication. In general, this IOCP approach has performed very well in my environment. However, I have encountered an edge case scenario for which I am seeking guidance:
For the purposes of testing, my server application is streaming data (lets say ~400 KB/sec) over a gigabit LAN to a single client. All is well...until I disconnect the client's Ethernet cable from the LAN. Disconnecting the cable in this manner prevents the server from immediately detecting that the client has disappeared (i.e. the client's TCP network stack does not send notification of the connection's termination to the server)
Meanwhile, the server continues to make WSASend calls to the client...and being that these calls are asynchronous, they appear to "succeed" (i.e. the data is buffered by the OS in the outbound queue for the socket).
While this is all happening, I have 16 threads blocked on GetQueuedCompletionStatus, waiting to retrieve completion packets from the port as they become available. Prior to disconnecting the client's cable, there was a constant stream of completion packets. Now, everything (as expected) seems to have come to a halt...for about 32 seconds. After 32 seconds, IOCP springs back into action returning FALSE with a non-null lpOverlapped value. GetLastError returns 121 (The semaphore timeout period has expired.) I can only assume that error 121 is an artifact of WSASend finally timing out after the TCP stack determined the client was gone?
I'm fine with the network stack taking 32 seconds to figure out my client is gone. The problem is that while the system is making this determination, my IOCP is paralyzed. For example, WSAAccept events that post to the same IOCP are not handled by any of the 16 threads blocked on GetQueuedCompletionStatus until the failed completion packet (indicating error 121) is received.
My initial plan to work around this involved using WSAWaitForMultipleEvents immediately after calling WSASend. If the socket event wasn't signaled within (e.g. 3 seconds), then I terminate the socket connection and move on (in hopes of preventing the extensive blocking effect on my IOCP). Unfortunately, WSAWaitForMultipleEvents never seems to encounter a timeout (so maybe asynchronous sockets are signaled by virtue of being asynchronous? Or copying data to the TCP queue qualifies for a signal?)
I'm still trying to sort this all out, but was hoping someone had some insights as to how to prevent the IOCP hang.
Other details: My server application is running on Win7 with 8 cores; IOCP is configured to use at most 8 concurrent threads; my thread pool has 16 threads. Plenty of RAM, processor and bandwidth.
Thanks in advance for your suggestions and advice.
It's usual for the WSASend() completions to stall in this situation. You won't get them until the TCP stack times out its resend attempts and completes all of the outstanding sends in error. This doesn't block any other operations. I expect you are either testing incorrectly or have a bug in your code.
Note that your 'fix' is flawed. You could see this 'delayed send completion' situation at any point during a normal connection if the sender is sending faster than the consumer can consume. See this article on TCP flow control and async writes. A better plan is to use a counter for the amount of oustanding writes (per connection) that you want to allow and stop sending if that counter gets reached and then resume when it drops below a 'low water mark' threshold value.
Note that if you've pulled out the network cable into the machine how do you expect any other operations to complete? Reads will just sit there and only fail once a write has failed and AcceptEx will simply sit there and wait for the condition to rectify itself.