How is the heartbeat negotiated? - rabbitmq

I'm trying to determine how the heartbeat is negotiated, if at all. It is, after all, called the "requested" heartbeat. For example, if the server heartbeat is set lower than the client heartbeat, the client library should use the minimum, right? Otherwise, all connections will be closed by the server.
The connection documentation doesn't really make it clear and I'm not finding much by searching the code (no conclusive usages of ConnectionConfiguration.RequestedHeartbeat).
The server documentation says
This value is negotiated between the client and RabbitMQ server at the time of connection. The client must be configured to request heartbeats... (although the client can still veto them).
The "official" .NET client library uses the minimum of the client or server heartbeat.
Math.Min(clientValue, serverValue);

Related

CometD Failover Ability - VM Switch During Restart

I have a chat implementation working with CometD.
On front end I have a Client that has a clientId=123 and is talking to VirtualMachine-1
The longpolling connection between the VirtualMachine-1 and the Client is done through the clientId. When the connection is established during the handshake, VirtualMachine-1 registers the 123 clientId as it's own and accepts its data.
For some reason, if VM-1 is restarted or FAILS. The longpolling connection between Client and VM-1 is disconnected (since the VirtualMachine-1 is dead, the heartbeats would fail, thus it would become disconnected).
In which case, CometD loadBalancer will re-route the Client communication to a new VirtualMachine-2. However, since VirtualMachine-2 has different clientId it is not able to understand the "123" coming from the Client.
My question is - what is the cometD behavior in this case? How does it re-route the traffic from VM-1 to a new VM-2 to successfully go through handshaking process?
When a CometD client is redirected to the second server by the load balancer, the second server does not know about this client.
The client will send a /meta/connect message with clientId=123, and the second server will reply with a 402::unknown_session and advice: {reconnect: "handshake"}.
When receiving the advice to re-handshake, the client will send a /meta/handshake message and will get a new clientId=456 from the second server.
Upon handshake, a well written CometD application will subscribe (even for dynamic subscriptions) to all needed channels, and eventually be restored to function as before, almost transparently.
Messages published to the client during the switch from one server to the other are completely lost: CometD does not implement any persistent feature.
However, persisting messages until the client acknowledged them is possible: CometD offers a number of listeners that are invoked by the CometD implementation, and through these listeners an application can persist messages (or other information) into their own choice of persistent (and possibly distributed) store: Redis, RDBMS, etc.
CometD handles reconnection transparently for you - it just takes a few messages between client and the new server.
You also want to read about CometD's in-memory clustering features.

What causes a SOAP service to keep disconnecting TLS clients after responding to a single message?

I loaded a client-side .svclog file inside Microsoft Service Trace Viewer and there are a lot of entries in the log saying setting up secure session and close secure session. On the server side, I can see many instances of trust/RST/SCT/Cancel, indicating that the connections are being closed on the server side, but only after giving a response to a SOAP message. It seems like every web service call involves setting up a TLS session for SOAP, and then the connection being closed immediately after sending a response, requiring that TLS be set up again for the very next call.
I read this article: https://blogs.technet.microsoft.com/tspring/2015/02/23/poor-mans-guide-to-troubleshooting-tls-failures/
It said:
Keep in mind that TCP resets should always be expected at some point as the client closes out the session to the server. However, if there are a high volume of TCP resets with little or no “Application Data” (traffic which contains the encapsulated encrypted data between client and server) then you likely have a problem. Particularly if the server side is resetting the connection as opposed to the client.
Unfortunately, the article doesn't expand on this, because it is exactly what I am seeing!
This is a net.tcp web service installed in some customer environment, set up to use Windows authentication.
What's the next step in my diagnosis?
Most likely the behavior you are seeing is normal, and unless you are experiencing some problems I would not be concerned. The MSFT document you quote is referring to TCP resets, but you said your logs show trust/RST/SCT/Cancel entries, and in that context RST means RequestSecurityToken. In other words, your log messages don't in any way imply that there are TCP reset (RST) frames occurring.
The Web Services Secure Conversation Language (WS-SecureConversation) spec (here) says:
It is not uncommon for a requestor to be done with a security context
token before it expires. In such cases the requestor can explicitly
cancel the security context using this specialized binding based on
the WS-Trust Cancel binding. The following Action URIs are used with
this binding:
http://schemas.xmlsoap.org/ws/2005/02/trust/RST/SCT/Cancel
http://schemas.xmlsoap.org/ws/2005/02/trust/RSTR/SCT/Cancel
Once a
security context has been cancelled it MUST NOT be allowed for
authentication or authorization or allow renewal. Proof of possession
of the key associated with the security context MUST be proven in
order for the context to be cancelled.
If you actually are experiencing transport problems due to unexpected TCP RST frames, or if you are seeing them and are curious to understand their underlying cause, then you'll need to capture network traffic to see how and why TCP resets are occurring, and whether they are normal or abnormal.
I'd do that by firing up WireShark and looking at the frames. If you see FIN, ACK messages from each side then you expect the connection to be closed gracefully after a waiting period. Otherwise you'll see RST frames for a variety of reasons: application resets (performed to avoid tying up a lot of ports in Wait states), bad sequence number when re-accessing a port that's in a Wait state, router or firewall RST messages (typically sent both directions), retransmit timeouts, port choice RST messages, and others.
There are lots of resources to help with TCP traffic analysis. You might find it helpful to take a look at https://blogs.technet.microsoft.com/networking/2009/08/12/where-do-resets-come-from-no-the-stork-does-not-bring-them/ for a quick overview.
If you're not familiar with WireShark it can seem a little complicated, but the thing you want to do here is very simple and you can get your answer very quickly even with no prior experience. Just search for wireshark tutorials and you'll find one that fits your cognitive style.
You can also use WireShark to troubleshoot higher level protocols, including TLS. You can find information about that in many places. I'll just list a few to get you started:
WireShark documentation on SSL is here.
Wikiversity section on HTTPS is here.
A 5-minute youtube tutorial for looking at SSL traffic is here.
I believe this covers your next diagnostic step reasonably well, but if not, feel free to post more information and I can try to provide a better answer.

Extra TCP connections on the RabbitMQ server after resource alarm

I have RabbitMQ Server 3.6.0 installed on Windows (I know it's time to upgrade, I've already done that on the other server node).
Heartbeats are enabled on both server and client side (heartbeat interval 60s).
I have had a resource alarm (RAM limit), and after that I have observed the raise of amount of TCP connections to RMQ Server.
At the moment there're 18000 connections while normal amount is 6000.
Via management plugin I can see there is a lot of connections with 0 channels, while our "normal" connection have at least 1 channel.
And even RMQ Server restart won't help: all connections would re-establish.
   1. Does that mean all of them are really alive?
Similar issue was described here https://github.com/rabbitmq/rabbitmq-server/issues/384, but as I can see it was fixed exactly in v3.6.0.
   2. Do I understand right that before RMQ Server v3.6.0 the behavior after resource alarm was like that: several TCP connections could hang on server side per 1 real client autorecovery connection?
Maybe important: we have haProxy between the server and the clients. 
   3. Could haProxy be an explanation for this extra connections? Maybe it prevents client from receiving a signal the connection was closed due to resource alarm?
Are all of them alive?
Only you can answer this, but I would ask - how is it that you are ending up with many thousands of connections? Really, you should only create one connection per logical process. So if you really have 6,000 logical processes connecting to the server, that might be a reason for that many connections, but in my opinion, you're well beyond reasonable design limits even in that case.
To check, see how many connections decrease when you kill one of your logical processes.
Do I understand right that before RMQ Server v3.6.0 the behavior after resource alarm was like that: several TCP connections could hang on server side per 1 real client autorecovery connection?
As far as I can tell, yes. It looks like the developer in this case ran across a common problem in sockets, and that is the detection of dropped connections. If I had a dollar for every time someone misunderstood how TCP works, I'd have more money than Bezos. So, what they found is that someone made some bad assumptions, when actually read or write is required to detect a dead socket, and the developer wrote code to (attempt) to handle it properly. It is important to note that this does not look like a very comprehensive fix, so if the conceptual design problem had been introduced to another part of the code, then this bug might still be around in some form. Searching for bug reports might give you a more detailed answer, or asking someone on that support list.
Could haProxy be an explanation for this extra connections?
That depends. In theory, haProxy as is just a pass-through. For the connection to be recognized by the broker, it's got to go through a handshake, which is a deliberate process and cannot happen inadvertently. Closing a connection also requires a handshake, which is where haProxy might be the culprit. If haProxy thinks the connection is dead and drops it without that process, then it could be a contributing cause. But it is not in and of itself making these new connections.
The RabbitMQ team monitors this mailing list and only sometimes answers questions on StackOverflow.
I recommended that this user upgrade from Erlang 18, which has known TCP connection issues -
https://groups.google.com/d/msg/rabbitmq-users/R3700QdIVJs/taDYKI6bAgAJ
I've managed to reproduce the problem: in the end it was a bug in the way our client used RMQ connections.
It created 1 auto-recovery connection (that's all fine with that) and sometimes it created a separate simple connection for "temporary" purposes.
Step to reproduce my problem were:
Reach memory alarm in RabbitMQ (e.g. set up an easily reached RAM
limit and push a lot of big messages). Connections would be in state
"blocking".
Start sending message from our client with this new "temp" connection.
Ensure the connection is in state "blocked".
Without eliminating resource alarm, restart RabbitMQ node.
The "temp" connection itself was here! Despite the fact auto-recovery
was not enabled for it. And it continued sending heartbeats so the
server didn't close it.
We will fix the client to use one and the only connection always.
Plus we of course will upgrade Erlang.

event-drive TLS server

I'm working on a server-side software that receives requests from clients via TLS (over TCP). For better performance and user experience, I'd like to avoid a full handshake for every request. Ideally, the client can just establish a TLS session with the server for hours, although most of the time the session might be idle. At the same time, high throughput is also required.
One easy way to do it is to dedicate a thread for each session and use a big thread pool to boost throughput. But the performance overhead of this method could be huge, if I want, say, 10s thousands of concurrent sessions.
The requirement of high throughput leads to me the event-driven model. The idea is when the connection is idle (namely no IO event on the underlying socket), the TLS server can switch context to serve other connections. One of the challenges is to sort of freeze the entire TLS session context while the socket is idle and retrieve it when the socket becomes readable/writable.
I'm wondering if there is support already in TLS for this kind of feature? Both cache and ticket seem relevant. Also, I'm wondering if people have implemented this idea.
You are talking about SSL Session resumption, and it is already implemented in both OpenSSL and JSSE, and no doubt every other SSL API you would be using. SSL sessions already survive connections. So there is nothing for you to do to get this.
The part about 'freezing the SSL session context' is completely pointless.

wcf and duplex communication

I have a lot of client programs and one service.
This Client programs communicate with the server with http channel with WCF.
The clients have dynamic IP.
They are online 24h/day.
I need the following:
The server should notify all the clients in 3 min interval. If the client is new (started in the moment), is should notify it immediately.
But because the clients have dynamic IP and they are working 24h/day and sometimes the connection is unstable, is it good idea to use wcf duplex?
What happens when the connection goes down? Will it automatically recover?
Is is good idea to use remote MSMQ for this type of notification ?
Regards,
WCF duplex is very resource hungry and per rule of thumb you should not use more than 10. There is a lot of overhead involved with duplex channels. Also there is not auto-recover.
If you know the interval of 3 minutes and you want the client to get information when it starts why not let the client poll the information from the server?
When the connection goes down the callback will throw an exception and the channel will close.
I am not sure MSMQ will work for you unless each client will create an MSMQ queue for you and you push messages to each one of them. Again with an unreliable connection it will not help. I don't think you can "push" the data if you loose the connection to a client, client goes off-line or changes an IP without notifying your system.