Extra TCP connections on the RabbitMQ server after resource alarm

Extra TCP connections on the RabbitMQ server after resource alarm - rabbitmq

I have RabbitMQ Server 3.6.0 installed on Windows (I know it's time to upgrade, I've already done that on the other server node).
Heartbeats are enabled on both server and client side (heartbeat interval 60s).
I have had a resource alarm (RAM limit), and after that I have observed the raise of amount of TCP connections to RMQ Server.
At the moment there're 18000 connections while normal amount is 6000.
Via management plugin I can see there is a lot of connections with 0 channels, while our "normal" connection have at least 1 channel.
And even RMQ Server restart won't help: all connections would re-establish.
   1. Does that mean all of them are really alive?
Similar issue was described here https://github.com/rabbitmq/rabbitmq-server/issues/384, but as I can see it was fixed exactly in v3.6.0.
   2. Do I understand right that before RMQ Server v3.6.0 the behavior after resource alarm was like that: several TCP connections could hang on server side per 1 real client autorecovery connection?
Maybe important: we have haProxy between the server and the clients. 
   3. Could haProxy be an explanation for this extra connections? Maybe it prevents client from receiving a signal the connection was closed due to resource alarm?

Are all of them alive?
Only you can answer this, but I would ask - how is it that you are ending up with many thousands of connections? Really, you should only create one connection per logical process. So if you really have 6,000 logical processes connecting to the server, that might be a reason for that many connections, but in my opinion, you're well beyond reasonable design limits even in that case.
To check, see how many connections decrease when you kill one of your logical processes.
Do I understand right that before RMQ Server v3.6.0 the behavior after resource alarm was like that: several TCP connections could hang on server side per 1 real client autorecovery connection?
As far as I can tell, yes. It looks like the developer in this case ran across a common problem in sockets, and that is the detection of dropped connections. If I had a dollar for every time someone misunderstood how TCP works, I'd have more money than Bezos. So, what they found is that someone made some bad assumptions, when actually read or write is required to detect a dead socket, and the developer wrote code to (attempt) to handle it properly. It is important to note that this does not look like a very comprehensive fix, so if the conceptual design problem had been introduced to another part of the code, then this bug might still be around in some form. Searching for bug reports might give you a more detailed answer, or asking someone on that support list.
Could haProxy be an explanation for this extra connections?
That depends. In theory, haProxy as is just a pass-through. For the connection to be recognized by the broker, it's got to go through a handshake, which is a deliberate process and cannot happen inadvertently. Closing a connection also requires a handshake, which is where haProxy might be the culprit. If haProxy thinks the connection is dead and drops it without that process, then it could be a contributing cause. But it is not in and of itself making these new connections.

The RabbitMQ team monitors this mailing list and only sometimes answers questions on StackOverflow.
I recommended that this user upgrade from Erlang 18, which has known TCP connection issues -
https://groups.google.com/d/msg/rabbitmq-users/R3700QdIVJs/taDYKI6bAgAJ

I've managed to reproduce the problem: in the end it was a bug in the way our client used RMQ connections.
It created 1 auto-recovery connection (that's all fine with that) and sometimes it created a separate simple connection for "temporary" purposes.
Step to reproduce my problem were:
Reach memory alarm in RabbitMQ (e.g. set up an easily reached RAM
limit and push a lot of big messages). Connections would be in state
"blocking".
Start sending message from our client with this new "temp" connection.
Ensure the connection is in state "blocked".
Without eliminating resource alarm, restart RabbitMQ node.
The "temp" connection itself was here! Despite the fact auto-recovery
was not enabled for it. And it continued sending heartbeats so the
server didn't close it.
We will fix the client to use one and the only connection always.
Plus we of course will upgrade Erlang.

Related

What causes a SOAP service to keep disconnecting TLS clients after responding to a single message?

I loaded a client-side .svclog file inside Microsoft Service Trace Viewer and there are a lot of entries in the log saying setting up secure session and close secure session. On the server side, I can see many instances of trust/RST/SCT/Cancel, indicating that the connections are being closed on the server side, but only after giving a response to a SOAP message. It seems like every web service call involves setting up a TLS session for SOAP, and then the connection being closed immediately after sending a response, requiring that TLS be set up again for the very next call.
I read this article: https://blogs.technet.microsoft.com/tspring/2015/02/23/poor-mans-guide-to-troubleshooting-tls-failures/
It said:
Keep in mind that TCP resets should always be expected at some point as the client closes out the session to the server. However, if there are a high volume of TCP resets with little or no “Application Data” (traffic which contains the encapsulated encrypted data between client and server) then you likely have a problem. Particularly if the server side is resetting the connection as opposed to the client.
Unfortunately, the article doesn't expand on this, because it is exactly what I am seeing!
This is a net.tcp web service installed in some customer environment, set up to use Windows authentication.
What's the next step in my diagnosis?

Most likely the behavior you are seeing is normal, and unless you are experiencing some problems I would not be concerned. The MSFT document you quote is referring to TCP resets, but you said your logs show trust/RST/SCT/Cancel entries, and in that context RST means RequestSecurityToken. In other words, your log messages don't in any way imply that there are TCP reset (RST) frames occurring.
The Web Services Secure Conversation Language (WS-SecureConversation) spec (here) says:
It is not uncommon for a requestor to be done with a security context
token before it expires. In such cases the requestor can explicitly
cancel the security context using this specialized binding based on
the WS-Trust Cancel binding. The following Action URIs are used with
this binding:
http://schemas.xmlsoap.org/ws/2005/02/trust/RST/SCT/Cancel
http://schemas.xmlsoap.org/ws/2005/02/trust/RSTR/SCT/Cancel
Once a
security context has been cancelled it MUST NOT be allowed for
authentication or authorization or allow renewal. Proof of possession
of the key associated with the security context MUST be proven in
order for the context to be cancelled.
If you actually are experiencing transport problems due to unexpected TCP RST frames, or if you are seeing them and are curious to understand their underlying cause, then you'll need to capture network traffic to see how and why TCP resets are occurring, and whether they are normal or abnormal.
I'd do that by firing up WireShark and looking at the frames. If you see FIN, ACK messages from each side then you expect the connection to be closed gracefully after a waiting period. Otherwise you'll see RST frames for a variety of reasons: application resets (performed to avoid tying up a lot of ports in Wait states), bad sequence number when re-accessing a port that's in a Wait state, router or firewall RST messages (typically sent both directions), retransmit timeouts, port choice RST messages, and others.
There are lots of resources to help with TCP traffic analysis. You might find it helpful to take a look at https://blogs.technet.microsoft.com/networking/2009/08/12/where-do-resets-come-from-no-the-stork-does-not-bring-them/ for a quick overview.
If you're not familiar with WireShark it can seem a little complicated, but the thing you want to do here is very simple and you can get your answer very quickly even with no prior experience. Just search for wireshark tutorials and you'll find one that fits your cognitive style.
You can also use WireShark to troubleshoot higher level protocols, including TLS. You can find information about that in many places. I'll just list a few to get you started:
WireShark documentation on SSL is here.
Wikiversity section on HTTPS is here.
A 5-minute youtube tutorial for looking at SSL traffic is here.
I believe this covers your next diagnostic step reasonably well, but if not, feel free to post more information and I can try to provide a better answer.

How to dispose of idle PUBSUB Redis connections

I was recently doing some investigation into some issues I'm facing on my Redis clusters, and saw that I have many connections that are sticking around, despite being idle indefinitely after some period of time.
After some investigation, I found that I have these two settings on my cluster:
timeout 300
tcp-keepalive 0
The stale connections that aren't going away are PUB/SUB client connections (StackExchange.Redis clients, in fact, but that's beside the point), so they do not respect the timeout configuration. As such, the tcp-keepalive seems to be the only other configuration that can ensure these connections get cleaned up over time.
So I applied this setting to all the nodes:
redis-trib.rb call 127.0.0.1:6001 config set tcp-keepalive 300
At this point I went home, and I came back the next morning, assuming the stale connections would have been disposed of properly. Sadly, I was greeted by all the same connections.
My question is: Is there any way from the Redis server side to dispose of these connections gracefully after they've been established? Is it expected that applying the tcp-keepalive configuration after the connections are established and old that they will not be disposed of?
The only solution I've found besides restarting the Redis servers are to do a bit of scripting and use the CLIENT KILL command, which is doable, but I was hoping for something configuration based to handle this.
Thanks in advance for any insight here!

RabbitMQ / EasyNetQ drops connections when machine very active?

I'm new to RabbitMQ / EasyNetQ and am trying to better understand a behaviour I am observing. We've seen that when our server running RabbitMQ is busy all EasyNetQ connections are dropped.
This is the exception simultaneously generated on all clients:
System.Exception: Failed to connect to Broker: 'XXXXXX.domain.com',
Port: 5672 VHost: 'XXXX'. ExceptionMessage: 'None of the specified
endpoints were reachable'
EasyNetQ automatically reconnects when the server is no longer busy, but I wonder if it is typical for RabbitMQ/EasyNetQ to drop connections when the machine is busy? (Or if I should be investigating performance issues with my server.)
(PS: By busy, I simply mean updating a large project from source control, relaunching a large ASP.NET application after redeploying it or running a CPU-intensive calculation on large amounts of data. ).

There are limits to the number of connections a RabbitMQ broker will accept. Is it possible that you are rapidly opening a connection, doing some work, then closing it, much as you would with a database connection? If so, that's not how you should interact with the broker. See the EasyNetQ documentation on connections:
https://github.com/mikehadlow/EasyNetQ/wiki/Connecting-to-RabbitMQ

Redis clients broadcast problems (in the context of Socket.IO)

So I've read some articles about scaling Socket.IO. For various reasons I don't want to use built-in Socket.IO scaling mechanism (mostly it seems to be inefficient, since it publishes a lot more stuff to Redis then required from my point of view).
So I've came up with this simple idea:
Each Socket.IO server creates Redis pub/sub/store clients, connects to Redis and subscribes to a channel. Now, when I want to broadcast data I just publish it to Redis and all other Socket.IO servers get it and push it to users.
There is a problem, though (which I think is also a problem for Socket.IO built-in mechanism). Let's say I want to know the number of all connected users. There are at least two ways of doing that:
Server A publishes give_me_clients to Redis. Then each Socket.IO server counts connections and publishes number_of_clients. Server A grabs this data, combines it and sends it to the client.
Each server updates number_of_clients_for::ID_HERE in Redis whenever user connects/disconnects to the server. Then Server A just fetches data and combines it. Might be more efficient.
There are problems with these solutions though:
Server A is not aware of other servers. Therefore he does not know when he should stop listening to number_of_clients. One could fix it with making Server A aware of other servers: whenever a server connects to Redis he publishes new_server (Server A grabs the data and stores it in memory). But what to do, when Redis - Socket.IO connection breaks? Is there a way for Redis to notify clients that one of the client disconnected?
Actually the same as above. When a Socket.IO server crashes how to clear number_of_clients data?
So the real question is: can Redis notify (publish to chanel) clients that the connection with one of them has just ended??

After a lot of testing it seems, that Redis does not have such functionality. Also I've found out, that scaling Socket.IO is really a pain.
So I've switched from Socket.IO to WS (see this link). It is low level (but perfect for my use) and it only supports WebSockets (in all major versions). But then again I only want to support WebSockets and FlashSocket (which I have to imlement manually, but that's fine).
The advantage is that I can easily create cluster with such servers. HAProxy works with such servers almost out of the box (some minor tuning). Servers can easily communicate on a local net (with UDP or central TCP server if the cluster is big).
The disadvantage is that one have to manually implement some cool features like heartbeats, broadcasting, rooms, etc. Also you want have long-polling fallback, but that's fine in my case. Scaling is still more important, imho.

Too many TIME_WAIT connections

We have a fairly busy website (1 million page views/day) using Apache mod proxy that keeps getting overloaded with connections (>1,000) in the TIME_WAIT state. The connections are to port 3306 (mysql), but mysql only shows a few connections (show process list) and is performing fine.
We have tried changing a bunch of things (keep alive on/off), but nothing seems to help. All other system resources are within reasonable range.
I've searched around, which seems to indicate changing the tcp_time_wait_interval. But that seems a bit drastic. I've worked on busy website before, but never had this problem.
Any suggestions?

Each time_wait connection is a connection that has been closed.
You're probably connecting to mysql, issuing a query, then disconnecting. Repeat for each query on the page. Consider using a connection pooling tool, or at very least, a global variable that holds on to your database connection. If you use a global, you'll have to close the connection at the end of the page. Hopefully you have someplace common you can put that, like a footer include.
As a bonus, you should get a faster page load. MySQL is quick to connect, but not having to re-connect is even faster.

If your client applications are using JDBC, you might be hitting this bug:
http://bugs.mysql.com/bug.php?id=56979
I believe that php has the same problem
Cheers,
Gilles.

We had a similar problem, where our web servers all froze up because our php was making connections to a mysql server that was set up to do reverse host lookups on incoming connections.
When things were slow it worked fine, but under load the responstimes shot through the roof and all the apache servers got stuck in time_wait.
The way we figured the problem out was through using xdebug to create profiling data on the scripts under high load, and looking at that. the mysql_connect calls took up 80-90% of the execution time.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas