RabbitMQ / EasyNetQ drops connections when machine very active? - rabbitmq

I'm new to RabbitMQ / EasyNetQ and am trying to better understand a behaviour I am observing. We've seen that when our server running RabbitMQ is busy all EasyNetQ connections are dropped.
This is the exception simultaneously generated on all clients:
System.Exception: Failed to connect to Broker: 'XXXXXX.domain.com',
Port: 5672 VHost: 'XXXX'. ExceptionMessage: 'None of the specified
endpoints were reachable'
EasyNetQ automatically reconnects when the server is no longer busy, but I wonder if it is typical for RabbitMQ/EasyNetQ to drop connections when the machine is busy? (Or if I should be investigating performance issues with my server.)
(PS: By busy, I simply mean updating a large project from source control, relaunching a large ASP.NET application after redeploying it or running a CPU-intensive calculation on large amounts of data. ).

There are limits to the number of connections a RabbitMQ broker will accept. Is it possible that you are rapidly opening a connection, doing some work, then closing it, much as you would with a database connection? If so, that's not how you should interact with the broker. See the EasyNetQ documentation on connections:
https://github.com/mikehadlow/EasyNetQ/wiki/Connecting-to-RabbitMQ

Related

RabbitMQ as Message Broker used by Spring Websocket dies under load

I develop an application where we need to handle 160k concurrent users which are connected to the backend via a websocket connection.
We decided to use the spring websocket implementation and RabbitMQ as the message broker.
In our application every user needs to subscribe to its user queue /exchange/amq.direct/update as well as to another queue where also other users can potential subscribe to /topic/someUniqueName.
In our first performance test we did the naive approach where every user subscribes to two new queues.
When running the test RabbitMQ dies silently when around 800 users are connected at the same time, so around 1600 queues are active (See the graph of all RabbitMQ objects here).
I read though that you should be careful opening many connections to RabbitMQ.
Now I wonder if the approach that is anticipated by Spring Websocket with opening one queue per user is a conceptional problem for systems with high load or if there is another error in my system.
Limiting factors for RabbitMQ are usually:
memory (can be checked in dashboard) that needs to grow with number of messages and number of queues (if you don't use lazy queues that go directly to disk).
maximum number of file descriptors (at least 1 per connection) that often defaults to too low values on many distributions (ref: https://lists.rabbitmq.com/pipermail/rabbitmq-discuss/2012-April/019615.html)
CPU for routing the messages
I did find the issue. I actually misconfigured the RabbitMQ service and just gave it a 1024 file descriptor limit. Increasing it solved the issue.

Network issues and Redis PubSub

I am using ServiceStack 5.0.2 and Redis 3.2.100 on Windows.
I have got several nodes with active Pub/Sub Subscription and a few Pub's per second.
I noticed that if Redis Service restarts while there is no physical network connection (so one of the clients cannot connect to Redis Service), that client stops receiving any messages after network recovers. Let's call it a "zombie subscriber": it thinks that it is still operational, but never actually receives a message: client thinks it has a connection, the same connection on server is closed.
The problem is no exception is thrown in RedisSubscription.SubscribeToChannels, so I am not able to detect the issue in order to resubscribe.
I have also analyzed RedisPubSubServer and I think I have discovered a problem. In the described case RedisPubSubServer tries to restart (send stop command CTRL), but "zombie subscriber" does not receive it and no resubscription is made.

Extra TCP connections on the RabbitMQ server after resource alarm

I have RabbitMQ Server 3.6.0 installed on Windows (I know it's time to upgrade, I've already done that on the other server node).
Heartbeats are enabled on both server and client side (heartbeat interval 60s).
I have had a resource alarm (RAM limit), and after that I have observed the raise of amount of TCP connections to RMQ Server.
At the moment there're 18000 connections while normal amount is 6000.
Via management plugin I can see there is a lot of connections with 0 channels, while our "normal" connection have at least 1 channel.
And even RMQ Server restart won't help: all connections would re-establish.
   1. Does that mean all of them are really alive?
Similar issue was described here https://github.com/rabbitmq/rabbitmq-server/issues/384, but as I can see it was fixed exactly in v3.6.0.
   2. Do I understand right that before RMQ Server v3.6.0 the behavior after resource alarm was like that: several TCP connections could hang on server side per 1 real client autorecovery connection?
Maybe important: we have haProxy between the server and the clients. 
   3. Could haProxy be an explanation for this extra connections? Maybe it prevents client from receiving a signal the connection was closed due to resource alarm?
Are all of them alive?
Only you can answer this, but I would ask - how is it that you are ending up with many thousands of connections? Really, you should only create one connection per logical process. So if you really have 6,000 logical processes connecting to the server, that might be a reason for that many connections, but in my opinion, you're well beyond reasonable design limits even in that case.
To check, see how many connections decrease when you kill one of your logical processes.
Do I understand right that before RMQ Server v3.6.0 the behavior after resource alarm was like that: several TCP connections could hang on server side per 1 real client autorecovery connection?
As far as I can tell, yes. It looks like the developer in this case ran across a common problem in sockets, and that is the detection of dropped connections. If I had a dollar for every time someone misunderstood how TCP works, I'd have more money than Bezos. So, what they found is that someone made some bad assumptions, when actually read or write is required to detect a dead socket, and the developer wrote code to (attempt) to handle it properly. It is important to note that this does not look like a very comprehensive fix, so if the conceptual design problem had been introduced to another part of the code, then this bug might still be around in some form. Searching for bug reports might give you a more detailed answer, or asking someone on that support list.
Could haProxy be an explanation for this extra connections?
That depends. In theory, haProxy as is just a pass-through. For the connection to be recognized by the broker, it's got to go through a handshake, which is a deliberate process and cannot happen inadvertently. Closing a connection also requires a handshake, which is where haProxy might be the culprit. If haProxy thinks the connection is dead and drops it without that process, then it could be a contributing cause. But it is not in and of itself making these new connections.
The RabbitMQ team monitors this mailing list and only sometimes answers questions on StackOverflow.
I recommended that this user upgrade from Erlang 18, which has known TCP connection issues -
https://groups.google.com/d/msg/rabbitmq-users/R3700QdIVJs/taDYKI6bAgAJ
I've managed to reproduce the problem: in the end it was a bug in the way our client used RMQ connections.
It created 1 auto-recovery connection (that's all fine with that) and sometimes it created a separate simple connection for "temporary" purposes.
Step to reproduce my problem were:
Reach memory alarm in RabbitMQ (e.g. set up an easily reached RAM
limit and push a lot of big messages). Connections would be in state
"blocking".
Start sending message from our client with this new "temp" connection.
Ensure the connection is in state "blocked".
Without eliminating resource alarm, restart RabbitMQ node.
The "temp" connection itself was here! Despite the fact auto-recovery
was not enabled for it. And it continued sending heartbeats so the
server didn't close it.
We will fix the client to use one and the only connection always.
Plus we of course will upgrade Erlang.

In RabbitMQ, do we need to manage Connections and Channels in a separate thread?

I am new to the world of Message Queues and I am currently evaluating RabbitMQ, ActiveMQ and Kafka. I see that in RabbitMQ, the Producer will create a Connection to the RabbitMQ server and the thread holding the Connection will remain active until the connection is closed. This leads me to believe that there MUST be a separate thread which delivers information to the RMQ Producer thread which will simply publish the message to the queue and keep looping until connection to the RMQ Server is closed? Is this assumption correct? Any thoughts/inputs would be appreciated.
Thanks!
P.S: This isn't the behaviour with Kafka. [ Apache Kafka: Java Producer reusability ]
in general, you should have a single RMQ connection per application instance. that connection can be opened as soon as your application starts.
having a connection does not yet give you the ability to publish or consume messages, though.
to do that, you need to create a channel.
the general best practice is one channel per thread in your application. need to publish a messages from this thread? create a channel for the thread. done with publishing it and not doing any other RMQ work on this channel? close the channel.
unlike connections, channels are cheap and easy to create. they work over the existing RMQ connection, and they take very little resources to create.
you can create thousands of channels in a single connection (though you might want to limit that number for performance reasons)

Simulating loss of broker publisher connectivity in ActiveMQ

I wish to run an experiment in which the publisher loses connection with the broker and then enqueues messages in its own queue and then when it regains connectivity it sends all its queued messages to the broker. How can I I do this since if I call close connection, I can no longer send(raises an exception). A trick that I can think of is to use a network of two brokers and simulate the above by breaking the connection between the two brokers. Is there an API call that I can use to do the above?
This is very much like facebook messenger or whatsapp acting as a publisher and enqueuing our to-send messages if we are offline and sending them once we are connected.
There is plenty of solutions you could use to break the connection in order to test, here is a non-comprehensive list :
Make a script that can set/unset a firewall rule on your environement blocking the connection port
If you are working with VMs, you can suspend/resume the one running Activemq, you can even automate it with tools like vagrant (vagrant suspend, then vagrant up)
Tweak the connection manualy accessing the activemq jmx
Develop an activemq plugin able to trash connections on demand (or maybe there is one ?)
Now in order to have the behavior you wish to obtain there is two options :
1) Make sure your connection is failover so it can be reestablished, and store your message on disk before sending them with your producer.
2)Produce to a local broker embbeded in your app, and connect this one to the remote broker.