RabbitMQ poor performance - rabbitmq

We are facing bad performance in our RabbitMQ clusters. Even when idle.
Once installed the rabbitmq-top plugin, we see many processes with very high reductions/sec. 100k and more!
Questions:
What does it mean?
How to control it?
What might be causing such slowness without any errors?
Notes:
Our clusters are running on Kubernetes 1.15.11
We allocated 3 nodes, each with 8 CPU and 8 GB limits. Set vm_watermark to 7G. Actual usage is ~1.5 CPU and 1 GB RAM
RabbitMQ 3.8.2. Erlang 22.1
We don't have many consumers and producers. The slowness is also on a fairly idle environment
The rabbitmqctl status is very slow to return details (sometimes 2 minutes) but does not show any errors

After some more investigation, we found the actual reason was made up of two issues.
RabbitMQ (Erlang) run time configuration by default (using the bitnami helm chart) assigns only a single scheduler. This is good for some simple app with a few concurrent connections. Production grade with 1000s of connections have to use many more schedulers. Bumping up from 1 to 8 schedulers improved throughput dramatically.
Our monitoring that was hammering RabbitMQ with a lot of requests per seconds (about 100/sec). The monitoring hits the aliveness-test, which creates a connection, declares a queue (not mirrored), publishes a message and then consumes that message. Disabling the monitoring reduced load dramatically. 80%-90% drop in CPU usage and the reductions/sec also dropped by about 90%.
References
Performance:
https://www.rabbitmq.com/runtime.html#scheduling
https://www.rabbitmq.com/blog/2020/06/04/how-to-run-benchmarks/
https://www.rabbitmq.com/blog/2020/08/10/deploying-rabbitmq-to-kubernetes-whats-involved/
https://www.rabbitmq.com/runtime.html#cpu-reduce-idle-usage
Monitoring:
http://rabbitmq.1065348.n5.nabble.com/RabbitMQ-API-aliveness-test-td32723.html
https://groups.google.com/forum/#!topic/rabbitmq-users/9pOeHlhQoHA
https://www.rabbitmq.com/monitoring.html

Related

Thousands of TimeoutExceptions after switching to Redis Enterprise

We recently attempted to switch from Azure Redis to Redis Enterprise, unfortunately after about an hour we were forced to roll back due to performance issues. We're looking for advice on how to get to the root cause and proceed. Here's what I've figured out so far, but I'm happy to add any more details as necessary.
First off, the client is a .NET Framework app using StackExchange.Redis version 2.1.30. The Azure Redis instance is using 4 shards, and the Redis Enterprise instance is also configured for 4 shards.
When we switched over to Redis Enterprise, we would immediately see several thousand of these exceptions per 5 minute interval:
Timeout performing GET (5000ms), next: GET [Challenges]::306331, inst:
1, qu: 0, qs: 3079, aw: False, rs: ReadAsync, ws: Idle, in: 0,
serverEndpoint: xxxxxxx:17142, mc: 1/1/0, mgr: 9 of 10 available,
clientName: API, IOCP: (Busy=2,Free=998,Min=400,Max=1000), WORKER:
(Busy=112,Free=32655,Min=2000,Max=32767), Local-CPU: 4.5%, v:
2.1.30.38891 (Please take a look at this article for some common client-side issues that can cause timeouts:
https://stackexchange.github.io/StackExchange.Redis/Timeouts)
Looking at this error message, it appears there's tons of things in the WORKER thread pool (things waiting on a response from Redis Enterprise), but nearly nothing in the IOCP thread pool (responses from Redis waiting to be processed by our client code). So, there's some sort of bottleneck on the Redis side.
Using AppInsights, I created a graph of the busy worker threads (dark blue), busy IO threads (red), and CPU usage (light blue). We see something like this:
The CPU never really goes above 20% or so, the IO threads are barely a blip (I think the max is like 2 busy), but the worker threads kinda grow and grow until eventually everything times out and the process starts over again. A little after 7pm is when we decided to roll back to Azure Redis, so everything is great at that point. So, everything points to Redis being some sort of bottleneck. So, let's look at the Redis side of things.
During this time, Redis reported a max of around 5% CPU usage. Incoming traffic topped out around 1.4MB/s, and outgoing traffic topped out around 9.5MB/s. Ops/sec were around 4k. Latency around this time was 0.05ms, and the slowest thing in the SLOWLOG was like 15ms or so. In other words, the Redis Enterprise node was barely breaking a sweat and was easily able to keep up with the traffic being sent to it. In fact, we had 4 other nodes in the cluster that weren't even being used since Redis didn't even see the need to send anything to other nodes. Redis was basically just yawning.
From here, I was thinking maybe there were network bandwidth contraints. All of our VMs are configured for accelerated networking, and we should have 10gig connections to these machines. I decided to run an iperf between the client and the server:
I can transfer easily over 700Mbit/sec between the client and the Redis Enterprise server, yet the server is processing 9.5MB/sec easily. So, it doesn't appear the problem is network bandwidth.
So, here's where we stand:
The same code works great with Azure Redis, yet causes thousands of timeouts when we switch over to Redis Enterprise.
Redis Enterprise is handling 4,000 operations per second and sending out 9 megs a second, and can usually handle a single operation in a fraction of a ms, with the very longest being 15ms.
I can send 700+ Mb/sec between the client and server.
Yet, the WORKER thread pool builds up with pending requests to Redis and eventually times out.
I'm pretty stuck here. What's a good next step to diagnose this issue? Thanks!

How to increase RabbitMQ low publish rates performance

I'm using RabbitMQ 3.6.10.
Having 16GB RAM on the machine and set water benchmark to 6GB. 4 cores.
I'm trying to perform some tests on Rabbit. Creating 1 publisher and no one that will consume the messages.
When creating 1 connection with 1 channel publishing unlimited messages one after another the management UI shows that average publish/s in ~4500.
When increasing the number of channels/connections and do it parallel in different kinds of combination i can see that it also not writing more than ~4,500.
I saw many benchmarks that talk about many more messages per second.
I can't figure what can cause the bottleneck? Any ideas?
In addition, when using many channels with many messages I get to some point that the Rabbit RAM is full and it blocks the publishers from publishing more messages. This is a good behavior but the problem is that the Rabbit stops writing to the disk and it stuck in this status forever. Any ideas?

RabbitMQ delayed exchange plugin loads and resources

We are using rabbitmq (3.6.6) to send analysis (millions) to different analyzers. These are very quick and we were planning on use the rabbit-message-plugin to schedule monitorizations over the analyzed elements.
We were thinking about rabbitmq-delayed-exchange-plugin, already made some tests and we need some clarification.
Currently:
We are scheduling millions of messages
Delays range from a few minutes to 24 hours
As previously said, these are tests, so we are using a machine with one core and 4G of RAM which has also other apps running on it.
What happened with a high memory watermark set up at 2.0G:
RabbitMQ eventually (a day or so) starts consuming 100% (only one core) and does not respond to the management interface nor rabbitmqctl. This goes on for at least 18 hours (always end up killing, deleting mnesia delayed file on disk - about 100 / 200 MB - and restarting).
What happened with a high memory watermark set up at 3.6G:
RabbitMQ was killed by kernel, because of high memory usage (4 GB hardware) about a week after working like this.
Mnesia file for delayed exchange is about 1.5G
RabbitMQ cannot start anymore giving to the below trace (we are assuming that because of being terminated by a KILL messages in the delay somehow ended up corrupted or something
{could_not_start,rabbit,
rabbitmq-server[12889]: {{case_clause,{timeout,['rabbit_delayed_messagerabbit#rabbitNode']}},
rabbitmq-server[12889]: [{rabbit_boot_steps,'-run_step/2-lc$^1/1-1-',1,
And right now we are asking ourselves: Are we a little over our heads using rabbit delayed exchange plugin for this volumes of information? If we are, then end of the problem, rethink and restart, but if not, what could be an appropiate hardware and/or configuration setup?
RabbitMQ delayed exchange plugin is not properly designed to store millions of messages.
It is also documented to the plugin page
Current design of this plugin doesn't really fit scenarios with a high
number of delayed messages (e.g. 100s of thousands or millions). See
72 for details.
Read also here: https://github.com/rabbitmq/rabbitmq-delayed-message-exchange/issues/72
This plugin is often used as if RabbitMQ was a database. It is not.

Infinispan Distributed Cache Issues

I am new to Infinispan. We have setup a Infinispan cluster so that we can make use of the Distributed Cache for our CPU and memory intensive task. We are using UDP as the communication medium and Infinispan MapReduce for distributed processing. The problem we are using facing is with the throughput. When we run the program on a single node machine, the total program completes in around 11 minutes i.e. processing a few hundred thousand records and finally emitting around 400 000 records.
However, when we use a cluster for the same dataset, we are seeing a throughput of only 200 records per second being updated between the nodes in the cluster, which impacts the overall processing time. Not sure which configuration changes are impacting the throughput so badly. I am sure its something to do with the configuration of either JGroups or Infinispan.
How can this be improved?

NServiceBus Pub/Sub Distributor/Worker Scenario Too Slow

I am working on a proof of concept implementation of NServiceBus v4.x for work.
Right now I have two subscribers and a single publisher.
The publisher can publish over 500 message per second. It runs great.
Subscriber A runs without distributors/workers. It is a single process.
Subscriber B runs with a single distributor powering N number of workers.
In my test I hit an endpoint that creates and publishes 100,000 messages. I do this publish with the subscribers off line.
Subscriber A processes a steady 100 messages per second.
Subscriber B with 2+ workers (same result with 2, 3, or 4) struggles to top 50 messages per second gross across all workers.
It seems in my scenario that the workers (which I ramped up to 40 threads per worker) are waiting around for the distributor to give them work.
Am I missing something possibly that is causing the distributor to be throttled? All Buses are running an unlimited Dev license.
System Information:
Intel Core i5 M520 # 2.40 GHz
8 GBs of RAM
SSD Hard Drive
UPDATE 08/06/2013: I finished deploying the system to a set of servers. I am experiencing the same results. Every server with a worker that I add decreases the performance of the subscriber.
Subscriber B has a distributor on one server and two additional servers for workers. With Subscriber B and one server with an active worker I am experiencing ~80 messages/events per second. Adding in another worker on an additional physical machine decreases that to ~50 messages per second. Also, these are "dummy messages". No logic actually happens in the handlers other than a log of the message through log4net. Turning off the logging doesn't increase performance.
Suggestions?
If you're scaling out with NServiceBus master/worker nodes on one server, then trying to measure performance is meaningless. One process with multiple threads will always do better than a distributor and multiple worker nodes on the same machine because the distributor will become a bottleneck while everything is competing for the same compute resources.
If the workers are moved to separate servers, it becomes a completely different story. The distributor is very efficient at doling out messages if that's the only thing happening on the server.
Give it a try with multiple servers and see what happens.
Rather than have a dummy handler that does nothing, can you simulate actual processing by adding in some sleep time, say 5 seconds. And then compare the results of having a subscriber and through the distributor?
Scaling out (with or without a distributor) is only useful for where the work being done by a single machine takes time and therefore more computing resources helps.
To help with this, monitor the CriticalTime performance counter on the endpoint and when you have the need, add in the distributor.
Scaling out using the distributor when needed is made easy by not having to change code, just starting the same endpoint in distributor and worker profiles.
The whole chain is transactional. You are paying heavy for this. Increasing the workload across machines will really not increase performance when you do not have very fast disk storage with write through caching to speed up transactional writes.
When you have your poc scaled out to several servers just try to mark a messages as 'Express' which does not do transactional writes in the queue and disable MSDTC on the bus instance to see what kind of performance is possible without transactions. This is not really usable for production unless you know where this is not mandatory or what is capable when you have a architecture which does not require DTC.