What are the recommended settings for fast queue binding in a RabbitMQ cluster - rabbitmq

When binding a queue in a RabbitMQ cluster with com.rabbit.mq:amqp-client:5.4.3, this takes a considerable amount of time when many queues are bound to an exchange within a short duration (averaging to ca. 1 second, with max times reaching 10 seconds). The RabbitMQ cluster is running 3 nodes in version 3.7.8 (Erlang 20.3.4) inside Rancher/Kubernetes.
When reducing the number of nodes down to a single node, maximum times stay well below 1 second (<700 ms. Still somewhat long, if you ask me, but acceptable). With a single node, average times range between 10 and 100 milliseconds.
I understand that replicating this information between the cluster nodes can take some time, but 1 to 2 orders of magnitude worse performance with just 3 nodes? (the same happens with 2 nodes as well, but 3 nodes is the minimum for a meaningful cluster setup)
Are there some knobs to turn to bring the time to bind a queue to an acceptable level for a cluster? Is there a configuration I'm missing? Having only a single node is a non-option with HA in mind. I couldn't find anything helpful in https://www.rabbitmq.com/clustering.html until now, maybe I missed it?
TLDR:
Are these timings expected for a RabbitMQ cluster?
What is the performance one can expect from a simple RabbitMQ (3 nodes) when binding queues to exchanges?
How many "bind" operations can be performed in e.g. 1 second?
Which factors affect this? Number of exchanges, number of queues, number of existing bindings, frequency of the operation?
What are the options to reduce the time required to create bindings?

Related

Ensure a new RabbitMQ quorum queue replicas are spread among cluster's availability zones

I'm going to run a multi-node (3 zones, 2 nodes in each, expected to grow) RabbitMQ cluster with many dynamically created quorum queues. It will be unmanageable to tinker with the queue replicas manually.
I need to ensure that a new quorum queue (1lead, 2repl) always spans all 3 AZs to be able to survive an AZ outage.
Is there a way to configure node affinity to achieve that goal?
There is one poor man's solution (actually, pretty expensive to do right) that comes on my mind:
create queues with x-quorum-initial-group-size=1
have a reconciliation script that runs periodically and adds replica members to nodes in the right zones
Of course, build in feature for configurable node affinity, which I might miss somehow, would be the best one.

Upper limit on number of redis streams consumer groups?

We are looking at using redis streams as a cluster wide messaging bus, where each node in the cluster has a unique id. The idea is that each node, when spawned, creates a consumer group with that unique id to a central redis stream to guarantee each node in the cluster gets a copy of every message. In an orchestrated environment, cluster nodes will be spawned and removed on the fly, each having a unique id. Over time I can see this resulting in there being 100's or even 1000's of old/unused consumer groups all subscribed to the same redis stream.
My question is this - is there an upper limit to the number of consumer groups that redis can handle and does a large number of (unused) consumer groups have any real processing cost? It seems that a consumer group is just a pointer stored in redis that points to the last read entry in the stream, and is only accessed when a consumer of the group does a ranged XREADGROUP. That would lead me to assume (without diving into Redis code) that the number of consumer groups really does not matter, save for the small amount of RAM that the consumer groups pointers would eat up.
Now, I understand we should be smarter and a node should delete its own consumer groups when it is being killed or we should be cleaning this up on a scheduled basis, but if a consumer group is just a record in redis, I am not sure it is worth the effort - at least at the MVP stage of development.
TL;DR;
Is my understanding correct, that there is no practical limit on the number of consumer groups for a given stream and that they have no processing cost unless used?
Your understanding is correct, there's no practical limit to the number of CGs and these do not impact the operational performance.
That said, other than the wasted RAM (which could become significant, depending on the number of consumers in the group and PEL entries), this will add time complexity to invocations of XINFO STREAM ... FULL and XINFO GROUPS as these list the CGs. Once you have a non-trivial number of CGs, every call to these would become slow (and block the server while it is executing).
Therefore, I'd still recommend implementing some type of "garbage collection" for the "stale" CGs, perhaps as soon as the MVP is done. Like any computing resource (e.g. disk space, network, mutexes...) and given there are no free lunches, CGs need to be managed as well.
P.S. IIUC, you're planning to use a single consumer in each group, and have each CG/consumer correspond to a node in your app's cluster. If that is the case, I'm not sure that you need CGs and you can use the simpler XREAD (instead of XREADGROUP) while keeping the last ID locally in the node.
OTOH, assuming I'm missing something and that there's a real need for this use pattern, I'd imagine Redis being able to support it better by offering some form of expiry for idle groups.

some questions about couchbase's replicas detail

Here I get several questions about the replica functions in couchbase, and hope can be answered. First of all, I wanna give some my own understanding ahout the couchbase; If there are 10 nodes in my cluster, and I set the number of replica to be 3 in each bucket (
actually I find that the maximal value is 3 , and I coundn't find any other way to make it larger than 3), then, does it mean that each data in bucket can only be copied to
other three nodes(I guess the three nodes should be random choosen, but could it select manually )in totally 10 nodes; Furthermore, if some of the 10 nodes have downtime,
will it cause loss of data?
I conclude my questions as follows;
1, The maximal value of the replica number in couchbase is 3, right or wrong? If wrong, how could it be largger than 3.
2, I guess the three nodes should be random choosen, but could it select manually
3, If my understanding is right, it will have loss of data when we find some nodes being in downtime. How could we avoid the loss under that condition
The maximal value of the replica number in couchbase is 3, right or wrong? If wrong, how could it be larger than 3.
The maximum number of replicas that you can have is 3, we run in production with 1 replica but it all comes down to how large your cluster is and performance impact. The more replicas you have the more inter node communication and transfer that will occur.
When you have 3 replicas this means that each node has its data replicated to 3 other nodes, this means you could handle 3 node failures in your cluster. It could happen but it is unlikely, what's more likely to happen is a machine dies and then Couchbase can automatically fail over and promote a replica held in an other node to serve requests/updates.
Couchbase's system is nice because it means you can scale up and down by failing over a node and automatic re-balancing.
I guess the three nodes should be randomly chosen, but could I select it manually?
You have no say on which nodes replicas are held, in fact I think it's a great thing that all of Couchbase's sharding and replica processes are taken out of the developers hands, it's all an automatic process.
If my understanding is right, it will have loss of data when we find some nodes being in downtime. How could we avoid data loss under that condition?
As I said before, if one node goes down then a replica is promoted, with 3 back ups you'd need 3 nodes to fail before you noticed something happening. In a production environment you should already have a warning system for each individual node, be it New Relic, Nagios etc that would report if a server dies. If there was a catastrophic problem and you lost more than 4 nodes then yes you would have data loss.
Bare in mind automatic fail over in Couchbase isn't instantaneous but still it's pretty quick. If you need downtime across the cluster say for server maintenance that needs a restart or something then it is possible to fail a node over, remove it from the cluster, perform operations and tasks on it, then add it back into the cluster and rebalance. Perform those stops again for as many nodes as you need, I've personally done that when I forgot to set specific Linux things that needed a system restart.
Overall to avoid data loss, monitor your servers, make (daily/hourly) backups of the data in your cluster and have machines that are well provisioned for your workrate.
Hope that helps!

NServiceBus Pub/Sub Distributor/Worker Scenario Too Slow

I am working on a proof of concept implementation of NServiceBus v4.x for work.
Right now I have two subscribers and a single publisher.
The publisher can publish over 500 message per second. It runs great.
Subscriber A runs without distributors/workers. It is a single process.
Subscriber B runs with a single distributor powering N number of workers.
In my test I hit an endpoint that creates and publishes 100,000 messages. I do this publish with the subscribers off line.
Subscriber A processes a steady 100 messages per second.
Subscriber B with 2+ workers (same result with 2, 3, or 4) struggles to top 50 messages per second gross across all workers.
It seems in my scenario that the workers (which I ramped up to 40 threads per worker) are waiting around for the distributor to give them work.
Am I missing something possibly that is causing the distributor to be throttled? All Buses are running an unlimited Dev license.
System Information:
Intel Core i5 M520 # 2.40 GHz
8 GBs of RAM
SSD Hard Drive
UPDATE 08/06/2013: I finished deploying the system to a set of servers. I am experiencing the same results. Every server with a worker that I add decreases the performance of the subscriber.
Subscriber B has a distributor on one server and two additional servers for workers. With Subscriber B and one server with an active worker I am experiencing ~80 messages/events per second. Adding in another worker on an additional physical machine decreases that to ~50 messages per second. Also, these are "dummy messages". No logic actually happens in the handlers other than a log of the message through log4net. Turning off the logging doesn't increase performance.
Suggestions?
If you're scaling out with NServiceBus master/worker nodes on one server, then trying to measure performance is meaningless. One process with multiple threads will always do better than a distributor and multiple worker nodes on the same machine because the distributor will become a bottleneck while everything is competing for the same compute resources.
If the workers are moved to separate servers, it becomes a completely different story. The distributor is very efficient at doling out messages if that's the only thing happening on the server.
Give it a try with multiple servers and see what happens.
Rather than have a dummy handler that does nothing, can you simulate actual processing by adding in some sleep time, say 5 seconds. And then compare the results of having a subscriber and through the distributor?
Scaling out (with or without a distributor) is only useful for where the work being done by a single machine takes time and therefore more computing resources helps.
To help with this, monitor the CriticalTime performance counter on the endpoint and when you have the need, add in the distributor.
Scaling out using the distributor when needed is made easy by not having to change code, just starting the same endpoint in distributor and worker profiles.
The whole chain is transactional. You are paying heavy for this. Increasing the workload across machines will really not increase performance when you do not have very fast disk storage with write through caching to speed up transactional writes.
When you have your poc scaled out to several servers just try to mark a messages as 'Express' which does not do transactional writes in the queue and disable MSDTC on the bus instance to see what kind of performance is possible without transactions. This is not really usable for production unless you know where this is not mandatory or what is capable when you have a architecture which does not require DTC.

distributed cluster questions about performance

I'm using 6 servers to make a cluster and they are all disk nodes. I use rabbitmq for collecting log file for our website. Now at the peak hour, the publish rate is about 30k message per second. There are 2 main consumers(hdfs and elasticsearch) and each one need to handle all message, so the delivery rate hit about 60k per second.
In my scenario, a single server can hold 10k delivery rate and I use 6 node to load balance the pressure. My solution is that I created 2 queues on each node. Each message is with a random routing-key(something like message.0, message.1, etc) to distribute the pressure to every node.
What confused me is:
All message send to one node. Should I use a HA Proxy to load balance this publish pressure?
Is there any performance difference between Durable Queues and Transient Queues?
Is there any performance difference between Memory Node and Disk Node? What I know is the difference between memory node and disk node is only about the meta data such as queue configuration.
How can I imrove the performance in publish and delivery codes? I've researched and I know several methods:
disable the confirm mechanism(in publish codes?)
enable HiPE(I've done that and it helped a lot)
For example, input is 1w mps(message per second), there are two consumers to consume all message. Then the output is 2w mps. If my server can handle 1w mps, I need two server to handle the 2w-mps-pressure. Now a new consumer need to consume all message, too. As a result, output hits 3w mps, so I need another one more server. For a conclusion, one more consumer for all message, one more server?
"All message send to one node. Should I use a HA Proxy to load balance this publish pressure?"
This article outlines a number of designs aimed at distributing load in RabbitMQ.
"Is there any performance difference between Durable Queues and Transient Queues?"
Yes, Durable Queues are backed up to disk so that they can be reinstated on server-restart, for example. This adds a nominal overhead, though the actual process occurs asynchronously.
"Is there any performance difference between Memory Node and Disk Node?"
Not that I'm aware of, but that would depend on the machine itself.
"How can I imrove the performance in publish and delivery codes?"
Try this out.