Rabbitmq active/passive HA over WAN - rabbitmq

I'm trying to provide disaster recovery between two data centers for a RabbitMQ. The secondary datacenter is passive until the primary DC goes down.
Federation of queues is inappropriate because it wouldn't move messages until the consumers in the secondary DC go active. That shouldn’t happen unless the primary DC is unavailable at which point those messages are inaccessible.
I’ve considered creating an extra queue in the primary DC that would receive a copy of each message and then use Federation or Shovel to copy those messages to the secondary. The issue then becomes removing the duplicate message from the secondary DC when the “original” in the primary DC is processed.
Mirroring the queue to a node in the secondary DC would be work, except that RabbitMQ won’t cluster over a WAN due to latency.
Has anyone else faced this scenario? Thanks.

you quite eloquently explain the issues with using Federation and Shovels to try and solve DR with RabbitMq. Rabbit isn't really designed to move data efficiently over a WAN.
Moving data across a WAN always presents problems for a messaging solutions. For instance IBM MQ has multi-instance queue managers for HA, but needs a SAN for DR which becomes expensive both in product and processing time.
Another free product like Rabbit that you could use is Solace. It has HA and DR replication built into it. It can manage active/passive passive DR scenario you describe by moving each message across the WAN asynchronously in near realtime. As soon as you're ready to move application traffic to the backup DC, you can activate the backup instance and start consuming messages. It automatically "removes the duplicate message" as it is consumed from the active side.

Related

Redis primary/secondary without replication

I am new to Redis. I read their documentation on Sentinel and Replication in which they talk about how the replicas try to remain in sync with the master as much as possible, but it is still possible that if the master fails after a successful write, the replica might not receive that write. If Sentinel then marks this replica as the new master, it is possible that the replica serves stale data.
If I cannot afford to lose consistency and prefer it over availability, how can I turn off replication so that when Sentinel marks a new replica as master, all the first requests would be cache misses and my cache can slowly warm up instead of returning potentially stale data?
Also, is that a good idea? Are there other good alternatives?
I cannot afford to lose consistency and prefer it over availability
It's not clear that redis automated failover is appropriate for your application. It looks like each client would need to carefully keep track of server availability.
Suppose you have a few clients, a master, M1, and three replicas, R2, R3, R4. Client C5 writes a new bank account balance to M1, which immediately permanently fails, and R2 is promoted to become master M2. Master did not obtain an acknowledge from a replica before replying to client. No paxos-like consensus protocol happens between servers prior to the reply being sent to C5.
C5 could remember counters / timestamps embedded in each write request, forget the write payload, and detect stale reads. But client C6 could not, unless you supply such data quickly and reliably outside the protocol. Nathan Fritz observes that your app could issue a write and then a PUBLISH event, and monitor multiple replicas with a LISTEN for that event, delaying its report of success to end user. Consider incorporating derecho into your app if the solid promises of virtual synchrony are necessary. Production releases of redis are targeted at a different part of the problem space than your primary interest.

Is RabbitMQ Clustering including scalability too?

I want to build a RabbitMQ system which is able to scale out for the sake of performance.
I've gone through the official document of RabbitMQ Clustering. However, its clustering doesn't seem to support scalability. That's because only through master queue we can publish/consume, even though the master queue is reachable from any node of a cluster. Other than the node on which a master queue resides, we can't process any publish/consume.
Why do we cluster then?
Why do we cluster then?
To ensure availability.
To enforce data replication.
To spread the load/data accross queues on different nodes. Master queues can be stored on different node and replicated with a factor < number of cluster nodes.
Other than the node on which a master queue resides, we can't process
any publish/consume.
Client can be connected on any node of the cluster. This node will transfer 'the request' to the master queue node and vice versa. As a downside it will generate extra hop.
Answer to the question in the title Is RabbitMQ Clustering including scalability too? - yes it does, this is achieved by simply adding more nodes/removing some nodes to/from the cluster. Of course you have to consider high availability - that is queue and exchange mirroring etc.
And just to make something clear regarding:
However, its clustering doesn't seem to support scalability. That's
because only through master queue we can publish/consume, even though
the master queue is reachable from any node of a cluster.
Publishing is done to exchange, queues have nothing to with publishing. A publishing client publishes only to an exchange and a routing key. It doesn't need any knowledge about the queue.

How distributed should queues be in a RabbitMQ cluster?

Assume you have a small rabbitmq system of 3 nodes that is supposed to handle 100+ decently high volume queues in the same exchange. Given that queues only exist on the node they are created on (we're not using replicated, High Availability queues), what's the best way to create the queues? Is there any benefit to having the queues distributed among the cluster nodes, or is it better to keep them all on one node and have rmq do the routing?
It depends on your application, really.
RabbitMQ is smart about sending messages, so it'll only send a message to a node in the cluster if
a queue that holds that message resides on that node or
if a consumer has connected to that node and has requested the message.
In general, you should aim to declare queues on the nodes on which both the publishers and the consumers for that queue will connect to. In other words, you should aim to connect publishers and consumers to the node that holds the queues they use. This assumes you're trying to conserve bandwidth used overall.
If you're using clustering to improve throughput (and you probably are), and you don't care about internal bandwidth used, you should aim to connect your publishers/consumers to the nodes in a balanced way and not worry about the internal routing mechanisms.
One last thing to think about is memory and disk-space. Queues store messages in main memory, and fallback to disk if that's insufficient. So, if you declare all your queues in one place, that'll result in one node that's "over-worked" and two nodes with memory to spare.
As part of a move towards redundancy and failover in an application I'm working on, I've just finished setting up a RabbitMQ cluster behind a proxy, and have all of my publishers and consumers connect via the proxy, which round robins connections to the individual nodes as they come in from the clients. Prior to upgrading RabbitMQ to 2.7.1, this seemed to pretty evenly distribute queues to the separate nodes, though this would of course depend pretty heavily on how your proxy balances the requests and when your clients try to connect (and declare a queue)...
Having said all that, I just upgraded to RabbitMQ 2.7.1, which was pretty painless, and gave us HA queues, which is a pretty big win for our apps. At any rate, if you're interested in the set up, and think it would be of benefit to your queue problem, I'd be happy to share the setup.

NService Bus: Nitty-Gritty Deployment Issues

Please consider the following questions in the context of multiple publications from a scaled out publisher (using DB subscription storage) and multiple subscriptions with scaled out subscribers (using distributors) where installs and uninstalls happen regularly for initial deployments, upgrades, etc. using automated MSI's.
Using DB subscription storage, what happens if the DB goes down? If access to the Subscription DB is required in order to Publish a message, how will it be delivered? Will it get lost? Will the call to Bus.Publish throw an exception?
Assuming you need to have no down-time deployments: What if you want to move your subscription DB for a particular publication to a different server? How do you manage a transition like this?
Same question goes for a distributor on the subscriber side: What if you want to move your distributor endpoint? One scenario I can think of is if you have multiple subscriptions utilizing a single distributor machine, it might be hard if you want to move some of them to another distributor server to reduce load.
What would the install/uninstall scenarios look like for a setup like this (both initially, and for continuous upgrades)? It seems like you would want to have some special install/uninstall scripts for deployment of the "logical publication" and subscription DB, as well as for the "logical subscriptions" and the distributors. The publisher instances wouldn't need any special install/uninstall logic (since they just start publishing messages using the configured subscription DB, and then stop when they are uninstalled). The subscriber worker nodes wouldn't need anything special on install other than the correct configuration of the distributor endpoint, but would need uninstall logic to make sure they are removed from the distributors list of worker nodes.
Eventually the publisher will fail and the messages will build up in the internal queue. You will have to plan the size of disk you need to handle this based on the message size and how long you want to wait for a DB to come up. From there it is based how much downtime you can handle. You can use DB mirroring or clustering to make the DB have less downtime.
Mirroring and clustering technologies can also help with this. Depends on if you want to do manual or automatic failover and where your doing it(remote sites?).
Clustering MSMQ could help you here. If you want to drop a distributor and move it within a cluster you'd be ok. Another possibility is to expose your distributors via HTTP and load balance them behind either a software or hardware load balancing solution. Behind the load balancer you'd be more free to move things around.
Sounds like you have a good grasp on this one already :)
To your first question, about the high availability of the subscription DB, you can use a cluster for failover. If the DB is down, then the Bus.Publish will throw an exception, yes. It is recommended to keep the subscription DB separate from your applicative DB to avoid having to bring it down when upgrading your app. This doesn't have to be a separate DB server, a separate DB on the same DB server will be fine.
About moving servers, this is usually managed at a DNS level where for a certain period of time you'll have both running, until communication moves over.
On your third question about distributors - don't share a distributor between different publishers or subscribers.
As a rule of thumb, it is recommended to not add/remove subscribers when doing these kinds of maintainenance activities. This usually simplifies things quite a bit.

NServiceBus: Pros and Cons of using NServiceBus Distributor

I am considering using a Network Load Balancer to load balance messages between my subscriber instances, instead of using the NServiceBus distributor (which is basically just a software load-balancer from what I can tell). Each subscriber instance will have a queue of the same name for messages to be delivered to, and there will be a virtual IP that round-robins between the subscribers. The publisher will only know about the virtual IP and queue name.
Here is what I understand as the pros and cons of doing this:
PROS
No need to install NServiceBus Distributor
One less thing that would need to be managed/updated when we are scaling-out (we already use an F5 to load balance these machines, and our data center buys know it like the back of their hand)
One less point of failure (yes, the NLB could fail, but let's face it, an F5 is going to be a lot more stable than NServiceBus Distributor running on Windows)
No need to have a clustered server to have our clustered MSMQ. 2 servers is a lot more expensive than just adding another VIP to an F5.
CONS
The NServiceBus Distributor allows you to see the backlog of messages more easily since there is a single queue on the Distributor you can monitor. This makes it easy to know when you should add more worker nodes.
The NServiceBus Distributor is smarter about controlling of number of worker threads, etc. Gives you more control than an NLB? (not sure about this one)
Have I captured this accurately? I know it is recommended to use the NServiceBus Distributor, and I would like to know more of why before I go against that recommendation.
Youve' got some of the main points down, but one of the main differences is that since the distributor holds on to load itself, if a machine were to go down, the rest of the load would be distributed between the remaining machines with a much lower SLA impact on the messages.