I have some problems with my RabbitMQ HA cluster.
Problems are next:
I have 3 nodes in cluster.
Node 2 and 3 joined with node 1.
When I have load - it goes to node 1 and almost all RAM is used.
If I switch nodes all load goes to the next node but RAM usage is less than on node 1.
Memory investigations shows that all RAM in this moment is used by RabbitMQ binary, but binary at the same time uses only 1 GB of memory, but allocated 5 GB.
If I switch nodes back - 1 node back uses more RAM than other nodes.
What is the problem in this cases?
Can anybody help me to solve this issue?
If you need more information or screenshots I can send them to you.
RabbitMQ 3.6.10
Erlang 20.3
Traffic to RabbitMQ goes to it via HAProxy on the same server which is RabbitMQ located.
Related
I have a quick question about redis cluster.
I'm setting up a redis cluster on google cloud kubernetes engine. I'm using the n1-highmem-2 machine type with 13GB RAM, but I'm slightly confused how to calculate the total available size of the cluster.
I have 3 nodes with each 13GB ram. I'm running 6 pods (2 on each node), 1 master and 1 slave per node. This all works. I've assigned 6GB of RAM to each pod in my pod definition yaml file.
Is it correct to say that my total cluster size would be 18GB (3 masters * 6GB), or can I count the slaves size with the total size of the redis cluster?
Redis Cluster master-slave model
In order to remain available when a subset of master nodes are failing or are not able to communicate with the majority of nodes, Redis Cluster uses a master-slave model where every hash slot has from 1 (the master itself) to N replicas (N-1 additional slaves nodes).
So, slaves are replicas(read only) of masters(read-write) for availability, hence your total workable size is the size of your master pods.
Keep in mind though, that leaving masters and slaves on the same Kubernetes node only protects from pod failure, not node failure and you should consider redistributing them.
You didn't mention how are you installing Redis, But I'd like to mention Bitnami Redis Helm Chart as it's built for use even on production and deploys 1 master and 3 slaves providing good fail tolerance and have tons of configurations easily personalized using the values.yaml file.
My mule application is comprised of 2 nodes running in a cluster, and it listens to IBM MQ Cluster (basically connecting to 2 MQ via queue manager). There are situations where one mule node pulls or takes more than 80% of message from MQ cluster and another mule node picks rest 20%. This is causing CPU performance issues.
We have double checked that all load balancing is proper, and very few times we get CPU performance problem. Please can anybody give some ideas what could be possible reason for it.
Example: last scenario was created where there are 200000 messages in queue, and node2 mule server picked 92% of message from queue within few minutes.
This issue has been fixed now. Got into the root cause - our mule application running on MULE_NODE01 reads/writes to WMQ_NODE01, and similarly for node 2. One of the mule node (lets say MULE_NODE02) reads from linux/windows file system and puts huge messages to its corresponding WMQ_NODE02. Now, its IBM MQ which tries to push maximum load to other WMQ node to balance the work load. That's why MULE_NODE01 reads all those loaded files from WMQ_NODE01 and causes CPU usage alerts.
#JoshMc your clue helped a lot in understanding the issues, thanks a lot for helping.
Its WMQ node in a cluster which tries to push maximum load to other WMQ node, seems like this is how MQ works internally.
To solve this, we are now connecting our mule node to MQ gateway, rather making 1-to-1 connectivity
This could be solved by avoiding the racing condition caused by multiple listeners. Configure the listener in the cluster to the primary node only.
republish the message to a persistent VM queue.
move the logic to another flow that could be triggered via a VM listener and let the Mule cluster do the load balancing.
We have 2 app/web servers running HA application, we need to setup redis with high availability/replication to support our app.
Considering the minimum sentinel setup requirement of 3 nodes.
We are planning to prepare the first app serve with redis master and 1 sentinel, the second app server will have the redis slave and 1 sentinel, we plan to add one additional server to hold the third sentinel node to achieve the 2 quorum sentinel setup.
Is this a valid setup ? what could be the risks ?
Thanks ,,,
Well it looks its not recommended to put the redis nodes on the app servers (where it is recommended to put the sentinel nodes there).
We ended with a setup for KeyDB (a fork from Redis) which claimed to be faster and support high availability/replication (and much more) to create two nodes within the app servers.
Of course We had to modify little in the client side to support some advance Lua scripts (There is some binary serialized data not getting replicated to the other node).
But after some effort, it worked ! as expected.
Hope this helps ...
We currently have a redis cache cluster with 3 masters and 3 slaves hosted on 3 windows servers (1 master/slave by server). We are using StackExhange.Redis as our client.
We have RBD disabled but AOF enabled and are experiencing some problems with the cluster in the following situation :
One of our servers became full and the redis node on this server was unable to write to the AOF file (the error returned to the client was MISCONF Errors writing to the AOF file: No space left on device).
The cluster did not detect that the node was failing and so did not exlclude it from the cluster.
All cache operations were blocked until we make some place on the server.
We know that we don't need the AOF, so we have disalbed it after the incident.
But we would like to confirm or infirm our view on redis clustering: for us, if a node was experiencing a failure, the cluster would redirect all requests to another one. We have tested that with a stopped node master, a slave is promoted into a master so we are confident that our cluster is working, but we are not sure why, in our case, the node was not marked as a failure.
Is the cluster capable of detecting a node failure when the failure is only happening when a request is made from a client to the cluster ?
What is the rationale behind requiring at least 3 ActiveMQ instances and 3 ZooKeeper servers for running master/slave setup with replicated LevelDB storage? If the requirement is imposed by the usage of ZooKeeper which requires at least 3 servers, what is the rationale for ZooKeeper to require at least 3 servers to provide reliability?
Is it for guaranteeing consistency in cases of network partitions (by sacrificing availability on the smaller smaller partition) as in a 2-node primary backup configuration it is impossible distinguish between a failed peer or both nodes being in different network partitions?
Is it for providing tolerance against Byzantine failures where you need 2f+1 nodes to survive f faulty nodes (considering ONLY crash failures requires only f+1 nodes to survive f faults)?
Or is there any other reason?
Thanks!
Zookeeper requires at least 3 servers because of how it elects a new Activemq Master. Zookeeper requires a majority (n/2+1) to elect a new master. If it does not have that majority, no master will be selected and the system will fail. This is the same reason for why you use an odd number of Zookeepers servers. (EG. 3 servers gives you the same failure rate as 4 because of majority, can still only lose 1 server.)
For Activemq, the necessity of at least 3 servers is derived from how the messages are synced, and the fact that when a new master is elected, it requires atleast a quorum of nodes (N/2+1) to be able to identify the latest updates. ActiveMQ will sync messages with 1 slave, and then respond with an OK. It will then sync asynchronously with all other slaves. If a quorum is not present when a node fails, then Zookeeper has no way to distinguish which node is the most currently updated. This is what happens when you have only 2 nodes originally, so at least 3 is recommended.
From ActiveMQ site, under How it Works:
All messaging operations which require a sync to disk will wait for the update to be replicated to a quorum of the nodes before completing. So if you configure the store with replicas="3" then the quorum size is (3/2+1)=2. The master will store the update locally and wait for 1 other slave to store the update before reporting success. Another way to think about it is that store will do synchronous replication to a quorum of the replication nodes and asynchronous replication replication to any additional nodes.
When a new master is elected, you also need at least a quorum of nodes online to be able to find a node with the lastest updates. The node with the lastest updates will become the new master. Therefore, it's recommend that you run with at least 3 replica nodes so that you can take one down without suffering a service outage.