I want to elect a leader from a number of identical processes. All the explanations of Paxos say that some processes are Proposers, some are Voters and some are Accepters. Do I need to assign these roles to my processes when I launch them?
What if all my proposers die? Can I switch existing learners/voters to proposers?
The way I thought about this is that all processes come up as voters and then wait with random timeout for messages. If they don't get any message until the timeout expires, they assume the role a proposer. If they receive a message, they kick of another timeout and then can become a proposer again if they don't receive any more messages until the timeout ends (or they have an agreement).
Is that a valid approach?
I am a maintainer of several production paxos systems. I'll detail what I have seen in mine and other practical/production systems.
All the explanations of Paxos say that some processes are Proposers, some are Voters and some are Accepters. Do I need to assign these roles to my processes when I launch them?
In a practical system the proposers and accepters are the same set of nodes. That is, proposer and acceptor are two roles taken on by the same process.
What if all my proposers die?
Many Paxos systems (such as those used as kv stores) work on a sequence of paxos instances (like a transaction log). It is infeasible to assume that the same set of processes will live forever, so there must be mechanisms to change the set of nodes in the quorum.
In some of the systems I maintain there are two parts to the proposed/chosen value: the payload from the customer and the paxos membership. The paxos round is run by the quorum chosen in the prior value. Doing it this way does impede the ability of your system to pipeline the values chosen; you'll have to batch them instead if you want choose multiple values at once—or you can look at the way Raft chooses membership.
In Raft choosing quorum members is a two-phase process. First, the new quorum is proposed and committed; both quorums are used for a while; and then the new quorum takes over. Specifically, in the interim a majority of both the old and the new quorums is required to commit anything, including the take-over from the new quorum.
Related
When people describe Paxos, they always assume that there are already some proposers in the cluster. But where are the proposers from, or what decides which processes to be proposers?
How the cluster is initially configured and how it is changed is down to the administrator who is trying to optimise the system.
You can run the different roles on different hosts and have different numbers of them. We could run three proposers, five acceptor and seven learners, whatever you choose. Clients that need to write a value only need to connect to proposers. With multi-Paxos for state replication clients only need to connect to proposers as that is sufficient and the clients don't need to exchange messages with any other role type. Yet there is nothing to prevent clients from also being learners by seeing messages from acceptor.
As long as you follow the Paxos algorithm it all comes down to minimising network hops (latency and bandwidth), costs of hardware, and complexity of the software for your particular workload.
From a practical perspective your clients need to be able to find proposers in the face of failures. A cluster administrator will be configuring which nodes are to be proposes and making sure that they are discovered by clients.
It is hard to visualize from descriptions of the abstract algorithm how things might work as many messaging topographies are possible. When applying the algorithm to a practical application its fair more obvious what setup minimises latency, bandwidth, hardware and complexity. An example might be a three node MySQL cluster running Paxos. You want all three servers to have all the data so they are all learners. All must be acceptors as you need three at a minimum to have one node fail and still maintain progress. They may as well all be proposers to give the best availability and simplicity of software and configuration. Note that one will become the distinguished leader. The database administrator doesn't think about the Paxos roles as they just set up a three-node database cluster.
The roles in the cluster may need to change. For example, you might want to expand the capacity of a database cluster. Or a server might die so you need to change the cluster membership to swap the dead one for a fresh one. For the Paxos algorithm to work every process must have a strongly consistent view of which processes are in which roles. How do you get consensus? You use Paxos to fix a new value of the cluster membership.
I've found different zookeeper definitions across multiple resources. Maybe some of them are taken out of context, but look at them pls:
A canonical example of Zookeeper usage is distributed-memory computation...
ZooKeeper is an open source Apache™ project that provides a centralized infrastructure and services that enable synchronization across a cluster.
Apache ZooKeeper is an open source file application program interface (API) that allows distributed processes in large systems to synchronize with each other so that all clients making requests receive consistent data.
I've worked with Redis and Hazelcast, that would be easier for me to understand Zookeeper by comparing it with them.
Could you please compare Zookeeper with in-memory-data-grids and Redis?
If distributed-memory computation, how does zookeeper differ from in-memory-data-grids?
If synchronization across cluster, than how does it differs from all other in-memory storages? The same in-memory-data-grids also provide cluster-wide locks. Redis also has some kind of transactions.
If it's only about in-memory consistent data, than there are other alternatives. Imdg allow you to achieve the same, don't they?
https://zookeeper.apache.org/doc/current/zookeeperOver.html
By default, Zookeeper replicates all your data to every node and lets clients watch the data for changes. Changes are sent very quickly (within a bounded amount of time) to clients. You can also create "ephemeral nodes", which are deleted within a specified time if a client disconnects. ZooKeeper is highly optimized for reads, while writes are very slow (since they generally are sent to every client as soon as the write takes place). Finally, the maximum size of a "file" (znode) in Zookeeper is 1MB, but typically they'll be single strings.
Taken together, this means that zookeeper is not meant to store for much data, and definitely not a cache. Instead, it's for managing heartbeats/knowing what servers are online, storing/updating configuration, and possibly message passing (though if you have large #s of messages or high throughput demands, something like RabbitMQ will be much better for this task).
Basically, ZooKeeper (and Curator, which is built on it) helps in handling the mechanics of clustering -- heartbeats, distributing updates/configuration, distributed locks, etc.
It's not really comparable to Redis, but for the specific questions...
It doesn't support any computation and for most data sets, won't be able to store the data with any performance.
It's replicated to all nodes in the cluster (there's nothing like Redis clustering where the data can be distributed). All messages are processed atomically in full and are sequenced, so there's no real transactions. It can be USED to implement cluster-wide locks for your services (it's very good at that in fact), and tehre are a lot of locking primitives on the znodes themselves to control which nodes access them.
Sure, but ZooKeeper fills a niche. It's a tool for making a distributed applications play nice with multiple instances, not for storing/sharing large amounts of data. Compared to using an IMDG for this purpose, Zookeeper will be faster, manages heartbeats and synchronization in a predictable way (with a lot of APIs for making this part easy), and has a "push" paradigm instead of "pull" so nodes are notified very quickly of changes.
The quotation from the linked question...
A canonical example of Zookeeper usage is distributed-memory computation
... is, IMO, a bit misleading. You would use it to orchestrate the computation, not provide the data. For example, let's say you had to process rows 1-100 of a table. You might put 10 ZK nodes up, with names like "1-10", "11-20", "21-30", etc. Client applications would be notified of this change automatically by ZK, and the first one would grab "1-10" and set an ephemeral node clients/192.168.77.66/processing/rows_1_10
The next application would see this and go for the next group to process. The actual data to compute would be stored elsewhere (ie Redis, SQL database, etc). If the node failed partway through the computation, another node could see this (after 30-60 seconds) and pick up the job again.
I'd say the canonical example of ZooKeeper is leader election, though. Let's say you have 3 nodes -- one is master and the other 2 are slaves. If the master goes down, a slave node must become the new leader. This type of thing is perfect for ZK.
Consistency Guarantees
ZooKeeper is a high performance, scalable service. Both reads and write operations are designed to be fast, though reads are faster than writes. The reason for this is that in the case of reads, ZooKeeper can serve older data, which in turn is due to ZooKeeper's consistency guarantees:
Sequential Consistency
Updates from a client will be applied in the order that they were sent.
Atomicity
Updates either succeed or fail -- there are no partial results.
Single System Image
A client will see the same view of the service regardless of the server that it connects to.
Reliability
Once an update has been applied, it will persist from that time forward until a client overwrites the update. This guarantee has two corollaries:
If a client gets a successful return code, the update will have been applied. On some failures (communication errors, timeouts, etc) the client will not know if the update has applied or not. We take steps to minimize the failures, but the only guarantee is only present with successful return codes. (This is called the monotonicity condition in Paxos.)
Any updates that are seen by the client, through a read request or successful update, will never be rolled back when recovering from server failures.
Timeliness
The clients view of the system is guaranteed to be up-to-date within a certain time bound. (On the order of tens of seconds.) Either system changes will be seen by a client within this bound, or the client will detect a service outage.
LMAX Disruptor is generally implemented using the following approach:
As in this example, Replicator is responsible for replicating the input events\commands to the slave nodes. Replicating across a set of nodes requires us to apply consensus algorithms, in case we want the system to available in the presence of network failures, master failure and slave failures.
I was thinking of applying RAFT consensus algorithm to this problem. One observation is that: "RAFT requires that the input event\commands are stored to the disk (durable storage) during replication" (Reference this link)
This observation essentially means that we cannot perform a in-memory replication. Hence it appears that we might have to combine the functionality of replicator and journaller to be able to successfully apply RAFT algorithm to LMAX.
There are two options to do this:
Option 1: Using the replicated log as input event queue
The receiver would read from the network and push the event onto the replicated log instead of the ring buffer
A separate "reader" can read from the log and publish the events onto the ring buffer.
The log can be replicated across nodes using RAFT. We do not need the replicator and journaller as the functionality is already accomplished by RAFT's replicated log
I think a disadvantage of this option has got to do with fact that we do an additional data copy step (receiver to event queue instead of the ring buffer).
Option 2: Use Replicator to push input events\commands to slave's input log file
I was wondering if there is any other solution to design of Replicator? What are the different design options that people have employed for replicators? Particularly any design that can support in-memory replication?
Your intuition is correct about folding the replication and journalling into the Raft component. But, the Raft protocol dictates exactly when things need to be stored on disk.
Here are two different ways to look at it.
I'm assuming there is no is hefty computation, such as a transaction processing, before the replication because you don't have any in your diagrams.
I, personally, would do the first because it separates concerns into different processes. If I was implementing Raft for myself I would take the first half of the second scenario and put it in its own process.
External Raft Replication
In which Raft is implement by an external process.
The replication component outsources to an external Raft process the business of replication. After some time, Raft responds to the replication component that it is, in fact, replicated. The replication component updates the items in the ring buffer, and moves its published cursor forward. The business logic sees the published cursor (via waitFor) and consumes the freshly replicated data.
In this scenario, the replication component probably has a lot of inflight events, so it's read cursor is far ahead of the cursor it publishes to the business logic.
There is no need for a journalling component in this scenario because the external raft system does the journalling for you.
Note, the replication may be the slowest component of the system!
Integrated Raft Replication
In which raft is implemented in the same process as the "Real Business Logic."
In terms of Raft, replication is the business logic. Actually, you have multiple levels of business logic, or equivalently, multiple stages of business logic.
I'm going to use two input disruptors and two output disruptors for this to emphasize the separate business logic. You can combine, split, or rearrange to your heart's content. Or your profiler's content.
The first stage, as I mentioned, is Raft replication. Client events go into the Replication Input Disruptor. The Raft logic picks it up, perhaps in batches, and sends out to the Followers on the Replication Output Disruptor. All Raft messages also go into the Replication Input Disruptor. The Raft logic also picks these up and sends the appropriate responses to the appropriate Followers/Master on the Replication Output Disruptor).
A journaller component hangs off the Input Ring Buffer; it only has to handle certain types of messages as dictated by Raft. This will likely be the slowest part of the system.
When the data is considered replicated, it is moved to the second stage, via the "Real Business Logic" Input Disruptor. There it is processed, sent to the Client Outbound Disruptor, and then sent to one of your millions of happy paying customers.
I am designing a replication algorithm, to promote a master among many slaves. I want it to be faster and simpler than Paxos. The basic idea is:
Assign each node a 'Promotion Priority', for example for 5 nodes there would be priorities: 50,40,30,20 and 10, 50 the highest and 10 the lowest.
When master needs to be elected, all slaves will send (at the same time) the other 4 nodes a message requesting to become a master, but only that master will be elected that will be confirmed by all slaves with a confirmation message. A slave will send confirmation message if its own 'Promotion Priority' is lower than the asking node, or if the asking node with higher priority times out to issue rejection message for its own request.
If a slave receives a rejection message from slave with higher 'Promotion Priority' it will abort the procedure.
There should be no nodes with the same priority.
There will be a minimum number of confirmation messages that a slave should collect in order to become a master.
This algorithm should be faster because all the slaves will be electing a master in parallel and the priority will help to speed up the process.
What do you think about it? Does any other algorithm for master promotion with priority exists?
What do you think about it?
It is hard to completely assess the validity of you algorithm without knowing the details of your requirements. Overall, it looks like a valid approach, but there are a few issues that I think deserve some attention.
Your question has some similarities to A distributed algorithm to assign a shared resource to one node out of many. Consequently, some of the arguments raised in my answer to that question hold for this question as well.
When master needs to be elected, all slaves will send (at the same
time) the other 4 nodes a message requesting to become a master, but
only that master will be elected that will be confirmed by all slaves
with a confirmation message.
This approach assumes that all slaves know how many slaves are present at any time -- otherwise the supposed master can never draw the conclusion when it has received a confirmation from all slaves. Implicitly, this means that no slaves can leave and join the system without breaking the algorithm.
In practice though, these slaves will come and go, because of crashes, reboots, network outages etc. The chances of this increase with the number of slaves, but whether or not this is a problem depends on your requirements. How fault tolerant does your system have to be?
By the way, since you mention that there are many slaves, I assume that you are using multicast or broadcast to send the request messages. Otherwise, depending on what many means to you, your set-up could be error prone with regard to administrating where all slaves reside.
A slave will send confirmation message if its own 'Promotion Priority'
is lower than the asking node, or if the asking node with higher
priority times out to issue rejection message for its own request.
Similar to the previous remark: a slave might draw the wrong conclusion if some slave has problem responding for whatever reason. In fact, if one slave is down or has a network problem, all other slaves will draw the same (most likely erroneous) conclusion that the non-responsive slave is the master.
This algorithm should be faster because all the slaves will be
electing a master in parallel
The issues raised in this answer are almost inherent to doing the master selection in a distributed fashion though, and hard to resolve without introducing some kind of centralized decision maker. You gain some, you lose some...
Does any other algorithm for master promotion with priority exists?
Another approach would be to have all slaves in the system constantly maintain administration about who is the current master. This could be done (at the cost of some network bandwidth) by having every slave multicasting/broadcasting its priority periodically, via some sort of heartbeat message. As a result, every slave will be aware of every other slave, and at the moment that a master needs to be selected, every slave can do that instantly. Network issues or other "system health" problems will be detected because heartbeats are missed. This algorithm is flexible with regard to slaves joining and leaving the system. The higher the heartbeat frequency, the more responsive your system will be to topology changes. However, you might still run into issues of slaves running drawing independent conclusions because of a disconnected network. If that is a problem, then you might not be able to solve this in a completely parallel fashion.
I have implemented a wcf service and now, my client wants it to have three copies of it, working independently on different machines. A master-slave approach. I need to find a solution that will enable behavior:
the first service that is instantiated "asks" the other two "if they are alive?" - if no, then it becomes a master and it is the one that is active on the net. The other two, once instantiated see that there is already a master alive, so they became slaves and start sleeping. There needs to be some mechanisms to periodically check if master is not dead and if so, choses the next copy that is alive to became a master (until it becomes dead respectively)
This i think should be a kind of an architectural pattern, so I would be more than glad to be given any advices.
thanks
I would suggest looking at the WCF peer channel (System.Net.PeerToPeer) to facilitate each node knowing about the other nodes. Here is a link that offers a decent introduction.
As for determining which node should be the master, the trick will be negotiating which node should be the master if two or more nodes come online at about the same time. Once the nodes become aware of each other, there needs to be some deterministic mechanism for establishing the master. For example, you could use the earliest creation time, the lowest value of the last octet of each node's IP address, or anything really. You'll just need to define some scheme that allows the nodes to negotiate this automatically.
Finally, as for checking if the master is still alive, I would suggest using the event-based mechanism described here. The master could send out periodic health-and-status events that the other nodes would register for. I'd put a try/catch/finally block at the code entry point so that if the master were to crash, it could publish one final MasterClosing event to let the slaves know it's going away. What this does not account for is a system crash, e.g., power failure, etc. To handle this, provide a timeout in the slaves so that when the timeout expires, the slaves can query the master to see if it's still there. If not, the slaves can negotiate between themselves using your deterministic algorithm about who should be the next master.