In distributed systems, why do we need 2n+1 nodes to handle n failures? - replication

I recently read that in order to handle the failure of n-nodes, the data has to be replicated on 2n+1 nodes. I could not understand the reasoning behind that. Could someone please explain?

This is the valid quorum configuration that requires the least number n of processes to tolerate f faults.
In detail, for fault tolerance, you can never wait for reading or writing to all processes, otherwise you'll block when at least one of them crashes. You need to read and write from sub-sets.
Given that you're not writing and reading all of them, you have to be sure that (1) you read from at least one process that has the latest version of data and that (2) every two writes intersect, such that one of them aborts. These are the quorum rules.
Finally, having n = 2f+1 processes and writing to f+1 is the configuration where you need the least n for f. You might still obey the quorum with a larger write quorum with a smaller read quorum, but then you need more processes to ensure that writes never block waiting for failed processes.

Ok, so think about it like this. A polynomial of degree n is defined uniquely by n+1 points. The proof for this is rather long and requires some knowledge of linear algebra so I will just link it here. Thus, if you want to send a message, you can derive the polynomial that encodes the message ( optimally through some mutually agreed standard so the person who receives the message will know what to do ). But how many points do you send through your channel? If you know the channel will drop n packets and the person receiving requires n+1 packets to read the message, you will need to interpolate your polynomial using the n+1 points you want to send and then calculate n additional points that lie on that polynomial and send the whole set of 2n+1 points so that the person receiving will always be able to reconstruct your polynomial and read the message.

Related

How a proposer know its propose is not approved by a quorum of acceptors?

I am reading "paxos" on wiki, and it reads:
"Rounds fail when multiple Proposers send conflicting Prepare messages, or when the Proposer does not receive a Quorum of responses (Promise or Accepted). In these cases, another round must be started with a higher proposal number."
But I don't understand how the proposer tells the difference between its proposal not being approved and it just takes more time for the message to transmit?
One of the tricky parts to understanding Paxos is that the original paper and most others, including the wiki, do not describe a full protocol capable of real-world use. They only focus on the algorithmic necessities. For example, they say that a proposer must choose a number "n" higher than any previously used number. But they say nothing about how to actually go about doing that, the kinds of failures that can happen, or how to resolve the situation if two proposers simultaneously try to use the same proposal number (as in both choosing n=2). That actually completely breaks the protocol and would lead to incorrect results but I'm not sure I've ever seen that specifically called out. I guess it's just supposed to be "obvious".
Specifically to your question, there's no perfect way to tell the difference using the raw algorithm. Practical implementations typically go the extra mile by sending a Nack message to the Proposer rather than just silently ignoring it. There are plenty of other tricks that can be used but all of them, including the nacks, come with varying downsides. Which approach is best generally depends on both the kind of application employing Paxos and the environment it's intended to run in.
If you're interested, I put together a much longer-winded description of Paxos that includes many of issues practical implementations must address in addition to the core components. It covers this issue along with several others.
Specific to your question it isn't possible for a proposer to distinguish between lost messages, delayed messages, crashed acceptors or stalled acceptors. In each case you get no response. Typically an implementation will timeout on getting less than a quorum response and resend the proposal on the assumption messages were dropped or acceptors are rebooting.
Often implementations add "nack" messages as negative acknowledgement as an optimisation to speed up recovery. The proposer only gets "nack" responses from nodes that are reachable that have accepted a higher promise. The ”nack” can show both the highest promise and also the highest instance known to be fixed. How this helps will be outlined below.
I wrote an implementation of Paxos called TRex with some of these techniques sticking as closely as possible to the description of the algorithm in the paper Paxos Made Simple. I wrote up a description of the practical considerations of timeouts and nacks on a blog post.
One of the interesting techniques it uses is for a timed out node to make the first proposal with a very low number. This will always get "nack" messages. Why? Consider a three node cluster where one network link breaks between a stable proposer and one other node. The other node will timeout and issue a prepare. If it issues a high prepare it will get a promise from the third node. This will interrupt the stable leader. You then have symmetry where the two nodes that cannot message one another can fight with the leadership swapping with no forward progress.
To avoid this a timed out node can start with a low prepare. It can then look at the "nack" messages to learn from the third node that there is a leader who is making progress. It will see this as the highest instance known to be fixed in the nack will be greater than the local value. The timed out node can then not issue a high prepare and instead ask the third node to send it the latest fixed and accepted values. With that enhancement a timed out node can now distinguish between a stable proposer crashing or the connection failing. Such ”nack” based techniques don't affect the correctness of the implementation they are only an optimisation to ensure fast failover and forward progress.

How does one handled skipped event numbers with Paxos?

If we are running multi-paxos then a node may see:
Propose(N)
Accept!(N,Vn)
Accept!(N+1,Vm)
Accept!(N+4,Vo) // huh? where is +2, +3?
Accept!(N+5,Vp)
This may be because either:
There was a stable leader but the network local to this node dropped else delayed +2 and +3.
There was an outage such that there were two attempts to propose such that +2 and +3 were failed rounds proposals
In general operations on the distributed finite state machine wont commute such that a node should apply all operations in order. This implies that a node needs to be able to distinguish between the two cases. If it is failed rounds of proposals the node has no problems. If it is lost messages it suggests that the node should wait till they turn up else try to recover the lost data (e.g. request a snapshot to reinitialise and catchup).
What are the options or strategies to handle this and what overhead do they create?
This question is inspired by In Paxos, can an Acceptor accept a different value after it has already accepted one?
I can think of two methods to deal with this.
The simplest approach would be to have the node that is missing +2 and +3 to go back and try to propose no-ops in those slots. If there were decisions there, the node will learn the data in the prepare round. Otherwise, no-ops will be decided.
Another approach would be to have an out-of-band re-learning process. This may be necessary anyway: how does a node catch up if it joins the system after the others?
Or you can use a combination of both. The leader can propose no-ops for any holes in its history, the others can use the re-learning process. This is how my paxos system works.

Algorithm for distributed messaging?

I have a distributed application across which I'd like to replicate a single, eventually consistent state. The data is suitable for a CRDT (http://pagesperso-systeme.lip6.fr/Marc.Shapiro/papers/RR-6956.pdf) which has the excellent property that each node, given the same set of messages, will deterministically converge to the same value without complicated consensus protocols.
However, I need another messaging/log layer that will ensure that each node actually sees every message, even in the face of adverse network conditions.
Specifically, I'm looking for an algorithm that has the following properties:
Works on an asynchronous network.
Nodes are only necessarily aware of their neighbors, not the whole network.
Nodes may be added or dropped at any time (that is, the network is not of a fixed size or topology).
The network can be acyclic (this can be a requirement, if necessary).
Is capable of bringing up to date a node that has become behind due to temporary network outage or dropped messages.
Is capable of bringing a new, empty node joining the cluster up to date.
There is not a hard limit on the time taken for the network to converge on a value (that is, for every node to recieve every message), but given no partitions it should be fairly quick (in fuzzy terms, a matter of seconds, not minutes).
Is bounded in size. Algorithms that keep the entire message history (which will grow boundlessly) are unsuitable.
Is anyone aware of an algorithm with these properties?

why should a producer write to odd number of servers in case of a distributed message queue

In a recent interview, i was asked to design a distributed message queue. I modeled it as a multi-partitioned system where each partition has a replica set with one primary and one or more secondaries for high availability. The writes from the producer are processed by the primary and are replicated synchronously, which means a message is not committed unless a quorum of the replica set has applied it. He then identified the potential availability problem when the primary of a replica set dies (which means a producer writing to that partition won't be able to write until a new primary is elected for the replica set) and asked me about the solution where the producer writes to the same message to multiple servers (favoring availability instead of consistency). He then asked me what would be the difference if the client wrote to 2 servers vs 3 servers, a question i failed to answer. In general, i thought it was more of an Even vs Odd question and I guessed it had something to do with quorums (i.e. majority) but failed to see how it would impact a consumer reading data. Needless to say, this question cost me the job and still continues to puzzle me to this day. I would appreciate any solutions and/or insights and/or suggestions for one.
Ok, this is what I understood from your question about the new system:
You won't have a primary replica anymore so you don't need to elect one and instead will work simply on a quorum based system to have a higher availability? - if that is correct than maybe this will give you some closure :) - otherwise feel free to correct me.
Assuming you read and write from / to multiple random nodes and those nodes don't replicate the data on their own, the solution lies in the principle of quorums. In simple cases that means that you need to write and read always at least to/from n/2 + 1 nodes. So if you would write to 3 nodes you could have up to 5 servers, while if you'd write to 2 nodes you could only have up to 3 servers.
The slightly more complicated quorum is based on the rules:
R + W > N
W > N / 2
(R - read quorum, W - write quorum, N - number of nodes)
This would give you some more variations for
from how many servers you need to read
how many servers you can have in general
From my understanding for the question, that is what I would have used to formulate an answer and I don't think that the difference between 2 and 3 has anything to do with even or odd numbers. Do you think this is the answer your interviewer was looking for or did I miss something?
Update:
To clarify as the thoughts in the comment are, which value would be accepted.
In the quorum as I've described it, you would accept the latest value. The can be determined with a simple logical clock. The quorums guarantee that you will retrieve at least one item with the latest information. And in case of a network partitioning or failure when you can't read the quorum, you will know that it's impossible guarantee retrieving the latest value.
On the other hand you suggested to read all items and accept the most common one. I'm not sure, this alone will guarantee to have always the latest item.

What is cyclic and acyclic communication?

So I've already searched if there was a question like this posted before, but I wasn't able to find the answer I liked.
I've been working with some PLCs and variable frequency drives lately and thought it was about time I finally found out what cyclic and non-cyclic communication is.
So correct me if I'm wrong, but when I think of cyclic data, I think of data that is continuously being updated and is able to be sent/sampled to other devices. With relation to what I'm doing, I'm thinking that the variable frequency drive is able to update information such as speed and frequency that can be sampled/read from a PLC. This is what I would consider cyclic communication, something that is always updating a certain type of information that can be sent as data.
So I might be completely wrong with this assumption, and that leaves me with the question of what exactly would be considered non-cyclic or acyclic communication.
Any help?
Forenote: This is mostly a programming based site, and while your question does have an answer within the contexts of programming, I happen to know that in your industrial application, the importance of cyclic vs acyclic tends to be very hardware/protocol specific, and is really more of a networking problem than a programming one.
Cyclic data is not simply "continuous" data. In industry, it refers to data delivered on a guaranteed (or at least highly predictable) schedule. If the data stream were to violate the schedule, it could have disastrous consequences (a VFD misses its shutdown command by a fraction of a second, and you lose your arm!).
Acyclic data is still reliable for machine control, it is just delivered in a less deterministic way (on the order of milliseconds, sometimes up to several seconds). When accessing a single VFD with a single PLC, you will probably never notice this bursting behavior, and in fact, you may perceive smoother and quicker data transmissions. From the hardware interface perspective, acyclic data transfer does not provide as strong of a guarantee about if or when one machine will respond to the request of another.
Both forms of data transfer deliver data at speeds much faster than humans can deal with, but in certain applications they will each have their own consequences.
Cyclic networks usually must take the form of master/slave, where only one device is allowed speak at a time, and answers are always returned, even if just to confirm that the message was received. Cyclic networks usually do not allow as many devices on the same wire, and often they will pass larger amounts of data at slower rates.
Acyclic networks might be thought of as a bit more choatic, but since they skip handshaking formalities, they can often cheat more devices onto the network and get higher speeds all at the same time. This comes at the cost of occasional data collisions/bottlenecks, and sometimes even, requests for critical data are simply ignored/lost with no indication of failure or success from the target ( in the case the sender will likely be sitting and waiting desperately for a message it will not get, and often then trigger process watchdogs that will shutdown the system).
From a programmer perspective, not much is different between these two transmission types.
What will usually dictate a situation,
how many devices are running on the wire (sometimes this forces the answer right away)
how sensitive/volatile is the data they want to share (how useful are messages if they are a little late)
how much data they might be required to send at any given time ( shifting demands on a network that already produces race conditions can be hard to anticipate/avoid if you don't see it coming before hand).
Hope that helps :)