How does one handled skipped event numbers with Paxos? - paxos

If we are running multi-paxos then a node may see:
Propose(N)
Accept!(N,Vn)
Accept!(N+1,Vm)
Accept!(N+4,Vo) // huh? where is +2, +3?
Accept!(N+5,Vp)
This may be because either:
There was a stable leader but the network local to this node dropped else delayed +2 and +3.
There was an outage such that there were two attempts to propose such that +2 and +3 were failed rounds proposals
In general operations on the distributed finite state machine wont commute such that a node should apply all operations in order. This implies that a node needs to be able to distinguish between the two cases. If it is failed rounds of proposals the node has no problems. If it is lost messages it suggests that the node should wait till they turn up else try to recover the lost data (e.g. request a snapshot to reinitialise and catchup).
What are the options or strategies to handle this and what overhead do they create?
This question is inspired by In Paxos, can an Acceptor accept a different value after it has already accepted one?

I can think of two methods to deal with this.
The simplest approach would be to have the node that is missing +2 and +3 to go back and try to propose no-ops in those slots. If there were decisions there, the node will learn the data in the prepare round. Otherwise, no-ops will be decided.
Another approach would be to have an out-of-band re-learning process. This may be necessary anyway: how does a node catch up if it joins the system after the others?
Or you can use a combination of both. The leader can propose no-ops for any holes in its history, the others can use the re-learning process. This is how my paxos system works.

Related

How a proposer know its propose is not approved by a quorum of acceptors?

I am reading "paxos" on wiki, and it reads:
"Rounds fail when multiple Proposers send conflicting Prepare messages, or when the Proposer does not receive a Quorum of responses (Promise or Accepted). In these cases, another round must be started with a higher proposal number."
But I don't understand how the proposer tells the difference between its proposal not being approved and it just takes more time for the message to transmit?
One of the tricky parts to understanding Paxos is that the original paper and most others, including the wiki, do not describe a full protocol capable of real-world use. They only focus on the algorithmic necessities. For example, they say that a proposer must choose a number "n" higher than any previously used number. But they say nothing about how to actually go about doing that, the kinds of failures that can happen, or how to resolve the situation if two proposers simultaneously try to use the same proposal number (as in both choosing n=2). That actually completely breaks the protocol and would lead to incorrect results but I'm not sure I've ever seen that specifically called out. I guess it's just supposed to be "obvious".
Specifically to your question, there's no perfect way to tell the difference using the raw algorithm. Practical implementations typically go the extra mile by sending a Nack message to the Proposer rather than just silently ignoring it. There are plenty of other tricks that can be used but all of them, including the nacks, come with varying downsides. Which approach is best generally depends on both the kind of application employing Paxos and the environment it's intended to run in.
If you're interested, I put together a much longer-winded description of Paxos that includes many of issues practical implementations must address in addition to the core components. It covers this issue along with several others.
Specific to your question it isn't possible for a proposer to distinguish between lost messages, delayed messages, crashed acceptors or stalled acceptors. In each case you get no response. Typically an implementation will timeout on getting less than a quorum response and resend the proposal on the assumption messages were dropped or acceptors are rebooting.
Often implementations add "nack" messages as negative acknowledgement as an optimisation to speed up recovery. The proposer only gets "nack" responses from nodes that are reachable that have accepted a higher promise. The ”nack” can show both the highest promise and also the highest instance known to be fixed. How this helps will be outlined below.
I wrote an implementation of Paxos called TRex with some of these techniques sticking as closely as possible to the description of the algorithm in the paper Paxos Made Simple. I wrote up a description of the practical considerations of timeouts and nacks on a blog post.
One of the interesting techniques it uses is for a timed out node to make the first proposal with a very low number. This will always get "nack" messages. Why? Consider a three node cluster where one network link breaks between a stable proposer and one other node. The other node will timeout and issue a prepare. If it issues a high prepare it will get a promise from the third node. This will interrupt the stable leader. You then have symmetry where the two nodes that cannot message one another can fight with the leadership swapping with no forward progress.
To avoid this a timed out node can start with a low prepare. It can then look at the "nack" messages to learn from the third node that there is a leader who is making progress. It will see this as the highest instance known to be fixed in the nack will be greater than the local value. The timed out node can then not issue a high prepare and instead ask the third node to send it the latest fixed and accepted values. With that enhancement a timed out node can now distinguish between a stable proposer crashing or the connection failing. Such ”nack” based techniques don't affect the correctness of the implementation they are only an optimisation to ensure fast failover and forward progress.

In distributed systems, why do we need 2n+1 nodes to handle n failures?

I recently read that in order to handle the failure of n-nodes, the data has to be replicated on 2n+1 nodes. I could not understand the reasoning behind that. Could someone please explain?
This is the valid quorum configuration that requires the least number n of processes to tolerate f faults.
In detail, for fault tolerance, you can never wait for reading or writing to all processes, otherwise you'll block when at least one of them crashes. You need to read and write from sub-sets.
Given that you're not writing and reading all of them, you have to be sure that (1) you read from at least one process that has the latest version of data and that (2) every two writes intersect, such that one of them aborts. These are the quorum rules.
Finally, having n = 2f+1 processes and writing to f+1 is the configuration where you need the least n for f. You might still obey the quorum with a larger write quorum with a smaller read quorum, but then you need more processes to ensure that writes never block waiting for failed processes.
Ok, so think about it like this. A polynomial of degree n is defined uniquely by n+1 points. The proof for this is rather long and requires some knowledge of linear algebra so I will just link it here. Thus, if you want to send a message, you can derive the polynomial that encodes the message ( optimally through some mutually agreed standard so the person who receives the message will know what to do ). But how many points do you send through your channel? If you know the channel will drop n packets and the person receiving requires n+1 packets to read the message, you will need to interpolate your polynomial using the n+1 points you want to send and then calculate n additional points that lie on that polynomial and send the whole set of 2n+1 points so that the person receiving will always be able to reconstruct your polynomial and read the message.

why should a producer write to odd number of servers in case of a distributed message queue

In a recent interview, i was asked to design a distributed message queue. I modeled it as a multi-partitioned system where each partition has a replica set with one primary and one or more secondaries for high availability. The writes from the producer are processed by the primary and are replicated synchronously, which means a message is not committed unless a quorum of the replica set has applied it. He then identified the potential availability problem when the primary of a replica set dies (which means a producer writing to that partition won't be able to write until a new primary is elected for the replica set) and asked me about the solution where the producer writes to the same message to multiple servers (favoring availability instead of consistency). He then asked me what would be the difference if the client wrote to 2 servers vs 3 servers, a question i failed to answer. In general, i thought it was more of an Even vs Odd question and I guessed it had something to do with quorums (i.e. majority) but failed to see how it would impact a consumer reading data. Needless to say, this question cost me the job and still continues to puzzle me to this day. I would appreciate any solutions and/or insights and/or suggestions for one.
Ok, this is what I understood from your question about the new system:
You won't have a primary replica anymore so you don't need to elect one and instead will work simply on a quorum based system to have a higher availability? - if that is correct than maybe this will give you some closure :) - otherwise feel free to correct me.
Assuming you read and write from / to multiple random nodes and those nodes don't replicate the data on their own, the solution lies in the principle of quorums. In simple cases that means that you need to write and read always at least to/from n/2 + 1 nodes. So if you would write to 3 nodes you could have up to 5 servers, while if you'd write to 2 nodes you could only have up to 3 servers.
The slightly more complicated quorum is based on the rules:
R + W > N
W > N / 2
(R - read quorum, W - write quorum, N - number of nodes)
This would give you some more variations for
from how many servers you need to read
how many servers you can have in general
From my understanding for the question, that is what I would have used to formulate an answer and I don't think that the difference between 2 and 3 has anything to do with even or odd numbers. Do you think this is the answer your interviewer was looking for or did I miss something?
Update:
To clarify as the thoughts in the comment are, which value would be accepted.
In the quorum as I've described it, you would accept the latest value. The can be determined with a simple logical clock. The quorums guarantee that you will retrieve at least one item with the latest information. And in case of a network partitioning or failure when you can't read the quorum, you will know that it's impossible guarantee retrieving the latest value.
On the other hand you suggested to read all items and accept the most common one. I'm not sure, this alone will guarantee to have always the latest item.

What is cyclic and acyclic communication?

So I've already searched if there was a question like this posted before, but I wasn't able to find the answer I liked.
I've been working with some PLCs and variable frequency drives lately and thought it was about time I finally found out what cyclic and non-cyclic communication is.
So correct me if I'm wrong, but when I think of cyclic data, I think of data that is continuously being updated and is able to be sent/sampled to other devices. With relation to what I'm doing, I'm thinking that the variable frequency drive is able to update information such as speed and frequency that can be sampled/read from a PLC. This is what I would consider cyclic communication, something that is always updating a certain type of information that can be sent as data.
So I might be completely wrong with this assumption, and that leaves me with the question of what exactly would be considered non-cyclic or acyclic communication.
Any help?
Forenote: This is mostly a programming based site, and while your question does have an answer within the contexts of programming, I happen to know that in your industrial application, the importance of cyclic vs acyclic tends to be very hardware/protocol specific, and is really more of a networking problem than a programming one.
Cyclic data is not simply "continuous" data. In industry, it refers to data delivered on a guaranteed (or at least highly predictable) schedule. If the data stream were to violate the schedule, it could have disastrous consequences (a VFD misses its shutdown command by a fraction of a second, and you lose your arm!).
Acyclic data is still reliable for machine control, it is just delivered in a less deterministic way (on the order of milliseconds, sometimes up to several seconds). When accessing a single VFD with a single PLC, you will probably never notice this bursting behavior, and in fact, you may perceive smoother and quicker data transmissions. From the hardware interface perspective, acyclic data transfer does not provide as strong of a guarantee about if or when one machine will respond to the request of another.
Both forms of data transfer deliver data at speeds much faster than humans can deal with, but in certain applications they will each have their own consequences.
Cyclic networks usually must take the form of master/slave, where only one device is allowed speak at a time, and answers are always returned, even if just to confirm that the message was received. Cyclic networks usually do not allow as many devices on the same wire, and often they will pass larger amounts of data at slower rates.
Acyclic networks might be thought of as a bit more choatic, but since they skip handshaking formalities, they can often cheat more devices onto the network and get higher speeds all at the same time. This comes at the cost of occasional data collisions/bottlenecks, and sometimes even, requests for critical data are simply ignored/lost with no indication of failure or success from the target ( in the case the sender will likely be sitting and waiting desperately for a message it will not get, and often then trigger process watchdogs that will shutdown the system).
From a programmer perspective, not much is different between these two transmission types.
What will usually dictate a situation,
how many devices are running on the wire (sometimes this forces the answer right away)
how sensitive/volatile is the data they want to share (how useful are messages if they are a little late)
how much data they might be required to send at any given time ( shifting demands on a network that already produces race conditions can be hard to anticipate/avoid if you don't see it coming before hand).
Hope that helps :)

Redis mimic MASTER/MASTER? or something else?

I have been reading a lot of the posts on here and surfing the web, but maybe I am not asking the right question. I know that Redis is currently Master/slave until Cluster becomes available. However, I was wondering if someone can tell me how I would want to configure Redis logistically to meet my needs (or if its not the right tool).
Scenerio:
we have 2 sites on opposite ends of the US. We want clients to be able to write at each site at a high volume. We then want each client to be able to perform reads at their site as well. However we want the data to be available from a write at the sister site in < 50ms. Given that we have plenty of bandwidth. Is there a way to configure redis to meet our needs? our writes maximum size would be on the order of 5k usually much less. The main point is how can i have2 masters that are syncing to one another even if it is not supported by default.
The catch with Tom's answer is that you are not running any sort of cluster, you are just writing to two servers. This is a problem if you want to ensure consistency between them. Consider what happens when your client fails a write to the remote server. Do you undo the write to local? What happens to the application when you can't write to the remote server? What happens when you can't read from the local?
The second catch is the fundamental physics issue Joshua raises. For a round trip you are talking a theoretical minimum of 38ms leaving a theoretical maximum processing time on both ends (of three systems) of 12ms. I'd say that expectation is a bit too much and bandwidth has nothing to do with latency in this case. You could have a 10GB pipe and those timings are still extant. That said, transferring 5k across the continent in 12ms is asking a lot as well. Are you sure you've got the connection capacity to transfer 5k of data in 50ms, let alone 12? I've been on private no-utilization circuits across the continent and seen ping times exceeding 50ms - and ping isn't transferring 5k of data.
How will you keep the two unrelated servers in-sync? If you truly need sub-50ms latency across the continent, the above theoretical best-case means you have 12ms to run synchronization algorithms. Even one query to check the data on the other server means you are outside the 50ms window. If the data is out of sync, how will you fix it? Given the above timings, I don't see how it is possible to synchronize in under 50ms.
I would recommend revisiting the fundamental design requirements. Specifically, why this requirement? Latency requirements of 50ms round trip across the continent are usually the sign of marketing or lack of attention to detail. I'd wager that if you analyze the requirements you'll find that this 50ms window is excessive and unnecessary. If it isn't, and data synchronization is actually important (likely), than someone will need to determine if the significant extra effort to write synchronization code is worth it or even possible to keep within the 50ms window. Cross-continent sub-50ms latency data sync is not a simple issue.
If you have no need for synchronization, why not simply run one server? You could use a slave on the other side of the continent for recovery-only purposes. Of course, that still means that best-case you have 12ms to get the data over there and back. I would not count on
50ms round trip operations+latency+5k/10k data transfer across the continent.
It's about 19ms at the speed of light to cross the US. <50ms is going to be hard to achieve.
http://www.wolframalpha.com/input/?i=new+york+to+los+angeles
This is probably best handled as part of your client - just have the client write to both nodes. Writes generally don't need to be synchronous, so sending the extra command shouldn't affect the performance you get from having a local node.