why should a producer write to odd number of servers in case of a distributed message queue - replication

In a recent interview, i was asked to design a distributed message queue. I modeled it as a multi-partitioned system where each partition has a replica set with one primary and one or more secondaries for high availability. The writes from the producer are processed by the primary and are replicated synchronously, which means a message is not committed unless a quorum of the replica set has applied it. He then identified the potential availability problem when the primary of a replica set dies (which means a producer writing to that partition won't be able to write until a new primary is elected for the replica set) and asked me about the solution where the producer writes to the same message to multiple servers (favoring availability instead of consistency). He then asked me what would be the difference if the client wrote to 2 servers vs 3 servers, a question i failed to answer. In general, i thought it was more of an Even vs Odd question and I guessed it had something to do with quorums (i.e. majority) but failed to see how it would impact a consumer reading data. Needless to say, this question cost me the job and still continues to puzzle me to this day. I would appreciate any solutions and/or insights and/or suggestions for one.

Ok, this is what I understood from your question about the new system:
You won't have a primary replica anymore so you don't need to elect one and instead will work simply on a quorum based system to have a higher availability? - if that is correct than maybe this will give you some closure :) - otherwise feel free to correct me.
Assuming you read and write from / to multiple random nodes and those nodes don't replicate the data on their own, the solution lies in the principle of quorums. In simple cases that means that you need to write and read always at least to/from n/2 + 1 nodes. So if you would write to 3 nodes you could have up to 5 servers, while if you'd write to 2 nodes you could only have up to 3 servers.
The slightly more complicated quorum is based on the rules:
R + W > N
W > N / 2
(R - read quorum, W - write quorum, N - number of nodes)
This would give you some more variations for
from how many servers you need to read
how many servers you can have in general
From my understanding for the question, that is what I would have used to formulate an answer and I don't think that the difference between 2 and 3 has anything to do with even or odd numbers. Do you think this is the answer your interviewer was looking for or did I miss something?
Update:
To clarify as the thoughts in the comment are, which value would be accepted.
In the quorum as I've described it, you would accept the latest value. The can be determined with a simple logical clock. The quorums guarantee that you will retrieve at least one item with the latest information. And in case of a network partitioning or failure when you can't read the quorum, you will know that it's impossible guarantee retrieving the latest value.
On the other hand you suggested to read all items and accept the most common one. I'm not sure, this alone will guarantee to have always the latest item.

Related

Can zookeeper coordinate across clusters

Background:
We have a single cluster containing 2 app svr nodes n 3 replicated db nodes. Our application is a .net app, deployed on Linux app svrs.
We will move to a multi-cluster architecture in separate continents in the near future. Those clusters will replicate our existing cluster.
Question 1:
Can I use Zookeeper as means to achieve consistency. Example: I would like to avoid an operation in NY n a similar operation in EU occurring simultaneously n are inconsistent.
Everything I read about Zookeeper points out to a single cluster solution n I would like to avoid implementing my own distributed locks.
Question 2:
Do you have a suggestion different than implementing Zookeeper?
Many thx
The reason I did not get any responses is, more than likely, due to the fact that what I am asking for is impossible at worse, or will bring the clusters to their knees due to the synchronization demands, at best.
I now believe that I will be better off allowing inconsistencies and dealing with them during the eventual consistency sync step.

How a proposer know its propose is not approved by a quorum of acceptors?

I am reading "paxos" on wiki, and it reads:
"Rounds fail when multiple Proposers send conflicting Prepare messages, or when the Proposer does not receive a Quorum of responses (Promise or Accepted). In these cases, another round must be started with a higher proposal number."
But I don't understand how the proposer tells the difference between its proposal not being approved and it just takes more time for the message to transmit?
One of the tricky parts to understanding Paxos is that the original paper and most others, including the wiki, do not describe a full protocol capable of real-world use. They only focus on the algorithmic necessities. For example, they say that a proposer must choose a number "n" higher than any previously used number. But they say nothing about how to actually go about doing that, the kinds of failures that can happen, or how to resolve the situation if two proposers simultaneously try to use the same proposal number (as in both choosing n=2). That actually completely breaks the protocol and would lead to incorrect results but I'm not sure I've ever seen that specifically called out. I guess it's just supposed to be "obvious".
Specifically to your question, there's no perfect way to tell the difference using the raw algorithm. Practical implementations typically go the extra mile by sending a Nack message to the Proposer rather than just silently ignoring it. There are plenty of other tricks that can be used but all of them, including the nacks, come with varying downsides. Which approach is best generally depends on both the kind of application employing Paxos and the environment it's intended to run in.
If you're interested, I put together a much longer-winded description of Paxos that includes many of issues practical implementations must address in addition to the core components. It covers this issue along with several others.
Specific to your question it isn't possible for a proposer to distinguish between lost messages, delayed messages, crashed acceptors or stalled acceptors. In each case you get no response. Typically an implementation will timeout on getting less than a quorum response and resend the proposal on the assumption messages were dropped or acceptors are rebooting.
Often implementations add "nack" messages as negative acknowledgement as an optimisation to speed up recovery. The proposer only gets "nack" responses from nodes that are reachable that have accepted a higher promise. The ”nack” can show both the highest promise and also the highest instance known to be fixed. How this helps will be outlined below.
I wrote an implementation of Paxos called TRex with some of these techniques sticking as closely as possible to the description of the algorithm in the paper Paxos Made Simple. I wrote up a description of the practical considerations of timeouts and nacks on a blog post.
One of the interesting techniques it uses is for a timed out node to make the first proposal with a very low number. This will always get "nack" messages. Why? Consider a three node cluster where one network link breaks between a stable proposer and one other node. The other node will timeout and issue a prepare. If it issues a high prepare it will get a promise from the third node. This will interrupt the stable leader. You then have symmetry where the two nodes that cannot message one another can fight with the leadership swapping with no forward progress.
To avoid this a timed out node can start with a low prepare. It can then look at the "nack" messages to learn from the third node that there is a leader who is making progress. It will see this as the highest instance known to be fixed in the nack will be greater than the local value. The timed out node can then not issue a high prepare and instead ask the third node to send it the latest fixed and accepted values. With that enhancement a timed out node can now distinguish between a stable proposer crashing or the connection failing. Such ”nack” based techniques don't affect the correctness of the implementation they are only an optimisation to ensure fast failover and forward progress.

In distributed systems, why do we need 2n+1 nodes to handle n failures?

I recently read that in order to handle the failure of n-nodes, the data has to be replicated on 2n+1 nodes. I could not understand the reasoning behind that. Could someone please explain?
This is the valid quorum configuration that requires the least number n of processes to tolerate f faults.
In detail, for fault tolerance, you can never wait for reading or writing to all processes, otherwise you'll block when at least one of them crashes. You need to read and write from sub-sets.
Given that you're not writing and reading all of them, you have to be sure that (1) you read from at least one process that has the latest version of data and that (2) every two writes intersect, such that one of them aborts. These are the quorum rules.
Finally, having n = 2f+1 processes and writing to f+1 is the configuration where you need the least n for f. You might still obey the quorum with a larger write quorum with a smaller read quorum, but then you need more processes to ensure that writes never block waiting for failed processes.
Ok, so think about it like this. A polynomial of degree n is defined uniquely by n+1 points. The proof for this is rather long and requires some knowledge of linear algebra so I will just link it here. Thus, if you want to send a message, you can derive the polynomial that encodes the message ( optimally through some mutually agreed standard so the person who receives the message will know what to do ). But how many points do you send through your channel? If you know the channel will drop n packets and the person receiving requires n+1 packets to read the message, you will need to interpolate your polynomial using the n+1 points you want to send and then calculate n additional points that lie on that polynomial and send the whole set of 2n+1 points so that the person receiving will always be able to reconstruct your polynomial and read the message.

Solandra Sharding: Insider Thoughts

Just got started on Solandra and was trying to understand the 2nd
level details of Solandra sharding.
AFAIK Soalndra creates number of shards configured (as
"solandra.shards.at.once" property) where each shard is up to size of
"solandra.maximum.docs.per.shard".
On the next level it starts
creating slots inside each shard which are defined by
"solandra.maximum.docs.per.shard"/"solandra.index.id.reserve.size".
What I understood from the datamodel of SchemaInfo CF that inside a
particular shard there are slots owned by different physical nodes and
these is a race happening between nodes to get these slots.
My questions are:
Does this mean if I request write on a particular solr node
eg .....solandra/abc/dataimport?command=full-import does this request
gets distributed to all possible nodes etc. Is this distributed write?
Because until that happens how would other nodes be competing for
slots inside a particular shard.Ideally the code for writing a doc or
set of docs would be getting executed on a single physical JVM.
By sharding we tried to write some docs on the single physical node
but if it is writing based on the slots which are owned by different
physical nodes , what did we actually achieved as we again need to
fetch results from different nodes. I understand that the write
throughput is maximized.
Can we look into tuning these numbers -?
"solandra.maximum.docs.per.shard" ,
"solandra.index.id.reserve.size","solandra.shards.at.once" .
If I have just one shard and replication factor as 5 in a single DC
6 node setup, I saw that the endpoints of this shard contain 5
endpoints as per the Replication Factor.But what happens to the 6th
one. I saw through nodetool that the left 6th node doesn't really get
any data. If I increase the replication factor to 6 while keeping the
cluster on , will this solve the problem and doing repair etc or is
there a better way.
Overall the shards.at.once param is used to control parallelism of indexing. the higher that number the more shards are written to at once. If you set it to one you will always to writing to only one shard. Normally this should be set to 20% > the number of nodes in the cluster. so for a four node cluster set it to five.
The higher the reserve size, the less coordination between the nodes is needed. so if you know you have lots of documents to write then raise this.
The higher the docs.per.shard the slower the queries for a given shard will become. In general this should be 1-5M max.
To answer your points:
This will only import from one node. but it will index across many depending on shards at once.
I think the question is should you write across all nodes? Yes.
Yes see above.
If you increase shards.at.once this will be populated quickly

Redis mimic MASTER/MASTER? or something else?

I have been reading a lot of the posts on here and surfing the web, but maybe I am not asking the right question. I know that Redis is currently Master/slave until Cluster becomes available. However, I was wondering if someone can tell me how I would want to configure Redis logistically to meet my needs (or if its not the right tool).
Scenerio:
we have 2 sites on opposite ends of the US. We want clients to be able to write at each site at a high volume. We then want each client to be able to perform reads at their site as well. However we want the data to be available from a write at the sister site in < 50ms. Given that we have plenty of bandwidth. Is there a way to configure redis to meet our needs? our writes maximum size would be on the order of 5k usually much less. The main point is how can i have2 masters that are syncing to one another even if it is not supported by default.
The catch with Tom's answer is that you are not running any sort of cluster, you are just writing to two servers. This is a problem if you want to ensure consistency between them. Consider what happens when your client fails a write to the remote server. Do you undo the write to local? What happens to the application when you can't write to the remote server? What happens when you can't read from the local?
The second catch is the fundamental physics issue Joshua raises. For a round trip you are talking a theoretical minimum of 38ms leaving a theoretical maximum processing time on both ends (of three systems) of 12ms. I'd say that expectation is a bit too much and bandwidth has nothing to do with latency in this case. You could have a 10GB pipe and those timings are still extant. That said, transferring 5k across the continent in 12ms is asking a lot as well. Are you sure you've got the connection capacity to transfer 5k of data in 50ms, let alone 12? I've been on private no-utilization circuits across the continent and seen ping times exceeding 50ms - and ping isn't transferring 5k of data.
How will you keep the two unrelated servers in-sync? If you truly need sub-50ms latency across the continent, the above theoretical best-case means you have 12ms to run synchronization algorithms. Even one query to check the data on the other server means you are outside the 50ms window. If the data is out of sync, how will you fix it? Given the above timings, I don't see how it is possible to synchronize in under 50ms.
I would recommend revisiting the fundamental design requirements. Specifically, why this requirement? Latency requirements of 50ms round trip across the continent are usually the sign of marketing or lack of attention to detail. I'd wager that if you analyze the requirements you'll find that this 50ms window is excessive and unnecessary. If it isn't, and data synchronization is actually important (likely), than someone will need to determine if the significant extra effort to write synchronization code is worth it or even possible to keep within the 50ms window. Cross-continent sub-50ms latency data sync is not a simple issue.
If you have no need for synchronization, why not simply run one server? You could use a slave on the other side of the continent for recovery-only purposes. Of course, that still means that best-case you have 12ms to get the data over there and back. I would not count on
50ms round trip operations+latency+5k/10k data transfer across the continent.
It's about 19ms at the speed of light to cross the US. <50ms is going to be hard to achieve.
http://www.wolframalpha.com/input/?i=new+york+to+los+angeles
This is probably best handled as part of your client - just have the client write to both nodes. Writes generally don't need to be synchronous, so sending the extra command shouldn't affect the performance you get from having a local node.