TL;DR
If a replica node goes down and new partition map is not available yet, will a read with consistency level = ALL fail?
Example:
Given this Aerospike cluster setup:
- 3 physical nodes: A, B, C
- Replicas = 2
- Read consistency level = ALL (reads consult both nodes holding the data)
And this sequence of events:
- A piece of data "DAT" is stored into two nodes, A and B
- Node B goes down.
- Immediately after B goes down, a read request ("request 1") is performed with consistency ALL.
- After ~1 second, a new partition map is generated. The cluster is now aware that B is gone.
- "DAT" now becomes replicated at node C (to preserve replicas=2).
- Another read request ("request 2") is performed with consistency ALL.
It is reasonable to say "request 2" will succeed.
Will "request 1" succeed? Will it:
a) Succeed because two reads were attempted, even if one node was down?
b) Fail because one node was down, meaning only 1 copy of "DAT" was available?
Request 1 and request 2 will succeed. The behavior of the consistency level policies are described here: https://discuss.aerospike.com/t/understanding-consistency-level-overrides/711.
The gist for read/write consistency levels is that they only apply when there are multiple versions of a given partition within the cluster. If there is only one version of a given partition in the cluster then a read/write will only go to a single node regardless of the consistency level.
So given an Aerospike cluster of A,B,C where A is master and B is
replica for partition 1.
Assume B fails and C is now replica for partition 1. Partition 1
receives a write and the partition key is changed.
Now B is restarted and returns to the cluster. Partition 1 on B will
now be different from A and C.
A read arrives with consistency all to node A for a key on Partition
1 and there are now 2 versions of that partition in the cluster. We
will read the record from nodes A and B and return the latest
version (not fail the read).
Time lapse
Migrations are now complete, for partition 1, A is master, B is
replica, and C no longer has the partition.
A read arrives with consistency all to node A. Since there is only
one version of Partition 1, node A responds to the client without
consulting node B.
Related
Apache Zookeeper documentation described steps about how to implement a distributed lock, steps are:
Call create() with the sequence and ephemeral flags set.
Call getChildren(), check if the data created in step 1 has the "lowest sequence number"
...
My question is: if leader A failed after step 1's create() (let's say, the sequence number it produced is 0001), Zookeeper must have failover logic to elect another new leader B, but how does Zookeeper make sure later the create() happened in new leader B will issue the correct sequence (which should be 0002)? otherwise it'll violate the exclusive lock property if if new leader B still produce the old sequence number 0001.
Does Zookeeper achieve this by making sure write (from the previous leader A) will replicated to a quorums of nodes before it replied to client that the write operation is success? If this is the case, how to make sure the failover process will choose a follower that has the latest update to previous leader A?
I want to understand the behavior of aerospike in different consistancy mode.
Consider a aerospike cluster running with 3 nodes and replication factor 3.
AP modes is simple and it says
Aerospike will allow reads and writes in every sub-cluster.
And Maximum no. of node which can go down < 3 (replication factor)
For aerospike strong consistency it says
Note that the only successful writes are those made on replication-factor number of nodes. Every other write is unsuccessful
Does this really means the no writes are allowed if available nodes < replication factor.
And then same document says
All writes are committed to every replica before the system returns success to the client. In case one of the replica writes fails, the master will ensure that the write is completed to the appropriate number of replicas within the cluster (or sub cluster in case the system has been compromised.)
what does appropriate number of replica means ?
So if I lose one node from my 3 node cluster with strong consistency and replication factor 3 , I will not be able to wright data ?
For aerospike strong consistency it says
Note that the only successful writes are those made on
replication-factor number of nodes. Every other write is unsuccessful
Does this really means the no writes are allowed if available nodes <
replication factor.
Yes, if there are fewer than replication-factor nodes then it is impossible to meet the user specified replication-factor.
All writes are committed to every replica before the system returns
success to the client. In case one of the replica writes fails, the
master will ensure that the write is completed to the appropriate
number of replicas within the cluster (or sub cluster in case the
system has been compromised.)
what does appropriate number of replica means ?
It means replication-factor nodes must receive the write. When a node fails, a new node can be promoted to replica status until either the node returns or an operator registers a new roster (cluster membership list).
So if I lose one node from my 3 node cluster with strong consistency
and replication factor 3 , I will not be able to wright data ?
Yes, so having all nodes a replicas wouldn't be a very useful configuration. Replication-factor 3 allows up to 2 nodes to be down, but only if the remaining nodes are able to satisfy the replication-factor. So for replication-factor 3 you would probably want to run with a minimum of 5 nodes.
You are correct, with 3 nodes and RF 3, losing one node means the cluster will not be able to successfully take write transactions since it wouldn't be able to write the required number of copies (3 in this case).
Appropriate number of replicas means a number of replicas that would match the replication factor configured.
I'm learning Raft from scratch with the Raft paper, and I can't understand the leader election process. I read in 5.4.1 that a leader needs to have in its log all the committed entries of the cluster:
Raft uses a simpler approach where it guarantees that all the committed entries from previous terms are present on each new leader
from the moment of its election, without the need to transfer those
entries to the leader.
Raft uses the voting process to prevent a candidate from winning an election unless its log contains all committed entries.
But later on, it is said that a candidate holds all the committed entries if it is at least as up-to-date as any other log in the majority. And the mechanism to determine this up-to-date is comparing the index and term of the last entries. The log with the higher term on the last entry will be more up-to-date.
Couldn't that lead to a situation in which a leader was elected without all previous committed entries? For instance:
In this case, if server 4 failed, server 2 could become leader, since it has an entry with a bigger term than the majority. But it wouldn't have in its log the two committed entries from term 2. Is that right? I am misunderstanding something but I can get what it is...
The question is, how did the logs get to that state in the first place? It's not possible.
So, it looks like:
* Server 2 is leader for term 1
* Server 1 is leader for term 2
* Server 2 (perhaps) is leader for term 3
* Server 4 is leader for term 4
But server 2 couldn't have been the leader for term 3 because it couldn't get votes based on the fact the last entry in its log would have been from term 1. If another server was leader for term 3, it must have written an entry for term 3 in its log if there's an entry from term 3 in server 2's log. But if there was another entry for term 3 in another server's log, a server with entries from term 2 could not have been elected since there would only be two of those. Even if server 3 had entries from term 2 in its log, it couldn't have been elected at that position because there would still be three other servers with entries from term 2 at higher indexes in the log.
So, I think you need to describe how the cluster got in a state in which server 2 could have won an election that would put an entry from term 3 in its log at index 4. It's important to note that the election protocol is not just about terms, it's also about indices. If two servers' last entries have the same term, the server with the greater last index is considered more up to date.
We've installed Datastax on five nodes with search enabled on the five nodes and replication factor of 3. After adding 590 rows to a table and select from node 1 it retrieve 590. And when selecting from other nodes the number varies from 570 to 585 rows.
I tried using CONSISTENCY QUORUM on cqlsh, but nothing changed. And solr_query is not supported on CONSISTENCY QUORUM.
Is there a way to assure all data written to Cassandra is relieved as it is?
As LHWizard mentioned, if you use Consistency levels such that (nodes_written + nodes_read) > RF, you will ensure immediate consistency.
In your case, you can try using a CONSISTENCY ALL on your read so that all nodes are checked before returning (this will be immediately consistent even with write CL of ONE). This should actually trigger a read repair on the inconsistent nodes and the missing data will be streamed to those nodes.
You're right that solr queries can only be read at CL ONE. If you need higher consistency requirements, you will need to raise the CL for the writes to achieve what you need.
What could be the reason for spontaneous master object count drop? I've added 3 more nodes to a cluster containing 17 nodes, and suddenly there'are 1 billion records less (as reported by AMC UI)?
Replica objects goes to zero as well.
Thanks!
Partitions are going to go through state transitions. Partitions that were shifted from master to replica will no longer report those objects in the master object counts. These partitions will also be in an "acting master" state until the "desync master" partitions receive the full partition. While acting master the namespace supervisor thread (NSUP) will not count them as master or replica objects. Also the node that previously was the replica for this partition will drop the partition to allow for potential incoming migrations.
There will also be partitions where the l old master and replica nodes will both be displaced by new nodes. The old master will become an acting master until the new master receives the partition. At which point the old master will drop and the new master will begin replicating to the new replica. You will periodically observe this in AMC as the number of active tx migrations will increase.
There are other possible scenarios but the main takeaway is that once migration settles, you object counts will return to normal.