RAFT election restrictions - replication

I'm learning Raft from scratch with the Raft paper, and I can't understand the leader election process. I read in 5.4.1 that a leader needs to have in its log all the committed entries of the cluster:
Raft uses a simpler approach where it guarantees that all the committed entries from previous terms are present on each new leader
from the moment of its election, without the need to transfer those
entries to the leader.
Raft uses the voting process to prevent a candidate from winning an election unless its log contains all committed entries.
But later on, it is said that a candidate holds all the committed entries if it is at least as up-to-date as any other log in the majority. And the mechanism to determine this up-to-date is comparing the index and term of the last entries. The log with the higher term on the last entry will be more up-to-date.
Couldn't that lead to a situation in which a leader was elected without all previous committed entries? For instance:
In this case, if server 4 failed, server 2 could become leader, since it has an entry with a bigger term than the majority. But it wouldn't have in its log the two committed entries from term 2. Is that right? I am misunderstanding something but I can get what it is...

The question is, how did the logs get to that state in the first place? It's not possible.
So, it looks like:
* Server 2 is leader for term 1
* Server 1 is leader for term 2
* Server 2 (perhaps) is leader for term 3
* Server 4 is leader for term 4
But server 2 couldn't have been the leader for term 3 because it couldn't get votes based on the fact the last entry in its log would have been from term 1. If another server was leader for term 3, it must have written an entry for term 3 in its log if there's an entry from term 3 in server 2's log. But if there was another entry for term 3 in another server's log, a server with entries from term 2 could not have been elected since there would only be two of those. Even if server 3 had entries from term 2 in its log, it couldn't have been elected at that position because there would still be three other servers with entries from term 2 at higher indexes in the log.
So, I think you need to describe how the cluster got in a state in which server 2 could have won an election that would put an entry from term 3 in its log at index 4. It's important to note that the election protocol is not just about terms, it's also about indices. If two servers' last entries have the same term, the server with the greater last index is considered more up to date.

Related

How does zookeeper internally achieve data consistency among leader and follower when leader fail

Apache Zookeeper documentation described steps about how to implement a distributed lock, steps are:
Call create() with the sequence and ephemeral flags set.
Call getChildren(), check if the data created in step 1 has the "lowest sequence number"
...
My question is: if leader A failed after step 1's create() (let's say, the sequence number it produced is 0001), Zookeeper must have failover logic to elect another new leader B, but how does Zookeeper make sure later the create() happened in new leader B will issue the correct sequence (which should be 0002)? otherwise it'll violate the exclusive lock property if if new leader B still produce the old sequence number 0001.
Does Zookeeper achieve this by making sure write (from the previous leader A) will replicated to a quorums of nodes before it replied to client that the write operation is success? If this is the case, how to make sure the failover process will choose a follower that has the latest update to previous leader A?

Aerospike cluster behavior in different consistency mode?

I want to understand the behavior of aerospike in different consistancy mode.
Consider a aerospike cluster running with 3 nodes and replication factor 3.
AP modes is simple and it says
Aerospike will allow reads and writes in every sub-cluster.
And Maximum no. of node which can go down < 3 (replication factor)
For aerospike strong consistency it says
Note that the only successful writes are those made on replication-factor number of nodes. Every other write is unsuccessful
Does this really means the no writes are allowed if available nodes < replication factor.
And then same document says
All writes are committed to every replica before the system returns success to the client. In case one of the replica writes fails, the master will ensure that the write is completed to the appropriate number of replicas within the cluster (or sub cluster in case the system has been compromised.)
what does appropriate number of replica means ?
So if I lose one node from my 3 node cluster with strong consistency and replication factor 3 , I will not be able to wright data ?
For aerospike strong consistency it says
Note that the only successful writes are those made on
replication-factor number of nodes. Every other write is unsuccessful
Does this really means the no writes are allowed if available nodes <
replication factor.
Yes, if there are fewer than replication-factor nodes then it is impossible to meet the user specified replication-factor.
All writes are committed to every replica before the system returns
success to the client. In case one of the replica writes fails, the
master will ensure that the write is completed to the appropriate
number of replicas within the cluster (or sub cluster in case the
system has been compromised.)
what does appropriate number of replica means ?
It means replication-factor nodes must receive the write. When a node fails, a new node can be promoted to replica status until either the node returns or an operator registers a new roster (cluster membership list).
So if I lose one node from my 3 node cluster with strong consistency
and replication factor 3 , I will not be able to wright data ?
Yes, so having all nodes a replicas wouldn't be a very useful configuration. Replication-factor 3 allows up to 2 nodes to be down, but only if the remaining nodes are able to satisfy the replication-factor. So for replication-factor 3 you would probably want to run with a minimum of 5 nodes.
You are correct, with 3 nodes and RF 3, losing one node means the cluster will not be able to successfully take write transactions since it wouldn't be able to write the required number of copies (3 in this case).
Appropriate number of replicas means a number of replicas that would match the replication factor configured.

Aerospike - Read (with consistency level ALL) when one replica is down

TL;DR
If a replica node goes down and new partition map is not available yet, will a read with consistency level = ALL fail?
Example:
Given this Aerospike cluster setup:
- 3 physical nodes: A, B, C
- Replicas = 2
- Read consistency level = ALL (reads consult both nodes holding the data)
And this sequence of events:
- A piece of data "DAT" is stored into two nodes, A and B
- Node B goes down.
- Immediately after B goes down, a read request ("request 1") is performed with consistency ALL.
- After ~1 second, a new partition map is generated. The cluster is now aware that B is gone.
- "DAT" now becomes replicated at node C (to preserve replicas=2).
- Another read request ("request 2") is performed with consistency ALL.
It is reasonable to say "request 2" will succeed.
Will "request 1" succeed? Will it:
a) Succeed because two reads were attempted, even if one node was down?
b) Fail because one node was down, meaning only 1 copy of "DAT" was available?
Request 1 and request 2 will succeed. The behavior of the consistency level policies are described here: https://discuss.aerospike.com/t/understanding-consistency-level-overrides/711.
The gist for read/write consistency levels is that they only apply when there are multiple versions of a given partition within the cluster. If there is only one version of a given partition in the cluster then a read/write will only go to a single node regardless of the consistency level.
So given an Aerospike cluster of A,B,C where A is master and B is
replica for partition 1.
Assume B fails and C is now replica for partition 1. Partition 1
receives a write and the partition key is changed.
Now B is restarted and returns to the cluster. Partition 1 on B will
now be different from A and C.
A read arrives with consistency all to node A for a key on Partition
1 and there are now 2 versions of that partition in the cluster. We
will read the record from nodes A and B and return the latest
version (not fail the read).
Time lapse
Migrations are now complete, for partition 1, A is master, B is
replica, and C no longer has the partition.
A read arrives with consistency all to node A. Since there is only
one version of Partition 1, node A responds to the client without
consulting node B.

Global revision without locking

Given this set of rules, would it be possible to implement this in SQL?
Two transactions that don't modify the same rows should be able to run concurrently. No locks should occur (or at least their use should be minimized as much as possible).
Transactions can only read committed data.
A revision is defined as an integer value in the system.
A new transaction must be able to increment and query a new revision. This revision will be applied to every rows that the transaction modifies.
No 2 transactions can share the same revision.
A transaction X that is committed before transaction Y must have a revision lower than the one assigned to transaction Y.
I want to use integer as the revision in order to optimize how I query all changes since a specific revision. Something like this:
SELECT * FROM [DummyTable] WHERE [DummyTable].[Revision] > clientRevision
My current solution uses an SQL table [GlobalRevision] with a single row [LastRevision] to keep the latest revision. All my transactions' isolation level are set to Snapshot.
The problem with this solution is that the [GlobalRevision] table with the single row [LastRevision] becomes a point of contention. This is because I must increment the revision at the start of a transaction so that I can apply the new revision to the modified rows. This will keep a lock on the [LastRevision] row throughout the duration of the transaction, killing the concurrency. Even though two concurrent transactions modify totally different rows, they cannot be executed concurrently (Rule #1: Failed).
Is there any pattern in SQL to solve this kind of issue? One solution is to use Guids and keep an history of revisions (like git revisions) but this is less easier than just having an integer that we can compare to see if a revision is newer than another one.
UPDATE:
The business case for this is to create a Baas system (Backend as a service) with data synchronization between client and server. Here are some use cases for this kind of system:
Client while online modifies an asset, pushes the update to the server, server updates DB [this is where my question relates to], server sends update notifications to interested clients that synchronize their local data with the new changes.
Client connects to server, client requests a pull to the server, server finds all changes that were applied after client's revision and return them to the client, client applies the changes and sets its new revision.
...
As you can see, the global revision lets me put a revision on every changes committed on the server and from this revision, I can determine what updates need to be sent to the clients depending on their specific revision.
This needs to scale to multiple thousands of users that can push updates in parallel and those changes must be synchronized to other connected users. So the longer it takes to execute a transaction, the longer it takes for other users to receive the change notifications.
I want to avoid as much as possible contention for this reason. I am not an expert in SQL so I just want to make sure there is not something I am missing that would let me do that easily.
Probably the easiest thing for you to try would be to use a SEQUENCE for your revision number, assuming you're at SQL 2012 or newer. This is a lighter-weight way of generating an auto-incrementing value that you can use as a revision ID per your rules. Acquiring them at scale should be far less subject to the contention issues you describe than using a full-fledged table.
You do need to know that you could end up with revision number gaps if a given transaction rolled back, because SEQUENCE values operate outside of transactional scope. From the article:
Sequence numbers are generated outside the scope of the current
transaction. They are consumed whether the transaction using the
sequence number is committed or rolled back.
If you can relax the requirement for an integer revision number and settle for knowing what the data was at a given point in time, you might be able to use Change Data Capture, or, in SQL 2016, Temporal Tables. Both of these technologies allow you to "turn back time" and see what the data looked like at a known timestamp.

Performance when using Redis as query results storage for certain repetitive Neo4j queries

I am teaching myself Neo4j & Redis and work on a app to make use of both. The data domain let's assume is publications (Imaginary domain, this is not a real life project) with following structures:
Label > Publication Node - Role {title, date}
Label > Publication Source - Role {source_name, url, id} i.e. PubMed, Scopus, etc.
Label > Author - Role {firstname, lastname, dob, email, etc.}
Label > Journal - Role {journal_name, etc.} i.e. Science, Nature
A publication is composed of title, list of authors, the journal, page numbers, doi id, and other relevant properties.
Let's assume I have a 1000 publication nodes, 5000 Author nodes, 7 Sources and 6 Journals. The data is connected with each other using relationships.
The app would provide a search screen with controls to allow searching of publications based on certain criteria. Every time the search screen is requested, the APP should return latest counts of publications as per bits (i.e. sources, years, A-Z author names, etc.).
These latest counts would give the person searching an idea of how many publications are there. See the screen mock for example (numbers in bracket show real count of nodes to which there are incoming relationships from publication nodes).
Each of those counts (i.e. Pubmed (15), Scopus (10), Immunology (95)) are results of cypher queries against the Neo4j that counts the number of incoming relationships to those Labels (assume there is like 10 cyphers queries to be run to get those counts).
If 100 users request the search screen, there will be 100 requests * 10 cyphers resulting in 1000 cypher queries fetching most of the time same values that sometimes change.
My current solution >
1) On Express initialization, run those cypher queries that retrieve counts and store them as hashes in Redis.
2) For every request asking for Search screen, HGETALL for each hash and return its value which is nothing but count.
3) Every time a new publication is added, increment the count of those relevant hashes by 1
4) Every time a publication is deleted, Query Neo4j and re-initialize the Redis (rather than searching the specific hashes and decrementing them - Need advise, whether I use Exists for hash and then HEXISTS for field of hash and then decrement or simply re-initialize the redis? Update and deletion are rare actions i.e. 22% of time).
5) Every time a publication is updated, Query neo4j and re-initialize the Redis (rather than figuring out which relationships were dropped and which hashes to be decremented and which ones to be incremented - I find reinitialization quicker? What is your advise?)
6) When searching it is tricky to decide which node to start from. Let's assume the user selects the following options during search
CrossRef (95) as source AND
Science (10) as journal AND
2014 (200) as year AND
A (66) as author start name AND
I guess from Neo4j perspective, the node (x) with least number of incoming relationship can be a good starting point because, naturally majority of publications that do not have relationship to node x will be ignored. If I start the query from node "Science" I will have 10 candidate publication nodes to check remainder of the search conditions but, if I start with year I have 200 candidate publication nodes to apply further search criteria to. if this is true, then rather than hitting another 10 cypher queries to figure out the node with least incoming relationships, I could refer to Redis for latest count figures and then build my Cypher query. What is your advise?
If the above idea of starting search from node with least incoming relationship is correct, then when the search fires the search request, I could SORT redis values for hashes and construct a cypher query with from the Node with minimum count, followed by the next minimum count and so on.
I am new to Redis & Neo4j and its applications, please advise, if you think this is not going to work out in real life with some explanations and suggestions to which other solutions should be considered.
Thanks a million for reading this big questions, I tried to be descriptive, that probably indicate the fact that I myself am confused a bit.