How does newly elected leader apply entries in raft? - replication

Let's say you have 3 servers S1, S2 and S3. S1 (leader), replicates a log to S2 and S3 and then applies the log respond to the client and crashes. So we have
S1 1
S2 1
S3 1
Now when S2 becomes the leader (with the vote from S3) how will it apply the log? According to the Raft paper
If there exists an N such that N > commitIndex, a majority
of matchIndex[i] ≥ N, and log[N].term == currentTerm:
set commitIndex = N.
In the above case, the term of S2 (commitIndex = 0) would be 2 while the term of the log would always be 1; hence, the last condition would not be satisfied? Am I missing something?

Every node has an event log with two core pointers: the committed events and the uncommitted events. The point of raft the protocol is to replicate both of these pointers across the system.
| 0 1 2 3 | 4 5 6 7 8 9 |
^ ^
Committed Uncommitted
Every replication message a Follower receives from the Leader updates both of these pointers. The message has events to append to the log (updating the uncommitted pointer). It also has an index to update the committed pointer.
When a Follower receives this message and updates its committed pointer then it applies all the events that just moved from uncommitted to committed.
The committed pointer sent to the Followers is a copy of what the Leader has on its log. The Leader updates its committed pointer when it receives a quorum from the Followers, and applies all the events that moved from uncommitted to committed.
A newly-elected Leader first needs to ensure that its version of the log is replicated to the Followers, and as the new Leader receives quorum from the Followers it updates its committed pointer, replicates that pointer back to the Followers, and applies the events as above.

Related

lock redis key when two processes are accessing the same key

Application 1 set a value in Redis.
And we have two instance of application 2 which are running and we would like only one instance should read this value from Redis (please note application 2 takes around 30 sec to 1 min process data )
Can Instance-1 application 2 acquire lock redis key which is created by application 1 , so that instance-2 of application 2 will not read and do the same operation ?
No, there's no concept of record lock in Redis. If you need to achieve some sort of locking you have to use another set of data structures to mimic that behavior. For example
List: You can use a list and then POP the item from the list or...
Redis Stream: Using Redis Stream with ConsumerGroup so that each consumer in your Group only sees a portion of the whole data the needs to be processed and it guarantees you that, when an item is delivered to a consumer, it is not going to be delivered to another one.

How to define TTL for redis streams?

I have two micro services and I need to implement reliable notifications between them. I thought about using redis streams -
serviceA will send a request to serviceB with an identifier X.
Once serviceB is done doing the work serviceA asked for, it'll create/add to a stream (the stream is specific for X) a new item to let it know it's done.
ServiceA can send multiple requests, each request may have a different identifier. So it'll block for new elements in different streams.
My question is how can I delete streams which are no longer needed, depending on their age. For example I'd like to have streams that were created over a day ago deleted. Is this possible?
If it's not, I'd love to hear any ideas you have as to how not to have unneeded streams in redis.
Thanks
There's no straight forward way to delete older entries based on the TTL/age. You can use a combination of XTRIM/XDEL with other commands to trim the stream.
Let's see how we can use XTRIM
XTRIM stream MAXLEN ~ SIZE
XTRIM trims the stream to a given number of items, evicting older items (items with lower IDs) if needed.
You generate the stream size every day or periodically based on your delete policy and store it somewhere using XLEN command
Run a periodic job that would call XTRIM as
XTRIM x-stream MAXLEN ~ (NEW_SIZE - PREVIOUS_SIZE)
For example, yesterday stream size was 500 now it's 600 then we need to delete 500 entries so we can just run
XTRIM x-stream MAXLEN ~ 100
You can use different policies for deletion for example daily, weekly, twice a week, etc.
XDEL stream ID [ID...]
Removes the specified entries from a stream, and returns the number of entries deleted, that may be different from the number of IDs passed to the command in case certain IDs do not exist.
So what you can do is whenever Service B consumes the event than the service itself can delete the stream entry as service B knows the stream ID, but this will not work as soon as you start using the consumer group. So I would say use Redis set or Redis map to track the acknowledge stream ids and run a periodic sweep job to clean up the stream.
For example
Service A sends a stream item with ID1 to service B
Service B acknowledges the stream item after consuming the items in the map
ack_stream = { ID1: true }, you can track other data e.g. count in case of the consumer group.
A sweep job would run at periodically like 1 AM daily that reads all the elements of ack_stream and filters out all items that require deletion. Now you can call XDEL commands in batch with the set of stream ids.

Preserving order of execution in case of an exception on ActiveMQ level

Is there an option on Active MQ level to preserve the order of execution of messages in case of an exception? . In other words, assume that we have inside message ID=1 info about an object called student having for example ID=Student_1000 and this message failed and entered in DLQ for a certain reason but we have in the principal queue message ID= 2 and message ID = 3 having the same ID of this student (ID=Student_1000) . We should not allow those messages from getting processed because they are containing info about same ID of object as inside message ID = 1; ideally, they should be redirected directly to DLQ to preserve the order of execution because if we allow this processing, we will loose the order of execution in case we are performing an update.
Please note that I'm using message groups of Active MQ.
How to do that on Active MQ level?
Many thanks,
Rosy
Well, not really. But since the DLQ is by default shared, you would not have ordered messages there unless you configure individual DLQs.
Trying to rely on strict, 100% message order on queues to keep business logic simple is a bad idea, from my experience. That is, unless you have a single broker, a single producer and a single consumer and no DLQ handling (infinite redeliviers on RedeliveryPolicy).
What you should do is to read the entire group in a single transaction. Roll it back or commit it as a group. It will require you to set the prefetch size accordingly. DLQ handling and reading is actually a client concern and not a broker level thing.

Cassandra: MigrationStage cannot keep up

We have 4 nodes and running tpstats shows big backlog for MigrationStage at all nodes and it's not able to reduce the queue over time. For example:
Pool Name Active Pending Completed Blocked All time blocked
MigrationStage 1 3946 17766 0 0
I don't see this going down ever and the other 3 servers have about 300 pending requests.
Is there a way to speed this up? Or is it possible to stop schema migration since most likely it's trying to migrate old keyspaces?
PS I Tried to drop keyspaces to reduce this (there are about 200 keyspace). However I always query timeout for that statement (select works). I assume this backlog is also blocking some schema DLL statements.

Multiversion Timestamp-based concurrency control

In timestamp based concurrency control why do you have to reject write in transaction T_i on element x_k if some transaction with T_j where j > i already read it.
As stated in document.
If T_j is not planing to do any update at all why is it necessary to be so restrictive on T_i's actions ?
Assume that T_i occurs first and T_j goes on second. Assume T_i also writes to x. The second read of t_j should fail due to T_i already using the value of x. T_i is younger than T_j and if T_j uses the last committed version of x, it shall cause a stale value being used if T_i writes to x.
You need to abort the writing transaction t_j during a read, write or at commit time due to the the potential for a stale value being used. If the writing transaction didn't abort, and someone else read and used the old value, the database is not serializable. As you would get a different result if you ran the transactions in a different order. This is what the text quoted means by timestamp order.
Any two reads of the same value at the same time is dangerous as it causes a not accurate view of the database, it reveals a non-serializable order. If three transactions are running and all use x, then the serializable order is undefined. You need to enforce one read of x at a time, and this forces the transactions to be single file and see the last transaction's x. So t_i then t_j, then t_k in order, finishing before the next one starts.
Think what could happen even if t_j were not to write, it would use a value that technically doesn't exist in the datbase that is stale, it would have ignored the outcome of t_i if t_i wrote.
If three transactions all read x and don't write x, then it is safe to run them at the same time. You would need to know in advance that all three transactions don't write to x.
As in the whitepaper Serializable Snapshot Isolation attests, the dangerous structure is two read-write dependencies. But a read-write x followed by a read x is dangerous also due to the value being stale if both transactions run at the same time, it needs to be serializable, so you abort the second read x as there is a younger transaction using x.
I wrote a multiversion concurrency implementation in a simulation. See the simulation runner. My simulation simulates 100 threads all trying to read and write two numbers, A and B. They want to increment the number by 1. We set A to 1 and B to 2 at the beginning of the simulation.
The desired outcome is that A and B should be set to 101 and 102 at the end of the simulation. This can only happen if there is locking or serialization due to multiversion concurrency control. If you didn't have concurrency control or locking, this number will be less than 101 and 102 due to data races.
When a thread reads A or B we iterate over versions of key A or B to see if there is a version that is <= transaction.getTimestamp() and committed.get(key) == that version. If successful, it sets the read timestamp of that value as the transaction that last read that value. rts.put("A", transaction)
At commit time, we check that the rts.get("A").getTimestamp() != committingTransaction.getTimestamp(). If this check is true, we abort the transaction and try again.
We also check if someone committed since the transaction began - we don't want to overwrite their commit.
We also check for each write that the other writing transaction is younger than us then we abort. The if statement is in a method called shouldRestart and this is called on reads and at commit time and on all transactions that touched a value.
public boolean shouldRestart(Transaction transaction, Transaction peek) {
boolean defeated = (((peek.getTimestamp() < transaction.getTimestamp() ||
(transaction.getNumberOfAttempts() < peek.getNumberOfAttempts())) && peek.getPrecommit()) ||
peek.getPrecommit() && (peek.getTimestamp() > transaction.getTimestamp() ||
(peek.getNumberOfAttempts() > transaction.getNumberOfAttempts() && peek.getPrecommit())
&& !peek.getRestart()));
return defeated;
}
see the code here The or && peek.getPrecommit() means that a younger transaction can abort if a later transaction gets ahead and the later transaction hasn't been restarted (aborted) Precommit occurs at the beginning of a commit.
During a read of a key we check the RTS to see if it is lower than the reading than our transaction. If so, we abort the transaction and restart - someone is ahead of us in the queue and they need to commit.
On average, the system reaches 101 and 102 after around < 300 transaction aborts. With many runs finishing well below 200 attempts.
EDIT: I changed the formula for calculating which transactions wins. So if another transactions is younger or the other transactions has a higher number of attempts, the current transactions aborts. This reduces the number of attempts.
EDIT: the reason there was high abort counts was that a committing thread would be starved by reading threads that would abort restart due to the committing thread. I added a Thread.yield when a read fails due to an ahead transaction, this reduces restart counts to <200.