When does a leader ACKs client in DistributedLog? - replication

I'm having difficulty to understand when does a leader actually ACKs client. Here is part of a DistributedLog documentation:
Each batched entry appended to a log segment will be assigned a
monotonically increasing entry id by the log segment writer. All the
entries are written asynchronously in a pipeline. The log segment
writer therefore updates an in-memory pointer, called LAP
(LastAddPushed), which is the entry id of the last batched entry
pushed to log segment store by the writer. The entries could be
written out of order but only be acknowledged in entry id order. Along
with the successful acknowledges, the log segment writer also updates
an in-memory pointer, called LAC (LastAddConfirmed). LAC is the entry
id of the last entry that already acknowledged by the writer. All the
entries written between LAC and LAP are unacknowledged data, which
they are not visible to readers.
The readers can read entries up to LAC as those entries are known to
be durably replicated - thereby can be safely read without the risk of
violating read ordering. The writer includes the current LAC in each
entry that it sends to BookKeeper. Therefore each subsequent entry
makes the records in the previous entry visible to the readers. LAC
updates can be piggybacked on the next entry that are written by the
writer. Since readers are strictly followers, they can leverage LAC to
read durable data from any of the replicas without need for any
communication or coordination with the writer.
DL introduces one type of system record, which is called control
record - it acts as the commit request in two-phases-commit algorithm.
If no application records arrive within the specified SLA, the writer
will generate a control record. With writing the control record, it
would advance the LAC of the log stream. The control record is added
either immediately after receiving acknowledges from writing a user
record or periodically if no application records are added. It is
configured as part of writer's flushing policy. While control log
records are present in the physical log stream, they are not delivered
by the log readers to the application.
Now consider the following scenario:
Leader publishes message to Bookkeeper
Followers get the messages, append to log and send ACK to leader
Leader gets the confirmation from followers, increments LAC and
replies client that messages is committed.
NOW: Leader fails before it could piggyback to followers that LAC
has been incremented.
The question is: Since potential leader is not aware of the fact
that LAC has been incremented, it becomes the new leader and
truncates the log to old LAC, which means we have lost an entry in
the log that has been confirmed by previous leader.
As a result, client has been confirmed that the message has been successfully written, but it has been lost.

Since potential leader is not aware of the fact that LAC has been incremented, it becomes the new leader and truncates the log to old LAC, which means we have lost an entry in the log that has been confirmed by previous leader.
There are several cases:
1) if the leader gracefully close the log, it will seal the log segment that it is writing to. The LAC will be advanced and it will also be recorded as part of the log segment metadata (which is stored in the metadata store).
2) if the leader crashes and doesn't close the log gracefully, a potential leader comes up, it will go through a recovery process. What the new leader does will be:
a) It will attempt the seal the last log segment that written by previous leader. The seal process is done by bookkeeper client, which includes two parts: (a) it will fence the log segment. fencing enforces that no more writes can happen in this log segment. (b) it will then do a forward-recovery from the last known LAC and recover entries that were written but not committed yet.
b) After recover the last log segment, the new leader will open a new log segment to write entries.
Hope this explains your question.
DistributedLog also has a paper published in ICDE 2017. You can get it from here.

Related

Rabbitmq :: Message is never removed from stream queue

I have created an stream queue in the rabbitmq of my project and configured max-age to 1 minute. I sent a message to the queue,all the consumers consumed the message, but the message is remaining in the queue (I waited more than 1 minute) as "ready". My worry is about accumulation of messages in the HD of rabbitmq instance.
So, my question is: All the messages marked as "ready" are stored in the HD, even after all consumer consumed the messages? If yes, how can I could purge (in this case, max-age is not working for it) these messages from HD of rabbitmq instance?
That is the design; see https://www.rabbitmq.com/streams.html#retention
Streams are implemented as an immutable append-only disk log. This means that the log will grow indefinitely until the disk runs out. To avoid this undesirable scenario it is possible to set a retention configuration per stream which will discard the oldest data in the log based on total log data size and/or age.
There are two parameters that control the retention of a stream. These can be combined. These are either set at declaration time using a queue argument or as a policy which can be dynamically updated. ...
max-age:
valid units: Y, M, D, h, m, s
e.g. 7D for a week
max-length-bytes:
the max total size in bytes
NB: retention is evaluated on per segment basis so there is one more parameter that comes into effect and that is the segment size of the stream. The stream will always leave at least one segment in place as long as the segment contains at least one message. When using broker-provided offset-tracking, offsets for each consumer are persisted in the stream itself as non-message data.
But I see what you mean.
I suggest you ask on the rabbitmq-users Google group where the RabbitMQ engineers hang out; they don't monitor SO closely.
Same problem here, the messages is nerver deleted.
The solution that I found:
It's not possible to avoid to store data into HD or make a purge, but it's possible to prevent excessive disk usage.
Add the argument x-stream-max-segment-size-bytes to the queue decreasing the default size to a size that is OK for your necessity. I defined 1 mb, for example. More details: https://www.rabbitmq.com/streams.html#declaring
At least one segment file will always remain, so if you just send 1 message and wait, it will remain on disk forever. However, if you keep publishing, a new segment file gets created at some point and the retention process kicks in. Files that only contain messages older than the retention period will be deleted.

Best practice for cleaning up EntityStoppedManifest journal entries for permanently terminated actors?

In our actor system, using sharding and persistence, the concrete instances of one of our ReceivePersistentActor implementations are not re-used once they are terminated (passivated), as they represent client sessions identified by a GUID that is generated for each new session.
When a session ends, the ReceivePersistentActor is responsible for cleaning up it's own persistence data and will call DeleteSnapshots and DeleteMessages, which works fine. Once these calls have been processed, the actor will Context.Parent.Tell(new Passivate(PoisonPill.Instance)); to terminate.
After that, the event journal will still contain an EntityStoppedManifest entry ("CD"), as this is generated through the Passivate message.
Over time this will lead to many "CD" entries remaining in the event journal.
Is there a recommended approach for cleaning up such residue entries?
Maybe a separate Janitor actor that cleans up these entries manually?
Or is this even a design flaw on our end?
Looks like I came here too hastily, as those events have been mostly cleaned up by now automagically.
What might have been the issue for those events to accumulate in such high numbers in the first place was that these events had been generated during actor recovery instead of during normal operation. But this is just an assumption.

Mulesoft with Salesforce Streaming API using CDC

I am working on a Mule API flow testing out the Salesforce event streams. I have my connector set up and subscribed to a streaming channel.
This is working just fine when I create / update / delete contact records, the events come through and I process them by adding them to another database.
I am slightly confused with the replayId functionality. With the current setup, I can shut down the Mule app, create contacts in the org, and then when I bring the app back online, it resumes by adding data from where it left off. Perfect.
However, I am trying to simulate what would happen if the mule app crashed while processing the events.
I ran some APEX to create 100 random contact records. As soon as I see it log the first flow in my app, I kill the mule app. My assumption here was that it would know where it left off when I resume the app, as if it was offline prior to the contact creation like in the previous test.
What I have noticed is that it only processes the few contacts that made it through before I shut the app down.
It appears that the events may be coming in so quickly in the flow input, that it has already reached the last replayId in the stream. However, since these records still haven't been added to my external database, I am losing those records. The stream did what it was supposed to do, but due to the batch of work the app is still processing, my 100 records are not being committed like the replayId reflects.
How can I approach this so that I don't end up losing data in the event there is a large stream of data prior to an app crash? I remember with Kafka, you had to were able to commit the id once it was inserted into the database so that it knew that the last one you officially processed. Is there such a concept in Mule where I can tell it where I have officially left off and committed to the DB?
Reliability at the protocol (CometD) level implies a number of properties. Chief among them is a transactional ACK(nowledgement) of the message having been received by the subscriber. CometD supports ACKs as an extension. Salesforce's implementation of CometD doesn't support ACKs. Even if it did, you'd still have issues...but the frequency/loss of risk might be lower.
In your case you have to engineer a solution that amounts to finding and replaying events that were not committed to your target database. You do this using custom code or wiring adapters in Mule. Replay ID values are not guaranteed to be contiguous for consecutive events but they will be ordered. Event A with replay ID of 100 will be followed by event B with replay ID of 200.
You will need to store a replay ID value in your DB. You can then use it on resubscription (after subscriber failure) to retrieve events from SF that are missing from your DB. This will only work if the failure window is small enough. Salesforce event retention window is currently at 24 hours for standard platform event license. Higher-level licenses allow for longer retention.
Depending on the volume of data, frequency of events and other process parameters, you could get all of this out of the box with Heroku Connect. It does imply a Postgres DB on Heroku + licensing cost of HC and operational costs but most of our customers in similar circumstances find it worthwhile.

Which consistency model do I have?

I have implemented a replicated key/value store on top of Redis. I have passive replication in which all write and read requests are forwarded to the leader that always returns the last value written for the key. The system uses quorum. So it works even if there are nodes which are down or with a network partition. In this case, the value in those nodes are not consistent. But this does not prevent the system to return the last most updated value. Do I have an eventual consistency model or a strict one? Thanks
You mentioned that, it is a quorum based system, with one node as a leader. The read and write requests are always forwarded to the leader.
For the sake of simplicity, let's assume that, there are 5 nodes in the system and one of them is a leader. Other 4 nodes are secondaries.
Typically quorum based systems work on consensus protocols. So out of 5 nodes, if 3 of the nodes have the latest value, it is enough to always return the latest value.
This is how writes should work
Leader first updates the key/value in it's database
Forwards the request to remaining 4 nodes, which are secondaries
The leader waits for at least 2 of the secondaries to acknowledge that, they have updated the latest key/value in their database. It means out of 5 nodes, at least 3 nodes have the latest updated value.
If the leader does not get response from at least 2 of the secondary nodes within the specified time period (request times out), then leader returns failure to client and client needs to retry.
So, the write requests succeeds only if 3 out of 5 nodes have the latest value. At any point, 2 of the nodes may or may not have the latest value and may catch up later.
For the reads, the leader (which has the latest key/value), always returns the response.
What happens when a leader machine is unable to serve requests, due to some issue? (For e.g. network error)
Typically these systems will have a leader election protocol, when the current leader is unable to serve the requests due to some error.
The new leader will be chosen from one of the secondaries, which have the latest updates. So, the newly elected leader should have the latest updated state and should start serving read requests with the latest set of values.
Your system is strictly consistent.

Keep lock for multiple updates to a single row in SQL

I want to lock a certain single row in a MariaDb table for multiple updates, and would like to release the lock when I'm finished updating it. The scenarios is multiple machines could be requesting the lock. When one machine gets the lock, it has a batch of work that needs to update the row data multiple times. The other machine may not get a chance at the lock while the first machine is still busy with it's batch of work. So if the first machine does two updates, the other machine may not get the lock between the two updates.
I've looked at transactions, but they do a rollback if my application crashes, whereas I want the updates to remain.
Does anyone have an idea on how to solve this kind of issue? Googling it does not produce any good hits, or my search terms might be wrong.
Edit:
Trying to clarify the use case a bit more:
This functionality is for event processing in a distributed system, where there are multiple concurrent consumers.
The events are parts of streams, and the events within a stream must be processed in the correct order, otherwise the system gets corrupted.
For all kinds of reasons the events of a single stream can end up on different consumers, in the wrong orders, with a large delay, etcetera, these are exceptional cases, not normal operation conditions.
The locking of rows helps making sure that different consumers are not working on the same event stream concurrently
BEGIN;
SELECT counter, someAdditionalFields FROM streamCounters WHERE streamId=x FOR UPDATE;
# this locks the stream to this consumer
# This thread is processing event with eventId counter+1, this can be in the range of 0ms to a few seconds, certainly not minutes.
# The result of this work ends up in another table
UPDATE streamCounters SET counter=(counter+1) WHERE streamId=x;
# signifies that event was processed, the stream advances by 1
# This thread is processing event with eventId counter+2
UPDATE streamCounters SET counter=(counter+2) WHERE streamId=x;
# signifies that event was processed, the stream advances by 1
# ... until no more ready events available in stream
COMMIT; # release the lock on this stream
The reason I don't want this transaction to roll back in case of a crash, is because the processing of the events means significant change in other tables.
The changes in the other tables are done by a different part of the application, they provide me a function that I call, what they do is not really under my control.
Lets say I want to process event 1 in a stream, I lock the stream.
I call provided function that processes event 1.
When it returns with success, i update the streamcounter from 0 to 1
Now I want to process event 2, but I crash.
Now this transaction gets rolled back, streamcounter goes back to 0, but event was processed.
My streamcounter does now not represent how much work was done!