Aerospike Design | Request Flow Internals | Resources - aerospike

Where can I find information about the how flow of the read/write request in the cluster when fired from the client API?
In Aerospike configuration doc ( http://www.aerospike.com/docs/reference/configuration ), it's mentioned about transaction queues, service threads, transaction threads etc but they are not discussed in the architecture document. I want to understand how it works so that I can configure it accordingly.

From client to cluster node
In your application, a record's key is the 3-tuple (namespace, set, identifier). The key is passed to the client for all key-value methods (such as get and put).
The client then hashes the (set, identifier) portion of the key through RIPEMD-160, resulting in a 20B digest. This digest is the actual unique identifier of the record within the specified namespace of your Aerospike cluster. Each namespace has 4096 partitions, which are distributed across the nodes of the cluster.
The client uses 12 bits of the digest to determine the partition ID of this specific key. Using the partition map, the client looks up the node that owns the master partition corresponding to the partition ID. As the cluster grows, the cost of finding the correct node stays constant (O(1)) as it does not depended on the number of records or the number of nodes.
The client converts the operation and its data into an Aerospike wire protocol message, then uses an existing TCP connection from its pool (or creates a new one) to send the message to the correct node (the one holding this partition ID's master replica).
Service threads and transaction queues
When an operation message comes in as a NIC transmit/receive queue interrupt,
a service thread picks up the message from the NIC. What happens next depends on the namespace this operation is supposed to execute against. If it is an in-memory namespace, the service thread will perform all of the following steps. If it's a namespace whose data is stored on SSD, the service thread will place the operation on a transaction queue. One of the queue's transaction threads will perform the following steps.
Primary index lookup
Every record has a 64B metadata entry in the in-memory primary index. The primary-index is expressed as a collection of sprigs per-partition, with each sprig being implemented as a red-black tree.
The thread (either a transaction thread or the service thread, as mentioned above) finds the partition ID from the record's digest, and skips to the correct sprig of the partition.
Exist, Read, Update, Replace
If the operation is an exists, a read, an update or a replace, the thread acquires a record lock, during which other operations wait to access the specific sprig. This is a very short lived lock. The thread walks the red-black tree to find the entry with this digest. If the operation is an exists, and the metadata entry does exist, the thread will package the appropriate message and respond. For a read, the thread will use the pointer metadata to read the record from the namespace storage.
An update needs to read the record as described above, and then merge in the bin data. A replace is similar to an update, but it skips first reading the current record. If the namespace is in-memory the service thread will write the modified record to memory. If the namespace stores on SSD the merged record is placed in a streaming write buffer, pending a flush to the storage device. The metadata entry in the primary index is adjusted, updating its pointer to the new location of the record. Aerospike performs a copy-on-write for create/update/replace.
Updates and replaces also needs to be communicated to the replica(s) if the replication factor of the namespace is greater than 1. After the record locking process, the operation will also be parked in the RW Hash (Serializer), while the replica write completes. This is where other transactions on the same record will queue up until they hit the transaction pending limit (AKA a hot key). The replica write(s) is handled by a different thread (rw-receive), releasing the transaction or service thread to move on to the next operation. When the replica writes complete the RW Hash lock is released, and the rw-receive thread will package the reply message and send it back to the client.
Create and Delete
If the operation is a new record being written, or a record being deleted, the partition sprig needs to be modified.
Like update/replace, these operations acquire the record-level lock and will go through the RW hash. Because they add or remove a metadata entry from the red-black tree representing the sprig, they must also acquire the index tree reduction lock. This process also happens when the namespace supervisor thread finds expired records and remove them from the primary index. A create operation will add an element to the partition sprig.
If the namespace stores on SSD, the create will load the record into a streaming write buffer, pending a flush to SSD, and ahead of the replica write. It will update the metadata entry in the primary index, adjusting its pointer to the new block.
A delete removes the metadata entry from the partition sprig of the primary index.
Summary
exists/read grab the record-level lock, and hold it for the shortest amount of time. That's also the case for update/replace when replication factor is 1.
update/replace also grab the RW hash lock, when replication factor is higher than 1.
create/delete also grab the index tree reduction lock.
For in-memory namespaces the service thread does all the work up to potentially the point of replica writes.
For data on SSD namespaces the service thread throws the operation onto a transaction queue, after which one of its transaction threads handles things such as loading the record into a streaming write buffer for writes, up until the potential replica write.
The rw-receive thread deals with replica writes and returning the message after the update/replace/create/delete write operation.
Further reading
I've addressed key-value operations, but not batch, scan or query. The difference between batch-reads and single-key reads is easier to understand once you know how single-read works.
Durable deletes do not remove the metadata entry of this record from the primary index. Instead, those are a new write operation of a tombstone. There will be a new 64B entry in the primary index, and a 128B entry in the SSD for the record.
Performance optimizations with CPU pinning. See: auto-pin, service-threads, transaction-queues.

Service threads == transaction queues == number of cores in your CPU or use CPU pinning - auto-pin config parameter if available in your version and possible in your OS env.
transaction threads per queue-> 3 (default is 4, for objsize <1KB, non data-in-memory namespace, 3 is optimal)

Changes with server ver 4.7+, the transaction is now handled by the service thread itself. By default, number of service threads is now set to 5 x no. of cpu cores. Once a service thread picks a transaction from the socket buffer, it carries it through completion unless it ends up in the rwHash (e.g. writes for replicating). The transaction queue is still there (internally) but only relevant for transaction restarts when queued up in the rwHash. (Multiple pending transactions for the same digest).

Related

How does zookeeper internally achieve data consistency among leader and follower when leader fail

Apache Zookeeper documentation described steps about how to implement a distributed lock, steps are:
Call create() with the sequence and ephemeral flags set.
Call getChildren(), check if the data created in step 1 has the "lowest sequence number"
...
My question is: if leader A failed after step 1's create() (let's say, the sequence number it produced is 0001), Zookeeper must have failover logic to elect another new leader B, but how does Zookeeper make sure later the create() happened in new leader B will issue the correct sequence (which should be 0002)? otherwise it'll violate the exclusive lock property if if new leader B still produce the old sequence number 0001.
Does Zookeeper achieve this by making sure write (from the previous leader A) will replicated to a quorums of nodes before it replied to client that the write operation is success? If this is the case, how to make sure the failover process will choose a follower that has the latest update to previous leader A?

Aerospike cluster behavior in different consistency mode?

I want to understand the behavior of aerospike in different consistancy mode.
Consider a aerospike cluster running with 3 nodes and replication factor 3.
AP modes is simple and it says
Aerospike will allow reads and writes in every sub-cluster.
And Maximum no. of node which can go down < 3 (replication factor)
For aerospike strong consistency it says
Note that the only successful writes are those made on replication-factor number of nodes. Every other write is unsuccessful
Does this really means the no writes are allowed if available nodes < replication factor.
And then same document says
All writes are committed to every replica before the system returns success to the client. In case one of the replica writes fails, the master will ensure that the write is completed to the appropriate number of replicas within the cluster (or sub cluster in case the system has been compromised.)
what does appropriate number of replica means ?
So if I lose one node from my 3 node cluster with strong consistency and replication factor 3 , I will not be able to wright data ?
For aerospike strong consistency it says
Note that the only successful writes are those made on
replication-factor number of nodes. Every other write is unsuccessful
Does this really means the no writes are allowed if available nodes <
replication factor.
Yes, if there are fewer than replication-factor nodes then it is impossible to meet the user specified replication-factor.
All writes are committed to every replica before the system returns
success to the client. In case one of the replica writes fails, the
master will ensure that the write is completed to the appropriate
number of replicas within the cluster (or sub cluster in case the
system has been compromised.)
what does appropriate number of replica means ?
It means replication-factor nodes must receive the write. When a node fails, a new node can be promoted to replica status until either the node returns or an operator registers a new roster (cluster membership list).
So if I lose one node from my 3 node cluster with strong consistency
and replication factor 3 , I will not be able to wright data ?
Yes, so having all nodes a replicas wouldn't be a very useful configuration. Replication-factor 3 allows up to 2 nodes to be down, but only if the remaining nodes are able to satisfy the replication-factor. So for replication-factor 3 you would probably want to run with a minimum of 5 nodes.
You are correct, with 3 nodes and RF 3, losing one node means the cluster will not be able to successfully take write transactions since it wouldn't be able to write the required number of copies (3 in this case).
Appropriate number of replicas means a number of replicas that would match the replication factor configured.

Iterating over ChronicleMap results in Exception

I am using v3.10.1 of Chronicle Map. In my map, I have approximately 77K entries. I am trying to iterate through this map using entrySet() method. It doesn't iterate successfully and in between throws Chronicle specific exception. Here are the logs produced from Chronicle Map
016-09-17 06:39:15 [ERROR] n.o.c.m.i.CompiledMapIterationContext - Contexts locked on this segment:
net.openhft.chronicle.map.impl.CompiledMapIterationContext#205cd34b: used, segment 62, local state: UNLOCKED, read lock count: 0, update lock count: 0, write lock count: 0
Current thread contexts:
net.openhft.chronicle.map.impl.CompiledMapIterationContext#205cd34b: used, segment 62, local state: UNLOCKED, read lock count: 0, update lock count: 0, write lock count: 0
and Exception:
2016-09-17 06:39:15 [ERROR] akka.dispatch.TaskInvocation - Failed to acquire the lock in 60 seconds.
Possible reasons:
- The lock was not released by the previous holder. If you use contexts API,
for example map.queryContext(key), in a try-with-resources block.
- This Chronicle Map (or Set) instance is persisted to disk, and the previous
process (or one of parallel accessing processes) has crashed while holding
this lock. In this case you should use ChronicleMapBuilder.recoverPersistedTo() procedure
to access the Chronicle Map instance.
- A concurrent thread or process, currently holding this lock, spends
unexpectedly long time (more than 60 seconds) in
the context (try-with-resource block) or one of overridden interceptor
methods (or MapMethods, or MapEntryOperations, or MapRemoteOperations)
while performing an ordinary Map operation or replication. You should either
redesign your logic to spend less time in critical sections (recommended) or
acquire this lock with tryLock(time, timeUnit) method call, with sufficient
time specified.
- Segment(s) in your Chronicle Map are very large, and iteration over them
takes more than 60 seconds. In this case you should
acquire this lock with tryLock(time, timeUnit) method call, with longer
timeout specified.
- This is a dead lock. If you perform multi-key queries, ensure you acquire
segment locks in the order (ascending by segmentIndex()), you can find
an example here: https://github.com/OpenHFT/Chronicle-Map#multi-key-queries
java.lang.RuntimeException: Failed to acquire the lock in 60 seconds.
Possible reasons:
- The lock was not released by the previous holder. If you use contexts API,
for example map.queryContext(key), in a try-with-resources block.
- This Chronicle Map (or Set) instance is persisted to disk, and the previous
process (or one of parallel accessing processes) has crashed while holding
this lock. In this case you should use ChronicleMapBuilder.recoverPersistedTo() procedure
to access the Chronicle Map instance.
- A concurrent thread or process, currently holding this lock, spends
unexpectedly long time (more than 60 seconds) in
the context (try-with-resource block) or one of overridden interceptor
methods (or MapMethods, or MapEntryOperations, or MapRemoteOperations)
while performing an ordinary Map operation or replication. You should either
redesign your logic to spend less time in critical sections (recommended) or
acquire this lock with tryLock(time, timeUnit) method call, with sufficient
time specified.
- Segment(s) in your Chronicle Map are very large, and iteration over them
takes more than 60 seconds. In this case you should
acquire this lock with tryLock(time, timeUnit) method call, with longer
timeout specified.
- This is a dead lock. If you perform multi-key queries, ensure you acquire
segment locks in the order (ascending by segmentIndex()), you can find
an example here: https://github.com/OpenHFT/Chronicle-Map#multi-key-queries
at net.openhft.chronicle.hash.impl.BigSegmentHeader.deadLock(BigSegmentHeader.java:59)
at net.openhft.chronicle.hash.impl.BigSegmentHeader.updateLock(BigSegmentHeader.java:231)
at net.openhft.chronicle.map.impl.CompiledMapIterationContext$UpdateLock.lock(CompiledMapIterationContext.java:768)
at net.openhft.chronicle.map.impl.CompiledMapIterationContext.forEachSegmentEntryWhile(CompiledMapIterationContext.java:3810)
at net.openhft.chronicle.map.impl.CompiledMapIterationContext.forEachSegmentEntry(CompiledMapIterationContext.java:3816)
at net.openhft.chronicle.map.ChronicleMapIterator.fillEntryBuffer(ChronicleMapIterator.java:61)
at net.openhft.chronicle.map.ChronicleMapIterator.hasNext(ChronicleMapIterator.java:77)
at java.lang.Iterable.forEach(Iterable.java:74)
It is a single-threaded and persistent map.

Global revision without locking

Given this set of rules, would it be possible to implement this in SQL?
Two transactions that don't modify the same rows should be able to run concurrently. No locks should occur (or at least their use should be minimized as much as possible).
Transactions can only read committed data.
A revision is defined as an integer value in the system.
A new transaction must be able to increment and query a new revision. This revision will be applied to every rows that the transaction modifies.
No 2 transactions can share the same revision.
A transaction X that is committed before transaction Y must have a revision lower than the one assigned to transaction Y.
I want to use integer as the revision in order to optimize how I query all changes since a specific revision. Something like this:
SELECT * FROM [DummyTable] WHERE [DummyTable].[Revision] > clientRevision
My current solution uses an SQL table [GlobalRevision] with a single row [LastRevision] to keep the latest revision. All my transactions' isolation level are set to Snapshot.
The problem with this solution is that the [GlobalRevision] table with the single row [LastRevision] becomes a point of contention. This is because I must increment the revision at the start of a transaction so that I can apply the new revision to the modified rows. This will keep a lock on the [LastRevision] row throughout the duration of the transaction, killing the concurrency. Even though two concurrent transactions modify totally different rows, they cannot be executed concurrently (Rule #1: Failed).
Is there any pattern in SQL to solve this kind of issue? One solution is to use Guids and keep an history of revisions (like git revisions) but this is less easier than just having an integer that we can compare to see if a revision is newer than another one.
UPDATE:
The business case for this is to create a Baas system (Backend as a service) with data synchronization between client and server. Here are some use cases for this kind of system:
Client while online modifies an asset, pushes the update to the server, server updates DB [this is where my question relates to], server sends update notifications to interested clients that synchronize their local data with the new changes.
Client connects to server, client requests a pull to the server, server finds all changes that were applied after client's revision and return them to the client, client applies the changes and sets its new revision.
...
As you can see, the global revision lets me put a revision on every changes committed on the server and from this revision, I can determine what updates need to be sent to the clients depending on their specific revision.
This needs to scale to multiple thousands of users that can push updates in parallel and those changes must be synchronized to other connected users. So the longer it takes to execute a transaction, the longer it takes for other users to receive the change notifications.
I want to avoid as much as possible contention for this reason. I am not an expert in SQL so I just want to make sure there is not something I am missing that would let me do that easily.
Probably the easiest thing for you to try would be to use a SEQUENCE for your revision number, assuming you're at SQL 2012 or newer. This is a lighter-weight way of generating an auto-incrementing value that you can use as a revision ID per your rules. Acquiring them at scale should be far less subject to the contention issues you describe than using a full-fledged table.
You do need to know that you could end up with revision number gaps if a given transaction rolled back, because SEQUENCE values operate outside of transactional scope. From the article:
Sequence numbers are generated outside the scope of the current
transaction. They are consumed whether the transaction using the
sequence number is committed or rolled back.
If you can relax the requirement for an integer revision number and settle for knowing what the data was at a given point in time, you might be able to use Change Data Capture, or, in SQL 2016, Temporal Tables. Both of these technologies allow you to "turn back time" and see what the data looked like at a known timestamp.

Real time analytic processing system design

I am designing a system that should analyze large number of user transactions and produce aggregated measures (such as trends and etc).
The system should work fast, be robust and scalable.
System is java based (on Linux).
The data arrives from a system that generate log files (CSV based) of user transactions.
The system generates a file every minute and each file contains the transactions of different users (sorted by time), each file may contain thousands of users.
A sample data structure for a CSV file:
10:30:01,user 1,...
10:30:01,user 1,...
10:30:02,user 78,...
10:30:02,user 2,...
10:30:03,user 1,...
10:30:04,user 2,...
.
.
.
The system I am planning should process the files and perform some analysis in real-time.
It has to gather the input, send it to several algorithms and other systems and store computed results in a database. The database does not hold the actual input records but only high level aggregated analysis about the transactions. For example trends and etc.
The first algorithm I am planning to use requires for best operation at least 10 user records, if it can not find 10 records after 5 minutes, it should use what ever data available.
I would like to use Storm for the implementation, but I would prefer to leave this discussion in the design level as much as possible.
A list of system components:
A task that monitors incoming files every minute.
A task that read the file, parse it and make it available for other system components and algorithms.
A component to buffer 10 records for a user (no longer than 5 minutes), when 10 records are gathered, or 5 minute have passed, it is time to send the data to the algorithm for further processing.
Since the requirement is to supply at least 10 records for the algorithm, I thought of using Storm Field Grouping (which means the same task gets called for the same user) and track the collection of 10 user's records inside the task, of course I plan to have several of these tasks, each handles a portion of the users.
There are other components that work on a single transaction, for them I plan on creating other tasks that receive each transaction as it gets parsed (in parallel to other tasks).
I need your help with #3.
What are the best practice for designing such a component?
It is obvious that it needs to maintain the data for 10 records per users.
A key value map may help, Is it better to have the map managed in the task itself or using a distributed cache?
For example Redis a key value store (I never used it before).
Thanks for your help
I had worked with redis quite a bit. So, I'll comment on your thought of using redis
#3 has 3 requirements
Buffer per user
Buffer for 10 Tasks
Should Expire every 5 min
1. Buffer Per User:
Redis is just a key value store. Although it supports wide variety of datatypes, they are always values mapped to a STRING key. So, You should decide how to identify a user uniquely incase you need have per user buffer. Because In redis you will never get an error when you override a key new value. One solution might be check the existence before write.
2. Buffer for 10 Tasks: You obviously can implement a queue in redis. But restricting its size is left to you. Ex: Using LPUSH and LTRIM or Using LLEN to check the length and decide whether to trigger your process. The key associated with this queue should be the one you decided in part 1.
3. Buffer Expires in 5 min: This is a toughest task. In redis every key irrespective of underlying datatype it value has, can have an expiry. But the expiry process is silent. You won't get notified on expiry of any key. So, you will silently lose your buffer if you use this property. One work around for this is, having an index. Means, the index will map a timestamp to the keys who are all need to be expired at that timestamp value. Then in background you can read the index every minute and manually delete the key [after reading] out of redis and call your desired process with the buffer data. To have such an index you can look at Sorted Sets. Where timestamp will be your score and set member will be the keys [unique key per user decided in part 1 which maps to a queue] you wish to delete at that timestamp. You can do zrangebyscore to read all set members with specified timestamp
Overall:
Use Redis List to implement a queue.
Use LLEN to make sure you are not exceeding your 10 limit.
Whenever you create a new list make an entry into index [Sorted Set] with Score as Current Timestamp + 5 min and Value as the list's key.
When LLEN reaches 10, remember to read then remove the key from the index [sorted set] and from the db [delete the key->list]. Then trigger your process with data.
For every one min, generate current timestamp, read the index and for every key, read data then remove the key from db and trigger your process.
This might be my way to implement it. There might be some other better way to model your data in redis
For your requirements 1 & 2: [Apache Flume or Kafka]
For your requirement #3: [Esper Bolt inside Storm. In Redis for accomplishing this you will have to rewrite the Esper Logic.]