Master objects count drops after re-sizing cluster (adding more nodes) - aerospike

What could be the reason for spontaneous master object count drop? I've added 3 more nodes to a cluster containing 17 nodes, and suddenly there'are 1 billion records less (as reported by AMC UI)?
Replica objects goes to zero as well.
Thanks!

Partitions are going to go through state transitions. Partitions that were shifted from master to replica will no longer report those objects in the master object counts. These partitions will also be in an "acting master" state until the "desync master" partitions receive the full partition. While acting master the namespace supervisor thread (NSUP) will not count them as master or replica objects. Also the node that previously was the replica for this partition will drop the partition to allow for potential incoming migrations.
There will also be partitions where the l old master and replica nodes will both be displaced by new nodes. The old master will become an acting master until the new master receives the partition. At which point the old master will drop and the new master will begin replicating to the new replica. You will periodically observe this in AMC as the number of active tx migrations will increase.
There are other possible scenarios but the main takeaway is that once migration settles, you object counts will return to normal.

Related

Aerospike cluster behavior in different consistency mode?

I want to understand the behavior of aerospike in different consistancy mode.
Consider a aerospike cluster running with 3 nodes and replication factor 3.
AP modes is simple and it says
Aerospike will allow reads and writes in every sub-cluster.
And Maximum no. of node which can go down < 3 (replication factor)
For aerospike strong consistency it says
Note that the only successful writes are those made on replication-factor number of nodes. Every other write is unsuccessful
Does this really means the no writes are allowed if available nodes < replication factor.
And then same document says
All writes are committed to every replica before the system returns success to the client. In case one of the replica writes fails, the master will ensure that the write is completed to the appropriate number of replicas within the cluster (or sub cluster in case the system has been compromised.)
what does appropriate number of replica means ?
So if I lose one node from my 3 node cluster with strong consistency and replication factor 3 , I will not be able to wright data ?
For aerospike strong consistency it says
Note that the only successful writes are those made on
replication-factor number of nodes. Every other write is unsuccessful
Does this really means the no writes are allowed if available nodes <
replication factor.
Yes, if there are fewer than replication-factor nodes then it is impossible to meet the user specified replication-factor.
All writes are committed to every replica before the system returns
success to the client. In case one of the replica writes fails, the
master will ensure that the write is completed to the appropriate
number of replicas within the cluster (or sub cluster in case the
system has been compromised.)
what does appropriate number of replica means ?
It means replication-factor nodes must receive the write. When a node fails, a new node can be promoted to replica status until either the node returns or an operator registers a new roster (cluster membership list).
So if I lose one node from my 3 node cluster with strong consistency
and replication factor 3 , I will not be able to wright data ?
Yes, so having all nodes a replicas wouldn't be a very useful configuration. Replication-factor 3 allows up to 2 nodes to be down, but only if the remaining nodes are able to satisfy the replication-factor. So for replication-factor 3 you would probably want to run with a minimum of 5 nodes.
You are correct, with 3 nodes and RF 3, losing one node means the cluster will not be able to successfully take write transactions since it wouldn't be able to write the required number of copies (3 in this case).
Appropriate number of replicas means a number of replicas that would match the replication factor configured.

How are the replication conflicts resolved using a 3rd party?

Been searching for a specific info but couldn't find; forgive me for being new at this.
I will try to replicate a Firebird DB using SymmetricsDS. This is an ERP database; which in my mind will have 1 master and 2 slaves. I will have 2 slave servers which will work locally and local machines will connect them as clients.
Say for example I am a client of local slave 1. I am creating a new customer which will automatically get a customer ID 100. At the same time a client of the local slave (server) 2 creates a new customer and it takes the same customer ID. Now when these two slaves sync to the master; there will be a conflict.
I know this sounds quite noob; you know you can't hide it.
What would be the best approach to prevent this; rather solving?
I don't think there is one "the best" approach. It depends on system specific details what works best... anyway, some options are:
UUID
Use UUID as customer ID. Since version 2.5 Firebird has some built in support for generating and converting UUIDs.
Segmented generators
On each local slave init the customer ID sequence so that IDs generated by it doesn't overlap with other slaves. Ie if you use 32 bit integers as PK and need max two slaves you dedicate top bit as "slave ID". That means that on first slave you start the sequence from zero while at the second you starti it from 2147483648 (bin 1000 0000 0000 0000 0000 0000 0000 0000). See the ALTER SEQUENCE statement for how to set the sequence's starting value.
ID server
You could have a service which generates IDs. Whenever slave needs ID for a customer it recuests it from a special service. To help with the perfomance it probably makes sense to request new IDs in patches and cache them for later use.
I suppose the system is legacy and you don't have the ability to change how it works. In a similar occasion I have solved this problem letting each slave generating sequences. I've added a write filter in symmetricDs on the master node that will intercept each push from a slave and add a unique prefix per slave. If data has to be synced back to the slaves after data is routed to each slave add a write filter to symmetric slave that will strip the added prefix.
For example maximum number of slaves is 99. Let's say slave 1 creates a sequence 198976, assuming the sequence length is 10, use slave's ID, pad left the sequence with zeros and add the slave id as prefix: (0)100198976. If slave 17 generated the same sequence, master node's filter would change it to 1700198976.
If the same data has is changed on the master and has to be sent back to the slave that generated it, write filter on the slave will strip the first two digits (after left padding with 0 in case of one digit slave IDs). Slave 1's sequence from master (0)100198976 will become again 198976; and slave 17's sequence from master 1700198976 will become 198976.
If the whole length of the ID column has been used on the slaves, alter the column on the master by widening the it to accommodate for the width of slave IDs

Aerospike Design | Request Flow Internals | Resources

Where can I find information about the how flow of the read/write request in the cluster when fired from the client API?
In Aerospike configuration doc ( http://www.aerospike.com/docs/reference/configuration ), it's mentioned about transaction queues, service threads, transaction threads etc but they are not discussed in the architecture document. I want to understand how it works so that I can configure it accordingly.
From client to cluster node
In your application, a record's key is the 3-tuple (namespace, set, identifier). The key is passed to the client for all key-value methods (such as get and put).
The client then hashes the (set, identifier) portion of the key through RIPEMD-160, resulting in a 20B digest. This digest is the actual unique identifier of the record within the specified namespace of your Aerospike cluster. Each namespace has 4096 partitions, which are distributed across the nodes of the cluster.
The client uses 12 bits of the digest to determine the partition ID of this specific key. Using the partition map, the client looks up the node that owns the master partition corresponding to the partition ID. As the cluster grows, the cost of finding the correct node stays constant (O(1)) as it does not depended on the number of records or the number of nodes.
The client converts the operation and its data into an Aerospike wire protocol message, then uses an existing TCP connection from its pool (or creates a new one) to send the message to the correct node (the one holding this partition ID's master replica).
Service threads and transaction queues
When an operation message comes in as a NIC transmit/receive queue interrupt,
a service thread picks up the message from the NIC. What happens next depends on the namespace this operation is supposed to execute against. If it is an in-memory namespace, the service thread will perform all of the following steps. If it's a namespace whose data is stored on SSD, the service thread will place the operation on a transaction queue. One of the queue's transaction threads will perform the following steps.
Primary index lookup
Every record has a 64B metadata entry in the in-memory primary index. The primary-index is expressed as a collection of sprigs per-partition, with each sprig being implemented as a red-black tree.
The thread (either a transaction thread or the service thread, as mentioned above) finds the partition ID from the record's digest, and skips to the correct sprig of the partition.
Exist, Read, Update, Replace
If the operation is an exists, a read, an update or a replace, the thread acquires a record lock, during which other operations wait to access the specific sprig. This is a very short lived lock. The thread walks the red-black tree to find the entry with this digest. If the operation is an exists, and the metadata entry does exist, the thread will package the appropriate message and respond. For a read, the thread will use the pointer metadata to read the record from the namespace storage.
An update needs to read the record as described above, and then merge in the bin data. A replace is similar to an update, but it skips first reading the current record. If the namespace is in-memory the service thread will write the modified record to memory. If the namespace stores on SSD the merged record is placed in a streaming write buffer, pending a flush to the storage device. The metadata entry in the primary index is adjusted, updating its pointer to the new location of the record. Aerospike performs a copy-on-write for create/update/replace.
Updates and replaces also needs to be communicated to the replica(s) if the replication factor of the namespace is greater than 1. After the record locking process, the operation will also be parked in the RW Hash (Serializer), while the replica write completes. This is where other transactions on the same record will queue up until they hit the transaction pending limit (AKA a hot key). The replica write(s) is handled by a different thread (rw-receive), releasing the transaction or service thread to move on to the next operation. When the replica writes complete the RW Hash lock is released, and the rw-receive thread will package the reply message and send it back to the client.
Create and Delete
If the operation is a new record being written, or a record being deleted, the partition sprig needs to be modified.
Like update/replace, these operations acquire the record-level lock and will go through the RW hash. Because they add or remove a metadata entry from the red-black tree representing the sprig, they must also acquire the index tree reduction lock. This process also happens when the namespace supervisor thread finds expired records and remove them from the primary index. A create operation will add an element to the partition sprig.
If the namespace stores on SSD, the create will load the record into a streaming write buffer, pending a flush to SSD, and ahead of the replica write. It will update the metadata entry in the primary index, adjusting its pointer to the new block.
A delete removes the metadata entry from the partition sprig of the primary index.
Summary
exists/read grab the record-level lock, and hold it for the shortest amount of time. That's also the case for update/replace when replication factor is 1.
update/replace also grab the RW hash lock, when replication factor is higher than 1.
create/delete also grab the index tree reduction lock.
For in-memory namespaces the service thread does all the work up to potentially the point of replica writes.
For data on SSD namespaces the service thread throws the operation onto a transaction queue, after which one of its transaction threads handles things such as loading the record into a streaming write buffer for writes, up until the potential replica write.
The rw-receive thread deals with replica writes and returning the message after the update/replace/create/delete write operation.
Further reading
I've addressed key-value operations, but not batch, scan or query. The difference between batch-reads and single-key reads is easier to understand once you know how single-read works.
Durable deletes do not remove the metadata entry of this record from the primary index. Instead, those are a new write operation of a tombstone. There will be a new 64B entry in the primary index, and a 128B entry in the SSD for the record.
Performance optimizations with CPU pinning. See: auto-pin, service-threads, transaction-queues.
Service threads == transaction queues == number of cores in your CPU or use CPU pinning - auto-pin config parameter if available in your version and possible in your OS env.
transaction threads per queue-> 3 (default is 4, for objsize <1KB, non data-in-memory namespace, 3 is optimal)
Changes with server ver 4.7+, the transaction is now handled by the service thread itself. By default, number of service threads is now set to 5 x no. of cpu cores. Once a service thread picks a transaction from the socket buffer, it carries it through completion unless it ends up in the rwHash (e.g. writes for replicating). The transaction queue is still there (internally) but only relevant for transaction restarts when queued up in the rwHash. (Multiple pending transactions for the same digest).

Understanding Cache Keys, Index, Partition and Affinity w.r.t reads and writes

I am new to Apache Ignite and come from a Data Warehousing background.
So pardon me if I try to relate to Ignite through DBMS jargon.
I have gone through forums but I am still unclear about some of the basics.
I also would like specific answers to the scenario I have posted later.
1.) CacheMode=PARTITIONED
a.) When a cache is declared as partitioned, does the data get equally
partitioned across all nodes by default?
b.) Is there an option to provide a "partition key" based on which the data
would be distributed across the nodes? Is this what we call the Affinity
Key?
c.) How is partitioning different from affinity and can a cache have both
partition and affinity key?
2.) Affinity Concept
With an Affinity Key defined, when I load data (using loadCache()) into a partitioned cache, will the source rows be sent to the node they belong to or all the nodes on the cluster?
3.) If I create one index on the cache, does it by default become the partition/
affinity key as well? In such a scenario, how is a partition different from index?
SCNEARIO DESCRIPTION
I want to load data from a persistent layer into a Staging Cache (assume ~2B) using loadCache(). The cache resides on a 4 node cluster.
a.) How to load data such that each node has to process only 0.5B records?
Is is by using Partitioned Cache mode and defining an Affinity Key?
Then I want to read transactions from the Staging Cache in TRANSACTIONAL atomicity mode, lookup a Target Cache and do some operations.
b.) When I do the lookup on Target Cache, how can I ensure that the lookup is happening only on the node where the data resides and not do lookup on all the nodes on which Target Cache resides?
Would that be using the AffinityKeyMapper API? If yes, how?
c.) Lets say I wanted to do a lookup on a key other than Affinity Key column, can creating an index on the lookup column help? Would I end up scanning all nodes in that case?
Staging Cache
CustomerID
CustomerEmail
CustomerPhone
Target Cache
Seq_Num
CustomerID
CustomerEmail
CustomerPhone
StartDate
EndDate
This is answered on Apache Ignite users forum: http://apache-ignite-users.70518.x6.nabble.com/Understanding-Cache-Key-Indexes-Partition-and-Affinity-td11212.html
Ignite uses AffinityFunction [1] for data distribution. AF implements two mappings: key->partition and partition->node.
Key->Partition mapping is definitely map entry to partition. It doesn't bother of backups, but data collocation\distribution over partitions.
Usually, entry key (actually it's hashcode) is used to calculate partition entry belongs to.
But you can use AffinityKey [2] that would be use instead to manage data collocation. See also 'org.apache.ignite.cache.affinity.AffinityKey' javadoc.
Partition->Node mapping determines primary and backup nodes for partition. It doesn't bother of data collocation, but backups and partition distribution among nodes
Cache.loadCache just makes all nodes to call localLoadCache method. Which calls CacheStore.loadCache. So, each of grid nodes will load all the data from cache store and then discard data that is not local for the node.
Same data may resides on several nodes if you use a backups. AffinityKey should be a part of entry key and if AffinityKey mapping is configured then AffinityKey will be used instead of entry key for entry->partition mapping
and AffinityKey will be passed to AffinityFunction.
Indexes always resides on same node with the data.
a. To achieve this you should implement CacheStore.loadCache method to load data for certain partitions. E.g. you can store partitionID for each row in database.
However, if you change AF or partitions numbers you should update partitionID for entries in database as well.
The other way. If it is posible, you can load all the data in single node and then add other nodes to the grid. Data will rebalanced over nodes automatically.
b. AffinityKey is always used if it is as it shoud be part of entry key. So, lookup will always be happening on the node where the data resides.
c. I can't understand the question. Would you please clarify if it still is actual?

Aerospike - Read (with consistency level ALL) when one replica is down

TL;DR
If a replica node goes down and new partition map is not available yet, will a read with consistency level = ALL fail?
Example:
Given this Aerospike cluster setup:
- 3 physical nodes: A, B, C
- Replicas = 2
- Read consistency level = ALL (reads consult both nodes holding the data)
And this sequence of events:
- A piece of data "DAT" is stored into two nodes, A and B
- Node B goes down.
- Immediately after B goes down, a read request ("request 1") is performed with consistency ALL.
- After ~1 second, a new partition map is generated. The cluster is now aware that B is gone.
- "DAT" now becomes replicated at node C (to preserve replicas=2).
- Another read request ("request 2") is performed with consistency ALL.
It is reasonable to say "request 2" will succeed.
Will "request 1" succeed? Will it:
a) Succeed because two reads were attempted, even if one node was down?
b) Fail because one node was down, meaning only 1 copy of "DAT" was available?
Request 1 and request 2 will succeed. The behavior of the consistency level policies are described here: https://discuss.aerospike.com/t/understanding-consistency-level-overrides/711.
The gist for read/write consistency levels is that they only apply when there are multiple versions of a given partition within the cluster. If there is only one version of a given partition in the cluster then a read/write will only go to a single node regardless of the consistency level.
So given an Aerospike cluster of A,B,C where A is master and B is
replica for partition 1.
Assume B fails and C is now replica for partition 1. Partition 1
receives a write and the partition key is changed.
Now B is restarted and returns to the cluster. Partition 1 on B will
now be different from A and C.
A read arrives with consistency all to node A for a key on Partition
1 and there are now 2 versions of that partition in the cluster. We
will read the record from nodes A and B and return the latest
version (not fail the read).
Time lapse
Migrations are now complete, for partition 1, A is master, B is
replica, and C no longer has the partition.
A read arrives with consistency all to node A. Since there is only
one version of Partition 1, node A responds to the client without
consulting node B.