Aerospike: How Primary & Secondary Index works internally - aerospike

We are using Aerospike DB and was going through the documentation.
I could not find good explanation of algorithm explaining how Primary & Secondary index works.
The documentation says it uses some sort of distributed hash + B Tree.
Could someone please explain it.

The primary index is a mix of a distributed hash and distributed trees. It holds the metadata for every record in the Aerospike cluster.
Each namespace has 4096 partitions that are evenly distributed to the nodes of the cluster, by way of the partition map. Within the node, the primary index is an in-memory structure that indexes only the partitions assigned to the node.
The primary index has a hash table that leads to sprigs. Each sprig is a red-black tree that holds a portion of the metadata. The number of sprigs per-partition is configurable through partition-tree-sprigs.
Therefore, to find any record in the cluster, the client first uses the record's digest to find the correct node with one lookup against the partition map. Then, the node holding the master partition for the record will look up its metadata in the primary index. If this namespace stores data on SSD, the metadata includes the device, block ID and byte offset of the record, so it can be read with a single read operation. The records are stored contiguously, whether on disk or in memory.
The primary index is used for operations against a single record (identified by its key), or batch operations against multiple records (identified by a list of keys). It's also used by scans.
Secondary indexes are optional in-memory structures within each node of the cluster, that also only index the records of the partitions assigned to each node. They're used for query operations, which are intended to return many records based on a non-key predicate.
Because Aerospike is a distributed database, a query must go to all the nodes. The concurrency level (how many nodes are queried at a time) is controlled through a query policy in the client. Each node receiving the query has to lookup the criteria of the predicate against the appropriate secondary index. This returns zero to many records. At this point the optional predicate filter can be applied. The records found by secondary index query are then streamed back to the client. See the documentation on managing indexes.

Related

Does deleting data from a database (using actual delete SQL query) cause huge problems in re-indexing of table data?

Does deleting data from a database (using actual delete SQL query) cause huge problems in re-indexing of table data (say tens of millions of data) thereby increasing system overhead and consuming more resource?
Most databases do not immediately delete the index nodes associated with deleted rows from the table. Depending on the specifics of how duplicate index keys are handled this may have no effect at all. For example, one scheme for duplicate key index building is to only have a single B+Tree node for the key value but have it point to a list of rows that contain that key value in the indexed column(s). In that case deleting one or even many of the rows in the table does not affect the efficiency of the index tree at all until all of the rows with that key value have been deleted at which time the key node will be flagged as deleted but not necessarily removed from the tree. Of course in the case of a unique index key any deletion will result in a node that is flagged as deleted. When that happens to many key values near each other on disk the index tree may become inefficient.
One solution is to rebuild the index from scratch either by dropping it and recreating it or if the DBMS has the feature by a “reindex” command. Another solution used by some more advanced database systems is to track whenever a search of an index actually encounters a deleted node. If this happens so often that a configured threshold is exceeded then an automated thread will “clean” the index actually removing deleted nodes and possibly compressing mostly empty index pages or even rebalancing the index tree. The advantage of this “cleaner thread” feature is that inefficient indexes that are not often used, or in which the subtrees of the index containing deleted nodes are no longer accessed (imagine deleting out-dated rows in an index whose lead column is the date used to purge rows), do not take up resources to clean or rebuild them since they are not affecting performance.

Dynamo DB single partition, Global Secondary indexes

Current Scenario
Datastore used: Dynamo Db.
DB size: 15-20 MB
Problem: for storing data I am thinking to use a common hash as the partition key (and timestamp as sort key), so that the complete table is saved in a single partition only. This would give me undivided throughput for the table.
But I also intend to create GSIs for querying, so I was wondering whether it would be wrong to use GSIs for single partition. I can use Local SIs also.
Is this the wrong approach?
Under the hood, GSI is basically just another DynamoDB table. It follows the same partitioning rules as the main table. Partitions in you primary table are not correlated to the partitions of your GSIs. So it doesn't matter if your table has a single partition or not.
Using single partition in DynamoDB is a bad architectural choice overall, but I would argue that for 20 Mb database that doesn't matter too much.
DynamoDB manages table partitioning for you automatically, adding new
partitions if necessary and distributing provisioned throughput
capacity evenly across them.
Deciding which partition the item should go can't be controlled if the partition key values are different.
I guess what you are going to do is having same partition key value for all the items with different sort key value (timestamp). In this case, I believe the data will be stored in single partition though I didn't understand your point regarding undivided throughput.
If you wanted to keep all the items of the index in single partition, I think LSI (Local Secondary Index) would be best suited here. LSI is basically having an alternate sort key for the partition key.
A local secondary index maintains an alternate sort key for a given
partition key value.
Your single partition rule is not applicable for index and you wanted different partition key, then you need GSI.

Affinity Key in Aerospike

In the aerospike documentation, it is mentioned that aerospike has 4096 logical partitions and each key is hashed and eventually mapped to any of the partitions between 1 to 4096, which determines in which node the data for that key should be stored.
However if we have two keys "A" and "AB" and we want to store them in the same node, is there a way?
In Redis it can be achieved by making the keys as "A" and "{A}B" that will make sure that the key "{A}B" will go to a node where "A" is hashed and stored.
In Apache Ignite, same can be done using "AffinityKey".
Does a similar idea exist in Aerospike?
Thanks
Aerospike was designed as a distributed database. Redis was designed to run on a single node, and lacks concepts such as data distribution, clustering, replication, failover, at least natively. I'm aware that you can use various application-side shenanigans to make it into an ad-hoc cluster.
Don't worry about the implementation details of Aerospike's data distribution. Those happen automatically between the client and cluster, and don't require you to do anything on the application side. Instead, think about your access patterns.
First, your Aerospike cluster will make sure the data is evenly distributed. Because work is directly proportional to data, you should make sure the nodes are homogeneous. You can then expect multi-node operations to wrap up in roughly the same amount of time on each node.
You can create a secondary index on the fields that you'll be querying often to enhance the speed of the query. Release 3.12 adds predicate filtering, allowing you to create more complex query predicates on top of the initial secondary index based filter (also see the Java client's PredExp class).
If you don't want to use secondary indexes (there are several valid reasons), you can create your own lookup using external records. In a set called country-school you can have a record for each country (keys such as 'india', 'luxembourg') with the value being a list containing the IDs of the schools in that country. You can get the list with a single get (or a batch-get if it's several records, such as india-1, india-2, ... , india-9999), then use the results to compose a batch-get operation for the schools. Batch reads return results in the ordered you asked so you can get a large batch, check whether the last element is null, and if not get another batch.
('ns1', 'country-school', 'us-california') => [ 1, 2, 3, 5, 8, 11, .. ]
Similarly, you can create permutations such as country-state-city, (example, us-california-oakland) with smaller lists. This costs some extra space, but gives you faster (key-value based) retrieval without spending memory on secondary indexes.
('ns1', 'country-school', 'us-california-oakland') => [ 1, 5, 42, .. ]

Understanding Cache Keys, Index, Partition and Affinity w.r.t reads and writes

I am new to Apache Ignite and come from a Data Warehousing background.
So pardon me if I try to relate to Ignite through DBMS jargon.
I have gone through forums but I am still unclear about some of the basics.
I also would like specific answers to the scenario I have posted later.
1.) CacheMode=PARTITIONED
a.) When a cache is declared as partitioned, does the data get equally
partitioned across all nodes by default?
b.) Is there an option to provide a "partition key" based on which the data
would be distributed across the nodes? Is this what we call the Affinity
Key?
c.) How is partitioning different from affinity and can a cache have both
partition and affinity key?
2.) Affinity Concept
With an Affinity Key defined, when I load data (using loadCache()) into a partitioned cache, will the source rows be sent to the node they belong to or all the nodes on the cluster?
3.) If I create one index on the cache, does it by default become the partition/
affinity key as well? In such a scenario, how is a partition different from index?
SCNEARIO DESCRIPTION
I want to load data from a persistent layer into a Staging Cache (assume ~2B) using loadCache(). The cache resides on a 4 node cluster.
a.) How to load data such that each node has to process only 0.5B records?
Is is by using Partitioned Cache mode and defining an Affinity Key?
Then I want to read transactions from the Staging Cache in TRANSACTIONAL atomicity mode, lookup a Target Cache and do some operations.
b.) When I do the lookup on Target Cache, how can I ensure that the lookup is happening only on the node where the data resides and not do lookup on all the nodes on which Target Cache resides?
Would that be using the AffinityKeyMapper API? If yes, how?
c.) Lets say I wanted to do a lookup on a key other than Affinity Key column, can creating an index on the lookup column help? Would I end up scanning all nodes in that case?
Staging Cache
CustomerID
CustomerEmail
CustomerPhone
Target Cache
Seq_Num
CustomerID
CustomerEmail
CustomerPhone
StartDate
EndDate
This is answered on Apache Ignite users forum: http://apache-ignite-users.70518.x6.nabble.com/Understanding-Cache-Key-Indexes-Partition-and-Affinity-td11212.html
Ignite uses AffinityFunction [1] for data distribution. AF implements two mappings: key->partition and partition->node.
Key->Partition mapping is definitely map entry to partition. It doesn't bother of backups, but data collocation\distribution over partitions.
Usually, entry key (actually it's hashcode) is used to calculate partition entry belongs to.
But you can use AffinityKey [2] that would be use instead to manage data collocation. See also 'org.apache.ignite.cache.affinity.AffinityKey' javadoc.
Partition->Node mapping determines primary and backup nodes for partition. It doesn't bother of data collocation, but backups and partition distribution among nodes
Cache.loadCache just makes all nodes to call localLoadCache method. Which calls CacheStore.loadCache. So, each of grid nodes will load all the data from cache store and then discard data that is not local for the node.
Same data may resides on several nodes if you use a backups. AffinityKey should be a part of entry key and if AffinityKey mapping is configured then AffinityKey will be used instead of entry key for entry->partition mapping
and AffinityKey will be passed to AffinityFunction.
Indexes always resides on same node with the data.
a. To achieve this you should implement CacheStore.loadCache method to load data for certain partitions. E.g. you can store partitionID for each row in database.
However, if you change AF or partitions numbers you should update partitionID for entries in database as well.
The other way. If it is posible, you can load all the data in single node and then add other nodes to the grid. Data will rebalanced over nodes automatically.
b. AffinityKey is always used if it is as it shoud be part of entry key. So, lookup will always be happening on the node where the data resides.
c. I can't understand the question. Would you please clarify if it still is actual?

AWS DynamoDB v2: Do I need secondary index for alternative queries?

I need to create a table that would contain a slice of data produced by a continuously running process. This process generates messages that contain two mandatory components, among other things: a globally unique message UUID, and a message timestamp.
Those messages would be later retrieved by the UUID.
In addition, on a regular basis I would need to delete all messages from that table that are too old, i.e. whose timestamps are more than X away from the current time.
I've been reading the DynamoDB v2 documentation (e.g. Local Secondary Indexes) trying to figure out how to organize my table and whether or not I need a secondary index to perform searches for messages to delete. There might be a simple answer to my question, but I am somehow confused...
So should I just create a table with the UUID as the hash and messageTimestamp as the range key (together with a "message" attribute that would contain the actual message), and then not create any secondary indices? In the examples that I've seen, the hash was something that was not unique (e.g. ForumName under the above link). In my case, the hash would be unique. I am not sure whether than makes any difference.
And if I create the table with hash and range as described, and without a secondary index, then how would I query for all messages that are in a certain timerange regardless of their UUIDs?
DynamoDB introduced Global Secondary Index which would solve this problem.
http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GSI.html
We've wrestled with this as well. The best solution we've come up with is to create second table for storing the time series data. To do this:
1) Use the date plus "bucket" id for a hash key
You could just use the date, but then I'm guessing today's date would become a "hot" key - one that is written with excessive frequency. This can create a serious bottleneck, as the total throughput for a particular DynamoDB partition is equal to the total provisioned throughput divided by the number of partitions - that means if all your writes are to a single key (today's key) and you have a throughput of 20 writes per second, then with 20 partitions, your total throughput would be 1 write per second. Any requests beyond this would be throttled. Not a good situation.
The bucket can be a random number from 1 to n, where n should be greater than the number of partitions used by the underlying DB. Determining n is a bit tricky of course because Dynamo does not reveal how many partitions it uses. But we are currently working with the upper limit of 200 based on the example found here. The writeup at this link was the basis for our thinking in coming up with this approach.
2) Use the UUID for the range key
3) Query records by issuing queries for each day and bucket.
This may seem tedious, but it is more efficient than a full scan. Another possibility is to use Elastic Map Reduce jobs, but I have not tried that myself yet so cannot say how easy/effective it is to work with.
We are still figuring this out ourselves, so I'm interested to hear others' comments. I also found this presentation very helpful in thinking through how best to use Dynamo:
Falling In and Out Of Love with Dynamo
-John
In short you can not. All DynamoDB queries MUST contain the primary hash index in the query. Optionally, you can also use the range key and/or a local secondary index. With the current DynamoDB functionality you won't be able to use an LSI as an alternative to the primary index. You also are not able to issue a query with only the range key (you can test this out easily in the AWS Console).
A (costly) workaround that I can think of is to issue a scan of the table, adding filters based on the timestamp value in order to find out which fields to delete. Note that filtering will not reduce the used capacity of the query, as it will parse the whole table.