Understanding Cache Keys, Index, Partition and Affinity w.r.t reads and writes - ignite

I am new to Apache Ignite and come from a Data Warehousing background.
So pardon me if I try to relate to Ignite through DBMS jargon.
I have gone through forums but I am still unclear about some of the basics.
I also would like specific answers to the scenario I have posted later.
1.) CacheMode=PARTITIONED
a.) When a cache is declared as partitioned, does the data get equally
partitioned across all nodes by default?
b.) Is there an option to provide a "partition key" based on which the data
would be distributed across the nodes? Is this what we call the Affinity
Key?
c.) How is partitioning different from affinity and can a cache have both
partition and affinity key?
2.) Affinity Concept
With an Affinity Key defined, when I load data (using loadCache()) into a partitioned cache, will the source rows be sent to the node they belong to or all the nodes on the cluster?
3.) If I create one index on the cache, does it by default become the partition/
affinity key as well? In such a scenario, how is a partition different from index?
SCNEARIO DESCRIPTION
I want to load data from a persistent layer into a Staging Cache (assume ~2B) using loadCache(). The cache resides on a 4 node cluster.
a.) How to load data such that each node has to process only 0.5B records?
Is is by using Partitioned Cache mode and defining an Affinity Key?
Then I want to read transactions from the Staging Cache in TRANSACTIONAL atomicity mode, lookup a Target Cache and do some operations.
b.) When I do the lookup on Target Cache, how can I ensure that the lookup is happening only on the node where the data resides and not do lookup on all the nodes on which Target Cache resides?
Would that be using the AffinityKeyMapper API? If yes, how?
c.) Lets say I wanted to do a lookup on a key other than Affinity Key column, can creating an index on the lookup column help? Would I end up scanning all nodes in that case?
Staging Cache
CustomerID
CustomerEmail
CustomerPhone
Target Cache
Seq_Num
CustomerID
CustomerEmail
CustomerPhone
StartDate
EndDate

This is answered on Apache Ignite users forum: http://apache-ignite-users.70518.x6.nabble.com/Understanding-Cache-Key-Indexes-Partition-and-Affinity-td11212.html
Ignite uses AffinityFunction [1] for data distribution. AF implements two mappings: key->partition and partition->node.
Key->Partition mapping is definitely map entry to partition. It doesn't bother of backups, but data collocation\distribution over partitions.
Usually, entry key (actually it's hashcode) is used to calculate partition entry belongs to.
But you can use AffinityKey [2] that would be use instead to manage data collocation. See also 'org.apache.ignite.cache.affinity.AffinityKey' javadoc.
Partition->Node mapping determines primary and backup nodes for partition. It doesn't bother of data collocation, but backups and partition distribution among nodes
Cache.loadCache just makes all nodes to call localLoadCache method. Which calls CacheStore.loadCache. So, each of grid nodes will load all the data from cache store and then discard data that is not local for the node.
Same data may resides on several nodes if you use a backups. AffinityKey should be a part of entry key and if AffinityKey mapping is configured then AffinityKey will be used instead of entry key for entry->partition mapping
and AffinityKey will be passed to AffinityFunction.
Indexes always resides on same node with the data.
a. To achieve this you should implement CacheStore.loadCache method to load data for certain partitions. E.g. you can store partitionID for each row in database.
However, if you change AF or partitions numbers you should update partitionID for entries in database as well.
The other way. If it is posible, you can load all the data in single node and then add other nodes to the grid. Data will rebalanced over nodes automatically.
b. AffinityKey is always used if it is as it shoud be part of entry key. So, lookup will always be happening on the node where the data resides.
c. I can't understand the question. Would you please clarify if it still is actual?

Related

Is there a way of shuffling partition data on Apache Ignite?

I've got a question that is related to data repartitioning.
Suppose there's a cache with a pre-defined affinity key. Assume I need to repartition data with a new affinity key. I'm wondering whether there is a way of shuffling partition data across all nodes by a new affinity key?
You need to repopulate the data in that case.
First, it's a static configuration and can't be changed on the fly.
The second, most likely you will need to clear meta-information for that particular type, i.e. clean work/binary_meta folder.
The last one - once you changed it, you won't be able to locate the data since most likely it will be stored in a different partition.
In other words, say, you had a cache key with two fields A and B: K(A,B) where A is your affinity key. Say, your Key(1,2) was mapped to a partition 5. In that case, to locate the value, Ignite will search for this partition 5 depending on which node hold the primary copy of it. Later you wanted to have B as the affinity key and re-configure the cache accordingly. In that case, Key(1,2) might now be mapped to a partition 780, meaning that Ignite will never search for a partition 5 and won't be able to locate the previous data.

Best practice to do join in Apache Ignite

I have two large tables A and B, and I want to join these two tables on two columns, say, project_id and customer_id.
When I do the join in Apache Ignite, I find that the performance is very bad. After investigating, I think the problem lies in that the data recide in different nodes randomly.
When the join happens, there are data transfer between nodes to make the same project_id and customer_id from A and B into same node.
For my case,
Load data into the Ignite cluster based on A and B's project_id and customer_id, so that, there is no data transfer when doing the join. The solution can work but not flexible.
Use only one node to hold all the data. This solution can work but there is memory limit for a single node(Not too much data can be held by one node)
I would ask which solution would be a better choice, thanks!
The former (1.) is recommended. You should load the data in the fashion so that data for the same project_id and customer_id is on the same node in both tables.
This is called affinity collocation and it is paramount to get right to have good performance of Ignite queries (and sometimes for them to work at all).
Ignite will take care of affinity collocation for you if you setup it correctly, but there are a few caveats right away:
Affinity key has to be a part of primary key (not a value field)
Affinity key has to be single (so you have to choose between project_id and customer_id) or a composite type (a nested POJO with its own implications) or a synthetic value maybe?
There is possibility of uneven partition distribution. Imagine you have a single large customer (or project). When processing this customer, all nodes but a single one will be idle and unused.

Aerospike: How Primary & Secondary Index works internally

We are using Aerospike DB and was going through the documentation.
I could not find good explanation of algorithm explaining how Primary & Secondary index works.
The documentation says it uses some sort of distributed hash + B Tree.
Could someone please explain it.
The primary index is a mix of a distributed hash and distributed trees. It holds the metadata for every record in the Aerospike cluster.
Each namespace has 4096 partitions that are evenly distributed to the nodes of the cluster, by way of the partition map. Within the node, the primary index is an in-memory structure that indexes only the partitions assigned to the node.
The primary index has a hash table that leads to sprigs. Each sprig is a red-black tree that holds a portion of the metadata. The number of sprigs per-partition is configurable through partition-tree-sprigs.
Therefore, to find any record in the cluster, the client first uses the record's digest to find the correct node with one lookup against the partition map. Then, the node holding the master partition for the record will look up its metadata in the primary index. If this namespace stores data on SSD, the metadata includes the device, block ID and byte offset of the record, so it can be read with a single read operation. The records are stored contiguously, whether on disk or in memory.
The primary index is used for operations against a single record (identified by its key), or batch operations against multiple records (identified by a list of keys). It's also used by scans.
Secondary indexes are optional in-memory structures within each node of the cluster, that also only index the records of the partitions assigned to each node. They're used for query operations, which are intended to return many records based on a non-key predicate.
Because Aerospike is a distributed database, a query must go to all the nodes. The concurrency level (how many nodes are queried at a time) is controlled through a query policy in the client. Each node receiving the query has to lookup the criteria of the predicate against the appropriate secondary index. This returns zero to many records. At this point the optional predicate filter can be applied. The records found by secondary index query are then streamed back to the client. See the documentation on managing indexes.

Using JedisCluster to write to a partition in a Redis Cluster

I have a Redis Cluster. I am using JedisCluster client to connect to my Redis.
My application is a bit complex and I want to basically control to which partition data from my application goes. For example, my application consists of sub-module A, B, C. Then I want that all data from sub-module A should go to partition 1 for example. Similarly data from sub-module B should go to partition 2 for example and so on.
I am using JedisCluster, but I don't find any API to write to a particular partition on my cluster. I am assuming I will have same partition names on all my Redis nodes and handling which data goes to which node will be automatically handled but to which partition will be handled by me.
I tried going through the JedisCluster lib at
https://github.com/xetorthio/jedis/blob/b03d4231f4412c67063e356a7c3acf9bb7e62534/src/main/java/redis/clients/jedis/JedisCluster.java
but couldn't find anything. Please help?
Thanks in advance for the help.
That's not how Redis Cluster works. With Redis Cluster, each node (partition) has a defined set of keys (slots) that it's handling. Writing a key to a master node which is not served by the master results in rejection of the command.
From the Redis Cluster Spec:
Redis Cluster implements a concept called hash tags that can be used in order to force certain keys to be stored in the same node.
[...]
The key space is split into 16384 slots, effectively setting an upper limit for the cluster size of 16384 master nodes (however the suggested max size of nodes is in the order of ~ 1000 nodes).
Each master node in a cluster handles a subset of the 16384 hash slots.
You need to define at Cluster configuration-level which master node is exclusively serving a particular slot or a set of slots. The configuration results in data locality.
The slot is calculated from the key. The good news is that you can enforce a particular slot for a key by using Key hash tags:
There is an exception for the computation of the hash slot that is used in order to implement hash tags. Hash tags are a way to ensure that multiple keys are allocated in the same hash slot. This is used in order to implement multi-key operations in Redis Cluster.
Example:
{user1000}.following
The content between {…} is used to calculate the slot. Key hash tags allow you to group keys on particular nodes and enforce the same data locality when using arbitrary hash tags.
You can also go a step further by using known hash tags that map to slots (you'd need either precalculate a table or see this Gist). By using known hash tags that map to a specific slot you're able to select the slot and so the master node on which the data is located.
Everything else is handled by your Redis client.

Does HBase uses a primary index?

How does HBase performs a lookup and retrieves a record ?
e.g., what is the equivalent in HBase for the RDBMS's B-Trees?
[EDIT]
I understand how HBase resolves the -ROOT-, and .META. tables to find out which region holds the data. But how is the local lookup performed?
To better illustrated, here's an example:
I am starting a search (get or scan) for record with key 77.
HBase client figures that the key is contains in the 50-100 region,
held by RegionServer X.
HBase client contacts RegionServer X to get the data.
How does RegionServer X finds out the location of record 77 ?
Does the RegionServer uses some kind of lookup table (e.g. like the RDBMS's B-Trees ?) for the keys of a region ?
Or does it need to read all contents of the StoreFiles, for records from 50 to 77 ?
TL;DR: it looks like HBase (like BigTable), uses as a structure similar to a B+ tree to do the lookup. So the row-key is the primary index (and the only index of any sort in HBase by default.)
Long answer: From this Cloudera blog post about HBase write path, it looks like HBase operates the following way:
Each HBase table is hosted and managed by sets of servers which fall
into three categories:
One active master server
One or more backup master servers
Many region servers
Region servers contribute to handling the HBase tables. Because HBase
tables can be large, they are broken up into partitions called
regions. Each region server handles one or more of these regions.
There's some more detail in the next paragraph:
Since the row key is sorted, it is easy to determine which region
server manages which key. ... Each row key belongs to a specific
region which is served by a region server. So based on the put or
delete’s key, an HBase client can locate a proper region server. At
first, it locates the address of the region server hosting the -ROOT-
region from the ZooKeeper quorum. From the root region server, the
client finds out the location of the region server hosting the -META-
region. From the meta region server, then we finally locate the
actual region server which serves the requested region. This is a
three-step process, so the region location is cached to avoid this
expensive series of operations.
From another Cloudera blog post, it looks like the exact format used to save the HBase on the file system keeps changing, but the above mechanism for row key lookups should be more or less consistent.
This mechanism is very, very similar to Google BigTable's lookup (you will find the details in Section 5.1 starting at the end of page 4 on the PDF), which uses a three-level heirarchy to query the row-key location: Chubby -> Root tablet -> METADATA tablets -> actual tablet
UPDATE: to answer the question about lookups inside a Region Server itself: I don't know for sure, but since the row keys are sorted, and HBase knows the start and end keys, I suspect it uses a binary search or interpolation search, both of which are really fast - log(n) and log(log(n)) respectively. I don't think HBase would ever need to scan rows from the start row key to the one that it needs to find, since searches on sorted keys is well-known problem that has several efficient solutions.