Affinity Key in Aerospike - aerospike

In the aerospike documentation, it is mentioned that aerospike has 4096 logical partitions and each key is hashed and eventually mapped to any of the partitions between 1 to 4096, which determines in which node the data for that key should be stored.
However if we have two keys "A" and "AB" and we want to store them in the same node, is there a way?
In Redis it can be achieved by making the keys as "A" and "{A}B" that will make sure that the key "{A}B" will go to a node where "A" is hashed and stored.
In Apache Ignite, same can be done using "AffinityKey".
Does a similar idea exist in Aerospike?
Thanks

Aerospike was designed as a distributed database. Redis was designed to run on a single node, and lacks concepts such as data distribution, clustering, replication, failover, at least natively. I'm aware that you can use various application-side shenanigans to make it into an ad-hoc cluster.
Don't worry about the implementation details of Aerospike's data distribution. Those happen automatically between the client and cluster, and don't require you to do anything on the application side. Instead, think about your access patterns.
First, your Aerospike cluster will make sure the data is evenly distributed. Because work is directly proportional to data, you should make sure the nodes are homogeneous. You can then expect multi-node operations to wrap up in roughly the same amount of time on each node.
You can create a secondary index on the fields that you'll be querying often to enhance the speed of the query. Release 3.12 adds predicate filtering, allowing you to create more complex query predicates on top of the initial secondary index based filter (also see the Java client's PredExp class).
If you don't want to use secondary indexes (there are several valid reasons), you can create your own lookup using external records. In a set called country-school you can have a record for each country (keys such as 'india', 'luxembourg') with the value being a list containing the IDs of the schools in that country. You can get the list with a single get (or a batch-get if it's several records, such as india-1, india-2, ... , india-9999), then use the results to compose a batch-get operation for the schools. Batch reads return results in the ordered you asked so you can get a large batch, check whether the last element is null, and if not get another batch.
('ns1', 'country-school', 'us-california') => [ 1, 2, 3, 5, 8, 11, .. ]
Similarly, you can create permutations such as country-state-city, (example, us-california-oakland) with smaller lists. This costs some extra space, but gives you faster (key-value based) retrieval without spending memory on secondary indexes.
('ns1', 'country-school', 'us-california-oakland') => [ 1, 5, 42, .. ]

Related

Did anyone write custom Affinity function?

I want all nodes in a cluster to have equal number data load. With
default Affinity function it is not happening.
As of now, we have 3 nodes. We use group ID as affinity key, and we have 3
group IDs (1, 2 and 3). And we limit cache partitions to group IDs. Overall
nodes=group IDs=cache partitions. So that each node have equal number of
partitions.
Will it be okay to write custom Affinity function? And
what will we lose doing so? Did anyone write custom Affinity function?
The affinity function doesn't guarantee an even distribution across all nodes. It's statistical... and three values isn't really enough to make sure the data is "fairly" distributed.
So, yes, writing a new affinity function would work. The downsides being you need to make it fast (it's called a lot) and you'd be hard-coding it to your current node topology. What happens when you choose to add a new node? What happens when a node fails? Also, you'd be potentially putting all your data into three partitions which make it harder to scale out (one of the main advantages of Ignite's architecture).
As an alternative, I'd look at your data model. Splitting your data into three chunks is too coarse for things to work automatically.

Which approach is better when using Redis?

I'm facing following problem:
I wan't to keep track of tasks given to users and I want to store this state in Redis.
I can do:
1) create list called "dispatched_tasks" holding many objects (username, task)
2) create many (potentialy thousands) lists called dispatched_tasks:username holding usually few objects (task)
Which approach is better? If I only thought of my comfort, I would choose the second one, as from time to time I will have to search for particular user tasks, and this second approach gives this for free.
But how about Redis? Which approach will be more performant?
Thanks for any help.
Redis supports different kinds of data structures as shown here. There are different approaches you can take:
Scenario 1:
Using a list data type, your list will contain all the task/user combination for your problem. However, accessing and deleting a task runs in O(n) time complexity (it has to traverse the list to get to the element). This can have an impact in performance if your user has a lot of tasks.
Using sets:
Similar to lists, but you can add/delete/check for existence in O(1) and sets elements are unique. So if you add another username/task that already exists, it won't add it.
Scenario 2:
The data types do not change. The only difference is that there will be a lot more keys in redis, which in can increase the memory footprint.
From the FAQ:
What is the maximum number of keys a single Redis instance can hold? and what the max number of elements in a Hash, List, Set, Sorted
Set?
Redis can handle up to 232 keys, and was tested in practice to handle
at least 250 million keys per instance.
Every hash, list, set, and sorted set, can hold 232 elements.
In other words your limit is likely the available memory in your
system.
What's the Redis memory footprint?
To give you a few examples (all obtained using 64-bit instances):
An empty instance uses ~ 3MB of memory. 1 Million small Keys ->
String Value pairs use ~ 85MB of memory. 1 Million Keys -> Hash
value, representing an object with 5 fields, use ~ 160 MB of
memory. To test your use case is trivial using the
redis-benchmark utility to generate random data sets and check with
the INFO memory command the space used.

Correct modeling in Redis for writing single entity but querying multiple

I'm trying to convert data which is on a Sql DB to Redis. In order to gain much higher throughput because it's a very high throughput. I'm aware of the downsides of persistence, storage costs etc...
So, I have a table called "Users" with few columns. Let's assume: ID, Name, Phone, Gender
Around 90% of the requests are Writes. to update a single row.
Around 10% of the requests are Reads. to get 20 rows in each request.
I'm trying to get my head around the right modeling of this in order to get the max out of it.
If there were only updates - I would use Hashes.
But because of the 10% of Reads I'm afraid it won't be efficient.
Any suggestions?
Actually, the real question is whether you need to support partial updates.
Supposing partial update is not required, you can store your record in a blob associated to a key (i.e. string datatype). All write operations can be done in one roundtrip, since the record is always written at once. Several read operations can be done in one rountrip as well using the MGET command.
Now, supposing partial update is required, you can store your record in a dictionary associated to a key (i.e. hash datatype). All write operations can be done in one roundtrip (even if they are partial). Several read operations can also be done in one roundtrip provided HGETALL commands are pipelined.
Pipelining several HGETALL commands is a bit more CPU consuming than using MGET, but not that much. In term of latency, it should not be significantly different, except if you execute hundreds of thousands of them per second on the Redis instance.

Best way to model millions of exist checks in Aerospike?

Having grown out of Redis for some data structures I'm looking to other solutions with good disk/SSD performance. I recently discovered Aerospike which seems to excel in an SSD environment.
One of the most memory hungry structures are about 100.000 Redis sets, which can each contain up to 10.000 strings. Each string is between 10 and 30 characters.
These sets are mostly used for exists / uniqueness checks.
What would be the best way to model these? I generally see 2 options:
* model a redis set as an Aerospike lset
* model each value in a set separately.
Besides this choice, the 100.000 Redis sets are used as a partitioning on the keys. For reasons of locality it would probably make sense to have a similar sort of partitioning/namespacing in Aerospike. However, I'm pretty sure the notion of 'namespacing' in Aerospike isn't used for this sort of key partitioning. What would be a correct way (if any) to do this in Aerospike, or is that not needed?
Aerospike does its own partitioning for load balancing and high availability needs. Namespace is synonymous to Database in traditional sense and NOT to Partition of data. Data in a Namespace is partitioned and stored in cluster. You as a user need not worry about placement of the data.
I would map a Redis set to Aerospike "lset" (one to one). Aerospike should takes care of data locality for the data in a given "lset".
Yes, you should not be worrying about the locality of the data as Aerospike does auto-sharding. This ensures equal balancing of data distribution and read/write load across all nodes of the cluster.
Putting in lset has its advantages. It gives functionality similar to redis where you do not need to write your own functionality. But at the same time it has its disadvantes too. So, you should choose based on your requirements. All the operations on a single set will be serialized. So, if you are expecting the read/wirte to the set to be parallelised, lset may not be the right fit for you. Also, the exists check in lset will actually read the full record and return true false. Aerospike has an exists api for normal keys, which will return true/false based on the in-memory index which is way faster.
For this usecase, you may not be able to segregate them into the 'sets' of aerospike. You need 100,000 sets. But as of now, Aerospike only supports 1024 sets.
Let me add a third option to your list. You can model the key itself to create virtual sets for you as below:
if you actual key is key1 and you want it to go to set1, you can set your mashed keys as set1_key1.
when you want to search for existence of key7 in set5, search for existence of set5_key7
If you go with this model, you are exploiting Aerospike's data-distribution, and load balancing to its best. The exists check will be the fastest as there will be no I/O.

Efficient Hashmap Use

What is the more efficient approach for using hashmaps?
A) Use multiple smaller hashmaps, or
B) store all objects in one giant hashmap?
(Assume that the hashing algorithm for the keys is fairly efficient, resulting in few collisions)
CLARIFICATION: Option B implies segregation by primary key -- i.e. no additional lookup is necessary to determine which actual hashmap to use. (For example, if the lookup keys are alphanumeric, Hashmap 1 stores the A's, Hashmap 2 stores B's, and so on.)
Definitely B. The advantage of hash tables is that the average number of comparisons per lookup is independent of the size.
If you split your map into N smaller hashmaps, you will have to search half of them on average for each lookup. If the smaller hashmaps have the same load factor that the larger map would have had, you will increase the total number of comparisons by a factor of approximately N/2.
And if the smaller hashmaps have a smaller load factor, you are wasting memory.
All that is assuming you distribute the keys randomly between the smaller hashmaps. If you distribute them according to some function of the key (e.g. a string prefix) then what you have created is a trie, which is efficient for some applications (e.g. auto-complete in web forms.)
Are these maps used in logically distinct places? For instance, I wouldn't have one map containing users, cached query results, loggers etc, just because you happen to know the keys won't clash. However, I equally wouldn't split up a single map into multiple maps.
Keep one hashmap for each logical mapping from key to value.
In addition #Jon's answer, there can be practical reasons why you want to maintain separate hash tables.
If you have separate tables for different mappings you can 'clear' each of the mappings independently; e.g. by calling 'clear' or getting rid of the reference to the corresponding table.
If the separate tables hold mappings to cached entries, you can use different strategies to 'age' the respective entries.
If the application is multi-threaded, using separate tables may reduce lock contention, and may (for some processor architectures) increase processor memory cache hit ratios.