How to really deal with indexing in Redis and correctly implement indexes - indexing

I am moving some "live" data structures from MySQL to REDIS. Using StackExchange C# Redis Client, I'm writing (due to some very project-specific restrictions) my own microORM code to store and retrieve object class entities from a Redis Database.
I am pushing c# object as hash keys in Redis.
My general question is about indexing on fields other than the "primary key".
Ok, I've read all the theory of sets and sorted sets, and how to add and remove members from sets, and so on.
I've added some code to correctly create set keys which contain entities hash keys, so that I can lookup those objects by simple indexes or sorted indexes.
However I cannot find or figure out a good strategy for solving the following problems:
1. Index maintenance on expiration
I'd like to add expiration to some object (hash) keys, so that old entities get purged automatically by Redis. However I cannot find a reilable way to update/purge relevant indexes besides running periodically a background task that scans index set keys for expired members and removes them (notification is not good for me)
2. Index updating when some object fields change
In some cases I need to update only a small fraction of hash key values, not the whole entity. If the fields being updated are part of one or more index set keys, I cannot figure out the best way to properly update the set keys.
For example, let's say I need to store a "Session" entity whose primary key is its ID (simple numerical integer), and I need to add an index on the "Node" string field (Node being the reference to the server currently serving the session):
class Session {
[RedisKey]
public int ID { get; set; }
public string RemoteIP { get; set; }
[RedisSimpleIndex]
public string Node { get; set; }
}
RedisKey and RedisSimpleIndex are attributes I use to extract via reflection which fields are used as primary key and which are used for indexing.
Let's suppose I have an instance of Session like this:
{ ID = 2, RemoteIP = "1.2.3.4", Node = "Server10" }
My routines are creating the following keys in Redis:
Hash key: "obj:Session:2"
Hash values: "ID" = "1", "RemoteIP" = "1.2.3.4", "Node" = "Server10"
Set key "idx:Session:Node:Server10"
Set members: "obj:Session:2"
which is fine for looking up all sessions on Server10.
However, if the very same session needs to be moved to a different server (e.g. Server8)and I want to update only the Node field in the Hash set, how can I update indexes too?
The only way I found so far is to SCAN all index keys with pattern idx:Session:Node:* and remove from them any member obj:Session:2, then create/update the index key for the new node (idx:Session:Node:Server8).
Moreover the SCAN command is not available in IDatabase or ITransaction interfaces, and in a HA Clustered environment things get worse since I need to determine which Redis server is holding relevant keys to make this procedure work.
Is there a better way to build/represent simple indexes in Redis? Is my approach wrong?

I'd like to add expiration to some object (hash) keys, so that old entities get purged automatically by Redis. However I cannot find a reilable way to update/purge relevant indexes besides running periodically a background task that scans index set keys for expired members and removes them (notification is not good for me)
You cannot expire individual KV pairs within a hash. This is was discussed in #167. There don't appear plans to change this.
I think, you should be able to use keyspace notifications to subscribe to expire events. You would have to have some worker that subscribes for them and updates all relevant indices accordingly. However, you might get some inconsistent data. For example, your worker might crash and leave the stale indices behind. Also the indices wouldn't be updated instantaneously, so you'd end up with a bit of stale data regardless.
Probably not the best idea, but you could also hack in some custom indexing logic into expire.c. The code seems fairly straightforward. The C module API by contrast doesn't appear to provide any way to hook into the eviction logic.
Another option is to not rely on Redis when it comes to handling expiration logic. So... you would still have a background job, but it would actually issue corresponding DEL commands for expired KV-pairs. This would also allow you to keep the index 100% up to date via transactions.
In some cases I need to update only a small fraction of hash key values, not the whole entity. If the fields being updated are part of one or more index set keys, I cannot figure out the best way to properly update the set keys.
I'm not sure which Redis client you're using, but I found the following pattern to be quite useful in the past:
You have some form of "Updater" class for each hash. It has setters for all relevant fields that could be updated (setFirstName, setLastName etc.).
When you set a field, you mark that particular field as "dirty" (e.g. via a separate boolean).
When you call "save", you update indices for fields that were marked as dirty.
The only way I found so far is to SCAN all index keys with pattern idx:Session:Node:* and remove from them any member obj:Session:2, then create/update the index key for the new node (idx:Session:Node:Server8).
This is cumbersome, but seems like the way to go. Sadly I don't think there is a better solution for this. You might want to consider maintaining a separate set with keys of index KV-pairs that would have to be updated though, as that way you'd avoid going over a bunch of keys that aren't relevant.
You might also want to check out an article about how to maintain those indices. As you already alluded to, there are basically two options: real-time using MULTI transactions or using batch jobs. Once you get into the territory of using key expiration, you are more or less forced to use the batch approach.

Related

Redis + .NET 6 - Best data type for querying all entries and updating individual entries

I recently got to know Redis, integrated it into my project and now I am facing the following use case.
My question in short:
Which data type can I use to get all entries sorted AND to be able to overwrite single entries?
My question in long:
I have a huge amount of point cloud models that I want to store and work with via Redis.
My point cloud model consists of three things:
Unique id (stays the same)
Point Cloud as a string (changes over time)
Priority as an integer (changes over time)
Basically I would like to be able to do only two things with Redis. However, if I understand the documentation correctly, these are seen as benefits of two different data types, so I can't find a data type that exactly fits my use case. I hope, however, that I am wrong about this and that someone here can help me.
Use case:
Get quick all models, all already sorted
Overwrite/update a specific model
Sorted Sets
Advantage
Get all entries in sorted order
my model property Priority can be used here as a score, which determines the order.
Disadvantage
No possibility to access a special value via a key and overwrite it.
Hashes:
Advantage
Overwrite specific entry via Key > Field
Get all entries via Key
Disadvantage
No sorting
I would suggest to just use two distinct data types:
a hash with all the properties of your model, with the exception of the priority;
a sorted set which allows to easily sort your collection and deal with the scores / priorities.
You could then link the two by storing each hash key (or a distinctive value which allows to reconstruct the final hash key) as the related sorted set member.
For example:
> HSET point-cloud:123 foo bar baz suppiej
> ZADD point-clouds-by-priority 42 point-cloud:123
You will keep all the advantages you mentioned, with no disadvantages at all.

Composite Primary Key equivalent in Redis

I'm new to nosql databases so forgive my sql mentality but I'm looking to store data that can be 'queried' by one of 2 keys. Here's the structure:
{user_id, business_id, last_seen_ts, first_seen_ts}
where if this were a sql DB I'd use the user_id and business_id as a primary composite key. The sort of querying I'm looking for is a
1.'get all where business_id = x'
2.'get all where user_id = x'
Any tips? I don't think I can make a simple secondary index based on the 2 retrieval types above. I looked into commands like 'zadd' and 'zrange' but there isn't really any sorting involved here.
The use case for Redis for me is to alleviate writes and reads on my SQL database while this program computes (doing its storage in redis) what eventually will be written to the SQL DB.
Note: given the OP's self-proclaimed experience, this answer is intentionally simplified for educational purposes.
(one of) The first thing(s) you need to understand about Redis is that you design the data so every query will be what you're used to think about as access by primary key. It is convenient, in that sense, to imagine Redis' keyspace (the global dictionary) as something like this relational table:
CREATE TABLE redis (
key VARCHAR(512MB) NOT NULL,
value VARCHAR(512MB),
PRIMARY KEY (key)
);
Note: in Redis, value can be more than just a String of course.
Keeping that in mind, and unlike other database models where normalizing data is the practice, you want to have your Redis ready to handle both of your queries efficiently. That means you'll be saving the data twice: once under a primary key that allows searching for businesses by id, and another time that allows querying by user id.
To answer the first query ("'get all where business_id = x'"), you want to have a key for each x that hold the relevant data (in Redis we use the colon, ':', as separator as a matter of convention) - so for x=1 you'd probably call your key business:1, for x=a1b2c3 business:a1b2c3 and so forth.
Each such business:x key could be a Redis Set, where each member represents the rest of the tuple. So, if the data is something like:
{user_id: foo, business_id: bar, last_seen_ts: 987, first_seen_ts: 123}
You'd be storing it with Redis with something like:
SADD business:bar foo
Note: you can use any serialization you want, Set members are just Strings.
With this in place, answering the first query is just a matter of SMEMBERS business:bar (or SSCANing it for larger Sets).
If you've followed through, you already know how to serve the second query. First, use a Set for each user (e.g. user:foo) to which you SADD user:foo bar. Then SMEMBERS/SSCAN and you're almost home.
The last thing you'll need is another set of keys, but this time you can use Hashes. Each such Hash will store the additional information of the tuple, namely the timestamps. We can use a "Primary Key" made up of the bussiness and the user ids (or vice versa) like so:
HMSET foo:bar first 123 last 987
After you've gotten the results from the 1st or 2nd query, you can fetch the contents of the relevant Hashes to complete the query (assuming that the queries return the timestamps as well).
The idiomatic way of doing this in Redis is to use a SET for each type of query you want to do.
In your case you would create:
a hash for each tuple (user_id, business_id, last_seen_ts, first_seen_ts)
a set with a name like user:<user_id>:business:<business_id>, to store the keys of the hashes for this user and this business (you have to add the ID of the hashes with SADD)
Then to get all data for a given user and business, you have to get the SET content with SMEMBERS first, and then to GET every HASH whose ID is in the SET.

Redis: how to use it similar to multi-tables

It seems that Redis has no any entity corresponding to "table" in relational database.
For instance, I have to store:
(token, user_id)
(cart_id, token, [{product_id, count}])
If it doesn't separate store those two, the get method would search from both, which would cause chaos.
By the way, (cart_id, token, [{product_id, count}]) is a shopping cart, how to design such data structure in redis?
It seems that Redis has no any entity corresponding to "table" in relational database.
Right, because it is not a relational database. It is a data structure server which is very different and requires a different approach to be used well.
Ultimately to use Redis in the way it is intended you need to not think in relational terms, but think of the data structures you use in the code. More specifically, how do you need the data when you want to consume it? That will be the most likely way to store it in Redis.
In this case there are a few options, but the hash method works incredibly well for this one so I'll detail it here.
First, create a hash, call it users:to:tokens. Store as the key in the hash the user id, and the value the token. Next create the inverse, a hash called 'tokens:to:users'. You will probably be wanting both of these - the ability to look one up from the other - and this foundation will provide that.
Next, for your carts. This, too, will be a hash: carts:cart_id. In this hash you have the product_id and the count.
Finally up is your third hash token:to:cart which builds an index of tokens to cart id. I'd go a step further and do user:to:cart to be able to pull carts by user as well.
Now as to whether to store the keynote in the map or not, I tend to go with "no". By just storing the ID you can easily build the Redis cart key and not store the key's full path in the data store as well the saving memory usage.
Indeed, if you can do so use integers for all of your IDs. By using integers you can take advantage of Redis' integer storage optimizations to keep memory usage down. Hashes storing integers are quite efficient and very fast.
If needed you can use Redis to build your IDs. You can use the INCR command to build a counter for each data type such as userid:counter, cartid:counter, and tokenid:counter. As INCR returns the new value you make a single call to increment and get the new ID and get cartid:counter will always give you the largest ID if you wanted to quickly see how many carts have been created. Kinda neat , IMO.
Now, where it gets tricky is if you want to use expiration to automatically expire carts as opposed to leaving them to "lie around" until you want to clean things up. By setting an expiration on the cart hash (which has the product,count mapping) your carts will automatically expire. However, their references will still be hanging out in the token:to:cart hash. Removing that is a simple periodic task which treats over the members of token:to:cart and does an exists check on the cart's key. If it doesn't exist delete it from the hash.
Redis is a key-value storage. From redis.io:
Redis is an open source (BSD licensed), in-memory data structure
store, used as database, cache and message broker. It supports data
structures such as strings, hashes, lists, sets, sorted sets with
range queries, bitmaps, hyperloglogs and geospatial indexes with
radius queries.
So if you want to store two diffetent types (tokens and carts) you will need to store two keys for different datatypes. For example:
127.0.0.1:6379> hset tokens.token_id#123 user user123
(integer) 1
127.0.0.1:6379> hget tokens.token_id#123 user
"user123"
Where tokens is a namespace for tokens only. It is stored as Redis-Hash:
Redis Hashes are maps between string fields and string values, so they
are the perfect data type to represent objects
To store lists I would do the following:
127.0.0.1:6379> hmset carts.cart_1 token token_id#123 cart_contents cart_contents_key1
OK
127.0.0.1:6379> hmget carts.cart_1 token cart_contents
1) "token_id#123"
2) "cart_contents_key1" # cart_contents is a list of receipts.
cart_contents are represented as a Redis-List:
127.0.0.1:6379> rpush cart_contents.cart_contents_key1 receipt_key1
(integer) 1
127.0.0.1:6379> lrange cart_contents.cart_contents_key1 0 -1
1) "receipt_key1"
Receipt is Redis-Hash for a tuple (product_id, count):
127.0.0.1:6379> hmset receipts.receipt_key1 product_id 43 count 2
OK
127.0.0.1:6379> hmget receipts.receipt_key1 product_id count
1) "43" # Your final product id.
2) "2"
But do you really need Redis in this case?

DynamoDB: When to use what PK type?

I am trying to read up on best practices on DynamoDB. I saw that DynamoDB has two PK types:
Hash Key
Hash and Range Key
From what I read, it appears the latter is like the former but supports sorting and indexing of a finite set of columns.
So my question is why ever use only a hash key without a range key? Is it a viable choice only when the table is not searched?
It'd also be great to have some general guidelines on when to use what key type. I've read several guides (including Amazon's own documentation on DynamoDB) but none of them appear to directly address this question.
Thanks
The choice of which key to use comes down to your Use Cases and Data Requirements for a particular scenario. For example, if you are storing User Session Data it might not make much sense using the Range Key since each record could be referenced by a GUID and accessed directly with no grouping requirements. In general terms once you know the Session Id you just get the specific item querying by the key. Another example could be storing User Account or Profile data, each user has his own and you most likely will access it directly (by User Id or something else).
However, if you are storing Order Items then the Range Key makes much more sense since you probably want to retrieve the items grouped by their Order.
In terms of the Data Model, the Hash Key allows you to uniquely identify a record from your table, and the Range Key can be optionally used to group and sort several records that are usually retrieved together. Example: If you are defining an Aggregate to store Order Items, the Order Id could be your Hash Key, and the OrderItemId the Range Key. Whenever you would like to search the Order Items from a particular Order, you just query by the Hash Key (Order Id), and you will get all your order items.
You can find below a formal definition for the use of these two keys:
"Composite Hash Key with Range Key allows the developer to create a
primary key that is the composite of two attributes, a 'hash
attribute' and a 'range attribute.' When querying against a composite
key, the hash attribute needs to be uniquely matched but a range
operation can be specified for the range attribute: e.g. all orders
from Werner in the past 24 hours, or all games played by an individual
player in the past 24 hours." [VOGELS]
So the Range Key adds a grouping capability to the Data Model, however, the use of these two keys also have an implication on the Storage Model:
"Dynamo uses consistent hashing to partition its key space across its
replicas and to ensure uniform load distribution. A uniform key
distribution can help us achieve uniform load distribution assuming
the access distribution of keys is not highly skewed."
[DDB-SOSP2007]
Not only the Hash Key allows to uniquely identify the record, but also is the mechanism to ensure load distribution. The Range Key (when used) helps to indicate the records that will be mostly retrieved together, therefore, the storage can also be optimized for such need.
Choosing the correct keys to represent your data is one of the most critical aspects during your design process, and it directly impacts how much your application will perform, scale and cost.
Footnotes:
The Data Model is the model through which we perceive and manipulate our data. It describes how we interact with the data in the database [FOWLER]. In other words, it is how you abstract your data model, the way you group your entities, the attributes that you choose as primary keys, etc
The Storage Model describes how the database stores and manipulates the data internally [FOWLER]. Although you cannot control this directly, you can certainly optimize how the data is retrieved or written by knowing how the database works internally.

Read-only keys in RavenDb or... interopability between RavenDb and Azure Table Storage

I'm looking to host a number of configuration parameters for customers in RavenDb database, while numereous data points that are generated on a minute-by-minute basis for these parameters in Azure Table storage. I need a basic way to connect between RavenDb and ATS. Obviously, this connection is to be done via keys. My issue is that RavenDb uses forward slashes in all of its Id fields, while ATS pukes when a forward slash is used in either PartitionKey or RowKey.
My question is as follows: Is it possible to have a read-only Id key in my RavenDb entities (no setters). Such key approach would return the value of a Guid-based key pre-pended with "entity/" prefix. This way, I can store the Guid-based ID key in raven entities as well and be able to compare ravenEntity.RootId (guid) to storageEntity.PartitionKey (string based on guid). I'm worried that even if my entities seem to persist to Raven and load back OK.. I may have an issue with some more obscure functionality?
Are there other suggested or perhaps worked out approaches to handle such relationship?
So you just need something in your RavenDB model, that does not change once created, that you can use to relate to associated data you store in Azure Table Storage?
Well, assuming your RavenDB's document id will not change (and it can't, becuase then it would be a different document) you could use a deterministic guid using the document id.
public class MyModel
{
public string Id { get; set; }
// Other stuff
[JsonIgnore] // <--- This really doesn't need to be persisted to RavenDB
public string AzureLookupKey
{
get { return "entity/" + Utils.GetDeterministicGuid(this.Id).ToString("n"); }
}
}
The GetDeterministicGuid(inputString) method could be implemented however you want, although the SO question How to Create Deterministic Guids has a good example right in the question, with other possibilities in the answer.
Instead of using [JsonIgnore] to prevent RavenDB from serializing the data to the database, you could also make it a method.