Redis Key Structure - redis

When creating a key in Redis, I get using the ":" format and treating it similar to a URL structure.
But what if that structure itself contains key-value type combinations? Does one put the key in the structure?
Made-up Example:
Option A) "country:usa:manufacturer:ford:vehicle:f150:color" = black
or
Option B) "usa:ford:f150:color" = black
In some ways, I think that there is strength in the structure of Option A, but it also adds a lot of complexity to the key.
Thoughts?

While keeping in mind your made-up example (do try to use an actual example, you'll get better answers) I would have to say neither.
I would go with an ID for the key, likely an int. then I'd put each key/value pair in your option A as a hash member and value.
For example:
HSET 1 country USA
HSET 1 manufacturer ford
And so on. Or you could use an hmset operation to set them all at once.
Why? You get the benefit of keeping the fields as describing the data (which you lose in your option b), the memory advantages of hashes over strings, and reduced complexity on key structure, not to mention the memory benefits of a short integer as keyname versus a long string.
Further, you have a memory cheap way to create indexes as integer sets. for example a key called "country:1" could be a set of entry IDs which then give you a way to "pull all entries for country ID 1" - USA in the example. By using integers you get the benefit of being able to store these all in a very memory efficient way, at the minor cost of a lookup table. This could even be done in lua to avoid a network hop.
The greater the range of possible combinations and entries, the more valuable the memory savings are. If you've got millions or billions of them, you'll want to follow the integer-ID & lookup route. This would also set you up nicely if you ever need to shard data - either server side or client side.

Related

String vs Hash for string type? Hash will have only one key instead of many

For example, I see many people are doing something like the following:
> set data:1000 "some string 1"
> set data:1001 "some string 2"
But what about using a hash to minimize the number of keys?
> hset data 1000 "some string 1"
> hset data 1001 "some string 2"
In the second way, it will only create one data key instead of creating many keys in the first way.
Which way is recommended?
I just see some people and tutorial are doing hset data:10 01 xxx. This is actually not related to my question. My question is simply asking what it's recommended between set data:1001 xxx and hset data 1001 xxx.
And I don't plan to modify hash-max-zipmap-entries and hash-max-zipmap-value. That means the hash will exceed the two values eventually. In such a config, are the two ways the same? or Which way is recommended?
Reasons to use strings:
you need per value timeouts
the values are semantically isolated
you're on cluster and want the values to be sharded over different nodes to spread load (sharding is based on the key)
Reasons to use hashes:
you want to be able to purge all of them at once (del/unlink), or have a timeout that impacts all of those values at once
you want to be able to enumerate them (prefer hscan/hgetall over scan/keys)
slightly better memory usage on the keys themselves
the values are semantically related
it is OK for all the values to be on the same node (whether single-server or cluster)
This all depends on the tradeoffs you want to support. In general, using hashes will have a smaller memory footprint than using simple keys. In fact, it is about an order of magnitude less memory. And access to hash values is constant time. So, if you are using redis simply as a key-value store, then hashes are way more efficient than simple keys.
However, you will want to use simple keys if you need to support expiration, keyspace notifications, etc, then you will need to use simple keys.
Just be careful to tweak the values for hash-max-zipmap-entries and hash-max-zipmap-value in your redis.conf in order to ensure that hashes are treated correctly for your environment.
You can read all about the details in the memory optimization section of the documentation.

Out of Process in memory database table that supports queries for high speed caching

I have a SQL table that is accessed continually but changes very rarely.
The Table is partitioned by UserID and each user has many records in the table.
I want to save database resources and move this table closer to the application in some kind of memory cache.
In process caching is too memory intensive so it needs to be external to the application.
Key Value stores like Redis are proving inefficient due to the overhead of serializing and deserializing the table to and from Redis.
I am looking for something that can store this table (or partitions of data) in memory, but let me query only the information I need without serializing and deserializing large blocks of data for each read.
Is there anything that would provide Out of Process in memory database table that supports queries for high speed caching?
Searching has shown that Apache Ignite might be a possible option, but I am looking for more informed suggestions.
Since it's out-of-process, it has to do serialization and deserialization. The problem you concern is how to reduce the serialization/deserizliation work. If you use Redis' STRING type, you CANNOT reduce these work.
However, You can use HASH to solve the problem: mapping your SQL table to a HASH.
Suppose you have the following table: person: id(varchar), name(varchar), age(int), you can take person id as key, and take name and age as fields. When you want to search someone's name, you only need to get the name field (HGET person-id name), other fields won't be deserialzed.
Ignite is indeed a possible solution for you since you may optimize serialization/deserialization overhead by using internal binary representation for accessing objects' fields. You may refer to this documentation page for more information: https://apacheignite.readme.io/docs/binary-marshaller
Also access overhead may be optimized by disabling copy-on-read option https://apacheignite.readme.io/docs/performance-tips#section-do-not-copy-value-on-read
Data collocation by user id is also possible with Ignite: https://apacheignite.readme.io/docs/affinity-collocation
As the #for_stack said, Hash will be very suitable for your case.
you said that Each user has many rows in db indexed by the user_id and tag_id . So It is that (user_id, tag_id) uniquely specify one row. Every row is functional depends on this tuple, you could use the tuple as the HASH KEY.
For example, if you want save the row (user_id, tag_id, username, age) which values are ("123456", "FDSA", "gsz", 20) into redis, You could do this:
HMSET 123456:FDSA username "gsz" age 30
When you want to query the username with the user_id and tag_id, you could do like this:
HGET 123456:FDSA username
So Every Hash Key will be a combination of user_id and tag_id, if you want the key to be more human readable, you could add a prefix string such as "USERINFO". e.g. : USERINFO:123456:FDSA .
BUT If you want to query with only a user_id and get all rows with this user_id, this method above will be not enough.
And you could build the secondary indexes in redis for you HASH.
as the above said, we use the user_id:tag_id as the HASH key. Because it can unique points to one row. If we want to query all the rows about one user_id.
We could use sorted set to build a secondary indexing to index which Hashes store the info about this user_id.
We could add this in SortedSet:
ZADD user_index 0 123456:FDSA
As above, we set the member to the string of HASH key, and set the score to 0. And the rule is that we should set all score in this zset to 0 and then we could use the lexicographical order to do range query. refer zrangebylex.
E.g. We want to get the all rows about user_id 123456,
ZRANGEBYLEX user_index [123456 (123457
It will return all the HASH key whose prefix are 123456, and then we use this string as HASH key and hget or hmget to retrieve infomation what we want.
[ means inclusive, and ( means exclusive. and why we use 123457? it is obvious. So when we want to get all rows with a user_id, we shoud specify the upper bound to make the user_id string's leftmost char's ascii value plus 1.
More about lex index you could refer the article I mentioned above.
You can try apache mnemonic started by intel. Link -http://incubator.apache.org/projects/mnemonic.html. It supports serdeless features
For a read-dominant workload MySQL MEMORY engine should work fine (writing DMLs lock whole table). This way you don't need to change you data retrieval logic.
Alternatively, if you're okay with changing data retrieval logic, then Redis is also an option. To add to what #GuangshengZuo has described, there's ReJSON Redis dynamically loadable module (for Redis 4+) which implements document-store on top of Redis. It can further relax requirements for marshalling big structures back and forth over the network.
With just 6 principles (which I collected here), it is very easy for a SQL minded person to adapt herself to Redis approach. Briefly they are:
The most important thing is that, don't be afraid to generate lots of key-value pairs. So feel free to store each row of the table in a different key.
Use Redis' hash map data type
Form key name from primary key values of the table by a separator (such as ":")
Store the remaining fields as a hash
When you want to query a single row, directly form the key and retrieve its results
When you want to query a range, use wild char "*" towards your key. But please be aware, scanning keys interrupt other Redis processes. So use this method if you really have to.
The link just gives a simple table example and how to model it in Redis. Following those 6 principles you can continue to think like you do for normal tables. (Of course without some not-so-relevant concepts as CRUD, constraints, relations, etc.)
using Memcache and REDIS combination on top of MYSQL comes to Mind.

How to implement a scalable, unordered collection in DynamoDB?

I am looking into implementing a scalable unordered collection of objects on top of Amazon DynamoDB. So far the following options have been considered:
Use DynamoDB document data types (map, list) and use document path to access stand-alone items. This has one obvious drawback for collection being limited to 400KB of data, meaning perhaps 1..10K objects depending on their size. Less obvious drawback is that cost of insertion of a new object into such collection is going to be huge: Amazon specifies that the write capacity will be deducted based on the total item size, not just newly added object -- therefore ~400 capacity units for inserting 1KB object when approaching the size limit. So considering this ruled out?
Using composite primary hash + range key, where primary hash remains the same for all objects in the collection, and range key is just something random or an atomic counter. Obvious drawback is that having identical hash key results in bad key distribution -- cardinality is low when there are collections with large number of objects. This means bad partitioning, and having a scale issue with all reads/writes on the same collection being stuck to one shard, becoming subject to 3000 reads / 1000 writes per second limitation of DynamoDB partition.
Using global secondary index with secondary hash + range key, where hash key remains the same for all objects belonging to the same collection, and range key is just something random or an atomic counter. Similar to above, partitioning becomes poor for the GSI, and it will become a bottleneck with too many identical hashes draining all the provisioned capacity to the index rapidly. I didn't find how the GSI is implemented exactly, thus not sure how badly it suffers from low cardinality.
Question is, whether I could live with (2) or (3) and suffer from non-ideal key distribution, or is there another way of implementing collection that was overlooked, or perhaps I should at all consider looking into another nosql database engine.
This is a "shooting from the hip" answer, what you end up doing may depend on how much and what type of reading and writing you do.
Two things the dynamo docs encourage you to avoid are hot keys and, in general, scans. You noted that in cases (2) and (3), you end up with a hot key. If you expect this to scale (large collections), the hot key will probably hurt more and more, especially if this is a write-intensive application.
The docs on Query and Scan operations (http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/QueryAndScan.html) say that, for a query, "you must specify the hash key attribute name and value as an equality condition." So if you want to avoid scans, this might still force your hand and put you back into that hot key situation.
Maybe one route would be to embrace doing a scan operation, but just have one table devoted to your collection. Then you could just have a fully random (well distributed) hash key and do a scan every time. This assumes you always want everything from the collection (you didn't say). This will still hurt if you scale up to a large collection, but if you always want the full set back, you'll have to deal with that pain regardless. If you just want a subset, you can add a limit parameter. This would help performance, but you will always get back the same subset (or you can use the last evaluated key and keep going). The docs also mention parallel scans.
If you are using AWS, elasticache/redis might be another route to try? The first pass might code up a lot faster/cleaner than situation (1) that you mentioned.

What is the conventional way to store objects in a sorted set in redis?

What is the most convenient/fast way to implement a sorted set in redis where the values are objects, not just strings.
Should I just store object id's in the sorted set and then query every one of them individually by its key or is there a way that I can store them directly in the sorted set, i.e. must the value be a string?
It depends on your needs, if you need to share this data with other zsets/structures and want to write the value only once for every change, you can put an id as the zset value and add a hash to store the object. However, it implies making additionnal queries when you read data from the zset (one zrange + n hgetall for n values in the zset), but writing and synchronising the value between many structures is cheap (only updating the hash corresponding to the value).
But if it is "self-contained", with no or few accesses outside the zset, you can serialize to a chosen format (JSON, MESSAGEPACK, KRYO...) your object and then store it as the value of your zset entry. This way, you will have better performance when you read from the zset (only 1 query with O(log(N)+M), it is actually pretty good, probably the best you can get), but maybe you will have to duplicate the value in other zsets / structures if you need to read / write this value outside, which also implies maintaining synchronisation by hand on the value.
Redis has good documentation on performance of each command, so check what queries you would write and calculate the total cost, so that you can make a good comparison of these two options.
Also, don't forget that redis comes with optimistic locking, so if you need pessimistic (because of contention for instance) you will have to do it by hand and/or using lua scripts. If you need a lot of sync, the first option seems better (less performance on read, but still good, less queries and complexity on writes), but if you have values that don't change a lot and memory space is not a problem, the second option will provide better performance on reads (you can duplicate the value in redis, synchronize the values periodically for instance).
Short answer: Yes, everything must be stored as a string
Longer answer: you can serialize your object into any text-based format of your choosing. Most people choose MsgPack or JSON because it is very compact and serializers are available in just about any language.

Using key-value databases as a set with persistent indices

Since the below got a bit long: Here's the tl;dr; version: Is there an existing key/value best-practice for fast key and value lookup, something like a hash-based set with persistent indices?
I'm interested in the world of key-value databases and have so far failed to figure out how one would efficiently implement the following use-case:
Assume we want to serialize some data and reference them somewhere else by a persistent, unique integer index. Thus e.g.: Key = unsigned int, Value = MyData.
The database should have fast key lookup and ensure that MyData is unique.
Now, when I insert a new value into my the database, I could assign it a new index key, e.g. the current size of the database or to prevent clashes after removing items, I could keep some counter externally.
But how would I ensure that I do not insert the same MyData value into my database? So far, it looks to me as if this is not efficiently possible with key-value databases - is this correct? I.e. I do not want to iterate over the whole database just to ensure MyData value is not in there already...
What is the best pratice to implement this, then?
For background: I work on KDevelop where we use the above for our code analysis cache. We actually have a custom implementation of the above use-case 1. Search for Bucket and ItemRepository if you are interested in the internals, and see 2 for an examplatory usage of the ItemRepository.
But you will probably agree, that this code is quite hard to understand and thus hard to maintain. I want to compare its performance to alternative solutions which might result in simpler code - but only if it does not incur a severe performance penalty. Considering the hype around the performance of key-value storages such as OpenLDAP MDB, Kyoto Cabinet and LevelDB, this is where I wanted to start.
What we have in KDevelop - as far as I figured out - is basically a sort of hybrid on-disk/in-memory hash map which gets saved to disk periodically (which of course can result in major data corruption in case of crashes etc.). Items are stored in a location based on their hash value which then of course also allows relatively fast value lookups as long as the hash function is fast. The added twist is that you also get some sort of persistent database index which can be used to lookup the items quite efficiently.
So - long story short - how would one do that with a key/value database such as LevelDB, Kyoto Cabinet, OpenLDAP MDB - you name it?
Sounds like you want to do what OpenLDAP does with its Equality index. Perhaps this is the same as the OrientDB example, I didn't read it.
The main table is indexed by a monotonically increasing integer key (called the entryID), and stores the data value. The equality index is indexed by a hash of the value, and stores a list of entryIDs that match the hash. Since the hash might have collisions, just the existence of an entry in the equality index doesn't prove uniqueness or duplication. You still need to check the actual values.
A faster/simpler approach, if you're using MDB, BDB, or some other database that supports duplicate keys, is to just keep one table, using the hash as the key. In both MDB and BDB there is a GET_BOTH request which matches both the key and the data to perform a fetch. If it succeeds then you know for certain that the value already exists. Otherwise, it allows you to save whatever data values and not worry whether or not there are hash collisions.
A caveat here, in MDB using duplicate keys, the size of the values is limited to less than one half of a disk page.
Unless I'm missing something here - typically your hash algorithm is consistent and will provide the same key for the same data. Thus you should only need to look up the key to see if it already exists, or handle the (likely duplicate key) error the DB gives back to you.
afaik Key/Value DBs can and will enforce a unique Value constraint for you i.e. you will get an error if you try and save a value that already exists.
How big are your value strings?
I would just store them in a key and let the database do all the work.
Typical LevelDB style, which applies to most KV stores, would be to use a pair of keys, prefixed to indicate type
eg:
Key = 'i' + ID
Value = valueString
Key = 'v' + valueString
Value = ID
In a system that needs to allow for multiple identical valueStrings you would move the ID into the tail of the second key
Key = 'v' + valueString + ID
Value = empty