Does Redis keep all chars from prefixed kind of keys? - redis

I may have 100mi long, but partially static keys like:
someReal...LongStaticPrefix:12345
someReal...LongStaticPrefix:12
someReal...LongStaticPrefix:123456
Where only the last part of the key is dynamic, the rest is static.
Does Redis keep all keys long or does it make an internal alias or something like that?
Should I worry about storage or performance?
Or is it better if I make internal alias do the keys to keep them short?

Redis does keep the whole key. This long prefix will impact your memory usage.
Given redis uses a hash map to store the keys, the performance impact is low. Hash map load factor is usually between 0.5 and 1. This means usually there is just one or two keys per hash slot. So the performance impact is the extra network payload for the long key, the longer effort to hash it, and the longer comparison in the hash slot with one or two keys. It's likely negligible unless your prefix is really really long
Consider a shorter key prefix.
Before considering using a hash structure (HSET), consider if you are using redis cluster or if you may need to eventually. A single hash key cannot be sharded.
A minor optimization : Consider using a suffix for faster compares at the hash slot chain

Related

Do I need to hash a Redis key before using *SET?

I'm under the impression that one should hash (i.e. sha3) their Redis key before adding data to it. (It might have even been with regard to memcache.) I don't remember why I have this impression or where it came from but I can't find anything to validate (or refute) it. The reasoning was the hash would help with even distribution across a cluster.
When using Redis (in either/both clustered and non-clustered modes) is it best pracatice to hash the key before calling SET? e.g. set(sha3("username:123"), "owlman123")
No, you shouldn't hash the key. Redis Cluster hashes the key itself for the purpose of choosing the node:
There are 16384 hash slots in Redis Cluster, and to compute what is the hash slot of a given key, we simply take the CRC16 of the key modulo 16384.
You can also use hash tags to control which keys share the same slot.
It might be a good idea if your keys are very long, as recommended in the official documentation:
A few other rules about keys:
Very long keys are not a good idea. For instance a key of 1024 bytes is a bad idea not only memory-wise, but also because the lookup of the key in the dataset may require several costly key-comparisons. Even when the task at hand is to match the existence of a large value, hashing it (for example with SHA1) is a better idea, especially from the perspective of memory and bandwidth.
source: https://redis.io/docs/data-types/tutorial/#keys

Using HSET or SETBIT for storing 6 billion SHA256 hashes in Redis

Problem set : I am looking to store 6 billion SHA256 hashes. I want to check if a hash exist and if so, an action will be performed. When it comes to storing the SHA256 hash (64 byte string) just to check the if the key exist, I've come across two functions to use
HSET/HEXIST and GETBIT/SETBIT
I want to make sure I take the least amount of memory, but also want to make sure lookups are quick.
The Use case will be "check if sha256 hash exist"
The problem,
I want to understand how to store this data as currently I have a 200% increase from text -> redis. I want to understand what would the best shard options using ziplist entries and ziplist value would be. How to split the hash to be effective so the ziplist is maximised.
I've tried setting the ziplist entries to 16 ^ 4 (65536) and the value to 60 based on splitting 4:60
Any help to help me understand options, and techniques to make this as small of a footprint but quick enough to run lookups.
Thanks
A bit late to the party but you can just use plain Redis keys for this:
# Store a given SHA256 hash
> SET 9f86d081884c7d659a2feaa0c55ad015a3bf4f1b2b0b822cd15d6c15b0f00a08 ""
OK
# Check whether a specific hash exists
> EXISTS 2c26b46b68ffc68ff99b453c1d30413413422d706483bfa0f98a5e886266e7ae
0
Where both SET and EXISTS have a time complexity of O(1) for single keys.
As Redis can handle a maximum of 2^32 keys, you should split your dataset into two or more Redis servers / clusters, also depending on the number of nodes and the total combined memory available to your servers / clusters.
I would also suggest to use the binary sequence of your hashes instead of their textual representation - as that would allow to save ~50% of memory while storing your keys in Redis.

Key types supported by Redis

What are the different key types supported by Redis? The documentation mentions all the various types (strings, set, hashmap, etc) of values supported by Redis, but I couldn't quiet find the key type information.
From redis documentation (Data types intro):
Redis keys
Redis keys are binary safe, this means that you can use any binary sequence as a key, from a string like "foo" to the content
of a JPEG file. The empty string is also a valid key. A few other
rules about keys:
Very long keys are not a good idea. For instance a key of 1024 bytes is a bad idea not only memory-wise, but also because the
lookup of the key in the dataset may require several costly
key-comparisons. Even when the task at hand is to match the
existence of a large value, hashing it (for example with SHA1) is a
better idea, especially from the perspective of memory and
bandwidth.
Very short keys are often not a good idea. There is little point in writing "u1000flw" as a key if you can instead write
"user:1000:followers". The latter is more readable and the added
space is minor compared to the space used by the key object itself
and the value object. While short keys will obviously consume a bit
less memory, your job is to find the right balance.
Try to stick with a schema. For instance "object-type:id" is a good idea, as in "user:1000". Dots or dashes are often used for multi-word
fields, as in "comment:1234:reply.to" or "comment:1234:reply-to".
The maximum allowed key size is 512 MB.
From my experience any binary sequence typically means a String, but I may not be familiar with languages where you can achieve this by using other data types.
Keys in Redis are all strings, so it doesn't really matter what kind of value you pass into a client. Under-the-hood the RESP protocol is used and it will pass the value as a string to the engine.
Example:
ZADD some_key 1 some_value
some_key is always a string, even if you pass 3 as key, it is handled as a string. This is true for every client.

How to implement a scalable, unordered collection in DynamoDB?

I am looking into implementing a scalable unordered collection of objects on top of Amazon DynamoDB. So far the following options have been considered:
Use DynamoDB document data types (map, list) and use document path to access stand-alone items. This has one obvious drawback for collection being limited to 400KB of data, meaning perhaps 1..10K objects depending on their size. Less obvious drawback is that cost of insertion of a new object into such collection is going to be huge: Amazon specifies that the write capacity will be deducted based on the total item size, not just newly added object -- therefore ~400 capacity units for inserting 1KB object when approaching the size limit. So considering this ruled out?
Using composite primary hash + range key, where primary hash remains the same for all objects in the collection, and range key is just something random or an atomic counter. Obvious drawback is that having identical hash key results in bad key distribution -- cardinality is low when there are collections with large number of objects. This means bad partitioning, and having a scale issue with all reads/writes on the same collection being stuck to one shard, becoming subject to 3000 reads / 1000 writes per second limitation of DynamoDB partition.
Using global secondary index with secondary hash + range key, where hash key remains the same for all objects belonging to the same collection, and range key is just something random or an atomic counter. Similar to above, partitioning becomes poor for the GSI, and it will become a bottleneck with too many identical hashes draining all the provisioned capacity to the index rapidly. I didn't find how the GSI is implemented exactly, thus not sure how badly it suffers from low cardinality.
Question is, whether I could live with (2) or (3) and suffer from non-ideal key distribution, or is there another way of implementing collection that was overlooked, or perhaps I should at all consider looking into another nosql database engine.
This is a "shooting from the hip" answer, what you end up doing may depend on how much and what type of reading and writing you do.
Two things the dynamo docs encourage you to avoid are hot keys and, in general, scans. You noted that in cases (2) and (3), you end up with a hot key. If you expect this to scale (large collections), the hot key will probably hurt more and more, especially if this is a write-intensive application.
The docs on Query and Scan operations (http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/QueryAndScan.html) say that, for a query, "you must specify the hash key attribute name and value as an equality condition." So if you want to avoid scans, this might still force your hand and put you back into that hot key situation.
Maybe one route would be to embrace doing a scan operation, but just have one table devoted to your collection. Then you could just have a fully random (well distributed) hash key and do a scan every time. This assumes you always want everything from the collection (you didn't say). This will still hurt if you scale up to a large collection, but if you always want the full set back, you'll have to deal with that pain regardless. If you just want a subset, you can add a limit parameter. This would help performance, but you will always get back the same subset (or you can use the last evaluated key and keep going). The docs also mention parallel scans.
If you are using AWS, elasticache/redis might be another route to try? The first pass might code up a lot faster/cleaner than situation (1) that you mentioned.

Using GUID (or similar) has performance penalty in Redis?

Does using a GUID or ulong key impact Redis DB performance?
Similar: Does name length impact performance in Redis?
This question is an old one, but other answers are a bit misleading. Eric's answer is totally unrelated to Redis. Pfreixes's answer is based on personal assumptions and is simply wrong.
In fact, it's fairly safe to use GUID keys (performance-wise) as even 300+ character keys don't affect performance significantly on O(1) operations. Check this benchmark: Does name length impact performance in Redis?.
GUID typically has a length of 32-36 chars, if you're using hex representation. As Evan Carrol noticed in comments, Redis strings are binary safe, so you can use binary value and reduce key size down to 128 bits (16 chars). Keys with such length won't hurt performance at all.
Also, documentation suggests to use hashing functions for really large keys: http://redis.io/topics/data-types-intro
Redis use a hash strategy to store all keys, every key is stored using a hash function. All Redis db peformance about keys fall into this function - or something related.
Original key is also stored to figure out future colisions between diferent keys, and yes big keys could be impact at memory handle and all of related fields : memory fragmentation, cache hits/misses, etc ...