efficiency of hashing long string redis key before storing - redis

tldr version:
I have a long encoded json payload that I store on redis as a key. I would like to know if hashing the key before storing will improve lookup performance and which hashing algorithm is recommended (I'm considering md5/sha1).
p/s i'm using python for code
other notes:
ttl for key is short (30 secs) hence I don't care about hash collision
I only need to check if key exists in redis
long story version:
I have a stream of transactions in json that are encoded in protobuf flowing to my application via a message queue at a high rate. I run worker nodes that read the data from the queue and process the data. However I realized that there were instances that duplicates were being sent.
my solution was to store the data in a "global" cache (redis) where my workers would check before attempting to process. as the flow rate is high, decoding the data and reading it is expensive hence i'm storing the strings whole.
transactions expire after 30s so i use a ttl of 30s.
therefore i'm wondering if hashing the strings before storing them would be a good idea as i only need to check for existance
thanks for reading

You really don't need a cryptographic hash. You want the fastest cryptographic algorithm that is good at collision avoidance.
Here is a good discussion of various options.
Fastest hash for non-cryptographic uses?
The Redis documentation discusses optimal key size here:
https://redis.io/topics/data-types-intro under the section "Redis keys"

Related

Fetch keys with matched pattern

I have a Spring boot application which is connected to Redis.
I want to perform a redis operation to fetch the keys which matches certain pattern.
I understand this can be achieved in multiple ways
Redis Template and Keys command : But its not suitable to be used on large data sets . As it may block the client (not the server) for long time and also exhaust the server memory due to the response buffer size.
Redis Template and Scan command : Redis docs recommends to use scan command in comparison to Keys. As it does the scanning iteratively which makes faster smaller operations and better on server resources.
Spring Data Redis Repository : Fetch all by creating a hash on the pattern in the Redis Entity.
But i am not sure which will give me overall faster performance to fetch all the matched keys under high load and would be recommended for my scenario.
Best Regards,
Saurav
Redis is single-threaded so traversing all keys on a large dataset (a large number of keys) may block the server for a long time (even several seconds). And so it is not advised to run 'Keys' in production at all.
The scan operation is built to run iteratively but you should note that you might get the same key more than once and also there is a chance that some keys will not be returned. Overall your system will run faster with Scan.

Is there an extra cost to cache misses on Redis

Is there an advantage to set a default value for an entry that will be heavily queried in Redis or will querying for the unset key take the same time?
Given the keys are stored in a distributed hash, it will have to check that the key is not in the bucket before returning on a miss, which may be a bit slower than finding and stopping at a hit. Is the bucket sorted of linear? Does anything else make it slower either way?
Redis is setup in a cluster and has many million entries in this case.
I'm assuming you're just talking about strings & hashes here here (so the only operations you care about are set/get, maybe hget/hset) - From Redis' perspective, a cache hit and cache miss have the same time complexity, if anything, a cache miss will be faster because redis will not have to transfer any data back over the socket to your app.

Redis: Dump db and delete dumped key / value pairs

I have multiple servers that all store set members in a shared Redis cache. When the cache fills up, I need to persist the data to disk to free up RAM. I then plan to parse the dumped data such that I will be able to combine all of the values that belong to a given key in MongoDB.
My first plan was to have each server process attempt an sadd operation. If the request fails because Redis has reached maxmemory, I planned to query for each of my set keys, and write each to disk.
However, I am wondering if there is a way to use one of the inbuilt persistence methods in Redis to write the Redis data to disk and delete the key/value pairs after writing. If this is possible I could just parse the rdb dump and work with the data in that fashion. I'd be grateful for any help others can offer on this question.
Redis' persistence is meant to be used for whatever's in the RAM. Put differently, you can't persist what ain't in RAM.
To answer your question: no, you can't use persistence to "offload" data from RAM.

Are smaller redis keys more efficient

I am working on implementing a page cache using Redis. It works, but i am currently using the url as the Redis key. Hashing will of course cost me CPU time, but it will make the key smaller and less complicated. Will smaller, hashed redis keys outperform a key that is based off a page url?

Redis: Efficient cluster of servers for large key set

I have a very large set of keys, 200M keys, with small values, <100 bytes, to store and I'm trying to use Redis. The problem is such that I have 10 Redis DB to split the keys over, but currently I'm on a single server with those 10 Redis DB. By a Redis DB I mean using SELECT. From my calculations it looks like I'm going to blow out memory. I think I'll need over 4TB of memory for this case! What are my options? First, my calculation is based on 10000 keys with 100 byte values taking 220MB of RAM (this is from a table I found). So simply put (2*10^8 / 10^4) * 220MB = 4.4TB.
If my calculation looks correct, what are my options? I've read on different posts that Redis VM is no longer an option. Can I use a Redis cluster? This still appears to require too many servers to be practical. I understand I could switch to another DB, but I'd like that to be the last resort option.
Firstly, using shared databases (i.e. the SELECT command) isn't a recommended practice since all of these databases are essentially managed by the same Redis process. It is preferable having 10 separate Redis processes (even on the same server) in order to avoid contention (more info here).
Next, there are ways to reduce the memory footprint of your database. You could, for example, perform client-side compression (see here) or consider other optimizations such as using Hashes to keep multiple values (as described here).
That said, a Redis server is ultimately bound by the amount of RAM that the host provides. Once you've reached that limit you'll need to shard your database and use a Redis cluster. Since you're already using multiple databases this shouldn't pose a big challenge as your code should already be compatible with that to a degree. Sharding can be done in one of three approaches: client, proxy or Redis Cluster. Client-side sharding can be implemented in your code or by the Redis client that you're using (if the client library that you're using supports that). Redis Cluster (v3) is expected to be released in the very near future and already has a stable release candidate. As for proxy-based sharding, there are several open source solutions out there, including Twitter's twemproxy, Netflix's dynomite and codis. Additional information about sharding and partitioning can be found here.
Disclaimer: I work at Redis Labs. Lastly, AFAIK there's only one Redis-as-a-Service provider that already provides built-in support for clustering Redis. Redis Labs' Redis Cloud is a fully-managed service that can scale seamlessly to any required capacity. Our clusters support both the '{}' hashtag standard as well as sharding by RegEx - more about this can be found here.
You can use LMDB with Dynomite to store data beyond your memory capacity. LMDB uses both disk and memory to store data. Dynomite make LMDB to be distributed.
We have done a POC with this combo and they work nicely together.
For more information, please check out our open issue here:
https://github.com/Netflix/dynomite/issues/254