How does Redis's MGET with one key compared to GET in terms of performance? - redis

Obviously MGET outperforms GET for batch fetching but is there any pragmatic advantage to using GET instead of MGET when fetching just one key?
For example, when implementing a batch fetching system, is it worthwhile to special-case:
if (keys.length === 1) {
results = [redis.get(keys[0])];
} else {
results = redis.mget(keys);
}

If you only need to get one key, there's no need to call MGET, and GET should be a better choice.
With MGET, Redis replies with an array reply, which costs some more CPU time, i.e. appending array length (in this case, 1), to the reply, and transmitting more data on the wire, i.e. the size of array. Although, normally these cost is negligible, the less work, the better.
With MGET, client library normally returns an array, and you need to get the reply from the array with an index, and make the client code less elegant.
MGET and GET behaves differently if key type is NOT STRING. In this case, MGET returns nil reply, while GET returns an error reply. Normally, in this case, an error reply is better. So that the client can distinguish a non-exist key and a key with wrong type.
UPDATE
Since OP updated the question, I update the answer:
when implementing a batch fetching system, is it worthwhile to special-case
In this case, I don't think it's worthwhile, unless your benchmark shows that GET is much faster than MGET, and your code hits if (keys.length === 1) branch frequently.
As I mentioned above MGET and GET have different behaviors, if you use both commands, you might also need to do error handling differently, which makes the code more complex. Also such a condition branch make the code less elegant.
In a word, do a benchmark with your dataset, and keep the code simple, unless the optimization significantly improve the performance.

Related

Which one to use Hset or HMSet in Redis?

I am kind of new to Redis. I am just trying to store values in Redis using the HashSet method available using StackExchange.Redis.StrongName assembly (Lets say I have 4000 items). Is it better to store the 4000 items individually (using HSet ) or shall I pass a dictionary (using HMSet) so it will call only Redis server call is required but a huge amount of data. Which one is better?
Thanks in Advance
HMSET has been deprecated as of redis 4.0.0 in favor of using HSET with multiple key/value pairs:
https://redis.io/commands/hset
https://redis.io/commands/hmset
Performance will be O(n)
TL;DR A single call is "better" in terms of performance.
Taking into consideration #dizzyf's answer about HMSET's deprecation, the question becomes "should I use a lot of small calls instead of one big one?". Because there is an overhead to process every command, it is usually preferred to "batch" calls together to reduce the price.
Some commands in Redis are varidiac, a.k.a dynamic arity, meaning they can accept one or more values to eliminate multiple calls. That said, overloading a single call with a huge amount of arguments is also not the best practice - that typically leads to massive memory allocations and blocks the server from serving other requests during its processing.
I would approach this by dividing the values to constant-sized "batches" - 100 is a good start, but you can always tune it afterwards - and sending each such "batch" in a single HSET key field1 value1 ... field100 value1000 call.
Pro tip: if your Redis client supports it, you can use pipelining to make everything more responsive ("better"?).

Redis PFADD to check a exists-in-set query

I have a requirement to process multiple records from a queue. But due to some external issues the items may sporadically occur multiple times.
I need to process items only once
What I planned to use is PFADD into redis every record ( as a md5sum) and then see if that returns success. If that shows no increment then the record is a duplicate else process the record.
This seems pretty straightforward , but I am getting too many false positives while using PFADD
Is there a better way to do this ?
Being the probabilistic data structure that it is, Redis' HyperLogLog exhibits 0.81% standard error. You can reduce (but never get rid of) the probability for false positives by using multiple HLLs, each counting a the value of a different hash function on your record.
Also note that if you're using a single HLL there's no real need to hash the record - just PFADD as is.
Alternatively, use a Redis Set to keep all the identifiers/hashes/records and have 100%-accurate membership tests with SISMEMBER. This approach requires more (RAM) resources as you're storing each processed element, but unless your queue is really huge that shouldn't be a problem for a modest Redis instance. To keep memory consumption under control, switch between Sets according to the date and set an expiry on the Set keys (another approach is to use a single Sorted Set and manually remove old items from it by keeping their timestamp in the score).
In general in distributed systems you have to choose between processing items either :
at most once
at least once
Processing something exactly-once would be convenient however this is generally impossible.
That being said there could be acceptable workarounds for your specific use case, and as you suggest storing the items already processed could be an acceptable solution.
Be aware though that PFADD uses HyperLogLog, which is fast and scales but is approximate about the count of the items, so in this case I do not think this is what you want.
However if you are fine with having a small probability of errors, the most appropriate data structure here would be a Bloom filter (as described here for Redis), which can be implemented in a very memory-efficient way.
A simple, efficient, and recommended solution would be to use a simple redis key (for instance a hash) storing a boolean-like value ("0", "1" or "true", "false") for instance with the HSET or SET with the NX option instruction. You could also put it under a namespace if you wish to. It has the added benefit of being able to expire keys also.
It would avoid you to use a set (not the SET command, but rather the SINTER, SUNION commands), which doesn't necessarily work well with Redis cluster if you want to scale to more than one node. SISMEMBER is still fine though (but lacks some features from hashes such as time to live).
If you use a hash, I would also advise you to pick a hash function that has fewer chances of collisions than md5 (a collision means that two different objects end up with the same hash).
An alternative approach to the hash would be to assign an uuid to every item when putting it in the queue (or a squuid if you want to have some time information).

consistent hashing vs. rendezvous (HRW) hashing - what are the tradeoffs?

There is a lot available on the Net about consistent hashing, and implementations in several languages available. The Wikipedia entry for the topic references another algorithm with the same goals:
Rendezvous Hashing
This algorithm seems simpler, and doesn't need the addition of replicas/virtuals around the ring to deal with uneven loading issues. As the article mentions, it appears to run in O(n) which would be an issue for large n, but references a paper stating it can be structured to run in O(log n).
My question for people with experience in this area is, why would one choose consistent hashing over HRW, or the reverse? Are there use cases where one of these solutions is the better choice?
Many thanks.
Primarily I would say the advantage of consistent hashing is when it comes to hotspots. Depending on the implementation its possible to manually modify the token ranges to deal with them.
With HRW if somehow you end up with hotspots (ie caused by poor hashing algorithm choices) there isn't much you can do about it short of removing the hotspot and adding a new one which should balance the requests out.
Big advantage to HRW is when you add or remove nodes you maintain an even distribution across everything. With consistent hashes they resolve this by giving each node 200 or so virtual nodes, which also makes it difficult to manually manage ranges.
Speaking as someone who's just had to choose between the two approaches and who ultimately plumped for HRW hashing: My use case was a simple load balancing one with absolutely no reassignment requirement -- if a node died it's perfectly OK to just choose a new one and start again. No re balancing of existing data is required.
1) Consistent Hashing requires a persistent hashmap of the nodes and vnodes (or at least a sensible implementation does, you could build all the objects on every request.... but you really don't want to!). HWR does not (it's state-less). Nothing needs to be modified when machines join or leave the cluster - there is no concurrency to worry about (except that your clients have a good view of the state of the cluster which is the same in both cases)
2) HRW is easier to explain and understand (and the code is shorter). For example this is a complete HRW algorythm implemented in Riverbed Stingray TrafficScript. (Note there are better hash algorithms to choose than MD5 - it's overkill for this job)
$nodes = pool.listActiveNodes("stingray_test");
# Get the key
$key = http.getFormParam("param");
$biggest_hash = "";
$node_selected = "";
foreach ($node in $nodes) {
$hash_comparator = string.hashMD5($node . '-' . $key);
# If the combined hash is the biggest we've seen, we have a candidate
if ( $hash_comparator > $biggest_hash ) {
$biggest_hash = $hash_comparator;
$node_selected = $node;
}
}
connection.setPersistenceNode( $node_selected );
​
3) HRW provides an even distribution when you lose or gain nodes (assuming you chose a sensible hash function). Consistent Hashing doesn't guarantee that but with enough vnodes it's probably not going to be an issue
4) Consistent Routing may be faster - in normal operation it should be an order Log(N) where N is the number of nodes * the replica factor for vnodes. However, if you don't have a lot of nodes (I didn't) then HRW is going to be probably fast enough for you.
4.1) As you mentioned wikipedia mentions that there is a way to do HWR in log(N) time. I don't know how to do that! I'm happy with my O(N) time on 5 nodes.....
In the end, the simplicity and the stateless nature of HRW made the choice for me....

What is the conventional way to store objects in a sorted set in redis?

What is the most convenient/fast way to implement a sorted set in redis where the values are objects, not just strings.
Should I just store object id's in the sorted set and then query every one of them individually by its key or is there a way that I can store them directly in the sorted set, i.e. must the value be a string?
It depends on your needs, if you need to share this data with other zsets/structures and want to write the value only once for every change, you can put an id as the zset value and add a hash to store the object. However, it implies making additionnal queries when you read data from the zset (one zrange + n hgetall for n values in the zset), but writing and synchronising the value between many structures is cheap (only updating the hash corresponding to the value).
But if it is "self-contained", with no or few accesses outside the zset, you can serialize to a chosen format (JSON, MESSAGEPACK, KRYO...) your object and then store it as the value of your zset entry. This way, you will have better performance when you read from the zset (only 1 query with O(log(N)+M), it is actually pretty good, probably the best you can get), but maybe you will have to duplicate the value in other zsets / structures if you need to read / write this value outside, which also implies maintaining synchronisation by hand on the value.
Redis has good documentation on performance of each command, so check what queries you would write and calculate the total cost, so that you can make a good comparison of these two options.
Also, don't forget that redis comes with optimistic locking, so if you need pessimistic (because of contention for instance) you will have to do it by hand and/or using lua scripts. If you need a lot of sync, the first option seems better (less performance on read, but still good, less queries and complexity on writes), but if you have values that don't change a lot and memory space is not a problem, the second option will provide better performance on reads (you can duplicate the value in redis, synchronize the values periodically for instance).
Short answer: Yes, everything must be stored as a string
Longer answer: you can serialize your object into any text-based format of your choosing. Most people choose MsgPack or JSON because it is very compact and serializers are available in just about any language.

Merge Sort in a data store?

I'm trying to make a "friend stream" for the project I'm working on. I have individual users streams saved in Redis ZSETS. Something like:
key : { stream_id : time }
user1-stream: { 1:9931112, 3:93291, 9:9181273, ...}
user2-stream: { 4:4239191, 2:92919, 7:3293021, ...}
user3-stream: { 8:3299213, 5:97313, 6:7919921, ...}
...
user4-friends: [1,2,3]
Right now, to make user4's friend stream, I would call:
ZUNIONSTORE user4-friend-stream, [user1-stream, user2-stream, user3-stream]
However, ZUNIONSTORE is slow when you try to merge ZSETS totaling more than 1-2000 elements.
I'd really love to have Redis do a merge sort on the ZSETS, and limit the results to a few hundred elements. Are there any off-the-shelf data stores that will do what I want? If not, is there any kind of framework for developing redis-like data stores?
I suppose I could just fork Redis and add the function I need, but I was hoping to avoid that.
People tend to think that a zset is just a skip list. This is wrong. It is a skip list (ordered data structure) plus a non ordered dictionary (implemented as a hash table). The semantic of a merge operation would have to be defined. For instance, how would you merge non disjoint zsets whose common items do not have the same score?
To implement a merge algorithm for ZUNIONSTORE, you would have to get the items ordered (easy with the skip lists), merge them while building the output (which happens to be a zset as well: skiplist plus dictionary).
Because the cardinality of the result cannot be guessed at the beginning of the algorithm, I don't think it is possible to build this skiplist + dictionary in linear time. It will be O(n log n) at best. So the merge is linear, but building the output is not: it defeats the benefit of using a merge algorithm.
Now, if you want to implement a ZUNION (i.e. directly returning the result, not building the result as a zset), and limit the result to a given number of items, a merge algorithm makes sense.
RDBMS supporting merge joins can typically do it (but this is usually not very efficient, due to the cost of random I/Os). I'm not aware of a NoSQL store supporting similar capabilities.
To implement it in Redis, you could try a Lua server-side script, but it may be complex, and I think it will be efficient only if the zsets are much larger than the limit provided in the zunion. In that case, the limit on the number of items will offset the overhead of running interpreted Lua code.
The last possibility is to implement it in C in the Redis source code, which is not that difficult. The drawback is the burden to maintain a patch for the Redis versions you use. Redis itself provides no framework to do that, and the idea of defining Redis plugins (isolated from Redis source code) is generally rejected by the author.