Redis ZRANGEBYSCORE strange behaviour - redis

I'm using redis 2.6. I've faced with strange behavior of ZRANGEBYSCORE function.
I have a sorted set with a length of about a few million elements.
Something like this:
10 marry
15 john
25 bob
...
So compare to queries:
ZRANGEBYSCORE longset 25 50 LIMIT 0 20 works like a charm, it takes milliseconds
ZRANGEBYSCORE longset 25 50 this one hangs up for a minutes!!
All elements which I'm intrested in are in the first hundred of the set.
I think that there's no need to scan elements with weight greater than "50"
because it is SORTED set.
Please explain how redis scans sorted sets and why there is such a big difference between these two queries.

One of the best things about Redis, IMO, is that you can check the time complexity of each command in the docs. The docs for zrangebyscore specifies:
Time complexity: O(log(N)+M) with N being the number of elements in the sorted set and M the number of elements being returned. If M is constant (e.g. always asking for the first 10 elements with LIMIT), you can consider it O(log(N)).
[...]
Keep in mind that if offset is large, the sorted set needs to be traversed for offset elements before getting to the elements to return, which can add up to O(N) time complexity.
This means that if you know that you only need a certain number of items, and specify a LIMIT offset count, if offset is (close to) 0, you can consider it O(log(N)), but if the returned number of items is high (or the offset is high), it can be considered O(N).

Related

Hash tables Time Complexity Confusion

I just started learning about Hash Dictionaries. Currently we are implementing a hash dictionary with separate buckets that are made of chains (linked lists). The book posed this problem and I am having a lot of trouble figuring it out. Imagine we have an initial table size of 10 ie 10 buckets. If we want to know the time complexity for n insertions and a single lookup, how do we figure this out? (Assuming a pointer access is one unit of time).
It poses three scenarios:
A hash dictionary that does not resize, what is the time complexity for n insertions and 1 lookup?
A hash dictionary that resizes by 1 when the load factor exceeds .8, what is the time complexity for n insertions and 1 lookup?
A hash dictionary that resizes by doubling the table size when the load factor exceeds .8, what is the time complexity for n insertions and 1 lookup?
MY initial thoughts had me really confused. I couldn't quite figure out how to know the length of some given chain for an insertion. Assuming k length (I thought), there is the pointer access of the for loop going through the whole chain so k units of time. Then, in each iteration to insert it checks if the current node's data is equivalent to the key trying to be inserted (if it exists, overwrite it) so either 2k units of time if not found, 2k+1 if found. Then, it does 5 pointer accesses to prepend some element. So, 2k+5 or 2k+1 to insert 1 time. Thus, O(kn) for the first scenario for n insertions. To lookup, it seems to be 2k+1 or 2k. So for 1 lookup, o(k). I don't have a clue how to approach the other two scenarios. Some help would be great. Once again to clarify: k isn't mentioned in the problem. The only facts given are an initial size of 10 and the information given in the scenarios, so k can't be used as the results for the time complexity of n insertions or 1 lookup.
if you have a hash dictionary then your insert, delete and search operation will take O(n) of Time-Complexity for 1 key in the worst case scenario. For n insertions it would be O(n^2). It doesn't matter what the size of your table is.
|--------|
|element1| -> element2 -> element3 -> element4 -> element5
|--------|
| null |
|--------|
| null |
|--------|
| null |
|--------|
| null |
|--------|
Now for Average Case
Scenario one will have the load factor fixed (assuming m slots) : n/m. Therefore, one insert function will be O(1+n/m). 1 for the hash function computation and n/m for the lookup.
For the 2nd and 3rd scenario it should be O(1+n/m+1) and O(1+n/2m) respectively.
For your confusion, you can ask yourself a question that what will be the expected chain length for any random set of keys. The solution will be that we can't be sure at all.
That's where the idea of load factor comes into place to define the average case scenario, we give each slot equal probability to form a chain, if our no. of keys is greater than the slot count.
Imagine we have an initial table size of 10 ie 10 buckets. If we want to know the time complexity for n insertions and a single lookup, how do we figure this out?
When we talk about time complexity, we're looking at the steepness of the n-vs-time-for-operation curve as n approaches infinity. In the case above, you're saying there are only ten buckets, so - assuming the hash function scatters the insertions across the buckets with near-uniform distribution (as it should), n insertions will result in 10 lists of roughly n/10 elements.
During each insertion, you can hash to the correct bucket in O(1) time. Now - a crucial factor here is whether you want your hash table implementation to protect you against duplicate insertions.
If you simply trust there will be no duplicates, or the hash table is allowed to have duplicates (e.g. C++'s unordered_multiset), then the insertion itself can be done without inspecting the existing bucket content, at an accessible end of the bucket's list (i.e. using a head or tail pointer), also in O(1) time. That means the overall time per insertion is O(1), and the total time for n insertions is O(n).
If the implementation much identify and avoid duplicates, then for each insertion it has to search along the existing linked list, the size of which is related to n by a constant #buckets factor (1/10) and varies linearly during insertion from 1 to 1/10 of the final number of elements, so on average is n/2/10 which - removing constant factors - simplifies to n. In other words, each insertion is O(n).
Presumably the question intends to ask the time for a single lookup done after all elements are inserted: in that case you have the 10 linked lists of ~n/10 length, so the lookup will hash to one of those lists and then on average have to look half way along the list before finding the desired value: that's roughly n/20 elements searched, but as /20 is a constant factor it can be dropped, and we can say the average complexity is O(n).
A hash dictionary that does not resize, what is the time complexity for n insertions and 1 lookup?
Well, we discussed that above with our hash table size stuck at 10.
A hash dictionary that resizes by 1 when the load factor exceeds .8, what is the time complexity for n insertions and 1 lookup?
Say the table has 100 buckets and 80 elements, you insert an 81st element, it resizes to 101, the load factor is then about .802 - should it immediately resize again, or wait until doing another insertion? Anyway, ignoring that -each resize operation involves visiting, rehashing (unless the elements or nodes cache the hash values), and "rewiring" the linked lists for all existing elements: that's O(s) where s is the size of the table at that point in time. And you're doing that once or twice (depending on your answer to "immediately resize again" behaviour above) for s values from 1 to n, so s averages n/2, which simplifies to n. The insertion itself may or may not involve another iteration of the bucket's linked list (you could optimise to search while resizing). Regardless the overall time complexity is O(n2).
The lookup then takes O(1), because the resizing has kept the load factor below a constant amount (i.e. the average linked list length is very, very short (even ignoring the empty buckets).
A hash dictionary that resizes by doubling the table size when the load factor exceeds .8, what is the time complexity for n insertions and 1 lookup?
If you consider the resultant hash table there with n elements inserted, about half the elements will have been inserted without needing to be rehashed, while for about a quarter, they'll have been rehashed once, and an eight rehashed twice, a sixteenth rehashed 3 times, a 32nd rehashed 4 times: if you sum up that series - 1/4 + 2/8 + 3/16 + 4/32 + 5/64 + 6/128... - the series approaches 1 as n goes to infinity. In other words, the average amount of repeated rehashing/linking work done per element in the final table size doesn't increase with n - it's constant. So, the total time to insert is simply O(n). Then because the load factor is kept below 0.8 - a constant rather than a function of n - the lookup time is O(1).

How to limit count of items in Redis sorted sets

In my case I upload a lot of records to Redis sorted set, but I need to store only 10 highest scored items. I don't have ability to influence on the data which is uploaded (to sort and to limit it before uploading).
Currently I just execute
ZREMRANGEBYRANK key 0 -11
after uploading finish, but such approach looks not very optimal (it's slow and it will be better if Redis could handle that).
So does Redis provide something out of the box to limit count of items in sorted sets?
Nopes, redis does not provide any such functionality apart from ZREMRANGEBYRANK .
There is a similar problem about keeping a redis list of constant size, say 10 elements only when elements are being pushed from left using LPUSH.
The solution lies in optimizing the pruning operation.
Truncate your sorted set, once a while, not everytime
Methods:
Run a ZREMRANGEBYRANK with 1/5 probability everytime, using a random integer.
Use redis pipeline or Lua scripting to achieve this , this would even save the two network calls happening at almost every 5th call.
This is optimal enough for the purpose mentioned.
Algorithm example:
ZADD key member1 score1
random_int = some random number between 1-5
if random_int == 1: # trim sorted set with 1/5 chance
ZREMRANGEBYRANK key 0 -11

Redis: Is it still O(logN) to get top score member from sorted set?

My code needs to frequently get the top score member from a sorted set of Redis.
The time complexity for zrangebyscore is O(logN): http://redis.io/commands/zrangebyscore.
Since I only want to get the top score one, will Redis optimize it to return top score member in O(1) time?
If you're trying to get the top score so frequently that ZRANGE's complexity is an issue, cache the top score independently of the sorted set and you'll be able to get to it with O(1).
The Redis documentation doesn't describe such an optimization. The page you linked to for ZRANGEBYSCORE states (emphasis added):
Time complexity: O(log(N)+M) with N being the number of elements in
the sorted set and M the number of elements being returned. If M is
constant (e.g. always asking for the first 10 elements with LIMIT),
you can consider it O(log(N)).
Given this, it seems that the time complexity will not be O(1), unless of course your sorted set contains only one element. Rather, the time complexity will be dependent on the number of elements in the sorted set and will still be O(log(N)).

Redis: fan out news feeds in list or sorted set?

I'm caching fan-out news feeds with Redis in the following way:
each feed activity is a key/value, like activity:id where the value is a JSON string of the data.
each news feed is currently a list, the key is feed:user:user_id and the list contains the keys of the relevant activities.
to retrieve a news feed I use for example: 'sort feed:user:user_id by nosort get * limit 0 40'
I'm considering changing the feed to a sorted set where the score is the activity's timestamp, this way the feed is always sorted by time.
I read http://arindam.quora.com/Redis-sorted-sets-and-lists-Pertaining-to-Newsfeed which recommend using lists because of the time complexity of sorted sets, but by keep using lists I have to take care of the insert order,
inserting a past story requires to iterate through the list and finding the right index to push to. (which can cause new problems in distributed environments).
should I keep using lists or go for sorted sets?
is there a way to retrieve the news feed instantly from a sorted set, (like with the sort ... get * command for a list) or does it have to be zrange and then iterating through the results and getting each value?
Yes, sorted sets are very fast and powerful. They seem a much better match for your requirements than SORT operations. The time complexity is often misunderstood. O(log(N)) is very fast, and scales just fine. We use it for tens of millions of members in one sorted set. Retrieval and insertion is sub-millisecond.
Use ZRANGEBYSCORE key min max WITHSCORES [LIMIT offset count] to get your results.
Depending on how you store the timestamps as 'scores', ZREVRANGEBYSCORE might be better.
A small remark about the timestamps: Sorted set SCORES which don't need a decimal part should be using 15 digits or less. So the SCORE has to stay in the range -999999999999999 to 999999999999999. Note: These limits exist because Redis server actually stores the score (float) as a redis-string representation internally.
I therefore recommend this format, converted to Zulu Time: -20140313122802 for second-precision. You may add 1 digit for 100ms-precision, but no more if you want no loss in precision. It's still a float64 by the way, so loss of precision could be fine in some scenarios, but your case fits in the 'perfect precision' range, so that's what I recommend.
If your data expires within 10 years, you can also skip the three first digits (CCY of CCYY), to achieve .0001 second precision.
I suggest negative scores here, so you can use the simpler ZRANGEBYSCORE instead of the REV one. You can use -inf as the start score (minus infinity) and LIMIT 0 100 to get the top 100 results.
Two sorted set members (or 'keys' but that's ambiguous since the sorted set is also a key in itself) may share a score, that's no problem, the results within an identical score are alphabetical.
Hope this helps, TW
Edit after chat
The OP wanted to collect data (using a ZSET) from different keys (GET/SET or HGET/HSET keys). JOIN can do that for you, ZRANGEBYSCORE can't.
The preferred way of doing this, is a simple Lua script. The Lua script is executed on the server. In the example below I use EVAL for simplicity, in production you would use SCRIPT EXISTS, SCRIPT LOAD and EVALSHA. Most client libraries have some bookkeeping logic built-in, so you don't upload the script each time.
Here's an example.lua:
local r={}
local zkey=KEYS[1]
local a=redis.call('zrangebyscore', zkey, KEYS[2], KEYS[3], 'withscores', 'limit', 0, KEYS[4])
for i=1,#a,2 do
r[i]=a[i+1]
r[i+1]=redis.call('get', a[i])
end
return r
You use it like this (raw example, not coded for performance):
redis-cli -p 14322 set activity:1 act1JSON
redis-cli -p 14322 set activity:2 act2JSON
redis-cli -p 14322 zadd feed 1 activity:1
redis-cli -p 14322 zadd feed 2 activity:2
redis-cli -p 14322 eval '$(cat example.lua)' 4 feed '-inf' '+inf' 100
Result:
1) "1"
2) "act1JSON"
3) "2"
4) "act2JSON"

Calculating Top 10 Results in Storm

I am reading sentences from redis Server and counting the occurrence of each word. Now I want to calculate the top 10 words based on count. I have one Spout to read the sentences from Redis Server, one Bolt that breaks sentences into words and one Bolt that counts the words.
What should be my approach in finding Top 10 Words based on count?
Say if you have to do a top to for last X minute, configure your bolt with tick tuple after every X minutes till then keep on counting words in the bolt. On encountering a tick tuple emit the top ten items, you can keep the counter maintained in an in memory tree map (depending on the usecase and data size)
Now say you have to do a top 10 ever with large data size maintain count in Redis data structure and emit top 10 after every Y seconds based on your need.
For tick tuple refer Michael's blog # http://www.michael-noll.com/blog/2013/01/18/implementing-real-time-trending-topics-in-storm/
Write the frequency of the words to Redis using a SortedSet. For each word that Storm processes it increments the counter for that word in Redis using ZINCRBY.
The values in a SortedSet are ordered by value and so you can retrieve the top ten with ZREVRANGE like this.
ZREVRANGE myset 0-9 WITHSCORES