I have read that insertion time complexity of skip lists is order of (log n) with very high probability but O(n) in worst case. But while reading the documentation of redis zadd at https://redis.io/commands/zadd It tells that: O(log(N)) for each item added, where N is the number of elements in the sorted set.
If redis uses skip lists, then zadd should be O(n) in worst case, isn't it ?
ps: Sorry, but I posted the same question earlier but didn't get any response.
Deleted that and creating again.
Redis' implementation of skiplist is a modification of William Pugh's paper. So, in worst case, the time complexity is O(n). The AVERAGE time complexity of ZADD is O(log(n)).
Related
The time complexity of the HGETALL command is according to the documentation O(N) where N is the size of the hash. In many cases where HGETALL is mentioned, users are often warned about its time complexity for example in this answer without going much into what HGETALL does under the hood and why the time complexity is the way it is. So why is this O(N)? Has it something to do with how Redis stores the hashes, is it networking, or is it just CPU-bound? HGET has the time complexity of O(1) and is not dependent on size in any way, so can I just store my hash set as one value concatenated with some separator to improve performance?
Redis stores a Hash as a hash table in memory. Getting a single entry from any hash table, by its very nature, is an O(1) operation. HGETALL has to get the all of the entries in the hash table, one by one. So, it's O(N). If you coded your own hash table and didn't use Redis, it would also work that way. This is just how hash tables work.
Serializing your hash table to a single string and then saving that string will not save you anything. You're replacing an O(N) operation on the backend for one in your code.
The thing I always find missing around discussions of time-complexity is that it's about scaling, not time. People talk about things being "slower" and "faster". But it's not about milliseconds. An O(1) operation is "constant time" not slower. That just means it always takes the same amount of time—every time. A function can be O(1) and still be slower than some other function that is O(N) with a billion entries.
In the case of Redis, HGETALL is really fast and O(N). Unless you have thousands of fields in your Hash, you probably don't need to worry about it.
I was discussing with a friend about the Hashset design using mod function as the Hashing function.
The time complexity of such implementation appears to be O(N/K) , where N is total items stored in the set and k is total # of buckets. This time complexity assumes that that all items are distributed among all buckets and bucket's average size is N/K.
I confused myself because i believe the time complexity should be O(N). Since time complexity is the worst case performance. Here the worst case could be that all N items go to same bucket and value we are looking for could be at the end of the bucket. Please help me here.
You're right that the worst case is all items going into one bucket. The items being evenly distributed is the best case. That said, O(N/k) is the same as O(N) if k is held constant, since constants can be neglected. I would not expect k to be part of the input to a lookup anyway. If k can vary, then it is different, but the worst case is still O(N).
According documentation section for ZRANGEBYLEX command, there is following information. If store keys in ordered set with zero score, later keys can be retrieved with lexicographical order. And ZRANGEBYLEX operation complexity will be O(log(N)+M), where N is total elements count and M is result set size. Documentation has some information about string comparation, but tells nothing about structure, in which elements will be stored.
But after some experiments and reading source code, it's probably what ZRANGEBYLEX operation has a linear time search, when every element in ziplist will be matched against request. If so, complexity will be more larger than described above - about O(N), because every element in ziplist will be scanned.
After debugging with gdb, it's clean that ZRANGEBYLEX command is implemented in genericZrangebylexCommand function. Control flow continues at eptr = zzlFirstInLexRange(zl,&range);, so major work for element retrieving will be performed at zzlFirstInLexRange function. All namings and following control flow consider that ziplist structure is used, and all comparation with input operands are done sequentially element by element.
Inspecting memory with analysis after inserting well-known keys in redis store, it seems that ZSET elements are really stored in ziplist - byte-per-byte comparation with gauge confirm it.
So question - how can documentation be wrong and propagate logarithmic complexity where linear one appears? Or maybe ZRANGEBYLEX command works slightly different? Thanks in advance.
how can documentation be wrong and propagate logarithmic complexity where linear one appears?
The documentation has been wrong on more than a few occasions, but it is an ongoing open source effort that you can contribute to via the repository (https://github.com/antirez/redis-doc).
Or maybe ZRANGEBYLEX command works slightly different?
Your conclusion is correct in the sense that Sorted Set search operations, whether lexicographical or not, exhibit linear time complexity when Ziplists are used for encoding them.
However.
Ziplists are an optimization that prefers CPU to memory, meaning it is meant for use on small sets (i.e. low N values). It is controlled via configuration (see the zset-max-ziplist-entries and zset-max-ziplist-value directives), and once the data grows above the specified thresholds the ziplist encoding is converted to a skip list.
Because ziplists are small (little Ns), their complexity can be assumed to be constant, i.e. O(1). On the other hand, due to their nature, skip lists exhibit logarithmic search time. IMO that means that the documentation's integrity remains intact, as it provides the worst case complexity.
You are given an array of N integers. You are asked to find the largest element which appears an even number of times in the array. What is the time complexity of your algorithm? Can you do this without sorting the entire array?
You could do it in O(n log n) with a table lookup method. For each element in the list, look it up in the table. If it is missing, insert a key-value pair with the key being the element and the value as the number of appearances (starting at one); if it is present, increment the appearances. At the end just loop through the table in O(n) and look for the largest key with an even value.
In theory for an ideal hash-table, a lookup operation is O(1). So you can find and/or insert all n elements in O(n) time, making the total complexity O(n). However, in practice you will have trouble with space allocation (need much more space than data set size) and collisions (why you need it). This makes the O(1) lookup very difficult to achieve; in the worst case scenario it can be as much as O(n) (though also unlikely) - making the total complexity O(n^2).
Instead you can be more secure with a tree-based table - that is, the keys are stored in a binary tree. Lookup and insertion operations are all O(log n) in this case, provided that the tree is balanced; there are a wide range of tree structures to help ensure this e.g. Red-Black trees, AVL, splay, B-trees etc (Google is your friend). This will make the total complexity a guaranteed O(n log n).
I've been using sinter which does intersection of unordered integer sets. Is there any faster way of doing intersection given that I wouldn't mind sorting beforehand (or performing any other preprocessing)?
EDIT:
Found some info [here][1]
EDIT2:
Bounty for specific answer: is zinterstore faster than sinter? Benchmarking would be cool too.
Fast answer
In theory the intersection of lists has complexity O(N) where N is the cardinality of the smallest set.
Use SET (SINTER/SINTERSTORE) if have sparsed data / should keep RAM low (O(N*M)) and use BITSET(SETBIT/BITOP) in all other cases (O(N). Like in your edit one info.
BITSET
The Redis BIT key operations have complexity O(N), there N is the cardinality of the smallest key. And the bitops has very best execution speed based on CPU cache (look at bitops.c sources). So this can be the absolute winner in you have not sparsed data or if memory not important for you (here is more about strings in Redis).
ZSET vs SET (zinterstore vs sinter)
Do not use ZSET (zinterstore) is you have a plain list of integers and want to intersect them. Sorted set in Redis is a complex structure their keys stored in ziplist or skiplist internal encodings. The last one used to store sorted scores but keys stored in other structures. In case of ZSET intersection always much complicated with comparison SET:
ZSET intersection: O(N * K) + O(M * log(M)) worst case with N being the smallest input sorted set, K being the number of input sorted sets and M being the number of elements in the resulting sorted set.
SET intersection: O(N * M) worst case where N is the cardinality of the smallest set and M is the number of sets. Actually math-based minimum in theory.
SET uses dict / intset data structures to store data and in your case (unordered integer sets) intset would be used. Intset the most memory save the structure in Redis. And have the best read speed with comparison to ziplist (doubly linked list, more about this data structure internals here).