Using REDIS sorted sets, what is the time complexity certain special operations? - redis

On REDIS documentation, it states that insert and update operations on sorted sets are O(log(n)).
On this question they specify more details about the underlying data structure, the skip list.
However there are a few special cases that depend on the REDIS implementation with which I'm not familiar.
adding at the head or tail of the sorted set will probably not be a O(log(n)) operation, but O(1), right? this question seems to agree with reservations.
updating the score of a member even if the ordering doesn't change is still O(log(n)) either because you take the element out and insert it again with the slightly different score, or because you have to check that the ordering doesn't change and so the difference is only in constant operations between insert or update score. right? I really hope I'm wrong in this case.
Any insights will be most welcome.

Note: skip lists will be used once the list grows above a certain size (max_ziplist_entries), below that size a zip list is used.
Re. 1st question - I believe that it would still be O(log(n)) since a skip list is a type of a binary tree so there's no assurance where the head/tail nodes are
Re. 2nd question - according to the source, changing the score is implemented with a removing and readding the member: https://github.com/antirez/redis/blob/209f266cc534471daa03501b2802f08e4fca4fe6/src/t_zset.c#L1233 & https://github.com/antirez/redis/blob/209f266cc534471daa03501b2802f08e4fca4fe6/src/t_zset.c#L1272

In Skip List, when you insert a new element in head or tail, you still need to update O(log n) levels. The previous head or tail can have O(log n) levels and each may have pointers which need to be updated.
Already answered by #itamar-haber

Related

Why is Hash Table insertion time complexity worst case is not N log N

Looking at the fundamental structure of hash table. We know that it resizes WRT load factor or some other deterministic parameter. I get that if the resizing limit is reached within an insertion we need to create a bigger hash table and insert everything there. Here is the thing which I don't get.
Let's consider a hash table where each bucket contains an AVL - balanced BST. If my hash function returns the same index for every key then I would store everything in the same AVL tree. I know that this hash function would be a really bad function and would not be used but I'm doing a worst case scenario here. So after some time let's say that resizing factor has been reached. So in order to resize I created a new hash table and tried to insert every old elements in my previous table. Since the hash function mapped everything back into one AVL tree, I would need to insert all the N elements into the same AVL. N insertion on an AVL tree is N logN. So why is the worst case of insertion for hash tables considered O(N)?
Here is the proof of adding N elements into Avl three is N logN:
Running time of adding N elements into an empty AVL tree
In short: it depends on how the bucket is implemented. With a linked list, it can be done in O(n) under certain conditions. For an implementation with AVL trees as buckets, this can indeed, wost case, result in O(n log n). In order to calculate the time complexity, the implementation of the buckets should be known.
Frequently a bucket is not implemented with an AVL tree, or a tree in general, but with a linked list. If there is a reference to the last entry of the list, appending can be done in O(1). Otherwise we can still reach O(1) by prepending the linked list (in that case the buckets store data in reversed insertion order).
The idea of using a linked list, is that a dictionary that uses a reasonable hashing function should result in few collisions. Frequently a bucket has zero, or one elements, and sometimes two or three, but not much more. In that case, a simple datastructure can be faster, since a simpler data structure usually requires less cycles per iteration.
Some hash tables use open addressing where buckets are not separated data structures, but in case the bucket is already taken, the next free bucket is used. In that case, a search will thus iterate over the used buckets until it has found a matching entry, or it has reached an empty bucket.
The Wikipedia article on Hash tables discusses how the buckets can be implemented.

Redis bitmap split key division strategy

I'm grabbing and archiving A LOT of data from the Federal Elections Commission public data source API which has a unique record identifier called "sub_id" that is a 19 digit integer.
I'd like to think of a memory efficient way to catalog which line items I've already archived and immediately redis bitmaps come to mind.
Reading the documentation on redis bitmaps indicates a maximum storage length of 2^32 (4294967296).
A 19 digit integer could theoretically range anywhere from 0000000000000000001 - 9999999999999999999. Now I know that the datasource in question does not actually have 99 quintillion records, so they are clearly sparsely populated and not sequential. Of the data I currently have on file the maximum ID is 4123120171499720404 and a minimum value of 1010320180036112531. (I can tell the ids a date based because the 2017 and 2018 in the keys correspond to the dates of the records they refer to, but I can't sus out the rest of the pattern.)
If I wanted to store which line items I've already downloaded would I need 2328306436 different redis bitmaps? (9999999999999999999 / 4294967296 = 2328306436.54). I could probably work up a tiny algorithm determine given an 19 digit idea to divide by some constant to determine which split bitmap index to check.
There is no way this strategy seems tenable so I'm thinking I must be fundamentally misunderstanding some aspect of this. Am I?
A Bloom Filter such as RedisBloom will be an optimal solution (RedisBloom can even grow if you miscalculated your desired capacity).
After you BF.CREATE your filter, you pass to BF.ADD an 'item' to be inserted. This item can be as long as you want. The filter uses hash functions and modulus to fit it to the filter size. When you want to check if the item was already checked, call BF.EXISTS with the 'item'.
In short, what you describe here is a classic example for when a Bloom Filter is a great fit.
How many "items" are there? What is "A LOT"?
Anyway. A linear approach that uses a single bit to track each of the 10^19 potential items requires 1250 petabytes at least. This makes it impractical (atm) to store it in memory.
I would recommend that you teach yourself about probabilistic data structures in general, and after having grokked the tradeoffs look into using something from the RedisBloom toolbox.
If the ids ids are not sequential and very spread, keep tracking of which one you processed using a bitmap is not the best option since it would waste lot of memory.
However, it is hard to point the best solution without knowing the how many distinct sub_ids your data set has. If you are talking about a few 10s of millions, a simple set in Redis may be enough.

Redis ZRANGEBYLEX command complexity

According documentation section for ZRANGEBYLEX command, there is following information. If store keys in ordered set with zero score, later keys can be retrieved with lexicographical order. And ZRANGEBYLEX operation complexity will be O(log(N)+M), where N is total elements count and M is result set size. Documentation has some information about string comparation, but tells nothing about structure, in which elements will be stored.
But after some experiments and reading source code, it's probably what ZRANGEBYLEX operation has a linear time search, when every element in ziplist will be matched against request. If so, complexity will be more larger than described above - about O(N), because every element in ziplist will be scanned.
After debugging with gdb, it's clean that ZRANGEBYLEX command is implemented in genericZrangebylexCommand function. Control flow continues at eptr = zzlFirstInLexRange(zl,&range);, so major work for element retrieving will be performed at zzlFirstInLexRange function. All namings and following control flow consider that ziplist structure is used, and all comparation with input operands are done sequentially element by element.
Inspecting memory with analysis after inserting well-known keys in redis store, it seems that ZSET elements are really stored in ziplist - byte-per-byte comparation with gauge confirm it.
So question - how can documentation be wrong and propagate logarithmic complexity where linear one appears? Or maybe ZRANGEBYLEX command works slightly different? Thanks in advance.
how can documentation be wrong and propagate logarithmic complexity where linear one appears?
The documentation has been wrong on more than a few occasions, but it is an ongoing open source effort that you can contribute to via the repository (https://github.com/antirez/redis-doc).
Or maybe ZRANGEBYLEX command works slightly different?
Your conclusion is correct in the sense that Sorted Set search operations, whether lexicographical or not, exhibit linear time complexity when Ziplists are used for encoding them.
However.
Ziplists are an optimization that prefers CPU to memory, meaning it is meant for use on small sets (i.e. low N values). It is controlled via configuration (see the zset-max-ziplist-entries and zset-max-ziplist-value directives), and once the data grows above the specified thresholds the ziplist encoding is converted to a skip list.
Because ziplists are small (little Ns), their complexity can be assumed to be constant, i.e. O(1). On the other hand, due to their nature, skip lists exhibit logarithmic search time. IMO that means that the documentation's integrity remains intact, as it provides the worst case complexity.

Design a highly optimized datastructure to perform three operations insert, delete and getRandom

I just had a software interview. One of the questions was to design any datastructure with three methods insert, delete and getRandom in a highly optimized way. The interviewer asked me to think of a combination of datastructures to design a new one. Insert can be designed anyway but for random and delete i need to get the position of specific element. He gave me a hint to think about the datastructure which takes minimum time for sorting.
Any answer or discussion is welcomed....
Let t be the type of the elements you want to store in the datastructure.
Have an extensible array elements containing all the elements in no particular order. Have a hashtable indices that maps elements of type t to their position in elements.
Inserting e means
add e at the end of elements (i.e. push_back), get its position i
insert the mapping (e,i) into `indices
deleting e means
find the position i of e in elements thanks to indices
overwrite e with the last element f of elements
update indices: remove the mapping (f,indices.size()) and insert (f,i)
drawing one element at random (leaving it in the datastructure, i.e. it's peek, not pop) is simply drawing an integer i in [0,elements.size()[ and returning elements[i].
Assuming the hashtable is well suited for your elements of type t, all three operations are O(1).
Be careful about the cases where there are 0 or 1 element in the datastructure.
A tree might work well here. Order log(n) insert and delete, and choose random could also be log(n): start at the root node and at each junction choose a child at random (weighted by the total number of leaf nodes per child) until you reach a leaf.
The data structure which takes the least time for sorting is sorted array.
get_random() is binary search, so O(log n).
insert() and delete() involve adding/removing the element in question and then resorting, which is O(n log n), e.g. horrendous.
I think his hint was poor. You may have been in a bad interview.
What I feel is that you can use some balaced version of tree like Red-Black trees. This will give O(log n) insertion and deletion time.
For getting random element, may be you can have a additional hash table to keep track of elements which are in the tree structure.
It might be Heap (data structure)

LIST alternative in redis

Redis.io
The main features of Redis Lists from the point of view of time
complexity is the support for constant time insertion and deletion of
elements near the head and tail, even with many millions of inserted
items. Accessing elements is very fast near the extremes of the list
but is slow if you try accessing the middle of a very big list, as it
is an O(N) operation.
what is the LIST alternative when the data is too high and writes are lesser than Reads
This is something I'd definitely benchmark before doing, but if you're really hitting a performance issue accessing items in the middle of the list, there are a couple of alternatives that really depend on your use case.
Don't make a list so big, age out/trim pieces that don't matter any more.
Memoize hot sections of the list. If a particular paginated range is being requested much more often than others, make that it's own list. Check if it exists already, and if it doesn't create a subset of your list in the paginated range.
Bucket your list from the beginning into "manageable sizes" (for whatever your definition of managable is). If a list is purely additive (no removal from the list), you could use the modulus index of an item as part of the key so that your list is stored in smaller buckets. Ex: key = "your_key_name_" + index % 100000