Find Successor and Predecessor using a modification to AVL tree in O(1) - time-complexity

So I got this task to suggest a AVL tree with the capability of performing the fallowing two operations in O(1):
Successor takes š¯‘‚(1) - return the successor.
Predecessor takes š¯‘‚(1)- return the Predecessor
I am kind of stuck after trying to think of a solution for this.
Any ideas?

I think it's not so much about modifying the tree or implementing any complicated predecessor/successor-finding logic to be able to physically traverse the tree to find successor/predecessor nodes in constant time as much as simply adding succ and pred links to every node in the tree, using them in your predecessor(n) and successor(n) functions, and maintaining these links over insert and delete operations.
All you have to be careful about with this approach to ensure successor(n) and predecessor(n) are actually O(1) is that the time complexity of maintaining succ/pred links does not exceed the complexity of the operations which require link updates. For example, because inserting a new node into a balanced tree is an O(lg n) operation, updating successor/predecessor links after insert cannot exceed O(lg n) (as otherwise your subsequent lookups would technically be doing amortized work greater than O(1)).
The steps you have to take on each insert and delete operation to maintain pred and succ links are actually pretty simple, and thankfully the complexities introduced to the AVL tree by the re-balance operation don't really affect them because while re-balance changes the structure of the tree, it doesn't affect the tree's ordering and therefore doesn't affect predecessor/successor relations.
Finding the successor and predecessor of a given node in balanced BST (which an AVL tree is guaranteed to be) is O(lg n), and don't differ in terms of logic from the operations you would perform to find predecessor and successor in a regular BST. Therefore, after inserting a new node n into your AVL, simply find n's successor and predecessor nodes and update the appropriate links:
insert(n) {
...
# normal AVL insert and rebalance logic
...
predecessor, successor = findPredAndSucc(n)
if (predecessor) predecessor.succ = n
if (successor) successor.pred = n
n.pred = predecessor
n.succ = successor
}
When deleting n, you'll also have to find n's predecessor and successor and update links as in:
delete(n) {
# save these values first so they aren't lost when n is deleted
predecessor, successor = findPredAndSucc(n)
...
# normal AVL delete and rebalance logic
...
if (predecessor) predecessor.succ = successor
if (successor) successor.pred = predecessor
}
I'm slightly conflating the node n and the value of n (if inserting/deleting by value you would have to traverse the tree to find n first), but you get the idea.
Following this approach, after every insert and delete each node will have direct links to its predecessor/successor nodes, and because of the AVL-tree's self balancing property the time spent finding the relevant nodes to update these links on after inserts and deletes doesn't add any additional complexity to those operations (finding predecessor and successor nodes don't require more than 2 O(h) = O(lg n) traversals each, the same complexity as insert and delete themselves, therefore not changing the overall complexity), so our successor(n) and predecessor(n) functions will be overall O(1) if they just follow the succ and pred links we set.

Related

Deletion operation in Binary Search Tree: successor or predecessor

Delete operation is the most complex operation in Binary Search Tree, since it needs to consider several possibilities:
The deleted node is leaf node
The deleted node has only one child
The deleted node has both left and right child
The first two cases are easy. But for the second one, I read many books or documents, the solution is: find the min value in the right subtree and replace it with the deleted node. And then delete it from right subtree.
I can fully understand this solution.
In fact, generally, the node with the min value in the right subtree is called Successor of the node. So the above solution is replace the deleted node with its successor's value. And delete the successor node from the subtree.
On the other hand, predecessor of each node is the node with max value in the left subtree.
I think, replace the deleted node with its predecessor should also work.
For instance, the example used in the book of "Data structure and algorithm analysis in C".
If we want to delete node "2". Then we replace it with "3" which is "2" 's successor.
I think, replace "2" with "1" which is "2" 's predecessor can also work. Right? But the books didn't talk about it even a bit.
So is there any convention here? And If after one deletion operation, there are two results both correct. How to keep consistent?
Edit:
Update something based on new learning about this issue. In fact, the book "data structure and algorithm analysis in c" discussed the issue. In summary, it goes as follows:
First, both methods (based on successor or predecessor)should work.
If repeat O(n^2) insert/delete pairs on the tree. And all the deletion operation is based on successor. Then the tree will become unbalanced. Because the algorithm makes the left subtrees deeper than the right. The idea can be illustrated with the following two images:
Then it introduces to the concept of balanced search tree, such as AVL tree.
I can tell for the theory, to me your argument seems correct, that one can take either the predecessor or the successor.
Now in practice, I would think that the best decision would be to keep the tree balanced, and switch between the two options depending on which makes the depth the lowest.

Why is Hash Table insertion time complexity worst case is not N log N

Looking at the fundamental structure of hash table. We know that it resizes WRT load factor or some other deterministic parameter. I get that if the resizing limit is reached within an insertion we need to create a bigger hash table and insert everything there. Here is the thing which I don't get.
Let's consider a hash table where each bucket contains an AVL - balanced BST. If my hash function returns the same index for every key then I would store everything in the same AVL tree. I know that this hash function would be a really bad function and would not be used but I'm doing a worst case scenario here. So after some time let's say that resizing factor has been reached. So in order to resize I created a new hash table and tried to insert every old elements in my previous table. Since the hash function mapped everything back into one AVL tree, I would need to insert all the N elements into the same AVL. N insertion on an AVL tree is N logN. So why is the worst case of insertion for hash tables considered O(N)?
Here is the proof of adding N elements into Avl three is N logN:
Running time of adding N elements into an empty AVL tree
In short: it depends on how the bucket is implemented. With a linked list, it can be done in O(n) under certain conditions. For an implementation with AVL trees as buckets, this can indeed, wost case, result in O(n log n). In order to calculate the time complexity, the implementation of the buckets should be known.
Frequently a bucket is not implemented with an AVL tree, or a tree in general, but with a linked list. If there is a reference to the last entry of the list, appending can be done in O(1). Otherwise we can still reach O(1) by prepending the linked list (in that case the buckets store data in reversed insertion order).
The idea of using a linked list, is that a dictionary that uses a reasonable hashing function should result in few collisions. Frequently a bucket has zero, or one elements, and sometimes two or three, but not much more. In that case, a simple datastructure can be faster, since a simpler data structure usually requires less cycles per iteration.
Some hash tables use open addressing where buckets are not separated data structures, but in case the bucket is already taken, the next free bucket is used. In that case, a search will thus iterate over the used buckets until it has found a matching entry, or it has reached an empty bucket.
The Wikipedia article on Hash tables discusses how the buckets can be implemented.

Neo4j scalability and indexing

An argument in favor of graph dbms with native storage over relational dbms made by neo4j (also in the neo4j graph databases book) is that "index-free adjacency" is the most efficient means of processing data in a graph (due to the 'clustering' of the data/nodes in a graph-based model).
Based on some benchmarking I've performed, where 3 nodes are sequentially connected (A->B<-C) and given the id of A I'm querying for C, the scalability is clearly O(n) when testing the same query on databases with 1M, 5M, 10M and 20M nodes - which is reasonable (with my limited understanding) considering I am not limiting my query to 1 node only hence all nodes need to be checked for matching. HOWEVER, when I index the queried node property, the execution time for the same query, is relatively constant.
Figure shows execution time by database node size before and after indexing. Orange plot is O(N) reference line, while the blue plot is the observed execution times.
Based on these results I'm trying to figure out where the advantage of index-free adjacency comes in. Is this advantageous when querying with a limit of 1 for deep(er) links? E.g. depth of 4 in A->B->C->D->E, and querying for E given A. Because in this case we know that there is only one match for A (hence no need to brute force through all the other nodes not part of this sub-network).
As this is highly dependent on the query, I'm listing an example of the Cypher query below for reference (where I'm matching entity labeled node with id of 1, and returning the associated node (B in the above example) and the secondary-linked node (C in the above example)):
MATCH (:entity{id:1})-[:LINK]->(result_assoc:assoc)<-[:LINK]-(result_entity:entity) RETURN result_entity, result_assoc
UPDATE / ADDITIONAL INFORMATION
This source states: "The key message of index-free adjacency is, that the complexity to traverse the whole graph is O(n), where n is the number of nodes. In contrast, using any index will have complexity O(n log n).". This statement explains the O(n) results before indexing. I guess the O(1) performance after indexing is identical to a hash list performance(?). Not sure why using any other index the complexity is O(n log n) if even using a hash list the worst case is O(n).
From my understanding, the index-free aspect is only pertinent for adjacent nodes (that's why it's called index-free adjacency). What your plots are demonstrating, is that when you find A, the additional time to find C is negligible, and the question of whether to use an index or not, is only to find the initial queried node A.
To find A without an index it takes O(n), because it has to scan through all the nodes in the database, but with an index, it's effectively like a hashlist, and takes O(1) (no clue why the book says O(n log n) either).
Beyond that, finding the adjacent nodes are not that hard for Neo4j, because they are linked to A, whereas in RM the linkage is not as explicit - thus a join, which is expensive, and then scan/filter is required. So to truly see the advantage, one should compare the performance of graph DBs and RM DBs, by varying the depth of the relations/links. It would also be interesting to see how a query would perform when the neighbours of the entity nodes are increased (ie., the graph becomes denser) - does Neo4j rely on the graphs never being too dense? Otherwise the problem of looking through the neighbours to find the right one repeats itself.

Searching an item in a balanced binary tree

If I have a balanced binary tree and I want to search for an item in it, will the big-oh time complexity be O(n)? Will searching for an item in a binary tree regardless of whether its balanced or not change the big - oh time complexity from O(n)? I understand that if we have a balanced BST then searching for an item is equivalent to the height of the BST so O(log n) but what about normal binary trees?
The O(log n) search time in a balanced BST is facilitated by two properties:
Elements in the tree are arranged by comparison
The tree is (approximately) balanced.
If you lose either of those properties, then you will no longer get O(log n) search time.
If you are searching a balanced binary tree that is not sorted (aka not a BST) for a specific value, then you will have to check every node in the tree to be guaranteed to find the value you are looking for, so it requires O(n) time.
For an unbalanced tree, it might help if you visualize the worst case of being out of balance in which every node has exactly one child except for the leafā€”essentially a linked list. If you have a completely (or mostly) unbalanced BST, searching will take O(n) time, just like a linked list.
If the unsorted binary tree is unbalanced, it still has n nodes and they are still unsorted, so it still takes O(n) time.

Improving scalability of the modified preorder tree traversal algorithm

I've been thinking about the modified preorder tree traversal algorithm for storing trees within a flat table (such as SQL).
One property I dislike about the standard approach is that to insert a node you
have to touch (on average) N/2 of the nodes (everything with left or right higher than the insert point).
The implementations I've seen rely on sequentially numbered values. This leaves no room for updates.
This seems bad for concurrency and scaling. Imagine you have a tree rooted at the world containing user groups for every account in a large system, it's extremely large, to the point you must store subsets of the tree on different servers. Touching half of all the nodes to add a node to the bottom of the tree is bad.
Here is the idea I was considering. Basically leave room for inserts by partitioning the keyspace and dividing at each level.
Here's an example with Nmax = 64 (this would normally be the MAX_INT of your DB)
0:64
________|________
/ \
1:31 32:63
/ \ / \
2:14 15-30 33:47 48:62
Here, a node is added to the left half of the tree.
0:64
________|________
/ \
1:31 32:63
/ | \ / \
2:11 11:20 21:30 33:47 48:62
The alogorithm must be extended for the insert and removal process to recursively renumber to the left/right indexes for the subtree. Since querying for immediate children of a node is complicated, I think it makes sense to also store the parent id in the table. The algorithm can then select the sub tree (using left > p.left && right < p.right), then use node.id and node.parent to work through the list, subdividing the indexes.
This is more complex than just incrementing all the indexes to make room for the insert (or decrementing for removal), but it has the potential to affect far fewer nodes (only decendenants of the parent of the inserted/removed node).
My question(s) are basically:
Has this idea been formalized or implemented?
Is this the same as nested intervals?
I have heard of people doing this before, for the same reasons, yes.
Note that you do lose at a couple of small advantages of the algorithm by doing this
normally, you can tell the number of descendants of a node by ((right - left + 1) div 2). This can occasionally be useful, if e.g. you'd displaying a count in a treeview which should include the number of children to be found further down in the tree
Flowing from the above, it's easy to select out all leaf nodes -- WHERE (right = left + 1).
These are fairly minor advantages and may not be useful to you anyway, though for some usage patterns they're obviously handy.
That said, it does sound like materialized paths may be more useful to you, as suggested above.
I think you're better off looking at a different way of storing trees. If your tree is broad but not terribly deep (which seems likely for the case you suggested), you can store the complete list of ancestors up to the root against each node. That way, modifying a node doesn't require touching any nodes other than the node being modified.
You can split your table into two: the first is (node ID, node value), the second (node ID, child ID), which stores all the edges of the tree. Insertion and deletion then become O(tree depth) (you have to navigate to the element and fix what is below it).
The solution you propose looks like a B-tree. If you can estimate the total number of nodes in your tree, then you can choose the depth of the tree beforehand.