Improving scalability of the modified preorder tree traversal algorithm - sql

I've been thinking about the modified preorder tree traversal algorithm for storing trees within a flat table (such as SQL).
One property I dislike about the standard approach is that to insert a node you
have to touch (on average) N/2 of the nodes (everything with left or right higher than the insert point).
The implementations I've seen rely on sequentially numbered values. This leaves no room for updates.
This seems bad for concurrency and scaling. Imagine you have a tree rooted at the world containing user groups for every account in a large system, it's extremely large, to the point you must store subsets of the tree on different servers. Touching half of all the nodes to add a node to the bottom of the tree is bad.
Here is the idea I was considering. Basically leave room for inserts by partitioning the keyspace and dividing at each level.
Here's an example with Nmax = 64 (this would normally be the MAX_INT of your DB)
0:64
________|________
/ \
1:31 32:63
/ \ / \
2:14 15-30 33:47 48:62
Here, a node is added to the left half of the tree.
0:64
________|________
/ \
1:31 32:63
/ | \ / \
2:11 11:20 21:30 33:47 48:62
The alogorithm must be extended for the insert and removal process to recursively renumber to the left/right indexes for the subtree. Since querying for immediate children of a node is complicated, I think it makes sense to also store the parent id in the table. The algorithm can then select the sub tree (using left > p.left && right < p.right), then use node.id and node.parent to work through the list, subdividing the indexes.
This is more complex than just incrementing all the indexes to make room for the insert (or decrementing for removal), but it has the potential to affect far fewer nodes (only decendenants of the parent of the inserted/removed node).
My question(s) are basically:
Has this idea been formalized or implemented?
Is this the same as nested intervals?

I have heard of people doing this before, for the same reasons, yes.
Note that you do lose at a couple of small advantages of the algorithm by doing this
normally, you can tell the number of descendants of a node by ((right - left + 1) div 2). This can occasionally be useful, if e.g. you'd displaying a count in a treeview which should include the number of children to be found further down in the tree
Flowing from the above, it's easy to select out all leaf nodes -- WHERE (right = left + 1).
These are fairly minor advantages and may not be useful to you anyway, though for some usage patterns they're obviously handy.
That said, it does sound like materialized paths may be more useful to you, as suggested above.

I think you're better off looking at a different way of storing trees. If your tree is broad but not terribly deep (which seems likely for the case you suggested), you can store the complete list of ancestors up to the root against each node. That way, modifying a node doesn't require touching any nodes other than the node being modified.

You can split your table into two: the first is (node ID, node value), the second (node ID, child ID), which stores all the edges of the tree. Insertion and deletion then become O(tree depth) (you have to navigate to the element and fix what is below it).
The solution you propose looks like a B-tree. If you can estimate the total number of nodes in your tree, then you can choose the depth of the tree beforehand.

Related

Deletion operation in Binary Search Tree: successor or predecessor

Delete operation is the most complex operation in Binary Search Tree, since it needs to consider several possibilities:
The deleted node is leaf node
The deleted node has only one child
The deleted node has both left and right child
The first two cases are easy. But for the second one, I read many books or documents, the solution is: find the min value in the right subtree and replace it with the deleted node. And then delete it from right subtree.
I can fully understand this solution.
In fact, generally, the node with the min value in the right subtree is called Successor of the node. So the above solution is replace the deleted node with its successor's value. And delete the successor node from the subtree.
On the other hand, predecessor of each node is the node with max value in the left subtree.
I think, replace the deleted node with its predecessor should also work.
For instance, the example used in the book of "Data structure and algorithm analysis in C".
If we want to delete node "2". Then we replace it with "3" which is "2" 's successor.
I think, replace "2" with "1" which is "2" 's predecessor can also work. Right? But the books didn't talk about it even a bit.
So is there any convention here? And If after one deletion operation, there are two results both correct. How to keep consistent?
Edit:
Update something based on new learning about this issue. In fact, the book "data structure and algorithm analysis in c" discussed the issue. In summary, it goes as follows:
First, both methods (based on successor or predecessor)should work.
If repeat O(n^2) insert/delete pairs on the tree. And all the deletion operation is based on successor. Then the tree will become unbalanced. Because the algorithm makes the left subtrees deeper than the right. The idea can be illustrated with the following two images:
Then it introduces to the concept of balanced search tree, such as AVL tree.
I can tell for the theory, to me your argument seems correct, that one can take either the predecessor or the successor.
Now in practice, I would think that the best decision would be to keep the tree balanced, and switch between the two options depending on which makes the depth the lowest.

Find Successor and Predecessor using a modification to AVL tree in O(1)

So I got this task to suggest a AVL tree with the capability of performing the fallowing two operations in O(1):
Successor takes 𝑂(1) - return the successor.
Predecessor takes 𝑂(1)- return the Predecessor
I am kind of stuck after trying to think of a solution for this.
Any ideas?
I think it's not so much about modifying the tree or implementing any complicated predecessor/successor-finding logic to be able to physically traverse the tree to find successor/predecessor nodes in constant time as much as simply adding succ and pred links to every node in the tree, using them in your predecessor(n) and successor(n) functions, and maintaining these links over insert and delete operations.
All you have to be careful about with this approach to ensure successor(n) and predecessor(n) are actually O(1) is that the time complexity of maintaining succ/pred links does not exceed the complexity of the operations which require link updates. For example, because inserting a new node into a balanced tree is an O(lg n) operation, updating successor/predecessor links after insert cannot exceed O(lg n) (as otherwise your subsequent lookups would technically be doing amortized work greater than O(1)).
The steps you have to take on each insert and delete operation to maintain pred and succ links are actually pretty simple, and thankfully the complexities introduced to the AVL tree by the re-balance operation don't really affect them because while re-balance changes the structure of the tree, it doesn't affect the tree's ordering and therefore doesn't affect predecessor/successor relations.
Finding the successor and predecessor of a given node in balanced BST (which an AVL tree is guaranteed to be) is O(lg n), and don't differ in terms of logic from the operations you would perform to find predecessor and successor in a regular BST. Therefore, after inserting a new node n into your AVL, simply find n's successor and predecessor nodes and update the appropriate links:
insert(n) {
...
# normal AVL insert and rebalance logic
...
predecessor, successor = findPredAndSucc(n)
if (predecessor) predecessor.succ = n
if (successor) successor.pred = n
n.pred = predecessor
n.succ = successor
}
When deleting n, you'll also have to find n's predecessor and successor and update links as in:
delete(n) {
# save these values first so they aren't lost when n is deleted
predecessor, successor = findPredAndSucc(n)
...
# normal AVL delete and rebalance logic
...
if (predecessor) predecessor.succ = successor
if (successor) successor.pred = predecessor
}
I'm slightly conflating the node n and the value of n (if inserting/deleting by value you would have to traverse the tree to find n first), but you get the idea.
Following this approach, after every insert and delete each node will have direct links to its predecessor/successor nodes, and because of the AVL-tree's self balancing property the time spent finding the relevant nodes to update these links on after inserts and deletes doesn't add any additional complexity to those operations (finding predecessor and successor nodes don't require more than 2 O(h) = O(lg n) traversals each, the same complexity as insert and delete themselves, therefore not changing the overall complexity), so our successor(n) and predecessor(n) functions will be overall O(1) if they just follow the succ and pred links we set.

Neo4j scalability and indexing

An argument in favor of graph dbms with native storage over relational dbms made by neo4j (also in the neo4j graph databases book) is that "index-free adjacency" is the most efficient means of processing data in a graph (due to the 'clustering' of the data/nodes in a graph-based model).
Based on some benchmarking I've performed, where 3 nodes are sequentially connected (A->B<-C) and given the id of A I'm querying for C, the scalability is clearly O(n) when testing the same query on databases with 1M, 5M, 10M and 20M nodes - which is reasonable (with my limited understanding) considering I am not limiting my query to 1 node only hence all nodes need to be checked for matching. HOWEVER, when I index the queried node property, the execution time for the same query, is relatively constant.
Figure shows execution time by database node size before and after indexing. Orange plot is O(N) reference line, while the blue plot is the observed execution times.
Based on these results I'm trying to figure out where the advantage of index-free adjacency comes in. Is this advantageous when querying with a limit of 1 for deep(er) links? E.g. depth of 4 in A->B->C->D->E, and querying for E given A. Because in this case we know that there is only one match for A (hence no need to brute force through all the other nodes not part of this sub-network).
As this is highly dependent on the query, I'm listing an example of the Cypher query below for reference (where I'm matching entity labeled node with id of 1, and returning the associated node (B in the above example) and the secondary-linked node (C in the above example)):
MATCH (:entity{id:1})-[:LINK]->(result_assoc:assoc)<-[:LINK]-(result_entity:entity) RETURN result_entity, result_assoc
UPDATE / ADDITIONAL INFORMATION
This source states: "The key message of index-free adjacency is, that the complexity to traverse the whole graph is O(n), where n is the number of nodes. In contrast, using any index will have complexity O(n log n).". This statement explains the O(n) results before indexing. I guess the O(1) performance after indexing is identical to a hash list performance(?). Not sure why using any other index the complexity is O(n log n) if even using a hash list the worst case is O(n).
From my understanding, the index-free aspect is only pertinent for adjacent nodes (that's why it's called index-free adjacency). What your plots are demonstrating, is that when you find A, the additional time to find C is negligible, and the question of whether to use an index or not, is only to find the initial queried node A.
To find A without an index it takes O(n), because it has to scan through all the nodes in the database, but with an index, it's effectively like a hashlist, and takes O(1) (no clue why the book says O(n log n) either).
Beyond that, finding the adjacent nodes are not that hard for Neo4j, because they are linked to A, whereas in RM the linkage is not as explicit - thus a join, which is expensive, and then scan/filter is required. So to truly see the advantage, one should compare the performance of graph DBs and RM DBs, by varying the depth of the relations/links. It would also be interesting to see how a query would perform when the neighbours of the entity nodes are increased (ie., the graph becomes denser) - does Neo4j rely on the graphs never being too dense? Otherwise the problem of looking through the neighbours to find the right one repeats itself.

Hierarchical Database Structure SQL Server

I have Different hierarchical structure
Please find Below structure.
1. Parent 1
1.1 Child 1
1.2 Child 2
1.3 Child 3
1.3.1 Child 4
**1.3.2 Parent 2**
Now Look at above tree, here child can also have sub child as PARENT.
So how can I achieve this, keep in mind that I want whole tree without for each loop.
Thanks in advance.
Generally, two approaches may fit your needs.
Version #1: The most obvious (but slow) attempt is to simply create a table holding each node and a reference (foreign key) to its parent. A parent of NULL indicates a/the root node.
The disadvantage of this attempt is that you either need a loop (what you want to avoid) or a RDBMS with the possibility to define and execute recursive queries (usually with a CTE).
Version #2: The second attempt would be the choice in the real world. Whereas the first solution is able to store unlimited depth, these scenarios usually don't occur in hirarchical trees.
Again you create a table with one row per node, but instead having a reference to the parent, you store the absolute path to that node within the tree in e.g. a VarChar column, just like the absolute path of a file in a filesystem. Here, the 'directory name' corresponds to e.g. the ID of the node.
Version #1 has the advantage of being very compact, but it takes quite an effort to prune the tree or retrieve a list of all nodes with their absolute path (RDBMS are not very good in recursive structures). On the other side, a lot of UI components expect exactly this structure to display the tree on screen. Questions like 'Which nodes are indirect childs of node X' are both slow and quite difficult to answer.
Version #2 has the advantage of making it very easy to implement tree manipulation (deletion, pruning, moving nodes and subtrees). Also, the list you require is a simple SELECT. The question 'show all direct or indirect childs of node X' are answered with a simple SELECT as well.
The caveat is the increased size due to redundant saving of paths and the limited depth of the possible tree to save.

Is hierarchyid suitable for large trees with frequent insertions of leaf nodes?

We have a database that models a tree. This data can grow fairly huge, that is to say many, may million of rows. (The primary key is actually a bigint, so I guess potentially we would like to support billions of rows, although this is probably never going to occur).
A single node can have a very large amount of direct children, more likely the higher up in the hierarchy they are. We have no specified limit to the actual maximum depth of a leaf, i.e. how many nodes one would have to traverse to get to the root, but in practice this would probably normally not grow beyond a few hundred at the very most. Normally it would probably be below 20.
Insertions in this table is very frequent and needs to be high performing. Insertions nodes inserted are always leaf nodes, and always after the last sibling. Nodes are never moved. Deletions are always made as entire subtrees. Finding subtrees is the other operation made on this table. It does not have the same performance requirements, but of course we would like it as fast as possible.
Today this is modeled with the parent/child model, which is efficient for insertions, but is painfully slow for finding subtrees. When the table grows large, this becomes extremely slow and finding a subtree may take several minutes.
So I was thinking about converting this to perhaps using the new hierarchyid type in SQL Server. But I am having troubles figuring out whether this would be suitable. As I undestand it, for the operations we perform in this scenario, such a tree would be a good idea. (Please correct me if I'm wrong here).
But it also states that the maximum size for a hierarchyid is 892 bytes. However, I can not find any information about what this means in practice. How is the hierarchyid encoded? Will I run out of hierarchyids, and if so, when?
So I did some tests and came to somewhat of a conclusion regarding the limitations of hierarchyid:
If I run for example the following code:
DECLARE #i BIGINT = 1
DECLARE #h hierarchyId = '/'
WHILE 1=1
BEGIN
SET #h = #h.ToString() + '1/'
PRINT CONVERT(nvarchar(max), #i)
SET #i = #i+1
END
I will get to 1427 levels deep before I get an error. Since I am using the value 1 for each level, this ought to be the most compact tree from which I draw the conclusion that I will not ever be able to create a tree with more than 1427 levels.
However, if I use for example 99999999999999 for each level (eg. /99999999999999/99999999999999/99999999999999/..., the error occurs already at 118 levels deep. It also seems that 14 digits are the maximum for an id at each level, since it fails immediately if I use a 15 digit number.
So with this in mind, if I only use whole integer identifiers (i.e. don't insert nodes between other nodes etc.) I should be able to guarantee up to at least 100 levels deep in my scenario, and at no time will I be able to exceed much more than 1400 levels.
892 bytes does not sound like much, but the hierarchy id seems to be very efficient, space-wise. From http://technet.microsoft.com/en-us/library/bb677290.aspx:
The average number of bits that are required to represent a node in a tree with n nodes depends on the average fanout (the average number of children of a node). For small fanouts (0-7), the size is about 6*logAn bits, where A is the average fanout. A node in an organizational hierarchy of 100,000 people with an average fanout of 6 levels takes about 38 bits. This is rounded up to 40 bits, or 5 bytes, for storage.
The calculation given says it's only for small fanouts (0-7) which makes it hard to reason about for bigger fanouts. You say 'up to a few hundred children at the most'. This (extreme) case does sound dangerous. I don't know about the spec of hierarchy_id, but the more nodes are at any one level, the less depth you should be able to have in the tree within those 892 bytes.
I do see a risk here, as do you (hence the question). Do some tests. Evaluate the goals. What are you moving from? Why are you moving? Simplicity or performance?
This problem is a bad fit for Sql. Maybe you should consider other options for this part of the program?