Counting levels of binary tree - sql

I have SQL table for Binary tree.
Id | Name | Parent_ID |
i want to calculate 5 levels of complete binary tree for parameter node which is passed to stored procedure.

In the frequent case, a binary tree with n nodes will have at least 1 + floor(log_2(n)) levels. For example, you can match 7 nodes on three levels, however 8 nodes will take at least four degrees no remember what. There are particular kinds of binary trees for which you can put better constraints on the higher restriction

Related

B-Tree Definition Correct or not?

I am trying to understand the order of a B-Tree, but I am finding multiple answers. Here is what the lecturer has in the slides:
In a B-Tree of order n:
The root node has between 1 and 2n keys, e.g. 1 to 4
All other nodes have between n and 2n elements
Root node: can have 0 or 2 to 2n+1 children, e.g. 0 or 2 to 5
Non root node: can have 0 or n to 2n+1 children, e.g. 0 or 2 to 5
Some sources say that the order is the maximum number of children a nonleaf node can have.
Can someone please help me understand which is correct?
There are indeed different, non-compatible definitions of the term "order" in the context of B-tree. Also on Wikipedia this is mentioned:
The literature on B-trees is not uniform in its terminology.
Bayer and McCreight (1972), Comer (1979), and others define the order of B-tree as the minimum number of keys in a non-root node.
Folk & Zoellick (1992) points out that terminology is ambiguous because the maximum number of keys is not clear. An order 3 B-tree might hold a maximum of 6 keys or a maximum of 7 keys. Knuth (1998) avoids the problem by defining the order to be the maximum number of children (which is one more than the maximum number of keys).
So, we'll have to live with this difference in terminology and always check what the author means when they use this term.

Hierarchical Database Structure SQL Server

I have Different hierarchical structure
Please find Below structure.
1. Parent 1
1.1 Child 1
1.2 Child 2
1.3 Child 3
1.3.1 Child 4
**1.3.2 Parent 2**
Now Look at above tree, here child can also have sub child as PARENT.
So how can I achieve this, keep in mind that I want whole tree without for each loop.
Thanks in advance.
Generally, two approaches may fit your needs.
Version #1: The most obvious (but slow) attempt is to simply create a table holding each node and a reference (foreign key) to its parent. A parent of NULL indicates a/the root node.
The disadvantage of this attempt is that you either need a loop (what you want to avoid) or a RDBMS with the possibility to define and execute recursive queries (usually with a CTE).
Version #2: The second attempt would be the choice in the real world. Whereas the first solution is able to store unlimited depth, these scenarios usually don't occur in hirarchical trees.
Again you create a table with one row per node, but instead having a reference to the parent, you store the absolute path to that node within the tree in e.g. a VarChar column, just like the absolute path of a file in a filesystem. Here, the 'directory name' corresponds to e.g. the ID of the node.
Version #1 has the advantage of being very compact, but it takes quite an effort to prune the tree or retrieve a list of all nodes with their absolute path (RDBMS are not very good in recursive structures). On the other side, a lot of UI components expect exactly this structure to display the tree on screen. Questions like 'Which nodes are indirect childs of node X' are both slow and quite difficult to answer.
Version #2 has the advantage of making it very easy to implement tree manipulation (deletion, pruning, moving nodes and subtrees). Also, the list you require is a simple SELECT. The question 'show all direct or indirect childs of node X' are answered with a simple SELECT as well.
The caveat is the increased size due to redundant saving of paths and the limited depth of the possible tree to save.

Is hierarchyid suitable for large trees with frequent insertions of leaf nodes?

We have a database that models a tree. This data can grow fairly huge, that is to say many, may million of rows. (The primary key is actually a bigint, so I guess potentially we would like to support billions of rows, although this is probably never going to occur).
A single node can have a very large amount of direct children, more likely the higher up in the hierarchy they are. We have no specified limit to the actual maximum depth of a leaf, i.e. how many nodes one would have to traverse to get to the root, but in practice this would probably normally not grow beyond a few hundred at the very most. Normally it would probably be below 20.
Insertions in this table is very frequent and needs to be high performing. Insertions nodes inserted are always leaf nodes, and always after the last sibling. Nodes are never moved. Deletions are always made as entire subtrees. Finding subtrees is the other operation made on this table. It does not have the same performance requirements, but of course we would like it as fast as possible.
Today this is modeled with the parent/child model, which is efficient for insertions, but is painfully slow for finding subtrees. When the table grows large, this becomes extremely slow and finding a subtree may take several minutes.
So I was thinking about converting this to perhaps using the new hierarchyid type in SQL Server. But I am having troubles figuring out whether this would be suitable. As I undestand it, for the operations we perform in this scenario, such a tree would be a good idea. (Please correct me if I'm wrong here).
But it also states that the maximum size for a hierarchyid is 892 bytes. However, I can not find any information about what this means in practice. How is the hierarchyid encoded? Will I run out of hierarchyids, and if so, when?
So I did some tests and came to somewhat of a conclusion regarding the limitations of hierarchyid:
If I run for example the following code:
DECLARE #i BIGINT = 1
DECLARE #h hierarchyId = '/'
WHILE 1=1
BEGIN
SET #h = #h.ToString() + '1/'
PRINT CONVERT(nvarchar(max), #i)
SET #i = #i+1
END
I will get to 1427 levels deep before I get an error. Since I am using the value 1 for each level, this ought to be the most compact tree from which I draw the conclusion that I will not ever be able to create a tree with more than 1427 levels.
However, if I use for example 99999999999999 for each level (eg. /99999999999999/99999999999999/99999999999999/..., the error occurs already at 118 levels deep. It also seems that 14 digits are the maximum for an id at each level, since it fails immediately if I use a 15 digit number.
So with this in mind, if I only use whole integer identifiers (i.e. don't insert nodes between other nodes etc.) I should be able to guarantee up to at least 100 levels deep in my scenario, and at no time will I be able to exceed much more than 1400 levels.
892 bytes does not sound like much, but the hierarchy id seems to be very efficient, space-wise. From http://technet.microsoft.com/en-us/library/bb677290.aspx:
The average number of bits that are required to represent a node in a tree with n nodes depends on the average fanout (the average number of children of a node). For small fanouts (0-7), the size is about 6*logAn bits, where A is the average fanout. A node in an organizational hierarchy of 100,000 people with an average fanout of 6 levels takes about 38 bits. This is rounded up to 40 bits, or 5 bytes, for storage.
The calculation given says it's only for small fanouts (0-7) which makes it hard to reason about for bigger fanouts. You say 'up to a few hundred children at the most'. This (extreme) case does sound dangerous. I don't know about the spec of hierarchy_id, but the more nodes are at any one level, the less depth you should be able to have in the tree within those 892 bytes.
I do see a risk here, as do you (hence the question). Do some tests. Evaluate the goals. What are you moving from? Why are you moving? Simplicity or performance?
This problem is a bad fit for Sql. Maybe you should consider other options for this part of the program?

SQL Server 2008 Performance on nullable geography column with spatial index

I'm seeing some strange performance issues on SQL Server 2008 with a nullable geography column with a spatial index. Each null value is stored as a root node within the spatial index.
E.g. A table with 5 000 000 addresses where 4 000 000 has a coordinate stored.
Every time I query the index I have to scan through every root node, meaning I have to scan through 1 000 001 level 0 nodes. (1 root node for all the valid coordinates + 1M nulls)
I cannot find this mentioned in the documentation, and I cannot see why SQL allows this column to be nullable if the indexing is unable to handle it.
For now I have bypassed this by storing only the existing coordinates in a separate table, but I would like to know what is the best practice here?
EDIT: (case closed)
I got some help on the sql spatial msdn forum, and there is a blog post about this issue:
http://www.sqlskills.com/BLOGS/BOBB/post/Be-careful-with-EMPTYNULL-values-and-spatial-indexes.aspx
Also the MSDN documentation does infact mention this, but in a very sneaky manner.
NULL and empty instances are counted
at level 0 but will not impact
performance. Level 0 will have as many
cells as NULL and empty instances at
the base table. For geography indexes,
level 0 will have as many cells as
NULL and empty instances +1 cell,
because the query sample is counted as
1
Nowhere in the text is it promised that nulls does not affect performance for geography.
Only geometry is supposed to be unaffected.
Just a follow-up note - this issue has been fixed in Sql Server Denali with the new AUTO_GRID indexes (which are now the default). NULL values will no longer be populated in the root index node.

Improving scalability of the modified preorder tree traversal algorithm

I've been thinking about the modified preorder tree traversal algorithm for storing trees within a flat table (such as SQL).
One property I dislike about the standard approach is that to insert a node you
have to touch (on average) N/2 of the nodes (everything with left or right higher than the insert point).
The implementations I've seen rely on sequentially numbered values. This leaves no room for updates.
This seems bad for concurrency and scaling. Imagine you have a tree rooted at the world containing user groups for every account in a large system, it's extremely large, to the point you must store subsets of the tree on different servers. Touching half of all the nodes to add a node to the bottom of the tree is bad.
Here is the idea I was considering. Basically leave room for inserts by partitioning the keyspace and dividing at each level.
Here's an example with Nmax = 64 (this would normally be the MAX_INT of your DB)
0:64
________|________
/ \
1:31 32:63
/ \ / \
2:14 15-30 33:47 48:62
Here, a node is added to the left half of the tree.
0:64
________|________
/ \
1:31 32:63
/ | \ / \
2:11 11:20 21:30 33:47 48:62
The alogorithm must be extended for the insert and removal process to recursively renumber to the left/right indexes for the subtree. Since querying for immediate children of a node is complicated, I think it makes sense to also store the parent id in the table. The algorithm can then select the sub tree (using left > p.left && right < p.right), then use node.id and node.parent to work through the list, subdividing the indexes.
This is more complex than just incrementing all the indexes to make room for the insert (or decrementing for removal), but it has the potential to affect far fewer nodes (only decendenants of the parent of the inserted/removed node).
My question(s) are basically:
Has this idea been formalized or implemented?
Is this the same as nested intervals?
I have heard of people doing this before, for the same reasons, yes.
Note that you do lose at a couple of small advantages of the algorithm by doing this
normally, you can tell the number of descendants of a node by ((right - left + 1) div 2). This can occasionally be useful, if e.g. you'd displaying a count in a treeview which should include the number of children to be found further down in the tree
Flowing from the above, it's easy to select out all leaf nodes -- WHERE (right = left + 1).
These are fairly minor advantages and may not be useful to you anyway, though for some usage patterns they're obviously handy.
That said, it does sound like materialized paths may be more useful to you, as suggested above.
I think you're better off looking at a different way of storing trees. If your tree is broad but not terribly deep (which seems likely for the case you suggested), you can store the complete list of ancestors up to the root against each node. That way, modifying a node doesn't require touching any nodes other than the node being modified.
You can split your table into two: the first is (node ID, node value), the second (node ID, child ID), which stores all the edges of the tree. Insertion and deletion then become O(tree depth) (you have to navigate to the element and fix what is below it).
The solution you propose looks like a B-tree. If you can estimate the total number of nodes in your tree, then you can choose the depth of the tree beforehand.