B-Tree Definition Correct or not? - binary-search-tree

I am trying to understand the order of a B-Tree, but I am finding multiple answers. Here is what the lecturer has in the slides:
In a B-Tree of order n:
The root node has between 1 and 2n keys, e.g. 1 to 4
All other nodes have between n and 2n elements
Root node: can have 0 or 2 to 2n+1 children, e.g. 0 or 2 to 5
Non root node: can have 0 or n to 2n+1 children, e.g. 0 or 2 to 5
Some sources say that the order is the maximum number of children a nonleaf node can have.
Can someone please help me understand which is correct?

There are indeed different, non-compatible definitions of the term "order" in the context of B-tree. Also on Wikipedia this is mentioned:
The literature on B-trees is not uniform in its terminology.
Bayer and McCreight (1972), Comer (1979), and others define the order of B-tree as the minimum number of keys in a non-root node.
Folk & Zoellick (1992) points out that terminology is ambiguous because the maximum number of keys is not clear. An order 3 B-tree might hold a maximum of 6 keys or a maximum of 7 keys. Knuth (1998) avoids the problem by defining the order to be the maximum number of children (which is one more than the maximum number of keys).
So, we'll have to live with this difference in terminology and always check what the author means when they use this term.

Related

Counting levels of binary tree

I have SQL table for Binary tree.
Id | Name | Parent_ID |
i want to calculate 5 levels of complete binary tree for parameter node which is passed to stored procedure.
In the frequent case, a binary tree with n nodes will have at least 1 + floor(log_2(n)) levels. For example, you can match 7 nodes on three levels, however 8 nodes will take at least four degrees no remember what. There are particular kinds of binary trees for which you can put better constraints on the higher restriction

Neo4j scalability and indexing

An argument in favor of graph dbms with native storage over relational dbms made by neo4j (also in the neo4j graph databases book) is that "index-free adjacency" is the most efficient means of processing data in a graph (due to the 'clustering' of the data/nodes in a graph-based model).
Based on some benchmarking I've performed, where 3 nodes are sequentially connected (A->B<-C) and given the id of A I'm querying for C, the scalability is clearly O(n) when testing the same query on databases with 1M, 5M, 10M and 20M nodes - which is reasonable (with my limited understanding) considering I am not limiting my query to 1 node only hence all nodes need to be checked for matching. HOWEVER, when I index the queried node property, the execution time for the same query, is relatively constant.
Figure shows execution time by database node size before and after indexing. Orange plot is O(N) reference line, while the blue plot is the observed execution times.
Based on these results I'm trying to figure out where the advantage of index-free adjacency comes in. Is this advantageous when querying with a limit of 1 for deep(er) links? E.g. depth of 4 in A->B->C->D->E, and querying for E given A. Because in this case we know that there is only one match for A (hence no need to brute force through all the other nodes not part of this sub-network).
As this is highly dependent on the query, I'm listing an example of the Cypher query below for reference (where I'm matching entity labeled node with id of 1, and returning the associated node (B in the above example) and the secondary-linked node (C in the above example)):
MATCH (:entity{id:1})-[:LINK]->(result_assoc:assoc)<-[:LINK]-(result_entity:entity) RETURN result_entity, result_assoc
UPDATE / ADDITIONAL INFORMATION
This source states: "The key message of index-free adjacency is, that the complexity to traverse the whole graph is O(n), where n is the number of nodes. In contrast, using any index will have complexity O(n log n).". This statement explains the O(n) results before indexing. I guess the O(1) performance after indexing is identical to a hash list performance(?). Not sure why using any other index the complexity is O(n log n) if even using a hash list the worst case is O(n).
From my understanding, the index-free aspect is only pertinent for adjacent nodes (that's why it's called index-free adjacency). What your plots are demonstrating, is that when you find A, the additional time to find C is negligible, and the question of whether to use an index or not, is only to find the initial queried node A.
To find A without an index it takes O(n), because it has to scan through all the nodes in the database, but with an index, it's effectively like a hashlist, and takes O(1) (no clue why the book says O(n log n) either).
Beyond that, finding the adjacent nodes are not that hard for Neo4j, because they are linked to A, whereas in RM the linkage is not as explicit - thus a join, which is expensive, and then scan/filter is required. So to truly see the advantage, one should compare the performance of graph DBs and RM DBs, by varying the depth of the relations/links. It would also be interesting to see how a query would perform when the neighbours of the entity nodes are increased (ie., the graph becomes denser) - does Neo4j rely on the graphs never being too dense? Otherwise the problem of looking through the neighbours to find the right one repeats itself.

Space Complexity of Ids Algorithm

Hello I try to calculate the space complexity of IDS (Iterative deepening depth-first search) algorithm but I cannot figure out how to do that. I can't really understand how the algorithm visits the nodes of a tree, so that I could calculate how space it needs. Can you help me?
As far as i have understood, IDS works like this: starting with a limit of 0, meaning the depth of the root node in a graph(or your starting point), it performs a Depth First Search until it exhausts the nodes it finds within the sub-graph defined by the limit. It then proceeds by increasing the limit by one, and performing a Depth First Search from the same starting point, but on the now bigger sub-graph defined by the larger limit. This way, IDS manages to combine benefits of DFS with those of BFS(breadth first search). I hope this clears up a few things.
from Wikipedia:
The space complexity of IDDFS is O(bd), where b is the branching factor and d is the depth of shallowest goal. Since iterative deepening visits states multiple times, it may seem wasteful, but it turns out to be not so costly, since in a tree most of the nodes are in the bottom level, so it does not matter much if the upper levels are visited multiple times.[2]

Is hierarchyid suitable for large trees with frequent insertions of leaf nodes?

We have a database that models a tree. This data can grow fairly huge, that is to say many, may million of rows. (The primary key is actually a bigint, so I guess potentially we would like to support billions of rows, although this is probably never going to occur).
A single node can have a very large amount of direct children, more likely the higher up in the hierarchy they are. We have no specified limit to the actual maximum depth of a leaf, i.e. how many nodes one would have to traverse to get to the root, but in practice this would probably normally not grow beyond a few hundred at the very most. Normally it would probably be below 20.
Insertions in this table is very frequent and needs to be high performing. Insertions nodes inserted are always leaf nodes, and always after the last sibling. Nodes are never moved. Deletions are always made as entire subtrees. Finding subtrees is the other operation made on this table. It does not have the same performance requirements, but of course we would like it as fast as possible.
Today this is modeled with the parent/child model, which is efficient for insertions, but is painfully slow for finding subtrees. When the table grows large, this becomes extremely slow and finding a subtree may take several minutes.
So I was thinking about converting this to perhaps using the new hierarchyid type in SQL Server. But I am having troubles figuring out whether this would be suitable. As I undestand it, for the operations we perform in this scenario, such a tree would be a good idea. (Please correct me if I'm wrong here).
But it also states that the maximum size for a hierarchyid is 892 bytes. However, I can not find any information about what this means in practice. How is the hierarchyid encoded? Will I run out of hierarchyids, and if so, when?
So I did some tests and came to somewhat of a conclusion regarding the limitations of hierarchyid:
If I run for example the following code:
DECLARE #i BIGINT = 1
DECLARE #h hierarchyId = '/'
WHILE 1=1
BEGIN
SET #h = #h.ToString() + '1/'
PRINT CONVERT(nvarchar(max), #i)
SET #i = #i+1
END
I will get to 1427 levels deep before I get an error. Since I am using the value 1 for each level, this ought to be the most compact tree from which I draw the conclusion that I will not ever be able to create a tree with more than 1427 levels.
However, if I use for example 99999999999999 for each level (eg. /99999999999999/99999999999999/99999999999999/..., the error occurs already at 118 levels deep. It also seems that 14 digits are the maximum for an id at each level, since it fails immediately if I use a 15 digit number.
So with this in mind, if I only use whole integer identifiers (i.e. don't insert nodes between other nodes etc.) I should be able to guarantee up to at least 100 levels deep in my scenario, and at no time will I be able to exceed much more than 1400 levels.
892 bytes does not sound like much, but the hierarchy id seems to be very efficient, space-wise. From http://technet.microsoft.com/en-us/library/bb677290.aspx:
The average number of bits that are required to represent a node in a tree with n nodes depends on the average fanout (the average number of children of a node). For small fanouts (0-7), the size is about 6*logAn bits, where A is the average fanout. A node in an organizational hierarchy of 100,000 people with an average fanout of 6 levels takes about 38 bits. This is rounded up to 40 bits, or 5 bytes, for storage.
The calculation given says it's only for small fanouts (0-7) which makes it hard to reason about for bigger fanouts. You say 'up to a few hundred children at the most'. This (extreme) case does sound dangerous. I don't know about the spec of hierarchy_id, but the more nodes are at any one level, the less depth you should be able to have in the tree within those 892 bytes.
I do see a risk here, as do you (hence the question). Do some tests. Evaluate the goals. What are you moving from? Why are you moving? Simplicity or performance?
This problem is a bad fit for Sql. Maybe you should consider other options for this part of the program?

How to do "GROUP BY" mathematically?

I have a data structure of key value pairs and I want to implement "GROUP BY" value.
Both keys and values are strings.
So what I did was I gave every value(string) a unique "prime number". Then for every key I stored the multiplication of all the prime numbers associated with different values that a particular key has.
So if key "Anirudh" has values "x","y","z", then I store the number M(Key) = 2*3*5 = 30 as well.
Later if I want to do group by a particular value "x"(say) then I just iterate over all the keys, and divide the M(key) by the prime number associated with "x". I then check if the remainder is 0 and if it is zero, then that particular "key" is a part of group by for value "x".
I know that this is the most weird way to do it. Some people sort the key value pairs(sorted by values). I could have also created another table(hash table) already grouped by "values". So I want to know a better method than mine (there must be many). In my method as the number of unique values for a particular key increases the product of prime number also increases (that too exponentially).
Your method will always perform O(n) to find group members because you have to iterate through all elements of the collection to find elements belonging to the target group. Your method also risks overflowing common integer bounds (32, 64 bit) if you have many elements since you are multiplying potentially a large number of prime numbers together to form your key.
You will find it more efficient and certainly more predictable to use a bit mask to track group membership following this approach. If you have 16 groups, you can represent that with a 16-bit short using a bit mask. Using primes as you suggest, you would need an integer with enough bits to hold the number 32589158477190044730 (first 16 primes multiplied together), which would require 65 bits.
Other approaches to grouping also are O(n) for the first iteration (after all, each element must be tested at least once for group membership). However, if you tend to repeat the same group checks, the other methods you refer to (e.g. keeping a list or hash table per target group) is much more efficient because subsequent group membership tests are O(1).
So to directly answer your question:
If there are multiple queries for group membership (repeating some groups), any solution that stores the groups (including the ones you suggest in your question) will perform better than your method.
If there are no repeat queries for group membership, there is no advantage to storing group membership
Given that repeat queries seem likely based on your question:
Use a structure such as a list keyed off of a group ID to store group membership if you want to trade memory to get more speed.
Use a suitably wide bit array to store group membership if you want to trade speed to use less memory.
If have no real idea what is being asked here, but this sounds similar (but much more computationally expensive) than a bit vector or a sum of powers of 2. First value is "1", second is "2", third is "4" and so on. If you got "7", you know it is "first" + "second" + "third".