Hello I try to calculate the space complexity of IDS (Iterative deepening depth-first search) algorithm but I cannot figure out how to do that. I can't really understand how the algorithm visits the nodes of a tree, so that I could calculate how space it needs. Can you help me?
As far as i have understood, IDS works like this: starting with a limit of 0, meaning the depth of the root node in a graph(or your starting point), it performs a Depth First Search until it exhausts the nodes it finds within the sub-graph defined by the limit. It then proceeds by increasing the limit by one, and performing a Depth First Search from the same starting point, but on the now bigger sub-graph defined by the larger limit. This way, IDS manages to combine benefits of DFS with those of BFS(breadth first search). I hope this clears up a few things.
from Wikipedia:
The space complexity of IDDFS is O(bd), where b is the branching factor and d is the depth of shallowest goal. Since iterative deepening visits states multiple times, it may seem wasteful, but it turns out to be not so costly, since in a tree most of the nodes are in the bottom level, so it does not matter much if the upper levels are visited multiple times.[2]
Related
An argument in favor of graph dbms with native storage over relational dbms made by neo4j (also in the neo4j graph databases book) is that "index-free adjacency" is the most efficient means of processing data in a graph (due to the 'clustering' of the data/nodes in a graph-based model).
Based on some benchmarking I've performed, where 3 nodes are sequentially connected (A->B<-C) and given the id of A I'm querying for C, the scalability is clearly O(n) when testing the same query on databases with 1M, 5M, 10M and 20M nodes - which is reasonable (with my limited understanding) considering I am not limiting my query to 1 node only hence all nodes need to be checked for matching. HOWEVER, when I index the queried node property, the execution time for the same query, is relatively constant.
Figure shows execution time by database node size before and after indexing. Orange plot is O(N) reference line, while the blue plot is the observed execution times.
Based on these results I'm trying to figure out where the advantage of index-free adjacency comes in. Is this advantageous when querying with a limit of 1 for deep(er) links? E.g. depth of 4 in A->B->C->D->E, and querying for E given A. Because in this case we know that there is only one match for A (hence no need to brute force through all the other nodes not part of this sub-network).
As this is highly dependent on the query, I'm listing an example of the Cypher query below for reference (where I'm matching entity labeled node with id of 1, and returning the associated node (B in the above example) and the secondary-linked node (C in the above example)):
MATCH (:entity{id:1})-[:LINK]->(result_assoc:assoc)<-[:LINK]-(result_entity:entity) RETURN result_entity, result_assoc
UPDATE / ADDITIONAL INFORMATION
This source states: "The key message of index-free adjacency is, that the complexity to traverse the whole graph is O(n), where n is the number of nodes. In contrast, using any index will have complexity O(n log n).". This statement explains the O(n) results before indexing. I guess the O(1) performance after indexing is identical to a hash list performance(?). Not sure why using any other index the complexity is O(n log n) if even using a hash list the worst case is O(n).
From my understanding, the index-free aspect is only pertinent for adjacent nodes (that's why it's called index-free adjacency). What your plots are demonstrating, is that when you find A, the additional time to find C is negligible, and the question of whether to use an index or not, is only to find the initial queried node A.
To find A without an index it takes O(n), because it has to scan through all the nodes in the database, but with an index, it's effectively like a hashlist, and takes O(1) (no clue why the book says O(n log n) either).
Beyond that, finding the adjacent nodes are not that hard for Neo4j, because they are linked to A, whereas in RM the linkage is not as explicit - thus a join, which is expensive, and then scan/filter is required. So to truly see the advantage, one should compare the performance of graph DBs and RM DBs, by varying the depth of the relations/links. It would also be interesting to see how a query would perform when the neighbours of the entity nodes are increased (ie., the graph becomes denser) - does Neo4j rely on the graphs never being too dense? Otherwise the problem of looking through the neighbours to find the right one repeats itself.
The question goes like this:
Given an array of n elements where elements are same. Worst case time complexity of sorting the array (with RAM model assumptions) will be:
So, I thought to use selection algorithm in order to find the element whose size is the , call it P. This should take O(n). Next, I take any element which doesn't equal this element and put it in another array. In total I will have k=n-n^(2001/2002) elements. Sorting this array will cost O(klog(k)) which equals O(nlogn). Finally, I will find the max element which is smaller than P and the min element which is bigger than P and I can sort the array.
All of it takes O(nlogn).
Note: if , then we can reduce the time to O(n).
I have two question: is my analysis correct? Is there any way to reduce time complexity? Also, what is the RAM model assumptions?
Thanks!
Your analysis is wrong - there is no guarantee that the n^(2001/2002)th-smallest element is actually one of the duplicates.
n^(2001/2002) duplicates simply don't constitute enough of the input to make things easier, at least in theory. Sorting the input is still at least as hard as sorting the n - n^(2001/2002) = O(n) other elements, and under standard comparison sort assumptions in the RAM model, that takes at least O(n*log(n)) worst-case time.
(For practical input sizes, n^(2001/2002) duplicates would be at least 98% of the input, so isolating the duplicates and sorting the rest would be both easy and highly efficient. This is one of those cases where the asymptotic analysis doesn't capture the behavior we care about in practice.)
On REDIS documentation, it states that insert and update operations on sorted sets are O(log(n)).
On this question they specify more details about the underlying data structure, the skip list.
However there are a few special cases that depend on the REDIS implementation with which I'm not familiar.
adding at the head or tail of the sorted set will probably not be a O(log(n)) operation, but O(1), right? this question seems to agree with reservations.
updating the score of a member even if the ordering doesn't change is still O(log(n)) either because you take the element out and insert it again with the slightly different score, or because you have to check that the ordering doesn't change and so the difference is only in constant operations between insert or update score. right? I really hope I'm wrong in this case.
Any insights will be most welcome.
Note: skip lists will be used once the list grows above a certain size (max_ziplist_entries), below that size a zip list is used.
Re. 1st question - I believe that it would still be O(log(n)) since a skip list is a type of a binary tree so there's no assurance where the head/tail nodes are
Re. 2nd question - according to the source, changing the score is implemented with a removing and readding the member: https://github.com/antirez/redis/blob/209f266cc534471daa03501b2802f08e4fca4fe6/src/t_zset.c#L1233 & https://github.com/antirez/redis/blob/209f266cc534471daa03501b2802f08e4fca4fe6/src/t_zset.c#L1272
In Skip List, when you insert a new element in head or tail, you still need to update O(log n) levels. The previous head or tail can have O(log n) levels and each may have pointers which need to be updated.
Already answered by #itamar-haber
We have a database that models a tree. This data can grow fairly huge, that is to say many, may million of rows. (The primary key is actually a bigint, so I guess potentially we would like to support billions of rows, although this is probably never going to occur).
A single node can have a very large amount of direct children, more likely the higher up in the hierarchy they are. We have no specified limit to the actual maximum depth of a leaf, i.e. how many nodes one would have to traverse to get to the root, but in practice this would probably normally not grow beyond a few hundred at the very most. Normally it would probably be below 20.
Insertions in this table is very frequent and needs to be high performing. Insertions nodes inserted are always leaf nodes, and always after the last sibling. Nodes are never moved. Deletions are always made as entire subtrees. Finding subtrees is the other operation made on this table. It does not have the same performance requirements, but of course we would like it as fast as possible.
Today this is modeled with the parent/child model, which is efficient for insertions, but is painfully slow for finding subtrees. When the table grows large, this becomes extremely slow and finding a subtree may take several minutes.
So I was thinking about converting this to perhaps using the new hierarchyid type in SQL Server. But I am having troubles figuring out whether this would be suitable. As I undestand it, for the operations we perform in this scenario, such a tree would be a good idea. (Please correct me if I'm wrong here).
But it also states that the maximum size for a hierarchyid is 892 bytes. However, I can not find any information about what this means in practice. How is the hierarchyid encoded? Will I run out of hierarchyids, and if so, when?
So I did some tests and came to somewhat of a conclusion regarding the limitations of hierarchyid:
If I run for example the following code:
DECLARE #i BIGINT = 1
DECLARE #h hierarchyId = '/'
WHILE 1=1
BEGIN
SET #h = #h.ToString() + '1/'
PRINT CONVERT(nvarchar(max), #i)
SET #i = #i+1
END
I will get to 1427 levels deep before I get an error. Since I am using the value 1 for each level, this ought to be the most compact tree from which I draw the conclusion that I will not ever be able to create a tree with more than 1427 levels.
However, if I use for example 99999999999999 for each level (eg. /99999999999999/99999999999999/99999999999999/..., the error occurs already at 118 levels deep. It also seems that 14 digits are the maximum for an id at each level, since it fails immediately if I use a 15 digit number.
So with this in mind, if I only use whole integer identifiers (i.e. don't insert nodes between other nodes etc.) I should be able to guarantee up to at least 100 levels deep in my scenario, and at no time will I be able to exceed much more than 1400 levels.
892 bytes does not sound like much, but the hierarchy id seems to be very efficient, space-wise. From http://technet.microsoft.com/en-us/library/bb677290.aspx:
The average number of bits that are required to represent a node in a tree with n nodes depends on the average fanout (the average number of children of a node). For small fanouts (0-7), the size is about 6*logAn bits, where A is the average fanout. A node in an organizational hierarchy of 100,000 people with an average fanout of 6 levels takes about 38 bits. This is rounded up to 40 bits, or 5 bytes, for storage.
The calculation given says it's only for small fanouts (0-7) which makes it hard to reason about for bigger fanouts. You say 'up to a few hundred children at the most'. This (extreme) case does sound dangerous. I don't know about the spec of hierarchy_id, but the more nodes are at any one level, the less depth you should be able to have in the tree within those 892 bytes.
I do see a risk here, as do you (hence the question). Do some tests. Evaluate the goals. What are you moving from? Why are you moving? Simplicity or performance?
This problem is a bad fit for Sql. Maybe you should consider other options for this part of the program?
I've been thinking about the modified preorder tree traversal algorithm for storing trees within a flat table (such as SQL).
One property I dislike about the standard approach is that to insert a node you
have to touch (on average) N/2 of the nodes (everything with left or right higher than the insert point).
The implementations I've seen rely on sequentially numbered values. This leaves no room for updates.
This seems bad for concurrency and scaling. Imagine you have a tree rooted at the world containing user groups for every account in a large system, it's extremely large, to the point you must store subsets of the tree on different servers. Touching half of all the nodes to add a node to the bottom of the tree is bad.
Here is the idea I was considering. Basically leave room for inserts by partitioning the keyspace and dividing at each level.
Here's an example with Nmax = 64 (this would normally be the MAX_INT of your DB)
0:64
________|________
/ \
1:31 32:63
/ \ / \
2:14 15-30 33:47 48:62
Here, a node is added to the left half of the tree.
0:64
________|________
/ \
1:31 32:63
/ | \ / \
2:11 11:20 21:30 33:47 48:62
The alogorithm must be extended for the insert and removal process to recursively renumber to the left/right indexes for the subtree. Since querying for immediate children of a node is complicated, I think it makes sense to also store the parent id in the table. The algorithm can then select the sub tree (using left > p.left && right < p.right), then use node.id and node.parent to work through the list, subdividing the indexes.
This is more complex than just incrementing all the indexes to make room for the insert (or decrementing for removal), but it has the potential to affect far fewer nodes (only decendenants of the parent of the inserted/removed node).
My question(s) are basically:
Has this idea been formalized or implemented?
Is this the same as nested intervals?
I have heard of people doing this before, for the same reasons, yes.
Note that you do lose at a couple of small advantages of the algorithm by doing this
normally, you can tell the number of descendants of a node by ((right - left + 1) div 2). This can occasionally be useful, if e.g. you'd displaying a count in a treeview which should include the number of children to be found further down in the tree
Flowing from the above, it's easy to select out all leaf nodes -- WHERE (right = left + 1).
These are fairly minor advantages and may not be useful to you anyway, though for some usage patterns they're obviously handy.
That said, it does sound like materialized paths may be more useful to you, as suggested above.
I think you're better off looking at a different way of storing trees. If your tree is broad but not terribly deep (which seems likely for the case you suggested), you can store the complete list of ancestors up to the root against each node. That way, modifying a node doesn't require touching any nodes other than the node being modified.
You can split your table into two: the first is (node ID, node value), the second (node ID, child ID), which stores all the edges of the tree. Insertion and deletion then become O(tree depth) (you have to navigate to the element and fix what is below it).
The solution you propose looks like a B-tree. If you can estimate the total number of nodes in your tree, then you can choose the depth of the tree beforehand.