Adjacency list tree - how to prevent circular references? - sql

I have an adjacency list in a database with ID and ParentID to represent a tree structure:
-a
--b
---c
-d
--e
Of course in a record the ParentID should never be the same as ID, but I also have to prevent circular references to prevent an endless loop. These circular references could in theory involve more than 2 records. ( a->b, b->c, c->a , etc.)
For each record I store the paths in a string column like this :
a a
b a/b
c a/b/c
d d
e d/e
My question is now :
when inserting/updating, is there a way to check if a circular reference would occur?
I should add that I know all about the nested set model, etc. I chose the adjacency method with stored path's because I find it much more intuitive. I got it working with triggers and a separate paths-table, and it works like a charm, except for the possible circular references.

If you're storing the path like that, you could put in a check that the path does not contain the id.

If you are using Oracle you can implement a check for cycles using the CONNECT BY syntax. The count of nodes should be equal to the count of decendents from the root node.
CHECK (
(SELECT COUNT(*) Nodes
FROM Tree) =
(SELECT COUNT(*) Decendents
FROM Tree
START WITH parent_node IS NULL -- Root Node
CONNECT BY parent_node = PRIOR child_node))
Note, you will still need other checks to enforce the tree. IE
Single root node with null.
Node can have exactly one parent.
You cannot create a check constraint with a subquery, so this will need to go to a view or trigger.

Related

Can I convert this cursor and while loop to a set based solution?

I currently am writing a script where I have a tree as well a set of known parent nodes (none of which are the root node)and a set of known child nodes. For each child node, I have to find a direct descendant of one of the parent nodes that is also a parent of the child node. For each child node, only one such value exists, but there could be any number of nodes between each child node and its corresponding target.
What I have now is a cursor that iterates through each child node and uses a while loop to travel up the tree until it finds a node with a parent in the set of parent nodes, and that is the match. My question is, can I solve this without a cursor or the while loop in a set-based way? I'm not a sql expert, but could not come up with a way to do this using merges or joins.
When working with apparently difficult tree problems, it is often useful to build an "ancestors table". This isn't just a SQL thing, it's a common tool used when dealing with hierarchies.
An ancestors table contains all the connections between the various nodes. So if you have a graph with root A, B as a child of A, and C as a child of B, your ancestors table contains a row for the connection from B to A, and a row for the connection from C to B, and a row for the connection from C to A, and then optionally a "root" row (from A to A with a length of zero).
Once you have such a table most problems become a lot easier to formulate. For example, your problem would turn into a fairly straightforward set of joins to do the following:
Find the set of rows R1(parent, child, length) in Ancestors where R1.parent is a KnownParent and the path length is 1 (this gives you the direct descendants of KnownParents), and then find the set of rows R2(parent, child) in Ancestors where R2.parent = R1.child, and R2.child is a KnownChilld
Generating an ancestors table can be done with a recursive CTE, has mentioned by HABO. There's an existing stackoverflow answer about that here
An ancestors table isn't the only way to answer this question, but it's such a useful thing to learn I suggest using one. You don't have to persist the ancestors of course, just join directly to the output of the recursive cte.

Recursive subdirectory SQL problem

This is a mental excercise that has been bothering me for a while. What strategy would you use to solve this sort of problem?
Let's consider the following simple database structure. We have directories, obviously a tree of them. Also we have content items, which always reside in some directories.
create table directory (
directoryId integer generated always as identity primary key,
parentId integer default null,
directoryName varchar(100)
);
create table content (
contentId integer generated always as identity primary key,
directory integer references directory(directoryId),
contentTitle varchar(100),
contentText varchar(32000)
);
Now let's assume that our directory tree is massive and the amount of content is massive. The solution must scale well.
The main problem: How to efficiently retrieve all content items that are found from the specified directory and its subdirectories?
The way I see it SQL can not be used to get easily all the directoryIds for a subselect. Am I correct?
One could solve this at application side with simple recursive loop. That might become actually very heavy though and require tricky caching, especially to quarantee reasonable first access times.
One could also perhaps build a materialized query table and add multi-dimensional indexes dynamically for it. Possible but an implementation mess. Too complex.
My far most favorite solution would be probably to add a new table like
create table subdirectories (
directoryId integer,
subdirectoryId integer,
constraint thekey primary key (directoryId,subdirectoryId)
)
and make sure I would always update it manually when directories are being moved/deleted/created. Thus I could always do a select with the directoryId and get all Ids for subdirectories, including as a subselect for more complex queries. I also like the fact that the rdbms is able to optimize the queries well.
What do you guys think?
In SQL Server 2005, PostgreSQL 8.4 and Oracle 11g:
WITH
-- uncomment the next line in PostgreSQL
-- RECURSIVE
q AS
(
SELECT directoryId
FROM directories
WHERE directoryId = 1
UNION ALL
SELECT d.directoryId
FROM q
JOIN directories
WHERE parentId = q.directoryId
)
SELECT c.*
FROM q
JOIN content c
ON c.directory = q.directoryId
In Oracle before 11g:
SELECT c.*
FROM (
SELECT directoryId
FROM directories
START WITH
directoryId = 1
CONNECT BY
parent = PRIOR directoryID
) q
JOIN content c
ON c.directory = q.directoryId
For PostgreSQL 8.3 and below see this article:
Hierarchical queries in PostgreSQL
For MySQL, see this article:
Hierarchical queries in MySQL
This is a standard -- and well understood -- "hard problem" in SQL.
All arc-node graph theory problems are hard because they involve transitive relationships.
There are standard solutions.
Loops with an explicit stack using to manage a list of unvisited nodes of the tree.
Recursion. This is remarkably efficient. It doesn't "require tricky caching" It's really simple and really effective. The recursion stack is the list of unvisited nodes.
Creating a "transitive closure" of the directory tree.
SQL extensions to handle transitive relationships like a directory tree.

What data structure should I use to track dependency?

I have a bunch of tables in a relational database which, obviously, are dependent upon one another due to foreign key relationships. I want to build a dependency tree, traverse it, and output INSERT SQL statements. I need to first output SQL for foreign key tables in my dependency tree first because parent tables will depend on values from their foreign key identifier tables.
Does a binary tree, traversed in postorder, seem suitable for this task?
Take a look at the following:
Microsoft.SqlServer.Management.Smo.Server
Microsoft.SqlServer.Management.Smo.Database
Microsoft.SqlServer.Management.Smo.Scripter
Microsoft.SqlServer.Management.Smo.DependencyTree
Microsoft.SqlServer.Management.Smo.DependencyWalker
Microsoft.SqlServer.Management.Smo.DependencyCollection
Microsoft.SqlServer.Management.Smo.DependencyCollectionNode
There's examples on MSDN on how to use all this.
Essentially you want something like
Server server = new Server(SOURCESERVER);
Database database = server.Databases[SOURCEDATABASE];
Scripter sp = new Scripter(server);
...
UrnCollection col = new UrnCollection();
foreach (Table table in database.Tables)
{
col.Add(table.Urn);
}
....
DependencyTree tree = sp.DiscoverDependencies(col, DependencyType.Parents);
DependencyWalker walker = new DependencyWalker(server);
DependencyCollection depends = walker.WalkDependencies(tree);
//Iterate over each table in DB in dependent order...
foreach (DependencyCollectionNode dcn in depends)
...
If a table can be dependent on more than two tables, a binary tree will be insufficient.
Let table A be dependent on tables B, C and D. Then you would have to insert into B, C and D first, i.e. A should have three child nodes in your tree.
I think you need to use a more general tree structure which allows an arbitrary number of child nodes. Traversing this tree structure in post-order should yield the desired results, as you suggested.
Things will start to get messy when your dependency graph contains cycles and you need to defer constraint checking ;)

How to find all nodes in a subtree in a recursive SQL query?

I have a table which defines a child-parent relationship between nodes:
CREATE TABLE node ( ' pseudo code alert
id INTEGER PRIMARY KEY,
parentID INTEGER, ' should be a valid id.
)
If parentID always points to a valid existing node, then this will naturally define a tree structure.
If the parentID is NULL then we may assume that the node is a root node.
How would I:
Find all the nodes which are decendents of a given node?
Find all the nodes under a given node to a specific depth?
I would like to do each of these as a single SQL (I expect it would necessarily be recursive) or two mutually recursive queries.
I'm doing this in an ODBC context, so I can't rely on any vendor specific features.
Edit
No tables are written yet, so adding extra columns/tables is perfectly acceptable.
The tree will potentially be updated and added to quite often; auxillary data structures/tables/columns would be possible, though need to be kept up-to-date.
If you have any magic books you reach for for this kind of query, I'd like to know.
Many thanks.
This link provides a tutorial on both the Adjacency List Model (as described in the question), and the Nested Set Model. It is written as part of the documentation for MySQL.
What is not discussed in that article is insertion/delection time, and maintenance cost of the two approaches. For example:
a dynamically grown tree using the Nested Set Model would seem to need some maintenance to maintain the nesting (e.g. renumbering all left and right set numbers)
removal of a node in the adjacency list model would require updates in at least one other row.
If you have any magic books you reach for for this kind of query, I'd like to know.
Celko's Trees and Hierarchies in SQL For Smarties
Store the entire "path" from the root node's ID in a separate column, being sure to use a separator at the beginning and end as well. E.g. let's say 1 is the parent of 5, which is the parent of 17, and your separator character is dash, you would store the value -1-5-17- in your path column.
Now to find all children of 5 you can simply select records where the path includes -5-
The separators at the ends are necessary so you don't need to worry about ID's that are at the leftmost or rightmost end of the field when you use LIKE.
As for your depth issue, if you add a depth column to your table indicating the current nesting depth, this becomes easy as well. You look up your starting node's depth and then you add x to it where x is the number of levels deep you want to search, and you filter out records with greater depth than that.

Optimized SQL for tree structures

How would you get tree-structured data from a database with the best performance? For example, say you have a folder-hierarchy in a database. Where the folder-database-row has ID, Name and ParentID columns.
Would you use a special algorithm to get all the data at once, minimizing the amount of database-calls and process it in code?
Or would you use do many calls to the database and sort of get the structure done from the database directly?
Maybe there are different answers based on x amount of database-rows, hierarchy-depth or whatever?
Edit: I use Microsoft SQL Server, but answers out of other perspectives are interesting too.
It really depends on how you are going to access the tree.
One clever technique is to give every node a string id, where the parent's id is a predictable substring of the child. For example, the parent could be '01', and the children would be '0100', '0101', '0102', etc. This way you can select an entire subtree from the database at once with:
SELECT * FROM treedata WHERE id LIKE '0101%';
Because the criterion is an initial substring, an index on the ID column would speed the query.
Out of all the ways to store a tree in a RDMS the most common are adjacency lists and nested sets. Nested sets are optimized for reads and can retrieve an entire tree in a single query. Adjacency lists are optimized for writes and can added to with in a simple query.
With adjacency lists each node a has column that refers to the parent node or the child node (other links are possible). Using that you can build the hierarchy based on parent child relationships. Unfortunately unless you restrict your tree's depth you cannot pull the whole thing in one query and reading it is usually slower than updating it.
With the nested set model the inverse is true, reading is fast and easy but updates get complex because you must maintain the numbering system. The nested set model encodes both parentage and sort order by enumerating all of the nodes using a preorder based numbering system.
I've used the nested set model and while it is complex for read optimizing a large hierarchy it is worth it. Once you do a few exercises in drawing out the tree and numbering the nodes you should get the hang of it.
My research on this method started at this article: Managing Hierarchical Data in MySQL.
In the product I work on we have some tree structures stored in SQL Server and use the technique mentioned above to store a node's hierarchy in the record. i.e.
tblTreeNode
TreeID = 1
TreeNodeID = 100
ParentTreeNodeID = 99
Hierarchy = ".33.59.99.100."
[...] (actual data payload for node)
Maintaining the the hierarchy is the tricky bit of course and makes use of triggers. But generating it on an insert/delete/move is never recursive, because the parent or child's hierarchy has all the information you need.
you can get all of node's descendants thusly:
SELECT * FROM tblNode WHERE Hierarchy LIKE '%.100.%'
Here's the insert trigger:
--Setup the top level if there is any
UPDATE T
SET T.TreeNodeHierarchy = '.' + CONVERT(nvarchar(10), T.TreeNodeID) + '.'
FROM tblTreeNode AS T
INNER JOIN inserted i ON T.TreeNodeID = i.TreeNodeID
WHERE (i.ParentTreeNodeID IS NULL) AND (i.TreeNodeHierarchy IS NULL)
WHILE EXISTS (SELECT * FROM tblTreeNode WHERE TreeNodeHierarchy IS NULL)
BEGIN
--Update those items that we have enough information to update - parent has text in Hierarchy
UPDATE CHILD
SET CHILD.TreeNodeHierarchy = PARENT.TreeNodeHierarchy + CONVERT(nvarchar(10),CHILD.TreeNodeID) + '.'
FROM tblTreeNode AS CHILD
INNER JOIN tblTreeNode AS PARENT ON CHILD.ParentTreeNodeID = PARENT.TreeNodeID
WHERE (CHILD.TreeNodeHierarchy IS NULL) AND (PARENT.TreeNodeHierarchy IS NOT NULL)
END
and here's the update trigger:
--Only want to do something if Parent IDs were changed
IF UPDATE(ParentTreeNodeID)
BEGIN
--Update the changed items to reflect their new parents
UPDATE CHILD
SET CHILD.TreeNodeHierarchy = CASE WHEN PARENT.TreeNodeID IS NULL THEN '.' + CONVERT(nvarchar,CHILD.TreeNodeID) + '.' ELSE PARENT.TreeNodeHierarchy + CONVERT(nvarchar, CHILD.TreeNodeID) + '.' END
FROM tblTreeNode AS CHILD
INNER JOIN inserted AS I ON CHILD.TreeNodeID = I.TreeNodeID
LEFT JOIN tblTreeNode AS PARENT ON CHILD.ParentTreeNodeID = PARENT.TreeNodeID
--Now update any sub items of the changed rows if any exist
IF EXISTS (
SELECT *
FROM tblTreeNode
INNER JOIN deleted ON tblTreeNode.ParentTreeNodeID = deleted.TreeNodeID
)
UPDATE CHILD
SET CHILD.TreeNodeHierarchy = NEWPARENT.TreeNodeHierarchy + RIGHT(CHILD.TreeNodeHierarchy, LEN(CHILD.TreeNodeHierarchy) - LEN(OLDPARENT.TreeNodeHierarchy))
FROM tblTreeNode AS CHILD
INNER JOIN deleted AS OLDPARENT ON CHILD.TreeNodeHierarchy LIKE (OLDPARENT.TreeNodeHierarchy + '%')
INNER JOIN tblTreeNode AS NEWPARENT ON OLDPARENT.TreeNodeID = NEWPARENT.TreeNodeID
END
one more bit, a check constraint to prevent a circular reference in tree nodes:
ALTER TABLE [dbo].[tblTreeNode] WITH NOCHECK ADD CONSTRAINT [CK_tblTreeNode_TreeNodeHierarchy] CHECK
((charindex(('.' + convert(nvarchar(10),[TreeNodeID]) + '.'),[TreeNodeHierarchy],(charindex(('.' + convert(nvarchar(10),[TreeNodeID]) + '.'),[TreeNodeHierarchy]) + 1)) = 0))
I would also recommend triggers to prevent more than one root node (null parent) per tree, and to keep related nodes from belonging to different TreeIDs (but those are a little more trivial than the above.)
You'll want to check for your particular case to see if this solution performs acceptably. Hope this helps!
Celko wrote about this (2000):
http://www.dbmsmag.com/9603d06.html
http://www.intelligententerprise.com/001020/celko1_1.jhtml;jsessionid=3DFR02341QLDEQSNDLRSKHSCJUNN2JVN?_requestid=32818
and other people asked:
Joining other tables in oracle tree queries
How to calculate the sum of values in a tree using SQL
How to store directory / hierarchy / tree structure in the database?
Performance of recursive stored procedures in MYSQL to get hierarchical data
What is the most efficient/elegant way to parse a flat table into a tree?
finally, you could look at the rails "acts_as_tree" (read-heavy) and "acts_as_nested_set" (write-heavy) plugins. I don't ahve a good link comparing them.
There are several common kinds of queries against a hierarchy. Most other kinds of queries are variations on these.
From a parent, find all children.
a. To a specific depth. For example, given my immediate parent, all children to a depth of 1 will be my siblings.
b. To the bottom of the tree.
From a child, find all parents.
a. To a specific depth. For example, my immediate parent is parents to a depth of 1.
b. To an unlimited depth.
The (a) cases (a specific depth) are easier in SQL. The special case (depth=1) is trivial in SQL. The non-zero depth is harder. A finite, but non-zero depth, can be done via a finite number of joins. The (b) cases, with indefinite depth (to the top, to the bottom), are really hard.
If you tree is HUGE (millions of nodes) then you're in a world of hurt no matter what you try to do.
If your tree is under a million nodes, just fetch it all into memory and work on it there. Life is much simpler in an OO world. Simply fetch the rows and build the tree as the rows are returned.
If you have a Huge tree, you have two choices.
Recursive cursors to handle the unlimited fetching. This means the maintenance of the structure is O(1) -- just update a few nodes and you're done. However fetching is O(n*log(n)) because you have to open a cursor for each node with children.
Clever "heap numbering" algorithms can encode the parentage of each node. Once each node is properly numbered, a trivial SQL SELECT can be used for all four types of queries. Changes to the tree structure, however, require renumbering the nodes, making the cost of a change fairly high compared to the cost of retrieval.
If you have many trees in the database, and you will only ever get the whole tree out, I would store a tree ID (or root node ID) and a parent node ID for each node in the database, get all the nodes for a particular tree ID, and process in memory.
However if you will be getting subtrees out, you can only get a subtree of a particular parent node ID, so you either need to store all parent nodes of each node to use the above method, or perform multiple SQL queries as you descend into the tree (hope there are no cycles in your tree!), although you can reuse the same Prepared Statement (assuming that nodes are of the same type and are all stored in a single table) to prevent re-compiling the SQL, so it might not be slower, indeed with database optimisations applied to the query it could be preferable. Might want to run some tests to find out.
If you are only storing one tree, your question becomes one of querying subtrees only, and the second answer applied.
Google for "Materialized Path" or "Genetic Trees"...
In Oracle there is SELECT ... CONNECT BY statement to retrieve trees.
I am a fan of the simple method of storing an ID associated with its parentID:
ID ParentID
1 null
2 null
3 1
4 2
... ...
It is easy to maintain, and very scalable.
This article is interesting as it shows some retrieval methods as well as a way to store the lineage as a derived column. The lineage provides a shortcut method to retrieve the hierarchy without too many joins.
Not going to work for all situations, but for example given a comment structure:
ID | ParentCommentID
You could also store TopCommentID which represents the top most comment:
ID | ParentCommentID | TopCommentID
Where the TopCommentID and ParentCommentID are null or 0 when it's the topmost comment. For child comments, ParentCommentID points to the comment above it, and TopCommentID points to the topmost parent.