Listing all ancestor paths, LIKE vs rcte - sql

DROP TABLE IF EXISTS t;
CREATE TABLE t(
mypath varchar(100),
parent_path varchar(100)
);
INSERT INTO t VALUES ('a', NULL),('a/b', 'a'),('a/b/c', 'a/b');
-- Listing all parent paths
1) using LIKE, not making use of parent_path column:
SELECT a.mypath, b.mypath aS parent_path
FROM t a
JOIN t b ON a.mypath LIKE b.mypath + '%' AND a.mypath != b.mypath
2) Using a recursive cte, making use of parent_path column
WITH cte AS (
SELECT mypath, parent_path
FROM t
UNION ALL
SELECT a.mypath, b.parent_path
FROM cte a
JOIN t b ON a.parent_path = b.mypath
)
SELECT * FROM cte WHERE parent_path IS NOT NULL;
On large datasets, what are the pros and cons of each method, performance wise ?
Am I right to thing the rcte method should be faster ?
Should LIKE be able to make use of an index, since there is only a trailing wildcard ?

Related

Active Directory: Convert canonicalName node value from string to integer

Are there any methods available to convert the string text value contained within the AD canonicalName attribute to an incremented integer value? Or, does this need to be performed manually?
For example:
canonicalName (what I am getting) hierarchyNode (what I need)
\domain.com\ /1/
\domain.com\Corporate /1/1/
\domain.com\Corporate\Hr /1/1/1/
\domain.com\Corporate\Accounting /1/1/2/
\domain.com\Users\ /1/2/
\domain.com\Users\Sales /1/2/1/
\domain.com\Users\Whatever /1/2/2/
\domain.com\Security\ /1/3/
\domain.com\Security\Servers /1/3/1/
\domain.com\Security\Administrative /1/3/2/
\domain.com\Security\Executive /1/3/3/
I am extracting user objects into a SQL Server database for reporting purposes. The user objects are spread throughout multiple OU's in the forest. So, by identifying the highest node on the tree that contains users, I can then utilize the SQL Server GetDescendent() method to quickly retrieve users recursively without having to write 1 + n number of sub-selects.
For reference: https://learn.microsoft.com/en-us/sql/t-sql/data-types/hierarchyid-data-type-method-reference
UPDATE:
I am able to convert the canonicalName from string to integer (see below using SQL Server 2014). However, this doesn't seem to solve my problem. I have only built the branches of the tree by stripping off the leafs, that way I can get IsDescendant() by tree branch. But now, I cannot insert the leafs in batch as it appears I need to GetDescendant(), which appears to be built for handling inserts one at a time.
How can I build the Active Directory directory tree, which resembles file system paths, as a SQL Hierarchy? All examples treat the hierarchy as an immediately parent/child relationship and use a recursive CTE to build from the root, which requires the parent child relationship to already be know. In my case, the parent child relationship is known only through the '/' delimiter.
-- Drop and re-create temp table(s) that are used by this procedure.
IF OBJECT_ID(N'Tempdb.dbo.#TEMP_TreeHierarchy', N'U') IS NOT NULL
BEGIN
DROP TABLE #TEMP_TreeHierarchy
END;
-- Drop and re-create temp table(s) that are used by this procedure.
IF OBJECT_ID(N'Tempdb.dbo.#TEMP_AdTreeHierarchyNodeNames', N'U') IS NOT NULL
BEGIN
DROP TABLE #TEMP_AdTreeHierarchyNodeNames
END;
-- CREATE TEMP TABLE(s)
CREATE TABLE #TEMP_TreeHierarchy(
TreeHierarchyKey INT IDENTITY(1,1) NOT NULL
,TreeHierarchyId hierarchyid NULL
,TreeHierarchyNodeLevel int NULL
,TreeHierarchyNode varchar(255) NULL
,TreeCanonicalName varchar(255) NOT NULL
PRIMARY KEY CLUSTERED
(
TreeCanonicalName ASC
))
CREATE TABLE #TEMP_AdTreeHierarchyNodeNames (
TreeCanonicalName VARCHAR(255) NOT NULL
,TreeHierarchyNodeLevel INT NOT NULL
,TreeHierarchyNodeName VARCHAR(255) NOT NULL
,IndexValueByLevel INT NULL
PRIMARY KEY CLUSTERED
(
TreeCanonicalName ASC
,TreeHierarchyNodeLevel ASC
,TreeHierarchyNodeName ASC
))
-- Step 1.) INSERT the DISTINCT list of CanonicalName values into #TEMP_TreeHierarchy.
-- Remove the reserved character '/' that has been escaped '\/'. Note: '/' is the delimiter.
-- Remove all of the leaves from the tree, leaving only the root and the branches/nodes.
;WITH CTE1 AS (SELECT CanonicalNameParseReserveChar = REPLACE(A.CanonicalName, '\/', '') -- Remove the reserved character '/' that has been escaped '\/'.
FROM dbo.AdObjects A
)
-- Remove CN from end of string in order to get the distinct list (i.e., remove all of the leaves from the tree, leaving only the root and the branches/nodes).
-- INSERT the records INTO #TEMP_TreeHierarchy
INSERT INTO #TEMP_TreeHierarchy (TreeCanonicalName)
SELECT DISTINCT
CanonicalNameTree = REVERSE(SUBSTRING(REVERSE(C1.CanonicalNameParseReserveChar), CHARINDEX('/', REVERSE(C1.CanonicalNameParseReserveChar), 0) + 1, LEN(C1.CanonicalNameParseReserveChar) - CHARINDEX('/', REVERSE(C1.CanonicalNameParseReserveChar), 0)))
FROM CTE1 C1
-- Step 2.) Get NodeLevel and NodeName (i.e., key/value pair).
-- Get the nodes for each entry by splitting out the '/' delimiter, which provides both the NodeLevel and NodeName.
-- This table will be used as scratch to build the HierarchyNodeByLvl,
-- which is where the heavy lifting of converting the canonicalName value from string to integer occurs.
-- Note: integer is required for the node name - string values are not allowed. Thus this bridge must be build dynamically.
-- Achieve dynamic result by using CROSS APPLY to convert a single delimited row into 1 + n rows, based on the number of nodes.
-- INSERT the key/value pair results INTO a temp table.
-- Use ROW_NUMBER() to identify each NodeLevel, which is the key.
-- Use the string contained between the delimiter, which is the value.
-- Combined, these create a unique identifier that will be used to roll-up the HierarchyNodeByLevel, which is a RECURSIVE key/value pair of NodeLevel and IndexValueByLevel.
-- The rolled-up value contained in HierarchyNodeByLevel is what the SQL Server hierarchyid::Parse() function requires in order to create the hierarchyid.
-- https://blog.sqlauthority.com/2015/04/21/sql-server-split-comma-separated-list-without-using-a-function/
INSERT INTO #TEMP_AdTreeHierarchyNodeNames (TreeCanonicalName, TreeHierarchyNodeLevel, TreeHierarchyNodeName)
SELECT TreeCanonicalName
,TreeHierarchyNodeLevel = ROW_NUMBER() OVER(PARTITION BY TreeCanonicalName ORDER BY TreeCanonicalName)
,TreeHierarchyNodeName = LTRIM(RTRIM(m.n.value('.[1]','VARCHAR(MAX)')))
FROM (SELECT TH.TreeCanonicalName
,x = CAST('<XMLRoot><RowData>' + REPLACE(TH.TreeCanonicalName,'/','</RowData><RowData>') + '</RowData></XMLRoot>' AS XML)
FROM #TEMP_TreeHierarchy TH
) SUB1
CROSS APPLY x.nodes('/XMLRoot/RowData')m(n)
-- Step 3.) Get the IndexValueByLevel RECURSIVE key/value pair
-- Get the DISTINCT list of TreeHierarchyNodeLevel, TreeHierarchyNodeName first
-- Use TreeHierarchyNodeLevel is the key
-- Use ROW_NUMBER() to identify each IndexValueByLevel, which is the value.
-- Since the IndexValueByLevel exists for each level, the value for each level must be concatenated together to create the final value that is stored in TreeHierarchyNode
;WITH CTE1 AS (SELECT DISTINCT TreeHierarchyNodeLevel, TreeHierarchyNodeName
FROM #TEMP_AdTreeHierarchyNodeNames
),
CTE2 AS (SELECT C1.*
,IndexValueByLevel = ROW_NUMBER() OVER(PARTITION BY C1.TreeHierarchyNodeLevel ORDER BY C1.TreeHierarchyNodeName)
FROM CTE1 C1
)
UPDATE TMP1
SET TMP1.IndexValueByLevel = C2.IndexValueByLevel
FROM #TEMP_AdTreeHierarchyNodeNames TMP1
INNER JOIN CTE2 C2
ON TMP1.TreeHierarchyNodeLevel = C2.TreeHierarchyNodeLevel
AND TMP1.TreeHierarchyNodeName = C2.TreeHierarchyNodeName
-- Step 4.) Build the TreeHierarchyNodeByLevel.
-- Use FOR XML to roll up all duplicate keys in order to concatenate their values into one string.
-- https://www.mssqltips.com/sqlservertip/2914/rolling-up-multiple-rows-into-a-single-row-and-column-for-sql-server-data/
;WITH CTE1 AS (SELECT DISTINCT TreeCanonicalName
,TreeHierarchyNodeByLevel =
(SELECT '/' + CAST(IndexValueByLevel AS VARCHAR(10))
FROM #TEMP_AdTreeHierarchyNodeNames TMP1
WHERE TMP1.TreeCanonicalName = TMP2.TreeCanonicalName
FOR XML PATH(''))
FROM #TEMP_AdTreeHierarchyNodeNames TMP2
),
CTE2 AS (SELECT C1.TreeCanonicalName
,C1.TreeHierarchyNodeByLevel
,TreeHierarchyNodeLevel = MAX(TMP1.TreeHierarchyNodeLevel)
FROM CTE1 C1
INNER JOIN #TEMP_AdTreeHierarchyNodeNames TMP1
ON TMP1.TreeCanonicalName = C1.TreeCanonicalName
GROUP BY C1.TreeCanonicalName, C1.TreeHierarchyNodeByLevel
)
UPDATE TH
SET TH.TreeHierarchyNodeLevel = C2.TreeHierarchyNodeLevel
,TH.TreeHierarchyNode = C2.TreeHierarchyNodeByLevel + '/'
,TH.TreeHierarchyId = hierarchyid::Parse(C2.TreeHierarchyNodeByLevel + '/')
FROM #TEMP_TreeHierarchy TH
INNER JOIN CTE2 C2
ON TH.TreeCanonicalName = C2.TreeCanonicalName
INSERT INTO AD.TreeHierarchy (EffectiveStartDate, EffectiveEndDate, TreeCanonicalName, TreeHierarchyNodeLevel, TreeHierarchyNode, TreeHierarchyId)
SELECT EffectiveStartDate = CAST(GETDATE() AS DATE)
,EffectiveEndDate = '12/31/9999'
,TH.TreeCanonicalName
,TH.TreeHierarchyNodeLevel
,TH.TreeHierarchyNode
,TH.TreeHierarchyId
FROM #TEMP_TreeHierarchy TH
ORDER BY TH.TreeHierarchyKey
---- For testing purposes only.
SELECT * FROM AD.TreeHierarchy TH
SELECT * FROM #TEMP_AdTreeHierarchyNodeNames
SELECT * FROM #TEMP_TreeHierarchy
-- Clean-up. DROP TEMP TABLE(s).
DROP TABLE #TEMP_TreeHierarchy
DROP TABLE #TEMP_AdTreeHierarchyNodeNames
This is where my thinking takes me
I gave you 9 levels, but the pattern is easy to see and expand
Without a proper sequence I defaulted to alphabetical by node.
It also supports multiple root nodes as well
Example
Select A.*
,Nodes = concat('/',dense_rank() over (Order By N1),'/'
,left(nullif(dense_rank() over (Partition By N1 Order By N2)-1,0),5)+'/'
,left(nullif(dense_rank() over (Partition By N1,N2 Order By N3)-1,0),5)+'/'
,left(nullif(dense_rank() over (Partition By N1,N2,N3 Order By N4)-1,0),5)+'/'
,left(nullif(dense_rank() over (Partition By N1,N2,N3,N4 Order By N5)-1,0),5)+'/'
,left(nullif(dense_rank() over (Partition By N1,N2,N3,N4,N5 Order By N6)-1,0),5)+'/'
,left(nullif(dense_rank() over (Partition By N1,N2,N3,N4,N5,N6 Order By N7)-1,0),5)+'/'
,left(nullif(dense_rank() over (Partition By N1,N2,N3,N4,N5,N6,N7 Order By N8)-1,0),5)+'/'
,left(nullif(dense_rank() over (Partition By N1,N2,N3,N4,N5,N6,N7,N8 Order By N9)-1,0),5)+'/'
)
From YourTable A
Cross Apply (
Select N1 = ltrim(rtrim(xDim.value('/x[1]','varchar(max)')))
,N2 = ltrim(rtrim(xDim.value('/x[2]','varchar(max)')))
,N3 = ltrim(rtrim(xDim.value('/x[3]','varchar(max)')))
,N4 = ltrim(rtrim(xDim.value('/x[4]','varchar(max)')))
,N5 = ltrim(rtrim(xDim.value('/x[5]','varchar(max)')))
,N6 = ltrim(rtrim(xDim.value('/x[6]','varchar(max)')))
,N7 = ltrim(rtrim(xDim.value('/x[7]','varchar(max)')))
,N8 = ltrim(rtrim(xDim.value('/x[8]','varchar(max)')))
,N9 = ltrim(rtrim(xDim.value('/x[9]','varchar(max)')))
From (Select Cast('<x>' + replace((Select replace(stuff([canonicalName],1,1,''),'\','§§Split§§') as [*] For XML Path('')),'§§Split§§','</x><x>')+'</x>' as xml) as xDim) as A
) B
Order By 1
Returns
canonicalName Nodes
\domain.com\ /1/
\domain.com\Corporate /1/1/
\domain.com\Corporate\Accounting /1/1/1/
\domain.com\Corporate\Hr /1/1/2/
\domain.com\Security\ /1/2/
\domain.com\Security\Administrative /1/2/1/
\domain.com\Security\Executive /1/2/2/
\domain.com\Security\Servers /1/2/3/
\domain.com\Users\ /1/3/
\domain.com\Users\Sales /1/3/1/
\domain.com\Users\Whatever /1/3/2/

CTE WITH RECURSIVE Up and back? How do I get the whole tree from any node?

Given a very simplistic table like:
-- SQLite3
CREATE TABLE tst (
id INTEGER PRIMARY KEY AUTOINCREMENT,
parent_id INTEGER CHECK (parent_id <> id),
tag STRING NOT NULL,
FOREIGN KEY (parent_id) REFERENCES tst(id)
)
I can use WITH RECURSIVE (common table expressions) to go from any node up to the "root" of that tree or to traverse downward from a node to all of its children (along all branches). Here are the queries that seem to work for those two cases (respectively):
WITH RECURSIVE t(id, parent_id, tag) AS (
SELECT id, parent_id, tag FROM tst WHERE id=:mynode
UNION ALL
SELECT t2.id, t2.parent_id, t2.tag FROM tst AS t2
JOIN t ON t.parent_id = t2.id
) SELECT * FROM t
... and:
WITH RECURSIVE t(id, parent_id, tag) AS (
SELECT id, parent_id, tag FROM tst WHERE id=?
UNION ALL
SELECT t2.id, t2.parent_id, t2.tag FROM tst AS t2
JOIN t ON t.id = t2.parent_id
) SELECT * FROM t
(All I've done is reverse t.parent_id and t2.id from the first example into the other).
That works like a charm. But I'm trying to wrap my head around how I would start from any node and get the whole group of rows.
The obvious workaround would be to perform the first query, find the row where parent_id IS NULL then perform the second query on that. But I figure there must be a more elegant solution.
What is it?
I found that my earlier RCTE query worked but had two major flaws for my application.
I wasn't capturing the depth for each row; so I couldn't easily indent my entries to reflect the thread nesting level
My ORDER BY clause was completely off base ... so even if I indented each row according to its nesting depth the resulting "outline summary" would be completely wrong.
This slightly more complicated query seems to solve both of those problems:
WITH RECURSIVE tree (id, parent_id, tag, depth, path) AS (
SELECT id, parent_id, tag, 1 AS depth, '' AS path FROM tst WHERE id = (
WITH RECURSIVE t3 (id, parent_id) AS (
SELECT id, parent_id FROM tst WHERE id = :mynode
UNION ALL
SELECT t2.id, t2.parent_id FROM tst AS t2
JOIN t3 ON t3.parent_id=t2.id
) SELECT id FROM t3 WHERE parent_id IS NULL
) UNION ALL
SELECT t2.id, t2.parent_id, t2.tag, tree.depth+1,
path || '/' || CAST(t2.id AS VARCHAR) FROM tst AS t2
JOIN tree ON tree.id = t2.parent_id
) SELECT * FROM tree ORDER by path;
... SO doesn't seem to let me mark-up the contents of my code here ... but I'm adding the depth and path columns to the "tree" (CTE virtual) table, supplying initial values for those (virtual) columns in my first SELECT (using 1 AS depth, '' AS path (that's a new trick right for me right there), and then modifying those at every step through the recursion with tree.depth+1, path || '/' || CAST(t2.id AS VARCHAR); then, finally I can use path for my ORDER BY and use the depth in my app to prefix each line with an appropriate level of indentation.
To get this working for my application I can do something like:
#!python
for each in db.execute("SELECT id FROM tst WHERE parent_id IS NULL").fetchall():
for row in db.execute(qry, each):
print("%s\t%s%s" % (row[0], ' ' * row[3], row[2]))
... where qry is the query I've describe above (actually adjusted to fetch only the columns of interest, but this example works even with * there). In practice I might use LIMIT and OFFSET to page through those results (as I already do for the flat list of results from the table that doesn't support any message threading).
Also I'm aware that the CHECK I put on the table schema for this was only preventing the most trivial form of circular tree. It seems like parent_id INTEGER CHECK (parent_id IS NULL or parent_id < id) should work better. (Each chain of parent_id -> id links must be monotonically decreasing ... so no cycle is possible. The FOREIGN KEY enforces that property for INSERT statements already ... but this check enforcement is for UPDATE as well. (Technically I suppose I should use the "date" fields in my actual application, but I hope the surrogate key is sufficient).
BTW: Shout out to: a_horse_with_no_name for this posting: https://dba.stackexchange.com/a/7150 ... which helped me figure out how to build the paths.
Okay, I've now spent some time with SQLFiddle: PostgreSQL: CTE and this seems to work:
-- SQL: Fetch whole tree
WITH RECURSIVE t (id, parent_id, tag) AS
(
SELECT id, parent_id, tag FROM tst WHERE id =
(
WITH RECURSIVE t3 (id, parent_id) AS
(
SELECT id, parent_id FROM tst WHERE id = :mynode
UNION ALL
SELECT t2.id, t2.parent_id FROM tst AS t2
JOIN t3 ON t3.parent_id=t2.id
) SELECT id FROM t3 WHERE parent_id IS NULL
) UNION ALL
SELECT t2.id, t2.parent_id, t2.tag FROM tst AS t2
JOIN t ON t.id = t2.parent_id
) SELECT * FROM t ORDER BY id;
... where :mynode is the ID of any node in a given tree of rows.
All I'm doing is taking the first query (walk up the tree), adding WHERE parent_id IS NULL to get just the parent of that whole tree, and pasting it into an IN (...) expression. (In this case I could also just use = rather than IN).
That still looks a bit, ummm, ... involved. But it does seem to work.
Is there a better way?

Using bottom-up or top-down search to find the best path from hierarchical data

I have the following table structure,
Create Table Nodes
(
NodeID int,
ParentNodeID int, // null if root node
NodeText nvarchar(100),
NodeCost int
)
Now I need to do a sort of complex thing. Find the cost of the best path (path which has the best cost) from this table.
There are two ways, I am not sure which one to use:
Bottom-up search: If I choose that, I need to add a isLeaf column to the table and use a CTE which does the job.
Top-down search: There is no need of isLeaf field, however hard to come up with the SQL statement which does the job.
What are the pros and cons of the given two alternative methods? Which is the better practice in terms of performance etc.?
I've wrote the answer before seeing it's no specified which RDBMS is used. This if for SQL Server, same or something similar should work in Postrge and Oracle, but not in MySQL.
Anyway, either way (Bottom-Up or Top-Down) is fine, and neither way you have to add isLeaf level column because you can simply find leaf level nodes with NOT IN or NOT EXISTS subquery - and you need that info in both ways if you are interested in only paths that go from top to bottom.
Here is a sample query for top-down search:
;WITH rCTE AS
(
SELECT NodeID ,
ParentNodeID ,
CAST(NodeID AS NVARCHAR(MAX)) AS PathIDs,
CAST(NodeText AS NVARCHAR(MAX)) AS PathText,
NodeCost AS PathCost
FROM Nodes WHERE ParentNodeID IS NULL
UNION ALL
SELECT n.NodeID ,
n.ParentNodeID ,
r.PathIDs + '-' + CAST(n.NodeID AS NVARCHAR(10)) AS PathIDs,
r.PathText + '-' + n.NodeText AS PathText,
r.PathCost + n.NodeCost AS PathCost
FROM rCTE r
INNER JOIN dbo.Nodes n ON n.ParentNodeID = r.NodeID
)
SELECT PathIDs ,
PathText ,
PathCost
FROM rCTE r
WHERE NOT EXISTS (SELECT * FROM Nodes n WHERE r.NodeID = n.ParentNodeID)
ORDER BY PathCost
Here is example for Bottom-Up:
;WITH rCTE AS
(
SELECT NodeID ,
ParentNodeID ,
CAST(NodeID AS NVARCHAR(MAX)) AS PathIDs,
CAST(NodeText AS NVARCHAR(MAX)) AS PathText,
NodeCost AS PathCost
FROM Nodes r WHERE NOT EXISTS (SELECT * FROM Nodes n WHERE r.NodeID = n.ParentNodeID)
UNION ALL
SELECT n.NodeID ,
n.ParentNodeID ,
r.PathIDs + '-' + CAST(n.NodeID AS NVARCHAR(10)) AS PathIDs,
r.PathText + '-' + n.NodeText AS PathText,
r.PathCost + n.NodeCost AS PathCost
FROM rCTE r
INNER JOIN dbo.Nodes n ON r.ParentNodeID = n.NodeID
)
SELECT PathIDs ,
PathText ,
PathCost
FROM rCTE r
WHERE r.ParentNodeID IS NULL
ORDER BY PathCost
SQLFiddle DEMO - Top-Down
SQLFiddle DEMO - Bottom-Up
In this examples, performance of both queries are exactly the same. 50%-50% in execution plans when run together.

Prevent recursive CTE visiting nodes multiple times

Consider the following simple DAG:
1->2->3->4
And a table, #bar, describing this (I'm using SQL Server 2005):
parent_id child_id
1 2
2 3
3 4
//... other edges, not connected to the subgraph above
Now imagine that I have some other arbitrary criteria that select the first and last edges, i.e. 1->2 and 3->4. I want to use these to find the rest of my graph.
I can write a recursive CTE as follows (I'm using terminology from MSDN):
with foo(parent_id,child_id) as (
// anchor member that happens to select first and last edges:
select parent_id,child_id from #bar where parent_id in (1,3)
union all
// recursive member:
select #bar.* from #bar
join foo on #bar.parent_id = foo.child_id
)
select parent_id,child_id from foo
However, this results in edge 3->4 being selected twice:
parent_id child_id
1 2
3 4
2 3
3 4 // 2nd appearance!
How can I prevent the query from recursing into subgraphs that have already been described? I could achieve this if, in my "recursive member" part of the query, I could reference all data that has been retrieved by the recursive CTE so far (and supply a predicate indicating in the recursive member excluding nodes already visited). However, I think I can access data that was returned by the last iteration of the recursive member only.
This will not scale well when there is a lot of such repetition. Is there a way of preventing this unnecessary additional recursion?
Note that I could use "select distinct" in the last line of my statement to achieve the desired results, but this seems to be applied after all the (repeated) recursion is done, so I don't think this is an ideal solution.
Edit - hainstech suggests stopping recursion by adding a predicate to exclude recursing down paths that were explicitly in the starting set, i.e. recurse only where foo.child_id not in (1,3). That works for the case above only because it simple - all the repeated sections begin within the anchor set of nodes. It doesn't solve the general case where they may not be. e.g., consider adding edges 1->4 and 4->5 to the above set. Edge 4->5 will be captured twice, even with the suggested predicate. :(
The CTE's are recursive.
When your CTE's have multiple initial conditions, that means they also have different recursion stacks, and there is no way to use information from one stack in another stack.
In your example, the recursion stacks will go as follows:
(1) - first IN condition
(1, 2)
(1, 2, 3)
(1, 2, 3, 4)
(1, 2, 3) - no more children
(1, 2) - no more children
(1) - no more children, going to second IN condition
(3) - second condition
(3, 4)
(3) - no more children, returning
As you can see, these recursion stack do not intersect.
You could probably record the visited values in a temporary table, JOIN each value with the temptable and do not follow this value it if it's found, but SQL Server does not support these things.
So you just use SELECT DISTINCT.
This is the approach I used. It has been tested against several methods and was the most performant. It combines the temp table idea suggested by Quassnoi and the use of both distinct and a left join to eliminate redundant paths to the recursion. The level of the recursion is also included.
I left the failed CTE approach in the code so you could compare results.
If someone has a better idea, I'd love to know it.
create table #bar (unique_id int identity(10,10), parent_id int, child_id int)
insert #bar (parent_id, child_id)
SELECT 1,2 UNION ALL
SELECT 2,3 UNION ALL
SELECT 3,4 UNION ALL
SELECT 2,5 UNION ALL
SELECT 2,5 UNION ALL
SELECT 5,6
SET NOCOUNT ON
;with foo(unique_id, parent_id,child_id, ord, lvl) as (
-- anchor member that happens to select first and last edges:
select unique_id, parent_id, child_id, row_number() over(order by unique_id), 0
from #bar where parent_id in (1,3)
union all
-- recursive member:
select b.unique_id, b.parent_id, b.child_id, row_number() over(order by b.unique_id), foo.lvl+1
from #bar b
join foo on b.parent_id = foo.child_id
)
select unique_id, parent_id,child_id, ord, lvl from foo
/***********************************
Manual Recursion
***********************************/
Declare #lvl as int
Declare #rows as int
DECLARE #foo as Table(
unique_id int,
parent_id int,
child_id int,
ord int,
lvl int)
--Get anchor condition
INSERT #foo (unique_id, parent_id, child_id, ord, lvl)
select unique_id, parent_id, child_id, row_number() over(order by unique_id), 0
from #bar where parent_id in (1,3)
set #rows=##ROWCOUNT
set #lvl=0
--Do recursion
WHILE #rows > 0
BEGIN
set #lvl = #lvl + 1
INSERT #foo (unique_id, parent_id, child_id, ord, lvl)
SELECT DISTINCT b.unique_id, b.parent_id, b.child_id, row_number() over(order by b.unique_id), #lvl
FROM #bar b
inner join #foo f on b.parent_id = f.child_id
--might be multiple paths to this recursion so eliminate duplicates
left join #foo dup on dup.unique_id = b.unique_id
WHERE f.lvl = #lvl-1 and dup.child_id is null
set #rows=##ROWCOUNT
END
SELECT * from #foo
DROP TABLE #bar
Do you happen to know which of the two edges is on a deeper level in the tree? Because in that case, you could make edge 3->4 the anchor member and start walking up the tree until you find edge 1->2.
Something like this:
with foo(parent_id, child_id)
as
(
select parent_id, child_id
from #bar
where parent_id = 3
union all
select parent_id, child_id
from #bar b
inner join foo f on b.child_id = f.parent_id
where b.parent_id <> 1
)
select *
from foo
(I'm no expert on graphs, just exploring a bit)
The DISTINCT will guarantee each row is distinct, but it won't eliminate graph routes that don't end up in your last edge. Take this graph:
insert into #bar (parent_id,child_id) values (1,2)
insert into #bar (parent_id,child_id) values (1,5)
insert into #bar (parent_id,child_id) values (2,3)
insert into #bar (parent_id,child_id) values (2,6)
insert into #bar (parent_id,child_id) values (6,4)
The results of the query here include (1,5), which is not part of the route from the first edge (1,2) to the last edge (6,4).
You could try something like this, to find only routes that start with (1,2) and end with (6,4):
with foo(parent_id, child_id, route) as (
select parent_id, child_id,
cast(cast(parent_id as varchar) +
cast(child_id as varchar) as varchar(128))
from #bar
union all
select #bar.parent_id, #bar.child_id,
cast(route + cast(#bar.child_id as varchar) as varchar(128))
from #bar
join foo on #bar.parent_id = foo.child_id
)
select * from foo where route like '12%64'
Is this what you want to do?
create table #bar (parent_id int, child_id int)
insert #bar values (1,2)
insert #bar values (2,3)
insert #bar values (3,4)
declare #start_node table (parent_id int)
insert #start_node values (1)
insert #start_node values (3)
;with foo(parent_id,child_id) as (
select
parent_id
,child_id
from #bar where parent_id in (select parent_id from #start_node)
union all
select
#bar.*
from #bar
join foo on #bar.parent_id = foo.child_id
where foo.child_id not in (select parent_id from #start_node)
)
select parent_id,child_id from foo
Edit - #bacar - I don't think this is the temp table solution Quasnoi was proposing. I believe they were suggesting basically duplicate the entire recursion member contents during each recursion, and use that as a join to prevent reprocessing (and that this is not supported in ss2k5). My approach is supported, and the only change to your original is in the predicate in the recursion member to exclude recursing down paths that were explicitly in your starting set. I only added the table variable so that you would define the starting parent_ids in one location, you could just as easily have used this predicate with your original query:
where foo.child_id not in (1,3)
EDIT -- This doesn't work at all. This is a method to stop chasing triangle routes. It doesn't do what the OP wanted.
Or you can use a recursive token separated string.
I'm at home on my laptop ( no sql server ) so this might not be completely right but here goes.....
; WITH NodeNetwork AS (
-- Anchor Definition
SELECT
b.[parent_Id] AS [Parent_ID]
, b.[child_Id] AS [Child_ID]
, CAST(b.[Parent_Id] AS VARCHAR(MAX)) AS [NodePath]
FROM
#bar AS b
-- Recursive Definition
UNION ALL SELECT
b.[Parent_Id]
, b.[child_Id]
, CAST(nn.[NodePath] + '-' + CAST(b.[Parent_Id] AS VARCHAR(MAX)) AS VARCHAR(MAX))
FROM
NodeNetwork AS nn
JOIN #bar AS b ON b.[Parent_Id] = nn.[Child_ID]
WHERE
nn.[NodePath] NOT LIKE '%[-]' + CAST(b.[Parent_Id] AS VARCHAR(MAX)) + '%'
)
SELECT * FROM NodeNetwork
Or similar. Sorry It's late and I can't test it. I'll check on Monday morning. Credit for this must go to Peter Larsson (Peso)
The idea was generated here:
http://www.sqlteam.com/forums/topic.asp?TOPIC_ID=115290

SQL Server 2005 recursive query with loops in data - is it possible?

I've got a standard boss/subordinate employee table. I need to select a boss (specified by ID) and all his subordinates (and their subrodinates, etc). Unfortunately the real world data has some loops in it (for example, both company owners have each other set as their boss). The simple recursive query with a CTE chokes on this (maximum recursion level of 100 exceeded). Can the employees still be selected? I care not of the order in which they are selected, just that each of them is selected once.
Added: You want my query? Umm... OK... I though it is pretty obvious, but - here it is:
with
UserTbl as -- Selects an employee and his subordinates.
(
select a.[User_ID], a.[Manager_ID] from [User] a WHERE [User_ID] = #UserID
union all
select a.[User_ID], a.[Manager_ID] from [User] a join UserTbl b on (a.[Manager_ID]=b.[User_ID])
)
select * from UserTbl
Added 2: Oh, in case it wasn't clear - this is a production system and I have to do a little upgrade (basically add a sort of report). Thus, I'd prefer not to modify the data if it can be avoided.
I know it has been a while but thought I should share my experience as I tried every single solution and here is a summary of my findings (an maybe this post?):
Adding a column with the current path did work but had a performance hit so not an option for me.
I could not find a way to do it using CTE.
I wrote a recursive SQL function which adds employeeIds to a table. To get around the circular referencing, there is a check to make sure no duplicate IDs are added to the table. The performance was average but was not desirable.
Having done all of that, I came up with the idea of dumping the whole subset of [eligible] employees to code (C#) and filter them there using a recursive method. Then I wrote the filtered list of employees to a datatable and export it to my stored procedure as a temp table. To my disbelief, this proved to be the fastest and most flexible method for both small and relatively large tables (I tried tables of up to 35,000 rows).
this will work for the initial recursive link, but might not work for longer links
DECLARE #Table TABLE(
ID INT,
PARENTID INT
)
INSERT INTO #Table (ID,PARENTID) SELECT 1, 2
INSERT INTO #Table (ID,PARENTID) SELECT 2, 1
INSERT INTO #Table (ID,PARENTID) SELECT 3, 1
INSERT INTO #Table (ID,PARENTID) SELECT 4, 3
INSERT INTO #Table (ID,PARENTID) SELECT 5, 2
SELECT * FROM #Table
DECLARE #ID INT
SELECT #ID = 1
;WITH boss (ID,PARENTID) AS (
SELECT ID,
PARENTID
FROM #Table
WHERE PARENTID = #ID
),
bossChild (ID,PARENTID) AS (
SELECT ID,
PARENTID
FROM boss
UNION ALL
SELECT t.ID,
t.PARENTID
FROM #Table t INNER JOIN
bossChild b ON t.PARENTID = b.ID
WHERE t.ID NOT IN (SELECT PARENTID FROM boss)
)
SELECT *
FROM bossChild
OPTION (MAXRECURSION 0)
what i would recomend is to use a while loop, and only insert links into temp table if the id does not already exist, thus removing endless loops.
Not a generic solution, but might work for your case: in your select query modify this:
select a.[User_ID], a.[Manager_ID] from [User] a join UserTbl b on (a.[Manager_ID]=b.[User_ID])
to become:
select a.[User_ID], a.[Manager_ID] from [User] a join UserTbl b on (a.[Manager_ID]=b.[User_ID])
and a.[User_ID] <> #UserID
You don't have to do it recursively. It can be done in a WHILE loop. I guarantee it will be quicker: well it has been for me every time I've done timings on the two techniques. This sounds inefficient but it isn't since the number of loops is the recursion level. At each iteration you can check for looping and correct where it happens. You can also put a constraint on the temporary table to fire an error if looping occurs, though you seem to prefer something that deals with looping more elegantly. You can also trigger an error when the while loop iterates over a certain number of levels (to catch an undetected loop? - oh boy, it sometimes happens.
The trick is to insert repeatedly into a temporary table (which is primed with the root entries), including a column with the current iteration number, and doing an inner join between the most recent results in the temporary table and the child entries in the original table. Just break out of the loop when ##rowcount=0!
Simple eh?
I know you asked this question a while ago, but here is a solution that may work for detecting infinite recursive loops. I generate a path and I checked in the CTE condition if the USER ID is in the path, and if it is it wont process it again. Hope this helps.
Jose
DECLARE #Table TABLE(
USER_ID INT,
MANAGER_ID INT )
INSERT INTO #Table (USER_ID,MANAGER_ID) SELECT 1, 2
INSERT INTO #Table (USER_ID,MANAGER_ID) SELECT 2, 1
INSERT INTO #Table (USER_ID,MANAGER_ID) SELECT 3, 1
INSERT INTO #Table (USER_ID,MANAGER_ID) SELECT 4, 3
INSERT INTO #Table (USER_ID,MANAGER_ID) SELECT 5, 2
DECLARE #UserID INT
SELECT #UserID = 1
;with
UserTbl as -- Selects an employee and his subordinates.
(
select
'/'+cast( a.USER_ID as varchar(max)) as [path],
a.[User_ID],
a.[Manager_ID]
from #Table a
where [User_ID] = #UserID
union all
select
b.[path] +'/'+ cast( a.USER_ID as varchar(max)) as [path],
a.[User_ID],
a.[Manager_ID]
from #Table a
inner join UserTbl b
on (a.[Manager_ID]=b.[User_ID])
where charindex('/'+cast( a.USER_ID as varchar(max))+'/',[path]) = 0
)
select * from UserTbl
basicaly if you have loops like this in data you'll have to do the retreival logic by yourself.
you could use one cte to get only subordinates and other to get bosses.
another idea is to have a dummy row as a boss to both company owners so they wouldn't be each others bosses which is ridiculous. this is my prefferd option.
I can think of two approaches.
1) Produce more rows than you want, but include a check to make sure it does not recurse too deep. Then remove duplicate User records.
2) Use a string to hold the Users already visited. Like the not in subquery idea that didn't work.
Approach 1:
; with TooMuchHierarchy as (
select "User_ID"
, Manager_ID
, 0 as Depth
from "User"
WHERE "User_ID" = #UserID
union all
select U."User_ID"
, U.Manager_ID
, M.Depth + 1 as Depth
from TooMuchHierarchy M
inner join "User" U
on U.Manager_ID = M."user_id"
where Depth < 100) -- Warning MAGIC NUMBER!!
, AddMaxDepth as (
select "User_ID"
, Manager_id
, Depth
, max(depth) over (partition by "User_ID") as MaxDepth
from TooMuchHierarchy)
select "user_id", Manager_Id
from AddMaxDepth
where Depth = MaxDepth
The line where Depth < 100 is what keeps you from getting the max recursion error. Make this number smaller, and less records will be produced that need to be thrown away. Make it too small and employees won't be returned, so make sure it is at least as large as the depth of the org chart being stored. Bit of a maintence nightmare as the company grows. If it needs to be bigger, then add option (maxrecursion ... number ...) to whole thing to allow more recursion.
Approach 2:
; with Hierarchy as (
select "User_ID"
, Manager_ID
, '#' + cast("user_id" as varchar(max)) + '#' as user_id_list
from "User"
WHERE "User_ID" = #UserID
union all
select U."User_ID"
, U.Manager_ID
, M.user_id_list + '#' + cast(U."user_id" as varchar(max)) + '#' as user_id_list
from Hierarchy M
inner join "User" U
on U.Manager_ID = M."user_id"
where user_id_list not like '%#' + cast(U."User_id" as varchar(max)) + '#%')
select "user_id", Manager_Id
from Hierarchy
The preferrable solution is to clean up the data and to make sure you do not have any loops in the future - that can be accomplished with a trigger or a UDF wrapped in a check constraint.
However, you can use a multi statement UDF as I demonstrated here: Avoiding infinite loops. Part One
You can add a NOT IN() clause in the join to filter out the cycles.
This is the code I used on a project to chase up and down hierarchical relationship trees.
User defined function to capture subordinates:
CREATE FUNCTION fn_UserSubordinates(#User_ID INT)
RETURNS #SubordinateUsers TABLE (User_ID INT, Distance INT) AS BEGIN
IF #User_ID IS NULL
RETURN
INSERT INTO #SubordinateUsers (User_ID, Distance) VALUES ( #User_ID, 0)
DECLARE #Distance INT, #Finished BIT
SELECT #Distance = 1, #Finished = 0
WHILE #Finished = 0
BEGIN
INSERT INTO #SubordinateUsers
SELECT S.User_ID, #Distance
FROM Users AS S
JOIN #SubordinateUsers AS C
ON C.User_ID = S.Manager_ID
LEFT JOIN #SubordinateUsers AS C2
ON C2.User_ID = S.User_ID
WHERE C2.User_ID IS NULL
IF ##RowCount = 0
SET #Finished = 1
SET #Distance = #Distance + 1
END
RETURN
END
User defined function to capture managers:
CREATE FUNCTION fn_UserManagers(#User_ID INT)
RETURNS #User TABLE (User_ID INT, Distance INT) AS BEGIN
IF #User_ID IS NULL
RETURN
DECLARE #Manager_ID INT
SELECT #Manager_ID = Manager_ID
FROM UserClasses WITH (NOLOCK)
WHERE User_ID = #User_ID
INSERT INTO #UserClasses (User_ID, Distance)
SELECT User_ID, Distance + 1
FROM dbo.fn_UserManagers(#Manager_ID)
INSERT INTO #User (User_ID, Distance) VALUES (#User_ID, 0)
RETURN
END
You need a some method to prevent your recursive query from adding User ID's already in the set. However, as sub-queries and double mentions of the recursive table are not allowed (thank you van) you need another solution to remove the users already in the list.
The solution is to use EXCEPT to remove these rows. This should work according to the manual. Multiple recursive statements linked with union-type operators are allowed. Removing the users already in the list means that after a certain number of iterations the recursive result set returns empty and the recursion stops.
with UserTbl as -- Selects an employee and his subordinates.
(
select a.[User_ID], a.[Manager_ID] from [User] a WHERE [User_ID] = #UserID
union all
(
select a.[User_ID], a.[Manager_ID]
from [User] a join UserTbl b on (a.[Manager_ID]=b.[User_ID])
where a.[User_ID] not in (select [User_ID] from UserTbl)
EXCEPT
select a.[User_ID], a.[Manager_ID] from UserTbl a
)
)
select * from UserTbl;
The other option is to hardcode a level variable that will stop the query after a fixed number of iterations or use the MAXRECURSION query option hint, but I guess that is not what you want.