Prevent recursive CTE visiting nodes multiple times - sql

Consider the following simple DAG:
1->2->3->4
And a table, #bar, describing this (I'm using SQL Server 2005):
parent_id child_id
1 2
2 3
3 4
//... other edges, not connected to the subgraph above
Now imagine that I have some other arbitrary criteria that select the first and last edges, i.e. 1->2 and 3->4. I want to use these to find the rest of my graph.
I can write a recursive CTE as follows (I'm using terminology from MSDN):
with foo(parent_id,child_id) as (
// anchor member that happens to select first and last edges:
select parent_id,child_id from #bar where parent_id in (1,3)
union all
// recursive member:
select #bar.* from #bar
join foo on #bar.parent_id = foo.child_id
)
select parent_id,child_id from foo
However, this results in edge 3->4 being selected twice:
parent_id child_id
1 2
3 4
2 3
3 4 // 2nd appearance!
How can I prevent the query from recursing into subgraphs that have already been described? I could achieve this if, in my "recursive member" part of the query, I could reference all data that has been retrieved by the recursive CTE so far (and supply a predicate indicating in the recursive member excluding nodes already visited). However, I think I can access data that was returned by the last iteration of the recursive member only.
This will not scale well when there is a lot of such repetition. Is there a way of preventing this unnecessary additional recursion?
Note that I could use "select distinct" in the last line of my statement to achieve the desired results, but this seems to be applied after all the (repeated) recursion is done, so I don't think this is an ideal solution.
Edit - hainstech suggests stopping recursion by adding a predicate to exclude recursing down paths that were explicitly in the starting set, i.e. recurse only where foo.child_id not in (1,3). That works for the case above only because it simple - all the repeated sections begin within the anchor set of nodes. It doesn't solve the general case where they may not be. e.g., consider adding edges 1->4 and 4->5 to the above set. Edge 4->5 will be captured twice, even with the suggested predicate. :(

The CTE's are recursive.
When your CTE's have multiple initial conditions, that means they also have different recursion stacks, and there is no way to use information from one stack in another stack.
In your example, the recursion stacks will go as follows:
(1) - first IN condition
(1, 2)
(1, 2, 3)
(1, 2, 3, 4)
(1, 2, 3) - no more children
(1, 2) - no more children
(1) - no more children, going to second IN condition
(3) - second condition
(3, 4)
(3) - no more children, returning
As you can see, these recursion stack do not intersect.
You could probably record the visited values in a temporary table, JOIN each value with the temptable and do not follow this value it if it's found, but SQL Server does not support these things.
So you just use SELECT DISTINCT.

This is the approach I used. It has been tested against several methods and was the most performant. It combines the temp table idea suggested by Quassnoi and the use of both distinct and a left join to eliminate redundant paths to the recursion. The level of the recursion is also included.
I left the failed CTE approach in the code so you could compare results.
If someone has a better idea, I'd love to know it.
create table #bar (unique_id int identity(10,10), parent_id int, child_id int)
insert #bar (parent_id, child_id)
SELECT 1,2 UNION ALL
SELECT 2,3 UNION ALL
SELECT 3,4 UNION ALL
SELECT 2,5 UNION ALL
SELECT 2,5 UNION ALL
SELECT 5,6
SET NOCOUNT ON
;with foo(unique_id, parent_id,child_id, ord, lvl) as (
-- anchor member that happens to select first and last edges:
select unique_id, parent_id, child_id, row_number() over(order by unique_id), 0
from #bar where parent_id in (1,3)
union all
-- recursive member:
select b.unique_id, b.parent_id, b.child_id, row_number() over(order by b.unique_id), foo.lvl+1
from #bar b
join foo on b.parent_id = foo.child_id
)
select unique_id, parent_id,child_id, ord, lvl from foo
/***********************************
Manual Recursion
***********************************/
Declare #lvl as int
Declare #rows as int
DECLARE #foo as Table(
unique_id int,
parent_id int,
child_id int,
ord int,
lvl int)
--Get anchor condition
INSERT #foo (unique_id, parent_id, child_id, ord, lvl)
select unique_id, parent_id, child_id, row_number() over(order by unique_id), 0
from #bar where parent_id in (1,3)
set #rows=##ROWCOUNT
set #lvl=0
--Do recursion
WHILE #rows > 0
BEGIN
set #lvl = #lvl + 1
INSERT #foo (unique_id, parent_id, child_id, ord, lvl)
SELECT DISTINCT b.unique_id, b.parent_id, b.child_id, row_number() over(order by b.unique_id), #lvl
FROM #bar b
inner join #foo f on b.parent_id = f.child_id
--might be multiple paths to this recursion so eliminate duplicates
left join #foo dup on dup.unique_id = b.unique_id
WHERE f.lvl = #lvl-1 and dup.child_id is null
set #rows=##ROWCOUNT
END
SELECT * from #foo
DROP TABLE #bar

Do you happen to know which of the two edges is on a deeper level in the tree? Because in that case, you could make edge 3->4 the anchor member and start walking up the tree until you find edge 1->2.
Something like this:
with foo(parent_id, child_id)
as
(
select parent_id, child_id
from #bar
where parent_id = 3
union all
select parent_id, child_id
from #bar b
inner join foo f on b.child_id = f.parent_id
where b.parent_id <> 1
)
select *
from foo

(I'm no expert on graphs, just exploring a bit)
The DISTINCT will guarantee each row is distinct, but it won't eliminate graph routes that don't end up in your last edge. Take this graph:
insert into #bar (parent_id,child_id) values (1,2)
insert into #bar (parent_id,child_id) values (1,5)
insert into #bar (parent_id,child_id) values (2,3)
insert into #bar (parent_id,child_id) values (2,6)
insert into #bar (parent_id,child_id) values (6,4)
The results of the query here include (1,5), which is not part of the route from the first edge (1,2) to the last edge (6,4).
You could try something like this, to find only routes that start with (1,2) and end with (6,4):
with foo(parent_id, child_id, route) as (
select parent_id, child_id,
cast(cast(parent_id as varchar) +
cast(child_id as varchar) as varchar(128))
from #bar
union all
select #bar.parent_id, #bar.child_id,
cast(route + cast(#bar.child_id as varchar) as varchar(128))
from #bar
join foo on #bar.parent_id = foo.child_id
)
select * from foo where route like '12%64'

Is this what you want to do?
create table #bar (parent_id int, child_id int)
insert #bar values (1,2)
insert #bar values (2,3)
insert #bar values (3,4)
declare #start_node table (parent_id int)
insert #start_node values (1)
insert #start_node values (3)
;with foo(parent_id,child_id) as (
select
parent_id
,child_id
from #bar where parent_id in (select parent_id from #start_node)
union all
select
#bar.*
from #bar
join foo on #bar.parent_id = foo.child_id
where foo.child_id not in (select parent_id from #start_node)
)
select parent_id,child_id from foo
Edit - #bacar - I don't think this is the temp table solution Quasnoi was proposing. I believe they were suggesting basically duplicate the entire recursion member contents during each recursion, and use that as a join to prevent reprocessing (and that this is not supported in ss2k5). My approach is supported, and the only change to your original is in the predicate in the recursion member to exclude recursing down paths that were explicitly in your starting set. I only added the table variable so that you would define the starting parent_ids in one location, you could just as easily have used this predicate with your original query:
where foo.child_id not in (1,3)

EDIT -- This doesn't work at all. This is a method to stop chasing triangle routes. It doesn't do what the OP wanted.
Or you can use a recursive token separated string.
I'm at home on my laptop ( no sql server ) so this might not be completely right but here goes.....
; WITH NodeNetwork AS (
-- Anchor Definition
SELECT
b.[parent_Id] AS [Parent_ID]
, b.[child_Id] AS [Child_ID]
, CAST(b.[Parent_Id] AS VARCHAR(MAX)) AS [NodePath]
FROM
#bar AS b
-- Recursive Definition
UNION ALL SELECT
b.[Parent_Id]
, b.[child_Id]
, CAST(nn.[NodePath] + '-' + CAST(b.[Parent_Id] AS VARCHAR(MAX)) AS VARCHAR(MAX))
FROM
NodeNetwork AS nn
JOIN #bar AS b ON b.[Parent_Id] = nn.[Child_ID]
WHERE
nn.[NodePath] NOT LIKE '%[-]' + CAST(b.[Parent_Id] AS VARCHAR(MAX)) + '%'
)
SELECT * FROM NodeNetwork
Or similar. Sorry It's late and I can't test it. I'll check on Monday morning. Credit for this must go to Peter Larsson (Peso)
The idea was generated here:
http://www.sqlteam.com/forums/topic.asp?TOPIC_ID=115290

Related

SQL Hierarchy - Resolve full path for all ancestors of a given node

I have a hierarchy described by an adjacency list. There is not necessarily a single root element, but I do have data to identify the leaf (terminal) items in the hiearchy. So, a hierachy that looked like this ...
1
- 2
- - 4
- - - 7
- 3
- - 5
- - 6
8
- 9
... would be described by a table, like this. NOTE: I don't have the ability to change this format.
id parentid isleaf
--- -------- ------
1 null 0
2 1 0
3 1 0
4 2 0
5 3 1
6 3 1
7 4 1
8 null 0
9 8 1
here is the sample table definition and data:
CREATE TABLE [dbo].[HiearchyTest](
[id] [int] NOT NULL,
[parentid] [int] NULL,
[isleaf] [bit] NOT NULL
)
GO
INSERT [dbo].[HiearchyTest] ([id], [parentid], [isleaf]) VALUES (1, NULL, 0)
INSERT [dbo].[HiearchyTest] ([id], [parentid], [isleaf]) VALUES (2, 1, 0)
INSERT [dbo].[HiearchyTest] ([id], [parentid], [isleaf]) VALUES (3, 1, 0)
INSERT [dbo].[HiearchyTest] ([id], [parentid], [isleaf]) VALUES (4, 2, 0)
INSERT [dbo].[HiearchyTest] ([id], [parentid], [isleaf]) VALUES (5, 3, 1)
INSERT [dbo].[HiearchyTest] ([id], [parentid], [isleaf]) VALUES (6, 3, 1)
INSERT [dbo].[HiearchyTest] ([id], [parentid], [isleaf]) VALUES (7, 4, 1)
INSERT [dbo].[HiearchyTest] ([id], [parentid], [isleaf]) VALUES (8, NULL, 0)
INSERT [dbo].[HiearchyTest] ([id], [parentid], [isleaf]) VALUES (9, 8, 1)
GO
From this, I need to provide any id and get a list of all ancestors including all descendents of each. So, if I provided the input of id = 6, I would expect the following:
id descendentid
-- ------------
1 1
1 3
1 6
3 3
3 6
6 6
id 6 just has itself
its parent, id 3 would have decendents of 3 and 6
its parent, id 1 would have decendents of 1, 3, and 6
I will be using this data to provide roll-up calculations at each level in the hierarchy. This works well, assuming I can get the dataset above.
I have accomplished this using two recusive ctes - one to get the "terminal" item for each node in the hiearchy. Then, a second one where I get the full ancestory of my selected node (so, 6 resolves to 6, 3, 1) to walk up and get the full set. I'm hoping that I'm missing something and that this can be accomplished in one round. Here is the example double-recursion code:
declare #test int = 6;
with cte as (
-- leaf nodes
select id, parentid, id as terminalid
from HiearchyTest
where isleaf = 1
union all
-- walk up - preserve "terminal" item for all levels
select h.id, h.parentid, c.terminalid
from HiearchyTest as h
inner join
cte as c on h.id = c.parentid
)
, cte2 as (
-- get all ancestors of our test value
select id, parentid, id as descendentid
from cte
where terminalid = #test
union all
-- and walkup from each to complete the set
select h.id, h.parentid, c.descendentid
from HiearchyTest h
inner join cte2 as c on h.id = c.parentid
)
-- final selection - order by is just for readability of this example
select id, descendentid
from cte2
order by id, descendentid
Additional detail: the "real" hierarchy will be much larger than the example. It can technically have infinite depth, but realistically it would rarely go more than 10 levels deep.
In summary, my question is if I can accomplish this with a single recursive cte instead of having to recurse over the hierarchy twice.
Because your data is a tree structure, we can use the hierarchyid data type to meet your needs (despite your saying that you can't in the comments). First, the easy part - generating the hierarchyid with a recursive cte
with cte as (
select id, parentid,
cast(concat('/', id, '/') as varchar(max)) as [path]
from [dbo].[HiearchyTest]
where ParentID is null
union all
select child.id, child.parentid,
cast(concat(parent.[path], child.id, '/') as varchar(max))
from [dbo].[HiearchyTest] as child
join cte as parent
on child.parentid = parent.id
)
select id, parentid, cast([path] as hierarchyid) as [path]
into h
from cte;
Next, a little table-valued function I wrote:
create function dbo.GetAllAncestors(#h hierarchyid, #ReturnSelf bit)
returns table
as return
select #h.GetAncestor(n.n) as h
from dbo.Numbers as n
where n.n <= #h.GetLevel()
or (#ReturnSelf = 1 and n.n = 0)
union all
select #h
where #ReturnSelf = 1;
Armed with that, getting your desired result set isn't too bad:
declare #h hierarchyid;
set #h = (
select path
from h
where id = 6
);
with cte as (
select *
from h
where [path].IsDescendantOf(#h) = 1
or #h.IsDescendantOf([path]) = 1
)
select h.id as parent, c.id as descendentid
from cte as c
cross apply dbo.GetAllAncestors([path], 1) as a
join h
on a.h = h.[path]
order by h.id, c.id;
Of course, you're missing out on a lot of the benefit of using a hierarchyid by not persisting it (you'll either have to keep it up to date in the side table or generate it every time). But there you go.
Okay this has been bothering me since I have read the question and I just came back to think of it again..... Anyway, why do you need to recurse back down to get all of the descendants? You have asked for ancestors not descendants and your result set is not trying to get other siblings, grand children, etc.. It is getting a parent and a grand parent in this case. Your First cte gives you everything you need to know except when an ancestor id is also the parentid. So with a union all, a little magic to setup the originating ancestor, and you have everything you need without a second recursion.
declare #test int = 6;
with cte as (
-- leaf nodes
select id, parentid, id as terminalid
from HiearchyTest
where isleaf = 1
union all
-- walk up - preserve "terminal" item for all levels
select h.id, h.parentid, c.terminalid
from HiearchyTest as h
inner join
cte as c on h.id = c.parentid
)
, cteAncestors AS (
SELECT DISTINCT
id = IIF(parentid IS NULL, #Test, id)
,parentid = IIF(parentid IS NULL,id,parentid)
FROM
cte
WHERE
terminalid = #test
UNION
SELECT DISTINCT
id
,parentid = id
FROM
cte
WHERE
terminalid = #test
)
SELECT
id = parentid
,DecendentId = id
FROM
cteAncestors
ORDER BY
id
,DecendentId
Your result set from your first cte gives you your 2 ancestors and self related to their ancestor except in the case of the originating ancestors who's parentid is null. That null is a special case I will deal with in a minute.
Remember at this point your query is producing Ancestors not descendants, but what it doesn't give you is self references meaning grandparent = grandparent, parent = parent, self = self. But all you have to do to get that is to add rows for every id and make the parentid equal to their id. hence the union. Now your result set is almost totally shaped up:
The special case of the null parentid. So the null parentid identifies the originating ancestor meaning that ancestor has no other ancestor in your dataset. And here is how you will use that to your advantage. Because you started your initial recursion at the leaf level there is no direct tie to the id that you started with to the originating ancestor, but there is at every other level, simply hijack that null parent id and flip the values around and you now have an ancestor for your leaf.
Then in the end if you want it to be a descendants table switch the columns and you are finished. One last note DISTINCTs are there in case the id is repeated with an additional parentid. E.g. 6 | 3 and another record for 6 | 4
I'm not sure if this performs better, or even produces the proper results in all cases, but you could capture a node list, then use xml functionality to parse it out and cross apply to the id list:
declare #test int = 6;
;WITH cte AS (SELECT id, parentid, CAST(id AS VARCHAR(MAX)) as IDlist
FROM HiearchyTest
WHERE isleaf = 1
UNION ALL
SELECT h.id, h.parentid , CAST(CONCAT(c.IDlist,',',h.id) AS VARCHAR(MAX))
FROM HiearchyTest as h
JOIN cte as c
ON h.id = c.parentid
)
,cte2 AS (SELECT *, CAST ('<M>' + REPLACE(IDlist, ',', '</M><M>') + '</M>' AS XML) AS Data
FROM cte
WHERE IDlist LIKE '%'+CAST(#test AS VARCHAR(50))+'%'
)
SELECT id,Split.a.value('.', 'VARCHAR(100)') AS descendentid
FROM cte2 a
CROSS APPLY Data.nodes ('/M') AS Split(a);

CTE WITH RECURSIVE Up and back? How do I get the whole tree from any node?

Given a very simplistic table like:
-- SQLite3
CREATE TABLE tst (
id INTEGER PRIMARY KEY AUTOINCREMENT,
parent_id INTEGER CHECK (parent_id <> id),
tag STRING NOT NULL,
FOREIGN KEY (parent_id) REFERENCES tst(id)
)
I can use WITH RECURSIVE (common table expressions) to go from any node up to the "root" of that tree or to traverse downward from a node to all of its children (along all branches). Here are the queries that seem to work for those two cases (respectively):
WITH RECURSIVE t(id, parent_id, tag) AS (
SELECT id, parent_id, tag FROM tst WHERE id=:mynode
UNION ALL
SELECT t2.id, t2.parent_id, t2.tag FROM tst AS t2
JOIN t ON t.parent_id = t2.id
) SELECT * FROM t
... and:
WITH RECURSIVE t(id, parent_id, tag) AS (
SELECT id, parent_id, tag FROM tst WHERE id=?
UNION ALL
SELECT t2.id, t2.parent_id, t2.tag FROM tst AS t2
JOIN t ON t.id = t2.parent_id
) SELECT * FROM t
(All I've done is reverse t.parent_id and t2.id from the first example into the other).
That works like a charm. But I'm trying to wrap my head around how I would start from any node and get the whole group of rows.
The obvious workaround would be to perform the first query, find the row where parent_id IS NULL then perform the second query on that. But I figure there must be a more elegant solution.
What is it?
I found that my earlier RCTE query worked but had two major flaws for my application.
I wasn't capturing the depth for each row; so I couldn't easily indent my entries to reflect the thread nesting level
My ORDER BY clause was completely off base ... so even if I indented each row according to its nesting depth the resulting "outline summary" would be completely wrong.
This slightly more complicated query seems to solve both of those problems:
WITH RECURSIVE tree (id, parent_id, tag, depth, path) AS (
SELECT id, parent_id, tag, 1 AS depth, '' AS path FROM tst WHERE id = (
WITH RECURSIVE t3 (id, parent_id) AS (
SELECT id, parent_id FROM tst WHERE id = :mynode
UNION ALL
SELECT t2.id, t2.parent_id FROM tst AS t2
JOIN t3 ON t3.parent_id=t2.id
) SELECT id FROM t3 WHERE parent_id IS NULL
) UNION ALL
SELECT t2.id, t2.parent_id, t2.tag, tree.depth+1,
path || '/' || CAST(t2.id AS VARCHAR) FROM tst AS t2
JOIN tree ON tree.id = t2.parent_id
) SELECT * FROM tree ORDER by path;
... SO doesn't seem to let me mark-up the contents of my code here ... but I'm adding the depth and path columns to the "tree" (CTE virtual) table, supplying initial values for those (virtual) columns in my first SELECT (using 1 AS depth, '' AS path (that's a new trick right for me right there), and then modifying those at every step through the recursion with tree.depth+1, path || '/' || CAST(t2.id AS VARCHAR); then, finally I can use path for my ORDER BY and use the depth in my app to prefix each line with an appropriate level of indentation.
To get this working for my application I can do something like:
#!python
for each in db.execute("SELECT id FROM tst WHERE parent_id IS NULL").fetchall():
for row in db.execute(qry, each):
print("%s\t%s%s" % (row[0], ' ' * row[3], row[2]))
... where qry is the query I've describe above (actually adjusted to fetch only the columns of interest, but this example works even with * there). In practice I might use LIMIT and OFFSET to page through those results (as I already do for the flat list of results from the table that doesn't support any message threading).
Also I'm aware that the CHECK I put on the table schema for this was only preventing the most trivial form of circular tree. It seems like parent_id INTEGER CHECK (parent_id IS NULL or parent_id < id) should work better. (Each chain of parent_id -> id links must be monotonically decreasing ... so no cycle is possible. The FOREIGN KEY enforces that property for INSERT statements already ... but this check enforcement is for UPDATE as well. (Technically I suppose I should use the "date" fields in my actual application, but I hope the surrogate key is sufficient).
BTW: Shout out to: a_horse_with_no_name for this posting: https://dba.stackexchange.com/a/7150 ... which helped me figure out how to build the paths.
Okay, I've now spent some time with SQLFiddle: PostgreSQL: CTE and this seems to work:
-- SQL: Fetch whole tree
WITH RECURSIVE t (id, parent_id, tag) AS
(
SELECT id, parent_id, tag FROM tst WHERE id =
(
WITH RECURSIVE t3 (id, parent_id) AS
(
SELECT id, parent_id FROM tst WHERE id = :mynode
UNION ALL
SELECT t2.id, t2.parent_id FROM tst AS t2
JOIN t3 ON t3.parent_id=t2.id
) SELECT id FROM t3 WHERE parent_id IS NULL
) UNION ALL
SELECT t2.id, t2.parent_id, t2.tag FROM tst AS t2
JOIN t ON t.id = t2.parent_id
) SELECT * FROM t ORDER BY id;
... where :mynode is the ID of any node in a given tree of rows.
All I'm doing is taking the first query (walk up the tree), adding WHERE parent_id IS NULL to get just the parent of that whole tree, and pasting it into an IN (...) expression. (In this case I could also just use = rather than IN).
That still looks a bit, ummm, ... involved. But it does seem to work.
Is there a better way?

sql: delete a subtree table(id, Parentid) , delete a item with all his children

I have a table like this
foo(id, parentId)
-- there is a FK constraint from parentId to id
and I need to delete an item with all his children and children of the children etc.
anybody knows how ?
AFAIK, SQL SERVER doesn't like cascade deletes for hierarchal relationships. So you could do both CTE (as Oded mentioned) or a solution with a recursive trigger (somehow like this). But I suppose, CTE is easier.
See, here is solution using CTE:
CREATE PROC deleteFoo
#id bigint
as
WITH Nodes ([Id], [ParentId], [Level])
AS (
SELECT F.[Id], F.[ParentId], 0 AS [Level]
FROM [dbo].Foo F
WHERE F.[Id] = #id
UNION ALL
SELECT F.[Id], F.[ParentId], N.[Level] + 1
FROM [dbo].Foo F
INNER JOIN Nodes N ON N.[Id] = F.[ParentId]
)
DELETE
FROM Foo
WHERE [Id] IN (
SELECT TOP 100 PERCENT N.[Id]
FROM Nodes N
ORDER BY N.[Level] DESC
)
firstly, we're defining recursive CTE, and then deleting records from the [Foo] table beginning from the very child records (hightest Level; so, the top node will be deleted in the last turn).
You can write a recursive CTE, the anchor would be the initial Id.

Different result with * and explicit field list?

I was exploring another question, when I hit this behaviour in Sql Server 2005. This query would exhaust the maximum recursion:
with foo(parent_id,child_id) as (
select parent_id,child_id
from #bar where parent_id in (1,3)
union all
select #bar.* -- Line that changed
from #bar
join foo on #bar.parent_id = foo.child_id
)
select * from foo
But this would work fine:
with foo(parent_id,child_id) as (
select parent_id,child_id
from #bar where parent_id in (1,3)
union all
select #bar.parent_id, #bar.child_id -- Line that changed
from #bar
join foo on #bar.parent_id = foo.child_id
)
select * from foo
Is this a bug in Sql Server, or am I overlooking something?
Here's the table definition:
if object_id('tempdb..#bar') is not null
drop table #bar
create table #bar (
child_id int,
parent_id int
)
insert into #bar (parent_id,child_id) values (1,2)
insert into #bar (parent_id,child_id) values (1,5)
insert into #bar (parent_id,child_id) values (2,3)
insert into #bar (parent_id,child_id) values (2,6)
insert into #bar (parent_id,child_id) values (6,4)
Edit
I think I know what's going on and is a great example of why to avoid select * in the first place.
You defined your table with childId first then parentId, but the CTE Foo expects parentId then childId,
So essentially when you say select #bar.* your saying select childId, parentId but your putting that into parentId, child. This results in a n-level recursive expression as you go to join back on yourself.
So this is not a bug in SQL.
Moral of the lesson: Avoid Select * and save yourself headaches.
I would say it was probably a deliberate choice by the programmers as a safety value because you have a union and all parts of a union must have the same number of fields. I never use select * in a union so I can't say for sure. I tried on SQL Server 2008 and got this message:
Msg 205, Level 16, State 1, Line 1
All queries combined using a UNION, INTERSECT or EXCEPT operator must have an equal number of expressions in their target lists.
That seems to support my theory.
Select * is a poor technique to use in any prodcution code, so I don't see why it would be a problem to simply specify the fields.

Select products where the category belongs to any category in the hierarchy

I have a products table that contains a FK for a category, the Categories table is created in a way that each category can have a parent category, example:
Computers
Processors
Intel
Pentium
Core 2 Duo
AMD
Athlon
I need to make a select query that if the selected category is Processors, it will return products that is in Intel, Pentium, Core 2 Duo, Amd, etc...
I thought about creating some sort of "cache" that will store all the categories in the hierarchy for every category in the db and include the "IN" in the where clause. Is this the best solution?
The best solution for this is at the database design stage. Your categories table needs to be a Nested Set. The article Managing Hierarchical Data in MySQL is not that MySQL specific (despite the title), and gives a great overview of the different methods of storing a hierarchy in a database table.
Executive Summary:
Nested Sets
Selects are easy for any depth
Inserts and deletes are hard
Standard parent_id based hierarchy
Selects are based on inner joins (so get hairy fast)
Inserts and deletes are easy
So based on your example, if your hierarchy table was a nested set your query would look something like this:
SELECT * FROM products
INNER JOIN categories ON categories.id = products.category_id
WHERE categories.lft > 2 and categories.rgt < 11
the 2 and 11 are the left and right respectively of the Processors record.
Looks like a job for a Common Table Expression.. something along the lines of:
with catCTE (catid, parentid)
as
(
select cat.catid, cat.catparentid from cat where cat.name = 'Processors'
UNION ALL
select cat.catid, cat.catparentid from cat inner join catCTE on cat.catparentid=catcte.catid
)
select distinct * from catCTE
That should select the category whose name is 'Processors' and any of it's descendents, should be able to use that in an IN clause to pull back the products.
I have done similar things in the past, first querying for the category ids, then querying for the products "IN" those categories. Getting the categories is the hard bit, and you have a few options:
If the level of nesting of categories is known or you can find an upper bound: Build a horrible-looking SELECT with lots of JOINs. This is fast, but ugly and you need to set a limit on the levels of the hierarchy.
If you have a relatively small number of total categories, query them all (just ids, parents), collect the ids of the ones you care about, and do a SELECT....IN for the products. This was the appropriate option for me.
Query up/down the hierarchy using a series of SELECTs. Simple, but relatively slow.
I believe recent versions of SQLServer have some support for recursive queries, but haven't used them myself.
Stored procedures can help if you don't want to do this app-side.
What you want to find is the transitive closure of the category "parent" relation. I suppose there's no limitation to the category hierarchy depth, so you can't formulate a single SQL query which finds all categories. What I would do (in pseudocode) is this:
categoriesSet = empty set
while new.size > 0:
new = select * from categories where parent in categoriesSet
categoriesSet = categoriesSet+new
So just keep on querying for children until no more are found. This behaves well in terms of speed unless you have a degenerated hierarchy (say, 1000 categories, each a child of another), or a large number of total categories. In the second case, you could always work with temporary tables to keep data transfer between your app and the database small.
Maybe something like:
select *
from products
where products.category_id IN
(select c2.category_id
from categories c1 inner join categories c2 on c1.category_id = c2.parent_id
where c1.category = 'Processors'
group by c2.category_id)
[EDIT] If the category depth is greater than one this would form your innermost query. I suspect that you could design a stored procedure that would drill down in the table until the ids returned by the inner query did not have children -- probably better to have an attribute that marks a category as a terminal node in the hierarchy -- then perform the outer query on those ids.
CREATE TABLE #categories (id INT NOT NULL, parentId INT, [name] NVARCHAR(100))
INSERT INTO #categories
SELECT 1, NULL, 'Computers'
UNION
SELECT 2, 1, 'Processors'
UNION
SELECT 3, 2, 'Intel'
UNION
SELECT 4, 2, 'AMD'
UNION
SELECT 5, 3, 'Pentium'
UNION
SELECT 6, 3, 'Core 2 Duo'
UNION
SELECT 7, 4, 'Athlon'
SELECT *
FROM #categories
DECLARE #id INT
SET #id = 2
; WITH r(id, parentid, [name]) AS (
SELECT id, parentid, [name]
FROM #categories c
WHERE id = #id
UNION ALL
SELECT c.id, c.parentid, c.[name]
FROM #categories c JOIN r ON c.parentid=r.id
)
SELECT *
FROM products
WHERE p.productd IN
(SELECT id
FROM r)
DROP TABLE #categories
The last part of the example isn't actually working if you're running it straight like this. Just remove the select from the products and substitute with a simple SELECT * FROM r
This should recurse down all the 'child' catagories starting from a given catagory.
DECLARE #startingCatagoryId int
DECLARE #current int
SET #startingCatagoryId = 13813 -- or whatever the CatagoryId is for 'Processors'
CREATE TABLE #CatagoriesToFindChildrenFor
(CatagoryId int)
CREATE TABLE #CatagoryTree
(CatagoryId int)
INSERT INTO #CatagoriesToFindChildrenFor VALUES (#startingCatagoryId)
WHILE (SELECT count(*) FROM #CatagoriesToFindChildrenFor) > 0
BEGIN
SET #current = (SELECT TOP 1 * FROM #CatagoriesToFindChildrenFor)
INSERT INTO #CatagoriesToFindChildrenFor
SELECT ID FROM Catagory WHERE ParentCatagoryId = #current AND Deleted = 0
INSERT INTO #CatagoryTree VALUES (#current)
DELETE #CatagoriesToFindChildrenFor WHERE CatagoryId = #current
END
SELECT * FROM #CatagoryTree ORDER BY CatagoryId
DROP TABLE #CatagoriesToFindChildrenFor
DROP TABLE #CatagoryTree
i like to use a stack temp table for hierarchal data.
here's a rough example -
-- create a categories table and fill it with 10 rows (with random parentIds)
CREATE TABLE Categories ( Id uniqueidentifier, ParentId uniqueidentifier )
GO
INSERT
INTO Categories
SELECT NEWID(),
NULL
GO
INSERT
INTO Categories
SELECT TOP(1)NEWID(),
Id
FROM Categories
ORDER BY Id
GO 9
DECLARE #lvl INT, -- holds onto the level as we move throught the hierarchy
#Id Uniqueidentifier -- the id of the current item in the stack
SET #lvl = 1
CREATE TABLE #stack (item UNIQUEIDENTIFIER, [lvl] INT)
-- we fill fill this table with the ids we want
CREATE TABLE #tmpCategories (Id UNIQUEIDENTIFIER)
-- for this example we’ll just select all the ids
-- if we want all the children of a specific parent we would include it’s id in
-- this where clause
INSERT INTO #stack SELECT Id, #lvl FROM Categories WHERE ParentId IS NULL
WHILE #lvl > 0
BEGIN -- begin 1
IF EXISTS ( SELECT * FROM #stack WHERE lvl = #lvl )
BEGIN -- begin 2
SELECT #Id = [item]
FROM #stack
WHERE lvl = #lvl
INSERT INTO #tmpCategories
SELECT #Id
DELETE FROM #stack
WHERE lvl = #lvl
AND item = #Id
INSERT INTO #stack
SELECT Id, #lvl + 1
FROM Categories
WHERE ParentId = #Id
IF ##ROWCOUNT > 0
BEGIN -- begin 3
SELECT #lvl = #lvl + 1
END -- end 3
END -- end 2
ELSE
SELECT #lvl = #lvl - 1
END -- end 1
DROP TABLE #stack
SELECT * FROM #tmpCategories
DROP TABLE #tmpCategories
DROP TABLE Categories
there is a good explanation here link text
My answer to another question from a couple days ago applies here... recursion in SQL
There are some methods in the book which I've linked which should cover your situation nicely.