SQL Server Recursive CTE for finding all the dependencies - sql

I have a table with the following data.
Road City
R1 C1
R2 C2
R3 C1
R3 C3
R4 C3
R4 C5
R5 C5
If R1 is the input I need to get R1, R3, R4 and R5 as the output. This is because R1 belongs to C1 and C1 has R3 and R3 also belongs to C3 which has R4 and similarly R5.
I was trying to make use of CTE recursion but not able to get it to work. I tried stored procedure recursive call but it goes only 30 levels deep.
with tmp1 as (
select ROAD, CITY, 1 as Level from table R1 WHERE ROAD = 1712
UNION ALL
select R2.ROAD, R2.CITY,Level + 1 as Level
from tmp1 INNER JOIN table R2 ON tmp1.CITY = R2.CITY and tmp1.ROAD <> R2.ROAD
)
select * from tmp1
OPTION (maxrecursion 0)
Any thoughts greatly appreciated!

A recursive CTE will not work without some way of breaking cycles. Other database vendors have specific features for disallowing a row to be added twice. Unless something was added in the latest releases, Microsoft SQL Server does not.
The following does not work, because the recursive clause is referring to the CTE twice. (Or it contains a subquery)
WITH recur AS (SELECT Road, City
FROM #Map
WHERE Road = #StartingRoad
--
UNION ALL
--
SELECT next.Road, next.City
FROM #Map next
INNER JOIN recur
ON (recur.City = next.City AND recur.Road <> next.Road)
OR (recur.City <> next.City AND recur.Road = next.Road)
WHERE NOT EXISTS (SELECT NULL
FROM recur test
WHERE test.Road = next.Road AND test.City = next.City))
SELECT *
FROM recur;
Msg 253, Level 16, State 1, Line 36
Recursive member of a common table expression 'recur' has multiple recursive references.
It is possible with a straight forward loop, which you could stick in a stored procedure:
DECLARE #Map TABLE (Road VARCHAR(2), City VARCHAR(2));
INSERT INTO #Map (Road, City)
VALUES ('R1', 'C1')
, ('R2', 'C2')
, ('R3', 'C1')
, ('R3', 'C3')
, ('R4', 'C3')
, ('R4', 'C5')
, ('R5', 'C5');
DECLARE #StartingRoad VARCHAR(2) = 'R1';
DECLARE #Results TABLE (Road VARCHAR(2), City VARCHAR(2));
INSERT INTO #Results (Road, City)
SELECT Road, City
FROM #Map
WHERE Road = #StartingRoad
WHILE (1=1) BEGIN
INSERT INTO #Results (Road, City)
SELECT next.Road, next.City
FROM #Map next
INNER JOIN #Results r
ON (r.City = next.City AND r.Road <> next.Road)
OR (r.City <> next.City AND r.Road = next.Road)
WHERE NOT EXISTS (SELECT NULL
FROM #Results test
WHERE test.Road = next.Road AND test.City = next.City);
IF ##ROWCOUNT = 0
BREAK;
END;
SELECT DISTINCT Road
FROM #Results

This might get you partially there. I am doing a partial Cartesian join to generate the city/road combinations, and artificially restricting the recursion to 20 levels. I can't help but think there is a better way.
WITH cte(road,
city,
connected_city,
connected_road)
AS (
SELECT a.road,
a.city,
b.city AS connected_city,
b.Road connected_road
FROM deleteme a
INNER JOIN deleteme b ON a.City = b.City
WHERE a.road <> b.road)
SELECT DISTINCT
road
FROM cte;
road
R1
R3
R4
R5

Related

Filter a join based on multiple rows

I'm trying to write a query that filters a join based on several rows in another table. Hard to put it into words, so I'll provide a cut-back simple example.
Parent
Child
P1
C1
P1
C2
P1
C3
P2
C1
P2
C2
P2
C4
P3
C1
P3
C3
P3
C5
Essentially all rows are stored in the same table, however there is a ParentID allowing one item to link to another (parent) row.
The stored procedure is taking a comma delimited list of "child" codes, and based on whatever is in this list, I need to provide a list of potential siblings.
For example, if the comma delimited list was empty, the returned rows should be C1, C2, C3, C4, C5. If the list is "C2", the returned rows would be C1, C3, C4, and if the list is 'C1, C2', then the only returned row would be c3, c4.
Sample query:
SELECT [S].[ID]
FROM utItem [P]
INNER JOIN utItem [C]
ON [P].[ID] = [C].[ParentID]
INNER JOIN
(
-- Encapsulated to simplify sample.
SELECT [ID]
FROM udfListToRows( #ChildList )
GROUP BY
[ID]
) [DT]
ON [DT].[ID] = [C].[ID]
/*
In the event where I passed in "C2", this would work, it would return C1, C3, C4.
However this falls apart the moment there is more than 1 value in #ChildList. If I pass in "C2, C3", it would return siblings for either. But I only want siblings of BOTH.
**/
INNER JOIN [utItem] [S]
ON [C].[ParentID] = [S].[ParentID]
AND [C].[ID] <> [S].[ID]
WHERE
#ChildList IS NOT NULL
GROUP BY
[S].[ID]
UNION ALL
-- In the event that no #ChildList values are provided, return a full list of possible children (e.g. 1,2,3,4,5).
SELECT [C].[ID]
FROM [utItem] [P]
INNER JOIN [utItem] [C]
ON [P].[ID] = [C].[ParentID]
WHERE
#ChildList IS NULL
GROUP BY
[C].[ID]
Firstly, you can split your data into a table variable for ease of use
DECLARE #input TABLE (NodeId varchar(2));
INSERT #input (NodeId)
SELECT [ID]
FROM udfListToRows( #ChildList ); -- or STRING_SPLIT or whatever
Assuming you already had your data in a proper table variable (rather than a comma-separated list) you can do this
DECLARE #totalCount int = (SELECT COUNT(*) FROM #input);
SELECT DISTINCT
t.Child
FROM (
SELECT
t.Parent,
t.Child,
i.NodeId,
COUNT(i.NodeId) OVER (PARTITION BY t.Parent) matches
FROM YourTable t
LEFT JOIN #input i ON i.NodeId = t.Child
) t
WHERE t.matches = #totalCount
AND t.NodeId IS NULL;
db<>fiddle
This is a kind of relational division
Left-join the input to the main table
Using a window function, calculate how many matches you get per Parent
There must be at least as many matches as there are inputs
We take the distinct Child, excluding the original inputs

Use Data of 1 table into another one dynamically

I have one table category_code having data like
SELECT Item, Code, Prefix from category_codes
Item Code Prefix
Bangles BL BL
Chains CH CH
Ear rings ER ER
Sets Set ST
Rings RING RG
Yellow GOld YG YG........
I have another table item_categories having data like
select code,name from item_categories
code name
AQ.TM.PN AQ.TM.PN
BL.YG.CH.ME.PN BL.YG.CH.ME.PN
BS.CZ.ST.YG.PN BS.CZ.ST.YG.PN
CR.YG CR.YG.......
i want to update item_categories.name column corresponding to category_code.item column like
code name
BL.YG.CH.ME.PN Bangles.Yellow Gold.Chains.. . . .
Please suggest good solution for that. Thanks in advance.
First, split the code into several rows, join with the category code and then, concat the result to update the table.
Here an example, based on the data you gave
create table #category_code (item varchar(max), code varchar(max), prefix varchar(max));
create table #item_categories (code varchar(max), name varchar(max));
insert into #category_code (item, code, prefix) values ('Bangles','BL','BL'),('Chains','CH','CH'),('Ear rings','ER','ER'), ('Sets','Set','ST'),('Rings','RING','RG'), ('Yellow gold','YG','YG');
insert into #item_categories (code, name) values ('AQ.TM,PN','AQ.TM.PN'),('BL.YG.CH.ME.PN','BL.YG.CH.ME.PN'),('BS.CZ.ST.YG.PN','BS.CZ.ST.YG.PN')
;with splitted as ( -- split the codes into individual code
select row_number() over (partition by ic.code order by ic.code) as id, ic.code, x.value, cc.item
from #item_categories ic
outer apply string_split(ic.code, '.') x -- SQL Server 2016+, otherwise, use another method to split the data
left join #category_code cc on cc.code = x.value -- some values are missing in you example, but can use an inner join
)
, joined as ( -- then joined them to concat the name
select id, convert(varchar(max),code) as code, convert(varchar(max),coalesce(item + ',','')) as Item
from splitted
where id = 1
union all
select s.id, convert(varchar(max), s.code), convert(varchar(max), j.item + coalesce(s.item + ',',''))
from splitted s
inner join joined j on j.id = s.id - 1 and j.code = s.code
)
update #item_categories
set name = substring (j.item ,1,case when len(j.item) > 1 then len(j.item)-1 else 0 end)
output deleted.name, inserted.name
from #item_categories i
inner join joined j on j.code = i.code
inner join (select code, max(id)maxid from joined group by code) mj on mj.code = j.code and mj.maxid = j.id

How do I pull all results form a table based on another tables many to one relationship?

Setup
Sorry for the poorly phrased question. I'm not sure how to phrase it better. Feel free to try your hand at it if you are more versed in sql phrasing.
I have 3 related tables.
Person => person_id, name, etc
Cases => case_id, person_id, incedent_date, etc
Files => file_id, case_id, file_path, etc
Problem
For a given case_id I want to pull all file_id's for the same person.
Requirements:
1 query.
without duplicates.
without using UNIQUE/DISTINCT flag.
without changing table structure.
e.g. Bob has 2 cases, auto and house.
He has 10 files on each case.
I have the case_id for auto.
I want the files for both auto and house (20 files).
My attempt
This returns all files for all cases.
SELECT
f.file_id AS id
FROM files f
LEFT JOIN Cases c1 ON f.case_id = c1.case_id
LEFT JOIN Cases c2 ON f.case_id = c2.case_id
WHERE (f.case_id = 3566 OR c1.person_id = c2.person_id)
AND f.active = 1
ORDER BY f.upload_date ASC
This returns files for only given case:
SELECT
f.file_id AS id
FROM files f
LEFT JOIN Cases c1 ON f.case_id = c1.case_id
LEFT JOIN Cases c2 ON f.case_id = c2.case_id
WHERE (f.case_id = 3566 OR (c1.case_id = 3566 AND c1.person_id = c2.person_id)
AND f.active = 1
ORDER BY f.upload_date ASC
This returns duplicate values and seems to pull only the given case:
SELECT
f.file_id AS id
FROM files f
LEFT JOIN Cases c1 ON f.case_id = c1.case_id
LEFT JOIN Cases c2 ON c1.person_id = c2.person_id
WHERE f.case_id = 3566
AND f.active = 1
ORDER BY f.upload_date ASC
I hope this is what you want.
Create table #Person (person_id int, name varchar(10))
Insert into #Person values (1,'Ajay')
Insert into #Person values (2,'Vijay')
Create table #Cases (case_id int, person_id int)
Insert into #Cases values (1,1)
Insert into #Cases values (2,1)
Insert into #Cases values (3,1)
Insert into #Cases values (4,2)
Create table #Files (file_id int, case_id int)
Insert into #Files values (1,1)
Insert into #Files values (2,1)
Insert into #Files values (3,1)
Insert into #Files values (4,2)
Insert into #Files values (5,4)
SELECT
f.file_id AS id
FROM #files f
LEFT JOIN #Cases c1 ON f.case_id = c1.case_id
LEFT JOIN #Cases c2 ON c1.person_id = c2.person_id and c2.case_id = 2
where c2.case_id is not null
--OR
SELECT *,
f.file_id AS id
FROM #files f
LEFT JOIN #Cases c1 ON f.case_id = c1.case_id
INNER JOIN #Cases c2 ON c1.person_id = c2.person_id and c2.case_id = 2

Group All Related Records in Many to Many Relationship, SQL graph connected components

Hopefully I'm missing a simple solution to this.
I have two tables. One contains a list of companies. The second contains a list of publishers. The mapping between the two is many to many. What I would like to do is bundle or group all of the companies in table A which have any relationship to a publisher in table B and vise versa.
The final result would look something like this (GROUPID is the key field). Row 1 and 2 are in the same group because they share the same company. Row 3 is in the same group because the publisher Y was already mapped over to company A. Row 4 is in the group because Company B was already mapped to group 1 through Publisher Y.
Said simply, any time there is any kind of shared relationship across Company and Publisher, that pair should be assigned to the same group.
ROW GROUPID Company Publisher
1 1 A Y
2 1 A X
3 1 B Y
4 1 B Z
5 2 C W
6 2 C P
7 2 D W
Fiddle
Update:
My bounty version: Given the table in the fiddle above of simply Company and Publisher pairs, populate the GROUPID field above. Think of it as creating a Family ID that encompasses all related parents/children.
SQL Server 2012
I thought about using recursive CTE, but, as far as I know, it's not possible in SQL Server to use UNION to connect anchor member and a recursive member of recursive CTE (I think it's possible to do in PostgreSQL), so it's not possible to eliminate duplicates.
declare #i int
with cte as (
select
GroupID,
row_number() over(order by Company) as rn
from Table1
)
update cte set GroupID = rn
select #i = ##rowcount
-- while some rows updated
while #i > 0
begin
update T1 set
GroupID = T2.GroupID
from Table1 as T1
inner join (
select T2.Company, min(T2.GroupID) as GroupID
from Table1 as T2
group by T2.Company
) as T2 on T2.Company = T1.Company
where T1.GroupID > T2.GroupID
select #i = ##rowcount
update T1 set
GroupID = T2.GroupID
from Table1 as T1
inner join (
select T2.Publisher, min(T2.GroupID) as GroupID
from Table1 as T2
group by T2.Publisher
) as T2 on T2.Publisher = T1.Publisher
where T1.GroupID > T2.GroupID
-- will be > 0 if any rows updated
select #i = #i + ##rowcount
end
;with cte as (
select
GroupID,
dense_rank() over(order by GroupID) as rn
from Table1
)
update cte set GroupID = rn
sql fiddle demo
I've also tried a breadth first search algorithm. I thought it could be faster (it's better in terms of complexity), so I'll provide a solution here. I've found that it's not faster than SQL approach, though:
declare #Company nvarchar(2), #Publisher nvarchar(2), #GroupID int
declare #Queue table (
Company nvarchar(2), Publisher nvarchar(2), ID int identity(1, 1),
primary key(Company, Publisher)
)
select #GroupID = 0
while 1 = 1
begin
select top 1 #Company = Company, #Publisher = Publisher
from Table1
where GroupID is null
if ##rowcount = 0 break
select #GroupID = #GroupID + 1
insert into #Queue(Company, Publisher)
select #Company, #Publisher
while 1 = 1
begin
select top 1 #Company = Company, #Publisher = Publisher
from #Queue
order by ID asc
if ##rowcount = 0 break
update Table1 set
GroupID = #GroupID
where Company = #Company and Publisher = #Publisher
delete from #Queue where Company = #Company and Publisher = #Publisher
;with cte as (
select Company, Publisher from Table1 where Company = #Company and GroupID is null
union all
select Company, Publisher from Table1 where Publisher = #Publisher and GroupID is null
)
insert into #Queue(Company, Publisher)
select distinct c.Company, c.Publisher
from cte as c
where not exists (select * from #Queue as q where q.Company = c.Company and q.Publisher = c.Publisher)
end
end
sql fiddle demo
I've tested my version and Gordon Linoff's to check how it's perform. It looks like CTE is much worse, I couldn't wait while it's complete on more than 1000 rows.
Here's sql fiddle demo with random data. My results were:
128 rows:
my RBAR solution: 190ms
my SQL solution: 27ms
Gordon Linoff's solution: 958ms
256 rows:
my RBAR solution: 560ms
my SQL solution: 1226ms
Gordon Linoff's solution: 45371ms
It's random data, so results may be not very consistent. I think timing could be changed by indexes, but don't think it could change a whole picture.
old version - using temporary table, just calculating GroupID without touching initial table:
declare #i int
-- creating table to gather all possible GroupID for each row
create table #Temp
(
Company varchar(1), Publisher varchar(1), GroupID varchar(1),
primary key (Company, Publisher, GroupID)
)
-- initializing it with data
insert into #Temp (Company, Publisher, GroupID)
select Company, Publisher, Company
from Table1
select #i = ##rowcount
-- while some rows inserted into #Temp
while #i > 0
begin
-- expand #Temp in both directions
;with cte as (
select
T2.Company, T1.Publisher,
T1.GroupID as GroupID1, T2.GroupID as GroupID2
from #Temp as T1
inner join #Temp as T2 on T2.Company = T1.Company
union
select
T1.Company, T2.Publisher,
T1.GroupID as GroupID1, T2.GroupID as GroupID2
from #Temp as T1
inner join #Temp as T2 on T2.Publisher = T1.Publisher
), cte2 as (
select
Company, Publisher,
case when GroupID1 < GroupID2 then GroupID1 else GroupID2 end as GroupID
from cte
)
insert into #Temp
select Company, Publisher, GroupID
from cte2
-- don't insert duplicates
except
select Company, Publisher, GroupID
from #Temp
-- will be > 0 if any row inserted
select #i = ##rowcount
end
select
Company, Publisher,
dense_rank() over(order by min(GroupID)) as GroupID
from #Temp
group by Company, Publisher
=> sql fiddle example
Your problem is a graph-walking problem of finding connected subgraphs. It is a little more challenging because your data structure has two types of nodes ("companies" and "pubishers") rather than one type.
You can solve this with a single recursive CTE. The logic is as follows.
First, convert the problem into a graph with only one type of node. I do this by making the nodes companies and the edges linkes between companies, using the publisher information. This is just a join:
select t1.company as node1, t2.company as node2
from table1 t1 join
table1 t2
on t1.publisher = t2.publisher
)
(For efficiency sake, you could also add t1.company <> t2.company but that is not strictly necessary.)
Now, this is a "simple" graph walking problem, where a recursive CTE is used to create all connections between two nodes. The recursive CTE walks through the graph using join. Along the way, it keeps a list of all nodes visited. In SQL Server, this needs to be stored in a string.
The code needs to ensure that it doesn't visit a node twice for a given path, because this can result in infinite recursion (and an error). If the above is called edges, the CTE that generates all pairs of connected nodes looks like:
cte as (
select e.node1, e.node2, cast('|'+e.node1+'|'+e.node2+'|' as varchar(max)) as nodes,
1 as level
from edges e
union all
select c.node1, e.node2, c.nodes+e.node2+'|', 1+c.level
from cte c join
edges e
on c.node2 = e.node1 and
c.nodes not like '|%'+e.node2+'%|'
)
Now, with this list of connected nodes, assign each node the minimum of all the nodes it is connected to, including itself. This serves as an identifier of connected subgraphs. That is, all companies connected to each other via the publishers will have the same minimum.
The final two steps are to enumerate this minimum (as the GroupId) and to join the GroupId back to the original data.
The full (and I might add tested) query looks like:
with edges as (
select t1.company as node1, t2.company as node2
from table1 t1 join
table1 t2
on t1.publisher = t2.publisher
),
cte as (
select e.node1, e.node2,
cast('|'+e.node1+'|'+e.node2+'|' as varchar(max)) as nodes,
1 as level
from edges e
union all
select c.node1, e.node2,
c.nodes+e.node2+'|',
1+c.level
from cte c join
edges e
on c.node2 = e.node1 and
c.nodes not like '|%'+e.node2+'%|'
),
nodes as (
select node1,
(case when min(node2) < node1 then min(node2) else node1 end
) as grp
from cte
group by node1
)
select t.company, t.publisher, grp.GroupId
from table1 t join
(select n.node1, dense_rank() over (order by grp) as GroupId
from nodes n
) grp
on t.company = grp.node1;
Note that this works on finding any connected subgraphs. It does not assume that any particular number of levels.
EDIT:
The question of performance for this is vexing. At a minimum, the above query will run better with an index on Publisher. Better yet is to take #MikaelEriksson's suggestion, and put the edges in a separate table.
Another question is whether you look for equivalency classes among the Companies or the Publishers. I took the approach of using Companies, because I think that has better "explanability" (my inclination to respond was based on numerous comments that this could not be done with CTEs).
I am guessing that you could get reasonable performance from this, although that requires more knowledge of your data and system than provided in the OP. It is quite likely, though, that the best performance will come from a multiple query approach.
Here is my solution SQL Fiddle
The nature of the relationships require looping as I figure.
Here is the SQL:
--drop TABLE Table1
CREATE TABLE Table1
([row] int identity (1,1),GroupID INT NULL,[Company] varchar(2), [Publisher] varchar(2))
;
INSERT INTO Table1
(Company, Publisher)
select
left(newid(), 2), left(newid(), 2)
declare #i int = 1
while #i < 8
begin
;with cte(Company, Publisher) as (
select
left(newid(), 2), left(newid(), 2)
from Table1
)
insert into Table1(Company, Publisher)
select distinct c.Company, c.Publisher
from cte as c
where not exists (select * from Table1 as t where t.Company = c.Company and t.Publisher = c.Publisher)
set #i = #i + 1
end;
CREATE NONCLUSTERED INDEX IX_Temp1 on Table1 (Company)
CREATE NONCLUSTERED INDEX IX_Temp2 on Table1 (Publisher)
declare #counter int=0
declare #row int=0
declare #lastnullcount int=0
declare #currentnullcount int=0
WHILE EXISTS (
SELECT *
FROM Table1
where GroupID is null
)
BEGIN
SET #counter=#counter+1
SET #lastnullcount =0
SELECT TOP 1
#row=[row]
FROM Table1
where GroupID is null
order by [row] asc
SELECT #currentnullcount=count(*) from table1 where groupid is null
WHILE #lastnullcount <> #currentnullcount
BEGIN
SELECT #lastnullcount=count(*)
from table1
where groupid is null
UPDATE Table1
SET GroupID=#counter
WHERE [row]=#row
UPDATE t2
SET t2.GroupID=#counter
FROM Table1 t1
INNER JOIN Table1 t2 on t1.Company=t2.Company
WHERE t1.GroupID=#counter
AND t2.GroupID IS NULL
UPDATE t2
SET t2.GroupID=#counter
FROM Table1 t1
INNER JOIN Table1 t2 on t1.publisher=t2.publisher
WHERE t1.GroupID=#counter
AND t2.GroupID IS NULL
SELECT #currentnullcount=count(*)
from table1
where groupid is null
END
END
SELECT * FROM Table1
Edit:
Added indexes as I would expect on the real table and be more in line with the other data sets Roman is using.
You are trying to find all of the connected components of your graph, which can only be done iteratively. If you know the maximum width of any connected component (i.e. the maximum number of links you will have to take from one company/publisher to another), you could in principle do it something like this:
SELECT
MIN(x2.groupID) AS groupID,
x1.Company,
x1.Publisher
FROM Table1 AS x1
INNER JOIN (
SELECT
MIN(x2.Company) AS groupID,
x1.Company,
x1.Publisher
FROM Table1 AS x1
INNER JOIN Table1 AS x2
ON x1.Publisher = x2.Publisher
GROUP BY
x1.Publisher,
x1.Company
) AS x2
ON x1.Company = x2.Company
GROUP BY
x1.Publisher,
x1.Company;
You have to keep nesting the subquery (alternating joins on Company and Publisher, and with the deepest subquery saying MIN(Company) rather than MIN(groupID)) to the maximum iteration depth.
I don't really recommend this, though; it would be cleaner to do this outside of SQL.
Disclaimer: I don't know anything about SQL Server 2012 (or any other version); it may have some kind of additional scripting ability to let you do this iteration dynamically.
This is a recursive solution, using XML:
with a as ( -- recursive result, containing shorter subsets and duplicates
select cast('<c>' + company + '</c>' as xml) as companies
,cast('<p>' + publisher + '</p>' as xml) as publishers
from Table1
union all
select a.companies.query('for $c in distinct-values((for $i in /c return string($i),
sql:column("t.company")))
order by $c
return <c>{$c}</c>')
,a.publishers.query('for $p in distinct-values((for $i in /p return string($i),
sql:column("t.publisher")))
order by $p
return <p>{$p}</p>')
from a join Table1 t
on ( a.companies.exist('/c[text() = sql:column("t.company")]') = 0
or a.publishers.exist('/p[text() = sql:column("t.publisher")]') = 0)
and ( a.companies.exist('/c[text() = sql:column("t.company")]') = 1
or a.publishers.exist('/p[text() = sql:column("t.publisher")]') = 1)
), b as ( -- remove the shorter versions from earlier steps of the recursion and the duplicates
select distinct -- distinct cannot work on xml types, hence cast to nvarchar
cast(companies as nvarchar) as companies
,cast(publishers as nvarchar) as publishers
,DENSE_RANK() over(order by cast(companies as nvarchar), cast(publishers as nvarchar)) as groupid
from a
where not exists (select 1 from a as s -- s is a proper subset of a
where (cast('<s>' + cast(s.companies as varchar)
+ '</s><a>' + cast(a.companies as varchar) + '</a>' as xml)
).value('if((count(/s/c) > count(/a/c))
and (some $s in /s/c/text() satisfies
(some $a in /a/c/text() satisfies $s = $a))
) then 1 else 0', 'int') = 1
)
and not exists (select 1 from a as s -- s is a proper subset of a
where (cast('<s>' + cast(s.publishers as nvarchar)
+ '</s><a>' + cast(a.publishers as nvarchar) + '</a>' as xml)
).value('if((count(/s/p) > count(/a/p))
and (some $s in /s/p/text() satisfies
(some $a in /a/p/text() satisfies $s = $a))
) then 1 else 0', 'int') = 1
)
), c as ( -- cast back to xml
select cast(companies as xml) as companies
,cast(publishers as xml) as publishers
,groupid
from b
)
select Co.company.value('(./text())[1]', 'varchar') as company
,Pu.publisher.value('(./text())[1]', 'varchar') as publisher
,c.groupid
from c
cross apply companies.nodes('/c') as Co(company)
cross apply publishers.nodes('/p') as Pu(publisher)
where exists(select 1 from Table1 t -- restrict to only the combinations that exist in the source
where t.company = Co.company.value('(./text())[1]', 'varchar')
and t.publisher = Pu.publisher.value('(./text())[1]', 'varchar')
)
The set of companies and the set of publishers are kept in XML fields in the intermediate steps, and there is some casting between xml and nvarchar necessary due to some limitations of SQL Server (like not being able to group or use distinct on XML columns.
Bit late to the challenge, and since SQLFiddle seems to be down ATM I'll have to guess your data-structures. Nevertheless, it seemed like a fun challenge (and it was =) so here's what I made from it :
Setup:
IF OBJECT_ID('t_link') IS NOT NULL DROP TABLE t_link
IF OBJECT_ID('t_company') IS NOT NULL DROP TABLE t_company
IF OBJECT_ID('t_publisher') IS NOT NULL DROP TABLE t_publisher
IF OBJECT_ID('tempdb..#link_A') IS NOT NULL DROP TABLE #link_A
IF OBJECT_ID('tempdb..#link_B') IS NOT NULL DROP TABLE #link_B
GO
CREATE TABLE t_company ( company_id int IDENTITY(1, 1) NOT NULL PRIMARY KEY,
company_name varchar(100) NOT NULL)
GO
CREATE TABLE t_publisher (publisher_id int IDENTITY(1, 1) NOT NULL PRIMARY KEY,
publisher_name varchar(100) NOT NULL)
CREATE TABLE t_link (company_id int NOT NULL FOREIGN KEY (company_id) REFERENCES t_company (company_id),
publisher_id int NOT NULL FOREIGN KEY (publisher_id) REFERENCES t_publisher (publisher_id),
PRIMARY KEY (company_id, publisher_id),
group_id int NULL
)
GO
-- example content
-- ROW GROUPID Company Publisher
--1 1 A Y
--2 1 A X
--3 1 B Y
--4 1 B Z
--5 2 C W
--6 2 C P
--7 2 D W
INSERT t_company (company_name) VALUES ('A'), ('B'), ('C'), ('D')
INSERT t_publisher (publisher_name) VALUES ('X'), ('Y'), ('Z'), ('W'), ('P')
INSERT t_link (company_id, publisher_id)
SELECT company_id, publisher_id
FROM t_company, t_publisher
WHERE (company_name = 'A' AND publisher_name = 'Y')
OR (company_name = 'A' AND publisher_name = 'X')
OR (company_name = 'B' AND publisher_name = 'Y')
OR (company_name = 'B' AND publisher_name = 'Z')
OR (company_name = 'C' AND publisher_name = 'W')
OR (company_name = 'C' AND publisher_name = 'P')
OR (company_name = 'D' AND publisher_name = 'W')
GO
/*
-- volume testing
TRUNCATE TABLE t_link
DELETE t_company
DELETE t_publisher
DECLARE #company_count int = 1000,
#publisher_count int = 450,
#links_count int = 800
INSERT t_company (company_name)
SELECT company_name = Convert(varchar(100), NewID())
FROM master.dbo.fn_int_list(1, #company_count)
UPDATE STATISTICS t_company
INSERT t_publisher (publisher_name)
SELECT publisher_name = Convert(varchar(100), NewID())
FROM master.dbo.fn_int_list(1, #publisher_count)
UPDATE STATISTICS t_publisher
-- Random links between the companies & publishers
DECLARE #count int
SELECT #count = 0
WHILE #count < #links_count
BEGIN
SELECT TOP 30 PERCENT row_id = IDENTITY(int, 1, 1), company_id = company_id + 0
INTO #link_A
FROM t_company
ORDER BY NewID()
SELECT TOP 30 PERCENT row_id = IDENTITY(int, 1, 1), publisher_id = publisher_id + 0
INTO #link_B
FROM t_publisher
ORDER BY NewID()
INSERT TOP (#links_count - #count) t_link (company_id, publisher_id)
SELECT A.company_id,
B.publisher_id
FROM #link_A A
JOIN #link_B B
ON A.row_id = B.row_id
WHERE NOT EXISTS ( SELECT *
FROM t_link old
WHERE old.company_id = A.company_id
AND old.publisher_id = B.publisher_id)
SELECT #count = #count + ##ROWCOUNT
DROP TABLE #link_A
DROP TABLE #link_B
END
*/
Actual grouping:
IF OBJECT_ID('tempdb..#links') IS NOT NULL DROP TABLE #links
GO
-- apply grouping
-- init
SELECT row_id = IDENTITY(int, 1, 1),
company_id,
publisher_id,
group_id = 0
INTO #links
FROM t_link
-- don't see an index that would be actually helpful here right-away, using row_id to avoid HEAP
CREATE CLUSTERED INDEX idx0 ON #links (row_id)
--CREATE INDEX idx1 ON #links (company_id)
--CREATE INDEX idx2 ON #links (publisher_id)
UPDATE #links
SET group_id = row_id
-- start grouping
WHILE ##ROWCOUNT > 0
BEGIN
UPDATE #links
SET group_id = new_group_id
FROM #links upd
CROSS APPLY (SELECT new_group_id = Min(group_id)
FROM #links new
WHERE new.company_id = upd.company_id
OR new.publisher_id = upd.publisher_id
) x
WHERE upd.group_id > new_group_id
-- select * from #links
END
-- remove 'holes'
UPDATE #links
SET group_id = (SELECT COUNT(DISTINCT o.group_id)
FROM #links o
WHERE o.group_id <= upd.group_id)
FROM #links upd
GO
UPDATE t_link
SET group_id = new.group_id
FROM t_link upd
LEFT OUTER JOIN #links new
ON new.company_id = upd.company_id
AND new.publisher_id = upd.publisher_id
GO
SELECT row = ROW_NUMBER() OVER (ORDER BY group_id, company_name, publisher_name),
l.group_id,
c.company_name, -- c.company_id,
p.publisher_name -- , p.publisher_id
from t_link l
JOIN t_company c
ON l.company_id = c.company_id
JOIN t_publisher p
ON p.publisher_id = l.publisher_id
ORDER BY 1
At first sight this approach hasn't been tried yet by anyone else, interesting to see how this can be done in a variety of ways... (preferred not to read them upfront as it would spoil the puzzle =)
Results look as expected (as far as I understand the requirements and the example) and performance isn't too shabby either although there is no real indication on the amount of records this should work on; not sure how it would scale but don't expect too many problems either...

How to optimize a Database Model for a M:N relationship

Edit 10-Apr-2013
In order to make myself clear I am adding another (simplified) example showing the principle of what I am trying to achieve:
T1 - PERSONHAS T2 - PRODUCTNEED
ANTON has WHEEL CAR need ENGINE
ANTON has ENGINE CAR need WHEEL
ANTON has NEEDLE SHIRT need NEEDLE
BERTA has NEEDLE SHIRT need THREAD
BERTA has THREAD JAM need FRUIT
BERTA has ENGINE JAM need SUGAR
Q3 - PERSONCANMAKE
ANTON canmake CAR
BERTA canmake SHIRT
Q4 - PERSONCANNOTMAKE
ANTON cannotmake SHIRT
ANTON cannotmake FRUIT
BERTA cannotmake CAR
BERTA cannotmake FRUIT
I have T1 and T2 and want to create queries for Q3 and Q4
End Edit 10-Apr-2013
Preface:
In order to create a product (P) I need to have certain generic capabilities (C - like a factory, supply, electricity, water, etc.)
A product manager defines all generic capabilities needed to create his/her product.
In a location (L) I have certain generic capabilities (C)
A location manager defines the capabilities his/her location is able to provide. This could be a clear YES, a clear NO, or the location manager does not list a certain capability at all.
DB Model:
I have created the following root entities
Location (PK: L) - values L1, L2, L3 // in real ca. 250 rows of L
Product (PK: P) - values P1, P2 // in real ca. 150 rows of P
Capability (PK: C) - values C1, C2, C3 // in real ca. 80 rows of C
and the following child (dependent) entities
ProductCapabilityAssignment:P, C (PK: P, C, FK: P, C)
P1 C1
P1 C2
P2 C1
P2 C3
LocationCapabilityAssignment: L, C, Status (Y/N) (PK: L, C, FK: L, C)
L1 C1 Y
L2 C1 Y
L2 C2 Y
L2 C3 N
L3 C1 Y
L3 C2 Y
L3 C3 Y
Task:
The task is to find out whether a certain product can be produced at a certain location, whereby all capabilities defined for the product must be present at that location. In order to answer this I couldn't help myself but to
create a Cartesian Product of Location and ProductCapabilityAssignment (CL_Cart) to ensure that for each location I am listing all possible products with their cpability needs
CREATE VIEW CL_Cart AS
SELECT L.L, PCA.P, PCA.C
FROM Location AS L, ProductCapabilityAssignment AS PCA;
create an outer join between CL_Cart and LocationCapabilityAssignment to match in all capabilities a location can provide
CREATE VIEW Can_Produce AS
SELECT X.L, X.P, X.C, LCA.Status
FROM CL_CArt AS X LEFT JOIN LocationCapabilityAssignment AS LCA ON (X.C = LCA.C) AND (X.L = LCA.L);
so that finaly I get
SELECT L, P, C, Status
FROM Can_Produce;
L1 P1 C1 Y
L1 P1 C2 NULL // C2 not listed for L1
L1 P2 C1 Y
L1 P2 C3 NULL // C3 not listed for L1
L2 P1 C1 Y
L2 P1 C2 Y
L2 P2 C1 Y
L2 P2 C3 N // C3 listed as "No" for L2
L3 P1 C1 Y
L3 P1 C2 Y
L3 P2 C1 Y
L3 P2 C3 Y
meaning that L1 cannot produce neither P1 nor P2, L2 can produce P1, and L3 can produce both P1, P2.
So I can query Can_Produce for a specific product/location and see what I have and what I don't have in terms of capabilities. I also can provide a shortcut overall YES/NO answer by examining Status="N" OR Status is NULL - if so the product cannot be produced.
Question:
For a relational database like MSSQL, MySQL, Oracle (not yet decided and beyond my influence) I am wondering if I have chosen the correct data model for this M:N relationship or if I could do any better. In particular I fear that with ca. 250 locations, 150 products and one product in average being defined by +/- 10 capabilities, so to say a Cartesian product of 375.000 rows, that performance will collapse due to huge memory consumption.
I would also really like to avoid stored procedures.
Any thoughts would be welcome.
--Environment Variables
Declare #Parent table (id int identity(1,1) primary key, Name varchar(20))
Declare #Components table (id int identity(1,1) primary key, Name varchar(20)) Insert into #Components (Name) values ('Engine'),('Wheel'),('Chassis'),('NEEDLE'),('THREAD'),('FRUIT'),('SUGAR')
Declare #Objects table (id int identity(1,1) primary key, Name varchar(20))
Declare #Template table (id int identity(1,1) primary key, Name varchar(20), ObjectID int, ComponentID int)
Insert into #Template (Name, ObjectID, ComponentID)
Select 'Vehicle', O.ID, C.ID from #Objects O, #Components C where O.Name = 'Car' and C.Name in ('Engine','Wheel','Chassis')union
Select 'Clothing', O.ID, C.ID from #Objects O, #Components C where O.Name = 'Shirt' and C.Name in ('Needle','Thread') union
Select 'Confectionary', O.ID, C.ID from #Objects O, #Components C where O.Name = 'JAM' and C.Name in ('FRUIT','SUGAR')
Declare #AvailableMaterials table (id int identity(1,1) primary key, TestType varchar(20), ParentID int, ComponentID int)
--Test Data
Insert into #AvailableMaterials (TestType,ParentID,ComponentID)
Select 'CompleteSet', P.ID, T.ComponentID from #Parent P, #Template T where P.Name = 'Driver' and T.Objectid = (Select ID from #Objects where Name = 'Car') union
Select 'CompleteSet', P.ID, T.ComponentID from #Parent P, #Template T where P.Name = 'Seamstress' and T.Objectid = (Select ID from #Objects where Name = 'Shirt') union
Select 'IncompleteSet', P.ID, T.ComponentID from #Parent P, #Template T where P.Name = 'Confectionarist' and T.Objectid = (Select ID from #Objects where Name = 'Jam')
and T.ComponentID not in (Select ID from #Components where Name = 'FRUIT')
--/*What sets are incomplete?*/
Select *
from #AvailableMaterials
where ID in (
Select SRCData.ID
from #AvailableMaterials SRCData cross apply (Select ObjectID from #Template T where ComponentID = SRCData.ComponentID) ObjectsMatchingComponents
inner join #Template T
on SRCData.ComponentID = T.ComponentID
and T.ObjectID = ObjectsMatchingComponents.ObjectID
cross apply (Select ObjectID, ComponentID from #Template FullTemplate where FullTemplate.ObjectID = T.ObjectID and FullTemplate.ComponentID not in (Select ComponentID from #AvailableMaterials SRC where SRC.ComponentID = FullTemplate.ComponentID)) FullTemplate
)
/*What sets are complete?*/
Select *
from #AvailableMaterials
where ID not in (
Select SRCData.ID
from #AvailableMaterials SRCData cross apply (Select ObjectID from #Template T where ComponentID = SRCData.ComponentID) ObjectsMatchingComponents
inner join #Template T
on SRCData.ComponentID = T.ComponentID
and T.ObjectID = ObjectsMatchingComponents.ObjectID
cross apply (Select ObjectID, ComponentID from #Template FullTemplate where FullTemplate.ObjectID = T.ObjectID and FullTemplate.ComponentID not in (Select ComponentID from #AvailableMaterials SRC where SRC.ComponentID = FullTemplate.ComponentID)) FullTemplate
)
Hi
This is the best I can come up with... It works on the premise that you have to know what the complete set is, to know what's missing. Once you have what's missing, you can tell the complete sets from the incomplete sets.
I doubt this solution will scale well, even if moved to #tables with indexing. Possibly though...
I too would be interested in seeing a cleaner approach. The above solution was developed in a SQL 2012 version. Note cross apply which limits the Cartesian effect somewhat.
Hope this helps.
I'm not sure what database you are using, but here is an example that would work in sql server - shouldn't require many changes to work in other databases...
WITH ProductLocation
AS
(
SELECT P.P,
P.Name as ProductName,
L.L,
L.Name as LocationName
FROM Product P
CROSS
JOIN Location L
),
ProductLocationCapability
AS
(
SELECT PL.P,
PL.ProductName,
PL.L,
PL.LocationName,
SUM(PC.C) AS RequiredCapabilities,
SUM(CASE WHEN LC.L IS NULL THEN 0 ELSE 1 END) AS FoundCapabilities
FROM ProductLocation PL
JOIN ProductCapabilityAssignment PC
ON PC.P = PL.P
LEFT
JOIN LocationCapabilityAssignment LC
ON LC.L = PL.L
AND LC.C = PC.C
GROUP BY PL.P, PL.ProductName, PL.L, PL.LocationName
)
SELECT PLC.P,
PLC.ProductName,
PLC.L,
PLC.LocationName,
CASE WHEN PLC.RequiredCapabilities = PLC.FoundCapabilities THEN 'Y' ELSE 'N' END AS CanMake
FROM ProductLocationCapability PLC
(Not sure if the field names are correct, I couldn't quite make sense of the schema description!)