Is there a good way to do this in SQL? - sql

I am trying to solve the following problem entirely in SQL (ANSI or TSQL, in Sybase ASE 12), without relying on cursors or loop-based row-by-row processing.
NOTE: I already created a solution that accomplishes the same goal in application layer (therefore please refrain from "answering" with "don't do this in SQL"), but as a matter of principle (and hopefully improved performance) I would like to know if there is an efficient (e.g. no cursors) pure SQL solution.
Setup:
I have a table T with the following 3 columns (all NOT NULL):
---- Table T -----------------------------
| item | tag | value |
| [int] | [varchar(10)] | [varchar(255)] |
The table has unique index on item, tag
Every tag has a form of a string "TAG##" where "##" is a number 1-99
Existing tags are not guaranteed to be contiguous, e.g. item 13 may have tags "TAG1", "TAG3", "TAG10".
TASK: I need to insert a bunch of new rows into the table from another table T_NEW, which only have items and values, and assign new tag to them so they don't violate unique index on item, tag.
Uniqueness of values is irrelevant (assume that item+value is always unique already).
---- Table T_NEW --------------------------
| item | tag | value |
| [int] | STARTS AS NULL | [varchar(255)] |
QUESTION: How can I assign new tags to all rows in table T_NEW, such that:
All item+tag combinations in a union of T and T_NEW are unique
Newly assigned tags should all be in the form "TAG##"
Newly assigned tags should ideally be the smallest available for a given item.
If it helps, you can assume that I already have a temp table #tags, with a "tag" column that contains 99 rows containing all the valid tags (TAG1..TAG99, one per row)

I started a fiddle that will get you the list of available "open" tags by item. It does this using the #tags (AllTags) and doing an outer-join-where-null. You could use that to insert new tags from T_New...
with T_openTags as (
select
items.item,
openTagName = a.tag
from
(select distinct item from T) items
cross join AllTags a
left outer join T on
items.item = T.item
and T.tag = a.tag
where
T.item is null
)
select * from T_openTags
or see this updated fiddle to do an update on T_New table. Essentially adds a row_number so we can pick the correct open tag to use in a single update statement. I padded the Tag names with a leading zero to simplify the sorting.
with T_openTags as (
select
items.item,
openTagName = a.tag,
rn = row_number() over(partition by items.item order by a.tag)
from
(select distinct item from T) items
cross join AllTags a
left outer join T on
items.item = T.item
and T.tag = a.tag
where
T.item is null
), T_New_numbered as (
select *,
rn = row_number() over(partition by item order by value)
from T_New
)
update tnn set tag = openTagName
from T_New_numbered tnn
inner join T_openTags tot on
tot.item = tnn.item
and tot.rn = tnn.rn
select * from T_New
updated fiddle with poor mans row_number replacement that only works with distinct T_New values

Try this:
DECLARE #T TABLE (ITEM INT, TAG VARCHAR(10), VALUE VARCHAR(255))
INSERT INTO #T VALUES
(1,'TAG1', '100'),
(2,'TAG2', '200')
DECLARE #T_NEW TABLE (ITEM INT, TAG VARCHAR(10), VALUE VARCHAR(255))
INSERT INTO #T_NEW VALUES
(3,NULL, '500'),
(4,NULL, '600')
INSERT INTO #T
SELECT
ITEM,
('TAG' + CONVERT(VARCHAR(20),ITEM)) AS TAG,
VALUE
FROM
#T_NEW
SELECT * FROM #T

OK, here's a correct solution, tested to work on Sybase (H/T: big thanks to #ypercube for providing a solid basis for it)
declare #c int
select #c = 1
WHILE (#c > 0)
BEGIN
UPDATE
t_new
SET
tag =
( SELECT min(tags.tag)
FROM #tags tags
LEFT JOIN t o
ON tags.tag = o.tag
AND o.item = t_new.item
LEFT JOIN t_new n3
ON tags.tag = n3.tag
AND n3.item = t_new.item
WHERE o.tag IS NULL
AND n3.tag IS NULL
)
WHERE tag IS NULL
-- and here's the main magic for only updating one item at a time
AND NOT EXISTS (SELECT 1 FROM t_new n2 WHERE t_new.value > n2.value
and n2.tag IS NULL and n2.item=t_new.item)
SELECT #c = ##rowcount
END

Inserting directly to t:
INSERT INTO t
(item, tag, value)
SELECT
item,
( SELECT MIN(tags.tag)
FROM #tags AS tags
LEFT JOIN t AS o
ON tags.tag = o.tag
AND o.item_id = n.item_id
WHERE o.tag IS NULL
) AS tag,
value
FROM
t_new AS n ;
Updating t_new:
UPDATE
t_new AS n
SET
tag =
( SELECT MIN(tags.tag)
FROM #tags AS tags
LEFT JOIN t AS o
ON tags.tag = o.tag
AND o.item_id = n.item_id
WHERE o.tag IS NULL
) ;
Correction
UPDATE
n
SET
n.tag = w.tag
FROM
( SELECT item_id,
tag,
ROW_NUMBER() OVER (PARTITION BY item_id ORDER BY value) AS rn
FROM t_new
) AS n
JOIN
( SELECT di.item_id,
tags.tag,
ROW_NUMBER() OVER (PARTITION BY di.item_id ORDER BY tags.tag) AS rn
FROM
( SELECT DISTINCT item_id
FROM t_new
) AS di
CROSS JOIN
#tags AS tags
LEFT JOIN
t AS o
ON tags.tag = o.tag
AND o.item_id = di.item_id
WHERE o.tag IS NULL
) AS w
ON w.item_id = n.item_id
AND w.rn = n.rn ;

Related

Change SQL where clause based on a parameter?

I need to alter a query to do something like this (following is generic pseudo-code):
if (tag list contains all tags in the database) {
select every product regardless of tag, even products with null tag
}
else { //tag list is only a few tags long
select only the products that have a tag in the tag list
}
I have tried doing stuff like this, but it doesn't work:
SELECT p.Id
FROM Tags t
JOIN Products p ON p.TagId = t.Id
WHERE ((EXISTS(select Id from Tags EXCEPT select item from dbo.SplitString(#tagList,',')) AND p.TagId in (select item from dbo.SplitString(#tagList,',')))
OR (p.TagId in (select item from dbo.SplitString(#tagList,',')) or p.TagId is null))
This will take place inside of a large query with a large WHERE clause, so putting two slightly different queries in an IF ELSE statement is not ideal.
What should I do to get this working?
First things first: you should use properly normalized input parameters. Ideally this would be a Table-Valued parameter, however if you cannot do that then you could insert the split values into a table variable
DECLARE #tags TABLE (TagId int PRIMARY KEY);
INSERT #tags (TagId)
SELECT item
FROM dbo.SplitString(#tagList, ',');
Next, the easiest way is probably to just find out first whether all tags match, and store that in a variable.
DECLARE #isAllTags bit = CASE WHEN EXISTS(
SELECT t.Id
FROM Tags t
EXCEPT
SELECT tList.Id
FROM #tags tList
) THEN 0 ELSE 1 END;
SELECT p.Id
FROM Products p
WHERE #isAllTags = 1
OR EXISTS (SELECT 1
FROM #tags tList
WHERE tList.TagId = p.TagId);
You could merge these queries, but it's unlikely to be more performant.
You could even do it in a very set-based fashion, but it's probably going to be really slow
SELECT p.Id
FROM Products p
WHERE EXISTS (SELECT 1
FROM Tags t
LEFT JOIN #tags tList ON tList.TagId = t.Id
CROSS APPLY (VALUES (CASE WHEN p.TagId = tList.TagId THEN 1 END )) v(ProductMatch)
HAVING COUNT(t.Id) = COUNT(tList.TagId) -- all exist
OR COUNT(v.ProductMatch) > 0 -- at least one match
);
Try this, this might work.
SELECT p.Id
FROM
Products p LEFT JOIN
Tags t ON p.TagId = t.Id
WHERE
t.Id is null
OR
(t.id is not null and
t.Id in (SELECT value FROM STRING_SPLIT(#tagList, ',')))
I just tested - works

remove Union from query

I have query to get category and subcategories of same category. I used below query:
Select Id from category where name like '%events%' and deleted = 0 and published = 1
UNION
SELECT Id from category where parentcategoryid = (Select id from category where name like '%events%' and deleted = 0 and published = 1)
I do not want to use UNION, want to use Join only. But not getting how i can achieve. Below is the table structure. Please help me. thanks in Advance
The following might solve your problem:
DECLARE #t TABLE (id int, name varchar(20), parentcatid varchar(200))
insert into #t values(23,'Christmas', 34)
,(29,'Birthday', 34)
,(31,'New year', 34)
,(34,'Events', 0)
,(35,'gfrhrt', 0)
;WITH cte AS(
Select Id from #t
where name like '%events%' --and deleted = 0 and published = 1
),
cte2 AS (
SELECT a.id AS tId, cpa.id, cId2.id AS idPa
from #t a
LEFT JOIN cte AS cId ON cId.Id = a.Id
FULL OUTER JOIN #t AS cPa ON cPa.parentcatid = cId.Id
LEFT JOIN cte cId2 ON cId2.id = cPa.id
WHERE cId.Id IS NOT NULL OR cPa.Id IS NOT NULL
)
SELECT id
FROM cte2
WHERE tId IS NOT NULL
OR id = idPa
The idea is to get all required IDs within the cte and then get all categories, where either ID oder ParentID match the IDs from the cte. However, depending on the size of your table, you might want to add further filters to the cte.

Use Data of 1 table into another one dynamically

I have one table category_code having data like
SELECT Item, Code, Prefix from category_codes
Item Code Prefix
Bangles BL BL
Chains CH CH
Ear rings ER ER
Sets Set ST
Rings RING RG
Yellow GOld YG YG........
I have another table item_categories having data like
select code,name from item_categories
code name
AQ.TM.PN AQ.TM.PN
BL.YG.CH.ME.PN BL.YG.CH.ME.PN
BS.CZ.ST.YG.PN BS.CZ.ST.YG.PN
CR.YG CR.YG.......
i want to update item_categories.name column corresponding to category_code.item column like
code name
BL.YG.CH.ME.PN Bangles.Yellow Gold.Chains.. . . .
Please suggest good solution for that. Thanks in advance.
First, split the code into several rows, join with the category code and then, concat the result to update the table.
Here an example, based on the data you gave
create table #category_code (item varchar(max), code varchar(max), prefix varchar(max));
create table #item_categories (code varchar(max), name varchar(max));
insert into #category_code (item, code, prefix) values ('Bangles','BL','BL'),('Chains','CH','CH'),('Ear rings','ER','ER'), ('Sets','Set','ST'),('Rings','RING','RG'), ('Yellow gold','YG','YG');
insert into #item_categories (code, name) values ('AQ.TM,PN','AQ.TM.PN'),('BL.YG.CH.ME.PN','BL.YG.CH.ME.PN'),('BS.CZ.ST.YG.PN','BS.CZ.ST.YG.PN')
;with splitted as ( -- split the codes into individual code
select row_number() over (partition by ic.code order by ic.code) as id, ic.code, x.value, cc.item
from #item_categories ic
outer apply string_split(ic.code, '.') x -- SQL Server 2016+, otherwise, use another method to split the data
left join #category_code cc on cc.code = x.value -- some values are missing in you example, but can use an inner join
)
, joined as ( -- then joined them to concat the name
select id, convert(varchar(max),code) as code, convert(varchar(max),coalesce(item + ',','')) as Item
from splitted
where id = 1
union all
select s.id, convert(varchar(max), s.code), convert(varchar(max), j.item + coalesce(s.item + ',',''))
from splitted s
inner join joined j on j.id = s.id - 1 and j.code = s.code
)
update #item_categories
set name = substring (j.item ,1,case when len(j.item) > 1 then len(j.item)-1 else 0 end)
output deleted.name, inserted.name
from #item_categories i
inner join joined j on j.code = i.code
inner join (select code, max(id)maxid from joined group by code) mj on mj.code = j.code and mj.maxid = j.id

How to UPDATE pivoted table in SQL SERVER

I have flat table which I have to join using EAN attribute with my main table and update gid (id of my main table).
id attrib value gid
1 weight 10 NULL
1 ean 123123123112 NULL
1 color blue NULL
2 weight 5 NULL
2 ean 331231313123 NULL
I was trying to pivot ean rows into column, next join on ean both tables, and for this moment everything works great.
--update SideTable
--set gid = ab_id
select gid, ab_id
from SideTable
pivot (max (value) for attrib in ([EAN],[MPN])) as b
join MainTable as c
on c.ab_ean = b.EAN
where b.EAN !='' AND c.ab_archive = '0'
When I am selecting both id columns is okey, but when I am uncomment first lines and delete select whole table is set with first gid from my main table.
It have to set my main id into all attributes where ID where ean is matched from my main table.
I am sorry for my terrible english but I hope someone can help me, with that.
The reason your update does not work is that you don't have any link between your source and target for the update, although you reference sidetable in the FROM clause, this is effectively destroyed by the PIVOT function, leaving no link back to the instance of SideTable that you are updating. Since there is no link, all rows are updated with the same value, this will be the last value encountered in the FROM.
This can be demonstrated by running the following:
DECLARE #S TABLE (ID INT, Attrib VARCHAR(50), Value VARCHAR(50), gid INT);
INSERT #S
VALUES
(1, 'weight', '10', NULL), (1, 'ean', '123123123112', NULL), (1, 'color', 'blue', NULL),
(2, 'weight', '5', NULL), (2, 'ean', '331231313123', NULL);
SELECT s.*
FROM #S AS s
PIVOT (MAX(Value) FOR attrib IN ([EAN],[MPN])) AS pvt;
You clearly have a table aliased s in the FROM clause, however because you have used pivot you cannot use SELECT s*, you get the following error:
The column prefix 's' does not match with a table name or alias name used in the query.
You haven't provided sample data for your main table, but I am about 95% certain your PIVOT is not needed, I think you can get your update using just normal JOINs:
UPDATE s
SET gid = ab_id
FROM SideTable AS s
INNER JOIN SideTable AS ean
ON ean.ID = s.ID
AND ean.attrib = 'ean'
INNER JOIN MainTable AS m
ON m.ab_EAN = ean.Value
WHERE m.ab_archive = '0'
AND m.ab_EAN != '';
As per comment to the question, you need to use update + select statement.
A standard version looks like:
UPDATE
T
SET
T.col1 = OT.col1,
T.col2 = OT.col2
FROM
Some_Table T
INNER JOIN
Other_Table OT
ON
T.id = OT.id
WHERE
T.col3 = 'cool'
As to your needs:
update a
set a.gid = p.ab_id
from SideTable As a
Inner join (
select gid, ab_id
from SideTable
pivot (max (value) for attrib in ([EAN],[MPN])) as b
join MainTable as c
on c.ab_ean = b.EAN
where b.EAN !='' AND c.ab_archive = '0') p ON a.ean = p.EAN
try and break it down a bit more like this..
update SideTable
set SideTable.gid = p.ab_id
FROM
(
select gid, ab_id
from SideTable
pivot (max (value) for attrib in ([EAN],[MPN])) as b
join MainTable as c
on c.ab_ean = b.EAN
where b.EAN !='' AND c.ab_archive = '0'
) p
WHERE p.EAN = SideTable.EAN

Group All Related Records in Many to Many Relationship, SQL graph connected components

Hopefully I'm missing a simple solution to this.
I have two tables. One contains a list of companies. The second contains a list of publishers. The mapping between the two is many to many. What I would like to do is bundle or group all of the companies in table A which have any relationship to a publisher in table B and vise versa.
The final result would look something like this (GROUPID is the key field). Row 1 and 2 are in the same group because they share the same company. Row 3 is in the same group because the publisher Y was already mapped over to company A. Row 4 is in the group because Company B was already mapped to group 1 through Publisher Y.
Said simply, any time there is any kind of shared relationship across Company and Publisher, that pair should be assigned to the same group.
ROW GROUPID Company Publisher
1 1 A Y
2 1 A X
3 1 B Y
4 1 B Z
5 2 C W
6 2 C P
7 2 D W
Fiddle
Update:
My bounty version: Given the table in the fiddle above of simply Company and Publisher pairs, populate the GROUPID field above. Think of it as creating a Family ID that encompasses all related parents/children.
SQL Server 2012
I thought about using recursive CTE, but, as far as I know, it's not possible in SQL Server to use UNION to connect anchor member and a recursive member of recursive CTE (I think it's possible to do in PostgreSQL), so it's not possible to eliminate duplicates.
declare #i int
with cte as (
select
GroupID,
row_number() over(order by Company) as rn
from Table1
)
update cte set GroupID = rn
select #i = ##rowcount
-- while some rows updated
while #i > 0
begin
update T1 set
GroupID = T2.GroupID
from Table1 as T1
inner join (
select T2.Company, min(T2.GroupID) as GroupID
from Table1 as T2
group by T2.Company
) as T2 on T2.Company = T1.Company
where T1.GroupID > T2.GroupID
select #i = ##rowcount
update T1 set
GroupID = T2.GroupID
from Table1 as T1
inner join (
select T2.Publisher, min(T2.GroupID) as GroupID
from Table1 as T2
group by T2.Publisher
) as T2 on T2.Publisher = T1.Publisher
where T1.GroupID > T2.GroupID
-- will be > 0 if any rows updated
select #i = #i + ##rowcount
end
;with cte as (
select
GroupID,
dense_rank() over(order by GroupID) as rn
from Table1
)
update cte set GroupID = rn
sql fiddle demo
I've also tried a breadth first search algorithm. I thought it could be faster (it's better in terms of complexity), so I'll provide a solution here. I've found that it's not faster than SQL approach, though:
declare #Company nvarchar(2), #Publisher nvarchar(2), #GroupID int
declare #Queue table (
Company nvarchar(2), Publisher nvarchar(2), ID int identity(1, 1),
primary key(Company, Publisher)
)
select #GroupID = 0
while 1 = 1
begin
select top 1 #Company = Company, #Publisher = Publisher
from Table1
where GroupID is null
if ##rowcount = 0 break
select #GroupID = #GroupID + 1
insert into #Queue(Company, Publisher)
select #Company, #Publisher
while 1 = 1
begin
select top 1 #Company = Company, #Publisher = Publisher
from #Queue
order by ID asc
if ##rowcount = 0 break
update Table1 set
GroupID = #GroupID
where Company = #Company and Publisher = #Publisher
delete from #Queue where Company = #Company and Publisher = #Publisher
;with cte as (
select Company, Publisher from Table1 where Company = #Company and GroupID is null
union all
select Company, Publisher from Table1 where Publisher = #Publisher and GroupID is null
)
insert into #Queue(Company, Publisher)
select distinct c.Company, c.Publisher
from cte as c
where not exists (select * from #Queue as q where q.Company = c.Company and q.Publisher = c.Publisher)
end
end
sql fiddle demo
I've tested my version and Gordon Linoff's to check how it's perform. It looks like CTE is much worse, I couldn't wait while it's complete on more than 1000 rows.
Here's sql fiddle demo with random data. My results were:
128 rows:
my RBAR solution: 190ms
my SQL solution: 27ms
Gordon Linoff's solution: 958ms
256 rows:
my RBAR solution: 560ms
my SQL solution: 1226ms
Gordon Linoff's solution: 45371ms
It's random data, so results may be not very consistent. I think timing could be changed by indexes, but don't think it could change a whole picture.
old version - using temporary table, just calculating GroupID without touching initial table:
declare #i int
-- creating table to gather all possible GroupID for each row
create table #Temp
(
Company varchar(1), Publisher varchar(1), GroupID varchar(1),
primary key (Company, Publisher, GroupID)
)
-- initializing it with data
insert into #Temp (Company, Publisher, GroupID)
select Company, Publisher, Company
from Table1
select #i = ##rowcount
-- while some rows inserted into #Temp
while #i > 0
begin
-- expand #Temp in both directions
;with cte as (
select
T2.Company, T1.Publisher,
T1.GroupID as GroupID1, T2.GroupID as GroupID2
from #Temp as T1
inner join #Temp as T2 on T2.Company = T1.Company
union
select
T1.Company, T2.Publisher,
T1.GroupID as GroupID1, T2.GroupID as GroupID2
from #Temp as T1
inner join #Temp as T2 on T2.Publisher = T1.Publisher
), cte2 as (
select
Company, Publisher,
case when GroupID1 < GroupID2 then GroupID1 else GroupID2 end as GroupID
from cte
)
insert into #Temp
select Company, Publisher, GroupID
from cte2
-- don't insert duplicates
except
select Company, Publisher, GroupID
from #Temp
-- will be > 0 if any row inserted
select #i = ##rowcount
end
select
Company, Publisher,
dense_rank() over(order by min(GroupID)) as GroupID
from #Temp
group by Company, Publisher
=> sql fiddle example
Your problem is a graph-walking problem of finding connected subgraphs. It is a little more challenging because your data structure has two types of nodes ("companies" and "pubishers") rather than one type.
You can solve this with a single recursive CTE. The logic is as follows.
First, convert the problem into a graph with only one type of node. I do this by making the nodes companies and the edges linkes between companies, using the publisher information. This is just a join:
select t1.company as node1, t2.company as node2
from table1 t1 join
table1 t2
on t1.publisher = t2.publisher
)
(For efficiency sake, you could also add t1.company <> t2.company but that is not strictly necessary.)
Now, this is a "simple" graph walking problem, where a recursive CTE is used to create all connections between two nodes. The recursive CTE walks through the graph using join. Along the way, it keeps a list of all nodes visited. In SQL Server, this needs to be stored in a string.
The code needs to ensure that it doesn't visit a node twice for a given path, because this can result in infinite recursion (and an error). If the above is called edges, the CTE that generates all pairs of connected nodes looks like:
cte as (
select e.node1, e.node2, cast('|'+e.node1+'|'+e.node2+'|' as varchar(max)) as nodes,
1 as level
from edges e
union all
select c.node1, e.node2, c.nodes+e.node2+'|', 1+c.level
from cte c join
edges e
on c.node2 = e.node1 and
c.nodes not like '|%'+e.node2+'%|'
)
Now, with this list of connected nodes, assign each node the minimum of all the nodes it is connected to, including itself. This serves as an identifier of connected subgraphs. That is, all companies connected to each other via the publishers will have the same minimum.
The final two steps are to enumerate this minimum (as the GroupId) and to join the GroupId back to the original data.
The full (and I might add tested) query looks like:
with edges as (
select t1.company as node1, t2.company as node2
from table1 t1 join
table1 t2
on t1.publisher = t2.publisher
),
cte as (
select e.node1, e.node2,
cast('|'+e.node1+'|'+e.node2+'|' as varchar(max)) as nodes,
1 as level
from edges e
union all
select c.node1, e.node2,
c.nodes+e.node2+'|',
1+c.level
from cte c join
edges e
on c.node2 = e.node1 and
c.nodes not like '|%'+e.node2+'%|'
),
nodes as (
select node1,
(case when min(node2) < node1 then min(node2) else node1 end
) as grp
from cte
group by node1
)
select t.company, t.publisher, grp.GroupId
from table1 t join
(select n.node1, dense_rank() over (order by grp) as GroupId
from nodes n
) grp
on t.company = grp.node1;
Note that this works on finding any connected subgraphs. It does not assume that any particular number of levels.
EDIT:
The question of performance for this is vexing. At a minimum, the above query will run better with an index on Publisher. Better yet is to take #MikaelEriksson's suggestion, and put the edges in a separate table.
Another question is whether you look for equivalency classes among the Companies or the Publishers. I took the approach of using Companies, because I think that has better "explanability" (my inclination to respond was based on numerous comments that this could not be done with CTEs).
I am guessing that you could get reasonable performance from this, although that requires more knowledge of your data and system than provided in the OP. It is quite likely, though, that the best performance will come from a multiple query approach.
Here is my solution SQL Fiddle
The nature of the relationships require looping as I figure.
Here is the SQL:
--drop TABLE Table1
CREATE TABLE Table1
([row] int identity (1,1),GroupID INT NULL,[Company] varchar(2), [Publisher] varchar(2))
;
INSERT INTO Table1
(Company, Publisher)
select
left(newid(), 2), left(newid(), 2)
declare #i int = 1
while #i < 8
begin
;with cte(Company, Publisher) as (
select
left(newid(), 2), left(newid(), 2)
from Table1
)
insert into Table1(Company, Publisher)
select distinct c.Company, c.Publisher
from cte as c
where not exists (select * from Table1 as t where t.Company = c.Company and t.Publisher = c.Publisher)
set #i = #i + 1
end;
CREATE NONCLUSTERED INDEX IX_Temp1 on Table1 (Company)
CREATE NONCLUSTERED INDEX IX_Temp2 on Table1 (Publisher)
declare #counter int=0
declare #row int=0
declare #lastnullcount int=0
declare #currentnullcount int=0
WHILE EXISTS (
SELECT *
FROM Table1
where GroupID is null
)
BEGIN
SET #counter=#counter+1
SET #lastnullcount =0
SELECT TOP 1
#row=[row]
FROM Table1
where GroupID is null
order by [row] asc
SELECT #currentnullcount=count(*) from table1 where groupid is null
WHILE #lastnullcount <> #currentnullcount
BEGIN
SELECT #lastnullcount=count(*)
from table1
where groupid is null
UPDATE Table1
SET GroupID=#counter
WHERE [row]=#row
UPDATE t2
SET t2.GroupID=#counter
FROM Table1 t1
INNER JOIN Table1 t2 on t1.Company=t2.Company
WHERE t1.GroupID=#counter
AND t2.GroupID IS NULL
UPDATE t2
SET t2.GroupID=#counter
FROM Table1 t1
INNER JOIN Table1 t2 on t1.publisher=t2.publisher
WHERE t1.GroupID=#counter
AND t2.GroupID IS NULL
SELECT #currentnullcount=count(*)
from table1
where groupid is null
END
END
SELECT * FROM Table1
Edit:
Added indexes as I would expect on the real table and be more in line with the other data sets Roman is using.
You are trying to find all of the connected components of your graph, which can only be done iteratively. If you know the maximum width of any connected component (i.e. the maximum number of links you will have to take from one company/publisher to another), you could in principle do it something like this:
SELECT
MIN(x2.groupID) AS groupID,
x1.Company,
x1.Publisher
FROM Table1 AS x1
INNER JOIN (
SELECT
MIN(x2.Company) AS groupID,
x1.Company,
x1.Publisher
FROM Table1 AS x1
INNER JOIN Table1 AS x2
ON x1.Publisher = x2.Publisher
GROUP BY
x1.Publisher,
x1.Company
) AS x2
ON x1.Company = x2.Company
GROUP BY
x1.Publisher,
x1.Company;
You have to keep nesting the subquery (alternating joins on Company and Publisher, and with the deepest subquery saying MIN(Company) rather than MIN(groupID)) to the maximum iteration depth.
I don't really recommend this, though; it would be cleaner to do this outside of SQL.
Disclaimer: I don't know anything about SQL Server 2012 (or any other version); it may have some kind of additional scripting ability to let you do this iteration dynamically.
This is a recursive solution, using XML:
with a as ( -- recursive result, containing shorter subsets and duplicates
select cast('<c>' + company + '</c>' as xml) as companies
,cast('<p>' + publisher + '</p>' as xml) as publishers
from Table1
union all
select a.companies.query('for $c in distinct-values((for $i in /c return string($i),
sql:column("t.company")))
order by $c
return <c>{$c}</c>')
,a.publishers.query('for $p in distinct-values((for $i in /p return string($i),
sql:column("t.publisher")))
order by $p
return <p>{$p}</p>')
from a join Table1 t
on ( a.companies.exist('/c[text() = sql:column("t.company")]') = 0
or a.publishers.exist('/p[text() = sql:column("t.publisher")]') = 0)
and ( a.companies.exist('/c[text() = sql:column("t.company")]') = 1
or a.publishers.exist('/p[text() = sql:column("t.publisher")]') = 1)
), b as ( -- remove the shorter versions from earlier steps of the recursion and the duplicates
select distinct -- distinct cannot work on xml types, hence cast to nvarchar
cast(companies as nvarchar) as companies
,cast(publishers as nvarchar) as publishers
,DENSE_RANK() over(order by cast(companies as nvarchar), cast(publishers as nvarchar)) as groupid
from a
where not exists (select 1 from a as s -- s is a proper subset of a
where (cast('<s>' + cast(s.companies as varchar)
+ '</s><a>' + cast(a.companies as varchar) + '</a>' as xml)
).value('if((count(/s/c) > count(/a/c))
and (some $s in /s/c/text() satisfies
(some $a in /a/c/text() satisfies $s = $a))
) then 1 else 0', 'int') = 1
)
and not exists (select 1 from a as s -- s is a proper subset of a
where (cast('<s>' + cast(s.publishers as nvarchar)
+ '</s><a>' + cast(a.publishers as nvarchar) + '</a>' as xml)
).value('if((count(/s/p) > count(/a/p))
and (some $s in /s/p/text() satisfies
(some $a in /a/p/text() satisfies $s = $a))
) then 1 else 0', 'int') = 1
)
), c as ( -- cast back to xml
select cast(companies as xml) as companies
,cast(publishers as xml) as publishers
,groupid
from b
)
select Co.company.value('(./text())[1]', 'varchar') as company
,Pu.publisher.value('(./text())[1]', 'varchar') as publisher
,c.groupid
from c
cross apply companies.nodes('/c') as Co(company)
cross apply publishers.nodes('/p') as Pu(publisher)
where exists(select 1 from Table1 t -- restrict to only the combinations that exist in the source
where t.company = Co.company.value('(./text())[1]', 'varchar')
and t.publisher = Pu.publisher.value('(./text())[1]', 'varchar')
)
The set of companies and the set of publishers are kept in XML fields in the intermediate steps, and there is some casting between xml and nvarchar necessary due to some limitations of SQL Server (like not being able to group or use distinct on XML columns.
Bit late to the challenge, and since SQLFiddle seems to be down ATM I'll have to guess your data-structures. Nevertheless, it seemed like a fun challenge (and it was =) so here's what I made from it :
Setup:
IF OBJECT_ID('t_link') IS NOT NULL DROP TABLE t_link
IF OBJECT_ID('t_company') IS NOT NULL DROP TABLE t_company
IF OBJECT_ID('t_publisher') IS NOT NULL DROP TABLE t_publisher
IF OBJECT_ID('tempdb..#link_A') IS NOT NULL DROP TABLE #link_A
IF OBJECT_ID('tempdb..#link_B') IS NOT NULL DROP TABLE #link_B
GO
CREATE TABLE t_company ( company_id int IDENTITY(1, 1) NOT NULL PRIMARY KEY,
company_name varchar(100) NOT NULL)
GO
CREATE TABLE t_publisher (publisher_id int IDENTITY(1, 1) NOT NULL PRIMARY KEY,
publisher_name varchar(100) NOT NULL)
CREATE TABLE t_link (company_id int NOT NULL FOREIGN KEY (company_id) REFERENCES t_company (company_id),
publisher_id int NOT NULL FOREIGN KEY (publisher_id) REFERENCES t_publisher (publisher_id),
PRIMARY KEY (company_id, publisher_id),
group_id int NULL
)
GO
-- example content
-- ROW GROUPID Company Publisher
--1 1 A Y
--2 1 A X
--3 1 B Y
--4 1 B Z
--5 2 C W
--6 2 C P
--7 2 D W
INSERT t_company (company_name) VALUES ('A'), ('B'), ('C'), ('D')
INSERT t_publisher (publisher_name) VALUES ('X'), ('Y'), ('Z'), ('W'), ('P')
INSERT t_link (company_id, publisher_id)
SELECT company_id, publisher_id
FROM t_company, t_publisher
WHERE (company_name = 'A' AND publisher_name = 'Y')
OR (company_name = 'A' AND publisher_name = 'X')
OR (company_name = 'B' AND publisher_name = 'Y')
OR (company_name = 'B' AND publisher_name = 'Z')
OR (company_name = 'C' AND publisher_name = 'W')
OR (company_name = 'C' AND publisher_name = 'P')
OR (company_name = 'D' AND publisher_name = 'W')
GO
/*
-- volume testing
TRUNCATE TABLE t_link
DELETE t_company
DELETE t_publisher
DECLARE #company_count int = 1000,
#publisher_count int = 450,
#links_count int = 800
INSERT t_company (company_name)
SELECT company_name = Convert(varchar(100), NewID())
FROM master.dbo.fn_int_list(1, #company_count)
UPDATE STATISTICS t_company
INSERT t_publisher (publisher_name)
SELECT publisher_name = Convert(varchar(100), NewID())
FROM master.dbo.fn_int_list(1, #publisher_count)
UPDATE STATISTICS t_publisher
-- Random links between the companies & publishers
DECLARE #count int
SELECT #count = 0
WHILE #count < #links_count
BEGIN
SELECT TOP 30 PERCENT row_id = IDENTITY(int, 1, 1), company_id = company_id + 0
INTO #link_A
FROM t_company
ORDER BY NewID()
SELECT TOP 30 PERCENT row_id = IDENTITY(int, 1, 1), publisher_id = publisher_id + 0
INTO #link_B
FROM t_publisher
ORDER BY NewID()
INSERT TOP (#links_count - #count) t_link (company_id, publisher_id)
SELECT A.company_id,
B.publisher_id
FROM #link_A A
JOIN #link_B B
ON A.row_id = B.row_id
WHERE NOT EXISTS ( SELECT *
FROM t_link old
WHERE old.company_id = A.company_id
AND old.publisher_id = B.publisher_id)
SELECT #count = #count + ##ROWCOUNT
DROP TABLE #link_A
DROP TABLE #link_B
END
*/
Actual grouping:
IF OBJECT_ID('tempdb..#links') IS NOT NULL DROP TABLE #links
GO
-- apply grouping
-- init
SELECT row_id = IDENTITY(int, 1, 1),
company_id,
publisher_id,
group_id = 0
INTO #links
FROM t_link
-- don't see an index that would be actually helpful here right-away, using row_id to avoid HEAP
CREATE CLUSTERED INDEX idx0 ON #links (row_id)
--CREATE INDEX idx1 ON #links (company_id)
--CREATE INDEX idx2 ON #links (publisher_id)
UPDATE #links
SET group_id = row_id
-- start grouping
WHILE ##ROWCOUNT > 0
BEGIN
UPDATE #links
SET group_id = new_group_id
FROM #links upd
CROSS APPLY (SELECT new_group_id = Min(group_id)
FROM #links new
WHERE new.company_id = upd.company_id
OR new.publisher_id = upd.publisher_id
) x
WHERE upd.group_id > new_group_id
-- select * from #links
END
-- remove 'holes'
UPDATE #links
SET group_id = (SELECT COUNT(DISTINCT o.group_id)
FROM #links o
WHERE o.group_id <= upd.group_id)
FROM #links upd
GO
UPDATE t_link
SET group_id = new.group_id
FROM t_link upd
LEFT OUTER JOIN #links new
ON new.company_id = upd.company_id
AND new.publisher_id = upd.publisher_id
GO
SELECT row = ROW_NUMBER() OVER (ORDER BY group_id, company_name, publisher_name),
l.group_id,
c.company_name, -- c.company_id,
p.publisher_name -- , p.publisher_id
from t_link l
JOIN t_company c
ON l.company_id = c.company_id
JOIN t_publisher p
ON p.publisher_id = l.publisher_id
ORDER BY 1
At first sight this approach hasn't been tried yet by anyone else, interesting to see how this can be done in a variety of ways... (preferred not to read them upfront as it would spoil the puzzle =)
Results look as expected (as far as I understand the requirements and the example) and performance isn't too shabby either although there is no real indication on the amount of records this should work on; not sure how it would scale but don't expect too many problems either...