Related
Having some issues with a larger MSSQL DB I am administrating.
Have some functionallity I am trying to implement in a single table.
Table looks contains such columns:
ID MailID
010-123456 12345678
010-123456/1 NULL
010-123456/2 NULL
010-123456/3 NULL
Now, what I would like to accomplish is to set each Childs MaildID to the same as its Parents MailID.
A ParentID being an ID with same value as the Child (Id ID row) before the "/" delimiter.
Also The IDs should not contins any chars, just digits (execpt for '-' and '/').
So currently I have this solution:
Declare #MyTable table(Child varchar(MAX),Parent varchar(MAX),ParentMaildId varchar(MAX))
insert into #MyTable
select ID as Child, left(ID, charindex('/', ID)-1) AS Parent, MailID as MailID
FROM [thisismy].[dbo].[table]
where ID not like '%[a-z]%' and ID like '%/%' and MailId is not NULL
select * from #MyTable
--update t1
--SET t1.MailId = t2.ParentMaildId
--FROM [thisismy].[dbo].[table] AS t1
--INNER JOIN #IDAS t2 ON t1.ID = t2.Child
--WHERE t1.ID = t2.Child and t1.MailId is NULL
When I use the select statement to get MyTable fileld up here, it only returns a low subset of all the cases that is in my DB table, why?
I would like to have stored procedure that automatically updates the Mailid of the Child to the Parent (all recuired info being in the very same table).
Is there a better way to do this?
I think you want:
select m.*, p.mailid as imputed_parent
from #mytable m cross apply
(select top (1) p.*
from #mytable p
where m.id like p.id + '/%'
order by p.id desc
) p;
You can put this in an update using a correlated subquery:
update m
set mailid = (select top (1) p.mailid
from #mytable p
where m.id like p.id + '/%'
order by p.id desc
)
from #mytable m
where m.mailid is null;
Here is a db<>fiddle.
I'm trying to join three tables to pull back a list of distinct blog posts with associated assets (images etc) but I keep coming up a cropper. The three tablets are tblBlog, tblAssetLink and tblAssets. The Blog tablet hold the blog, the asset table holds the assets and the Assetlink table links the two together.
tblBlog.BID is the PK in blog, tblAssets.AID is the PK in Assets.
This query works but pulls back multiple posts for the same record. I've tried to use select distinct and group by and even union but as my knowledge is pretty poor with SQL - they all error.
I'd like to also discount any assets that are marked as deleted (tblAssets.Deleted = true) but not hide the associated Blog post (if that's not marked as deleted). If anyone can help - it would be much appreciated! Thanks.
Here's my query so far....
SELECT dbo.tblBlog.BID,
dbo.tblBlog.DateAdded,
dbo.tblBlog.PMonthName,
dbo.tblBlog.PDay,
dbo.tblBlog.Header,
dbo.tblBlog.AddedBy,
dbo.tblBlog.PContent,
dbo.tblBlog.Category,
dbo.tblBlog.Deleted,
dbo.tblBlog.Intro,
dbo.tblBlog.Tags,
dbo.tblAssets.Name,
dbo.tblAssets.Description,
dbo.tblAssets.Location,
dbo.tblAssets.Deleted AS Expr1,
dbo.tblAssetLink.Priority
FROM dbo.tblBlog
LEFT OUTER JOIN dbo.tblAssetLink
ON dbo.tblBlog.BID = dbo.tblAssetLink.BID
LEFT OUTER JOIN dbo.tblAssets
ON dbo.tblAssetLink.AID = dbo.tblAssets.AID
WHERE ( dbo.tblBlog.Deleted = 'False' )
ORDER BY dbo.tblAssetLink.Priority, tblBlog.DateAdded DESC
EDIT
Changed the Where and the order by....
Expected output:
tblBlog.BID = 123
tblBlog.DateAdded = 12/04/2015
tblBlog.Header = This is a header
tblBlog.AddedBy = Persons name
tblBlog.PContent = *text*
tblBlog.Category = Category name
tblBlog.Deleted = False
tblBlog.Intro = *text*
tblBlog.Tags = Tag, Tag, Tag
tblAssets.Name = some.jpg
tblAssets.Description = Asset desc
tblAssets.Location = Location name
tblAssets.Priority = True
Use OUTER APPLY:
DECLARE #b TABLE ( BID INT )
DECLARE #a TABLE ( AID INT )
DECLARE #ba TABLE
(
BID INT ,
AID INT ,
Priority INT
)
INSERT INTO #b
VALUES ( 1 ),
( 2 )
INSERT INTO #a
VALUES ( 1 ),
( 2 ),
( 3 ),
( 4 )
INSERT INTO #ba
VALUES ( 1, 1, 1 ),
( 1, 2, 2 ),
( 2, 1, 1 ),
( 2, 2, 2 )
SELECT *
FROM #b b
OUTER APPLY ( SELECT TOP 1
a.*
FROM #ba ba
JOIN #a a ON a.AID = ba.AID
WHERE ba.BID = b.BID
ORDER BY Priority
) o
Output:
BID AID
1 1
2 1
Something like:
SELECT b.BID ,
b.DateAdded ,
b.PMonthName ,
b.PDay ,
b.Header ,
b.AddedBy ,
b.PContent ,
b.Category ,
b.Deleted ,
b.Intro ,
b.Tags ,
o.Name ,
o.Description ,
o.Location ,
o.Deleted AS Expr1 ,
o.Priority
FROM dbo.tblBlog b
OUTER APPLY ( SELECT TOP 1
a.* ,
al.Priority
FROM dbo.tblAssetLink al
JOIN dbo.tblAssets a ON al.AID = a.AID
WHERE b.BID = al.BID
ORDER BY al.Priority
) o
WHERE b.Deleted = 'False'
You cannot join three tables unless they all have the same attribute. It would work if all tables had BID, but the second join is trying to join AID. Which wont work. They all have to have BID.
Based on your comments
i would like to get is just one asset per blog post (top one ordered
by Priority)
You can change your query as following. I suggest changing the join with dbo.tblAssetLink to filtered one, which contains only one (highest priority) link for every blog.
SELECT dbo.tblBlog.BID,
dbo.tblBlog.DateAdded,
dbo.tblBlog.PMonthName,
dbo.tblBlog.PDay,
dbo.tblBlog.Header,
dbo.tblBlog.AddedBy,
dbo.tblBlog.PContent,
dbo.tblBlog.Category,
dbo.tblBlog.Deleted,
dbo.tblBlog.Intro,
dbo.tblBlog.Tags,
dbo.tblAssets.Name,
dbo.tblAssets.Description,
dbo.tblAssets.Location,
dbo.tblAssets.Deleted AS Expr1,
dbo.tblAssetLink.Priority
FROM dbo.tblBlog
LEFT OUTER JOIN
(SELECT BID, AID,
ROW_NUMBER() OVER (PARTITION BY BID ORDER BY [Priority] DESC) as N
FROM dbo.tblAssetLink) AS filteredAssetLink
ON dbo.tblBlog.BID = filteredAssetLink.BID
LEFT OUTER JOIN dbo.tblAssets
ON filteredAssetLink.AID = dbo.tblAssets.AID
WHERE dbo.tblBlog.Deleted = 'False' AND filteredAssetLink.N = 1
ORDER BY tblBlog.DateAdded DESC
Hopefully I'm missing a simple solution to this.
I have two tables. One contains a list of companies. The second contains a list of publishers. The mapping between the two is many to many. What I would like to do is bundle or group all of the companies in table A which have any relationship to a publisher in table B and vise versa.
The final result would look something like this (GROUPID is the key field). Row 1 and 2 are in the same group because they share the same company. Row 3 is in the same group because the publisher Y was already mapped over to company A. Row 4 is in the group because Company B was already mapped to group 1 through Publisher Y.
Said simply, any time there is any kind of shared relationship across Company and Publisher, that pair should be assigned to the same group.
ROW GROUPID Company Publisher
1 1 A Y
2 1 A X
3 1 B Y
4 1 B Z
5 2 C W
6 2 C P
7 2 D W
Fiddle
Update:
My bounty version: Given the table in the fiddle above of simply Company and Publisher pairs, populate the GROUPID field above. Think of it as creating a Family ID that encompasses all related parents/children.
SQL Server 2012
I thought about using recursive CTE, but, as far as I know, it's not possible in SQL Server to use UNION to connect anchor member and a recursive member of recursive CTE (I think it's possible to do in PostgreSQL), so it's not possible to eliminate duplicates.
declare #i int
with cte as (
select
GroupID,
row_number() over(order by Company) as rn
from Table1
)
update cte set GroupID = rn
select #i = ##rowcount
-- while some rows updated
while #i > 0
begin
update T1 set
GroupID = T2.GroupID
from Table1 as T1
inner join (
select T2.Company, min(T2.GroupID) as GroupID
from Table1 as T2
group by T2.Company
) as T2 on T2.Company = T1.Company
where T1.GroupID > T2.GroupID
select #i = ##rowcount
update T1 set
GroupID = T2.GroupID
from Table1 as T1
inner join (
select T2.Publisher, min(T2.GroupID) as GroupID
from Table1 as T2
group by T2.Publisher
) as T2 on T2.Publisher = T1.Publisher
where T1.GroupID > T2.GroupID
-- will be > 0 if any rows updated
select #i = #i + ##rowcount
end
;with cte as (
select
GroupID,
dense_rank() over(order by GroupID) as rn
from Table1
)
update cte set GroupID = rn
sql fiddle demo
I've also tried a breadth first search algorithm. I thought it could be faster (it's better in terms of complexity), so I'll provide a solution here. I've found that it's not faster than SQL approach, though:
declare #Company nvarchar(2), #Publisher nvarchar(2), #GroupID int
declare #Queue table (
Company nvarchar(2), Publisher nvarchar(2), ID int identity(1, 1),
primary key(Company, Publisher)
)
select #GroupID = 0
while 1 = 1
begin
select top 1 #Company = Company, #Publisher = Publisher
from Table1
where GroupID is null
if ##rowcount = 0 break
select #GroupID = #GroupID + 1
insert into #Queue(Company, Publisher)
select #Company, #Publisher
while 1 = 1
begin
select top 1 #Company = Company, #Publisher = Publisher
from #Queue
order by ID asc
if ##rowcount = 0 break
update Table1 set
GroupID = #GroupID
where Company = #Company and Publisher = #Publisher
delete from #Queue where Company = #Company and Publisher = #Publisher
;with cte as (
select Company, Publisher from Table1 where Company = #Company and GroupID is null
union all
select Company, Publisher from Table1 where Publisher = #Publisher and GroupID is null
)
insert into #Queue(Company, Publisher)
select distinct c.Company, c.Publisher
from cte as c
where not exists (select * from #Queue as q where q.Company = c.Company and q.Publisher = c.Publisher)
end
end
sql fiddle demo
I've tested my version and Gordon Linoff's to check how it's perform. It looks like CTE is much worse, I couldn't wait while it's complete on more than 1000 rows.
Here's sql fiddle demo with random data. My results were:
128 rows:
my RBAR solution: 190ms
my SQL solution: 27ms
Gordon Linoff's solution: 958ms
256 rows:
my RBAR solution: 560ms
my SQL solution: 1226ms
Gordon Linoff's solution: 45371ms
It's random data, so results may be not very consistent. I think timing could be changed by indexes, but don't think it could change a whole picture.
old version - using temporary table, just calculating GroupID without touching initial table:
declare #i int
-- creating table to gather all possible GroupID for each row
create table #Temp
(
Company varchar(1), Publisher varchar(1), GroupID varchar(1),
primary key (Company, Publisher, GroupID)
)
-- initializing it with data
insert into #Temp (Company, Publisher, GroupID)
select Company, Publisher, Company
from Table1
select #i = ##rowcount
-- while some rows inserted into #Temp
while #i > 0
begin
-- expand #Temp in both directions
;with cte as (
select
T2.Company, T1.Publisher,
T1.GroupID as GroupID1, T2.GroupID as GroupID2
from #Temp as T1
inner join #Temp as T2 on T2.Company = T1.Company
union
select
T1.Company, T2.Publisher,
T1.GroupID as GroupID1, T2.GroupID as GroupID2
from #Temp as T1
inner join #Temp as T2 on T2.Publisher = T1.Publisher
), cte2 as (
select
Company, Publisher,
case when GroupID1 < GroupID2 then GroupID1 else GroupID2 end as GroupID
from cte
)
insert into #Temp
select Company, Publisher, GroupID
from cte2
-- don't insert duplicates
except
select Company, Publisher, GroupID
from #Temp
-- will be > 0 if any row inserted
select #i = ##rowcount
end
select
Company, Publisher,
dense_rank() over(order by min(GroupID)) as GroupID
from #Temp
group by Company, Publisher
=> sql fiddle example
Your problem is a graph-walking problem of finding connected subgraphs. It is a little more challenging because your data structure has two types of nodes ("companies" and "pubishers") rather than one type.
You can solve this with a single recursive CTE. The logic is as follows.
First, convert the problem into a graph with only one type of node. I do this by making the nodes companies and the edges linkes between companies, using the publisher information. This is just a join:
select t1.company as node1, t2.company as node2
from table1 t1 join
table1 t2
on t1.publisher = t2.publisher
)
(For efficiency sake, you could also add t1.company <> t2.company but that is not strictly necessary.)
Now, this is a "simple" graph walking problem, where a recursive CTE is used to create all connections between two nodes. The recursive CTE walks through the graph using join. Along the way, it keeps a list of all nodes visited. In SQL Server, this needs to be stored in a string.
The code needs to ensure that it doesn't visit a node twice for a given path, because this can result in infinite recursion (and an error). If the above is called edges, the CTE that generates all pairs of connected nodes looks like:
cte as (
select e.node1, e.node2, cast('|'+e.node1+'|'+e.node2+'|' as varchar(max)) as nodes,
1 as level
from edges e
union all
select c.node1, e.node2, c.nodes+e.node2+'|', 1+c.level
from cte c join
edges e
on c.node2 = e.node1 and
c.nodes not like '|%'+e.node2+'%|'
)
Now, with this list of connected nodes, assign each node the minimum of all the nodes it is connected to, including itself. This serves as an identifier of connected subgraphs. That is, all companies connected to each other via the publishers will have the same minimum.
The final two steps are to enumerate this minimum (as the GroupId) and to join the GroupId back to the original data.
The full (and I might add tested) query looks like:
with edges as (
select t1.company as node1, t2.company as node2
from table1 t1 join
table1 t2
on t1.publisher = t2.publisher
),
cte as (
select e.node1, e.node2,
cast('|'+e.node1+'|'+e.node2+'|' as varchar(max)) as nodes,
1 as level
from edges e
union all
select c.node1, e.node2,
c.nodes+e.node2+'|',
1+c.level
from cte c join
edges e
on c.node2 = e.node1 and
c.nodes not like '|%'+e.node2+'%|'
),
nodes as (
select node1,
(case when min(node2) < node1 then min(node2) else node1 end
) as grp
from cte
group by node1
)
select t.company, t.publisher, grp.GroupId
from table1 t join
(select n.node1, dense_rank() over (order by grp) as GroupId
from nodes n
) grp
on t.company = grp.node1;
Note that this works on finding any connected subgraphs. It does not assume that any particular number of levels.
EDIT:
The question of performance for this is vexing. At a minimum, the above query will run better with an index on Publisher. Better yet is to take #MikaelEriksson's suggestion, and put the edges in a separate table.
Another question is whether you look for equivalency classes among the Companies or the Publishers. I took the approach of using Companies, because I think that has better "explanability" (my inclination to respond was based on numerous comments that this could not be done with CTEs).
I am guessing that you could get reasonable performance from this, although that requires more knowledge of your data and system than provided in the OP. It is quite likely, though, that the best performance will come from a multiple query approach.
Here is my solution SQL Fiddle
The nature of the relationships require looping as I figure.
Here is the SQL:
--drop TABLE Table1
CREATE TABLE Table1
([row] int identity (1,1),GroupID INT NULL,[Company] varchar(2), [Publisher] varchar(2))
;
INSERT INTO Table1
(Company, Publisher)
select
left(newid(), 2), left(newid(), 2)
declare #i int = 1
while #i < 8
begin
;with cte(Company, Publisher) as (
select
left(newid(), 2), left(newid(), 2)
from Table1
)
insert into Table1(Company, Publisher)
select distinct c.Company, c.Publisher
from cte as c
where not exists (select * from Table1 as t where t.Company = c.Company and t.Publisher = c.Publisher)
set #i = #i + 1
end;
CREATE NONCLUSTERED INDEX IX_Temp1 on Table1 (Company)
CREATE NONCLUSTERED INDEX IX_Temp2 on Table1 (Publisher)
declare #counter int=0
declare #row int=0
declare #lastnullcount int=0
declare #currentnullcount int=0
WHILE EXISTS (
SELECT *
FROM Table1
where GroupID is null
)
BEGIN
SET #counter=#counter+1
SET #lastnullcount =0
SELECT TOP 1
#row=[row]
FROM Table1
where GroupID is null
order by [row] asc
SELECT #currentnullcount=count(*) from table1 where groupid is null
WHILE #lastnullcount <> #currentnullcount
BEGIN
SELECT #lastnullcount=count(*)
from table1
where groupid is null
UPDATE Table1
SET GroupID=#counter
WHERE [row]=#row
UPDATE t2
SET t2.GroupID=#counter
FROM Table1 t1
INNER JOIN Table1 t2 on t1.Company=t2.Company
WHERE t1.GroupID=#counter
AND t2.GroupID IS NULL
UPDATE t2
SET t2.GroupID=#counter
FROM Table1 t1
INNER JOIN Table1 t2 on t1.publisher=t2.publisher
WHERE t1.GroupID=#counter
AND t2.GroupID IS NULL
SELECT #currentnullcount=count(*)
from table1
where groupid is null
END
END
SELECT * FROM Table1
Edit:
Added indexes as I would expect on the real table and be more in line with the other data sets Roman is using.
You are trying to find all of the connected components of your graph, which can only be done iteratively. If you know the maximum width of any connected component (i.e. the maximum number of links you will have to take from one company/publisher to another), you could in principle do it something like this:
SELECT
MIN(x2.groupID) AS groupID,
x1.Company,
x1.Publisher
FROM Table1 AS x1
INNER JOIN (
SELECT
MIN(x2.Company) AS groupID,
x1.Company,
x1.Publisher
FROM Table1 AS x1
INNER JOIN Table1 AS x2
ON x1.Publisher = x2.Publisher
GROUP BY
x1.Publisher,
x1.Company
) AS x2
ON x1.Company = x2.Company
GROUP BY
x1.Publisher,
x1.Company;
You have to keep nesting the subquery (alternating joins on Company and Publisher, and with the deepest subquery saying MIN(Company) rather than MIN(groupID)) to the maximum iteration depth.
I don't really recommend this, though; it would be cleaner to do this outside of SQL.
Disclaimer: I don't know anything about SQL Server 2012 (or any other version); it may have some kind of additional scripting ability to let you do this iteration dynamically.
This is a recursive solution, using XML:
with a as ( -- recursive result, containing shorter subsets and duplicates
select cast('<c>' + company + '</c>' as xml) as companies
,cast('<p>' + publisher + '</p>' as xml) as publishers
from Table1
union all
select a.companies.query('for $c in distinct-values((for $i in /c return string($i),
sql:column("t.company")))
order by $c
return <c>{$c}</c>')
,a.publishers.query('for $p in distinct-values((for $i in /p return string($i),
sql:column("t.publisher")))
order by $p
return <p>{$p}</p>')
from a join Table1 t
on ( a.companies.exist('/c[text() = sql:column("t.company")]') = 0
or a.publishers.exist('/p[text() = sql:column("t.publisher")]') = 0)
and ( a.companies.exist('/c[text() = sql:column("t.company")]') = 1
or a.publishers.exist('/p[text() = sql:column("t.publisher")]') = 1)
), b as ( -- remove the shorter versions from earlier steps of the recursion and the duplicates
select distinct -- distinct cannot work on xml types, hence cast to nvarchar
cast(companies as nvarchar) as companies
,cast(publishers as nvarchar) as publishers
,DENSE_RANK() over(order by cast(companies as nvarchar), cast(publishers as nvarchar)) as groupid
from a
where not exists (select 1 from a as s -- s is a proper subset of a
where (cast('<s>' + cast(s.companies as varchar)
+ '</s><a>' + cast(a.companies as varchar) + '</a>' as xml)
).value('if((count(/s/c) > count(/a/c))
and (some $s in /s/c/text() satisfies
(some $a in /a/c/text() satisfies $s = $a))
) then 1 else 0', 'int') = 1
)
and not exists (select 1 from a as s -- s is a proper subset of a
where (cast('<s>' + cast(s.publishers as nvarchar)
+ '</s><a>' + cast(a.publishers as nvarchar) + '</a>' as xml)
).value('if((count(/s/p) > count(/a/p))
and (some $s in /s/p/text() satisfies
(some $a in /a/p/text() satisfies $s = $a))
) then 1 else 0', 'int') = 1
)
), c as ( -- cast back to xml
select cast(companies as xml) as companies
,cast(publishers as xml) as publishers
,groupid
from b
)
select Co.company.value('(./text())[1]', 'varchar') as company
,Pu.publisher.value('(./text())[1]', 'varchar') as publisher
,c.groupid
from c
cross apply companies.nodes('/c') as Co(company)
cross apply publishers.nodes('/p') as Pu(publisher)
where exists(select 1 from Table1 t -- restrict to only the combinations that exist in the source
where t.company = Co.company.value('(./text())[1]', 'varchar')
and t.publisher = Pu.publisher.value('(./text())[1]', 'varchar')
)
The set of companies and the set of publishers are kept in XML fields in the intermediate steps, and there is some casting between xml and nvarchar necessary due to some limitations of SQL Server (like not being able to group or use distinct on XML columns.
Bit late to the challenge, and since SQLFiddle seems to be down ATM I'll have to guess your data-structures. Nevertheless, it seemed like a fun challenge (and it was =) so here's what I made from it :
Setup:
IF OBJECT_ID('t_link') IS NOT NULL DROP TABLE t_link
IF OBJECT_ID('t_company') IS NOT NULL DROP TABLE t_company
IF OBJECT_ID('t_publisher') IS NOT NULL DROP TABLE t_publisher
IF OBJECT_ID('tempdb..#link_A') IS NOT NULL DROP TABLE #link_A
IF OBJECT_ID('tempdb..#link_B') IS NOT NULL DROP TABLE #link_B
GO
CREATE TABLE t_company ( company_id int IDENTITY(1, 1) NOT NULL PRIMARY KEY,
company_name varchar(100) NOT NULL)
GO
CREATE TABLE t_publisher (publisher_id int IDENTITY(1, 1) NOT NULL PRIMARY KEY,
publisher_name varchar(100) NOT NULL)
CREATE TABLE t_link (company_id int NOT NULL FOREIGN KEY (company_id) REFERENCES t_company (company_id),
publisher_id int NOT NULL FOREIGN KEY (publisher_id) REFERENCES t_publisher (publisher_id),
PRIMARY KEY (company_id, publisher_id),
group_id int NULL
)
GO
-- example content
-- ROW GROUPID Company Publisher
--1 1 A Y
--2 1 A X
--3 1 B Y
--4 1 B Z
--5 2 C W
--6 2 C P
--7 2 D W
INSERT t_company (company_name) VALUES ('A'), ('B'), ('C'), ('D')
INSERT t_publisher (publisher_name) VALUES ('X'), ('Y'), ('Z'), ('W'), ('P')
INSERT t_link (company_id, publisher_id)
SELECT company_id, publisher_id
FROM t_company, t_publisher
WHERE (company_name = 'A' AND publisher_name = 'Y')
OR (company_name = 'A' AND publisher_name = 'X')
OR (company_name = 'B' AND publisher_name = 'Y')
OR (company_name = 'B' AND publisher_name = 'Z')
OR (company_name = 'C' AND publisher_name = 'W')
OR (company_name = 'C' AND publisher_name = 'P')
OR (company_name = 'D' AND publisher_name = 'W')
GO
/*
-- volume testing
TRUNCATE TABLE t_link
DELETE t_company
DELETE t_publisher
DECLARE #company_count int = 1000,
#publisher_count int = 450,
#links_count int = 800
INSERT t_company (company_name)
SELECT company_name = Convert(varchar(100), NewID())
FROM master.dbo.fn_int_list(1, #company_count)
UPDATE STATISTICS t_company
INSERT t_publisher (publisher_name)
SELECT publisher_name = Convert(varchar(100), NewID())
FROM master.dbo.fn_int_list(1, #publisher_count)
UPDATE STATISTICS t_publisher
-- Random links between the companies & publishers
DECLARE #count int
SELECT #count = 0
WHILE #count < #links_count
BEGIN
SELECT TOP 30 PERCENT row_id = IDENTITY(int, 1, 1), company_id = company_id + 0
INTO #link_A
FROM t_company
ORDER BY NewID()
SELECT TOP 30 PERCENT row_id = IDENTITY(int, 1, 1), publisher_id = publisher_id + 0
INTO #link_B
FROM t_publisher
ORDER BY NewID()
INSERT TOP (#links_count - #count) t_link (company_id, publisher_id)
SELECT A.company_id,
B.publisher_id
FROM #link_A A
JOIN #link_B B
ON A.row_id = B.row_id
WHERE NOT EXISTS ( SELECT *
FROM t_link old
WHERE old.company_id = A.company_id
AND old.publisher_id = B.publisher_id)
SELECT #count = #count + ##ROWCOUNT
DROP TABLE #link_A
DROP TABLE #link_B
END
*/
Actual grouping:
IF OBJECT_ID('tempdb..#links') IS NOT NULL DROP TABLE #links
GO
-- apply grouping
-- init
SELECT row_id = IDENTITY(int, 1, 1),
company_id,
publisher_id,
group_id = 0
INTO #links
FROM t_link
-- don't see an index that would be actually helpful here right-away, using row_id to avoid HEAP
CREATE CLUSTERED INDEX idx0 ON #links (row_id)
--CREATE INDEX idx1 ON #links (company_id)
--CREATE INDEX idx2 ON #links (publisher_id)
UPDATE #links
SET group_id = row_id
-- start grouping
WHILE ##ROWCOUNT > 0
BEGIN
UPDATE #links
SET group_id = new_group_id
FROM #links upd
CROSS APPLY (SELECT new_group_id = Min(group_id)
FROM #links new
WHERE new.company_id = upd.company_id
OR new.publisher_id = upd.publisher_id
) x
WHERE upd.group_id > new_group_id
-- select * from #links
END
-- remove 'holes'
UPDATE #links
SET group_id = (SELECT COUNT(DISTINCT o.group_id)
FROM #links o
WHERE o.group_id <= upd.group_id)
FROM #links upd
GO
UPDATE t_link
SET group_id = new.group_id
FROM t_link upd
LEFT OUTER JOIN #links new
ON new.company_id = upd.company_id
AND new.publisher_id = upd.publisher_id
GO
SELECT row = ROW_NUMBER() OVER (ORDER BY group_id, company_name, publisher_name),
l.group_id,
c.company_name, -- c.company_id,
p.publisher_name -- , p.publisher_id
from t_link l
JOIN t_company c
ON l.company_id = c.company_id
JOIN t_publisher p
ON p.publisher_id = l.publisher_id
ORDER BY 1
At first sight this approach hasn't been tried yet by anyone else, interesting to see how this can be done in a variety of ways... (preferred not to read them upfront as it would spoil the puzzle =)
Results look as expected (as far as I understand the requirements and the example) and performance isn't too shabby either although there is no real indication on the amount of records this should work on; not sure how it would scale but don't expect too many problems either...
I am trying to solve the following problem entirely in SQL (ANSI or TSQL, in Sybase ASE 12), without relying on cursors or loop-based row-by-row processing.
NOTE: I already created a solution that accomplishes the same goal in application layer (therefore please refrain from "answering" with "don't do this in SQL"), but as a matter of principle (and hopefully improved performance) I would like to know if there is an efficient (e.g. no cursors) pure SQL solution.
Setup:
I have a table T with the following 3 columns (all NOT NULL):
---- Table T -----------------------------
| item | tag | value |
| [int] | [varchar(10)] | [varchar(255)] |
The table has unique index on item, tag
Every tag has a form of a string "TAG##" where "##" is a number 1-99
Existing tags are not guaranteed to be contiguous, e.g. item 13 may have tags "TAG1", "TAG3", "TAG10".
TASK: I need to insert a bunch of new rows into the table from another table T_NEW, which only have items and values, and assign new tag to them so they don't violate unique index on item, tag.
Uniqueness of values is irrelevant (assume that item+value is always unique already).
---- Table T_NEW --------------------------
| item | tag | value |
| [int] | STARTS AS NULL | [varchar(255)] |
QUESTION: How can I assign new tags to all rows in table T_NEW, such that:
All item+tag combinations in a union of T and T_NEW are unique
Newly assigned tags should all be in the form "TAG##"
Newly assigned tags should ideally be the smallest available for a given item.
If it helps, you can assume that I already have a temp table #tags, with a "tag" column that contains 99 rows containing all the valid tags (TAG1..TAG99, one per row)
I started a fiddle that will get you the list of available "open" tags by item. It does this using the #tags (AllTags) and doing an outer-join-where-null. You could use that to insert new tags from T_New...
with T_openTags as (
select
items.item,
openTagName = a.tag
from
(select distinct item from T) items
cross join AllTags a
left outer join T on
items.item = T.item
and T.tag = a.tag
where
T.item is null
)
select * from T_openTags
or see this updated fiddle to do an update on T_New table. Essentially adds a row_number so we can pick the correct open tag to use in a single update statement. I padded the Tag names with a leading zero to simplify the sorting.
with T_openTags as (
select
items.item,
openTagName = a.tag,
rn = row_number() over(partition by items.item order by a.tag)
from
(select distinct item from T) items
cross join AllTags a
left outer join T on
items.item = T.item
and T.tag = a.tag
where
T.item is null
), T_New_numbered as (
select *,
rn = row_number() over(partition by item order by value)
from T_New
)
update tnn set tag = openTagName
from T_New_numbered tnn
inner join T_openTags tot on
tot.item = tnn.item
and tot.rn = tnn.rn
select * from T_New
updated fiddle with poor mans row_number replacement that only works with distinct T_New values
Try this:
DECLARE #T TABLE (ITEM INT, TAG VARCHAR(10), VALUE VARCHAR(255))
INSERT INTO #T VALUES
(1,'TAG1', '100'),
(2,'TAG2', '200')
DECLARE #T_NEW TABLE (ITEM INT, TAG VARCHAR(10), VALUE VARCHAR(255))
INSERT INTO #T_NEW VALUES
(3,NULL, '500'),
(4,NULL, '600')
INSERT INTO #T
SELECT
ITEM,
('TAG' + CONVERT(VARCHAR(20),ITEM)) AS TAG,
VALUE
FROM
#T_NEW
SELECT * FROM #T
OK, here's a correct solution, tested to work on Sybase (H/T: big thanks to #ypercube for providing a solid basis for it)
declare #c int
select #c = 1
WHILE (#c > 0)
BEGIN
UPDATE
t_new
SET
tag =
( SELECT min(tags.tag)
FROM #tags tags
LEFT JOIN t o
ON tags.tag = o.tag
AND o.item = t_new.item
LEFT JOIN t_new n3
ON tags.tag = n3.tag
AND n3.item = t_new.item
WHERE o.tag IS NULL
AND n3.tag IS NULL
)
WHERE tag IS NULL
-- and here's the main magic for only updating one item at a time
AND NOT EXISTS (SELECT 1 FROM t_new n2 WHERE t_new.value > n2.value
and n2.tag IS NULL and n2.item=t_new.item)
SELECT #c = ##rowcount
END
Inserting directly to t:
INSERT INTO t
(item, tag, value)
SELECT
item,
( SELECT MIN(tags.tag)
FROM #tags AS tags
LEFT JOIN t AS o
ON tags.tag = o.tag
AND o.item_id = n.item_id
WHERE o.tag IS NULL
) AS tag,
value
FROM
t_new AS n ;
Updating t_new:
UPDATE
t_new AS n
SET
tag =
( SELECT MIN(tags.tag)
FROM #tags AS tags
LEFT JOIN t AS o
ON tags.tag = o.tag
AND o.item_id = n.item_id
WHERE o.tag IS NULL
) ;
Correction
UPDATE
n
SET
n.tag = w.tag
FROM
( SELECT item_id,
tag,
ROW_NUMBER() OVER (PARTITION BY item_id ORDER BY value) AS rn
FROM t_new
) AS n
JOIN
( SELECT di.item_id,
tags.tag,
ROW_NUMBER() OVER (PARTITION BY di.item_id ORDER BY tags.tag) AS rn
FROM
( SELECT DISTINCT item_id
FROM t_new
) AS di
CROSS JOIN
#tags AS tags
LEFT JOIN
t AS o
ON tags.tag = o.tag
AND o.item_id = di.item_id
WHERE o.tag IS NULL
) AS w
ON w.item_id = n.item_id
AND w.rn = n.rn ;
I have three tables: videos, videos_categories, and categories.
The tables look like this:
videos: video_id, title, etc...
videos_categories: video_id, category_id
categories: category_id, name, etc...
In my app, I allow a user to multiselect categories. When they do so, I need to return all videos that are in every selected category.
I ended up with this:
SELECT * FROM videos WHERE video_id IN (
SELECT c1.video_id FROM videos_categories AS c1
JOIN c2.videos_categories AS c2
ON c1.video_id = c2.video_id
WHERE c1.category_id = 1 AND c2.category_id = 2
)
But for every category I add to the multiselect, I have to add a join to my inner select:
SELECT * FROM videos WHERE video_id IN (
SELECT c1.video_id FROM videos_categories AS c1
JOIN videos_categories AS c2
ON c1.video_id = c2.video_id
JOIN videos_categories AS c3
ON c2.video_id = c3.video_id
WHERE c1.category_id = 1 AND c2.category_id = 2 AND c3.category_id = 3
)
I can't help but feel this is the really wrong way to do this, but I'm blocked trying to see the proper way to go about it.
if this is a primary key:
videos_categories: video_id, category_id
then a GROUP BY and HAVING should work, try this:
SELECT
*
FROM videos
WHERE video_id IN (SELECT
video_id
FROM videos_categories
WHERE category_id IN (1,2,3)
GROUP BY video_id
HAVING COUNT(video_id)=3
)
Sounds similar to SQL searching for rows that contain multiple criteria
To avoid having to another join for each category (and hence changing the structure of the query), you can put the categories into a temp table and then join against that.
CREATE TEMPORARY TABLE query_categories(category_id int);
INSERT INTO query_categories(category_id) VALUES(1);
INSERT INTO query_categories(category_id) VALUES(2);
INSERT INTO query_categories(category_id) VALUES(3);
SELECT * FROM videos v WHERE video_id IN (
SELECT video_id FROM video_categories vc JOIN query_categories q ON vc.category_id = qc.category_id
GROUP BY video_id
HAVING COUNT(*) = 3
)
Although this is ugly in its own way, of course. You may want to skip the temp table and just say 'category_id IN (...)' in the subquery.
Here's a FOR XML PATH solution:
--Sample data
CREATE TABLE Video
(
VideoID int,
VideoName varchar(50)
)
CREATE TABLE Videos_Categories
(
VideoID int,
CategoryID int
)
INSERT Video(VideoID, VideoName)
SELECT 1, 'Indiana Jones'
UNION ALL
SELECT 2, 'Star Trek'
INSERT Videos_Categories(VideoID, CategoryID)
SELECT 1, 1
UNION ALL
SELECT 1, 2
UNION ALL
SELECT 1, 3
UNION ALL
SELECT 2, 1
GO
--The query
;WITH GroupedVideos
AS
(
SELECT v.*,
SUBSTRING(
(SELECT (', ') + CAST(vc.CategoryID AS varchar(20))
FROM Videos_Categories AS vc
WHERE vc.VideoID = v.VideoID
AND vc.CategoryID IN (1,2)
ORDER BY vc.CategoryID
FOR XML PATH('')), 3, 2000) AS CatList
FROM Video AS v
)
SELECT *
FROM GroupedVideos
WHERE CatList = '1, 2'
(Ignore everything below - I misread the question)
Try
WHERE c1.category_id IN (1,2,3)
or
...
FROM videos v
JOIN Vedeos_categories vc ON v.video_id = vc.video_id
WHERE vc.category_id IN (1,2,3)
Multiple joins aren't at all necessary.
Edit: to put the solutions in context (I realize it's not obvious):
SELECT *
FROM videos
WHERE video_id IN
( SELECT c1.video_id
FROM videos_categories AS c1
WHERE c1.category_id = IN (1,2,3))
or
SELECT *
FROM videos v
JOIN Vedeos_categories vc ON v.video_id = vc.video_id
WHERE vc.category_id IN (1,2,3)