Update table using random of multiple values in other table - sql

Consider this data:
CREATE TABLE #Data (DataID INT, Code VARCHAR(2), Prefix VARCHAR(3))
INSERT INTO #Data (DataID, Code)
VALUES (1, 'AA')
, (2, 'AA')
, (3, 'AA')
, (4, 'AA')
, (5, 'AA')
, (6, 'AA')
CREATE TABLE #Prefix (Code VARCHAR(2), Prefix VARCHAR(3))
INSERT INTO #Prefix (Code, Prefix)
VALUES ('AA', 'ABC')
, ('AA', 'DEF')
, ('AA', 'GHI')
, ('AA', 'JKL')
I want to set the Prefix value in #Data to be a random Prefix from #Prefix with a matching Code.
Using a straight inner join just results in one value being used:
UPDATE D
SET Prefix = P.Prefix
FROM #Data AS D
INNER JOIN #Prefix AS P ON D.Code = P.Code
From reading other questions on here, NEWID() is recommended as a way of randomly ordering something. Changing the join to:
SELECT TOP 1 subquery ordering by NEWID()
still only selects a single value (though random each time) for every row:
UPDATE D
SET Prefix = (SELECT TOP 1 P.Prefix FROM #Prefix AS P WHERE P.Code = D.Code ORDER BY NEWID())
FROM #Data AS D
So, I'm not sure how to get a random prefix for each data entry from a single update statement. I could probably do some kind of loop through the #Data table, but I avoid ever touching loops in SQL and I'm sure that would be slow. The actual application of this will be on tens of thousands of records, with hundreds of prefixes for dozens of codes.

This is how to do it:
UPDATE d SET Prefix = ca.Prefix
FROM #Data d
CROSS APPLY(SELECT TOP 1 Prefix
FROM #Prefix p
WHERE d.DataID = d.DataID AND p.Code = d.Code ORDER BY NEWID()) ca
Notice d.DataID = d.DataID. This is here to force Sql Server engine to reevaluate subquery for each row in #Data table.

Related

SQL query takes more than an hour to execute for 200k rows

I have two tables each with around 200,000 rows. I have run the query below and it still hasn't completed after running for more than an hour. What could be the explanation for this?
SELECT
dbo.[new].[colom1],
dbo.[new].[colom2],
dbo.[new].[colom3],
dbo.[new].[colom4],
dbo.[new].[Value] as 'nieuwe Value',
dbo.[old].[Value] as 'oude Value'
FROM dbo.[new]
JOIN dbo.[old]
ON dbo.[new].[colom1] = dbo.[old].[colom1]
and dbo.[new].[colom2] = dbo.[old].[colom2]
and dbo.[new].[colom3] = dbo.[old].[colom3]
and dbo.[new].[colom4] = dbo.[old].[colom4]
where dbo.[new].[Value] <> dbo.[old].[Value]
from comment;
It seems that for an equality join on a single column, the rows with NULL value in the join key are being filtered out, but this is not the case for joins on multiple columns.
As a result, the hash join complexity is changed from O(N) to O(N^2).
======================================================================
In that context I would like to recommend a great article written by Paul White on similar issues -
Hash Joins on Nullable Columns
======================================================================
I have generated a small simulation of this use-case and I encourage you to test your solutions.
create table mytab1 (c1 int null,c2 int null)
create table mytab2 (c1 int null,c2 int null)
;with t(n) as (select 1 union all select n+1 from t where n < 10)
insert into mytab1 select null,null from t t0,t t1,t t2,t t3,t t4
insert into mytab2 select null,null from mytab1
insert into mytab1 values (111,222);
insert into mytab2 values (111,222);
select * from mytab1 t1 join mytab2 t2 on t1.c1 = t2.c1 and t1.c2 = t2.c2
For the OP query we should remove rows with NULL values in any of the join key columns.
SELECT
dbo.[new].[colom1],
dbo.[new].[colom2],
dbo.[new].[colom3],
dbo.[new].[colom4],
dbo.[new].[Value] as 'nieuwe Value',
dbo.[old].[Value] as 'oude Value'
FROM dbo.[new]
JOIN dbo.[old]
ON dbo.[new].[colom1] = dbo.[old].[colom1]
and dbo.[new].[colom2] = dbo.[old].[colom2]
and dbo.[new].[colom3] = dbo.[old].[colom3]
and dbo.[new].[colom4] = dbo.[old].[colom4]
where dbo.[new].[Value] <> dbo.[old].[Value]
and dbo.[new].[colom1] is not null
and dbo.[new].[colom2] is not null
and dbo.[new].[colom3] is not null
and dbo.[new].[colom4] is not null
and dbo.[old].[colom1] is not null
and dbo.[old].[colom2] is not null
and dbo.[old].[colom3] is not null
and dbo.[old].[colom4] is not null
Using EXCEPT join, you only have to make the larger HASH join on those values that have changed, so much faster:
/*
create table [new] ( colom1 int, colom2 int, colom3 int, colom4 int, [value] int)
create table [old] ( colom1 int, colom2 int, colom3 int, colom4 int, [value] int)
insert old values (1,2,3,4,10)
insert old values (1,2,3,5,10)
insert old values (1,2,3,6,10)
insert old values (1,2,3,7,10)
insert old values (1,2,3,8,10)
insert old values (1,2,3,9,10)
insert new values (1,2,3,4,11)
insert new values (1,2,3,5,10)
insert new values (1,2,3,6,11)
insert new values (1,2,3,7,10)
insert new values (1,2,3,8,10)
insert new values (1,2,3,9,11)
*/
select n.colom1, n.colom2 , n.colom3, n.colom4, n.[value] as newvalue, o.value as oldvalue
from new n
inner join [old] o on n.colom1=o.colom1 and n.colom2=o.colom2 and n.colom3=o.colom3 and n.colom4=o.colom4
inner join
(
select colom1, colom2 , colom3, colom4, [value] from new
except
select colom1, colom2 , colom3, colom4, [value] from old
) i on n.colom1=i.colom1 and n.colom2=i.colom2 and n.colom3=i.colom3 and n.colom4=i.colom4

Query different type data in to a single row with the master record

I just created the following sample data to demonstrate what I am after.
I need to query the above table to get my information as 1 single row
i.e. like the following
I was planning to create two temp tables and insert the two different types of addresses seperately. Then, inner join them with the main company table. I am not sure, this is a good solution. I appreaciate if anyone share their thoughts or code to my problem.
Try this..
Select companyId,CompanyName,homesddress1
,homeaddress2,HomePostCode,OfficeAddress1,OfficeAddress2,OfficePostCode
From tblCompany a
Outer apply ( select address1 homesddress1, address2 homeaddress2,postcode HomePostCode
From tblAddress t
Where AddressType='home' and t.companyid=a.companyid)
Outer apply (select address1 OfficeAddress1, address2 Officeaddress2,postcode OfficePostCode
From tblAddress t2
Where AddressType='Office ' and t2.companyid=a.companyid)
You can do it with a simple select using two outer joins. Note that you need the joins to be outer because for some companies you may only have one address.
DECLARE #company TABLE (
CompanyId int,
CompanyName varchar(50)
)
DECLARE #companyAddress TABLE (
Id int,
AddressType varchar(10),
Address1 varchar(50),
Address2 varchar(50),
Postcode varchar(10),
CompanyId int
)
INSERT INTO #company VALUES (1, 'Test Company')
INSERT INTO #companyAddress VALUES (1, 'Home', '25 Street', 'City 1', 'BA3 1PE', 1)
INSERT INTO #companyAddress VALUES (2, 'Office', '25 Street', 'City 2', 'NA1 4TW', 1)
SELECT c.CompanyId, c.CompanyName,
h.Address1 AS HomeAddress1, h.Address2 AS HomeAddress2, h.Postcode AS HomePostcode,
o.Address1 AS OfficeAddress1, o.Address2 AS OfficeAddress2, o.Postcode AS OfficePostcode
FROM #company c
LEFT JOIN
#companyAddress h ON h.CompanyId = c.CompanyId AND h.AddressType = 'Home'
LEFT JOIN
#companyAddress o ON o.CompanyId = c.CompanyId AND o.AddressType = 'Office'
Here is an Almost Dynamic version. You just have to specify the field list in the final pivot
Declare #YourTable table (ID int,AddressType varchar(25),Address1 varchar(50),Address2 varchar(50),CompanyID int)
Insert Into #YourTable values
(1,'Home' ,'25 Street','City 1',1),
(2,'Office','10 Avenue','City 2',1)
Declare #XML xml = (Select * from #YourTable for XML RAW) --<<< Initial Query
;with cteBase as (
Select ID = R.value('#CompanyID','int') --<<< Key ID
,AddressType = R.value('#AddressType','varchar(50)')
,Item = R.value('#AddressType','varchar(50)')+Attr.value('local-name(.)','varchar(100)')
,Value = Attr.value('.','varchar(max)')
From #XML.nodes('/row') as A(R)
Cross Apply A.r.nodes('./#*[local-name(.)!="CompanyID"]') as B(Attr) --<<< Key ID
),cteDist as (Select Distinct ID,Item from cteBase
),cteComp as (
Select A.*,B.Value
From cteDist A
Cross Apply (Select Value=Stuff((Select Distinct ',' + Value
From cteBase
Where ID=A.ID
and Item=A.Item
For XML Path ('')),1,1,'') ) B
)
Select *
From (Select * From cteComp) as s
Pivot (max(value)
For Item in (HomeID,HomeAddressType,HomeAddress1,HomeAddress2,OfficeID,OfficeAddressType,OfficeAddress1,OfficeAddress2)) as pvt
Returns
ID HomeID HomeAddressType HomeAddress1 HomeAddress2 OfficeID OfficeAddressType OfficeAddress1 OfficeAddress2
1 1 Home 25 Street City 1 2 Office 10 Avenue City 2

Matching multiple key/value pairs in SQL

I have metadata stored in a key/value table in SQL Server. (I know key/value is bad, but this is free-form metadata supplied by users, so I can't turn the keys into columns.) Users need to be able to give me an arbitrary set of key/value pairs and have me return all DB objects that match all of those criteria.
For example:
Metadata:
Id Key Value
1 a p
1 b q
1 c r
2 a p
2 b p
3 c r
If the user says a=p and b=q, I should return object 1. (Not object 2, even though it also has a=p, because it has b=p.)
The metadata to match is in a table-valued sproc parameter with a simple key/value schema. The closest I have got is:
select * from [Objects] as o
where not exists (
select * from [Metadata] as m
join #data as n on (n.[Key] = m.[Key])
and n.[Value] != m.[Value]
and m.[Id] = o.[Id]
)
My "no rows exist that don't match" is an attempt to implement "all rows match" by forming its contrapositive. This does eliminate objects with mismatching metadata, but it also returns objects with no metadata at all, so no good.
Can anyone point me in the right direction? (Bonus points for performance as well as correctness.)
; WITH Metadata (Id, [Key], Value) AS -- Create sample data
(
SELECT 1, 'a', 'p' UNION ALL
SELECT 1, 'b', 'q' UNION ALL
SELECT 1, 'c', 'r' UNION ALL
SELECT 2, 'a', 'p' UNION ALL
SELECT 2, 'b', 'p' UNION ALL
SELECT 3, 'c', 'r'
),
data ([Key], Value) AS -- sample input
(
SELECT 'a', 'p' UNION ALL
SELECT 'b', 'q'
),
-- here onwards is the actual query
data2 AS
(
-- cnt is to count no of input rows
SELECT [Key], Value, cnt = COUNT(*) OVER()
FROM data
)
SELECT m.Id
FROM Metadata m
INNER JOIN data2 d ON m.[Key] = d.[Key] AND m.Value= d.Value
GROUP BY m.Id
HAVING COUNT(*) = MAX(d.cnt)
The following SQL query produces the result that you require.
SELECT *
FROM #Objects m
WHERE Id IN
(
-- Include objects that match the conditions:
SELECT m.Id
FROM #Metadata m
JOIN #data d ON m.[Key] = d.[Key] AND m.Value = d.Value
-- And discount those where there is other metadata not matching the conditions:
EXCEPT
SELECT m.Id
FROM #Metadata m
JOIN #data d ON m.[Key] = d.[Key] AND m.Value <> d.Value
)
Test schema and data I used:
-- Schema
DECLARE #Objects TABLE (Id int);
DECLARE #Metadata TABLE (Id int, [Key] char(1), Value char(2));
DECLARE #data TABLE ([Key] char(1), Value char(1));
-- Data
INSERT INTO #Metadata VALUES
(1, 'a', 'p'),
(1, 'b', 'q'),
(1, 'c', 'r'),
(2, 'a', 'p'),
(2, 'b', 'p'),
(3, 'c', 'r');
INSERT INTO #Objects VALUES
(1),
(2),
(3),
(4); -- Object with no metadata
INSERT INTO #data VALUES
('a','p'),
('b','q');

SQL return only distinct IDs from LEFT JOIN

I've inherited some fun SQL and am trying to figure out how to how to eliminate rows with duplicate IDs. Our indexes are stored in a somewhat columnar format and then we pivot all the rows into one with the values as different columns.
The below sample returns three rows of unique data, but the IDs are duplicated. I need just two rows with unique IDs (and the other columns that go along with it). I know I'll be losing some data, but I just need one matching row per ID to the query (first, top, oldest, newest, whatever).
I've tried using DISTINCT, GROUP BY, and ROW_NUMBER, but I keep getting the syntax wrong, or using them in the wrong place.
I'm also open to rewriting the query completely in a way that is reusable as I currently have to generate this on the fly (cardtypes and cardindexes are user defined) and would love to be able to create a stored procedure. Thanks in advance!
declare #cardtypes table ([ID] int, [Name] nvarchar(50))
declare #cards table ([ID] int, [CardTypeID] int, [Name] nvarchar(50))
declare #cardindexes table ([ID] int, [CardID] int, [IndexType] int, [StringVal] nvarchar(255), [DateVal] datetime)
INSERT INTO #cardtypes VALUES (1, 'Funny Cards')
INSERT INTO #cardtypes VALUES (2, 'Sad Cards')
INSERT INTO #cards VALUES (1, 1, 'Bunnies')
INSERT INTO #cards VALUES (2, 1, 'Dogs')
INSERT INTO #cards VALUES (3, 1, 'Cat')
INSERT INTO #cards VALUES (4, 1, 'Cat2')
INSERT INTO #cardindexes VALUES (1, 1, 1, 'Bunnies', null)
INSERT INTO #cardindexes VALUES (2, 1, 1, 'playing', null)
INSERT INTO #cardindexes VALUES (3, 1, 2, null, '2014-09-21')
INSERT INTO #cardindexes VALUES (4, 2, 1, 'Dogs', null)
INSERT INTO #cardindexes VALUES (5, 2, 1, 'playing', null)
INSERT INTO #cardindexes VALUES (6, 2, 1, 'poker', null)
INSERT INTO #cardindexes VALUES (7, 2, 2, null, '2014-09-22')
SELECT TOP(100)
[ID] = c.[ID],
[Name] = c.[Name],
[Keyword] = [colKeyword].[StringVal],
[DateAdded] = [colDateAdded].[DateVal]
FROM #cards AS c
LEFT JOIN #cardindexes AS [colKeyword] ON [colKeyword].[CardID] = c.ID AND [colKeyword].[IndexType] = 1
LEFT JOIN #cardindexes AS [colDateAdded] ON [colDateAdded].[CardID] = c.ID AND [colDateAdded].[IndexType] = 2
WHERE [colKeyword].[StringVal] LIKE 'p%' AND c.[CardTypeID] = 1
ORDER BY [DateAdded]
Edit:
While both solutions are valid, I ended up using the MAX() solution from #popovitsj as it was easier to implement. The issue of data coming from multiple rows doesn't really factor in for me as all rows are essentially part of the same record. I will most likely use both solutions depending on my needs.
Here's my updated query (as it didn't quite match the answer):
SELECT TOP(100)
[ID] = c.[ID],
[Name] = MAX(c.[Name]),
[Keyword] = MAX([colKeyword].[StringVal]),
[DateAdded] = MAX([colDateAdded].[DateVal])
FROM #cards AS c
LEFT JOIN #cardindexes AS [colKeyword] ON [colKeyword].[CardID] = c.ID AND [colKeyword].[IndexType] = 1
LEFT JOIN #cardindexes AS [colDateAdded] ON [colDateAdded].[CardID] = c.ID AND [colDateAdded].[IndexType] = 2
WHERE [colKeyword].[StringVal] LIKE 'p%' AND c.[CardTypeID] = 1
GROUP BY c.ID
ORDER BY [DateAdded]
You could use MAX or MIN to 'decide' on what to display for the other columns in the rows that are duplicate.
SELECT ID, MAX(Name), MAX(Keyword), MAX(DateAdded)
(...)
GROUP BY ID;
using row number windowed function along with a CTE will do this pretty well. For example:
;With preResult AS (
SELECT TOP(100)
[ID] = c.[ID],
[Name] = c.[Name],
[Keyword] = [colKeyword].[StringVal],
[DateAdded] = [colDateAdded].[DateVal],
ROW_NUMBER()OVER(PARTITION BY c.ID ORDER BY [colDateAdded].[DateVal]) rn
FROM #cards AS c
LEFT JOIN #cardindexes AS [colKeyword] ON [colKeyword].[CardID] = c.ID AND [colKeyword].[IndexType] = 1
LEFT JOIN #cardindexes AS [colDateAdded] ON [colDateAdded].[CardID] = c.ID AND [colDateAdded].[IndexType] = 2
WHERE [colKeyword].[StringVal] LIKE 'p%' AND c.[CardTypeID] = 1
ORDER BY [DateAdded]
)
SELECT * from preResult WHERE rn = 1

How can I write a better multiple join that matches multiple values across rows?

I'm trying to write a SQL statement that will allow me to select a series of articles from a table based on their keywords. What I've got so far is a token table, an article table, and a many-to-many table for tokens & articles:
tokens
rowid
token
token_article
token_rowid
article_rowid
articles
rowid
What I'm doing is taking a search query, splitting it up by spaces, then select all articles that contains those keywords. So far I've come up with this:
select * from
(select * from tokens
inner join token_article on
tokens.rowid = token_article.token_rowid and
token = 'ABC'
) as t1,
(select * from tokens
inner join token_article on
tokens.rowid = token_article.token_rowid and
token = 'DEF'
) as t2
where t1.article_rowid = t2.article_rowid and t2.article_rowid = articles.rowid
Which works but of course its doing a select on all articles that match ABC and all articles that DEF then selecting them.
Now I'm trying to figure out a better way. What I imagine in my mind that would work would be to select all the articles that match ABC and from those match any with DEF. This is what I imagine it to look like but does not work (receive error message "no such columns: tokens.rowid")
select * from
(select * from
(select * from tokens
inner join token_article on
tokens.rowid = token_article.token_rowid and
token = 'ABC'
)
inner join token_article on
tokens.rowid = token_article.token_rowid and
token = 'DEF'
)
Because there is more than one way to do this...this method uses GROUP BY and HAVING clauses. The query is looking for all articles that have either the ABC or DEF token, but then grouping by the article ID where the count of tokens for the article is equal to the number of tokens being queried.
Note that I've used MSSQL syntax here, but the concept should work in most SQL implementations.
Edit: I should point out that this has a fairly clean syntax as you add more tokens to the query. If you add more tokens, then you just need to modify the t.token_in criteria and adjust the HAVING COUNT(*) = x clause accordingly.
DECLARE #tokens TABLE
(
rowid INT NOT NULL,
token VARCHAR(255) NOT NULL
)
DECLARE #articles TABLE
(
rowid INT NOT NULL,
title VARCHAR(255) NOT NULL
)
DECLARE #token_article TABLE
(
token_rowid INT NOT NULL,
article_rowid INT NOT NULL
)
INSERT INTO #tokens VALUES (1, 'ABC'), (2, 'DEF')
INSERT INTO #articles VALUES (1, 'This is article 1.'), (2, 'This is article 2.'), (3, 'This is article 3.'), (4, 'This is article 4.'), (5, 'This is article 5.'), (6, 'This is article 6.')
INSERT INTO #token_article VALUES (1, 1), (2, 1), (1, 2), (2, 3), (1, 4), (2, 4), (1, 5), (1, 6)
-- Get the article IDs that have all of the tokens
-- Use this if you just want the IDs
SELECT a.rowid FROM #articles a
INNER JOIN #token_article ta ON a.rowid = ta.article_rowid
INNER JOIN #tokens t ON ta.token_rowid = t.rowid
WHERE t.token IN ('ABC', 'DEF')
GROUP BY a.rowid
HAVING COUNT(*) = 2 -- This should match the number of tokens
rowid
-----------
1
4
-- Get the articles themselves
-- Use this if you want the articles
SELECT * FROM #articles WHERE rowid IN (
SELECT a.rowid FROM #articles a
INNER JOIN #token_article ta ON a.rowid = ta.article_rowid
INNER JOIN #tokens t ON ta.token_rowid = t.rowid
WHERE t.token IN ('ABC', 'DEF')
GROUP BY a.rowid
HAVING COUNT(*) = 2 -- This should match the number of tokens
)
rowid title
----------- ------------------
1 This is article 1.
4 This is article 4.
Here is one way to do it. The script was tested in SQL Server 2012 database.
Script:
CREATE TABLE dbo.tokens
(
rowid INT NOT NULL IDENTITY
, token VARCHAR(10) NOT NULL
);
CREATE TABLE dbo.articles
(
rowid INT NOT NULL IDENTITY
, name VARCHAR(10) NOT NULL
);
CREATE TABLE dbo.token_article
(
token_rowid INT NOT NULL
, article_rowid INT NOT NULL
);
INSERT INTO dbo.tokens (token) VALUES
('ABC'),
('DEF');
INSERT INTO dbo.articles (name) VALUES
('Article 1'),
('Article 2'),
('Article 3');
INSERT INTO dbo.token_article (token_rowid, article_rowid) VALUES
(1, 2),
(2, 3),
(1, 3),
(1, 1),
(2, 2);
SELECT out1.rowid
, out1.token
, out1.token_rowid
, out1.article_rowid
, ta2.token_rowid
, ta2.article_rowid
, t2.rowid
, t2.token
FROM
(
SELECT t.rowid
, t.token
, ta1.token_rowid
, ta1.article_rowid
FROM dbo.tokens t
INNER JOIN dbo.token_article ta1
ON ta1.token_rowid = t.rowid
WHERE t.token = 'ABC'
) out1
INNER JOIN dbo.token_article ta2
ON ta2.article_rowid = out1.article_rowid
INNER JOIN dbo.tokens t2
ON t2.rowid = ta2.token_rowid
AND t2.token = 'DEF';
Output:
rowid token token_rowid article_rowid token_rowid article_rowid rowid token
----- ----- ----------- ------------- ----------- ------------- ----- -----
1 ABC 1 2 2 2 2 DEF
1 ABC 1 3 2 3 2 DEF