Matching multiple key/value pairs in SQL - sql

I have metadata stored in a key/value table in SQL Server. (I know key/value is bad, but this is free-form metadata supplied by users, so I can't turn the keys into columns.) Users need to be able to give me an arbitrary set of key/value pairs and have me return all DB objects that match all of those criteria.
For example:
Metadata:
Id Key Value
1 a p
1 b q
1 c r
2 a p
2 b p
3 c r
If the user says a=p and b=q, I should return object 1. (Not object 2, even though it also has a=p, because it has b=p.)
The metadata to match is in a table-valued sproc parameter with a simple key/value schema. The closest I have got is:
select * from [Objects] as o
where not exists (
select * from [Metadata] as m
join #data as n on (n.[Key] = m.[Key])
and n.[Value] != m.[Value]
and m.[Id] = o.[Id]
)
My "no rows exist that don't match" is an attempt to implement "all rows match" by forming its contrapositive. This does eliminate objects with mismatching metadata, but it also returns objects with no metadata at all, so no good.
Can anyone point me in the right direction? (Bonus points for performance as well as correctness.)

; WITH Metadata (Id, [Key], Value) AS -- Create sample data
(
SELECT 1, 'a', 'p' UNION ALL
SELECT 1, 'b', 'q' UNION ALL
SELECT 1, 'c', 'r' UNION ALL
SELECT 2, 'a', 'p' UNION ALL
SELECT 2, 'b', 'p' UNION ALL
SELECT 3, 'c', 'r'
),
data ([Key], Value) AS -- sample input
(
SELECT 'a', 'p' UNION ALL
SELECT 'b', 'q'
),
-- here onwards is the actual query
data2 AS
(
-- cnt is to count no of input rows
SELECT [Key], Value, cnt = COUNT(*) OVER()
FROM data
)
SELECT m.Id
FROM Metadata m
INNER JOIN data2 d ON m.[Key] = d.[Key] AND m.Value= d.Value
GROUP BY m.Id
HAVING COUNT(*) = MAX(d.cnt)

The following SQL query produces the result that you require.
SELECT *
FROM #Objects m
WHERE Id IN
(
-- Include objects that match the conditions:
SELECT m.Id
FROM #Metadata m
JOIN #data d ON m.[Key] = d.[Key] AND m.Value = d.Value
-- And discount those where there is other metadata not matching the conditions:
EXCEPT
SELECT m.Id
FROM #Metadata m
JOIN #data d ON m.[Key] = d.[Key] AND m.Value <> d.Value
)
Test schema and data I used:
-- Schema
DECLARE #Objects TABLE (Id int);
DECLARE #Metadata TABLE (Id int, [Key] char(1), Value char(2));
DECLARE #data TABLE ([Key] char(1), Value char(1));
-- Data
INSERT INTO #Metadata VALUES
(1, 'a', 'p'),
(1, 'b', 'q'),
(1, 'c', 'r'),
(2, 'a', 'p'),
(2, 'b', 'p'),
(3, 'c', 'r');
INSERT INTO #Objects VALUES
(1),
(2),
(3),
(4); -- Object with no metadata
INSERT INTO #data VALUES
('a','p'),
('b','q');

Related

SQL Server query to extract all rows

I've two database tables, one called "Headers" and one called "Rows".
The structure is:
Header: IDPK | Description
Row: IDPK | IDPK_Header | Item_ID | Qty
I need to do a query that says: "From a Header, IDPK find another header that have the same number of rows and the same item ID and quantity".
For example:
Header Rows
IDPK Description IDPK Item_ID Qty
1 'Test1' 1 'A' 10
1 'Test1' 2 'B' 20
2 'Test2' 3 'A' 10
2 'Test2' 4 'B' 20
3 'Test3' 5 'A' 5
3 'Test3' 6 'B' 20
4 'Test4' 7 'A' 10
Header Test1 match Test2 but not Test3 and Test4
The problem is that the number of rows must be exactly the same. I try with ALL operator but without luck.
How I can do the query with an eye for the performance? The two tables can be very huge (~500.000 records).
Assuming there are no duplicates:
with r as (
select r.*, count(*) over (partition by idpk_header) as num_items
from rows r
)
select r1.idpk_header, r2.idpk_header
from r r1 join
r r2
on r1.item_id = r1.item_id and r2.qty = r1.qty and r2.num_items = r1.num_items
group by r1.idpk_header, r2.idpk_header, r1.num_items
having count(*) = r1.num_items;
Basically, this does a self-join on the items, so you only get matches. The on validates that the two have the same number of items. And the having guarantees that all match.
Note: This version returns each match of the header to itself. That is a nice check. You can of course filter this out in the on or a where clause.
If you do have duplicate items, you can simply replace r with:
select idpk_header, item_id, sum(qty) as qty,
count(*) over (partition by idpk_header) as num_items
from rows r
group by idpk_header, item_id;
I woul suggest using a forxml query in order to create the list of items per IDPK. Next I would search for matching item lists and quantities. See following example:
DECLARE #Headers TABLE(
IDPK INT,
Description NVARCHAR(100)
)
DECLARE #Rows TABLE(
IDPK INT,
ITEMID NVARCHAR(1),
Qty INT
)
INSERT INTO #Headers VALUES
(1, 'Test1'),
(2, 'Test2'),
(3, 'Test3'),
(4, 'Test4'),
(5, 'Test5')
INSERT INTO #Rows VALUES
(1, 'A', 10),
(1, 'B', 20),
(2, 'A', 10),
(2, 'B', 20),
(3, 'A', 5 ),
(3, 'B', 20),
(4, 'C', 10),
(5, 'A', 10),
(5, 'C', 20)
;
WITH cteHeaderRows AS(
SELECT IDPK
,ItemIDs=STUFF(
(
SELECT ',' + CAST(ITEMID AS VARCHAR(MAX))
FROM #Rows t2
WHERE t2.IDPK = t1.IDPK
ORDER BY ITEMID, QTY
FOR XML PATH('')
),1,1,''
)
,Qtys=STUFF(
(
SELECT ',' + CAST(Qty AS VARCHAR(MAX))
FROM #Rows t2
WHERE t2.IDPK = t1.IDPK
ORDER BY ITEMID, QTY
FOR XML PATH('')
),1,1,''
)
FROM #Rows t1
GROUP BY IDPK
),
cteFilter AS(
SELECT h1.IDPK AS IDPK1, h2.IDPK AS IDPK2
FROM cteHeaderRows h1
JOIN cteHeaderRows h2 ON h1.IDPK != h2.IDPK AND h1.ItemIDs = h2.ItemIDs AND h2.Qtys = h1.Qtys
)
SELECT DISTINCT h.IDPK, h.Description, r.ItemID, r.Qty
FROM #Headers h
JOIN cteFilter f ON f.IDPK1 = h.IDPK
JOIN #Rows r ON r.IDPK = f.IDPK1
ORDER BY 1,3,4

SQL logic for the Vlookup function in excel/ How to do a Vlookup in SQL

I have a table with 2 columns OLD_VALUE and NEW_VALUE and 5 rows. 1st row has values (A,B). Other row values can be (B,C),(C,D),(E,D),(D,F). I want to update all the old values with the new value (how a vlookup in excel would work) The Final Result Required: The newest value in the above example would be D,F. i.e. D points to F. E and C point to D. B points to C and A points to B. D pointing to F is the last and newest and there are no more successions after D,F. So (OLD_VALUE,NEW_VALUE)->(A,F), (B,F), (C,F), (D,F), (E,F). I want 5 rows with the NEW_VALUE as 'F'. The level of successions can be ranging from 1 to x.
This is the table I have used for the script:
declare #t as table(old_value char(1), new_value char(1));
insert into #t values('A','B')
insert into #t values('B','C')
insert into #t values('C','D')
insert into #t values('E','D')
insert into #t values('D','F')
This needs to be done with a recursive CTE. First, you will need to define an anchor for the CTE. The anchor in this case should be the record with the latest value. This is how I define the anchor:
select old_value, new_value, 1 as level
from #t
where new_value NOT IN (select old_value from #t)
And here is the recursive CTE I used to locate the latest value for each row:
;with a as(
select old_value, new_value, 1 as level
from #t
where new_value NOT IN (select old_value from #t)
union all
select b.old_value, a.new_value, a.level + 1
from a INNER JOIN #t b ON a.old_value = b.new_value
)
select * from a
Results:
old_value new_value level
--------- --------- -----------
D F 1
C F 2
E F 2
B F 3
A F 4
(5 row(s) affected)
I think a recursive CTE like the following is what you're looking for (where the parent is the row whose second value does not exist as a first value elsewhere). If there's no parent(s) to anchor to, this would fail (e.g. if you had A->B, B->C, C->A, you'd get no result), but it should work for your case:
DECLARE #T TABLE (val1 CHAR(1), val2 CHAR(2));
INSERT #T VALUES ('A', 'B'), ('B', 'C'), ('C', 'D'), ('E', 'D'), ('D', 'F');
WITH CTE AS
(
SELECT val1, val2
FROM #T AS T
WHERE NOT EXISTS (SELECT 1 FROM #T WHERE val1 = T.val2)
UNION ALL
SELECT T.val1, CTE.val2
FROM #T AS T
JOIN CTE
ON CTE.val1 = T.val2
)
SELECT *
FROM CTE;

Reordering output to predefined sequence

I am trying to get output from a table sorted in a predefined sequence of 5 alphabet.
i.e. L > C > E > O > A
by using order by I cant get the desired result. I am using SQL server db.
Can any one please suggest me if I can define a sequence inside a query ?
SO that I get my result in L > C > E > O > A.
Thanks in Advance.
select * from your_table
order by case when some_column = 'L' then 1
when some_column = 'C' then 2
when some_column = 'E' then 3
when some_column = 'O' then 4
when some_column = 'A' then 5
end desc
If you want to use those sorting criteria for two or more queries then you can create a table for this:
CREATE TABLE dbo.CustomSort (
Value VARCHAR(10) PRIMARY KEY,
SortOrder INT NOT NULL
);
GO
INSERT INTO dbo.CustomSort (Value, SortOrder) VALUES ('L', 1);
INSERT INTO dbo.CustomSort (Value, SortOrder) VALUES ('C', 2);
INSERT INTO dbo.CustomSort (Value, SortOrder) VALUES ('E', 3);
INSERT INTO dbo.CustomSort (Value, SortOrder) VALUES ('O', 4);
INSERT INTO dbo.CustomSort (Value, SortOrder) VALUES ('A', 5);
GO
and then you can join the source table (x in this example) with dbo.CustomSort table thus:
SELECT x.Col1
FROM
(
SELECT 'E' UNION ALL
SELECT 'C' UNION ALL
SELECT 'O'
) x(Col1) INNER JOIN dbo.CustomSort cs ON x.Col1 = cs.Value
ORDER BY cs.SortOrder
/*
Col1
----
C
E
O
*/
I you update the dbo.CustomSort table then all queries will use the new sorting criteria.

Identifying duplicate GROUPS of data in SQL

My question is how to identify duplicate (repeating) 'groups' of data within an SQL table. I am using SQL Server 2005 at the moment so prefer solutions based on that or ansi-sql.
Here is a sample table and expected result (below) to base this question on:
declare #data table (id nvarchar(10), fund nvarchar(1), xtype nvarchar(1))
insert into #data select 'Switch_1', 'A', 'S'
insert into #data select 'Switch_1', 'X', 'B'
insert into #data select 'Switch_1', 'Y', 'B'
insert into #data select 'Switch_1', 'Z', 'B'
insert into #data select 'Switch_2', 'A', 'S'
insert into #data select 'Switch_2', 'X', 'B'
insert into #data select 'Switch_2', 'Y', 'B'
insert into #data select 'Switch_2', 'Z', 'B'
insert into #data select 'Switch_3', 'C', 'S'
insert into #data select 'Switch_3', 'D', 'B'
insert into #data select 'Switch_4', 'C', 'S'
insert into #data select 'Switch_4', 'F', 'B'
(new data)
insert into #data select 'Switch_5', 'A', 'S'
insert into #data select 'Switch_5', 'X', 'B'
insert into #data select 'Switch_5', 'Y', 'B'
insert into #data select 'Switch_5', 'Z', 'B'
-- id fund xtype match
-- ---------- ---- ----- ---------
-- Switch_1 A S Match_1
-- Switch_1 X B Match_1
-- Switch_1 Y B Match_1
-- Switch_1 Z B Match_1
-- Switch_2 A S Match_1
-- Switch_2 X B Match_1
-- Switch_2 Y B Match_1
-- Switch_2 Z B Match_1
-- Switch_3 C S
-- Switch_3 D B
-- Switch_4 C S
-- Switch_4 F B
(new results)
-- Switch_5 A S Match_1
-- Switch_5 X B Match_1
-- Switch_5 Y B Match_1
-- Switch_5 Z B Match_1
I only want matches on an ALL or NOTHING basis (i.e. All records in the group match all records in another group - not a part match). Any match id can be used (I have used Match_1 above but can be numeric etc.)
Thanks for any help here.
(EDIT: I guess I should add that there could be any number of rows per group, not just the 2 or 4 shown in the sample above - and I'm also trying to avoid cursors)
(EDIT 2: I seem to have an issue if there are more than one matches found. The output from the SQL supplied is returning duplicate records for Switch_1 when there are more than one matches found. I have updated the sample data accordingly. Not sure if Lieven is still following this - I'm also looking at the solution and will post here if found.)
The flow of execution is as follows
q: Combine all funds and xtypes of one id into one string using an XML PATH construction
r: Select a ROW_NUMBER and the respective id's for matching groups
Select the results by LEFT JOINING #data and r
SQL Statement
;WITH q AS (
SELECT DISTINCT d.id
, DuplicateData = STUFF((SELECT ', ' + fund + xtype FROM #data WHERE id = d.id FOR XML PATH('')), 1, 2, '')
FROM #data d
)
, r AS (
SELECT id1 = q1.id
, id2 = q2.id
, rn = ROW_NUMBER() OVER (ORDER BY q1.ID)
FROM q q1
INNER JOIN q q2 ON q1.DuplicateData = q2.DuplicateData AND q1.id < q2.id
)
SELECT id
, fund
, xtype
, match = 'Match_' + CAST(r.rn AS VARCHAR(32))
FROM #data d
LEFT OUTER JOIN r ON d.id IN (r.id1, r.id2)
Results
id fund xtype match
---------- ---- ----- --------------------------------------
Switch_1 A S Match_1
Switch_1 X B Match_1
Switch_1 Y B Match_1
Switch_1 Z B Match_1
Switch_2 A S Match_1
Switch_2 X B Match_1
Switch_2 Y B Match_1
Switch_2 Z B Match_1
Switch_3 C S NULL
Switch_3 D B NULL
Switch_4 C S NULL
Switch_4 F B NULL
Here it is another query for it:
create table #temp1 (
id varchar(10),
fund nvarchar(1),
xtype nvarchar(1)
)
insert into #temp1 select 'Switch_1', 'A', 'S'
insert into #temp1 select 'Switch_1', 'X', 'B'
insert into #temp1 select 'Switch_1', 'Y', 'B'
insert into #temp1 select 'Switch_1', 'Z', 'B'
insert into #temp1 select 'Switch_2', 'A', 'S'
insert into #temp1 select 'Switch_2', 'X', 'B'
insert into #temp1 select 'Switch_2', 'Y', 'B'
insert into #temp1 select 'Switch_2', 'Z', 'B'
insert into #temp1 select 'Switch_3', 'C', 'S'
insert into #temp1 select 'Switch_3', 'D', 'B'
insert into #temp1 select 'Switch_4', 'C', 'S'
insert into #temp1 select 'Switch_4', 'F', 'B'
select t1.*, case when t2.equal = t3.total then 'True' else 'False' end as 'Match' from #temp1 t1
left outer join (select m.id, count(m2.id) as 'equal' from #temp1 m
inner join #temp1 m2 on m.Id <> m2.Id and m.fund = m2.fund and m.xtype = m2.xtype
group by m.id) t2 on t1.id = t2.id
inner join (select m3.id, count(m3.fund) as 'total' from #temp1 m3 group by m3.id) t3 on t3.id = t1.id
drop table #temp1

Multiple parents tree (or digraph) implementation sql server 2005

I need to implement a multi-parented tree (or digraph) onto SQL Server 2005.
I've read several articles, but most of them uses single-parented trees with a unique root like the following one.
-My PC
-Drive C
-Documents and Settings
-Program Files
-Adobe
-Microsoft
-Folder X
-Drive D
-Folder Y
-Folder Z
In this one, everything derives from a root element (My PC).
In my case, a child could have more than 1 parent, like the following:
G A
\ /
B
/ \
X C
/ \
D E
\ /
F
So I have the following code:
create table #ObjectRelations
(
Id varchar(20),
NextId varchar(20)
)
insert into #ObjectRelations values ('G', 'B')
insert into #ObjectRelations values ('A', 'B')
insert into #ObjectRelations values ('B', 'C')
insert into #ObjectRelations values ('B', 'X')
insert into #ObjectRelations values ('C', 'E')
insert into #ObjectRelations values ('C', 'D')
insert into #ObjectRelations values ('E', 'F')
insert into #ObjectRelations values ('D', 'F')
declare #id varchar(20)
set #id = 'A';
WITH Objects (Id, NextId) AS
( -- This is the 'Anchor' or starting point of the recursive query
SELECT rel.Id,
rel.NextId
FROM #ObjectRelations rel
WHERE rel.Id = #id
UNION ALL -- This is the recursive portion of the query
SELECT rel.Id,
rel.NextId
FROM #ObjectRelations rel
INNER JOIN Objects -- Note the reference to CTE table name (Recursive Join)
ON rel.Id = Objects.NextId
)
SELECT o.*
FROM Objects o
drop table #ObjectRelations
Which returns the following SET:
Id NextId
-------------------- --------------------
A B
B C
B X
C E
C D
D F
E F
Expected result SET:
Id NextId
-------------------- --------------------
G B
A B
B C
B X
C E
C D
D F
E F
Note that the relation G->B is missing, because it asks for an starting object (which doesn't work for me also, because I don't know the root object from the start) and using A as the start point will ignore the G->B relationship.
So, this code doesn't work in my case because it asks for a starting object, which is obvious in a SINGLE-parent tree (will always be the root object). But in multi-parent tree, you could have more than 1 "root" object (like in the example, G and A are the "root" objects, where root is an object which doesn't have a parent (ancestor)).
So I'm kind of stucked in here... I need to modify the query to NOT ask for a starting object and recursively traverse the entire tree.
I don't know if that's possible with the (Id, NextId) implementation... may be I need to store it like a graph using some kind of Incidence matrix, adjacency matrix or whatever (see http://willets.org/sqlgraphs.html).
Any help? What do you think guys?
Thank you very much for your time =)
Cheers!
Sources:
Source 1
Source 2
Source 3
Well, I finally came up with the following solution.
It's the way I found to support multi-root trees and also cycling digraphs.
create table #ObjectRelations
(
Id varchar(20),
NextId varchar(20)
)
/* Cycle */
/*
insert into #ObjectRelations values ('A', 'B')
insert into #ObjectRelations values ('B', 'C')
insert into #ObjectRelations values ('C', 'A')
*/
/* Multi root */
insert into #ObjectRelations values ('G', 'B')
insert into #ObjectRelations values ('A', 'B')
insert into #ObjectRelations values ('B', 'C')
insert into #ObjectRelations values ('B', 'X')
insert into #ObjectRelations values ('C', 'E')
insert into #ObjectRelations values ('C', 'D')
insert into #ObjectRelations values ('E', 'F')
insert into #ObjectRelations values ('D', 'F')
declare #startIds table
(
Id varchar(20) primary key
)
;WITH
Ids (Id) AS
(
SELECT Id
FROM #ObjectRelations
),
NextIds (Id) AS
(
SELECT NextId
FROM #ObjectRelations
)
INSERT INTO #startIds
/* This select will not return anything since there are not objects without predecessor, because it's a cyclic of course */
SELECT DISTINCT
Ids.Id
FROM
Ids
LEFT JOIN
NextIds on Ids.Id = NextIds.Id
WHERE
NextIds.Id IS NULL
UNION
/* So let's just pick anyone. (the way I will be getting the starting object for a cyclic doesn't matter for the regarding problem)*/
SELECT TOP 1 Id FROM Ids
;WITH Objects (Id, NextId, [Level], Way) AS
( -- This is the 'Anchor' or starting point of the recursive query
SELECT rel.Id,
rel.NextId,
1,
CAST(rel.Id as VARCHAR(MAX))
FROM #ObjectRelations rel
WHERE rel.Id IN (SELECT Id FROM #startIds)
UNION ALL -- This is the recursive portion of the query
SELECT rel.Id,
rel.NextId,
[Level] + 1,
RecObjects.Way + ', ' + rel.Id
FROM #ObjectRelations rel
INNER JOIN Objects RecObjects -- Note the reference to CTE table name (Recursive Join)
ON rel.Id = RecObjects.NextId
WHERE RecObjects.Way NOT LIKE '%' + rel.Id + '%'
)
SELECT DISTINCT
Id,
NextId,
[Level]
FROM Objects
ORDER BY [Level]
drop table #ObjectRelations
Could be useful for somebody. It is for me =P
Thanks
If you want to use all root objects as starting objects, you should first update your data to include information about the root objects (and the leaves). You should add the following inserts:
insert into #ObjectRelations values (NULL, 'G')
insert into #ObjectRelations values (NULL, 'A')
insert into #ObjectRelations values ('X', NULL)
insert into #ObjectRelations values ('F', NULL)
Of course you could also write your anchor query in such a way that you select as root nodes the records that have an Id that does not occur as a NextId, but this is easier.
Next, modify your anchor query to look like this:
SELECT rel.Id,
rel.NextId
FROM #ObjectRelations rel
WHERE rel.Id IS NULL
If you run this query, you'll see that you get a lot of duplicates, a lot of arcs occur multiple times. This is because you now have two results from your anchor query and therefore the tree is traversed two times.
This can be fixed by changing your select statement to this (note the DISTINCT):
SELECT DISTINCT o.*
FROM Objects o
If you dont want to do the inserts suggested by Ronald,this would do!.
WITH CTE_MultiParent (ID, ParentID)
AS
(
SELECT ID, ParentID FROM #ObjectRelations
WHERE ID NOT IN
(
SELECT DISTINCT ParentID FROM #ObjectRelations
)
UNION ALL
SELECT ObjR.ID, ObjR.ParentID FROM #ObjectRelations ObjR INNER JOIN CTE_MultiParent
ON CTE_MultiParent.ParentID = ObjR.Id
)
SELECT DISTINCT * FROM CTE_MultiParent