Intersecting a dynamic number of tables - sql

I have two relations, a and b, with attributes given by
CREATE TABLE a (id int, b_id int)
CREATE TABLE b (id int)
for which I can assume that all pairs of values in a and all values in b are unique, and which will be based in an SQL Server 2016 database.
A given element of b defines a subset of a.id given by those elements for which the corresponding a.b_id is the given value, and my goal is to produce the intersection of all those subsets.
Say, for instance, that a contains the six values,
INSERT INTO a VALUES (1, 1), (1, 2), (1, 3), (2, 2), (3, 2), (3, 3)
Then the expected results would include the following:
b: (1), (2), (3). Expected result: (1)
b: (1), (2). Expected result: (1)
b: (2). Expected result: (1), (2), (3)
b: (2), (3). Expected result: (1), (3)
b: [Empty]. Expected result: (1), (2), (3)
Using uniqueness, for the case of non-empty b, this can be achieved through
SELECT a.id FROM a
JOIN b on a.b_id = b.id
GROUP BY a.id
HAVING COUNT(a.id) = (SELECT COUNT(*) FROM b)
but this feels clunky, given that SQL has an INTERSECT operator readily available, and were I to write the same query in, say, LINQ, I would simply aggregate intersections. It also fails to produce the desired result in the case of empty b without treating that as a special case.
So, the question becomes: Is there a more idiomatic way of performing the above query, which also works properly for trivial b?

What you are trying to do is called a relational division [1, 2].
CREATE TABLE a (id INT, b_id INT);
CREATE TABLE b (id INT);
INSERT INTO a VALUES
(1, 1), (1, 2), (1, 3),
(2, 2), (3, 2), (3, 3);
DECLARE #i INT = 0;
WHILE #i < 5 BEGIN
TRUNCATE TABLE b;
IF #i = 0 INSERT b VALUES (1), (2), (3);
IF #i = 1 INSERT b VALUES (1), (2);
IF #i = 2 INSERT b VALUES (2);
IF #i = 3 INSERT b VALUES (2), (3);
SELECT DISTINCT x.id
FROM a AS x
WHERE NOT EXISTS (
SELECT *
FROM b AS y
WHERE NOT EXISTS (
SELECT *
FROM a AS z
WHERE z.id = x.id AND z.b_id=y.id
)
)
SET #i = #i + 1;
END;
Test it online.

You could extend your approach to handle empty set case by using:
SELECT a.id
FROM a
LEFT JOIN b on a.b_id = b.id
GROUP BY a.id
HAVING COUNT(b.id) = (SELECT COUNT(*) FROM b);
DBFiddle Demo | DBFiddle Demo - all test cases
Extra: Transient Data (used in second demo)
EDIT:
Another approach to handle empty set and leave INNER JOIN:
SELECT a.id FROM a
JOIN b on a.b_id = b.id
GROUP BY a.id
HAVING COUNT(a.id) = (SELECT COUNT(*) FROM b)
UNION
SELECT a.id
FROM a
WHERE NOT EXISTS (SELECT 1 FROM b);
DBFiddle Demo 3

Related

Find data by multiple Lookup table clauses

declare #Character table (id int, [name] varchar(12));
insert into #Character (id, [name])
values
(1, 'tom'),
(2, 'jerry'),
(3, 'dog');
declare #NameToCharacter table (id int, nameId int, characterId int);
insert into #NameToCharacter (id, nameId, characterId)
values
(1, 1, 1),
(2, 1, 3),
(3, 1, 2),
(4, 2, 1);
The Name Table has more than just 1,2,3 and the list to parse on is dynamic
NameTable
id | name
----------
1 foo
2 bar
3 steak
CharacterTable
id | name
---------
1 tom
2 jerry
3 dog
NameToCharacterTable
id | nameId | characterId
1 1 1
2 1 3
3 1 2
4 2 1
I am looking for a query that will return a character that has two names. For example
With the above data only "tom" will be returned.
SELECT *
FROM nameToCharacterTable
WHERE nameId in (1,2)
The in clause will return every row that has a 1 or a 3. I want to only return the rows that have both a 1 and a 3.
I am stumped I have tried everything I know and do not want to resort to dynamic SQL. Any help would be great
The 1,3 in this example will be a dynamic list of integers. for example it could be 1,3,4,5,.....
Filter out a count of how many times the Character appears in the CharacterToName table matching the list you are providing (which I have assumed you can convert into a table variable or temp table) e.g.
declare #Character table (id int, [name] varchar(12));
insert into #Character (id, [name])
values
(1, 'tom'),
(2, 'jerry'),
(3, 'dog');
declare #NameToCharacter table (id int, nameId int, characterId int);
insert into #NameToCharacter (id, nameId, characterId)
values
(1, 1, 1),
(2, 1, 3),
(3, 1, 2),
(4, 2, 1);
declare #RequiredNames table (nameId int);
insert into #RequiredNames (nameId)
values
(1),
(2);
select *
from #Character C
where (
select count(*)
from #NameToCharacter NC
where NC.characterId = c.id
and NC.nameId in (select nameId from #RequiredNames)
) = 2;
Returns:
id
name
1
tom
Note: Providing DDL+DML as shown here makes it much easier for people to assist you.
This is classic Relational Division With Remainder.
There are a number of different solutions. #DaleK has given you an excellent one: inner-join everything, then check that each set has the right amount. This is normally the fastest solution.
If you want to ensure it works with a dynamic amount of rows, just change the last line to
) = (SELECT COUNT(*) FROM #RequiredNames);
Two other common solutions exist.
Left-join and check that all rows were joined
SELECT *
FROM #Character c
WHERE EXISTS (SELECT 1
FROM #RequiredNames rn
LEFT JOIN #NameToCharacter nc ON nc.nameId = rn.nameId AND nc.characterId = c.id
HAVING COUNT(*) = COUNT(nc.nameId) -- all rows are joined
);
Double anti-join, in other words: there are no "required" that are "not in the set"
SELECT *
FROM #Character c
WHERE NOT EXISTS (SELECT 1
FROM #RequiredNames rn
WHERE NOT EXISTS (SELECT 1
FROM #NameToCharacter nc
WHERE nc.nameId = rn.nameId AND nc.characterId = c.id
)
);
A variation on the one from the other answer uses a windowed aggregate instead of a subquery. I don't think this is performant, but it may have uses in certain cases.
SELECT *
FROM #Character c
WHERE EXISTS (SELECT 1
FROM (
SELECT *, COUNT(*) OVER () AS cnt
FROM #RequiredNames
) rn
JOIN #NameToCharacter nc ON nc.nameId = rn.nameId AND nc.characterId = c.id
HAVING COUNT(*) = MIN(rn.cnt)
);
db<>fiddle

Find Groups that don't contain all records

I feel like I should be able to get this and I'm just having a brain fart. I've simplified the problem to the following example:
DECLARE #A TABLE (ID int);
DECLARE #B TABLE (GroupID char(1), ID int);
INSERT #A VALUES (1);
INSERT #A VALUES (2);
INSERT #A VALUES (3);
INSERT #B VALUES ('X', 1);
INSERT #B VALUES ('X', 2);
INSERT #B VALUES ('X', 3);
INSERT #B VALUES ('Y', 1);
INSERT #B VALUES ('Y', 2);
INSERT #B VALUES ('Z', 1);
INSERT #B VALUES ('Z', 2);
INSERT #B VALUES ('Z', 3);
INSERT #B VALUES ('Z', 4);
So table A contains a set of some records. Table B contains multiple copies of the set contained in A with Group IDs. But some of those groups may be missing one or more records of the set. I want to find the groups that are missing records. So in the above example, my results should be:
GroupID
-------
Y
But for some reason I just can't wrap my head around this, today. Any help would be appreciated.
Awesome use-case for relational division! (Here's a must-read blog post about it)
SELECT DISTINCT b1.GroupID
FROM #B b1
WHERE EXISTS (
SELECT 1
FROM #A a
WHERE NOT EXISTS (
SELECT 1
FROM #B b2
WHERE b1.GroupID = b2.GroupID
AND b2.ID = a.ID
)
);
How to read this?
I want all distinct GroupIDs in #B for which there is a record in #A for which there isn't a record in #B with the same #A.ID
In fact, this is the "remainder" of the relational division.
try this
SELECT GroupID ,COUNT(GroupID )
FROM #a INNER JOIN #b
ON #a.id=#b.id
GROUP BY GroupID
HAVING COUNT(GroupID )<(SELECT count(*) FROM #a)
This will give you all the combinations that are missing.
select FullList.*
from (select distinct a.ID,
b.GroupId
from #A a
cross join #B b) FullList
left join #B b
on FullList.ID = b.ID
and FullList.GroupID = b.GroupID
where b.ID is null
The answer to your question would just be the same but with the first line:
select distinct FullList.GroupID
This will give you all the combinations that are missing.
select FullList.*
from (select distinct a.ID,
b.GroupId
from #A a
cross join (select distinct db.GroupId from #B db) b
) as FullList
left join #B b
on FullList.ID = b.ID
and FullList.GroupID = b.GroupID
where b.ID is null
The answer to your question would just be the same but with the first line:
select distinct FullList.GroupID

project a sparse result at some level

I don't really know what to call this but it's not that hard to explain
Basically what I have is a result like this
Similarity ColumnA ColumnB ColumnC
1 SomeValue NULL SomeValue
2 NULL SomeB NULL
3 SomeValue NULL SomeC
4 SomeA NULL NULL
This result is created by matching a set of strings against another table. Each string also contains some values for these ColumnA..C which are the values I wan't to aggregate in some way.
Something like min/max works very well but I can't figure out how to get it to account for the highest similarity not just the min/max value. I don't really want the min/max, I want the first non-null value with the highest similarity.
Ideally the result would look like this
ColumnA ColumnB ColumnC
SomeA SomeB SomeC
I'd like be able to efficiently join in the temporary result to compute the rest and I've been exploring different options. Something which I've been considering is creating a SQL Server CLR aggregate the yields the "first" non-null value but I'm unsure if there's even such a thing as a first or last when running an aggregate on a result.
Okay, so I figured it out, I originally had trouble with the UPDATE FROM and JOIN not playing well together. I was counting on that the UPDATE would just occur multiple times and that would give me the correct results, however, there's no such guarantee from SQL Server (it's actually undefined behavior and alltough it appeared to work we'll have none of that) but since you can run UPDATE against a CTE I combined that with the OUTER APPLY to select the exactly 1 row to complement a missing value if possible.
Here's the whole thing with test data as well.
DECLARE #cost TABLE (
make nvarchar(100) not null,
model nvarchar(100),
a numeric(18,2),
b numeric(18,2)
);
INSERT #cost VALUES ('a%', null, 100, 2);
INSERT #cost VALUES ('a%', 'a%', 149, null);
INSERT #cost VALUES ('a%', 'ab', 349, null);
INSERT #cost VALUES ('b', null, null, 2.5);
INSERT #cost VALUES ('b', 'b%', 249, null);
INSERT #cost VALUES ('b', 'b', null, 3);
DECLARE #unit TABLE (
id int,
make nvarchar(100) not null,
model nvarchar(100)
);
INSERT #unit VALUES (1, 'a', null);
INSERT #unit VALUES (2, 'a', 'a');
INSERT #unit VALUES (3, 'a', 'ab');
INSERT #unit VALUES (4, 'b', null);
INSERT #unit VALUES (5, 'b', 'b');
DECLARE #tmp TABLE (
id int,
specificity int,
a numeric(18,2),
b numeric(18,2),
primary key(id, specificity)
);
INSERT #tmp
OUTPUT inserted.* --FOR DEBUGGING
SELECT
unit.id
, ROW_NUMBER() OVER (
PARTITION BY unit.id
ORDER BY cost.make DESC, cost.model DESC
) AS specificity
, cost.a
, cost.b
FROM #unit unit
INNER JOIN #cost cost ON unit.make LIKE cost.make
AND (cost.model IS NULL OR unit.model LIKE cost.model)
;
--fix the holes
WITH tmp AS (
SELECT *
FROM #tmp
WHERE specificity = 1
AND (a IS NULL OR b IS NULL) --where necessary
)
UPDATE tmp
SET
tmp.a = COALESCE(tmp.a, a.a)
, tmp.b = COALESCE(tmp.b, b.b)
OUTPUT inserted.* --FOR DEBUGGING
FROM tmp
OUTER APPLY (
SELECT TOP 1 a
FROM #tmp a
WHERE a.id = tmp.id
AND a.specificity > 1
AND a.a IS NOT NULL
ORDER BY a.specificity
) a
OUTER APPLY (
SELECT TOP 1 b
FROM #tmp b
WHERE b.id = tmp.id
AND b.specificity > 1
AND b.b IS NOT NULL
ORDER BY b.specificity
) b
;

How can I make one table that "overrides" another?

Suppose I have two tables:
CREATE TABLE A(
id INT PRIMARY KEY,
x INT,
y INT
)
CREATE TABLE B(
id INT PRIMARY KEY,
x INT,
y INT,
)
Table A contains data brought in from another vendor while table B is our data. For simplicity, I've made these tables be exactly the same in terms of schema, but table B would likely be a superset of table A (it would contain some columns that table A wouldn't in other words).
What I would like to do is create a view C with columns id, x, and y such that the values come from table B unless they're NULL in which case they would come from table A. For instance, suppose I had the following:
INSERT INTO A (id, x, y)
VALUES (1, 2, 3);
INSERT INTO B (id, x, y)
VALUES (1, NULL, NULL);
INSERT INTO A (id, x, y)
VALUES (2, 3, 4);
INSERT INTO B (id, x, y)
VALUES (2, 5, 6);
INSERT INTO A(id, x, y)
VALUES (3, 4, 5);
INSERT INTO B(id, x, y)
VALUES (3, 5, NULL);
So that if I select * from C, I'd get the following rows:
(1, 2, 3)
(2, 5, 6)
(3, 5, 5)
How could I create such a view?
You could join the tables together with a left join, and then select the right columns with a case:
select case when A.x is null then B.x else A.x end
, case when A.y is null then B.y else A.y end
from A
left join
B
on A.id = b.id
Try:
Create view C as
select B.ID,
coalesce(B.x,A.x) x,
coalesce(B.y,A.y) y
from B
left join A
on B.ID = A.ID

T-Sql Query - Get Unique Rows Across 2 Columns

I have a set of data, with columns x and y. This set contains rows where, for any 2 given values, A and B, there is a row with A and B in columns x and y respectivly and there will be a second row with B and A in columns x and y respectivly.
E.g
**Column X** **Column Y**
Row 1 A B
Row 2 B A
There are multiple pairs of data in
this set that follow this rule.
For every row with A, B in Columns
X and Y, there will always be a
row with B, A in X and Y
Columns X and Y are of type int
I need a T-Sql query that given a set with the rules above will return me either Row 1 or Row 2, but not both.
Either the answer is very difficult, or its so easy that I can't see the forest for the trees, either way it's driving me up the wall.
Add to your query the predicate,
where X < Y
and you can never get row two, but will always get row one.
(This assumes that when you wrote "two given values" you meant two distinct given values; if the two values can be the same, add the predicate where X <= Y (to get rid of all "reversed" rows where X > Y) and then add a distinct to your select list (to collapse any two rows where X == Y into one row).)
In reply to comments:
That is, if currently your query is select foo, x, y from sometable where foo < 3; change it to select foo, x, y from sometable where foo < 3 and x < y;, or for the the second case (where X and Y are not distinct values) select distinct foo, x, y from sometable where foo < 3 and x <= y;.
This should work.
Declare #t Table (PK Int Primary Key Identity(1, 1), A int, B int);
Insert into #t values (1, 2);
Insert into #t values (2, 1);
Insert into #t values (3, 4);
Insert into #t values (4, 3);
Insert into #t values (5, 6);
Insert into #t values (6, 5);
Declare #Table Table (ID Int Primary Key Identity(1, 1), PK Int, A Int, B Int);
Declare #Current Int;
Declare #A Int;
Insert Into #Table
Select PK, A, B
From #t;
Set #Current = 1;
While (#Current <= (Select Max(ID) From #Table) Begin
Select #A = A
From #Table
Where ID = #Current;
If (#A Is Not Null) Begin
Delete From #Table Where B = #A;
If ((Select COUNT(*) From #Table Where A = #A) > 1) Begin
Delete From #Table Where ID = #Current;
End
End
Set #A = Null;
Set #Current = #Current + 1;
End
Select a.*
From #tAs a
Inner Join #Table As b On a.PK = b.PK
SELECT O.X, O.Y
FROM myTable O
WHERE EXISTS (SELECT X, Y FROM myTable I WHERE I.X = O.Y AND I.Y = O.X)
I have not tried this. But, this should work.
To get the highest and lowest of each pair, you could use:
(X+Y+ABS(X-Y)) / 2 as High, (X+Y-ABS(X-Y)) / 2 as Low
So now use DISTINCT to get the pairs of them.
SELECT DISTINCT
(X+Y+ABS(X-Y)) / 2 as High, (X+Y-ABS(X-Y)) / 2 as Low
FROM YourTable