Is there a fundamental difference between INTERSECT and INNER JOIN? [duplicate] - sql

This question already has an answer here:
What's different between INTERSECT and JOIN?
(1 answer)
Closed 4 years ago.
I understand, that INNER JOIN is made for referenced keys and INTERSECT is not. But afaik in some cases, both of them can do the same thing. So, is there a difference (in performance or anything) between the following two expressions? And if there is, which one is better?
Expression 1:
SELECT id FROM customers
INNER JOIN orders ON customers.id = orders.customerID;
Expression 2:
SELECT id FROM customers
INTERSECT
SELECT customerID FROM orders

They are very different, even in your case.
The INNER JOIN will return duplicates, if id is duplicated in either table. INTERSECT removes duplicates. The INNER JOIN will never return NULL, but INTERSECT will return NULL.
The two are very different; INNER JOIN is an operator that generally matches on a limited set of columns and can return zero rows or more rows from either table. INTERSECT is a set-based operator that compares complete rows between two sets and can never return more rows than in the smaller table.

Try the following, for example:
CREATE TABLE #a (id INT)
CREATE TABLE #b (id INT)
INSERT INTO #a VALUES (1), (NULL), (2)
INSERT INTO #b VALUES (1), (NULL), (3), (1)
SELECT a.id FROM #a a
INNER JOIN #b b ON a.id = b.id
SELECT id FROM #a
INTERSECT
SELECT id FROM #b

Related

RIGHT JOIN in place of subselect - a genuine use case?

I have avoided RIGHT OUTER JOIN, since the same can be achieved using LEFT OUTER JOIN if you reorder the tables.
However, recently I have been working with the need to have large numbers of joins, and I often encounter a pattern where a series of INNER JOINs are LEFT JOINed to a sub select which itself contains many INNER JOINs:
SELECT *
FROM Tab_1 INNER JOIN Tab_2 INNER JOIN Tab_3...
LEFT JOIN (SELECT *
FROM Tab_4 INNER JOIN Tab_5 INNER JOIN Tab_6....
)...
The script is hard to read. I often encounter sub sub selects. Some are correlated sub-selects and performance across the board is not good (probably not only because of the way the scripts are written).
I could of tidy it up in several ways, such as using common table expressions, views, staging tables etc, but a single RIGHT JOIN could remove the need for the sub selects. In many cases, doing so would improve performance.
In the example below, is there a way to replicate the result given by the first two SELECT statements, but using only INNER and LEFT joins?
DECLARE #A TABLE (Id INT)
DECLARE #B TABLE (Id_A INT, Id_C INT)
DECLARE #C TABLE (Id INT)
INSERT #A VALUES (1),(2)
INSERT #B VALUES (1,10),(2,20),(1,20)
INSERT #C VALUES (10),(30)
-- Although we want to see all the rows in A, we only want to see rows in C that have a match in B, which must itself match A
SELECT A.Id, T.Id
FROM
#A AS A
LEFT JOIN ( SELECT *
FROM #B AS B
INNER JOIN #C AS C ON B.Id_C = C.Id) AS T ON A.Id = T.Id_A;
-- NB Right join as although B and C MUST match, we only want to see them if they also have a row in A - otherwise null.
SELECT A.Id, C.Id
FROM
#B AS B
INNER JOIN #C AS C ON B.Id_C = C.Id
RIGHT JOIN #A AS A ON B.Id_A = A.Id;
Would you rather see the long-winded sub-selects, or a RIGHT JOIN, assuming decent comments in each case?
All the articles I have ever read have said pretty much what I think about RIGHT JOINS, that they are unecessary and confusing. Is this case strong enough to break the cultural aversion?
As #jarlh wrote most people think LEFT to RIGHT as much more intuitive, so it's very confusing to see RIGHT joins in the code.
In this cases sometimes I found that SQL Server creates better query plans when I use OUTER APPLY in combination with WHERE EXISTS clauses, over your LEFT JOINs and inner INNER JOIN with WHERE EXISTS
The result is not much different of what you have in your first example:
SELECT A.Id, T.Id
FROM
#A AS A
OUTER APPLY (
SELECT C.Id FROM #C AS C
WHERE EXISTS (SELECT 1 FROM #B AS B WHERE A.Id = B.Id_a AND B.Id_C = C.Id) )T;
I have found an answer to this question in the old scripts that I was going through - I came across this syntax which performs the same function as the RIGHT JOIN example, using LEFT JOINs (or at least I think it does - it certainly gives the correct results in the example):
DECLARE #A TABLE (Id INT)
DECLARE #B TABLE (Id_A INT, Id_C INT)
DECLARE #C TABLE (Id INT)
INSERT #A VALUES (1),(2)
INSERT #B VALUES (1,10),(2,20),(1,20)
INSERT #C VALUES (10),(30)
SELECT
A.Id, C.Id
FROM
#A AS A
LEFT JOIN #B AS B
INNER JOIN #C AS C
ON C.Id = B.Id_C
ON B.Id_A = A.Id
I don't know if there is a name for this pattern, which I have not seen before in other places of work, but it seems to work like a "nested" join, allowing the LEFT JOIN to preserve rows from the later INNER JOIN.
EDIT: I have done some more research and apparently this is an ANSI SQL syntax for nesting joins, but... it does not seem to be very popular!
Descriptive Article
Relevant Stack Exchange Question and Answer

SQL Intersect VS Inner Join [duplicate]

This question already has an answer here:
What's different between INTERSECT and JOIN?
(1 answer)
Closed 4 years ago.
I understand, that INNER JOIN is made for referenced keys and INTERSECT is not. But afaik in some cases, both of them can do the same thing. So, is there a difference (in performance or anything) between the following two expressions? And if there is, which one is better?
Expression 1:
SELECT id FROM customers
INNER JOIN orders ON customers.id = orders.customerID;
Expression 2:
SELECT id FROM customers
INTERSECT
SELECT customerID FROM orders
They are very different, even in your case.
The INNER JOIN will return duplicates, if id is duplicated in either table. INTERSECT removes duplicates. The INNER JOIN will never return NULL, but INTERSECT will return NULL.
The two are very different; INNER JOIN is an operator that generally matches on a limited set of columns and can return zero rows or more rows from either table. INTERSECT is a set-based operator that compares complete rows between two sets and can never return more rows than in the smaller table.
Try the following, for example:
CREATE TABLE #a (id INT)
CREATE TABLE #b (id INT)
INSERT INTO #a VALUES (1), (NULL), (2)
INSERT INTO #b VALUES (1), (NULL), (3), (1)
SELECT a.id FROM #a a
INNER JOIN #b b ON a.id = b.id
SELECT id FROM #a
INTERSECT
SELECT id FROM #b

Not in or Not exist Query Very Slow for Large Data Sybase

I have a table A which is having around 50000 records and a table B which is having 50000 records as well.
sample data:
A B
1 1
2 2
3 null
4 null
I want to find records 3, 4 which are present in Table A but not in Table B.
I am using
select id from A where id NOT IN(select id from B)
I have also tried NOT Exist, but as the records are very large in number, it still takes a lot of time.
select id from A where NOT Exists(select id from B and B.id = A.id)
Left Outer Join cant be used to find the missing records as the id is not present in Table B.
Is there any way to make the Query Work Faster in Sybase itself?
Or Shifting the database to MongoDB is the solution?
I'm not sure why you are not prepare LEFT JOIN, I tried with the LEFT JOIN it returns your expected result.
Sample execution with the given data:
DECLARE #TableA TABLE (Id INT);
DECLARE #TableB TABLE (Id INT);
INSERT INTO #TableA (Id) VALUES (1), (2), (3), (4);
INSERT INTO #TableB (Id) VALUES (1), (2), (NULL), (NULL);
SELECT T1.Id
FROM #TableA T1
LEFT JOIN #TableB T2 ON T2.Id = T1.Id
WHERE T2.Id IS NULL
Result
3
4
In performance perspective, always try to avoid using inverse keywords like NOT IN, NOT EXISTS. Because to check the inverse items DBMS need to runs through all the available records and drop the inverse selection.
LEFT JOIN / IS NULL and NOT EXISTS are semantically equivalent, while NOT IN is not. These method differ in how they handle NULL values in table_right.
Therefore, You should go for LEFT JOIN to improve your sql performance.
select A.id from A LEFT JOIN B
on A.id = B.id
where B.id is null
order by A.id;

Is a scalar database function used in a join called once per distinct set of inputs or once per row?

If I have a sql statement like this:
select *
from tableA a
inner join tableB b on dbo.fn_something(a.ColX) = b.ColY
if you assume there are 5 rows in tableA with the same value for ColX will dbo.fn_something() be called with that value 5 times or just one time?
Clearly this is a trivial example, but I'm interested for the purposes of thinking about performance in a more complex scenario.
UPDATE
Thanks #DStanley, following from your answer I investigated further. Using SQL Profiler with the SP:StmtStarting event on the SQL below illustrates what happens. i.e. as you said: the function will be called once for each row in the join.
This has an extra join from the original question.
create table tableA
( id int )
create table tableB
( id_a int not null
, id_c int not null
)
create table tableC
( id int )
go
create function dbo.fn_something( #id int )
returns int
as
begin
return #id
end
go
-- add test data
-- 5 rows:
insert into tableA (id) values (1), (2), (3), (4), (5)
-- 5 rows:
insert into tableC (id) values (101), (102), (103), (104), (105)
-- 25 rows:
insert into tableB (id_a, id_c) select a.id, c.id from tableA a, tableC c
go
-- here dbo.fn_something() is called 25 times:
select *
from tableA a
inner join tableB b on a.id = b.id_a
inner join tableC c on c.id = dbo.fn_something(b.id_c)
-- here dbo.fn_something() is called just 5 times,
-- as the 'b.id_c < 102' happens to be applied first.
-- That's likely to depend on whether SQL thinks it's
-- faster to evaluate the '<' or the function.
select *
from tableA a
inner join tableB b on a.id = b.id_a
inner join tableC c on c.id = dbo.fn_something(b.id_c) and b.id_c < 102
go
drop table tableA ;
drop table tableB;
drop table tableC;
drop function dbo.fn_something;
go
It will be called for each row in a. I do not know of any optimization that would call the function just for unique inputs. If performance is an issue you could create a temp table with distinct input values and use thoce results in your join, but I would only do that it it was an issue - don't assume it's a problem and clutter your query unnecessarily.
If you declare your function as schema bound, it can be run one for each unique case. This requires that the function be deterministic and always has the same output for a given input.
CREATE FUNCTION dbo.fn_something (#id INT)
RETURNS INT
WITH SCHEMABINDING
AS
BEGIN
RETURN #id
END
GO

Simple SQL Server query across databases runs extremely slow R2

I have two databases on a local machine, connected to localhost. They both have roughly two million rows a piece. I was doing the following very simple join and it took over a minute to complete.
select distinct x.patid
from [i 3 sci study].dbo.clm_extract as x
left join [i 3 study].dbo.claims as y on y.patid=x.patid
where y.patid is null
When I looked at the execution plan I saw that the join showplan operator had this to say
Why is the actual number of rows so exorbitantly high compared to the actual number of rows in both tables?
The LEFT JOIN will match each row on the left with each row on the right, and then filter. Assuming patid is not unique in either table, the number of possible match combinations could get very high.
Try the following:
SET NOCOUNT ON;
GO
CREATE TABLE #t1 (Id INT NOT NULL);
CREATE TABLE #t2 (Id INT NOT NULL);
GO
INSERT #t1 (Id)
VALUES (1);
GO 100
INSERT #t2 (Id)
SELECT Id FROM #t1;
GO
Now look at the execution plan for the left join query form:
SELECT *
FROM #t1
LEFT OUTER JOIN #t2 ON #t1.Id = #t2.Id
WHERE #t2.Id IS NULL;
Looking at the execution plan, the hash join shows 10,000 actual rows (100 from #t1 x 100 from #t2). This shows the advantage of checking for existence (or a lack thereof) using any of the following T-SQL syntaxes:
SELECT #t1.Id
FROM #t1
WHERE NOT EXISTS (SELECT * FROM #t2 WHERE Id = #t1.Id);
-- #t2.Id must not contain any NULLs for this to be correct
SELECT #t1.Id
FROM #t1
WHERE Id NOT IN (SELECT #t2.Id FROM #t2);
-- Returns DISTINCT #t1 values
SELECT Id
FROM #t1
EXCEPT
SELECT Id
FROM #t2;
Checking for a lack of existence enables the engine to short circuit. This is due to the anti semi join. As soon as the first match is found, it moves on to the next record. For more details, see this blog post.