Are the SQL concepts LEFT OUTER JOIN and WHERE NOT EXISTS basically the same? - sql

Whats the difference between using a LEFT OUTER JOIN, rather than a sub-query that starts with a WHERE NOT EXISTS (...)?

No they are not the same thing, as they will not return the same rowset in the most simplistic use case.
The LEFT OUTER JOIN will return all rows from the left table, both where rows exist in the related table and where they does not. The WHERE NOT EXISTS() subquery will only return rows where the relationship is not met.
However, if you did a LEFT OUTER JOIN and looked for IS NULL on the foreign key column in the WHERE clause, you can make equivalent behavior to the WHERE NOT EXISTS.
For example this:
SELECT
t_main.*
FROM
t_main
LEFT OUTER JOIN t_related ON t_main.id = t_related.id
/* IS NULL in the WHERE clause */
WHERE t_related.id IS NULL
Is equivalent to this:
SELECT
t_main.*
FROM t_main
WHERE
NOT EXISTS (
SELECT t_related.id
FROM t_related
WHERE t_main.id = t_related.id
)
But this one is not equivalent:
It will return rows from t_main both having and not having related rows in t_related.
SELECT
t_main.*
FROM
t_main
LEFT OUTER JOIN t_related ON t_main.id = t_related.id
/* WHERE clause does not exclude NULL foreign keys */
Note This does not speak to how the queries are compiled and executed, which differs as well -- this only addresses a comparison of the rowsets they return.

As Michael already answered your question here is a quick sample to illustrate the difference:
Table A
Key Data
1 somedata1
2 somedata2
Table B
Key Data
1 data1
Left outer join:
SELECT *
FROM A
LEFT OUTER JOIN B
ON A.Key = B.Key
Result:
Key Data Key Data
1 somedata1 1
2 somedata2 null null
EXISTS use:
SELECT *
FROM A WHERE EXISTS ( SELECT B.Key FROM B WHERE A.Key = B.Key )
Not Exists In:
SELECT *
FROM A WHERE NOT EXISTS ( SELECT B.Key FROM B WHERE A.Key = B.Key )
Result:
Key Data
2 somedata2

Left outer join is more flexible than where not exists. You must use a left outer join if you want to return any of the columns from the child table. You can also use the left outer join to return records that match the parent table as well as all records in the parent table that have no match. Where not exists only lets you return the records with no match.
However in the case where they do return the equivalent rows and you do not need any of the columns in the right table, then where exists is likely to be the more performant choice (at least in SQL server, I don't know about other dbs).

I suspect the answer ultimately is, both are used (among other constructs) to perform the relational operation antijoin in SQL.

I suspect the OP wanted to know which construct is better when they are functionally the same (ie I want to see only rows where there is no match in the secondary table).
As such, WHERE NOT EXISTS will always be as quick or quicker, so is a good habit to get into.

Related

What's the purpose of a JOIN where no column from 2nd table is being used?

I am looking through some hive queries we are running as part of analytics on our hadoop cluster, but I am having trouble understanding one. This is the Hive QL query
SELECT
c_id, v_id, COUNT(DISTINCT(m_id)) AS participants,
cast(date_sub(current_date, ${window}) as string) as event_date
from (
select
a.c_id, a.v_id, a.user_id,
case
when c.id1 is not null and a.timestamp <= c.stitching_ts then c.id2 else a.m_id
end as m_id
from (
select * from first
where event_date <= cast(date_sub(current_date, ${window}) as string)
) a
join (
select * from second
) b on a.c_id = b.c_id
left join third c
on a.user_id = c.id1
) dx
group by c_id, v_id;
I have changed the names but otherwise this is the select statement being used to insert overwrite to another table.
Regarding the join
join (
select * from second
) b on a.c_id = b.c_id
b is not used anywhere except for join condition, so is this join serving any purpose at all?
Is it for making sure that this join only has entries where c_id is present in second table? Would a where IN condition be better if thats all this is doing.
Or I can just remove this join and it won't make any difference at all.
Thanks.
Join (any inner, left or right) can duplicate rows if join key in joined dataset is not unique. For example if a contains single row with c_id=1 and b contains two rows with c_id=1, the result will be two rows with a.c_id=1.
Join (inner) can filter rows if join key is absent in joined dataset. I believe this is what it meant to do.
If the goal is to get only rows with keys present in both datasets(filter) and you do not want duplication, and you do not use columns from joined dataset, then better use LEFT SEMI JOIN instead of JOIN, it will work as filter only even if there are duplicated keys in joined dataset:
left semi join (
select c_id from second
) b on a.c_id = b.c_id
This is much safer way to filter rows only which exist in both a and b and avoid unintended duplication.
You can replace join with WHERE IN/EXISTS, but it makes no difference, it is implemented as the same JOIN, check the EXPLAIN output and you will see the same query plan. Better use LEFT SEMI JOIN, it implements uncorrelated IN/EXISTS in efficient way.
If you prefer to move it to the WHERE:
WHERE a.c_id IN (select c_id from second)
or correlated EXISTS:
WHERE EXISTS (select 1 from second b where a.c_id=b.c_id)
But as I said, all of them are implemented internally using JOIN operator.

sql left join returns

I am trying to run a left join on 2 tables. I do not have a group by and the only where condition i have is on the second table. But, the returned rows are less than the first table. isn't the left join suppose to bring all the data from the first table?
Here is my SQL:
select *
from tbl_a A left join tbl_b B
ON
A.Cnumber=B.Cnumber
and A.CDNUmber=B.CDNumber
and abs(A.duration - B.Duration)<2
and substr(A.text,1,3)||substr(A.text,5,8)||substr(A.text,9,2)=substr(B.text,1,8)
where B.fixed = 'b580'
There are 140,000 records in table A but the result returned is less than 100,000 records. What is the problem and how can I solve it?
As soon as you put a condition in the WHERE clause that references the right table and doesn't accommodate the NULLs that will be produced when the join is unsuccessful, you've transformed it (effectively) back into an INNER JOIN.
Try:
where B.fixed = 'b580' OR B.fixed IS NULL
Or add this condition to the ON clause for the JOIN.
You should add the where clause to the join:
select *
from tbl_a A left join tbl_b B
ON
A.Cnumber=B.Cnumber
and A.CDNUmber=B.CDNumber
and abs(A.duration - B.Duration)<2
and substr(A.text,1,3)||substr(A.text,5,8)||substr(A.text,9,2)=substr(B.text,1,8)
and B.fixed = 'b580'
If you use where statemen all records where b is not existing will not returned.

What means "table A left outer join table B ON TRUE"?

I know conditions are used in table joining. But I met a specific situation and the SQL codes writes like "Table A join table B ON TRUE"
What will happen based on the "ON TRUE" condition? Is that just a total cross join without any condition selection?
Actually, the original expression is like:
Table A LEFT outer join table B on TRUE
Let's say A has m rows and B has n rows. Is there any conflict between "left outer join" and "on true"? Because it seems "on true" results a cross join.
From what I guess, the result will be m*n rows. So, it has no need to write "left outer join", just a "join" will give the same output, right?
Yes. That's the same thing as a CROSS JOIN.
In MySQL, we can omit the [optional] CROSS keyword. We can also omit the ON clause.
The condition in the ON clause is evaluated as a boolean, so we could also jave written something like ON 1=1.
UPDATE:
(The question was edited, to add another question about a LEFT [OUTER] JOIN b which is different than the original construct: a JOIN b)
The "LEFT [OUTER] JOIN" is slightly different, in that rows from the table on the left side will be returned even when there are no matching rows found in the table on the right side.
As noted, a CROSS JOIN between tables a (containing m rows) and table b containing n rows, absent any other predicates, will produce a resultset of m x n rows.
The LEFT [OUTER] JOIN will produce a different resultset in the special case where table b contains 0 rows.
CREATE TABLE a (i INT);
CREATE TABLE b (i INT);
INSERT INTO a VALUES (1),(2),(3);
SELECT a.i, b.i FROM a LEFT JOIN b ON TRUE ;
Note that the LEFT JOIN will returns rows from table a (a total of m rows) even when table b contains 0 rows.
A cross join produces a cartesian product between the two tables, returning all possible combinations of all rows. It has no on clause because you're just joining everything to everything.
Cross join does not combine the rows, if you have 100 rows in each table with 1 to 1 match, you get 10.000 results, Innerjoin will only return 100 rows in the same situation.
These 2 examples will return the same result:
Cross join
select * from table1 cross join table2 where table1.id = table2.fk_id
Inner join
select * from table1 join table2 on table1.id = table2.fk_id
Use the last method
The join syntax's general form:
SELECT *
FROM table_a
JOIN table_b ON condition
The condition is used to tell the database how to match rows from table_a to table_b, and would usually look like table_a.some_id = table_b.some_id.
If you just specify true, you will match every row from table_a with every row of table_b, so if table_a contains n rows and table_b contains m rows the result would have m*n rows.
Most(?) modern databases have a cleaner syntax for this, though:
SELECT *
FROM table_a
CROSS JOIN table_b
The difference between the pure cross join and left join (where the condition is forced to be always true, as when using ON TRUE) is that the result set for the left join will also have rows where the left table's rows appear next to a bunch of NULLs where the right table's columns would have been.

Need help with a sql query that has an inner and outer join

I really need help getting this query right. I can't share actual table and column names, but will try my best to layout the problem simply.
Assume the following tables. The tables and keys CANNOT be changed. Period. I don't care if you think it's a bad design, this question isn't a design question, it's on SQL syntax.
Table A - Primary key named id1
Table B - Contains two foreign keys, TableA.id1 and Foo.id2(ignore Foo, it doesn't matter for this)
Table C - Contains two foreign keys, TableA.id1 and Foo.id2, additional interesting
columns.
Constraints:
The SQL gets a set of id1s passed in as an argument.
It must return a list of Table C rows.
It must only return Table C rows where a Table B row exists with a matching TableA.id1 and Foo.id2 - There ARE rows in Table C that don't match Table B
A row MUST be returned for every id1 passed in, even if no Table C row exists.
At first I tried a Left Outer Join from Table A to Table B then an Inner Join to Table C. That violates the 4th rule above, as the Inner Join drops out those rows.
Next I tried two Left Outer joins. This is closer, but has the side effect of including rows that match the Table A join to Table B, but don't have a corresponding Table C entry, which isn't what I want.
So, here's what I came up with.
SELECT
a.id1,
c.*
FROM
TableB b
INNER JOIN
TableC c USING (id1,id2)
RIGHT OUTER JOIN
TableA a USING (id1)
WHERE
a.id1 in (x,y,z)
I'm a bit wary of a Right Outer Join, as the documentation I've read says it can be replaced with a Left Outer, but it doesn't appear so for this case. It also seems a bit rare, which is making other devs nervous, so I'm being cautious.
So, three questions in one.
Is this correct?
Did I use the Right Outer Join correctly?
Is there a cleaner way to achieve the same thing?
EDIT: DB is MySQL
You can rewrite it as a LEFT OUTER JOIN by using parentheses. In pseudo-SQL change this:
SELECT ...
FROM b
INNER JOIN c ON ...
RIGHT OUTER JOIN a ON ...
to this:
SELECT ...
FROM a
LEFT OUTER JOIN (
b INNER JOIN c ON ...
) ON ...
You can use an EXISTS clause, which sometimes works better
SELECT
a.id1,
c.*
FROM TableA a
LEFT JOIN TableC c
ON c.id1 = a.id1 AND EXISTS (
select *
from TableB b
where b.id1=c.id1 and b.id2=c.id2)
WHERE
a.id1 in (x,y,z)
As you have written it, it works because ANSI JOINs are always processed top to bottom. Since you need to test B against C before joining to A, it is about the only way to write it without introducing a subquery [(B x C) RIGHT JOIN A]. However, a bad query plan could perform all records in B and C (B x C) before right joining to A.
The EXISTS method efficiently uses the filter on A, then LEFT JOINs to C and for each C found, validates that it also exists in B (or discards).
Q's
Yes your query is correct
Yes
EXISTS should work better
Yeah, you need to start with TableA and then add tables B and C using joins. The only reason you even need TableA is to make sure you have a row for each parameter.
Select a.id1,c.*
From
TableA a
Left Join TableB b on a.id1=b.id1
Left Join TableC c on b.id1=c.id1 and b.id2=c.id2
Where a.id1 in (x,y,z)
You need to do OUTER joins all the way across, or rows that are missing in B will also cause data from A to be filtered out of the result set. By joining C to B (instead of directly to A) you are using B to filter. You could do it with a complicated EXISTS clause, but this is cleaner.

How to exclude rows that don't join with another table?

I have two tables, one has primary key other has it as a foreign key.
I want to pull data from the primary table, only if the secondary table does not have an entry containing it's key. Sort of an opposite of a simple inner join, which returns only rows that join together by that key.
SELECT <select_list>
FROM Table_A A
LEFT JOIN Table_B B
ON A.Key = B.Key
WHERE B.Key IS NULL
Full image of join
From aticle : http://www.codeproject.com/KB/database/Visual_SQL_Joins.aspx
SELECT
*
FROM
primarytable P
WHERE
NOT EXISTS (SELECT * FROM secondarytable S
WHERE
P.PKCol = S.FKCol)
Generally, (NOT) EXISTS is a better choice then (NOT) IN or (LEFT) JOIN
use a "not exists" left join:
SELECT p.*
FROM primary_table p LEFT JOIN second s ON p.ID = s.ID
WHERE s.ID IS NULL
Another solution is:
SELECT * FROM TABLE1 WHERE id NOT IN (SELECT id FROM TABLE2)
SELECT P.*
FROM primary_table P
LEFT JOIN secondary_table S on P.id = S.p_id
WHERE S.p_id IS NULL
If you want to select the columns from First Table "which are also present in Second table, then in this case you can also use EXCEPT. In this case, column names can be different as well but data type should be same.
Example:
select ID, FName
from FirstTable
EXCEPT
select ID, SName
from SecondTable
This was helpful to use in COGNOS because creating a SQL "Not in" statement in Cognos was allowed, but it took too long to run. I had manually coded table A to join to table B in in Cognos as A.key "not in" B.key, but the query was taking too long/not returning results after 5 minutes.
For anyone else that is looking for a "NOT IN" solution in Cognos, here is what I did. Create a Query that joins table A and B with a LEFT JOIN in Cognos by selecting link type: table A.Key has "0 to N" values in table B, then added a Filter (these correspond to Where Clauses) for: table B.Key is NULL.
Ran fast and like a charm.