Are left outer joins associative? - sql

It's easy to understand why left outer joins are not commutative, but I'm having some trouble understanding whether they are associative. Several online sources suggest that they are not, but I haven't managed to convince myself that this is the case.
Suppose we have three tables: A, B, and C.
Let A contain two columns, ID and B_ID, where ID is the primary key of table A and B_ID is a foreign key corresponding to the primary key of table B.
Let B contain two columns, ID and C_ID, where ID is the primary key of table B and C_ID is a foreign key corresponding to the primary key of table C.
Let C contain two columns, ID and VALUE, where ID is the primary key of table C and VALUE just contains some arbitrary values.
Then shouldn't (A left outer join B) left outer join C be equal to A left outer join (B left outer join C)?

In this thread, it is said, that they are not associative: Is LEFT OUTER JOIN associative?
However, I've found some book online where it is stated, that OUTER JOINs are associative, when the tables on the far left side and far right side have no attributes in common (here).
Here is a graphical presentation (MSPaint ftw):
Another way to look at it:
Since you said that table A joins with B, and B joins with C, then:
When you first join A and B, you are left with all records from A. Some of them have values from B. Now, for some of those rows for which you got value from B, you get values from C.
When you first join B and C, you and up with the whole table B, where some of the records have values from C. Now, you take all records from A and join some of them with all rows from B joined with C. Here, again, you get all rows from A, but some of them have values from B, some of which have values from C.
I don't see any possibility where, in conditons described by you, there would be a data loss depending on the sequence of LEFT joins.
Basing on the data provided by Tilak in his answer (which is now deleted), I've built a simple test case:
CREATE TABLE atab (id NUMBER, val VARCHAR2(10));
CREATE TABLE btab (id NUMBER, val VARCHAR2(10));
CREATE TABLE ctab (id NUMBER, val VARCHAR2(10));
INSERT INTO atab VALUES (1, 'A1');
INSERT INTO atab VALUES (2, 'A2');
INSERT INTO atab VALUES (3, 'A3');
INSERT INTO btab VALUES (1, 'B1');
INSERT INTO btab VALUES (2, 'B2');
INSERT INTO btab VALUES (4, 'B4');
INSERT INTO ctab VALUES (1, 'C1');
INSERT INTO ctab VALUES (3, 'C3');
INSERT INTO ctab VALUES (5, 'C5');
SELECT ab.aid, ab.aval, ab.bval, c.val AS cval
FROM (
SELECT a.id AS aid, a.val AS aval, b.id AS bid, b.val AS bval
FROM atab a LEFT OUTER JOIN btab b ON (a.id = b.id)
) ab
LEFT OUTER JOIN ctab c ON (ab.bid = c.id)
ORDER BY ab.aid
;
AID AVAL BVAL CVAL
---------- ---------- ---------- ----------
1 A1 B1 C1
2 A2 B2
3 A3
SELECT a.id, a.val AS aval, bc.bval, bc.cval
FROM
atab a
LEFT OUTER JOIN (
SELECT b.id AS bid, b.val AS bval, c.id AS cid, c.val AS cval
FROM btab b LEFT OUTER JOIN ctab c ON (b.id = c.id)
) bc
ON (a.id = bc.bid)
ORDER BY a.id
;
ID AVAL BVAL CVAL
---------- ---------- ---------- ----------
1 A1 B1 C1
2 A2 B2
3 A3
It seems in this particular example, that both solutions give the same result. I can't think of any other dataset that would make those queries return different results.
Check at SQLFiddle:
MySQL
Oracle
PostgreSQL
SQLServer

If you're assuming that you're JOINing on a foreign key, as your question seems to imply, then yes, I think OUTER JOIN is guaranteed to be associative, as covered by Przemyslaw Kruglej's answer.
However, given that you haven't actually specified the JOIN condition, the pedantically correct answer is that no, they're not guaranteed to be associative. There are two easy ways to violate associativity with perverse ON clauses.
1. One of the JOIN conditions involves columns from all 3 tables
This is a pretty cheap way to violate associativity, but strictly speaking nothing in your question forbade it. Using the column names suggested in your question, consider the following two queries:
-- This is legal
SELECT * FROM (A JOIN B ON A.b_id = B.id)
JOIN C ON (A.id = B.id) AND (B.id = C.id)
-- This is not legal
SELECT * FROM A
JOIN (B JOIN C ON (A.id = B.id) AND (B.id = C.id))
ON A.b_id = B.id
The bottom query isn't even a valid query, but the top one is. Clearly this violates associativity.
2. One of the JOIN conditions can be satisfied despite all fields from one table being NULL
This way, we can even have different numbers of rows in our result set depending upon the order of the JOINs. For example, let the condition for JOINing A on B be A.b_id = B.id, but the condition for JOINing B on C be B.id IS NULL.
Thus we get these two queries, with very different output:
SELECT * FROM (A LEFT OUTER JOIN B ON A.b_id = B.id)
LEFT OUTER JOIN C ON B.id IS NULL;
SELECT * FROM A
LEFT OUTER JOIN (B LEFT OUTER JOIN C ON B.id IS NULL)
ON A.b_id = B.id;
You can see this in action here: http://sqlfiddle.com/#!9/d59139/1

In addition to the previous answers: The topic is nicely discussed in Michael M. David, Advanced ANSI SQL Data Modeling and Structure Processing, Artech House, 1999, pages 19--21. Pages available online.
I find particularly noteworthy that he discusses that the table (LEFT JOIN ...) and join clauses (ON ... ) have to be considered separately, so associativity could refer to both (re-arranging of table clauses and re-arranging of join conditions, i.e., on clauses). So the notion of associativity is not the same as for, e.g., addition of numbers, it has two dimensions.

Related

Best way to eliminate duplicates rows after multiple joins

I'll consider three simple tables. A, B are my entity tables and C is an intermediate table that creates a many-to-many relationship between A & B.
Schemas:
A: (id INTEGER PRIMARY KEY)
B: (id INTEGER PRIMARY KEY)
C: (
A_id INTEGER,
B_id INTEGER,
FOREIGN KEY(A_id) REFERENCES A(id),
FOREIGN KEY(B_id) REFERENCES B(id)
)
Now, consider the below query
SELECT
A.id
FROM A
LEFT OUTER JOIN C
ON (A.id = C.A_id)
LEFT OUTER JOIN B
ON (C.B_id = B.id)
WHERE ...;
This query would result in duplicate values of A.id, which is expected because C might have multiple rows associated with each row of A. My question is what's the best way to eliminate these duplicates and get the A records. I only need the A records.
I am aware of two ways,
-- Using DISTINCT
SELECT
DISTINCT(A.id), ...
FROM A
LEFT OUTER JOIN C
ON (A.id = C.A_id)
LEFT OUTER JOIN B
ON (C.B_id = B.id)
WHERE ...
ORDER BY A.id;
And
-- Or using A.id IN (above query)/ A.id = Any(above query)
SELECT
...
FROM A
WHERE A.id IN (
SELECT
A.id
FROM A
LEFT OUTER JOIN C
ON (A.id = C.A_id)
LEFT OUTER JOIN B
ON (C.B_id = B.id)
WHERE ...
);
I'm using PostgreSQL. I need to include all the tables for filtering, so not joining a table cannot be considered as an improvement. I've analyzed both the queries but I still feel there might be a better way to do this(in terms of performance).
Any help is really appreciated!
I would suggest exists:
SELECT A.id
FROM A
WHERE EXISTS (SELECT 1
FROM C JOIN
B
ON C.B_id = B.id
WHERE A.id = C.A_id AND . . .
)
You can also try following query:
SELECT
a.* -- or whatever columns you need of a
FROM a
WHERE EXISTS(
SELECT 1
FROM c
WHERE c.a_id = a.id
)
Note, that there is no need to join table b as the existence of the row in c always guarantees for the row in b and you do not need any information contained in this row/table.
Perhaps even more clean might be:
SELECT DISTINCT
a.* -- or whatever columns you need of a
FROM a
LEFT JOIN c
You can have a look at the query plans and execution times using EXPLAIN ANALYZE <query>. Perhaps this gives you a hint on what to use best.
But be aware of caching, repeat both queries multiple times this way to see comparable results.

Left join inside left join

I have problem getting values from tables.
I need something like this
A.Id a1
B.Id b1
C.Id c1
B.Id b2
C.Id c2
C.Id c3
C.Id c4
Table A and B are joined together and also table B and C.
Table A can have one/zero or more values from table B. Same situation is for values from table C.
I need to perform left join on table A over table B and inside that left join on table B over table C.
I tried with left join from table A and B, but don't know how to perform left join inside that left join.
Is that possible? What would syntax for that look like?
edit:
Data would look like this
ZZN1 P1 NULL
ZZN1 P2 NAB1
ZZN2 P3 NAB2
ZZN2 P3 NAB3
No need to nest the left joins, you can simply flatten them and let your RDMBS handle the logic.
Sample schema:
a(id)
b(id, aid) -- aid is a foreign key to a(id)
c(id, bid) -- bid is a foreign key to b(id)
Query:
select a.id, b.id, c.id
from a
left join b on b.aid = a.id
left join c on c.bid = b.id
If the first left join does not succeed, then the second one cannot be performed either, since joining column b.id will be null. On the other hand, if the first left join succeeds, then the second one may, or may not succeed, depending if the relevant bid is available in c.
SELECT A.Name, B.Name , C.Name
FROM A
LEFT JOIN B ON A.id = B.id
LEFT JOIN C ON B.id = C.id

SQL Logic: When joining Child table B to Parent Table A on A.FID = B.ID

I have been wondering if the results would change in multi-join tables queries.
If you have parent Table A
A B
ID|FID FID
1|2 1
2|4 2
3|5 3
4|7 4
5|8 5
6|NULL 6
7|NULL 7
8|NULL 8
does it matter which table column you specified in the WHERE clause?
For example, what is the difference between the two:
Select *
From Table A
Left Join B on A.FID = B.FID
WHERE A.FID IN (2,5,8)
Select *
From Table A
Left Join B on A.FID = B.FID
WHERE B.ID IN (2,5,8)
Thank you for the help!
EDIT:
Micheal has solved my question and I have tested it out
'Actually, while your answer is a good one (and probably the one he's looking for), since both of his queries are essentially filtering on the primary key of B (A.FID, B.ID), they actually are logically identical (assuming that A.FID is a true foreign key constraint on B). That is, both queries filter out rows in which B.ID is not 2, 5 or 8.' – Michael L.
It is only different is Table B is the main table and you queried based on B.ID as in:
SELECT *
FROM B
LEFT JOIN A ON A.FID = B.FID
WHERE B.FID IN (2,5,8)
While this will be the same as having A as the main table:
SELECT *
FROM B
LEFT JOIN A ON A.FID = B.FID
WHERE A.FID IN (2,5,8)
Yes, it does.
When you use an OUTER JOIN, values from one of the tables may be NULL. So, the second query is equivalent to:
Select *
From Table A Inner Join
B
on A.FID = B.ID
WHERE B.ID IN (2, 5, 8);
because the NULL values are filtered out.
As a general rules with LEFT JOIN:
Filters on the first table belong in the WHERE clause.
Filters on the second and subsequent tables should to in the ON clause.

How to get values from tables A and C, joined by table B with default values from C when C has no key from A

here is my situation:
I have 3 tables:
A: (A_id, Name)
B: (B_id, A_id, Name)
C: (C_id, B_id, State)
What i want is to have the following resultset:
A.A_id,A.Name, C.State
the complicator is that i need State to have a default value when there is no B data to link.
In that case, i want
A.A_id, A.Name, 'Default_Value'
I dont know much of advanced Sql, so any pointers are greatly appreciated.
select
coalesce(c.State, 'default value')
from
a
left join b on a.id = b.A_id
left join c on b.B_id = c.B_id
the best visual explanation of joins I've ever seen: A Visual Explanation of SQL Joins
COALESCE() returns the first of its parameters which isn't NULL
You could use ISNULL in the select
SELECT A.A_id,A.Name, ISNULL (C.State, 'Default_Value')
from A
left join b...
left join c...
SELECT A.A_id, A.Name, COASLESCE(C.State, 'Default_Value')
FROM
A LEFT JOIN
(B INNER JOIN C ON C.B_id = B.B_id)
ON B.A_id = A.A_id
Some information on joins: What is the difference between "INNER JOIN" and "OUTER JOIN"?
What's happening here is that we are joining table B and C with an INNER JOIN where the respective B_id column is equal. The INNER specifies that results will be returned only when records exist in both tables that match the C.B_id = B.B_id condition.
The LEFT JOIN will join those combined values to table A if the matching condition exists, while still returning the records from table A if no match exists. That is, if nothing exists for the condition B.A_id = A.A_id, NULL values are returned for the columns from the right side of the join (the B and C join). We perform the COASLESCE, so that if the queried column returns with NULL, it can default to some specified value.
COALESCE has some added benefits when performing this function: http://msdn.microsoft.com/en-us/library/ms190349.aspx
One last thing, table B in your example is commonly known as a junction table (or join table, or bridge table)... http://en.wikipedia.org/wiki/Junction_table

SQL Insert into table A from table B based off table C

I have an empty table that I would like to fill with rows from a second table, based off a third table, Ill call them A,B,C respectively.
Table C has ID numbers that match ID numbers for rows in Table B. For every ID in table C, I want to add the corresponding row from table B into Table A.
This is what I have, and I am getting an error saying that I cannot use the last statement.
INSERT INTO TABLEA
SELECT * FROM TABLEB
WHERE ID FROM TABLEB = ID FROM TABLEC;
DSNT408I SQLCODE = -199, ERROR: ILLEGAL USE OF KEYWORD FROM. TOKEN ( . AT
MICROSECONDS MICROSECOND SECONDS SECOND MINUTES MINUTE WAS EXPECTED
DSNT418I SQLSTATE = 42601 SQLSTATE RETURN CODE
Any help would be appreciated.
INSERT INTO TableA
SELECT B.*
FROM TableB AS B
JOIN TableC AS C ON B.ID = C.ID
Or possibly that will give you too many duplicates (if there are multiple rows in C that match a given row in B), in which case you might need:
INSERT INTO TableA
SELECT B.*
FROM TableB AS B
WHERE B.ID IN (SELECT C.ID FROM TableC AS C)
Or:
INSERT INTO TableA
SELECT DISTINCT B.*
FROM TableB AS B
JOIN TableC AS C ON B.ID = C.ID
Both of those give you one row in A for each row in B that matches one or more rows in C.
How would I add a WHEN clause to this? Let's say Table C has another column called VALUE, and I want to add all the ID numbers that have a value of 'x' or greater. How would I do that, I tried adding JOIN TableC AS C ON B.ID = C.ID AND C.VALUE > 5 but I still got all the values from TABLE C.
Working with the first query (fixing the others being left as an 'exercise for the reader'), then what I think you should be doing is just:
INSERT INTO TableA
SELECT B.*
FROM TableB AS B
JOIN TableC AS C ON B.ID = C.ID
WHERE C.Value > 5
The optimizer should translate that to an equivalent expression:
INSERT INTO TableA
SELECT B.*
FROM TableB AS B
JOIN TableC AS C ON B.ID = C.ID AND C.Value > 5
I'm not clear from your comment whether you somehow added a second reference to TableC in the one query, or you modified your query as shown in this second example. If you were not using LEFT JOIN anywhere, then adding the AND C.Value > 5 term to the ON clause or as a WHERE clause should have yielded the correct data.
When debugging this sort of problem, it is worth noting that this INSERT statement has a perfectly good SELECT statement in it that you can run on its own to review what is going to be added to TableA. You might want to augment the select-list to include (at least) C.ID and C.Value just to make sure nothing is going haywire.