Why does FULL JOIN order make a difference in these queries? - sql

I'm using PostgreSQL. Everything I read here suggests that in a query using nothing but full joins on a single column, the order of tables joined basically doesn't matter.
My intuition says this should also go for multiple columns, so long as every common column is listed in the query where possible (that is, wherever both joined tables have the column in common). But this is not the case, and I'm trying to figure out why.
Simplified to three tables a, b, and c.
Columns in table a: id, name_a
Columns in table b: id, id_x
Columns in table c: id, id_x
This query:
SELECT *
FROM a
FULL JOIN b USING(id)
FULL JOIN c USING(id, id_x);
returns a different number of rows than this one:
SELECT *
FROM a
FULL JOIN c USING(id)
FULL JOIN b USING(id, id_x);
What I want/expect is hard to articulate, but basically, a I'd like a "complete" full merger. I want no null fields anywhere unless that is unavoidable.
For example, whenever there is a not-null id, I want the corresponding name column to always have the name_a and not be null. Instead, one of those example queries returns semi-redundant results, with one row having a name_a but no id, and another having an id but no name_a, rather than a single merged row.
When the joins are listed in the other order, I do get that desired result (but I'm not sure what other problems might occur, because future data is unknown).

Your queries are different.
In the first, you are doing a full join to b using a single column, id.
In the second, you are doing a full join to b using two columns.
Although the two queries could return the same results under some circumstances, there is not reason to think that the results would be comparable.

Argument order matters in OUTER JOINs, except that FULL NATURAL JOIN is symmetric. They return what an INNER JOIN (ON, USING or NATURAL) does but also the unmatched rows from the left (LEFT JOIN), right (RIGHT JOIN) or both (FULL JOIN) tables extended by NULLs.
USING returns the single shared value for each specified column in INNER JOIN rows; in NULL-extended rows another common column can have NULL in one table's version and a value in the other's.
Join order matters too. Even FULL NATURAL JOIN is not associative, since with multiple tables each pair of tables (either operand being an original or join result) can have a unique set of common columns, ie in general (A ⟗ B) ⟗ C ≠ A ⟗ (B ⟗ C).
There are a lot of special cases where certain additional identities hold. Eg FULL JOIN USING all common column names and OUTER JOIN ON equality of same-named columns are symmetric. Some cases involve CKs (candidate keys), FKs (foreign keys) and other constraints on arguments.
Your question doesn't make clear exactly what input conditions you are assuming or what output conditions you are seeking.

Related

Choosing between values using left join or union

I have a table A with columns ABK and ACK. Each row can have a value in either ABK or ACK but not in both at the same time
ABK and ACK are keys to be used to fetch more detailed information from tables B and C, respectively
B has columns named BK (key) and B1 and C has columns named CK (key) and C1
When fetching information from B and C, I want to select between B1 and C1 depending on which column in A (ABK or ACK) is NOT null
What would be better considering readability and performance:
1
select COALESCE(B.B1, C.C1) as X from A
left join B on A.ABK = B.BK
left join C on A.ACK = C.CK
OR
2
select B.B1 as X from A join B on A.ABK = B.BK
UNION
select C.C1 as X from A join C on A.ACK = C.CK
In other words should I do a left join with all the tables I want to use or do union?
I am guessing that readability wise the UNION is better, but I am not sure about performance
Also B and C do not overlap, i.e. there is no duplicates between B and C
I don't think the answer in the question pointed as a duplicate of mine is correct for my case since it focus on the fact that there could be duplicates among tables B and C, but as stated B and C are mutually exclusive
your two queries aren't necessarily accomplishing the same thing.
Using LEFT JOIN will create duplicate rows in the result if there are duplicate values in either table, whereas UNION (as opposed to UNION ALL) automatically limits to remove duplicate values if applicable. Therefore, before thinking about performance, I would decide which method to use based on whether you are interested in preserving duplicates in your results.
See here for more info: Performance of two left joins versus union.
First, you can test the performance on your database and your data. You can also look at the execution plans.
Second, the queries are not equivalent, because the union removes duplicates.
That said, I would go for the first version using left join for both readability and performance. Readability is obviously a matter of taste. I think having all the logic in the select makes it more apparent what the query is doing.
More importantly, the first will start with table A and be able to use indexes for the additional lookups. There is no additional phase for removing duplicates. My guess is that two joins is faster than two joins and duplicate removal.

How to drop one join key when joining two tables

I have two tables. Both have lot of columns. Now I have a common column called ID on which I would join.
Now since this variable ID is present in both the tables if I do simply this
select a.*,b.*
from table_a as a
left join table_b as b on a.id=b.id
This will give an error as id is duplicate (present in both the tables and getting included for both).
I don't want to write down separately each column of b in the select statement. I have lots of columns and that is a pain. Can I rename the ID column of b in the join statement itself similar to SAS data merge statements?
I am using Postgres.
Postgres would not give you an error for duplicate output column names, but some clients do. (Duplicate names are also not very useful.)
Either way, use the USING clause as join condition to fold the two join columns into one:
SELECT *
FROM tbl_a a
LEFT JOIN tbl_b b USING (id);
While you join the same table (self-join) there will be more duplicate column names. The query would make hardly any sense to begin with. This starts to make sense for different tables. Like you stated in your question to begin with: I have two tables ...
To avoid all duplicate column names, you have to list them in the SELECT clause explicitly - possibly dealing out column aliases to get both instances with different names.
Or you can use a NATURAL join - if that fits your unexplained use case:
SELECT *
FROM tbl_a a
NATURAL LEFT JOIN tbl_b b;
This joins on all columns that share the same name and folds those automatically - exactly the same as listing all common column names in a USING clause. You need to be aware of rules for possible NULL values ...
Details in the manual.

Explanation of code for right excluding join?

I just found a great page with Venn diagrams of different joins and the code for executing them:
http://www.codeproject.com/Articles/33052/Visual-Representation-of-SQL-Joins
I used the "Right Excluding Join" in my query, the Venn diagram looks like this:
and here is the code:
SELECT subjects.subject
FROM sold_subjects
RIGHT JOIN subjects
ON sold_subjects.subject = subjects.subject
WHERE sold_subjects.subject IS NULL
I am asking for an explanation of what this code actually does, particularly what happens in the last row. I understand that we are joining the two relations where they have the same subject, but what happens when we set subjects for one of the relations to NULL in the last row?
First, what do JOIN and RIGHT JOIN do?
The JOIN gets information from two tables and joins them according to rules you specify in the ON or WHERE clauses.
The JOIN modifiers, such as LEFT, INNER, OUTER and RIGHT control the behavior you JOIN will have in case of unmatched records -- when no record in A matches a record in B according to the specified rules, and vice-versa.
To understand this part, take table A as being the left table and table B as being the right one. When you have multiple joins, the right table in each join is the one whose name is immediately right of the JOIN command.
e.g. FROM a1 LEFT JOIN ... LEFT JOIN b
The b table is the right one and whatever comes before is the left one.
This is a summary of the modifiers' behavior:
LEFT: preserves unmatched records in the left table, discards those in the right table;
RIGHT: preserves unmatched records in the right table, discards those in the left table;
INNER: preserves only the records that are matched, discards unmatched from both tables;
OUTER or FULL: preserves all records, regardless of matches.
What is visually happening?
Imagine you have two simple tables with the same names of the ones you put in there.
sold_subjects subjects
subject subject
1 1
2 4
3 5
4 6
When you RIGHT JOIN two tables, you create a third one that looks like this:
joined_table
sold_subjects.subject subjects.subject
1 1
4 4
NULL 5
NULL 6
Please note that the subjects 2 and 3 are already gone in this subset.
When you add a WHERE clause with sold_subjects.subject IS NULL, you are only keeping the last two lines where there was no match in subjects.
The right join makes sure that you will keep all the records of the right table. If there is no match with the left table, then all the variables in the result originating from the left table will be null (because there is no match).
The where clause checks whether the value of lefttable.subject is null or not. If it's not null, then obviously the join succeeded. If it is null, then the join did not work, leaving this value blank. So this where clause will, per definition, return all the records of the right table that have no match in the left table, which is exactly what the venn diagram says!
This is a very common practice in SQL, there are may use cases. For example: left table is sales, right table is customers, and you want to know all the customers without sales.
RIGHT JOIN is shorthand for RIGHT OUTER JOIN.
Consider the excellent explanation in the fine manual:
LEFT OUTER JOIN returns all rows in the qualified Cartesian product
(i.e., all combined rows that pass its join condition), plus one copy
of each row in the left-hand table for which there was no right-hand
row that passed the join condition. This left-hand row is extended to
the full width of the joined table by inserting null values for the
right-hand columns. Note that only the JOIN clause's own condition is
considered while deciding which rows have matches. Outer conditions
are applied afterwards.
Conversely, RIGHT OUTER JOIN returns all the joined rows, plus one row
for each unmatched right-hand row (extended with nulls on the left).
This is just a notational convenience, since you could convert it to a
LEFT OUTER JOIN by switching the left and right tables.
Bold emphasis mine. Your query is just one way to exclude rows that are not present in another table, with a shiny buzz word attached ("Right Excluding JOIN"). There are others:
Select rows which are not present in other table
Now, for the tricky part - or where you deviate from the original:
But what happens when we set subjects for one of the relations to NULL in the last row?
Your query has:
WHERE sold_subjects.subject IS NULL
Where the original says:
WHERE A.Key IS NULL
Key is supposed to imply NOT NULL. The query simply does not work if either of the underlying table columns sold_subjects.subject or subjects.subject can be NULL. There would be no way to disambiguate how the row qualified:
subjects.subject IS NULL and no row with NULL in sold_subjects.subject
subjects.subject IS NULL and some row with NULL in sold_subjects.subject
subjects.subject IS NOT NULL but no matching row in sold_subjects
If one of the linking columns can be NULL, and you want to treat NULL values like they were actual values (which they are not), i.e. match NULL to NULL, you could substitute with an anti-join using the NULL-safe operator IS NOT DISTINCT FROM:
SELECT s.subject
FROM subjects s
LEFT JOIN sold_subjects ss ON ss.subject IS NOT DISTINCT FROM s.subject
WHERE ss.subject IS NULL;
Also with shorter syntax, using the more commonly used LEFT JOIN, but otherwise identical. IS NOT DISTINCT FROM is often slower than a simple =, only use it where you need it. Typically, you join tables on key columns that are defined NOT NULL - implicitly (a PK column is NOT NULL automatically) or explicitly.

Is the order of joining tables indifferent as long as we chose proper join types?

Can we achieve desired results of joining tables by executing joins in whatever order? Suppose we want to left join two tables A and B (order AB). We can get the same results with right join of B and A (BA).
What about 3 tables ABC. Can we get whatever results by only changing order and joins types? For example A left join B inner join C. Can we get it with BAC order? What about if we have 4 or more tables?
Update.
The question Does the join order matter in SQL? is about inner join type. Agreed that then the order of join doesn't matter. The answer provided in that question does not answer my question whether it is possible to get desired results of joining tables with whatever original join types (join types here) by choosing whatever order of tables we like, and achieve this goal only by manipulating with join types.
In an inner join, the ordering of the tables in the join doesn't matter - the same rows will make up the result set regardless of the order they are in the join statement.
In either a left or right outer join, the order DOES matter. In A left join B, your result set will contain one row for every record in table A, irrespective of whether there is a matching row in table B. If there are non matching rows, this is likely to be a different result set to B left join A.
In a full outer join, the order again doesn't matter - rows will be produced for each row in each joined table no matter what their order.
Regarding A left join B vs B right join A - these will produce the same results. In simple cases with 2 tables, swapping the tables and changing the direction of the outer join will result in the same result set.
This will also apply to 3 or more tables if all of the outer joins are in the same direction - A left join B left join C will give the same set of results as C right join B right join A.
If you start mixing left and right joins, then you will need to start being more careful. There will almost always be a way to make an equivalent query with re-ordered tables, but at that point sub-queries or bracketing off expressions might be the best way to clarify what you are doing.
As another commenter states, using whatever makes your purpose most clear is usually the best option. The ordering of the tables in your query should make little or no difference performance wise, as the query optimiser should work this out (although the only way to be sure of this would be to check the execution plans for each option with your own queries and data).

When to use SQL natural join instead of join .. on?

I'm studying SQL for a database exam and the way I've seen SQL is they way it looks on this page:
http://en.wikipedia.org/wiki/Star_schema
IE join written the way Join <table name> On <table attribute> and then the join condition for the selection. My course book and my exercises given to me from the academic institution however, use only natural join in their examples. So when is it right to use natural join? Should natural join be used if the query can also be written using JOIN .. ON ?
Thanks for any answer or comment
A natural join will find columns with the same name in both tables and add one column in the result for each pair found. The inner join lets you specify the comparison you want to make using any column.
IMO, the JOIN ON syntax is much more readable and maintainable than the natural join syntax. Natural joins is a leftover of some old standards, and I try to avoid it like the plague.
A natural join will find columns with the same name in both tables and add one column in the result for each pair found. The inner join lets you specify the comparison you want to make using any column.
The JOIN keyword is used in an SQL statement to query data from two or more tables, based on a relationship between certain columns in these tables.
Different Joins
* JOIN: Return rows when there is at least one match in both tables
* LEFT JOIN: Return all rows from the left table, even if there are no matches in the right table
* RIGHT JOIN: Return all rows from the right table, even if there are no matches in the left table
* FULL JOIN: Return rows when there is a match in one of the tables
INNER JOIN
http://www.w3schools.com/sql/sql_join_inner.asp
FULL JOIN
http://www.w3schools.com/sql/sql_join_full.asp
A natural join is said to be an abomination because it does not allow qualifying key columns, which makes it confusing. Because you never know which "common" columns are being used to join two tables simply by looking at the sql statement.
A NATURAL JOIN matches on any shared column names between the tables, whereas an INNER JOIN only matches on the given ON condition.
The joins often interchangeable and usually produce the same results. However, there are some important considerations to make:
If a NATURAL JOIN finds no matching columns, it returns the cross
product. This could produce disastrous results if the schema is
modified. On the other hand, an INNER JOIN will return a 'column does
not exist' error. This is much more fault tolerant.
An INNER JOIN self-documents with its ON clause, resulting in a
clearer query that describes the table schema to the reader.
An INNER JOIN results in a maintainable and reusable query in
which the column names can be swapped in and out with changes in the
use case or table schema.
The programmer can notice column name mis-matches (e.g. item_ID vs itemID) sooner if they are forced to define the ON predicate.
Otherwise, a NATURAL JOIN is still a good choice for a quick, ad-hoc query.