JOIN on references - sql

Why doesn't SQL contain a keyword for specifying a join on columns in one table that references columns in another table?
Like NATURAL in a join specifies that it's on columns with the same name.
I often find myself doing something like SELECT ... FROM a JOIN b ON a.b_id=b.id; where the b_id column in a is defined to reference the id column in b. That seems like an awful lot of typing for something quite natural?
Would such a feature be particularly hard to implement or undesirable for some reason that I haven't thought of?
I mostly know SQL from postgresql, so if most other RDBMS'es has such a feature, the questions is just why postgresql doesn't have it.

In fact, SQL does. It is the USING clause:
select . . .
from a join
b
using (id);
As the documentation explains:
The USING clause is a shorthand that allows you to take advantage of
the specific situation where both sides of the join use the same name
for the joining column(s). It takes a comma-separated list of the
shared column names and forms a join condition that includes an
equality comparison for each one. For example, joining T1 and T2 with
USING (a, b) produces the join condition ON T1.a = T2.a AND T1.b =
T2.b.
It does still require that the names be the same in the two tables. But, because the names are explicitly specified in the join condition, this is a safe alternative to natural join (which I consider to be a bug waiting to happen).

Related

Why does FULL JOIN order make a difference in these queries?

I'm using PostgreSQL. Everything I read here suggests that in a query using nothing but full joins on a single column, the order of tables joined basically doesn't matter.
My intuition says this should also go for multiple columns, so long as every common column is listed in the query where possible (that is, wherever both joined tables have the column in common). But this is not the case, and I'm trying to figure out why.
Simplified to three tables a, b, and c.
Columns in table a: id, name_a
Columns in table b: id, id_x
Columns in table c: id, id_x
This query:
SELECT *
FROM a
FULL JOIN b USING(id)
FULL JOIN c USING(id, id_x);
returns a different number of rows than this one:
SELECT *
FROM a
FULL JOIN c USING(id)
FULL JOIN b USING(id, id_x);
What I want/expect is hard to articulate, but basically, a I'd like a "complete" full merger. I want no null fields anywhere unless that is unavoidable.
For example, whenever there is a not-null id, I want the corresponding name column to always have the name_a and not be null. Instead, one of those example queries returns semi-redundant results, with one row having a name_a but no id, and another having an id but no name_a, rather than a single merged row.
When the joins are listed in the other order, I do get that desired result (but I'm not sure what other problems might occur, because future data is unknown).
Your queries are different.
In the first, you are doing a full join to b using a single column, id.
In the second, you are doing a full join to b using two columns.
Although the two queries could return the same results under some circumstances, there is not reason to think that the results would be comparable.
Argument order matters in OUTER JOINs, except that FULL NATURAL JOIN is symmetric. They return what an INNER JOIN (ON, USING or NATURAL) does but also the unmatched rows from the left (LEFT JOIN), right (RIGHT JOIN) or both (FULL JOIN) tables extended by NULLs.
USING returns the single shared value for each specified column in INNER JOIN rows; in NULL-extended rows another common column can have NULL in one table's version and a value in the other's.
Join order matters too. Even FULL NATURAL JOIN is not associative, since with multiple tables each pair of tables (either operand being an original or join result) can have a unique set of common columns, ie in general (A ⟗ B) ⟗ C ≠ A ⟗ (B ⟗ C).
There are a lot of special cases where certain additional identities hold. Eg FULL JOIN USING all common column names and OUTER JOIN ON equality of same-named columns are symmetric. Some cases involve CKs (candidate keys), FKs (foreign keys) and other constraints on arguments.
Your question doesn't make clear exactly what input conditions you are assuming or what output conditions you are seeking.

How is a natural join different from a JOIN ON clause?

So for a fun little lab my Professor assigned, he wants us to create our own queries using different join operations. The ones that I'm curious about are NATURAL JOIN and JOIN ON.
The textbook definition of a natural join - "returns all rows with matching values in the matching columns and eliminates duplicates columns." So, say I have two tables, Customers and Orders. I list all orders submitted by the customer with an id = 1 as follows:
Select Customers.Name
From Customers, Orders
Where Customers.ID = 1
AND Customers.ID = Orders.CID
I want to know how that is different from JOIN ON, which according to the textbook "returns rows that meet the indicated join condition, and typically includes an equality comparison of two expressed columns" i.e. a primary key of one table and a foreign key of another. So a JOIN ON clause essentially does the same thing as a natural join. It returns all rows with matching values according to the parameters specified in the ON clause.
Select Customers.Name
From Customers JOIN Orders ON Customers.ID = Orders.CID
Same results. Is the latter just an easier way to write a natural join, or is there something I'm missing here?
Kinda like how in JavaScript, I can say:
var array = new Array(1, 2, 3);
OR I could just use the quicker and easier literal, without the constructor:
var array = [1, 2, 3];
Edit: Didn't even realize that the natural join uses a JOIN keyword in the FROM clause, and omits the WHERE clause. That just shows how little I know about this language. I'll keep the error for the sake of tracking my own progress.
NATURAL JOIN is :
always an equi-join
always matches by equality of all of the same-named attributes
which in essence boils down to there being no way at all to specify the JOIN condition. So you can only specify T1 NATURAL JOIN T2 and that's it, SQL will derive the entire matching condition from just that.
JOIN ON is :
not always an equi-JOIN (you can also specify JOIN ON T1.A > T2.A)
not always involving all attributes that correspond by name (if both tables have an attribute named A, you can still leave out ON T1.A = T2.A).
Your ID/CID example is not suitable for using NATURAL JOIN directly. You would have to rename attributes to get the column/attribute matching you want by stating [something like] :
SELECT Customers.Name
From Customers
NATURAL JOIN
(SELECT CID as ID FROM Orders)
(And as you stated in the question yourself, there is the thing about duplicate removal, which no other form of JOIN does by and of itself. It's an issue of scrutinous conformance to relational theory, which SQL as a whole doesn't exactly excel at, to put it mildly.)

How to drop one join key when joining two tables

I have two tables. Both have lot of columns. Now I have a common column called ID on which I would join.
Now since this variable ID is present in both the tables if I do simply this
select a.*,b.*
from table_a as a
left join table_b as b on a.id=b.id
This will give an error as id is duplicate (present in both the tables and getting included for both).
I don't want to write down separately each column of b in the select statement. I have lots of columns and that is a pain. Can I rename the ID column of b in the join statement itself similar to SAS data merge statements?
I am using Postgres.
Postgres would not give you an error for duplicate output column names, but some clients do. (Duplicate names are also not very useful.)
Either way, use the USING clause as join condition to fold the two join columns into one:
SELECT *
FROM tbl_a a
LEFT JOIN tbl_b b USING (id);
While you join the same table (self-join) there will be more duplicate column names. The query would make hardly any sense to begin with. This starts to make sense for different tables. Like you stated in your question to begin with: I have two tables ...
To avoid all duplicate column names, you have to list them in the SELECT clause explicitly - possibly dealing out column aliases to get both instances with different names.
Or you can use a NATURAL join - if that fits your unexplained use case:
SELECT *
FROM tbl_a a
NATURAL LEFT JOIN tbl_b b;
This joins on all columns that share the same name and folds those automatically - exactly the same as listing all common column names in a USING clause. You need to be aware of rules for possible NULL values ...
Details in the manual.

When to use SQL natural join instead of join .. on?

I'm studying SQL for a database exam and the way I've seen SQL is they way it looks on this page:
http://en.wikipedia.org/wiki/Star_schema
IE join written the way Join <table name> On <table attribute> and then the join condition for the selection. My course book and my exercises given to me from the academic institution however, use only natural join in their examples. So when is it right to use natural join? Should natural join be used if the query can also be written using JOIN .. ON ?
Thanks for any answer or comment
A natural join will find columns with the same name in both tables and add one column in the result for each pair found. The inner join lets you specify the comparison you want to make using any column.
IMO, the JOIN ON syntax is much more readable and maintainable than the natural join syntax. Natural joins is a leftover of some old standards, and I try to avoid it like the plague.
A natural join will find columns with the same name in both tables and add one column in the result for each pair found. The inner join lets you specify the comparison you want to make using any column.
The JOIN keyword is used in an SQL statement to query data from two or more tables, based on a relationship between certain columns in these tables.
Different Joins
* JOIN: Return rows when there is at least one match in both tables
* LEFT JOIN: Return all rows from the left table, even if there are no matches in the right table
* RIGHT JOIN: Return all rows from the right table, even if there are no matches in the left table
* FULL JOIN: Return rows when there is a match in one of the tables
INNER JOIN
http://www.w3schools.com/sql/sql_join_inner.asp
FULL JOIN
http://www.w3schools.com/sql/sql_join_full.asp
A natural join is said to be an abomination because it does not allow qualifying key columns, which makes it confusing. Because you never know which "common" columns are being used to join two tables simply by looking at the sql statement.
A NATURAL JOIN matches on any shared column names between the tables, whereas an INNER JOIN only matches on the given ON condition.
The joins often interchangeable and usually produce the same results. However, there are some important considerations to make:
If a NATURAL JOIN finds no matching columns, it returns the cross
product. This could produce disastrous results if the schema is
modified. On the other hand, an INNER JOIN will return a 'column does
not exist' error. This is much more fault tolerant.
An INNER JOIN self-documents with its ON clause, resulting in a
clearer query that describes the table schema to the reader.
An INNER JOIN results in a maintainable and reusable query in
which the column names can be swapped in and out with changes in the
use case or table schema.
The programmer can notice column name mis-matches (e.g. item_ID vs itemID) sooner if they are forced to define the ON predicate.
Otherwise, a NATURAL JOIN is still a good choice for a quick, ad-hoc query.

Is NATURAL (JOIN) considered harmful in production environment?

I am reading about NATURAL shorthand form for SQL joins and I see some traps:
it just takes automatically all same named column-pairs (use USING to specify explicit column list)
if some new column is added, then join output can be "unexpectedly" changed too, which may be not so obvious (even if you know how NATURAL works) in complicated structures
NATURAL JOIN syntax is anti-pattern:
The purpose of the query is less obvious;
the columns used by the application is not clear
the columns used can change "unexpectedly"
The syntax goes against the modularity rule, about using strict typing whenever possible. Explicit is almost universally better.
Because of this, I don't recommend the syntax in any environment.
I also don't recommend mixing syntax (IE: using both NATURAL JOIN and explicit INNER/OUTER JOIN syntax) - keep a consistent codebase format.
These "traps", which seem to argue against natural joins, cut both ways. Suppose you add a new column to table A, fully expecting it to be used in joining with table B. If you know that every join of A and B is a natural join, then you're done. If every join explicitly uses USING, then you have to track them all down and change them. Miss one and there's a bug.
Use NATURAL joins when the semantics of the tables suggests that this is the right thing to do. Use explicit join criteria when you want to make sure the join is done in a specific way, regardless of how the table definitions might evolve.
One thing that completely destroys NATURAL for me is that most of my tables have an id column, which are obviously semantically all different. You could argue that having a user_id makes more sense than id, but then you end up writing things like user.user_id, a violation of DRY. Also, by the same logic, you would also have columns like user_first_name, user_last_name, user_age... (which also kind of makes sense in view that it would be different from, for example, session_age)... The horror.
I'll stick to my JOIN ... ON ..., thankyouverymuch. :)
I agree with the other posters that an explicit join should be used for reasons of clarity and also to easily allow a switch to an "OUTER" join should your requirements change.
However most of your "traps" have nothing to do with joins but rather the evils of using "SELECT *" instead of explicitly naming the columns you require "SELECT a.col1, a.col2, b.col1, b.col2". These traps occurs whenever a wildcard column list is used.
Adding an extra reason not listed in any of the answers above. In postgres (not sure if this the case for other databases) if no column names are found in common between the two tables when using NATURAL JOIN then a CROSS JOIN is performed. This means that if you had an existing query and then you were to subsequently change one of the column names in a table, you would still get a set of rows returned from the query rather than an error. If instead you used the JOIN ... USING(...) syntax you would get an error if the joining column was no longer there.
The postgres documentation has a note to this effect:
Note: USING is reasonably safe from column changes in the joined relations since only the listed columns are combined. NATURAL is considerably more risky since any schema changes to either relation that cause a new matching column name to be present will cause the join to combine that new column as well.
Do you mean the syntax like this:
SELECT *
FROM t1, t2, t3 ON t1.id = t2.id
AND t2.id = t3.id
Versus this:
SELECT *
FROM t1
LEFT OUTER JOIN t2 ON t1.id = t2.id
AND t2.id = t3.id
I prefer the 2nd syntax and also format it differently:
SELECT *
FROM T1
LEFT OUTER JOIN T2 ON T2.id = T1.id
LEFT OUTER JOIN T3 ON T3.id = T2.id
In this case, it is very clear what tables I am joining and what ON clause I am using to join them. By using that first syntax is just too easy to not put in the proper JOIN and get a huge result set. I do this because I am prone to typos, and this is my insurance against that. Plus, it is visually easier to debug.