I've just been reading up on NATURAL JOIN / USING - SQL92 features which are (sadly?) missing from SQL Server's current repertoire.
Has anyone come from a DBMS that supported these to SQL Server (or another non-supporting DBMS) - were they as useful as they sound, or a can of worms (which also sounds possible!)?
I never use NATURAL JOIN because I don't like the possibility that the join could do something I don't intend just because some column name exists in both tables.
I do use the USING join syntax occasionally, but just as often it turns out that I need a more complex join condition than USING can support, so I convert it to the equivalent ON syntax after all.
Would you consider a DBMS that was truly relational?:
in Tutorial D [a truly relational
language], the only “join” operator is
called JOIN, and it means “natural
join”... There should be no other kind
of join... Few people have had the
experience of using a proper
relational language. Of those who
have, I strongly suspect that none of
them ever complained about some
perceived inconvenience in pairing
columns according to their names
Source: "The Importance of Column Names" by Hugh Darwen
It's a matter of convenience. Not indispensable, but it should have its place, for example in interactive querying (every keystroke brings us closer to RSI, anyway), or some simple cases of hand written SQL even in production code (yes, I wrote that. And even seen JOIN USING in serious code, written by wise programmers other than myself. But, I'm digressing).
I found this question when looking for confirmation that SS is missing this feature, and I got it. I am only bewildered by the amount of hate against this syntax, which I attribute to the Sour Grapes Syndrome. I feel amused when being lectured with a patronising tone Sweets (read: syntactic sugar) is bad for your health. You don't need it anyway.
What is nice in the JOIN USING syntax, is that it works not just on column names, but also on column aliases, for example:
-- foreign key "order".customerId references (customer.id)
SELECT c.*, c.id as customerId, o.* from customer c
join "order" o using (customerId);
I don't agree with "Join using would be better, if only (...)". Or the argument, that you may need more complex conditions. From a different point of view, why use JOIN ON? Why not be pure, and move all conditions to the WHERE clause?
SELECT t1.*, t2.* from t1, t2 where t2.t1_id = t1.id;
I could now go mad and argue, how this is the cleanest way to express a join, and you can immediately start adding more conditions in the where clause, which you usually need anyway, blah blah blah...
So you shouldn't miss this particular syntax too dearly, but there's nothing to be happy about for not having it ("Phew, that was close. So good not to have JOIN USING. I was spared a lot of pain").
So, while I personally use JOIN ON 99% of the time, I feel no Schadenfreude when there is no JOIN USING or NATURAL JOIN.
I don't see the value of the USING or NATURAL syntax - as you've encountered, only ON is consistently implemented so it's best from a portability standpoint.
Being explicit is also better for maintenance, besides that the alternatives can be too limited to deal with situations. I'd also prefer my codebase be consistent.
Related
This question comes from my readings of C.J Date's SQL and Relational Theory: How to Write Accurate SQL Code and looking up about joins on the internet (which includes coming across multiple posts here on NATURAL JOINs (and about SQL Server's lack of support for it))
So here is my problem...
On one hand, in relational theory, natural joins are the only joins that should happen (or at least are highly preferred).
On the other hand, in SQL it is advised against using NATURAL JOIN and instead use alternate means (e.g inner join with restriction).
Is the reconciliation of these that:
Natural joins work in true RDBMS. SQL however, fails at completely reproducing the relational model and none of the popular SQL DBMSs are true RDBMS.
and / or
Good/Better table design should remove/minimise the problems that natural join creates.
?
a number of points regarding your question (even if I'm afraid I'm not really answering anything you asked),
"On one hand, in relational theory, natural joins are the only joins that should happen (or at least are highly preferred)."
This seems to suggest that you interpret theory as if it proscribes against "other kinds" of joins ... That is not really true. Relational theory does not say "you cannot have antijoins", or "you should never use antijoins", or anything like that. What it DOES say, is that in the relational algebra, a set of primitive operators can be identified, in which natural join is the only "join-like" operator. All other "join-like" operators, can always be expressed equivalently in terms of the primitive operators defined. Cartesian product, for example, is a special case of a natural join (where the set of common attributes is empty), and if you want the cartesian product of two tables that do have an attribute name in common, you can address this using RENAME. Semijoin, for example, is the natural join of the first table with some projection on the second. Antijoin, for example (SEMIMINUS or NOT MATCHING in Date's book), is the relational difference between the first table and a SEMIJOIN of the two. etc. etc.
"On the other hand, in SQL it is advised against using NATURAL JOIN and instead use alternate means (e.g inner join with restriction)."
Where are such things advised ? In the SQL standard ? I don't really think so. It is important to distinguish between the SQL language per se, which is defined by an ISO standard, and some (/any) particular implementation of that language, which is built by some particular vendor. If Microsoft advises its customers to not use NJ in SQL Server 200x, then that advice has a completely different meaning than an advice by someone to not ever use NJ in SQL altogether.
"Natural joins work in true RDBMS. SQL however, fails at completely reproducing the relational model and none of the popular SQL DBMSs are true RDBMS."
While it is true that SQL per se fails to faithfully comply with relational theory, that actually has very little to do with the question of NJ.
Whether an implementation gives good performance for invocations of NJ, is a characteristic of that implementation, not of the language, or of the "degree of trueness" of the 'R' in 'RDBMS'. It is very easy to build a TRDBMS that doesn't use SQL, and that gives ridiculous execution times for NJ. The SQL language per se has everything that is needed to support NJ. If an implementation supports NJ, then NJ will work in that implementation too. Whether it gives good performance, is a characteristic of that implementation, and poor performance of some particular implementation should not be "extrapolated" to other implementations, or be seen as a characteristic of the SQL language per se.
"Good/Better table design should remove/minimise the problems that natural join creates."
Problems that natural join creates ? Controlling the columns that appear in the arguments to a join is easily done by adding explicit projections (and renames if needed) on the columns you want. Much like you also want to avoid SELECT * as much as possible, for basically the same reason ...
First, the choice between theory and being practical is a fallacy. To quote Chris Date: "the truth is that theory--at least the theory I'm talking about here, which is relational theory--is most definitely very practical indeed".
Second, consider that natural join relies on attribute naming. Please (re)read the following sections of the Accurate SQL Code book:
6.12. The Reliance on Attribute Names. Salient quote:
The operators of the relational algebra… all rely heavily on attribute
naming.
3.9. Column Naming in SQL. Salient quote:
Strong recommendation: …if two columns in SQL represent "the same kind
of information," give them the same name wherever possible. (That's
why, for example, the two supplier number columns in the
suppliers-and-parts database are both called SNO and not, say, SNO in
one table and SNUM in the other.) Conversely, if two columns represent
different kinds of information, it's usually a good idea to give them
different names.
I'd like to address #kuru kuru pa's point (a good one too) about columns being added to a table over which you have no control, such as a "web service you're consuming." It seems to me that this problem is effectively mitigated using the strategy suggested by Date in section 3.9 (referenced above): quote:
For every base table, define a view identical to that base table except possibly for some column renaming.
Make sure the set of views so defined abides by the column naming discipline described above.
Operate in terms of those views instead of the underlying base tables.
Personally, I find the "natural join considered dangerous" attitude frustrating. Not wishing to sound self-righteous but my own naming convention, which follows the guidance of ISO 11179-5 Naming and identification principles, results in schema highly suited to natural join.
Sadly, natural join perhaps won't be supported anytime soon in the DBMS product I use professionally (SQL Server): the relevant feature request on Microsoft Connect
is currently closed as "won't fix" despite currently having a respectable +38 / -2 score
has been reopened and gained a respectable 46 / -2 score
(go vote for it now :)
The main problem with the NATURAL JOIN syntax in SQL is that it is typically too verbose.
In Tutorial D syntax I can very simply write a natural join as:
R{a,b,c} JOIN S{a,c,d};
But in SQL the SELECT statement needs either derived table subqueries or a WHERE clause and aliases to achieve the same thing. That's because a single "SELECT statement" is really a non-relational, compound operator in which the component operations always happen in a predetermined order. Projection comes after joins and columns in the result of a join don't necessarily have unique names.
E.g. the above query can be written in SQL as:
SELECT DISTINCT a, b, c, d
FROM
(SELECT a,b,c FROM R) R
NATURAL JOIN
(SELECT a,c,d FROM S) S;
or:
SELECT DISTINCT R.a, R.b, R.c, S.d
FROM R,S
WHERE R.a = S.a AND R.c = S.c;
People will likely prefer the latter version because it is shorter and "simpler".
Theory versus reality...
Natural joins are not practical.
There is no such thing as a pure (i.e. practice is idetical to theory) RDBMS, as far as I know.
I think Oracle and a few others actually support support natural joins -- TSQL doesn't.
Consider the world we live in -- chances of two tables each having a column with the same name is pretty high (like maybe [name] or [id] or [date], etc.). Maybe those chances are narrowed down a bit by grouping only those tables you might actually want to join. But regardless, without a careful examination of the table structure, you won't know if a "natural join" is a good idea or not. And even if it is, at that moment, it might not be in another year when the application gets an upgrade which adds columns to certain tables, etc., or the web service you're consuming adds fields you didn't know about, etc.
I think a "pure" system would have to be one you had 100% control over at a minimum, and then also, one that would have some good validation in the alter table / create table process that would warn / prevent you from creating a new column in some table that could be "naturally" joined to some other table you might not be intending it to be join-able to.
I guess bottom-line for me would be, valuing my sanity, wanting my applications to have maximum up-time, valuing quick/clean maintenance and upgrades, etc. -- good table design in this context means not using natural joins (ever).
Is there any support for natural joins in recent Microsoft SQL Server editions? Or is there a good alternative for making SQL Server work out the predicates that would have been in the ON clauses based on the referential integrity?
No, and thank the lucky stars
I can't believe that you'd want the engine to guess the JOIN for you
Related links:
SQL Server - lack of NATURAL JOIN / x JOIN y USING(field)
is NATURAL JOIN any better than SELECT FROM WHERE in terms of performance ?
Edit, to explain why
The JOIN (whether USING or ON) is clear and explicit
I should be able to name my columns for the entity stored in the table, without worrying about what a column is called in another table, without NATURAL JOIN side effects
Quoting Bill Karwin in this excellent answer:
I never use NATURAL JOIN because I don't like the possibility that the
join could do something I don't intend just because some column name
exists in both tables.
MS SQL does not support natural join, neither join using (). You have to explicitly write down all your attributes used in the join.
If the datamodel changes, you have to change all "natural join" written by hand and make sure your join condition is ok again.
I wouldn't expect to see it any time soon. A Connect suggestion from 2006 has very little info other than:
Thanks for your feedback. We will look into your request for one of the upcoming releases.
And has only received ~30 upvotes
use Full Outer join instead of Natural Join. It works for me.
This question already has answers here:
Closed 12 years ago.
Possible Duplicate:
Is there something wrong with joins that don't use the JOIN keyword in SQL or MySQL?
Hi,
i'ave always retrieved data without joins...
but is there a benefit to one method over the other?
select * from a INNER JOIN b on a.a = b.b;
select a.*,b.* from a,b where a.a = b.b;
Thanks!
The first method using the INNER JOIN keyword is:
ANSI SQL standard
much cleaner and more expressive
Therefore, I always cringe when I see the second option used - it just bloats up your WHERE clause, and you can't really see at one glance how the tables are joined (on what fields).
Should you happen to forget one of the JOIN conditions in a long list of WHERE clause expressions, you suddenly get a messy cartesian product..... can't do that with the INNER JOIN keyword (you must express what field(s) to join on).
I'd say the biggest benefit is readability. The version with the explicitly named join types are much easier for me to comprehend.
You are using a different syntax for a JOIN, basically. As a matter of best practices, it is best to use the first syntax (explicit JOIN) because it is clearer what the intention of the query is and makes the code easier to maintain.
These are both joins. they are just two different syntactical representations for joins. The first one, (using the "Join" keyword, is the current ANSI Standard (as of 1992 I think).
In the case of inner joins only, the two differeent representations are functionally identical, but the latter ANSI SQL92 standard syntax is much moire readable, once you get used to it, because each individual join condition is associated with the pair of intermediate resultsets being joined together, In the older representation, the join conditions are all together, along with the overall queries' filter conditions, in the where clause, and it is not as clear which is which. This makes identifying bad join conditions (where for example, an unintended cartesian product will be generated) much more difficult.
But more important, perhaps, is that, when performing an outer Join, in certain scenarios, the older syntax is NOT equivilent, and in fact will generate the WRONG resultset.
You should transition to the newer syntax for all your queries.
You've always retrieved the data with joins. The second query is using old syntax, but in the background it is still join :)
This depends on the RDBMS but in the case of SQL server I understand that the utilizing the former syntax allows for better optimization. This is less of a SQL question and more of a vendor specific question.
You can also use the EXPLAIN (SQL Server: Query Execution Plan) type functions to help you understand if there is a difference. Each query is unique and I imagine that the stored statistics can (and will) alter the behavior.
I'm creating a query that will display information for a record which is derived from 8 tables. The developer who originally wrote the query used a combination of 'where this equals this' AND 'this equals this' to create the joins.
I have since changed the query to use INNER JOINS. I wondered if my approach was better than utilising a combination of WHERE operators.
On a measure of good practice, is a combination of INNER JOINS a good option or should I be employing a different technique.
From performance point of view, there will not be any difference... atleast not on prominent RDBMS like Sql-server / Oracle... These Database Engine, are capable of identifying both the patterns means the same and use the same execution plan for both...
In my humble opinion both are equally good practice, but you should maintain consistency and proper alignment... Somewhere I heard that where clause are generally used by oracle developers, and inner join by Sql-Server... Not very much sure about that... By now most of the coders are capable of understanding both types of queries...
I beleive that there is no performance difference between both :
ANSI Style
SELECT * FROM Table1 a JOIN TABLE2 b on a.id = b.id
Old Style
SELECT * FROM Table1 a , TABLE2 b
WHERE a.id = b.id
The ANSI style is newer and more readable and pretty, i do prefer the old style especially when it comes to join more than 4/5 Tables... maybe because im an Oracle Dev as it was said before o_O
Both will behave the same (as The King said). Personally I prefer the INNER/OUTER/LEFT JOIN syntax because its more intuitive/explicit. When you get into LEFT JOINs with join conditions in the WHERE clause then you have to get into using plus signs and then you have to remember which side to put them on. Urgghh.
It also seems (again as The King said) to be true that Oracle devs favour putting the join conditions into the WHERE clause.
-Jamiet
I've seen people writing the following query
SELECT column_name(s)
FROM table_name1
LEFT JOIN table_name2
ON table_name1.column_name=table_name2.column_name
written like this
SELECT column_name(s)
FROM table_name1
LEFT JOIN table_name2
ON table_name2.column_name =table_name1.column_name
does it actually make any difference?
There is no difference.
table_name1.column_name=table_name2.column_name
and
table_name2.column_name=table_name1.column_name
have the same meaning.
The 'left' part of the join is referring the the tables and not to the comparison -- I guess that was your implicit question?
Not under normal circumstances, no. It's imaginable that there could be some sufficiently complex query, much more complex than your examples, where the join optimizer would be so overwhelmed that this could matter, but if that's even possible you'd have to work pretty hard.
You can write both of those as USING (column_name), by the way. :)
I'd also recommend, for purposes of DIY analysis of issues like this, that you familiarize yourself with EXPLAIN.
It's likely to be a matter of personal taste; I happen to write joins like your second example but couldn't say I'm 100% consistent in doing it.
Also, back in the Bad Old Days when the query optimizers weren't as smart as they are now, you could find that re-ordering the operands in a join expression or WHERE clause could make a big difference; i.e., the optimizer would use an index if the operands were ordered one way but not when they were reversed.
I think I'm having an Oracle 7 flasback. Gotta lie down...