TABLE1 T1, TABLE2 T2 WHERE T1.Blah = T2.Blah - VS - INNER JOIN - sql

Provided that the tables could essentially be inner joined, since the where clause excludes all records that don't match, just exactly how bad is it to use the first of the following 2 query statement syntax styles:
SELECT {COLUMN LIST}
FROM TABLE1 t1, TABLE2 t2, TABLE3 t3, TABLE4 t4 (etc)
WHERE t1.uid = t2.foreignid
AND t2.uid = t3.foreignid
AND t3.uid = t4.foreignid
etc
instead of
SELECT {COLUMN LIST}
FROM TABLE1 t1
INNER JOIN TABLE2 t2 ON t1.uid = t2.foreignid
INNER JOIN TABLE3 t3 ON t2.uid = t3.foreignid
INNER JOIN TABLE4 t4 ON t3.uid = t4.foreignid
I'm not sure if this is limited to microsoft SQL, or even a particular version, but my understanding is that the first scenario does a full outer join to make all possible correlations accessible.
I've used the first approach in the past to optimise queries that access two significantly large stores of data that each have peripheral table joined to them, with the product of those joins coming together late in the query. By allowing each of the "larger" table to join to their respective lookup tables, and only combining a specific subset of each of the larger tables, I found that there were notable speed improvements over introducing the large tables to each other prior to specific filtering.
Under normal (simple joins) circumstance, would it not be far better to use the second scenario? I find it to be more easily readable and it seems like it'll be much faster.

INNER JOIN ON vs WHERE clause

Maybe the best way to answer this is to take a look at how the database handles the query internally. If you're on SQL Server, use Profiler to see how many reads etc. each query takes and the query plan to see what route is being taken through the data. Statistics, skewing etc. will also most likely play a role.

The first query doesn't produce a full OUTER join (which is the union of both LEFT and RIGHT joins). Essentially unless there are some [internal] SQL parser - specific optimizations, both queries are equal.

Personally I would never use the first syntax. It may be the same performancewise but it is harder to maintain and far more subject to accidental cross joins when things get complex. If you miss an ON condition, it will fail the syntax check , if you miss one of the WHERE conditions that is the equivalent of an ON condition, it will happily do a cross join. It is also a syntax that is 17 years out of date for goodness sakes!
Further, the left and right join syntax in the old syntax are broken in SQL Server and do NOT always return the correct results (it can sometimes interpet the results as a corss join instead of an outerjoin) and they have been deprecated and will not be useable at all in the next version. If you need to change one of the queries to use an outer join, then you can be looikng at a major rewrite as it is especially bad to try to mix the two kinds of syntax.

Related

Optimizing regular join SQL queries

How to optimize the speed of SQL queries looking like this:
select ... from TABLE
left join TABLE2 on TABLE2.COL2 = TABLE.COL
left join TABLE3 on TABLE3.COL2 = TABLE2.COL
etc.
I am asking from a SQL (precisely Postgres) point of view, e.g.: does the order of the joins matter? Do subqueries or CTE help? Does the type of join matter?
I am not asking from a database implementation point of view, e.g. indexes, tablespaces, configuration variables, etc.
In theory the order of the joins should not matter since the built-in query optimizer should put the joins that limit more the volume of the result-set before those that has less effect on the volume.
However in my practice I learned that it is always best to try to help the performance as much as you can and put the more restrictive joins before the less restrictive ones.
So generally speaking the less you relay on the query optimizer the better will be the performance in the edge cases.
Here you can learn more about the query optimizer: http://www.postgresql.org/docs/9.1/static/runtime-config-query.html#RUNTIME-CONFIG-QUERY-GEQO
As a rule of the thumb using join should be faster than CTE or sub-queries, but this is just a rule and exceptions are still possible.
Also some of the problems need both joins and CTE.
This is kind of killing question: Does the type of join matter?
Yes it does! Actually this matters most of all! :)
Here you can see the idea behind the different join types: http://en.wikipedia.org/wiki/Join_(SQL)
For the left and right join these 2 statements are equal:
... table1 LEFT JOIN table2 ...
... table2 RIGHT JOIN table1 ...
Right and left outer joins are functionally equivalent. Neither provides any functionality that the other does not, so right and left outer joins may replace each other as long as the table order is switched.

Is NATURAL (JOIN) considered harmful in production environment?

I am reading about NATURAL shorthand form for SQL joins and I see some traps:
it just takes automatically all same named column-pairs (use USING to specify explicit column list)
if some new column is added, then join output can be "unexpectedly" changed too, which may be not so obvious (even if you know how NATURAL works) in complicated structures
NATURAL JOIN syntax is anti-pattern:
The purpose of the query is less obvious;
the columns used by the application is not clear
the columns used can change "unexpectedly"
The syntax goes against the modularity rule, about using strict typing whenever possible. Explicit is almost universally better.
Because of this, I don't recommend the syntax in any environment.
I also don't recommend mixing syntax (IE: using both NATURAL JOIN and explicit INNER/OUTER JOIN syntax) - keep a consistent codebase format.
These "traps", which seem to argue against natural joins, cut both ways. Suppose you add a new column to table A, fully expecting it to be used in joining with table B. If you know that every join of A and B is a natural join, then you're done. If every join explicitly uses USING, then you have to track them all down and change them. Miss one and there's a bug.
Use NATURAL joins when the semantics of the tables suggests that this is the right thing to do. Use explicit join criteria when you want to make sure the join is done in a specific way, regardless of how the table definitions might evolve.
One thing that completely destroys NATURAL for me is that most of my tables have an id column, which are obviously semantically all different. You could argue that having a user_id makes more sense than id, but then you end up writing things like user.user_id, a violation of DRY. Also, by the same logic, you would also have columns like user_first_name, user_last_name, user_age... (which also kind of makes sense in view that it would be different from, for example, session_age)... The horror.
I'll stick to my JOIN ... ON ..., thankyouverymuch. :)
I agree with the other posters that an explicit join should be used for reasons of clarity and also to easily allow a switch to an "OUTER" join should your requirements change.
However most of your "traps" have nothing to do with joins but rather the evils of using "SELECT *" instead of explicitly naming the columns you require "SELECT a.col1, a.col2, b.col1, b.col2". These traps occurs whenever a wildcard column list is used.
Adding an extra reason not listed in any of the answers above. In postgres (not sure if this the case for other databases) if no column names are found in common between the two tables when using NATURAL JOIN then a CROSS JOIN is performed. This means that if you had an existing query and then you were to subsequently change one of the column names in a table, you would still get a set of rows returned from the query rather than an error. If instead you used the JOIN ... USING(...) syntax you would get an error if the joining column was no longer there.
The postgres documentation has a note to this effect:
Note: USING is reasonably safe from column changes in the joined relations since only the listed columns are combined. NATURAL is considerably more risky since any schema changes to either relation that cause a new matching column name to be present will cause the join to combine that new column as well.
Do you mean the syntax like this:
SELECT *
FROM t1, t2, t3 ON t1.id = t2.id
AND t2.id = t3.id
Versus this:
SELECT *
FROM t1
LEFT OUTER JOIN t2 ON t1.id = t2.id
AND t2.id = t3.id
I prefer the 2nd syntax and also format it differently:
SELECT *
FROM T1
LEFT OUTER JOIN T2 ON T2.id = T1.id
LEFT OUTER JOIN T3 ON T3.id = T2.id
In this case, it is very clear what tables I am joining and what ON clause I am using to join them. By using that first syntax is just too easy to not put in the proper JOIN and get a huge result set. I do this because I am prone to typos, and this is my insurance against that. Plus, it is visually easier to debug.

Is there any difference between using innerjoin and writing all the tables directly in the from segment?

Do these two queries differ from each other?
Query 1:
SELECT * FROM Table1, Table2 WHERE Table1.Id = Table2.RefId
Query 2:
SELECT * FROM Table1 INNER JOIN Table2 ON Table1.Id = Table2.RefId
I analysed both methods and they clearly produced the same actual execution plans. Do you know any cases where using inner joins would work in a more efficient way. What is the real advantage of using inner joins rather than approaching the manner of "Query 1"?
The two statements you have provided are functionally equivalent to one another.
The variation is caused by differing SQL syntax standards.
For a really exciting read, you can lookup the various SQL standards by visiting the following Wikipedia link. On the right hand side are references and links to the various dialects/standards of SQL.
http://en.wikipedia.org/wiki/SQL
These SQL statements are synonymous, though specifying the INNER JOIN is the preferred method and follows ISO format. I prefer it as well because it limits the plumbing of joining the tables from your where clause and makes the goal of your query clearer.
These will result in an identical query plan, but the INNER JOIN, OUTER JOIN, CROSS JOIN keywords are prefered because they add clarity to the code.
While you have the ability to specifiy join hints using the keywords in the FROM clause, you can do more complicated joins in the WHERE clause. But otherwise, there will be no difference in query plan.
I will also add that the first syntax is much more subject to inadvertent cross joins as the queries get complicated. Further the left and right joins in this syntax do not work properly in SQL server and should never be used. Mixing the syntax when you add a left join can also cause problems where the query does not correctly return the results. The syntax in the first example has been outdated for 17 years, I see no reason to ever use it.
Query 1 is considered an old syntax style and its use is discouraged. You will run into problems with you use LEFT and Right joins using that syntax style. Also on SQL Server you can have problems mixing those two different syles together in queries that use view of different formats.
I have found a significant difference using the LEFT OUTER JOINS and putting the conditions on the joined table in the ON clause rather than the WHERE clause. Once you put a condition on the joined table in the WHERE clause, you defeat the left outer join.
When I was using Oracle, I used the archaic (+) after the joined table (with all conditions including join conditions in the WHERE clause)because that's what I knew. When we became a SQL Server shop, I was forced to use LEFT OUTER JOINs, and I found they didn't work as before until I discovered this behavior. Here's an example:
select NC.*,
IsNull(F.STRING_VAL, 'NONE') as USER_ID,
CO.TOTAL_AMT_ORDERED
from customer_order CO
INNER JOIN VTG_CO_NET_CHANGE NC
ON NC.CUST_ORDER_ID=CO.ID
LEFT OUTER JOIN USER_DEF_FIELDS F
ON F.DOCUMENT_ID = CO.ID and
F.PROGRAM_ID='VMORDENT' and
F.ID='UDF-0000072' and
F.DOCUMENT_ID is not null
where NC.acct_year=2017

Mixing Left and right Joins? Why?

Doing some refactoring in some legacy code I've found in a project. This is for MSSQL. The thing is, i can't understand why we're using mixed left and right joins and collating some of the joining conditions together.
My question is this: doesn't this create implicit inner joins in some places and implicit full joins in others?
I'm of the school that just about anything can be written using just left (and inner/full) or just right (and inner/full) but that's because i like to keep things simple where possible.
As an aside, we convert all this stuff to work on oracle databases as well, so maybe there's some optimization rules that work differently with Ora?
For instance, here's the FROM part of one of the queries:
FROM Table1
RIGHT OUTER JOIN Table2
ON Table1.T2FK = Table2.T2PK
LEFT OUTER JOIN Table3
RIGHT OUTER JOIN Table4
LEFT OUTER JOIN Table5
ON Table4.T3FK = Table5.T3FK
AND Table4.T2FK = Table5.T2FK
LEFT OUTER JOIN Table6
RIGHT OUTER JOIN Table7
ON Table6.T6PK = Table7.T6FK
LEFT OUTER JOIN Table8
RIGHT OUTER JOIN Table9
ON Table8.T8PK= Table9.T8FK
ON Table7.T9FK= Table9.T9PK
ON Table4.T7FK= Table7.T7PK
ON Table3.T3PK= Table4.T3PK
RIGHT OUTER JOIN ( SELECT *
FROM TableA
WHERE ( TableA.PK = #PK )
AND ( TableA.Date BETWEEN #StartDate
AND #EndDate )
) Table10
ON Table4.T4PK= Table10.T4FK
ON Table2.T2PK = Table4.T2PK
One thing I would do is make sure you know what results you are expecting before messing with this. Wouldn't want to "fix" it and have different results returned. Although honestly, with a query that poorly designed, I'm not sure that you are actually getting correct results right now.
To me this looks like something that someone did over time maybe even originally starting with inner joins, realizing they wouldn't work and changing to outer joins but not wanting to bother changing the order the tables were referenced in the query.
Of particular concern to me for maintenance purposes is to put the ON clauses next to the tables you are joining as well as converting all the joins to left joins rather than mixing right and left joins. Having the ON clause for table 4 and table 3 down next to table 9 makes no sense at all to me and should contribute to confusion as to what the query should actually return. You may also need to change the order of the joins in order to convert to all left joins. Personally I prefer to start with the main table that the others will join to (which appears to be table2) and then work down the food chain from there.
It could probably be converted to use all LEFT joins: I'd be looking and moving the right-hand table in each RIGHT to be above all the existing LEFTs, then you might be able to then turn every RIGHT join into a LEFT join. I'm not sure you'll get any FULL joins behind the scenes -- if the query looks like it is, it might be a quirk of this specific query rather than a SQL Server "rule": that query you've provided does seem to be mixing it up in a rather confusing way.
As for Oracle optimisation -- that's certainly possible. No experience of Oracle myself, but speaking to a friend who's knowledgeable in this area, Oracle (no idea what version) is/was fussy about the order of predicates. For example, with SQL Server you can write your way clause so that columns are in any order and indexes will get used, but with Oracle you end up having to specify the columns in the order they appear in the index in order to get best performance with the index. As stated - no idea if this is the case with newer Oracle's, but was the case with older ones (apparently).
Whether this explains this particular construction, I can't say. It could simply be less-thean-optimal code if it's changed over the years and a clean-up is what it's begging for.
LEFT and RIGHT join are pure syntax sugar.
Any LEFT JOIN can be transformed into a RIGHT JOIN merely by switching the sets.
Pre-9i Oracle used this construct:
WHERE table1.col(+) = table2.col
, (+) here denoting the nullable column, and LEFT and RIGHT joins could be emulated by mere switching:
WHERE table1.col = table2.col(+)
In MySQL, there is no FULL OUTER JOIN and it needs to be emulated.
Ususally it is done this way:
SELECT *
FROM table1
LEFT JOIN
table2
ON table1.col = table2.col
UNION ALL
SELECT *
FROM table1
RIGHT JOIN
table2
ON table1.col = table2.col
WHERE table1.col IS NULL
, and it's more convenient to copy the JOIN and replace LEFT with RIGHT, than to swap the tables.
Note that in SQL Server plans, Hash Left Semi Join and Hash Right Semi Join are different operators.
For the query like this:
SELECT *
FROM table1
WHERE table1.col IN
(
SELECT col
FROM table2
)
, Hash Match (Left Semi Join) hashes table1 and removes the matched elements from the hash table in runtime (so that they cannot match more than one time).
Hash Match (Right Semi Join) hashes table2 and removes the duplicate elements from the hash table while building it.
I may be missing something here, but the only difference between LEFT and RIGHT joins is which order the source tables were written in, and so having multiple LEFT joins or multiple RIGHT joins is no different to having a mix. The equivalence to FULL OUTERs could be achieved just as easily with all LEFT/RIGHT than with a mix, n'est pas?
We have some LEFT OUTER JOINs and RIGHT OUTER JOINs in the same query. Typically such queries are large, have been around a long time, probably badly written in the first place and have received infrequent maintenance. I assume the RIGHT OUTER JOINs were introduced as a means of maintaining the query without taking on the inevitable risk when refactoring a query significantly.
I think most SQL coders are most confortable with using all LEFT OUTER JOINs, probably because a FROM clause is read left-to-right in the English way.
The only time I use a RIGHT OUTER JOIN myself is when when writing a new query based on an existing query (no need to reinvent the wheel) and I need to change an INNER JOIN to an OUTER JOIN. Rather than change the order of the JOINs in the FROM clause just to be able to use a LEFT OUTER JOIN I would instead use a RIGHT OUTER JOIN and this would not bother me. This is quite rare though. If the original query had LEFT OUTER JOINs then I'd end up with a mix of LEFT- and RIGHT OUTER JOINs, which again wouldn't bother me. Hasn't happened to me yet, though.
Note that for SQL products such as the Access database engine that do not support FULL OUTER JOIN, one workaround is to UNION a LEFT OUTER JOIN and a RIGHT OUTER JOIN in the same query.
The bottom line is that this is a very poorly formatted SQL statement and should be re-written. Many of the ON clauses are located far from their JOIN statements, which I am not sure is even valid SQL.
For clarity's sake, I would rewrite the query using all LEFT JOINS (rather than RIGHT), and locate the using statements underneath their corresponding JOIN clauses. Otherwise, this is a bit of a train wreck and is obfuscating the purpose of the query, making errors during future modifications more likely to occur.
doesn't this create implicit inner
joins in some places and implicit full
joins in others?
Perhaps you are assuming that because you don't see the ON clause for some joins, e.g., RIGHT OUTER JOIN Table4, but it is located down below, ON Table4.T7FK= Table7.T7PK. I don't see any implicit inner joins, which could occur if there was a WHERE clause like WHERE Table3.T3PK is not null.
The fact that you are asking questions like this is a testament to the opaqueness of the query.
To answer another portion of this question that hasn't been answered yet, the reason this query is formatted so oddly is that it's likely built using the Query Designer inside SQL Management Studio. The give away is the combined ON clauses that happen many lines after the table is mentioned. Essentially tables get added in the build query window and the order is kept even if that way things are connected would favor moving a table up, so to speak, and keeping all the joins a certain direction.

is it better to put more logic in your ON clause or should it only have the minimum necessary?

Given these two queries:
Select t1.id, t2.companyName
from table1 t1
INNER JOIN table2 t2 on t2.id = t1.fkId
WHERE t2.aField <> 'C'
OR:
Select t1.id, t2.companyName
from table1 t1
INNER JOIN table2 t2 on t2.id = t1.fkId and t2.aField <> 'C'
Is there a demonstrable difference between the two? Seems to me that the clause "t2.aField <> 'C'" will run on every row in t2 that meets the join criteria regardless. Am I incorrect?
Update: I did an "Include Actual Execution Plan" in SQL Server. The two queries were identical.
I prefer to use the Join criteria for explaining how the tables are joined together.
So I would place the additional clause in the where section.
I hope (although I have no stats), that SQL Server would be clever enough to find the optimal query plan regardless of the syntax you use.
HOWEVER, if you have indexes which also have id, and aField in them, I would suggest placing them together in the inner join criteria.
It would be interesting to see the query plan's in these 2 (or 3) scenarios, and see what happens. Nice question.
There is a difference. You should do an EXPLAIN PLAN for both of the selects and see it in detail.
As for a simplier explanation:
The WHERE clause gets executed only after the joining of the two tables, so it executes for each row returned from the join and not nececerally every one from table2.
Performance wise its best to eliminate unwanted results early on so there should be less rows for joins, where clauses or other operations to deal with later on.
In the second example, there are 2 columns that have to be same for the rows to be joined together so it usually will give different results than the first one.
It depends.
SELECT
t1.foo,
t2.bar
FROM
table1 t1
LEFT JOIN table2 t2 ON t1.SomeId = t2.SomeId
WHERE
t2.SomeValue IS NULL
is different from
SELECT
t1.foo,
t2.bar
FROM
table1 t1
LEFT JOIN table2 t2 ON t1.SomeId = t2.SomeId AND t2.SomeValue IS NULL
It is different because the former crosses out all records from t2 that have NULL in t2.SomeValue and those from t1 that are not referenced in t2. The latter crosses out only the t2 records that have NULL in t2.SomeValue.
Just use the ON clause for the join condition and the WHERE clause for the filter.
Unless moving the join condition to the where clause changes the meaning of the query (like in the left join example above), then it doesn't matter where you put them. SQL will re-arrange them, and as long as they are provably equivalent, you'll get the same query.
That being said, I think it's more of a logical / readability thing. I usually put anything that relates two tables in the join, and anything that filters in the where.
I'd prefer first query. SQL server will use the best join type for your query based on indexes you have, after that will apply WHERE clause. But you can run both queries at the same time, look at execution plans, compare and choose the fastest (optimize adding indexes also).
unless you are working on a single-user app or something similarly small that creates trivial load, the only considerations that mean anything is how the server will process your query.
The answers that mention query plans give good advice.
In addition, set io statistics on to get an idea of how many reads your query will generate (I especially love Azder's post).
Think of every DB server as a pump of data from disk to client. That pump goes faster if it performs only the IO needed to get the job done. If the data is in cache it will be even faster. But you don't want to be reading more than you need from disk - that will result in crowding out of your cache useful data for no good reason.