Doing some refactoring in some legacy code I've found in a project. This is for MSSQL. The thing is, i can't understand why we're using mixed left and right joins and collating some of the joining conditions together.
My question is this: doesn't this create implicit inner joins in some places and implicit full joins in others?
I'm of the school that just about anything can be written using just left (and inner/full) or just right (and inner/full) but that's because i like to keep things simple where possible.
As an aside, we convert all this stuff to work on oracle databases as well, so maybe there's some optimization rules that work differently with Ora?
For instance, here's the FROM part of one of the queries:
FROM Table1
RIGHT OUTER JOIN Table2
ON Table1.T2FK = Table2.T2PK
LEFT OUTER JOIN Table3
RIGHT OUTER JOIN Table4
LEFT OUTER JOIN Table5
ON Table4.T3FK = Table5.T3FK
AND Table4.T2FK = Table5.T2FK
LEFT OUTER JOIN Table6
RIGHT OUTER JOIN Table7
ON Table6.T6PK = Table7.T6FK
LEFT OUTER JOIN Table8
RIGHT OUTER JOIN Table9
ON Table8.T8PK= Table9.T8FK
ON Table7.T9FK= Table9.T9PK
ON Table4.T7FK= Table7.T7PK
ON Table3.T3PK= Table4.T3PK
RIGHT OUTER JOIN ( SELECT *
FROM TableA
WHERE ( TableA.PK = #PK )
AND ( TableA.Date BETWEEN #StartDate
AND #EndDate )
) Table10
ON Table4.T4PK= Table10.T4FK
ON Table2.T2PK = Table4.T2PK
One thing I would do is make sure you know what results you are expecting before messing with this. Wouldn't want to "fix" it and have different results returned. Although honestly, with a query that poorly designed, I'm not sure that you are actually getting correct results right now.
To me this looks like something that someone did over time maybe even originally starting with inner joins, realizing they wouldn't work and changing to outer joins but not wanting to bother changing the order the tables were referenced in the query.
Of particular concern to me for maintenance purposes is to put the ON clauses next to the tables you are joining as well as converting all the joins to left joins rather than mixing right and left joins. Having the ON clause for table 4 and table 3 down next to table 9 makes no sense at all to me and should contribute to confusion as to what the query should actually return. You may also need to change the order of the joins in order to convert to all left joins. Personally I prefer to start with the main table that the others will join to (which appears to be table2) and then work down the food chain from there.
It could probably be converted to use all LEFT joins: I'd be looking and moving the right-hand table in each RIGHT to be above all the existing LEFTs, then you might be able to then turn every RIGHT join into a LEFT join. I'm not sure you'll get any FULL joins behind the scenes -- if the query looks like it is, it might be a quirk of this specific query rather than a SQL Server "rule": that query you've provided does seem to be mixing it up in a rather confusing way.
As for Oracle optimisation -- that's certainly possible. No experience of Oracle myself, but speaking to a friend who's knowledgeable in this area, Oracle (no idea what version) is/was fussy about the order of predicates. For example, with SQL Server you can write your way clause so that columns are in any order and indexes will get used, but with Oracle you end up having to specify the columns in the order they appear in the index in order to get best performance with the index. As stated - no idea if this is the case with newer Oracle's, but was the case with older ones (apparently).
Whether this explains this particular construction, I can't say. It could simply be less-thean-optimal code if it's changed over the years and a clean-up is what it's begging for.
LEFT and RIGHT join are pure syntax sugar.
Any LEFT JOIN can be transformed into a RIGHT JOIN merely by switching the sets.
Pre-9i Oracle used this construct:
WHERE table1.col(+) = table2.col
, (+) here denoting the nullable column, and LEFT and RIGHT joins could be emulated by mere switching:
WHERE table1.col = table2.col(+)
In MySQL, there is no FULL OUTER JOIN and it needs to be emulated.
Ususally it is done this way:
SELECT *
FROM table1
LEFT JOIN
table2
ON table1.col = table2.col
UNION ALL
SELECT *
FROM table1
RIGHT JOIN
table2
ON table1.col = table2.col
WHERE table1.col IS NULL
, and it's more convenient to copy the JOIN and replace LEFT with RIGHT, than to swap the tables.
Note that in SQL Server plans, Hash Left Semi Join and Hash Right Semi Join are different operators.
For the query like this:
SELECT *
FROM table1
WHERE table1.col IN
(
SELECT col
FROM table2
)
, Hash Match (Left Semi Join) hashes table1 and removes the matched elements from the hash table in runtime (so that they cannot match more than one time).
Hash Match (Right Semi Join) hashes table2 and removes the duplicate elements from the hash table while building it.
I may be missing something here, but the only difference between LEFT and RIGHT joins is which order the source tables were written in, and so having multiple LEFT joins or multiple RIGHT joins is no different to having a mix. The equivalence to FULL OUTERs could be achieved just as easily with all LEFT/RIGHT than with a mix, n'est pas?
We have some LEFT OUTER JOINs and RIGHT OUTER JOINs in the same query. Typically such queries are large, have been around a long time, probably badly written in the first place and have received infrequent maintenance. I assume the RIGHT OUTER JOINs were introduced as a means of maintaining the query without taking on the inevitable risk when refactoring a query significantly.
I think most SQL coders are most confortable with using all LEFT OUTER JOINs, probably because a FROM clause is read left-to-right in the English way.
The only time I use a RIGHT OUTER JOIN myself is when when writing a new query based on an existing query (no need to reinvent the wheel) and I need to change an INNER JOIN to an OUTER JOIN. Rather than change the order of the JOINs in the FROM clause just to be able to use a LEFT OUTER JOIN I would instead use a RIGHT OUTER JOIN and this would not bother me. This is quite rare though. If the original query had LEFT OUTER JOINs then I'd end up with a mix of LEFT- and RIGHT OUTER JOINs, which again wouldn't bother me. Hasn't happened to me yet, though.
Note that for SQL products such as the Access database engine that do not support FULL OUTER JOIN, one workaround is to UNION a LEFT OUTER JOIN and a RIGHT OUTER JOIN in the same query.
The bottom line is that this is a very poorly formatted SQL statement and should be re-written. Many of the ON clauses are located far from their JOIN statements, which I am not sure is even valid SQL.
For clarity's sake, I would rewrite the query using all LEFT JOINS (rather than RIGHT), and locate the using statements underneath their corresponding JOIN clauses. Otherwise, this is a bit of a train wreck and is obfuscating the purpose of the query, making errors during future modifications more likely to occur.
doesn't this create implicit inner
joins in some places and implicit full
joins in others?
Perhaps you are assuming that because you don't see the ON clause for some joins, e.g., RIGHT OUTER JOIN Table4, but it is located down below, ON Table4.T7FK= Table7.T7PK. I don't see any implicit inner joins, which could occur if there was a WHERE clause like WHERE Table3.T3PK is not null.
The fact that you are asking questions like this is a testament to the opaqueness of the query.
To answer another portion of this question that hasn't been answered yet, the reason this query is formatted so oddly is that it's likely built using the Query Designer inside SQL Management Studio. The give away is the combined ON clauses that happen many lines after the table is mentioned. Essentially tables get added in the build query window and the order is kept even if that way things are connected would favor moving a table up, so to speak, and keeping all the joins a certain direction.
Related
Just playing around with queries and examples to get a better understanding of joins. I'm noticing that in SQL Server 2008, the following two queries give the same results:
SELECT * FROM TableA
FULL OUTER JOIN TableB
ON TableA.name = TableB.name
SELECT * FROM TableA
FULL JOIN TableB
ON TableA.name = TableB.name
Are these performing exactly the same action to produce the same results, or would I run into different results in a more complicated example? Is this just interchangeable terminology?
Actually they are the same. LEFT OUTER JOIN is same as LEFT JOIN and RIGHT OUTER JOIN is same as RIGHT JOIN. It is more informative way to compare from INNER Join.
See this Wikipedia article for details.
Microsoft® SQL Server™ 2000 uses these SQL-92 keywords for outer joins
specified in a FROM clause:
LEFT OUTER JOIN or LEFT JOIN
RIGHT OUTER JOIN or RIGHT JOIN
FULL OUTER JOIN or FULL JOIN
From MSDN
The full outer join or full join returns all rows from both tables, matching up the rows wherever a match can be made and placing NULLs in the places where no matching row exists.
It's true that some databases recognize the OUTER keyword. Some do not.
Where it is recognized, it is usually an optional keyword.
Almost always, FULL JOIN and FULL OUTER JOIN do exactly the same thing. (I can't think of an example where they do not. Can anyone else think of one?)
This may leave you wondering, "Why would it even be a keyword if it has no meaning?" The answer boils down to programming style.
In the old days, programmers strived to make their code as compact as possible. Every character meant longer processing time. We used 1, 2, and 3 letter variables. We used 2 digit years. We eliminated all unnecessary white space. Some people still program that way. It's not about processing time anymore. It's more about fast coding.
Modern programmers are learning to use more descriptive variables and put more remarks and documentation into their code. Using extra words like OUTER make sure that other people who read the code will have an easier time understanding it. There will be less ambiguity. This style is much more readable and kinder to the people in the future who will have to maintain that code.
JOIN, LEFT JOIN, RIGHT JOIN, FULL JOIN?
I'm guessing the size of the datasets on each side of the join may make LEFT vs RIGHT a hard call, but how do the others compare.
Also am I correct in assuming JOIN & INNER JOIN are one and the same? If not, how does this fit into the order/ranking.
Yes, JOIN and INNER JOIN are the same. In general the ranking is JOIN is fastest, followed closely by LEFT JOIN which is equivalent to RIGHT JOIN, and then followed very far in the distance by FULL JOIN.
But this ranking is so variable that it can be largely ignored. Your actual performance is highly dependent upon the size of the datasets, availability of proper indexes, and exact query plan chosen. One LEFT JOIN may be fast and the next INNER JOIN might be glacially slow.
That notwithstanding, I would advise avoiding FULL JOIN unless you absolutely need it. (At least in Oracle, which is where I've had bad experiences with it.)
INNER is an optional word when INNER JOIN is desired => so they are one and the same. This is the same as the word OUTER being optional in LEFT/RIGHT/FULL OUTER JOIN
In terms of efficiency, it completely depends on what else is happening. If it is a LEFT JOIN with a IS NOT NULL test on the right side (anti-semi join) then it is very efficient and works like an EXISTS clause.
Absent other factors, and considering only
SELECT .. FROM A X-JOIN B ON <condition>
If results need to be preserved from A, B or Both, then efficiency is not a factor. You need a LEFT/RIGHT/FULL join because it provides the correct results
If you need results that match on both sides, and not all data is available from either side, then same as the above, you need an INNER JOIN.
Only if the join is bound to find rows on both sides, then LEFT/RIGHT/FULL join becomes an option. In most cases, the INNER JOIN will be faster because it gives the optimizer the option to start from the smaller table (or better indexed) and hash match to the larger table.
"in most cases" in Point #3 because different RDBMS may optimize queries differently.
Ranking them for efficiency would be pointless, as they return different results. If you need a left join, an inner join won't do the job.
Efficiency in a join has more to with the size of the tables, the indexing, and how the rest of the query is written than whether it is an INNER, OUTER, CROSS or FUll JOIN. A CROSS JOIN on two small tables might be fast but a INNER join on two large tables with a WHERE clause that is not sargable would not be.
Provided that the tables could essentially be inner joined, since the where clause excludes all records that don't match, just exactly how bad is it to use the first of the following 2 query statement syntax styles:
SELECT {COLUMN LIST}
FROM TABLE1 t1, TABLE2 t2, TABLE3 t3, TABLE4 t4 (etc)
WHERE t1.uid = t2.foreignid
AND t2.uid = t3.foreignid
AND t3.uid = t4.foreignid
etc
instead of
SELECT {COLUMN LIST}
FROM TABLE1 t1
INNER JOIN TABLE2 t2 ON t1.uid = t2.foreignid
INNER JOIN TABLE3 t3 ON t2.uid = t3.foreignid
INNER JOIN TABLE4 t4 ON t3.uid = t4.foreignid
I'm not sure if this is limited to microsoft SQL, or even a particular version, but my understanding is that the first scenario does a full outer join to make all possible correlations accessible.
I've used the first approach in the past to optimise queries that access two significantly large stores of data that each have peripheral table joined to them, with the product of those joins coming together late in the query. By allowing each of the "larger" table to join to their respective lookup tables, and only combining a specific subset of each of the larger tables, I found that there were notable speed improvements over introducing the large tables to each other prior to specific filtering.
Under normal (simple joins) circumstance, would it not be far better to use the second scenario? I find it to be more easily readable and it seems like it'll be much faster.
INNER JOIN ON vs WHERE clause
Maybe the best way to answer this is to take a look at how the database handles the query internally. If you're on SQL Server, use Profiler to see how many reads etc. each query takes and the query plan to see what route is being taken through the data. Statistics, skewing etc. will also most likely play a role.
The first query doesn't produce a full OUTER join (which is the union of both LEFT and RIGHT joins). Essentially unless there are some [internal] SQL parser - specific optimizations, both queries are equal.
Personally I would never use the first syntax. It may be the same performancewise but it is harder to maintain and far more subject to accidental cross joins when things get complex. If you miss an ON condition, it will fail the syntax check , if you miss one of the WHERE conditions that is the equivalent of an ON condition, it will happily do a cross join. It is also a syntax that is 17 years out of date for goodness sakes!
Further, the left and right join syntax in the old syntax are broken in SQL Server and do NOT always return the correct results (it can sometimes interpet the results as a corss join instead of an outerjoin) and they have been deprecated and will not be useable at all in the next version. If you need to change one of the queries to use an outer join, then you can be looikng at a major rewrite as it is especially bad to try to mix the two kinds of syntax.
Do these two queries differ from each other?
Query 1:
SELECT * FROM Table1, Table2 WHERE Table1.Id = Table2.RefId
Query 2:
SELECT * FROM Table1 INNER JOIN Table2 ON Table1.Id = Table2.RefId
I analysed both methods and they clearly produced the same actual execution plans. Do you know any cases where using inner joins would work in a more efficient way. What is the real advantage of using inner joins rather than approaching the manner of "Query 1"?
The two statements you have provided are functionally equivalent to one another.
The variation is caused by differing SQL syntax standards.
For a really exciting read, you can lookup the various SQL standards by visiting the following Wikipedia link. On the right hand side are references and links to the various dialects/standards of SQL.
http://en.wikipedia.org/wiki/SQL
These SQL statements are synonymous, though specifying the INNER JOIN is the preferred method and follows ISO format. I prefer it as well because it limits the plumbing of joining the tables from your where clause and makes the goal of your query clearer.
These will result in an identical query plan, but the INNER JOIN, OUTER JOIN, CROSS JOIN keywords are prefered because they add clarity to the code.
While you have the ability to specifiy join hints using the keywords in the FROM clause, you can do more complicated joins in the WHERE clause. But otherwise, there will be no difference in query plan.
I will also add that the first syntax is much more subject to inadvertent cross joins as the queries get complicated. Further the left and right joins in this syntax do not work properly in SQL server and should never be used. Mixing the syntax when you add a left join can also cause problems where the query does not correctly return the results. The syntax in the first example has been outdated for 17 years, I see no reason to ever use it.
Query 1 is considered an old syntax style and its use is discouraged. You will run into problems with you use LEFT and Right joins using that syntax style. Also on SQL Server you can have problems mixing those two different syles together in queries that use view of different formats.
I have found a significant difference using the LEFT OUTER JOINS and putting the conditions on the joined table in the ON clause rather than the WHERE clause. Once you put a condition on the joined table in the WHERE clause, you defeat the left outer join.
When I was using Oracle, I used the archaic (+) after the joined table (with all conditions including join conditions in the WHERE clause)because that's what I knew. When we became a SQL Server shop, I was forced to use LEFT OUTER JOINs, and I found they didn't work as before until I discovered this behavior. Here's an example:
select NC.*,
IsNull(F.STRING_VAL, 'NONE') as USER_ID,
CO.TOTAL_AMT_ORDERED
from customer_order CO
INNER JOIN VTG_CO_NET_CHANGE NC
ON NC.CUST_ORDER_ID=CO.ID
LEFT OUTER JOIN USER_DEF_FIELDS F
ON F.DOCUMENT_ID = CO.ID and
F.PROGRAM_ID='VMORDENT' and
F.ID='UDF-0000072' and
F.DOCUMENT_ID is not null
where NC.acct_year=2017
Do any queries exist that require RIGHT JOIN, or can they always be re-written with LEFT JOIN?
And more specifically, how do you re-write this one without the right join (and I guess implicitly without any subqueries or other fanciness):
SELECT *
FROM t1
LEFT JOIN t2 ON t1.k2 = t2.k2
RIGHT JOIN t3 ON t3.k3 = t2.k3
You can always re-write them to get the same result set. However, sometimes the execution plan may be different in significant ways (performance) and sometimes a right join let's you express the query in a way that makes more sense.
Let me illustrate the performance difference. Programmers tend to think in terms of an sql statement happening all at once. However, it's useful to keep a mental model that complicated queries happen in a series of steps where tables are typically joined in the order listed. So you may have a query like this:
SELECT * /* example: don't care what's returned */
FROM LargeTable L
LEFT JOIN MediumTable M ON M.L_ID=L.ID
LEFT JOIN SmallTable S ON S.M_ID=M.ID
WHERE ...
The server will normally start by applying anything it can from the WHERE clause to the first table listed (LargeTable, in this case), to reduce what it needs to load into memory. Then it will join the next table (MediumTable), and then the one after that (SmallTable), and so on.
What we want to do is use a strategy that accounts for the expected impact of each joined table on the results. In general you want to keep the result set as small as possible for as long as possible. Apply that principle to the example query above, and we see it's obviously much slower than it needs to be. It starts with the larger sets (tables) and works down. We want to begin with the smaller sets and work up. That means using SmallTable first, and the way to do that is via a RIGHT JOIN.
Another key here is that the server usually can't know which rows from SmallTable will be needed until the join is completed. Therefore it only matters if SmallTable is so much smaller than LargeTable that loading the entire SmallTable into memory is cheaper than whatever you would start with from LargeTable (which, being a large table, is probably well-indexed and probably filters on a field or three in the where clause).
It's important to also point out that in the vast majority of cases the optimizer will look at this and handle things in the most efficient way possible, and most of the time the optimizer is going to do a better job at this than you could.
But the optimizer isn't perfect. Sometimes you need to help it along: especially if one or more of your "tables" is a view (perhaps into a linked server!) or a nested select statement, for example. A nested sub-query is also a good case of where you might want to use a right join for expressive reasons: it lets you move the nested portion of the query around so you can group things better.
You can always use only left Joins...
SELECT * FROM t1
LEFT JOIN t2 ON t1.k2 = t2.k2
RIGHT JOIN t3 ON t3.k3 = t2.k3
is equivilent to:
Select * From t3
Left Join (t1 Left Join t2
On t2.k2 = t1.k2)
On T2.k3 = T3.K3
In general I always try to use only Left Joins, as the table on the left in a Left Join is the one whose rows are ALL included in the output, and I like to think of it, (The Left side) as the "base" set I am performing the cartesion product (join) against ... So I like to have it first in the SQL...
It's a bit like asking if using greater-than is ever required. Use the one that better fits the task at hand.
Yes! all the time! (Have to admit, mostly used when you're strict as to which table you want to call first)
On this subject: here's a nice visual guide on joins.
You can always swap the table order to turn a RIGHT JOIN into a LEFT JOIN. Sometimes it's just more efficient to do it one way or the other.
There are many elements of many programming languages which are not strictly required to achieve the correct results but which permit one a) to express intent more clearly b) to boost performance. Examples include numbers, characters, loops, switches, classes, joins, types, filters, and thousands more.
I use LEFT JOINs about 99.999% of the time, but some of my dynamic code generation uses RIGHT JOINs which mean that the stuff outside the join doesn't need to be reversed.
I'd also like to add that the specific example you give I believe produces a cross join, and that is probably not your intention or even a good design.
i.e. I think it's effectively the same as:
SELECT *
FROM t1
CROSS JOIN t3
LEFT JOIN t2
ON t1.k2 = t2.k2
AND t3.k3 = t2.k3
And also, because it's a cross join, there's not a lot the optimizer is going to be able to do.