Where vs ON in outer join - sql

I am wondering how to have a better SQL performance when we decide whether to duplicate our criteria when it is already in Where clause.
My friend claimed it is up to DB engines but I am not so sure.
Regardless of DB engines, normally, the condition in Where clause should be executed first before join, but I assume it means inner join but not outer join. Because some conditions can only be executed AFTER outer join.
For example:
Select a.*, b.*
From A a
Left outer join B on a.id = b.id
Where b.id is NULL;
The condition in Where cannot be executed before outer join.
So, I assume the whole ON clause must be executed first before where clause, and it seems the ON clause will control the size of table B (or table A if we use right outer join) before outer join. That seems not related to DB engines to me.
And that raised my question: when we use outer join, should we always deplicate our criteria in ON Clause?
for example (I use a table to outer join with a shorter version of itself)
temp_series_installment & series_id > 18940000 vs temp_series_installment:
select sql_no_cache s.*, t.* from temp_series_installment s
left outer join temp_series_installment t on s.series_id = t.series_id and t.series_id > 18940000 and t.incomplete = 1
where t.incomplete = 1;
VS
select sql_no_cache s.*, t.* from temp_series_installment s
left outer join temp_series_installment t on s.series_id = t.series_id and t.series_id > 18940000
where t.incomplete = 1;
Edit: where t.incomplete = 1 performs the logic of: where t.series_id is not null
which is an inner join suggested by Gordon Linoff
But what I have been asking is: if it outer join a smaller table, it should have been faster right?
I tried to see if there is any performace difference in mysql:
But it is out of my expectation, why is the second one faster? I thought by outer joining a smaller table, the query will be faster.
My idea is from:
https://www.ibm.com/support/knowledgecenter/en/SSZLC2_8.0.0/com.ibm.commerce.developer.doc/refs/rsdperformanceworkspaces.htm
Section:
Push predicates into the OUTER JOIN clause whenever possible
Duplicate constant condition for different tables whenever possible

Regardless of DB engines, normally, the condition in Where clause should be executed first before join, but I assume it means inner join but not outer join. Because some conditions can only be executed AFTER outer join.
This is simply not true. SQL is a descriptive language. It does not specify how the query gets executed. It only specifies what the result set looks like. The SQL compiler/optimizer determines the actual processing steps to meet the requirements described by the query.
In terms of semantics, the FROM clause is the first clause that is "evaluated". Hence, FROM is logically processed before the WHERE clause.
The rest of your question is similarly misguided. Comparison logic in the where clause, such as:
from s left join
t
on s.series_id = t.series_id and t.series_id > 18940000
where t.incomplete = 1
turns the outer join into an inner join. Hence, the logic is different from what you think is going on.

As Gordon Lindolf pointed out it's not true, Your friend is plain wrong.
I want just to add developers like to think SQL like they think their language of trade (C++, VB, Java), but those are procedural/imperative languages.
When you code SQL you are in another paradigm. You are just describing a function to be applied to a dataset.
Let's get your own example:
Select a.*, b.*
From A a
Left outer join B on a.id = b.id
Where b.id is NULL;
If a.Id and b.Id are not null columns.
It's semantically equal to
Select a.*, null, ..., null
From A a
where not exists (select * from B b where b.Id = a.Id)
Now try to run those to queries and profile.
In most DBMS I can expect both queries to run in the exact same way.
It happens because the engine decides how to implement your "function" over the dataset.
Note the above example is the equivalent in set mathematics to:
Give me the set A minus the intersection between A and B.
Engines can decide how to implement your query because they have some tricks under its sleeve.
It has metrics about your tables, indexes, etc and can use it to, for example, "make a join" in a diferent order you wrote it.
IMHO engines today are really good at finding the best way to implement the function you describe and rarely needs query hints.
Of course you can end describing your funciton in a way too complicated, affecting how the engines decides to run it.
The art of better describing functions and sets and managins indexes is what we call query tunning.

Related

SQL Join query with filter condition : performance

I want to join a couple of tables but filter a particular type from 'A' table. Which is the better query from the following two queries (is there a better approach?) or is there no difference cause of query optimizer ?
When filter condition is given in 'WHERE' clause:
SELECT .. FROM A a JOIN B b ON a.id=b.id JOIN C c on a.id = c.id...<other joins>...WHERE a.col='SOME_VAL';
The filter condition is given inside 'ON':
SELECT .. FROM A a JOIN B b ON a.id=b.id AND a.col='SOME_VAL' JOIN C c on a.id = c.id...<other joins>
With INNER JOINs there is no resulting performance difference, which you can check using EXPLAIN (both forms of your query should yield the same plan).
The benefit of using the WHERE clause can often be readability. When combined with a multi-line layout. (I strongly recommend Not keeping all your SQL as one long line.)
SELECT
stuff
FROM
foo
INNER JOIN
bar
ON (join predicates here)
WHERE
(static filters here)
The reason it becomes easier to read (and maintain) is that the join predicates now expressly only describe the Relationship between the tables. The query still runs correctly without the WHERE clause, but returns a larger set.

SQL INNER JOIN implemented as implicit JOIN

Recently, I came across an SQL query which looked like this:
SELECT * FROM A, B WHERE A.NUM = B.NUM
To me, it seems as if this will return exactly the same as an INNER JOIN:
SELECT * FROM A INNER JOIN B ON A.NUM = B.NUM
Is there any sane reason why anyone would use a CROSS JOIN here? Edit: it seems as if most SQL applications will automatically use a INNER JOIN here.
The database is HSQLDB
The older syntax is a SQL antipattern. It should be replaced with an inner join anytime you see it. Part of why it is an antipattern is because it is impoosible to tell if a cross join was intended or not if the where clasues is ommitted. This causes many accidental cross joins espcially in complex queries. Further, in some databases (espcially Sql server) the implict outer joins do not work correctly and so people try to combine explicit and implict joins and get bad results without even realizing it. All in all it is a poor practice to even consider using an implict join.
Yes, your both statements will return the same result. Which one is to be used is a matter of taste. Every sane database system will use a join for both if possible, no sane optimizer will really use a cross product in the first case.
But note that your first syntax is not a cross join. It is just an implicit notation for a join which does not specify which kind of join to use. Instead, the optimizer must check the WHERE clauses to determine whether to use an inner join or a cross join: If an applicable join condition is found in the WHERE clause, this will result in an inner join. If no such clause is found it will result in a cross join. Since your first example specifies an applicable join condition (WHERE A.NUM = B.NUM) this results in an INNER JOIN and thus exactly equivalent to your second case.

Filtering using the JOIN instead of WHERE

In SQL (MSSQL, Oracle, etc., whatever), when joining tables, what is the gain from adding a filter to the JOIN statement instead of having it in the WHERE clause?
i.e.
SELECT * FROM X INNER JOIN Y ON X.A = Y.A WHERE X.B = 'SOMETHING'
versus
SELECT * FROM X INNER JOIN Y ON X.A = Y.A AND X.B = 'SOMETHING'
I realize that this does not work in all cases, but I've noticed that in some cases there appears to be a performance gain by putting the filter criteria in the JOIN statement. However, since it's a part of the JOIN statement, it can also cause it to behave a little strangely.
Thoughts?
For INNER JOIN queries, the performance characteristics of these filters will depend on many factors - the size of the tables, indexing, the selectivity of the query, and other factors specific to the RDBMS on which the query is executed.
In LEFT and RIGHT OUTER JOIN, the position of the filter matters much more than INNER JOIN, since affects whether it will be applied before (JOIN clause) or after (WHERE clause) the join is carried out.
I sometimes do this in queries that have a lot of joins because it localises all the information about the join in one part of the query rather than having some in the join condition and some in the where clause.
For an INNER JOIN, I would not expect a performance difference, but rather that the same plan would be used whether the filter was in the JOIN...ON clause or the WHERE clause. I personally prefer to use write the join criteria in the JOIN clause and the filtering in the WHERE clause- a sort of way to stick all the "parameters" to the SQL statement in the same place- this isn't necessarily sensible or well-thought-out. Conversely, some people like to have everything in the JOIN clause to keep everything together.
The situation with outer joins is different- there is a significant difference between "a LEFT OUTER JOIN b ON a.a_id=b.a_id AND b.type = 1" and "a LEFT OUTER JOIN b ON a.a_id=b.a_id WHERE b.type=1"- in fact the latter is implicitly forcing an inner join. This would be another reason to put all such conditions in the JOIN clause, for consistency.
These syntaxes are synonymous and are optimized to the same thing by most RDBMS.
I usually prefer this syntax:
SELECT *
FROM X
INNER JOIN
Y
ON X.A = Y.A
WHERE X.B = 'SOMETHING'
when B is not a part of the logical link between A and B, and this one:
SELECT *
FROM X
INNER JOIN
Y
ON X.A = Y.A
AND X.B = 'SOMETHING'
when it is.
Nothing, except clarity and meaning. Unless you have outer joins.
As a human (rather than an optimizer) myself, when maintaining a query I would look for a join condition in the JOIN clause and a search condition in the WHERE clause.
Of course, you need to strike a balance between performance issues and code maintenance issues. However, my first priority is good logical code in the first instance then optimize as necessary.

Mixing Left and right Joins? Why?

Doing some refactoring in some legacy code I've found in a project. This is for MSSQL. The thing is, i can't understand why we're using mixed left and right joins and collating some of the joining conditions together.
My question is this: doesn't this create implicit inner joins in some places and implicit full joins in others?
I'm of the school that just about anything can be written using just left (and inner/full) or just right (and inner/full) but that's because i like to keep things simple where possible.
As an aside, we convert all this stuff to work on oracle databases as well, so maybe there's some optimization rules that work differently with Ora?
For instance, here's the FROM part of one of the queries:
FROM Table1
RIGHT OUTER JOIN Table2
ON Table1.T2FK = Table2.T2PK
LEFT OUTER JOIN Table3
RIGHT OUTER JOIN Table4
LEFT OUTER JOIN Table5
ON Table4.T3FK = Table5.T3FK
AND Table4.T2FK = Table5.T2FK
LEFT OUTER JOIN Table6
RIGHT OUTER JOIN Table7
ON Table6.T6PK = Table7.T6FK
LEFT OUTER JOIN Table8
RIGHT OUTER JOIN Table9
ON Table8.T8PK= Table9.T8FK
ON Table7.T9FK= Table9.T9PK
ON Table4.T7FK= Table7.T7PK
ON Table3.T3PK= Table4.T3PK
RIGHT OUTER JOIN ( SELECT *
FROM TableA
WHERE ( TableA.PK = #PK )
AND ( TableA.Date BETWEEN #StartDate
AND #EndDate )
) Table10
ON Table4.T4PK= Table10.T4FK
ON Table2.T2PK = Table4.T2PK
One thing I would do is make sure you know what results you are expecting before messing with this. Wouldn't want to "fix" it and have different results returned. Although honestly, with a query that poorly designed, I'm not sure that you are actually getting correct results right now.
To me this looks like something that someone did over time maybe even originally starting with inner joins, realizing they wouldn't work and changing to outer joins but not wanting to bother changing the order the tables were referenced in the query.
Of particular concern to me for maintenance purposes is to put the ON clauses next to the tables you are joining as well as converting all the joins to left joins rather than mixing right and left joins. Having the ON clause for table 4 and table 3 down next to table 9 makes no sense at all to me and should contribute to confusion as to what the query should actually return. You may also need to change the order of the joins in order to convert to all left joins. Personally I prefer to start with the main table that the others will join to (which appears to be table2) and then work down the food chain from there.
It could probably be converted to use all LEFT joins: I'd be looking and moving the right-hand table in each RIGHT to be above all the existing LEFTs, then you might be able to then turn every RIGHT join into a LEFT join. I'm not sure you'll get any FULL joins behind the scenes -- if the query looks like it is, it might be a quirk of this specific query rather than a SQL Server "rule": that query you've provided does seem to be mixing it up in a rather confusing way.
As for Oracle optimisation -- that's certainly possible. No experience of Oracle myself, but speaking to a friend who's knowledgeable in this area, Oracle (no idea what version) is/was fussy about the order of predicates. For example, with SQL Server you can write your way clause so that columns are in any order and indexes will get used, but with Oracle you end up having to specify the columns in the order they appear in the index in order to get best performance with the index. As stated - no idea if this is the case with newer Oracle's, but was the case with older ones (apparently).
Whether this explains this particular construction, I can't say. It could simply be less-thean-optimal code if it's changed over the years and a clean-up is what it's begging for.
LEFT and RIGHT join are pure syntax sugar.
Any LEFT JOIN can be transformed into a RIGHT JOIN merely by switching the sets.
Pre-9i Oracle used this construct:
WHERE table1.col(+) = table2.col
, (+) here denoting the nullable column, and LEFT and RIGHT joins could be emulated by mere switching:
WHERE table1.col = table2.col(+)
In MySQL, there is no FULL OUTER JOIN and it needs to be emulated.
Ususally it is done this way:
SELECT *
FROM table1
LEFT JOIN
table2
ON table1.col = table2.col
UNION ALL
SELECT *
FROM table1
RIGHT JOIN
table2
ON table1.col = table2.col
WHERE table1.col IS NULL
, and it's more convenient to copy the JOIN and replace LEFT with RIGHT, than to swap the tables.
Note that in SQL Server plans, Hash Left Semi Join and Hash Right Semi Join are different operators.
For the query like this:
SELECT *
FROM table1
WHERE table1.col IN
(
SELECT col
FROM table2
)
, Hash Match (Left Semi Join) hashes table1 and removes the matched elements from the hash table in runtime (so that they cannot match more than one time).
Hash Match (Right Semi Join) hashes table2 and removes the duplicate elements from the hash table while building it.
I may be missing something here, but the only difference between LEFT and RIGHT joins is which order the source tables were written in, and so having multiple LEFT joins or multiple RIGHT joins is no different to having a mix. The equivalence to FULL OUTERs could be achieved just as easily with all LEFT/RIGHT than with a mix, n'est pas?
We have some LEFT OUTER JOINs and RIGHT OUTER JOINs in the same query. Typically such queries are large, have been around a long time, probably badly written in the first place and have received infrequent maintenance. I assume the RIGHT OUTER JOINs were introduced as a means of maintaining the query without taking on the inevitable risk when refactoring a query significantly.
I think most SQL coders are most confortable with using all LEFT OUTER JOINs, probably because a FROM clause is read left-to-right in the English way.
The only time I use a RIGHT OUTER JOIN myself is when when writing a new query based on an existing query (no need to reinvent the wheel) and I need to change an INNER JOIN to an OUTER JOIN. Rather than change the order of the JOINs in the FROM clause just to be able to use a LEFT OUTER JOIN I would instead use a RIGHT OUTER JOIN and this would not bother me. This is quite rare though. If the original query had LEFT OUTER JOINs then I'd end up with a mix of LEFT- and RIGHT OUTER JOINs, which again wouldn't bother me. Hasn't happened to me yet, though.
Note that for SQL products such as the Access database engine that do not support FULL OUTER JOIN, one workaround is to UNION a LEFT OUTER JOIN and a RIGHT OUTER JOIN in the same query.
The bottom line is that this is a very poorly formatted SQL statement and should be re-written. Many of the ON clauses are located far from their JOIN statements, which I am not sure is even valid SQL.
For clarity's sake, I would rewrite the query using all LEFT JOINS (rather than RIGHT), and locate the using statements underneath their corresponding JOIN clauses. Otherwise, this is a bit of a train wreck and is obfuscating the purpose of the query, making errors during future modifications more likely to occur.
doesn't this create implicit inner
joins in some places and implicit full
joins in others?
Perhaps you are assuming that because you don't see the ON clause for some joins, e.g., RIGHT OUTER JOIN Table4, but it is located down below, ON Table4.T7FK= Table7.T7PK. I don't see any implicit inner joins, which could occur if there was a WHERE clause like WHERE Table3.T3PK is not null.
The fact that you are asking questions like this is a testament to the opaqueness of the query.
To answer another portion of this question that hasn't been answered yet, the reason this query is formatted so oddly is that it's likely built using the Query Designer inside SQL Management Studio. The give away is the combined ON clauses that happen many lines after the table is mentioned. Essentially tables get added in the build query window and the order is kept even if that way things are connected would favor moving a table up, so to speak, and keeping all the joins a certain direction.

Is there something wrong with joins that don't use the JOIN keyword in SQL or MySQL?

When I started writing database queries I didn't know the JOIN keyword yet and naturally I just extended what I already knew and wrote queries like this:
SELECT a.someRow, b.someRow
FROM tableA AS a, tableB AS b
WHERE a.ID=b.ID AND b.ID= $someVar
Now that I know that this is the same as an INNER JOIN I find all these queries in my code and ask myself if I should rewrite them. Is there something smelly about them or are they just fine?
My answer summary: There is nothing wrong with this query BUT using the keywords will most probably make the code more readable/maintainable.
My conclusion: I will not change my old queries but I will correct my writing style and use the keywords in the future.
Filtering joins solely using WHERE can be extremely inefficient in some common scenarios. For example:
SELECT * FROM people p, companies c
WHERE p.companyID = c.id AND p.firstName = 'Daniel'
Most databases will execute this query quite literally, first taking the Cartesian product of the people and companies tables and then filtering by those which have matching companyID and id fields. While the fully-unconstrained product does not exist anywhere but in memory and then only for a moment, its calculation does take some time.
A better approach is to group the constraints with the JOINs where relevant. This is not only subjectively easier to read but also far more efficient. Thusly:
SELECT * FROM people p JOIN companies c ON p.companyID = c.id
WHERE p.firstName = 'Daniel'
It's a little longer, but the database is able to look at the ON clause and use it to compute the fully-constrained JOIN directly, rather than starting with everything and then limiting down. This is faster to compute (especially with large data sets and/or many-table joins) and requires less memory.
I change every query I see which uses the "comma JOIN" syntax. In my opinion, the only purpose for its existence is conciseness. Considering the performance impact, I don't think this is a compelling reason.
The more verbose INNER JOIN, LEFT OUTER JOIN, RIGHT OUTER JOIN, FULL OUTER JOIN are from the ANSI SQL/92 syntax for joining. For me, this verbosity makes the join more clear to the developer/DBA of what the intent is with the join.
In SQL Server there are always query plans to check, a text output can be made as follows:
SET SHOWPLAN_ALL ON
GO
DECLARE #TABLE_A TABLE
(
ID INT IDENTITY(1,1) NOT NULL PRIMARY KEY,
Data VARCHAR(10) NOT NULL
)
INSERT INTO #TABLE_A
SELECT 'ABC' UNION
SELECT 'DEF' UNION
SELECT 'GHI' UNION
SELECT 'JKL'
DECLARE #TABLE_B TABLE
(
ID INT IDENTITY(1,1) NOT NULL PRIMARY KEY,
Data VARCHAR(10) NOT NULL
)
INSERT INTO #TABLE_B
SELECT 'ABC' UNION
SELECT 'DEF' UNION
SELECT 'GHI' UNION
SELECT 'JKL'
SELECT A.Data, B.Data
FROM
#TABLE_A AS A, #TABLE_B AS B
WHERE
A.ID = B.ID
SELECT A.Data, B.Data
FROM
#TABLE_A AS A
INNER JOIN #TABLE_B AS B ON A.ID = B.ID
Now I'll omit the plan for the table variable creates, the plan for both queries is identical though:
SELECT A.Data, B.Data FROM #TABLE_A AS A, #TABLE_B AS B WHERE A.ID = B.ID
|--Nested Loops(Inner Join, OUTER REFERENCES:([A].[ID]))
|--Clustered Index Scan(OBJECT:(#TABLE_A AS [A]))
|--Clustered Index Seek(OBJECT:(#TABLE_B AS [B]), SEEK:([B].[ID]=#TABLE_A.[ID] as [A].[ID]) ORDERED FORWARD)
SELECT A.Data, B.Data FROM #TABLE_A AS A INNER JOIN #TABLE_B AS B ON A.ID = B.ID
|--Nested Loops(Inner Join, OUTER REFERENCES:([A].[ID]))
|--Clustered Index Scan(OBJECT:(#TABLE_A AS [A]))
|--Clustered Index Seek(OBJECT:(#TABLE_B AS [B]), SEEK:([B].[ID]=#TABLE_A.[ID] as [A].[ID]) ORDERED FORWARD)
So, short answer - No need to rewrite, unless you spend a long time trying to read them each time you maintain them?
It's more of a syntax choice. I prefer grouping my join conditions with my joins, hence I use the INNER JOIN syntax
SELECT a.someRow, b.someRow
FROM tableA AS a
INNER JOIN tableB AS b
ON a.ID = b.ID
WHERE b.ID = ?
(? being a placeholder)
Nothing is wrong with the syntax in your example. The 'INNER JOIN' syntax is generally termed 'ANSI' syntax, and came after the style illustrated in your example. It exists to clarify the type/direction/constituents of the join, but is not generally functionally different than what you have.
Support for 'ANSI' joins is per-database platform, but it's more or less universal these days.
As a side note, one addition with the 'ANSI' syntax was the 'FULL OUTER JOIN' or 'FULL JOIN'.
Hope this helps.
In general:
Use the JOIN keyword to link (ie. "join") primary keys and foreign keys.
Use the WHERE clause to limit your result set to only the records you are interested in.
The one problem that can arise is when you try to mix the old "comma-style" join with SQL-92 joins in the same query, for example if you need one inner join and another outer join.
SELECT *
FROM table1 AS a, table2 AS b
LEFT OUTER JOIN table3 AS c ON a.column1 = c.column1
WHERE a.column2 = b.column2;
The problem is that recent SQL standards say that the JOIN is evaluated before the comma-join. So the reference to "a" in the ON clause gives an error, because the correlation name hasn't been defined yet as that ON clause is being evaluated. This is a very confusing error to get.
The solution is to not mix the two styles of joins. You can continue to use comma-style in your old code, but if you write a new query, convert all the joins to SQL-92 style.
SELECT *
FROM table1 AS a
INNER JOIN table2 AS b ON a.column2 = b.column2
LEFT OUTER JOIN table3 AS c ON a.column1 = c.column1;
Another thing to consider in the old join syntax is that is is very easy to get a cartesion join by accident since there is no on clause. If the Distinct keyword is in the query and it uses the old style joins, convert it to an ANSI standard join and see if you still need the distinct. If you are fixing accidental cartesion joins this way, you can improve performance tremendously by rewriting to specify the join and the join fields.
I avoid implicit joins; when the query is really large, they make the code hard to decipher
With explicit joins, and good formatting, the code is more readable and understandable without need for comments.
It also depends on whether you are just doing inner joins this way or outer joins as well. For instance, the MS SQL Server syntax for outer joins in the WHERE clause (=* and *=) can give different results than the OUTER JOIN syntax and is no longer supported (http://msdn.microsoft.com/en-us/library/ms178653(SQL.90).aspx) in SQL Server 2005.
And what about performances ???
As a matter of fact, performances is a very important problem in RDBMS.
So the question is what is the most performant... Using JOIN or having joined table in the WHERE clause ?
Because optimizer (or planer as they said in PG...) ordinary does a good job, the two execution plans are the same, so the performances while excuting the query will be the same...
But devil are hidden in some details....
All optimizers have a limited time or a limited amount of work to find the best plan... And when the limit is reached, the result is the best plan among all computed plans, and not the better of all possible plans !
Now the question is do I loose time when I use WHERE clause instead of JOINs for joining tables ?
And the answer is YES !
YES, because the relational engine use relational algebrae that knows only JOIN operator, not pseudo joins made in the WHERE clause. So the first thing that the optimizer do (in fact the parser or the algrebriser) is to rewrite the query... and this loose some chances to have the best of all plans !
I have seen this problem twice, in my long RDBMS career (40 years...)