SQL Server : does order of full outer join matter? - sql

I have 4 full-outer joins in my query and its really slow, So does the order of FULL OUTER JOIN make a difference in performance / result ?
FULL OUTER JOIN = ⋈
Then,
I have a situation : A ⋈ B ⋈ C ⋈ D
All joins occur on a key common to all k contained in all A,B,C,D
Then:
Will changing the order of ⋈ joins make a difference to performance ?
Will changing the order of ⋈ change the result ?
I feel that it should not affect the result, but will it affect the performance or not I am not sure !
Update:
Will SQL Server automatically rearrange the joins for better performance assuming the result set will be independent of the order ?

No, rearranging the JOIN orders should not affect the performance. MSSQL (as with other DBMS) has a query optimizer whose job it is to find the most efficient query plan for any given query. Generally, these do a pretty good job - so you're unlikely to beat the optimizer easily.
That said, they do get it wrong occasionally. That's where reading an execution plan comes into play. You can add JOIN hints to tell MSSQL how to join your tables (at which point, ordering does matter). You'd generally order from smallest to largest table (though, with a FULL JOIN, it's not likely to matter very much) and follow the rules of thumb for join types.
Since you're doing FULL JOINS, you're basically reading the entirety of 4 tables off disk. That's likely to be very expensive. You may want to re-examine the problem, and see if it can be accomplished in a different way.

Will changing the order of ⋈ change the result ?
No, the order of the FULL JOIN does not matter, the result will be the same. Notice however, that you can't use something like this (the following may give different results depending on the order of joins):
SELECT
COALESCE(a.id, b.id, c.id, d.id) AS id, --- Key columns used in FULL JOIN
a.*, b.*, c.*, d.* --- other columns
FROM a
FULL JOIN b
ON b.id = a.id
FULL JOIN c
ON c.id = a.id
FULL JOIN d
ON d.id = a.id ;
You have to use something like this (no difference in results whatever the order of joins):
SELECT
COALESCE(a.id, b.id, c.id, d.id) AS id,
a.*, b.*, c.*, d.*
FROM a
FULL JOIN b
ON b.id = a.id
FULL JOIN c
ON c.id = COALESCE(a.id, b.id)
FULL JOIN d
ON d.id = COALESCE(a.id, b.id, c.id) ;
Will changing the order of ⋈ joins make a difference to performance?
Taking into consideration that the second and third joins have to be done on the COALESCE() of the columns and not the columns themselves, I think only testing with large enough tables will show if the indexes can be used effectively.

Changing the order of a Full outer join shouldn't affect performance or results. The only thing that will be affected based on order of a Full Outer Join is the default order of the columns produced if using a SELECT *. You may be having performance issues simply from trying to do multiple joins with large tables. If there is no where clause to limit the tables, you could be going through hundreds of thousands of results.

Related

Where vs ON in outer join

I am wondering how to have a better SQL performance when we decide whether to duplicate our criteria when it is already in Where clause.
My friend claimed it is up to DB engines but I am not so sure.
Regardless of DB engines, normally, the condition in Where clause should be executed first before join, but I assume it means inner join but not outer join. Because some conditions can only be executed AFTER outer join.
For example:
Select a.*, b.*
From A a
Left outer join B on a.id = b.id
Where b.id is NULL;
The condition in Where cannot be executed before outer join.
So, I assume the whole ON clause must be executed first before where clause, and it seems the ON clause will control the size of table B (or table A if we use right outer join) before outer join. That seems not related to DB engines to me.
And that raised my question: when we use outer join, should we always deplicate our criteria in ON Clause?
for example (I use a table to outer join with a shorter version of itself)
temp_series_installment & series_id > 18940000 vs temp_series_installment:
select sql_no_cache s.*, t.* from temp_series_installment s
left outer join temp_series_installment t on s.series_id = t.series_id and t.series_id > 18940000 and t.incomplete = 1
where t.incomplete = 1;
VS
select sql_no_cache s.*, t.* from temp_series_installment s
left outer join temp_series_installment t on s.series_id = t.series_id and t.series_id > 18940000
where t.incomplete = 1;
Edit: where t.incomplete = 1 performs the logic of: where t.series_id is not null
which is an inner join suggested by Gordon Linoff
But what I have been asking is: if it outer join a smaller table, it should have been faster right?
I tried to see if there is any performace difference in mysql:
But it is out of my expectation, why is the second one faster? I thought by outer joining a smaller table, the query will be faster.
My idea is from:
https://www.ibm.com/support/knowledgecenter/en/SSZLC2_8.0.0/com.ibm.commerce.developer.doc/refs/rsdperformanceworkspaces.htm
Section:
Push predicates into the OUTER JOIN clause whenever possible
Duplicate constant condition for different tables whenever possible
Regardless of DB engines, normally, the condition in Where clause should be executed first before join, but I assume it means inner join but not outer join. Because some conditions can only be executed AFTER outer join.
This is simply not true. SQL is a descriptive language. It does not specify how the query gets executed. It only specifies what the result set looks like. The SQL compiler/optimizer determines the actual processing steps to meet the requirements described by the query.
In terms of semantics, the FROM clause is the first clause that is "evaluated". Hence, FROM is logically processed before the WHERE clause.
The rest of your question is similarly misguided. Comparison logic in the where clause, such as:
from s left join
t
on s.series_id = t.series_id and t.series_id > 18940000
where t.incomplete = 1
turns the outer join into an inner join. Hence, the logic is different from what you think is going on.
As Gordon Lindolf pointed out it's not true, Your friend is plain wrong.
I want just to add developers like to think SQL like they think their language of trade (C++, VB, Java), but those are procedural/imperative languages.
When you code SQL you are in another paradigm. You are just describing a function to be applied to a dataset.
Let's get your own example:
Select a.*, b.*
From A a
Left outer join B on a.id = b.id
Where b.id is NULL;
If a.Id and b.Id are not null columns.
It's semantically equal to
Select a.*, null, ..., null
From A a
where not exists (select * from B b where b.Id = a.Id)
Now try to run those to queries and profile.
In most DBMS I can expect both queries to run in the exact same way.
It happens because the engine decides how to implement your "function" over the dataset.
Note the above example is the equivalent in set mathematics to:
Give me the set A minus the intersection between A and B.
Engines can decide how to implement your query because they have some tricks under its sleeve.
It has metrics about your tables, indexes, etc and can use it to, for example, "make a join" in a diferent order you wrote it.
IMHO engines today are really good at finding the best way to implement the function you describe and rarely needs query hints.
Of course you can end describing your funciton in a way too complicated, affecting how the engines decides to run it.
The art of better describing functions and sets and managins indexes is what we call query tunning.

SQL JOIN: ON vs Equals

Is there any significant difference between the following?
SELECT a.name, b.name FROM a, b WHERE a.id = b.id AND a.id = 1
AND
SELECT a.name, b.name FROM a INNER JOIN b ON a.id = b.id WHERE a.id = 1
Do SO users have a preference of one over the other?
There is no difference, but the readability of the second is much better when you have a big multi-join query with extra where clauses for filtering.
Separating the join clauses and the filter clauses is a Good Thing :)
The former is ANSI 89 syntax, the latter is ANSI 92.
For that specific query there is no difference. However, with the former you lose the ability to separate a filter from a join condition in complex queries, and the syntax to specify LEFT vs RIGHT vs INNER is often confusing, especially if you have to go back and forth between different db vendors. There are also certain kinds of join that cannot be written with the old syntax.
In fact, the former syntax has been obsolete for more than 30 years now, and should not be used for new development.
There is no difference to the sql query engine.
For readability, the latter is much easier to read if you use linebreaks and indentation.
For INNER JOINs, it does not matter if you put "filters" and "joins" in ON or WHERE clause, the query optimizer should decide what to do first anyway (it may chose to do a filter first, a join later, or vice versa
For OUTER JOINs however, there is a difference, and sometimes youll want to put the condition in the ON clause, sometimes in the WHERE. Putting a condition in the WHERE clause for an OUTER JOIN can turn it into an INNER JOIN (because of how NULLs work)
For example, check the readability between the two following samples:
SELECT c.customer_no, o.order_no, a.article_no, r.price
FROM customer c, order o, orderrow r, article a
WHERE o.customer_id = c.customer_id
AND r.order_id = o.order_id
AND a.article_id = r.article_id
AND o.orderdate >= '2003-01-01'
AND o.orderdate < '2004-01-01'
AND c.customer_name LIKE 'A%'
ORDER BY r.price DESC
vs
SELECT c.customer_no, o.order_no, a.article_no, r.price
FROM customer c
INNER JOIN order o
ON o.customer_id = c.customer_id
AND o.orderdate >= '2003-01-01'
AND o.orderdate < '2004-01-01'
INNER JOIN orderrow r
ON r.order_id = o.order_id
INNER JOIN article a
ON a.article_id = r.article_id
WHERE c.customer_name LIKE 'A%'
ORDER BY r.price DESC
Whilst you can perform most tasks using both and in your case there is no difference whatsoever, I will always use the second at all times.
It's the current supported standard
It keeps joins in the FROM clause and filters in the WHERE clause
It makes more complex LEFT, RIGHT, FULL OUTER joins much easier
MSSQL Help is all based around that syntax therefore much easier to get help on your problem queries
While there is no difference technically, you need to be extra careful about doing joins using the first method. If you get it wrong by accident, you could end up doing a cartesian join between your a and b tables (a very long, memory & cpu intensive query - it will match each single row in a with all rows in b. Bad if a and b are large tables to begin with). Using an explicit INNER JOIN is both safer and easier to read.
No difference. I find the first format more readable and use the second format only when doing other types of joins (OUTER, LEFT INNER, etc).
The second form is SQL92 compliant syntax. This should mean that it is supported by all current and future databases vendors. However , the truth is that the first form is so pervasive that it is also guaranteed to be around for longer than we care.
Otherwise they are same in all respects in how databases treat the two.

Does the order of tables referenced in the ON clause of the JOIN matter?

Does it matter which way I order the criteria in the ON clause for a JOIN?
select a.Name, b.Status from a
inner join b
on a.StatusID = b.ID
versus
select a.Name, b.Status from a
inner join b
on b.ID = a.StatusID
Is there any impact on performance? What if I had multiple criteria?
Is one order more maintainable than another?
JOIN order can be forced by putting the tables in the right order in the FROM clause:
MySQL has a special clause called STRAIGHT_JOIN which makes the order matter.
This will use an index on b.id:
SELECT a.Name, b.Status
FROM a
STRAIGHT_JOIN
b
ON b.ID = a.StatusID
And this will use an index on a.StatusID:
SELECT a.Name, b.Status
FROM b
STRAIGHT_JOIN
a
ON b.ID = a.StatusID
Oracle has a special hint ORDERED to enforce the JOIN order:
This will use an index on b.id or build a hash table on b:
SELECT /*+ ORDERED */
*
FROM a
JOIN b
ON b.ID = a.StatusID
And this will use an index on a.StatusID or build a hash table on a:
SELECT /*+ ORDERED */
*
FROM b
JOIN a
ON b.ID = a.StatusID
SQL Server has a hint called FORCE ORDER to do the same:
This will use an index on b.id or build a hash table on b:
SELECT *
FROM a
JOIN b
ON b.ID = a.StatusID
OPTION (FORCE ORDER)
And this will use an index on a.StatusID or build a hash table on a:
SELECT *
FROM b
JOIN a
ON b.ID = a.StatusID
OPTION (FORCE ORDER)
PostgreSQL guys, sorry. Your TODO list says:
Optimizer hints (not wanted)
Optimizer hints are used to work around problems in the optimizer. We would rather have the problems reported and fixed.
As for the order in the comparison, it doesn't matter in any RDBMS, AFAIK.
Though I personally always try to estimate which column will be searched for and put this column in the left (for it to seem like an lvalue).
See this answer for more detail.
No it does not.
What i do (for readability) is your 2nd example.
No. The database should be determining the best execution plan based on the entire criteria, not creating it by looking at each item in sequence. You can confirm this by requesting the execution plan for both queries, you'll see they are the same (you'll find that even vastly different queries, as long as they ultimately specify the same logic, are often compiled into the same execution plan).
No there is not. At the end of the day, you are really just evaluating whether a=b.
And as the symmetric property of equality states:
For any quantities a and b, if a = b, then b = a.
so whether you check for (12)*=12 or 12=(12)* makes logically no difference.
If values are equal, join, if not, don't. And whether you specify it as in your first example or the second, makes no difference.
As many have said: The order does not make a difference in result or performance.
What I want to point out though is that LINQ to SQL only allows the first case!
Eg, following example works well, ...
var result = from a in db.a
join b in db.b on a.StatusID equals b.ID
select new { Name = a.Name, Status = b.Status }
... while this will throw errors in Visual Studio:
var result = from a in db.a
join b in db.b on b.ID equals a.StatusID
select new { Name = a.Name, Status = b.Status }
Which throws these compiler errors:
CS1937: The name 'name' is not in scope on the left side of 'equals'. Consider swapping the expressions on either side of 'equals'.
CS1938: The name 'name' is not in scope on the right side of 'equals'. Consider swapping the expressions on either side of 'equals'.
Though not relevant in standard SQL coding, this might be a point to consider, when accustoming oneself to either one of those.
Read this
SqlServer contains an optimisation for situations far more complex than this.
If you have multiple criteria stuff is usually lazy evaluated (but I need to do a bit of research around edge cases if any.)
For readability I usually prefer
SELECT Name, Status FROM a
JOIN b
ON a.StatusID = b.ID
I think it makes better sense to reference the variable in the same order they were declared but its really a personal taste thing.
The only reason I wouldn't use your second example:
select a.Name, b.Status
from a
inner join b
on b.ID = a.StatusID
Your user is more likely to come back and say 'Can I see all the a.name's even if they have no status records?' rather than 'Can I see all of b.status even if they don't have a name record?', so just to plan ahead for this example, I would use On a.StatusID = b.ID in anticipation of a LEFT Outer Join. This assumes you could have table 'a' record without 'b'.
Correction: It won't change the result.
This is probably a moot point since users never want to change their requirements.
nope, doesn't matter. but here's an example to help make your queries more readable (at least to me)
select a.*, b.*
from tableA a
inner join tableB b
on a.name=b.name
and a.type=b.type
each table reference is on a separate line, and each join criteria is on a separate line. the tabbing helps keep what belongs to what straight.
another thing i like to do is make my criteria in my on statements flow the same order as the table. so if a is first and then b, a will be on the left side and b on the right.
ERROR: ON clause references tables to its right (php sqlite 3.2)
Replace this
LEFT JOIN itm08 i8 ON i8.id= **cdd01.idcmdds** and i8.itm like '%ormit%'
LEFT JOIN **comodidades cdd01** ON cdd01.id_registro = u.id_registro
For this
LEFT JOIN **comodidades cdd01** ON cdd01.id_registro = u.id_registro
LEFT JOIN itm08 i8 ON i8.id= **cdd01.idcmdds** and i8.itm like '%ormit%'

SQL (any) Request for insight on a query optimization

I have a particularly slow query due to the vast amount of information being joined together. However I needed to add a where clause in the shape of id in (select id from table).
I want to know if there is any gain from the following, and more pressing, will it even give the desired results.
select a.* from a where a.id in (select id from b where b.id = a.id)
as an alternative to:
select a.* from a where a.id in (select id from b)
Update:
MySQL
Can't be more specific sorry
table a is effectively a join between 7 different tables.
use of * is for examples
Edit, b doesn't get selected
Your question was about the difference between these two:
select a.* from a where a.id in (select id from b where b.id = a.id)
select a.* from a where a.id in (select id from b)
The former is a correlated subquery. It may cause MySQL to execute the subquery for each row of a.
The latter is a non-correlated subquery. MySQL should be able to execute it once and cache the results for comparison against each row of a.
I would use the latter.
Both queries you list are the equivalent of:
select a.*
from a
inner join b on b.id = a.id
Almost all optimizers will execute them in the same way.
You could post a real execution plan, and someone here might give you a way to speed it up. It helps if you specify what database server you are using.
YMMV, but I've often found using EXISTS instead of IN makes queries run faster.
SELECT a.* FROM a WHERE EXISTS (SELECT 1 FROM b WHERE b.id = a.id)
Of course, without seeing the rest of the query and the context, this may not make the query any faster.
JOINing may be a more preferable option, but if a.id appears more than once in the id column of b, you would have to throw a DISTINCT in there, and you more than likely go backwards in terms of optimization.
I would never use a subquery like this. A join would be much faster.
select a.*
from a
join b on a.id = b.id
Of course don't use select * either (especially never use it when doing a join as at least one field is repeated) and it wastes network resources to send unnneeded data.
Have you looked at the execution plan?
How about
select a.*
from a
inner join b
on a.id = b.id
presumably the id fields are primary keys?
Select a.* from a
inner join (Select distinct id from b) c
on a.ID = c.AssetID
I tried all 3 versions and they ran about the same. The execution plan was the same (inner join, IN (with and without where clause in subquery), Exists)
Since you are not selecting any other fields from B, I prefer to use the Where IN(Select...) Anyone would look at the query and know what you are trying to do (Only show in a if in b.).
your problem is most likely in the seven tables within "a"
make the FROM table contain the "a.id"
make the next join: inner join b on a.id = b.id
then join in the other six tables.
you really need to show the entire query, list all indexes, and approximate row counts of each table if you want real help

Is there something wrong with joins that don't use the JOIN keyword in SQL or MySQL?

When I started writing database queries I didn't know the JOIN keyword yet and naturally I just extended what I already knew and wrote queries like this:
SELECT a.someRow, b.someRow
FROM tableA AS a, tableB AS b
WHERE a.ID=b.ID AND b.ID= $someVar
Now that I know that this is the same as an INNER JOIN I find all these queries in my code and ask myself if I should rewrite them. Is there something smelly about them or are they just fine?
My answer summary: There is nothing wrong with this query BUT using the keywords will most probably make the code more readable/maintainable.
My conclusion: I will not change my old queries but I will correct my writing style and use the keywords in the future.
Filtering joins solely using WHERE can be extremely inefficient in some common scenarios. For example:
SELECT * FROM people p, companies c
WHERE p.companyID = c.id AND p.firstName = 'Daniel'
Most databases will execute this query quite literally, first taking the Cartesian product of the people and companies tables and then filtering by those which have matching companyID and id fields. While the fully-unconstrained product does not exist anywhere but in memory and then only for a moment, its calculation does take some time.
A better approach is to group the constraints with the JOINs where relevant. This is not only subjectively easier to read but also far more efficient. Thusly:
SELECT * FROM people p JOIN companies c ON p.companyID = c.id
WHERE p.firstName = 'Daniel'
It's a little longer, but the database is able to look at the ON clause and use it to compute the fully-constrained JOIN directly, rather than starting with everything and then limiting down. This is faster to compute (especially with large data sets and/or many-table joins) and requires less memory.
I change every query I see which uses the "comma JOIN" syntax. In my opinion, the only purpose for its existence is conciseness. Considering the performance impact, I don't think this is a compelling reason.
The more verbose INNER JOIN, LEFT OUTER JOIN, RIGHT OUTER JOIN, FULL OUTER JOIN are from the ANSI SQL/92 syntax for joining. For me, this verbosity makes the join more clear to the developer/DBA of what the intent is with the join.
In SQL Server there are always query plans to check, a text output can be made as follows:
SET SHOWPLAN_ALL ON
GO
DECLARE #TABLE_A TABLE
(
ID INT IDENTITY(1,1) NOT NULL PRIMARY KEY,
Data VARCHAR(10) NOT NULL
)
INSERT INTO #TABLE_A
SELECT 'ABC' UNION
SELECT 'DEF' UNION
SELECT 'GHI' UNION
SELECT 'JKL'
DECLARE #TABLE_B TABLE
(
ID INT IDENTITY(1,1) NOT NULL PRIMARY KEY,
Data VARCHAR(10) NOT NULL
)
INSERT INTO #TABLE_B
SELECT 'ABC' UNION
SELECT 'DEF' UNION
SELECT 'GHI' UNION
SELECT 'JKL'
SELECT A.Data, B.Data
FROM
#TABLE_A AS A, #TABLE_B AS B
WHERE
A.ID = B.ID
SELECT A.Data, B.Data
FROM
#TABLE_A AS A
INNER JOIN #TABLE_B AS B ON A.ID = B.ID
Now I'll omit the plan for the table variable creates, the plan for both queries is identical though:
SELECT A.Data, B.Data FROM #TABLE_A AS A, #TABLE_B AS B WHERE A.ID = B.ID
|--Nested Loops(Inner Join, OUTER REFERENCES:([A].[ID]))
|--Clustered Index Scan(OBJECT:(#TABLE_A AS [A]))
|--Clustered Index Seek(OBJECT:(#TABLE_B AS [B]), SEEK:([B].[ID]=#TABLE_A.[ID] as [A].[ID]) ORDERED FORWARD)
SELECT A.Data, B.Data FROM #TABLE_A AS A INNER JOIN #TABLE_B AS B ON A.ID = B.ID
|--Nested Loops(Inner Join, OUTER REFERENCES:([A].[ID]))
|--Clustered Index Scan(OBJECT:(#TABLE_A AS [A]))
|--Clustered Index Seek(OBJECT:(#TABLE_B AS [B]), SEEK:([B].[ID]=#TABLE_A.[ID] as [A].[ID]) ORDERED FORWARD)
So, short answer - No need to rewrite, unless you spend a long time trying to read them each time you maintain them?
It's more of a syntax choice. I prefer grouping my join conditions with my joins, hence I use the INNER JOIN syntax
SELECT a.someRow, b.someRow
FROM tableA AS a
INNER JOIN tableB AS b
ON a.ID = b.ID
WHERE b.ID = ?
(? being a placeholder)
Nothing is wrong with the syntax in your example. The 'INNER JOIN' syntax is generally termed 'ANSI' syntax, and came after the style illustrated in your example. It exists to clarify the type/direction/constituents of the join, but is not generally functionally different than what you have.
Support for 'ANSI' joins is per-database platform, but it's more or less universal these days.
As a side note, one addition with the 'ANSI' syntax was the 'FULL OUTER JOIN' or 'FULL JOIN'.
Hope this helps.
In general:
Use the JOIN keyword to link (ie. "join") primary keys and foreign keys.
Use the WHERE clause to limit your result set to only the records you are interested in.
The one problem that can arise is when you try to mix the old "comma-style" join with SQL-92 joins in the same query, for example if you need one inner join and another outer join.
SELECT *
FROM table1 AS a, table2 AS b
LEFT OUTER JOIN table3 AS c ON a.column1 = c.column1
WHERE a.column2 = b.column2;
The problem is that recent SQL standards say that the JOIN is evaluated before the comma-join. So the reference to "a" in the ON clause gives an error, because the correlation name hasn't been defined yet as that ON clause is being evaluated. This is a very confusing error to get.
The solution is to not mix the two styles of joins. You can continue to use comma-style in your old code, but if you write a new query, convert all the joins to SQL-92 style.
SELECT *
FROM table1 AS a
INNER JOIN table2 AS b ON a.column2 = b.column2
LEFT OUTER JOIN table3 AS c ON a.column1 = c.column1;
Another thing to consider in the old join syntax is that is is very easy to get a cartesion join by accident since there is no on clause. If the Distinct keyword is in the query and it uses the old style joins, convert it to an ANSI standard join and see if you still need the distinct. If you are fixing accidental cartesion joins this way, you can improve performance tremendously by rewriting to specify the join and the join fields.
I avoid implicit joins; when the query is really large, they make the code hard to decipher
With explicit joins, and good formatting, the code is more readable and understandable without need for comments.
It also depends on whether you are just doing inner joins this way or outer joins as well. For instance, the MS SQL Server syntax for outer joins in the WHERE clause (=* and *=) can give different results than the OUTER JOIN syntax and is no longer supported (http://msdn.microsoft.com/en-us/library/ms178653(SQL.90).aspx) in SQL Server 2005.
And what about performances ???
As a matter of fact, performances is a very important problem in RDBMS.
So the question is what is the most performant... Using JOIN or having joined table in the WHERE clause ?
Because optimizer (or planer as they said in PG...) ordinary does a good job, the two execution plans are the same, so the performances while excuting the query will be the same...
But devil are hidden in some details....
All optimizers have a limited time or a limited amount of work to find the best plan... And when the limit is reached, the result is the best plan among all computed plans, and not the better of all possible plans !
Now the question is do I loose time when I use WHERE clause instead of JOINs for joining tables ?
And the answer is YES !
YES, because the relational engine use relational algebrae that knows only JOIN operator, not pseudo joins made in the WHERE clause. So the first thing that the optimizer do (in fact the parser or the algrebriser) is to rewrite the query... and this loose some chances to have the best of all plans !
I have seen this problem twice, in my long RDBMS career (40 years...)