Cross join behaviour (SQLServer 2008) - sql

I have been trying to track down a problem with a query I have. The query is actually generated by hibernate from HQL but the resulting SQL doesn't do what I expect. Modifying the SQL slightly produces the correct result but I'm not sure why the modification should make any difference.
Original query (returns no rows)
select sched.id, max(txn.dttm), acc.id
from PaymentSchedulePeriod sched
cross join PaymentSchedulePayment pay
right outer join AccountTransaction txn on pay.accountTransactionFk=txn.id
right outer join Account acc on txn.accountFk=acc.id
where sched.accountFk=acc.id
group by sched.id, acc.id
Modified query - cross join replaced by a comma (implicit cross join)
Returns one row
select sched.id, max(txn.dttm), acc.id
from PaymentSchedulePeriod sched
,PaymentSchedulePayment pay
right outer join AccountTransaction txn on pay.accountTransactionFk=txn.id
right outer join Account acc on txn.accountFk=acc.id
where sched.accountFk=acc.id
group by sched.id, acc.id
My understanding, which may be incorrect is that writing from Table1 a, Table2 b is the same as writing from Table 1 a cross join Table2 b. So I don't understand why the queries return different results.
Is is something to do with the interaction between the cross join and the outer joins in the first query that causes this? I've looked at the query plans and second query plan looks reasonable. The first one has no outer joins at all which is strange.
This is on SQLServer 2008.

JOIN has a higher precedence than a COMMA, so your second statement is interpreted as (note the parens I added):
select sched.id, max(txn.dttm), acc.id
from PaymentSchedulePeriod sched
,(PaymentSchedulePayment pay
right outer join AccountTransaction txn on pay.accountTransactionFk=txn.id
right outer join Account acc on txn.accountFk=acc.id)
where sched.accountFk=acc.id
group by sched.id, acc.id
See also: JOIN precendence rules per SQL-99

Without looking at the actual data and query plans, I'd say (ok, guess) it has to do with the way the optimizer builds the query plans.
In the first, it is more or less explicitly told to "take the first table, cross join it with the second, then right join in the third, then right join in the fourth"
In the second, that cross join is (at least to my way of thinking) implicit. This is "old" SQL syntax from the days when all joins were performed in the WHERE clause, which--again, to my way of thinking--means that the database engine was free to work out on its own the order in which to process tables. Or, in other words, SQL is not being give a specific order in which to join tables. (With inner joins and cross joins, it makes no difference, but with outer joins, it can make a huge difference.)
...I prefer #Joe's answer (upvoted), as it's technically accurate. I'm tossing it my own anyway just for detail's sake.

Related

Where vs ON in outer join

I am wondering how to have a better SQL performance when we decide whether to duplicate our criteria when it is already in Where clause.
My friend claimed it is up to DB engines but I am not so sure.
Regardless of DB engines, normally, the condition in Where clause should be executed first before join, but I assume it means inner join but not outer join. Because some conditions can only be executed AFTER outer join.
For example:
Select a.*, b.*
From A a
Left outer join B on a.id = b.id
Where b.id is NULL;
The condition in Where cannot be executed before outer join.
So, I assume the whole ON clause must be executed first before where clause, and it seems the ON clause will control the size of table B (or table A if we use right outer join) before outer join. That seems not related to DB engines to me.
And that raised my question: when we use outer join, should we always deplicate our criteria in ON Clause?
for example (I use a table to outer join with a shorter version of itself)
temp_series_installment & series_id > 18940000 vs temp_series_installment:
select sql_no_cache s.*, t.* from temp_series_installment s
left outer join temp_series_installment t on s.series_id = t.series_id and t.series_id > 18940000 and t.incomplete = 1
where t.incomplete = 1;
VS
select sql_no_cache s.*, t.* from temp_series_installment s
left outer join temp_series_installment t on s.series_id = t.series_id and t.series_id > 18940000
where t.incomplete = 1;
Edit: where t.incomplete = 1 performs the logic of: where t.series_id is not null
which is an inner join suggested by Gordon Linoff
But what I have been asking is: if it outer join a smaller table, it should have been faster right?
I tried to see if there is any performace difference in mysql:
But it is out of my expectation, why is the second one faster? I thought by outer joining a smaller table, the query will be faster.
My idea is from:
https://www.ibm.com/support/knowledgecenter/en/SSZLC2_8.0.0/com.ibm.commerce.developer.doc/refs/rsdperformanceworkspaces.htm
Section:
Push predicates into the OUTER JOIN clause whenever possible
Duplicate constant condition for different tables whenever possible
Regardless of DB engines, normally, the condition in Where clause should be executed first before join, but I assume it means inner join but not outer join. Because some conditions can only be executed AFTER outer join.
This is simply not true. SQL is a descriptive language. It does not specify how the query gets executed. It only specifies what the result set looks like. The SQL compiler/optimizer determines the actual processing steps to meet the requirements described by the query.
In terms of semantics, the FROM clause is the first clause that is "evaluated". Hence, FROM is logically processed before the WHERE clause.
The rest of your question is similarly misguided. Comparison logic in the where clause, such as:
from s left join
t
on s.series_id = t.series_id and t.series_id > 18940000
where t.incomplete = 1
turns the outer join into an inner join. Hence, the logic is different from what you think is going on.
As Gordon Lindolf pointed out it's not true, Your friend is plain wrong.
I want just to add developers like to think SQL like they think their language of trade (C++, VB, Java), but those are procedural/imperative languages.
When you code SQL you are in another paradigm. You are just describing a function to be applied to a dataset.
Let's get your own example:
Select a.*, b.*
From A a
Left outer join B on a.id = b.id
Where b.id is NULL;
If a.Id and b.Id are not null columns.
It's semantically equal to
Select a.*, null, ..., null
From A a
where not exists (select * from B b where b.Id = a.Id)
Now try to run those to queries and profile.
In most DBMS I can expect both queries to run in the exact same way.
It happens because the engine decides how to implement your "function" over the dataset.
Note the above example is the equivalent in set mathematics to:
Give me the set A minus the intersection between A and B.
Engines can decide how to implement your query because they have some tricks under its sleeve.
It has metrics about your tables, indexes, etc and can use it to, for example, "make a join" in a diferent order you wrote it.
IMHO engines today are really good at finding the best way to implement the function you describe and rarely needs query hints.
Of course you can end describing your funciton in a way too complicated, affecting how the engines decides to run it.
The art of better describing functions and sets and managins indexes is what we call query tunning.

SQL INNER JOIN implemented as implicit JOIN

Recently, I came across an SQL query which looked like this:
SELECT * FROM A, B WHERE A.NUM = B.NUM
To me, it seems as if this will return exactly the same as an INNER JOIN:
SELECT * FROM A INNER JOIN B ON A.NUM = B.NUM
Is there any sane reason why anyone would use a CROSS JOIN here? Edit: it seems as if most SQL applications will automatically use a INNER JOIN here.
The database is HSQLDB
The older syntax is a SQL antipattern. It should be replaced with an inner join anytime you see it. Part of why it is an antipattern is because it is impoosible to tell if a cross join was intended or not if the where clasues is ommitted. This causes many accidental cross joins espcially in complex queries. Further, in some databases (espcially Sql server) the implict outer joins do not work correctly and so people try to combine explicit and implict joins and get bad results without even realizing it. All in all it is a poor practice to even consider using an implict join.
Yes, your both statements will return the same result. Which one is to be used is a matter of taste. Every sane database system will use a join for both if possible, no sane optimizer will really use a cross product in the first case.
But note that your first syntax is not a cross join. It is just an implicit notation for a join which does not specify which kind of join to use. Instead, the optimizer must check the WHERE clauses to determine whether to use an inner join or a cross join: If an applicable join condition is found in the WHERE clause, this will result in an inner join. If no such clause is found it will result in a cross join. Since your first example specifies an applicable join condition (WHERE A.NUM = B.NUM) this results in an INNER JOIN and thus exactly equivalent to your second case.

With SQL, what is the ranking of efficiency for each of the types of join

JOIN, LEFT JOIN, RIGHT JOIN, FULL JOIN?
I'm guessing the size of the datasets on each side of the join may make LEFT vs RIGHT a hard call, but how do the others compare.
Also am I correct in assuming JOIN & INNER JOIN are one and the same? If not, how does this fit into the order/ranking.
Yes, JOIN and INNER JOIN are the same. In general the ranking is JOIN is fastest, followed closely by LEFT JOIN which is equivalent to RIGHT JOIN, and then followed very far in the distance by FULL JOIN.
But this ranking is so variable that it can be largely ignored. Your actual performance is highly dependent upon the size of the datasets, availability of proper indexes, and exact query plan chosen. One LEFT JOIN may be fast and the next INNER JOIN might be glacially slow.
That notwithstanding, I would advise avoiding FULL JOIN unless you absolutely need it. (At least in Oracle, which is where I've had bad experiences with it.)
INNER is an optional word when INNER JOIN is desired => so they are one and the same. This is the same as the word OUTER being optional in LEFT/RIGHT/FULL OUTER JOIN
In terms of efficiency, it completely depends on what else is happening. If it is a LEFT JOIN with a IS NOT NULL test on the right side (anti-semi join) then it is very efficient and works like an EXISTS clause.
Absent other factors, and considering only
SELECT .. FROM A X-JOIN B ON <condition>
If results need to be preserved from A, B or Both, then efficiency is not a factor. You need a LEFT/RIGHT/FULL join because it provides the correct results
If you need results that match on both sides, and not all data is available from either side, then same as the above, you need an INNER JOIN.
Only if the join is bound to find rows on both sides, then LEFT/RIGHT/FULL join becomes an option. In most cases, the INNER JOIN will be faster because it gives the optimizer the option to start from the smaller table (or better indexed) and hash match to the larger table.
"in most cases" in Point #3 because different RDBMS may optimize queries differently.
Ranking them for efficiency would be pointless, as they return different results. If you need a left join, an inner join won't do the job.
Efficiency in a join has more to with the size of the tables, the indexing, and how the rest of the query is written than whether it is an INNER, OUTER, CROSS or FUll JOIN. A CROSS JOIN on two small tables might be fast but a INNER join on two large tables with a WHERE clause that is not sargable would not be.

How to do a full outer join without having full outer join available

Last week I was surprised to find out that sybase 12 doesn't support full outer joins.
But it occurred to me that a full outer join should be the same as a left outer join unioned with a right outer join of the same sql.
Can anybody think of a reason this would not hold true?
UNION ALL the left join with the right join, but limit the right join to only rows that do not exist in the base table (return null on the join when they would not be null in the table if they existed).
For this code you will need to create two tables t1 and t2. t1 should have one column named c1 with five rows containing the values 1-5. t2 should also have a c1 column with five rows containing the values 2-6.
Full Outer Join:
select * from t1 full outer join t2 on t1.c1=t2.c1 order by 1, 2;
Full Outer Join Equivalent:
select t1.c1, t2.c1 from t1 left join t2 on t1.c1=t2.c1
union all
select t1.c1, t2.c1 from t1 right join t2 on t1.c1=t2.c1
where t1.c1 is null
order by 1, 2;
Note the where clause on the right joined select that limits the results to only those that would not be duplicates.
UNION-ing two OUTER JOIN statements should result in duplicate rows representing the data you'd get from an INNER JOIN. You'd have to probably do a SELECT DISTINCT on the data set produced by the UNION. Generally if you have to use a SELECT DISTINCT that means it's not a well-designed query (or so I've heard).
If you union them with UNION ALL, you'll get duplicates. If you just use UNION without the ALL, it will filter duplicates and therefore be equivalent to a full join, but the query will also be a lot more expensive because it has to perform a distinct sort.
Well first, I don't know why you are using 12.x. It was EndOfLifed on 31 Dec 2009, after having been notified on 3 Apr 2007. 15.0.2 (first solid version) came out in Jan 2009. 15.5 is much better and was available 02 Dec 2009, so you are two major releases, and over at least 13 months, out of date.
ASE 12.5.4 has the new Join syntax. (you have not specified, you may be on 12.5.0.3, the release prior to that).
DB2 and Sybase did not implement FULL OUTER JOIN, for precisely the reason you have identified: it is covered by LEFT ... UNION ... RIGHT without ALL. It is not a case of "not supporting" a FOJ; it is a case of the keyword is missing.
And then you have the issue that Sybase and DB2 types would generally never use outer joins let alone FOJs, because their databases tend to be more normalised, etc.
Last, there is perfectly ordinary SQL you can use in any version of Sybase that will supply the function of FOJ, and will be distinctly faster on 12.x; only marginally faster on 15.x. It is kind of like the RANK() function: quite unnecessary if you can write a Subquery.
The second reason it does not need FULL OUTER, as some of the low end engines do, is because the new optimiser is extremely fast, and the query is fully normalised. Ie. it performs the LEFT and the RIGHT in a single pass.
Depending on you SARGs and DataType mismatches, etc it may still have to sort-merge, but that too is streamed at all three levels: disk I/O subsystem; engine(s); and network handler. If your tables are partitioned, then it is additionally parallelised at that level.
If your server is not configured and your result set is very large, you may need to increase proc cache size and number of sort buffers. That's all.

Mixing Left and right Joins? Why?

Doing some refactoring in some legacy code I've found in a project. This is for MSSQL. The thing is, i can't understand why we're using mixed left and right joins and collating some of the joining conditions together.
My question is this: doesn't this create implicit inner joins in some places and implicit full joins in others?
I'm of the school that just about anything can be written using just left (and inner/full) or just right (and inner/full) but that's because i like to keep things simple where possible.
As an aside, we convert all this stuff to work on oracle databases as well, so maybe there's some optimization rules that work differently with Ora?
For instance, here's the FROM part of one of the queries:
FROM Table1
RIGHT OUTER JOIN Table2
ON Table1.T2FK = Table2.T2PK
LEFT OUTER JOIN Table3
RIGHT OUTER JOIN Table4
LEFT OUTER JOIN Table5
ON Table4.T3FK = Table5.T3FK
AND Table4.T2FK = Table5.T2FK
LEFT OUTER JOIN Table6
RIGHT OUTER JOIN Table7
ON Table6.T6PK = Table7.T6FK
LEFT OUTER JOIN Table8
RIGHT OUTER JOIN Table9
ON Table8.T8PK= Table9.T8FK
ON Table7.T9FK= Table9.T9PK
ON Table4.T7FK= Table7.T7PK
ON Table3.T3PK= Table4.T3PK
RIGHT OUTER JOIN ( SELECT *
FROM TableA
WHERE ( TableA.PK = #PK )
AND ( TableA.Date BETWEEN #StartDate
AND #EndDate )
) Table10
ON Table4.T4PK= Table10.T4FK
ON Table2.T2PK = Table4.T2PK
One thing I would do is make sure you know what results you are expecting before messing with this. Wouldn't want to "fix" it and have different results returned. Although honestly, with a query that poorly designed, I'm not sure that you are actually getting correct results right now.
To me this looks like something that someone did over time maybe even originally starting with inner joins, realizing they wouldn't work and changing to outer joins but not wanting to bother changing the order the tables were referenced in the query.
Of particular concern to me for maintenance purposes is to put the ON clauses next to the tables you are joining as well as converting all the joins to left joins rather than mixing right and left joins. Having the ON clause for table 4 and table 3 down next to table 9 makes no sense at all to me and should contribute to confusion as to what the query should actually return. You may also need to change the order of the joins in order to convert to all left joins. Personally I prefer to start with the main table that the others will join to (which appears to be table2) and then work down the food chain from there.
It could probably be converted to use all LEFT joins: I'd be looking and moving the right-hand table in each RIGHT to be above all the existing LEFTs, then you might be able to then turn every RIGHT join into a LEFT join. I'm not sure you'll get any FULL joins behind the scenes -- if the query looks like it is, it might be a quirk of this specific query rather than a SQL Server "rule": that query you've provided does seem to be mixing it up in a rather confusing way.
As for Oracle optimisation -- that's certainly possible. No experience of Oracle myself, but speaking to a friend who's knowledgeable in this area, Oracle (no idea what version) is/was fussy about the order of predicates. For example, with SQL Server you can write your way clause so that columns are in any order and indexes will get used, but with Oracle you end up having to specify the columns in the order they appear in the index in order to get best performance with the index. As stated - no idea if this is the case with newer Oracle's, but was the case with older ones (apparently).
Whether this explains this particular construction, I can't say. It could simply be less-thean-optimal code if it's changed over the years and a clean-up is what it's begging for.
LEFT and RIGHT join are pure syntax sugar.
Any LEFT JOIN can be transformed into a RIGHT JOIN merely by switching the sets.
Pre-9i Oracle used this construct:
WHERE table1.col(+) = table2.col
, (+) here denoting the nullable column, and LEFT and RIGHT joins could be emulated by mere switching:
WHERE table1.col = table2.col(+)
In MySQL, there is no FULL OUTER JOIN and it needs to be emulated.
Ususally it is done this way:
SELECT *
FROM table1
LEFT JOIN
table2
ON table1.col = table2.col
UNION ALL
SELECT *
FROM table1
RIGHT JOIN
table2
ON table1.col = table2.col
WHERE table1.col IS NULL
, and it's more convenient to copy the JOIN and replace LEFT with RIGHT, than to swap the tables.
Note that in SQL Server plans, Hash Left Semi Join and Hash Right Semi Join are different operators.
For the query like this:
SELECT *
FROM table1
WHERE table1.col IN
(
SELECT col
FROM table2
)
, Hash Match (Left Semi Join) hashes table1 and removes the matched elements from the hash table in runtime (so that they cannot match more than one time).
Hash Match (Right Semi Join) hashes table2 and removes the duplicate elements from the hash table while building it.
I may be missing something here, but the only difference between LEFT and RIGHT joins is which order the source tables were written in, and so having multiple LEFT joins or multiple RIGHT joins is no different to having a mix. The equivalence to FULL OUTERs could be achieved just as easily with all LEFT/RIGHT than with a mix, n'est pas?
We have some LEFT OUTER JOINs and RIGHT OUTER JOINs in the same query. Typically such queries are large, have been around a long time, probably badly written in the first place and have received infrequent maintenance. I assume the RIGHT OUTER JOINs were introduced as a means of maintaining the query without taking on the inevitable risk when refactoring a query significantly.
I think most SQL coders are most confortable with using all LEFT OUTER JOINs, probably because a FROM clause is read left-to-right in the English way.
The only time I use a RIGHT OUTER JOIN myself is when when writing a new query based on an existing query (no need to reinvent the wheel) and I need to change an INNER JOIN to an OUTER JOIN. Rather than change the order of the JOINs in the FROM clause just to be able to use a LEFT OUTER JOIN I would instead use a RIGHT OUTER JOIN and this would not bother me. This is quite rare though. If the original query had LEFT OUTER JOINs then I'd end up with a mix of LEFT- and RIGHT OUTER JOINs, which again wouldn't bother me. Hasn't happened to me yet, though.
Note that for SQL products such as the Access database engine that do not support FULL OUTER JOIN, one workaround is to UNION a LEFT OUTER JOIN and a RIGHT OUTER JOIN in the same query.
The bottom line is that this is a very poorly formatted SQL statement and should be re-written. Many of the ON clauses are located far from their JOIN statements, which I am not sure is even valid SQL.
For clarity's sake, I would rewrite the query using all LEFT JOINS (rather than RIGHT), and locate the using statements underneath their corresponding JOIN clauses. Otherwise, this is a bit of a train wreck and is obfuscating the purpose of the query, making errors during future modifications more likely to occur.
doesn't this create implicit inner
joins in some places and implicit full
joins in others?
Perhaps you are assuming that because you don't see the ON clause for some joins, e.g., RIGHT OUTER JOIN Table4, but it is located down below, ON Table4.T7FK= Table7.T7PK. I don't see any implicit inner joins, which could occur if there was a WHERE clause like WHERE Table3.T3PK is not null.
The fact that you are asking questions like this is a testament to the opaqueness of the query.
To answer another portion of this question that hasn't been answered yet, the reason this query is formatted so oddly is that it's likely built using the Query Designer inside SQL Management Studio. The give away is the combined ON clauses that happen many lines after the table is mentioned. Essentially tables get added in the build query window and the order is kept even if that way things are connected would favor moving a table up, so to speak, and keeping all the joins a certain direction.