SQL is evaluated Right to Left or Left to Right? - sql

I have a SQL query Like below -
A left JOIN B Left Join C Left JOIN D
Say table A is a big table whereas tables B, C, D are small.
Will Spark join will execute like-
A with B and subsequent results will be joined with C then D
or,
Spark will automatically optimize i.e it will join B, C and D and then
results will be joined with A.
My question is what is order of execution or join evaluation? Does it go left to right or right to left?

Spark can optimize join order, if it has access to information about cardinialities of those joins.
For example, if those are parquet tables or cached dataframes, then it has estimates on total counts of the tables, and can reorder join order to make it less expensive. If a "table" is a jdbc dataframe, Spark may not have information on row counts.
Spark Query Optimizer can also choose a different join type in case if it has statistics (e.g. it can broadcast all smaller tables, and run broadcast hash join instead of sort merge join).
If statistics aren't available, then it'll will just follow the order as in the SQL query, e.g. from left to right.
Update:
I originally missed that all the joins in your query are OUTER joins (left is equivalent to left outer).
Normally outer joins can't be reordered, because this would change result of the query. I said "normally" because sometimes Spark Optimizer can convert an outer join to an inner join (e.g. if you have a WHERE clause that filters out NULLs - see conversion logic here).
For completeness of the answer, reordering of joins is driven by two different codepaths, depending is Spark CBO is enabled or not (spark.sql.cbo.enabled first appeared in Spark 2.2 and is off by default). If spark.sql.cbo.enabled=true and spark.sql.cbo.joinReorder.enabled=true (also off by default), and statistics are available/collected manually through ANALYZE TABLE .. COMPUTE STATISTICS then reordering is based on estimated cardinality of the join I mentioned above.
Proof that reordering only works for INNER JOINS is here (on example of CBO).
Update 2: Sample queries that show that reordering of outer joins produce different results, so outer joins are never reordered :

The order of interpretation of joins does not matter for inner joins. However, it can matter for outer joins.
Your logic is equivalent to:
FROM ((A LEFT JOIN
B
) ON . . . LEFT JOIN
C
ON . . . LEFT JOIN
)
D
ON . . .
The simplest way to think about chains of LEFT JOIN is that they keep all rows in the first table and columns from matching rows in the subsequent tables.
Note that this is the interpretation of the code. The SQL optimizer is free to rearrange the JOINs in any order to arrive at the same result set (although with outer joins this is generally less likely than with inner joins).

Related

Optimizing OUTER JOIN queries using filters from WHERE clause.(Query Planner)

I am writing a distributed SQL query planner(Query Engine). Data will be fetched from RDBMS(PostgreSQL) nodes involving network I/O.
I want to optimize JOIN queries.
Logical Order of Execution is:
Do JOIN(make use of ON clause)
Apply WHERE clause on the joined result.
I was thinking about applying Filter(WHERE clause specific to a table) first itself, and then do join.
In what cases would that result in wrong results?
Example:
SELECT *
FROM tableA
LEFT JOIN tableB ON(tableA.col1 = tableB.col1)
LEFT JOIN tableC ON(tableB.col2 = tableC.col1)
WHERE tableA.colY < 100 AND tableB.colX > 50
Logical Execution:
joinResult = (tableA left join tableB ON() ) left join tableC ON()
Filter joinResult using given WHERE clause.
Proposed Execution:
filteredA = tableA WHERE tableA.colY < 100
filteredB = tableB WHERE tableB.colX > 50
Result = (filteredA left join filteredB ON(..))left join tableC ON(..)
Can I optimize any query like this? That is filtering the table first and then applying join above that.
Edit:
Some people are confusing and talking about this specific example. I am not talking about this specific example query, I am writing a query planner and I want to handle all type of queries
Please note that, each of the tables is sharded and stored in different machines, and the current execution model is to fetch each of the tables and then do join locally. So if I apply the WHERE filter before fetching, it would be better.
This is actually a complex topic.
We can filter the table in some cases. We can also reorder outer joins and then push the filter quals inside.
I was going through a research paper regarding this, but I haven't completed it yet(may not complete it also).
So for now, for those who are looking for answers, you could probably go through this research paper particularly section 2.2. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.43.2531&rep=rep1&type=pdf
For now I'm relying on PostgreSQL's planner and taking its output and reconstructing the query for my requirements.

difference in with/without "left join" and matching in "where" or "on"?

Is there any performance difference between two different SQL-codes as below? The first one is without left jon and matching with where, the other is with left join and matching with on.
Because I get exactly the same result/output from those sql's, but I will be working with bigger tables soon (like couple of billions rows), so I don't want to have any performance issues. Thanks in advance ...
select a.customer_id
from table a, table b
where a.customer_id = b.customer_id
select a.customer_id
from table a
left join table b
on a.customer_id = b.customer_id
The two do different things and yes, there is a performance impact.
Your first example is a cross join with a filter which reduces it to an inner join (virtually all planners are smart enough to reduce this to an inner join but it is semantically a cross join and filter).
Your second is a left join which means that where the filter is not met, you will still get all records from table a.
This means that the planner has to assume all records from table a are relevant, and that correlating records from table b are relevant in your second example, but in your first example it knows that only correlated records are relevant (and therefore has more freedom in planning).
In a very small data set you will see no difference but you may get different results. In a large data set, your left join can never perform better than your inner join and may perform worse.

Is the order of joining tables indifferent as long as we chose proper join types?

Can we achieve desired results of joining tables by executing joins in whatever order? Suppose we want to left join two tables A and B (order AB). We can get the same results with right join of B and A (BA).
What about 3 tables ABC. Can we get whatever results by only changing order and joins types? For example A left join B inner join C. Can we get it with BAC order? What about if we have 4 or more tables?
Update.
The question Does the join order matter in SQL? is about inner join type. Agreed that then the order of join doesn't matter. The answer provided in that question does not answer my question whether it is possible to get desired results of joining tables with whatever original join types (join types here) by choosing whatever order of tables we like, and achieve this goal only by manipulating with join types.
In an inner join, the ordering of the tables in the join doesn't matter - the same rows will make up the result set regardless of the order they are in the join statement.
In either a left or right outer join, the order DOES matter. In A left join B, your result set will contain one row for every record in table A, irrespective of whether there is a matching row in table B. If there are non matching rows, this is likely to be a different result set to B left join A.
In a full outer join, the order again doesn't matter - rows will be produced for each row in each joined table no matter what their order.
Regarding A left join B vs B right join A - these will produce the same results. In simple cases with 2 tables, swapping the tables and changing the direction of the outer join will result in the same result set.
This will also apply to 3 or more tables if all of the outer joins are in the same direction - A left join B left join C will give the same set of results as C right join B right join A.
If you start mixing left and right joins, then you will need to start being more careful. There will almost always be a way to make an equivalent query with re-ordered tables, but at that point sub-queries or bracketing off expressions might be the best way to clarify what you are doing.
As another commenter states, using whatever makes your purpose most clear is usually the best option. The ordering of the tables in your query should make little or no difference performance wise, as the query optimiser should work this out (although the only way to be sure of this would be to check the execution plans for each option with your own queries and data).

Successive LEFT SQL joins back to original

I need to redo sql statement in legacy Foxpro application and don't understand whether it is meaningful at all. Syntax is a bit specific - it extracts data from temporary table into the same temporary table ( overwriting) with some joins.
SELECT aa.*,b.spa_date FROM (ALIAS()) aa INNER JOIN jobs ON aa.seq=jobs.seq ;
LEFT JOIN job2 ON jobs.job_no=job2.rucjob;
left join jobs b on b.job_no=job2.job_no;
WHERE jobs.qty1<>0 INTO CURSOR (ALIAS())
Since only one field is added from joined tables ( spa_date ) is there any point in 2 left joins or I am missing something. Isn't it equivalent to
SELECT aa.*,jobs.spa_date FROM (ALIAS()) aa INNER JOIN jobs ON aa.seq=jobs.seq ;
WHERE jobs.qty1<>0 INTO CURSOR (ALIAS())
They are different because b.spa_date come from the second left join. You may be missing filtered rows without both left joins.
You would need to know the intent of the original query and perhaps rewrite it to make more sense but I'd say the two queries are different.

With SQL, what is the ranking of efficiency for each of the types of join

JOIN, LEFT JOIN, RIGHT JOIN, FULL JOIN?
I'm guessing the size of the datasets on each side of the join may make LEFT vs RIGHT a hard call, but how do the others compare.
Also am I correct in assuming JOIN & INNER JOIN are one and the same? If not, how does this fit into the order/ranking.
Yes, JOIN and INNER JOIN are the same. In general the ranking is JOIN is fastest, followed closely by LEFT JOIN which is equivalent to RIGHT JOIN, and then followed very far in the distance by FULL JOIN.
But this ranking is so variable that it can be largely ignored. Your actual performance is highly dependent upon the size of the datasets, availability of proper indexes, and exact query plan chosen. One LEFT JOIN may be fast and the next INNER JOIN might be glacially slow.
That notwithstanding, I would advise avoiding FULL JOIN unless you absolutely need it. (At least in Oracle, which is where I've had bad experiences with it.)
INNER is an optional word when INNER JOIN is desired => so they are one and the same. This is the same as the word OUTER being optional in LEFT/RIGHT/FULL OUTER JOIN
In terms of efficiency, it completely depends on what else is happening. If it is a LEFT JOIN with a IS NOT NULL test on the right side (anti-semi join) then it is very efficient and works like an EXISTS clause.
Absent other factors, and considering only
SELECT .. FROM A X-JOIN B ON <condition>
If results need to be preserved from A, B or Both, then efficiency is not a factor. You need a LEFT/RIGHT/FULL join because it provides the correct results
If you need results that match on both sides, and not all data is available from either side, then same as the above, you need an INNER JOIN.
Only if the join is bound to find rows on both sides, then LEFT/RIGHT/FULL join becomes an option. In most cases, the INNER JOIN will be faster because it gives the optimizer the option to start from the smaller table (or better indexed) and hash match to the larger table.
"in most cases" in Point #3 because different RDBMS may optimize queries differently.
Ranking them for efficiency would be pointless, as they return different results. If you need a left join, an inner join won't do the job.
Efficiency in a join has more to with the size of the tables, the indexing, and how the rest of the query is written than whether it is an INNER, OUTER, CROSS or FUll JOIN. A CROSS JOIN on two small tables might be fast but a INNER join on two large tables with a WHERE clause that is not sargable would not be.