Exactly when to user inner join or an alternative query - sql

SELECT .... FROM TABLE1 T1, TABLE2 T2, TABLE3 T3
WHERE T1.NAME = 'ABC' AND T1.ID = T2.COL_ID AND T2.COL1 = T3.COL2
vs
SELECT .... FROM TABLE1 T1
WHERE T1.NAME = 'ABC'
INNER JOIN TABLE2 T2 ON T1.ID = T2.COL_ID
INNER JOIN TABLE3 T3 ON T2.COL1 = T3.COL2
Two questions
In terms of performance, which will perform better and why?
If Option 2 has the better performance, when should be using Option 1? (vice versa question if Option 1 has better performance)

The second query is not correct. It should be:
SELECT .... FROM TABLE1 T1
INNER JOIN TABLE2 T2 ON T1.ID = T2.COL_ID
INNER JOIN TABLE3 T3 ON T2.COL1 = T3.COL2
WHERE T1.NAME = 'ABC'
This is the right way to write your join condition. The 1st one is accepted, but technically creates a cartesian product. All modern database deals perfectly with both 1st and 2nd queries and interprets them the same way, therefore, performance should be the same. But still, you should use the second one because it is more readable and allows you to have only one way to write join weither it is a inner, left or full outer.

The answer is easy: Don't use comma-separated joins (first query). We used these in the 1980s for the lack of something better, but then in 1992 the new syntax (second query) was introduced1, because the old syntax was error-prone (it was easier to forget to apply join criteria) and harder to maintain (was missing join criteria intended or not in a query?) and there was no standard syntax for outer joins.
1 Oracle was a little late though featuring the new syntax. They introduced the new ANSI joins in Oracle 9i in 2001.
In terms of performance: There should be no difference in speed, because DBMS optimizers see that this is essentially the same query.
Your second query is syntactically incorrect by the way. The query's WHERE clause belongs after the complete FROM clause, i.e. after all the joins:
SELECT ....
FROM table1 t1
INNER JOIN table2 t2 ON t1.id = t2.col_id
INNER JOIN table3 t3 ON t2.col1 = t3.col2
WHERE t1.name = 'ABC';

Related

Left Join in hive not passing filters before doing a scan

I am using Hive 2.3.5
Spark version 2.3.3
When I run the following query on hive it fails ..saying trying to scan too many partitions.
select t1.A, t2.B
from t1 left join t2 on t1.x = t2.x
where t1.x = 'abc'
vs when I run this, it works fine:
select t1.A, t2.B
from t1 left join t2 on t1.x = t2.x
where t1.x = 'abc'
and t2.x = 'abc'
Why do I need to pass the explicit filter (t2.x='abc') again on the table t2 when I am already doing join on t1.x = t2.x
where t1.x = 'abc'?
Normal join works fine without the additional filter needed, but not let join
Optiimizer not always can push down predicates because it is not intelligent enough. And WHERE is being applied after join most probably, causing scanning too many rows.
Probably PPD works fine with INNER JOIN. EXPLAIN plan may give more information about plan.
But apart from this, there is one more issue.
You are saying that INNER join works ok... Look:
Your two queries are completely different. First one is a LEFT JOIN. If t2 does not contain rows where t2.x = 'abc', it will return rows from t1.
Second one has different behavior, it is an INNER JOIN really because this predicate in the where t2.x = 'abc' does not allow NULLs, filtering out records which are not joined with t2. Check it, you are selecting only joined records = INNER JOIN. If table does not contain rows where t2.x = 'abc', second query will not return any rows.
Try adding one more join condition to the ON instead of WHERE, this will look more like LEFT JOIN:
select t1.A, t2.B
from t1 left join t2 on t1.x = t2.x and t2.x='abc'
where t1.x = 'abc'
I am not saying that this will resolve the issue with too many partitions scanned, I'm just saying that this will be real left join, not inner, and predicate will be applied to t2 before join.
Another way is using filtering in the subquery before join.
select t1.A, t2.B
from (select * from t1 where t1.x = 'abc') t1
left join (select * from t2 where t2.x = 'abc') t2
on t1.x = t2.x

SQL Filtering when joining

bit of a novice question, I am running a query and left joining and wanted to know whether there was a difference when you specify a filter in terms of performance, in e.g below, top I filter straight after first join and below I do all joins and then filter:
Select t1.*,t2.* from t1 t1
left join t2 t2
on t1.key = t2.key
and t1.date < today
left join t3 t3
on t2.key2 = t3.key
vs
Select t1.*,t2.* from t1 t1
left join t2 t2
on t1.key = t2.key
left join t3 t3
on t2.key2 = t3.key
and t1.date < today
Learn what LEFT JOIN ON returns: INNER JOIN ON rows UNION ALL unmatched left table rows extended by NULLs. Always know what INNER JOIN you want as part of an OUTER JOIN.
In general your queries have different inner join & null-extended rows for the 1st left joins & then further differences due to more joining. Unless certain constraints hold, the 2 queries return different functions of their inputs. So comparing their performance seems moot.

Order of operation Inner Join and Where clause performance in SQL Server?

Is there any Performance problem to use operation in miss order?
Like
1. All Inner join first then all where condition later.
select * from
t1
inner join t2 on t1.t2Id = t2.Id
inner join t3 on t1.t3Id = t3.Id
inner join t4 on t2.t4Id = t4.Id
where
t1.Id in (1,2,3,4,5)
and t2.Id in (1,2,3,4,5,6,7)
and t3.Name like '%a'
2. All table with Respective Where and then Inner join
select * from
(select * from t1 where t1.Id in (1,2,3,4,5)) a
inner join (select * from t2 where t2.Id in (1,2,3,4,5,6,7)) a1 on a.t2Id =
a.Id
inner join (select * from t3 where t3.Name like '%a') a2 on a.t3Id = a2.Id
inner join t4 on a1.t4Id = t4.Id
It may effect on query Performance?
Also Order of Where Condition?
Like
select * from t1
inner join t2 on t1.t2Id = t2.Id
where t1.t2Id in (1,2,3,4,5,6)
and t2.t3Id in (1,2,3,4,5)
A SQL query goes through three phases when it is run:
The query is parsed (and the various references are looked up).
The execution plan is created, with an optimization phase based on what the query needs to accomplish.
The query plan is execution.
As a result of the optimization, the way you write the query often has less effect on the performance than you might think. Lots of people have worked very hard on figuring out the best way to optimize queries -- and there are probably lots of things that you are not even aware of (such as different join algorithms, join ordering, pushing down expression evaluations, and so on).
For your examples, the SQL Server optimizer should produce the same execution plans. The engine is smart enough to realize that these are really doing the same thing.
Note: This is not true of all query engines. Some have pretty poor optimizers, and there would be differences in performance.

Data Difference Between Where Clause and AND Clause

I tried these queries in our application. Each returned different result sets for me.
Query Set 1
SELECT *
FROM TABLE1 T1
LEFT OUTER JOIN TABLE2 T2 ON (T1.ID = T2.ID
AND T1.STATUS = 'A'
AND T2.STATUS = 'A')
INNER JOIN TABLE3 T3 ON (T2.ID = T3.ID)
WHERE T3.STATUS = 'A'
Query Set 2
SELECT *
FROM TABLE1 T1
LEFT OUTER JOIN TABLE2 T2 ON (T1.ID = T2.ID
AND T2.STATUS = 'A')
INNER JOIN TABLE3 T3 ON (T2.ID = T3.ID)
WHERE T3.STATUS = 'A'
AND T1.STATUS = 'A'
I couldn't find out why each query returns different outputs. Also please guide me about which approach is best when we use multiple joins (left, right, Inner) with Filtering clauses.
Thanks for any help
On the first
SELECT *
FROM TABLE1 T1
LEFT OUTER JOIN TABLE2 T2 ON (T1.ID = T2.ID
AND T1.STATUS = 'A'
AND T2.STATUS = 'A')
INNER JOIN TABLE3 T3 ON (T2.ID = T3.ID)
WHERE T3.STATUS = 'A'
AND T1.STATUS = 'A' has zero effect
It is a left join - you are gong to get all of T1 period
When you move AND T1.STATUS = 'A' to the where then it is applied
The difference in your code is the T1.STATUS = 'A' location.
Query 1:
You join the T1 and T2 tables on all the common IDS and only if both T1.STATUS = 'A' = T2.STATUS.
Query 2:
You join the T1 and T2 tables on all the common IDS and only if T2.STATUS = 'A'
Basically, query 1: you filter T1's data first and then you join.
query 2: you join the tables first and then you filter the data on the already joined table.
Also regarding the joins, usually the left and inner joins return the same results. Have a look here and here I found both links really useful.
Finally, my personal preference is to use inner joins unless I definitely need all the rows from either left or right table. I believe it makes my queries simpler to read and maintain.
I hope that helps.
You are putting a filter on the right table of a left join, creating an inner join. Your first query will return less results, whilst your second will have NULLS against non matching rows on the right table. This link helped me to understand joins much better, as I really struggled for a long time to grasp the concept. HERE
If my answer is not clear please ask me for a revision.

Rewrite SQL code SELECT block to simplify logic

I am trying to rewrite this block with simpler logic if this can be done. I am using it within a larger SELECT statement and I think IF I can simplify this block, I might be able to improve performance of my query.
proj_catg_type_id, proj_catg_id and proj_id are all PKs in their tables.
select t1.proj_catg_name
from table1 t1, table2 t2, table3 t3
where t2.proj_catg_type_id = t1.proj_catg_type_id
and t2.proj_catg_type_id = 213
and t3.proj_id = t2.proj_id
Without knowing the referential integrety rules and the logic behind the tables it is difficult to give a 100% correct answer. But just by looking to this statement the most simplified logic would be
select t1.proj_catg_name
from table1 t1
where t1.proj_catg_type_id = 213;
select t1.proj_catg_name
from table1 t1 inner join table2 t2
on t2.proj_catg_type_id=t1.proj_catg_type_id
where t2.proj_catg_type_id=213
and t3.proj_id=t2.proj_i
maybe? is t3 used outside this subselect?
If t3 is a table outside the selct you showed, then this is a correlated subquery which you should not be using at all, ever! That turns your query into a row-by agonizing row cursor.
Use derived tables or joins to get the results.
You don't give me enough code to write a specific solution for your problem, but let me give you an example:
SELECT
field1
, field2
, (SELECT t3.field3
FROM table2 t2
JOIN table3 t3 ON t2.id = t3.id
WHERE t4.somefield = t2.somefield)
FROM table1 t1
JOIn table4 t4 ON t1.id = t4.id
SELECT
field1
, field2
, t3.field3
FROM table1 t1
JOIn table4 t4
ON t1.id = t4.id
join (SELECT field3
FROM table2 t2
JOIN table3 t3 ON t2.id = t3.id) a
ON t4.somefield = t2.somefield
The first query runs one row at a time which is extremely slow. The second should give the same results but runs in a set-based fashion which is much faster. It is important to make sure the derived table has an a alias. You could also use a CTE.