I tried these queries in our application. Each returned different result sets for me.
Query Set 1
SELECT *
FROM TABLE1 T1
LEFT OUTER JOIN TABLE2 T2 ON (T1.ID = T2.ID
AND T1.STATUS = 'A'
AND T2.STATUS = 'A')
INNER JOIN TABLE3 T3 ON (T2.ID = T3.ID)
WHERE T3.STATUS = 'A'
Query Set 2
SELECT *
FROM TABLE1 T1
LEFT OUTER JOIN TABLE2 T2 ON (T1.ID = T2.ID
AND T2.STATUS = 'A')
INNER JOIN TABLE3 T3 ON (T2.ID = T3.ID)
WHERE T3.STATUS = 'A'
AND T1.STATUS = 'A'
I couldn't find out why each query returns different outputs. Also please guide me about which approach is best when we use multiple joins (left, right, Inner) with Filtering clauses.
Thanks for any help
On the first
SELECT *
FROM TABLE1 T1
LEFT OUTER JOIN TABLE2 T2 ON (T1.ID = T2.ID
AND T1.STATUS = 'A'
AND T2.STATUS = 'A')
INNER JOIN TABLE3 T3 ON (T2.ID = T3.ID)
WHERE T3.STATUS = 'A'
AND T1.STATUS = 'A' has zero effect
It is a left join - you are gong to get all of T1 period
When you move AND T1.STATUS = 'A' to the where then it is applied
The difference in your code is the T1.STATUS = 'A' location.
Query 1:
You join the T1 and T2 tables on all the common IDS and only if both T1.STATUS = 'A' = T2.STATUS.
Query 2:
You join the T1 and T2 tables on all the common IDS and only if T2.STATUS = 'A'
Basically, query 1: you filter T1's data first and then you join.
query 2: you join the tables first and then you filter the data on the already joined table.
Also regarding the joins, usually the left and inner joins return the same results. Have a look here and here I found both links really useful.
Finally, my personal preference is to use inner joins unless I definitely need all the rows from either left or right table. I believe it makes my queries simpler to read and maintain.
I hope that helps.
You are putting a filter on the right table of a left join, creating an inner join. Your first query will return less results, whilst your second will have NULLS against non matching rows on the right table. This link helped me to understand joins much better, as I really struggled for a long time to grasp the concept. HERE
If my answer is not clear please ask me for a revision.
Related
I am using Hive 2.3.5
Spark version 2.3.3
When I run the following query on hive it fails ..saying trying to scan too many partitions.
select t1.A, t2.B
from t1 left join t2 on t1.x = t2.x
where t1.x = 'abc'
vs when I run this, it works fine:
select t1.A, t2.B
from t1 left join t2 on t1.x = t2.x
where t1.x = 'abc'
and t2.x = 'abc'
Why do I need to pass the explicit filter (t2.x='abc') again on the table t2 when I am already doing join on t1.x = t2.x
where t1.x = 'abc'?
Normal join works fine without the additional filter needed, but not let join
Optiimizer not always can push down predicates because it is not intelligent enough. And WHERE is being applied after join most probably, causing scanning too many rows.
Probably PPD works fine with INNER JOIN. EXPLAIN plan may give more information about plan.
But apart from this, there is one more issue.
You are saying that INNER join works ok... Look:
Your two queries are completely different. First one is a LEFT JOIN. If t2 does not contain rows where t2.x = 'abc', it will return rows from t1.
Second one has different behavior, it is an INNER JOIN really because this predicate in the where t2.x = 'abc' does not allow NULLs, filtering out records which are not joined with t2. Check it, you are selecting only joined records = INNER JOIN. If table does not contain rows where t2.x = 'abc', second query will not return any rows.
Try adding one more join condition to the ON instead of WHERE, this will look more like LEFT JOIN:
select t1.A, t2.B
from t1 left join t2 on t1.x = t2.x and t2.x='abc'
where t1.x = 'abc'
I am not saying that this will resolve the issue with too many partitions scanned, I'm just saying that this will be real left join, not inner, and predicate will be applied to t2 before join.
Another way is using filtering in the subquery before join.
select t1.A, t2.B
from (select * from t1 where t1.x = 'abc') t1
left join (select * from t2 where t2.x = 'abc') t2
on t1.x = t2.x
I am currently trying to join a few tables together (maybe join 2 additional more if possible) but with how my query is written right now, I cant even see the results with 3 tables
select t1.x,
t1.y,
t1.z,
t4.a,
t4.b,
t4.c,
t4.d
from t1
left join t2 on t1.id=t2.id
left join t3 on t2.id=t3.id
left join t4 on t1.id2=t4.id
where t1.date between 'x' and'x'
and t1.city not in ('x')
and t3.column = x;
Is there a way to optimize this code to run faster and perhaps make it able to add more tables to it?
Thank you in advance!
Your query has some logic issues that might help with the speed.
t2 is joined to t1 when they have the same id value.
t3 is then pulled in, if and only if, there was a row in t2 and it has the same value as t1 and t2.
Finally, in your where clause, the t3.column has to be x else it's filtered.
This means a row in t3 has to exist. Every t1 record that doesn't have a t2 record and a t3 record will be filtered out with that where. Thus you don't need a left join, you need an INNER join.
select t1.x,
t1.y,
t1.z,
t4.a,
t4.b,
t4.c,
t4.d
from t1
inner join t2 on t1.id=t2.id
inner join t3 on t2.id=t3.id
left join t4 on t1.id2=t4.id
where t1.date between 'x' and'x'
and t1.city not in ('x')
and t3.column = x;
In some DBMS you can move the t3.column clause to the join command which can help filter out the rows earlier in the plan.
select t1.x,
t1.y,
t1.z,
t4.a,
t4.b,
t4.c,
t4.d
from t1
inner join t2 on t1.id=t2.id
inner join t3 on t2.id=t3.id and t3.column = x
left join t4 on t1.id2=t4.id
where t1.date between 'x' and'x'
and t1.city not in ('x');
My final advise is to take a close look at t2 to see if you really need it. Ask yourself, is there a reason a row has to exist in t2 in order for me to get the right results? ... because if t1.id = t2.id then t1.id = t3.id and you can eliminate the t2 table completely.
bit of a novice question, I am running a query and left joining and wanted to know whether there was a difference when you specify a filter in terms of performance, in e.g below, top I filter straight after first join and below I do all joins and then filter:
Select t1.*,t2.* from t1 t1
left join t2 t2
on t1.key = t2.key
and t1.date < today
left join t3 t3
on t2.key2 = t3.key
vs
Select t1.*,t2.* from t1 t1
left join t2 t2
on t1.key = t2.key
left join t3 t3
on t2.key2 = t3.key
and t1.date < today
Learn what LEFT JOIN ON returns: INNER JOIN ON rows UNION ALL unmatched left table rows extended by NULLs. Always know what INNER JOIN you want as part of an OUTER JOIN.
In general your queries have different inner join & null-extended rows for the 1st left joins & then further differences due to more joining. Unless certain constraints hold, the 2 queries return different functions of their inputs. So comparing their performance seems moot.
I am just trying to understand the concept behind joining of 2 tables with an OR condition.
My requirement is: I need to join 2 tables Table1 [colA, colB] and Table2 [colX, colY] on columns Table1.colA = Table2.colB but if colA is NULL the condition should be Table1.colB = Table2.colY.
Do I need to do join them separately and then do union? Or is there a way I can do it in one join? Note that I have millions of records in both tables and its a left join and the tables reside in HIVE. I don't have a reproducible example, just trying to understand the concept.
While I'm not familiar with HiveQL, in SQL server this could be accomplished as follows:
SELECT *
FROM table1 t1
JOIN table2 t2
ON COALESCE(t1.cola, t1.colb) = CASE
WHEN t1.cola IS NULL THEN t2.coly
ELSE t2.colx
END
The logic should be fairly readable.
Translate your conditions directly:
SELECT *
FROM table1 t1 JOIN
table2 t2
ON (t1.cola = t2.colb) or
(t1.cola is null and t1.colb = t2.coly)
Usually, or is a performance killer in joins. This wold often be expressed using two separate left joins:
SELECT . . . , COALESCE(t2a.col, t2b.col) as col
FROM table1 t1 LEFT JOIN
table2 t2a
ON (t1.cola = t2.colb) LEFT JOIN
table2 t2b
ON t1.cola is null and t1.colb = t2.coly;
SELECT .... FROM TABLE1 T1, TABLE2 T2, TABLE3 T3
WHERE T1.NAME = 'ABC' AND T1.ID = T2.COL_ID AND T2.COL1 = T3.COL2
vs
SELECT .... FROM TABLE1 T1
WHERE T1.NAME = 'ABC'
INNER JOIN TABLE2 T2 ON T1.ID = T2.COL_ID
INNER JOIN TABLE3 T3 ON T2.COL1 = T3.COL2
Two questions
In terms of performance, which will perform better and why?
If Option 2 has the better performance, when should be using Option 1? (vice versa question if Option 1 has better performance)
The second query is not correct. It should be:
SELECT .... FROM TABLE1 T1
INNER JOIN TABLE2 T2 ON T1.ID = T2.COL_ID
INNER JOIN TABLE3 T3 ON T2.COL1 = T3.COL2
WHERE T1.NAME = 'ABC'
This is the right way to write your join condition. The 1st one is accepted, but technically creates a cartesian product. All modern database deals perfectly with both 1st and 2nd queries and interprets them the same way, therefore, performance should be the same. But still, you should use the second one because it is more readable and allows you to have only one way to write join weither it is a inner, left or full outer.
The answer is easy: Don't use comma-separated joins (first query). We used these in the 1980s for the lack of something better, but then in 1992 the new syntax (second query) was introduced1, because the old syntax was error-prone (it was easier to forget to apply join criteria) and harder to maintain (was missing join criteria intended or not in a query?) and there was no standard syntax for outer joins.
1 Oracle was a little late though featuring the new syntax. They introduced the new ANSI joins in Oracle 9i in 2001.
In terms of performance: There should be no difference in speed, because DBMS optimizers see that this is essentially the same query.
Your second query is syntactically incorrect by the way. The query's WHERE clause belongs after the complete FROM clause, i.e. after all the joins:
SELECT ....
FROM table1 t1
INNER JOIN table2 t2 ON t1.id = t2.col_id
INNER JOIN table3 t3 ON t2.col1 = t3.col2
WHERE t1.name = 'ABC';