How does SQL join work? - sql

I am trying to understand how does joins work internally. What will be the difference between the way in which the following two queries would run?
For example
(A)
Select *
FROM TABLE1
FULL JOIN TABLE2 ON TABLE1.ID = TABLE2.ID
FULL JOIN TABLE3 ON TABLE1.ID = TABLE3.ID
And
(B)
Select *
FROM TABLE1
FULL JOIN TABLE2 ON TABLE1.ID = TABLE2.ID
FULL JOIN TABLE3 ON TABLE2.ID = TABLE3.ID
Edit: I am talking about oracle here.
Consider some records present in table 2 and table 3 but not in table 1, query A would give two rows for that record but B would give only one row.

Your DBMS's optimiser will determine how best to perform the query. Usually this is done by "cost based optimisation", where a number of different query plans are considered and the most efficient one selected. If your two queries are logically identical, it is most likely that the optimiser will end up using the same query plan whichever way you write it. In fact, it would be a poor optimiser these days that produced different query plans based on such minor differences in the SQL.
However, full outer joins are a different matter (in Oracle at least), since the way the columns are joined influences the result. i.e. the 2 queries are not interchangeable.
You can use AUTOTRACE in SQL Plus to see the different plans:
SQL> select *
2 from t1
3 full join t2 on t2.id = t1.id
4 full join t3 on t3.id = t2.id;
ID ID ID
---------- ---------- ----------
1 1
1 row selected.
Execution Plan
----------------------------------------------------------
---------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)|
---------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 3 | 117 | 29 (11)|
| 1 | VIEW | | 3 | 117 | 29 (11)|
| 2 | UNION-ALL | | | | |
|* 3 | HASH JOIN OUTER | | 2 | 142 | 15 (14)|
| 4 | VIEW | | 2 | 90 | 11 (10)|
| 5 | UNION-ALL | | | | |
|* 6 | HASH JOIN OUTER | | 1 | 91 | 6 (17)|
| 7 | TABLE ACCESS FULL| T1 | 1 | 52 | 2 (0)|
| 8 | TABLE ACCESS FULL| T2 | 1 | 39 | 3 (0)|
|* 9 | HASH JOIN ANTI | | 1 | 26 | 6 (17)|
| 10 | TABLE ACCESS FULL| T2 | 1 | 13 | 3 (0)|
| 11 | TABLE ACCESS FULL| T1 | 1 | 13 | 2 (0)|
| 12 | TABLE ACCESS FULL | T3 | 1 | 26 | 3 (0)|
|* 13 | HASH JOIN ANTI | | 1 | 26 | 15 (14)|
| 14 | TABLE ACCESS FULL | T3 | 1 | 13 | 3 (0)|
| 15 | VIEW | | 2 | 26 | 11 (10)|
| 16 | UNION-ALL | | | | |
|* 17 | HASH JOIN OUTER | | 1 | 39 | 6 (17)|
| 18 | TABLE ACCESS FULL| T1 | 1 | 26 | 2 (0)|
| 19 | TABLE ACCESS FULL| T2 | 1 | 13 | 3 (0)|
|* 20 | HASH JOIN ANTI | | 1 | 26 | 6 (17)|
| 21 | TABLE ACCESS FULL| T2 | 1 | 13 | 3 (0)|
| 22 | TABLE ACCESS FULL| T1 | 1 | 13 | 2 (0)|
---------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
3 - access("T3"."ID"(+)="T2"."ID")
6 - access("T2"."ID"(+)="T1"."ID")
9 - access("T2"."ID"="T1"."ID")
13 - access("T3"."ID"="T2"."ID")
17 - access("T2"."ID"(+)="T1"."ID")
20 - access("T2"."ID"="T1"."ID")
SQL> select *
2 from t1
3 full join t2 on t2.id = t1.id
4 full join t3 on t3.id = t1.id;
ID ID ID
---------- ---------- ----------
1
1
2 rows selected.
Execution Plan
----------------------------------------------------------
---------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)|
---------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 3 | 117 | 29 (11)|
| 1 | VIEW | | 3 | 117 | 29 (11)|
| 2 | UNION-ALL | | | | |
|* 3 | HASH JOIN OUTER | | 2 | 142 | 15 (14)|
| 4 | VIEW | | 2 | 90 | 11 (10)|
| 5 | UNION-ALL | | | | |
|* 6 | HASH JOIN OUTER | | 1 | 91 | 6 (17)|
| 7 | TABLE ACCESS FULL| T1 | 1 | 52 | 2 (0)|
| 8 | TABLE ACCESS FULL| T2 | 1 | 39 | 3 (0)|
|* 9 | HASH JOIN ANTI | | 1 | 26 | 6 (17)|
| 10 | TABLE ACCESS FULL| T2 | 1 | 13 | 3 (0)|
| 11 | TABLE ACCESS FULL| T1 | 1 | 13 | 2 (0)|
| 12 | TABLE ACCESS FULL | T3 | 1 | 26 | 3 (0)|
|* 13 | HASH JOIN ANTI | | 1 | 26 | 15 (14)|
| 14 | TABLE ACCESS FULL | T3 | 1 | 13 | 3 (0)|
| 15 | VIEW | | 2 | 26 | 11 (10)|
| 16 | UNION-ALL | | | | |
|* 17 | HASH JOIN OUTER | | 1 | 39 | 6 (17)|
| 18 | TABLE ACCESS FULL| T1 | 1 | 26 | 2 (0)|
| 19 | TABLE ACCESS FULL| T2 | 1 | 13 | 3 (0)|
|* 20 | HASH JOIN ANTI | | 1 | 26 | 6 (17)|
| 21 | TABLE ACCESS FULL| T2 | 1 | 13 | 3 (0)|
| 22 | TABLE ACCESS FULL| T1 | 1 | 13 | 2 (0)|
---------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
3 - access("T3"."ID"(+)="T1"."ID")
6 - access("T2"."ID"(+)="T1"."ID")
9 - access("T2"."ID"="T1"."ID")
13 - access("T3"."ID"="T1"."ID")
17 - access("T2"."ID"(+)="T1"."ID")
20 - access("T2"."ID"="T1"."ID")
In fact, the query plans are identical except for the Predicate information

Use EXPLAIN PLAN:
http://download.oracle.com/docs/cd/B10500_01/server.920/a96533/ex_plan.htm

You stated interest in "internals", and then asked an example that illustrates "semantics". I'm answering semantics.
Consider these tables.
Table1 : 1, 4, 6
Table2 : 2, 4, 5
Table3 : 3, 5, 6
Both examples perform the same join first, so I'll perform that here.
FirstResult = T1 FULL JOIN T2 : (T1, T2)
(1, null)
(4, 4)
(6, null)
(null, 2)
(null, 5)
Example (A)
FirstResult FULL JOIN T3 ON FirstItem : (T1, T2, T3)
(1, null, null)
(4, 4, null)
(6, null, 6) <----
(null, 2, null)
(null, 5, null) <----
(null, null, 3)
Example (B)
FirstResult FULL JOIN T3 ON SecondItem : (T1, T2, T3)
(1, null, null)
(4, 4, null)
(6, null, null) <----
(null, 2, null)
(null, 5, 5) <----
(null, null, 3)
This shows you logically how to produce the results from the joins.
For "internals", there's something called a query optimizer, which will produce these same results - but it will make implementation choices to do the computation/io fast. These choices include:
which tables to access first
look into a table using an index or table scan
which join implementation type to use (nested loop, merge, hash).
Also note: due to the optimizer making these choices, and changing these choices based on what it considers to be optimal - the order of the results can change. The default ordering of results is always "what is easiest". If you don't want the default ordering, you need to specify ordering in your query.
To see exactly what the optimizer will do with a query (at that moment, because it can change its mind), you need to view the execution plan.

With query A what you get includes entries in table 1 with a corresponding entry in table3 without corresponding entries in table3 (nulls for t2 columns)
With query B uou don't get those entries because you only go to table3 through table2. If you don't have a corresponding entry in table2, the table2.id will be null and will never match a table3.id

Related

sql from clause tables

I have the following query and in the from clause there is a left join with ga and following other tables.
should we use left join keyword for all other tables after ga table or we can use as it is in the query. Is there any performance issues with this query?
query:
from
a#db_link st left join (Select a,b,c,d
from b#db_link where id = 'AD' and num = 4) ga
on st.compensationdate = ga.compensationdate
and st.salestransactionseq = ga.salestransactionseq ,
b#db_link ta,
c#db_link cr,
d#db_link crd_typ,
e#db_link evt_typ,
f#db_link disputes
where st.salestransactionseq = ta.salestransactionseq
and st.id = 'AD'
This is the query plan:
Plan hash value: 3767304471
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | Pstart| Pstop | Inst |
------------------------------------------------------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT REMOTE | | 1 | 661 | 342 (1)| 00:00:01 | | | |
| 1 | NESTED LOOPS | | 1 | 661 | 342 (1)| 00:00:01 | | | |
| 2 | NESTED LOOPS | | 1 | 661 | 342 (1)| 00:00:01 | | | |
| 3 | NESTED LOOPS | | 1 | 612 | 342 (1)| 00:00:01 | | | |
| 4 | NESTED LOOPS | | 1 | 564 | 342 (1)| 00:00:01 | | | |
|* 5 | HASH JOIN | | 1 | 549 | 342 (1)| 00:00:01 | | | |
| 6 | NESTED LOOPS | | 1 | 503 | 0 (0)| 00:00:01 | | | |
| 7 | NESTED LOOPS | | 1 | 503 | 0 (0)| 00:00:01 | | | |
| 8 | NESTED LOOPS | | 1 | 450 | 0 (0)| 00:00:01 | | | |
| 9 | NESTED LOOPS | | 1 | 407 | 0 (0)| 00:00:01 | | | |
| 10 | PARTITION RANGE SINGLE | | 1 | 217 | 0 (0)| 00:00:01 | 1357 | 1357 | |
|* 11 | TABLE ACCESS BY LOCAL INDEX ROWID BATCHED| CS_SALESTRANSACTION | 1 | 217 | 0 (0)| 00:00:01 | 1357 | 1357 | PRD121 |
|* 12 | INDEX RANGE SCAN | CS_SALESTRANSACTION_PK | 1 | | 0 (0)| 00:00:01 | 1357 | 1357 | PRD121 |
| 13 | PARTITION RANGE SINGLE | | 1 | 190 | 0 (0)| 00:00:01 | 1356 | 1356 | |
| 14 | TABLE ACCESS BY LOCAL INDEX ROWID BATCHED| CS_TRANSACTIONASSIGNMENT | 1 | 190 | 0 (0)| 00:00:01 | 1356 | 1356 | PRD121 |
|* 15 | INDEX RANGE SCAN | CS_TRANSACTIONASSIGNMENT_PK | 1 | | 0 (0)| 00:00:01 | 1356 | 1356 | PRD121 |
|* 16 | TABLE ACCESS BY GLOBAL INDEX ROWID BATCHED | CS_GASALESTRANSACTION | 1 | 43 | 0 (0)| 00:00:01 | ROWID | ROWID | PRD121 |
|* 17 | INDEX RANGE SCAN | GASALESTRANSACTION_IDX | 3 | | 0 (0)| 00:00:01 | | | PRD121 |
| 18 | PARTITION RANGE SINGLE | | 1 | | 2 (0)| 00:00:01 | 8 | 8 | |
| 19 | PARTITION LIST ALL | | 1 | | 2 (0)| 00:00:01 | 1 | 268 | |
|* 20 | INDEX RANGE SCAN | OD_CREDIT_UTVALUE | 1 | | 2 (0)| 00:00:01 | 1347 | 1614 | PRD121 |
|* 21 | TABLE ACCESS BY LOCAL INDEX ROWID | CS_CREDIT | 1 | 53 | 3 (0)| 00:00:01 | 1 | 1 | PRD121 |
| 22 | TABLE ACCESS FULL | ADTV_FRS_DISPUTES | 27011 | 1213K| 341 (0)| 00:00:01 | | | PRD121 |
|* 23 | TABLE ACCESS BY INDEX ROWID | ADTV_FRS_CONTROL | 1 | 15 | 1 (0)| 00:00:01 | | | PRD121 |
|* 24 | INDEX UNIQUE SCAN | ADTV_FRS_CONTROL_PK | 1 | | 0 (0)| 00:00:01 | | | PRD121 |
| 25 | PARTITION LIST SINGLE | | 1 | 48 | 1 (0)| 00:00:01 | 2 | 2 | |
|* 26 | TABLE ACCESS BY LOCAL INDEX ROWID | CS_EVENTTYPE | 1 | 48 | 1 (0)| 00:00:01 | 2 | 2 | PRD121 |
|* 27 | INDEX UNIQUE SCAN | CS_EVENTTYPE_PK | 1 | | 0 (0)| 00:00:01 | 2 | 2 | PRD121 |
| 28 | PARTITION LIST SINGLE | | 1 | | 0 (0)| 00:00:01 | 2 | 2 | |
|* 29 | INDEX UNIQUE SCAN | CS_CREDITTYPE_PK | 1 | | 0 (0)| 00:00:01 | 2 | 2 | PRD121 |
|* 30 | TABLE ACCESS BY LOCAL INDEX ROWID | CS_CREDITTYPE | 1 | 49 | 1 (0)| 00:00:01 | 2 | 2 | PRD121 |
------------------------------------------------------------------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
5 - access("GENERICATTRIBUTE13"="A1"."ACTIVITY_ID" AND "LINENUMBER"="A1"."ITEM_ID")
11 - filter("SUBLINENUMBER"<>2)
12 - access("TENANTID"='ADTV' AND "PROCESSINGUNITSEQ"=3.82805968326498E16)
15 - access("TENANTID"='ADTV' AND "PROCESSINGUNITSEQ"=3.82805968326498E16 AND "COMPENSATIONDATE"="COMPENSATIONDATE" AND
"SALESTRANSACTIONSEQ"="SALESTRANSACTIONSEQ")
16 - filter("COMPENSATIONDATE"="COMPENSATIONDATE" AND "PAGENUMBER"=4)
17 - access("SALESTRANSACTIONSEQ"="SALESTRANSACTIONSEQ" AND "TENANTID"='ADTV')
20 - access("TENANTID"='ADTV' AND "PROCESSINGUNITSEQ"=3.82805968326498E16)
21 - filter("SALESTRANSACTIONSEQ"="SALESTRANSACTIONSEQ" AND "SALESORDERSEQ"="SALESORDERSEQ")
23 - filter(UPPER("A8"."STATUS")='NEW')
24 - access("A1"."CASE_NO"="A8"."CASE_NO")
26 - filter("EVENTTYPEID"='PROTECTIONPLAN CHARGEBACK' OR "EVENTTYPEID"='PROTECTIONPLAN CHARGEBACK-FRS' OR "EVENTTYPEID"='PROTECTIONPLAN
INCENTIVE' OR "EVENTTYPEID"='PROTECTIONPLAN INCENTIVE-FRS' OR "EVENTTYPEID"='PROTECTIONPLAN KICKER' OR "EVENTTYPEID"='PROTECTIONPLAN KICKER-FRS' OR
"EVENTTYPEID"='UNIVERSAL BILLER' OR "EVENTTYPEID"='UNIVERSAL BILLER-FRS' OR "EVENTTYPEID"='WORK ORDER' OR "EVENTTYPEID"='WORK ORDER-FRS')
27 - access("TENANTID"='ADTV' AND "EVENTTYPESEQ"="DATATYPESEQ" AND "REMOVEDATE"=TO_DATE(' 2200-01-01 00:00:00', 'syyyy-mm-dd hh24:mi:ss'))
29 - access("TENANTID"='ADTV' AND "CREDITTYPESEQ"="DATATYPESEQ" AND "REMOVEDATE"=TO_DATE(' 2200-01-01 00:00:00', 'syyyy-mm-dd hh24:mi:ss'))
30 - filter("CREDITTYPEID"="A1"."CREDIT_TYPE" OR "CREDITTYPEID" LIKE "A1"."CREDIT_TYPE"||'%FRS')
Note
-----
- fully remote statement
- dynamic statistics used: dynamic sampling (level=7)
should we use left join keyword for all other tables after ga table or we can use as it is in the query. Is there any performance issues with this query?
LEFT OUTER JOIN, to give it its full name, is a two part thing
OUTER JOIN is a special case where "if the join fails, permit the solid side table/resultset to exist in the output and fill the partial side with NULLs"
The LEFT is a direction to the database as to which side shall be considered "solid". All the rows from the solid side are present at least once.
Absent any parentheses or sub queries driving execution direction:
SELECT *
FROM
a
LEFT OUTER JOIN b ON ...
a is the left; a is thus the solid side. All rows from a will be present. Rows from b may be present or null if the join predicates matched no rows
Once this is done this whole resultset of "a and b, nulls, warts and all" will become "the left side" for subsequent joins
SELECT *
FROM
a
LEFT OUTER JOIN b
some_kind_of JOIN c
Is effectively the same as:
SELECT *
FROM
(
SELECT *
FROM
a
LEFT OUTER JOIN b
) newLeft
some_kind_of JOIN c
Remember, the OUTER specifier permits the join to fail and still keeps the declared solid side rows
Whether you can use INNER or LEFT/RIGHT OUTER to join c in depends on what you're joining it to
If you're joining it to, say, a column from a then it could be fine to use INNER or OUTER - you'd use whatever you'd use if b wasn't even in the picture.
Will the join from a to c fail sometimes and you still want the rows from a? Use an OUTER.
Will it never fail, or do you not want any rows that do fail? Use an INNER.
However, if you're joining it to a column that was provided by b then you probably are going to want to use some OUTER join, otherwise there will have been no point making the query do a left outer join b - rows from bb will definitely have a NULL where the join failed but you wanted to keep those ones.. If you then INNER JOIN c to some column from b, that was NULL because the join failed, then the row will disappear from the output. Nothing is ever equal to NULL, so the INNER JOIN to c on the NULL in the column from b. In effect the INNER JOIN undoes all that good work done keeping a's data, by the OUTER join that joined b's data
Doing
a
LEFT JOIN b ON a.b_id = b.id
LEFT JOIN c ON b.c_id = c.id
allows those rows from a-join-b where b.c_id is null (because the join failed) to stay in the output (because it's an outer join to C, not an inner one)..
..
Generally we inner join everything we can, then switch to left joining everything else because it makes the queries easier to follow. In that "if c is being inner joined to a" scenario we would perhaps:
a
INNER JOIN c on a.c_id = c.id
LEFT JOIN b on a.b_id = b.id
Rather than:
a
LEFT JOIN b on a.b_id = b.id
INNER JOIN c on a.c_id = c.id
If a table is being joined to a table that was left joined, left join it too. Avoid RIGHT join because it goes against the evaluation direction of SQL and makes things harder to reason about; any time you think about using a right join, turn it around and rewrite it as a left.
Don't forget to use sub queries too. If you want every a joined to b which is joined to c only if both b and c sides match, it's probably clearest to:
a
LEFT JOIN (
SELECT * FROM b INNER JOIN c ON b.c_id = c.id
) b_and_c
Try to see your SQL as developing some growing-wider-with-every-join resultset that, at every join, becomes the new left side

why oracle optimizer not eliminate this case?

i am doubting about this case, but not clear why.
consider the following sql :
create table t1(tid int not null, t1 int not null);
create table t2(t2 int not null, tname varchar(30) null);
create unique index i_t2 on t2(t2);
create or replace view v_1 as
select t1.tid,t1.t1,max(t2.tname) as tname
from t1 left join t2
on t1.t1 = t2.t2
group by t1.tid,t1.t1;
then check the execution plan for select count(1) from v_1, the t2 is eliminated by the optimizer:
SQL> select count(1) from v_1;
Execution Plan
----------------------------------------------------------
Plan hash value: 3243658773
----------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
----------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | 3 (34)| 00:00:01 |
| 1 | SORT AGGREGATE | | 1 | | | |
| 2 | VIEW | VM_NWVW_0 | 1 | | 3 (34)| 00:00:01 |
| 3 | HASH GROUP BY | | 1 | 26 | 3 (34)| 00:00:01 |
| 4 | TABLE ACCESS FULL| T1 | 1 | 26 | 2 (0)| 00:00:01 |
----------------------------------------------------------------------------------
but if the index i_t2 is dropped or recreated without unique attribute,
the table t2 is not eliminated in execution plan:
SQL> drop index i_t2;
Index dropped.
SQL> select count(1) from v_1;
Execution Plan
----------------------------------------------------------
Plan hash value: 2710188186
-----------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
-----------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | 5 (20)| 00:00:01 |
| 1 | SORT AGGREGATE | | 1 | | | |
| 2 | VIEW | VM_NWVW_0 | 1 | | 5 (20)| 00:00:01 |
| 3 | HASH GROUP BY | | 1 | 39 | 5 (20)| 00:00:01 |
|* 4 | HASH JOIN OUTER | | 1 | 39 | 4 (0)| 00:00:01 |
| 5 | TABLE ACCESS FULL| T1 | 1 | 26 | 2 (0)| 00:00:01 |
| 6 | TABLE ACCESS FULL| T2 | 1 | 13 | 2 (0)| 00:00:01 |
-----------------------------------------------------------------------------------
it seems even if the index is removed,
the result of select count(1) from v_1 also equal to
select count(1) from (select tid,t1 from t1 group by tid,t1)
why the optimizer does not eliminate t2 in the second case?
is there any principle or actual data example discribing this?
thanks :)
This is an optimization called join elimination. Because t2.t2 us unique, the optimizer knows that every row retrieved from t1 can only ever retrieve one row from t2. Since there is nothing projected from t2, there is no need to perform the join.
If you do
select tid, t1 from v_1;
you will see that we do not perform the join. However, if we project from t2, then the join is needed.

Unexpected query results in Oracle db

We have Oracle 12.2.0.1.0 database. We create a simple table like this:
CREATE TABLE TABLE1 (DATE1 TIMESTAMP (6));
INSERT INTO TABLE1 VALUES (TIMESTAMP'2018-05-30 00:00:00');
INSERT INTO TABLE1 VALUES (TIMESTAMP'2018-05-30 00:00:00');
When we query with the following two select statements, we get different results. The first one returns two rows as expected, while the second one doesn't.
SELECT T1.*, NVL(T2.DATE1, TIMESTAMP'1900-01-01 00:00:00')
FROM TABLE1 T1
LEFT JOIN TABLE1 T2
ON 1 = 0
WHERE T1.DATE1 > NVL(T2.DATE1, TIMESTAMP'1900-01-01 00:00:00');
SELECT T1.*, NVL(T2.DATE1, TIMESTAMP'1900-01-01 00:00:00')
FROM TABLE1 T1
LEFT JOIN TABLE1 T2
ON T1.DATE1 || '---' = '-'
WHERE T1.DATE1 > NVL(T2.DATE1, TIMESTAMP'1900-01-01 00:00:00');
T1 and T2 are the same TABLE1. We are joining it on itself.
Please advise why that is so. Thanks.
It seems the optimizer gets confused with so many levels of obfuscating the join condition.
The first query results in the following execution plan:
SQL_ID 9k6g3m0xs31w7, child number 1
-------------------------------------
select t1.*, nvl(t2.date1, timestamp'1900-01-01 00:00:00') from table1
t1 left join table1 t2 on 1 = 0 where t1.date1 > nvl(t2.date1,
timestamp'1900-01-01 00:00:00')
Plan hash value: 963482612
-----------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows |E-Bytes| Cost (%CPU)| A-Rows | A-Time | Buffers |
-----------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | | 3 (100)| 2 |00:00:00.01 | 7 |
|* 1 | TABLE ACCESS FULL| TABLE1 | 1 | 2 | 26 | 3 (0)| 2 |00:00:00.01 | 7 |
-----------------------------------------------------------------------------------------------------------
Query Block Name / Object Alias (identified by operation id):
-------------------------------------------------------------
1 - SEL$F7AF7B7D / T1#SEL$1
Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter("T1"."DATE1">TIMESTAMP' 1900-01-01 00:00:00.000000000')
So the planner correctly sees that the self join is unnecessary and replaces the NVL() condition on the joined table with a condition on the column itself.
Apparently this "replacing" the condition does not work correctly in 12.2.
The second query results in the following plan:
SQL_ID 3twykk3kcyyxy, child number 1
-------------------------------------
select t1.*, nvl(t2.date1, timestamp'1900-01-01 00:00:00') from table1
t1 left join table1 t2 on t1.date1 || '---' = '-' where t1.date1 >
nvl(t2.date1, timestamp'1900-01-01 00:00:00')
Plan hash value: 736255932
----------------------------------------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows |E-Bytes| Cost (%CPU)| A-Rows | A-Time | Buffers | OMem | 1Mem | Used-Mem |
----------------------------------------------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | | 8 (100)| 0 |00:00:00.01 | 7 | | | |
|* 1 | FILTER | | 1 | | | | 0 |00:00:00.01 | 7 | | | |
| 2 | MERGE JOIN OUTER | | 1 | 1 | 26 | 8 (25)| 2 |00:00:00.01 | 7 | | | |
| 3 | SORT JOIN | | 1 | 2 | 26 | 4 (25)| 2 |00:00:00.01 | 7 | 2048 | 2048 | 2048 (0)|
| 4 | TABLE ACCESS FULL | TABLE1 | 1 | 2 | 26 | 3 (0)| 2 |00:00:00.01 | 7 | | | |
|* 5 | SORT JOIN | | 2 | 2 | 26 | 4 (25)| 0 |00:00:00.01 | 0 | 1024 | 1024 | |
| 6 | VIEW | VW_LAT_C83A7ED5 | 2 | 2 | 26 | 3 (0)| 0 |00:00:00.01 | 0 | | | |
|* 7 | FILTER | | 2 | | | | 0 |00:00:00.01 | 0 | | | |
| 8 | TABLE ACCESS FULL| TABLE1 | 0 | 2 | 26 | 3 (0)| 0 |00:00:00.01 | 0 | | | |
----------------------------------------------------------------------------------------------------------------------------------------------------
Query Block Name / Object Alias (identified by operation id):
-------------------------------------------------------------
1 - SEL$F7AF7B7D
4 - SEL$F7AF7B7D / T1#SEL$1
6 - SEL$BCD4421C / VW_LAT_AE9E49E8#SEL$AE9E49E8
7 - SEL$BCD4421C
8 - SEL$BCD4421C / T2#SEL$1
Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter("T1"."DATE1">NVL("ITEM_1",TIMESTAMP' 1900-01-01 00:00:00.000000000'))
5 - access(INTERNAL_FUNCTION("T1"."DATE1")>NVL("ITEM_1",TIMESTAMP' 1900-01-01 00:00:00.000000000'))
7 - filter(INTERNAL_FUNCTION("T1"."DATE1")||'---'='-')
So the optimizer replaced the reference to the table column with some ITEM1 placeholder - and the step access(INTERNAL_FUNCTION("T1"."DATE1")>NVL("ITEM_1",TIMESTAMP' 1900-01-01 00:00:00.000000000')) messes things up.
With 12.1 the plan is essentially the same, the only difference is that the access() part is missing in the predicates, so I guess that replacement is somewhat buggy in 12.2 (to be precise my version is: 12.2.0.1.0)

SQL Multiple Minus vs Multiple Join Performance

I hope someone can explain the performance of joining multiple tables vs. using MINUS to eliminate records. I looked at a few other stack overflow questions but didn't see what I was looking for.
I thought these two queries would produce the same output, and I have always heard "use joins, use joins!", particularly from stackoverflow posts, that they were expected to be faster...
This is the first query I ran which I thought would be much slower, but it takes only a matter of minutes to run...
select some_id
from table1
MINUS
select some_id
from table2
where table2.value = 'some_value'
MINUS
select some_id
from table3
where table3.value = 'some_value'
group by some_id
This is the second query which I thought would be faster, but it has been running for over 3 hours now (with no end in sight?)
select some_id
from table1
join table2 on table1.id=table2.id
join table3 on table1.id=table3.id
where table2.value = 'some_value'
or table3.value = 'some_value'
group by some_id
I should note all 3 tables have > 1 Million records, up to 15 Million records each.
EDIT:
Sorry - I meant to let you know I was avoiding the use of NOT EXISTS in this question as a response, as I really am curious about just these two scenarios.
Try this version:
select some_id
from table1
where not exists (select 1 from table2 t2 on t1.id = t2.id and t2.value = 'some_value') or
not exists (select 1 from table3 t3 on t1.id = t3.id and t3.value = 'some_value')
For best performance, you want indexes on table2(id, value) and table3(id, value).
Firstly make sure you have the indexes in place,
to see the plan, if it is making use of full table scan, the go ahead with the creating of indexes else it is going to take a long , long time.
if you have plsql developer, then paste the query in the in sql window and press F5 it would give you the explain plan .
or can do this also,
SCOTT#research 17-APR-15> EXPLAIN PLAN FOR
2 select empno
3 from emp
4 MINUS
5 select empno
6 from empp
7 where empp.empno = '7839'
8 MINUS
9 select empno
10 from emppp
11 where emppp.empno = '7902'
12 group by empno
13 ;
Explained.
SCOTT#research 17-APR-15> SET LINESIZE 130
SCOTT#research 17-APR-15> SET PAGESIZE 0
SCOTT#research 17-APR-15> SELECT *
2 FROM TABLE(DBMS_XPLAN.DISPLAY);
Plan hash value: 4222598102
---------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
---------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 14 | 82 | 10 (90)| 00:00:01 |
| 1 | MINUS | | | | | |
| 2 | MINUS | | | | | |
| 3 | SORT UNIQUE NOSORT | | 14 | 56 | 2 (50)| 00:00:01 |
| 4 | INDEX FULL SCAN | PK_EMP | 14 | 56 | 1 (0)| 00:00:01 |
| 5 | SORT UNIQUE NOSORT | | 1 | 13 | 4 (25)| 00:00:01 |
|* 6 | TABLE ACCESS FULL | EMPP | 1 | 13 | 3 (0)| 00:00:01 |
| 7 | SORT UNIQUE NOSORT | | 1 | 13 | 4 (25)| 00:00:01 |
| 8 | SORT GROUP BY NOSORT| | 1 | 13 | 4 (25)| 00:00:01 |
|* 9 | TABLE ACCESS FULL | EMPPP | 1 | 13 | 3 (0)| 00:00:01 |
---------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
6 - filter("EMPP"."EMPNO"=7839)
9 - filter("EMPPP"."EMPNO"=7902)
Note
-----
- dynamic sampling used for this statement (level=2)
26 rows selected.
Execution Plan
----------------------------------------------------------
Plan hash value: 2137789089
---------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
---------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 8168 | 16336 | 29 (0)| 00:00:01 |
| 1 | COLLECTION ITERATOR PICKLER FETCH| DISPLAY | 8168 | 16336 | 29 (0)| 00:00:01 |
---------------------------------------------------------------------------------------------
or if you want to use autotrace then do,
set autotrace on explain
This is how it would look,
SCOTT#research 17-APR-15> select empno
2 from emp
3 MINUS
4 select empno
5 from empp
6 where empp.empno = '7839'
7 MINUS
8 select empno
9 from emppp
10 where emppp.empno = '7902'
11 group by empno
12 ;
EMPNO
----------
234
7499
7521
7566
7654
7698
7782
7788
7844
7876
7900
7934
12 rows selected.
Execution Plan
----------------------------------------------------------
Plan hash value: 4222598102
---------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
---------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 14 | 82 | 10 (90)| 00:00:01 |
| 1 | MINUS | | | | | |
| 2 | MINUS | | | | | |
| 3 | SORT UNIQUE NOSORT | | 14 | 56 | 2 (50)| 00:00:01 |
| 4 | INDEX FULL SCAN | PK_EMP | 14 | 56 | 1 (0)| 00:00:01 |
| 5 | SORT UNIQUE NOSORT | | 1 | 13 | 4 (25)| 00:00:01 |
|* 6 | TABLE ACCESS FULL | EMPP | 1 | 13 | 3 (0)| 00:00:01 |
| 7 | SORT UNIQUE NOSORT | | 1 | 13 | 4 (25)| 00:00:01 |
| 8 | SORT GROUP BY NOSORT| | 1 | 13 | 4 (25)| 00:00:01 |
|* 9 | TABLE ACCESS FULL | EMPPP | 1 | 13 | 3 (0)| 00:00:01 |
---------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
6 - filter("EMPP"."EMPNO"=7839)
9 - filter("EMPPP"."EMPNO"=7902)
Note
-----
- dynamic sampling used for this statement (level=2)
SCOTT#research 17-APR-15>
SCOTT#research 17-APR-15> select emp.empno
2 from emp
3 join empp on emp.empno=empp.empno
4 join emppp on emp.empno=emppp.empno
5 where empp.empno = '7839'
6 or emppp.empno = '7902'
7 group by emp.empno
8 ;
EMPNO
----------
7839
7902
Execution Plan
----------------------------------------------------------
Plan hash value: 1435156579
-------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
-------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 30 | 8 (25)| 00:00:01 |
| 1 | HASH GROUP BY | | 1 | 30 | 8 (25)| 00:00:01 |
|* 2 | HASH JOIN | | 1 | 30 | 7 (15)| 00:00:01 |
| 3 | NESTED LOOPS | | 6 | 102 | 3 (0)| 00:00:01 |
| 4 | TABLE ACCESS FULL| EMPPP | 6 | 78 | 3 (0)| 00:00:01 |
|* 5 | INDEX UNIQUE SCAN| PK_EMP | 1 | 4 | 0 (0)| 00:00:01 |
| 6 | TABLE ACCESS FULL | EMPP | 10 | 130 | 3 (0)| 00:00:01 |
-------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - access("EMP"."EMPNO"="EMPP"."EMPNO")
filter("EMPP"."EMPNO"=7839 OR "EMPPP"."EMPNO"=7902)
5 - access("EMP"."EMPNO"="EMPPP"."EMPNO")
Note
-----
- dynamic sampling used for this statement (level=2)

SQL subquery and Joins giving same or different result (oracle)

I'm working to optimize queries due to huge amount of data on Oracle.
There is one query like this.
With subquery :
SELECT
STG.ID1,
STG.ID2
FROM (SELECT
DISTINCT
H1.ID1,
H2.ID2
FROM T_STGDV STG
INNER JOIN T_HUB1 H1 ON STG.BK1 = H1.BK1
INNER JOIN T_HUB2 H2 ON STG.BK2 = H2.BK2 ) STG
LEFT OUTER JOIN T_LINK L ON L.ID1 = STG.ID1 AND L.ID2 = STG.ID2
WHERE L.IDL IS NULL;
I'm doing this optimization :
SELECT
DISTINCT
H1.ID1,
H2.ID2
FROM T_STGDV STG
INNER JOIN T_HUB1 H1 ON STG.BK1 = H1.BK1
INNER JOIN T_HUB2 H2 ON STG.BK2 = H2.BK2
LEFT OUTER JOIN T_LINK L ON L.ID1 = H1.ID1 AND L.ID2 = H2.ID2
WHERE L.IDL IS NULL;
I want to know if the result will be the same, the behavior is the same.
I did some tests, I didn't find difference but maybe i missed some test case ?
Any idea what could be the difference between those queries ?
Thanks.
Some details, the Explain plan for those testing tables (the cost are not representative of the real tables)
the First query :
Plan hash value: 2680307749
-----------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
-----------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 65 | 11 (28)| 00:00:01 |
|* 1 | FILTER | | | | | |
|* 2 | HASH JOIN OUTER | | 1 | 65 | 11 (28)| 00:00:01 |
| 3 | VIEW | | 1 | 26 | 8 (25)| 00:00:01 |
| 4 | HASH UNIQUE | | 1 | 134 | 8 (25)| 00:00:01 |
|* 5 | HASH JOIN | | 1 | 134 | 7 (15)| 00:00:01 |
|* 6 | HASH JOIN | | 1 | 94 | 5 (20)| 00:00:01 |
| 7 | TABLE ACCESS FULL| T_STGDV | 1 | 54 | 2 (0)| 00:00:01 |
| 8 | TABLE ACCESS FULL| T_HUB1 | 2 | 80 | 2 (0)| 00:00:01 |
| 9 | TABLE ACCESS FULL | T_HUB2 | 2 | 80 | 2 (0)| 00:00:01 |
| 10 | TABLE ACCESS FULL | T_LINK | 3 | 117 | 2 (0)| 00:00:01 |
-----------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter("L"."IDL" IS NULL)
2 - access("L"."ID2"(+)="STG"."ID2" AND "L"."ID1"(+)="STG"."ID1")
5 - access("STG"."BK2"="H2"."BK2")
6 - access("STG"."BK1"="H1"."BK1")
Note
-----
- dynamic sampling used for this statement (level=2)
the second query
Plan hash value: 2149614538
-----------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
-----------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 65 | 11 (28)| 00:00:01 |
| 1 | HASH UNIQUE | | 1 | 65 | 11 (28)| 00:00:01 |
|* 2 | FILTER | | | | | |
|* 3 | HASH JOIN OUTER | | 1 | 65 | 10 (20)| 00:00:01 |
| 4 | VIEW | | 1 | 26 | 7 (15)| 00:00:01 |
|* 5 | HASH JOIN | | 1 | 134 | 7 (15)| 00:00:01 |
|* 6 | HASH JOIN | | 1 | 94 | 5 (20)| 00:00:01 |
| 7 | TABLE ACCESS FULL| T_STGDV | 1 | 54 | 2 (0)| 00:00:01 |
| 8 | TABLE ACCESS FULL| T_HUB1 | 2 | 80 | 2 (0)| 00:00:01 |
| 9 | TABLE ACCESS FULL | T_HUB2 | 2 | 80 | 2 (0)| 00:00:01 |
| 10 | TABLE ACCESS FULL | T_LINK | 3 | 117 | 2 (0)| 00:00:01 |
-----------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - filter("L"."IDL" IS NULL)
3 - access("L"."ID2"(+)="H2"."ID2" AND "L"."ID1"(+)="H1"."ID1")
5 - access("STG"."BK2"="H2"."BK2")
6 - access("STG"."BK1"="H1"."BK1")
Note
-----
- dynamic sampling used for this statement (level=2)
The queries look equivalent to me, because of the where clause.
Without the where clause they are not equivalent. Duplicates in t_link (relative to the join keys) would result in duplicate rows. However, you are looking for no matches, so this is not an issue. When there is no match, the two versions should be equivalent.
If you want to test them with your current dataset you can use minus.
query 1
MINUS
query 2
If any results are displayed, they are not the same.
You have to flip them around to try the other way too...
query 2
MINUS
query 1
If both tests return no records, the queries have the same effect on your current dataset.
This might be the difference: look at these lines in your execution plans:
2 - access("L"."ID2"(+)="STG"."ID2" AND "L"."ID1"(+)="STG"."ID1")
and
3 - access("L"."ID2"(+)="H2"."ID2" AND "L"."ID1"(+)="H1"."ID1")
STG is a temporary table created by Oracle for the duration of the query (that ambiguousness between T_STGDV alias and the subquery alias was alone a reason to rewrite the query). And this temporary table is of course unindexed. After your refactoring, Oracle optimiser start joining T_LINK with H1 and H2 instead of a temporary table and that allows it to utilize indexes built on those table, thus giving you the 20x increase in speed.
After testing, there are giving the same result. And the second one is more efficient.