Efficient table comparison - sql

I know that there are the following ways to select values present in one table but not in other.
LEFT JOIN, NOT IN and NOT EXISTS
Which is the recommended option to use?
There probably isn't an universal answer - so would appreciate the used-case where each is advisable.
(I am not looking for the syntax of the above options - just a comparison of the approaches)

In short, LEFT JOIN takes slightly more time that other two. But NOT EXISTS and NOT IN took almost same time.
I prefer left join when I need to utilize the values of other table in select clause. Else I prefer not exists.
I suggest you to replicate the below test on your machine as mine is a home machine with Oracle 12c and hardly anything else running. May be in a bigger environment, the test will give more accurate result.
Test in detail:
To practically test it I will create 2 tables and insert first one with 10 Million rows and second one with some other condition from first one, so some rows are not inserted to second table.
--Create first table
create table test_data_left (empno integer, ename varchar2(10),CONSTRAINT tdl_pk primary key(empno));
--PLSQL Block to enter 10 Million rows in test_data_left
declare v_max_empno integer;
BEGIN
select coalesce(max(EMPNO),0) into v_max_empno from emp_data;
FOR i IN 1..1000000 LOOP -- add 10 Million rows
insert into test_data_left(empno,ename) values (
i+v_max_empno,
DBMS_RANDOM.string('U',TRUNC(DBMS_RANDOM.value(10,11)))
);
END LOOP;
END;
/
commit;
--Create second table and populate with some condition to block some rows from first table
create table test_data_right (empno integer, ename varchar2(10),CONSTRAINT tdr_pk primary key(empno));
insert into test_data_right (empno,ename)
select empno,ename from test_data_left
where ename not like 'JK%';
These are the queries I am using to get the data.
NOTE: I am not using t1.* in select statements, as SQL Developer only displays first 50 rows and you cannot run explain plan on it. Hence I am using count(*)
select count(*) from test_data_left t1 left join test_data_right t2 on
t1.empno=t2.empno where t2.empno is nulll
select count(*) from test_data_left t1
where t1.empno not in (select empno from test_data_right);
select count(*) from test_data_left t1
where not exists (select 1 from test_data_right t2 where t1.empno=t2.empno);
To gather status of last run query, I used this command.
SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY_CURSOR(null,null,'ALLSTATS LAST')) ;
Just to be careful, that Oracle is not doing any funny business while calculating, I had reset the database connection before running every query.
Below are the status after each query. I have repeated it in reverse order to give a fair chance to LEFT JOIN.
As far as I can see, the LEFT JOIN is the slowest but NOT IN and
NOT EXISTS are almost same. (based on couple of more iterations which I wasn't able to capture)
Iteration 1
LEFT JOIN
SQL_ID 0qz2qtza4yrr0, child number 0
-------------------------------------
select count(*) from test_data_left t1 left join test_data_right t2 on
t1.empno=t2.empno where t2.empno is null
Plan hash value: 2082679279
-------------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers |
-------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | 1 |00:00:01.41 | 5012 |
| 1 | SORT AGGREGATE | | 1 | 1 | 1 |00:00:01.41 | 5012 |
| 2 | NESTED LOOPS ANTI | | 1 | 1206K| 900K|00:00:01.32 | 5012 |
| 3 | INDEX FAST FULL SCAN| TDL_PK | 1 | 1206K| 1000K|00:00:00.22 | 1891 |
|* 4 | INDEX UNIQUE SCAN | TDR_PK | 1000K| 1 | 99865 |00:00:00.54 | 3121 |
-------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
4 - access("T1"."EMPNO"="T2"."EMPNO")
Note
-----
- dynamic statistics used: dynamic sampling (level=2)
NOT EXISTS
SQL_ID c498qdbzw5dxv, child number 0
-------------------------------------
select count(*) from test_data_left t1 where not exists (select 1 from
test_data_right t2 where t1.empno=t2.empno)
Plan hash value: 2082679279
-------------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers |
-------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | 1 |00:00:01.27 | 5012 |
| 1 | SORT AGGREGATE | | 1 | 1 | 1 |00:00:01.27 | 5012 |
| 2 | NESTED LOOPS ANTI | | 1 | 1206K| 900K|00:00:01.19 | 5012 |
| 3 | INDEX FAST FULL SCAN| TDL_PK | 1 | 1206K| 1000K|00:00:00.21 | 1891 |
|* 4 | INDEX UNIQUE SCAN | TDR_PK | 1000K| 1 | 99865 |00:00:00.49 | 3121 |
-------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
4 - access("T1"."EMPNO"="T2"."EMPNO")
Note
-----
- dynamic statistics used: dynamic sampling (level=2)
NOT IN
SQL_ID gwm775xqnufgm, child number 0
-------------------------------------
select count(*) from test_data_left t1 where t1.empno not in (select
empno from test_data_right)
Plan hash value: 2082679279
-------------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers |
-------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | 1 |00:00:01.23 | 5012 |
| 1 | SORT AGGREGATE | | 1 | 1 | 1 |00:00:01.23 | 5012 |
| 2 | NESTED LOOPS ANTI | | 1 | 1206K| 900K|00:00:01.15 | 5012 |
| 3 | INDEX FAST FULL SCAN| TDL_PK | 1 | 1206K| 1000K|00:00:00.19 | 1891 |
|* 4 | INDEX UNIQUE SCAN | TDR_PK | 1000K| 1 | 99865 |00:00:00.47 | 3121 |
-------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
4 - access("T1"."EMPNO"="EMPNO")
Note
-----
- dynamic statistics used: dynamic sampling (level=2)
ITERATION 2
NOT IN
SQL_ID gwm775xqnufgm, child number 0
-------------------------------------
select count(*) from test_data_left t1 where t1.empno not in (select
empno from test_data_right)
Plan hash value: 2082679279
-------------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers |
-------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | 1 |00:00:01.19 | 5012 |
| 1 | SORT AGGREGATE | | 1 | 1 | 1 |00:00:01.19 | 5012 |
| 2 | NESTED LOOPS ANTI | | 1 | 1206K| 900K|00:00:01.11 | 5012 |
| 3 | INDEX FAST FULL SCAN| TDL_PK | 1 | 1206K| 1000K|00:00:00.19 | 1891 |
|* 4 | INDEX UNIQUE SCAN | TDR_PK | 1000K| 1 | 99865 |00:00:00.46 | 3121 |
-------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
4 - access("T1"."EMPNO"="EMPNO")
Note
-----
- dynamic statistics used: dynamic sampling (level=2)
NOT EXISTS
SQL_ID c498qdbzw5dxv, child number 0
-------------------------------------
select count(*) from test_data_left t1 where not exists (select 1 from
test_data_right t2 where t1.empno=t2.empno)
Plan hash value: 2082679279
-------------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers |
-------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | 1 |00:00:01.19 | 5012 |
| 1 | SORT AGGREGATE | | 1 | 1 | 1 |00:00:01.19 | 5012 |
| 2 | NESTED LOOPS ANTI | | 1 | 1206K| 900K|00:00:01.12 | 5012 |
| 3 | INDEX FAST FULL SCAN| TDL_PK | 1 | 1206K| 1000K|00:00:00.19 | 1891 |
|* 4 | INDEX UNIQUE SCAN | TDR_PK | 1000K| 1 | 99865 |00:00:00.46 | 3121 |
-------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
4 - access("T1"."EMPNO"="T2"."EMPNO")
Note
-----
- dynamic statistics used: dynamic sampling (level=2)
LEFT JOIN
SQL_ID 0qz2qtza4yrr0, child number 0
-------------------------------------
select count(*) from test_data_left t1 left join test_data_right t2 on
t1.empno=t2.empno where t2.empno is null
Plan hash value: 2082679279
-------------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers |
-------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | 1 |00:00:01.33 | 5012 |
| 1 | SORT AGGREGATE | | 1 | 1 | 1 |00:00:01.33 | 5012 |
| 2 | NESTED LOOPS ANTI | | 1 | 1206K| 900K|00:00:01.24 | 5012 |
| 3 | INDEX FAST FULL SCAN| TDL_PK | 1 | 1206K| 1000K|00:00:00.22 | 1891 |
|* 4 | INDEX UNIQUE SCAN | TDR_PK | 1000K| 1 | 99865 |00:00:00.50 | 3121 |
-------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
4 - access("T1"."EMPNO"="T2"."EMPNO")
Note
-----
- dynamic statistics used: dynamic sampling (level=2)

this will return everything from table a where there is not corresponding record in table b
SELECT a.col FROM a WHERE a.col NOT IN (SELECT b.col from b)

Try on below query
select tabA.* from tabA left join tabB on tabA.id = tabB.tabA_id
where tabB.tabA_id is null
Hope it can help.

Related

Execution plan too expensive case when exists

I have the below query, but when I execute it runs forever.
WITH aux AS (
SELECT
contract,
contract_account,
business_partner,
payment_plan,
installation,
contract_status
FROM
reta.mv_integrated_md a
WHERE
contract_status IN (
'LIVE',
'FINAL'
)
), aux1 AS (
SELECT
a.*,
CASE
WHEN EXISTS (
SELECT
NULL
FROM
aux b
WHERE
b.business_partner = a.business_partner
AND b.installation = a.installation
AND b.payment_plan = 'BMW'
) THEN
'X'
END h
FROM
aux a
)
SELECT
*
FROM
aux1;
My execution plan shows a huge cost which I cannot locate. How could I optimize this query? I have tried some hints but none of them have worked :(
Plan hash value: 1662974027
----------------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | Pstart| Pstop |
----------------------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 19M| 2000M| 825G (1)|999:59:59 | | |
|* 1 | VIEW | | 19M| 990M| 41331 (1)| 00:00:02 | | |
| 2 | TABLE ACCESS STORAGE FULL | SYS_TEMP_0FDA49C92_9A7BE8DE | 19M| 1066M| 41331 (1)| 00:00:02 | | |
| 3 | TEMP TABLE TRANSFORMATION | | | | | | | |
| 4 | LOAD AS SELECT | SYS_TEMP_0FDA49C92_9A7BE8DE | | | | | | |
| 5 | PARTITION RANGE SINGLE | | 18M| 974M| 759K (1)| 00:00:30 | 1 | 1 |
|* 6 | TABLE ACCESS STORAGE FULL| MV_INTEGRATED_MD | 18M| 974M| 759K (1)| 00:00:30 | 1 | 1 |
| 7 | VIEW | | 19M| 2000M| 41331 (1)| 00:00:02 | | |
| 8 | TABLE ACCESS STORAGE FULL | SYS_TEMP_0FDA49C92_9A7BE8DE | 19M| 1066M| 41331 (1)| 00:00:02 | | |
----------------------------------------------------------------------------------------------------------------------------
Kindly let me know if any additional information needed.
Use window functions:
SELECT r.contract, r.contract_account, r.business_partner,
r.payment_plan, r.installation, r.contract_status,
MAX(CASE WHEN r.payment_plan = 'BMW' THEN 'X' END) OVER (PARTITION BY business_partner, installation) as h
FROM reta.mv_integrated_md#rbip r
WHERE r.contract_status IN ('LIVE', 'FINAL');
Not only is the query much simpler to write and read, but it should perform much better too.
Highest cost is due to FTS(Full table scan) on table/MV MV_INTEGRATED_MD.
Try to create index on contract_status and check if it reduces the cost and also, what is size of this mv/table in terms of block and it is 10 percent or more than total buffer cache size ?
TABLE ACCESS STORAGE FULL| MV_INTEGRATED_MD | 18M| 974M| 759K (1)| 00:00:30 | 1 | 1
If you run your query with the /*+ gather_plan_statistics */ hint (I'm simulating it with a 1000 row table) you imediately see the problem :
select * from table(dbms_xplan.display_cursor(null,null,'ALLSTATS LAST'));
-------------------------------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | Reads |
-------------------------------------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | 1000 |00:00:00.01 | 9 | 5 |
|* 1 | VIEW | | 1000 | 1000 | 1000 |00:00:00.09 | 0 | 0 |
| 2 | TABLE ACCESS FULL | SYS_TEMP_0FD9D6737_1A17DE13 | 1000 | 1000 | 500K|00:00:00.08 | 0 | 0 |
| 3 | TEMP TABLE TRANSFORMATION | | 1 | | 1000 |00:00:00.01 | 9 | 5 |
| 4 | LOAD AS SELECT (CURSOR DURATION MEMORY)| SYS_TEMP_0FD9D6737_1A17DE13 | 1 | | 0 |00:00:00.01 | 8 | 5 |
|* 5 | TABLE ACCESS FULL | MV_INTEGRATED_MD | 1 | 1000 | 1000 |00:00:00.01 | 7 | 5 |
| 6 | VIEW | | 1 | 1000 | 1000 |00:00:00.01 | 0 | 0 |
| 7 | TABLE ACCESS FULL | SYS_TEMP_0FD9D6737_1A17DE13 | 1 | 1000 | 1000 |00:00:00.01 | 0 | 0 |
-------------------------------------------------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter(("B"."BUSINESS_PARTNER"=:B1 AND "B"."INSTALLATION"=:B2 AND "B"."PAYMENT_PLAN"='BMW'))
5 - filter("CONTRACT_STATUS"='LIVE')
It is in the line 2 where a full scan is activated in a loop for each line of the main table (see starts = 1000)
Typically you want to resolve the EXISTS with a semi join to preserve good performance, but here it seems that Oracle can not rewrite it.
So you'll need to rewrite the query yourself.
Despite the excelent proposal of #GordonLinoff (that I'll start with) you may try to use an outer join as follows
with bmw as (
select distinct business_partner, installation
from mv_integrated_md
where payment_plan = 'BMW')
SELECT
a.contract,
a.contract_account,
a.business_partner,
a.payment_plan,
a.installation,
a.contract_status,
case when b.business_partner is not null then 'X' end as h
FROM mv_integrated_md a
left outer join bmw b
on b.business_partner = a.business_partner and
b.installation = a.installation
WHERE a.contract_status IN ( 'LIVE', 'FINAL')
This will lead to two fulls scans, one deduplication and outer join.

why oracle optimizer not eliminate this case?

i am doubting about this case, but not clear why.
consider the following sql :
create table t1(tid int not null, t1 int not null);
create table t2(t2 int not null, tname varchar(30) null);
create unique index i_t2 on t2(t2);
create or replace view v_1 as
select t1.tid,t1.t1,max(t2.tname) as tname
from t1 left join t2
on t1.t1 = t2.t2
group by t1.tid,t1.t1;
then check the execution plan for select count(1) from v_1, the t2 is eliminated by the optimizer:
SQL> select count(1) from v_1;
Execution Plan
----------------------------------------------------------
Plan hash value: 3243658773
----------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
----------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | 3 (34)| 00:00:01 |
| 1 | SORT AGGREGATE | | 1 | | | |
| 2 | VIEW | VM_NWVW_0 | 1 | | 3 (34)| 00:00:01 |
| 3 | HASH GROUP BY | | 1 | 26 | 3 (34)| 00:00:01 |
| 4 | TABLE ACCESS FULL| T1 | 1 | 26 | 2 (0)| 00:00:01 |
----------------------------------------------------------------------------------
but if the index i_t2 is dropped or recreated without unique attribute,
the table t2 is not eliminated in execution plan:
SQL> drop index i_t2;
Index dropped.
SQL> select count(1) from v_1;
Execution Plan
----------------------------------------------------------
Plan hash value: 2710188186
-----------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
-----------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | 5 (20)| 00:00:01 |
| 1 | SORT AGGREGATE | | 1 | | | |
| 2 | VIEW | VM_NWVW_0 | 1 | | 5 (20)| 00:00:01 |
| 3 | HASH GROUP BY | | 1 | 39 | 5 (20)| 00:00:01 |
|* 4 | HASH JOIN OUTER | | 1 | 39 | 4 (0)| 00:00:01 |
| 5 | TABLE ACCESS FULL| T1 | 1 | 26 | 2 (0)| 00:00:01 |
| 6 | TABLE ACCESS FULL| T2 | 1 | 13 | 2 (0)| 00:00:01 |
-----------------------------------------------------------------------------------
it seems even if the index is removed,
the result of select count(1) from v_1 also equal to
select count(1) from (select tid,t1 from t1 group by tid,t1)
why the optimizer does not eliminate t2 in the second case?
is there any principle or actual data example discribing this?
thanks :)
This is an optimization called join elimination. Because t2.t2 us unique, the optimizer knows that every row retrieved from t1 can only ever retrieve one row from t2. Since there is nothing projected from t2, there is no need to perform the join.
If you do
select tid, t1 from v_1;
you will see that we do not perform the join. However, if we project from t2, then the join is needed.

Two similar tables but different join performances

I am running the exact same join query using two different tables, but the first one (table A) times out whereas the second (table B) does not.
SELECT * FROM table_X
INNER JOIN table_A
ON table_A.point_origin = table_X.item_id
WHERE ROWNUM < 10;
SELECT * FROM table_X
INNER JOIN table_B
ON table_B.point_origin = table_X.item_id
WHERE ROWNUM < 10;
As far as I know, table A is a subset of table B. Neither table A nor table B have point_origin indexed.
(Edit for clarification: table A is a only a subset of table B in terms of row identifiers, not in terms of exact column data.)
For what it's worth, I'm dealing with very large tables and item_id is indexed.
Is there anything else that would affect performance here or am I definitely wrong about some information provided?
Edit: Additional information per a comment below
table_A:
---------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Pstart| Pstop |
---------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 9 | 4743 | 12 (0)| | |
|* 1 | COUNT STOPKEY | | | | | | |
| 2 | TABLE ACCESS BY INDEX ROWID| table_X | 1 | 227 | 1 (0)| | |
| 3 | NESTED LOOPS | | 11 | 5797 | 12 (0)| | |
| 4 | PARTITION RANGE ALL | | 10M| 2969M| 2 (0)| 1 | 4 |
| 5 | TABLE ACCESS FULL | table_A | 10M| 2969M| 2 (0)| 1 | 4 |
|* 6 | INDEX RANGE SCAN | table_X_IP_PK | 1 | | 1 (0)| | |
---------------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter(ROWNUM<10)
6 - access("table_A"."POINT_ORIGIN"="table_X"."ITEM_ID")
Note
-----
- 'PLAN_TABLE' is old version
table_B:
-----------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)|
-----------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 9 | 3879 | 11 (0)|
|* 1 | COUNT STOPKEY | | | | |
| 2 | TABLE ACCESS BY INDEX ROWID| table_X | 1 | 227 | 1 (0)|
| 3 | NESTED LOOPS | | 10 | 4310 | 11 (0)|
| 4 | TABLE ACCESS FULL | table_B | 118M| 22G| 2 (0)|
|* 5 | INDEX RANGE SCAN | table_X_IP_PK | 1 | | 1 (0)|
-----------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter(ROWNUM<10)
5 - access("table_B"."POINT_ORIGIN"="table_X"."ITEM_ID")
Note
-----
- 'PLAN_TABLE' is old version
It appears that table_a is partitioned and that the query only needs to scan 4 partitions while table_b is not partitioned and must be read in its entirety. The optimizer estimates that 4 partitions of table_a have 10 million rows while table_b has 118 million rows. You're using a nested loop so you'd expect O(n) performance so based on the statistics, it would make sense that the second query would take ~11.8 times as long as the first query.
Are the optimizer's estimates accurate? The optimizer is only as good as the statistics you've given it and it is possible that one or both tables have stale statistics.

Unexpected query results in Oracle db

We have Oracle 12.2.0.1.0 database. We create a simple table like this:
CREATE TABLE TABLE1 (DATE1 TIMESTAMP (6));
INSERT INTO TABLE1 VALUES (TIMESTAMP'2018-05-30 00:00:00');
INSERT INTO TABLE1 VALUES (TIMESTAMP'2018-05-30 00:00:00');
When we query with the following two select statements, we get different results. The first one returns two rows as expected, while the second one doesn't.
SELECT T1.*, NVL(T2.DATE1, TIMESTAMP'1900-01-01 00:00:00')
FROM TABLE1 T1
LEFT JOIN TABLE1 T2
ON 1 = 0
WHERE T1.DATE1 > NVL(T2.DATE1, TIMESTAMP'1900-01-01 00:00:00');
SELECT T1.*, NVL(T2.DATE1, TIMESTAMP'1900-01-01 00:00:00')
FROM TABLE1 T1
LEFT JOIN TABLE1 T2
ON T1.DATE1 || '---' = '-'
WHERE T1.DATE1 > NVL(T2.DATE1, TIMESTAMP'1900-01-01 00:00:00');
T1 and T2 are the same TABLE1. We are joining it on itself.
Please advise why that is so. Thanks.
It seems the optimizer gets confused with so many levels of obfuscating the join condition.
The first query results in the following execution plan:
SQL_ID 9k6g3m0xs31w7, child number 1
-------------------------------------
select t1.*, nvl(t2.date1, timestamp'1900-01-01 00:00:00') from table1
t1 left join table1 t2 on 1 = 0 where t1.date1 > nvl(t2.date1,
timestamp'1900-01-01 00:00:00')
Plan hash value: 963482612
-----------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows |E-Bytes| Cost (%CPU)| A-Rows | A-Time | Buffers |
-----------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | | 3 (100)| 2 |00:00:00.01 | 7 |
|* 1 | TABLE ACCESS FULL| TABLE1 | 1 | 2 | 26 | 3 (0)| 2 |00:00:00.01 | 7 |
-----------------------------------------------------------------------------------------------------------
Query Block Name / Object Alias (identified by operation id):
-------------------------------------------------------------
1 - SEL$F7AF7B7D / T1#SEL$1
Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter("T1"."DATE1">TIMESTAMP' 1900-01-01 00:00:00.000000000')
So the planner correctly sees that the self join is unnecessary and replaces the NVL() condition on the joined table with a condition on the column itself.
Apparently this "replacing" the condition does not work correctly in 12.2.
The second query results in the following plan:
SQL_ID 3twykk3kcyyxy, child number 1
-------------------------------------
select t1.*, nvl(t2.date1, timestamp'1900-01-01 00:00:00') from table1
t1 left join table1 t2 on t1.date1 || '---' = '-' where t1.date1 >
nvl(t2.date1, timestamp'1900-01-01 00:00:00')
Plan hash value: 736255932
----------------------------------------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows |E-Bytes| Cost (%CPU)| A-Rows | A-Time | Buffers | OMem | 1Mem | Used-Mem |
----------------------------------------------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | | 8 (100)| 0 |00:00:00.01 | 7 | | | |
|* 1 | FILTER | | 1 | | | | 0 |00:00:00.01 | 7 | | | |
| 2 | MERGE JOIN OUTER | | 1 | 1 | 26 | 8 (25)| 2 |00:00:00.01 | 7 | | | |
| 3 | SORT JOIN | | 1 | 2 | 26 | 4 (25)| 2 |00:00:00.01 | 7 | 2048 | 2048 | 2048 (0)|
| 4 | TABLE ACCESS FULL | TABLE1 | 1 | 2 | 26 | 3 (0)| 2 |00:00:00.01 | 7 | | | |
|* 5 | SORT JOIN | | 2 | 2 | 26 | 4 (25)| 0 |00:00:00.01 | 0 | 1024 | 1024 | |
| 6 | VIEW | VW_LAT_C83A7ED5 | 2 | 2 | 26 | 3 (0)| 0 |00:00:00.01 | 0 | | | |
|* 7 | FILTER | | 2 | | | | 0 |00:00:00.01 | 0 | | | |
| 8 | TABLE ACCESS FULL| TABLE1 | 0 | 2 | 26 | 3 (0)| 0 |00:00:00.01 | 0 | | | |
----------------------------------------------------------------------------------------------------------------------------------------------------
Query Block Name / Object Alias (identified by operation id):
-------------------------------------------------------------
1 - SEL$F7AF7B7D
4 - SEL$F7AF7B7D / T1#SEL$1
6 - SEL$BCD4421C / VW_LAT_AE9E49E8#SEL$AE9E49E8
7 - SEL$BCD4421C
8 - SEL$BCD4421C / T2#SEL$1
Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter("T1"."DATE1">NVL("ITEM_1",TIMESTAMP' 1900-01-01 00:00:00.000000000'))
5 - access(INTERNAL_FUNCTION("T1"."DATE1")>NVL("ITEM_1",TIMESTAMP' 1900-01-01 00:00:00.000000000'))
7 - filter(INTERNAL_FUNCTION("T1"."DATE1")||'---'='-')
So the optimizer replaced the reference to the table column with some ITEM1 placeholder - and the step access(INTERNAL_FUNCTION("T1"."DATE1")>NVL("ITEM_1",TIMESTAMP' 1900-01-01 00:00:00.000000000')) messes things up.
With 12.1 the plan is essentially the same, the only difference is that the access() part is missing in the predicates, so I guess that replacement is somewhat buggy in 12.2 (to be precise my version is: 12.2.0.1.0)

CTE vs subquery, which one is efficient?

SELECT col, (SELECT COUNT(*) FROM table) as total_count FROM table
This query executes subquery for every row, right?
Now if we have
;WITH CTE(total_count) AS (
SELECT COUNT(*) FROM table
)
SELECT col, (SELECT total_count FROM CTE) FROM table;
Will be second method more efficient? Will CTE execute COUNT(*) only once and then SELECT uses it as prepared value? or in second case also executed COUNT(*) for each row?
For Oracle the surest way is to observe the behaviour of the statements with extended statistics.
Do do so first increase the statistics level to ALL
alter session set statistics_level=all;
Then run both statements (fetching all rows) and find the SQL_ID of those statements
Finally display the statistics using following statement (passing the proper SQL_ID):
select * from table(dbms_xplan.display_cursor('your SQL_ID here',null,'ALLSTATS LAST'));
This gives for my test table
SQL_ID 5n0sdcu8347j9, child number 0
-------------------------------------
SELECT col, (SELECT COUNT(*) FROM t1) as total_count FROM t1
Plan hash value: 1306093980
-------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers |
-------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | 1000 |00:00:00.01 | 351 |
| 1 | SORT AGGREGATE | | 1 | 1 | 1 |00:00:00.01 | 338 |
| 2 | TABLE ACCESS FULL| T1 | 1 | 1061 | 1000 |00:00:00.01 | 338 |
| 3 | TABLE ACCESS FULL | T1 | 1 | 1061 | 1000 |00:00:00.01 | 351 |
-------------------------------------------------------------------------------------
and
SQL_ID fs0h660f08bj6, child number 0
-------------------------------------
WITH CTE(total_count) AS ( SELECT COUNT(*) FROM t1 ) SELECT col,
(SELECT total_count FROM CTE) FROM t1
Plan hash value: 1223456497
--------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers |
--------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | 1000 |00:00:00.01 | 351 |
| 1 | VIEW | | 1 | 1 | 1 |00:00:00.01 | 338 |
| 2 | SORT AGGREGATE | | 1 | 1 | 1 |00:00:00.01 | 338 |
| 3 | TABLE ACCESS FULL| T1 | 1 | 1061 | 1000 |00:00:00.01 | 338 |
| 4 | TABLE ACCESS FULL | T1 | 1 | 1061 | 1000 |00:00:00.01 | 351 |
--------------------------------------------------------------------------------------
So the plans are slightly different, but in both cases the FULL TABLE SCAN is started only once (column Starts = 1). Which gives no real difference.
For purpose of camparison I run also a correlated subquery, which gives a complete different picture with high number of Starts (of FTS)
SQL_ID cbvwd6pm6699m, child number 0
-------------------------------------
SELECT col, (SELECT COUNT(*) FROM t1 where col = a.col) as total_count
FROM t1 a
Plan hash value: 1306093980
-------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers |
-------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | 1000 |00:00:00.01 | 351 |
| 1 | SORT AGGREGATE | | 1000 | 1 | 1000 |00:00:00.31 | 338K|
|* 2 | TABLE ACCESS FULL| T1 | 1000 | 11 | 1000 |00:00:00.31 | 338K|
| 3 | TABLE ACCESS FULL | T1 | 1 | 1061 | 1000 |00:00:00.01 | 351 |
-------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - filter("COL"=:B1)
I believe that the query optimizer in both Oracle and SQL Server will recognize that the count query is not correlated, compute it once, and then use the cached result throughout the execution of the outer query.
Also, the CTE won't change anything as far as I know, since at execution time the code inside it will basically just be inlined into the actual outer query.
Here is a reference for Oracle which mentions that a non correlated subquery will be executed once and cached, except in cases where the outer query only has a few rows. In that case, it might not be cached because there isn't much of a penalty in executing the count subquery multiple times.