I'm having trouble optimizing an Oracle query after an upgrade to Oracle 11g and this problem is starting to drive me a little mad.
Note, this question has been now fully edited because I have more information after creating a simple test case. The original question is available here: https://stackoverflow.com/revisions/12304320/1.
This issue is that when joining two tables, one of which has a between condition on a date column, if the query joins to a remote table, bind peeking doesn't happen.
Here is a test case to help reproduce the problem. First set up two source tables. The first is a list of dates, being the first of the month, going back thirty years
create table mike_temp_etl_control
as
select
add_months(trunc(sysdate, 'MM'), 1-row_count) as reporting_date
from (
select level as row_count
from dual
connect by level < 360
);
Then some data sourced from dba_objects:
create table mike_temp_dba_objects as
select owner, object_name, subobject_name, object_id, created
from dba_objects
union all
select owner, object_name, subobject_name, object_id, created
from dba_objects;
Then create an empty table to run the data in to:
create table mike_temp_1
as
select
a.OWNER,
a.OBJECT_NAME,
a.SUBOBJECT_NAME,
a.OBJECT_ID,
a.CREATED,
b.REPORTING_DATE
from
mike_temp_dba_objects a
join mike_temp_etl_control b on (
b.reporting_date between add_months(a.created, -24) and a.created)
where 1=2;
Then run the code. You may need to create a larger version mike_temp_dba_objects to slow the query down (or use some other method to get the execution plan). While the query is running, I get an execution plan from the session by running select *
from table(dbms_xplan.display_cursor(sql_id => 'xxxxxxxxxxx')) from a different session.
declare
pv_report_start_date date := date '2002-01-01';
v_report_end_date date := date '2012-07-01';
begin
INSERT /*+ APPEND */
INTO mike_temp_5
select
a.OWNER,
a.OBJECT_NAME,
a.SUBOBJECT_NAME,
a.OBJECT_ID,
a.CREATED,
b.REPORTING_DATE
from
mike_temp_dba_objects a
join mike_temp_etl_control b on (
b.reporting_date between add_months(a.created, -24) and a.created)
cross join dual#emirrl -- This line causes problems...
where
b.reporting_date between add_months(pv_report_start_date, -12) and v_report_end_date;
rollback;
end;
By having a remote table in the query, the cardinality estimate for the mike_temp_etl_control table is completely wrong and bind peeking doesn't seem to be happening.
The execution plan for the query above is shown below:
---------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)|
---------------------------------------------------------------------------------------
| 0 | INSERT STATEMENT | | | | 373 (100)|
| 1 | LOAD AS SELECT | | | | |
|* 2 | FILTER | | | | |
| 3 | MERGE JOIN | | 5 | 655 | 373 (21)|
| 4 | SORT JOIN | | 1096 | 130K| 370 (20)|
| 5 | MERGE JOIN CARTESIAN| | 1096 | 130K| 369 (20)|
| 6 | REMOTE | DUAL | 1 | | 2 (0)|
| 7 | BUFFER SORT | | 1096 | 130K| 367 (20)|
|* 8 | TABLE ACCESS FULL | MIKE_TEMP_DBA_OBJECTS | 1096 | 130K| 367 (20)|
|* 9 | FILTER | | | | |
|* 10 | SORT JOIN | | 2 | 18 | 3 (34)|
|* 11 | TABLE ACCESS FULL | MIKE_TEMP_ETL_CONTROL | 2 | 18 | 2 (0)|
---------------------------------------------------------------------------------------
If I then replace the remote dual with the local version I get the correct cardinality (139 instead of 2):
-------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)|
-------------------------------------------------------------------------------------
| 0 | INSERT STATEMENT | | | | 10682 (100)|
| 1 | LOAD AS SELECT | | | | |
|* 2 | FILTER | | | | |
| 3 | MERGE JOIN | | 152K| 19M| 10682 (3)|
| 4 | SORT JOIN | | 438K| 51M| 10632 (2)|
| 5 | NESTED LOOPS | | 438K| 51M| 369 (20)|
| 6 | FAST DUAL | | 1 | | 2 (0)|
|* 7 | TABLE ACCESS FULL| MIKE_TEMP_DBA_OBJECTS | 438K| 51M| 367 (20)|
|* 8 | FILTER | | | | |
|* 9 | SORT JOIN | | 139 | 1251 | 3 (34)|
|* 10 | TABLE ACCESS FULL| MIKE_TEMP_ETL_CONTROL | 139 | 1251 | 2 (0)|
-------------------------------------------------------------------------------------
So, I guess the question is how can I get the correct cardinality to be estimated? Is this an Oracle bug or is this the expected behaviour?
I think you should mess about dynamic sampling. It works in 11g differently so may it is the reason of your troubles.
Related
i am doubting about this case, but not clear why.
consider the following sql :
create table t1(tid int not null, t1 int not null);
create table t2(t2 int not null, tname varchar(30) null);
create unique index i_t2 on t2(t2);
create or replace view v_1 as
select t1.tid,t1.t1,max(t2.tname) as tname
from t1 left join t2
on t1.t1 = t2.t2
group by t1.tid,t1.t1;
then check the execution plan for select count(1) from v_1, the t2 is eliminated by the optimizer:
SQL> select count(1) from v_1;
Execution Plan
----------------------------------------------------------
Plan hash value: 3243658773
----------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
----------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | 3 (34)| 00:00:01 |
| 1 | SORT AGGREGATE | | 1 | | | |
| 2 | VIEW | VM_NWVW_0 | 1 | | 3 (34)| 00:00:01 |
| 3 | HASH GROUP BY | | 1 | 26 | 3 (34)| 00:00:01 |
| 4 | TABLE ACCESS FULL| T1 | 1 | 26 | 2 (0)| 00:00:01 |
----------------------------------------------------------------------------------
but if the index i_t2 is dropped or recreated without unique attribute,
the table t2 is not eliminated in execution plan:
SQL> drop index i_t2;
Index dropped.
SQL> select count(1) from v_1;
Execution Plan
----------------------------------------------------------
Plan hash value: 2710188186
-----------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
-----------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | 5 (20)| 00:00:01 |
| 1 | SORT AGGREGATE | | 1 | | | |
| 2 | VIEW | VM_NWVW_0 | 1 | | 5 (20)| 00:00:01 |
| 3 | HASH GROUP BY | | 1 | 39 | 5 (20)| 00:00:01 |
|* 4 | HASH JOIN OUTER | | 1 | 39 | 4 (0)| 00:00:01 |
| 5 | TABLE ACCESS FULL| T1 | 1 | 26 | 2 (0)| 00:00:01 |
| 6 | TABLE ACCESS FULL| T2 | 1 | 13 | 2 (0)| 00:00:01 |
-----------------------------------------------------------------------------------
it seems even if the index is removed,
the result of select count(1) from v_1 also equal to
select count(1) from (select tid,t1 from t1 group by tid,t1)
why the optimizer does not eliminate t2 in the second case?
is there any principle or actual data example discribing this?
thanks :)
This is an optimization called join elimination. Because t2.t2 us unique, the optimizer knows that every row retrieved from t1 can only ever retrieve one row from t2. Since there is nothing projected from t2, there is no need to perform the join.
If you do
select tid, t1 from v_1;
you will see that we do not perform the join. However, if we project from t2, then the join is needed.
I would like to know whether INTERSECT or EXISTS have better performance in Oracle 11g. Consider I have the below two tables.
Student_Master
STUDENT_ID NAME
---------- ------
STUD01 ALEX
STUD02 JAMES
STUD03 HANS
Student_Status
STUDENT_ID STATUS
---------- ------
STUD01 Fail
STUD02 Pass
STUD03 Pass
Which of the below query will perform better considering that the table Student_Status will have more number of records compared to the table Student_Master.
SELECT STUDENT_ID FROM Student_Master
INTERSECT
SELECT STUDENT_ID FROM Student_Status
SELECT STUDENT_ID FROM Student_Master M
WHERE EXISTS
(SELECT STUDENT_ID FROM Student_Status S WHERE M.STUDENT_ID=S.STUDENT_ID)
A quick test would suggest the EXISTS option...
SELECT STUDENT_ID FROM Student_Master INTERSECT SELECT STUDENT_ID FROM
Student_Status
Plan hash value: 416197223
--------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
--------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | | | 6 (100)| |
| 1 | INTERSECTION | | | | | |
| 2 | SORT UNIQUE | | 3 | 36 | 3 (34)| 00:00:01 |
| 3 | TABLE ACCESS FULL| STUDENT_MASTER | 3 | 36 | 2 (0)| 00:00:01 |
| 4 | SORT UNIQUE | | 3 | 36 | 3 (34)| 00:00:01 |
| 5 | TABLE ACCESS FULL| STUDENT_STATUS | 3 | 36 | 2 (0)| 00:00:01 |
--------------------------------------------------------------------------------------
-
SELECT STUDENT_ID FROM Student_Master M WHERE EXISTS (SELECT STUDENT_ID
FROM Student_Status S WHERE M.STUDENT_ID=S.STUDENT_ID)
Plan hash value: 361045672
-------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
-------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | | | 4 (100)| |
|* 1 | HASH JOIN SEMI | | 3 | 72 | 4 (0)| 00:00:01 |
| 2 | TABLE ACCESS FULL| STUDENT_MASTER | 3 | 36 | 2 (0)| 00:00:01 |
| 3 | TABLE ACCESS FULL| STUDENT_STATUS | 3 | 36 | 2 (0)| 00:00:01 |
-------------------------------------------------------------------------------------
The second form with EXIST operator does always full scan on Student_Master table, so I think, you'll get better performance using intersect. But you'll need use index on master table.
More info on EXIST can be found here: Ask TOM - IN & EXISTS
I have been trying to optimize a query by adding several indices. I was successful in most of them but there is this one query where the column I have created was never used. Query looks something like this:
SELECT DISTINCT
c.containername
, hml.qty
, CAST(hml.txndate AS TIMESTAMP) txndate
, hml.comments
, e.fullname
, 'ICO' txn
FROM con c
JOIN hml hml ON c.containerid = hml.historyid
JOIN emp e ON hml.employeeid = e.employeeid
WHERE hml.comments LIKE '%ICO%'
AND hml.txndate >= to_date(:ddate, 'MM/DD/YYYY HH:MI:SS PM')
AND e.notes IN ('DBP', 'FTD', 'MH', 'SUPERVISOR')
ORDER BY CAST(hml.txndate AS TIMESTAMP);
Execution Plan
Plan hash value: 267728100
---------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
---------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 116 | 14616 | 248K (1)| 00:49:40 |
| 1 | SORT ORDER BY | | 116 | 14616 | 248K (1)| 00:49:40 |
| 2 | HASH UNIQUE | | 116 | 14616 | 248K (1)| 00:49:40 |
| 3 | NESTED LOOPS | | | | | |
| 4 | NESTED LOOPS | | 116 | 14616 | 248K (1)| 00:49:40 |
|* 5 | HASH JOIN | | 116 | 11252 | 248K (1)| 00:49:37 |
|* 6 | TABLE ACCESS FULL | EMPLOYEE | 8 | 304 | 25 (0)| 00:00:01 |
|* 7 | TABLE ACCESS BY INDEX ROWID| HISTORYMAINLINE | 38373 | 2210K| 248K (1)| 00:49:37 |
|* 8 | INDEX RANGE SCAN | HISTORYMAINLINEMX005 | 347K| | 1248 (1)| 00:00:15 |
|* 9 | INDEX UNIQUE SCAN | CONTAINER450 | 1 | | 1 (0)| 00:00:01 |
| 10 | TABLE ACCESS BY INDEX ROWID | CONTAINER | 1 | 29 | 2 (0)| 00:00:01 |
---------------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
5 - access("HML"."EMPLOYEEID"="E"."EMPLOYEEID")
6 - filter("E"."NOTES"='DBP' OR "E"."NOTES"='FTD' OR "E"."NOTES"='MH' OR
"E"."NOTES"='SUPERVISOR')
7 - filter("HML"."COMMENTS" IS NOT NULL AND "HML"."COMMENTS" LIKE '%ICO%')
8 - access("HML"."TXNDATE">=TO_DATE(:DDATE,'MM/DD/YYYY HH:MI:SS AM'))
9 - access("C"."CONTAINERID"="HML"."HISTORYID")
Cost is 248,283. Same with the cost before i put on the index. Additionally, when i change the hml.comments LIKE '%ICO%' to hml.comments = '%ICO%', cost improves really well (only 14!) because the index I have created is used. Unfortunately, I still have to used LIKE since I don't know the position where the tag ICO is used on the comments column.
Is there a way I can have the query be improved? I am using Oracle 11g and the hml table has a total rows of 40,193,106 by the way. Thanks in advance to all who will take time to answer.
I'm working to optimize queries due to huge amount of data on Oracle.
There is one query like this.
With subquery :
SELECT
STG.ID1,
STG.ID2
FROM (SELECT
DISTINCT
H1.ID1,
H2.ID2
FROM T_STGDV STG
INNER JOIN T_HUB1 H1 ON STG.BK1 = H1.BK1
INNER JOIN T_HUB2 H2 ON STG.BK2 = H2.BK2 ) STG
LEFT OUTER JOIN T_LINK L ON L.ID1 = STG.ID1 AND L.ID2 = STG.ID2
WHERE L.IDL IS NULL;
I'm doing this optimization :
SELECT
DISTINCT
H1.ID1,
H2.ID2
FROM T_STGDV STG
INNER JOIN T_HUB1 H1 ON STG.BK1 = H1.BK1
INNER JOIN T_HUB2 H2 ON STG.BK2 = H2.BK2
LEFT OUTER JOIN T_LINK L ON L.ID1 = H1.ID1 AND L.ID2 = H2.ID2
WHERE L.IDL IS NULL;
I want to know if the result will be the same, the behavior is the same.
I did some tests, I didn't find difference but maybe i missed some test case ?
Any idea what could be the difference between those queries ?
Thanks.
Some details, the Explain plan for those testing tables (the cost are not representative of the real tables)
the First query :
Plan hash value: 2680307749
-----------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
-----------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 65 | 11 (28)| 00:00:01 |
|* 1 | FILTER | | | | | |
|* 2 | HASH JOIN OUTER | | 1 | 65 | 11 (28)| 00:00:01 |
| 3 | VIEW | | 1 | 26 | 8 (25)| 00:00:01 |
| 4 | HASH UNIQUE | | 1 | 134 | 8 (25)| 00:00:01 |
|* 5 | HASH JOIN | | 1 | 134 | 7 (15)| 00:00:01 |
|* 6 | HASH JOIN | | 1 | 94 | 5 (20)| 00:00:01 |
| 7 | TABLE ACCESS FULL| T_STGDV | 1 | 54 | 2 (0)| 00:00:01 |
| 8 | TABLE ACCESS FULL| T_HUB1 | 2 | 80 | 2 (0)| 00:00:01 |
| 9 | TABLE ACCESS FULL | T_HUB2 | 2 | 80 | 2 (0)| 00:00:01 |
| 10 | TABLE ACCESS FULL | T_LINK | 3 | 117 | 2 (0)| 00:00:01 |
-----------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter("L"."IDL" IS NULL)
2 - access("L"."ID2"(+)="STG"."ID2" AND "L"."ID1"(+)="STG"."ID1")
5 - access("STG"."BK2"="H2"."BK2")
6 - access("STG"."BK1"="H1"."BK1")
Note
-----
- dynamic sampling used for this statement (level=2)
the second query
Plan hash value: 2149614538
-----------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
-----------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 65 | 11 (28)| 00:00:01 |
| 1 | HASH UNIQUE | | 1 | 65 | 11 (28)| 00:00:01 |
|* 2 | FILTER | | | | | |
|* 3 | HASH JOIN OUTER | | 1 | 65 | 10 (20)| 00:00:01 |
| 4 | VIEW | | 1 | 26 | 7 (15)| 00:00:01 |
|* 5 | HASH JOIN | | 1 | 134 | 7 (15)| 00:00:01 |
|* 6 | HASH JOIN | | 1 | 94 | 5 (20)| 00:00:01 |
| 7 | TABLE ACCESS FULL| T_STGDV | 1 | 54 | 2 (0)| 00:00:01 |
| 8 | TABLE ACCESS FULL| T_HUB1 | 2 | 80 | 2 (0)| 00:00:01 |
| 9 | TABLE ACCESS FULL | T_HUB2 | 2 | 80 | 2 (0)| 00:00:01 |
| 10 | TABLE ACCESS FULL | T_LINK | 3 | 117 | 2 (0)| 00:00:01 |
-----------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - filter("L"."IDL" IS NULL)
3 - access("L"."ID2"(+)="H2"."ID2" AND "L"."ID1"(+)="H1"."ID1")
5 - access("STG"."BK2"="H2"."BK2")
6 - access("STG"."BK1"="H1"."BK1")
Note
-----
- dynamic sampling used for this statement (level=2)
The queries look equivalent to me, because of the where clause.
Without the where clause they are not equivalent. Duplicates in t_link (relative to the join keys) would result in duplicate rows. However, you are looking for no matches, so this is not an issue. When there is no match, the two versions should be equivalent.
If you want to test them with your current dataset you can use minus.
query 1
MINUS
query 2
If any results are displayed, they are not the same.
You have to flip them around to try the other way too...
query 2
MINUS
query 1
If both tests return no records, the queries have the same effect on your current dataset.
This might be the difference: look at these lines in your execution plans:
2 - access("L"."ID2"(+)="STG"."ID2" AND "L"."ID1"(+)="STG"."ID1")
and
3 - access("L"."ID2"(+)="H2"."ID2" AND "L"."ID1"(+)="H1"."ID1")
STG is a temporary table created by Oracle for the duration of the query (that ambiguousness between T_STGDV alias and the subquery alias was alone a reason to rewrite the query). And this temporary table is of course unindexed. After your refactoring, Oracle optimiser start joining T_LINK with H1 and H2 instead of a temporary table and that allows it to utilize indexes built on those table, thus giving you the 20x increase in speed.
After testing, there are giving the same result. And the second one is more efficient.
I have a query (below). The explain plan shows high CPU utilization which has also caused downtime in our lab. So it possible to tube this query further? What should i do to tune it?
FYI, mtr_main_a, mtr_main_b, mtr_hist contain huge no of records may be 10 millions or more.
SELECT to_char(MAX(mdt), 'MM-DD-RRRR HH24:MI:SS')
FROM (
SELECT MAX(mod_date - 2 / 86400) mdt
FROM mtr_main_a
UNION
SELECT MAX(mod_date - 2 / 86400) mdt
FROM mtr_main_b
UNION
SELECT MAX(mod_date - 2 / 86400) mdt
FROM mtr_hist#batch_hist
)
/
Explain plan is as below
Execution Plan
----------------------------------------------------------
Plan hash value: 1573811822
-------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes |TempSpc| Cost (%CPU)| Time | Inst |IN-OUT|
-------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 9 | | 79803 (1)| 00:18:38 | | |
| 1 | SORT AGGREGATE | | 1 | 9 | | | | | |
| 2 | VIEW | | 2 | 18 | | 79803 (1)| 00:18:38 | | |
| 3 | SORT UNIQUE | | 2 | 17 | 77M| 79803 (2)| 00:18:38 | | |
| 4 | UNION-ALL | | | | | | | | |
| 5 | SORT AGGREGATE | | 1 | 8 | | 79459 (1)| 00:18:33 | | |
| 6 | TABLE ACCESS FULL| MTR_MAIN_A | 5058K| 38M| | 67735 (1)| 00:15:49 | | |
| 7 | SORT AGGREGATE | | 1 | 9 | | 344 (1)| 00:00:05 | | |
| 8 | TABLE ACCESS FULL| MTR_MAIN_B | 1 | 9 | | 343 (1)| 00:00:05 | | |
| 9 | REMOTE | | | | | | | HISTB | R->S |
-------------------------------------------------------------------------------------------------------------
Remote SQL Information (identified by operation id):
----------------------------------------------------
9 - EXPLAIN PLAN SET STATEMENT_ID='PLUS10294704' INTO PLAN_TABLE#! FOR SELECT
MAX("A1"."MOD_DATE"-.00002314814814814814814814814814814814814815) FROM "MTR_HIST" "A1" (accessing
'HISTB' )
Thanks and Regards,
Chandra
You should be able to greatly improve the performance by putting an index on the mod_date columns and changing your query in a way that does the subtraction at the end, after you determined the maximum date:
SELECT to_char(MAX(mdt) - 2 / 86400, 'MM-DD-RRRR HH24:MI:SS')
FROM (
SELECT MAX(mod_date) mdt
FROM mtr_main_a
UNION
SELECT MAX(mod_date) mdt
FROM mtr_main_b
UNION
SELECT MAX(mod_date) mdt
FROM mtr_hist#batch_hist
)
This should get rid of the full table scans.
If you have indexes on the columns, how does this version work:
SELECT to_char((case when a.mdt > b.mdt and a.mdt > c.mdt then a.mdt
when b.mdt > c.mdt then b.mdt
else c.mdt
end) - 2 / 86400, 'MM-DD-RRRR HH24:MI:SS')
FROM (SELECT MAX(mod_date) mdt
FROM mtr_main_a
) a cross join
(SELECT MAX(mod_date) mdt
FROM mtr_main_b
) b cross join
(SELECT MAX(mod_date) mdt
FROM mtr_hist#batch_hist
) c
This is just a suggestion, if the union version doesn't work faster.