Trying to optimize a *random* query in Oracle SQL - sql

I need to optimize a procedure in Oracle SQL, mainly using indexes. This is the statement:
CREATE OR REPLACE PROCEDURE DEL_OBS(cuantos number) IS
begin
FOR I IN (SELECT * FROM (SELECT * FROM observations ORDER BY DBMS_RANDOM.VALUE)WHERE ROWNUM<=cuantos)
LOOP
DELETE FROM OBSERVATIONS WHERE nplate=i.nplate AND odatetime=i.odatetime;
END LOOP;
end del_obs;
My plan was to create an index related with rownum since it is what appears to be used to do the deletes. But I don't know if it is going to be worthy. The problem with this procedure is that its randomness causes a lot of consistent gets. Can anyone help me with this?? Thanks :)
Note: I cannot change the code, only make improvements afterwards

Use the ROWID pseudo-column to filter the columns:
CREATE OR REPLACE PROCEDURE DEL_OBS(
cuantos number
)
IS
BEGIN
DELETE FROM OBSERVATIONS
WHERE ROWID IN (
SELECT rid
FROM (
SELECT ROWID AS rid
FROM observations
ORDER BY DBMS_RANDOM.VALUE
)
WHERE ROWNUM < cuantos
);
END del_obs;
If you have an index on the table then it can use a index fast full scan:
SQL Fiddle
Oracle 11g R2 Schema Setup:
CREATE TABLE table_name ( id ) AS
SELECT LEVEL FROM DUAL CONNECT BY LEVEL <= 50000;
Query 1: No Index:
DELETE FROM table_name
WHERE ROWID IN (
SELECT rid
FROM (
SELECT ROWID AS rid
FROM table_name
ORDER BY DBMS_RANDOM.VALUE
)
WHERE ROWNUM <= 10000
)
Execution Plan:
----------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost | Time |
----------------------------------------------------------------------------------------
| 0 | DELETE STATEMENT | | 1 | 24 | 123 | 00:00:02 |
| 1 | DELETE | TABLE_NAME | | | | |
| 2 | NESTED LOOPS | | 1 | 24 | 123 | 00:00:02 |
| 3 | VIEW | VW_NSO_1 | 10000 | 120000 | 121 | 00:00:02 |
| 4 | SORT UNIQUE | | 1 | 120000 | | |
| * 5 | COUNT STOPKEY | | | | | |
| 6 | VIEW | | 19974 | 239688 | 121 | 00:00:02 |
| * 7 | SORT ORDER BY STOPKEY | | 19974 | 239688 | 121 | 00:00:02 |
| 8 | TABLE ACCESS FULL | TABLE_NAME | 19974 | 239688 | 25 | 00:00:01 |
| 9 | TABLE ACCESS BY USER ROWID | TABLE_NAME | 1 | 12 | 1 | 00:00:01 |
----------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
------------------------------------------
* 5 - filter(ROWNUM<=10000)
* 7 - filter(ROWNUM<=10000)
Query 2 Add an index:
ALTER TABLE table_name ADD CONSTRAINT tn__id__pk PRIMARY KEY ( id )
Query 3 With the index:
DELETE FROM table_name
WHERE ROWID IN (
SELECT rid
FROM (
SELECT ROWID AS rid
FROM table_name
ORDER BY DBMS_RANDOM.VALUE
)
WHERE ROWNUM <= 10000
)
Execution Plan:
---------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost | Time |
---------------------------------------------------------------------------------------
| 0 | DELETE STATEMENT | | 1 | 37 | 13 | 00:00:01 |
| 1 | DELETE | TABLE_NAME | | | | |
| 2 | NESTED LOOPS | | 1 | 37 | 13 | 00:00:01 |
| 3 | VIEW | VW_NSO_1 | 9968 | 119616 | 11 | 00:00:01 |
| 4 | SORT UNIQUE | | 1 | 119616 | | |
| * 5 | COUNT STOPKEY | | | | | |
| 6 | VIEW | | 9968 | 119616 | 11 | 00:00:01 |
| * 7 | SORT ORDER BY STOPKEY | | 9968 | 119616 | 11 | 00:00:01 |
| 8 | INDEX FAST FULL SCAN | TN__ID__PK | 9968 | 119616 | 9 | 00:00:01 |
| 9 | TABLE ACCESS BY USER ROWID | TABLE_NAME | 1 | 25 | 1 | 00:00:01 |
---------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
------------------------------------------
* 5 - filter(ROWNUM<=10000)
* 7 - filter(ROWNUM<=10000)
If you cannot do it in single SQL statement using ROWID then you can rewrite your existing procedure to use exactly the same queries but use the FORALL statement:
CREATE OR REPLACE PROCEDURE DEL_OBS(cuantos number)
IS
TYPE obs_tab IS TABLE OF observations%ROWTYPE;
begin
SELECT *
BULK COLLECT INTO obs_tab
FROM (
SELECT * FROM observations ORDER BY DBMS_RANDOM.VALUE
)
WHERE ROWNUM<=cuantos;
FORALL i IN 1 .. obs_tab.COUNT
DELETE FROM OBSERVATIONS
WHERE nplate = obs_tab(i).nplate
AND odatetime = obs_tab(i).odatetime;
END del_obs;

What you definitively need is an index on OBSERVATIONS to allow the DELETEwith an index access.
CREATE INDEX cuantos ON OBSERVATIONS(nplate, odatetime);
The execution of the procedure will lead to one FULL TABLE SCANot the OBSERVATIONS table and to one INDEX ACCESS for each deleted record.
For a limited number deleted recrods it will behave similar as the set DELETEproposed in other answer; for larger number of deleted records the elapsed time will linerary scale with the number of deletes.
For a non-trival number of deleted records you must assume that the index is not completely in the buffer pool and lots of disc access will be requried. So you'll end with approximately 100 deleted rows per second.
In other words to delete 100K rows it will take ca. 1/4 hour.
To delete 1M rows you need 2 3/4 of an hour.
You see while deleting in this scale the first part of the task - the FULL SCAN of your table is neglectable, it will take few minutes only. The only possibility to get acceptable response time in this case is to switch the logic to a single DELETEstatement as proposed in other answers.
This behavior is also called the rule: "Row by Row is Slow by Slow" (i.e. processing in a loop works fine, but only with a limited number of records).

You can do this using a single delete statement:
delete from observations o
where (o.nplate, o.odatetime) in (select nplace, odatetime
from (select o2.nplate, o2.odatetime
from observations o2
order by DBMS_RANDOM.VALUE
) o2
where rownum <= v_cuantos
);
This is often faster than executing multiple queries for each row being deleted.

Try this. test on MSSQL hopes so it will work also on Oracle. please remarks the status.
CREATE OR REPLACE PROCEDURE DEL_OBS(cuantos number) IS
begin
DELETE OBSERVATIONS FROM OBSERVATIONS
join (select * from OBSERVATIONS ORDER BY VALUE ) as i on
nplate=i.nplate AND
odatetime=i.odatetime AND
i.ROWNUM<=cuantos;
End DEL_OBS;

Since you say that nplate and odatetime are the primary key of observations, then I am guessing the problem is here:
SELECT * FROM (
SELECT *
FROM observations
ORDER BY DBMS_RANDOM.VALUE)
WHERE ROWNUM<=cuantos;
There is no way to prevent that from performing a full scan of observations, plus a lot of sorting if that's a big table.
You need to change the code that runs. By far, the easiest way to change the code is to change the source code and recompile it.
However, there are ways to change the code that executes without changing the source code. Here are two:
(1) Use DBMS_FGAC to add a policy that detects whether you are in this procedure and, if so, add a predicate to the observations table like this:
AND rowid IN
( SELECT obs_sample.rowid
FROM observations sample (0.05) obs_sample)
(2) Use DBMS_ADVANCED_REWRITE to rewrite your query changing:
FROM observations
.. to ..
FROM observations SAMPLE (0.05)
Using the text of your query in the re-write policy should prevent it from affecting other queries against the observations table.
Neither of these are easy (at all), but can be worth a try if you are really stuck.

Related

What's the best way to optimize a query of search 1000 rows in 50 million by date? Oracle

I have table
CREATE TABLE ard_signals
(id, val, str, date_val, dt);
This table records the values ​​of all ID's signals. Unique ID's around 950. At the moment, the table contains about 50 million rows with different values of these signals. Each ID can have only a numeric values, string values or date values.
I get the last value of each ID, which, by condition, is less than input date:
select ID,
max(val) keep (dense_rank last order by dt desc) as val,
max(str) keep (dense_rank last order by dt desc) as str,
max(date_val) keep (dense_rank lastt order by dt desc) as date_val,
max(dt)
where dt <= to_date(any_date)
group by id;
I have index on ID. At the moment, the request takes about 30 seconds. Help, please, what ways of optimization it is possible to make for the given request?
EXPLAIN PLAN: with dt index
Example Data(This kind of rows are about 950-1000):
ID
VAL
STR
DATE_VAL
DT
920
0
20.07.2022 9:59:11
490
yes
20.07.2022 9:40:01
565
233
20.07.2022 9:32:03
32
1
20.07.2022 9:50:01
TL;DR You need your application to maintain a table of distinct id values.
So, you want the last record for each group (distinct id value) in your table, without doing a full table scan. It seems like it should be easy to tell Oracle: iterate through the distinct values for id and then do an index lookup to get the last dt value for each id and then give me that row. But looks are deceiving here -- it's not easy at all, if it is even possible.
Think about what an index on (id) (or (id, dt)) has. It has a bunch of leaf blocks and a structure to find the highest value of id in each block. But Oracle can only use the index to get all the distinct id values by reading every leaf block in the index. So, we might be find a way to trade our TABLE FULL SCAN for an INDEX_FFS for a marginal benefit, but it's not really what we're hoping for.
What about partitioning? Can't we create ARD_SIGNALS with PARTITION BY LIST (id) AUTOMATIC and use that? Now the data dictionary is guaranteed to have a separate partition for each distinct id value.
But again - think about what Oracle knows (from DBA_TAB_PARTITIONS) -- it knows what the highest partition key value is in each partition. It is true: for a list partitioned table, that highest value is guaranteed to be the only distinct value in the partition. But I think using that guarantee requires special optimizations that Oracle's CBO does not seem to make (yet).
So, unfortunately, you are going to need to modify your application to keep a parent table for ARDS_SIGNALS that has a (single) row for each distinct id.
Even then, it's kind of difficult to get what we want. Because, again, want Oracle to iterate through the distinct id values, then use the index to find the one with the highest dt for that id .. and then stop. So, we're looking for an execution plan that makes use of the INDEX RANGE SCAN (MIN/MAX) operation.
I find that tricky with joins, but not so hard with scalar subqueries. So, assuming we named our parent table ARD_IDS, we can start with this:
SELECT /*+ NO_UNNEST(#ssq) */ i.id,
( SELECT /*+ QB_NAME(ssq) */ max(dt)
FROM ard_signals s
WHERE s.id = i.id
AND s.dt <= to_date(trunc(SYSDATE) + 2+ 10/86400) -- replace with your date variable
) max_dt
FROM ard_ids i;
---------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers |
---------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | 1000 |00:00:00.01 | 22 |
| 1 | SORT AGGREGATE | | 1000 | 1 | 1000 |00:00:00.02 | 3021 |
| 2 | FIRST ROW | | 1000 | 1 | 1000 |00:00:00.01 | 3021 |
|* 3 | INDEX RANGE SCAN (MIN/MAX)| ARD_SIGNALS_N1 | 1000 | 1 | 1000 |00:00:00.01 | 3021 |
| 4 | TABLE ACCESS FULL | ARD_IDS | 1 | 1000 | 1000 |00:00:00.01 | 22 |
---------------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
3 - access("S"."ID"=:B1 AND "S"."DT"<=TO_DATE(TO_CHAR(TRUNC(SYSDATE#!)+2+.00011574074074074074
0740740740740740740741)))
Note the use of hints to keep Oracle from merging the scalar subquery into the rest of the query and losing our desired access path.
Next, it is a matter of using those (id, max(dt)) combinations to look up the rows from the table to get the other column values. I came up with this; improvements may be possible (especially if (id, dt) is not as selective as I am assuming it is):
with k AS (
select /*+ NO_UNNEST(#ssq) */ i.id, ( SELECT /*+ QB_NAME(ssq) */ max(dt) FROM ard_signals s WHERE s.id = i.id AND s.dt <= to_date(trunc(SYSDATE) + 2+ 10/86400) ) max_dt
from ard_ids i
)
SELECT k.id,
max(val) keep ( dense_rank first order by dt desc, s.rowid ) val,
max(str) keep ( dense_rank first order by dt desc, s.rowid ) str,
max(date_val) keep ( dense_rank first order by dt desc, s.rowid ) date_val,
max(dt) keep ( dense_rank first order by dt desc, s.rowid ) dt
FROM k
INNER JOIN ard_signals s ON s.id = k.id AND s.dt = k.max_dt
GROUP BY k.id;
--------------------------------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | OMem | 1Mem | Used-Mem |
--------------------------------------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | 1000 |00:00:00.04 | 7009 | | | |
| 1 | SORT GROUP BY | | 1 | 1000 | 1000 |00:00:00.04 | 7009 | 302K| 302K| 268K (0)|
| 2 | NESTED LOOPS | | 1 | 1005 | 1000 |00:00:00.04 | 7009 | | | |
| 3 | NESTED LOOPS | | 1 | 1005 | 1000 |00:00:00.03 | 6009 | | | |
| 4 | TABLE ACCESS FULL | ARD_IDS | 1 | 1000 | 1000 |00:00:00.01 | 3 | | | |
|* 5 | INDEX RANGE SCAN | ARD_SIGNALS_N1 | 1000 | 1 | 1000 |00:00:00.02 | 6006 | | | |
| 6 | SORT AGGREGATE | | 1000 | 1 | 1000 |00:00:00.02 | 3002 | | | |
| 7 | FIRST ROW | | 1000 | 1 | 1000 |00:00:00.01 | 3002 | | | |
|* 8 | INDEX RANGE SCAN (MIN/MAX) | ARD_SIGNALS_N1 | 1000 | 1 | 1000 |00:00:00.01 | 3002 | | | |
| 9 | TABLE ACCESS BY GLOBAL INDEX ROWID| ARD_SIGNALS | 1000 | 1 | 1000 |00:00:00.01 | 1000 | | | |
--------------------------------------------------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
5 - access("S"."ID"="I"."ID" AND "S"."DT"=)
8 - access("S"."ID"=:B1 AND "S"."DT"<=TO_DATE(TO_CHAR(TRUNC(SYSDATE#!)+2+.000115740740740740740740740740740740740741)))
... 7000 gets and 4/100ths of a second.
You want to see that latest entry per ID at a given time.
If you would usually be interested in times at the beginning of the recordings, an index on the time could help to limit the rows to have to work with. But I consider this highly unlikely.
It is much more likely that you are interested in the situation at more recent times. This means for instance that when looking for the latest entries until today, for one ID the latest entry may be found yesterday, while for another ID the latest entry my be from two years ago.
In my opinion, Oracle already chooses the best approach to deal with this: read the whole table sequentially.
If you have several CPUs at hand, parallel execution might speed up things:
select /*+ parallel(4) */ ...
If you really need this to be much faster, then you may want to consider an additional table with one row per ID, which gets a copy of the lastest date and value by a trigger. I.e. you'd introduce redundancy for the gain of speed, as it is done in data warehouses.

Query Optimization - subselect in Left Join

I'm working on optimizing a sql query, and I found a particular line that appears to be killing my queries performance:
LEFT JOIN anothertable lastweek
AND lastweek.date>= (SELECT MAX(table.date)-7 max_date_lweek
FROM table table
WHERE table.id= lastweek.id)
AND lastweek.date< (SELECT MAX(table.date) max_date_lweek
FROM table table
WHERE table.id= lastweek.id)
I'm working on a way of optimizing these lines, but I'm stumped. If anyone has any ideas, please let me know!
-----------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost | Time |
-----------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1908654 | 145057704 | 720461 | 00:00:29 |
| * 1 | HASH JOIN RIGHT OUTER | | 1908654 | 145057704 | 720461 | 00:00:29 |
| 2 | VIEW | VW_DCL_880D8DA3 | 427487 | 7694766 | 716616 | 00:00:28 |
| * 3 | HASH JOIN | | 427487 | 39328804 | 716616 | 00:00:28 |
| 4 | VIEW | VW_SQ_2 | 7174144 | 193701888 | 278845 | 00:00:11 |
| 5 | HASH GROUP BY | | 7174144 | 294139904 | 278845 | 00:00:11 |
| 6 | TABLE ACCESS STORAGE FULL | TASK | 170994691 | 7010782331 | 65987 | 00:00:03 |
| * 7 | HASH JOIN | | 8549735 | 555732775 | 429294 | 00:00:17 |
| 8 | VIEW | VW_SQ_1 | 7174144 | 172179456 | 278845 | 00:00:11 |
| 9 | HASH GROUP BY | | 7174144 | 294139904 | 278845 | 00:00:11 |
| 10 | TABLE ACCESS STORAGE FULL | TASK | 170994691 | 7010782331 | 65987 | 00:00:03 |
| 11 | TABLE ACCESS STORAGE FULL | TASK | 170994691 | 7010782331 | 65987 | 00:00:03 |
| * 12 | TABLE ACCESS STORAGE FULL | TASK | 1908654 | 110701932 | 2520 | 00:00:01 |
-----------------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
------------------------------------------
* 1 - access("SYS_ID"(+)="TASK"."PARENT")
* 3 - access("ITEM_2"="TASK_LWEEK"."SYS_ID")
* 3 - filter("TASK_LWEEK"."SNAPSHOT_DATE"<"MAX_DATE_LWEEK")
* 7 - access("ITEM_1"="TASK_LWEEK"."SYS_ID")
* 7 - filter("TASK_LWEEK"."SNAPSHOT_DATE">=INTERNAL_FUNCTION("MAX_DATE_LWEEK"))
* 12 - storage("TASK"."CLOSED_AT" IS NULL OR "TASK"."CLOSED_AT">=TRUNC(SYSDATE#!)-15)
* 12 - filter("TASK"."CLOSED_AT" IS NULL OR "TASK"."CLOSED_AT">=TRUNC(SYSDATE#!)-15)
Well, you are not even showing the select. As I can see that the select is done over Exadata ( Table Access Storage Full ) , perhaps you need to ask yourself why do you need to make 4 access to the same table.
You access fourth times ( lines 6, 10, 11, 12 ) to the main table TASK with 170994691 rows ( based on estimation of the CBO ). I don't know whether the statistics are up-to-date or it is optimizing sampling kick in due to lack of good statistics.
A solution could be use WITH for generating intermediate results that you need several times in your outline query
with my_set as
(SELECT MAX(table.date)-7 max_date_lweek ,
max(table.date) as max_date,
id from FROM table )
select
.......................
from ...
left join anothertable lastweek on ( ........ )
left join myset on ( anothertable.id = myset.id )
where
lastweek.date >= myset.max_date_lweek
and
lastweek.date < myset.max_date
Please, take in account that you did not provide the query, so I am guessing a lot of things.
Since complete information is not available I will suggest:
You are using the same query twice then why not use CTE such as
with CTE_example as (SELECT MAX(table.date), max_date_lweek, ID
FROM table table)
Looking at your explain plan, the only table being accessed is TASK. From that, I infer that the tables in your example: ANOTHERTABLE and TABLE are actually the same table and that, therefore, you are trying to get the last week of data that exists in that table for each id value.
If all that is true, it should be much faster to use an analytic function to get the max date value for each id and then limit based on that.
Here is an example of what I mean. Note I use "dte" instead of "date", to remove confusion with the reserved word "date".
LEFT JOIN ( SELECT lastweek.*,
max(dte) OVER ( PARTITION BY id ) max_date
FROM anothertable lastweek ) lastweek
ON 1=1 -- whatever other join conditions you have, seemingly omitted from your post
AND lastweek.dte >= lastweek.max_date - 7;
Again, this only works if I am correct in thinking that table and anothertable are actually the same table.

Sequence of query execution

When there is a correlated query, what is the sequence of execution?
Ex:
select
p.productNo,
(
select count(distinct concat(bom.detailpart,bom.groupname))
from dl_MBOM bom
where bom.DetailPart=p.ProductNo
) cnt1
from dm_product p
The execution plan will vary by database vendors. For Oracle, here is a similar query, and the corresponding execution plan.
select dname,
( select count( distinct job )
from emp e
where e.deptno = d.deptno
) x
from dept d
---------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
---------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | | | 6 (100)| |
| 1 | SORT GROUP BY | | 1 | 11 | | |
|* 2 | TABLE ACCESS FULL| EMP | 5 | 55 | 2 (0)| 00:00:01 |
| 3 | TABLE ACCESS FULL | DEPT | 4 | 52 | 2 (0)| 00:00:01 |
---------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - filter("E"."DEPTNO"=:B1)
While it seems likely that the DBMS reads record for record from dm_product and for each such record looks up the value in dl_MBOM, this doesn't necessarily happen.
With an SQL query you tell the DBMS mainly what to do, not how to do it. If the DBMS thinks it better to build a join instead and work on this, it is free to do so.
Short answer: the sequence of execution is not determined. (You can, however, in many DBMS look at the query's execution plan to see how it is executed.)

Distinct clause on large join

Following this question, suppose now I've set up the indexes, and now I want only to return certain field, without duplicates:
Select distinct A.cod
from A join B
on A.id1=B.id1 and
A.id2=B.id2
where A.year=2016
and B.year=2016
the problem now is I'm getting something like 150k cod, with only 1000 distinct values, so my query is very inefficient.
Question: how can I improve that? i.e, how can I tell the DB, for every row on A, to stop joining that row as soon as a match is found?
Thank you in advance!
I'm basing my answer on your question:
how can I tell the DB, for every row on A, to stop joining that row as soon as a match is found?
with the EXISTS clause, once it sees a match it will stop and check for the next record to be checked.
adding the DISTINCT will filter out any duplicate CODs (in case there is one).
select DISTINCT cod
from A ax
where year = 2016
and exists ( select 1
from B bx
WHERE Ax.ID1 = Bx.ID1
AND Ax.ID2 = Bx.ID2
AND Ax.YEAR = Bx.YEAR);
EDIT: Was curious which solution (IN or EXISTS) will give me a better Explain plan
Create the 1st Table Definition
Create table A
(
ID1 number,
ID2 number,
cod varchar2(100),
year number
);
insert 4000000 sequential numbers
BEGIN
FOR i IN 1..4000000 loop
insert into A (id1, id2, cod, year)
values (i, i , i, i);
end loop;
END;
commit;
Create Table B and insert the same data into to it
Create table B
as
select *
from A;
Reinsert Data from Table A to make duplicates
insert into B
select *
from A
Build the Indexes mentioned in the Previous Post Index on join and where
CREATE INDEX A_IDX ON A(year, id1, id2);
CREATE INDEX B_IDX ON B(year, id1, id2);
Update a bunch of rows to make it fetch multiple rows with the year 2016:
update B
set year = 2016
where rownum < 20000;
update A
set year = 2016
where rownum < 20000;
commit;
Check Explain plan using EXISTS
Plan hash value: 1052726981
----------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
----------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 44 | 7 (15)| 00:00:01 |
| 1 | HASH UNIQUE | | 1 | 44 | 7 (15)| 00:00:01 |
| 2 | NESTED LOOPS SEMI | | 1 | 44 | 6 (0)| 00:00:01 |
| 3 | TABLE ACCESS BY INDEX ROWID| A | 1 | 26 | 4 (0)| 00:00:01 |
|* 4 | INDEX RANGE SCAN | A_IDX | 1 | | 3 (0)| 00:00:01 |
|* 5 | INDEX RANGE SCAN | B_IDX | 2 | 36 | 2 (0)| 00:00:01 |
----------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
4 - access("YEAR"=2016)
5 - access("BX"."YEAR"=2016 AND "AX"."ID1"="BX"."ID1" AND "AX"."ID2"="BX"."ID2")
filter("AX"."YEAR"="BX"."YEAR")
Check Explain plan using IN
Plan hash value: 3002464630
----------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
----------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 44 | 7 (15)| 00:00:01 |
| 1 | HASH UNIQUE | | 1 | 44 | 7 (15)| 00:00:01 |
| 2 | NESTED LOOPS | | 1 | 44 | 6 (0)| 00:00:01 |
| 3 | TABLE ACCESS BY INDEX ROWID| A | 1 | 26 | 4 (0)| 00:00:01 |
|* 4 | INDEX RANGE SCAN | A_IDX | 1 | | 3 (0)| 00:00:01 |
|* 5 | INDEX RANGE SCAN | B_IDX | 1 | 18 | 2 (0)| 00:00:01 |
----------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
4 - access("YEAR"=2016)
5 - access("YEAR"=2016 AND "ID1"="ID1" AND "ID2"="ID2")
Although my test case is limited, i'm guessing that both the IN and EXISTS clause have nearly the same execution.
On the face of it, what you are actually trying to do should be done like this:
select distinct cod
from A
where year = 2016
and (id1, id2) in (select id1, id2 from B where year = 2016)
The subquery in the WHERE condition is a non-correlated query, so it will be evaluated only once. And the IN condition is evaluated using short-circuiting; instead of a complete join, it will search through the results of the subquery only until a match is found.
EDIT: As Migs Isip points out, there may be duplicate codes in the original table, so a "distinct" may still be needed. I edited my code to add it back after Migs posted his answer.
Not sure about your existing indexes but you can improve your query a bit by adding another JOIN condition like
Select distinct A.cod
from A join B
on A.id1=B.id1 and
A.id2=B.id2 and
A.year = B.year // this one
where A.year=2016;

Oracle intermediate join table size

When we join more than 2 tables, oracle or for that matter any database decides to join 2 tables and use the result to join with subsequent tables. Is there a way to identify the intermediate join size. I am particularly interested in oracle. One solution I know is to use Autotrace in sqldeveloper which has the column LAST_OUTPUT_ROWS. But for queries executed by pl/sql and other means does oracle record the intermediate join size in some table?
I am asking this because recently we had a problem as someone dropped the statistics and failed to regenerate it and when traced through we found that oracle formed an intermediate table of 180 million rows before arriving at the final result of 6 rows and the query was quite slow.
Oracle can materialize the intermediate results of a table join in the temporary segment set for your session.
Since it's a one-off table that is deleted after the query is complete, its statistics are not stored.
However, you can estimate its size by building a plan for the query and looking at ROWS parameters of the appropriate operation:
EXPLAIN PLAN FOR
WITH q AS
(
SELECT /*+ MATERIALIZE */
e1.value AS val1, e2.value AS val2
FROM t_even e1, t_even e2
)
SELECT COUNT(*)
FROM q
SELECT *
FROM TABLE(DBMS_XPLAN.display())
Plan hash value: 3705384459
---------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
---------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | 43G (5)|999:59:59 |
| 1 | TEMP TABLE TRANSFORMATION | | | | | |
| 2 | LOAD AS SELECT | | | | | |
| 3 | MERGE JOIN CARTESIAN | | 100T| 909T| 42G (3)|999:59:59 |
| 4 | TABLE ACCESS FULL | T_ODD | 10M| 47M| 4206 (3)| 00:00:51 |
| 5 | BUFFER SORT | | 10M| 47M| 42G (3)|999:59:59 |
| 6 | TABLE ACCESS FULL | T_ODD | 10M| 47M| 4204 (3)| 00:00:51 |
| 7 | SORT AGGREGATE | | 1 | | | |
| 8 | VIEW | | 100T| | 1729M (62)|999:59:59 |
| 9 | TABLE ACCESS FULL | SYS_TEMP_0FD9D6604_2660595 | 100T| 909T| 1729M (62)|999:59:59 |
---------------------------------------------------------------------------------------------------------
Here, the materialized table is called SYS_TEMP_0FD9D6604_2660595 and the estimated record count is 100T (100,000,000,000,000 records)