I have a table of employees. One of the columns is a varray() that contains multiple room #'s for their office. I'm looking for a simple query that will compare each employee to see if they share an office.
SELECT E1.Name, E2.Name
FROM Employee E1
JOIN Employee E2
ON E1.Room = E2.Room;
Something like this doesn't work because the Room column is a varray. I just need one value in the first varray to match with another in the second. Is there an easy way of doing this?
Assuming you refer to Oracle, the query of your choice could be either
select
E1.name as employee_1, E2.name as employee_2,
R1.column_value as the_matching_room
from employee E1
cross join table(E1.rooms) R1
join employee E2
on E2.emp_id > E1.emp_id
join table(E2.rooms) R2
on R2.column_value = R1.column_value
;
or (somewhat more effective)
with rooms_unnested$ as (
select E.emp_id, E.name, R.column_value as room
from employee E
cross join table(E.rooms) R
)
select
E1.name as employee_1, E2.name as employee_2,
E1.room as the_matching_room
from rooms_unnested$ E1
join rooms_unnested$ E2
on E2.emp_id > E1.emp_id
and E2.room = E1.room
;
This one has the potential problem of doing the cartesian between employee tables first, unnesting the collections later:
-----------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost | Time |
-----------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1334324 | 5142484696 | 447202 | 00:00:18 |
| 1 | NESTED LOOPS | | 1334324 | 5142484696 | 447202 | 00:00:18 |
| 2 | NESTED LOOPS | | 16336 | 62926272 | 63 | 00:00:01 |
| 3 | NESTED LOOPS | | 2 | 7700 | 7 | 00:00:01 |
| 4 | TABLE ACCESS FULL | EMPLOYEE | 2 | 3850 | 3 | 00:00:01 |
| * 5 | TABLE ACCESS FULL | EMPLOYEE | 1 | 1925 | 2 | 00:00:01 |
| 6 | COLLECTION ITERATOR PICKLER FETCH | | 8168 | 16336 | 28 | 00:00:01 |
| * 7 | COLLECTION ITERATOR PICKLER FETCH | | 82 | 164 | 27 | 00:00:01 |
-----------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
------------------------------------------
* 5 - filter("E2"."EMP_ID">"E1"."EMP_ID")
* 7 - filter(VALUE(KOKBF$)=VALUE(KOKBF$))
With the assumption that your "rooms" varrays may contain duplicates, there's one more tweak to do - making each employee's rooms distinct, which leads us to the (hopefully) final query...
with rooms_unnested$ as (
select distinct
E.emp_id, E.name, R.column_value as room
from employee E
cross join table(E.rooms) R
)
select
E1.name as employee_1, E2.name as employee_2,
E1.room as the_matching_room
from rooms_unnested$ E1
join rooms_unnested$ E2
on E2.emp_id > E1.emp_id
and E2.room = E1.room
;
... which also happens to resolve the "issue" with cartesians by unnesting the "rooms" varray first (and only once!) and equi-hash-joining afterwards:
---------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost | Time |
---------------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 120 | 65 | 00:00:01 |
| 1 | TEMP TABLE TRANSFORMATION | | | | | |
| 2 | LOAD AS SELECT (CURSOR DURATION MEMORY) | SYS_TEMP_0FD9D6699_11FF28DD | | | | |
| 3 | HASH UNIQUE | | 3 | 36 | 61 | 00:00:01 |
| 4 | NESTED LOOPS | | 16336 | 196032 | 59 | 00:00:01 |
| 5 | TABLE ACCESS FULL | EMPLOYEE | 2 | 20 | 3 | 00:00:01 |
| 6 | COLLECTION ITERATOR PICKLER FETCH | | 8168 | 16336 | 28 | 00:00:01 |
| * 7 | HASH JOIN | | 1 | 120 | 4 | 00:00:01 |
| 8 | VIEW | | 3 | 180 | 2 | 00:00:01 |
| 9 | TABLE ACCESS FULL | SYS_TEMP_0FD9D6699_11FF28DD | 3 | 36 | 2 | 00:00:01 |
| 10 | VIEW | | 3 | 180 | 2 | 00:00:01 |
| 11 | TABLE ACCESS FULL | SYS_TEMP_0FD9D6699_11FF28DD | 3 | 36 | 2 | 00:00:01 |
---------------------------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
------------------------------------------
* 7 - access("E2"."ROOM"="E1"."ROOM")
* 7 - filter("E2"."EMP_ID">"E1"."EMP_ID")
Related
I have a very slow query due to scanning through millions of records. The query searches for how many numbers are in a specific range.
I have 2 tables: numbers_in_ranges and person table
Create table numbers_in_ranges
( range_id number(9,0) ,
begin_range number(9,0),
end_range number(9,0)
) ;
Create table person
(
id integer,
a_number varchar(9),
first_name varchar(25),
last_name varchar(25)
);
Data for numbers_in_ranges
range_id| begin_range | end_range
--------|------------------------
101 | 100000000 | 200000000
102 | 210000000 | 290000000
103 | 350000000 | 459999999
104 | 461000000 | 569999999
106 | 241000000 | 241999999
e.t.c.
Data for person
id | a_number | first_name | last_name
---|------------|------------|-----------
1 | 100000001 | Maria | Doe
2 | 100000999 | Emily | Davis
3 | 150000000 | Dave | Smith
4 | 461000000 | Jane | Jones
6 | 241000001 | John | Doe
7 | 100000002 | Maria | Doe
8 | 100009999 | Emily | Davis
9 | 150000010 | Dave | Smith
10 | 210000001 | Jane | Jones
11 | 210000010 | John | Doe
12 | 281000000 | Jane | Jones
13 | 241000000 | John | Doe
14 | 460000001 | Maria | Doe
15 | 500000999 | Emily | Davis
16 | 550000010 | Dave | Smith
17 | 461000010 | Jane | Jones
18 | 241000020 | John | Doe
e.t.c.
We are getting the range data from a remote database via a database link and storing it in a materialized view.
The query
select nums.range_id, count(p. a_number) as a_count
from number_in_ranges nums
left join person p on to_number(p. a_number)
between nums.begin_range and nums.end_range
group by nums.range_id;
The result looks like
range_id| a_count
--------|------------------------
101 | 6
102 | 5
103 | 2
104 | 3
e.t.c
As I said, this query is very slow.
Here is the explain plan
Plan hash value: 3785994407
---------------------------------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes |TempSpc| Cost (%CPU)| Time | TQ |IN-OUT| PQ Distrib |
---------------------------------------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 9352 | 264K| | 42601 (31)| 00:00:02 | | | |
| 1 | PX COORDINATOR | | | | | | | | | |
| 2 | PX SEND QC (RANDOM) | :TQ10002 | 9352 | 264K| | 42601 (31)| 00:00:02 | Q1,02 | P->S | QC (RAND) |
| 3 | HASH GROUP BY | | 9352 | 264K| | 42601 (31)| 00:00:02 | Q1,02 | PCWP | |
| 4 | PX RECEIVE | | 9352 | 264K| | 42601 (31)| 00:00:02 | Q1,02 | PCWP | |
| 5 | PX SEND HASH | :TQ10001 | 9352 | 264K| | 42601 (31)| 00:00:02 | Q1,01 | P->P | HASH |
| 6 | HASH GROUP BY | | 9352 | 264K| | 42601 (31)| 00:00:02 | Q1,01 | PCWP | |
| 7 | MERGE JOIN OUTER | | 2084M| 56G| | 37793 (23)| 00:00:02 | Q1,01 | PCWP | |
| 8 | SORT JOIN | | 9352 | 173K| | 3 (34)| 00:00:01 | Q1,01 | PCWP | |
| 9 | PX BLOCK ITERATOR | | 9352 | 173K| | 2 (0)| 00:00:01 | Q1,01 | PCWC | |
| 10 | MAT_VIEW ACCESS FULL | NUMBERS_IN_RANGES | 9352 | 173K| | 2 (0)| 00:00:01 | Q1,01 | PCWP | |
|* 11 | FILTER | | | | | | | Q1,01 | PCWP | |
|* 12 | SORT JOIN | | 89M| 850M| 2732M| 29681 (1)| 00:00:02 | Q1,01 | PCWP | |
| 13 | BUFFER SORT | | | | | | | Q1,01 | PCWC | |
| 14 | PX RECEIVE | | 89M| 850M| | 4944 (1)| 00:00:01 | Q1,01 | PCWP | |
| 15 | PX SEND BROADCAST | :TQ10000 | 89M| 850M| | 4944 (1)| 00:00:01 | Q1,00 | P->P | BROADCAST |
| 16 | PX BLOCK ITERATOR | | 89M| 850M| | 4944 (1)| 00:00:01 | Q1,00 | PCWC | |
| 17 | INDEX FAST FULL SCAN| PERSON_AN_IDX | 89M| 850M| | 4944 (1)| 00:00:01 | Q1,00 | PCWP | |
---------------------------------------------------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
11 - filter("NUMS"."END_RANGE">=TO_NUMBER("P"."A_NUMBER"(+)))
12 - access("NUMS"."BEGIN_RANGE"<=TO_NUMBER("P"."A_NUMBER"(+)))
filter("NUMS"."BEGIN_RANGE"<=TO_NUMBER("P"."A_NUMBER"(+)))
Note
-----
- automatic DOP: Computed Degree of Parallelism is 16 because of degree limit
I tried to run the deltas for the month and then append them to the table, like:
if new range_id is found then insert
if range_id is found then update
So we don't have to scan the whole table.
But this solution didn't work because some ranges are updated, and splicing happens, for example:
We create a new range_id = 110 with ranges between 100110000 and 210000001
then range_id = 101 is spliced to 100000000 and 100110000
and range_id = 102 is spliced to 100110001 and 210000000 ;
Now I thought of creating a trigger for when a new range is created or updated to update that table; however, that is impossible since we are getting this data from a remote database that stores the data into a Materialized View, and we cannot put a trigger on a read-only materialized view.
My question is there any other way that I can do this or optimize this query?
Thank you!
The issue is that Oracle tries to broadcast the table with all ID's that looks quite strange for this case.
However, since you need only to count rows and (it looks like) the intervals do not overlap, you may improve the performance and avoid join of two datasets using a trick: transform the data to event stream where each start and end value identifies the beginning and end of series and then count the number of events in this series. This way you may use match_recognize which is dramatically faster than join.
The query will be:
with ranges_unpivot as (
/*Transform from_ ... to_... to the event-like structure*/
select
id
, val
, val_type
from ranges_table
unpivot(
val for val_type in (from_num as '01_START', to_num as '03_END')
)
union all
/*Append the rest of the data to the event stream*/
select
null,
id,
/*
This should be ordered between START mark and END mark
to process edge cases correctly
*/
'02_val'
from other_table
where id <= (select max(to_num) from ranges_table)
)
select /*+parallel(4) gather_plan_statistics*/ *
from ranges_unpivot
match_recognize (
order by val asc, val_type asc
measures
start_.id as range_id,
count(values_.val) as count_
pattern (start_ values_* end_)
define
start_ as val_type = '01_START',
values_ as val_type = '02_val',
end_ as val_type = '03_END'
)
which shows this time in the query plan:
| 0 | SELECT STATEMENT | | 1 | | 1 |00:00:00.33
Compared to join query:
select /*+gather_plan_statistics*/
rt.id as range_id,
count(ot.id) as count_
from ranges_table rt
left join other_table ot
on rt.from_num <= ot.id
and rt.to_num >= ot.id
group by rt.id
which shows:
| 0 | SELECT STATEMENT | | 1 | | 1 |00:00:13.84 |
See db<>fiddle.
I have a table having data as shown below,
+-------+----------------+----------------+
| Id | HierUnitId | ObjectNumber |
+-------+----------------+----------------+
| 10 | 3599 | 1 |
| 10 | 3599 | 2 |
| 20 | 3599 | 3 |
| 20 | 3599 | 4 |
| 20 | 3599 | 1 |
| 30 | 3599 | 2 |
| 30 | 3599 | 3 |
+-------+----------------+----------------+
I have a select query
SELECT ID FROM TEST
FETCH NEXT :LIMIT ROWS ONLY
Now I want to limit the number of rows using the value of limit.
When the value of Limit is 2 I want two distinct id's i.e up to 5 rows. However, from query I will get only two rows having 10 as the id. Can someone help me in limiting the rows using distinct id?
What i want is total number of distinct id in the output is limit.
Use the DENSE_RANK analytic function to number the rows based on the unique/distinct ID values and then filter on that:
SELECT id
FROM (
SELECT ID,
DENSE_RANK() OVER (ORDER BY id) AS rnk
FROM test
)
WHERE rnk <= 2;
Which, for the sample data:
CREATE TABLE test (Id, HierUnitId, ObjectNumber ) AS
SELECT 10, 3599, 1 FROM DUAL UNION ALL
SELECT 10, 3599, 2 FROM DUAL UNION ALL
SELECT 20, 3599, 3 FROM DUAL UNION ALL
SELECT 20, 3599, 4 FROM DUAL UNION ALL
SELECT 20, 3599, 1 FROM DUAL UNION ALL
SELECT 30, 3599, 2 FROM DUAL UNION ALL
SELECT 30, 3599, 3 FROM DUAL;
Outputs:
ID
10
10
20
20
20
db<>fiddle here
As you said in the comment, you need to be able to define how many distinct ids should be shown. For that case i'd recommend you to find those ids first (see the distinct_ids part) and fetch all the lines you needed afterwards
with distinct_ids as (
select distinct id
from test_data
order by id
fetch first :limit rows only)
select id
from test_data td
join distinct_ids di
on td.id = di.id
If you need some distinct IDs without any particular order, then you may put fetch next ... into the subquery with distinct keyword. Index on ID column will be suitable to avoid two full table scans (I assume that ID cannot be null)
select /*+gather_plan_statistics*/
*
from t
where id in (
select distinct id
from t
where id is not null
fetch next 2 rows only
)
ID | HIERUNITID | OBJECTNUMBER
-: | ---------: | -----------:
1 | 3599 | 1
1 | 3599 | 2
2 | 3599 | 3
2 | 3599 | 4
2 | 3599 | 1
select *
from table(dbms_xplan.display_cursor(
format => 'ALL -PROJECTION -ALIAS ALLSTATS LAST'
))
| PLAN_TABLE_OUTPUT |
| :------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| SQL_ID 2sqqq53kpy5rj, child number 0 |
| ------------------------------------- |
| select /*+gather_plan_statistics*/ * from t where id in ( select |
| distinct id from t where id is not null fetch next 2 rows only ) |
| |
| Plan hash value: 534568331 |
| |
| -------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| | Id | Operation | Name | Starts | E-Rows |E-Bytes| Cost (%CPU)| E-Time | A-Rows | A-Time | Buffers | OMem | 1Mem | Used-Mem | |
| -------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| | 0 | SELECT STATEMENT | | 1 | | | 5 (100)| | 5 |00:00:00.01 | 3 | | | | |
| | 1 | MERGE JOIN SEMI | | 1 | 5 | 115 | 5 (40)| 00:00:01 | 5 |00:00:00.01 | 3 | | | | |
| | 2 | TABLE ACCESS BY INDEX ROWID| T | 1 | 7 | 70 | 2 (0)| 00:00:01 | 6 |00:00:00.01 | 2 | | | | |
| | 3 | INDEX FULL SCAN | T_IX | 1 | 7 | | 1 (0)| 00:00:01 | 6 |00:00:00.01 | 1 | | | | |
| |* 4 | SORT UNIQUE | | 6 | 2 | 26 | 3 (67)| 00:00:01 | 5 |00:00:00.01 | 1 | 2048 | 2048 | 2048 (0)| |
| | 5 | VIEW | VW_NSO_1 | 1 | 2 | 26 | 2 (50)| 00:00:01 | 2 |00:00:00.01 | 1 | | | | |
| |* 6 | VIEW | | 1 | 2 | 20 | 2 (50)| 00:00:01 | 2 |00:00:00.01 | 1 | | | | |
| |* 7 | WINDOW NOSORT STOPKEY | | 1 | 3 | 39 | 2 (50)| 00:00:01 | 2 |00:00:00.01 | 1 | 73728 | 73728 | | |
| | 8 | VIEW | | 1 | 3 | 39 | 2 (50)| 00:00:01 | 2 |00:00:00.01 | 1 | | | | |
| | 9 | SORT UNIQUE NOSORT | | 1 | 3 | 9 | 2 (50)| 00:00:01 | 2 |00:00:00.01 | 1 | | | | |
| |* 10 | INDEX FULL SCAN | T_IX | 1 | 7 | 21 | 1 (0)| 00:00:01 | 6 |00:00:00.01 | 1 | | | | |
| -------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| |
| Predicate Information (identified by operation id): |
| --------------------------------------------------- |
| |
| 4 - access("ID"="ID") |
| filter("ID"="ID") |
| 6 - filter("from$_subquery$_004"."rowlimit_$$_rownumber"<=2) |
| 7 - filter(ROW_NUMBER() OVER ( ORDER BY NULL)<=2) |
| 10 - filter("ID" IS NOT NULL) |
| |
db<>fiddle here
I have the below query, but when I execute it runs forever.
WITH aux AS (
SELECT
contract,
contract_account,
business_partner,
payment_plan,
installation,
contract_status
FROM
reta.mv_integrated_md a
WHERE
contract_status IN (
'LIVE',
'FINAL'
)
), aux1 AS (
SELECT
a.*,
CASE
WHEN EXISTS (
SELECT
NULL
FROM
aux b
WHERE
b.business_partner = a.business_partner
AND b.installation = a.installation
AND b.payment_plan = 'BMW'
) THEN
'X'
END h
FROM
aux a
)
SELECT
*
FROM
aux1;
My execution plan shows a huge cost which I cannot locate. How could I optimize this query? I have tried some hints but none of them have worked :(
Plan hash value: 1662974027
----------------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | Pstart| Pstop |
----------------------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 19M| 2000M| 825G (1)|999:59:59 | | |
|* 1 | VIEW | | 19M| 990M| 41331 (1)| 00:00:02 | | |
| 2 | TABLE ACCESS STORAGE FULL | SYS_TEMP_0FDA49C92_9A7BE8DE | 19M| 1066M| 41331 (1)| 00:00:02 | | |
| 3 | TEMP TABLE TRANSFORMATION | | | | | | | |
| 4 | LOAD AS SELECT | SYS_TEMP_0FDA49C92_9A7BE8DE | | | | | | |
| 5 | PARTITION RANGE SINGLE | | 18M| 974M| 759K (1)| 00:00:30 | 1 | 1 |
|* 6 | TABLE ACCESS STORAGE FULL| MV_INTEGRATED_MD | 18M| 974M| 759K (1)| 00:00:30 | 1 | 1 |
| 7 | VIEW | | 19M| 2000M| 41331 (1)| 00:00:02 | | |
| 8 | TABLE ACCESS STORAGE FULL | SYS_TEMP_0FDA49C92_9A7BE8DE | 19M| 1066M| 41331 (1)| 00:00:02 | | |
----------------------------------------------------------------------------------------------------------------------------
Kindly let me know if any additional information needed.
Use window functions:
SELECT r.contract, r.contract_account, r.business_partner,
r.payment_plan, r.installation, r.contract_status,
MAX(CASE WHEN r.payment_plan = 'BMW' THEN 'X' END) OVER (PARTITION BY business_partner, installation) as h
FROM reta.mv_integrated_md#rbip r
WHERE r.contract_status IN ('LIVE', 'FINAL');
Not only is the query much simpler to write and read, but it should perform much better too.
Highest cost is due to FTS(Full table scan) on table/MV MV_INTEGRATED_MD.
Try to create index on contract_status and check if it reduces the cost and also, what is size of this mv/table in terms of block and it is 10 percent or more than total buffer cache size ?
TABLE ACCESS STORAGE FULL| MV_INTEGRATED_MD | 18M| 974M| 759K (1)| 00:00:30 | 1 | 1
If you run your query with the /*+ gather_plan_statistics */ hint (I'm simulating it with a 1000 row table) you imediately see the problem :
select * from table(dbms_xplan.display_cursor(null,null,'ALLSTATS LAST'));
-------------------------------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | Reads |
-------------------------------------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | 1000 |00:00:00.01 | 9 | 5 |
|* 1 | VIEW | | 1000 | 1000 | 1000 |00:00:00.09 | 0 | 0 |
| 2 | TABLE ACCESS FULL | SYS_TEMP_0FD9D6737_1A17DE13 | 1000 | 1000 | 500K|00:00:00.08 | 0 | 0 |
| 3 | TEMP TABLE TRANSFORMATION | | 1 | | 1000 |00:00:00.01 | 9 | 5 |
| 4 | LOAD AS SELECT (CURSOR DURATION MEMORY)| SYS_TEMP_0FD9D6737_1A17DE13 | 1 | | 0 |00:00:00.01 | 8 | 5 |
|* 5 | TABLE ACCESS FULL | MV_INTEGRATED_MD | 1 | 1000 | 1000 |00:00:00.01 | 7 | 5 |
| 6 | VIEW | | 1 | 1000 | 1000 |00:00:00.01 | 0 | 0 |
| 7 | TABLE ACCESS FULL | SYS_TEMP_0FD9D6737_1A17DE13 | 1 | 1000 | 1000 |00:00:00.01 | 0 | 0 |
-------------------------------------------------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter(("B"."BUSINESS_PARTNER"=:B1 AND "B"."INSTALLATION"=:B2 AND "B"."PAYMENT_PLAN"='BMW'))
5 - filter("CONTRACT_STATUS"='LIVE')
It is in the line 2 where a full scan is activated in a loop for each line of the main table (see starts = 1000)
Typically you want to resolve the EXISTS with a semi join to preserve good performance, but here it seems that Oracle can not rewrite it.
So you'll need to rewrite the query yourself.
Despite the excelent proposal of #GordonLinoff (that I'll start with) you may try to use an outer join as follows
with bmw as (
select distinct business_partner, installation
from mv_integrated_md
where payment_plan = 'BMW')
SELECT
a.contract,
a.contract_account,
a.business_partner,
a.payment_plan,
a.installation,
a.contract_status,
case when b.business_partner is not null then 'X' end as h
FROM mv_integrated_md a
left outer join bmw b
on b.business_partner = a.business_partner and
b.installation = a.installation
WHERE a.contract_status IN ( 'LIVE', 'FINAL')
This will lead to two fulls scans, one deduplication and outer join.
I'm using Oracle11g and i would compare two tables finding records that match between them.
Example:
Table 1 Table 2
George Micheal
Michael Paul
The record "Micheal" and "Michael" match between them, so they are good record.
To see if two records match, i use the Oracle function utl_match.edit_distance_similarity.
I tried with the code below, but i have a performance problem (it is too slow):
SELECT *
FROM table1
JOIN table2
ON utl_match.edit_distance_similarity(table1.name, table2.name) > 75;
Is there a better solution?
Thank you
This is a hard problem. In general, it is going to result in nested loop joins and slowness. It might be possible to use SOUNDEX() to get "closish" matches and then the character distance function for final filtering. This may not work for your problem, but it might.
Although I am not a big fan of the function, you might find that soundex() works for your purposes (see here).
The idea would be to add an index on this value:
create index idx_table1_soundexname on table1(soundex(name));
create index idx_table2_soundexname on table2(soundex(name));
Then you would query this as:
SELECT *
FROM table1 t1 JOIN
table2 t2
ON soundex(t1.name) = soundex(t2.name)
WHERE utl_match.edit_distance_similarity(t1.name, t2.name) > 75;
The idea is that Oracle will use the indexes to get names that are "close" and then the edit distance to get the better matches. This may not work for your problem. It is just an idea that might work.
In case you have a lot of redundancy with respect to name values in your tables table1 and table2, this could be a solution
-- Test data set
select count(*) from table1;
--> 10.000
select count(*) from table2;
--> 10.000
select count(distinct(name)) from table1;
--> ~ 2500
select count(distinct(name)) from table2;
--> ~ 2500
/* a) Join with function compare */
select table1.name, table2.name
from table1, table2
where utl_match.edit_distance_similarity(table1.name, table2.name) > 35
/*
--------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost | Time |
--------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 5000000 | 270000000 | 37364 | 00:09:21 |
| 1 | NESTED LOOPS | | 5000000 | 270000000 | 37364 | 00:09:21 |
| 2 | TABLE ACCESS FULL | TABLE1 | 10000 | 270000 | 5 | 00:00:01 |
| * 3 | TABLE ACCESS FULL | TABLE2 | 500 | 13500 | 4 | 00:00:01 |
--------------------------------------------------------------------------------
Predicate Information (identified by operation id):
------------------------------------------
* 3 - filter("UTL_MATCH"."EDIT_DISTANCE_SIMILARITY"("TABLE1"."NAME","TABLE2"."NAME")>35)
Note
-----
- dynamic sampling used for this statement
*/
/* b) Join with function, only distinct values */
-- A Set of all existing names (in table1 and table2)
with names as
(select name from table1 union select name from table2),
-- Compare only once because utl_match.edit_distance_similarity(name1, name2) = utl_match.edit_distance_similarity(name2, name1)
table_cmp(name1, name2) as
(select n1.name, n2.name
from names n1
join names n2
on n1.name <= n2.name
and utl_match.edit_distance_similarity(n1.name, n2.name) > 35)
select t1.*, t2.*
from table_cmp c
join table1 t1
on t1.name = c.name1
join table2 t2
on t2.name = c.name2
union all
select t1.*, t2.*
from table_cmp c
join table1 t1
on t1.name = c.name2
join table2 t2
on t2.name = c.name1;
/*
--------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost | Time |
--------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 30469950 | 3290754600 | 2495 | 00:00:38 |
| 1 | TEMP TABLE TRANSFORMATION | | | | | |
| 2 | LOAD AS SELECT | SYS_TEMP_0FD9D663E_B39FC2B6 | | | | |
| 3 | SORT UNIQUE | | 20000 | 540000 | 12 | 00:00:01 |
| 4 | UNION-ALL | | | | | |
| 5 | TABLE ACCESS FULL | TABLE1 | 10000 | 270000 | 5 | 00:00:01 |
| 6 | TABLE ACCESS FULL | TABLE2 | 10000 | 270000 | 5 | 00:00:01 |
| 7 | LOAD AS SELECT | SYS_TEMP_0FD9D663F_B39FC2B6 | | | | |
| 8 | MERGE JOIN | | 1000000 | 54000000 | 62 | 00:00:01 |
| 9 | SORT JOIN | | 20000 | 540000 | 3 | 00:00:01 |
| 10 | VIEW | | 20000 | 540000 | 2 | 00:00:01 |
| 11 | TABLE ACCESS FULL | SYS_TEMP_0FD9D663E_B39FC2B6 | 20000 | 540000 | 2 | 00:00:01 |
| * 12 | FILTER | | | | | |
| * 13 | SORT JOIN | | 20000 | 540000 | 3 | 00:00:01 |
| 14 | VIEW | | 20000 | 540000 | 2 | 00:00:01 |
| 15 | TABLE ACCESS FULL | SYS_TEMP_0FD9D663E_B39FC2B6 | 20000 | 540000 | 2 | 00:00:01 |
| 16 | UNION-ALL | | | | | |
| * 17 | HASH JOIN | | 15234975 | 1645377300 | 1248 | 00:00:19 |
| 18 | TABLE ACCESS FULL | TABLE2 | 10000 | 270000 | 5 | 00:00:01 |
| * 19 | HASH JOIN | | 3903201 | 316159281 | 1200 | 00:00:18 |
| 20 | TABLE ACCESS FULL | TABLE1 | 10000 | 270000 | 5 | 00:00:01 |
| 21 | VIEW | | 1000000 | 54000000 | 1183 | 00:00:18 |
| 22 | TABLE ACCESS FULL | SYS_TEMP_0FD9D663F_B39FC2B6 | 1000000 | 54000000 | 1183 | 00:00:18 |
| * 23 | HASH JOIN | | 15234975 | 1645377300 | 1248 | 00:00:19 |
| 24 | TABLE ACCESS FULL | TABLE2 | 10000 | 270000 | 5 | 00:00:01 |
| * 25 | HASH JOIN | | 3903201 | 316159281 | 1200 | 00:00:18 |
| 26 | TABLE ACCESS FULL | TABLE1 | 10000 | 270000 | 5 | 00:00:01 |
| 27 | VIEW | | 1000000 | 54000000 | 1183 | 00:00:18 |
| 28 | TABLE ACCESS FULL | SYS_TEMP_0FD9D663F_B39FC2B6 | 1000000 | 54000000 | 1183 | 00:00:18 |
--------------------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
------------------------------------------
* 12 - filter("UTL_MATCH"."EDIT_DISTANCE_SIMILARITY"("N1"."NAME","N2"."NAME")>35)
* 13 - access("N1"."NAME"<="N2"."NAME")
* 13 - filter("N1"."NAME"<="N2"."NAME")
* 17 - access("T2"."NAME"="C"."NAME2")
* 19 - access("T1"."NAME"="C"."NAME1")
* 23 - access("T2"."NAME"="C"."NAME1")
* 25 - access("T1"."NAME"="C"."NAME2")
Note
-----
- dynamic sampling used for this statement
*/
In a simple join, I would like to limit the results of the first table. So I thought about doing this :
WITH events AS (SELECT event FROM risk_event WHERE status = 'ABC' AND rownum <= 20)
SELECT event_id
FROM events ev, attributes att
WHERE ev.event_id = att.risk_event_id
FOR UPDATE NOWAIT
The problem is that I get an ORA-02014: cannot select FOR UPDATE from view exception because of the rownum<=20 and the FOR UPDATE NOWAIT'.
I know that I can do it with a inner in clause as well, but I'm wondering if there is a better way?
Try first select rowid and then query with table from which you select this rowid
DDL:
create table risk_event as select level as event, mod(level,20) as status from dual connect by level <=10000;
begin
dbms_stats.gather_table_stats(user,
'risk_event',
cascade => true,
estimate_percent => null,
method_opt => 'for all columns size 1');
end;
/
create table attributes as select * from risk_event;
begin
dbms_stats.gather_table_stats(user,
'attributes',
cascade => true,
estimate_percent => null,
method_opt => 'for all columns size 1');
end;
/
Code
WITH events AS (SELECT rowid as rd from risk_event WHERE status = 19 AND rownum <= 20)
SELECT ev.*
FROM risk_event ev, attributes att
WHERE ev.event = att.event and ev.rowid in(select rd from events)
FOR UPDATE NOWAIT
Plan
-----------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost | Time |
-----------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 23 | 11 | 00:00:01 |
| 1 | FOR UPDATE | | | | | |
| 2 | BUFFER SORT | | | | | |
| * 3 | HASH JOIN | | 1 | 23 | 11 | 00:00:01 |
| 4 | NESTED LOOPS | | 1 | 19 | 4 | 00:00:01 |
| 5 | VIEW | VW_NSO_1 | 20 | 240 | 2 | 00:00:01 |
| 6 | SORT UNIQUE | | 1 | 240 | | |
| 7 | VIEW | | 20 | 240 | 2 | 00:00:01 |
| * 8 | COUNT STOPKEY | | | | | |
| * 9 | TABLE ACCESS FULL | RISK_EVENT | 20 | 140 | 2 | 00:00:01 |
| 10 | TABLE ACCESS BY USER ROWID | RISK_EVENT | 1 | 7 | 1 | 00:00:01 |
| 11 | TABLE ACCESS FULL | ATTRIBUTES | 10000 | 40000 | 7 | 00:00:01 |
-----------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
------------------------------------------
* 3 - access("EV"."EVENT"="ATT"."EVENT")
* 8 - filter(ROWNUM<=20)
* 9 - filter("STATUS"=19)