Oracle TOP optimizing - sql

I'm having trouble optimizing a statement. The corrisponding table (INTERVAL_TBL) contains about 11.000.000 rows which causes that statement to take round about 8 seconds on my test system. Even on a dedicated Oracle Server (24gb RAM, 17gb DB size) it takes about 4 - 5 seconds.
SELECT
ID, STATUS_ID, INTERVAL_ID, BEGIN_TS, END_TS, PT, DISPLAYTEXT, RSC
FROM
(
SELECT
INTERVAL_TBL.ID, INTERVAL_TBL.INTERVAL_ID, INTERVAL_TBL.STATUS_ID, INTERVAL_TBL.BEGIN_TS, INTERVAL_TBL.END_TS, INTERVAL_TBL.PT, ST_TBL.DISPLAYTEXT, ST_TBL.RSC,
RANK() OVER (ORDER BY BEGIN_TS DESC) MY_RANK
FROM INTERVAL_TBL
INNER JOIN ST_TBL ON ST_TBL.STATUS_ID = INTERVAL_TBL.STATUS_ID
WHERE ID = '<id>'
)
WHERE MY_RANK <= 10
First of all I'd like to know if there is a way to optimize the Statement (Select most recent rows ordered by BEGIN).
Second I'd like to know if someone can make suggestions for an Index based on the statement.
EDIT:
Explain Plan:
PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
| 0 | SELECT STATEMENT | | 525K| 79M| 58469 (1)| 00:00:03 |
|* 1 | HASH JOIN | | 525K| 79M| 58469 (1)| 00:00:03 |
| 2 | TABLE ACCESS FULL| ST_TBL | 46 | 2438 | 3 (0)| 00:00:01 |
|* 3 | TABLE ACCESS FULL| INTERVAL_TBL | 525K| 52M| 58464 (1)| 00:00:03 |
Predicate Information (identified by operation id):
---------------------------------------------------
1 - access("ST_TBL"."STATUS_ID"="INTERVAL_TBL"."STATUS_ID")
3 - filter("INTERVAL_TBL"."ID"='aef830a6-275b-4713-90da-9135f3f91a32'")
Rows in INTERVAL_TBL: 10.673.122
Rows in ST_TBL: 46
Rows in joined subset: 10.673.122
Rows in joined subset with filter on ID: 530.073
Ideally it would get down to about some milliseconds. Thats what that statement takes in MS SQL Server with 10.000.000 rows.

First of all I would perform the inner join to ST_TBL in the outer scope, I got an speed up with this. It reduces the Cost, IO Cost an Bytes in my Execution Plan comparing both variants.
I assume that columns PT, DISPLAYTEXT and RSC are part of the table ST_TBL.
SELECT
ID, STATUS_ID, INTERVAL_ID, BEGIN_TS, END_TS, SRESULT.PT, SRESULT.DISPLAYTEXT, SRESULT.RSC
FROM
(
SELECT
ID, INTERVAL_ID, INTERVAL_TBL.STATUS_ID, BEGIN_TS, END_TS,
RANK() OVER (ORDER BY BEGIN_TS DESC) MY_RANK
FROM INTERVAL_TBL
WHERE ID = '<id>'
) SRESULT
INNER JOIN ST_TBL ON ST_TBL.STATUS_ID = SRESULT.STATUS_ID
WHERE MY_RANK <= 10

Related

nested query with two tables take more than 20 minutes

Here is our query which took 1229.206 seconds to execute that SQL (returning 8310286 rows):
SELECT t_01.object_uid
FROM HashTable t_01
WHERE t_01.object_uid IN (SELECT t_02.puid
FROM ObjectTable t_02
WHERE (t_02.arev_category IN (48, 40)))
Plan hash value: 1560846306
------------------------------------------------------------------------------------------------
| Id | Operation | Name | E-Rows |E-Bytes| Cost (%CPU)|
------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | | | 780K(100)|
| 1 | NESTED LOOPS SEMI | | 7764K| 244M| 780K (1)|
| 2 | INDEX FULL SCAN | PIHashTable | 7764K| 111M| 4073 (1)|
|* 3 | TABLE ACCESS BY INDEX ROWID BATCHED| ObjectTable | 290M| 4986M| 1 (0)|
|* 4 | INDEX RANGE SCAN | PIOBJECTTABLE | 1 | | 1 (0)|
------------------------------------------------------------------------------------------------
table HashTable has 51154 blocks, last analyzed 2022/04/19 with 7764715 rows
index PIHashTable on HashTable (OBJECT_UID) last analyzed 2022/04/19 over 7764715 rows
table ObjectTable has 3327616 blocks, last analyzed 2022/05/02 with 290473386 rows
index PIPPOM_OBJECT on ObjectTable (PUID) last analyzed 2022/05/02 over 290473386 rows
Table ObjectTable has 290 million rows and Hashtable has 7 million rows.
Any way to optimize this?
It's most likely going to take a while to return 8M+ rows. You should implement paging instead of returning all of the rows at once. It looks like you're using Oracle so try this:
select *
from (
SELECT t_01.object_uid, row_number() over(order by t_01.object_uid) r
FROM HashTable t_01
WHERE t_01.object_uid IN (
SELECT t_02.puid FROM ObjectTable t_02 WHERE (t_02.arev_category IN (48,40))
)
)
where r between 1 and 1000;
Then you can get rows 1 - 1000 then 1001 - 2000 and so on. This has the added benefit of using less memory and CPU on the server per query but also returning data sooner that you (or the user) can act on while additional data is loaded in the background
If you implement this as a join and put an index on t_02.arev_category it will be very quick.
SELECT t_01.object_uid
FROM HashTable t_01
JOIN ObjectTable t_02 ON t_02.arev_category IN (48, 40) and t_02.puid = t_01.object_uid
If the object table can contain the same puid for both categories 40 and 48 then do it like this:
SELECT t_01.object_uid
FROM HashTable t_01
JOIN (SELECT DISTINCT puid
FROM ObjectTable
WHERE arev_category IN (48, 40)
) t_02 ON t_02.puid = t_01.object_uid

Improve performance of NOT EXISTS in case of large tables

What I am trying to accomplish is getting rows from one table that do not match another table based on specific filters. The two tables are relatively huge so I am trying to filter them based on a certain time range.
The steps I went through so far.
Get the IDs from "T1" for the last 3 days
SELECT
id
FROM T1
WHERE STARTTIME BETWEEN '3 days ago' AND 'now';
Execution time is 4.5s.
Get the IDs from "T2" for the last 3 days
SELECT
id
FROM T2
WHERE STARTTIME BETWEEN '3 days ago' AND 'now';
Execution time is 2.5s.
Now I try to use NOT EXISTS to merge the results from both statements into one
SELECT
CID
FROM T1
WHERE STARTTIME BETWEEN '3 days ago' AND 'now'
AND NOT EXISTS (
SELECT NULL FROM T2
WHERE T1.ID = T2.ID
AND STARTTIME BETWEEN '3 days ago' AND 'now'
);
Execution time is 23s.
I also tried the INNER JOIN logic from this answer thinking it makes sense, but I get no results so I cannot properly evaluate.
Is there a better way to construct this statement that could possibly lead to a faster execution time?
19.01.2022 - Update based on comments
Expected result can contain any number of rows between 1 and 10 000
The used columns have the following indexes:
CREATE INDEX IX_T1_CSTARTTIME
ON T1 (CSTARTTIME ASC)
TABLESPACE MYHOSTNAME_DATA1;
CREATE INDEX IX_T2_CSTARTTIME
ON T2 (CSTARTTIME ASC)
TABLESPACE MYHOSTNAME_DATA2;
NOTE: Just noticed that the indexes are located on different table spaces, could this be a potential issue as well?
Following the excellent comments from Marmite Bomber here is the execution plan for the statement:
---------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes |TempSpc| Cost (%CPU)| Time |
--------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 21773 | 2019K| | 1817K (1)| 00:01:12 |
|* 1 | HASH JOIN RIGHT ANTI| | 21773 | 2019K| 112M| 1817K (1)| 00:01:12 |
|* 2 | TABLE ACCESS FULL | T2 | 2100K| 88M| | 1292K (1)| 00:00:51 |
|* 3 | TABLE ACCESS FULL | T1 | 2177K| 105M| | 512K (1)| 00:00:21 |
---------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - access("T2"."ID"="T1"."ID")
2 - filter("STARTTIME">=1642336690000 AND "T2"."ID" IS NOT NULL
AND "STARTTIME"<=1642595934000)
3 - filter("STARTTIME">=1642336690000 AND
"STARTTIME"<=1642595934000)
Column Projection Information (identified by operation id):
-----------------------------------------------------------
1 - (#keys=1; rowset=256) "T1"."ID"[CHARACTER,38]
2 - (rowset=256) "T2"."ID"[CHARACTER,38]
3 - (rowset=256) "ID"[CHARACTER,38]
Is there a better way to construct this statement that could possibility lead to a faster execution time?
Your basic responsibility is to write the SQL staement, the basic responsibility of Oracle is to come with an execution plan
If you are not satified (but you should know that a combination of two sources using NOT EXISTS will take longer that the sum of the time to extract the data from the sources) your fist step should be to verify the execution plan (and not try to rewrite the statement).
See some more details how to proceede here
EXPLAIN PLAN SET STATEMENT_ID = 'stmt1' into plan_table FOR
SELECT
PAD
FROM T1
WHERE STARTTIME BETWEEN date'2021-01-11' AND date'2021-01-13'
AND NOT EXISTS (
SELECT NULL FROM T2
WHERE T1.ID = T2.ID
AND STARTTIME BETWEEN date'2021-01-11' AND date'2021-01-13'
);
SELECT * FROM table(DBMS_XPLAN.DISPLAY('plan_table', 'stmt1','ALL'));
This is what you should see
-----------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
-----------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1999 | 150K| 10175 (1)| 00:00:01 |
|* 1 | HASH JOIN RIGHT ANTI| | 1999 | 150K| 10175 (1)| 00:00:01 |
|* 2 | TABLE ACCESS FULL | T2 | 2002 | 26026 | 4586 (1)| 00:00:01 |
|* 3 | TABLE ACCESS FULL | T1 | 4002 | 250K| 5589 (1)| 00:00:01 |
-----------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - access("T1"."ID"="T2"."ID")
2 - filter("STARTTIME"<=TO_DATE(' 2021-01-13 00:00:00', 'syyyy-mm-dd
hh24:mi:ss') AND "STARTTIME">=TO_DATE(' 2021-01-11 00:00:00',
'syyyy-mm-dd hh24:mi:ss'))
3 - filter("STARTTIME"<=TO_DATE(' 2021-01-13 00:00:00', 'syyyy-mm-dd
hh24:mi:ss') AND "STARTTIME">=TO_DATE(' 2021-01-11 00:00:00',
'syyyy-mm-dd hh24:mi:ss'))
Note that the hash join (here anti due to the not exists) is the best way to join two large row sources. Note also that the plan does not use indexes. The reason is the same - to access large data you do not want to go over index.
Contrary to the case of low cardinality row sources (OTPL) where you expects to see index access and NESTED LOOPS ANTI.
Some times is Oracle confused (e.g. while seeing stale statistics) and decide to go the NESTED LOOPway even for large data - which leads to long elapsed time.
This should help you at least to decide if you have a problem or not.
Perhaps a simple MINUS operation will accomplish what you are looking for:
select id
from ( select id
from t1
where starttime between '3 days ago' and 'now'
MINUS
select id
from t2
where starttime between '3 days ago' and 'now'
);
for however you actually define starttime between '3 days ago' and 'now'. This literally uses your current queries as is the MINUS operation removes from the first those values which do exist in the second and returns the result. See MINUS demo here.

rownum / fetch first n rows

select * from Schem.Customer
where cust='20' and cust_id >= '890127'
and rownum between 1 and 2 order by cust, cust_id;
Execution time appr 2 min 10 sec
select * from Schem.Customer where cust='20'
and cust_id >= '890127'
order by cust, cust_id fetch first 2 rows only ;
Execution time appr 00.069 ms
The execution time is a huge difference but results are the same. My team is not adopting to later one. Don't ask why.
So what is the difference between Rownum and fetch first 2 rows and what should I do to improve or convince anyone to adopt.
DBMS : DB2 LUW
Although both SQL end up giving same resultset, it only happens for your data. There is a great chance that resultset would be different. Let me explain why.
I will make your SQL a little simpler to make it simple to understand:
SELECT * FROM customer
WHERE ROWNUM BETWEEN 1 AND 2;
In this SQL, you want only first and second rows. That's fine. DB2 will optimize your query and never look rows beyond 2nd. Because only first 2 rows qualify your query.
Then you add ORDER BY clause:
SELECT * FROM customer
WHERE ROWNUM BETWEEN 1 AND 2;
ORDER BY cust, cust_id;
In this case, DB2 first fetches 2 rows then order them by cust and cust_id. Then sends to client(you). So far so good. But what if you want to order by cust and cust_id first, then ask for first 2 rows? There is a great difference between them.
This is the simplified SQL for this case:
SELECT * FROM customer
ORDER BY cust, cust_id
FETCH FIRST 2 ROWS ONLY;
In this SQL, ALL rows qualify the query, so DB2 fetches all of the rows, then sorts them, then sends first 2 rows to client.
In your case, both queries give same results because first 2 rows are already ordered by cust and cust_id. But it won't work if first 2 rows would have different cust and cust_id values.
A hint about this is FETCH FIRST n ROWS comes after order by, that means DB2 orders the result then retrieves first n rows.
Excellent answer here:
https://blog.dbi-services.com/oracle-rownum-vs-rownumber-and-12c-fetch-first/
Now the index range scan is chosen, with the right cardinality estimation.
So which solution it the best one? I prefer row_number() for several reasons:
I like analytic functions. They have larger possibilities, such as setting the limit as a percentage of total number of rows for example.
11g documentation for rownum says:
The ROW_NUMBER built-in SQL function provides superior support for ordering the results of a query
12c allows the ANSI syntax ORDER BY…FETCH FIRST…ROWS ONLY which is translated to row_number() predicate
12c documentation for rownum adds:
The row_limiting_clause of the SELECT statement provides superior support
rownum has first_rows_n issues as well
PLAN_TABLE_OUTPUT
SQL_ID 49m5a3f33cmd0, child number 0
-------------------------------------
select /*+ FIRST_ROWS(10) */ * from test where contract_id=500
order by start_validity fetch first 10 rows only
Plan hash value: 1912639229
--------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | A-Rows | Buffers |
--------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | 10 | 15 |
|* 1 | VIEW | | 1 | 10 | 10 | 15 |
|* 2 | WINDOW NOSORT STOPKEY | | 1 | 10 | 10 | 15 |
| 3 | TABLE ACCESS BY INDEX ROWID| TEST | 1 | 10 | 11 | 15 |
|* 4 | INDEX RANGE SCAN | TEST_PK | 1 | | 11 | 4 |
--------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter("from$_subquery$_002"."rowlimit_$$_rownumber" <=10)
2 - filter(ROW_NUMBER() OVER ( ORDER BY "TEST"."START_VALIDITY") <=10 )
4 - access("CONTRACT_ID"=500)

SQL script runs VERY slowly with small change

I am relatively new to SQL. I have a script that used to run very quickly (<0.5 seconds) but runs very slowly (>120 seconds) if I add one change - and I can't see why this change makes such a difference. Any help would be hugely appreciated!
This is the script and it runs quickly if I do NOT include "tt2.bulk_cnt
" in line 26:
with bulksum1 as
(
select t1.membercode,
t1.schemecode,
t1.transdate
from mina_raw2 t1
where t1.transactiontype in ('RSP','SP','UNTV','ASTR','CN','TVIN','UCON','TRAS')
group by t1.membercode,
t1.schemecode,
t1.transdate
),
bulksum2 as
(
select t1.schemecode,
t1.transdate,
count(*) as bulk_cnt
from bulksum1 t1
group by t1.schemecode,
t1.transdate
having count(*) >= 10
),
results as
(
select t1.*, tt2.bulk_cnt
from mina_raw2 t1
inner join bulksum2 tt2
on t1.schemecode = tt2.schemecode and t1.transdate = tt2.transdate
where t1.transactiontype in ('RSP','SP','UNTV','ASTR','CN','TVIN','UCON','TRAS')
)
select * from results
EDIT: I apologise for not putting enough detail in here previously - although I can use basic SQL code, I am a complete novice when it comes to databases.
Database: Oracle (I'm not sure which version, sorry)
Execution plans:
QUICK query:
Plan hash value: 1712123489
---------------------------------------------
| Id | Operation | Name |
---------------------------------------------
| 0 | SELECT STATEMENT | |
| 1 | HASH JOIN | |
| 2 | VIEW | |
| 3 | FILTER | |
| 4 | HASH GROUP BY | |
| 5 | VIEW | VM_NWVW_0 |
| 6 | HASH GROUP BY | |
| 7 | TABLE ACCESS FULL| MINA_RAW2 |
| 8 | TABLE ACCESS FULL | MINA_RAW2 |
---------------------------------------------
SLOW query:
Plan hash value: 1298175315
--------------------------------------------
| Id | Operation | Name |
--------------------------------------------
| 0 | SELECT STATEMENT | |
| 1 | FILTER | |
| 2 | HASH GROUP BY | |
| 3 | HASH JOIN | |
| 4 | VIEW | VM_NWVW_0 |
| 5 | HASH GROUP BY | |
| 6 | TABLE ACCESS FULL| MINA_RAW2 |
| 7 | TABLE ACCESS FULL | MINA_RAW2 |
--------------------------------------------
A few observations, and then some things to do:
1) More information is needed. In particular, how many rows are there in the MINA_RAW2 table, what indexes exist on this table, and when was the last time it was analyzed? To determine the answers to these questions, run:
SELECT COUNT(*) FROM MINA_RAW2;
SELECT TABLE_NAME, LAST_ANALYZED, NUM_ROWS
FROM USER_TABLES
WHERE TABLE_NAME = 'MINA_RAW2';
From looking at the plan output it looks like the database is doing two FULL SCANs on MINA_RAW2 - it would be nice if this could be reduced to no more than one, and hopefully none. It's always tough to tell without very detailed information about the data in the table, but at first blush it appears that an index on TRANSACTIONTYPE might be helpful. If such an index doesn't exist you might want to consider adding it.
2) Assuming that the statistics are out-of-date (as in, old, nonexistent, or a significant amount of data (> 10%) has been added, deleted, or updated since the last analysis) run the following:
BEGIN
DBMS_STATS.GATHER_TABLE_STATS(owner => 'YOUR-SCHEMA-NAME',
table_name => 'MINA_RAW2');
END;
substituting the correct schema name for "YOUR-SCHEMA-NAME" above. Remember to capitalize the schema name! If you don't know if you should or shouldn't gather statistics, err on the side of caution and do it. It shouldn't take much time.
3) Re-try your existing query after updating the table statistics. I think there's a fair chance that having up-to-date statistics in the database will solve your issues. If not:
4) This query is doing a GROUP BY on the results of a GROUP BY. This doesn't appear to be necessary as the initial GROUP BY doesn't do any grouping - instead, it appears this is being done to get the unique combinations of MEMBERCODE, SCHEMECODE, and TRANSDATE so that the count of the members by scheme and date can be determined. I think the whole query can be simplified to:
WITH cteWORKING_TRANS AS (SELECT *
FROM MINA_RAW2
WHERE TRANSACTIONTYPE IN ('RSP','SP','UNTV',
'ASTR','CN','TVIN',
'UCON','TRAS')),
cteBULKSUM AS (SELECT a.SCHEMECODE,
a.TRANSDATE,
COUNT(*) AS BULK_CNT
FROM (SELECT DISTINCT MEMBERCODE,
SCHEMECODE,
TRANSDATE
FROM cteWORKING_TRANS) a
GROUP BY a.SCHEMECODE,
a.TRANSDATE)
SELECT t.*, b.BULK_CNT
FROM cteWORKING_TRANS t
INNER JOIN cteBULKSUM b
ON b.SCHEMECODE = t.SCHEMECODE AND
b.TRANSDATE = t.TRANSDATE
I managed to remove an unnecessary subquery, but this syntax with distinct inside count may not work outside of PostgreSQL or may not be the desired result. I know I've certainly used it there.
select t1.*, tt2.bulk_cnt
from mina_raw2 t1
inner join (select t2.schemecode,
t2.transdate,
count(DISTINCT membercode) as bulk_cnt
from mina_raw2 t2
where t2.transactiontype in ('RSP','SP','UNTV','ASTR','CN','TVIN','UCON','TRAS')
group by t2.schemecode,
t2.transdate
having count(DISTINCT membercode) >= 10) tt2
on t1.schemecode = tt2.schemecode and t1.transdate = tt2.transdate
where t1.transactiontype in ('RSP','SP','UNTV','ASTR','CN','TVIN','UCON','TRAS')
When you use those with queries, instead of subqueries when you don't need to, you're kneecapping the query optimizer.

Performance of a sql query

I'm exectuting command on large table. It has around 7 millons rows.
Command is like this:
select * from mytable;
Now I'm restriting the number of rows to around 3 millons. I'm using this command:
select * from mytable where timest > add_months( sysdate, -12*4 )
I have an index on timest column. But the costs are almost same. I would expect they will decrease. What am I doing wrong?
Any clue?
Thank you in advance!
Here explain plans:
using an index for 3 out of 7 mio. of rows would most probably be even more expensive, so oracle makes a full table scan for both queries, which is IMO correct.
You may try to do parallel FTS (Full Table Scan) - it should be faster, BUT it will put your Oracle server under higher load, so don't do it on heavy loaded multiuser DBs.
Here is an example:
select /*+full(t) parallel(t,4)*/ *
from mytable t
where timest > add_months( sysdate, -12*4 );
To select a very small number of records from a table use index. To select a non-trivial part use partitioning.
In your case an effective acces would be enabled with range partitioning on timest column.
The big advantage is that only relevant partitions are accessed.
Here an exammple
create table test(ts date, s varchar2(4000))
PARTITION BY RANGE (ts)
(PARTITION t1p1 VALUES LESS THAN (TO_DATE('2010-01-01', 'YYYY-MM-DD')),
PARTITION t1p2 VALUES LESS THAN (TO_DATE('2015-01-01', 'YYYY-MM-DD')),
PARTITION t1p4 VALUES LESS THAN (MAXVALUE)
);
Query
select * from test where ts < to_date('2009-01-01','yyyy-mm-dd');
will access only the paartition 1, i.e. only before '2010-01-01'.
See pstart an dpstop in execution plan
-----------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | Pstart| Pstop |
-----------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 5 | 10055 | 9 (0)| 00:00:01 | | |
| 1 | PARTITION RANGE SINGLE| | 5 | 10055 | 9 (0)| 00:00:01 | 1 | 1 |
|* 2 | TABLE ACCESS FULL | TEST | 5 | 10055 | 9 (0)| 00:00:01 | 1 | 1 |
-----------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - filter("TS"<TO_DATE(' 2009-01-01 00:00:00', 'syyyy-mm-dd hh24:mi:ss'))
There are (at least) two problems.
add_months( sysdate, -12*4 ) is a function. It's not just a constant, so optimizer can't use index here.
Choosing 3 mln. from 7 mln. rows by index is not good idea anyway. Yes, you (would) go by index tree fast, yet each time you have to go to the heap (cause you need * = all rows). That means that there's no sense in using this index.
Thus the index plays no role here.