How should tables be indexed to optimise this Oracle SELECT query? - sql

I've got the following query in Oracle10g:
select *
from DATA_TABLE DT,
LOOKUP_TABLE_A LTA,
LOOKUP_TABLE_B LTB
where DT.COL_A = LTA.COL_A (+)
and DT.COL_B = LTA.COL_B (+)
and LTA.COL_C = LTB.COL_C
and LTA.COL_B = LTB.COL_B
and ( DT.REF_TXT = :refTxt or DT.ALT_REF_TXT = :refTxt )
and DT.CREATED_DATE between :startDate and :endDate
And was wondering whether you've got any hints for optimising the query.
Currently I've got the following indices:
IDX1 on DATA_TABLE (REF_TXT, CREATED_DATE)
IDX2 on DATA_TABLE (ALT_REF_TXT, CREATED_DATE)
LOOKUP_A_PK on LOOKUP_TABLE_A (COL_A, COL_B)
LOOKUP_A_IDX1 on LOOKUP_TABLE_A (COL_C, COL_B)
LOOKUP_B_PK on LOOKUP_TABLE_B (COL_C, COL_B)
Note, the LOOKUP tables are very small (<200 rows).
EDIT:
Explain plan:
Query Plan
SELECT STATEMENT Cost = 8
FILTER
NESTED LOOPS
NESTED LOOPS
TABLE ACCESS BY INDEX ROWID DATA_TABLE
BITMAP CONVERSION TO ROWIDS
BITMAP OR
BITMAP CONVERSION FROM ROWIDS
SORT ORDER BY
INDEX RANGE SCAN IDX1
BITMAP CONVERSION FROM ROWIDS
SORT ORDER BY
INDEX RANGE SCAN IDX2
TABLE ACCESS BY INDEX ROWID LOOKUP_TABLE_A
INDEX UNIQUE SCAN LOOKUP_A_PK
TABLE ACCESS BY INDEX ROWID LOOKUP_TABLE_B
INDEX UNIQUE SCAN LOOKUP_B_PK
EDIT2:
The data looks like this:
There will be 10000s of distinct REF_TXT, which 10-100s of CREATED_DTs for each. ALT_REF_TXT will mostly NULL but there are going to be 100s-1000s which it will be different from REF_TXT.
EDIT3: Fixed what ALT_REF_TXT actually contains.

The execution plan you're currently getting looks pretty good. There's no obvious improvement to be made.
As other have noted, you have some outer join indicators, but then you essentially prevent the outer join by requiring equality on other columns in the two outer tables. As you can see from the execution plan, no outer join is happening. If you don't want an outer join, remove the (+) operators, they're just confusing the issue. If you do want an outer join, rewrite the query as shown by #Dems.
If you're unhappy with the current performance, I would suggest running the query with the gather_plan_statistics hint, then using DBMS_XPLAN.DISPLAY_CURSOR(?,?,'ALLSTATS LAST') to view the actual execution statistics. This will show the elapsed time attributed to each step in the execution plan.
You might get some benefit from converting one or both of the lookup tables into index-organized tables.

Your 2 index range scans on IDX1 and IDX2 will produce at most 100 rows, so your BITMAP CONVERSION TO ROWIDS will produce at most 200 rows. And from there on, it's only indexed access by rowids, leading to a likely sub-second execution. So are you really experiencing performance problems? If so, how long does it take exactly?
If you are experiencing performance problems, then please follow Dave Costa's advice and get the real plan, because in that case it's likely that you are using another plan runtime, possibly due to certain bind variable values or different optimizer environment settings.
Regards,
Rob.

This is one of those cases where it makes very little sense to try to optimize the DBMS performance without knowing what your data means.
Do you have many, many distinct CREATED_DATE values and a few rows in your DT for each date? If so you want an index on CREATED_DATE, as it will be the primary way for the DBMS to reject columns it doesn't want to process.
On the other hand, do you have only a handful of dates, and many distinct values of REF_TXT or ALT_REF_TXT? In that case you probably have the correct compound index choices.
The presence of OR in your query complicates things greatly, and throws most guesswork out the window. You must look at EXPLAIN PLAN to see what's going on.
If you have tens of millions of distinct REF_TXT and ALT_REF_TXT values, you may want to consider denormalizing this schema.
Edit.
Thanks for the additional info. Your explain plan contains no smoking guns that I can see. Some things to try next if you're not happy with performance yet.
Flip the order of the columns in your compound indexes on your data tables. Maybe that will get you simpler index range scans instead of all the bitmap monkey business.
Exchange your SELECT * for the names of the columns you actually need in the query resultset. That's good programming practice in any case, and it MAY allow the optimizer to avoid some work.
If things are still too slow, try recasting this as a UNION of two queries rather than using OR. That MAY allow the alt_ref_txt part of your query, which is made a little more complex by all the NULL values in that column, to be optimized separately.

This may be the query you want using a more upto date syntax.
(And without inner joins breaking outer joins)
select
*
from
DATA_TABLE DT
left outer join
(
LOOKUP_TABLE_A LTA
inner join
LOOKUP_TABLE_B LTB
on LTA.COL_C = LTB.COL_C
and LTA.COL_B = LTB.COL_B
)
on DT.COL_A = LTA.COL_A
and DT.COL_B = LTA.COL_B
where
( DT.REF_TXT = :refTxt or DT.ALT_REF_TXT = :refTxt )
and DT.CREATED_DATE between :startDate and :endDate
INDEXes that I'd have are...
LOOKUP_TABLE_A (COL_A, COL_B)
LOOKUP_TABLE_B (COL_B, COL_C)
DATA_TABLE (REF_TXT, CREATED_DATE)
DATA_TABLE (ALT_REF_TXT, CREATED_DATE)
Note: The first condition in the WHERE clause about contains an OR that will likely frag the use of INDEXes. In such case I have seen performance benefits in UNIONing two queries together...
<your query>
where
DT.REF_TXT = :refTxt
and DT.CREATED_DATE between :startDate and :endDate
UNION
<your query>
where
DT.ALT_REF_TXT = :refTxt
and DT.CREATED_DATE between :startDate and :endDate

Provide output of this query with "set autot trace". Let's see how many blocks it is pulling. Explain plan looks good, it should be very fast. If you need more, denormalize the lookup table info into DT. Violates 3rd normal form, but it will make your query faster by eliminating the joins. In a situation where milliseconds counts, everything is in buffers, and you need that query to run 1000 times/second, it can help by driving down the number of blocks looked at per row. It is the ultimate way to boost read performance, but complicates your app (and ruins your lovely ER diagram).

Related

Hint for SQL Query Joined tables with million records

I have below query that is taking on an average more than 5 seconds to fetch the data in a transaction that is triggered in-numerous times via application. I am looking for a hint that can possibly help me reduce the time taken for this query everytime its been fired. My conditions are that I cannot add any indexes or change any settings of application for this query. Hence oracle hints or changing the structure of the query is the only choice I have. Please find below my query.
SELECT SUM(c.cash_flow_amount) FROM CM_CONTRACT_DETAIL a ,CM_CONTRACT b,CM_CONTRACT_CASHFLOW c
WHERE a.country_code = Ip_country_code
AND a.company_code = ip_company_code
AND a.dealer_bp_id = ip_bp_id
AND a.contract_start_date >= ip_start_date
AND a.contract_start_date <= ip_end_date
AND a.version_number = b.current_version
AND a.status_code IN ('00','10')
AND a.country_code = b.country_code
AND a.company_code = b.company_code
AND a.contract_number = b.contract_number
AND a.country_code = c.country_code
AND a.company_code = c.company_code
AND a.contract_number = c.contract_number
AND a.version_number = c.version_number
AND c.cash_flow_type_code IN ('07','13');
The things to know about the tables are that they are all transactional tables and the data of this table keeps changing everyday. They have records in 1 lacs to 10 lacs in numbers.
This is the explain plan currently on the query:
Operation Object Name Rows Bytes Cost Object Node In/Out PStart PStop
SELECT STATEMENT Hint=RULE
SORT AGGREGATE
TABLE ACCESS BY INDEX ROWID CM_CONTRACT_CASHFLOW
NESTED LOOPS
NESTED LOOPS
TABLE ACCESS BY INDEX ROWID CM_CONTRACT_DETAIL
INDEX RANGE SCAN XIF760CT_CONTRACT_DETAIL
TABLE ACCESS BY INDEX ROWID CM_CONTRACT
INDEX UNIQUE SCAN XPKCM_CONTRACT
INDEX RANGE SCAN XPKCM_CONTRACT_CASHFLOW
Indexes on CM_CONTRACT_DETAIL:
XPKCM_CONTRACT_DETAIL is a composite unique index on country_code, company_code, contract_number and version_number
XIF760CT_CONTRACT_DETAIL is a non unique index on dealer_bp_id
Indexes on CM_CONTRACT:
XPKCM_CONTRACT is a composite unique index on country_code, company_code, contract_number
Indexes on CM_CONTRACT_CASHFLOW:
XPKCM_CONTRACT_CASHFLOW is a composite unique index on country_code, company_code, contract_number and version_number,supply_sequence_number, cash_flow_type_code,payment_date.
Could you please help better this query? Please let me know if anything else about the tables is required on this. Stats are not gathered on this tables either.
Your query plan says HINT=RULE. Why is that? Is this the standard setting in your dbms? Why not make use of the optimizer? You can use /*+CHOOSE*/ for that. This may be all that's needed. (Why are there no Stats on the tables, though?)
EDIT: The above was nonsense. By not gathering any statistics you prevent the optimizer from doing its work. It will always fall back to the good old rules, because it has no basis to calculate costs on and find a better plan. It is strange to see that you voluntarily keep the dbms from getting your queries fast. You can use hints in your queries of course, but be careful always to check and alter them when table data changes significantly. Better gather statistics and have the optimizer doing this work. As to useful hints:
My feeling says: With that many criteria on CM_CONTRACT_DETAIL this should be the driving table. You can force that with /*+LEADING(a)*/. Maybe even use a full table scan on that table /*+FULL(a)*/, which you can still speed up with parallel execution: /*+PARALLEL(a,4)*/.
Good luck :-)

Join Performance Issue in Oracle

We have are having 2 tables.
Table - XYZ - > Having over 189 M Records
Table - ABC - > Having only 1098 records.
Our join query is some what like
select a.a, a.b, a.c
from xyz a , ABC r
where a.d = r.d
and a.sub not like '0%'
and ((a.eff_dat < sysdate) or (a.eff_date is null))
This is how our query is performing. In any way can it be optmised to perform faster.
Apart from the not like, can you suggest me any other method.
In the explain plan I have seen that it is taking the 189 M as itrator and checking with the 1098 records which is taking more time.
I swapped the tables after the from Key word but also it did not work.
Tried leading hint, which also not servered the purpose.
Also a.d column is an indexed one which is also used in the hint.
Please do suggest any methods for optimisation.
When you have multiple predicates on a table such as:
a.sub not like '0%'
and ((a.eff_dat < sysdate) or (a.eff_date is null))
... it is rather unlikely that the optimiser will accurately estimate the cardinality of the result set unless you use dynamic sampling, so check the explain plan to see whether:
Dynamic sampling is being invoked.
The estimation of the cardinalities are correct.
If the predicates are not very selective -- if they do not eliminate something in the order of 90% or the rows in the table -- then it is unlikely that an index will be of help in finding the rows, and a full scan (with partition pruning if the table is partitioned in a way that supports that) is likely to be the best access path.
I'd be reasonably sure that if there is a foreign key between the tables (ie. that all values of a.d exist in r.d) then the best access path is going to be a full scan of XYZ with a hash join to ABC.
By the way, you mention hints but do not include them in the question. It's also unhelpful to hide the purpose of the tables with fake names, as the names often give valuable clues about the type of data and distribution of values within the data sets.
It would seem most of the cost would be in the (presumed) full table scan on the large table. I would suggest rewriting your WHERE condition as follows:
SELECT * FROM XYZ A
WHERE SUBSTR(A.SUB, 1, 1) <> '0'
AND NVL(A.EFF_DAT, TO_DATE('01-01-0001', 'MM-DD-YYYY')) < SYSDATE ;
And then create a function index that includes all the relevant columns:
CREATE INDEX IX_XYZ1 ON
XYZ(NVL(EFF_DAT, TO_DATE('01-01-0001', 'MM-DD-YYYY')), SUB, D);
Make sure the new index is being picked up by the cost-based optimizer, by checking the execution plan.
LIKE, NOT LIKE and the OR operand are some of the worst things you can use in a WHERE condition.

Getting RID Lookup instead of Table Scan?

SQL Fiddle: http://sqlfiddle.com/#!3/23cf8
In this query, when I have an In clause on an Id, and then also select other columns, the In is evaluated first, and then the Details column and other columns are pulled in via a RID Lookup:
--In production and in SQL Fiddle, Details is grabbed via a RID Lookup after the In clause is evaluated
SELECT [Id]
,[ForeignId]
,Details
--Generate a numbering(starting at 1)
--,Row_Number() Over(Partition By ForeignId Order By Id Desc) as ContactNumber --Desc because older posts should be numbered last
FROM SupportContacts
Where foreignId In (1,2,3,5)
With this query, the Details are being pulled in via a Table Scan.
With NumberedContacts AS
(
SELECT [Id]
,[ForeignId]
--Generate a numbering(starting at 1)
,Row_Number() Over(Partition By ForeignId Order By Id Desc) as ContactNumber --Desc because older posts should be numbered last
FROM SupportContacts
Where ForeignId In (1,2,3,5)
)
Select nc.[Id]
,nc.[ForeignId]
,sc.[Details]
From NumberedContacts nc
Inner Join SupportContacts sc on nc.Id = sc.Id
Where nc.ContactNumber <= 2 --Only grab the last 2 contacts per ForeignId
;
In SqlFiddle, the second query actually gets a RID Lookup, whereas in production with a million records it produces a Table Scan (the IN clause eliminates 99% of the rows)
Otherwise the query plan shown in SQL Fiddle is identical, the only difference being that for the second query the RID Lookup in SQL Fiddle, is a Table Scan in production :(
I would like to understand possibilities that would cause this behavior? What kinds of things would you look at to help determine the cause of it using a table scan here?
How can I influence it to use a RID Lookup there?
From looking at operation costs in the actual execution plan, I believe I can get the second query very close in performance to the first query if I can get it to use a RID Lookup. If I don't select the Detail column, then the performance of both queries is very close in production. It is only after adding other columns like Detail that performance degrades significantly for the second query. When I put it in SQL Fiddle and saw that the execution plan used an RID Lookup, I was surprised but slightly confused...
It doesn't have a clustered index because in testing with different clustered indexes, there was slightly worse performance for this and other queries. That was before I began adding other columns like Details though, and I can experiment with that more, but would like to have a understanding of what is going on now before I start shooting in the dark with random indexes.
What if you would change your main index to include the Details column?
If you use:
CREATE NONCLUSTERED INDEX [IX_SupportContacts_ForeignIdAsc_IdDesc]
ON SupportContacts ([ForeignId] ASC, [Id] DESC)
INCLUDE (Details);
then neither a RID lookup nor a table scan would be needed, since your query could be satisfied from just the index itself....
The differences in the query plans will be dependent on the types of indexes that exist and the statistics of the data for those tables in the different environments.
The optimiser uses the statistics (histograms of data frequency, mostly) and the available indexes to decide which execution plan is going to be the quickest.
So, for example, you have noticed that the performance degrades when the 'Details' column is included. This is an almost sure sign that either the 'Details' column is not part of an index, or if it is part of an index, the data in that column is mostly unique such that the index accesses would be equivalent (or almost equivalent) to a table scan.
Often when this situation arises, the optimiser will choose a table scan over the index access, as it can take advantage of things like block reads to access the table records faster than perhaps a fragmented read of an index.
To influence the path that will be chose by the optimiser, you would need to look at possible indexes that could be added/modified to make an index access more efficient, but this should be done with care as it can adversely affect other queries as well as possibly degrading insert performance.
The other important activity you can do to help the optimiser is to make sure the table statistics are kept up to date and refreshed at a frequency that is appropriate to the rate of change of the frequency distribution in the table data
If it's true that 99% of the rows would be omitted if it performed the query using the relevant index + RID then the likeliest problem in your production environment is that your statistics are out of date and the optimiser doesn't realise that ForeignID in (1,2,3,5) would limit the result set to 1% of the total data.
Here's a good link for discovering more about statistics from Pinal Dave: http://blog.sqlauthority.com/2010/01/25/sql-server-find-statistics-update-date-update-statistics/
As for forcing the optimiser to follow the correct path WITHOUT updating the statistics, you could use a table hint - if you know the index that your plan should be using which contains the ID and ForeignID columns then stick that in your query as a hint and force SQL optimiser to use the index:
http://msdn.microsoft.com/en-us/library/ms187373.aspx
FYI, if you want the best performance from your second query, use this index and avoid the headache you're experiencing altogether:
create index ix1 on SupportContacts(ForeignID, Id DESC) include (Details);

Slow query with unexpected index scan

I have this query:
SELECT *
FROM sample
INNER JOIN test ON sample.sample_number = test.sample_number
INNER JOIN result ON test.test_number = result.test_number
WHERE sampled_date BETWEEN '2010-03-17 09:00' AND '2010-03-17 12:00'
the biggest table here is RESULT, contains 11.1M records. The left 2 tables about 1M.
this query works slowly (more than 10 minutes) and returns about 800 records. executing plan shows clustered index scan (over it's PRIMARY KEY (result.result_number, which actually doesn't take part in query)) over all 11M records.
RESULT.TEST_NUMBER is a clustered primary key.
if I change 2010-03-17 09:00 to 2010-03-17 10:00 - i get about 40 records. it executes for 300ms. and plan shows index seek (over result.test_number index)
if i replace * in SELECT clause to result.test_number (covered with index) - then all become fast in first case too. this points to hdd IO issues, but doesn't clarifies changing plan.
so, any ideas?
UPDATE:
sampled_date is in table sample and covered by index.
other fields from this query: test.sample_number is covered by index and result.test_number too.
UPDATE 2:
obviously than sql server in any reasons don't want to use index.
i did a small experiment: i remove INNER JOIN with result, select all test.test_number and after that do
SELECT * FROM RESULT WHERE TEST_NUMBER IN (...)
this, of course, works fast. but i cannot get what is the difference and why query optimizer choose such inappropriate way to select data in 1st case.
UPDATE 3:
after backing up database and restoring to database with new name - both requests work fast as expected even on much more ranges...
so - are there any special commands to clean or optimize, whatever, that can be relevant to this? :-(
A couple things to try:
Update statistics
Add hints to the query about what index to use (in SQL Server you might say WITH (INDEX(myindex)) after specifying a table)
EDIT: You noted that copying the database made it work, which tells me that the index statistics were out of date. You can update them with something like UPDATE STATISTICS mytable on a regular basis.
Use EXEC sp_updatestats to update the whole database.
The first thing I would do is specify the exact columns I want, and see if the problems persists. I doubt you would need all the columns from all three tables.
It sounds like it has trouble getting all the rows out of the result table. How big is a row? Look at how big all the data in the table is and divide it by the number of rows. Right click on the table -> properties..., Storage tab.
Try putting where clause into a subquery to force it to do that first?
SELECT *
FROM
(SELECT * FROM sample
WHERE sampled_date
BETWEEN '2010-03-17 09:00' AND '2010-03-17 12:00') s
INNER JOIN test ON s.sample_number = test.sample_number
INNER JOIN result ON test.test_number = result.test_number
OR this might work better if you expect a small number of samples
SELECT *
FROM sample
INNER JOIN test ON sample.sample_number = test.sample_number
INNER JOIN result ON test.test_number = result.test_number
WHERE sample.sample_ID in (
SELECT sample_ID
FROM sample
WHERE sampled_date BETWEEN '2010-03-17 09:00' AND '2010-03-17 12:00'
)
If you do a SELECT *, you want all the data from the table. The data for the table is in the clustered index - the leaf nodes of the clustered index are the data pages.
So if you want all of those data pages anyway, and since you're joining 1 mio. rows to 11 mio. rows (1 out of 11 isn't very selective for SQL Server), using an index to find the rows, and then do bookmark lookups into the actual data pages for each of those rows found, might just not be very efficient, and thus SQL Server uses the clustered index scan instead.
So to make a long story short: only select those rows you really need! You thus give SQL Server a chance to use an index, do a seek there, and find the necessary data.
If you only select three, four columns, then the chances that SQL Server will find and use an index that contains those columns are just so much higher than if you ask for all the data from all the tables involved.
Another option would be to try and find a way to express a subquery, using e.g. a Common Table Expression, that would grab data from the two smaller tables, and reduce that number of rows even more, and join the hopefully quite small result against the main table. If you have a small result set of only 40 or 800 results (rather than two tables with 1 mio. rows each), then SQL Server might be more inclined to use a Clustered Index Seek and do bookmark lookups on 40 or 800 rows, rather than doing a full Clustered Index Scan.

Creating Indexes for Group By Fields?

Do you need to create an index for fields of group by fields in an Oracle database?
For example:
select *
from some_table
where field_one is not null and field_two = ?
group by field_three, field_four, field_five
I was testing the indexes I created for the above and the only relevant index for this query is an index created for field_two. Other single-field or composite indexes created on any of the other fields will not be used for the above query. Does this sound correct?
It could be correct, but that would depend on how much data you have. Typically I would create an index for the columns I was using in a GROUP BY, but in your case the optimizer may have decided that after using the field_two index that there wouldn't be enough data returned to justify using the other index for the GROUP BY.
No, this can be incorrect.
If you have a large table, Oracle can prefer deriving the fields from the indexes rather than from the table, even there is no single index that covers all values.
In the latest article in my blog:
NOT IN vs. NOT EXISTS vs. LEFT JOIN / IS NULL: Oracle
, there is a query in which Oracle does not use full table scan but rather joins two indexes to get the column values:
SELECT l.id, l.value
FROM t_left l
WHERE NOT EXISTS
(
SELECT value
FROM t_right r
WHERE r.value = l.value
)
The plan is:
SELECT STATEMENT
HASH JOIN ANTI
VIEW , 20090917_anti.index$_join$_001
HASH JOIN
INDEX FAST FULL SCAN, 20090917_anti.PK_LEFT_ID
INDEX FAST FULL SCAN, 20090917_anti.IX_LEFT_VALUE
INDEX FAST FULL SCAN, 20090917_anti.IX_RIGHT_VALUE
As you can see, there is no TABLE SCAN on t_left here.
Instead, Oracle takes the indexes on id and value, joins them on rowid and gets the (id, value) pairs from the join result.
Now, to your query:
SELECT *
FROM some_table
WHERE field_one is not null and field_two = ?
GROUP BY
field_three, field_four, field_five
First, it will not compile, since you are selecting * from a table with a GROUP BY clause.
You need to replace * with expressions based on the grouping columns and aggregates of the non-grouping columns.
You will most probably benefit from the following index:
CREATE INDEX ix_sometable_23451 ON some_table (field_two, field_three, field_four, field_five, field_one)
, since it will contain everything for both filtering on field_two, sorting on field_three, field_four, field_five (useful for GROUP BY) and making sure that field_one is NOT NULL.
Do you need to create an index for fields of group by fields in an Oracle database?
No. You don't need to, in the sense that a query will run irrespective of whether any indexes exist or not. Indexes are provided to improve query performance.
It can, however, help; but I'd hesitate to add an index just to help one query, without thinking about the possible impact of the new index on the database.
...the only relevant index for this query is an index created for field_two. Other single-field or composite indexes created on any of the other fields will not be used for the above query. Does this sound correct?
Not always. Often a GROUP BY will require Oracle to perform a sort (but not always); and you can eliminate the sort operation by providing a suitable index on the column(s) to be sorted.
Whether you actually need to worry about the GROUP BY performance, however, is an important question for you to think about.