Wrong index being used when selecting top rows

Wrong index being used when selecting top rows - sql

I have a simple query, which selects top 200 rows ordered by one of the columns filtered by other indexed column. The confusion is why is that the query plan in PL/SQL Developer shows that this index is used only when I'm selecting all rows, e.g.:
SELECT * FROM
(
SELECT *
FROM cr_proposalsearch ps
WHERE UPPER(ps.customerpostcode) like 'MK3%'
ORDER BY ps.ProposalNumber DESC
)
WHERE ROWNUM <= 200
Plan shows that it uses index CR_PROPOSALSEARCH_I1, which is an index on two columns: PROPOSALNUMBER & UPPER(CUSTOMERNAME), this takes 0.985s to execute:
If I get rid of ROWNUM condition, the plan is what I expect and it executes in 0.343s:
Where index XIF25CR_PROPOSALSEARCH is on CR_PROPOSALSEARCH (UPPER(CUSTOMERPOSTCODE));
How come?
EDIT: I have gathered statistics on cr_proposalsearch table and both query plans now show that they use XIF25CR_PROPOSALSEARCH index.

Including the ROWNUM changes the optimizer's calculations about which is the more efficient path.
When you do a top-n query like this, it doesn't necessarily mean that Oracle will get all the rows, fully sort them, then return the top ones. The COUNT STOPKEY operation in the execution plan indicates that Oracle will only perform the underlying operations until it has found the number of rows you asked for.
The optimizer has calculated that the full query will acquire and sort 77K rows. If it used this plan for the top-n query, it would have to do a large sort of those rows to find the top 200 (it wouldn't necessarily have to fully sort them, as it wouldn't care about the exact order of rows past the top; but it would have to look over all of those rows).
The plan for the top-n query uses the other index to avoid having to sort at all. It considers each row in order, checks whether it matches the predicate, and if so returns it. When it's returned 200 rows, it's done. Its calculations have indicated that this will be more efficient for getting a small number of rows. (It may not be right, of course; you haven't said what the relative performance of these queries is.)
If the optimizer were to choose this plan when you ask for all rows, it would have to read through the entire index in descending order, getting each row from the table by ROWID as it goes to check against the predicate. This would result in a lot of extra I/O and inspecting many rows that would not be returned. So in this case, it decides that using the index on customerpostcode is more efficient.
If you gradually increase the number of rows to be returned from the top-n query, you will probably find a tipping point where the plan switches from the first to the second. Just from the costs of the two plans, I'd guess this might be around 1,200 rows.

If you are sure your stats are up to date and that the index is selective enough, you could tell oracle to use the index
SELECT *
FROM (SELECT /*+ index(ps XIF25CR_PROPOSALSEARCH) */ *
FROM cr_proposalsearch ps
WHERE UPPER (ps.customerpostcode) LIKE 'MK3%'
ORDER BY ps.proposalnumber DESC)
WHERE ROWNUM <= 200
(I would only recommend this approach as a last resort)
If I were doing this I would first tkprof the query to see actually how much work it is doing,
e.g: the cost of index range scans could be way off
forgot to mention....
You should check the actual cardinality:
SELECT count(*) FROM cr_proposalsearch ps WHERE UPPER(ps.customerpostcode) like 'MK3%'
and then compare it to the cardinality in the query plan.

You don't seem to have a perfectly fitting index. The index CR_PROPOSALSEARCH_I1 can be used to retrieve the rows in descending order of the attribute PROPOSALNUMBER. It's probably chosen because Oracle can avoid to retrieve all matching rows, sort them according to the ORDER BY clause and then discard all rows except the first ones.
Without the ROWNUM condition, Oracle uses the XIF25CR_PROPOSALSEARCH index (you didn't give any details about it) because it's probably rather selective regarding the WHERE clause. But it will require to sort the result afterwards. This is probably the more efficent plan based on the assumption that you'll retrieve all rows.
Since either index is a trade-off (one is better for sorting, the other one better for applying the WHERE clause), details such as ROWNUM determine which execution plan Oracle chooses.

This condition:
WHERE UPPER(ps.customerpostcode) like 'MK3%'
is not continuous, that is you cannot preserve a single ordered range for it.
So there are two ways to execute this query:
Order by number then filter on code.
Filter on code then order by number.
Method 1 is able to user an index on number which gives you linear execution time (top 100 rows would be selected 2 times faster than top 200, provided that number and code do not correlate).
Method 2 is able to use a range scan for coarse filtering on code (the range condition would be something like code >= 'MK3' AND code < 'MK4'), however, it requires a sort since the order of number cannot be preserved in a composite index.
The sort time depends on the number of top rows you are selecting too, but this dependency, unlike that for method 1, is not linear (you always need at least one range scan).
However, the filtering condition in method 2 is selective enough for the RANGE SCAN with a subsequent sort to be more efficient than a FULL SCAN for the whole table.
This means that there is a tipping point: for this condition: ROWNUM <= X there exists a value of X so that method 2 becomes more efficient when this value is exceeded.
Update:
If you are always searching on at least 3 first symbols, you can create an index like this:
SUBSTRING(UPPER(customerpostcode), 1, 3), proposalnumber
and use it in this query:
SELECT *
FROM (
SELECT *
FROM cr_proposalsearch ps
WHERE SUBSTRING(UPPER(customerpostcode, 1, 3)) = SUBSTRING(UPPER(:searchquery), 1, 3)
AND UPPER(ps.customerpostcode) LIKE UPPER(:searchquery) || '%'
ORDER BY
proposalNumber DESC
)
WHERE rownum <= 200
This way, the number order will be preserved separately for each set of codes sharing first 3 letters which will give you a more dense index scan.

Related

Strange behavior when doing where and order by query in postgres

Background: A large table, 50M+, all column in query is indexed.
when I do a query like this:
select * from table where A=? order by id DESC limit 10;
In statement, A, id are both indexed.
Now confusing things happen:
the more rows where returned, the less time whole sql cost
the less rows where returned, the more time whole sql cost
I have a guess here: postgres do the order by first, and then where , so it cost more time to find 10 row in the orderd index when target rowset is small(like find 10 particular sand on beach); opposite， if target rowset is large, it's easy to find the first 10.
Is it right? Or there are some other reason for this?
Final question: How to optimize this situation?

It can either use the index on A to apply the selectivity, then sort on "id" and apply the limit. Or it can read them already in order using the index on "id", then filter out the ones that meet the A condition until it finds 10 of them. It will choose the one it thinks is faster, and sometimes it makes the wrong choice.
If you had a multi-column index, on (A,id) it could use that one index to do both things, get the selectivity on A and still fetch the already in order by "id", at the same time.

Do you know PGAdmin? With "explain verbose" before your statement, you can check how the query is executed (meaning the order of the operators). Usually first happens the filter and only afterwards the sorting...

Index not used when Oracle SQL query has large number of selections in the IN(...) statement

I have a SELECT statment which a lot of selections in the IN(...) statement. The column for the IN statment has a nonunique index and is VARCHAR(50). When the number of elements in the IN statement goes over a certain threshold, the index is not used.
My select is structured like this
SELECT T.*, RANK() OVER
(PARTITION BY KEY_ID ORDER BY OBS_DATE ASC) AS XRANK
FROM "MY_TABLE" T WHERE KEY_ID IN('A','B','C')
But in reality there are a few hundred more elements in the IN statement and they are not called A, B, C.
If I reduce the number of items in my IN statement to 50 the index is used and the query takes 0.003s. 7k rows returned
If I double the items for my IN statment to 100, the index is not used and a full table scan is performed taking 0.4s to return 14k rows.
I'm not sure why the index is not used but I want to see what would happen if it was, so I tried to experiment I with a hint,
SELECT /*+ index(MY_TABLE,MY_INDEX) */ O.*, RANK() OVER ...blah blah
But the hint is ignored. When I run the explain plan it is not used and the query is the same speed.
Any advice would be appreciated, especially
Why is the index not being used when there is a higher number of elements in the IN statment
Why is the hint being ignored.
Thanks.

It's because of selectivity. If the query is not selective enough, then a full table scan is usually better than multiple index range scans (not unique).
If the number of values is low, the cost based optimizer will consider the cost of multiple index range scans together is still lower than the cost of a full table scan. If you add too many values, then the index range scan will surpass the cost of the full table scan.
Now, the costs are relative, and it depends on how the optimizer is configured (hints), but also depends on the histogram and table stats. Are they up to date?
By the way, you say 7k and 14k rows are selected. What percentage of the table does that represent? If it's too high the engine will go for the heap ignoring the index.
Having said all this, I think this is a really bad design. Instead of sending 100 parameters over the wire, can you produce those values from another table/select instead?

Simple SQL query that filters by geographic distance is very slow

Here's my query:
SELECT 1
FROM post po
WHERE ST_DWithin(po.geog, (SELECT geog FROM person WHERE person.person_id = $1), 20000 * 1609.34, false)
ORDER BY post_id DESC
LIMIT 5;
And here's the EXPLAIN ANALYZE:
I have an index on everything, so I'm not sure why this is slow. The first 5 posts when sorting by post_id DESC satisfy the clause, so shouldn't this return instantly?
I notice that if I replace the ST_DWithin call with an ST_Distance call instead, it runs instantly, like so:
SELECT 1
FROM post po
WHERE ST_Distance(po.geog, (SELECT geog FROM person WHERE person.person_id = $1)) < 20000 * 1609.34
ORDER BY post_id DESC
LIMIT 5;
That one runs in .15 milliseconds. So, simple solution is to just replace the ST_DWithin call with the ST_Distance call, no?
Well, unfortunately not, because it's not always the first 5 rows that match. Sometimes it has to scan deep within the table, so at that point ST_DWithin is better because it can use the geographic index, while ST_Distance cannot.
I think this may be a problem of postgres' query planner messing up? Like, for some reason it thinks it needs to do a scan of the whole table, despite the ORDER BY x LIMIT 5 clause being front and center? Not sure..

The distance you are using is almost the length of the equator, so you can expect (almost) all of your results to satisfy this clause.
As ST_DWithin makes use of a spatial index, the planner (wrongly) thinks it will be faster to use it to first filter out the rows. It then has to order (almost) all rows and at last will keep the first 5 ones.
When using st_distance, no spatial index can be used and the planner will pick a different plan, likely one relying on an index on post_id, which is blazing fast. But when the number of rows to be returned (the limit) increases, a different plan is used and the planner probably believe it would be again faster to compute the distance on all rows.

The first 5 posts when sorting by post_id DESC satisfy the clause, so shouldn't this return instantly?
This is a fact the system has no way of knowing ahead of time. It can't use unknown facts when planning the query. It thinks it will find only 10 rows. That means it thinks it would have to scan half the index on post_id before accumulating 5 rows (out of 10) which meet the geometry condition.
It actually finds 100,000 rows (an oddly round number). But it doesn't know that until after the fact.
If you were to first to a query on SELECT geog FROM person WHERE person.person_id = $1 and then write the result of that directly into your main query, rather than as a subquery, it might (or might not) do a better job of planning.

how does SELECT TOP works when no order by is specified?

The msdn documentation says that when we write
SELECT TOP(N) ..... ORDER BY [COLUMN]
We get top(n) rows that are sorted by column (asc or desc depending on what we choose)
But if we don't specify any order by, msdn says random as Gail Erickson pointed out here. As he points out it should be unspecified rather then random. But as
Thomas Lee points out there that
When TOP is used in conjunction with the ORDER BY clause, the result
set is limited to the first N number of ordered rows; otherwise, it
returns the first N number of rows ramdom
So, I ran this query on a table that doesn't have any indexes, first I ran this..
SELECT *
FROM
sys.objects so
WHERE
so.object_id NOT IN (SELECT si.object_id
FROM
sys.index_columns si)
AND so.type_desc = N'USER_TABLE'
And then in one of those tables, (in fact I tried the query below in all of those tables returned by above query) and I always got the same rows.
SELECT TOP (2) *
FROM
MstConfigSettings
This always returned the same 2 rows, and same is true for all other tables returned by query 1. Now the execution plans shows 3 steps..
As you can see there is no index look up, it's just a pure table scan, and
The Top shows actual no of rows to be 2, and so does the Table Scan; Which is not the case (there I many rows).
But when I run something like
SELECT TOP (2) *
FROM
MstConfigSettings
ORDER BY
DefaultItemId
The execution plan shows
and
So, when I don't apply ORDER BY the steps are different (there is no sort). But the question is how does this TOP works when there is no Sort and why and how does it always gives the same result?

There is no guarantee which two rows you get. It will just be the first two retrieved from the table scan.
The TOP iterator in the execution plan will stop requesting rows once two have been returned.
Likely for a scan of a heap this will be the first two rows in allocation order but this is not guaranteed. For example SQL Server might use the advanced scanning feature which means that your scan will read pages recently read from another concurrent scan.

Oracle ROWNUM performance

To query top-n rows in Oracle, it is general to use ROWNUM.
So the following query seems ok (gets most recent 5 payments):
select a.paydate, a.amount
from (
select t.paydate, t.amount
from payments t
where t.some_id = id
order by t.paydate desc
) a
where rownum <= 5;
But for very big tables, it is inefficient - for me it run for ~10 minutes.
So I tried other queries, and I ended up with this one which runs for less than a second:
select *
from (
select a.*, rownum
from (select t.paydate, t.amount
from payments t
where t.some_id = id
order by t.paydate desc) a
)
where rownum <= 5;
To find out what is happening, I looked execution plans for each query. For first query:
SELECT STATEMENT, GOAL = ALL_ROWS 7 5 175
COUNT STOPKEY
VIEW 7 5 175
TABLE ACCESS BY INDEX ROWID 7 316576866 6331537320
INDEX FULL SCAN DESCENDING 4 6
And for second:
SELECT STATEMENT, GOAL = ALL_ROWS 86 5 175
COUNT STOPKEY
VIEW 86 81 2835
COUNT
VIEW 86 81 1782
SORT ORDER BY 86 81 1620
TABLE ACCESS BY INDEX ROWID 85 81 1620
INDEX RANGE SCAN 4 81
Obviously, it is INDEX FULL SCAN DESCENDING that makes first query inefficient for big tables. But I can not really differentiate the logic of two queries by looking at them.
Could anyone explain me the logical differences between two queries in human language?
Thanks in advance!

First of all, as mentioned in Alex's comment, I'm not sure that your second version is 100% guaranteed to give you the right rows -- since the "middle" block of the query does not have an explicit order by, Oracle is under no obligation to pass the rows up to the outer query block in any specific order. However, there doesn't seem to be any particular reason that it would change the order that the rows are passed up from the innermost block, so in practice it will probably work.
And this is why Oracle chooses a different plan for the second query -- it is logically not able to apply the STOPKEY operation to the innermost query block.
I think in the first case, the optimizer is assuming that id values are well-distributed and, for any given value, there are likely to be some very recent transactions. Since it can see that it only needs to find the 5 most recent matches, it calculates that it appears to be more efficient to scan the rows in descending order of paydate using an index, lookup the corresponding id and other data from the table, and stop when it's found the first 5 matches. I suspect that you would see very different performance for this query depending on the specific id value that you use -- if the id has a lot of recent activity, the rows should be found very quickly, but if it does not, the index scan may have to do a lot more work.
In the second case, I believe it's not able to apply the STOPKEY optimization to the innermost block due to the extra layer of nesting. In that case, the index full scan would become much less attractive, since it would always need to scan the entire index. Therefore it chooses to do an index lookup on id (I'm assuming) followed by an actual sort on the date. If the given id value matches a small subset of rows, this is likely to be more efficient -- but if you give an id that has lots of rows spread throughout the entire table, I would expect it to become slower, since it will have to access and sort many rows.
So, I would guess that your tests have used id value(s) that have relatively few rows which are not very recent. If this would be a typical use case, then the second query is probably better for you (again, with the caveat that I'm not sure it is technically guaranteed to produce the correct result set). But if typical values would be more likely to have many matching rows and/or more likely to have 5 very recent rows, then the first query and plan might be better.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas