Oracle ROWNUM performance - sql

To query top-n rows in Oracle, it is general to use ROWNUM.
So the following query seems ok (gets most recent 5 payments):
select a.paydate, a.amount
from (
select t.paydate, t.amount
from payments t
where t.some_id = id
order by t.paydate desc
) a
where rownum <= 5;
But for very big tables, it is inefficient - for me it run for ~10 minutes.
So I tried other queries, and I ended up with this one which runs for less than a second:
select *
from (
select a.*, rownum
from (select t.paydate, t.amount
from payments t
where t.some_id = id
order by t.paydate desc) a
)
where rownum <= 5;
To find out what is happening, I looked execution plans for each query. For first query:
SELECT STATEMENT, GOAL = ALL_ROWS 7 5 175
COUNT STOPKEY
VIEW 7 5 175
TABLE ACCESS BY INDEX ROWID 7 316576866 6331537320
INDEX FULL SCAN DESCENDING 4 6
And for second:
SELECT STATEMENT, GOAL = ALL_ROWS 86 5 175
COUNT STOPKEY
VIEW 86 81 2835
COUNT
VIEW 86 81 1782
SORT ORDER BY 86 81 1620
TABLE ACCESS BY INDEX ROWID 85 81 1620
INDEX RANGE SCAN 4 81
Obviously, it is INDEX FULL SCAN DESCENDING that makes first query inefficient for big tables. But I can not really differentiate the logic of two queries by looking at them.
Could anyone explain me the logical differences between two queries in human language?
Thanks in advance!

First of all, as mentioned in Alex's comment, I'm not sure that your second version is 100% guaranteed to give you the right rows -- since the "middle" block of the query does not have an explicit order by, Oracle is under no obligation to pass the rows up to the outer query block in any specific order. However, there doesn't seem to be any particular reason that it would change the order that the rows are passed up from the innermost block, so in practice it will probably work.
And this is why Oracle chooses a different plan for the second query -- it is logically not able to apply the STOPKEY operation to the innermost query block.
I think in the first case, the optimizer is assuming that id values are well-distributed and, for any given value, there are likely to be some very recent transactions. Since it can see that it only needs to find the 5 most recent matches, it calculates that it appears to be more efficient to scan the rows in descending order of paydate using an index, lookup the corresponding id and other data from the table, and stop when it's found the first 5 matches. I suspect that you would see very different performance for this query depending on the specific id value that you use -- if the id has a lot of recent activity, the rows should be found very quickly, but if it does not, the index scan may have to do a lot more work.
In the second case, I believe it's not able to apply the STOPKEY optimization to the innermost block due to the extra layer of nesting. In that case, the index full scan would become much less attractive, since it would always need to scan the entire index. Therefore it chooses to do an index lookup on id (I'm assuming) followed by an actual sort on the date. If the given id value matches a small subset of rows, this is likely to be more efficient -- but if you give an id that has lots of rows spread throughout the entire table, I would expect it to become slower, since it will have to access and sort many rows.
So, I would guess that your tests have used id value(s) that have relatively few rows which are not very recent. If this would be a typical use case, then the second query is probably better for you (again, with the caveat that I'm not sure it is technically guaranteed to produce the correct result set). But if typical values would be more likely to have many matching rows and/or more likely to have 5 very recent rows, then the first query and plan might be better.

Related

SQL Top clause with order by clause

I am bit new to SQL, I want to write query with TOP clause and order by clause.
So, for returning all the records I write below query
select PatientName,PlanDate as Date,* from OPLMLA21..Exams order
by PlanDate desc
And I need top few elements from same query, so I modified the query to
select top(5) PatientName,PlanDate as Date,* from OPLMLA21..Exams
order by PlanDate desc
In my understanding it will give the top 5 results from the previous query, but I see ambiguity there. I have attached the screen shot of query results .
May be my understanding is wrong, I read a lot but not able to understand this please help me out.
I stated this in a comment, however, to repeat that:
 TOP (5) doesn't give the "top results" of the prior query though, no. It gives the top (first) rows from the dataset defined in the query it in is. If there are multiple rows that have the same "rank", then the row(s) returned for that rank are arbitrary. So, for example, for your query if you have 100 rows all with the same value for PlanDate, what 5 rows you get are completely arbitrary and could be different (including the order they are in) every time you run said query.
What I mean by arbitrary is that, effectively, SQL Server is free to choose whatever rows, of those applicable, are returned. Sometimes this might be the same everytime you run the query, but this by luck more than anything. As your database gets larger, you have more users querying the data, you involve joins, things like locks, indexes, parrallelism, etc all will effect the "order" that SQL Server is processing said data, and will effect an ambigious TOP clause.
Take the example data below:
ID | SomeDate
---|---------
1 |2020-01-01
2 |2020-01-01
3 |2020-01-01
4 |2020-01-01
5 |2020-01-01
6 |2020-01-02
Now, what would you expect if I ran a TOP (2) against that table with an ORDER BY clause of SomeDate DESC. Well, certainly, you'd expect the "last" row (with an ID of 6) to be returned, but what about the next row? The other 5 rows all have the same value for SomeDate. Perhaps, because your under the impression that data in a table is pre-sorted, you might expect the row with a value of 5 for ID. What if I told you that there was a CLUSTERED INDEX on ID ASC; that might well end up meaning that the row with a value of 1 is returned. What if there is also an index on SomeDate DESC?
What if the table was 10,000 of rows in size, and you also have a JOIN to another table, which also has a CLUSTERED INDEX, and some user is performing a query with some specific row locking on in while you run your query? What would you expect then?
Without your ORDER BY being specific enough to ensure that each row has a distinct ordering position, SQL Server will return other rows in an arbitrary order and when mixed with a TOP means the "top" rows will also be arbitrary.
Side note: I note in your image (of what appears to be SSMS), your "dates" are in the format yyyyMMdd. This strongly implies that you are storing a date value as a varchar or int type. This is a design flaw and needs to be fixed. There are 6 date and time data types, and 5 of them are far superior to using a string and numerical data type to storing the data.

Simple SQL query that filters by geographic distance is very slow

Here's my query:
SELECT 1
FROM post po
WHERE ST_DWithin(po.geog, (SELECT geog FROM person WHERE person.person_id = $1), 20000 * 1609.34, false)
ORDER BY post_id DESC
LIMIT 5;
And here's the EXPLAIN ANALYZE:
I have an index on everything, so I'm not sure why this is slow. The first 5 posts when sorting by post_id DESC satisfy the clause, so shouldn't this return instantly?
I notice that if I replace the ST_DWithin call with an ST_Distance call instead, it runs instantly, like so:
SELECT 1
FROM post po
WHERE ST_Distance(po.geog, (SELECT geog FROM person WHERE person.person_id = $1)) < 20000 * 1609.34
ORDER BY post_id DESC
LIMIT 5;
That one runs in .15 milliseconds. So, simple solution is to just replace the ST_DWithin call with the ST_Distance call, no?
Well, unfortunately not, because it's not always the first 5 rows that match. Sometimes it has to scan deep within the table, so at that point ST_DWithin is better because it can use the geographic index, while ST_Distance cannot.
I think this may be a problem of postgres' query planner messing up? Like, for some reason it thinks it needs to do a scan of the whole table, despite the ORDER BY x LIMIT 5 clause being front and center? Not sure..
The distance you are using is almost the length of the equator, so you can expect (almost) all of your results to satisfy this clause.
As ST_DWithin makes use of a spatial index, the planner (wrongly) thinks it will be faster to use it to first filter out the rows. It then has to order (almost) all rows and at last will keep the first 5 ones.
When using st_distance, no spatial index can be used and the planner will pick a different plan, likely one relying on an index on post_id, which is blazing fast. But when the number of rows to be returned (the limit) increases, a different plan is used and the planner probably believe it would be again faster to compute the distance on all rows.
The first 5 posts when sorting by post_id DESC satisfy the clause, so shouldn't this return instantly?
This is a fact the system has no way of knowing ahead of time. It can't use unknown facts when planning the query. It thinks it will find only 10 rows. That means it thinks it would have to scan half the index on post_id before accumulating 5 rows (out of 10) which meet the geometry condition.
It actually finds 100,000 rows (an oddly round number). But it doesn't know that until after the fact.
If you were to first to a query on SELECT geog FROM person WHERE person.person_id = $1 and then write the result of that directly into your main query, rather than as a subquery, it might (or might not) do a better job of planning.

how does SELECT TOP works when no order by is specified?

The msdn documentation says that when we write
SELECT TOP(N) ..... ORDER BY [COLUMN]
We get top(n) rows that are sorted by column (asc or desc depending on what we choose)
But if we don't specify any order by, msdn says random as Gail Erickson pointed out here. As he points out it should be unspecified rather then random. But as
Thomas Lee points out there that
When TOP is used in conjunction with the ORDER BY clause, the result
set is limited to the first N number of ordered rows; otherwise, it
returns the first N number of rows ramdom
So, I ran this query on a table that doesn't have any indexes, first I ran this..
SELECT *
FROM
sys.objects so
WHERE
so.object_id NOT IN (SELECT si.object_id
FROM
sys.index_columns si)
AND so.type_desc = N'USER_TABLE'
And then in one of those tables, (in fact I tried the query below in all of those tables returned by above query) and I always got the same rows.
SELECT TOP (2) *
FROM
MstConfigSettings
This always returned the same 2 rows, and same is true for all other tables returned by query 1. Now the execution plans shows 3 steps..
As you can see there is no index look up, it's just a pure table scan, and
The Top shows actual no of rows to be 2, and so does the Table Scan; Which is not the case (there I many rows).
But when I run something like
SELECT TOP (2) *
FROM
MstConfigSettings
ORDER BY
DefaultItemId
The execution plan shows
and
So, when I don't apply ORDER BY the steps are different (there is no sort). But the question is how does this TOP works when there is no Sort and why and how does it always gives the same result?
There is no guarantee which two rows you get. It will just be the first two retrieved from the table scan.
The TOP iterator in the execution plan will stop requesting rows once two have been returned.
Likely for a scan of a heap this will be the first two rows in allocation order but this is not guaranteed. For example SQL Server might use the advanced scanning feature which means that your scan will read pages recently read from another concurrent scan.

SELECT COUNT(*) with an ORDER BY

Will the following two queries be executed in the same way?
SELECT COUNT(*) from person ORDER BY last_name;
and
SELECT COUNT(*) from person;
Either way they should display the same results, so I was curious if the ORDER BY just gets ignored.
The reason I am asking is because I am displaying a paginated table where I will get 20 records at a time from the database and then firing a second query that counts the total number of records. I want to know if I should use the same criteria that the first query used, or if I should be removing all sorting from the criteria?
According to the execution plan, the two queries are different. For example, the query:
select count(*) from USER
Will give me:
INDEX (FAST FULL SCAN) 3.0 3 453812 3457 1 TPMDBO USER_PK FAST FULL SCAN INDEX (UNIQUE) ANALYZED
As you can see, we hit USER_PK which is the primary key of that table.
If I sort by a non-indexed column:
select count(*) from USER ORDER BY FIRSTNAME --No Index on FIRSTNAME
I'll get:
TABLE ACCESS (FULL) 19.0 19 1124488 3457 24199 1 TPMDBO USER FULL TABLE ANALYZED 1
Meaning we did a full table scan (MUCH higher node cost)
If I sort by the primary key (which is already index,) Oracle is smart enough to use the index to do that sort:
INDEX (FAST FULL SCAN) 3.0 3 453812 3457 13828 1 TPMDBO USER_PK FAST FULL SCAN INDEX (UNIQUE) ANALYZED
Which looks very similar to the first execution plan.
So, the answer to your question is absolutely not - they are not the same. However, ordering by an index that Oracle is already seeking anyway will probably result in the same query plan.
Of course not. Unless last name is the primary key and you are already ordered by that.
The Oracle query optimizer actually does perform a sort (I verified this looking at the explain plan) for the first version, but since both queries only return one row, the performance difference will be very small.
EDIT:
Mike's answer is correct. The performance difference can possibly be significant.

Wrong index being used when selecting top rows

I have a simple query, which selects top 200 rows ordered by one of the columns filtered by other indexed column. The confusion is why is that the query plan in PL/SQL Developer shows that this index is used only when I'm selecting all rows, e.g.:
SELECT * FROM
(
SELECT *
FROM cr_proposalsearch ps
WHERE UPPER(ps.customerpostcode) like 'MK3%'
ORDER BY ps.ProposalNumber DESC
)
WHERE ROWNUM <= 200
Plan shows that it uses index CR_PROPOSALSEARCH_I1, which is an index on two columns: PROPOSALNUMBER & UPPER(CUSTOMERNAME), this takes 0.985s to execute:
If I get rid of ROWNUM condition, the plan is what I expect and it executes in 0.343s:
Where index XIF25CR_PROPOSALSEARCH is on CR_PROPOSALSEARCH (UPPER(CUSTOMERPOSTCODE));
How come?
EDIT: I have gathered statistics on cr_proposalsearch table and both query plans now show that they use XIF25CR_PROPOSALSEARCH index.
Including the ROWNUM changes the optimizer's calculations about which is the more efficient path.
When you do a top-n query like this, it doesn't necessarily mean that Oracle will get all the rows, fully sort them, then return the top ones. The COUNT STOPKEY operation in the execution plan indicates that Oracle will only perform the underlying operations until it has found the number of rows you asked for.
The optimizer has calculated that the full query will acquire and sort 77K rows. If it used this plan for the top-n query, it would have to do a large sort of those rows to find the top 200 (it wouldn't necessarily have to fully sort them, as it wouldn't care about the exact order of rows past the top; but it would have to look over all of those rows).
The plan for the top-n query uses the other index to avoid having to sort at all. It considers each row in order, checks whether it matches the predicate, and if so returns it. When it's returned 200 rows, it's done. Its calculations have indicated that this will be more efficient for getting a small number of rows. (It may not be right, of course; you haven't said what the relative performance of these queries is.)
If the optimizer were to choose this plan when you ask for all rows, it would have to read through the entire index in descending order, getting each row from the table by ROWID as it goes to check against the predicate. This would result in a lot of extra I/O and inspecting many rows that would not be returned. So in this case, it decides that using the index on customerpostcode is more efficient.
If you gradually increase the number of rows to be returned from the top-n query, you will probably find a tipping point where the plan switches from the first to the second. Just from the costs of the two plans, I'd guess this might be around 1,200 rows.
If you are sure your stats are up to date and that the index is selective enough, you could tell oracle to use the index
SELECT *
FROM (SELECT /*+ index(ps XIF25CR_PROPOSALSEARCH) */ *
FROM cr_proposalsearch ps
WHERE UPPER (ps.customerpostcode) LIKE 'MK3%'
ORDER BY ps.proposalnumber DESC)
WHERE ROWNUM <= 200
(I would only recommend this approach as a last resort)
If I were doing this I would first tkprof the query to see actually how much work it is doing,
e.g: the cost of index range scans could be way off
forgot to mention....
You should check the actual cardinality:
SELECT count(*) FROM cr_proposalsearch ps WHERE UPPER(ps.customerpostcode) like 'MK3%'
and then compare it to the cardinality in the query plan.
You don't seem to have a perfectly fitting index. The index CR_PROPOSALSEARCH_I1 can be used to retrieve the rows in descending order of the attribute PROPOSALNUMBER. It's probably chosen because Oracle can avoid to retrieve all matching rows, sort them according to the ORDER BY clause and then discard all rows except the first ones.
Without the ROWNUM condition, Oracle uses the XIF25CR_PROPOSALSEARCH index (you didn't give any details about it) because it's probably rather selective regarding the WHERE clause. But it will require to sort the result afterwards. This is probably the more efficent plan based on the assumption that you'll retrieve all rows.
Since either index is a trade-off (one is better for sorting, the other one better for applying the WHERE clause), details such as ROWNUM determine which execution plan Oracle chooses.
This condition:
WHERE UPPER(ps.customerpostcode) like 'MK3%'
is not continuous, that is you cannot preserve a single ordered range for it.
So there are two ways to execute this query:
Order by number then filter on code.
Filter on code then order by number.
Method 1 is able to user an index on number which gives you linear execution time (top 100 rows would be selected 2 times faster than top 200, provided that number and code do not correlate).
Method 2 is able to use a range scan for coarse filtering on code (the range condition would be something like code >= 'MK3' AND code < 'MK4'), however, it requires a sort since the order of number cannot be preserved in a composite index.
The sort time depends on the number of top rows you are selecting too, but this dependency, unlike that for method 1, is not linear (you always need at least one range scan).
However, the filtering condition in method 2 is selective enough for the RANGE SCAN with a subsequent sort to be more efficient than a FULL SCAN for the whole table.
This means that there is a tipping point: for this condition: ROWNUM <= X there exists a value of X so that method 2 becomes more efficient when this value is exceeded.
Update:
If you are always searching on at least 3 first symbols, you can create an index like this:
SUBSTRING(UPPER(customerpostcode), 1, 3), proposalnumber
and use it in this query:
SELECT *
FROM (
SELECT *
FROM cr_proposalsearch ps
WHERE SUBSTRING(UPPER(customerpostcode, 1, 3)) = SUBSTRING(UPPER(:searchquery), 1, 3)
AND UPPER(ps.customerpostcode) LIKE UPPER(:searchquery) || '%'
ORDER BY
proposalNumber DESC
)
WHERE rownum <= 200
This way, the number order will be preserved separately for each set of codes sharing first 3 letters which will give you a more dense index scan.