SQL Server : OFFSET FETCH performs scan while TOP WHERE performs seek? - sql

I got the following two queries. One is fast the other is slow.
The table has a clusted index on the Id column.
-- Slow, uses clustered index scan reading 100100 rows
SELECT *
FROM [dbo].[Foo]
ORDER BY Id
OFFSET 100000 ROWS FETCH FIRST 100 ROWS ONLY
-- Fast, uses clustered index seek reading 100 rows
SELECT TOP 100 *
FROM [dbo].[Foo]
WHERE Id > 100000
ORDER BY Id
The plans are identical except for one uses a scan the other a seek.
Can anyone explain why or is this simply how OFFSET works?
The table is very wide with a few NVARCHAR(100-200) and a single NVARCHAR(2500) column.

The two queries are not equivalent. Although you might assume that the ids have no gaps and start at 1, the database engine does not know that.
Indexes are organized to find particular values quickly. They generally do this by traversing a tree structure, and one which is generally balanced. You can read more about this in the documentation.
However, they are not organized to quickly get to the nth row in the table. Hence, the query needs to scan the table to count the number of rows.
That said, the index could do what you want if it kept the number of rows in each child. Do realize that this would complicate modifications to the table, because the entire hierarchy would need to be updated for each update, insert, and delete.

Related

Performance impact of view on aggregate function vs result set limiting

The problem
Using PostgreSQL 13, I ran into a performance issue selecting the highest id from a view that joins two tables, depending on the select statement I execute.
Here's a sample setup:
CREATE TABLE test1 (
id BIGSERIAL PRIMARY KEY,
joincol VARCHAR
);
CREATE TABLE test2 (
joincol VARCHAR
);
CREATE INDEX ON test1 (id);
CREATE INDEX ON test1 (joincol);
CREATE INDEX ON test2 (joincol);
CREATE VIEW testview AS (
SELECT test1.id,
test1.joincol AS t1charcol,
test2.joincol AS t2charcol
FROM test1, test2
WHERE test1.joincol = test2.joincol
);
What I found out
I'm executing two statements which result in completely different execution plans and runtimes. The following statement executes in less than 100ms. As far as I understand the execution plan, the runtime is independent of the rowcount, since Postgres iterates the rows one by one (starting at the highest id, using the index) until a join on a row is possible and immediately returns.
SELECT id FROM testview ORDER BY ID DESC LIMIT 1;
However, this one takes over 1 second on average (depending on rowcount), since the two tables are "joined completely", before Postgres uses the index to select the highest id.
SELECT MAX(id) FROM testview;
Please refer to this sample on dbfiddle to check the explain plans:
https://www.db-fiddle.com/f/bkMNeY6zXqBAYUsprJ5eWZ/1
My real environment
On my real environment test1 contains only a hand full of rows (< 100), having unique values in joincol. test2 contains up to ~10M rows, where joincol always matches a value of test1's joincol. test2's joincol is not nullable.
The actual question
Why does Postgres not recognize that it could use an Index Scan Backward on row basis for the second select? Is there anything I could improve on the tables/indexes?
Queries not strictly equivalent
why does Postgres not recognize that it could use a Index Scan Backward on row basis for the second select?
To make the context clear:
max(id) excludes NULL values. But ORDER BY ... LIMIT 1 does not.
NULL values sort last in ascending sort order, and first in descending. So an Index Scan Backward might not find the greatest value (according to max()) first, but any number of NULL values.
The formal equivalent of:
SELECT max(id) FROM testview;
is not:
SELECT id FROM testview ORDER BY id DESC LIMIT 1;
but:
SELECT id FROM testview ORDER BY id DESC NULLS LAST LIMIT 1;
The latter query doesn't get the fast query plan. But it would with an index with matching sort order: (id DESC NULLS LAST).
That's different for the aggregate functions min() and max(). Those get a fast plan when targeting table test1 directly using the plain PK index on (id). But not when based on the view (or the underlying join-query directly - the view is not the blocker). An index sorting NULL values in the right place has hardly any effect.
We know that id in this query can never be NULL. The column is defined NOT NULL. And the join in the view is effectively an INNER JOIN which cannot introduce NULL values for id.
We also know that the index on test.id cannot contain NULL values.
But the Postgres query planner is not an AI. (Nor does it try to be, that could get out of hands quickly.) I see two shortcomings:
min() and max() get the fast plan only when targeting the table, regardless of index sort order, an index condition is added: Index Cond: (id IS NOT NULL)
ORDER BY ... LIMIT 1 gets the fast plan only with the exactly matching index sort order.
Not sure, whether that might be improved (easily).
db<>fiddle here - demonstrating all of the above
Indexes
Is there anything I could improve on the tables/indexes?
This index is completely useless:
CREATE INDEX ON "test" ("id");
The PK on test.id is implemented with a unique index on the column, that already covers everything the additional index might do for you.
There may be more, waiting for the question to clear up.
Distorted test case
The test case is too far away from actual use case to be meaningful.
In the test setup, each table has 100k rows, there is no guarantee that every value in joincol has a match on the other side, and both columns can be NULL
Your real case has 10M rows in table1 and < 100 rows in table2, every value in table1.joincol has a match in table2.joincol, both are defined NOT NULL, and table2.joincol is unique. A classical one-to-many relationship. There should be a UNIQUE constraint on table2.joincol and a FK constraint t1.joincol --> t2.joincol.
But that's currently all twisted in the question. Standing by till that's cleaned up.
This is a very good problem, and good testcase.
I tested it in postgres 9.3 perhaps 13 is can it more more fast.
I used Occam's Razor and i excluded some possiblities
View (without view is slow to)
JOIN can filter some rows (unfortunatly in your test not, but more length md5 5-6 yes)
Other basic equivalent select statements not solve yout problem (inner query or exists)
I achieved to use just index, but because the tables isn't bigger than indexes it was not the solution.
I think
CREATE INDEX on "test" ("id");
is useless, because PK!
If you change this
CREATE INDEX on "test" ("joincol");
to this
CREATE INDEX ON TEST (joincol, id);
Than the second query use just indexes.
After you run this
REINDEX table test;
REINDEX table test2;
VACUUM ANALYZE test;
VACUUM ANALYZE test2;
you can achive some performance tuning. Because you created indexes before inserts.
I think the reason is the two aim of DB.
First aim optimalize just some row. So run Nested Loop. You can force it with limit x.
Second aim optimalize whole table. Run this query fast for whole table.
In this situation postgres optimalizer didn't notice that simple MAX can run with NESTED LOOP. Or perhaps postgres cannot use limit in aggregate clause (can run on whole partial select, what is filtered with query).
And this is very expensive. But you have possiblities to write there other aggregates, like SUM, MIN, AVG stb.
Perhaps can help you the Window functions too.

Finding the "next 25 rows" in Oracle SQL based on an indexed column

I have a large table (~200M rows) that is indexed on a numeric column, Z. There is also an index on the key column, K.
K Z
= ==========================================
1 0.6508784068583483336644518457703156855132
2 0.4078768075307567089075462518978907890789
3 0.5365440453204830852096396398565048002638
4 0.7573281573257782352853823856682368153782
What I need to be able to do is find the 25 records "surrounding" a given record. For instance, the "next" record starting at K=3 would be K=1, followed by K=4.
I have been lead by several sources (most notably this paper from some folks at Florida State University) that SQL like the following should work. It's not hard to imagine that scanning along the indexed column in ascending or descending order would be efficient.
select * from (
select *
from T
where Z >= [origin's Z value]
order by Z asc
) where rownum <= 25;
In theory, this should find the 25 "next" rows, and a similar variation would find the 25 "previous" rows. However, this can take minutes and the explain plan consistently contains a full table scan. A full table scan is simply too expensive for my purpose, but nothing I do seems to prompt the query optimizer to take advantage of the index (short, of course, of changing the ">=" above to an equals sign, which indicates that the index is present and operational). I have tried several hints to no avail (index, index_asc in several permutations).
Is what I am trying to do impossible? If I were trying to do this on a large data structure over which I had more control, I'd build a linked list on the indexed column's values and a tree to find the right entry point. Then traversing the list would be very inexpensive (yes I might have to run all over the disk to find the records I'm looking for, but I surely wouldn't have to scan the whole table).
I'll add in case it's important to my query that the database I'm using is running Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit.
I constructed a small test case with 10K rows. When I populated the table such that the Z values were already ordered, the exact query you gave tended to use the index. But when I populated it with random values, and refreshed the table statistics, it started doing full table scans, at least for some values of n larger than 25. So there is a tipping point at which the optimizer decides that the amount of work it will do to look up index entries then find the corresponding rows in the table is more than the amount of work to do a full scan. (It might be wrong in its estimate, of course, but that is what it has to go on.)
I noticed that you are using SELECT *, which means the query is returning both columns. This means that the actual table rows must be accessed, since neither index includes both columns. This might push the optimizer towards preferring a full table scan for a larger samples. If the query could be fulfilled from the index alone, it would be more likely to use the index.
One possibility is that you don't really need to return the values of K at all. If so, I'd suggest that you change both occurrences of SELECT * to SELECT z. In my test, this change caused a query that had been doing a full table scan to use an index scan instead (and not access the table itself at all).
If you do need to include K in the result, then you might try creating an index on (Z, K). This index could be used to satisfy the query without accessing the table.

Index Created but doesn't speed-up on retrieval process

I have created table as bellow
create table T1(num varchar2(20))
then I inserted 3 lac numbers in above table so now it looks like below
num
1
2
3
.
.
300000
Now if I do
select * from T1
then it takes 1min 15sec to completely fetch the records and as I created index on column num and if I use below query then it should be faster to fetch 3 lac records but it takes also 1min15sec for fetch the records
select * from T1 where num between '1' and '300000'
So how the index has improved my retrieval process?
The index does not improve the retrieval process when you are trying to fetch all rows.
The index makes it possible to find a subset of rows much more quickly.
An index can help if you want to retrieve a few rows from a large table. But since you retrieve all rows and since your index contains all the columns of your table, it won't speed up the query.
Furthermore, you don't tell us what tool you use to retrieve the data. I guess you use SQL Developer or Toad. So what you measure is the time it takes SQL Developer or Toad to store 300,000 rows in memory in such a way that they can be easily displayed on screen in a scrollable table. You aren't really measuring how long it takes to retrieve them.
To get a test of the effects of having an index in place you might want to try a query such as
SELECT *
FROM T1
WHERE NUM IN ('288888', '188888', '88888')
both with with the index in place, and again after removing the index. You should also collect statistics on the table prior to running the query with the index in place or you may still get a query which performs a full table scan. Share and enjoy.

Wrong index being used when selecting top rows

I have a simple query, which selects top 200 rows ordered by one of the columns filtered by other indexed column. The confusion is why is that the query plan in PL/SQL Developer shows that this index is used only when I'm selecting all rows, e.g.:
SELECT * FROM
(
SELECT *
FROM cr_proposalsearch ps
WHERE UPPER(ps.customerpostcode) like 'MK3%'
ORDER BY ps.ProposalNumber DESC
)
WHERE ROWNUM <= 200
Plan shows that it uses index CR_PROPOSALSEARCH_I1, which is an index on two columns: PROPOSALNUMBER & UPPER(CUSTOMERNAME), this takes 0.985s to execute:
If I get rid of ROWNUM condition, the plan is what I expect and it executes in 0.343s:
Where index XIF25CR_PROPOSALSEARCH is on CR_PROPOSALSEARCH (UPPER(CUSTOMERPOSTCODE));
How come?
EDIT: I have gathered statistics on cr_proposalsearch table and both query plans now show that they use XIF25CR_PROPOSALSEARCH index.
Including the ROWNUM changes the optimizer's calculations about which is the more efficient path.
When you do a top-n query like this, it doesn't necessarily mean that Oracle will get all the rows, fully sort them, then return the top ones. The COUNT STOPKEY operation in the execution plan indicates that Oracle will only perform the underlying operations until it has found the number of rows you asked for.
The optimizer has calculated that the full query will acquire and sort 77K rows. If it used this plan for the top-n query, it would have to do a large sort of those rows to find the top 200 (it wouldn't necessarily have to fully sort them, as it wouldn't care about the exact order of rows past the top; but it would have to look over all of those rows).
The plan for the top-n query uses the other index to avoid having to sort at all. It considers each row in order, checks whether it matches the predicate, and if so returns it. When it's returned 200 rows, it's done. Its calculations have indicated that this will be more efficient for getting a small number of rows. (It may not be right, of course; you haven't said what the relative performance of these queries is.)
If the optimizer were to choose this plan when you ask for all rows, it would have to read through the entire index in descending order, getting each row from the table by ROWID as it goes to check against the predicate. This would result in a lot of extra I/O and inspecting many rows that would not be returned. So in this case, it decides that using the index on customerpostcode is more efficient.
If you gradually increase the number of rows to be returned from the top-n query, you will probably find a tipping point where the plan switches from the first to the second. Just from the costs of the two plans, I'd guess this might be around 1,200 rows.
If you are sure your stats are up to date and that the index is selective enough, you could tell oracle to use the index
SELECT *
FROM (SELECT /*+ index(ps XIF25CR_PROPOSALSEARCH) */ *
FROM cr_proposalsearch ps
WHERE UPPER (ps.customerpostcode) LIKE 'MK3%'
ORDER BY ps.proposalnumber DESC)
WHERE ROWNUM <= 200
(I would only recommend this approach as a last resort)
If I were doing this I would first tkprof the query to see actually how much work it is doing,
e.g: the cost of index range scans could be way off
forgot to mention....
You should check the actual cardinality:
SELECT count(*) FROM cr_proposalsearch ps WHERE UPPER(ps.customerpostcode) like 'MK3%'
and then compare it to the cardinality in the query plan.
You don't seem to have a perfectly fitting index. The index CR_PROPOSALSEARCH_I1 can be used to retrieve the rows in descending order of the attribute PROPOSALNUMBER. It's probably chosen because Oracle can avoid to retrieve all matching rows, sort them according to the ORDER BY clause and then discard all rows except the first ones.
Without the ROWNUM condition, Oracle uses the XIF25CR_PROPOSALSEARCH index (you didn't give any details about it) because it's probably rather selective regarding the WHERE clause. But it will require to sort the result afterwards. This is probably the more efficent plan based on the assumption that you'll retrieve all rows.
Since either index is a trade-off (one is better for sorting, the other one better for applying the WHERE clause), details such as ROWNUM determine which execution plan Oracle chooses.
This condition:
WHERE UPPER(ps.customerpostcode) like 'MK3%'
is not continuous, that is you cannot preserve a single ordered range for it.
So there are two ways to execute this query:
Order by number then filter on code.
Filter on code then order by number.
Method 1 is able to user an index on number which gives you linear execution time (top 100 rows would be selected 2 times faster than top 200, provided that number and code do not correlate).
Method 2 is able to use a range scan for coarse filtering on code (the range condition would be something like code >= 'MK3' AND code < 'MK4'), however, it requires a sort since the order of number cannot be preserved in a composite index.
The sort time depends on the number of top rows you are selecting too, but this dependency, unlike that for method 1, is not linear (you always need at least one range scan).
However, the filtering condition in method 2 is selective enough for the RANGE SCAN with a subsequent sort to be more efficient than a FULL SCAN for the whole table.
This means that there is a tipping point: for this condition: ROWNUM <= X there exists a value of X so that method 2 becomes more efficient when this value is exceeded.
Update:
If you are always searching on at least 3 first symbols, you can create an index like this:
SUBSTRING(UPPER(customerpostcode), 1, 3), proposalnumber
and use it in this query:
SELECT *
FROM (
SELECT *
FROM cr_proposalsearch ps
WHERE SUBSTRING(UPPER(customerpostcode, 1, 3)) = SUBSTRING(UPPER(:searchquery), 1, 3)
AND UPPER(ps.customerpostcode) LIKE UPPER(:searchquery) || '%'
ORDER BY
proposalNumber DESC
)
WHERE rownum <= 200
This way, the number order will be preserved separately for each set of codes sharing first 3 letters which will give you a more dense index scan.

Index performance with WHERE clause in SQL

I'm reading about indexes in my database book and I was wondering if I was correct in my assumption that a WHERE clause with a non-constant expression in it will not use the index.
So if i have
SELECT * FROM statuses WHERE app_user_id % 10 = 0;
This would not use an index created on app_user_id. But
SELECT * FROM statuses WHERE app_user_id = 5;
would use the index on app_user_id.
Usually (there are other options) a database index is a B-Tree, which means that you can do range scans on it (including equality scans).
The condition app_user_id % 10 = 0 cannot be evaluated with a single range scan, which is why a database will probably not use an index.
It could still decide to use the index in another way, namely for a full scan: Reading the whole table takes more time than just reading the whole index. On the other hand, after reading the index you may still get back to the table, so the overall cost may end up being higher.
This is up to the database query optimizer to decide.
A few examples:
select app_user_id from t where app_user_id % 10 = 0
Here, you do not need the table at all, all necessary data is in the index. The database will most likely do a full index scan.
select count(*) from t where app_user_id % 10 = 0
Same. Full index scan.
select count(*) from t
Only if app_user_id is NOT NULL can this be done with the index (because NULL data is not in the index, at least on Oracle, at least on single column indexes, your database may handle this differently).
Some databases do not need to do access table or index for this, they maintain row counts in the metadata.
select * from t where app_user_id = 5
This is the classic scenario for an index. The database can look at the small section of the index tree, retrieve a small (just one if this was a unique or primary index) number of rowids and fetch those selectively from the table.
select * from t where app_user_id between 5 and 10
Another classic index case. Range scan in the tree returns a small number of rowids to fetch from the table.
select * from t where app_user_id between 5 and 10 order by app_user_id
Since index scans return ordered data, you even get the sorting for free.
select * from t where app_user_id between 5 and 1000000000
Maybe here you should not be using an index. It seems to match too many records. This is a case where having bind variables hide the range from the database could actually be detrimental.
select * from t where app_user_id between 5 and 1000000000
order by app_user_id
But here, since sorting would be very expensive (even taking up temporary swap disk space), maybe iterating in index order is good. Maybe.
select * from t where app_user_id % 10 = 0
This is difficult to decide. We need all columns, so ultimately the query needs to touch the table. The question is whether to go through an index first. The query returns approximately 10% of the whole table. That is probably too much for an index access path to be efficient. If the optimizer has reason to believe that the query returns much less than 10% of the table, an index scan followed by accessing the table might be good. Same if the table is very fragmented (lots of deleted rows eating up space).