Index performance with WHERE clause in SQL

Index performance with WHERE clause in SQL - sql

I'm reading about indexes in my database book and I was wondering if I was correct in my assumption that a WHERE clause with a non-constant expression in it will not use the index.
So if i have
SELECT * FROM statuses WHERE app_user_id % 10 = 0;
This would not use an index created on app_user_id. But
SELECT * FROM statuses WHERE app_user_id = 5;
would use the index on app_user_id.

Usually (there are other options) a database index is a B-Tree, which means that you can do range scans on it (including equality scans).
The condition app_user_id % 10 = 0 cannot be evaluated with a single range scan, which is why a database will probably not use an index.
It could still decide to use the index in another way, namely for a full scan: Reading the whole table takes more time than just reading the whole index. On the other hand, after reading the index you may still get back to the table, so the overall cost may end up being higher.
This is up to the database query optimizer to decide.
A few examples:
select app_user_id from t where app_user_id % 10 = 0
Here, you do not need the table at all, all necessary data is in the index. The database will most likely do a full index scan.
select count(*) from t where app_user_id % 10 = 0
Same. Full index scan.
select count(*) from t
Only if app_user_id is NOT NULL can this be done with the index (because NULL data is not in the index, at least on Oracle, at least on single column indexes, your database may handle this differently).
Some databases do not need to do access table or index for this, they maintain row counts in the metadata.
select * from t where app_user_id = 5
This is the classic scenario for an index. The database can look at the small section of the index tree, retrieve a small (just one if this was a unique or primary index) number of rowids and fetch those selectively from the table.
select * from t where app_user_id between 5 and 10
Another classic index case. Range scan in the tree returns a small number of rowids to fetch from the table.
select * from t where app_user_id between 5 and 10 order by app_user_id
Since index scans return ordered data, you even get the sorting for free.
select * from t where app_user_id between 5 and 1000000000
Maybe here you should not be using an index. It seems to match too many records. This is a case where having bind variables hide the range from the database could actually be detrimental.
select * from t where app_user_id between 5 and 1000000000
order by app_user_id
But here, since sorting would be very expensive (even taking up temporary swap disk space), maybe iterating in index order is good. Maybe.
select * from t where app_user_id % 10 = 0
This is difficult to decide. We need all columns, so ultimately the query needs to touch the table. The question is whether to go through an index first. The query returns approximately 10% of the whole table. That is probably too much for an index access path to be efficient. If the optimizer has reason to believe that the query returns much less than 10% of the table, an index scan followed by accessing the table might be good. Same if the table is very fragmented (lots of deleted rows eating up space).

Related

SQL Server : OFFSET FETCH performs scan while TOP WHERE performs seek?

I got the following two queries. One is fast the other is slow.
The table has a clusted index on the Id column.
-- Slow, uses clustered index scan reading 100100 rows
SELECT *
FROM [dbo].[Foo]
ORDER BY Id
OFFSET 100000 ROWS FETCH FIRST 100 ROWS ONLY
-- Fast, uses clustered index seek reading 100 rows
SELECT TOP 100 *
FROM [dbo].[Foo]
WHERE Id > 100000
ORDER BY Id
The plans are identical except for one uses a scan the other a seek.
Can anyone explain why or is this simply how OFFSET works?
The table is very wide with a few NVARCHAR(100-200) and a single NVARCHAR(2500) column.

The two queries are not equivalent. Although you might assume that the ids have no gaps and start at 1, the database engine does not know that.
Indexes are organized to find particular values quickly. They generally do this by traversing a tree structure, and one which is generally balanced. You can read more about this in the documentation.
However, they are not organized to quickly get to the nth row in the table. Hence, the query needs to scan the table to count the number of rows.
That said, the index could do what you want if it kept the number of rows in each child. Do realize that this would complicate modifications to the table, because the entire hierarchy would need to be updated for each update, insert, and delete.

Finding the "next 25 rows" in Oracle SQL based on an indexed column

I have a large table (~200M rows) that is indexed on a numeric column, Z. There is also an index on the key column, K.
K Z
= ==========================================
1 0.6508784068583483336644518457703156855132
2 0.4078768075307567089075462518978907890789
3 0.5365440453204830852096396398565048002638
4 0.7573281573257782352853823856682368153782
What I need to be able to do is find the 25 records "surrounding" a given record. For instance, the "next" record starting at K=3 would be K=1, followed by K=4.
I have been lead by several sources (most notably this paper from some folks at Florida State University) that SQL like the following should work. It's not hard to imagine that scanning along the indexed column in ascending or descending order would be efficient.
select * from (
select *
from T
where Z >= [origin's Z value]
order by Z asc
) where rownum <= 25;
In theory, this should find the 25 "next" rows, and a similar variation would find the 25 "previous" rows. However, this can take minutes and the explain plan consistently contains a full table scan. A full table scan is simply too expensive for my purpose, but nothing I do seems to prompt the query optimizer to take advantage of the index (short, of course, of changing the ">=" above to an equals sign, which indicates that the index is present and operational). I have tried several hints to no avail (index, index_asc in several permutations).
Is what I am trying to do impossible? If I were trying to do this on a large data structure over which I had more control, I'd build a linked list on the indexed column's values and a tree to find the right entry point. Then traversing the list would be very inexpensive (yes I might have to run all over the disk to find the records I'm looking for, but I surely wouldn't have to scan the whole table).
I'll add in case it's important to my query that the database I'm using is running Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit.

I constructed a small test case with 10K rows. When I populated the table such that the Z values were already ordered, the exact query you gave tended to use the index. But when I populated it with random values, and refreshed the table statistics, it started doing full table scans, at least for some values of n larger than 25. So there is a tipping point at which the optimizer decides that the amount of work it will do to look up index entries then find the corresponding rows in the table is more than the amount of work to do a full scan. (It might be wrong in its estimate, of course, but that is what it has to go on.)
I noticed that you are using SELECT *, which means the query is returning both columns. This means that the actual table rows must be accessed, since neither index includes both columns. This might push the optimizer towards preferring a full table scan for a larger samples. If the query could be fulfilled from the index alone, it would be more likely to use the index.
One possibility is that you don't really need to return the values of K at all. If so, I'd suggest that you change both occurrences of SELECT * to SELECT z. In my test, this change caused a query that had been doing a full table scan to use an index scan instead (and not access the table itself at all).
If you do need to include K in the result, then you might try creating an index on (Z, K). This index could be used to satisfy the query without accessing the table.

Why SQL Server index is not used?

Most of my SQL queries have WHERE rec_id <> 'D'; as for example:
select * from Table1 where Field1 = 'ABC' and rec_id <> 'D'
I added index on REC_ID. But when I run this query and look at the execution plan, the new index (REC_ID) is not used. The Execution plan shows Cost of 50% of nonClustered index Field1 and 50% RID Lookup (Heap) in Table1.
Why the index REC_ID not used?

For this query:
select *
from Table1
where Field1 = 'ABC' and rec_id <> 'D';
The best index is table1(Field1, rec_id).
However, your query may not be able to take advantage of an index. The goal of using an index for a where clause is to reduce the number of pages that need to be read. To understand the concept for non-clustered indexes on normal rows, you need some basic ideas:
Records are stored on pages.
Each page is 8,192 bytes (slightly fewer used for data) and can store some number of records.
The entire page is loaded into memory to read a record.
Say a record is about 80 bytes and there are 100 records on each page. If 10% of the records have Field1 = 'ABC', then there will be about ten on each page. That means that using the index would not (typically) save any page reads. If 1% of the records match, then there is about one on each page. The index still isn't helpful.
If only 0.01% of the records match (30 in your case), then only a fraction of the pages need to be read. This is the sweet spot for indexes, and where they are really helpful.
The number of matching records is called "selectivity". If the where clause is not very selective, then a non-clustered index will not be useful.
Sometimes, a clustered index can be helpful in this situation. However, clustered indexes may have more overhead for insert and certain update transactions. So, the choice of index needs to be based on the queries being processed and other ways that the table is used.

SQL Server uses many factors to decide which indices to use. It must have determined that using the index on Field1 would be more effective that using the index on rec_id - meaning that field1={value} defines a smaller set than rec_id <> {value} based on data dispersion, etc., so there are fewer records to compare against the other condition. Note that the actual value is usually irrelevant in determining which index to use.

Oracle: Full text search with condition

I've created an Oracle Text index like the following:
create index my_idx on my_table (text) indextype is ctxsys.context;
And I can then do the following:
select * from my_table where contains(text, '%blah%') > 0;
But lets say we have a have another column in this table, say group_id, and I wanted to do the following query instead:
select * from my_table where contains(text, '%blah%') > 0 and group_id = 43;
With the above index, Oracle will have to search for all items that contain 'blah', and then check all of their group_ids.
Ideally, I'd prefer to only search the items with group_id = 43, so I'd want an index like this:
create index my_idx on my_table (group_id, text) indextype is ctxsys.context;
Kind of like a normal index, so a separate text search can be done for each group_id.
Is there a way to do something like this in Oracle (I'm using 10g if that is important)?
Edit (clarification)
Consider a table with one million rows and the following two columns among others, A and B, both numeric. Lets say there are 500 different values of A and 2000 different values of B, and each row is unique.
Now lets consider select ... where A = x and B = y
An index on A and B separately as far as I can tell do an index search on B, which will return 500 different rows, and then do a join/scan on these rows. In any case, at least 500 rows have to be looked at (aside from the database being lucky and finding the required row early.
Whereas an index on (A,B) is much more effective, it finds the one row in one index search.
Putting separate indexes on group_id and the text I feel only leaves the query generator with two options.
(1) Use the group_id index, and scan all the resulting rows for the text.
(2) Use the text index, and scan all the resulting rows for the group_id.
(3) Use both indexes, and do a join.
Whereas I want:
(4) Use the (group_id, "text") index to find the text index under the particular group_id and scan that text index for the particular row/rows I need. No scanning and checking or joining required, much like when using an index on (A,B).

Oracle Text
1 - You can improve performance by creating the CONTEXT index with FILTER BY:
create index my_idx on my_table(text) indextype is ctxsys.context filter by group_id;
In my tests the filter by definitely improved the performance, but it was still slightly faster to just use a btree index on group_id.
2 - CTXCAT indexes use "sub-indexes", and seem to work similar to a multi-column index. This seems to be the option (4) you're looking for:
begin
ctx_ddl.create_index_set('my_table_index_set');
ctx_ddl.add_index('my_table_index_set', 'group_id');
end;
/
create index my_idx2 on my_table(text) indextype is ctxsys.ctxcat
parameters('index set my_table_index_set');
select * from my_table where catsearch(text, 'blah', 'group_id = 43') > 0
This is likely the fastest approach. Using the above query against 120MB of random text similar to your A and B scenario required only 18 consistent gets. But on the downside, creating the CTXCAT index took almost 11 minutes and used 1.8GB of space.
(Note: Oracle Text seems to work correctly here, but I'm not familiar with Text and I can't gaurentee this isn't an inappropriate use of these indexes like #NullUserException said.)
Multi-column indexes vs. index joins
For the situation you describe in your edit, normally there would not be a significant difference between using an index on (A,B) and joining separate indexes on A and B. I built some tests with data similar to what you described and an index join required only 7 consistent gets versus 2 consistent gets for the multi-column index.
The reason for this is because Oracle retrieves data in blocks. A block is usually 8K, and an index block is already sorted, so you can probably fit the 500 to 2000 values in a few blocks. If you're worried about performance, usually the IO to read and write blocks is the only thing that matters. Whether or not Oracle has to join together a few thousand rows is an inconsequential amount of CPU time.
However, this doesn't apply to Oracle Text indexes. You can join a CONTEXT index with a btree index (a "bitmap and"?), but the performance is poor.

I'd put an index on group_id and see if that's good enough. You don't say how many rows we're talking about or what performance you need.
Remember, the order in which the predicates are handled is not necessarily the order in which you wrote them in the query. Don't try to outsmart the optimizer unless you have a real reason to.

Short version: There's no need to do that. The query optimizer is smart enough to decide what's the best way to select your data. Just create a btree index on group_id, ie:
CREATE INDEX my_group_idx ON my_table (group_id);
Long version: I created a script (testperf.sql) that inserts 136 rows of dummy data.
DESC my_table;
Name Null Type
-------- -------- ---------
ID NOT NULL NUMBER(4)
GROUP_ID NUMBER(4)
TEXT CLOB
There is a btree index on group_id. To ensure the index will actually be used, run this as a dba user:
EXEC DBMS_STATS.GATHER_TABLE_STATS('<YOUR USER HERE>', 'MY_TABLE', cascade=>TRUE);
Here's how many rows each group_id has and the corresponding percentage:
GROUP_ID COUNT PCT
---------------------- ---------------------- ----------------------
1 1 1
2 2 1
3 4 3
4 8 6
5 16 12
6 32 24
7 64 47
8 9 7
Note that the query optimizer will use an index only if it thinks it's a good idea - that is, you are retrieving up to a certain percentage of rows. So, if you ask it for a query plan on:
SELECT * FROM my_table WHERE group_id = 1;
SELECT * FROM my_table WHERE group_id = 7;
You will see that for the first query, it will use the index, whereas for the second query, it will perform a full table scan, since there are too many rows for the index to be effective when group_id = 7.
Now, consider a different condition - WHERE group_id = Y AND text LIKE '%blah%' (since I am not very familiar with ctxsys.context).
SELECT * FROM my_table WHERE group_id = 1 AND text LIKE '%ipsum%';
Looking at the query plan, you will see that it will use the index on group_id. Note that the order of your conditions is not important:
SELECT * FROM my_table WHERE text LIKE '%ipsum%' AND group_id = 1;
Generates the same query plan. And if you try to run the same query on group_id = 7, you will see that it goes back to the full table scan:
SELECT * FROM my_table WHERE group_id = 7 AND text LIKE '%ipsum%';
Note that stats are gathered automatically by Oracle every day (it's scheduled to run every night and on weekends), to continually improve the effectiveness of the query optimizer. In short, Oracle does its best to optimize the optimizer, so you don't have to.

I do not have an Oracle instance at hand to test, and have not used the full-text indexing in Oracle, but I have generally had good performance with inline views, which might be an alternative to the sort of index you had in mind. Is the following syntax legit when contains() is involved?
This inline view gets you the PK values of the rows in group 43:
(
select T.pkcol
from T
where group = 43
)
If group has a normal index, and doesn't have low cardinality, fetching this set should be quick. Then you would inner join that set with T again:
select * from T
inner join
(
select T.pkcol
from T
where group = 43
) as MyGroup
on T.pkcol = MyGroup.pkcol
where contains(text, '%blah%') > 0
Hopefully the optimizer would be able to use the PK index to optimize the join and then appy the contains predicate only to the group 43 rows.

Wrong index being used when selecting top rows

I have a simple query, which selects top 200 rows ordered by one of the columns filtered by other indexed column. The confusion is why is that the query plan in PL/SQL Developer shows that this index is used only when I'm selecting all rows, e.g.:
SELECT * FROM
(
SELECT *
FROM cr_proposalsearch ps
WHERE UPPER(ps.customerpostcode) like 'MK3%'
ORDER BY ps.ProposalNumber DESC
)
WHERE ROWNUM <= 200
Plan shows that it uses index CR_PROPOSALSEARCH_I1, which is an index on two columns: PROPOSALNUMBER & UPPER(CUSTOMERNAME), this takes 0.985s to execute:
If I get rid of ROWNUM condition, the plan is what I expect and it executes in 0.343s:
Where index XIF25CR_PROPOSALSEARCH is on CR_PROPOSALSEARCH (UPPER(CUSTOMERPOSTCODE));
How come?
EDIT: I have gathered statistics on cr_proposalsearch table and both query plans now show that they use XIF25CR_PROPOSALSEARCH index.

Including the ROWNUM changes the optimizer's calculations about which is the more efficient path.
When you do a top-n query like this, it doesn't necessarily mean that Oracle will get all the rows, fully sort them, then return the top ones. The COUNT STOPKEY operation in the execution plan indicates that Oracle will only perform the underlying operations until it has found the number of rows you asked for.
The optimizer has calculated that the full query will acquire and sort 77K rows. If it used this plan for the top-n query, it would have to do a large sort of those rows to find the top 200 (it wouldn't necessarily have to fully sort them, as it wouldn't care about the exact order of rows past the top; but it would have to look over all of those rows).
The plan for the top-n query uses the other index to avoid having to sort at all. It considers each row in order, checks whether it matches the predicate, and if so returns it. When it's returned 200 rows, it's done. Its calculations have indicated that this will be more efficient for getting a small number of rows. (It may not be right, of course; you haven't said what the relative performance of these queries is.)
If the optimizer were to choose this plan when you ask for all rows, it would have to read through the entire index in descending order, getting each row from the table by ROWID as it goes to check against the predicate. This would result in a lot of extra I/O and inspecting many rows that would not be returned. So in this case, it decides that using the index on customerpostcode is more efficient.
If you gradually increase the number of rows to be returned from the top-n query, you will probably find a tipping point where the plan switches from the first to the second. Just from the costs of the two plans, I'd guess this might be around 1,200 rows.

If you are sure your stats are up to date and that the index is selective enough, you could tell oracle to use the index
SELECT *
FROM (SELECT /*+ index(ps XIF25CR_PROPOSALSEARCH) */ *
FROM cr_proposalsearch ps
WHERE UPPER (ps.customerpostcode) LIKE 'MK3%'
ORDER BY ps.proposalnumber DESC)
WHERE ROWNUM <= 200
(I would only recommend this approach as a last resort)
If I were doing this I would first tkprof the query to see actually how much work it is doing,
e.g: the cost of index range scans could be way off
forgot to mention....
You should check the actual cardinality:
SELECT count(*) FROM cr_proposalsearch ps WHERE UPPER(ps.customerpostcode) like 'MK3%'
and then compare it to the cardinality in the query plan.

You don't seem to have a perfectly fitting index. The index CR_PROPOSALSEARCH_I1 can be used to retrieve the rows in descending order of the attribute PROPOSALNUMBER. It's probably chosen because Oracle can avoid to retrieve all matching rows, sort them according to the ORDER BY clause and then discard all rows except the first ones.
Without the ROWNUM condition, Oracle uses the XIF25CR_PROPOSALSEARCH index (you didn't give any details about it) because it's probably rather selective regarding the WHERE clause. But it will require to sort the result afterwards. This is probably the more efficent plan based on the assumption that you'll retrieve all rows.
Since either index is a trade-off (one is better for sorting, the other one better for applying the WHERE clause), details such as ROWNUM determine which execution plan Oracle chooses.

This condition:
WHERE UPPER(ps.customerpostcode) like 'MK3%'
is not continuous, that is you cannot preserve a single ordered range for it.
So there are two ways to execute this query:
Order by number then filter on code.
Filter on code then order by number.
Method 1 is able to user an index on number which gives you linear execution time (top 100 rows would be selected 2 times faster than top 200, provided that number and code do not correlate).
Method 2 is able to use a range scan for coarse filtering on code (the range condition would be something like code >= 'MK3' AND code < 'MK4'), however, it requires a sort since the order of number cannot be preserved in a composite index.
The sort time depends on the number of top rows you are selecting too, but this dependency, unlike that for method 1, is not linear (you always need at least one range scan).
However, the filtering condition in method 2 is selective enough for the RANGE SCAN with a subsequent sort to be more efficient than a FULL SCAN for the whole table.
This means that there is a tipping point: for this condition: ROWNUM <= X there exists a value of X so that method 2 becomes more efficient when this value is exceeded.
Update:
If you are always searching on at least 3 first symbols, you can create an index like this:
SUBSTRING(UPPER(customerpostcode), 1, 3), proposalnumber
and use it in this query:
SELECT *
FROM (
SELECT *
FROM cr_proposalsearch ps
WHERE SUBSTRING(UPPER(customerpostcode, 1, 3)) = SUBSTRING(UPPER(:searchquery), 1, 3)
AND UPPER(ps.customerpostcode) LIKE UPPER(:searchquery) || '%'
ORDER BY
proposalNumber DESC
)
WHERE rownum <= 200
This way, the number order will be preserved separately for each set of codes sharing first 3 letters which will give you a more dense index scan.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas