Oracle indexes. "DISTINCT_KEYS" vs "NUM_ROWS". Do I need an NONUNIQUE index?

Oracle indexes. "DISTINCT_KEYS" vs "NUM_ROWS". Do I need an NONUNIQUE index? - sql

I have a table in which I have a lot of indexes. I noticed that in on one of them "DISTINCT_KEYS" is almost the same as "NUM_ROWS". Is such an index needed?
Or maybe it is better to remove it because:
takes a place on the database.
When adding data to a table, it does not necessarily slow down the refreshing of indexes.
What do you think? Will deleting this index slow down the queries using the name of this column?

Is such an index needed?
All you can tell from statistics like DISTINCT_KEYS and NUM_ROWS (and other statistics like histograms) is whether an index might be useful. An index is only truly "needed" if it is actually being used by queries in your system. (See ALTER INDEX ... MONITORING USAGE command)
An index having DISTINCT_KEYS that is almost equal to NUM_ROWS certainly might be useful. In fact, it would be much more natural to suspect an index to be useless if DISTINCT_KEYS was a very low percentage of NUM_ROWS.
Suppose you have a query:
SELECT column_x
FROM table_y
WHERE column_z = :some_value
Suppose the index on column_z shows DISTINCT_KEYS = 999999 and NUM_ROWS = 1000000.
That means, on average, each distinct key has (very) slightly more than one row. That makes the index very selective and very useful. When our query runs, we will use the index to pull out only one row of the table very quickly.
Suppose, instead, the index on column_z shows DISTINCT_KEYS = 2 and NUM_ROWS = 1000000. Now, each distinct key has an average of 500,000 rows. This index is worthless because we have to read each half of the blocks from the index and then still probably wind up reading at last half of the blocks from the table (probably way more than half). Worse, these reads are all single block reads. It would be way, way faster for Oracle to ignore the index and do a full table scan -- fewer blocks in total to read and all the reads are multi-block reads (e.g., 8 at a time).
For completeness, I'll point out that an index with DISTINCT_KEYS = 2 and NUM_ROWS = 1000000 could still be useful if the data is very skewed. That is, for example, if one distinct key had 999,000 rows and the other distinct key had only 1,000 rows. The index would be useful for finding the rows of that other (smaller) distinct key. Oracle gathers histograms as part of its statistics to keep track of which columns have skewed data and, if so, how many rows there are for each distinct key. (Over-simplification).
TL;DR It's very likely a good index and no more likely to be "unneeded" than any other index in your system.

Related

SQL-SERVER Query is taking too much time

I have this simple query but it is taking 1 min for just 0.5M records even all columns mentioned in select are in Non Clustered index.
Both tables have approx 1M records and approx 200 columns in each.
Does having lots of records in table or having lot of index causing the slowness.
SELECT catalog_items.id,
catalog_items.store_code,
catalog_items.web_code AS web_code,
catalog_items.name AS name,
catalog_items.name AS item_description,
catalog_items.image_thumnail AS image_thumnail,
catalog_items.purchase_description AS purchase_description,
catalog_items.sale_description AS sale_description,
catalog_items.taxable,
catalog_items.is_unique_item,
ISNULL(catalog_items.inventory_posting_flag, 'Y') AS inventory_posting_flag,
catalog_item_extensions.total_cost,
catalog_item_extensions.price
FROM catalog_items
LEFT OUTER JOIN catalog_item_extensions ON catalog_items.id = catalog_item_extensions.catalog_item_id
WHERE catalog_items.trans_flag = 'A';
Update: execution plan showing index missing it the same index is already there. Why?

I'm not convinced that the plan is wrong currently, on the basis that you mention selecting 500k rows, out of a table of 1m rows. Even with an index as suggested by others, the selectivity of that index is pretty weak, from a tipping point perspective (https://www.sqlskills.com/blogs/kimberly/the-tipping-point-query-answers/ ) - even with 200 columns I wouldn't expect 500k out of 1m rows per table to result in index seeks with lookups, a full scan would be faster in the CBO's view.
The missing index question - if you look closely its not just suggesting the index on trans_flag, it's suggesting to index the field and then INCUDE a number more. We can't see how many it's suggesting to include, but I would expect it to be all of them in the query and it's basically suggesting you create a covering index. Even in an NC Index Scan scenario this would be faster to scan than the base table.
We also have no information about physical layouts as yet, how the page is constructed, level of fragmentation etc, or even what kind of disks the data is on and overall size. That image_thumbnail field for example is suggestive of a large row size overall, which means we are dealing with off page storage into LOB / SLOB.
In short - even with a query plan, there is no 'easy' answer here in my view.

For this query
select . . .
from catalog_items ci left outer join
catalog_item_extensions cie
on ci.id = cie.catalog_item_id
where ci.trans_flag = 'A'
I would recommend an index on catalog_items(trans_flag, id) and catalog_item_extensions(catalog_item_id).

Why SQL Server index is not used?

Most of my SQL queries have WHERE rec_id <> 'D'; as for example:
select * from Table1 where Field1 = 'ABC' and rec_id <> 'D'
I added index on REC_ID. But when I run this query and look at the execution plan, the new index (REC_ID) is not used. The Execution plan shows Cost of 50% of nonClustered index Field1 and 50% RID Lookup (Heap) in Table1.
Why the index REC_ID not used?

For this query:
select *
from Table1
where Field1 = 'ABC' and rec_id <> 'D';
The best index is table1(Field1, rec_id).
However, your query may not be able to take advantage of an index. The goal of using an index for a where clause is to reduce the number of pages that need to be read. To understand the concept for non-clustered indexes on normal rows, you need some basic ideas:
Records are stored on pages.
Each page is 8,192 bytes (slightly fewer used for data) and can store some number of records.
The entire page is loaded into memory to read a record.
Say a record is about 80 bytes and there are 100 records on each page. If 10% of the records have Field1 = 'ABC', then there will be about ten on each page. That means that using the index would not (typically) save any page reads. If 1% of the records match, then there is about one on each page. The index still isn't helpful.
If only 0.01% of the records match (30 in your case), then only a fraction of the pages need to be read. This is the sweet spot for indexes, and where they are really helpful.
The number of matching records is called "selectivity". If the where clause is not very selective, then a non-clustered index will not be useful.
Sometimes, a clustered index can be helpful in this situation. However, clustered indexes may have more overhead for insert and certain update transactions. So, the choice of index needs to be based on the queries being processed and other ways that the table is used.

SQL Server uses many factors to decide which indices to use. It must have determined that using the index on Field1 would be more effective that using the index on rec_id - meaning that field1={value} defines a smaller set than rec_id <> {value} based on data dispersion, etc., so there are fewer records to compare against the other condition. Note that the actual value is usually irrelevant in determining which index to use.

resource busy while rebuilding an index

There is a table T with column a:
CREATE TABLE T {
id_t integer not null,
text varchar2(100),
a integer
}
/
ALTER TABLE T ADD CONSTRAINT PK_T PRIMARY KEY (ID_T)
/
Index was created like this:
CREATE INDEX IDX_T$A ON T(a);
Also there's such a check constraint:
ALTER TABLE T ADD CONSTRAINT CHECK (a is null or a = 1);
Most of the records in T have null value of a, so the query using the index works really fast if the index is in consistent state and statistics for it is up to date.
But the problem is that values of a of some rows change really frequently (some rows get null value, some get 1), and I need to rebuild the index let's say every hour.
However, really often when the job doing this, trying to rebuild the index, it gets an exception:
ORA-00054: resource busy and acquire with NOWAIT specified
Can anybody help me with coping with this issue?

Index rebuild is not needed in most cases. Of course newly created indexes are efficient and their efficiency decreases over time. But this process stops after some time - it simply converges to some level.
If you really need to optimize indexes try to use less invasive DDL command "ALTER INDEX SHRINK SPACE COMPACT".
PS: I would also recommend you to use some smaller block size (4K or 8K) for you tablespace storage.

Have you tried adding "ONLINE" to that index rebuild statement?
Edit: If online rebuild is not available then you might look at a fast refresh on commit materialised view to store the rowid's or primary keys of rows that have a 1 for column A.
Start with a look at the documentation:-
http://docs.oracle.com/cd/B28359_01/server.111/b28326/repmview.htm
http://docs.oracle.com/cd/B28359_01/server.111/b28286/statements_6002.htm#SQLRF01302
You'd create a materialised view log on the table, and then a materialised view.
Think in particular about the resource requirements for this: changes to the master table require a change vector to be written to the materialised view log, which is effectively an additional insert for every change. Then the changes have to be propagated to another table (the materialised view storage table) with additional queries. It is by no means a low-impact option.

Rebuilding for Performance
Most Oracle experts are skeptical of frequently rebuilding indexes. For example, a quick glance at the presentation Rebuilding the Truth will show you that indexes do not behave in the naive way many people assume they do.
One of the relevant points in that presentation is "fully deleted blocks are recycled and are not generally problematic". If your values completely change, then your index should not grow infinitely large. Although your indexes are used in a non-typical way, that
behavior is probably a good thing.
Here's a quick example. Create 1 million rows and index 100 of them.
--Create table, constraints, and index.
CREATE TABLE T
(
id_t integer primary key,
text varchar2(100),
a integer check (a is null or a = 1)
);
CREATE INDEX IDX_T$A ON T(a);
--Insert 1M rows, with 100 "1"s.
insert into t
select level, level, case when mod(level, 10000) = 0 then 1 else null end
from dual connect by level <= 1000000;
commit;
--Initial sizes:
select segment_name, bytes/1024/1024 MB
from dba_segments
where segment_name in ('T', 'IDX_T$A');
SEGMENT_NAME MB
T 19
IDX_T$A 0.0625
Now completely shuffle the index rows around 1000 times.
--Move the 1s around 1000 times. Takes about 6 minutes.
begin
for i in 9000 .. 10000 loop
update t
set a = case when mod(id_t, i) = 0 then 1 else null end
--Don't update if the vlaue is the same
where nvl(a,-1) <> nvl(case when mod(id_t,i) = 0 then 1 else null end,-1);
commit;
end loop;
end;
/
The index segment size is still the same.
--The the index size is the same.
select segment_name, bytes/1024/1024 MB
from dba_segments
where segment_name in ('T', 'IDX_T$A');
SEGMENT_NAME MB
T 19
IDX_T$A 0.0625
Rebuilding for Statistics
It's good to worry about the statistics of objects whose data changes so dramatically. But again, although your system is unusual, it may work fine with the default Oracle behavior. Although the rows indexed may completely change, the relevant statistics may stay the same. If there are always 100 rows indexed, the number of rows, blocks, and distinctness will stay the same.
Perhaps the clustering factor will significantly change, if the 100 rows shift from being completely random to being very close to each other. But even that may not matter. If there are millions of rows, but only 100 indexed, the optimizer's decision will probably be the same regardless of the clustering factor. Reading 1 block (awesome clustering factor) or reading 100 blocks (worst-case clustering factor) will still look much better than doing a full table scan of millions of rows.
But statistics are complicated, I'm surely over-simplifying things. If you need to keep your statistics a specific way, you may want to lock them. Unfortunately you can't lock just an index, but you can lock the table and it's dependent indexes.
begin
dbms_stats.lock_table_stats(ownname => user, tabname => 'T');
end;
/
Rebuilding anyway
If a rebuild is still necessary, #Robe Eleckers idea to retry should work. Although instead of an exception, it would be easier to set DDL_LOCK_TIMEOUT.
alter session set ddl_lock_timeout = 500;
The session will still need to get an exclusive lock on the table, but this will make it much easier to find the right window of opportunity.

Since the field in question has very low cardinality I would suggest using a bitmap index and skipping the rebuilds altogether.
CREATE BITMAP INDEX IDX_T$A ON T(a);
Note (as mentioned in comments): transactional performance is very low for bitmap indexes so this would only work well if there are very few overlapping transactions doing updates to the table.

Why is there a non-clustered index scan when counting all rows in a table?

As far as I understand it, each transaction sees its own version of the database, so the system cannot get the total number of rows from some counter and thus needs to scan an index. But I thought it would be the clustered index on the primary key, not the additional indexes. If I had more than one additional index, which one will be chosen, anyway?
When digging into the matter, I've noticed another strange thing. Suppose there are two identical tables, Articles and Articles2, each with three columns: Id, View_Count, and Title. The first has only a clustered PK-based index, while the second one has an additional non-clustered, non-unique index on view_count. The query SELECT COUNT(1) FROM Articles runs 2 times faster for the table with the additional index.

SQL Server will optimize your query - if it needs to count the rows in a table, it will choose the smallest possible set of data to do so.
So if you consider your clustered index - it contains the actual data pages - possibly several thousand bytes per row. To load all those bytes just to count the rows would be wasteful - even just in terms of disk I/O.
Therefore, it there is a non-clustered index that's not filtered or restricted in any way, SQL Server will pick that data structure to count - since the non-clustered index basically contains the columns you've put into the NC index (plus the clustered index key) - much less data to load just to count the number of rows.

Index performance with WHERE clause in SQL

I'm reading about indexes in my database book and I was wondering if I was correct in my assumption that a WHERE clause with a non-constant expression in it will not use the index.
So if i have
SELECT * FROM statuses WHERE app_user_id % 10 = 0;
This would not use an index created on app_user_id. But
SELECT * FROM statuses WHERE app_user_id = 5;
would use the index on app_user_id.

Usually (there are other options) a database index is a B-Tree, which means that you can do range scans on it (including equality scans).
The condition app_user_id % 10 = 0 cannot be evaluated with a single range scan, which is why a database will probably not use an index.
It could still decide to use the index in another way, namely for a full scan: Reading the whole table takes more time than just reading the whole index. On the other hand, after reading the index you may still get back to the table, so the overall cost may end up being higher.
This is up to the database query optimizer to decide.
A few examples:
select app_user_id from t where app_user_id % 10 = 0
Here, you do not need the table at all, all necessary data is in the index. The database will most likely do a full index scan.
select count(*) from t where app_user_id % 10 = 0
Same. Full index scan.
select count(*) from t
Only if app_user_id is NOT NULL can this be done with the index (because NULL data is not in the index, at least on Oracle, at least on single column indexes, your database may handle this differently).
Some databases do not need to do access table or index for this, they maintain row counts in the metadata.
select * from t where app_user_id = 5
This is the classic scenario for an index. The database can look at the small section of the index tree, retrieve a small (just one if this was a unique or primary index) number of rowids and fetch those selectively from the table.
select * from t where app_user_id between 5 and 10
Another classic index case. Range scan in the tree returns a small number of rowids to fetch from the table.
select * from t where app_user_id between 5 and 10 order by app_user_id
Since index scans return ordered data, you even get the sorting for free.
select * from t where app_user_id between 5 and 1000000000
Maybe here you should not be using an index. It seems to match too many records. This is a case where having bind variables hide the range from the database could actually be detrimental.
select * from t where app_user_id between 5 and 1000000000
order by app_user_id
But here, since sorting would be very expensive (even taking up temporary swap disk space), maybe iterating in index order is good. Maybe.
select * from t where app_user_id % 10 = 0
This is difficult to decide. We need all columns, so ultimately the query needs to touch the table. The question is whether to go through an index first. The query returns approximately 10% of the whole table. That is probably too much for an index access path to be efficient. If the optimizer has reason to believe that the query returns much less than 10% of the table, an index scan followed by accessing the table might be good. Same if the table is very fragmented (lots of deleted rows eating up space).

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas