Using more than one index per table is dangerous? - sql

In a former company I worked at, the rule of thumb was that a table should have no more than one index (allowing the odd exception, and certain parent-tables holding references to nearly all other tables and thus are updated very frequently).
The idea being that often, indexes cost the same or more to uphold than they gain. Note that this question is different to indexed-view-vs-indexes-on-table as the motivation is not only reporting.
Is this true? Is this index-purism worth it?
In your career do you generally avoid using indexes?
What are the general large-scale recommendations regarding indexes?
Currently and at the last company we use SQL Server, so any product specific guidelines are welcome too.

You need to create exactly as many indexes as you need to create. No more, no less. It is as simple as that.
Everybody "knows" that an index will slow down DML statements on a table. But for some reason very few people actually bother to test just how "slow" it becomes in their context. Sometimes I get the impression that people think that adding another index will add several seconds to each inserted row, making it a game changing business tradeoff that some fictive hotshot user should decide in a board room.
I'd like to share an example that I just created on my 2 year old pc, using a standard MySQL installation. I know you tagged the question SQL Server, but the example should be easily converted. I insert 1,000,000 rows into three tables. One table without indexes, one table with one index and one table with nine indexes.
drop table numbers;
drop table one_million_rows;
drop table one_million_one_index;
drop table one_million_nine_index;
/*
|| Create a dummy table to assist in generating rows
*/
create table numbers(n int);
insert into numbers(n) values(0),(1),(2),(3),(4),(5),(6),(7),(8),(9);
/*
|| Create a table consisting of 1,000,000 consecutive integers
*/
create table one_million_rows as
select d1.n + (d2.n * 10)
+ (d3.n * 100)
+ (d4.n * 1000)
+ (d5.n * 10000)
+ (d6.n * 100000) as n
from numbers d1
,numbers d2
,numbers d3
,numbers d4
,numbers d5
,numbers d6;
/*
|| Create an empty table with 9 integer columns.
|| One column will be indexed
*/
create table one_million_one_index(
c1 int, c2 int, c3 int
,c4 int, c5 int, c6 int
,c7 int, c8 int, c9 int
,index(c1)
);
/*
|| Create an empty table with 9 integer columns.
|| All nine columns will be indexed
*/
create table one_million_nine_index(
c1 int, c2 int, c3 int
,c4 int, c5 int, c6 int
,c7 int, c8 int, c9 int
,index(c1), index(c2), index(c3)
,index(c4), index(c5), index(c6)
,index(c7), index(c8), index(c9)
);
/*
|| Insert 1,000,000 rows in the table with one index
*/
insert into one_million_one_index(c1,c2,c3,c4,c5,c6,c7,c8,c9)
select n, n, n, n, n, n, n, n, n
from one_million_rows;
/*
|| Insert 1,000,000 rows in the table with nine indexes
*/
insert into one_million_nine_index(c1,c2,c3,c4,c5,c6,c7,c8,c9)
select n, n, n, n, n, n, n, n, n
from one_million_rows;
My timings are:
1m rows into table without indexes: 0,45 seconds
1m rows into table with 1 index: 1,5 seconds
1m rows into table with 9 indexes: 6,98 seconds
I'm better with SQL than statistics and math, but I'd like to think that:
Adding 8 indexes to my table, added (6,98-1,5) 5,48 seconds in total. Each index would then have contributed 0,685 seconds (5,48 / 8) for all 1,000,000 rows. That would mean that the added overhead per row per index would have been 0,000000685 seconds. SOMEBODY CALL THE BOARD OF DIRECTORS!
In conclusion, I'd like to say that the above test case doesn't prove a shit. It just shows that tonight, I was able to insert 1,000,000 consecutive integers into in a table in a single user environment. Your results will be different.

That is utterly ridiculous. First, you need multiple indexes in order to perfom correctly. For instance, if you have a primary key, you automatically have an index. that means you can't index anything else with the rule you described. So if you don't index foreign keys, joins will be slow and if you don't index fields used in the where clause, queries will still be slow. Yes you can have too many indexes as they do take extra time to insert and update and delete records, but no more than one is not dangerous, it is a requirement to have a system that performs well. And I have found that users tolerate a longer time to insert better than they tolerate a longer time to query.
Now the exception might be for a system that takes thousands of readings per second from some automated equipment. This is a database that generally doesn't have indexes to speed inserts. But usually these types of databases are also not used for reading, the data is transferred instead daily to a reporting database which is indexed.

Yes, definitely - too many indexes on a table can be worse than no indexes at all. However, I don't think there's any good in having the "at most one index per table" rule.
For SQL Server, my rule is:
index any foreign key fields - this helps JOINs and is beneficial to other queries, too
index any other fields when it makes sense, e.g. when lots of intensive queries can benefit from it
Finding the right mix of indices - weighing the pros of speeding up queries vs. the cons of additional overhead on INSERT, UPDATE, DELETE - is not an exact science - it's more about know-how, experience, measuring, measuring, and measuring again.
Any fixed rule is bound to be more contraproductive than anything else.....
The best content on indexing comes from Kimberly Tripp - the Queen of Indexing - see her blog posts here.

Unless you like very slow reads, you should have indexes. Don't go overboard, but don't be afraid of being liberal about them either. EVERY FK should be indexed. You're going to do a look up each of these columns on inserts to other tables to make sure the references are set. The index helps. As well as the fact that indexed columns are used often in joins and selects.
We have some tables that are inserted into rarely, with millions of records. Some of these tables are also quite wide. It's not uncommon for these tables to have 15+ indexes. Other tables with heavy inserting and low reads we might only have a handful of indexes- but one index per table is crazy.

Updating an index is once per insert (per index). Speed gain is for every select. So if you update infrequently and read often, then the extra work may be well worth it.
If you do different selects (meaning the columns you filter on are different), then maintaining an index for each type of query is very useful. Provided you have a limited set of columns that you query often.
But the usual advice holds: if you want to know which is fastest: profile!

You should of course be careful not to create too many indexes per table, but only ever using a single index per table is not a useful level.
How many indexes to use depends on how the table is used. A table that is updated often would generally have less indexes than one that is read much more often than it's updated.
We have some tables that are updated regularly by a job every two minutes, but they are read often by queries that vary a lot, so they have several indexes. One table for example have 24 indexes.

So much depends on your schema and the queries that you normally run. For example: if you normally need to select above 60% of the rows of your table, indexes won't help you and it will be cheaper to table scan than to index scan and then lookup rows. Focused queries that select a small number of rows in different parts of the table or which are used for joins in queries will probably benefit from indexes. The right index in the right place can make or break a feature.
Indexes take space so making too many indexes on a table can be counter productive for the same reasons listed above. Scanning 5 indexes and then performing row lookups may be much more expensive than simply table scanning.
Good design is the synthesis about about knowing when to normalise and when not to.
If you frequently join on a particular column, check the IO plan with the index and without. As a general rule I avoid tables with more than 20 columns. This is often a sign that the data should be normalised. More than about 5 indexes on a table and you may be using more space for the indexes than the main table, be sure that is worth it. These rules are only the lightest of guidance and so much depends on how the data will be used in queries and what your data update profile looks like.
Experiment with your query plans to see how your solution improves or degrades with an index.

Every table must have a PK, which is indexed of course (generally a clustered one), then every FK should be indexed as well.
Finally you may want to index fields on which you often sort on, if their data is well differenciated: for a field with only 5 possible values in a table with 1 million records, an index will not be of a great benefit.
I tend to be minimalistic with indexes, until the db starts beeing well filled, and ...slower. It is easy to identify the bottlenecks and add just the right the indexes at that point.

Optimizing the retrieval with indexes must be carefully designed to reflect actual query patterns. Surely, for a table with Primary Key, you will have at least one clustered index (that's how data is actually stored), then any additional indexes are taking advantage of the layout of the data (clustered index).
After analyzing queries that execute against the table, you want to design an index(s) that cover them. That may mean building one or more indexes but that heavily depends on the queries themselves. That decision cannot be made just by looking at column statistics only.
For tables where it's mostly inserts, i.e. ETL tables or something, then you should not create Primary Keys, or actually drop indexes and re-create them if data changes too quickly or drop/recreated entirely.
I personally would be scared to step into an environment that has a hard-coded rule of indexes per table ratio.

Related

index value or create new column for dimensions?

Main Tables have 5-25 columns and 1,000,000-100,000,000 rows, most of columns are FK to tables with less then 100 values like 'unique char(6)' and many views that perform calculations, joins and grouping need to show the user values like that original 'unique char(5)'.
In this case it is better to create tinyint identity(1,1) index column for each dimension table and join them to result of view or create and use indexes on original values like that char(6)?
You should test to see which works better.
However, it is quite reasonable to assume that dimension tables using tinyint would benefit performance, because they would significantly reduce the size of each row -- from 5 or 6 bytes to 1. This means that your large tables would fit on many fewer pages, reducing I/O time.
Balanced against this is the cost of the joins. However, the reference tables would easily fit into memory. And by making the lookup key the primary key on the table, the lookups would also be much faster.

SQL Batched Updates to large table with many indexes

We have a table with 18 columns, 7 of them bit columns, with over 100 million rows. It has 6 non-clustered indexes, 5 of which have the column I need to update.
The primary key (clustered) is a uniqueidentifier, called EntityID
I need to update one of the bit flags on this table using a different table which contains the values I need to sync. My manager asked me to write the update to run in batches since even the smallest of updates take a while due to all the indexes and the shear number of rows in the table. He also asked that the update run based on the EntityID sorted ASC, he mentioned something about reducing the number of pages being read.
I've written probably 5 different versions of a sorted batched update, and they work, but I'm interested to see if there is already a well polished template I could use to do this.
select 1
while(##rowcount > 0)
begin
update top (100000) t
set t.bit = s.bit
from table t
join tbls s
on s.EntityID = t.EntityID
and t.bit != s.bit
end
I would advise against sorting. Let the query optimizer do its thing.
If you have any t.bit is null I would do that separate as an or slows down the update.
I suggest you disable all the indexes, update, and then enable in indexes.
You would need to do some testing, and it really depends on if you can stop other queries during this time, but quite often it's much faster to
drop the indexes
do the inserts/updates
recreate the indexes

dictionary database, one table vs table for each char

I have a very simple database contains one table T
wOrig nvarchar(50), not null
wTran nvarchar(50), not null
The table has +50 million rows. I execute a simple query
select wTran where wOrig = 'myword'
The query takes about 40 sec to complete. I divided the table based on the first char of wOrig and the execution time is much smaller than before (based on each table new length).
Am I missing something here? Should not the database use more efficient way to do the search, like binary search?
My question What changes to the database options - based on this situation - could make the search more efficient in order to keep all the data in one table?
You should be using an index. For your query, you want an index on wTran(wOrig). Your query will be much faster:
create index idx_wTran_wOrig on wTran(wOrig);
Depending on considerations such as space and insert/update characteristics, a clustered index on (wOrig) or (wOrig, wTran) might be the best solution.

Oracle: How can I find tablespace fragmentation?

I've a JOIN beween two tables. It's really really slow and I can't find why.
The query takes hours in a PRODUCTION environment on a very big Client.
Can you ask me what you need to understand why it doesn't work well?
I can add indexes, partition the table, etc. It's Oracle 10g.
I expect a few thousand record. Because of the following condition:
f.eif_campo1 != c.fornitura AND and f.field29 = 'New'
Infact it should be always verified for all 18 million records
SELECT c.id_messaggio
,f.campo1
,c.f
FROM
flows c,
tab f
WHERE
f.field198 = c.id_messaggio
AND f.extra_id = c.extra_id
and f.field1 != c.ExampleF
and f.field29 = 'New'
and c.processtype in ('Example1')
and c.flag_ann = 'N';
Selectivity for the following record expressed as number of distinct values:
COUNT (DISTINCT extra_id) =>17*10^6,
COUNT (DISTINCT (extra_id || field20)) =>17*10^6,
COUNT (DISTINCT field198) =>36*10^6,
COUNT (DISTINCT (field19 || field20)) =>45*10^6,
COUNT (DISTINCT (field1)) =>18*10^6,
COUNT (DISTINCT (field20)) =>47
This is the execution plan [See large image][1]
![enter image description here][2]
Extra details:
I have relaxed one contition to see how many records are taken. 300 thousand.
![enter image description here][7]
--03:57 mins with parallel execution /*+ parallel(c 8) parallel(f 24) */
--395.358 rows
SELECT count(1)
FROM
flows c,
flet f
WHERE
f.field19 = c.id_messaggio
AND f.extra_id = c.extra_id
and f.field20 = 'ExampleF'
and c.process_type in ('ExampleP')
and c.flag_ann = 'N';
Your explain plan shows the following.
The database uses an index to retrieve rows from ENI_FLUSSI_HUB where
flh_tipo_processo_cod in ('VT','VOLTURA_ENI','CC')
It then winnows the rows
where flh_flag_ann = 'N'
This produces a result set which is used to access
rows from ETL_ELAB_INTERF_FLAT on the basis of f.idde_identif_dati_ext_id =
c.idde_identif_dati_ext_id
Finally those rows are filtered on the basis of the
remaining parts of the WHERE clause.
Now, the starting point is a good one if flh_tipo_processo_cod is a selective
column: that is, if it contains hundreds of different values, or if the values in
your list are relatively rare. It might even be a good path of the flag column
identifies relatively few columns with a value of 'N'. So you need to understand
both the distribution of your data - how many distinct values you have - and its
skew - which values appear very often or hardly at all. The overall
performance suggests that the distribution and/or skew of the
flh_tipo_processo_cod and flh_flag_ann columns is not good.
So what can you do? One approach is to follow Ben's suggestion, and use full
table scans. If you have an Enterprise Edition licence and plenty of CPU capacity
you could try parallel query to improve things. That might still be too slow, or it might be too disruptive for other users.
An alternative approach would be to use better indexes. A composite index on
eni_flussi_hub(flh_tipo_processo_cod,flh_flag_ann,idde_identif_dati_ext_id,
flh_fornitura,flh_id_messaggio) would avoid the need to read that table. Whether
this would be a new index or a replacement for ENI_FLK_IDX3 depends on the other
activity against the table. You might be able to benefit from index compression.
All the columns in the query projection are referenced in the WHERE clause. So
you could also use a composite index on the other table to avoid table reads. Agsin you need to understand the distribution and skew of the data. But you should probably lead with the least selective columns. Something like etl_elab_interf_flat(etl_elab_interf_flat,eif_campo200,dde_identif_dati_ext_id,eif_campo1,eif_campo198). Probably this is a new index. It's unlikely you would want to replace ETL_EIF_FK_IDX4 with this (especially if that really is an index on a foreign key constraint)..
Of course these are just guesses on my part. Tuning is a science and to do it properly requires lots of data. Use the Wait Interface to investigate where the database is spending its time. Use the 10053 event to understand why the Optimizer makes the choices it does. But above all, don't implement partitioning unless you really know the ramifications.
The simple answer seems to be your explain plan. You're accessing both tables by index rowid. Whilst to select a single row you cannot - to my knowledge - get faster, in your case you're selecting a lot more than a single row.
This means that for every single row you, you're going into both tables one row at a time, which when you're looking a significant proportion of a table or index is not what you want to do.
My suggestion would be to force a full scan of one or both of your tables. Try to use the smaller as a driver first:
SELECT /*+ full(c) */ c.flh_id_messaggio
, f.eif_campo1
, c.f
FROM flows c,
JOIN flet f
ON f.field19 = c.flh_id_messaggio
AND f.extra_id = c.extra_id
AND f.field1 <> c.f
WHERE ...
But you may have to change /*+ full(c) */ to /*+ full(c) full(f) */.
Your indexes seem to be separate column indexes as well. For this, and if possible, I would have indexes on:
flows of id_messaggio, extra_id, f
and on flet of field19, extra_id, field1.
This will only really matter if you do not use as full scan. Or, if you have everything you're returning and selecting is in one index.

SQL Index Performance

We have a table called table1 ...
(c1 int indentity,c2 datetime not null,c3 varchar(50) not null,
c4 varchar(50) not null,c5 int not null,c6 int ,c7 int)
on column c1 is primary key(clusterd Index)
on column c2 is index_2(Nonclusterd)
on column c3 is index_2(Nonclusterd)
on column c4 is index_2(Nonclusterd)
on column c5 is index_2(Nonclusterd)
It contains 10 million records. We have several procedures pointing to "table1" with different search criteria:
select from table1 where c1=blah..and c2= blah.. and c3=blah..
select from table1 where c2=blah..and c3= blah.. and c4=blah..
select from table1 where c1=blah..and c3= blah.. and c5=blah..
select from table1 where c1=blah..
select from table1 where c2=blah..
select from table1 where c3=blah..
select from table1 where c4=blah..
select from table1 where c5=blah..
What is the best way to create non-clustered index apart from above, or modify existing indexes to get good index performance and reduce the execution time?
And now to actually respond...
The trick here is that you have single-column lookups on any number of columns, as well as composite column lookups. You need to understand with what frequency the queries above are executing - for those that are run very seldom, you should probably exclude them from your indexing considerations.
You may be better off creating single NCIX's on each of the columns being queried. This would likely be the case if the number of rows being returned is very small, as the NCIX's would be able to handle the "single lookup" queries as well as the composite lookups. Alternatively, you could create single-column NCIX's in addition to covering composite indexes - again, the deciding factor being the frequency of execution and the number of results being returned.
This is somewhat tough to answer with just the information you have provided. There are other factors you need to weigh out.
For example:
How often is the table updated and what columns are updated frequently?
You'll be paying a cost on these updates due to index maintenance.
What is the cardinality of your different columns?
What queries are you executing most often and what columns appear in the where clause of those queries?
You need to first figure out what your parameters for acceptable performance are for each of your queries and work from their taking into account the things I have mentioned above.
With 10 million rows, partitioning your table could make a lot of sense here.
Have you thought about using the Full Text Search component of MSSQL. It might offer the performance that you are looking for?