[1] states:
"When data is deleted from a heap, the data on the page is not compressed (reclaimed). And should all of the rows of a heap page are deleted, often the entire page cannot be reclaimed"
"The ALTER INDEX rebuild and reorganize options cannot be used to defragment and reclaim space in a heap (but they can used to defragment non-clustered indexes on a heap).
If you want to defragment a heap in SQL Server 2005, you have three options:
1) create a clustered index on the heap, then drop the clustered index;
2) Use SELECT INTO to copy the old table to a new table; or
3) use BCP or SSIS to move the data from the old table to a new table.
In SQL Server 2008, the ALTER TABLE command has been changed so that it now has the ability to rebuild heap"
Plz explain me:
What are the difference between compression, (de)fragmentation, reclaiming the space, shrinkfile and shrinkdatabase in MS SQL Server 2005?
What does shrinkfile and shrinkdatabase accomplish in MS SQL Server 2005?
Update:
The question was inspired by discussion in [2] - how to shrink database in MS SQL Server 2005?
Update2: #PerformanceDBA,
Congats! You've gained over 500+ in just a week. This is remarkable!
Your diagram
Thanks, once more, for your time.
I shall ask later and not here.
Internals is not my primary preoccupation and not easiest one.
It is very succinct and generally odes not invoke any doubts or questions.
I'd prefer some tool, descriptions/instructions, technique around which to develop my doubts, question and discussion.
Plz see, for ex., my questions:
http://www.sqlservercentral.com/Forums/Topic1013693-373-2.aspx#bm1014385
http://www.sqlservercentral.com/Forums/Topic1013975-373-1.aspx
They are basically duplicates of what I asked but cannot discuss in stackoverflow.com
Update3: #PerformanceDBA,
thanks, once more, the main purpose of my questions was to determine the ways how to resolve concrete questions (basing on as well as avoiding what ) having contradictory docs, articles, discussions, answers, etc. which you helped to detect.
Currently I do not have further (unresolvable and blocking me) questions in this area.
[1]
Brad McGehee. Brad's Sure Guide to Indexes
(11 June 2009)
http://www.simple-talk.com/sql/database-administration/brads-sure-guide-to-indexes/
[2]
Answers and feedback to question
"Shrink a database below its initial size"
https://stackoverflow.com/questions/3543884/shrink-a-database-below-its-initial-size/3774639#3774639
No one touched this in over one month.
The answers to the first three are actually in the Diagram I Made for You, which you have not bothered to digest and ask questions about ... it is often used as a platform for discussion.
(That is a condensed version of my much more elaborate Sybase diagrams, which I have butchered for the MS context. There is a link at the bottom of that doc, if you want the full Sybase set.)
Therefore I am not going to spend much time on you either. And please do not ask for links to "reference sites", there ain't no such thing (what is available is non-technical rubbish), which is precisely why I have my own diagrams; there are very few people who understand MS SQL Internals.
reclaiming the space
That is the correct term. MS does not remove deleted rows from the page, or deleted pages from the extent. Reclaiming the space is an operation that goes through the Heap and removes the unused (a) rows and (b) pages. Of course that changes the RowIds, so all Nonclustered indices have to be rebuilt.
compression
In the context of the pasted text: same as Reclaiming space.
defragmentation
the operation of full scale removal of unused space. There are three Levels:
I. Database (AllocationUnits), across all objects
II. Object (Extent & Page), Page Chains, Split Pages, Overflow Pages
III. Heap Only (No Clustered index), the subject of the post
shrinkfile
Quite a different operation, to reduce the space allocated on a Device (File). This removes unused AllocationUnits (hence 'shrink') but it is not the same a de-fragmenting AllocationUnits.
shrinkdatabase
To do the same for a Database; All Devices Allocations used by the database across all Devices.
Response to Comments
The poster at SSC is clueless and does not address your question directly.
there is no such thing as a Clustered table (CREATE CLUSTERED TABLE fails)
there is such a thing as a Clustered index (CREATE CLUSTERED INDEX succeeds)
as per my diagrams, it is a single physical structure; the clustered index INCLUDES the rows and thus the Heap is eliminated
where there is no Clustered index, there are two physical structures: a Heap and a separate Nonclustered Index
Now before you go diving into them with DBCC, which is too low level, and clueless folks cannot identify, let alone explain, the whys and wherefores, you need to understand and confirm the above:
create a Table_CI (we are intending to add a CI, there is still no such thing as a Clustered Table)
add an unique clustered index to it UC_PK
add a few rows
create a table Heap
add an unique Nonclustered index to it NC_PK
add a few rows
SELECT * FROM sysindexes WHERE id = OBJECT_ID("Table_CI")
SELECT * FROM sysindexes WHERE id = OBJECT_ID("Heap")
note that each sysindexes entry is a complete, independent, data storage structure (look at the columns)
contemplate the output
compare with my diagram
compare with the rubbish in the universe
In future, I will not answer questions about the confused rubbish in the universe, and the incorrect and misinformed posts on other sites (I do not care if they are MS Certified Professionals, they have proved that they are incapable of inspecting their databases and determining the correct information)
There is a reason I have bothered to create accurate diagrams (the manuals, pictures, and all available info fro MS, is all rubbish; no use for you to look for accurate info from the :authority", because the "authority" is technically bankrupt).
Even Gail finally gets around to
I suspect you'd benefit from more reading on overall architecture of indexes before fiddling with the low level internals.
Except, there isn't any. That are not confusing, non-technical, and inconsistent.
There is a reason I have bothered to create accurate diagrams.
Back to the DBCCs. Gail is simply incorrect. In a Clustered Index (which includes the rows), the single page contains rows. Yes, rows. that is the leaf level of the index. There is a B-tree, it lives in the top of the page, but it is so small and you can't see it. Look at the sysindexes output. The root and firstpage pointer IS pointing to the page; that is the root of the Clustered Index. When you dive into the ocean, you need to know what to look for, AND where to find it, otherwise you won't find what you are looking for, and you will get distracted by the flotsam and jetsam that you do find by accident.
Now look at the TWO SEPARATE STRUCTURES for the NCI and the Heap.
Oh, and MS has changed from using the OAM terminology to the IAM where the data structure is an index. That introduces confusion. In terms of data structures (entries in sysindexes), they are all Objects; they might or might not be Indices). The point is, who cares, we know what it is, it is an ObjectAllocationMap ... if you are looking at at NCI, gee, it is an IndexObjectAllocationMap; if you are looking at a Heap, it is a HeapObjectAllocMap. I will let you ponder what it is in the case of a CI. In chasing it down, or in using it (finding the pages that belong to the OBJECT, it does not matter, they are all Objects. When doing that, you need to know, some objects have a PageChain and others do not (another of your questions). CIs have them; NCIs and Heaps do not.
Gail Shaw: "I doubt these kinds of internals are documented anywhere. After all, we're using undocumented features. Definition of index depends who you ask and where you look.
ROTFLMAO. My sides hurt, I could not read the posts that followed it. These are supposed to be intelligent human beings ? Working in the IT world ? Definitions CHANGE ? What with the temperature or the time of day ? And that was SQL Server Central ? Not the backwoods ?
When MS stole SQL Server from Sybase, the documentation was rock solid. Of coure, with each major release, they "rewrite" it, and the docs get weaker and more fluffy (recall our discussion in another post). Now we have cute pictures that make people feel good but are inaccurate, technically. Which is why earnest people like you have problems. the pictures do not even match the text in the manuals.
Anyway, DEFINITIONS do not change. That's the definition of definitions. They are true in any context. And Um, the um feature you are using is an ordinary, documented feature. Has been since 1987. Except MS lost it somewhere and no one can find it. You'll have to ask a Sybase Guru who was around in the old days, who remembers what exact data structures were in the code that they acquired. And if you are really lucky, he will be up to date with the differences that MoronSociety has introduced in 2000, 2005, 2008. He might even have a single accurate diagram that matches the output of sysindexes and DBCC on your box. If you find him, kiss his ring and shower him with gold. Lock up your daughters.
(not serious, my sides are killing me, the mirth is overflowing).
Now do you see why I will not answer questions about the confused rubbish in the universe ? There are just SO MANY morons out there in MoronSociety.
-----
Gail again:
"Scans:
An index scan is a complete read of all of the leaf pages in the index. When an index scan is done on the clustered index, it’s a table scan in all but name.
When an index scan is done by the query processor, it is always a full read of all of the leaf pages in the index, regardless of whether all of the rows are returned. It is never a partial scan.
A scan does not only involve reading the leaf levels of the index, the higher level pages are also read as part of the index scan."
There must be a reason she is named after fast wind. She writes "books" ? Yeah, fantasy novels. Hot air is for balloonists not IT professionals.
Complete and total drivel. The whole point of an Index Scan AND WHY IT IS PREFERABLE TO A TABLE SCAN, because it is trying to AVOID A TABLE SCAN, is that:
- the engine (executing the query tree) can go directly to the Index (Clustered or Nonclustered, at this point)
- navigate the B-Tree to find the place to start (which up to this point, is much the same as when it is getting a few rows, ie. not scanning)
- the B-Tree (from any good TECHNICAL diagram) is a few pages, containing many, many index entries per page, so it is very fast
- that's the root plus non-leaf levels
- until it find a leaf-level entry that qualifies
- from that point on, it does a SCAN, sequentially, through the LEAF level of said index (fat blue arrow)
now for NCIs, if you remember your homework, that means the leaf level pages are full of index_leaf_level_entry + CI_key
so it is scanning sequentially across the NCI Leaf level (that's why there is a PageChain only at the leaf level of NCIs, so that it can navigate across)
but jumping all over the place on the HEAP, to get the data rows
but for a CI, the leaf level IS the data row (data pages, with only data rows, that's why you cannot see an "index" in them; the non-leaf-level CI pages are pure index pages containing index_entries only)
so when it SCANS the index leaf_level sequentially, using the PageChain, it is SCANNING the data sequentially, they are the same operation (fat green arrow)
no Heap
no jumping around
For comparison, then, a TABLE SCAN (MS Only):
- has no PageChain on the Heap
- has no choice, but to start at the beginning
- and read every data page
- of which many will be fragmented (contain unused space left by deleted or forwarded rows)
- and others will be completely empty
The whole intent is, the optimiser had already decided, not to go for a table (heap) scan, that it could go for an Index scan (because it required LESS than the full range of data, and it could find the starting point of that data via some index). If you look at your SHOWPLAN, even for retrieving a single unique PK row, it says "INDEX SCAN". All that means is, it will navigate the B-Tree first, to find at least one row. And then it may scan the leaf level, until it finds an end point. If it is a covered query, it never goes to the data rows.
There is no substitute for a Clustered Index.
Related
Is there a way in PostgreSQL to mark a table as storing data that will never change after insertion, thus improving performance? One example where this could help is Index-Only Scans and Covering Indexes:
However, for seldom-changing data there is a way around this problem.
PostgreSQL tracks, for each page in a table's heap, whether all rows
stored in that page are old enough to be visible to all current and
future transactions. This information is stored in a bit in the
table's visibility map. An index-only scan, after finding a candidate
index entry, checks the visibility map bit for the corresponding heap
page. If it's set, the row is known visible and so the data can be
returned with no further work. If it's not set, the heap entry must be
visited to find out whether it's visible, so no performance advantage
is gained over a standard index scan
If PostgreSQL knew that the data in a table never changes the heap entries would never have to be visited in index-only scans.
If you have data that truly never change, run VACUUM (FREEZE) on the table. If there is no concurrent long-running transaction, that will mark all blocks in the visibility map as “all-frozen” and “all-visible”. Then you will get index-only scans and anti-wraparound autovacuum won't have to do any work on the table.
This is a little hard to follow in the documentation, but basically you need to vacuum the table after it has been created. Then if there are no changes (and no locks), no worries.
The documentation does explain that vacuuming updates the visibility map:
PostgreSQL's VACUUM command has to process each table on a regular basis for several reasons:
. . .
To update the visibility map, which speeds up index-only scans.
indexes make read fast but write slower. But why can't you have single writes and have db add indexes asynchronously with time, also cache in the INSERT until it's indexed?
Is there any database like that?
Converting my comments to an answer:
indexes make read fast but write slower
That's an oversimplification and it's also misleading.
Indexes make data lookups faster because the DBMS doesn't need to do a table-scan to find rows matching a predicate (the WHERE part of a query). Indexes don't make "reads" any faster (that's entirely dependent on the characteristics of your disk IO) and when used improperly they can sometimes even make queries slower (for reasons I won't get into).
I want to stress that the additional cost of writing to a single index, or even multiple indexes, when executing a DML statement (INSERT/UPDATE/DELETE/MERGE/etc) is negligible, really! (In actuality: foreign-key constraints are a much bigger culprit - and I note you can practically eliminate the cost of foreign-key constraint checking by adding additional indexes!). Indexes are primarily implemented using B-trees (a B-tree is essentially like a binary-tree, except rather than each node having only 2 children it can have many children because each tree-node comes with unused space for all those child node pointers, so inserting into the middle of a B-tree won't require data to be moved-around on-disk unlike with other kinds of trees, like a heap-tree).
Consider this QA where a Postgres user (like yourself) reports inserting 10,000 rows into a table. Without an index it took 78ms, with an index it took 84ms, that's only a 7.5% increase, which at that scale (6ms!) is so small it may as well be a rounding error or caused by IO scheduling. That should be proof enough it shouldn't be something you should worry about without actual hard data showing it's a problem for you and your application.
I assume you have this negative impression about indexes after reading an article like this one, which certainly gives the impression that "indexes are bad" - but while the points mentioned in that article are not wrong, there's a LOT of problems with that article so you shouldn't take it dogmatically. (I'll list my concerns with that article in the footer).
But why can't you have single writes and have db add indexes asynchronously with time
By this I assume you mean you'd like a DMBS to do a single-row INSERT by simply appending a new record to the end of a table and then immediately returning and then at an arbitrary point later the DBMS' housekeeping system would update the indexes afterwards.
The problem with that is that it breaks the A, C, and I parts of the the A.C.I.D. model.
Indexes are used for more than just avoiding table-scans: they're also used to store copies of table data for the benefit of queries that would use the index and which also need (for example) a small subset of the table's data, this significantly reduces disk reads. For this reason, RDBMS (and ISO SQL) allow indexes to include non-indexed data using the INCLUDES clause.
Consider this scenario:
CREATE INDEX IX_Owners ON cars ( ownerId ) INCLUDE ( colour );
CREATE INDEX IX_Names ON people ( name ) INCLUDE ( personId, hairColour );
GO;
SELECT
people.name,
people.hairColour,
cars.colour
FROM
cars
INNER JOIN people ON people.personId = cars.ownerId
WHERE
people.name LIKE 'Steve%'
The above query will not need to read either the cars or people tables on-disk. The DBMS will be able to fully answer the query using data only in the index - which is great because indexes tend to exist in a small number of pages on-disk which tend to be in proximal-locality which is good for performance because it means it will use sequential IO which scales much better than random IO.
The RDBMS will perform a string-prefix index-scan of the people.IX_Names index to get all of the personId (and hairColour) values, then it will look-up those personId values in the cars.IX_Owners index and be able to get the car.colour from the copy of the data inside the IX_Owners index without needing to read the tables directly.
Now, assuming that another database client has just completed inserted a load of records into the cars and/or people table with a COMMIT TRANSACTION just for good measure, and the RDMBS uses your idea of only updating indexes later whenever it feels like it, then if that same database client re-runs the query from above it would return stale data (i.e. wrong data) because the query uses the index, but the index is old.
In addition to using index tree nodes to store copies of table data to avoid non-proximal disk IO, many RDBMS also use index-trees to store entire copies - even multiple copies of table data, to enable other scenarios, such as columnar data storage and indexed-VIEWs - both of these features absolutely require that indexes are updated atomically with table data.
Is there any database like that?
Yes, they exist - but they're not widely used (or they're niche) because for the vast majority of applications it's entirely undesirable behaviour for the reasons described above.
There are distributed databases that are designed around eventual consistency, but clients (and entire application code) needs to be designed with that in-mind, and it's a huge PITA to have to redesign a data-centric application to support eventual-consistency which is why you only really see them being used in truly massive systems (like Facebook, Google, etc) where availability (uptime) is more important than users seeing stale-data for a few minutes.
Footnote:
Regarding this article: https://use-the-index-luke.com/sql/dml/insert
The number of indexes on a table is the most dominant factor for insert performance. The more indexes a table has, the slower the execution becomes. The insert statement is the only operation that cannot directly benefit from indexing because it has no where clause.
I disagree. I'd argue that foreign-key constraints (and triggers) are far more likely to have a larger detrimental effect on DML operations.
Adding a new row to a table involves several steps. First, the database must find a place to store the row. For a regular heap table—which has no particular row order—the database can take any table block that has enough free space. This is a very simple and quick process, mostly executed in main memory. All the database has to do afterwards is to add the new entry to the respective data block.
I agree with this.
If there are indexes on the table, the database must make sure the new entry is also found via these indexes. For this reason it has to add the new entry to each and every index on that table. The number of indexes is therefore a multiplier for the cost of an insert statement.
This is true, but I don't know if I agree that it's a "multiplier" of the cost of an insert.
For example, consider a table with hundreds of nvarchar(1000) columns and several int columns - and there's separate indexes for each int column (with no INCLUDE columns). If you're inserting 100x megabyte-sized rows all-at-once (using an INSERT INTO ... SELECT FROM statement) the cost of updating those int indexes is very likely to require much less IO than the table data.
Moreover, adding an entry to an index is much more expensive than inserting one into a heap structure because the database has to keep the index order and tree balance. That means the new entry cannot be written to any block—it belongs to a specific leaf node. Although the database uses the index tree itself to find the correct leaf node, it still has to read a few index blocks for the tree traversal.
I strongly disagree with this, especially the first sentence: "adding an entry to an index is much more expensive than inserting one into a heap structure".
Indexes in RDBMS today are invariably based on B-trees, not binary-trees or heap-trees. B-trees are essentially like binary-trees except each node has built-in space for dozens of child node pointers and B-trees are only rebalanced when a node fills its internal child pointer list, so a B-tree node insert will be considerably cheaper than the article is saying because each node will have plenty of empty space for a new insertion without needing to re-balance itself or any other relatively expensive operation (besides, DBMS can and do index maintenance separately and independently of any DML statement).
The article is correct about how the DBMS will need to traverse the B-tree to find the node to insert into, but index nodes are efficently arranged on-disk, such as keeping related nodes in the same disk page which minimizes index IO reads (assuming they aren't already loaded into memory first). If an index tree is too big to store in-memory the RDBMS can always keep a "meta-indexes" in-memory so it could potentially instantly find the correct B-tree index without needing to traverse the B-tree from the root.
Once the correct leaf node has been identified, the database confirms that there is enough free space left in this node. If not, the database splits the leaf node and distributes the entries between the old and a new node. This process also affects the reference in the corresponding branch node as that must be duplicated as well. Needless to say, the branch node can run out of space as well so it might have to be split too. In the worst case, the database has to split all nodes up to the root node. This is the only case in which the tree gains an additional layer and grows in depth.
In practice this isn't a problem, because the RDBMS's index maintenance will ensure there's sufficient free space in each index node.
The index maintenance is, after all, the most expensive part of the insert operation. That is also visible in Figure 8.1, “Insert Performance by Number of Indexes”: the execution time is hardly visible if the table does not have any indexes. Nevertheless, adding a single index is enough to increase the execute time by a factor of a hundred. Each additional index slows the execution down further.
I feel the article is being dishonest by suggesting (implying? stating?) that index-maintenance happens with every DML. This is not true. This may have been the case with some early dBase-era databases, but this is certainly not the case with modern RDBMS like Postgres, MS SQL Server, Oracle and others.
Considering insert statements only, it would be best to avoid indexes entirely—this yields by far the best insert performance.
Again, this claim in the article is not wrong, but it's basically saying if you want a clean and tidy house you should get rid of all of your possessions. Indexes are a fact of life.
However tables without indexes are rather unrealistic in real world applications. You usually want to retrieve the stored data again so that you need indexes to improve query speed. Even write-only log tables often have a primary key and a respective index.
Indeed.
Nevertheless, the performance without indexes is so good that it can make sense to temporarily drop all indexes while loading large amounts of data—provided the indexes are not needed by any other SQL statements in the meantime. This can unleash a dramatic speed-up which is visible in the chart and is, in fact, a common practice in data warehouses.
Again, with modern RDBMS this isn't necessary. If you do a batch insert then a RDBMS won't update indexes until after the table-data has finished being modified, as a batch index update is cheaper than many individual updates. Similarly I expect that multiple DML statements and queries inside an explicit BEGIN TRANSACTION may cause an index-update deferral provided no subsequent query in the transaction relies on an updated index.
But my biggest issue with that article is that the author is making these bold claims about detrimental IO performance without providing any citations or even benchmarks they've run themselves. It's even more galling that they posted a bar-chart with arbitrary numbers on, again, without any citation or raw benchmark data and instructions for how to reproduce their results. Always demand citations and evidence from anything you read making claims: because the only claims anyone should accept without evidence are logical axioms - and a quantitative claim about database index IO cost is not a logical axiom :)
For PostgreSQL GIN indexes, there is the fastupdate feature. This stores new index entries into a unordered unconsolidated area waiting for some other process to file them away into the main index structure. But this doesn't directly match up with what you want. It is mostly designed so that the index updates are done in bulk (which can be more IO efficient), rather than in the background. Once the unconsolidated area gets large enough, then a foreground process might take on the task of filing them away, and it can be hard to tune the settings in a way to get this to always be done by a background process instead of a foreground process. And it only applies to GIN indexes. (With the use of the btree_gin extension, you can create GIN indexes on regular scalar columns rather than the array-like columns it usually works with.) While waiting for the entries to be consolidated, every query will have to sequential scan the unconsolidated buffer area, so delaying the updates for the sake of INSERT can come at a high cost for SELECTs.
There are more general techniques to do something like this, such as fractal tree indexes. But these are not implemented in PostgreSQL, and wherever they are implemented they seem to be proprietary.
I have an table which use an auto-increment field (ID) as primary key. The table is append only and no row will be deleted. Table has been designed to have a constant row size.
Hence, I expected to have O(1) access time using any value as ID since it is easy to compute exact position to seek in file (ID*row_size), unfortunately that is not the case.
I'm using SQL Server.
Is it even possible ?
Thanks
Hence, I expected to have O(1) access
time using any value as ID since it is
easy to compute exact position to seek
in file (ID*row_size),
Ah. No. Autoincrement does not - even without deletions -guarantee no holes. Holes = seek via index. Ergo: your assumption is wrong.
I guess the thing that matters to you is the performance.
Databases use indexes to access records which are written on the disk.
Usually this is done with B+ tree indexes, which are logbn where b for internal nodes is typically between 100 and 200 (optimized to block size, see ref)
This is still strictly speaking logarithmic performance, but given decent number of records, let's say a few million, the leaf nodes can be reached in 3 to 4 steps and that, together with all the overhead for query planning, session initiation, locking, etc (that you would have anyway if you need multiuser, ACID compliant data management system) is certainly for all practical reasons comparable to constant time.
The good news is that an indexed read is O(log(n)) which for large values of n gets pretty close to O(1). That said in this context O notation is not very useful, and actual timings are far more meanigful.
Even if it were possible to address rows directly, your query would still have to go through the client and server protocol stacks and carry out various lookups and memory allocations before it could give the result you want. It seems like you are expecting something that isn't even practical. What is the real problem here? Is SQL Server not fast enough for you? If so there are many options you can use to improve performance but directly seeking an address in a file is not one of them.
Not possible. SQL Server organizes data into a tree-like structure based on key and index values; an "index" in the DB sense is more like a reference book's index and not like an indexed data structure like an array or list. At best, you can get logarithmic performance when searching on an indexed value (PKs are generally treated as an index). Worst-case is a table scan for a non-indexed column, which is linear. Until the database gets very large, the seek time of a well-designed query against a well-designed table will pale in comparison to the time required to send it over the network or even a named pipe.
If you are talking about btrees, I wouldn't imagine that the additional overhead of a non clustered index (not counting stuff like full text search or other kind of string indexing) is even measurable, except for an extremely high volume high write scenario.
What kind of overhead are we actually talking about? Why would it be a bad idea to just index everything? Is this implementation specific? (in that case, I am mostly interested in answers around pg)
EDIT: To explain the reasoning behind this a bit more...
We are looking to specifically improve performance right now across the board, and one of the key things we are looking at is query performance. I have read the things mentioned here, that indexes will increase db size on disk and will slow down writes. The question came up today when one pair did some pre-emptive indexing on a new table, since we usually apply indexes in a more reactive way. Their arguement was that they weren't indexing string fields, and they weren't doing clustered indexes, so the negative impact of possibly redundant indexes should barely be measurable.
Now, I am far from an expert in such things, and those arguments made a lot of sense to me based on what I understand.
Now, I am sure there are other reasons, or I am misunderstanding something. I know a redundant index will have a negative effect, what I want to know is how bad it will be (because it seems negligible). The whole indexing every field thing is a worst case scenario, but I figured if people could tell me what that will do to my db, it will help me understand the concerns around being conservative with indexing, or just throwing them out there when it has a possibility of helping things.
Random thoughts
Indexes benefit reads of course
You should index where you get the most bang for your buck
Most DBs are > 95% read (think about updates, FK checks, duplicate checks etc = reads)
"Everything" is pointless: most indexed need to be composite with includes
Define high volume we have 15-20 million new rows per day with indexes
Introduction to Indices
In short, an index, whether clustered or non-, adds extra "branches" to the "tree" in which data is stored by most current DBMSes. This makes finding values with a single unique combination of the index logarithmic-time instead of linear-time. This reduction in access time speeds up many common tasks the DB does; however, when performing tasks other than that, it can slow it down because the data must be accessed through the tree. Filtering based on non-indexed columns, for instance, requires the engine to iterate through the tree, and because the ratio of branch nodes (containing only pointers to somewhere else in the tree) to leaf nodes has been reduced, this will take longer than if the index were not present.
In addition, non-clustered indices separate data based on column values, but if those column values are not very unique across all table rows (like a flag indicating "yes" or "no"), then the index adds an extra level of complexity that doesn't actually help the search; in fact, it hinders it because in navigating from root to leaves of the tree, an extra branch is encountered.
I am sure the exact overheard is probably implementation specific, but off the top of my head some points:
Increased Disk Space requirements.
All writes (inserts, updates, deletes) cost more as all indexes must be updated.
Increased transaction locking overheard (all indexes must be updated within a transaction, leading to more locks being required, etc).
Potentially increased complexity for the query optimizer (choosing which index is most likely to perform best; Also potential for one index to be chosen when another index would actually be better).
I understand that indexes should get updated automatically but when that does not happen we need to reindex.
My question is (1) Why this automatic udate fails or why an index become bad?
(2) How do I prgramatically know which table/index needs re-indexing at a point of time?
Indexes' statistics may be updated automatically. I do not believe that the indexes themselves would be rebuilt automatically when needed (although there may be some administrative feature that allows such a thing to take place).
Indexes associated with tables which receive a lot of changes (new rows, updated rows and deleted rows) may have their indexes that become fragmented, and less efficient. Rebuilding the index then "repacks" the index in a contiguous section of storage space, a bit akin to the way defragmentation of the file system makes file access faster...
Furthermore the Indexes (on several DBMS) have a FILL_FACTOR parameter, which determine how much extra space should be left in each node for growth. For example if you expect a given table to grow by 20% next year, by declaring the fill factor around 80%, the amount of fragmentation of the index should be minimal during the first year (there may be some if these 20% of growth are not evenly distributed,..)
In SQL Server, It is possible to query properties of the index that indicate its level of fragmentation, and hence it possible need for maintenance. This can be done by way of the interactive management console. It is also possible to do this programatically, by way of sys.dm_db_index_physical_stats in MSSQL 2005 and above (maybe even older versions?)