SQL Server: What is the difference between Index Rebuilding and Index Reorganizing? - sql

What is the difference between Index Rebuilding and Index Reorganizing?

Think about how the index is implemented. It's generally some kind of tree, like a B+ Tree or B- Tree. The index itself is created by looking at the keys in the data, and building the tree so the table can be searched efficiently.
When you reorganize the index, you go through the existing index, cleaning up blocks for deleted records etc. This could be done (and is in some databases) when you make a deletion, but that imposes some performance penalty. instead, you do it separately in order to do it more or less batch mode.
When you rebuild the index, you delete the existing tree and read all the records, building a new tree directly from the data. That gives you a new, and hopefully optimized, tree that may be better than the results of reorganizing the table; it also lets you regenerate the tree if it somehow has been corrupted.

REBUILD locks the table for the whole operation period (which may be hours and days if the table is large).
REORGANIZE doesn't lock the table.
Well. actually, it places some temporary locks on the pages it works with right now, but they are removed as soon as the operation is complete (which is fractions of second for any given lock).
As #Andomar noted, there is an option to REBUILD an index online, which creates the new index, and when the operation is complete, just replaces the old index with the new one.
This of course means you should have enough space to keep both the old and the new copy of the index.
REBUILD is also a DML operation which changes the system tables, affects statistics, enables disabled indexes etc.
REORGANIZE is a pure cleanup operation which leaves all system state as is.

There are a number of differences. Basically, rebuilding is a total rebuild of an index - it will build a new index, then drop the existing one, whereas reorganising it will simply, well... it will reorganise it.
This blog entry I came across a while back will explain it much better than I can. :)

Rebuild it droping the current indexes and recreating new ones.
Reorganizing is like putting the house in order with what u already have.
it is a good practice to use 30% fragmentation to determine rebuild vs reorganize.
<30% reorganize vs. >30% rebuild

"Reorganize index" is a process of cleaning, organizing, and defragmenting of "leaf level" of the B-tree (really, data pages).
Rebuilding of the index is changing the whole B-tree, recreating the index.
It’s recommended that index should be reorganized when index fragmentation is from 10% to 40%; if index fragmentation is great than 40%, it’s better to rebuild it.
Rebuilding of an index takes more resources, produce locks and slowing performance (if you choose to keep table online). So, you need to find right time for that process.

In addition to the differences above (basically rebuild will create the index anew, and then "swap it in" for the existing one, rather than trying to fix the existing one), an important consideration is that a rebuild - even an Enterprise ONLINE rebuild - will interfere with snapshot isolation transactions.
TX1 starts snapshot transaction
TX1 reads from T
TX2 rebuilds index on T
TX2 rebuild complete
TX1 read from T again:
Error 3961, Snapshot isolation transaction failed in database because the object accessed by the statement has been modified by a DDL statement in another concurrent transaction since the start of this transaction. It is disallowed because the metadata is not versioned. A concurrent update to metadata can lead to inconsistency if mixed with snapshot isolation.

Rebuild index - rebuilds one or more indexes for a table in the specified database.
Reorganise index - Defragments clustered and secondary indexes of the specified table

Related

PostgreSQL - storing data that never changes

Is there a way in PostgreSQL to mark a table as storing data that will never change after insertion, thus improving performance? One example where this could help is Index-Only Scans and Covering Indexes:
However, for seldom-changing data there is a way around this problem.
PostgreSQL tracks, for each page in a table's heap, whether all rows
stored in that page are old enough to be visible to all current and
future transactions. This information is stored in a bit in the
table's visibility map. An index-only scan, after finding a candidate
index entry, checks the visibility map bit for the corresponding heap
page. If it's set, the row is known visible and so the data can be
returned with no further work. If it's not set, the heap entry must be
visited to find out whether it's visible, so no performance advantage
is gained over a standard index scan
If PostgreSQL knew that the data in a table never changes the heap entries would never have to be visited in index-only scans.
If you have data that truly never change, run VACUUM (FREEZE) on the table. If there is no concurrent long-running transaction, that will mark all blocks in the visibility map as “all-frozen” and “all-visible”. Then you will get index-only scans and anti-wraparound autovacuum won't have to do any work on the table.
This is a little hard to follow in the documentation, but basically you need to vacuum the table after it has been created. Then if there are no changes (and no locks), no worries.
The documentation does explain that vacuuming updates the visibility map:
PostgreSQL's VACUUM command has to process each table on a regular basis for several reasons:
. . .
To update the visibility map, which speeds up index-only scans.

Is it possible to do usual atomic INSERT operation but update Indexes asynchronously?

indexes make read fast but write slower. But why can't you have single writes and have db add indexes asynchronously with time, also cache in the INSERT until it's indexed?
Is there any database like that?
Converting my comments to an answer:
indexes make read fast but write slower
That's an oversimplification and it's also misleading.
Indexes make data lookups faster because the DBMS doesn't need to do a table-scan to find rows matching a predicate (the WHERE part of a query). Indexes don't make "reads" any faster (that's entirely dependent on the characteristics of your disk IO) and when used improperly they can sometimes even make queries slower (for reasons I won't get into).
I want to stress that the additional cost of writing to a single index, or even multiple indexes, when executing a DML statement (INSERT/UPDATE/DELETE/MERGE/etc) is negligible, really! (In actuality: foreign-key constraints are a much bigger culprit - and I note you can practically eliminate the cost of foreign-key constraint checking by adding additional indexes!). Indexes are primarily implemented using B-trees (a B-tree is essentially like a binary-tree, except rather than each node having only 2 children it can have many children because each tree-node comes with unused space for all those child node pointers, so inserting into the middle of a B-tree won't require data to be moved-around on-disk unlike with other kinds of trees, like a heap-tree).
Consider this QA where a Postgres user (like yourself) reports inserting 10,000 rows into a table. Without an index it took 78ms, with an index it took 84ms, that's only a 7.5% increase, which at that scale (6ms!) is so small it may as well be a rounding error or caused by IO scheduling. That should be proof enough it shouldn't be something you should worry about without actual hard data showing it's a problem for you and your application.
I assume you have this negative impression about indexes after reading an article like this one, which certainly gives the impression that "indexes are bad" - but while the points mentioned in that article are not wrong, there's a LOT of problems with that article so you shouldn't take it dogmatically. (I'll list my concerns with that article in the footer).
But why can't you have single writes and have db add indexes asynchronously with time
By this I assume you mean you'd like a DMBS to do a single-row INSERT by simply appending a new record to the end of a table and then immediately returning and then at an arbitrary point later the DBMS' housekeeping system would update the indexes afterwards.
The problem with that is that it breaks the A, C, and I parts of the the A.C.I.D. model.
Indexes are used for more than just avoiding table-scans: they're also used to store copies of table data for the benefit of queries that would use the index and which also need (for example) a small subset of the table's data, this significantly reduces disk reads. For this reason, RDBMS (and ISO SQL) allow indexes to include non-indexed data using the INCLUDES clause.
Consider this scenario:
CREATE INDEX IX_Owners ON cars ( ownerId ) INCLUDE ( colour );
CREATE INDEX IX_Names ON people ( name ) INCLUDE ( personId, hairColour );
GO;
SELECT
people.name,
people.hairColour,
cars.colour
FROM
cars
INNER JOIN people ON people.personId = cars.ownerId
WHERE
people.name LIKE 'Steve%'
The above query will not need to read either the cars or people tables on-disk. The DBMS will be able to fully answer the query using data only in the index - which is great because indexes tend to exist in a small number of pages on-disk which tend to be in proximal-locality which is good for performance because it means it will use sequential IO which scales much better than random IO.
The RDBMS will perform a string-prefix index-scan of the people.IX_Names index to get all of the personId (and hairColour) values, then it will look-up those personId values in the cars.IX_Owners index and be able to get the car.colour from the copy of the data inside the IX_Owners index without needing to read the tables directly.
Now, assuming that another database client has just completed inserted a load of records into the cars and/or people table with a COMMIT TRANSACTION just for good measure, and the RDMBS uses your idea of only updating indexes later whenever it feels like it, then if that same database client re-runs the query from above it would return stale data (i.e. wrong data) because the query uses the index, but the index is old.
In addition to using index tree nodes to store copies of table data to avoid non-proximal disk IO, many RDBMS also use index-trees to store entire copies - even multiple copies of table data, to enable other scenarios, such as columnar data storage and indexed-VIEWs - both of these features absolutely require that indexes are updated atomically with table data.
Is there any database like that?
Yes, they exist - but they're not widely used (or they're niche) because for the vast majority of applications it's entirely undesirable behaviour for the reasons described above.
There are distributed databases that are designed around eventual consistency, but clients (and entire application code) needs to be designed with that in-mind, and it's a huge PITA to have to redesign a data-centric application to support eventual-consistency which is why you only really see them being used in truly massive systems (like Facebook, Google, etc) where availability (uptime) is more important than users seeing stale-data for a few minutes.
Footnote:
Regarding this article: https://use-the-index-luke.com/sql/dml/insert
The number of indexes on a table is the most dominant factor for insert performance. The more indexes a table has, the slower the execution becomes. The insert statement is the only operation that cannot directly benefit from indexing because it has no where clause.
I disagree. I'd argue that foreign-key constraints (and triggers) are far more likely to have a larger detrimental effect on DML operations.
Adding a new row to a table involves several steps. First, the database must find a place to store the row. For a regular heap table—which has no particular row order—the database can take any table block that has enough free space. This is a very simple and quick process, mostly executed in main memory. All the database has to do afterwards is to add the new entry to the respective data block.
I agree with this.
If there are indexes on the table, the database must make sure the new entry is also found via these indexes. For this reason it has to add the new entry to each and every index on that table. The number of indexes is therefore a multiplier for the cost of an insert statement.
This is true, but I don't know if I agree that it's a "multiplier" of the cost of an insert.
For example, consider a table with hundreds of nvarchar(1000) columns and several int columns - and there's separate indexes for each int column (with no INCLUDE columns). If you're inserting 100x megabyte-sized rows all-at-once (using an INSERT INTO ... SELECT FROM statement) the cost of updating those int indexes is very likely to require much less IO than the table data.
Moreover, adding an entry to an index is much more expensive than inserting one into a heap structure because the database has to keep the index order and tree balance. That means the new entry cannot be written to any block—it belongs to a specific leaf node. Although the database uses the index tree itself to find the correct leaf node, it still has to read a few index blocks for the tree traversal.
I strongly disagree with this, especially the first sentence: "adding an entry to an index is much more expensive than inserting one into a heap structure".
Indexes in RDBMS today are invariably based on B-trees, not binary-trees or heap-trees. B-trees are essentially like binary-trees except each node has built-in space for dozens of child node pointers and B-trees are only rebalanced when a node fills its internal child pointer list, so a B-tree node insert will be considerably cheaper than the article is saying because each node will have plenty of empty space for a new insertion without needing to re-balance itself or any other relatively expensive operation (besides, DBMS can and do index maintenance separately and independently of any DML statement).
The article is correct about how the DBMS will need to traverse the B-tree to find the node to insert into, but index nodes are efficently arranged on-disk, such as keeping related nodes in the same disk page which minimizes index IO reads (assuming they aren't already loaded into memory first). If an index tree is too big to store in-memory the RDBMS can always keep a "meta-indexes" in-memory so it could potentially instantly find the correct B-tree index without needing to traverse the B-tree from the root.
Once the correct leaf node has been identified, the database confirms that there is enough free space left in this node. If not, the database splits the leaf node and distributes the entries between the old and a new node. This process also affects the reference in the corresponding branch node as that must be duplicated as well. Needless to say, the branch node can run out of space as well so it might have to be split too. In the worst case, the database has to split all nodes up to the root node. This is the only case in which the tree gains an additional layer and grows in depth.
In practice this isn't a problem, because the RDBMS's index maintenance will ensure there's sufficient free space in each index node.
The index maintenance is, after all, the most expensive part of the insert operation. That is also visible in Figure 8.1, “Insert Performance by Number of Indexes”: the execution time is hardly visible if the table does not have any indexes. Nevertheless, adding a single index is enough to increase the execute time by a factor of a hundred. Each additional index slows the execution down further.
I feel the article is being dishonest by suggesting (implying? stating?) that index-maintenance happens with every DML. This is not true. This may have been the case with some early dBase-era databases, but this is certainly not the case with modern RDBMS like Postgres, MS SQL Server, Oracle and others.
Considering insert statements only, it would be best to avoid indexes entirely—this yields by far the best insert performance.
Again, this claim in the article is not wrong, but it's basically saying if you want a clean and tidy house you should get rid of all of your possessions. Indexes are a fact of life.
However tables without indexes are rather unrealistic in real world applications. You usually want to retrieve the stored data again so that you need indexes to improve query speed. Even write-only log tables often have a primary key and a respective index.
Indeed.
Nevertheless, the performance without indexes is so good that it can make sense to temporarily drop all indexes while loading large amounts of data—provided the indexes are not needed by any other SQL statements in the meantime. This can unleash a dramatic speed-up which is visible in the chart and is, in fact, a common practice in data warehouses.
Again, with modern RDBMS this isn't necessary. If you do a batch insert then a RDBMS won't update indexes until after the table-data has finished being modified, as a batch index update is cheaper than many individual updates. Similarly I expect that multiple DML statements and queries inside an explicit BEGIN TRANSACTION may cause an index-update deferral provided no subsequent query in the transaction relies on an updated index.
But my biggest issue with that article is that the author is making these bold claims about detrimental IO performance without providing any citations or even benchmarks they've run themselves. It's even more galling that they posted a bar-chart with arbitrary numbers on, again, without any citation or raw benchmark data and instructions for how to reproduce their results. Always demand citations and evidence from anything you read making claims: because the only claims anyone should accept without evidence are logical axioms - and a quantitative claim about database index IO cost is not a logical axiom :)
For PostgreSQL GIN indexes, there is the fastupdate feature. This stores new index entries into a unordered unconsolidated area waiting for some other process to file them away into the main index structure. But this doesn't directly match up with what you want. It is mostly designed so that the index updates are done in bulk (which can be more IO efficient), rather than in the background. Once the unconsolidated area gets large enough, then a foreground process might take on the task of filing them away, and it can be hard to tune the settings in a way to get this to always be done by a background process instead of a foreground process. And it only applies to GIN indexes. (With the use of the btree_gin extension, you can create GIN indexes on regular scalar columns rather than the array-like columns it usually works with.) While waiting for the entries to be consolidated, every query will have to sequential scan the unconsolidated buffer area, so delaying the updates for the sake of INSERT can come at a high cost for SELECTs.
There are more general techniques to do something like this, such as fractal tree indexes. But these are not implemented in PostgreSQL, and wherever they are implemented they seem to be proprietary.

Which database operation is heaviest?

If I perform the CRUD operations on the same table, what is the heaviest operation in terms of performance?
People say DELETE and then INSERT is better than UPDATE in some cases, is this true? Then UPDATE is the heaviest operation?
Like all things in life, it depends.
SQL Server uses WAL (write ahead logging) to maintain ACID (Atomicity, Consistency, Isolation, Durability) properties.
A insert needs to log entries for data page and index page changes. If page splits occur, it takes longer. Then the data is written to the data file.
A delete marks the data and index pages for re-use. The data will still be there right after the operation.
A update is implemented as an delete and insert. There for double the log entries.
What can help inserts is pre-allocating the space in the data file before running the job. Auto growing the data files is expensive.
In summary, I would expect updates on average to be the most expensive operation.
I am by no way an expert on the storage engine.
Please check out http://www.sqlskills.com - Paul Randals blog and/or Kalen Daleny SQL Server Internals book, http://sqlserverinternals.com/. These authors go in depth on all the cases that might happen.
It depends mostly on foregin keys and indexes which you have on this table. For deletion and isertion every column that is a foreign key and part of an index has to be checked on foreign key references and every index containing that column has to be rebuilt.
If you do DELETE and then INSERT then checking and rebuilding happens twice. If it is a really large table then rebuilding indexes can take very long time and in this case update will be MUCH faster.
Of course if you have index on the key that you're searching with update statement and you are not updating the key.
For a small table with almost no indexes/foreign keys the operations run so fast that it's not a big issue.

How to create indexes faster

I have a table of about 60GB and I'm trying to create an index,
and its very slow (almost a day, and still running!)
I see most of the time is on Disk I/O(4MB/s), and it doesn't use the memory or cpu so much
I tried: running 'pragma cache_zise = 10000' and 'pragma page_zise=4000'
(after I created the table), and its still doesn't help.
How can I make the 'create index' run in reasonable time?
Creating an index on a database table is a one time operation and it can be expensive based on many factors ranging from how many fields and of what type are included in the index, the size of the data table that is to be indexed, the hardware of the machine the database is running on, and possibly even more.
To give a reasonable answer on speeding things up, we would need to know the schema of the table, the definition of the index you are creating, are you reasonably sure if you are including uniqueness in your index that the data is actually unique, what are the hardware specs of your server, what are your disk speeds, how much available space on the disks, are you using a raid array, what level of raid, how much ram do you have and what is the utilization. etc...
Now all that said, this might be faster but I have not tested it.
make a structurally duplicate table of the table you wish to index.
Add the index to the new empty table.
copy the data from the old table to the new table in chunks.
drop the old table.
My theory is that it will be less expensive to index the data as it is added than to dig through the data that is already there and add the index after the fact.
When you create table,you should create the index. PS:you should consider the index is properly.and you need not to create the index at runtime.

When and why do I reindex a MSDE database

I understand that indexes should get updated automatically but when that does not happen we need to reindex.
My question is (1) Why this automatic udate fails or why an index become bad?
(2) How do I prgramatically know which table/index needs re-indexing at a point of time?
Indexes' statistics may be updated automatically. I do not believe that the indexes themselves would be rebuilt automatically when needed (although there may be some administrative feature that allows such a thing to take place).
Indexes associated with tables which receive a lot of changes (new rows, updated rows and deleted rows) may have their indexes that become fragmented, and less efficient. Rebuilding the index then "repacks" the index in a contiguous section of storage space, a bit akin to the way defragmentation of the file system makes file access faster...
Furthermore the Indexes (on several DBMS) have a FILL_FACTOR parameter, which determine how much extra space should be left in each node for growth. For example if you expect a given table to grow by 20% next year, by declaring the fill factor around 80%, the amount of fragmentation of the index should be minimal during the first year (there may be some if these 20% of growth are not evenly distributed,..)
In SQL Server, It is possible to query properties of the index that indicate its level of fragmentation, and hence it possible need for maintenance. This can be done by way of the interactive management console. It is also possible to do this programatically, by way of sys.dm_db_index_physical_stats in MSSQL 2005 and above (maybe even older versions?)