Index and Insert Operations - sql

I have one job with around 100K records to process. This job truncates the destination tables, and then insert all the records "one at a time", not a batch insert, in these tables.
I need to know how indexes will take affect as these records are inserted? Whether cost of creating index during the job will be more than benefit from using them?
Are there any best practices or optimization hints in such situation?

This kind of question can only be answered on a case-by-case basis. However the following general considerations may be of help:
Unless some of the data for the inserts comes from additional lookups and such, No index is useful during INSERT (i.e. for this very operation, indexes may of course be useful for other queries under other sessions/users....)
[on the other hand...] The presence of indexes on a table slows down the INSERT (or more generally UPDATE or DELETE) operations
The order in which the new records are added may matter
Special consideration is to be had with regards if the table is a clustered index
Deciding whether to drop indexes (all of them or some of them) prior to the INSERT operation depends much on the relative number of records (added vs. readily in)
INSERT operations may often introduce index fragmentation, which is in of itself an additional incentive to mayve drop the index(es) prior to data load, then to rebuild it (them).
In general, adding 100,000 records is "small potatoes" for MS-SQL, and unless of a particular situation such as unusually wide records, or the presence of many (and possibly poorly defined) constraints of various nature, SQL Server should handle this load in minutes rather than hours on most any hardware configuation.

The answer to this question is very different depending if the indexes you're talking about are clustered or not. Clustered indexes force SQL Server to store data in sorted order, so if you try to insert a record that doesn't sort to the bottom the end of your clustered index, your insert can result in significant reshuffling of your data, as many of your records are moved to make room for your new record.
Nonclustered indexes don't have this problem; all the server has to do is keep track of where the new record is stored. So if your index is clustered (most clustered indexes are primary keys, but this is not required; run "sp_helpindex [TABLENAME]" to find out for sure), you would almost certainly be better off adding the index after all of your inserts are done.
As to the performance of inserts on nonclustered indexes, I can't actually tell you; in my experience, the slowdown hasn't been enough to worry about. The index overhead in this case would be vastly outweighed by the overhead of doing all your inserts one at a time.
Edit: Since you have the luxury of truncating your entire table, performance-wise, you're almost certainly better off dropping (or NOCHECKing) your indexes and constraints before doing all of your inserts, then adding them back in at the end.

The insert statement is the only operation that cannot directly benefit from indexing because it has no where clause.
The more the indexes table has more slower execution becomes .
If there are indexes on the table, the database must make sure the new entry is also found via these indexes. For this reason it has to add the new entry to each and every index on that table. The number of indexes is therefore a multiplier for the cost of an insert statement.
Check here

Related

Is it possible to do usual atomic INSERT operation but update Indexes asynchronously?

indexes make read fast but write slower. But why can't you have single writes and have db add indexes asynchronously with time, also cache in the INSERT until it's indexed?
Is there any database like that?
Converting my comments to an answer:
indexes make read fast but write slower
That's an oversimplification and it's also misleading.
Indexes make data lookups faster because the DBMS doesn't need to do a table-scan to find rows matching a predicate (the WHERE part of a query). Indexes don't make "reads" any faster (that's entirely dependent on the characteristics of your disk IO) and when used improperly they can sometimes even make queries slower (for reasons I won't get into).
I want to stress that the additional cost of writing to a single index, or even multiple indexes, when executing a DML statement (INSERT/UPDATE/DELETE/MERGE/etc) is negligible, really! (In actuality: foreign-key constraints are a much bigger culprit - and I note you can practically eliminate the cost of foreign-key constraint checking by adding additional indexes!). Indexes are primarily implemented using B-trees (a B-tree is essentially like a binary-tree, except rather than each node having only 2 children it can have many children because each tree-node comes with unused space for all those child node pointers, so inserting into the middle of a B-tree won't require data to be moved-around on-disk unlike with other kinds of trees, like a heap-tree).
Consider this QA where a Postgres user (like yourself) reports inserting 10,000 rows into a table. Without an index it took 78ms, with an index it took 84ms, that's only a 7.5% increase, which at that scale (6ms!) is so small it may as well be a rounding error or caused by IO scheduling. That should be proof enough it shouldn't be something you should worry about without actual hard data showing it's a problem for you and your application.
I assume you have this negative impression about indexes after reading an article like this one, which certainly gives the impression that "indexes are bad" - but while the points mentioned in that article are not wrong, there's a LOT of problems with that article so you shouldn't take it dogmatically. (I'll list my concerns with that article in the footer).
But why can't you have single writes and have db add indexes asynchronously with time
By this I assume you mean you'd like a DMBS to do a single-row INSERT by simply appending a new record to the end of a table and then immediately returning and then at an arbitrary point later the DBMS' housekeeping system would update the indexes afterwards.
The problem with that is that it breaks the A, C, and I parts of the the A.C.I.D. model.
Indexes are used for more than just avoiding table-scans: they're also used to store copies of table data for the benefit of queries that would use the index and which also need (for example) a small subset of the table's data, this significantly reduces disk reads. For this reason, RDBMS (and ISO SQL) allow indexes to include non-indexed data using the INCLUDES clause.
Consider this scenario:
CREATE INDEX IX_Owners ON cars ( ownerId ) INCLUDE ( colour );
CREATE INDEX IX_Names ON people ( name ) INCLUDE ( personId, hairColour );
GO;
SELECT
people.name,
people.hairColour,
cars.colour
FROM
cars
INNER JOIN people ON people.personId = cars.ownerId
WHERE
people.name LIKE 'Steve%'
The above query will not need to read either the cars or people tables on-disk. The DBMS will be able to fully answer the query using data only in the index - which is great because indexes tend to exist in a small number of pages on-disk which tend to be in proximal-locality which is good for performance because it means it will use sequential IO which scales much better than random IO.
The RDBMS will perform a string-prefix index-scan of the people.IX_Names index to get all of the personId (and hairColour) values, then it will look-up those personId values in the cars.IX_Owners index and be able to get the car.colour from the copy of the data inside the IX_Owners index without needing to read the tables directly.
Now, assuming that another database client has just completed inserted a load of records into the cars and/or people table with a COMMIT TRANSACTION just for good measure, and the RDMBS uses your idea of only updating indexes later whenever it feels like it, then if that same database client re-runs the query from above it would return stale data (i.e. wrong data) because the query uses the index, but the index is old.
In addition to using index tree nodes to store copies of table data to avoid non-proximal disk IO, many RDBMS also use index-trees to store entire copies - even multiple copies of table data, to enable other scenarios, such as columnar data storage and indexed-VIEWs - both of these features absolutely require that indexes are updated atomically with table data.
Is there any database like that?
Yes, they exist - but they're not widely used (or they're niche) because for the vast majority of applications it's entirely undesirable behaviour for the reasons described above.
There are distributed databases that are designed around eventual consistency, but clients (and entire application code) needs to be designed with that in-mind, and it's a huge PITA to have to redesign a data-centric application to support eventual-consistency which is why you only really see them being used in truly massive systems (like Facebook, Google, etc) where availability (uptime) is more important than users seeing stale-data for a few minutes.
Footnote:
Regarding this article: https://use-the-index-luke.com/sql/dml/insert
The number of indexes on a table is the most dominant factor for insert performance. The more indexes a table has, the slower the execution becomes. The insert statement is the only operation that cannot directly benefit from indexing because it has no where clause.
I disagree. I'd argue that foreign-key constraints (and triggers) are far more likely to have a larger detrimental effect on DML operations.
Adding a new row to a table involves several steps. First, the database must find a place to store the row. For a regular heap table—which has no particular row order—the database can take any table block that has enough free space. This is a very simple and quick process, mostly executed in main memory. All the database has to do afterwards is to add the new entry to the respective data block.
I agree with this.
If there are indexes on the table, the database must make sure the new entry is also found via these indexes. For this reason it has to add the new entry to each and every index on that table. The number of indexes is therefore a multiplier for the cost of an insert statement.
This is true, but I don't know if I agree that it's a "multiplier" of the cost of an insert.
For example, consider a table with hundreds of nvarchar(1000) columns and several int columns - and there's separate indexes for each int column (with no INCLUDE columns). If you're inserting 100x megabyte-sized rows all-at-once (using an INSERT INTO ... SELECT FROM statement) the cost of updating those int indexes is very likely to require much less IO than the table data.
Moreover, adding an entry to an index is much more expensive than inserting one into a heap structure because the database has to keep the index order and tree balance. That means the new entry cannot be written to any block—it belongs to a specific leaf node. Although the database uses the index tree itself to find the correct leaf node, it still has to read a few index blocks for the tree traversal.
I strongly disagree with this, especially the first sentence: "adding an entry to an index is much more expensive than inserting one into a heap structure".
Indexes in RDBMS today are invariably based on B-trees, not binary-trees or heap-trees. B-trees are essentially like binary-trees except each node has built-in space for dozens of child node pointers and B-trees are only rebalanced when a node fills its internal child pointer list, so a B-tree node insert will be considerably cheaper than the article is saying because each node will have plenty of empty space for a new insertion without needing to re-balance itself or any other relatively expensive operation (besides, DBMS can and do index maintenance separately and independently of any DML statement).
The article is correct about how the DBMS will need to traverse the B-tree to find the node to insert into, but index nodes are efficently arranged on-disk, such as keeping related nodes in the same disk page which minimizes index IO reads (assuming they aren't already loaded into memory first). If an index tree is too big to store in-memory the RDBMS can always keep a "meta-indexes" in-memory so it could potentially instantly find the correct B-tree index without needing to traverse the B-tree from the root.
Once the correct leaf node has been identified, the database confirms that there is enough free space left in this node. If not, the database splits the leaf node and distributes the entries between the old and a new node. This process also affects the reference in the corresponding branch node as that must be duplicated as well. Needless to say, the branch node can run out of space as well so it might have to be split too. In the worst case, the database has to split all nodes up to the root node. This is the only case in which the tree gains an additional layer and grows in depth.
In practice this isn't a problem, because the RDBMS's index maintenance will ensure there's sufficient free space in each index node.
The index maintenance is, after all, the most expensive part of the insert operation. That is also visible in Figure 8.1, “Insert Performance by Number of Indexes”: the execution time is hardly visible if the table does not have any indexes. Nevertheless, adding a single index is enough to increase the execute time by a factor of a hundred. Each additional index slows the execution down further.
I feel the article is being dishonest by suggesting (implying? stating?) that index-maintenance happens with every DML. This is not true. This may have been the case with some early dBase-era databases, but this is certainly not the case with modern RDBMS like Postgres, MS SQL Server, Oracle and others.
Considering insert statements only, it would be best to avoid indexes entirely—this yields by far the best insert performance.
Again, this claim in the article is not wrong, but it's basically saying if you want a clean and tidy house you should get rid of all of your possessions. Indexes are a fact of life.
However tables without indexes are rather unrealistic in real world applications. You usually want to retrieve the stored data again so that you need indexes to improve query speed. Even write-only log tables often have a primary key and a respective index.
Indeed.
Nevertheless, the performance without indexes is so good that it can make sense to temporarily drop all indexes while loading large amounts of data—provided the indexes are not needed by any other SQL statements in the meantime. This can unleash a dramatic speed-up which is visible in the chart and is, in fact, a common practice in data warehouses.
Again, with modern RDBMS this isn't necessary. If you do a batch insert then a RDBMS won't update indexes until after the table-data has finished being modified, as a batch index update is cheaper than many individual updates. Similarly I expect that multiple DML statements and queries inside an explicit BEGIN TRANSACTION may cause an index-update deferral provided no subsequent query in the transaction relies on an updated index.
But my biggest issue with that article is that the author is making these bold claims about detrimental IO performance without providing any citations or even benchmarks they've run themselves. It's even more galling that they posted a bar-chart with arbitrary numbers on, again, without any citation or raw benchmark data and instructions for how to reproduce their results. Always demand citations and evidence from anything you read making claims: because the only claims anyone should accept without evidence are logical axioms - and a quantitative claim about database index IO cost is not a logical axiom :)
For PostgreSQL GIN indexes, there is the fastupdate feature. This stores new index entries into a unordered unconsolidated area waiting for some other process to file them away into the main index structure. But this doesn't directly match up with what you want. It is mostly designed so that the index updates are done in bulk (which can be more IO efficient), rather than in the background. Once the unconsolidated area gets large enough, then a foreground process might take on the task of filing them away, and it can be hard to tune the settings in a way to get this to always be done by a background process instead of a foreground process. And it only applies to GIN indexes. (With the use of the btree_gin extension, you can create GIN indexes on regular scalar columns rather than the array-like columns it usually works with.) While waiting for the entries to be consolidated, every query will have to sequential scan the unconsolidated buffer area, so delaying the updates for the sake of INSERT can come at a high cost for SELECTs.
There are more general techniques to do something like this, such as fractal tree indexes. But these are not implemented in PostgreSQL, and wherever they are implemented they seem to be proprietary.

SQL Server : large data import with clustered index

Performance-wise, does a clustered index help or not when bulk inserting hundreds of millions of rows in a table?
LE: after the INSERTs I have to put the database into production so I will have to create the one or more indexes.
A clustered index specifies that the data is ordered on the data pages.
When you are inserting data, the new data has to be sorted and compared to existing values. This is going to incur overhead.
The one exception is when you have an identity column -- that is being generated during the insert. Then the database knows that the new data goes "at the end" of the table.
Indexes are meant for speeding up retrieval (SELECT) of rows. They only have anti-effect with respect to INSERT or DELETE or UPDATE. And, in your case, if INSERT is the predominant operation to be performed in your system, don't go for indexes at all. Even in your Production system, assess the ratio between retrieval operations and insert/update operations and if it turns out to be that the retrieval operation is going to be dominant, then you can think of indexes.
Note: Whenever we define a Primary Key on a table, a basic index structure is already created for that table. So, without any specific need for retrieval optimization, there is no actual need to design and implement indexes.
You can know more here: https://www.geeksforgeeks.org/sql-indexes/

Why is only reading faster in an indexed table and not writing?

The data structure used for indexing in a DB table is B-Tree (default, out of B-Tree, R-Tree, Hash). Since look-ups, deletions, and insertions can all be done in logarithmic time in a B-Tree, then why is only reading from an indexed table is faster and but writing is slower?
Indexes are only used for speeding up SELECT statements. For INSERT, UPDATE, and DELETE your statements will be slower than normal due to the index needing to be updated as part of the statement.
I should maybe clarify on the UPDATE/DELETE point. It is true that the statements will be slowed down due to the change to the index added to the overhead, however the initial lookup part (WHERE) of the UPDATE and DELETE statement could be sped up due to the index. Basically any place a WHERE clause is used and you reference the indexed fields, the record selection part of that statement should see some increase.
Also, if an UPDATE statement does not alter any of the columns that are part of an index then you should not see any additional slowness as the index is not being updated.
Because indexes require additional disk space. Indexes increase the amount of data that needs to be logged and written to a database. Indexes reduce write performance. When a column covered by an index is updated, that index also must be updated. Similarly any deletes or insert requires updating the relevant indexes.
The disk space and write penalties of indexes is precisely why you need to be careful about creating indices.
That said, updates to non-indexed columns can have their performance improved with indexes.
This:
UPDATE Table SET NonIndexedColumn = 'Value' WHERE IndexedKey = 'KeyValue'
Will be faster than this:
UPDATE Table SET IndexedColumn = 'Value' WHERE IndexedKey = 'KeyValue'
But the above two will likely both be faster than this in any reasonably sized table:
UPDATE Table SET NonIndexedColumn = 'Value' WHERE NonIndexedKey = 'KeyValue'
Deletes, especially single deletes, can similarly be faster even though the table and the indexes need to be updated. This is simply because the query engine can find the target row(s) faster. That is, it can be faster to read an index, find the row, and remove the row and update the index, instead of scanning the entire table for the correct rows and removing the relevant ones. However, even in this case there is going to be more data to write; it's just that the IO cost of scanning an entire table could be fairly high compared to an index.
Finally, in theory, a clustering key that spreads inserts across multiple disk pages can allow the system to support more concurrent inserts since inserts typically require page locks to function, but that is a somewhat uncommon situation and it may result in worse read performance due to fragmenting your clustered indexes.
INSERT and DELETE have to update every index for the table (and the heap if there's no clustered index), in order to maintain consistency. UPDATEs may get away with updating fewer indexes, depending on which columns have been affected by the update (because only those indices that index/include those columns have to be updated)
A SELECT, on the other hand, is only reading and so, if an index contains all columns required by the SELECT, only that index has to be accessed. We know that the data in that index is accurate precisely because the modification operations are required to maintain that consistency.

Does deleting rows from a table distrupts indexes?

I need to know that if we delete some rows (I am talking for sql server) from a table which has some indexes (clustered or non-clustered, for both situation) can give any damage to indexes or not? What happens to indexes when we delete rows? Which one is better for performance, deleting rows from a table after processing them, or mark up them as processed (When we will need to reuse them like 20 times more). Thanks for the answers.
I don't know what you mean by "damage". When you delete rows from the table, the index entries need to be deleted as well. This does not "damage" the index per se. At least, the index continues to be useful.
If you have lots of deletes, updates, and inserts, then over time the index will be fragmented. This does affect performance. At some point it becomes useful to re-build the index for performance purposes. You can read about this in the documentation.
I would not worry about rebuilding the indexes because of a handful of deletes. It takes a bit of work to really fragment an index.
My answer is YES.
Index is created on data in the tables and in short if data is deleted from the tables then the levels of fragmentation rise.
Rise in fragmentation levels effects the data retrieval in many ways.

Batch index updates?

I'm writing several hundred or potentially several thousand rows into a set of tables at a time, each of which is heavily indexed both internally and via indexed views.
Generally, the inserts are occurring where the rows inserted will be adjacent in the index.
I expect these inserts to be expensive, but they are really slow. I think part of the performance issue is that the indexes are being updated with each individual INSERT.
Is there a way to tell SQL Server to hold off on updating the indexes until I am finished with my batch of inserts so the index trees will only need to be updated once?
These are executed as separate statements due to needing to show the user a progress bar during save and log any individual issues, but are all coming from the same connection in C#. I can place them in a transaction if needed, though I'd prefer not to.
You are paying the cost of adding those rows to the index one way or another. Not updating the index during the insert would cause an issue with accuracy of concurrent statements - any query on that table that used any of the indexes would not "see" the new rows!
If speed is of the essence, and downtime after the insert isn't a major concern, you can:
Disable non-clustered indexes on the target table
Inert
Rebuild non-clustered indexes
You probably should clarify some more about your table:
How wide is the table?
How many indexes?
How wide are the indexes?
If you have 20 indexes and each index has 5 fields, you are really updating 100 extra fields per row which can get expensive quickly.