I need to know that if we delete some rows (I am talking for sql server) from a table which has some indexes (clustered or non-clustered, for both situation) can give any damage to indexes or not? What happens to indexes when we delete rows? Which one is better for performance, deleting rows from a table after processing them, or mark up them as processed (When we will need to reuse them like 20 times more). Thanks for the answers.
I don't know what you mean by "damage". When you delete rows from the table, the index entries need to be deleted as well. This does not "damage" the index per se. At least, the index continues to be useful.
If you have lots of deletes, updates, and inserts, then over time the index will be fragmented. This does affect performance. At some point it becomes useful to re-build the index for performance purposes. You can read about this in the documentation.
I would not worry about rebuilding the indexes because of a handful of deletes. It takes a bit of work to really fragment an index.
My answer is YES.
Index is created on data in the tables and in short if data is deleted from the tables then the levels of fragmentation rise.
Rise in fragmentation levels effects the data retrieval in many ways.
i have a huge table (200mln records). about 70% is not need now (there is column ACTIVE in a table and those records have value N ). There are a lot of multi-column indexes but none of them includes that column. Will removing that 70% records improve SELECT (ACTIVE='Y') performance (because oracle has to read table blocks with no active records and then exclude them from final result)? Is shrink space necessary?
It's really impossible to say without knowing more about your queries.
At one extreme, access by primary key would only improve if the height of the supporting index was reduced, which would probably require deletion of the rows and then a rebuild of the index.
At the other extreme, if you're selecting nearly all active records then a full scan of the table with 70% of the rows removed (and the table shrunk) would take only 30% of the pre-deletion time.
There are many other considerations -- selecting a set of data and accessing the table via indexes, and needing to reject 99% of rows after reading the table because it turns out that there's a positive correlation between the required rows and an inactive status.
One way of dealing with this would be through list partitioning the table on the ACTIVE column. That would move inactive records to a partition that could be eliminated from many queries, with no need to index the column, and would keep the time for full scans of active records down.
If you really do not need these inactive records, why do you just not delete them instead of marking them inactive?
Edit: Furthermore, although indexing a column with a 70/30 split is not generally helpful, you could try a couple of other indexing tricks.
For example, if you have an indexed column which is frequently used in queries (client_id?) then you can add the active flag to that index. You could also construct a partial index:
create index my_table_active_clients
on my_table (case when active = 'Y' then client_id end);
... and then query on:
select ...
from ...
where (case when active = 'Y' then client_id end) = :client_id
This would keep the index smaller, and both indexing approaches would probably be helpful.
Another edit: A beneficial side effect of partitioning could be that it keeps the inactive records and active records "physically" apart, and every block read into memory from the "active" partition of course only has active records. This could have the effect of improving your cache efficiency.
Partitioning, putting the active='NO' records in a separate partition, might be a good option.
http://docs.oracle.com/cd/B19306_01/server.102/b14223/parpart.htm
Yes it will most likely. But depending on your access schema the increase will most likely not as big. Setting an index including the column would be a better solution for future IMHO.
Most probably no. Delete will not reduce the size of the table's segment. Additional maintenance might help. After the DELETE execute also:
ALTER TABLE <tablename> SHRINK SPACE COMPACT;
ALTER INDEX <indexname> SHRINK SPACE COMPACT; -- for every table's index
Alternatively you can use old school approach:
ALTER TABLE <tablename> MOVE;
ALTER INDEX <indexnamename> REBUILD;
When delting 70% of table also consider possible solution CTAS (create table as select). It will be much faster.
Indexing plays a vital role in SELECT query. The performance will drastically increase
if you use those indexed columns in the query. Ya deleting rows will enhance the performance
for sure somewhat but not drastically.
I have a table with 20 or so columns. I have approximately 7 non-clustered indexes in that table on the columns that users filter by more often. The active records (those that the users see on their screen) are no more than 700-800. Twice a day a batch job runs and inserts a few records in that table - maybe 30 - 100 - and may update the existing ones as well.
I have noticed that the indexes need rebuilding EVERY time that the batch operation completes. Their fragmentation level doesnt go from 0-1% step by step to say 50%. I have noticed that they go from 0-1% to approx. 99% after the batch operation completes. A zillion of selects can happen on this table between batch operations but i dont think that matters.
Is this normal? i dont think it is. what do you think its the problem here? The indexed columns are mostly strings and floats.
A few changes could easily change fragmentation levels.
An insert on a page can cause a page split
Rows can overflow
Rows can be moved (forward pointers)
You'll have quite wide rows too so your data density (rows per page) is lower. DML on existing rows will cause fragmentation quite quickly if the DML is distributed across many pages
I'm writing several hundred or potentially several thousand rows into a set of tables at a time, each of which is heavily indexed both internally and via indexed views.
Generally, the inserts are occurring where the rows inserted will be adjacent in the index.
I expect these inserts to be expensive, but they are really slow. I think part of the performance issue is that the indexes are being updated with each individual INSERT.
Is there a way to tell SQL Server to hold off on updating the indexes until I am finished with my batch of inserts so the index trees will only need to be updated once?
These are executed as separate statements due to needing to show the user a progress bar during save and log any individual issues, but are all coming from the same connection in C#. I can place them in a transaction if needed, though I'd prefer not to.
You are paying the cost of adding those rows to the index one way or another. Not updating the index during the insert would cause an issue with accuracy of concurrent statements - any query on that table that used any of the indexes would not "see" the new rows!
If speed is of the essence, and downtime after the insert isn't a major concern, you can:
Disable non-clustered indexes on the target table
Inert
Rebuild non-clustered indexes
You probably should clarify some more about your table:
How wide is the table?
How many indexes?
How wide are the indexes?
If you have 20 indexes and each index has 5 fields, you are really updating 100 extra fields per row which can get expensive quickly.
Deletes on sql server are sometimes slow and I've been often in need to optimize them in order to diminish the needed time.
I've been googleing a bit looking for tips on how to do that, and I've found diverse suggestions.
I'd like to know your favorite and most effective techinques to tame the delete beast, and how and why they work.
until now:
be sure foreign keys have indexes
be sure the where conditions are indexed
use of WITH ROWLOCK
destroy unused indexes, delete, rebuild the indexes
now, your turn.
The following article, Fast Ordered Delete Operations may be of interest to you.
Performing fast SQL Server delete operations
The solution focuses on utilising a view in order to simplify the execution plan produced for a batched delete operation. This is achieved by referencing the given table once, rather than twice which in turn reduces the amount of I/O required.
I have much more experience with Oracle, but very likely the same applies to SQL Server as well:
when deleting a large number of rows, issue a table lock, so the database doesn't have to do lots of row locks
if the table you delete from is referenced by other tables, make sure those other tables have indexes on the foreign key column(s) (otherwise the database will do a full table scan for each deleted row on the other table to ensure that deleting the row doesn't violate the foreign key constraint)
I wonder if it's time for garbage-collecting databases? You mark a row for deletion and the server deletes it later during a sweep. You wouldn't want this for every delete - because sometimes a row must go now - but it would be handy on occasion.
Summary of Answers through 2014-11-05
This answer is flagged as community wiki since this is an ever-evolving topic with a lot of nuances, but very few possible answers overall.
The first issue is you must ask yourself what scenario you're optimizing for? This is generally either performance with a single user on the db, or scale with many users on the db. Sometimes the answers are the exact opposite.
For single user optimization
Hint a TABLELOCK
Remove indexes not used in the delete then rebuild them afterward
Batch using something like SET ROWCOUNT 20000 (or whatever, depending on log space) and loop (perhaps with a WAITFOR DELAY) until you get rid of it all (##ROWCOUNT = 0)
If deleting a large % of table, just make a new one and delete the old table
Partition the rows to delete, then drop the parition. [Read more...]
For multi user optimization
Hint row locks
Use the clustered index
Design clustered index to minimize page re-organization if large blocks are deleted
Update "is_deleted" column, then do actual deletion later during a maintenance window
For general optimization
Be sure FKs have indexes on their source tables
Be sure WHERE clause has indexes
Identify the rows to delete in the WHERE clause with a view or derived table instead of referencing the table directly. [Read more...]
To be honest, deleting a million rows from a table scales just as badly as inserting or updating a million rows. It's the size of the rowset that's the problem, and there's not much you can do about that.
My suggestions:
Make sure that the table has a primary key and clustered index (this is vital for all operations).
Make sure that the clustered index is such that minimal page re-organisation would occur if a large block of rows were to be deleted.
Make sure that your selection criteria are SARGable.
Make sure that all your foreign key constraints are currently trusted.
(if the indexes are "unused", why are they there at all?)
One option I've used in the past is to do the work in batches. The crude way would be to use SET ROWCOUNT 20000 (or whatever) and loop (perhaps with a WAITFOR DELAY) until you get rid of it all (##ROWCOUNT = 0).
This might help reduce the impact upon other systems.
The problem is you haven't defined your conditions enough. I.e. what exactly are you optimizing?
For example, is the system down for nightly maintenance and no users are on the system? And are you deleting a large % of the database?
If offline and deleting a large %, may make sense to just build a new table with data to keep, drop the old table, and rename. If deleting a small %, you likely want to batch things in as large batches as your log space allows. It entirely depends on your database, but dropping indexes for the duration of the rebuild may hurt or help -- if even possible due to being "offline".
If you're online, what's the likelihood your deletes are conflicting with user activity (and is user activity predominantly read, update, or what)? Or, are you trying to optimize for user experience or speed of getting your query done? If you're deleting from a table that's frequently updated by other users, you need to batch but with smaller batch sizes. Even if you do something like a table lock to enforce isolation, that doesn't do much good if your delete statement takes an hour.
When you define your conditions better, you can pick one of the other answers here. I like the link in Rob Sanders' post for batching things.
If you have lots of foreign key tables, start at the bottom of the chain and work up. The final delete will go faster and block less things if there are no child records to cascade delete (which I would NOT turn on if I had a large number fo child tables as it will kill performance).
Delete in batches.
If you have foreign key tables that are no longer being used (you'd be surprised how often production databses end up with old tables nobody will get rid of), get rid of them or at least break the FK/PK connection. No sense cheking a table for records if it isn't being used.
Don't delete - mark records as delted and then exclude marked records from all queries. This is best set up at the time of database design. A lot of people use this because it is also the best fastest way to get back records accidentlally deleted. But it is a lot of work to set up in an already existing system.
I'll add another one to this:
Make sure your transaction isolation level and database options are set appropriately. If your SQL server is set not to use row versioning, or you're using an isolation level on other queries where you will wait for the rows to be deleted, you could be setting yourself up for some very poor performance while the operation is happening.
On very large tables where you have a very specific set of criteria for deletes, you could also partition the table, switch out the partition, and then process the deletions.
The SQLCAT team has been using this technique on really really large volumes of data. I found some references to it here but I'll try and find something more definitive.
I think, the big trap with delete that kill the performance is that sql after each row deleted, it updates all the related indexes for any column in this row. what about delting all indexes before bulk delete?
There are deletes and then there are deletes. If you are aging out data as part of a trim job, you will hopefully be able to delete contiguous blocks of rows by clustered key. If you have to age out data from a high volume table that is not contiguous it is very very painful.
If it is true that UPDATES are faster than DELETES, you could add a status column called DELETED and filter on it in your selects. Then run a proc at night that does the actual deletes.
Do you have foreign keys with referential integrity activated?
Do you have triggers active?
Simplify any use of functions in your WHERE clause! Example:
DELETE FROM Claims
WHERE dbo.YearMonthGet(DataFileYearMonth) = dbo.YearMonthGet(#DataFileYearMonth)
This form of the WHERE clause required 8 minutes to delete 125,837 records.
The YearMonthGet function composed a date with the year and month from the input date and set day = 1. This was to ensure we deleted records based on year and month but not day of month.
I rewrote the WHERE clause to:
WHERE YEAR(DataFileYearMonth) = YEAR(#DataFileYearMonth)
AND MONTH(DataFileYearMonth) = MONTH(#DataFileYearMonth)
The result: The delete required about 38-44 seconds to delete those 125,837 records!