What's the best way to delete all data from a table? - sql

I have a SQLite table with 6 million rows.
Doing a DELETE FROM TABLE is quite slow;
Dropping the table and then re-creating it seems quicker.
I'm using this for a database import.
Would dropping the table be a better approach or is there a way to delete all data quickly?

One big difference is that DELETE FROM TABLE is DML and DROP TABLE is DDL. This is very important when it comes to db transactions. The result at the end may be the same, but these operations are very different.
If it's just performance you've to be aware of then it may be ok to drop and recreate the table. If you need transactions in your imports then you've to be aware that DDL is not covered and cannot be rollbacked for example.

Generally speaking the DROP TABLE would be a non-logged transaction. DELETE FROM would require a transient journal to log the records until the DELETE statement has been completed.

TRUNCATE TABLE is a lot faster.
The following except from an Oracle website explains why:
'Deletes' perform normal DML. That is, they take locks on rows, they generate redo (lots of it), and they require segments in the UNDO tablespace. Deletes clear records out of blocks carefully. If a mistake is made a rollback can be issued to restore the records prior to a commit. A delete does not relinquish segment space thus a table in which all records have been deleted retains all of its original blocks.
Truncates are DDL and, in a sense, cheat. A truncate moves the High Water Mark of the table back to zero. No row-level locks are taken, no redo or rollback is generated. All extents bar the initial are de-allocated from the table (if you have MINEXTENTS set to anything other than 1, then that number of extents is retained rather than just the initial). By re-positioning the high water mark, they prevent reading of any table data, so they have the same effect as a delete, but without all the overhead. Just one slight problem: a truncate is a DDL command, so you can't roll it back if you decide you made a mistake. (It's also true that you can't selectively truncate -no "WHERE" clause is permitted, unlike with deletes, of course).
By resetting the High Water Mark, the truncate prevents reading of any table's data, so they it has the same effect as a delete, but without the overhead. There is, however, one aspect of a Truncate that must be kept in mind. Because a Truncate is DDL it issues a COMMIT before it acts and another COMMIT afterward so no rollback of the transaction is possible.
Note that by default, TRUNCATE drops storage even if DROP STORAGE is not specified.

I don't think SQLite implements TRUNCATE but if it does it likely will be more performant than DELETE

Related

How to perform data archive in SQL Server with many tables?

Let's say I have a database with many tables in it. I want to perform data archiving on certain tables, that is create a same table with same structures (same constraint, indexes, columns, triggers, etc) as a new table and insert specific data into the new table from the old table.
Example, current table has data from 2008-2017 and I want to move only data from 2010-2017 into the new table. Then after that, I can delete the old table and rename the new table with naming conventions similar to old table.
How should I approach this?
For the sort of clone-rename-drop logic you're talking about, the basics are pretty straight forward. Really the only time this is a good idea is if you have a table with a large amount of data, which you can't afford down time or blocking on, and you only plan to do this one. The process looks something like this:
Insert all the data from your original table into the clone table
In a single transaction, sp_rename the original table from (for example) myTable to myTable_OLD (just something to distinguish it from the real table). Then sp_rename the clone table from (for example) myTable_CLONE to myTable
Drop myTable_OLD when you're happy everything has worked how you want. If it didn't work how you want, just sp_rename the objects back.
Couple considerations to think about if you go that route
Identity columns: If your table has any identities on it, you'll have to use identity_insert on then reseed the identity to pick up at where the old identity left off
Do you have the luxury of blocking the table while you do this? Generally if you need to do this sort of thing, the answer is no. What I find works well is to insert all the rows I need using (nolock), or however you need to do it so the impact of the select from the original table is mitigated. Then, after I've moved 99% of the data, I will then open a transaction, block the original table, insert just the new data that's come in since the bulk of the data movement, then do the sp_rename stuff
That way you don't lock anything for the bulk of the data movement, and you only block the table for the very last bit of data that came into the original table between your original insert and your sp_rename
How you determine what's come in "since you started" will depend on how your table is structured. If you have an identity or a datestamp column, you can probably just pick rows which came in after the max of those fields you moved over. If your table does NOT have something you can easily hook into, you might need to get creative.
Alternatives
A couple other alternatives that came to mind:
Table Partitioning:
This shards a single table across multiple partitions (which can be managed sort of like individual tables). You can, say, partition you data by year, then when you want to purge the trailing year of data, you "switch out" that partition to a special table which you can then truncate. All those operations are meta-data only, so they're super fast. This also works really well for huge amounts of data where deletes and all their pesky transaction logging aren't feasible
The downside to table partitioning is it's kind of a pain to set up and manage.
Batched Deletes:
If you're data isn't too big, you could just do batched deletes on the trailing end of your data. If you can find a way to get clustered index seeks for your deletes, they should be reasonably lightweight. As long as you're not accumulating data faster than you can get rid of it, the benefit of this kind of thing is you just run it semi-continuously and it just nibbles away at the trailing end of your data
Snapshot Isolation:
If deletes cause too much blocking, you can also set up something like snapshot isolation, which basically stores historical versions of rows in tempdb. Any query which sets isolation level read committed snapshot will then read those pre-change rows instead of contend for locks on the "real" table. You can then do batched deletes to your hearts content and know that any queries that hit the table will never get blocked by a delete (or any other DML operation) because they'll either read the pre-delete snapshot, or they'll read the post-delete snapshot. They won't wait for an in-process delete to figure out whether it's going to commit or rollback. This is not without its drawbacks as well unfortunately. For large data sets, it can put a big burden on tempdb and it too can be a little bit of a black box. It's also going to require buy-in from your DBAs.

Does TRUNCATE TABLE grows up transaction log?

I have read that one of the differences between DELETE and TRUNCATE TABLE in Sql is the TRUNCATE operation cannot be rolled back and no triggers will be fired (as written in this site for example) :
QUESTION:
Does this mean that when I TRUNcATE TABLE that is containing millions of records, I should not be effecting the transaction log file -that is transaction log file should not grow up in the time of truncating-, am I correct?
In MS SQL Server (Books Online)
Compared to the DELETE statement, TRUNCATE TABLE has the following advantages:
Less transaction log space is used.
The DELETE statement removes rows one at a time and records an entry in the transaction log for each deleted row. TRUNCATE TABLE removes the data by deallocating the data pages used to store the table data and records only the page deallocations in the transaction log.
Fewer locks are typically used.
When the DELETE statement is executed using a row lock, each row in the table is locked for deletion. TRUNCATE TABLE always locks the table (including a schema (SCH-M) lock) and page but not each row.
Without exception, zero pages are left in the table.
After a DELETE statement is executed, the table can still contain empty pages. For example, empty pages in a heap cannot be deallocated without at least an exclusive (LCK_M_X) table lock. If the delete operation does not use a table lock, the table (heap) will contain many empty pages. For indexes, the delete operation can leave empty pages behind, although these pages will be deallocated quickly by a background cleanup process.
TRUNCATE TABLE removes all rows from a table, but the table structure and its columns, constraints, indexes, and so on remain. To remove the table definition in addition to its data, use the DROP TABLE statement.
If the table contains an identity column, the counter for that column is reset to the seed value defined for the column. If no seed was defined, the default value 1 is used. To retain the identity counter, use DELETE instead.
From: http://msdn.microsoft.com/en-us/library/ms177570.aspx
To the original question:
Technically TRUNCATE is deallocating the data pages from the table, effectively removing all records from it. This action in theory can be rolled back until none of the data pages are being re-used. The information on the deallocated pages are not removed, they are still available in the data file. These deallocated pages in the data file can be re-used (allocated to another table for example) and the data on them can be overwritten.
The transaction log contains the list of pages (in SQL Server) being deallocated from the table during TRUNCATE, but this list is much shorter than the list of all records, therefore the transaction log will not grow to the same extent.
Depending on the implementation of transactions and TRUNCATE on different RDBMS it can be possible to do a rollback within a transaction. Sometimes it is possible to do a 'rollback' (restore the table) after the transaction is committed, if the data on the pages are still intact and all information is available, but that is some black magic and usually not supported directly by the RDBMS.
Does this mean that when I TRUNcATE TABLE that is containing millions of records, I should not be effecting the transaction log file -that is transaction log file should not grow up in the time of truncating-, am I correct?
Well you don't specify that actual server software but in all cases that I'm aware of that's correct.
DELETE effectively works row-by-row, deleting the record, firing any appropriate triggers, and adding a transaction to the log.
TRUNCATE just removes all of the data in one swoop, not significantly affecting the transaction log (certainly not enough to allow a rollback) and not executing triggers.

TSQL Snapshot Isolation

Using SQL2k5, I have a staging table that contains columns that will populate numerous other tables. For instance, a statement like this:
INSERT INTO [appTable1] ([colA], [colB])
SELECT [appTable1_colA], [appTable1_colB]
FROM [stageTable]
A trigger on [appTable1] will then populate the identity column values of the newly inserted rows back into [stageTable]; for this example, we'll say it's [stageTable].[appTable1_ID] which are then inserted into other tables as a FK. More similar statements follow like:
INSERT INTO [appTable2] ([colA], [colB], [colC], [appTable1_FK])
SELECT [appTable2_colA], [appTable2_colB], [appTable2_colC], [appTable1_ID]
FROM [stageTable]
This process continues through numerous tables like this. As you can see, I'm not including a WHERE clause on the SELECTs from the staging table as this table gets truncated at the end of the process. However, this leaves the possibility of another process adding records to this staging table in the middle of this transaction and those records would not contain the FKs previously populated. Would I want to issue this statement to prevent this?:
SET TRANSACTION ISOLATION LEVEL SNAPSHOT
If this is the best solution, what are the downsides of doing it this way?
Can you add a batch id to your staging table, so that you can use it in where clauses to ensure that you are only working on the original batch of records? Any process that adds records to the staging table would have to use a new, unique batch id. This would be more efficient (and more robust) than depending on snapshot isolation, I think.
All Isolation levels, including snapshot, affect only reads. SELECTs from stageTable will not see uncommited inserts, nor it will block. I'm not sure that solves your problem of throwing everything into the stageTable without any regard for ownership. What happens when the transaction finally commits, the stageTable is left with all the intermediate results ready to be read by the next transaction? Perhaps you should use a temporary #stageTable that will ensure a natural isolation between concurent threads.
To understand the cost of using Snapshot isolation, read Row Versioning Resource Usage:
extra space consumed in tempdb
extra space consumed in each row in the data table
extra space consumed in BLOB storage for large fields

Optimizing Delete on SQL Server

Deletes on sql server are sometimes slow and I've been often in need to optimize them in order to diminish the needed time.
I've been googleing a bit looking for tips on how to do that, and I've found diverse suggestions.
I'd like to know your favorite and most effective techinques to tame the delete beast, and how and why they work.
until now:
be sure foreign keys have indexes
be sure the where conditions are indexed
use of WITH ROWLOCK
destroy unused indexes, delete, rebuild the indexes
now, your turn.
The following article, Fast Ordered Delete Operations may be of interest to you.
Performing fast SQL Server delete operations
The solution focuses on utilising a view in order to simplify the execution plan produced for a batched delete operation. This is achieved by referencing the given table once, rather than twice which in turn reduces the amount of I/O required.
I have much more experience with Oracle, but very likely the same applies to SQL Server as well:
when deleting a large number of rows, issue a table lock, so the database doesn't have to do lots of row locks
if the table you delete from is referenced by other tables, make sure those other tables have indexes on the foreign key column(s) (otherwise the database will do a full table scan for each deleted row on the other table to ensure that deleting the row doesn't violate the foreign key constraint)
I wonder if it's time for garbage-collecting databases? You mark a row for deletion and the server deletes it later during a sweep. You wouldn't want this for every delete - because sometimes a row must go now - but it would be handy on occasion.
Summary of Answers through 2014-11-05
This answer is flagged as community wiki since this is an ever-evolving topic with a lot of nuances, but very few possible answers overall.
The first issue is you must ask yourself what scenario you're optimizing for? This is generally either performance with a single user on the db, or scale with many users on the db. Sometimes the answers are the exact opposite.
For single user optimization
Hint a TABLELOCK
Remove indexes not used in the delete then rebuild them afterward
Batch using something like SET ROWCOUNT 20000 (or whatever, depending on log space) and loop (perhaps with a WAITFOR DELAY) until you get rid of it all (##ROWCOUNT = 0)
If deleting a large % of table, just make a new one and delete the old table
Partition the rows to delete, then drop the parition. [Read more...]
For multi user optimization
Hint row locks
Use the clustered index
Design clustered index to minimize page re-organization if large blocks are deleted
Update "is_deleted" column, then do actual deletion later during a maintenance window
For general optimization
Be sure FKs have indexes on their source tables
Be sure WHERE clause has indexes
Identify the rows to delete in the WHERE clause with a view or derived table instead of referencing the table directly. [Read more...]
To be honest, deleting a million rows from a table scales just as badly as inserting or updating a million rows. It's the size of the rowset that's the problem, and there's not much you can do about that.
My suggestions:
Make sure that the table has a primary key and clustered index (this is vital for all operations).
Make sure that the clustered index is such that minimal page re-organisation would occur if a large block of rows were to be deleted.
Make sure that your selection criteria are SARGable.
Make sure that all your foreign key constraints are currently trusted.
(if the indexes are "unused", why are they there at all?)
One option I've used in the past is to do the work in batches. The crude way would be to use SET ROWCOUNT 20000 (or whatever) and loop (perhaps with a WAITFOR DELAY) until you get rid of it all (##ROWCOUNT = 0).
This might help reduce the impact upon other systems.
The problem is you haven't defined your conditions enough. I.e. what exactly are you optimizing?
For example, is the system down for nightly maintenance and no users are on the system? And are you deleting a large % of the database?
If offline and deleting a large %, may make sense to just build a new table with data to keep, drop the old table, and rename. If deleting a small %, you likely want to batch things in as large batches as your log space allows. It entirely depends on your database, but dropping indexes for the duration of the rebuild may hurt or help -- if even possible due to being "offline".
If you're online, what's the likelihood your deletes are conflicting with user activity (and is user activity predominantly read, update, or what)? Or, are you trying to optimize for user experience or speed of getting your query done? If you're deleting from a table that's frequently updated by other users, you need to batch but with smaller batch sizes. Even if you do something like a table lock to enforce isolation, that doesn't do much good if your delete statement takes an hour.
When you define your conditions better, you can pick one of the other answers here. I like the link in Rob Sanders' post for batching things.
If you have lots of foreign key tables, start at the bottom of the chain and work up. The final delete will go faster and block less things if there are no child records to cascade delete (which I would NOT turn on if I had a large number fo child tables as it will kill performance).
Delete in batches.
If you have foreign key tables that are no longer being used (you'd be surprised how often production databses end up with old tables nobody will get rid of), get rid of them or at least break the FK/PK connection. No sense cheking a table for records if it isn't being used.
Don't delete - mark records as delted and then exclude marked records from all queries. This is best set up at the time of database design. A lot of people use this because it is also the best fastest way to get back records accidentlally deleted. But it is a lot of work to set up in an already existing system.
I'll add another one to this:
Make sure your transaction isolation level and database options are set appropriately. If your SQL server is set not to use row versioning, or you're using an isolation level on other queries where you will wait for the rows to be deleted, you could be setting yourself up for some very poor performance while the operation is happening.
On very large tables where you have a very specific set of criteria for deletes, you could also partition the table, switch out the partition, and then process the deletions.
The SQLCAT team has been using this technique on really really large volumes of data. I found some references to it here but I'll try and find something more definitive.
I think, the big trap with delete that kill the performance is that sql after each row deleted, it updates all the related indexes for any column in this row. what about delting all indexes before bulk delete?
There are deletes and then there are deletes. If you are aging out data as part of a trim job, you will hopefully be able to delete contiguous blocks of rows by clustered key. If you have to age out data from a high volume table that is not contiguous it is very very painful.
If it is true that UPDATES are faster than DELETES, you could add a status column called DELETED and filter on it in your selects. Then run a proc at night that does the actual deletes.
Do you have foreign keys with referential integrity activated?
Do you have triggers active?
Simplify any use of functions in your WHERE clause! Example:
DELETE FROM Claims
WHERE dbo.YearMonthGet(DataFileYearMonth) = dbo.YearMonthGet(#DataFileYearMonth)
This form of the WHERE clause required 8 minutes to delete 125,837 records.
The YearMonthGet function composed a date with the year and month from the input date and set day = 1. This was to ensure we deleted records based on year and month but not day of month.
I rewrote the WHERE clause to:
WHERE YEAR(DataFileYearMonth) = YEAR(#DataFileYearMonth)
AND MONTH(DataFileYearMonth) = MONTH(#DataFileYearMonth)
The result: The delete required about 38-44 seconds to delete those 125,837 records!

Slow deletes from table with CLOB fields in Oracle 10g

I am encountering an issue where Oracle is very slow when I attempt to delete rows from a table which contains two CLOB fields. The table has millions of rows, no constraints, and the deletes are based on the Primary Key. I have rebuilt indexes and recomputed statistics, to no avail.
What can I do to improve the performance of deletes from this table?
Trace it, with waits enabled
http://download.oracle.com/docs/cd/B19306_01/appdev.102/b14258/d_monitor.htm#i1003679
Find the trace file in the UDUMP directory. TKPROF it.
Look at the end and it will tell you what the database spent its time doing during that SQL. The following link is a good overview of how to analyze a performance issue.
http://www.method-r.com/downloads/doc_download/10-for-developers-making-friends-with-the-oracle-database-cary-millsap
With Oracle you have to consider the amount of redo you are generating when deleting a row. If the CLOB fields are very big, it may just take awhile for Oracle to delete them due to the amount of redo being written and there may not be much you can do.
A test you may perform is seeing if the delete takes a long time on a row, where both CLOB fields are set to null. If that's the case, then it may be the index updates taking a long time. If that is the case, you may need to investigate consolidating indexes if possible, if deletes occur very frequently.
If the table is a derived table, meaning, it can be rebuilt from other tables, you may look at the NOLOGGING option on the table. You can the rebuild the table from the source table, with minimal logging.
I hope this entry helps some, however some more details could help diagnose the issue.
Are there any child tables that reference this table from which are deleting? (You can do a select from user_constraints where r_constraint_name = primary key name on the table you are deleting from).
A delete can be slow if Oracle needs to look into another table to check there are no child records. Normal practice is to index all foreign keys on the child tables so this is not a problem.
Follow Gary's advice, perform the trace and post the TKPROF results here someone will be able to help further.
Your UNDO tablespace seems to be the bottleneck in this case.
Check how long it takes to make a ROLLBACK after you delete the data. If it takes time comparable to the time of the query itself (within 50%), then this certainly is the case.
When you perform a DML query, your data (both original and changed) are written into redo logs and then applied to the datafiles and to the UNDO tablespace.
Deleting millions of CLOB rows takes copying several hundreds of megabytes, if not gigabytes, to the UNDO tablespace, which takes tens of seconds itself.
What can you do about this?
Create a faster UNDO: put it onto a separate disk, make it less sparse (create a larger datafile).
Use ROLLBACK SEGMENTS instead of managed UNDO, assign a ROLLBACK SEGMENT for this very query and issue SET TRANSACTION USE ROLLBACK SEGMENT before running the query.
If it's not the case, i. e. ROLLBACK executes much faster that the query itself, then try to play with you REDO parameters:
Increase your REDO buffer size using LOG_BUFFER parameter.
Increate the size of your logfiles.
Create your logfiles on separate disks so that reading from a first datafile does not hinder writing to a second an so on.
Note that UNDO operations also generate REDO, so it's useful to do all this anyway.
NOLOGGING adviced before is useless, as it is applied only to certain set of operations listed here, DELETE not being one of those operations.
Deleted CLOBs do not end up in the UNDOTBS since they are versioned and retented in the LOB Segment. I think it will generate some LOBINDEX changes in the undo.
If you null or empty the LOBs before, did you actually measured that time with commit separate of the DELETE? If you issue thousands of deletes, do you use batch commits? Is the instance idle? Then AWR report should tell you what is going on.