Using SQL2k5, I have a staging table that contains columns that will populate numerous other tables. For instance, a statement like this:
INSERT INTO [appTable1] ([colA], [colB])
SELECT [appTable1_colA], [appTable1_colB]
FROM [stageTable]
A trigger on [appTable1] will then populate the identity column values of the newly inserted rows back into [stageTable]; for this example, we'll say it's [stageTable].[appTable1_ID] which are then inserted into other tables as a FK. More similar statements follow like:
INSERT INTO [appTable2] ([colA], [colB], [colC], [appTable1_FK])
SELECT [appTable2_colA], [appTable2_colB], [appTable2_colC], [appTable1_ID]
FROM [stageTable]
This process continues through numerous tables like this. As you can see, I'm not including a WHERE clause on the SELECTs from the staging table as this table gets truncated at the end of the process. However, this leaves the possibility of another process adding records to this staging table in the middle of this transaction and those records would not contain the FKs previously populated. Would I want to issue this statement to prevent this?:
SET TRANSACTION ISOLATION LEVEL SNAPSHOT
If this is the best solution, what are the downsides of doing it this way?
Can you add a batch id to your staging table, so that you can use it in where clauses to ensure that you are only working on the original batch of records? Any process that adds records to the staging table would have to use a new, unique batch id. This would be more efficient (and more robust) than depending on snapshot isolation, I think.
All Isolation levels, including snapshot, affect only reads. SELECTs from stageTable will not see uncommited inserts, nor it will block. I'm not sure that solves your problem of throwing everything into the stageTable without any regard for ownership. What happens when the transaction finally commits, the stageTable is left with all the intermediate results ready to be read by the next transaction? Perhaps you should use a temporary #stageTable that will ensure a natural isolation between concurent threads.
To understand the cost of using Snapshot isolation, read Row Versioning Resource Usage:
extra space consumed in tempdb
extra space consumed in each row in the data table
extra space consumed in BLOB storage for large fields
Related
Let's say I have a database with many tables in it. I want to perform data archiving on certain tables, that is create a same table with same structures (same constraint, indexes, columns, triggers, etc) as a new table and insert specific data into the new table from the old table.
Example, current table has data from 2008-2017 and I want to move only data from 2010-2017 into the new table. Then after that, I can delete the old table and rename the new table with naming conventions similar to old table.
How should I approach this?
For the sort of clone-rename-drop logic you're talking about, the basics are pretty straight forward. Really the only time this is a good idea is if you have a table with a large amount of data, which you can't afford down time or blocking on, and you only plan to do this one. The process looks something like this:
Insert all the data from your original table into the clone table
In a single transaction, sp_rename the original table from (for example) myTable to myTable_OLD (just something to distinguish it from the real table). Then sp_rename the clone table from (for example) myTable_CLONE to myTable
Drop myTable_OLD when you're happy everything has worked how you want. If it didn't work how you want, just sp_rename the objects back.
Couple considerations to think about if you go that route
Identity columns: If your table has any identities on it, you'll have to use identity_insert on then reseed the identity to pick up at where the old identity left off
Do you have the luxury of blocking the table while you do this? Generally if you need to do this sort of thing, the answer is no. What I find works well is to insert all the rows I need using (nolock), or however you need to do it so the impact of the select from the original table is mitigated. Then, after I've moved 99% of the data, I will then open a transaction, block the original table, insert just the new data that's come in since the bulk of the data movement, then do the sp_rename stuff
That way you don't lock anything for the bulk of the data movement, and you only block the table for the very last bit of data that came into the original table between your original insert and your sp_rename
How you determine what's come in "since you started" will depend on how your table is structured. If you have an identity or a datestamp column, you can probably just pick rows which came in after the max of those fields you moved over. If your table does NOT have something you can easily hook into, you might need to get creative.
Alternatives
A couple other alternatives that came to mind:
Table Partitioning:
This shards a single table across multiple partitions (which can be managed sort of like individual tables). You can, say, partition you data by year, then when you want to purge the trailing year of data, you "switch out" that partition to a special table which you can then truncate. All those operations are meta-data only, so they're super fast. This also works really well for huge amounts of data where deletes and all their pesky transaction logging aren't feasible
The downside to table partitioning is it's kind of a pain to set up and manage.
Batched Deletes:
If you're data isn't too big, you could just do batched deletes on the trailing end of your data. If you can find a way to get clustered index seeks for your deletes, they should be reasonably lightweight. As long as you're not accumulating data faster than you can get rid of it, the benefit of this kind of thing is you just run it semi-continuously and it just nibbles away at the trailing end of your data
Snapshot Isolation:
If deletes cause too much blocking, you can also set up something like snapshot isolation, which basically stores historical versions of rows in tempdb. Any query which sets isolation level read committed snapshot will then read those pre-change rows instead of contend for locks on the "real" table. You can then do batched deletes to your hearts content and know that any queries that hit the table will never get blocked by a delete (or any other DML operation) because they'll either read the pre-delete snapshot, or they'll read the post-delete snapshot. They won't wait for an in-process delete to figure out whether it's going to commit or rollback. This is not without its drawbacks as well unfortunately. For large data sets, it can put a big burden on tempdb and it too can be a little bit of a black box. It's also going to require buy-in from your DBAs.
I have read that one of the differences between DELETE and TRUNCATE TABLE in Sql is the TRUNCATE operation cannot be rolled back and no triggers will be fired (as written in this site for example) :
QUESTION:
Does this mean that when I TRUNcATE TABLE that is containing millions of records, I should not be effecting the transaction log file -that is transaction log file should not grow up in the time of truncating-, am I correct?
In MS SQL Server (Books Online)
Compared to the DELETE statement, TRUNCATE TABLE has the following advantages:
Less transaction log space is used.
The DELETE statement removes rows one at a time and records an entry in the transaction log for each deleted row. TRUNCATE TABLE removes the data by deallocating the data pages used to store the table data and records only the page deallocations in the transaction log.
Fewer locks are typically used.
When the DELETE statement is executed using a row lock, each row in the table is locked for deletion. TRUNCATE TABLE always locks the table (including a schema (SCH-M) lock) and page but not each row.
Without exception, zero pages are left in the table.
After a DELETE statement is executed, the table can still contain empty pages. For example, empty pages in a heap cannot be deallocated without at least an exclusive (LCK_M_X) table lock. If the delete operation does not use a table lock, the table (heap) will contain many empty pages. For indexes, the delete operation can leave empty pages behind, although these pages will be deallocated quickly by a background cleanup process.
TRUNCATE TABLE removes all rows from a table, but the table structure and its columns, constraints, indexes, and so on remain. To remove the table definition in addition to its data, use the DROP TABLE statement.
If the table contains an identity column, the counter for that column is reset to the seed value defined for the column. If no seed was defined, the default value 1 is used. To retain the identity counter, use DELETE instead.
From: http://msdn.microsoft.com/en-us/library/ms177570.aspx
To the original question:
Technically TRUNCATE is deallocating the data pages from the table, effectively removing all records from it. This action in theory can be rolled back until none of the data pages are being re-used. The information on the deallocated pages are not removed, they are still available in the data file. These deallocated pages in the data file can be re-used (allocated to another table for example) and the data on them can be overwritten.
The transaction log contains the list of pages (in SQL Server) being deallocated from the table during TRUNCATE, but this list is much shorter than the list of all records, therefore the transaction log will not grow to the same extent.
Depending on the implementation of transactions and TRUNCATE on different RDBMS it can be possible to do a rollback within a transaction. Sometimes it is possible to do a 'rollback' (restore the table) after the transaction is committed, if the data on the pages are still intact and all information is available, but that is some black magic and usually not supported directly by the RDBMS.
Does this mean that when I TRUNcATE TABLE that is containing millions of records, I should not be effecting the transaction log file -that is transaction log file should not grow up in the time of truncating-, am I correct?
Well you don't specify that actual server software but in all cases that I'm aware of that's correct.
DELETE effectively works row-by-row, deleting the record, firing any appropriate triggers, and adding a transaction to the log.
TRUNCATE just removes all of the data in one swoop, not significantly affecting the transaction log (certainly not enough to allow a rollback) and not executing triggers.
I have some database tables those contains some aggregated data. Their records (some thousand / tables) are recomputed periodically by an external .NET app, so the old data should be deleted and the new should be inserted periodically. Update is not an option in this case.
Between the delete / insert there is an intermediate time, when the records state is inconsistent (old ones are deleted, new ones are not in the table yet), so making select query in that state results an incorrect result.
I use subsonic simplerepository to handle database features.
What is the best practice / pattern to workaround / handle this state?
Three options come to my mind:
Create a transaction with a lock on reads until it is done. This only works if processes are relatively fast. A few thousand records shouldn't be too bad if you transact/lock a table at a time -- if you lock the whole process, that could be costly! But if data is related, this is what you'd have to do
Write to temporary versions of the table, then drop old tables and rename temp tables.
Same as above, except bulk copy from temp tables (not necessarily SQL temporary tables, but ancillary holding tables would suffice) into correct tables, first deleting from main table. you'd still want to use a transaction for this.
I have a SQLite table with 6 million rows.
Doing a DELETE FROM TABLE is quite slow;
Dropping the table and then re-creating it seems quicker.
I'm using this for a database import.
Would dropping the table be a better approach or is there a way to delete all data quickly?
One big difference is that DELETE FROM TABLE is DML and DROP TABLE is DDL. This is very important when it comes to db transactions. The result at the end may be the same, but these operations are very different.
If it's just performance you've to be aware of then it may be ok to drop and recreate the table. If you need transactions in your imports then you've to be aware that DDL is not covered and cannot be rollbacked for example.
Generally speaking the DROP TABLE would be a non-logged transaction. DELETE FROM would require a transient journal to log the records until the DELETE statement has been completed.
TRUNCATE TABLE is a lot faster.
The following except from an Oracle website explains why:
'Deletes' perform normal DML. That is, they take locks on rows, they generate redo (lots of it), and they require segments in the UNDO tablespace. Deletes clear records out of blocks carefully. If a mistake is made a rollback can be issued to restore the records prior to a commit. A delete does not relinquish segment space thus a table in which all records have been deleted retains all of its original blocks.
Truncates are DDL and, in a sense, cheat. A truncate moves the High Water Mark of the table back to zero. No row-level locks are taken, no redo or rollback is generated. All extents bar the initial are de-allocated from the table (if you have MINEXTENTS set to anything other than 1, then that number of extents is retained rather than just the initial). By re-positioning the high water mark, they prevent reading of any table data, so they have the same effect as a delete, but without all the overhead. Just one slight problem: a truncate is a DDL command, so you can't roll it back if you decide you made a mistake. (It's also true that you can't selectively truncate -no "WHERE" clause is permitted, unlike with deletes, of course).
By resetting the High Water Mark, the truncate prevents reading of any table's data, so they it has the same effect as a delete, but without the overhead. There is, however, one aspect of a Truncate that must be kept in mind. Because a Truncate is DDL it issues a COMMIT before it acts and another COMMIT afterward so no rollback of the transaction is possible.
Note that by default, TRUNCATE drops storage even if DROP STORAGE is not specified.
I don't think SQLite implements TRUNCATE but if it does it likely will be more performant than DELETE
I am encountering an issue where Oracle is very slow when I attempt to delete rows from a table which contains two CLOB fields. The table has millions of rows, no constraints, and the deletes are based on the Primary Key. I have rebuilt indexes and recomputed statistics, to no avail.
What can I do to improve the performance of deletes from this table?
Trace it, with waits enabled
http://download.oracle.com/docs/cd/B19306_01/appdev.102/b14258/d_monitor.htm#i1003679
Find the trace file in the UDUMP directory. TKPROF it.
Look at the end and it will tell you what the database spent its time doing during that SQL. The following link is a good overview of how to analyze a performance issue.
http://www.method-r.com/downloads/doc_download/10-for-developers-making-friends-with-the-oracle-database-cary-millsap
With Oracle you have to consider the amount of redo you are generating when deleting a row. If the CLOB fields are very big, it may just take awhile for Oracle to delete them due to the amount of redo being written and there may not be much you can do.
A test you may perform is seeing if the delete takes a long time on a row, where both CLOB fields are set to null. If that's the case, then it may be the index updates taking a long time. If that is the case, you may need to investigate consolidating indexes if possible, if deletes occur very frequently.
If the table is a derived table, meaning, it can be rebuilt from other tables, you may look at the NOLOGGING option on the table. You can the rebuild the table from the source table, with minimal logging.
I hope this entry helps some, however some more details could help diagnose the issue.
Are there any child tables that reference this table from which are deleting? (You can do a select from user_constraints where r_constraint_name = primary key name on the table you are deleting from).
A delete can be slow if Oracle needs to look into another table to check there are no child records. Normal practice is to index all foreign keys on the child tables so this is not a problem.
Follow Gary's advice, perform the trace and post the TKPROF results here someone will be able to help further.
Your UNDO tablespace seems to be the bottleneck in this case.
Check how long it takes to make a ROLLBACK after you delete the data. If it takes time comparable to the time of the query itself (within 50%), then this certainly is the case.
When you perform a DML query, your data (both original and changed) are written into redo logs and then applied to the datafiles and to the UNDO tablespace.
Deleting millions of CLOB rows takes copying several hundreds of megabytes, if not gigabytes, to the UNDO tablespace, which takes tens of seconds itself.
What can you do about this?
Create a faster UNDO: put it onto a separate disk, make it less sparse (create a larger datafile).
Use ROLLBACK SEGMENTS instead of managed UNDO, assign a ROLLBACK SEGMENT for this very query and issue SET TRANSACTION USE ROLLBACK SEGMENT before running the query.
If it's not the case, i. e. ROLLBACK executes much faster that the query itself, then try to play with you REDO parameters:
Increase your REDO buffer size using LOG_BUFFER parameter.
Increate the size of your logfiles.
Create your logfiles on separate disks so that reading from a first datafile does not hinder writing to a second an so on.
Note that UNDO operations also generate REDO, so it's useful to do all this anyway.
NOLOGGING adviced before is useless, as it is applied only to certain set of operations listed here, DELETE not being one of those operations.
Deleted CLOBs do not end up in the UNDOTBS since they are versioned and retented in the LOB Segment. I think it will generate some LOBINDEX changes in the undo.
If you null or empty the LOBs before, did you actually measured that time with commit separate of the DELETE? If you issue thousands of deletes, do you use batch commits? Is the instance idle? Then AWR report should tell you what is going on.