Does TRUNCATE TABLE grows up transaction log? - sql

I have read that one of the differences between DELETE and TRUNCATE TABLE in Sql is the TRUNCATE operation cannot be rolled back and no triggers will be fired (as written in this site for example) :
QUESTION:
Does this mean that when I TRUNcATE TABLE that is containing millions of records, I should not be effecting the transaction log file -that is transaction log file should not grow up in the time of truncating-, am I correct?

In MS SQL Server (Books Online)
Compared to the DELETE statement, TRUNCATE TABLE has the following advantages:
Less transaction log space is used.
The DELETE statement removes rows one at a time and records an entry in the transaction log for each deleted row. TRUNCATE TABLE removes the data by deallocating the data pages used to store the table data and records only the page deallocations in the transaction log.
Fewer locks are typically used.
When the DELETE statement is executed using a row lock, each row in the table is locked for deletion. TRUNCATE TABLE always locks the table (including a schema (SCH-M) lock) and page but not each row.
Without exception, zero pages are left in the table.
After a DELETE statement is executed, the table can still contain empty pages. For example, empty pages in a heap cannot be deallocated without at least an exclusive (LCK_M_X) table lock. If the delete operation does not use a table lock, the table (heap) will contain many empty pages. For indexes, the delete operation can leave empty pages behind, although these pages will be deallocated quickly by a background cleanup process.
TRUNCATE TABLE removes all rows from a table, but the table structure and its columns, constraints, indexes, and so on remain. To remove the table definition in addition to its data, use the DROP TABLE statement.
If the table contains an identity column, the counter for that column is reset to the seed value defined for the column. If no seed was defined, the default value 1 is used. To retain the identity counter, use DELETE instead.
From: http://msdn.microsoft.com/en-us/library/ms177570.aspx
To the original question:
Technically TRUNCATE is deallocating the data pages from the table, effectively removing all records from it. This action in theory can be rolled back until none of the data pages are being re-used. The information on the deallocated pages are not removed, they are still available in the data file. These deallocated pages in the data file can be re-used (allocated to another table for example) and the data on them can be overwritten.
The transaction log contains the list of pages (in SQL Server) being deallocated from the table during TRUNCATE, but this list is much shorter than the list of all records, therefore the transaction log will not grow to the same extent.
Depending on the implementation of transactions and TRUNCATE on different RDBMS it can be possible to do a rollback within a transaction. Sometimes it is possible to do a 'rollback' (restore the table) after the transaction is committed, if the data on the pages are still intact and all information is available, but that is some black magic and usually not supported directly by the RDBMS.

Does this mean that when I TRUNcATE TABLE that is containing millions of records, I should not be effecting the transaction log file -that is transaction log file should not grow up in the time of truncating-, am I correct?
Well you don't specify that actual server software but in all cases that I'm aware of that's correct.
DELETE effectively works row-by-row, deleting the record, firing any appropriate triggers, and adding a transaction to the log.
TRUNCATE just removes all of the data in one swoop, not significantly affecting the transaction log (certainly not enough to allow a rollback) and not executing triggers.

Related

Update via a temp table

So I have a rather large table (150 million rows) that data scrub queries get run on nightly. Now these queries don't update a lot of records, but to get the records needed, that have to query that single table multiple times in sub queries, which takes some time.
So, would it be better for me to do a normal update statement, or would it be better if I put the few results I needed in a temp table, and then just did an update for those few rows, which would greatly reduce the locks during update.
I'm unsure how an update statement locks work when most of the time is spent querying. If it is going to only update 5 records, and runs for half and hour, will it release a record that it updated in the first minute, or does it wait till the end of the query?
Thanks
You need to use (and look into) into the ROWLOCK table hint. You can use it with the update statement while updating in batches of 5000 rows of less. This will attempt to place row locks in the target table (or on index keys, if a covering index is present). If for some reason that fails, the lock will be escalated to a table lock.
From MSDN (as for reasons why lock escalation might occur):
When the Database Engine checks for possible escalations at every 1250
newly acquired locks, a lock escalation will occur if and only if a
Transact-SQL statement has acquired at least 5000 locks on a single
reference of a table. Lock escalation is triggered when a Transact-SQL
statement acquires at least 5,000 locks on a single reference of a
table. For example, lock escalation is not triggered if a statement
acquires 3,000 locks in one index and 3,000 locks in another index of
the same table. Similarly, lock escalation is not triggered if a
statement has a self join on a table, and each reference to the table
only acquires 3,000 locks in the table.
Actually, there's more to read in this last article. You should have a look at mixed lock type escalation section.

Reverting a database insertion with log files?

I am working on a program that is supposed to insert hundreds of rows to the database per run.
The problem is that once the inserted data is wrong, how can we recover from that run? Currently I only have a log file (I created the format), which records the raw data get inserted (no metadata nor primary keys). Is there a way we can create a log that database can understand it, and once we want to undo the insertion we feed the database with that log file.
Or, if there is alternative mechanism of undoing an operation from a program, kindly let me know, thanks.
The fact, that this is only hundreds of rows, makes it succeptible to the great-grandmother of all undo mechanisms:
have a table importruns with a row for each run you do. I assume it has an integer auto-increment PK
add a field to your data table, that identifies carries the PK of the import run
for insert-only runs, you just need to DELETE FROM sometable WHERE importid=$whatever
If you also have replace/update imports, go one step further
for each data table have a corresponding table, that has one field more: superseededby
for each row you update/replace, place an original copy of the row in this table plus the import id in superseededby
to revert, you now have to add INSERT INTO originaltable SELECT * FROM superseededtable WHERE superseededby=$whatever
You can clean up superseededtable for known-good imports, to make sure, storage doesn't grow unlimited.
You have several options. Depending on when you notice the error.
If you know there is an error with the data, the you can use the transactions API to rollback to changes of the current transaction.
In case you know there was an error only later, then you can create your own log. Make an index identifying the transaction, and add a field to the relevant table where that id would be inserted. This would allow you to identify exactly which transaction it came from. You can also create a stored procedure that deletes rows according to the given transaction id.

T-SQL Optimize DELETE of many records

I have a table that can grew to millions records (50 millions for example). On each 20 minutes records that are older than 20 minutes are deleted.
The problems is that if the table has so many records such deletion can take a lot of time and I want to make it faster.
I can not do "truncate table" because I want to remove only records that are older than 20 minutes. I suppose that when doing the "delete" and filtering the information that need to be delete, the server is creating log file or something and this take much time?
Am I right? Is there a way to stop any flag or option to optimize the delete, and then to turn on the stopped option?
To expand on the batch delete suggestion, i'd suggest you do this far more regularly (every 20 seconds perhaps) - batch deletions are easy:
WHILE 1 = 1
BEGIN
DELETE TOP ( 4000 )
FROM YOURTABLE
WHERE YourIndexedDateColumn < DATEADD(MINUTE, -20, GETDATE())
IF ##ROWCOUNT = 0
BREAK
END
Your inserts may lag slightly whilst they wait for the locks to release but they should insert rather than error.
In regards to your table though, a table with this much traffic i'd expect to see on a very fast raid 10 array / perhaps even partitioned - are your disks up to it? Are your transaction logs on different disks to your data files? - they should be
EDIT 1 - Response to your comment
TO put a database into SIMPLE recovery:
ALTER DATABASE Database Name SET RECOVERY='SIMPLE'
This basically turns off transaction logging on the given database. Meaning in the event of data loss you would need loose all data since your last full backup. If you're OK with that, well this should save a lot of time when running large transactions. (NOTE that as the transaction is running, the logging still takes place in SIMPLE - to enable the rolling back of the transaction).
If there are tables within your database where you cant afford to loose data you'll need to leave your database in FULL recovery mode (i.e. any transaction gets logged (and hopefully flushed to *.trn files by your servers maintenance plans). As i stated in my question though, there is nothing stopping you having two databases, 1 in FULL and 1 in SIMPLE. the FULL database would be fore tables where you cant afford to loose any data (i.e. you could apply the transaction logs to restore data to a specific time) and the SIMPLE database would be for these massive high-traffic tables that you can allow data loss on in the event of a failure.
All of this is relevant assuming your creating full (*.bak) files every night & flushing your log files to *.trn files every half hour or so).
In regards to your index question, it's imperative your date column is indexed, if you check your execution plan and see any "TABLE SCAN" - that would be an indicator of a missing index.
Your date column i presume is DATETIME with a constraint setting the DEFAULT to getdate()?
You may find that you get better performance by replacing that with a BIGINT YYYYMMDDHHMMSS and then apply a CLUSTERED index to that column - note however that you can only have 1 clustered index per table, so if that table already has one you'll need to use a Non-Clustered index. (in case you didnt know, a clustered index basically tells SQL to store the information in that order, meaning that when you delete rows > 20 minutes SQL can literally delete stuff sequentially rather than hopping from page to page.
The log problem is probably due to the number of records deleted in the trasaction, to make things worse the engine may be requesting a lock per record (or by page wich is not so bad)
The one big thing here is how you determine the records to be deleted, i'm assuming you use a datetime field, if so make sure you have an index on the column otherwise it's a sequential scan of the table that will really penalize your process.
There are two things you may do depending of the concurrency of users an the time of the delete
If you can guarantee that no one is going to read or write when you delete, you can lock the table in exclusive mode and delete (this takes only one lock from the engine) and release the lock
You can use batch deletes, you would make a script with a cursor that provides the rows you want to delete, and you begin transtaction and commit every X records (ideally 5000), so you can keep the transactions shorts and not take that many locks
Take a look at the query plan for the delete process, and see what it shows, a sequential scan of a big table its never good.
Unfortunately for the purpose of this question and fortunately for the sake of consistency and recoverability of the databases in SQL server, putting a database into Simple recovery mode DOES NOT disable logging.
Every transaction still gets logged before committing it to the data file(s), the only difference would be that the space in the log would get released (in most cases) right after the transaction is either rolled back or committed in the Simple recovery mode, but this is not going to affect the performance of the DELETE statement in one way or another.
I had a similar problem when I needed to delete more than 70% of the rows from a big table with 3 indexes and a lot of foreign keys.
For this scenario, I saved the rows I wanted in a temp table, truncated the original table and reinserted the rows, something like:
SELECT * INTO #tempuser FROM [User] WHERE [Status] >= 600;
TRUNCATE TABLE [User];
INSERT [User] SELECT * FROM #tempuser;
I learned this technique with this link that explains:
DELETE is a a fully logged operation , and can be rolled back if something goes wrong
TRUNCATE Removes all rows from a table without logging the individual row deletions
In the article you can explore other strategies to resolve the delay in deleting many records, that one worked to me

What's the best way to delete all data from a table?

I have a SQLite table with 6 million rows.
Doing a DELETE FROM TABLE is quite slow;
Dropping the table and then re-creating it seems quicker.
I'm using this for a database import.
Would dropping the table be a better approach or is there a way to delete all data quickly?
One big difference is that DELETE FROM TABLE is DML and DROP TABLE is DDL. This is very important when it comes to db transactions. The result at the end may be the same, but these operations are very different.
If it's just performance you've to be aware of then it may be ok to drop and recreate the table. If you need transactions in your imports then you've to be aware that DDL is not covered and cannot be rollbacked for example.
Generally speaking the DROP TABLE would be a non-logged transaction. DELETE FROM would require a transient journal to log the records until the DELETE statement has been completed.
TRUNCATE TABLE is a lot faster.
The following except from an Oracle website explains why:
'Deletes' perform normal DML. That is, they take locks on rows, they generate redo (lots of it), and they require segments in the UNDO tablespace. Deletes clear records out of blocks carefully. If a mistake is made a rollback can be issued to restore the records prior to a commit. A delete does not relinquish segment space thus a table in which all records have been deleted retains all of its original blocks.
Truncates are DDL and, in a sense, cheat. A truncate moves the High Water Mark of the table back to zero. No row-level locks are taken, no redo or rollback is generated. All extents bar the initial are de-allocated from the table (if you have MINEXTENTS set to anything other than 1, then that number of extents is retained rather than just the initial). By re-positioning the high water mark, they prevent reading of any table data, so they have the same effect as a delete, but without all the overhead. Just one slight problem: a truncate is a DDL command, so you can't roll it back if you decide you made a mistake. (It's also true that you can't selectively truncate -no "WHERE" clause is permitted, unlike with deletes, of course).
By resetting the High Water Mark, the truncate prevents reading of any table's data, so they it has the same effect as a delete, but without the overhead. There is, however, one aspect of a Truncate that must be kept in mind. Because a Truncate is DDL it issues a COMMIT before it acts and another COMMIT afterward so no rollback of the transaction is possible.
Note that by default, TRUNCATE drops storage even if DROP STORAGE is not specified.
I don't think SQLite implements TRUNCATE but if it does it likely will be more performant than DELETE

TSQL Snapshot Isolation

Using SQL2k5, I have a staging table that contains columns that will populate numerous other tables. For instance, a statement like this:
INSERT INTO [appTable1] ([colA], [colB])
SELECT [appTable1_colA], [appTable1_colB]
FROM [stageTable]
A trigger on [appTable1] will then populate the identity column values of the newly inserted rows back into [stageTable]; for this example, we'll say it's [stageTable].[appTable1_ID] which are then inserted into other tables as a FK. More similar statements follow like:
INSERT INTO [appTable2] ([colA], [colB], [colC], [appTable1_FK])
SELECT [appTable2_colA], [appTable2_colB], [appTable2_colC], [appTable1_ID]
FROM [stageTable]
This process continues through numerous tables like this. As you can see, I'm not including a WHERE clause on the SELECTs from the staging table as this table gets truncated at the end of the process. However, this leaves the possibility of another process adding records to this staging table in the middle of this transaction and those records would not contain the FKs previously populated. Would I want to issue this statement to prevent this?:
SET TRANSACTION ISOLATION LEVEL SNAPSHOT
If this is the best solution, what are the downsides of doing it this way?
Can you add a batch id to your staging table, so that you can use it in where clauses to ensure that you are only working on the original batch of records? Any process that adds records to the staging table would have to use a new, unique batch id. This would be more efficient (and more robust) than depending on snapshot isolation, I think.
All Isolation levels, including snapshot, affect only reads. SELECTs from stageTable will not see uncommited inserts, nor it will block. I'm not sure that solves your problem of throwing everything into the stageTable without any regard for ownership. What happens when the transaction finally commits, the stageTable is left with all the intermediate results ready to be read by the next transaction? Perhaps you should use a temporary #stageTable that will ensure a natural isolation between concurent threads.
To understand the cost of using Snapshot isolation, read Row Versioning Resource Usage:
extra space consumed in tempdb
extra space consumed in each row in the data table
extra space consumed in BLOB storage for large fields