Fastest way to update 120 Million records - sql

I need to initialize a new field with the value -1 in a 120 Million record table.
Update table
set int_field = -1;
I let it run for 5 hours before canceling it.
I tried running it with transaction level set to read uncommitted with the same results.
Recovery Model = Simple.
MS SQL Server 2005
Any advice on getting this done faster?

The only sane way to update a table of 120M records is with a SELECT statement that populates a second table. You have to take care when doing this. Instructions below.
Simple Case
For a table w/out a clustered index, during a time w/out concurrent DML:
SELECT *, new_col = 1 INTO clone.BaseTable FROM dbo.BaseTable
recreate indexes, constraints, etc on new table
switch old and new w/ ALTER SCHEMA ... TRANSFER.
drop old table
If you can't create a clone schema, a different table name in the same schema will do. Remember to rename all your constraints and triggers (if applicable) after the switch.
Non-simple Case
First, recreate your BaseTable with the same name under a different schema, eg clone.BaseTable. Using a separate schema will simplify the rename process later.
Include the clustered index, if applicable. Remember that primary keys and unique constraints may be clustered, but not necessarily so.
Include identity columns and computed columns, if applicable.
Include your new INT column, wherever it belongs.
Do not include any of the following:
triggers
foreign key constraints
non-clustered indexes/primary keys/unique constraints
check constraints or default constraints. Defaults don't make much of difference, but we're trying to keep
things minimal.
Then, test your insert w/ 1000 rows:
-- assuming an IDENTITY column in BaseTable
SET IDENTITY_INSERT clone.BaseTable ON
GO
INSERT clone.BaseTable WITH (TABLOCK) (Col1, Col2, Col3)
SELECT TOP 1000 Col1, Col2, Col3 = -1
FROM dbo.BaseTable
GO
SET IDENTITY_INSERT clone.BaseTable OFF
Examine the results. If everything appears in order:
truncate the clone table
make sure the database in in bulk-logged or simple recovery model
perform the full insert.
This will take a while, but not nearly as long as an update. Once it completes, check the data in the clone table to make sure it everything is correct.
Then, recreate all non-clustered primary keys/unique constraints/indexes and foreign key constraints (in that order). Recreate default and check constraints, if applicable. Recreate all triggers. Recreate each constraint, index or trigger in a separate batch. eg:
ALTER TABLE clone.BaseTable ADD CONSTRAINT UQ_BaseTable UNIQUE (Col2)
GO
-- next constraint/index/trigger definition here
Finally, move dbo.BaseTable to a backup schema and clone.BaseTable to the dbo schema (or wherever your table is supposed to live).
-- -- perform first true-up operation here, if necessary
-- EXEC clone.BaseTable_TrueUp
-- GO
-- -- create a backup schema, if necessary
-- CREATE SCHEMA backup_20100914
-- GO
BEGIN TRY
BEGIN TRANSACTION
ALTER SCHEMA backup_20100914 TRANSFER dbo.BaseTable
-- -- perform second true-up operation here, if necessary
-- EXEC clone.BaseTable_TrueUp
ALTER SCHEMA dbo TRANSFER clone.BaseTable
COMMIT TRANSACTION
END TRY
BEGIN CATCH
SELECT ERROR_MESSAGE() -- add more info here if necessary
ROLLBACK TRANSACTION
END CATCH
GO
If you need to free-up disk space, you may drop your original table at this time, though it may be prudent to keep it around a while longer.
Needless to say, this is ideally an offline operation. If you have people modifying data while you perform this operation, you will have to perform a true-up operation with the schema switch. I recommend creating a trigger on dbo.BaseTable to log all DML to a separate table. Enable this trigger before you start the insert. Then in the same transaction that you perform the schema transfer, use the log table to perform a true-up. Test this first on a subset of the data! Deltas are easy to screw up.

If you have the disk space, you could use SELECT INTO and create a new table. It's minimally logged, so it would go much faster
select t.*, int_field = CAST(-1 as int)
into mytable_new
from mytable t
-- create your indexes and constraints
GO
exec sp_rename mytable, mytable_old
exec sp_rename mytable_new, mytable
drop table mytable_old

I break the task up into smaller units. Test with different batch size intervals for your table, until you find an interval that performs optimally. Here is a sample that I have used in the past.
declare #counter int
declare #numOfRecords int
declare #batchsize int
set #numOfRecords = (SELECT COUNT(*) AS NumberOfRecords FROM <TABLE> with(nolock))
set #counter = 0
set #batchsize = 2500
set rowcount #batchsize
while #counter < (#numOfRecords/#batchsize) +1
begin
set #counter = #counter + 1
Update table set int_field = -1 where int_field <> -1;
end
set rowcount 0

If your int_field is indexed, remove the index before running the update. Then create your index again...
5 hours seem like a lot for 120 million recs.

set rowcount 1000000
Update table set int_field = -1 where int_field<>-1
see how fast that takes, adjust and repeat as necessary

What I'd try first is
to drop all constraints, indexes, triggers and full text indexes first before you update.
If above wasn't performant enough, my next move would be
to create a CSV file with 12 million records and bulk import it using bcp.
Lastly, I'd create a new heap table (meaning table with no primary key) with no indexes on a different filegroup, populate it with -1. Partition the old table, and add the new partition using "switch".

When adding a new column ("initialize a new field") and setting a single value to each existing row, I use the following tactic:
ALTER TABLE MyTable
add NewColumn int not null
constraint MyTable_TemporaryDefault
default -1
ALTER TABLE MyTable
drop constraint MyTable_TemporaryDefault
If the column is nullable and you don't include a "declared" constraint, the column will be set to null for all rows.

declare #cnt bigint
set #cnt = 1
while #cnt*100<10000000
begin
UPDATE top(100) [Imp].[dbo].[tablename]
SET [col1] = xxxx
WHERE[col1] is null
print '#cnt: '+convert(varchar,#cnt)
set #cnt=#cnt+1
end

Sounds like an indexing problem, like Pabla Santa Cruz mentioned. Since your update is not conditional, you can DROP the column and RE-ADD it with a DEFAULT value.

In general, recommendation are next:
Remove or just Disable all INDEXES, TRIGGERS, CONSTRAINTS on the table;
Perform COMMIT more often (e.g. after each 1000 records that were updated);
Use select ... into.
But in particular case you should choose the most appropriate solution or their combination.
Also bear in mind that sometime index could be useful e.g. when you perform update of non-indexed column by some condition.

If the table has an index which you can iterate over I would put update top(10000) statement in a while loop moving over the data. That would keep the transaction log slim and won't have such a huge impact on the disk system. Also, I would recommend to play with maxdop option (setting it closer to 1).

Related

What is the process during re-naming and re-creating a MS-SQL table using stored procedure?

I have a table called myTable where continuous insertion is happening. I will rename that table by myTable_Date and create a new table, myTable through a Store Procedure.
I want to know what will happen during re-naming and re-creating the table, will it drop any packet?
SQL Server has sp_rename built in if you just want to change the name of a table.
sp_rename myTable, myTable_Date
Would change the name from myTable to myTable_Date
But it only changes the name reference in sys.Objects so make sure any references are altered and read the documentation about it :)
The Microsoft doc for it is HERE
When you rename the myTable to myTableDate, myTable won't exist anymore so if someone tries to insert something inside myTable it will fail.
When you create new myTable with the same name and columns everything will be fine and the insertion process will continue.
I suggest you to make a little script renaming the table and creating new one. Something like this:
sp_rename myTable, myTable_Date
GO
CREATE TABLE myTable(
-- Table definition
)
When you rename the table you will get warning like this: "Caution: Changing any part of an object name could break scripts and stored procedures." so you better create the new table fast.
Other option is you create a table exact like myTable and insert all data from myTable there and then can delete them from myTable. No renaming, no dropping and insertion process will not be interrupted.
I want to know what will happen during re-naming and re-creating the
table, will it drop any packet?
Inserts attempted after the table is renamed will err until the table is recreated. You can avoid that by executing the tasks in a transaction. Short term blocking will happen if an insert is attempted before the transaction is committed but no rows will be lost. For example:
CREATE PROC dbo.ReanmeMytableWithDate
AS
DECLARE #NewName sysname = 'mytable_' + CONVERT(nchar(8), SYSDATETIME(), 112);
SET XACT_ABORT ON;
BEGIN TRY;
BEGIN TRAN;
EXEC sp_rename N'dbo.mytable', #NewName;
CREATE TABLE dbo.mytable(
col1 int
);
COMMIT;
END TRY
BEGIN CATCH
THROW;
END CATCH;
GO
I don't know your use case for renaming tables like this but it seems table partitioning might be a better approach as #Damien_The_Unbeliever suggested. Although table partitioning previously required Enterprise Edition, the feature is available in Standard Edition beginning with SQL Server 2016 SP1 as well as Azure SQL Database.

Truncate Statement Taking Too much time

I have a table Which has more than 1 million records, I have created a stored Procedure to insert data in that table, before Inserting the data I need to truncate the table but truncate is taking too long.
I have read on some links that if a table is used by another person or some locks are applied then truncate takes too long time but here I am the only user and I have applied no locks on that.
Also no other transactions are open when I tried to truncate the table.
As my database is on SQL Azure I am not supposed to drop the indexes as it does not allow me to insert the data without an index.
Drop all the indexes from the table and then truncate, if you want to insert the data then insert data and after inserting the data recreate the indexes
When deleting from Azure you can get into all sorts of trouble, but truncate is almost always an issue of locking. If you can't fix that you can always do this trick when deleting from Azure.
declare #iDeleteCounter int =1
while #iDeleteCounter > 0
begin
begin transaction deletes;
with deleteTable as
(
select top 100000 * from mytable where mywhere
)
delete from deleteTable
commit transaction deletes
select #iDeleteCounter = count(1) from mytable where mywhere
print 'deleted 100000 from table'
end

Update ANSI_NULLS option in an existing table

In our database there is a table which is created with ANSI_NULLS OFF. Now we have created a view using this table. And we want to add a clustered index for this view.
While creating the clustered index it is showing an error like can't create an index since the ANSI_NULL is off for this particular table.
This table contains a large amount of data. So I want to change this option to ON without losing any data.
Is there any way to alter the table to modify this option . Please give your suggestions.
This was cross posted on Database Administrators so I might as well post my answer from there here too to help future searchers.
It can be done as a metadata only change (i.e. without migrating all the data to a new table) using ALTER TABLE ... SWITCH.
Example code below
/*Create table with option off*/
SET ANSI_NULLS OFF;
CREATE TABLE dbo.YourTable (X INT)
/*Add some data*/
INSERT INTO dbo.YourTable VALUES (1),(2),(3)
/*Confirm the bit is set to 0*/
SELECT uses_ansi_nulls, *
FROM sys.tables
WHERE object_id = object_id('dbo.YourTable')
GO
BEGIN TRY
BEGIN TRANSACTION;
/*Create new table with identical structure but option on*/
SET ANSI_NULLS ON;
CREATE TABLE dbo.YourTableNew (X INT)
/*Metadata only switch*/
ALTER TABLE dbo.YourTable SWITCH TO dbo.YourTableNew;
DROP TABLE dbo.YourTable;
EXECUTE sp_rename N'dbo.YourTableNew', N'YourTable','OBJECT';
/*Confirm the bit is set to 1*/
SELECT uses_ansi_nulls, *
FROM sys.tables
WHERE object_id = object_id('dbo.YourTable')
/*Data still there!*/
SELECT *
FROM dbo.YourTable
COMMIT TRANSACTION;
END TRY
BEGIN CATCH
IF XACT_STATE() <> 0
ROLLBACK TRANSACTION;
PRINT ERROR_MESSAGE();
END CATCH;
WARNING: when your table contains an IDENTITY column you need to reseed the IDENTITY value.
The SWITCH TO will reset the seed of the identity column and if you do not have a UNIQUE or PRIMARY KEY constraint on the identity (e.g. when using CLUSTERED COLUMNSTORE index in SQL 2014) you won't notice it right away.
You need to use DBCC CHECKIDENT ('dbo.YourTable', RESEED, [reseed value]) to correctly set the seed value again.
Unfortunately, there is no way how to do it without recreating. You need to create new table with ANSI_NULLS ON and copy there all data.
It should be something like:
SET ANSI_NULLS ON;
CREATE TABLE new_MyTBL (
....
)
-- stop all processes changing your data at this point
SET IDENTITY_INSERT new_MyTBL ON
INSERT new_MyTBL (...) -- including IDENTITY field
SELECT ... -- including IDENTITY field
FROM MyTBL
SET IDENTITY_INSERT new_MyTBL OFF
-- alter/drop WITH SCHEMABINDING objects at this point
EXEC sp_rename #objname = 'MyTBL', #newname = 'old_MyTBL'
EXEC sp_rename #objname = 'new_MyTBL', #newname = 'MyTBL'
-- alter/create WITH SCHEMABINDING objects at this point
-- re-enable your processes
DROP TABLE old_MyTBL -- do that when you are sure that system works OK
If there are any depending objects, they will work with new table as soon as you rename it. But if some of them are WITH SCHEMABINDING you need to DROP and CREATE them manualy.
I tried the SWITCH option recommended above but was unable to RESEED the identity. I could not find out why.
I used the following alternative approach instead:
Create database snapshot for the database that contains the table
Script table definition of the table you intend to update
Delete the table that you intend to update (Make sure the database snapshot is successfully created)
Update SET ANSI NULLs from OFF to ON from the script obtained from step 2 and run updated script. Table is now recreated.
Populate data from database snapshot to your table:
SET IDENTITY_INSERT TABLE_NAME ON
INSERT INTO TABLE_NAME (PK, col1, etc.)
SELECT PK, col1, etc.
FROM [Database_Snapshot].dbo.TABLE_NAME
SET IDENTITY_INSERT TABLE_NAME OFF
Migrate non clustered index manually (get script from database snapshot)
Using the above:
I did not have to worry about constraints and keys since table/constraint names always remain the same (I do not need to rename anything)
I have a backup of my data (the snapshot) which I can rely on to double check that nothing is missing.
I do not need to reseed the identity
I realize deleting table may not always be straightforward if table is referenced in other tables. That was not the case for me in this instance.. I was lucky.

updlock vs for update cursor

I need to update a column of all rows of a table and I need to use UPDLOCK to do it.
For example:
UPDATE table (UPDLock)
SET column_name = ‘123’
Another alternative is to use an for update cursor and update each row. The advantage with the second approach is that the lock is not held till the end of the transaction and concurrent updates of the same rows can happen sooner. At the same time update cursors are said to have bad performance. Which is a better approach?
EDIT:
Assume the column is updated with a value that is derived from another column in the table. In other words, column_name = f(column_name_1)
You cannot give an UPDLOCK hint to a write operation, like UPDATE statement. It will be ignored, since all writes (INSERT/UPDATE/DELETE) take the same lock, an exclusive lock on the row being updated. You can quickly validate this yourself:
create table heap (a int);
go
insert into heap (a) values (1)
go
begin transaction
update heap
--with (UPDLOCK)
set a=2
select * from sys.dm_tran_locks
rollback
If you remove the comment -- on the with (UPDLOCK) you'll see that you get excatly the same locks (an X lock on the physical row). You can do the same experiment with a B-Tree instead of a heap:
create table btree (a int not null identity(1,1) primary key, b int)
go
insert into btree (b) values (1)
go
begin transaction
update btree
--with (UPDLOCK)
set b=2
select * from sys.dm_tran_locks
rollback
Again, the locks acquired will be identical with or w/o the hint (an exclusive lock on the row key).
Now back to your question, can this whole table update be done in batches? (since this is basically what you're asking). Yes, if the table has a primary key (to be precise what's required is an unique index to batch on, preferable the clustered index to avoid tipping point issues). Here is an example how:
create table btree (id int not null identity(1,1) primary key, b int, c int);
go
set nocount on;
insert into btree (b) values (rand()*1000);
go 1000
declare #id int = null, #rc int;
declare #inserted table (id int);
begin transaction;
-- first batch has no WHERE clause
with cte as (
select top(10) id, b, c
from btree
order by id)
update cte
set c = b+1
output INSERTED.id into #inserted (id);
set #rc = ##rowcount;
commit;
select #id = max(id) from #inserted;
delete from #inserted;
raiserror (N'Updated %d rows, up to id %d', 0,0,#rc, #id);
begin transaction;
while (1=1)
begin
-- update the next batch of 10 rows, now it has where clause
with cte as (
select top(10) id, b, c
from btree
where id > #id
order by id)
update cte
set c = b+1
output INSERTED.id into #inserted (id);
set #rc = ##rowcount;
if (0 = #rc)
break;
commit;
begin transaction;
select #id = max(id) from #inserted;
delete from #inserted;
raiserror (N'Updated %d rows, up to id %d', 0,0,#rc, #id);
end
commit
go
If your table doesn't have a unique clustered index then it becomes really tricky to do this, you would need to do the same thing a cursor has to do. While from a logical point of view the index is not required, not having it would cause each batch to do a whole-table-scan, which would be pretty much disastrous.
In case you wonder what happens if someone inserts a value behind the current #id, then the answer is very simple: the exactly same thing that would happen if someone inserts a value after the whole processing is complete.
Personally I think the single UPDATE will be much better. There are very few cases where a cursor will be better overall, regardless of concurrent activity. In fact the only one that comes to mind is a very complex running totals query - I don't think I've ever seen better overall performance from a cursor that is not read only, only SELECT queries. Of course, you have much better means of testing which is "a better approach" - you have your hardware, your schema, your data, and your usage patterns right in front of you. All you have to do is perform some tests.
That all said, what is the point in the first place of updating that column so that every single row has the same value? I suspect that if the value in that column has no bearing to the rest of the row, it can be stored elsewhere - perhaps a related table or a single-row table. Maybe the value in that column should be NULL (in which case you get it from the other table) unless it is overriden for a specific row. It seems to me like there is a better solution here than touching every single row in the table every time.

Copy one column to another for over a billion rows in SQL Server database

Database : SQL Server 2005
Problem : Copy values from one column to another column in the same table with a billion+
rows.
test_table (int id, bigint bigid)
Things tried 1: update query
update test_table set bigid = id
fills up the transaction log and rolls back due to lack of transaction log space.
Tried 2 - a procedure on following lines
set nocount on
set rowcount = 500000
while #rowcount > 0
begin
update test_table set bigid = id where bigid is null
set #rowcount = ##rowcount
set #rowupdated = #rowsupdated + #rowcount
end
print #rowsupdated
The above procedure starts slowing down as it proceeds.
Tried 3 - Creating a cursor for update.
generally discouraged in SQL Server documentation and this approach updates one row at a time which is too time consuming.
Is there an approach that can speed up the copying of values from one column to another. Basically I am looking for some 'magic' keyword or logic that will allow the update query to rip through the billion rows half a million at a time sequentially.
Any hints, pointers will be much appreciated.
I'm going to guess that you are closing in on the 2.1billion limit of an INT datatype on an artificial key for a column. Yes, that's a pain. Much easier to fix before the fact than after you've actually hit that limit and production is shut down while you are trying to fix it :)
Anyway, several of the ideas here will work. Let's talk about speed, efficiency, indexes, and log size, though.
Log Growth
The log blew up originally because it was trying to commit all 2b rows at once. The suggestions in other posts for "chunking it up" will work, but that may not totally resolve the log issue.
If the database is in SIMPLE mode, you'll be fine (the log will re-use itself after each batch). If the database is in FULL or BULK_LOGGED recovery mode, you'll have to run log backups frequently during the running of your operation so that SQL can re-use the log space. This might mean increasing the frequency of the backups during this time, or just monitoring the log usage while running.
Indexes and Speed
ALL of the where bigid is null answers will slow down as the table is populated, because there is (presumably) no index on the new BIGID field. You could, (of course) just add an index on BIGID, but I'm not convinced that is the right answer.
The key (pun intended) is my assumption that the original ID field is probably the primary key, or the clustered index, or both. In that case, lets take advantage of that fact, and do a variation of Jess' idea:
set #counter = 1
while #counter < 2000000000 --or whatever
begin
update test_table set bigid = id
where id between #counter and (#counter + 499999) --BETWEEN is inclusive
set #counter = #counter + 500000
end
This should be extremely fast, because of the existing indexes on ID.
The ISNULL check really wasn't necessary anyway, neither is my (-1) on the interval. If we duplicate some rows between calls, that's not a big deal.
Use TOP in the UPDATE statement:
UPDATE TOP (#row_limit) dbo.test_table
SET bigid = id
WHERE bigid IS NULL
You could try to use something like SET ROWCOUNT and do batch updates:
SET ROWCOUNT 5000;
UPDATE dbo.test_table
SET bigid = id
WHERE bigid IS NULL
GO
and then repeat this as many times as you need to.
This way, you're avoiding the RBAR (row-by-agonizing-row) symptoms of cursors and while loops, and yet, you don't unnecessarily fill up your transaction log.
Of course, in between runs, you'd have to do backups (especially of your log) to keep its size within reasonable limits.
Is this a one time thing? If so, just do it by ranges:
set counter = 500000
while #counter < 2000000000 --or whatever your max id
begin
update test_table set bigid = id where id between (#counter - 500000) and #counter and bigid is null
set counter = #counter + 500000
end
I didn't run this to try it, but if you can get it to update 500k at a time I think you're moving in the right direction.
set rowcount 500000
update test_table tt1
set bigid = (SELECT tt2.id FROM test_table tt2 WHERE tt1.id = tt2.id)
where bigid IS NULL
You can also try changing the recover model so you don't log the transactions
ALTER DATABASE db1
SET RECOVERY SIMPLE
GO
update test_table
set bigid = id
GO
ALTER DATABASE db1
SET RECOVERY FULL
GO
First step, if there are any, would be to drop indexes before the operation. This is probably what is causing the speed degrade with time.
The other option, a little outside the box thinking...can you express the update in such a way that you could materialize the column values in a select? If you can do this then you could create what amounts to a NEW table using SELECT INTO which is a minimally logged operation (assuming in 2005 that you are set to a recovery model of SIMPLE or BULK LOGGED). This would be pretty fast and then you can drop the old table, rename this table to to old table name and recreate any indexes.
select id, CAST(id as bigint) bigid into test_table_temp from test_table
drop table test_table
exec sp_rename 'test_table_temp', 'test_table'
I second the
UPDATE TOP(X) statement
Also to suggest, if you're in a loop, add in some WAITFOR delay or COMMIT between, to allow other processes some time to use the table if needed vs. blocking forever until all the updates are completed