I need to update a huge table (over 200 million records, 20+ columns)
I tried to update one column:
update Table1 set [Customer]=Null where [Customer]='-' or len([customer])=0
And it took over 2 hours.
I tried it on all columns and it's still running, for over 5 days now.
update Table1 set [Name]=Null where [Name]='-' or len([Name])=0
update Table1 set [Email]=Null where [Email]='-' or len([Email])=0
...
BTW- the table does not have any indexes or triggers, only data. The DB is not in use and recovery mode is simple.
Is there any more efficient way to update big tables?
Related
I have a task on my dev databases to update some customer PII. Some of these tables have many rows and are quite large (on the order of 51 million+ and 25GB in size). Every row in the table will need to be updated so right now I'm just running a simple update statement without a where clause on a single column, and this query is, at the time of this post, running 35mins+. Is there a faster way to update large tables? or a better way to mask PII data?
Current query is just
update mytable
set mycolumn = 'some text'
I have million records in table and accumulating each second, so whenever the record marked status as COMPLETE I want to move that records to backup table.
This job I will be running periodically. Also I don't want to interrupt other operations in the DB by this move operation.
I have tried the below query.
insert into Table_Bakup
select * from Table where batch_status ='COMPLETED'
and ID not in (select ID from Table_Bakup )
delete from Table
where ID in (select ID from Table_Bakup )
But with the above query performance will be impacted. Can anyone suggest how can I achieve this?
"Periodically" you mentioned sounds like a database job which is scheduled to run ... I don't know how often. Once a day? Every 2 hours? You decide.
Procedure might look like this:
update table set cb_to_be_moved = 1 where status = 'COMPLETED';
insert into table_bakup (col1, col2, ...)
select col1, col2, ...
from table
where cb_to_be_moved = 1;
delete from table where cb_to_be_moved = 1;
Why cb_to_be_moved? As there are millions of rows affected, between your insert and delete statements there might be some other rows which are set to be "completed" (and those transactions committed). If you use IDs, yes - that works, but - you have to compare millions of rows in table with billions of rows in table_bakup so this (cb_to_be_moved) approach might be faster.
Database trigger? Yes, but - imagine what happens when it fires for millions of rows ... database might choke. Or maybe not; try to test it.
I have a very large table people with 60M rows indexed on id, wish to populate a field newid for every record based on a look up table id_conversion (1M rows) which contains id and newid, indexed on id.
when I run
update people p set p.newid=(select l.newid from id_conversion l where l.id=p.id)
it runs for an hour or so and then I get an archive error ora 00257.
Any suggestions for either running update in sections or better sql command?
To avoid writing to Oracle's undo log if your update statement hits every single row of the table then you are likely better off running a create table as select query which will bypass all undo logs, which is likely the issue you're running into as it is logging the impact across 60 million rows. You can then drop the old table and rename the new table to that of the old table's name.
Something like:
create table new_people as
select l.newid,
p.col2,
p.col3,
p.col4,
p.col5
from people p
join id_conversion l
on p.id = l.id;
drop table people;
-- rebuild any constraints and indexes
-- from old people table to new people table
alter table new_people rename to people;
For reference, read some of the tips here: http://www.dba-oracle.com/t_efficient_update_sql_dml_tips.htm
If you are basically creating a new table and not just updating some of the rows of a table it will likely prove the faster method.
I doubt you will be able to get this to run in seconds. Your query, as written, needs to update all 60 million rows.
My first advice is to add an index on id_conversion(id, newid), to make the subquery more efficient. If that doesn't help, then doing the update in batches might be the best way to go.
I should add. Because you are updating all the rows, it might be faster to take the following approach:
Copy the data into a new table with the new values.
Truncate the original table.
Insert the new data into the old table.
Inserts are faster than updates.
In addition to the answers above, which probably will work better in this case, you should know the MERGE statement
http://docs.oracle.com/cd/B28359_01/server.111/b28286/statements_9016.htm
that is used for updating one table according to another table and is far faster then update according to a select statement
I have a few jobs that insert large data sets from a text file. The data is loaded via .NET's SqlBulkCopy.
Currently, I load all the data into a temp table and then insert it into the production table. This was an improvement over straight importing into production. The T-SQL insert results query was a lot faster. Data is only loaded via this method, there is no other inserts or deletes.
However, I'm back to timeouts because of locks while the job is executing. The job consists of the following steps:
load data into temp table
start transaction
delete current and future dated rows
insert from temp table
commit
This happens once every hour. This portion takes 70 seconds. I need to get that to the smallest number possible.
The production table has about 20 million records and each import is about 70K rows. The table is not accessed at night, so I use this time to do all required maintenance (rebuild stats, index, etc.). Of the 70K, added, ~4K is kept from day-to-day - that is, the table grows by 4k a day.
I'm thinking a 2 part solution:
The job will turn into a copy/rename job. I insert all current data into the temp table, create stats & index, rename tables, drop old table.
Create a history table to break out older data. The "current" table would have a rolling 6 months data, about 990K records. This would make the delete/insert table smaller and [hopefully] more performant. I would prefer not to do this; the table is well designed with the perfect indexes; queries are plenty fast. But eventually it might be required.
Edit: Using Windows 2003, SQL Server 2008
Thoughts? Other suggestions?
Well one really sneaky way is to rename the current table as TableA and set up a second table with the same structure as TableB and the same data. Then set up a view with the same name and the exact fields in the TableA. Now all your existing code will use the view instead of the current table. The view starts out looking at TableA.
In your load process, load to TableB. Refresh the view defintion changing it to look at TableB. Your users are down for less than a second. Then load the same data to TableA and store which table you should start with somewhere in a database table. Next time load first to TableA and then change the view to point to TableA then reload TableB.
The answer should be that your queries that read from this table should READ UNCOMMITTED, since your data load is the only place that changes data. With READ UNCOMMITTED, the SELECT queries won't get locks.
http://msdn.microsoft.com/en-us/library/ms173763.aspx
You should look into using partitioned tables. The basic idea is that you can load all of your new data into a new table, then join that table to the original as a new partition. This is orders of magnitude faster than inserting into the current existing table.
Later on, you can merge multiple partitions into a single larger partition.
More information here: http://msdn.microsoft.com/en-us/library/ms191160.aspx
Get better hardware. Using 3 threads, 35.000 item batches I import around 90.000 items per second using this approach.
Sorry, but at a point hardware decides insert speed. Important: SSD for the logs, mirrored ;)
Another trick you could use is have a delta table for the updates. You'd have 2 tables with a view over them to merge them. One table, TableBig, will hold the old data, the second table, TableDelta, will hold deltas that you add to rows in tableBig.
You create a view over them that adds them up. A simple example:
For instance your old data in the TableBig (20M rows, lot's of indexes etc)
ID Col1 Col2
123 5 10
124 6 20
And you want to update 1 row and add a new one so that afterwards the table looks like this:
ID Col1 Col2
123 5 10 -- unchanged
124 8 30 -- this row is updated
125 9 60 -- this one added
Then in the TableDelta you insert these two rows:
ID Col1 Col2
124 2 10 -- these will be added to give the right number
125 9 60 -- this row is new
and the view is
select ID,
sum(col1) col1, -- the old value and delta added to give the correct value
sum(col2) col2
from (
select id, col1, col2 from TableBig
union all
select id, col1, col2 from TableDelta
)
group by ID
At night you can merge TableDelta into TableBig and index etc.
This way you can leave the big table alone completely during the day, and TableDelta will not have many rows so overall query perf is very good. Getting the data from BigTable benefits from inexing, getting rows from DeltaTable is no issue becuase it's small, summing them up is cheap compared to looking for data on disk. Pumping data into TableDelta is very cheap because you can just insert at the end.
Regards Gert-Jan
Update for text columns:
You could try somehting similar, with two tables, but instead of adding, you would substitute.Like this:
Select isnull(b.ID, o.ID) Id,
isnull(b.Col1, o.Col1) Col1
isnull(b.Col2, o.Col2) col2
From TableBig b
full join TableOverWrite o on b.ID = o.ID
The basic idea is the same: a big table with indexes and a small table for updates that doesn't need them.
Your solution seems sound to me. Another alternative to try would be to create the final data in a temp table, all outside of a transaction and then inside the transaction truncate the target table then load it from your temp table...something along those lines might be worth trying too.
I have an SQL Server 2005 database, and I tried putting indexes on the appropriate fields in order to speed up the DELETE of records from a table with millions of rows (big_table has only 3 columns), but now the DELETE execution time is even longer! (1 hour versus 13 min for example)
I have a relationship between to tables, and the column that I filter my DELETE by is in the other table. For example
DELETE FROM big_table
WHERE big_table.id_product IN (
SELECT small_table.id_product FROM small_table
WHERE small_table.id_category = 1)
Btw, I've also tried:
DELETE FROM big_table
WHERE EXISTS
(SELECT 1 FROM small_table
WHERE small_table.id_product = big_table.id_product
AND small_table.id_category = 1)
and while it seems to run slightly faster than the first, it's still a lot slower with the indexes than without.
I created indexes on these fields:
big_table.id_product
small_table.id_product
small_table.id_category
My .ldf file grows a lot during the DELETE.
Why are my DELETE queries slower when I have indexes on my tables? I thought they were supposed to run faster.
UPDATE
Okay, consensus seems to be indexes will slow down a huge DELETE becuase the index has to be updated. Although, I still don't understand why it can't DELETE all the rows all at once, and just update the index once at the end.
I was under the impression from some of my reading that indexes would speed up DELETE by making searches for fields in the WHERE clause faster.
Odetocode.com says:
"Indexes work just as well when searching for a record in DELETE and UPDATE commands as they do for SELECT statements."
But later in the article, it says that too many indexes can hurt performance.
Answers to bobs questions:
55 million rows in table
42 million rows being deleted
Similar SELECT statement would not run (Exception of type 'System.OutOfMemoryException' was thrown)
I tried the following 2 queries:
SELECT * FROM big_table
WHERE big_table.id_product IN (
SELECT small_table.id_product FROM small_table
WHERE small_table.id_category = 1)
SELECT * FROM big_table
INNER JOIN small_table
ON small_table.id_product = big_table.id_product
WHERE small_table.id_category = 1
Both failed after running for 25 min with this error message from SQL Server 2005:
An error occurred while executing batch. Error message is: Exception of type 'System.OutOfMemoryException' was thrown.
The database server is an older dual core Xeon machine with 7.5 GB ram. It's my toy test database :) so it's not running anything else.
Do I need to do something special with my indexes after I CREATE them to make them work properly?
Indexes make lookups faster - like the index at the back of a book.
Operations that change the data (like a DELETE) are slower, as they involve manipulating the indexes. Consider the same index at the back of the book. You have more work to do if you add, remove or change pages because you have to also update the index.
I Agree with Bobs comment above - if you are deleting large volumes of data from large tables deleting the indices can take a while on top of deleting the data its the cost of doing business though. As it deletes all the data out you are causing reindexing events to happen.
With regards to the logfile growth; if you arent doing anything with your logfiles you could switch to Simple logging; but i urge you to read up on the impact that might have on your IT department before you change.
If you need to do the delete in real time; its often a good work around to flag the data as inactive either directly on the table or in another table and exclude that data from queries; then come back later and delete the data when the users aren't staring at an hourglass. There is a second reason for covering this; if you are deleting lots of data out of the table (which is what i am supposing based on your logfile issue) then you will likely want to do an indexdefrag to reorgnaise the index; doing that out of hours is the way to go if you dont like users on the phone !
JohnB is deleting about 75% of the data. I think the following would have been a possible solution and probably one of the faster ones. Instead of deleting the data, create a new table and insert the data that you need to keep. Create the indexes on that new table after inserting the data. Now drop the old table and rename the new one to the same name as the old one.
The above of course assumes that sufficient disk space is available to temporarily store the duplicated data.
Try something like this to avoid bulk delete (and thereby avoid log file growth)
declare #continue bit = 1
-- delete all ids not between starting and ending ids
while #continue = 1
begin
set #continue = 0
delete top (10000) u
from <tablename> u WITH (READPAST)
where <condition>
if ##ROWCOUNT > 0
set #continue = 1
end
You can also try TSQL extension to DELETE syntax and check whether it improves performance:
DELETE FROM big_table
FROM big_table AS b
INNER JOIN small_table AS s ON (s.id_product = b.id_product)
WHERE s.id_category =1