Related
I have statements like this that are timing out:
DELETE FROM [table] WHERE [COL] IN ( '1', '2', '6', '12', '24', '7', '3', '5')
I tried doing one at a time like this:
DELETE FROM [table] WHERE [COL] IN ( '1' )
and so far it's at 22 minutes and still going.
The table has 260,000 rows in it and is four columns.
Does anyone have any ideas why this would be so slow and how to speed it up?
I do have a non-unique, non-clustered index on the [COL] that i'm doing the WHERE on.
I'm using SQL Server 2008 R2
update: I have no triggers on the table.
Things that can cause a delete to be slow:
deleting a lot of records
many indexes
missing indexes on foreign keys in child tables. (thank you to #CesarAlvaradoDiaz for mentioning this in the comments)
deadlocks and blocking
triggers
cascade delete (those ten parent records you are deleting could mean
millions of child records getting deleted)
Transaction log needing to grow
Many Foreign keys to check
So your choices are to find out what is blocking and fix it or run the deletes in off hours when they won't be interfering with the normal production load. You can run the delete in batches (useful if you have triggers, cascade delete, or a large number of records). You can drop and recreate the indexes (best if you can do that in off hours too).
Disable CONSTRAINT
ALTER TABLE [TableName] NOCHECK CONSTRAINT ALL;
Disable Index
ALTER INDEX ALL ON [TableName] DISABLE;
Rebuild Index
ALTER INDEX ALL ON [TableName] REBUILD;
Enable CONSTRAINT
ALTER TABLE [TableName] CHECK CONSTRAINT ALL;
Delete again
Deleting a lot of rows can be very slow. Try to delete a few at a time, like:
delete top (10) YourTable where col in ('1','2','3','4')
while ##rowcount > 0
begin
delete top (10) YourTable where col in ('1','2','3','4')
end
In my case the database statistics had become corrupt. The statement
delete from tablename where col1 = 'v1'
was taking 30 seconds even though there were no matching records but
delete from tablename where col1 = 'rubbish'
ran instantly
running
update statistics tablename
fixed the issue
If the table you are deleting from has BEFORE/AFTER DELETE triggers, something in there could be causing your delay.
Additionally, if you have foreign keys referencing that table, additional UPDATEs or DELETEs may be occurring.
Preventive Action
Check with the help of SQL Profiler for the root cause of this issue. There may be Triggers causing the delay in Execution. It can be anything. Don't forget to Select the Database Name and Object Name while Starting the Trace to exclude scanning unnecessary queries...
Database Name Filtering
Table/Stored Procedure/Trigger Name Filtering
Corrective Action
As you said your table contains 260,000 records...and IN Predicate contains six values. Now, each record is being search 260,000 times for each value in IN Predicate. Instead it should be the Inner Join like below...
Delete K From YourTable1 K
Inner Join YourTable2 T on T.id = K.id
Insert the IN Predicate values into a Temporary Table or Local Variable
It's possible that other tables have FK constraint to your [table].
So the DB needs to check these tables to maintain the referential integrity.
Even if you have all needed indexes corresponding these FKs, check their amount.
I had the situation when NHibernate incorrectly created duplicated FKs on the same columns, but with different names (which is allowed by SQL Server).
It has drastically slowed down running of the DELETE statement.
Check execution plan of this delete statement. Have a look if index seek is used. Also what is data type of col?
If you are using wrong data type, change update statement (like from '1' to 1 or N'1').
If index scan is used consider using some query hint..
If you're deleting all the records in the table rather than a select few it may be much faster to just drop and recreate the table.
Is [COL] really a character field that's holding numbers, or can you get rid of the single-quotes around the values? #Alex is right that IN is slower than =, so if you can do this, you'll be better off:
DELETE FROM [table] WHERE [COL] = '1'
But better still is using numbers rather than strings to find the rows (sql likes numbers):
DELETE FROM [table] WHERE [COL] = 1
Maybe try:
DELETE FROM [table] WHERE CAST([COL] AS INT) = 1
In either event, make sure you have an index on column [COL] to speed up the table scan.
I read this article it was really helpful for troubleshooting any kind of inconveniences
https://support.microsoft.com/en-us/kb/224453
this is a case of waitresource
KEY: 16:72057595075231744 (ab74b4daaf17)
-- First SQL Provider to find the SPID (Session ID)
-- Second Identify problem, check Status, Open_tran, Lastwaittype, waittype, and waittime
-- iMPORTANT Waitresource select * from sys.sysprocesses where spid = 57
select * from sys.databases where database_id=16
-- with Waitresource check this to obtain object id
select * from sys.partitions where hobt_id=72057595075231744
select * from sys.objects where object_id=2105058535
After inspecting an SSIS Package(due to a SQL Server executing commands really slow), that was set up in a client of ours about 5-4 years before the time of me writing this, I found out that there were the below tasks:
1) insert data from an XML file into a table called [Importbarcdes].
2) merge command on an another target table, using as source the above mentioned table.
3) "delete from [Importbarcodes]", to clear the table of the row that was inserted after the XML file was read by the task of the SSIS Package.
After a quick inspection all statements(SELECT, UPDATE, DELETE etc.) on the table ImportBarcodes that had only 1 row, took about 2 minutes to execute.
Extended Events showed a whole lot PAGEIOLATCH_EX wait notifications.
No indexes were present of the table and no triggers were registered.
Upon close inspection of the properties of the table, in the Storage Tab and under general section, the Data Space field showed more than 6 GIGABYTES of space allocated in pages.
What happened:
The query ran for a good portion of time each day for the last 4 years, inserting and deleting data in the table, leaving unused pagefiles behind with out freeing them up.
So, that was the main reason of the wait events that were captured by the Extended Events Session and the slowly executed commands upon the table.
Running ALTER TABLE ImportBarcodes REBUILD fixed the issue freeing up all the unused space. TRUNCATE TABLE ImportBarcodes did a similar thing, with the only difference of deleting all pagefiles and data.
Older topic but one still relevant.
Another issue occurs when an index has become fragmented to the extent of becoming more of a problem than a help. In such a case, the answer would be to rebuild or drop and recreate the index and issuing the delete statement again.
As an extension to Andomar's answer, above, I had a scenario where the first 700,000,000 records (of ~1.2 billion) processed very quickly, with chunks of 25,000 records processing per second (roughly). But, then it starting taking 15 minutes to do a batch of 25,000. I reduced the chunk size down to 5,000 records and it went back to its previous speed. I'm not certain what internal threshold I hit, but the fix was to reduce the number of records, further, to regain the speed.
open CMD and run this commands
NET STOP MSSQLSERVER
NET START MSSQLSERVER
this will restart the SQL Server instance.
try to run again after your delete command
I have this command in a batch script and run it from time to time if I'm encountering problems like this. A normal PC restart will not be the same so restarting the instance is the most effective way if you are encountering some issues with your sql server.
I've been having an odd problem with some queries that depend on a sub query. They run lightning fast, until I use a UNION statement in the sub query. Then they run endlessly, I've given after 10 minutes. The scenario I'm describing now isn't the original one I started with, but I think it cuts out a lot of possible problems yet yields the same problem. So even though it's a pointless query, bear with me!
I have a table:
tblUser - 100,000 rows
tblFavourites - 200,000 rows
If I execute:
SELECT COUNT(*)
FROM tblFavourites
WHERE userID NOT IN (SELECT uid FROM tblUser);
… then it runs in under a second. However, if I modify it so that the sub query has a UNION, it will run for at least 10 minutes (before I give up!)
SELECT COUNT(*)
FROM tblFavourites
WHERE userID NOT IN (SELECT uid FROM tblUser UNION SELECT uid FROM tblUser);
A pointless change, but it should yield the same result and I don't see why it should take any longer?
Putting the sub-query into a view and calling that instead has the same effect.
Any ideas why this would be? I'm using SQL Azure.
Problem solved. See my answer below.
UNION generate unique values, so the DBMS engine makes sorts.
You can use safely UNION ALL in this case.
UNION is really doing a DISTINCT on all fields in the combined data set. It filters out dupes in the final results.
Is Uid indexed? If not it may take a long time as the query engine:
Generates the first result set
Generates the second result set
Filters out all the dupes (which is half the records) in a hash table
If duplicates aren't a concern (and using IN means they won't be) then use UNION ALL which removes the expensive Sort/Filter step.
UNION's are usually implemented via temporary in-memory tables. You're essentially copying your tblUser two times into memory, WITH NO INDEX. Then every row in tblFavourites incur a complete table scan over 200,000 rows - that's 200Kx200K=40 billion double-row scans (because the query engine must get the uid from both table rows)
If your tblUser has an index on uid (which is definitely true because all tables in SQL Azure must have a clustered index), then each row in tblFavourites incurs a very fast index lookup, resulting in only 200Kxlog(100K) =200Kx17 = 200K row scans, each with 17 b-tree index comparisons (which is much faster than reading the uid from a row on a data page), so it should equate to roughly 200Kx(3-4) or 1 million double-row scans. I believe newer versions of SQL server may also build a temp hash table containing just the uid's, so essentially it gets down to 200K row scans (assuming hash table lookup to be trivial).
You should also generate your query plan to check.
Essentially, the non-UNION query runs around 500,000 times faster if tblUser has an index (should be on SQL Azure).
It turns out the problem was due to one of the indexes ... tblFavourites contained two foreign keys to the primary key (uid) in tblUser:
userId
otherUserId
both columns had the same definition and same indexes, but I discovered that swapping userId for otherUserId in the original query solved the problem.
I ran:
ALTER INDEX ALL ON tblFavourites REBUILD
... and the problem went away. The query now executes almost instantly.
I don't know too much about what goes on behind the scenes in Sql Server/Azure ... but I can only imagine that it was a damaged index or something? I update statistics frequently, but that had no effect.
Thanks!
---- UPDATE
The above was not fully correct. It did fix the problem for around 20 minutes, then it returned. I have been in touch with Microsoft support for several days and it seems the problem is to do with the tempDB. They are working on a solution at their end.
I just ran into this problem. I have about 1million rows to go through and then I realized that some of my IDs were in another table, so I unioned to get the same information in one "NOT EXISTS." I went from the query taking about 7 sec to processing only 5000 rows after a minute or so. This seemed to help. I absolutely hate the solution, but I've tried a multitude of things that all end up w/the same extremely slow execution plan. This one got me what I needed in about 18 sec.
DECLARE #PIDS TABLE ([PID] [INT] PRIMARY KEY)
INSERT INTO #PIDS SELECT DISTINCT [ID] FROM [STAGE_TABLE] WITH(NOLOCK)
INSERT INTO #PIDS SELECT DISTINCT [OTHERID] FROM [PRODUCTION_TABLE] WITH(NOLOCK)
WHERE NOT EXISTS(SELECT [PID] FROM #PIDS WHERE [PID] = [OTHERID]
SELECT (columns needed)
FROM [ORDER_HEADER] [OH] WITH(NOLOCK)
INNER JOIN #PIDS ON [OH].[SOME_ID] = [PID]
(And yes I tried "WHERE EXISTS IN..." for the final select... inner join was faster)
Please let me say again, I personaly feel this is really ugly, but I actually use this join twice in my proc, so it's going to save me time in the long run. Hope this helps.
Doesn't it make more sense to rephrase the questions from
"UserIds that aren't on the combined list of all the Ids that apper in this table and/or that table"
to
"UserIds that aren't on this table AND aren't on that table either
SELECT COUNT(*)
FROM tblFavourites
WHERE userID NOT IN (SELECT uid FROM tblUser)
AND userID NOT IN (SELECT uid FROM tblUser);
I have a table cats with 42,795,120 rows.
Apparently this is a lot of rows. So when I do:
/* owner_cats is a many-to-many join table */
DELETE FROM cats
WHERE cats.id_cat IN (
SELECT owner_cats.id_cat FROM owner_cats
WHERE owner_cats.id_owner = 1)
the query times out :(
(edit: I need to increase my CommandTimeout value, default is only 30 seconds)
I can't use TRUNCATE TABLE cats because I don't want to blow away cats from other owners.
I'm using SQL Server 2005 with "Recovery model" set to "Simple."
So, I thought about doing something like this (executing this SQL from an application btw):
DELETE TOP (25) PERCENT FROM cats
WHERE cats.id_cat IN (
SELECT owner_cats.id_cat FROM owner_cats
WHERE owner_cats.id_owner = 1)
DELETE TOP(50) PERCENT FROM cats
WHERE cats.id_cat IN (
SELECT owner_cats.id_cat FROM owner_cats
WHERE owner_cats.id_owner = 1)
DELETE FROM cats
WHERE cats.id_cat IN (
SELECT owner_cats.id_cat FROM owner_cats
WHERE owner_cats.id_owner = 1)
My question is: what is the threshold of the number of rows I can DELETE in SQL Server 2005?
Or, if my approach is not optimal, please suggest a better approach. Thanks.
This post didn't help me enough:
SQL Server Efficiently dropping a group of rows with millions and millions of rows
EDIT (8/6/2010):
Okay, I just realized after reading the above link again that I did not have indexes on these tables. Also, some of you have already pointed out that issue in the comments below. Keep in mind this is a fictitious schema, so even id_cat is not a PK, because in my real life schema, it's not a unique field.
I will put indexes on:
cats.id_cat
owner_cats.id_cat
owner_cats.id_owner
I guess I'm still getting the hang of this data warehousing, and obviously I need indexes on all the JOIN fields right?
However, it takes hours for me to do this batch load process. I'm already doing it as a SqlBulkCopy (in chunks, not 42 mil all at once). I have some indexes and PKs. I read the following posts which confirms my theory that the indexes are slowing down even a bulk copy:
SqlBulkCopy slow as molasses
What’s the fastest way to bulk insert a lot of data in SQL Server (C# client)
So I'm going to DROP my indexes before the copy and then re CREATE them when it's done.
Because of the long load times, it's going to take me awhile to test these suggestions. I'll report back with the results.
UPDATE (8/7/2010):
Tom suggested:
DELETE
FROM cats c
WHERE EXISTS (SELECT 1
FROM owner_cats o
WHERE o.id_cat = c.id_cat
AND o.id_owner = 1)
And still with no indexes, for 42 million rows, it took 13:21 min:sec versus 22:08 with the way described above. However, for 13 million rows, took him 2:13 versus 2:10 my old way. It's a neat idea, but I still need to use indexes!
Update (8/8/2010):
Something is terribly wrong! Now with the indexes on, my first delete query above took 1:9 hrs:min (yes an hour!) versus 22:08 min:sec and 13:21 min:sec versus 2:10 min:sec for 42 mil rows and 13 mil rows respectively. I'm going to try Tom's query with the indexes now, but this is heading in the wrong direction. Please help.
Update (8/9/2010):
Tom's delete took 1:06 hrs:min for 42 mil rows and 10:50 min:sec for 13 mil rows with indexes versus 13:21 min:sec and 2:13 min:sec respectively. Deletes are taking longer on my database when I use indexes by an order of magnitude! I think I know why, my database .mdf and .ldf grew from 3.5 GB to 40.6 GB during the first (42 mil) delete! What am I doing wrong?
Update (8/10/2010):
For lack of any other options, I have come up with what I feel is a lackluster solution (hopefully temporary):
Increase timeout for database connection to 1 hour (CommandTimeout=60000; default was 30 sec)
Use Tom's query: DELETE FROM WHERE EXISTS (SELECT 1 ...) because it performed a little faster
DROP all indexes and PKs before running delete statement (???)
Run DELETE statement
CREATE all indexes and PKs
Seems crazy, but at least it's faster than using TRUNCATE and starting over my load from the beginning with the first owner_id, because one of my owner_id takes 2:30 hrs:min to load versus 17:22 min:sec for the delete process I just described with 42 mil rows. (Note: if my load process throws an exception, I start over for that owner_id, but I don't want to blow away previous owner_id, so I don't want to TRUNCATE the owner_cats table, which is why I'm trying to use DELETE.)
Anymore help would still be appreciated :)
There is no practical threshold. It depends on what your command timeout is set to on your connection.
Keep in mind that the time it takes to delete all of these rows is contingent upon:
The time it takes to find the rows of interest
The time it takes to log the transaction in the transaction log
The time it takes to delete the index entries of interest
The time it takes to delete the actual rows of interest
The time it takes to wait for other processes to stop using the table so you can acquire what in this case will most likely be an exclusive table lock
The last point may often be the most significant. Do an sp_who2 command in another query window to make sure that there isn't lock contention going on, preventing your command from executing.
Improperly configured SQL Servers will do poorly at this type of query. Transaction logs which are too small and/or share the same disks as the data files will often incur severe performance penalties when working with large rows.
As for a solution, well, like all things, it depends. Is this something you intend to be doing often? Depending on how many rows you have left, the fastest way might be to rebuild the table as another name and then rename it and recreate its constraints, all inside a transaction. If this is just an ad-hoc thing, make sure your ADO CommandTimeout is set high enough and you can just bear the cost of this big delete.
If the delete will remove "a significant number" of rows from the table, this can be an alternative to a DELETE: put the records to keep somewhere else, truncate the original table, put back the 'keepers'. Something like:
SELECT *
INTO #cats_to_keep
FROM cats
WHERE cats.id_cat NOT IN ( -- note the NOT
SELECT owner_cats.id_cat FROM owner_cats
WHERE owner_cats.id_owner = 1)
TRUNCATE TABLE cats
INSERT INTO cats
SELECT * FROM #cats_to_keep
Have you tried no Subquery and use a join instead?
DELETE cats
FROM
cats c
INNER JOIN owner_cats oc
on c.id_cat = oc.id_cat
WHERE
id_owner =1
And if you have have you also tried different Join hints e.g.
DELETE cats
FROM
cats c
INNER HASH JOIN owner_cats oc
on c.id_cat = oc.id_cat
WHERE
id_owner =1
If you use an EXISTS rather than an IN, you should get much better performance. Try this:
DELETE
FROM cats c
WHERE EXISTS (SELECT 1
FROM owner_cats o
WHERE o.id_cat = c.id_cat
AND o.id_owner = 1)
There's no threshold as such - you can DELETE all the rows from any table given enough transaction log space - which is where your query is most likely falling over. If you're getting some results from your DELETE TOP (n) PERCENT FROM cats WHERE ... then you can wrap it in a loop as below:
SELECT 1
WHILE ##ROWCOUNT <> 0
BEGIN
DELETE TOP (somevalue) PERCENT FROM cats
WHERE cats.id_cat IN (
SELECT owner_cats.id_cat FROM owner_cats
WHERE owner_cats.id_owner = 1)
END
As others have mentioned, when you delete 42 million rows, the db has to log 42 million deletions against the database. Thus, the transaction log has to grow substantially. What you might try is to break up the delete into chunks. In the following query, I use the NTile ranking function to break up the rows into 100 buckets. If that is too slow, you can expand the number of buckets so that each delete is smaller. It will help tremendously if there is an index on owner_cats.id_owner, owner_cats.id_cats and cats.id_cat (which I assumed the primary key and numeric).
Declare #Cats Cursor
Declare #CatId int --assuming an integer PK here
Declare #Start int
Declare #End int
Declare #GroupCount int
Set #GroupCount = 100
Set #Cats = Cursor Fast_Forward For
With CatHerd As
(
Select cats.id_cat
, NTile(#GroupCount) Over ( Order By cats.id_cat ) As Grp
From cats
Join owner_cats
On owner_cats.id_cat = cats.id_cat
Where owner_cats.id_owner = 1
)
Select Grp, Min(id_cat) As MinCat, Max(id_cat) As MaxCat
From CatHerd
Group By Grp
Open #Cats
Fetch Next From #Cats Into #CatId, #Start, #End
While ##Fetch_Status = 0
Begin
Delete cats
Where id_cat Between #Start And #End
Fetch Next From #Cats Into #CatId, #Start, #End
End
Close #Cats
Deallocate #Cats
The notable catch with the above approach is that it is not transactional. Thus, if it fails on the 40th chunk, you will have deleted 40% of the rows and the other 60% will still exist.
Might be worth trying MERGE e.g.
MERGE INTO cats
USING owner_cats
ON cats.id_cat = owner_cats.id_cat
AND owner_cats.id_owner = 1
WHEN MATCHED THEN DELETE;
<Edit> (9/28/2011)
My answer performs basically the same way as Thomas' solution (Aug 6 '10). I missed it when I posted my answer because it he uses an actual CURSOR so I thought to myself "bad" because of the # of records involved. However, when I reread his answer just now I realize that the WAY he uses the cursor is actually "good". Very clever. I just voted up his answer and will probably use his approach in the future. If you don't understand why, take a look at it again. If you still can't see it, post a comment on this answer and I will come back and try to explain in detail. I decided to leave my answer because someone may have a DBA who refuses to let them use an actual CURSOR regardless of how "good" it is. :-)
</Edit>
I realize that this question is a year old but I recently had a similar situation. I was trying to do "bulk" updates to a large table with a join to a different table, also fairly large. The problem was that the join was resulting in so many "joined records" that it took too long to process and could have led to contention problems. Since this was a one-time update I came up with the following "hack." I created a WHILE LOOP that went through the table to be updated and picked 50,000 records to update at a time. It looked something like this:
DECLARE #RecId bigint
DECLARE #NumRecs bigint
SET #NumRecs = (SELECT MAX(Id) FROM [TableToUpdate])
SET #RecId = 1
WHILE #RecId < #NumRecs
BEGIN
UPDATE [TableToUpdate]
SET UpdatedOn = GETDATE(),
SomeColumn = t2.[ColumnInTable2]
FROM [TableToUpdate] t
INNER JOIN [Table2] t2 ON t2.Name = t.DBAName
AND ISNULL(t.PhoneNumber,'') = t2.PhoneNumber
AND ISNULL(t.FaxNumber, '') = t2.FaxNumber
LEFT JOIN [Address] d ON d.AddressId = t.DbaAddressId
AND ISNULL(d.Address1,'') = t2.DBAAddress1
AND ISNULL(d.[State],'') = t2.DBAState
AND ISNULL(d.PostalCode,'') = t2.DBAPostalCode
WHERE t.Id BETWEEN #RecId AND (#RecId + 49999)
SET #RecId = #RecId + 50000
END
Nothing fancy but it got the job done. Because it was only processing 50,000 records at a time, any locks that got created were short lived. Also, the optimizer realized that it did not have to do the entire table so it did a better job of picking an execution plan.
<Edit> (9/28/2011)
There is a HUGE caveat to the suggestion that has been mentioned here more than once and is posted all over the place around the web regarding copying the "good" records to a different table, doing a TRUNCATE (or DROP and reCREATE, or DROP and rename) and then repopulating the table.
You cannot do this if the table is the PK table in a PK-FK relationship (or other CONSTRAINT). Granted, you could DROP the relationship, do the clean up, and re-establish the relationship, but you would have to clean up the FK table, too. You can do that BEFORE re-establishing the relationship, which means more "down-time", or you can choose to not ENFORCE the CONSTRAINT on creation and clean up afterwards. I guess you could also clean up the FK table BEFORE you clean up the PK table. Bottom line is that you have to explicitly clean up the FK table, one way or the other.
My answer is a hybrid SET-based/quasi-CURSOR process. Another benefit of this method is that if the PK-FK relationship is setup to CASCADE DELETES you don't have to do the clean up I mention above because the server will take care of it for you. If your company/DBA discourage cascading deletes, you can ask that it be enabled only while this process is running and then disabled when it is finished. Depending on the permission levels of the account that runs the clean up, the ALTER statements to enable/disable cascading deletes can be tacked onto the beginning and the end of the SQL statement.
</Edit>
Bill Karwin's answer to another question applies to my situation also:
"If your DELETE is intended to eliminate a great majority of the rows in that table, one thing that people often do is copy just the rows you want to keep to a duplicate table, and then use DROP TABLE or TRUNCATE to wipe out the original table much more quickly."
Matt in this answer says it this way:
"If offline and deleting a large %, may make sense to just build a new table with data to keep, drop the old table, and rename."
ammoQ in this answer (from the same question) recommends (paraphrased):
issue a table lock when deleting a large amount of rows
put indexes on any foreign key columns
I am trying a simple UPDATE table SET column1 = 0 on a table with about 3 million rows on Postegres 8.4 but it is taking forever to finish. It has been running for more than 10 min.
Before, I tried to run a VACUUM and ANALYZE commands on that table and I also tried to create some indexes (although I doubt this will make any difference in this case) but none seems to help.
Any other ideas?
Update:
This is the table structure:
CREATE TABLE myTable
(
id bigserial NOT NULL,
title text,
description text,
link text,
"type" character varying(255),
generalFreq real,
generalWeight real,
author_id bigint,
status_id bigint,
CONSTRAINT resources_pkey PRIMARY KEY (id),
CONSTRAINT author_pkey FOREIGN KEY (author_id)
REFERENCES users (id) MATCH SIMPLE
ON UPDATE NO ACTION ON DELETE NO ACTION,
CONSTRAINT c_unique_status_id UNIQUE (status_id)
);
I am trying to run UPDATE myTable SET generalFreq = 0;
I have to update tables of 1 or 2 billion rows with various values for each rows. Each run makes ~100 millions changes (10%).
My first try was to group them in transaction of 300K updates directly on a specific partition as Postgresql not always optimize prepared queries if you use partitions.
Transactions of bunch of "UPDATE myTable SET myField=value WHERE
myId=id" Gives 1,500 updates/sec. which means each run would
take at least 18 hours.
HOT updates solution as described here with FILLFACTOR=50. Gives
1,600 updates/sec. I use SSD's so it's a costly improvement as it
doubles the storage size.
Insert in a temporary table of updated value and merge them after
with UPDATE...FROM Gives 18,000 updates/sec. if I do a VACUUM
for each partition; 100,000 up/s otherwise. Cooool.Here is the
sequence of operations:
CREATE TEMP TABLE tempTable (id BIGINT NOT NULL, field(s) to be updated,
CONSTRAINT tempTable_pkey PRIMARY KEY (id));
Accumulate a bunch of updates in a buffer depending of available RAM
When it's filled, or need to change of table/partition, or completed:
COPY tempTable FROM buffer;
UPDATE myTable a SET field(s)=value(s) FROM tempTable b WHERE a.id=b.id;
COMMIT;
TRUNCATE TABLE tempTable;
VACUUM FULL ANALYZE myTable;
That means a run now takes 1.5h instead of 18h for 100 millions updates, vacuum included. To save time, it's not necessary to make a vacuum FULL at the end but even a fast regular vacuum is usefull to control your transaction ID on the database and not get unwanted autovacuum during rush hours.
Take a look at this answer: PostgreSQL slow on a large table with arrays and lots of updates
First start with a better FILLFACTOR, do a VACUUM FULL to force table rewrite and check the HOT-updates after your UPDATE-query:
SELECT n_tup_hot_upd, * FROM pg_stat_user_tables WHERE relname = 'myTable';
HOT updates are much faster when you have a lot of records to update. More information about HOT can be found in this article.
Ps. You need version 8.3 or better.
After waiting 35 min. for my UPDATE query to finish (and still didn't) I decided to try something different. So what I did was a command:
CREATE TABLE table2 AS
SELECT
all the fields of table1 except the one I wanted to update, 0 as theFieldToUpdate
from myTable
Then add indexes, then drop the old table and rename the new one to take its place. That took only 1.7 min. to process plus some extra time to recreate the indexes and constraints. But it did help! :)
Of course that did work only because nobody else was using the database. I would need to lock the table first if this was in a production environment.
Today I've spent many hours with similar issue. I've found a solution: drop all the constraints/indices before the update. No matter whether the column being updated is indexed or not, it seems like psql updates all the indices for all the updated rows. After the update is finished, add the constraints/indices back.
Try this (note that generalFreq starts as type REAL, and stays the same):
ALTER TABLE myTable ALTER COLUMN generalFreq TYPE REAL USING 0;
This will rewrite the table, similar to a DROP + CREATE, and rebuild all indices. But all in one command. Much faster (about 2x) and you don't have to deal with dependencies and recreating indexes and other stuff, though it does lock the table (access exclusive--i.e. full lock) for the duration. Or maybe that's what you want if you want everything else to queue up behind it. If you aren't updating "too many" rows this way is slower than just an update.
How are you running it? If you are looping each row and performing an update statement, you are running potentially millions of individual updates which is why it will perform incredibly slowly.
If you are running a single update statement for all records in one statement it would run a lot faster, and if this process is slow then it's probably down to your hardware more than anything else. 3 million is a lot of records.
The first thing I'd suggest (from https://dba.stackexchange.com/questions/118178/does-updating-a-row-with-the-same-value-actually-update-the-row) is to only update rows that "need" it, ex:
UPDATE myTable SET generalFreq = 0 where generalFreq != 0;
(might also need an index on generalFreq). Then you'll update fewer rows. Though not if the values are all non zero already, but updating fewer rows "can help" since otherwise it updates them and all indexes regardless of whether the value changed or not.
Another option: if the stars align in terms of defaults and not-null constraints, you can drop the old column and create another by just adjusting metadata, instant time.
In my tests I noticed that a big update, more than 200 000 rows, is slower than 2 updates of 100 000 rows, even with a temporary table.
My solution is to loop, in each loop create a temporary table of 200 000 rows, in this table I compute my values, then update my main table with the new values aso...
Every 2 000 000 rows, I manually "VACUUM ANALYSE mytable", I noticed that the auto vacuum doesn't do its job for such updates.
I need to update more than 1B+ rows on PostgreSQL table which contains some indexes. I am working on PostgreSQL 12 + SQLAlchemy + Python.
Inspired by the answers here, I wrote a temp table and UPDATE... FROM based updater to see if it makes a difference. The temp table is then fed from CSV generated by Python, and uploaded over the normal SQL client connection.
The speed-up naive approach using SQLAlchemy's bulk_update_mappings is 4x - 5x. Not an order of magnitude, but still considerable and in my case this means 1 day, not 1 week, of a batch job.
Below is the relevant Python code that does CREATE TEMPORARY TABLE, COPY FROM and UPDATE FROM. See the full example in this gist.
def bulk_load_psql_using_temp_table(
dbsession: Session,
data_as_dicts: List[dict],
):
"""Bulk update columns in PostgreSQL faster using temp table.
Works around speed issues on `bulk_update_mapping()` and PostgreSQL.
Your mileage and speed may vary, but it is going to be faster.
The observation was 3x ... 4x faster when doing UPDATEs
where one of the columns is indexed.
Contains hardcoded temp table creation and UPDATE FROM statements.
In our case we are bulk updating three columns.
- Create a temp table - if not created before
- Filling it from the in-memory CSV using COPY FROM
- Then performing UPDATE ... FROM on the actual table from the temp table
- Between the update chunks, clear the temp table using TRUNCATE
Why is it faster? I have did not get a clear answer from the sources I wa reading.
At least there should be
less data uploaded from the client to the server,
as CSV loading is more compact than bulk updates.
Further reading
- `About PSQL temp tables <https://www.postgresqltutorial.com/postgresql-tutorial/postgresql-temporary-table/>`_
- `Naive bulk_update_mapping approach <https://stackoverflow.com/questions/36272316/using-bulk-update-mappings-in-sqlalchemy-to-update-multiple-rows-with-different>`_
- `Discussion on UPDATE ... FROM + temp table approach <https://stackoverflow.com/questions/3361291/slow-simple-update-query-on-postgresql-database-with-3-million-rows/24811058#24811058>_`.
:dbsession:
SQLAlchemy session.
Note that we open a separate connection for the bulk update.
:param data_as_dicts:
In bound data as it would be given to bulk_update_mapping
"""
# mem table created in sql
temp_table_name = "temp_bulk_temp_loader"
# the real table of which data we are filling
real_table_name = "swap"
# colums we need to copy
columns = ["id", "sync_event_id", "sync_reserve0", "sync_reserve1"]
# how our CSV fields are separated
delim = ";"
# Increase temp buffer size for updates
temp_buffer_size = "3000MB"
# Dump data to a local mem buffer using CSV writer.
# No header - this is specifically addressed in copy_from()
out = StringIO()
writer = csv.DictWriter(out, fieldnames=columns, delimiter=delim)
writer.writerows(data_as_dicts)
# Update data in alternative raw connection
engine = dbsession.bind
conn = engine.connect()
try:
# No rollbacks
conn.execution_options(isolation_level="AUTOCOMMIT")
# See https://blog.codacy.com/how-to-update-large-tables-in-postgresql/
conn.execute(f"""SET temp_buffers = "{temp_buffer_size}";""")
# Temp table is dropped at the end of the session
# https://www.postgresqltutorial.com/postgresql-tutorial/postgresql-temporary-table/
# This must match data_as_dicts structure.
sql = f"""
CREATE TEMP TABLE IF NOT EXISTS {temp_table_name}
(
id int,
sync_event_id int,
sync_reserve0 bytea,
sync_reserve1 bytea
);
"""
conn.execute(sql)
# Clean any pending data in the temp table
# between update chunks.
# TODO: Not sure why this does not clear itself at conn.close()
# as I would expect based on the documentation.
sql = f"TRUNCATE {temp_table_name}"
conn.execute(sql)
# Load data from CSV to the temp table
# https://www.psycopg.org/docs/cursor.html
cursor = conn.connection.cursor()
out.seek(0)
cursor.copy_from(out, temp_table_name, sep=delim, columns=columns)
# Fill real table from the temp table
# This copies values from the temp table using
# UPDATE...FROM and matching by the row id.
sql = f"""
UPDATE {real_table_name}
SET
sync_event_id=b.sync_event_id,
sync_reserve0=b.sync_reserve0,
sync_reserve1=b.sync_reserve1
FROM {temp_table_name} AS b
WHERE {real_table_name}.id=b.id;
"""
res = conn.execute(sql)
logger.debug("Updated %d rows", res.rowcount)
finally:
conn.close()
I do an update millions rows incrementally in batches with minimal locks by one procedure loop_execute(). There is a progress of execution in percent and a prediction of the end work time!
try
UPDATE myTable SET generalFreq = 0.0;
Maybe it is a casting issue
How I can Delete 1.5 Millions Rows From SQL Server 2000, And how much time it will take to complete this task.
I dont want to delete all records from table.... I just want to delete all records which are fullfilling WHERE condition.
EDITED from a comment to an answer below.
"I fire the same query i.e. delete from table_name with Where Clause... Is it possible to Disable Indexing at the running Query, becuase Query is going on from past 20 hr.. Also help me out how i can disable Indexing.."
If (and only if) you want to delete all of the records in a table, you can use DROP TABLE or TRUNCATE TABLE.
DELETE removes one record at a time and records an entry in the transaction log for each deleted row.
TRUNCATE TABLE is much faster because it doesn't record the activity in the transaction log. It removes all rows from a table, but the table structure & its columns, constraints, indexes and so on remain. DROP TABLE would remove those.
Use caution if you decide to TRUNCATE. It's irreversible (unless you have a backup).
create a second table, inserting all rows from the first that you don't want deleting.
delete the first table
rename the second table to be the first
(or a variation on the above)
This can often be quicker than doing a delete of selected records from a big table.
You may want to try deleting in batches too. I just tested this on a table I have and the delete operation went from 13 seconds to 3 seconds.
While Exists(Select * From YourTable Where YourCondition = True)
Delete Top (100000)
From YourTable
Where YourCondition = True
I don't think you can use the TOP predicate if you are running SQL2000, but it works with SQL2005 and up. If you are using SQL2000, then you can use this syntax instead:
Set RowCount 100000
While Exists(Select * From YourTable Where YourCondition = True)
Delete
From YourTable
Where YourCondition = True
DELETE FROM table WHERE a=b;
When deleting that many rows you may want to disable the indexes so they don't get updated on every delete. Rewriting the indexes on every deletion will significantly slow down the whole process.
You'll want to disable these indexes before beginning your deletion or else there may be table locks already in place.
--Disable Index
ALTER INDEX [IX_MyIndex] ON MyTable.MyColumn DISABLE
--Enable Index
ALTER INDEX [IX_MyIndex] ON MyTable.MyColumn REBUILD
If you wish to remove all entries in a table you can use TRUNCATE.
Does the table you are deleting from have multiple foreign keys, or cascaded deletes or triggers? All of these will impact performance.
Depending on what you want to do and the transactional integrity, can you delete things in small batches e.g. if you are trying to delete 1.5 million records that is 1 years worth of data, can you do it 1 week at a time?
Delete from table where condition for those 1.5 million rows
The time depends.
On Oracle it is also possible to use
truncate table <table>
Not sure if that is standard SQL or available in SQL Server. It will however clear the whole table - but then it is quicker than "delete from " (it will also conduct a commit).
TRUNCATE will also ignore any referential integrity or triggers on the table. DELETE FROM ... WHERE will respect both. The time will depend on the indexing of your condition columns, your hardware, and any additional system load.
The delete SQL is exactly the same as a normal SQL delete
delete from table where [your condition ]
However if your worried about time then I'll assume your question is a little deeper than this. If your table is has a significant number of non-clustered indexes then in some circumstances it may be faster to drop all these indexes first and rebuild after the delete. This is unusual but in cases where your straightforward delete is vulnerable to timeout issues it may be helpful
CREATE TABLE new_table as select <data you want to keep> from old_table;
index new_table
grant on new table
add constraints on new_table
etc on new_table
drop table old_table
rename new_table to old_table;