SQL Server 2005 efficient delete - sql

My issue at hand is that I need to remove about 60M records from a table without causing deadlocks with other processes that use this table. At this point I'm almost done removing the records using a while loop that only processes about 1M records at a time however it's taken all day.
Q1: What is the optimal way to remove large quantities of data from a table, keeping the table online and minimal impact to other resources that need to use this table in MS SQL Server 2005?
Q2: Is there a way to implement the individual row locking (rather than table locking) in SQL Server like they have in Oracle? (Note answering this may answer Q1).
A2: So as #Remus Rusanu informed me there is a way to do row level locking with a delete.

See this thread, the original poster actually did some tests and posted the most efficient method. An MVP did originally chime in with an option to actually insert the data you want to retain into a temp table and then truncate the original table and reinsert.

I just did something like that recently. I simply created a SQL Server job that ran every 10 mins deleting a million rows. Code follows
SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED;
DELETE TOP 1000000 FROM BIG_TABLE WHERE CreatedDate <= '20080630'
Said table in question had about 900 mil rows to start with. Didn't notice any significant performance issues.

The most efficient way is to use partition switching, see Transferring Data Efficiently by Using Partition Switching. The drawback is that it requires planning ahead in how the partitions are deployed.
If partition switching is not available then the answer depends on the actual table schema. You better post the actual schema (including all indexes and most importantly the clustered key definiton) and the criteria that qualifies the delete candidates.
As for Q2, SQL Server had row level locking since mid 90s, I don't know what you're actually asking.

Related

Lock SQL Table/Row and using NOLOCK/READPAST [duplicate]

This question already has answers here:
Effect of NOLOCK hint in SELECT statements
(5 answers)
SQL 2008+ NOLOCK vs READPAST Considerations for Reporting Accuracy
(3 answers)
Closed last year.
I am the end-user of a highly updated Microsoft SQL Server DB containing dozens of tables with hunreds of millions of rows each.
A banking DB is a good example for what I am working with, with the exception that in my DB UPDATE statement are rearly used and INSERT statements are used frequently (once a row as entered a table, it rarely changes).
I, personally, not using any UPDATE/INSERT statement, only SELECT statement (with complex WHERE/ JOIN/ CROSS/ GROUP clues).
I have some questions about locking and using NOLOCK/READPAST.
1.how can I know if a query I am using is locking only a row or the entire table?
for example, I noticed this query didn't locked other users from inserting new data to the table:
SELECT *
FROM Table
while this query did:
SELECT COUNT(Date)
FROM Table
This is of course just examples, not actual full queris I am using.
As I mentioned, rows rarely changing so locking a row isn't concerning me but locking a table is highly concerning.
2.I would like to know the risks of using NOLOCK/READPAST in my queries (to revoke any concern I might have about locking a table from updating).
I searched about it a lot but I could not find a full answer.
I dont care If by using NOLOCK/READPAST I might get past data (that again, data i rarely changes) or I might miss some newly added data.
I did read in a couple of places that using NOLOCK might cause duplicate data/ corrupted data, this is a problem for me.
3.what exactly is the diffrent between READPASY and NOLOCK? which one is "safer" regarding the concerns mentiond above
Thank you.
This is highly dependent on your servers settings. Generally speaking, you want to lock records, even when you are just reading them because you don't want data to change while you are reading it. This isn't just something that affects updated records, but also inserts. You can learn more about read commits and snapshop isolation here:
https://learn.microsoft.com/en-us/dotnet/framework/data/adonet/sql/snapshot-isolation-in-sql-server
Both NOLOCK/READPAST should be avoided at all cost. There are a very small handful of scenarios where these make sense, but they are exceedingly rare. You are better off optimizing your query to perform better and reduce the amount of records being locked and the time that the records spend being locked. One case that I can see NOLOCK being useful would be a log table that only has inserts, and your query doesn't join the data to other tables, AND a dirty record wouldn't cause problems.
NOLOCK doesn't lock records that it reads. The risk here is that records you are reading can literally change mid read. This means you can begin reading a record and get some values for some columns before the update was made and some column values from after the update. If another transaction rolls back you could end up reading records that were never actually committed to the database.
READPAST skips any rows that are locked. If another query runs and the criteria causes rows 1-25 of 100 to be locked while you are querying the same data you are only going to see records 26-100. To your query locked rows don't exist.
Great article with the details:
https://www.mssqltips.com/sqlservertip/4468/compare-sql-server-nolock-and-readpast-table-hints/
You would be far better served by spending time learning to optimize your queries to reduce the number of records they need to lock, and improving the performance so that the amount of time those locks exist is kept to a minimum.

Create a Historical Auditing Table

Currently we have an AuditLog table that holds over 11M records. Regardless on the indexes and statistics any query referencing this table takes a long time. Most reports don't check for Audit records past a year but we would still like to keep these records. Whats the best way to handle this?
I was thinking of keeping the AuditLog table to hold all records less than or equal to a year old. Then move any records greater than a year old to an AuditLogHistory table. Maybe just running a batch file every night to move these records over and then update the indexes and statistics of the AuditLog table. Is this an okay way to complete this task? Or what other way should I be storing older records?
The records brought back from the AuditLog table hit a linked server and check in 6 different db's to see if a certain member exists in them based on a condition. I don't have access to make any changes to the linked server db's so can only optimize what I have which is the Auditlog. Hitting the linked server db's uses up over 90% of the queries cost. So I'm just trying to limit what I can.
First, I find it hard to believe that you cannot optimize a query on a table with 11 million records. You should investigate the indexes that you have relative to the queries that are frequently run.
In any case, the answer to your question is "partitioning". You would partition by the date column and be sure to include this condition in all queries. That will reduce the amount of data and probably speed the processing.
The documentation is a good place to start for learning about partitioning.

Would a table lock speed up an update statement in Oracle 10g enterprise?

We have a fairly wide table BaseData with some 33 millions rows in it. Then we have an update query that joins it to several other tables containing all kinds of parameters, some functions are applied, there is a group by original Id and then the results are written back to the BaseData table in a few columns.
This process is very slow so I'm looking into ways of speeding it up. I have most of my experience in SQLServer so all this type of internals of Oracle I don't know yet.
One thing I suspect is that during the update Oracle creates versions of every row so any oher readers can read that unaffected row. This however takes up considerable resources. Is there any way to have the update take a write lock on the table so it wouldn't create versions of every row?
Any other tips you guys have for large updates? We already broke it down into batches. Each batch is in a seperate partition of the table and then several updates are run in parallel. But still its all much too slow.
The short answer is that no, in Oracle, taking an exclusive lock on a table won't prevent other sessions from reading it, or having to incur the work of generating a read-consistent view of the data. Similarly, in Oracle, you can't tell a session to enable "dirty reads."
Well, the first question is what's slow - is it all the work of joining and applying functions, or is it the writing back? How does a SELECT my_updated_resultset FROM BASEDATA JOIN... perform compared to your update statement? Have you verified that there's contention between the readers of BaseData and the update process? Also, it's it too slow for the business, or just slower than you think it should be?
Another option to consider is to use partition exchange to perform your updates. The high level concept would be:
CREATE TABLE BASEDATA_XCHG as SELECT * FROM BASEDATA WHERE 1 = 0;
INSERT /*+ append */ INTO BASEDATA_XCHG SELECT my_updated_resultset FROM BASEDATA PARTITION (ONLY_ONE_PARTITION) JOIN...
Create all the required indexes and constraints on the BASEDATA_XCHG table.
ALTER TABLE BASEDATA EXCHANGE PARTITION (ONLY_ONE_PARTITION) WITH BASEDATA_XCHG
If you're updating most of the rows in a partition of BASEDATA table, don't update them - create a new table and exchange it out. Tim Gorman has an excellent paper called "Scaling to Infinity" that covers this concept in greater depth; you may wish to check it out.
In addition to Adam's answer:
Run an EXPLAIN PLAN on your update statement and check the execution plan.
Chances are that adding indexes to support your joins and WHERE conditions can speed up the query.
Oracle uses undo segments for read consistency (along with SCNs, read more here)
I'm assuming these large batch processes are running on a staging area and not a "prod" instance that is being used by a lot of various processes. If you are updating 25% or more (rough figures) of some big table, it may be better to do a CTAS (create table as select...) than attempting updates. Your CTAS would contain the update logic for the new table. Once done, add indexes/grants/etc on new table and rename new to old. You can also add a parallel hint and nologging on the CTAS to potentially speed things up even more.

Update thousands of records in a DataSet to SQL Server

I have half a million records in a data set of which 50,000 are updated. Now I need to commit the updated records back to the SQL Server 2005 Database.
What is the best and efficient way to do this considering the fact that such updates could be frequent (though concurrency is not an issue but performance is)
I would use a Batch Update.
Also documented here.
I agree with David's answer, as that's what I use. However, there is an alternative approach you could take which is worth considering (all situations are different after all) - it's something I would consider in the future if I had another similar requirement.
You could bulk insert the updated records into a new table in the DB (using SqlBulkCopy) which is an extremely fast way of loading data into the db (example). Then run an UPDATE statement on your main table to pull in the updated values from this new table which you would drop at the end.
The batched update approach of using SqlDataAdapter allows you to easily deal with any errors on specific rows (e.g. you could tell it to continue in the event of an error with a specific updated row so it doesn't stop the whole process).

Best practice for archiving a huge table of over 1,000,000,000 rows

I'm using SQL Server 2005. There is an audit trail table, containing over 1,000,000,000 rows. I'm planning to archive this table. When I make a simple select with nolock, I can still find blocking (probably IO blocking with other process?). So are there any best practice for this kind of situation?
For a table that large you will be wanting to find some effective sharding/partitioning strategy. Archiving in this sense tends to be a form of partitioning but not a good one since you often want to query over the current and archive anyway. In the worst cases you end up with a SELECT over a UNION of the archive and current tables, which is worse than if you hadn't split them at all.
You will often do better by finding some other means to slice the data, say on a record type or something. But if you are going to split it by date make absolutely sure you won't query over the archive+current data set.
Also, SQL Server 2005+ doesn't by default enable MVCC. It can do this however if you enable what MS calls Snapshot Isolation. See Serializable vs. Snapshot Isolation Level.
The effect of not having this enabled is that an uncommitted INSERT or UPDATE will block a SELECT in another transaction until the first transaction commits or rolls back. That can cause unnecessary locks and limit your scalability.
Create a backup of the database and restore it in the archive location.
Selecting 1 billion rows all at once is going to strain the server no matter how you do it.
Do it in batches instead, say 1000 rows at a time. The bcp tool does this automatically. Or use SSIS to copy the data into another database - it does pretty much the same thing.