Updating 100k records in one update query - sql

Is it possible, or recommended at all, to run one update query, that will update nearly 100k records at once?
If so, how can I do that? I am trying to pass an array to my stored proc, but it seems not to work, this is my SP:
CREATE PROCEDURE [dbo].[UpdateAllClients]
#ClientIDs varchar(max)
AS
BEGIN
DECLARE #vSQL varchar(max)
SET #vSQL = 'UPDATE Clients SET LastUpdate=GETDATE() WHERE ID IN (' + #ClientIDs + ')';
EXEC(#vSQL);
END
I have not idea whats not working, but its just not updating the relevant queries.
Anyone?

The UPDATE is reading your #ClientIDs (as a Comma Separated Value) as a whole. To illustrate it more, you are doing like this.
assume the #ClientIDs = 1,2,3,4,5
your UPDATE command is interpreting it like this
UPDATE Clients SET LastUpdate=GETDATE() WHERE ID IN ('1,2,3,4,5')';
and not
UPDATE Clients SET LastUpdate=GETDATE() WHERE ID IN (1,2,3,4,5)';
One suggestion to your question is by using subquery on your UPDATE, example
UPDATE Clients
SET LastUpdate = GETDATE()
WHERE ID IN
(
SELECT ID
FROM tableName
-- where condtion
)
Hope this makes sense.

A few notes to be aware of.
Big updates like this can lock up the target table. If > 5000 rows are affected by the operation, the individual row locks will be promoted to a table lock, which would block other processes. Worth bearing in mind if this could cause an issue in your scenario. See: Lock Escalation
With a large number of rows to update like this, an approach I'd consider is (basic):
bulk insert the 100K Ids into a staging table (e.g. from .NET, use SqlBulkCopy)
update the target table, using a join onto the above staging table
drop the staging table
This gives some more room for controlling the process, but breaking the workload up into chunks and doing it x rows at a time.

There is a limit for the number of items you pass to 'IN' if you are giving an array.
So, if you just want to update the whole table, skip the IN condition.
If not specify an SQL inside IN. That should do the job

The database will very likely reject that SQL statement because it is too long.
When you need to update so many records at once, then maybe your database schema isn't appropriate. Maybe the LastUpdate datum should not be stored separately for each client but only once globally or only once for a constant group of clients?
But it's hard to recommend a good course of action without seeing the whole picture.

What version of sql server are you using? If it is 2005+ I would recommend using TVPs (table valued parameters - http://msdn.microsoft.com/en-us/library/bb510489.aspx). The transfer of data will be faster (as opposed to building a huge string) and your query would look nicer:
update c
set lastupdate=getdate()
from clients c
join #mytvp t on c.Id = t.Id

Each SQL statement on its own is a transaction statement . This means sql server is going to grab locks for all these million of rows .It can really degrade the performance of a table .So you really don’t tend to update a table which has million of rows in it which hurts the performance.So the workaround is to set rowcount before DML operation
set rowcount=100
UPDATE Clients SET LastUpdate=GETDATE()
WHERE ID IN ('1,2,3,4,5')';
set rowcount=0
or from SQL server 2008 you can parametrize Top keyword
Declare #value int
set #value=100000
again:
UPDATE top (#value) Clients SET LastUpdate=GETDATE()
WHERE ID IN ('1,2,3,4,5')';
if ##rowcount!=0 goto again
See how fast the above query takes and then adjust and change the value of the variable .You need to break the tasks for smaller units as suggested in the above answers

Method 1:
Split the #clientids with delimiters ','
put in array and iterate over that array
update clients table for each id.
OR
Method 2:
Instead of taking #clientids as a varchar2, follow below steps
create object type table for ids and use join.
For faster processing u can also create index on clientid as well.

Related

MS SQL Server 2008 - UPDATE a large database

I'd like to UPDATE just one data in a large TABLE
What would be the most efficient way to do this?
SELECT * from TABLE WHERE status='N'
UPDATE TABLE set status='Y' where status='N'
I assume table is very very large.
Then may be you should create temp/permanent Filtered index on table
CREATE NONCLUSTERED INDEX Temp_Table_Status
ON dbname.dbo.Table(Status)
WHERE Status='N'
GO
else your query is correct.
UPDATE TABLE set status='Y' where status='N'
"Most efficient" is much like "most beautiful". It has no absolute meaning. How do you measure "efficient". IMO, by far the most efficient mechanism is to use a single update query. Your query should actually be written to avoid pointless updates:
update table set col = 'Y' where col <> 'Y';
The where clause will make it "most efficient". And note that you might need to account for null values in the where clause - know your data. Some might argue for batching the updates in order to save space. If you do this on a regular basis, then you should have generally have sufficient space in the database and log to do this without attempting to pointlessly manage bits of diskspace.

SQL Server 2 Table Sync Recursion

I'm using triggers to keep two identical tables in a single database in sync. One is for an internal proprietary system, the other is used to expose a subset of data to the outside world. I'm not able to use the same table for both.
I need updates, inserts, deletes in either of the tables to be applied to the other.
So far I'm using triggers on both tables instead of a scheduled stored procedure because I wanted immediate updates. The problem is, an update to table A fires the trigger to update table B, which fires a trigger to update table A, which fires the trigger to table B.... and so on.
What's the best way to stop the recursion?
One way is to check the data first to see if it is different, something like this:
SELECT #cempno = inserted.cempno FROM inserted
SELECT #count = COUNT(*)
FROM jcempy j INNER JOIN zhhjcempy z ON j.cempno = z.cempno AND j.cempno = #cempno
WHERE (j.ccostcode <> z.ccostcode)
OR (j.cimearnreg <> z.cimearnreg)
OR (j.cimearnot <> z.cimearnot)
OR (j.cimearndt <> z.cimearndt)
OR (j.cimearnl1 <> z.cimearnl1)
IF #count = 1
BEGIN
-- Update the record
END
Another way could be to use a third table that holds status flags to show which table first initiated the update, but I've a feeling managing that will be a problem once I have 100 users hammering away at the system.
Any ideas, or comments on what sort of performance penalties the data check will incur?
Thanks!
If you can't avoid triggers look at ##NESTLEVEL within your trigger.
It should be always the same number. If number is bigger, just do nothing.

Quickly update and do statistical normalization of data SQL

I have a SQL-Server table which needs to be updated via a SQLCLR function. An update to a single row will need to trigger table-wide update. I was wondering how to properly perform the update and normalize the table. The main problem I see is that it's linked to a website, so there could be multiple updates coming in at any time. The website will only see a few hundred visitors and after a short period of time will be closed (collecting data for research).
To get an idea of the SQLClr call:
DECLARE #ID INT = 501;
SELECT dbo.fn_ComputeJaccard(
RectangleTwo.MinX, RectangleTwo.MinY, RectangleTwo.MaxX, RectangleTwo.MaxY,
RectangleOne.MinX, RectangleOne.MinY, RectangleOne.MaxX, RectangleOne.MaxY) as RunningTotal
FROM PreProcessed RectangleOne
INNER JOIN PreProcessed RectangleTwo
ON RectangleTwo.ID <> RectangleOne.ID
WHERE dbo.fn_ComputeJaccard(
RectangleTwo.MinX, RectangleTwo.MinY, RectangleTwo.MaxX, RectangleTwo.MaxY,
RectangleOne.MinX, RectangleOne.MinY, RectangleOne.MaxX, RectangleOne.MaxY) > .97
AND RectangleTwo.ID = #ID
I would need to select this data into a temp table, normalize that table and then perform an update to the original table with the values (newValue*.5 + oldValue*.9) then renormalize the whole table. I imagine this would take a while to process, so I'm looking for the most efficient way of doing that, plus a solution to the multiple updates flying in issue.
Any advice you could give me would be great!
Thanks

Slow join on Inserted/Deleted trigger tables

We have a trigger that creates audit records for a table and joins the inserted and deleted tables to see if any columns have changed. The join has been working well for small sets, but now I'm updating about 1 million rows and it doesn't finish in days. I tried updating a select number of rows with different orders of magnitude and it's obvious this is exponential, which would make sense if the inserted/deleted tables are being scanned to do the join.
I tried creating an index but get the error:
Cannot find the object "inserted" because it does not exist or you do not have permissions.
Is there any way to make this any faster?
Inserting into temporary tables indexed on the joining columns could well improve things as inserted and deleted are not indexed.
You can check ##ROWCOUNT inside the trigger so you only perform this logic above some threshold number of rows though on SQL Server 2008 this might overstate the number somewhat if the trigger was fired as the result of a MERGE statement (It will return the total number of rows affected by all MERGE actions not just the one relevant to that specific trigger).
In that case you can just do something like SELECT #NumRows = COUNT(*) FROM (SELECT TOP 10 * FROM INSERTED) T to see if the threshold is met.
Addition
One other possibility you could experiment with is simply bypassing the trigger for these large updates. You could use SET CONTEXT_INFO to set a flag and check the value of this inside the trigger. You could then use OUTPUT inserted.*, deleted.* to get the "before" and "after" values for a row without needing to JOIN at all.
DECLARE #TriggerFlag varbinary(128)
SET #TriggerFlag = CAST('Disabled' AS varbinary(128))
SET CONTEXT_INFO #TriggerFlag
UPDATE YourTable
SET Bar = 'X'
OUTPUT inserted.*, deleted.* INTO #T
/*Reset the flag*/
SET CONTEXT_INFO 0x

What does this do?

Once in a while, I need to clear out the anonymous user profiles from the database. A colleague has suggested I use this procedure because it allows a little breathing space from time to time for other procedures to run.
WHILE EXISTS (SELECT * FROM aspnet_users WITH (NOLOCK)
WHERE userID IN (SELECT UserID FROM #AspnetUsersToDelete))
BEGIN
SET ROWCOUNT 1000
DELETE FROM aspnet_users WHERE userID IN (SELECT UserID FROM #AspnetUsersToDelete )
print 'aspnet_Users deleted: ' + CONVERT(varchar(255), ##ROWCOUNT)
SET ROWCOUNT 0
WAITFOR DELAY '00:00:01'
END
This is the first time I've seen the NOLOCK keyword used and the logic for the rowcount seems backwards to me. Does anyone else use a similar sort of technique for providing windows in long running procedures and is this the best way of doing things?
Any time I anticipate deleting a very large number of rows, I'll do something similar to this to keep transaction batch sizes reasonable.
For SQL Server 2005+, you could use DELETE TOP (1000)... instead of the SET ROWCOUNT statements. I usually do:
SELECT NULL; /* Fudge ##ROWCOUNT value for first time in loop */
WHILE (##ROWCOUNT <> 0) BEGIN
DELETE TOP (1000)
...
END /* WHILE */
The SET ROWCOUNT 1000 means it will only process one thousand rows in the following statements (i.e., DELETE statement). SET ROWCOUNT 0 means each statement processes however many rows are relevant.
So basically, over all it deletes one thousand rows, waits a second, deletes another thousand, and continues that until there are no more to delete.
The WITH (NOLOCK) prevents the data from being locked, meaning that multiple queries running simultaneously can access the data. This allows your query to be a little faster. For more information about NOLOCK, consult the following link:
http://www.mollerus.net/tom/blog/2008/03/using_mssqls_nolock_for_faster_queries.html
(NOLOCK) allows dirty reads. Basically, there is a chance that if you are reading data out of the table while it is in the process of being updated, you could read the wrong data. You can also read data that has been modified by transactions that have not been committed yet as well as a slew of other problems.
Best practice is not to use NOLOCK unless you are reading from tables that really don't change (such as a table containing states) or from a data warehouse type DB that is not constantly updated.