T-SQL ways to avoid potentialy updating the same row based on subquery results - sql

I have a SQL Server table with records (raw emails) that needs to be processed (build the email and send it) in a given order by an external process (mailer). Its not very resource intensive but can take a while with all the parsing and SMTP overhead etc.
To speed things up I can easily run multiple instance of the mailer process over multiple servers but worry that if two were to start at almost the same time they might still overlap a bit and send the same records.
Simplified for the question my table look something like this with each record having the data for the email.
queueItem
======================
queueItemID PK
...data...
processed bit
priority int
queuedStart datetime
rowLockName varchar
rowLockDate datetime
Batch 1 (Server 1)
starts at 12:00PM
lock/reserve the first 5000 rows (1-5000)
select the newly reserved rows
begin work
Batch 2 (Server 2)
starts at 12:15PM
lock/reserve the next 5000 rows (5001-10000)
select the newly reserved rows
begin work
To lock the rows I have been using the following:
declare #lockName varchar(36)
set #lockName = newid()
declare #batchsize int
set #batchsize = 5000
update queueItem
set rowLockName = #lockName,
rowLockDate = getdate()
where queueitemID in (
select top(#batchsize) queueitemID
from queueItem
where processed = 0
and rowLockName is null
and queuedStart <= getdate()
order by priority, queueitemID
)
If I'm not mistaken the query would start executing the SELECT subquery first and then lock the rows in preparation of the update, this is fast but not instantaneous.
My concern is that if I start two batches at near the same time (faster than the subquery runs) Batch 1's UPDATE might not be completed and Batch 2's SELECT would see the records as still available and attempt/succeed in overwriting Batch 1 (sort of race condition?)
I have ran some test but so far haven't had the issue with them overlapping, is it a valid concern that will come to haunt me at the worst of time?
Perhaps there are better way to write this query worth looking into as I am by no mean a T-SQL guru.

Related

In sybase, how would I lock a stored procedure that is executing and alter the table that the stored procedure returns?

I have a table as follows:
id status
-- ------
1 pass
1 fail
1 pass
1 na
1 na
Also, I have a stored procedure that returns a table with top 100 records having status as 'na'. The stored procedure can be called by multiple nodes in an environment and I don't want them to fetch duplicate data. So, I want to lock the stored procedure while it is executing and set the status of the records obtained from the stored procedure to 'In Progress' and return that table and then release the lock, so that different nodes don't fetch the same data. How would I accomplish this?
There is already a solution provided for similar question in ms sql but it shows errors when using in sybase.
Assuming Sybase ASE ...
The bigger issue you'll likely want to consider is whether you want a single process to lock the entire table while you're grabbing your top 100 rows, or if you want other processes to still access the table?
Another question is whether you'd like multiple processes to concurrently pull 100 rows from the table without blocking each other?
I'm going to assume that you a) don't want to lock the entire table and b) you may want to allow multiple processes to concurrently pull rows from the table.
1 - if possible, make sure the table is using datarows locking (default is usually allpages); this will reduce the granularity of locks to the row level (as opposed to page level for allpages); the table will need to be datarows if you want to allow multiple processes to concurrently find/update rows in the table
2 - make sure the lock escalation setting on the table is high enough to ensure a single process's 100 row update doesn't lock the table (sp_setpglockpromote for allpages, sp_setrowlockpromote for datarows); the key here is to make sure your update doesn't escalate to a table-level lock!
3 - when it comes time to grab your set of 100 rows you'll want to ... inside a transaction ... update the 100 rows with a status value that's unique to your session, select the associated id's, then update the status again to 'In Progress'
The gist of the operation looks like the following:
declare #mysession varchar(10)
select #mysession = convert(varchar(10),##spid) -- replace ##spid with anything that
-- uniquely identifies your session
set rowcount 100 -- limit the update to 100 rows
begin tran get_my_rows
-- start with an update so that get exclusive access to the desired rows;
-- update the first 100 rows you find with your ##spid
update mytable
set status = #mysession -- need to distinguish your locked rows from
-- other processes; if we used 'In Progress'
-- we wouldn't be able to distinguish between
-- rows update earlier in the day or updated
-- by other/concurrent processes
from mytable readpast -- 'readpast' allows your query to skip over
-- locks held by other processes but it only
-- works for datarows tables
where status = 'na'
-- select your reserved id's and send back to the client/calling process
select id
from mytable
where status = #mysession
-- update your rows with a status of 'In Progress'
update mytable
set status = 'In Progress'
where status = #mysession
commit -- close out txn and release our locks
set rowcount 0 -- set back to default of 'unlimited' rows
Potential issues:
if your table is large and you don't have an index on status then your queries could take longer than necessary to run; by making sure lock escalation is high enough and you're using datarows locking (so the readpast works) you should see minimal blocking of other processes regardless of how long it takes to find the desired rows
with an index on the status column, consider that all of these updates are going to force a lot of index updates which is probably going to lead to some expensive deferred updates
if using datarows and your lock escalation is too low then an update could look the entire table, which would cause another (concurrent) process to readpast the table lock and find no rows to process
if using allpages you won't be able to use readpast so concurrent processes will block on your locks (ie, they won't be able to read around your lock)
if you've got an index on status, and several concurrent processes locking different rows in the table, there could be a chance for deadlocks to occur (likely in the index tree of the index on the status column) which in turn would require your client/application to be coded to expect and address deadlocks
To think about:
if the table is relatively small such that table scanning isn't a big cost, you could drop any index on the status column and this should reduce the performance overhead of deferred updates (related to updating the indexes)
if you can work with a session specific status value (eg, 'In Progress - #mysession') then you could eliminate the 2nd update statement (could come in handy if you're incurring deferred updates on an indexed status column)
if you have another column(s) in the table that you could use to uniquely identifier your session's rows (eg, last_updated_by_spid = ##spid, last_updated_date = #mydate - where #mydate is initially set to getdate()) then your first update could set the status = 'In Progress', the select would use ##spid and #mydate for the where clause, and the second update would not be needed [NOTE: This is, effectively, the same thing Gordon is trying to address with his session column.]
assuming you can work with a session specific status value, consider using something that will allow you to track, and fix, orphaned rows (eg, row status remains 'In Progress - #mysession' because the calling process died and never came back to (re)set the status)
if you can pass the id list back to the calling program as a single string of concatenated id values you could use the method I outline in this answer to append the id's into a #variable during the first update, allowing you to set status = 'In Progress' in the first update and also allowing you to eliminate the select and the second update
how would you tell which rows have been orphaned? you may want the ability to update a (small)datetime column with the getdate() of when you issued your update; then, if you would normally expect the status to be updated within, say, 5 minutes, you could have a monitoring process that looks for orphaned rows where status = 'In Progress' and its been more than, say, 10 minutes since the last update
If the datarows, readpast, lock escalation settings and/or deadlock potential is too much, and you can live with brief table-level locks on the table, you could have the process obtain an exclusive table level lock before performing the update and select statements; the exclusive lock would need to be obtained within a user-defined transaction in order to 'hold' the lock for the duration of your work; a quick example:
begin tran get_my_rows
-- request an exclusive table lock; wait until it's granted
lock table mytable in exclusive mode
update ...
select ...
update ...
commit
I'm not 100% sure how to do this in Sybase. But, the idea is the following.
First, add a new column to the table that represents the session or connection used to change the data. You will use this column to provide isolation.
Then, update the rows:
update top (100) t
set status = 'in progress',
session = #session
where status = 'na'
order by ?; -- however you define the "top" records
Then, you can return or process the 100 ids that are "in progress" for the given connection.
Create another table, proc_lock, that has one row
When control enters the stored procedure, start a transaction and do a select for update on the row in proc_lock (see this link). If that doesn't work for Sybase, then you could try the technique from this answer to lock the row.
Before the procedure exits, make sure to commit the transaction.
This will ensure that only one user can execute the proc at a time. When the second user tries to execute the proc, it will block until the first user's lock on the proc_lock row is released (e.g. when transaction is committed)

Running large queries in the background MS SQL

I am using MS SQL Server 2008
i have a table which is constantly in use (data is always changing and inserted to it)
it contains now ~70 Mill rows,
I am trying to run a simple query over the table with a stored procedure that should properly take a few days,
I need the table to keep being usable, now I executed the stored procedure and after a while every simple select by identity query that I try to execute on the table is not responding/running too much time that I break it
what should I do?
here is how my stored procedure looks like:
SET NOCOUNT ON;
update SOMETABLE
set
[some_col] = dbo.ufn_SomeFunction(CONVERT(NVARCHAR(500), another_column))
WHERE
[some_col] = 243
even if i try it with this on the where clause (with an 'and' logic..) :
ID_COL > 57000000 and ID_COL < 60000000 and
it still doesn't work
BTW- SomeFunction does some simple mathematics actions and looks up rows in another table that contains about 300k items, but is never changed
From my perspective your server has a serious performance problem. Even if we assume that none of the records in the query
select some_col with (nolock) where id_col between 57000000 and 57001000
was in memory, it shouldn't take 21 seconds to read the few pages sequentially from disk (your clustered index on the id_col should not be fragmented if it's an auto-identity and you didn't do something stupid like adding a "desc" to the index definition).
But if you can't/won't fix that, my advice would be to make the update in small packages like 100-1000 records at a time (depending on how much time the lookup function consumes). One update/transaction should take no more than 30 seconds.
You see each update keeps an exclusive lock on all the records it modified until the transaction is complete. If you don't use an explicit transaction, each statement is executed in a single, automatic transaction context, so the locks get released when the update statement is done.
But you can still run into deadlocks that way, depending on what the other processes do. If they modify more than one record at a time, too, or even if they gather and hold read locks on several rows, you can get deadlocks.
To avoid the deadlocks, your update statement needs to take a lock on all the records it will modify at once. The way to do this is to place the single update statement (with only the few rows limited by the id_col) in a serializable transaction like
IF ##TRANCOUNT > 0
-- Error: You are in a transaction context already
SET NOCOUNT ON
SET TRANSACTION ISOLATION LEVEL SERIALIZABLE
-- Insert Loop here to work "x" through the id range
BEGIN TRANSACTION
UPDATE SOMETABLE
SET [some_col] = dbo.ufn_SomeFunction(CONVERT(NVARCHAR(500), another_column))
WHERE [some_col] = 243 AND id_col BETWEEN x AND x+500 -- or whatever keeps the update in the small timerange
COMMIT
-- Next loop
-- Get all new records while you where running the loop. If these are too many you may have to paginate this also:
BEGIN TRANSACTION
UPDATE SOMETABLE
SET [some_col] = dbo.ufn_SomeFunction(CONVERT(NVARCHAR(500), another_column))
WHERE [some_col] = 243 AND id_col >= x
COMMIT
For each update this will take an update/exclusive key-range lock on the given records (but only them, because you limit the update through the clustered index key). It will wait for any other updates on the same records to finish, then get it's lock (causing blocking for all other transactions, but still only for the given records), then update the records and release the lock.
The last extra statement is important, because it will take a key range lock up to "infinity" and thus prevent even inserts on the end of the range while the update statement runs.

Performing SQL updates in single statements vs batches

I'm working with large databases and need advice on how to optimize my selects/updates. Here's an ex:
create table Book (
BookID int,
Description nvarchar(max)
)
-- 8 million rows
create table #BookUpdates (
BookID int,
Description nvarchar(max)
)
-- 2 million rows
Let's assume that there's 8 million Books and I have to update the genre for 2 million of them.
Problem: the time to run these updates is very long. It will occasionally cause blocking for the users who are also trying to run statements off the database. I've come up with a solution but want to know if there's a better one out there. I have to prepare one-off random updates like this alot (for whatever reason)
-- normal update
update b set b.Description = bu.Description
from Book b
join #BookUpdates bu
on bu.BookID = b.BookID
-- batch update
while (#BookID < #MaxBookID)
begin
update b set b.Description = bu.Description
from Book b
join #BookUpdates bu
on bu.BookID = b.BookID
where bu.BookID >= #BookID
and bu.BookID < #BookID + 5000
set #BookID = #BookID + 5000
end
The second update works a lot faster. I like this solution because I can print status updates to myself on how long it has left and it doesn't cause performance issues on our customers.
Question: am I missing something important here? Indexes on the temp tables?
I updated the EXAMPLE tables so I don't get more normalization comments. Only 1 description per book :)
You can prevent blocking on the query side by using NOLOCK or READUNCOMITTED hints on the SQL queries.
The real issue with performance is probably the accumulation of changes in the log. Your method of batching the changes in groups of 5,000 is quite reasonable. Because you are setting up the updates in a batch table, you might as well calculate the batch number in the table and then do the looping based on that.
I would try your own suggestion first and index the temp table before you run the update:
CREATE INDEX IDX_BookID ON #BookUpdates(BookID)
Try it with the index and without the index and see what the impact on the runtime is. If you want to avoid impacting your users for this test, run it outside working hours (if you can) or copy Book to another temp table first and test against that.
Regardless, given the volume, I expect you will still cause blocking for other processes. If you are unable to schedule your updates at a time when no other processes are running against this table (which would be the ideal solution), your existing batch update appears to be a perfectly valid solution. Indexing the temp table will likely help with that too so you may be able to increase the batch size without causing blocking.

What does this do?

Once in a while, I need to clear out the anonymous user profiles from the database. A colleague has suggested I use this procedure because it allows a little breathing space from time to time for other procedures to run.
WHILE EXISTS (SELECT * FROM aspnet_users WITH (NOLOCK)
WHERE userID IN (SELECT UserID FROM #AspnetUsersToDelete))
BEGIN
SET ROWCOUNT 1000
DELETE FROM aspnet_users WHERE userID IN (SELECT UserID FROM #AspnetUsersToDelete )
print 'aspnet_Users deleted: ' + CONVERT(varchar(255), ##ROWCOUNT)
SET ROWCOUNT 0
WAITFOR DELAY '00:00:01'
END
This is the first time I've seen the NOLOCK keyword used and the logic for the rowcount seems backwards to me. Does anyone else use a similar sort of technique for providing windows in long running procedures and is this the best way of doing things?
Any time I anticipate deleting a very large number of rows, I'll do something similar to this to keep transaction batch sizes reasonable.
For SQL Server 2005+, you could use DELETE TOP (1000)... instead of the SET ROWCOUNT statements. I usually do:
SELECT NULL; /* Fudge ##ROWCOUNT value for first time in loop */
WHILE (##ROWCOUNT <> 0) BEGIN
DELETE TOP (1000)
...
END /* WHILE */
The SET ROWCOUNT 1000 means it will only process one thousand rows in the following statements (i.e., DELETE statement). SET ROWCOUNT 0 means each statement processes however many rows are relevant.
So basically, over all it deletes one thousand rows, waits a second, deletes another thousand, and continues that until there are no more to delete.
The WITH (NOLOCK) prevents the data from being locked, meaning that multiple queries running simultaneously can access the data. This allows your query to be a little faster. For more information about NOLOCK, consult the following link:
http://www.mollerus.net/tom/blog/2008/03/using_mssqls_nolock_for_faster_queries.html
(NOLOCK) allows dirty reads. Basically, there is a chance that if you are reading data out of the table while it is in the process of being updated, you could read the wrong data. You can also read data that has been modified by transactions that have not been committed yet as well as a slew of other problems.
Best practice is not to use NOLOCK unless you are reading from tables that really don't change (such as a table containing states) or from a data warehouse type DB that is not constantly updated.

sql table cell modified by multiple threads at the same time

if you have table BankAccount that has a column Amount and the value for this column for a specific row can be modified by multiple threads at the same time so it could happen so that the last one to set the value will win.
how do you usually handle this kind of situations ?
UPDATE: I heard that in MSSQL there is update lock UPDLOCK that locks the table or the row that is being updated, could I this here somehow ?
An update statement which references the current value would prevent overwriting. So, instead of doing something like
SELECT Amount FROM BankAccount WHERE account_id = 1
(it comes back as 350 and you want to subtract 50)...
UPDATE BankAccount SET Amount = 300 WHERE account_id = 1
do
UPDATE BankAccount SET Amount = Amount - 50 WHERE account_id = 1
You cannot have several threads modifying the same data at exactly the same time : it will always be the last one which set the value that'll "win".
If the problem is that several threads read and set the value at almost the same time, and reads and writes don't arrive on the right order, the solution is to use Transactions :
start a transaction
read the value
set the new value
commit the transaction
This ensures the read and the write will be done consistently, and no other thread will be able to modify the data during the same transaction.
Quoting the wikipedia page about Database Transactions :
A database transaction comprises a
unit of work performed within a
database management system (or similar
system) against a database, and
treated in a coherent and reliable way
independent of other transactions.
Transactions in a database environment
have two main purposes:
To provide reliable units of work that allow correct recovery from
failures and keep a database
consistent even in cases of system
failure, when execution stops
(completely or partially) and many
operations upon a database remain
uncompleted, with unclear status.
To provide isolation between programs accessing a database
concurrently. Without isolation the
programs' outcomes are typically
erroneous.
You ussually use transactions to overcome this.
Have a look at Database transaction
You should have a database function/procedure which makes operations with the "Amount". This function/procedure should return if the operation was succeeded or failed (for example, you want take $1000, but current AMount is only $550, so operation can not be proceede).
Expamle in T-SQL:
UPDATE BankAccount SET Amount = Amount - 1000 WHERE BankAcountID = 12345 AND Amount >= 1000
RETURN ##ROWCOUNT
If the amaount was changed, the return value will be 1, otherwise 0.
Know, you can safely run this functions/procedures (in several threads too):
DECLARE #Result_01 int, Result_02 int, Result_03 int
EXEC #Result_01 = ChangeBankAccountAmount #BankAcountID = 12345, #Amount = 1000
EXEC #Result_02 = ChangeBankAccountAmount #BankAcountID = 12345, #Amount = 15
EXEC #Result_03 = ChangeBankAccountAmount #BankAcountID = 12345, #Amount = 600, #Amount = -2000
EDIT:
Whole procedure in T-SQL:
CRATE PROC ChangeBankAccountAmount
#BankAccountID int,
#ChangeAmount int,
#MMinAmount int = 0
AS BEGIN
IF #ChangeAmount >= 0
UPDATE BankAccount SET Amount = Amount + #ChangeAmount WHERE BankAcountID = 12345
ELSE
UPDATE BankAccount SET Amount = Amount + #ChangeAmount WHERE BankAcountID = 12345 AND Amount >= #MMinAmount
RETURN ##ROWCOUNT
END
Of course - the "int" datatype is not good for money, you should change it to datatype used in your table.