basic SQL atomicity "UPDATE ... SET .. WHERE ..." - sql

I have a rather basic and general question about atomicity of "UPDATE ... SET .. WHERE ..." statement.
having a table (without extra constraint),
+----------+
| id | name|
+----------+
| 1 | a |
+----+-----+
now, I would execute following 4 statements "at the same time" (concurrently).
UPDATE table SET name='b1' WHERE name='a'
UPDATE table SET name='b2' WHERE name='a'
UPDATE table SET name='b3' WHERE name='a'
UPDATE table SET name='b4' WHERE name='a'
is there only one UPDATE statement would be executed with table update?
or, is it possible that more than one UPDATE statements can really update the table?
should I need extra transaction or lock to let only one UPDATE write values into table?
thanks
[EDIT]
the 4 UPDATE statements are executed parallel from different processes.
[EDIT] with Postgresql

One of these statements will lock the record (or the page, or the whole table, depending on your engine and locking granularity) and will be executed.
The others will wait for the resource to be freed.
When the lucky statement will commit, the others will either reread the table and do nothing (if your transaction isolation mode is set to READ COMMITTED) or fail to serialize the transaction (if the transaction isolation level is SERIALIZABLE).

If you ran these UPDATEs in one go, it will run them in order, so it would update everything to b1 but then the other 3 would fail to update any as there will be no A's left to update.

There would be an implicit transaction around each which would hold other UPDATEs in a queue. Only the first one through would win in this case as each subsequent update will not see a name called 'a'.
Edit: I was assuming here that you were calling each UPDATE from separate processes. If they were called in a single script, they would be run consecutively in the order of appearence.

There is only ever one UPDATE statement that can access a record. Before it runs, it starts a transaction and locks the table (or more correctly, the page the record is on, or even only the record itself - this depends on many factors). Then it makes the change and unlocks the table again, commiting the transaction in the process.
Updates/deletes/inserts to a certain record are single threaded by their very nature, it is required to ensure table integrity. This does not mean that updates to many records (that are on different pages) could not run in parallel.

With the statement you posted and you would end up with an error because after the first update 'a' can't be found. What are you trying to achieve with this?

Even if you don't specify begin transaction and commit transaction, a single SQL statement is always transactional. That means only one of the updates is allowed to modify the row at the same time. The other queries will block on the update lock owned by the first query.

I believe this may be what you are looking for (in T-SQL):
UPDATE [table]
SET [table].[name] = temp.new_name
FROM
[table],
(
SELECT
id,
new_name = 'B'+ cast(row_number() over (order by [name]) as varchar)
FROM
[table]
WHERE
[table].name = 'A'
) temp
WHERE [table].id = temp.id
There should be an equivalent function to row_number in the other SQL flavors.

Related

Update and then select updated rows?

I have an application that selects row with a particular status and then starts processing these rows. However some long running processing can cause a new instance of my program to be started, selecting the same rows again because it haven't had time to update the status yet. So I'm thinking of selecting my rows and then updating the status to something else so they cannot be selected again. I have done some searching and got the impression that the following should work, but it fails.
UPDATE table SET status = 5 WHERE status in
(SELECT TOP (10) * FROM table WHERE status = 1)
Only one expression can be specified in the select list when the subquery is not introduced with EXISTS.
TLDR: Is it possible to both select and update rows at the same time? (The order doesn't really matter)
You can use an output clause to update the rows and then return them as if they were selected.
As for updating the top 100 rows only, a safe approach is to use a cte to select the relevant rows first (I assumed that column id can be used to order the rows):
with cte as (select top (100) * from mytable where status = 1 order by id)
update cte set status = 5 output inserted.*
You can directly go for UPDATE statement. It will generate exclusive lock on the row and other concurrent transactions cannot read this row.
More information on locks
Exclusive locks: Exclusive locks are used to lock data being modified by one transaction thus preventing modifications by other
concurrent transactions. You can read data held by exclusive lock only
by specifying a NOLOCK hint or using a read uncommitted isolation
level. Because DML statements first need to read the data they want to
modify you'll always find Exclusive locks accompanied by shared locks
on that same data.
UPDATE TOP(10) table
SET status = 5 WHERE status =1

In sybase, how would I lock a stored procedure that is executing and alter the table that the stored procedure returns?

I have a table as follows:
id status
-- ------
1 pass
1 fail
1 pass
1 na
1 na
Also, I have a stored procedure that returns a table with top 100 records having status as 'na'. The stored procedure can be called by multiple nodes in an environment and I don't want them to fetch duplicate data. So, I want to lock the stored procedure while it is executing and set the status of the records obtained from the stored procedure to 'In Progress' and return that table and then release the lock, so that different nodes don't fetch the same data. How would I accomplish this?
There is already a solution provided for similar question in ms sql but it shows errors when using in sybase.
Assuming Sybase ASE ...
The bigger issue you'll likely want to consider is whether you want a single process to lock the entire table while you're grabbing your top 100 rows, or if you want other processes to still access the table?
Another question is whether you'd like multiple processes to concurrently pull 100 rows from the table without blocking each other?
I'm going to assume that you a) don't want to lock the entire table and b) you may want to allow multiple processes to concurrently pull rows from the table.
1 - if possible, make sure the table is using datarows locking (default is usually allpages); this will reduce the granularity of locks to the row level (as opposed to page level for allpages); the table will need to be datarows if you want to allow multiple processes to concurrently find/update rows in the table
2 - make sure the lock escalation setting on the table is high enough to ensure a single process's 100 row update doesn't lock the table (sp_setpglockpromote for allpages, sp_setrowlockpromote for datarows); the key here is to make sure your update doesn't escalate to a table-level lock!
3 - when it comes time to grab your set of 100 rows you'll want to ... inside a transaction ... update the 100 rows with a status value that's unique to your session, select the associated id's, then update the status again to 'In Progress'
The gist of the operation looks like the following:
declare #mysession varchar(10)
select #mysession = convert(varchar(10),##spid) -- replace ##spid with anything that
-- uniquely identifies your session
set rowcount 100 -- limit the update to 100 rows
begin tran get_my_rows
-- start with an update so that get exclusive access to the desired rows;
-- update the first 100 rows you find with your ##spid
update mytable
set status = #mysession -- need to distinguish your locked rows from
-- other processes; if we used 'In Progress'
-- we wouldn't be able to distinguish between
-- rows update earlier in the day or updated
-- by other/concurrent processes
from mytable readpast -- 'readpast' allows your query to skip over
-- locks held by other processes but it only
-- works for datarows tables
where status = 'na'
-- select your reserved id's and send back to the client/calling process
select id
from mytable
where status = #mysession
-- update your rows with a status of 'In Progress'
update mytable
set status = 'In Progress'
where status = #mysession
commit -- close out txn and release our locks
set rowcount 0 -- set back to default of 'unlimited' rows
Potential issues:
if your table is large and you don't have an index on status then your queries could take longer than necessary to run; by making sure lock escalation is high enough and you're using datarows locking (so the readpast works) you should see minimal blocking of other processes regardless of how long it takes to find the desired rows
with an index on the status column, consider that all of these updates are going to force a lot of index updates which is probably going to lead to some expensive deferred updates
if using datarows and your lock escalation is too low then an update could look the entire table, which would cause another (concurrent) process to readpast the table lock and find no rows to process
if using allpages you won't be able to use readpast so concurrent processes will block on your locks (ie, they won't be able to read around your lock)
if you've got an index on status, and several concurrent processes locking different rows in the table, there could be a chance for deadlocks to occur (likely in the index tree of the index on the status column) which in turn would require your client/application to be coded to expect and address deadlocks
To think about:
if the table is relatively small such that table scanning isn't a big cost, you could drop any index on the status column and this should reduce the performance overhead of deferred updates (related to updating the indexes)
if you can work with a session specific status value (eg, 'In Progress - #mysession') then you could eliminate the 2nd update statement (could come in handy if you're incurring deferred updates on an indexed status column)
if you have another column(s) in the table that you could use to uniquely identifier your session's rows (eg, last_updated_by_spid = ##spid, last_updated_date = #mydate - where #mydate is initially set to getdate()) then your first update could set the status = 'In Progress', the select would use ##spid and #mydate for the where clause, and the second update would not be needed [NOTE: This is, effectively, the same thing Gordon is trying to address with his session column.]
assuming you can work with a session specific status value, consider using something that will allow you to track, and fix, orphaned rows (eg, row status remains 'In Progress - #mysession' because the calling process died and never came back to (re)set the status)
if you can pass the id list back to the calling program as a single string of concatenated id values you could use the method I outline in this answer to append the id's into a #variable during the first update, allowing you to set status = 'In Progress' in the first update and also allowing you to eliminate the select and the second update
how would you tell which rows have been orphaned? you may want the ability to update a (small)datetime column with the getdate() of when you issued your update; then, if you would normally expect the status to be updated within, say, 5 minutes, you could have a monitoring process that looks for orphaned rows where status = 'In Progress' and its been more than, say, 10 minutes since the last update
If the datarows, readpast, lock escalation settings and/or deadlock potential is too much, and you can live with brief table-level locks on the table, you could have the process obtain an exclusive table level lock before performing the update and select statements; the exclusive lock would need to be obtained within a user-defined transaction in order to 'hold' the lock for the duration of your work; a quick example:
begin tran get_my_rows
-- request an exclusive table lock; wait until it's granted
lock table mytable in exclusive mode
update ...
select ...
update ...
commit
I'm not 100% sure how to do this in Sybase. But, the idea is the following.
First, add a new column to the table that represents the session or connection used to change the data. You will use this column to provide isolation.
Then, update the rows:
update top (100) t
set status = 'in progress',
session = #session
where status = 'na'
order by ?; -- however you define the "top" records
Then, you can return or process the 100 ids that are "in progress" for the given connection.
Create another table, proc_lock, that has one row
When control enters the stored procedure, start a transaction and do a select for update on the row in proc_lock (see this link). If that doesn't work for Sybase, then you could try the technique from this answer to lock the row.
Before the procedure exits, make sure to commit the transaction.
This will ensure that only one user can execute the proc at a time. When the second user tries to execute the proc, it will block until the first user's lock on the proc_lock row is released (e.g. when transaction is committed)

Deadlock when using SELECT FOR UPDATE

I noticed that concurrent execution of simple and identical queries similar to
BEGIN;
SELECT files.data FROM files WHERE files.file_id = 123 LIMIT 1 FOR UPDATE;
UPDATE files SET ... WHERE files.file_id = 123;
COMMIT;
lead to deadlock which is surprising to me since it looks like such queries should not create a deadlock. Also: it is usually takes only milliseconds to complete such request. During such deadlock situation if I run:
SELECT blockeda.pid AS blocked_pid, blockeda.query as blocked_query,
blockinga.pid AS blocking_pid, blockinga.query as blocking_query FROM pg_catalog.pg_locks blockedl
JOIN pg_stat_activity blockeda ON blockedl.pid = blockeda.pid
JOIN pg_catalog.pg_locks blockingl ON(blockingl.transactionid=blockedl.transactionid
AND blockedl.pid != blockingl.pid)
JOIN pg_stat_activity blockinga ON blockingl.pid = blockinga.pid
WHERE NOT blockedl.granted;
I see both of my identical select statements listed for blocked_pid and blockin_pid for whole duration of the deadlock.
So my question is: Is it normal and expected for queries that try to select same row FOR UPDATE to cause deadlock? And if so, what is the best strategy to avoid deadlocking in this scenario?
Your commands are contradicting.
If files.file_id is defined UNIQUE (or PRIMARY KEY), you don't need LIMIT 1. And you don't need explicit locking at all. Just run the UPDATE, since only a single row is affected in the whole transaction, there cannot be a deadlock. (Unless there are side effects from triggers or rules or involved functions.)
If files.file_id is not UNIQUE (like it seems), then the UPDATE can affect multiple rows in arbitrary order and only one of them is locked, a recipe for deadlocks. The more immediate problem would then be that the query does not do what you seem to want to begin with.
The best solution depends on missing information. This would work:
UPDATE files
SET ...
WHERE primary_key_column = (
SELECT primary_key_column
FROM files
WHERE file_id = 123
LIMIT 1
-- FOR UPDATE SKIP LOCKED
);
No BEGIN; and COMMIT; needed for the single command, while default auto-commit is enabled.
You might want to add FOR UPDATE SKIP LOCKED (or FOR UPDATE NOWAIT) to either skip or report an error if the row is already locked.
And you probably want to add a WHERE clause that avoids processing the same row repeatedly.
More here:
Postgres UPDATE … LIMIT 1

Running large queries in the background MS SQL

I am using MS SQL Server 2008
i have a table which is constantly in use (data is always changing and inserted to it)
it contains now ~70 Mill rows,
I am trying to run a simple query over the table with a stored procedure that should properly take a few days,
I need the table to keep being usable, now I executed the stored procedure and after a while every simple select by identity query that I try to execute on the table is not responding/running too much time that I break it
what should I do?
here is how my stored procedure looks like:
SET NOCOUNT ON;
update SOMETABLE
set
[some_col] = dbo.ufn_SomeFunction(CONVERT(NVARCHAR(500), another_column))
WHERE
[some_col] = 243
even if i try it with this on the where clause (with an 'and' logic..) :
ID_COL > 57000000 and ID_COL < 60000000 and
it still doesn't work
BTW- SomeFunction does some simple mathematics actions and looks up rows in another table that contains about 300k items, but is never changed
From my perspective your server has a serious performance problem. Even if we assume that none of the records in the query
select some_col with (nolock) where id_col between 57000000 and 57001000
was in memory, it shouldn't take 21 seconds to read the few pages sequentially from disk (your clustered index on the id_col should not be fragmented if it's an auto-identity and you didn't do something stupid like adding a "desc" to the index definition).
But if you can't/won't fix that, my advice would be to make the update in small packages like 100-1000 records at a time (depending on how much time the lookup function consumes). One update/transaction should take no more than 30 seconds.
You see each update keeps an exclusive lock on all the records it modified until the transaction is complete. If you don't use an explicit transaction, each statement is executed in a single, automatic transaction context, so the locks get released when the update statement is done.
But you can still run into deadlocks that way, depending on what the other processes do. If they modify more than one record at a time, too, or even if they gather and hold read locks on several rows, you can get deadlocks.
To avoid the deadlocks, your update statement needs to take a lock on all the records it will modify at once. The way to do this is to place the single update statement (with only the few rows limited by the id_col) in a serializable transaction like
IF ##TRANCOUNT > 0
-- Error: You are in a transaction context already
SET NOCOUNT ON
SET TRANSACTION ISOLATION LEVEL SERIALIZABLE
-- Insert Loop here to work "x" through the id range
BEGIN TRANSACTION
UPDATE SOMETABLE
SET [some_col] = dbo.ufn_SomeFunction(CONVERT(NVARCHAR(500), another_column))
WHERE [some_col] = 243 AND id_col BETWEEN x AND x+500 -- or whatever keeps the update in the small timerange
COMMIT
-- Next loop
-- Get all new records while you where running the loop. If these are too many you may have to paginate this also:
BEGIN TRANSACTION
UPDATE SOMETABLE
SET [some_col] = dbo.ufn_SomeFunction(CONVERT(NVARCHAR(500), another_column))
WHERE [some_col] = 243 AND id_col >= x
COMMIT
For each update this will take an update/exclusive key-range lock on the given records (but only them, because you limit the update through the clustered index key). It will wait for any other updates on the same records to finish, then get it's lock (causing blocking for all other transactions, but still only for the given records), then update the records and release the lock.
The last extra statement is important, because it will take a key range lock up to "infinity" and thus prevent even inserts on the end of the range while the update statement runs.

Updating 100k records in one update query

Is it possible, or recommended at all, to run one update query, that will update nearly 100k records at once?
If so, how can I do that? I am trying to pass an array to my stored proc, but it seems not to work, this is my SP:
CREATE PROCEDURE [dbo].[UpdateAllClients]
#ClientIDs varchar(max)
AS
BEGIN
DECLARE #vSQL varchar(max)
SET #vSQL = 'UPDATE Clients SET LastUpdate=GETDATE() WHERE ID IN (' + #ClientIDs + ')';
EXEC(#vSQL);
END
I have not idea whats not working, but its just not updating the relevant queries.
Anyone?
The UPDATE is reading your #ClientIDs (as a Comma Separated Value) as a whole. To illustrate it more, you are doing like this.
assume the #ClientIDs = 1,2,3,4,5
your UPDATE command is interpreting it like this
UPDATE Clients SET LastUpdate=GETDATE() WHERE ID IN ('1,2,3,4,5')';
and not
UPDATE Clients SET LastUpdate=GETDATE() WHERE ID IN (1,2,3,4,5)';
One suggestion to your question is by using subquery on your UPDATE, example
UPDATE Clients
SET LastUpdate = GETDATE()
WHERE ID IN
(
SELECT ID
FROM tableName
-- where condtion
)
Hope this makes sense.
A few notes to be aware of.
Big updates like this can lock up the target table. If > 5000 rows are affected by the operation, the individual row locks will be promoted to a table lock, which would block other processes. Worth bearing in mind if this could cause an issue in your scenario. See: Lock Escalation
With a large number of rows to update like this, an approach I'd consider is (basic):
bulk insert the 100K Ids into a staging table (e.g. from .NET, use SqlBulkCopy)
update the target table, using a join onto the above staging table
drop the staging table
This gives some more room for controlling the process, but breaking the workload up into chunks and doing it x rows at a time.
There is a limit for the number of items you pass to 'IN' if you are giving an array.
So, if you just want to update the whole table, skip the IN condition.
If not specify an SQL inside IN. That should do the job
The database will very likely reject that SQL statement because it is too long.
When you need to update so many records at once, then maybe your database schema isn't appropriate. Maybe the LastUpdate datum should not be stored separately for each client but only once globally or only once for a constant group of clients?
But it's hard to recommend a good course of action without seeing the whole picture.
What version of sql server are you using? If it is 2005+ I would recommend using TVPs (table valued parameters - http://msdn.microsoft.com/en-us/library/bb510489.aspx). The transfer of data will be faster (as opposed to building a huge string) and your query would look nicer:
update c
set lastupdate=getdate()
from clients c
join #mytvp t on c.Id = t.Id
Each SQL statement on its own is a transaction statement . This means sql server is going to grab locks for all these million of rows .It can really degrade the performance of a table .So you really don’t tend to update a table which has million of rows in it which hurts the performance.So the workaround is to set rowcount before DML operation
set rowcount=100
UPDATE Clients SET LastUpdate=GETDATE()
WHERE ID IN ('1,2,3,4,5')';
set rowcount=0
or from SQL server 2008 you can parametrize Top keyword
Declare #value int
set #value=100000
again:
UPDATE top (#value) Clients SET LastUpdate=GETDATE()
WHERE ID IN ('1,2,3,4,5')';
if ##rowcount!=0 goto again
See how fast the above query takes and then adjust and change the value of the variable .You need to break the tasks for smaller units as suggested in the above answers
Method 1:
Split the #clientids with delimiters ','
put in array and iterate over that array
update clients table for each id.
OR
Method 2:
Instead of taking #clientids as a varchar2, follow below steps
create object type table for ids and use join.
For faster processing u can also create index on clientid as well.