How can I read dirty values in SQL UPDATE statement WHERE clause - sql

Let's assume I have the following query in two separate SSMS query windows:
SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED
BEGIN TRANSACTION
UPDATE dbo.Jobs
SET [status] = 'Running'
OUTPUT Inserted.*
WHERE [status] = 'Waiting'
--I'm NOT committing yet
--Commit Transaction
I run query window 1 (but do not commit), and then I run query window 2.
I want for query window 2 to immediately update only rows that were inserted after I started query 1 (all new records come in with a status of 'Waiting'). However, SQL Server is waiting for the first query to finish, because in an update statement it's not reading dirty values (even if it's set to READ UNCOMMITTED);
Is there a way to overcome this?
In my application I will have 2 (or more) processes running it, I want that process 2 should be able to pickup the rows that process 1 have not picked up; I don't want that process 2 should need to wait until process 1 is finish

What you are asking for is simply impossible.
Even at the lowest isolation level of READ UNCOMMITTED (aka NOLOCK), an X-Lock (exclusive) must be taken in order to make modifications. In other words, writes are always locked, even if the reads that fetched those rows were not locked.
So even though session 2 is running under READ UNCOMMITTED also, if it wants to do a modification it must also take an X-Lock, which is incompatible with the first X-Lock.
The solution here is to either do this in one session, or commit immediately. In any case, do not hold locks for any length of time, as it can cause massive blocking chains and even deadlocks.
If you want to just ignore all those rows which have been inserted, you could use the WITH (READPAST) hint.
READ UNCOMMITTED as an isolation level or as a hint has huge issues.
It can cause anything from deadlocks to completely incorrect results. For example, you could read a row twice, or not at all, when by the logical definition of the schema there should have been exactly one row. You could read entire pages twice or not at all.
You can get deadlocks due to U-Locks not being taken in UPDATE and DELETE statements.
And you still take schema locks, so you can still get stuck behind a synchronous statistics update or an index rebuild.

Related

Understanding locks and query status in Snowflake (multiple updates to a single table)

While using the python connector for snowflake with queries of the form
UPDATE X.TABLEY SET STATUS = %(status)s, STATUS_DETAILS = %(status_details)s WHERE ID = %(entry_id)s
, sometimes I get the following message:
(snowflake.connector.errors.ProgrammingError) 000625 (57014): Statement 'X' has locked table 'XX' in transaction 1588294931722 and this lock has not yet been released.
and soon after that
Your statement X' was aborted because the number of waiters for this lock exceeds the 20 statements limit
This usually happens when multiple queries are trying to update a single table. What I don't understand is that when I see the query history in Snowflake, it says the query finished successfully (Succeded Status) but in reality, the Update never happened, because the table did not alter.
So according to https://community.snowflake.com/s/article/how-to-resolve-blocked-queries I used
SELECT SYSTEM$ABORT_TRANSACTION(<transaction_id>);
to release the lock, but still, nothing happened and even with the succeed status the query seems to not have executed at all. So my question is, how does this really work and how can a lock be released without losing the execution of the query (also, what happens to the other 20+ queries that are queued because of the lock, sometimes it seems that when the lock is released the next one takes the lock and have to be aborted as well).
I would appreciate it if you could help me. Thanks!
Not sure if Sergio got an answer to this. The problem in this case is not with the table. Based on my experience with snowflake below is my understanding.
In snowflake, every table operations also involves a change in the meta table which keeps track of micro partitions, min and max. This meta table supports only 20 concurrent DML statements by default. So if a table is continuously getting updated and getting hit at the same partition, there is a chance that this limit will exceed. In this case, we should look at redesigning the table updation/insertion logic. In one of our use cases, we increased the limit to 50 after speaking to snowflake support team
UPDATE, DELETE, and MERGE cannot run concurrently on a single table; they will be serialized as only one can take a lock on a table at at a time. Others will queue up in the "blocked" state until it is their turn to take the lock. There is a limit on the number of queries that can be waiting on a single lock.
If you see an update finish successfully but don't see the updated data in the table, then you are most likely not COMMITting your transactions. Make sure you run COMMIT after an update so that the new data is committed to the table and the lock is released.
Alternatively, you can make sure AUTOCOMMIT is enabled so that DML will commit automatically after completion. You can enable it with ALTER SESSION SET AUTOCOMMIT=TRUE; in any sessions that are going to run an UPDATE.

SQL Server Update Locks

If you have the following sql, is it possible that if it is run multiple times by many different processes at exactly the same time, that two or more processes may update the table?
SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED
UPDATE table
SET Column1 = 1
WHERE Column1 = 0
No other locks etc are specified in the sql, other that Read Uncommitted.
I'm trying to track down an issue, and I'm now clutching at straws...
Got this from MSDN.
Transactions running at the READ UNCOMMITTED level do not issue shared locks to prevent other transactions from modifying data read by the current transaction. READ UNCOMMITTED transactions are also not blocked by exclusive locks that would prevent the current transaction from reading rows that have been modified but not committed by other transactions. When this option is set, it is possible to read uncommitted modifications, which are called dirty reads. Values in the data can be changed and rows can appear or disappear in the data set before the end of the transaction. This option has the same effect as setting NOLOCK on all tables in all SELECT statements in a transaction. This is the least restrictive of the isolation levels.
So basically, this is equivalent to SQL Server , NOLOCK hint. This might result in dirty reads, i.e. if some process in updated 1000 records and updated 500 till now, and other process read that data, then data might be in inconsistent form. This also helps in executing update without getting blocked (shared lock) by multiple select queries.
Hope this make some sense to your question. For reference -- MSDN

SELECT during a lengthy UPDATE - What happens to SELECT for different Transaction Isolation Levels and SELECT WITH (NOLOCK)?

Given an UPDATE execution which takes 5 minutes or so, what happens when SELECT tries to retrieve data from the same table? For different Transaction Isolation Levels and SELECT WITH (NOLOCK), does SELECT wait for UPDATE? If not, does SELECT return old data (data before the UPDATE) or part of the currently inserted records (such as 50% of the records currently being inserted) ?
If found the following question, but it only describes what happens when you execute and UPDATE during a long SELECT.
SQL Server - does [SELECT] lock [UPDATE]?
I am using MS SQL Server 2012. Hopefully, this behaviour is consistent for different implementations.
This post by Gavin Draper explains it quite well and contains some example query's.
SQL Server Isolation Levels By Example
Isolation levels in SQL Server control the way locking works between
transactions.
SQL Server 2008 supports the following isolation levels
Read Uncommitted
Read Committed (The default)
Repeatable Read
Serializable
Snapshot
Before I run through each of these in detail you may want to create a
new database to run the examples, run the following script on the new
database to create the sample data. Note : You’ll also want to drop
the IsolationTests table and re-run this script before each example to
reset the data.
CREATE TABLE IsolationTests
(
Id INT IDENTITY,
Col1 INT,
Col2 INT,
Col3 INTupdate te
)
INSERT INTO IsolationTests(Col1,Col2,Col3)
SELECT 1,2,3
UNION ALL SELECT 1,2,3
UNION ALL SELECT 1,2,3
UNION ALL SELECT 1,2,3
UNION ALL SELECT 1,2,3
UNION ALL SELECT 1,2,3
UNION ALL SELECT 1,2,3
Also before we go any further it is important to understand these two
terms….
Dirty Reads – This is when you read uncommitted data, when doing this there is no guarantee that data read will ever be committed
meaning the data could well be bad.
Phantom Reads – This is when data that you are working with has been changed by another transaction since you first read it in.
This
means subsequent reads of this data in the same transaction could
well be different.
Read Uncommitted
This is the lowest isolation level there is. Read uncommitted causes
no shared locks to be requested which allows you to read data that is
currently being modified in other transactions. It also allows other
transactions to modify data that you are reading.
As you can probably imagine this can cause some unexpected results in
a variety of different ways. For example data returned by the select
could be in a half way state if an update was running in another
transaction causing some of your rows to come back with the updated
values and some not to.
To see read uncommitted in action lets run Query1 in one tab of
Management Studio and then quickly run Query2 in another tab before
Query1 completes.
Query1
BEGIN TRAN
UPDATE IsolationTests SET Col1 = 2
--Simulate having some intensive processing here with a wait
WAITFOR DELAY '00:00:10'
ROLLBACK
Query2
SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED
SELECT * FROM IsolationTests
Notice that Query2 will not wait for Query1 to finish, also more
importantly Query2 returns dirty data. Remember Query1 rolls back all
its changes however Query2 has returned the data anyway, this is
because it didn't wait for all the other transactions with exclusive
locks on this data it just returned what was there at the time.
There is a syntactic shortcut for querying data using the read
uncommitted isolation level by using the NOLOCK table hint. You
could change the above Query2 to look like this and it would do the
exact same thing.
SELECT * FROM IsolationTests WITH(NOLOCK)
Read Committed
This is the default isolation level and means selects will only return
committed data. Select statements will issue shared lock requests
against data you’re querying this causes you to wait if another
transaction already has an exclusive lock on that data. Once you have
your shared lock any other transactions trying to modify that data
will request an exclusive lock and be made to wait until your Read
Committed transaction finishes.
You can see an example of a read transaction waiting for a modify
transaction to complete before returning the data by running the
following Queries in separate tabs as you did with Read Uncommitted.
Query1
BEGIN TRAN
UPDATE Tests SET Col1 = 2
--Simulate having some intensive processing here with a wait
WAITFOR DELAY '00:00:10'
ROLLBACK
Query2
SELECT * FROM IsolationTests
Notice how Query2 waited for the first transaction to complete before
returning and also how the data returned is the data we started off
with as Query1 did a rollback. The reason no isolation level was
specified is because Read Committed is the default isolation level for
SQL Server. If you want to check what isolation level you are running
under you can run DBCC useroptions. Remember isolation levels are
Connection/Transaction specific so different queries on the same
database are often run under different isolation levels.
Repeatable Read
This is similar to Read Committed but with the additional guarantee
that if you issue the same select twice in a transaction you will get
the same results both times. It does this by holding on to the shared
locks it obtains on the records it reads until the end of the
transaction, This means any transactions that try to modify these
records are forced to wait for the read transaction to complete.
As before run Query1 then while its running run Query2
Query1
SET TRANSACTION ISOLATION LEVEL REPEATABLE READ
BEGIN TRAN
SELECT * FROM IsolationTests
WAITFOR DELAY '00:00:10'
SELECT * FROM IsolationTests
ROLLBACK
Query2
UPDATE IsolationTests SET Col1 = -1
Notice that Query1 returns the same data for both selects even though
you ran a query to modify the data before the second select ran. This
is because the Update query was forced to wait for Query1 to finish
due to the exclusive locks that were opened as you specified
Repeatable Read.
If you rerun the above Queries but change Query1 to Read Committed you
will notice the two selects return different data and that Query2 does
not wait for Query1 to finish.
One last thing to know about Repeatable Read is that the data can
change between 2 queries if more records are added. Repeatable Read
guarantees records queried by a previous select will not be changed or
deleted, it does not stop new records being inserted so it is still
very possible to get Phantom Reads at this isolation level.
Serializable
This isolation level takes Repeatable Read and adds the guarantee that
no new data will be added eradicating the chance of getting Phantom
Reads. It does this by placing range locks on the queried data. This
causes any other transactions trying to modify or insert data touched
on by this transaction to wait until it has finished.
You know the drill by now run these queries side by side…
Query1
SET TRANSACTION ISOLATION LEVEL SERIALIZABLE
BEGIN TRAN
SELECT * FROM IsolationTests
WAITFOR DELAY '00:00:10'
SELECT * FROM IsolationTests
ROLLBACK
Query2
INSERT INTO IsolationTests(Col1,Col2,Col3)
VALUES (100,100,100)
You’ll see that the insert in Query2 waits for Query1 to complete
before it runs eradicating the chance of a phantom read. If you change
the isolation level in Query1 to repeatable read, you’ll see the
insert no longer gets blocked and the two select statements in Query1
return a different amount of rows.
Snapshot
This provides the same guarantees as serializable. So what's the
difference? Well it’s more in the way it works, using snapshot doesn't
block other queries from inserting or updating the data touched by the
snapshot transaction. Instead row versioning is used so when data is
changed the old version is kept in tempdb so existing transactions
will see the version without the change. When all transactions that
started before the changes are complete the previous row version is
removed from tempdb. This means that even if another transaction has
made changes you will always get the same results as you did the first
time in that transaction.
So on the plus side your not blocking anyone else from modifying the
data whilst you run your transaction but…. You’re using extra
resources on the SQL Server to hold multiple versions of your changes.
To use the snapshot isolation level you need to enable it on the
database by running the following command
ALTER DATABASE IsolationTests
SET ALLOW_SNAPSHOT_ISOLATION ON
If you rerun the examples from serializable but change the isolation
level to snapshot you will notice that you still get the same data
returned but Query2 no longer waits for Query1 to complete.
Summary
You should now have a good idea how each of the different isolation
levels work. You can see how the higher the level you use the less
concurrency you are offering and the more blocking you bring to the
table. You should always try to use the lowest isolation level you can
which is usually read committed.
READ UNCOMMITTED: The SELECT can read all kinds of nasty inconsistencies. Old rows, new rows, duplicate rows, missing rows. It can also totally error out with the famous "data movement" error.
READ COMMITTED: Will block without snapshot isolation. Will return the old state with snapshot isolation in perfect consistency.
REPEATABLE READ/SERIALIZABLE: Will block.
SNAPSHOT: Will return the old state with snapshot isolation in perfect consistency.
It sounds like you should read a few concurrency tutorials. I have written these brief facts to get you started. To really understand what's going on to the point that you can make predictions (that come true) you need to go deeper than an answer on Stack Overflow can provide.
Most of the time, you want to use SNAPSHOT for read-only transactions. It takes away all concurrency concerns. Be aware that it has a few drawbacks.

Check if table data has changed?

I am pulling the data from several tables and then passing the data to a long running process. I would like to be able to record what data was used for the process and then query the database to check if any of the tables have changed since the process was last run.
Is there a method of solving this problem that should work across all sql databases?
One possible solution that I've thought of is having a separate table that is only used for keeping track of whether the data has changed since the process was run. The table contains a "stale" flag. When I start running the process, stale is set to false. If any creation, update, or deletion occurs in any of the tables on which the operation depends, I set stale to true. Is this a valid solution? Are there better solutions?
One concern with my solution is situations like this:
One user starts inserting a new row into one of the tables. Stale gets set to true, but the new row has not actually been added yet. Another user has simultaneously started the long running process, pulling the data from the tables and setting the flag to false. The row is finally added. Now the data used for the process is out of date but the flag indicates it is not stale. Would transactions be able to solve this problem?
EDIT:
This is some SQL for my idea. Not sure if it works, but just to give you a better idea of what I was thinking:
# First transaction reads the data and sets the flag to false
BEGIN TRANSACTION
SET TRANSACTION ISOLATION LEVEL SERIALIZABLE
UPDATE flag SET stale = false
SELECT * FROM DATATABLE
COMMIT TRANSACTION
# Second transaction updates the data and sets the flag to true
BEGIN TRANSACTION
SET TRANSACTION ISOLATION LEVEL SERIALIZABLE
UPDATE data SET val = 15 WHERE ID = 10
UPDATE flag SET stale = true
COMMIT TRANSACTION
I do not have much experience with transactions or handwriting xml, so there are probably issues with this. From what I understand two serializable transactions can not be interleaved. Please correct me if I'm wrong.
Is there a way to accomplish this with only the first transaction? The process will be run rarely, but the updates to the data table will occur more frequently, so it would be nice to not lock up the data table when performing updates.
Also, is the SET TRANSACTION ISOLATION syntax specific to MS?
The stale flag will probably work, but a timestamp would be better since it provides more metadata about the age of the records which could be used to tune your queries, e.g., only pull data that is over 5 minutes old.
To address your concern about inserting a row at the same time a query is run, transactions with an appropriate isolation level will help. For row inserts, updates, and selects, at least use a transaction with an isolation level that prevents dirty reads so that no other connections can see the updated data until the transaction is committed.
If you are strongly concerned about the case where an update happens at the same time as a record pull, you could use the REPEATABLE READ or even SERIALIZABLE isolation levels, but this will slow DB access down.
Your SQLServer sampled should work. For alternate databases, Here's an example that works in PostGres:
Transaction 1
BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE;
-- run queries that update the tables, then set last_updated column
UPDATE sometable SET last_updatee = now() WHERE id = 1;;
COMMIT;
Transaction 2
BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE;
-- select data from tables, then set last_queried column
UPDATE sometable SET last_queried = now() WHERE id = 1;
COMMIT;
If transaction 1 starts, and then transaction 2 starts before transaction 1 has completed, transaction 2 will block during on the update, and then will throw an error when transaction 1 is committed. If transaction 2 starts first, and transaction 1 starts before that has finished, then transaction 1 will error. Your application code or process should be able to handle those errors.
Other databases use similar syntax - MySQL (with InnoDB plugin) requires you to set the isolation level before you start the transaction.

Why deadlock occurs?

I use a small transaction which consists of two simple queries: select and update:
SELECT * FROM XYZ WHERE ABC = DEF
and
UPDATE XYZ SET ABC = 123
WHERE ABC = DEF
It is quite often situation when the transaction is started by two threads, and depending on Isolation Level deadlock occurs (RepeatableRead, Serialization). Both transactions try to read and update exactly the same row.
I'm wondering why it is happening. What is the order of queries which leads to deadlock? I've read a bit about lock (shared, exclusive) and how long locks last for each isolation level, but I still don't fully understand...
I've even prepared a simple test which always result in deadlock. I've looked at results of the test in SSMS and SQL Server Profiler. I started first query and then immediately the second.
First query:
SET TRANSACTION ISOLATION LEVEL SERIALIZABLE
BEGIN TRANSACTION
SELECT ...
WAITFOR DELAY '00:00:04'
UPDATE ...
COMMIT
Second query:
SET TRANSACTION ISOLATION LEVEL SERIALIZABLE
BEGIN TRANSACTION
SELECT ...
UPDATE ...
COMMIT
Now I'm not able to show you detailed logs, but it looks less or more like this (I've very likely missed Lock:deadlock etc. somewhere):
(1) SQL:BatchStarting: First query
(2) SQL:BatchStarting: Second query
(3) Lock:timeout for second query
(4) Lock:timeout for first query
(5) Deadlock graph
If I understand locks well, in (1) first query takes a shared lock (to execute SELECT), then goes to sleep and keeps the shared lock until the end of transaction. In (2) second query also takes shared lock (SELECT) but cannot take exclusive lock (UPDATE) while there are shared locks on the same row, which results in Lock:timeout. But I can't explain why timeout for second query occurs. Probably I don't understand the whole process well. Can anybody give a good explanation?
I haven't noticed deadlocks using ReadCommitted but I'm afraid they may occur.
What solution do you recommend?
A deadlock occurs when two or more tasks permanently block each other by each task having a lock on a resource which the other tasks are trying to lock
http://msdn.microsoft.com/en-us/library/ms177433.aspx
"But I can't explain why timeout for second query occurs."
Because the first query holds shared lock. Then the update in the first query also tries to get the exclusive lock, which makes him sleep. So the first and second query are both sleeping waiting for the other to wake up - and this is a deadlock which results in timeout :-)
In mysql it works better - the deadlock is detected immediatelly and one of the transactions is rolled back (you need not to wait for timeout :-)).
Also, in mysql, you can do the following to prevent deadlock:
select ... for update
which will put a write-lock (i.e. exclusive lock) just from the beginning of the transaction, and this way you avoid the deadlock situation! Perhaps you can do something similar in your database engine.
For MSSQL there is a mechanism to prevent deadlocks. What you need here is called the WITH NOLOCK hint.
In 99.99% of the cases of SELECT statements it's usable and there is no need to bundle the SELECT with the UPDATE. There is also no need to put a SELECT into a transaction. The only exception is when dirty reads are not allowed.
Changing your queries to this form would solve all your issues:
SELECT ...
FROM yourtable WITH (NOLOCK)
WHERE ...
SET TRANSACTION ISOLATION LEVEL SERIALIZABLE
BEGIN TRANSACTION
UPDATE ...
COMMIT
It has been a long time since I last dealt with this, but I believe that the select statement creates a read-lock, which only prevents the data to be changed -- hence multiple queries can hold and share a read-lock on the same data. The shared-read-lock is for read consistency, that is if you multiple times in your transaction reads the same row, then read-consistency should mean that you should always get the same result.
The update statement requires an exclusive lock, and hence the update statement have to wait for the read-lock to be released.
None of the two transactions will release the locks, so the transactions fails.
Different databases implementations have different strategies for how to deal with this, with Sybase and MS-SQL-servers using lock escalation with timeout (escalate from read-to-write-lock) -- Oracle I believe (at some point) implemented read consistency though use of the roll-back-log, where MySQL have yet a different strategy.