SQL Server locks - avoid insertion of duplicate entries - sql

After reading a lot of articles and many answers related to the above subject, I am still wondering how the SQL Server database engine works in the following example:
Let's assume that we have a table named t3:
create table t3 (a int , b int);
create index test on t3 (a);
and a query as follow:
INSERT INTO T3
SELECT -86,-86
WHERE NOT EXISTS (SELECT 1 FROM t3 where t3.a=-86);
The query inserts a line in the table t3 after verifying that the row does not already exist based on the column "a".
Many articles and answers indicate that using the above query there is no way that a row will be inserted twice.
For the execution of the above query, I assume that the database engine works as follow:
The subquery is executed first.
The database engine sets a shared(s) lock on a range.
The data is read.
The shared lock is released. According to MSDN a shared
lock is released as soon as the data
has been read.
If a row does not exist it inserts a new line in the table.
The new line is locked with an exclusive lock (x)
Now consider the following scenario:
The above query is executed by processor A (SPID 1).
The same query is executed by a
processor B (SPID 2).
[SPID 1] The database engine sets a shared(s) lock
[SPID 1] The subquery reads the
data. Now rows are returned.
[SPID 1] The shared lock is
released.
[SPID 2] The database engine sets a
shared(s) lock
[SPID 2] The subquery reads the
data. No rows are return.
[SPID 2] The shared lock is
released.
Both processes proceed with a row insertion (and we get a duplicate entry).
Am I missing something? Is the above way a correct way for avoiding duplicate entries?
A safe way to avoid duplicate entries is using the code below, but I am just wondering whether the above method is correct.
begin tran
if (SELECT 1 FROM t3 with (updlock) where t3.a=-86)
begin
INSERT INTO T3
SELECT -86,-86
end
commit

If you just have a unique constraint on the column, you'll never have duplicates.
The technique you've outlined will avoid you having to catch an error or an exception in the case of the (second "simultaneous") operation failing.
I'd like to add that relying on "outer" code (even T-SQL) to enforce your database consistency is not a great idea. In all cases, using declarative referential integrity at the table level is important for the database to ensure consistency and matching expectations, regardless of whether application code is written well or not. As in security, you need to utilize a strategy of defense in depth - constraints, unique indexes, triggers, stored procedures, and views can all assist in making a multi-layered approach to ensure the database presents a consistent and reliable interface to the application or system.

To keep locks between multiple statements, they have to be wrapped in a transaction. In your example:
If (SELECT 1 FROM t3 with (updlock) where t3.a=-86)
INSERT INTO T3 SELECT -86,-86
The update lock can be released before the insert is executed. This would work reliably:
begin transaction
If (SELECT 1 FROM t3 with (updlock) where t3.a=-86)
INSERT INTO T3 SELECT -86,-86
commit transaction
Single statements are always wrapped in a transaction, so this would work too:
INSERT INTO T3 SELECT -86,-86
WHERE NOT EXISTS (SELECT 1 FROM t3 with (updlock) where t3.a=-86)
(This is assuming you have "implicit transactions" turned off, like the default SQL Server setting.)

Related

What are invisible rows in postgresql?

Reading this article about the EXPLAIN command I come across the so called invisible rows concept. To me more specific:
In a sequential scan, the executor needs:
to read all the blocks of relation foo
to check every row in each block to filter “unvisible” rows
Googling for the pharse invisible row postgresql and some related to it didn't give any useful result. So, what does the concept mean? Or it's an informal concept and is not standardized.
It's basically a consequence of MVCC and transactions. If you start a transaction then rows created by a different session will normally not be visible to your session until the transaction has run its course. This is to prevent the state of a transaction becoming inconsistent during its execution.
There are exceptions related to unique indexes and key columns, but its relatively rare to encounter those, especially if all your primary keys are SERIAL.
Due to the transaction isolation not all tuples are visible to all the transactions. You should check the manual on MVCC. Also the source code is a great source on all more complicated concepts, this description seems to explain it well.
especially
Invisible rows are rows that are not visible to a transaction (lets call it T1) when started.
A typical scenario is the following:
A transaction T2 starts its execution. T2 consists in the query
UPDATE users SET name = 'John' WHERE age < 18
Meanwhile, the transaction T1 (concurrently with T2) starts its execution, doing the following:
SELECT COUNT(*) FROM users WHERE name = 'John'
As you can easily see, if T1 ends before T2, its results will be a number X: the count of users whose name is John.
But if T1 ends after T2, the resulting value X might be different (it will be, if exists some rows that satisfy the WHERE predicate).
The same thing can happen in a JOIN, the resulting join relation should or not contain the rows that satisfies the join predicate.
Think about the transaction T1
SELECT * FROM users u, infos i INNER JOIN u.id = info.id;
And concurrently there's the execution of T2
UPDATE infos SET id = 9 WHERE id > 12
The physical implementation of the logical operator JOIN, must handle this cases, in order to produce the right result.

How to SELECT COUNT from tables currently being INSERT?

Hi consider there is an INSERT statement running on a table TABLE_A, which takes a long time, I would like to see how has it progressed.
What I tried was to open up a new session (new query window in SSMS) while the long running statement is still in process, I ran the query
SELECT COUNT(1) FROM TABLE_A WITH (nolock)
hoping that it will return right away with the number of rows everytime I run the query, but the test result was even with (nolock), still, it only returns after the INSERT statement is completed.
What have I missed? Do I add (nolock) to the INSERT statement as well? Or is this not achievable?
(Edit)
OK, I have found what I missed. If you first use CREATE TABLE TABLE_A, then INSERT INTO TABLE_A, the SELECT COUNT will work. If you use SELECT * INTO TABLE_A FROM xxx, without first creating TABLE_A, then non of the following will work (not even sysindexes).
Short answer: You can't do this.
Longer answer: A single INSERT statement is an atomic operation. As such, the query has either inserted all the rows or has inserted none of them. Therefore you can't get a count of how far through it has progressed.
Even longer answer: Martin Smith has given you a way to achieve what you want. Whether you still want to do it that way is up to you of course. Personally I still prefer to insert in manageable batches if you really need to track progress of something like this. So I would rewrite the INSERT as multiple smaller statements. Depending on your implementation, that may be a trivial thing to do.
If you are using SQL Server 2016 the live query statistics feature can allow you to see the progress of the insert in real time.
The below screenshot was taken while inserting 10 million rows into a table with a clustered index and a single nonclustered index.
It shows that the insert was 88% complete on the clustered index and this will be followed by a sort operator to get the values into non clustered index key order before inserting into the NCI. This is a blocking operator and the sort cannot output any rows until all input rows are consumed so the operators to the left of this are 0% done.
With respect to your question on NOLOCK
It is trivial to test
Connection 1
USE tempdb
CREATE TABLE T2
(
X INT IDENTITY PRIMARY KEY,
F CHAR(8000)
);
WHILE NOT EXISTS(SELECT * FROM T2 WITH (NOLOCK))
LOOP:
SELECT COUNT(*) AS CountMethod FROM T2 WITH (NOLOCK);
SELECT rows FROM sysindexes WHERE id = OBJECT_ID('T2');
RAISERROR ('Waiting for 10 seconds',0,1) WITH NOWAIT;
WAITFOR delay '00:00:10';
SELECT COUNT(*) AS CountMethod FROM T2 WITH (NOLOCK);
SELECT rows FROM sysindexes WHERE id = OBJECT_ID('T2');
RAISERROR ('Waiting to drop table',0,1) WITH NOWAIT
DROP TABLE T2
Connection 2
use tempdb;
--Insert 2000 * 2000 = 4 million rows
WITH T
AS (SELECT TOP 2000 'x' AS x
FROM master..spt_values)
INSERT INTO T2
(F)
SELECT 'X'
FROM T v1
CROSS JOIN T v2
OPTION (MAXDOP 1)
Example Results - Showing row count increasing
SELECT queries with NOLOCK allow dirty reads. They don't actually take no locks and can still be blocked, they still need a SCH-S (schema stability) lock on the table (and on a heap it will also take a hobt lock).
The only thing incompatible with a SCH-S is a SCH-M (schema modification) lock. Presumably you also performed some DDL on the table in the same transaction (e.g. perhaps created it in the same tran)
For the use case of a large insert, where an approximate in flight result is fine, I generally just poll sysindexes as shown above to retrieve the count from metadata rather than actually counting the rows (non deprecated alternative DMVs are available)
When an insert has a wide update plan you can even see it inserting to the various indexes in turn that way.
If the table is created inside the inserting transaction this sysindexes query will still block though as the OBJECT_ID function won't return a result based on uncommitted data regardless of the isolation level in effect. It's sometimes possible to get around that by getting the object_id from sys.tables with nolock instead.
Use the below query to find the count for any large table or locked table or being inserted table in seconds . Just replace the table name which you want to search.
SELECT
Total_Rows= SUM(st.row_count)
FROM
sys.dm_db_partition_stats st
WHERE
object_name(object_id) = 'TABLENAME' AND (index_id < 2)
For those who just need to see the record count while executing a long running INSERT script, I found you can see the current record count through SSMS by right clicking on the destination database table, -> Properties -> Storage, then view the "Row Count" value like so:
Close window and repeat to see the updated record count.

SELECT COUNT(SomeId) with INSERT later to same SomeId: Appropriate locking strategy?

I am using SQL Server 2012. I have a repeatable read transaction where I perform this query:
select count(SomeId)
from dbo.MyTable
where SomeId = #SomeId
SomeId is a column whose value may repeat in the table (think foreign key). However, SomeId is not a member of any index nor is it a foreign key.
Later in the transaction, I insert a record into dbo.MyTable with the same #SomeId, thus changing what the select count(*) would return were I to run it again:
insert into dbo.MyTable (SomeId, ...)
values (#SomeId, ...)
Several threads in my application can execute this transaction at the same time. Because of this, I'm getting deadlocks on the insert statement. At first, I thought an updlock would be appropriate on the select statement, but I quickly realized that it wouldn't work because I'm not actually updating the rows selected by the select count(SomeId).
My question is this: is there any way to avoid a potentially expensive table lock? Is there any way to lock just rows that involve SomeId, even if they haven't been inserted yet (strange, I know)? I want to force other threads to wait while the original transaction completes its work but I don't want to lock rows unnecessarily.
EDIT
Here's what I'm trying to accomplish:
I only want to insert up to eight rows for a particular SomeId. There are several unrelated processes that can start one of these transactions, potentially at the same time. The select count detects whether there are already eight rows and causes the operation to fail for that process. If the count is less than eight, that same transaction performs additional work, then inserts a record at the end, thus effectively incrementing the count were the select count to be run again. I am hitting the deadlock on the insert statement.
If you have several processes that try to do the same thing and you don't want to have more records than some number, you will need to actually prevent those processes from running at the the same time.
One way would be to read the counts with exclusive lock:
select count(SomeId)
from dbo.MyTable with (xlock)
where SomeId = #SomeId
This way those records will be blocked until the transaction completes.
You should create an index for the SomeId column though as it will be most likely that the locks will be on held on the index level this way.

Avoid deadlock for concurrent delete

I have a table called Products with many columns. It is a temporary table used for reporting purposes. The data will be processed to this table concurrently by multiple user requests. There are separate stored procedures to make DML operations to this table.
Table Structure:
CREATE TABLE Products (
instance uniqueidentifier,
inserted datetime,
col1,
col2,
...
)
The inserted column will be populated with GETDATE() to contain the time when each row was inserted and the instance column will contain the value from newid(). One user request will have one unique id but may have million rows. The below are the queries which will be executed concurrently, which causing the deadlock. Please advise me
Query 1:
SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED
DELETE P
FROM Products (NOLOCK)
WHERE instance = 'XXXX-xxx-xxx-xx'
Query 2:
SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED
DELETE P
FROM Products (NOLOCK)
WHERE inserted <= DATEADD(hh, -10, GETDATE())
Note: The nonclustered index is created on instance column.
Please advice me which lock I can use in this scenario.
Note I couldnt able to primary key as it is consuming time when I insert 10 million rows to the table (this for one transaction; there are 20 concurrent transations).
The report should be generated sooner. And my procedure has multiple 35 DML statments, there are around 15 DELETE statements for instance column with other columns( DELETE FROM table WHERE instance = #instance AND col1 = #col1).
(1) You should stop using read uncommitted isolation. Use at least read committed.
(2) There are a number of things you could try to avoid deadlocks, like ensuring your different transactions access database objects in the same order, etc. This would be worth a read - http://support.microsoft.com/kb/169960
(3) Disable lock escalation for your table (more granular locks so better concurrency, but more lock overhead):
ALTER TABLE Products SET (lock_escalation = disable)
(4) Disallow Page Locks, and allow Row Locks on your indexes (will mean you can't defrag indexes, but you can still rebuild them):
ALTER INDEX [<YourIndex>] ON Product WITH (allow_row_locks = on, allow_page_locks = off)
First, there's no lock that you can take on these delete statements besides an exclusive lock. Your isolation level and NOLOCK hints are being ignored by Sql Server:
(Nolock) Only applies to the SELECT statement.
Two suggestions:
Change your non-clustered index on instance to a clustered index. BUT, only do this if you can change NEWID() to NEWSEQUENTIALID().
Second, instead of performing a delete to remove records older than 10 hours... consider implementing rolling partitions. This will remove any contention caused by the cleanup with your other delete operations.

Does "SELECT FOR UPDATE" prevent other connections inserting when the row is not present?

I'm interested in whether a SELECT FOR UPDATE query will lock a non-existent row.
Example
Table FooBar with two columns, foo and bar, foo has a unique index.
Issue query SELECT bar FROM FooBar WHERE foo = ? FOR UPDATE
If the first query returns zero rows, issue a query
INSERT INTO FooBar (foo, bar) values (?, ?)
Now is it possible that the INSERT would cause an index violation or does the SELECT FOR UPDATE prevent that?
Interested in behavior on SQLServer (2005/8), Oracle and MySQL.
MySQL
SELECT ... FOR UPDATE with UPDATE
Using transactions with InnoDB (auto-commit turned off), a SELECT ... FOR UPDATE allows one session to temporarily lock down a particular record (or records) so that no other session can update it. Then, within the same transaction, the session can actually perform an UPDATE on the same record and commit or roll back the transaction. This would allow you to lock down the record so no other session could update it while perhaps you do some other business logic.
This is accomplished with locking. InnoDB utilizes indexes for locking records, so locking an existing record seems easy--simply lock the index for that record.
SELECT ... FOR UPDATE with INSERT
However, to use SELECT ... FOR UPDATE with INSERT, how do you lock an index for a record that doesn't exist yet? If you are using the default isolation level of REPEATABLE READ, InnoDB will also utilize gap locks. As long as you know the id (or even range of ids) to lock, then InnoDB can lock the gap so no other record can be inserted in that gap until we're done with it.
If your id column were an auto-increment column, then SELECT ... FOR UPDATE with INSERT INTO would be problematic because you wouldn't know what the new id was until you inserted it. However, since you know the id that you wish to insert, SELECT ... FOR UPDATE with INSERT will work.
CAVEAT
On the default isolation level, SELECT ... FOR UPDATE on a non-existent record does not block other transactions. So, if two transactions both do a SELECT ... FOR UPDATE on the same non-existent index record, they'll both get the lock, and neither transaction will be able to update the record. In fact, if they try, a deadlock will be detected.
Therefore, if you don't want to deal with a deadlock, you might just do the following:
INSERT INTO ...
Start a transaction, and perform the INSERT. Do your business logic, and either commit or rollback the transaction. As soon as you do the INSERT on the non-existent record index on the first transaction, all other transactions will block if they attempt to INSERT a record with the same unique index. If the second transaction attempts to insert a record with the same index after the first transaction commits the insert, then it will get a "duplicate key" error. Handle accordingly.
SELECT ... LOCK IN SHARE MODE
If you select with LOCK IN SHARE MODE before the INSERT, if a previous transaction has inserted that record but hasn't committed yet, the SELECT ... LOCK IN SHARE MODE will block until the previous transaction has completed.
So to reduce the chance of duplicate key errors, especially if you hold the locks for awhile while performing business logic before committing them or rolling them back:
SELECT bar FROM FooBar WHERE foo = ? LOCK FOR UPDATE
If no records returned, then
INSERT INTO FooBar (foo, bar) VALUES (?, ?)
In Oracle, the SELECT ... FOR UPDATE has no effect on a non-existent row (the statement simply raises a No Data Found exception). The INSERT statement will prevent a duplicates of unique/primary key values. Any other transactions attempting to insert the same key values will block until the first transaction commits (at which time the blocked transaction will get a duplicate key error) or rolls back (at which time the blocked transaction continues).
On Oracle:
Session 1
create table t (id number);
alter table t add constraint pk primary key(id);
SELECT *
FROM t
WHERE id = 1
FOR UPDATE;
-- 0 rows returned
-- this creates row level lock on table, preventing others from locking table in exclusive mode
Session 2
SELECT *
FROM t
FOR UPDATE;
-- 0 rows returned
-- there are no problems with locking here
rollback; -- releases lock
INSERT INTO t
VALUES (1);
-- 1 row inserted without problems
I wrote a detailed analysis of this thing on SQL Server: Developing Modifications that Survive Concurrency
Anyway, you need to use SERIALIZABLE isolation level, and you really need to stress test.
SQL Server only has the FOR UPDATE as part of a cursor. And, it only applies to UPDATE statements that are associated with the current row in the cursor.
So, the FOR UPDATE has no relationship with INSERT. Therefore, I think your answer is that it's not applicable in SQL Server.
Now, it may be possible to simulate the FOR UPDATE behavior with transactions and locking strategies. But, that may be more than what you're looking for.