SQL unique field: concurrency bugs? [duplicate]

SQL unique field: concurrency bugs? [duplicate] - sql

This question already has answers here:
Only inserting a row if it's not already there
(7 answers)
Closed 9 years ago.
I have a DB table with a field that must be unique. Let's say the table is called "Table1" and the unique field is called "Field1".
I plan on implementing this by performing a SELECT to see if any Table1 records exist where Field1 = #valueForField1, and only updating or inserting if no such records exist.
The problem is, how do I know there isn't a race condition here? If two users both click Save on the form that writes to Table1 (at almost the exact same time), and they have identical values for Field1, isn't it possible that the following would happen?
User1 makes a SQL call, which performs the select operation and determines there are no existing records where Field1 = #valueForField1. User1's process is preempted by User2's process, which also finds no records where Field1 = #valueForField1, and performs an insert. User1's process is allowed to run again, and inserts a second record where Field1 = #valueForField1, violating the requirement that Field1 be unique.
How can I prevent this? I'm told that transactions are atomic, but then why do we need table locks too? I've never used a lock before and I don't know whether or not I need one in this case. What happens if a process tries to write to a locked table? Will it block and try again?
I'm using MS SQL 2008R2.

Add a unique constraint on the field. That way you won't have to SELECT. You will only have to insert. The first user will succeed the second will fail.
On top of that you may make the field autoincremented, so you won't have to care on filling it, or you may add a default value, again not caring on filling it.
Some options would be an autoincremented INT field, or a unique identifier.

You can add a add a unique constraint. Example from http://www.w3schools.com/sql/sql_unique.asp:
CREATE TABLE Persons
(
P_Id int NOT NULL UNIQUE
)

EDIT: Please also read Martin Smith's comment below.
jyparask has a good answer on how you can tackle this specific problem. However, I would like to elaborate on your confusion over locks, transactions, blocking, and retries. For the sake of simplicity, I'm going to assume transaction isolation level serializable.
Transactions are atomic. The database guarantees that if you have two transactions, then all operations in one transaction occur completely before the next one starts, no matter what kind of race conditions there are. Even if two users access the same row at the same time (multiple cores), there is no chance of a race condition, because the database will ensure that one of them will fail.
How does the database do this? With locks. When you select a row, SQL Server will lock the row, so that all other clients will block when requesting that row. Block means that their query is paused until that row is unlocked.
The database actually has a couple of things it can lock. It can lock the row, or the table, or somewhere in between. The database decides what it thinks is best, and it's usually pretty good at it.
There is never any retrying. The database will never retry a query for you. You need to explicitly tell it to retry a query. The reason is because the correct behavior is hard to define. Should a query retry with the exact same parameters? Or should something be modified? Is it still safe to retry the query? It's much safer for the database to simply throw an exception and let you handle it.
Let's address your example. Assuming you use transactions correctly and do the right query (Martin Smith linked to a few good solutions), then the database will create the right locks so that the race condition disappears. One user will succeed, and the other will fail. In this case, there is no blocking, and no retrying.
In the general case with transactions, however, there will be blocking, and you get to implement the retrying.

Related

How do you handle stale data with multiple threads?

Let's say I have the following psuedocode:
SELECT count(*) FROM users WHERE email = 'bob#gmail.com'
>>>> MARKER A
if (count > 0) return;
else INSERT INTO users VALUES ('bob#gmail.com')
So essentially only insert the email if it doesn't exist already. I understand there's probably some sort of INSERT IF NOT EXISTS query I could use, but let's say we use this example.
So if the code above runs on thread A, and thread B actually inserts 'bob#gmail.com' into users at MARKER A, then thread A has "stale data" and will try to insert 'bob#gmail.com', thinking the count is still 0, but in fact it is now 1. This will error out since we have a unique index on the email.
What is the tool I should use to prevent this issue? From my reading about transactions, they basically make a set of operations atomic, so the code above will execute completely or not at all. It will NOT ensure the users table is locked against updates correct? So I can't just wrap the code above in a transaction and make it thread-safe?
Should I implement application-level locking? Should I ensure that when this operation occurs, it must acquire the lock to access the users table so that no other thread can make changes to it? I feel that locking the entire table is a performance hit I want to avoid.

Checking before inserting is a known anti-pattern on multi-threaded applications. Do not even try it.
The right way of doing it is letting the database take care of it. Add a UNIQUE constraint on the column, as in:
alter table users add constraint uq1 unique(email);
Just try to insert the row in the database. If it succeeds, all is good; if it fails, then some other thread has alreay inserted the row.
Alternatively, you could issue a LOCK on the whole table. That would also work, but the performance of your application would become horrible.

Some sort of “different auto-increment indexes” per a primary key values

I have got a table which has an id (primary key with auto increment), uid (key refering to users id for example) and something else which for my question won’t matter.
I want to make, lets call it, different auto-increment keys on id for each uid entry.
So, I will add an entry with uid 10, and the id field for this entry will have a 1 because there were no previous entries with a value of 10 in uid. I will add a new one with uid 4 and its id will be 3 because I there were already two entried with uid 4.
...Very obvious explanation, but I am trying to be as explainative an clear as I can to demonstrate the idea... clearly.
What SQL engine can provide such a functionality natively? (non Microsoft/Oracle based)
If there is none, how could I best replicate it? Triggers perhaps?
Does this functionality have a more suitable name?
In case you know about a non SQL database engine providing such a functioality, name it anyway, I am curious.
Thanks.

MySQL's MyISAM engine can do this. See their manual, in section Using AUTO_INCREMENT:
For MyISAM tables you can specify AUTO_INCREMENT on a secondary column in a multiple-column index. In this case, the generated value for the AUTO_INCREMENT column is calculated as MAX(auto_increment_column) + 1 WHERE prefix=given-prefix. This is useful when you want to put data into ordered groups.
The docs go on after that paragraph, showing an example.
The InnoDB engine in MySQL does not support this feature, which is unfortunate because it's better to use InnoDB in almost all cases.
You can't emulate this behavior using triggers (or any SQL statements limited to transaction scope) without locking tables on INSERT. Consider this sequence of actions:
Mario starts transaction and inserts a new row for user 4.
Bill starts transaction and inserts a new row for user 4.
Mario's session fires a trigger to computes MAX(id)+1 for user 4. You get 3.
Bill's session fires a trigger to compute MAX(id). I get 3.
Bill's session finishes his INSERT and commits.
Mario's session tries to finish his INSERT, but the row with (userid=4, id=3) now exists, so Mario gets a primary key conflict.
In general, you can't control the order of execution of these steps without some kind of synchronization.
The solutions to this are either:
Get an exclusive table lock. Before trying an INSERT, lock the table. This is necessary to prevent concurrent INSERTs from creating a race condition like in the example above. It's necessary to lock the whole table, since you're trying to restrict INSERT there's no specific row to lock (if you were trying to govern access to a given row with UPDATE, you could lock just the specific row). But locking the table causes access to the table to become serial, which limits your throughput.
Do it outside transaction scope. Generate the id number in a way that won't be hidden from two concurrent transactions. By the way, this is what AUTO_INCREMENT does. Two concurrent sessions will each get a unique id value, regardless of their order of execution or order of commit. But tracking the last generated id per userid requires access to the database, or a duplicate data store. For example, a memcached key per userid, which can be incremented atomically.
It's relatively easy to ensure that inserts get unique values. But it's hard to ensure they will get consecutive ordinal values. Also consider:
What happens if you INSERT in a transaction but then roll back? You've allocated id value 3 in that transaction, and then I allocated value 4, so if you roll back and I commit, now there's a gap.
What happens if an INSERT fails because of other constraints on the table (e.g. another column is NOT NULL)? You could get gaps this way too.
If you ever DELETE a row, do you need to renumber all the following rows for the same userid? What does that do to your memcached entries if you use that solution?

SQL Server should allow you to do this. If you can't implement this using a computed column (probably not - there are some restrictions), surely you can implement it in a trigger.
MySQL also would allow you to implement this via triggers.

In a comment you ask the question about efficiency. Unless you are dealing with extreme volumes, storing an 8 byte DATETIME isn't much of an overhead compared to using, for example, a 4 byte INT.
It also massively simplifies your data inserts, as well as being able to cope with records being deleted without creating 'holes' in your sequence.
If you DO need this, be careful with the field names. If you have uid and id in a table, I'd expect id to be unique in that table, and uid to refer to something else. Perhaps, instead, use the field names property_id and amendment_id.
In terms of implementation, there are generally two options.
1). A trigger
Implementations vary, but the logic remains the same. As you don't specify an RDBMS (other than NOT MS/Oracle) the general logic is simple...
Start a transaction (often this is Implicitly already started inside triggers)
Find the MAX(amendment_id) for the property_id being inserted
Update the newly inserted value with MAX(amendment_id) + 1
Commit the transaction
Things to be aware of are...
- multiple records being inserted at the same time
- records being inserted with amendment_id being already populated
- updates altering existing records
2). A Stored Procedure
If you use a stored procedure to control writes to the table, you gain a lot more control.
Implicitly, you know you're only dealing with one record.
You simply don't provide a parameter for DEFAULT fields.
You know what updates / deletes can and can't happen.
You can implement all the business logic you like without hidden triggers
I personally recommend the Stored Procedure route, but triggers do work.

It is important to get your data types right.
What you are describing is a multi-part key. So use a multi-part key. Don't try to encode everything into a magic integer, you will poison the rest of your code.
If a record is identified by (entity_id,version_number) then embrace that description and use it directly instead of mangling the meaning of your keys. You will have to write queries which constrain the version number but that's OK. Databases are good at this sort of thing.
version_number could be a timestamp, as a_horse_with_no_name suggests. This is quite a good idea. There is no meaningful performance disadvantage to using timestamps instead of plain integers. What you gain is meaning, which is more important.
You could maintain a "latest version" table which contains, for each entity_id, only the record with the most-recent version_number. This will be more work for you, so only do it if you really need the performance.

SQL Server 2008 Express locking

OK so I have read a fair amount about SQL Server's locking stuff, but I'm struggling to understand it all.
What I want to achieve is thus:
I need to be able to lock a row when user A SELECTs it
If user B then tries to SELECT it, my winforms .net app needs to set all the controls on the relevant form to be disabled, so the user can't try and update. Also it would be nice if I could throw up a messagebox for user B, stating that user A is the person that is using that row.
So basically User B needs to be able to SELECT the data, but when they do so, they should also get a) whether the record is locked and b) who has it locked.
I know people are gonna say I should just let SQL Server deal with the locking, but I need User B to know that the record is in use as soon as they SELECT it, rather than finding out when they UPDATE - by which time they may have entered data into the form, giving me inconsistency.
Also any locks need to allow SELECTs to still happen - so when user B does his SELECT, rather than just being thrown an exception and receiving no/incomplete data, he should still get the data, and be able to view it, but just not be able to update it.
I'm guessing this is pretty basic stuff, but there's so much terminology involved with SQL Server's locking that I'm not familiar with that it makes reading about it pretty difficult at the moment.
Thanks

To create this type of 'application lock', you may want to use a table called Locks and insert key, userid, and table names into it.
When your select comes along, join into the Locks table and use the presence of this value to indicate the record is locked.
I would also recommend adding a 'RowVersion' column to your table you wish to protect. This field will assist in identifying if you are updating or querying a row that has changed since you last selected it.

This isn't really what SQL Server locking is for - ideally you should only be keeping a transaction (and therefore a lock) open for the absolute minimum needed to complete an atomic operation against that database - you certainly shouldn't be holding locks while waiting for user input.
You would be better served keeping track of these sorts of locks yourself by (for example) adding a locked bit column to the table in question along with a locked_by varchar column to keep track of who has the row locked.
The first user should UPDATE the row to indicate that the row is locked and who has it locked:
UPDATE MyTable
SET `locked` = 1
AND `locked_by` = #me
WHERE `locked` = 0
The locked = 0 check is there to protect against potential race conditions and make sure that you don't update a record that someone else has already locked.
This first user then does a SELECT to return the data and ensure that they did really manage to lock the row.

Getting deadlocks in MySQL

We're very frustratingly getting deadlocks in MySQL. It isn't because of exceeding a lock timeout as the deadlocks happen instantly when they do happen. Here's the SQL code that is executing on 2 separate threads (with 2 separate connections from the connection pool) that produces a deadlock:
UPDATE Sequences SET Counter = LAST_INSERT_ID(Counter + 1) WHERE Sequence IS NULL
Sequences table has 2 columns: Sequence and Counter
The LAST_INSERT_ID allows us to retrieve this updated counter value as per MySQL's recommendation. That works perfect for us, but we get these deadlocks! Why are we getting them and how can we avoid them??
Thanks so much for any help with this.
EDIT: this is all in a transaction (required since I'm using Hibernate) and AUTO_INCREMENT doesn't make sense here. I should've been more clear. The Sequences table holds many sequences (in our case about 100 million of them). I need to increment a counter and retrieve that value. AUTO_INCREMENT plays no role in all of this, this has nothing to do with Ids or PRIMARY KEYs.

Wrap your sql statements in a transaction. If you aren't using a transaction you will get a race condition on LAST_INSERT_ID.
But really, you should have counter fields auto_increment, so you let mysql handle this.
Your third solution is to use LOCK_TABLES, to lock the sequence table so no other process can access it concurrently. This is the probably the slowest solution unless you are using INNODB.

Deadlocks are a normal part of any transactional database, and can occur at any time. Generally, you are supposed to write your application code to handle them, as there is no surefire way to guarantee that you will never get a deadlock. That being said, there are situations that increase the likelihood of deadlocks occurring, such as the use of large transactions, and there are things you can do to mitigate their occurrence.
First thing, you should read this manual page to get a better understanding of how you can avoid them.
Second, if all you're doing is updating a counter, you should really, really, really be using an AUTO_INCREMENT column for Counter rather than relying on a "select then update" process, which as you have seen is a race condition that can produce deadlocks. Essentially, the AUTO_INCREMENT property of your table column will act as a counter for you.
Finally, I'm going to assume that you have that update statement inside a transaction, as this would produce frequent deadlocks. If you want to see it in action, try the experiment listed here. That's exactly what's happening with your code... two threads are attempting to update the same records at the same time before one of them is committed. Instant deadlock.
Your best solution is to figure out how to do it without a transaction, and AUTO_INCREMENT will let you do that.

No other SQL involved ? Seems a bit unlikely to me.
The 'where sequence is null' probably causes a full table scan, causing read locks to be acquired on every row/page/... .
This becomes a problem if (your particular engine does not use MVCC and) there were an INSERT that preceded your update within the same transaction. That INSERT would have acquired an exclusive lock on some resource (row/page/...), which will cause the acquisition of a read lock by any other thread to go waiting. So two connections can first do their insert, causing each of them to have an exclusive lock on some small portion of the table, and then they both try to do your update, requiring each of them to be able to acquire a read lock on the entire table.

I managed to do this using a MyISAM table for the sequences.
I then have a function called getNextCounter that does the following:
performs a SELECT sequence_value FROM sequences where sequence_name = 'test';
performs the update: UPDATE sequences SET sequence_value = LAST_INSERT_ID(last_retrieved_value + 1) WHERE sequence_name = 'test' and sequence_value = last retrieved value;
repeat in a loop until both queries are successful, then retrieve the last insert id.
As it is a MyISAM table it won't be part of your transaction, so the operation won't cause any deadlocks.

Should I be using InnoDB for this?

I am developing a personal PHP/MySQL app, and I came across this particular scenario in my project:
I have various comment threads. This is handled by two tables - 'Comments' and 'Threads', with each comment in 'Comments' table having a 'thread_id' attribute indicating which thread the comment belongs to. When the user deletes a comment thread, currently I am doing two separate DELETE SQL queries:
First delete all the comments belonging to the thread in the 'Comments' table
Then, clearing the thread record from the 'Threads' table.
I also have another situation, where I need to insert data from a form into two separate tables.
Should I be using transactions for these kind of situations? If so, is it a general rule of thumb to use transactions whenever I need to perform such multiple SQL queries?

It depends on your actual needs, transactions are just a way of ensuring that all data manipulation that forms a single transaction gets executed successfully, and that transactions happen sequentially (a new transaction cannot be made until the previous one has either succeeded or failed). If one of the queries fails for whatever reason, the whole transaction will fail and the previous state will be restored.
If you absolutely need to make sure that no threads will be deleted unless all the comments have been deleted beforehand, go for transactions. If you need all the speed you can get, go for MyISAM

Yes, it is a general rule of thumb to use transactions when doing multiple operations that are related. If you do switch to InnoDB (usually a good idea, but not always. We didn't really discuss any requirements besides transactions, so I won't comment more), I'd also suggest setting a constraint on Comments that points to Threads as it sounds like a comment must be assigned to a thread. Deleting the thread would then remove associated comments in a single atomic statement.

If you want ACID transactions, you want InnoDB. If having one DELETE succeed and the other fail means having to manually DELETE the failed attempt, I'd say that's a hardship better handled with the database. Those situations call for transactions.

For the first part of your question I would recommend declaring thread_id as a foreign key in your comments table. This should reference the id column of the thread table. You can then set 'ON DELETE CASCADE' this means that when the ID is removed from the thread table all comments that reference that ID will also be deleted.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas