SELECT ... FOR UPDATE and conflicting inserts - sql

I want to insert a row, but if a conflict occurs (example below) I'd like the database to lock the existing row so I can log its contents for debugging purposes. I am using READ_COMMITTED transaction isolation.
For example:
CREATE TABLE users(id BIGINT AUTO_INCREMENT, name VARCHAR(30),
count INT NOT NULL, PRIMARY KEY(id), UNIQUE(name));
Thread 1:
INSERT INTO users(username, count) VALUES('joe', 1000);
transaction.commit();
Thread 2:
// Insert fails due to conflict with above record
INSERT INTO users(username, count) VALUES('joe', 0);
// Get the conflicting row and log its properties
SELECT * FROM users WHERE username = 'joe';
If the conflicting row is not locked, it may be modified by the time I check it. The only workaround I found is invoking SELECT id FROM users WHERE username = 'joe' FOR UPDATE before the insert. Is it possible way to implement this without any overhead when a conflict does not occur?
UPDATE: I am not asking to avoid the conflict or the resulting SQLException. I am just asking for the conflicting row to get locked so I can look up what values triggered the conflict. Yes, I know that the conflicting record contains joe but I want to log all its other columns.

No it is not possible to eliminate the confict of a UNIQUE column
when using INSERT of rows with unique column(s).
Trying to write SQL that never has to deal with SQL Exceptions
is just wasted effort that always ends up creating SQL that fails
under some conditions.
Exception handling can't be avoided when dealing with real time
multi-threaded multi-user database servers, unless you
can afford to lock the table, do the update, and unlock the
table (which will create terrible performance when under
heavy load of many users)
The UNIQUE CONSTRAINT VIOLATION Exception will ALWAYS occur on the 2nd INSERT,
as the two INSERTs in your example could be widely separated in time
(e.g. by hours, days or weeks); Table or row locking won't change this.
This problem is one that should be solved at the GUI level anyway
as choosing a "user name" that may already be chosen by a previous
user, requires providing the "new" user with feedback like
"Sorry, that user name is already in use by another user", so
it would seen unlikely that handling the UNIQUE VIOLATION exception
can or should ever be "avoided".
In addition, there is no reason to SELECT ... FOR UPDATE, since
all you need to do is SELECT id WHERE name = newName and see if
you get a resulting id or null; (id == null) => user name not in use,
but even then two user could try to both get the "not in use" result
at the same time and one of the INSERTs could still fail.
When the UNIQUE exception is returned on the duplicate INSERT,
the second INSERT has failed and that record was not created,
so there is no "duplicate" record to lock and then read after
the UNIQUE exception is returned on the failed INSERT.

Wich version of SQL are you usign? I'm not sure if I understand correctly your question but I think you could do this in a trigger.
In the trigger, you can view the inserted value (your conflicting row) and log it, and make a rollback. Wich means that when you insert your row, when a conflict does not occur, you don't have to commit anything, and when a conflict occurs, the log is made and the row is not inserted.

No, most databases do not support that kind of operation.
You can do tricks like creating an explicit transaction
BEGIN TRANSACTION
IF EXIST(SELECT ...)
ROLLBACK
INSERT INTO...
COMMIT
But that isn't exactly what you want. The only to achieve what you're asking for is to use one of the B-TREE style libraries which are a lot more low-level.

There doesn't seem to be a portable way of doing this and looking at MVCC there is a strong indication that this cannot be implemented without a substantial performance impact.
So in conclusion: you're going to have to settle for knowing that a conflict occurred but have no way of being 100% sure of the cause (there is no thread-safe to verify).

Related

Postgres unique constraint performance, insert + fail on duplicate or check?

What is bigger performance hit on a postgres database when table has unique constraint:
Trying to insert and let it throw unique violation constraint error
Check if entry exist and not do insert if it does
I'm importing some data, and ORM is connecting some entries via many to many through connection table. It is not checking if connection exist, it just runs the query and fails with unique constraint when it exist.
Is it better to leave it like that, or to introduce a step where I would check if the entry exist and then do the insert if it doesn't?
I would assume that your check from 2. would be an extra statement, so it is probably more expensive. I cannot say for sure, since you were rather vague in your question.
Besides the second approach is suffering from a race condition: you can never guarantee that no conflicting row gets inserted by a concurrent session after you checked.
If you want to avoid the error, the best approach would be
INSERT INTO ... VALUES (...) ON CONFLICT DO NOTHING;
performance hit:
As unique constraint creates index on the specified column it will affect the rate of insertion and updation.And most abruptly, In batch operations where numbers of inserts and updates are very large.

First_or_create yet ERROR: duplicate key value violates unique constraint

I have the following code:
rating = user.recipe_ratings.where(:recipe_id => recipe.id).where(:delivery_id => delivery.id).first_or_create
Yet somehow we get occasional PG::Error: ERROR: duplicate key value violates unique constraint errors from this. I can't think of any reason that should happen, since the whole point of first_or_create is to prevent those.
Is this just a crazy race-condition? How can I solve this without a maddening series of begin...rescue blocks?
This seems to stem from a typical race condition for the "SELECT or INSERT" case.
Ruby seems to choose performance over safety in its implementation. Quoting "Ruby on Rails Guides":
The first_or_create method checks whether first returns nil or not. If
it does return nil, then create is called.
...
The SQL generated by this method looks like this:
SELECT * FROM clients WHERE (clients.first_name = 'Andy') LIMIT 1
BEGIN
INSERT INTO clients (created_at, first_name, locked, orders_count, updated_at)
VALUES ('2011-08-30 05:22:57', 'Andy', 0, NULL, '2011-08-30 05:22:57')
COMMIT
If that's the actual implementation (?), it seems completely open for race conditions. Another transaction can easily SELECT between the first transaction's SELECT and INSERT. And then try its own INSERT, which would raise the error you reported, since the first transaction has inserted the row in the meantime.
The time frame for a race condition could be drastically reduced with a data-modifying CTE. Even a safe version would not cost that much more. But I guess they have their reasons.
Compare this safe implementation:
Is SELECT or INSERT in a function prone to race conditions?
Rails 6 adds a new create_or_find_by method that alleviates a possible race condition, with a few drawbacks:
The underlying table must have the relevant columns defined with unique constraints.
A unique constraint violation may be triggered by only one, or at least less than all, of the given attributes. This means that the subsequent find_by! may fail to find a matching record, which will then raise an ActiveRecord::RecordNotFound exception, rather than a record with the given attributes.
While we avoid the race condition between SELECT -> INSERT from find_or_create_by, we actually have another race condition between INSERT -> SELECT, which can be triggered if a DELETE between those two statements is run by another client. But for most applications, that's a significantly less likely condition to hit.
It relies on exception handling to handle control flow, which may be marginally slower.
def create_or_find_by(attributes, &block)
transaction(requires_new: true) { create(attributes, &block) }
rescue ActiveRecord::RecordNotUnique
find_by!(attributes)
end
Using your example:
rating = user.recipe_ratings.create_or_find_by(
recipe_id: recipe.id,
delivery_id: delivery.id
)

Locking table in postgresql

I have a table named as 'games', which contains a column named as 'title', this column is unique, database used in PostgreSQL
I have a user input form that allows him to insert a new 'game' in 'games' table. The function that insert a new game checks if a previously entered 'game' with the same 'title' already exists, for this, I get the count of rows, with the same game 'title'.
I use transactions for this, the insert function at the start uses BEGIN, gets the row count, if row count is 0, inserts the new row and after process is completed, it COMMITS the changes.
The problem is that, there are chances that 2 games with the same title if submitted by the user at the same time, would be inserted twice, since I just get the count of rows to chk for duplicate records, and each of the transaction would be isolated from each other
I thought of locking the tables when getting the row count as:
LOCK TABLE games IN ACCESS EXCLUSIVE MODE;
SELECT count(id) FROM games WHERE games.title = 'new_game_title'
Which would lock the table for reading too (which means the other transaction would have to wait, until the current one is completed successfully). This would solve the problem, which is what I suspect. Is there a better way around this (avoiding duplicate games with the same title)
You should NOT need to lock your tables in this situation.
Instead, you can use one of the following approaches:
Define UNIQUE index for column that really must be unique. In this case, first transaction will succeed, and second will error out.
Define AFTER INSERT OR UPDATE OR DELETE trigger that will check your condition, and if it does not hold, it should RAISE error, which will abort offending transaction
In all these cases, your client code should be ready to properly handle possible failures (like failed transactions) that could be returned by executing your statements.
Using the highest transaction isolation(Serializable) you can achieve something similar to your actual question. But be aware that this may fail ERROR: could not serialize access due to concurrent update
I do not agree with the constraint approach entirely. You should have a constraint to protect data integrity, but relying on the constraint forces you to identify not only what error occurred, but which constraint caused the error. The trouble is not catching the error as some have discussed but identifying what caused the error and providing a human readable reason for the failure. Depending on which language your application is written in, this can be next to impossible. eg: telling the user "Game title [foo] already exists" instead of "game must have a price" for a separate constraint.
There is a single statement alternative to your two stage approach:
INSERT INTO games ( [column1], ... )
SELECT [value1], ...
WHERE NOT EXISTS ( SELECT x FROM games as g2 WHERE games.title = g2.title );
I want to be clear with this... this is not an alternative to having a unique constraint (which requires extra data for the index). You must have one to protect your data from corruption.

Concurrent insert of keys into a table

Probably a trivial question, but I want to get the best possible solution.
Problem:
I have two or more workers that insert keys into one or more tables. The problem arises when two or more workers try to insert the same key into one of those key tables at the same time.
Typical problem.
Worker A reads the table if a key exists (SELECT). There is no key.
Worker B reads the table if a key exists (SELECT). There is no key.
Worker A inserts the key.
Worker B inserts the key.
Worker A commits.
Worker B commits. Exception is throws as unique constraint is violated
The key tables are simple pairs. First column is autoincrement integer and the second is varchar key.
What is the best solution to such a concurrency problem? I believe it is a common problem. One way for sure is to handle the exceptions thrown, but somehow I don't believe this is the best way to tackle this.
The database I use is Firebird 2.5
EDIT:
Some additional info to make things clear.
Client side synchronization is not a good approach, because the inserts come from different processes (workers). And I could have workers across different machines someday, so even mutexes are a no-go.
The primary key and the first columns of such a table is autoincrement field. No problem there. The varchar field is the problem as it is something that the client inserts.
Typical such table is a table of users. For instance:
1 2056
2 1044
3 1896
4 5966
...
Each worker check if user "xxxx" exists and if not inserts it.
EDIT 2:
Just for the reference if somebody will go the same route. IB/FB return pair of error codes (I am using InterBase Express components). Checking for duplicate value violation look like this:
except
on E: EIBInterBaseError do
begin
if (E.SQLCode = -803) and (E.IBErrorCode = 335544349) then
begin
FKeysConnection.IBT.Rollback;
EnteredKeys := False;
end;
end;
end;
With Firebird you can use the following statement:
UPDATE OR INSERT INTO MY_TABLE (MY_KEY) VALUES (:MY_KEY) MATCHING (MY_KEY) RETURNING MY_ID
assuming there is a BEFORE INSERT trigger which will generate the MY_ID if a NULL value is being inserted.
Here is the documentation.
Update: The above statement will avoid exceptions and cause every statement to succeed. However, in case of many duplicate key values it will also cause many unnecessary updates.
This can be avoided by another approach: just handle the unique constraint exception on the client and ignore it. The details depend on which Delphi library you're using to work with Firebird but it should be possible to examine the SQLCode returned by the server and ignore only the specific case of unique constraint violation.
I do not know if something like this is avalible in Firebird but in SQL Server you can check when inserting the key.
insert into Table1 (KeyValue)
select 'NewKey'
where not exists (select *
from Table1
where KeyValue = 'NewKey')
First option - don't do it.
Don't do it; Unless the WORKERS are doing extraordinary amounts of work (we're talking about computers, so requiring 1 second per record qualifies as "extraordinary amount of work"), just use a single thread; Even better, do all the work in a stored procedure, you'd be amazed by the speedup gained by not transporting data over whatever protocol into your app.
Second option - Use a Queue
Make sure your worker threads don't all work on the same ID. Set up a Queue, push all the ID's that need processing into that queue, have each working thread Dequeue an ID from that Queue. This way you're guaranteed no two workers work on the same record at the same time. This might be difficult to implement if your workers are not all part of the same process.
Last resort
Set up an DB-based "Reservation" system so an Worker Thread can mark a Key for "work in process" so no two workers would work on the same Key. I'd set up a table like this:
CREATE TABLE KEY_RESERVATIONS (
KEY INTEGER NOT NULL, /* This is the KEY you'd be reserving */
RESERVED_UNTIL TIMESTAMP NOT NULL /* We don't want to keep reservations for ever in case of failure */
);
Each of your workers would use short transactions to work on that table: Select a candidate Key, one that's not in the KEY_RESERVATIONS table. Try to INSERT. Failed? Try an other KEY. Periodically delete all reserved key with old RESERVED_UNTIL timestamps. Make sure the transactions for working with KEY_RESERVATIONS are as short as possible, so that two threads both trying to reserve the same key at the same time would fail quickly.
This is what you have to deal with in an optimistic (or no-) locking scheme.
One way to avoid it is to put a pessimistic lock on the table around the whole select, insert, commit sequence.
However, that means you will have to deal with not being able to access the table (handle table-locked exceptions).
If by workers you mean threads in the same application instance instead of different users (application instances), you will need thread synchronization like kubal5003 says around the select-insert-commit sequence.
A combination of the two is needed if you have multiple users/application instances each with multiple threads.
Synchronize your threads to make it impossible to insert the same value or use a db side key generation method (I don't know Firebird so I don't even know if it's there, eg. in MsSQL Server there is identity column or GUIDs also solve the problem because it's unlikely to generate two identical ones)
You should not rely the client to generate the unique key, if there's possibility for duplicates.
Use triggers and generators (maybe with help of stored procedure) to create always unique keys.
More information about proper autoinc implementation in Firebird here: http://www.firebirdfaq.org/faq29/

Is there a disadvantage to blindly using INSERT in MySQL?

Often I want to add a value to a table or update the value if its key already exists. This can be accomplished in several ways, assuming a primary or unique key is set on the 'user_id' and 'pref_key' columns in the example:
1. Blind insert, update if receiving a duplicate key error:
// Try to insert as a new value
INSERT INTO my_prefs
(user_id, pref_key, pref_value)
VALUES (1234, 'show_help', 'true');
// If a duplicate-key error occurs run an update query
UPDATE my_prefs
SET pref_value = 'true'
WHERE user_id=1234 AND pref_key='show_help';
2. Check for existence, then select or update:
// Check for existence
SELECT COUNT(*)
FROM my_prefs
WHERE user_id=1234 AND pref_key='show_help';
// If count is zero, insert
INSERT INTO my_prefs
(user_id, pref_key, pref_value)
VALUES (1234, 'show_help', 'true');
// If count is one, update
UPDATE my_prefs
SET pref_value = 'true'
WHERE user_id=1234 AND pref_key='show_help';
The first way seems to be preferable as it will require only one query for new inserts and two for an update, where as the second way will always require two queries. Is there anything I'm missing though that would make it a bad idea to blindly insert?
have a look at the ON DUPLICATE KEY syntax in http://dev.mysql.com/doc/refman/5.0/en/insert-select.html
INSERT [LOW_PRIORITY | HIGH_PRIORITY] [IGNORE]
[INTO] tbl_name [(col_name,...)]
SELECT ...
[ ON DUPLICATE KEY UPDATE col_name=expr, ... ]
There is the third MySQL way, which would be the preferred one in that RDBMS
INSERT INTO my_prefs
(user_id, pref_key, pref_value)
VALUES (1234, 'show_help', 'true')
ON DUPLICATE KEY
UPDATE pref_value = 'true'
Personally I am never a fan of exception based programming (expecting an exception in the normal operation of an application) and to me the second example is much more readable/maintainable.
There are situations where this would make a difference (very tight loops for example) but I think there should be a good reason to write code like this rather than it being the default.
If you want to avoid "the exception" by perhaps inserting a doublette and you want to use standard SQL (and your programming language / database returns the count of the updated rows) then use the following "SQL" - commands (pseudo-code):
int i = SQL("UPDATE my_prefs ...");
if(i==0) {
SQL("INSERT INTO my_prefs ...");
}
This also takes in account that - for the most use cases - updates do occur more often than inserts.
Will there be concurrent INSERTs to these rows? DELETEs?
"ON DUPLICATE" sounds great (the behavior is just what you want) provided that you're not concerned about portability to non-MySQL databases.
The "blind insert" seems reasonable and robust provided that rows are never deleted. (If the INSERT case fails because the row exists, the UPDATE afterward should succeed because the row still exists. But this assumption is false if rows are deleted - you'd need retry logic then.) On other databases without "ON DUPLICATE", you might consider an optimization if you find latency to be bad: you could avoid a database round trip in the already-exists case by putting this logic in a stored procedure.
The "check for existence" is tricky to get right if there are concurrent INSERTs. Rows could be added between your SELECT and your UPDATE. Transactions won't even really help - I think even at isolation level "serializable", you'll see "could not serialize access due to concurrent update" errors occasionally (or whatever the MySQL equivalent error message is). You'll need retry logic, so I'd say the person above who suggests using this method to avoid "exception-based programming" is wrong, as is the person who suggests doing the UPDATE first for the same reason.
You may be able to use REPLACE instead, or if using a more current MySQL you get the option of using "INSERT ... ON DUPLICATE KEY UPDATE"
The fact that several people brought this up in quick succession says "always check the MySQL docs" when you have an issue, as they're decent and in many cases, leads directly to the solution.
The first way is the preferred way as far as I know.
In your DAO model you could have an id field.
If set to null / -1 / whatever, the data hasn't been persisted.
When you persist it (or retrieve from database), set it to the id value in the database.
Your persist method can check the ID and pass it onto the update() or add() implementation.
Flaws: Getting out of sync with the database, etc. I'm sure there are more, but I really should get some work done...
So long as you're using MySQL, you can use the ON DUPLICATE keyword. For example:
INSERT INTO my_prefs (user_id, pref_key, pref_value) VALUES (1234, 'show_help', 'true')
ON DUPLICATE KEY UPDATE (pref_key, pref_value) VALUES ('show_help', 'true');