First_or_create yet ERROR: duplicate key value violates unique constraint - ruby-on-rails-3

I have the following code:
rating = user.recipe_ratings.where(:recipe_id => recipe.id).where(:delivery_id => delivery.id).first_or_create
Yet somehow we get occasional PG::Error: ERROR: duplicate key value violates unique constraint errors from this. I can't think of any reason that should happen, since the whole point of first_or_create is to prevent those.
Is this just a crazy race-condition? How can I solve this without a maddening series of begin...rescue blocks?

This seems to stem from a typical race condition for the "SELECT or INSERT" case.
Ruby seems to choose performance over safety in its implementation. Quoting "Ruby on Rails Guides":
The first_or_create method checks whether first returns nil or not. If
it does return nil, then create is called.
...
The SQL generated by this method looks like this:
SELECT * FROM clients WHERE (clients.first_name = 'Andy') LIMIT 1
BEGIN
INSERT INTO clients (created_at, first_name, locked, orders_count, updated_at)
VALUES ('2011-08-30 05:22:57', 'Andy', 0, NULL, '2011-08-30 05:22:57')
COMMIT
If that's the actual implementation (?), it seems completely open for race conditions. Another transaction can easily SELECT between the first transaction's SELECT and INSERT. And then try its own INSERT, which would raise the error you reported, since the first transaction has inserted the row in the meantime.
The time frame for a race condition could be drastically reduced with a data-modifying CTE. Even a safe version would not cost that much more. But I guess they have their reasons.
Compare this safe implementation:
Is SELECT or INSERT in a function prone to race conditions?

Rails 6 adds a new create_or_find_by method that alleviates a possible race condition, with a few drawbacks:
The underlying table must have the relevant columns defined with unique constraints.
A unique constraint violation may be triggered by only one, or at least less than all, of the given attributes. This means that the subsequent find_by! may fail to find a matching record, which will then raise an ActiveRecord::RecordNotFound exception, rather than a record with the given attributes.
While we avoid the race condition between SELECT -> INSERT from find_or_create_by, we actually have another race condition between INSERT -> SELECT, which can be triggered if a DELETE between those two statements is run by another client. But for most applications, that's a significantly less likely condition to hit.
It relies on exception handling to handle control flow, which may be marginally slower.
def create_or_find_by(attributes, &block)
transaction(requires_new: true) { create(attributes, &block) }
rescue ActiveRecord::RecordNotUnique
find_by!(attributes)
end
Using your example:
rating = user.recipe_ratings.create_or_find_by(
recipe_id: recipe.id,
delivery_id: delivery.id
)

Related

Postgres unique constraint performance, insert + fail on duplicate or check?

What is bigger performance hit on a postgres database when table has unique constraint:
Trying to insert and let it throw unique violation constraint error
Check if entry exist and not do insert if it does
I'm importing some data, and ORM is connecting some entries via many to many through connection table. It is not checking if connection exist, it just runs the query and fails with unique constraint when it exist.
Is it better to leave it like that, or to introduce a step where I would check if the entry exist and then do the insert if it doesn't?
I would assume that your check from 2. would be an extra statement, so it is probably more expensive. I cannot say for sure, since you were rather vague in your question.
Besides the second approach is suffering from a race condition: you can never guarantee that no conflicting row gets inserted by a concurrent session after you checked.
If you want to avoid the error, the best approach would be
INSERT INTO ... VALUES (...) ON CONFLICT DO NOTHING;
performance hit:
As unique constraint creates index on the specified column it will affect the rate of insertion and updation.And most abruptly, In batch operations where numbers of inserts and updates are very large.

Locking table in postgresql

I have a table named as 'games', which contains a column named as 'title', this column is unique, database used in PostgreSQL
I have a user input form that allows him to insert a new 'game' in 'games' table. The function that insert a new game checks if a previously entered 'game' with the same 'title' already exists, for this, I get the count of rows, with the same game 'title'.
I use transactions for this, the insert function at the start uses BEGIN, gets the row count, if row count is 0, inserts the new row and after process is completed, it COMMITS the changes.
The problem is that, there are chances that 2 games with the same title if submitted by the user at the same time, would be inserted twice, since I just get the count of rows to chk for duplicate records, and each of the transaction would be isolated from each other
I thought of locking the tables when getting the row count as:
LOCK TABLE games IN ACCESS EXCLUSIVE MODE;
SELECT count(id) FROM games WHERE games.title = 'new_game_title'
Which would lock the table for reading too (which means the other transaction would have to wait, until the current one is completed successfully). This would solve the problem, which is what I suspect. Is there a better way around this (avoiding duplicate games with the same title)
You should NOT need to lock your tables in this situation.
Instead, you can use one of the following approaches:
Define UNIQUE index for column that really must be unique. In this case, first transaction will succeed, and second will error out.
Define AFTER INSERT OR UPDATE OR DELETE trigger that will check your condition, and if it does not hold, it should RAISE error, which will abort offending transaction
In all these cases, your client code should be ready to properly handle possible failures (like failed transactions) that could be returned by executing your statements.
Using the highest transaction isolation(Serializable) you can achieve something similar to your actual question. But be aware that this may fail ERROR: could not serialize access due to concurrent update
I do not agree with the constraint approach entirely. You should have a constraint to protect data integrity, but relying on the constraint forces you to identify not only what error occurred, but which constraint caused the error. The trouble is not catching the error as some have discussed but identifying what caused the error and providing a human readable reason for the failure. Depending on which language your application is written in, this can be next to impossible. eg: telling the user "Game title [foo] already exists" instead of "game must have a price" for a separate constraint.
There is a single statement alternative to your two stage approach:
INSERT INTO games ( [column1], ... )
SELECT [value1], ...
WHERE NOT EXISTS ( SELECT x FROM games as g2 WHERE games.title = g2.title );
I want to be clear with this... this is not an alternative to having a unique constraint (which requires extra data for the index). You must have one to protect your data from corruption.

SELECT ... FOR UPDATE and conflicting inserts

I want to insert a row, but if a conflict occurs (example below) I'd like the database to lock the existing row so I can log its contents for debugging purposes. I am using READ_COMMITTED transaction isolation.
For example:
CREATE TABLE users(id BIGINT AUTO_INCREMENT, name VARCHAR(30),
count INT NOT NULL, PRIMARY KEY(id), UNIQUE(name));
Thread 1:
INSERT INTO users(username, count) VALUES('joe', 1000);
transaction.commit();
Thread 2:
// Insert fails due to conflict with above record
INSERT INTO users(username, count) VALUES('joe', 0);
// Get the conflicting row and log its properties
SELECT * FROM users WHERE username = 'joe';
If the conflicting row is not locked, it may be modified by the time I check it. The only workaround I found is invoking SELECT id FROM users WHERE username = 'joe' FOR UPDATE before the insert. Is it possible way to implement this without any overhead when a conflict does not occur?
UPDATE: I am not asking to avoid the conflict or the resulting SQLException. I am just asking for the conflicting row to get locked so I can look up what values triggered the conflict. Yes, I know that the conflicting record contains joe but I want to log all its other columns.
No it is not possible to eliminate the confict of a UNIQUE column
when using INSERT of rows with unique column(s).
Trying to write SQL that never has to deal with SQL Exceptions
is just wasted effort that always ends up creating SQL that fails
under some conditions.
Exception handling can't be avoided when dealing with real time
multi-threaded multi-user database servers, unless you
can afford to lock the table, do the update, and unlock the
table (which will create terrible performance when under
heavy load of many users)
The UNIQUE CONSTRAINT VIOLATION Exception will ALWAYS occur on the 2nd INSERT,
as the two INSERTs in your example could be widely separated in time
(e.g. by hours, days or weeks); Table or row locking won't change this.
This problem is one that should be solved at the GUI level anyway
as choosing a "user name" that may already be chosen by a previous
user, requires providing the "new" user with feedback like
"Sorry, that user name is already in use by another user", so
it would seen unlikely that handling the UNIQUE VIOLATION exception
can or should ever be "avoided".
In addition, there is no reason to SELECT ... FOR UPDATE, since
all you need to do is SELECT id WHERE name = newName and see if
you get a resulting id or null; (id == null) => user name not in use,
but even then two user could try to both get the "not in use" result
at the same time and one of the INSERTs could still fail.
When the UNIQUE exception is returned on the duplicate INSERT,
the second INSERT has failed and that record was not created,
so there is no "duplicate" record to lock and then read after
the UNIQUE exception is returned on the failed INSERT.
Wich version of SQL are you usign? I'm not sure if I understand correctly your question but I think you could do this in a trigger.
In the trigger, you can view the inserted value (your conflicting row) and log it, and make a rollback. Wich means that when you insert your row, when a conflict does not occur, you don't have to commit anything, and when a conflict occurs, the log is made and the row is not inserted.
No, most databases do not support that kind of operation.
You can do tricks like creating an explicit transaction
BEGIN TRANSACTION
IF EXIST(SELECT ...)
ROLLBACK
INSERT INTO...
COMMIT
But that isn't exactly what you want. The only to achieve what you're asking for is to use one of the B-TREE style libraries which are a lot more low-level.
There doesn't seem to be a portable way of doing this and looking at MVCC there is a strong indication that this cannot be implemented without a substantial performance impact.
So in conclusion: you're going to have to settle for knowing that a conflict occurred but have no way of being 100% sure of the cause (there is no thread-safe to verify).

what is the difference between triggers, assertions and checks (in database)

Can anybody explain (or suggest a site or paper) the exact difference between triggers, assertions and checks, also describe where I should use them?
EDIT: I mean in database, not in any other systems or programing languages.
Triggers - a trigger is a piece of SQL to execute either before or after an update, insert, or delete in a database. An example of a trigger in plain English might be something like: before updating a customer record, save a copy of the current record. Which would look something like:
CREATE TRIGGER triggerName
AFTER UPDATE
INSERT INTO CustomerLog (blah, blah, blah)
SELECT blah, blah, blah FROM deleted
The difference between assertions and checks is a little more murky, many databases don't even support assertions.
Check Constraint - A check is a piece of SQL which makes sure a condition is satisfied before action can be taken on a record. In plain English this would be something like: All customers must have an account balance of at least $100 in their account. Which would look something like:
ALTER TABLE accounts
ADD CONSTRAINT CK_minimumBalance
CHECK (balance >= 100)
Any attempt to insert a value in the balance column of less than 100 would throw an error.
Assertions - An assertion is a piece of SQL which makes sure a condition is satisfied or it stops action being taken on a database object. It could mean locking out the whole table or even the whole database.
To make matters more confusing - a trigger could be used to enforce a check constraint and in some DBs can take the place of an assertion (by allowing you to run code un-related to the table being modified). A common mistake for beginners is to use a check constraint when a trigger is required or a trigger when a check constraint is required.
An example: All new customers opening an account must have a balance of $100; however, once the account is opened their balance can fall below that amount. In this case you have to use a trigger because you only want the condition evaluated when a new record is inserted.
In the SQL standard, both ASSERTIONS and CHECK CONSTRAINTS are what relational theory calls "constraints" : rules that the data actually contained in the database must comply with.
The difference between the two is that CHECK CONSTRAINTS are, in a sense, much "simpler" : they are rules that relate to one single row only, while ASSERTIONs can involve any number of other tables, or any number of other rows in the same table. That obviously makes it (much !) more complex for the DBMS builders to support it, and that is, in turn, the reason why they don't : they just don't know how to do it.
TRIGGERs are pieces of executable code of which it can be declared to the DBMS that those should be executed every time a certain kind of update operation (insert/delete/update) gets done on a certain table. Because triggers can raise exceptions, they are a MEANS for implementing the same thing as an ASSERTION. However, with triggers, it's still the programmer who has to do all the coding, and not make any mistake.
EDIT
Onedaywhen's comments re. ASSERTION/CHECK cnstr. are correct. The difference is way more subtle (AND confusing). The standard indeed allows subqueries in CHECK constraints. (Most products don't support it though, so my "relate to a single row" is true for most SQL products, but not for the standard.) So is there still a difference ? Yes there still is. More than one even.
First case : TABLE MEN (ID:INTEGER) and TABLE WOMEN(ID:INTEGER). Now imagine a rule to the effect that "no ID value can appear both in the MEN and in the WOMEN table". That's a single rule. The intent of ASSERTION is precisely that the database designer would state this single rule [and be done with it], and the DBMS would know how to deal with this [efficiently, of course] and how to enforce this rule, no matter what particular update gets done to the database. In the example, the DBMS would know that it has to do a check for this rule upon INSERT INTO MEN, and upon INSERT INTO WOMEN, but not upon DELETE FROM MEN/WOMEN, or INSERT INTO <anyothertable>.
But DBMS's aren't smart enough for doing all that. So what needs to be done ? The database designer must add TWO CHECK constraints to his database, one to the MEN table (checking newly inserted MEN ID's against the WOMEN table) and one to the WOMAN table (checking the other way round). There's your first difference : one rule, one ASSERTION, TWO CHECK constraints. CHECK constraints are a lower level of abstraction than ASSERTIONs, because they require the designer to do more thinking himself about (a) all the kinds of update that could potentially cause his ASSERTION to be violated, and (b) what particular check should be carried out for any of the specific "kinds of update" he found in (a). (Although I don't really like making "absolute" statements on what is still "WHAT" and what is "HOW", I'd summarize that CHECK constraints require more "HOW" thinking (procedural) by the database designer, whereas ASSERTIONs allow the database designer to focus exclusively on the "WHAT" (declarative).)
Second case (though I'm not entirely sure of this - so take with a grain of salt) : just your average RI rule. Of course you are used to specify this using some REFERENCES clause. But imagine that a REFERENCES clause was not available. A rule like "Every ORDER must be placed by a known CUSTOMER" is really just that, a rule, thus : a single ASSERTION. However, we all know that such a rule can always be violated in two ways : inserting an ORDER (in this example), and deleting a CUSTOMER. Now, in line with the foregoing MAN/WOMEN example, if we wanted to implement this single rule/ASSERTION using CHECK constraints, then we'd have to write a CHECK constraint that checks CUSTOMER existence upon insertions into ORDER, but what CHECK constraint could we write that does whatever is needed upon deletion from CUSTOMER ? They simply aren't designed for that purpose, as far as I can tell. There's your second difference : CHECK constraints are tied to INSERTs exclusively, ASSERTIONS can define rules that will also be checked upon DELETEs.
Third case : Imagine a table COMPOS (componentID:... percentage:INTEGER), and a rule to the effect that "the sum of all percentages must at all times be equal to 100". That's a single rule, and an ASSERTION is capable of specifying that. But try and imagine how you would go about enforcing such a rule with CHECK constraints ... If you have a valid table with, say, three nonzero rows adding up to a hundred, how would you apply any change to this table that could survive your CHECK constraint ? You can't delete or update(decrease) any row without having to add other replacing rows, or update the remaining rows, that sum up to the same percentage. Likewise for insert or update (increase). You'd need deferred constraint checking at the very least, and then what are you going to CHECK ? There's your third difference : CHECK constraints are targeted to individual rows, while ASSERTIONs can also define/express rules that "span" several rows (i.e. rules about aggregations of rows).
Assertions do not modify the data, they only check certain conditions
Triggers are more powerful because the can check conditions and also modify the data
Assertions are not linked to specific tables in the database and not linked to specific events
Triggers are linked to specific tables and specific events
A database constraint involves a condition that must be satisfied when the database is updated. In SQL, if a constraint condition evaluates to false then the update fails, the data remains unchanged and the DBMS generates an error.
Both CHECK and ASSERTION are database constraints defined by the SQL standards. An important distinction is that a CHECK is applied to a specific base table, whereas an ASSERTION is applied to the whole database. Consider a constraint that limits the combined rows in tables T1 and T2 to a total of 10 rows e.g.
CHECK (10 >= (
SELECT COUNT(*)
FROM (
SELECT *
FROM T1
UNION
SELECT *
FROM T2
) AS Tn
))
Assume the tables are empty. If this was applied as an ASSERTION only and a user tried to insert 11 rows into T1 then then the update would fail. The same would apply if the constraint was applied as a CHECK constraint to T1 only. However, if the constraint was applied as a CHECK constraint to T2 only the constraint would succeed because a statement targeting T1 does not cause the constraints applied to T1 to be tested.
Both an ASSERTION and a CHECK may be deferred (if declared as DEFERRABLE), allowing for data to temporarily violate the constraint condition, but only within a transaction.
ASSERTION and CHECK constraints involving subqueries are features outside of core Standard SQL and none of the major SQL products support these features. MS Access (not exactly an industrial-strength product) supports CHECK constraints involving subqueries but not deferrable constraints plus constraint testing is always performed on a row-by-row basis, the practical consequences being that the functionality is very limited.
In common with CHECK constraints, a trigger is applied to a specific table. Therefore, a trigger can be used to implement the same logic as a CHECK constraint but not an ASSERTION. A trigger is procedural code and, unlike constraints, the user must take far more responsibility for concerns such as performance and error handling. Most commercial SQL products support triggers (the aforementioned MS Access does not).
The expression should be true for trigger to fire, but check will be evaluated wherever the expression is false.

Is there a disadvantage to blindly using INSERT in MySQL?

Often I want to add a value to a table or update the value if its key already exists. This can be accomplished in several ways, assuming a primary or unique key is set on the 'user_id' and 'pref_key' columns in the example:
1. Blind insert, update if receiving a duplicate key error:
// Try to insert as a new value
INSERT INTO my_prefs
(user_id, pref_key, pref_value)
VALUES (1234, 'show_help', 'true');
// If a duplicate-key error occurs run an update query
UPDATE my_prefs
SET pref_value = 'true'
WHERE user_id=1234 AND pref_key='show_help';
2. Check for existence, then select or update:
// Check for existence
SELECT COUNT(*)
FROM my_prefs
WHERE user_id=1234 AND pref_key='show_help';
// If count is zero, insert
INSERT INTO my_prefs
(user_id, pref_key, pref_value)
VALUES (1234, 'show_help', 'true');
// If count is one, update
UPDATE my_prefs
SET pref_value = 'true'
WHERE user_id=1234 AND pref_key='show_help';
The first way seems to be preferable as it will require only one query for new inserts and two for an update, where as the second way will always require two queries. Is there anything I'm missing though that would make it a bad idea to blindly insert?
have a look at the ON DUPLICATE KEY syntax in http://dev.mysql.com/doc/refman/5.0/en/insert-select.html
INSERT [LOW_PRIORITY | HIGH_PRIORITY] [IGNORE]
[INTO] tbl_name [(col_name,...)]
SELECT ...
[ ON DUPLICATE KEY UPDATE col_name=expr, ... ]
There is the third MySQL way, which would be the preferred one in that RDBMS
INSERT INTO my_prefs
(user_id, pref_key, pref_value)
VALUES (1234, 'show_help', 'true')
ON DUPLICATE KEY
UPDATE pref_value = 'true'
Personally I am never a fan of exception based programming (expecting an exception in the normal operation of an application) and to me the second example is much more readable/maintainable.
There are situations where this would make a difference (very tight loops for example) but I think there should be a good reason to write code like this rather than it being the default.
If you want to avoid "the exception" by perhaps inserting a doublette and you want to use standard SQL (and your programming language / database returns the count of the updated rows) then use the following "SQL" - commands (pseudo-code):
int i = SQL("UPDATE my_prefs ...");
if(i==0) {
SQL("INSERT INTO my_prefs ...");
}
This also takes in account that - for the most use cases - updates do occur more often than inserts.
Will there be concurrent INSERTs to these rows? DELETEs?
"ON DUPLICATE" sounds great (the behavior is just what you want) provided that you're not concerned about portability to non-MySQL databases.
The "blind insert" seems reasonable and robust provided that rows are never deleted. (If the INSERT case fails because the row exists, the UPDATE afterward should succeed because the row still exists. But this assumption is false if rows are deleted - you'd need retry logic then.) On other databases without "ON DUPLICATE", you might consider an optimization if you find latency to be bad: you could avoid a database round trip in the already-exists case by putting this logic in a stored procedure.
The "check for existence" is tricky to get right if there are concurrent INSERTs. Rows could be added between your SELECT and your UPDATE. Transactions won't even really help - I think even at isolation level "serializable", you'll see "could not serialize access due to concurrent update" errors occasionally (or whatever the MySQL equivalent error message is). You'll need retry logic, so I'd say the person above who suggests using this method to avoid "exception-based programming" is wrong, as is the person who suggests doing the UPDATE first for the same reason.
You may be able to use REPLACE instead, or if using a more current MySQL you get the option of using "INSERT ... ON DUPLICATE KEY UPDATE"
The fact that several people brought this up in quick succession says "always check the MySQL docs" when you have an issue, as they're decent and in many cases, leads directly to the solution.
The first way is the preferred way as far as I know.
In your DAO model you could have an id field.
If set to null / -1 / whatever, the data hasn't been persisted.
When you persist it (or retrieve from database), set it to the id value in the database.
Your persist method can check the ID and pass it onto the update() or add() implementation.
Flaws: Getting out of sync with the database, etc. I'm sure there are more, but I really should get some work done...
So long as you're using MySQL, you can use the ON DUPLICATE keyword. For example:
INSERT INTO my_prefs (user_id, pref_key, pref_value) VALUES (1234, 'show_help', 'true')
ON DUPLICATE KEY UPDATE (pref_key, pref_value) VALUES ('show_help', 'true');