What is bigger performance hit on a postgres database when table has unique constraint:
Trying to insert and let it throw unique violation constraint error
Check if entry exist and not do insert if it does
I'm importing some data, and ORM is connecting some entries via many to many through connection table. It is not checking if connection exist, it just runs the query and fails with unique constraint when it exist.
Is it better to leave it like that, or to introduce a step where I would check if the entry exist and then do the insert if it doesn't?
I would assume that your check from 2. would be an extra statement, so it is probably more expensive. I cannot say for sure, since you were rather vague in your question.
Besides the second approach is suffering from a race condition: you can never guarantee that no conflicting row gets inserted by a concurrent session after you checked.
If you want to avoid the error, the best approach would be
INSERT INTO ... VALUES (...) ON CONFLICT DO NOTHING;
performance hit:
As unique constraint creates index on the specified column it will affect the rate of insertion and updation.And most abruptly, In batch operations where numbers of inserts and updates are very large.
Related
I have a postgres database in which I'm refreshing data periodically. Most of the time it works, but sometimes I have issues with a unique index.
Minimal example
create table test_table (
id int
);
create unique index test_table_unique on test_table(id);
(I know, in this case it should be a primary key, but for the sake of example, please bear with me.)
Now, every hour, I do something like this:
begin;
delete from test_table;
insert into test_table (id) values (1), (2), (3)...
commit;
As I said, most of the time it will just work fine. However, sometimes postgres complains about a duplicate entry in the unique index.
error: duplicate key value violates unique constraint test_table_unique
detail: "Key (id)=(2) already exists."
My real database
In my actual table, I'm using JSON payloads, and the unique index is made on fields of that json payload. In particular, the error details is as follows:
create table if not exists source (
id serial primary key,
payload jsonb not null
);
create unique index if not exists source_index_and_id on source ((payload->>'_index'), (payload->>'_id'));
error details: "Key ((payload ->> '_index'::text), (payload ->> '_id'::text))=(companies, AC9860) already exists."
I'm confident there is no actual duplicate data. I'm deleting everything for a particular ->>_index, and the ->>_id is unique in my source data.
My understanding is that if I delete rows from a table, the indices will be updated before the next statements are executed. But it doesn't seem to be the case. I've found that it helps (not sure if it actually solves the issue) to commit the changes after the delete, and before the inserts.
begin;
delete...
commit;
begin;
insert...
commit;
What's happening here?
The only options how this could happen are
the deleting transaction rolled back
concurrent transactions inserted new rows after you deteted the original ones
the inserting transaction inserts the same key twice
the inserting transaction is accidentally run before the deleting one
PostGreSQL is not a real relational DBMS and does not match rule 7 of Codd's Rule about functional set operations.
Contrary to other RDBMS PostGreSQL delete rows one by one and this lack of functionality conduct to have sometime fantom key violation.
In my paper that compare PostGreSQL to MS SQL Server I made a test that show this evidence (§ 7 – The hard way to udpates unique values)
From Impala documentation:
In most relational databases, if you try to insert a row that has already been inserted, the insertion will fail because the primary key would be duplicated.
Impala, however, will not fail the query. Instead, it will generate a warning, but continue to execute the remainder of the insert statement.
Why does Impala/Kudu act like that?
Please note that the insert won't update the value (there is an upsert command for that), it will just fail silently.
Is there a way to be aware that I'm inserting a duplicate primary key?
This is because kudu itself will not throw any exception (only raise warning) and hence impala will (rightly) assume the task succeeded.
As to why Kudu chose to do it this way we can only speculate.
This is just my opinion. Kudu (and Impala) is designed for analytical work-load instead of transactional work-load. Which usually involves batch processing of large amounts of data. It would be undesirable to for the application to fail because of small number of records with duplicate keys.
Thus default behaviour inserts all records with non-duplicate keys and skip all the duplicate keys. This can be changed by using upsert which replaces replaces duplicates.
According to Imapala documentation
If an INSERT statement attempts to insert a row with the same values for the primary key columns as an existing row, that row is discarded and the insert operation continues. When rows are discarded due to duplicate primary keys, the statement finishes with a warning, not an error. (This is a change from early releases of Kudu where the default was to return in error in such cases, and the syntax INSERT IGNORE was required to make the statement succeed. The IGNORE clause is no longer part of the INSERT syntax.)
For situations where you prefer to replace rows with duplicate primary key values, rather than discarding the new data, you can use the UPSERT statement instead of INSERT. UPSERT inserts rows that are entirely new, and for rows that match an existing primary key in the table, the non-primary-key columns are updated to reflect the values in the "upserted" data.
If you really want to store new rows, not replace existing ones, but cannot do so because of the primary key uniqueness constraint, consider recreating the table with additional columns included in the primary key.
Imagine I have this simple table:
Table Name: Table1
Columns: Col1 NUMBER (Primary Key)
Col2 NUMBER
If I insert a record into Table1 with no commit...
INSERT INTO Table1 (Col1, Col2) Values (100, 1234);
How does Oracle know that this next INSERT statement violates the PK constraint, since nothing has yet been committed to the database yet.
INSERT INTO Table1 (Col1, Col2) Values (100, 5678);
Where/how does Oracle manage the transactions so that it knows I'm violating the constraint when I haven't even committed the transaction yet.
Oracle creates an index to enforce the primary key constraint (a unique index by default). When Session A inserts the first row, the index structure is updated but the change is not committed. When Session B tries to insert the second row, the index maintenance operation notes that there is already a pending entry in the index with that particular key. Session B cannot acquire the latch that protects the shared index structure so it will block until Session A's transaction completes. At that point, Session B will either be able to acquire the latch and make its own modification to the index (because A rolled back) or it will note that the other entry has been committed and will throw a unique constraint violation (because A committed).
It's because of the unique index that enforces the primary key constraint. Even though the insert into the data block is not yet committed, the attempt to add the duplicate entry into the index cannot succeed, even if it's done in another session.
Just because you haven't done a commit yet does not mean the first record hasn't been sent to the server. Oracle already knows about you intentions to insert the first record. When you insert the second record Oracle knows for sure there is no way this will ever succeed without a constraint violation so it refuses.
If another user were to insert the second record, Oracle will accept it if the first record has not been committed yet. If the second user commits before you do, your commit will fail.
Unless a particular constraint is "deferred", it will be checked at the point of the statement execution. If it is deferred, it will be checked at the end of the transaction. I'm assuming you did not defer your PRIMARY KEY and that's why you get a violation even before you commit.
How this is really done is an implementation detail and may vary between different database systems and even versions of the same system. The application developer should probably not make too many assumptions about it. In Oracle's case, PRIMARY KEY uses the underlying index for performance reasons, while there are systems out there that do not even require an index (if you can live with the corresponding performance hit).
BTW, a deferrable Oracle PRIMARY KEY constraint relies on a non-unique index (vs non-deferrable PRIMARY KEY that uses a unique index).
--- EDIT ---
I just realized you didn't even commit the first INSERT. I think Justin's answer explains nicely how what is essentially a lock contention causes one of the transactions to stall.
I want to insert a row, but if a conflict occurs (example below) I'd like the database to lock the existing row so I can log its contents for debugging purposes. I am using READ_COMMITTED transaction isolation.
For example:
CREATE TABLE users(id BIGINT AUTO_INCREMENT, name VARCHAR(30),
count INT NOT NULL, PRIMARY KEY(id), UNIQUE(name));
Thread 1:
INSERT INTO users(username, count) VALUES('joe', 1000);
transaction.commit();
Thread 2:
// Insert fails due to conflict with above record
INSERT INTO users(username, count) VALUES('joe', 0);
// Get the conflicting row and log its properties
SELECT * FROM users WHERE username = 'joe';
If the conflicting row is not locked, it may be modified by the time I check it. The only workaround I found is invoking SELECT id FROM users WHERE username = 'joe' FOR UPDATE before the insert. Is it possible way to implement this without any overhead when a conflict does not occur?
UPDATE: I am not asking to avoid the conflict or the resulting SQLException. I am just asking for the conflicting row to get locked so I can look up what values triggered the conflict. Yes, I know that the conflicting record contains joe but I want to log all its other columns.
No it is not possible to eliminate the confict of a UNIQUE column
when using INSERT of rows with unique column(s).
Trying to write SQL that never has to deal with SQL Exceptions
is just wasted effort that always ends up creating SQL that fails
under some conditions.
Exception handling can't be avoided when dealing with real time
multi-threaded multi-user database servers, unless you
can afford to lock the table, do the update, and unlock the
table (which will create terrible performance when under
heavy load of many users)
The UNIQUE CONSTRAINT VIOLATION Exception will ALWAYS occur on the 2nd INSERT,
as the two INSERTs in your example could be widely separated in time
(e.g. by hours, days or weeks); Table or row locking won't change this.
This problem is one that should be solved at the GUI level anyway
as choosing a "user name" that may already be chosen by a previous
user, requires providing the "new" user with feedback like
"Sorry, that user name is already in use by another user", so
it would seen unlikely that handling the UNIQUE VIOLATION exception
can or should ever be "avoided".
In addition, there is no reason to SELECT ... FOR UPDATE, since
all you need to do is SELECT id WHERE name = newName and see if
you get a resulting id or null; (id == null) => user name not in use,
but even then two user could try to both get the "not in use" result
at the same time and one of the INSERTs could still fail.
When the UNIQUE exception is returned on the duplicate INSERT,
the second INSERT has failed and that record was not created,
so there is no "duplicate" record to lock and then read after
the UNIQUE exception is returned on the failed INSERT.
Wich version of SQL are you usign? I'm not sure if I understand correctly your question but I think you could do this in a trigger.
In the trigger, you can view the inserted value (your conflicting row) and log it, and make a rollback. Wich means that when you insert your row, when a conflict does not occur, you don't have to commit anything, and when a conflict occurs, the log is made and the row is not inserted.
No, most databases do not support that kind of operation.
You can do tricks like creating an explicit transaction
BEGIN TRANSACTION
IF EXIST(SELECT ...)
ROLLBACK
INSERT INTO...
COMMIT
But that isn't exactly what you want. The only to achieve what you're asking for is to use one of the B-TREE style libraries which are a lot more low-level.
There doesn't seem to be a portable way of doing this and looking at MVCC there is a strong indication that this cannot be implemented without a substantial performance impact.
So in conclusion: you're going to have to settle for knowing that a conflict occurred but have no way of being 100% sure of the cause (there is no thread-safe to verify).
Imagine I have this simple table:
Table Name: Table1
Columns: Col1 NUMBER (Primary Key)
Col2 NUMBER
If I insert a record into Table1 with no commit...
INSERT INTO Table1 (Col1, Col2) Values (100, 1234);
How does Oracle know that this next INSERT statement violates the PK constraint, since nothing has yet been committed to the database yet.
INSERT INTO Table1 (Col1, Col2) Values (100, 5678);
Where/how does Oracle manage the transactions so that it knows I'm violating the constraint when I haven't even committed the transaction yet.
Oracle creates an index to enforce the primary key constraint (a unique index by default). When Session A inserts the first row, the index structure is updated but the change is not committed. When Session B tries to insert the second row, the index maintenance operation notes that there is already a pending entry in the index with that particular key. Session B cannot acquire the latch that protects the shared index structure so it will block until Session A's transaction completes. At that point, Session B will either be able to acquire the latch and make its own modification to the index (because A rolled back) or it will note that the other entry has been committed and will throw a unique constraint violation (because A committed).
It's because of the unique index that enforces the primary key constraint. Even though the insert into the data block is not yet committed, the attempt to add the duplicate entry into the index cannot succeed, even if it's done in another session.
Just because you haven't done a commit yet does not mean the first record hasn't been sent to the server. Oracle already knows about you intentions to insert the first record. When you insert the second record Oracle knows for sure there is no way this will ever succeed without a constraint violation so it refuses.
If another user were to insert the second record, Oracle will accept it if the first record has not been committed yet. If the second user commits before you do, your commit will fail.
Unless a particular constraint is "deferred", it will be checked at the point of the statement execution. If it is deferred, it will be checked at the end of the transaction. I'm assuming you did not defer your PRIMARY KEY and that's why you get a violation even before you commit.
How this is really done is an implementation detail and may vary between different database systems and even versions of the same system. The application developer should probably not make too many assumptions about it. In Oracle's case, PRIMARY KEY uses the underlying index for performance reasons, while there are systems out there that do not even require an index (if you can live with the corresponding performance hit).
BTW, a deferrable Oracle PRIMARY KEY constraint relies on a non-unique index (vs non-deferrable PRIMARY KEY that uses a unique index).
--- EDIT ---
I just realized you didn't even commit the first INSERT. I think Justin's answer explains nicely how what is essentially a lock contention causes one of the transactions to stall.