From Impala documentation:
In most relational databases, if you try to insert a row that has already been inserted, the insertion will fail because the primary key would be duplicated.
Impala, however, will not fail the query. Instead, it will generate a warning, but continue to execute the remainder of the insert statement.
Why does Impala/Kudu act like that?
Please note that the insert won't update the value (there is an upsert command for that), it will just fail silently.
Is there a way to be aware that I'm inserting a duplicate primary key?
This is because kudu itself will not throw any exception (only raise warning) and hence impala will (rightly) assume the task succeeded.
As to why Kudu chose to do it this way we can only speculate.
This is just my opinion. Kudu (and Impala) is designed for analytical work-load instead of transactional work-load. Which usually involves batch processing of large amounts of data. It would be undesirable to for the application to fail because of small number of records with duplicate keys.
Thus default behaviour inserts all records with non-duplicate keys and skip all the duplicate keys. This can be changed by using upsert which replaces replaces duplicates.
According to Imapala documentation
If an INSERT statement attempts to insert a row with the same values for the primary key columns as an existing row, that row is discarded and the insert operation continues. When rows are discarded due to duplicate primary keys, the statement finishes with a warning, not an error. (This is a change from early releases of Kudu where the default was to return in error in such cases, and the syntax INSERT IGNORE was required to make the statement succeed. The IGNORE clause is no longer part of the INSERT syntax.)
For situations where you prefer to replace rows with duplicate primary key values, rather than discarding the new data, you can use the UPSERT statement instead of INSERT. UPSERT inserts rows that are entirely new, and for rows that match an existing primary key in the table, the non-primary-key columns are updated to reflect the values in the "upserted" data.
If you really want to store new rows, not replace existing ones, but cannot do so because of the primary key uniqueness constraint, consider recreating the table with additional columns included in the primary key.
Related
I have a postgres database in which I'm refreshing data periodically. Most of the time it works, but sometimes I have issues with a unique index.
Minimal example
create table test_table (
id int
);
create unique index test_table_unique on test_table(id);
(I know, in this case it should be a primary key, but for the sake of example, please bear with me.)
Now, every hour, I do something like this:
begin;
delete from test_table;
insert into test_table (id) values (1), (2), (3)...
commit;
As I said, most of the time it will just work fine. However, sometimes postgres complains about a duplicate entry in the unique index.
error: duplicate key value violates unique constraint test_table_unique
detail: "Key (id)=(2) already exists."
My real database
In my actual table, I'm using JSON payloads, and the unique index is made on fields of that json payload. In particular, the error details is as follows:
create table if not exists source (
id serial primary key,
payload jsonb not null
);
create unique index if not exists source_index_and_id on source ((payload->>'_index'), (payload->>'_id'));
error details: "Key ((payload ->> '_index'::text), (payload ->> '_id'::text))=(companies, AC9860) already exists."
I'm confident there is no actual duplicate data. I'm deleting everything for a particular ->>_index, and the ->>_id is unique in my source data.
My understanding is that if I delete rows from a table, the indices will be updated before the next statements are executed. But it doesn't seem to be the case. I've found that it helps (not sure if it actually solves the issue) to commit the changes after the delete, and before the inserts.
begin;
delete...
commit;
begin;
insert...
commit;
What's happening here?
The only options how this could happen are
the deleting transaction rolled back
concurrent transactions inserted new rows after you deteted the original ones
the inserting transaction inserts the same key twice
the inserting transaction is accidentally run before the deleting one
PostGreSQL is not a real relational DBMS and does not match rule 7 of Codd's Rule about functional set operations.
Contrary to other RDBMS PostGreSQL delete rows one by one and this lack of functionality conduct to have sometime fantom key violation.
In my paper that compare PostGreSQL to MS SQL Server I made a test that show this evidence (§ 7 – The hard way to udpates unique values)
What is bigger performance hit on a postgres database when table has unique constraint:
Trying to insert and let it throw unique violation constraint error
Check if entry exist and not do insert if it does
I'm importing some data, and ORM is connecting some entries via many to many through connection table. It is not checking if connection exist, it just runs the query and fails with unique constraint when it exist.
Is it better to leave it like that, or to introduce a step where I would check if the entry exist and then do the insert if it doesn't?
I would assume that your check from 2. would be an extra statement, so it is probably more expensive. I cannot say for sure, since you were rather vague in your question.
Besides the second approach is suffering from a race condition: you can never guarantee that no conflicting row gets inserted by a concurrent session after you checked.
If you want to avoid the error, the best approach would be
INSERT INTO ... VALUES (...) ON CONFLICT DO NOTHING;
performance hit:
As unique constraint creates index on the specified column it will affect the rate of insertion and updation.And most abruptly, In batch operations where numbers of inserts and updates are very large.
Imagine I have this simple table:
Table Name: Table1
Columns: Col1 NUMBER (Primary Key)
Col2 NUMBER
If I insert a record into Table1 with no commit...
INSERT INTO Table1 (Col1, Col2) Values (100, 1234);
How does Oracle know that this next INSERT statement violates the PK constraint, since nothing has yet been committed to the database yet.
INSERT INTO Table1 (Col1, Col2) Values (100, 5678);
Where/how does Oracle manage the transactions so that it knows I'm violating the constraint when I haven't even committed the transaction yet.
Oracle creates an index to enforce the primary key constraint (a unique index by default). When Session A inserts the first row, the index structure is updated but the change is not committed. When Session B tries to insert the second row, the index maintenance operation notes that there is already a pending entry in the index with that particular key. Session B cannot acquire the latch that protects the shared index structure so it will block until Session A's transaction completes. At that point, Session B will either be able to acquire the latch and make its own modification to the index (because A rolled back) or it will note that the other entry has been committed and will throw a unique constraint violation (because A committed).
It's because of the unique index that enforces the primary key constraint. Even though the insert into the data block is not yet committed, the attempt to add the duplicate entry into the index cannot succeed, even if it's done in another session.
Just because you haven't done a commit yet does not mean the first record hasn't been sent to the server. Oracle already knows about you intentions to insert the first record. When you insert the second record Oracle knows for sure there is no way this will ever succeed without a constraint violation so it refuses.
If another user were to insert the second record, Oracle will accept it if the first record has not been committed yet. If the second user commits before you do, your commit will fail.
Unless a particular constraint is "deferred", it will be checked at the point of the statement execution. If it is deferred, it will be checked at the end of the transaction. I'm assuming you did not defer your PRIMARY KEY and that's why you get a violation even before you commit.
How this is really done is an implementation detail and may vary between different database systems and even versions of the same system. The application developer should probably not make too many assumptions about it. In Oracle's case, PRIMARY KEY uses the underlying index for performance reasons, while there are systems out there that do not even require an index (if you can live with the corresponding performance hit).
BTW, a deferrable Oracle PRIMARY KEY constraint relies on a non-unique index (vs non-deferrable PRIMARY KEY that uses a unique index).
--- EDIT ---
I just realized you didn't even commit the first INSERT. I think Justin's answer explains nicely how what is essentially a lock contention causes one of the transactions to stall.
I'm using a table Mail with auto-increment Id and Mail Address. The table is used in 4 other tables and it is mainly used to save storage (String is only saved once and not 4 times). I'm using INSERT OR IGNORE to just blindly add the mail addresses to the table and if it exists ignore the update. This approach is MUCH faster than checking the existence with SELECT ... and do an INSERT if needed.
For every INSERT OR IGNORE the auto-increment, no matter if ignored or done the auto-increment Id is incremented. I one run I have approx. 500k data sets to proceed. So after every run the the last auto-increment key is incremented by 500k. I know there are 2^63-1 possible keys, so a long time to use them all up.
I also tried INSERT OR REPLACE, but this will increment the Id of the dataset on every run of the command, so this is not a solution at all.
Is there a way to prevent this increase of auto-increment key on every INSERT OR IGNORE?
Table Mail Example (replaced with pseudo Addresses)
mIdMail mMail
"1" ""
"7" "mail1#example.com"
"15" "mail2#example.com"
"17" "mail3#example.com"
"19" "mail4#example.com"
"23" "mail5#example.com"
...
Insert Query (Using Java Lib: org.apache.commons.dbutils)
INSERT OR IGNORE
INTO MAIL
( mMail )
VALUES ( ? );
Table Definition
CREATE TABLE IF NOT EXISTS MAIL (
mIdMail INTEGER PRIMARY KEY AUTOINCREMENT,
mMail CHAR(90) UNIQUE
);
To get autoincrementing values without gaps, drop the AUTOINCREMENT keyword. (Yes, you get autoincrementing values even without it.)
Auto-increment keys behave the way they do specifically because the database guarantees their behavior -- regardless of concurrent transactions and transaction failures.
Auto-increment keys have two guarantees:
They are increasing, so later inserts have larger values than earlier ones.
They are guaranteed to be unique.
The mechanism for allocating the keys does not guarantee no gaps. Why not? Because no-gaps would incur a lot more overhead on the database. Basically, each transaction on the table would need to be completely serialized (that is completed and committed) before the next one can take place. Generally, that is a really bad idea from a performance perspective.
Unfortunately, SQLite doesn't have the simplest solution, which is simply to call row_number() on the auto-incremented keys. You could try to implement a gapless auto-increment using triggers, significantly slowing down your application.
My real suggestion is simply to live with the gaps. Accept them. Surrender. That is how the built-in method works, and for good reason. Now design the rest of the database/application keeping this in mind.
I had the same issue, and changing "INSERT OR IGNORE" into "INSERT OR FAIL" solved the problem, so now when it fails the id value doesn't increment.
Imagine I have this simple table:
Table Name: Table1
Columns: Col1 NUMBER (Primary Key)
Col2 NUMBER
If I insert a record into Table1 with no commit...
INSERT INTO Table1 (Col1, Col2) Values (100, 1234);
How does Oracle know that this next INSERT statement violates the PK constraint, since nothing has yet been committed to the database yet.
INSERT INTO Table1 (Col1, Col2) Values (100, 5678);
Where/how does Oracle manage the transactions so that it knows I'm violating the constraint when I haven't even committed the transaction yet.
Oracle creates an index to enforce the primary key constraint (a unique index by default). When Session A inserts the first row, the index structure is updated but the change is not committed. When Session B tries to insert the second row, the index maintenance operation notes that there is already a pending entry in the index with that particular key. Session B cannot acquire the latch that protects the shared index structure so it will block until Session A's transaction completes. At that point, Session B will either be able to acquire the latch and make its own modification to the index (because A rolled back) or it will note that the other entry has been committed and will throw a unique constraint violation (because A committed).
It's because of the unique index that enforces the primary key constraint. Even though the insert into the data block is not yet committed, the attempt to add the duplicate entry into the index cannot succeed, even if it's done in another session.
Just because you haven't done a commit yet does not mean the first record hasn't been sent to the server. Oracle already knows about you intentions to insert the first record. When you insert the second record Oracle knows for sure there is no way this will ever succeed without a constraint violation so it refuses.
If another user were to insert the second record, Oracle will accept it if the first record has not been committed yet. If the second user commits before you do, your commit will fail.
Unless a particular constraint is "deferred", it will be checked at the point of the statement execution. If it is deferred, it will be checked at the end of the transaction. I'm assuming you did not defer your PRIMARY KEY and that's why you get a violation even before you commit.
How this is really done is an implementation detail and may vary between different database systems and even versions of the same system. The application developer should probably not make too many assumptions about it. In Oracle's case, PRIMARY KEY uses the underlying index for performance reasons, while there are systems out there that do not even require an index (if you can live with the corresponding performance hit).
BTW, a deferrable Oracle PRIMARY KEY constraint relies on a non-unique index (vs non-deferrable PRIMARY KEY that uses a unique index).
--- EDIT ---
I just realized you didn't even commit the first INSERT. I think Justin's answer explains nicely how what is essentially a lock contention causes one of the transactions to stall.