I have a table where I'm recording if a user has viewed an object at least once, hence:
HasViewed
ObjectID number (FK to Object table)
UserId number (FK to Users table)
Both fields are NOT NULL and together form the Primary Key.
My question is, since I don't care how many times someone has viewed an object (after the first), I have two options for handling inserts.
Do a SELECT count(*) ... and if no records are found, insert a new record.
Always just insert a record, and if it throws a DUP_VAL_ON_INDEX exceptions (indicating that there already was such a record), just ignore it.
What's the downside of choosing the second option?
UPDATE:
I guess the best way to put it is : "Is the overhead caused by the exception worse than the overhead caused by the initial select?"
I would normally just insert and trap the DUP_VAL_ON_INDEX exception, as this is the simplest to code. This is more efficient than checking for existence before inserting. I don't consider doing this a "bad smell" (horrible phrase!) because the exception we handle is raised by Oracle - it's not like raising your own exceptions as a flow-control mechanism.
Thanks to Igor's comment I have now run two different benchamrks on this: (1) where all insert attempts except the first are duplicates, (2) where all inserts are not duplicates. Reality will lie somewhere between the two cases.
Note: tests performed on Oracle 10.2.0.3.0.
Case 1: Mostly duplicates
It seems that the most efficient approach (by a significant factor) is to check for existence WHILE inserting:
prompt 1) Check DUP_VAL_ON_INDEX
begin
for i in 1..1000 loop
begin
insert into hasviewed values(7782,20);
exception
when dup_val_on_index then
null;
end;
end loop
rollback;
end;
/
prompt 2) Test if row exists before inserting
declare
dummy integer;
begin
for i in 1..1000 loop
select count(*) into dummy
from hasviewed
where objectid=7782 and userid=20;
if dummy = 0 then
insert into hasviewed values(7782,20);
end if;
end loop;
rollback;
end;
/
prompt 3) Test if row exists while inserting
begin
for i in 1..1000 loop
insert into hasviewed
select 7782,20 from dual
where not exists (select null
from hasviewed
where objectid=7782 and userid=20);
end loop;
rollback;
end;
/
Results (after running once to avoid parsing overheads):
1) Check DUP_VAL_ON_INDEX
PL/SQL procedure successfully completed.
Elapsed: 00:00:00.54
2) Test if row exists before inserting
PL/SQL procedure successfully completed.
Elapsed: 00:00:00.59
3) Test if row exists while inserting
PL/SQL procedure successfully completed.
Elapsed: 00:00:00.20
Case 2: no duplicates
prompt 1) Check DUP_VAL_ON_INDEX
begin
for i in 1..1000 loop
begin
insert into hasviewed values(7782,i);
exception
when dup_val_on_index then
null;
end;
end loop
rollback;
end;
/
prompt 2) Test if row exists before inserting
declare
dummy integer;
begin
for i in 1..1000 loop
select count(*) into dummy
from hasviewed
where objectid=7782 and userid=i;
if dummy = 0 then
insert into hasviewed values(7782,i);
end if;
end loop;
rollback;
end;
/
prompt 3) Test if row exists while inserting
begin
for i in 1..1000 loop
insert into hasviewed
select 7782,i from dual
where not exists (select null
from hasviewed
where objectid=7782 and userid=i);
end loop;
rollback;
end;
/
Results:
1) Check DUP_VAL_ON_INDEX
PL/SQL procedure successfully completed.
Elapsed: 00:00:00.15
2) Test if row exists before inserting
PL/SQL procedure successfully completed.
Elapsed: 00:00:00.76
3) Test if row exists while inserting
PL/SQL procedure successfully completed.
Elapsed: 00:00:00.71
In this case DUP_VAL_ON_INDEX wins by a mile. Note the "select before insert" is the slowest in both cases.
So it appears that you should choose option 1 or 3 according to the relative likelihood of inserts being or not being duplicates.
I don't think there is a downside to your second option. I think it's a perfectly valid use of the named exception, plus it avoids the lookup overhead.
Try this?
SELECT 1
FROM TABLE
WHERE OBJECTID = 'PRON_172.JPG' AND
USERID='JCURRAN'
It should return 1, if there is one there, otherwise NULL.
In your case, it looks safe to ignore, but for performance, one should avoid exceptions on the common path. A question to ask, "How common will the exceptions be?"
Few enough to ignore? or so many another method should be used?
IMHO it is best to go with Option 2: Other than what is already been said, you should consider thread safety. If you go with option 1 and If multiple threads are executing your PL/SQL block then its possible that two or more threads fire select at the same time and at that time there is no record, this will end up leading all threads to insert and you will get unique constraint error.
Usually, exception handling is slower; however if it would happen only seldom, then you would avoid the overhead of the query.
I think it mainly depends on the frequency of the exception, but if performance is important, I would suggest some benchmarking with both approaches.
Generally speaking, treating common events as exception is a bad smell; for this reason you could see also from another point of view.
If it is an exception, then it should be treated as an exception - and your approach is correct.
If it is a common event, then you should try to explicitly handle it - and then checking if the record is already inserted.
Related
I create the trigger A1 so that an article with a certain type, that is 'Bert' cannot be added more than once and it can have only 1 in the stock.
However, although i create the trigger, i can still add an article with the type 'Bert'. Somehow, the count returns '0' but when i run the same sql statement, it returns the correct number. It also starts counting properly if I drop the trigger and re-add it. Any ideas what might be going wrong?
TRIGGER A1 BEFORE INSERT ON mytable
FOR EACH ROW
DECLARE
l_count NUMBER;
PRAGMA AUTONOMOUS_TRANSACTION;
BEGIN
SELECT COUNT(*) INTO l_count FROM mytable WHERE article = :new.article;
dbms_output.put_line('Count: ' || l_count);
IF l_count >0 THEN
IF(:new.TYPEB = 'Bert') THEN
dbms_output.put_line('article already exists!');
ROLLBACK;
END IF;
ELSIF (:new.TYPEB = 'Bert' AND :new.stock_count>1) THEN
dbms_output.put_line('stock cannot have more than 1 of this article with type Bert');
ROLLBACK;
END IF;
END;
This is the insert statement I use:
INSERT INTO mytable VALUES('Chip',1,9,1,'Bert');
A couple of points. First, you are misusing the autonomous transaction pragma. It is meant for separate transactions you need to commit or rollback independently of the main transaction. You are using it to rollback the main transaction -- and you never commit if there is no error.
And those "unforeseen consequences" someone mentioned? One of them is that your count always returns 0. So remove the pragma both because it is being misused and so the count will return a proper value.
Another thing is don't have commits or rollbacks within triggers. Raise an error and let the controlling code do what it needs to do. I know the rollbacks were because of the pragma. Just don't forget to remove them when you remove the pragma.
The following trigger works for me:
CREATE OR REPLACE TRIGGER trg_mytable_biu
BEFORE INSERT OR UPDATE ON mytable
FOR EACH ROW
WHEN (NEW.TYPEB = 'Bert') -- Don't even execute unless this is Bert
DECLARE
L_COUNT NUMBER;
BEGIN
SELECT COUNT(*) INTO L_COUNT
FROM MYTABLE
WHERE ARTICLE = :NEW.ARTICLE
AND TYPEB = :NEW.TYPEB;
IF L_COUNT > 0 THEN
RAISE_APPLICATION_ERROR( -20001, 'Bert already exists!' );
ELSIF :NEW.STOCK_COUNT > 1 THEN
RAISE_APPLICATION_ERROR( -20001, 'Can''t insert more than one Bert!' );
END IF;
END;
However, it's not a good idea for a trigger on a table to separately access that table. Usually the system won't even allow it -- this trigger won't execute at all if changed to "after". If it is allowed to execute, one can never be sure of the results obtained -- as you already found out. Actually, I'm a little surprised the trigger above works. I would feel uneasy using it in a real database.
The best option when a trigger must access the target table is to hide the table behind a view and write an "instead of" trigger on the view. That trigger can access the table all it wants.
You need to do an AFTER trigger, not a BEFORE trigger. Doing a count(*) "BEFORE" the insert occurs results in zero rows because the data hasn't been inserted yet.
(Using Oracle) I have a table with one column (myCol) which is the primary key.
When doing an insert, is it faster to check with a select statement before doing the insert or just write the insert and let error handling check it?
So would this be faster?
BEGIN
SELECT count(*) INTO v_count FROM myTbl WHERE myCol = v_newVal;
IF v_count = 0 THEN
INSERT INTO myTbl (myCol) VALUES (v_newVal);
END IF;
END;
or this?
BEGIN
INSERT INTO myTbl (myCol) VALUES (v_newVal);
EXCEPTION WHEN DUP_VAL_ON_INDEX THEN
null;
END;
Thanks!
In terms of performance, certainly the second option is better.
But, more importantly, the first option has a bug - it will fail in cases where more than one session tries to insert the same value at the same time - one of them will succeed, the other session will wait for the first to commit and then raise DUP_VAL_ON_INDEX (which you haven't handled).
I think you could try merge. Typically you should avoid count() from dup checking. There is also a hint you can use to ignore duplicates. This page gives a nice example of this.
http://guyharrison.squarespace.com/blog/2010/1/1/the-11gr2-ignore_row_on_dupkey_index-hint.html
I've got a nightly archive job that sweeps existing rows from a records table into records_archive:
INSERT INTO records_archive
SELECT * FROM records WHERE id > 555 AND id <= 556;
For reasons that are outside the scope of this question, there is a column -- master_guid -- in both records and records_archive that has a UNIQUE index, and I can't change that. Rebuilding the index into a non-unique one is off the table for reasons out of my control. So, in principle, every master_guid value is supposed to be unique. There are some bad client implementations out there that sometimes fail to generate unique enough GUIDs, though, which results in a collision when we attempt to insert a record into records_archive that with a master_guid value that already exists in records_archive.
I can't fix the clients, so I need to work around it. The way to do that is to catch the unique_violation exception, modify the GUID (adding some random characters to it), and attempt to re-insert.
I can't just wrap the above-mentioned INSERT query in a stored procedure and catch the unique_violation exception, because the whole query is one transaction. I need row-level access. So, I wrote a stored procedure to iterate over each row and catch a unique_violation exception for that row individually:
CREATE OR REPLACE FUNCTION archive_records(start_id bigint,
end_id bigint)
RETURNS void
AS $$
DECLARE
_r record;
BEGIN
FOR _r IN
SELECT * FROM records WHERE id > start_id AND id <= end_id
LOOP
BEGIN
INSERT INTO records_archive VALUES (_r.*);
EXCEPTION WHEN unique_violation THEN
-- Manipulate the _r.master_guid value, add some random
-- numbers to it or whatever, and attempt reinsertion.
END;
END LOOP;
END;
$$ LANGUAGE 'plpgsql';
The problem is that records (and records_archive) are insanely wide tables. I don't want to explicitly enumerate every single column to be copied from _r to records_archive, not only because I'm lazy, but because this stored procedure would become a dependency in any future column changes in those tables.
The problem I've got is that this doesn't work, syntactically or conceptually:
INSERT INTO records_archive VALUES (_r.*);
Neither do any of these:
INSERT INTO records_archive _r;
INSERT INTO records_archive _r.*;
Is there a way to pull this off? If not, is there a better way to accomplish what I'm trying to accomplish?
I would create a BEFORE INSERT Trigger on records_archive. Do a select on the records_archive for the NEW.master_guid, and if it exists, manipulate it to add your random numbers.
You'd probably want a loop around the check to ensure the modified master_guid still didn't exist before went ahead with the insertion.
I wrote a function to create posts for a simple blogging engine:
CREATE FUNCTION CreatePost(VARCHAR, TEXT, VARCHAR[])
RETURNS INTEGER AS $$
DECLARE
InsertedPostId INTEGER;
TagName VARCHAR;
BEGIN
INSERT INTO Posts (Title, Body)
VALUES ($1, $2)
RETURNING Id INTO InsertedPostId;
FOREACH TagName IN ARRAY $3 LOOP
DECLARE
InsertedTagId INTEGER;
BEGIN
-- I am concerned about this part.
BEGIN
INSERT INTO Tags (Name)
VALUES (TagName)
RETURNING Id INTO InsertedTagId;
EXCEPTION WHEN UNIQUE_VIOLATION THEN
SELECT INTO InsertedTagId Id
FROM Tags
WHERE Name = TagName
FETCH FIRST ROW ONLY;
END;
INSERT INTO Taggings (PostId, TagId)
VALUES (InsertedPostId, InsertedTagId);
END;
END LOOP;
RETURN InsertedPostId;
END;
$$ LANGUAGE 'plpgsql';
Is this prone to race conditions when multiple users delete tags and create posts at the same time?
Specifically, do transactions (and thus functions) prevent such race conditions from happening?
I'm using PostgreSQL 9.2.3.
It's the recurring problem of SELECT or INSERT under possible concurrent write load, related to (but different from) UPSERT (which is INSERT or UPDATE).
This PL/pgSQL function uses UPSERT (INSERT ... ON CONFLICT ..) to INSERT or SELECT a single row:
CREATE OR REPLACE FUNCTION f_tag_id(_tag text, OUT _tag_id int)
LANGUAGE plpgsql AS
$func$
BEGIN
SELECT tag_id -- only if row existed before
FROM tag
WHERE tag = _tag
INTO _tag_id;
IF NOT FOUND THEN
INSERT INTO tag AS t (tag)
VALUES (_tag)
ON CONFLICT (tag) DO NOTHING
RETURNING t.tag_id
INTO _tag_id;
END IF;
END
$func$;
There is still a tiny window for a race condition. To make absolutely sure we get an ID:
CREATE OR REPLACE FUNCTION f_tag_id(_tag text, OUT _tag_id int)
LANGUAGE plpgsql AS
$func$
BEGIN
LOOP
SELECT tag_id
FROM tag
WHERE tag = _tag
INTO _tag_id;
EXIT WHEN FOUND;
INSERT INTO tag AS t (tag)
VALUES (_tag)
ON CONFLICT (tag) DO NOTHING
RETURNING t.tag_id
INTO _tag_id;
EXIT WHEN FOUND;
END LOOP;
END
$func$;
db<>fiddle here
This keeps looping until either INSERT or SELECT succeeds.
Call:
SELECT f_tag_id('possibly_new_tag');
If subsequent commands in the same transaction rely on the existence of the row and it is actually possible that other transactions update or delete it concurrently, you can lock an existing row in the SELECT statement with FOR SHARE.
If the row gets inserted instead, it is locked (or not visible for other transactions) until the end of the transaction anyway.
Start with the common case (INSERT vs SELECT) to make it faster.
Related:
Get Id from a conditional INSERT
How to include excluded rows in RETURNING from INSERT ... ON CONFLICT
Related (pure SQL) solution to INSERT or SELECT multiple rows (a set) at once:
How to use RETURNING with ON CONFLICT in PostgreSQL?
What's wrong with this pure SQL solution?
CREATE OR REPLACE FUNCTION f_tag_id(_tag text, OUT _tag_id int)
LANGUAGE sql AS
$func$
WITH ins AS (
INSERT INTO tag AS t (tag)
VALUES (_tag)
ON CONFLICT (tag) DO NOTHING
RETURNING t.tag_id
)
SELECT tag_id FROM ins
UNION ALL
SELECT tag_id FROM tag WHERE tag = _tag
LIMIT 1;
$func$;
Not entirely wrong, but it fails to seal a loophole, like #FunctorSalad worked out. The function can come up with an empty result if a concurrent transaction tries to do the same at the same time. The manual:
All the statements are executed with the same snapshot
If a concurrent transaction inserts the same new tag a moment earlier, but hasn't committed, yet:
The UPSERT part comes up empty, after waiting for the concurrent transaction to finish. (If the concurrent transaction should roll back, it still inserts the new tag and returns a new ID.)
The SELECT part also comes up empty, because it's based on the same snapshot, where the new tag from the (yet uncommitted) concurrent transaction is not visible.
We get nothing. Not as intended. That's counter-intuitive to naive logic (and I got caught there), but that's how the MVCC model of Postgres works - has to work.
So do not use this if multiple transactions can try to insert the same tag at the same time. Or loop until you actually get a row. The loop will hardly ever be triggered in common work loads anyway.
Postgres 9.4 or older
Given this (slightly simplified) table:
CREATE table tag (
tag_id serial PRIMARY KEY
, tag text UNIQUE
);
An almost 100% secure function to insert new tag / select existing one, could look like this.
CREATE OR REPLACE FUNCTION f_tag_id(_tag text, OUT tag_id int)
LANGUAGE plpgsql AS
$func$
BEGIN
LOOP
BEGIN
WITH sel AS (SELECT t.tag_id FROM tag t WHERE t.tag = _tag FOR SHARE)
, ins AS (INSERT INTO tag(tag)
SELECT _tag
WHERE NOT EXISTS (SELECT 1 FROM sel) -- only if not found
RETURNING tag.tag_id) -- qualified so no conflict with param
SELECT sel.tag_id FROM sel
UNION ALL
SELECT ins.tag_id FROM ins
INTO tag_id;
EXCEPTION WHEN UNIQUE_VIOLATION THEN -- insert in concurrent session?
RAISE NOTICE 'It actually happened!'; -- hardly ever happens
END;
EXIT WHEN tag_id IS NOT NULL; -- else keep looping
END LOOP;
END
$func$;
db<>fiddle here
Old sqlfiddle
Why not 100%? Consider the notes in the manual for the related UPSERT example:
https://www.postgresql.org/docs/current/plpgsql-control-structures.html#PLPGSQL-UPSERT-EXAMPLE
Explanation
Try the SELECT first. This way you avoid the considerably more expensive exception handling 99.99% of the time.
Use a CTE to minimize the (already tiny) time slot for the race condition.
The time window between the SELECT and the INSERT within one query is super tiny. If you don't have heavy concurrent load, or if you can live with an exception once a year, you could just ignore the case and use the SQL statement, which is faster.
No need for FETCH FIRST ROW ONLY (= LIMIT 1). The tag name is obviously UNIQUE.
Remove FOR SHARE in my example if you don't usually have concurrent DELETE or UPDATE on the table tag. Costs a tiny bit of performance.
Never quote the language name: 'plpgsql'. plpgsql is an identifier. Quoting may cause problems and is only tolerated for backwards compatibility.
Don't use non-descriptive column names like id or name. When joining a couple of tables (which is what you do in a relational DB) you end up with multiple identical names and have to use aliases.
Built into your function
Using this function you could largely simplify your FOREACH LOOP to:
...
FOREACH TagName IN ARRAY $3
LOOP
INSERT INTO taggings (PostId, TagId)
VALUES (InsertedPostId, f_tag_id(TagName));
END LOOP;
...
Faster, though, as a single SQL statement with unnest():
INSERT INTO taggings (PostId, TagId)
SELECT InsertedPostId, f_tag_id(tag)
FROM unnest($3) tag;
Replaces the whole loop.
Alternative solution
This variant builds on the behavior of UNION ALL with a LIMIT clause: as soon as enough rows are found, the rest is never executed:
Way to try multiple SELECTs till a result is available?
Building on this, we can outsource the INSERT into a separate function. Only there we need exception handling. Just as safe as the first solution.
CREATE OR REPLACE FUNCTION f_insert_tag(_tag text, OUT tag_id int)
RETURNS int
LANGUAGE plpgsql AS
$func$
BEGIN
INSERT INTO tag(tag) VALUES (_tag) RETURNING tag.tag_id INTO tag_id;
EXCEPTION WHEN UNIQUE_VIOLATION THEN -- catch exception, NULL is returned
END
$func$;
Which is used in the main function:
CREATE OR REPLACE FUNCTION f_tag_id(_tag text, OUT _tag_id int)
LANGUAGE plpgsql AS
$func$
BEGIN
LOOP
SELECT tag_id FROM tag WHERE tag = _tag
UNION ALL
SELECT f_insert_tag(_tag) -- only executed if tag not found
LIMIT 1 -- not strictly necessary, just to be clear
INTO _tag_id;
EXIT WHEN _tag_id IS NOT NULL; -- else keep looping
END LOOP;
END
$func$;
This is a bit cheaper if most of the calls only need SELECT, because the more expensive block with INSERT containing the EXCEPTION clause is rarely entered. The query is also simpler.
FOR SHARE is not possible here (not allowed in UNION query).
LIMIT 1 would not be necessary (tested in pg 9.4). Postgres derives LIMIT 1 from INTO _tag_id and only executes until the first row is found.
There's still something to watch out for even when using the ON CONFLICT clause introduced in Postgres 9.5. Using the same function and example table as in #Erwin Brandstetter's answer, if we do:
Session 1: begin;
Session 2: begin;
Session 1: select f_tag_id('a');
f_tag_id
----------
11
(1 row)
Session 2: select f_tag_id('a');
[Session 2 blocks]
Session 1: commit;
[Session 2 returns:]
f_tag_id
----------
NULL
(1 row)
So f_tag_id returned NULL in session 2, which would be impossible in a single-threaded world!
If we raise the transaction isolation level to repeatable read (or the stronger serializable), session 2 throws ERROR: could not serialize access due to concurrent update instead. So no "impossible" results at least, but unfortunately we now need to be prepared to retry the transaction.
Edit: With repeatable read or serializable, if session 1 inserts tag a, then session 2 inserts b, then session 1 tries to insert b and session 2 tries to insert a, one session detects a deadlock:
ERROR: deadlock detected
DETAIL: Process 14377 waits for ShareLock on transaction 1795501; blocked by process 14363.
Process 14363 waits for ShareLock on transaction 1795503; blocked by process 14377.
HINT: See server log for query details.
CONTEXT: while inserting index tuple (0,3) in relation "tag"
SQL function "f_tag_id" statement 1
After the session that received the deadlock error rolls back, the other session continues. So I guess we should treat deadlock just like serialization_failure and retry, in a situation like this?
Alternatively, insert the tags in a consistent order, but this is not easy if they don't all get added in one place.
I think there is a slight chance that when the tag already existed it might be deleted by another transaction after your transaction has found it. Using a SELECT FOR UPDATE should solve that.
I'm working on migration of data from a legacy system into our new app(running on Oracle Database, 10gR2). As part of the migration, I'm working on a script which inserts the data into tables that are used by the app.
The number of rows of data that are imported runs into thousands, and the source data is not clean (unexpected nulls in NOT NULL columns, etc). So while inserting data through the scripts, whenever such an exception occurs, the script ends abruptly, and the whole transaction is rolled back.
Is there a way, by which I can continue inserts of data for which the rows are clean?
Using NVL() or COALESCE() is not an option, as I'd like to log the rows causing the errors so that the data can be corrected for the next pass.
EDIT: My current procedure has an exception handler, I am logging the first row which causes the error. Would it be possible for inserts to continue without termination, because right now on the first handled exception, the procedure terminates execution.
If the data volumes were higher, row-by-row processing in PL/SQL would probably be too slow.
In those circumstances, you can use DML error logging, described here
CREATE TABLE raises (emp_id NUMBER, sal NUMBER
CONSTRAINT check_sal CHECK(sal > 8000));
EXECUTE DBMS_ERRLOG.CREATE_ERROR_LOG('raises', 'errlog');
INSERT INTO raises
SELECT employee_id, salary*1.1 FROM employees
WHERE commission_pct > .2
LOG ERRORS INTO errlog ('my_bad') REJECT LIMIT 10;
SELECT ORA_ERR_MESG$, ORA_ERR_TAG$, emp_id, sal FROM errlog;
ORA_ERR_MESG$ ORA_ERR_TAG$ EMP_ID SAL
--------------------------- -------------------- ------ -------
ORA-02290: check constraint my_bad 161 7700
(HR.SYS_C004266) violated
Using PLSQL you can perform each insert in its own transaction (COMMIT after each) and log or ignore errors with an exception handler that keeps going.
Try this:
for r_row in c_legacy_data loop
begin
insert into some_table(a, b, c, ...)
values (r_row.a, r_row.b, r_row.c, ...);
exception
when others then
null; /* or some extra logging */
end;
end loop;
DECLARE
cursor;
BEGIN
loop for each row in cursor
BEGIN -- subBlock begins
SAVEPOINT startTransaction; -- mark a savepoint
-- do whatever you have do here
COMMIT;
EXCEPTION
ROLLBACK TO startTransaction; -- undo changes
END; -- subBlock ends
end loop;
END;
If you use sqlldr you can specify to continue loading data, and all the 'bad' data will be skipped and logged in a separate file.