SQL - relying on server errors during INSERT - sql

I'm working with PostgreSQL 9.1. Let's say I have a table where some columns have UNIQUE constraint. The easiest example:
CREATE TABLE test (
value INTEGER NOT NULL UNIQUE
);
Now, when inserting some values, I have to separately handle the case, where the values to be inserted are already in the table. I have two options:
Make a SELECT beforehand to ensure the values are not in the table, or:
Execute the INSERT and watch for any errors the server might return.
The application utilizing the PostgreSQL database is written in Ruby. Here's how I would code the second option:
require 'pg'
db = PG.connect(...)
begin
db.exec('INSERT INTO test VALUES (66)')
rescue PG::UniqueViolation
# ... the values are already in the table
else
# ... the values were brand new
end
db.close
Here's my thinking: let's suppose we make a SELECT first, before inserting. The SQL engine would have to scan the rows and return any matching tuples. If there are none, we make an INSERT, which presumably makes yet another scan, to see if the UNIQUE constraint is not about to be violated by any chance. So, in theory, second option would speed the execution up by 50%. Is this how PostgreSQL would actually behave?
We're assuming there's no ambiguity when it comes to the exception itself (e.g. we only have one UNIQUE constraint).
Is it a common practice? Or are there any caveats to it? Are there any more alternatives?

It depends - if your application UI normally allows entering duplicate values, then it's strongly encouraged to check before inserting. Because any error would invalidate current transaction, consume sequence/serial values, fill logs with error messages etc.
But if your UI is not allowing duplicates, and inserting duplicate is only possible when somebody is using tricks (for example during vulnerability research) or highly improbable then I'd allow inserting without checking first.
As unique constraint forces creation of an index, this check is not slow. But definitely slightly slower than inserting and checking for rare errors. Postgres 9.5 would have on conflict do nothing support, which would be both fast and safe. You'd check number of rows inserted to detect duplicates.

You don't (and shouldn't) have to test before; you can test while inserting. Just add the test as a where clause. The following insert inserts either zero or one tuple, dependiing on the existence of a row with the same value. (and it certainly is not slower) :
INSERT INTO test (value)
SELECT 55
WHERE NOT EXISTS (
SELECT * FROM test
WHERE value = 55
);
Though your error-driven approach may look elegant from the client side, from the database side it is a near-disaster: the current transaction is rolled back implicitely + all cursors (including prepared statements) are closed. (thus: your application will have to rebuild the complete transaction but without the error and start again.)
Addition: when adding more than one row you can put the VALUES() into a CTE and refer to the CTE in the insert query:
WITH vvv(val) AS (
VALUES (11),(22),(33),(44),(55),(66)
)
INSERT INTO test(value)
SELECT val FROM vvv
WHERE NOT EXISTS (
SELECT *
FROM test nx
WHERE nx.value = vvv.val
);
-- SELECT * FROM test;

Related

Performance difference between Insert and Insert Where Not Exists

I know its better to use INSERT WHERE NOT EXISTS than INSERT as it leads to duplicated records or unique key violation issues.
But with respect to performance, will it create any big difference ?
INSERT WHERE NOT EXISTS will internally triggers extra SELECT statement to check the record is existing or not. In case of large tables, which is recommended to use INSERT vs INSERT WHERE NOT EXITS ?
And someone pls explain cost of execution difference between the both.
Most Oracle IN clause queries involve a series of literal values, and when a table is present a standard join is better. In most cases the Oracle cost-based optimizer will create an identical execution plan for IN vs EXISTS, so there is no difference in query performance.
The Exists keyword evaluates true or false, but the IN keyword will compare all values in the corresponding subquery column. If you are using the IN operator, the SQL engine will scan all records fetched from the inner query. On the other hand, if we are using EXISTS, the SQL engine will stop the scanning process as soon as it found a match.
The EXISTS subquery is used when we want to display all rows where we have a matching column in both tables. In most cases, this type of subquery can be re-written with a standard join to improve performance.
The EXISTS clause is much faster than IN when the subquery results is very large. Conversely, the IN clause is faster than EXISTS when the subquery results is very small.
Also, the IN clause can't compare anything with NULL values, but the EXISTS clause can compare everything with NULLs.
It's not a matter of "what's fastest" but a matter of "what's correct".
When you INSERT into a table (without any restriction) you simply add records to that table. If an existing identical record already was in there this will then result there now being two such records. This may be fine or this may be an issue depending on your needs (**).
When you add a WHERE NOT EXISTS() to your INSERT construction the system will only add records to the table that aren't there yet, thus avoiding the situation of ending up with multiple identical records.
(**: suppose you have unique or primary key constraint on the target table then the INSERT of a duplicate record will result in a UQ/PK Violation error. IF your question was: "What's fastest: try to insert the row and if there is such an error simply ignore it versus try to insert where not exists() and avoid the error" then I can't give you conclusive answer but I'm fairly certain it will be a close call. What I can say however is that the WHERE NOT EXISTS() approach will look much nicer in code and (importantly!) it will also work for set-based operations, the try/catch approach will fail for the entire set even if only 1 record causes an issue.)
INSERT will check inserted data against any existing schema constraints, PK, FK, Unique Indexes, not nulls and any other custom constraints - whatever the table schema demands. If those checks are ok, the row will be inserted and loop on to the next row.
INSERT WHERE NOT EXISTS, prior to the above check, will check data of all columns of the row against data of all rows of the table. Even if 1 column is different it is ok and then it will move on exactly as INSERT above.
The performance impact mostly depends:
1. number of existing rows at the table
2. size of row
So as the table gets larger followed by larger row size the difference grows.

Postgres, update and lock ordering

I'm working on Postgres 9.2.
There are 2 UPDATEs, each in their own transactions. One looks like:
UPDATE foo SET a=1 WHERE b IN (1,2,3,4);
The other is similar:
UPDATE foo SET a=2 WHERE b IN (1,2,3,4);
These could possibly run at the same time and in reality have 500+ in the 'IN' expression.
I'm sometimes seeing deadlocks. Is is true that that order of items in the 'IN' expression may not actually influence the true lock ordering?
There is no ORDER BY in the UPDATE command.
But there is for SELECT. Use row-level locking with the FOR UPDATE clause in a subquery:
UPDATE foo f
SET a = 1
FROM (
SELECT b FROM foo
WHERE b IN (1,2,3,4)
ORDER BY b
FOR UPDATE
) upd
WHERE f.b = upd.b;
Of course, b has to be UNIQUE or you need to add more expressions to the ORDER BY clause to make it unambiguous.
And you need to enforce the same order for all UPDATE, DELETE and SELECT .. FOR UPDATE statements on the table.
Related, with more details:
Postgres UPDATE … LIMIT 1
Avoiding PostgreSQL deadlocks when performing bulk update and delete operations
Optimizing concurrent updates in Postgres
Yes. I think the main issue here is that IN checks for membership in the set specified, but does not confer any sort of ordering on the UPDATE, which in turn, means that no concrete ordering is conferred upon the lock ordering.
The WHERE clause in an UPDATE statement essentially behaves the same way it does in a SELECT. For example, I will often simulate an UPDATE using a SELECT to check what will be updated to see that it's what I expected.
With that in mind, the following example using SELECT demonstrates that IN does not in itself confer ordering:
Given this schema/data:
create table foo
(
id serial,
val text
);
insert into foo (val)
values ('one'), ('two'), ('three'), ('four');
The following queries:
select *
from foo
where id in (1,2,3,4);
select *
from foo
where id in (4,3,2,1);
yield the exact same results -- the rows in order from id 1-4.
Even that isn't guaranteed, since I did not use an ORDER BY in the select. Rather, without it, Postgres uses whatever order the server decides is fastest (see point 8 about ORDER BY in the Postgres SELECT doc). Given a fairly static table, it's often the same order in which it was inserted (as was the case here). However, there's nothing guaranteeing that, and if there's a lot of churn on the table (lots of dead tuples, rows removed, etc.), it's less likely to be the case.
I suspect that's what's happening here with your UPDATE. Sometimes -- if not even most of the time -- it may end up in numerical order if that's the same way the rows were inserted, but there's nothing to guarantee that, and the cases where you see the deadlocks are likely scenarios where the data has changed such that one update is ordered different than the other.
sqlfiddle with the above code.
Possible fixes/workarounds:
In terms of what you could do about it, there are various options, depending on your requirements. You could explicitly take out a table lock on the table, although that would of course have the effect of serializing the updates there, which may prove to be too large a bottleneck.
Another option, which would still allow for concurrency -- is to explicitly iterate over the items using dynamic SQL in, say, Python. That way, you'd have a set of one-row updates that occurred always in the same order, and since you could ensure that consistent order, the normal Postgres locking should be able to handle the concurrency without deadlocking.
That won't perform as well as batch-updating in pure SQL, but it should solve the lock issue. One suggestion to bump up performance is to only COMMIT every so often, and not after every single row -- that saves a lot of overhead.
Another option would be to do the loop in a Postgres function written in PL/pgSQL. That function could then be called externally, in, say, Python, but the looping would be done (also explicitly) server-side, which may save on some overhead, since the looping and UPDATEs are done entirely server-side without having to go over the wire each loop iteration.

Ignoring errors in concurrent insertions

I have a string vector data containing items that I want to insert into a table named foos. It's possible that some of the elements in data already exist in the table, so I must watch out for those.
The solution I'm using starts by transforming the data vector into virtual table old_and_new; it then builds virtual table old which contains the elements which are already present in foos; then, it constructs virtual table new with the elements
which are really new. Finally, it inserts the new elements in table foos.
WITH old_and_new AS (SELECT unnest ($data :: text[]) AS foo),
old AS (SELECT foo FROM foos INNER JOIN old_and_new USING (foo)),
new AS (SELECT * FROM old_and_new EXCEPT SELECT * FROM old)
INSERT INTO foos (foo) SELECT foo FROM new
This works fine in a non-concurrent setting, but fails if concurrent threads
try to insert the same new element at the same time. I know I can solve this
by setting the isolation level to serializable, but that's very heavy-handed.
Is there some other way I can solve this problem? If only there was a way to
tell PostgreSQL that it was safe to ignore INSERT errors...
Is there some other way I can solve this problem?
There are plenty, but none are a panacea...
You can't lock for inserts like you can do a select for update, since the rows don't exist yet.
You can lock the entire table, but that's even heavier handed that serializing your transactions.
You can use advisory locks, but be super wary about deadlocks. Sort new keys so as to obtain the locks in a consistent, predictable order. (Someone more knowledgeable with PG's source code will hopefully chime in, but I'm guessing that the predicate locks used in the serializable isolation level amount to doing precisely that.)
In pure sql you could also use a do statement to loop through the rows one by one, and trap the errors as they occur:
http://www.postgresql.org/docs/9.2/static/sql-do.html
http://www.postgresql.org/docs/9.2/static/plpgsql-control-structures.html#PLPGSQL-ERROR-TRAPPING
Similarly, you could create a convoluted upsert function and call it once per piece of data...
If you're building $data at the app level, you could run the inserts one by one and ignore errors.
And I'm sure I forgot some additional options...
Whatever your course of action is (#Denis gave you quite a few options), this rewritten INSERT command will be much faster:
INSERT INTO foos (foo)
SELECT n.foo
FROM unnest ($data::text[]) AS n(foo)
LEFT JOIN foos o USING (foo)
WHERE o.foo IS NULL
It also leaves a much smaller time frame for a possible race condition.
In fact, the time frame should be so small, that unique violations should only be popping up under heavy concurrent load or with huge arrays.
Dupes in the array?
Except, if you your problem is built-in. Do you have duplicates in the input array itself? In this case, transaction isolation is not going to help you. The enemy is within!
Consider this example / solution:
INSERT INTO foos (foo)
SELECT n.foo
FROM (SELECT DISTINCT foo FROM unnest('{foo,bar,foo,baz}'::text[]) AS foo) n
LEFT JOIN foos o USING (foo)
WHERE o.foo IS NULL
I use DISTINCT in the subquery to eliminate the "sleeper agents", a.k.a. duplicates.
People tend to forget that the dupes may come within the import data.
Full automation
This function is one way to deal with concurrency for good. If a UNIQUE_VIOLATION occurs, the INSERT is just retried. The newly present rows are excluded from the new attempt automatically.
It does not take care of the opposite problem, that a row might have been deleted concurrently - this would not get re-inserted. One might argue, that this outcome is ok, since such a DELETE happened concurrently. If you want to prevent this, make use of SELECT ... FOR SHARE to protect rows from concurrent DELETE.
CREATE OR REPLACE FUNCTION f_insert_array(_data text[], OUT ins_ct int) AS
$func$
BEGIN
LOOP
BEGIN
INSERT INTO foos (foo)
SELECT n.foo
FROM (SELECT DISTINCT foo FROM unnest(_data) AS foo) n
LEFT JOIN foos o USING (foo)
WHERE o.foo IS NULL;
GET DIAGNOSTICS ins_ct = ROW_COUNT;
RETURN;
EXCEPTION WHEN UNIQUE_VIOLATION THEN -- tag.tag has UNIQUE constraint.
RAISE NOTICE 'It actually happened!'; -- hardly ever happens
END;
END LOOP;
END
$func$
LANGUAGE plpgsql;
I made the function return the count of inserted rows, which is completely optional.
-> SQLfiddle demo
I like both Erwin and Denis' answers, but another approach might be to have concurrent sessions performing the unnesting and loading into a separate temporary table, and optionally eliminating what duplicates they can against the target table, and having a single session selecting from this temporary table, resolving temp table internal duplicates in an appropriate manner, inserting to the target table checking again for existing values, and deleting the selected temporary table records (in the same query using a common table expression).
This would be more batch oriented, in the style of a data warehouse extraction-load-transform paradigm, but would guarantee that no unique constraint issues would need to be dealt with.
Other advantages/disadvantages apply, such as decoupling the final insert from the data gathering (possible advantage), and needing to vacuum the temp table frequently (possible disadvantage), which may not be relevant to Jon's case, but might be useful info to others in the same situation.

Using a temporary table to replace a WHERE IN clause

I've got the user entering a list of values that I need to query for in a table. The list could be potentially very large, and the length isn't known at compile time. Rather than using WHERE ... IN (...) I was thinking it would be more efficient to use a temporary table and execute a join against it. I read this suggestion in another SO question (can't find it at the moment, but will edit when I do).
The gist is something like:
CREATE TEMP TABLE my_temp_table (name varchar(160) NOT NULL PRIMARY KEY);
INSERT INTO my_temp_table VALUES ('hello');
INSERT INTO my_temp_table VALUES ('world');
//... etc
SELECT f.* FROM foo f INNER JOIN my_temp_table t ON f.name = t.name;
DROP TABLE my_temp_table;
If I have two of these going at the same time, would I not get an error if Thread 2 tries to create the TEMP table after Thread 1?
Should I randomly generate a name for the TEMP table instead?
Or, if I wrap the whole thing in a transaction, will the naming conflict go away?
This is Postgresql 8.2.
Thanks!
There is no need to worry about the conflict.
The pg_temp schema is session specific. If you've a concurrent statement in a separate session, it'll use a different schema (even if you see it as having the same name).
Two notes, however:
Every time you create temporary objects, the system catalog creates a temporary schema and the objects themselves. This can lead to clutter if used frequently.
Thus, for small sets/frequent uses, it's usually better stick to an in or a with statement (both of which Postgres copes quite well with). It's also occasionally useful to "trick" the planner into using whichever plan you're seeking by using an immutable set returning function.
In the event you decide to actually use temporary tables, it's usually better to index and analyze them once you've filled them up. Else you're doing little more than writing a with statement.
Consider using WITH query insteed: http://www.postgresql.org/docs/9.0/interactive/queries-with.html
It also creates temporary table, which is destroyed when query / transaction finishes, so I believe there should be no concurrency conflicts.

Fast check of existence of an entry in a SQL database

Nitpicker Question:
I like to have a function returning a boolean to check if a table has an entry or not. And i need to call this a lot, so some optimizing is needed.
Iues mysql for now, but should be fairly basic...
So should i use
select id from table where a=b limit 1;
or
select count(*) as cnt from table where a=b;
or something completly different?
I think SELECT with limit should stop after the first find, count(*) needs to check all entries. So SELECT could be faster.
Simnplest thing would be doing a few loop and test it, but my tests were not helpful. (My test system seemd to be used otherwise too, which diluted mny results)
this "need" is often indicative of a situation where you are trying to INSERT or UPDATE. the two most common situations are bulk loading/updating of rows, or hit counting.
checking for existence of a row first can be avoided using the INSERT ... ON DUPLICATE KEY UPDATE statement. for a hit counter, just a single statement is needed. for bulk loading, load the data in to a temporary table, then use INSERT ... ON DUPLICATE KEY UPDATE using the temp table as the source.
but if you can't use this, then the fastest way will be select id from table where a=b limit 1; along with force index to make sure mysql looks ONLY at the index.
The limit 1 will tell the MySQL to stop searching after it finds one row. If there can be multiple rows that match the criteria, this is faster than count(*).
There are more ways to optimize this, but the exact nature would depend on the amount of rows and the spread of a and b. I'd go with the "where a=b" approach until you actually encounter performance issues. Databases are often so fast that most queries are no performance issue at all.