I have a string vector data containing items that I want to insert into a table named foos. It's possible that some of the elements in data already exist in the table, so I must watch out for those.
The solution I'm using starts by transforming the data vector into virtual table old_and_new; it then builds virtual table old which contains the elements which are already present in foos; then, it constructs virtual table new with the elements
which are really new. Finally, it inserts the new elements in table foos.
WITH old_and_new AS (SELECT unnest ($data :: text[]) AS foo),
old AS (SELECT foo FROM foos INNER JOIN old_and_new USING (foo)),
new AS (SELECT * FROM old_and_new EXCEPT SELECT * FROM old)
INSERT INTO foos (foo) SELECT foo FROM new
This works fine in a non-concurrent setting, but fails if concurrent threads
try to insert the same new element at the same time. I know I can solve this
by setting the isolation level to serializable, but that's very heavy-handed.
Is there some other way I can solve this problem? If only there was a way to
tell PostgreSQL that it was safe to ignore INSERT errors...
Is there some other way I can solve this problem?
There are plenty, but none are a panacea...
You can't lock for inserts like you can do a select for update, since the rows don't exist yet.
You can lock the entire table, but that's even heavier handed that serializing your transactions.
You can use advisory locks, but be super wary about deadlocks. Sort new keys so as to obtain the locks in a consistent, predictable order. (Someone more knowledgeable with PG's source code will hopefully chime in, but I'm guessing that the predicate locks used in the serializable isolation level amount to doing precisely that.)
In pure sql you could also use a do statement to loop through the rows one by one, and trap the errors as they occur:
http://www.postgresql.org/docs/9.2/static/sql-do.html
http://www.postgresql.org/docs/9.2/static/plpgsql-control-structures.html#PLPGSQL-ERROR-TRAPPING
Similarly, you could create a convoluted upsert function and call it once per piece of data...
If you're building $data at the app level, you could run the inserts one by one and ignore errors.
And I'm sure I forgot some additional options...
Whatever your course of action is (#Denis gave you quite a few options), this rewritten INSERT command will be much faster:
INSERT INTO foos (foo)
SELECT n.foo
FROM unnest ($data::text[]) AS n(foo)
LEFT JOIN foos o USING (foo)
WHERE o.foo IS NULL
It also leaves a much smaller time frame for a possible race condition.
In fact, the time frame should be so small, that unique violations should only be popping up under heavy concurrent load or with huge arrays.
Dupes in the array?
Except, if you your problem is built-in. Do you have duplicates in the input array itself? In this case, transaction isolation is not going to help you. The enemy is within!
Consider this example / solution:
INSERT INTO foos (foo)
SELECT n.foo
FROM (SELECT DISTINCT foo FROM unnest('{foo,bar,foo,baz}'::text[]) AS foo) n
LEFT JOIN foos o USING (foo)
WHERE o.foo IS NULL
I use DISTINCT in the subquery to eliminate the "sleeper agents", a.k.a. duplicates.
People tend to forget that the dupes may come within the import data.
Full automation
This function is one way to deal with concurrency for good. If a UNIQUE_VIOLATION occurs, the INSERT is just retried. The newly present rows are excluded from the new attempt automatically.
It does not take care of the opposite problem, that a row might have been deleted concurrently - this would not get re-inserted. One might argue, that this outcome is ok, since such a DELETE happened concurrently. If you want to prevent this, make use of SELECT ... FOR SHARE to protect rows from concurrent DELETE.
CREATE OR REPLACE FUNCTION f_insert_array(_data text[], OUT ins_ct int) AS
$func$
BEGIN
LOOP
BEGIN
INSERT INTO foos (foo)
SELECT n.foo
FROM (SELECT DISTINCT foo FROM unnest(_data) AS foo) n
LEFT JOIN foos o USING (foo)
WHERE o.foo IS NULL;
GET DIAGNOSTICS ins_ct = ROW_COUNT;
RETURN;
EXCEPTION WHEN UNIQUE_VIOLATION THEN -- tag.tag has UNIQUE constraint.
RAISE NOTICE 'It actually happened!'; -- hardly ever happens
END;
END LOOP;
END
$func$
LANGUAGE plpgsql;
I made the function return the count of inserted rows, which is completely optional.
-> SQLfiddle demo
I like both Erwin and Denis' answers, but another approach might be to have concurrent sessions performing the unnesting and loading into a separate temporary table, and optionally eliminating what duplicates they can against the target table, and having a single session selecting from this temporary table, resolving temp table internal duplicates in an appropriate manner, inserting to the target table checking again for existing values, and deleting the selected temporary table records (in the same query using a common table expression).
This would be more batch oriented, in the style of a data warehouse extraction-load-transform paradigm, but would guarantee that no unique constraint issues would need to be dealt with.
Other advantages/disadvantages apply, such as decoupling the final insert from the data gathering (possible advantage), and needing to vacuum the temp table frequently (possible disadvantage), which may not be relevant to Jon's case, but might be useful info to others in the same situation.
Related
I have these two below tables on which I am performing delete followed by an insert but intermittently deadlocks are being encountered.
Schedule.Assignments (Parent table)
[Schedule.Assignments (Parent table)]
Schedule.Schedules (Child table)
[Schedule.Schedules (Child table)]
Intermittently two types of deadlocks are occurring on schedule.Schedules table (child) although the operation is being performed on schedule.Assignments table (parent). Both are having the same deadlock graph as shown below.
Deadlock between an Insert and Delete statements on schedule.Assignments table.
Deadlock between same Delete statement on schedule.Assignments table.
[Deadlock Graph]
Deadlock Graph1 : https://pastebin.com/raw/ZpQUrjBV
Deadlock Graph2 : https://pastebin.com/raw/DhnuyZ7a
StoredProc containing the insert and delete statements: https://pastebin.com/raw/6DNh2RxH
Query Execution Plan: PasteThePlan
[Edit]
Assignments Schema: Assignments Schema
Assignments Indexes: Assignments Indexes
Schedules Schema: Schedules Schema
Schedules Indexes: Schedules Indexes
What I am not able to understand as to why deadlock object is showing as child table whereas the process involved in the deadlock shows insert/delete on parent table.
Please share your thoughts as how to solve these deadlocks?
It looks like your deadlocking is caused by a big table scan on Schedules. The scan happens in three different places in your procedure. What should happen instead is a simple Nested Loops/Index Seek on ParentId.
The reason you have a scan is because the join condition on ParentId is between a nvarchar(50) column and a bigint. I suggest you fix this by making ParentId a bigint.
ALTER TABLE schedule.Schedules
ALTER COLUMN ParentId bigint NULL;
You may need to drop and re-create indexes or constraints when you do this.
As a side point, although it appears that you have an index on schedule.Assignments (OldResourceRequestId), it is not unique. This is causing an Assert on the various subqueries to ensure only one row is returned, and may also be affecting query statistics/estimates.
I suggest you change it (if possible) to a unique index. If there are duplicates then you need to rethink these joins anyway, as you would get duplicate results or fail the Assert.
CREATE NONCLUSTERED INDEX [IX_Assignments_OldResourceRequestId] ON schedule.Assignments
(
OldResourceRequestId ASC
)
WITH (DROP_EXISTING = ON, ONLINE = ON) ON PRIMARY;
You should also take note of your IF statements. They are not indented, and it is not clear that what is actually happening is that only the first statement afterwards is conditional, due to the lack of BEGIN END. As mentioned in the other answer, the IF may not be necessary anyway.
There's not enough information to be sure, but I'll tell you where I'd start.
Eliminate the IF EXISTS test. Just go straight to the DELETE. If the value isn't there, the DELETE will be quick anyway. You're also not in a transaction, which leaves you open to the table changing between SELECT and DELETE.
Re-write the proprietary DELETE...JOIN as an ANSI DELETE using a WHERE EXISTS subquery. The proprietary version has problems whose details elude me right now. Better to write it in a standard way, and not invite problems.
You say "child" and "parent" tables. Does Schedules have a defined foreign key to Assignments? It should.
I'm not sure those changes will eliminate the problem, but IMO they'll make it less likely. You're increasing the atomicity by reducing the number of statements, and by eliminating branches you force each invocation of the procedure to execute the exact same logical sequence.
I'm working with PostgreSQL 9.1. Let's say I have a table where some columns have UNIQUE constraint. The easiest example:
CREATE TABLE test (
value INTEGER NOT NULL UNIQUE
);
Now, when inserting some values, I have to separately handle the case, where the values to be inserted are already in the table. I have two options:
Make a SELECT beforehand to ensure the values are not in the table, or:
Execute the INSERT and watch for any errors the server might return.
The application utilizing the PostgreSQL database is written in Ruby. Here's how I would code the second option:
require 'pg'
db = PG.connect(...)
begin
db.exec('INSERT INTO test VALUES (66)')
rescue PG::UniqueViolation
# ... the values are already in the table
else
# ... the values were brand new
end
db.close
Here's my thinking: let's suppose we make a SELECT first, before inserting. The SQL engine would have to scan the rows and return any matching tuples. If there are none, we make an INSERT, which presumably makes yet another scan, to see if the UNIQUE constraint is not about to be violated by any chance. So, in theory, second option would speed the execution up by 50%. Is this how PostgreSQL would actually behave?
We're assuming there's no ambiguity when it comes to the exception itself (e.g. we only have one UNIQUE constraint).
Is it a common practice? Or are there any caveats to it? Are there any more alternatives?
It depends - if your application UI normally allows entering duplicate values, then it's strongly encouraged to check before inserting. Because any error would invalidate current transaction, consume sequence/serial values, fill logs with error messages etc.
But if your UI is not allowing duplicates, and inserting duplicate is only possible when somebody is using tricks (for example during vulnerability research) or highly improbable then I'd allow inserting without checking first.
As unique constraint forces creation of an index, this check is not slow. But definitely slightly slower than inserting and checking for rare errors. Postgres 9.5 would have on conflict do nothing support, which would be both fast and safe. You'd check number of rows inserted to detect duplicates.
You don't (and shouldn't) have to test before; you can test while inserting. Just add the test as a where clause. The following insert inserts either zero or one tuple, dependiing on the existence of a row with the same value. (and it certainly is not slower) :
INSERT INTO test (value)
SELECT 55
WHERE NOT EXISTS (
SELECT * FROM test
WHERE value = 55
);
Though your error-driven approach may look elegant from the client side, from the database side it is a near-disaster: the current transaction is rolled back implicitely + all cursors (including prepared statements) are closed. (thus: your application will have to rebuild the complete transaction but without the error and start again.)
Addition: when adding more than one row you can put the VALUES() into a CTE and refer to the CTE in the insert query:
WITH vvv(val) AS (
VALUES (11),(22),(33),(44),(55),(66)
)
INSERT INTO test(value)
SELECT val FROM vvv
WHERE NOT EXISTS (
SELECT *
FROM test nx
WHERE nx.value = vvv.val
);
-- SELECT * FROM test;
I'm working on Postgres 9.2.
There are 2 UPDATEs, each in their own transactions. One looks like:
UPDATE foo SET a=1 WHERE b IN (1,2,3,4);
The other is similar:
UPDATE foo SET a=2 WHERE b IN (1,2,3,4);
These could possibly run at the same time and in reality have 500+ in the 'IN' expression.
I'm sometimes seeing deadlocks. Is is true that that order of items in the 'IN' expression may not actually influence the true lock ordering?
There is no ORDER BY in the UPDATE command.
But there is for SELECT. Use row-level locking with the FOR UPDATE clause in a subquery:
UPDATE foo f
SET a = 1
FROM (
SELECT b FROM foo
WHERE b IN (1,2,3,4)
ORDER BY b
FOR UPDATE
) upd
WHERE f.b = upd.b;
Of course, b has to be UNIQUE or you need to add more expressions to the ORDER BY clause to make it unambiguous.
And you need to enforce the same order for all UPDATE, DELETE and SELECT .. FOR UPDATE statements on the table.
Related, with more details:
Postgres UPDATE … LIMIT 1
Avoiding PostgreSQL deadlocks when performing bulk update and delete operations
Optimizing concurrent updates in Postgres
Yes. I think the main issue here is that IN checks for membership in the set specified, but does not confer any sort of ordering on the UPDATE, which in turn, means that no concrete ordering is conferred upon the lock ordering.
The WHERE clause in an UPDATE statement essentially behaves the same way it does in a SELECT. For example, I will often simulate an UPDATE using a SELECT to check what will be updated to see that it's what I expected.
With that in mind, the following example using SELECT demonstrates that IN does not in itself confer ordering:
Given this schema/data:
create table foo
(
id serial,
val text
);
insert into foo (val)
values ('one'), ('two'), ('three'), ('four');
The following queries:
select *
from foo
where id in (1,2,3,4);
select *
from foo
where id in (4,3,2,1);
yield the exact same results -- the rows in order from id 1-4.
Even that isn't guaranteed, since I did not use an ORDER BY in the select. Rather, without it, Postgres uses whatever order the server decides is fastest (see point 8 about ORDER BY in the Postgres SELECT doc). Given a fairly static table, it's often the same order in which it was inserted (as was the case here). However, there's nothing guaranteeing that, and if there's a lot of churn on the table (lots of dead tuples, rows removed, etc.), it's less likely to be the case.
I suspect that's what's happening here with your UPDATE. Sometimes -- if not even most of the time -- it may end up in numerical order if that's the same way the rows were inserted, but there's nothing to guarantee that, and the cases where you see the deadlocks are likely scenarios where the data has changed such that one update is ordered different than the other.
sqlfiddle with the above code.
Possible fixes/workarounds:
In terms of what you could do about it, there are various options, depending on your requirements. You could explicitly take out a table lock on the table, although that would of course have the effect of serializing the updates there, which may prove to be too large a bottleneck.
Another option, which would still allow for concurrency -- is to explicitly iterate over the items using dynamic SQL in, say, Python. That way, you'd have a set of one-row updates that occurred always in the same order, and since you could ensure that consistent order, the normal Postgres locking should be able to handle the concurrency without deadlocking.
That won't perform as well as batch-updating in pure SQL, but it should solve the lock issue. One suggestion to bump up performance is to only COMMIT every so often, and not after every single row -- that saves a lot of overhead.
Another option would be to do the loop in a Postgres function written in PL/pgSQL. That function could then be called externally, in, say, Python, but the looping would be done (also explicitly) server-side, which may save on some overhead, since the looping and UPDATEs are done entirely server-side without having to go over the wire each loop iteration.
I've got the user entering a list of values that I need to query for in a table. The list could be potentially very large, and the length isn't known at compile time. Rather than using WHERE ... IN (...) I was thinking it would be more efficient to use a temporary table and execute a join against it. I read this suggestion in another SO question (can't find it at the moment, but will edit when I do).
The gist is something like:
CREATE TEMP TABLE my_temp_table (name varchar(160) NOT NULL PRIMARY KEY);
INSERT INTO my_temp_table VALUES ('hello');
INSERT INTO my_temp_table VALUES ('world');
//... etc
SELECT f.* FROM foo f INNER JOIN my_temp_table t ON f.name = t.name;
DROP TABLE my_temp_table;
If I have two of these going at the same time, would I not get an error if Thread 2 tries to create the TEMP table after Thread 1?
Should I randomly generate a name for the TEMP table instead?
Or, if I wrap the whole thing in a transaction, will the naming conflict go away?
This is Postgresql 8.2.
Thanks!
There is no need to worry about the conflict.
The pg_temp schema is session specific. If you've a concurrent statement in a separate session, it'll use a different schema (even if you see it as having the same name).
Two notes, however:
Every time you create temporary objects, the system catalog creates a temporary schema and the objects themselves. This can lead to clutter if used frequently.
Thus, for small sets/frequent uses, it's usually better stick to an in or a with statement (both of which Postgres copes quite well with). It's also occasionally useful to "trick" the planner into using whichever plan you're seeking by using an immutable set returning function.
In the event you decide to actually use temporary tables, it's usually better to index and analyze them once you've filled them up. Else you're doing little more than writing a with statement.
Consider using WITH query insteed: http://www.postgresql.org/docs/9.0/interactive/queries-with.html
It also creates temporary table, which is destroyed when query / transaction finishes, so I believe there should be no concurrency conflicts.
Nitpicker Question:
I like to have a function returning a boolean to check if a table has an entry or not. And i need to call this a lot, so some optimizing is needed.
Iues mysql for now, but should be fairly basic...
So should i use
select id from table where a=b limit 1;
or
select count(*) as cnt from table where a=b;
or something completly different?
I think SELECT with limit should stop after the first find, count(*) needs to check all entries. So SELECT could be faster.
Simnplest thing would be doing a few loop and test it, but my tests were not helpful. (My test system seemd to be used otherwise too, which diluted mny results)
this "need" is often indicative of a situation where you are trying to INSERT or UPDATE. the two most common situations are bulk loading/updating of rows, or hit counting.
checking for existence of a row first can be avoided using the INSERT ... ON DUPLICATE KEY UPDATE statement. for a hit counter, just a single statement is needed. for bulk loading, load the data in to a temporary table, then use INSERT ... ON DUPLICATE KEY UPDATE using the temp table as the source.
but if you can't use this, then the fastest way will be select id from table where a=b limit 1; along with force index to make sure mysql looks ONLY at the index.
The limit 1 will tell the MySQL to stop searching after it finds one row. If there can be multiple rows that match the criteria, this is faster than count(*).
There are more ways to optimize this, but the exact nature would depend on the amount of rows and the spread of a and b. I'd go with the "where a=b" approach until you actually encounter performance issues. Databases are often so fast that most queries are no performance issue at all.