Does a SELECT query following an INSERT … ON CONFLICT DO NOTHING statement always find a row, given the default transaction isolation (read committed)?
I want to INSERT-or-SELECT a row in one table, then reference that when inserting rows in a second table. Since RETURNING doesn't work well with ON CONFLICT, I have so far used a simple CTE that should always give me the identity column value even if the row already exists:
$id = query(
`WITH ins AS (
INSERT INTO object (scope, name)
VALUES ($1, $2)
ON CONFLICT (scope, name) DO NOTHING
RETURNING id
)
SELECT id FROM ins
UNION ALL
SELECT id FROM object WHERE scope = $1 AND name = $2
LIMIT 1;`,
[$scope, $name]
)
query(
`INSERT INTO object_member (object_id, key, value)
SELECT $1, UNNEST($2::text[]), UNNEST($3::int[]);`
[$id, $keys, $values]
)
However, I learned that this CTE is not entirely safe under concurrent write load, where it can happen that both the upsert and the select come up empty when a different transaction does insert the same row.
In the answers there (and also here) it is suggested to use another query to do the SELECT:
start a new command (in the same transaction), which then can see these conflicting rows from the previous query.
If I understand correctly, it would mean to do
$id = query(
`INSERT INTO object (scope, name)
VALUES ($1, $2)
ON CONFLICT (scope, name) DO NOTHING
RETURNING id;`,
[$scope, $name]
)
if not $id:
$id = query(
`SELECT id FROM object WHERE scope = $1 AND name = $2;`
[$scope, $name]
)
query(
`INSERT INTO object_member (object_id, key, value)
SELECT $1, UNNEST($2::text[]), UNNEST($3::int[]);`
[$id, $keys, $values]
)
or even shortened to
query(
`INSERT INTO object (scope, name)
VALUES ($1, $2)
ON CONFLICT (scope, name) DO NOTHING;`,
[$scope, $name]
)
query(
`INSERT INTO object_member (object_id, key, value)
SELECT (SELECT id FROM object WHERE scope = $1 AND name = $2), UNNEST($3::text[]), UNNEST($3::int[]);`
[$scope, $name, $keys, $values]
)
I believe this would be enough to prevent that particular race condition (dubbed "concurrency issue 1" in this answer) - but I'm not 100% certain not to have missed anything.
Also what about "concurrency issue 2"? If I understand correctly, this is about another transaction deleting or updating the existing row, inbetween the INSERT and SELECT statements - and it's more likely to happen when using multiple queries instead of the CTE approach. How exactly should I deal with that? I assume locking the SELECT with FOR KEY SHARE is necessary in the second code snippet - but would I also need that in the third snippet where the id is used within the same query? If it helps to simplify the answer, let's assume an object can only be inserted or deleted, but is never updated.
To make absolutely sure that the single row in the first table is there, and it's ID returned, you could create a function like outlined here:
Is SELECT or INSERT in a function prone to race conditions?
To make sure the row also stays there for the duration of the transaction, just make sure it's locked. If you INSERT the row, it's locked anyway. If you SELECT an existing id, you have to lock it explicitly - just like you suggested. FOR KEY SHARE is strong enough for our purpose as long as there is a (non-partial, non-functional) UNIQUE index on (scope, name), which is safe to assume given your ON CONFLICT clause.
CREATE OR REPLACE FUNCTION f_object_id(_scope text, _name text, OUT _object_id int)
LANGUAGE plpgsql AS
$func$
BEGIN
LOOP
SELECT id FROM object
WHERE scope = $1
AND name = $2
-- lock to prevent deletion in the tiny time frame before the next INSERT
FOR KEY SHARE
INTO _object_id;
EXIT WHEN FOUND;
INSERT INTO object AS o (scope, name)
VALUES ($1, $2)
ON CONFLICT (scope, name) DO NOTHING
RETURNING o.id
INTO _object_id;
EXIT WHEN FOUND;
END LOOP;
END
$func$;
You really only need to lock the row if it's conceivable that a concurrent transaction might DELETE it (you don't UPDATE) in the tiny time frame between the SELECT and the next INSERT statement.
Also, if you have a FOREIGN KEY constraint from object_member.object_id to object.id (which seems likely), referential integrity is guaranteed anyway. If you don't add the explicit lock, and the row is deleted in between, you get a foreign key violation, and the INSERT to object_member is cancelled, along with the whole transaction. Else, the other transaction with the DELETE has to wait until your transaction is done, and is then cancelled by the same FK constraint since depending rows are now there (unless it's defined to CASCADE ...) So by locking (or not) you can decide whether to prevent the DELETE or the INSERT in this scenario.
Then your call burns down to just:
query(
`WITH o(id) AS (SELECT f_object_id($1, $2))
INSERT INTO object_member (object_id, key, value)
SELECT o.id, UNNEST($3::text[]), UNNEST($4::int[])
FROM o;`
[$scope, $name, $keys, $values]
)
Since you obviously insert multiple rows into object_member, I moved f_object_id($1, $2) to a CTE to avoid repeated execution - which would work, but pointlessly expensive.
In Postgres 12 or later I would make that explicit by adding MATERIALIZED (since the INSERT is hidden in a function):
WITH o(id) AS MATERIALIZED (SELECT f_object_id($1, $2)) ...
Aside: For the multiple unnest() in the SELECT list, make sure you are on Postgres 10 or later. See:
What is the expected behaviour for multiple set-returning functions in SELECT clause?
Matters of detail
Will it make any difference (apart from execution time) to do this in the application logic with multiple queries in the same transaction?
Basically no. The only difference is performance. Well, and short code and reliability. It's objectively more error prone to go back and forth between db and client for each loop. But unless you have extremely competitive transactions, you would hardly ever be looping anyway.
The other consideration is this: the matter is tricky, and most developers do not understand it. Encapsulated in a server-side function, it's less likely to be broken by the next application programmer (or yourself). You have to make sure that it's actually used, too. Either way, properly document the reasons you are doing it one way or another ...
I really wonder whether my second snippet is safe, or why not (given the quote about visibility in the SELECT after the INSERT).
Mostly safe, but not absolutely. While the next separate SELECT will see (now committed) rows of a transactions competing with the previous UPSERT, there is nothing to keep a third transaction from deleting it again in the meantime. The row has not been locked, and you have no way to do that while it's not visible, and there is no generic predicate locking available in Postgres.
Consider this (T1, T2, T3 are concurrent transactions):
T2: BEGIN transaction
T1: BEGIN transaction
T2: INSERT object 666
T1: UPSERT object 666
unique violation?
-> wait for T2
T2: COMMIT
T1: unique violation -> NO ACTION
finish statement
can't return invisible object 666
T3: DELETE object 666 & COMMIT
T1: SELECT object 666 -> no row!
BOOM!
Typically it's extremely unlikely that it ever happens.
But it's possible. Hence the loop.
The other option is SERIALIZABLE transaction isolation. Typically more expensive, and you need to prepare for serialization failures. Catch 22.
Related
Say we have this table:
create table if not exists template (
id serial primary key,
label text not null,
version integer not null default 1,
created_at timestamp not null default current_timestamp,
unique(label, version)
);
The logic is to insert new record, incrementing version value in case of the equal label value. First intention is to do something like this:
with v as (
select coalesce(max(version), 0) + 1 as new_version
from template t where label = 'label1'
)
insert into template (label, version)
values ('label1', (select new_version from v))
returning *;
Although it works, I'm pretty sure it wouldn't be safe in case of the simultaneous inserts. Am I right?
If I am, should I wrap this query in a transaction?
Or is there a better way to implement this kind of versioning?
Gap-less serial IDs per label are hard to come by. Your simple approach can easily fail with concurrent writes due to inherent race conditions. And "value-locking" is not generally implemented in Postgres.
But there is a way. Introduce a parent table label - if you don't already have one - and take a lock on the parent row. This keeps locking to a minimum and should avoid excessive costs from lock contention.
CREATE TABLE label (
label text PRIMARY KEY
);
CREATE TABLE version (
id serial PRIMARY KEY
, label text NOT NULL REFERENCES label
, version integer NOT NULL DEFAULT 1
, created_at timestamptz NOT NULL DEFAULT CURRENT_TIMESTAMP
, UNIQUE(label, version)
);
Then, in a single transaction:
BEGIN;
INSERT INTO label (label)
VALUES ('label1')
ON CONFLICT (label) DO UPDATE
SET label = NULL WHERE false; -- never executed, but still locks the row
RETURNING *; -- optional
INSERT INTO version (label, version)
SELECT 'label1', coalesce(max(v.version), 0) + 1
FROM version v
WHERE v.label = 'label1'
RETURNING *;
COMMIT;
The first UPSERT inserts a new label if it's not there, yet, or locks the row if it is. Either way, the transaction now holds a lock on that label, excluding concurrent writes.
The second INSERT adds a new version, or the first one if there are none, yet.
You could also move the UPSERT into a CTE attached to the INSERT, thus making it a single command and hence always a single transaction implicitly. But the CTE is not needed per se.
This is safe under concurrent write load and works for all corner cases. You just have to make sure that all possibly competing write access takes the same route.
You might wrap this into a function. This ...
... ensures a single transaction
... simplifies the call, with a single mention of the label value
... allows to revoke write privileges from the tables and only grant it to this function if desired, enforcing the right access pattern.
CREATE FUNCTION f_new_label (_label text)
RETURNS TABLE (label text, version int)
LANGUAGE sql STRICT AS
$func$
INSERT INTO label (label)
VALUES (_label)
ON CONFLICT (label) DO UPDATE
SET label = NULL WHERE false; -- never executed, but still locks the row
INSERT INTO version AS v (label, version)
SELECT _label, coalesce(max(v1.version), 0) + 1
FROM version v1
WHERE v1.label = _label
RETURNING v.label, v.version;
$func$;
Call:
SELECT * FROM f_new_label('label1');
fiddle
Related:
How to use RETURNING with ON CONFLICT in PostgreSQL?
UPDATE n:m relation in view as array (operations)
Yes, there could be collisions with simultaneous inserts.
Transactions could lead to locks, as you want to keep the state if the table till the insert occurs with the new version number. appently, postgres would create a deadlock with multiple concurrent inserts.
You can use a before i8nsert trigger , which would guarantee, that every insert gets a higher version number, as it would go row by row.
But you have to remember, that cpus and the sql server can rearrange computation order, so that the rule first come first serve, may not be applied.
In postgres, I have an after-insert trigger on a partitionned table that checks if the record is unique or not.
Currently, if the record is not unique, the trigger raises an exception.
This is not what I want because I just want to ignore this case and act as if the insertion actually happened.
Is it possible to configure the trigger to silently fail ?
The surrounding transaction should not fail, but the row should not be inserted.
I have an after-insert trigger on a partitionned table that checks if the record is unique or not.
You could do this as part of your insert statement, using the on conflict do nothing clause. This is simpler and more efficient than using a trigger.
For this to work, you need a unique constraint on the column (or the tuple of columns) whose uniqueness you want to enforce. Assuming that this is column id, you would do:
insert into mytable(id, ...) -- enumerate the target columns here
values(id, ...) -- enumerate the values to insert here
on conflict(id) do nothing
Note that conflict action do nothing does not actually requires specifiying a conflict target, so strictly speaking you could just write this as on conflict do nothing instead. I find that it is always a good idea to specify the conflict target, so the scope is better defined.
If for some reason, you cannot have a unique index on the target column, then another option is to use the insert ... select syntax along with a not exists condition:
insert into mytable(id, ...)
select id, ...
from (values(id, ...)) t(id, ...)
where not exists (select 1 from mytable t1 where t1.id = t.id)
I have a situation where I very frequently need to get a row from a table with a unique constraint, and if none exists then create it and return.
For example my table might be:
CREATE TABLE names(
id SERIAL PRIMARY KEY,
name TEXT,
CONSTRAINT names_name_key UNIQUE (name)
);
And it contains:
id | name
1 | bob
2 | alice
Then I'd like to:
INSERT INTO names(name) VALUES ('bob')
ON CONFLICT DO NOTHING RETURNING id;
Or perhaps:
INSERT INTO names(name) VALUES ('bob')
ON CONFLICT (name) DO NOTHING RETURNING id
and have it return bob's id 1. However, RETURNING only returns either inserted or updated rows. So, in the above example, it wouldn't return anything. In order to have it function as desired I would actually need to:
INSERT INTO names(name) VALUES ('bob')
ON CONFLICT ON CONSTRAINT names_name_key DO UPDATE
SET name = 'bob'
RETURNING id;
which seems kind of cumbersome. I guess my questions are:
What is the reasoning for not allowing the (my) desired behaviour?
Is there a more elegant way to do this?
It's the recurring problem of SELECT or INSERT, related to (but different from) an UPSERT. The new UPSERT functionality in Postgres 9.5 is still instrumental.
WITH ins AS (
INSERT INTO names(name)
VALUES ('bob')
ON CONFLICT ON CONSTRAINT names_name_key DO UPDATE
SET name = NULL
WHERE FALSE -- never executed, but locks the row
RETURNING id
)
SELECT id FROM ins
UNION ALL
SELECT id FROM names
WHERE name = 'bob' -- only executed if no INSERT
LIMIT 1;
This way you do not actually write a new row version without need.
I assume you are aware that in Postgres every UPDATE writes a new version of the row due to its MVCC model - even if name is set to the same value as before. This would make the operation more expensive, add to possible concurrency issues / lock contention in certain situations and bloat the table additionally.
However, there is still a tiny corner case for a race condition. Concurrent transactions may have added a conflicting row, which is not yet visible in the same statement. Then INSERT and SELECT come up empty.
Proper solution for single-row UPSERT:
Is SELECT or INSERT in a function prone to race conditions?
General solutions for bulk UPSERT:
How to use RETURNING with ON CONFLICT in PostgreSQL?
Without concurrent write load
If concurrent writes (from a different session) are not possible you don't need to lock the row and can simplify:
WITH ins AS (
INSERT INTO names(name)
VALUES ('bob')
ON CONFLICT ON CONSTRAINT names_name_key DO NOTHING -- no lock needed
RETURNING id
)
SELECT id FROM ins
UNION ALL
SELECT id FROM names
WHERE name = 'bob' -- only executed if no INSERT
LIMIT 1;
I've got this table (generated by Django):
CREATE TABLE feeds_person (
id serial PRIMARY KEY,
created timestamp with time zone NOT NULL,
modified timestamp with time zone NOT NULL,
name character varying(4000) NOT NULL,
url character varying(1000) NOT NULL,
email character varying(254) NOT NULL,
CONSTRAINT feeds_person_name_ad8c7469_uniq UNIQUE (name, url, email)
);
I'm trying to bulk insert a lot of data using INSERT with an ON CONFLICT clause.
The wrinkle is that I need to get the id back for all of the rows, whether they're already existing or not.
In other cases, I would do something like:
INSERT INTO feeds_person (created, modified, name, url, email)
VALUES blah blah blah
ON CONFLICT (name, url, email) DO UPDATE SET url = feeds_person.url
RETURNING id
Doing the UPDATE causes the statement to return the id of that row. Except, it doesn't work with this table. I think it doesn't work because I've got multiple fields unique together whereas in other instances I've used this method I've had just one unique field.
I get this error when trying to run the SQL through Django's cursor:
django.db.utils.ProgrammingError: ON CONFLICT DO UPDATE command cannot affect row a second time
HINT: Ensure that no rows proposed for insertion within the same command have duplicate constrained values.
How do I do the bulk insert with this table and get back the inserted and existing ids?
The error you get:
ON CONFLICT DO UPDATE command cannot affect row a second time
... indicates you are trying to upsert the same row more than once in a single command. In other words: you have dupes on (name, url, email) in your VALUES list. Fold duplicates (if that's an option) and the error goes away. This chooses an arbitrary row from each set of dupes:
INSERT INTO feeds_person (created, modified, name, url, email)
SELECT DISTINCT ON (name, url, email) *
FROM (
VALUES
('blah', 'blah', 'blah', 'blah', 'blah')
-- ... more rows
) AS v(created, modified, name, url, email) -- match column list
ON CONFLICT (name, url, email) DO UPDATE
SET url = feeds_person.url
RETURNING id;
Since we use a free-standing VALUES expression now, you have to add explicit type casts for non-default types. Like:
VALUES
(timestamptz '2016-03-12 02:47:56+01'
, timestamptz '2016-03-12 02:47:56+01'
, 'n3', 'u3', 'e3')
...
Your timestamptz columns need an explicit type cast, while the string types can operate with default text. (You could still cast to varchar(n) right away.)
If you want to have a say in which row to pick from each set of dupes, there are ways to do that:
Select first row in each GROUP BY group?
You are right, there is (currently) no way to use excluded columns in the RETURNING clause. I quote the Postgres Wiki:
Note that RETURNING does not make visible the "EXCLUDED.*" alias
from the UPDATE (just the generic "TARGET.*" alias is visible
there). Doing so is thought to create annoying ambiguity for the
simple, common cases [30] for little to no benefit. At some
point in the future, we may pursue a way of exposing if
RETURNING-projected tuples were inserted and updated, but this
probably doesn't need to make it into the first committed iteration of
the feature [31].
However, you shouldn't be updating rows that are not supposed to be updated. Empty updates are almost as expensive as regular updates - and might have unintended side effects. You don't strictly need UPSERT to begin with, your case looks more like "SELECT or INSERT". Related:
Is SELECT or INSERT in a function prone to race conditions?
One cleaner way to insert a set of rows would be with data-modifying CTEs:
WITH val AS (
SELECT DISTINCT ON (name, url, email) *
FROM (
VALUES
(timestamptz '2016-1-1 0:0+1', timestamptz '2016-1-1 0:0+1', 'n', 'u', 'e')
, ('2016-03-12 02:47:56+01', '2016-03-12 02:47:56+01', 'n1', 'u3', 'e3')
-- more (type cast only needed in 1st row)
) v(created, modified, name, url, email)
)
, ins AS (
INSERT INTO feeds_person (created, modified, name, url, email)
SELECT created, modified, name, url, email FROM val
ON CONFLICT (name, url, email) DO NOTHING
RETURNING id, name, url, email
)
SELECT 'inserted' AS how, id FROM ins -- inserted
UNION ALL
SELECT 'selected' AS how, f.id -- not inserted
FROM val v
JOIN feeds_person f USING (name, url, email);
The added complexity should pay for big tables where INSERT is the rule and SELECT the exception.
Originally, I had added a NOT EXISTS predicate on the last SELECT to prevent duplicates in the result. But that was redundant. All CTEs of a single query see the same snapshots of tables. The set returned with ON CONFLICT (name, url, email) DO NOTHING is mutually exclusive to the set returned after the INNER JOIN on the same columns.
Unfortunately this also opens a tiny window for a race condition. If ...
a concurrent transaction inserts conflicting rows
has not committed yet
but commits eventually
... some rows may be lost.
You might just INSERT .. ON CONFLICT DO NOTHING, followed by a separate SELECT query for all rows - within the same transaction to overcome this. Which in turn opens another tiny window for a race condition if concurrent transactions can commit writes to the table between INSERT and SELECT (in default READ COMMITTED isolation level). Can be avoided with REPEATABLE READ transaction isolation (or stricter). Or with a (possibly expensive or even unacceptable) write lock on the whole table. You can get any behavior you need, but there may be a price to pay.
Related:
How to use RETURNING with ON CONFLICT in PostgreSQL?
Return rows from INSERT with ON CONFLICT without needing to update
A very frequently asked question here is how to do an upsert, which is what MySQL calls INSERT ... ON DUPLICATE UPDATE and the standard supports as part of the MERGE operation.
Given that PostgreSQL doesn't support it directly (before pg 9.5), how do you do this? Consider the following:
CREATE TABLE testtable (
id integer PRIMARY KEY,
somedata text NOT NULL
);
INSERT INTO testtable (id, somedata) VALUES
(1, 'fred'),
(2, 'bob');
Now imagine that you want to "upsert" the tuples (2, 'Joe'), (3, 'Alan'), so the new table contents would be:
(1, 'fred'),
(2, 'Joe'), -- Changed value of existing tuple
(3, 'Alan') -- Added new tuple
That's what people are talking about when discussing an upsert. Crucially, any approach must be safe in the presence of multiple transactions working on the same table - either by using explicit locking, or otherwise defending against the resulting race conditions.
This topic is discussed extensively at Insert, on duplicate update in PostgreSQL?, but that's about alternatives to the MySQL syntax, and it's grown a fair bit of unrelated detail over time. I'm working on definitive answers.
These techniques are also useful for "insert if not exists, otherwise do nothing", i.e. "insert ... on duplicate key ignore".
9.5 and newer:
PostgreSQL 9.5 and newer support INSERT ... ON CONFLICT (key) DO UPDATE (and ON CONFLICT (key) DO NOTHING), i.e. upsert.
Comparison with ON DUPLICATE KEY UPDATE.
Quick explanation.
For usage see the manual - specifically the conflict_action clause in the syntax diagram, and the explanatory text.
Unlike the solutions for 9.4 and older that are given below, this feature works with multiple conflicting rows and it doesn't require exclusive locking or a retry loop.
The commit adding the feature is here and the discussion around its development is here.
If you're on 9.5 and don't need to be backward-compatible you can stop reading now.
9.4 and older:
PostgreSQL doesn't have any built-in UPSERT (or MERGE) facility, and doing it efficiently in the face of concurrent use is very difficult.
This article discusses the problem in useful detail.
In general you must choose between two options:
Individual insert/update operations in a retry loop; or
Locking the table and doing batch merge
Individual row retry loop
Using individual row upserts in a retry loop is the reasonable option if you want many connections concurrently trying to perform inserts.
The PostgreSQL documentation contains a useful procedure that'll let you do this in a loop inside the database. It guards against lost updates and insert races, unlike most naive solutions. It will only work in READ COMMITTED mode and is only safe if it's the only thing you do in the transaction, though. The function won't work correctly if triggers or secondary unique keys cause unique violations.
This strategy is very inefficient. Whenever practical you should queue up work and do a bulk upsert as described below instead.
Many attempted solutions to this problem fail to consider rollbacks, so they result in incomplete updates. Two transactions race with each other; one of them successfully INSERTs; the other gets a duplicate key error and does an UPDATE instead. The UPDATE blocks waiting for the INSERT to rollback or commit. When it rolls back, the UPDATE condition re-check matches zero rows, so even though the UPDATE commits it hasn't actually done the upsert you expected. You have to check the result row counts and re-try where necessary.
Some attempted solutions also fail to consider SELECT races. If you try the obvious and simple:
-- THIS IS WRONG. DO NOT COPY IT. It's an EXAMPLE.
BEGIN;
UPDATE testtable
SET somedata = 'blah'
WHERE id = 2;
-- Remember, this is WRONG. Do NOT COPY IT.
INSERT INTO testtable (id, somedata)
SELECT 2, 'blah'
WHERE NOT EXISTS (SELECT 1 FROM testtable WHERE testtable.id = 2);
COMMIT;
then when two run at once there are several failure modes. One is the already discussed issue with an update re-check. Another is where both UPDATE at the same time, matching zero rows and continuing. Then they both do the EXISTS test, which happens before the INSERT. Both get zero rows, so both do the INSERT. One fails with a duplicate key error.
This is why you need a re-try loop. You might think that you can prevent duplicate key errors or lost updates with clever SQL, but you can't. You need to check row counts or handle duplicate key errors (depending on the chosen approach) and re-try.
Please don't roll your own solution for this. Like with message queuing, it's probably wrong.
Bulk upsert with lock
Sometimes you want to do a bulk upsert, where you have a new data set that you want to merge into an older existing data set. This is vastly more efficient than individual row upserts and should be preferred whenever practical.
In this case, you typically follow the following process:
CREATE a TEMPORARY table
COPY or bulk-insert the new data into the temp table
LOCK the target table IN EXCLUSIVE MODE. This permits other transactions to SELECT, but not make any changes to the table.
Do an UPDATE ... FROM of existing records using the values in the temp table;
Do an INSERT of rows that don't already exist in the target table;
COMMIT, releasing the lock.
For example, for the example given in the question, using multi-valued INSERT to populate the temp table:
BEGIN;
CREATE TEMPORARY TABLE newvals(id integer, somedata text);
INSERT INTO newvals(id, somedata) VALUES (2, 'Joe'), (3, 'Alan');
LOCK TABLE testtable IN EXCLUSIVE MODE;
UPDATE testtable
SET somedata = newvals.somedata
FROM newvals
WHERE newvals.id = testtable.id;
INSERT INTO testtable
SELECT newvals.id, newvals.somedata
FROM newvals
LEFT OUTER JOIN testtable ON (testtable.id = newvals.id)
WHERE testtable.id IS NULL;
COMMIT;
Related reading
UPSERT wiki page
UPSERTisms in Postgres
Insert, on duplicate update in PostgreSQL?
http://petereisentraut.blogspot.com/2010/05/merge-syntax.html
Upsert with a transaction
Is SELECT or INSERT in a function prone to race conditions?
SQL MERGE on the PostgreSQL wiki
Most idiomatic way to implement UPSERT in Postgresql nowadays
What about MERGE?
SQL-standard MERGE actually has poorly defined concurrency semantics and is not suitable for upserting without locking a table first.
It's a really useful OLAP statement for data merging, but it's not actually a useful solution for concurrency-safe upsert. There's lots of advice to people using other DBMSes to use MERGE for upserts, but it's actually wrong.
Other DBs:
INSERT ... ON DUPLICATE KEY UPDATE in MySQL
MERGE from MS SQL Server (but see above about MERGE problems)
MERGE from Oracle (but see above about MERGE problems)
Here are some examples for insert ... on conflict ... (pg 9.5+) :
Insert, on conflict - do nothing.
insert into dummy(id, name, size) values(1, 'new_name', 3)
on conflict do nothing;`
Insert, on conflict - do update, specify conflict target via column.
insert into dummy(id, name, size) values(1, 'new_name', 3)
on conflict(id)
do update set name = 'new_name', size = 3;
Insert, on conflict - do update, specify conflict target via constraint name.
insert into dummy(id, name, size) values(1, 'new_name', 3)
on conflict on constraint dummy_pkey
do update set name = 'new_name', size = 4;
I am trying to contribute with another solution for the single insertion problem with the pre-9.5 versions of PostgreSQL. The idea is simply to try to perform first the insertion, and in case the record is already present, to update it:
do $$
begin
insert into testtable(id, somedata) values(2,'Joe');
exception when unique_violation then
update testtable set somedata = 'Joe' where id = 2;
end $$;
Note that this solution can be applied only if there are no deletions of rows of the table.
I do not know about the efficiency of this solution, but it seems to me reasonable enough.
SQLAlchemy upsert for Postgres >=9.5
Since the large post above covers many different SQL approaches for Postgres versions (not only non-9.5 as in the question), I would like to add how to do it in SQLAlchemy if you are using Postgres 9.5. Instead of implementing your own upsert, you can also use SQLAlchemy's functions (which were added in SQLAlchemy 1.1). Personally, I would recommend using these, if possible. Not only because of convenience, but also because it lets PostgreSQL handle any race conditions that might occur.
Cross-posting from another answer I gave yesterday (https://stackoverflow.com/a/44395983/2156909)
SQLAlchemy supports ON CONFLICT now with two methods on_conflict_do_update() and on_conflict_do_nothing():
Copying from the documentation:
from sqlalchemy.dialects.postgresql import insert
stmt = insert(my_table).values(user_email='a#b.com', data='inserted data')
stmt = stmt.on_conflict_do_update(
index_elements=[my_table.c.user_email],
index_where=my_table.c.user_email.like('%#gmail.com'),
set_=dict(data=stmt.excluded.data)
)
conn.execute(stmt)
http://docs.sqlalchemy.org/en/latest/dialects/postgresql.html?highlight=conflict#insert-on-conflict-upsert
MERGE in PostgreSQL v. 15
Since PostgreSQL v. 15, is possible to use MERGE command. It actually has been presented as the first of the main improvements of this new version.
It uses a WHEN MATCHED / WHEN NOT MATCHED conditional in order to choose the behaviour when there is an existing row with same criteria.
It is even better than standard UPSERT, as the new feature gives full control to INSERT, UPDATE or DELETE rows in bulk.
MERGE INTO customer_account ca
USING recent_transactions t
ON t.customer_id = ca.customer_id
WHEN MATCHED THEN
UPDATE SET balance = balance + transaction_value
WHEN NOT MATCHED THEN
INSERT (customer_id, balance)
VALUES (t.customer_id, t.transaction_value)
WITH UPD AS (UPDATE TEST_TABLE SET SOME_DATA = 'Joe' WHERE ID = 2
RETURNING ID),
INS AS (SELECT '2', 'Joe' WHERE NOT EXISTS (SELECT * FROM UPD))
INSERT INTO TEST_TABLE(ID, SOME_DATA) SELECT * FROM INS
Tested on Postgresql 9.3
Since this question was closed, I'm posting here for how you do it using SQLAlchemy. Via recursion, it retries a bulk insert or update to combat race conditions and validation errors.
First the imports
import itertools as it
from functools import partial
from operator import itemgetter
from sqlalchemy.exc import IntegrityError
from app import session
from models import Posts
Now a couple helper functions
def chunk(content, chunksize=None):
"""Groups data into chunks each with (at most) `chunksize` items.
https://stackoverflow.com/a/22919323/408556
"""
if chunksize:
i = iter(content)
generator = (list(it.islice(i, chunksize)) for _ in it.count())
else:
generator = iter([content])
return it.takewhile(bool, generator)
def gen_resources(records):
"""Yields a dictionary if the record's id already exists, a row object
otherwise.
"""
ids = {item[0] for item in session.query(Posts.id)}
for record in records:
is_row = hasattr(record, 'to_dict')
if is_row and record.id in ids:
# It's a row but the id already exists, so we need to convert it
# to a dict that updates the existing record. Since it is duplicate,
# also yield True
yield record.to_dict(), True
elif is_row:
# It's a row and the id doesn't exist, so no conversion needed.
# Since it's not a duplicate, also yield False
yield record, False
elif record['id'] in ids:
# It's a dict and the id already exists, so no conversion needed.
# Since it is duplicate, also yield True
yield record, True
else:
# It's a dict and the id doesn't exist, so we need to convert it.
# Since it's not a duplicate, also yield False
yield Posts(**record), False
And finally the upsert function
def upsert(data, chunksize=None):
for records in chunk(data, chunksize):
resources = gen_resources(records)
sorted_resources = sorted(resources, key=itemgetter(1))
for dupe, group in it.groupby(sorted_resources, itemgetter(1)):
items = [g[0] for g in group]
if dupe:
_upsert = partial(session.bulk_update_mappings, Posts)
else:
_upsert = session.add_all
try:
_upsert(items)
session.commit()
except IntegrityError:
# A record was added or deleted after we checked, so retry
#
# modify accordingly by adding additional exceptions, e.g.,
# except (IntegrityError, ValidationError, ValueError)
db.session.rollback()
upsert(items)
except Exception as e:
# Some other error occurred so reduce chunksize to isolate the
# offending row(s)
db.session.rollback()
num_items = len(items)
if num_items > 1:
upsert(items, num_items // 2)
else:
print('Error adding record {}'.format(items[0]))
Here's how you use it
>>> data = [
... {'id': 1, 'text': 'updated post1'},
... {'id': 5, 'text': 'updated post5'},
... {'id': 1000, 'text': 'new post1000'}]
...
>>> upsert(data)
The advantage this has over bulk_save_objects is that it can handle relationships, error checking, etc on insert (unlike bulk operations).