Postgres race condition involving subselect and foreign key

Postgres race condition involving subselect and foreign key - sql

We have 2 tables defined as follows
CREATE TABLE foo (
id BIGSERIAL PRIMARY KEY,
name TEXT NOT NULL UNIQUE
);
CREATE TABLE bar (
foo_id BIGINT UNIQUE,
foo_name TEXT NOT NULL UNIQUE REFERENCES foo (name)
);
I've noticed that when executing the following two queries concurrently
INSERT INTO foo (name) VALUES ('BAZ')
INSERT INTO bar (foo_name, foo_id) VALUES ('BAZ', (SELECT id FROM foo WHERE name = 'BAZ'))
it is possible under certain circumstances to end up inserting a row into bar where foo_id is NULL. The two queries are executed in different transactions, by two completely different processes.
How is this possible? I'd expect the second statement to either fail due to a foreign key violation (if the record in foo is not there), or succeed with a non-null value of foo_id (if it is).
What is causing this race condition? Is it due to the subselect, or is it due to the timing of when the foreign key constraint is checked?
We are using isolation level "read committed" and postgres version 10.3.
EDIT
I think the question was not particularly clear on what is confusing me. The question is about how and why 2 different states of the database were being observed during the execution of a single statement. The subselect is observing that the record in foo as being absent, whereas the fk check sees it as present. If it's just that there's no rule preventing this race condition, then this is an interesting question in itself - why would it not be possible to use transaction ids to ensure that the same state of the database is observed for both?

The subselect in the INSERT INTO bar cannot see the new row concurrently inserted in foo because the latter is not committed yet.
But by the time that the query that checks the foreign key constraint is executed, the INSERT INTO foo has committed, so the foreign key constraint doesn't report an error.
A simple way to work around that is to use the REPEATABLE READ isolation level for the INSERT INT bar. Then the foreign key check uses the same snapshot as the INSERT, it won't see the newly committed row, and a constraint violation error will be thrown.

Logic suggests that ordering of the commands (including the sub-query), combined with when Postgres checks of constraints (which is not necessarily immediate) could cause the issue. Therefore you could
Have the second command start first
Have the SELECT component run and return NULL
First command starts and inserts row
Second command inserts the row (with the 'name' field and a NULL)
FK reference check is successful as 'name' exists
Re deferrable constraints see https://www.postgresql.org/docs/13/sql-set-constraints.html and https://begriffs.com/posts/2017-08-27-deferrable-sql-constraints.html
Suggested answers
Have a not null check on BAR for Foo_Id, or included as part of foreign key checks
Rewrite the two commands to run consecutively rather than simultaneously (if possible)

You do indeed have a race condition. Without some sort of locking or use of a transaction to sequence the events, there is no rule precluding the sequence
The sub select of the bar INSERT is performed, yielding NULL
The INSERT into foo
The INSERT into bar, which now does not have any FK violation, but does have a NULL.
Since of course this is the toy version of your real program, I can't recommend how best to fix it. If it makes sense to require these events in a particular sequence, then they can be in a transaction on a single thread. In some other situation, you might prohibit inserting directly into foo and bar (REVOKE permissions as necessary) and allow modifications only through a function/procedure, or through a view that has triggers (possibly rules).

An anonymous plpgsql block will help you avoid the race conditions (by making sure that the inserts run sequentially within the same transaction) without going deep into Postgres internals:
do language plpgsql
$$
declare
v_foo_id bigint;
begin
INSERT into foo (name) values ('BAZ') RETURNING id into v_foo_id;
INSERT into bar (foo_name, foo_id) values ('BAZ', v_foo_id);
end;
$$;
or using plain SQL with a CTE in order to avoid switching context to/from plpgsql:
with t(id) as
(
INSERT into foo (name) values ('BAZ') RETURNING id
)
INSERT into bar (foo_name, foo_id) values ('BAZ', (select id from t));
And, btw, are you sure that the two inserts in your example are executed in the same transaction in the right order? If not then the short answer to your question is "MVCC" since the second statement is not atomic.

This seems more likely a scenario where both queries executed one after another but transaction is not committed.
Process 1
INSERT INTO foo (name) VALUES ('BAZ')
Transaction not committed but Process 2 execute next query
INSERT INTO bar (foo_name, foo_id) VALUES ('BAZ', (SELECT id FROM foo WHERE name = 'BAZ'))
In this case process 2 query will wait until process 1 transaction isn't committed.
From PostgreSQL doc :
UPDATE, DELETE, SELECT FOR UPDATE, and SELECT FOR SHARE commands behave the same as SELECT in terms of searching for target rows: they will only find target rows that were committed as of the command start time. However, such a target row might have already been updated (or deleted or locked) by another concurrent transaction by the time it is found. In this case, the would-be updater will wait for the first updating transaction to commit or roll back (if it is still in progress).

Related

Get number of rows of an event already in a table before inserting a new dataset?

I have a table consisting of three values
ID of the participant
ID of the courseevent
Mark
For each courseevent, there are only 15 people allowed
How can I check this using Oracle?

The most important aspect of this problem is making it work in a multi-user environment.
Oracle only allows READ COMMITTED and SERIALIZED isolation levels. There are no phantom or dirty reads, and no mechanism for "peeking" at uncommitted sessions. Find out more.
Which means this statement
select courseevent, count(*)
from courseparticpants
group by courseevent;
will show you how many records have been committed. If you go on to insert a record you could still insert the sixteenth booking, if someone else commits their work in the interim. Conversely you may decide that the course is already full when in fact somebody is about to delete a row.
To control this you need to serialize access to the courseparticpants table, so that only one session may insert records into it at a time. There are various ways to do this but the safest is:
lock table courseparticpants exclusive nowait;
If you fail to get the lock you know another session is already working on it. Otherwise you can run your count, insert a new booking and do whatever else is required with the confidence that your rule is not broken.
It is important not to freeze on to the lock for too lock, for obvious reasons: nobody else can do their work on the table. A slightly less obtrusive mechanism would be to lock the relevant record in the parent table; I didn't propose this first because I didn't want to make assumptions about your data model.
select whatever
from courseevents
where courseevent = :p1
for update nowait;
This would allow other sessions to book participants for another event. Find out more.
Both these solutions entail writing a program unit - say in PL/SQL - to manage the transaction.
"is there a possibility to solve this with constraints?"
No, Oracle does not allow SQL in its CHECK constraints. Standard SQL has the concept of ASSERTIONS but Oracle has not implemented them.
One possible solution would be to make participantid a count within courseevent, so you could enforce a check constraint
check ( participantid <= 15)
However, you would still need to do all the locking and stuff to get an accurate figure for the current number of participants so that your n+1 was correct.

select count(*)
from blah, blah, blah
gives you the number of existing records.

This shows participants in more than 15 events.
SELECT participant, COUNT(DISTINCT courseevent) F
FROM Table
GROUP BY participant
HAVING COUNT(DISTINCT courseevent) > 15

INSERT INTO MyTable(Col1, Col2, Col3)
SELECT 'Val1', 'Val2', 'Val3'
FROM DUAL
WHERE (SELECT COUNT(*) FROM MyTable WHERE condition) < 15;

Regular table constraints only consider individual rows in isolation, but your requirement is to consider a group of rows together.
Here is a rather complicated solution that uses materialized view constraints to implement the requirement.
You can think of this as defining a constraint on a column in a result set.
create table course_participants(
course varchar2(20) not null
,participant varchar2(20) not null
,constraint course_participants_pk primary key(course, participant)
);
-- Need this for fast refreshable mview
create materialized view log
on course_participants
with rowid(course, participant)
including new values;
-- A materialized view with a count of participants per course
create materialized view course_parts_max_mv
refresh fast on commit
as
select course
,count(*) as participants
from course_participants
group
by course;
-- This is where you perform the check.
-- I've used 2 participants to make the example easier
alter materialized view course_parts_max_mv
add constraint too_many_participants check(participants <= 2);
The above DDL creates a table and a materialized view. The materialized view will contain one row for each course along with the nr of participants. The trick is that rather than declaring the constraint on the base table, we can now declare it on the the materialized view.
-- One participant is ok!
insert into course_participants values('Oracle', 'Alfred');
commit;
-- Two participants are ok!
insert into course_participants values('Englis speling', 'Benjamin');
insert into course_participants values('Englis speling', 'Charles');
commit;
-- This will fail, because the count(*) for 'Economics' will return 3
insert into course_participants values('Economics', 'Alfred');
insert into course_participants values('Economics', 'Benjamin');
insert into course_participants values('Economics', 'Charles');
commit;
ORA-12008: error in materialized view refresh path
ORA-02290: check constraint (RNBN.TOO_MANY_PARTICIPANTS) violated
Note that the constraint is checked when you commit the transaction, so in the last example none of the participants will get registered.

SQL Oracle, inserting into tables without sequence

I am working with an Oracle 11.2g instance.
I'd like to know what I am exposing to by inserting rows into tables by generating the primary key values myself.
I would SELECT max(pk) FROM sometables;
and then use the next hundred values for example for my next 100 inserts.
Is is playing with fire?
The context is: I have a big number of inserts to do, that are splitted over several tables linked by foreign keys. I am trying to get good performance, and not use PL/SQL.
[EDIT] here a code sample that looks like what I'm dealing with:
QString query1 = "INSERT INTO table 1 (pk1_id, val) VALUES (pk1_seq.nextval, ?)"
sqlQuery->prepare(query);
sqlQuery->addBindValue(vec_of_values);
sqlQuery->execBatch();
QString query2 = "INSERT INTO table 2 (pk2_id, another_val, pk1_pk1_id) VALUES (pk2_seq.nextval, ?, ?)"
sqlQuery->prepare(query);
sqlQuery->addBindValue(vec_of_values);
// How do I get the primary keys (hundreds of them)
// from the first insert??
sqlQuery->addBindValue(vec_of_pk1);
sqlQuery->execBatch();

You are exposing yourself to slower performance, errors in your logic, and extra code to maintain. Oracle sequences are optimized for your specific purpose. For high DML operations you may also cache sequences:
ALTER SEQUENCE customers_seq CACHE 100;

Create a sequence for the master table(s)
Insert into the master table using your_sequence.nextval
Inserts into child (dependent) tables are done using your_sequence.currval
create table parent (id integer primary key not null);
create table child (id integer primary key not null, pid integer not null references parent(id));
create sequence parent_seq;
create sequence child_seq;
insert into parent (id) values (parent_seq.nextval);
insert into child (id, pid) values (child_seq.nextval, parent_seq.currval);
commit;
To explain why max(id) will not work reliably, consider the following scenario:
Transaction 1 retrieves max(id) + 1 (yields, say 42)
Transaction 1 insert a new row with id = 42
Transaction 2 retrieves max(id) + 1 (also yields 42, because transaction 1 is not yet committed)
Transaction 1 commits
Transcation 2 inserts a new row with id = 42
Transaction 2 tries to commit and gets a unique key violation
Now think about what happens when you have a lot of transactions doing this. You'll get a lot of errors. Additionally your inserts will be slower and slower, because the cost of calculating max(id) will increase with the size of the table.
Sequences are the only sane (i.e. correct, fast and scalable) way out of this problem.
Edit
If you are struck with yet another ORM which can't cope with these kind of strategy (which is supported by nearly all DBMS nowadays - even SQL Server has sequences now), then you should be able to do the following in your client code:
Retrieve the next PK value using select parent_seq.nextval from dual into a variable in your programming language (this is a fast, scalable and correct way to retrieve the PK value).
If you can run a select max(id) you can also run a select parent_seq.nextval from dual. In both cases just use the value obtained from that select statement.

Please recommend the best bulk-delete option

I'm using PostgreSQL 8.1.4. I've 3 tables: one being the core (table1), others are dependents (table2,table3). I inserted 70000 records in table1 and appropriate related records in other 2 tables. As I'd used CASCADE, I could able to delete the related records using DELETE FROM table1; It works fine when the records are minimal in my current PostgreSQL version. When I've a huge volume of records, it tries to delete all but there is no sign of deletion progress for many hours! Whereas, bulk import, does in few minutes. I wish to do bulk-delete in reasonable minutes. I tried TRUNCATE also. Like, TRUNCATE table3, table2,table1; No change in performance though. It just takes more time, and no sign of completion! From the net, I got few options, like, deleting all constraints and then recreating the same would be fine. But, no query seems to be successfully run over 'table1' when it's loaded more data!
Please recommend me the best solutions to delete all the records in minutes.
CREATE TABLE table1(
t1_id SERIAL PRIMARY KEY,
disp_name TEXT NOT NULL DEFAULT '',
last_updated TIMESTAMP NOT NULL DEFAULT current_timestamp,
UNIQUE(disp_name)
) WITHOUT OIDS;
CREATE UNIQUE INDEX disp_name_index on table1(upper(disp_name));
CREATE TABLE table2 (
t2_id SERIAL PRIMARY KEY,
t1_id INTEGER REFERENCES table1 ON DELETE CASCADE,
type TEXT
) WITHOUT OIDS;
CREATE TABLE table3 (
t3_id SERIAL PRIMARY KEY,
t1_id INTEGER REFERENCES table1 ON DELETE CASCADE,
config_key TEXT,
config_value TEXT
) WITHOUT OIDS;
Regards,
Siva.

You can create an index on the columns on the child tables which reference the parent table:
on table2 create an index on the t1_id column
on table3 create an index on the t1_id column
that should speed things up slightly.
And/or, don't bother with the on delete cascade, make a delete stored procedure which deletes first from the child tables and then from the parent table, it may be faster than letting postgresql do it for you.

In SQL, the TRUNCATE TABLE statement is a Data Definition Language
(DDL) operation that marks the extents of a table for deallocation
(empty for reuse). The result of this operation quickly removes all
data from a table, typically bypassing a number of integrity
enforcing mechanisms.
http://en.wikipedia.org/wiki/Truncate_(SQL)
So truncate should be very fast. In your case, it looks like that you have a transaction which is not committed nor rollbacked. In that case your delete transaction will never finish.
To solve this problem, you should check your active transactions in your database. The easiest way (at least under SQL Server, it works) is to write "ROLLBACK COMMIT;" into the query window and execute it. If it executes without throwing an error, it means that there were actually an active transaction. If there is no active transaction remaining, it will give you an error.

I would bet that you miss some indices on the database too.
If you issue the delete command from psql console, just hit Ctrl-C - the transaction will get interrupted and psql should inform you which query was being executed when you interrupted it.
Then use EXPLAIN to check why the query takes so long.
I had a similar situation recently and adding an index solved the problem.

Does "SELECT FOR UPDATE" prevent other connections inserting when the row is not present?

I'm interested in whether a SELECT FOR UPDATE query will lock a non-existent row.
Example
Table FooBar with two columns, foo and bar, foo has a unique index.
Issue query SELECT bar FROM FooBar WHERE foo = ? FOR UPDATE
If the first query returns zero rows, issue a query
INSERT INTO FooBar (foo, bar) values (?, ?)
Now is it possible that the INSERT would cause an index violation or does the SELECT FOR UPDATE prevent that?
Interested in behavior on SQLServer (2005/8), Oracle and MySQL.

MySQL
SELECT ... FOR UPDATE with UPDATE
Using transactions with InnoDB (auto-commit turned off), a SELECT ... FOR UPDATE allows one session to temporarily lock down a particular record (or records) so that no other session can update it. Then, within the same transaction, the session can actually perform an UPDATE on the same record and commit or roll back the transaction. This would allow you to lock down the record so no other session could update it while perhaps you do some other business logic.
This is accomplished with locking. InnoDB utilizes indexes for locking records, so locking an existing record seems easy--simply lock the index for that record.
SELECT ... FOR UPDATE with INSERT
However, to use SELECT ... FOR UPDATE with INSERT, how do you lock an index for a record that doesn't exist yet? If you are using the default isolation level of REPEATABLE READ, InnoDB will also utilize gap locks. As long as you know the id (or even range of ids) to lock, then InnoDB can lock the gap so no other record can be inserted in that gap until we're done with it.
If your id column were an auto-increment column, then SELECT ... FOR UPDATE with INSERT INTO would be problematic because you wouldn't know what the new id was until you inserted it. However, since you know the id that you wish to insert, SELECT ... FOR UPDATE with INSERT will work.
CAVEAT
On the default isolation level, SELECT ... FOR UPDATE on a non-existent record does not block other transactions. So, if two transactions both do a SELECT ... FOR UPDATE on the same non-existent index record, they'll both get the lock, and neither transaction will be able to update the record. In fact, if they try, a deadlock will be detected.
Therefore, if you don't want to deal with a deadlock, you might just do the following:
INSERT INTO ...
Start a transaction, and perform the INSERT. Do your business logic, and either commit or rollback the transaction. As soon as you do the INSERT on the non-existent record index on the first transaction, all other transactions will block if they attempt to INSERT a record with the same unique index. If the second transaction attempts to insert a record with the same index after the first transaction commits the insert, then it will get a "duplicate key" error. Handle accordingly.
SELECT ... LOCK IN SHARE MODE
If you select with LOCK IN SHARE MODE before the INSERT, if a previous transaction has inserted that record but hasn't committed yet, the SELECT ... LOCK IN SHARE MODE will block until the previous transaction has completed.
So to reduce the chance of duplicate key errors, especially if you hold the locks for awhile while performing business logic before committing them or rolling them back:
SELECT bar FROM FooBar WHERE foo = ? LOCK FOR UPDATE
If no records returned, then
INSERT INTO FooBar (foo, bar) VALUES (?, ?)

In Oracle, the SELECT ... FOR UPDATE has no effect on a non-existent row (the statement simply raises a No Data Found exception). The INSERT statement will prevent a duplicates of unique/primary key values. Any other transactions attempting to insert the same key values will block until the first transaction commits (at which time the blocked transaction will get a duplicate key error) or rolls back (at which time the blocked transaction continues).

On Oracle:
Session 1
create table t (id number);
alter table t add constraint pk primary key(id);
SELECT *
FROM t
WHERE id = 1
FOR UPDATE;
-- 0 rows returned
-- this creates row level lock on table, preventing others from locking table in exclusive mode
Session 2
SELECT *
FROM t
FOR UPDATE;
-- 0 rows returned
-- there are no problems with locking here
rollback; -- releases lock
INSERT INTO t
VALUES (1);
-- 1 row inserted without problems

I wrote a detailed analysis of this thing on SQL Server: Developing Modifications that Survive Concurrency
Anyway, you need to use SERIALIZABLE isolation level, and you really need to stress test.

SQL Server only has the FOR UPDATE as part of a cursor. And, it only applies to UPDATE statements that are associated with the current row in the cursor.
So, the FOR UPDATE has no relationship with INSERT. Therefore, I think your answer is that it's not applicable in SQL Server.
Now, it may be possible to simulate the FOR UPDATE behavior with transactions and locking strategies. But, that may be more than what you're looking for.

ignore insert of rows that violate duplicate key index

I perform an insert as follows:
INSERT INTO foo (a,b,c)
SELECT x,y,z
FROM fubar
WHERE ...
However, if some of the rows that are being inserted violate the duplicate key index on foo, I want the database to ignore those rows, and not insert them and continue inserting the other rows.
The DB in question is Informix 11.5. Currently all that happens is that the DB is throwing an exception. If I try to handle the exception with:
ON EXCEPTION IN (-239)
END EXCEPTION WITH RESUME;
... it does not help because after the exception is caught, the entire insert is skipped.
I don't think informix supports INSERT IGNORE, or INSERT ... ON DUPLICATE KEY..., but feel free to correct me if I am wrong.

Use IF statement and EXISTS function to check for existed records. Or you can probably include that EXISTS function in the WHERE clause like below
INSERT INTO foo (a,b,c)
SELECT x,y,z
FROM fubar
WHERE (NOT EXISTS(SELECT a FROM foo WHERE ...))

Depending on whether you want to know all about all the errors (typically as a result of a data loading operation), consider using violations tables.
START VIOLATIONS TABLE FOR foo;
This will create a pair of tables foo_vio and foo_dia to contain information about rows that violate the integrity constraints on the table.
When you've had enough, you use:
STOP VIOLATIONS TABLE FOR foo;
You can clean up the diagnostic tables at your leisure. There are bells and whistles on the command to control which table is used, etc. (I should perhaps note that this assumes you are using IDS (IBM Informix Dynamic Server) and not, say, Informix SE or Informix OnLine.)
Violations tables are a heavy-duty option - suitable for loads and the like. They are not ordinarily used to protect run-of-the-mill SQL. For that, the protected INSERT (with SELECT and WHERE NOT EXISTS) is fairly effective - it requires the data to be in a table already, but temp tables are easy to create.

There are a couple of other options to consider.
From version 11.50 onwards, Informix supports the MERGE statement. This could be used to insert rows from fubar where the corresponding row in foo does not exist, and to update the rows in foo with the values from fubar where the corresponding row already exists in foo (the duplicate key problem).
Another way of looking at it is:
SELECT fubar.*
FROM fubar JOIN foo ON fubar.pk = foo.pk
INTO TEMP duplicate_entries;
DELETE FROM fubar WHERE pk IN (SELECT pk FROM duplicate_entries);
INSERT INTO foo SELECT * FROM fubar;
...processs duplicate_entries
DROP TABLE duplicate_entries
This cleans the source table (fubar) of the duplicate entries (assuming it is only the primary key that is duplicated) before trying to insert the data. The duplicate_entries table contains the rows in fubar with the duplicate keys - the ones that need special processing in some shape or form. Or you can simply delete and ignore those rows, though in my experience, that is seldom a good idea.

Group by maybe your friend in this. To prevent duplicate rows from being entered. Use group by in your select. This will force the duplicates into a unique row. The only thing I would do is test to see if there any performance issues. Also, make sure you include all of the rows you want to be unique in the group by or you could exclude rows that are not duplicates.
INSERT INTO FOO(Name, Address, Age, Gadget, Price)
select Name, Age, Gadget, Price
from foobar
group by Name, Age, Gadget, Price
Where Name, Age, Gadget, Price form the primary key index (or unique key index).
The other possibility is to write the duplicated rows to an error table without the index and then resolve the duplicates before inserting them into the new table. Just need to add a having count(*) > 1 clause to the above.

I don't know about Informix, but with SQL Server, you can create an index, make it unique and then set a property to have it ignore duplicate keys so no error gets thrown on a duplicate. It's just ignored. Perhaps Informix has something similar.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas