I have a table A which is referenced by a table B that is to say A's schema looks like this:
Table A
(
id int,
name varchar,
)
While Table B's schema is:
Table B
(
id int,
a_id int,
val int
)
I have a piece of code that creates a record in table B. But, in cases of race conditions say, in case of two parallel transactions, I have a condition in that block which fails and as a result two records are created in table B instead of one.
The transaction block looks very similar to this (in Rails):
ActiveRecord::Base.transaction do
# a here is an ActiveRecord Object of Model A
b = B.new(a_id: a.id, val: value) # value is -ve
raise ActiveRecord::Rollback unless b.save
# this method calculates the sum of val's of all associated records b of a.
# i.e. find all records from B where b.a_id = a.id and find the sum of val
# column
sum = calculateSum(a)
# below condition fails in race conditions
raise ActiveRecord::Rollback if sum <= 0
end
One solution to this would be to keep a centralized hash of locks whose key would be A's id and before entering the block (in my application) I keep on waiting for this lock to be released. This solution would definitely work but I was thinking if Postgres already provides any better solution.
Edit: There is no such constraint that A's should have only one B's record. A can have many B's. It's just that in the block of code that I mentioned has a check that fails in case of two parallel transactions.
The most general solution to concurrency issues like this is to put your whole block within a SERIALIZABLE transaction. Put simply, this guarantees that your transactions behave as if they had exclusive access to the database. The main downside is that you may trigger a serialisation failure at any point, with from a simple SELECT, and you should be prepared to retry the transaction if this happens. There is an example on the wiki which appears to be very similar to your case, which should give you a better idea of how these transactions behave in practice.
Other than that, I think you'll need to explicitly lock something. One possibility would be to lock the whole record in A via a SELECT FOR UPDATE statement, which will block competing processes in your application, as well as anything else trying to insert a referencing row in B. The drawback here is that you might block (or be blocked by) some unrelated operation, like an insert in a different referencing table, or an update of A itself.
A better approach might be to take out an advisory lock on A.id. This is basically equivalent to your centralised hash, but these locks have the advantage of being managed by Postgres, and automatically released on commit/rollback. The caveat is that, because you're taking out locks on arbitrary integers, you want to be sure that you don't collide with some other process which happens to be locking that same integer for some unrelated reason.
You can handle this by using the two-argument version of pg_advisory_xact_lock(), and using one of the inputs to identify the type of lock. Rather than maintaining a bunch of lock type constants somewhere on the client side, I find that a useful strategy is to wrap the call for each lock type in its own function, and use that function's oid as the type identifier, e.g:
CREATE FUNCTION lock_A_for_insert_into_B(a_id int) RETURNS VOID LANGUAGE sql AS $$
SELECT pg_advisory_xact_lock('lock_A_for_insert_into_B(int)'::regprocedure::int, a_id)
$$
If understand your dilemma, try executing inside a BEGIN...COMMIT block. For most operations, this takes the place of a lock. If the instructions fail, the db is unchanged. It is particularly useful for operations where multiple tables much change simultaneously.
You have a condition that will block? That's not how database work. You don't do anything. They do it. Why is your app conditionally doing anything? The database ensures integrity, it'll be fine. Centralized hash of locks? I'm not sure what you're doing.. but you're so far down the wrong rabbit hole, it's gunna take a lot cleverness to get you out.
You gotta backtrack. Fast.
CREATE TEMP TABLE a ( id_a int PRIMARY KEY, name text );
CREATE TEMP TABLE b ( id_b int PRIMARY KEY, id_a int REFERENCES a, val int );
WITH ti AS (
INSERT INTO a (id_a, name) VALUES (2,'foo')
RETURNING id_a
)
INSERT INTO b (id_b,id_a,val)
SELECT 1,ti.id_a,42
FROM ti;
Result,
TABLE a;
id_a | name
------+------
2 | foo
(1 row)
test=# TABLE b;
id_b | id_a | val
------+------+-----
1 | 2 | 42
Related
We have 2 tables defined as follows
CREATE TABLE foo (
id BIGSERIAL PRIMARY KEY,
name TEXT NOT NULL UNIQUE
);
CREATE TABLE bar (
foo_id BIGINT UNIQUE,
foo_name TEXT NOT NULL UNIQUE REFERENCES foo (name)
);
I've noticed that when executing the following two queries concurrently
INSERT INTO foo (name) VALUES ('BAZ')
INSERT INTO bar (foo_name, foo_id) VALUES ('BAZ', (SELECT id FROM foo WHERE name = 'BAZ'))
it is possible under certain circumstances to end up inserting a row into bar where foo_id is NULL. The two queries are executed in different transactions, by two completely different processes.
How is this possible? I'd expect the second statement to either fail due to a foreign key violation (if the record in foo is not there), or succeed with a non-null value of foo_id (if it is).
What is causing this race condition? Is it due to the subselect, or is it due to the timing of when the foreign key constraint is checked?
We are using isolation level "read committed" and postgres version 10.3.
EDIT
I think the question was not particularly clear on what is confusing me. The question is about how and why 2 different states of the database were being observed during the execution of a single statement. The subselect is observing that the record in foo as being absent, whereas the fk check sees it as present. If it's just that there's no rule preventing this race condition, then this is an interesting question in itself - why would it not be possible to use transaction ids to ensure that the same state of the database is observed for both?
The subselect in the INSERT INTO bar cannot see the new row concurrently inserted in foo because the latter is not committed yet.
But by the time that the query that checks the foreign key constraint is executed, the INSERT INTO foo has committed, so the foreign key constraint doesn't report an error.
A simple way to work around that is to use the REPEATABLE READ isolation level for the INSERT INT bar. Then the foreign key check uses the same snapshot as the INSERT, it won't see the newly committed row, and a constraint violation error will be thrown.
Logic suggests that ordering of the commands (including the sub-query), combined with when Postgres checks of constraints (which is not necessarily immediate) could cause the issue. Therefore you could
Have the second command start first
Have the SELECT component run and return NULL
First command starts and inserts row
Second command inserts the row (with the 'name' field and a NULL)
FK reference check is successful as 'name' exists
Re deferrable constraints see https://www.postgresql.org/docs/13/sql-set-constraints.html and https://begriffs.com/posts/2017-08-27-deferrable-sql-constraints.html
Suggested answers
Have a not null check on BAR for Foo_Id, or included as part of foreign key checks
Rewrite the two commands to run consecutively rather than simultaneously (if possible)
You do indeed have a race condition. Without some sort of locking or use of a transaction to sequence the events, there is no rule precluding the sequence
The sub select of the bar INSERT is performed, yielding NULL
The INSERT into foo
The INSERT into bar, which now does not have any FK violation, but does have a NULL.
Since of course this is the toy version of your real program, I can't recommend how best to fix it. If it makes sense to require these events in a particular sequence, then they can be in a transaction on a single thread. In some other situation, you might prohibit inserting directly into foo and bar (REVOKE permissions as necessary) and allow modifications only through a function/procedure, or through a view that has triggers (possibly rules).
An anonymous plpgsql block will help you avoid the race conditions (by making sure that the inserts run sequentially within the same transaction) without going deep into Postgres internals:
do language plpgsql
$$
declare
v_foo_id bigint;
begin
INSERT into foo (name) values ('BAZ') RETURNING id into v_foo_id;
INSERT into bar (foo_name, foo_id) values ('BAZ', v_foo_id);
end;
$$;
or using plain SQL with a CTE in order to avoid switching context to/from plpgsql:
with t(id) as
(
INSERT into foo (name) values ('BAZ') RETURNING id
)
INSERT into bar (foo_name, foo_id) values ('BAZ', (select id from t));
And, btw, are you sure that the two inserts in your example are executed in the same transaction in the right order? If not then the short answer to your question is "MVCC" since the second statement is not atomic.
This seems more likely a scenario where both queries executed one after another but transaction is not committed.
Process 1
INSERT INTO foo (name) VALUES ('BAZ')
Transaction not committed but Process 2 execute next query
INSERT INTO bar (foo_name, foo_id) VALUES ('BAZ', (SELECT id FROM foo WHERE name = 'BAZ'))
In this case process 2 query will wait until process 1 transaction isn't committed.
From PostgreSQL doc :
UPDATE, DELETE, SELECT FOR UPDATE, and SELECT FOR SHARE commands behave the same as SELECT in terms of searching for target rows: they will only find target rows that were committed as of the command start time. However, such a target row might have already been updated (or deleted or locked) by another concurrent transaction by the time it is found. In this case, the would-be updater will wait for the first updating transaction to commit or roll back (if it is still in progress).
I'd like to use the following statement to update a column of a single row:
UPDATE Test SET Column1 = Column1 & ~2 WHERE Id = 1
The above seems to work. Is this safe in SQL Server? I remember reading about possible deadlocks when using similar statments in a non-SQL Server DBMS (I think it was related to PostgreSQL).
Example of a table and corresponding stored procs:
CREATE TABLE Test (Id int IDENTITY(1,1) NOT NULL, Column1 int NOT NULL, CONSTRAINT PK_Test PRIMARY KEY (Id ASC))
GO
INSERT INTO Test (Column1) Values(255)
GO
-- this will always affect a single row only
UPDATE Test SET Column1 = Column1 & ~2 WHERE Id = 1
For the table structure you have shown both the UPDATE and the SELECT are standalone transactions and can use clustered index seeks to do their work without needing to read unnecessary rows and take unnecessary locks so I would not be particularly concerned about deadlocks with this procedure.
I would be more concerned about the fact that you don't have the UPDATE and SELECT inside the same transaction. So the X lock on the row will be released as soon as the update statement finishes and it will be possible for another transaction to change the column value (or even delete the whole row) before the SELECT is executed.
If you execute both statements inside the same transaction then I still wouldn't be concerned about deadlock potential as the exclusive lock is taken first (it would be a different matter if the SELECT happened before the UPDATE)
You can also address the concurrency issue by getting rid of the SELECT entirely and using the OUTPUT clause to return the post-update value to the client.
UPDATE Test SET Column1 = Column1 & ~2
OUTPUT INSERTED.Column1
WHERE Id = 1
What do you mean "is it safe"?
Your id is a unique identifier for each row. I would strongly encourage you to declare it as a primary key. But you should have an index on the column.
Without an index, you do have a potential issue with performance (and deadlocks) because SQL Server has to scan the entire table. But with the appropriate primary key declaration (or another index), then you are only updating a single row in a single table. If you have no triggers on the table, then there is not much going on that can interfere with other transactions.
I have a table Messages with columns ID (primary key, autoincrement) and Content (text).
I have a table Users with columns username (primary key, text) and Hash.
A message is sent by one Sender (user) to many recipients (user) and a recipient (user) can have many messages.
I created a table Messages_Recipients with two columns: MessageID (referring to the ID column of the Messages table and Recipient (referring to the username column in the Users table). This table represents the many to many relation between recipients and messages.
So, the question I have is this. The ID of a new message will be created after it has been stored in the database. But how can I hold a reference to the MessageRow I just added in order to retrieve this new MessageID? I can always search the database for the last row added of course, but that could possibly return a different row in a multithreaded environment?
EDIT: As I understand it for SQLite you can use the SELECT last_insert_rowid(). But how do I call this statement from ADO.Net?
My Persistence code (messages and messagesRecipients are DataTables):
public void Persist(Message message)
{
pm_databaseDataSet.MessagesRow messagerow;
messagerow=messages.AddMessagesRow(message.Sender,
message.TimeSent.ToFileTime(),
message.Content,
message.TimeCreated.ToFileTime());
UpdateMessages();
var x = messagerow;//I hoped the messagerow would hold a
//reference to the new row in the Messages table, but it does not.
foreach (var recipient in message.Recipients)
{
var row = messagesRecipients.NewMessages_RecipientsRow();
row.Recipient = recipient;
//row.MessageID= How do I find this??
messagesRecipients.AddMessages_RecipientsRow(row);
UpdateMessagesRecipients();//method not shown
}
}
private void UpdateMessages()
{
messagesAdapter.Update(messages);
messagesAdapter.Fill(messages);
}
One other option is to look at the system table sqlite_sequence. Your sqlite database will have that table automatically if you created any table with autoincrement primary key. This table is for sqlite to keep track of the autoincrement field so that it won't repeat the primary key even after you delete some rows or after some insert failed (read more about this here http://www.sqlite.org/autoinc.html).
So with this table there is the added benefit that you can find out your newly inserted item's primary key even after you inserted something else (in other tables, of course!). After making sure that your insert is successful (otherwise you will get a false number), you simply need to do:
select seq from sqlite_sequence where name="table_name"
With SQL Server you'd SELECT SCOPE_IDENTITY() to get the last identity value for the current process.
With SQlite, it looks like for an autoincrement you would do
SELECT last_insert_rowid()
immediately after your insert.
http://www.mail-archive.com/sqlite-users#sqlite.org/msg09429.html
In answer to your comment to get this value you would want to use SQL or OleDb code like:
using (SqlConnection conn = new SqlConnection(connString))
{
string sql = "SELECT last_insert_rowid()";
SqlCommand cmd = new SqlCommand(sql, conn);
conn.Open();
int lastID = (Int32) cmd.ExecuteScalar();
}
I've had issues with using SELECT last_insert_rowid() in a multithreaded environment. If another thread inserts into another table that has an autoinc, last_insert_rowid will return the autoinc value from the new table.
Here's where they state that in the doco:
If a separate thread performs a new INSERT on the same database connection while the sqlite3_last_insert_rowid() function is running and thus changes the last insert rowid, then the value returned by sqlite3_last_insert_rowid() is unpredictable and might not equal either the old or the new last insert rowid.
That's from sqlite.org doco
According to Android Sqlite get last insert row id there is another query:
SELECT rowid from your_table_name order by ROWID DESC limit 1
Sample code from #polyglot solution
SQLiteCommand sql_cmd;
sql_cmd.CommandText = "select seq from sqlite_sequence where name='myTable'; ";
int newId = Convert.ToInt32( sql_cmd.ExecuteScalar( ) );
sqlite3_last_insert_rowid() is unsafe in a multithreaded environment (and documented as such on SQLite)
However the good news is that you can play with the chance, see below
ID reservation is NOT implemented in SQLite, you can also avoid PK using your own UNIQUE Primary Key if you know something always variant in your data.
Note:
See if the clause on RETURNING won't solve your issue
https://www.sqlite.org/lang_returning.html
As this is only available in recent version of SQLite and may have some overhead, consider Using the fact that it's really bad luck if you have an insertion in-between your requests to SQLite
see also if you absolutely need to fetch SQlite internal PK, can you design your own predict-able PK:
https://sqlite.org/withoutrowid.html
If need traditional PK AUTOINCREMENT, yes there is a small risk that the id you fetch may belong to another insertion. Small but unacceptable risk.
A workaround is to call twice the sqlite3_last_insert_rowid()
#1 BEFORE my Insert, then #2 AFTER my insert
as in :
int IdLast = sqlite3_last_insert_rowid(m_db); // Before (this id is already used)
const int rc = sqlite3_exec(m_db, sql,NULL, NULL, &m_zErrMsg);
int IdEnd = sqlite3_last_insert_rowid(m_db); // After Insertion most probably the right one,
In the vast majority of cases IdEnd==IdLast+1. This the "happy path" and you can rely on IdEnd as being the ID you look for.
Else you have to need to do an extra SELECT where you can use criteria based on IdLast to IdEnd (any additional criteria in WHERE clause are good to add if any)
Use ROWID (which is an SQlite keyword) to SELECT the id range that is relevant.
"SELECT my_pk_id FROM Symbols WHERE ROWID>%d && ROWID<=%d;",IdLast,IdEnd);
// notice the > in: ROWID>%zd, as we already know that IdLast is NOT the one we look for.
As second call to sqlite3_last_insert_rowid is done right away after INSERT, this SELECT generally only return 2 or 3 row max.
Then search in result from SELECT for the data you Inserted to find the proper id.
Performance improvement: As the call to sqlite3_last_insert_rowid() is way faster than the INSERT, (Even if mutex may make that wrong it is statistically true) I bet on IdEnd to be the right one and unwind the SELECT results by the end. Nearly in every cases we tested the last ROW does contain the ID you look for).
Performance improvement: If you have an additional UNIQUE Key, then add it to the WHERE to get only one row.
I experimented using 3 threads doing heavy Insertions, it worked as expected, the preparation + DB handling take the vast majority of CPU cycles, then results is that the Odd of mixup ID is in the range of 1/1000 insertions (situation where IdEnd>IdLast+1)
So the penalty of an additional SELECT to resolve this is rather low.
Otherwise said the benefit to use the sqlite3_last_insert_rowid() is great in the vast majority of Insertion, and if using some care, can even safely be used in MT.
Caveat: Situation is slightly more awkward in transactional mode.
Also SQLite didn't explicitly guaranty that ID will be contiguous and growing (unless AUTOINCREMENT). (At least I didn't found information about that, but looking at the SQLite source code it preclude that)
the simplest method would be using :
SELECT MAX(id) FROM yourTableName LIMIT 1;
if you are trying to grab this last id in a relation to effect another table as for example : ( if invoice is added THEN add the ItemsList to the invoice ID )
in this case use something like :
var cmd_result = cmd.ExecuteNonQuery(); // return the number of effected rows
then use cmd_result to determine if the previous Query have been excuted successfully, something like : if(cmd_result > 0) followed by your Query SELECT MAX(id) FROM yourTableName LIMIT 1; just to make sure that you are not targeting the wrong row id in case the previous command did not add any Rows.
in fact cmd_result > 0 condition is very necessary thing in case anything fail . specially if you are developing a serious Application, you don't want your users waking up finding random items added to their invoice.
I recently came up with a solution to this problem that sacrifices some performance overhead to ensure you get the correct last inserted ID.
Let's say you have a table people. Add a column called random_bigint:
create table people (
id int primary key,
name text,
random_bigint int not null
);
Add a unique index on random_bigint:
create unique index people_random_bigint_idx
ON people(random_bigint);
In your application, generate a random bigint whenever you insert a record. I guess there is a trivial possibility that a collision will occur, so you should handle that error.
My app is in Go and the code that generates a random bigint looks like this:
func RandomPositiveBigInt() (int64, error) {
nBig, err := rand.Int(rand.Reader, big.NewInt(9223372036854775807))
if err != nil {
return 0, err
}
return nBig.Int64(), nil
}
After you've inserted the record, query the table with a where filter on the random bigint value:
select id from people where random_bigint = <put random bigint here>
The unique index will add a small amount of overhead on the insertion. The id lookup, while very fast because of the index, will also add a little overhead.
However, this method will guarantee a correct last inserted ID.
I am working with an Oracle 11.2g instance.
I'd like to know what I am exposing to by inserting rows into tables by generating the primary key values myself.
I would SELECT max(pk) FROM sometables;
and then use the next hundred values for example for my next 100 inserts.
Is is playing with fire?
The context is: I have a big number of inserts to do, that are splitted over several tables linked by foreign keys. I am trying to get good performance, and not use PL/SQL.
[EDIT] here a code sample that looks like what I'm dealing with:
QString query1 = "INSERT INTO table 1 (pk1_id, val) VALUES (pk1_seq.nextval, ?)"
sqlQuery->prepare(query);
sqlQuery->addBindValue(vec_of_values);
sqlQuery->execBatch();
QString query2 = "INSERT INTO table 2 (pk2_id, another_val, pk1_pk1_id) VALUES (pk2_seq.nextval, ?, ?)"
sqlQuery->prepare(query);
sqlQuery->addBindValue(vec_of_values);
// How do I get the primary keys (hundreds of them)
// from the first insert??
sqlQuery->addBindValue(vec_of_pk1);
sqlQuery->execBatch();
You are exposing yourself to slower performance, errors in your logic, and extra code to maintain. Oracle sequences are optimized for your specific purpose. For high DML operations you may also cache sequences:
ALTER SEQUENCE customers_seq CACHE 100;
Create a sequence for the master table(s)
Insert into the master table using your_sequence.nextval
Inserts into child (dependent) tables are done using your_sequence.currval
create table parent (id integer primary key not null);
create table child (id integer primary key not null, pid integer not null references parent(id));
create sequence parent_seq;
create sequence child_seq;
insert into parent (id) values (parent_seq.nextval);
insert into child (id, pid) values (child_seq.nextval, parent_seq.currval);
commit;
To explain why max(id) will not work reliably, consider the following scenario:
Transaction 1 retrieves max(id) + 1 (yields, say 42)
Transaction 1 insert a new row with id = 42
Transaction 2 retrieves max(id) + 1 (also yields 42, because transaction 1 is not yet committed)
Transaction 1 commits
Transcation 2 inserts a new row with id = 42
Transaction 2 tries to commit and gets a unique key violation
Now think about what happens when you have a lot of transactions doing this. You'll get a lot of errors. Additionally your inserts will be slower and slower, because the cost of calculating max(id) will increase with the size of the table.
Sequences are the only sane (i.e. correct, fast and scalable) way out of this problem.
Edit
If you are struck with yet another ORM which can't cope with these kind of strategy (which is supported by nearly all DBMS nowadays - even SQL Server has sequences now), then you should be able to do the following in your client code:
Retrieve the next PK value using select parent_seq.nextval from dual into a variable in your programming language (this is a fast, scalable and correct way to retrieve the PK value).
If you can run a select max(id) you can also run a select parent_seq.nextval from dual. In both cases just use the value obtained from that select statement.
we have some persistent data in an application, that is queried from a server and then stored in a database so we can keep track of additional information. Because we do not want to query when an object is used in the memory we do an select for update so that other threads that want to get the same data will be blocked.
I am not sure how select for update handles non-existing rows. If the row does not exist and another thread tries to do another select for update on the same row, will this thread be blocked until the other transaction finishes or will it also get an empty result set? If it does only get an empty result set is there any way to make it block as well, for example by inserting the missing row immediately?
EDIT:
Because there was a remark, that we might lock too much, here some more details on the concrete usage in our case. In reduced pseudocode our programm flow looks like this:
d = queue.fetch();
r = SELECT * FROM table WHERE key = d.key() FOR UPDATE;
if r.empty() then
r = get_data_from_somewhere_else();
new_r = process_stuff( r );
if Data was present then
update row to new_r
else
insert new_r
This code is run in multiple thread and the data that is fetched from the queue might be concerning the same row in the database (hence the lock). However if multiple threads are using data that needs the same row, then these threads need to be sequentialized (order does not matter). However this sequentialization fails, if the row is not present, because we do not get a lock.
EDIT:
For now I have the following solution, which seems like an ugly hack to me.
select the data for update
if zero rows match then
insert some dummy data // this will block if multiple transactions try to insert
if insertion failed then
// somebody beat us at the race
select the data for update
do processing
if data was changed then
update the old or dummy data
else
rollback the whole transaction
I am neither 100% sure however that this actually solves the problem, nor does this solution seem good style. So if anybody has to offer something more usable this would be great.
I am not sure how select for update handles non-existing rows.
It doesn't.
The best you can do is to use an advisory lock if you know something unique about the new row. (Use hashtext() if needed, and the table's oid to lock it.)
The next best thing is a table lock.
That being said, your question makes it sound like you're locking way more than you should. Only lock rows when you actually need to, i.e. write operations.
Example solution (i haven't found better :/)
Thread A:
BEGIN;
SELECT pg_advisory_xact_lock(42); -- database semaphore arbitrary ID
SELECT * FROM t WHERE id = 1;
DELETE FROM t WHERE id = 1;
INSERT INTO t (id, value) VALUES (1, 'thread A');
SELECT 1 FROM pg_sleep(10); -- only for race condition simulation
COMMIT;
Thread B:
BEGIN;
SELECT pg_advisory_xact_lock(42); -- database semaphore arbitrary ID
SELECT * FROM t WHERE id = 1;
DELETE FROM t WHERE id = 1;
INSERT INTO t (id, value) VALUES (1, 'thread B');
SELECT 1 FROM pg_sleep(10); -- only for race condition simulation
COMMIT;
Causes always correct order of transactions execution.
Looking at the code added in the second edit, it looks right.
As for it looking like a hack, there's a couple options - basically it's all about moving the database logic to the database.
One is simply to put the whole select for update, if not exist then insert logic in a function, and do select get_object(key1,key2,etc) instead.
Alternatively, you could make an insert trigger that will ignore attempts to add an entry if it already exists, and simply do an insert before you do the select for update. This does have more potential to interfere with other code already in place, though.
(If I remember to, I'll edit and add example code later on when I'm in a position to check what I'm doing.)