I am working with an Oracle 11.2g instance.
I'd like to know what I am exposing to by inserting rows into tables by generating the primary key values myself.
I would SELECT max(pk) FROM sometables;
and then use the next hundred values for example for my next 100 inserts.
Is is playing with fire?
The context is: I have a big number of inserts to do, that are splitted over several tables linked by foreign keys. I am trying to get good performance, and not use PL/SQL.
[EDIT] here a code sample that looks like what I'm dealing with:
QString query1 = "INSERT INTO table 1 (pk1_id, val) VALUES (pk1_seq.nextval, ?)"
sqlQuery->prepare(query);
sqlQuery->addBindValue(vec_of_values);
sqlQuery->execBatch();
QString query2 = "INSERT INTO table 2 (pk2_id, another_val, pk1_pk1_id) VALUES (pk2_seq.nextval, ?, ?)"
sqlQuery->prepare(query);
sqlQuery->addBindValue(vec_of_values);
// How do I get the primary keys (hundreds of them)
// from the first insert??
sqlQuery->addBindValue(vec_of_pk1);
sqlQuery->execBatch();
You are exposing yourself to slower performance, errors in your logic, and extra code to maintain. Oracle sequences are optimized for your specific purpose. For high DML operations you may also cache sequences:
ALTER SEQUENCE customers_seq CACHE 100;
Create a sequence for the master table(s)
Insert into the master table using your_sequence.nextval
Inserts into child (dependent) tables are done using your_sequence.currval
create table parent (id integer primary key not null);
create table child (id integer primary key not null, pid integer not null references parent(id));
create sequence parent_seq;
create sequence child_seq;
insert into parent (id) values (parent_seq.nextval);
insert into child (id, pid) values (child_seq.nextval, parent_seq.currval);
commit;
To explain why max(id) will not work reliably, consider the following scenario:
Transaction 1 retrieves max(id) + 1 (yields, say 42)
Transaction 1 insert a new row with id = 42
Transaction 2 retrieves max(id) + 1 (also yields 42, because transaction 1 is not yet committed)
Transaction 1 commits
Transcation 2 inserts a new row with id = 42
Transaction 2 tries to commit and gets a unique key violation
Now think about what happens when you have a lot of transactions doing this. You'll get a lot of errors. Additionally your inserts will be slower and slower, because the cost of calculating max(id) will increase with the size of the table.
Sequences are the only sane (i.e. correct, fast and scalable) way out of this problem.
Edit
If you are struck with yet another ORM which can't cope with these kind of strategy (which is supported by nearly all DBMS nowadays - even SQL Server has sequences now), then you should be able to do the following in your client code:
Retrieve the next PK value using select parent_seq.nextval from dual into a variable in your programming language (this is a fast, scalable and correct way to retrieve the PK value).
If you can run a select max(id) you can also run a select parent_seq.nextval from dual. In both cases just use the value obtained from that select statement.
Related
We have 2 tables defined as follows
CREATE TABLE foo (
id BIGSERIAL PRIMARY KEY,
name TEXT NOT NULL UNIQUE
);
CREATE TABLE bar (
foo_id BIGINT UNIQUE,
foo_name TEXT NOT NULL UNIQUE REFERENCES foo (name)
);
I've noticed that when executing the following two queries concurrently
INSERT INTO foo (name) VALUES ('BAZ')
INSERT INTO bar (foo_name, foo_id) VALUES ('BAZ', (SELECT id FROM foo WHERE name = 'BAZ'))
it is possible under certain circumstances to end up inserting a row into bar where foo_id is NULL. The two queries are executed in different transactions, by two completely different processes.
How is this possible? I'd expect the second statement to either fail due to a foreign key violation (if the record in foo is not there), or succeed with a non-null value of foo_id (if it is).
What is causing this race condition? Is it due to the subselect, or is it due to the timing of when the foreign key constraint is checked?
We are using isolation level "read committed" and postgres version 10.3.
EDIT
I think the question was not particularly clear on what is confusing me. The question is about how and why 2 different states of the database were being observed during the execution of a single statement. The subselect is observing that the record in foo as being absent, whereas the fk check sees it as present. If it's just that there's no rule preventing this race condition, then this is an interesting question in itself - why would it not be possible to use transaction ids to ensure that the same state of the database is observed for both?
The subselect in the INSERT INTO bar cannot see the new row concurrently inserted in foo because the latter is not committed yet.
But by the time that the query that checks the foreign key constraint is executed, the INSERT INTO foo has committed, so the foreign key constraint doesn't report an error.
A simple way to work around that is to use the REPEATABLE READ isolation level for the INSERT INT bar. Then the foreign key check uses the same snapshot as the INSERT, it won't see the newly committed row, and a constraint violation error will be thrown.
Logic suggests that ordering of the commands (including the sub-query), combined with when Postgres checks of constraints (which is not necessarily immediate) could cause the issue. Therefore you could
Have the second command start first
Have the SELECT component run and return NULL
First command starts and inserts row
Second command inserts the row (with the 'name' field and a NULL)
FK reference check is successful as 'name' exists
Re deferrable constraints see https://www.postgresql.org/docs/13/sql-set-constraints.html and https://begriffs.com/posts/2017-08-27-deferrable-sql-constraints.html
Suggested answers
Have a not null check on BAR for Foo_Id, or included as part of foreign key checks
Rewrite the two commands to run consecutively rather than simultaneously (if possible)
You do indeed have a race condition. Without some sort of locking or use of a transaction to sequence the events, there is no rule precluding the sequence
The sub select of the bar INSERT is performed, yielding NULL
The INSERT into foo
The INSERT into bar, which now does not have any FK violation, but does have a NULL.
Since of course this is the toy version of your real program, I can't recommend how best to fix it. If it makes sense to require these events in a particular sequence, then they can be in a transaction on a single thread. In some other situation, you might prohibit inserting directly into foo and bar (REVOKE permissions as necessary) and allow modifications only through a function/procedure, or through a view that has triggers (possibly rules).
An anonymous plpgsql block will help you avoid the race conditions (by making sure that the inserts run sequentially within the same transaction) without going deep into Postgres internals:
do language plpgsql
$$
declare
v_foo_id bigint;
begin
INSERT into foo (name) values ('BAZ') RETURNING id into v_foo_id;
INSERT into bar (foo_name, foo_id) values ('BAZ', v_foo_id);
end;
$$;
or using plain SQL with a CTE in order to avoid switching context to/from plpgsql:
with t(id) as
(
INSERT into foo (name) values ('BAZ') RETURNING id
)
INSERT into bar (foo_name, foo_id) values ('BAZ', (select id from t));
And, btw, are you sure that the two inserts in your example are executed in the same transaction in the right order? If not then the short answer to your question is "MVCC" since the second statement is not atomic.
This seems more likely a scenario where both queries executed one after another but transaction is not committed.
Process 1
INSERT INTO foo (name) VALUES ('BAZ')
Transaction not committed but Process 2 execute next query
INSERT INTO bar (foo_name, foo_id) VALUES ('BAZ', (SELECT id FROM foo WHERE name = 'BAZ'))
In this case process 2 query will wait until process 1 transaction isn't committed.
From PostgreSQL doc :
UPDATE, DELETE, SELECT FOR UPDATE, and SELECT FOR SHARE commands behave the same as SELECT in terms of searching for target rows: they will only find target rows that were committed as of the command start time. However, such a target row might have already been updated (or deleted or locked) by another concurrent transaction by the time it is found. In this case, the would-be updater will wait for the first updating transaction to commit or roll back (if it is still in progress).
I'd like to use the following statement to update a column of a single row:
UPDATE Test SET Column1 = Column1 & ~2 WHERE Id = 1
The above seems to work. Is this safe in SQL Server? I remember reading about possible deadlocks when using similar statments in a non-SQL Server DBMS (I think it was related to PostgreSQL).
Example of a table and corresponding stored procs:
CREATE TABLE Test (Id int IDENTITY(1,1) NOT NULL, Column1 int NOT NULL, CONSTRAINT PK_Test PRIMARY KEY (Id ASC))
GO
INSERT INTO Test (Column1) Values(255)
GO
-- this will always affect a single row only
UPDATE Test SET Column1 = Column1 & ~2 WHERE Id = 1
For the table structure you have shown both the UPDATE and the SELECT are standalone transactions and can use clustered index seeks to do their work without needing to read unnecessary rows and take unnecessary locks so I would not be particularly concerned about deadlocks with this procedure.
I would be more concerned about the fact that you don't have the UPDATE and SELECT inside the same transaction. So the X lock on the row will be released as soon as the update statement finishes and it will be possible for another transaction to change the column value (or even delete the whole row) before the SELECT is executed.
If you execute both statements inside the same transaction then I still wouldn't be concerned about deadlock potential as the exclusive lock is taken first (it would be a different matter if the SELECT happened before the UPDATE)
You can also address the concurrency issue by getting rid of the SELECT entirely and using the OUTPUT clause to return the post-update value to the client.
UPDATE Test SET Column1 = Column1 & ~2
OUTPUT INSERTED.Column1
WHERE Id = 1
What do you mean "is it safe"?
Your id is a unique identifier for each row. I would strongly encourage you to declare it as a primary key. But you should have an index on the column.
Without an index, you do have a potential issue with performance (and deadlocks) because SQL Server has to scan the entire table. But with the appropriate primary key declaration (or another index), then you are only updating a single row in a single table. If you have no triggers on the table, then there is not much going on that can interfere with other transactions.
Assuming:
I am using REPEATABLE_READ or SERIALIZABLE transaction isolation (locks get retained every time I access a row)
We are talking about multiple threads accessing multiple tables simultaneously.
I have the following questions:
Is it possible for an INSERT operation to cause a deadlock? If so, please provide a detailed scenario demonstrating how a deadlock may occur (e.g. Thread 1 does this, Thread 2 does that, ..., deadlock).
For bonus points: answer the same question for all other operations (e.g. SELECT, UPDATE, DELETE).
UPDATE:
3. For super bonus points: how can I avoid a deadlock in the following scenario?
Given tables:
permissions[id BIGINT PRIMARY KEY]
companies[id BIGINT PRIMARY KEY, name VARCHAR(30), permission_id BIGINT NOT NULL, FOREIGN KEY (permission_id) REFERENCES permissions(id))
I create a new Company as follows:
INSERT INTO permissions; -- Inserts permissions.id = 100
INSERT INTO companies (name, permission_id) VALUES ('Nintendo', 100); -- Inserts companies.id = 200
I delete a Company as follows:
SELECT permission_id FROM companies WHERE id = 200; -- returns permission_id = 100
DELETE FROM companies WHERE id = 200;
DELETE FROM permissions WHERE id = 100;
In the above example, the INSERT locking order is [permissions, companies] whereas the DELETE locking order is [companies, permissions]. Is there a way to fix this example for REPEATABLE_READ or SERIALIZABLE isolation?
Generally all modifications can cause a deadlock and selects will not (get to that later). So
No you cannot ignore these.
You can somewhat ignore select depending on your database and settings but the others will give you deadlocks.
You don't even need multiple tables.
The best way to create a deadlock is to do the same thing in a different order.
SQL Server examples:
create table A
(
PK int primary key
)
Session 1:
begin transaction
insert into A values(1)
Session 2:
begin transaction
insert into A values(7)
Session 1:
delete from A where PK=7
Session 2:
delete from A where PK=1
You will get a deadlock. So that proved inserts & deletes can deadlock.
Updates are similar:
Session 1:
begin transaction
insert into A values(1)
insert into A values(2)
commit
begin transaction
update A set PK=7 where PK=1
Session 2:
begin transaction
update A set pk=9 where pk=2
update A set pk=8 where pk=1
Session 1:
update A set pk=9 where pk=2
Deadlock!
SELECT should never deadlock but on some databases it will because the locks it uses interfere with consistent reads. That's just crappy database engine design though.
SQL Server will not lock on a SELECT if you use SNAPSHOT ISOLATION. Oracle & I think Postgres will never lock on SELECT (unless you have FOR UPDATE which is clearly reserving for an update anyway).
So basically I think you have a few incorrect assumptions. I think I've proved:
Updates can cause deadlocks
Deletes can cause deadlocks
Inserts can cause deadlocks
You do not need more than one table
You do need more than one session
You'll just have to take my word on SELECT ;) but it will depend on your DB and settings.
In addition to LoztInSpace's answer, inserts may cause deadlocks even without deletes or updates presence. All you need is a unique index and a reversed operations order.
Example in Oracle :
create table t1 (id number);
create unique index t1_pk on t1 (id);
--thread 1 :
insert into t1 values(1);
--thread 2
insert into t1 values(2);
--thread 1 :
insert into t1 values(2);
--thread 2
insert into t1 values(1); -- deadlock !
Let us assume you have two relations A and B and two users X and Y. Table A is WRITE Locked by user X and Table B is WRITE Locked by Y. Then the following query will give you a dead lock if used by both the users X and Y.
Select * from A,B
So clearly a Select operation can cause a deadlock if join operations involving more than one table is a part of it. Usually Insert and Delete operations involve single relations. So they may not cause deadlock.
I need to periodically update a local cache with new additions to some DB table. The table rows contain an auto-increment sequential number (SN) field. The cache keeps this number too, so basically I just need to fetch all rows with SN larger than the highest I already have.
SELECT * FROM table where SN > <max_cached_SN>
However, the majority of the attempts will bring no data (I just need to make sure that I have an absolutely up-to-date local copy). So I wander if this will be more efficient:
count = SELECT count(*) from table;
if (count > <cache_size>)
// fetch new rows as above
I suppose that selecting by an indexed numeric field is quite efficient, so I wander whether using count has benefit. On the other hand, this test/update will be done quite frequently and by many clients, so there is a motivation to optimize it.
this test/update will be done quite frequently and by many clients
this could lead to unexpected race competition for cache generation
I would suggest
upon new addition to your table add the newest id into a queue table
using like crontab to trigger the cache generation by checking queue table
upon new cache generated, delete the id from queue table
as you stress majority of the attempts will bring no data, the above will only trigger where there is new addition
and the queue table concept, even can expand for update and delete
I believe that
SELECT * FROM table where SN > <max_cached_SN>
will be faster, because select count(*) may call table scan. Just for clarification, do you never delete rows from this table?
SELECT COUNT(*) may involve a scan (even a full scan), while SELECT ... WHERE SN > constant can effectively use an index by SN, and looking at very few index nodes may suffice. Don't count items if you don't need the exact total, it's expensive.
You don't need to use SELECT COUNT(*)
There is two solution.
You can use a temp table that has one field that contain last count of your table, and create new Trigger after insert on your table and inc temp table field in Trigger.
You can use a temp table that has one field that contain last SN of your table is cached and create new Trigger after insert on your table and update temp table field in Trigger.
not much to this really
drop table if exists foo;
create table foo
(
foo_id int unsigned not null auto_increment primary key
)
engine=innodb;
insert into foo values (null),(null),(null),(null),(null),(null),(null),(null),(null);
select * from foo order by foo_id desc limit 10;
insert into foo values (null),(null),(null),(null),(null),(null),(null),(null),(null);
select * from foo order by foo_id desc limit 10;
I perform an insert as follows:
INSERT INTO foo (a,b,c)
SELECT x,y,z
FROM fubar
WHERE ...
However, if some of the rows that are being inserted violate the duplicate key index on foo, I want the database to ignore those rows, and not insert them and continue inserting the other rows.
The DB in question is Informix 11.5. Currently all that happens is that the DB is throwing an exception. If I try to handle the exception with:
ON EXCEPTION IN (-239)
END EXCEPTION WITH RESUME;
... it does not help because after the exception is caught, the entire insert is skipped.
I don't think informix supports INSERT IGNORE, or INSERT ... ON DUPLICATE KEY..., but feel free to correct me if I am wrong.
Use IF statement and EXISTS function to check for existed records. Or you can probably include that EXISTS function in the WHERE clause like below
INSERT INTO foo (a,b,c)
SELECT x,y,z
FROM fubar
WHERE (NOT EXISTS(SELECT a FROM foo WHERE ...))
Depending on whether you want to know all about all the errors (typically as a result of a data loading operation), consider using violations tables.
START VIOLATIONS TABLE FOR foo;
This will create a pair of tables foo_vio and foo_dia to contain information about rows that violate the integrity constraints on the table.
When you've had enough, you use:
STOP VIOLATIONS TABLE FOR foo;
You can clean up the diagnostic tables at your leisure. There are bells and whistles on the command to control which table is used, etc. (I should perhaps note that this assumes you are using IDS (IBM Informix Dynamic Server) and not, say, Informix SE or Informix OnLine.)
Violations tables are a heavy-duty option - suitable for loads and the like. They are not ordinarily used to protect run-of-the-mill SQL. For that, the protected INSERT (with SELECT and WHERE NOT EXISTS) is fairly effective - it requires the data to be in a table already, but temp tables are easy to create.
There are a couple of other options to consider.
From version 11.50 onwards, Informix supports the MERGE statement. This could be used to insert rows from fubar where the corresponding row in foo does not exist, and to update the rows in foo with the values from fubar where the corresponding row already exists in foo (the duplicate key problem).
Another way of looking at it is:
SELECT fubar.*
FROM fubar JOIN foo ON fubar.pk = foo.pk
INTO TEMP duplicate_entries;
DELETE FROM fubar WHERE pk IN (SELECT pk FROM duplicate_entries);
INSERT INTO foo SELECT * FROM fubar;
...processs duplicate_entries
DROP TABLE duplicate_entries
This cleans the source table (fubar) of the duplicate entries (assuming it is only the primary key that is duplicated) before trying to insert the data. The duplicate_entries table contains the rows in fubar with the duplicate keys - the ones that need special processing in some shape or form. Or you can simply delete and ignore those rows, though in my experience, that is seldom a good idea.
Group by maybe your friend in this. To prevent duplicate rows from being entered. Use group by in your select. This will force the duplicates into a unique row. The only thing I would do is test to see if there any performance issues. Also, make sure you include all of the rows you want to be unique in the group by or you could exclude rows that are not duplicates.
INSERT INTO FOO(Name, Address, Age, Gadget, Price)
select Name, Age, Gadget, Price
from foobar
group by Name, Age, Gadget, Price
Where Name, Age, Gadget, Price form the primary key index (or unique key index).
The other possibility is to write the duplicated rows to an error table without the index and then resolve the duplicates before inserting them into the new table. Just need to add a having count(*) > 1 clause to the above.
I don't know about Informix, but with SQL Server, you can create an index, make it unique and then set a property to have it ignore duplicate keys so no error gets thrown on a duplicate. It's just ignored. Perhaps Informix has something similar.