ignore insert of rows that violate duplicate key index

ignore insert of rows that violate duplicate key index - sql

I perform an insert as follows:
INSERT INTO foo (a,b,c)
SELECT x,y,z
FROM fubar
WHERE ...
However, if some of the rows that are being inserted violate the duplicate key index on foo, I want the database to ignore those rows, and not insert them and continue inserting the other rows.
The DB in question is Informix 11.5. Currently all that happens is that the DB is throwing an exception. If I try to handle the exception with:
ON EXCEPTION IN (-239)
END EXCEPTION WITH RESUME;
... it does not help because after the exception is caught, the entire insert is skipped.
I don't think informix supports INSERT IGNORE, or INSERT ... ON DUPLICATE KEY..., but feel free to correct me if I am wrong.

Use IF statement and EXISTS function to check for existed records. Or you can probably include that EXISTS function in the WHERE clause like below
INSERT INTO foo (a,b,c)
SELECT x,y,z
FROM fubar
WHERE (NOT EXISTS(SELECT a FROM foo WHERE ...))

Depending on whether you want to know all about all the errors (typically as a result of a data loading operation), consider using violations tables.
START VIOLATIONS TABLE FOR foo;
This will create a pair of tables foo_vio and foo_dia to contain information about rows that violate the integrity constraints on the table.
When you've had enough, you use:
STOP VIOLATIONS TABLE FOR foo;
You can clean up the diagnostic tables at your leisure. There are bells and whistles on the command to control which table is used, etc. (I should perhaps note that this assumes you are using IDS (IBM Informix Dynamic Server) and not, say, Informix SE or Informix OnLine.)
Violations tables are a heavy-duty option - suitable for loads and the like. They are not ordinarily used to protect run-of-the-mill SQL. For that, the protected INSERT (with SELECT and WHERE NOT EXISTS) is fairly effective - it requires the data to be in a table already, but temp tables are easy to create.

There are a couple of other options to consider.
From version 11.50 onwards, Informix supports the MERGE statement. This could be used to insert rows from fubar where the corresponding row in foo does not exist, and to update the rows in foo with the values from fubar where the corresponding row already exists in foo (the duplicate key problem).
Another way of looking at it is:
SELECT fubar.*
FROM fubar JOIN foo ON fubar.pk = foo.pk
INTO TEMP duplicate_entries;
DELETE FROM fubar WHERE pk IN (SELECT pk FROM duplicate_entries);
INSERT INTO foo SELECT * FROM fubar;
...processs duplicate_entries
DROP TABLE duplicate_entries
This cleans the source table (fubar) of the duplicate entries (assuming it is only the primary key that is duplicated) before trying to insert the data. The duplicate_entries table contains the rows in fubar with the duplicate keys - the ones that need special processing in some shape or form. Or you can simply delete and ignore those rows, though in my experience, that is seldom a good idea.

Group by maybe your friend in this. To prevent duplicate rows from being entered. Use group by in your select. This will force the duplicates into a unique row. The only thing I would do is test to see if there any performance issues. Also, make sure you include all of the rows you want to be unique in the group by or you could exclude rows that are not duplicates.
INSERT INTO FOO(Name, Address, Age, Gadget, Price)
select Name, Age, Gadget, Price
from foobar
group by Name, Age, Gadget, Price
Where Name, Age, Gadget, Price form the primary key index (or unique key index).
The other possibility is to write the duplicated rows to an error table without the index and then resolve the duplicates before inserting them into the new table. Just need to add a having count(*) > 1 clause to the above.

I don't know about Informix, but with SQL Server, you can create an index, make it unique and then set a property to have it ignore duplicate keys so no error gets thrown on a duplicate. It's just ignored. Perhaps Informix has something similar.

Related

Postgres race condition involving subselect and foreign key

We have 2 tables defined as follows
CREATE TABLE foo (
id BIGSERIAL PRIMARY KEY,
name TEXT NOT NULL UNIQUE
);
CREATE TABLE bar (
foo_id BIGINT UNIQUE,
foo_name TEXT NOT NULL UNIQUE REFERENCES foo (name)
);
I've noticed that when executing the following two queries concurrently
INSERT INTO foo (name) VALUES ('BAZ')
INSERT INTO bar (foo_name, foo_id) VALUES ('BAZ', (SELECT id FROM foo WHERE name = 'BAZ'))
it is possible under certain circumstances to end up inserting a row into bar where foo_id is NULL. The two queries are executed in different transactions, by two completely different processes.
How is this possible? I'd expect the second statement to either fail due to a foreign key violation (if the record in foo is not there), or succeed with a non-null value of foo_id (if it is).
What is causing this race condition? Is it due to the subselect, or is it due to the timing of when the foreign key constraint is checked?
We are using isolation level "read committed" and postgres version 10.3.
EDIT
I think the question was not particularly clear on what is confusing me. The question is about how and why 2 different states of the database were being observed during the execution of a single statement. The subselect is observing that the record in foo as being absent, whereas the fk check sees it as present. If it's just that there's no rule preventing this race condition, then this is an interesting question in itself - why would it not be possible to use transaction ids to ensure that the same state of the database is observed for both?

The subselect in the INSERT INTO bar cannot see the new row concurrently inserted in foo because the latter is not committed yet.
But by the time that the query that checks the foreign key constraint is executed, the INSERT INTO foo has committed, so the foreign key constraint doesn't report an error.
A simple way to work around that is to use the REPEATABLE READ isolation level for the INSERT INT bar. Then the foreign key check uses the same snapshot as the INSERT, it won't see the newly committed row, and a constraint violation error will be thrown.

Logic suggests that ordering of the commands (including the sub-query), combined with when Postgres checks of constraints (which is not necessarily immediate) could cause the issue. Therefore you could
Have the second command start first
Have the SELECT component run and return NULL
First command starts and inserts row
Second command inserts the row (with the 'name' field and a NULL)
FK reference check is successful as 'name' exists
Re deferrable constraints see https://www.postgresql.org/docs/13/sql-set-constraints.html and https://begriffs.com/posts/2017-08-27-deferrable-sql-constraints.html
Suggested answers
Have a not null check on BAR for Foo_Id, or included as part of foreign key checks
Rewrite the two commands to run consecutively rather than simultaneously (if possible)

You do indeed have a race condition. Without some sort of locking or use of a transaction to sequence the events, there is no rule precluding the sequence
The sub select of the bar INSERT is performed, yielding NULL
The INSERT into foo
The INSERT into bar, which now does not have any FK violation, but does have a NULL.
Since of course this is the toy version of your real program, I can't recommend how best to fix it. If it makes sense to require these events in a particular sequence, then they can be in a transaction on a single thread. In some other situation, you might prohibit inserting directly into foo and bar (REVOKE permissions as necessary) and allow modifications only through a function/procedure, or through a view that has triggers (possibly rules).

An anonymous plpgsql block will help you avoid the race conditions (by making sure that the inserts run sequentially within the same transaction) without going deep into Postgres internals:
do language plpgsql
$$
declare
v_foo_id bigint;
begin
INSERT into foo (name) values ('BAZ') RETURNING id into v_foo_id;
INSERT into bar (foo_name, foo_id) values ('BAZ', v_foo_id);
end;
$$;
or using plain SQL with a CTE in order to avoid switching context to/from plpgsql:
with t(id) as
(
INSERT into foo (name) values ('BAZ') RETURNING id
)
INSERT into bar (foo_name, foo_id) values ('BAZ', (select id from t));
And, btw, are you sure that the two inserts in your example are executed in the same transaction in the right order? If not then the short answer to your question is "MVCC" since the second statement is not atomic.

This seems more likely a scenario where both queries executed one after another but transaction is not committed.
Process 1
INSERT INTO foo (name) VALUES ('BAZ')
Transaction not committed but Process 2 execute next query
INSERT INTO bar (foo_name, foo_id) VALUES ('BAZ', (SELECT id FROM foo WHERE name = 'BAZ'))
In this case process 2 query will wait until process 1 transaction isn't committed.
From PostgreSQL doc :
UPDATE, DELETE, SELECT FOR UPDATE, and SELECT FOR SHARE commands behave the same as SELECT in terms of searching for target rows: they will only find target rows that were committed as of the command start time. However, such a target row might have already been updated (or deleted or locked) by another concurrent transaction by the time it is found. In this case, the would-be updater will wait for the first updating transaction to commit or roll back (if it is still in progress).

Bulk Insert with Table Valued Parameter with duplicate rows

Need to insert multiple records into a SQL table. If there are duplicates (already inserted records) then I want to ignore them.
For sending multiple records from my code to SQL, I am using table valued parameter.
I was looking at two options.
Option 1: Make a get call to SQL table and check if there are duplicates and return the duplicate row key. Perform multiple insert with table valued parameter only for those not existing row keys into SQL table.
Option 2: Use table valued parameter and call bulk insert. In the SQL do the duplicate detection and ignore the duplicate rows.
The SQL that was implemented is as follows:
#tvpNewFMdata is the table valued parameter.
INSERT INTO
[dbo].[FMData]
(
[Id],
[Name],
[Path],
[CreatedDate],
[ModifiedDate]
)
SELECT
fm.Id, fm.Name, fm.Path, GETUTCDATE(), GETUTCDATE()
FROM
#tvpNewFMdata AS fm
WHERE
fm.Id NOT IN
(
SELECT
[Id]
FROM
[dbo].[FMdata]
)
In the SQL approach, I do a select first to check whether the row exist and only if does not exist, then I do an insert.
Want to get a better perspective on which approach is performance wise optimized. Also wanted to understand whether the above query is optimized.

Your code looks fine, although I might make some suggestions.
First, use default values for CreatedDate and ModifiedDate. That way, you don't need to set the values every time a row is inserted.
Second, I'm not a fan of NOT IN, preferring NOT EXISTS instead. I prefer NOT EXISTS because it works more intuitively when the subquery returns NULL values. However, I am guessing that Id is a primary key in FMData, so it could never be NULL.
Third, Id should have an index . . . which it would have as a primary key.
Fourth, the code is not thread safe, meaning that running the same code twice at the same time could generate an error. I'm guessing this is not a problem for this code, but if so, you can investigate table locking hints.
Except for the presence of an index on Id, none of these comments address performance. Your code should be fine from a performance perspective.

Most efficient way to maintain a 'set' in SQL Server 2008?

I have ~2 million rows or so of data, each row with an artificial PK, and two Id fields (so: PK, ID1, ID2). I have a unique constraint (and index) on ID1+ID2.
I get two sorts of updates, both with a distinct ID1 per update.
100-1000 rows of all-new data (ID1 is new)
100-1000 rows of largely, but not necessarily completely overlapping data (ID1 already exists, maybe new ID1+ID2 pairs)
What's the most efficient way to maintain this 'set'? Here are the options as I see them:
Delete all the rows with ID1, insert all the new rows (yikes)
Query all the existing rows from the set of new data ID1+ID2, only insert the new rows
Insert all the new rows, ignore inserts that trigger unique constraint violations
Any thoughts?

If you're using SQL Server 2008 (or 2008 R2), you can look at the MERGE, something like:
MERGE INTO MyTable mt
USING NewRows nr
ON mt.ID1 = nr.ID1 and mt.ID2 = nr.ID2
WHEN NOT MATCHED THEN
INSERT (ID1,ID2,<more columns>) VALUES (nr.ID1,nr.ID2,<other columns>);

Not all of your listed solutions are functionally equivalent, so without more knowledge about what you want or need to accomplish, it's hard to say which is most appropriate.
You may lose data that you want or need to keep.
Based on the table schema that you mentioned, this should be reasonable.
This will only work if you perform each INSERT separately.
I'd suggest [2] based on the available info.

how to select the newly added rows in a table efficiently?

I need to periodically update a local cache with new additions to some DB table. The table rows contain an auto-increment sequential number (SN) field. The cache keeps this number too, so basically I just need to fetch all rows with SN larger than the highest I already have.
SELECT * FROM table where SN > <max_cached_SN>
However, the majority of the attempts will bring no data (I just need to make sure that I have an absolutely up-to-date local copy). So I wander if this will be more efficient:
count = SELECT count(*) from table;
if (count > <cache_size>)
// fetch new rows as above
I suppose that selecting by an indexed numeric field is quite efficient, so I wander whether using count has benefit. On the other hand, this test/update will be done quite frequently and by many clients, so there is a motivation to optimize it.

this test/update will be done quite frequently and by many clients
this could lead to unexpected race competition for cache generation
I would suggest
upon new addition to your table add the newest id into a queue table
using like crontab to trigger the cache generation by checking queue table
upon new cache generated, delete the id from queue table
as you stress majority of the attempts will bring no data, the above will only trigger where there is new addition
and the queue table concept, even can expand for update and delete

I believe that
SELECT * FROM table where SN > <max_cached_SN>
will be faster, because select count(*) may call table scan. Just for clarification, do you never delete rows from this table?

SELECT COUNT(*) may involve a scan (even a full scan), while SELECT ... WHERE SN > constant can effectively use an index by SN, and looking at very few index nodes may suffice. Don't count items if you don't need the exact total, it's expensive.

You don't need to use SELECT COUNT(*)
There is two solution.
You can use a temp table that has one field that contain last count of your table, and create new Trigger after insert on your table and inc temp table field in Trigger.
You can use a temp table that has one field that contain last SN of your table is cached and create new Trigger after insert on your table and update temp table field in Trigger.

not much to this really
drop table if exists foo;
create table foo
(
foo_id int unsigned not null auto_increment primary key
)
engine=innodb;
insert into foo values (null),(null),(null),(null),(null),(null),(null),(null),(null);
select * from foo order by foo_id desc limit 10;
insert into foo values (null),(null),(null),(null),(null),(null),(null),(null),(null);
select * from foo order by foo_id desc limit 10;

Improving performance of Sql Delete

We have a query to remove some rows from the table based on an id field (primary key). It is a pretty straightforward query:
delete all from OUR_TABLE where ID in (123, 345, ...)
The problem is no.of ids can be huge (Eg. 70k), so the query takes a long time. Is there any way to optimize this?
(We are using sybase - if that matters).

There are two ways to make statements like this one perform:
Create a new table and copy all but the rows to delete. Swap the tables afterwards (alter table name ...) I suggest to give it a try even when it sounds stupid. Some databases are much faster at copying than at deleting.
Partition your tables. Create N tables and use a view to join them into one. Sort the rows into different tables grouped by the delete criterion. The idea is to drop a whole table instead of deleting individual rows.

Consider running this in batches. A loop running 1000 records at a time may be much faster than one query that does everything and in addition will not keep the table locked out to other users for as long at a stretch.
If you have cascade delete (and lots of foreign key tables affected) or triggers involved, you may need to run in even smaller batches. You'll have to experiement to see which is the best number for your situation. I've had tables where I had to delete in batches of 100 and others where 50000 worked (fortunate in that case as I was deleting a million records).
But in any even I would put my key values that I intend to delete into a temp table and delete from there.

I'm wondering if parsing an IN clause with 70K items in it is a problem. Have you tried a temp table with a join instead?

Can Sybase handle 70K arguments in IN clause? All databases I worked with have some limit on number of arguments for IN clause. For example, Oracle have limit around 1000.
Can you create subselect instead of IN clause? That will shorten sql. Maybe that could help for such a big number of values in IN clause. Something like this:
DELETE FROM OUR_TABLE WHERE ID IN
(SELECT ID FROM somewhere WHERE some_condition)
Deleting large number of records can be sped up with some interventions in database, if database model permits. Here are some strategies:
you can speed things up by dropping indexes, deleting records and recreating indexes again. This will eliminate rebalancing index trees while deleting records.
drop all indexes on table
delete records
recreate indexes
if you have lots of relations to this table, try disabling constraints if you are absolutely sure that delete command will not break any integrity constraint. Delete will go much faster because database won't be checking integrity. Enable constraints after delete.
disable integrity constraints, disable check constraints
delete records
enable constraints
disable triggers on table, if you have any and if your business rules allow that. Delete records, then enable triggers.
last, do as other suggested - make a copy of the table that contains rows that are not to be deleted, then drop original, rename copy and recreate integrity constraints, if there are any.
I would try combination of 1, 2 and 3. If that does not work, then 4. If everything is slow, I would look for bigger box - more memory, faster disks.

Find out what is using up the performance!
In many cases you might use one of the solutions provided. But there might be others (based on Oracle knowledge, so things will be different on other databases. Edit: just saw that you mentioned sybase):
Do you have foreign keys on that table? Makes sure the referring ids are indexed
Do you have indexes on that table? It might be that droping before delete and recreating after the delete might be faster.
check the execution plan. Is it using an index where a full table scan might be faster? Or the other way round? HINTS might help
instead of a select into new_table as suggested above a create table as select might be even faster.
But remember: Find out what is using up the performance first.
When you are using DDL statements make sure you understand and accept the consequences it might have on transactions and backups.

Try sorting the ID you are passing into "in" in the same order as the table, or index is stored in. You may then get more hits on the disk cache.
Putting the ID to be deleted into a temp table that has the Ids sorted in the same order as the main table, may let the database do a simple scanned over the main table.
You could try using more then one connection and spiting the work over the connections so as to use all the CPUs on the database server, however think about what locks will be taken out etc first.

I also think that the temp table is likely the best solution.
If you were to do a "delete from .. where ID in (select id from ...)" it can still be slow with large queries, though. I thus suggest that you delete using a join - many people don't know about that functionality.
So, given this example table:
-- set up tables for this example
if exists (select id from sysobjects where name = 'OurTable' and type = 'U')
drop table OurTable
go
create table OurTable (ID integer primary key not null)
go
insert into OurTable (ID) values (1)
insert into OurTable (ID) values (2)
insert into OurTable (ID) values (3)
insert into OurTable (ID) values (4)
go
We can then write our delete code as follows:
create table #IDsToDelete (ID integer not null)
go
insert into #IDsToDelete (ID) values (2)
insert into #IDsToDelete (ID) values (3)
go
-- ... etc ...
-- Now do the delete - notice that we aren't using 'from'
-- in the usual place for this delete
delete OurTable from #IDsToDelete
where OurTable.ID = #IDsToDelete.ID
go
drop table #IDsToDelete
go
-- This returns only items 1 and 4
select * from OurTable order by ID
go

Does our_table have a reference on delete cascade?

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas