running insert on conflict when most queries will result in a conflict - sql

In my postgres database, I have a table that is used to map two fields together with a serial primary key, but I will rarely get new mappings. I may get data sent to me 60k times per day (or more) and only get new mappings a handful of times every month. I wrote a query with the on conflict clause:
insert into mapping_table
values
(field1, field2)
on conflict (field1, field2) do nothing
This seems to work as expected! But I'm wondering if running the insert query tens of thousands of times per day when a mapping rarely needs to be added is problematic somehow? And if so, is there a better way to do that?
I'm also not sure if there's a problem with the serial primary key. As expected, the value auto increments even though there is a conflict. I don't have any logic that expects that the primary key won't have gaps, but the numbers will get large very fast, and I'm not sure if that could become a problem in the future.

There are 86,400 seconds in a day. So, 60k times per day is less than once per second.
Determining the conflict is a lookup in a unique index, and then perhaps some additional checking for locking and so on. Under most circumstances, this would be an acceptable load on the system.

on conflict seems like the right tool for this. It is built-in, probably fairly optimized, and implements the logic you want with a straight-forward syntax.
If you want to avoid "burning" sequence numbers though, an alternative is not exists, to check if a row exists before inserting it:
insert into mapping_table (field1, field2)
select *
from (values (field1, field2)) v
where not exists (
select 1
from mapping_table mp
where mp.field1 = v.field1 and mp.field2 = v.field2
)
This query would take advantage of the already existing index on (field1, field2). It will most likely not be faster than on conflict though, and probably slower.

Related

Postgres SQL UPSERT too slow query(Performance Issue)

I am trying to populate postgres table from another table, nearly about 24 millions records. but query become too slow it taking 9-10 hours. the update operation only update 1-2 row each second. i cant understand why it slow.
Current benchmark
Query = INSERT INTO .... SELECT FROM .... ON CONFLICT DO UPDATE
Source table has 24 Million records
Destination Already have 560 Millions records with indexes, unique keys, primary and foreign keys
Query(Sample)
INSERT INTO destination_tbl(col1, col2 .... , col22, false AS processed, null AS updated_at)
SELECT (col1, col2 .... , col22) FROM source_tbl
WHERE processed=false
ON CONFLICT (unique_cols...)
DO UPDATE
SET col1 = EXCLUDED.col1
....
col22 = EXCLUDED.col22
processed = false
updated_at = now()
The query performance results that you have mentioned do seem to be according to the query that you have.It is a simple insert query, which uses INSERT ... ON CONFLICT which is one of the ways to UPSERT data.However talking about the performance then it matters a lot if you use ON CONFLICT DO NOTHING or if you use an UPDATE clause.
Generally when a DO NOTHING clause is running , there won't be any dead tuples that have to be cleaned up whereas if you use an UPDATE clause, there will be a dead tuple, and cleaning up these dead tuples may take time which is definitely inclusive in the total query execution time.We know that INSERT ON CONFLICT always performs a read to determine the necessary writes, the UPSERT statement writes without reading, making it faster. For tables with secondary indexes, there is no performance difference between UPSERT and INSERT ON CONFLICT.
Try to check on the above factors and see if batch loads are possible or a query division which would allow a reduction in time of execution and also fillfactor value set should help in time reduction.

Why does this query run faster? (UNIQUE vs NOT EXISTS)

I am trying to run a simple operation where I have a table called insert_base and another table called insert_values (both with attributes a,b), and I want to just copy all the values in insert_values into insert_base whilst avoiding duplicates - that is, I do not want to insert a tuple already in insert_values, and also do not want to insert the same tuple twice into insert_values. I am currently looking at two query methods to do this:
INSERT INTO insert_base SELECT DISTINCT * FROM insert_values IV
WHERE NOT EXISTS
(SELECT * FROM insert_base IB WHERE IV.a = IB.a AND IV.b = IB.b);
and another that involves using a UNIQUE constraint on insert_base(a,b):
INSERT INTO insert_base SELECT * FROM insert_values ON CONFLICT DO NOTHING;
However, every time I run this the first query runs significantly faster (500ms compared to 1sec + 300msec), and I am not sure why that is the case? Do the 2 queries not do roughly the same thing? I am aware that the UNIQUE constraint just puts an index on (a,b), but would that not make it faster than the first method? In fact, when I run the first method with an index on (a,b), it actually runs slightly slower than without any index (or unique constraint), which confuses me even more.
Any help would be much appreciated. I am running all this in postgresql by the way.
Your second query does more work than the first. It attempts to insert every row from your insert_values table, then when it sometimes hits a conflict it abandons the insert.
Your first query's WHERE clause filters out the rows that can't be inserted before attempting to insert them.

Limiting the number of records in a Sqlite DB

What I'm trying to implement here is a condition wherein a sqlite database holds only the most recent 1000 records. I have timestamps with each record.
One of the inefficient logic which strikes right away is to check the total number of records. If they exceed 1000, then simply delete the ones which fall out of the periphery.
However, I would have to do this check with each INSERT which makes things highly inefficient.
What could be a better logic? Can we do something with triggers?
Some related questions which follow the same logic I thought of are posted on SO:-
Delete oldest records from database
SQL Query to delete records older than two years
You can use an implicit "rowid" column for that.
Assuming you don't delete rows manually in different ways:
DELETE FROM yourtable WHERE rowid < (last_row_id - 1000)
You can obtain last rowid using API function or as max(rowid)
If you don't need to have exactly 1000 records (e.g. just want to cleanup old records), it is not necessary to do it on each insert. Add some counter in your program and execute cleanup f.i. once every 100 inserts.
UPDATE:
Anyway, you pay performance either on each insert or on each select. So the choice depends on what you have more: INSERTs or SELECTs.
In case you don't have that much inserts to care about performance, you can use following trigger to keep not more than 1000 records:
CREATE TRIGGER triggername AFTER INSERT ON tablename BEGIN
DELETE FROM tablename WHERE timestamp < (SELECT MIN(timestamp) FROM tablename ORDER BY timestamp DESC LIMIT 1000);
END
Creating unique index on timestamp column should be a good idea too (in case it isn't PK already). Also note, that SQLITE supports only FOR EACH ROW triggers, so when you bulk-insert many records it is worth to temporary disable the trigger.
If there are too many INSERTs, there isn't much you can do on database side. You can achieve less frequent trigger calls by adding trigger condition like AFTER INSERT WHEN NEW.rowid % 100 = 0. And with selects just use LIMIT 1000 (or create appropriate view).
I can't predict how much faster that would be. The best way would be just measure how much performance you will gain in your particular case.

ignore insert of rows that violate duplicate key index

I perform an insert as follows:
INSERT INTO foo (a,b,c)
SELECT x,y,z
FROM fubar
WHERE ...
However, if some of the rows that are being inserted violate the duplicate key index on foo, I want the database to ignore those rows, and not insert them and continue inserting the other rows.
The DB in question is Informix 11.5. Currently all that happens is that the DB is throwing an exception. If I try to handle the exception with:
ON EXCEPTION IN (-239)
END EXCEPTION WITH RESUME;
... it does not help because after the exception is caught, the entire insert is skipped.
I don't think informix supports INSERT IGNORE, or INSERT ... ON DUPLICATE KEY..., but feel free to correct me if I am wrong.
Use IF statement and EXISTS function to check for existed records. Or you can probably include that EXISTS function in the WHERE clause like below
INSERT INTO foo (a,b,c)
SELECT x,y,z
FROM fubar
WHERE (NOT EXISTS(SELECT a FROM foo WHERE ...))
Depending on whether you want to know all about all the errors (typically as a result of a data loading operation), consider using violations tables.
START VIOLATIONS TABLE FOR foo;
This will create a pair of tables foo_vio and foo_dia to contain information about rows that violate the integrity constraints on the table.
When you've had enough, you use:
STOP VIOLATIONS TABLE FOR foo;
You can clean up the diagnostic tables at your leisure. There are bells and whistles on the command to control which table is used, etc. (I should perhaps note that this assumes you are using IDS (IBM Informix Dynamic Server) and not, say, Informix SE or Informix OnLine.)
Violations tables are a heavy-duty option - suitable for loads and the like. They are not ordinarily used to protect run-of-the-mill SQL. For that, the protected INSERT (with SELECT and WHERE NOT EXISTS) is fairly effective - it requires the data to be in a table already, but temp tables are easy to create.
There are a couple of other options to consider.
From version 11.50 onwards, Informix supports the MERGE statement. This could be used to insert rows from fubar where the corresponding row in foo does not exist, and to update the rows in foo with the values from fubar where the corresponding row already exists in foo (the duplicate key problem).
Another way of looking at it is:
SELECT fubar.*
FROM fubar JOIN foo ON fubar.pk = foo.pk
INTO TEMP duplicate_entries;
DELETE FROM fubar WHERE pk IN (SELECT pk FROM duplicate_entries);
INSERT INTO foo SELECT * FROM fubar;
...processs duplicate_entries
DROP TABLE duplicate_entries
This cleans the source table (fubar) of the duplicate entries (assuming it is only the primary key that is duplicated) before trying to insert the data. The duplicate_entries table contains the rows in fubar with the duplicate keys - the ones that need special processing in some shape or form. Or you can simply delete and ignore those rows, though in my experience, that is seldom a good idea.
Group by maybe your friend in this. To prevent duplicate rows from being entered. Use group by in your select. This will force the duplicates into a unique row. The only thing I would do is test to see if there any performance issues. Also, make sure you include all of the rows you want to be unique in the group by or you could exclude rows that are not duplicates.
INSERT INTO FOO(Name, Address, Age, Gadget, Price)
select Name, Age, Gadget, Price
from foobar
group by Name, Age, Gadget, Price
Where Name, Age, Gadget, Price form the primary key index (or unique key index).
The other possibility is to write the duplicated rows to an error table without the index and then resolve the duplicates before inserting them into the new table. Just need to add a having count(*) > 1 clause to the above.
I don't know about Informix, but with SQL Server, you can create an index, make it unique and then set a property to have it ignore duplicate keys so no error gets thrown on a duplicate. It's just ignored. Perhaps Informix has something similar.

Improving performance of Sql Delete

We have a query to remove some rows from the table based on an id field (primary key). It is a pretty straightforward query:
delete all from OUR_TABLE where ID in (123, 345, ...)
The problem is no.of ids can be huge (Eg. 70k), so the query takes a long time. Is there any way to optimize this?
(We are using sybase - if that matters).
There are two ways to make statements like this one perform:
Create a new table and copy all but the rows to delete. Swap the tables afterwards (alter table name ...) I suggest to give it a try even when it sounds stupid. Some databases are much faster at copying than at deleting.
Partition your tables. Create N tables and use a view to join them into one. Sort the rows into different tables grouped by the delete criterion. The idea is to drop a whole table instead of deleting individual rows.
Consider running this in batches. A loop running 1000 records at a time may be much faster than one query that does everything and in addition will not keep the table locked out to other users for as long at a stretch.
If you have cascade delete (and lots of foreign key tables affected) or triggers involved, you may need to run in even smaller batches. You'll have to experiement to see which is the best number for your situation. I've had tables where I had to delete in batches of 100 and others where 50000 worked (fortunate in that case as I was deleting a million records).
But in any even I would put my key values that I intend to delete into a temp table and delete from there.
I'm wondering if parsing an IN clause with 70K items in it is a problem. Have you tried a temp table with a join instead?
Can Sybase handle 70K arguments in IN clause? All databases I worked with have some limit on number of arguments for IN clause. For example, Oracle have limit around 1000.
Can you create subselect instead of IN clause? That will shorten sql. Maybe that could help for such a big number of values in IN clause. Something like this:
DELETE FROM OUR_TABLE WHERE ID IN
(SELECT ID FROM somewhere WHERE some_condition)
Deleting large number of records can be sped up with some interventions in database, if database model permits. Here are some strategies:
you can speed things up by dropping indexes, deleting records and recreating indexes again. This will eliminate rebalancing index trees while deleting records.
drop all indexes on table
delete records
recreate indexes
if you have lots of relations to this table, try disabling constraints if you are absolutely sure that delete command will not break any integrity constraint. Delete will go much faster because database won't be checking integrity. Enable constraints after delete.
disable integrity constraints, disable check constraints
delete records
enable constraints
disable triggers on table, if you have any and if your business rules allow that. Delete records, then enable triggers.
last, do as other suggested - make a copy of the table that contains rows that are not to be deleted, then drop original, rename copy and recreate integrity constraints, if there are any.
I would try combination of 1, 2 and 3. If that does not work, then 4. If everything is slow, I would look for bigger box - more memory, faster disks.
Find out what is using up the performance!
In many cases you might use one of the solutions provided. But there might be others (based on Oracle knowledge, so things will be different on other databases. Edit: just saw that you mentioned sybase):
Do you have foreign keys on that table? Makes sure the referring ids are indexed
Do you have indexes on that table? It might be that droping before delete and recreating after the delete might be faster.
check the execution plan. Is it using an index where a full table scan might be faster? Or the other way round? HINTS might help
instead of a select into new_table as suggested above a create table as select might be even faster.
But remember: Find out what is using up the performance first.
When you are using DDL statements make sure you understand and accept the consequences it might have on transactions and backups.
Try sorting the ID you are passing into "in" in the same order as the table, or index is stored in. You may then get more hits on the disk cache.
Putting the ID to be deleted into a temp table that has the Ids sorted in the same order as the main table, may let the database do a simple scanned over the main table.
You could try using more then one connection and spiting the work over the connections so as to use all the CPUs on the database server, however think about what locks will be taken out etc first.
I also think that the temp table is likely the best solution.
If you were to do a "delete from .. where ID in (select id from ...)" it can still be slow with large queries, though. I thus suggest that you delete using a join - many people don't know about that functionality.
So, given this example table:
-- set up tables for this example
if exists (select id from sysobjects where name = 'OurTable' and type = 'U')
drop table OurTable
go
create table OurTable (ID integer primary key not null)
go
insert into OurTable (ID) values (1)
insert into OurTable (ID) values (2)
insert into OurTable (ID) values (3)
insert into OurTable (ID) values (4)
go
We can then write our delete code as follows:
create table #IDsToDelete (ID integer not null)
go
insert into #IDsToDelete (ID) values (2)
insert into #IDsToDelete (ID) values (3)
go
-- ... etc ...
-- Now do the delete - notice that we aren't using 'from'
-- in the usual place for this delete
delete OurTable from #IDsToDelete
where OurTable.ID = #IDsToDelete.ID
go
drop table #IDsToDelete
go
-- This returns only items 1 and 4
select * from OurTable order by ID
go
Does our_table have a reference on delete cascade?