Oracle set based insert vs set based merge performance - sql

We're using Oracle 11g at the moment without Enterprise (not an option unfortunately).
Let's say I have a table with a constant(Let's say 2000) rows of data. Let's call it data_source.
I want to insert some columns of this table into another table, data_dest. I'm using all the records from the source table.
In other words, I would like to insert this set
select data_source.col1, data_source.col2, ... data_source.colN
from data_source
Which would be faster in this case:
insert into data_dest
select data_source.col1, data_source.col2, ... data_source.colN
from data_source
OR
merge into data_dest dd
using data_source ds
on (dd.col1 = ds.col1) --Let's assume that this is a matching column names
when not matched
insert (col1,col2...)
values(ds.col1,ds.col2...)
EDIT 1:
We can assume there are no primary keys violations from the insert.
In other words we can assume that insert will successfully insert all of the rows and so will merge.

The insert is very likely faster because it does not require a join on the two tables.
That said, the two queries are not equivalent. Assuming that col1 is defined as the primary key, the insert will throw an error if data_source contains a value in col1 that is already in data_dest. Because the merge is comparing the data in the two tables, then only inserting only the rows that don't already exist, it won't ever throw a primary key violation.
An insert that would be equivalent to the merge would be:
INSERT INTO data_dest
SELECT data_source.col1, data_source.col2, ... data_source.colN
FROM data_source
WHERE NOT EXISTS
(SELECT *
FROM data_dest
WHERE data_source.col1 = data_dest.col1)
It's likely that the plan for this insert will be very similar (if not identical) to the plan for the merge and the performance would be indistinguishable.

Related

INSERT OR REPLACE multiple rows, but there is no unique or primary keys

Hi I'm running into the following problem on SQlite3
I have a simple table
CREATE TABLE TestTable (id INT, cnt INT);
There are some rows already in the table.
I have some data I want to be inserted into the table: {(id0, cnt0), (id1, cnt1)...}
I want to insert data into the table, on id conflict, update TestTable.cnt = TestTable.cnt + value.cnt
(values.cnt is cnt0, cnt1 ... basically my data to be inserted)
*** But the problem is, there is no primary or unique constraint on id, and I am not allowed to change it!
What I currently have :
In my program I loop through all the values
UPDATE TestTABLE SET count = count + value.cnt WHERE id = value.id;
if (sqlite3_changes() == 0)
INSERT INTO MyTable (id, cnt) values (value.id, value.cnt);
But the problem is, with a very large dataset, doing 2 queries for each data entry takes too long. I'm trying to bundle multiple entries together into one call.
Please let me know if you have questions about my description, thank you for helping!
If you are able to create temporary tables, then do the following. Although I don't show it here, I suggest wrapping all this in a transaction. This technique will likely increase efficiency even if you are also able to add a temporary unique index. (In that case you could use an UPSERT with source data in the temporary table.)
CREATE TEMP TABLE data(id INT, cnt INT);
Now insert the new data into the temporary table, whether by using the host-language data libraries or crafting an insert statement similar to
INSERT INTO data (id, cnt)
VALUES (1, 100),
(2, 200),
(5, 400),
(7, 500);
Now update all existing rows using the single UPDATE statement. SQLite does not have a convenient syntax for joining tables and/or providing a source query for an UPDATE statement. However, one can use nested statement to provide similar convenience:
UPDATE TestTable AS tt
SET cnt = cnt + ifnull((SELECT cnt FROM data WHERE data.id == tt.id), 0)
WHERE tt.id IN (SELECT id FROM data);
Note that the two nested queries are independent of each other. In fact, one could eliminate the WHERE clause altogether and get the same results for this simple case. The WHERE clause is simply to make it more efficient, only attempting to update matching id's. The other subquery in the SET clause also specifies a match on id, but alone it would still allow updates of rows that don't have a match, defaulting to a null value and being converted to 0 (by isnull() function) for a no-op. By the way, without the isnull() function, the sum would result in null and would overwrite non-null values.
Finally, insert only rows with non-existing id values:
INSERT INTO TestTable (id, cnt)
SELECT data.id, data.cnt
FROM data LEFT JOIN TestTable
ON data.id == TestTable.id
WHERE TestTable.id IS NULL;

What happens with duplicates when inserting multiple rows?

I am running a python script that inserts a large amount of data into a Postgres database, I use a single query to perform multiple row inserts:
INSERT INTO table (col1,col2) VALUES ('v1','v2'),('v3','v4') ... etc
I was wondering what would happen if it hits a duplicate key for the insert. Will it stop the entire query and throw an exception? Or will it merely ignore the insert of that specific row and move on?
The INSERT will just insert all rows and nothing special will happen, unless you have some kind of constraint disallowing duplicate / overlapping values (PRIMARY KEY, UNIQUE, CHECK or EXCLUDE constraint) - which you did not mention in your question. But that's what you are probably worried about.
Assuming a UNIQUE or PK constraint on (col1,col2), you are dealing with a textbook UPSERT situation. Many related questions and answers to find here.
Generally, if any constraint is violated, an exception is raised which (unless trapped in subtransaction like it's possible in a procedural server-side language like plpgsql) will roll back not only the statement, but the whole transaction.
Without concurrent writes
I.e.: No other transactions will try to write to the same table at the same time.
Exclude rows that are already in the table with WHERE NOT EXISTS ... or any other applicable technique:
Select rows which are not present in other table
And don't forget to remove duplicates within the inserted set as well, which would not be excluded by the semi-anti-join WHERE NOT EXISTS ...
One technique to deal with both at once would be EXCEPT:
INSERT INTO tbl (col1, col2)
VALUES
(text 'v1', text 'v2') -- explicit type cast may be needed in 1st row
, ('v3', 'v4')
, ('v3', 'v4') -- beware of dupes in source
EXCEPT SELECT col1, col2 FROM tbl;
EXCEPT without the key word ALL folds duplicate rows in the source. If you know there are no dupes, or you don't want to fold duplicates silently, use EXCEPT ALL (or one of the other techniques). See:
Using EXCEPT clause in PostgreSQL
Generally, if the target table is big, WHERE NOT EXISTS in combination with DISTINCT on the source will probably be faster:
INSERT INTO tbl (col1, col2)
SELECT *
FROM (
SELECT DISTINCT *
FROM (
VALUES
(text 'v1', text'v2')
, ('v3', 'v4')
, ('v3', 'v4') -- dupes in source
) t(c1, c2)
) t
WHERE NOT EXISTS (
SELECT FROM tbl
WHERE col1 = t.c1 AND col2 = t.c2
);
If there can be many dupes, it pays to fold them in the source first. Else use one subquery less.
Related:
Select rows which are not present in other table
With concurrent writes
Use the Postgres UPSERT implementation INSERT ... ON CONFLICT ... in Postgres 9.5 or later:
INSERT INTO tbl (col1,col2)
SELECT DISTINCT * -- still can't insert the same row more than once
FROM (
VALUES
(text 'v1', text 'v2')
, ('v3','v4')
, ('v3','v4') -- you still need to fold dupes in source!
) t(c1, c2)
ON CONFLICT DO NOTHING; -- ignores rows with *any* conflict!
Further reading:
How to use RETURNING with ON CONFLICT in PostgreSQL?
How do I insert a row which contains a foreign key?
Documentation:
The manual
The commit page
The Postgres Wiki page
Craig's reference answer for UPSERT problems:
How to UPSERT (MERGE, INSERT ... ON DUPLICATE UPDATE) in PostgreSQL?
Will it stop the entire query and throw an exception? Yes.
To avoid that, you can look on the following SO question here, which describes how to avoid Postgres from throwing an error for multiple inserts when some of the inserted keys already exist on the DB.
You should basically do this:
INSERT INTO DBtable
(id, field1)
SELECT 1, 'value'
WHERE
NOT EXISTS (
SELECT id FROM DBtable WHERE id = 1
);

Insert into table from a temp table without identity column ?

I need to insert into table with same column in temp table. without identity column of table which am inserting.
Remaining columns of inserting table is contained in temp table.
You need to be able to generate primary key values on your own. With IBM DB2, I am using sequences to get those values. It would be good to know which RDBMS you are using.
A statement matching your needs could look like that:
INSERT INTO MYSCHEMA.MYTABLE
select (NEXTVAL FOR MYSCHEMA.ID_SEQUENCE) as ID, T.*
from MYSCHEMA.MYTEMPTABLE T

Duplicate key error with PostgreSQL INSERT with subquery

There are some similar questions on StackOverflow, but they don't seem to exactly match my case. I am trying to bulk insert into a PostgreSQL table with composite unique constraints. I created a temporary table (temptable) without any constraints, and loaded the data (with possible some duplicate values) in it. So far, so good.
Now, I am trying to transfer the data to the actual table (realtable) with unique index. For this, I used an INSERT statement with a subquery:
INSERT INTO realtable
SELECT * FROM temptable WHERE NOT EXISTS (
SELECT 1 FROM realtable WHERE temptable.added_date = realtable.added_date
AND temptable.product_name = realtable.product_name
);
However, I am getting duplicate key errors:
ERROR: duplicate key value violates unique constraint "realtable_added_date_product_name_key"
SQL state: 23505
Detail: Key (added_date, product_name)=(20000103, TEST) already exists.
My question is, shouldn't the WHERE NOT EXISTS clause prevent this from happening? How can I fix it?
The NOT EXISTS clause only prevents rows from temptable conflicting with existing rows from realtable; it will not prevent multiple rows from temptable from conflicting with each other. This is because the SELECT is calculated once based on the initial state of realtable, not re-calculated after each row is inserted.
One solution would be to use a GROUP BY or DISTINCT ON in the SELECT query, to omit duplicates, e.g.
INSERT INTO realtable
SELECT DISTINCT ON (added_date, product_name) *
FROM temptable WHERE NOT EXISTS (
SELECT 1 FROM realtable WHERE temptable.added_date = realtable.added_date
AND temptable.product_name = realtable.product_name
)
ORDER BY ???; -- this ORDER BY will determine which of a set of duplicates is kept by the DISTINCT ON

Row number in Sybase tables

Sybase db tables do not have a concept of self updating row numbers. However , for one of the modules , I require the presence of rownumber corresponding to each row in the database such that max(Column) would always tell me the number of rows in the table.
I thought I'll introduce an int column and keep updating this column to keep track of the row number. However I'm having problems in updating this column in case of deletes. What sql should I use in delete trigger to update this column?
You can easily assign a unique number to each row by using an identity column. The identity can be a numeric or an integer (in ASE12+).
This will almost do what you require. There are certain circumstances in which you will get a gap in the identity sequence. (These are called "identity gaps", the best discussion on them is here). Also deletes will cause gaps in the sequence as you've identified.
Why do you need to use max(col) to get the number of rows in the table, when you could just use count(*)? If you're trying to get the last row from the table, then you can do
select * from table where column = (select max(column) from table).
Regarding the delete trigger to update a manually managed column, I think this would be a potential source of deadlocks, and many performance issues. Imagine you have 1 million rows in your table, and you delete row 1, that's 999999 rows you now have to update to subtract 1 from the id.
Delete trigger
CREATE TRIGGER tigger ON myTable FOR DELETE
AS
update myTable
set id = id - (select count(*) from deleted d where d.id < t.id)
from myTable t
To avoid locking problems
You could add an extra table (which joins to your primary table) like this:
CREATE TABLE rowCounter
(id int, -- foreign key to main table
rownum int)
... and use the rownum field from this table.
If you put the delete trigger on this table then you would hugely reduce the potential for locking problems.
Approximate solution?
Does the table need to keep its rownumbers up to date all the time?
If not, you could have a job which runs every minute or so, which checks for gaps in the rownum, and does an update.
Question: do the rownumbers have to reflect the order in which rows were inserted?
If not, you could do far fewer updates, but only updating the most recent rows, "moving" them into gaps.
Leave a comment if you would like me to post any SQL for these ideas.
I'm not sure why you would want to do this. You could experiment with using temporary tables and "select into" with an Identity column like below.
create table test
(
col1 int,
col2 varchar(3)
)
insert into test values (100, "abc")
insert into test values (111, "def")
insert into test values (222, "ghi")
insert into test values (300, "jkl")
insert into test values (400, "mno")
select rank = identity(10), col1 into #t1 from Test
select * from #t1
delete from test where col2="ghi"
select rank = identity(10), col1 into #t2 from Test
select * from #t2
drop table test
drop table #t1
drop table #t2
This would give you a dynamic id (of sorts)