SQL Insertion without duplication

SQL Insertion without duplication - sql

Is there a specific command for SQL Server in order to INSERT a lot of rows with the condition : if a row already exists in database doesn't duplicate it during insertion ?
Edited
In a sqlbulkcopy, I'd like to avoid exception because a row is already in the table ?

You can use the MERGE command for this. Example Usage.
CREATE TABLE #A(
[id] [int] NOT NULL PRIMARY KEY CLUSTERED,
[C] [varchar](200) NOT NULL)
MERGE #A AS target
USING (SELECT 3, 'C') AS source (id, C)
ON (target.id = source.id)
/*Uncomment for Upsert Semantics
WHEN MATCHED THEN
UPDATE SET C = source.C */
WHEN NOT MATCHED THEN
INSERT (id, C)
VALUES (source.id, source.C);
Edit Though you say in your edit this is for a bulk copy? You could also investigate the "Ignore Duplicate Keys" option on your index.

How to do it in T-SQL is discussed here (article is a bit old though)

Related

Updating SQL Server table with composite key

I have a SQL Server table with three columns, the first two columns are the primary key. I'm writing a stored procedure that will update the last two columns in mass and it works fine for that as long as there are are no primary key violations but when there is a primary key violation it throws an error and stops executing.
How can I make it to ignore the line and continue updating the record as long as there is no primary key violation?
Is there a better way to approach this problem? I'm only doing a simple update with where as column2= somevalue AND column 3 = some value.

In SQL Server you'd use MERGE to upsert (i.e. insert or update):
MERGE mytable
USING (SELECT 1 as key1, 2 as key2, 3 as col1, 4 as col2) AS src
ON (mytable.key1 = src.key1 AND mytable.key2 = src.key2)
WHEN MATCHED THEN
UPDATE SET col1 = src.col1, col2 = src.col2
WHEN NOT MATCHED THEN
INSERT (key1, key2, col1, col2) VALUES (src.key1, src.key2, src.col1, src.col2);

There is nothing inherently wrong with your question, despite the rather loud protestations. Your question is confusing, especially when you refer to columns by position. That is a big no-no. So, a script that demonstrates your problem is generally the best way to both demonstrate your problem and get useful suggestions.
The short answer to your question is - you can't. A statement either succeeds or fails as a whole. If you want to update each row individually and ignore certain errors, then you need to write your tsql to do that.
And despite the protests (again), there are situations where it is necessary to update columns that are part of the primary key. It is unusual - very unusual - but you should also be wary of any absolute statement about tsql. When you find yourself doing unusual things, you should review your schema (and your approach) because it is quite possible that there are better ways to accomplish your goal.
And in this case, I suggest that you SHOULD really think about what you are trying to accomplish. If you want to update a set of rows in a particular way and the statement fails - that means there is a flaw somewhere!. Typically, this error implies that your update logic is not correct. Perhaps you assume something about your data that is not accurate? It is impossible to know from a distance. The error message will tell you what set of values caused the conflict - so that should give you sufficient information to investigate. As another tool, write a select statement that demonstrates your proposed update and look for the values in the error message. E.g.
set nocount on;
create table #x (a smallint not null, b smallint not null, c varchar(10) not null, constraint xx primary key(a, b));
insert #x (a, b, c) values (1, 1, 'test'), (1, 2, 'zork');
select * from #x;
update #x set b = 2, c = 'dork';
select a, b, c, cast(2 as smallint) as new_b, 'dork' as new_c
from #x
order by a, new_b;
drop table #x;

Oracle set based insert vs set based merge performance

We're using Oracle 11g at the moment without Enterprise (not an option unfortunately).
Let's say I have a table with a constant(Let's say 2000) rows of data. Let's call it data_source.
I want to insert some columns of this table into another table, data_dest. I'm using all the records from the source table.
In other words, I would like to insert this set
select data_source.col1, data_source.col2, ... data_source.colN
from data_source
Which would be faster in this case:
insert into data_dest
select data_source.col1, data_source.col2, ... data_source.colN
from data_source
OR
merge into data_dest dd
using data_source ds
on (dd.col1 = ds.col1) --Let's assume that this is a matching column names
when not matched
insert (col1,col2...)
values(ds.col1,ds.col2...)
EDIT 1:
We can assume there are no primary keys violations from the insert.
In other words we can assume that insert will successfully insert all of the rows and so will merge.

The insert is very likely faster because it does not require a join on the two tables.
That said, the two queries are not equivalent. Assuming that col1 is defined as the primary key, the insert will throw an error if data_source contains a value in col1 that is already in data_dest. Because the merge is comparing the data in the two tables, then only inserting only the rows that don't already exist, it won't ever throw a primary key violation.
An insert that would be equivalent to the merge would be:
INSERT INTO data_dest
SELECT data_source.col1, data_source.col2, ... data_source.colN
FROM data_source
WHERE NOT EXISTS
(SELECT *
FROM data_dest
WHERE data_source.col1 = data_dest.col1)
It's likely that the plan for this insert will be very similar (if not identical) to the plan for the merge and the performance would be indistinguishable.

What happens with duplicates when inserting multiple rows?

I am running a python script that inserts a large amount of data into a Postgres database, I use a single query to perform multiple row inserts:
INSERT INTO table (col1,col2) VALUES ('v1','v2'),('v3','v4') ... etc
I was wondering what would happen if it hits a duplicate key for the insert. Will it stop the entire query and throw an exception? Or will it merely ignore the insert of that specific row and move on?

The INSERT will just insert all rows and nothing special will happen, unless you have some kind of constraint disallowing duplicate / overlapping values (PRIMARY KEY, UNIQUE, CHECK or EXCLUDE constraint) - which you did not mention in your question. But that's what you are probably worried about.
Assuming a UNIQUE or PK constraint on (col1,col2), you are dealing with a textbook UPSERT situation. Many related questions and answers to find here.
Generally, if any constraint is violated, an exception is raised which (unless trapped in subtransaction like it's possible in a procedural server-side language like plpgsql) will roll back not only the statement, but the whole transaction.
Without concurrent writes
I.e.: No other transactions will try to write to the same table at the same time.
Exclude rows that are already in the table with WHERE NOT EXISTS ... or any other applicable technique:
Select rows which are not present in other table
And don't forget to remove duplicates within the inserted set as well, which would not be excluded by the semi-anti-join WHERE NOT EXISTS ...
One technique to deal with both at once would be EXCEPT:
INSERT INTO tbl (col1, col2)
VALUES
(text 'v1', text 'v2') -- explicit type cast may be needed in 1st row
, ('v3', 'v4')
, ('v3', 'v4') -- beware of dupes in source
EXCEPT SELECT col1, col2 FROM tbl;
EXCEPT without the key word ALL folds duplicate rows in the source. If you know there are no dupes, or you don't want to fold duplicates silently, use EXCEPT ALL (or one of the other techniques). See:
Using EXCEPT clause in PostgreSQL
Generally, if the target table is big, WHERE NOT EXISTS in combination with DISTINCT on the source will probably be faster:
INSERT INTO tbl (col1, col2)
SELECT *
FROM (
SELECT DISTINCT *
FROM (
VALUES
(text 'v1', text'v2')
, ('v3', 'v4')
, ('v3', 'v4') -- dupes in source
) t(c1, c2)
) t
WHERE NOT EXISTS (
SELECT FROM tbl
WHERE col1 = t.c1 AND col2 = t.c2
);
If there can be many dupes, it pays to fold them in the source first. Else use one subquery less.
Related:
Select rows which are not present in other table
With concurrent writes
Use the Postgres UPSERT implementation INSERT ... ON CONFLICT ... in Postgres 9.5 or later:
INSERT INTO tbl (col1,col2)
SELECT DISTINCT * -- still can't insert the same row more than once
FROM (
VALUES
(text 'v1', text 'v2')
, ('v3','v4')
, ('v3','v4') -- you still need to fold dupes in source!
) t(c1, c2)
ON CONFLICT DO NOTHING; -- ignores rows with *any* conflict!
Further reading:
How to use RETURNING with ON CONFLICT in PostgreSQL?
How do I insert a row which contains a foreign key?
Documentation:
The manual
The commit page
The Postgres Wiki page
Craig's reference answer for UPSERT problems:
How to UPSERT (MERGE, INSERT ... ON DUPLICATE UPDATE) in PostgreSQL?

Will it stop the entire query and throw an exception? Yes.
To avoid that, you can look on the following SO question here, which describes how to avoid Postgres from throwing an error for multiple inserts when some of the inserted keys already exist on the DB.
You should basically do this:
INSERT INTO DBtable
(id, field1)
SELECT 1, 'value'
WHERE
NOT EXISTS (
SELECT id FROM DBtable WHERE id = 1
);

MERGE Violation of PRIMARY KEY constraint

I have a SQL Server 2008 many-to-many relationship table (Assets) with two columns:
AssetId (PK, FK, uniqueidentifier, not null)
AssetCategoryId (PK, FK, int, not null)
In my project, I need to take rows from this table, and insert them into a replicated database periodically. So, I have two databases that are exactly the same (constraints included).
In order to "copy" from one database to the other, I use a MERGE statement with a temp table. I insert up to 50 records into the temp table, then merge the temp table with the Assets table I am copying into as follows:
CREATE TABLE #Assets (AssetId UniqueIdentifier, AssetCategoryId Int);
INSERT INTO #Assets (AssetId, AssetCategoryId) VALUES ('ed05bac3-7a92-46aa-8822-2d882b137597', 44), ('dc5e3082-e2eb-4bdf-a640-94e0f59411ed', 22) ... ;
MERGE INTO Assets WITH (HOLDLOCK) AS Target
USING #Assets AS Source
ON Target.AssetId = Source.AssetId AND Target.AssetCategoryId = Source.AssetCategoryId
WHEN MATCHED THEN
UPDATE SET ...
WHEN NOT MATCHED BY Target THEN
INSERT (AssetId,AssetCategoryId) VALUES (Source.AssetId,Source.AssetCategoryId);
This works great, for the most part. However, once in a while, I get the error:
Violation of PRIMARY KEY constraint 'PK_Assets'. Cannot insert
duplicate key in object 'dbo.Assets'. The duplicate key value is
(dc5e3082-e2eb-4bdf-a640-94e0f59411ed, 22). The statement has been
terminated.
When I check in the Assets table, no such record exists... so I am confused how I would be inserting a duplicate key.
Any idea what is going on here?
UPDATE
When testing, it runs successfully 6 times, inserting 300 rows. On the 7th try, it always gives the same error shown above. Furthermore, when I INSERT (dc5e3082-e2eb-4bdf-a640-94e0f59411ed, 22) by itself, it works fine. My test is then able to continue and insert the remaining rows with no errors.

You need to add a HOLDLOCK on your MERGE statement. Try the following:
MERGE INTO Assets WITH (HOLDLOCK) AS Target
...
This avoids the race condition that you are running into. See more info here
EDIT
Based on your update, the only other thing I can think of is that your temp table might have a duplicate record in it. Can you double check?

Does DB2 have an "insert or update" statement?

From my code (Java) I want to ensure that a row exists in the database (DB2) after my code is executed.
My code now does a select and if no result is returned it does an insert. I really don't like this code since it exposes me to concurrency issues when running in a multi-threaded environment.
What I would like to do is to put this logic in DB2 instead of in my Java code.
Does DB2 have an insert-or-update statement? Or anything like it that I can use?
For example:
insertupdate into mytable values ('myid')
Another way of doing it would probably be to always do the insert and catch "SQL-code -803 primary key already exists", but I would like to avoid that if possible.

Yes, DB2 has the MERGE statement, which will do an UPSERT (update or insert).
MERGE INTO target_table USING source_table ON match-condition
{WHEN [NOT] MATCHED
THEN [UPDATE SET ...|DELETE|INSERT VALUES ....|SIGNAL ...]}
[ELSE IGNORE]
See:
http://publib.boulder.ibm.com/infocenter/db2luw/v9/index.jsp?topic=/com.ibm.db2.udb.admin.doc/doc/r0010873.htm
https://www.ibm.com/support/knowledgecenter/en/SS6NHC/com.ibm.swg.im.dashdb.sql.ref.doc/doc/r0010873.html
https://www.ibm.com/developerworks/community/blogs/SQLTips4DB2LUW/entry/merge?lang=en

I found this thread because I really needed a one-liner for DB2 INSERT OR UPDATE.
The following syntax seems to work, without requiring a separate temp table.
It works by using VALUES() to create a table structure . The SELECT * seems surplus IMHO but without it I get syntax errors.
MERGE INTO mytable AS mt USING (
SELECT * FROM TABLE (
VALUES
(123, 'text')
)
) AS vt(id, val) ON (mt.id = vt.id)
WHEN MATCHED THEN
UPDATE SET val = vt.val
WHEN NOT MATCHED THEN
INSERT (id, val) VALUES (vt.id, vt.val)
;
if you have to insert more than one row, the VALUES part can be repeated without having to duplicate the rest.
VALUES
(123, 'text'),
(456, 'more')
The result is a single statement that can INSERT OR UPDATE one or many rows presumably as an atomic operation.

This response is to hopefully fully answer the query MrSimpleMind had in use-update-and-insert-in-same-query and to provide a working simple example of the DB2 MERGE statement with a scenario of inserting AND updating in one go (record with ID 2 is updated and record ID 3 inserted).
CREATE TABLE STAGE.TEST_TAB ( ID INTEGER, DATE DATE, STATUS VARCHAR(10) );
COMMIT;
INSERT INTO TEST_TAB VALUES (1, '2013-04-14', NULL), (2, '2013-04-15', NULL); COMMIT;
MERGE INTO TEST_TAB T USING (
SELECT
3 NEW_ID,
CURRENT_DATE NEW_DATE,
'NEW' NEW_STATUS
FROM
SYSIBM.DUAL
UNION ALL
SELECT
2 NEW_ID,
NULL NEW_DATE,
'OLD' NEW_STATUS
FROM
SYSIBM.DUAL
) AS S
ON
S.NEW_ID = T.ID
WHEN MATCHED THEN
UPDATE SET
(T.STATUS) = (S.NEW_STATUS)
WHEN NOT MATCHED THEN
INSERT
(T.ID, T.DATE, T.STATUS) VALUES (S.NEW_ID, S.NEW_DATE, S.NEW_STATUS);
COMMIT;

Another way is to execute this 2 queries. It's simpler than create a MERGE statement:
update TABLE_NAME set FIELD_NAME=xxxxx where MyID=XXX;
INSERT INTO TABLE_NAME (MyField1,MyField2) values (xxx,xxxxx)
WHERE NOT EXISTS(select 1 from TABLE_NAME where MyId=xxxx);
The first query just updateS the field you need, if the MyId exists.
The second insertS the row into db if MyId does not exist.
The result is that only one of the queries is executed in your db.

I started with hibernate project where hibernate allows you to saveOrUpdate().
I converted that project into JDBC project the problem was with save and update.
I wanted to save and update at the same time using JDBC.
So, I did some research and I came accross ON DUPLICATE KEY UPDATE :
String sql="Insert into tblstudent (firstName,lastName,gender) values (?,?,?)
ON DUPLICATE KEY UPDATE
firstName= VALUES(firstName),
lastName= VALUES(lastName),
gender= VALUES(gender)";
The issue with the above code was that it updated primary key twice which is true as
per mysql documentation:
The affected rows is just a return code. 1 row means you inserted, 2 means you updated, 0 means nothing happend.
I introduced id and increment it to 1. Now I was incrementing the value of id and not mysql.
String sql="Insert into tblstudent (id,firstName,lastName,gender) values (?,?,?)
ON DUPLICATE KEY UPDATE
id=id+1,
firstName= VALUES(firstName),
lastName= VALUES(lastName),
gender= VALUES(gender)";
The above code worked for me for both insert and update.
Hope it works for you as well.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas