Avoid duplicates in INSERT INTO SELECT query in SQL Server - sql

I have the following two tables:
Table1
----------
ID Name
1 A
2 B
3 C
Table2
----------
ID Name
1 Z
I need to insert data from Table1 to Table2. I can use the following syntax:
INSERT INTO Table2(Id, Name) SELECT Id, Name FROM Table1
However, in my case, duplicate IDs might exist in Table2 (in my case, it's just "1") and I don't want to copy that again as that would throw an error.
I can write something like this:
IF NOT EXISTS(SELECT 1 FROM Table2 WHERE Id=1)
INSERT INTO Table2 (Id, name) SELECT Id, name FROM Table1
ELSE
INSERT INTO Table2 (Id, name) SELECT Id, name FROM Table1 WHERE Table1.Id<>1
Is there a better way to do this without using IF - ELSE? I want to avoid two INSERT INTO-SELECT statements based on some condition.

Using NOT EXISTS:
INSERT INTO TABLE_2
(id, name)
SELECT t1.id,
t1.name
FROM TABLE_1 t1
WHERE NOT EXISTS(SELECT id
FROM TABLE_2 t2
WHERE t2.id = t1.id)
Using NOT IN:
INSERT INTO TABLE_2
(id, name)
SELECT t1.id,
t1.name
FROM TABLE_1 t1
WHERE t1.id NOT IN (SELECT id
FROM TABLE_2)
Using LEFT JOIN/IS NULL:
INSERT INTO TABLE_2
(id, name)
SELECT t1.id,
t1.name
FROM TABLE_1 t1
LEFT JOIN TABLE_2 t2 ON t2.id = t1.id
WHERE t2.id IS NULL
Of the three options, the LEFT JOIN/IS NULL is less efficient. See this link for more details.

In MySQL you can do this:
INSERT IGNORE INTO Table2(Id, Name) SELECT Id, Name FROM Table1
Does SQL Server have anything similar?

I just had a similar problem, the DISTINCT keyword works magic:
INSERT INTO Table2(Id, Name) SELECT DISTINCT Id, Name FROM Table1

I was facing the same problem recently...
Heres what worked for me in MS SQL server 2017...
The primary key should be set on ID in table 2...
The columns and column properties should be the same of course between both tables. This will work the first time you run the below script. The duplicate ID in table 1, will not insert...
If you run it the second time, you will get a
Violation of PRIMARY KEY constraint error
This is the code:
Insert into Table_2
Select distinct *
from Table_1
where table_1.ID >1

Using ignore Duplicates on the unique index as suggested by IanC here was my solution for a similar issue, creating the index with the Option WITH IGNORE_DUP_KEY
In backward compatible syntax
, WITH IGNORE_DUP_KEY is equivalent to WITH IGNORE_DUP_KEY = ON.
Ref.: index_option

From SQL Server you can set a Unique key index on the table for (Columns that needs to be unique)

A little off topic, but if you want to migrate the data to a new table, and the possible duplicates are in the original table, and the column possibly duplicated is not an id, a GROUP BY will do:
INSERT INTO TABLE_2
(name)
SELECT t1.name
FROM TABLE_1 t1
GROUP BY t1.name

In my case, I had duplicate IDs in the source table, so none of the proposals worked. I don't care about performance, it's just done once.
To solve this I took the records one by one with a cursor to ignore the duplicates.
So here's the code example:
DECLARE #c1 AS VARCHAR(12);
DECLARE #c2 AS VARCHAR(250);
DECLARE #c3 AS VARCHAR(250);
DECLARE MY_cursor CURSOR STATIC FOR
Select
c1,
c2,
c3
from T2
where ....;
OPEN MY_cursor
FETCH NEXT FROM MY_cursor INTO #c1, #c2, #c3
WHILE ##FETCH_STATUS = 0
BEGIN
if (select count(1)
from T1
where a1 = #c1
and a2 = #c2
) = 0
INSERT INTO T1
values (#c1, #c2, #c3)
FETCH NEXT FROM MY_cursor INTO #c1, #c2, #c3
END
CLOSE MY_cursor
DEALLOCATE MY_cursor

I used a MERGE query to fill a table without duplications.
The problem I had was a double key in the tables ( Code , Value ) ,
and the exists query was very slow
The MERGE executed very fast ( more then X100 )
examples for MERGE query

For one table it works perfectly when creating one unique index from multiple field. Then simple "INSERT IGNORE" will ignore duplicates if ALL of 7 fields (in this case) will have SAME values.
Select fields in PMA Structure View and click Unique, new combined index will be created.

A simple DELETE before the INSERT would suffice:
DELETE FROM Table2 WHERE Id = (SELECT Id FROM Table1)
INSERT INTO Table2 (Id, name) SELECT Id, name FROM Table1
Switching Table1 for Table2 depending on which table's Id and name pairing you want to preserve.

Related

SQL migration script - insert into select with output ID

I am using PostgreSQL and Flyway to perform a data migration in an application. The idea is to move rows from one table to another and keep the link between the old and new table in the old table. So, let's say we have Table_1 with columns (id, name, user_id) and a new Table_2 with similiar columns (id2, name2, user_id2).
Now, the first step will be to add a column to Table_1 that will store the id of its counterpart in new Table_2. So:
alter Table_1 add column if not exists migrated_table_2_id int;
And now I would like to write an sql that will perform the migration of data from Table_1 to Table_2 and at the same time fill in the id values in the migrated_table_2_id column. So something like:
insert into Table_2 (name2, user_id2) select name, user_id from Table_1;
but with filling in the migrated_table_2_id with the newly created row in Table 2
You can use a CTE, assuming that name2, user_id2 or both in combination are unique:
with i as (
insert into Table_2 (name2, user_id2)
select name, user_id
from Table_1
returning *
)
update table_1 t1
set t1.user_id2 = t2.id
from table_2 t2
where t2.name = t1.name and t2.user_id2 = t.user_id;

Provide the values for a 'like' function from a specific column in a Table?

I am using SQL Server 2014 and I need a T-SQL query which uses the like function to run on a specific column (c1) of a Table (t1) to find out if it contains one of the codes from a list of codes found in the column (c2) of another Table (t2).
To simplify, here is the scenario and the expected output:
Table t1:
ID Notes
101 (free text in this column)
102 ...
... ...
115000 ...
Table t2 (list of more than 300 codes):
Code
FR110419
GB150619
...
DE111219
What I am looking for:
SELECT ID
FROM t1
WHERE t1.Notes like (SELECT Code FROM t2)
Since the like operator needs '%'to work, I am confused as to how to construct that line.
I have done some research on StackOverflow and the closest solution I have come across is for a mysql problem: how to use LIKE with column name
Any type of help will be most appreciated.
You seem to be looking for a JOIN:
SELECT ID
FROM t1
INNER JOIN t2 ON t1.Notes LIKE '%' + t2.Code + '%'
If different Codes might appear in the same Note, using an EXISTS condition with a correlated subquery is also an option, as it would avoid duplicating records in the output:
SELECT ID
FROM t1
WHERE EXISTS (
SELECT 1 FROM t2 WHERE t1.Notes LIKE '%' + t2.Code + '%'
)
You can use cross apply with charindex like this:
--Loading data
create table t1 (id varchar(10));
insert into t1 (id) values ('100100'),('200100'),('300100')
insert into t1 (id) values ('100200'),('200200'),('300200')
insert into t1 (id) values ('100300'),('200300'),('300300')
insert into t1 (id) values ('0100'),('0200'),('0300')
insert into t1 (id) values ('00010'),('00020'),('00030')
create table t2 (id varchar(10));
insert into t2 (id) values ('020'),('010')
select t.id
from t1 as t
cross apply t2 as t2
--where charindex(t2.id,t.id) > 0 -- simulates a double % one at the beginning and one at the end
--where charindex(t2.id,t.id) = 1 -- simulates a % at the beginning
where charindex(t2.id,t.id) = len(t.id)-len(t2.id)+1 -- simulates a % at the end
The only thing is that the table is very big this could be a slow solution.
Building on what's already been posted, you can create an indexed view to really speed things up.
Using CTE6's sample data...
--Loading data
create table t1 (id varchar(10));
insert into t1 (id) values ('100100'),('200100'),('300100')
insert into t1 (id) values ('100200'),('200200'),('300200')
insert into t1 (id) values ('100300'),('200300'),('300300')
insert into t1 (id) values ('0100'),('0200'),('0300')
insert into t1 (id) values ('00010'),('00020'),('00030')
create table t2 (id varchar(10));
insert into t2 (id) values ('020'),('010')
GO
-- The View
CREATE VIEW dbo.vw_t1t2 WITH SCHEMABINDING AS
SELECT t1 = t1.id, t2 = t2.id, cb = COUNT_BIG(*)
FROM dbo.t1 AS t1
CROSS JOIN dbo.t2 AS t2
WHERE CHARINDEX(t2.id,t1.id) > 0
GROUP BY t1.id, t2.id
GO
-- The index (may need to add something else to make UNIQUE)
CREATE UNIQUE CLUSTERED INDEX uq_cl_vwt1t2 ON dbo.vw_t1t2(t1,t2);
GO
This will perform very well for SELECT statements but could impact data modifications against t1 and t2 so make sure to use the smallest datatype possible and only include columns you are certain you need (Varchar(10) is good). I include COUNT_BIG() because it's required in indexed views that leverage GROUP BY.

duplicating rows through multiple tables(new id/foreign key) with some column modifications

The basic concepts is to duplicate rows in table1 where id between for example 100..10000,
modify some of the column data then insert with a new id:
Table2 referencing to table1.id with foreign key, table3 referencing to table2.id with foreign key
.... and tableX referencing to tableX-1.id with foreign key.
I also have the modificate some of the table2..tableX data.
I started to think about writing nested loops; for the first 3 table, it looks like this (in plsql), maybe it should work:
declare
table1_row table1%rowtype;
table2_row table2%rowtype;
table3_row table3%rowtype;
begin
for t1 in(select * from table1
where id between 100 and 10000)
loop
table1_row:=t1;
table1_row.id:=tableseq.nextval;
table1_row.col1:='asdf';
table1_row.col4:='xxx';
insert into table1 values table1_row;
for t2 in(select * from table2
where foreign_key_id =t1.id)
loop
table2_row:=t2;
table2_row.id:=tableseq.nextval;
table2_row.foreign_key_id:=table1_row.id;
table2_row.col3:='gfdgf';
insert into table2 values table2_row;
for t3 in(select * from table3
where foreign_key_id =t2.id)
loop
table3_row:=t3;
table3_row.id:=tableseq.nextval;
table3_row.foreign_key_id:=table2_row.id;
table3_row.col1:='gdfgdg';
insert into table3 values table3_row;
end loop;
end loop;
end loop;
end;
Any better solutions? With about 10-20nested loops, it looks awful :(
Thanks in advance.
I believe you can use an insert statement with several subqueries to clean this up. Here is a simpler example but I believe you can extrapolate for your specific case:
insert into table1
(col1, col2, col3, col4, col5)
values
select 'asdf',
(select table2_data --whatever data from this table you want
from table2
where foreign_key_id =table1.id),
(select table3_data --whatever data from this table you want
from table3
where foreign_key_id =table1.id),
'xxx',
table1.col5
from table1
where table1.id between 100 and 10000
Note, your id column should be set up as an Auto_increment primary key so you shouldn't need it as part of your insert statement. Also, I added "table1.col5" as an example of how to use the same data from the existing row in your duplicated row (as I'm assuming some data you want to be duplicated).

Simple update statement so that all rows are assigned a different value

I'm trying to set a column in one table to a random foreign key for testing purposes.
I attempted using the below query
update table1 set table2Id = (select top 1 table2Id from table2 order by NEWID())
This will get one table2Id at random and assign it as the foreign key in table1 for each row.
It's almost what I want, but I want each row to get a different table2Id value.
I could do this by looping through the rows in table1, but I know there's a more concise way of doing it.
On some test table my end your original plan looks as follows.
It just calculates the result once and caches it in a sppol then replays that result. You could try the following so that SQL Server sees the subquery as correlated and in need of re-evaluating for each outer row.
UPDATE table1
SET table2Id = (SELECT TOP 1 table2Id
FROM table2
ORDER BY Newid(),
table1.table1Id)
For me that gives this plan without the spool.
It is important to correlate on a unique field from table1 however so that even if a spool is added it must always be rebound rather than rewound (replaying the last result) as the correlation value will be different for each row.
If the tables are large this will be slow as work required is a product of the two table's rows (for each row in table1 it needs to do a full scan of table2)
I'm having another go at answering this, since my first answer was incomplete.
As there is no other way to join the two tables until you assign the table2_id you can use row_number to give a temporary key to both table1 and table2.
with
t1 as (
select row_number() over (order by table1_id) as row, table1_id
from table1 )
,
t2 as (
select row_number() over (order by NEWID()) as row, table2_id
from table2 )
update table1
set table2_id = t2.table2_id
from t1 inner join t2
on t1.row = t2.row
select * from table1
SQL Fiddle to test it out: http://sqlfiddle.com/#!6/bf414/12
Broke down and used a loop for it. This worked, although it was very slow.
Select *
Into #Temp
From table1
Declare #Id int
While (Select Count(*) From #Temp) > 0
Begin
Select Top 1 #Id = table1Id From #Temp
update table1 set table2Id = (select top 1 table2Id from table2 order by NEWID()) where table1Id = #Id
Delete #Temp Where table1Id = #Id
End
drop table #Temp
I'm going to assume MS SQL based on top 1:
update table1
set table2Id =
(select top 1 table2Id from table2 tablesample(1 percent))
(sorry, not tested)

How to remove duplicate records in a table?

I've got a table in a testing DB that someone apparently got a little too trigger-happy on when running INSERT scripts to set it up. The schema looks like this:
ID UNIQUEIDENTIFIER
TYPE_INT SMALLINT
SYSTEM_VALUE SMALLINT
NAME VARCHAR
MAPPED_VALUE VARCHAR
It's supposed to have a few dozen rows. It has about 200,000, most of which are duplicates in which TYPE_INT, SYSTEM_VALUE, NAME and MAPPED_VALUE are all identical and ID is not.
Now, I could probably make a script to clean this up that creates a temporary table in memory, uses INSERT .. SELECT DISTINCT to grab all the unique values, TRUNCATE the original table and then copy everything back. But is there a simpler way to do it, like a DELETE query with something special in the WHERE clause?
You don't give your table name but I think something like this should work. Just leaving the record which happens to have the lowest ID. You might want to test with the ROLLBACK in first!
BEGIN TRAN
DELETE <table_name>
FROM <table_name> T1
WHERE EXISTS(
SELECT * FROM <table_name> T2
WHERE
T1.TYPE_INT = T2.TYPE_INT AND
T1.SYSTEM_VALUE = T2.SYSTEM_VALUE AND
T1.NAME = T2.NAME AND
T1.MAPPED_VALUE = T2.MAPPED_VALUE AND
T2.ID > T1.ID
)
SELECT * FROM <table_name>
ROLLBACK
here is a great article on that: Deleting duplicates, which basically uses this pattern:
WITH q AS
(
SELECT d.*,
ROW_NUMBER() OVER (PARTITION BY id ORDER BY value) AS rn
FROM t_duplicate d
)
DELETE
FROM q
WHERE rn > 1
SELECT *
FROM t_duplicate
WITH Duplicates(ID , TYPE_INT, SYSTEM_VALUE, NAME, MAPPED_VALUE )
AS
(
SELECT Min(Id) ID TYPE_INT, SYSTEM_VALUE, NAME, MAPPED_VALUE
FROM T1
GROUP BY TYPE_INT, SYSTEM_VALUE, NAME, MAPPED_VALUE
HAVING Count(Id) > 1
)
DELETE FROM T1
WHERE ID IN (
SELECT T1.Id
FROM T1
INNER JOIN Duplicates
ON T1.TYPE_INT = Duplicates.TYPE_INT
AND T1.SYSTEM_VALUE = Duplicates.SYSTEM_VALUE
AND T1.NAME = Duplicates.NAME
AND T1.MAPPED_VALUE = Duplicates.MAPPED_VALUE
AND T1.Id <> Duplicates.ID
)