Remove duplicate from a table - sql

The database type is PostGres 8.3.
If I wrote:
SELECT field1, field2, field3, count(*)
FROM table1
GROUP BY field1, field2, field3 having count(*) > 1;
I have some rows that have a count over 1. How can I take out the duplicate (I do still want 1 row for each of them instead of +1 row... I do not want to delete them all.)
Example:
1-2-3
1-2-3
1-2-3
2-3-4
4-5-6
Should become :
1-2-3
2-3-4
4-5-6
The only answer I found is there but I am wondering if I could do it without hash column.
Warning
I do not have a PK with an unique number so I can't use the technique of min(...). The PK is the 3 fields.

This is one of many reasons that all tables should have a primary key (not necessarily an ID number or IDENTITY, but a combination of one or more columns that uniquely identifies a row and which has its uniqueness enforced in the database).
Your best bet is something like this:
SELECT field1, field2, field3, count(*)
INTO temp_table1
FROM table1
GROUP BY field1, field2, field3 having count(*) > 1
DELETE T1
FROM table1 T1
INNER JOIN (SELECT field1, field2, field3
FROM table1
GROUP BY field1, field2, field3 having count(*) > 1) SQ ON
SQ.field1 = T1.field1 AND
SQ.field2 = T1.field2 AND
SQ.field3 = T1.field3
INSERT INTO table1 (field1, field2, field3)
SELECT field1, field2, field3
FROM temp_table1
DROP TABLE temp_table1

One possible answer is:
CREATE <temporary table> (<correct structure for table being cleaned>);
BEGIN WORK; -- if needed
INSERT INTO <temporary table> SELECT DISTINCT * FROM <source table>;
DELETE FROM <source table>
INSERT INTO <source table> SELECT * FROM <temporary table>;
COMMIT WORK; -- needed
DROP <temporary table>;
I'm not sure whether the 'work' is needed on transaction statements, nor whether the explicit BEGIN is necessary in PostgreSQL. But the concept applies to any DBMS.
The only thing to beware of is referential constraints and in particular triggered delete operations. If those exist, this may prove less satisfactory.

This will use the OID Object ID (if the table was created with it):
DELETE FROM table1
WHERE OID NOT IN (SELECT MIN (OID)
FROM table1
GROUP BY field1, field2, field3)

Well I should misunderstand something but I'll say :
SELECT DISTINCT field1, field2, field3 FROM table1
Too easy to be good? ^^

This is the simplest method I've found:
Postgre SQL syntax:
CREATE TABLE tmp AS SELECT distinct * FROM table1
truncate table table1
insert into table1 select * from tmp
drop table tmp
T-SQL syntax:
select distinct * into #tmp from table1
truncate table table1
insert into table1 select * from #tmp
drop table #tmp

A good Answer for this problem, but for SQL Server. It uses the ROWCOUNT that SQL Server offers, to good effect. I have never used PostgreSQL and hence don't know the equivalent of ROWCOUNT in PostgreSQL.

Using TSQL, no idea if Postgres supports temp tables but you could select into a temp table, and then loop through and delete and insert your results back into the original
-- **Disclaimer** using TSQL
-- You could select your records into a temp table with a pk
Create Table #dupes
([id] int not null identity(1,1), f1 int, f2 int, f3 int)
Insert Into #dupes (f1,f2,f3) values (1,2,3)
Insert Into #dupes (f1,f2,f3) values (1,2,3)
Insert Into #dupes (f1,f2,f3) values (1,2,3)
Insert Into #dupes (f1,f2,f3) values (2,3,4)
Insert Into #dupes (f1,f2,f3) values (4,5,6)
Insert Into #dupes (f1,f2,f3) values (4,5,6)
Insert Into #dupes (f1,f2,f3) values (4,5,6)
Insert Into #dupes (f1,f2,f3) values (7,8,9)
Select f1,f2,f3 From #dupes
Declare #rowCount int
Declare #counter int
Set #counter = 1
Set #rowCount = (Select Count([id]) from #dupes)
while (#counter < #rowCount + 1)
Begin
Delete From #dupes
Where [Id] <>
(Select [id] From #dupes where [id]=#counter)
and
(
[f1] = (Select [f1] from #dupes where [id]=#counter)
and
[f2] = (Select [f2] from #dupes where [id]=#counter)
and
[f3] = (Select [f3] from #dupes where [id]=#counter)
)
Set #counter = #counter + 1
End
Select f1,f2,f3 From #dupes -- You could take these results and pump them back into --your original table
Drop Table #dupes
Tested this on MS SQL Server 2000. Not familiar with Postgres' options but maybe this will lead you in a right direction.

Related

SQL transform table with multiple columns to two tables with FK between them

I need to convert my table with several fields to two other tables where one of the new tables has a row for each field in the first table in Microsoft SQL Server.
Table1(Table1Id, Field1, Field2, Field3)
for each row in Table1 create
Table2a(Table2aId)
Table2b(Table2bId, Table2aId, Field1)
Table2b(Table2bId, Table2aId, Field2)
Table2b(Table2bId, Table2aId, Field3)
Details
I currently have the table
Table1
[dbo].[CommunityAssetTemplates]
,[CommunityId]
,[CommunityAssetTemplateId]
,[BaseHouseSpecsAssetId]
,[CommunityLogoAssetId]
,[CommunityMarketingMapAssetId]
,[CommunityPhotoAssetId]
,[CommunityVideoDraftAssetId]
,[CommunityVideoAssetId]
This was mostly a quick way to fulfill a business need before we fully implemented the new feature where users can define multiple templates with different assets in them, so I made two new tables one is just to relate the second table to a Community
Table2a
[dbo].[CommunityAssetDataTemplates]
,[CommunityAssetDataTemplateId]
,[CommunityAssetTemplateTypeId]
,[CommunityId]
Table2b
[dbo].[CommunityAssetTemplateFiles]
,[CommunityAssetTemplateFileId]
,[CommunityAssetDataTemplateId]
,[CommunityAssetId]
These two tables map together like so, each Table1 row creates 1 Table2a row and 6 Table2b rows
Table2a
[CommunityAssetDataTemplateId] Auto Increments
[CommunityAssetTemplateTypeId] = 1
[CommunityId] = Table1.CommunityId
Table2b - 1
[CommunityAssetTemplateFileId] Auto increments
,[CommunityAssetDataTemplateId] = Table2a.[CommunityAssetDataTemplateId]
,[CommunityAssetId] = Table1.[BaseHouseSpecsAssetId] (THIS CHANGES)
Table2b - 2
[CommunityAssetTemplateFileId] Auto increments
,[CommunityAssetDataTemplateId] = Table2a.[CommunityAssetDataTemplateId]
,[CommunityAssetId] = Table1.[CommunityLogoAssetId] (THIS CHANGES)
continues for the remaining 4 'AssetId's fields of Table1
Here is one way to accomplish this using CROSS APPLY to separate Field1, Field2, and Field3 columns into rows:
insert into Table2A (Table2Id)
select Table1Id from Table1
insert into Table2B(Table2Id, Field4)
select Table1Id, Field
from Table1
cross apply (values(Field1), (Field2), (Field3)) as ColumnsAsRows(Field)
Here is a sample:
declare #t1 table (Table1Id int identity(1,1), Field1 int, Field2 int, Field3 int)
declare #t2 table (Table2Id int primary key clustered)
declare #t3 table (Table3Id int identity(1,1) primary key clustered, Table2Id int, Field4 int)
insert into #t1 (Field1, Field2, Field3)
values (1, 2, 3), (4, 5, 6), (7, 8, 9)
select * from #t1
insert into #t2 (Table2Id)
select Table1Id from #t1
insert into #t3 (Table2Id, Field4)
select Table1Id, Field
from #t1
cross apply (values(Field1), (Field2), (Field3)) as ColumnsAsRows(Field)
select * from #t2
select * from #t3

SQL : Retrieve inserted row IDs array / table

i have the following statement:
INSERT INTO table1 (field1, FIELD2)
SELECT f1, f2 FROM table2 -- returns 20 rows
after insert i need to know the array/table of IDs generated in table1.ID which is INT IDENTITY
thanx in advance.
Use the OUTPUT clause (SQL2005 and up):
DECLARE #IDs TABLE(ID int)
INSERT INTO table1(Field1, Field2)
OUTPUT inserted.ID into #IDs(ID)
SELECT Field1, Field2 FROM table2
If you want to know exactly which rows from table2 generated which ID in table1 (and Field1 and Field2 aren't enough to identify that), you'll need to use MERGE:
DECLARE #IDs TABLE(Table1ID int, Table2ID int)
MERGE table1 AS T
USING table2 AS S
ON 1=0
WHEN NOT MATCHED THEN
INSERT (Field1, Field2) VALUES(S.Field1, S.Field2)
OUTPUT inserted.ID, S.ID INTO #IDs(Table1ID, Table2ID)
Use SCOPE_IDENTITY()
SELECT SCOPE_IDENTITY()
http://msdn.microsoft.com/en-us/library/ms190315(v=sql.105).aspx

Copy Rows with PK feedback loop

Given the following (Table1):
Id Field1 Field2 ...
-- ------ -------
NULL 1 2
NULL 3 4
...
I'd like to insert the values of Field1 and Field2 into a different table (Table2). Table2 has an auto increment integer primary key. I want to retrieve the new PKs from Table2 and update the Id column above (Table1).
I realize this is not conventional - its not something I need to do regularly, simply one-off for some migration work. I made some attempts using INSERT INTO, OUTPUT, INSERTED.Id, but failed. The PKs that are "looped-back" into Table1 must tie to the values of Field1/Filed2 inserted.
You should just be able to do a insert, then a delete and re-insert.
create table t1
( id int, f1 int, f2 int);
create table t2
( id int primary key IDENTITY , f1 int, f2 int);
insert into t1 (id, f1, f2) values (null, 1, 2);
insert into t1 (id, f1, f2) values (null, 3, 4);
insert into t1 (id, f1, f2) values (null, 5, 6);
insert into t1 (id, f1, f2) values (null, 5, 6);
insert into t2 (f1, f2)
select f1, f2 from t1 where id is null;
delete t1
from t1 join t2 on (t1.f1 = t2.f1 and t1.f2 = t2.f2);
insert into t1
select id, f1, f2 from t2;
select * from t1;
See this example on SQLFiddle.
You will need some type of unique key to match your rows in each table. I've taken the liberty of adding a TempGuid column to each of your tables (which you can later drop):
-- Setup test data
declare #Table1 table (
Id int null
, Field1 int not null
, Field2 int not null
, TempGuid uniqueidentifier not null unique
)
insert into #Table1 (Field1, Field2, TempGuid) select 1, 2, newid()
insert into #Table1 (Field1, Field2, TempGuid) select 3, 4, newid()
declare #Table2 table (
Id int not null primary key identity(1, 1)
, Field1 int not null
, Field2 int not null
, TempGuid uniqueidentifier not null unique
)
-- Fill Table2
insert into #Table2 (Field1, Field2, TempGuid)
select Field1, Field2, TempGuid
from #Table1
-- Update Table1 with the identity values from Table2
update a
set a.Id = b.Id
from #Table1 a
join #Table2 b on a.TempGuid = b.TempGuid
-- Show results
select * from #Table1
Output would be workable if you already had a unique key on Table1 that you were inserting into Table2. You could also do a temporary unique key (perhaps GUID again) in a loop or cursor and process one row at a time, but that seems worse to me.
UPDATE
Here is the actual SQL to run on your tables:
-- Add TempGuid columns
alter table Table1 add TempGuid uniqueidentifier null
update Table1 set TempGuid = newid()
alter table Table2 add TempGuid uniqueidentifier not null
-- Fill Table2
insert into Table2 (Field1, Field2, TempGuid)
select Field1, Field2, TempGuid
from Table1
-- Update Table1 with the identity values from Table2
update a
set a.Id = b.Id
from Table1 a
join Table2 b on a.TempGuid = b.TempGuid
-- Remove TempGuid columns
alter table Table1 drop column TempGuid
alter table Table2 drop column TempGuid
Assuming you have full control over the schema definitions, add a foreign key to Table2 that references Table1's primary key.
Perform your data insert:
INSERT INTO Table2 (Field1, Field2, T1PK)
SELECT Field1, Field2, PK FROM Table1
Then backfill Table1:
UPDATE t1 SET Id = t2.PK
FROM Table1 t1 INNER JOIN Table2 t2 ON t2.T1PK = t1.PK
Then delete the extra column (T1PK) from Table2.
Edit:
Since there's no PK in Table1, just add one to Table1, use that, and then drop it at the end.
For example...
ALTER TABLE Table1 ADD COLUMN T1PK UNIQUEIDENTIFIER CONSTRAINT Table1_PK PRIMARY KEY DEFAULT NEWID();
ALTER TABLE Table2 ADD COLUMN T1PK UNIQUEIDENTIFIER NULL
INSERT INTO Table2 (Field1, Field2, T1PK)
SELECT Field1, Field2, T1PK FROM Table1
UPDATE t1 SET Id = t2.PK
FROM Table1 t1 INNER JOIN Table2 t2 ON t2.T1PK = t1.T1PK
ALTER TABLE Table1 DROP CONSTRAINT Table1_PK
ALTER TABLE Table1 DROP COLUMN T1PK
ALTER TABLE Table2 DROP COLUMN T1PK
This is not pretty, but should do as a one time effort.
create table tableA
(
id int,
field1 int,
field2 int
)
create table tableB
(
id int identity(1,1),
field1 int,
field2 int
)
insert into tableA select NULL, 1, 2
insert into tableA select NULL, 2, 3
declare #field1_value int;
declare #field2_value int;
declare #lastInsertedId int;
DECLARE tableA_cursor CURSOR FOR
select field1, field2 from tableA
OPEN tableA_cursor
FETCH NEXT FROM tableA_cursor INTO #field1_value, #field2_value
WHILE ##FETCH_STATUS = 0
BEGIN
insert into tableB select #field1_value, #field2_value
set #lastInsertedId = (SELECT SCOPE_IDENTITY())
update a
set id = #lastInsertedId
from tableA a
where field1 = #field1_value
and field2 = #field2_value
print #field1_value
FETCH NEXT FROM tableA_cursor
INTO #field1_value, #field2_value
END
CLOSE tableA_cursor
DEALLOCATE tableA_cursor
With not exists check:
declare #field1_value int;
declare #field2_value int;
declare #lastInsertedId int;
DECLARE tableA_cursor CURSOR FOR
select field1, field2 from tableA
OPEN tableA_cursor
FETCH NEXT FROM tableA_cursor INTO #field1_value, #field2_value
WHILE ##FETCH_STATUS = 0
BEGIN
IF NOT EXISTS
(
select * from tableB
where field1 = #field1_value
and field2 = #field2_value
)
BEGIN
insert into tableB
select #field1_value, #field2_value
set #lastInsertedId = (SELECT SCOPE_IDENTITY())
update a
set id = #lastInsertedId
from tableA a
where field1 = #field1_value
and field2 = #field2_value
END
FETCH NEXT FROM tableA_cursor
INTO #field1_value, #field2_value
END
CLOSE tableA_cursor
DEALLOCATE tableA_cursor

SQL query construction issue

There two tables:
Table1
field1 | field2
Table2
field1
“string1”
“string2”
I need to insert concatenation of table2.field1 values into table1, so it looks like
insert into table1(field1, field2) values (1, “string1string2”);
How can I do it? Is there any SQL-standard way to do it?
PS: string1 and string2 are values of the field1 column.
PPS: the main subtask of my question is, how can I get the result of select query into one row? All examples I've seen just use concatenation, but in all your examples SELECT subquery does not return string concatenation for all values of the table2.field1 column.
There is no ANSI standard SQL way to do this.
But in MySQL you can use GROUP_CONCAT
insert into table1 ( field1, field2 )
select 1, group_concat(field1) from table2
In SQL Server 2005 and later you can use XML PATH,
insert into table1 ( field1, field2 )
select 1, (select field1 from table2
for xml path(''), type).value('.','nvarchar(max)')
In Oracle, you can refer to Stack Overflow question How can I combine multiple rows into a comma-delimited list in Oracle?.
INSERT INTO TABLE1 (FIELD1, FILED2) VALUES (1, CONCAT("string1", "string2"))
try this :
insert into table1(field1, field2)
select table2.field1, table2.string1 || table2.string2 from table2;
You can add a where clause to the query to select only some entries from table2 :
insert into table1(field1, field2)
select table2.field1, table2.string1 || table2.string2 from table2
where table2.field = 'whatever';
I'd try with
insert table1 select field1, string1+string2 from table2
tested with MSSQL Server 2008
create table #t1 (n int, s varchar(200))
create table #t2 (n int, s1 varchar(100), s2 varchar(100))
insert #t2 values (1, 'one', 'two') -- worked without into ???
insert #t2 values (2, 'three', 'four') -- worked without into ???
insert #t1 select n, s1+s2 from #t2 -- worked without into ???
select * from #t1
drop table #t1
drop table #t2
After the edit:
No, if you have no way to identify the lines in table2 and sort them the way you want it is impossible. Remember that, in the absence of a order by in the SQL statement, lines can be returned in any order whatsoever
Assuming this is SQL server,
Insert into table1 (field1, field2)
select field1, string1 + string2
from table2
In oracle you will do it as -
Insert into table1 (field1, field2)
select field1, string1 || string2
from table2

Delete multiple duplicate rows in table

I have multiple groups of duplicates in one table (3 records for one, 2 for another, etc) - multiple rows where more than 1 exists.
Below is what I came up with to delete them, but I have to run the script for however many duplicates there are:
set rowcount 1
delete from Table
where code in (
select code from Table
group by code
having (count(code) > 1)
)
set rowcount 0
This works well to a degree. I need to run this for every group of duplicates, and then it only deletes 1 (which is all I need right now).
If you have a key column on the table, then you can use this to uniquely identify the "distinct" rows in your table.
Just use a sub query to identify a list of ID's for unique rows and then delete everything outside of this set. Something along the lines of.....
create table #TempTable
(
ID int identity(1,1) not null primary key,
SomeData varchar(100) not null
)
insert into #TempTable(SomeData) values('someData1')
insert into #TempTable(SomeData) values('someData1')
insert into #TempTable(SomeData) values('someData2')
insert into #TempTable(SomeData) values('someData2')
insert into #TempTable(SomeData) values('someData2')
insert into #TempTable(SomeData) values('someData3')
insert into #TempTable(SomeData) values('someData4')
select * from #TempTable
--Records to be deleted
SELECT ID
FROM #TempTable
WHERE ID NOT IN
(
select MAX(ID)
from #TempTable
group by SomeData
)
--Delete them
DELETE
FROM #TempTable
WHERE ID NOT IN
(
select MAX(ID)
from #TempTable
group by SomeData
)
--Final Result Set
select * from #TempTable
drop table #TempTable;
Alternatively you could use a CTE for example:
WITH UniqueRecords AS
(
select MAX(ID) AS ID
from #TempTable
group by SomeData
)
DELETE A
FROM #TempTable A
LEFT outer join UniqueRecords B on
A.ID = B.ID
WHERE B.ID IS NULL
It is frequently more efficient to copy unique rows into temporary table,
drop source table, rename back temporary table.
I reused the definition and data of #TempTable, called here as SrcTable instead, since it is impossible to rename temporary table into a regular one)
create table SrcTable
(
ID int identity(1,1) not null primary key,
SomeData varchar(100) not null
)
insert into SrcTable(SomeData) values('someData1')
insert into SrcTable(SomeData) values('someData1')
insert into SrcTable(SomeData) values('someData2')
insert into SrcTable(SomeData) values('someData2')
insert into SrcTable(SomeData) values('someData2')
insert into SrcTable(SomeData) values('someData3')
insert into SrcTable(SomeData) values('someData4')
by John Sansom in previous answer
-- cloning "unique" part
SELECT * INTO TempTable
FROM SrcTable --original table
WHERE id IN
(SELECT MAX(id) AS ID
FROM SrcTable
GROUP BY SomeData);
GO;
DROP TABLE SrcTable
GO;
sys.sp_rename 'TempTable', 'SrcTable'
You can alternatively use ROW_NUMBER() function to filter out duplicates
;WITH [CTE_DUPLICATES] AS
(
SELECT RN = ROW_NUMBER() OVER (PARTITION BY SomeData ORDER BY SomeData)
FROM #TempTable
)
DELETE FROM [CTE_DUPLICATES] WHERE RN > 1
SET ROWCOUNT 1
DELETE Table
FROM Table a
WHERE (SELECT COUNT(*) FROM Table b WHERE b.Code = a.Code ) > 1
WHILE ##rowcount > 0
DELETE Table
FROM Table a
WHERE (SELECT COUNT(*) FROM Table b WHERE b.Code = a.Code ) > 1
SET ROWCOUNT 0
this will delete all duplicate rows, But you can add attributes if you want to compare according to them .