Single Query to delete and display duplicate records - sql

One of the question asked in an interview was,
One table has 100 records. 50 of them
are duplicates. Is it possible with a single
query to delete the duplicate records
from the table as well as select and
display the remaining 50 records.
Is this possible in a single SQL query?
Thanks
SNA

with SQL Server you would use something like this
DECLARE #Table TABLE (ID INTEGER, PossibleDuplicate INTEGER)
INSERT INTO #Table VALUES (1, 100)
INSERT INTO #Table VALUES (2, 100)
INSERT INTO #Table VALUES (3, 200)
INSERT INTO #Table VALUES (4, 200)
DELETE FROM #Table
OUTPUT Deleted.*
FROM #Table t
INNER JOIN (
SELECT ID = MAX(ID)
FROM #Table
GROUP BY PossibleDuplicate
HAVING COUNT(*) > 1
) d ON d.ID = t.ID
The OUTPUT statement shows the records that get deleted.
Update:
Above query will delete duplicates and give you the rows that are deleted, not the rows that remain. If that is important to you (all in all, the remaining 50 rows should be identical to the 50 deleted rows), you could use SQL Server's 2008 MERGE syntax to achieve this.

Lieven's Answer is a good explanation of how to output the deleted rows. I'd like to add two things:
If you want to do something more with the output other than displaying it, you can specify OUTPUT INTO #Tbl (where #Tbl is a table-var you declare before the deleted);
Using MAX, MIN, or any of the other aggregates can only handle one duplicate row per group. If it's possible for you to have many duplicates, the following SQL Server 2005+ code will help do that:
;WITH Duplicates AS
(
SELECT
ID,
ROW_NUMBER() OVER (PARTITION BY DupeColumn ORDER BY ID) AS RowNum
)
DELETE FROM MyTable
OUTPUT deleted.*
WHERE ID IN
(
SELECT ID
FROM Duplicates
WHERE RowNum > 1
)

Sounds unlikely, at least in ANSI SQL, since a delete only returns the count of the number of deleted rows.

Related

How to do batch insert when there is no identity column?

Table with million rows, two columns.
code | name
xyz | product1
abc | Product 2
...
...
I want to do insert in small batches (10000) via the insert into/select query.
How can we do this when there is no identity key to create a batch?
You could use a LEFT OUTER JOIN in your SELECT statement to identify records that are not already in the INSERT table, then use TOP to grab the first 10000 that the database finds. Something like:
INSERT INTO tableA
SELECT TOP 10000 code, name
FROM tableB LEFT OUTER JOIN tableA ON tableB.Code = tableA.Code
WHERE tableA.Code IS NULL;
And then run that over and over and over again until it's full.
You could also use Windowing functions to batch like:
INSERT INTO tableA
SELECT code, name
FROM (
SELECT code, name, ROW_NUMBER() OVER (ORDER BY name) as rownum
FROM tableB
)
WHERE rownum BETWEEN 1 AND 100000;
And then just keep changing the BETWEEN to get your batch. Personally, if I had to do this, I would use the first method though since it's guaranteed to catch everything that isn't already in TableA.
Also, if there is the possibility that tableb will gain records during this batching process, then option 1 is definitely better. Essentially, with option2, it will determine the row_number() on the fly, so newly inserted records will cause records to be missed if they show up in the middle of batches.
If TableB is static, then Option 2 may be faster since the DB just has to sort and number the records, instead of having to join HUGE table to HUGE table and then grab 10000 records.
You can do the pagination on SELECT and select the records by batch/page size of say 10000 or whatever you need and insert in the target table. In the below sample you will have to change the value of #Min and #Max for each iteration of the batch size you desire to have.
INSERT INTO EmployeeNew
SELECT Name
FROM
(
SELECT DENSE_RANK OVER(ORDER BY EmployeeId) As Rank, Name
FROM Employee
) AS RankedEmployee
WHERE Rank >= #Min AND Rank < #Max

INSERT INTO from tableA to tableB with count()

Inserting rows from one table to another table, I try to use a count(*) to ensure that the line_no column in table OBJ_LINES is set to 1,2,3,4.. and so on for every line added.
INSERT INTO OBJ_LINES(
id,
line_no)
SELECT (1,
(select count(*)+1 FROM OBJ_LINES WHERE id = 1)
FROM TEMPLATE_LINES tmp
WHERE tmp.id = 37;
(syntax Sybase Database)
If the TEMPLATE_LINES table holds more than one row, I get a duplicate error as the count() seems to be evaluted only once, and not for every row found in the TEMPLATE_LINES table.
How may I write the sql to set 'dynamic' line_no depending on the current number of rows for a given id?
INSERT INTO OBJ_LINES(
id,
line_no)
SELECT (1,
number(*)
FROM TEMPLATE_LINES tmp
WHERE tmp.id = 37;

How can I delete one of two perfectly identical rows?

I am cleaning out a database table without a primary key (I know, I know, what were they thinking?). I cannot add a primary key, because there is a duplicate in the column that would become the key. The duplicate value comes from one of two rows that are in all respects identical. I can't delete the row via a GUI (in this case MySQL Workbench, but I'm looking for a database agnostic approach) because it refuses to perform tasks on tables without primary keys (or at least a UQ NN column), and I cannot add a primary key, because there is a duplicate in the column that would become the key. The duplicate value comes from one...
How can I delete one of the twins?
SET ROWCOUNT 1
DELETE FROM [table] WHERE ....
SET ROWCOUNT 0
This will only delete one of the two identical rows
One option to solve your problem is to create a new table with the same schema, and then do:
INSERT INTO new_table (SELECT DISTINCT * FROM old_table)
and then just rename the tables.
You will of course need approximately the same amount of space as your table requires spare on your disk to do this!
It's not efficient, but it's incredibly simple.
Note that MySQL has its own extension of DELETE, which is DELETE ... LIMIT, which works in the usual way you'd expect from LIMIT: http://dev.mysql.com/doc/refman/5.0/en/delete.html
The MySQL-specific LIMIT row_count option to DELETE tells the server
the maximum number of rows to be deleted before control is returned to
the client. This can be used to ensure that a given DELETE statement
does not take too much time. You can simply repeat the DELETE
statement until the number of affected rows is less than the LIMIT
value.
Therefore, you could use DELETE FROM some_table WHERE x="y" AND foo="bar" LIMIT 1; note that there isn't a simple way to say "delete everything except one" - just keep checking whether you still have row duplicates.
delete top(1) works on Microsoft SQL Server (T-SQL).
This can be accomplished using a CTE and the ROW_NUMBER() function, as below:
/* Sample Data */
CREATE TABLE #dupes (ID INT, DWCreated DATETIME2(3))
INSERT INTO #dupes (ID, DWCreated) SELECT 1, '2015-08-03 01:02:03.456'
INSERT INTO #dupes (ID, DWCreated) SELECT 2, '2014-08-03 01:02:03.456'
INSERT INTO #dupes (ID, DWCreated) SELECT 1, '2013-08-03 01:02:03.456'
/* Check sample data - returns three rows, with two rows for ID#1 */
SELECT * FROM #dupes
/* CTE to give each row that shares an ID a unique number */
;WITH toDelete AS
(
SELECT ID, ROW_NUMBER() OVER (PARTITION BY ID ORDER BY DWCreated) AS RN
FROM #dupes
)
/* Delete any row that is not the first instance of an ID */
DELETE FROM toDelete WHERE RN > 1
/* Check the results: ID is now unique */
SELECT * FROM #dupes
/* Clean up */
DROP TABLE #dupes
Having a column to ORDER BY is handy, but not necessary unless you have a preference for which of the rows to delete. This will also handle all instances of duplicate records, rather than forcing you to delete one row at a time.
For PostgreSQL you can do this:
DELETE FROM tablename
WHERE id IN (SELECT id
FROM (SELECT id, ROW_NUMBER()
OVER (partition BY column1, column2, column3 ORDER BY id) AS rnum
FROM tablename) t
WHERE t.rnum > 1);
column1, column2, column3 would the column set which have duplicate values.
Reference here.
This works for PostgreSQL
DELETE FROM tablename WHERE id = 123 AND ctid IN (SELECT ctid FROM tablename WHERE id = 123 LIMIT 1)
Tried LIMIT 1? This will only delete 1 of the rows that match your DELETE query
DELETE FROM `table_name` WHERE `column_name`='value' LIMIT 1;
In my case I could get the GUI to give me a string of values of the row in question (alternatively, I could have done this by hand). On the suggestion of a colleague, in whose debt I remain, I used this to create an INSERT statement:
INSERT
'ID1219243408800307444663', '2004-01-20 10:20:55', 'INFORMATION', 'admin' (...)
INTO some_table;
I tested the insert statement, so that I now had triplets. Finally, I ran a simple DELETE to remove all of them...
DELETE FROM some_table WHERE logid = 'ID1219243408800307444663';
followed by the INSERT one more time, leaving me with a single row, and the bright possibilities of a primary key.
in case you can add a column like
ALTER TABLE yourtable ADD IDCOLUMN bigint NOT NULL IDENTITY (1, 1)
do so.
then count rows grouping by your problem column where count >1 , this will identify your twins (or triplets or whatever).
then select your problem column where its content equals the identified content of above and check the IDs in IDCOLUMN.
delete from your table where IDCOLUMN equals one of those IDs.
You could use a max, which was relevant in my case.
DELETE FROM [table] where id in
(select max(id) from [table] group by id, col2, col3 having count(id) > 1)
Be sure to test your results first and having a limiting condition in your "having" clausule. With such a huge delete query you might want to update your database first.
delete top(1) tableNAme
where --your conditions for filtering identical rows
I added a Guid column to the table and set it to generate a new id for each row. Then I could delete the rows using a GUI.
In PostgreSQL there is an implicit column called ctid. See the wiki. So you are free to use the following:
WITH cte1 as(
SELECT unique_column, max( ctid ) as max_ctid
FROM table_1
GROUP BY unique_column
HAVING count(*) > 1
), cte2 as(
SELECT t.ctid as target_ctid
FROM table_1 t
JOIN cte1 USING( unique_column )
WHERE t.ctid != max_ctid
)
DELETE FROM table_1
WHERE ctid IN( SELECT target_ctid FROM cte2 )
I'm not sure how safe it is to use this when there is a possibility of concurrent updates. So one may find it sensible to make a LOCK TABLE table_1 IN ACCESS EXCLUSIVE MODE; before actually doing the cleanup.
In case there are multiple duplicate rows to delete and all fields are identical, no different id, the table has no primary key , one option is to save the duplicate rows with distinct in a new table, delete all duplicate rows and insert the rows back. This is helpful if the table is really big and the number of duplicate rows is small.
--- col1 , col2 ... coln are the table columns that are relevant.
--- if not sure add all columns of the table in the select bellow and the where clause later.
--- make a copy of the table T to be sure you can rollback anytime , if possible
--- check the ##rowcount to be sure it's what you want
--- use transactions and rollback in case there is an error
--- first find all with duplicate rows that are identical , this statement could be joined
--- with the first one if you choose all columns
select col1,col2, --- other columns as needed
count(*) c into temp_duplicate group by col1,col2 having count(*) > 1
--- save all the rows that are identical only once ( DISTINCT )
insert distinct * into temp_insert from T , temp_duplicate D where
T.col1 = D.col1 and
T.col2 = D.col2 --- and other columns if needed
--- delete all the rows that are duplicate
delete T from T , temp_duplicate D where
T.col1 = D.col1 and
T.col2 = D.col2 ---- and other columns if needed
--- add the duplicate rows , now only once
insert into T select * from temp_insert
--- drop the temp tables after you check all is ok
If, like me, you don't want to have to list out all the columns of the database, you can convert each row to JSONB and compare by that.
(NOTE: This is incredibly inefficient - be careful!)
select to_jsonb(a.*), to_jsonb(b.*)
FROM
table a
left join table b
on
a.entry_date < b.entry_date
where (SELECT NOT exists(
SELECT
FROM jsonb_each_text(to_jsonb(a.*) - 'unwanted_column') t1
FULL OUTER JOIN jsonb_each_text(to_jsonb(b.*) - 'unwanted_column') t2 USING (key)
WHERE t1.value<>t2.value OR t1.key IS NULL OR t2.key IS NULL
))
Suppose we want to delete duplicate records with keeping only 1 unique records from Employee table - Employee(id,name,age)
delete from Employee
where id not in (select MAX(id)
from Employee
group by (id,name,age)
);
You can use limit 1
This works perfectly for me with MySQL
delete from `your_table` [where condition] limit 1;
DELETE FROM Table_Name
WHERE ID NOT IN
(
SELECT MAX(ID) AS MaxRecordID
FROM Table_Name
GROUP BY [FirstName],
[LastName],
[Country]
);

How to Get only Duplicate data from one table?

I have one table and many data like duplicate values and single values.
But I want to get only duplicate value data's , not single value.
SELECT columnWithDuplicates, count(*) FROM myTable
GROUP BY columnWithDuplicates HAVING (count(*) > 1);
The above query will show the duplicated values. And once you provide that to a business user, their next question will be what happened? How did these get there? Is there a pattern to the duplicates? What's often more informative is to see the whole rows containing those values to help determine why there are duplicates.
-- this query finds all the values in T that
-- exist in the derived table D where D is the list of
-- all the values in columnWithDuplicates that occur more than once
SELECT DISTINCT
T.*
FROM
myTable T
INNER JOIN
(
-- this identifies the duplicated values
-- courtesy of Brian Roach
SELECT
columnWithDuplicates
, count(*) AS rowCount
FROM
myTable
GROUP BY
columnWithDuplicates
HAVING
(count(*) > 1)
) D
ON D.columnWithDuplicates = T.columnWithDuplicates

How to join two tables together with same number of rows by their order

I am using SQL2000 and I would like to join two table together based on their positions
For example consider the following 2 tables:
table1
-------
name
-------
'cat'
'dog'
'mouse'
table2
------
cost
------
23
13
25
I would now like to blindly join the two table together as follows based on their order not on a matching columns (I can also guarantee both tables have the same number of rows):
-------|-----
name |cost
-------|------
'cat' |23
'dog' |13
'mouse'|25
Is this possible in a T-SQL select??
This is NOT possible, since there's absolutely no guarantee regarding the order in which the rows will be selected.
There are a number of ways to achieve what you want (see other answers) provided you're lucky regarding the sorting order, but none will work if you aren't, and you shouldn't rely on such queries.
Being forced to do this kind of queries strongly smells of a bad database design.
in 2000 you will either have to run 2 forward only cursors and insert into a temp table. or insert the values into a temp table with an extra identity column and join the 2 temp tables on the identity field
If your tables aren't two large, you could create two temp tables in memory and select your content into them in a specific order, and then join them on the row Number.
e.g.
CREATE TABLE #Temp_One (
[RowNum] [int] IDENTITY (1, 1) NOT NULL ,
[Description] [nvarchar] (50) NOT NULL
)
CREATE TABLE #Temp_Two (
[RowNum] [int] IDENTITY (1, 1) NOT NULL ,
[Description] [nvarchar] (50) NOT NULL
)
INSERT INTO #Temp_One
SELECT Your_Column FROM Your_Table_One ORDER BY Whatever
INSERT INTO #Temp_Two
SELECT Your_Column FROM Your_Table_Two ORDER BY Whatever
SELECT *
FROM #Temp_One a
LEFT OUTER JOIN #Temp_Two b
On a.RowNum = b.RowNum
Do you have anything that guarantees ordering of each table?
As far ax I know, SQL server does not make any promise on the ordering of a resultset unless the outer query has an order by clause.
In your case you need Each table to be ordered in a deterministic manner for this to work.
Other than that, in SQL 2000, as answered before me, a temp table and two cursors seem like a good answer.
Update:
Someone mentioned inserting both tables into temp tables, and that it would yield better performance. I am no SQL expert so I defer to those who know on that front, and since I had an up-vote I thought you should investigate those performance considerations.
But in any case, if you do not have any other information in your tables than what you showed us I'm not sure you can pull it off, ordering-wise.
You could alter both tables to have an auto_increment column, then join on that.
As others have told you, SQL has no intrinsic ordering; a table of rows is a set. Any ordering you get is arbitrary, unless you add an order by clause.
So yeah, there are ways you can do this, but all of them depend on the accidental ordering being what you hope it is. So do this this once, and don't do it again unless you can come up with a way (auto_increments, natural keys, something) to ensure ordering.
Absolutely. Use the following query but make sure that (order by) clause uses the same columns the order of rows will change which you dont want.
select
(
row_number() over(order by name) rno, * from Table1
) A
(
row_number() over(order by name) rno, * from Table2
) B
JOIN A.rno=B.rno
order by clause can be modified according to user linkings
The above query produces unique row_numbers for each row, which an be joined with row_numbers of the other table
Consider using a rank (rownum in Oracle) to dynamically apply ordered unique numbers to each table. Simply join on the rank column and you should have what you need. See this Microsoft article on numbering rows.
would be best to use row_number(), but that is only for 2005 and 2008, this should work for 2000...
Try this:
create table table1 (name varchar(30))
insert into table1 (name) values ('cat')
insert into table1 (name) values ('dog')
insert into table1 (name) values ('mouse')
create table table2 (cost int)
insert into table2 (cost) values (23)
insert into table2 (cost) values (13)
insert into table2 (cost) values (25)
Select IDENTITY(int,1,1) AS RowNumber
, Name
INTO #Temp1
from table1
Select IDENTITY(int,1,1) AS RowNumber
, Cost
INTO #Temp2
from table2
select * from #Temp1
select * from #Temp2
SELECT
t1.Name, t2.Cost
FROM #Temp1 t1
LEFT OUTER JOIN #Temp2 t2 ON t1.RowNumber=t2.RowNumber
ORDER BY t1.RowNumber
Xynth - built in row numbering is not available until SQL2K5 unfortunately, and the example given by microsoft actually uses triangular joins - a horrific hidden performance hit if the tables get large. My preferred approach would be an insert into a pair of temp tables using the identity function and then join on these, which is basically the same answer already given. I think the two-cursors approach sounds much heavier than it needs to be for this task.