SQL deleting one of two duplicate records? - sql

I have a DB that has a problem that there are two of the same records for everything but they all have a different ID, but they have 2 columns (the actual data) that are the same. I was wondering if there was a good way to have a DELETE statement where I could select all these records that have the 2 columns match but have a different ID and delete one (doesn't matter which one)?
If you could could you give me a code example?

Delete from ...
where id in (select max(id), count as c
from ...
group by data1, data2
having c >1)
The idea is to select the bigger id of all duplicate rows, by grouping the rows on the column that are the same and making sure that there are multiple rows (having clause).

delete from your_table
where id not in
(
select min(id)
from your_table
group by col2
)

Related

Get unique records from table avoiding all duplicates based on two key columns

I have a table Trial_tb with columns p_id,t_number and rundate.
Sample values:
p_id|t_number|rundate
=====================
111|333 |1/7/2016||
111|333 |1/1/2016||
222|888 |1/8/2016||
222|444 |1/2/2016||
666|888 |1/6/2016||
555|777 |1/5/2016||
pid and tnumber are key columns. I need fetch values such that the result should not have any record in which pid-tnumber combination are duplicated. For example there is duplication for 111|333 and hence not valid. The query should fetch all other than first two records.
I wrote below script but it fetches only the last record. :(
select rundate,p_id,t_number from
(
select rundate,p_id,t_number,
count(p_id) over (partition by p_id) PCnt,
count(t_number) over (partition by t_number) TCnt
from trialtb
)a
where a.PCnt=1 and a.TCnt=1
The having clause is ideal for this job. Having allows you to filter on aggregated records.
-- Finding unique combinations.
SELECT
p_id,
t_number
FROM
trialtb
GROUP BY
p_id,
t_number
HAVING
COUNT(*) = 1
;
This query returns combinations of p_id and t_number that occur only once.
If you want to include rundate you could add MAX(rundate) AS rundate to the select clause. Because you are only looking at unique occurrences the max or min would always be the same.
Do you mean:
select
p_id,t_number
from
trialtb
group by
p_id,t_number
having
count(*) = 1
or do you need the run date too?
select
p_id,t_number,max(rundate)
from
trialtb
group by
p_id,t_number
having
count(*) = 1
Seeing as you are only looking items with one result using max or min should work fine

SQL Server Sum multiple rows into one - no temp table

I would like to see a most concise way to do what is outlined in this SO question: Sum values from multiple rows into one row
that is, combine multiple rows while summing a column.
But how to then delete the duplicates. In other words I have data like this:
Person Value
--------------
1 10
1 20
2 15
And I want to sum the values for any duplicates (on the Person col) into a single row and get rid of the other duplicates on the Person value. So my output would be:
Person Value
-------------
1 30
2 15
And I would like to do this without using a temp table. I think that I'll need to use OVER PARTITION BY but just not sure. Just trying to challenge myself in not doing it the temp table way. Working with SQL Server 2008 R2
Simply put, give me a concise stmt getting from my input to my output in the same table. So if my table name is People if I do a select * from People on it before the operation that I am asking in this question I get the first set above and then when I do a select * from People after the operation, I get the second set of data above.
Not sure why not using Temp table but here's one way to avoid it (tho imho this is an overkill):
UPDATE MyTable SET VALUE = (SELECT SUM(Value) FROM MyTable MT WHERE MT.Person = MyTable.Person);
WITH DUP_TABLE AS
(SELECT ROW_NUMBER()
OVER (PARTITION BY Person ORDER BY Person) As ROW_NO
FROM MyTable)
DELETE FROM DUP_TABLE WHERE ROW_NO > 1;
First query updates every duplicate person to the summary value. Second query removes duplicate persons.
Demo: http://sqlfiddle.com/#!3/db7aa/11
All you're asking for is a simple SUM() aggregate function and a GROUP BY
SELECT Person, SUM(Value)
FROM myTable
GROUP BY Person
The SUM() by itself would sum up the values in a column, but when you add a secondary column and GROUP BY it, SQL will show distinct values from the secondary column and perform the aggregate function by those distinct categories.

Delete multiple occurrences of the same ID # and code in a junction table

enter code here
My problem is this: in this database the junction table contains some rows where the kha_id and the icd_fk are the same. While it's OK that kha_id appears in icd_junction more than once , it has to be with a separate icd_fk. I can run a query and get all of the ID#s and the codes which are listed more than once, but is there an industry-standard way of going about deleting all but one occurrence of each?
example: what i have is above
KHA_ID: 123456 V23
123456 V23
123456 V24
I need one of the rows kha_id=123456 and ICD_FK=V23 taken out.
This:
DELETE j1
FROM ICD_Junction AS j1
WHERE EXISTS
( SELECT 1
FROM ICD_Junction AS j2
WHERE j2.KHA_ID = j1.KHA_ID
AND j2.ICD_FK = j1.ICD_FK
AND j2.ID < j1.ID
)
;
will delete, for each KHA_ID and ICD_FK, all but one relevant row of ICD_Junction. (Specifically, it will keep the one with the least ID, and delete the rest.)
Once you've run the above, you should fix whatever code caused the duplication, and add a unique constraint to prevent this from happening again.
(Disclaimer: Not tested, and it's been a while since I last used SQL Server.)
Edited to add: If I'm understanding your comment correctly, you also need help with the query to find duplicates? For that, you can write:
SELECT KHA_ID,
ICD_FK,
COUNT(1) -- the number of duplicates
FROM ICD_Junction
GROUP
BY KHA_ID,
ICD_FK
HAVING COUNT(1) > 1
;
The original question was delete but the comment was find
Select jDup.*
FROM ICD_Junction AS j
JOIN ICD_Junction AS jDup
On j.KHA_ID = jDup.KHA_ID
AND j.ICD_FK = jDup.ICD_FK
AND j.ID < jDup.ID
Select max(jDup.ID), min(jDup.ID), count(*), jDup.KHA_ID, jDup.ICD_FK
FROM ICD_Junction AS jDup
Group By jDup.KHA_ID, jDup.ICD_FK
Having Count(*) > 1
You want something that uses ROW_NUMBER() and partition by. The reason is that it will let you pick one row to keep from a table that doesn't have a unique id. Like if this was a pure intersection table with no identity, you could use a variation on this to delete all rows where RowID > 1, leaving you just the unique rows. And it works just as well when you do have a unique id, where you can choose to preserve the earliest id.
select * from (select KHA_ID, ICD_FK, ROW_NUMBER()
OVER(PARTITION BY KHA_ID, ICD_FK
ORDER BY ID ASC) AS RowID
from ICD_Junction ) ordered where RowID > 1

how to delete duplicates from a database table based on a certain field

i have a table that somehow got duplicated. i basically want to delete all records that are duplicates, which is defined by a field in my table called SourceId. There should only be one record for each source ID.
is there any SQL that i can write that will delete every duplicate so i only have one record per Sourceid ?
Assuming you have a column ID that can tie-break the duplicate sourceid's, you can use this. Using min(id) causes it to keep just the min(id) per sourceid batch.
delete from tbl
where id NOT in
(
select min(id)
from tbl
group by sourceid
)
delete from table
where pk in (
select i2.pk
from table i1
inner join table i2
on i1.SourceId = i2.SourceId
)
good practice is to start with
select * from … and only later replace to delete from …

How to mark duplicates in an SQL query

I have an SQL query which looks at date-of-birth, last name and a soundex of first name to identify duplicates. The following query finds some 8,000 rows (which I assume means there are around 8,000 duplicate records).
select dob,last_name,soundex(first_name),count(*)
from clients
group by dob,last_name,soundex(first_name)
having count(*) >1
Almost all of the results have a count of 2, a few have a count of 3 where obviously the record existed twice in one of the two databases which were merged.
The next step I need to take is to mark one of the rows, doesn't really matter, with a duplicate flag and to mark each row with the opposite rows key. Is there a way of doing this using SQL?
This should do what you are after, the UPDATE in one go.
UPDATE FROM clients c
INNER JOIN
(
select dob,last_name,soundex(first_name),MIN(id) as keep
from clients
group by dob,last_name,soundex(first_name)
having count(*) >1
) k
ON c.dob=k.dob AND c.last_name=k.last_name AND soundex(c.first_name)=soundex(k.first_name)
SET duplicateid = NULLIF(k.keep, c.id),
hasduplicate = (k.keep = c.id)
It assumes you have 3 columns not stated in the question
id: primary key
duplicateid: points to the dup being kept
hasduplicate: boolean, marks the one to keep
Well, you could use SELECT DISTINCT, and then mark a single row as "not duplicate" -- then search for rows that are "not duplicate" to find the duplicate.
Here is a query that will give you not only the duplicates, but also the first id inserted (assuming Id is the sequential primary-key column) and the newest id.
OTTOMH
select dob, last_name, soundex(first_name) firstnamesoundex, min (Id) OldestId, max (Id) NewestId, Count (*) NumRows
from clients
group by dob,last_name,soundex(first_name)
having count(*) >1
You can use this in a JOIN to do your update
UPDATE Clients
SET OppositeRowId = DuplicateRows.NewestId
FROM
(
select dob, last_name, soundex(first_name) firstnamesoundex, min (Id) OldestId, max (Id) NewestId, Count (*) NumRows
from clients
group by dob,last_name,soundex(first_name)
having count(*) >1
) DuplicateRows
WHERE
DuplicateRows.OldestId = Clients.Id
All of this assumes that you have one duplicate. If you have more than one, you are going to have to try something different.