Optimisation of sql query for deleting duplicate items from large table - sql

Could anyone please help me optimise one of the queries which is taking more than 20 minutes to run against 3 Million data.
Table Structure
-----------------------------------------------------------------------------------------
|id [INT Auto Inc]| name_id (uuid) | name (varchar)| city (varchar) | name_type(varchar)|
-----------------------------------------------------------------------------------------
Query
The purpose of the query is to eliminate the duplicate, here duplicate means having same name_id and name.
DELETE
FROM records
WHERE id NOT IN
(SELECT DISTINCT
ON (name_id, name) id
FROM records);

I would write your delete using exists logic:
DELETE
FROM records r1
WHERE EXISTS (SELECT 1 FROM records r2
WHERE r2.name_id = r1.name_id AND r2.name = r2.name AND
r2.id < r1.id);
This delete query will spare the duplicate having the smallest id value. To speed this up, you may try adding the following index:
CREATE INDEX idx ON records (name_id, name, id);

You probably already have a primary key on the identity column, then you can use it to exclude redundant rows by id in the following way:
WITH cte AS (
SELECT MIN(id) AS id FROM records GROUP BY name_id, name)
DELETE FROM records
WHERE NOT EXISTS (SELECT id FROM cte WHERE id=records.id)
Even without the index, this should work relatively fast, probably because of merge join strategy.

Related

Delete multiple occurrences of the same ID # and code in a junction table

enter code here
My problem is this: in this database the junction table contains some rows where the kha_id and the icd_fk are the same. While it's OK that kha_id appears in icd_junction more than once , it has to be with a separate icd_fk. I can run a query and get all of the ID#s and the codes which are listed more than once, but is there an industry-standard way of going about deleting all but one occurrence of each?
example: what i have is above
KHA_ID: 123456 V23
123456 V23
123456 V24
I need one of the rows kha_id=123456 and ICD_FK=V23 taken out.
This:
DELETE j1
FROM ICD_Junction AS j1
WHERE EXISTS
( SELECT 1
FROM ICD_Junction AS j2
WHERE j2.KHA_ID = j1.KHA_ID
AND j2.ICD_FK = j1.ICD_FK
AND j2.ID < j1.ID
)
;
will delete, for each KHA_ID and ICD_FK, all but one relevant row of ICD_Junction. (Specifically, it will keep the one with the least ID, and delete the rest.)
Once you've run the above, you should fix whatever code caused the duplication, and add a unique constraint to prevent this from happening again.
(Disclaimer: Not tested, and it's been a while since I last used SQL Server.)
Edited to add: If I'm understanding your comment correctly, you also need help with the query to find duplicates? For that, you can write:
SELECT KHA_ID,
ICD_FK,
COUNT(1) -- the number of duplicates
FROM ICD_Junction
GROUP
BY KHA_ID,
ICD_FK
HAVING COUNT(1) > 1
;
The original question was delete but the comment was find
Select jDup.*
FROM ICD_Junction AS j
JOIN ICD_Junction AS jDup
On j.KHA_ID = jDup.KHA_ID
AND j.ICD_FK = jDup.ICD_FK
AND j.ID < jDup.ID
Select max(jDup.ID), min(jDup.ID), count(*), jDup.KHA_ID, jDup.ICD_FK
FROM ICD_Junction AS jDup
Group By jDup.KHA_ID, jDup.ICD_FK
Having Count(*) > 1
You want something that uses ROW_NUMBER() and partition by. The reason is that it will let you pick one row to keep from a table that doesn't have a unique id. Like if this was a pure intersection table with no identity, you could use a variation on this to delete all rows where RowID > 1, leaving you just the unique rows. And it works just as well when you do have a unique id, where you can choose to preserve the earliest id.
select * from (select KHA_ID, ICD_FK, ROW_NUMBER()
OVER(PARTITION BY KHA_ID, ICD_FK
ORDER BY ID ASC) AS RowID
from ICD_Junction ) ordered where RowID > 1

How to find out the duplicate records

Using Sql Server 2000
I want to find out the duplicate record in the table
Table1
ID Transaction Value
001 020102 10
001 020103 20
001 020102 10 (Duplicate Records)
002 020102 10
002 020103 20
002 020102 10 (Duplicate Records)
...
...
Transaction and value can be repeat for different id's, not for the same id...
Expected Output
Duplicate records are...
ID Transaction Value
001 020102 10
002 020102 10
...
...
How to make a query for view the duplicate records.
Need Query help
You can use
SELECT
ID, Transaction, Value
FROM
Table1
GROUP BY
ID, Transaction, Value
HAVING count(ID) > 1
Select Id, Transaction, Value, Count(id)
from table
group by Id, Transaction, Value
having count(id) > 1
This query will show you the count of times the ID has been repeated with each entry of the Id. If you don't need it you can simply remove the Count(Id) column from the select clause.
Self join (with additional PK or Timestamp or...)
I can see that people've provided solution with grouping but none has provided the self join solution. The only problem is that you'd need some other row descriptor that should be unique for each record. Be it primary key, timestamp or anything else... Suppose that the unique column's name is Uniq this would be the solution:
select distinct ID, [Transaction], Value
from Records r1
join Records r2
on ((r2.ID = r1.ID) and
(r2.[Transaction] = r1.[Transaction]) and
(r2.Value = r1.Value) and
(r2.Uniq != r1.Uniq))
The last join column makes it possible to not join each row to itself but only to other duplicates...
To find out which one works best for you, you can check their execution plan and execute some testing.
You can do this:
SELECT ID, Transaction, Value
FROM Table
GROUP BY ID, Transaction, Value
HAVING COUNT(*) > 1
To delete the duplicates, if you have no primary key then you need to select the distinct values into a separate table, delete everything from this one, then copy the distinct records back:
SELECT ID, Transaction, Value
INTO #tmpDeduped
FROM Table
GROUP BY ID, Transaction, Value
DELETE FROM Table
INSERT Table
SELECT * FROM #tmpDeduped

Last id value in a table. SQL Server

Is there a way to know the last nth id field of a table, without scanning it completely? (just go to the end of table and get id value)
table
id fieldvalue
1 2323
2 4645
3 556
... ...
100000000 1232
So for example here n = 100000000 100 Million
--------------EDIT-----
So which one of the queries proposed would be more efficient?
SELECT MAX(id) FROM <tablename>
Assuming ID is the IDENTITY for the table, you could use SELECT IDENT_CURRENT('TABLE NAME').
See here for more info.
One thing to note about this approach: If you have INSERTs that fail but increment the IDENTITY counter, then you will get back a result that is higher than the result returned by SELECT MAX(id) FROM <tablename>
You can also use system tables to get all last values from all identity columns in system:
select
OBJECT_NAME(object_id) + '.' + name as col_name
, last_value
from
sys.identity_columns
order by last_value desc
In case when table1 rows are inserted first, and then rows to table2 which depend on ids from the table1, you can use SELECT:
INSERT INTO `table2` (`some_id`, `some_value`)
VALUES ((SELECT some_id
FROM `table1`
WHERE `other_key_1` = 'xxx'
AND `other_key_2` = 'yyy'),
'some value abc abc 123 123 ...');
Of course, this can work only if there are other identifiers that can uniquely identify rows from table1
First of all, you want to access the table in DESCENDING order by ID.
Then you would select the TOP N records.
At this point, you want the last record of the set which hopefully is obvious. Assuming that the id field is indexed, this would at most retrieve the last N records of the table and most likely would end up being optimized into a single record fetch.
Select Ident_Current('Your Table Name') gives the last Id of your table.

how to delete duplicates from a database table based on a certain field

i have a table that somehow got duplicated. i basically want to delete all records that are duplicates, which is defined by a field in my table called SourceId. There should only be one record for each source ID.
is there any SQL that i can write that will delete every duplicate so i only have one record per Sourceid ?
Assuming you have a column ID that can tie-break the duplicate sourceid's, you can use this. Using min(id) causes it to keep just the min(id) per sourceid batch.
delete from tbl
where id NOT in
(
select min(id)
from tbl
group by sourceid
)
delete from table
where pk in (
select i2.pk
from table i1
inner join table i2
on i1.SourceId = i2.SourceId
)
good practice is to start with
select * from … and only later replace to delete from …

How to delete duplicate rows with SQL?

I have a table with some rows in. Every row has a date-field. Right now, it may be duplicates of a date. I need to delete all the duplicates and only store the row with the highest id. How is this possible using a SQL query?
Now:
date id
'07/07' 1
'07/07' 2
'07/07' 3
'07/05' 4
'07/05' 5
What I want:
date id
'07/07' 3
'07/05' 5
DELETE FROM table WHERE id NOT IN
(SELECT MAX(id) FROM table GROUP BY date);
I don't have comment rights, so here's my comment as an answer in case anyone comes across the same problem:
In SQLite3, there is an implicit numerical primary key called "rowid", so the same query would look like this:
DELETE FROM table WHERE rowid NOT IN
(SELECT MAX(rowid) FROM table GROUP BY date);
this will work with any table even if it does not contain a primary key column called "id".
For mysql,postgresql,oracle better way is SELF JOIN.
Postgresql:
DELETE FROM table t1 USING table t2 WHERE t1.date=t2.date AND t1.id<t2.id;
MySQL
DELETE FROM table
USING table, table as vtable
WHERE (table.id < vtable.id)
AND (table.date=vtable.date)
SQL aggregate (max,group by) functions almost always are very slow.