Deleting completely identical duplicates from db - sql

We have a table in our db with copied data that has completely duplicated many rows. Because the id is also duplicated there is nothing we can use to select just the duplicates. I tried using a limit to only delete 1 but redshift gave a syntax error when trying to use limit.
Any ideas how we can delete just one of two rows that have completely identical information?

Use select distinct to create a new table. Then either truncate & copy the data, or drop the original table and rename the new table to the original name:
create table t2 as select distinct * from t;
truncate t;
insert into t from select * from t2;
drop table t2;
Add column a column with unique values. identity(seed, step) looks interesting.

Related

Remove duplicate SQL rows by looking at all columns

I have this table, where every column is a VARCHAR (or equivalent):
field001 field002 field003 field004 field005 .... field500
500 VARCHAR columns. No primary keys. And no column is guaranteed to be unique. So the only way to know for sure if two rows are the same is to compare the values of all columns.
(Yes, this should be in TheDailyWTF. No, it's not my fault. Bear with me here).
I inserted a duplicate set of rows by mistake, and I need to find them and remove them.
There's 12 million rows on this table, so I'd rather not recreate it.
However, I do know what rows were mistakenly inserted (I have the .sql file).
So I figured I'd create another table and load it with those. And then I'd do some sort of join that would compare all columns on both tables and then delete the rows that are equal from the first table. I tried a NATURAL JOIN as that looked promising, but nothing was returned.
What are my options?
I'm using Amazon Redshift (so PostgreSQL 8.4 if I recall), but I think this is a general SQL question.
You can treat the whole row as a single record in Postgres (and thus I think in Redshift).
The following works in Postgres, and will keep one of the duplicates
delete from the_table
where ctid not in (select min(ctid)
from the_table
group by the_table); --<< Yes, the group by is correct!
This is going to be slow!
Grouping over so many columns and then deleting with a NOT IN will take quite some time. Especially if a lot of rows are going to be deleted.
If you want to delete all duplicate rows (not keeping any of them), you can use the following:
delete from the_table
where the_table in (select the_table
from the_table
group by the_table
having count(*) > 1);
You should be able to identify all the mistakenly inserted rows using CREATEXID.If you group by CREATEXID on your table as below and get the count you should be able to understand how many rows were inserted in your transaction and remove them using DELETE command.
SELECT CREATEXID,COUNT(1)
FROM yourtable
GROUP BY 1;
One simplistic solution is to recreate the table, e.g.
CREATE TABLE my_temp_table (
-- add column definitions here, just like the original table
);
INSERT INTO my_temp_table SELECT DISTINCT * FROM original_table;
DROP TABLE original_table;
ALTER TABLE my_temp_table RENAME TO original_table;
or even
CREATE TABLE my_temp_table AS SELECT DISTINCT * FROM original_table;
DROP TABLE original_table;
ALTER TABLE my_temp_table RENAME TO original_table;
It is a trick but probably it helps.
Each row in the table containing the transaction ID in which it row was inserted/updated: System Columns. It is xmin column. So using it you can to find the transaction ID in which you inserted the wrong data. Then just delete the rows using
delete from my_table where xmin = <the_wrong_transaction_id>;
PS: Be careful and try it on the some test table first.

How to delete all data then insert new data

I have a process that runs every 60 minutes. On one table I need to remove all data then insert records from a different table. The problem is it takes a long time to delete and reinsert the data. When the table has no data I am afraid the users will see this. Is there a way to refresh the data without users seeing this?
If you want to remove all data from the table then use the TRUNCATE
TABLE instead of delete - It'll do it faster.
As for the insert it is a bit hard to say because you did not give any details but what you can try is:
Option 1 - Using temp table
create table table_temp as select * from original_table where rownum < 1;
//insert into table_temp
drop table original_table;
Exec sp_rename 'table_temp' , 'original_table'
Option 2 - Use 2 tables "Active-Passive" -
Have 2 tables for the data and a view to select over them. The view will join with a third table that will specify from which of the tables to select. kind of an "active-passive" concept.
To demonstrate concept:
with active_table as ( select 'table1_active' active_table )
select 1 data
where 'table1_active' in (select * from active_table)
union all
select 2
where 'table2_active' in (select * from active_table)
//This returns only one record with the "1"
Are you truncating instead of deleting? A truncate (while logged) is much, much, faster then a delete.
If you cannot truncate try deleting 1000-10000 rows at a time (smaller log buildup and on deleting large amounts of rows great increase in speed.)
If you really want fast performance you can create a second table, fill it with data, and then drop the first table and rename the second table as the first table. You will lose all the permissions on the table when you do this so be sure to reapply the permissions to the renamed table.
If you are deleting all rows in a table, you can consider using a TRUNCATE statement against the table instead of a DELETE. It will speed up part of your process. Keep in mind that this will reset any identity seeds you may have on the table.
As suggested, you can wrap this process in a transaction and depending on how you set your transaction isolation level, you can control what your users will see if they query the data during the transaction.
Make it sequence based, your copied in records all have have a series number (all the same for all copied in records) and another file holds which sequence is active, and you always select on a join to this table - when you copy in new records they have a new sequence that is not yet active, when they are all copied in, then the sequence table is updated to the new sequence - the redundant sequence records are deleted at your leisure.
Example
Let's suppose your table has field SeriesNo added and table ActiveSeries has field SeriesNo.
All queries of your table:
SELECT *
FROM YourTable Y
JOIN ActiveSeries A
ON A.SeriesNo = Y.SeriesNo
then updating SeriesNo in ActiveSeries makes new series of records available instantly.
I would follow below approach. While I troubleshoot why the delete and reinsert is taking time.
Create a new table ( t1 ) which has same data as oldtable ( maintable )
Now do your stuff on t1.
When your stuff is done, rename t1 to maintable.

Delete rows in a table that are not affected by update

I have two tables. One of them is a temporary table in which I copy the data from a big CSV file. After that I update my other table with the temporary table (see this answer: Copy a few of the columns of a csv file into a table).
When I update my temporary table once more with a (updated) CSV file (data from a grep in bash, increasing row numbers per update) I want to delete the rows that are not affected by the update. I could have a temp table smaller than a temp table with all the data.
First: Is it better drop all data in the temp table and to fill it with the whole updated CSV data and after that to update/insert the other table.
Second: Or to update the temp table in the first place?
So it is a matter of size of the tables. I talking about 500k rows (with geometry columns).
An example:
table
1, NULL
2, NULL
temp table
1, hello
2, good morning
CSV
1, hello there
2, good morning
3, good evening
temp table
1, hello there
2, good morning
3, good evening
OR
temp table
1, hello there
3, good evening
So my question is how to update a table with a CSV file, insert new rows, update the old rows and delete the rows that were not affected by the update.
So my question is how to update a table with a CSV file, insert new rows, update the old rows and delete the rows that were not affected by the update.
I see two possible solutions:
1. Apply the changes individually
The data is applied with a series of update/delete/insert statements like this:
-- get rid of deleted rows
delete from the_table
where not exists (select 1
from temp_table tt
where tt.id = the_table.id);
-- update changed data
update the_table
set ..
from temp_table src
where src.id = the_table.id;
-- insert new rows
insert into the_table
select ..
from temp_table src
where not exists (select 1
from the_table t2
where t2.id = src.id);
This is the required approach if other sources write to the target table and you don't want to overwrite that. Maybe you don't even want to delete "missing" rows then. Or update only a sub-set of the columns.
2. Flush and fill the table
If you never modify the data in the target table and you don't have foreign keys referencing that table, I would do a flush and fill of the real table:
truncate the_table;
copy the_table from '/path/to/data.csv' ...;
If you run the truncate and copy in a single transaction, the copy performance will improve because it minimizes the amount of WAL logging.
I haven't many experience with SQL(half year), but maybe you will compare your table using MINUS clause? Using MINUS you can get non-updated rows?
P.S. I'm talking about PL/SQL)

delete duplicate records from sql

In my table I have so many duplicate records
SELECT ENROLMENT_NO_DATE, COUNT(ENROLMENT_NO_DATE) AS NumOccurrences
FROM Import_Master GROUP BY ENROLMENT_NO_DATE HAVING ( COUNT(ENROLMENT_NO_DATE) > 1 )
I need to remove duplicate record if it is occur second time... Need to keep first or any of one record. How can I do that?
You can use CTE to perform this task:
;with cte as
(
select ENROLMENT_NO_DATE,
row_number() over(partition by ENROLMENT_NO_DATE order by ENROLMENT_NO_DATE) rn
from Import_Master
)
delete from cte where rn > 1
See SQL Fddle with Demo
One method could be to create a secondary, temporary table
CREATE TABLE Import_Master_Deduped AS SELECT * FROM Import_Master WHERE FALSE;
This will create an empty table with identical structure to Import_Master. Now impose uniqueness on the new table with an index:
CREATE UNIQUE INDEX Import_Master_Ndx ON Import_Master_Deduped(ENROLMENT_NO_DATE);
Finally copy the table with duplicated records inside with INSERT IGNORE, so that duplicated records will not get inserted:
INSERT IGNORE INTO Import_Master_Deduped SELECT * FROM Import_Master;
At this point, after checking everything is OK, you can rename the two tables swapping their names (this will lose any old indexes), or TRUNCATE the Import_Master table and copy back the deduped records from the new table into the old.
In the second case, recreate the UNIQUE constraint on the old table to avoid further duplicates; in the first, recreate any old indexes on the new table.
Finally, you remove the table you don't need anymore.

Changing table field to UNIQUE

I want to run the following sql command:
ALTER TABLE `my_table` ADD UNIQUE (
`ref_id` ,
`type`
);
The problem is that some of the data in the table would make this invalid, therefore altering the table fails.
Is there a clever way in MySQL to delete the duplicate rows?
SQL can, at best, handle this arbitrarily. To put it another way: this is your problem.
You have data that currently isn't unique. You want to make it unique. You need to decide how to handle the duplicates.
There are a variety of ways of handling this:
Modifying or deleting duplicate rows by hand if the numbers are sufficiently small;
Running statements to update or delete duplicate that meet certain criteria to get to a point where the exceptions can be dealt with on an individual basis;
Copying the data to a temporary table, emptying the original and using queries to repopulate the table; and
so on.
Note: these all require user intervention.
You could of course just copy the table to a temporary table, empty the original and copy in the rows just ignoring those that fail but I expect that won't give you the results that you really want.
if you don't care which row gets deleted, use IGNORE:
ALTER IGNORE TABLE `my_table` ADD UNIQUE (
`ref_id` ,
`type`
);
What you can do is add a temporary identity column to your table. With that you can write query to identify and delete the duplicates (you can modify the query little bit to make sure only one copy from the set of duplicate rows are retained).
Once this is done, drop the temporary column and add unique constraint to your original column.
Hope this helps.
What I've done in the past is export the unique set of data, drop the table, recreate it with the unique columns and import the data.
It is often faster than trying to figure out how to delete the duplicate data.
There is a good KB article that provides a step-by-step approach to finding and removing rows that have duplicate values. It provides two approaches - a one-off approach for finding and removing a single row and a broader solution to solving this when many rows are involved.
http://support.microsoft.com/kb/139444
Here is a snippet I used to delete duplicate rows in one of the tables
BEGIN TRANSACTION
Select *,
rank() over (Partition by PolicyId, PlanSeqNum, BaseProductSeqNum,
CoInsrTypeCd, SupplierTypeSeqNum
order by CoInsrAmt desc) as MyRank
into #tmpTable
from PlanCoInsr
select distinct PolicyId,PlanSeqNum,BaseProductSeqNum,
SupplierTypeSeqNum, CoInsrTypeCd, CoInsrAmt
into #tmpTable2
from #tmpTable where MyRank=1
truncate table PlanCoInsr
insert into PlanCoInsr
select * from #tmpTable2
drop table #tmpTable
drop table #tmpTable2
COMMIT
This worked for me:
ALTER TABLE table_name ADD UNIQUE KEY field_name (field_name)
You will have to find some other field that is unique because deleting on ref_id and type alone will delete them all.
To get the duplicates:
select ref_id, type from my_table group by ref_id, type having count(*)>1
Xarpb has some clever tricks (maybe too clever): http://www.xaprb.com/blog/2007/02/06/how-to-delete-duplicate-rows-with-sql-part-2/