SQL Script for Bulk Deletion - sql

I am attempting to write a SQL Script to bulk delete rows in a table with input from a Text File. I am just getting into SQL Scripting.
Backstory: Someone in my previous role setup a table without a primary key and a program was designed to insert data into the table. However, the program would never check for duplicate entries first and just go ahead and do the insert.
I am attempting to clean-up the database.
First, I have run a query to see just how many rows are duplicates. There are roughly ~7,000 therefore, there is no way I am going to delete them one query at a time. [ID] should have been setup as a Primary Key.
Query to determine duplicates
SELECT [ID] FROM [testing].[dbo].[testingtable]
GROUP BY [ID]
HAVING COUNT(*) > 1
I can delete the duplicate rows by using the following query on an individual ID:
SET ROWCOUNT 1
DELETE FROM [testing].[dbo].[testingtable]
WHERE [ID] = SomeNumber
SET ROWCOUNT 0
I have a text file of all of the duplicate ID number entries, however, is there a bulk delete script that I can create so that I can feed in all of the ID duplicate numbers from the text file? Or is there a more efficient way. Please point me in the direction.

I don't understand why you have (or need) a text file of all duplicate IDs.
There are roughly ~7,000 therefore, there is no way I am going to delete them one query at a time Of course there is a way to delete them, here we go:
If you just want to remove duplicates from your table, use this code:
WITH CTE AS(
SELECT [ID]
,RN = ROW_NUMBER()OVER(PARTITION BY [ID])
FROM [testing].[dbo].[testingtable]
)
DELETE FROM CTE WHERE RN > 1

if you want to remove a very high percentage of rows...
SELECT col1, col2, ...
INTO #Holdingtable
FROM MyTable
WHERE ..opposite condition..
TRUNCATE TABLE MyTable
INSERT MyTable (col1, col2, ...)
SELECT col1, col2, ...
FROM #Holdingtable

Related

Redshift Delete Duplicate Rows

I want to delete duplicates from a Redshift table that are true duplicates. Below is an example of two rows that are true duplicates.
Since it is Redshift, there are no primary keys to the table. Any help is appreciated.
id
Col 1
Col 2
1
Val 1
Val 2
1
Val 1
Val 2
I tried using window functions row_number(), rank(). Neither worked as when applying Delete command, SQL command cannot differentiate both rows.
Trial 1:
The below command deletes both rows
DELETE From test_table
where (id) IN
(
select \*,row_number() over(partition by id) as rownumber from test*table where row*number !=1
);
Trial 2:
The below command retains both rows.
DELETE From test_table
where (id) IN
(
select \*,rank() over(partition by id) as rownumber from test*table where row*number !=1
);
All row values are identical. Hence you unable to delete specific rows in that table.
In that I would recommend to create dummy table, and load unique records.
Steps to follow:
create table dummy as select * from main_table where 1=2
insert into dummy(col1,col2..coln) select distinct col1,col2..coln from main_table;
verify dummy table.
Alter table main_table rename to main_table_bk
alter table dummy rename to main.
after complete your testing and verification, drop main_table_bk
Hope it will help.
You cannot delete one without deleting the other as they are identical. The way to do this is to:
make a temp table with (one copy) of each duplicate row
(within a transaction) delete all rows from the source table that match rows in the temp table
Insert temp table rows into source table (commit)

How to import data to an empty SQL server table avoiding duplicates in the source data

I am trying to import data into an empty SQL server table, avoiding duplicates that exist in the source data.
Currently I am doing a bulk insert into a temp table, and then copying the data across using:
INSERT INTO Actual_table
SELECT * FROM Temp_table
So the Temp_table and Actual_table have the exact same structure, the only difference is that on the PK field on the Actual_table, I have set up the Temp_table with a UNIQUE identifier, and set it to ignore duplicates:
UNIQUE NONCLUSTERED (Col1) WITH (IGNORE_DUP_KEY = ON)
In other words:
Actual_table
Col1 (PK) Col2
Temp_table
Col1 (Unique, ignore duplicates) Col2
The Actual_table is empty when we start this process, and the duplicates to be avoided are only on the PK field (not DISTINCT on the whole row, in other words).
I have no idea if this is the best way to achieve this, and comments/suggestions would be appreciated.
Just to flesh out my thoughts further:
Should I rather import straight to the actual table, adding the IGNORE_DUP_KEY contraint before importing, and then removing it (is this even possible)?
Do I not set up the Temp_table with the IGNORE_DUP_KEY constraint (which makes the bulk import faster), and then tweak the copying across code to ignore the duplicates? If this is a good idea, could someone please show me the syntax to achieve this.
I am using SQL server 2014.
If the table is initially empty, then you just remove the duplicates when you load:
INSERT INTO Actual_table
SELECT DISTINCT *
FROM Temp_table;
If you only want "distinctness" on a subset of columns, use row_Number():
INSERT INTO Actual_table
SELECT <col1>, <col2>, . . .
FROM (SELECT t.*,
ROW_NUMBER() OVER (PARTITION BY col ORDER BY (SELECT NULL)) as seqnum
FROM Temp_table t
) t
WHERE seqnum = 1;

How do I delete one record matching some criteria in SQL? (Netezza)

I've got some duplicate records in a table because as it turns out Netezza does not support constraint checks on primary keys. That being said, I have some records where the information is the exact same and I want to delete just ONE of them. I've tried doing
delete from table_name where test_id=2025 limit 1
and also
delete from table_name where test_id=2025 rowsetlimit 1
However neither option works. I get an error saying
found 'limit'. Expecting a keyword
Is there any way to limit the records deleted by this query? I know I could just delete the record and reinsert it but that is a little tedious since I will have to do this multiple times.
Please note that this is not SQL Server or MySQL.This is for Netezza
If it doesn't support either "DELETE TOP 1" or the "LIMIT" keyword, you may end up having to do one of the following:
1) add some sort of an auto-incrementing column (like IDs), making each row unique. I don't know if you can do that in Netezza after the table has been created, though.
2) Programmatically read the entire table with some programming language, eliminate duplicates programmatically, then deleting all the rows and inserting them again. This might not be possible if they are references by other tables, in which case, you might have to temporarily remove the constraint.
I hope that helps. Please let us know.
And for future reference; this is why I personally always create an auto-incrementing ID field, even if I don't think I'll ever use it. :)
The below query works for deleting duplicates from a table.
DELETE FROM YOURTABLE
WHERE COLNAME1='XYZ' AND
(
COLNAME1,
ROWID
)
NOT IN
(
SELECT COLNAME1,
MAX(ROWID)
FROM YOURTABLENAME
WHERE COLNAME = 'XYZ'
GROUP BY COLNAME1
)
If the records are identical then you could do something like
CREATE TABLE DUPES as
SELECT col11,col2,col3,col....... coln from source_table where test_id = 2025
group by
1,2,3..... n
DELETE FROM source_table where test_id = 2025
INSERT INTO Source_table select * from duoes
DROP TABLE DUPES
You could even create a sub-query to select all the test_ids HAVING COUNT(*) > 1 to automatically find the dupes in steps 1 and 3
-- remove duplicates from the <<TableName>> table
delete from <<TableName>>
where rowid not in
(
select min(rowid) from <<TableName>>
group by (col1,col2,col3)
);
The GROUP BY 1,2,3,....,n will eliminate the dupes on the insert to the temp table
Does the use rowid is allowed in Netezza...As far as my knowledge is concern i don't think this query will executed in Netezza...

SQL Constraint IGNORE_DUP_KEY on Update

I have a Constraint on a table with IGNORE_DUP_KEY. This allows bulk inserts to partially work where some records are dupes and some are not (only inserting the non-dupes). However, it does not allow updates to partially work, where I only want those records updated where dupes will not be created.
Does anyone know how I can support IGNORE_DUP_KEY when applying updates?
I am using MS SQL 2005
If I understand correctly, you want to do UPDATEs without specifying the necessary WHERE logic to avoid creating duplicates?
create table #t (col1 int not null, col2 int not null, primary key (col1, col2))
insert into #t
select 1, 1 union all
select 1, 2 union all
select 2, 3
-- you want to do just this...
update #t set col2 = 1
-- ... but you really need to do this
update #t set col2 = 1
where not exists (
select * from #t t2
where #t.col1 = t2.col1 and col2 = 1
)
The main options that come to mind are:
Use a complete UPDATE statement to avoid creating duplicates
Use an INSTEAD OF UPDATE trigger to 'intercept' the UPDATE and only do UPDATEs that won't create a duplicate
Use a row-by-row processing technique such as cursors and wrap each UPDATE in TRY...CATCH... or whatever the language's equivalent is
I don't think anyone can tell you which one is best, because it depends on what you're trying to do and what environment you're working in. But because row-by-row processing could potentially produce some false positives, I would try to stick with a set-based approach.
I'm not sure what is really going on, but if you are inserting duplicates and updating Primary Keys as part of a bulk load process, then a staging table might be the solution for you. You create a table that you make sure is empty prior to the bulk load, then load it with the 100% raw data from the file, then process that data into your real tables (set based is best). You can do things like this to insert all rows that don't already exist:
INSERT INTO RealTable
(pk, col1, col2, col3)
SELECT
pk, col1, col2, col3
FROM StageTable s
WHERE NOT EXISTS (SELECT
1
FROM RealTable r
WHERE s.pk=r.pk
)
Prevent the duplicates in the first place is best. You could also do UPDATEs on your real table by joining in the staging table, etc. This will avoid the need to "work around" the constraints. When you work around the constraints, you usually create difficult to find bugs.
I have the feeling you should use the MERGE statement and then in the update part you should really not update the key you want to have unique. That also means that you have to define in your table that a key is unique (Setting a unique index or define as primary key). Then any update or insert with a duplicate key will fail.
Edit: I think this link will help on that:
http://msdn.microsoft.com/en-us/library/bb522522.aspx

Row number in Sybase tables

Sybase db tables do not have a concept of self updating row numbers. However , for one of the modules , I require the presence of rownumber corresponding to each row in the database such that max(Column) would always tell me the number of rows in the table.
I thought I'll introduce an int column and keep updating this column to keep track of the row number. However I'm having problems in updating this column in case of deletes. What sql should I use in delete trigger to update this column?
You can easily assign a unique number to each row by using an identity column. The identity can be a numeric or an integer (in ASE12+).
This will almost do what you require. There are certain circumstances in which you will get a gap in the identity sequence. (These are called "identity gaps", the best discussion on them is here). Also deletes will cause gaps in the sequence as you've identified.
Why do you need to use max(col) to get the number of rows in the table, when you could just use count(*)? If you're trying to get the last row from the table, then you can do
select * from table where column = (select max(column) from table).
Regarding the delete trigger to update a manually managed column, I think this would be a potential source of deadlocks, and many performance issues. Imagine you have 1 million rows in your table, and you delete row 1, that's 999999 rows you now have to update to subtract 1 from the id.
Delete trigger
CREATE TRIGGER tigger ON myTable FOR DELETE
AS
update myTable
set id = id - (select count(*) from deleted d where d.id < t.id)
from myTable t
To avoid locking problems
You could add an extra table (which joins to your primary table) like this:
CREATE TABLE rowCounter
(id int, -- foreign key to main table
rownum int)
... and use the rownum field from this table.
If you put the delete trigger on this table then you would hugely reduce the potential for locking problems.
Approximate solution?
Does the table need to keep its rownumbers up to date all the time?
If not, you could have a job which runs every minute or so, which checks for gaps in the rownum, and does an update.
Question: do the rownumbers have to reflect the order in which rows were inserted?
If not, you could do far fewer updates, but only updating the most recent rows, "moving" them into gaps.
Leave a comment if you would like me to post any SQL for these ideas.
I'm not sure why you would want to do this. You could experiment with using temporary tables and "select into" with an Identity column like below.
create table test
(
col1 int,
col2 varchar(3)
)
insert into test values (100, "abc")
insert into test values (111, "def")
insert into test values (222, "ghi")
insert into test values (300, "jkl")
insert into test values (400, "mno")
select rank = identity(10), col1 into #t1 from Test
select * from #t1
delete from test where col2="ghi"
select rank = identity(10), col1 into #t2 from Test
select * from #t2
drop table test
drop table #t1
drop table #t2
This would give you a dynamic id (of sorts)