How to delete duplicate records (SQL) when there is no identification column? - sql

It is a datawarehouse project in which I load a table in which each column refers to another table. The problem is that due to an error in the process, many duplicate records were loaded (approximately 13,000) but they do not have a unique identifier, therefore they are exactly the same. Is there a way to delete only one of the duplicate records so that I don't have to delete everything and repeat the table loading process?

You can use row_number() and a cte:
with cte as (
select row_number() over(
partition by col1, col2, ...
order by (select null)) rn
from mytable
)
delete from cte where rn > 1
The window functions guarantees that the same number will not be assigned twice within a partition - you need to enumerate all column columns in the partition by clause.
If you are going to delete a significant part of the rows, then it might be simpler to empty and recreate the table:
create table tmptable as select distinct * from mytable;
truncate table mytable; -- back it up first!
insert into mytable select * from tmptable;
drop table tmptable;

You can make use row_number to delete the duplicate rows by first partitioning them and then ordering by one of the columns with that partition.
You have to list all your columns in partition by if records are completely identical.
WITH CTE1 AS (
SELECT A.*
, ROW_NUMBER(PARTITION BY CODDIMALUMNO, (OTHER COLUMNS) ORDER BY CODDIMALUMNO) RN
FROM TABLE1 A
)
DELETE FROM CTE1
WHERE RN > 1;

You can use row_number() and an updatable CTE:
with todelete as (
select t.*, row_number() over (partition by . . . ) as seqnum
from t
)
delete from todelete
where seqnum > 1;
The . . . is for the columns that define duplicates.

Related

CTE with DELETE - Alternative for SQL Data Warehouse

I would like to delete all rows in a table where the batchId (a running number) older than the previous two. I could probably do this in a SQL Database with the query:
WITH CTE AS(
SELECT
*,
DENSE_RANK() OVER(ORDER BY BATCHID DESC) AS RN
FROM MyTable
)
DELETE FROM CTE WHERE RN>2
But the same is not allowed in a SQL Data Warehouse per this. Looking for alternatives here.
You can try using JOIN
delete d from MyTable d
join
(
SELECT
*,
RN = ROW_NUMBER() OVER(PARTITION BY BATCH_ID ORDER BY BATCH_ID DESC)
FROM MyTable
)A on d.batch_id=A.batch_id where RN >2
Azure SQL Data Warehouse only supports a limited T-SQL surface area and CTEs for DELETE operations and DELETEs with FROM clauses which will yield the following error:
Msg 100029, Level 16, State 1, Line 1
A FROM clause is currently not supported in a DELETE statement.
It does however support sub-queries so one way to write your statement like this:
DELETE dbo.MyTable
WHERE BATCHID Not In ( SELECT TOP 2 BATCHID FROM dbo.MyTable ORDER BY BATCHID DESC );
This syntax is supported in Azure SQL Data Warehouse and I have tested it. I'm not sure how efficient it will be on billions of rows though. You could also consider partition switching.
If you are deleting a large portion of your table then it might make sense to use a CTAS to put the data you want to keep into a new table, eg something like this:
-- Keep the most recent two BATCHIDS
CREATE TABLE dbo.MyTable2
WITH
(
CLUSTERED COLUMNSTORE INDEX,
DISTRIBUTION = HASH( BATCHID )
-- Add partition scheme here if required
)
AS
SELECT *
FROM dbo.MyTable
WHERE BATCHID In ( SELECT TOP 2 BATCHID FROM dbo.MyTable ORDER BY BATCHID DESC )
OPTION ( LABEL = 'CTAS : Keep top two BATCHIDs' );
GO
-- Rename or DROP old table
RENAME OBJECT dbo.MyTable TO MyTable_Old;
RENAME OBJECT dbo.MyTable2 TO MyTable;
GO
-- Optionally DROP MyTable_Old if everything has been successful
-- DROP TABLE MyTable_Old
This technique is described in more detail here.
You can try:
delete t from mytable t
where batchId < (select max(batchid) from mytable);
Oh, if you want to keep two, perhaps this will work:
delete t from mytable t
where batchId < (select batchid
from mytable
group by batchid
limit 1 offset 1
);

T-SQL Delete half of duplicates with no primary key

In a T-SQL stored procedure I have a complex procedure that is comparing data using temp tables but at the end of everything when I return a single table I end up with duplicate rows. In these rows all columns in the row are EXACTLY the same and there is no primary key within this table. I need to delete only half of these based on the number of times that row occurs. For example if there are eight rows that are all the same value. I want to delete four of them.
There is no way to get rid of them through my SP filtering because the data that is entered is literally duplicate information entered in by the user but I do required half of that information.
I've done some research on the subject and did some testing but it seems as if it's not possible to delete half of the duplicated rows. Is this not possible? Or is there a way?
Here is one way, using a great feature of SQL Server, updatable CTEs:
with todelete as (
select t.*,
row_number() over (partition by col1, col2, col3, . . . order by newid()) as seqnum
from table t
)
delete from todelete
where seqnum % 2 = 0;
This will delete every other value.
Assuming SQL Server 2005+:
;WITH CTE AS
(
SELECT *,
RN=ROW_NUMBER() OVER(PARTITION BY Col1, Col2,...Coln ORDER BY Col1)
FROM YourTempTableHere
)
DELETE FROM CTE
WHERE RN = 1

Select Single and Duplicate Row and Return Multiple Columns

I'm currently working with my database in SQL Server. I have a table with 23 fields and it has single and duplicate rows. How can I select both of them without having any duplicate data.
I have try this query:
SELECT
Code, Stuff, and other fields....
FROM
(
SELECT
*,ROW_NUMBER() OVER (PARTITION BY Code ORDER BY Code) AS RN
FROM
my_table
)t
WHERE RN = 1
The above code just return the data from the duplicate rows. But, I want the "single rows" also returned.
This is the illustration.
Thank you for the help.
Could it be as simple as:
SELECT DISTINCT Code, Stuff FROM MyTable
Or, just add stuff to the partition by clause:
PARTITION BY Code,Stuff ORDER BY Code
Try This
You may need to add Stuff and more fields in Partition BY
SELECT
Code, Stuff
FROM
(
SELECT
*,ROW_NUMBER() OVER (PARTITION BY Code,Stuff ORDER BY Code) AS RN
FROM
my_table
)t
WHERE RN = 1

How to delete duplicate record where PK is uniqueidentifier field

I want to know the way we can remove duplicate records where PK is uniqueidentifier.
I have to delete records on the basis of duplicate values in a set of fields.we can use option to get temptable using Row_Number() and except row number one we can delete rest or the records.
But I wanted to build one liner query. Any suggestion?
You can use CTE to do this, without seeing your table structure here is the basic SQL
;with cte as
(
select *, row_number() over(partition by yourfields order by yourfields) rn
from yourTable
)
delete
from cte
where rn > 1
delete from table t using table ta where ta.dup_field=t.dup_field and t.pk >ta.pk
;

SQL Server - Inserting a specific set of rows from one table to another table

I have a table called table_one. (7 Mil) rows
I want to insert 0 - 1 Mil on a new table (table_two) and then insert 1Mil one - 2mil to the same table.
SET ROWCOUNT 1000000
How can this be achieved? Is there a way to specify range of rows to be inserted?
You can use row_number:
;with cte as (
select
*,
row_number() over(order by some_field ) as rn
from table_one
)
insert into table_two ( fields )
select fields from cte
where rn < 1000000
If you can get the start and end IDs in your old table, you can do something like this:
INSERT INTO NewTable (...)
SELECT ... FROM OldTable
WHERE OldTableID BETWEEN #StartID AND #EndID
If you don't already have a useful ID, use danihp's solution using ROW_NUMBER().
If you don't have a range of ids, you can generate them using row_number():
with toinsert (
select *, row_number() over (partition by NULL order by <whatever>) as rownum
from OldTable
)
insert into NewTable(...)
select ... from toinsert
If you are interested in getting exact number of rows, you might employ TOP:
insert into Table2
select top 1000000 *
from Table1
order by ... ID? or newid() if you want random rows.
You might be better off exporting the entire table in bulk import format, splitting it as a text file, then bulk importing the seven or so pieces into the several tables.
Of course there may be keys in the original table that make it possible to do with SQL INSERT operations, but this requires information not provided in the Question posted.