Im using an azure dwh server and try to delete duplicate records. Normally i would do it using a CTE like this:
WITH cte AS (
SELECT
HashTagID,
ROW_NUMBER() OVER (
PARTITION BY
HashTagID
ORDER BY
HashTagID
) row_num
FROM
[dbo].[ref_Tag]
)
Delete FROM cte
WHERE row_num > 1;
But that gets a "Failed to generate query plan." because i cant use delete after a CTE on the DWH.
So i tried rewriting the statement to this:
Delete dup from (SELECT
HashTagID,
ROW_NUMBER() OVER (
PARTITION BY
HashTagID
ORDER BY
HashTagID
) row_num
FROM
[dbo].[ref_Tag]) as dup WHERE row_num > 1
But i get the same error.
The records are exactly the same. Even the key. so i have to somehow count the times the same record is there and then delete the 2nd occurance. I'd like to keep one of the dupes.
Does anyone know how to write this statement to work on the azure dwh?
Thanks.
I dont think we can do it directly as we do it through CTE in sql server.
But you can try something like below to achieve it. Here I used one column table. If you have more columns, you can 'group by' in first query.
`
Create Table dbo.ref_tag_temp
with (distribution = ROUND_ROBIN , clustered columnstore index)
as select distinct HashTagID from
dbo.ref_Tag
delete from [dbo].[ref_Tag];
INSERT INTO [dbo].[ref_Tag]
select * from dbo.ref_tag_temp;
drop table dbo.ref_tag_temp;
`
Related
It is a datawarehouse project in which I load a table in which each column refers to another table. The problem is that due to an error in the process, many duplicate records were loaded (approximately 13,000) but they do not have a unique identifier, therefore they are exactly the same. Is there a way to delete only one of the duplicate records so that I don't have to delete everything and repeat the table loading process?
You can use row_number() and a cte:
with cte as (
select row_number() over(
partition by col1, col2, ...
order by (select null)) rn
from mytable
)
delete from cte where rn > 1
The window functions guarantees that the same number will not be assigned twice within a partition - you need to enumerate all column columns in the partition by clause.
If you are going to delete a significant part of the rows, then it might be simpler to empty and recreate the table:
create table tmptable as select distinct * from mytable;
truncate table mytable; -- back it up first!
insert into mytable select * from tmptable;
drop table tmptable;
You can make use row_number to delete the duplicate rows by first partitioning them and then ordering by one of the columns with that partition.
You have to list all your columns in partition by if records are completely identical.
WITH CTE1 AS (
SELECT A.*
, ROW_NUMBER(PARTITION BY CODDIMALUMNO, (OTHER COLUMNS) ORDER BY CODDIMALUMNO) RN
FROM TABLE1 A
)
DELETE FROM CTE1
WHERE RN > 1;
You can use row_number() and an updatable CTE:
with todelete as (
select t.*, row_number() over (partition by . . . ) as seqnum
from t
)
delete from todelete
where seqnum > 1;
The . . . is for the columns that define duplicates.
I would like to delete all rows in a table where the batchId (a running number) older than the previous two. I could probably do this in a SQL Database with the query:
WITH CTE AS(
SELECT
*,
DENSE_RANK() OVER(ORDER BY BATCHID DESC) AS RN
FROM MyTable
)
DELETE FROM CTE WHERE RN>2
But the same is not allowed in a SQL Data Warehouse per this. Looking for alternatives here.
You can try using JOIN
delete d from MyTable d
join
(
SELECT
*,
RN = ROW_NUMBER() OVER(PARTITION BY BATCH_ID ORDER BY BATCH_ID DESC)
FROM MyTable
)A on d.batch_id=A.batch_id where RN >2
Azure SQL Data Warehouse only supports a limited T-SQL surface area and CTEs for DELETE operations and DELETEs with FROM clauses which will yield the following error:
Msg 100029, Level 16, State 1, Line 1
A FROM clause is currently not supported in a DELETE statement.
It does however support sub-queries so one way to write your statement like this:
DELETE dbo.MyTable
WHERE BATCHID Not In ( SELECT TOP 2 BATCHID FROM dbo.MyTable ORDER BY BATCHID DESC );
This syntax is supported in Azure SQL Data Warehouse and I have tested it. I'm not sure how efficient it will be on billions of rows though. You could also consider partition switching.
If you are deleting a large portion of your table then it might make sense to use a CTAS to put the data you want to keep into a new table, eg something like this:
-- Keep the most recent two BATCHIDS
CREATE TABLE dbo.MyTable2
WITH
(
CLUSTERED COLUMNSTORE INDEX,
DISTRIBUTION = HASH( BATCHID )
-- Add partition scheme here if required
)
AS
SELECT *
FROM dbo.MyTable
WHERE BATCHID In ( SELECT TOP 2 BATCHID FROM dbo.MyTable ORDER BY BATCHID DESC )
OPTION ( LABEL = 'CTAS : Keep top two BATCHIDs' );
GO
-- Rename or DROP old table
RENAME OBJECT dbo.MyTable TO MyTable_Old;
RENAME OBJECT dbo.MyTable2 TO MyTable;
GO
-- Optionally DROP MyTable_Old if everything has been successful
-- DROP TABLE MyTable_Old
This technique is described in more detail here.
You can try:
delete t from mytable t
where batchId < (select max(batchid) from mytable);
Oh, if you want to keep two, perhaps this will work:
delete t from mytable t
where batchId < (select batchid
from mytable
group by batchid
limit 1 offset 1
);
My issue is how do we delete a primary key row in case it is duplicated. The other fields may/may not be duplicates. I am interested only in the primary key being duplicated and would like to retain the first instance while deleting the other duplicate entries.
For example,
I have 2 tables with the following data:
Table1:- Portfolio
Columns:- PortfolioID(PK), PortfolioName
Sample data :-
1, North America
2, Europe
3, Asia
Table2:- Account
Columns:- AccountID(PK), PortfolioID(FK), AccountName
Sample data :-
1,1,Quake
1,1,Wind
2,1,Fire
3,1,Quake
4,2,Flood
5,2,Wind
Lets say for PortfolioID = 1,
I am trying to delete row number 2 from the Account table where the AccountID 1 is repeated for PortfolioID =1. I have tried using the CTE expression where I use the ROW_NUMBER statement and try to delete ROWNUMBER <> 1. But this query doesn't work as it deletes all the rows in the table.
The query I tried:
WITH CTE AS
(
SELECT ROW_NUMBER() OVER (PARTITION BY [Account].[AccountID] ORDER BY [Account].[AccountID]) AS [ROWNUMBER],
[Account].[AccountID]
FROM [Account]
INNER JOIN [Portfolio] ON [Portfolio].[PortfolioID] = [Account]. [PortfolioID]
WHERE [Portfolio].[PortfolioID] = 1
)
DELETE [Account]
FROM [CTE]
WHERE [ROWNUMBER] <> 1
Am I doing something wrong in the query? Thanks in advance for the help.
Firstly, if you define the AccountID column as the primary key in your database, this going forward will help solve having these kinds of problems.
Secondly, are you using Sql Server? Which version?
Assuming you are using Sql Server and a recent version which allows you to use windowing, you can try something like this to delete any duplicates that you have.
This will delete ALL copies of ALL duplicates:
WITH CTE AS
(SELECT *,R=RANK() OVER (ORDER BY AccountID,PortfolioID)
FROM Account)
DELETE CTE
WHERE R IN (SELECT R FROM CTE GROUP BY R HAVING COUNT(*)>1)
This alternative script will keep one of the duplicates if that is what you prefer:
WITH CTE AS
(
SELECT *,ROW_NUMBER() OVER (PARTITION BY AccountID,PortfolioID ORDER BY AccountID,PortfolioID) AS RN
FROM Account
)
DELETE FROM CTE WHERE RN<>1
Finally, if you want to only delete duplicates for Portfolio Id 1:
WITH CTE AS
(
SELECT *,ROW_NUMBER() OVER (PARTITION BY AccountID,PortfolioID ORDER BY AccountID,PortfolioID) AS RN
FROM Account
Where PortfolioID = 1
)
DELETE FROM CTE WHERE RN<>1
Primary key column never ever support duplicate entries.
Try with the below query for the desired result based on the given data/inputs.
;WITH CTE AS
(
SELECT ROW_NUMBER() OVER (PARTITION BY a.[AccountID],a.PortfolioID ORDER BY a.[AccountID]) AS [ROWNUMBER],*
FROM [Account] a
WHERE a.[PortfolioID] = 1
)
DELETE
FROM [CTE]
WHERE [ROWNUMBER] > 1
In a T-SQL stored procedure I have a complex procedure that is comparing data using temp tables but at the end of everything when I return a single table I end up with duplicate rows. In these rows all columns in the row are EXACTLY the same and there is no primary key within this table. I need to delete only half of these based on the number of times that row occurs. For example if there are eight rows that are all the same value. I want to delete four of them.
There is no way to get rid of them through my SP filtering because the data that is entered is literally duplicate information entered in by the user but I do required half of that information.
I've done some research on the subject and did some testing but it seems as if it's not possible to delete half of the duplicated rows. Is this not possible? Or is there a way?
Here is one way, using a great feature of SQL Server, updatable CTEs:
with todelete as (
select t.*,
row_number() over (partition by col1, col2, col3, . . . order by newid()) as seqnum
from table t
)
delete from todelete
where seqnum % 2 = 0;
This will delete every other value.
Assuming SQL Server 2005+:
;WITH CTE AS
(
SELECT *,
RN=ROW_NUMBER() OVER(PARTITION BY Col1, Col2,...Coln ORDER BY Col1)
FROM YourTempTableHere
)
DELETE FROM CTE
WHERE RN = 1
I want to know the way we can remove duplicate records where PK is uniqueidentifier.
I have to delete records on the basis of duplicate values in a set of fields.we can use option to get temptable using Row_Number() and except row number one we can delete rest or the records.
But I wanted to build one liner query. Any suggestion?
You can use CTE to do this, without seeing your table structure here is the basic SQL
;with cte as
(
select *, row_number() over(partition by yourfields order by yourfields) rn
from yourTable
)
delete
from cte
where rn > 1
delete from table t using table ta where ta.dup_field=t.dup_field and t.pk >ta.pk
;