CTE with DELETE - Alternative for SQL Data Warehouse - sql

I would like to delete all rows in a table where the batchId (a running number) older than the previous two. I could probably do this in a SQL Database with the query:
WITH CTE AS(
SELECT
*,
DENSE_RANK() OVER(ORDER BY BATCHID DESC) AS RN
FROM MyTable
)
DELETE FROM CTE WHERE RN>2
But the same is not allowed in a SQL Data Warehouse per this. Looking for alternatives here.

You can try using JOIN
delete d from MyTable d
join
(
SELECT
*,
RN = ROW_NUMBER() OVER(PARTITION BY BATCH_ID ORDER BY BATCH_ID DESC)
FROM MyTable
)A on d.batch_id=A.batch_id where RN >2

Azure SQL Data Warehouse only supports a limited T-SQL surface area and CTEs for DELETE operations and DELETEs with FROM clauses which will yield the following error:
Msg 100029, Level 16, State 1, Line 1
A FROM clause is currently not supported in a DELETE statement.
It does however support sub-queries so one way to write your statement like this:
DELETE dbo.MyTable
WHERE BATCHID Not In ( SELECT TOP 2 BATCHID FROM dbo.MyTable ORDER BY BATCHID DESC );
This syntax is supported in Azure SQL Data Warehouse and I have tested it. I'm not sure how efficient it will be on billions of rows though. You could also consider partition switching.
If you are deleting a large portion of your table then it might make sense to use a CTAS to put the data you want to keep into a new table, eg something like this:
-- Keep the most recent two BATCHIDS
CREATE TABLE dbo.MyTable2
WITH
(
CLUSTERED COLUMNSTORE INDEX,
DISTRIBUTION = HASH( BATCHID )
-- Add partition scheme here if required
)
AS
SELECT *
FROM dbo.MyTable
WHERE BATCHID In ( SELECT TOP 2 BATCHID FROM dbo.MyTable ORDER BY BATCHID DESC )
OPTION ( LABEL = 'CTAS : Keep top two BATCHIDs' );
GO
-- Rename or DROP old table
RENAME OBJECT dbo.MyTable TO MyTable_Old;
RENAME OBJECT dbo.MyTable2 TO MyTable;
GO
-- Optionally DROP MyTable_Old if everything has been successful
-- DROP TABLE MyTable_Old
This technique is described in more detail here.

You can try:
delete t from mytable t
where batchId < (select max(batchid) from mytable);
Oh, if you want to keep two, perhaps this will work:
delete t from mytable t
where batchId < (select batchid
from mytable
group by batchid
limit 1 offset 1
);

Related

How to delete duplicate records (SQL) when there is no identification column?

It is a datawarehouse project in which I load a table in which each column refers to another table. The problem is that due to an error in the process, many duplicate records were loaded (approximately 13,000) but they do not have a unique identifier, therefore they are exactly the same. Is there a way to delete only one of the duplicate records so that I don't have to delete everything and repeat the table loading process?
You can use row_number() and a cte:
with cte as (
select row_number() over(
partition by col1, col2, ...
order by (select null)) rn
from mytable
)
delete from cte where rn > 1
The window functions guarantees that the same number will not be assigned twice within a partition - you need to enumerate all column columns in the partition by clause.
If you are going to delete a significant part of the rows, then it might be simpler to empty and recreate the table:
create table tmptable as select distinct * from mytable;
truncate table mytable; -- back it up first!
insert into mytable select * from tmptable;
drop table tmptable;
You can make use row_number to delete the duplicate rows by first partitioning them and then ordering by one of the columns with that partition.
You have to list all your columns in partition by if records are completely identical.
WITH CTE1 AS (
SELECT A.*
, ROW_NUMBER(PARTITION BY CODDIMALUMNO, (OTHER COLUMNS) ORDER BY CODDIMALUMNO) RN
FROM TABLE1 A
)
DELETE FROM CTE1
WHERE RN > 1;
You can use row_number() and an updatable CTE:
with todelete as (
select t.*, row_number() over (partition by . . . ) as seqnum
from t
)
delete from todelete
where seqnum > 1;
The . . . is for the columns that define duplicates.

Delete after CTE

Im using an azure dwh server and try to delete duplicate records. Normally i would do it using a CTE like this:
WITH cte AS (
SELECT
HashTagID,
ROW_NUMBER() OVER (
PARTITION BY
HashTagID
ORDER BY
HashTagID
) row_num
FROM
[dbo].[ref_Tag]
)
Delete FROM cte
WHERE row_num > 1;
But that gets a "Failed to generate query plan." because i cant use delete after a CTE on the DWH.
So i tried rewriting the statement to this:
Delete dup from (SELECT
HashTagID,
ROW_NUMBER() OVER (
PARTITION BY
HashTagID
ORDER BY
HashTagID
) row_num
FROM
[dbo].[ref_Tag]) as dup WHERE row_num > 1
But i get the same error.
The records are exactly the same. Even the key. so i have to somehow count the times the same record is there and then delete the 2nd occurance. I'd like to keep one of the dupes.
Does anyone know how to write this statement to work on the azure dwh?
Thanks.
I dont think we can do it directly as we do it through CTE in sql server.
But you can try something like below to achieve it. Here I used one column table. If you have more columns, you can 'group by' in first query.
`
Create Table dbo.ref_tag_temp
with (distribution = ROUND_ROBIN , clustered columnstore index)
as select distinct HashTagID from
dbo.ref_Tag
delete from [dbo].[ref_Tag];
INSERT INTO [dbo].[ref_Tag]
select * from dbo.ref_tag_temp;
drop table dbo.ref_tag_temp;
`

How can i remove successive similar rows and keep only the recent row?

I have this table
I want to remove successive similar rows and keep the recent.
so the result I want to have is something like this
Here is how I would do it:
;WITH cte AS (
SELECT valeur, date_reference, id, rownum = ROW_NUMBER() OVER (ORDER BY date_reference) FROM #temperatures
UNION ALL
SELECT NULL, NULL, NULL, (SELECT COUNT(*) FROM #temperatures) + 1
)
SELECT A.* FROM cte AS A INNER JOIN cte AS B ON A.rownum + 1 = B.rownum AND COALESCE(a.valeur, -459) != COALESCE(b.valeur, -459)
I am calling the table #temperatures. Use a CTE to assign a ROW_NUMBER to each record and to include an extra record with the last Row_Number (otherwise the last record will not be included in the following query). Then, SELECT from the CTE where the next ROW_NUMBER does not have the same valeur.
Now, if you want to DELETE from the original table, you can review this query's return to make sure you really want to delete all the records not in this return. Then, assuming historique_id is the primary key, DELETE FROM #temperatures WHERE historique_id NOT IN (SELECT historique_id FROM cte AS A....
You can collect all the rows that you want to held in a temp table, truncate your original table, and insert all the rows from temp table to your original table. This will be more effective than just deleting rows in case you have "a lot of duplicates". Also truncate table have following restrictions
You cannot use TRUNCATE TABLE on tables that:
Are referenced by a FOREIGN KEY constraint. (You can truncate a
table that has a foreign key that references itself.)
Participate in an indexed view.
Are published by using transactional replication or merge
replication.
TRUNCATE TABLE cannot activate a trigger because the operation does
not log individual row deletions. For more information, see CREATE
TRIGGER (Transact-SQL)
In Azure SQL Data Warehouse and Parallel Data Warehouse:
TRUNCATE TABLE is not allowed within the EXPLAIN statement.
TRUNCATE TABLE cannot be ran inside of a transaction.
You can find more information in following topics.
Truncate in SQL SERVER
Deleting Data in SQL Server with TRUNCATE vs DELETE commands
You can use this script for removing duplicate rows by truncate-insert strategy
CREATE TABLE #temp_hisorique(
code varchar(50),
code_trim varchar(50),
libelle varchar(50),
unite varchar(50),
valeur varchar(50),
date_reference datetime,
hisoriqueID int
)
GO
;WITH cte AS (
select *, row_number() over(partition by code, code_trim, libelle, unite, valeur order by date_reference desc) as rownum
from mytable
)
insert into #temp_hisorique(code, code_trim, libelle, unite, valeur, date_reference, hisoriqueID)
select code, code_trim, libelle, unite, valeur, date_reference, hisoriqueID
from cte
where rownum = 1
TRUNCATE TABLE mytable
insert into mytable(code, code_trim, libelle, unite, valeur, date_reference, hisoriqueID)
select code, code_trim, libelle, unite, valeur, date_reference, hisoriqueID
from #temp_hisorique
Or you can just remove the rows by delete command with join.
;WITH cte AS (
select *, row_number() over(partition by code, code_trim, libelle, unite, valeur order by date_reference desc) as rownum
from mytable
)
delete T
from mytable T
join cte on T.hisoriqueID = cte.hisoriqueID
where cte.rownum > 1

Deleting rows where the Primary key is duplicated - SQL

My issue is how do we delete a primary key row in case it is duplicated. The other fields may/may not be duplicates. I am interested only in the primary key being duplicated and would like to retain the first instance while deleting the other duplicate entries.
For example,
I have 2 tables with the following data:
Table1:- Portfolio
Columns:- PortfolioID(PK), PortfolioName
Sample data :-
1, North America
2, Europe
3, Asia
Table2:- Account
Columns:- AccountID(PK), PortfolioID(FK), AccountName
Sample data :-
1,1,Quake
1,1,Wind
2,1,Fire
3,1,Quake
4,2,Flood
5,2,Wind
Lets say for PortfolioID = 1,
I am trying to delete row number 2 from the Account table where the AccountID 1 is repeated for PortfolioID =1. I have tried using the CTE expression where I use the ROW_NUMBER statement and try to delete ROWNUMBER <> 1. But this query doesn't work as it deletes all the rows in the table.
The query I tried:
WITH CTE AS
(
SELECT ROW_NUMBER() OVER (PARTITION BY [Account].[AccountID] ORDER BY [Account].[AccountID]) AS [ROWNUMBER],
[Account].[AccountID]
FROM [Account]
INNER JOIN [Portfolio] ON [Portfolio].[PortfolioID] = [Account]. [PortfolioID]
WHERE [Portfolio].[PortfolioID] = 1
)
DELETE [Account]
FROM [CTE]
WHERE [ROWNUMBER] <> 1
Am I doing something wrong in the query? Thanks in advance for the help.
Firstly, if you define the AccountID column as the primary key in your database, this going forward will help solve having these kinds of problems.
Secondly, are you using Sql Server? Which version?
Assuming you are using Sql Server and a recent version which allows you to use windowing, you can try something like this to delete any duplicates that you have.
This will delete ALL copies of ALL duplicates:
WITH CTE AS
(SELECT *,R=RANK() OVER (ORDER BY AccountID,PortfolioID)
FROM Account)
DELETE CTE
WHERE R IN (SELECT R FROM CTE GROUP BY R HAVING COUNT(*)>1)
This alternative script will keep one of the duplicates if that is what you prefer:
WITH CTE AS
(
SELECT *,ROW_NUMBER() OVER (PARTITION BY AccountID,PortfolioID ORDER BY AccountID,PortfolioID) AS RN
FROM Account
)
DELETE FROM CTE WHERE RN<>1
Finally, if you want to only delete duplicates for Portfolio Id 1:
WITH CTE AS
(
SELECT *,ROW_NUMBER() OVER (PARTITION BY AccountID,PortfolioID ORDER BY AccountID,PortfolioID) AS RN
FROM Account
Where PortfolioID = 1
)
DELETE FROM CTE WHERE RN<>1
Primary key column never ever support duplicate entries.
Try with the below query for the desired result based on the given data/inputs.
;WITH CTE AS
(
SELECT ROW_NUMBER() OVER (PARTITION BY a.[AccountID],a.PortfolioID ORDER BY a.[AccountID]) AS [ROWNUMBER],*
FROM [Account] a
WHERE a.[PortfolioID] = 1
)
DELETE
FROM [CTE]
WHERE [ROWNUMBER] > 1

How to delete duplicate record where PK is uniqueidentifier field

I want to know the way we can remove duplicate records where PK is uniqueidentifier.
I have to delete records on the basis of duplicate values in a set of fields.we can use option to get temptable using Row_Number() and except row number one we can delete rest or the records.
But I wanted to build one liner query. Any suggestion?
You can use CTE to do this, without seeing your table structure here is the basic SQL
;with cte as
(
select *, row_number() over(partition by yourfields order by yourfields) rn
from yourTable
)
delete
from cte
where rn > 1
delete from table t using table ta where ta.dup_field=t.dup_field and t.pk >ta.pk
;