Bigquery - Delete duplicate rows in tables by rank() - google-bigquery

We used ETL to synchronize data from cloud storage to Bigquery, just appended latest data to table.
There're might be updated data with same attribute but with different processing timestamp. We just want to keep the latest record in the table.
Due to there's no primary concept inn Bigquery, cannot do upsert action. We want to delete redundant data by applying ranking window function.
We're able to use CREATE OR RELACE method to recreate table with latest information. However, there're over 200GB records in this table, wanna know if we can simply delete useless data?
here's our sample table schema and data,
create table `project.dataset.sample`
(
name string,
process_timestamp timestamp not null,
amount int
)
PARTITION BY
TIMESTAMP_TRUNC(process_timestamp, DAY);
insert into `project.dataset.sample`
values
('Zoe', timestamp('2022-07-09 05:04:13.439780+00'),1 ),
('Zoe', timestamp('2022-07-09 10:53:13.330751+00'),2 ),
('Zoe', timestamp('2022-07-09 18:48:01.089188+00'),3 ),
('Zoe', timestamp('2022-07-10 11:06:01.053347+00'),4 ),
('Zoe', timestamp('2022-07-10 19:11:17.731549+00'),5 ),
('Tess', timestamp('2022-07-10 11:06:01.053347+00'),1 ),
('Tess', timestamp('2022-07-10 19:11:17.731549+00'),2 )
We expected there're two record left after executing the delete SQL,
however, it deleted all record...
DELETE
FROM `project.dataset.sample` ori
WHERE EXISTS (
WITH dedup as (select *,
rank() over(partition by name order by process_timestamp desc) as rank
from `project.dataset.sample`
)
SELECT * FROM r
WHERE ori.name = dedup.name and dedup.rank > 1);
Is there any method to achieve this requirement?

Fixed, update SQL as below
DELETE
FROM `project.dataset.sample` ori
WHERE EXISTS (
WITH dedup as (select *,
rank() over(partition by name order by process_timestamp desc) as rank
from `project.dataset.sample`
)
SELECT * except(rank)
FROM dedup
WHERE ori.name = dedup.name
and ori.process_timestamp = dedup.process_timestamp
and dedup.rank > 1
);

Related

How to delete duplicate records (SQL) when there is no identification column?

It is a datawarehouse project in which I load a table in which each column refers to another table. The problem is that due to an error in the process, many duplicate records were loaded (approximately 13,000) but they do not have a unique identifier, therefore they are exactly the same. Is there a way to delete only one of the duplicate records so that I don't have to delete everything and repeat the table loading process?
You can use row_number() and a cte:
with cte as (
select row_number() over(
partition by col1, col2, ...
order by (select null)) rn
from mytable
)
delete from cte where rn > 1
The window functions guarantees that the same number will not be assigned twice within a partition - you need to enumerate all column columns in the partition by clause.
If you are going to delete a significant part of the rows, then it might be simpler to empty and recreate the table:
create table tmptable as select distinct * from mytable;
truncate table mytable; -- back it up first!
insert into mytable select * from tmptable;
drop table tmptable;
You can make use row_number to delete the duplicate rows by first partitioning them and then ordering by one of the columns with that partition.
You have to list all your columns in partition by if records are completely identical.
WITH CTE1 AS (
SELECT A.*
, ROW_NUMBER(PARTITION BY CODDIMALUMNO, (OTHER COLUMNS) ORDER BY CODDIMALUMNO) RN
FROM TABLE1 A
)
DELETE FROM CTE1
WHERE RN > 1;
You can use row_number() and an updatable CTE:
with todelete as (
select t.*, row_number() over (partition by . . . ) as seqnum
from t
)
delete from todelete
where seqnum > 1;
The . . . is for the columns that define duplicates.

CTE with DELETE - Alternative for SQL Data Warehouse

I would like to delete all rows in a table where the batchId (a running number) older than the previous two. I could probably do this in a SQL Database with the query:
WITH CTE AS(
SELECT
*,
DENSE_RANK() OVER(ORDER BY BATCHID DESC) AS RN
FROM MyTable
)
DELETE FROM CTE WHERE RN>2
But the same is not allowed in a SQL Data Warehouse per this. Looking for alternatives here.
You can try using JOIN
delete d from MyTable d
join
(
SELECT
*,
RN = ROW_NUMBER() OVER(PARTITION BY BATCH_ID ORDER BY BATCH_ID DESC)
FROM MyTable
)A on d.batch_id=A.batch_id where RN >2
Azure SQL Data Warehouse only supports a limited T-SQL surface area and CTEs for DELETE operations and DELETEs with FROM clauses which will yield the following error:
Msg 100029, Level 16, State 1, Line 1
A FROM clause is currently not supported in a DELETE statement.
It does however support sub-queries so one way to write your statement like this:
DELETE dbo.MyTable
WHERE BATCHID Not In ( SELECT TOP 2 BATCHID FROM dbo.MyTable ORDER BY BATCHID DESC );
This syntax is supported in Azure SQL Data Warehouse and I have tested it. I'm not sure how efficient it will be on billions of rows though. You could also consider partition switching.
If you are deleting a large portion of your table then it might make sense to use a CTAS to put the data you want to keep into a new table, eg something like this:
-- Keep the most recent two BATCHIDS
CREATE TABLE dbo.MyTable2
WITH
(
CLUSTERED COLUMNSTORE INDEX,
DISTRIBUTION = HASH( BATCHID )
-- Add partition scheme here if required
)
AS
SELECT *
FROM dbo.MyTable
WHERE BATCHID In ( SELECT TOP 2 BATCHID FROM dbo.MyTable ORDER BY BATCHID DESC )
OPTION ( LABEL = 'CTAS : Keep top two BATCHIDs' );
GO
-- Rename or DROP old table
RENAME OBJECT dbo.MyTable TO MyTable_Old;
RENAME OBJECT dbo.MyTable2 TO MyTable;
GO
-- Optionally DROP MyTable_Old if everything has been successful
-- DROP TABLE MyTable_Old
This technique is described in more detail here.
You can try:
delete t from mytable t
where batchId < (select max(batchid) from mytable);
Oh, if you want to keep two, perhaps this will work:
delete t from mytable t
where batchId < (select batchid
from mytable
group by batchid
limit 1 offset 1
);

Remove duplicates from table in bigquery

I found duplicates in my table by doing below query.
SELECT name, id, count(1) as count
FROM [myproject:dev.sample]
group by name, id
having count(1) > 1
Now i would like to remove these duplicates based on id and name by using DML statement but its showing '0 rows affected' message.
Am i missing something?
DELETE FROM PRD.GPBP WHERE
id not in(select id from [myproject:dev.sample] GROUP BY id) and
name not in (select name from [myproject:dev.sample] GROUP BY name)
I suggest, you create a new table without the duplicates. Drop your original table and rename the new table to original table.
You can find duplicates like below:
Create table new_table as
Select name, id, ...... , put our remaining 10 cols here
FROM(
SELECT *,
ROW_NUMBER() OVER(Partition by name , id Order by id) as rnk
FROM [myproject:dev.sample]
)a
WHERE rnk = 1;
Then drop the older table and rename new_table with old table name.
Below query (BigQuery Standard SQL) should be more optimal for de-duping like in your case
#standardSQL
SELECT AS VALUE ANY_VALUE(t)
FROM `myproject.dev.sample` AS t
GROUP BY name, id
If you run it from within UI - you can just set Write Preference to Overwrite Table and you are done
Or if you want you can use DML's INSERT to new table and then copy over original one
Meantime, the easiest way is as below (using DDL)
#standardSQL
CREATE OR REPLACE TABLE `myproject.dev.sample` AS
SELECT * FROM (
SELECT AS VALUE ANY_VALUE(t)
FROM `myproject.dev.sample` AS t
GROUP BY name, id
)

How can i remove successive similar rows and keep only the recent row?

I have this table
I want to remove successive similar rows and keep the recent.
so the result I want to have is something like this
Here is how I would do it:
;WITH cte AS (
SELECT valeur, date_reference, id, rownum = ROW_NUMBER() OVER (ORDER BY date_reference) FROM #temperatures
UNION ALL
SELECT NULL, NULL, NULL, (SELECT COUNT(*) FROM #temperatures) + 1
)
SELECT A.* FROM cte AS A INNER JOIN cte AS B ON A.rownum + 1 = B.rownum AND COALESCE(a.valeur, -459) != COALESCE(b.valeur, -459)
I am calling the table #temperatures. Use a CTE to assign a ROW_NUMBER to each record and to include an extra record with the last Row_Number (otherwise the last record will not be included in the following query). Then, SELECT from the CTE where the next ROW_NUMBER does not have the same valeur.
Now, if you want to DELETE from the original table, you can review this query's return to make sure you really want to delete all the records not in this return. Then, assuming historique_id is the primary key, DELETE FROM #temperatures WHERE historique_id NOT IN (SELECT historique_id FROM cte AS A....
You can collect all the rows that you want to held in a temp table, truncate your original table, and insert all the rows from temp table to your original table. This will be more effective than just deleting rows in case you have "a lot of duplicates". Also truncate table have following restrictions
You cannot use TRUNCATE TABLE on tables that:
Are referenced by a FOREIGN KEY constraint. (You can truncate a
table that has a foreign key that references itself.)
Participate in an indexed view.
Are published by using transactional replication or merge
replication.
TRUNCATE TABLE cannot activate a trigger because the operation does
not log individual row deletions. For more information, see CREATE
TRIGGER (Transact-SQL)
In Azure SQL Data Warehouse and Parallel Data Warehouse:
TRUNCATE TABLE is not allowed within the EXPLAIN statement.
TRUNCATE TABLE cannot be ran inside of a transaction.
You can find more information in following topics.
Truncate in SQL SERVER
Deleting Data in SQL Server with TRUNCATE vs DELETE commands
You can use this script for removing duplicate rows by truncate-insert strategy
CREATE TABLE #temp_hisorique(
code varchar(50),
code_trim varchar(50),
libelle varchar(50),
unite varchar(50),
valeur varchar(50),
date_reference datetime,
hisoriqueID int
)
GO
;WITH cte AS (
select *, row_number() over(partition by code, code_trim, libelle, unite, valeur order by date_reference desc) as rownum
from mytable
)
insert into #temp_hisorique(code, code_trim, libelle, unite, valeur, date_reference, hisoriqueID)
select code, code_trim, libelle, unite, valeur, date_reference, hisoriqueID
from cte
where rownum = 1
TRUNCATE TABLE mytable
insert into mytable(code, code_trim, libelle, unite, valeur, date_reference, hisoriqueID)
select code, code_trim, libelle, unite, valeur, date_reference, hisoriqueID
from #temp_hisorique
Or you can just remove the rows by delete command with join.
;WITH cte AS (
select *, row_number() over(partition by code, code_trim, libelle, unite, valeur order by date_reference desc) as rownum
from mytable
)
delete T
from mytable T
join cte on T.hisoriqueID = cte.hisoriqueID
where cte.rownum > 1

I have a SQL Server 2000 database. How do I remove duplicate rows from a table without a key field?

I also do not have the ability to create a temp table to move data to, most of the suggestions I have seen either require a unique ID or the ability to create a table to move unique rows to. I also cant add a key to the table.
My table structure (relevant columns) is:
Customer_code, carrier, rack, bin
Thanks.
;WITH x AS
(
SELECT id, gid, url, rn = ROW_NUMBER() OVER
(PARTITION BY gid, url ORDER BY id)
FROM dbo.table
)
SELECT id,gid,url FROM x WHERE rn = 1