Deduplicate a partitioned table in BigQuery - google-bigquery

I'm trying to deduplicate an ingestion-time partitioned table in BigQuery:
MERGE dataset.table_name targ
USING (
SELECT * EXCEPT(row_number)
FROM (
SELECT
*,
ROW_NUMBER() OVER (PARTITION BY uid ORDER BY _PARTITIONDATE DESC) row_number
FROM dataset.table_name
)
WHERE row_number = 1
) src
ON FALSE
WHEN NOT MATCHED BY SOURCE THEN DELETE
WHEN NOT MATCHED THEN INSERT ROW
Getting the following error:
Omitting INSERT target column list is unsupported for ingestion-time partitioned table dataset.table_name
CREATE OR REPLACE TABLE can't create partitioned tables.
Creating a partitioned table using DDL and then inserting requires defining schema inside the query, which I'm trying to avoid.
I'm looking for a simple universal query that I can apply to different tables with minimal adjustment. Like the one above.

You need to specify the columns in the INSERT statement. You can use the following query to get the column names:
MERGE dataset.table_name targ
USING (
SELECT * EXCEPT(row_number)
FROM (
SELECT
*,
_PARTITIONTIME,
ROW_NUMBER() OVER (PARTITION BY uid ORDER BY _PARTITIONTIME DESC) row_number
FROM dataset.table_name
)
WHERE row_number = 1
) src
ON FALSE
WHEN NOT MATCHED BY SOURCE THEN DELETE
WHEN NOT MATCHED THEN INSERT (uid, _PARTITIONTIME, column1, column2, column3)
VALUES (uid, _PARTITIONTIME, column1, column2, column3)
But it's not universal.

Related

Hive- Delete duplicate rows using ROW_NUMBER()

How to delete duplicates using row_number() without listing all the columns from the table. I have a hive table with 50+ columns. If i want to delete duplicates based on a 2 columns below are the steps i followed
Create temp table as Create temptable as select * from (select
*,row_number() over(col1,col2) as rn from maintable) where rn=1)
Insert overwrite table maintable select * from temptable
But here in insert it fails because the new column rn is present in temptable; To avoid this column i would have to list all the rest of the columns.
And there is no Drop column option in hive. There also you need to use REPLACE function which again needs listing all the rest of the columns.
So any better idea for deleting duplicates in Hive based on 2 columns?
Spell out all column names from the original table for insert overwrite as the query computes a new column. No temp table is needed for this.
Insert overwrite table maintable
select col1,col2,col3 ---...col50
from (select t.*
,row_number() over(order by col1,col2) as rn
from maintable
) t
where rn = 1

Create a table with duplicate values, and use a CTE (Common Table Expression )to delete those duplicate values

Create a table with duplicate values, and use a CTE (Common Table Expression )to delete those duplicate values.
=>
Would some one please help me how to start it because i really don't understand the question.
Assume guess duplicate values can be chosen anything.
For MS SQL Server, this would work:
;with cte as
(
select *
, row_number() over (
partition by [columns], [which], [should], [be], [unique]
order by [columns], [to], [select], [what's], [kept]
) NoOfThisDuplicate
)
delete
from cte
where NoOfThisDuplicate > 1
SQL Fiddle Demo (based on this question: Deleting duplicate row that has earliest date).
Explanation
Create a CTE
Populate it with all rows from the table we want to delete
Add a NoOfThiDuplicate column to that output
Populate this value with the sequential number of this record with the group/partition of all records with the same values for columns [columns], [which], [should], [be], [unique].
The order of the numbering depends on the sort order of those records when sorted by columns [columns], [to], [select], [what's], [kept]
We delete all records returned by the CTE except the first of each group (i.e. all except those with NoOfThisDuplicate=1).
Oracle Setup:
CREATE TABLE test_data ( value ) AS
SELECT LEVEL FROM DUAL CONNECT BY LEVEL <= 10
UNION ALL
SELECT 2*LEVEL FROM DUAL CONNECT BY LEVEL <= 5;
Query 1:
This will select the values removing duplicates:
SELECT DISTINCT *
FROM test_data
But it does not use a CTE.
Query 2:
So, we can put it in a sub-query factoring clause (the name used in the Oracle documentation which corresponds to the SQL Server Common Table Expression)
WITH unique_values ( value ) AS (
SELECT DISTINCT *
FROM test_data
)
SELECT * FROM unique_values;
Query 3:
The sub-query factoring clause was pointless in the previous example ... so doing it a different way:
WITH row_numbers ( value, rn ) AS (
SELECT value, ROW_NUMBER() OVER ( PARTITION BY value ORDER BY ROWNUM ) AS rn
FROM test_data
)
SELECT value
FROM row_numbers
WHERE rn = 1;
Will select the rows where it the first instance of each value found.
Delete Query:
But that didn't delete the rows ...
DELETE FROM test_data
WHERE ROWID IN (
WITH row_numbers ( rid, rn ) AS (
SELECT ROWID, ROW_NUMBER() OVER ( PARTITION BY value ORDER BY ROWNUM ) AS rn
FROM test_data
)
SELECT rid
FROM row_numbers
WHERE rn > 1
);
Which uses the ROWID pseudocolumn to match rows for deletion.

How to Update Executed table result into the same table?

I have created a table tbl_Dist with Column names District and DistCode, there were many duplicate values in the District table so i have removed all the duplicates value using this statement:
select distinct District from tbl_Dist;
its done, but i am not getting how to update the results of the above executed query to the table tbl_Dist?
You can as the below:
-- Move temp table
SELECT DISTINCT District INTO TmpTable FROM tbl_Dist
-- Delete all data
DELETE FROM tbl_Dist
-- Insert data from temp table
INSERT INTO tbl_Dist
SELECT * FROM TmpTable
Updated
Firstly, run this query. You will have a temp table with distinct data of main table (tbl_Dist)
-- Move temp table
SELECT DISTINCT District INTO TmpTable FROM tbl_Dist
Then run the below query to delete all data
DELETE FROM tbl_Dist
Finally, run the below query to insert all distinct data to main table.
-- Insert data from temp table
INSERT INTO tbl_Dist
SELECT * FROM TmpTable
You need Delete not Update
;with cte as
(
Select row_number() over(partition by District order by (select null)) as rn,*
From yourtable
)
Delete from cte where Rn > 1
To check the records that will be deleted use this.
;with cte as
(
Select row_number() over(partition by District order by (select null)) as rn,*
From yourtable
)
Select * from cte where Rn > 1
If you want to keep this query you can keep it in a view the write an update query through that view.The table will be updated
try this script
DELETE FROM tbl_Dist
WHERE District = District
AND DistCode > DistCode

Deleting duplicates rows from redshift

I am trying to delete some duplicate data in my redshift table.
Below is my query:-
With duplicates
As
(Select *, ROW_NUMBER() Over (PARTITION by record_indicator Order by record_indicator) as Duplicate From table_name)
delete from duplicates
Where Duplicate > 1 ;
This query is giving me an error.
Amazon Invalid operation: syntax error at or near "delete";
Not sure what the issue is as the syntax for with clause seems to be correct.
Has anybody faced this situation before?
Redshift being what it is (no enforced uniqueness for any column), Ziggy's 3rd option is probably best. Once we decide to go the temp table route it is more efficient to swap things out whole. Deletes and inserts are expensive in Redshift.
begin;
create table table_name_new as select distinct * from table_name;
alter table table_name rename to table_name_old;
alter table table_name_new rename to table_name;
drop table table_name_old;
commit;
If space isn't an issue you can keep the old table around for a while and use the other methods described here to validate that the row count in the original accounting for duplicates matches the row count in the new.
If you're doing constant loads to such a table you'll want to pause that process while this is going on.
If the number of duplicates is a small percentage of a large table, you might want to try copying distinct records of the duplicates to a temp table, then delete all records from the original that join with the temp. Then append the temp table back to the original. Make sure you vacuum the original table after (which you should be doing for large tables on a schedule anyway).
If you're dealing with a lot of data it's not always possible or smart to recreate the whole table. It may be easier to locate, delete those rows:
-- First identify all the rows that are duplicate
CREATE TEMP TABLE duplicate_saleids AS
SELECT saleid
FROM sales
WHERE saledateid BETWEEN 2224 AND 2231
GROUP BY saleid
HAVING COUNT(*) > 1;
-- Extract one copy of all the duplicate rows
CREATE TEMP TABLE new_sales(LIKE sales);
INSERT INTO new_sales
SELECT DISTINCT *
FROM sales
WHERE saledateid BETWEEN 2224 AND 2231
AND saleid IN(
SELECT saleid
FROM duplicate_saleids
);
-- Remove all rows that were duplicated (all copies).
DELETE FROM sales
WHERE saledateid BETWEEN 2224 AND 2231
AND saleid IN(
SELECT saleid
FROM duplicate_saleids
);
-- Insert back in the single copies
INSERT INTO sales
SELECT *
FROM new_sales;
-- Cleanup
DROP TABLE duplicate_saleids;
DROP TABLE new_sales;
COMMIT;
Full article: https://elliot.land/post/removing-duplicate-data-in-redshift
That should have worked. Alternative you can do:
With
duplicates As (
Select *, ROW_NUMBER() Over (PARTITION by record_indicator
Order by record_indicator) as Duplicate
From table_name)
delete from table_name
where id in (select id from duplicates Where Duplicate > 1);
or
delete from table_name
where id in (
select id
from (
Select id, ROW_NUMBER() Over (PARTITION by record_indicator
Order by record_indicator) as Duplicate
From table_name) x
Where Duplicate > 1);
If you have no primary key, you can do the following:
BEGIN;
CREATE TEMP TABLE mydups ON COMMIT DROP AS
SELECT DISTINCT ON (record_indicator) *
FROM table_name
ORDER BY record_indicator --, other_optional_priority_field DESC
;
DELETE FROM table_name
WHERE record_indicator IN (
SELECT record_indicator FROM mydups);
INSERT INTO table_name SELECT * FROM mydups;
COMMIT;
This method will preserve permissions and the table definition of the original_table.
The most upvoted answer does not preserve permissions on the table or the original definition of the table.
In real world production environment this method is how you should be doing as this is safest and easiest way to execute in production environment.
This will have a DOWN TIME in PROD.
Create Table with unique rows
CREATE TABLE unique_table as
(
SELECT DISTINCT * FROM original_table
)
;
Backup the original_table
CREATE TABLE backup_table as
(
SELECT * FROM original_table
)
;
Truncate the original_table
TRUNCATE original_table;
Insert records from unique_table into original_table
INSERT INTO original_table
(
SELECT * FROM unique_table
)
;
To avoid DOWN TIME run the below queries in a TRANSACTION and instead of TRUNCATE use DELETE
BEGIN transaction;
CREATE TABLE unique_table as
(
SELECT DISTINCT * FROM original_table
)
;
CREATE TABLE backup_table as
(
SELECT * FROM original_table
)
;
DELETE FROM original_table;
INSERT INTO original_table
(
SELECT * FROM unique_table
)
;
END transaction;
Simple answer to this question:
Firstly create a temporary table from the main table where value of row_number=1.
Secondly delete all the rows from the main table on which we had duplicates.
Then insert the values of temporary table into the main table.
Queries:
Temporary table
select id,date into #temp_a
from
(select *
from (select a.*,
row_number() over(partition by id order by etl_createdon desc) as rn
from table a
where a.id between 59 and 75 and a.date = '2018-05-24')
where rn =1)a
deleting all the rows from the main table.
delete from table a
where a.id between 59 and 75 and a.date = '2018-05-24'
inserting all values from temp table to main table
insert into table a select * from #temp_a.
The following deletes all records in 'tablename' that have a duplicate, it will not deduplicate the table:
DELETE FROM tablename
WHERE id IN (
SELECT id
FROM (
SELECT id,
ROW_NUMBER() OVER (partition BY column1, column2, column3 ORDER BY id) AS rnum
FROM tablename
) t
WHERE t.rnum > 1);
Postgres administrative snippets
Your query does not work because Redshift does not allow DELETE after the WITH clause. Only SELECT and UPDATE and a few others are allowed (see WITH clause)
Solution (in my situation):
I did have an id column on my table events that contained duplicate rows and uniquely identifies the record. This column id is the same as your record_indicator.
Unfortunately I was unable to create a temporary table because I ran into the following error using SELECT DISTINCT:
ERROR: Intermediate result row exceeds database block size
But this worked like a charm:
CREATE TABLE temp as (
SELECT *,ROW_NUMBER() OVER (PARTITION BY id ORDER BY id) AS rownumber
FROM events
);
resulting in the temp table:
id | rownumber | ...
----------------
1 | 1 | ...
1 | 2 | ...
2 | 1 | ...
2 | 2 | ...
Now the duplicates can be deleted by removing the rows having rownumber larger than 1:
DELETE FROM temp WHERE rownumber > 1
After that rename the tables and your done.
with duplicates as
(
select a.*, row_number (over (partition by first_name, last_name, email order by first_name, last_name, email) as rn from contacts a
)
delete from contacts
where contact_id in (
select contact_id from duplicates where rn >1
)

Remove duplicated rows in sql

I want to remove duplicated rows in sql. My table looks like that:
CREATE TABLE test_table
(
id Serial,
Date Date,
Time Time,
Open double precision,
High double precision,
Low double precision
);
DELETE FROM test_table
WHERE ctid IN (SELECT min(ctid)
FROM test_table
GROUP BY id
HAVING count(*) > 1);
with the below delete statement I am searching in the secret column ctid for duplicated entries and delete them. However this does not work correctly. The query gets executed properly, but does not delete anything.
I appreciate your answer!
UPDATE
This is some sample data(without the generated id):
2013.11.07,12:43,1.35162,1.35162,1.35143,1.35144
2013.11.07,12:43,1.35162,1.35162,1.35143,1.35144
2013.11.07,12:44,1.35144,1.35144,1.35141,1.35142
2013.11.07,12:45,1.35143,1.35152,1.35143,1.35151
2013.11.07,12:46,1.35151,1.35152,1.35149,1.35152
Get out of the habit of using ctid, xid, etc. - they're not advertised for a reason.
One way of dealing with duplicate rows in one shot, depending on how recent your postgres version is:
with unique_rows
as
(
select distinct on (id) *
from test_table
),
delete_rows
as
(
delete
from test_table
)
insert into test_table
select *
from unique_rows
;
Or break everything up in three steps and use temp tables:
create temp table unique_rows
as
select distinct on (id) *
from test_table
;
create temp table delete_rows
as
delete
from test_table
;
insert into test_table
select *
from unique_rows
;
Not sure if you can use row_number with partiontions in postgresql but if so you can do this to find duplicates, you can add or substract columns from the partion by to define what duplicates are in the set
WITH cte AS
(
SELECT id,ROW_NUMBER() OVER(PARTITION BY Date, Time ORDER BY date, time) AS rown
FROM test_table
)
delete From test_table
where id in (select id from cte where rown > 1);