Redshift Delete Duplicate Rows

Redshift Delete Duplicate Rows - sql

I want to delete duplicates from a Redshift table that are true duplicates. Below is an example of two rows that are true duplicates.
Since it is Redshift, there are no primary keys to the table. Any help is appreciated.
id
Col 1
Col 2
1
Val 1
Val 2
1
Val 1
Val 2
I tried using window functions row_number(), rank(). Neither worked as when applying Delete command, SQL command cannot differentiate both rows.
Trial 1:
The below command deletes both rows
DELETE From test_table
where (id) IN
(
select \*,row_number() over(partition by id) as rownumber from test*table where row*number !=1
);
Trial 2:
The below command retains both rows.
DELETE From test_table
where (id) IN
(
select \*,rank() over(partition by id) as rownumber from test*table where row*number !=1
);

All row values are identical. Hence you unable to delete specific rows in that table.
In that I would recommend to create dummy table, and load unique records.
Steps to follow:
create table dummy as select * from main_table where 1=2
insert into dummy(col1,col2..coln) select distinct col1,col2..coln from main_table;
verify dummy table.
Alter table main_table rename to main_table_bk
alter table dummy rename to main.
after complete your testing and verification, drop main_table_bk
Hope it will help.

You cannot delete one without deleting the other as they are identical. The way to do this is to:
make a temp table with (one copy) of each duplicate row
(within a transaction) delete all rows from the source table that match rows in the temp table
Insert temp table rows into source table (commit)

Related

BigQuery - remove duplicate rows

I have a table with duplicate rows (sometimes 2,3,4 duplicates) and I need to delete them by leaving only one row (they are all the same, no dates differences).
Is there another way than CREATE OR REPLACE as recommended by Google?
I've already tried with CTE, ROW_NUMBER() over partition, ... but haven't found a way for the moment
Let's say the table looks like this:
id
name
1
test
1
test
1
test
1
test

You can delete duplicate information with some steps without using the create or replace clauses.
I’m using this example data:
select * from `items`
You can follow these steps:
1.Insert the data that you don’t want to delete and mark it with ‘--’ or use the character you want.
insert into `items` (id, data)
select distinct id,concat(data,'--') from `items`
2.- Delete all the data that is not marked in this case with ‘--’
delete from `items` where STRPOS(data,"--")=0;
3.- Update the data deleting the mark we used in this case ‘--’
update `items` set data = substring(data,0,LENGTH(data)-2) where 1=1 ;

SQL Script for Bulk Deletion

I am attempting to write a SQL Script to bulk delete rows in a table with input from a Text File. I am just getting into SQL Scripting.
Backstory: Someone in my previous role setup a table without a primary key and a program was designed to insert data into the table. However, the program would never check for duplicate entries first and just go ahead and do the insert.
I am attempting to clean-up the database.
First, I have run a query to see just how many rows are duplicates. There are roughly ~7,000 therefore, there is no way I am going to delete them one query at a time. [ID] should have been setup as a Primary Key.
Query to determine duplicates
SELECT [ID] FROM [testing].[dbo].[testingtable]
GROUP BY [ID]
HAVING COUNT(*) > 1
I can delete the duplicate rows by using the following query on an individual ID:
SET ROWCOUNT 1
DELETE FROM [testing].[dbo].[testingtable]
WHERE [ID] = SomeNumber
SET ROWCOUNT 0
I have a text file of all of the duplicate ID number entries, however, is there a bulk delete script that I can create so that I can feed in all of the ID duplicate numbers from the text file? Or is there a more efficient way. Please point me in the direction.

I don't understand why you have (or need) a text file of all duplicate IDs.
There are roughly ~7,000 therefore, there is no way I am going to delete them one query at a time Of course there is a way to delete them, here we go:
If you just want to remove duplicates from your table, use this code:
WITH CTE AS(
SELECT [ID]
,RN = ROW_NUMBER()OVER(PARTITION BY [ID])
FROM [testing].[dbo].[testingtable]
)
DELETE FROM CTE WHERE RN > 1

if you want to remove a very high percentage of rows...
SELECT col1, col2, ...
INTO #Holdingtable
FROM MyTable
WHERE ..opposite condition..
TRUNCATE TABLE MyTable
INSERT MyTable (col1, col2, ...)
SELECT col1, col2, ...
FROM #Holdingtable

Duplicate key error with PostgreSQL INSERT with subquery

There are some similar questions on StackOverflow, but they don't seem to exactly match my case. I am trying to bulk insert into a PostgreSQL table with composite unique constraints. I created a temporary table (temptable) without any constraints, and loaded the data (with possible some duplicate values) in it. So far, so good.
Now, I am trying to transfer the data to the actual table (realtable) with unique index. For this, I used an INSERT statement with a subquery:
INSERT INTO realtable
SELECT * FROM temptable WHERE NOT EXISTS (
SELECT 1 FROM realtable WHERE temptable.added_date = realtable.added_date
AND temptable.product_name = realtable.product_name
);
However, I am getting duplicate key errors:
ERROR: duplicate key value violates unique constraint "realtable_added_date_product_name_key"
SQL state: 23505
Detail: Key (added_date, product_name)=(20000103, TEST) already exists.
My question is, shouldn't the WHERE NOT EXISTS clause prevent this from happening? How can I fix it?

The NOT EXISTS clause only prevents rows from temptable conflicting with existing rows from realtable; it will not prevent multiple rows from temptable from conflicting with each other. This is because the SELECT is calculated once based on the initial state of realtable, not re-calculated after each row is inserted.
One solution would be to use a GROUP BY or DISTINCT ON in the SELECT query, to omit duplicates, e.g.
INSERT INTO realtable
SELECT DISTINCT ON (added_date, product_name) *
FROM temptable WHERE NOT EXISTS (
SELECT 1 FROM realtable WHERE temptable.added_date = realtable.added_date
AND temptable.product_name = realtable.product_name
)
ORDER BY ???; -- this ORDER BY will determine which of a set of duplicates is kept by the DISTINCT ON

Can I keep old keys linked to new keys when making a copy in SQL?

I am trying to copy a record in a table and change a few values with a stored procedure in SQL Server 2005. This is simple, but I also need to copy relationships in other tables with the new primary keys. As this proc is being used to batch copy records, I've found it difficult to store some relationship between old keys and new keys.
Right now, I am grabbing new keys from the batch insert using OUTPUT INTO.
ex:
INSERT INTO table
(column1, column2,...)
OUTPUT INSERTED.PrimaryKey INTO #TableVariable
SELECT column1, column2,...
Is there a way like this to easily get the old keys inserted at the same time I am inserting new keys (to ensure I have paired up the proper corresponding keys)?
I know cursors are an option, but I have never used them and have only heard them referenced in a horror story fashion. I'd much prefer to use OUTPUT INTO, or something like it.

If you need to track both old and new keys in your temp table, you need to cheat and use MERGE:
Data setup:
create table T (
ID int IDENTITY(5,7) not null,
Col1 varchar(10) not null
);
go
insert into T (Col1) values ('abc'),('def');
And the replacement for your INSERT statement:
declare #TV table (
Old_ID int not null,
New_ID int not null
);
merge into T t1
using (select ID,Col1 from T) t2
on 1 = 0
when not matched then insert (Col1) values (t2.Col1)
output t2.ID,inserted.ID into #TV;
And (actually needs to be in the same batch so that you can access the table variable):
select * from T;
select * from #TV;
Produces:
ID Col1
5 abc
12 def
19 abc
26 def
Old_ID New_ID
5 19
12 26
The reason you have to do this is because of an irritating limitation on the OUTPUT clause when used with INSERT - you can only access the inserted table, not any of the tables that might be part of a SELECT.
Related - More explanation of the MERGE abuse

INSERT statements loading data into tables with an IDENTITY column are guaranteed to generate the values in the same order as the ORDER BY clause in the SELECT.
If you want the IDENTITY values to be assigned in a sequential fashion
that follows the ordering in the ORDER BY clause, create a table that
contains a column with the IDENTITY property and then run an INSERT ..
SELECT … ORDER BY query to populate this table.
From: The behavior of the IDENTITY function when used with SELECT INTO or INSERT .. SELECT queries that contain an ORDER BY clause
You can use this fact to match your old with your new identity values. First collect the list of primary keys that you intend to copy into a temporary table. You can also include your modified column values as well if needed:
select
PrimaryKey,
Col1
--Col2... etc
into #NewRecords
from Table
--where whatever...
Then do your INSERT with the OUTPUT clause to capture your new ids into the table variable:
declare #TableVariable table (
New_ID int not null
);
INSERT INTO #table
(Col1 /*,Col2... ect.*/)
OUTPUT INSERTED.PrimaryKey INTO #NewIds
SELECT Col1 /*,Col2... ect.*/
from #NewRecords
order by PrimaryKey
Because of the ORDER BY PrimaryKey statement, you will be guaranteed that your New_ID numbers will be generated in the same order as the PrimaryKey field of the copied records. Now you can match them up by row numbers ordered by the ID values. The following query would give you the parings:
select PrimaryKey, New_ID
from
(select PrimaryKey,
ROW_NUMBER() over (order by PrimaryKey) OldRow
from #NewRecords
) PrimaryKeys
join
(
select New_ID,
ROW_NUMBER() over (order by New_ID) NewRow
from #NewIds
) New_IDs
on OldRow = NewRow

Row number in Sybase tables

Sybase db tables do not have a concept of self updating row numbers. However , for one of the modules , I require the presence of rownumber corresponding to each row in the database such that max(Column) would always tell me the number of rows in the table.
I thought I'll introduce an int column and keep updating this column to keep track of the row number. However I'm having problems in updating this column in case of deletes. What sql should I use in delete trigger to update this column?

You can easily assign a unique number to each row by using an identity column. The identity can be a numeric or an integer (in ASE12+).
This will almost do what you require. There are certain circumstances in which you will get a gap in the identity sequence. (These are called "identity gaps", the best discussion on them is here). Also deletes will cause gaps in the sequence as you've identified.
Why do you need to use max(col) to get the number of rows in the table, when you could just use count(*)? If you're trying to get the last row from the table, then you can do
select * from table where column = (select max(column) from table).
Regarding the delete trigger to update a manually managed column, I think this would be a potential source of deadlocks, and many performance issues. Imagine you have 1 million rows in your table, and you delete row 1, that's 999999 rows you now have to update to subtract 1 from the id.

Delete trigger
CREATE TRIGGER tigger ON myTable FOR DELETE
AS
update myTable
set id = id - (select count(*) from deleted d where d.id < t.id)
from myTable t
To avoid locking problems
You could add an extra table (which joins to your primary table) like this:
CREATE TABLE rowCounter
(id int, -- foreign key to main table
rownum int)
... and use the rownum field from this table.
If you put the delete trigger on this table then you would hugely reduce the potential for locking problems.
Approximate solution?
Does the table need to keep its rownumbers up to date all the time?
If not, you could have a job which runs every minute or so, which checks for gaps in the rownum, and does an update.
Question: do the rownumbers have to reflect the order in which rows were inserted?
If not, you could do far fewer updates, but only updating the most recent rows, "moving" them into gaps.
Leave a comment if you would like me to post any SQL for these ideas.

I'm not sure why you would want to do this. You could experiment with using temporary tables and "select into" with an Identity column like below.
create table test
(
col1 int,
col2 varchar(3)
)
insert into test values (100, "abc")
insert into test values (111, "def")
insert into test values (222, "ghi")
insert into test values (300, "jkl")
insert into test values (400, "mno")
select rank = identity(10), col1 into #t1 from Test
select * from #t1
delete from test where col2="ghi"
select rank = identity(10), col1 into #t2 from Test
select * from #t2
drop table test
drop table #t1
drop table #t2
This would give you a dynamic id (of sorts)

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Redshift Delete Duplicate Rows - sql

You cannot delete one without deleting the other as they are identical. The way to do this is to: make a temp table with (one copy) of each duplicate row (within a transaction) delete all rows from the source table that match rows in the temp table Insert temp table rows into source table (commit)

Related

BigQuery - remove duplicate rows

SQL Script for Bulk Deletion

Duplicate key error with PostgreSQL INSERT with subquery

Can I keep old keys linked to new keys when making a copy in SQL?

Row number in Sybase tables

Categories

Resources