Hive: Removing Duplicate Rows from Table - hive

I have a table which contains millions of records and all the records have duplicates. So I am trying to extract all the distinct rows in the table.
Here's the query I am using:
CREATE TABLE unique_table AS SELECT DISTINCT * FROM duplicates_table;
Is this the efficient way to do this job? Or is there a way to remove duplicate rows without creating a new table?

You can use the same table:
INSERT OVERWRITE table_name SELECT DISTINCT * FROM table_name;

Related

Deleting completely identical duplicates from db

We have a table in our db with copied data that has completely duplicated many rows. Because the id is also duplicated there is nothing we can use to select just the duplicates. I tried using a limit to only delete 1 but redshift gave a syntax error when trying to use limit.
Any ideas how we can delete just one of two rows that have completely identical information?
Use select distinct to create a new table. Then either truncate & copy the data, or drop the original table and rename the new table to the original name:
create table t2 as select distinct * from t;
truncate t;
insert into t from select * from t2;
drop table t2;
Add column a column with unique values. identity(seed, step) looks interesting.

Remove duplicate SQL rows by looking at all columns

I have this table, where every column is a VARCHAR (or equivalent):
field001 field002 field003 field004 field005 .... field500
500 VARCHAR columns. No primary keys. And no column is guaranteed to be unique. So the only way to know for sure if two rows are the same is to compare the values of all columns.
(Yes, this should be in TheDailyWTF. No, it's not my fault. Bear with me here).
I inserted a duplicate set of rows by mistake, and I need to find them and remove them.
There's 12 million rows on this table, so I'd rather not recreate it.
However, I do know what rows were mistakenly inserted (I have the .sql file).
So I figured I'd create another table and load it with those. And then I'd do some sort of join that would compare all columns on both tables and then delete the rows that are equal from the first table. I tried a NATURAL JOIN as that looked promising, but nothing was returned.
What are my options?
I'm using Amazon Redshift (so PostgreSQL 8.4 if I recall), but I think this is a general SQL question.
You can treat the whole row as a single record in Postgres (and thus I think in Redshift).
The following works in Postgres, and will keep one of the duplicates
delete from the_table
where ctid not in (select min(ctid)
from the_table
group by the_table); --<< Yes, the group by is correct!
This is going to be slow!
Grouping over so many columns and then deleting with a NOT IN will take quite some time. Especially if a lot of rows are going to be deleted.
If you want to delete all duplicate rows (not keeping any of them), you can use the following:
delete from the_table
where the_table in (select the_table
from the_table
group by the_table
having count(*) > 1);
You should be able to identify all the mistakenly inserted rows using CREATEXID.If you group by CREATEXID on your table as below and get the count you should be able to understand how many rows were inserted in your transaction and remove them using DELETE command.
SELECT CREATEXID,COUNT(1)
FROM yourtable
GROUP BY 1;
One simplistic solution is to recreate the table, e.g.
CREATE TABLE my_temp_table (
-- add column definitions here, just like the original table
);
INSERT INTO my_temp_table SELECT DISTINCT * FROM original_table;
DROP TABLE original_table;
ALTER TABLE my_temp_table RENAME TO original_table;
or even
CREATE TABLE my_temp_table AS SELECT DISTINCT * FROM original_table;
DROP TABLE original_table;
ALTER TABLE my_temp_table RENAME TO original_table;
It is a trick but probably it helps.
Each row in the table containing the transaction ID in which it row was inserted/updated: System Columns. It is xmin column. So using it you can to find the transaction ID in which you inserted the wrong data. Then just delete the rows using
delete from my_table where xmin = <the_wrong_transaction_id>;
PS: Be careful and try it on the some test table first.

delete duplicate records from sql

In my table I have so many duplicate records
SELECT ENROLMENT_NO_DATE, COUNT(ENROLMENT_NO_DATE) AS NumOccurrences
FROM Import_Master GROUP BY ENROLMENT_NO_DATE HAVING ( COUNT(ENROLMENT_NO_DATE) > 1 )
I need to remove duplicate record if it is occur second time... Need to keep first or any of one record. How can I do that?
You can use CTE to perform this task:
;with cte as
(
select ENROLMENT_NO_DATE,
row_number() over(partition by ENROLMENT_NO_DATE order by ENROLMENT_NO_DATE) rn
from Import_Master
)
delete from cte where rn > 1
See SQL Fddle with Demo
One method could be to create a secondary, temporary table
CREATE TABLE Import_Master_Deduped AS SELECT * FROM Import_Master WHERE FALSE;
This will create an empty table with identical structure to Import_Master. Now impose uniqueness on the new table with an index:
CREATE UNIQUE INDEX Import_Master_Ndx ON Import_Master_Deduped(ENROLMENT_NO_DATE);
Finally copy the table with duplicated records inside with INSERT IGNORE, so that duplicated records will not get inserted:
INSERT IGNORE INTO Import_Master_Deduped SELECT * FROM Import_Master;
At this point, after checking everything is OK, you can rename the two tables swapping their names (this will lose any old indexes), or TRUNCATE the Import_Master table and copy back the deduped records from the new table into the old.
In the second case, recreate the UNIQUE constraint on the old table to avoid further duplicates; in the first, recreate any old indexes on the new table.
Finally, you remove the table you don't need anymore.

How I can delete repeated rows from database? [duplicate]

This question already has answers here:
Closed 11 years ago.
Possible Duplicates:
Remove duplicates in large MySql table
Can I extract the extract records that are duplicated in sql?
How can I delete duplicate rows in a table
I need something to delete repeated rows from the database.
I found out how many rows are repeated in table using this query :
SELECT GoodCode FROM Good_
and here is distinct query SELECT Distinct GoodCode FROM Good_
The second one has lower records. Please guide me how I can delete repeated rows from the first one.
Simple method:
SELECT DISTINCT *
INTO #TempGood
FROM Good_
TRUNCATE TABLE Good_
INSERT Good_
SELECT *
FROM #TempGood
DROP TABLE #TempGood
create table temptable as select distinct * from Good_;
drop table Good_;
rename temptable to Good_;

Query select a bulk of IDs from a table - SQL

I have a table which holds ~1M rows. My application has a list of ~100K IDs which belong to that table (the list being generated by the application layer).
Is there a common-method of how to query all of these IDs? ~100K Select queries? A temporary table which I insert the ~100K IDs to, and Select query via join the required table?
Thanks,
Doori Bar
You could do it in one query, something like
SELECT * FROM large_table WHERE id IN (...)
Insert a comma-separated list of IDs where I put the ...
Unfortunately, there is no easy way that I know of to parametrize this, so you need to be extra-super careful to avoid SQL injection vulnerabilities.
A temporary table which holds the 100k IDs seems like a good solution. Don't insert them one by one though ; INSERT ... VALUES syntax in MySQL accepts the insertion of multiple rows.
By the way, where do you get your 100k IDs, if it's not from the database ? If they come from a preceding request, I'd suggest to have it fill the temporary table.
Edit : For a more portable way of multiple insert :
INSERT INTO mytable (col1, col2) SELECT 'foo', 0 UNION SELECT 'bar', 1
Do those id's actually reference the table with 1M rows?
If so, you could use SELECT * ids FROM <1M table>
where ids is the ID column and where "1M table" is the name of the table which holds the 1M rows.
but I don't think I really understand your question...