This code I have finds duplicate rows in a table. H
SELECT position, name, count(*) as cnt
FROM team
GROUP BY position, name,
HAVING COUNT(*) > 1
How do I delete the duplicate rows that I have found in Hiveql?
Apart from distinct, you can use row_number for this in Hive. Explicit delete and update can only be performed on tables that support ACID. So insert overwrite is more universal.
insert overwrite table team
select position, name, other1, other2...
from (
select
*,
row_number() over(partition by position, name order by rand()) as rn
from team
) tmp
where rn = 1
;
Please try this.assuming id is primary key column
delete from team where id in (
select t1.id from team t1,
(SELECT position, name, count(*) as cnt ,max(id) as id1
FROM team
GROUP BY position, name,
HAVING COUNT(*) > 1) t2
where t1.position=t2.position
and t1.name=t2.name
and t1.id<>t2.id1)
This is an alternative way, since deletes are expensive in Hive
Create table Team_new
As
Select distinct <col1>, <col2>,...
from Team;
Drop table Team purge;
Alter table Team_new rename to Team;
This is assuming you don’t have an id column. If you have an id column then the 1st query would change slightly as
Create table Team_new
As
Select <col1>,<col2>,...,max(id) as id from Team
Group by <col1>,<col2>,... ;
Other queries (drop & alter post this) would remain the same as above.
Related
I found duplicates in my table by doing below query.
SELECT name, id, count(1) as count
FROM [myproject:dev.sample]
group by name, id
having count(1) > 1
Now i would like to remove these duplicates based on id and name by using DML statement but its showing '0 rows affected' message.
Am i missing something?
DELETE FROM PRD.GPBP WHERE
id not in(select id from [myproject:dev.sample] GROUP BY id) and
name not in (select name from [myproject:dev.sample] GROUP BY name)
I suggest, you create a new table without the duplicates. Drop your original table and rename the new table to original table.
You can find duplicates like below:
Create table new_table as
Select name, id, ...... , put our remaining 10 cols here
FROM(
SELECT *,
ROW_NUMBER() OVER(Partition by name , id Order by id) as rnk
FROM [myproject:dev.sample]
)a
WHERE rnk = 1;
Then drop the older table and rename new_table with old table name.
Below query (BigQuery Standard SQL) should be more optimal for de-duping like in your case
#standardSQL
SELECT AS VALUE ANY_VALUE(t)
FROM `myproject.dev.sample` AS t
GROUP BY name, id
If you run it from within UI - you can just set Write Preference to Overwrite Table and you are done
Or if you want you can use DML's INSERT to new table and then copy over original one
Meantime, the easiest way is as below (using DDL)
#standardSQL
CREATE OR REPLACE TABLE `myproject.dev.sample` AS
SELECT * FROM (
SELECT AS VALUE ANY_VALUE(t)
FROM `myproject.dev.sample` AS t
GROUP BY name, id
)
I want to delete duplicate records without using ROW_NUMBER() function (SQL Server)
Example: Table with the following data:
name salary
-----------------
Husain 20000.00
Husain 20000.00
Husain 20000.00
Munavvar 50000.00
Munavvar 50000.00
After deleting the duplicate records
table should contains data like this:
name salary
-----------------
Husain 20000.00
Munavvar 50000.00
As the motivation for this question seems to be academic interest rather than practical use...
The table has no primary key but the undocumented pseudo column %%physloc%% can provide a substitute.
DELETE T1
FROM YourTable T1 WITH(TABLOCKX)
WHERE CAST(T1.%%physloc%% AS BIGINT)
NOT IN (SELECT MAX(CAST(%%physloc%% AS BIGINT))
FROM YourTable
GROUP BY Name, Salary)
In reality you should never use the above and just use row_number as it is more efficient and documented.
(Data Explorer Demo)
Another (academic) option, depending on what version of SQL server you're using:
;with CTE as (select lag(name) over (order by name) as name1
,lag(salary) over (order by name) as salary1
, *
from #table)
delete from cte where name = name1 and salary = salary1
You can use Common Table Expression combined with ROW_NUMBER() like this (This is the best way to delete duplicates):
WITH CTE AS(
SELECT t.name,t.salary
ROW_NUMBER() OVER(PARTITION BY t.name,t.salary ORDER BY (SELECT 1)) as rn
FROM YourTable t
)
DELETE FROM CTE WHERE RN > 1
ROW_NUMBER() will assign each group randomly ranking, only one will get the rank 1 , and every thing else will be deleted.
EDIT: I can suggest something else with out the use of ROW_NUMBER() :
SELECT distinct t.name,t.salart
INTO TEMP_FOR_UPDATE
FROM YourTable t;
TRUNCATE TABLE YourTable ;
INSERT INTO YourTable
SELECT * FROM TEMP_FOR_UPDATE;
DROP TEMP_FOR_UPDATE;
This will basically create a temp table containing distincted values from your table, truncate your table and re insert the distincted values into your table.
Select data from your table using group by name , salary (or
distinct).
Insert into temp table.
Delete data in original
Copy data from temp table to your original table
In oracle you can use as below
Delete from table
where rowid not in (select max(rowid) from test group by name, salary);
We have a table that has had the same data inserted into it twice by accident meaning most (but not all) rows appears twice in the table. Simply put, I'd like an SQL statement to delete one version of a row while keeping the other; I don't mind which version is deleted as they're identical.
Table structure is something like:
FID, unique_ID, COL3, COL4....
Unique_ID is the primary key, meaning each one appears only once.
FID is a key that is unique to each feature, so if it appears more than once then the duplicates should be deleted.
To select features that have duplicates would be:
select count(*) from TABLE GROUP by FID
Unfortunately I can't figure out how to go from that to a SQL delete statement that will delete extraneous rows leaving only one of each.
This sort of question has been asked before, and I've tried the create table with distinct, but how do I get all columns without naming them? This only gets the single column FID and itemising all the columns to keep gives an: ORA-00936: missing expression
CREATE TABLE secondtable NOLOGGING as select distinct FID from TABLE
If you don't care which row is retained
DELETE FROM your_table_name a
WHERE EXISTS( SELECT 1
FROM your_table_name b
WHERE a.fid = b.fid
AND a.unique_id < b.unique_id )
Once that's done, you'll want to add a constraint to the table that ensures that FID is unique.
Try this
DELETE FROM table_name A WHERE ROWID > (
SELECT min(rowid) FROM table_name B
WHERE A.FID = B.FID)
A suggestion
DELETE FROM x WHERE ROWID IN
(WITH y AS (SELECT xCOL, MIN(ROWID) FROM x GROUP BY xCOL HAVING COUNT(xCOL) > 1)
SELCT a.ROWID FROM x, y WHERE x.XCOL=y.XCOL and x.ROWIDy.ROWID)
Try with this.
DELETE FROM firsttable WHERE unique_ID NOT IN
(SELECT MAX(unique_ID) FROM firsttable GROUP BY FID)
EDIT:
One explanation:
SELECT MAX(unique_ID) FROM firsttable GROUP BY FID;
This sql statement will pick each maximum unique_ID row from each duplicate rows group. And delete statement will keep these maximum unique_ID rows and delete other rows of each duplicate group.
You can try this.
delete from tablename a
where a.logid, a.pointid, a.routeid) in (select logid, pointid, routeid from tablename
group by logid, pointid, routeid having count(*) > 1)
and rowid not in (select min(rowid) from tablename
group by logid, pointid, routeid having count(*) > 1)
I have a table that has a lot of duplicates in the Name column. I'd
like to only keep one row for each.
The following lists the duplicates, but I don't know how to delete the
duplicates and just keep one:
SELECT name FROM members GROUP BY name HAVING COUNT(*) > 1;
Thank you.
See the following question: Deleting duplicate rows from a table.
The adapted accepted answer from there (which is my answer, so no "theft" here...):
You can do it in a simple way assuming you have a unique ID field: you can delete all records that are the same except for the ID, but don't have "the minimum ID" for their name.
Example query:
DELETE FROM members
WHERE ID NOT IN
(
SELECT MIN(ID)
FROM members
GROUP BY name
)
In case you don't have a unique index, my recommendation is to simply add an auto-incremental unique index. Mainly because it's good design, but also because it will allow you to run the query above.
It would probably be easier to select the unique ones into a new table, drop the old table, then rename the temp table to replace it.
#create a table with same schema as members
CREATE TABLE tmp (...);
#insert the unique records
INSERT INTO tmp SELECT * FROM members GROUP BY name;
#swap it in
RENAME TABLE members TO members_old, tmp TO members;
#drop the old one
DROP TABLE members_old;
We have a huge database where deleting duplicates is part of the regular maintenance process. We use DISTINCT to select the unique records then write them into a TEMPORARY TABLE. After TRUNCATE we write back the TEMPORARY data into the TABLE.
That is one way of doing it and works as a STORED PROCEDURE.
If we want to see first which rows you are about to delete. Then delete them.
with MYCTE as (
SELECT DuplicateKey1
,DuplicateKey2 --optional
,count(*) X
FROM MyTable
group by DuplicateKey1, DuplicateKey2
having count(*) > 1
)
SELECT E.*
FROM MyTable E
JOIN MYCTE cte
ON E.DuplicateKey1=cte.DuplicateKey1
AND E.DuplicateKey2=cte.DuplicateKey2
ORDER BY E.DuplicateKey1, E.DuplicateKey2, CreatedAt
Full example at http://developer.azurewebsites.net/2014/09/better-sql-group-by-find-duplicate-data/
You can join table with yourself by matched field and delete unmatching rows
DELETE t1 FROM table_name t1
LEFT JOIN tablename t2 ON t1.match_field = t2.match_field
WHERE t1.id <> t2.id;
delete dup row keep one
table has duplicate rows and may be some rows have no duplicate rows then it keep one rows if have duplicate or single in a table.
table has two column id and name if we have to remove duplicate name from table
and keep one. Its Work Fine at My end You have to Use this query.
DELETE FROM tablename
WHERE id NOT IN(
SELECT id FROM
(
SELECT MIN(id)AS id
FROM tablename
GROUP BY name HAVING
COUNT(*) > 1
)AS a )
AND id NOT IN(
(SELECT ids FROM
(
SELECT MIN(id)AS ids
FROM tablename
GROUP BY name HAVING
COUNT(*) =1
)AS a1
)
)
before delete table is below see the screenshot:
enter image description here
after delete table is below see the screenshot this query delete amit and akhil duplicate rows and keep one record (amit and akhil):
enter image description here
if you want to remove duplicate record from table.
CREATE TABLE tmp SELECT lastname, firstname, sex
FROM user_tbl;
GROUP BY (lastname, firstname);
DROP TABLE user_tbl;
ALTER TABLE tmp RENAME TO user_tbl;
show record
SELECT `page_url`,count(*) FROM wl_meta_tags GROUP BY page_url HAVING count(*) > 1
delete record
DELETE FROM wl_meta_tags
WHERE meta_id NOT IN( SELECT meta_id
FROM ( SELECT MIN(meta_id)AS meta_id FROM wl_meta_tags GROUP BY page_url HAVING COUNT(*) > 1 )AS a )
AND meta_id NOT IN( (SELECT ids FROM (
SELECT MIN(meta_id)AS ids FROM wl_meta_tags GROUP BY page_url HAVING COUNT(*) =1 )AS a1 ) )
Source url
WITH CTE AS
(
SELECT ROW_NUMBER() OVER (PARTITION BY [emp_id] ORDER BY [emp_id]) AS Row, * FROM employee_salary
)
DELETE FROM CTE
WHERE ROW <> 1
I have the below table with the below records in it
create table employee
(
EmpId number,
EmpName varchar2(10),
EmpSSN varchar2(11)
);
insert into employee values(1, 'Jack', '555-55-5555');
insert into employee values (2, 'Joe', '555-56-5555');
insert into employee values (3, 'Fred', '555-57-5555');
insert into employee values (4, 'Mike', '555-58-5555');
insert into employee values (5, 'Cathy', '555-59-5555');
insert into employee values (6, 'Lisa', '555-70-5555');
insert into employee values (1, 'Jack', '555-55-5555');
insert into employee values (4, 'Mike', '555-58-5555');
insert into employee values (5, 'Cathy', '555-59-5555');
insert into employee values (6 ,'Lisa', '555-70-5555');
insert into employee values (5, 'Cathy', '555-59-5555');
insert into employee values (6, 'Lisa', '555-70-5555');
I dont have any primary key in this table .But i have the above records in my table already.
I want to remove the duplicate records which has the same value in EmpId and EmpSSN fields.
Ex : Emp id 5
How can I frame a query to delete those duplicate records?
It is very simple. I tried in SQL Server 2008
DELETE SUB FROM
(SELECT ROW_NUMBER() OVER (PARTITION BY EmpId, EmpName, EmpSSN ORDER BY EmpId) cnt
FROM Employee) SUB
WHERE SUB.cnt > 1
Add a Primary Key (code below)
Run the correct delete (code below)
Consider WHY you woudln't want to keep that primary key.
Assuming MSSQL or compatible:
ALTER TABLE Employee ADD EmployeeID int identity(1,1) PRIMARY KEY;
WHILE EXISTS (SELECT COUNT(*) FROM Employee GROUP BY EmpID, EmpSSN HAVING COUNT(*) > 1)
BEGIN
DELETE FROM Employee WHERE EmployeeID IN
(
SELECT MIN(EmployeeID) as [DeleteID]
FROM Employee
GROUP BY EmpID, EmpSSN
HAVING COUNT(*) > 1
)
END
Use the row number to differentiate between duplicate records. Keep the first row number for an EmpID/EmpSSN and delete the rest:
DELETE FROM Employee a
WHERE ROW_NUMBER() <> ( SELECT MIN( ROW_NUMBER() )
FROM Employee b
WHERE a.EmpID = b.EmpID
AND a.EmpSSN = b.EmpSSN )
With duplicates
As
(Select *, ROW_NUMBER() Over (PARTITION by EmpID,EmpSSN Order by EmpID,EmpSSN) as Duplicate From Employee)
delete From duplicates
Where Duplicate > 1 ;
This will update Table and remove all duplicates from the Table!
select distinct * into newtablename from oldtablename
Now, the newtablename will have no duplicate records.
Simply change the table name(newtablename) by pressing F2 in object explorer in sql server.
Code
DELETE DUP
FROM
(
SELECT ROW_NUMBER() OVER (PARTITION BY Clientid ORDER BY Clientid ) AS Val
FROM ClientMaster
) DUP
WHERE DUP.Val > 1
Explanation
Use an inner query to construct a view over the table which includes a field based on Row_Number(), partitioned by those columns you wish to be unique.
Delete from the results of this inner query, selecting anything which does not have a row number of 1; i.e. the duplicates; not the original.
The order by clause of the row_number window function is needed for a valid syntax; you can put any column name here. If you wish to change which of the results is treated as a duplicate (e.g. keep the earliest or most recent, etc), then the column(s) used here do matter; i.e. you want to specify the order such that the record you wish to keep will come first in the result.
You could create a temporary table #tempemployee containing a select distinct of your employee table.
Then delete from employee.
Then insert into employee select from #tempemployee.
Like Josh said - even if you know the duplicates, deleting them will be impossile since you cannot actually refer to a specific record if it is an exact duplicate of another record.
If you don't want to create a new primary key you can use the TOP command in SQL Server:
declare #ID int
while EXISTS(select count(*) from Employee group by EmpId having count(*)> 1)
begin
select top 1 #ID = EmpId
from Employee
group by EmpId
having count(*) > 1
DELETE TOP(1) FROM Employee WHERE EmpId = #ID
end
ITS easy use below query
WITH Dups AS
(
SELECT col1,col2,col3,
ROW_NUMBER() OVER(PARTITION BY col1,col2,col3 ORDER BY (SELECT 0)) AS rn
FROM mytable
)
DELETE FROM Dups WHERE rn > 1
delete sub from (select ROW_NUMBER() OVer(Partition by empid order by empid)cnt from employee)sub
where sub.cnt>1
I'm not an SQL expert so bear with me. I'm sure you'll get a better answer soon enough. Here's how you can find the duplicate records.
select t1.empid, t1.empssn, count(*)
from employee as t1
inner join employee as t2 on (t1.empid=t2.empid and t1.empssn = t2.empssn)
group by t1.empid, t1.empssn
having count(*) > 1
Deleting them will be more tricky because there is nothing in the data that you could use in a delete statement to differentiate the duplicates. I suspect the answer will involve row_number() or adding an identity column.
create unique clustered index Employee_idx
on Employee ( EmpId,EmpSSN )
with ignore_dup_key
You can drop the index if you don't need it.
no ID, no rowcount() or no temp table needed....
WHILE
(
SELECT COUNT(*)
FROM TBLEMP
WHERE EMPNO
IN (SELECT empno from tblemp group by empno having count(empno)>1)) > 1
DELETE top(1)
FROM TBLEMP
WHERE EMPNO IN (SELECT empno from tblemp group by empno having count(empno)>1)
there are two columns in the a table ID and name where names are repeating with different IDs so for that you may use this query:
.
.
DELETE FROM dbo.tbl1
WHERE id NOT IN (
Select MIN(Id) AS namecount FROM tbl1
GROUP BY Name
)
Having a database table without Primary Key is really and will say extremely BAD PRACTICE...so after you add one (ALTER TABLE)
Run this until you don't see any more duplicated records (that is the purpose of HAVING COUNT)
DELETE FROM [TABLE_NAME] WHERE [Id] IN
(
SELECT MAX([Id])
FROM [TABLE_NAME]
GROUP BY [TARGET_COLUMN]
HAVING COUNT(*) > 1
)
SELECT MAX([Id]),[TABLE_NAME], COUNT(*) AS dupeCount
FROM [TABLE_NAME]
GROUP BY [TABLE_NAME]
HAVING COUNT(*) > 1
MAX([Id]) will cause to delete latest records (ones added after first created) in case you want the opposite meaning that in case of requiring deleting first records and leave the last record inserted please use MIN([Id])
Let's think out of the box.
I don't delete from the table, I make a new table first, for safety.
I personally prefer do a
INSERT INTO new_table SELECT DISTINCT * FROM orig_table;
Now, new_table now should contains the expected data I want.
I can check new_table to ensure that.
Then I have 2 options to replace the orig_table
A. delete orig_table; rename new_table to orig_table
B. truncate orig_table; insert data from new_table to orig_table; delete new_table (Recommended: in case you have some trigger/something else linked to the original orig_table)
select t1.* from employee t1, employee t2 where t1.empid=t2.empid and t1.empname = t2.empname and t1.salary = t2.salary
group by t1.empid, t1.empname,t1.salary having count(*) > 1
delete from employee where rowid in (select rowid from (select rowid, name_count from (select rowid, count(emp_name) as name_count from employee group by emp_id, emp_name) where name_count>1))
DELETE FROM 'test'
USING 'test' , 'test' as vtable
WHERE test.id>vtable.id and test.common_column=vtable.common_column
Using this we can remove duplicate records
ALTER IGNORE TABLE test
ADD UNIQUE INDEX 'test' ('b');
# here 'b' is column name to uniqueness,
# here 'test' is index name.