This code I have finds duplicate rows in a table. H
SELECT position, name, count(*) as cnt
FROM team
GROUP BY position, name,
HAVING COUNT(*) > 1
How do I delete the duplicate rows that I have found in Hiveql?
Apart from distinct, you can use row_number for this in Hive. Explicit delete and update can only be performed on tables that support ACID. So insert overwrite is more universal.
insert overwrite table team
select position, name, other1, other2...
from (
select
*,
row_number() over(partition by position, name order by rand()) as rn
from team
) tmp
where rn = 1
;
Please try this.assuming id is primary key column
delete from team where id in (
select t1.id from team t1,
(SELECT position, name, count(*) as cnt ,max(id) as id1
FROM team
GROUP BY position, name,
HAVING COUNT(*) > 1) t2
where t1.position=t2.position
and t1.name=t2.name
and t1.id<>t2.id1)
This is an alternative way, since deletes are expensive in Hive
Create table Team_new
As
Select distinct <col1>, <col2>,...
from Team;
Drop table Team purge;
Alter table Team_new rename to Team;
This is assuming you don’t have an id column. If you have an id column then the 1st query would change slightly as
Create table Team_new
As
Select <col1>,<col2>,...,max(id) as id from Team
Group by <col1>,<col2>,... ;
Other queries (drop & alter post this) would remain the same as above.
I found duplicates in my table by doing below query.
SELECT name, id, count(1) as count
FROM [myproject:dev.sample]
group by name, id
having count(1) > 1
Now i would like to remove these duplicates based on id and name by using DML statement but its showing '0 rows affected' message.
Am i missing something?
DELETE FROM PRD.GPBP WHERE
id not in(select id from [myproject:dev.sample] GROUP BY id) and
name not in (select name from [myproject:dev.sample] GROUP BY name)
I suggest, you create a new table without the duplicates. Drop your original table and rename the new table to original table.
You can find duplicates like below:
Create table new_table as
Select name, id, ...... , put our remaining 10 cols here
FROM(
SELECT *,
ROW_NUMBER() OVER(Partition by name , id Order by id) as rnk
FROM [myproject:dev.sample]
)a
WHERE rnk = 1;
Then drop the older table and rename new_table with old table name.
Below query (BigQuery Standard SQL) should be more optimal for de-duping like in your case
#standardSQL
SELECT AS VALUE ANY_VALUE(t)
FROM `myproject.dev.sample` AS t
GROUP BY name, id
If you run it from within UI - you can just set Write Preference to Overwrite Table and you are done
Or if you want you can use DML's INSERT to new table and then copy over original one
Meantime, the easiest way is as below (using DDL)
#standardSQL
CREATE OR REPLACE TABLE `myproject.dev.sample` AS
SELECT * FROM (
SELECT AS VALUE ANY_VALUE(t)
FROM `myproject.dev.sample` AS t
GROUP BY name, id
)
I have sample data like this which only have varchar values and it doesn't have ID column:
Name
Mohan
Mohan
Mohan
Mohan
Required output:
Name
Mohan
How can this be achieved using only a Correlated Sub query.
I have tried with below query:
DELETE table
WHERE name < ( SELECT MAX(name) FROM table t
WHERE t.name = name )
Maybe you can use the undocumented(!) %%physloc%% pseudo column, which describes the physical location of a row, to identify a row. For example, only keep the row with the minimal %%physloc%%.
DELETE FROM elbat
WHERE %%physloc%% <> (SELECT min(%%physloc%%)
FROM elbat t
WHERE t.name = elbat.name);
db<>fiddle
But, as already mentioned, %%physloc%% is undocumented and therefore subject to change or removal without further notice in future patches or releases. You query might break suddenly.
Without "CTE, ROW_NUM, Distinct, Group by, having count(*)" you can use UNION:
DECLARE #Data TABLE (Value varchar(50));
INSERT INTO #Data VALUES ('Name'), ('Mohan'), ('Mohan'), ('Mohan'), ('Mohan');
SELECT Value FROM #Data
UNION
SELECT Value FROM #Data
Besides sticky bit's solution using %%physloc%%, I don't see a way to do this via a correlated subquery when the table doesn't have a primary key.
But here's an alternative with a sub-query that comes near.
DELETE t FROM
(
select row_number() over (partition by name order by name) as rn
from [Table]
) t
WHERE rn > 1;
I have the below table with the below records in it
create table employee
(
EmpId number,
EmpName varchar2(10),
EmpSSN varchar2(11)
);
insert into employee values(1, 'Jack', '555-55-5555');
insert into employee values (2, 'Joe', '555-56-5555');
insert into employee values (3, 'Fred', '555-57-5555');
insert into employee values (4, 'Mike', '555-58-5555');
insert into employee values (5, 'Cathy', '555-59-5555');
insert into employee values (6, 'Lisa', '555-70-5555');
insert into employee values (1, 'Jack', '555-55-5555');
insert into employee values (4, 'Mike', '555-58-5555');
insert into employee values (5, 'Cathy', '555-59-5555');
insert into employee values (6 ,'Lisa', '555-70-5555');
insert into employee values (5, 'Cathy', '555-59-5555');
insert into employee values (6, 'Lisa', '555-70-5555');
I dont have any primary key in this table .But i have the above records in my table already.
I want to remove the duplicate records which has the same value in EmpId and EmpSSN fields.
Ex : Emp id 5
How can I frame a query to delete those duplicate records?
It is very simple. I tried in SQL Server 2008
DELETE SUB FROM
(SELECT ROW_NUMBER() OVER (PARTITION BY EmpId, EmpName, EmpSSN ORDER BY EmpId) cnt
FROM Employee) SUB
WHERE SUB.cnt > 1
Add a Primary Key (code below)
Run the correct delete (code below)
Consider WHY you woudln't want to keep that primary key.
Assuming MSSQL or compatible:
ALTER TABLE Employee ADD EmployeeID int identity(1,1) PRIMARY KEY;
WHILE EXISTS (SELECT COUNT(*) FROM Employee GROUP BY EmpID, EmpSSN HAVING COUNT(*) > 1)
BEGIN
DELETE FROM Employee WHERE EmployeeID IN
(
SELECT MIN(EmployeeID) as [DeleteID]
FROM Employee
GROUP BY EmpID, EmpSSN
HAVING COUNT(*) > 1
)
END
Use the row number to differentiate between duplicate records. Keep the first row number for an EmpID/EmpSSN and delete the rest:
DELETE FROM Employee a
WHERE ROW_NUMBER() <> ( SELECT MIN( ROW_NUMBER() )
FROM Employee b
WHERE a.EmpID = b.EmpID
AND a.EmpSSN = b.EmpSSN )
With duplicates
As
(Select *, ROW_NUMBER() Over (PARTITION by EmpID,EmpSSN Order by EmpID,EmpSSN) as Duplicate From Employee)
delete From duplicates
Where Duplicate > 1 ;
This will update Table and remove all duplicates from the Table!
select distinct * into newtablename from oldtablename
Now, the newtablename will have no duplicate records.
Simply change the table name(newtablename) by pressing F2 in object explorer in sql server.
Code
DELETE DUP
FROM
(
SELECT ROW_NUMBER() OVER (PARTITION BY Clientid ORDER BY Clientid ) AS Val
FROM ClientMaster
) DUP
WHERE DUP.Val > 1
Explanation
Use an inner query to construct a view over the table which includes a field based on Row_Number(), partitioned by those columns you wish to be unique.
Delete from the results of this inner query, selecting anything which does not have a row number of 1; i.e. the duplicates; not the original.
The order by clause of the row_number window function is needed for a valid syntax; you can put any column name here. If you wish to change which of the results is treated as a duplicate (e.g. keep the earliest or most recent, etc), then the column(s) used here do matter; i.e. you want to specify the order such that the record you wish to keep will come first in the result.
You could create a temporary table #tempemployee containing a select distinct of your employee table.
Then delete from employee.
Then insert into employee select from #tempemployee.
Like Josh said - even if you know the duplicates, deleting them will be impossile since you cannot actually refer to a specific record if it is an exact duplicate of another record.
If you don't want to create a new primary key you can use the TOP command in SQL Server:
declare #ID int
while EXISTS(select count(*) from Employee group by EmpId having count(*)> 1)
begin
select top 1 #ID = EmpId
from Employee
group by EmpId
having count(*) > 1
DELETE TOP(1) FROM Employee WHERE EmpId = #ID
end
ITS easy use below query
WITH Dups AS
(
SELECT col1,col2,col3,
ROW_NUMBER() OVER(PARTITION BY col1,col2,col3 ORDER BY (SELECT 0)) AS rn
FROM mytable
)
DELETE FROM Dups WHERE rn > 1
delete sub from (select ROW_NUMBER() OVer(Partition by empid order by empid)cnt from employee)sub
where sub.cnt>1
I'm not an SQL expert so bear with me. I'm sure you'll get a better answer soon enough. Here's how you can find the duplicate records.
select t1.empid, t1.empssn, count(*)
from employee as t1
inner join employee as t2 on (t1.empid=t2.empid and t1.empssn = t2.empssn)
group by t1.empid, t1.empssn
having count(*) > 1
Deleting them will be more tricky because there is nothing in the data that you could use in a delete statement to differentiate the duplicates. I suspect the answer will involve row_number() or adding an identity column.
create unique clustered index Employee_idx
on Employee ( EmpId,EmpSSN )
with ignore_dup_key
You can drop the index if you don't need it.
no ID, no rowcount() or no temp table needed....
WHILE
(
SELECT COUNT(*)
FROM TBLEMP
WHERE EMPNO
IN (SELECT empno from tblemp group by empno having count(empno)>1)) > 1
DELETE top(1)
FROM TBLEMP
WHERE EMPNO IN (SELECT empno from tblemp group by empno having count(empno)>1)
there are two columns in the a table ID and name where names are repeating with different IDs so for that you may use this query:
.
.
DELETE FROM dbo.tbl1
WHERE id NOT IN (
Select MIN(Id) AS namecount FROM tbl1
GROUP BY Name
)
Having a database table without Primary Key is really and will say extremely BAD PRACTICE...so after you add one (ALTER TABLE)
Run this until you don't see any more duplicated records (that is the purpose of HAVING COUNT)
DELETE FROM [TABLE_NAME] WHERE [Id] IN
(
SELECT MAX([Id])
FROM [TABLE_NAME]
GROUP BY [TARGET_COLUMN]
HAVING COUNT(*) > 1
)
SELECT MAX([Id]),[TABLE_NAME], COUNT(*) AS dupeCount
FROM [TABLE_NAME]
GROUP BY [TABLE_NAME]
HAVING COUNT(*) > 1
MAX([Id]) will cause to delete latest records (ones added after first created) in case you want the opposite meaning that in case of requiring deleting first records and leave the last record inserted please use MIN([Id])
Let's think out of the box.
I don't delete from the table, I make a new table first, for safety.
I personally prefer do a
INSERT INTO new_table SELECT DISTINCT * FROM orig_table;
Now, new_table now should contains the expected data I want.
I can check new_table to ensure that.
Then I have 2 options to replace the orig_table
A. delete orig_table; rename new_table to orig_table
B. truncate orig_table; insert data from new_table to orig_table; delete new_table (Recommended: in case you have some trigger/something else linked to the original orig_table)
select t1.* from employee t1, employee t2 where t1.empid=t2.empid and t1.empname = t2.empname and t1.salary = t2.salary
group by t1.empid, t1.empname,t1.salary having count(*) > 1
delete from employee where rowid in (select rowid from (select rowid, name_count from (select rowid, count(emp_name) as name_count from employee group by emp_id, emp_name) where name_count>1))
DELETE FROM 'test'
USING 'test' , 'test' as vtable
WHERE test.id>vtable.id and test.common_column=vtable.common_column
Using this we can remove duplicate records
ALTER IGNORE TABLE test
ADD UNIQUE INDEX 'test' ('b');
# here 'b' is column name to uniqueness,
# here 'test' is index name.
I need, if possible, a t-sql query that, returning the values from an arbitrary table, also returns a incremental integer column with value = 1 for the first row, 2 for the second, and so on.
This column does not actually resides in any table, and must be strictly incremental, because the ORDER BY clause could sort the rows of the table and I want the incremental row in perfect shape always.
The solution must run on SQL Server 2000
For SQL 2005 and up
SELECT ROW_NUMBER() OVER( ORDER BY SomeColumn ) AS 'rownumber',*
FROM YourTable
for 2000 you need to do something like this
SELECT IDENTITY(INT, 1,1) AS Rank ,VALUE
INTO #Ranks FROM YourTable WHERE 1=0
INSERT INTO #Ranks
SELECT SomeColumn FROM YourTable
ORDER BY SomeColumn
SELECT * FROM #Ranks
Order By Ranks
see also here Row Number
You can start with a custom number and increment from there, for example you want to add a cheque number for each payment you can do:
select #StartChequeNumber = 3446;
SELECT
((ROW_NUMBER() OVER(ORDER BY AnyColumn)) + #StartChequeNumber ) AS 'ChequeNumber'
,* FROM YourTable
will give the correct cheque number for each row.
Try ROW_NUMBER()
http://msdn.microsoft.com/en-us/library/ms186734.aspx
Example:
SELECT
col1,
col2,
ROW_NUMBER() OVER (ORDER BY col1) AS rownum
FROM tbl
It is ugly and performs badly, but technically this works on any table with at least one unique field AND works in SQL 2000.
SELECT (SELECT COUNT(*) FROM myTable T1 WHERE T1.UniqueField<=T2.UniqueField) as RowNum, T2.OtherField
FROM myTable T2
ORDER By T2.UniqueField
Note: If you use this approach and add a WHERE clause to the outer SELECT, you have to added it to the inner SELECT also if you want the numbers to be continuous.