Retrieve row of aggregate on primary key with group by - sql

Ok so I know you can't pull specific fields without an aggregate when you perform a SQL command with a group by, but it seems to me that if you are doing an aggregate on a primary key that is guaranteed to be unique, there should be a way to pull the other rows of that column along with it. Something like this:
SELECT Max(id),
foo,
bar
FROM mytable
GROUP BY value1,
value2
So ID is guaranteed to be unique, so it will have exactly 1 value for foo and bar, is there a way to generate a query like this?
I have tried this, but MyTable in this case has millions of rows, and the run-time for this is unacceptable:
SELECT *
FROM mytable
WHERE id IN (SELECT Max(id)
FROM mytable
GROUP BY value1,
value2)
AND ...
Ideally I would like a solution that works at least as far back as SQL server 2005, but if there are better solutions in the later versions I would like to hear them as well.

Make sure you have an index defined such as:
CREATE NONCLUSTERED INDEX idx_GroupBy ON mytable (value1, value2)
IN can be tend to be slow if you have many rows. It may help to turn your IN into an INNER JOIN.
SELECT
data.*
FROM
mytable data
INNER JOIN
(
SELECT id = MAX(id)
FROM mytable
GROUP BY value1,
value2
) ids
ON data.id = ids.id
Unfortunately, Sql Server does not have any features that will do this any better.

Try this:
SELECT foo
,bar
,MAX(ID) OVER(Partition By Value1, Value2 Order by Value1,Value2)
FROM myTable
CREATE NONCLUSTERED INDEX myIndex ON MyTable(Value1,Value2) INCLUDE(foo,bar)
This NCI will cover the whole query, so it will not need to lookup the table at all

Related

SQL duplicate deletion and retrieval [duplicate]

I need to remove duplicate rows from a fairly large SQL Server table (i.e. 300,000+ rows).
The rows, of course, will not be perfect duplicates because of the existence of the RowID identity field.
MyTable
RowID int not null identity(1,1) primary key,
Col1 varchar(20) not null,
Col2 varchar(2048) not null,
Col3 tinyint not null
How can I do this?
Assuming no nulls, you GROUP BY the unique columns, and SELECT the MIN (or MAX) RowId as the row to keep. Then, just delete everything that didn't have a row id:
DELETE FROM MyTable
LEFT OUTER JOIN (
SELECT MIN(RowId) as RowId, Col1, Col2, Col3
FROM MyTable
GROUP BY Col1, Col2, Col3
) as KeepRows ON
MyTable.RowId = KeepRows.RowId
WHERE
KeepRows.RowId IS NULL
In case you have a GUID instead of an integer, you can replace
MIN(RowId)
with
CONVERT(uniqueidentifier, MIN(CONVERT(char(36), MyGuidColumn)))
Another possible way of doing this is
;
--Ensure that any immediately preceding statement is terminated with a semicolon above
WITH cte
AS (SELECT ROW_NUMBER() OVER (PARTITION BY Col1, Col2, Col3
ORDER BY ( SELECT 0)) RN
FROM #MyTable)
DELETE FROM cte
WHERE RN > 1;
I am using ORDER BY (SELECT 0) above as it is arbitrary which row to preserve in the event of a tie.
To preserve the latest one in RowID order for example you could use ORDER BY RowID DESC
Execution Plans
The execution plan for this is often simpler and more efficient than that in the accepted answer as it does not require the self join.
This is not always the case however. One place where the GROUP BY solution might be preferred is situations where a hash aggregate would be chosen in preference to a stream aggregate.
The ROW_NUMBER solution will always give pretty much the same plan whereas the GROUP BY strategy is more flexible.
Factors which might favour the hash aggregate approach would be
No useful index on the partitioning columns
relatively fewer groups with relatively more duplicates in each group
In extreme versions of this second case (if there are very few groups with many duplicates in each) one could also consider simply inserting the rows to keep into a new table then TRUNCATE-ing the original and copying them back to minimise logging compared to deleting a very high proportion of the rows.
There's a good article on removing duplicates on the Microsoft Support site. It's pretty conservative - they have you do everything in separate steps - but it should work well against large tables.
I've used self-joins to do this in the past, although it could probably be prettied up with a HAVING clause:
DELETE dupes
FROM MyTable dupes, MyTable fullTable
WHERE dupes.dupField = fullTable.dupField
AND dupes.secondDupField = fullTable.secondDupField
AND dupes.uniqueField > fullTable.uniqueField
The following query is useful to delete duplicate rows. The table in this example has ID as an identity column and the columns which have duplicate data are Column1, Column2 and Column3.
DELETE FROM TableName
WHERE ID NOT IN (SELECT MAX(ID)
FROM TableName
GROUP BY Column1,
Column2,
Column3
/*Even if ID is not null-able SQL Server treats MAX(ID) as potentially
nullable. Because of semantics of NOT IN (NULL) including the clause
below can simplify the plan*/
HAVING MAX(ID) IS NOT NULL)
The following script shows usage of GROUP BY, HAVING, ORDER BY in one query, and returns the results with duplicate column and its count.
SELECT YourColumnName,
COUNT(*) TotalCount
FROM YourTableName
GROUP BY YourColumnName
HAVING COUNT(*) > 1
ORDER BY COUNT(*) DESC
delete t1
from table t1, table t2
where t1.columnA = t2.columnA
and t1.rowid>t2.rowid
Postgres:
delete
from table t1
using table t2
where t1.columnA = t2.columnA
and t1.rowid > t2.rowid
DELETE LU
FROM (SELECT *,
Row_number()
OVER (
partition BY col1, col1, col3
ORDER BY rowid DESC) [Row]
FROM mytable) LU
WHERE [row] > 1
This will delete duplicate rows, except the first row
DELETE
FROM
Mytable
WHERE
RowID NOT IN (
SELECT
MIN(RowID)
FROM
Mytable
GROUP BY
Col1,
Col2,
Col3
)
Refer (http://www.codeproject.com/Articles/157977/Remove-Duplicate-Rows-from-a-Table-in-SQL-Server)
I would prefer CTE for deleting duplicate rows from sql server table
strongly recommend to follow this article ::http://codaffection.com/sql-server-article/delete-duplicate-rows-in-sql-server/
by keeping original
WITH CTE AS
(
SELECT *,ROW_NUMBER() OVER (PARTITION BY col1,col2,col3 ORDER BY col1,col2,col3) AS RN
FROM MyTable
)
DELETE FROM CTE WHERE RN<>1
without keeping original
WITH CTE AS
(SELECT *,R=RANK() OVER (ORDER BY col1,col2,col3)
FROM MyTable)
 
DELETE CTE
WHERE R IN (SELECT R FROM CTE GROUP BY R HAVING COUNT(*)>1)
To Fetch Duplicate Rows:
SELECT
name, email, COUNT(*)
FROM
users
GROUP BY
name, email
HAVING COUNT(*) > 1
To Delete the Duplicate Rows:
DELETE users
WHERE rowid NOT IN
(SELECT MIN(rowid)
FROM users
GROUP BY name, email);
Quick and Dirty to delete exact duplicated rows (for small tables):
select distinct * into t2 from t1;
delete from t1;
insert into t1 select * from t2;
drop table t2;
I prefer the subquery\having count(*) > 1 solution to the inner join because I found it easier to read and it was very easy to turn into a SELECT statement to verify what would be deleted before you run it.
--DELETE FROM table1
--WHERE id IN (
SELECT MIN(id) FROM table1
GROUP BY col1, col2, col3
-- could add a WHERE clause here to further filter
HAVING count(*) > 1
--)
SELECT DISTINCT *
INTO tempdb.dbo.tmpTable
FROM myTable
TRUNCATE TABLE myTable
INSERT INTO myTable SELECT * FROM tempdb.dbo.tmpTable
DROP TABLE tempdb.dbo.tmpTable
I thought I'd share my solution since it works under special circumstances.
I my case the table with duplicate values did not have a foreign key (because the values were duplicated from another db).
begin transaction
-- create temp table with identical structure as source table
Select * Into #temp From tableName Where 1 = 2
-- insert distinct values into temp
insert into #temp
select distinct *
from tableName
-- delete from source
delete from tableName
-- insert into source from temp
insert into tableName
select *
from #temp
rollback transaction
-- if this works, change rollback to commit and execute again to keep you changes!!
PS: when working on things like this I always use a transaction, this not only ensures everything is executed as a whole, but also allows me to test without risking anything. But off course you should take a backup anyway just to be sure...
This query showed very good performance for me:
DELETE tbl
FROM
MyTable tbl
WHERE
EXISTS (
SELECT
*
FROM
MyTable tbl2
WHERE
tbl2.SameValue = tbl.SameValue
AND tbl.IdUniqueValue < tbl2.IdUniqueValue
)
it deleted 1M rows in little more than 30sec from a table of 2M (50% duplicates)
Using CTE. The idea is to join on one or more columns that form a duplicate record and then remove whichever you like:
;with cte as (
select
min(PrimaryKey) as PrimaryKey
UniqueColumn1,
UniqueColumn2
from dbo.DuplicatesTable
group by
UniqueColumn1, UniqueColumn1
having count(*) > 1
)
delete d
from dbo.DuplicatesTable d
inner join cte on
d.PrimaryKey > cte.PrimaryKey and
d.UniqueColumn1 = cte.UniqueColumn1 and
d.UniqueColumn2 = cte.UniqueColumn2;
Yet another easy solution can be found at the link pasted here. This one easy to grasp and seems to be effective for most of the similar problems. It is for SQL Server though but the concept used is more than acceptable.
Here are the relevant portions from the linked page:
Consider this data:
EMPLOYEE_ID ATTENDANCE_DATE
A001 2011-01-01
A001 2011-01-01
A002 2011-01-01
A002 2011-01-01
A002 2011-01-01
A003 2011-01-01
So how can we delete those duplicate data?
First, insert an identity column in that table by using the following code:
ALTER TABLE dbo.ATTENDANCE ADD AUTOID INT IDENTITY(1,1)
Use the following code to resolve it:
DELETE FROM dbo.ATTENDANCE WHERE AUTOID NOT IN (SELECT MIN(AUTOID) _
FROM dbo.ATTENDANCE GROUP BY EMPLOYEE_ID,ATTENDANCE_DATE)
This is the easiest way to delete duplicate record
DELETE FROM tblemp WHERE id IN
(
SELECT MIN(id) FROM tblemp
GROUP BY title HAVING COUNT(id)>1
)
Use this
WITH tblTemp as
(
SELECT ROW_NUMBER() Over(PARTITION BY Name,Department ORDER BY Name)
As RowNumber,* FROM <table_name>
)
DELETE FROM tblTemp where RowNumber >1
Here is another good article on removing duplicates.
It discusses why its hard: "SQL is based on relational algebra, and duplicates cannot occur in relational algebra, because duplicates are not allowed in a set."
The temp table solution, and two mysql examples.
In the future are you going to prevent it at a database level, or from an application perspective. I would suggest the database level because your database should be responsible for maintaining referential integrity, developers just will cause problems ;)
I had a table where I needed to preserve non-duplicate rows.
I'm not sure on the speed or efficiency.
DELETE FROM myTable WHERE RowID IN (
SELECT MIN(RowID) AS IDNo FROM myTable
GROUP BY Col1, Col2, Col3
HAVING COUNT(*) = 2 )
Oh sure. Use a temp table. If you want a single, not-very-performant statement that "works" you can go with:
DELETE FROM MyTable WHERE NOT RowID IN
(SELECT
(SELECT TOP 1 RowID FROM MyTable mt2
WHERE mt2.Col1 = mt.Col1
AND mt2.Col2 = mt.Col2
AND mt2.Col3 = mt.Col3)
FROM MyTable mt)
Basically, for each row in the table, the sub-select finds the top RowID of all rows that are exactly like the row under consideration. So you end up with a list of RowIDs that represent the "original" non-duplicated rows.
The other way is Create a new table with same fields and with Unique Index. Then move all data from old table to new table. Automatically SQL SERVER ignore (there is also an option about what to do if there will be a duplicate value: ignore, interrupt or sth) duplicate values. So we have the same table without duplicate rows. If you don't want Unique Index, after the transfer data you can drop it.
Especially for larger tables you may use DTS (SSIS package to import/export data) in order to transfer all data rapidly to your new uniquely indexed table. For 7 million row it takes just a few minute.
By useing below query we can able to delete duplicate records based on the single column or multiple column. below query is deleting based on two columns. table name is: testing and column names empno,empname
DELETE FROM testing WHERE empno not IN (SELECT empno FROM (SELECT empno, ROW_NUMBER() OVER (PARTITION BY empno ORDER BY empno)
AS [ItemNumber] FROM testing) a WHERE ItemNumber > 1)
or empname not in
(select empname from (select empname,row_number() over(PARTITION BY empno ORDER BY empno)
AS [ItemNumber] FROM testing) a WHERE ItemNumber > 1)
Create new blank table with the same structure
Execute query like this
INSERT INTO tc_category1
SELECT *
FROM tc_category
GROUP BY category_id, application_id
HAVING count(*) > 1
Then execute this query
INSERT INTO tc_category1
SELECT *
FROM tc_category
GROUP BY category_id, application_id
HAVING count(*) = 1
Another way of doing this :--
DELETE A
FROM TABLE A,
TABLE B
WHERE A.COL1 = B.COL1
AND A.COL2 = B.COL2
AND A.UNIQUEFIELD > B.UNIQUEFIELD
I would mention this approach as well as it can be helpful, and works in all SQL servers:
Pretty often there is only one - two duplicates, and Ids and count of duplicates are known. In this case:
SET ROWCOUNT 1 -- or set to number of rows to be deleted
delete from myTable where RowId = DuplicatedID
SET ROWCOUNT 0
From the application level (unfortunately). I agree that the proper way to prevent duplication is at the database level through the use of a unique index, but in SQL Server 2005, an index is allowed to be only 900 bytes, and my varchar(2048) field blows that away.
I dunno how well it would perform, but I think you could write a trigger to enforce this, even if you couldn't do it directly with an index. Something like:
-- given a table stories(story_id int not null primary key, story varchar(max) not null)
CREATE TRIGGER prevent_plagiarism
ON stories
after INSERT, UPDATE
AS
DECLARE #cnt AS INT
SELECT #cnt = Count(*)
FROM stories
INNER JOIN inserted
ON ( stories.story = inserted.story
AND stories.story_id != inserted.story_id )
IF #cnt > 0
BEGIN
RAISERROR('plagiarism detected',16,1)
ROLLBACK TRANSACTION
END
Also, varchar(2048) sounds fishy to me (some things in life are 2048 bytes, but it's pretty uncommon); should it really not be varchar(max)?
DELETE
FROM
table_name T1
WHERE
rowid > (
SELECT
min(rowid)
FROM
table_name T2
WHERE
T1.column_name = T2.column_name
);
CREATE TABLE car(Id int identity(1,1), PersonId int, CarId int)
INSERT INTO car(PersonId,CarId)
VALUES(1,2),(1,3),(1,2),(2,4)
--SELECT * FROM car
;WITH CTE as(
SELECT ROW_NUMBER() over (PARTITION BY personid,carid order by personid,carid) as rn,Id,PersonID,CarId from car)
DELETE FROM car where Id in(SELECT Id FROM CTE WHERE rn>1)
I you want to preview the rows you are about to remove and keep control over which of the duplicate rows to keep. See http://developer.azurewebsites.net/2014/09/better-sql-group-by-find-duplicate-data/
with MYCTE as (
SELECT ROW_NUMBER() OVER (
PARTITION BY DuplicateKey1
,DuplicateKey2 -- optional
ORDER BY CreatedAt -- the first row among duplicates will be kept, other rows will be removed
) RN
FROM MyTable
)
DELETE FROM MYCTE
WHERE RN > 1

sql: How to remove duplicate rows (content the same, but sequence is different)

You see that the SKU1 has 2 rows, but actually the content of these 2 rows are the same, just the sequence of "b" and "c" makes difference.
What if I want to remove the duplicate rows as shown in the 2nd picture?
In Oracle there is a LEAST/GREATEST function that can realize it, but I used SQL Server, therefore it doesn't work following the instruction of the below post:
How to remove duplicate rows in SQL
If it's only 2 columns where the order should not matter for the group by?
Then you could use IIF (or a CASE WHEN) to calculate the maximum and minimum values.
And use those calculated values in the GROUP BY.
For example:
select Name,
MAX(Val1) as Val1,
MIN(Val2) as Val2
from Table1
GROUP BY Name,
IIF(Val2 is null or Val1 < Val2, Val1, Val2),
IIF(Val1 is null or Val1 < Val2, Val2, Val1);
For the example records that would give the result:
Name Val1 Val2
SKU1 20 10
SKU2 20 10
Or if you want to use a fancy XML trick :
select Name, max(Val1) as Val1, min(Val2) as Val2
from (
select *,
cast(
convert(XML,
concat('<n>',Val1,'</n><n>',Val2,'</n>')
).query('for $n in /n order by $n return string($n)'
) as varchar(6)) as SortedValues
from Table1
) q
group by Name, SortedValues;
The last method could be more usefull when there are more columns involved.
To actually remove the duplicates?
Here's an example that uses a table variable to demonstrate:
declare #Table1 TABLE (Id int, Name varchar(20), Val1 int, Val2 int);
Insert Into #Table1 values
(1,'SKU1',10,20),
(2,'SKU1',20,10),
(3,'SKU1',12,15),
(4,'SKU2',10,null),
(5,'SKU2',null,10),
(6,'SKU2',10,20);
delete from #Table1
where Id in (
select Id
from (
select Id,
row_number() over (partition by Name,
IIF(Val2 is null or Val1 < Val2, Val1, Val2),
IIF(Val1 is null or Val1 < Val2, Val2, Val1)
order by Val1 desc, Val2 desc
) as rn
from #Table1
) q
where rn > 1
);
select * from #Table1;
Please Use Max() and Min Function instead of least and greatest of oracle if used the follwoing steps and got the same result.
Create Table Transactions (Name varchar(255),Quantity1 int,Quantity2 int)
Insert Into Transactions values
('SKU1',10,20),
('SKU1',20,10),
('SKU2',10,20),
('SKU2',10,20)
Now I used the query below to find the solution of your answer
Select T1.Name,MAX(T1.Quantity1),MIN(T2.Quantity2) From Transactions T1
join Transactions T2
on T1.Name=T2.Name
group by T1.Name
Please Reply
greatest() can be simulated using a CASE expression
greatest(b,c) is the same as:
case
when b > c then b
else c
end
You can use this together with a distinct to remove your duplicates:
select distinct
a,
case when b > c then b else c end as x
from the_table
order by a;
Try %%physloc%%. It is the equivalent of oracle's RowId.
Find it
select *, %%physloc%% from [MyTable] where ...
Delete what you want
delete from [MyTable] where %%physloc%% = 0xDEADBEEF -- (your address)
Consider adding a Unique / Primary Key to prevent future occurrences.
SELECT * FROM abc where A='SKU1'and B=20 || A='SKU2'and B=10
a b c
SKU1 20 10
SKU2 10 20
This may seem a little complicated at first, but we can also use PIVOT/UNPIVOT to obtain the results
Below is the query
select *
from
(
select
*,
'quantity'+ cast(row_number() over (partition by name order by data) as nvarchar) cols
from
(
select
distinct name, data
from
(select * from transactions)s
unpivot
(
data for cols in (quantity1,quantity2)
)u
)s
)s
pivot
(
max(data) for cols in (quantity1,quantity2)
)p
From your question it is not clear whether you want to filter duplicates by row or duplicates by column. Let me describe both to make sure your question is addressed completely.
In Example 1, you can see that we have duplicate rows:
To filter them, just add the keyword DISTINCT to your query, as follows:
SELECT DISTINCT * FROM myTable;
It filters the duplicate rows and returns:
Hence, you don't need a least or greatest function in this case.
In Example 2, you can see that we have duplicates in the columns:
Here, SELECT DISTINCT * from abc will still return all 4 rows.
If we regard only the first column in the filtering, it can be achieved by the following query:
select distinct t.Col1,
(select top 1 Col2 from myTable ts where t.Col1=ts.Col1) Col2,
(select top 1 Col3 from myTable ts where t.Col1=ts.Col1) Col3
from myTable t
It will pick the first matching value in each column, so the result of the query will be:
The difference between Example 1 and this example is that it eliminated just the duplicate occurances of the values in Col1 of myTable and then returned the related values of the other columns - hence the results in Col1 and Col2 are different.
Note:
In this case you can't just join the table myTable because then you would be forced to list the columns in the select distinct which would return more rows then you want to have. Unfortunately, T-SQL does not offer something like SELECT DISTINCT ON(fieldname) i.e. you can't directly specify a distinct (single) fieldname.
You might have thought "why not use GROUP BY?" The answer of that question is here: With GROUP BY you are forced to either specify all columns which is a technical DISTINCT equivalent, or you need to use aggregate functions like MIN or MAX which aren't returning what you want either.
A more advanced query (you might have seen is once before!) which has the same result is:
SELECT Col1, Col2, Col3
FROM (
SELECT *, ROW_NUMBER() OVER (PARTITION BY Col1 ORDER BY Col1) AS RowNumber
FROM myTable
) t
WHERE RowNumber=1
This statement numbers each occurance of a value on Col1 in the subquery and then takes the first of each duplicate rows - which effectively is a grouping by Col1 (but without the disadvantages of GROUP BY).
N.B. In the examples above, I am assuming a table definition like:
CREATE TABLE [dbo].[myTable](
[Col1] [nvarchar](max) NULL,
[Col2] [int] NULL,
[Col3] [int] NULL
) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY]
For the examples above, we don't need to declare a primary key column. But generally spoken you'll need a primary key in database tables, to be able to reference rows efficiently.
If you want to permanently delete rows not needed, you should introduce a primary key, because then you can delete the rows not displayed easily as follows (i.e. it is the inverse filter of the advanced query mentioned above):
DELETE FROM [dbo].[myTable]
WHERE myPK NOT IN
(SELECT myPK
FROM (
SELECT *, ROW_NUMBER() OVER (PARTITION BY Col1 ORDER BY Col1) AS RowNumber
FROM [dbo].[myTable]
) t
WHERE RowNumber=1 and myPK=t.myPK)
This assumes you have added an integer primary key myPK which auto-increments (you can do that via the SQL Management Studio easily by using the designer).
Or you can execute the following query to add it to the existing table:
BEGIN TRANSACTION
GO
ALTER TABLE dbo.myTable ADD
myPK int NOT NULL IDENTITY (1, 1)
GO
ALTER TABLE dbo.myTable ADD CONSTRAINT
PK_myTable PRIMARY KEY CLUSTERED (myPK)
WITH (STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF,
ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON)
ON [PRIMARY]
GO
ALTER TABLE dbo.myTable SET (LOCK_ESCALATION = TABLE)
GO
COMMIT
You can find some more examples here at MSDN.

Dedupe SQL Server Table [stored procedure] [duplicate]

I need to remove duplicate rows from a fairly large SQL Server table (i.e. 300,000+ rows).
The rows, of course, will not be perfect duplicates because of the existence of the RowID identity field.
MyTable
RowID int not null identity(1,1) primary key,
Col1 varchar(20) not null,
Col2 varchar(2048) not null,
Col3 tinyint not null
How can I do this?
Assuming no nulls, you GROUP BY the unique columns, and SELECT the MIN (or MAX) RowId as the row to keep. Then, just delete everything that didn't have a row id:
DELETE FROM MyTable
LEFT OUTER JOIN (
SELECT MIN(RowId) as RowId, Col1, Col2, Col3
FROM MyTable
GROUP BY Col1, Col2, Col3
) as KeepRows ON
MyTable.RowId = KeepRows.RowId
WHERE
KeepRows.RowId IS NULL
In case you have a GUID instead of an integer, you can replace
MIN(RowId)
with
CONVERT(uniqueidentifier, MIN(CONVERT(char(36), MyGuidColumn)))
Another possible way of doing this is
;
--Ensure that any immediately preceding statement is terminated with a semicolon above
WITH cte
AS (SELECT ROW_NUMBER() OVER (PARTITION BY Col1, Col2, Col3
ORDER BY ( SELECT 0)) RN
FROM #MyTable)
DELETE FROM cte
WHERE RN > 1;
I am using ORDER BY (SELECT 0) above as it is arbitrary which row to preserve in the event of a tie.
To preserve the latest one in RowID order for example you could use ORDER BY RowID DESC
Execution Plans
The execution plan for this is often simpler and more efficient than that in the accepted answer as it does not require the self join.
This is not always the case however. One place where the GROUP BY solution might be preferred is situations where a hash aggregate would be chosen in preference to a stream aggregate.
The ROW_NUMBER solution will always give pretty much the same plan whereas the GROUP BY strategy is more flexible.
Factors which might favour the hash aggregate approach would be
No useful index on the partitioning columns
relatively fewer groups with relatively more duplicates in each group
In extreme versions of this second case (if there are very few groups with many duplicates in each) one could also consider simply inserting the rows to keep into a new table then TRUNCATE-ing the original and copying them back to minimise logging compared to deleting a very high proportion of the rows.
There's a good article on removing duplicates on the Microsoft Support site. It's pretty conservative - they have you do everything in separate steps - but it should work well against large tables.
I've used self-joins to do this in the past, although it could probably be prettied up with a HAVING clause:
DELETE dupes
FROM MyTable dupes, MyTable fullTable
WHERE dupes.dupField = fullTable.dupField
AND dupes.secondDupField = fullTable.secondDupField
AND dupes.uniqueField > fullTable.uniqueField
The following query is useful to delete duplicate rows. The table in this example has ID as an identity column and the columns which have duplicate data are Column1, Column2 and Column3.
DELETE FROM TableName
WHERE ID NOT IN (SELECT MAX(ID)
FROM TableName
GROUP BY Column1,
Column2,
Column3
/*Even if ID is not null-able SQL Server treats MAX(ID) as potentially
nullable. Because of semantics of NOT IN (NULL) including the clause
below can simplify the plan*/
HAVING MAX(ID) IS NOT NULL)
The following script shows usage of GROUP BY, HAVING, ORDER BY in one query, and returns the results with duplicate column and its count.
SELECT YourColumnName,
COUNT(*) TotalCount
FROM YourTableName
GROUP BY YourColumnName
HAVING COUNT(*) > 1
ORDER BY COUNT(*) DESC
delete t1
from table t1, table t2
where t1.columnA = t2.columnA
and t1.rowid>t2.rowid
Postgres:
delete
from table t1
using table t2
where t1.columnA = t2.columnA
and t1.rowid > t2.rowid
DELETE LU
FROM (SELECT *,
Row_number()
OVER (
partition BY col1, col1, col3
ORDER BY rowid DESC) [Row]
FROM mytable) LU
WHERE [row] > 1
This will delete duplicate rows, except the first row
DELETE
FROM
Mytable
WHERE
RowID NOT IN (
SELECT
MIN(RowID)
FROM
Mytable
GROUP BY
Col1,
Col2,
Col3
)
Refer (http://www.codeproject.com/Articles/157977/Remove-Duplicate-Rows-from-a-Table-in-SQL-Server)
I would prefer CTE for deleting duplicate rows from sql server table
strongly recommend to follow this article ::http://codaffection.com/sql-server-article/delete-duplicate-rows-in-sql-server/
by keeping original
WITH CTE AS
(
SELECT *,ROW_NUMBER() OVER (PARTITION BY col1,col2,col3 ORDER BY col1,col2,col3) AS RN
FROM MyTable
)
DELETE FROM CTE WHERE RN<>1
without keeping original
WITH CTE AS
(SELECT *,R=RANK() OVER (ORDER BY col1,col2,col3)
FROM MyTable)
 
DELETE CTE
WHERE R IN (SELECT R FROM CTE GROUP BY R HAVING COUNT(*)>1)
To Fetch Duplicate Rows:
SELECT
name, email, COUNT(*)
FROM
users
GROUP BY
name, email
HAVING COUNT(*) > 1
To Delete the Duplicate Rows:
DELETE users
WHERE rowid NOT IN
(SELECT MIN(rowid)
FROM users
GROUP BY name, email);
Quick and Dirty to delete exact duplicated rows (for small tables):
select distinct * into t2 from t1;
delete from t1;
insert into t1 select * from t2;
drop table t2;
I prefer the subquery\having count(*) > 1 solution to the inner join because I found it easier to read and it was very easy to turn into a SELECT statement to verify what would be deleted before you run it.
--DELETE FROM table1
--WHERE id IN (
SELECT MIN(id) FROM table1
GROUP BY col1, col2, col3
-- could add a WHERE clause here to further filter
HAVING count(*) > 1
--)
SELECT DISTINCT *
INTO tempdb.dbo.tmpTable
FROM myTable
TRUNCATE TABLE myTable
INSERT INTO myTable SELECT * FROM tempdb.dbo.tmpTable
DROP TABLE tempdb.dbo.tmpTable
I thought I'd share my solution since it works under special circumstances.
I my case the table with duplicate values did not have a foreign key (because the values were duplicated from another db).
begin transaction
-- create temp table with identical structure as source table
Select * Into #temp From tableName Where 1 = 2
-- insert distinct values into temp
insert into #temp
select distinct *
from tableName
-- delete from source
delete from tableName
-- insert into source from temp
insert into tableName
select *
from #temp
rollback transaction
-- if this works, change rollback to commit and execute again to keep you changes!!
PS: when working on things like this I always use a transaction, this not only ensures everything is executed as a whole, but also allows me to test without risking anything. But off course you should take a backup anyway just to be sure...
This query showed very good performance for me:
DELETE tbl
FROM
MyTable tbl
WHERE
EXISTS (
SELECT
*
FROM
MyTable tbl2
WHERE
tbl2.SameValue = tbl.SameValue
AND tbl.IdUniqueValue < tbl2.IdUniqueValue
)
it deleted 1M rows in little more than 30sec from a table of 2M (50% duplicates)
Using CTE. The idea is to join on one or more columns that form a duplicate record and then remove whichever you like:
;with cte as (
select
min(PrimaryKey) as PrimaryKey
UniqueColumn1,
UniqueColumn2
from dbo.DuplicatesTable
group by
UniqueColumn1, UniqueColumn1
having count(*) > 1
)
delete d
from dbo.DuplicatesTable d
inner join cte on
d.PrimaryKey > cte.PrimaryKey and
d.UniqueColumn1 = cte.UniqueColumn1 and
d.UniqueColumn2 = cte.UniqueColumn2;
Yet another easy solution can be found at the link pasted here. This one easy to grasp and seems to be effective for most of the similar problems. It is for SQL Server though but the concept used is more than acceptable.
Here are the relevant portions from the linked page:
Consider this data:
EMPLOYEE_ID ATTENDANCE_DATE
A001 2011-01-01
A001 2011-01-01
A002 2011-01-01
A002 2011-01-01
A002 2011-01-01
A003 2011-01-01
So how can we delete those duplicate data?
First, insert an identity column in that table by using the following code:
ALTER TABLE dbo.ATTENDANCE ADD AUTOID INT IDENTITY(1,1)
Use the following code to resolve it:
DELETE FROM dbo.ATTENDANCE WHERE AUTOID NOT IN (SELECT MIN(AUTOID) _
FROM dbo.ATTENDANCE GROUP BY EMPLOYEE_ID,ATTENDANCE_DATE)
This is the easiest way to delete duplicate record
DELETE FROM tblemp WHERE id IN
(
SELECT MIN(id) FROM tblemp
GROUP BY title HAVING COUNT(id)>1
)
Use this
WITH tblTemp as
(
SELECT ROW_NUMBER() Over(PARTITION BY Name,Department ORDER BY Name)
As RowNumber,* FROM <table_name>
)
DELETE FROM tblTemp where RowNumber >1
Here is another good article on removing duplicates.
It discusses why its hard: "SQL is based on relational algebra, and duplicates cannot occur in relational algebra, because duplicates are not allowed in a set."
The temp table solution, and two mysql examples.
In the future are you going to prevent it at a database level, or from an application perspective. I would suggest the database level because your database should be responsible for maintaining referential integrity, developers just will cause problems ;)
I had a table where I needed to preserve non-duplicate rows.
I'm not sure on the speed or efficiency.
DELETE FROM myTable WHERE RowID IN (
SELECT MIN(RowID) AS IDNo FROM myTable
GROUP BY Col1, Col2, Col3
HAVING COUNT(*) = 2 )
Oh sure. Use a temp table. If you want a single, not-very-performant statement that "works" you can go with:
DELETE FROM MyTable WHERE NOT RowID IN
(SELECT
(SELECT TOP 1 RowID FROM MyTable mt2
WHERE mt2.Col1 = mt.Col1
AND mt2.Col2 = mt.Col2
AND mt2.Col3 = mt.Col3)
FROM MyTable mt)
Basically, for each row in the table, the sub-select finds the top RowID of all rows that are exactly like the row under consideration. So you end up with a list of RowIDs that represent the "original" non-duplicated rows.
The other way is Create a new table with same fields and with Unique Index. Then move all data from old table to new table. Automatically SQL SERVER ignore (there is also an option about what to do if there will be a duplicate value: ignore, interrupt or sth) duplicate values. So we have the same table without duplicate rows. If you don't want Unique Index, after the transfer data you can drop it.
Especially for larger tables you may use DTS (SSIS package to import/export data) in order to transfer all data rapidly to your new uniquely indexed table. For 7 million row it takes just a few minute.
By useing below query we can able to delete duplicate records based on the single column or multiple column. below query is deleting based on two columns. table name is: testing and column names empno,empname
DELETE FROM testing WHERE empno not IN (SELECT empno FROM (SELECT empno, ROW_NUMBER() OVER (PARTITION BY empno ORDER BY empno)
AS [ItemNumber] FROM testing) a WHERE ItemNumber > 1)
or empname not in
(select empname from (select empname,row_number() over(PARTITION BY empno ORDER BY empno)
AS [ItemNumber] FROM testing) a WHERE ItemNumber > 1)
Create new blank table with the same structure
Execute query like this
INSERT INTO tc_category1
SELECT *
FROM tc_category
GROUP BY category_id, application_id
HAVING count(*) > 1
Then execute this query
INSERT INTO tc_category1
SELECT *
FROM tc_category
GROUP BY category_id, application_id
HAVING count(*) = 1
Another way of doing this :--
DELETE A
FROM TABLE A,
TABLE B
WHERE A.COL1 = B.COL1
AND A.COL2 = B.COL2
AND A.UNIQUEFIELD > B.UNIQUEFIELD
I would mention this approach as well as it can be helpful, and works in all SQL servers:
Pretty often there is only one - two duplicates, and Ids and count of duplicates are known. In this case:
SET ROWCOUNT 1 -- or set to number of rows to be deleted
delete from myTable where RowId = DuplicatedID
SET ROWCOUNT 0
From the application level (unfortunately). I agree that the proper way to prevent duplication is at the database level through the use of a unique index, but in SQL Server 2005, an index is allowed to be only 900 bytes, and my varchar(2048) field blows that away.
I dunno how well it would perform, but I think you could write a trigger to enforce this, even if you couldn't do it directly with an index. Something like:
-- given a table stories(story_id int not null primary key, story varchar(max) not null)
CREATE TRIGGER prevent_plagiarism
ON stories
after INSERT, UPDATE
AS
DECLARE #cnt AS INT
SELECT #cnt = Count(*)
FROM stories
INNER JOIN inserted
ON ( stories.story = inserted.story
AND stories.story_id != inserted.story_id )
IF #cnt > 0
BEGIN
RAISERROR('plagiarism detected',16,1)
ROLLBACK TRANSACTION
END
Also, varchar(2048) sounds fishy to me (some things in life are 2048 bytes, but it's pretty uncommon); should it really not be varchar(max)?
DELETE
FROM
table_name T1
WHERE
rowid > (
SELECT
min(rowid)
FROM
table_name T2
WHERE
T1.column_name = T2.column_name
);
CREATE TABLE car(Id int identity(1,1), PersonId int, CarId int)
INSERT INTO car(PersonId,CarId)
VALUES(1,2),(1,3),(1,2),(2,4)
--SELECT * FROM car
;WITH CTE as(
SELECT ROW_NUMBER() over (PARTITION BY personid,carid order by personid,carid) as rn,Id,PersonID,CarId from car)
DELETE FROM car where Id in(SELECT Id FROM CTE WHERE rn>1)
I you want to preview the rows you are about to remove and keep control over which of the duplicate rows to keep. See http://developer.azurewebsites.net/2014/09/better-sql-group-by-find-duplicate-data/
with MYCTE as (
SELECT ROW_NUMBER() OVER (
PARTITION BY DuplicateKey1
,DuplicateKey2 -- optional
ORDER BY CreatedAt -- the first row among duplicates will be kept, other rows will be removed
) RN
FROM MyTable
)
DELETE FROM MYCTE
WHERE RN > 1

Returning not found values in query

I'm working on a project, and I have a list of several thousand records that I need check. I need to provide the list of the ones NOT found, and provide the query I use to locate them so my superiors can check their work.
I'll admit I'm relatively new to SQL. I don't have access to create temporary tables, which is one way I had thought of to do this.
A basic idea of what I'm doing:
select t.column1, t.column2
from table1 t
where t.column1 in ('value1','value2','value3')
If value1 and value3 are in the database, but value2 is not, I need to display value2 and not the others.
I have tried ISNULL, embedding the query, and trying to select NOT values from the query.
I have searched for returning records not found in a query on Google and on this site, and still found nothing.
I have tried something similar:
First create a table which will contain all such values that you need.
lets say
create table table_values(values varchar2(30));
then try the in clause as below:
select * from table_values tv where tv.value not in (select t.column1
from table1 t);
this will return the values needed.
In SQL Server 2008, you can make derived tables using the syntax VALUES(...)(...)(...), e.g.
select v.value
from (
values ('value1'),('value2'),('value3')
) v(value)
left join table1 t on t.column1 = v.value
where t1.column1 is null
Notes:
Each (...) after VALUES is a single row, and you can have multiple columns.
The v(value) after the derived table is the alias given to the table, and column name(s).
LEFT JOIN keeps the values from the derived table v even when the record doesn't exist in table1
Then, we keep only the records that cannot be matched, i.e. t1.column1 is null
I've switched the first column in your select to the column from v. column2 is removed because it is always null
solution might work in Oracle where dual is single row single column table. You need
one table where you can make virtual select of desired values!
WARNING as I don't have access to db I never tested query below.
SELECT tab_all.col_search, t.column1, t.column2
FROM
(
Select value1 AS col_search from dual
union all
Select value2 from dual
union all
Select value3 from dual
) tab_all left join table1 t
on col_search = t.column1
WHERE t.column1 is null;
I belive sqlserver equivalent of Oracle's
SELECT value1 FROM dual is
SELECT value1 OR SELECT 'value1'.
So try
SELECT tab_all.col_search, t.column1, t.column2
FROM
(
Select value1 AS col_search
union all
Select value2 AS col_search
union all
Select value3 AS col_search
) tab_all left join table1 t
on col_search = t.column1
WHERE t.column1 is null;
As I am not sqlserver person might be that
RichardTheKiwi version of Oracle's select from dual is better.

How to find row after query?

I have a query, and I need to find the row number that the query return the answer.
I do not have a counter field. How do I do it?
Thanks in advance.
SELECT
ROW_NUMBER() OVER (<your_id_field_here> ORDER BY <field_here>) as RowNum,
<the_rest_of_your_fields_here>
FROM
<my_table>
If you have a primary key, you can use this method on SQL Server 2005 and up:
SELECT
ROW_NUMBER() OVER (ORDER BY PrimaryKeyField) as RowNumber,
Field1,
Field2
FROM
YourSourceTable
If you don't have a primary key, what you may have to do is duplicate your table into a mermory table (or a temp table if it is very large) using a method like this:
DECLARE #NewTable table (
RowNumber BIGINT IDENTITY(1,1) PRIMARY KEY,
Field1 varchar(50),
Field2 varchar(50),
)
INSERT INTO #NewTable
(Field1, Field2)
SELECT
Field1,
Field2
FROM
YourSourceTable
SELECT
RowNumber,
Field1,
Field2
FROM
#NewTable
The nice part about this, is that you will be able to detect identical rows if your source table does not have a primary key.
Also, at this point, I would suggest adding a primary key to every table you have, if they don't already have one.
create table t1 (N1 int ,N2 int)
insert into t1
select 200,300
insert into t1
select 200,300
insert into t1
select 300,400
insert into t1
select 400,400
.....
......
select row_number() over (order by [N1]) RowNumber,* from t1
I have a Query, and I need to find the
row number that the query return the
answer. I do not have any counter
field. how I do it ?
If I understand your question correctly, you have some sort of query that returns a row, or a few rows (answer, answers) and you want to have a row number associates with an "answer"
As you said you do not have any counter field so what you need to to is the following:
decide on the ordering criteria
add the counter
run the query
It would really help if you provided more details to your question(s).
If you confirm that I understand your question correctly, I will add a code example