Finding duplicate rows in SQL Server - sql

I have a SQL Server database of organizations, and there are many duplicate rows. I want to run a select statement to grab all of these and the amount of dupes, but also return the ids that are associated with each organization.
A statement like:
SELECT orgName, COUNT(*) AS dupes
FROM organizations
GROUP BY orgName
HAVING (COUNT(*) > 1)
Will return something like
orgName | dupes
ABC Corp | 7
Foo Federation | 5
Widget Company | 2
But I'd also like to grab the IDs of them. Is there any way to do this? Maybe like a
orgName | dupeCount | id
ABC Corp | 1 | 34
ABC Corp | 2 | 5
...
Widget Company | 1 | 10
Widget Company | 2 | 2
The reason being that there is also a separate table of users that link to these organizations, and I would like to unify them (therefore remove dupes so the users link to the same organization instead of dupe orgs). But I would like part manually so I don't screw anything up, but I would still need a statement returning the IDs of all the dupe orgs so I can go through the list of users.

select o.orgName, oc.dupeCount, o.id
from organizations o
inner join (
SELECT orgName, COUNT(*) AS dupeCount
FROM organizations
GROUP BY orgName
HAVING COUNT(*) > 1
) oc on o.orgName = oc.orgName

You can run the following query and find the duplicates with max(id) and delete those rows.
SELECT orgName, COUNT(*), Max(ID) AS dupes
FROM organizations
GROUP BY orgName
HAVING (COUNT(*) > 1)
But you'll have to run this query a few times.

You can do it like this:
SELECT
o.id, o.orgName, d.intCount
FROM (
SELECT orgName, COUNT(*) as intCount
FROM organizations
GROUP BY orgName
HAVING COUNT(*) > 1
) AS d
INNER JOIN organizations o ON o.orgName = d.orgName
If you want to return just the records that can be deleted (leaving one of each), you can use:
SELECT
id, orgName
FROM (
SELECT
orgName, id,
ROW_NUMBER() OVER (PARTITION BY orgName ORDER BY id) AS intRow
FROM organizations
) AS d
WHERE intRow != 1
Edit: SQL Server 2000 doesn't have the ROW_NUMBER() function. Instead, you can use:
SELECT
o.id, o.orgName, d.intCount
FROM (
SELECT orgName, COUNT(*) as intCount, MIN(id) AS minId
FROM organizations
GROUP BY orgName
HAVING COUNT(*) > 1
) AS d
INNER JOIN organizations o ON o.orgName = d.orgName
WHERE d.minId != o.id

You can try this , it is best for you
WITH CTE AS
(
SELECT *,RN=ROW_NUMBER() OVER (PARTITION BY orgName ORDER BY orgName DESC) FROM organizations
)
select * from CTE where RN>1
go

The solution marked as correct didn't work for me, but I found this answer that worked just great: Get list of duplicate rows in MySql
SELECT n1.*
FROM myTable n1
INNER JOIN myTable n2
ON n2.repeatedCol = n1.repeatedCol
WHERE n1.id <> n2.id

If you want to delete duplicates:
WITH CTE AS(
SELECT orgName,id,
RN = ROW_NUMBER()OVER(PARTITION BY orgName ORDER BY Id)
FROM organizations
)
DELETE FROM CTE WHERE RN > 1

select * from [Employees]
For finding duplicate Record
1)Using CTE
with mycte
as
(
select Name,EmailId,ROW_NUMBER() over(partition by Name,EmailId order by id) as Duplicate from [Employees]
)
select * from mycte
2)By Using GroupBy
select Name,EmailId,COUNT(name) as Duplicate from [Employees] group by Name,EmailId

Select * from (Select orgName,id,
ROW_NUMBER() OVER(Partition By OrgName ORDER by id DESC) Rownum
From organizations )tbl Where Rownum>1
So the records with rowum> 1 will be the duplicate records in your table. ‘Partition by’ first group by the records and then serialize them by giving them serial nos.
So rownum> 1 will be the duplicate records which could be deleted as such.

select column_name, count(column_name)
from table_name
group by column_name
having count (column_name) > 1;
Src : https://stackoverflow.com/a/59242/1465252

select a.orgName,b.duplicate, a.id
from organizations a
inner join (
SELECT orgName, COUNT(*) AS duplicate
FROM organizations
GROUP BY orgName
HAVING COUNT(*) > 1
) b on o.orgName = oc.orgName
group by a.orgName,a.id

select orgname, count(*) as dupes, id
from organizations
where orgname in (
select orgname
from organizations
group by orgname
having (count(*) > 1)
)
group by orgname, id

You have several way for Select duplicate rows.
for my solutions , first consider this table for example
CREATE TABLE #Employee
(
ID INT,
FIRST_NAME NVARCHAR(100),
LAST_NAME NVARCHAR(300)
)
INSERT INTO #Employee VALUES ( 1, 'Ardalan', 'Shahgholi' );
INSERT INTO #Employee VALUES ( 2, 'name1', 'lname1' );
INSERT INTO #Employee VALUES ( 3, 'name2', 'lname2' );
INSERT INTO #Employee VALUES ( 2, 'name1', 'lname1' );
INSERT INTO #Employee VALUES ( 3, 'name2', 'lname2' );
INSERT INTO #Employee VALUES ( 4, 'name3', 'lname3' );
First solution :
SELECT DISTINCT *
FROM #Employee;
WITH #DeleteEmployee AS (
SELECT ROW_NUMBER()
OVER(PARTITION BY ID, First_Name, Last_Name ORDER BY ID) AS
RNUM
FROM #Employee
)
SELECT *
FROM #DeleteEmployee
WHERE RNUM > 1
SELECT DISTINCT *
FROM #Employee
Secound solution : Use identity field
SELECT DISTINCT *
FROM #Employee;
ALTER TABLE #Employee ADD UNIQ_ID INT IDENTITY(1, 1)
SELECT *
FROM #Employee
WHERE UNIQ_ID < (
SELECT MAX(UNIQ_ID)
FROM #Employee a2
WHERE #Employee.ID = a2.ID
AND #Employee.FIRST_NAME = a2.FIRST_NAME
AND #Employee.LAST_NAME = a2.LAST_NAME
)
ALTER TABLE #Employee DROP COLUMN UNIQ_ID
SELECT DISTINCT *
FROM #Employee
and end of all solution use this command
DROP TABLE #Employee

i think i know what you need
i needed to mix between the answers and i think i got the solution he wanted:
select o.id,o.orgName, oc.dupeCount, oc.id,oc.orgName
from organizations o
inner join (
SELECT MAX(id) as id, orgName, COUNT(*) AS dupeCount
FROM organizations
GROUP BY orgName
HAVING COUNT(*) > 1
) oc on o.orgName = oc.orgName
having the max id will give you the id of the dublicate and the one of the original which is what he asked for:
id org name , dublicate count (missing out in this case)
id doublicate org name , doub count (missing out again because does not help in this case)
only sad thing you get it put out in this form
id , name , dubid , name
hope it still helps

Suppose we have table the table 'Student' with 2 columns:
student_id int
student_name varchar
Records:
+------------+---------------------+
| student_id | student_name |
+------------+---------------------+
| 101 | usman |
| 101 | usman |
| 101 | usman |
| 102 | usmanyaqoob |
| 103 | muhammadusmanyaqoob |
| 103 | muhammadusmanyaqoob |
+------------+---------------------+
Now we want to see duplicate records
Use this query:
select student_name,student_id ,count(*) c from student group by student_id,student_name having c>1;
+---------------------+------------+---+
| student_name | student_id | c |
+---------------------+------------+---+
| usman | 101 | 3 |
| muhammadusmanyaqoob | 103 | 2 |
+---------------------+------------+---+

I got a better option to get the duplicate records in a table
SELECT x.studid, y.stdname, y.dupecount
FROM student AS x INNER JOIN
(SELECT a.stdname, COUNT(*) AS dupecount
FROM student AS a INNER JOIN
studmisc AS b ON a.studid = b.studid
WHERE (a.studid LIKE '2018%') AND (b.studstatus = 4)
GROUP BY a.stdname
HAVING (COUNT(*) > 1)) AS y ON x.stdname = y.stdname INNER JOIN
studmisc AS z ON x.studid = z.studid
WHERE (x.studid LIKE '2018%') AND (z.studstatus = 4)
ORDER BY x.stdname
Result of the above query shows all the duplicate names with unique student ids and number of duplicate occurances
Click here to see the result of the sql

/*To get duplicate data in table */
SELECT COUNT(EmpCode),EmpCode FROM tbl_Employees WHERE Status=1
GROUP BY EmpCode HAVING COUNT(EmpCode) > 1

I use two methods to find duplicate rows.
1st method is the most famous one using group by and having.
2nd method is using CTE - Common Table Expression.
As mentioned by #RedFilter this way is also right. Many times I find CTE method is also useful for me.
WITH TempOrg (orgName,RepeatCount)
AS
(
SELECT orgName,ROW_NUMBER() OVER(PARTITION by orgName ORDER BY orgName)
AS RepeatCount
FROM dbo.organizations
)
select t.*,e.id from organizations e
inner join TempOrg t on t.orgName= e.orgName
where t.RepeatCount>1
In the example above we collected the result by finding repeat occurrence using ROW_NUMBER and PARTITION BY. Then we applied where clause to select only rows which are on repeat count more than 1. All the result is collected CTE table and joined with Organizations table.
Source : CodoBee

Try
SELECT orgName, id, count(*) as dupes
FROM organizations
GROUP BY orgName, id
HAVING count(*) > 1;

Related

function that allows grouping of rows

I'm using SQL Server Management Studio 2012. I have a similar looking output from a query shown below. I want to eliminate someone from the query who has 2 contracts.
Select
Row_Number() over (partition by ID ORDER BY ContractypeDescription DESC) as [Row_Number],
Name,
ContractDescription,
Role
From table
Output
Row_Number ID Name Contract Description Role
1 1234 Mike FullTime Admin
2 1234 Mike Temp Manager
1 5678 Dave FullTime Admin
1 9785 Liz FullTime Admin
What I would like to see
Row_Number ID Name Contract Description Role
1 5678 Dave FullTime Admin
1 9785 Liz FullTime Admin
Is there a function rather than Row_Number that allows you to group rows together so I can then use something like 'where Row_Number not like 1 and 2'?
You can use HAVING as
SELECT ID,
MAX(Name) Name,
MAX(ContractDescription) ContractDescription,
MAX(Role) Role
FROM t
GROUP BY ID
HAVING COUNT(*) = 1;
Demo
Try this:
select * from (
Select
Count(*) over (partition by ID ) as [Row_Number],
Name,
ContractDescription,
Role
From table
)t where [Row_Number] = 1
You can check this option-
SELECT *
FROM table
WHERE ID IN
(
SELECT ID
FROM table
GROUP BY ID
HAVING COUNT(*) = 1
)
You can use a CTE to get all the ids of people who got only one contract and then just join the result of the CTE with your table.
;with cte as (
select
id
,COUNT(id) as no
from #tbl
group by id
having COUNT(id) = 1
)
select
t.id
,t.name
,t.ContractDescription
,t.role
from #tbl t
inner join cte
on t.id = cte.id
Basically you need those record who have exactly one contract.
Just extend your script, (My script is not tested)
;with CTE as
(
Select
Row_Number() over (partition by ID ORDER BY ContractypeDescription DESC) as [Row_Number],
Name,
ContractDescription,
Role
From table
)
select * from CTE c where [Row_Number]=1
and not exists(select 1 from CTE c1 where c.id=c1.id and c1.[Row_Number]>1 )
Is there a function rather than Row_Number that allows you to group
rows together so I can then use something like 'where Row_Number not
like 1 and 2'?
You can use a windowed COUNT(). The key is the OVER() clause.
;WITH WindowedCount AS
(
SELECT
T.*,
WindowCount = COUNT(1) OVER (PARTITION BY T.ID)
FROM
YourTable AS T
)
DELETE W FROM
WindowedCount AS W
WHERE
W.WindowCount > 1
The COUNT() will count the amount of rows for each different ID, so if the same ID appears in 2 or more rows, those rows will be deleted.

Getting Person having interests in more number of subjects - Sql Server

Sorry for the subject as it is not very definientia. I have 2 tables, one stores Person Data and one stores Subject data along with the person interested in. Two tables looks like below
Person
Id Name
1 Imad
2 Sumeet
3 Suresh
4 Navin
Subjects
Id PId Subject
1 1 DC
2 1 DS
3 3 DS
4 4 CA
PId is a Persons' Id
I need to get all students who are interested in max number of subjects, e.g Imad here.
Here is my query
With c as
(
select Pid, count(Id) as 'Total' from subjects group by Pid
)
select Pid into #Temp from c where Total = (Select Max(Total) from c)
select * from Person where Id in (Select Pid from #Temp)
It gives me desired output but whenever this type question is asked in interview, I never get good response from interviewer as they always expect better solution. I am not confident on my SQL skills that's why I think there must be more efficient solution hence I posted it here.
Thanks
Simply order the data and get top most one record with ties(this means if some students have equal counts they both will come in result):
select top 1 with ties p.Id, p.Name
from Subjects s
join Person p on s.PId = p.Id
group by p.Id, p.Name,
order by count(*) desc
You can try this :
;With c as
(
select Pid, count(Id) as 'Total' from subject group by Pid
)
select * from Person join c on c.Pid=Person.Id where c.total>1
Try to this
select * from person
where id in(
Select b.pid
from subject b
group by b.pid
having count(b.pid)>1
)
declare #t table (ID int,name varchar(10))
insert into #t (ID,name)values (1,'imad'),(2,'sumeet'),(3,'suresh'),(4,'navin')
declare #tt table (Id int,Pid int,Subject varchar(10))
insert into #tt (Id,Pid,subject)values (1,1,'DC'),(2,1,'DS'),(3,3,'DS'),(4,4,'CA')
select p.ID,P.name,ttt.Subject from (Select P.ID,P.name,P.Cnt from (
select t.ID,t.name,COUNT((t.ID))Cnt from #t t
INNER JOIN
#tt tt ON t.ID = tt.Pid
GROUP BY t.ID,t.name)P
GROUP BY P.Cnt,P.ID,p.name
HAVING(cnt) > 1)P
INNER JOIN #tt ttt ON Ttt.Pid = P.ID
The current solutions with TOP are MS SQL Server specific. Following solution is based on Standard SQL's Windowed Aggregate Functions, which most DBMSes support:
select Pid, Total
from
(
select Pid, count(Id) as Total
,rank() over (order by count(Id) desc) as rn
from subjects group by Pid
) as dt
where rn = 1

Find duplicate ID's with different fields

I have a table that contain UserID's and Departments. UserID's can belong to several departments so their combo makes it unique.
However I have been trying to query trying to find where the UserID belongs to either one of two departments (hr or customer).
SELECT UserId, Dept, COUNT(*) Total
FROM MyTable
GROUP BY UserID
HAVING COUNT(*) = 1
However this still brings back duplicates if a UserId has both departments I guess because the combo makes it a unique record.
What I get back is this
UserID | Department | Total
1 hr 1
2 customer 1
3 customer 1
1 customer 1
3 hr 1
But what I am trying to get back is this
UserID | Department | Total
2 customer 1
Where any instances of UserId belonging to both departments are not included only if they belong to one or the other.
This should do the job
select t1.UserId, Dept, t2.Total
FROM MyTable t1
INNER JOIN
(
SELECT UserId, COUNT(*) Total
FROM Table1
GROUP BY UserID
HAVING COUNT(*) = 1
) t2 on t1.UserId = t2.UserId
try
SELECT UserId, COUNT(distinct Dept) Total
FROM MyTable
GROUP BY UserID
HAVING COUNT(distinct Dept) = 1
You can try something like this
select UserId, Dept
FROM Mytable where UserId in
(
SELECT UserId, COUNT(*) Total
FROM MyTable
GROUP BY UserID
HAVING COUNT(*) = 1
)
If you add the Dept column in your Group by you will always retrieve all the combinations.
So you have to select all users with only one Dept and then retrieve the additional informations
There is maybe syntax errors because I am usually working with oracle but I think the concept is correct
This will select users in 'hr' that do not have id's in 'customer' and then users in 'customer' that do not have ids in 'hr'.
I didn't include count on purpose as it can always only be one.
SELECT * FROM [MyTable] T WHERE Department = 'hr' AND NOT EXISTS (SELECT 1 FROM [MyTable] WHERE [MyTable].UserID =T.UserID AND Department = 'customer') UNION
SELECT * FROM [MyTable] T WHERE Department = 'customer' AND NOT EXISTS (SELECT 1 FROM [MyTable] WHERE [MyTable].UserID =T.UserID AND Department = 'hr')
I do not think the above query will run as Dept is not in group by clause and also count comparsion cannot be in that way.
Coming to your issue:
Select userid, Dept
from #temp
where userid in (Select userid from (Select userid,count(*) 'C'
from #temp group by userid ) u where u.c=1)
I do not think you require count as it is always 1
It's a fake question because you can't use Dept in this query.
I'm don't understand why author doesn't explain us about any compiler warnings and tells us some results of this query.
but you can yse this way if you really want to get some result:
SELECT UserId, min(Dept) as Dept
FROM MyTable
GROUP BY UserID
HAVING COUNT(*) = 1
The problem is the COUNT (*) at the end.
Try this:
SELECT UserId, Dept, COUNT(*) Total
FROM MyTable
GROUP BY UserID
HAVING COUNT(UserId) = 1

Retrieve last entry for each

I have the following table for evaluating students:
StudentID | EvaluationStatusID| Date
1011010 | 1 |2013-11-07 20:31:51.000
1011020 | 1 |2013-11-08 13:23:51.000
1011010 | 2 |2013-11-08 20:31:51.000
1011020 | 3 |2013-11-09 20:31:51.000
The evaluation of a student does through different stages - 'submitted','assessed,'accepted' etc.
I need to get the LATEST record(by date) on each student in teh form 'StudentID-EvaluationStatusID'.
So,in the above data i should have the following returned:
1011010-2
1011020-3
In Sql server 2008,how do I get this?
The simplest is using ranking functions like ROW_NUMBER.
WITH CTE AS
(
SELECT StudentID, EvaluationStatusID, Date,
RN = ROW_NUMBER() OVER (PARTITION BY StudentID
ORDER BY Date DESC)
FROM dbo.Student
)
SELECT StudentID, EvaluationStatusID, Date
FROM CTE WHERE RN = 1
Demo
SELECT StudentID +' ' +EvaluationStatusID
FROM tblTable T
WHERE T.Date = (SELECT MAX(TT.Date)
FROM tblTable TT
WHERE TT.StudentID = T.StudentID
)
Try this :
With S1 as
(
select StudentID +''+EvaluationStatusID as Info,
ROW_NUMBER() OVER (PARTITION BY StudentID
ORDER BY StudentID) as RC
from students
)
select * from S1 where S1.RC in (Select MAX(S1.RC)from S1)
Group By Info, RC

SQL query to return only 1 record per group ID

I'm looking for a way to handle the following scenario. I have a database table that I need to return only one record for each "group id" that is contained within the table, furthermore the record that is selected within each group should be the oldest person in the household.
ID Group ID Name Age
1 134 John Bowers 37
2 134 Kerri Bowers 33
3 135 John Bowers 44
4 135 Shannon Bowers 42
So in the sample data provided above I would need ID 1 and 3 returned, as they are the oldest people within each group id.
This is being queried against a SQL Server 2005 database.
SELECT t.*
FROM (
SELECT DISTINCT groupid
FROM mytable
) mo
CROSS APPLY
(
SELECT TOP 1 *
FROM mytable mi
WHERE mi.groupid = mo.groupid
ORDER BY
age DESC
) t
or this:
SELECT *
FROM (
SELECT *, ROW_NUMBER() OVER (PARTITION BY groupid ORDER BY age DESC) rn
FROM mytable
) x
WHERE x.rn = 1
This will return at most one record per group even in case of ties.
See this article in my blog for performance comparisons of both methods:
SQL Server: Selecting records holding group-wise maximum
Use:
SELECT DISTINCT
t.groupid,
t.name
FROM TABLE t
JOIN (SELECT t.groupid,
MAX(t.age) 'max_age'
FROM TABLE t
GROUP BY t.groupid) x ON x.groupid = t.groupid
AND x.max_age = t.age
So what if there's 2+ people with the same age for a group? It'd be better to store the birthdate rather than age - you can always calculate the age for presentation.
Try this (assuming Group is synonym for Household)
Select * From Table t
Where Age = (Select Max(Age)
From Table
Where GroupId = t.GroupId)
If there are two or more "oldest" people in some household (They all are the same age and there is noone else older), then this will return all of them, not just one at random.
If this is an issue, then you need to add another subquery to return an arbitrary key value for one person in that set.
Select * From Table t
Where Id =
(Select Max(Id) Fom Table
Where GroupId = t.GroupId
And Age =
(Select(Max(Age) From Table
Where GroupId = t.GroupId))
SELECT GroupID, Name, Age
FROM table
INNER JOIN
(
SELECT GroupID, MAX(Age) AS OLDEST
FROM table
) AS OLDESTPEOPLE
ON
table.GroupID = OLDESTPEOPLE.GroupID
AND
table.Age = OLDESTPEOPLE.OLDEST
SELECT GroupID, Name, Age
FROM table
INNER JOIN
(
SELECT GroupID, MAX(Age) AS OLDEST
FROM table
**GROUP BY GroupID**
) AS OLDESTPEOPLE
ON
table.GroupID = OLDESTPEOPLE.GroupID
AND
table.Age = OLDESTPEOPLE.OLDEST