SQL Query to select duplicate registrations

SQL Query to select duplicate registrations - sql

Hi sorry for what seems such a simple question in advance...
I have a table with some millions of rows of laboratory data and the following fields (amongst others)
Laboratory Reference Number
Forename
Surname
DOB
I need to do a query that will give me all of the distinct laboratory Reference Number
, forename, surname and DOBs where the laboratory Reference Number
has more than one associated forename, surname and DOB.
i.e. a query to highlight where a laboratory Reference Number has duplicate candidates associated with it
e.g.
12345, Bob, Smith, 30/038/1981
12345, Fred, Smith, 31/03/1981
Any help would be much appreciated.

SELECT * FROM TABLE WHERE REF IN
(SELECT REF FROM TABLE GROUP BY REF HAVING COUNT(*) > 1)
You could also use SELECT DISTINCT * if necessary

select RefNr
, Forename
, Surname
, DOB
from YourTable yt1
where exists
(
select *
from YourTable yt2
where yt1.RefNr = yt2.RefNr
and
(
yt1.Forename <> yt2.Forename
or yt1.Surname <> yt2.Surname
or yt1.DOB <> yt2.DOB
)
)

Related

SQL Select column which is not used in select section of subquery which find duplicates

I am trying to find in my database records which has duplicated fields like name, surname and type.
Example:
SELECT name, surname, type, COUNT(*)
FROM customers
GROUP BY name, surname
HAVING COUNT(*)>1
Query results:
Robb|Stark|1|2
Tyrion|Lannister|1|3
So we have duplicated customer with name and surname "Robb Stark" 2 times and "Tyrion Lannister" 3 times
Now, I want to know the id of these records.
I found similar problem described here:
Finding duplicate values in a SQL table
there is answer but no example.

Use COUNT as an analytic function:
WITH cte AS (
SELECT *, COUNT(*) OVER (PARTITION BY name, surname) cnt
FROM customers
)
SELECT * -- return all columns
FROM cte
WHERE cnt > 1
ORDER BY name, surname;

The simplest way will be to use the EXISTS as follows:
SELECT t.*
FROM customers t
where exists
(select 1 from customers tt
where tt.name = t.name
and tt.surname = t.surname
and tt.id <> t.id)
Or use your original query in IN clause as follows:
select * from customers where (name, surname) in
(SELECT name, surname
FROM customers
GROUP BY name, surname
HAVING COUNT(*)>1)

If you want one row per group of duplicate, with the list of id in a comma separated string, you can just use string aggration with your existing query:
SELECT name, surname, COUNT(*) as cnt,
STRING_AGG(id, ',') WITHIN GROUP (ORDER BY id) as all_ids
FROM customers
GROUP BY name, surname
HAVING COUNT(*) > 1

What is the best way to select rows belonging to other IDs which have the exact same rows as an ID in question using SQL?

Suppose I have a database where I keep track of people and their hobbies, and there are two tables: People and Hobbies. Now if there exists a person named Tom from table People with the two hobbies 'fishing' and 'jogging' in table Hobbies, how can I check for other persons who have exactly these two hobbies? I want to exclude people who have, for instance, the hobbies Fishing, Jogging AND Gaming. I have tried the following:
select name
from people
where name IN(
select name_hobbyist
from hobby
where hobby IN(
select hobby
from hobby
where name_hobbyist =(
select name
from people
where name = 'Tom'
)
)
)
order by name asc
And it returns no rows.

Since you have the names of people in the table hobby you don't need the table people.
You can group by name_hobbyist and use the aggregate function string_agg() in the having clause to apply the condition:
select name_hobbyist
from hobby
where name_hobbyist <> 'Tom'
group by name_hobbyist
having string_agg(hobby, ',' order by hobby) = (
select string_agg(hobby, ',' order by hobby)
from hobby
where name_hobbyist = 'Tom'
)

Match records together in SQL Server by multiple groupings

I have a scenario that I need to "match" records based on multiple attributes of a person. For instance, if a FirstName and LastName match, or a NickName and LastName match, those two scenarios should be grouped into one larger match. Here's the example data in SQLFiddle:
http://www.sqlfiddle.com/#!18/0ca91/7
I'm generating a match key from the record attributes. The result gives me two different match keys and three total records. I need a result that has only one match key generated, and eventually I'm going to group all three records into one golden record in a separate step. I cannot figure out a way to logically group these records together either by "group by" or by using DENSE_RANK to generate my match key. Any help would be greatly appreciated! Thanks!
CREATE TABLE Persons (
ID int,
FirstName varchar(255),
LastName varchar(255),
NickName varchar(255)
);
INSERT INTO Persons
SELECT 1 AS ID, 'NIKKI' AS FNAME, 'MADISON' AS LNAME, 'Nikki' AS NickName
UNION ALL
SELECT 2 AS ID, 'NICOLE' AS FNAME, 'MADISON' AS LNAME, 'NICOLE' AS NickName
UNION ALL
SELECT 3 AS ID, 'NICOLE' AS FNAME, 'MADISON' AS LNAME,'Nikki' AS NickName
SELECT
*
, DENSE_RANK() OVER (ORDER BY TRIM(LastName), TRIM(FirstName)) AS GroupKey
FROM Persons
Desired Result:
GroupKey
1
1
1

Listing duplicated records using T SQL

I have a database that is used to record patient information for a small clinic. We use MS SQL Server 2008 as the backend. The patient table contains the following columns:
Id int identity(1,1),
FamilyName varchar(30),
FirstName varchar (20),
DOB datetime,
AddressLine1 varchar (50),
AddressLine2 varchar (50),
State varchar (20),
Postcode varchar (4),
NextOfKin varchar (20),
Homephone varchar (20),
Mobile varchar (20)
Occasionally the staff register a new patient, unaware that the patient already has a record in the system. We end up with several thousands duplicated records.
What I would like to do is to present a list of patients who have duplicated records for the staff to merge during quiet time. We consider 2 records to be duplicated if the 2 records have exactly the same FamilyName, FirstName and DOB. What I am doing at the moment is to use a sub query to return the records as follow:
SELECT FamilyName,
FirstName,
DOB,
AddressLine1,
AddressLine2,
State,
Postcode,
NextOfKin,
HomePhone,
Mobile
FROM
Patients AS p1
WHERE Id IN
(
SELECT Max(Id)
FROM Patients AS p2,
COUNT(id) AS NumberOfDuplicate
GROUP BY
FamilyName,
FirstName,
DOB HAVING COUNT(Id) > 1
)
This produces the result but the performance is terrible. Is there any better way to do it? The only requirements is I need to show all the fields in the Patients table as the user of the system wants to view all the details before making the decision whether to merge the records or not.

This will output every row which has a duplicate, based on firstname and lastname
SELECT DISTINCT t1.*
FROM Table AS t1
INNER JOIN Table AS t2
ON t1.firstname = t2.firstname
AND t1.lastname = t2.lastname
AND t1.id <> t2.id

I suggest you build an index on the 3 fields you use to detect duplicates,
then try this query:
with Duplicates as
(
select FamilyName, FirstName, DOB
from Patients
group by FamilyName, FirstName, DOB
having count(*) > 1
)
Select Patients.*
from Patients
inner join Duplicates
on Patients.FamilyName = Duplicates.FamilyName
And Patients.FirstName= Duplicates.FirstName
and Patients.DOB= Duplicates.DOB

WITH CTE
AS
(
SELECT Id, FamilyName, FirstName ,DOB
ROW_NUMBER() OVER(PARTITION BY FamilyName, FirstName ,DOB ORDER BY Id) AS DuplicateCount
FROM PatientTable
)
select * from CTE where DuplicateCount > 1

If I were in your shoes, I'd do following:
add indexes to FamilyName, FirstName and DOB
create view for your subquery
modified the query as following
Select p.* FROM Patients p INNER JOIN view_name v ON v.FirstName=p.Firstname AND ...

select FamilyName, FirstName, DOB
from Patients
group by FamilyName, FirstName, DOB
having count(*)>1
Will show all duplicates.
However, please consider names being written different, but similar. You might want to look for the topics 'data deduplication' and/or 'record linkage'. I solved the problem using a string similarity algorithm (modified Jaro/Winkler and levenshtein).

MySQL, return only rows where there are duplicates among two columns

I have a table in MySQL of contact information ;
first name, last name, address, etc.
I would like to run a query on this table that will return only rows with first and last name combinations which appear in the table more than once.
I do not want to group the "duplicates" (which may only be duplicates of the first and last name, but not other information like address or birthdate) -
I want to return all the "duplicate" rows so I can look over the results and determine if they are dupes or not. This seemed like it would be a simple thing to do, but it has not been.
Every solution I can find either groups the dupes and gives me a count only (which is not useful for what I need to do with the results) or doesn't work at all.
Is this kind of logic even possible in a query ? Should I try and do this in Python or something?

You should be able doing this with the GROUP BY approach in a sub-query.
SELECT t.first_name, t.last_name, t.address
FROM your_table t
JOIN ( SELECT first_name, last_name
FROM your_table
GROUP BY first_name, last_name
HAVING COUNT(*) > 1
) t2
ON ( t.first_name = t2.first_name, t.last_name = t2.last_name )
The sub-query returns all names (first_name and last_name) that exist more than once, and the JOIN returns all records that match these names.

You could do it with a GROUP BY / HAVING and A SUB SELECT. Something like
SELECT t.*
FROM Table t INNER JOIN
(
SELECT FirstName, LastName
FROM Table
GROUP BY FirstName, LastName
HAVING COUNT(*) > 1
) Dups ON t.FirstName = Dups.FirstName
AND t.LastName = Dups.LastName

select * from people
join (select firstName, lastName
from people
group by firstName, lastName
having count(*) > 1
) dupe
using (firstName, lastName)

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

SQL Query to select duplicate registrations - sql

SELECT * FROM TABLE WHERE REF IN (SELECT REF FROM TABLE GROUP BY REF HAVING COUNT() > 1) You could also use SELECT DISTINCT if necessary

select RefNr , Forename , Surname , DOB from YourTable yt1 where exists ( select * from YourTable yt2 where yt1.RefNr = yt2.RefNr and ( yt1.Forename <> yt2.Forename or yt1.Surname <> yt2.Surname or yt1.DOB <> yt2.DOB ) )

Related

SQL Select column which is not used in select section of subquery which find duplicates

What is the best way to select rows belonging to other IDs which have the exact same rows as an ID in question using SQL?

Match records together in SQL Server by multiple groupings

Listing duplicated records using T SQL

MySQL, return only rows where there are duplicates among two columns

Categories

Resources

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

SQL Query to select duplicate registrations - sql

SELECT * FROM TABLE WHERE REF IN (SELECT REF FROM TABLE GROUP BY REF HAVING COUNT(*) > 1) You could also use SELECT DISTINCT * if necessary

select RefNr , Forename , Surname , DOB from YourTable yt1 where exists ( select * from YourTable yt2 where yt1.RefNr = yt2.RefNr and ( yt1.Forename <> yt2.Forename or yt1.Surname <> yt2.Surname or yt1.DOB <> yt2.DOB ) )

Related

SQL Select column which is not used in select section of subquery which find duplicates

What is the best way to select rows belonging to other IDs which have the exact same rows as an ID in question using SQL?

Match records together in SQL Server by multiple groupings

Listing duplicated records using T SQL

MySQL, return only rows where there are duplicates among two columns

Categories

Resources

SELECT * FROM TABLE WHERE REF IN (SELECT REF FROM TABLE GROUP BY REF HAVING COUNT() > 1) You could also use SELECT DISTINCT if necessary