Listing duplicated records using T SQL - sql

I have a database that is used to record patient information for a small clinic. We use MS SQL Server 2008 as the backend. The patient table contains the following columns:
Id int identity(1,1),
FamilyName varchar(30),
FirstName varchar (20),
DOB datetime,
AddressLine1 varchar (50),
AddressLine2 varchar (50),
State varchar (20),
Postcode varchar (4),
NextOfKin varchar (20),
Homephone varchar (20),
Mobile varchar (20)
Occasionally the staff register a new patient, unaware that the patient already has a record in the system. We end up with several thousands duplicated records.
What I would like to do is to present a list of patients who have duplicated records for the staff to merge during quiet time. We consider 2 records to be duplicated if the 2 records have exactly the same FamilyName, FirstName and DOB. What I am doing at the moment is to use a sub query to return the records as follow:
SELECT FamilyName,
FirstName,
DOB,
AddressLine1,
AddressLine2,
State,
Postcode,
NextOfKin,
HomePhone,
Mobile
FROM
Patients AS p1
WHERE Id IN
(
SELECT Max(Id)
FROM Patients AS p2,
COUNT(id) AS NumberOfDuplicate
GROUP BY
FamilyName,
FirstName,
DOB HAVING COUNT(Id) > 1
)
This produces the result but the performance is terrible. Is there any better way to do it? The only requirements is I need to show all the fields in the Patients table as the user of the system wants to view all the details before making the decision whether to merge the records or not.

This will output every row which has a duplicate, based on firstname and lastname
SELECT DISTINCT t1.*
FROM Table AS t1
INNER JOIN Table AS t2
ON t1.firstname = t2.firstname
AND t1.lastname = t2.lastname
AND t1.id <> t2.id

I suggest you build an index on the 3 fields you use to detect duplicates,
then try this query:
with Duplicates as
(
select FamilyName, FirstName, DOB
from Patients
group by FamilyName, FirstName, DOB
having count(*) > 1
)
Select Patients.*
from Patients
inner join Duplicates
on Patients.FamilyName = Duplicates.FamilyName
And Patients.FirstName= Duplicates.FirstName
and Patients.DOB= Duplicates.DOB

WITH CTE
AS
(
SELECT Id, FamilyName, FirstName ,DOB
ROW_NUMBER() OVER(PARTITION BY FamilyName, FirstName ,DOB ORDER BY Id) AS DuplicateCount
FROM PatientTable
)
select * from CTE where DuplicateCount > 1

If I were in your shoes, I'd do following:
add indexes to FamilyName, FirstName and DOB
create view for your subquery
modified the query as following
Select p.* FROM Patients p INNER JOIN view_name v ON v.FirstName=p.Firstname AND ...

select FamilyName, FirstName, DOB
from Patients
group by FamilyName, FirstName, DOB
having count(*)>1
Will show all duplicates.
However, please consider names being written different, but similar. You might want to look for the topics 'data deduplication' and/or 'record linkage'. I solved the problem using a string similarity algorithm (modified Jaro/Winkler and levenshtein).

Related

SQL Query to Obtain the Oldest People

I am trying to find the oldest customers in my database. I want just their full names and their ages, but my current results are outputting all customers and their ages (not just the oldest). What am I doing wrong here?
SELECT
LTRIM(CONCAT(' ' + Prefix, ' ' + FirstName,
' ' + MiddleName, ' ' + LastName, ', ' + Suffix)),
MAX(DATEDIFF(year, BirthDate, GETDATE()))
FROM
Customers
WHERE
BirthDate is not null
GROUP BY
Prefix, FirstName, MiddleName, LastName, Suffix
ORDER BY
MAX(DATEDIFF(year, e.BirthDate, GETDATE())) desc
Note that there seems to be multiple customers with the same oldest age.
You have not defined what you mean with "oldest customers".
So I will give a few options you could try
to see a list of customers with the oldest on top, use a simple querie like this
SELECT FirstName, LastName, Suffix, BirthDate
FROM Customers
WHERE BirthDate is not null
ORDER BY BirthDate desc
to restrict the result to a number of rows, for example the 10 oldest, use top 10
SELECT top 10
FirstName, LastName, Suffix, BirthDate
FROM Customers
WHERE BirthDate is not null
ORDER BY BirthDate desc
to restrict the result to all customers born after a certain date, add to the where clause
SELECT FirstName, LastName, Suffix, BirthDate
FROM Customers
WHERE BirthDate is not null
and BirtDate < '19920101'
ORDER BY BirthDate desc
The first thing you need to do before you do anything else is define a unique numeric primary key on the Customers table.
ALTER TABLE Customers ADD Cust_Id int IDENTITY(1,1);
ALTER TABLE Customers ADD CONSTRAINT PK_Customers PRIMARY KEY (Cust_Id);
After you've doe that, the following code will give you the "oldest customer (or customers) in your database".
With qry1 As (
SELECT Cust_Id,
DATEDIFF(year, BirthDate, GETDATE()) As Age
FROM Customers
WHERE BirthDate is not null
),
qry2 As (
SELECT Max(Age) As Max_Age
FROM qry1
)
SELECT Customers.Cust_Id,
Customers.Prefix,
Customers.FirstName,
Customers.MiddleName,
Customers.LastName,
Customers.Suffix,
Qry1.Age
FROM Customers
Inner Join Qry1 On Customers.Cust_Id = Qry1.Cust_Id
Inner Join Qry2 On Qry1.Age = Qry2.Max_Age

SQL Query to select duplicate registrations

Hi sorry for what seems such a simple question in advance...
I have a table with some millions of rows of laboratory data and the following fields (amongst others)
Laboratory Reference Number
Forename
Surname
DOB
I need to do a query that will give me all of the distinct laboratory Reference Number
, forename, surname and DOBs where the laboratory Reference Number
has more than one associated forename, surname and DOB.
i.e. a query to highlight where a laboratory Reference Number has duplicate candidates associated with it
e.g.
12345, Bob, Smith, 30/038/1981
12345, Fred, Smith, 31/03/1981
Any help would be much appreciated.
SELECT * FROM TABLE WHERE REF IN
(SELECT REF FROM TABLE GROUP BY REF HAVING COUNT(*) > 1)
You could also use SELECT DISTINCT * if necessary
select RefNr
, Forename
, Surname
, DOB
from YourTable yt1
where exists
(
select *
from YourTable yt2
where yt1.RefNr = yt2.RefNr
and
(
yt1.Forename <> yt2.Forename
or yt1.Surname <> yt2.Surname
or yt1.DOB <> yt2.DOB
)
)

Deleting duplicates in a table based on a criteria only in SQL

Let's say I have a table with columns:
CustomerNumber
Lastname
Firstname
PurchaseDate
...and other columns that do not change anything in the question if they're not shown here.
In this table I could have many rows for the same customer with different purchase dates (I know, poorly designed... I'm only trying to fix an issue for reporting, not really trying to fix the root of the problem).
How, in SQL, can I keep one record per customer with the latest date, and delete the rest? A group by doesn't seem to be working for my case
;with a as
(
select row_number() over (partition by CustomerNumber, Lastname, Firstname order by PurchaseDate desc) rn
from <table>
)
delete from a where rn > 1
This worked for me (on DB2):
DELETE FROM my_table
WHERE (CustomerNumber, Lastname, Firstname, PurchaseDate)
NOT IN (
SELECT CustomerNumber, Lastname, Firstname, MAX(PurchaseDate)
FROM my_table
GROUP BY CustomerNumber, Lastname, FirstName
)
SELECT CustomerNumber, Lastname, Firstname, MAX(PurchaseDate) LatestPurchaseDate
FROM Table
GROUP BY CustomerNumber, Lastname, Firstname
The MAX will select the highest (latest) date and show that date for each unique combination of the GROUP BY columns.
EDIT: I misunderstood that you wanted to delete records for all but the latest purchase date.
WITH Keep AS
(
SELECT CustomerNumber, Lastname, Firstname, MAX(PurchaseDate) LatestPurchaseDate
FROM Table
GROUP BY CustomerNumber, Lastname, Firstname
)
DELETE FROM Table
WHERE NOT EXISTS
(
SELECT *
FROM Keep
WHERE Table.CustomerNumber = Keep.CustomerNumber
AND Table.Lastname = Keep.Lastname
AND Table.Firstname = Keep.Firstname
AND Table.PurchaseDate = Keep.LastPurchaseDate
)

SELECT using GROUP BY and HAVING not returning records

I'm trying to select all records that have a duplicate value in the LASTNAME column. This is my code so far
If EXISTS( SELECT name FROM sysobjects WHERE name = 'USER_DUPLICATES' AND type = 'U' )
DROP TABLE USER_DUPLICATES
GO
CREATE TABLE USER_DUPLICATES
(
FIRSTNAME VARCHAR(MAX),
LASTNAME VARCHAR(MAX),
PHONE VARCHAR(MAX),
EMAIL VARCHAR(MAX),
TITLE VARCHAR(MAX),
LMU VARCHAR(MAX)
)
GO
INSERT INTO USER_DUPLICATES
(
FIRSTNAME,
LASTNAME,
PHONE,
EMAIL,
TITLE,
LMU
)
SELECT
FIRSTNAME,
LASTNAME,
PHONE,
EMAIL,
TITLE,
LMU
FROM TM_USER
GROUP BY
FIRSTNAME,
LASTNAME,
PHONE,
EMAIL,
TITLE,
LMU
HAVING COUNT(LASTNAME) > 1
It does not return any records. I changed the
HAVING COUNT(LASTNAME) > 1
to
HAVING COUNT(LASTNAME) > 0
and it returns all the records. I am also certain there are records with the same LASTNAME value. It is written using T-SQL on SQL Server
Try this:
SELECT
a.FIRSTNAME,
a.LASTNAME,
a.PHONE,
a.EMAIL,
a.TITLE,
a.LMU
FROM TM_USER a
INNER JOIN
(
SELECT LASTNAME
FROM TM_USER
GROUP BY LASTNAME
HAVING COUNT(1) > 1
) b ON a.LASTNAME = b.LASTNAME
Your Group By clause will Group By all the comuns in the list. Those columns probably define a discreet record of count = 1
You will need to do something like:
Select LAST_NAME from TM_USER GROUP BY LAST_NAME HAVING COUNT(LAST_NAME) > 1
COUNT function is computed over all grouping expression, not over LASTNAME.
To get unique last names use
SELECT LASTNAME FROM TM_USER GROUP BY LASTNAME HAVING COUNT(LASTNAME) > 1
If you group by few columns, you will get count of their unique combination even if computing COUNT over single column value.

MySQL, return only rows where there are duplicates among two columns

I have a table in MySQL of contact information ;
first name, last name, address, etc.
I would like to run a query on this table that will return only rows with first and last name combinations which appear in the table more than once.
I do not want to group the "duplicates" (which may only be duplicates of the first and last name, but not other information like address or birthdate) -
I want to return all the "duplicate" rows so I can look over the results and determine if they are dupes or not. This seemed like it would be a simple thing to do, but it has not been.
Every solution I can find either groups the dupes and gives me a count only (which is not useful for what I need to do with the results) or doesn't work at all.
Is this kind of logic even possible in a query ? Should I try and do this in Python or something?
You should be able doing this with the GROUP BY approach in a sub-query.
SELECT t.first_name, t.last_name, t.address
FROM your_table t
JOIN ( SELECT first_name, last_name
FROM your_table
GROUP BY first_name, last_name
HAVING COUNT(*) > 1
) t2
ON ( t.first_name = t2.first_name, t.last_name = t2.last_name )
The sub-query returns all names (first_name and last_name) that exist more than once, and the JOIN returns all records that match these names.
You could do it with a GROUP BY / HAVING and A SUB SELECT. Something like
SELECT t.*
FROM Table t INNER JOIN
(
SELECT FirstName, LastName
FROM Table
GROUP BY FirstName, LastName
HAVING COUNT(*) > 1
) Dups ON t.FirstName = Dups.FirstName
AND t.LastName = Dups.LastName
select * from people
join (select firstName, lastName
from people
group by firstName, lastName
having count(*) > 1
) dupe
using (firstName, lastName)