Match records together in SQL Server by multiple groupings - sql

I have a scenario that I need to "match" records based on multiple attributes of a person. For instance, if a FirstName and LastName match, or a NickName and LastName match, those two scenarios should be grouped into one larger match. Here's the example data in SQLFiddle:
http://www.sqlfiddle.com/#!18/0ca91/7
I'm generating a match key from the record attributes. The result gives me two different match keys and three total records. I need a result that has only one match key generated, and eventually I'm going to group all three records into one golden record in a separate step. I cannot figure out a way to logically group these records together either by "group by" or by using DENSE_RANK to generate my match key. Any help would be greatly appreciated! Thanks!
CREATE TABLE Persons (
ID int,
FirstName varchar(255),
LastName varchar(255),
NickName varchar(255)
);
INSERT INTO Persons
SELECT 1 AS ID, 'NIKKI' AS FNAME, 'MADISON' AS LNAME, 'Nikki' AS NickName
UNION ALL
SELECT 2 AS ID, 'NICOLE' AS FNAME, 'MADISON' AS LNAME, 'NICOLE' AS NickName
UNION ALL
SELECT 3 AS ID, 'NICOLE' AS FNAME, 'MADISON' AS LNAME,'Nikki' AS NickName
SELECT
*
, DENSE_RANK() OVER (ORDER BY TRIM(LastName), TRIM(FirstName)) AS GroupKey
FROM Persons
Desired Result:
GroupKey
1
1
1

Related

Maria db how to use groups, unions, and/or sum/count in a select query

I've been breaking my mind for three days trying to puzzle this one out. I'm new to Maria db and sql in general. I've managed to use UNION on a previous similar situation but it's not working in this one.
I have three tables as follows:
create table zipcode (zip int, city varchar(30))
create table student (id int, zip_fk int)
create table teacher (id int, zip_fk int)
I want to create a select query that will have the following fields: city, the number of students from the city, the number of teachers from the city, and the total number of students and teachers from the city. Essentially, the results should be grouped by city. I am at a complete loss.
Edit. The challenge I am facing is that the city field is located in a different table and is not a primary key or a foreign key. As such, I cannot directly use it. The primary key is zip which means I first have to derive students and teachers from their respective tables, then bring in the zipcode table to compare their zip with cities.
This is rather tricky. Here is one method using union all and group by:
select city, sum(student) as students, sum(teacher) as teachers
from ((select z.city, 1 as student, 0 as teacher
from student s join
zipcode z
on s.zip_fk = z.zip
) union all
(select z.city, 0 as student, 1 as teacher
from teacher t join
zipcode z
on t.zip_fk = z.zip
)
) st
group by city;

Retrieving data with single occurrence of repeated data

SELECT *
FROM employee
GROUP BY first_name
HAVING count(first_name) >= 1;
How can i retrieve all rows and columns with single occurrence of duplicates? i want to retrieve all the table contents including repeated data that must occur only at once. In a table first_name,last_name are repeated twice but with different in other info.
Please Help.
try this Sql Query
SELECT * FROM EMPLOYEE WHERE FIRST_NAME NOT IN
(
SELECT FIRST_NAME FROM
(
SELECT ROW_NUMBER() OVER(PARTITION BY FIRST_NAME ORDER BY FIRST_NAME) RNK,FIRST_NAME FROM EMPLOYEE
)A WHERE A.RNK=2
)

Listing duplicated records using T SQL

I have a database that is used to record patient information for a small clinic. We use MS SQL Server 2008 as the backend. The patient table contains the following columns:
Id int identity(1,1),
FamilyName varchar(30),
FirstName varchar (20),
DOB datetime,
AddressLine1 varchar (50),
AddressLine2 varchar (50),
State varchar (20),
Postcode varchar (4),
NextOfKin varchar (20),
Homephone varchar (20),
Mobile varchar (20)
Occasionally the staff register a new patient, unaware that the patient already has a record in the system. We end up with several thousands duplicated records.
What I would like to do is to present a list of patients who have duplicated records for the staff to merge during quiet time. We consider 2 records to be duplicated if the 2 records have exactly the same FamilyName, FirstName and DOB. What I am doing at the moment is to use a sub query to return the records as follow:
SELECT FamilyName,
FirstName,
DOB,
AddressLine1,
AddressLine2,
State,
Postcode,
NextOfKin,
HomePhone,
Mobile
FROM
Patients AS p1
WHERE Id IN
(
SELECT Max(Id)
FROM Patients AS p2,
COUNT(id) AS NumberOfDuplicate
GROUP BY
FamilyName,
FirstName,
DOB HAVING COUNT(Id) > 1
)
This produces the result but the performance is terrible. Is there any better way to do it? The only requirements is I need to show all the fields in the Patients table as the user of the system wants to view all the details before making the decision whether to merge the records or not.
This will output every row which has a duplicate, based on firstname and lastname
SELECT DISTINCT t1.*
FROM Table AS t1
INNER JOIN Table AS t2
ON t1.firstname = t2.firstname
AND t1.lastname = t2.lastname
AND t1.id <> t2.id
I suggest you build an index on the 3 fields you use to detect duplicates,
then try this query:
with Duplicates as
(
select FamilyName, FirstName, DOB
from Patients
group by FamilyName, FirstName, DOB
having count(*) > 1
)
Select Patients.*
from Patients
inner join Duplicates
on Patients.FamilyName = Duplicates.FamilyName
And Patients.FirstName= Duplicates.FirstName
and Patients.DOB= Duplicates.DOB
WITH CTE
AS
(
SELECT Id, FamilyName, FirstName ,DOB
ROW_NUMBER() OVER(PARTITION BY FamilyName, FirstName ,DOB ORDER BY Id) AS DuplicateCount
FROM PatientTable
)
select * from CTE where DuplicateCount > 1
If I were in your shoes, I'd do following:
add indexes to FamilyName, FirstName and DOB
create view for your subquery
modified the query as following
Select p.* FROM Patients p INNER JOIN view_name v ON v.FirstName=p.Firstname AND ...
select FamilyName, FirstName, DOB
from Patients
group by FamilyName, FirstName, DOB
having count(*)>1
Will show all duplicates.
However, please consider names being written different, but similar. You might want to look for the topics 'data deduplication' and/or 'record linkage'. I solved the problem using a string similarity algorithm (modified Jaro/Winkler and levenshtein).

How to select distinct with ID on the result?

How to select distinct from table including ID column on the result?
Like for example: (this is error query)
SELECT ID,City,Street from (SELECT distinct City, Street from Location)
The table Location
CREATE TABLE Location(
ID int identity not null,
City varchar(max) not null,
Street varchar(max) not null
)
Then it will show the column ID, distinct column City, distinct column Street
Is there a possible query to have this result?
If you want for instance the lowest id for the unique data you desire you can do
select min(id), City, Street
from Location
group by City, Street
Generally you have to tell the DB what id to take using an aggregate function like min() or max()

SQL Query to select duplicate registrations

Hi sorry for what seems such a simple question in advance...
I have a table with some millions of rows of laboratory data and the following fields (amongst others)
Laboratory Reference Number
Forename
Surname
DOB
I need to do a query that will give me all of the distinct laboratory Reference Number
, forename, surname and DOBs where the laboratory Reference Number
has more than one associated forename, surname and DOB.
i.e. a query to highlight where a laboratory Reference Number has duplicate candidates associated with it
e.g.
12345, Bob, Smith, 30/038/1981
12345, Fred, Smith, 31/03/1981
Any help would be much appreciated.
SELECT * FROM TABLE WHERE REF IN
(SELECT REF FROM TABLE GROUP BY REF HAVING COUNT(*) > 1)
You could also use SELECT DISTINCT * if necessary
select RefNr
, Forename
, Surname
, DOB
from YourTable yt1
where exists
(
select *
from YourTable yt2
where yt1.RefNr = yt2.RefNr
and
(
yt1.Forename <> yt2.Forename
or yt1.Surname <> yt2.Surname
or yt1.DOB <> yt2.DOB
)
)