Find potential duplicate names in database - sql

I have two tables in a SQL Server Database:
Table: People
Columns: ID, FirstName, LastName
Table: StandardNames
Columns: Nickname, StandardName
Sample Nicknames would be Rick, Rich, Richie when StandardName is Richard.
I would like to find duplicate contacts in my People table but replace any of the nicknames with the standard name. IE: sometimes I have Rich Smith other times it is Richard Smith in the People table. Is this possible? I realize it might be multiple joins to the same table but can't figure out how to start.

Firstly, you need to determine how many duplicates you have in your People table...
SELECT p.FirstName, COUNT(*)
FROM People AS p
INNER JOIN StandardNames AS sn
ON CHARINDEX(sn.Nickname, p.FirstName) > 0 OR
CHARINDEX(sn.Nickname, p.LastName) > 0
GROUP BY p.FirstName
HAVING COUNT(*) > 1
That's just to get an idea of what data you're trying to find in relation to the Nicknames that may possibly exist inside (as a wildcard word search) the Firstname and Lastname columns.
If you are happy with the items found then expand on the query to update the values.
Let's say you wanted to change the Firstname to be the Standardname...
UPDATE p2
SET p2.FirstName = p2.Standardname
FROM
(SELECT p.ID, sn.StandardName
FROM People AS p
INNER JOIN StandardNames AS sn
ON CHARINDEX(sn.Nickname, p.FirstName) > 0 OR
CHARINDEX(sn.Nickname, p.LastName) > 0) AS a
INNER JOIN People AS p2 ON p2.ID = a.ID
So this will obviously find all the People IDs that have a match based on the query above, and it will update the People table by replacing the FirstName with the StandardName.
However, there are issues with this due to the limitation of your question.
the StandardNames table should have its own ID field. All tables should have an ID column as its primary table. That's just my view.
this is only going to work for data it matches using the CHARINDEX() function. What you really need is something to find based on a "sound" or similarity to the nicknames. Check out the SOUNDEX() function and apply your logic from there.
And this is assuming your IDs above are unique!
Good luck

You could standardize the names by joining, and count the number of occurrences. Extracting the ID is a bit fiddly, but also quite possible. I'd suggest the following - use a case expression to find the contact with the standard name, and if you don't have one, just take the id of the first duplicate:
SELECT COALESCE(MIN(CASE FirstName WHEN StandardName THEN id END), MIN(id)),
StandardName,
LastName,
COUNT(*)
FROM People p
LEFT JOIN StandardNames s ON FirstName = Nickname AND
GROUP BY StandardName, LastName

Related

How to select from more than one record of the same field in a JOIN

I have a table of letters. And table with authors and recipients called "people". I want to run a SQL command that returns a table or row with the letter and BOTH its author and its recipient.
Here's what I tried so far. If I need a table with letters and the author name I can use:
SELECT date, type, description, firstName, lastName
FROM letters
INNER JOIN people ON letters.authorId=people.ROWID
This gives me a table like the following:
1866-12-08 request need of women and children, list of needed supplies, FREEDMEN'S BUREAU A.M.L. Crawford
1867-01-18 request destitution of women and children in state and efforts to help M.H. Cruikshank
If I need recipient I can change the last line to:
INNER JOIN people ON letters.recipientId=people.ROWID
I want both the author and recipient. But I cannot seem to select more than one of the same field in a JOIN.
I have tried using the following.
SELECT date, type, description, firstName, lastName, firstName, lastName
FROM letters
INNER JOIN people ON letters.authorId=people.ROWID
INNER JOIN people ON letters.recipient=people.ROWID
This leads to the error:
Execution finished with errors.
Result: ambiguous column name: firstName
At line 1:
SELECT date, type, description, firstName, lastName, firstName, lastName
FROM letters
INNER JOIN people ON letters.authorId=people.ROWID
INNER JOIN people ON letters.recipient=people.ROWID
When you JOIN to the same table more than once, as you do to show senders' and recipients' details, you must give the tables aliases.
FROM letters
INNER JOIN people senders ON letters.authorId=senders.ROWID
INNER JOIN people recips ON letters.recipient=recips.ROWID
Then you can mention columns from senders and recips in SELECT, WHERE, and other clauses as if those were separate tables.
SELECT letter.date, letter.type, letter.description,
sender.firstName, sender.lastName,
recip.firstName, recip.lastName
...
Pro tip In queries with JOINs, always qualify each column you mention with the table name or alias. Say, for example, letter.description instead of just description. It makes your query much easier to understand for the next person. or even your future self.

Compare 2 columns in 2 tables with DISTINCT value

I am now creating a reporting service with visual business intelligent.
i try to count how many users have been created under an org_id.
but the report consist of multiple org_id. and i have difficulties on counting how many has been created under that particular org_id.
TBL_USER
USER_ID
0001122
0001234
ABC9999
DEF4545
DEF7676
TBL_ORG
ORG_ID
000
ABC
DEF
EXPECTED OUTPUT
TBL_RESULT
USER_CREATED
000 - 2
ABC - 1
DEF - 2
in my understanding, i need nested SELECT, but so far i have come to nothing.
SELECT COUNT(TBL_USER.USER_ID) AS Expr1
FROM TBL_USER INNER JOIN TBL_ORG
WHERE TBL_USER.USER_ID LIKE 'TBL_ORG.ORG_ID%')
this is totally wrong. but i hope it might give us clue.
It looks like the USER_ID value is the concatenation of your ORG_ID and something to make it unique. I'm assuming this is from a COTS product and nothing a human would have built.
Your desire is to find out how many entries there are by department. In SQL, when you read the word by in a requirement, that implies grouping. The action you want to take is to get a count and the reserved word for that is COUNT. Unless you need something out of the TBL_ORG, I see no need to join to it
SELECT
LEFT(T.USER_ID, 3) AS USER_CREATED
, COUNT(1) AS GroupCount
FROM
TBL_USER AS T
GROUP BY
LEFT(T.USER_ID, 3)
Anything that isn't in an aggregate (COUNT, SUM, AVG, etc) must be in your GROUP BY.
SQLFiddle
I updated the fiddle to also show how you could link to TBL_ORG if you need an element from the row in that table.
-- Need to have the friendly name for an org
-- Now we need to do the join
SELECT
LEFT(T.USER_ID, 3) AS USER_CREATED
, O.SOMETHING_ELSE
, COUNT(1) AS GroupCount
FROM
TBL_USER AS T
-- inner join assumes there will always be a match
INNER JOIN
TBL_ORG AS O
-- Using a function on a column is a performance killer
ON O.ORG_ID = LEFT(T.USER_ID, 3)
GROUP BY
LEFT(T.USER_ID, 3)
, O.SOMETHING_ELSE;

Inner join (or intersect) over three tables

I have a database with three tables named: NameAddressPhone, NameAddressAge, and AgeSex.
Table NameAddressPhone has columns name, address, and phone.
Table NameAddressAge has columns name, address, and age.
Table AgeSex has columns age and sex.
I'm trying to write a (SQLite) query to find the names, addresses, and ages such that the names and addresses appear in both NameAddressPhone and NameAddressAge, and such that the ages appear in both NameAddressAgeand AgeSex. I'm able to get halfway there (i.e., with two tables) using inner join, but I only dabble in SQL and would appreciate some help from an expert in getting this right. I have seen solutions that appear to be similar, but don't quite follow their logic.
Thanks in advance.
Chris
I think you just want to join these together on their obvious keys:
select *
from NameAddressPhone nap join
NameAddressAge naa
on nap.name = naa.name and
nap.address = naa.address join
(select distinct age
from AgeSex asx
) asx
on asx.age = naa.age
This is selecting the distinct ages in the AgeSex to prevent the proliferation of rows. Presumably, one age could appear multiple times in that table, which would result in duplicate rows on output.
I am assuming your tables have the following layout
NameAddressPhone
================
Name
Address
Phone
NameAddressAge
==============
Name
Address
Age
AgeSex
======
Age
Sex
If I am understanding everything correctly, the solution might look kind of like this:
SELECT P.Name, P.Address, P.Phone, A.Age, S.Sex
FROM NameAddressPhone P
INNER JOIN NameAddressAge A ON P.Name = A.Name AND P.Address = A.Address
INNER JOIN AgeSex S ON A.Age = S.Age
Mind you, joining AgeSex could produce duplicate rows if there are multiple rows with the same age in AgeSex. There wouldn't be a way to distinguish 21 and Male from 21 and Female, for example.
I hope I can help and this is what you are looking for.

Simple MySQL problem

I'm working on a MySQL database that contains persons. My problem is that, (I will simplify to make my point):
I have three tables:
Persons(id int, birthdate date)
PersonsLastNames(id int, lastname varchar(30))
PersonsFirstNames(id int, firstname varchar(30))
The id is the common key. There are separate tables for last names and first names because a single person can have many first names and many last names.
I want to make a query that returns all persons with, let's say, one last name. If I go with
select birthdate, lastname, firstname from Persons, PersonsLastNames,
PersonsFirstNames where Persons.id = PersonsLastNames.id and
Persons.id = PersonsFirstNames.id and lastName = 'Anderson'
I end up with a table like
1/1/1970 Anderson Steven //Person 1
1/1/1970 Anderson David //Still Person 1
2/2/1980 Smith Adam //Person 2
3/3/1990 Taylor Ed //Person 3
When presenting this, I would like to have
1/1/1970 Anderson Steven David
2/2/1980 Smith Adam [possibly null?]
3/3/1990 Taylor Ed [possibly null?]
How do I join the tables to introduce new columns in the result set if needed to hold several first names or last names for one person?
Does your application really need to handle unlimited first/last names per person? I don't know your specific needs, but that seems like it may be a little extreme. Regardless...
Since you can't really have a dynamic number of columns returned, you could do something like this:
SELECT birthdate, lastname, GROUP_CONCAT(firstname SEPARATOR '|') AS firstnames
FROM Persons, PersonsLastNames, PersonsFirstNames
WHERE Persons.id = PersonsLastNames.id
AND Persons.id = PersonsFirstNames.id
GROUP BY Persons.id
This would return one row per person that has a last name, with the (unlimited) first names separated by a pipe (|) symbol, GROUP_CONCAT function.
birthdate lastname firstnames
--- --- ---
1970-01-01 00:00:00 Anderson Steven|David
1980-02-02 00:00:00 Smith Adam
1990-03-03 00:00:00 Taylor Ed
SQL does not support a dynamic number of columns in the query select-list. You have to define exactly as many columns as you want (notwithstanding the * wildcard).
I recommend that you fetch the multiple names as rows, not columns. Then write some application code to loop over the result set and do whatever you want to do for presenting them.
The short answer is, you can't. You'll always have to pick a fixed number of columns. You can, however, greatly improve the syntax of your query by using the ON keyword. For example:
SELECT
birthdate,
firstName,
lastName
FROM
Persons
INNER JOIN PersonsLastNames
ON Persons.id = PersonsLastNames.id
INNER JOIN PersonsFirstNames
ON Persons.id = PersonsFirstNames.id
WHERE
lastName = 'Anderson'
GROUP BY
lastName, firstName
HAVING
count(lastName) = 1
Of course, my query includes a few extra provisions at the end so that only persons with only one last name specified would be grabbed, but you can always remove those.
Now, what you CAN do is choose a maximum number of these you'd like to retrieve and do something like this:
SELECT
birthdate,
lastName,
PersonsFirstNames.firstName,
IFNULL(p.firstName,''),
IFNULL(q.firstName,'')
FROM
Persons
INNER JOIN PersonsLastNames
ON Persons.id = PersonsLastNames.id
INNER JOIN PersonsFirstNames
ON Persons.id = PersonsFirstNames.id
LEFT JOIN PersonsFirstNames p
ON Persons.id = p.id
AND p.firstName <> PersonsFirstNames.firstName
LEFT JOIN PersonsFirstNames q
ON Persons.id = q.id
AND q.firstName <> PersonsFirstNames.firstName
AND q.firstName <> p.firstName
GROUP BY
lastName
But I really don't recommend that. The best bet is to retrieve multiple rows, and then iterate over them in whatever application you're using/developing.
Make sure you read up on your JOIN types (Left-vs-Inner), if you're not already familiar, before you start. Hope this helps.
EDIT: You also might want to consider, in that case, a slightly more complex GROUP BY clause, e.g.
GROUP BY
Persons.id, lastName
I think the closest thing you could do is to Group By Person.Id and then do string concatenation. Perhaps this post will help:
How to use GROUP BY to concatenate strings in MySQL?

complex sql query help needed

I'm not sure how to write this query in SQL. there are two tables
**GroupRecords**
Id (int, primary key)
Name (nvarchar)
SchoolYear (datetime)
RecordDate (datetime)
IsUpdate (bit)
**People**
Id (int, primary key)
GroupRecordsId (int, foreign key to GroupRecords.Id)
Name (nvarchar)
Bio (nvarchar)
Location (nvarchar)
return a distinct list of people who belong to GroupRecords that have a SchoolYear of '2000'. In the returned list, people.name should be unique (no duplicate People.Name), in case of a duplication only the person who belong to the GroupRecords with the later RecordDate should be returned.
It would probably be better to write a stored procedure for this right?
This is untested, but it should do what is required in the question.
It selects all details about the person.
The subquery will make it match only the latest RecordDate for a single name. It will also look only in the right GroupRecord because of the Match between the ids.
SELECT
People.Id,
People.GroupRecordsId,
People.Name,
People.Group,
People.Bio,
People.Location
FROM
People
INNER JOIN GroupRecords ON GroupRecords.Id = People.GroupRecordsId
WHERE
GroupRecords.SchoolYear = '2000/1/1' AND
GroupRecords.RecordDate = (
SELECT
MAX(GR2.RecordDate)
FROM
People AS P2
INNER JOIN GroupRecords AS GR2 ON P2.GroupRecordsId = GR2.Id
WHERE
P2.Name = People.Name AND
GR2.Id = GroupRecords.Id
)
Select Distinct ID
From People
Where GroupRecordsID In
(Select Id From GroupRecords
Where SchoolYear = '2000/1/1')
This will produce a distinct list of those individuals in the 2000 class...
but I don't understand what you're getting at with the cpmment about duplicates... please elaborate...
It reads as though you're talking about when two different people happen to have the same name you don't want them both listed... Is that really what you want?
MySQL specific:
SELECT *
FROM `People`
LEFT JOIN `GroupRecords` ON `GroupRecordsId` = `GroupRecords`.`Id`
GROUP BY `People`.`Name`
ORDER BY `GroupRecords`.`RecordDate` DESC
WHERE `GroupRecords`.`SchoolYear` = '2000/1/1'
people.name should be unique (no duplicate People.Name)
? Surely you mean no duplicate People.ID?
in case of a duplication only the person who belong to the GroupRecords with the later RecordDate should be returned.
There's the rub — that's the bit that it's not obvious how to do in plain SQL. There are a number of approaches to the “For each X, select the row Y with maximum/minimum Z” question; which work and which perform better depend on which database software you're using.
http://kristiannielsen.livejournal.com/6745.html has some good discussion of some of the usual techniques for attacking this (in the context of MySQL, but widely applicable).