We've had a few cases of people entering in first names where last names should be and vice versa. So I'm trying to come up with a SQL search to match the swapped columns. For example, someone may have entered the record as first_name = Smith, last_name = John by accident. Later, another person may see that John Smith is not in the database and enter a new user as first_name = John, last_name = Smith, when in fact it is the same person.
I used this query to help narrow my search:
SELECT person_id, first_name, last_name
FROM people
WHERE first_name IN (
SELECT last_name FROM people
) AND last_name IN (
SELECT first_name FROM people
);
But if we have people named John Allen, Allen Smith, and Smith John, they would all be returned even though none of those are actually duplicates. In this case, it's actually good enough that I can see the duplicates in my particular data set, but I'm wondering if there's a more precise way to do this.
I would do a self join like this:
SELECT p1.person_id, p1.first_name, p1.last_name
FROM people p1
join people p2 on p1.first_name = p2.last_name and p1.last_name = p2.first_name
To also find typos on names I recommend this:
SELECT p1.person_id, p1.first_name, p1.last_name
FROM people p1
join people p2 on soundex(p1.first_name) = soundex(p2.last_name) and
soundex(p1.last_name) = soundex(p2.first_name)
soundex is a neat function that "hashes" words in a way that two words that sound the same get the same hash. This means Anne and Ann will have the same soundex. So if you had an Anne Smith and a Smith Ann the query above would find them as a match.
Interesting. This is a problem that I cover in Data Analysis Using SQL and Excel (note: I only very rarely mention books in my answers or comments).
The idea is to summarize the data to get a likelihood of a mismatch. So, look at the number of times a name appears as a first name and as a last name and then combine these. So:
with names as (
select first_name as name, 1.0 as isf, 0.0 as isl
from people
union all
select last_name, 0, 1
from people
),
nl as (
select name, sum(isf) as numf, sum(isl) as numl,
avg(isf) as p_f, avg(isl) as p_l
from names
group by name
)
select p.*
from people p join
nl nlf
on p.first_name = nlf.name join
nl nll
on p.last_name = nll.name
order by (coalesce(nlf.p_l, 0) + coalesce(nll.p_f, 0));
This orders the records by a measure of mismatch of the names -- the sum of the probabilities of the first name used by a last name and a last name used as a first name.
Related
I'm sorry for asking these questions but I am really new and studying Oracle SQL for my university and I am having a hard time with a few things. Anyways, for this one, I am supposed to retrieve data from 2 different tables. One is from 'STAFF' and one is from 'BRANCH' I want the output to display the staff name (SNAME), the start date (STARTDATE) and the area (AREA). SNAME and STARTDATE are from the table 'STAFF', and the AREA is from the table 'Branch', how do I access that? Also, I ONLY want to display the names and start dates of those who are in the STOKE areea.
This is my code
SELECT SNAME, STARTDATE, BRANCH.AREA
FROM STAFF CROSS JOIN BRANCH
WHERE STAFF.BRANCHID = 20;
Note: The STAFF.BRANCHID = 20 is because the area 'STOKE' has a BRANCHID of 20.
This is what I get:
SNAME STARTDATE AREA
---------- --------- -------------
SMITH 15-NOV-00 ECCLESHALL
JONES 02-MAR-01 ECCLESHALL
SONG 03-JAN-02 ECCLESHALL
SMITH 15-NOV-00 STOKE
SONG 03-JAN-02 STOKE
SMITH 15-NOV-00 STAFFORD
As you can see, the WHERE clause is not working because it outputs every area instead of stoke only.
I know I am supposed to use the JOIN function but I don't get which one and why, any useful links would be appreciated :)
Thank you
You need to use the INNER JOIN and proper join condition as follows:
SELECT S.SNAME, S.STARTDATE, B.AREA
FROM STAFF S INNER JOIN BRANCH B
ON S.BRANCHID = B.BRANCHID
WHERE S.AREA = 'STOKE';
Looking into my notes for introduction to databases, I have stumbled upon a case that i do not understand (Between except and distinct).
It says so in my notes that:
The two queries below have the same results, but this will not be the case in general.
First query:
Select c.first_name,c.last_name,c.email
FROM customers as c
WHERE c.country = 'Japan'
EXCEPT
Select c.first_name,c.last_name,c.email
FROM customers as c
WHERE c.last_name LIKE 'D%';
Second query:
Select DISTINCT c.first_name,c.last_name,c.email
FROM customers as c
WHERE c.country = 'Japan' AND NOT (c.last_name LIKE 'D%');
Could anyone provide me some insights as to what are cases whereby the results would differ?
Number 1 selects first, last & email from customers who are from Japan and whose last names do not start with D.
Number 2 selects first, last & email, where no two records have all 3 fields the same, where the customers are from Singapore and their last names do not begin with D.
I suppose I can imagine a table where these would yield the same results, but I don't think it would ever appear except in very contrived circumstances.
Joe Smith jsmith#abc.com Japan
Joe Smith jsmith#abc.com Singapore
Would be one of them. Both queries would yield Joe Smith jsmith#abc.com. Another case would be if no-one was from either country or everyone's last name started with D, then they would both yield nothing.
None of this is tested, and the EXCEPT statement is something I've read about but never had occasion to use.
The first is looking at Japan, the second at Singapore, so I don't see why these would generally -- or specifically -- return the same data.
Even if the countries were the same you have another issue with NULL values. So, if your data looks like this:
first_name last_name email country
xxx NULL a Japan
Your first query would return the row. The second would not.
I want to find how many modules a lecturer taught in a specific year and want to select name of the lecturer and the number of modules for that lecturer.
Problem is that because I am selecting Name, and I have to group it by name to make it work. But what if there are two lecturers with same name? Then sql will make them one and that would be wrong output.
So what I really want to do is select name but group by id, which sql is not allowing me to do. Is there a way around it?
Below are the tables:
Lecturer(lecturerID, lecturerName)
Teaches(lecturerID, moduleID, year)
This is my query so far:
SELECT l.lecturerName, COUNT(moduleID) AS NumOfModules
FROM Lecturer l , Teaches t
WHERE l.lecturerID = t.lecturerID
AND year = 2011
GROUP BY l.lecturerName --I want lectureID here, but it doesn't run if I do that
SELECT a.lecturerName, b.NumOfModules
FROM Lecturer a,(
SELECT l.lecturerID, COUNT(moduleID) AS NumOfModules
FROM Lecturer l , Teaches t
WHERE l.lecturerID = t.lecturerID
AND year = 2011
GROUP BY l.lecturerID) b
WHERE a.lecturerID = b.lecturerID
You should probably just group by lecturerID and include it in the select column list. Otherwise, you're going to end up with two rows containing the same name with no way to distinguish between them.
You raise the problem of "wrong output" when grouping just by name but "undecipherable output" is just as big a problem. In other words, your desired output (grouping by ID but giving name):
lecturerName Module
------------ ------
Bob Smith 1
Bob Smith 2
is no better than your erroneous output (grouping by, and giving, name):
lecturerName Module
------------ ------
Bob Smith 3
since, while you now know that one of the lecturers taught two modules and the other taught one, you have no idea which is which.
The better output (grouping by ID and displaying both ID and name) would be:
lecturerId lecturerName Module
---------- ------------ ------
314159 Bob Smith 1
271828 Bob Smith 2
And, yes, I'm aware this doesn't answer your specific request but sometimes the right answer to "How do I do XYZZY?" is "Don't do XYZZY, it's a bad idea for these reasons ...".
Things like writing operating systems in COBOL, accounting packages in assembler, or anything in Pascal come to mind instantly :-)
You could subquery your count statement.
SELECT lecturername,
(SELECT Count(*)
FROM teaches t
WHERE t.lecturerid = l.lecturerid
AND t.year = 2011) AS NumOfModules
FROM lecturer l
Note there are other ways of doing this. If you also wanted to elimiate the rows with no modules you can then try.
SELECT *
FROM (SELECT lecturername,
(SELECT Count(*)
FROM teaches t
WHERE t.lecturerid = l.lecturerid
AND t.year = 2011) AS NumOfModules
FROM lecturer l) AS temp
WHERE temp.numofmodules > 0
I have a table of listings that has NAP fields and I wanted to find duplication within it - specifically where everything is the same except the house number (within 2 or 3 digits).
My table looks something like this:
Name Housenumber Streetname Streettype City State Zip
1 36 Smith St Norwalk CT 6851
2 38 Smith St Norwalk CT 6851
3 1 Kennedy Ave Campbell CA 95008
4 4 Kennedy Ave Campbell CA 95008
I was wondering how to set up a qry to find records like these.
I've tried a few things but can't figure out how to do it - any help would be appreciated.
Thanks
Are you looking to find something that shows the amount of these rows you have like this?
SELECT
StreenName,
City,
State,
Zip,
COUNT(*)
FROM YourTable
group by StreenName, City, State, Zip
HAVING COUNT(*) >1
Or maybe trying to find all of the rows that have the same street, city, state, and zip?
SELECT
A.HouseNumber,
A.StreetName,
A.City,
A.State,
A.Zip
FROM YourTable as A
INNER JOIN YourTable as B
ON A.StreetName = B.StreetName
AND A.City = B.City
AND A.State = B.State
AND A.Zip = B.Zip
AND A.HouseNumber <> B.HouseNumber
Here is one way to do it. You'll need a unique ID for the table to run this, as you wouldn't want to select the exact same person if theyre the only one there. This'll just spit out all the results where there is at least one duplicate.
Edit: Woops, just realized in comments it says varchar for the street number...hmm. So you could just run a cast on it. The OP never said anything about house numbers in varchar or being letters and numbers in the original post. As for letters in the street number field, I've been a third party shipping provider for 2 yrs in the past and I have never seen one; with the exception of an apt., which would be a diff field. Its just as likely that someone put varchar there for some other reason(leading 0's), or for no reason. Of oourse there could be, but no way of knowing whats in the field without response from OP. To run cast to int its the same except this for each instance: Cast(mt.HouseNumber as int)
select *
from MyTable mt
where exists (select 1
from MyTable mt2
where mt.name = mt2.name
and mt.street = mt2.street
and mt.state = mt2.state
and mt.city = mt2.city
and mt2.HouseNumber between (mt.HouseNumber -3) and (mt.HouseNumber +3)
and mt.UID != mt2.UID
)
order by mt.state, mt.city, mt.street
;
Not sure how to run the -3 +3 if there are letters involed...unless you know excatly where they are and you can just simply cut them out then cast.
I have a relation Presidents(firstName,lastName,beginTerm,endTerm)
that gives information about US Presidents. Attribute firstName is a string
with the first name, and in some cases, one or more
middle initials.
Attribute lastName is a string with the last name of the president. For example,
the previous president has firstName = 'George W.' and his father has firstName = 'George H.W.'; both have lastName = 'Bush'. The last 2 attributes, beginTerm and endTerm,
are the years the president entered and left office, respectively.
One subtlety is that Grover Cleveland served 2 noncontiguous
terms. He appears in 2 tuples, one with the beginning and ending years of his first term and the other for the second term.
The question I have is below:
There are 2 pairs of presidents that were father and son. But there are
a number of other pairs of presidents that shared a last name. Find all the last names belonging to 2 or more Presidents. Do not repeat a last name, and remember that the same person serving 2 different terms (e.g., Grover Cleveland) does not constitute a case of 2 presidents with the same last name.
I first thought the answer might be:
SELECT lastName
FROM Presidents
WHERE COUNT(lastName) > 2
EXCEPT lastName = 'Cleveland';
I'm not too sure if the COUNT() function can be used in the WHERE clause though.
Is this possible?
Thanks!
Use HAVING instead of WHERE when checking against Group functions.
SELECT lastName
FROM Presidents
WHERE lastName != 'Cleveland'
GROUP BY lastName
HAVING COUNT(lastName) > 2;
However, when solving SQL-puzzles likes this, you should never take into account the actual data. It should work for all consistent data-sets! I believe this is an actual solution to your problem:
SELECT DISTINCT p1.lastName
FROM Presidents p1, Presidents p2
WHERE p1.lastName == p2.LastName
AND p1.firstName != p2.firstName;
You constrain on aggregates using HAVING, and you are also missing a group by.
SELECT lastName
FROM Presidents
where lastName <> 'Cleveland';
group by lastname
having COUNT(lastName) > 2
Assuming there is an id field as well,
select id, lastname, count(*) differentguycount
from presidents left join
(select id, firstname, lastname count(*) sameguycount
from presidents
group by id, firstname, lastname
having sameguycount > 1 ) temp on temp.id = presidents.id
where temp.firstname is null
group by id, lastname
having differentguycount > 1
As noted, the OP did not specify his database engine which could cause syntax errors. For example some databases might not allow you to use aliases in the having clause.