Eliminate rows with names that are slightly different - sql

I have in POSTGRESQL a database with a UUID, firstname (fname) and phone
uuid fname phone
1 JOHN 111
2 john 111
3 John 111
4 JOHN JAMES 111
5 Charles 222
6 Peter 222
7 James 222
8 Jimmy 222
9 Fred 333
10 Fred 333
11 Greg 333
I would like to keep only the group phone + firstname that have a similarity between at least two names. So, for example, in this case I would like to keep the phone 111 and one of the names and the phone 333 keeping the name that repeats (Fred). The phone 222 woud be eliminate as all names are not similar.
The result data would be
fname phone
John 111
Fred 333
The problem I am having is when the name is similar but it has more names (as in John and John James or when the name was mistyped, as in John and Jonh). I have tried to do the following
SELECT
m1.phone,
m1.fname,
m1.uuid
FROM
master as m1
JOIN master as m2 on m1.uuid = m2.uuid
WHERE
m1.phone = m2.phone
and m1.fname ILIKE m2.fname
ORDER BY 1

The definition of similarity is a bit vague, but this works for the data you have in the question:
select m.*
from master m
where exists (select 1
from master m2
where m2.phone = m.phone and m2.uuid <> m.uuid and
(m.fname ilike '%' || m2.fname || '%' or
m2.fname ilike '%' || m.fname || '%'
)
);
Here is a rextester.
Name matching is a complicated task and not well suited to SQL. However, you might want to look into Levenshtein distance and other string similarity metrics if this is a problem that you are facing.
Note: This keeps all names that match. If you want only one row per phone, you can use distinct on.

Related

Self JOIN to find the parent detail which matches with the row data -

I am trying to query in MS SQL and I can not resolve it. I have a table employees:
Id Name Surname FatherName MotherName WifeName Pincode isChild
-- ------- ------- ---------- ---------- -------- ------- -------
1 John Green James Sue null 101011 1
2 Michael Sloan Barry Lilly null 101011 1
3 Sally Green Andrew Molly Jemi 101011 1
4 Barry Sloan Soul Paul Lilly 101011 0
5 James Green Ned White Sue 101011 0
I want a query that selects rows where the father name and mother name of child matches with name and wife name. For the example table, where I want to return the result of rows where father and mother name matches the name and wife name column. For eg. id=1, where John's father name James and mother name Sue matches with id 5 which returns James as first name and Sue as wife name. So my query should return (this is my expected result)
Id Name Surname FatherName MotherName WifeName Pincode isChild
-- ------- ------- ---------- ---------- -------- ------- -------
5 James Green Ned White Sue 101011 0
4 Barry Sloan Soul Paul Lilly 101011 0
I tried with the below query but it checks for James only. How to change my query so that it checks all the names and returns the expected result.
select * FROM employees
where first_name like '%James%'
and wife_name like '%Sue%'
and pincode=101011;
Any tips on this will be really helpful. I am new to joins, need help on writing self join to get the result.
…
select *
from thetable as p -- the parent/father
where exists -- with one child at least
(
select *
from thetable as c
where c.fathername = p.name
and c.mothername = p.wifename
-- lastname?
)
Too long for a comment, but also not intended as a slam against what you are working with. Please take as constructive criticism.
Aside from VERY POOR DESIGN on the table content, getting that corrected before you get too deep into whatever you are working should be done first. A more typical design might be having a table of people. Now, to get the relationships you could do a couple ways. One is that on each individual person's record, you add 2 additional IDs. FatherID, MotherID. These IDs would join directly back to the child vs hard strings to match against. Take a surname like Smith or Jones. Then, look at the many instances of a "John Smith" may exist, yes a lot, and lower probability of finding a matching wife's name of Sue, Mary or whatever else name. But even that could lead to multiple possibilities. Yes, you are adding a PIN, but even a computer can generate a random pin of 1234.
By having the IDs, there is NO ambiguity of who the relationship is with.
If the data were slightly altered to something like
Id Name Surname FatherID MotherID SpouseID
-- ------- ------- ---------- ---------- --------
1 John Green 5 6 null
2 Michael Sloan 4 3 null
3 Lilly Sloan null null 4
4 Barry Sloan null null 3
5 James Green 9 10 6
6 Sue Green 7 8 5
7 Bill Jones null null 8
8 Martha Jones null null 7
9 Brian Green null null 10
10 Beth Smith-Green null null 9
So, in this modified example, you can see right away that ID#1 John Green has parents of Father (ID#5) is James and Mother (ID#6) is Sue. But even from this, James is a child to Father (ID#9) Brian and Mother (ID#10) Beth. This scenario is showing to a grand-parent level capacity and that each of James and Sue are also children but to their respective parents. Sue's parents of the Jones surname.
For Michael Sloan, parents of #4 Barry, and #3 Lilly.
And I additionally added a spouse ID. This prevents redundancy of people's names copied all over. Then you can query based on the child's parent's respective IDs to find out vs a hopeful name LIKE guess.
So, even though not solving a relatively simple query, fixing the underlying foundation of your database and is relations will, long-term, help ease your querying in the future.
Try this:
SELECT
T2.*
FROM Employee T1
JOIN Employee T2 ON T2.Name = T1.FatherName
AND T2.WifeName = T1.MotherName

How can i select rows for specific cells from same table?

I have a table with tenants and their addresses.
A tenant can have several addresses and at each address can appear several times (closed, open, modified).
The tenant appears first with an address (first) after which he can have several changes on the first address (closed, open, modified) or he can have other addresses (closed, open, modified).
How can I extract the date of closing the first address.
The problem come with a twist. The name of streets are not exactly like first addres. It can contain St., Ave. in their names.
The table look like this:
id
Tenant
code
Street
Number
Date
1
Alice
First
Abbey
5
01.01.2021
2
Alice
Modify
Abbey Ro.
5
02.01.2021
3
Alice
Open
Elm St
3
02.01.2021
4
Alice
Close
St. Abbey
5
05.01.2021
5
Bob
First
Fifth
10
01.02.2021
6
Bob
Open
Fifth Ave
222
01.02.2021
7
Bob
Close
Fifth Ave
222
05.02.2021
8
Bob
Close
Ave Fifth
10
06.02.2021
The expected result should by like this:
id
Tenant
CloseID
Street
Number
Date
1
Alice
4
Abbey
5
05.01.2021
2
Bob
8
Fifth
10
06.02.2021
I have a feel the answer is there but i can't grab it :)
Thanks in advance for your support.
you need sometihng equal in both records (first and close)
for exemple, if you are sure that 'Number' it's every time the same, and the id user it's unique you can use it
SELECT * FROM tenant WHERE Number = 10 AND Id = 5 AND code = "close"
but this it's only an example, you can use it with your tenant, beacuse you can easily run into bug
for example if the tenant have more than 1 address with the same Number
if you know where the other word (st. , ave etc ) are in the Street, you can use %
example
7 | Bob | Close | Fifth Ave | 222 | 05.02.2021
sql:
SELECT * FROM tenant WHERE Number = 10 AND Street LIKE 'Fifth%' AND code = "close"
8 | Bob | Close | Ave Fifth | 10 | 06.02.2021
sql:
SELECT * FROM tenant WHERE Number = 10 AND Street LIKE '%Fifth' AND code = "close"
otherwise you have to change the table
for example by adding the Address No. field
indicating if the address is the first, second, third etc .. for that person
I would do it like this:
with cte as(
select tenant, code, street, number, date, row_number() over (partition by tenant order by date desc) as rank
from tenants
order by date desc
)
select * from cte where rank = 1;

SQL Server group by? [duplicate]

This question already has answers here:
Retrieving last record in each group from database - SQL Server 2005/2008
(2 answers)
Closed 4 years ago.
I'm not sure how to word my question so perhaps an example would be best. I'm looking for a function or statement that would produce the following result from a single table. For each name, return the row with largest id.
ID NAME ADDRESS
1 JOHN DOE 123 FAKE ST.
2 JOHN DOE 321 MAIN ST.
3 JOHN DOE 333 2ND AVE.
4 MARY JANE 222 1ST. AVE
5 MARY JANE 444 POPLAR ST.
6 SUZY JO 999 8TH AVE.
DESIRED RESULT
3 JOHN DOE 333 2ND AVE.
5 MARY JANE 444 POPLAR ST.
6 SUZY JO 999 8TH AVE.
One option is to use the row_number window function. This allows you to establish a row number to the result set. Then you can define the grouping and ordering within the over clause, in this case you want to partition by (group) the name field and order by the id field descending. Finally you filter those results where rn = 1 which returns the max result for each grouping.
select *
from (
select *, row_number() over (partition by name order by id desc) rn
from yourtable
) t
where rn = 1

Coalesce records in sql server when repeated two 2 times

I have records as follow in a tables on sql server 2005
fname lname address zip
xxx yyy UK 001
zzz yyy UK 001
aaa yyy UK 002
ddd jjj US 003
eee jjj US 003
I need to get the result in the following format
fname lname address zip
xxx,zzz yyy UK 001
ddd,eee jjj US 003
Basically every records which have a count address and zip 2 times will have their first name grouped and separated by comma.
Ok Here is my approach: but not working and stuck right now
select fname, lname, address, zip from table people
where address is not null
and zip is not null
group by address,zip
having count(address)=2 and count (zip)=2
order by address
-- Now to coalesce the records I am using
SELECT fname = COALESCE(fname + ', ', '') + ISNULL(fname, 'N/A'), fname, lname,streetname, housenumber
FROM people
WHERE address is not null and zip is not null
group by address,zip
having count(address)=2 and count (zip)=2
order by address
I don't think this is a duplicate because it doesn't require anything like group_concat(). The OP is specifically asking for two times, and you can get that like this:
select min(fname) + ',' + max(fname), lname, address, zip
from table t
group by lname, address, zip
having count(*) = 2;
Of course, a general answer with more matching rows can't be solved this way, but the question specifically says "zip 2 times".

SQL statements without group by

If I wanted to find all values in a table that occur more than twice without using group by, how would I do that? I understand how to do this with group by and was curious how to do it without group by (EDIT: could you do this with join?).
For example, if I had last names in a certain zip code, and I wanted to find entries with this last name more than twice, how would I do this without group by in SQL statements?
I tried
select name, count() from population order by name asc having count() > 2;
but that doesn't do what I want it to. Any suggestions?
Being this tagged only as sql it seems a general solution is being looked for. Since the SQL:2003 revision it should be fair to say that this can be solved with window functions:
SELECT name FROM (
SELECT
ROW_NUMBER() OVER (PARTITION BY name ORDER BY name) rank,
name
FROM population
) s
WHERE rank = 3
See a sample fiddle here.
Anyway, the fact that it is possible to solve this without a GROUP BY doesn't mean that it should :)
It seems correlated query can work for you. Please check.
Assuming the data set given below
Id Zip Lastname
--- ----- --------
101 12345 John
102 12345 John
103 12345 John
104 12345 Ram
105 12345 Kelly
106 12345 Kelly
107 45678 Krishna
108 45678 Krishna
109 45678 Krishna
110 45678 David
111 45678 David
Query
select * from test.population pop1
where 2 < (select count(*) from test.population pop2
where pop1.Lastname=pop2.Lastname and pop1.Zipcode = pop2.Zipcode)
The output of above query is
Id Zip Lastname
--- ------ --------
101 12345 John
102 12345 John
103 12345 John
107 45678 Krishna
108 45678 Krishna
109 45678 Krishna