SQL join table on partial string - sql

I have two tables in a Postgres database:
Table A:
**Middle_name**
John
Joe
Fred
Jim Bob
Paul-John
Table B:
**Full_name**
Fred, Joe, Bobda
Jason, Fred, Anderson
Tom, John, Jefferson
Jackson, Jim Bob, Sager
Michael, Paul-John, Jensen
Sometimes the middle name is hyphenated or has a space between it. But there is never a comma in the middle name. If it is hyphenated or two middle names, the entries will still be the same in both Table A and Table B.
I want to join the tables on Middle_name and Full_name. The difficult part is that the join has to check only the values between the commas in Full_name. Otherwise it might match the first name accidentally.
I've been using the query below but I just realized that there is nothing stopping it from matching the middle name to a first name accidentally.
SELECT Full_name, Middle_name
FROM B
JOIN A
ON POSITION(Middle_name IN Full_name)>0
I'm wondering how I can refactor this query to match only the middle name (assuming they all appear in the same format).

use split_part('Fred, Joe, Bobda', ',', 2) which returns the middle name joe
SELECT Full_name, Middle_name
FROM B
JOIN A
ON split_part(B.Full_name, ',', 2)=A.Middle_name
demo for returning middle name

If there is always exactly one space after the comma, and everybody has a middle name like your sample data suggests, the space can just be part of the delimiter in split_part():
SELECT full_name, middle_name
FROM A
JOIN B ON split_part(B.full_name, ', ', 2) = A.middle_name;
Related:
Split comma separated column data into additional columns

Related

How to compare 2 columns with similar data (eg. not same order, with comma)

I'm dealing with a dirty database, that has not normalized data, and has a column with a name like
"john kevin smith" that I need to compare with another column (from other table) that could have "Kevin john smith" or "smith, kevin john" or the same as the original, and based on that I need to figure out if they point to the same record.
I'm trying to figure out how to do this with SQL Server 2012.
I have been testing with the JaroWinkler function without luck, I also tried with the fnSplit function, but that doesn't seem to do the trick, I believe I might have to first somehow normalize them, then separate them all and then compare between each other, but I'm drawing a blank on the process.
Any suggestions?
UPDATE:
with a split function and some replaces I'm able to "normalize" the columns, and receive the results in a table function like this:
now I just need to figure out how to compare with another set of values, as I'll get something similar when executing to the column I want to compare with.
For comparison you can just use the EXISTS clause.
Sample data
CREATE TABLE TEST(
VAL_A varchar(200),
VAL_B varchar(200)
);
INSERT INTO TEST (VAL_A, VAL_B) VALUES
('john kevin smith', 'Kevin john smith'),
('john kevin smith', 'Kevin, john smith'),
('Alpha beta gamma', 'beta delta alpha');
Query:
SELECT VAL_A, VAL_B
FROM TEST
WHERE NOT EXISTS (
SELECT value FROM [dbo].[fn_Split](VAL_A, ' ')
EXCEPT
SELECT value FROM [dbo].[fn_Split]((REPLACE(VAL_B,',','')), ' ')
)
OR
NOT EXISTS (
SELECT value FROM [dbo].[fn_Split]((REPLACE(VAL_B,',','')), ' ')
EXCEPT
SELECT value FROM [dbo].[fn_Split](VAL_A, ' ')
This will return the matching rows.
+-------------------+-------------------+
| VAL_A | VAL_B |
+-------------------+-------------------+
| john kevin smith | Kevin john smith |
| john kevin smith | Kevin, john smith |
+-------------------+-------------------+
in the output, you can notice that alpha,beta,gamma,delta combination is not appearing as those are not matching. You can change NOT EXISTS to EXISTS if you want non matching rows.
CHECK DEMO HERE
You can use the joins if the other column is coming from another table. Also use the REPLACE statements accordingly.
You can use STRING_SPLIT()
A table-valued function that splits a string into rows of substrings,
based on a specified separator character
SELECT a.Value
FROM STRING_SPLIT('john kevin smith', ' ') a
INNER JOIN STRING_SPLIT('Kevin john smith', ' ') b on a.Value = b.Value
Demo here
Just an idea.. what you could do is to do a find and replace of all spaces with a comma for that field in each of the table. Place the results of each table into a separate #temp table. Once you have done that.. do a string_split (of fnSplit) depending on SQL version and order then alphabetically.
Now join these tables based on the column value and see where you get at

Matching pattern in SQL

I have two tables and need to match all records by name. The problem is that in one table name format is FirstName LastName, in another table - LastName FirstName, and I cannot split into separate columns because some records might have few first names or last names, so I don't know where first or last name ends or starts.
Eg. in first table I have John Erick Smith and need to join all records from another table where the name is Smith John Erick.
Any solution in SQL?
I think you can use string functions to get the piece of string (in 'John Erick Smith' type column) after the last space as a surname and stick it to front. Then you could compare the strings. That is assuming you don't have spaces in surnames.
Here is MSDN article on how to do it.
DECLARE #string nvarchar(max) = 'FirstName SecondName Surname'
SELECT RIGHT(#string, NULLIF(charindex(' ', REVERSE(#string)),0)) + ' ' +
REVERSE(RIGHT(REVERSE (#string), len(#string) - NULLIF(charindex(' ', REVERSE(#string)),0)))
Returns:
Surname FirstName SecondName
Verify first if you still have other tables with "FirstName" and "LastName" that you can use instead of using the field with "FirstName LastName". Normally Oracle has this kind of tables use for persons/employees. You may have something like this.
But if the "LastName FirstName" uses "," (comma) in its data then you can do a substring to get the LastName from the FirstName.
Or another alternative is by using their IDs (eg. employee IDs) [if only available].

Need query to list rows where two columns are relatively the same with caveat in Postgresql

Most compare two or more columns for the same value between two tables. I need to compare two columns in a table where relative to another column with 3 owner values used. This will be 3 different first names.
I have seen table compares, but this is in one table and data is separated already by a column value. The table is thousands of rows. Owner column is only three different names.
So Joe, Sam and Jim have rows of names. When any of these two people have a common first and last name listed in the row, they will be listed as output with other column data like location and zip. So if Joe and Sam have two common names, the output will list:
Owner,firstname,lastname,location,zip
Joe,Bob,Smith,Dallas,37377
Sam,Bob,Smith,Dallas,37377
Jim had no Bob,Smith,Dallas,37377 so it is not listed in the group with Joe and Sam. In summation, the results will be a few hundred lines of these 3 owners grouped where a common name is found. I will need to use percentages in the query in case some one uses Bobby or Smiths. Therefore a name like Smithson will show up but I will adjust.
I had this all typed out and the page sees it as code so I apologize I had to abbreviate.
The most straightforward thing to use is a self-join, though an EXISTS may be quicker.
First, the self-join:
--note DISTINCT here because if Joe, Bob, and Sam all have records with the name we don't want duplicates
select DISTINCT f.owner, f.firstname, f.lastname, f.location, f.zip
from tablename f
join tablename s on f.firstname = s.firstname and f.lastname = s.lastname and f.owner <> s.owner
Then, with an EXISTS:
select f.owner, f.firstname, f.lastname, f.location, f.zip
from tablename f
WHERE EXISTS(select 1 from tablename s WHERE s.firstname = f.firstname and s.lastname = f.lastname and s.owner <> f.owner)
Of course, if instead of equality you want 'Smith' and 'smithson' to match, you can replace the equality comparison with something like: (f.firstname ilike (s.firstname||'%') OR s.firstname ilike (f.firstname||'%')) you can use that (or any other comparison)

How can I compare two name strings that are formatted differently in SQL Server?

What would be the best approach for comparing the following set of strings in SQL Server?
Purpose: The main purpose of this Stored Procedure is to compare a set of input Names of Customers to names that Exist in the Customer database for a specific accounts. If there is a difference between the input name and the name in the Database this should trigger the updating of the Customer Database with the new name information.
Conditions:
Format of Input: FirstName [MiddleName] LastName
Format of Value in Database: LastName, FirstName MiddleName
The complication arises when names like this are presented,
Example:
Input: Dr. John A. Mc Donald
Database: Mc Donald, Dr. John A.
For last names that consist of 2 or more parts what logic would have to be put into place
to ensure that the lastname in the input is being compared to the lastname in the database and likewise for the first name and middle name.
I've thought about breaking the database values up into a temp HASH table since I know that everything before the ',' in the database is the last name. I could then check to see if the input contains the lastname and split out the FirstName [MiddleName] from it to perform another comparison to the database for the values that come after the ','.
There is a second part to this however. In the event that the input name has a completely New last name (i.e. if the name in the database is Mary Smith but the updated input name is now Mary Mc Donald). In this case comparing the database value of the last name before the ',' to the input name will result in no match which is correct, but at this point how does the code know where the last name even begins in the input value? How does it know that her Middle name isn't MC and her last name Donald?
Has anyone had to deal with a similar problem like this before? What solutions did you end up going with?
I greatly appreciate your input and ideas.
Thank you.
Realistically, it's extremely computationally difficult (if not impossible) to know if a name like "Mary Jane Evelyn Scott" is first-middle-last1-last2, first1-first2-middle-last, first1-first2-last1-last2, or some other combination... and that's not even getting into cultural considerations...
So personally, I would suggest a change in the data structure (and, correspondingly, the application's input fields). Instead of a single string for name, break it into several fields, e.g.:
FullName{
title, //i.e. Dr., Professor, etc.
firstName, //or given name
middleName, //doesn't exist in all countries!
lastName, //or surname
qualifiers //i.e. Sr., Jr., fils, D.D.S., PE, Ph.D., etc.
}
Then the user could choose that their first name is "Mary", their middle name is "Jane Evelyn", and their last name is "Scott".
UPDATE
Based on your comments, if you must do this entirely in SQL, I'd do something like the following:
Build a table for all possible combinations of "lastname, firstname [middlename]" given an input string "firstname [middlename] lastname"
Run a query based on the join of your original data and all possible orderings.
So, step 1. would take the string "Dr. John A. Mc Donald" and create the table of values:
'Donald, Dr. John A. Mc'
'Mc Donald, Dr. John A.'
'A. Mc Donald, Dr. John'
'John A. Mc Donald, Dr.'
Then step 2. would search for all occurrences of any of those strings in the database.
Assuming MSSQL 2005 or later, step 1. can be achieved using some recursive CTE, and a modification of a method I've used to split CSV strings (found here) (SQL isn't the ideal language for this form of string manipulation...):
declare #str varchar(200)
set #str = 'Dr. John A. Mc Donald'
--Create a numbers table
select [Number] = identity(int)
into #Numbers
from sysobjects s1
cross join sysobjects s2
create unique clustered index Number_ind on #Numbers(Number) with IGNORE_DUP_KEY
;with nameParts as (
--Split the name string at the spaces.
select [ord] = row_number() over(order by Number),
[part] = substring(fn1, Number, charindex(' ', fn1+' ', Number) - Number)
from (select #str fn1) s
join #Numbers n on substring(' '+fn1, Number, 1) = ' '
where Number<=Len(fn1)+1
),
lastNames as (
--Build all possible lastName strings.
select [firstOrd]=ord, [lastOrd]=ord, [lastName]=cast(part as varchar(max))
from nameParts
where ord!=1 --remove the case where the whole string is the last name
UNION ALL
select firstOrd, p.ord, l.lastName+' '+p.part
from lastNames l
join nameParts p on l.lastOrd+1=p.ord
),
firstNames as (
--Build all possible firstName strings.
select [firstOrd]=ord, [lastOrd]=ord, [firstName]=cast(part as varchar(max))
from nameParts
where ord!=(select max(ord) from nameParts) --remove the case where the whole string is the first name
UNION ALL
select p.ord, f.lastOrd, p.part+' '+f.firstName
from firstNames f
join nameParts p on f.firstOrd-1 = p.ord
)
--Combine for all possible name strings.
select ln.lastName+', '+fn.firstName
from firstNames fn
join lastNames ln on fn.lastOrd+1=ln.firstOrd
where fn.firstOrd=1
and ln.lastOrd = (select max(ord) from nameParts)
drop table #Numbers
Since I had my share of terrible experience with data from third parties, it is almost guaranteed that the input data will contain lots of garbage not following the specified format.
When trying to match data multipart string data like in your case, I preprocessed both input and our data into something I called "normalized string" using the following method.
strip all non-ascii chars (leaving language-specific chars like "č" intact)
compact spaces (replace multiple spaces with single one)
lower case
split into words
remove duplicates
sort alphabetically
join back to string separated by dashes
Using you sample data, this function would produce:
Dr. John A. Mc Donald ->
a-donald-dr-john-mc Mc Donald, Dr.
John A.-> a-donald-dr-john-mc
Unfortunaly it's not 100% bulletproof, there are cases where degenerated inputs produce invalid matches.
Your name field is bad in the database. Redesign and get rid of it. If you havea a first name, middlename, lastname, prefix and suffix sttructure, you can hava computed filed that has the structure you are using. But it is a very poor way to store data and your first priority should be to stop using it.
Since you have a common customer Id why aren't you matching on that instead of name?

Get words from sentence - SQL

Suppose I have a description column that contains
Column Description
------------------
I live in USA
I work as engineer
I have an other table containing the list of countries, since USA (country name) is mentioned in first row, I need that row.
In second case there is no country name so I don't need that column.
Can you please clarify
You may want to try something like the following:
SELECT cd.*
FROM column_description cd
JOIN countries c ON (INSTR(cd.description, c.country_name) > 1);
If you are using SQL Server, you should be able to use the CHARINDEX() function instead of INSTR(), which is available for MySQL and Oracle. You can also use LIKE as other answers have suggested.
Test case:
CREATE TABLE column_description (description varchar(100));
CREATE TABLE countries (country_name varchar(100));
INSERT INTO column_description VALUES ('I live in USA');
INSERT INTO column_description VALUES ('I work as engineer');
INSERT INTO countries VALUES ('USA');
Result:
+---------------+
| description |
+---------------+
| I live in USA |
+---------------+
1 row in set (0.01 sec)
This is a really bad idea, to join on arbitrary text like this. It will be very slow and may not even work.. give it a shot:
select t1.description, c.*
from myTable t1
left join countries c on t1.description like CONCAT('%',c.countryCode,'%')
Its not entierly clear from your post but I think you are asking to return all the rows in the table that contain the descriptions which contain a certain country name? If thats the case you can just use the sql LIKE operator like the following.
select
column_description
from
description_table
where
column_description like %(select distinct country_name from country)%
If not I think your only other choice is Dans post.
Enjoy !
All the suggestions so far seem to match partial words e.g. 'I AM USAIN BOLT' would match the country 'USA'. The question implies that matching should be done on whole words.
If the text was consisted entirely of alphanumeric characters and each word was separated by a space character, you could use something like this
Descriptions AS D1
LEFT OUTER JOIN Countries AS C1
ON ' ' + D1.description + ' '
LIKE '%' + ' ' + country_name + ' ' + '%'
However, 'sentence' implies punctuation e.g. the above would fail to match 'I work in USA, Goa and Iran.' You need to delimit words before you can start matching them. Happily, there are already solutions to this problem e.g. full text search in SQL Server and the like. Why reinvent the wheel?
Another problem is that a single country can go by many names e.g. my country can legitimately be referred to as 'Britain', 'UK', 'GB' (according to my stackoverflow profile), 'England' (if you ask my kids) and 'The United Kingdom of Great Britain and Northern Ireland' (the latter is what is says on my passport and no it won't fit in your NVARCHAR(50) column ;) to name but a few.