Fuzzy lookup in SQL to match names - sql

I am stuck at a problem where I need to populate historical data using Fuzzy match. I'm using SQL Server 2014 Developer Edition
MainTbl.UNDERWRITER_CODE is where data needs to be populated in place of NULL. This data needs to be from LKP table. The Matching criteria is MainTbl.UNDERWRITER_NAME with LKP.UNDERWRTIER_NAME
sample:
CREATE TABLE MainTbl(UNDERWRITER_CODE int, UNDERWRITER_NAME varchar(100))
INSERT INTO MainTbl VALUES
(NULL,'dylan.campbell'),
(NULL,'dylanadmin'),
(NULL,'dylanc'),
(002,'Dylan Campbell'),
(002,'dylan.campbell'),
(002,'dylanadmin'),
(NULL,'scott.noffsinger'),
(001,'Scott Noffsinger')
CREATE TABLE LKP(UNDERWRITER_CODE int, UNDERWRITER_NAME varchar(100))
INSERT INTO LKP VALUES
(002,'Dylan Campbell'),
(001,'Scott Noffsinger')
expected output:
2 dylan.campbell
2 dylanadmin
2 dylanc
2 Dylan Campbell
2 dylan.campbell
2 dylanadmin
1 scott.noffsinger
1 Scott Noffsinger

SQL is not really designed for such fuzzy string comparisons. However, SQL Server has a function called difference(), which works for your data:
select mt.*, l.*
from maintbl mt outer apply
(select top (1) lkp.*
from lkp
order by difference(mt.underwriter_name, lkp.underwriter_name) desc
) l;
Here is a db<>fiddle.

UPDATE T1 SET T1.UNDERWRITER_CODE = T2.UNDERWRITER_CODE
FROM MainTbl T1
INNER JOIN LKP T2
ON T1.UNDERWRITER_NAME LIKE CONCAT('%', LEFT( LOWER(T2.UNDERWRITER_NAME)
,CHARINDEX(' '
,LOWER(T2.UNDERWRITER_NAME)
) - 1
)
, '%'
)
Output
https://dbfiddle.uk/?rdbms=sqlserver_2019&fiddle=23a3a55cc1ab1741f6e70dd210db0471
Explanation
Step 1:
SELECT *
,CONCAT('%', LEFT( LOWER(T2.UNDERWRITER_NAME)
,CHARINDEX(' '
,LOWER(T2.UNDERWRITER_NAME)
) - 1
)
, '%'
) AS JOIN_COL
FROM LKP T2
Output of above Query
UNDERWRITER_CODE UNDERWRITER_NAME JOIN_COL
2 Dylan Campbell %dylan%
1 Scott Noffsinger %scott%
Used the above JOIN_COL data format in join condion with like operator
Step 2:
SELECT T2.UNDERWRITER_CODE,T1.UNDERWRITER_NAME
FROM MainTbl T1
INNER JOIN LKP T2
ON T1.UNDERWRITER_NAME LIKE CONCAT('%', LEFT( LOWER(T2.UNDERWRITER_NAME)
,CHARINDEX(' '
,LOWER(T2.UNDERWRITER_NAME)
) - 1
)
, '%'
)
Output of above query:
UNDERWRITER_CODE UNDERWRITER_NAME
2 dylan.campbell
2 dylanadmin
2 dylanc
2 Dylan Campbell
2 dylan.campbell
2 dylanadmin
1 scott.noffsinger
1 Scott Noffsinger

First, fuzzy lookup is a little vague. There are a number of algorithms that are used for fuzzy matching including the Levenshtein Distance, Longest Common Subsequence, and some others referenced in the "See Also" section of this Wikipedia page about Approximate String Matching.
To rephrase what you are attempting to do. You are updating the UNDERWRITER_CODE column in MainTbl with the UNDERWRITER_CODE that matches the most similar UNDERWRITER_NAME in LKP. Fuzzy algorithms can be used for measuring similarity. Note my post here. For the sample data you provided we can use Phil Factor's T-SQL Levenshtein functions and match based on the lowest Levenshtein value like so:
SELECT TOP (1) WITH TIES
UNDERWRITER_CODE_NULL = m.UNDERWRITER_CODE,
LKP_UN = m.UNDERWRITER_NAME, l.UNDERWRITER_NAME, l.UNDERWRITER_CODE,
MinLev = dbo.LEVENSHTEIN(m.UNDERWRITER_NAME, l.UNDERWRITER_NAME)
FROM dbo.MainTbl AS m
CROSS JOIN dbo.LKP AS l
WHERE m.UNDERWRITER_CODE IS NULL
ORDER BY ROW_NUMBER() OVER (PARTITION BY m.UNDERWRITER_NAME
ORDER BY dbo.LEVENSHTEIN(m.UNDERWRITER_NAME, l.UNDERWRITER_NAME))
Returns:
UNDERWRITER_CODE_NULL LKP_UN UNDERWRITER_NAME UNDERWRITER_CODE MinLev
--------------------- ------------------ ------------------ ---------------- -----------
NULL dylan.campbell Dylan Campbell 2 1
NULL dylanadmin Dylan Campbell 2 8
NULL dylanc Dylan Campbell 2 8
NULL scott.noffsinger Scott Noffsinger 1 1
We can use this logic to update UNDERWRITE_CODE like so:
WITH FuzzyCompare AS
(
SELECT TOP (1) WITH TIES
UNDERWRITER_CODE_NULL = m.UNDERWRITER_CODE,
LKP_UN = m.UNDERWRITER_NAME, l.UNDERWRITER_NAME, l.UNDERWRITER_CODE,
MinLev = dbo.LEVENSHTEIN(m.UNDERWRITER_NAME, l.UNDERWRITER_NAME)
FROM dbo.MainTbl AS m
CROSS JOIN dbo.LKP AS l
WHERE m.UNDERWRITER_CODE IS NULL
ORDER BY ROW_NUMBER() OVER (PARTITION BY m.UNDERWRITER_NAME
ORDER BY dbo.LEVENSHTEIN(m.UNDERWRITER_NAME, l.UNDERWRITER_NAME))
)
UPDATE fc
SET fc.UNDERWRITER_CODE_NULL = fc.UNDERWRITER_CODE
FROM FuzzyCompare AS fc
JOIN dbo.MainTbl AS m ON fc.UNDERWRITER_NAME = m.UNDERWRITER_NAME;
After this update SELECT * FROM dbo.mainTbl Returns:
UNDERWRITER_CODE UNDERWRITER_NAME
---------------- -------------------
2 dylan.campbell
2 dylanadmin
2 dylanc
2 Dylan Campbell
2 dylan.campbell
2 dylanadmin
1 scott.noffsinger
1 Scott Noffsinger
This should get you started; depending on the amount & kind of data you are dealing with, you will need to be very selective about what algorithms you use. Do your homework and test, test ,test!
Let me know if you have questions.

Related

Sql Query: How to Base on the row name to display

I have the table data as listed on below:
name | score
andy | 1
leon | 2
aaron | 3
I want to list out as below, even no jacky's data, but list his name and score set to 0
aaron 3
andy 2
jacky 0
leon 2
You didn't specify your DBMS, but the following is 100% standard ANSI SQL:
select v.name, coalesce(t.score, 0) as score
from (
values ('andy'),('leon'),('aaron'),('jacky')
) as v(name)
left join your_table t on t.name = v.name;
The values clause builds up a "virtual table" that contains the names you are interested in. Then this is used in a left join so that all names from the virtual table are returned plus the existing scores from your (unnamed table). For non-existing scores, NULL is returned which is turned to 0 using coalesce()
If you only want to specify the missing names, you can use a UNION in the virtual table:
select v.name, coalesce(t.score, 0) as score
from (
select t1.name
from your_table t1
union
select *
from ( values ('jacky')) as x
) as v(name)
left join your_table t on t.name = v.name;
fixed the query, could list out the data, but still missing jacky, only could list out as shown on below, the DBMS. In SQL is SQL2008.
data
name score scoredate
andy 1 2021-08-10 01:23:16
leon 2 2021-08-10 03:25:16
aaron 3 2021-08-10 06:25:16
andy 4 2021-08-10 11:25:16
leon 5 2021-08-10 13:25:16
result set
name | score
aaron | 1
andy | 5
leon | 7
select v.name as Name,
coalesce(sum(t.score),0) as Score
from (
values ('aaron'), ('andy'), ('jacky'), ('leon')
) as v(name)
left join Score t on t.name=v.name
where scoredate>='2021-08-10 00:00:00'
and scoredate<='2021-08-10 23:59:59'
group by v.name
order by v.name asc
Your question lacks a bunch of information, such as where "Jacky"s name comes from. If you have a list of names that you know are not in the table, just use union all:
select name, score
from t
union all
select 'Jacky', 0;

SQL Server - Find similarities in column and write them into new column

I have a big table with data like this:
ID Title
-- ------------------------
1 01_SOMESTRING_038
2 01_SOMESTRING K5038
3 01_SOMESTRING-648
4 K-OTHERSTRING_T_73474
5 K-OTHERSTRING_T_ffk
6 ABC
7 DEF
And the task is now to find similarities in that column, and write that found similarity to a new column.
So the desired output would be like this:
ID Title Similarity
-- ------------------------ -----------------
1 01_SOMESTRING_038 01_SOMESTRING
2 01_SOMESTRING K5038 01_SOMESTRING
3 01_SOMESTRING-648 01_SOMESTRING
4 K-OTHERSTRING_T_73474 K-OTHERSTRING_T_
5 K-OTHERSTRING_T_ffk K-OTHERSTRING_T_
6 ABC NULL
7 DEF NULL
How can I achieve that in MS SQL Server 17?
Any help is much appreciated. Thanks!
EDIT: The strings are not only broken by delimiters such as "-", "_".
And for handling competeing similrities I would set a minimum length for the similarity. For instance 10.
Try the following, using a recursive CTE to split out the letters, then we can group them up to find the greatest match:
WITH TITLE_EXPAND AS (
SELECT
1 MatchLen
,CAST(SUBSTRING(Title,1,1) as NVARCHAR(255)) MatchString
,Title
,ID
FROM
[SourceDataTable]
UNION ALL
SELECT
MatchLen + 1
,CAST(SUBSTRING(Title,1,MatchLen+1) AS NVARCHAR(255))
,Title
,ID
FROM
TITLE_EXPAND
WHERE
MatchLen < LEN(Title)
)
SELECT DISTINCT
SDT.ID
,SDT.title
,FIRST_VALUE(MatchString) OVER (PARTITION BY SDT.ID ORDER BY SC.MatchLen DESC, SC.MatchCount DESC) Similarity
FROM
[SourceDataTable] SDT
LEFT JOIN
(SELECT
*
,COUNT(*) OVER (PARTITION BY MatchString, MatchLen) MatchCount
FROM
TITLE_EXPAND) SC
ON
SDT.ID = SC.ID
AND
SC.MatchCount > 1
ORDER BY SDT.ID
Where SourceDataTable is your source table. The Similarity value will be the longest matched similar value.

How to concatenate strings in Teradata

I use Teradata 14 with all strtok and other new functions , but i am not allowed to write my own functions .
In the following table each person has many skills . How can I concatenate those skills ?
team person
1 Mike (swi)
1 Nick (dri)
1 Mike (coo)
2
3 Kate (swi)
3 Kate (coo)
3 Kate (dri)
3 Wend (fly)
4 Pete (jum)
Desired table is
team person
1 Mike (swi coo), Nick (dri),
2
3 Kate (swi coo dri), Wend(fly),
4 Pete (jum),
How can I concatenate strings ?
You should use recursive queries to do such thing without the use of UDFs. I have given you the query to aggregate skills use similar approach to get the end result.
CREATE Volatile Table TempTable1
as
(
SELECT
team
,substr(person,0,Index(trim(person),'(')) as name
,substr(person,Index(person,'(')+1,3) as skill
,Row_Number() Over(Partition by team,name order by skill) as rnk
from
MainTable)
WITH DATA
Primary Index(team,name)
ON COMMIT Preserve Rows;
CREATE VOLATILE TABLE temp_table2 (team,name)
as
(WITH RECURSIVE temp_table3 (team,name,skill,rnk,lev)
AS
(
SELECT team,name,cast(skill as varchar(1000)),rnk,1 as lev
from TempTable1
where rnk = 1
UNION ALL
SELECT t1.team,t1.name,t1.skill||','||t2.skill,t1.rnk,t2.lev+1
FROM
TempTable1 t1
Inner join
temp_table3 t2
on t1.team = t2.team
AND t1.name = t2.name
AND t1.rnk = t2.rnk + 1
)
SELECT team,name||'('||skill||')' as new_name
from temp_table3
qualify rank() over (partition by team,name order by lev desc) = 1)
WITH DATA
ON COMMIT PRESERVE ROWS;

Query to fetch records from 2 diff table into 2 columns

I have 2 table like below :
1)
Engine
======
ID Title Unit Value
123 Hello Inch 50
555 Hii feet 60
2)
Fuel
=====
ID Title Value
123 test12 343
555 test5556 777
I want the select result in 2 columns as per the ID given (ID should be same in both tables) :
Title -- This will get the (Title + Unit) from Engine table and only
Title from Fuel table. Value
Value-- This will get Value from both tables.
Result for ID = 123 is :
Title Value
Hello(Inch) 50
test12 343
Any suggestion how I can get this in SQL server 2008.
Based on your same data and the desired result, it appears that you want to use a UNION ALL to get the data from both tables:
select title+'('+Unit+')' Title, value
from engine
where id = 123
union all
select title, value
from fuel
where id = 123
See SQL Fiddle with Demo
The result of the query is:
| TITLE | VALUE |
-----------------------
| Hello(Inch) | 50 |
| test12 | 343 |
Look at SQL JOINs: INNER JOIN, LEFT JOIN etc
Select
e.ID, e.Title, e.Unit, e.Value, f.Title as FuelTitle, e.Value as FuelValue,
e.Title+' '+e.Units As TitleAndUnits
From Engine e
Join Fuel f
On e.ID = f.ID
You can do this w/o a join but with join it may be more optimal depending on other factors in your case.
Example w/o join:
select concat(t1.c1, ' ', t1.c2, ' ', t2.c1) col1, concat(t1.c3, ' ', t2.c3) col2
from t1, t2
where t1.id = [ID] and t2.id = [ID]
You should probably have a look at something like Introduction to JOINs – Basic of JOINs and read up a little on JOINS
Join Fundamentals
SQL Server Join Example
SQL Joins
EDIT
Maybe then also look at
CASE (Transact-SQL)
+ (String Concatenation) (Transact-SQL)
CAST and CONVERT (Transact-SQL)

Use comma separated values in where clause and compare it with in clause

This is the edited question with full problem. Following is the table structure. (Only necessary columns are shown below.)
Table Name: tblQualificationMaster.
Qualiid QualiName
------- ---------
1 S.S.C
2 H.S.C
3 B.Sc
4 M.C.A
5 M.Sc(IT)
6 B.E
7 M.B.A
8 B.Com
9 M.E
10 C.S
12 M.Com
Table Name: tblAppResumeMaster.
AppId FirstName LastName TotalExpYears TotalExpMonths
----- --------- -------- ------------- --------------
1 Rahul Patel 7 0
2 Ritesh Shah 0 0
3 Ajay shah 7 6
4 Ram Prasad 7 6
5 Mohan Varma 5 0
6 Gaurav Kumar 8 0
Table Name: tblAppQualificationDetail. (For better reading I am writing comma separated value for all rows except first row but in my database all values are stored like for appid=1. i.e one row for each qualificationid.)
Appid QualiId
----- -------
1 1
1 2
1 3
1 4
2 1,2,3
3 1,2,6
4 1,2,3,5
5 1,2,3,4
6 1,2,6,9
Table Name: tblVacancyMaster
VacId Title Criteria Req.Exp KeySkills
----- -------------- -------- ------- ---------------
1 Programmer 4,5,6 4 .net,java,php
2 TL 4,5 3 .net,java,php
3 Project Mngr. 4,6,9 4 .net,java,php,sql
4 Java Developer 4,5,6 0 java,oracle,sql
5 Manager 7,9 7 bussiness management
6 Supervisior 3,8 3 marketing
7 PHP Developer 4,5 0 php,mysql,send
Now based on this detail I want to create view which should have following fields. (It is shown for VacId=1 but I need this for all vacancies so that I can fire where clause on this view like select * from view where VacId=3.)
AppId FirstName LastName QualiName QualiId TotalExp VacId VacTitle
----- --------- -------- --------- ------- -------- ----- ----------
1 Rahul Patel M.C.A 4 7 1 Programmer
3 Ajay Shah B.E. 6 7 1 Programmer
5 Mohan Verma M.C.A 4 5 1 Programmer
6 Gaurav Kumar B.E 6 8 1 Programmer
6 Gaurav Kumar M.E 9 8 1 Programmer
This view shows AppId 1,3,5 and 6 are eligible for vacancy 3 but it shows duplicate entry for app 6. How can I get unique records?
I may be wrong in database design because this is my first project and I am learning database, so let me know and correct if something goes against database standards.
My previous query
(Note: Earlier I was using one intermediate table tblVacancyCriteriaDetail which was having columns VacId and QualiId and my table tblVacancyMaster was not having column criteria)
select
ARM.AppId,
ARM.AppFirstName,
ARM.AppLastName,
ARM.AppMobileNo,
AQD.QualiId,
VacQualiDetail.QualiName,
ARM.AppEmailId1,
VacQualiDetail.VacID,
ARM.TotalExpYear,
VacQualiDetail.VacTitle,
VacQualiDetail.DeptId,
VacQualiDetail.CompId,
CM.CompName
from
tblAppResumeMaster ARM,
tblAppQualificationDetail AQD,
tblCompanyMaster CM,
(
select
VM.VacID,
VM.VacTitle,
VM.CompId,
VM.DeptId,
vcd.QualificationID,
QM.QualiName,
VM.RequiredExperience as Expe
from
tblVacancyCriteriaDetail VCD,
tblVacancyMaster VM,
tblQualificationMaster QM
where VCD.VacID=VM.VacID
and VCD.QualificationID=QM.QualificationId
and VM.Status=0
) as VacQualiDetail
where AQD.AppId=arm.AppId
and aqd.QualiId=VacQualiDetail.QualificationID
and ARM.TotalExpYear>=Expe
and cm.CompId=VacQualiDetail.CompId
create view vAppList as
select AppId,
FirstName,
LastName,
QualiName,
Qualiid,
TotalExpYears,
VacId,
Title
from (select ARM.AppId,
ARM.FirstName,
ARM.LastName,
QM.QualiName,
QM.Qualiid,
ARM.TotalExpYears,
VM.VacId,
VM.Title,
row_number() over(partition by ARM.AppId, VM.VacId order by QM.Qualiid) as rn
from tblAppResumeMaster as ARM
inner join tblAppQualificationDetail as AQD
on ARM.AppId = AQD.Appid
inner join tblQualificationMaster as QM
on AQD.QualiId = QM.Qualiid
inner join tblVacancyMaster as VM
on ','+VM.Criteria+',' like '%,'+cast(QM.Qualiid as varchar(10))+',%'
) as V
where V.rn = 1
The sub query will have duplicate when one applicant matches more then one qualification. In that case QualiName will have the value for the lowest Qualiid.
If you go back to use tblVacancyCriteriaDetail, which I think you should, the view would look like this.
create view vAppList as
select AppId,
FirstName,
LastName,
QualiName,
Qualiid,
TotalExpYears,
VacId,
Title
from (select ARM.AppId,
ARM.FirstName,
ARM.LastName,
QM.QualiName,
QM.Qualiid,
ARM.TotalExpYears,
VM.VacId,
VM.Title,
row_number() over(partition by ARM.AppId, VM.VacId order by QM.Qualiid) as rn
from tblAppResumeMaster as ARM
inner join tblAppQualificationDetail as AQD
on ARM.AppId = AQD.Appid
inner join tblQualificationMaster as QM
on AQD.QualiId = QM.Qualiid
inner join tblVacancyCriteriaDetail as VCD
on QM.Qualiid = VCD.QualiID
inner join tblVacancyMaster as VM
on VCD.VacId = VM.VacId
) as V
where V.rn = 1
I have never worked with MS SQL Server, so I think the best way would be to use Regex (try to find something about it in SQL Server documentation).
But I think this should work:
select * from Table1 Where (',' + qualificationid + ',') like '%,6,%';
I assume that string concatenation is done using + sign.
revised:
create a new function:
CREATE FUNCTION Split(#String varchar(8000), #Delimiter char(1))
returns #temptable TABLE (items varchar(8000))
as
begin
declare #idx int
declare #slice varchar(8000)
select #idx = 1
if len(#String)<1 or #String is null return
while #idx!= 0
begin
set #idx = charindex(#Delimiter,#String)
if #idx!=0
set #slice = left(#String,#idx - 1)
else
set #slice = #String
if(len(#slice)>0)
insert into #temptable(Items) values(#slice)
set #String = right(#String,len(#String) - #idx)
if len(#String) = 0 break
end
return
end
then you could use my prev answer:
SELECT * FROM TableA WHERE ColumnID IN split(SELECT ColumnWithValues FROM TableB)
try using COALESCE function to get your rows in one column with comma separeted.this is a simple examle
declare #QualIDs varchar(50)=''
select #QualIDs= COALESCE(#QualIDs+ ', ', '') + CAST(Qualiid AS varchar(50)))
from tblQualificationMaster
this will return all Qualiid with comma separated you can use it in where clause or in sub query.
to read more about COALESCE go to http://msdn.microsoft.com/en-us/library/ms190349.aspx