Levenshtein distance with multiple comparisons - sql

Currently I am trying to create a "best match" query.
I came across this answer, but the main difference is that I have a table with more columns, and I need to compare 6 strings.
Is there a way to implement the Levenshtein distance algorithm with a query that involves this many comparisons? All the examples I've seen online involve a single comparison sort. Is there a better way of getting the best match in a query involving this many comparisons?
EDIT
So here is the table I am trying to query best match:
CustomerID CustomerName CompanyName CompanyPhone CompanyEmail AddressL1 PostalCode
1 terbubbs terbubbs incorporated 1234567890 terbubbs#gmail.com 5 Main St 06482
This "best match" query is done when a user submits an order request. They will enter data into identical fields and I need to make sure whether this user has submitted a request in the past.
Here are three possible requests:
1. CustomerName CompanyName CompanyPhone CompanyEmail AddressL1 PostalCode
terrbubbs terbubbs inc 11234567890 terbubbs#gmail.com 7 Main St 06482
2. CustomerName CompanyName CompanyPhone CompanyEmail AddressL1 PostalCode
terribble Terribble Incorporated 1254643789 terribble#gmail.com 12 State St 04422
3. CustomerName CompanyName CompanyPhone CompanyEmail AddressL1 PostalCode
john doe JD inc 5468791313 john#gmail.com 12 Main St 06482
Now based on these three requests, I would want Request 1 to be the best match. Honestly, this is probably a terrible example.. My point is that a user might submit an almost identical request besides a few misspellings or grammar mistakes. I want to retrieve the most similar entry in the datatable if possible.
EDIT 2
I'm wondering if it is better to try and concatenate corresponding datatable column values into a formatted string and compare it to a formatted string of the request. Any thoughts?

Related

SQL different null values in different rows

I have a quick question regarding writing a SQL query to obtain a complete entry from two or more entries where the data is missing in different columns.
This is the example, suppose I have this table:
Client Id | Name | Email
1234 | John | (null)
1244 | (null) | john#example.com
Would it be possible to write a query that would return the following?
Client Id | Name | Email
1234 | John | john#example.com
I am finding this particularly hard because these are 2 entires in the same table.
I apologize if this is trivial, I am still studying SQL and learning, but I wasn't able to come up with a solution for this and I although I've tried looking online I couldn't phrase the question in the proper way, I suppose and I couldn't really find the answer I was after.
Many thanks in advance for the help!
Yes, but actually no.
It is possible to write a query that works with your example data.
But just under the assumption that the first part of the mail is always equal to the name.
SELECT clients.id,clients.name,bclients.email FROM clients
JOIN clients bclients ON upper(clients.name) = upper(substring(bclients.email from 0 for position('#' in bclients.email)));
db<>fiddle
Explanation:
We join the table onto itself, to get the information into one row.
For this we first search for the position of the '#' in the email, get the substring from the start (0) of the string for the amount of characters until we hit the # (result of positon).
To avoid case-problems the name and substring are cast to uppercase for comparsion.
(lowercase would work the same)
The design is flawed
How can a client have multiple ids and different kind of information about the same user at the same time?
I think you want to split the table between clients and users, so that a user can have multiple clients.
I recommend that you read information about database normalization as this provides you with necessary knowledge for successfull database design.

What are the cases whereby EXCEPT and DISTINCT are different?

Looking into my notes for introduction to databases, I have stumbled upon a case that i do not understand (Between except and distinct).
It says so in my notes that:
The two queries below have the same results, but this will not be the case in general.
First query:
Select c.first_name,c.last_name,c.email
FROM customers as c
WHERE c.country = 'Japan'
EXCEPT
Select c.first_name,c.last_name,c.email
FROM customers as c
WHERE c.last_name LIKE 'D%';
Second query:
Select DISTINCT c.first_name,c.last_name,c.email
FROM customers as c
WHERE c.country = 'Japan' AND NOT (c.last_name LIKE 'D%');
Could anyone provide me some insights as to what are cases whereby the results would differ?
Number 1 selects first, last & email from customers who are from Japan and whose last names do not start with D.
Number 2 selects first, last & email, where no two records have all 3 fields the same, where the customers are from Singapore and their last names do not begin with D.
I suppose I can imagine a table where these would yield the same results, but I don't think it would ever appear except in very contrived circumstances.
Joe Smith jsmith#abc.com Japan
Joe Smith jsmith#abc.com Singapore
Would be one of them. Both queries would yield Joe Smith jsmith#abc.com. Another case would be if no-one was from either country or everyone's last name started with D, then they would both yield nothing.
None of this is tested, and the EXCEPT statement is something I've read about but never had occasion to use.
The first is looking at Japan, the second at Singapore, so I don't see why these would generally -- or specifically -- return the same data.
Even if the countries were the same you have another issue with NULL values. So, if your data looks like this:
first_name last_name email country
xxx NULL a Japan
Your first query would return the row. The second would not.

SQL Server : set all column aliases in a dynamic query

It's a bit of a long and convoluted story why I need to do this, but I will be getting a query string which I will then be executing with this code
EXECUTE sp_ExecuteSQL
I need to set the aliases of all the columns to "value". There could be a variable number of columns in the queries that are being passed in, and they could be all sorts of data types, for example
SELECT
Company, AddressNo, Address1, Town, County, Postcode
FROM Customers
SELECT
OrderNo, OrderType, CustomerNo, DeliveryNo, OrderDate
FROM Orders
Is this possible and relatively simple to do, or will I need to get the aliases included in the SQL queries (it would be easier not to do this, if it can be avoided and done when we process the query)
---Edit---
As an example, the output from the first query would be
Company AddressNo Address1 Town County Postcode
--------- --------- ------------ ------ -------- --------
Dave Inc 12345 1 Main Road Harlow Essex HA1 1AA
AA Tyres 12234 5 Main Road Epping Essex EP1 1PP
I want it to be
value value value value value value
--------- --------- ------------ ------ -------- --------
Dave Inc 12345 1 Main Road Harlow Essex HA1 1AA
AA Tyres 12234 5 Main Road Epping Essex EP1 1PP
So each of the column has an alias of "value"
I could do this with
SELECT
Company AS 'value', AddressNo AS 'value', Address1 AS 'value', Town AS 'value', County AS 'value', Postcode AS 'value'
FROM Customers
but it would be better (it would save additional complexity in other steps in the process chain) if we didn't have to manually alias each column in the SQL we're feeding in to this section of the process.
Regarding the XY problem, this is a tiny section in a very large process chain, it would take pages to explain the whole process in detail - in essence, we're taking code out of our database triggers and putting it into a dynamic procedure; then we will have frontends that users will access to "edit" the SQL statements that are called by the triggers and these will then dynamically feed the results out into other systems. It works if we manually alias the SQL going in, but it would be neater if there was a way we could feed clean SQL into the process and then apply the aliases when the SQL is processed - it would keep us DRY, to start with.
I do not understand at all what you are trying to accomplish, but I believe the answer is no: there is no built-in way how to globally predefine or override column aliases for ad hoc queries. You will need to code it yourself.

Combining almost identical rows into 1

I have a tricky problem that I wouldn't mind a bit of help on, I've made some progress using queries that I've here and elsewhere, but am getting seriously stumped now.
I have a mailing list that has numerous near duplications that I'm trying to combine into one meaningful row, taking data such as this.
Title Forename Surname Address1 Postcode Phone Age Income Ownership Gas
Mrs D Andrews 122 Somewhere BH10 123456 66-70 Homeowner
Ms Diane Andrews 122 Somewhere BH10 123456 £25-40 EDF
and making one row along the lines of
Title Forename Surname Address1 Postcode Phone Age Income Ownership Gas
Mrs Diane Andrews 122 Somewhere BH10 123456 66-70 £25-40 Homeowner EDF
I have over 127 million records, most duplicated with a similar pattern, but no clear logic as was proven when I added an identity field. I also have over 90 columns to consider, so it's a bit of work!
There isn't a clear pattern to the data, so I'm thinking I may have a huge case statement to try to climb over.
Using the following code I can get a decent start on only returning the full name, but with the pattern of data - trying to compare the fields across rows is as follows.
SELECT c1.*
FROM
Mailing c1
JOIN
Mailingc2 ON c1.Telephone1 = c2.Telephone1 AND c1.surname = c2.surname
WHERE
len(c1.Forename) > len(c2.Forename)
AND c2.over_18 <> ''
AND c1.Telephone1 = '123456'
Has anyone got any pointers as to how I should progress please? I'm open to discussion and ideas...
I'm using SQL 2005 and apologies in advance if the tagging is all over the place!
Cheers,
Jon
Would it work by assuming that all persons with the same surname and phone number (Do all persons have a phone?) were the same person?
INSERT INTO newtable <fieldnames>
SELECT lastname,phone,max(field3),max(field4)....
FROM oldtable
GROUP BY lastname,phone
But that would collapse John Smith and Jack Smith living together into one person.
Perhaps you should consider outsourcing it to a data-entry sweatshop somewhere, adter you have preprocessed the data. :-)
And/or be prepared to take the flack for mistaken bundling.
Perhaps adding something like "To improve our green footprint, we have merged x listings on your adress together. If you would like separate mailings, please contact us"

SQL for Querying MSSQL with Double Metaphone for First/Last Name Combos

I am using Double-Metaphone for fuzzy searching within my database. I have a table of names, and both the first and last names have double metaphone entries already created (and updated, via a Trigger). In my application, I am allowing the user to search by Lastname and/or Firstname.
What is the best way to query the database, to get the best results from the Double-Metaphone indexes when dealing with both last AND first names ? Querying just based on lastname is easy - generate the DM tags and query the database. It's when querying by both first and last that I'd like to get some fine tuning.
The database layout is similar to the following:
tblName
FirstName
LastName
MetaPhoneFN1
MetaPhoneFN2
MetaPhoneLN1
MetaPhoneLN2
Application: [Lastname] [FirstName]
User inputs just a lastname, or a combination of lastname + [First initial, first name, part of first name].
Lastname: SMITH
FirstName: J or Jo or John or Johnathan
If I pass in "J" as the firstname - I'd like all name entries matching "J%".
If I pass in "JO" as the firstname - I'd like all name entries matching "JO%".
If I pass in "JOHN" or "JOHNATHAN" as the firstname - I'd like to use DM
or maybe also "JOHN%" ?
I'm really open to suggestions here, for the firstname. I want the results to be as good as possible and return what the user wants.
What is the best way to query the database for last + any of those combinations of first name ? Here's a sample of what I've gotten so far.. and I'm not completely thrilled with the results:
SELECT *
FROM tblName
WHERE
--There will always be a last name
(MetaPhoneLN1 = #paramMetaPhoneLN1
OR (CASE WHEN #paramMetaPhoneLN2 IS NOT NULL AND MetaPhoneLN2 = #paramMetaPhoneLN2 THEN 1
WHEN #paramMetaPhoneLN2 IS NULL THEN 0
END) = 1)
-- Match Firstname 1
AND (CASE WHEN #paramMetaPhoneFN1 IS NULL THEN 1
WHEN #paramMetaPhoneFN1 IS NOT NULL AND MetaPhoneFN1 = #paramMetaPhoneFN1 THEN 1
WHEN LEN(#paramMetaPhoneFN1) > 1 AND LEN(#paramMetaPhoneFN1) < 4 AND MetaPhoneFN1 LIKE #paramMetaPhoneFN1 + '%' THEN 1
WHEN LEN(#paramMetaPhoneFN1) = 1 THEN 1
END) = 1
-- Match Firstname 2
AND (CASE WHEN #paramMetaPhoneFN2 IS NULL THEN 1
WHEN #paramMetaPhoneFN2 IS NOT NULL AND MetaPhoneFN2 = #paramMetaPhoneFN2 THEN 1
WHEN LEN(#paramMetaPhoneFN2) > 1 AND LEN(#paramMetaPhoneFN2) < 4 AND MetaPhoneFN2 LIKE #paramMetaPhoneFN2 + '%' THEN 1
WHEN LEN(#paramMetaPhoneFN2) = 1 THEN 1
--ELSE 0
END) = 1
AND (CASE WHEN #paramFirstName IS NULL THEN 1
WHEN FirstName LIKE #paramFirstName + '%' THEN 1
--WHEN LEN(#paramMetaPhoneFN1) = 1 AND #paramFirstName IS NOT NULL AND LEN(#paramFirstName) > 1 AND FirstName LIKE #paramFirstName + '%' THEN 1
--ELSE 1
END) = 1
What I've tried to do is account for the different variations for firstname. My results however, aren't exactly what I would want.
I've been able to find lots of implementations of Double Metaphone in SQL/C#, etc. for /generating/ the Double-Metaphone values, but nothing on how to actually query the database effectively once you have those values.
SUMMARY:
When I search by both lastname and firstname -- I'd like to query the database for the Double Metaphone match only on Lastname, but I'd like a lot of flexibility when a firstname is also passed in.. first initial ? sounds like ? etc. I am open to suggestions and SQL examples!
UPDATE 1:
When I say that I'm not thrilled with the results.. what I'm saying is that I'm not sure how to formulate the Firstname part of the query, to maximize results. If I search for "WILL" - what results should be returned ? WILLIAM, WILL, WILBERT .. but not WALKER - though with what I have here, WALKER would be returned because WILL -> FL and WALKER IS [FLKR] but WILLIAM IS [FLM]. If I do only DM = DM then I wouldn't get WILLIAM even returned, which is why I'm doing a LIKE in the first place, if the DM length is < 4.
Basically, I'd like to know if anyone else has run into this issue, and see what solutions others have come up with.
First initial only - should show all firstnames starting with that initial
- Here's where I'm uncertain:
Partial name - should should all firstnames starting with the partial ? [how do you know if it's just a partial name ?!]
Full name - should use DM ?
It's up to you to decide your business rules on what to return, and what to consider using LIKE vs. DM (or both) on.
Once thing you seem to not considering, though is length of the DM value.
If I search for "WILL" - what results should be returned ? WILLIAM,
WILL, WILBERT .. but not WALKER - though with what I have here, WALKER
would be returned because WILL -> FL and WALKER IS [FLKR] but WILLIAM
IS [FLM]. If I do only DM = DM then I wouldn't get WILLIAM even
returned, which is why I'm doing a LIKE in the first place, if the DM
length is < 4.
So, for this case:
WILL -> FL and WALKER IS [FLKR] but WILLIAM > IS [FLM]
Assuming you are OK with returning multiple matches with best match at top, you would order the results by the length of the stored matching DM value ascending. So, WALKER would be suggested before WILLIAM.
For the first names, again assuming you are OK with returning multiple possible matchs, you could return exact string matches first (non-DM), followed by exact DM matches, followed by partial DM and LIKE matches ordered by the shortest DM matches first, and then LIKE matches and then the rest of the longer DM matches. This is often easiest done with a bunch of UNIONed queries.
You could also choose to rank the LIKE matches by how much the returned string length differs from the input string length (smaller difference = better match).
The difficulty you are facing is that you are combining searching abbreviated names with phonetically similar names. Those two aims are sometimes opposing each other.
Just to throw you another complication, ;-), Bill is also an abbreviation of William.
My thoughts on this subject are that it's probably best to treat names that could be abbreviated or are abbreviations as a separate issue from the phonetic matching. Once you come up with a solution for the abbreviations, then feed the results through metaphone.