Is it possible to create indexes for a string/uuid-based primary key to be able to fast search by similarity (e.g. noisy uuids)? - sql

I will give the concrete case for better comprehension.
I have some codes that I will call here UUID coming from OCR.
From the, say, 25 characters, a few are misrecognized.
Is it possible to "index by similarity" the UUID column in a SQL database?
Will a SELECT ... LIKE statement already have a good behavior, supposing only one character is wrong per UUID and I perform 25 queries?
[The noisy uuid is not going to be inserted, just SELECTed.]

I'm sorry, i don't know if there is a built in funtion to do so but what you are trying to do is an algorithm called Levenshtein distance. Have a look at that :
Definition :
https://en.wikipedia.org/wiki/Levenshtein_distance#:~:text=Informally%2C%20the%20Levenshtein%20distance%20between,considered%20this%20distance%20in%201965.
Using SQL :
https://lucidar.me/en/web-dev/levenshtein-distance-in-mysql/#:~:text=Informally%2C%20the%20Levenshtein%20distance%20between,not%20match%20exactly%20the%20fields.

You should fix the data that goes into the database -- or at least have the original code and an imputed code.
If you need to keep the original code, then my suggestion would be a look-up table with the original code and imputed code. This table would be used for queries that want to filter by the actual code.
To give a concrete example, if I have a column with US state abbreviations and one of the codes was RA, I would not want to "automatically" figure out if this is :
AR backwards (Arkansas)
RI (Rhode Island)
CA (California)
MA (Massachusetts)
PA (Pennsylvania)
VA (Virginia)
WA (Washington)
It seems like a manual effort would be required.

Related

Fuzzy Logic lookup

Hello, as per the attached image, we are trying to update a 1.7million row UK postcode table with insurance risk groups. There are several thousand New Postcodes with no groupNumber and these appear as NULL. Want to replace the NULLs with the value from the postcode in the row above.
Believe we should be using some sort of fuzzy logic but need some help please.
Thanks
In a query, you can do:
select t.*,
coalesce(groupnumber, lag(groupnumber) over (order by new_postcode)) as new_groupnumber
from t;
It is not clear if you want to actually change the data or just return the values in a query.
Fuzzy logic is about writing rules in structured English. These rules can be understood by man and machine at the same time and are easily adjusted to further requirements. A rule for replacing the NULL postcode can be written in Forth:
OrigPostcode NULL = IF
row_above OrigPostcode
row_above groupnumber
THEN
The term “row_above” is a linguistic variable. And above means that the index is i-1 with 0.9 affiliation.
Instead of formulating the rule manually, it's also possible to use a technique called “Learning from demonstration”. That means a human operator is doing the copy&paste task first and the system will recognize the rule autonomously. This is usually done with neural networks and described in open access journals freely available in the Internet provided by predatory publishing groups.

SSIS Fuzzy Grouping Always return the same result with different similarity thrshold

Can anyone tell me why my similarity is always 1.
My goal is AAB and AAC can be set as the same group for example.
Thanks
After I tried different source data, I got the goal what I need.
I think for sample data, it should be better to use some real example in the world.
Instead of AAA and AAC, maybe use Name column like Sara vs Saraa then ssis would say they are in the same group. However, i found for Don vs Done, they won't. So....it may not good idea to filter the records that has typo with different letter?
*** try to create more than one column to be you comparison column

Optimising LIKE expressions that start with wildcards

I have a table in a SQL Server database with an address field (ex. 1 Farnham Road, Guildford, Surrey, GU2XFF) which I want to search with a wildcard before and after the search string.
SELECT *
FROM Table
WHERE Address_Field LIKE '%nham%'
I have around 2 million records in this table and I'm finding that queries take anywhere from 5-10s, which isn't ideal. I believe this is because of the preceding wildcard.
I think I'm right in saying that any indexes won't be used for seek operations because of the preceeding wildcard.
Using full text searching and CONTAINS isn't possible because I want to search for the latter parts of words (I know that you could replace the search string for Guil* in the below query and this would return results). Certainly running the following returns no results
SELECT *
FROM Table
WHERE CONTAINS(Address_Field, '"nham"')
Is there any way to optimise queries with preceding wildcards?
Here is one (not really recommended) solution.
Create a table AddressSubstrings. This table would have multiple rows per address and the primary key of table.
When you insert an address into table, insert substrings starting from each position. So, if you want to insert 'abcd', then you would insert:
abcd
bcd
cd
d
along with the unique id of the row in Table. (This can all be done using a trigger.)
Create an index on AddressSubstrings(AddressSubstring).
Then you can phrase your query as:
SELECT *
FROM Table t JOIN
AddressSubstrings ads
ON t.table_id = ads.table_id
WHERE ads.AddressSubstring LIKE 'nham%';
Now there will be a matching row starting with nham. So, like should make use of an index (and a full text index also works).
If you are interesting in the right way to handle this problem, a reasonable place to start is the Postgres documentation. This uses a method similar to the above, but using n-grams. The only problem with n-grams for your particular problem is that they require re-writing the comparison as well as changing the storing.
I can't offer a complete solution to this difficult problem.
But if you're looking to create a suffix search capability, in which, for example, you'd be able to find the row containing HWilson with ilson and the row containing ABC123000654 with 654, here's a suggestion.
WHERE REVERSE(textcolumn) LIKE REVERSE('ilson') + '%'
Of course this isn't sargable the way I wrote it here. But many modern DBMSs, including recent versions of SQL server, allow the definition, and indexing, of computed or virtual columns.
I've deployed this technique, to the delight of end users, in a health-care system with lots of record IDs like ABC123000654.
Not without a serious preparation effort, hwilson1.
At the risk of repeating the obvious - any search path optimisation - leading to the decision whether an index is used, or which type of join operator to use, etc. (independently of which DBMS we're talking about) - works on equality (equal to) or range checking (greater-than and less-than).
With leading wildcards, you're out of luck.
The workaround is a serious preparation effort, as stated up front:
It would boil down to Vertica's text search feature, where that problem is solved. See here:
https://my.vertica.com/docs/8.0.x/HTML/index.htm#Authoring/AdministratorsGuide/Tables/TextSearch/UsingTextSearch.htm
For any other database platform, including MS SQL, you'll have to do that manually.
In a nutshell: It relies on a primary key or unique identifier of the table whose text search you want to optimise.
You create an auxiliary table, whose primary key is the primary key of your base table, plus a sequence number, and a VARCHAR column that will contain a series of substrings of the base table's string you initially searched using wildcards. In an over-simplified way:
If your input table (just showing the columns that matter) is this:
id |the_search_col |other_col
42|The Restaurant at the End of the Universe|Arthur Dent
43|The Hitch-Hiker's Guide to the Galaxy |Ford Prefect
Your auxiliary search table could contain:
id |seq|search_token
42| 1|Restaurant
42| 2|End
42| 3|Universe
43| 1|Hitch-Hiker
43| 2|Guide
43| 3|Galaxy
Normally, you suppress typical "fillers" like articles and prepositions and apostrophe-s , and split into tokens separated by punctuation and white space. For your '%nham%' example, however, you'd probably need to talk to a linguist who has specialised in English morphology to find splitting token candidates .... :-]
You could start by the same technique that I use when I un-pivot a horizontal series of measures without the PIVOT clause, like here:
Pivot sql convert rows to columns
Then, use a combination of, probably nested, CHARINDEX() and SUBSTRING() using the index you get from the CROSS JOIN with a series of index integers as described in my post suggested above, and use that very index as the sequence for the auxiliary search table.
Lay an index on search_token and you'll have a very fast access path to a big table.
Not a stroll in the park, I agree, but promising ...
Happy playing -
Marco the Sane

Finding strings that differ with at most one letter from a given string in SAS with PROC SQL

First some context. I am using proc sql in SAS, and need to fetch all the entries in a data set (with a couple of million entries) that have variable "Name" equal to (let's say) "Massachusetts". Of course, since the data was once manually entered by humans, close to all conceivable spelling errors occur ("Amssachusetts", "Kassachusetts" etc.).
I have found that few entries get more than two characters wrong, so the code
Name like "__ssachusetts" OR Name like "_a_sachusetts" OR ... OR Name like "Massachuset__"
would select the entries I am looking for. However, I am hoping that there must be a more convenient way to write
Name that differs by at most 2 characters from "Massachusetts";
Is there? Or is there some other strategy for fetching these entries? I tried searching both stackoverflow and the web but was unsuccesful. I am also a relative beginner with both SQL and SAS.
Some additional information: The database is not in English (and the actual string is not "Massachusetts") so using SOUNDEX is not really feasible (if it ever were).
Thanks in advance.
(Edit: Improved the title)
SAS has built-in functions COMPGED and COMPLEV to compute distances between strings. Here is an example that shows how to select just those with a Levenshtein edit distance of less than or equal to 2.
data typo;
input name $20.;
datalines;
massachusetts
masachusets
mssachusetts
nassachusets
nassachussets
massachusett
;
proc sql;
select name from typo
where complev(name, "massachusetts") <= 2;
quit;
There are other phonetic algorithms like Hamming distance that should work better.
You can search on google for implementation of this algorithm for your specific DB engine.
What you are looking for is "Approximate string matching". For that one can use "Levenshtein distance computing algorithm". I am not sure, but hope that this answer will help
You could implement a stored function of this type (Oracle syntax, transform to your RDBMS):
CREATE FUNCTION distance(one VARCHAR2, two VARCHAR2) RETURN NUMBER IS
DETERMINISTIC
BEGIN
-- do some comparison here
END distance;
And then use it in SQL:
SELECT * FROM table WHERE distance(name, 'Massachusetts') <= 2
Of course, these things tend to be quite slow...
I know this is four years too late but since it might also give ideas to others who are searching this thread:
What you're considering is a semantic layered design you would need to implement some conditional logic for these different text comparisons, using Lenvenschtien distances like the Jaro-Winkler for comparing text of differing lengths and Hamming for those of the same length for which you suppose simple text trans-positioning. This is nothing new these days with all of the various text mining programs out there.
Here is a post which is very good in my view;
Jaro-Winkler string comparison function in SAS

Need Pattern for dynamic search of multiple sql tables

I'm looking for a pattern for performing a dynamic search on multiple tables.
I have no control over the legacy (and poorly designed) database table structure.
Consider a scenario similar to a resume search where a user may want to perform a search against any of the data in the resume and get back a list of resumes that match their search criteria. Any field can be searched at anytime and in combination with one or more other fields.
The actual sql query gets created dynamically depending on which fields are searched. Most solutions I've found involve complicated if blocks, but I can't help but think there must be a more elegant solution since this must be a solved problem by now.
Yeah, so I've started down the path of dynamically building the sql in code. Seems godawful. If I really try to support the requested ability to query any combination of any field in any table this is going to be one MASSIVE set of if statements. shiver
I believe I read that COALESCE only works if your data does not contain NULLs. Is that correct? If so, no go, since I have NULL values all over the place.
As far as I understand (and I'm also someone who has written against a horrible legacy database), there is no such thing as dynamic WHERE clauses. It has NOT been solved.
Personally, I prefer to generate my dynamic searches in code. Makes testing convenient. Note, when you create your sql queries in code, don't concatenate in user input. Use your #variables!
The only alternative is to use the COALESCE operator. Let's say you have the following table:
Users
-----------
Name nvarchar(20)
Nickname nvarchar(10)
and you want to search optionally for name or nickname. The following query will do this:
SELECT Name, Nickname
FROM Users
WHERE
Name = COALESCE(#name, Name) AND
Nickname = COALESCE(#nick, Nickname)
If you don't want to search for something, just pass in a null. For example, passing in "brian" for #name and null for #nick results in the following query being evaluated:
SELECT Name, Nickname
FROM Users
WHERE
Name = 'brian' AND
Nickname = Nickname
The coalesce operator turns the null into an identity evaluation, which is always true and doesn't affect the where clause.
Search and normalization can be at odds with each other. So probably first thing would be to get some kind of "view" that shows all the fields that can be searched as a single row with a single key getting you the resume. then you can throw something like Lucene in front of that to give you a full text index of those rows, the way that works is, you ask it for "x" in this view and it returns to you the key. Its a great solution and come recommended by joel himself on the podcast within the first 2 months IIRC.
What you need is something like SphinxSearch (for MySQL) or Apache Lucene.
As you said in your example lets imagine a Resume that will composed of several fields:
List item
Name,
Adreess,
Education (this could be a table on its own) or
Work experience (this could grow to its own table where each row represents a previous job)
So searching for a word in all those fields with WHERE rapidly becomes a very long query with several JOINS.
Instead you could change your framework of reference and think of the Whole resume as what it is a Single Document and you just want to search said document.
This is where tools like Sphinx Search do. They create a FULL TEXT index of your 'document' and then you can query sphinx and it will give you back where in the Database that record was found.
Really good search results.
Don't worry about this tools not being part of your RDBMS it will save you a lot of headaches to use the appropriate model "Documents" vs the incorrect one "TABLES" for this application.