SSIS Fuzzy Grouping Always return the same result with different similarity thrshold - sql

Can anyone tell me why my similarity is always 1.
My goal is AAB and AAC can be set as the same group for example.
Thanks

After I tried different source data, I got the goal what I need.
I think for sample data, it should be better to use some real example in the world.
Instead of AAA and AAC, maybe use Name column like Sara vs Saraa then ssis would say they are in the same group. However, i found for Don vs Done, they won't. So....it may not good idea to filter the records that has typo with different letter?
*** try to create more than one column to be you comparison column

Related

Is it possible to create indexes for a string/uuid-based primary key to be able to fast search by similarity (e.g. noisy uuids)?

I will give the concrete case for better comprehension.
I have some codes that I will call here UUID coming from OCR.
From the, say, 25 characters, a few are misrecognized.
Is it possible to "index by similarity" the UUID column in a SQL database?
Will a SELECT ... LIKE statement already have a good behavior, supposing only one character is wrong per UUID and I perform 25 queries?
[The noisy uuid is not going to be inserted, just SELECTed.]
I'm sorry, i don't know if there is a built in funtion to do so but what you are trying to do is an algorithm called Levenshtein distance. Have a look at that :
Definition :
https://en.wikipedia.org/wiki/Levenshtein_distance#:~:text=Informally%2C%20the%20Levenshtein%20distance%20between,considered%20this%20distance%20in%201965.
Using SQL :
https://lucidar.me/en/web-dev/levenshtein-distance-in-mysql/#:~:text=Informally%2C%20the%20Levenshtein%20distance%20between,not%20match%20exactly%20the%20fields.
You should fix the data that goes into the database -- or at least have the original code and an imputed code.
If you need to keep the original code, then my suggestion would be a look-up table with the original code and imputed code. This table would be used for queries that want to filter by the actual code.
To give a concrete example, if I have a column with US state abbreviations and one of the codes was RA, I would not want to "automatically" figure out if this is :
AR backwards (Arkansas)
RI (Rhode Island)
CA (California)
MA (Massachusetts)
PA (Pennsylvania)
VA (Virginia)
WA (Washington)
It seems like a manual effort would be required.

Fuzzy Logic lookup

Hello, as per the attached image, we are trying to update a 1.7million row UK postcode table with insurance risk groups. There are several thousand New Postcodes with no groupNumber and these appear as NULL. Want to replace the NULLs with the value from the postcode in the row above.
Believe we should be using some sort of fuzzy logic but need some help please.
Thanks
In a query, you can do:
select t.*,
coalesce(groupnumber, lag(groupnumber) over (order by new_postcode)) as new_groupnumber
from t;
It is not clear if you want to actually change the data or just return the values in a query.
Fuzzy logic is about writing rules in structured English. These rules can be understood by man and machine at the same time and are easily adjusted to further requirements. A rule for replacing the NULL postcode can be written in Forth:
OrigPostcode NULL = IF
row_above OrigPostcode
row_above groupnumber
THEN
The term “row_above” is a linguistic variable. And above means that the index is i-1 with 0.9 affiliation.
Instead of formulating the rule manually, it's also possible to use a technique called “Learning from demonstration”. That means a human operator is doing the copy&paste task first and the system will recognize the rule autonomously. This is usually done with neural networks and described in open access journals freely available in the Internet provided by predatory publishing groups.

Finding strings that differ with at most one letter from a given string in SAS with PROC SQL

First some context. I am using proc sql in SAS, and need to fetch all the entries in a data set (with a couple of million entries) that have variable "Name" equal to (let's say) "Massachusetts". Of course, since the data was once manually entered by humans, close to all conceivable spelling errors occur ("Amssachusetts", "Kassachusetts" etc.).
I have found that few entries get more than two characters wrong, so the code
Name like "__ssachusetts" OR Name like "_a_sachusetts" OR ... OR Name like "Massachuset__"
would select the entries I am looking for. However, I am hoping that there must be a more convenient way to write
Name that differs by at most 2 characters from "Massachusetts";
Is there? Or is there some other strategy for fetching these entries? I tried searching both stackoverflow and the web but was unsuccesful. I am also a relative beginner with both SQL and SAS.
Some additional information: The database is not in English (and the actual string is not "Massachusetts") so using SOUNDEX is not really feasible (if it ever were).
Thanks in advance.
(Edit: Improved the title)
SAS has built-in functions COMPGED and COMPLEV to compute distances between strings. Here is an example that shows how to select just those with a Levenshtein edit distance of less than or equal to 2.
data typo;
input name $20.;
datalines;
massachusetts
masachusets
mssachusetts
nassachusets
nassachussets
massachusett
;
proc sql;
select name from typo
where complev(name, "massachusetts") <= 2;
quit;
There are other phonetic algorithms like Hamming distance that should work better.
You can search on google for implementation of this algorithm for your specific DB engine.
What you are looking for is "Approximate string matching". For that one can use "Levenshtein distance computing algorithm". I am not sure, but hope that this answer will help
You could implement a stored function of this type (Oracle syntax, transform to your RDBMS):
CREATE FUNCTION distance(one VARCHAR2, two VARCHAR2) RETURN NUMBER IS
DETERMINISTIC
BEGIN
-- do some comparison here
END distance;
And then use it in SQL:
SELECT * FROM table WHERE distance(name, 'Massachusetts') <= 2
Of course, these things tend to be quite slow...
I know this is four years too late but since it might also give ideas to others who are searching this thread:
What you're considering is a semantic layered design you would need to implement some conditional logic for these different text comparisons, using Lenvenschtien distances like the Jaro-Winkler for comparing text of differing lengths and Hamming for those of the same length for which you suppose simple text trans-positioning. This is nothing new these days with all of the various text mining programs out there.
Here is a post which is very good in my view;
Jaro-Winkler string comparison function in SAS

How to get multi row data of one column to one row of one Column

I need to get data in multiple row of one column.
For example data from that format
ID Interest
Sports
Cooking
Movie
Reading
to that format
ID Interest
Sports,Cooking
Movie,Reading
I wonder that we can do that in MS Access sql. If anybody knows that, please help me on that.
Take a look at Allen Browne's approach: Concatenate values from related records
As for the normalization argument, I'm not suggesting you store concatenated values. But if you want to join them together for display purposes (like a report or form), I don't think you're violating the rules of normalization.
This is called de-normalizing data. It may be acceptable for final reporting. Apparently some experts believe it's good for something, as seen here.
(Mind you, kevchadder's question is right on.)
Have you looked into the SQL Pivot operation?
Take a look at this link:
http://technet.microsoft.com/en-us/library/ms177410.aspx
Just noticed you're using access. Take a look at this article:
http://www.blueclaw-db.com/accessquerysql/pivot_query.htm
This is nothing you should do in SQL and it's most likely not possible at all.
Merging the rows in your application code shouldn't be too hard.

MySQL: select the closest match?

I want to show the closest related item for a product. So say I am showing a product and the style number is SG-sfs35s. Is there a way to select whatever product's style number is closest to that?
Thanks.
EDIT: to answer your questions. Well I definitely want to keep the first 2 letters as that is the manufacturer code but as for the part after the first dash, just whatever matches closest. so for example SG-sfs35s would match SG-shs35s much more than SG-sht64s. I hope this makes sense whenever I do LIKE product_style_number it only pulls the exact match.
There normally isn't a simple way to match product codes that are roughly similar.
A more SQL friendly solution is to create a new table that maps each product to all the products it is similar to.
This table would either need to be maintained manually, or a more sophisticated script can be executed periodically to update it.
If your product codes follow a consistent pattern (all the letters are the same for similar products, with only the numbers changing), then you should be able to use a regular expression to match the similar items. There are docs on this here...
It sounds like what you want is levenshtein distance .
Unfortunately, there isn't a built-in levenshtein function for mysql, but some folks have come up with a user-defined function that does it(deadlink).
You will probably want to do it as a stored procedure, as I expect that the algorithm may not be trivial.
For example, you may split the term at the -, so you have two parts. You do a LIKE query on each part and use that to make a decision.
You could just loop though, replacing the last character with "%" until you get at least one result, in your stored procedure.
Sounds like you need something like Lucene, though i'm not sure if that would be overkill for your situation. But it certainly would be able to do text searches and return the ones most similar first.
If you need something more simple I would try to start by searching with the full product code, then if that doesn't work try to use wildcards/remove some characters until you return a result.
JD Isaacks.
This situation of yours is very simple to solve.
It`s not like you need to use Artificial Intelligence like the Google.
http://www.w3schools.com/sql/sql_wildcards.asp
Take a look at this manual at w3schools about wildcards to use with your SELECT code.
But also you will need to create a new table with 3 columns: LeftCode, RightCode and WildCard.
Example:
Rows on Table:
LeftCode = SG | RightCode = 35s | WildCard = SG-s_s35s
LeftCode = SG | RightCode = 64s | WildCard = SG-s_t64s
SQL Code
If the user typed the code that matches the row1 of the table:
SELECT * FROM PRODUCTS WHERE CODE LIKE "$WildCard";
Where $WildCard is the PHP variable containing the column 3 of the new table.
I hope I helped, even 4 years late...