TSQL Degrees of Difference in strings - sql

I am trying to compare 2 strings in a sql statement. Some of the string are almost identical and some are very different
For example:
The following 2 addresses are almost identical
22224 143RD AVE
222-24 143RD AVE.
While the next 2 are very different
6969 elmund street
6969 mamerth street
Is there a function to classify the degree of difference?

http://support.microsoft.com/kb/100365 lays out DIFFERENCE() and SOUNDEX(). I'm not sure how well these would work in your particular use case.

Related

Adding in missing Country Codes into a dataset (GDP Dataset)

I have downloaded a dataset which has countries, their codes and their GDP by year in 4 columns (5 if you include the unique row number far left). I noticed however that there are some missing codes for the country codes and was wondering if anyone could help me out and tell me how to get those codes and add them in , probably from a seperate dataset I imagine . You can see this isin the pictures I posted. Second pictures shows the missing country code data. Thanks.
.
Your country codes look like ISO 3166-1, which are only defined for countries and not for the larger entities such as « East Asia » and « Western Offshoots ».
You could roll your own for these entities, see ISO country codes glossary:
User-assigned codes - If users need code elements to represent country names not included in ISO 3166-1, the series of letters [...] AAA to AAZ, QMA to QZZ, XAA to XZZ, and ZZA to ZZZ respectively, and the series of numbers 900 to 999 are available.
I think the easiest is to prefix them all with X so you know easily that they are your own codes. Then use the 2 next letters for initials:
East Asia: XEA
Western Offshoots: XWO
etc.

Exact String match using oracle CONTAINS for multiple values

I need to get exact name string match from multiple values using Contains function.
I used below query to get data which matches exactly JOHN SMITH OR MIKE DAVID but query is fetching all data which has JOHN SMITH, JOHN, SMITH, MIKE, DAVID, MIKE DAVID, JOHN..SMITH, JOHN/SMITH,....
where contains(names,'{JOHN SMITH} OR {MIKE DAVID}')>0
Note - I don't want to use multiple like in OR conditions.We need to pass around 200 to 300 values (names) to do match pattern.
Can anyone let me know how to get exact match from multiple values using CONTAINS?
Thanks
Anand
I don't use Oracle but i liked this question and i will try to answer it based on what i read on Oracle's documentation, regarding CONTAINS.
so i am currently reading about Stored Query Expression (SQE),
Stored Query Expression (SQE)
Use the SQE operator to call a stored query expression created with
the CTX_QUERY.STORE_SQE procedure.
Stored query expressions can be used for creating predefined bins for
organizing and categorizing documents or to perform iterative queries,
in which an initial query is refined using one or more additional
queries.
So perhaps you could pass all the names you want to query in such a way. For example:
begin
ctx_query.store_sqe('mynames', 'JOHN SMITH OR MIKE DAVID OR GEORGE HARRIS OR JOHN DOE');
end;
And then call it:
SELECT SCORE(1), nameid FROM mytable
WHERE CONTAINS(names, 'sqe(mynames)', 1)> 0
ORDER BY SCORE(1);
Hope this was helpful.

Fuzzy text searching in Oracle

I have a large Oracle DB table which contains street names for a whole country, which has 600000+ rows. In my application, I take an address string as input and want to check whether specific substrings of this address string matches one or many of the street names in the table, such that I can label that address substring as the name of a street.
Clearly, this should be a fuzzy text matching problem, there is only a small chance that the substring I query has an exact match with the street names in DB table. So there should be some kind of fuzzy text matching approach. I am trying to read the Oracle documentation at http://docs.oracle.com/cd/B28359_01/text.111/b28303/query.htm in which CONTAINS and CATSEARCH search operators are explained. But these seem to be used for more complex tasks like searching a match for the given string in documents. I just want to do that for a column of a table.
What do you suggest me in this case, does Oracle have support for such kind of fuzzy text matching queries?
UTL_MATCH contains methods for matching strings and comparing their similarity. The
edit distance, also known as the Levenshtein Distance, might be a good place to start. Since one string is a substring it may help to compare the edit distance
relative to the size of the strings.
--Addresses that are most similar to each substring.
select substring, address, edit_ratio
from
(
--Rank edit ratios.
select substring, address, edit_ratio
,dense_rank() over (partition by substring order by edit_ratio desc) edit_ratio_rank
from
(
--Calculate edit ratio - edit distance relative to string sizes.
select
substring,
address,
(length(address) - UTL_MATCH.EDIT_DISTANCE(substring, address))/length(substring) edit_ratio
from
(
--Fake addreses (from http://names.igopaygo.com/street/north_american_address)
select '526 Burning Hill Big Beaver District of Columbia 20041' address from dual union all
select '5206 Hidden Rise Whitebead Michigan 48426' address from dual union all
select '2714 Noble Drive Milk River Michigan 48770' address from dual union all
select '8325 Grand Wagon Private Sleeping Buffalo Arkansas 72265' address from dual union all
select '968 Iron Corner Wacker Arkansas 72793' address from dual
) addresses
cross join
(
--Address substrings.
select 'Michigan' substring from dual union all
select 'Not-So-Hidden Rise' substring from dual union all
select '123 Fake Street' substring from dual
)
order by substring, edit_ratio desc
)
)
where edit_ratio_rank = 1
order by substring, address;
These results are not great but hopefully this is at least a good starting point. It should work with any language. But you'll still probably want to combine this with some language- or locale- specific comparison rules.
SUBSTRING ADDRESS EDIT_RATIO
--------- ------- ----------
123 Fake Street 526 Burning Hill Big Beaver District of Columbia 20041 0.5333
Michigan 2714 Noble Drive Milk River Michigan 48770 1
Michigan 5206 Hidden Rise Whitebead Michigan 48426 1
Not-So-Hidden Rise 5206 Hidden Rise Whitebead Michigan 48426 0.5
You could make use of the SOUNDEX function available in Oracle databases. SOUNDEX computes a numeric signature of a text string. This can be used to find strings which sound similar and thus reduce the number of string comparisons.
Edited:
If SOUNDEX is not suitable for your local language, you can ask Google for a phonetic signature or phonetic matching function which performs better. This function has to be evaluated once per new table entry and once for every query. Therefore, it does not need to reside in Oracle.
Example: A Turkish SOUNDEX is promoted here.
To increase the matching quality, the street name spelling should be unified in a first step. This could be done by applying a set of rules:
Simplified example rules:
Convert all characters to lowercase
Remove "str." at the end of a name
Remove "drv." at the end of a name
Remove "place" at the end of a name
Remove "ave." at the end of a name
Sort names with multiple words alphabetically
Drop auxiliary words like "of", "and", "the", ...

Combining almost identical rows into 1

I have a tricky problem that I wouldn't mind a bit of help on, I've made some progress using queries that I've here and elsewhere, but am getting seriously stumped now.
I have a mailing list that has numerous near duplications that I'm trying to combine into one meaningful row, taking data such as this.
Title Forename Surname Address1 Postcode Phone Age Income Ownership Gas
Mrs D Andrews 122 Somewhere BH10 123456 66-70 Homeowner
Ms Diane Andrews 122 Somewhere BH10 123456 £25-40 EDF
and making one row along the lines of
Title Forename Surname Address1 Postcode Phone Age Income Ownership Gas
Mrs Diane Andrews 122 Somewhere BH10 123456 66-70 £25-40 Homeowner EDF
I have over 127 million records, most duplicated with a similar pattern, but no clear logic as was proven when I added an identity field. I also have over 90 columns to consider, so it's a bit of work!
There isn't a clear pattern to the data, so I'm thinking I may have a huge case statement to try to climb over.
Using the following code I can get a decent start on only returning the full name, but with the pattern of data - trying to compare the fields across rows is as follows.
SELECT c1.*
FROM
Mailing c1
JOIN
Mailingc2 ON c1.Telephone1 = c2.Telephone1 AND c1.surname = c2.surname
WHERE
len(c1.Forename) > len(c2.Forename)
AND c2.over_18 <> ''
AND c1.Telephone1 = '123456'
Has anyone got any pointers as to how I should progress please? I'm open to discussion and ideas...
I'm using SQL 2005 and apologies in advance if the tagging is all over the place!
Cheers,
Jon
Would it work by assuming that all persons with the same surname and phone number (Do all persons have a phone?) were the same person?
INSERT INTO newtable <fieldnames>
SELECT lastname,phone,max(field3),max(field4)....
FROM oldtable
GROUP BY lastname,phone
But that would collapse John Smith and Jack Smith living together into one person.
Perhaps you should consider outsourcing it to a data-entry sweatshop somewhere, adter you have preprocessed the data. :-)
And/or be prepared to take the flack for mistaken bundling.
Perhaps adding something like "To improve our green footprint, we have merged x listings on your adress together. If you would like separate mailings, please contact us"

Compare two addresses which are not in standard format

I have to compare addresses from two tables and get the Id if the address matches.
Each table has three columns Houseno, street, state
The address are not in standard format in either of the tables. There are approx. 50,000 rows, I need to scan through
At some places its Ave. Avenue Ave . Str Street, ST. Lane Ln. Place PL Cir CIRCLE.
Any combination with a dot or comma or spaces ,hypen.
I was thinking of combining all three What can be best way to do it in SQL or PLSQL for example
table1
HNO STR State
----- ----- -----
12 6th Ave NY
10 3rd Aven SD
12-11 Fouth St NJ
11 sixth Lane NY
A23 Main Parkway NY
A-21 124 th Str. VA
table2
id HNO STR state
-- ----- ----- -----
1 12 6 Ave. NY
13 10 3 Avenue SD
15 1121 Fouth Street NJ
33 23 9th Lane NY
24 X23 Main Cir. NY
34 A1 124th Street VA
There is no simple way to achieve what you want. There is a expensive software (google for "address standardization software") that can do this but rarely 100% automatic.
What this type of software does is to take the data, use complex heuristics to try to figure out the "official" address and then return that (sometimes with the confidence that the result is correct, sometimes a list of results sorted by confidence).
For a small percentage of the data, the software will simply not work and you'll have to fix that yourself.
Oracle has a built in package UTL_Match which has an edit_distance function (based on the Levenshtein algorithm, this is a measure of how many changes you would need to make to make one string the same as another). More info about this Package / Function can be found here: http://docs.oracle.com/cd/E18283_01/appdev.112/e16760/u_match.htm
You would need to make some decisions around whether to compare each column or concatenate and then compare and what a reasonable threshold is. For example, you may want to do a manual check on any with an edit distance of less than 8 on the concatenated values.
Let me know if you want any help with the syntax, the edit_distance function just takes 2 varchar2 args (the strings you want to compare) and returns a number.
This is not a perfect solution in that if you set the threshold high you will have a lot of manual checking to do to discard some, and if you set it too low you will miss some matches, but it may be about the best if you want a relatively simple solution.
The way we did this for one of our applications was to use a third party adddress normalization API(eg:Pitney Bowes),normalize each address(Address is a combination of Street Address,City ,State and Zip) and create a T-sql hash for that address.For the adress to compare do the same thing and compare the two hashes and if they match,we have a match
you can make a cursor where you do first a group by where house number and city =.
in a loop
you can separate a row with instr e substr considering chr(32).
After that you can try to consider to make a confront with substring where you have a number 6 = 6th , other case street = str.
good luck!