Fuzzy text searching in Oracle - sql

I have a large Oracle DB table which contains street names for a whole country, which has 600000+ rows. In my application, I take an address string as input and want to check whether specific substrings of this address string matches one or many of the street names in the table, such that I can label that address substring as the name of a street.
Clearly, this should be a fuzzy text matching problem, there is only a small chance that the substring I query has an exact match with the street names in DB table. So there should be some kind of fuzzy text matching approach. I am trying to read the Oracle documentation at http://docs.oracle.com/cd/B28359_01/text.111/b28303/query.htm in which CONTAINS and CATSEARCH search operators are explained. But these seem to be used for more complex tasks like searching a match for the given string in documents. I just want to do that for a column of a table.
What do you suggest me in this case, does Oracle have support for such kind of fuzzy text matching queries?

UTL_MATCH contains methods for matching strings and comparing their similarity. The
edit distance, also known as the Levenshtein Distance, might be a good place to start. Since one string is a substring it may help to compare the edit distance
relative to the size of the strings.
--Addresses that are most similar to each substring.
select substring, address, edit_ratio
from
(
--Rank edit ratios.
select substring, address, edit_ratio
,dense_rank() over (partition by substring order by edit_ratio desc) edit_ratio_rank
from
(
--Calculate edit ratio - edit distance relative to string sizes.
select
substring,
address,
(length(address) - UTL_MATCH.EDIT_DISTANCE(substring, address))/length(substring) edit_ratio
from
(
--Fake addreses (from http://names.igopaygo.com/street/north_american_address)
select '526 Burning Hill Big Beaver District of Columbia 20041' address from dual union all
select '5206 Hidden Rise Whitebead Michigan 48426' address from dual union all
select '2714 Noble Drive Milk River Michigan 48770' address from dual union all
select '8325 Grand Wagon Private Sleeping Buffalo Arkansas 72265' address from dual union all
select '968 Iron Corner Wacker Arkansas 72793' address from dual
) addresses
cross join
(
--Address substrings.
select 'Michigan' substring from dual union all
select 'Not-So-Hidden Rise' substring from dual union all
select '123 Fake Street' substring from dual
)
order by substring, edit_ratio desc
)
)
where edit_ratio_rank = 1
order by substring, address;
These results are not great but hopefully this is at least a good starting point. It should work with any language. But you'll still probably want to combine this with some language- or locale- specific comparison rules.
SUBSTRING ADDRESS EDIT_RATIO
--------- ------- ----------
123 Fake Street 526 Burning Hill Big Beaver District of Columbia 20041 0.5333
Michigan 2714 Noble Drive Milk River Michigan 48770 1
Michigan 5206 Hidden Rise Whitebead Michigan 48426 1
Not-So-Hidden Rise 5206 Hidden Rise Whitebead Michigan 48426 0.5

You could make use of the SOUNDEX function available in Oracle databases. SOUNDEX computes a numeric signature of a text string. This can be used to find strings which sound similar and thus reduce the number of string comparisons.
Edited:
If SOUNDEX is not suitable for your local language, you can ask Google for a phonetic signature or phonetic matching function which performs better. This function has to be evaluated once per new table entry and once for every query. Therefore, it does not need to reside in Oracle.
Example: A Turkish SOUNDEX is promoted here.
To increase the matching quality, the street name spelling should be unified in a first step. This could be done by applying a set of rules:
Simplified example rules:
Convert all characters to lowercase
Remove "str." at the end of a name
Remove "drv." at the end of a name
Remove "place" at the end of a name
Remove "ave." at the end of a name
Sort names with multiple words alphabetically
Drop auxiliary words like "of", "and", "the", ...

Related

BigQuery: grouping by similar strings for a large dataset

I have a table of invoice data with over 100k unique invoices and several thousand unique company names associated with them.
I'm trying to group these company names into more general groups to understand how many invoices they're responsible for, how often they receive them, etc.
Currently, I'm using the following code to identify unique company names:
SELECT DISTINCT(company_name)
FROM invoice_data
ORDER BY company_name
The problem is that this only gives me exact matches, when its obvious that there are many string values in company_name that are similar. For example: McDonalds Paddington, McDonlads Oxford Square, McDonalds Peckham, etc.
How can I make by GROUP BY statement more general?
Sometimes the issue isn't as simple as the example listed above, occasionally there is simply an extra space or PTY/LTD which throws off a GROUP BY match.
EDIT
To give an example of what I'm looking for, I'd be looking to turn the following:
company_name
----------------------
Jim's Pizza Paddington|
Jim's Pizza Oxford |
McDonald's Peckham |
McDonald's Victoria |
-----------------------
And be able to group by their company name rather than exclusively with an exact string match.
Have you tried using the Soundex function?
SELECT
SOUNDEX(name) AS code,
MAX( name) AS sample_name,
count(name) as records
FROM ((
SELECT
"Jim's Pizza Paddington" AS name)
UNION ALL (
SELECT
"Jim's Pizza Oxford" AS name)
UNION ALL (
SELECT
"McDonald's Peckham" AS name)
UNION ALL (
SELECT
"McDonald's Victoria" AS name))
GROUP BY
1
ORDER BY
You can then use the soundex to create groupings, with a split or other type of function to pull the part of the string which matches the name group or use a windows function to pull back one occurrence to get the name string. Not perfect but means you do not need to pull into other tools with advanced language recognition.

How to extract 4 words before each word of a given list in sql

I have got a table with a column containing text (the column name is 'Text'). There are some acronyms in brackets, so I would like to extract them along with the five words appearing before them.
I have already extracted the rows that contain all the acronyms of my list using the like operator:
select Text from table
where Text like '(NASA)'
or Text like '(NBA)'
In stead of getting an output of the whole text in each row:
Text
He works for the National Aeronautics and Space Administration (NASA).
He played basketball for the National Basketball Association (NBA) from 2000 to 2002.
I would like to get the output of two columns one for the acronym and another for the meaning of the acronym (showing the five words prior to the acronym):
Acronym Meaning
(NASA) National Aeronautics and Space Administration
(NBA) for the National Basketball Association
Without actually seeing your data, I will assume that all the acronyms follow the same pattern but you should be able to adapt the code with the correct logic if your strings are structured differently. In this case '(Acronym) meaning' is the structure which I'm going to work with.
select '(NASA) National Aeronautics and Space Administration' as text
into #temp1
union all
select '(FBI) Federal Bureau of Investigation' as text
select SUBSTRING(text,CHARINDEX('(',text)+1 ,CHARINDEX(')',text)-CHARINDEX('(',text)-1) as Acronym,
SUBSTRING(text,CHARINDEX(')',text)+2 ,len(text)-CHARINDEX(')',text)+1) as meaning
from #temp1
This code subsets the original string by using character positions in the string between the brackets for the acronym and then character positions starting after closed brackets for the meanings.

Get following letter from given

I have a table with company names. Some companies have different locations and different legal names but they should be reported under the same Group Code. The Code is made up using the first five letters.
Company GroupCode
DEEZER FRANCE DEEZE
DEEZER SPAIN DEEZE
DEEZER ALGERIA DEEZE
So far so good. Now I’m adding a different company which starts with the same letters but should get a new Group Code.
A new Code should be assigned if the company name does not contain a word which is part of a company name already having a GroupCode. In this Case DEEZER is the key word which determines association with GroupCode DEEZE
Rule is that the code should then use the first four letters + the fifth letter next in the alphabet. If this code also exists then use the first four letters + the fifth letter next but one in the alphabet. The required result would look like:
Company GroupCode Status
DEEZER FRANCE DEEZE EXISTING
DEEZER SPAIN DEEZE EXISITNG
DEEZER ALGERIA DEEZE EXISTING
DEEZEMBER DEEZF CREATED
DEEZEMAL DEEZG CREATED
So what I need to figure out is the next „unused“ letter. How can I achieve this with SQL Server 2008 R2?
Try this:
;with cte as
(select max(groupcode) maxcode
from yourtable
where left(code,4) = left(#companyname,4))
insert into yourtable (company, groupcode, [status])
select #companyname,
case when maxcode is null then left(#companyname,4) + 'a' else left(maxcode,4) + char(ascii(right(maxcode,1))+1) end,
'created'
from cte
Assumption: Your input is taking the company name as a parameter from somewhere, presumably the front end.
The idea is to use ascii function to get the ASCII code of the last letter, increment it by 1 and go back to the corresponding character using char function.
Be warned, however, that this is definitely not the best solution. For instance, I have not implemented bounds checking to ensure range between A and Z. In fact, I would suggest that you handle this in application code rather than at DB level.

Compare two addresses which are not in standard format

I have to compare addresses from two tables and get the Id if the address matches.
Each table has three columns Houseno, street, state
The address are not in standard format in either of the tables. There are approx. 50,000 rows, I need to scan through
At some places its Ave. Avenue Ave . Str Street, ST. Lane Ln. Place PL Cir CIRCLE.
Any combination with a dot or comma or spaces ,hypen.
I was thinking of combining all three What can be best way to do it in SQL or PLSQL for example
table1
HNO STR State
----- ----- -----
12 6th Ave NY
10 3rd Aven SD
12-11 Fouth St NJ
11 sixth Lane NY
A23 Main Parkway NY
A-21 124 th Str. VA
table2
id HNO STR state
-- ----- ----- -----
1 12 6 Ave. NY
13 10 3 Avenue SD
15 1121 Fouth Street NJ
33 23 9th Lane NY
24 X23 Main Cir. NY
34 A1 124th Street VA
There is no simple way to achieve what you want. There is a expensive software (google for "address standardization software") that can do this but rarely 100% automatic.
What this type of software does is to take the data, use complex heuristics to try to figure out the "official" address and then return that (sometimes with the confidence that the result is correct, sometimes a list of results sorted by confidence).
For a small percentage of the data, the software will simply not work and you'll have to fix that yourself.
Oracle has a built in package UTL_Match which has an edit_distance function (based on the Levenshtein algorithm, this is a measure of how many changes you would need to make to make one string the same as another). More info about this Package / Function can be found here: http://docs.oracle.com/cd/E18283_01/appdev.112/e16760/u_match.htm
You would need to make some decisions around whether to compare each column or concatenate and then compare and what a reasonable threshold is. For example, you may want to do a manual check on any with an edit distance of less than 8 on the concatenated values.
Let me know if you want any help with the syntax, the edit_distance function just takes 2 varchar2 args (the strings you want to compare) and returns a number.
This is not a perfect solution in that if you set the threshold high you will have a lot of manual checking to do to discard some, and if you set it too low you will miss some matches, but it may be about the best if you want a relatively simple solution.
The way we did this for one of our applications was to use a third party adddress normalization API(eg:Pitney Bowes),normalize each address(Address is a combination of Street Address,City ,State and Zip) and create a T-sql hash for that address.For the adress to compare do the same thing and compare the two hashes and if they match,we have a match
you can make a cursor where you do first a group by where house number and city =.
in a loop
you can separate a row with instr e substr considering chr(32).
After that you can try to consider to make a confront with substring where you have a number 6 = 6th , other case street = str.
good luck!

Sort Postcode for menu/list

I need to sort a list of UK postcodes in to order.
Is there a simple way to do it?
UK postcodes are made up of letters and numbers:
see for full info of the format:
http://en.wikipedia.org/wiki/UK_postcodes
But my problem is this a simple alpha sort doesn't work because each code starts with 1 or two letters letters and then is immediately followed by a number , up to two digits, then a space another number then a letter. e.g. LS1 1AA or ls28 1AA, there is also another case where once the numbers in the first section exceed 99 then it continues 9A etc.
Alpha sort cause the 10s to immediately follow the 1:
...
LS1 9ZZ
LS10 1AA
...
LS2
I'm looking at creating a SQL function to convert the printable Postcode into a sortable postcode e.g. 'LS1 9ZZ' would become 'LS01 9ZZ', then use this function in the order by clause.
Has anybody done this or anything similar already?
You need to think of this as a tokenization issue so SW1A 1AA should tokenize to:
SW
1
A
1AA
(although you could break the inward part down into 1 and AA if you wanted to)
and G12 8QT should tokenize to:
G
12
(empty string)
8QT
Once you have broken the postcode down into those component parts then sorting should be easy enough. There is an exception with the GIR 0AA postcode but you can just hardcode a test for that one
edit: some more thoughts on tokenization
For the sample postcode SW1A 1AA, SW is the postcode area, 1A is the postcode district (which we'll break into two parts for sorting purposes), 1 is the postcode sector and AA is the unit postcode.
These are the valid postcode formats (source: Royal Mail PAF user guide page 8 - link at bottom of this page):
AN NAA
AAN NAA
ANN NAA
ANA NAA
AAA NAA (only for GIR 0AA code)
AANN NAA
AANA NAA
So a rough algorithm would be (assuming we want to separate the sector and unit postcode):
code = GIR 0AA? Tokenize to GI/R/ /0/AA (treating R as the district simplifies things)
code 5 letters long e.g G1 3AF? Tokenize to G/1/ /3/AF
code 6 letters long with 3rd character being a letter e.g. W1P 1HQ? Tokenize to W/1/P/1/HQ
code 6 letters long with 2nd character being a letter e.g. CR2 6XH? Tokenize to CR/2/ /6/XH
code 7 letters long with 4th character being a letter e.g. EC1A 1BB? Tokenize to EC/1/A/1/BB
otherwise e.g. TW14 2ZZ, tokenize to TW/14/ /2/ZZ
If the purpose is to display a list of postcodes for the user to choose from then I would adopt Neil Butterworth's suggestion of storing a 'sortable' version of the postcode in the database. The easiest way to create a sortable version is to pad all entries to nine characters:
two characters for the area (right-pad if shorter)
two for the district number (left-pad if shorter)
one for the district letter (pad if missing)
space
one for the sector
two for the unit
and GIR 0AA is a slight exception again. If you pad with spaces then the sort order should be correct. Examples using # to represent a space:
W1#1AA => W##1##1AA
WC1#1AA => WC#1##1AA
W10#1AA => W#10##1AA
W1W#1AA => W##1W#1AA
GIR#0AA => GI#R##0AA
WC10#1AA => WC10##1AA
WC1W#1AA => WC#1W#1AA
You need to right-pad the area if it's too short: left-padding produces the wrong sort order. All of the single letter areas - B, E, G, L, M, N, S, W - would sort before all of the two-letter areas - AB, AL, ..., ZE - if you left-padded
The district number needs to be left padded to ensure that the natural W1, W2, ..., W9, W10 order remains intact
I know this is a couple of years late but i too have just experienced this problem.
I have managed to over come it with the following code, so thought i would share as i searched the internet and could not find anything!
mysql_query("SELECT SUBSTRING_INDEX(postcode,' ',1) as p1, SUBSTRING_INDEX(postcode,' ',-1) as p2 from `table` ORDER BY LENGTH(p1), p1, p2 ASC");
This code will take a Full UK postcode and split it into 2.
It will then order by the first part of the postcode followed by the second.
I'd be tempted to store the normalised postcode in the database along with the real postcode - that way you only do the string manipulation once, and you can use an index to help you with the sort.