Compare two addresses which are not in standard format - sql

I have to compare addresses from two tables and get the Id if the address matches.
Each table has three columns Houseno, street, state
The address are not in standard format in either of the tables. There are approx. 50,000 rows, I need to scan through
At some places its Ave. Avenue Ave . Str Street, ST. Lane Ln. Place PL Cir CIRCLE.
Any combination with a dot or comma or spaces ,hypen.
I was thinking of combining all three What can be best way to do it in SQL or PLSQL for example
table1
HNO STR State
----- ----- -----
12 6th Ave NY
10 3rd Aven SD
12-11 Fouth St NJ
11 sixth Lane NY
A23 Main Parkway NY
A-21 124 th Str. VA
table2
id HNO STR state
-- ----- ----- -----
1 12 6 Ave. NY
13 10 3 Avenue SD
15 1121 Fouth Street NJ
33 23 9th Lane NY
24 X23 Main Cir. NY
34 A1 124th Street VA

There is no simple way to achieve what you want. There is a expensive software (google for "address standardization software") that can do this but rarely 100% automatic.
What this type of software does is to take the data, use complex heuristics to try to figure out the "official" address and then return that (sometimes with the confidence that the result is correct, sometimes a list of results sorted by confidence).
For a small percentage of the data, the software will simply not work and you'll have to fix that yourself.

Oracle has a built in package UTL_Match which has an edit_distance function (based on the Levenshtein algorithm, this is a measure of how many changes you would need to make to make one string the same as another). More info about this Package / Function can be found here: http://docs.oracle.com/cd/E18283_01/appdev.112/e16760/u_match.htm
You would need to make some decisions around whether to compare each column or concatenate and then compare and what a reasonable threshold is. For example, you may want to do a manual check on any with an edit distance of less than 8 on the concatenated values.
Let me know if you want any help with the syntax, the edit_distance function just takes 2 varchar2 args (the strings you want to compare) and returns a number.
This is not a perfect solution in that if you set the threshold high you will have a lot of manual checking to do to discard some, and if you set it too low you will miss some matches, but it may be about the best if you want a relatively simple solution.

The way we did this for one of our applications was to use a third party adddress normalization API(eg:Pitney Bowes),normalize each address(Address is a combination of Street Address,City ,State and Zip) and create a T-sql hash for that address.For the adress to compare do the same thing and compare the two hashes and if they match,we have a match

you can make a cursor where you do first a group by where house number and city =.
in a loop
you can separate a row with instr e substr considering chr(32).
After that you can try to consider to make a confront with substring where you have a number 6 = 6th , other case street = str.
good luck!

Related

SQL Server : set all column aliases in a dynamic query

It's a bit of a long and convoluted story why I need to do this, but I will be getting a query string which I will then be executing with this code
EXECUTE sp_ExecuteSQL
I need to set the aliases of all the columns to "value". There could be a variable number of columns in the queries that are being passed in, and they could be all sorts of data types, for example
SELECT
Company, AddressNo, Address1, Town, County, Postcode
FROM Customers
SELECT
OrderNo, OrderType, CustomerNo, DeliveryNo, OrderDate
FROM Orders
Is this possible and relatively simple to do, or will I need to get the aliases included in the SQL queries (it would be easier not to do this, if it can be avoided and done when we process the query)
---Edit---
As an example, the output from the first query would be
Company AddressNo Address1 Town County Postcode
--------- --------- ------------ ------ -------- --------
Dave Inc 12345 1 Main Road Harlow Essex HA1 1AA
AA Tyres 12234 5 Main Road Epping Essex EP1 1PP
I want it to be
value value value value value value
--------- --------- ------------ ------ -------- --------
Dave Inc 12345 1 Main Road Harlow Essex HA1 1AA
AA Tyres 12234 5 Main Road Epping Essex EP1 1PP
So each of the column has an alias of "value"
I could do this with
SELECT
Company AS 'value', AddressNo AS 'value', Address1 AS 'value', Town AS 'value', County AS 'value', Postcode AS 'value'
FROM Customers
but it would be better (it would save additional complexity in other steps in the process chain) if we didn't have to manually alias each column in the SQL we're feeding in to this section of the process.
Regarding the XY problem, this is a tiny section in a very large process chain, it would take pages to explain the whole process in detail - in essence, we're taking code out of our database triggers and putting it into a dynamic procedure; then we will have frontends that users will access to "edit" the SQL statements that are called by the triggers and these will then dynamically feed the results out into other systems. It works if we manually alias the SQL going in, but it would be neater if there was a way we could feed clean SQL into the process and then apply the aliases when the SQL is processed - it would keep us DRY, to start with.
I do not understand at all what you are trying to accomplish, but I believe the answer is no: there is no built-in way how to globally predefine or override column aliases for ad hoc queries. You will need to code it yourself.

Fuzzy text searching in Oracle

I have a large Oracle DB table which contains street names for a whole country, which has 600000+ rows. In my application, I take an address string as input and want to check whether specific substrings of this address string matches one or many of the street names in the table, such that I can label that address substring as the name of a street.
Clearly, this should be a fuzzy text matching problem, there is only a small chance that the substring I query has an exact match with the street names in DB table. So there should be some kind of fuzzy text matching approach. I am trying to read the Oracle documentation at http://docs.oracle.com/cd/B28359_01/text.111/b28303/query.htm in which CONTAINS and CATSEARCH search operators are explained. But these seem to be used for more complex tasks like searching a match for the given string in documents. I just want to do that for a column of a table.
What do you suggest me in this case, does Oracle have support for such kind of fuzzy text matching queries?
UTL_MATCH contains methods for matching strings and comparing their similarity. The
edit distance, also known as the Levenshtein Distance, might be a good place to start. Since one string is a substring it may help to compare the edit distance
relative to the size of the strings.
--Addresses that are most similar to each substring.
select substring, address, edit_ratio
from
(
--Rank edit ratios.
select substring, address, edit_ratio
,dense_rank() over (partition by substring order by edit_ratio desc) edit_ratio_rank
from
(
--Calculate edit ratio - edit distance relative to string sizes.
select
substring,
address,
(length(address) - UTL_MATCH.EDIT_DISTANCE(substring, address))/length(substring) edit_ratio
from
(
--Fake addreses (from http://names.igopaygo.com/street/north_american_address)
select '526 Burning Hill Big Beaver District of Columbia 20041' address from dual union all
select '5206 Hidden Rise Whitebead Michigan 48426' address from dual union all
select '2714 Noble Drive Milk River Michigan 48770' address from dual union all
select '8325 Grand Wagon Private Sleeping Buffalo Arkansas 72265' address from dual union all
select '968 Iron Corner Wacker Arkansas 72793' address from dual
) addresses
cross join
(
--Address substrings.
select 'Michigan' substring from dual union all
select 'Not-So-Hidden Rise' substring from dual union all
select '123 Fake Street' substring from dual
)
order by substring, edit_ratio desc
)
)
where edit_ratio_rank = 1
order by substring, address;
These results are not great but hopefully this is at least a good starting point. It should work with any language. But you'll still probably want to combine this with some language- or locale- specific comparison rules.
SUBSTRING ADDRESS EDIT_RATIO
--------- ------- ----------
123 Fake Street 526 Burning Hill Big Beaver District of Columbia 20041 0.5333
Michigan 2714 Noble Drive Milk River Michigan 48770 1
Michigan 5206 Hidden Rise Whitebead Michigan 48426 1
Not-So-Hidden Rise 5206 Hidden Rise Whitebead Michigan 48426 0.5
You could make use of the SOUNDEX function available in Oracle databases. SOUNDEX computes a numeric signature of a text string. This can be used to find strings which sound similar and thus reduce the number of string comparisons.
Edited:
If SOUNDEX is not suitable for your local language, you can ask Google for a phonetic signature or phonetic matching function which performs better. This function has to be evaluated once per new table entry and once for every query. Therefore, it does not need to reside in Oracle.
Example: A Turkish SOUNDEX is promoted here.
To increase the matching quality, the street name spelling should be unified in a first step. This could be done by applying a set of rules:
Simplified example rules:
Convert all characters to lowercase
Remove "str." at the end of a name
Remove "drv." at the end of a name
Remove "place" at the end of a name
Remove "ave." at the end of a name
Sort names with multiple words alphabetically
Drop auxiliary words like "of", "and", "the", ...

Eliminate duplicate records/rows?

I'm trying to list result from a multi-table query with on row, 2 columns. I have the correct data that I need, I merely need to trim it down to 1 line of results. In other words, eliminate duplicate entries in the result. I'm using a value not shown here, school_id. Should I go with that as a distinct value? Can I do that without displaying the school_id?
SQL> select DISTINCT(school_name),Team_Name
2 from school, team
3 where team.team_name like '%B%'
4 AND school.school_id = team.school_id;
SCHOOL_NAME TEAM_NAME
-------------------------------------------------- ----------
Lawrence Central High School Bears
Lawrence Central High School BEars
Lawrence Central High School BEARS
The problem, as I'm sure you know, is the fact that "Bears" is in 3 different cases here. The simple fix is to do the upper or lower of "Team_Name" so it will only have 1 return record.
UPPER(Team_Name)

TSQL Degrees of Difference in strings

I am trying to compare 2 strings in a sql statement. Some of the string are almost identical and some are very different
For example:
The following 2 addresses are almost identical
22224 143RD AVE
222-24 143RD AVE.
While the next 2 are very different
6969 elmund street
6969 mamerth street
Is there a function to classify the degree of difference?
http://support.microsoft.com/kb/100365 lays out DIFFERENCE() and SOUNDEX(). I'm not sure how well these would work in your particular use case.

Sort Postcode for menu/list

I need to sort a list of UK postcodes in to order.
Is there a simple way to do it?
UK postcodes are made up of letters and numbers:
see for full info of the format:
http://en.wikipedia.org/wiki/UK_postcodes
But my problem is this a simple alpha sort doesn't work because each code starts with 1 or two letters letters and then is immediately followed by a number , up to two digits, then a space another number then a letter. e.g. LS1 1AA or ls28 1AA, there is also another case where once the numbers in the first section exceed 99 then it continues 9A etc.
Alpha sort cause the 10s to immediately follow the 1:
...
LS1 9ZZ
LS10 1AA
...
LS2
I'm looking at creating a SQL function to convert the printable Postcode into a sortable postcode e.g. 'LS1 9ZZ' would become 'LS01 9ZZ', then use this function in the order by clause.
Has anybody done this or anything similar already?
You need to think of this as a tokenization issue so SW1A 1AA should tokenize to:
SW
1
A
1AA
(although you could break the inward part down into 1 and AA if you wanted to)
and G12 8QT should tokenize to:
G
12
(empty string)
8QT
Once you have broken the postcode down into those component parts then sorting should be easy enough. There is an exception with the GIR 0AA postcode but you can just hardcode a test for that one
edit: some more thoughts on tokenization
For the sample postcode SW1A 1AA, SW is the postcode area, 1A is the postcode district (which we'll break into two parts for sorting purposes), 1 is the postcode sector and AA is the unit postcode.
These are the valid postcode formats (source: Royal Mail PAF user guide page 8 - link at bottom of this page):
AN NAA
AAN NAA
ANN NAA
ANA NAA
AAA NAA (only for GIR 0AA code)
AANN NAA
AANA NAA
So a rough algorithm would be (assuming we want to separate the sector and unit postcode):
code = GIR 0AA? Tokenize to GI/R/ /0/AA (treating R as the district simplifies things)
code 5 letters long e.g G1 3AF? Tokenize to G/1/ /3/AF
code 6 letters long with 3rd character being a letter e.g. W1P 1HQ? Tokenize to W/1/P/1/HQ
code 6 letters long with 2nd character being a letter e.g. CR2 6XH? Tokenize to CR/2/ /6/XH
code 7 letters long with 4th character being a letter e.g. EC1A 1BB? Tokenize to EC/1/A/1/BB
otherwise e.g. TW14 2ZZ, tokenize to TW/14/ /2/ZZ
If the purpose is to display a list of postcodes for the user to choose from then I would adopt Neil Butterworth's suggestion of storing a 'sortable' version of the postcode in the database. The easiest way to create a sortable version is to pad all entries to nine characters:
two characters for the area (right-pad if shorter)
two for the district number (left-pad if shorter)
one for the district letter (pad if missing)
space
one for the sector
two for the unit
and GIR 0AA is a slight exception again. If you pad with spaces then the sort order should be correct. Examples using # to represent a space:
W1#1AA => W##1##1AA
WC1#1AA => WC#1##1AA
W10#1AA => W#10##1AA
W1W#1AA => W##1W#1AA
GIR#0AA => GI#R##0AA
WC10#1AA => WC10##1AA
WC1W#1AA => WC#1W#1AA
You need to right-pad the area if it's too short: left-padding produces the wrong sort order. All of the single letter areas - B, E, G, L, M, N, S, W - would sort before all of the two-letter areas - AB, AL, ..., ZE - if you left-padded
The district number needs to be left padded to ensure that the natural W1, W2, ..., W9, W10 order remains intact
I know this is a couple of years late but i too have just experienced this problem.
I have managed to over come it with the following code, so thought i would share as i searched the internet and could not find anything!
mysql_query("SELECT SUBSTRING_INDEX(postcode,' ',1) as p1, SUBSTRING_INDEX(postcode,' ',-1) as p2 from `table` ORDER BY LENGTH(p1), p1, p2 ASC");
This code will take a Full UK postcode and split it into 2.
It will then order by the first part of the postcode followed by the second.
I'd be tempted to store the normalised postcode in the database along with the real postcode - that way you only do the string manipulation once, and you can use an index to help you with the sort.