I'm using regexp_replace to standardize mailing addresses and I've encountered a situation I'm having trouble with.
Consider the following two addresses and what their result should be:
115 1/2 East 6th St -> 115 1/2 E 6th St
818 East St -> 818 East St
In the second address, "East" is the actual name of the street, not a directional indicator.
For my query, I've attempted
SELECT
regexp_replace(address, 'East[^ St]', 'E ')
but this fails to convert the first address to it's proper format.
How can I write my regexp_replace such that the word East is converted to an 'E' in the first address, but leaves the word intact in the second address?
Your current pattern matches the literal text East followed by any single character that isn't space, S, or t. I'm assuming you probably meant to use a negative lookahead to make sure that "East" doesn't come before " St", but sadly Oracle doesn't support negative lookaheads. Instead, you'll need to make the REGEXP_REPLACE conditional:
CASE
WHEN address LIKE '%East%' AND address NOT LIKE '%East St%'
THEN REGEXP_REPLACE(address, your_pattern, your_replacement)
ELSE address
END
This answers your question with REGEXP_REPLACE(). It looks for the string ' EAST' (don't want to catch the case where 'east' is the end of another word) followed by a space, one or more characters, another space and the string 'St' which is remembered in a group. If found, replace it with ' E' followed by the second remembered group (the space followed by the one or more characters followed by the space and 'St'. This is needed as they are 'consumed' by the regex engine as it moves left to right analyzing the string so you need to put them back. Note I added a bunch of different test formats (always test for the unexpected too!):
SQL> with tbl(address) as (
select '115 1/2 East 6th St' from dual union
select '115 1/2 NorthEast 6th St' from dual union
select '115 1/2 East 146th St' from dual union
select '115 1/2 East North 1st St' from dual union
select '818 East Ave' from dual union
select '818 Woodward' from dual union
select '818 East St' from dual
)
select regexp_replace(address, '( East)( .+ St)', ' E\2') new_addr
from tbl;
NEW_ADDR
------------------------------------------------------------------------
115 1/2 E 146th St
115 1/2 E 6th St
115 1/2 E North 1st St
115 1/2 NorthEast 6th St
818 East Ave
818 East St
818 Woodward
7 rows selected.
Related
Please see the SQLFiddle example:
http://sqlfiddle.com/#!4/abd6d/1
here are a few example address:
MINNEAPOLIS MN 55450
MINNAPOLIS MN 55439-8136
BETHANY OK 73008
Hillsboro Oregon 97124
Not all of them are separated by spaces, but enough that I I think that is the method I want to approach.
running Oracle 11g
update:
this was how it was accomplished:
select bill_address4, Substr(bill_address4, 1, Instr(bill_address4,
',') - 1) "CITY EXMP ONE",
regexp_substr(bill_address4,'[^,]+', 1, 1) "CITY EXMP TWO",
Trim(regexp_substr(bill_address4,'[^,]+', 1, 2)) "STATE/ZIP",
TRIM(regexp_substr(Trim(regexp_substr(bill_address4,'[^,]+', 1,
2)),'[^ ]+', 1, 1)) "STATE",
TRIM(TRIM(regexp_substr(Trim(regexp_substr(bill_address4,'[^,]+',
1, 2)),'[^ ]+',1,2))||'
'||TRIM(regexp_substr(Trim(regexp_substr(bill_address4,'[^,]+', 1,
2)),'[^ ]+',1,3))||'
'||TRIM(regexp_substr(Trim(regexp_substr(bill_address4,'[^,]+', 1,
2)),'[^ ]+',1,4))) "ZIP" from so_header
I do not think this is easily feasible by sql. You will need to rearrange the raw data and the table schema by adding more columns.
The appraoch I recommand is string manipulation by using other programming language, for example, C#.
See if such an approach helps.
TEMP CTE finds position of the first digit in citystatezip column
zip: that's the substring that starts from the zip_position
state: nested functions
substr selects everything up to the zip_position (e.g. "SNOHOMISH WA ")
trim removes trailing spaces
regexp_substr extracts the last word from that substring (e.g. "WA")
city: substring from the 1st character, up to position of the second space character starting from the back of the string (see instr's parameters)
For sample data you posted (LA added), that would look as follows:
SQL> with temp as
2 (select p.*,
3 regexp_instr(citystatezip, '\d') zip_position
4 from po_header p
5 )
6 select t.po_number, t.customer, t.citystatezip,
7 substr(t.citystatezip, t.zip_position) zip,
8 regexp_substr(trim(substr(t.citystatezip, 1, t.zip_position - 1)), '\w+$') state,
9 trim(substr(t.citystatezip, 1, instr(t.citystatezip, ' ', -1, 2))) city
10 from temp t;
PO_NUMBER CUSTOME CITYSTATEZIP ZIP STATE CITY
---------- ------- ------------------------------ ---------- ---------- -----------
1 John SNOHOMISH WA 98290 98290 WA SNOHOMISH
2 Jen MINNAPOLIS MN 55439-8136 55439-8136 MN MINNAPOLIS
3 Jillian BETHANY OK 73008 73008 OK BETHANY
4 Jordan Hillsboro Oregon 97124 97124 Oregon Hillsboro
5 Scott Los Angeles CA 12345 12345 CA Los Angeles
SQL>
Is it perfect? Certainly not, but the final solution depends on much more sample data. Generally speaking, data model is just wrong - you should have split those information into separate columns in the first place.
I have two tables
table with all country codes like KZ,US,RU
table tranzactions with terminal location like
(Starbucks 1500 Broadway *Near Times Square US)
(CoffeBoom KZ Mendekulova district *Near Dostyk plaza)
and I want select
country code number , code str , location terminal name
like
398 | KZ | CoffeBoom KZ Mendekulova district *Near Dostyk plaza
840 | US | tarbucks 1500 Broadway *Near Times Square US
and without case when in terminal location name has code char in string like 'Gucci Moscow Redkzsuzin district RU' where char 'KZ','UZ' country code I want to select only 'RU'.
You can try building a regular expression incorporating column code_str within itself. The following attempts such. It builds an expression looking for the beginning of the string or a space followed the country code followed by a space or end-of-string and extracts rows matching. However, both false positives and false negatives as your searching free form text. Any occurrence matching that pattern will be returned even if NOT actually the a valid code and can miss valid ones as well. For example it will not find the row:
982,'US', 'Starbucks 618 Miracle Mile, Chicago, IL, USA'
You may need to workout a better definition of what you are searching for.
with tranzactions (country_code_number , code_str , location_terminal_name) as
(select 398,'KZ', 'CoffeBoom KZ Mendekulova district *Near Dostyk plaza' from dual union all
select 840,'US', 'Starbucks 1500 Broadway *Near Times Square US' from dual union all
select 982,'US', 'Starbucks 618 Miracle Mile, Chicago, IL, USA' from dual
)
select * from tranzactions
where regexp_like(location_terminal_name, '(^| )' || code_str || '( |$)' );
I have REGEXP expression that need to accept beginning of specific letter , anything in between the specific ending letter and also there might be spaces after that ending letter. (comes from database)
When I run my expression it doesn't give me the ending letter, because it has spaces in database after the name I am searching it
WHERE REGEXP_LIKE (cname, UPPER('^[&p_name_beginning](.*?)[&p_name_ending$]'));
Output:
JIE DONG has bought 2 car(s) and has spent $151200
JAMES BARREDO has bought 1 car(s) and has spent $300145
JUAN MENDIOLA has bought 1 car(s) and has spent $75610.89
JASON HADDAD has bought 1 car(s) and has spent $157000
JOSE ANDRADE has bought 1 car(s) and has spent $151046
JORDAN PENNEY has bought 1 car(s) and has spent $85201.92
JUAN RODAS has bought 1 car(s) and has spent $105000
You will get better help if you specify what you are trying to do with sample before and after data, as well as showing what you have tried. I suspect you are trying to select a row where the first and last letters of the name match parameters you have been given. If you update the tag to show what database you are using, you will get a more targeted answer, but I here's an Oracle solution to return the 3rd record that may help if my assumption is correct. It will give you a hint at any rate.
with tbl(str) as (
select 'JIE DONG has bought 2 car(s) and has spent $151200' from dual union all
select 'JAMES BARREDO has bought 1 car(s) and has spent $300145' from dual union all
select 'JUAN MENDIOLA has bought 1 car(s) and has spent $75610.89' from dual union all
select 'JASON HADDAD has bought 1 car(s) and has spent $157000' from dual union all
select 'JOSE ANDRADE has bought 1 car(s) and has spent $151046' from dual union all
select 'JORDAN PENNEY has bought 1 car(s) and has spent $85201.92' from dual union all
select 'JUAN RODAS has bought 1 car(s) and has spent $105000' from dual
)
select str
from tbl
where regexp_like(str, '^j\S+ \S+a .*$', 'i');
The regex reads as follows:
^ Anchor to the start of the line
j Match a 'j' (first letter of name)
\S+ Followed by one or more characters that are not spaces
<space> Then a space character
\S+ Followed by one or more characters that are not spaces
a Then the ending letter of the name
<space> Followed by a space character
.* Followed by zero or more of any characters
$ The end of the line
The 'i' means case-insensitive.
I have a large Oracle DB table which contains street names for a whole country, which has 600000+ rows. In my application, I take an address string as input and want to check whether specific substrings of this address string matches one or many of the street names in the table, such that I can label that address substring as the name of a street.
Clearly, this should be a fuzzy text matching problem, there is only a small chance that the substring I query has an exact match with the street names in DB table. So there should be some kind of fuzzy text matching approach. I am trying to read the Oracle documentation at http://docs.oracle.com/cd/B28359_01/text.111/b28303/query.htm in which CONTAINS and CATSEARCH search operators are explained. But these seem to be used for more complex tasks like searching a match for the given string in documents. I just want to do that for a column of a table.
What do you suggest me in this case, does Oracle have support for such kind of fuzzy text matching queries?
UTL_MATCH contains methods for matching strings and comparing their similarity. The
edit distance, also known as the Levenshtein Distance, might be a good place to start. Since one string is a substring it may help to compare the edit distance
relative to the size of the strings.
--Addresses that are most similar to each substring.
select substring, address, edit_ratio
from
(
--Rank edit ratios.
select substring, address, edit_ratio
,dense_rank() over (partition by substring order by edit_ratio desc) edit_ratio_rank
from
(
--Calculate edit ratio - edit distance relative to string sizes.
select
substring,
address,
(length(address) - UTL_MATCH.EDIT_DISTANCE(substring, address))/length(substring) edit_ratio
from
(
--Fake addreses (from http://names.igopaygo.com/street/north_american_address)
select '526 Burning Hill Big Beaver District of Columbia 20041' address from dual union all
select '5206 Hidden Rise Whitebead Michigan 48426' address from dual union all
select '2714 Noble Drive Milk River Michigan 48770' address from dual union all
select '8325 Grand Wagon Private Sleeping Buffalo Arkansas 72265' address from dual union all
select '968 Iron Corner Wacker Arkansas 72793' address from dual
) addresses
cross join
(
--Address substrings.
select 'Michigan' substring from dual union all
select 'Not-So-Hidden Rise' substring from dual union all
select '123 Fake Street' substring from dual
)
order by substring, edit_ratio desc
)
)
where edit_ratio_rank = 1
order by substring, address;
These results are not great but hopefully this is at least a good starting point. It should work with any language. But you'll still probably want to combine this with some language- or locale- specific comparison rules.
SUBSTRING ADDRESS EDIT_RATIO
--------- ------- ----------
123 Fake Street 526 Burning Hill Big Beaver District of Columbia 20041 0.5333
Michigan 2714 Noble Drive Milk River Michigan 48770 1
Michigan 5206 Hidden Rise Whitebead Michigan 48426 1
Not-So-Hidden Rise 5206 Hidden Rise Whitebead Michigan 48426 0.5
You could make use of the SOUNDEX function available in Oracle databases. SOUNDEX computes a numeric signature of a text string. This can be used to find strings which sound similar and thus reduce the number of string comparisons.
Edited:
If SOUNDEX is not suitable for your local language, you can ask Google for a phonetic signature or phonetic matching function which performs better. This function has to be evaluated once per new table entry and once for every query. Therefore, it does not need to reside in Oracle.
Example: A Turkish SOUNDEX is promoted here.
To increase the matching quality, the street name spelling should be unified in a first step. This could be done by applying a set of rules:
Simplified example rules:
Convert all characters to lowercase
Remove "str." at the end of a name
Remove "drv." at the end of a name
Remove "place" at the end of a name
Remove "ave." at the end of a name
Sort names with multiple words alphabetically
Drop auxiliary words like "of", "and", "the", ...
I have an address column that contains address, state and postcode. I would like to extract the address, suburb, state, and postcode into separate columns, how can a do this as the length of the address is variable, there is a ^ to separate the address and "other" details. The State can be 2 or 3 characters long and the postcode is always 4 characters long.
PostalAddress TO BE Address Suburb State Postcode
28 Smith Avenue^MOOROOLBARK VIC 3138^ 28 Smith Avenue MOOROOLBARK VIC 3138
16 Farr Street^HEYFIELD VIC 3858^ 16 Farr Street HEYFIELD VIC 3858
17 Terry Road^LOWER PLENTY VIC 3093^ 17 Terry Road LOWER PLENTY VIC 3093
String parsing in SQL is messy and tends to be brittle. I usually think it's best to do these sort of tasks outside of SQL altogether. That said, given the mini-spec above, it is possible to parse the data into the fields you want like so:
select
left(PostalAddress, charindex('^', PostalAddress) - 1) as street_address,
left(second_part, len(second_part) - charindex(' ', reverse(second_part))) as suburb,
right(second_part, charindex(' ', reverse(second_part))) as state,
reverse(substring(reverse(PostalAddress), 2, 4)) as postal_code
from (
select
PostalAddress,
rtrim(reverse(substring(reverse(PostalAddress), 6, len(PostalAddress) - charindex('^', PostalAddress) - 5))) as second_part
from Addresses
) as t1
Note that you'll need so substitute your table name for what I've called addresses in the subquery above.
You can see this in action against your sample data here.
In my case it's just to get a five-numeric from a string as a postcode:
Below is my code:
Select SUBSTRING([Column or string],patindex('%[0-9][0-9][0-9][0-9][0-9]%',[Column or string]),5) AS 'Postcode'