Parse Dataframe string column into Street, City, State & Zip code - sql

I am trying to break up the following fixed string into several columns as street ,city, state & zip code. Is it possible to do this in SQLDF via the INSTR & Subtr method?
Sample Address String. The difficult part is the NV and zip code parsing.
727 Wright Brothers Ln, Las Vegas, NV 89119, USA
I am able to parse the city/street information using sqldf/instr but unable to parse the final two values for state/zip code
parsed_tweetAddressdf <- sqldf("SELECT lon, lat, result, substr(result,0,instr(result,',')) AS street, substr(result,instr(result,',')+1,instr(result,',')-1) AS city from tweetAddressdf")

Here are some alternatives. They all use instr and substr as required by the question although the third also writes out the data and reads it back in (in addition to using instr and substr). Notes at the end point out that it is also easy to do this in plain R or using read.pattern in gsubfn.
1) Assume state, zip and country fields are fixed width With only one sample record it is impossible to know what your general case is but if we assume that every record ends in SS ZZZZZ, USA where SS is the two letter state abbreviation and ZZZZZ is a 5 digit zip then this works:
DF <- data.frame(v = "727 Wright Brothers Ln, Las Vegas, NV 89119, USA")
library(sqldf)
sqldf("select
substr(v, 0, instr(v, ',')) street,
substr(v, instr(v, ',') + 2, length(v) - 16 - instr(v, ',')) city,
substr(v, -13, 2) state,
substr(v, -10, 5) zip
from DF")
giving:
street city state zip
1 727 Wright Brothers Ln Las Vegas NV 89119
2) Separate strictly based on commas (except state/zip) This approach avoids certain assumptions in (1) at the expense of additional complication. It takes the first two comma separated fields, the 2 character state and everything after that to the next comma as the zip.
It uses a triple nested select. The innermost select denoted a parses the input string into: street and a.rest. The next one proceeding outward denoted b returns the street already parsed from a, and parses a.rest into city and the b.rest. The outermost one returns the street and city already parsed plus it returns the two state characters in b.rest and everything beyond them in b.rest to the next comma as zip.
library(sqldf)
sqldf("
select
street,
city,
substr(b.rest, 1, 2) state,
substr(b.rest, 4, instr(b.rest, ',') - 4) zip
from (
select
street,
substr(a.rest, 0, instr(a.rest, ',')) city,
substr(a.rest, instr(a.rest, ',') + 2) rest
from (select
substr(v, 0, instr(v, ',')) street,
substr(v, instr(v, ',') + 2) rest
from DF) a) b
")
giving:
street city state zip
1 727 Wright Brothers Ln Las Vegas NV 89119
3) read.csv.sql If it's OK to write it out and read it back in then we can use read.csv.sql, a wrapper around sqldf. Although the question did not ask for it, this one also parses out the country:
write.table(DF, "addresses.csv", row.names = FALSE, col.names = FALSE,
sep = ",", quote = FALSE)
read.csv.sql("addresses.csv", header = FALSE, sql =
"select V1 street,
V2 city,
substr(V3, 2, 2) state,
substr(V3, 4) zip,
V4 country
from file")
giving:
street city state zip country
1 727 Wright Brothers Ln Las Vegas NV 89119 USA
Note 1: This is also easy in plain R.
dd <- read.table(text = as.character(DF$v), sep = ",",
col.names = c("street", "city", "state_zip", "country"))
transform(dd,
state = substring(state_zip, 2, 3),
zip = substring(state_zip, 4))[c(1, 2, 5, 6, 4)]
giving:
street city state zip country
1 727 Wright Brothers Ln Las Vegas NV 89119 USA
Note 2: It is even easier using read.pattern from gsubfn:
library(gsubfn)
pat <- "(.*), (.*), (..) (.*), (.*)"
read.pattern(text = as.character(DF$v), pattern = pat,
col.names = c("street", "city", "state", "zip", "country"))
giving:
street city state zip country
1 727 Wright Brothers Ln Las Vegas NV 89119 USA

Related

How to remove multiple possible prefixes from a Name string in SQL Server

I need to correct Names of users by removing prefixes before I can process the names.
For example, my list of prefixes is:
am, auf, auf dem, aus der, d, da, de, de l’, del, de la, de le, di, do, dos, du,
im, la, le, mac, mc, mhac, mhíc, mhic giolla, mic, ni, ní, níc, o, ó,
ua, ui, uí, van, van de, van den, van der, vom, von, von dem, von den, von der
I want to remove any of these prefixes from the First Name if they are present.
For example - inputs:
Outputs:
I know I can take a brute force approach and do a replace 40 odd times, but was wondering if there is a better/smarter way to do this, given the list of names that need to be processed can be in the tens of thousands, daily.
Thank you
You could use apply:
select t.*, v.prefix_free_first_name
from t outer apply
(select top (1) left(t.first_name, len(t.first_name) - len(v.prefix) - 1) as prefix_free_first_name
from (values ('am'), ('auf'), . . .
) v(prefix)
where t.first_name like '% ' + v.prefix
order by len(v.prefix) desc
) v;
Note: This handles the situation where multiple prefixes match a name, such as "de le" and "le".

Replace characters to right of string in PostgreSQL

I am in need of solving a problem for my project.
I need to clean up an address field in PostgreSQL by removing everything to the right of a street name.
And I have found it here: PostgreSQL replace characters to right of string
SELECT regexp_replace('100 broadway street 100', '(.*)(Street).*', '\1\2', 'i');
However, I would like to replace '100 broadway street 100' more flexibly, like this:
SELECT regexp_replace('100 broadway street 100', '(.*)(Street OR Str. OR Ward OR W. OR Dist).*', '\1\2', 'i');
Can someone help me write the right syntax or have any other links I haven't found yet?
Input 1: "100 Alexandre de Rhodes Street, District 10, HCM City"
Input 2: "100 Quang Trung Str., District 10, HCM City"
Input 3: "123 Newton St., GV District, HCM City"
Output 1: "100 ABC Street, Ward 16"
Output 2: "100 Quang Trung Str."
.v.v..
ie will need to remove the string behind the road name
I think you are looking for | operator like this
SELECT regexp_replace('100 broadway Dist 100', '(.*)(Street|Str|Ward|Dist).*', '\1\2', 'i');
Output
100 broadway Dist
Update based on comments
You can replace .* with ..
SELECT regexp_replace('100 broadway Dist Str 100 Str abc Street',
'(.)(Street|Dist|Ward|Str).*', '\1\2', 'i');
Output
100 broadway Dist

Address String to Address Fields VIEW or SELECT

I have an address field that is a single line that looks like this:
Dr Robert Ruberry, West End Medical Practice, 38 Russell Street, South Brisbane 4101
I am wanting to write a view that will split that address into Name, Addr1, Addr2, Suburb, Postcode fields for reporting purposes.
I have been trying to USE SUBSTRING and CHARINDEX like this but it doesnt seem to split it correctly.
SUBSTRING([address_Field],CHARINDEX(',',[address_Field]),CHARINDEX(',',[address_Field]))
Can anyone help? TIA
may be this works for your requirement
IF OBJECT_ID('tempdb..#test') IS NOT NULL
DROP TABLE #test
CREATE TABLE #test(id int, data varchar(100))
INSERT INTO #test VALUES (1,'Dr Robert Ruberry, West End Medical Practice, 38 Russell Street, South Brisbane 4101')
DECLARE #pivot varchar(8000)
DECLARE #select varchar(8000)
SELECT
#pivot=coalesce(#pivot+',','')+'[col'+cast(number+1 as varchar(10))+']'
FROM
master..spt_values where type='p' and
number<=(SELECT max(len(data)-len(replace(data,',',''))) FROM #test)
SELECT
#select='
select p.col1 As Name,p.col2 as Addr1,p.col3 as Addr3,p.col4 as Postcode
from (
select
id,substring(data, start+2, endPos-Start-2) as token,
''col''+cast(row_number() over(partition by id order by start) as varchar(10)) as n
from (
select
id, data, n as start, charindex('','',data,n+2) endPos
from (select number as n from master..spt_values where type=''p'') num
cross join
(
select
id, '','' + data +'','' as data
from
#test
) m
where n < len(data)-1
and substring(data,n+1,1) = '','') as data
) pvt
Pivot ( max(token)for n in ('+#pivot+'))p'
EXEC(#select)
Here's a couple of options for you. If you're just looking for a quick answer, see this similar question that's already been answered:
T-SQL split string based on delimiter
If you want some more in depth knowledge of the various options, check this out:
http://sqlperformance.com/2012/07/t-sql-queries/split-strings
This answer does not specifically apply to SQL, but it does apply to street addresses.
If you are willing to take a dependency on a third-party, you could send street addresses to the SmartyStreets International Street API service.
To do this, you submit an HTTP GET request.
Your example address would look like this:
curl -v 'https://international-api.smartystreets.com/verify?
auth-id=YOUR+AUTH-ID+HERE&auth-token=YOUR+AUTH-TOKEN+HERE&
address1=Dr%20Robert%20Ruberry%2C%20West%20End%20
Medical%20Practice%2C%2038%20Russell%20Street%2C%20
South%20Brisbane%204101
&country=aus'
(Notice that the address is url encoded. The request is wrapped for readability.)
The response would be a JSON string, separated into components that you can then insert into your database however you need to:
[
{
"organization": "Dr Robert Ruberry, West End Med.",
"address1": "Dr Robert Ruberry, West End Med.",
"address2": "Russell Street",
"address3": "38 Practice",
"address4": "South Brisbane QLD 4101",
"components": {
"administrative_area": "QLD",
"building": "Russell Street",
"country_iso_3": "AUS",
"locality": "South Brisbane",
"postal_code": "4101",
"postal_code_short": "4101",
"thoroughfare": "Practice",
"thoroughfare_name": "Practice",
"sub_building_number": "38"
},
"metadata": {},
"analysis": {
"verification_status": "Partial",
"address_precision": "Locality",
"max_address_precision": "DeliveryPoint"
}
}
]
An added benefit is that the service provides you with extra information about the validity of the address.
(Disclosure: I work at SmartyStreets.)

Oracle 11g : staging table

I need Oracle 11g commands to create staging table.
table: Streets
Input field: Name
output fields: Streets_Prefix
Streets_Name
Streets_Suffix
End users from a front end application fill information for only "Name" fields of "Streets" table as :
"AVE Mandela road South".
But in same table "Streets" other fields need to get parsed data from "Name" fields as :
Streets_Prefix : AVE
Streets_Name : Mandela road
Streets_Suffix : South
So here input and target table is same "Streets" but input and target fields are different so I need command to create staging table in which I can do Parsing of 'Name' field and update "Streets_Prefix", "Streets_Name" and "Streets_Suffix".
You can use the below queries to get the Streets_Prefix, Streets_Name,Streets_Suffix populated from Name column of STREETS table
select
substr(name, 0, instr(name,' ',1)-1)
as Streets_Prefix
from STREETS ;
select
substr(name, instr(name, ' ')+1,instr(name, ' ', -1, 1) - instr(name, ' ') - 1)
as Streets_Name
from STREETS ;
select
substr(name, instr(name,' ',-1)+1)
as Streets_suffix
from STREETS ;
OUTPUT:
STREETS_PREFIX
AVE
star
ZEBRA
STREETS_NAME
Mandela road
Bangalore road
CROSSING road
STREETS_SUFFIX
South
East
NORTH
create table STREETS ( name varchar2(200)); --Considering you want only one column
For further reference : CREATE TABLE

matching tool in access

I am trying to have access return address that matches in a table. Currently i am getting below results since i have Expressions matching for only left 3:
Expr1: ((Left([Sales].[ShipToAddress1],3)=Left([HPG ROSTER].[Address1],3)))
ShipToAddress1 Address1
10420 VISTA DEL SOL 10420 Vista Del Sol
10420 VISTA DEL SOL 10460 Vista Del Sol
10301 GATEWAY WEST 10301 Gateway West
3535 S. I35E 3535 S. I-35 East
3535 S. I35E 3537 South I-35
Is there a way to have access match only "numbers" from Shipto address and numbers from Address 1 instead of left 3?
Use Instr and search for the first space.
I think this should work for you:
((Instr(1, [Sales].[ShipToAddress1], ' ') = Instr(1, [HPG ROSTER].[Address1], ' '))
NOTE: You'll also want to match on something like the zipcode.