What's the best way to parse an Address field using t-sql or SSIS? - sql

I have a data set that I import into a SQL table every night. One field is 'Address_3' and contains the City, State, Zip and Country fields. However, this data isn't standardized. How can I best parse the data that is currently going into 1 field into individual fields. Here are some examples of the data I might receive:
'INDIANAPOLIS, IN 46268 US'
'INDIANAPOLIS, IN 46268-1234 US'
'INDIANAPOLIS, IN 46268-1234'
'INDIANAPOLIS, IN 46268'
Thanks in advance!
David

I've done something similar (not in T-SQL) and I find it works best to start at the end of the string and work backwards.
Grab the rightmost element up to the first space or comma.
Is it a known country code? It's a country
If not, is it all numeric (including a hyphen)? It's a zip code.
Else discard it
Grab the second rightmost element up to the next space or comma
Is it a two alpha-character field? It's the state
Grab everything else preceding the last comma and call it the city.
You'll need to make some adjustments based on what your input data looks like but the basic idea is to start from the right, grab the elements you can easily classify and call everything else the city.
You can implement something like this by using the REVERSE function to make searching easier (in which case you'll be parsing the string from left to right instead of right to left like I said above), the PATINDEX or CHARINDEX functions to find spaces and commas, and the SUBSTRING function to pull the address apart based on the positions found by PATINDEX and CHARINDEX. You could use the ASCII function to determine if a character is numeric or not.
You tagged your question with the SSIS tag as well - it might be easier to implement the parsing in some VB script in SSIS rather than try to do it with T-SQL.

By far the best way is to not reinvent the wheel and get an address parsing and standardization engine. Ideally, you would use a CASS certified engine which is what is approved by the Postal Service. However, there are free address parsers on the net these days and any of those would be more accurate and less frustrating than trying to parse the address yourself.
That said, I will say that address parsers and the Post Office work from bottom up (So, country, then zip code, then city, then state then address line 2 etc.).

In SSIS you can have 4 derived columns (city,state,zip,country).
substring(column,1,FINDSTRING(",",column,1)-1) --city
substring(column,FINDSTRING(" ",column,1)+1,FINDSTRING("",column,2)-1) --state
substring(column,FINDSTRING(" ",column,2)+1,FINDSTRING(" ",column,3)-1) -- zip
You can see the pattern above and continue accordingly. This might get a bit complicated. You can use a Script Component to better pull out the lines of text.

something like this should help:
select substring(CityStateZip, 1,
case when charindex(',',reverse(CityStateZip)) = 0 then len(CityStateZip)
else len(CityStateZip) - charindex(',',reverse(CityStateZip)) end) as City,
LEFT(LTRIM(
SUBSTRING(CityStateZip, case when charindex(',',reverse(CityStateZip)) = 0 then len(CityStateZip) else
len(CityStateZip) - charindex(',',reverse(CityStateZip))+2 end, LEN(CityStateZip)))
,2) as State,
SUBSTRING(CityStateZip, case when charindex(' ',reverse(CityStateZip)) = 0 then len(CityStateZip) else
len(CityStateZip) - charindex(' ',reverse(CityStateZip))+2 end, LEN(CityStateZip)) as Zip
from YourAddressTable

Related

How to use Regex to lowercase catalogue values without any logic codes

For a loan domain we pass some catalogue values eg. if a customer is primary or secondary customer like that. So i need to check the values irrespective of uppercase, lowercase, camelcase. Software which i am using will accept only regex codes not any Java, js codes (it is different scripting). I am trying to convert only with regexp but still getting error.
If catalogue_value ~"(/A-Z/)" then
Catalogue_value ~"/l"
Endif
As i am learning regex as of now still figuring for correct expressions to use.
Kindly please tell me correct format to use regex to change into lowercase / uppercase
If i understood your problem you want to search without worrying about the case, for example the data is Paul, and you want to find this record searching by PAUL, paul, PaUl, etc?
One common to technique to do that is to put both sides all in upper or lower case, without regex, for example, in javascript:
"Paul".toLowerCase() === "paUL".toLowerCase()
In SQL:
select case when LOWER('Paul') = LOWER('paUL') then 1 else 0 end

Reuse a previously processed value in a column

I hope someone can help me on that as I'm struggling to get it work as expected.
In my DB I have phone numbers with different format (i.e: 15551234, 5551234, +15551234).
So I'm using the following CASE to clean this up and it works great:
CASE
LEFT(DC.PhoneNumber0,1)
WHEN '+' then replace(DC.PhoneNumber0,'+1','')
WHEN 1 then right(DC.PhoneNumber0,len(DC.PhoneNumber0)-1)
ELSE DC.PhoneNumber0
End 'Transformed Number',
Now this return the clean phone number I need to work on (5551234) in the 'Transformed column'
I would like now to use another CASE that retrieve this cleaned number and extract the area code to translate it in an understandable value (i.e.: 201 => US - New Jersey)
So I'm stuck in writing the second CASE.
I tried something like that, but it's NOT working with numbers that are already clean in my DB.
CASE
WHEN right(left(replace(DC.PhoneNumber0,'+',''),4),3) in (201, 1201) then 'US - New Jersey'
WHEN right(left(replace(DC.PhoneNumber0,'+',''),4),3) in (202, 1202) then 'US - Washington D.C.'
Ideally, I would like to reuse, the value that I just transformed in the previous CASE. Is there a way to do it?
Thanks in advance for your help.
You can CREATE FUNCTION containing your initial phone number parse logic and then just reference the function twice. Your DBMS create function syntax should have a clause declaring the function to be NOT VARIANT or DETERMINISTIC so it can be run only once per row regardless how many times you invoke it.

Using SQL - how do I match an exact number of characters?

My task is to validate existing data in an MSSQL database. I've got some SQL experience, but not enough, apparently. We have a zip code field that must be either 5 or 9 digits (US zip). What we are finding in the zip field are embedded spaces and other oddities that will be prevented in the future. I've searched enough to find the references for LIKE that leave me with this "novice approach":
ZIP NOT LIKE '[0-9][0-9][0-9][0-9][0-9]'
AND ZIP NOT LIKE '[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]'
Is this really what I must code? Is there nothing similar to...?
ZIP NOT LIKE '[\d]{5}' AND ZIP NOT LIKE '[\d]{9}'
I will loath validating longer fields! I suppose, ultimately, both code sequences will be equally efficient (or should be).
Thanks for your help
Unfortunately, LIKE is not regex-compatible so nothing of the sort \d. Although, combining a length function with a numeric function may provide an acceptable result:
WHERE ISNUMERIC(ZIP) <> 1 OR LEN(ZIP) NOT IN(5,9)
I would however not recommend it because it ISNUMERIC will return 1 for a +, - or valid currency symbol. Especially the minus sign may be prevalent in the data set, so I'd still favor your "novice" approach.
Another approach is to use:
ZIP NOT LIKE '%[^0-9]%' OR LEN(ZIP) NOT IN(5,9)
which will find any row where zip does not contain any character that is not 0-9 (i.e only 0-9 allowed) where the length is not 5 or 9.
There are few ways you could achieve that.
You can replace [0-9] with _ like
ZIP NOT LIKE '_'
USE LEN() so it's like
LEN(ZIP) NOT IN(5,9)
You are looking for LENGTH()
select * from table WHERE length(ZIP)=5;
select * from table WHERE length(ZIP)=9;
To test for non-numeric values you can use ISNUMERIC():
WHERE ISNUMERIC(ZIP) <> 1

Custom ORDER BY to ignore 'the'

I'm trying to sort a list of titles, but currently there's a giant block of titles which start with 'The '. I'd like the 'The ' to be ignored, and the sort to work off the second word. Is that possible in SQL, or do I have to do custom work on the front end?
For example, current sorting:
Airplane
Children of Men
Full Metal Jacket
Pulp Fiction
The Fountain
The Great Escape
The Queen
Zardoz
Would be better sorted:
Airplane
Children of Men
The Fountain
Full Metal Jacket
The Great Escape
Pulp Fiction
The Queen
Zardoz
Almost as if the records were stored as 'Fountain, The', and the like. But I don't want to store them that way if I can, which is of course the crux of the problem.
Best is to have a computed column to do this, so that you can index the computed column and order by that. Otherwise, the sort will be a lot of work.
So then you can have your computed column as:
CASE WHEN title LIKE 'The %' THEN stuff(title,1,4,'') + ', The' ELSE title END
Edit: If STUFF isn't available in MySQL, then use RIGHT or SUBSTRING to remove the leading 4 characters. But still try to use a computed column if possible, so that indexing can be better. The same logic should be applicable to rip out "A " and "An ".
Rob
Something like:
ORDER BY IF(LEFT(title,2) = "A ",
SUBSTRING(title FROM 3),
IF(LEFT(title,3) = "An ",
SUBSTRING(title FROM 4),
IF(LEFT(title,4) = "The ",
SUBSTRING(title FROM 5),
title)))
But given the overhead of doing this more than a few times, you're really better off storing the title sort value in another column...
I think you could do something like
ORDER BY REPLACE(TITLE, 'The ', '')
although this would replace any occurrence of 'The ' with '', not just the first 'The ', although I don't think this would affect very much.
The best way to handle this would be to have a column that contains the value you want to use specifically for ordering output. Then you'd just have to use:
SELECT t.title
FROM MOVIES t
ORDER BY t.order_title
There are going to be various rules about what should and should not be used to order titles.
Based on your example, an alternative would be to use something like:
SELECT t.title
FROM MOVIES t
ORDER BY SUBSTR(t.title, INSTR(t.title, 'The '))
You could use a CASE statement to contain the various rules.
You can certainly arrange dynamically strip off 'The', though you'll soon find that you have to deal with 'A' and 'An' (except for the special case of titles like "A is for Alibi"). When "foreign" films enter the mix, you'll need to cope with "El" and "La" (except for that pesky edge case, "LA Story"). Then mix in some German films, and you'll need to cope with 'Der' and 'Die' (except for that pesky set of 'Die Hard' edge cases). See the pattern? You're headed down a path that keeps getting longer and more pitted with special cases.
The way forward on this that avoids an ever-growing set of special cases is to store the title as you want it display and store the title as you want it sorted.
For SQLite
ORDER BY CASE WHEN LOWER(SUBSTR(title,1,4)) = 'the ' THEN SUBSTR(title,5) ELSE title END ASC
Ways that will only remove the first The:
=SUBSTITUTE(A1,"The ","",1) OR more reliably:
=IF(IF(LEFT(A1,4)="The ",TRUE)=TRUE,RIGHT(A1,(LEN(A1)-4)),A1)
Second one is basically saying if the first left digit equals The, then check how many digits are in the cell, and show only the the right hand digits excluding The.

T-SQL: checking for email format

I have this scenario where I need data integrity in the physical database. For example, I have a variable of #email_address VARCHAR(200) and I want to check if the value of #email_address is of email format. Anyone has any idea how to check format in T-SQL?
Many thanks!
I tested the following query with many different wrong and valid email addresses. It should do the job.
IF (
CHARINDEX(' ',LTRIM(RTRIM(#email_address))) = 0
AND LEFT(LTRIM(#email_address),1) <> '#'
AND RIGHT(RTRIM(#email_address),1) <> '.'
AND CHARINDEX('.',#email_address ,CHARINDEX('#',#email_address)) - CHARINDEX('#',#email_address ) > 1
AND LEN(LTRIM(RTRIM(#email_address ))) - LEN(REPLACE(LTRIM(RTRIM(#email_address)),'#','')) = 1
AND CHARINDEX('.',REVERSE(LTRIM(RTRIM(#email_address)))) >= 3
AND (CHARINDEX('.#',#email_address ) = 0 AND CHARINDEX('..',#email_address ) = 0)
)
print 'valid email address'
ELSE
print 'not valid'
It checks these conditions:
No embedded spaces
'#' can't be the first character of an email address
'.' can't be the last character of an email address
There must be a '.' somewhere after '#'
the '#' sign is allowed
Domain name should end with at least 2 character extension
can't have patterns like '.#' and '..'
AFAIK there is no good way to do this.
The email format standard is so complex parsers have been known to run to thousands of lines of code, but even if you were to use a simpler form which would fail some obscure but valid addresses you'd have to do it without regular expressions which are not natively supported by T-SQL (again, I'm not 100% on that), leaving you with a simple fallback of somethign like:
LIKE '%_#_%_.__%'
..or similar.
My feeling is generally that you shouln't be doing this at the last possible moment though (as you insert into a DB) you should be doing it at the first opportunity and/or a common gateway (the controller which actually makes the SQL insert request), where incidentally you would have the advantage of regex, and possibly even a library which does the "real" validation for you.
If you use SQL 2005 or 2008 you might want to look at writing CLR stored proceudues and use the .NET regex engine like this. If you're using SQL 2000 or earlier you can use the VBScript scripting engine's regular expression like ths. You could also use an extended stored procedure like this
There is no easy way to do it in T-SQL, I am afraid. To validate all the varieties of email address allowed byRFC 2822 you will need to use a regular expression.
More info here.
You will need to define your scope, if you want to simplify it.