Column type and size for international country subdivisions (states, provinces, territories etc) - sql

I apologize if this is a duplication.
What column standardization would you use for storing international country subdivision data?
For example, if it was just US and Canada I believe all subdivisions have a 2-character abbreviation... which might lend to a Char(2)
This cannot possibly be sustainable internationally lest we presume there are only 1296 (A-Z, 0-9) subdivisions.
I've been unsuccessful locating an ISO list of these or even an indication of how to store them.
That's fine, I don't need to know them all now but I would like to know that there is a standard and what standard info to store as needed.
Thanks
EDIT: It appears that I can accomplish this using the ISO 3166-2 standard:
http://en.wikipedia.org/wiki/ISO_3166-2
Browsable as a dataset here:
http://www.commondatahub.com/live/geography/state_province_region/iso_3166_2_state_codes

As far as I know there are no international standards because it's a national issue
Take the UK...
Are the sub division Wales, Scotland, England, Northern Ireland? No abbreviations.
Counties: is it "Cheshire" ("Ches.") or "Highlands and Islands" (no abbreviations)
Postal areas: Rutland is still a post county but not an official one
Your question arguably assumes a federal structure (as per Switzerland where I am) but this won't apply to many if most countries. Carrying on with Switzerland, Kanton does not feature in postal addresses or post codes either.
If there is an ISO standard, then national or local pride will annoy punters as soon as it's on your web site.
Personally, I dislike wading through a "state" dropdown on a web site. It has no meaning for me in either UK (my nationality) or my residence (Switzerland).
You may be best to stick states from US and Canada and "non US/Canada". Don't force or assume a sub-division.
Edit, Jun 2012.
I now live in Malta. I have neither state, county, nor Kanton. Please don't insist.
Any big cities in the UK don't normally mention county (England+Wales)/region (Scotland).

Juat for example:
Llanfairpwllgwyngyllgogerychwyrndrobwyll-llantysiliogogogoch This is the name of a town in North Wales.
VARCHAR(100)
Abbreviations:
There are 2-letter country code and 3-letter country code which used by UN. You can use VARCHAR(2) for 2-letter code and VARCHAR(3) for 3-letter country code.
E.G. Australia 2-letter, 3-letter and numeric code
AU AUS 036 Australia
It all depends on how you want to save data. If you want to save 3-letter + numeric code in one column then size will be according to that and if you want to save them separate then size will be different.
To be on safe side you can use VARCHAR(10).

Related

How to get Lucene scoring to account for words not specified in search terms?

There is probably a name for what I'm asking and it has something to do with Bayesian statistics.
I have a database of street addresses and I'm using Lucene to match user-entered addresses (if you need an analogy, pretend I work for Google Maps).
Given that both "West North Avenue" and "West North Shore Avenue" are valid street names, how can I get Lucene to score "2000 West North Avenue" higher than "1000 West North Shore Avenue" when searching for "1000^0.001 West North Avenue"?
The 1000^0.001 means, the number should be used to break a tie, but otherwise matching the street name is more important than matching the right number to the wrong street.
Unfortunately in this example, the 1000^0.001 causes the wrong match (North Shore) to get ahead of the correct one.
What scoring algorithm would enable Lucene to adjust the score downwards for failure to specify an indexed term in the search, with rare terms weighing more than common terms?
I would solve this by carefully tokenizing street names. For instance, you could do this:
extract the number and the street name to two different fields street_nb, street_nm. And index them separately.
now use two clauses for your query, one, targeting street_nb is MUST,and the other SHOULD. So you make sure the street name alone will match, and then if the name matches, even better.
you can do different things besides this, like using phrases to force a perfect match on the street name etc. Play around with the variants till it gives you good results.

What are ways to match street addresses in SQL Server?

We have a column for street addresses:
123 Maple Rd.
321 1st Ave.
etc...
Is there any way to match these addresses to a given input? The input would be a street address, but it might not be in the same format. For example:
123 Maple Road
321 1st Avenue
Our first thought is to strip the input of all street terms (rd, st, ave, blvd, etc).
Obviously that won't match reliably all the time. Are there any other ways to try to match street addresses in SQL server?
We can use user defined functions, stored procs and regular old t-sql. We cannot use clr.
Rather than stripping out the things that can be variable, try to convert them to a "canonical form" that can be compared.
For example, replace 'rd' or 'rd.' with 'road' and 'st' or 'st.' with 'street' before comparing.
You may want to consider using the Levenshtein Distance algorithm.
You can create it as a user-defined function in SQL Server, where it will return the number of operations that need to be performed on String_A so that it becomes String_B. You can then compare the result of the Levenshtein Distance function against some fixed threshold, or against some value derived from the length of the strings.
You would simply use it as follows:
... WHERE LEVENSHTEIN(address_in_db, address_to_search) < 5;
As Mark Byers suggested, converting variable terms into canonical form will help if you use Levenshtein Distance.
Using Full-Text Search may be another option, especially since Levenshtein would normally require a full table scan. This decision may depend on how frequently you intend to do these queries.
You may want to check out the following Levenshtein Distance implementation for SQL Server:
Levenshtein Distance Algorithm: TSQL Implementation
Note: You would need to implement a MIN3 function for the above implementation. You can use the following:
CREATE FUNCTION MIN3(#a int, #b int, #c int)
RETURNS int
AS
BEGIN
DECLARE #m INT
SET #m = #a
IF #b < #m SET #m = #b
IF #c < #m SET #m = #c
RETURN #m
END
You may also be interested in checking out the following articles:
Address Geocoding with Fuzzy String Matching [Uses Levenshtein Distance]
Stack Overflow - Strategies for finding duplicate mailing addresses
Merge/Purge and Duplicate Detection
In order to do proper street address matching, you need to get your addresses into a standardized form. Have a look at the USPS postal standards here (I'm asssuming you're dealing with US addresses). It is by no means an easy process if you want to be able to deal with ALL types of US mail addresses. There is software available from companies like QAS and Satori Software that you can use to do the standardization for you. You'll need to export your addresses, run them through the software and then load the database with the updated addresses. There are also third party vendors that will perform the address standardization as well. It may be overkill for what you are trying to do but it's the best way to do it. if the addresses in your database are standardized you'll have a better chance of matching them (especially if you can standardize the input as well).
I think the first step for you is to better define how generous or not you're going to be regarding differing addresses. For example, which of these match and which don't:
123 Maple Street
123 Maple St
123 maple street
123 mpale street
123 maple
123. maple st
123 N maple street
123 maple ave
123 maple blvd
Are there both a Maple Street and a Maple Blvd in the same area? What about Oak Street vs Oak Blvd.
For example, where I live there many streets/roads/blvds/ave that are all named Owasso. I live on Owasso Street, which connects to North Owasso Blvd, which connects to South Owasso Blvd. However, there is only one Victoria Ave.
Given that reality, you must either have a database of all road names, and look for the closest road (and deal with the number seperately)
OR
Make an decision ahead of time what you'll insist on and what you won't.
Address matching and deduplication is a messy business. Other posters are correct when they say that the addresses need to be standardized first to the local postal standards authority (The USPS for example if it is a US addresses). Once the addresses are in standard format the rest is easy.
There are several third-party services which will flag duplicates in a list for you. Doing this solely with a MySQL subquery will not account for differences in address formats and standards. The USPS (for US address) has certain guidelines to make these standard, but only a handful of vendors are certified to perform such operations.
So, I would recommend the best answer for you is to export the table into a CSV file, for instance, and submit it to a capable list processor. One such is SmartyStreets' Bulk Address Validation Tool which will have it done for you in a few seconds to a few minutes automatically. It will flag duplicate rows with a new field called "Duplicate" and a value of Y in it.
Try standardizing and validating a couple of addresses here to get an idea for what the output will look like.
Full Disclosure: I work for SmartyStreets
Stripping out data is a bad idea. Many towns will have dozens of variations of the same street - Oak Street, Oak Road, Oak Lane, Oak Circle, Oak Court, Oak Avenue, etc... As mentioned above converting to the canonical USPS abbreviation is a better approach.
Fuzzy Lookups and Groupings Provide Powerful Data Cleansing Capabilities
You could try SOUNDEX to see if that gets you close. http://msdn.microsoft.com/en-us/library/aa259235%28SQL.80%29.aspx
You may also checkout COMPGED function - https://www.sas.com/content/dam/SAS/support/en/sas-global-forum-proceedings/2018/2487-2018.pdf

Is there common street addresses database design for all addresses of the world? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 2 years ago.
Improve this question
I am a programmer and need a practical approach to storing street address structures of the world in a database. So which is the best and common database design for storing street addresses? It should be simple to use, fast to query and dynamic to store all street addresses of the world.
It is possible to represent addresses from lots of different countries in a standard set of fields. The basic idea of a named access route (thoroughfare) which the named or numbered buildings are located on is fairly standard, except in China sometimes. Other near universal concepts include: naming the settlement (city/town/village), which can be generically referred to as a locality; naming the region and assigning an alphanumeric postcode. Note that postcodes, also known as zip codes, are purely numeric only in some countries. You will need lots of fields if you really want to be generic.
The Universal Postal Union (UPU) provides address data for lots of countries in a standard format. Note that the UPU format holds all addresses (down to the available field precision) for a whole country, it is therefore relational. If storing customer addresses, where only a small fraction of all possible addresses will be stored, its better to use a single table (or flat format) containing all fields and one address per row.
A reasonable format for storing addresses would be as follows:
Address Lines 1-4
Locality
Region
Postcode (or zipcode)
Country
Address lines 1-4 can hold components such as:
Building
Sub-Building
Premise number (house number)
Premise Range
Thoroughfare
Sub-Thoroughfare
Double-Dependent Locality
Sub-Locality
Frequently only 3 address lines are used, but this is often insufficient. It is of course possible to require more lines to represent all addresses in the official format, but commas can always be used as line separators, meaning the information can still be captured.
Usually analysis of the data would be performed by locality, region, postcode and country and these elements are fairly easy for users to understand when entering data. This is why these elements should be stored as separate fields. However, don't force users to supply postcode or region, they may not be used locally.
Locality can be unclear, particularly the distinction between map locality and postal-locality. The postal locality is the one deemed by a postal authority which may sometimes be a nearby large town. However, the postcode will usually resolve any problems or discrepancies there, to allow correct delivery even if the official post-locality is not used.
Have a look at Database Answers. Specifically, this covers many cases:
(All variable length character datatype)
AddressId
Line1
Line2
Line3
City
ZipOrPostcode
StateProvinceCounty
CountryId
OtherAddressDetails
Ask yourself what is the main purpose of storing this data? Do you intend to actually send mail to the person at the address? Track demographics, populations? Be able to ask callers for their correct address as part of some basic authentication/verification? All of the above? None of the above?
Depending on your actual need, you will determine either a) it doesn't really matter, and you can go for a free-text approach, or b) structured/specific fields for all countries, or c) country specific architecture.
Sometimes the closest you can get to a street address is the city.
I once had a project to put all the Secondary Schools in India in Google Maps. I wrote a spiffy program using the Google API and thought it would be quite easy.
Then I got the data from the client. Some school addresses were things like "Across from the market, next to the barber" or "Near old bus stand".
It made my task much harder since, unfortunately, the Google API does not support that format.
For international addresses, it is remarkably hard to find a way to format the information if it is broken down into fields. As a for instance, an Italian address uses:
<street address>
<zip> <town> <region>
<country>
Such as
Via Eroi della Repubblica
89861 Tropea VV
Italy
That is rather different from the order for US addresses - on the second line.
See also the SO questions:
How many address fields would you use for a UK database?
Do you break up addresses into street / city / state / zip?
How do you deal with duplicate street suffixes?
Best practices for storing postal addresses in a database (RDBMS)?
Also check out tag 'postal-code'.
Edit: Reverse order of region and town - per UPU
Maybe this is useful:
https://gist.github.com/259744
For a project I collected a table of informations about all countries of the world, including ISO codes, top level domain, phone code, car sign, length and regex of zip.
Country names and comments unfortunately only in German...
Differently of other answers here, I believe it's possible to have an structured address database.
Just out of the hat, I can think of the following structure:
Country
Region (State / Province)
Locality (City / Municipality)
Sub-Locality (County / other sub-division of a locality)
Street
But how to query it fast enough?
One way I always think it can be accomplished is to ask for the ZIP Code (or Postal Code) which varies from country to country, but is solid within the country.
This way you can structure your data around the information provided by the postal offices around the world.
Depends on how free-form you are prepared to go with the fields. One free-form address field will obviously always do, but be of relatively little help narrowing down geography.
The problem you'll have is that there is too much variation in the level of geographical hierarchy across countries. Heck, some countries do not even have 'street addresses' everywhere.
I recommend you don't try to make it too clever.
Len Silverston of Universal Data Model fame recommends a separate hierarchy of GEOGRAPHIC BOUNDARIES and depending on how much free-formed-ness you're willing to accept either simple STREET ADDRESS LINEs or per-country derivatives.
No, absolutely not. If you compare the way US and Japanese addresses work, you'll see that it's not possible.
UPDATE:
On second thought, anything can be done, but there's a trade-off.
One approach is to model the problem with address and address_attribute tables, with a 1:m relationship between them, anything can be modeled. The address_attribute table would have a pk, a name, a value, and an fk that points back to its address parent's pk. It's almost like using a Map with name, value pairs.
The trade-off is having to do a JOIN every time you want an address. You also have to interrogate the names of the address_attributes to figure out what you're dealing with each time.
Another approach would be to do more comprehensive research on how addresses are modeled around the world. In an object-oriented world you might have the western Address class (street1/street2/city/state/zip) and others for Japan, China, as many as needed to tile the address space. Then you'd have a master Address table and child tables to the other types with a 1:1 relationship between them.
How does Amazon or eBay do it? They ship internationally. Do they have locale-specific UI features? I've only used the US locale.
No, there are no standard addressing scheme. It usually varies from country-to-country.
Even the Universal Postal Union said on Adressing the world, an address for everyone that there is none. The best solution for this is to use the 2/3-letter country code standards known as ISO 3166 and treat everything else by country's standards.
However, if you really are desperate to use easily accessible tools for your project, you can try Google Place API.
Your design should strongly depend from your purpose. Some people have posted how to structure data. So if you simply want to send s-mail to someone, it will do. Things begin to complicate if you want to use this data for navigation. Car navigation will require additional structures to contain traffic info (eg one-way roads), while foot navigation will require a lot of additional data. Here is small example: in my city, my neighborhood is near the park. Next to the park is former airfield (in fact, one of the oldest in Europe) turned into aviation museum. Next to aviation museum is a business park. Street number for museum is 39, while business park numbers start with 39A. So it may seem that 39 and 39A are close – but it takes about a mile to walk from one to another (and even longer if going by car) .
This is just a small example taken from my city, I think you can probably find a lot of exceptions (especially in rural or wilder parts of every country).

Which parts of an address should be required?

Say I am storing addresses in a DB table, in this fairly common break down:
address_street_line_1,
address_street_line_2,
address_city,
address_state,
address_zip,
address_country_id
(Note: I have read the questions on splitting down further, street type, house number, etc. and for this application I think it would unnecessarily complicate things.)
To work best with international users, which of these fields should NOT be required?
I'm thinking this:
address_street_line_1 REQUIRED
address_city REQUIRED
address_country_id REQUIRED
Should I require state or zip?
Thanks!
Xavier
You can probably only require one field: country.
But what you should really be doing is making the logic dependent on country. Take a look at Address Formats by Country for a comprehensive list. That isn't just about required fields either. It's also about correct formatting. A US address might be:
8031 Main Street
Springfield OH 12345
USA
whereas in Switzerland:
Bodenstr. 173
8043 Zürich
Schweiz
Note: the street numbers and post codes are in the "reverse" order for Switzerland (compared to what English speaking countries use).
Also, your data types need to be broad enough to cover data used in other countries. Zip/post code should absolutely not be a numeric type. For example, "EC2R 8AH" is a valid UK postcode.
That goes back to this principle: if you don't perform arithmetic on it, it's not a numeric type. It's text.
Also, try not to call it Zip Code to end users. That's a US only term. Pretty much everywhere else its call a Postcode, Post code or Postal Code. Also note that the UK postal codes are alphanumeric and include a space.
Not all countries even use postal codes, for example they were rarely used in New Zealand prior to 2006 or so. I think Ireland doesn't use them at all.
If you're truly international, city-states such as Singapore don't actually need a City field.
In the user interface, you can (and perhaps should) make the postcode required for countries where you already know it's required, since that isn't likely to change. And, if you make the UI dynamic enough, you can call it "Zip code" if the selected country is the United States, "Postal code" for Canada, "Postcode" for the UK, etc.
How about making none required? If the user wants to be contacted they'll enter enough information. Or, enter a single text field and let them enter free form information. They know better than you what fields are required for postal deliveries to make it to their door.
I would say everything except street_line_2 and state- and think of 'zip' as more of a postal codes instead of zip code - as you can tell from the variety of format based on the country of origin, this should have a pretty open format.
Even in the U.S., most of the address is not required. A large fraction of U.S. zip codes are allocated to various businesses and organizations - any mail to one of those zips will be delivered the same regardless of the rest of the address. For instance:
General Electric
Schenectady, NY 12345
Internal Revenue Service
Ogden, UT 84201-0027
The city and state are nice, but the mail will probably get delivered without.
The best way that I have found to solve this problem is by abstracting the logic in your application layer, and not the persistence layer. One of the cleanest/simplest ways I've seen this done is by passing the user's data in a value object (creating a common interface that's easy to validate against) to a validator with the current country code, which makes sure all the required attributes are set properly in the value object for that locale. Assuming it passes validation, pass the value object along to the persistence side of your application for storage.
The key here is the value object - you're creating a common interface that multiple pieces of your application can talk to, validate, and read/write from. You can then also use that same value object when displaying the address: have your persistence layer get the information, put it in the value object, pass it to a factory with the current locale which returns the desired address format, and send that output to the front end.
There are no states in New Zealand, so it should definately be optional. So I think you have the right answer in your question.
If you are not going to do any specific lookup, like searching by postal code or by city, I'd say to all combine the address in a single field. This way you will support the different address from different countries.
You will also support address oddities.
If you fear that the requirements are going to change, you could store the address as a Xml field. Modern database like Sql Server 2005 and 2008 can have an index on a Xml node inside a Xml column as long as you are using a schema.
It all come down to requirements. If the client need to group the data inside a grid by country, then you need a country column.
Making fields required is always a tradeoff. If the person doesn't want to fill in the info then they won't -- they'll put in a period, or garbage to get past the "required field" nanny.
I only require street_address_1 in my apps. Also, for the US and many countries, you can buy the mapping between the postal/zip code and the canonical city/state. It's not expensive. (The mapping between individual street addresses and zip is much more expensive.)
For the US, see http://www.usps.com/ncsc/addressinfo/citystate.htm
If you're including an Ajax web interface, ask for the country first, then the post code. If in the US, then use Ajax to fetch and fill in the city/state for the user from the zip.
Some non-US countries, eg UK, can have 3 lines of street addresses if you're asking people to fill in their "preferred address" Eg:
Mirassou (You can register a building's name with the post office
High Street as an alternative to its street number)
Old Town
City, Bucks postal_code
Larry
Actually, city isn't even required in the US.
Many people have rural addresses on state and county roads. Publication 28 at the postal service web site has details. Different companies end up using the "city" field to store other information. This also applies to military base addresses.
Publication 28 link

Calculating person's time zone (GMT offset) based on phone number?

I've gotten a request to show a person's local time based on their phone number. I know our local GMT offset, so I can handle USA phones via a database table we have that links US zip_code to GMT offset (e.g. -5). But I've no clue how to convert non-US phone numbers or country names (these people obviously don't have a zip code).
If you care, my employer college wants to solicit our alumni for donations and do it during reasonable hours.
Sorry to all that I didn't clearly state that I was considering HOME phone numbers. So roaming isn't an issue. I'm looking for some reference table or Oracle application I can source this info from.
Florida has two time zones, but many countries only have one. You need this table: http://en.wikipedia.org/wiki/List_of_country_calling_codes . Parse the country code out of the phone number by looking for the 1 and the area code for NANPA countries (those countries using the same 1+area code allocation as the USA), 7 for Russia or Kazakhstan. If that doesn't match check to see whether the number starts with one of the 2 digit calling prefixes, and then the 3-digit ones.
Remember that the first few digits of the number may be the international dialing prefix, and are not properly part of the telephone number.
For countries that span more than one time zone, see if you can get allocation information from their national telecom regulator. For the USA and other NANPA countries, check out http://www.nanpa.com/ .
Of course your results will be far from perfect, but hopefully you will wake fewer customers from their night's sleep.
Local time is one thing but, if you have worldwide customers, there are also local habits to take into account.
People typically go to bed earlier in Norway than in Spain (and they are in the same time zone).
You might be able to get the phone company to feed you location data (this info should exist for land lines and must exist for cells) but expect to pay.
Some nations are easy, since they are in a single time-zone. Look at Europe and add millions of people by just using the internation dialing code. +47 for Norway etc.
Phone-number allocations are usually done by a national telecom authority, so you could probably get the information for free.
As you allready know this would only take into account default-timezone, since they might be anywhere on the planet at the time. Also number-allocation might not distingish at all between timezones, so the approach is buggy but potentially usefull to provide default settings.
Regards
Look in the phone book. Ours has quite a few pages mapping area codes onto countries/provinces/states. Then you have to map geographical locations onto time zones, but that is pretty straightforward.
Impossible. If I drive about 400 miles east (west coast of the US) then I'll break your algorithm by having a XXX number in a YYY timezone.
Now if this is a cell phone app, it does seem possible with something called NITZ.
I think Danie, Bortzmeyer, and others are over thinking the problem. It's not to maximize the calling window, it's to find an acceptable time.
Let's take the US and consider only the 4 major timezones. Say we define acceptable as from 10AM - 7PM. I doubt even the Norwegian Bachelor farmers go to be before 7PM.
So if you know that the phone is in the US, don't make any call before 1PM. That way if they are in NYC or LA, it's still after 10AM. And no calls after 7PM. Who cares if it's Florida main or its hour later panhandle? Dallas or El Paso, also same state but different time zones. For US, filter for AK and HI. The only seriously difficult country is Russia -- lots-o-timezones.