Address form fields for a japanese address - ruby-on-rails-3

I am building a small application for an english speaking client in Japan. As part of the app, users need to be able to enter their address. Unfortunately, I can't find any reference for how addresses are usually handled in an online form.
I know that there are different combinations of wards/prefectures/cities; do these all usually have their own field in a database? Is it standard for all of that to go into a general "city" type of field? What's the standard UI for this sort of thing?

The Universal Postal Union has compiled info on address formats in different countries. See also an unofficial guide to postal addresses.
But as a rule, internationalization of software typically means that for postal addresses, you avoid imposing any specific format. Instead, you would use a free-form text input area, of sufficient size. There are often many forms of addresses used in a country (and Japan is no exception), and normally you need not enforce any specific format – instead, you expect people to know their own address and how to enter it so that postal services can understand it.

it depends on what you have to do with the address:
if you have to:
check for obligatory fields
validate fields, or
query for city, prefecture, postal code, etc.
then you should use separate fields. UI: a form with text-inputs (and maybe even menus).
do not use more fields than necessary, so if you don't have any of the mentioned needs, just use a text-field (UI: textarea).

The first part of a Japanese address is easy: Todofuken will either be 2 or 3 characters, followed by either "都","道","府" or "県". Where it gets tricky is the rest of the address since smaller areas don't always divide their cities neatly.
What I've seen to make this easier is using the postal code to render the address. The bad news is that I haven't seen any of this in Ruby but I have seen it in other languages so hopefully this will help.
This site is only in Japanese, but maybe you can download the code and check it out:
http://www.kawa.net/works/ajax/ajaxzip2/ajaxzip2.html
There's also this add-in for Excel that converts addresses. The code may be helpful to you:
http://office.microsoft.com/ja-jp/excel-help/HP010077514.aspx
Hope this helps.

Related

Custom model in Apache Open NLP

I am working currently with custom models which I am training for my own use case. My use case is to classify emails based on whether it is an address change request. If the address change request could be understood from a single sentence, it is working fine without issues. But if the address change request needs to be understood from multiple sentences, it is not working.
Giving few examples below :-
Example 1 :- THIS IS WORKING
1.
a)training file :-
Guys I wish to <START:contactupdate> change my address <END> .
My new address is 68 Dorset Road, Coventry, West Midlands, CV1 4ED.
Please confirm once you are done.
Thanks.
b)Testing model with the below sentence :-
String input = "Guys I wish to change my address.My new address is 68 Dorset Road, Coventry, West Midlands, CV1 4ED.Please confirm once you are done. Thanks."; //Working
EXAMPLE 2 :- This is not working.
Lets say the address change request can only be deduced from multiple lines.
"My old address is no longer valid. Need to update it."
How do I train my model in this scenario?How do I specify the custom tags for above?
Can you please help. I am stuck.
Many Thanks
What do you mean with not working? That the thing you want to retrieve is not retrieved? Or that the training crashes somewhere when the tags are spread out over multiple lines?
In general, the (by default MaxEnt) model that you are training in this procedure tries to detect common features for the thing you are training for. Typically, these are named entities like persons, organisations, locations. And in many languages, these contain typical features (like the prefix Mr./Mrs., the suffix corp., the morpheme "street", respectively). This can be picked up by the model, and applied in new data, leading to the recognition of whichever it is you want to recognise. The thing you are trying to do however, is pretty advanced NLP already. Since the longer the phrase, the larger the possible variation, it becomes more difficult to pick up commonalities. I'd say for your use case, people are typically using parsing (either constituency or dependency parsing) or other more sophisticated tools than just this relatively flat pattern recognition. So you may want to look into these instead. I don't know how much data you have at your disposal, from which you can infer different ways of expressing the desire to change an address in a customer database. If reasonable (i.e. not just a couple of sentences), you may want to manually annotate them, parse the corpus, use machine learning on the parse trees/graphs for the sentences of interest and go about it in this way. As mentioned, quite advanced NLP in my opinion, and not something that has an out of the box solution.
If I understand your question correctly, I think you are trying to categorize emails to find out if its for address change. But the model example looks like for named entity. In my opinion, it might be better to use "Document Categorizer" feature of Apache OpenNLP.
You can provide different samples for possible sentences which can be categorized as address change. "Address_change", "general_inquiry" etc. can be a categories. This way you can add as many different sampels as you want with many variations of sentences. Here is easy & basic tutorial for document categorization training & usage.

Is there a "standard" set of telephone numbers to use for testing validation code?

I have seen many different questions and discussions about validation of telephone numbers in various formats and from various countries. Can anyone point me to a comprehensive set of test data that would exercise such validation code?
I'm not looking to rehash the whole discussion on how to validate a phone number. That has been covered elsewhere and in sufficient detail. This is more a case of looking for the input to pass to the "black box" code that does phone number validation.
Is there a generally accepted set of data in many different formats from various countries that is considered "comprehensive"? Are there sets of country specific formats available anywhere?
well. From my QA experience, I do not think you'll be able to find international set of telephone numbers as phone number format depends upon region, type (simple or cell/mobile), internal or international number an so on.
So I would choose the appropriate format number for validation in accordance with the region location.
And the second point is functional spec of the application being developed, where it is usually is described in details what type and format phone number is accepted.
The matter is one application differs from another as well as phone number formats.
Hope this somehow helps you)

OOP: Is it going to far to create a phone number object, or an address object?

Many things can have phone numbers and addresses. . . people, places, etc. You want phone numbers and addresses to have the same functionality, format and validation whether it is a phone number or address for a person or a place etc.
Is it going to far to create a phone number class, and an address class, and use them in those objects that have phone numbers and addresses?
My question goes to other properties as well that could be reuseable across diverse objects.
Yes, you can go too far and this is borderline. I tend to draw the line at the point where it becomes cumbersome to treat things as more than a string, or another already defined class/type.
If you need to somehow manipulate phone numbers (by, for example, separating them into area code and other bits) or addresses (number, street, city, country and so forth) then, yes, consider making them objects.
I rarely do anything with phone numbers or addresses other than store and display them, in which case they're fine as strings without having to have their own dedicated class. For addresses, I don't even impose a separation based on parts (except maybe the zipcode), preferring free-format entry so as to not annoy those with addresses of a format I don't know about.
Going the reductio ad absurdum route, you could also objectify the characters that make up your phone number but that would be silly.
I think it would be perfectly acceptable. A well designed class will allow you to reuse it in many different projects. If you have many projects that could use this sort of functionality, using an object is the perfect way to ensure that your code is reusable and portable. The extensibility and the potential for you to extend the functionality of your class to handle anything phone number/address related would be unmatched by a set of functions or once off code you rewrite over and over.
In the end it's your call, personally I think it would fall under good practice though.
You need an Entity Class and Address Class.
Entity can be person, place, organisation, coffee shop kinda, whereas Address can capture Phone number, emailid, Lat/Long kinda stuff.
Keeping Entity and Address will help you across diverse objects.
and having many to many relation ship among entity and address would help, having loose coupling wud help on long run.

Source of data for "official" country/region list

Lately we've started getting issues with outdated countries / regions list being presented to users of our web-application.
We currently have a few DB tables to store localized country names along with their regions (states). However as the planet goes, that list is in constant evolution and it's proving to be a pain to maintain as some regions are deleted, some merged - existing data needs to be updated all the time.
What are, if any exist, the best practices when it come to dealing with multi-locale countries/regions list?
Is there a place or a standard in place? I know of ISO 3166, but their list isn't exactly DB friendly ... plus it's not fully localized.
An ideal solution would simply allow us to "sync" to it? Preferably in multiple language. The solution would preferably be free or subscription based with an historic of what changed so we could update our data (aka tblAddress)
Thanks!
geonames is pretty accurate in this respect, and they update regularly.
http://www.geonames.org/export/
There is no such thing. This is a political issue, which you can only solve in the context of your own application. Deciding to use ISO 3166 could be the easiest to defend. I know of issues with at least:
China/Taiwan
Israel/Palestine
China/Tibet
Greece/Macedonia
The ISO lists here are DB friendly, though they only include short names and codes.
This one looks very good: Multiple languages, update option, database independent file format for import, countries/regions/cities information, and some other features you might use or not.
And it's quite affordable if you need it for only one server.
You can try CLDR
http://cldr.unicode.org/
This set of data is maintained by the Unicode organization. It is updated regularly and the data is versioned so it is easy for you to manage the state of your list.
Hy! you can find a free dump of all countries with their respective continents https://gist.github.com/kamermans/1441495, its much easy to use.just download the dump & upload in your data base.
Well, wait, do you just want an up-to-date list of countries? Or do you need to know that Country X has split into Country Y and Country Z? Because I'm not aware of any automated way to get the latter. Even updates to the ISO databases are distributed as PDFs (you're on your own for implementing the change)
The EU maintains data about Local Administrative Units (LAUs) which can be downloaded as hierarchical XLS files in several languages.
United Nations Statistics Division, Standard country or area codes for statistical use (M49).
Look for "Search and Download: Full View" on page left. That leads here.
Groups countries by continent, sub-continental region, Least Developed Countries, and so on.
If you cannot import the excel version, note that the csv has unquoted" fields and a comma in one country name that will bust your import ("Bonaire, Sint Eustatius and Saba"). Perhaps open it first in LibreOffice or whatever, fix the broken country name and shunt its other right-most columns back into place. Then set all cells to type Text, saveAs csv with [Edit Filter Settings] checked [x] in the saveAs dialog, and make sure string delimiter is set to ", as it should be by default.

Extract Address Information from a Web Page

I need to take a web page and extract the address information from the page. Some are easier than others. I'm looking for a firefox plugin, windows app, or VB.NET code that will help me get this done.
Ideally I would like to have a web page on our admin (ASP.NET/VB.NET) where you enter a URL and it scraps the page and returns a Dataset that I can put in a Grid.
If you know the format of the page (for instance, if they're all like that ashnha.com page) then it's fairly easy to write VB.NET code that does this:
Create a System.Net.WebRequest and read the response into a string.
Then create a
System.Text.RegularExpressions.Regex
and iterate over the collection of
Matches between that and the string
you just retrieved. For each match,
create a new row in a DataTable.
The tough bit is writing the regex, which is a bit of a black art. See regexlib.com for loads of tools, books etc about regexes.
If the HTML format isn't well-defined enough for a regex, then you're probably going to have to rely on some amount of user intervention in order to identify which bits are the addresses...
What type of address information are you referring to?
There are a couple FireFox plugins Operator & Tails that allow you to extract and view microformats from web pages.
Aza Raskin has talked about recognising when selected text is an address in his Firefox Proposal: A Better New Tab Screen. No code yet, but I mention it as there may be code in firefox to do this in the future.
Alternatively, you could look at using the map command in Ubiquity, although you'd have to select the addresses yourself.
For general HTML screen scraping in VB.NET, check out HTML Agility Pack. Much easier than trying to Regex it (unless you happen to be a Regex ninja already!)
The page you mentioned in your answer would be easy to automate, as the addresses are in a consistent format.
But to allow the users to point to any page, that's a much harder job. The data could be in any format at all. You could write something to dump all the text, guess how they are divided, try and recognise bits like country and state names, telephone numbers etc, and get then show your results with an interface that will let the users complete missing sections, move the dividers, and identify the bits you missed or they didn't want.
It's not simple though, and making an interface that provides a big advantage over simply cutting and pasting into validated form fields would be quite an achievement I think - I'd be interested to know how you get on!
EDIT: Just noticed this other question that might cover quite a bit of what you want to do:
Parse usable Street Address, City, State, Zip from a string