Given a zipcode, how to find out lat-long range that are within the area of that zipcode? - latitude-longitude

so if I have the zipcode of say Manhattan, how do I find the lat-long range for that zipcode? Is there a free database I can use?

Check out http://www.boutell.com/zipcodes/.

Related

How to get Lucene scoring to account for words not specified in search terms?

There is probably a name for what I'm asking and it has something to do with Bayesian statistics.
I have a database of street addresses and I'm using Lucene to match user-entered addresses (if you need an analogy, pretend I work for Google Maps).
Given that both "West North Avenue" and "West North Shore Avenue" are valid street names, how can I get Lucene to score "2000 West North Avenue" higher than "1000 West North Shore Avenue" when searching for "1000^0.001 West North Avenue"?
The 1000^0.001 means, the number should be used to break a tie, but otherwise matching the street name is more important than matching the right number to the wrong street.
Unfortunately in this example, the 1000^0.001 causes the wrong match (North Shore) to get ahead of the correct one.
What scoring algorithm would enable Lucene to adjust the score downwards for failure to specify an indexed term in the search, with rare terms weighing more than common terms?
I would solve this by carefully tokenizing street names. For instance, you could do this:
extract the number and the street name to two different fields street_nb, street_nm. And index them separately.
now use two clauses for your query, one, targeting street_nb is MUST,and the other SHOULD. So you make sure the street name alone will match, and then if the name matches, even better.
you can do different things besides this, like using phrases to force a perfect match on the street name etc. Play around with the variants till it gives you good results.

How to use http://www.census.gov API to pull data

Am trying to query data from http://www.census.gov, using their API
I want to get the population of a particular city in the US, by using the city name and the US state code.
Given that I already have the key, what other parameters do I add in the URL below, so that I can get the population.
http://api.census.gov/data/2010/sf1?key=<my key>
any assistance will be greatly appreciated
Judging from your query URI, you wish to access population data from the 2010 Census Summary File. You would add GET paramaters of get and for to your query. Example:
http://api.census.gov/data/2010/sf1?key=b48301d897146e8f8efd9bef3c6eb1fcb864cf&get=P0010001&for=state:06
The population table as given in the get parameter are identified with a "P" and you can use the for parameter to further narrow down your scope. Examples of valid criteria formatted as URIs can be found here...
EDIT: It seems that for a finer grained search such as cities, you're going to need to use the governments cumbersome FIPS (Federal Information Processing Standard) codes (after converting lat/lon regions to their coding system)... I've found this resource that should be helpful, specifically points 5 thru 7, but it seems mega complex...
Another alternative I found is the USA Today census API, it seems that they mirror the data from the census and they do have available endpoints with data granularity at the city level... Check it out here...
no need to use API the data is available in CSV here http://www.census.gov/popest/data/cities/totals/2012/SUB-EST2012.html

Column type and size for international country subdivisions (states, provinces, territories etc)

I apologize if this is a duplication.
What column standardization would you use for storing international country subdivision data?
For example, if it was just US and Canada I believe all subdivisions have a 2-character abbreviation... which might lend to a Char(2)
This cannot possibly be sustainable internationally lest we presume there are only 1296 (A-Z, 0-9) subdivisions.
I've been unsuccessful locating an ISO list of these or even an indication of how to store them.
That's fine, I don't need to know them all now but I would like to know that there is a standard and what standard info to store as needed.
Thanks
EDIT: It appears that I can accomplish this using the ISO 3166-2 standard:
http://en.wikipedia.org/wiki/ISO_3166-2
Browsable as a dataset here:
http://www.commondatahub.com/live/geography/state_province_region/iso_3166_2_state_codes
As far as I know there are no international standards because it's a national issue
Take the UK...
Are the sub division Wales, Scotland, England, Northern Ireland? No abbreviations.
Counties: is it "Cheshire" ("Ches.") or "Highlands and Islands" (no abbreviations)
Postal areas: Rutland is still a post county but not an official one
Your question arguably assumes a federal structure (as per Switzerland where I am) but this won't apply to many if most countries. Carrying on with Switzerland, Kanton does not feature in postal addresses or post codes either.
If there is an ISO standard, then national or local pride will annoy punters as soon as it's on your web site.
Personally, I dislike wading through a "state" dropdown on a web site. It has no meaning for me in either UK (my nationality) or my residence (Switzerland).
You may be best to stick states from US and Canada and "non US/Canada". Don't force or assume a sub-division.
Edit, Jun 2012.
I now live in Malta. I have neither state, county, nor Kanton. Please don't insist.
Any big cities in the UK don't normally mention county (England+Wales)/region (Scotland).
Juat for example:
Llanfairpwllgwyngyllgogerychwyrndrobwyll-llantysiliogogogoch This is the name of a town in North Wales.
VARCHAR(100)
Abbreviations:
There are 2-letter country code and 3-letter country code which used by UN. You can use VARCHAR(2) for 2-letter code and VARCHAR(3) for 3-letter country code.
E.G. Australia 2-letter, 3-letter and numeric code
AU AUS 036 Australia
It all depends on how you want to save data. If you want to save 3-letter + numeric code in one column then size will be according to that and if you want to save them separate then size will be different.
To be on safe side you can use VARCHAR(10).

Algorithm for almost similar values search

I have Persons table in SQL Server 2008.
My goal is to find Persons who have almost similar addresses.
The address is described with columns state, town, street, house, apartment, postcode and phone.
Due to some specific differences in some states (not US) and human factor (mistakes in addresses etc.), address is not filled in the same pattern.
Most common mistakes in addresses
Case sensitivity
Someone wrote "apt.", another one "apartment" or "ap." (although addresses aren't written in English)
Spaces, dots, commas
Differences in writing street names, like 'Dr. Jones str." or "Doctor Jones street" or "D. Jon. st." or "Dr Jones st" etc.
The main problem is that data isn't in the same pattern, so it's really difficult to find similar addresses.
Is there any algorithm for this kind of issue?
Thanks in advance.
UPDATE
As I mentioned address is separated into different columns. Should I generate a string concatenating columns or do your steps for each column?
I assume I shouldn't concatenate columns, but if I'll compare columns separately how should I organize it? Should I find similarities for each column an union them or intersect or anything else?
Should I have some statistics collecting or some kind of educating algorithm?
Suggest approaching it thus:
Create word-level n-grams (a trigram/4-gram might do it) from the various entries
Do a many x many comparison for string comparison and cluster them by string distance. Someone suggested Levenshtein; there are better ones for this kind of task, Jaro-Winkler Distance and Smith-Waterman work better. A libraryt such as SimMetrics would make life a lot easier
Once you have clusters of n-grams, you can resolve the whole string using the constituent subgrams i.e. D.Jones St => Davy Jones St. => DJones St.
Should not be too hard, this is an all-too-common problem.
Update: Based on your update above, here are the suggested steps
Catenate your columns into a single string, perhaps create a db "view" . For example,
create view vwAddress
as
select top 10000
state town, street, house, apartment, postcode,
state+ town+ street+ house+ apartment+ postcode as Address
from ...
Write a separate application (say in Java or C#/VB.NET) and Use an algorithm like JaroWinkler to estimate the string distance for the combined address, to create a many x many comparison. and write into a separate table
address1 | address n | similarity
You can use Simmetrics to get the similarity thus:
JaroWinnkler objJw = new JaroWinkler()
double sim = objJw.GetSimilarity (address1, addres n);
You could also trigram it so that an address such as "1 Jones Street, Sometown, SomeCountry" becomes "1 Jones Street", "Jones Street Sometown", and so on....
and compare the trigrams. (or even 4-grams) for higher accuracy.
Finally you can order by similarity to get a cluster of most similar addresses and decide an approprite threshold. Not sure why you are stuck
I would try to do the following:
split up the address in multiple words, get rid of punctuation at the same time
check all the words for patterns that are typically written differently and replace them with a common name (e.g. replace apartment, ap., ... by apt, replace Doctor by Dr., ...)
put all the words back in one string alphabetically sorted
compare all the addresses using a fuzzy string comparison algorithm, e.g. Levenshtein
tweak the parameters of the Levenshtein algorithm (e.g. you want to allow more differences on longer strings)
finally do a manual check of the strings
Of course, the solution to keep your data 'in shape' is to have explicit fields for each of your characteristics in your database. Otherwise, you will end up doing this exercise every few months.
The main problem I see here is to exactly define equality.
Even if someone writes Jon. and another Jone. - you will never be able to say if they are the same. (Jon-Jonethan,Joneson,Jonedoe whatever ;)
I work in a firm where we have to handle exact this problem - I'm afraid I have to tell you this kind of checking the adress lists for navigation systems is done "by hand" most of the time. Abbrevations are sometimes context dependend, and there are other things that make this difficult. Ofc replacing string etc is done with python - but telling you the MEANING of such an abbr. can only done by script in a few cases. ("St." -> Can be "Saint" and "Street". How to decide? impossible...this is human work.).
Another big problem is as you said "Is there a street "DJones" or a person? Or both? Which one is ment here? Is this DJones the same as Dr Jones or the same as Don Jones? Its impossible to decide!
You can do some work with lists as presented by another answer here - but it will give you enough "false positives" or so.
You have a postcode field!!!
So, why don't you just buy a postcode table for your country
and use that to clean up your street/town/region/province information?
I did a project like this in the last centuary. Basicly it was a consolidation of two customer files after a merger, and, involved names and addresses from three different sources.
Firstly as many posters have suggested, convert all the common words and abbreveations and spelling mistakes to a common form "Apt." "Apatment" etc. to "Apt".
Then look through the name and identifiy the first letter of the first name, plus the first surname. (Not that easy consider "Dr. Med. Sir Henry de Baskerville Smythe") but dont worry where there are amiguities just take both! So if you lucky you get HBASKERVILLE and HSMYTHE. Now get rid of all the vowels as thats where most spelling variations occur so now you have HBSKRVLL HSMTH.
You would also get these strings from "H. Baskerville","Sir Henry Baskerville Smith" and unfortunately "Harold Smith" but we are talking fuzzy matching here!
Perform a similar exercise on the street, and apartment and postcode fields. But do not throw away the original data!
You now come to the interesting bit first you compare each of the original strings and give say 50 points for each string that matches exactly. Then go through you "normalised" strings and give say 20 points for each one that matches exactly. Then go through all the strings and give say 5 points for each four character or more substring they have in common. For each pair compared you will end up with some with scores > 150 which you can consider as a certain match, some with scores less than 50 which you can consider not matched and some inbetween which have some probability of matching.
You need some more tweaking to improve this by adding various rules like "subtract 20 points for a surname of 'smith'". You really have to keep running and tweaking until you get happy with the resulting matches, but, once you look at the results you get a pretty good feel which score to consider a "match" and which are the false positives you need to get rid of.
I think the amount of data could affect what approach works best for you.
I had a similar problem when indexing music from compilation albums with various artists. Sometimes the artist came first, sometimes the song name, with various separator styles.
What I did was to count the number of occurrences on other entries with the same value to make an educated guess wether it was the song name or an artist.
Perhaps you can use soundex or similar algorithm to find stuff that are similar.
EDIT: (maybe I should clarify that I assumed that artist names were more likely to be more frequently reoccurring than song names.)
One important thing that you mention in the comments is that you are going to do this interactively.
This allows to parse user input and also at the same time validate guesses on any abbreviations and to correct a lot of mistakes (the way for example phone number entry works some contact management systems - the system does the best effort to parse and correct the country code, area code and the number, but ultimately the user is presented with the guess and has the chance to correct the input)
If you want to do it really good then keeping database/dictionaries of postcodes, towns, streets, abbreviations and their variations can improve data validation and pre-processing.
So, at least you would have fully qualified address. If you can do this for all the input you will have all the data categorized and matches can then be strict on certain field and less strict on others, with matching score calculated according weights you assign.
After you have consistently pre-processed the input then n-grams should be able to find similar addresses.
Have you looked at SQL Server Integration Services for this? The Fuzzy Lookup component allows you to find 'Near matches': http://msdn.microsoft.com/en-us/library/ms137786.aspx
For new input, you could call the package from .Net code, passing the value row to be checked as a set of parameters, you'd probably need to persist the token index for this to be fast enough for user interaction though.
There's an example of address matching here: http://msdn.microsoft.com/en-us/magazine/cc163731.aspx
I'm assuming that response time is not critical and that the problem is finding an existing address in a database, not merging duplicates. I'm also assuming the database contains a large number of addresses (say 3 million), rather than a number that could be cleaned up economically by hand or by Amazon's Mechanical Turk.
Pre-computation - Identify address fragments with high information content.
Identify all the unique words used in each database field and count their occurrences.
Eliminate very common words and abbreviations. (Street, st., appt, apt, etc.)
When presented with an input address,
Identify the most unique word and search (Street LIKE '%Jones%') for existing addresses containing those words.
Use the pre-computed statistics to estimate how many addresses will be in the results set
If the estimated results set is too large, select the second-most unique word and combine it in the search (Street LIKE '%Jones%' AND Town LIKE '%Anytown%')
If the estimated results set is too small, select the second-most unique word and combine it in the search (Street LIKE '%Aardvark%' OR Town LIKE '%Anytown')
if the actual results set is too large/small, repeat the query adding further terms as before.
The idea is to find enough fragments with high information content in the address which can be searched for to give a reasonable number of alternatives, rather than to find the most optimal match. For more tolerance to misspelling, trigrams, tetra-grams or soundex codes could be used instead of words.
Obviously if you have lists of actual states / towns / streets then some data clean-up could take place both in the database and in the search address. (I'm very surprised the Armenian postal service does not make such a list available, but I know that some postal services charge excessive amounts for this information. )
As a practical matter, most systems I see in use try to look up people's accounts by their phone number if possible: obviously whether that is a practical solution depends upon the nature of the data and its accuracy.
(Also consider the lateral-thinking approach: could you find a mail-order mail-list broker company which will clean up your database for you? They might even be willing to pay you for use of the addresses.)
I've found a great article.
Adding some dlls as sql user-defined functions we can use string comparison algorithms using SimMetrics library.
Check it
http://anastasiosyal.com/archive/2009/01/11/18.aspx
the possibilities of such variations are countless and even if such an algorithm exists, it can never be fool-proof. u can't have a spell checker for nouns after all.
what you can do is provide a drop-down list of previously entered field values, so that they can select one, if a particular name already exists.
its better to have separate fields for each value like apartments and so on.
You could throw all addresses at a web service like Google Maps (I don't know whether this one is suitable, though) and see whether they come up with identical GPS coordinates.
One method could be to apply the Levenshtein distance algorithm to the address fields. This will allow you to compare the strings for similarity.
Edit
After looking at the kinds of address differences you are dealing with, this may not be helpful after all.
Another idea is to use learning. For example you could learn, for each abbreviation and its place in the sentence, what the abbreviation means.
3 Jane Dr. -> Dr (in 3rd position (or last)) means Drive
Dr. Jones St -> Dr (in 1st position) means Doctor
You could, for example, use decision trees and have a user train the system. Probably few examples of each use would be enough. You wouldn't classify single-letter abbreviations like D.Jones that could be David Jones, or Dr. Jones as likely. But after a first level of translation you could look up a street index of the town and see if you can expand the D. into a street name.
Again, you would run each address through the decision tree before storing it.
It feels like there should be some commercial products doing this out there.
A possibility is to have a dictionary table in the database that maps all the variants to the 'proper' version of the word:
*Value* | *Meaning*
Apt. | Apartment
Ap. | Apartment
St. | Street
Then you run each word through the dictionary before you compare.
Edit: this alone is too naive to be practical (see comment).

Which parts of an address should be required?

Say I am storing addresses in a DB table, in this fairly common break down:
address_street_line_1,
address_street_line_2,
address_city,
address_state,
address_zip,
address_country_id
(Note: I have read the questions on splitting down further, street type, house number, etc. and for this application I think it would unnecessarily complicate things.)
To work best with international users, which of these fields should NOT be required?
I'm thinking this:
address_street_line_1 REQUIRED
address_city REQUIRED
address_country_id REQUIRED
Should I require state or zip?
Thanks!
Xavier
You can probably only require one field: country.
But what you should really be doing is making the logic dependent on country. Take a look at Address Formats by Country for a comprehensive list. That isn't just about required fields either. It's also about correct formatting. A US address might be:
8031 Main Street
Springfield OH 12345
USA
whereas in Switzerland:
Bodenstr. 173
8043 Zürich
Schweiz
Note: the street numbers and post codes are in the "reverse" order for Switzerland (compared to what English speaking countries use).
Also, your data types need to be broad enough to cover data used in other countries. Zip/post code should absolutely not be a numeric type. For example, "EC2R 8AH" is a valid UK postcode.
That goes back to this principle: if you don't perform arithmetic on it, it's not a numeric type. It's text.
Also, try not to call it Zip Code to end users. That's a US only term. Pretty much everywhere else its call a Postcode, Post code or Postal Code. Also note that the UK postal codes are alphanumeric and include a space.
Not all countries even use postal codes, for example they were rarely used in New Zealand prior to 2006 or so. I think Ireland doesn't use them at all.
If you're truly international, city-states such as Singapore don't actually need a City field.
In the user interface, you can (and perhaps should) make the postcode required for countries where you already know it's required, since that isn't likely to change. And, if you make the UI dynamic enough, you can call it "Zip code" if the selected country is the United States, "Postal code" for Canada, "Postcode" for the UK, etc.
How about making none required? If the user wants to be contacted they'll enter enough information. Or, enter a single text field and let them enter free form information. They know better than you what fields are required for postal deliveries to make it to their door.
I would say everything except street_line_2 and state- and think of 'zip' as more of a postal codes instead of zip code - as you can tell from the variety of format based on the country of origin, this should have a pretty open format.
Even in the U.S., most of the address is not required. A large fraction of U.S. zip codes are allocated to various businesses and organizations - any mail to one of those zips will be delivered the same regardless of the rest of the address. For instance:
General Electric
Schenectady, NY 12345
Internal Revenue Service
Ogden, UT 84201-0027
The city and state are nice, but the mail will probably get delivered without.
The best way that I have found to solve this problem is by abstracting the logic in your application layer, and not the persistence layer. One of the cleanest/simplest ways I've seen this done is by passing the user's data in a value object (creating a common interface that's easy to validate against) to a validator with the current country code, which makes sure all the required attributes are set properly in the value object for that locale. Assuming it passes validation, pass the value object along to the persistence side of your application for storage.
The key here is the value object - you're creating a common interface that multiple pieces of your application can talk to, validate, and read/write from. You can then also use that same value object when displaying the address: have your persistence layer get the information, put it in the value object, pass it to a factory with the current locale which returns the desired address format, and send that output to the front end.
There are no states in New Zealand, so it should definately be optional. So I think you have the right answer in your question.
If you are not going to do any specific lookup, like searching by postal code or by city, I'd say to all combine the address in a single field. This way you will support the different address from different countries.
You will also support address oddities.
If you fear that the requirements are going to change, you could store the address as a Xml field. Modern database like Sql Server 2005 and 2008 can have an index on a Xml node inside a Xml column as long as you are using a schema.
It all come down to requirements. If the client need to group the data inside a grid by country, then you need a country column.
Making fields required is always a tradeoff. If the person doesn't want to fill in the info then they won't -- they'll put in a period, or garbage to get past the "required field" nanny.
I only require street_address_1 in my apps. Also, for the US and many countries, you can buy the mapping between the postal/zip code and the canonical city/state. It's not expensive. (The mapping between individual street addresses and zip is much more expensive.)
For the US, see http://www.usps.com/ncsc/addressinfo/citystate.htm
If you're including an Ajax web interface, ask for the country first, then the post code. If in the US, then use Ajax to fetch and fill in the city/state for the user from the zip.
Some non-US countries, eg UK, can have 3 lines of street addresses if you're asking people to fill in their "preferred address" Eg:
Mirassou (You can register a building's name with the post office
High Street as an alternative to its street number)
Old Town
City, Bucks postal_code
Larry
Actually, city isn't even required in the US.
Many people have rural addresses on state and county roads. Publication 28 at the postal service web site has details. Different companies end up using the "city" field to store other information. This also applies to military base addresses.
Publication 28 link