Is there a way to programmatically determine what School District a given physical street address in the US is tied to?
The US Census Bureau coordinates with local school districts from each state to understand boundaries along with any changes. This information is published every other year by the Census Bureau itself in their "Tiger" data:
http://www.census.gov/did/www/schooldistricts/data/boundaries.html
(A relatively easy way to read the data is using PostGIS under PostgreSQL)
But before all of that, you need to make sure you're working with a correct address. The reason is that the address may not even exist at all or may be formatted incorrectly such that you are unable to determine a match. Furthermore, if the ZIP Code is wrong or street information is misspelled, or the person accidentally reversed the "North" or "South" part of an address, e.g. 123 North Main Street, that could put the coordinate in the wrong school district. So as a precursor to running your addresses through PostGIS, you'll definitely want to look at using an address verification service to make sure you've got good data to start with. (Full disclosure: I'm the founder of SmartyStreets, we do address verification.)
Related
For country and state, there are ISO numbers. With City, there is not.
Method 1:
Store in one column:
[Country ISO]-[State ISO]-[City Name]
Method 2:
Store in 3 separate columns.
Also, how to handle city names if there is no unique identifier?
First and foremost, three separate columns to keep your data. If you want to create a unique identifier, the easiest way would be giving a random 3-10 digit code depending on the size of your data set. However, I would suggest concatenating [country-code]-[state-code]-[code] if you have a small data set and if you want human readability to a certain point. code can be several things. Here are some ideas:
of course a random id or even a database row id
licence plate number/code if there is for a city
phone area code of the city or the code of the center
same logic may apply to zip codes
combination of latitude and longitude of the city center up to certain degree
Here are also more references that can be used:
ISO 3166 is a country codes. In there you can find codes for states or cities depending on the country.
As mentioned IATA has both Airport and City codes list but they are hard to obtain.
UN Location list is a good mention but it can be difficult to gather the levels of differentiation, like the airport code or city code or a borough code can be on the same list, but eventually the UN/LOCODE must be unique. (Airport codes are used for ICAO, similar to IATA but not the same)
there are several data sets out there like OpenTravelData or GeoNames that can be used for start but may require digging and converting. They provide unique codes for locations. And many others can be found.
Bonus:
I would suggest checking Schema.org's City Schema and other Place Schemas for a conscious setup.
I have an task to store data about destinations of delivery, where companies can ship the postal parcel.
The trivial way is to create a table
CompanyShippmentPlaces
id | country | city
There are the some design issues:
What if need be delivered to towns or villages, not to cities? That means altering a table?
What if company needs to specify a part of city, townm or village?
What if the destinations have the same name?
How I plan to use this data:
When system gest a order, the order should be distributes across all companies. I must get all companies that can deliver this product.
It pushes me to use noSQL database, but I am not confident.
What do you think about that?
What if need be delivered to towns or villages, not to cities? That means altering a table?
This would be solved by the solution of jaimish11.
Peronsaly I would change the naming of "City" to "locality" (or something comparable - to generalize).
What if company needs to specify a part of city, townm or village?
I think this is solved by the address lines.
What if the destinations have the same name?
Normaly (as much that i know) each location in a country has it's own pin- or zipcode respectively if the naming doesn't match the post will use the code to identify the location. (to be sure you should ask the post in your at least in your country)
When system gest a order, the order should be distributes across all companies. I must get all companies that can deliver this product.
I would get all locations where the products are available and then get the location wich is next by the city out of the first selection. Maybe you could save the nearest location to a city in your "city table".
The issue you're describing isn't actually an issue. No matter what database you use (SQL or NoSQL) you can simply have all the address fields you need such as:
Address Line 1
Address Line 2
Landmark
City
Pincode/ Zipcode
State
Country
This way, it won't matter whether it's city, town or village.
i have a database having country, city, state and hotels in these table country name has multiple identical records for eg mexico is wrongly spelled as maxico and mxico and mexico,other records like usa and united states of america and america these type of records are having mutiple same wrongly spelled states and states has multiple wrong spelled cities but hotels are unique and i want them to set them to there right city and state and country for eg. some hotel is in chicago city Illinois state and country is usa. please help me how can i fix this
you could do an update if you know all the different scenarios that are incorrect
update tbl
set city = 'Mexico'
where city in ('maxico', 'mxico')
Well,you can list all values the country column has,and then check wether the values is right, if it is wrong, just use update clause to fix the wrong value, like below:
update my_table set country = 'Mexico' where country in ('maco', 'xico');
It depends on infrastructure you're running.
If you have access to some ETL tools, they often have DataQuality capabilities, often with databases used in correcting adresses. Those are often paid.
If you are a "private" developer, then you might not want to use paid data, so you can look for open data sources, like https://catalog.data.gov allegheny country addresses.
You can use multitude of algorithms and solutions, ranging from simple distances in word space to neural networks pre-trained to do just that.
This type of data problem is hard. There is no built-in simple way to determine the "right spelling". Many databases have one of two capabilities built in that can help -- either "soundex" algorithms or Levenshtein distance.
What should you do? If you really want to fix this problem, create a table with the misspelled name and the correct value that you want. This table will need to be maintained manually, such as in a spreadsheet. Then use this table when importing data and use only the rectified value.
Better yet, set up a reference table with only the correct names. Create a second table with alternative names, which is maintained as above.
I have a database that contains 5 digit zip codes (ie 10001) and matching state abbreviations (ie NJ, NY, CA). I've found that some of the zip codes have multiple states (ie 10001 = NJ and 10001 = NY) which is wrong.
zip State
10001 NY
10001 NJ
10001 NY
10001 NY
... ...
Each State can have many zip codes, but each zip code should have only one state.
I'd like to find all the errors but can't seem to write a query to do so.
Any suggestions?
Just to point out the obvious:
Because ZIP codes are intended for efficient postal delivery, there are unusual cases where a ZIP code crosses state boundaries, such as a military facility spanning multiple states or remote areas of one state most easily serviced from an adjacent state. For example ZIP code 42223 spans Christian KY and Montgomery TN, and ZIP code 97635 spans Lake OR and Modoc CA.
http://en.wikipedia.org/wiki/ZIP_code
Be careful what you consider canonical data, and always trust someone providing you authentic data.
In this case, 0 is for NJ, so 10001 for NJ would be wrong, but 00001 would be accurate for NJ, and 1 is for NY, so 00001 would be wrong for NY, but 10001 would be accurate for NY. See also http://en.wikipedia.org/wiki/List_of_ZIP_code_prefixes
Also of note is that with the 1000 zip coverages in the previous link, you could accurately determine what zip codes you have that fall outside the range/state that they should be in ...
A different approach, but rather than just give you a count, this gives you the states involved.
SELECT zip, state
FROM dbo.table AS t
WHERE EXISTS
(
SELECT 1 FROM dbo.table
WHERE zip = t.zip AND state <> t.state
)
GROUP BY zip, state
ORDER BY zip, state;
Once you've identified the duplicates and removed them, add a unique constraint on zip,state so you're not doing this again next week, next month, etc.
I actually work with nationwide datasets on a daily basis and encounter this issue a lot. The State designator in the prefix of the ZIP code indicates the state that the Post Office is located in, not necessarily the extents of the delivery area. I was in contact with some higher-ups at USPS about some issues in the north-central part of the country and was told that the ZIP code program originally intended for the ZIP Codes to be confined by state boundaries, but in the early 80s they started to make exceptions in rural areas. There are cases where a house in North Dakota is 10 miles away from a Montana Post Office, but the nearest Post Office in their own state is located several counties away. This is why these exceptions are made. It makes sense on the ground level, but not on a data level. There are a lot of these exceptions in the USPS databases now. The most prolific areas (that I have found) are along the MT/ND and SD/ND borders.
This should give you what you need:
select zip,count(distinct state)
from TheTable
group by zip
having count(distinct state)>1
That will give you a list of each zip code for which more than one state exists in the table.
What is the most flexible design for a table of physical addresses in some variety of SQL? I assume there is something better than { street address, building address, city, state/province/region, country, zipcode }. Also are there standard names for various components of addresses, especially international standard names? Further, what sort of provision should be made for the same physical location having multiple addresses? In what circumstances could this occur?
http://www.upu.int has the format standards for international addresses. Publication 28 at http://usps.com has the U.S. format standards.
The USPS wants the following unpunctuated address components concatenated on a single line:
house number
predirectional (N, SE, etc)
street
suffix (AVE, BLVD, etc)
postdirectional (SW, E, etc)
unit (APT, STE, etc)
apartment/suite number
Eg, 102 N MAIN ST SE APT B.
If you keep the entire address line as a single field in your database, input and editing is easy, but searches can be more difficult (eg, in the case SOUTH EAST LANE is the street EAST as in S EAST LN or is it LANE as in SE LANE ST?).
If you keep the address parsed into separate fields, searches for components like street name or apartments become easier, but you have to append everything together for output, you need CASS software to parse correctly, and PO boxes, rural route addresses, and APO/FPO addresses have special parsings.
A physical location with multiple addresses at that location is either a multiunit building, in which case letters/numbers after units like APT and STE designate the address, or it's a Commercial Mail Receiving Agency (eg, UPS store) and a maildrop/private mailbox number is appended (like 100 MAIN ST STE B PMB 102), or it's a business with one USPS delivery point and mail is routed after USPS delivery (which usually requires a separate mailstop field which the company might need but the USPS won't want on the address line).
A contact with more than one physical address is usually a business or person with a street address and a PO box. Note that it's common for each address to have a different ZIP code.
It's quite typical that one business transaction might have a shipping address and a billing address (again, with different ZIP codes). The information I keep for EACH address is:
name prefix (DR, MS, etc)
first name and initial
last name
name suffix (III, PHD, etc)
mail stop
company name
address (one line only per Pub 28 for USA)
city
state/province
ZIP/postal code
country
I typically print mail stops somewhere between the person's name and company because the country contains the state/ZIP which contains the city which contains the address which contains the company which contains the mail stop which contains the person. I use CASS software to validate and standardize addresses when entered or edited.
The most generic ADDRESS table I constructed included the following columns:
ADDRESS_1 - c/o line
ADDRESS_2 - RR/PO Box
ADDRESS_3 - suite/apt
ADDRESS_4 - street address
ADDRESS_TYPE_CODE - foreign key to ADDRESS_TYPE_CODES table
CITY
STATE_PROVINCE
POSTAL_CODE
COUNTRY
The ADDRESS_TYPE_CODES would be business, home, mailing/shipping, etc.
There's no way to know what address details you're going to get, so you have to make it flexible.
One possibility is, a text field could be used so that the address could be formatted in any way needed. However, then the text would need to be parsed for the parts of the address useful for providing links to maps and directions. Perhaps there should be a text full_address field and then a separate set of columns that will contain parsed (automatically or manually) address pieces useful for links to maps and directions. If the address could not be parsed, a flag would be set saying it couldn't be recognized. (Perhaps this means the address is in a country that the parser doesn't know the address format for, or that the address was entered improperly.)
According to the United States Postal Service:
The components of the delivery address [line] are the primary address number, street name, secondary address identifier, and secondary address range.
Here's more than most people ever wanted to know about the delivery address line for United States addresses.
You would have to look up similar documents for any other countries that you're interested in.
Are you sure you need to be very flexilbe? I usually try and get the simplest design (ie least amount of columns) for the dataset I'm currently working with, then be agile.