Designing a contacts database in SQL Server 2005 - sql-server-2005

I think this is a common thing to do... you have a database server, and you want to store customer contact information in it.
You need the person's name, their address, phone, etc.
What are best practices for storing addresses and phones? Assuming OLTP...
Multiple people may have the same phone number (such as wife and husband, or mother and daughter).
Multiple people share a household.
I read this: http://sqlcat.com/sqlcat/b/whitepapers/archive/2008/09/03/best-practices-for-semantic-data-modeling-for-performance-and-scalability.aspx
And that will work fine for the specific model mentioned, but I don't see how this model can be optimized short of denormalizing.
Ex:
Person table = person id, first name, last name, etc...
Address table = address id, address line 1, etc..
Phone table = phone id, phone number, etc...
So if I designed it like that whitepaper suggests, I'd have a personid in my address table and in my phone table. However, since multiple people may share the same address, that isn't feasible. One person may have multiple addresses or even no addresses. So it seems I'll need a person -> address mapping table as well as a mapping table for the phones, otherwise I'll denormalize both of those tables and let there be some duplicates in the unusual case of two people who share the same phone / address.
Anyway, my point in asking this question is because it seems difficult to find a 'best practices' for this type of thing, yet it seems like the type of thing which would come up in just about any type of application or database.

Normalizing addresses and phone numbers in a one-to-many relationship where a Contact may have many related Phone or Address entities makes perfect sense.
However, there is no need to normalize addresses and phone numbers in a many-to-many relationship in a contacts database, because those are not entities you have any interest in working with by themselves, on their own merits as unique entities. In fact, I would say that in your situation, normalizing them to that level is not a good design.
If you were modeling a business in real estate, rentals, or phone service, where you cared about properties and phone numbers even when no person was associated with them, then it could make sense to model them to this level. It is more work for someone to avoid duplicate addresses and phone numbers in the many-to-many design than it is for them to just enter the address again, and there is no real benefit to avoiding these duplicates. Plus, you'll end up with duplicates anyway (at least for addresses, unless you scrub them all real-time using post office routines), so who is going to go through and match up '123 Ascot Wy #5' to '123 Ascot Way Apt 5'? What value is there in that?
The usual reason for normalizing this deep doesn't apply. Let's say that you do create a PhoneNumber table and the PersonPhoneNumber table needed for the many-to-many relationship. You have three people using the same phone number and they are all properly linked to it. Now, one of them calls you up and tells you that he is changing his phone number. Are you sure you want to change the actual PhoneNumber record and update the numbers of the other two folks at the same time? What if they aren't moving with him? Soon you will find that your data is screwed up. You may as well normalize first names to the FirstName table and last names to the LastName table! Then when "Joey" grows up and changes his name to "Joe", all the other Joeys will get an automatic upgrade. But whoops... "Joe" already exists, as does the phone number that you are changing one of the three people above to... what an awful mess.
For another thing, will you use PhoneID as a surrogate key for the phone number? But phone numbers are one of the few things that actually are good as natural keys, they almost even demand being used as natural keys. Then your Phone table becomes meaningless because it doesn't encode any additional information about that phone number. It would just be a list of phone numbers, which are already present in the referencing table. Don't use a Phone table like that. If you want to find out whether two people share the same phone number, you just join on or group by the column! In my mind it approaches silliness to have a layer of abstraction where a phone number is linked to a monotonically-increasing PhoneID.
If you read A Universal Person and Organization Model you will see the perspective that phone numbers and addresses in fact aren't entities that need modeling to the level of a many-to-many relationship--they are more like "intelligent locators" that route messages to recipients. Why on earth would you force three different people's locator (a.k.a. phone number) to be identical? The locator helps to locate the person, not the physical phone that rings. You couldn't care less about the phone or who else might answer--you only care about the fact that once answered, the person of interest could possibly be reached.

Normalize.
Normalize until it hurts.
Normalize again until it is excruciating.
Then tune your queries; design your indices; and measure your performance; if at that point you have no other options, denormalize the bare minimum to meet performance options.
Remember that every denormalization that speeds performance on one query, by its nature degrades performance on (almost) every operation on tat table set. Only keep the denormalizaiton if measurement actually shows a noticeable performance improvement.
Remember that the more you normalize the smaller your indices are; the more index rows sit in cache, and the faster your database performs. Yes, a lot of very small tables get created - they are permanently in cache and thus almost free to access.

Related

Many-to-Many-to-Many Relation in SQL?

I am quite new to SQL and I am trying to make a few relations and come across the following.
I have 3 tables
- Persons
- PhoneNumbers
- PhoneNumberCategories
Persons and PhoneNumbers is self explanatory I think and PhoneNumberCategories I use to indicate whether a phonenumber is for the office, home, mobile, fax etc.
Now I made a 4th table to make this many-to-many-to-many relation. This table has 4 columns:
ID
PersonID
PhoneNumberID
PhoneNumberCategoryID
And I have set-up 3 relations here. One for all columns except the ID Column. It seems to work, but it just gives me a strange feeling. Can someone tell me if I am on the wright track or if I am going all wrong about this.
The reason I want this is as follows. Of course I need to link a person to a phone number. However, I may have a phone number that is the general office (This would be the row 'OfficeGeneral') number, and therefore linked to me because I work there. I also have a direct office (And this would be the row 'OfficeDirect') number. This is of course also linked to me. The general number however, is linked to all people in the office as 'OfficeGeneral'. Except for the receptionist, here it would be linked as 'OfficeDirect'. And this is the reason I came up with this many-to-many-to-many relation. I can not find much about it on the web. And that is reason enough to doubt if this is the correct way to go about it. Anyway, this is just an example. I would like to make sure that I am flexible and can catch as many exceptions as possible. I am sure once the database is in use that people will come with situations which I have not anticipated. People are good at that I have learned over the years.
Clarification in Response to Comment Below:
Person can have more than 1 PhoneNumber.
PhoneNumber can have more than 1 Person.
Person with PhoneNumber can have more than 1 PhoneNumberCategory (i.e. Private Phone used for both Phone and Fax.)
Several people can be linked to the same PhoneNumber with different PhoneNumberCategories. (i.e. OfficeMain for me is OfficDirect for receptionist.)
Looking forward to hear from you all.
Your model looks fine. Your description, perhaps, is not.
Person and PhoneNumber have a many to many relationship. This relationship, in turn, is 1-many with PhoneNumberCategory -- unless a given phone number could be in more than one category.
I would expect that Person/PhoneNumber would be declared unique in this table, so a person could only use a particular phone number once. However, that may not be how you are viewing the structure.

Normalization in plain English

I understand the concept of database normalization, but always have a hard time explaining it in plain English - especially for a job interview. I have read the wikipedia post, but still find it hard to explain the concept to non-developers. "Design a database in a way not to get duplicated data" is the first thing that comes to mind.
Does anyone has a nice way to explain the concept of database normalization in plain English? And what are some nice examples to show the differences between first, second and third normal forms?
Say you go to a job interview and the person asks: Explain the concept of normalization and how would go about designing a normalized database.
What key points are the interviewers looking for?
Well, if I had to explain it to my wife it would have been something like that:
The main idea is to avoid duplication of large data.
Let's take a look at a list of people and the country they came from. Instead of holding the name of the country which can be as long as "Bosnia & Herzegovina" for every person, we simply hold a number that references a table of countries. So instead of holding 100 "Bosnia & Herzegovina"s, we hold 100 #45. Now in the future, as often happens with Balkan countries, they split to two countries: Bosnia and Herzegovina, I will have to change it only in one place. well, sort of.
Now, to explain 2NF, I would have changed the example, and let's assume that we hold the list of countries every person visited.
Instead of holding a table like:
Person CountryVisited AnotherInformation D.O.B.
Faruz USA Blah Blah 1/1/2000
Faruz Canada Blah Blah 1/1/2000
I would have created three tables, one table with the list of countries, one table with the list of persons and another table to connect them both. That gives me the most freedom I can get changing person's information or country information. This enables me to "remove duplicate rows" as normalization expects.
One-to-many relationships should be represented as two separate tables connected by a foreign key. If you try to shove a logical one-to-many relationship into a single table, then you are violating normalization which leads to dangerous problems.
Say you have a database of your friends and their cats. Since a person may have more than one cat, we have a one-to-many relationship between persons and cats. This calls for two tables:
Friends
Id | Name | Address
-------------------------
1 | John | The Road 1
2 | Bob | The Belltower
Cats
Id | Name | OwnerId
---------------------
1 | Kitty | 1
2 | Edgar | 2
3 | Howard | 2
(Cats.OwnerId is a foreign key to Friends.Id)
The above design is fully normalized and conforms to all known normalization levels.
But say I had tried to represent the above information in a single table like this:
Friends and cats
Id | Name | Address | CatName
-----------------------------------
1 | John | The Road 1 | Kitty
2 | Bob | The Belltower | Edgar
3 | Bob | The Belltower | Howard
(This is the kind of design I might have made if I was used to Excel-sheets but not relational databases.)
A single-table approach forces me to repeat some information if I want the data to be consistent. The problem with this design is that some facts, like the information that Bob's address is "The belltower" is repeated twice, which is redundant, and makes it difficult to query and change data and (the worst) possible to introduce logical inconsistencies.
Eg. if Bob moves I have to make sure I change the address in both rows. If Bob gets another cat, I have to be sure to repeat the name and address exactly as typed in the other two rows. E.g. if I make a typo in Bob's address in one of the rows, suddenly the database has inconsistent information about where Bob lives. The un-normalized database cannot prevent the introduction of inconsistent and self-contradictory data, and hence the database is not reliable. This is clearly not acceptable.
Normalization cannot prevent you from entering wrong data. What normalization prevents is the possibility of inconsistent data.
It is important to note that normalization depends on business decisions. If you have a customer database, and you decide to only record a single address per customer, then the table design (#CustomerID, CustomerName, CustomerAddress) is fine. If however you decide that you allow each customer to register more than one address, then the same table design is not normalized, because you now have a one-to-many relationship between customer and address. Therefore you cannot just look at a database to determine if it is normalized, you have to understand the business model behind the database.
This is what I ask interviewees:
Why don't we use a single table for an application instead of using multiple tables ?
The answer is ofcourse normalization. As already said, its to avoid redundancy and there by update anomalies.
This is not a thorough explanation, but one goal of normalization is to allow for growth without awkwardness.
For example, if you've got a user table, and every user is going to have one and only one phone number, it's fine to have a phonenumber column in that table.
However, if each user is going to have a variable number of phone numbers, it would be awkward to have columns like phonenumber1, phonenumber2, etc. This is for two reasons:
If your columns go up to phonenumber3 and someone needs to add a fourth number, you have to add a column to the table.
For all the users with fewer than 3 phone numbers, there are empty columns on their rows.
Instead, you'd want to have a phonenumber table, where each row contains a phone number and a foreign key reference to which row in the user table it belongs to. No blank columns are needed, and each user can have as few or many phone numbers as necessary.
One side point to note about normalization: A fully normalized database is space efficient, but is not necessarily the most time efficient arrangement of data depending on use patterns.
Skipping around to multiple tables to look up all the pieces of info from their denormalized locations takes time. In high load situations (millions of rows per second flying around, thousands of concurrent clients, like say credit card transaction processing) where time is more valuable than storage space, appropriately denormalized tables can give better response times than fully normalized tables.
For more info on this, look for SQL books written by Ken Henderson.
I would say that normalization is like keeping notes to do things efficiently, so to speak:
If you had a note that said you had to
go shopping for ice cream without
normalization, you would then have
another note, saying you have to go
shopping for ice cream, just one in
each pocket.
Now, In real life, you would never do
this, so why do it in a database?
For the designing and implementing part, thats when you can move back to "the lingo" and keep it away from layman terms, but I suppose you could simplify. You would say what you needed to at first, and then when normalization comes into it, you say you'll make sure of the following:
There must be no repeating groups of information within a table
No table should contain data that is not functionally dependent on that tables primary key
For 3NF I like Bill Kent's take on it: Every non-key attribute must provide a fact about the key, the whole key, and nothing but the key.
I think it may be more impressive if you speak of denormalization as well, and the fact that you cannot always have the best structure AND be in normal forms.
Normalization is a set of rules that used to design tables that connected through relationships.
It helps in avoiding repetitive entries, reducing required storage space, preventing the need to restructure existing tables to accommodate new data, increasing speed of queries.
First Normal Form: Data should be broken up in the smallest units. Tables should not contain repetitive groups of columns. Each row is identified with one or more primary key.
For example, There is a column named 'Name' in a 'Custom' table, it should be broken to 'First Name' and 'Last Name'. Also, 'Custom' should have a column named 'CustiomID' to identify a particular custom.
Second Normal Form: Each non-key column should be directly related to the entire primary key.
For example, if a 'Custom' table has a column named 'City', the city should has a separate table with primary key and city name defined, in the 'Custom' table, replace the 'City' column with 'CityID' and make 'CityID' the foreign key in the tale.
Third normal form: Each non-key column should not depend on other non-key columns.
For example, In an order table, the column 'Total' is dependent on 'Unit price' and 'quantity', so the 'Total' column should be removed.
I teach normalization in my Access courses and break it down a few ways.
After discussing the precursors to storyboarding or planning out the database, I then delve into normalization. I explain the rules like this:
Each field should contain the smallest meaningful value:
I write a name field on the board and then place a first name and last name in it like Bill Lumbergh. We then query the students and ask them what we will have problems with, when the first name and last name are all in one field. I use my name as an example, which is Jim Richards. If the students do not lead me down the road, then I yank their hand and take them with me. :) I tell them that my name is a tough name for some, because I have what some people would consider 2 first names and some people call me Richard. If you were trying to search for my last name then it is going to be harder for a normal person (without wildcards), because my last name is buried at the end of the field. I also tell them that they will have problems with easily sorting the field by last name, because again my last name is buried at the end.
I then let them know that meaningful is based upon the audience who is going to be using the database as well. We, at our job will not need a separate field for apartment or suite number if we are storing people's addresses, but shipping companies like UPS or FEDEX might need it separated out to easily pull up the apartment or suite of where they need to go when they are on the road and running from delivery to delivery. So it is not meaningful to us, but it is definitely meaningful to them.
Avoiding Blanks:
I use an analogy to explain to them why they should avoid blanks. I tell them that Access and most databases do not store blanks like Excel does. Excel does not care if you have nothing typed out in the cell and will not increase the file size, but Access will reserve that space until that point in time that you will actually use the field. So even if it is blank, then it will still be using up space and explain to them that it also slows their searches down as well.
The analogy I use is empty shoe boxes in the closet. If you have shoe boxes in the closet and you are looking for a pair of shoes, you will need to open up and look in each of the boxes for a pair of shoes. If there are empty shoe boxes, then you are just wasting space in the closet and also wasting time when you need to look through them for that certain pair of shoes.
Avoiding redundancy in data:
I show them a table that has lots of repeated values for customer information and then tell them that we want to avoid duplicates, because I have sausage fingers and will mistype in values if I have to type in the same thing over and over again. This “fat-fingering” of data will lead to my queries not finding the correct data. We instead, will break the data out into a separate table and create a relationship using a primary and foreign key field. This way we are saving space because we are not typing the customer's name, address, etc multiple times and instead are just using the customer's ID number in a field for the customer. We then will discuss drop-down lists/combo boxes/lookup lists or whatever else Microsoft wants to name them later on. :) You as a user will not want to look up and type out the customer's number each time in that customer field, so we will setup a drop-down list that will give you a list of customer, where you can select their name and it will fill in the customer’s ID for you. This will be a 1-to-many relationship, whereas 1 customer will have many different orders.
Avoiding repeated groups of fields:
I demonstrate this when talking about many-to-many relationships. First, I draw 2 tables, 1 that will hold employee information and 1 that will hold project information. The tables are laid similar to this.
(Table1)
tblEmployees
* EmployeeID
First
Last
(Other Fields)….
Project1
Project2
Project3
Etc.
**********************************
(Table2)
tblProjects
* ProjectNum
ProjectName
StartDate
EndDate
…..
I explain to them that this would not be a good way of establishing a relationship between an employee and all of the projects that they work on. First, if we have a new employee, then they will not have any projects, so we will be wasting all of those fields, second if an employee has been here a long time then they might have worked on 300 projects, so we would have to include 300 project fields. Those people that are new and only have 1 project will have 299 wasted project fields. This design is also flawed because I will have to search in each of the project fields to find all of the people that have worked on a certain project, because that project number could be in any of the project fields.
I covered a fair amount of the basic concepts. Let me know if you have other questions or need help with clarfication/ breaking it down in plain English. The wiki page did not read as plain English and might be daunting for some.
I've read the wiki links on normalization many times but I have found a better overview of normalization from this article. It is a simple easy to understand explanation of normalization up to fourth normal form. Give it a read!
Preview:
What is Normalization?
Normalization is the process of
efficiently organizing data in a
database. There are two goals of the
normalization process: eliminating
redundant data (for example, storing
the same data in more than one table)
and ensuring data dependencies make
sense (only storing related data in a
table). Both of these are worthy goals
as they reduce the amount of space a
database consumes and ensure that data
is logically stored.
http://databases.about.com/od/specificproducts/a/normalization.htm
Database normalization is a formal process of designing your database to eliminate redundant data. The design consists of:
planning what information the database will store
outlining what information users will request from it
documenting the assumptions for review
Use a data-dictionary or some other metadata representation to verify the design.
The biggest problem with normalization is that you end up with multiple tables representing what is conceptually a single item, such as a user profile. Don't worry about normalizing data in table that will have records inserted but not updated, such as history logs or financial transactions.
References
When not to Normalize your SQL Database
Database Design Basics
+1 for the analogy of talking to your wife. I find talking to anyone without a tech mind needs some ease into this type of conversation.
but...
To add to this conversation, there is the other side of the coin (which can be important when in an interview).
When normalizing, you have to watch how the databases are indexed and how the queries are written.
When in a truly normalized database, I have found that in situations it's been easier to write queries that are slow because of bad join operations, bad indexing on the tables, and plain bad design on the tables themselves.
Bluntly, it's easier to write bad queries in high level normalized tables.
I think for every application there is a middle ground. At some point you want the ease of getting everything out a few tables, without having to join to a ton of tables to get one data set.

sql database model for contact details

I need a db to store, i.e. user records. Regular stuff: name, e-mail, address, phone, fax and so on. The problem is, in this case there can be more than one phone number per user. And more than one e-mail. Even more than one address. And a lot more of more-than-one stuff.
One approach is to store everything in one table, e.g. serialized phones array in one phones column. Or comma separated phones in one phones column. But I really dont like this way, I'd rather do overcomplicated database to make programming logic simpler than the other way round.
Another one is individual table for phones, individual table for addresses and so on. Columns: id, customer_id, phone. customer_id references customers.id Now this seems like a real overkill, having around 10 tables just for storing contact details.
And yet another idea I came up is one additional table for contacts with columns like id, customer_id ( <-foreign key), key, value. Where key can be "phone" and value "+123 3435454", or key "e-mail" and value... you got the idea. So far I like this one best.
What would you suggest? What would be downsides of #3 method?
db I'm going to use is postgresql, but it doesnt really matter.
One table row per entity is correct. So one for emails, one for phone numbers etc is correct. it's not overkill: it's normalised database design.
Your option 3 can be done, but, say, what if you want to enforce a certain pattern for phone numbers and email addresses?
Some may say that method #3 is the best. Others will say that it is basically one of the most common anti-patterns in databases - namely EAV, which gets a lot of hate.
As for me - I know too little about your application to suggest solution. Generally - method #2 gives you the most functionality.
There is also method #4, and it's variant - method #5:
4 - use arrays of values - i.e. phone column, instead of being of TEXT database, is TEXT[], and then you can store many phones in it.
5 - since you're on PostgreSQL - use it. There is pretty cool hstore datatype in contrib, which you can use instead of arrays to add some semantics (like type of phone).
Some purists would suggest no 3 to be the way to go. In fact using that approach you could theoretically build a single table database!
However like gbn mentioned, this would cause problems with specific formats, and you would have to enforce data lengths etc on the client only.
I would go with your 2nd suggestion and use the approach of having different tables, with an address type, phone number type etc similar to that shown below.
id number
addrss_type varchar (home/contact/mail etc)
addrs_line1 varchar
addrs_line2 varchar
etc, etc
Downsides:
That violates first normal form, as you'll have many values for one field
As you said, that would be a nightmare to maintain
Seems bets option but could be normalized by using a domain table for that "key" value
Customer (Id, Name)
AdditionalData( Id, Name ) (phone, address, etc)
CustomerAdditionalData(Id, CustomerId, AdditionalDataId, Value)

Best way to model Customer <--> Address

Every Customer has a physical address and an optional mailing address. What is your preferred way to model this?
Option 1. Customer has foreign key to Address
Customer (id, phys_address_id, mail_address_id)
Address (id, street, city, etc.)
Option 2. Customer has one-to-many relationship to Address, which contains a field
to describe the address type
Customer (id)
Address (id, customer_id, address_type, street, city, etc.)
Option 3. Address information is de-normalized and stored in Customer
Customer (id, phys_street, phys_city, etc. mail_street, mail_city, etc.)
One of my overriding goals is to simplify the object-relational mappings, so I'm leaning towards the first approach. What are your thoughts?
I tend towards first approach for all the usual reasons of normalisation. This approach also makes it easier to perform data cleansing on mailing details.
If you are possibly going to allow multiple addresses (mail, residential, etc) or wish to be able to use effective dates, consider this approach
Customer (id, phys_address_id)
Cust_address_type (cust_id, mail_address_id, address_type, start_date, end_date)
Address (id, street, city, etc.)
One important fact you may need to consider (depending on your problem domain) is that people change addresses, and may want to let you know in advance of their address change; this is certainly true for utility companies, telcos, etc.
In this case you need to have a way to store multiple addresses for the customer with validity dates, so that the address can be set up in advance and automatically switch at the correct point. If this is a requirement, then a variation on (2) is the only sensible way to model it, e.g.
Customer (id, ...)
Address (id, customer_id, address_type, valid_from, valid_to)
On the other hand, if you don't need to cater for this (and you're sure you won't in the future) then probably (1) is simpler to manage because it's much easier to maintain data integrity as there's no issues with ensuring only one address of the same type exists, and the joins become simpler as they're only on one field.
So either (1) or (2) are fine depending on whether you need house-moves, but I'd steer clear of (3) because you're then repeating the definition of what an address is in the table, and you'll have to add multiple columns if you change what an address looks like. It's possibly slightly more performant, but to be honest when you're dealing with properly indexed joins in a relational database there isn't a lot to be gained, and it's likely to be slower in some scenarios where you don't need the address as the record size for a customer will be larger.
We are moving forward with a model like this:
Person (id, given_name, family_name, title, suffix, birth_date)
Address (id, culture_id, line1, line2, city, state, zipCode, province, postalCode)
AddressType (id, descriptiveName)
PersonAddress (person_id, address_id, addressType_id, activeDates)
Most may consider this excessive. However, an undeniable common theme amongst the apps we develop is that they will have some of these fundamental entities - People, Organizations, Addresses, Phone Numbers, etc.. - and they all want to combine them in different ways. So, we're building in some generalization up-front that we are 100% certain we have use cases for.
The Address table will follow a table-per-hierarchy inheritance scheme to differentiate addresses based on culture; so a United States address will have a state and zip field, but Canadian addresses will have a province and postal code.
We use a separate connecting table to "give" a person an address. This keeps our other entities - Person & Address - free from ties to other entities when our experience is this tends to complicate matters down the road. It also makes it far simpler to connect Address entities to many other types of entities (People, Organizations, etc.) and with different contextual information associated with the link (like activeDates in my example).
The second option would probably be the way I would go. And on the off-chance it would let users add additional address' (If you wanted to let them do that), that they could switch between at will for shipping and such.
I'd prefer #1. Good normalization and communicates intent clearly. This model also allows the same address object (row) to be used for both addresses, something I have found to be quite valuable. It's far too easy to get lost in duplicating this information too much.
When answering those kinds of questions I like to use the classifications of DDD. If it's a Entity it should have a separate ID, if it's a value object it should not.
Option 3 is too restrictive, and option 1 cannot be extended to allow for other address types without changing the schema.
Option 2 is clearly the most flexible and therefore the best choice.
In most code I write nowadays every customer has one and only one physical location. This is the legal entity beeing our business partner. Therefore I put street, city etc in the customer object/table. Often this is the possible simplest thing that works and it works.
When an additional mailing address is needed, I put it in a separate object/table to not clutter the customer object to much.
Earlier in my career I normalized like mad having an order referencing a customer which references a shipping address. This made things "clean" but slow and inelegant to use. Nowadays I use an order object which just contains all the address information. I actually consider this more natural since a customer might change his (default?) address, but the address of a shipment send in 2007 should always stay the same - even if the customer moves in 2008.
We currently implement the VerySimpleAddressProtocol in out project to standardize the fields used.
I'd go for the first option. In these situations I'm very weary of YAGNI (you aren't going to need it). I can't count the number of times I've looked at schemas that've had one-to-many tables "just incase" that are many years old. If you only need two, just use the first option; if the requirement changes in the future, change it then.
Like in many cases: It depends.
If your customers deal with multiple addresses then a to-many relationship would be appropriate. You could introduce a flag on address that signals if an address is for shipment or bill, etc. Or you store the different address types in different tables and have multiple to-one relationships on a customer.
In cases where you only need to know one address of a customer why would you model that to-many? A to-one relationship would satisfy your needs here.
Important: Denormalize only if you encounter performance issues.
I would go with option 1. If you want to, you could even modify it a little bit to keep an address history:
Customer (id, phys_address_id, mail_address_id)
Address (id, customer_id, start_dt, end_dt, street, city, etc.)
If the address changes, just end date the current address and add a new record in the Address table. The phys_address_id and mail_address_id always point to the current address.
That way you can keep a history of addresses, you could have multiple mailing addresses stored in the database (with the default in mail_address_id), and if the physical address and mailing address are identical you'll just point phys_address_id and mail_address_id at the same record.
Good thread. I have spent a while contemplating the most suitable schema and I have concluded that quentin-starin's solution is the best except I have added start_date and end_date fields to what would be his PersonAddress table. I have also decided to add notes, active and deleted.
deleted is for soft delete functionality as I think I do not want to lose trace of previous addresses simply by deleting the record from the junction table. I think that is quite wise and something others may want to consider. If not done this way, it could be left to revision of paper or electronic documents to try to trace address information (something best avoided).
notes I think of being something of a requirement but that might just be preference. I've spent time in backfill exercises verifying addresses in databases and some addresses can be very vague (such as rural addresses) that I think it is very useful to at least allow notes about that address to be held in the record address.
One thing i would like to hear opinions on is the unique indexing of the address table (again, referring to the table of the same name in quentin-starin's example. Do you think it should be unique index should be enforced (as a compound index presumably across all not-null/required fields)? This would seem sensible but it might still be hard to stop duplicate data regardless as postal/zip codes are not always unique to a single property. Even if the country, province and city fields are populated from reference data (which they are in my model), spelling differences in the address lines may not match up. The only way to best avoid this might be to run one or a number of DB queries from the incoming form fields to see if a possible duplicate has been found. Another safety measure would be give the user the option of selecting from address in the database already linked to that person and use that to auto-populate. I think this might be a case where you can only be sensible and take precautions to stop duplication but just accept it can (and probably will) happen sooner or later.
The other very important aspect of this for me is future editing of the address table records. Lets say you have 2 people both listed at: -
11 Whatever Street
Whatever City
Z1P C0D3
Should it not be considered dangerous to allow the same address table record to be assigned to different entities (person, company)? Then let's say the user realises one of these people lives at 111 Whatever Street and there is a typo. If you change that address, it will change it for both of the entities. I would like to avoid that. My suggestion would be to have the model in the MVC (in my case, PHP Yii2) look for existing address records when a new address is being created known to be related to that customer (SELECT * FROM address INNER JOIN personaddress ON personaddress.address_id = address.id WHERE personaddress.person_id = {current person being edited ID}) and provide the user the option of using that record instead (as was essentially suggested above).
I feel linking the same address to multiple different entities is just asking for trouble as it might be a case of refusing later editing of the address record (impractical) or risking that the future editing of the record may corrupt data related to other entities outside of the one who's address record is being edited.
I would love to hear people's thoughts.

Is this a good way to model address information in a relational database?

I'm wondering if this is a good design. I have a number of tables that require address information (e.g. street, post code/zip, country, fax, email). Sometimes the same address will be repeated multiple times. For example, an address may be stored against a supplier, and then on each purchase order sent to them. The supplier may then change their address and any subsequent purchase orders should have the new address. It's more complicated than this, but that's an example requirement.
Option 1
Put all the address columns as attributes on the various tables. Copy the details down from the supplier to the PO as it is created. Potentially store multiple copies of the
Option 2
Create a separate address table. Have a foreign key from the supplier and purchase order tables to the address table. Only allow insert and delete on the address table as updates could change more than you intend. Then I would have some scheduled task that deletes any rows from the address table that are no longer referenced by anything so unused rows were not left about. Perhaps also have a unique constraint on all the non-pk columns in the address table to stop duplicates as well.
I'm leaning towards option 2. Is there a better way?
EDIT: I must keep the address on the purchase order as it was when sent. Also, it's a bit more complicated that I suggested as there may be a delivery address and a billing address (there's also a bunch of other tables that have address information).
After a while, I will delete old purchase orders en-masse based on their date. It is after this that I was intending on garbage collecting any address records that are not referenced anymore by anything (otherwise it feels like I'm creating a leak).
I actually use this as one of my interview questions. The following is a good place to start:
Addresses
---------
AddressId (PK)
Street1
... (etc)
and
AddressTypes
------------
AddressTypeId
AddressTypeName
and
UserAddresses (substitute "Company", "Account", whatever for Users)
-------------
UserId
AddressTypeId
AddressId
This way, your addresses are totally unaware of how they are being used, and your entities (Users, Accounts) don't directly know anything about addresses either. It's all up to the linking tables you create (UserAddresses in this case, but you can do whatever fits your model).
One piece of somewhat contradictory advice for a potentially large database: go ahead and put a "primary" address directly on your entities (in the Users table in this case) along with a "HasMoreAddresses" field. It seems icky compared to just using the clean design above, but can simplify coding for typical use cases, and the denormalization can make a big difference for performance.
Option 2, without a doubt.
Some important things to keep in mind: it's an important aspect of design to indicate to the users when addresses are linked to one another. I.e. corporate address being the same as shipping address; if they want to change the shipping address, do they want to change the corporate address too, or do they want to specify a new loading dock? This sort of stuff, and the ability to present users with this information and to change things with this sort of granularity is VERY important. This is important, too, about the updates; give the user the granularity to "split" entries. Not that this sort of UI is easy to design; in point of fact, it's a bitch. But it's really important to do; anything less will almost certainly cause your users to get very frustrated and annoyed.
Also; I'd strongly recommend keeping around the old address data; don't run a process to clean it up. Unless you have a VERY busy database, your database software will be able to handle the excess data. Really. One common mistake I see about databases is attempting to overoptimize; you DO want to optimize the hell out of your queries, but you DON'T want to optimize your unused data out. (Again, if your database activity is VERY HIGH, you may need to have something does this, but it's almost a certainty that your database will work well with still having excess data around in the tables.) In most situations, it's actually more advantageous to simply let your database grow than it is to attempt to optimize it. (Deletion of sporadic data from your tables won't cause a significant reduction in the size of your database, and when it does... well, the reindexing that that causes can be a gigantic drain on the database.)
Do you want to keep a historical record of what address was originally on the purchase order?
If yes go with option 1, otherwise store it in the supplier table and link each purchase order to the supplier.
BTW: A sure sign of a poor DB design is the need for an automated job to keep the data "cleaned up" or in synch. Option 2 is likely a bad idea by that measure
I think I agree with JohnFx..
Another thing about (snail-)mail addresses, since you want to include country I assume you want to ship/mail internationally, please keep the address field mostly freeform text. It's really annoying having to make up an 5 digit zip code when Norway don't have zip-codes, we have 4 digit post-numbers.
The best fields would be:
Name/Company
Address (multiline textarea)
Country
This should be pretty global, if the US-postal system require zip-codes in a specific format, then include that too but make it optional unless USA is selected as country. Everyone know how to format the address in their country, so as long as you keep the linebreaks it should be okay...
Why would any of the rows on the address table become unused? Surely they would still be pointed at by the purchase order that used them?
It seems to me that stopping the duplicates should be the priority, thus negating the need for any cleanup.
In the case of orders, you would never want to update the address as the person (or company) address changed if the order has been sent. You meed the record of where the order was actually sent if there is an issue with the order.
The address table is a good idea. Make a unique constraint on it so that the same entity cannot have duplicate addresses. You may still get them as users may add another one instead of looking them up and if they spell things slightly differently (St. instead of Street) the unique constraint won't prevent that. Copy the data at the time the order is created to the order. This is one case where you want the multiple records because you need a historical record of what you sent where. Only allowing inserts and deletes to the table makes no sense to me as they aren't any safer than updates and involve more work for the database. An update is done in one call to the database. If an address changes in your idea, then you must first delete the old address and then insert the new one. Not only more calls to the databse but twice the chance of making a code error.
I've seen every system using option 1 get into data quality trouble. After 5 years 30% of all addresses will no longer be current.