Every Customer has a physical address and an optional mailing address. What is your preferred way to model this?
Option 1. Customer has foreign key to Address
Customer (id, phys_address_id, mail_address_id)
Address (id, street, city, etc.)
Option 2. Customer has one-to-many relationship to Address, which contains a field
to describe the address type
Customer (id)
Address (id, customer_id, address_type, street, city, etc.)
Option 3. Address information is de-normalized and stored in Customer
Customer (id, phys_street, phys_city, etc. mail_street, mail_city, etc.)
One of my overriding goals is to simplify the object-relational mappings, so I'm leaning towards the first approach. What are your thoughts?
I tend towards first approach for all the usual reasons of normalisation. This approach also makes it easier to perform data cleansing on mailing details.
If you are possibly going to allow multiple addresses (mail, residential, etc) or wish to be able to use effective dates, consider this approach
Customer (id, phys_address_id)
Cust_address_type (cust_id, mail_address_id, address_type, start_date, end_date)
Address (id, street, city, etc.)
One important fact you may need to consider (depending on your problem domain) is that people change addresses, and may want to let you know in advance of their address change; this is certainly true for utility companies, telcos, etc.
In this case you need to have a way to store multiple addresses for the customer with validity dates, so that the address can be set up in advance and automatically switch at the correct point. If this is a requirement, then a variation on (2) is the only sensible way to model it, e.g.
Customer (id, ...)
Address (id, customer_id, address_type, valid_from, valid_to)
On the other hand, if you don't need to cater for this (and you're sure you won't in the future) then probably (1) is simpler to manage because it's much easier to maintain data integrity as there's no issues with ensuring only one address of the same type exists, and the joins become simpler as they're only on one field.
So either (1) or (2) are fine depending on whether you need house-moves, but I'd steer clear of (3) because you're then repeating the definition of what an address is in the table, and you'll have to add multiple columns if you change what an address looks like. It's possibly slightly more performant, but to be honest when you're dealing with properly indexed joins in a relational database there isn't a lot to be gained, and it's likely to be slower in some scenarios where you don't need the address as the record size for a customer will be larger.
We are moving forward with a model like this:
Person (id, given_name, family_name, title, suffix, birth_date)
Address (id, culture_id, line1, line2, city, state, zipCode, province, postalCode)
AddressType (id, descriptiveName)
PersonAddress (person_id, address_id, addressType_id, activeDates)
Most may consider this excessive. However, an undeniable common theme amongst the apps we develop is that they will have some of these fundamental entities - People, Organizations, Addresses, Phone Numbers, etc.. - and they all want to combine them in different ways. So, we're building in some generalization up-front that we are 100% certain we have use cases for.
The Address table will follow a table-per-hierarchy inheritance scheme to differentiate addresses based on culture; so a United States address will have a state and zip field, but Canadian addresses will have a province and postal code.
We use a separate connecting table to "give" a person an address. This keeps our other entities - Person & Address - free from ties to other entities when our experience is this tends to complicate matters down the road. It also makes it far simpler to connect Address entities to many other types of entities (People, Organizations, etc.) and with different contextual information associated with the link (like activeDates in my example).
The second option would probably be the way I would go. And on the off-chance it would let users add additional address' (If you wanted to let them do that), that they could switch between at will for shipping and such.
I'd prefer #1. Good normalization and communicates intent clearly. This model also allows the same address object (row) to be used for both addresses, something I have found to be quite valuable. It's far too easy to get lost in duplicating this information too much.
When answering those kinds of questions I like to use the classifications of DDD. If it's a Entity it should have a separate ID, if it's a value object it should not.
Option 3 is too restrictive, and option 1 cannot be extended to allow for other address types without changing the schema.
Option 2 is clearly the most flexible and therefore the best choice.
In most code I write nowadays every customer has one and only one physical location. This is the legal entity beeing our business partner. Therefore I put street, city etc in the customer object/table. Often this is the possible simplest thing that works and it works.
When an additional mailing address is needed, I put it in a separate object/table to not clutter the customer object to much.
Earlier in my career I normalized like mad having an order referencing a customer which references a shipping address. This made things "clean" but slow and inelegant to use. Nowadays I use an order object which just contains all the address information. I actually consider this more natural since a customer might change his (default?) address, but the address of a shipment send in 2007 should always stay the same - even if the customer moves in 2008.
We currently implement the VerySimpleAddressProtocol in out project to standardize the fields used.
I'd go for the first option. In these situations I'm very weary of YAGNI (you aren't going to need it). I can't count the number of times I've looked at schemas that've had one-to-many tables "just incase" that are many years old. If you only need two, just use the first option; if the requirement changes in the future, change it then.
Like in many cases: It depends.
If your customers deal with multiple addresses then a to-many relationship would be appropriate. You could introduce a flag on address that signals if an address is for shipment or bill, etc. Or you store the different address types in different tables and have multiple to-one relationships on a customer.
In cases where you only need to know one address of a customer why would you model that to-many? A to-one relationship would satisfy your needs here.
Important: Denormalize only if you encounter performance issues.
I would go with option 1. If you want to, you could even modify it a little bit to keep an address history:
Customer (id, phys_address_id, mail_address_id)
Address (id, customer_id, start_dt, end_dt, street, city, etc.)
If the address changes, just end date the current address and add a new record in the Address table. The phys_address_id and mail_address_id always point to the current address.
That way you can keep a history of addresses, you could have multiple mailing addresses stored in the database (with the default in mail_address_id), and if the physical address and mailing address are identical you'll just point phys_address_id and mail_address_id at the same record.
Good thread. I have spent a while contemplating the most suitable schema and I have concluded that quentin-starin's solution is the best except I have added start_date and end_date fields to what would be his PersonAddress table. I have also decided to add notes, active and deleted.
deleted is for soft delete functionality as I think I do not want to lose trace of previous addresses simply by deleting the record from the junction table. I think that is quite wise and something others may want to consider. If not done this way, it could be left to revision of paper or electronic documents to try to trace address information (something best avoided).
notes I think of being something of a requirement but that might just be preference. I've spent time in backfill exercises verifying addresses in databases and some addresses can be very vague (such as rural addresses) that I think it is very useful to at least allow notes about that address to be held in the record address.
One thing i would like to hear opinions on is the unique indexing of the address table (again, referring to the table of the same name in quentin-starin's example. Do you think it should be unique index should be enforced (as a compound index presumably across all not-null/required fields)? This would seem sensible but it might still be hard to stop duplicate data regardless as postal/zip codes are not always unique to a single property. Even if the country, province and city fields are populated from reference data (which they are in my model), spelling differences in the address lines may not match up. The only way to best avoid this might be to run one or a number of DB queries from the incoming form fields to see if a possible duplicate has been found. Another safety measure would be give the user the option of selecting from address in the database already linked to that person and use that to auto-populate. I think this might be a case where you can only be sensible and take precautions to stop duplication but just accept it can (and probably will) happen sooner or later.
The other very important aspect of this for me is future editing of the address table records. Lets say you have 2 people both listed at: -
11 Whatever Street
Whatever City
Z1P C0D3
Should it not be considered dangerous to allow the same address table record to be assigned to different entities (person, company)? Then let's say the user realises one of these people lives at 111 Whatever Street and there is a typo. If you change that address, it will change it for both of the entities. I would like to avoid that. My suggestion would be to have the model in the MVC (in my case, PHP Yii2) look for existing address records when a new address is being created known to be related to that customer (SELECT * FROM address INNER JOIN personaddress ON personaddress.address_id = address.id WHERE personaddress.person_id = {current person being edited ID}) and provide the user the option of using that record instead (as was essentially suggested above).
I feel linking the same address to multiple different entities is just asking for trouble as it might be a case of refusing later editing of the address record (impractical) or risking that the future editing of the record may corrupt data related to other entities outside of the one who's address record is being edited.
I would love to hear people's thoughts.
Related
I'm not a DBA so I'm not familiar with the proper lingo, so maybe the title of the question could be a little misleading.
So, the thing. I have Members for a certain system, these members can be part of a demographic segment (any kind of segment: favorite color, gender, job, etc)
These are the tables
SegmentCategory
ID, Name, Description
SegmentCategory_segment
SegmentID, SegmentCategoryID
Segment
ID, Name, Description
MemberSegment
ID, MemberID, SegmentID
So the guy that designed the DB decided to go uber normalizing everything so he put the member's gender on a segment and not in the Member's table.
Is this ok? According to my logic, gender it's a property of the Member so it must be on its entity. But by doing this then there must be duplicated data (The gender on the Member and Gender as a segment) But a trigger on the Member table could just fix this (Update the segment on a gender change)
Having to crawl 4 tables just to get a property from the member seems like over engineering to me.
My question is whether I'm right or not? If so, how could I propose the change to the DBA?
There isn't a blanket rule you can apply to database decisions like this. It depends on what applications/processes it is supporting. A database for reporting is much easier to work with when it is more de-normalized (in a well thought out way) than it is a more transactional database.
You can have a customer record spread across 2 tables, for instance, if some data is accessed or updated more often than other parts. Say you only need one half of the data 90% of your queries, but don't want to drag around the the varchar(max) fields you have there for whatever reason.
Having said that, having a table with just a gender/memberid is on the far side of extreme. From my naive understanding of your situation I feel you just need a members table with views over top for your segments.
As for the DBA, ultimately I imagine it will be them who will be needing to maintain the integrity of the data, so I would just approach them and say "hey, what do you think of this?" Hopefully they'll either see the merit or be able to give you reasons to their design decisions.
I think this is a common thing to do... you have a database server, and you want to store customer contact information in it.
You need the person's name, their address, phone, etc.
What are best practices for storing addresses and phones? Assuming OLTP...
Multiple people may have the same phone number (such as wife and husband, or mother and daughter).
Multiple people share a household.
I read this: http://sqlcat.com/sqlcat/b/whitepapers/archive/2008/09/03/best-practices-for-semantic-data-modeling-for-performance-and-scalability.aspx
And that will work fine for the specific model mentioned, but I don't see how this model can be optimized short of denormalizing.
Ex:
Person table = person id, first name, last name, etc...
Address table = address id, address line 1, etc..
Phone table = phone id, phone number, etc...
So if I designed it like that whitepaper suggests, I'd have a personid in my address table and in my phone table. However, since multiple people may share the same address, that isn't feasible. One person may have multiple addresses or even no addresses. So it seems I'll need a person -> address mapping table as well as a mapping table for the phones, otherwise I'll denormalize both of those tables and let there be some duplicates in the unusual case of two people who share the same phone / address.
Anyway, my point in asking this question is because it seems difficult to find a 'best practices' for this type of thing, yet it seems like the type of thing which would come up in just about any type of application or database.
Normalizing addresses and phone numbers in a one-to-many relationship where a Contact may have many related Phone or Address entities makes perfect sense.
However, there is no need to normalize addresses and phone numbers in a many-to-many relationship in a contacts database, because those are not entities you have any interest in working with by themselves, on their own merits as unique entities. In fact, I would say that in your situation, normalizing them to that level is not a good design.
If you were modeling a business in real estate, rentals, or phone service, where you cared about properties and phone numbers even when no person was associated with them, then it could make sense to model them to this level. It is more work for someone to avoid duplicate addresses and phone numbers in the many-to-many design than it is for them to just enter the address again, and there is no real benefit to avoiding these duplicates. Plus, you'll end up with duplicates anyway (at least for addresses, unless you scrub them all real-time using post office routines), so who is going to go through and match up '123 Ascot Wy #5' to '123 Ascot Way Apt 5'? What value is there in that?
The usual reason for normalizing this deep doesn't apply. Let's say that you do create a PhoneNumber table and the PersonPhoneNumber table needed for the many-to-many relationship. You have three people using the same phone number and they are all properly linked to it. Now, one of them calls you up and tells you that he is changing his phone number. Are you sure you want to change the actual PhoneNumber record and update the numbers of the other two folks at the same time? What if they aren't moving with him? Soon you will find that your data is screwed up. You may as well normalize first names to the FirstName table and last names to the LastName table! Then when "Joey" grows up and changes his name to "Joe", all the other Joeys will get an automatic upgrade. But whoops... "Joe" already exists, as does the phone number that you are changing one of the three people above to... what an awful mess.
For another thing, will you use PhoneID as a surrogate key for the phone number? But phone numbers are one of the few things that actually are good as natural keys, they almost even demand being used as natural keys. Then your Phone table becomes meaningless because it doesn't encode any additional information about that phone number. It would just be a list of phone numbers, which are already present in the referencing table. Don't use a Phone table like that. If you want to find out whether two people share the same phone number, you just join on or group by the column! In my mind it approaches silliness to have a layer of abstraction where a phone number is linked to a monotonically-increasing PhoneID.
If you read A Universal Person and Organization Model you will see the perspective that phone numbers and addresses in fact aren't entities that need modeling to the level of a many-to-many relationship--they are more like "intelligent locators" that route messages to recipients. Why on earth would you force three different people's locator (a.k.a. phone number) to be identical? The locator helps to locate the person, not the physical phone that rings. You couldn't care less about the phone or who else might answer--you only care about the fact that once answered, the person of interest could possibly be reached.
Normalize.
Normalize until it hurts.
Normalize again until it is excruciating.
Then tune your queries; design your indices; and measure your performance; if at that point you have no other options, denormalize the bare minimum to meet performance options.
Remember that every denormalization that speeds performance on one query, by its nature degrades performance on (almost) every operation on tat table set. Only keep the denormalizaiton if measurement actually shows a noticeable performance improvement.
Remember that the more you normalize the smaller your indices are; the more index rows sit in cache, and the faster your database performs. Yes, a lot of very small tables get created - they are permanently in cache and thus almost free to access.
Of course I realize that there's no one "right way" to design a SQL database, but I wanted to get some opinions on what is better or worse in my particular scenario.
Currently, I'm designing an order entry module (Windows .NET 4.0 application with SQL Server 2008) and I'm torn between two design decisions when it comes to data that can be applied in more than one spot. In this question I'll refer specifically to Addresses.
Addresses can be used by a variety of objects (orders, customers, employees, shipments, etc..) and they almost always contain the same data (Address1/2/3, City, State, Postal Code, Country, etc). I was originally going to include each of these fields as a column in each of the related tables (e.g. Orders will contain Address1/2/3, City, State, etc.. and Customers will also contain this same column layout). But a part of me wants to apply DRY/Normalization principles to this scenario, i.e. have a table called "Addresses" which is referenced via Foreign Key in the appropriate table.
CREATE TABLE DB.dbo.Addresses
(
Id INT
NOT NULL
IDENTITY(1, 1)
PRIMARY KEY
CHECK (Id > 0),
Address1 VARCHAR(120)
NOT NULL,
Address2 VARCHAR(120),
Address3 VARCHAR(120),
City VARCHAR(100)
NOT NULL,
State CHAR(2)
NOT NULL,
Country CHAR(2)
NOT NULL,
PostalCode VARCHAR(16)
NOT NULL
)
CREATE TABLE DB.dbo.Orders
(
Id INT
NOT NULL
IDENTITY(1000, 1)
PRIMARY KEY
CHECK (Id > 1000),
Address INT
CONSTRAINT fk_Orders_Address
FOREIGN KEY REFERENCES Addresses(Id)
CHECK (Address > 0)
NOT NULL,
-- other columns....
)
CREATE TABLE DB.dbo.Customers
(
Id INT
NOT NULL
IDENTITY(1000, 1)
PRIMARY KEY
CHECK (Id > 1000),
Address INT
CONSTRAINT fk_Customers_Address
FOREIGN KEY REFERENCES Addresses(Id)
CHECK (Address > 0)
NOT NULL,
-- other columns....
)
From a design standpoint I like this approach because it creates a standard address format that is easily changeable, i.e. if I ever needed to add Address4 I would just add it in one place rather than to every table. However, I can see the number of JOINs required to build queries might get a little insane.
I guess I'm just wondering if any enterprise-level SQL architects out there have ever used this approach successfully, or if the number of JOINs that this creates would create a performance issue?
You're on the right track by breaking address out into its own table. I'd add a couple of additional suggestions.
Consider taking the Address FK columns out of the Customers/Orders tables and creating junction tables instead. In other words, treat Customers/Addresses and Orders/Addresses as many-to-many relationships in your design now so you can easily support multiple addresses in the future. Yes, this means introducing more tables and joins, but the flexibility you gain is well worth the effort.
Consider creating lookup tables for city, state and country entities. The city/state/country columns of the address table then consist of FKs pointing to these lookup tables. This allows you to guarantee consistent spellings across all addresses and gives you a place to store additional metadata (e.g., city population) if needed in the future.
I just have some cautions. For each of these, there's more than one way to fix the problem.
First, normalization doesn't mean "replace text with an id number".
Second, you don't have a key. I know, you have a column declared "PRIMARY KEY", but that's not enough.
insert into Addresses
(Address1, Address2, Address3, City, State, Country, PostalCode)
values
('President Obama', '1600 Pennsylvania Avenue NW', NULL, 'Washington', 'DC', 'US', '20500'),
('President Obama', '1600 Pennsylvania Avenue NW', NULL, 'Washington', 'DC', 'US', '20500'),
('President Obama', '1600 Pennsylvania Avenue NW', NULL, 'Washington', 'DC', 'US', '20500'),
('President Obama', '1600 Pennsylvania Avenue NW', NULL, 'Washington', 'DC', 'US', '20500');
select * from Addresses;
1;President Obama;1600 Pennsylvania Avenue NW;;Washington;DC;US;20500
2;President Obama;1600 Pennsylvania Avenue NW;;Washington;DC;US;20500
3;President Obama;1600 Pennsylvania Avenue NW;;Washington;DC;US;20500
4;President Obama;1600 Pennsylvania Avenue NW;;Washington;DC;US;20500
In the absence of any other constraints, your "primary key" identifies a row; it doesn't identify an address. Identifying a row is usually not good enough.
Third, "Address1", "Address2", and "Address3" aren't attributes of addresses. They're attributes of mailing labels. (Lines on a mailing label.) That distinction might not be important to you. It's really important to me.
Fourth, addresses have a lifetime. Between birth and death, they sometimes change. They change when streets get re-routed, buildings get divided, buildings get undivided, and sometimes (I'm pretty sure) when a city employee has a pint too many. Natural disasters can eliminate whole communities. Sometimes buildings get renumbered. In our database, which is tiny compared to most, about 1% per year change like that.
When an address dies, you have to do two things.
Make sure nobody uses that address to mail, ship, or whatever.
Make sure its death doesn't affect historical data.
When an address itself changes, you have to do two things.
Some data must reflect that change. Make sure it does.
Some data must not reflect that change. Make sure it doesn't.
Fifth, DRY doesn't apply to foreign keys. Their whole purpose is to be repeated. The only question is how wide a key? An id number is narrow, but requires a join. (10 id numbers might require 10 joins.) An address is wide, but requires no joins. (I'm talking here about a proper address, not a mailing label.)
That's all I can think of off the top of my head.
I think there is a problem you are not aware of and that is that some of this data is time sensitive. You do not want your records to show you shipped an order to 35 State St, Chicago Il, when you actually sent it to 10 King Street, Martinsburg WV but the customer moved two years after the order was shipped. So yes, build an address table to get the address at that moment in time as long as any change to the address for someone like a customer results in a new addressid not in changing the current address which would break the history on an order.
You would want the addresses to be in a separate table only if they were entities in their own right. Entities have identity (meaning it matters if two objects pointed to the same address or to different ones), and they have their own lifecycle apart from other entities. If this was the case with your domain, I think it would be totally apparent and you wouldn't have a need to ask this question.
Cade's answer explains the mutability of addresses, something like a shipping address is part of an order and shouldn't be able to change out from under the order it belongs to. This shows that the shipping address doesn't have its own lifecycle. Handling it as if it was a separate entity can only lead to more opportunities for error.
"Normalization" specifically refers to removing redundancies from data so you don't have the same item represented in different places. Here the only redundancy is in the DDL, it's not in the data, so "normalization" is not relevant here. (JPA has the concept of embedded classes that can address the redundancy).
TLDR: Use a separate table if the address is truly an Entity, with its own distinct identity and its own lifecycle. Otherwise don't.
What you have to answer for yourself is the question whether the same address in everyday language is actually the same address in your database. If somebody "changes his address" (colloquially), he really links himself to another address. The address per se only changes when a street is renamed, a zip-code reform takes place or a nuke hits. And those are rare events (hopefully for the most part). There goes your main profit: change in one place for multiple rows (of multiple tables).
If you should actually change an address for that in your model - in the sense of an UPDATE on table address - that may or may not work for other rows that link to it. Also, in my experience, even the exact same address has to look different for different purposes. Understand the semantic differences and you will arrive at the right model that represents your real world best.
I have a number of databases where I use a common table of streets (which uses a table of cities (which uses a table of countries, ...)). In combination with a street number think of it as geocodes (lat/lon), not "street names". Addresses are not shared among different tables (or rows). Changes to street names and zip codes cascade, other changes don't.
You would normally normalise the data as far as possible, so use the table 'Addresses'.
You can use views to de-normalise the data afterwards which use indexes and should give a method to access data with easy references, whilst leaving the underlying structure normalised fully.
The number of joins shouldn't be a major issue, index based joins aren't too much of an overhead.
It's fine to have a split out addresses table.
However, you have to avoid the temptation of allowing multiple rows to refer to the same address without an appropriate system for managing options for the user to decide whether and how changing an address splits out a row for the new address change, i.e. You have the same address for billing and ship-to. Then a user says their address is changing. To start with, old orders might (should?) need their ship-to addresses retained, so you can't change it in-place. But the user might also need to say this address I'm changing is only going to change the ship-to.
One should maintain some master tables for City, State and Country. This way one can avoid the different spellings for these entities which might end up with mapping same city with some different state/country.
One can simply map the CityId in the address table as foreign key as shown below, instead of having all the three fields separately (City, State and Country) as plain text in address table itself.
Address: {
CityId
// With other fields
}
City: {
CityId
StateId
// Other fields
}
State: {
StateId
CountryId
// Other fields
}
Country: {
CountryId
// Other fields
}
If one maintains all the three ids (CityId, StateId and CountryId) in address table, at the end you have to make joins against those tables. Hence my suggestion would be to have only CityId and then retrieve rest of the required information though joins with above table structure.
I prefer to use an XREF table that contains a FK reference to the person/business table, a FK reference to the address table and, generally, a FK reference to a role table (HOME, OFFICE, etc) to delineate the actual type of address. I also include an ACTIVE flag to allow me to choose to ignore old address while preserving the ability to maintain an address history.
This approach allows me to maintain multiple addresses of varying types for each primary entity
I am busy creating a basic ecommerce website and would like to know what is the best of the following two options regarding the way I store the billing and delivery addresses. I am open to any other suggestions.
I can included the billing address and delivery address in the Order table:
order
-------
billing_name
billing_address
billing_state
shipping_name
shipping_address
shipping_state
Otherwise I can create another table that will just store addresses for orders:
order
-------
billing_address_id
shipping_address_id
order_address
-------
address_id
name
address
state
I would usually choose the second. This will let you have many different addresses for a customer of different types. But I would normally address this at the customer level first, then address the orders and invoices.
However, you may need to address the nature of your order workflow/business rules.
Once an order is completed, is it a document (like an invoice)? If so, then the address should be locked in at that time and cannot be altered, otherwise you may not be able re-present the original document.
When a customer changes their billing address, does the billing address of an old order even matter anymore? In this case, the billing address does not even need to be linked from the order, only from the customer. If you were to re-present the orders for payment, you would present them to their current billing address.
Personally, I like neither of your solutions although the second solution is "righter" in terms of database theory. If you have repeating addresses you should store them once.
The problem comes in implementation. When an order is placed, you are going to have to make a decision whether you want to use an existing address, update an existing address (for instance, with a newly-added apartment number) or create a new address (the customer has moved, has a new summer address, whatever).
In order to do this, someone (an employee for direct or phone sales, the customer or the program for on-line sales) will have to make a decision as to whether you're performing an address update or address addition operation. It's very difficult to get users to make this kind of decision accurately. If an update is performed when an addition was really needed, you've corrupted your order history (the older orders point to the new address). If an addition is performed when an update was the correct choice, you've eliminated the value of the normalized structure.
In situations like this I've come, not entirely happily, to the conclusion that the best option is to store one or more addresses for the customer and then copy the address information into address fields in the order itself.
If you choose your second option, you need to plan on writing a really good user interface to the address system to avoid the kind of problems I mentioned above. And remember that not only you, but every programmer who works on the project in the future is going to have to understand and agree on the management of that address table.
Pulling the addresses into separate tables is more normalized, but be careful. If you allow the addresses to be updated, you can lose track of where the order was originally intended to be billed to / shipped to.
Edit: Based on the new comment, where you are going to copy from an address book, to an order_address table, it might keep your order table cleaner to do so, but if you are going to duplicate the data anyway, I would say copy it to the record it belongs too.
--
Both, denormalize the shipping, and keep them with the order. Storage is cheap, and it's easier than managing a bunch of extra data in the address table. But keep the addresses split out so customers don't have to enter them over again. If you denormalize, then you don't have to keep explicit records in your addresses table for them, and worry about doing soft deletes.
Don't underestimate the complexity of managing addresses. If I enter an address into your system, and associate it with my account, and then realize some part of it is wrong, then you need to remove the old one and create a new one. The removal can be a soft delete, but it needs to be removed. You can try to decide if I was putting in a new address, or dramatically changing the old one. Or you can only allow adding and removing addresses. But when operations happen with addresses, the previous orders need to maintain the data that was assigned to it in the first place. Editing an address that is already associated with an order, would modify where the order says it was sent to, after it has already been sent. Make sure you think through these scenarios. There are several ways to solve the potential problems, but your decision is really based on how you want to handle these situations. if you denormalize and copy the address information to the order once it is placed, then editing addresses in the address table becomes less of an issue. Decide how to handle these situations, and your database schema just needs to support that. Either choice works.
I would keep the addresses in a separate table and reference them from the orders. I would include a "CurrentAddress" attribute, so that an end user can "delete" that address from their list of current address. The value would still exist in the table so previous orders could reference the address for historical purposes, but it would no longer be a selectable address at order time.
The second method has a couple advantages over the first. It is easier simply to have the addresses be the same, as they often will be, with less possibility of error. Also, if you ever save addresses within an account, the second method will give you an easier time. That said, you need to verify that a given address actually belongs to the same account as the order, which you might do by including a customer_id field in both the order and order_address tables, then including customer_id in the primary key of order_address and the foreign key from order to order_address.
It depends whether addresses are re-used.
If you have a "registered customer" table, you should definitely go for the option with "delivery_adress", "billing_adress", etc. tables, each record of them being linked to a customer.
A table with the billing and shipping address id will serve better if you plan to register users in your site, then you can place the order in other table and use the ids you already have to correlate data between order info <=> billing/shipping addresses
I'm wondering if this is a good design. I have a number of tables that require address information (e.g. street, post code/zip, country, fax, email). Sometimes the same address will be repeated multiple times. For example, an address may be stored against a supplier, and then on each purchase order sent to them. The supplier may then change their address and any subsequent purchase orders should have the new address. It's more complicated than this, but that's an example requirement.
Option 1
Put all the address columns as attributes on the various tables. Copy the details down from the supplier to the PO as it is created. Potentially store multiple copies of the
Option 2
Create a separate address table. Have a foreign key from the supplier and purchase order tables to the address table. Only allow insert and delete on the address table as updates could change more than you intend. Then I would have some scheduled task that deletes any rows from the address table that are no longer referenced by anything so unused rows were not left about. Perhaps also have a unique constraint on all the non-pk columns in the address table to stop duplicates as well.
I'm leaning towards option 2. Is there a better way?
EDIT: I must keep the address on the purchase order as it was when sent. Also, it's a bit more complicated that I suggested as there may be a delivery address and a billing address (there's also a bunch of other tables that have address information).
After a while, I will delete old purchase orders en-masse based on their date. It is after this that I was intending on garbage collecting any address records that are not referenced anymore by anything (otherwise it feels like I'm creating a leak).
I actually use this as one of my interview questions. The following is a good place to start:
Addresses
---------
AddressId (PK)
Street1
... (etc)
and
AddressTypes
------------
AddressTypeId
AddressTypeName
and
UserAddresses (substitute "Company", "Account", whatever for Users)
-------------
UserId
AddressTypeId
AddressId
This way, your addresses are totally unaware of how they are being used, and your entities (Users, Accounts) don't directly know anything about addresses either. It's all up to the linking tables you create (UserAddresses in this case, but you can do whatever fits your model).
One piece of somewhat contradictory advice for a potentially large database: go ahead and put a "primary" address directly on your entities (in the Users table in this case) along with a "HasMoreAddresses" field. It seems icky compared to just using the clean design above, but can simplify coding for typical use cases, and the denormalization can make a big difference for performance.
Option 2, without a doubt.
Some important things to keep in mind: it's an important aspect of design to indicate to the users when addresses are linked to one another. I.e. corporate address being the same as shipping address; if they want to change the shipping address, do they want to change the corporate address too, or do they want to specify a new loading dock? This sort of stuff, and the ability to present users with this information and to change things with this sort of granularity is VERY important. This is important, too, about the updates; give the user the granularity to "split" entries. Not that this sort of UI is easy to design; in point of fact, it's a bitch. But it's really important to do; anything less will almost certainly cause your users to get very frustrated and annoyed.
Also; I'd strongly recommend keeping around the old address data; don't run a process to clean it up. Unless you have a VERY busy database, your database software will be able to handle the excess data. Really. One common mistake I see about databases is attempting to overoptimize; you DO want to optimize the hell out of your queries, but you DON'T want to optimize your unused data out. (Again, if your database activity is VERY HIGH, you may need to have something does this, but it's almost a certainty that your database will work well with still having excess data around in the tables.) In most situations, it's actually more advantageous to simply let your database grow than it is to attempt to optimize it. (Deletion of sporadic data from your tables won't cause a significant reduction in the size of your database, and when it does... well, the reindexing that that causes can be a gigantic drain on the database.)
Do you want to keep a historical record of what address was originally on the purchase order?
If yes go with option 1, otherwise store it in the supplier table and link each purchase order to the supplier.
BTW: A sure sign of a poor DB design is the need for an automated job to keep the data "cleaned up" or in synch. Option 2 is likely a bad idea by that measure
I think I agree with JohnFx..
Another thing about (snail-)mail addresses, since you want to include country I assume you want to ship/mail internationally, please keep the address field mostly freeform text. It's really annoying having to make up an 5 digit zip code when Norway don't have zip-codes, we have 4 digit post-numbers.
The best fields would be:
Name/Company
Address (multiline textarea)
Country
This should be pretty global, if the US-postal system require zip-codes in a specific format, then include that too but make it optional unless USA is selected as country. Everyone know how to format the address in their country, so as long as you keep the linebreaks it should be okay...
Why would any of the rows on the address table become unused? Surely they would still be pointed at by the purchase order that used them?
It seems to me that stopping the duplicates should be the priority, thus negating the need for any cleanup.
In the case of orders, you would never want to update the address as the person (or company) address changed if the order has been sent. You meed the record of where the order was actually sent if there is an issue with the order.
The address table is a good idea. Make a unique constraint on it so that the same entity cannot have duplicate addresses. You may still get them as users may add another one instead of looking them up and if they spell things slightly differently (St. instead of Street) the unique constraint won't prevent that. Copy the data at the time the order is created to the order. This is one case where you want the multiple records because you need a historical record of what you sent where. Only allowing inserts and deletes to the table makes no sense to me as they aren't any safer than updates and involve more work for the database. An update is done in one call to the database. If an address changes in your idea, then you must first delete the old address and then insert the new one. Not only more calls to the databse but twice the chance of making a code error.
I've seen every system using option 1 get into data quality trouble. After 5 years 30% of all addresses will no longer be current.