Address book DB schema - schema

I need to store contact information for users. I want to present this data on the page as an hCard and downloadable as a vCard. I'd also like to be able to search the database by phone number, email, etc.
What do you think is the best way to store this data? Since users could have multiple addresses, etc complete normalization would be a mess. I'm thinking about using XML, but I'm not familiar with querying XML db fields. Would I still be able to search for users by contact info?
I'm using SQL Server 2005, if that matters.

Consider two tables for People and their addresses:
People (pid, prefix, firstName, lastName, suffix, DOB, ... primaryAddressTag )
AddressBook (pid, tag, address1, address2, city, stateProv, postalCode, ... )
The Primary Key (that uniquely identifies each and every row) of People is pid. The PK of AddressBook is the composition of pid and tag (pid, tag).
Some example data:
People
1, Kirk
2, Spock
AddressBook
1, home, '123 Main Street', Iowa
1, work, 'USS Enterprise NCC-1701'
2, other, 'Mt. Selaya, Vulcan'
In this example, Kirk has two addresses: one 'home' and one 'work'. One of those two can (and should) be noted as a foreign key (like a cross-reference) in People in the primaryAddressTag column.
Spock has a single address with the tag 'other'. Since that is Spock's only address, the value 'other' ought to go in the primaryAddressTag column for pid=2.
This schema has the nice effect of preventing the same person from duplicating any of their own addresses by accidentally reusing tags while at the same time allowing all other people use any address tags they like.
Further, with FK references in primaryAddressTag, the database system itself will enforce the validity of the primary address tag (via something we database geeks call referential integrity) so that your -- or any -- application need not worry about it.

Why would complete normalization "be a mess"? This is exactly the kind of thing that normalization makes less messy.

Don't be afraid of normalizing your data. Normalization, like John mentions, is the solution not the problem. If you try to denormalize your data just to avoid a couple joins, then you're going to cause yourself serious trouble in the future. Trying to refactor this sort of data down the line after you have a reasonable size dataset WILL NOT BE FUN.
I strongly suggest you check out Highrise from 36 Signals. It was recently recommended to me when I was looking for an online contact manager. It does so much right. Actually, my only objection so far with the service is that I think the paid versions are too expensive -- that's all.
As things stand today, I do not fit into a flat address profile. I have 4-5 e-mail addresses that I use regularly, 5 phone numbers, 3 addresses, several websites and IM profiles, all of which I would include in my contact profile. If you're starting to build a contact management system now and you're unencumbered by architectural limitations (think gmail cantacts being keyed to a single email address), then do your users a favor and make your contact structure as flexible (normalized) as possible.
Cheers, -D.

I'm aware of SQLite, but that doesn't really help - I'm talking about figuring out the best schema (regardless of the database) for storing this data.

Per John, I don't see what the problem with a classic normalised schema would be. You haven't given much information to go on, but you say that there's a one-to-many relationship between users and addresses, so I'd plump for a bog standard solution with a foreign key to the user in the address relation.

If you assume each user has one or more addresses, a telephone number, etc., you could have a 'Users' table, an 'Addresses Table' (containing a primary key and then non-unique reference to Users), the same for phone numbers - allowing multiple rows with the same UserID foreign key, which would make querying 'all addresses for user X' quite simple.

I don't have a script, but I do have mySQL that you can use. Before that I should mentioned that there seem to be two logical approaches to storing vCards in SQL:
Store the whole card and let the database search, (possibly) huge text strings, and process them in another part of your code or even client side. e.g.
CREATE TABLE IF NOT EXISTS vcards (
name_or_letter varchar(250) NOT NULL,
vcard text NOT NULL,
timestamp timestamp default CURRENT_TIMESTAMP on update CURRENT_TIMESTAMP,
PRIMARY KEY (username)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 COLLATE=utf8_bin;
Probably easy to implement, (depending on what you are doing with the data) though your searches are going to be slow if you have many entries.
If this is just for you then this might work, (if it is any good then it is never just for you.) You can then process the vCard client side or server side using some beautiful module that you share, (or someone else shared with you.)
I've watched vCard evolve and know that there is going to be
some change at /some/ time in the future so I use three tables.
The first is the card, (this mostly links back to my existing tables - if you don't need this then yours can be a cut down version).
The second are the card definitions, (which seem to be called profile in vCard speak).
The last is all the actual data for the cards.
Because I let DBIx::Class, (yes I'm one of those) do all of the database work this, (three tables) seems to work rather well for me,
(though obviously you can tighten up the types to match rfc2426 more closely,
but for the most part each piece of data is just a text string.)
The reason that I don't normalize out the address from the person is that I already have an
address table in my database and these three are just for non-user contact details.
CREATE TABLE `vCards` (
`card_id` int(255) unsigned NOT NULL AUTO_INCREMENT,
`card_peid` int(255) DEFAULT NULL COMMENT 'link back to user table',
`card_acid` int(255) DEFAULT NULL COMMENT 'link back to account table',
`card_language` varchar(5) DEFAULT NULL COMMENT 'en en_GB',
`card_encoding` varchar(32) DEFAULT 'UTF-8' COMMENT 'why use anything else?',
`card_created` datetime NOT NULL,
`card_updated` datetime NOT NULL,
PRIMARY KEY (`card_id`) )
ENGINE=InnoDB DEFAULT CHARSET=latin1 COMMENT='These are the contact cards'
create table vCard_profile (
vcprofile_id int(255) unsigned auto_increment NOT NULL,
vcprofile_version enum('rfc2426') DEFAULT "rfc2426" COMMENT "defaults to vCard 3.0",
vcprofile_feature char(16) COMMENT "FN to CATEGORIES",
vcprofile_type enum('text','bin') DEFAULT "text" COMMENT "if it is too large for vcd_value then user vcd_bin",
PRIMARY KEY (`vcprofile_id`)
) COMMENT "These are the valid types of card entry";
INSERT INTO vCard_profile VALUES('','rfc2426','FN','text'),('','rfc2426','N','text'),('','rfc2426','NICKNAME','text'),('','rfc2426','PHOTO','bin'),('','rfc2426','BDAY','text'),('','rfc2426','ADR','text'),('','rfc2426','LABEL','text'),('','rfc2426','TEL','text'),('','rfc2426','EMAIL','text'),('','rfc2426','MAILER','text'),('','rfc2426','TZ','text'),('','rfc2426','GEO','text'),('','rfc2426','TITLE','text'),('','rfc2426','ROLE','text'),('','rfc2426','LOGO','bin'),('','rfc2426','AGENT','text'),('','rfc2426','ORG','text'),('','rfc2426','CATEGORIES','text'),('','rfc2426','NOTE','text'),('','rfc2426','PRODID','text'),('','rfc2426','REV','text'),('','rfc2426','SORT-STRING','text'),('','rfc2426','SOUND','bin'),('','rfc2426','UID','text'),('','rfc2426','URL','text'),('','rfc2426','VERSION','text'),('','rfc2426','CLASS','text'),('','rfc2426','KEY','bin');
create table vCard_data (
vcd_id int(255) unsigned auto_increment NOT NULL,
vcd_card_id int(255) NOT NULL,
vcd_profile_id int(255) NOT NULL,
vcd_prof_detail varchar(255) COMMENT "work,home,preferred,order for e.g. multiple email addresses",
vcd_value varchar(255),
vcd_bin blob COMMENT "for when varchar(255) is too small",
PRIMARY KEY (`vcd_id`)
) COMMENT "The actual vCard data";
This isn't the best SQL but I hope that helps.

Related

Adding an artificial primary key versus using a unique field [duplicate]

This question already has answers here:
Surrogate vs. natural/business keys [closed]
(19 answers)
Why would one consider using Surrogate keys vs Natural with ON UPDATE CASCADE?
(1 answer)
Closed 7 months ago.
Recently I Inherited a huge app from somebody who left the company.
This app used a SQL server DB .
Now the developer always defines an int base primary key on tables. for example even if Users table has a unique UserName field , he always added an integer identity primary key.
This is done for every table no matter if other fields could be unique and define primary key.
Do you see any benefits whatsoever on this? using UserName as primary key vs adding UserID(identify column) and set that as primary key?
I feel like I have to add add another element to my comments, which started to produce an essay of comments, so I think it is better that I post it all as an answer instead.
Sometimes there are domain specific reasons why a candidate key is not a good candidate for joins (maybe people change user names so often that the required cascades start causing performance problems). But another reason to add an ever-increasing surrogate is to make it the clustered index. A static and ever-increasing clustered index alleviates a high-cost IO operation known as a page split. So even with a good natural candidate key, it can be useful to add a surrogate and cluster on that. Read this for further details.
But if you add such a surrogate, recognise that the surrogate is purely internal, it is there for performance reasons only. It does not guarantee the integrity of your data. It has no meaning in the model, unless it becomes part of the model. For example, if you are generating invoice numbers as an identity column, and sending those values out into the real world (on invoice documents/emails/etc), then it's not a surrogate, it's part of the model. It can be meaningfully referenced by the customer who received the invoice, for example.
One final thing that is typically left out of this discussion is one particular aspect of join performance. It is often said that the primary key should also be narrow, because it can make joins more performant, as well as reducing the size of non-clustered indexes. And that's true.
But a natural primary key can eliminate the need for a join in the first place.
Let's put all this together with an example:
create table Countries
(
countryCode char(2) not null primary key clustered,
countryName varchar(64) not null
);
insert Countries values
('AU', 'Australia'),
('FR', 'France');
create table TourLocations
(
tourLocationName varchar(64) not null,
tourLocationId int identity(1,1) unique clustered,
countryCode char(2) not null foreign key references Countries(countryCode),
primary key (countryCode, tourLocationName)
);
insert TourLocations (TourLocationName, countryCode) values
('Bondi Beach', 'AU'),
('Eiffel Tower', 'FR')
I did not add a surrogate key to Countries, because there aren't many rows and we're not going to be constantly inserting new rows. I already know what all the countries are, and they don't change very often.
On the TourLocations table I have added an identity and clustered on it. There could be very many tour locations, changing all the time.
But I still must have a natural key on TourLocations. Otherwise I could insert the same tour location name with the same country twice. Sure, the Id's will be different. But the Id's don't mean anything. As far as any real human is concerned, two tour locations with the same name and country code are completely indistinguishable. Do you intend to have actual users using the system? Then you've got a problem.
By putting the same country and location name in twice I haven't created two facts in my database. I have created the same fact twice! No good. The natural key is necessary. In this sense The Impaler's answer is strictly, necessarily, wrong. You cannot not have a natural key. If the natural key can't be defined as anything other than "every meaningful column in the table" (that is to say, excluding the surrogate), so be it.
OK, now let's investigate the claim that an int identity key is advantageous because it helps with joins. Well, in this case my char(2) country code is narrower than an int would have been.
But even if it wasn't (maybe we think we can get away with a tinyint), those country codes are meaningful to real people, which means a lot of the time I don't have to do the join at all.
Suppose I gave the results of this query to my users:
select countryCode, tourLocationName
from TourLocations
order by 1, 2;
Very many people will not need me to provide the countries.countryName column for them to know which country is represented by the code in each of those rows. I don't have to do the join.
When you're dealing with a specific business domain that becomes even more likely. Meaningful codes are understood by the domain users. They often don't need to see the long description columns from the key table. So in many cases no join is required to give the users all of the information they need.
If I had foreign keyed to an identity surrogate I would have to do the join, because the identity surrogate doesn't mean anything to anyone.
You are talking about the difference between synthetic and natural keys.
In my [very] personal opinion, I would recommend to always use synthetic keys (and always call it id). The main problem is that natural keys are never unique; they are unique in theory, yes, but in the real world there are a myriad of unexpected and inexorable events that will make this false.
In database design:
Natural keys correspond to values present in the domain model. For example, UserName, SSN, VIN can be considered natural keys.
Synthetic keys are values not present in the domain model. They are just numeric/string/UUID values that have no relationship with the actual data. They only serve as a unique identifiers for the rows.
I would say, stick to synthetic keys and sleep well at night. You never know what the Marketing Department will come up with on Monday, and suddenly "the username is not unique anymore".
Yes having a dedicated int is a good thing for PK use.
you may have multiple alternate keys, that's ok too.
two great reasons for it:
it is performant
it protects against key mutation ( editing a name etc. )
A username or any such unique field that holds meaningful data is subject to changes. A name may have been misspelled or you might want to edit a name to choose a better one, etc. etc.
Primary keys are used to identify records and, in conjunction with foreign keys, to connect records in different tables. They should never change. Therefore, it is better to use a meaningless int field as primary key.
By meaningless I mean that apart from being the primary key it has no meaning to the users.
An int identity column has other advantages over a text field as primary key.
It is generated by the database engine and is guaranteed to be unique in multi-user scenarios.
it is faster than a text column.
Text can have leading spaces, hidden characters and other oddities.
There are multiple kinds of text data types, multiple character sets and culture dependent behaviors resulting in text comparisons not always working as expected.
int primary keys generated in ascending order have a superior performance in conjunction with clustered primary keys (which is a SQL-Server specialty).
Note that I am talking from a database point of view. In the user interface, users will prefer identifying entries by name or e-mail address, etc.
But commands like SELECT, INSERT, UPDATE or DELETE will always identify records by the primary key.
This subject - quite much like gulivar travels and wars being fought over which end of the egg you supposed to crack open to eat.
However, using the SAME "id" name for all tables, and autonumber? Yes, it is LONG establihsed choice.
There are of course MANY different views on this subject, and many advantages and disavantages.
Regardless of which choice one perfers (or even needs), this is a long established concept in our industry. In fact SharePoint tables use "ID" and autonumber by defualt. So does ms-access, and there probably more that do this.
The simple concpet?
You can build your tables with the PK and child tables with forighen keys.
At that point you setup your relationships between the tables.
Now, you might decide to add say some invoice number or whatever. Rules might mean that such invoice number is not duplicated.
But, WHY do we care of you have some "user" name, or some "invoice" number or whatever. Why should that fact effect your relational database model?
You mean I don't have a user name, or don't have a invoice number, and the whole database and relatonships don't work anymore? We don't care!!!!
The concept of data, even required fields, or even a column having to be unique ?
That has ZERO to do with a working relational data model.
And maybe you decide that invoice number is not generated until say sent to the customer. So, the fact of some user name, invoice number or whatever? Don't care - you can have all kinds of business rules for those numbers, but they have ZERO do to do with the fact that you designed a working relational data model based on so called "surrogate" or sometime called synthetic keys.
So, once you build that data model - even with JUST the PK "id" and FK (forighen keys), you are NOW free to start adding columns and define what type of data you going to put in each table. but, what you shove into each table has ZERO to do with that working related data model. They are to be thought as seperate concpets.
So, if you have a user name - add that column to the table. If you don't want users name, remove the column. As such data you store in the table has ZERO to do with the automatic PK ID you using - it not really any different then say what area of memory the computer going to allocate to load that data. Basic data operations of the system is has nothing to do with having build database with relationships that simple exist. And the data columns you add after having built those relationships is up to you - but will not, and should not effect the operation of the database and relationships you built and setup. Not only are these two concepts separate, but they free the developer from having to worry about the part that maintains the relationships as opposed to data column you add to such tables to store user data.
I mean, in json data, xml? We often have a master + child table relationship. We don't care how that relationship is maintained - but only that it exists.
Thus yes, all tables have that pk "ID". Even better? in code, you NEVER have to guess what the PK id is - it always the same!!!
So, data and columns you put and toss into a table? Those columns and data have zero to do with the PK id, and while it is the database generating that PK? It could be a web service call to some monkeys living in a far away jungle eating banana's and they give you a PK value based on how many bananas they eaten. We just really don't' care about that number - it is just internal house keeping numbers - one that we don't see or even care about in most code. And thus the number one rule to such auto matic PK values?
You NEVER give that auto PK number any meaning from a user and applcation point of view.
In summary:
Yes, using a PK called "id" for all tables? Common, and in fact in SharePoint and many systems, it not only the default, but is in fact required for such systems to operate.
Its better to use userid. User table is referenced by many other tables.
The referenced table would contain the primary key of the user table as foreign key.
Its better to use userid since its integer value,
it takes less space than string values of username and
the searches by the database engine would be faster
user(userid, username, name)
comments(commentid, comment, userid) would be better than
comments(commentid, comment, username)

REST APIs: Hiding surrogate keys and only exposing natural keys to the user

Suppose we're working with a database that has the following relation:
CREATE TABLE Author (
first_name VARCHAR NOT NULL,
last_name VARCHAR NOT NULL,
birth_date DATE NOT NULL,
death_date DATE,
biography TEXT,
UNIQUE(first_name, last_name, birth_date)
);
In the real world, it's highly, highly improbable that two authors with the same first and last name will have been born on the exact same day. So we consider the combination of an author's first name, last name, and birth date to be a natural key.
However, for the purposes of joining tables and creating foreign keys, this is a bit painful because it means we need to store these three pieces of information over and over and over in our tables. Maybe we have a BookAuthor relation associating Authors with Books.
So we create a serial ID (or a UUID if we wanted to) and treat it as the primary key:
CREATE TABLE Author (
id SERIAL PRIMARY KEY,
first_name VARCHAR NOT NULL,
last_name VARCHAR NOT NULL,
birth_date DATE NOT NULL,
death_date DATE,
biography TEXT,
UNIQUE(first_name, last_name, birth_date)
);
However, based on the reading I've done here on StackOverflow and other sites, it seems you never want to expose your primary key to the end user because doing so may allow them to ascertain information that they should not be able to. If this were a situation involving user accounts, for example, and we decided to use serial IDs (we shouldn't, but humor me), then someone could potentially figure out how many total users we have in our database, which isn't information that they should see. In this particular case, someone knowing how many total authors we have isn't really a big deal. But I want to follow best practices.
So, if I were to design a REST API for this database, would it be acceptable to use the natural key to uniquely identify resources (Authors) instead of using serial IDs? For example:
https://www.foo.bar/authors/?first=:Terry&last=Pratchett&dob=19480424
My reasoning here is that there's no way for an end user to know what serial ID corresponds to this author in order to query for them.
As long as you guarantee this combination of data elements is an unique alternate key (would be good having a database unique key to enforce that) and use it consistently across all the API methods it would be perfectly acceptable to use it to identify resources in a REST API. There is no conceptual flaw in that.
The only minor issue is if you change the natural key, updating name or date of birth, cached data would be invalidated. Not a big deal. GET methods would still be idempotent.
If you choose to use a natural key as identifier and the data elements of this key are editable, keep in mind that you may want to do a redirect to a new URL when you PUT updates and want to keep displaying the same resource.
To address the different concerns you are looking at, I would suggest the following:
Generate a sequential id (a numeric sequence), if possible, which will be your primary key. This key is for internal app use (or db internal) and never leaves the backend.
A unique (non-numeric) id (UUID?... or a hash from id+first_name+las_name+b_day) which will be used as the key for access your data through the API calls.
Optionally, you could still have the unique key: first_name+last_name+b_day.
This way you will have a simple sequential Id for your DB, but also an Id that does not expose relevant information about your system.
This may not be perfect, or complete, solution, but could be a good start.
So, if I were to design a REST API for this database, would it be acceptable to use the natural key to uniquely identify resources (Authors) instead of using serial IDs?
The short answer to your question is No.
The longer answer is that there are actually two separate concerns here; you mention:
There's no way for an end user to know what serial ID corresponds to this author in order to query for them.
What this implies is that you want the end user to be able to search for an author.
This is different from the standard use-case for a GET route which explicitly requires that the requester knows the unique identifier (read this W3 description of the basic REST methods for more information).
I would recommend that you have two separate APIs, one for retrieving all details of a resource given its UUID, and a second for searching by some fields.

Best Practices when Choosing SQL Keys Types

I am very new to the SQL database. I need to create a database for my internship as our DA resigned suddenly.
The data is available but it is not inputted into a database yet. I am trying to follow the tutorials online but got stuck on what to choose for the different key types.
I hope to get the feedback of more experienced folks to get your guidance.
Table columns:
entry id (unique)
entry timestamp
username (unique but can appear more than once if the same user input a new meal record)
email address
user first and last name
meals taken date
meal type
meal calories
meal duration
meal cost
meal location
user notes
For primary key = entry id
For candidate key, I will pick username & and entry ID. Are there other columns that I should select as candidate keys? Would email make more sense? But a username can be repeated if they input another meal record. Does that matter?
For compound Key =
email address + user first and last name?
record date + user name?
Are there other keys I need to classify?
Online tutorials these are the most basic keys I need to identify. But I am not sure if I am making the right choice. I appreciate any feedback.
Your data seems to contain multiple entities. Based on your simple description, I can identify:
users
meals
locations
Then there seems to be this thing called a entries which is a user, eating (buying?) a meal at a location. This is a 3-way junction table among the entities.
This is a guess on what you are trying to represent. But it sounds like multiple tables.
I don't know what database you're using, I'll use Postgres because it's free, follows the SQL standard, has good documentation, and is very powerful.
As Gordon said, you seem to have three things: users, meals, and locations. Three things means one table for each. This avoids storing redundant data. The whole topic is database normalization.
create table users (
id bigserial primary key,
username text not null unique,
email text not null unique,
first_name text not null,
last_name text not null
);
create table meals (
id bigserial primary key,
type text not null unique,
-- If calories vary by user, move this into user_meals.
calories integer not null
);
create table locations (
id bigserial primary key,
-- The specific information you have about a location will vary,
-- but this is a good start. I've allowed nulls because often
-- people don't have full information about their location.
name text,
address text,
city text,
province text,
country text,
postal_code text
);
You asked about compound keys. Don't bother. There's too many potential problems. Use a simple, unique, auto-incrementing big integer on every table.
Primary keys must be unique and unchanging. Usernames, names, email addresses... these can all change. Even if you think they won't, why bake the risk into your schema?
Foreign keys will be repeated all over the database and indexes many times. You want them to be small and simple and fixed size to not use up any more storage than necessary. Integers are small and simple and fixed size. Text is not.
Compound primary keys potentially leak information. A primary keys refer to a row, and those often show up in URLs and the like. If we were to use the user's email address or name that risks leaking personally identifiable information.
I've chosen bigserial for the primary key. serial types are auto-incrementing so the database will take care of assigning each new row a primary key. bigserial uses a 64 bit integer. A regular integer can only hold about 2 to 4 billion entries. That sounds like a lot, but it isn't, and you do not want to run out of primary keys. bigserial can handle 9 quintillion. An extra 32 bits per row is worth it.
Some more notes.
Everything is not null unless we have a good reason otherwise. This will catch data entry mistakes. It makes the database easier to work with because you know the data will be there.
Similarly, anything which is supposed to be unique is declared unique so it is guaranteed.
There are no arbitrary limits on column size. You might see other examples like email varchar(64) or the like. This doesn't actually save you any space, it just puts limits on what is allowed in the database. There's no fundamental limit on how long a username or email address can be, that's a business rule. Business rules change all the time; they should not be hard coded into the schema. Enforcing business rules is the job of the thing inputting the data, or possibly triggers.
Now that we have these three tables, we can use them to record people's meals. To do that we need a fourth table to record a user having a meal at a location: a join table. A join table is what allows us to keep information about users, meals, and locations each in one canonical place. The users, locations, and meals are referred to by their IDs known as a foreign key.
create table user_meals (
id bigserial primary key,
user_id bigint not null references users(id),
meal_id bigint not null references meals(id),
location_id bigint not null references locations(id),
-- Using a time range is more flexible than start + duration.
taken_at tstzrange not null,
-- This can store up to 99999.99 which seems reasonable.
cost numeric(7, 2) not null,
notes text,
-- When was this entry created?
created_at timestamp not null default current_timestamp,
-- When was it last updated?
updated_at timestamp not null default current_timestamp
);
Because we're using bigserial for everything's primary key, referring to other tables is simple: they're all bigint.
As before, everything is not null unless we have a good reason otherwise. Not every meal will have notes, for example.
Fairly standard created_at and updated_at fields are used to record when the entry was created or updated.
Imprecise numeric types such as float or double are avoided, especially for money. Instead the arbitrary precision type numeric is used to store money precisely.
numeric requires we give it a precision and scale; we must choose a limit, so I picked something unreasonably high in my currency. Your currency may vary.
Rather than storing a start time and meal duration, we take advantage of Postgres's range types to store a time range from when the meal started to when it ended. This allows us to use range functions and operators to make queries about the meal time much simpler. For example, where taken_at #> '2020-02-15 20:00' finds all meals which were being taken at 8pm on Feb 15th.
There's more to do, such as adding indexes for performance, hopefully this will get you started. If there's one take away it's this: don't try to cram everything into one table.
Try it out.

SQL Database Design Best Practice (Addresses)

Of course I realize that there's no one "right way" to design a SQL database, but I wanted to get some opinions on what is better or worse in my particular scenario.
Currently, I'm designing an order entry module (Windows .NET 4.0 application with SQL Server 2008) and I'm torn between two design decisions when it comes to data that can be applied in more than one spot. In this question I'll refer specifically to Addresses.
Addresses can be used by a variety of objects (orders, customers, employees, shipments, etc..) and they almost always contain the same data (Address1/2/3, City, State, Postal Code, Country, etc). I was originally going to include each of these fields as a column in each of the related tables (e.g. Orders will contain Address1/2/3, City, State, etc.. and Customers will also contain this same column layout). But a part of me wants to apply DRY/Normalization principles to this scenario, i.e. have a table called "Addresses" which is referenced via Foreign Key in the appropriate table.
CREATE TABLE DB.dbo.Addresses
(
Id INT
NOT NULL
IDENTITY(1, 1)
PRIMARY KEY
CHECK (Id > 0),
Address1 VARCHAR(120)
NOT NULL,
Address2 VARCHAR(120),
Address3 VARCHAR(120),
City VARCHAR(100)
NOT NULL,
State CHAR(2)
NOT NULL,
Country CHAR(2)
NOT NULL,
PostalCode VARCHAR(16)
NOT NULL
)
CREATE TABLE DB.dbo.Orders
(
Id INT
NOT NULL
IDENTITY(1000, 1)
PRIMARY KEY
CHECK (Id > 1000),
Address INT
CONSTRAINT fk_Orders_Address
FOREIGN KEY REFERENCES Addresses(Id)
CHECK (Address > 0)
NOT NULL,
-- other columns....
)
CREATE TABLE DB.dbo.Customers
(
Id INT
NOT NULL
IDENTITY(1000, 1)
PRIMARY KEY
CHECK (Id > 1000),
Address INT
CONSTRAINT fk_Customers_Address
FOREIGN KEY REFERENCES Addresses(Id)
CHECK (Address > 0)
NOT NULL,
-- other columns....
)
From a design standpoint I like this approach because it creates a standard address format that is easily changeable, i.e. if I ever needed to add Address4 I would just add it in one place rather than to every table. However, I can see the number of JOINs required to build queries might get a little insane.
I guess I'm just wondering if any enterprise-level SQL architects out there have ever used this approach successfully, or if the number of JOINs that this creates would create a performance issue?
You're on the right track by breaking address out into its own table. I'd add a couple of additional suggestions.
Consider taking the Address FK columns out of the Customers/Orders tables and creating junction tables instead. In other words, treat Customers/Addresses and Orders/Addresses as many-to-many relationships in your design now so you can easily support multiple addresses in the future. Yes, this means introducing more tables and joins, but the flexibility you gain is well worth the effort.
Consider creating lookup tables for city, state and country entities. The city/state/country columns of the address table then consist of FKs pointing to these lookup tables. This allows you to guarantee consistent spellings across all addresses and gives you a place to store additional metadata (e.g., city population) if needed in the future.
I just have some cautions. For each of these, there's more than one way to fix the problem.
First, normalization doesn't mean "replace text with an id number".
Second, you don't have a key. I know, you have a column declared "PRIMARY KEY", but that's not enough.
insert into Addresses
(Address1, Address2, Address3, City, State, Country, PostalCode)
values
('President Obama', '1600 Pennsylvania Avenue NW', NULL, 'Washington', 'DC', 'US', '20500'),
('President Obama', '1600 Pennsylvania Avenue NW', NULL, 'Washington', 'DC', 'US', '20500'),
('President Obama', '1600 Pennsylvania Avenue NW', NULL, 'Washington', 'DC', 'US', '20500'),
('President Obama', '1600 Pennsylvania Avenue NW', NULL, 'Washington', 'DC', 'US', '20500');
select * from Addresses;
1;President Obama;1600 Pennsylvania Avenue NW;;Washington;DC;US;20500
2;President Obama;1600 Pennsylvania Avenue NW;;Washington;DC;US;20500
3;President Obama;1600 Pennsylvania Avenue NW;;Washington;DC;US;20500
4;President Obama;1600 Pennsylvania Avenue NW;;Washington;DC;US;20500
In the absence of any other constraints, your "primary key" identifies a row; it doesn't identify an address. Identifying a row is usually not good enough.
Third, "Address1", "Address2", and "Address3" aren't attributes of addresses. They're attributes of mailing labels. (Lines on a mailing label.) That distinction might not be important to you. It's really important to me.
Fourth, addresses have a lifetime. Between birth and death, they sometimes change. They change when streets get re-routed, buildings get divided, buildings get undivided, and sometimes (I'm pretty sure) when a city employee has a pint too many. Natural disasters can eliminate whole communities. Sometimes buildings get renumbered. In our database, which is tiny compared to most, about 1% per year change like that.
When an address dies, you have to do two things.
Make sure nobody uses that address to mail, ship, or whatever.
Make sure its death doesn't affect historical data.
When an address itself changes, you have to do two things.
Some data must reflect that change. Make sure it does.
Some data must not reflect that change. Make sure it doesn't.
Fifth, DRY doesn't apply to foreign keys. Their whole purpose is to be repeated. The only question is how wide a key? An id number is narrow, but requires a join. (10 id numbers might require 10 joins.) An address is wide, but requires no joins. (I'm talking here about a proper address, not a mailing label.)
That's all I can think of off the top of my head.
I think there is a problem you are not aware of and that is that some of this data is time sensitive. You do not want your records to show you shipped an order to 35 State St, Chicago Il, when you actually sent it to 10 King Street, Martinsburg WV but the customer moved two years after the order was shipped. So yes, build an address table to get the address at that moment in time as long as any change to the address for someone like a customer results in a new addressid not in changing the current address which would break the history on an order.
You would want the addresses to be in a separate table only if they were entities in their own right. Entities have identity (meaning it matters if two objects pointed to the same address or to different ones), and they have their own lifecycle apart from other entities. If this was the case with your domain, I think it would be totally apparent and you wouldn't have a need to ask this question.
Cade's answer explains the mutability of addresses, something like a shipping address is part of an order and shouldn't be able to change out from under the order it belongs to. This shows that the shipping address doesn't have its own lifecycle. Handling it as if it was a separate entity can only lead to more opportunities for error.
"Normalization" specifically refers to removing redundancies from data so you don't have the same item represented in different places. Here the only redundancy is in the DDL, it's not in the data, so "normalization" is not relevant here. (JPA has the concept of embedded classes that can address the redundancy).
TLDR: Use a separate table if the address is truly an Entity, with its own distinct identity and its own lifecycle. Otherwise don't.
What you have to answer for yourself is the question whether the same address in everyday language is actually the same address in your database. If somebody "changes his address" (colloquially), he really links himself to another address. The address per se only changes when a street is renamed, a zip-code reform takes place or a nuke hits. And those are rare events (hopefully for the most part). There goes your main profit: change in one place for multiple rows (of multiple tables).
If you should actually change an address for that in your model - in the sense of an UPDATE on table address - that may or may not work for other rows that link to it. Also, in my experience, even the exact same address has to look different for different purposes. Understand the semantic differences and you will arrive at the right model that represents your real world best.
I have a number of databases where I use a common table of streets (which uses a table of cities (which uses a table of countries, ...)). In combination with a street number think of it as geocodes (lat/lon), not "street names". Addresses are not shared among different tables (or rows). Changes to street names and zip codes cascade, other changes don't.
You would normally normalise the data as far as possible, so use the table 'Addresses'.
You can use views to de-normalise the data afterwards which use indexes and should give a method to access data with easy references, whilst leaving the underlying structure normalised fully.
The number of joins shouldn't be a major issue, index based joins aren't too much of an overhead.
It's fine to have a split out addresses table.
However, you have to avoid the temptation of allowing multiple rows to refer to the same address without an appropriate system for managing options for the user to decide whether and how changing an address splits out a row for the new address change, i.e. You have the same address for billing and ship-to. Then a user says their address is changing. To start with, old orders might (should?) need their ship-to addresses retained, so you can't change it in-place. But the user might also need to say this address I'm changing is only going to change the ship-to.
One should maintain some master tables for City, State and Country. This way one can avoid the different spellings for these entities which might end up with mapping same city with some different state/country.
One can simply map the CityId in the address table as foreign key as shown below, instead of having all the three fields separately (City, State and Country) as plain text in address table itself.
Address: {
CityId
// With other fields
}
City: {
CityId
StateId
// Other fields
}
State: {
StateId
CountryId
// Other fields
}
Country: {
CountryId
// Other fields
}
If one maintains all the three ids (CityId, StateId and CountryId) in address table, at the end you have to make joins against those tables. Hence my suggestion would be to have only CityId and then retrieve rest of the required information though joins with above table structure.
I prefer to use an XREF table that contains a FK reference to the person/business table, a FK reference to the address table and, generally, a FK reference to a role table (HOME, OFFICE, etc) to delineate the actual type of address. I also include an ACTIVE flag to allow me to choose to ignore old address while preserving the ability to maintain an address history.
This approach allows me to maintain multiple addresses of varying types for each primary entity

Use email address as primary key?

Is email address a bad candidate for primary when compared to auto incrementing numbers?
Our web application needs the email address to be unique in the system. So, I thought of using email address as primary key. However my colleague suggests that string comparison will be slower than integer comparison.
Is it a valid reason to not use email as primary key?
We are using PostgreSQL.
String comparison is slower than int comparison. However, this does not matter if you simply retrieve a user from the database using the e-mail address. It does matter if you have complex queries with multiple joins.
If you store information about users in multiple tables, the foreign keys to the users table will be the e-mail address. That means that you store the e-mail address multiple times.
I will also point out that email is a bad choice to make a unique field, there are people and even small businesses that share an email address. And like phone numbers, emails can get re-used. Jsmith#somecompany.com can easily belong to John Smith one year and Julia Smith two years later.
Another problem with emails is that they change frequently. If you are joining to other tables with that as the key, then you will have to update the other tables as well which can be quite a performance hit when an entire client company changes their emails (which I have seen happen.)
the primary key should be unique and constant
email addresses change like the seasons. Useful as a secondary key for lookup, but a poor choice for the primary key.
Disadvantages of using an email address as a primary key:
Slower when doing joins.
Any other record with a posted foreign key now has a larger value, taking up more disk space. (Given the cost of disk space today, this is probably a trivial issue, except to the extent that the record now takes longer to read. See #1.)
An email address could change, which forces all records using this as a foreign key to be updated. As email address don't change all that often, the performance problem is probably minor. The bigger problem is that you have to make sure to provide for it. If you have to write the code, this is more work and introduces the possibility of bugs. If your database engine supports "on update cascade", it's a minor issue.
Advantages of using email address as a primary key:
You may be able to completely eliminate some joins. If all you need from the "master record" is the email address, then with an abstract integer key you would have to do a join to retrieve it. If the key is the email address, then you already have it and the join is unnecessary. Whether this helps you any depends on how often this situation comes up.
When you are doing ad hoc queries, it's easy for a human being to see what master record is being referenced. This can be a big help when trying to track down data problems.
You almost certainly will need an index on the email address anyway, so making it the primary key eliminates one index, thus improving the performance of inserts as they now have only one index to update instead of two.
In my humble opinion, it's not a slam-dunk either way. I tend to prefer to use natural keys when a practical one is available because they're just easier to work with, and the disadvantages tend to not really matter much in most cases.
No one seems to have mentioned a possible problem that email addresses could be considered private. If the email address is the primary key, a profile page URL most likely will look something like ..../Users/my#email.com. What if you don't want to expose the user's email address? You'd have to find some other way of identifying the user, possibly by a unique integer value to make URLs like ..../Users/1. Then you'd end up with a unique integer value after all.
It is pretty bad. Assume some e-mail provider goes out of business. Users will then want to change their e-mail. If you have used e-mail as primary key, all foreign keys for users will duplicate that e-mail, making it pretty damn hard to change ...
... and I haven't even started talking about performance considerations.
I don't know if that might be an issue in your setup, but depending on your RDBMS the values of a columns might be case sensitive. PostgreSQL docs say: „If you declare a column as UNIQUE or PRIMARY KEY, the implicitly generated index is case-sensitive“. In other words, if you accept user input for a search in a table with email as primary key, and the user provides "John#Doe.com", you won't find “john#doe.com".
At the logical level, the email is the natural key.
At the physical level, given you are using a relational database, the natural key doesn't fit well as the primary key. The reason is mainly the performance issues mentioned by others.
For that reason, the design can be adapted. The natural key becomes the alternate key (UNIQUE, NOT NULL), and you use a surrogate/artificial/technical key as the primary key, which can be an auto-increment in your case.
systempuntoout asked,
What if someone wants to change his email address? Are you going to change all the foreign keys too?
That's what cascading is for.
Another reason to use a numeric surrogate key as the primary key is related to how the indexing works in your platform. In MySQL's InnoDB, for example, all indexes in a table have the primary key pre-pended to them, so you want the PK to be as small as possible (for speed's and size's sakes). Also related to this, InnoDB is faster when the primary key is stored in sequence, and a string would not help there.
Another thing to take into consideration when using a string as an alternate key, is that using a hash of the actual string that you want might be faster, skipping things like upper and lower cases of some letters. (I actually landed here while looking for a reference to confirm what I just said; still looking...)
yes, it is better if you use an integer instead. you can also set your email column as unique constraint.
like this:
CREATE TABLE myTable(
id integer primary key,
email text UNIQUE
);
Yes, it is a bad primary key because your users will want to update their email addresses.
Another reason why integer primary key is better is when you refer to email address in different table. If address itself is a primary key then in another table you have to use it as a key. So you store email addresses multiple time.
I am not too familiar with postgres. Primary Keys is a big topic. I've seen some excellent questions and answers on this site (stackoverflow.com).
I think you may have better performance by having a numeric primary key and use a UNIQUE INDEX on the email column. Emails tend to vary in length and may not be proper for primary key index.
some reading here and here.
Personally, I do not use any information for primary key when designing database, because it is very likely that I might need to alter any information later. The sole reason that I provide primary key is, it is convenience to do most SQL operation from client-side, and my choice for that has been always auto-increment integer type.
I know this is a bit of a late entry but i would like to add that people abandon email accounts and service providers recover the address allowing another person to use it.
As #HLGEM pointed out "Jsmith#somecompany.com can easily belong to John Smith one year and Julia Smith two years later." in this case should John Smith want your service you either have to refuse to use his email address or delete all your records pertaining to Julia Smith.
If you have to delete records and they relate to the financial history of the business depending on local law you could find yourself in hot water.
So i would never use data like email addresses, number plates, etc. as a primary keys because no matter how unique they seem they are out of your control and can provide some interesting challenges that you may not have time to deal with.
You may need to consider any applicable data regulation legislation. Email is personal information, and if your users are EU citizens for instance then under GDPR they can instruct you to delete their information from your records (remember this applies regardless of which country you are based).
If you need to keep the record itself in the database for referential integrity or historical reasons such as audit, using a surrogate key would allow you to just NULL all the personal data field. This obviously isn't as easy if their personal data is the primary key
Your colleague is right: Use an autoincrementing integer for your primary key.
You can implement the email-uniqueness either at the application level, or you coudl mark your email address column as unique, and add an index on that column.
Adding the field as unique will cost you string comparision only when inserting into that table, and not when performing joins and foreign key constraint checks.
Of course, you must note that adding any constraints to your application at the database level can cause your app to become inflexible. Always give due consideration before you make any field "unique" or "not null" just because your application needs it to be unique or non-empty.
Use a GUID as a primary key... that way you can generate it from your program when you do an INSERT and you don't need to get a response from the server to find out what the primary key is. It will also be unique accross tables and databases and you don't have to worry about what happens if you truncate the table some day and the auto-increment gets reset to 1.
you can boost the performance by using integer primary key.
you should use an integer primary key. if you need the email-column to be unique, why don't you simply set an unique-index on that column?
If you have a non int value as primary key then insertions and retrievals will be very slow on large data.
primary key should be chosen a static attribute. Since email addresses are not static and can be shared by multiple candidates so it is not a good idea to use them as primary key. Moreover email addresses are strings usually of a certain length which may be greater than unique id we would like to use[len(email_address)>len(unique_id)] so it would require more space and even worst they are stored multiple times as foreign key. And consequently it will lead to degrade the performance.
It depends on the table. If the rows in your table represent email addresses, then email is the best ID. If not, then email is not a good ID.
If it's simply a matter of requiring the email to be unique then you can just create a unique index with that column.
Email is a good unique index candidate, but not for primary key, if it is a primary key, you will be no able to change the contact's emails address for example.
I think your join querys will be slower too.
don not use email address as primary key , keep email as unique but don not use it as primary key, use user id or username as primary key