Normalization in plain English - sql
I understand the concept of database normalization, but always have a hard time explaining it in plain English - especially for a job interview. I have read the wikipedia post, but still find it hard to explain the concept to non-developers. "Design a database in a way not to get duplicated data" is the first thing that comes to mind.
Does anyone has a nice way to explain the concept of database normalization in plain English? And what are some nice examples to show the differences between first, second and third normal forms?
Say you go to a job interview and the person asks: Explain the concept of normalization and how would go about designing a normalized database.
What key points are the interviewers looking for?
Well, if I had to explain it to my wife it would have been something like that:
The main idea is to avoid duplication of large data.
Let's take a look at a list of people and the country they came from. Instead of holding the name of the country which can be as long as "Bosnia & Herzegovina" for every person, we simply hold a number that references a table of countries. So instead of holding 100 "Bosnia & Herzegovina"s, we hold 100 #45. Now in the future, as often happens with Balkan countries, they split to two countries: Bosnia and Herzegovina, I will have to change it only in one place. well, sort of.
Now, to explain 2NF, I would have changed the example, and let's assume that we hold the list of countries every person visited.
Instead of holding a table like:
Person CountryVisited AnotherInformation D.O.B.
Faruz USA Blah Blah 1/1/2000
Faruz Canada Blah Blah 1/1/2000
I would have created three tables, one table with the list of countries, one table with the list of persons and another table to connect them both. That gives me the most freedom I can get changing person's information or country information. This enables me to "remove duplicate rows" as normalization expects.
One-to-many relationships should be represented as two separate tables connected by a foreign key. If you try to shove a logical one-to-many relationship into a single table, then you are violating normalization which leads to dangerous problems.
Say you have a database of your friends and their cats. Since a person may have more than one cat, we have a one-to-many relationship between persons and cats. This calls for two tables:
Friends
Id | Name | Address
-------------------------
1 | John | The Road 1
2 | Bob | The Belltower
Cats
Id | Name | OwnerId
---------------------
1 | Kitty | 1
2 | Edgar | 2
3 | Howard | 2
(Cats.OwnerId is a foreign key to Friends.Id)
The above design is fully normalized and conforms to all known normalization levels.
But say I had tried to represent the above information in a single table like this:
Friends and cats
Id | Name | Address | CatName
-----------------------------------
1 | John | The Road 1 | Kitty
2 | Bob | The Belltower | Edgar
3 | Bob | The Belltower | Howard
(This is the kind of design I might have made if I was used to Excel-sheets but not relational databases.)
A single-table approach forces me to repeat some information if I want the data to be consistent. The problem with this design is that some facts, like the information that Bob's address is "The belltower" is repeated twice, which is redundant, and makes it difficult to query and change data and (the worst) possible to introduce logical inconsistencies.
Eg. if Bob moves I have to make sure I change the address in both rows. If Bob gets another cat, I have to be sure to repeat the name and address exactly as typed in the other two rows. E.g. if I make a typo in Bob's address in one of the rows, suddenly the database has inconsistent information about where Bob lives. The un-normalized database cannot prevent the introduction of inconsistent and self-contradictory data, and hence the database is not reliable. This is clearly not acceptable.
Normalization cannot prevent you from entering wrong data. What normalization prevents is the possibility of inconsistent data.
It is important to note that normalization depends on business decisions. If you have a customer database, and you decide to only record a single address per customer, then the table design (#CustomerID, CustomerName, CustomerAddress) is fine. If however you decide that you allow each customer to register more than one address, then the same table design is not normalized, because you now have a one-to-many relationship between customer and address. Therefore you cannot just look at a database to determine if it is normalized, you have to understand the business model behind the database.
This is what I ask interviewees:
Why don't we use a single table for an application instead of using multiple tables ?
The answer is ofcourse normalization. As already said, its to avoid redundancy and there by update anomalies.
This is not a thorough explanation, but one goal of normalization is to allow for growth without awkwardness.
For example, if you've got a user table, and every user is going to have one and only one phone number, it's fine to have a phonenumber column in that table.
However, if each user is going to have a variable number of phone numbers, it would be awkward to have columns like phonenumber1, phonenumber2, etc. This is for two reasons:
If your columns go up to phonenumber3 and someone needs to add a fourth number, you have to add a column to the table.
For all the users with fewer than 3 phone numbers, there are empty columns on their rows.
Instead, you'd want to have a phonenumber table, where each row contains a phone number and a foreign key reference to which row in the user table it belongs to. No blank columns are needed, and each user can have as few or many phone numbers as necessary.
One side point to note about normalization: A fully normalized database is space efficient, but is not necessarily the most time efficient arrangement of data depending on use patterns.
Skipping around to multiple tables to look up all the pieces of info from their denormalized locations takes time. In high load situations (millions of rows per second flying around, thousands of concurrent clients, like say credit card transaction processing) where time is more valuable than storage space, appropriately denormalized tables can give better response times than fully normalized tables.
For more info on this, look for SQL books written by Ken Henderson.
I would say that normalization is like keeping notes to do things efficiently, so to speak:
If you had a note that said you had to
go shopping for ice cream without
normalization, you would then have
another note, saying you have to go
shopping for ice cream, just one in
each pocket.
Now, In real life, you would never do
this, so why do it in a database?
For the designing and implementing part, thats when you can move back to "the lingo" and keep it away from layman terms, but I suppose you could simplify. You would say what you needed to at first, and then when normalization comes into it, you say you'll make sure of the following:
There must be no repeating groups of information within a table
No table should contain data that is not functionally dependent on that tables primary key
For 3NF I like Bill Kent's take on it: Every non-key attribute must provide a fact about the key, the whole key, and nothing but the key.
I think it may be more impressive if you speak of denormalization as well, and the fact that you cannot always have the best structure AND be in normal forms.
Normalization is a set of rules that used to design tables that connected through relationships.
It helps in avoiding repetitive entries, reducing required storage space, preventing the need to restructure existing tables to accommodate new data, increasing speed of queries.
First Normal Form: Data should be broken up in the smallest units. Tables should not contain repetitive groups of columns. Each row is identified with one or more primary key.
For example, There is a column named 'Name' in a 'Custom' table, it should be broken to 'First Name' and 'Last Name'. Also, 'Custom' should have a column named 'CustiomID' to identify a particular custom.
Second Normal Form: Each non-key column should be directly related to the entire primary key.
For example, if a 'Custom' table has a column named 'City', the city should has a separate table with primary key and city name defined, in the 'Custom' table, replace the 'City' column with 'CityID' and make 'CityID' the foreign key in the tale.
Third normal form: Each non-key column should not depend on other non-key columns.
For example, In an order table, the column 'Total' is dependent on 'Unit price' and 'quantity', so the 'Total' column should be removed.
I teach normalization in my Access courses and break it down a few ways.
After discussing the precursors to storyboarding or planning out the database, I then delve into normalization. I explain the rules like this:
Each field should contain the smallest meaningful value:
I write a name field on the board and then place a first name and last name in it like Bill Lumbergh. We then query the students and ask them what we will have problems with, when the first name and last name are all in one field. I use my name as an example, which is Jim Richards. If the students do not lead me down the road, then I yank their hand and take them with me. :) I tell them that my name is a tough name for some, because I have what some people would consider 2 first names and some people call me Richard. If you were trying to search for my last name then it is going to be harder for a normal person (without wildcards), because my last name is buried at the end of the field. I also tell them that they will have problems with easily sorting the field by last name, because again my last name is buried at the end.
I then let them know that meaningful is based upon the audience who is going to be using the database as well. We, at our job will not need a separate field for apartment or suite number if we are storing people's addresses, but shipping companies like UPS or FEDEX might need it separated out to easily pull up the apartment or suite of where they need to go when they are on the road and running from delivery to delivery. So it is not meaningful to us, but it is definitely meaningful to them.
Avoiding Blanks:
I use an analogy to explain to them why they should avoid blanks. I tell them that Access and most databases do not store blanks like Excel does. Excel does not care if you have nothing typed out in the cell and will not increase the file size, but Access will reserve that space until that point in time that you will actually use the field. So even if it is blank, then it will still be using up space and explain to them that it also slows their searches down as well.
The analogy I use is empty shoe boxes in the closet. If you have shoe boxes in the closet and you are looking for a pair of shoes, you will need to open up and look in each of the boxes for a pair of shoes. If there are empty shoe boxes, then you are just wasting space in the closet and also wasting time when you need to look through them for that certain pair of shoes.
Avoiding redundancy in data:
I show them a table that has lots of repeated values for customer information and then tell them that we want to avoid duplicates, because I have sausage fingers and will mistype in values if I have to type in the same thing over and over again. This “fat-fingering” of data will lead to my queries not finding the correct data. We instead, will break the data out into a separate table and create a relationship using a primary and foreign key field. This way we are saving space because we are not typing the customer's name, address, etc multiple times and instead are just using the customer's ID number in a field for the customer. We then will discuss drop-down lists/combo boxes/lookup lists or whatever else Microsoft wants to name them later on. :) You as a user will not want to look up and type out the customer's number each time in that customer field, so we will setup a drop-down list that will give you a list of customer, where you can select their name and it will fill in the customer’s ID for you. This will be a 1-to-many relationship, whereas 1 customer will have many different orders.
Avoiding repeated groups of fields:
I demonstrate this when talking about many-to-many relationships. First, I draw 2 tables, 1 that will hold employee information and 1 that will hold project information. The tables are laid similar to this.
(Table1)
tblEmployees
* EmployeeID
First
Last
(Other Fields)….
Project1
Project2
Project3
Etc.
**********************************
(Table2)
tblProjects
* ProjectNum
ProjectName
StartDate
EndDate
…..
I explain to them that this would not be a good way of establishing a relationship between an employee and all of the projects that they work on. First, if we have a new employee, then they will not have any projects, so we will be wasting all of those fields, second if an employee has been here a long time then they might have worked on 300 projects, so we would have to include 300 project fields. Those people that are new and only have 1 project will have 299 wasted project fields. This design is also flawed because I will have to search in each of the project fields to find all of the people that have worked on a certain project, because that project number could be in any of the project fields.
I covered a fair amount of the basic concepts. Let me know if you have other questions or need help with clarfication/ breaking it down in plain English. The wiki page did not read as plain English and might be daunting for some.
I've read the wiki links on normalization many times but I have found a better overview of normalization from this article. It is a simple easy to understand explanation of normalization up to fourth normal form. Give it a read!
Preview:
What is Normalization?
Normalization is the process of
efficiently organizing data in a
database. There are two goals of the
normalization process: eliminating
redundant data (for example, storing
the same data in more than one table)
and ensuring data dependencies make
sense (only storing related data in a
table). Both of these are worthy goals
as they reduce the amount of space a
database consumes and ensure that data
is logically stored.
http://databases.about.com/od/specificproducts/a/normalization.htm
Database normalization is a formal process of designing your database to eliminate redundant data. The design consists of:
planning what information the database will store
outlining what information users will request from it
documenting the assumptions for review
Use a data-dictionary or some other metadata representation to verify the design.
The biggest problem with normalization is that you end up with multiple tables representing what is conceptually a single item, such as a user profile. Don't worry about normalizing data in table that will have records inserted but not updated, such as history logs or financial transactions.
References
When not to Normalize your SQL Database
Database Design Basics
+1 for the analogy of talking to your wife. I find talking to anyone without a tech mind needs some ease into this type of conversation.
but...
To add to this conversation, there is the other side of the coin (which can be important when in an interview).
When normalizing, you have to watch how the databases are indexed and how the queries are written.
When in a truly normalized database, I have found that in situations it's been easier to write queries that are slow because of bad join operations, bad indexing on the tables, and plain bad design on the tables themselves.
Bluntly, it's easier to write bad queries in high level normalized tables.
I think for every application there is a middle ground. At some point you want the ease of getting everything out a few tables, without having to join to a ton of tables to get one data set.
Related
How to structure SQL tables with one (non-composite) candidate key and all non-primary attributes?
I'm not very familiar with relational databases but here is my question. I have some raw data that's collected as a result of a customer survey. For each customer who participated, there is only one record and that's uniquely identifiable by the CustomerId attribute. All other attributes I believe fall under the non-prime key description as no other attribute depends on another, apart from the non-composite candidate key. Also, all columns are atomic, as in, none can be split into multiple columns. For example, the columns are like CustomerId(non-sequential), Race, Weight, Height, Salary, EducationLevel, JobFunction, NumberOfCars, NumberOfChildren, MaritalStatus, GeneralHealth, MentalHealth and I have 100+ columns like this in total. So, as far as I understand we can't talk about any form of normalization for this kind of dataset, am I correct? However, given the excessive number of columns, if I wanted to split this monolithic table into tables with fewer columns, ie based on some categorisation of columns like demographics, health, employment etc, is there a specific name for such a structure/approach in the literature? All the tables are still going to be using the CustomerId as their primary key. Yes, this is part of an assignment and as part of a task, it's required to fit this dataset into a relational DB, not a document DB which I don't think would gain anything in this case anyway. So, there is no direct question as such as I worded above but creating a table with 100+ columns doesn't feel right to me. Therefore, what I am trying to understand is how the theory approaches such blobs. Some concept names or potential ideas for further investigation would be appreciated as I even don't know how to look this up.
In relational databases using all information in a table is not a good usage. As you mentioned groping some columns in other tables and join all tables with master table is well. In this usage you can also manage one to many, many to one and many to many relationships. Such as customers could have more than one address or phone numbers. An other usage is making a table like customer_properities and use columns like property_type and property_value and store data by rows. But the first usage is more effective and most common usage customer_id property_type properity_value 1 num_of_child 3 1 age 22 1 marial_status Single . . .
How to identify duplicate records using client name and address in SQL while both of them is in free text
I have a database with millions of client contacts. However, a lot of them are duplicated and may I ask some hero from here to advise how to identify those duplicates using Oracle SQL, PL/SQL or Excel. Following is the data structure: Client_Header id integer (Primary Key) Client_First_Name (varchar2) Client_Last_Name (varchar2) Client_Date_Of_Birth (timestamp) Client_Address Client_Id (Foreign Key ref Client_header) Address_Line1 (varchar2) Address_Line2 (varhchar2) Adderss_Line3 (varchar2) Suburb (Varchar2) State (varchar2) Country (varchar2) My challenge is other than Client_Date_Of_Birth and those key fields, all fields are free text only. For example, we have a client like following Surname : Jones First name : David Client_Date_Of_Birth: 10/05/1975 Address: Unit 10 Floor 1, 20 Railway Parade, St Peter, NSW 2044 However, as those fields are free text, I have a lot of data issues and following link (jpeg file only) illustrated some of those issues Sample of data issues Note: Other than those issues, sometime we may miss the first name or last name of the client (but not both) too Sometimes multiple problems can be find within the same record. Also sometime, the address may simply be the name of a school, shopping center etc. The system does not store any other id that can uniquely identify the client. I understand it is close to impossible to gather all duplicate records where the client address is a school or shopping center. However, for other cases, is there anyway to identify most of the duplication. Thank you for your help!
Not a pretty sight, and I'm afraid I don't have good news for you. This is a common problem in databases, especially if the data entry personnel are insufficiently trained. One of the main objectives in data entry training is to make the problem well understood and show ways to avoid it. Something to keep in mind in the future. Unfortunately, there isn't any "magic wand" that will clean your data for you. I'm sorry, but you have before you one of the most tedious tasks in database maintenance. You're going to have to basically remove the duplicates by hand, and the job requires more of an editor than a database administrator. If you have millions of records, of which perhaps a million are actually duplicates, I would estimate that it will take an expert working full time for at least two years -- and probably longer -- to clean up your problem: to do it in two years would require fixing 2000 records a day, with time off on weekends and two weeks of vacation. In the end, the only sure way to remove all the duplicates is to compare all of them and remove them one at a time. But there are plenty of tricks you can use to get rid of blocks of them at once. Here are a few that I can think of with your data sample: Change "Dave" to "David" in both first and last name fields. (Make sure that nobody actually has the last name "Dave.") Change all instances of "Jones David" to "David Jones." (Make sure that there are no people named "Jones David".) Change "1/F" to "Floor 1." The idea is to focus on some of the fields, and in those fields get all of the duplicates to be exact duplicates. Once you have that done, you delete all the records with the target values in the fields, except the one with the primary key of the record that you want to keep (if your table isn't keyed, you'll have to find another way to do it, such as selecting the top record into a new table). This technique speeds things up for records with a large number of duplicates. Where you have only a few duplicates, it's quicker to just identify them one by one. One way to do this quickly is to go into edit mode on a table, work with a particular field (for example, the postal code field in this case), and put a unique value in that field when you want to mark it for deletion (in this case, perhaps a single zero). Then you can periodically delete all the records with that value in the field. You'll also need to sort the data in multiple ways to find the duplicates, which it appears you already know. As for your notes, don't try to identify all the ways that the data is messed up. Once you identify one record as a duplicate of another, you don't care what's wrong with it, you just have to get rid of it. If you have two records and each contains data that you want to keep that the other one is missing, then you'll have to consolidate them and delete one of them. And then go on to the next, and the next, and the next...
Some years ago I had a similar task and I tooks about one years to clean the data. What I did in short: send the address to api.addressdoctor.com for validation and split into single fields (with maps.googleapis.com it is also possible) use a first name and last name match list to check the names (we used namepedia.org). A lot depends on the quality of this list. This list should base on country of birth or of the first address. From the results we made a propability what kind of name it is (first/last/company). with this improved date you should create some normalized and fuzzy attributes. Normalized fields from names and address...like upper and just with alpha-numeric List item at the end I would change the data model a little bit to improve the data quality by design. I recommend you adding pre-title, post-title, middle-name and post-name fields. You should also add the splitted address fields like street, streetno, zip, location, longitude, latitude, etc... I would also change the relation between Client_Header and Client_Address with an extra address_Id as primary key...but this depends on the requirements. And at the end I would add some constraints to prevent duplicated entries. after all that is the deduplication not hard. Group just all normalized or fuzzy data together and greate a dense_rank. (I group by person, household, ...) Make a ranking over the attributes (I used data quality, data fillrate and transaction history for a score value) Finally it is your choice if you just want to delete the duplicates and copy the corresponding data to the living client or virtually connect the data via Client_Id in an extra Field. for insert and update processes you should create PL/SQL functions that check if fuzzy last-name (eg. first-name) + fuzzy address exist. Split the names and address fileds and check them with the address API's and match them with the names reference. If it is a single tuple data entry, show the best results to the user and let him decide.
Best way to add content (large list) to relational database
I apologize if this may seem like somewhat of a novice question (which it probably is), but I'm just introducing myself to the idea of relational databases and I'm struggling with this concept. I have a database with roughly 75 fields which represent different characteristics of a 'user'. One of those fields represents a the locations that user has been and I'm wondering what the best way is to store the data so that it is easily retrievable and can be used later on (i.e. tracking a route on Google Maps, identifying if two users shared the same location etc.) The problem is that some users may have 5 locations in total while others may be well over 100. Is it best to store these locations in a text file named using the unique id of each user(one location on each line, or in a csv)? Or to create a separate table for each individual user connected to their unique id (that seems like overkill to me)? Or, is there a way to store all of the locations directly in the single field in the original table? I'm hoping that I'm missing a concept, or there is a link to a tutorial that will help my understanding. If it helps, you can assume that the locations will be stored in order and will not be changed once stored. Also, these locations are static (I don't need to add any more locations once as they can't be updated). Thank you for time in helping me. I appreciate it!
Store the location data for the user in a separate table. The location table would link back to the user table by a common user_id. Keeping multiple locations for a particular user in a single table is not a good idea - you'll end up with denormalized data. You may want to read up on: Referential Integrity Relational denormalization
The most common way would be to have a separate table, something like USER_LOCATION +------------+------------------+ | USER_ID | LOCATION_ID | +------------+------------------+ | | | If user 3 has 5 locations, there will be five rows containing user_id 3. However, if you say the order of locations matter then an additional field specifying the ordinal position of the location within a user can be used. The separate table approach is what we call normalized. If you store a location list as a comma-separated string of location ids, for example, it is trival to maintain the order, but you lose the ability for the database to quickly answer the question "which users have been at location x?". Your data would be what we call denormalized. You do have options, of course, but relational databases are pretty good with joining tables, and they are not overkill. They do look a little funny when you have ordering requirements, like the one you mention. But people use them all the time.
In a relational database you would use a mapping table. So you would have user, location and userlocation tables (user is a reserved word so you may wish to use a different name). This allows you to have a many-to-many relationship, i.e. many users can visit many locations. If you want to model a route as an ordered collection of locations then you will need to do more work. This site gives an example
One mysql table with many fields or many (hundreds of) tables with fewer fields?
I am designing a system for a client, where he is able to create data forms for various products he sales him self. The number of fields he will be using will not be more than 600-700 (worst case scenario). As it looks like he will probably be in the range of 400 - 500 (max). I had 2 methods in mind for creating the database (using meta data): a) Create a table for each product, which will hold only fields necessary for this product, which will result to hundreds of tables but with only the neccessary fields for each product or b) use one single table with all availabe form fields (any range from current 300 to max 700), resulting in one table that will have MANY fields, of which only about 10% will be used for each product entry (a product should usualy not use more than 50-80 fields) Which solution is best? keeping in mind that table maintenance (creation, updates and changes) to the table(s) will be done using meta data, so I will not need to do changes to the table(s) manually. Thank you! /**** UPDATE *****/ Just an update, even after this long time (and allot of additional experience gathered) I needed to mention that not normalizing your database is a terrible idea. What is more, a not normalized database almost always (just always from my experience) indicates a flawed application design as well.
i would have 3 tables: product id name whatever else you need field id field name anything else you might need product_field id product_id field_id field value
Your key deciding factor is whether normalization is required. Even though you are only adding data using an application, you'll still need to cater for anomalies, e.g. what happens if someone's phone number changes, and they insert multiple rows over the lifetime of the application? Which row contains the correct phone number? As an example, you may find that you'll have repeating groups in your data, like one person with several phone numbers; rather than have three columns called "Phone1", "Phone2", "Phone3", you'd break that data into its own table. There are other issues in normalisation, such as transitive or non-key dependencies. These concepts will hopefully lead you to a database table design without modification anomalies, as you should hope for!
Pulegiums solution is a good way to go. You do not want to go with the one-table-for-each-product solution, because the structure of your database should not have to change when you insert or delete a product. Only the rows of one or many tables should be inserted or deleted, not the tables themselves.
While it's possible that it may be necessary, having that many fields for something as simple as a product list sounds to me like you probably have a flawed design. You need to analyze your potential table structures to ensure that each field contains no more than one piece of information (e.g., "2 hammers, 500 nails" in a single field is bad) and that each piece of information has no more than one field where it belongs (e.g., having phone1, phone2, phone3 fields is bad). Either of these situations indicates that you should move that information out into a separate, related table with a foreign key connecting it back to the original table. As pulegium has demonstrated, this technique can quickly break things down to three tables with only about a dozen fields total.
Is there ever a time where using a database 1:1 relationship makes sense?
I was thinking the other day on normalization, and it occurred to me, I cannot think of a time where there should be a 1:1 relationship in a database. Name:SSN? I'd have them in the same table. PersonID:AddressID? Again, same table. I can come up with a zillion examples of 1:many or many:many (with appropriate intermediate tables), but never a 1:1. Am I missing something obvious?
A 1:1 relationship typically indicates that you have partitioned a larger entity for some reason. Often it is because of performance reasons in the physical schema, but it can happen in the logic side as well if a large chunk of the data is expected to be "unknown" at the same time (in which case you have a 1:0 or 1:1, but no more). As an example of a logical partition: you have data about an employee, but there is a larger set of data that needs to be collected, if and only if they select to have health coverage. I would keep the demographic data regarding health coverage in a different table to both give easier security partitioning and to avoid hauling that data around in queries unrelated to insurance. An example of a physical partition would be the same data being hosted on multiple servers. I may keep the health coverage demographic data in another state (where the HR office is, for example) and the primary database may only link to it via a linked server... avoiding replicating sensitive data to other locations, yet making it available for (assuming here rare) queries that need it. Physical partitioning can be useful whenever you have queries that need consistent subsets of a larger entity.
One reason is database efficiency. Having a 1:1 relationship allows you to split up the fields which will be affected during a row/table lock. If table A has a ton of updates and table b has a ton of reads (or has a ton of updates from another application), then table A's locking won't affect what's going on in table B. Others bring up a good point. Security can also be a good reason depending on how applications etc. are hitting the system. I would tend to take a different approach, but it can be an easy way of restricting access to certain data. It's really easy to just deny access to a certain table in a pinch. My blog entry about it.
Sparseness. The data relationship may be technically 1:1, but corresponding rows don't have to exist for every row. So if you have twenty million rows and there's some set of values that only exists for 0.5% of them, the space savings are vast if you push those columns out into a table that can be sparsely populated.
Most of the highly-ranked answers give very useful database tuning and optimization reasons for 1:1 relationships, but I want to focus on nothing but "in the wild" examples where 1:1 relationships naturally occur. Please note one important characteristic of the database implementation of most of these examples: no historical information is retained about the 1:1 relationship. That is, these relationships are 1:1 at any given point in time. If the database designer wants to record changes in the relationship participants over time, then the relationships become 1:M or M:M; they lose their 1:1 nature. With that understood, here goes: "Is-A" or supertype/subtype or inheritance/classification relationships: This category is when one entity is a specific type of another entity. For example, there could be an Employee entity with attributes that apply to all employees, and then different entities to indicate specific types of employee with attributes unique to that employee type, e.g. Doctor, Accountant, Pilot, etc. This design avoids multiple nulls since many employees would not have the specialized attributes of a specific subtype. Other examples in this category could be Product as supertype, and ManufacturingProduct and MaintenanceSupply as subtypes; Animal as supertype and Dog and Cat as subtypes; etc. Note that whenever you try to map an object-oriented inheritance hierarchy into a relational database (such as in an object-relational model), this is the kind of relationship that represents such scenarios. "Boss" relationships, such as manager, chairperson, president, etc., where an organizational unit can have only one boss, and one person can be boss of only one organizational unit. If those rules apply, then you have a 1:1 relationship, such as one manager of a department, one CEO of a company, etc. "Boss" relationships don't only apply to people. The same kind of relationship occurs if there is only one store as the headquarters of a company, or if only one city is the capital of a country, for example. Some kinds of scarce resource allocation, e.g. one employee can be assigned only one company car at a time (e.g. one truck per trucker, one taxi per cab driver, etc.). A colleague gave me this example recently. Marriage (at least in legal jurisdictions where polygamy is illegal): one person can be married to only one other person at a time. I got this example from a textbook that used this as an example of a 1:1 unary relationship when a company records marriages between its employees. Matching reservations: when a unique reservation is made and then fulfilled as two separate entities. For example, a car rental system might record a reservation in one entity, and then an actual rental in a separate entity. Although such a situation could alternatively be designed as one entity, it might make sense to separate the entities since not all reservations are fulfilled, and not all rentals require reservations, and both situations are very common. I repeat the caveat I made earlier that most of these are 1:1 relationships only if no historical information is recorded. So, if an employee changes their role in an organization, or a manager takes responsibility of a different department, or an employee is reassigned a vehicle, or someone is widowed and remarries, then the relationship participants can change. If the database does not store any previous history about these 1:1 relationships, then they remain legitimate 1:1 relationships. But if the database records historical information (such as adding start and end dates for each relationship), then they pretty much all turn into M:M relationships. There are two notable exceptions to the historical note: First, some relationships change so rarely that historical information would normally not be stored. For example, most IS-A relationships (e.g. product type) are immutable; that is, they can never change. Thus, the historical record point is moot; these would always be implemented as natural 1:1 relationships. Second, the reservation-rental relationship store dates separately, since the reservation and the rental are independent events, each with their own dates. Since the entities have their own dates, rather than the 1:1 relationship itself having a start date, these would remain as 1:1 relationships even though historical information is stored.
Your question can be interpreted in several ways, because of the way you worded it. The responses show this. There can definitely be 1:1 relationships between data items in the real world. No question about it. The "is a" relationship is generally one to one. A car is a vehicle. One car is one vehicle. One vehicle might be one car. Some vehicles are trucks, in which case one vehicle is not a car. Several answers address this interpretation. But I think what you really are asking is... when 1:1 relationships exist, should tables ever be split? In other words, should you ever have two tables that contain exactly the same keys? In practice, most of us analyze only primary keys, and not other candidate keys, but that question is slightly diferent. Normalization rules for 1NF, 2NF, and 3NF never require decomposing (splitting) a table into two tables with the same primary key. I haven't worked out whether putting a schema in BCNF, 4NF, or 5NF can ever result in two tables with the same keys. Off the top of my head, I'm going to guess that the answer is no. There is a level of normalization called 6NF. The normalization rule for 6NF can definitely result in two tables with the same primary key. 6NF has the advantage over 5NF that NULLS can be completely avoided. This is important to some, but not all, database designers. I've never bothered to put a schema into 6NF. In 6NF missing data can be represent by an omitted row, instead of a row with a NULL in some column. There are reasons other than normalization for splitting tables. Sometimes split tables result in better performance. With some database engines, you can get the same performance benefits by partitioning the table instead of actually splitting it. This can have the advantage of keeping the logical design easy to understand, while giving the database engine the tools needed to speed things up.
I use them primarily for a few reasons. One is significant difference in rate of data change. Some of my tables may have audit trails where I track previous versions of records, if I only care to track previous versions of 5 out of 10 columns splitting those 5 columns onto a separate table with an audit trail mechanism on it is more efficient. Also, I may have records (say for an accounting app) that are write only. You can not change the dollar amounts, or the account they were for, if you made a mistake then you need to make a corresponding record to write adjust off the incorrect record, then create a correction entry. I have constraints on the table enforcing the fact that they cannot be updated or deleted, but I may have a couple of attributes for that object that are malleable, those are kept in a separate table without the restriction on modification. Another time I do this is in medical record applications. There is data related to a visit that cannot be changed once it is signed off on, and other data related to a visit that can be changed after signoff. In that case I will split the data and put a trigger on the locked table rejecting updates to the locked table when signed off, but allowing updates to the data the doctor is not signing off on. Another poster commented on 1:1 not being normalized, I would disagree with that in some situations, especially subtyping. Say I have an employee table and the primary key is their SSN (it's an example, let's save the debate on whether this is a good key or not for another thread). The employees can be of different types, say temporary or permanent and if they are permanent they have more fields to be filled out, like office phone number, which should only be not null if the type = 'Permanent'. In a 3rd normal form database the column should depend only on the key, meaning the employee, but it actually depends on employee and type, so a 1:1 relationship is perfectly normal, and desirable in this case. It also prevents overly sparse tables, if I have 10 columns that are normally filled, but 20 additional columns only for certain types.
The most common scenario I can think of is when you have BLOB's. Let's say you want to store large images in a database (typically, not the best way to store them, but sometimes the constraints make it more convenient). You would typically want the blob to be in a separate table to improve lookups of the non-blob data.
In terms of pure science, yes, they are useless. In real databases it's sometimes useful to keep a rarely used field in a separate table: to speed up queries using this and only this field; to avoid locks, etc.
Rather than using views to restrict access to fields, it sometimes makes sense to keep restricted fields in a separate table to which only certain users have access.
I can also think of situations where you have an OO model in which you use inheritance, and the inheritance tree has to be persisted to the DB. For instance, you have a class Bird and Fish which both inherit from Animal. In your DB you could have an 'Animal' table, which contains the common fields of the Animal class, and the Animal table has a one-to-one relationship with the Bird table, and a one-to-one relationship with the Fish table. In this case, you don't have to have one Animal table which contains a lot of nullable columns to hold the Bird and Fish-properties, where all columns that contain Fish-data are set to NULL when the record represents a bird. Instead, you have a record in the Birds-table that has a one-to-one relationship with the record in the Animal table.
1-1 relationships are also necessary if you have too much information. There is a record size limitation on each record in the table. Sometimes tables are split in two (with the most commonly queried information in the main table) just so that the record size will not be too large. Databases are also more efficient in querying if the tables are narrow.
In SQL it is impossible to enforce a 1:1 relationship between two tables that is mandatory on both sides (unless the tables are read-only). For most practical purposes a "1:1" relationship in SQL really means 1:0|1. The inability to support mandatory cardinality in referential constraints is one of SQL's serious limitations. "Deferrable" constraints don't really count because they are just a way of saying the constraint is not enforced some of the time.
It's also a way to extend a table which is already in production with less (perceived) risk than a "real" database change. Seeing a 1:1 relationship in a legacy system is often a good indicator that fields were added after the initial design.
Most of the time, designs are thought to be 1:1 until someone asks "well, why can't it be 1:many"? Divorcing the concepts from one another prematurely is done in anticipation of this common scenario. Person and Address don't bind so tightly. A lot of people have multiple addresses. And so on... Usually two separate object spaces imply that one or both can be multiplied (x:many). If two objects were truly, truly 1:1, even philosophically, then it's more of an is-relationship. These two "objects" are actually parts of one whole object.
If you're using the data with one of the popular ORMs, you might want to break up a table into multiple tables to match your Object Hierarchy.
I have found that when I do a 1:1 relationship its totally for a systemic reason, not a relational reason. For instance, I've found that putting the reserved aspects of a user in 1 table and putting the user editable fields of the user in a different table allows logically writing those rules about permissions on those fields much much easier. But you are correct, in theory, 1:1 relationships are completely contrived, and are almost a phenomenon. However logically it allows the programs and optimizations abstracting the database easier.
extended information that is only needed in certain scenarios. in legacy applications and programming languages (such as RPG) where the programs are compiled over the tables (so if the table changes you have to recompile the program(s)). Tag along files can also be useful in cases where you have to worry about table size.
Most frequently it is more of a physical than logical construction. It is commonly used to vertically partition a table to take advantage of splitting I/O across physical devices or other query optimizations associated with segregating less frequently accessed data or data that needs to be kept more secure than the rest of the attributes on the same object (SSN, Salary, etc). The only logical consideration that prescribes a 1-1 relationship is when certain attributes only apply to some of the entities. However, in most cases there is a better/more normalized way to model the data through entity extraction.
The best reason I can see for a 1:1 relationship is a SuperType SubType of database design. I created a Real Estate MLS data structure based on this model. There were five different data feeds; Residential, Commercial, MultiFamily, Hotels & Land. I created a SuperType called property that contained data that was common to each of the five separate data feeds. This allowed for very fast "simple" searches across all datatypes. I create five separate SubTypes that stored the unique data elements for each of the five data feeds. Each SuperType record had a 1:1 relationship to the appropriate SubType record. If a customer wanted a detailed search they had to select a Super-Sub type for example PropertyResidential.
In my opinion a 1:1 relationship maps a class Inheritance on a RDBMS. There is a table A that contains the common attributes, i.e. the partent class status Each inherited class status is mapped on the RDBMS with a table B with a 1:1 relationship to A table, containing the specialized attributes. The table namend A contain also a "type" field that represents the "casting" functionality Bye Mario
You can create a one to one relationship table if there is any significant performance benefit. You can put the rarely used fields into separate table.
1:1 relationships don't really make sense if you're into normalization as anything that would be 1:1 would be kept in the same table. In the real world though, it's often different. You may want to break your data up to match your applications interface.
Possibly if you have some kind of typed objects in your database. Say in a table, T1, you have the columns C1, C2, C3… with a one to one relation. It's OK, it's in normalized form. Now say in a table T2, you have columns C1, C2, C3, … (the names may differ, but say the types and the role is the same) with a one to one relation too. It's OK for T2 for the same reasons as with T1. In this case however, I see a fit for a separate table T3, holding C1, C2, C3… and a one to one relation from T1 to T3 and from T2 to T3. I even more see a fit if there exist another table, with which there already exist a one to multiple C1, C2, C3… say from table A to multiple rows in table B. Then, instead of T3, you use B, and have a one to one relation from T1 to B, the same for from T2 to B, and still the same one to multiple relation from A to B. I believe normalization do not agree with this, and that may be an idea outside of it: identifying object types and move objects of a same type to their own storage pool, using a one to one relation from some tables, and a one to multiple relation from some other tables.
It is unnecessary great for security purposes but there better ways to perform security checks. Imagine, you create a key that can only open one door. If the key can open any other door, you should ring the alarm. In essence, you can have "CitizenTable" and "VotingTable". Citizen One vote for Candidate One which is stored in the Voting Table. If citizen one appear in the voting table again, then their should be an alarm. Be advice, this is a one to one relationship because we not refering to the candidate field, we are refering to the voting table and the citizen table. Example: Citizen Table id = 1, citizen_name = "EvryBod" id = 2, citizen_name = "Lesly" id = 3, citizen_name = "Wasserman" Candidate Table id = 1, citizen_id = 1, candidate_name = "Bern Nie" id = 2, citizen_id = 2, candidate_name = "Bern Nie" id = 3, citizen_id = 3, candidate_name = "Hill Arry" Then, if we see the voting table as so: Voting Table id = 1, citizen_id = 1, candidate_name = "Bern Nie" id = 2, citizen_id = 2, candidate_name = "Bern Nie" id = 3, citizen_id = 3, candidate_name = "Hill Arry" id = 4, citizen_id = 3, candidate_name = "Hill Arry" id = 5, citizen_id = 3, candidate_name = "Hill Arry" We could say that citizen number 3 is a liar pants on fire who cheated Bern Nie. Just an example.
When you are dealing with a database from a third party product, then you probably don't want to alter their database as to prevent tight coupling. but you may have data that corresponds 1:1 with their data
Anywhere were two entirely independent entities share a one-to-one relationship. There must be lots of examples: person <-> dentist (its 1:N, so its wrong!) person <-> doctor (its 1:N, so it's also wrong!) person <-> spouse (its 1:0|1, so its mostly wrong!) EDIT: Yes, those were pretty bad examples, particularly if I was always looking for a 1:1, not a 0 or 1 on either side. I guess my brain was mis-firing :-) So, I'll try again. It turns out, after a bit of thought, that the only way you can have two separate entities that must (as far as the software goes) be together all of the time is for them to exist together in higher categorization. Then, if and only if you fall into a lower decomposition, the things are and should be separate, but at the higher level they can't live without each other. Context, then is the key. For a medical database you may want to store different information about specific regions of the body, keeping them as a separate entity. In that case, a patient has just one head, and they need to have it, or they are not a patient. (They also have one heart, and a number of other necessary single organs). If you're interested in tracking surgeries for example, then each region should be a unique separate entity. In a production/inventory system, if you're tracking the assembly of vehicles, then you certainly want to watch the engine progress differently from the car body, yet there is a one to one relationship. A care must have an engine, and only one (or it wouldn't be a 'car' anymore). An engine belongs to only one car. In each case you could produce the separate entities as one big record, but given the level of decomposition, that would be wrong. They are, in these specific contexts, truly independent entities, although they might not appear so at a higher level. Paul.