Related
Lets say I have 2 tables:
Species
SpeciesId
SpeciesName
Animal
AnimalId
SpeciesId - foreign key
If you give the end-user ability to change SpeciesName, that means they can affect the species of all animals that reference the changed record (at least from user standpoint). This may be a bit of an extreme example, but how are situations like this usually handled? Put the responsibility on the end-user to know what they are doing? Disallow name change if it has been used before?
We are discussing this situation at work and I want to get input from some others. One of the solutions that was brought up was to remove the foreign key (e.g. put a text field for species in the Animal table). This doesn't seem right to me, because at what point do you draw the line of using foreign keys? To me it seems like more of a training issue to make sure admins understand the impact of the changes they make. I know it's an open-ended question and it may vary per scenario, but I'm just trying to get some general opinions.
This is a design decision that you have to make. You need to determine what is more important from a business perspective. Do you value historical accuracy or efficiently updating the information?
In your example, I would put less emphasis on history for the following reasons.
Only the most recent convention is significant. Assume an animal moves from one genus to another, it really doesn't provide any value to know what the old and now invalid genus was.
All animals of the same species should have the same species ID. You get this for free with Foreign Keys. Assume a tiger was added prior to a species name change. Then a different tiger was added after the species name change. Both tigers still belong to the same species.
Querying the database by ID will be easier and more reliable than using a string let alone delving into the dirty business of string parsing. You don't need to worry about character encoding, capitalization, white space, punctuation, etc. Assume that you would like to retrieve all animals of one or more species.
Put the responsibility on the end-user to know what they are doing?
You need to decide what the end user is able to update. If your end user is a biologist that is well informed about the scientific names of the species, he should be able to update this information. Otherwise, maybe it's a good idea to prevent the user from modifying this column at all, or only if this particular species has any animal associated with it.
One of the solutions that was brought up was to remove the foreign key
Don't do that. You will lose the ability to join the information from these tables. Imagine your table Species has a column "Continent", indicating if the species is found on America, Africa, Europe, Asia, etc. If you use the foreign key, you can ask questions like "What are all the animals that belong to an american species?" This will be impossible if you remove the foreign key.
So I have searched around but haven't found a satisfactory answer.
I have different types of locations, as stated in the title. Given a type of location (i.e. city), the less granular locations can be inferred. I.e. if you know you're in Oregon, it implies you're in the United States, which implies you're in North America.
We have Objects that reference locations, but the granularity is not all the same. Some items might point to neighborhoods, others are only known down to the city level, while some are only known to a region, etc.
There were two ways in which I thought of organizing the data, this is the way I am leaning towards:
Have a generic "Locations" table, with a location "type" and a "parent location" referencing itself. So there'd be an entry for United States of type country, and an entry for Oregon type state which references United States.
i.e.
You can then have the object reference the location off its primary key, and then other locations can be inferred. Does this make sense or is there a better way I could be organizing the data?
The other way I considered was with a different table for each location "type" but then the problem is having our objects referencing it, since the most granular type of location for an object isn't always the same.
If I were to slip other location types in later, for example counties in between Cities and Regions, might this present a problem? I'm thinking it would be no more a problem than with separate tables, but perhaps there's a better way I can keep track of things in a logical way.
This is a case of subclasses, often called subtypes. It's complicated by the fact that some subtypes are contained in other subtypes. The container issue is well handled by classical elementary relational database design.
The subclass issue requires a little explanation. What OOP calls "subclasses" goes by the name "ER Specialization" in ER modeling circles. This tells you how to diagram subclasses, but it doesn't tell you how to implement them.
It's worth mentioning two techniques for implementing subclasses in SQL tables. The first goes by the name "single Table Inheritance". The second goes by the name "Class Table Inheritance". In class table inheritance, you will have one generic table for "locations" with all the attributes that are common to all locations, regardless of type. In the "Cities" table you will have attributes that pertain to cities, but not to countries, etc. You will have other subclass tables for the other types of locations.
If you go this route, you should look up another technique, called "Shared Priomary Key". In this technique, the id field of the subclass tables all contain copies of the id field from the superclass table. This requires a little effort, but it's well worth it.
Shared primary key offers several advantages. It enforces the one-to-one nature of a subclass relationship. It makes joining specialized data with generalized data simple, easy, and fast. It keeps track of which items belong in which subclass, without an extra field.
In your case, there is yet another advantage. Other tables that reference a location by using a foreign key don't have to decide whether to reference the superclass table or the subclass table. A single foreign key that references the superclass table will also implicitly reference one of the subclass tables, although it isn't obvious which one.
This isn't perfect, but it's very, very good. Been there, done that.
For more information, you can google the techniques, or find relevant tags here in SO.
What about:
Countries:
Id,
Name.
Regions:
Id,
CityId,
Name.
Cities:
Id,
RegionId,
Name.
Neighborhoods:
Id,
CityId,
Name.
This for location types. But the main problem in your case is
but the granularity is not all the same.
For this:
Object:
Id,
Name,
LocationId,
Type.
Good question.
You should definitely go with your first option. If you look at any data modeling patterns book, they all choose that way.
Is this North America only, or global?
Issues:
Cities/Towns/Hamlets/Villages are children of Divisions (generic term for state/province), though not in, say, England, where they are children of Country (or is it County)
Postal Areas (postal codes, zip codes) are children of Divisions too, not county or city. Some cities reside entirely in zips, and some zips reside entirely in cities
Counties are children of Division too. Manhattan contains counties, whereas most counties contain cities.
I would read Hay's Enterprise Model Patterns if you are hoping for a global solution. It's on safari for cheap.
I understand the concept of database normalization, but always have a hard time explaining it in plain English - especially for a job interview. I have read the wikipedia post, but still find it hard to explain the concept to non-developers. "Design a database in a way not to get duplicated data" is the first thing that comes to mind.
Does anyone has a nice way to explain the concept of database normalization in plain English? And what are some nice examples to show the differences between first, second and third normal forms?
Say you go to a job interview and the person asks: Explain the concept of normalization and how would go about designing a normalized database.
What key points are the interviewers looking for?
Well, if I had to explain it to my wife it would have been something like that:
The main idea is to avoid duplication of large data.
Let's take a look at a list of people and the country they came from. Instead of holding the name of the country which can be as long as "Bosnia & Herzegovina" for every person, we simply hold a number that references a table of countries. So instead of holding 100 "Bosnia & Herzegovina"s, we hold 100 #45. Now in the future, as often happens with Balkan countries, they split to two countries: Bosnia and Herzegovina, I will have to change it only in one place. well, sort of.
Now, to explain 2NF, I would have changed the example, and let's assume that we hold the list of countries every person visited.
Instead of holding a table like:
Person CountryVisited AnotherInformation D.O.B.
Faruz USA Blah Blah 1/1/2000
Faruz Canada Blah Blah 1/1/2000
I would have created three tables, one table with the list of countries, one table with the list of persons and another table to connect them both. That gives me the most freedom I can get changing person's information or country information. This enables me to "remove duplicate rows" as normalization expects.
One-to-many relationships should be represented as two separate tables connected by a foreign key. If you try to shove a logical one-to-many relationship into a single table, then you are violating normalization which leads to dangerous problems.
Say you have a database of your friends and their cats. Since a person may have more than one cat, we have a one-to-many relationship between persons and cats. This calls for two tables:
Friends
Id | Name | Address
-------------------------
1 | John | The Road 1
2 | Bob | The Belltower
Cats
Id | Name | OwnerId
---------------------
1 | Kitty | 1
2 | Edgar | 2
3 | Howard | 2
(Cats.OwnerId is a foreign key to Friends.Id)
The above design is fully normalized and conforms to all known normalization levels.
But say I had tried to represent the above information in a single table like this:
Friends and cats
Id | Name | Address | CatName
-----------------------------------
1 | John | The Road 1 | Kitty
2 | Bob | The Belltower | Edgar
3 | Bob | The Belltower | Howard
(This is the kind of design I might have made if I was used to Excel-sheets but not relational databases.)
A single-table approach forces me to repeat some information if I want the data to be consistent. The problem with this design is that some facts, like the information that Bob's address is "The belltower" is repeated twice, which is redundant, and makes it difficult to query and change data and (the worst) possible to introduce logical inconsistencies.
Eg. if Bob moves I have to make sure I change the address in both rows. If Bob gets another cat, I have to be sure to repeat the name and address exactly as typed in the other two rows. E.g. if I make a typo in Bob's address in one of the rows, suddenly the database has inconsistent information about where Bob lives. The un-normalized database cannot prevent the introduction of inconsistent and self-contradictory data, and hence the database is not reliable. This is clearly not acceptable.
Normalization cannot prevent you from entering wrong data. What normalization prevents is the possibility of inconsistent data.
It is important to note that normalization depends on business decisions. If you have a customer database, and you decide to only record a single address per customer, then the table design (#CustomerID, CustomerName, CustomerAddress) is fine. If however you decide that you allow each customer to register more than one address, then the same table design is not normalized, because you now have a one-to-many relationship between customer and address. Therefore you cannot just look at a database to determine if it is normalized, you have to understand the business model behind the database.
This is what I ask interviewees:
Why don't we use a single table for an application instead of using multiple tables ?
The answer is ofcourse normalization. As already said, its to avoid redundancy and there by update anomalies.
This is not a thorough explanation, but one goal of normalization is to allow for growth without awkwardness.
For example, if you've got a user table, and every user is going to have one and only one phone number, it's fine to have a phonenumber column in that table.
However, if each user is going to have a variable number of phone numbers, it would be awkward to have columns like phonenumber1, phonenumber2, etc. This is for two reasons:
If your columns go up to phonenumber3 and someone needs to add a fourth number, you have to add a column to the table.
For all the users with fewer than 3 phone numbers, there are empty columns on their rows.
Instead, you'd want to have a phonenumber table, where each row contains a phone number and a foreign key reference to which row in the user table it belongs to. No blank columns are needed, and each user can have as few or many phone numbers as necessary.
One side point to note about normalization: A fully normalized database is space efficient, but is not necessarily the most time efficient arrangement of data depending on use patterns.
Skipping around to multiple tables to look up all the pieces of info from their denormalized locations takes time. In high load situations (millions of rows per second flying around, thousands of concurrent clients, like say credit card transaction processing) where time is more valuable than storage space, appropriately denormalized tables can give better response times than fully normalized tables.
For more info on this, look for SQL books written by Ken Henderson.
I would say that normalization is like keeping notes to do things efficiently, so to speak:
If you had a note that said you had to
go shopping for ice cream without
normalization, you would then have
another note, saying you have to go
shopping for ice cream, just one in
each pocket.
Now, In real life, you would never do
this, so why do it in a database?
For the designing and implementing part, thats when you can move back to "the lingo" and keep it away from layman terms, but I suppose you could simplify. You would say what you needed to at first, and then when normalization comes into it, you say you'll make sure of the following:
There must be no repeating groups of information within a table
No table should contain data that is not functionally dependent on that tables primary key
For 3NF I like Bill Kent's take on it: Every non-key attribute must provide a fact about the key, the whole key, and nothing but the key.
I think it may be more impressive if you speak of denormalization as well, and the fact that you cannot always have the best structure AND be in normal forms.
Normalization is a set of rules that used to design tables that connected through relationships.
It helps in avoiding repetitive entries, reducing required storage space, preventing the need to restructure existing tables to accommodate new data, increasing speed of queries.
First Normal Form: Data should be broken up in the smallest units. Tables should not contain repetitive groups of columns. Each row is identified with one or more primary key.
For example, There is a column named 'Name' in a 'Custom' table, it should be broken to 'First Name' and 'Last Name'. Also, 'Custom' should have a column named 'CustiomID' to identify a particular custom.
Second Normal Form: Each non-key column should be directly related to the entire primary key.
For example, if a 'Custom' table has a column named 'City', the city should has a separate table with primary key and city name defined, in the 'Custom' table, replace the 'City' column with 'CityID' and make 'CityID' the foreign key in the tale.
Third normal form: Each non-key column should not depend on other non-key columns.
For example, In an order table, the column 'Total' is dependent on 'Unit price' and 'quantity', so the 'Total' column should be removed.
I teach normalization in my Access courses and break it down a few ways.
After discussing the precursors to storyboarding or planning out the database, I then delve into normalization. I explain the rules like this:
Each field should contain the smallest meaningful value:
I write a name field on the board and then place a first name and last name in it like Bill Lumbergh. We then query the students and ask them what we will have problems with, when the first name and last name are all in one field. I use my name as an example, which is Jim Richards. If the students do not lead me down the road, then I yank their hand and take them with me. :) I tell them that my name is a tough name for some, because I have what some people would consider 2 first names and some people call me Richard. If you were trying to search for my last name then it is going to be harder for a normal person (without wildcards), because my last name is buried at the end of the field. I also tell them that they will have problems with easily sorting the field by last name, because again my last name is buried at the end.
I then let them know that meaningful is based upon the audience who is going to be using the database as well. We, at our job will not need a separate field for apartment or suite number if we are storing people's addresses, but shipping companies like UPS or FEDEX might need it separated out to easily pull up the apartment or suite of where they need to go when they are on the road and running from delivery to delivery. So it is not meaningful to us, but it is definitely meaningful to them.
Avoiding Blanks:
I use an analogy to explain to them why they should avoid blanks. I tell them that Access and most databases do not store blanks like Excel does. Excel does not care if you have nothing typed out in the cell and will not increase the file size, but Access will reserve that space until that point in time that you will actually use the field. So even if it is blank, then it will still be using up space and explain to them that it also slows their searches down as well.
The analogy I use is empty shoe boxes in the closet. If you have shoe boxes in the closet and you are looking for a pair of shoes, you will need to open up and look in each of the boxes for a pair of shoes. If there are empty shoe boxes, then you are just wasting space in the closet and also wasting time when you need to look through them for that certain pair of shoes.
Avoiding redundancy in data:
I show them a table that has lots of repeated values for customer information and then tell them that we want to avoid duplicates, because I have sausage fingers and will mistype in values if I have to type in the same thing over and over again. This “fat-fingering” of data will lead to my queries not finding the correct data. We instead, will break the data out into a separate table and create a relationship using a primary and foreign key field. This way we are saving space because we are not typing the customer's name, address, etc multiple times and instead are just using the customer's ID number in a field for the customer. We then will discuss drop-down lists/combo boxes/lookup lists or whatever else Microsoft wants to name them later on. :) You as a user will not want to look up and type out the customer's number each time in that customer field, so we will setup a drop-down list that will give you a list of customer, where you can select their name and it will fill in the customer’s ID for you. This will be a 1-to-many relationship, whereas 1 customer will have many different orders.
Avoiding repeated groups of fields:
I demonstrate this when talking about many-to-many relationships. First, I draw 2 tables, 1 that will hold employee information and 1 that will hold project information. The tables are laid similar to this.
(Table1)
tblEmployees
* EmployeeID
First
Last
(Other Fields)….
Project1
Project2
Project3
Etc.
**********************************
(Table2)
tblProjects
* ProjectNum
ProjectName
StartDate
EndDate
…..
I explain to them that this would not be a good way of establishing a relationship between an employee and all of the projects that they work on. First, if we have a new employee, then they will not have any projects, so we will be wasting all of those fields, second if an employee has been here a long time then they might have worked on 300 projects, so we would have to include 300 project fields. Those people that are new and only have 1 project will have 299 wasted project fields. This design is also flawed because I will have to search in each of the project fields to find all of the people that have worked on a certain project, because that project number could be in any of the project fields.
I covered a fair amount of the basic concepts. Let me know if you have other questions or need help with clarfication/ breaking it down in plain English. The wiki page did not read as plain English and might be daunting for some.
I've read the wiki links on normalization many times but I have found a better overview of normalization from this article. It is a simple easy to understand explanation of normalization up to fourth normal form. Give it a read!
Preview:
What is Normalization?
Normalization is the process of
efficiently organizing data in a
database. There are two goals of the
normalization process: eliminating
redundant data (for example, storing
the same data in more than one table)
and ensuring data dependencies make
sense (only storing related data in a
table). Both of these are worthy goals
as they reduce the amount of space a
database consumes and ensure that data
is logically stored.
http://databases.about.com/od/specificproducts/a/normalization.htm
Database normalization is a formal process of designing your database to eliminate redundant data. The design consists of:
planning what information the database will store
outlining what information users will request from it
documenting the assumptions for review
Use a data-dictionary or some other metadata representation to verify the design.
The biggest problem with normalization is that you end up with multiple tables representing what is conceptually a single item, such as a user profile. Don't worry about normalizing data in table that will have records inserted but not updated, such as history logs or financial transactions.
References
When not to Normalize your SQL Database
Database Design Basics
+1 for the analogy of talking to your wife. I find talking to anyone without a tech mind needs some ease into this type of conversation.
but...
To add to this conversation, there is the other side of the coin (which can be important when in an interview).
When normalizing, you have to watch how the databases are indexed and how the queries are written.
When in a truly normalized database, I have found that in situations it's been easier to write queries that are slow because of bad join operations, bad indexing on the tables, and plain bad design on the tables themselves.
Bluntly, it's easier to write bad queries in high level normalized tables.
I think for every application there is a middle ground. At some point you want the ease of getting everything out a few tables, without having to join to a ton of tables to get one data set.
I was thinking the other day on normalization, and it occurred to me, I cannot think of a time where there should be a 1:1 relationship in a database.
Name:SSN? I'd have them in the same table.
PersonID:AddressID? Again, same table.
I can come up with a zillion examples of 1:many or many:many (with appropriate intermediate tables), but never a 1:1.
Am I missing something obvious?
A 1:1 relationship typically indicates that you have partitioned a larger entity for some reason. Often it is because of performance reasons in the physical schema, but it can happen in the logic side as well if a large chunk of the data is expected to be "unknown" at the same time (in which case you have a 1:0 or 1:1, but no more).
As an example of a logical partition: you have data about an employee, but there is a larger set of data that needs to be collected, if and only if they select to have health coverage. I would keep the demographic data regarding health coverage in a different table to both give easier security partitioning and to avoid hauling that data around in queries unrelated to insurance.
An example of a physical partition would be the same data being hosted on multiple servers. I may keep the health coverage demographic data in another state (where the HR office is, for example) and the primary database may only link to it via a linked server... avoiding replicating sensitive data to other locations, yet making it available for (assuming here rare) queries that need it.
Physical partitioning can be useful whenever you have queries that need consistent subsets of a larger entity.
One reason is database efficiency. Having a 1:1 relationship allows you to split up the fields which will be affected during a row/table lock. If table A has a ton of updates and table b has a ton of reads (or has a ton of updates from another application), then table A's locking won't affect what's going on in table B.
Others bring up a good point. Security can also be a good reason depending on how applications etc. are hitting the system. I would tend to take a different approach, but it can be an easy way of restricting access to certain data. It's really easy to just deny access to a certain table in a pinch.
My blog entry about it.
Sparseness. The data relationship may be technically 1:1, but corresponding rows don't have to exist for every row. So if you have twenty million rows and there's some set of values that only exists for 0.5% of them, the space savings are vast if you push those columns out into a table that can be sparsely populated.
Most of the highly-ranked answers give very useful database tuning and optimization reasons for 1:1 relationships, but I want to focus on nothing but "in the wild" examples where 1:1 relationships naturally occur.
Please note one important characteristic of the database implementation of most of these examples: no historical information is retained about the 1:1 relationship. That is, these relationships are 1:1 at any given point in time. If the database designer wants to record changes in the relationship participants over time, then the relationships become 1:M or M:M; they lose their 1:1 nature. With that understood, here goes:
"Is-A" or supertype/subtype or inheritance/classification relationships: This category is when one entity is a specific type of another entity. For example, there could be an Employee entity with attributes that apply to all employees, and then different entities to indicate specific types of employee with attributes unique to that employee type, e.g. Doctor, Accountant, Pilot, etc. This design avoids multiple nulls since many employees would not have the specialized attributes of a specific subtype. Other examples in this category could be Product as supertype, and ManufacturingProduct and MaintenanceSupply as subtypes; Animal as supertype and Dog and Cat as subtypes; etc. Note that whenever you try to map an object-oriented inheritance hierarchy into a relational database (such as in an object-relational model), this is the kind of relationship that represents such scenarios.
"Boss" relationships, such as manager, chairperson, president, etc., where an organizational unit can have only one boss, and one person can be boss of only one organizational unit. If those rules apply, then you have a 1:1 relationship, such as one manager of a department, one CEO of a company, etc. "Boss" relationships don't only apply to people. The same kind of relationship occurs if there is only one store as the headquarters of a company, or if only one city is the capital of a country, for example.
Some kinds of scarce resource allocation, e.g. one employee can be assigned only one company car at a time (e.g. one truck per trucker, one taxi per cab driver, etc.). A colleague gave me this example recently.
Marriage (at least in legal jurisdictions where polygamy is illegal): one person can be married to only one other person at a time. I got this example from a textbook that used this as an example of a 1:1 unary relationship when a company records marriages between its employees.
Matching reservations: when a unique reservation is made and then fulfilled as two separate entities. For example, a car rental system might record a reservation in one entity, and then an actual rental in a separate entity. Although such a situation could alternatively be designed as one entity, it might make sense to separate the entities since not all reservations are fulfilled, and not all rentals require reservations, and both situations are very common.
I repeat the caveat I made earlier that most of these are 1:1 relationships only if no historical information is recorded. So, if an employee changes their role in an organization, or a manager takes responsibility of a different department, or an employee is reassigned a vehicle, or someone is widowed and remarries, then the relationship participants can change. If the database does not store any previous history about these 1:1 relationships, then they remain legitimate 1:1 relationships. But if the database records historical information (such as adding start and end dates for each relationship), then they pretty much all turn into M:M relationships.
There are two notable exceptions to the historical note: First, some relationships change so rarely that historical information would normally not be stored. For example, most IS-A relationships (e.g. product type) are immutable; that is, they can never change. Thus, the historical record point is moot; these would always be implemented as natural 1:1 relationships. Second, the reservation-rental relationship store dates separately, since the reservation and the rental are independent events, each with their own dates. Since the entities have their own dates, rather than the 1:1 relationship itself having a start date, these would remain as 1:1 relationships even though historical information is stored.
Your question can be interpreted in several ways, because of the way you worded it. The responses show this.
There can definitely be 1:1 relationships between data items in the real world. No question about it. The "is a" relationship is generally one to one. A car is a vehicle.
One car is one vehicle. One vehicle might be one car. Some vehicles are trucks, in which case one vehicle is not a car. Several answers address this interpretation.
But I think what you really are asking is... when 1:1 relationships exist, should tables ever be split? In other words, should you ever have two tables that contain exactly the same keys? In practice, most of us analyze only primary keys, and not other candidate keys, but that question is slightly diferent.
Normalization rules for 1NF, 2NF, and 3NF never require decomposing (splitting) a table into two tables with the same primary key. I haven't worked out whether putting a schema in BCNF, 4NF, or 5NF can ever result in two tables with the same keys. Off the top of my head, I'm going to guess that the answer is no.
There is a level of normalization called 6NF. The normalization rule for 6NF can definitely result in two tables with the same primary key. 6NF has the advantage over 5NF that NULLS can be completely avoided. This is important to some, but not all, database designers. I've never bothered to put a schema into 6NF.
In 6NF missing data can be represent by an omitted row, instead of a row with a NULL in some column.
There are reasons other than normalization for splitting tables. Sometimes split tables result in better performance. With some database engines, you can get the same performance benefits by partitioning the table instead of actually splitting it. This can have the advantage of keeping the logical design easy to understand, while giving the database engine the tools needed to speed things up.
I use them primarily for a few reasons. One is significant difference in rate of data change. Some of my tables may have audit trails where I track previous versions of records, if I only care to track previous versions of 5 out of 10 columns splitting those 5 columns onto a separate table with an audit trail mechanism on it is more efficient. Also, I may have records (say for an accounting app) that are write only. You can not change the dollar amounts, or the account they were for, if you made a mistake then you need to make a corresponding record to write adjust off the incorrect record, then create a correction entry. I have constraints on the table enforcing the fact that they cannot be updated or deleted, but I may have a couple of attributes for that object that are malleable, those are kept in a separate table without the restriction on modification. Another time I do this is in medical record applications. There is data related to a visit that cannot be changed once it is signed off on, and other data related to a visit that can be changed after signoff. In that case I will split the data and put a trigger on the locked table rejecting updates to the locked table when signed off, but allowing updates to the data the doctor is not signing off on.
Another poster commented on 1:1 not being normalized, I would disagree with that in some situations, especially subtyping. Say I have an employee table and the primary key is their SSN (it's an example, let's save the debate on whether this is a good key or not for another thread). The employees can be of different types, say temporary or permanent and if they are permanent they have more fields to be filled out, like office phone number, which should only be not null if the type = 'Permanent'. In a 3rd normal form database the column should depend only on the key, meaning the employee, but it actually depends on employee and type, so a 1:1 relationship is perfectly normal, and desirable in this case. It also prevents overly sparse tables, if I have 10 columns that are normally filled, but 20 additional columns only for certain types.
The most common scenario I can think of is when you have BLOB's. Let's say you want to store large images in a database (typically, not the best way to store them, but sometimes the constraints make it more convenient). You would typically want the blob to be in a separate table to improve lookups of the non-blob data.
In terms of pure science, yes, they are useless.
In real databases it's sometimes useful to keep a rarely used field in a separate table: to speed up queries using this and only this field; to avoid locks, etc.
Rather than using views to restrict access to fields, it sometimes makes sense to keep restricted fields in a separate table to which only certain users have access.
I can also think of situations where you have an OO model in which you use inheritance, and the inheritance tree has to be persisted to the DB.
For instance, you have a class Bird and Fish which both inherit from Animal.
In your DB you could have an 'Animal' table, which contains the common fields of the Animal class, and the Animal table has a one-to-one relationship with the Bird table, and a one-to-one relationship with the Fish table.
In this case, you don't have to have one Animal table which contains a lot of nullable columns to hold the Bird and Fish-properties, where all columns that contain Fish-data are set to NULL when the record represents a bird.
Instead, you have a record in the Birds-table that has a one-to-one relationship with the record in the Animal table.
1-1 relationships are also necessary if you have too much information. There is a record size limitation on each record in the table. Sometimes tables are split in two (with the most commonly queried information in the main table) just so that the record size will not be too large. Databases are also more efficient in querying if the tables are narrow.
In SQL it is impossible to enforce a 1:1 relationship between two tables that is mandatory on both sides (unless the tables are read-only). For most practical purposes a "1:1" relationship in SQL really means 1:0|1.
The inability to support mandatory cardinality in referential constraints is one of SQL's serious limitations. "Deferrable" constraints don't really count because they are just a way of saying the constraint is not enforced some of the time.
It's also a way to extend a table which is already in production with less (perceived) risk than a "real" database change. Seeing a 1:1 relationship in a legacy system is often a good indicator that fields were added after the initial design.
Most of the time, designs are thought to be 1:1 until someone asks "well, why can't it be 1:many"? Divorcing the concepts from one another prematurely is done in anticipation of this common scenario. Person and Address don't bind so tightly. A lot of people have multiple addresses. And so on...
Usually two separate object spaces imply that one or both can be multiplied (x:many). If two objects were truly, truly 1:1, even philosophically, then it's more of an is-relationship. These two "objects" are actually parts of one whole object.
If you're using the data with one of the popular ORMs, you might want to break up a table into multiple tables to match your Object Hierarchy.
I have found that when I do a 1:1 relationship its totally for a systemic reason, not a relational reason.
For instance, I've found that putting the reserved aspects of a user in 1 table and putting the user editable fields of the user in a different table allows logically writing those rules about permissions on those fields much much easier.
But you are correct, in theory, 1:1 relationships are completely contrived, and are almost a phenomenon. However logically it allows the programs and optimizations abstracting the database easier.
extended information that is only needed in certain scenarios. in legacy applications and programming languages (such as RPG) where the programs are compiled over the tables (so if the table changes you have to recompile the program(s)). Tag along files can also be useful in cases where you have to worry about table size.
Most frequently it is more of a physical than logical construction. It is commonly used to vertically partition a table to take advantage of splitting I/O across physical devices or other query optimizations associated with segregating less frequently accessed data or data that needs to be kept more secure than the rest of the attributes on the same object (SSN, Salary, etc).
The only logical consideration that prescribes a 1-1 relationship is when certain attributes only apply to some of the entities. However, in most cases there is a better/more normalized way to model the data through entity extraction.
The best reason I can see for a 1:1 relationship is a SuperType SubType of database design. I created a Real Estate MLS data structure based on this model. There were five different data feeds; Residential, Commercial, MultiFamily, Hotels & Land.
I created a SuperType called property that contained data that was common to each of the five separate data feeds. This allowed for very fast "simple" searches across all datatypes.
I create five separate SubTypes that stored the unique data elements for each of the five data feeds. Each SuperType record had a 1:1 relationship to the appropriate SubType record.
If a customer wanted a detailed search they had to select a Super-Sub type for example PropertyResidential.
In my opinion a 1:1 relationship maps a class Inheritance on a RDBMS.
There is a table A that contains the common attributes, i.e. the partent class status
Each inherited class status is mapped on the RDBMS with a table B with a 1:1 relationship
to A table, containing the specialized attributes.
The table namend A contain also a "type" field that represents the "casting" functionality
Bye
Mario
You can create a one to one relationship table if there is any significant performance benefit. You can put the rarely used fields into separate table.
1:1 relationships don't really make sense if you're into normalization as anything that would be 1:1 would be kept in the same table.
In the real world though, it's often different. You may want to break your data up to match your applications interface.
Possibly if you have some kind of typed objects in your database.
Say in a table, T1, you have the columns C1, C2, C3… with a one to one relation. It's OK, it's in normalized form. Now say in a table T2, you have columns C1, C2, C3, … (the names may differ, but say the types and the role is the same) with a one to one relation too. It's OK for T2 for the same reasons as with T1.
In this case however, I see a fit for a separate table T3, holding C1, C2, C3… and a one to one relation from T1 to T3 and from T2 to T3. I even more see a fit if there exist another table, with which there already exist a one to multiple C1, C2, C3… say from table A to multiple rows in table B. Then, instead of T3, you use B, and have a one to one relation from T1 to B, the same for from T2 to B, and still the same one to multiple relation from A to B.
I believe normalization do not agree with this, and that may be an idea outside of it: identifying object types and move objects of a same type to their own storage pool, using a one to one relation from some tables, and a one to multiple relation from some other tables.
It is unnecessary great for security purposes but there better ways to perform security checks. Imagine, you create a key that can only open one door. If the key can open any other door, you should ring the alarm. In essence, you can have "CitizenTable" and "VotingTable". Citizen One vote for Candidate One which is stored in the Voting Table. If citizen one appear in the voting table again, then their should be an alarm. Be advice, this is a one to one relationship because we not refering to the candidate field, we are refering to the voting table and the citizen table.
Example:
Citizen Table
id = 1, citizen_name = "EvryBod"
id = 2, citizen_name = "Lesly"
id = 3, citizen_name = "Wasserman"
Candidate Table
id = 1, citizen_id = 1, candidate_name = "Bern Nie"
id = 2, citizen_id = 2, candidate_name = "Bern Nie"
id = 3, citizen_id = 3, candidate_name = "Hill Arry"
Then, if we see the voting table as so:
Voting Table
id = 1, citizen_id = 1, candidate_name = "Bern Nie"
id = 2, citizen_id = 2, candidate_name = "Bern Nie"
id = 3, citizen_id = 3, candidate_name = "Hill Arry"
id = 4, citizen_id = 3, candidate_name = "Hill Arry"
id = 5, citizen_id = 3, candidate_name = "Hill Arry"
We could say that citizen number 3 is a liar pants on fire who cheated Bern Nie. Just an example.
When you are dealing with a database from a third party product, then you probably don't want to alter their database as to prevent tight coupling. but you may have data that corresponds 1:1 with their data
Anywhere were two entirely independent entities share a one-to-one relationship. There must be lots of examples:
person <-> dentist (its 1:N, so its wrong!)
person <-> doctor (its 1:N, so it's also wrong!)
person <-> spouse (its 1:0|1, so its mostly wrong!)
EDIT: Yes, those were pretty bad examples, particularly if I was always looking for a 1:1, not a 0 or 1 on either side. I guess my brain was mis-firing :-)
So, I'll try again. It turns out, after a bit of thought, that the only way you can have two separate entities that must (as far as the software goes) be together all of the time is for them to exist together in higher categorization. Then, if and only if you fall into a lower decomposition, the things are and should be separate, but at the higher level they can't live without each other. Context, then is the key.
For a medical database you may want to store different information about specific regions of the body, keeping them as a separate entity. In that case, a patient has just one head, and they need to have it, or they are not a patient. (They also have one heart, and a number of other necessary single organs). If you're interested in tracking surgeries for example, then each region should be a unique separate entity.
In a production/inventory system, if you're tracking the assembly of vehicles, then you certainly want to watch the engine progress differently from the car body, yet there is a one to one relationship. A care must have an engine, and only one (or it wouldn't be a 'car' anymore). An engine belongs to only one car.
In each case you could produce the separate entities as one big record, but given the level of decomposition, that would be wrong. They are, in these specific contexts, truly independent entities, although they might not appear so at a higher level.
Paul.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
Academia has it that table names should be the singular of the entity that they store attributes of.
I dislike any T-SQL that requires square brackets around names, but I have renamed a Users table to the singular, forever sentencing those using the table to sometimes have to use brackets.
My gut feel is that it is more correct to stay with the singular, but my gut feel is also that brackets indicate undesirables like column names with spaces in them etc.
Should I stay, or should I go?
I had same question, and after reading all answers here I definitely stay with SINGULAR, reasons:
Reason 1 (Concept). You can think of bag containing apples like "AppleBag", it doesn't matter if contains 0, 1 or a million apples, it is always the same bag. Tables are just that, containers, the table name must describe what it contains, not how much data it contains. Additionally, the plural concept is more about a spoken language one (actually to determine whether there is one or more).
Reason 2. (Convenience). it is easier come out with singular names, than with plural ones. Objects can have irregular plurals or not plural at all, but will always have a singular one (with few exceptions like News).
Customer
Order
User
Status
News
Reason 3. (Aesthetic and Order). Specially in master-detail scenarios, this reads better, aligns better by name, and have more logical order (Master first, Detail second):
1.Order
2.OrderDetail
Compared to:
1.OrderDetails
2.Orders
Reason 4 (Simplicity). Put all together, Table Names, Primary Keys, Relationships, Entity Classes... is better to be aware of only one name (singular) instead of two (singular class, plural table, singular field, singular-plural master-detail...)
Customer
Customer.CustomerID
CustomerAddress
public Class Customer {...}
SELECT FROM Customer WHERE CustomerID = 100
Once you know you are dealing with "Customer", you can be sure you will use the same word for all of your database interaction needs.
Reason 5. (Globalization). The world is getting smaller, you may have a team of different nationalities, not everybody has English as a native language. It would be easier for a non-native English language programmer to think of "Repository" than of "Repositories", or "Status" instead of "Statuses". Having singular names can lead to fewer errors caused by typos, save time by not having to think "is it Child or Children?", hence improving productivity.
Reason 6. (Why not?). It can even save you writing time, save you disk space, and even make your computer keyboard last longer!
SELECT Customer.CustomerName FROM Customer WHERE Customer.CustomerID = 100
SELECT Customers.CustomerName FROM Customers WHERE Customers.CustomerID = 103
You have saved 3 letters, 3 bytes, 3 extra keyboard hits :)
And finally, you can name those ones messing up with reserved names like:
User > LoginUser, AppUser, SystemUser, CMSUser,...
Or use the infamous square brackets [User]
I prefer to use the uninflected noun, which in English happens to be singular.
Inflecting the number of the table name causes orthographic problems (as many of the other answers show), but choosing to do so because tables usually contain multiple rows is also semantically full of holes. This is more obvious if we consider a language that inflects nouns based on case (as most do):
Since we're usually doing something with the rows, why not put the name in the accusative case? If we have a table that we write to more than we read, why not put the name in dative? It's a table of something, why not use the genitive? We wouldn't do this, because the table is defined as an abstract container that exists regardless of its state or usage. Inflecting the noun without a precise and absolute semantic reason is babbling.
Using the uninflected noun is simple, logical, regular and language-independent.
If you use Object Relational Mapping tools or will in the future I suggest Singular.
Some tools like LLBLGen can automatically correct plural names like Users to User without changing the table name itself. Why does this matter? Because when it's mapped you want it to look like User.Name instead of Users.Name or worse from some of my old databases tables naming tblUsers.strName which is just confusing in code.
My new rule of thumb is to judge how it will look once it's been converted into an object.
one table I've found that does not fit the new naming I use is UsersInRoles. But there will always be those few exceptions and even in this case it looks fine as UsersInRoles.Username.
Others have given pretty good answers as far as "standards" go, but I just wanted to add this... Is it possible that "User" (or "Users") is not actually a full description of the data held in the table? Not that you should get too crazy with table names and specificity, but perhaps something like "Widget_Users" (where "Widget" is the name of your application or website) would be more appropriate.
What convention requires that tables have singular names? I always thought it was plural names.
A user is added to the Users table.
This site agrees:
http://vyaskn.tripod.com/object_naming.htm#Tables
This site disagrees (but I disagree with it):
http://justinsomnia.org/writings/naming_conventions.html
As others have mentioned: these are just guidelines. Pick a convention that works for you and your company/project and stick with it. Switching between singular and plural or sometimes abbreviating words and sometimes not is much more aggravating.
How about this as a simple example:
SELECT Customer.Name, Customer.Address FROM Customer WHERE Customer.Name > "def"
vs.
SELECT Customers.Name, Customers.Address FROM Customers WHERE Customers.Name > "def"
The SQL in the latter is stranger sounding than the former.
I vote for singular.
I am of the firm belief that in an Entity Relation Diagram, the entity should be reflected with a singular name, similar to a class name being singular. Once instantiated, the name reflects its instance. So with databases, the entity when made into a table (a collection of entities or records) is plural. Entity, User is made into table Users. I would agree with others who suggested maybe the name User could be improved to Employee or something more applicable to your scenario.
This then makes more sense in a SQL statement because you are selecting from a group of records and if the table name is singular, it doesn't read well.
I stick with singular for table names and any programming entity.
The reason? The fact that there are irregular plurals in English like mouse ⇒ mice and sheep ⇒ sheep. Then, if I need a collection, i just use mouses or sheeps, and move on.
It really helps the plurality stand out, and I can easily and programatically determine what the collection of things would look like.
So, my rule is: everything is singular, every collection of things is singular with an s appended. Helps with ORMs too.
IMHO, table names should be plural like Customers.
Class names should be singular like Customer if it maps to a row in the Customers table.
Singular. I don't buy any argument involving which is most logical - every person thinks his own preference is most logical. No matter what you do it is a mess, just pick a convention and stick to it. We are trying to map a language with highly irregular grammar and semantics (normal spoken and written language) to a highly regular (SQL) grammar with very specific semantics.
My main argument is that I don't think of the tables as a set but as relations.
So, the AppUser relation tells which entities are AppUsers.
The AppUserGroup relation tells me which entities are AppUserGroups
The AppUser_AppUserGroup relation tells me how the AppUsers and AppUserGroups are related.
The AppUserGroup_AppUserGroup relation tells me how AppUserGroups and AppUserGroups are related (i.e. groups member of groups).
In other words, when I think about entities and how they are related I think of relations in singular, but of course, when I think of the entities in collections or sets, the collections or sets are plural.
In my code, then, and in the database schema, I use singular. In textual descriptions, I end up using plural for increased readability - then use fonts etc. to distinguish the table/relation name from the plural s.
I like to think of it as messy, but systematic - and this way there is always a systematically generated name for the relation I wish to express, which to me is very important.
I also would go with plurals, and with the aforementioned Users dilemma, we do take the square bracketing approach.
We do this to provide uniformity between both database architecture and application architecture, with the underlying understanding that the Users table is a collection of User values as much as a Users collection in a code artifact is a collection of User objects.
Having our data team and our developers speaking the same conceptual language (although not always the same object names) makes it easier to convey ideas between them.
I personaly prefer to use plural names to represent a set, it just "sounds" better to my relational mind.
At this exact moment i am using singular names to define a data model for my company, because most of the people at work feel more confortable with it.
Sometimes you just have to make life easier to everyone instead of imposing your personal preferences.
(that's how i ended up in this thread, to get a confirmation on what should be the "best practice" for naming tables)
After reading all the arguing in this thread, i reached one conclusion:
I like my pancakes with honey, no matter what everybody's favorite flavour is. But if i am cooking for other people, i will try to serve them something they like.
Singular. I'd call an array containing a bunch of user row representation objects 'users', but the table is 'the user table'. Thinking of the table as being nothing but the set of the rows it contains is wrong, IMO; the table is the metadata, and the set of rows is hierarchically attached to the table, it is not the table itself.
I use ORMs all the time, of course, and it helps that ORM code written with plural table names looks stupid.
I've actually always thought it was popular convention to use plural table names. Up until this point I've always used plural.
I can understand the argument for singular table names, but to me plural makes more sense. A table name usually describes what the table contains. In a normalized database, each table contains specific sets of data. Each row is an entity and the table contains many entities. Thus the plural form for the table name.
A table of cars would have the name cars and each row is a car. I'll admit that specifying the table along with the field in a table.field manner is the best practice and that having singular table names is more readable. However in the following two examples, the former makes more sense:
SELECT * FROM cars WHERE color='blue'
SELECT * FROM car WHERE color='blue'
Honestly, I will be rethinking my position on the matter, and I would rely on the actual conventions used by the organization I'm developing for. However, I think for my personal conventions, I'll stick with plural table names. To me it makes more sense.
I don't like plural table names because some nouns in English are not countable (water, soup, cash) or the meaning changes when you make it countable (chicken vs a chicken; meat vs bird).
I also dislike using abbreviations for table name or column name because doing so adds extra slope to the already steep learning curve.
Ironically, I might make User an exception and call it Users because of USER (Transac-SQL), because I too don't like using brackets around tables if I don't have to.
I also like to name all the ID columns as Id, not ChickenId or ChickensId (what do plural guys do about this?).
All this is because I don't have proper respect for the database systems, I am just reapplying one-trick-pony knowledge from OO naming conventions like Java's out of habit and laziness. I wish there were better IDE support for complicated SQL.
We run similar standards, when scripting we demand [ ] around names, and where appropriate schema qualifiers - primarily it hedges your bets against future name grabs by the SQL syntax.
SELECT [Name] FROM [dbo].[Customer] WHERE [Location] = 'WA'
This has saved our souls in the past - some of our database systems have run 10+ years from SQL 6.0 through SQL 2005 - way past their intended lifespans.
If we look at MS SQL Server's system tables, their names as assigned by Microsoft are in plural.
The Oracle's system tables are named in singular. Although a few of them are plural.
Oracle recommends plural for user-defined table names.
That doesn't make much sense that they recommend one thing and follow another.
That the architects at these two software giants have named their tables using different conventions, doesn't make much sense either... After all, what are these guys ... PhD's?
I do remember in academia, the recommendation was singular.
For example, when we say:
select OrderHeader.ID FROM OrderHeader WHERE OrderHeader.Reference = 'ABC123'
maybe b/c each ID is selected from a particular single row ...?
The system tables/views of the server itself (SYSCAT.TABLES, dbo.sysindexes, ALL_TABLES, information_schema.columns, etc.) are almost always plural. I guess for the sake of consistency I'd follow their lead.
I am a fan of singular table names as they make my ER diagrams using CASE syntax easier to read, but by reading these responses I'm getting the feeling it never caught on very well? I personally love it. There is a good overview with examples of how readable your models can be when you use singular table names, add action verbs to your relationships and form good sentences for every relationships. It's all a bit of overkill for a 20 table database but if you have a DB with hundreds of tables and a complex design how will your developers ever understand it without a good readable diagram?
http://www.aisintl.com/case/method.html
As for prefixing tables and views I absolutely hate that practice. Give a person no information at all before giving them possibly bad information. Anyone browsing a db for objects can quite easily tell a table from a view, but if I have a table named tblUsers that for some reason I decide to restructure in the future into two tables, with a view unifying them to keep from breaking old code I now have a view named tblUsers. At this point I am left with two unappealing options, leave a view named with a tbl prefix which may confuse some developers, or force another layer, either middle tier or application to be rewritten to reference my new structure or name viewUsers. That negates a large part of the value of views IMHO.
Tables: plural
Multiple users are listed in the users table.
Models: singular
A singular user can be selected from the users table.
Controllers: plural
http://myapp.com/users would list multiple users.
That's my take on it anyway.
I once used "Dude" for the User table - same short number of characters, no conflict with keywords, still a reference to a generic human. If I weren't concerned about the stuffy heads that might see the code, I would have kept it that way.
I've always used singular simply because that's what I was taught. However, while creating a new schema recently, for the first time in a long time, I actively decided to maintain this convention simply because... it's shorter. Adding an 's' to the end of every table name seems as useless to me as adding 'tbl_' in front of every one.
This may be a bit redundant, but I would suggest being cautious. Not necessarily that it's a bad thing to rename tables, but standardization is just that; a standard -- this database may already be "standardized", however badly :) -- I would suggest consistency to be a better goal given that this database already exists and presumably it consists of more than just 2 tables.
Unless you can standardize the entire database, or at least are planning to work towards that end, I suspect that table names are just the tip of the iceberg and concentrating on the task at hand, enduring the pain of poorly named objects, may be in your best interest --
Practical consistency sometimes is the best standard... :)
my2cents ---
As others have mentioned here, conventions should be a tool for adding to the ease of use and readability. Not as a shackle or a club to torture developers.
That said, my personal preference is to use singular names for both tables and columns. This probably comes from my programming background. Class names are generally singular unless they are some sort of collection. In my mind I am storing or reading individual records in the table in question, so singular makes sense to me.
This practice also allows me to reserve plural table names for those that store many-to-many relationships between my objects.
I try to avoid reserved words in my table and column names, as well. In the case in question here it makes more sense to go counter to the singular convention for Users to avoid the need to encapsulate a table that uses the reserved word of User.
I like using prefixes in a limited manner (tbl for table names, sp_ for proc names, etc), though many believe this adds clutter. I also prefer CamelBack names to underscores because I always end up hitting the + instead of _ when typing the name. Many others disagree.
Here is another good link for naming convention guidelines: http://www.xaprb.com/blog/2008/10/26/the-power-of-a-good-sql-naming-convention/
Remember that the most important factor in your convention is that it make sense to the people interacting with the database in question. There is no "One Ring to Rule Them All" when it comes to naming conventions.
Possible alternatives:
Rename the table SystemUser
Use brackets
Keep the plural table names.
IMO using brackets is technically the safest approach, though it is a bit cumbersome. IMO it's 6 of one, half-a-dozen of the other, and your solution really just boils down to personal/team preference.
My take is in semantics depending on how you define your container. For example, A "bag of apples" or simply "apples" or an "apple bag" or "apple".
Example:
a "college" table can contain 0 or more colleges
a table of "colleges" can contain 0 or more collegues
a "student" table can contain 0 or more students
a table of "students" can contain 0 or more students.
My conclusion is that either is fine but you have to define how you (or people interacting with it) are going to approach when referring to the tables; "a x table" or a "table of xs"
I think using the singular is what we were taught in university. But at the same time you could argue that unlike in object oriented programming, a table is not an instance of its records.
I think I'm tipping in favour of the singular at the moment because of plural irregularities in English. In German it's even worse due to no consistent plural forms - sometimes you cannot tell if a word is plural or not without the specifying article in front of it (der/die/das). And in Chinese languages there are no plural forms anyway.
I only use nouns for my table names that are spelled the same, whether singular or plural:
moose
fish
deer
aircraft
you
pants
shorts
eyeglasses
scissors
species
offspring
I did not see this clearly articulated in any of the previous answers. Many programmers have no formal definition in mind when working with tables. We often communicate intuitively in terms of of "records" or "rows". However, with some exceptions for denormalized relations, tables are usually designed so that the relation between the non-key attributes and the key constitutes a set theoretic function.
A function can be defined as a subset of a cross-product between two sets, in which each element of the set of keys occurs at most once in the mapping. Hence the terminology arising from from that perspective tends to be singular. One sees the same singular (or at least, non-plural) convention across other mathematical and computational theories involving functions (algebra and lambda calculus for instance).
I always thought that was a dumb convention. I use plural table names.
(I believe the rational behind that policy is that it make it easier for ORM code generators to produce object & collection classes, since it is easier to produce a plural name from a singular name than vice-versa)