A different (less redundant) approach to Normalization - sql

Let's say we are to normalize a database into 3rd normal form using the requirement:
I need a movie ticket registry program that can remember customers and
the tickets that they've purchased.
We might end up with a database like this:
ticket
id
movie_name
price
customer
id
first_name
However, when I look at this, for some reason it looks redundant. What if I were to break it up into even smaller pieces, like this:
name
id
name
customer
id
fk_name_id
ticket
id
fk_name_id
price
Would this be a good approach? Is there a name for this approach?

As Jordan says, the point of breaking data out into a separate table is to avoid redundant data.
As you apparently realize, we do NOT want to lay out our tables like this:
WRONG!!!
ticket
customer_name
movie_name
That would mean that the customer_name is repeated for every movie he watches, and the movie name is repeated for every person who watches that movie. Lots and lots of redundant names. If the user has to type them in every time, it's likely that sometimes he mis-spells a name or uses a variation on a name, like we find our table includes "Star Wars", "Star Wars IV", "Star Wars Episode IV", and "Stra Wars", all for the same movie. All sorts of problems.
By breaking the customer and the movie out into separate tables, we eliminate all the redundancy. Great. Celebrate.
But if we take your suggestion of making a "name" table that holds both customer names and movie names, did we eliminate any redundancy?
If a customer has the same name as a movie -- if we happen to have a customer named "Anna Karenina" or "John Carter" or whatever (or maybe someone named their kid "Batman Returns" for that matter) -- are you going to use the same record to store both? If no, then you have not saved any redundancy. You have just forced us to do an extra join every time we read the tables.
If you do use the same record, it's even worse. What if you create a record for customer "Anna Karenina" and you share the id/name record with the movie. Then Anna gets married and now her name is "Anna Smith". If you update the name record, you have not only changed the name of the customer, but also the title of the movie! This would be a very bad thing.
You could, of course, say that if you change the name, that instead of updating in place you create a new record for the new name. But then that defeats half the purpose of breaking the names out to a separate table. Suppose when we originally created the movie record we mistyped the name as "Anna Karina". Now someone points out our mistake and we fix it. But with the "make a new record every time there's a change" logic, we'd have to fix each ticket sale one by one.
I guess you could ask the user if this is a change for just the movie title, just the customer name, or both. But now we've added another level of complexity. And for what? Our program is more complex, our queries are more complex, and our user interface is more complex. In exchange, we get a tiny gain in saving disk space for the rare case where a customer coincidentally has the same name as a movie title.
Not worth it.

Your first approach is not correct. If you think about the problem, there are three entities:
Movie
Customer
Ticket
The connection between Movie and Customer is really the Ticket table, so this is an example of an association or junction table that has additional information.
I wouldn't think of the problem as "there is an entity 'name' and customers and movies both have names". The name is an attribute of other entities, it is not its own entity (at least in this case).

Jay's answer is excellent, and should be chosen as the correct one IMHO.
However I wanted to add: normalization does not mean "storing like data in a separate structure". That is absolutely not the intent of normalization, and this is a mistake made by a lot of inexperienced database modelers, especially when they have a programming (OOP) background.

Related

Database Design for Contacts

So I have 3 main entities. Airports, Customers, and Vendor.
Each of these will have multiple contacts I need to relate to each.
So they way I set it up currently.
I have the following tables..
Airport
Customer
Vendor
I then have one Contacts table and a xref for Airport, Customer, Vendor...
I am questioning that and was thinking a contacts table for each ..
Airport and AirportContacts
Customer and CustomerContacts
Vendor and VendorContacts
Any drawbacks to either of these designs?
To me, the deciding factor is duplication of entities vs "one version of the truth". If a single real-world person can be a contact for more than one of the other entities, then you don't want to store that single person in multiple contact tables, because then you have to maintain any changes to his/her properties in multiple places.
If you put the same "Joe Smith" in both AirportContacts and VendorContacts, then one day when you look and see his city is "Denver" in one table and "Boston" in another table, which one will you consider to be the truth?
But as someone mentioned in comments, if a contact can only be associated with one of the three other entities ("types" as you call them), then putting them in separate tables makes the most sense.
And there's yet a third scenario. Say "Joe Smith" can be a contact for both Airports and Vendors. But say that he has some properties, like his gender and age, which are the same regardless of which "type" he is being considered, but there might be some properties, like phone number, or position/job title, which could depend on the "type". Maybe he uses one phone in his capacity as an Airport Vendor, and a different phone as a Vendor Contact. Moreover, maybe there are some properties that apply to one type of contact that don't apply to the others. In these cases, I would look at some kind of hybrid approach where you keep common properties in a single Contact table, and "Type"-specific properties in their own type-related tables. These type-related tables would be bridge tables that have FKs back to the Contact table and to the main entity table of the "Type" they are related to (Vendor, Customer or Airport).
What I have so far ... Dont mind some of the data types.. just placed quick placeholders..

Many-to-Many-to-Many Relation in SQL?

I am quite new to SQL and I am trying to make a few relations and come across the following.
I have 3 tables
- Persons
- PhoneNumbers
- PhoneNumberCategories
Persons and PhoneNumbers is self explanatory I think and PhoneNumberCategories I use to indicate whether a phonenumber is for the office, home, mobile, fax etc.
Now I made a 4th table to make this many-to-many-to-many relation. This table has 4 columns:
ID
PersonID
PhoneNumberID
PhoneNumberCategoryID
And I have set-up 3 relations here. One for all columns except the ID Column. It seems to work, but it just gives me a strange feeling. Can someone tell me if I am on the wright track or if I am going all wrong about this.
The reason I want this is as follows. Of course I need to link a person to a phone number. However, I may have a phone number that is the general office (This would be the row 'OfficeGeneral') number, and therefore linked to me because I work there. I also have a direct office (And this would be the row 'OfficeDirect') number. This is of course also linked to me. The general number however, is linked to all people in the office as 'OfficeGeneral'. Except for the receptionist, here it would be linked as 'OfficeDirect'. And this is the reason I came up with this many-to-many-to-many relation. I can not find much about it on the web. And that is reason enough to doubt if this is the correct way to go about it. Anyway, this is just an example. I would like to make sure that I am flexible and can catch as many exceptions as possible. I am sure once the database is in use that people will come with situations which I have not anticipated. People are good at that I have learned over the years.
Clarification in Response to Comment Below:
Person can have more than 1 PhoneNumber.
PhoneNumber can have more than 1 Person.
Person with PhoneNumber can have more than 1 PhoneNumberCategory (i.e. Private Phone used for both Phone and Fax.)
Several people can be linked to the same PhoneNumber with different PhoneNumberCategories. (i.e. OfficeMain for me is OfficDirect for receptionist.)
Looking forward to hear from you all.
Your model looks fine. Your description, perhaps, is not.
Person and PhoneNumber have a many to many relationship. This relationship, in turn, is 1-many with PhoneNumberCategory -- unless a given phone number could be in more than one category.
I would expect that Person/PhoneNumber would be declared unique in this table, so a person could only use a particular phone number once. However, that may not be how you are viewing the structure.

Should I create a new entity for just one (string) field?

Suppose I have an entity, for example:
Person
-id
-name
-address
-phone
And then I want the person to have something that is a string and that may be repeated a lot of times for each person, for example the neighborhood, this is what I think I should do:
Person
-id
-name
-address
-phone
-idNeighborhood
and create a new table
Neighborhood
-id
-name
And of course, idNeighborhood is a foreign key to the id of Neighborhood.
Now, what I was thinking is that I will have to make JOINS every time (let's say I will use the neighborhood in 90% of the cases that I want to use some Person), so, is this wrong?:
Person
-id
-name
-address
-phone
-neighborhoodName
in which I will save the name of the neighborhood, but of course, will be repeated a lot of times (in the other case I would repeat a lot of IDs... so...)..
Also, in my particular case, Neighborhood will never grow, it will always have a name, that's why I think it's better to do this, but I am not really sure..
The only thing that I think of a disadvantage is that I can't make an index of neighborhood and then it will be slower, or not?
This is a question about what you want to model. You should model things in your database that are relevant to the problem you are trying to solve with your system.
Are you interested in neighbourhoods as an entity of their own?
Some reasons you would want to make a neighbourhood table:-
You end up adding additional attributes to the neighbourhood (say, city or state)
You end up having two neighbourhoods with the same name, that are actually different (so they need an identity beyond just their name)
You are worried about storage requirements and you have relatively few neighbourhoods and lots of people (the id will be smaller in size)
You want to control the master list of neighbourhoods that can be used (people only select from an existing neighbourhood and can't just enter any old thing)
Note that you could apply all this same logic to the name column as well. But you would only do that if your database was about people's names, the same way as you should make a neighbourhood column if modelling neighbourhoods will help with the problem you are solving.
It actually depends on your requirement.if you need to display neighbourhood's details like NAME, PROFESSION etc. In your app, the its better to use the first approach i.e create a entity with required attributes.
If you really do not need those details anywhere about your neighbourhood and you just need the name, you can go with your second approach

What to do if 2 (or more) relationship tables would have the same name?

So I know the convention for naming M-M relationship tables in SQL is to have something like so:
For tables User and Data the relationship table would be called
UserData
User_Data
or something similar (from here)
What happens then if you need to have multiple relationships between User and Data, representing each in its own table? I have a site I'm working on where I have two primary items and multiple independent M-M relationships between them. I know I could just use a single relationship table and have a field which determines the relationship type, but I'm not sure whether this is a good solution. Assuming I don't go that route, what naming convention should I follow to work around my original problem?
To make it more clear, say my site is an auction site (it isn't but the principle is similar). I have registered users and I have items, a user does not have to be registered to post an item but they do need to be to do anything else. I have table User which has info on registered users and Items which has info on posted items. Now a user can bid on an item, but they can also report a item (spam, etc.), both of these are M-M relationships. All that happens when either event occurs is that an email is generated, in my scenario I have no reason to keep track of the actual "report" or "bid" other than to know who bid/reported on what.
I think you should name tables after their function. Lets say we have Cars and People tables. Car has owners and car has assigned drivers. Driver can have more than one car. One of the tables you could call CarsDrivers, second CarsOwners.
EDIT
In your situation I think you should have two tables: AuctionsBids and AuctionsReports. I believe that report requires additional dictinary (spam, illegal item,...) and bid requires other parameters like price, bid date. So having two tables is justified. You will propably be more often accessing bids than reports. Sending email will be slightly more complicated then when this data is stored in one table, but it is not really a big problem.
I don't really see this as a true M-M mapping table. Those usually are JUST a mapping. From your example most of these will have additional information as well. For example, a table of bids, which would have a User and an Item, will probably have info on what the bid was, when it was placed, etc. I would call this table... wait for it... Bids.
For reporting items you might want what was offensive about it, when it was placed, etc. Call this table OffenseReports or something.
You can name tables whatever you want. I would just name them something that makes sense. I think the convention of naming them Table1Table2 is just because sometimes the relationships don't make alot of sense to an outside observer.
There's no official or unofficial convention on relations or tables names. You can name them as you want, the way you like.
If you have multiple user_data relationships with the same keys that makes absolutely no sense. If you have different keys, name the relation in a descriptive way like: stores_products_manufacturers or stores_products_paymentMethods
I think you're only confused because the join tables are currently simple. Once you add more information, I think it will be obvious that you should append a functional suffix. For example:
Table User
UserID
EmailAddress
Table Item
ItemID
ItemDescription
Table UserItem_SpamReport
UserID
ItemID
ReportDate
Table UserItem_Post
UserID -- can be (NULL, -1, '', ...)
ItemID
PostDate
Table UserItem_Bid
UserId
ItemId
BidDate
BidAmount
Then the relation will have a Role. For instance a stock has 2 companies associated: an issuer and a buyer. The relationship is defined by the role the parent and child play to each other.
You could either put each role in a separate table that you name with the role (IE Stock_Issuer, Stock_Buyer etc, both have a relationship one - many to company - stock)
The stock example is pretty fixed, so two tables would be fine. When there are multiple types of relations possible and you can't foresee them now, normalizing it into a relationtype column would seem the better option.
This also depends on the quality of the developers having to work with your model. The column approach is a bit more abstract... but if they don't get it maybe they'd better stay away from databases altogether..
Both will work fine I guess.
Good luck, GJ
GJ

Normalization in plain English

I understand the concept of database normalization, but always have a hard time explaining it in plain English - especially for a job interview. I have read the wikipedia post, but still find it hard to explain the concept to non-developers. "Design a database in a way not to get duplicated data" is the first thing that comes to mind.
Does anyone has a nice way to explain the concept of database normalization in plain English? And what are some nice examples to show the differences between first, second and third normal forms?
Say you go to a job interview and the person asks: Explain the concept of normalization and how would go about designing a normalized database.
What key points are the interviewers looking for?
Well, if I had to explain it to my wife it would have been something like that:
The main idea is to avoid duplication of large data.
Let's take a look at a list of people and the country they came from. Instead of holding the name of the country which can be as long as "Bosnia & Herzegovina" for every person, we simply hold a number that references a table of countries. So instead of holding 100 "Bosnia & Herzegovina"s, we hold 100 #45. Now in the future, as often happens with Balkan countries, they split to two countries: Bosnia and Herzegovina, I will have to change it only in one place. well, sort of.
Now, to explain 2NF, I would have changed the example, and let's assume that we hold the list of countries every person visited.
Instead of holding a table like:
Person CountryVisited AnotherInformation D.O.B.
Faruz USA Blah Blah 1/1/2000
Faruz Canada Blah Blah 1/1/2000
I would have created three tables, one table with the list of countries, one table with the list of persons and another table to connect them both. That gives me the most freedom I can get changing person's information or country information. This enables me to "remove duplicate rows" as normalization expects.
One-to-many relationships should be represented as two separate tables connected by a foreign key. If you try to shove a logical one-to-many relationship into a single table, then you are violating normalization which leads to dangerous problems.
Say you have a database of your friends and their cats. Since a person may have more than one cat, we have a one-to-many relationship between persons and cats. This calls for two tables:
Friends
Id | Name | Address
-------------------------
1 | John | The Road 1
2 | Bob | The Belltower
Cats
Id | Name | OwnerId
---------------------
1 | Kitty | 1
2 | Edgar | 2
3 | Howard | 2
(Cats.OwnerId is a foreign key to Friends.Id)
The above design is fully normalized and conforms to all known normalization levels.
But say I had tried to represent the above information in a single table like this:
Friends and cats
Id | Name | Address | CatName
-----------------------------------
1 | John | The Road 1 | Kitty
2 | Bob | The Belltower | Edgar
3 | Bob | The Belltower | Howard
(This is the kind of design I might have made if I was used to Excel-sheets but not relational databases.)
A single-table approach forces me to repeat some information if I want the data to be consistent. The problem with this design is that some facts, like the information that Bob's address is "The belltower" is repeated twice, which is redundant, and makes it difficult to query and change data and (the worst) possible to introduce logical inconsistencies.
Eg. if Bob moves I have to make sure I change the address in both rows. If Bob gets another cat, I have to be sure to repeat the name and address exactly as typed in the other two rows. E.g. if I make a typo in Bob's address in one of the rows, suddenly the database has inconsistent information about where Bob lives. The un-normalized database cannot prevent the introduction of inconsistent and self-contradictory data, and hence the database is not reliable. This is clearly not acceptable.
Normalization cannot prevent you from entering wrong data. What normalization prevents is the possibility of inconsistent data.
It is important to note that normalization depends on business decisions. If you have a customer database, and you decide to only record a single address per customer, then the table design (#CustomerID, CustomerName, CustomerAddress) is fine. If however you decide that you allow each customer to register more than one address, then the same table design is not normalized, because you now have a one-to-many relationship between customer and address. Therefore you cannot just look at a database to determine if it is normalized, you have to understand the business model behind the database.
This is what I ask interviewees:
Why don't we use a single table for an application instead of using multiple tables ?
The answer is ofcourse normalization. As already said, its to avoid redundancy and there by update anomalies.
This is not a thorough explanation, but one goal of normalization is to allow for growth without awkwardness.
For example, if you've got a user table, and every user is going to have one and only one phone number, it's fine to have a phonenumber column in that table.
However, if each user is going to have a variable number of phone numbers, it would be awkward to have columns like phonenumber1, phonenumber2, etc. This is for two reasons:
If your columns go up to phonenumber3 and someone needs to add a fourth number, you have to add a column to the table.
For all the users with fewer than 3 phone numbers, there are empty columns on their rows.
Instead, you'd want to have a phonenumber table, where each row contains a phone number and a foreign key reference to which row in the user table it belongs to. No blank columns are needed, and each user can have as few or many phone numbers as necessary.
One side point to note about normalization: A fully normalized database is space efficient, but is not necessarily the most time efficient arrangement of data depending on use patterns.
Skipping around to multiple tables to look up all the pieces of info from their denormalized locations takes time. In high load situations (millions of rows per second flying around, thousands of concurrent clients, like say credit card transaction processing) where time is more valuable than storage space, appropriately denormalized tables can give better response times than fully normalized tables.
For more info on this, look for SQL books written by Ken Henderson.
I would say that normalization is like keeping notes to do things efficiently, so to speak:
If you had a note that said you had to
go shopping for ice cream without
normalization, you would then have
another note, saying you have to go
shopping for ice cream, just one in
each pocket.
Now, In real life, you would never do
this, so why do it in a database?
For the designing and implementing part, thats when you can move back to "the lingo" and keep it away from layman terms, but I suppose you could simplify. You would say what you needed to at first, and then when normalization comes into it, you say you'll make sure of the following:
There must be no repeating groups of information within a table
No table should contain data that is not functionally dependent on that tables primary key
For 3NF I like Bill Kent's take on it: Every non-key attribute must provide a fact about the key, the whole key, and nothing but the key.
I think it may be more impressive if you speak of denormalization as well, and the fact that you cannot always have the best structure AND be in normal forms.
Normalization is a set of rules that used to design tables that connected through relationships.
It helps in avoiding repetitive entries, reducing required storage space, preventing the need to restructure existing tables to accommodate new data, increasing speed of queries.
First Normal Form: Data should be broken up in the smallest units. Tables should not contain repetitive groups of columns. Each row is identified with one or more primary key.
For example, There is a column named 'Name' in a 'Custom' table, it should be broken to 'First Name' and 'Last Name'. Also, 'Custom' should have a column named 'CustiomID' to identify a particular custom.
Second Normal Form: Each non-key column should be directly related to the entire primary key.
For example, if a 'Custom' table has a column named 'City', the city should has a separate table with primary key and city name defined, in the 'Custom' table, replace the 'City' column with 'CityID' and make 'CityID' the foreign key in the tale.
Third normal form: Each non-key column should not depend on other non-key columns.
For example, In an order table, the column 'Total' is dependent on 'Unit price' and 'quantity', so the 'Total' column should be removed.
I teach normalization in my Access courses and break it down a few ways.
After discussing the precursors to storyboarding or planning out the database, I then delve into normalization. I explain the rules like this:
Each field should contain the smallest meaningful value:
I write a name field on the board and then place a first name and last name in it like Bill Lumbergh. We then query the students and ask them what we will have problems with, when the first name and last name are all in one field. I use my name as an example, which is Jim Richards. If the students do not lead me down the road, then I yank their hand and take them with me. :) I tell them that my name is a tough name for some, because I have what some people would consider 2 first names and some people call me Richard. If you were trying to search for my last name then it is going to be harder for a normal person (without wildcards), because my last name is buried at the end of the field. I also tell them that they will have problems with easily sorting the field by last name, because again my last name is buried at the end.
I then let them know that meaningful is based upon the audience who is going to be using the database as well. We, at our job will not need a separate field for apartment or suite number if we are storing people's addresses, but shipping companies like UPS or FEDEX might need it separated out to easily pull up the apartment or suite of where they need to go when they are on the road and running from delivery to delivery. So it is not meaningful to us, but it is definitely meaningful to them.
Avoiding Blanks:
I use an analogy to explain to them why they should avoid blanks. I tell them that Access and most databases do not store blanks like Excel does. Excel does not care if you have nothing typed out in the cell and will not increase the file size, but Access will reserve that space until that point in time that you will actually use the field. So even if it is blank, then it will still be using up space and explain to them that it also slows their searches down as well.
The analogy I use is empty shoe boxes in the closet. If you have shoe boxes in the closet and you are looking for a pair of shoes, you will need to open up and look in each of the boxes for a pair of shoes. If there are empty shoe boxes, then you are just wasting space in the closet and also wasting time when you need to look through them for that certain pair of shoes.
Avoiding redundancy in data:
I show them a table that has lots of repeated values for customer information and then tell them that we want to avoid duplicates, because I have sausage fingers and will mistype in values if I have to type in the same thing over and over again. This “fat-fingering” of data will lead to my queries not finding the correct data. We instead, will break the data out into a separate table and create a relationship using a primary and foreign key field. This way we are saving space because we are not typing the customer's name, address, etc multiple times and instead are just using the customer's ID number in a field for the customer. We then will discuss drop-down lists/combo boxes/lookup lists or whatever else Microsoft wants to name them later on. :) You as a user will not want to look up and type out the customer's number each time in that customer field, so we will setup a drop-down list that will give you a list of customer, where you can select their name and it will fill in the customer’s ID for you. This will be a 1-to-many relationship, whereas 1 customer will have many different orders.
Avoiding repeated groups of fields:
I demonstrate this when talking about many-to-many relationships. First, I draw 2 tables, 1 that will hold employee information and 1 that will hold project information. The tables are laid similar to this.
(Table1)
tblEmployees
* EmployeeID
First
Last
(Other Fields)….
Project1
Project2
Project3
Etc.
**********************************
(Table2)
tblProjects
* ProjectNum
ProjectName
StartDate
EndDate
…..
I explain to them that this would not be a good way of establishing a relationship between an employee and all of the projects that they work on. First, if we have a new employee, then they will not have any projects, so we will be wasting all of those fields, second if an employee has been here a long time then they might have worked on 300 projects, so we would have to include 300 project fields. Those people that are new and only have 1 project will have 299 wasted project fields. This design is also flawed because I will have to search in each of the project fields to find all of the people that have worked on a certain project, because that project number could be in any of the project fields.
I covered a fair amount of the basic concepts. Let me know if you have other questions or need help with clarfication/ breaking it down in plain English. The wiki page did not read as plain English and might be daunting for some.
I've read the wiki links on normalization many times but I have found a better overview of normalization from this article. It is a simple easy to understand explanation of normalization up to fourth normal form. Give it a read!
Preview:
What is Normalization?
Normalization is the process of
efficiently organizing data in a
database. There are two goals of the
normalization process: eliminating
redundant data (for example, storing
the same data in more than one table)
and ensuring data dependencies make
sense (only storing related data in a
table). Both of these are worthy goals
as they reduce the amount of space a
database consumes and ensure that data
is logically stored.
http://databases.about.com/od/specificproducts/a/normalization.htm
Database normalization is a formal process of designing your database to eliminate redundant data. The design consists of:
planning what information the database will store
outlining what information users will request from it
documenting the assumptions for review
Use a data-dictionary or some other metadata representation to verify the design.
The biggest problem with normalization is that you end up with multiple tables representing what is conceptually a single item, such as a user profile. Don't worry about normalizing data in table that will have records inserted but not updated, such as history logs or financial transactions.
References
When not to Normalize your SQL Database
Database Design Basics
+1 for the analogy of talking to your wife. I find talking to anyone without a tech mind needs some ease into this type of conversation.
but...
To add to this conversation, there is the other side of the coin (which can be important when in an interview).
When normalizing, you have to watch how the databases are indexed and how the queries are written.
When in a truly normalized database, I have found that in situations it's been easier to write queries that are slow because of bad join operations, bad indexing on the tables, and plain bad design on the tables themselves.
Bluntly, it's easier to write bad queries in high level normalized tables.
I think for every application there is a middle ground. At some point you want the ease of getting everything out a few tables, without having to join to a ton of tables to get one data set.