Related
I have to specify that this is for a database assignment. I'm pretty good with SQL code but the diagram aspect of the assignment is killing me, I think that every step I take is wrong.
They have given us This scenario and requirements :
A research team has asked you to create a database for a project on movie production
companies; the project aims to use machine learning, neural networks and other
methods to extract information about the situation of movie production companies in
Europe and the health of this sector for a set of specific countries, including the UK.
The data analytics application resulting from this project – which you DO NOT have to
develop; your job is to develop the central, server-side database that underpins it – has been commissioned by a research institute (which shall remain nameless), and it is
intended to be open source, and therefore available to anyone.
Basically, it is a machine learning application that would run on a database with the aim
to identify the correlation between different aspects of the sector, including funding
opportunities and development of new production companies or studios.
The database records every production company in Europe, including the name of the
company, the address, ZIP code, city, country, type of the company (e.g., non-profit
organisation), number of employees and net worth (calculated as total assets minus
total liabilities). Every production company has its name registered with one and only
one local government authority (for example, Companies House in the UK) on a specific
date; each company can have many shareholders. The authority typically requires
information about all the shareholders, including town of birth, mother’s maiden name,
father’s first name, their personal telephone number (only one), national insurance
number (each country in Europe has a similar unique ID), and passport number. Also,
the registration procedure has a cost associated with it (e.g., 12£ in the UK).
The database also records the employees’ data for each company: each employee is
assumed to work for a single production company. Due to the complex structure of
movie production companies and the need for various skills and professions,
employees are categorised into crew and staff. The crew consists of three main groups:
the actors, the director(s) and those who work on other jobs relevant to the filming
(producers, editors, production designers, costume designers, composer, etc.). All
other employees belong to the staff group, including those responsible for HR,
advertising, etc. Employees are identified by an employee ID, first name, last name and
an optional middle name, date of birth and start date. Also, each employee has their
contact details recorded, whether it is a single phone number or multiple, with a
description associated with each of them. Each employee has a single email address,
too.
Members of the crew are paid hourly, and this is recorded in the database as well as a
bonus that depends on their contract. Actors get a bonus for each day of work and
another bonus for each scene completed; directors get a bonus at the end of the
shooting; crew members that work in other jobs relevant to the filming get a bonus at
the end of the shooting, and they have their role recorded as well (e.g., producer or
costume designer).
Staff members have the monthly salary and the working hours (e.g., full time 9-5).
Furthermore, each staff member belongs to a specific department (e.g., advertising),
which is located in a given building at a given address (both recorded in the database).
The database records all movies from each production company. More specifically, for
each movie the following information is recorded: a universal unique movie code(similar to the ISBN for books), the title of the movie, the year and the first release date
(different release dates are not important and should NOT be recorded).
Also, the database records each member of the crew that is part of the movie, and the
role they have in the movie: each crew member can play a single role or multiple roles
in the same movie, and each role has a description associated with it. For example, in
each movie there can be a single protagonist or more than one, the same actor can play
one or several roles, or even have a cameo.
One of the aims of the project is to provide insights on the impact of funding and grants
within the movie industry. To this end, the database should be able to record all the
funding that each production company receives. This must include the name of the
grant, the funding body (e.g., the government of a given country or European Union
grants such as the ERDF), the maximum amount for that grant and the deadline to
submit a proposal.
Then, for each company the database must record the date of the application to a given
grant, the amount requested, the outcome (successful/unsuccessful).
A grant can be given to a single production company or shared among several. Finally,
once the database is ready, the project will run a set of machine learning algorithms to
perform high level data analysis based on the different grants and their corresponding
impact with the aim to investigate the impacts of such funding against a list of criteria.
No additional information is provided at this stage from the project.
In the spec, the requirements are numerated from 1 to 5, as the scenario was not given
at that time. The details of each requirement are provided in the following:
Each production company may have received one or multiple grants, and grants
can be shared by more than one company.
It is possible for each employee to have more than one telephone number. Each
telephone number has a description associated with it (e.g., personal, or work).
Each production company is registered only once but can have many shareholders.
Each employee can either be a member of the crew OR a staff member. Each crew
member can be an actor OR a director OR have another role. Each staff member
belongs to a department. No duplication of data is allowed.
Each crew member may be part of one or more movies in a single role or many.
Based on that I have created THIS DIAGRAM.
I think I have all the entities,attributes and relationships down but I'm missing the keys. Keys can't be names right? I will use the company entity as an example. So, should I create new attributes like company_id to use as primary keys or just underline the name attributes and use it as Primary Key?
Also, please tell me if there's anything else wrong with the diagram.
Thanks a lot!
I created an er diagram but some entities don't attributes that can be used as primary keys because they are names. I tried using them but I don't think it's right.
The problem with names as primary keys
In your diagram, you have a couple of name used to identify entities: Grant, Production Company, Shareholder (full name), Employee, Movie (Title). You can in theory use them as primary key. However, this is a bad practice:
names can change (e.g. departments and companies can be renamed, movies can have a temporary working title);
names are often not sufficient to distinguish entities (e.g. there may be different people having the same name, e.g. Adam Smith);
names can be spelled differently across source of information , and are also easily misspelled;
although not really noticeable with modern RDBMS, names are more time consuming to search, and consume more memory when used as foreign keys.
How to chose a primary key?
You'd better use a primary key that guarantees uniqueness. You can then decide easily if a same name correspond to a different entity or not.
The next question that you'll then face in you design is surrogate key vs. natural key:
When there's no other unique information, you'll have not choice than using a surrogate key.
When there are other potential unique attributes, you may chose to use either a natural key (e.g. company registration number, national insurance number together with a country code, movie code?) or a surrogate one.
Keep in mind that both have advantages and inconveniences, but the surrogate key is in general more robust, as natural keys sometimes appear to be not as stable as expected.
Other remarks concerns about your ERD
By the way, here some issues and other remarks:
Works in relation does not relate Staff to anything else. From the name, it's obviously not a reflexive relation either. So this is a diagram error. department (name) and building should either be attached to a Department entity or be attributes of Staff.
In several cases you relate attributes to other attributes (actor-extra role, phone number-description) . This is also a diagramming error. Either add the extra attribute to the same entity, or there's a missing relationship with a missing entity.
In one case you relate two entities without a relation between the two (production company- application). This is an inconsistency that must be corrected also.
The following attributes are not real attributes but probably values of an unidentified attribute: producer, composer, actor, editor, xyzzy designer, advertising, HR, janitor.
Government authority is a misleading entity name: nowhere do you refer to data about the authority itself (name of the authority, e.g. "CNC", country of the government, ...). It's only information about the company's registration.
In your diagram you leave the hourly and monthly wage at the level of the Employee. This does not model accurately the requirements.
The link of the relation receive funding and the entity Application with the same attribute outcome seems very ambiguous.
In the name of the entities, stay consistent: either singular or plural. But mixing both will lead to lots of typos.
Better show cardinality in the link between the relation and the entity, than on the top of the relation: this avoids confusion about the direction of reading.
As a side remark, your question provides wealth of interesting details, but that are not really needed for answering the core of your question. Better limit yourself to only the information directly related to your issue in your next questions ;-)
Research or not research, keep in mind that GDPR may apply and that it requires inter alia privacy by design (some information about the shareholder and the employee may require some additional thoughts).
I'm building a relational database that will act as a CRM for a travel company. I have removed tables and attributes to make this as simple as possible. Users will send quotes to customers.
A hotel can have many rooms (e.g. hotel 1 can have both a twin room and a triple room).
A room can have many hotels (e.g. a both hotels 1 and 2 can have a twin room).
Let's say a customer has a group of 6.
A user could send this customer a quote for hotel 1 with either 3x twin rooms or 2x triple rooms.
A quote will need to contain the hotel and appropriate room type and room type quantities.
Whats the best practice to connect table HOTEL_ROOM_JUNCTION to QUOTE as they key is a multi-attribute, composite key?
Thank you
Noting the Relational Database tag.
Problem
There is a lack of precision in your declarations:
A hotel can have many rooms (e.g. hotel 1 can have both a twin room and a triple room).
A room can have many hotels (e.g. a both hotels 1 and 2 can have a twin room).
I think you mean RoomType. From the rest of your declarations, the system you are implementing is for Quotations of rooms across all hotels, not a room booking system for each of the hotels. That is, you need to track RoomType, not Room, per Hotel.
The tables as given are not Relational tables, they do not have any of the requirements that make them Relational. When you start with stamping an id field on every file, it cripples the data analysis & data modelling exercise that is required to create a set of Relational tables. That is anti-Relational:
physical pointers such as record id are expressly prohibited in the Relational Model.
The Primary Key must be "made up from the data".
I appreciate that you have been schooled in that, due to the marketing and promotion of primitive methods as "relational".
.
For starters, each logical row (not physical record with a record id) must be unique.
The fields in each file should not be prefixed with the filename. In SQL (the data sub-language for the implementation of the Relational Model), the fully qualified address for a column is:
[server.][database.][owner.][table.]column
with defaults (obvious) for each element. If a column is ambiguous, simply prefix it with the table name.
Primary Keys are a special case. In order to avoid confusion (and now, to allow the new NATURAL JOIN), they should be the full name, in both the PK and FK locations. An id on every file would ensure buggy code.
Relational Data Model
If I address all those issues, and model the data according to the Relational Model, it would be:
Notation
All my data models are rendered in IDEF1X, the Standard for modelling Relational databases since 1993.
My IDEF1X Introduction is essential reading for those who are new to the Relational Model, or its modelling method. Note that IDEF1X models are rich in detail and precision, showing all required details, whereas home-grown models have far less than that. Which means, the notation has to be understood.
Content
Relational Key
In order to make the logical rows unique, we need to make a Key from the data. The users know their data, they know what is unique and what is not. Usually they will have a ShortName for such things as Company; Hotel; Customer; etc.
If you do not communicate with the user, there is no chance of supplying the user's needs.
Hotel, UserName, Customer are ShortNames, which are unique, which therefore are the Primary Key. (More, later)
Relational Keys are composites, because they preserve the natural data hierarchies. Get used to it.
If you need the DDL for composite Keys, please ask.
Presuming that a Hotel may be a chain or franchise, we need a Location to make a specific hotel that has rooms unique.
The following are discrete Facts, and should not be mixed together (doing so will lead to complex constraints and horrendous SQL code):
HotelRoomType
that a Hotel.Location has a particular RoomType; and the Price
RoomTypeAvailable
that a Hotel.Location has one of those RoomTypes available on a particular Date; and the Number.
I presume there is a file from the hotels that you will be importing on a daily basis: this is the central table for that, with the constraints, of course.
Quote
that an User is providing a Quote that is requested by a single Customer, for a single TravelDate, for a single Hotel.Location. This allows separate Quotes for separate Hotel.Locations for a single TravelDate; Quotes for a Customer for more than one TravelDate; etc.
.
If you need multiple Hotel.Locations (and their RoomTypes) on a single Quote, let me know in the comments, and I will update the data model.
QuoteRoomType
that a Quote contains a line item which is a single RoomType in the single Hotel.Location that is available on the TravelDate.
Relational Integrity
A logical feature of the Relational Model, which is distinct from Referential Integrity, which is a physical feature in SQL. It is not possible to achieve this in a Record Filing System with record ids as "primary keys", not even an advanced and progressed one (after the various errors in the initial RFS have been corrected). Genuine logical Keys ("made up from the data") are required.
In RoomTypeAvailable, we have constrained:
RoomTypes to that which the Hotel.Location actually has (in HotelRoomType)
AND is actually available on Date.
In QuoteRoomType, we have constrained:
Hotel.Location to that which is in the Quote,
AND RoomTypes to that which is available in Hotel.Location (in HotelRoomType),
AND which is available on the TravelDate (RoomTypeAvailable.Date "maps to" QuoteRoomType.TravelDate).
1960's Record Filing System • Anti-Relational, Sold as "relational"
This section is relevant for those who prescribe a Record ID field as "primary key" in every file. And somehow think that that is "relational". Others can safely skip it.
For comparison, here is the set of files that one would come up with, if one followed the techniques and methods that are promoted and marketed by Date; Darwen; Fagin; et al crowd, falsely proposed as "relational".
This a "mature" or "advanced" model, the fourth or fifth iteration. It has a number of improvements over the initial RFS. The initial or second or third iteration would not be equivalent enough to offer a comparison:
the Facts that are required to support the system have been determined (as opposed to the initial model, the record perspective, which is oblivious to Facts).
the content of the records have been improved to prevent duplicates, to the extent possible given the record content (but it is still streets behind the uniqueness provided in a Relational data model)
Fails Relational
Nevertheless it has no Relational features, which are logical. It has only the physical features of SQL reference-ability. Just a few of the many failures, which the mob prescribes as "relational":
Duplicate rows (logical) are not prevented, because rows are not defined.
No Relational Integrity
which depends on Relational Keys. (Refer to the Relational Keys detailed above.)
Eg. QuoteRoomType is constrained to any RoomTypeAvailable.
It is not possible to constrain it to:
the HotelId that is referenced in the Quote only,
OR to RoomTypes that exist in the HotelId only,
OR to RoomTypesAvailable that are available on the TravelDate only.
One additional field, and one additional index, for the Record id on every file. That will have a marvellous effect on performance.
Horrendous navigation and query code.
No Relational Power
When two distal files need to be JOINed, each of the intermediate files must be additionally JOINed, something that is not required in a Relational database. That is because it breaks the Access Path Independence Rule, a concept that the razor gang have not been understand in the fifty years since the advent of the RM. But they will come up with yet another abnormal "normal form", to add to their bag of seventeen thus far.
More, Not Fewer, Joins
Let’s look at what that means. We need a query to provide statistics for RoomTypes that have been quoted for previous year, so that hotels can re-arrange their room types to suit the expected traffic.
Using the Relational data model (separate section above), we would code:
SELECT RoomType.RoomType, -- Relational database
Description,
SUM( NumRoom )
FROM RoomType
JOIN QuoteRoomType ON RoomType.RoomType = QuoteRoomType.RoomType
WHERE DATEPART( YY, TravelDate ) = DATEPART( YY, GETDATE() ) - 1
GROUP BY RoomType.RoomType, Description
Using the Record Filing System data model, which is the result of following the advice of the Date; Darwen; Fagin; philipxy; AntC; et al gang, which is falsely marketed as "relational" (above), we would be forced to code:
SELECT RoomType, -- Record Filing System
Description,
SUM( NumRoom )
FROM RoomType
JOIN HotelRoomType
ON RoomType.RoomTypeId = HotelRoomType.RoomTypeId
JOIN RoomTypeAvailable
ON HotelRoomType.HotelRoomTypeId = RoomTypeAvailable.HotelRoomTypeId
JOIN QuoteRoomType
ON RoomTypeAvailable.RoomTypeAvailableId = QuoteRoomType.RoomTypeAvailableId
JOIN Quote
ON QuoteRoomType.QuoteId Quote.QuoteId
WHERE DATEPART( YY, TravelDate ) = DATEPART( YY, GETDATE() ) - 1
GROUP BY RoomType, Description
Gotta love the QueryPlan for that, that the SQL platform will produce.
Re-arranging the order of the JOINs might improve the tortoise.
Resorting to moving fragments such as “partial FDs” or “MVDs” around, might improve it.
Perhaps deploying more “candies”, plus the required additional indices, all over the place, will help. But wait, that would be duplication on a mass scale, it would break Normalisation, there would be Update Anomalies everywhere one looks.
Note that that result set has no reliability; no credibility. Why ? Because, as already proved, the QuoteRoomType is not constrained to the Quote.Hotel (referenced by HotelId);
or to the Quote.TravelDate;
or to the RoomTypes available in QuoteHotel (referenced by HotelId).
Further, there may well be duplicates, because prevention can only be partially implemented. The result of which is unreliable result sets.
Simplicity vs Complexity
If you have the interest and the stamina, you can attempt to elevate the RFS by muddling through their "partial dependencies"; "transitive dependencies"; "candies"; "multi-valued dependencies"; etc, all of which are neither defined in, nor required in, the Relational Model. They are expressly for use in the Record Filing Systems of the last century.
First, the RFS paradigm (marketed as "relational") forces a record mindset, instead of a data-only mindset.
Second, it breaks everything down into fragments, instead of understanding the atoms; the Facts, in their full context (data hierarchies).
Third, it gives you a morass of complexity to handle the fragments, that have no relevance when handling atoms.
When you are done, all that complexity in the Record Filing System will still not be anywhere near the simplicity of the equivalent Relational data model: it will have:
No Relational Integrity (yes, yes, we have Declarative Referential Integrity, and that only for physical records, not for logical rows)
No Relational Power (multiple forced JOINs in every query)
No Relational Speed (those additional columns and indices have an effect).
And the navigation and query code will be horrendous, and prone to errors.
Please feel feel to ask specific questions. Also, please supply clarifications as noted, and I will update the data model.
Since a specific room can only exist in one hotel the table HOTEL_ROOM_JUNCTION is redundant. So pk hotel_id is fk in rooom, and pk in room is a concat key of hotel_id and room_id.
If one quote can consist of several rooms you need a connecting table between quote and room them with fk quote_id, room_id and hotel_id and those three will be the pk in that table. (As a rule of thumb, that kind of table will usually need a timestamp).
(as a side note; I would name the tables QUOTES, ROOMS and HOTELS since they contain many)
EDIT: I miss read the question somewhat .. to make my model as OP wants I need to add ROOM_TYPES with pk room_type_id which will be fk (not null) in ROOMS but not part of the pk.
I am tasked to find anomalies within this relation. I had identified a few insertion, deletion and update anomalies.
Commission Percentage: the percentage of the total sales made by a salesperson that is paid as commission to that salesperson.
Year of Hire: the year the salesperson was first hired
Department Number: the number of the department where the salesperson works
Manager Name: name of the manager of the department
However, I am confused with a anomalies that I pulled out. Below is the statement:
There can not be a manager with the same name in the company as there is no primary identifier for the manager entity except for the name, which can be a duplicate within the company.
May I know how should I phrase the above statement and under which (update/deletion/insertion) anomaly should I include it in?
Thank you
May I request additional assistance below as well:
How would you change the current design and how does your new design address the problems you have identified with the current design.
My current design is splitting it into 3 relations:
Salesperson(salespersonNumber, salespersonName, commissionPercentage, YearOfHire, deparetmentNumber)
Product(productNumber, productName, unitPrice)
Manager(managerNumber, managerName, departmentNumber)
However, I am missing out quantity entity.
Quantity requires composite key of productNumber & salespersonNumber.
Should I make it in another relation by itself?
Quantity(productNumber, salespersonNumber)
Anomalies
When identifying identifying (potential) anomalies, you're listing dependent attributes that are affected by the anomalies (you forgot Salesperson Name, btw). Specifically, you listed attributes that depended on a subset of the key (Salesperson Number, Product Number), thus violating 2NF. You're on the right track.
However, be careful not to confuse attributes with anomalies. An update anomaly would be if 1 of the 3 instances of Bilstein got changed. The (assumed) functional dependency Salesperson Name depends on Salesperson Number would be broken and the data would be inconsistent (Salesperson Number 437 would be associated with more than one name). Remember that normalization aims to eliminate redundant associations.
Identity
The problem with identifying managers by name indicates a poor modeling decision. As you stated, a company's set of managers isn't uniquely identified by name, so there's a mismatch between the logical data model and the world it models. This won't cause insert, update or delete anomalies as long as we use different values for different managers, but it will prevent convenient identification of managers. Possible improvements would be to use multiple attributes (abstract domains are often easily identified by a combination of attributes, but natural domains like people usually aren't, e.g. Manager Name, Birthdate would be more identifying but still not a good solution), turn the Manager Name into a surrogate key (e.g. Scott1, Scott2), or introduce a new surrogate key (e.g. a numeric ID).
Proposed improvement
Your proposed design normalizes the original table as well as addressing the identification problem. It's a good answer except for two issues: it doesn't include the association between Salesperson and Manager, and in your Quantity relation, you forgot to include the quantity as a dependent attribute.
Good job so far, hope this helps.
Two part architecture question:
I have employee, job title, and supervisor dimensions. I kind of wanted to keep them in one dimension and have something like site > supervisor > job title > employee. The problem is that these need to be SCD. That is, they have historical associations to relate to the facts. The fact tables have a requirement to be processed every five minutes (dashboard).
1) Should I have these in a single dimension with a surrogate key (or composite for that matter)? The keys/surrogate key would be composed of calendar_id - employee_id.
2) Have the fact tables have maintain a reference to three different dimensions instead?
The requirement to process every 5 minutes (MOLAP SSIS ETL driven processing). Makes me lean toward keeping the time/change in the facts so that I would ease having to process the dimensions along with the fact tables.
I would design it as a single dimension, with the hierarchy you mentioned: site > supervisor > job title > employee.
Let's call this dimension EmployeeAssignment, because its granularity is not Employees, but any combination of site/supervisor/job title that an employee "adopts" during his/her career. (Feel free to come up with a better name).
I don't think you need a calendar_id key in this dimension: a surrogate key based on DISTINCT SiteID,SupervisorID,JobTitleID,EmployeeID would be enough. Adding a calendar_id key would be making the dimension do too much work: over and above slicing the actual facts, this would make the dimension answer questions like
"Where was employeeID 12345 (in the site/supervisor/job title network) on 1 January 2015?" and
"How many employees did supervisorID 98765 supervise on 1st January 2015?"
These questions IMHO are best addressed with a fact, not a dimension. One cube I've worked on addresses with with an EmployeeDay measure: sliced by dimensions "EmployeeAssignment" and Time, this simply has a 1 if the employee is in that "assignment" on that day.
This EmployeeAssignment SCD is actually pretty slowly-changing, especially compared to your 5-minute fact update interval. Employees are not going to move about or get promoted every 5 minutes, so a reprocess of the dimension shouldn't be necessary more often than daily.
If I've misunderstood anything, let me know in the comments.
I understand the concept of database normalization, but always have a hard time explaining it in plain English - especially for a job interview. I have read the wikipedia post, but still find it hard to explain the concept to non-developers. "Design a database in a way not to get duplicated data" is the first thing that comes to mind.
Does anyone has a nice way to explain the concept of database normalization in plain English? And what are some nice examples to show the differences between first, second and third normal forms?
Say you go to a job interview and the person asks: Explain the concept of normalization and how would go about designing a normalized database.
What key points are the interviewers looking for?
Well, if I had to explain it to my wife it would have been something like that:
The main idea is to avoid duplication of large data.
Let's take a look at a list of people and the country they came from. Instead of holding the name of the country which can be as long as "Bosnia & Herzegovina" for every person, we simply hold a number that references a table of countries. So instead of holding 100 "Bosnia & Herzegovina"s, we hold 100 #45. Now in the future, as often happens with Balkan countries, they split to two countries: Bosnia and Herzegovina, I will have to change it only in one place. well, sort of.
Now, to explain 2NF, I would have changed the example, and let's assume that we hold the list of countries every person visited.
Instead of holding a table like:
Person CountryVisited AnotherInformation D.O.B.
Faruz USA Blah Blah 1/1/2000
Faruz Canada Blah Blah 1/1/2000
I would have created three tables, one table with the list of countries, one table with the list of persons and another table to connect them both. That gives me the most freedom I can get changing person's information or country information. This enables me to "remove duplicate rows" as normalization expects.
One-to-many relationships should be represented as two separate tables connected by a foreign key. If you try to shove a logical one-to-many relationship into a single table, then you are violating normalization which leads to dangerous problems.
Say you have a database of your friends and their cats. Since a person may have more than one cat, we have a one-to-many relationship between persons and cats. This calls for two tables:
Friends
Id | Name | Address
-------------------------
1 | John | The Road 1
2 | Bob | The Belltower
Cats
Id | Name | OwnerId
---------------------
1 | Kitty | 1
2 | Edgar | 2
3 | Howard | 2
(Cats.OwnerId is a foreign key to Friends.Id)
The above design is fully normalized and conforms to all known normalization levels.
But say I had tried to represent the above information in a single table like this:
Friends and cats
Id | Name | Address | CatName
-----------------------------------
1 | John | The Road 1 | Kitty
2 | Bob | The Belltower | Edgar
3 | Bob | The Belltower | Howard
(This is the kind of design I might have made if I was used to Excel-sheets but not relational databases.)
A single-table approach forces me to repeat some information if I want the data to be consistent. The problem with this design is that some facts, like the information that Bob's address is "The belltower" is repeated twice, which is redundant, and makes it difficult to query and change data and (the worst) possible to introduce logical inconsistencies.
Eg. if Bob moves I have to make sure I change the address in both rows. If Bob gets another cat, I have to be sure to repeat the name and address exactly as typed in the other two rows. E.g. if I make a typo in Bob's address in one of the rows, suddenly the database has inconsistent information about where Bob lives. The un-normalized database cannot prevent the introduction of inconsistent and self-contradictory data, and hence the database is not reliable. This is clearly not acceptable.
Normalization cannot prevent you from entering wrong data. What normalization prevents is the possibility of inconsistent data.
It is important to note that normalization depends on business decisions. If you have a customer database, and you decide to only record a single address per customer, then the table design (#CustomerID, CustomerName, CustomerAddress) is fine. If however you decide that you allow each customer to register more than one address, then the same table design is not normalized, because you now have a one-to-many relationship between customer and address. Therefore you cannot just look at a database to determine if it is normalized, you have to understand the business model behind the database.
This is what I ask interviewees:
Why don't we use a single table for an application instead of using multiple tables ?
The answer is ofcourse normalization. As already said, its to avoid redundancy and there by update anomalies.
This is not a thorough explanation, but one goal of normalization is to allow for growth without awkwardness.
For example, if you've got a user table, and every user is going to have one and only one phone number, it's fine to have a phonenumber column in that table.
However, if each user is going to have a variable number of phone numbers, it would be awkward to have columns like phonenumber1, phonenumber2, etc. This is for two reasons:
If your columns go up to phonenumber3 and someone needs to add a fourth number, you have to add a column to the table.
For all the users with fewer than 3 phone numbers, there are empty columns on their rows.
Instead, you'd want to have a phonenumber table, where each row contains a phone number and a foreign key reference to which row in the user table it belongs to. No blank columns are needed, and each user can have as few or many phone numbers as necessary.
One side point to note about normalization: A fully normalized database is space efficient, but is not necessarily the most time efficient arrangement of data depending on use patterns.
Skipping around to multiple tables to look up all the pieces of info from their denormalized locations takes time. In high load situations (millions of rows per second flying around, thousands of concurrent clients, like say credit card transaction processing) where time is more valuable than storage space, appropriately denormalized tables can give better response times than fully normalized tables.
For more info on this, look for SQL books written by Ken Henderson.
I would say that normalization is like keeping notes to do things efficiently, so to speak:
If you had a note that said you had to
go shopping for ice cream without
normalization, you would then have
another note, saying you have to go
shopping for ice cream, just one in
each pocket.
Now, In real life, you would never do
this, so why do it in a database?
For the designing and implementing part, thats when you can move back to "the lingo" and keep it away from layman terms, but I suppose you could simplify. You would say what you needed to at first, and then when normalization comes into it, you say you'll make sure of the following:
There must be no repeating groups of information within a table
No table should contain data that is not functionally dependent on that tables primary key
For 3NF I like Bill Kent's take on it: Every non-key attribute must provide a fact about the key, the whole key, and nothing but the key.
I think it may be more impressive if you speak of denormalization as well, and the fact that you cannot always have the best structure AND be in normal forms.
Normalization is a set of rules that used to design tables that connected through relationships.
It helps in avoiding repetitive entries, reducing required storage space, preventing the need to restructure existing tables to accommodate new data, increasing speed of queries.
First Normal Form: Data should be broken up in the smallest units. Tables should not contain repetitive groups of columns. Each row is identified with one or more primary key.
For example, There is a column named 'Name' in a 'Custom' table, it should be broken to 'First Name' and 'Last Name'. Also, 'Custom' should have a column named 'CustiomID' to identify a particular custom.
Second Normal Form: Each non-key column should be directly related to the entire primary key.
For example, if a 'Custom' table has a column named 'City', the city should has a separate table with primary key and city name defined, in the 'Custom' table, replace the 'City' column with 'CityID' and make 'CityID' the foreign key in the tale.
Third normal form: Each non-key column should not depend on other non-key columns.
For example, In an order table, the column 'Total' is dependent on 'Unit price' and 'quantity', so the 'Total' column should be removed.
I teach normalization in my Access courses and break it down a few ways.
After discussing the precursors to storyboarding or planning out the database, I then delve into normalization. I explain the rules like this:
Each field should contain the smallest meaningful value:
I write a name field on the board and then place a first name and last name in it like Bill Lumbergh. We then query the students and ask them what we will have problems with, when the first name and last name are all in one field. I use my name as an example, which is Jim Richards. If the students do not lead me down the road, then I yank their hand and take them with me. :) I tell them that my name is a tough name for some, because I have what some people would consider 2 first names and some people call me Richard. If you were trying to search for my last name then it is going to be harder for a normal person (without wildcards), because my last name is buried at the end of the field. I also tell them that they will have problems with easily sorting the field by last name, because again my last name is buried at the end.
I then let them know that meaningful is based upon the audience who is going to be using the database as well. We, at our job will not need a separate field for apartment or suite number if we are storing people's addresses, but shipping companies like UPS or FEDEX might need it separated out to easily pull up the apartment or suite of where they need to go when they are on the road and running from delivery to delivery. So it is not meaningful to us, but it is definitely meaningful to them.
Avoiding Blanks:
I use an analogy to explain to them why they should avoid blanks. I tell them that Access and most databases do not store blanks like Excel does. Excel does not care if you have nothing typed out in the cell and will not increase the file size, but Access will reserve that space until that point in time that you will actually use the field. So even if it is blank, then it will still be using up space and explain to them that it also slows their searches down as well.
The analogy I use is empty shoe boxes in the closet. If you have shoe boxes in the closet and you are looking for a pair of shoes, you will need to open up and look in each of the boxes for a pair of shoes. If there are empty shoe boxes, then you are just wasting space in the closet and also wasting time when you need to look through them for that certain pair of shoes.
Avoiding redundancy in data:
I show them a table that has lots of repeated values for customer information and then tell them that we want to avoid duplicates, because I have sausage fingers and will mistype in values if I have to type in the same thing over and over again. This “fat-fingering” of data will lead to my queries not finding the correct data. We instead, will break the data out into a separate table and create a relationship using a primary and foreign key field. This way we are saving space because we are not typing the customer's name, address, etc multiple times and instead are just using the customer's ID number in a field for the customer. We then will discuss drop-down lists/combo boxes/lookup lists or whatever else Microsoft wants to name them later on. :) You as a user will not want to look up and type out the customer's number each time in that customer field, so we will setup a drop-down list that will give you a list of customer, where you can select their name and it will fill in the customer’s ID for you. This will be a 1-to-many relationship, whereas 1 customer will have many different orders.
Avoiding repeated groups of fields:
I demonstrate this when talking about many-to-many relationships. First, I draw 2 tables, 1 that will hold employee information and 1 that will hold project information. The tables are laid similar to this.
(Table1)
tblEmployees
* EmployeeID
First
Last
(Other Fields)….
Project1
Project2
Project3
Etc.
**********************************
(Table2)
tblProjects
* ProjectNum
ProjectName
StartDate
EndDate
…..
I explain to them that this would not be a good way of establishing a relationship between an employee and all of the projects that they work on. First, if we have a new employee, then they will not have any projects, so we will be wasting all of those fields, second if an employee has been here a long time then they might have worked on 300 projects, so we would have to include 300 project fields. Those people that are new and only have 1 project will have 299 wasted project fields. This design is also flawed because I will have to search in each of the project fields to find all of the people that have worked on a certain project, because that project number could be in any of the project fields.
I covered a fair amount of the basic concepts. Let me know if you have other questions or need help with clarfication/ breaking it down in plain English. The wiki page did not read as plain English and might be daunting for some.
I've read the wiki links on normalization many times but I have found a better overview of normalization from this article. It is a simple easy to understand explanation of normalization up to fourth normal form. Give it a read!
Preview:
What is Normalization?
Normalization is the process of
efficiently organizing data in a
database. There are two goals of the
normalization process: eliminating
redundant data (for example, storing
the same data in more than one table)
and ensuring data dependencies make
sense (only storing related data in a
table). Both of these are worthy goals
as they reduce the amount of space a
database consumes and ensure that data
is logically stored.
http://databases.about.com/od/specificproducts/a/normalization.htm
Database normalization is a formal process of designing your database to eliminate redundant data. The design consists of:
planning what information the database will store
outlining what information users will request from it
documenting the assumptions for review
Use a data-dictionary or some other metadata representation to verify the design.
The biggest problem with normalization is that you end up with multiple tables representing what is conceptually a single item, such as a user profile. Don't worry about normalizing data in table that will have records inserted but not updated, such as history logs or financial transactions.
References
When not to Normalize your SQL Database
Database Design Basics
+1 for the analogy of talking to your wife. I find talking to anyone without a tech mind needs some ease into this type of conversation.
but...
To add to this conversation, there is the other side of the coin (which can be important when in an interview).
When normalizing, you have to watch how the databases are indexed and how the queries are written.
When in a truly normalized database, I have found that in situations it's been easier to write queries that are slow because of bad join operations, bad indexing on the tables, and plain bad design on the tables themselves.
Bluntly, it's easier to write bad queries in high level normalized tables.
I think for every application there is a middle ground. At some point you want the ease of getting everything out a few tables, without having to join to a ton of tables to get one data set.