Single OR Grouped/Combined Items for Entries in a Database - sql

I have 2 Similar Types of Things that I want to point to in a certain field in a Database. One of them is a combination of 1 or more of the other.
How should I Design my Database in this kind of situation?
In my current example I have (Simple)Food Ingredients and (Combined)Food Dishes and I want Either One or these Things to be entries in a Meals/Eating table.
So a User can Either Eat a simple Food like an Apple OR a complex food like an Apple Pie that consists of 200g of Apples and 100g of Flour and 30g of Sugar etc. at one point in time in a Meal. I'm thinking something like this:
Ingredients |IID| |Name| |Calories|
Dishes |DID| |Name| (|Calories|???)
Food Data |DID| |IID| |Amount|
.
Users |UID| |FirstName| |LastName| etc.
Meals |UID| |DID| |Date/Time| |Amount|
I Find this really annoying tho because Every Single Ingredient would have to have Two (Basically Identical)Entries to start with: 1 in the Ingredients Table and 1 in the Dishes Table so it could be paired up in a Meal. Am I missing something Here? Is there a way around this?
Also I don't know if a Dish should have the Calories Listed in the Database. Having the Calories for a Dish in the database is rather Redundant because it could be Calculated when Making a Query(by summing up&calculating its respective ingredients). BUT this seems quite inefficient since it this calculation would have to be done for every single query of a dish(and it would get worse by adding things like Macros/Nutrional Values/Price which I left out for clarity/simplicity here).
Also If I DO have Calories(and other things relating to food in general) for a Dish I could just have 1 single table in this scenario like:
Food |FID| |Name| |Calories| (|Simple[bool]|?)
Food Data |FID| |FID| |Amount|
This would Seem better in general. The Simple field would distinguish between Simple Ingredients or Dish which I think is worth putting in so you don't have to search in Food Data for every item.
BUT If I want to introduce Specific Dish-Only Data then I would to make some Other Table like:
DISH DATA |FID| |TimetoCook| |Presentation| etc. (which seems pretty weird/unintuitive to me)
.
So the Question is: What the BEST General Practice in this kind of scenario?
Is it generally better to do extra calculations when querying rather than have redundant data in these kinds of situations?
Is there something I'm missing that would make this simpler/better in general?

I'm not sure this can be answered as generally as you would like, because the semantics and the use of the database should be taken into account. Even in the simple/complex food context of your example, either of the approaches you describe (ingredients/dishes/food_data or food/food_data/dish_data) can be right, depending on the specifics.
Let me get this out of the way first: I wouldn't look for a third approach. Any other thing I can think of would be semantically obscure, hell to maintain or a nightmare to query.
So your first concern is the semantics of the database. Your first approach seems more natural; most people will easily see the semantic distinction between ingredients and dishes. It is also the only option if the "ingredient" entity has another reason of existence besides being part of a dish, e.g. for managing orders of raw ingredients. If you choose to go with the second approach you will have to make sure that a) it fits your data and b) you choose your table names very very carefully.
For the second approach to "fit your data" semantically, simple dishes must fully fit the description: "dishes that don't have the extra dish_data". The [Simple] flag is also acceptable as a property of dish, though a real need for it can be a hint that you're off base with this approach. But if ingredients and dishes only partially overlap, i.e. if you have ingredients that cannot be dishes, or if they have different properties in general, then you are definitely off base. If you find yourself in need of enforcing business rules that would prevent a customer from ordering a serving of "flour"; if you raise questions like what to put under "calories" for the "pickles" (would it be the calories per 100gr for pickles-as-an-ingredient, or the calories per serving for pickles-as-a-side-dish?); if you find you have fields like "measuring unit" that are meaningless for dishes, then you're dealing with two separate entities (ingredients and dishes), not one entity (dish) with two subcategories (simple and complex). If you are only going to duplicate a tiny bit of information between the two tables and save yourself a lot of trouble and ambiguity, by all means do that.
Your second concern is how the data will be used. Try to answer questions like: Are you going to be querying calories of dishes millions of times per second? Are the ingredients - and therefore the calories - of dishes going to stay the same for ever? Will your customer or cook ever need to query what a dish is made of?
"Don't duplicate" and "don't store calculatable values" are two rules that are as hard as design rules come. Even such rules though should be, not really bent, just "critically adjusted" some times, if that makes sense.

This is a question of understanding the context of your data.
I imagine meals can be simple (unprocessed) or be complex and consist of other meals. If I were to generate a database for meals and their calorific value I would not separate them.
meal | calorific value per 100g | glicemic index
apple | 12345 | 34234
apple-pie | 3233 | 32334
Other table you would join it with could be a meal composition for a specific person.
2020-02-27|Johny Doe | Breakfast |apple | 300 g
2020-02-27|Johny Doe | Breakfast |sausage| 150 g
2020-02-27|Johny Doe | Breakfast |apple-juice | 500 g
By joinning the two tables you would learn how much Johny Doe ate callories and perhaps what was the glicemic index...
Then... it is not yet an SQL question but a the question of understanding first the process one would like to describe with SQL.

Related

SQL - Survey Data, Table Schema Design for looped survey questions

Suppose we have a survey where some of the questions are asked across multiple entities.
For example:
Car Brands = [Brand 1, Brand 2, Brand 3, Brand 4...]
This questions will be asked for each one of the car brands (looped).
Question Q01 = (Scale 1-10) Do you think [Car Brand] cars are reliable?
Question Q02 = (Scale 1-10) Do you think [Car Brand] cars are a good value?
...
I'm designing a schema that will power some web based analytic tools, so query performance is important.
The schema will be 3 tables: Records, Questions, Answers
I have two approaches for the answers table:
A) Table: Answers
QuestionId | AnswerValue | BrandOption
Q01 | 7 | 1
Q01 | 5 | 2
Q01 | 4 | 3
Q01 | 8 | 4
B) Table: Answers
QuestionId | AnswerValue
Q01-1 | 7
Q01-2 | 5
Q01-3 | 4
Q01-4 | 8
The queries can be either for one brand at a time or for all the brands, with equal priority for both queries.
Option A seems to give me some advantages if I ever need to do something like a group by, however if most of the queries are for a specific brand, then Option B seems to be more efficient.
Thoughts?
Option A is better, even if you don't see it right now.
Storing multiple values in a single database "cell" is a mistake any way you look at it (though unfortunately, a very common mistake) - not to mention it's a violation of the first normal form - which specifically states that each column can only contain a single atomic value in each row (though the original rule is using a different terminology).
The disadvantages are numerous and some of them are critical, including (but not limited to):
You loose the ability to use the proper data type - two ints stored together must be stored as a different data type than int.
You might loose the ability to verify your data is, in fact, correct, or that the different parts can be converted to the correct data type (most databases supports check constraints nowadays but not all (Yes, MySql, I'm pointing my finger at you!))
You loose the ability to enforce uniqueness on each parts of the data separately.
You can't use the different parts of the data as basis for foreign key constraints
The list goes on and on - but I think anyone should get the picture by now - a database column should be used to store a single value for each row - every time.
The first version is preferable in my opinion. It makes it easier to look for answers to different questions for a single brand and to the same question across brands.
Munging the question id seems like a poor substitute. For one thing, it precludes simple foreign key relationships to a questions table and to a brands table. I'm a big fan of explicit foreign key relationships.
Of course, to make this work, you will need a method to store "no brand" or "brand no relevant". One method is to use NULL for such answers.

Storing timetable availability in SQL table

I'm playing around with a database idea at the moment. It's likely not going to be deployed in any sort of fashion and is more of a learning experience.
It's meant to simplify the collection and handling of tutor information for a bunch of classes at the university I went to. I worked part time in an office that organised tutors for a handful of classes each semester.
I've got a number of questions, but the one that's causing me a problem at the moment is how I can store the availability of each tutor. I'm considering 3 options at the moment, and I'm looking for feedback on the pros and cons of each from a technical perspective.
Background:
Tutor information is stored in a "tutor" table (tutorID references this) and the previous availability must be able to be recalled. Tutor availability is discrete (hourly), and constant throughout a semester.
Option 1:
Table: Availability
+-----------+---------+-------+-------+---+---+---+----+---+
| avID (PK) | tutorID | year | sem | M | T | W | Th | F |
| | | (int) | (int) | (all strings) |
+-----------+---------+-------+-------+---+---+---+----+---+
In this table, availaiblity is stored in a string (08,09,10,13,14 represents 8am, 9am, 10am, 1pm and 2pm).
Data could be reclaimed with
SELECT * FROM Availability WHERE tutorID=0001 AND year=2013 AND sem=1
And to see who's available
SELECT * FROM Availability WHERE AND year=2013 AND sem=1 AND M LIKE '%08%'
Option 2:
Table: Availability
+-----------+---------+-------+-------+--------------+
| avID (PK) | tutorID | year | sem | availability |
| | | (int) | (int) | (set) |
+-----------+---------+-------+-------+--------------+
In this layout, the availability column is stored as the SET datatype in mysql, with the options being every combination of Mon through Friday and every time from 8 till 4 (M08, M09... Th14, F16 etc etc). This works out to 45 acceptable values. This is the one that I'm currently leaning towards, but I don't know much about the SET datatype.
Data could be reclaimed with
SELECT * FROM Availability WHERE tutorID=0001 AND year=2013 AND sem=1
And to see who's available
SELECT * FROM Availability WHERE AND year=2013 AND sem=1
AND FIND_IN_SET('M09',availability) > 0
Option 3:
Table: Availability
+-----------+---------+-------+-------+-------+-------+
| avID (PK) | tutorID | year | sem | day | time |
| | | (int) | (int) | (int) | (int) |
+-----------+---------+-------+-------+-------+-------+
In this option, there is a single row for each tutor each year and each timeslot.
Data could be reclaimed with
SELECT * FROM Availability WHERE year=2013 AND sem=2 AND tutorID=0001
Availability with
SELECT * FROM Availability WHERE year=2013 AND sem=2 AND day=3 AND time=14
Anyway... Thanks for reading through all of that. Hopefully someone will be able to shed some light on this. I think that it basically will boil down to a best-practice type of question. Unless there's something that I've missed entirely!!
None of your listed options are normalized. Basically normalizing, and one of the main points and benefits of relational database technology, is avoiding the storage of redundant information.
Option 1
You were not clear about the requirement, but I'm assuming a tutor may be available more than one hour per day. That would make Option 1 awkward, or a poor fit because you would have to have multiple rows to cover multiple sessions in a single day. The other columns values would be duplicated across rows – that kind of repetition means a violation of normalization.
Also, choosing text as the data type for the start time is probably not optimal. If the sessions always start on the hour, then you are dealing with hour numbers. If dealing with numbers, store them as numbers (as a general rule). If the sessions may not always start on the hour, then you are dealing with time values. Same general rule, store them as a Time data type.
Choosing int as data type for year is probably not clear. Usually an academic year is something like "2013-2014".
Option 2
In Option 2, stuffing multiple points of data into a single field is definitely not normalized. While your query would work it has at least two shortcomings. One is performance; typically searching a multi-value field like that will be relatively slow. But more importantly, violating normalization almost always leads to painting yourself into a corner. What if you want to tie additional values to each of those time slots — you can't because you don't have access to each time slot when they are smashed together.
Option 3
In Option 3, you are getting closer to a normalized design. But notice how multiple fields will be repeated together (year and sem)? Again that kind of duplication is a flag for a violation of normalization.
Generalize
When designing, generally it is a good habit to broaden or generalize your thinking. For example, are sessions always forever going to start on the hour and last one hour? Not likely. So it may be smart to use a Time value rather than an hour number. Another example, "semester" – not all schools use semesters and even those that do (yours) may change. So it may be smart to generalize to "term" and not make assumptions related to semesters. On the other hand, don't over-generalize or else you can fall into a meaningless mess of a design or fall into analysis-paralysis.
Normalize
To normalize, look for the "things", the stuff that may take an action, or stuff that "owns" other stuff. We call these entities.
You've already identified the tutor as a separate entity. Good.
I see another: term (semester). That repeating of 'year' and 'sem' is the clue. Such repetition is avoided by moving those values into another table. That table is for the entity of 'term'. Another clue that separate table is correct is the idea that we may well want to tie other information to the 'term' table, such as the term's start date and length (or stop date). Such additional data certainly should not be repeated across all our 'availability' rows. Such data should be stored once in a single row in term table.
My Design
So my initial design would look like this diagram.
This relationship is Many-to-Many. Each tutor may be available in multiple terms, and each term may have multiple tutors. A many-to-many is a problem in a relational design, and is always resolved with a third "bridge" or "junction" table. Many-to-many and bridge tables are quite common in databases designed for business contexts.
Here, the bridge table between them, is availibility_. That bridge table is a child table to both, and carries each parent's primary key (a foreign key). Tip: when I place parents (blue here) higher vertically than children (orange here), and I notice the "bird body with raised wings" pattern of a parent on either side, then I recognize a many-to-many relationship exists between the parents.
By the way, there are times to violate normalization. We call that "to dernormalize". Usually the goal is related to performance. But denormalize only after you have consulted with another experienced database designer, and when you have very good reasons, clearly know the price you are paying, and thoroughly document the violation for the edification of those who may later take your place.

Should I Merge This Database data into one table?

I want to store some product data in my database. At first I thought having a product table and product info table but not sure if I should just merge it all into one table.
Example
Coke - 355 ml
Product.Name = Coke
ProductInfo.Size = 355
ProductInfo.UnitType = ml
Coke - 1 Liter
Product.Name = Coke (would not be duplicated...just for illustration purposes)
ProductInfo.Size = 1
ProductInfo.UnitType = L
if I did this of then of course I would not be duplicating the "Name" twice. My plan then was I could find all sizes of the same product very easily as all I would have to do is look at the many side of the relationship for any given item.
Here is the problem though, all the data will be user driven and entered. Someone might write "Coke a Cola" instead of "Coke" and now that would be treated as 2 different products as when I go to look if a product has been entered called "Coke a Cola" but it won't know to check for "Coke" as well.
This leads me to having to do like partial matches to maybe try to find it but what happens if someone has some generic brand what would be "Cola" and that would get matched as well.
This gets me to think maybe there is no point to keep the data separate as to me it seems like a good chance everything will end up to be it's own product anyways.
There's merit in both approaches. Keeping them separate, the table you're calling "Product", I'd call "Brand" instead, and "ProductInfo" is your actual "Product" table, containing the information about the actual sellable item of that brand (a 12oz can or liter bottle of Coke).
Alternately, you could further normalize it into Brand, Product (here being Coke Classic as maybe opposed to Diet Coke or Caffeine Free Coke) and UnitSize (can or bottle; these would apply not only to Coke Classic, but Diet Coke, Pepsi or Dr Pepper).
If you denormalize this this data, you aren't duplicating much on the naming side of things, but you are duplicating quite a bit of unit of measure data. The question is whether it's more useful to ensure consistent branding of your product records (denormalizing means you'll need some other means to ensure your products have the same brand), or to avoid the joins between the two tables (there is a cost to joining, though it's typically small if you can join between indexed fields).
The only compelling reason to make a Header-Detail arrangement, with two tables, would be if Coke has attributes that are the same no matter the packaging. Right now, I don't see any attributes like that; so one table covers it. You might say, "But I might think of something in the future like that." That may be a reason to make two tables; but (unlike many kinds of change to a database schema) this may not be too difficult to break into two tables later, when you know there is a need.
I see the point about mistakes that result in nearly-matching records. I think that's not a consideration at this table level and you should address it as a part of record editing.
The best way to do this would be to have your product or item table in its own table with fields like ID, SKU number, short description, active, and so on… Then you have your “many” table hold the other item attributes which can be joined on ID; a one to many relationship. And to solve the user input issue, you have a combo box which is tied to inventory choices or item choices. This way you enforce data integrity. Well, that is how I have done it.
This post has some helpful links on DB design

Is this database structure sane, correct and normalized?

So, yesterday I asked 2 questions that pivoted around the same idea: Reorganizing a database that A- wasn't normalized and B- was a mess by virtue of my ignorance. I spent the better part of the day organizing my thoughts, reading up and working through some tests. Today I think I have a much better idea of how my DB should look and act, but I wanted to make sure I understood the core ideas of proper SQL DB design and normalization processes.
Originally I had ONE table called "Files" that held data about a file (it's URL, date uploaded, user ID of whomever uploaded it etc.) as well as a column called "grades" that represented the grade level you might use that file for. (FYI: These files are lesson plans for schools) I realized I'd violated Rule #1 about Normalization- I was storing my "grades" like this "1,2" or "2,6" or "3,5,6" in one column. This caused major headaches when trying to parse that data if I wanted to see JUST 3rd grade lessons or JUST 5th grade lessons.
What was suggested to me, and what became evident later, was that I have 3 tables:
files (data about the files, url etc.)
grades (a table of available grade levels. Likely 1-6 to start)
files_grades (a junction table)
This makes sense,. I just want to make sure I understand what I'm doing before I do it. Let's say User A uploads File xyz and decides that it's good for grades 2 and 3.
I'd write ONE record to the "files" table with data about that file (kb size, url, description, name, primary key files_id). Let's say it gets id 345.
Because of the limited number of grade options, grades will likely be equivalent to their ID (i.e., Grade 1 is grades_id 1, Grade 2 is grades_id 2)
I'd then write TWO records to the "files_grade" junction table containing
files_grade_id, files_id, and grades_id i.e.
1,345,2
1,345,3
To represent the 2 grades that files_id 345 is good for. Then I wave my magic SELECT and JOIN wands and pull the data I need.
Does this make sense? Am I, again, misunderstanding the proper structure of a relational many-to-many database?
Problem 2 which just dawned on me: So, a Lesson can have Multiple "Grades". No problem, we just solved that (I hope!). But it could, in theory, have multiple "Schools" as well- Elementary, Middle, High. What do we do if a files entry has Grades 1,2 for Middle,High? This could very easily be solved by saying "One school per file, users!", but I like to throw this out there.
I'd then write TWO records to the "files_grade" junction table containing
files_grade_id, files_id, and grades_id i.e.
files_grade_id here is redundant, because the combination of files_id and grades_id is already unique (thus can be set as the primary key).
But it could, in theory, have multiple "Schools" as well- Elementary, Middle, High. What do we do if a files entry has Grades 1,2 for Middle,High?
Depending on your requirement, you can perhaps store those as "continuations" of the previous grades, e.g. 1-6 elementary, 7-9 middle, 10-12 high. Then you can make do without the grades table completely (since you can just store these numbers in the files_grade table).
From the sounds of it, it sounds pretty good. Just one thing, though: you don't really need to have an ID for the bridge table (FILES_GRADES), and if you do, you need to increment the ID.
You would have a two-part primary key: grade_id and file_id, the files_grade_id just complicates things, and would make for a bad index, since you'd never use it in a select.
Since you first question is already been answered I will take a stab at the second one.
There are multiple ways to do this, but one possibility is to add another table for the "Schools" and include it as part of the junction table - renaming the junction table of course to match the new design. So, you could have:
School Table:
-------------------------
SchoolId | School
-------------------------
1 | Elementary
2 | Middle
3 | High
-------------------------
Files_grades_school
------------------------------------
FileId | GradeId | SchoolId
------------------------------------
345 | 1 | 1
345 | 1 | 2
You will probably want to create multiple indexes based on your usage patterns.

Normalization in plain English

I understand the concept of database normalization, but always have a hard time explaining it in plain English - especially for a job interview. I have read the wikipedia post, but still find it hard to explain the concept to non-developers. "Design a database in a way not to get duplicated data" is the first thing that comes to mind.
Does anyone has a nice way to explain the concept of database normalization in plain English? And what are some nice examples to show the differences between first, second and third normal forms?
Say you go to a job interview and the person asks: Explain the concept of normalization and how would go about designing a normalized database.
What key points are the interviewers looking for?
Well, if I had to explain it to my wife it would have been something like that:
The main idea is to avoid duplication of large data.
Let's take a look at a list of people and the country they came from. Instead of holding the name of the country which can be as long as "Bosnia & Herzegovina" for every person, we simply hold a number that references a table of countries. So instead of holding 100 "Bosnia & Herzegovina"s, we hold 100 #45. Now in the future, as often happens with Balkan countries, they split to two countries: Bosnia and Herzegovina, I will have to change it only in one place. well, sort of.
Now, to explain 2NF, I would have changed the example, and let's assume that we hold the list of countries every person visited.
Instead of holding a table like:
Person CountryVisited AnotherInformation D.O.B.
Faruz USA Blah Blah 1/1/2000
Faruz Canada Blah Blah 1/1/2000
I would have created three tables, one table with the list of countries, one table with the list of persons and another table to connect them both. That gives me the most freedom I can get changing person's information or country information. This enables me to "remove duplicate rows" as normalization expects.
One-to-many relationships should be represented as two separate tables connected by a foreign key. If you try to shove a logical one-to-many relationship into a single table, then you are violating normalization which leads to dangerous problems.
Say you have a database of your friends and their cats. Since a person may have more than one cat, we have a one-to-many relationship between persons and cats. This calls for two tables:
Friends
Id | Name | Address
-------------------------
1 | John | The Road 1
2 | Bob | The Belltower
Cats
Id | Name | OwnerId
---------------------
1 | Kitty | 1
2 | Edgar | 2
3 | Howard | 2
(Cats.OwnerId is a foreign key to Friends.Id)
The above design is fully normalized and conforms to all known normalization levels.
But say I had tried to represent the above information in a single table like this:
Friends and cats
Id | Name | Address | CatName
-----------------------------------
1 | John | The Road 1 | Kitty
2 | Bob | The Belltower | Edgar
3 | Bob | The Belltower | Howard
(This is the kind of design I might have made if I was used to Excel-sheets but not relational databases.)
A single-table approach forces me to repeat some information if I want the data to be consistent. The problem with this design is that some facts, like the information that Bob's address is "The belltower" is repeated twice, which is redundant, and makes it difficult to query and change data and (the worst) possible to introduce logical inconsistencies.
Eg. if Bob moves I have to make sure I change the address in both rows. If Bob gets another cat, I have to be sure to repeat the name and address exactly as typed in the other two rows. E.g. if I make a typo in Bob's address in one of the rows, suddenly the database has inconsistent information about where Bob lives. The un-normalized database cannot prevent the introduction of inconsistent and self-contradictory data, and hence the database is not reliable. This is clearly not acceptable.
Normalization cannot prevent you from entering wrong data. What normalization prevents is the possibility of inconsistent data.
It is important to note that normalization depends on business decisions. If you have a customer database, and you decide to only record a single address per customer, then the table design (#CustomerID, CustomerName, CustomerAddress) is fine. If however you decide that you allow each customer to register more than one address, then the same table design is not normalized, because you now have a one-to-many relationship between customer and address. Therefore you cannot just look at a database to determine if it is normalized, you have to understand the business model behind the database.
This is what I ask interviewees:
Why don't we use a single table for an application instead of using multiple tables ?
The answer is ofcourse normalization. As already said, its to avoid redundancy and there by update anomalies.
This is not a thorough explanation, but one goal of normalization is to allow for growth without awkwardness.
For example, if you've got a user table, and every user is going to have one and only one phone number, it's fine to have a phonenumber column in that table.
However, if each user is going to have a variable number of phone numbers, it would be awkward to have columns like phonenumber1, phonenumber2, etc. This is for two reasons:
If your columns go up to phonenumber3 and someone needs to add a fourth number, you have to add a column to the table.
For all the users with fewer than 3 phone numbers, there are empty columns on their rows.
Instead, you'd want to have a phonenumber table, where each row contains a phone number and a foreign key reference to which row in the user table it belongs to. No blank columns are needed, and each user can have as few or many phone numbers as necessary.
One side point to note about normalization: A fully normalized database is space efficient, but is not necessarily the most time efficient arrangement of data depending on use patterns.
Skipping around to multiple tables to look up all the pieces of info from their denormalized locations takes time. In high load situations (millions of rows per second flying around, thousands of concurrent clients, like say credit card transaction processing) where time is more valuable than storage space, appropriately denormalized tables can give better response times than fully normalized tables.
For more info on this, look for SQL books written by Ken Henderson.
I would say that normalization is like keeping notes to do things efficiently, so to speak:
If you had a note that said you had to
go shopping for ice cream without
normalization, you would then have
another note, saying you have to go
shopping for ice cream, just one in
each pocket.
Now, In real life, you would never do
this, so why do it in a database?
For the designing and implementing part, thats when you can move back to "the lingo" and keep it away from layman terms, but I suppose you could simplify. You would say what you needed to at first, and then when normalization comes into it, you say you'll make sure of the following:
There must be no repeating groups of information within a table
No table should contain data that is not functionally dependent on that tables primary key
For 3NF I like Bill Kent's take on it: Every non-key attribute must provide a fact about the key, the whole key, and nothing but the key.
I think it may be more impressive if you speak of denormalization as well, and the fact that you cannot always have the best structure AND be in normal forms.
Normalization is a set of rules that used to design tables that connected through relationships.
It helps in avoiding repetitive entries, reducing required storage space, preventing the need to restructure existing tables to accommodate new data, increasing speed of queries.
First Normal Form: Data should be broken up in the smallest units. Tables should not contain repetitive groups of columns. Each row is identified with one or more primary key.
For example, There is a column named 'Name' in a 'Custom' table, it should be broken to 'First Name' and 'Last Name'. Also, 'Custom' should have a column named 'CustiomID' to identify a particular custom.
Second Normal Form: Each non-key column should be directly related to the entire primary key.
For example, if a 'Custom' table has a column named 'City', the city should has a separate table with primary key and city name defined, in the 'Custom' table, replace the 'City' column with 'CityID' and make 'CityID' the foreign key in the tale.
Third normal form: Each non-key column should not depend on other non-key columns.
For example, In an order table, the column 'Total' is dependent on 'Unit price' and 'quantity', so the 'Total' column should be removed.
I teach normalization in my Access courses and break it down a few ways.
After discussing the precursors to storyboarding or planning out the database, I then delve into normalization. I explain the rules like this:
Each field should contain the smallest meaningful value:
I write a name field on the board and then place a first name and last name in it like Bill Lumbergh. We then query the students and ask them what we will have problems with, when the first name and last name are all in one field. I use my name as an example, which is Jim Richards. If the students do not lead me down the road, then I yank their hand and take them with me. :) I tell them that my name is a tough name for some, because I have what some people would consider 2 first names and some people call me Richard. If you were trying to search for my last name then it is going to be harder for a normal person (without wildcards), because my last name is buried at the end of the field. I also tell them that they will have problems with easily sorting the field by last name, because again my last name is buried at the end.
I then let them know that meaningful is based upon the audience who is going to be using the database as well. We, at our job will not need a separate field for apartment or suite number if we are storing people's addresses, but shipping companies like UPS or FEDEX might need it separated out to easily pull up the apartment or suite of where they need to go when they are on the road and running from delivery to delivery. So it is not meaningful to us, but it is definitely meaningful to them.
Avoiding Blanks:
I use an analogy to explain to them why they should avoid blanks. I tell them that Access and most databases do not store blanks like Excel does. Excel does not care if you have nothing typed out in the cell and will not increase the file size, but Access will reserve that space until that point in time that you will actually use the field. So even if it is blank, then it will still be using up space and explain to them that it also slows their searches down as well.
The analogy I use is empty shoe boxes in the closet. If you have shoe boxes in the closet and you are looking for a pair of shoes, you will need to open up and look in each of the boxes for a pair of shoes. If there are empty shoe boxes, then you are just wasting space in the closet and also wasting time when you need to look through them for that certain pair of shoes.
Avoiding redundancy in data:
I show them a table that has lots of repeated values for customer information and then tell them that we want to avoid duplicates, because I have sausage fingers and will mistype in values if I have to type in the same thing over and over again. This “fat-fingering” of data will lead to my queries not finding the correct data. We instead, will break the data out into a separate table and create a relationship using a primary and foreign key field. This way we are saving space because we are not typing the customer's name, address, etc multiple times and instead are just using the customer's ID number in a field for the customer. We then will discuss drop-down lists/combo boxes/lookup lists or whatever else Microsoft wants to name them later on. :) You as a user will not want to look up and type out the customer's number each time in that customer field, so we will setup a drop-down list that will give you a list of customer, where you can select their name and it will fill in the customer’s ID for you. This will be a 1-to-many relationship, whereas 1 customer will have many different orders.
Avoiding repeated groups of fields:
I demonstrate this when talking about many-to-many relationships. First, I draw 2 tables, 1 that will hold employee information and 1 that will hold project information. The tables are laid similar to this.
(Table1)
tblEmployees
* EmployeeID
First
Last
(Other Fields)….
Project1
Project2
Project3
Etc.
**********************************
(Table2)
tblProjects
* ProjectNum
ProjectName
StartDate
EndDate
…..
I explain to them that this would not be a good way of establishing a relationship between an employee and all of the projects that they work on. First, if we have a new employee, then they will not have any projects, so we will be wasting all of those fields, second if an employee has been here a long time then they might have worked on 300 projects, so we would have to include 300 project fields. Those people that are new and only have 1 project will have 299 wasted project fields. This design is also flawed because I will have to search in each of the project fields to find all of the people that have worked on a certain project, because that project number could be in any of the project fields.
I covered a fair amount of the basic concepts. Let me know if you have other questions or need help with clarfication/ breaking it down in plain English. The wiki page did not read as plain English and might be daunting for some.
I've read the wiki links on normalization many times but I have found a better overview of normalization from this article. It is a simple easy to understand explanation of normalization up to fourth normal form. Give it a read!
Preview:
What is Normalization?
Normalization is the process of
efficiently organizing data in a
database. There are two goals of the
normalization process: eliminating
redundant data (for example, storing
the same data in more than one table)
and ensuring data dependencies make
sense (only storing related data in a
table). Both of these are worthy goals
as they reduce the amount of space a
database consumes and ensure that data
is logically stored.
http://databases.about.com/od/specificproducts/a/normalization.htm
Database normalization is a formal process of designing your database to eliminate redundant data. The design consists of:
planning what information the database will store
outlining what information users will request from it
documenting the assumptions for review
Use a data-dictionary or some other metadata representation to verify the design.
The biggest problem with normalization is that you end up with multiple tables representing what is conceptually a single item, such as a user profile. Don't worry about normalizing data in table that will have records inserted but not updated, such as history logs or financial transactions.
References
When not to Normalize your SQL Database
Database Design Basics
+1 for the analogy of talking to your wife. I find talking to anyone without a tech mind needs some ease into this type of conversation.
but...
To add to this conversation, there is the other side of the coin (which can be important when in an interview).
When normalizing, you have to watch how the databases are indexed and how the queries are written.
When in a truly normalized database, I have found that in situations it's been easier to write queries that are slow because of bad join operations, bad indexing on the tables, and plain bad design on the tables themselves.
Bluntly, it's easier to write bad queries in high level normalized tables.
I think for every application there is a middle ground. At some point you want the ease of getting everything out a few tables, without having to join to a ton of tables to get one data set.