It is ok to have duplicated values in SQL - sql

I'm not a DBA so I'm not familiar with the proper lingo, so maybe the title of the question could be a little misleading.
So, the thing. I have Members for a certain system, these members can be part of a demographic segment (any kind of segment: favorite color, gender, job, etc)
These are the tables
SegmentCategory
ID, Name, Description
SegmentCategory_segment
SegmentID, SegmentCategoryID
Segment
ID, Name, Description
MemberSegment
ID, MemberID, SegmentID
So the guy that designed the DB decided to go uber normalizing everything so he put the member's gender on a segment and not in the Member's table.
Is this ok? According to my logic, gender it's a property of the Member so it must be on its entity. But by doing this then there must be duplicated data (The gender on the Member and Gender as a segment) But a trigger on the Member table could just fix this (Update the segment on a gender change)
Having to crawl 4 tables just to get a property from the member seems like over engineering to me.
My question is whether I'm right or not? If so, how could I propose the change to the DBA?

There isn't a blanket rule you can apply to database decisions like this. It depends on what applications/processes it is supporting. A database for reporting is much easier to work with when it is more de-normalized (in a well thought out way) than it is a more transactional database.
You can have a customer record spread across 2 tables, for instance, if some data is accessed or updated more often than other parts. Say you only need one half of the data 90% of your queries, but don't want to drag around the the varchar(max) fields you have there for whatever reason.
Having said that, having a table with just a gender/memberid is on the far side of extreme. From my naive understanding of your situation I feel you just need a members table with views over top for your segments.
As for the DBA, ultimately I imagine it will be them who will be needing to maintain the integrity of the data, so I would just approach them and say "hey, what do you think of this?" Hopefully they'll either see the merit or be able to give you reasons to their design decisions.

Related

PK Query on M2M Relation

I was reading on Neo4j, a graph database, and how it compares to the relational model. Here is one thing it mentions in how to query a M2M join for the "Departments" associated with a single user here:
I would think though if I knew beforehand I'm just looking up a single row-by-PK and there are less than 5 departments likely for that user, I would write the query as follows:
SELECT name FROM department WHERE department_id IN (
SELECT department_id FROM PersonDepartment WHERE user_id IN (
SELECT pk FROM Person WHERE name='Alice' # assume unique name
)
)
I'm sure writing this in the more common 'join format' would be optimized by the RDBMS into something closer to the above but I'm using the above just to show how the above query seems like it would take almost no time to execute, or am I wrong here? On the other hand, writing the above in the more concise Cypher format of: [p:Person{name:"Alice"}]-[:BELONGS_TO]->[d:Department] is much simpler to read and write.
Preliminary
To get some issues that confuse the problem out of the way, so that we can answer the question in a straight-forward manner.
The text blurb in the graphic
It is completely dishonest, the typical Straw Man argument, used to demean what he is against, and to elevate what he is for. He poses the Relational method as something that it is not (a 1960's Record Filing System with IDs as "Primary Keys"), and then takes it down
Whoopee, he destroyed his own concoction, his Straw Man
The Relational method remains, unaffected
Nevertheless, the uneducated will be confused.
ID fields as a "Primary Key"
The Relational Model explicitly prohibits IDs, which are Physical. The PK must be "made up from the data", which is Logical.
Further, the file contains duplicate rows (IDs do not provide row uniqueness, which is demanded in the RM)
IDs complicate DML code, and force more JOINs (that are not required in the equivalent Relational database), which the dear professor is counting on, in his erection of the Straw Man
The IDs need to be removed, and proper Relational Keys need to be implemented
Relational Integrity, which is Logical (as distinct from Referential Integrity, which is Physical), is lost, not possible
Full detail in Relational schema for a book graph.
No one in the right mind is going to step through those three tables in that way, let alone prescribe it.
he is using procedural code, such as in a CURSOR, which is anti-Relational, and stupefyingly slow
the RM and SQL are based on Set Theory, so use set verbs, such as SELECT, and select only what you need
the proposition is a single set, a single SELECT fulfils it.
Questions
I would think though if I knew beforehand I'm just looking up a single row-by-PK and there are less than 5 departments likely for that user, I would write the query as follows: ...
Definitely not. Even with the IDs
the population in each table is irrelevant (the PK has an unique index)
let us assume 10,000 Persons; 10,000 Departments; 16,000,000 PersonDepartments
performance should never be considered when modelling, or when writing DML code
it should be considered only when some code performs badly, with a view to improving it.
Other than for the purpose of clarifying your question, that code can be dismissed.
I'm sure writing this in the more common 'join format' would be optimized by the RDBMS
Yes.
with a genuine SQL Platform, it will do many things re optimisation, at many levels: parsing; determination of a Query Plan; considerations of Statistics; etc.
with the freeware "SQLs", it does a mickey mouse version of that (at best), and none at all (at worst). Which is why performance is consideration, everywhere, but that is abnormal; sub-standard.
into something closer to the above
Definitely not. That is a dog's breakfast. It will create a very elegant and optimised Query Plan, and then a hierarchic Query Tree (run-time executable, that can be shared).
but I'm using the above just to show how the above query seems like it would take almost no time to execute, or am I wrong here?
No, you are right. In the sense that either the horrible code example operating on a RFS, or the correct code operating on a Relational database, will execute in millisecs, "almost no time".
Relational Data Model
If you wish to evaluate what he intended in his proposition (what departments does Alice work for), without the dishonesty of his Straw Man, using a Relational database (no IDs, proper Relational Keys), we need a data model.
All my models are rendered in IDEF1X, the one and only Standard for Relational data modelling. (ERD cannot be used.)
The IDEF1X Introduction is essential reading.
The code is simple.
SELECT NameFirst,
DepartmentCode
FROM Person P
JOIN Employee E ON P.PersonNo = E.EmployeeNo
WHERE NameFirst = "Alice"
This code may produce is more meaningful result set, it is stil a single, simple SELECT.
SELECT NameLast,
NameFirst,
D.Name,
EmploymentDate
FROM Person P
JOIN Employee E ON P.PersonNo = E.EmployeeNo
JOIN Department D ON E.DepartmentCode = D.DepartmentCode
WHERE NameFirst = "Alice"
Comments
One question regarding the "no IDs, proper keys" -- doesn't the PersonNo act the same way as would an autoincrementing PK to identify a person?
Yes.
Except that AUTOINCREMENT/IDENTITY columns have horrendous maintenance problems, thus we do not allow them in Production, thus we do not allow them in Development that is not intended for Production.
The alternative for an INSERT is:
...
PersonNo = (
SELECT MAX( PersonNo ) + 1
FROM Person
)
...
Of course, for high performance OLTP, there are other methods.
Never use the Oracle method, which is a file of records, each holding a next-sequential-number for some other file.
If we went with PK must be "made up from the data" and no SS# or some uniquely-identifying-person-code, it'd then be just combining a bunch of stuff: FirstName+LastName+BirthPlace+Birthdate (or whatever combination would give enough granularity to guarantee uniqueness)
Yes.  That is answered in full detail in the IDEF1X Introduction, please read.
Short answer ...
this is a true surrogate, not a RecordID (which is falsely called a surrogate).
the only justification is when
the natural PK gets too long (here 7 columns and 120+ bytes), to be carried into subordinate tables as FKs,
and
the table is the top of a data hierarchy, which therefore does not suffer an Access Path Independence breach, as stipulated in Codd's Relational Model.  Which is true in the usage here.
Technically, a surrogate or RecordID) is a breach of the Relational Key Normal Form.  A properly placed surrogate does not breach the Access Path Independence Rule (nothing above the breach to be accessed), whereas a RecordID always does. The user does not see the surrogate or RecordID, because it is not data.
Separately, note that ( LastName, FirstName, Initial, Birthdate, BirthCountry, BirthState, BirthPlace ) is an established international convention (not quite a standard) for identifying a person.

db design critic /suggestions

I am redesigning a pharmacy db system and need inputs to see if the new design is optimal or requires tweaking.
Here's a snapshot of the old system..
As can be seen, the pharmacies table stores pharmacy information, along with its address and contact information. Pharmacies are grouped together for invoicing purposes(pharmacygroup) or for sales, advertsing other purposes (banner group). The invoice group may have a different physical address, different contact information.
Here's my new design. I have split the address from both the pharmacy and pharmacygroup table into a table of its own and made a new table for contacts. Their could be technical contacts, account contacts, owner contacts etc, hence the contacttypes table. The pharmacy and the pharmacygroup can have separate contact info, I thought of making a single contact table and have a 'linktype' and 'linkid' column to indicate if its a pharmacy contact or pharmacy group contact, but I am not sure if this is a right approach. Is this a good design or will it be costly in terms of data retrieval because of the number of joins?? Another thing I noted that , is in the old design , they didn't create any foreign key constraints, although the pharmacy table had groupid and bannergroupid references for pharmacygroup and bannergroup, possibly to save time for data retrieval. Is this a good approach?
Your design looks good to me. I always prefer to have a couple of extra joins on the design step over spending time reorganizing data after system went into production. You never know in advance what kind of reports will be requested by management/sales/financial people, and proper relational design will give you more freedom.
Also, you cannot blame only a couple of extra JOINs for your performance issues. You should always look at:
data volumes (and physical data layout),
transaction amount and density,
I/O, CPU, memory usage,
your RDBMS configuration,
SQL queries quality.
In my view, JOINs will be on the bottom of this list.
As to the RI constraints (Referential Integrity), I've seen a couple of projects that had been running without any Primary/Foreign keys for increased performance. The main excuse was: we have all checks embedded into the Application and Application is the only source of any changes in the system. On the other hand, they agreed, that it is not known, whether systems were in a consistent state (in fact, analysis showed they were not).
I always stick to creating all possible keys/constraints on the design state, as there always will be some “cowboys” around, who will dig into your database and “adjust” data they seem fits better. Still, you might want to temporarily disable or even drop some constraints/indexes for the bulk data manipulations, which is also an official recommendation.
If uncertain, create 2 test databases, one with and another without constraints. Load some data and compare query performance. I think it will be similar.
And here my comments on your sketches, decisions are all yours.
You might want to create a common contacts table the same way you did for addresses, i.e. add contact_id, owner_contact_id, etc. columns to the target relations instead of referencing relations from contacts table;
As you have only one column in contacttype table (and in case you'll have a common contacts), it's better to move the only field away and avoid this table;
You seem to have mixture of singular/plural names for your tables, better to stick to a common pattern here. I personally prefer singular;
In pharmacygroup your PK is named id, while all the rest PKs follow tableid pattern, it will be easier to write scripts later if you'll use a common pattern here;
In addresses table you have fields with underscores, like street_name, while elsewhere you avoid _ — consider making it common;
References are named differently. Although it is not so highly important, I do have a couple of systems where I have to rely on the constraints' names, so it's better to use some pattern here. I use the following one:
prefix p_, f_, c_, t_, u_ or i_ for primary, foreign keys, check constraints, triggers, unique and other indexes;
name of the table;
name of the column constraint/index/trigger refers to.
Why I prefer naming tables in singular form? Because I always name PK using table_id pattern, and IMHO pharmacy_id looks better then pharmacies_id. I use this approach as I have a bunch of general-purpose scripts which relies on this pattern when performing data consistency checks prior to loading it into the main tables.
EDIT:
More on contacts.
You can use contact_id in all your tables, making it a primary contact, whatever this might mean in your application. Should you need more contacts to be there for some relations, then you can go with different prefixes, like owner_contact_id, sales_contact_id, etc.
In case you expect a huge number of contacts to be there for some relations, like pharmacygroup, then you will can add an extra table like this:
CREATE TABLE pharmacygroupcontact (
contactid int4,
groupid int4,
contact_desc text
);
It partially copies your initial groupcontacts, but consists of two FKs and a description.
Which approach is better I cannot tell as I'm not aware how Application is designed.
You have 2 contact tables, I would create one, then use linking tables to link groupcontacts and pharmacycontacts. I would definitely want to have the FK and PK relationships setup to.

Survey Data Model - How to avoid EAV and excessive denormalization?

My database skills are mediocre at best and I have to design a data model for survey data. I have spent some thoughts on this and right now I feel that I am stuck between some kind of EAV model and a design involving hundreds of tables, each with hundreds of columns (and thousands of records). There must be a better way to do this and I hope that the wise folks on this forum can help me.
My question is: how should I model the answers to survey questions in an RDBMS? Using SQL Server is mandatory. So alternative data storage systems should be excluded from this discussion. (Sure, some should and will be evaluated, but not here please.) I don't need a solution for the entire data model, for now I'm only interested in the Answers part.
I have already searched various forums, but I couldn't really find a solution. If it has already been given elsewhere, please excuse me and provide me with a link so I can read it up.
Some assumptions about the data I have to deal with:
Each survey consists of 1 to n questionnaires
Each questionnaire consists of 100-2,000 questions (please ignore that 2,000 questions really sound like a lot to answer...)
Questions can be of various types: multiple-choice, free text, a number (like age, income, percentages, ...)
Each survey involves 10-200 countries (These are not the respondents. The respondents are actually people in the countries.)
Depending on the type of questionnaire, each questionnaire is answered by 100-20,000 respondents per country.
A country can adapt the questionnaires for a survey, i.e. add, remove or edit questions
The data for one country is gathered in a separate database in that country. There is no possibility for online integration from the start.
The data for all countries has to be integrated later. This means for example, if a country has deleted a question, that data must somehow be derived from what they sent in order to achieve a uniform design across all countries
I will have to write the integration and cleaning software, which will need to work with every country's data
In the end the data needs to be exported to flat files, one rectangular grid per country and questionnaire.
I have already discussed this topic with people from various backgrounds and have not come to a good solution yet. I mainly got two kinds of opinions.
The domain experts, who are used to working with flat files (spreadsheet-style) for data processing and analysis vote for a denormalized structure with loads of tables and columns as I described above (1 table per country and questionnaire). This sounds terrible to me, because I learned that wide tables are to be avoided, it will be annoying to determine which columns are actually in a table when working with it, the database will become cluttered with hundreds of tables (or I even need to set up multiple databases, each with a similar yet a bit differetn design), etc.
O-O-programmers vote for a strongly "normalized" design, which would effectively lead to a central table containing all the answers from all respondents to all questions. This table would either need to contain a column of type sql_variant type or multiple answer columns with different types to store answers of different types (multiple choice, free text, ..). The former would essentially be a EAV model. I tend to follow Joe Celko here, who strongly discourages its use (he calls it OTLT or "One True Lookup Table"). The latter would imply that each row would contain null cells for the not applicable types by design.
Another alternative I could think of would be to create one table per answer type, i.e., one for multiple-choice questions, one for free text questions, etc.. That's not so generic, it would lead to a lot of union joins, I think and I would have to add a table if a new answer type is invented.
Sorry for boring you with all this text and thank you for your input!
Cheers,
Alex
PS: I asked the same question here: http://www.eggheadcafe.com/community/aspnet/13/10242616/survey-data-model--how-to-avoid-eav-and-excessive-denormalization.aspx
Well imgur is down so i'll post the pic later.
I think this is completely feasible within a relational model. I've built a CDM to show how I would do this.
Outbound
It takes 4 entities to define a Country's Survey. Some Parent Survey, the country and a list of questions. Your questions have an internal relationship so when one country "edits" a question, you can track both the question asked by the country and the question it came from. The other thing you need is a Possible Answer entity/table. Each question may have an associated list of possible answers (multiple choice or ranges etc). Those 4 should completely define the "OUTBOUND" side of this.
Inbound
The "INBOUND" side is just 2 new entities, The Respondent and the answer. The respondent is straightforward, just the demographics of that person if you know them and here you can include a relationship back to country. Each respondent answered the survey in a given country. (Person may be 1:n with Respondent if the person travels or has dual citizenship)
The answer is basic; either it is one of the choices listed in the list of Possible Answers or it is provided. Don't get all caught up in the fact that the answer may be a number, date, etc just yet. Either it's a FK or a string of characters.
Reporting
A report is a join over all of these... You'll choose a country and a survey, get the list of questions and answers.
Answer Complexity
Depends on where you want to do your calculations. If you used a Varchar2(4000) column for your user-provided answers, you could add an attribute to question to describe the datatype of the answer. Q: Age? DT: Integer Between (0 and 130). Then your integration layer can do the validation instead of the database enforcing it. Or you can have 4 columns, one for number, date, character and CLOB. And your integration layer will determine the column to use. When you report those answers out, you'll just select all four columns with Coalesce().
Is this an EAV because there's a slight ambiguity to the datatype of "Answer"
No, it's not.
AN EAV model breaks down an Entity into a list of attributes.
like so:
Entity Attribute Value
1 Fname Stephanie
1 Lname Page
1 Age 30
because you see the Answer column of the Survey schema is holding both words and numbers like the Value column does here you think that defines EAV. It does not. Just as if I added 3 datatype columns to this model it wouldn't change it FROM an EAV.
I soooo hate it when
I've had people tell me that the query I'm tuning has to go "as fast as possible". Ok, so give me a billion dollars and 30 years. "Wait, a Billion what?" "As much as", "as fast as" aren't requirements. You can validate anything you want in a database... build a shedload of Before triggers, voila! Validation galore.
What's the datatype of an age column? Or Birthdate column? Depends on what your data source is. Some older records may only have Month and Year, or just year, or 'around' or 'circa' some year. You couldn't have just a number column and do 'as much validation as possible'. and NUMBER(2) may be BETTER validation than just NUMBER. So now you'll have NUMBER(1), NUMBER(2), NUMBER... to have "as much as".
Where I think you are getting tripped up
Think of this as a Conceptual Data Model, not a Physical one. In those terms Survey is an entity. Is Question an entity or just an attribute of Survey. If you built One table PER you're clearly saying that Question is just an Attribute of Survey and storing them vertically makes this an EAV. What this model shows is that Question is actually another entity. There is a relationship between Questions, e.g. 'a country [can] edit questions'. There was the original question and edited one. Each question has a collection of possible answers. And the most important this is that, they are all questions. In an EAV I call fname, lname, bdate, age, major, salary, etc... all very disparate things, just attributes. In this case we're not including the name of the agency who originated the survey and the date it was issued and the date is due back and the etc... as questions.
Let me put this another way. You're Fedex. You want to store timestamps for certain events. Each time a package enters or leaves a facility or vehicle. Time on the picking up truck, time off the truck and into the first facility, time out of that facility and onto a plane, etc. Do you store them Horizontally? How do you know the number of hops in advance? If you store them vertically does that automatically make it an EAV? And if so why.
You're a weather company getting temps from stations around the country. Let's say the sensors are designed to send a reading when the temperature changes +/- a full degree. If you store a sensor_ID|timestamp|temp is a Reading Table is that an EAV? Each reading isn't an attribute of the sensor, they are themselves entities which belong to a collection/series.
One thing that vertical storage of answers has in common with an EAV is its difficulty in performing analytic queries. If you wanted a list of all the people who answered TRUE to question 5 and 10 but FALSE to 6 and 11 would be very difficult when done vertically. Maybe that's why you see this an EAV. If you want to do that, you need a different storage. The relational storage of the question and answers isn't the best reporting database. Let's go back to the Fedex example. It's not simple to do "transit" time reporting when the rows are vertical.
This sounds like you are wrestling with a common problem: how to use a hammer to fasten a screw.
Both alternatives you listed are bad, each for different reasons. But that's because you are trying to stuff your particular data model into a relational database system. A good approach would be to look beyond the relational database at some other database/storage systems, try a couple out, and find the best fit for your project.
I have tried the EAV model and gave up because it was far too complex, and I am afraid to try the multi-tables model with a relational database system. The easiest solution I have found with a relational database is: store each complete response as a single CLOB, serialized into JSON or YAML (or something else lightweight), in a responses table.
create table responses (
id uuid primary key,
questionnaire_id uuid references questionnaires.id,
data text
)
If I was using SQL Server, Express will be OK, then I would do this:
Table with list of questions, flags
for type (bit), if required flag
(bit), the correct answer if exists,
etc
Table with list of countries
Table linking of countries and
questions (some countries may not get some questions
Table for answers with columns for
the question(s) and a xml
column for the optional questions
including those which are added
If you are not versed in shredding XML then use sparse columns for all the optional questions. I do not recall exactly the limit on the number of sparse columns in a table but I believe it is above 30,000. SQL Server internally stores sparse columns as XML and will shred it when one selects the column and yes it can be indexed
The diagram below show a diagram created with SQL Server. the column AL_A4 will hold the answer to QL_Id = 4 and is of type sparse. The QL_Id in the QuestionList table is not flagged required letting you know to make the column in AnswerList sparse.
Since countries will add questions create QuestionListCustom, QuestiontoCountryCustom and AnswerListCustom tables and add the information from the custom questions.
I am sure there are other ways to design the storage, this is the way I would turn in the homework, if this is not homework then you surely work for the UN.
Have you considered not reinventing the wheel? There are open source survey applications already built. Even if they don't meet your needs, download a few and check out their data models.

Sql design question - many tables or not?

15 ECTS credits worth of database design down the bin.. I really can't come up with the best design solution for my problem.
Which is this: Basically I'm making a tool that gathers a lot of information concerning the user. At the most the user would fill in 50 fields of data, ranging from simple checkboxes to text input. I'm designing the db right now (with mySql) and can't decide whether or not to use a single User table with all of those fields, or to have a table for each category of input.
One example would be "type of payment". This one has three options and if I went with the "table" way I would add a table paymentType and give it binary fields for each payment type. Then I would need and id table to identify which paymentType the user has chosen whereas if I use a single user table, the data would already be there.
The site will probably see a lot of users (tv, internet and radio marketing) so I'm concerned which alternative would be the best.
I'll be happy to provide more details if you need more to base a decision.
Thanks for reading.
Read this article "Database Normalization Basics", and come back here if you still have questions. It should help a lot.
The most fundamental idea behind these decisions, as you will see in this article, is that each table should represent one and only one "thing", and each field should relate directly and only to that thing.
In your payment types example, it probably makes sense to break it out into a separate table if you anticipate the need to store additional information about each payment type.
Create your "Type of Payment" table; there's no real question there. That's proper normalization and the power behind using relational databases. One of the many reasons to do so is the ability to update a Type of Payment record and not have to touch the related data in your users table. Your join between the two tables will allow your app to see the updated type of payment info by changing it in just the 1 place.
Regarding your other fields, they may not be as clear cut. The question to ask yourself about each field is "does this field relate only to a user or does it have meaning and possible use in its own right?". If you can never imagine a field having meaning outside of the context of a user you're safe leaving it as a field on the user table, otherwise do the primary key-foreign key relationship and put the information in its own table.
If you are building a form with variable inputs, I wouldn't recommend building it as one table. This is inflexible and dirty.
Normalization is the key, though if you end up with a key/value setup, or effectively a scalar type implementation across many tables and can't cache:
a) the form definition from table data and
b) the joined result of storage (either a caching view or otherwise)
c) or don't build in proper sharding
Then you may hit a performance boundary.
In this KVP setup, you might want to look at something like CouchDB or a less table-driven storage format.
You may also want to look at trickier setups such as serialized object storage and cache-tables if your internal data is heavily relative to other data already in the database
50 columns is a lot. Have you considered a table that stores values like a property sheet? This would only be useful if you didn't need to regularly query the values it contains.
INSERT INTO UserProperty(UserID, Name, Value)
VALUES(1, 'PaymentType', 'Visa')
INSERT INTO UserProperty(UserID, Name, Value)
VALUES(1, 'TrafficSource', 'TV')
I think I figured out a great way of solving this. Thanks to a friend of mine for suggesting this!
I have three tables, Field {IdField, FieldName, FieldType}, FieldInput {IdInput, IdField, IdUser} and User { IdUser, UserName... etc }
This way it becomes very easy to see what a user has answered, the solution is somewhat scalable and it provides a good overview. I will constrain the alternatives in another layer, farther away from the db. I believe it's a tradeoff worth doing.
Any suggestions or critics to this solution?

Normalization in plain English

I understand the concept of database normalization, but always have a hard time explaining it in plain English - especially for a job interview. I have read the wikipedia post, but still find it hard to explain the concept to non-developers. "Design a database in a way not to get duplicated data" is the first thing that comes to mind.
Does anyone has a nice way to explain the concept of database normalization in plain English? And what are some nice examples to show the differences between first, second and third normal forms?
Say you go to a job interview and the person asks: Explain the concept of normalization and how would go about designing a normalized database.
What key points are the interviewers looking for?
Well, if I had to explain it to my wife it would have been something like that:
The main idea is to avoid duplication of large data.
Let's take a look at a list of people and the country they came from. Instead of holding the name of the country which can be as long as "Bosnia & Herzegovina" for every person, we simply hold a number that references a table of countries. So instead of holding 100 "Bosnia & Herzegovina"s, we hold 100 #45. Now in the future, as often happens with Balkan countries, they split to two countries: Bosnia and Herzegovina, I will have to change it only in one place. well, sort of.
Now, to explain 2NF, I would have changed the example, and let's assume that we hold the list of countries every person visited.
Instead of holding a table like:
Person CountryVisited AnotherInformation D.O.B.
Faruz USA Blah Blah 1/1/2000
Faruz Canada Blah Blah 1/1/2000
I would have created three tables, one table with the list of countries, one table with the list of persons and another table to connect them both. That gives me the most freedom I can get changing person's information or country information. This enables me to "remove duplicate rows" as normalization expects.
One-to-many relationships should be represented as two separate tables connected by a foreign key. If you try to shove a logical one-to-many relationship into a single table, then you are violating normalization which leads to dangerous problems.
Say you have a database of your friends and their cats. Since a person may have more than one cat, we have a one-to-many relationship between persons and cats. This calls for two tables:
Friends
Id | Name | Address
-------------------------
1 | John | The Road 1
2 | Bob | The Belltower
Cats
Id | Name | OwnerId
---------------------
1 | Kitty | 1
2 | Edgar | 2
3 | Howard | 2
(Cats.OwnerId is a foreign key to Friends.Id)
The above design is fully normalized and conforms to all known normalization levels.
But say I had tried to represent the above information in a single table like this:
Friends and cats
Id | Name | Address | CatName
-----------------------------------
1 | John | The Road 1 | Kitty
2 | Bob | The Belltower | Edgar
3 | Bob | The Belltower | Howard
(This is the kind of design I might have made if I was used to Excel-sheets but not relational databases.)
A single-table approach forces me to repeat some information if I want the data to be consistent. The problem with this design is that some facts, like the information that Bob's address is "The belltower" is repeated twice, which is redundant, and makes it difficult to query and change data and (the worst) possible to introduce logical inconsistencies.
Eg. if Bob moves I have to make sure I change the address in both rows. If Bob gets another cat, I have to be sure to repeat the name and address exactly as typed in the other two rows. E.g. if I make a typo in Bob's address in one of the rows, suddenly the database has inconsistent information about where Bob lives. The un-normalized database cannot prevent the introduction of inconsistent and self-contradictory data, and hence the database is not reliable. This is clearly not acceptable.
Normalization cannot prevent you from entering wrong data. What normalization prevents is the possibility of inconsistent data.
It is important to note that normalization depends on business decisions. If you have a customer database, and you decide to only record a single address per customer, then the table design (#CustomerID, CustomerName, CustomerAddress) is fine. If however you decide that you allow each customer to register more than one address, then the same table design is not normalized, because you now have a one-to-many relationship between customer and address. Therefore you cannot just look at a database to determine if it is normalized, you have to understand the business model behind the database.
This is what I ask interviewees:
Why don't we use a single table for an application instead of using multiple tables ?
The answer is ofcourse normalization. As already said, its to avoid redundancy and there by update anomalies.
This is not a thorough explanation, but one goal of normalization is to allow for growth without awkwardness.
For example, if you've got a user table, and every user is going to have one and only one phone number, it's fine to have a phonenumber column in that table.
However, if each user is going to have a variable number of phone numbers, it would be awkward to have columns like phonenumber1, phonenumber2, etc. This is for two reasons:
If your columns go up to phonenumber3 and someone needs to add a fourth number, you have to add a column to the table.
For all the users with fewer than 3 phone numbers, there are empty columns on their rows.
Instead, you'd want to have a phonenumber table, where each row contains a phone number and a foreign key reference to which row in the user table it belongs to. No blank columns are needed, and each user can have as few or many phone numbers as necessary.
One side point to note about normalization: A fully normalized database is space efficient, but is not necessarily the most time efficient arrangement of data depending on use patterns.
Skipping around to multiple tables to look up all the pieces of info from their denormalized locations takes time. In high load situations (millions of rows per second flying around, thousands of concurrent clients, like say credit card transaction processing) where time is more valuable than storage space, appropriately denormalized tables can give better response times than fully normalized tables.
For more info on this, look for SQL books written by Ken Henderson.
I would say that normalization is like keeping notes to do things efficiently, so to speak:
If you had a note that said you had to
go shopping for ice cream without
normalization, you would then have
another note, saying you have to go
shopping for ice cream, just one in
each pocket.
Now, In real life, you would never do
this, so why do it in a database?
For the designing and implementing part, thats when you can move back to "the lingo" and keep it away from layman terms, but I suppose you could simplify. You would say what you needed to at first, and then when normalization comes into it, you say you'll make sure of the following:
There must be no repeating groups of information within a table
No table should contain data that is not functionally dependent on that tables primary key
For 3NF I like Bill Kent's take on it: Every non-key attribute must provide a fact about the key, the whole key, and nothing but the key.
I think it may be more impressive if you speak of denormalization as well, and the fact that you cannot always have the best structure AND be in normal forms.
Normalization is a set of rules that used to design tables that connected through relationships.
It helps in avoiding repetitive entries, reducing required storage space, preventing the need to restructure existing tables to accommodate new data, increasing speed of queries.
First Normal Form: Data should be broken up in the smallest units. Tables should not contain repetitive groups of columns. Each row is identified with one or more primary key.
For example, There is a column named 'Name' in a 'Custom' table, it should be broken to 'First Name' and 'Last Name'. Also, 'Custom' should have a column named 'CustiomID' to identify a particular custom.
Second Normal Form: Each non-key column should be directly related to the entire primary key.
For example, if a 'Custom' table has a column named 'City', the city should has a separate table with primary key and city name defined, in the 'Custom' table, replace the 'City' column with 'CityID' and make 'CityID' the foreign key in the tale.
Third normal form: Each non-key column should not depend on other non-key columns.
For example, In an order table, the column 'Total' is dependent on 'Unit price' and 'quantity', so the 'Total' column should be removed.
I teach normalization in my Access courses and break it down a few ways.
After discussing the precursors to storyboarding or planning out the database, I then delve into normalization. I explain the rules like this:
Each field should contain the smallest meaningful value:
I write a name field on the board and then place a first name and last name in it like Bill Lumbergh. We then query the students and ask them what we will have problems with, when the first name and last name are all in one field. I use my name as an example, which is Jim Richards. If the students do not lead me down the road, then I yank their hand and take them with me. :) I tell them that my name is a tough name for some, because I have what some people would consider 2 first names and some people call me Richard. If you were trying to search for my last name then it is going to be harder for a normal person (without wildcards), because my last name is buried at the end of the field. I also tell them that they will have problems with easily sorting the field by last name, because again my last name is buried at the end.
I then let them know that meaningful is based upon the audience who is going to be using the database as well. We, at our job will not need a separate field for apartment or suite number if we are storing people's addresses, but shipping companies like UPS or FEDEX might need it separated out to easily pull up the apartment or suite of where they need to go when they are on the road and running from delivery to delivery. So it is not meaningful to us, but it is definitely meaningful to them.
Avoiding Blanks:
I use an analogy to explain to them why they should avoid blanks. I tell them that Access and most databases do not store blanks like Excel does. Excel does not care if you have nothing typed out in the cell and will not increase the file size, but Access will reserve that space until that point in time that you will actually use the field. So even if it is blank, then it will still be using up space and explain to them that it also slows their searches down as well.
The analogy I use is empty shoe boxes in the closet. If you have shoe boxes in the closet and you are looking for a pair of shoes, you will need to open up and look in each of the boxes for a pair of shoes. If there are empty shoe boxes, then you are just wasting space in the closet and also wasting time when you need to look through them for that certain pair of shoes.
Avoiding redundancy in data:
I show them a table that has lots of repeated values for customer information and then tell them that we want to avoid duplicates, because I have sausage fingers and will mistype in values if I have to type in the same thing over and over again. This “fat-fingering” of data will lead to my queries not finding the correct data. We instead, will break the data out into a separate table and create a relationship using a primary and foreign key field. This way we are saving space because we are not typing the customer's name, address, etc multiple times and instead are just using the customer's ID number in a field for the customer. We then will discuss drop-down lists/combo boxes/lookup lists or whatever else Microsoft wants to name them later on. :) You as a user will not want to look up and type out the customer's number each time in that customer field, so we will setup a drop-down list that will give you a list of customer, where you can select their name and it will fill in the customer’s ID for you. This will be a 1-to-many relationship, whereas 1 customer will have many different orders.
Avoiding repeated groups of fields:
I demonstrate this when talking about many-to-many relationships. First, I draw 2 tables, 1 that will hold employee information and 1 that will hold project information. The tables are laid similar to this.
(Table1)
tblEmployees
* EmployeeID
First
Last
(Other Fields)….
Project1
Project2
Project3
Etc.
**********************************
(Table2)
tblProjects
* ProjectNum
ProjectName
StartDate
EndDate
…..
I explain to them that this would not be a good way of establishing a relationship between an employee and all of the projects that they work on. First, if we have a new employee, then they will not have any projects, so we will be wasting all of those fields, second if an employee has been here a long time then they might have worked on 300 projects, so we would have to include 300 project fields. Those people that are new and only have 1 project will have 299 wasted project fields. This design is also flawed because I will have to search in each of the project fields to find all of the people that have worked on a certain project, because that project number could be in any of the project fields.
I covered a fair amount of the basic concepts. Let me know if you have other questions or need help with clarfication/ breaking it down in plain English. The wiki page did not read as plain English and might be daunting for some.
I've read the wiki links on normalization many times but I have found a better overview of normalization from this article. It is a simple easy to understand explanation of normalization up to fourth normal form. Give it a read!
Preview:
What is Normalization?
Normalization is the process of
efficiently organizing data in a
database. There are two goals of the
normalization process: eliminating
redundant data (for example, storing
the same data in more than one table)
and ensuring data dependencies make
sense (only storing related data in a
table). Both of these are worthy goals
as they reduce the amount of space a
database consumes and ensure that data
is logically stored.
http://databases.about.com/od/specificproducts/a/normalization.htm
Database normalization is a formal process of designing your database to eliminate redundant data. The design consists of:
planning what information the database will store
outlining what information users will request from it
documenting the assumptions for review
Use a data-dictionary or some other metadata representation to verify the design.
The biggest problem with normalization is that you end up with multiple tables representing what is conceptually a single item, such as a user profile. Don't worry about normalizing data in table that will have records inserted but not updated, such as history logs or financial transactions.
References
When not to Normalize your SQL Database
Database Design Basics
+1 for the analogy of talking to your wife. I find talking to anyone without a tech mind needs some ease into this type of conversation.
but...
To add to this conversation, there is the other side of the coin (which can be important when in an interview).
When normalizing, you have to watch how the databases are indexed and how the queries are written.
When in a truly normalized database, I have found that in situations it's been easier to write queries that are slow because of bad join operations, bad indexing on the tables, and plain bad design on the tables themselves.
Bluntly, it's easier to write bad queries in high level normalized tables.
I think for every application there is a middle ground. At some point you want the ease of getting everything out a few tables, without having to join to a ton of tables to get one data set.