Normalization in database with countries as columns - sql

This has been bugging me for a while, consider a table with attributes like this: {ID, Value, Australia, India, France, Germany}, where ID is the primary key, Value is some text, say car-model and under each attribute like Australia, India is the number of cars manufactured corresponding to that value.
Intuitively I know that the correct way to put this by {ID, Value, Cars-Manufactured, Country} , but can someone tell me why this is correct in terms of database normalization? Which normalization does the first table not meet. Or is the first table correct too?

The rule it violates is "no repeating groups". This is one of the rules for first normal form.
A column for each country is a repeating group. The data under each column is the same data, just applicable in a different context. When there is only one value there -- like number of cars made in that country -- this may not be obvious, maybe it's even debatable. But suppose we need two pieces of information for each country, like number manufactured and number sold. Now the table has a set of paired columns: Australia_manufactured, Australia_sold, India_manufactured, India_sold, France_manufactured, France_sold, etc. You have a set of two columns repeated multiple times.
Someone could ask, What is the difference between multiple distinct fields and a repeating group? How is "India_manufactured, Australia_manufactured, France_manufactured" different from "number_manufactured, price, description"? The difference is that in the first case, the semantic meaning of the value is the same, all that differs is a context, an application. In the second case, the semantic meaning is different. That is, it is hard to imagine a query or program that processes the data beyond a trivial "find the biggest value" or some such in which we would run it today processing number_manufactured, and then run it tomorrow doing exactly the same processing but on sale_price. But we could easily imagine running today for India and tomorrow for Germany.
Of course there are times when it can be ambiguous. That's why database designers get paid the big bucks. :-)
Okay, that's the rule. Does the rule have practical value?
Let's consider scenario A, one table:
model (model_id, description, india_manufactured, australia_manufactured, france_manufactured)
Scenario B, two tables:
model (model_id, description)
production (model_id, country_code, manufactured)
There are a number of reasons why scenario A sucks. Here's the biggest:
Queries are much simpler with Scenario B. We do not have to hard-code countries into our program or query. Write a query to accept a country code as a parameter and return the number of each model manufactured in that country. In scenario B, simple:
select description, manufactured
from model join production on model.model_id=production.model_id
where production.country_code=#country
Easy. Now do it with scenario A. Something like:
select description,
case when #country_code='IN' then india_manufactured
when #country_code='AU' then australia_manufactured
when #country_code='FR' then france_manufactured
else null
end as manufactured
from model
Or suppose we want the total produced in all countries. Scenario B:
select description, sum(manufactured)
from model
join production on model.model_id=production.model_id
Scenario A:
select description, india_manufactured+australia_manufactured+france_manufactured
from model
(Might be more complex if we have to allow for nulls.)
We'd likely have many, many such queries throughout the system. In real life, many would be much more complex than this, with multiple such messy case clauses or juggling multiple columns. Now suppose we add another country. In scenario B, this is zero effort. We can add and delete countries all we like and the queries don't change. but in scenario A, we would have to find every query and change it. If we miss one, we won't get any compile errors or anything like that. We'll just mysteriously get incorrect results.
Oh, and by the way, it's likely that there will be times when we only want to process some of the countries. Like, say some of the countries have a VAT and some don't, or whatever. In scenario B, we add a column for this fact and test on it. That's just "join country on country.country_code=production.country_code and country.vat=1". In scenario A the programmer would almost surely end up hard-coding the list of specific countries in each query. Then someone comes along later and sees that query X processes India and France and query Y processes France and Germany and query Z processes Germany and Singapore and he might well have no idea why. Even if he knows, the list is hard-coded in every query so every update requires updating every query, changing code rather than changing data.
suppose we come across a query that only processes three of the four countries.
Oh, and by the way,
How do we know whether this is a mistake, someone forgot one of the countries when writing the query or missed this query when a new country was added; or whether there is some reason why this country was excluded?

The second approach is better for you as you will better clarity in terms of the data and also you can avoid INSERT DELETE and UPDATE anomalies.
Yes with the second approach you will have more data in terms of number.
Basically when you design a DB the normal approach is to go for 3NF.

Table COUNTRYANDCARS [MODEL (PK), AUSTRALIA, INDIA, FRANCE, GERMANY]
Ideally the above approach is correct when you have only fixed countries.
Table CARPRODUCTION [MODEL (PK), COUNTRY (PK), COUNT]
This would meet for all.

Related

How to structure SQL tables with one (non-composite) candidate key and all non-primary attributes?

I'm not very familiar with relational databases but here is my question.
I have some raw data that's collected as a result of a customer survey. For each customer who participated, there is only one record and that's uniquely identifiable by the CustomerId attribute. All other attributes I believe fall under the non-prime key description as no other attribute depends on another, apart from the non-composite candidate key. Also, all columns are atomic, as in, none can be split into multiple columns.
For example, the columns are like CustomerId(non-sequential), Race, Weight, Height, Salary, EducationLevel, JobFunction, NumberOfCars, NumberOfChildren, MaritalStatus, GeneralHealth, MentalHealth and I have 100+ columns like this in total.
So, as far as I understand we can't talk about any form of normalization for this kind of dataset, am I correct?
However, given the excessive number of columns, if I wanted to split this monolithic table into tables with fewer columns, ie based on some categorisation of columns like demographics, health, employment etc, is there a specific name for such a structure/approach in the literature? All the tables are still going to be using the CustomerId as their primary key.
Yes, this is part of an assignment and as part of a task, it's required to fit this dataset into a relational DB, not a document DB which I don't think would gain anything in this case anyway.
So, there is no direct question as such as I worded above but creating a table with 100+ columns doesn't feel right to me. Therefore, what I am trying to understand is how the theory approaches such blobs. Some concept names or potential ideas for further investigation would be appreciated as I even don't know how to look this up.
In relational databases using all information in a table is not a good usage.
As you mentioned groping some columns in other tables and join all tables with master table is well. In this usage you can also manage one to many, many to one and many to many relationships. Such as customers could have more than one address or phone numbers.
An other usage is making a table like customer_properities and use columns like property_type and property_value and store data by rows.
But the first usage is more effective and most common usage
customer_id property_type properity_value
1 num_of_child 3
1 age 22
1 marial_status Single
.
.
.

Adding Records and Decreasing Quantity

I'm starting programming in pl/sql and I have some problems. I'm making a database, which includes tables like:
TRAINS: ID_TRAIN, NAME, TYPE
CARS: ID_CAR, TYPE, SEATS, QUANTITY
TRAINS_W_CARS: ID_TRAIN (FK), ID_CAR (FK), QUANTITY
I want to make a procedure/function, which will decrease amount of available cars, when I add some to train.
For example: I have 50 available cars with 2 class, and when I add 5 of them to the some train, this amount will be decreased to 45.
To be honest, I don't really know how to go about it, because I have not done so complicated things this far.
As an alternative, consider modifying the nature of what you're storing in the first place.
Currently it seems like you're storing one record which is a count of 50 "cars", and one record which is a count of 0 "train cars". So you have to manually keep track of moving these counts from one record to another. This is not only error-prone, but can get confusing fast. Mostly because there's no actual record of what cars are where, there are just counts.
To update, you have to do the following:
Decrement the car count by 1.
Increment the train car count by 1.
Instead, consider something like the following...
The CARS table has 50 records, each representing an actual car.
The TRAINS table has 1 record, representing an actual train.
The CARS table has a FK to the TRAINS table. This identifies to which train that car belongs. This value is nullable for cars which aren't currently connected to a train.
In this scenario, you have to do the following:
Update the FK of the CARS record.
It's a single atomic operation which either succeeds or fails. No half-committed data, no missing cars, etc.
Each record in your data should represent "a thing". Not counts and descriptions of things, but actual things. And in this scenario you also don't need that linking table, because "trains" and "cars" are inherently not many-to-many things. (How can a car simultaneously be connected to two trains?)
Always consider the real-world concept being modeled in your data. You have:
A train, which can have many cars
A car, which can have many seats
A seat
A passenger
etc.
Build the relationships based on the actual real-world objects, not on the screens and reports of data that you want to see. Those reports can be easily generated from the real-world data, but real-world data can't always be generated from flat reports. (For example, in your current setup you can look at a report of how many cars are in a train, but you don't know which ones.)
I don't know whether you are inserting rows in your table manually or from some UI button click but you can do below:
Add a before/after insert trigger in your CARS table and in that trigger write whatever you have to modify in your TRAINS table.

It is ok to have duplicated values in SQL

I'm not a DBA so I'm not familiar with the proper lingo, so maybe the title of the question could be a little misleading.
So, the thing. I have Members for a certain system, these members can be part of a demographic segment (any kind of segment: favorite color, gender, job, etc)
These are the tables
SegmentCategory
ID, Name, Description
SegmentCategory_segment
SegmentID, SegmentCategoryID
Segment
ID, Name, Description
MemberSegment
ID, MemberID, SegmentID
So the guy that designed the DB decided to go uber normalizing everything so he put the member's gender on a segment and not in the Member's table.
Is this ok? According to my logic, gender it's a property of the Member so it must be on its entity. But by doing this then there must be duplicated data (The gender on the Member and Gender as a segment) But a trigger on the Member table could just fix this (Update the segment on a gender change)
Having to crawl 4 tables just to get a property from the member seems like over engineering to me.
My question is whether I'm right or not? If so, how could I propose the change to the DBA?
There isn't a blanket rule you can apply to database decisions like this. It depends on what applications/processes it is supporting. A database for reporting is much easier to work with when it is more de-normalized (in a well thought out way) than it is a more transactional database.
You can have a customer record spread across 2 tables, for instance, if some data is accessed or updated more often than other parts. Say you only need one half of the data 90% of your queries, but don't want to drag around the the varchar(max) fields you have there for whatever reason.
Having said that, having a table with just a gender/memberid is on the far side of extreme. From my naive understanding of your situation I feel you just need a members table with views over top for your segments.
As for the DBA, ultimately I imagine it will be them who will be needing to maintain the integrity of the data, so I would just approach them and say "hey, what do you think of this?" Hopefully they'll either see the merit or be able to give you reasons to their design decisions.

Is there an appropriate way to show correlation or causation from a SQL query?

I'm curious if there's any way that Microsoft Access or SQL Server can provide a function to show correlation or causation from a SQL Select statement and its results.
This is basically the use case scenario:
Let's say you have two tables, table studentCourseSurveys and table onlineCourseReviews. At the forefront these two tables are mostly unrelated but could be joined based on the course name, for example "ENG 101".
studentCourseSurveys is a table that is intended to hold data that students submit during
their in person course survey at the end of a semester. For example, in the last day of class students receive that form to fill out to rate the instructor based on things such as "exams were related to the actual lecture content", "instructor was on time and prepared", and then at the end they have their short answer opportunities to give additional comments.
onlineCourseReviews is a table that is used by an internal department that conducts content reviews on the online component of courses. For example, this department has individual instructional designers who are assigned different courses in Blackboard to review. They review content, delivery, course structure, and so on and so forth. The course is then given its comments, score, etc.
As already mentioned the tables would be most unrelated. But let's say someone wanted to show a correlation that the results of the online course reviews could somehow indicate that the quality of the overall course was better because of these online reviews, and that this was shown in the responses from the students based on their survey results.(basically, a course that got an online course review score of 10 has course surveys where a great majority of the students rated it as excellent, content was relevant, teacher was prepared, etc, to indicate that an improvement in the online course translated to improvements in the in-person class and overall class quality).
This almost seems like a unique job for a statistician but I'd like to know if it's possible to show this data based on a query in Access or SQL Server. I know that you could just easily join the two tables with a foreign key and then get the results of a survey and online course review in a single statement but that doesn't really say anything. I would think that to show ANY kind of relationship that you need to illustrate a trend over any given period of time.
Thank you.
I would suggest creating a query with three values - course name (e.g., ENG101), the student rating, and the online course rating. In SQL Server you can save the results as a .csv file. Do this, then open it in Excel and use the RSQ function , or R-Squared, to find the correlation coefficient between columns two and three. The closer R-squared is to 1, the closer the two match. 1 mean a perfect correlation. 0 means no relation at all and -1 means related but but in a polar opposite manner.

Normalization in plain English

I understand the concept of database normalization, but always have a hard time explaining it in plain English - especially for a job interview. I have read the wikipedia post, but still find it hard to explain the concept to non-developers. "Design a database in a way not to get duplicated data" is the first thing that comes to mind.
Does anyone has a nice way to explain the concept of database normalization in plain English? And what are some nice examples to show the differences between first, second and third normal forms?
Say you go to a job interview and the person asks: Explain the concept of normalization and how would go about designing a normalized database.
What key points are the interviewers looking for?
Well, if I had to explain it to my wife it would have been something like that:
The main idea is to avoid duplication of large data.
Let's take a look at a list of people and the country they came from. Instead of holding the name of the country which can be as long as "Bosnia & Herzegovina" for every person, we simply hold a number that references a table of countries. So instead of holding 100 "Bosnia & Herzegovina"s, we hold 100 #45. Now in the future, as often happens with Balkan countries, they split to two countries: Bosnia and Herzegovina, I will have to change it only in one place. well, sort of.
Now, to explain 2NF, I would have changed the example, and let's assume that we hold the list of countries every person visited.
Instead of holding a table like:
Person CountryVisited AnotherInformation D.O.B.
Faruz USA Blah Blah 1/1/2000
Faruz Canada Blah Blah 1/1/2000
I would have created three tables, one table with the list of countries, one table with the list of persons and another table to connect them both. That gives me the most freedom I can get changing person's information or country information. This enables me to "remove duplicate rows" as normalization expects.
One-to-many relationships should be represented as two separate tables connected by a foreign key. If you try to shove a logical one-to-many relationship into a single table, then you are violating normalization which leads to dangerous problems.
Say you have a database of your friends and their cats. Since a person may have more than one cat, we have a one-to-many relationship between persons and cats. This calls for two tables:
Friends
Id | Name | Address
-------------------------
1 | John | The Road 1
2 | Bob | The Belltower
Cats
Id | Name | OwnerId
---------------------
1 | Kitty | 1
2 | Edgar | 2
3 | Howard | 2
(Cats.OwnerId is a foreign key to Friends.Id)
The above design is fully normalized and conforms to all known normalization levels.
But say I had tried to represent the above information in a single table like this:
Friends and cats
Id | Name | Address | CatName
-----------------------------------
1 | John | The Road 1 | Kitty
2 | Bob | The Belltower | Edgar
3 | Bob | The Belltower | Howard
(This is the kind of design I might have made if I was used to Excel-sheets but not relational databases.)
A single-table approach forces me to repeat some information if I want the data to be consistent. The problem with this design is that some facts, like the information that Bob's address is "The belltower" is repeated twice, which is redundant, and makes it difficult to query and change data and (the worst) possible to introduce logical inconsistencies.
Eg. if Bob moves I have to make sure I change the address in both rows. If Bob gets another cat, I have to be sure to repeat the name and address exactly as typed in the other two rows. E.g. if I make a typo in Bob's address in one of the rows, suddenly the database has inconsistent information about where Bob lives. The un-normalized database cannot prevent the introduction of inconsistent and self-contradictory data, and hence the database is not reliable. This is clearly not acceptable.
Normalization cannot prevent you from entering wrong data. What normalization prevents is the possibility of inconsistent data.
It is important to note that normalization depends on business decisions. If you have a customer database, and you decide to only record a single address per customer, then the table design (#CustomerID, CustomerName, CustomerAddress) is fine. If however you decide that you allow each customer to register more than one address, then the same table design is not normalized, because you now have a one-to-many relationship between customer and address. Therefore you cannot just look at a database to determine if it is normalized, you have to understand the business model behind the database.
This is what I ask interviewees:
Why don't we use a single table for an application instead of using multiple tables ?
The answer is ofcourse normalization. As already said, its to avoid redundancy and there by update anomalies.
This is not a thorough explanation, but one goal of normalization is to allow for growth without awkwardness.
For example, if you've got a user table, and every user is going to have one and only one phone number, it's fine to have a phonenumber column in that table.
However, if each user is going to have a variable number of phone numbers, it would be awkward to have columns like phonenumber1, phonenumber2, etc. This is for two reasons:
If your columns go up to phonenumber3 and someone needs to add a fourth number, you have to add a column to the table.
For all the users with fewer than 3 phone numbers, there are empty columns on their rows.
Instead, you'd want to have a phonenumber table, where each row contains a phone number and a foreign key reference to which row in the user table it belongs to. No blank columns are needed, and each user can have as few or many phone numbers as necessary.
One side point to note about normalization: A fully normalized database is space efficient, but is not necessarily the most time efficient arrangement of data depending on use patterns.
Skipping around to multiple tables to look up all the pieces of info from their denormalized locations takes time. In high load situations (millions of rows per second flying around, thousands of concurrent clients, like say credit card transaction processing) where time is more valuable than storage space, appropriately denormalized tables can give better response times than fully normalized tables.
For more info on this, look for SQL books written by Ken Henderson.
I would say that normalization is like keeping notes to do things efficiently, so to speak:
If you had a note that said you had to
go shopping for ice cream without
normalization, you would then have
another note, saying you have to go
shopping for ice cream, just one in
each pocket.
Now, In real life, you would never do
this, so why do it in a database?
For the designing and implementing part, thats when you can move back to "the lingo" and keep it away from layman terms, but I suppose you could simplify. You would say what you needed to at first, and then when normalization comes into it, you say you'll make sure of the following:
There must be no repeating groups of information within a table
No table should contain data that is not functionally dependent on that tables primary key
For 3NF I like Bill Kent's take on it: Every non-key attribute must provide a fact about the key, the whole key, and nothing but the key.
I think it may be more impressive if you speak of denormalization as well, and the fact that you cannot always have the best structure AND be in normal forms.
Normalization is a set of rules that used to design tables that connected through relationships.
It helps in avoiding repetitive entries, reducing required storage space, preventing the need to restructure existing tables to accommodate new data, increasing speed of queries.
First Normal Form: Data should be broken up in the smallest units. Tables should not contain repetitive groups of columns. Each row is identified with one or more primary key.
For example, There is a column named 'Name' in a 'Custom' table, it should be broken to 'First Name' and 'Last Name'. Also, 'Custom' should have a column named 'CustiomID' to identify a particular custom.
Second Normal Form: Each non-key column should be directly related to the entire primary key.
For example, if a 'Custom' table has a column named 'City', the city should has a separate table with primary key and city name defined, in the 'Custom' table, replace the 'City' column with 'CityID' and make 'CityID' the foreign key in the tale.
Third normal form: Each non-key column should not depend on other non-key columns.
For example, In an order table, the column 'Total' is dependent on 'Unit price' and 'quantity', so the 'Total' column should be removed.
I teach normalization in my Access courses and break it down a few ways.
After discussing the precursors to storyboarding or planning out the database, I then delve into normalization. I explain the rules like this:
Each field should contain the smallest meaningful value:
I write a name field on the board and then place a first name and last name in it like Bill Lumbergh. We then query the students and ask them what we will have problems with, when the first name and last name are all in one field. I use my name as an example, which is Jim Richards. If the students do not lead me down the road, then I yank their hand and take them with me. :) I tell them that my name is a tough name for some, because I have what some people would consider 2 first names and some people call me Richard. If you were trying to search for my last name then it is going to be harder for a normal person (without wildcards), because my last name is buried at the end of the field. I also tell them that they will have problems with easily sorting the field by last name, because again my last name is buried at the end.
I then let them know that meaningful is based upon the audience who is going to be using the database as well. We, at our job will not need a separate field for apartment or suite number if we are storing people's addresses, but shipping companies like UPS or FEDEX might need it separated out to easily pull up the apartment or suite of where they need to go when they are on the road and running from delivery to delivery. So it is not meaningful to us, but it is definitely meaningful to them.
Avoiding Blanks:
I use an analogy to explain to them why they should avoid blanks. I tell them that Access and most databases do not store blanks like Excel does. Excel does not care if you have nothing typed out in the cell and will not increase the file size, but Access will reserve that space until that point in time that you will actually use the field. So even if it is blank, then it will still be using up space and explain to them that it also slows their searches down as well.
The analogy I use is empty shoe boxes in the closet. If you have shoe boxes in the closet and you are looking for a pair of shoes, you will need to open up and look in each of the boxes for a pair of shoes. If there are empty shoe boxes, then you are just wasting space in the closet and also wasting time when you need to look through them for that certain pair of shoes.
Avoiding redundancy in data:
I show them a table that has lots of repeated values for customer information and then tell them that we want to avoid duplicates, because I have sausage fingers and will mistype in values if I have to type in the same thing over and over again. This “fat-fingering” of data will lead to my queries not finding the correct data. We instead, will break the data out into a separate table and create a relationship using a primary and foreign key field. This way we are saving space because we are not typing the customer's name, address, etc multiple times and instead are just using the customer's ID number in a field for the customer. We then will discuss drop-down lists/combo boxes/lookup lists or whatever else Microsoft wants to name them later on. :) You as a user will not want to look up and type out the customer's number each time in that customer field, so we will setup a drop-down list that will give you a list of customer, where you can select their name and it will fill in the customer’s ID for you. This will be a 1-to-many relationship, whereas 1 customer will have many different orders.
Avoiding repeated groups of fields:
I demonstrate this when talking about many-to-many relationships. First, I draw 2 tables, 1 that will hold employee information and 1 that will hold project information. The tables are laid similar to this.
(Table1)
tblEmployees
* EmployeeID
First
Last
(Other Fields)….
Project1
Project2
Project3
Etc.
**********************************
(Table2)
tblProjects
* ProjectNum
ProjectName
StartDate
EndDate
…..
I explain to them that this would not be a good way of establishing a relationship between an employee and all of the projects that they work on. First, if we have a new employee, then they will not have any projects, so we will be wasting all of those fields, second if an employee has been here a long time then they might have worked on 300 projects, so we would have to include 300 project fields. Those people that are new and only have 1 project will have 299 wasted project fields. This design is also flawed because I will have to search in each of the project fields to find all of the people that have worked on a certain project, because that project number could be in any of the project fields.
I covered a fair amount of the basic concepts. Let me know if you have other questions or need help with clarfication/ breaking it down in plain English. The wiki page did not read as plain English and might be daunting for some.
I've read the wiki links on normalization many times but I have found a better overview of normalization from this article. It is a simple easy to understand explanation of normalization up to fourth normal form. Give it a read!
Preview:
What is Normalization?
Normalization is the process of
efficiently organizing data in a
database. There are two goals of the
normalization process: eliminating
redundant data (for example, storing
the same data in more than one table)
and ensuring data dependencies make
sense (only storing related data in a
table). Both of these are worthy goals
as they reduce the amount of space a
database consumes and ensure that data
is logically stored.
http://databases.about.com/od/specificproducts/a/normalization.htm
Database normalization is a formal process of designing your database to eliminate redundant data. The design consists of:
planning what information the database will store
outlining what information users will request from it
documenting the assumptions for review
Use a data-dictionary or some other metadata representation to verify the design.
The biggest problem with normalization is that you end up with multiple tables representing what is conceptually a single item, such as a user profile. Don't worry about normalizing data in table that will have records inserted but not updated, such as history logs or financial transactions.
References
When not to Normalize your SQL Database
Database Design Basics
+1 for the analogy of talking to your wife. I find talking to anyone without a tech mind needs some ease into this type of conversation.
but...
To add to this conversation, there is the other side of the coin (which can be important when in an interview).
When normalizing, you have to watch how the databases are indexed and how the queries are written.
When in a truly normalized database, I have found that in situations it's been easier to write queries that are slow because of bad join operations, bad indexing on the tables, and plain bad design on the tables themselves.
Bluntly, it's easier to write bad queries in high level normalized tables.
I think for every application there is a middle ground. At some point you want the ease of getting everything out a few tables, without having to join to a ton of tables to get one data set.