What are Database Normalization and Functional Dependencies? - sql

I am reading a Database Normalization tutorial and I am having difficulty understanding the following:
Functional dependency says that if two tuples have same values for attributes A1, A2,..., An, then those two tuples must have to have same values for attributes B1, B2, ..., Bn.
Functional dependency is represented by an arrow sign (→) that is, X→Y, where X functionally determines Y.
What do the above two refer to? What is meant by "Functionally Determines"?
I can have a tuple where A1, A2, A3 are same but B1, B2, B3 are different.

A functional dependency occurs when one attribute in a relation uniquely determines another attribute. This can be written A -> B which would be the same as stating "B is functionally dependent upon A."
In a table listing employee characteristics including Social Security Number (SSN) and name, it can be said that name is functionally dependent upon SSN (or SSN -> name) because an employee's name can be uniquely determined from their SSN However, the reverse statement (name -> SSN) is not true because more than one employee can have the same name but different SSNs.

Related

In which normal form are these FDs?

I've been trying to figure out the difference between the 2nd and 3rd Normal Form using this example. The definitions didn't do the trick for me...
These are the functional dependencies:
A is the candidate key. (A --> A,B,C,D)
FDs:
A --> CD
AC --> D
CD --> B
D --> B
My idea: it's in 1st and 2nd, but not in 3rd Normal form because A, the candidate key, doesn't consist of two or more columns. But B is transitively dependent on D. So it's not in 3rd.
Ist that correct? Especially the argument that A consits of less than two columns?
First, let us see what 2NF and 3NF are. From the context of the question it is clear that 1NF is understood, so I will refer to it. If it is unclear as well, let me know, I will clarify that as well.
2NF: R is in second normal form, if and only if it is in first normal form and no non-prime attribute is dependent on any proper subset of any candidate key of the relation.
non-prime attributes are attributes which are not part of any candidate keys. So, if a non-prime attribute can be determined by a functional dependency which holds a non-whole subset of a candidate key, then the relation is not in 2NF.
For example, let's consider an invoices(number, year, age) table where (number, year) is a candidate key. age can be determined by the year alone, so the table is not in 2NF.
In your case, since the key is one dimensional, assuming it is in 1NF, we can say it is in 2NF as well. However, it is in 3NF if and only if it is in 2NF and every non-prime attribute is non transitively dependent on every key.
In your case, A is the key, but since
A -> D -> B
B is transitively dependent on A, so your table is not in 3NF. To achieve 3NF, you will need to create another table, which will be in relation with this one via D and will hold B. Possible solution:
T1(A, C, D)
T2(D, B)
Note, that AC -> D and A -> CD are trivial, since A is the candidate key and the candidate key determines everything else. If that's not the case, you will need to take a look at 1NF as well.

Not sure if this consistitues a transitive dependency

I am a bit stuck designing part of a database.
I have a table called Staff. It has different attributes:
StaffID
First Name
Last Name
Job Title
Department Number
Telephone Number
StaffID is the primary key in this table.
My issue however, is that it is possible to find any information based on the telephone number (i.e. each staff member has a different, unique telephone number).
For example, this means that the First Name or Job Title can be found when we have the Phone Number. However, Phone Number is not a primary key, StaffID is.
I am not sure whether this is a transitive dependency and should fixed through 3NF by splitting up the table and having the Staff table without the Phone Number and another table with just StaffID and Telephone Number.
Transitive dependency occurs only if you have indirect relationship between more than 2 attributes that are not part of they key.
In your example, as you explained, the StaffID is part of your dependency, which is fine because it's the primary key.
Also you can look at this question that shows what is wrong with a transitive dependency. It could help put things into perspective.
In your table, if you delete staff member, you delete all the information (rightly so because you don't need it). If you leave phone number in a different table and, for instance, delete entry only in Staff, you're left with a wild phone number. But if your Staff table allowed multiple entries for the same person (but different departments) then the situation would be different.
Other sites that helped me in the past:
https://www.thoughtco.com/transitive-dependency-1019760
https://beginnersbook.com/2015/04/transitive-dependency-in-dbms/
Funnily they always follow the book example : )
In design-theoretical terms, keys are implied by dependencies. If PhoneNumber→StaffID and if StaffID is known to be a key then we can infer that PhoneNumber is also a key. If that is the case then there is no violation of 3NF because the determinants are all keys. Note that the choice of StaffID as primary key is irrelevant here. Normalization treats all keys as equally significant.
In practical database design however, the question arises as to whether PhoneNumber really makes sense as a key. In other words, would you actually want to enforce dependencies like PhoneNumber→StaffID? If, after consideration, you decide that dependency is not applicable then you could discard that dependency (by not making PhoneNumber a key) and the table would still satisfy 3NF with respect to the set of dependencies you have left.
Here's a reason why a dependency like PhoneNumber→StaffID might not be a realistic choice: when I joined my present company I got a staff ID on my first day; I didn't get a phone number until two days later.
It is not because there is no dependency between phone and name or last name, if you know the name you can't know the phone number, it is not the same as for example, Model and Manufacturer, if you know the model is a mustang then you know the manufacturer is ford, and ther other way around, you know that ford makes mustangs
With the columns you mentioned I would have separate tables for departments and job titles, because they do not depend on the PK StaffID. Think about it as removing potential redundancies, you can have five thousand people in there and have job title as a string repeated one thousand times, that is a signal that it needs its own table (2NF).
Transitive dependency means that you have a (set of) attribute(s) that are completely determined by going from a (set of) attribute(s) A -> B and then from B -> C, while you cannot go from B -> A.
In your case, you do indeed have (StaffId) -> (PhoneNumber) and also (PhoneNumber) -> (StaffId). This means you have A -> B and B -> A and hence at this step you can already rule out the transitive dependency.
If you like, you could say that PhoneNumber would be another candidate for PK.
As a background, the problem with transitive dependencies is this: Assume you have a table consisting of "Book Title" (primary key), "Author" and "Gender of Author". Then you certainly have a transitive dependency BT -> A, A -> GoA, hence BT -> GoA.
Now assume that one of your authors is "Andy Smith", Andy being a short name for Andreas. Andreas goes and changes gender, and is now Andrea. Obviously you do not need to change the name, "Andy" works just fine for "Andrea". But you do have to change the Gender. You have to do it for many entries in your table, i.e. for all books from that author.
In this case, you would fix the problem by creating a new table "Author", obviously, and then you'd have only one row for Andy.
Hope that clears it up. It is easy to see that in your example there is no constellation where you have to change many rows due to a phone number change. It's a simple 1:1 relationship between StaffId and PhoneNumber, no problems whatsoever. Both are candidate keys.

Identifying functional dependencies (FDs)

I am working with a table that has a composite primary key composed of two attributes (with a total of 10) in 1NF form.
In my situation a fully functional dependency involves the dependent relying on both attributes in my primary key.
A partial dependency relies on either one of the attributes from the primary key.
A transitive dependency involves two or more non-key attributes in a functional dependence where one of the non-key attributes is dependent on a key attribute from my primary key.
Pulling the transitive dependencies out of the table, seems do this after normalization, but my assignment requires us to identify all functional dependencies before we draw the dependency diagram (after which we normalize the tables). Parenthesis identify the primary key attributes:
(Student ID), Student Name, Student Address, Student Major, (Course ID), Course Title, Instructor ID, Instructor Name, Instructor Office, Student_course_grade
Only one class is taught for each course ID.
Students may take up to 4 courses.
Each course may have a maximum of 25 students.
Each course is taught by only one Instructor.
Each student may have only one major.
From your question it seems that you do not have a clear understanding of basics.
Application relationships & situations
First you have to take what you were told about your application (including business rules) and identify the application relationships (aka associations) (aka relations, in the math sense of association). Each gets a (base) table (aka relation, in the math sense of associated tuples) variable. Such an application relationship can be characterized by a row membership criterion (aka meaning) (aka predicate) that is a statement template. Eg suppose criterion student [si] takes course [ct] has table variable TAKES. The parameters of the criterion are the columns of its table. We can use a table name with columns (like an SQL declaration) as a shorthand for the criterion. Eg TAKES(si,ct). A criterion plus a row makes a statement (aka proposition) about a situation. Eg row (17,'CS101') gives student 17 takes course 'CS101' ie TAKES(17,'CS101'). Rows that give a true statement go in the table and rows that make a false one stay out.
If we can rephrase a criterion as the AND/conjunction of two others then we only need the tables with those other criteria. This is because NATURAL JOIN is defined so that the NATURAL JOIN of two tables containing the rows making their criteria true returns the rows that make the AND/conjunction of their criteria true. So we can NATURAL JOIN the two tables to get back the original. (This is what normalization is doing by decomposing tables into components.)
/* rows where
student with id [si] has name [sn] and address [sa] and major [sm]
and takes course [ci] with title [ct]
from instructor with id [ii] and name [in] and office [io]
with grade [scg]
*/
T(si,sn,sa,sm,ci,ct,ii,in,io,scg)
/* rows where
student with id [si] has name [sn] and address [sa] and major [sm]
and takes course [ci] with grade [scg]
*/
SG(si,sn,sa,sm,ci,scg)
/* rows where
course [ci] with title [ct]
is taught by instructor with id [ii] and name [in] and office [io]
*/
CI(ci,ct,ii,in,io,scg)
Now by the definition of NATURAL JOIN,
the rows where
SG(si,sn,sa,sm,ci,scg) AND CI(ci,ct,ii,in,io,scg)
are the rows in SG NATURAL JOIN CI.
And since
T(si,sn,sa,sm,ci,ct,ii,in,io,scg)
when/iff
SG(si,sn,sa,sm,ci,scg) AND CI(ci,ct,ii,in,io,scg),
ie since
the rows where
T(si,sn,sa,sm,ci,ct,ii,in,io,scg)
are the rows where
SG(si,sn,sa,sm,ci,scg) AND CI(ci,ct,ii,in,io,scg),
we have T = SG NATURAL JOIN CI.
Together the application relationships and situations that can arise determine both the rules and constraints! They are just things that are true of every application situation or every database state (ie values of one or more base tables) (which are are a function of the criteria and the possible application situations.)
Then we normalize to reduce redundancy. Normalization replaces a table variable by others whose predicates AND/conjoin together to the original's when this is beneficial.
The only time a rule can tell you something that you don't know already know from the (putative) criteria and (putative) situations is when you don't really understand the criteria or what situations can turn up, and the a priori rules are clarifying something about that. A person giving you rules is already using application relationships that they assume you understand and they can only have determined that a rule holds by using them and all the application situations that can arise (albeit informally)!
(Unfortunately, many presentations of information modeling don't even mention application relationships. Eg: If someone says "there is a X:Y relationship" then they must already have in mind a particular binary application relationship between entities; knowing it and what application situations can arise, they are reporting that it has a certain cardinality in a certain direction. This will correspond to some application relationship, represented by (a projection of) a table using column sets that identify entities. Plus some presentations/methods call FKs "relationships"--confusing them with those relationships.)
Check out "fact-based" information modeling methods Object-Role Modeling or (its predecessor) NIAM.
FDs & CKs
Given the criterion for putting rows into or leaving them out of a table and all possible situations that can arise, only some values (sets of rows) can ever be in a table variable.
For every subset of columns you need to decide which other columns can only have one value for a given subrow value for those columns. When it can only have one we say that the subset of columns functionally determines that column. We say that there is a FD (functional dependency) columns->column. This is when we can express the table's predicate as "... AND column=F(columns)" for some function F. (F is represented by the projection of the table on the column & columns.) But every superset of that subset will also functionally determine it, so that cuts down on cases. Conversely, if a given set does not determine a column then no subset of the set does. Applying Armstrong's axioms gives all the FDs that hold when given FDs hold. (Algorithms & software are available to apply them & determine FD closures & covers.) Also, you may think in terms of column sets being unique; then all other columns are functionally dependent on that set. Such a set is called a superkey.
Only after you have determined the FDs can you determine the CKs (candidate keys)! A CK is a superkey that contains no smaller superkey. (That a CK and/or superkey is present is also a constraint.) We can pick a CK as PK (primary key). PKs have no other role in relational theory.
A partial dependency relies on either one of the attributes from the
Primary key.
Don't use "involve" or "relies on" to give a definition. Say, "when" or "iff" ("if and only if").
Read a definition. A FD that holds is partial when/iff using a proper subset of the determinant gives a FD that holds with the same determined column; otherwise it is full. Note that this does not involve CKs. A relation is in 2NF when all non-prime attributes are fully functionally dependent on every CK.
A transitive dependency involves two or more non-key attributes in a
functional dependence where one of the non-key attributes is dependent
on a key attribute (from my PK).
Read a definition. S -> T is transitive when/iff there is an X where S -> X and X -> T and not (X -> S) and not (X = T). Note that this does not involve CKs. A relation is in 3NF when all non-prime attributes are non-transitively dependent on every CK.
"1NF" has no single meaning.
I am inferring a functional dependency that was not listed in your business rules. Namely that instructor ID determines instructor name.
If this is true, and if you have both instructor ID and instructor name in the Course table, then this is not in 3NF, because there is a transitive dependency between Course ID, Instructor ID, and Instructor Name.
Why is this harmful? Because duplicating the instructor name in each course an instructor teaches makes updating an instructor name difficult, and possible to do in an inconsistent manner. Inconsistent instructor name is just another bug you have to watch out for, and 3NF obviates the problem. The same argument could be made for Instructor office.

Many to Many Relationship

I have the following scenario:
Fact table A linked to Dimensions D1, D2, D3, D4, D5
Fact table B linked to Dimensions D1, D2, D3
I want that D4 is linked to Fact B. I can use Fact A for this. Fact A will be used as a Many-to-Many Relationship.
Is such an approach of using an existing fact as a M2M relationship good practice?
Also, in SSAS you do not specify which dimensions will be linked (when using M2M). Does this mean that I would have to link both D4 and D5? and what happens to D1,D2,D3? Are they linked again?
It is totally fine to have a many-to-many relationship table containing facts. In fact, for currency conversions, this is a standard case, where the exchange rate is the fact in the table relating time and possibly transaction currency to the target currency.
And you do configure the many-to-many relationship for every dimension: On the "Dimension Usage" tab of Cube Designer, you configure e. g. at the row for dimension D4 and the column for measure group B that this relationship is via the many-to-many table A. If you configure the cell in the same column for the D5 dimension as "No Relationship" (i. e. gray), this dimension will not be related to the measures of measure group B.

Misconception of what superkey or Boyce Codd Normal form is

At 9:34 in this video the speaker says that all 3 functional dependencies are in Boyce Codd Normal Form. I don't believe it because clearly GPA can't determine the SSN, sName, address and all other attributes in the student table. Either I'm confused about the definition of Boyce Codd Normal Form or what a super key is? Does it only have to be able to uniquly identify certain attributes, not all attributes in the schema? For example GPA does determine priority (which is on the right side of the functional dependency) but not everything else.
For example if I had the relation R(A,B,C,D) and the FDs A->B would we say A is a superkey for B but I thought a super key is for the whole table? To add to my confusion I know for BCNF it can be a (primary) key but you can only have on primary key for the table. Ugh my brain hurts.
"... the speaker says that all 3 functional dependencies are in Boyce Codd Normal Form."
To be in BC normal form is a property that can be had by RELATIONS (relation variables, more specifically, or relation schemas, if that term suits you better), not by functional dependencies. If you find someone talking so sloppily of normalization theory, leave and move onto more accurate explanations.
Whether or not a relation variable is indeed in BC normal form, depends on which functional dependencies are supposed to hold in it. That is why it is utter nonsense to say that functional dependencies are or are not in BC normal form.
"I don't believe it because clearly GPA can't determine the SSN, sName, address and all other attributes in the student table. Either I'm confused about the definition of Boyce Codd Normal Form or what a super key is? Does it only have to be able to uniquly identify certain attributes, not all attributes in the schema?"
An irreducible candidate key is that set (not necessarily unique) of attributes of the relation schema that is guaranteed to have unique combinations of attribute values in whatever relation values could validly appear in the relation variable in the database.
In your (A,B,C,D) example, if A->B is the only FD that holds, then the only candidate key is {A,C,D}.
"For example if I had the relation R(A,B,C,D) and the FDs A->B would we say A is a superkey for B"
It is sloppy and confusing to talk of A as being the "key" for B in such a case. People who pretend to be teaching others ought to know this, and people who don't, ought not engage in any teaching until they do know this. It would be better to talk of A as the "determinant" for B in such contexts. The term "key" in the context of relational database design has a very well-defined and precise meaning, and using the same term for other meanings merely confuses people. As evidenced by your question.
"but I thought a super key is for the whole table?"
Yes you thought right.
Back to your (A,B,C,D) example. If we were to split that design into (A,B) and (A,C,D), then we would have a relation schema -the (A,B) one- of which we can say that "{A} is a key" in that schema.
That is actually precisely what the FD A->B means : if you take the projection -of the relation value that would appear in the database in the (A,B,C,D) schema- over the attributes {A,B}, then you should be getting a relation in which no A value appears twice (if it did, then that A value would correspond to >1 distinct B value, meaning that A could not possibly be a determinant for B after all).
"To add to my confusion I know for BCNF it can be a (primary) key but ..."
Now you are being sloppy yourself. What does "it" refer to ?