Misconception of what superkey or Boyce Codd Normal form is - sql

At 9:34 in this video the speaker says that all 3 functional dependencies are in Boyce Codd Normal Form. I don't believe it because clearly GPA can't determine the SSN, sName, address and all other attributes in the student table. Either I'm confused about the definition of Boyce Codd Normal Form or what a super key is? Does it only have to be able to uniquly identify certain attributes, not all attributes in the schema? For example GPA does determine priority (which is on the right side of the functional dependency) but not everything else.
For example if I had the relation R(A,B,C,D) and the FDs A->B would we say A is a superkey for B but I thought a super key is for the whole table? To add to my confusion I know for BCNF it can be a (primary) key but you can only have on primary key for the table. Ugh my brain hurts.

"... the speaker says that all 3 functional dependencies are in Boyce Codd Normal Form."
To be in BC normal form is a property that can be had by RELATIONS (relation variables, more specifically, or relation schemas, if that term suits you better), not by functional dependencies. If you find someone talking so sloppily of normalization theory, leave and move onto more accurate explanations.
Whether or not a relation variable is indeed in BC normal form, depends on which functional dependencies are supposed to hold in it. That is why it is utter nonsense to say that functional dependencies are or are not in BC normal form.
"I don't believe it because clearly GPA can't determine the SSN, sName, address and all other attributes in the student table. Either I'm confused about the definition of Boyce Codd Normal Form or what a super key is? Does it only have to be able to uniquly identify certain attributes, not all attributes in the schema?"
An irreducible candidate key is that set (not necessarily unique) of attributes of the relation schema that is guaranteed to have unique combinations of attribute values in whatever relation values could validly appear in the relation variable in the database.
In your (A,B,C,D) example, if A->B is the only FD that holds, then the only candidate key is {A,C,D}.
"For example if I had the relation R(A,B,C,D) and the FDs A->B would we say A is a superkey for B"
It is sloppy and confusing to talk of A as being the "key" for B in such a case. People who pretend to be teaching others ought to know this, and people who don't, ought not engage in any teaching until they do know this. It would be better to talk of A as the "determinant" for B in such contexts. The term "key" in the context of relational database design has a very well-defined and precise meaning, and using the same term for other meanings merely confuses people. As evidenced by your question.
"but I thought a super key is for the whole table?"
Yes you thought right.
Back to your (A,B,C,D) example. If we were to split that design into (A,B) and (A,C,D), then we would have a relation schema -the (A,B) one- of which we can say that "{A} is a key" in that schema.
That is actually precisely what the FD A->B means : if you take the projection -of the relation value that would appear in the database in the (A,B,C,D) schema- over the attributes {A,B}, then you should be getting a relation in which no A value appears twice (if it did, then that A value would correspond to >1 distinct B value, meaning that A could not possibly be a determinant for B after all).
"To add to my confusion I know for BCNF it can be a (primary) key but ..."
Now you are being sloppy yourself. What does "it" refer to ?

Related

Cross Table Dependency/Constraint in SQL Database

Take the example that I have of a table called classes that holds university classes and a table called students that holds students. A class has many students and a student can only take one class. (1 to many relationship). If I had a column in classes that stored the total number of students a class has, this feels like it should violate 3NF. But the dependency is in a separate table. What is this dependency called? And can we say this is violating 3NF? Because in some sense it has all the problems of a 3NF violation. I was wondering if this was a related case.
TL;DR
But the dependency is in a separate table.
You mean there is a dependency (in the everyday sense) on another table. We say there is a constraint on the two tables. (They depend on each other.) In addition to the FK (foreign key) constraint that every students classes value is a classes class value.
What is this dependency called?
We can reasonably categorize the constraint as "inter-table". It is that classes equals SELECT class, SUM(student) AS total FROM classes LEFT JOIN students USING (class) GROUP BY class.
And can we say this is violating 3NF?
The constraint doesn't involve violating a NF. Moreover normalization applies only to a single table and its FDs (functional dependencies).
(A straightforward design is to have base students, base classes1 that is the original classes without total, and VIEW classes AS SELECT class, SUM(student) AS total FROM classes1 LEFT JOIN students USING (class) GROUP BY class.)
If I had a column in classes that stored the total number of students a class has, this feels like it should violate 3NF.
Whether a table is in a given NF (normal form) has nothing to do with any other tables. (We say a database is in a given NF when all its tables are.) Whether your design is nevertheless bad is another matter.
Since a class has just one total number of students, there is a FD (functional dependency) of total on class in classes, ie class functionally determines total.
We say that a set of columns functionally determines another set in a table when each subrow for the first always appears with the same subrow for the second. Normalization to higher NFs replaces a table by projections of it that join back ot it, per the FDs & JDs (join dependencies) that hold in it. There is redundancy in a database when two tables say the same thing about the business/application situation; but not all redundancy is bad. Learn proper information modeling & database design.
It may or may not violate a NF to have your class student count as a column in classes. What FDs violate a NF depends on all the FDs present and the NF. (And it only make sense to talk about a particular FD in a particular table violating a particular NF if you are talking about a particular part of a particular definition of that NF.)
(If a DBMS-calculated/computed/generated column violates a NF that would hold without it then that is not a problem, because it is controlled by the DBMS. You can think of the table as view of the table without the column.)
But the dependency is in a separate table.
When a sequence of database states cannot hold all the values possible per the the columns of tables we say constraints hold or the database is constrained. FDs (functional dependencies), MVDs (multi-valued dependencies), JDs (join dependencies), INDs (inclusion dependencies), EQDs (equality dependencies) and other "dependencies" (which technically are expressions given a context) are each associated with certain constraints. CKs (candidate keys), PKs (primary keys), superkeys (SQL PK & UNIQUE NOT NULL), FKs (foreign keys) (which technically are all column sets) & other notions are also each associated with certain constraints. But arbitrary conditions can hold on a sequence of database states.
SQL has a distinct but related notion of a constraint characterized by a name and an expression/condition (constraint in the above sense), declared by appropriate syntax. A state is constrained by column typing, PK, UNIQUE, NOT NULL & CHECK constraints. ASSERTION gives an arbitrary condition on a state but it is not supported by most DBMSs. CASCADES supports some inter-state inter-table constraints. SQL TRIGGERs enforce arbitrary constraints. Indexes also enforce constraints in a DBMS-specific way.
Because in some sense it has all the problems of a 3NF violation.
Your edits improved your question. Using the wrong words or using words in the wrong way at best states something that is not what we mean. But when what we write doesn't make sense it suggests that our problem, whatever else it involves, involves not knowing what the words mean. Forcing ourselves to use words correctly allows others to know what we really mean. Eg here maybe "... in the join of tables ... there would be a 3NF-violating FD ...". Even by explicitly saying that we are unsure we can communicate some of our vague groping without saying something that we don't mean. Eg your "this feels like ...". But it also leads us to clearly organize what we are faced with. This helps not only the problem we are working on but improves our problem solving.
It does not violate normalization, but it would be painful to maintain rather than doing a count in your query.
Note: Junction tables are for many-to-many.

When does repeating data get removed: before/after normalization?

Given this logical design:
R(a,b, c, d)
a is the only key. I can't underline it using this editor.
a->b
a->c
a->d
It's in BCNF because there are no composite keys, no transitive dependencies, and no partial key dependencies.
However, we still have repeating data across rows in the attributes b, c, and d.
Do we introduce surrogate keys and rewrite it this way:
R(a, bID, cID, dID)
R1(bID, b)
R2(cID, c)
R3(dID, d)
if so, does that happen before or after normalization?
The point of normalization is not to remove repetition. It is to remove inappropriate dependencies. If every non-key attribute is fully functionally dependent on the primary key (and nothing else) then it doesn't matter for purposes of normalization that from one row to another in a table that some column data may be the same. That sameness is incidental.
Here is the thing you have to think about when looking at repetition and deciding whether it is incidental or meaningful. Consider the case of an update to a non-key column.
In one scenario, let's say that the non-key column is a person's name. What happens in your system when someone changes their name? If the old value is "Doug" and the new value is "Bob" do you want every instance of "Doug" to be replaced by "Bob"? Maybe you do, but I'm guessing you probably don't. If you were to create surrogate keys and normalize out the non-key value to another table then you would be incorrectly changing values that you don't mean to change.
In another scenario, let's say the non-key column is a municipality name. What happens in your system when you change a municipality name? Let's say the old value is "New Berlin" and the new value is "Kitchener". Would you want every instance of "New Berlin" to become "Kitchener"? Maybe so. (perhaps not, it depends on your business rules) = If you do want to change every instance then what you've discovered is that the municipality name may not be fully functionally dependent on your primary key. In that case you should normalize it out to a new table.
You have asked when this should happen (before or after normalization). The answer is that it happens as part of the normalization. The act of moving data off into a separate relation in order to avoid a partial or transitive functional dependency is itself the act of normalizing your database schema. Is this part of 2NF or 3NF? It depends. If your non-key attribute is partially functionally dependent on the key then it's during 2NF. If it's transitively dependent (i.e. dependent on another non-key attribute or attributes) then it's during 3NF.
You should perform normalization as part of the logical modeling process as much as possible. When you get to the physical model you are more likely to introduce denormalization for one or another of some practical reasons. Denormalization (in transaction processing systems) is something you should generally do only when you find that you have to. 3NF or higher is a good stake in the ground for OLTP systems. Therefore, you will have built your logical and your physical schemas before you start denormalizing in most cases.

Identifying functional dependencies (FDs)

I am working with a table that has a composite primary key composed of two attributes (with a total of 10) in 1NF form.
In my situation a fully functional dependency involves the dependent relying on both attributes in my primary key.
A partial dependency relies on either one of the attributes from the primary key.
A transitive dependency involves two or more non-key attributes in a functional dependence where one of the non-key attributes is dependent on a key attribute from my primary key.
Pulling the transitive dependencies out of the table, seems do this after normalization, but my assignment requires us to identify all functional dependencies before we draw the dependency diagram (after which we normalize the tables). Parenthesis identify the primary key attributes:
(Student ID), Student Name, Student Address, Student Major, (Course ID), Course Title, Instructor ID, Instructor Name, Instructor Office, Student_course_grade
Only one class is taught for each course ID.
Students may take up to 4 courses.
Each course may have a maximum of 25 students.
Each course is taught by only one Instructor.
Each student may have only one major.
From your question it seems that you do not have a clear understanding of basics.
Application relationships & situations
First you have to take what you were told about your application (including business rules) and identify the application relationships (aka associations) (aka relations, in the math sense of association). Each gets a (base) table (aka relation, in the math sense of associated tuples) variable. Such an application relationship can be characterized by a row membership criterion (aka meaning) (aka predicate) that is a statement template. Eg suppose criterion student [si] takes course [ct] has table variable TAKES. The parameters of the criterion are the columns of its table. We can use a table name with columns (like an SQL declaration) as a shorthand for the criterion. Eg TAKES(si,ct). A criterion plus a row makes a statement (aka proposition) about a situation. Eg row (17,'CS101') gives student 17 takes course 'CS101' ie TAKES(17,'CS101'). Rows that give a true statement go in the table and rows that make a false one stay out.
If we can rephrase a criterion as the AND/conjunction of two others then we only need the tables with those other criteria. This is because NATURAL JOIN is defined so that the NATURAL JOIN of two tables containing the rows making their criteria true returns the rows that make the AND/conjunction of their criteria true. So we can NATURAL JOIN the two tables to get back the original. (This is what normalization is doing by decomposing tables into components.)
/* rows where
student with id [si] has name [sn] and address [sa] and major [sm]
and takes course [ci] with title [ct]
from instructor with id [ii] and name [in] and office [io]
with grade [scg]
*/
T(si,sn,sa,sm,ci,ct,ii,in,io,scg)
/* rows where
student with id [si] has name [sn] and address [sa] and major [sm]
and takes course [ci] with grade [scg]
*/
SG(si,sn,sa,sm,ci,scg)
/* rows where
course [ci] with title [ct]
is taught by instructor with id [ii] and name [in] and office [io]
*/
CI(ci,ct,ii,in,io,scg)
Now by the definition of NATURAL JOIN,
the rows where
SG(si,sn,sa,sm,ci,scg) AND CI(ci,ct,ii,in,io,scg)
are the rows in SG NATURAL JOIN CI.
And since
T(si,sn,sa,sm,ci,ct,ii,in,io,scg)
when/iff
SG(si,sn,sa,sm,ci,scg) AND CI(ci,ct,ii,in,io,scg),
ie since
the rows where
T(si,sn,sa,sm,ci,ct,ii,in,io,scg)
are the rows where
SG(si,sn,sa,sm,ci,scg) AND CI(ci,ct,ii,in,io,scg),
we have T = SG NATURAL JOIN CI.
Together the application relationships and situations that can arise determine both the rules and constraints! They are just things that are true of every application situation or every database state (ie values of one or more base tables) (which are are a function of the criteria and the possible application situations.)
Then we normalize to reduce redundancy. Normalization replaces a table variable by others whose predicates AND/conjoin together to the original's when this is beneficial.
The only time a rule can tell you something that you don't know already know from the (putative) criteria and (putative) situations is when you don't really understand the criteria or what situations can turn up, and the a priori rules are clarifying something about that. A person giving you rules is already using application relationships that they assume you understand and they can only have determined that a rule holds by using them and all the application situations that can arise (albeit informally)!
(Unfortunately, many presentations of information modeling don't even mention application relationships. Eg: If someone says "there is a X:Y relationship" then they must already have in mind a particular binary application relationship between entities; knowing it and what application situations can arise, they are reporting that it has a certain cardinality in a certain direction. This will correspond to some application relationship, represented by (a projection of) a table using column sets that identify entities. Plus some presentations/methods call FKs "relationships"--confusing them with those relationships.)
Check out "fact-based" information modeling methods Object-Role Modeling or (its predecessor) NIAM.
FDs & CKs
Given the criterion for putting rows into or leaving them out of a table and all possible situations that can arise, only some values (sets of rows) can ever be in a table variable.
For every subset of columns you need to decide which other columns can only have one value for a given subrow value for those columns. When it can only have one we say that the subset of columns functionally determines that column. We say that there is a FD (functional dependency) columns->column. This is when we can express the table's predicate as "... AND column=F(columns)" for some function F. (F is represented by the projection of the table on the column & columns.) But every superset of that subset will also functionally determine it, so that cuts down on cases. Conversely, if a given set does not determine a column then no subset of the set does. Applying Armstrong's axioms gives all the FDs that hold when given FDs hold. (Algorithms & software are available to apply them & determine FD closures & covers.) Also, you may think in terms of column sets being unique; then all other columns are functionally dependent on that set. Such a set is called a superkey.
Only after you have determined the FDs can you determine the CKs (candidate keys)! A CK is a superkey that contains no smaller superkey. (That a CK and/or superkey is present is also a constraint.) We can pick a CK as PK (primary key). PKs have no other role in relational theory.
A partial dependency relies on either one of the attributes from the
Primary key.
Don't use "involve" or "relies on" to give a definition. Say, "when" or "iff" ("if and only if").
Read a definition. A FD that holds is partial when/iff using a proper subset of the determinant gives a FD that holds with the same determined column; otherwise it is full. Note that this does not involve CKs. A relation is in 2NF when all non-prime attributes are fully functionally dependent on every CK.
A transitive dependency involves two or more non-key attributes in a
functional dependence where one of the non-key attributes is dependent
on a key attribute (from my PK).
Read a definition. S -> T is transitive when/iff there is an X where S -> X and X -> T and not (X -> S) and not (X = T). Note that this does not involve CKs. A relation is in 3NF when all non-prime attributes are non-transitively dependent on every CK.
"1NF" has no single meaning.
I am inferring a functional dependency that was not listed in your business rules. Namely that instructor ID determines instructor name.
If this is true, and if you have both instructor ID and instructor name in the Course table, then this is not in 3NF, because there is a transitive dependency between Course ID, Instructor ID, and Instructor Name.
Why is this harmful? Because duplicating the instructor name in each course an instructor teaches makes updating an instructor name difficult, and possible to do in an inconsistent manner. Inconsistent instructor name is just another bug you have to watch out for, and 3NF obviates the problem. The same argument could be made for Instructor office.

Changing foreign key meaning. How to handle?

Lets say I have 2 tables:
Species
SpeciesId
SpeciesName
Animal
AnimalId
SpeciesId - foreign key
If you give the end-user ability to change SpeciesName, that means they can affect the species of all animals that reference the changed record (at least from user standpoint). This may be a bit of an extreme example, but how are situations like this usually handled? Put the responsibility on the end-user to know what they are doing? Disallow name change if it has been used before?
We are discussing this situation at work and I want to get input from some others. One of the solutions that was brought up was to remove the foreign key (e.g. put a text field for species in the Animal table). This doesn't seem right to me, because at what point do you draw the line of using foreign keys? To me it seems like more of a training issue to make sure admins understand the impact of the changes they make. I know it's an open-ended question and it may vary per scenario, but I'm just trying to get some general opinions.
This is a design decision that you have to make. You need to determine what is more important from a business perspective. Do you value historical accuracy or efficiently updating the information?
In your example, I would put less emphasis on history for the following reasons.
Only the most recent convention is significant. Assume an animal moves from one genus to another, it really doesn't provide any value to know what the old and now invalid genus was.
All animals of the same species should have the same species ID. You get this for free with Foreign Keys. Assume a tiger was added prior to a species name change. Then a different tiger was added after the species name change. Both tigers still belong to the same species.
Querying the database by ID will be easier and more reliable than using a string let alone delving into the dirty business of string parsing. You don't need to worry about character encoding, capitalization, white space, punctuation, etc. Assume that you would like to retrieve all animals of one or more species.
Put the responsibility on the end-user to know what they are doing?
You need to decide what the end user is able to update. If your end user is a biologist that is well informed about the scientific names of the species, he should be able to update this information. Otherwise, maybe it's a good idea to prevent the user from modifying this column at all, or only if this particular species has any animal associated with it.
One of the solutions that was brought up was to remove the foreign key
Don't do that. You will lose the ability to join the information from these tables. Imagine your table Species has a column "Continent", indicating if the species is found on America, Africa, Europe, Asia, etc. If you use the foreign key, you can ask questions like "What are all the animals that belong to an american species?" This will be impossible if you remove the foreign key.

Compound primary key table with subtypes

Me and a database architect were having argument over if a table with a compound primary key with subtypes made sense relationally and if it was a good practice.
Say we have two tables Employee and Project. We create a composite table Employee_Project with a composite primary key back to Employee and Project.
Is there a valid way for Employee_Project to have subtypes? Or can you think of any scenario where a composite key table can have subtypes?
To me a composite key relationship is a 'Is A' relationship (Employee_Project is a Employee and a Project). Subtypes are also a 'Is A' relationship. So if you have a composite key with a subtype its two 'Is A' relationships in one sentence which makes me believe this is a bad practice.
Employee-project is a bit hard, but one can imagine something like this -- although I'm not much of a chemist.
Or something like this, which would require different legal forms (fields) for single person ownership vs joint (time-share).
Or like this, providing that different forms are needed for full time and temp.
Employee projects have subtypes if the candidate subtypes are
not utterly different, but
not exactly alike
That means that
Every employee project has some
attributes (columns) in common. So they're not utterly different.
Some employee projects have different
attributes than others. So they're not exactly alike.
The determination has to do with common and distinct attributes. It doesn't have anything to do with the number of columns in a candidate key. Do you have employee projects that are not utterly different, but not exactly alike?
The most common business supertype/subtype example concerns organizations and individuals. They're not utterly different.
Both have addresses.
Both have phone numbers.
Both can be plaintiffs and defendants
in court.
But they're not exactly alike.
Individuals can go to college.
Organizations can have a CEO.
Individuals can get married.
Individuals can have children.
Organizations (in the USA) can be liquidated.
So you can express individuals and organizations as subtypes of a supertype called, say, "Parties". The attributes all the subtypes have in common relate to the supertype.
Parties have addresses.
Parties have phone numbers.
Parties can be plaintiffs and defendants
in court.
Again, this has to do with attributes that are held in common, and attributes that are distinct. It has nothing to do with the number of columns in a candidate key.
To me a composite key relationship is
a 'Is A' relationship
(Employee_Project is a Employee and a
Project).
Database designers don't think that way. We think in terms of a table's predicate.
If an employee can have many projects and a project can have many employees it is a many-to-many join that RDBM's can only represent easily in one way (the way you have outlined above.) You can see in the ER diagram below (employee / departments is one of the classic many-to-many examples) that it does not have a separate ER component. The separate table is a leaky abstraction of RDBMS's (which is probably why you are having a hard time modeling it).
http://www.library.cornell.edu/elicensestudy/dlfdeliverables/fallforum2003/ERD_final.doc
Bridge Entities
When an instance of an entity may be related to multiple instances of another entity and vice versa, that is called a “many-to-many relationship.” In the example below, a supplier may provide many different products, and each type of product may be offered by many suppliers:
While this relationship model is perfectly valid, it cannot be translated directly into a relational database design. In a relational database, relationships are expressed by keys in a table column that point to the correct instance in the related table. A many-to-many relationship does not allow this relationship expression, because each record in each table might have to point to multiple records in the other table.
http://users.csc.calpoly.edu/~jdalbey/205/Lectures/ERD_image004.gif
Here they do not event bother with a separate box although they add in later (at this step it is a 'pure' ER diagram). It can also be explicitly represented with a box and a diamond superimposed on each other.