Maintaining subclass integrity in a relational database - sql

Let's say I have a table that represents a super class, students. And then I have N tables that represent subclasses of that object (athletes, musicians, etc). How can I express a constraint such that a student must be modeled in one (not more, not less) subclass?
Clarifications regarding comments:
This is being maintained manually, not through an ORM package.
The project this relates to sits atop SQL Server (but it would be nice to see a generic solution)
This may not have been the best example. There are a couple scenarios we can consider regarding subclassing, and I just happened to invent this student/athlete example.
A) In true object-oriented fashion, it's possible that the superclass can exist by itself and need not be modeled in any subclasses.
B) In real life, any object or student can have multiple roles.
C) The particular scenario I was trying to illustrate was requiring that every object be implemented in exactly one subclass. Think of the superclass as an abstract implementation, or just commonalities factored out of otherwise disparate object classes/instances.
Thanks to all for your input, especially Bill.

Each Student record will have a SubClass column (assume for the sake of argument it's a CHAR(1)). {A = Athlete, M=musician...}
Now create your Athlete and Musician tables. They should also have a SubClass column, but there should be a check constraint hard-coding the value for the type of table they represent. For example, you should put a default of 'A' and a CHECK constraint of 'A' for the SubClass column on the Athlete table.
Link your Musician and Athlete tables to the Student table using a COMPOSITE foreign key of StudentID AND Subclass. And you're done! Go enjoy a nice cup of coffee.
CREATE TABLE Student (
StudentID INT NOT NULL IDENTITY PRIMARY KEY,
SubClass CHAR(1) NOT NULL,
Name VARCHAR(200) NOT NULL,
CONSTRAINT UQ_Student UNIQUE (StudentID, SubClass)
);
CREATE TABLE Athlete (
StudentID INT NOT NULL PRIMARY KEY,
SubClass CHAR(1) NOT NULL,
Sport VARCHAR(200) NOT NULL,
CONSTRAINT CHK_Jock CHECK (SubClass = 'A'),
CONSTRAINT FK_Student_Athlete FOREIGN KEY (StudentID, Subclass) REFERENCES Student(StudentID, Subclass)
);
CREATE TABLE Musician (
StudentID INT NOT NULL PRIMARY KEY,
SubClass CHAR(1) NOT NULL,
Instrument VARCHAR(200) NOT NULL,
CONSTRAINT CHK_Band_Nerd CHECK (SubClass = 'M'),
CONSTRAINT FK_Student_Musician FOREIGN KEY (StudentID, Subclass) REFERENCES Student(StudentID, Subclass)
);

Here are a couple of possibilities. One is a CHECK in each table that the student_id does not appear in any of the other sister subtype tables. This is probably expensive and every time you need a new subtype, you need to modify the constraint in all the existing tables.
CREATE TABLE athletes (
student_id INT NOT NULL PRIMARY KEY,
FOREIGN KEY (student_id) REFERENCES students(student_id),
CHECK (student_id NOT IN (SELECT student_id FROM musicians
UNION SELECT student_id FROM slackers
UNION ...))
);
edit: #JackPDouglas correctly points out that the above form of CHECK constraint is not supported by Microsoft SQL Server. Nor, in fact, is it valid per the SQL-99 standard to reference another table (see http://kb.askmonty.org/v/constraint_type-check-constraint).
SQL-99 defines a metadata object for multi-table constraints. This is called an ASSERTION, however I don't know any RDBMS that implements assertions.
Probably a better way is to make the primary key in the students table a compound primary key, the second column denotes a subtype. Then restrict that column in each child table to a single value corresponding to the subtype represented by the table. edit: no need to make the PK a compound key in child tables.
CREATE TABLE athletes (
student_id INT NOT NULL PRIMARY KEY,
student_type CHAR(4) NOT NULL CHECK (student_type = 'ATHL'),
FOREIGN KEY (student_id, student_type) REFERENCES students(student_id, student_type)
);
Of course student_type could just as easily be an integer, I'm just showing it as a char for illustration purposes.
If you don't have support for CHECK constraints (e.g. MySQL), then you can do something similar in a trigger.
I read your followup about making sure a row exists in some subclass table for every row in the superclass table. I don't think there's a practical way to do this with SQL metadata and constraints. The only option I can suggest to meet this requirement is to use Single-Table Inheritance. Otherwise you need to rely on application code to enforce it.
edit: JackPDouglas also suggests using a design based on Class Table Inheritance. See his example or my examples of the similar technique here or here or here.

If you are interested in data modeling, in addition to object modeling, I suggest you look up "relational modeling generalization specialization" on the web.
There used to be some good resources out there that explains this kind of pattern quite well.
I hope those resources are still there.
Here's a simplified view of what I hope you'll find.
Before you begin designing a database, it's useful to come up with a conceptual data model that connects the values stored in the database back to the subject matter. Making a conceptual data model is really data analysis, not database design. Sometimes it's difficult to keep analysis and design separate.
One way of modeling data at the conceptual level is the Entity-Relationship (ER) model. There are well known patterns for modeling the specialization-generalization situation. Converting those ER patterns to SQL tables (called logical design) is pretty straightforward, although you do have to make some design choices.
The case you gave of a student having possibly several roles like musician probably doesn't illustrate the case you are interested in, if I read you right. You seem to be interested in the case where the subclasses are mutually exclusive. Perhaps the case where a vehicle might be an auto, a truck, or a motorcycle might be easier to discuss.
One difference you are likely to encounter is that the general table for the superclass doesn't really need the type code column. The type of a single superclass instance can be derived by the presence or absence of foreign keys in the various subclass tables. Whether it's smarter to include or omit the type code depends on how you intend to use the data.

interesting problem. Of course the FK constraints are there for the subtables so there has to be a student for those.
The main problem is trying to check as it is inserted. The student has to be inserted first so that you don't violate a FK constraint in a subtable so a trigger that does a check wouldn't work.
You could write an app that checks now and then if you are really concerned about this. I think the biggest fear though would be deletions. Someone could delete a subtable entry but not the student. You could have triggers to check when items are deleted from the subtables since that is probably the biggest problem.
I have a db with a table per subclass hierarchy like this as well. I use Hibernate and its mapped properly so it deletes everything automatically. If doing this by 'hand' then I would make sure to always delete the parent with proper cascades hehe :)

Thanks, Bill. You got me thinking...
The superclass table has a subclass code column. Each of the subclass tables has a foreign key constraint, as well as one that dictates that the id exist with a subset of the superclass table (where code = athlete).
The only missing part here is that it's possible to model a superclass without a subclass. Even if you make the code column mandatory, it could just be an empty join. That can be fixed by adding a constraint that the superclass's ids exist in a union of the ids in the subclass tables. Insertion gets a little hairy with these two constraints if constraints are enforced in the middle of transactions. That or just don't worry about unsubclassed objects.
Edit: Bleh, such a good sounding idea... But impeded by the fact that subqueries that refer to other tables aren't supported. At least not in SQL Server.

That can be fixed by adding a constraint that the superclass's ids exist in a union of
the ids in the subclass tables.
Depending on how much intelligence you want to put into your schema (and how much MS SQL Server lets you put there), you wouldn't actually need to do a union of the subclass tables, since you know that, if the id exists in any subclass table, it must exist in the same subclass as the one identified by the subclass code column.

I would add a Check Constraint possibly.
Create the ForeignKeys as Nullable.
Add a Check to make sure they aren't both null and to make sure they aren't both set.
CONSTRAINT [CK_HasOneForiegnKey] CHECK ((FK_First!= NULL OR FK_Second != NULL) AND NOT (FK_First != NULL AND FK_Second != NULL)).
I am not sure but I believe this would allow you to set only one key at a time.

Related

Questionable SQL Relationship

I am going through a pluralsight course that is currently going through building an MVC application using an entity framework code-first approach. I was confused about the Database schema used for the project.
As you can see, the relationship between Securities and it's relating tables seems to be one-to-one, but the confusion comes when I realize there is no foreign key to relate the two sub-tables and they they appear to share the same primary key column.
The video before made the Securities model class abstract in order for the "Stock" and "MutualFund" model classes to inherit from it and contain all relating data. To me however, it seems that same thing could be done using a couple of foreign keys.
I guess my question is does this method of linking tables serve any useful purpose in SQL or EF? It seems to me in order to create a new record for one table, all tables would need a new record which is where I really get confused.
In ORM and EF terminology, this setup is referred to as the "Table per Type" inheritance paradigm, where there is a table per subclass, a base class table, and the primary key is shared between the subclasses and the base class.
e.g. In this case, Securities_Stock and Securities_MutualFund are two subclasses of the Securities base class / table (possibly abstract).
The relationship will be 0..1 (subclass) to 1 (base class) - i.e. only one of the records in Securities_MutualFund or Securities_Stock will exist for each base table Securities row.
There's also often a discriminator column on the base table to indicate which subclass table to join to, but that doesn't seem to be the case here.
It is also common to enforce referential integrity between the subclasses to the base table with a foreign key.
To answer your question, the reason why there's no FK between the two subclass instance tables is because each instance (with a unique Id) will only ever be in ONE of the sub class tables - it is NOT possible for the same Security to be both a mutual fund and a share.
You are right, in order for a new concrete Security record to be added, a row is needed in both the base Securities Table (must be inserted first, as their are FK's from the subclass tables to the base table), and then a row is inserted into one of the subclass tables, with the rest of the 'specific' data.
If a Foreign Key was added between Stock and Mutual Fund, it would be impossible to insert new rows into the tables.
The full pattern often looks like this:
CREATE TABLE BaseTable
(
Id INT PRIMARY KEY, -- Can also be Identity
... Common columns here
Discriminator, -- Type usually has a small range, so `INT` or `CHAR` are common
);
CREATE TABLE SubClassTable
(
Id INT PRIMARY KEY, -- Not identity, must be manually inserted
-- Specialized SubClass columns here
FOREIGN KEY (Id) REFERENCES BaseTable(Id)
);

Is it possible to implement a TRUE one-to-one relation?

Consider the following model where a Customer should have one and only one Address and an Address should belong to one and only one Customer:
To implement it, as almost everybody in DB field says, Shared PK is the solution:
But I think it is a fake one-to-one relationship. Because nothing in terms of database relationship actually prevents deleting any row in table Address. So truely, it is 1..[0..1] not 1..1
Am I right? Is there any other way to implement a true 1..1 relation?
Update:
Why cascade delete is not a solution:
If we consider cascade delete as a solution we should put this on either of the tables. Let's say if a row is deleted from table Address, it causes corresponding row in table Customer to be deleted. it's okay but half of the solution. If a row in Customer is deleted, the corresponding row in Address should be deleted as well. This is the second half of the solution, and it obviously makes a cycle.
Beside my comment
You could implement DELETE CASCADE See HOW
I realize there is also the problem of insert.
You have to insert Customer first and then Address
So I think the best way if you really want a 1:1 is create a single table instead.
Customer
CustomerID
Name
Address
City
Sorry, is this meant to be a real-world database relationship? In all of the many databases I have ever built with customer data, there has always been real cases of either customers with multiple addresses, or more than one organisation at the same address.
I wouldn't want to lead you into a database modelling fallacy by suggesting anything different.
Yes, the "shared PK" idiom you show is for 1-to-0-or-1.
The straightforward way to have a true 1-to-1 correspondence is to have one table with Customer and Address as CKs (candidate keys). (Via UNIQUE NOT NULL and/or PRIMARY KEY.) You could offer the separate tables as views. Unfortunately typical DBMSs have restrictions on what you can do via the views, in particular re updating.
The relational way to have separate CUSTOMER and ADDRESS tables and a third table/association/relationship with Customer and Address columns as CKs plus FKs on Customer to and from CUSTOMER and on Address to and from ADDRESS (or equivalent constraint(s)). Unfortunately most DBMSs needlessly won't let you declare cycles in FKs and you cannot impose the constraints without triggers/complexity. (Ultimately, if you want to have proper integrity in a typical SQL database you need to use triggers and complex idioms.)
Entity-oriented design methods unfortunately artificially distinguish between entities, associations and properties. Here is an example where if you consider the simplest design to simply be the one table with PKs then you don't want to always have to have distinct tables for each entity. Or if you consider the simplest design to be the three tables (or even two) with the PKs and FKs (or some other constraint(s) for 1-to-1) then unfortunately typical DBMSs just don't declaratively/ergonomically support that particular design situation.
(Straightforward relational design is to have values (that are sometimes used as ids) 1-to-1 with application things but then just have whatever relevant application relationships/associations/relations and corresponding/representing tables/relations as needed to describe your application situations.)
It's possible in principle to implement a true 1-1 data structure in some DBMSs. It's very difficult to add data or modify data in such a structure using standard SQL however. Standard SQL only permits one table to be updated at a time and therefore as soon as you insert a row into one or other table the intended constraint is broken.
Here are two examples. First using Tutorial D. Note that the comma between the two INSERT statements ensures that the 1-1 constraint is never broken:
VAR CUSTOMER REAL RELATION {
id INTEGER} KEY{id};
VAR ADDRESS REAL RELATION {
id INTEGER} KEY{id};
CONSTRAINT one_to_one (CUSTOMER{id} = ADDRESS{id});
INSERT CUSTOMER RELATION {
TUPLE {id 1234}
},
INSERT ADDRESS RELATION {
TUPLE {id 1234}
};
Now the same thing in SQL.
CREATE TABLE CUSTOMER (
id INTEGER NOT NULL PRIMARY KEY);
CREATE TABLE ADDRESS (
id INTEGER NOT NULL PRIMARY KEY);
INSERT INTO CUSTOMER (id)
VALUES (1234);
INSERT INTO ADDRESS (id)
VALUES (1234);
ALTER TABLE CUSTOMER ADD CONSTRAINT one_to_one_1
FOREIGN KEY (id) REFERENCES ADDRESS (id);
ALTER TABLE ADDRESS ADD CONSTRAINT one_to_one_2
FOREIGN KEY (id) REFERENCES CUSTOMER (id);
The SQL version uses two foreign key constraints, which is the only kind of multi-table constraint supported by most SQL DBMSs. It requires two INSERT statements which means I could only insert a row before adding the constraints, not after.
A strict one-to-one constraint probably isn't very useful in practice but it's actually just a special case of something more important and interesting: join dependency. A join dependency is effectively an "at least one" constraint between tables rather than "exactly one". In the world outside databases it is common to encounter examples of business rules that ought to be implemented as join dependencies ("each customer must have AT LEAST ONE addresss", "each order must have AT LEAST ONE item in it"). In SQL DBMSs it's hard or impossible to implement join dependencies. The usual solution is simply to ignore such business rules thus weakening the data integrity value of the database.
Yes, what you say is true, the dependent side of a 1:1 relationship may not exist -- if only for the time it takes to create the dependent entity after creating the independent entity. In fact, all relationships may have a zero on one side or the other. You can even turn the relationship into a 1:m by placing the FK of the address in the Customer row and making the field not null. You can still have addresses that aren't referenced by any customer.
At first glance, a m:n may look like an exception. The intersection entry is generally defined so that neither FK can be null. But there can be customers and addresses both that have no entry referring to them. So this is really a 0..m:0..n relationship.
What of it? Everyone I've ever worked with has understood that "one" (as in 1:1) or "many" (as in 1:m or m:n) means "no more than this." There is no "exactly this, no more or less." For example, we can design a 1:3 relationship on paper. We cannot strictly enforce it in any database. We have to use triggers, stored procedures and/or scheduled tasks to seek out and call our attention to deviations. Execute a stored procedure weekly, for instance, that will seek and and flag or delete any such orphaned addresses.
Think of it like a "frictionless surface." It exists only on paper.
I see this question as a conceptual misunderstanding. Relations are between different things. Things with a "true 1-to-1 relation" are by definition aspects or attributes of the same thing, and belong in the same table. No, of course a person and and address are not the same, but if they are inseparable, and must always be inserted, deleted, or otherwise acted upon as a unit, then as data they are "the same thing". This is exactly what is described here.
Yes, and it's actually quite easy: just put both entities in the same table!
OTOH, if you need to keep them in separate tables for some reason, then you need a key in one table referencing1 a key in another, and vice-versa. This, of course, represents a "chicken and egg" problem2 which can be resolved by deferring the enforcement of FKs to the end of the transaction3. This works only on DBMSes that support deferred constraints (such as Oracle and PostgreSQL).
1 Via a foreign key.
2 Inserting a row in the first table is impossible because that would violate the referential integrity towards the second table, but inserting a row in the second table is impossible because that would violate the referential integrity towards the first table, etc... Ditto for deletion.
3 So you simply insert both rows, and then check both FKs.

Primary key/foreign Key naming convention [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 6 years ago.
Improve this question
In our dev group we have a raging debate regarding the naming convention for Primary and Foreign Keys. There's basically two schools of thought in our group:
1:
Primary Table (Employee)
Primary Key is called ID
Foreign table (Event)
Foreign key is called EmployeeID
or
2:
Primary Table (Employee)
Primary Key is called EmployeeID
Foreign table (Event)
Foreign key is called EmployeeID
I prefer not to duplicate the name of the table in any of the columns (So I prefer option 1 above). Conceptually, it is consistent with a lot of the recommended practices in other languages, where you don't use the name of the object in its property names. I think that naming the foreign key EmployeeID (or Employee_ID might be better) tells the reader that it is the ID column of the Employee Table.
Some others prefer option 2 where you name the primary key prefixed with the table name so that the column name is the same throughout the database. I see that point, but you now can not visually distinguish a primary key from a foreign key.
Also, I think it's redundant to have the table name in the column name, because if you think of the table as an entity and a column as a property or attribute of that entity, you think of it as the ID attribute of the Employee, not the EmployeeID attribute of an employee. I don't go an ask my coworker what his PersonAge or PersonGender is. I ask him what his Age is.
So like I said, it's a raging debate and we go on and on and on about it. I'm interested to get some new perspectives.
If the two columns have the same name in both tables (convention #2), you can use the USING syntax in SQL to save some typing and some boilerplate noise:
SELECT name, address, amount
FROM employees JOIN payroll USING (employee_id)
Another argument in favor of convention #2 is that it's the way the relational model was designed.
The significance of each column is
partially conveyed by labeling it with
the name of the corresponding domain.
It doesn't really matter. I've never run into a system where there is a real difference between choice 1 and choice 2.
Jeff Atwood had a great article a while back on this topic. Basically people debate and argue the most furiously those topics which they cannot be proven wrong on. Or from a different angle, those topics which can only be won through filibuster style endurance based last-man-standing arguments.
Pick one and tell them to focus on issues that actually impact your code.
EDIT: If you want to have fun, have them specify at length why their method is superior for recursive table references.
I think it depends on your how you application is put together. If you use ORM or design your tables to represent objects then option 1 may be for you.
I like to code the database as its own layer. I control everything and the app just calls stored procedures. It is nice to have result sets with complete column names, especially when there are many tables joined and many columns returned. With this stype of application, I like option 2. I really like to see column names match on joins. I've worked on old systems where they didn't match and it was a nightmare,
Have you considered the following?
Primary Table (Employee)
Primary Key is PK_Employee
Foreign table (Event)
Foreign key is called FK_Employee
Neither convention works in all cases, so why have one at all? Use Common sense...
e.g., for self-referencing table, when there are more than one FK column that self-references the same table's PK, you HAVE to violate both "standards", since the two FK columns can't be named the same... e.g., EmployeeTable with EmployeeId PK, SupervisorId FK, MentorId Fk, PartnerId FK, ...
I agree that there is little to choose between them. To me a much more significant thing about either standard is the "standard" part.
If people start 'doing their own thing' they should be strung up by their nethers. IMHO :)
If you are looking at application code, not just database queries, some things seem clear to me:
Table definitions usually directly map to a class that describes one object, so they should be singular. To describe a collection of an object, I usually append "Array" or "List" or "Collection" to the singular name, as it more clearly than use of plurals indicates not only that it is a collection, but what kind of a collection it is. In that view, I see a table name as not the name of the collection, but the name of the type of object of which it is a collection. A DBA who doesn't write application code might miss this point.
The data I deal with often uses "ID" for non-key identification purposes. To eliminate confusion between key "ID"s and non-key "ID"s, for the primary key name, we use "Key" (that's what it is, isn't it?) prefixed with the table name or an abbreviation of the table name. This prefixing (and I reserve this only for the primary key) makes the key name unique, which is especially important because we use variable names that are the same as the database column names, and most classes have a parent, identified by the name of the parent key. This also is needed to make sure that it is not a reserved keyword, which "Key" alone is. To facilitate keeping key variable names consistent, and to provide for programs that do natural joins, foreign keys have the same name as is used in the table in which they are the primary key. I have more than once encountered programs which work much better this way using natural joins. On this last point, I admit a problem with self-referencing tables, which I have used. In this case, I would make an exception to the foreign key naming rule. For example, I would use ManagerKey as a foreign key in the Employee table to point to another record in that table.
The convention we use where I work is pretty close to A, with the exception that we name tables in the plural form (ie, "employees") and use underscores between the table and column name. The benefit of it is that to refer to a column, it's either "employees _ id" or "employees.id", depending on how you want to access it. If you need to specify what table the column is coming from, "employees.employees _ id" is definitely redundant.
I like convention #2 - in researching this topic, and finding this question before posting my own, I ran into the issue where:
I am selecting * from a table with a large number of columns and joining it to a second table that similarly has a large number of columns. Both tables have an "id" column as the primary key, and that means I have to specifically pick out every column (as far as I know) in order to make those two values unique in the result, i.e.:
SELECT table1.id AS parent_id, table2.id AS child_id
Though using convention #2 means I will still have some columns in the result with the same name, I can now specify which id I need (parent or child) and, as Steven Huwig suggested, the USING statement simplifies things further.
I've always used userId as a PK on one table and userId on another table as a FK. 'm seriously thinking about using userIdPK and userIdFK as names to identify one from the other. It will help me to identify PK and FK quickly when looking at the tables and it seems like it will clear up code when using PHP/SQL to access data making it easier to understand. Especially when someone else looks at my code.
I use convention #2. I'm working with a legacy data model now where I don't know what stands for in a given table. Where's the harm in being verbose?
How about naming the foreign key
role_id
where role is the role the referenced entity has relativ to the table at hand. This solves the issue of recursive reference and multiple fks to the same table.
In many cases will be identical to the referenced table name. In this cases it becomes identically to one of your proposals.
In any case havin long arguments is a bad idea
"Where in "employee INNER JOIN order ON order.employee_id = employee.id" is there a need for additional qualification?".
There is no need for additional qualification because the qualification I talked of is already there.
"the reason that a business user refers to Order ID or Employee ID is to provide context, but at a dabase level you already have context because you are refereing to the table".
Pray, tell me, if the column is named 'ID', then how is that "refereing [sic] to the table" done exactly, unless by qualifying this reference to the ID column exactly in the way I talked of ?

What does/should NULL mean along with FK relationships - Database

I was experiencing a hard time creating FK relationships in my relational SQL database and after a brief discussion at work, we realized that we have nullable columns which were most likely contributing to the problem. I have always viewed NULL as meaning unassigned, not specified, blank, etc. and have really never seen a problem with that.
The other developers I was speaking with felt that the only way to handle a situation where if a relationship did exist between 2 entities, then you would have to create a table that joins the data from both entities...
It seems intuitive to me at least to say that for a column that contains an ID from another table, if that column is not null, then it must have an ID from the other table, but if it is NULL then that is OK and move on. It seems like this in itself is contradictory to what some say and suggest.
What is the best practice or correct way to handle situations where there could be a relationship between two tables and if a value is specified then it must be in the other table...
It's perfectly acceptable, and it means that, if that column has any value, its value must exist in another table. (I see other answers asserting otherwise, but I beg to differ.)
Think a table of Vehicles and Engines, and the Engines aren't installed in a Vehicle yet (so VehicleID is null). Or an Employee table with a Supervisor column and the CEO of the company.
Update: Per Solberg's request, here is an example of two tables that have a foreign key relationship showing that the foreign key field value can be null.
CREATE TABLE [dbo].[EngineTable](
[EngineID] [int] IDENTITY(1,1) NOT NULL,
[EngineCylinders] smallint NOT NULL,
CONSTRAINT [EngineTbl_PK] PRIMARY KEY NONCLUSTERED
(
[EngineID] ASC
)WITH (IGNORE_DUP_KEY = OFF) ON [PRIMARY]
) ON [PRIMARY]
CREATE TABLE [dbo].[CarTable](
[CarID] [int] IDENTITY(1,1) NOT NULL,
[Model] [varchar](32) COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL,
[EngineID] [int] NULL
CONSTRAINT [PK_UnitList] PRIMARY KEY CLUSTERED
(
[CarID] ASC
)WITH (IGNORE_DUP_KEY = OFF) ON [PRIMARY]
) ON [PRIMARY]
ALTER TABLE [dbo].[CarTable] WITH CHECK ADD CONSTRAINT [FK_Engine_Car] FOREIGN KEY([EngineID])
REFERENCES [dbo].[EngineTable] ([EngineID])
Insert Into EngineTable (EngineCylinders) Values (4);
Insert Into EngineTable (EngineCylinders) Values (6);
Insert Into EngineTable (EngineCylinders) Values (6);
Insert Into EngineTable (EngineCylinders) Values (8);
-- Now some tests:
Insert Into CarTable (Model, EngineID) Values ('G35x', 3); -- References the third engine
Insert Into CarTable (Model, EngineID) Values ('Sienna', 13); -- Invalid FK reference - throws an error
Insert Into CarTable (Model) Values ('M'); -- Leaves null in the engine id field & does NOT throw an error
I think this debate is another byproduct of the object-relational impedence mismatch. Some DBA-types will pedantically say never allow null in a FK based on some deeper understanding of relational algebra semantics, but application developers will argue that it makes their domain layer more elegant.
The use cases for a "not yet established" relationship are valid, but with null FKs some find that it adds complexity to their queries by introducing more sophisticated features of SQL, specifically LEFT JOINs.
One common alternative solution I've seen is to introduce a "null row" or "sentinel row" into each table with pk=0 or pk=1 (based on what's supported by your RDBMS). This allows you to design a domain layer with "not yet established" relationships, but also avoid introducing LEFT JOINs as you're guaranteeing there will always be something to join against.
Of course, this approach requires diligence too because you're basically trading off LEFT JOINs for having to check the presence of your sentinel row in queries so you don't update/delete it, etc. Whether or not the trade offs are justified is another thing. I tend to agree that reinventing null just to avoid a fancier join seems a bit silly, but I also worked in an environment where application developers don't win debates against DBAs.
Edits
I removed some of the "matter of fact" wording and tried to clarify what I meant by "failing" joins. #wcoenen's example is the reason that I've personally heard most often for avoiding null FKs. It's not that they fail as in "broken", but rather fail--some would argue--to adhere to the principle of least surprise.
Also, I turned this response into a wiki since I've essentially butchered it from its original state and borrowed from other posts.
I'm strongly supportive of the arguments for NULLs in foreign keys to indicate no-parent in an OLTP system, but in a decision support system it rarely works well. There the most appropriate practice is to use a special "Not Applicable" (or similar) value as the parent (in the dimenson table) to which the child records (in the fact table) can link.
The reason for this is that the exploratory nature of drill-down/across etc can lead to users not understanding how a metric can change when they have merely asked for more information on it. For example where a finance data mart includes a mix of product sales and other sources of revenue, drilling down to "Type of Product" ought to classify non-product sale related data as such, rather than letting those numbers drop out of the report because there is no join from the fact table to the product dimension table.
The problem with allowing nulls in foreign key columns arises when the foreign key is composite. What does it mean if one of the two columns is null? Does the other column have to match anything in the referenced table? With simple (single-column) foreign key constraints, you can get away with nulls.
On the other hand, if the relationship between the two tables is conditional (both entities can exist in their own right, but may almost coincidentally be related) then it may be best to model that with a 'joining table' - table that contains a FK to the referenced table and another to the referencing table and that has its own primary key as the combination of two FKs.
As an example of a joining table, suppose your database has tables of clubs and people. Some of the people belong to some of the clubs. The joining table would be club_members and would contain an FK for the person referencing the 'people' table, and would contain another FK for the club that the person belongs to, and the combination of identifiers for person and club would be the primary key of the joining table. (Another name for joining table is 'association' or 'associative' table.)
I would lean toward a design that communicates the meaning of that column. A null could mean any number of things as far as the domain is concerned. Putting a value in the related table that says "Not Needed", or "Not Selected" at least communicates the purpose without having to ask a developer or consult a document.
Suppose you would need to generate a report of all customers. Each customer has a FK to a country and the country data needs to be included in the report. Now suppose you allow the FK to be null, and you do the following query:
SELECT * FROM customer, country WHERE customer.countryID = country.ID
Any customer where the country FK is null would be silently omitted from the report (you need to use LEFT JOIN instead to fix it). I find this unintuitive and surprising, so I don't like NULL FKs and avoid them in my database schemas. Instead I use sentinel values, e.g. a special "unkown country".
CREATE TABLE [tree]
{
[id] int NOT NULL,
[parent_id] int NULL
};
ALTER TABLE [tree] ADD CONSTRAINT [FK_tree_tree] FOREIGN KEY([parent_id])
REFERENCES [tree] ([id]);
There is nothing wrong with this! The root node will eternally have a NULL parent, and this is not a case of a "not yet established" relationship. No problem with joins here, either.
Having the root node point to itself as the parent to avoid the NULL FK, or any other creative workaround, means that the real world is no longer accurately modeled in the database.
The one potential issue that nobody mentioned is with index performance on columns that contain lots of NULL values. This per se has nothing to do with the foreign key question, though, but it can make joins perform poorly.
I do understand that if you are a DBA working with ultra-large databases that have hundreds of millions of rows, you would not want NULL foreign keys, because they would simply not perform. The truth is, though, that most developers will never work with such large databases in their lifetime, and today's databases can handle such a situation just fine with a few hundred thousand rows. To stress a (poor) metaphor, most of us so not drive F1 race cars, and the automatic transmission in my wife's Accord does what it needs to do just fine (or at least, it used to, until it broke a few weeks ago ...).
If you are assigning NULL to a Business Reason then you are essentially redefining what NULL means in your domain and must document that for users and future developers. If there is a Business Reason for having NULL as a foreign key then I would suggest you do as others have mentioned and add a joining record that has a value of something along the lines of 'N/A' or 'Not Assigned'.
Also there could be complications when NULL in your database now becomes multiple meanings (Business Meaning, Something Error'd or Wasn't inputed correctly) which can cause issues to be more difficult to track down.
I don't see a problem with null values if the field can be empty. An abuse is allowing null values when there should be information in that field.
You got it right. For an FK a NULL means no value (meaning no relationship). If there is a value in an FK it has to match exactly one value in the PK that it references.
It is not necessarily bad design to permit this. If a relationship is one-to-many and optional, it's perfectly OK to add a FK to the table on the one side, referencing the PK on the many side.
If a relationship is many-to-many it requires a table of its own, called a junction table. This table has two FKs, each referencing a PK in one of the tables being related. In this case an omitted relationship can be expressed by simply omitting an entire row from the junction table.
Some people design so as to avoid the necessity of permitting NULLS. These people will use a junction table for a many-to-one relationship, and omit a row, as above, when a relationship is omitted.
I don't follow this practice myself, but it does have certain benefits.
I would have to say that even though it is clearly possible, what is the problem with using a joining table as per Jonathon Leffler's well made point?
I came upon this question because I had exactly the same need but my design is now significantly "cleaner" with a joining table. My database diagram now clearly shows me that my field is optional which works well for me from a schema POV.
Then to simplify my queries, I just made a view LEFT JOINing the two tables together which gives the appearance of an optional join but actually uses the clearer database structure. Also using ISNULL(MyField, 'None') in my view I can provide the benefits of the "not present" additional row design but without the pain.
Given the points mentioned here, I'm with DBA's on this one - why have a null column when you can have a more "solid" relationship made easier to use with a view? And for no real extra effort either.
The join table is the correct method.
Nulls in keys indicate bad database design.
A null value is not unassigned/empty/blank/etc, it is missing/unknown data.
Using nulls in a foreign key field does not mean "there's no relation", it means "I don't know if there's a relation or not" - which is clearly bad.

Subtyping database tables

I hear a lot about subtyping tables when designing a database, and I'm fully aware of the theory behind them. However, I have never actually seen table subtyping in action. How can you create subtypes of tables? I am using MS Access, and I'm looking for a way of doing it in SQL as well as through the GUI (Access 2003).
Cheers!
An easy example would be to have a Person table with a primary key and some columns in that table. Now you can create another table called Student that has a foreign key to the person table (its supertype). Now the student table has some columns which the supertype doesn't have like GPA, Major, etc. But the name, last name and such would be in the parent table. You can always access the student name back in the Person table through the foreign key in the Student table.
Anyways, just remember the following:
The hierarchy depicts relationship between supertypes and subtypes
Supertypes has common attributes
Subtypes have uniques attributes
Subtypes of tables is a conceptual thing in EER diagrams. I haven't seen an RDBMS (excluding object-relational DBMSs) that supports it directly. They are usually implemented in either
A set of nullable columns for each property of the subtype in a single table
With a table for base type properties and some other tables with at most one row per base table that will contain subtype properties
The notion of table sub-types is useful when using an ORM mapper to produce class sub-type heirarchy that exactly models the domain.
A sub-type table will have both a Foreign Key back to its parent which is also the sub-types table's primary key.
Keep in mind that in designing a bound application, as with an Access application, subtypes impose a heavy cost in terms of joins.
For instance, if you have a supertype table with three subtype tables and you need to display all three in a single form at once (and you need to show not just the supertype date), you end up with a choice of using three outer joins and Nz(), or you need a UNION ALL of three mutually exclusive SELECT statements (one for each subtype). Neither of these will be editable.
I was going to paste some SQL from the first major app where I worked with super/subtype tables, but looking at it, the SQL is so complicated it would just confuse people. That's not so much because my app was complicated, but it's because the nature of the problem is complex -- presenting the full set of data to the user, both super- and subtypes, is by its very nature complex. My conclusion from working with it was that I'd have been better off with only one subtype table.
That's not to say it's not useful in some circumstances, just that Access's bound forms don't necessarily make it easy to present this data to the user.
I have a similar problem I've been working on.
While looking for a repeatable pattern, I wanted to make sure I didn't abandon referential integrity, which meant that I wouldn't use a (TABLE_NAME, PK_ID) solution.
I finally settled on:
Base Type Table: CUSTOMER
Sub Type Tables: PERSON, BUSINESS, GOVT_ENTITY
I put nullable PRERSON_ID, BUSINESS_ID and GOVT_ENTITY_ID fields in CUSTOMER, with foreign keys on each, and a check constraint that only one is not null. It's easy to add new sub types, just need to add the nullable foreign key and modify the check constraint.