Why specify primary/foreign key attributes in column names - sql

A couple of recent questions discuss strategies for naming columns, and I was rather surprised to discover the concept of embedding the notion of foreign and primary keys in column names. That is
select t1.col_a, t1.col_b, t2.col_z
from t1 inner join t2 on t1.id_foo_pk = t2.id_foo_fk
I have to confess I have never worked on any database system that uses this sort of scheme, and I'm wondering what the benefits are. The way I see it, once you've learnt the N principal tables of a system, you'll write several orders of magnitude more requests with those tables.
To become productive in development, you'll need to learn which tables are the important tables, and which are simple tributaries. You'll want to commit an good number of column names to memory. And one of the basic tasks is to join two tables together. To reduce the learning effort, the easiest thing to do is to ensure that the column name is the same in both tables:
select t1.col_a, t1.col_b, t2.col_z
from t1 inner join t2 on t1.id_foo = t2.id_foo
I posit that, as a developer, you don't need to be reminded that much about which columns are primary keys, which are foreign and which are nothing. It's easy enough to look at the schema if you're curious. When looking at a random
tx inner join ty on tx.id_bar = ty.id_bar
... is it all that important to know which one is the foreign key? Foreign keys are important only to the database engine itself, to allow it to ensure referential integrity and do the right thing during updates and deletes.
What problem is being solved here? (I know this is an invitation to discuss, and feel free to do so. But at the same time, I am looking for an answer, in that I may be genuinely missing something).

I agree with you. Putting this information in the column name smacks of the crappy Hungarian Notation idiocy of the early Windows days.

I agree with you that the foreign key column in a child table should have the same name as the primary key column in the parent table. Note that this permits syntax like the following:
SELECT * FROM foo JOIN bar USING (foo_id);
The USING keyword assumes that a column exists by the same name in both tables, and that you want an equi-join. It's nice to have this available as shorthand for the more verbose:
SELECT * FROM foo JOIN bar ON (foo.foo_id = bar.foo_id);
Note, however, there are cases when you can't name the foreign key the same as the primary key it references. For example, in a table that has a self-reference:
CREATE TABLE Employees (
emp_id INT PRIMARY KEY,
manager_id INT REFERENCES Employees(emp_id)
);
Also a table may have multiple foreign keys to the same parent table. It's useful to use the name of the column to describe the nature of the relationship:
CREATE TABLE Bugs (
...
reported_by INT REFERENCES Accounts(account_id),
assigned_to INT REFERENCES Accounts(account_id),
...
);
I don't like to include the name of the table in the column name. I also eschew the obligatory "id" as the name of the primary key column in every table.

I've espoused most of the ideas proposed here over the 20-ish years I've been developing with SQL databases, I'm embarrassed to say. Most of them delivered few or none of the expected benefits and were with hindsight, a pain in the neck.
Any time I've spent more than a few hours with a schema I've fairly rapidly become familiar with the most important tables and their columns. Once it got to a few months, I'd pretty much have the whole thing in my head.
Who is all this explanation for? Someone who only spends a few minutes with the design isn't going to be doing anything serious anyway. Someone who plans to work with it for a long time will learn it if you named your columns in Sanskrit.
Ignoring compound primary keys, I don't see why something as simple as "id" won't suffice for a primary key, and "_id" for foreign keys.
So a typical join condition becomes customer.id = order.customer_id.
Where more than one association between two tables exists, I'd be inclined to use the association rather than the table name, so perhaps "parent_id" and "child_id" rather than "parent_person_id" etc

I only use the tablename with an Id suffix for the primary key, e.g. CustomerId, and foreign keys referencing that from other tables would also be called CustomerId. When you reference in the application it becomes obvious the table from the object properties, e.g. Customer.TelephoneNumber, Customer.CustomerId, etc.

I used "fk_" on the front end of any foreign keys for a table mostly because it helped me keep it straight when developing the DB for a project at my shop. Having not done any DB work in the past, this did help me. In hindsight, perhaps I didn't need to do that but it was three characters tacked onto some column names so I didn't sweat it.
Being a newcomer to writing DB apps, I may have made a few decisions which would make a seasoned DB developer shudder, but I'm not sure the foreign key thing really is that big a deal. Again, I guess it is a difference in viewpoint on this issue and I'll certainly take what you've written and cogitate on it.
Have a good one!

I agree with you--I take a different approach that I have seen recommended in many corporate environments:
Name columns in the format TableNameFieldName, so if I had a Customer table and UserName was one of my fields, the field would be called CustomerUserName. That means that if I had another table called Invoice, and the customer's user name was a foreign key, I would call it InvoiceCustomerUserName, and when I referenced it, I would call it Invoice.CustomerUserName, which immediately tells me which table it's in.
Also, this naming helps you to keep track of the tables your columns are coming from when you're joiining.
I only use FK_ and PK_ in the ACTUAL names of the foreign and primary keys in the DBMS.

Related

Adding an artificial primary key versus using a unique field [duplicate]

This question already has answers here:
Surrogate vs. natural/business keys [closed]
(19 answers)
Why would one consider using Surrogate keys vs Natural with ON UPDATE CASCADE?
(1 answer)
Closed 7 months ago.
Recently I Inherited a huge app from somebody who left the company.
This app used a SQL server DB .
Now the developer always defines an int base primary key on tables. for example even if Users table has a unique UserName field , he always added an integer identity primary key.
This is done for every table no matter if other fields could be unique and define primary key.
Do you see any benefits whatsoever on this? using UserName as primary key vs adding UserID(identify column) and set that as primary key?
I feel like I have to add add another element to my comments, which started to produce an essay of comments, so I think it is better that I post it all as an answer instead.
Sometimes there are domain specific reasons why a candidate key is not a good candidate for joins (maybe people change user names so often that the required cascades start causing performance problems). But another reason to add an ever-increasing surrogate is to make it the clustered index. A static and ever-increasing clustered index alleviates a high-cost IO operation known as a page split. So even with a good natural candidate key, it can be useful to add a surrogate and cluster on that. Read this for further details.
But if you add such a surrogate, recognise that the surrogate is purely internal, it is there for performance reasons only. It does not guarantee the integrity of your data. It has no meaning in the model, unless it becomes part of the model. For example, if you are generating invoice numbers as an identity column, and sending those values out into the real world (on invoice documents/emails/etc), then it's not a surrogate, it's part of the model. It can be meaningfully referenced by the customer who received the invoice, for example.
One final thing that is typically left out of this discussion is one particular aspect of join performance. It is often said that the primary key should also be narrow, because it can make joins more performant, as well as reducing the size of non-clustered indexes. And that's true.
But a natural primary key can eliminate the need for a join in the first place.
Let's put all this together with an example:
create table Countries
(
countryCode char(2) not null primary key clustered,
countryName varchar(64) not null
);
insert Countries values
('AU', 'Australia'),
('FR', 'France');
create table TourLocations
(
tourLocationName varchar(64) not null,
tourLocationId int identity(1,1) unique clustered,
countryCode char(2) not null foreign key references Countries(countryCode),
primary key (countryCode, tourLocationName)
);
insert TourLocations (TourLocationName, countryCode) values
('Bondi Beach', 'AU'),
('Eiffel Tower', 'FR')
I did not add a surrogate key to Countries, because there aren't many rows and we're not going to be constantly inserting new rows. I already know what all the countries are, and they don't change very often.
On the TourLocations table I have added an identity and clustered on it. There could be very many tour locations, changing all the time.
But I still must have a natural key on TourLocations. Otherwise I could insert the same tour location name with the same country twice. Sure, the Id's will be different. But the Id's don't mean anything. As far as any real human is concerned, two tour locations with the same name and country code are completely indistinguishable. Do you intend to have actual users using the system? Then you've got a problem.
By putting the same country and location name in twice I haven't created two facts in my database. I have created the same fact twice! No good. The natural key is necessary. In this sense The Impaler's answer is strictly, necessarily, wrong. You cannot not have a natural key. If the natural key can't be defined as anything other than "every meaningful column in the table" (that is to say, excluding the surrogate), so be it.
OK, now let's investigate the claim that an int identity key is advantageous because it helps with joins. Well, in this case my char(2) country code is narrower than an int would have been.
But even if it wasn't (maybe we think we can get away with a tinyint), those country codes are meaningful to real people, which means a lot of the time I don't have to do the join at all.
Suppose I gave the results of this query to my users:
select countryCode, tourLocationName
from TourLocations
order by 1, 2;
Very many people will not need me to provide the countries.countryName column for them to know which country is represented by the code in each of those rows. I don't have to do the join.
When you're dealing with a specific business domain that becomes even more likely. Meaningful codes are understood by the domain users. They often don't need to see the long description columns from the key table. So in many cases no join is required to give the users all of the information they need.
If I had foreign keyed to an identity surrogate I would have to do the join, because the identity surrogate doesn't mean anything to anyone.
You are talking about the difference between synthetic and natural keys.
In my [very] personal opinion, I would recommend to always use synthetic keys (and always call it id). The main problem is that natural keys are never unique; they are unique in theory, yes, but in the real world there are a myriad of unexpected and inexorable events that will make this false.
In database design:
Natural keys correspond to values present in the domain model. For example, UserName, SSN, VIN can be considered natural keys.
Synthetic keys are values not present in the domain model. They are just numeric/string/UUID values that have no relationship with the actual data. They only serve as a unique identifiers for the rows.
I would say, stick to synthetic keys and sleep well at night. You never know what the Marketing Department will come up with on Monday, and suddenly "the username is not unique anymore".
Yes having a dedicated int is a good thing for PK use.
you may have multiple alternate keys, that's ok too.
two great reasons for it:
it is performant
it protects against key mutation ( editing a name etc. )
A username or any such unique field that holds meaningful data is subject to changes. A name may have been misspelled or you might want to edit a name to choose a better one, etc. etc.
Primary keys are used to identify records and, in conjunction with foreign keys, to connect records in different tables. They should never change. Therefore, it is better to use a meaningless int field as primary key.
By meaningless I mean that apart from being the primary key it has no meaning to the users.
An int identity column has other advantages over a text field as primary key.
It is generated by the database engine and is guaranteed to be unique in multi-user scenarios.
it is faster than a text column.
Text can have leading spaces, hidden characters and other oddities.
There are multiple kinds of text data types, multiple character sets and culture dependent behaviors resulting in text comparisons not always working as expected.
int primary keys generated in ascending order have a superior performance in conjunction with clustered primary keys (which is a SQL-Server specialty).
Note that I am talking from a database point of view. In the user interface, users will prefer identifying entries by name or e-mail address, etc.
But commands like SELECT, INSERT, UPDATE or DELETE will always identify records by the primary key.
This subject - quite much like gulivar travels and wars being fought over which end of the egg you supposed to crack open to eat.
However, using the SAME "id" name for all tables, and autonumber? Yes, it is LONG establihsed choice.
There are of course MANY different views on this subject, and many advantages and disavantages.
Regardless of which choice one perfers (or even needs), this is a long established concept in our industry. In fact SharePoint tables use "ID" and autonumber by defualt. So does ms-access, and there probably more that do this.
The simple concpet?
You can build your tables with the PK and child tables with forighen keys.
At that point you setup your relationships between the tables.
Now, you might decide to add say some invoice number or whatever. Rules might mean that such invoice number is not duplicated.
But, WHY do we care of you have some "user" name, or some "invoice" number or whatever. Why should that fact effect your relational database model?
You mean I don't have a user name, or don't have a invoice number, and the whole database and relatonships don't work anymore? We don't care!!!!
The concept of data, even required fields, or even a column having to be unique ?
That has ZERO to do with a working relational data model.
And maybe you decide that invoice number is not generated until say sent to the customer. So, the fact of some user name, invoice number or whatever? Don't care - you can have all kinds of business rules for those numbers, but they have ZERO do to do with the fact that you designed a working relational data model based on so called "surrogate" or sometime called synthetic keys.
So, once you build that data model - even with JUST the PK "id" and FK (forighen keys), you are NOW free to start adding columns and define what type of data you going to put in each table. but, what you shove into each table has ZERO to do with that working related data model. They are to be thought as seperate concpets.
So, if you have a user name - add that column to the table. If you don't want users name, remove the column. As such data you store in the table has ZERO to do with the automatic PK ID you using - it not really any different then say what area of memory the computer going to allocate to load that data. Basic data operations of the system is has nothing to do with having build database with relationships that simple exist. And the data columns you add after having built those relationships is up to you - but will not, and should not effect the operation of the database and relationships you built and setup. Not only are these two concepts separate, but they free the developer from having to worry about the part that maintains the relationships as opposed to data column you add to such tables to store user data.
I mean, in json data, xml? We often have a master + child table relationship. We don't care how that relationship is maintained - but only that it exists.
Thus yes, all tables have that pk "ID". Even better? in code, you NEVER have to guess what the PK id is - it always the same!!!
So, data and columns you put and toss into a table? Those columns and data have zero to do with the PK id, and while it is the database generating that PK? It could be a web service call to some monkeys living in a far away jungle eating banana's and they give you a PK value based on how many bananas they eaten. We just really don't' care about that number - it is just internal house keeping numbers - one that we don't see or even care about in most code. And thus the number one rule to such auto matic PK values?
You NEVER give that auto PK number any meaning from a user and applcation point of view.
In summary:
Yes, using a PK called "id" for all tables? Common, and in fact in SharePoint and many systems, it not only the default, but is in fact required for such systems to operate.
Its better to use userid. User table is referenced by many other tables.
The referenced table would contain the primary key of the user table as foreign key.
Its better to use userid since its integer value,
it takes less space than string values of username and
the searches by the database engine would be faster
user(userid, username, name)
comments(commentid, comment, userid) would be better than
comments(commentid, comment, username)

Why is a primary-foreign key relation required when we can join without it?

If we can get data from two tables without having primary and foreign key relation, then why we need this rule? Can you please explain me clearly, with suitable example?
It's a test database, don't mind the bad structure.
Tables' structure:
**
table - 'test1'
columns - id,lname,fname,dob
no primary and foreign key and also not unique(without any constraints)
**
**table - 'test2'
columns- id,native_city
again, no relations and no constraints**
I can still join these tables with same columns 'id',
so if there's no primary-foreign key, then what is the use of that?
The main reason for primary and foreign keys is to enforce data consistency.
A primary key enforces the consistency of uniqueness of values over one or more columns. If an ID column has a primary key then it is impossible to have two rows with the same ID value. Without that primary key, many rows could have the same ID value and you wouldn't be able to distinguish between them based on the ID value alone.
A foreign key enforces the consistency of data that points elsewhere. It ensures that the data which is pointed to actually exists. In a typical parent-child relationship, a foreign key ensures that every child always points at a parent and that the parent actually exists. Without the foreign key you could have "orphaned" children that point at a parent that doesn't exist.
You need two columns of the same type, one on each table, to JOIN on. Whether they're primary and foreign keys or not doesn't matter.
You don't need a FK, you can join arbitrary columns.
But having a foreign key ensures that the join will actually succeed in finding something.
Foreign key give you certain guarantees that would be extremely difficult and error prone to implement otherwise.
For example, if you don't have a foreign key, you might insert a detail record in the system and just after you checked that the matching master record is present somebody else deletes it. So in order to prevent this you need to lock the master table, when ever you modify the detail table (and vice versa). If you don't need/want that guarantee, screw the FKs.
Depending on your RDBMS a foreign key also might improve performance of select (but also degrades performance of updates, inserts and deletes)
I know its late to post, but I use the site for my own reference and so I wanted to put an answer here for myself to reference in the future too. I hope you (and others) find it helpful.
Lets pretend a bunch of super Einstein experts designed our database. Our super perfect database has 3 tables, and the following relationships defined between them:
TblA 1:M TblB
TblB 1:M TblC
Notice there is no relationship between TblA and TblC
In most scenarios such a simple database is easy to navigate but in commercial databases it is usually impossible to be able to tell at the design stage all the possible uses and combination of uses for data, tables, and even whole databases, especially as systems get built upon and other systems get integrated or switched around or out. This simple fact has spawned a whole industry built on top of databases called Business Intelligence. But I digress...
In the above case, the structure is so simple to understand that its easy to see you can join from TblA, through to B, and through to C and vice versa to get at what you need. It also very vaguely highlights some of the problems with doing it. Now expand this simple chain to 10 or 20 or 50 relationships long. Now all of a sudden you start to envision a need for exactly your scenario. In simple terms, a join from A to C or vice versa or A to F or B to Z or whatever as our system grows.
There are many ways this can indeed be done. The one mentioned above being the most popular, that is driving through all the links. The major problem is that its very slow. And gets progressively slower the more tables you add to the chain, the more those tables grow, and the further you want to go through it.
Solution 1: Look for a common link. It must be there if you taught of a reason to join A to C. If it is not obvious, create a relationship and then join on it. i.e. To join A through B through C there must be some commonality or your join would either produce zero results or a massive number or results (Cartesian product). If you know this commonality, simply add the needed columns to A and C and link them directly.
The rule for relationships is that they simply must have a reason to exist. Nothing more. If you can find a good reason to link from A to C then do it. But you must ensure your reason is not redundant (i.e. its already handled in some other way).
Now a word of warning. There are some pitfalls. But I don't do a good job of explaining them so I will refer you to my source instead of talking about it here. But remember, this is getting into some heavy stuff, so this video about fan and chasm traps is really only a starting point. You can join without relationships. But I advise watching this video first as this goes beyond what most people learn in college and well into the territory of the BI and SAP guys. These guys, while they can program, their day job is to specialise in exactly this kind of thing. How to get massive amounts of data to talk to each other and make sense.
This video is one of the better videos I have come across on the subject. And it's worth looking over some of his other videos. I learned a lot from him.
A primary key is not required. A foreign key is not required either. You can construct a query joining two tables on any column you wish as long as the datatypes either match or are converted to match. No relationship needs to explicitly exist.
To do this you use an outer join:
select tablea.code, tablea.name, tableb.location from tablea left outer join
tableb on tablea.code = tableb.code
join with out relation
SQL join

Why do I read so many negative opinions on using composite keys?

I was working on an Access database which loved auto-numbered identifiers. Every table used them except one, which used a key made up of the first name, last name and birthdate of a person. Anyways, people started running into a lot of problems with duplicates, as tables representing relationships could hold the same relationship twice or more. I decided to get around this by implementing composite keys for the relationship tables and I haven't had a problem with duplicates since.
So I was wondering what's the deal with the bad rep of composite keys in the Access world? I guess it's slightly more difficult to write a query, but at least you don't have to put in place tons of checks every time data is entered or even edited in the front end. Are they incredibly super inefficient or something?
A composite key works fine for a single table, but when you start to create relations between tables it can get a bit much.
Consider two tables Person and Event, and a many-to-many relations between them called Appointment.
If you have a composite key in the Person table made up of the first name, last name and birth date, and a compossite key in the Event table made up of place and name, you will get five fields in the Appointment table to identify the relation.
A condition to bind the relation will be quite long:
select Person,*, Event.*
from Person, Event, Appointment
where
Person.FirstName = Appointment.PersonFirstName and
Person.LastName = Appointment.PersonLastName and
Person.BirthDate = Appointment.PersonBirthDate and
Event.Place = Appointment.EventPlace and
Event.Name = Appointment.EventName`.
If you on the other hand have auto-numbered keys for the Person and Event tables, you only need two fields in the Appointment table to identify the relation, and the condition is a lot smaller:
select Person,*, Event.*
from Person, Event, Appointment
where
Person.Id = Appointment.PersonId and Event.Id = Appointment.EventId
If you only use pure self-written SQL to access your data, they are OK.
However, some ORMs, adapters etc. require having a single PK field to identify a record.
Also note that a composite primary key is almost invariably a natural key (there is hardly a point in creating a surrogate composite key, you can as well use a single-field one).
The most common usage of a composite primary key is a many-to-many link table.
When using the natural keys, you should ensure they are inherently unique and immutable, that is an entity is always identified by the same value of the key, once been reflected by the model, and only one entity can be identified by any value.
This it not so in your case.
First, a person can change their name and even the birthdate
Second, I can easily imagine two John Smiths born at the same day.
The former means that if a person changes their name, you will have to update it in each and every table that refers to persons; the latter means that the second John Smith will not be able to make it into your database.
For the case like yours, I would really consider adding a surrogate identifier to your model.
Unfortunately one reason for those negative opinions is probably ignorance. Too many people don't understand the concept of Candidate Keys properly. There are people who seem to think that every table needs only one key, that one key is sufficient for data integrity and that choosing that one key is all that matters.
I have often speculated that it would be a good thing to deprecate and phase out the use of the term "primary key" altogether. Doing that would focus database designers minds on the real issue: that a table should have as many keys as are necessary to ensure the correctness of the data and that some of those keys will probably be composite. Abolishing the primary key concept would do away with all those fatuous debates about what the primary key ought to be or not be.
If your RDBMS supports them and if you use them correctly (and consistently), unique keys on the composite PK should be sufficient to avoid duplicates. In SQL Server at least, you can also create FKs against a unique key instead of the PK, which can be useful.
The advantage of a single "id" column (or surrogate key) is that it can improve performance by making for a narrower key. Since this key may be carried to indexes on that table (as a pointer back to the physical row from the index row) and other tables as a FK column that can decrease space and improve performance. A lot of it depends on the specific architecture of your RDBMS though. I'm not familiar enough with Access to comment on that unfortunately.
As Quassnoi points out, some ORMs (and other third party applications, ETL solutions, etc.) don't have the capability to handle composite keys. Other than some ORMs though, most recent third party apps worth anything will support composite keys though. ORMs have been a little slower in adopting that in general though.
My personal preference for composite keys is that although a unique index can solve the problem of duplicates, I've yet to see a development shop that actually fully used them. Most developers get lazy about it. They throw on an auto-incrementing ID and move on. Then, six months down the road they pay me a lot of money to fix their duplicate data issues.
Another issue, is that auto-incrementing IDs aren't generally portable. Sure, you can move them around between systems, but since they have no actual basis in the real world it's impossible to determine one given everything else about an entity. This becomes a big deal in ETL.
PKs are a pretty important thing in the data modeling world and they generally deserve more thought then, "add an auto-incrementing ID" if you want your data to be consistent and clean.
Surrogate keys are also useful, but I prefer to use them when I have a known performance issue that I'm trying to deal with. Otherwise it's the classic problem of wasting time trying to solve a problem that you might not even have.
One last note... on cross-reference tables (or joining tables as some call them) it's a little silly (in my opinion) to add a surrogate key unless required by an ORM.
Composite Keys are not just composite primary keys, but composite foreign keys as well. What do I mean by that? I mean that each table that refers back to the original table needs a column for each column in the composite key.
Here's a simple example, using a generic student/class arrangement.
Person
FirstName
LastName
Address
Class
ClassName
InstructorFirstName
InstructorLastName
InstructorAddress
MeetingTime
StudentClass - a many to many join table
StudentFirstName
StudentLastName
StudentAddress
ClassName
InstructorFirstName
InstructorLastName
InstructorAddress
MeetingTime
You just went from having a 2-column many-to-many table using surrogate keys to having an 8-column many-to-many table using composite keys, because they have 3 and 5 column foreign keys. You can't really get rid of any of these fields, because then the records wouldn't be unique, since both students and instructors can have duplicate names. Heck, if you have two people from the same address with the same name, you're still in serious trouble.
Most of the answers given here don't seem to me to be given by people who work with Access on a regular basis, so I'll chime in from that perspective (though I'll be repeating what some of the others have said, just with some Access-specific comments).
I use surrogate a key only when there is no single-column candidate key. This means I have tables with surrogate PKs and with single-column natural PKs, but no composite keys (except in joins, where they are the composite of two FKs, surrogate or natural doesn't matter).
Jet/ACE clusters on the PK, and only on the PK. This has potential drawbacks and potential benefits (if you consider a random Autonumber as PK, for instance).
In my experience, the non-Null requirement for a composite PK makes most natural keys impossible without using potentially problematic default values. It likewise wrecks your unique index in Jet/ACE, so in an Access app (before 2010), you end up enforcing uniqueness in your application. Starting with A2010, table-level data macros (which work like triggers) can conceivably be used to move that logic into the database engine.
Composite keys can help you avoid joins, because they repeat data that with surrogate keys you'd have to get from the source table via a join. While joins can be expensive, it's mostly outer joins that are a performance drain, and it's only with non-required FKs that you'd get the full benefit of avoiding outer joins. But that much data repetition has always bothered me a lot, since it seems to go against everything we've ever been taught about normalization!
As I mentioned above, the only composite keys in my apps are in N:N join tables. I would never add a surrogate key to a join table except in the relatively rare case in which the join table is itself a parent to a related tables (e.g., Person/Company N:N record might have related JobTitles, i.e., multiple jobs within the same company). Rather than store the composite key in the child table, you'd store the surrogate key. I'd likely not make the surrogate key the PK, though -- I'd keep the composite PK on the pair of FK values. I would just add an Autonumber with a unique index for joining to the child table(s).
I'll add more as I think of it.
It complicates queries and maintenance. If you are really interested in this subject I'd recommend looking over the number of posts that already cover this. This will give you better info than any one response here.
https://stackoverflow.com/search?q=composite+primary+key
In the first place composite keys are bad for performance in joins. Further they are much worse for updating records as you have to update all the child records as well. Finally very few composite keys are actually really good keys. To be a good key it should be unique and not be subject to change. The example you gave as a composite key you used fails both tests. It is not unique (there are people with the same name born on the same day) and names change frequently causing much unnecessary updating of all the child tables.
As far as table with autogenrated keys casuing duplicates, that is mostly due to several factors:
the rest of the data in the table
can't be identified in any way as
unique
a design failure of forgetting to
create a unique index on the possible
composite key
Poor design of the user interface
which doesn't attempt to find
matching records or which allows data
entry when a pull down might be more
appropriate.
None of those are the fault of the surrogate key, they just indicate incompetent developers.
I think some coders see the complexity but want to avoid it, and most coders don't even think to look for the complexity at all.
Let's consider a common example of a table that had more than one candidate key: a Payroll table with columns employee_number, salary_amount, start_date and end_date.
The four candidate keys are as follows:
UNIQUE (employee_number, start_date); -- simple constraint
UNIQUE (employee_number, end_date); -- simple constraint
UNIQUE (employee_number, start_date, end_date); -- simple constraint
CHECK (
NOT EXISTS (
SELECT Calendar.day_date
FROM Calendar, Payroll AS P1
WHERE P1.start_date <= Calendar.day_date
AND Calendar.day_date < P1.end_date
GROUP
BY P1.employee_number, Calendar.day_date
)
); -- sequenced key i.e. no over-lapping periods for the same employee
Only one of those keys are required to be enforced i.e. the sequenced key. However, most coders wouldn't think to add such a key, let alone know how to code it in the first place. In fact, I would wager that most Access coders would add an incrementing autonumber column to the table, make the autonumber column the PRIMARY KEY, fail to add constraints for any of the candidate keys and will have convinced themselves that their table has a key!

Do link tables need a meaningless primary key field?

I am working on a couple of link tables and I got to thinking (Danger Will Robinson, Danger) what are the possible structures of a link table and what are their pro's and con's.
I came up with a few possible strictures for the link table:
Traditional 3 column model
id - auto-numbered PRIMARY
table1fk - foreign key
table2fk - foreign key
It's a classic, in most of the books, 'nuff said.
Indexed 3 column model
id - auto-numbered PRIMARY
table1fk - foreign key INDEX ('table1fk')
table2fk - foreign key INDEX ('table2fk')
In my own experience, the fields that you are querying against are not indexed in the traditional model. I have found that indexing the foreign key fields does improve performance as would be expected. Not a major change but a nice optimizing tweak.
Composite key 2 columns ADD PRIMARY KEY ('table1fk' , 'table2fk')
table1fk - foreign key
table2fk - foreign key
With this I use a composite key so that a record from table1 can only be linked to a record on table2 once. Because the key is composite I can add records (1,1), (1,2), (2,2) without any duplication errors.
Any potential problems with the composite key 2 columns option? Is there an indexing issue that this might cause? A performance hit? Anything that would disqualify this as a possible option?
I would use composite key, and no extra meaningless key.
I would not use a ORM system that enforces such rules on my db structure.
For true link tables, they typically do not exist as object entities in my object models. Thus the surrogate key is not ever used. The removable of an item from a collection results in a removal of an item from a link relationship where both foreign keys are known (Person.Siblings.Remove(Sibling) or Person.RemoveSibling(Sibling) which is appropriately translated at the data access layer as usp_Person_RemoveSibling(PersonID, SiblingID)).
As Mike mentioned, if it does become an actual entity in your object model, then it may merit an ID. However, even with addition of temporal factors like effective start and end dates of the relationship and things like that, it's not always clear. For instance, the collection may have an effective date associated at the aggregate level, so the relationship itself may still not become an entity with any exposed properties.
I'd like to add that you might very well need the table indexed both ways on the two foreign key columns.
If this is a true many-to-many join table, then dump unecessary id column (unless your ORM requires one. in that case you've got to decide whether your intellect is going to trump your practicality).
But I find that true join tables are pretty rare. It usually isn't long before I start wanting to put some other data in that table. Because of that I almost always model these join tables as entities from the beginning and stick an id in there.
Having a single column pk can help out alot in disaster recovery situation. So though while correct in theory that you only need the 2 foreign keys. In practice when the shit hits the fan you may want the single column key. I have never been in a situation where i was screwed because I had a single column identifier but I have been in ones where I was screwed because I didn't.
Composite PK and turn off clustering.
I have used composite key to prevent duplicate entry and let the database handle the exception. With a single key, you are rely on the front-end application to check the database for duplicate before adding a new record.
There is something called identifying and non-identifying relationship. With identifying relationships the FK is a part of the PK in the many-to-many table. For example, say we have tables Person, Company and a many-to-many table Employment. In an identifying relationship both fk PersonID and CompanyID are part of the pk, so we can not repeat PersonID, CompanyID combination.
TABLE Employment(PersonID int (PK,FK), CompanyID int (PK,FK))
Now, suppose we want to capture history of employment, so a person can leave a company, work somewhere else and return to the same company later. The relationship is non-identifying here, combination of PersonID, CompanyID can now repeat, so the table would look something like:
TABLE Employment(EmploymentID int (PK), PersonID int (FK), CompanyID int (FK),
FromDate datetime, ToDate datetime)
If you are using an ORM to get to/alter the data, some of them require a single-column primary key (Thank you Tom H for pointing this out) in order to function correctly (I believe Subsonic 2.x was this way, not sure about 3.x).
In my mind, having the primary key doesn't impact performance to any measurable degree, so I usually use it.
If you need to traverse the join table 'in both directions', that is starting with a table1fk or a table2fk key only, you might consider adding a second, reversed, composite index.
ADD KEY ('table2fk', 'table1fk')
The correct answer is:
Primary key is ('table1fk' , 'table2fk')
Another index on ('table2fk' , 'table1fk')
Because:
You don't need an index on table1fk or table2fk alone: the optimiser will use the PK
You'll most likely use the table "both" ways
Adding a surrogate key is only needed because of braindead ORMs
i've used both, the only benefit of using the first model (with uid) is that you can transport the identifier around as a number, whereas in some cases you would have to do some string concatenation with the composite key to transport it around.
i agree that not indexing the foreign keys is a bad idea whichever way you go.
I (almost) always use the additional single-column primary key. This generally makes it easier to build user interfaces, because when a user selects that particular linking entity I can identify with a single integer value rather than having to create and then parse compound identifiers.

Primary key/foreign Key naming convention [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 6 years ago.
Improve this question
In our dev group we have a raging debate regarding the naming convention for Primary and Foreign Keys. There's basically two schools of thought in our group:
1:
Primary Table (Employee)
Primary Key is called ID
Foreign table (Event)
Foreign key is called EmployeeID
or
2:
Primary Table (Employee)
Primary Key is called EmployeeID
Foreign table (Event)
Foreign key is called EmployeeID
I prefer not to duplicate the name of the table in any of the columns (So I prefer option 1 above). Conceptually, it is consistent with a lot of the recommended practices in other languages, where you don't use the name of the object in its property names. I think that naming the foreign key EmployeeID (or Employee_ID might be better) tells the reader that it is the ID column of the Employee Table.
Some others prefer option 2 where you name the primary key prefixed with the table name so that the column name is the same throughout the database. I see that point, but you now can not visually distinguish a primary key from a foreign key.
Also, I think it's redundant to have the table name in the column name, because if you think of the table as an entity and a column as a property or attribute of that entity, you think of it as the ID attribute of the Employee, not the EmployeeID attribute of an employee. I don't go an ask my coworker what his PersonAge or PersonGender is. I ask him what his Age is.
So like I said, it's a raging debate and we go on and on and on about it. I'm interested to get some new perspectives.
If the two columns have the same name in both tables (convention #2), you can use the USING syntax in SQL to save some typing and some boilerplate noise:
SELECT name, address, amount
FROM employees JOIN payroll USING (employee_id)
Another argument in favor of convention #2 is that it's the way the relational model was designed.
The significance of each column is
partially conveyed by labeling it with
the name of the corresponding domain.
It doesn't really matter. I've never run into a system where there is a real difference between choice 1 and choice 2.
Jeff Atwood had a great article a while back on this topic. Basically people debate and argue the most furiously those topics which they cannot be proven wrong on. Or from a different angle, those topics which can only be won through filibuster style endurance based last-man-standing arguments.
Pick one and tell them to focus on issues that actually impact your code.
EDIT: If you want to have fun, have them specify at length why their method is superior for recursive table references.
I think it depends on your how you application is put together. If you use ORM or design your tables to represent objects then option 1 may be for you.
I like to code the database as its own layer. I control everything and the app just calls stored procedures. It is nice to have result sets with complete column names, especially when there are many tables joined and many columns returned. With this stype of application, I like option 2. I really like to see column names match on joins. I've worked on old systems where they didn't match and it was a nightmare,
Have you considered the following?
Primary Table (Employee)
Primary Key is PK_Employee
Foreign table (Event)
Foreign key is called FK_Employee
Neither convention works in all cases, so why have one at all? Use Common sense...
e.g., for self-referencing table, when there are more than one FK column that self-references the same table's PK, you HAVE to violate both "standards", since the two FK columns can't be named the same... e.g., EmployeeTable with EmployeeId PK, SupervisorId FK, MentorId Fk, PartnerId FK, ...
I agree that there is little to choose between them. To me a much more significant thing about either standard is the "standard" part.
If people start 'doing their own thing' they should be strung up by their nethers. IMHO :)
If you are looking at application code, not just database queries, some things seem clear to me:
Table definitions usually directly map to a class that describes one object, so they should be singular. To describe a collection of an object, I usually append "Array" or "List" or "Collection" to the singular name, as it more clearly than use of plurals indicates not only that it is a collection, but what kind of a collection it is. In that view, I see a table name as not the name of the collection, but the name of the type of object of which it is a collection. A DBA who doesn't write application code might miss this point.
The data I deal with often uses "ID" for non-key identification purposes. To eliminate confusion between key "ID"s and non-key "ID"s, for the primary key name, we use "Key" (that's what it is, isn't it?) prefixed with the table name or an abbreviation of the table name. This prefixing (and I reserve this only for the primary key) makes the key name unique, which is especially important because we use variable names that are the same as the database column names, and most classes have a parent, identified by the name of the parent key. This also is needed to make sure that it is not a reserved keyword, which "Key" alone is. To facilitate keeping key variable names consistent, and to provide for programs that do natural joins, foreign keys have the same name as is used in the table in which they are the primary key. I have more than once encountered programs which work much better this way using natural joins. On this last point, I admit a problem with self-referencing tables, which I have used. In this case, I would make an exception to the foreign key naming rule. For example, I would use ManagerKey as a foreign key in the Employee table to point to another record in that table.
The convention we use where I work is pretty close to A, with the exception that we name tables in the plural form (ie, "employees") and use underscores between the table and column name. The benefit of it is that to refer to a column, it's either "employees _ id" or "employees.id", depending on how you want to access it. If you need to specify what table the column is coming from, "employees.employees _ id" is definitely redundant.
I like convention #2 - in researching this topic, and finding this question before posting my own, I ran into the issue where:
I am selecting * from a table with a large number of columns and joining it to a second table that similarly has a large number of columns. Both tables have an "id" column as the primary key, and that means I have to specifically pick out every column (as far as I know) in order to make those two values unique in the result, i.e.:
SELECT table1.id AS parent_id, table2.id AS child_id
Though using convention #2 means I will still have some columns in the result with the same name, I can now specify which id I need (parent or child) and, as Steven Huwig suggested, the USING statement simplifies things further.
I've always used userId as a PK on one table and userId on another table as a FK. 'm seriously thinking about using userIdPK and userIdFK as names to identify one from the other. It will help me to identify PK and FK quickly when looking at the tables and it seems like it will clear up code when using PHP/SQL to access data making it easier to understand. Especially when someone else looks at my code.
I use convention #2. I'm working with a legacy data model now where I don't know what stands for in a given table. Where's the harm in being verbose?
How about naming the foreign key
role_id
where role is the role the referenced entity has relativ to the table at hand. This solves the issue of recursive reference and multiple fks to the same table.
In many cases will be identical to the referenced table name. In this cases it becomes identically to one of your proposals.
In any case havin long arguments is a bad idea
"Where in "employee INNER JOIN order ON order.employee_id = employee.id" is there a need for additional qualification?".
There is no need for additional qualification because the qualification I talked of is already there.
"the reason that a business user refers to Order ID or Employee ID is to provide context, but at a dabase level you already have context because you are refereing to the table".
Pray, tell me, if the column is named 'ID', then how is that "refereing [sic] to the table" done exactly, unless by qualifying this reference to the ID column exactly in the way I talked of ?