SQL many-to-many - sql

I need to build a simple forum (message board) as a school project. But I came across one problem. In the img above there are 2 tables: post and category, which have many-to-many relationship. I made a bridge table, which stores the postKey and categoryKey. Is it a bad practice to create a composite primary key from those 2 keys, or I need something like postCategoryKey? And what else should I improve?

On my opinion, there is no need for PostCategoryKey, due to it's only a relationship table and you won't accés it by postCategoryKey.
I would create the PK using the 2 others FK (postKey and categoryKey).
Hope it helps!

It depends, if you plan later on to add some extra metadata to postCategoryKey in a separate table, then it makes sense.
In your case - I'd go with a composite primary key and get rid of postCategoryKey

You will have to make postKey and categoryKey not null and create a unique constraint on them anyway. That makes them a key for the table, no matter whether you call this the "primary key" or not.
So, there are three options:
Leave this as described with NOT NULL and the unique constraint.
Declare the two columns the table's primary key.
Create an additional column postCategoryKey and make this the primary key.
The decision doesn't really matter. Some companies have a database style convention. In that case it's easy; just follow the company rules. Some people want every table to have a single-column primary key. If so, add that PK column. Some people want bridge tables to have a composite primary key to imediately show what identifies a row. My personal preference is the latter, but any method is as good as the other actually. Just stay consistent in your database.

Related

Should I use composite primary keys in this example?

I am not a SQL expert, so I defer to someone with more knowledge. So here is my question. I have designed a database where every table has an Id column (auto increment) that is the primary key. And I use this design without any issue - it makes sense to me I simply do referential integrity by way of this simple primary key since the Id columns of all tables uniquely identifies each row.
Some of my colleagues have suggested that I use composite primary keys, but I see no value in doing that. The purpose of a primary key is to enable referential integrity, and that is what it does.
For example, this is a toy example but it demonstrates my design:
tbl_Customers
-------------
Id (PK)
Code (VARCHAR)
Name (VARCHAR)
Surname (VARCHAR)
tbl_CustomerDetails
-----------------
Id (PK)
CustomerId (FK to tbl_Customers)
SomeDetails (VARCHAR)
This does not use a seperate 'linking' table, but it does not matter, it demonstrates my design.
Some of my colleagues noted that I should have a composite primary key on tbl_Customers to not only include Id as I do now, but also Code. They say that this will improve performance and that it will ensure that Code will not duplicate.
My counter argument is that if I want Code to not duplicate, I can create a UNIQUE INDEX on Code. And that, since my front-end only ever works with Ids and never allows for example searching (SELECTing) by Code, that there can not be a performance improvement. On my presentation layer, if I show for example Customers and I allow the user to select one to see the associated CustomerDetails, I will select the corresponding tbl_CustomerDetails rows on CustomerId where it matches the selected Id of the clicked customer.
What do you suggest? Am I correct or am I wrong? I am always willing to learn, and if I am wrong here I'd love to learn. But at the moment, I do not feel their arguments are valid. Which is why I am asking the community.
Thanks!
I would suggest to go with single column primary key instead on composite keys. The biggest drawback with composite key is that you require more than one value /columnto identify a row. If your application uses an O/RM (Object/Relation Mapping) layer, then you will have fits mapping these database rows to objects in a programming language. O/RM's are easiest to set up when every table has a single column primary key.
Programming aside,the major drawback of composite keys in general, and especially composite keys requiring this many columns, is all of this data needs to be specified and copied to child tables in order to set up proper relationships between tables which is wastage of space and it increase unnecessary complexity too.
The biggest headache I've run into with developers is they assume "uniqueness of data" equates "identifying a row in the database". This is rarely the case. I've found applications and databases to be much more maintainable and easy to build by defaulting to single column primary keys, and using composite keys as an exception to the rule, then enforcing data uniqueness by using unique constraints or indexes on those columns.
After reading your question and arguments I would like to say you are not wrong.
Since you have ID auto incremented which will always provides uniqueness to your row.
Now talking about code column, then if code should be unique then you can always have UNIQUE constraint for column which will not allow duplicate values for code and since you are doing it from front end so no need to add composite primary key with(ID,Code) but make sure you add UNIQUE constraint for code column.
You have already given explanation buddy and I believe you are totally right.
If you are going to make composite primary key then you have to consider two things here:
Composite PK on (ID,Code) will allow duplicate ID's and duplicate codes, it will not
allow duplicate combinations.
you have to add code column in tbl_CustomerDetails table as well if you are going
to link both tables.
In Summary I would like to say I don't feel that in this case Composite Primary Key is required.
If your question is, should you use a composite key in your example, the answer to that is a resounding NO! Your colleague's suggestion to add code as a composite key is not only unnecessary but will more than likely introduce problems for you down the road. Let me illustrate:
Let's say that you'd like to distinguish customers by code: All members are having code MEMB plus the Id number, all vendors have code VEND plus the Id number, and all customers have code CUST plus Id.
Among the "customers" are donors who don't purchase anything but give a contribution. You decide to make a distinction between donors and customers.
That means you'll have to change the code of some of your customers from CUST to DONOR plus Id. To make that change you will have to UPDATE EVERY INSTANCE of CUST that's a donor into DONOR. That could be a nightmare to say the least as you'll need to figure out every table that has that Id as a reference.
With your current set up, all you have to do is update the Code in ONE place and no more changes are needed. So you're right in your implementation.

What would it mean If I change the identifying relationship from this part of a database design to a non-identifying relationship?

I have a question regarding this database design. I am a bit unsure of the difference between identifying and non-identifying relationships in a database leading me to some puzzles in my head.
I have this database design: (kind of like a movie rental stores. "friend" are those who borrow the movie. "studio" is the production studios that collaborated in making the movie.)
I somewhat understand how it works. However, I was wondering what if I create a loan_id in the loan table, and use movie_id and friend_id as normal foreign keys?
Some of my questions are:
What are the advantages or disadvantages of the later approach?
A situation where the initial or later model is better?
Does the initial model enable a friend to borrow a movie more than once?
Any thorough explanation would be much appreciated.
The way you have all of your many-to-many tables (tables collaboration, loan, role), is called a composite primary key: Where two (or more) columns form a unique value.
When you have a composite pk, a lot of db designers prefer to create a surrogate primary key (like your proposed loan_id). I'm one of them. This post does a good job going through the arguments of why or why not: Composite primary keys versus unique object ID field.
My relatively simple reason for it, is composite keys tend to grow: Using the loan example, what happens if that movies loaned more than once? Using the composite approach, you would then have to add loan_date to the composite key.
What if you then wanted to track re-loans of some sort? You would then have to have a 2nd table carrying all the composite pk fields from the loan table (original_loan_movie_id, original_loan_friend_id, original_loan_date) just to refer to the original loan...
In the LOAN table, you'd need to guarantee the following columns are unique:
movie_id (replace with copy_id assuming there are multiple copies of a movie)
friend_id
loan_date
...because I, or anyone else, should be able to rent the same movie more than once. These are also the columns most likely to be searched on...
With that in mind, the idea of defining a column called loan_id as the primary key for the table to be redundant. ORMs have been mandating the use of non-composite primary keys to simplify queries...
But it Makes Queries Easier...
At first glance, it makes deleting or updating a specific loan/etc easier - until you realize that you need to know the applicable id value first. If you have to search for that id value based on a movie, user/friend, and date then you'd have been better off using the criteria directly in the first place.
But Composite keys are Complex...
In this example, a primary key constraint will ensure that the three columns--movie_id, friend_id and loan_date--will be unique and indexed (most DBs these days automatically index primary keys if the clustered index doesn't already exist) using the best index possible for the table.
The lone primary key approach means the loan_id is indexed with the best index for the table (SQL Server & MySQL call them clustered indexes, to Oracle they're all just indexes), and requires an additional composite unique constraint/index. Some databases might require additional indexing beyond the unique constraint... So this makes the data model more involved/complex, and for no benefit:
Some databases, like MySQL, put a limit on the amount of space you can use for indexes.
the primary key is getting the most ideal index yet the value has no relevance to the data in the table, so making use of the index related to the primary key will be seldom if ever.
Conclusion
I've yet to see a legitimate justification for a single column primary key over a composite primary key.

Do link tables need a meaningless primary key field?

I am working on a couple of link tables and I got to thinking (Danger Will Robinson, Danger) what are the possible structures of a link table and what are their pro's and con's.
I came up with a few possible strictures for the link table:
Traditional 3 column model
id - auto-numbered PRIMARY
table1fk - foreign key
table2fk - foreign key
It's a classic, in most of the books, 'nuff said.
Indexed 3 column model
id - auto-numbered PRIMARY
table1fk - foreign key INDEX ('table1fk')
table2fk - foreign key INDEX ('table2fk')
In my own experience, the fields that you are querying against are not indexed in the traditional model. I have found that indexing the foreign key fields does improve performance as would be expected. Not a major change but a nice optimizing tweak.
Composite key 2 columns ADD PRIMARY KEY ('table1fk' , 'table2fk')
table1fk - foreign key
table2fk - foreign key
With this I use a composite key so that a record from table1 can only be linked to a record on table2 once. Because the key is composite I can add records (1,1), (1,2), (2,2) without any duplication errors.
Any potential problems with the composite key 2 columns option? Is there an indexing issue that this might cause? A performance hit? Anything that would disqualify this as a possible option?
I would use composite key, and no extra meaningless key.
I would not use a ORM system that enforces such rules on my db structure.
For true link tables, they typically do not exist as object entities in my object models. Thus the surrogate key is not ever used. The removable of an item from a collection results in a removal of an item from a link relationship where both foreign keys are known (Person.Siblings.Remove(Sibling) or Person.RemoveSibling(Sibling) which is appropriately translated at the data access layer as usp_Person_RemoveSibling(PersonID, SiblingID)).
As Mike mentioned, if it does become an actual entity in your object model, then it may merit an ID. However, even with addition of temporal factors like effective start and end dates of the relationship and things like that, it's not always clear. For instance, the collection may have an effective date associated at the aggregate level, so the relationship itself may still not become an entity with any exposed properties.
I'd like to add that you might very well need the table indexed both ways on the two foreign key columns.
If this is a true many-to-many join table, then dump unecessary id column (unless your ORM requires one. in that case you've got to decide whether your intellect is going to trump your practicality).
But I find that true join tables are pretty rare. It usually isn't long before I start wanting to put some other data in that table. Because of that I almost always model these join tables as entities from the beginning and stick an id in there.
Having a single column pk can help out alot in disaster recovery situation. So though while correct in theory that you only need the 2 foreign keys. In practice when the shit hits the fan you may want the single column key. I have never been in a situation where i was screwed because I had a single column identifier but I have been in ones where I was screwed because I didn't.
Composite PK and turn off clustering.
I have used composite key to prevent duplicate entry and let the database handle the exception. With a single key, you are rely on the front-end application to check the database for duplicate before adding a new record.
There is something called identifying and non-identifying relationship. With identifying relationships the FK is a part of the PK in the many-to-many table. For example, say we have tables Person, Company and a many-to-many table Employment. In an identifying relationship both fk PersonID and CompanyID are part of the pk, so we can not repeat PersonID, CompanyID combination.
TABLE Employment(PersonID int (PK,FK), CompanyID int (PK,FK))
Now, suppose we want to capture history of employment, so a person can leave a company, work somewhere else and return to the same company later. The relationship is non-identifying here, combination of PersonID, CompanyID can now repeat, so the table would look something like:
TABLE Employment(EmploymentID int (PK), PersonID int (FK), CompanyID int (FK),
FromDate datetime, ToDate datetime)
If you are using an ORM to get to/alter the data, some of them require a single-column primary key (Thank you Tom H for pointing this out) in order to function correctly (I believe Subsonic 2.x was this way, not sure about 3.x).
In my mind, having the primary key doesn't impact performance to any measurable degree, so I usually use it.
If you need to traverse the join table 'in both directions', that is starting with a table1fk or a table2fk key only, you might consider adding a second, reversed, composite index.
ADD KEY ('table2fk', 'table1fk')
The correct answer is:
Primary key is ('table1fk' , 'table2fk')
Another index on ('table2fk' , 'table1fk')
Because:
You don't need an index on table1fk or table2fk alone: the optimiser will use the PK
You'll most likely use the table "both" ways
Adding a surrogate key is only needed because of braindead ORMs
i've used both, the only benefit of using the first model (with uid) is that you can transport the identifier around as a number, whereas in some cases you would have to do some string concatenation with the composite key to transport it around.
i agree that not indexing the foreign keys is a bad idea whichever way you go.
I (almost) always use the additional single-column primary key. This generally makes it easier to build user interfaces, because when a user selects that particular linking entity I can identify with a single integer value rather than having to create and then parse compound identifiers.

SQL: Do you need an auto-incremental primary key for Many-Many tables?

Say you have a Many-Many table between Artists and Fans. When it comes to designing the table, do you design the table like such:
ArtistFans
ArtistFanID (PK)
ArtistID (FK)
UserID (FK)
(ArtistID and UserID will then be contrained with a Unique Constraint
to prevent duplicate data)
Or do you build use a compound PK for the two relevant fields:
ArtistFans
ArtistID (PK)
UserID (PK)
(The need for the separate unique constraint is removed because of the
compound PK)
Are there are any advantages (maybe indexing?) for using the former schema?
ArtistFans
ArtistID (PK)
UserID (PK)
The use of an auto incremental PK has no advantages here, even if the parent tables have them.
I'd also create a "reverse PK" index automatically on (UserID, ArtistID) too: you will need it because you'll query the table by both columns.
Autonumber/ID columns have their place. You'd choose them to improve certain things after the normalisation process based on the physical platform. But not for link tables: if your braindead ORM insists, then change ORMs...
Edit, Oct 2012
It's important to note that you'd still need unique (UserID, ArtistID) and (ArtistID, UserID) indexes. Adding an auto increments just uses more space (in memory, not just on disk) that shouldn't be used
Assuming that you're already a devotee of the surrogate key (you're in good company), there's a case to be made for going all the way.
A key point that is sometimes forgotten is that relationships themselves can have properties. Often it's not enough to state that two things are related; you might have to describe the nature of that relationship. In other words, there's nothing special about a relationship table that says it can only have two columns.
If there's nothing special about these tables, why not treat it like every other table and use a surrogate key? If you do end up having to add properties to the table, you'll thank your lucky presentation layers that you don't have to pass around a compound key just to modify those properties.
I wouldn't even call this a rule of thumb, more of a something-to-consider. In my experience, some slim majority of relationships end up carrying around additional data, essentially becoming entities in themselves, worthy of a surrogate key.
The rub is that adding these keys after the fact can be a pain. Whether the cost of the additional column and index is worth the value of preempting this headache, that really depends on the project.
As for me, once bitten, twice shy – I go for the surrogate key out of the gate.
Even if you create an identity column, it doesn't have to be the primary key.
ArtistFans
ArtistFanId
ArtistId (PK)
UserId (PK)
Identity columns can be useful to relate this relation to other relations. For example, if there was a creator table which specified the person who created the artist-user relation, it could have a foreign key on ArtistFanId, instead of the composite ArtistId+UserId primary key.
Also, identity columns are required (or greatly improve the operation of) certain ORM packages.
I cannot think of any reason to use the first form you list. The compound primary key is fine, and having a separate, artificial primary key (along with the unique contraint you need on the foreign keys) will just take more time to compute and space to store.
The standard way is to use the composite primary key. Adding in a separate autoincrement key is just creating a substitute that is already there using what you have. Proper database normalization patterns would look down on using the autoincrement.
Funny how all answers favor variant 2, so I have to dissent and argue for variant 1 ;)
To answer the question in the title: no, you don't need it. But...
Having an auto-incremental or identity column in every table simplifies your data model so that you know that each of your tables always has a single PK column.
As a consequence, every relation (foreign key) from one table to another always consists of a single column for each table.
Further, if you happen to write some application framework for forms, lists, reports, logging etc you only have to deal with tables with a single PK column, which simplifies the complexity of your framework.
Also, an additional id PK column does not cost you very much in terms of disk space (except for billion-record-plus tables).
Of course, I need to mention one downside: in a grandparent-parent-child relation, child will lose its grandparent information and require a JOIN to retrieve it.
In my opinion, in pure SQL id column is not necessary and should not be used. But for ORM frameworks such as Hibernate, managing many-to-many relations is not simple with compound keys etc., especially if join table have extra columns.
So if I am going to use a ORM framework on the db, I prefer putting an auto-increment id column to that table and a unique constraint to the referencing columns together. And of course, not-null constraint if it is required.
Then I treat the table just like any other table in my project.

Should I use Natural Identity Columns without GUID?

Is it ok to define tables primary key with 2 foreign key column that in combination must be unique? And thus not add a new id column (Guid or int)?
Is there a negative to this?
Yes, it's completely OK. Why not? The downside of composite primary keys is that it can be long and it might be harder to identify a single row uniquely from the application perspective. However, for a couple integer columns (specially in junction tables), it's a good practice.
Natural versus Artificial primary keys is one of those issues that gets widely debated and IMHO the discussion only seems to see the positions harden.
In my opinion both work so long as the developer knows how to avoid the downside of both. A natural primary key (whether composite or single column) more nearly ensures that duplicate rows are not added to the DB. Whereas with artificial primary keys it is necessary to first check the record is unique (as opposed to the primary key, which being artificial will always be unique). One effective way to achieve this is to have a unique or candidate index on the fields that make the record unique (e.g. the fields that make a candidate for a primary key)
Meanwhile an artificial primary key makes for easy joining. Relations can be made with a single field to single field join. With a composite key the writer of the SQL statement must know how many fields to include in the join.
For some definitions of "ok", yes. As long as you never intend to add additional fields to this intersection table, that'll be fine. However, if you intend to have more fields, it's good practice to have an ID field. It's still fine, come to think of it, but can be more awkward.
Unless, of course, disk space is at a serious premium!
If you look into any database textbook, you will find such tables en masse. This is the default way to define n-to-m relations. For example:
article = (id, title, text)
author = (id, name)
article_author = (article_id, author_id)
Semantically, article_author is not a new entity so you might refrain from defining it as the primary key and instead create it as a normal index with UNIQUE constraint.
Yes, I agree, with "some definition of OK" it is OK. But the moment you decide to reference this composite primary key from somewhere (i.e. move it to a foreign key), it quickly bocomes NG (Not Good).