Many-to-many link table design : two foreign keys only or an additional primary key? - sql

this is undoubtedly a newbie question, but I haven't been able
to find a satisfactory answer.
When creating a link table for many-to-many relationships, is it better to
create a unique id or only use two foreign keys of the respective tables (compound key?).
Looking at different diagrams of the Northwind database for example, I've come across
both 'versions'.
That is: a OrderDetails table with fkProductID and fkOrderID and also versions
with an added OrderDetailsID.
What's the difference? (does it also depend on the DB engine?).
What are the SQL (or Linq) advantages/disadvantages?
Thanks in advance for an explanation.
Tom

ORMs have been mandating the use of non-composite primary keys to simplify queries...
But it Makes Queries Easier...
At first glance, it makes deleting or updating a specific order/etc easier - until you realize that you need to know the applicable id value first. If you have to search for that id value based on an orders specifics then you'd have been better off using the criteria directly in the first place.
But Composite keys are Complex...
In this example, a primary key constraint will ensure that the two columns--fkProductID and fkOrderID--will be unique and indexed (most DBs these days automatically index primary keys if the clustered index doesn't already exist) using the best index possible for the table.
The lone primary key approach means the OrderDetailsID is indexed with the best index for the table (SQL Server & MySQL call them clustered indexes, to Oracle they're all just indexes), and requires an additional composite unique constraint/index. Some databases might require additional indexing beyond the unique constraint... So this makes the data model more involved/complex, and for no benefit:
Some databases, like MySQL, put a limit on the amount of space you can use for indexes.
the primary key is getting the most ideal index yet the value has no relevance to the data in the table, so making use of the index related to the primary key will be seldom if ever.
Conclusion
I don't see the benefit in a single column primary key over a composite primary key. More work for additional overhead with no net benefit...

I'm used to use PrimaryKey column. It's because the primary key uniquely identify the record.
If you have a cascade-update settings on table relations, the values of foreign keys can be changed between "SELECT" and "UPDATE/DELETE" commands sent from application.

Related

In PostgreSQL what tables with no primary key used for

I've read a lot about this issue.. (also read this: Tables with no Primary Key)
it seems like there is no reason to use tables with no PK. so why does PostgreSQL allows it? can you give an example when it's good idea to not indicate PK?
I think the answer to your question lies in trying to understand what are the drawbacks of having a Primary-Key (PK) in the first place.
One obvious 'drawback' (depending on how you see it) in maintaining a PK is that it has its own overhead during an INSERT. So, in order to increase INSERT performance (assuming for e.g. the sample case is a logging table, where Querying is done offline) I would remove all Constraints / PK if possible and definitely would increase table performance. You may argue that pure logging should be done outside the DB (in a noSQL DB such as Cassandra etc.) but then again at least its possible in PostgreSQL.
A primary key is a special form of a unique constraint. A unique constraint is always backed up by an index. And the disadvantage of an index is that it takes time to update. Tables with an index have lower update, delete and insert performance.
So if you have a table that has a lot of modifications, and few queries, you can improve performance by omitting the primary key.
AFAIK, the primary key is primarily needed for the relationships between tables as a foreign key. If you have a table that is not linked to anything you don't need a primary key. In Excel spreadsheets there're no primary keys but a spreadsheet is not a relational database.

What would it mean If I change the identifying relationship from this part of a database design to a non-identifying relationship?

I have a question regarding this database design. I am a bit unsure of the difference between identifying and non-identifying relationships in a database leading me to some puzzles in my head.
I have this database design: (kind of like a movie rental stores. "friend" are those who borrow the movie. "studio" is the production studios that collaborated in making the movie.)
I somewhat understand how it works. However, I was wondering what if I create a loan_id in the loan table, and use movie_id and friend_id as normal foreign keys?
Some of my questions are:
What are the advantages or disadvantages of the later approach?
A situation where the initial or later model is better?
Does the initial model enable a friend to borrow a movie more than once?
Any thorough explanation would be much appreciated.
The way you have all of your many-to-many tables (tables collaboration, loan, role), is called a composite primary key: Where two (or more) columns form a unique value.
When you have a composite pk, a lot of db designers prefer to create a surrogate primary key (like your proposed loan_id). I'm one of them. This post does a good job going through the arguments of why or why not: Composite primary keys versus unique object ID field.
My relatively simple reason for it, is composite keys tend to grow: Using the loan example, what happens if that movies loaned more than once? Using the composite approach, you would then have to add loan_date to the composite key.
What if you then wanted to track re-loans of some sort? You would then have to have a 2nd table carrying all the composite pk fields from the loan table (original_loan_movie_id, original_loan_friend_id, original_loan_date) just to refer to the original loan...
In the LOAN table, you'd need to guarantee the following columns are unique:
movie_id (replace with copy_id assuming there are multiple copies of a movie)
friend_id
loan_date
...because I, or anyone else, should be able to rent the same movie more than once. These are also the columns most likely to be searched on...
With that in mind, the idea of defining a column called loan_id as the primary key for the table to be redundant. ORMs have been mandating the use of non-composite primary keys to simplify queries...
But it Makes Queries Easier...
At first glance, it makes deleting or updating a specific loan/etc easier - until you realize that you need to know the applicable id value first. If you have to search for that id value based on a movie, user/friend, and date then you'd have been better off using the criteria directly in the first place.
But Composite keys are Complex...
In this example, a primary key constraint will ensure that the three columns--movie_id, friend_id and loan_date--will be unique and indexed (most DBs these days automatically index primary keys if the clustered index doesn't already exist) using the best index possible for the table.
The lone primary key approach means the loan_id is indexed with the best index for the table (SQL Server & MySQL call them clustered indexes, to Oracle they're all just indexes), and requires an additional composite unique constraint/index. Some databases might require additional indexing beyond the unique constraint... So this makes the data model more involved/complex, and for no benefit:
Some databases, like MySQL, put a limit on the amount of space you can use for indexes.
the primary key is getting the most ideal index yet the value has no relevance to the data in the table, so making use of the index related to the primary key will be seldom if ever.
Conclusion
I've yet to see a legitimate justification for a single column primary key over a composite primary key.

SQL primary key - complex primary or string with concatenation?

I have a table with 16 columns. It will be most frequently used table in web aplication and it will contain about few hundred tousand rows. Database is created on sql server 2008.
My question is choice for primary key. What is quicker? I can use complex primary key with two bigint-s or i can use one varchar value but i will need to concatenate it after?
There are many more factors you must consider:
data access prevalent pattern, how are you going to access the table?
how many non-clustered indexes?
frequency of updates
pattern of updates (sequential inserts, random)
pattern of deletes
All these factors, and specially the first two, should drive your choice of the clustered key. Note that the primary key and clustered key are different concepts, often confused. Read up my answer on Should I design a table with a primary key of varchar or int? for a lengthier discussion on the criteria that drive a clustered key choice.
Without any information on your access patterns I can answer very briefly and concise, and actually correct: the narrower key is always quicker (for reasons of IO). However, this response bares absolutely no value. The only thing that will make your application faster is to choose a key that is going to be used by the query execution plans.
A primary key which does not rely on any underlying values (called a surrogate key) is a good choice. That way if the row changes, the ID doesn't have to, and any tables referring to it (Foriegn Keys) will not need to change. I would choose an autonumber (i.e. IDENTITY) column for the primary key column.
In terms of performance, a shorter, integer based primary key is best.
You can still create your clustered index on multiple columns.
Why not just a single INT auto-generated primary key? INT is 32-bit, so it can handle over 4 billion records.
CREATE TABLE Records (
recordId INT NOT NULL PRIMARY KEY,
...
);
A surrogate key might be a fine idea if there are foreign key relationships on this table. Using a surrogate will save tables that refer to it from having to duplicate all those columns in their tables.
Another important consideration is indexes on columns that you'll be using in WHERE clauses. Your performance will suffer if you don't. Make sure that you add appropriate indexes, over and above the primary key, to avoid table scans.
What do you mean quicker? if you need to search quicker, you can create index for any column or create full text search. the primary key just make sure you do not have duplicated records.
The decision relies upon its use. If you are using the table to save data mostly and not retrieve it, then a simple key. If you are mostly querying the data and it is mostly static data where the key values will not change, your index strategy needs to optimize the data to the most frequent query that will be used. Personally, I like the idea of using GUIDs for the primary key and an int for the clustered index. That allows for easy data imports. But, it really depends upon your needs.
Lot’s of variables you haven’t mentioned; whether the data in the two columns is “natural” and there is a benefit in identifying records by a logical ID, if disclosure of the key via a UI poses a risk, how important performance is (a few hundred thousand rows is pretty minimal).
If you’re not too fussy, go the auto number path for speed and simplicity. Also take a look at all the posts on the site about SQL primary key types. Heaps of info here.
Is it a ER Model or Dimensional Model. In ER Model, they should be separate and should not be surrogated. The entire record could have a single surrogate for easy references in URLs etc. This could be a hash of all parts of the composite key or an Identity.
In Dimensional Model, also they must be separate and they all should be surrogated.

SQL: what exactly do Primary Keys and Indexes do?

I've recently started developing my first serious application which uses a SQL database, and I'm using phpMyAdmin to set up the tables. There are a couple optional "features" I can give various columns, and I'm not entirely sure what they do:
Primary Key
Index
I know what a PK is for and how to use it, but I guess my question with regards to that is why does one need one - how is it different from merely setting a column to "Unique", other than the fact that you can only have one PK? Is it just to let the programmer know that this value uniquely identifies the record? Or does it have some special properties too?
I have no idea what "Index" does - in fact, the only times I've ever seen it in use are (1) that my primary keys seem to be indexed, and (2) I heard that indexing is somehow related to performance; that you want indexed columns, but not too many. How does one decide which columns to index, and what exactly does it do?
edit: should one index colums one is likely to want to ORDER BY?
Thanks a lot,
Mala
Primary key is usually used to create a numerical 'id' for your records, and this id column is automatically incremented.
For example, if you have a books table with an id field, where the id is the primary key and is also set to auto_increment (Under 'Extra in phpmyadmin), then when you first add a book to the table, the id for that will become 1'. The next book's id would automatically be '2', and so on. Normally, every table should have at least one primary key to help identifying and finding records easily.
Indexes are used when you need to retrieve certain information from a table regularly. For example, if you have a users table, and you will need to access the email column a lot, then you can add an index on email, and this will cause queries accessing the email to be faster.
However there are also downsides for adding unnecessary indexes, so add this only on the columns that really do need to be accessed more than the others. For example, UPDATE, DELETE and INSERT queries will be a little slower the more indexes you have, as MySQL needs to store extra information for each indexed column. More info can be found at this page.
Edit: Yes, columns that need to be used in ORDER BY a lot should have indexes, as well as those used in WHERE.
The primary key is basically a unique, indexed column that acts as the "official" ID of rows in that table. Most importantly, it is generally used for foreign key relationships, i.e. if another table refers to a row in the first, it will contain a copy of that row's primary key.
Note that it's possible to have a composite primary key, i.e. one that consists of more than one column.
Indexes improve lookup times. They're usually tree-based, so that looking up a certain row via an index takes O(log(n)) time rather than scanning through the full table.
Generally, any column in a large table that is frequently used in WHERE, ORDER BY or (especially) JOIN clauses should have an index. Since the index needs to be updated for evey INSERT, UPDATE or DELETE, it slows down those operations. If you have few writes and lots of reads, then index to your hear's content. If you have both lots of writes and lots of queries that would require indexes on many columns, then you have a big problem.
The difference between a primary key and a unique key is best explained through an example.
We have a table of users:
USER_ID number
NAME varchar(30)
EMAIL varchar(50)
In that table the USER_ID is the primary key. The NAME is not unique - there are a lot of John Smiths and Muhammed Khans in the world. The EMAIL is necessarily unique, otherwise the worldwide email system wouldn't work. So we put a unique constraint on EMAIL.
Why then do we need a separate primary key? Three reasons:
the numeric key is more efficient
when used in foreign key
relationships as it takes less space
the email can change (for example
swapping provider) but the user is
still the same; rippling a change of
a primary key value throughout a schema
is always a nightmare
it is always a bad idea to use
sensitive or private information as
a foreign key
In the relational model, any column or set of columns that is guaranteed to be both present and unique in the table can be called a candidate key to the table. "Present" means "NOT NULL". It's common practice in database design to designate one of the candidate keys as the primary key, and to use references to the primary key to refer to the entire row, or to the subject matter item that the row describes.
In SQL, a PRIMARY KEY constraint amounts to a NOT NULL constraint for each primary key column, and a UNIQUE constraint for all the primary key columns taken together. In practice many primary keys turn out to be single columns.
For most DBMS products, a PRIMARY KEY constraint will also result in an index being built on the primary key columns automatically. This speeds up the systems checking activity when new entries are made for the primary key, to make sure the new value doesn't duplicate an existing value. It also speeds up lookups based on the primary key value and joins between the primary key and a foreign key that references it. How much speed up occurs depends on how the query optimizer works.
Originally, relational database designers looked for natural keys in the data as given. In recent years, the tendency has been to always create a column called ID, an integer as the first column and the primary key of every table. The autogenerate feature of the DBMS is used to ensure that this key will be unique. This tendency is documented in the "Oslo design standards". It isn't necessarily relational design, but it serves some immediate needs of the people who follow it. I do not recommend this practice, but I recognize that it is the prevalent practice.
An index is a data structure that allows for rapid access to a few rows in a table, based on a description of the columns of the table that are indexed. The index consists of copies of certain table columns, called index keys, interspersed with pointers to the table rows. The pointers are generally hidden from the DBMS users. Indexes work in tandem with the query optimizer. The user specifies in SQL what data is being sought, and the optimizer comes up with index strategies and other strategies for translating what is being sought into a stategy for finding it. There is some kind of organizing principle, such as sorting or hashing, that enables an index to be used for fast lookups, and certain other uses. This is all internal to the DBMS, once the database builder has created the index or declared the primary key.
Indexes can be built that have nothing to do with the primary key. A primary key can exist without an index, although this is generally a very bad idea.

Why use primary keys?

What are primary keys used aside from identifying a unique column in a table? Couldn't this be done by simply using an autoincrement constraint on a column? I understand that PK and FK are used to relate different tables, but can't this be done by just using join?
Basically what is the database doing to improve performance when joining using primary keys?
Mostly for referential integrity with foreign keys,, When you have a PK it will also create an index behind the scenes and this way you don't need table scans when looking up values
RDBMS providers are usually optimized to work with tables that have primary keys. Most store statistics which helps optimize query plans. These statistics are very important to performance especially on larger tables and they are not going to work the same without primary keys, and you end up getting unpredictable query response times.
Most database best practices books suggest creating all tables with a primary key with no exceptions, it would be wise to follow this practice. Not many things say junior software dev more than one who builds a database without referential integrity!
Some PKs are simply an auto-incremented column. Also, you typically join USING the PK and FK. There has to be some relationship to do a join. Additionally, most DBMS automatically index PKs by default, which improves join performance as well as querying for a particular record based on ID.
You can join without a primary key within a query, however, you must have a primary key defined to enforce data integrity constraints, at least with SQL Server. (Foreign Keys, etc..)
Also, here is an interesting read for you on Primary Keys.
In Microsoft Access, if you have a linked table to, say, SQL Server, the source table must have a primary key in order for the linked table to be writeable. At least, that was the case with Access 2000 and SQL Server 6.5. It may be different with later versions.
Keys are about data integrity as well as identification. The uniqueness of a key is guaranteed by having a constraint in the database to keep out "bad" data that would otherwise violate the key. The fact that data integrity rules are guaranteed in that way is precisely what makes a key usable as an identifier. That goes for any key. One key per table by convention is called a "primary" key but that doesn't make other alternate keys any less important.
In practice we need to be able to enforce uniqueness rules against all types of data (not just numbers) to satisfy the demands of data quality and usability.