Why use primary keys? - primary-key

What are primary keys used aside from identifying a unique column in a table? Couldn't this be done by simply using an autoincrement constraint on a column? I understand that PK and FK are used to relate different tables, but can't this be done by just using join?
Basically what is the database doing to improve performance when joining using primary keys?

Mostly for referential integrity with foreign keys,, When you have a PK it will also create an index behind the scenes and this way you don't need table scans when looking up values

RDBMS providers are usually optimized to work with tables that have primary keys. Most store statistics which helps optimize query plans. These statistics are very important to performance especially on larger tables and they are not going to work the same without primary keys, and you end up getting unpredictable query response times.
Most database best practices books suggest creating all tables with a primary key with no exceptions, it would be wise to follow this practice. Not many things say junior software dev more than one who builds a database without referential integrity!

Some PKs are simply an auto-incremented column. Also, you typically join USING the PK and FK. There has to be some relationship to do a join. Additionally, most DBMS automatically index PKs by default, which improves join performance as well as querying for a particular record based on ID.

You can join without a primary key within a query, however, you must have a primary key defined to enforce data integrity constraints, at least with SQL Server. (Foreign Keys, etc..)
Also, here is an interesting read for you on Primary Keys.

In Microsoft Access, if you have a linked table to, say, SQL Server, the source table must have a primary key in order for the linked table to be writeable. At least, that was the case with Access 2000 and SQL Server 6.5. It may be different with later versions.

Keys are about data integrity as well as identification. The uniqueness of a key is guaranteed by having a constraint in the database to keep out "bad" data that would otherwise violate the key. The fact that data integrity rules are guaranteed in that way is precisely what makes a key usable as an identifier. That goes for any key. One key per table by convention is called a "primary" key but that doesn't make other alternate keys any less important.
In practice we need to be able to enforce uniqueness rules against all types of data (not just numbers) to satisfy the demands of data quality and usability.

Related

In PostgreSQL what tables with no primary key used for

I've read a lot about this issue.. (also read this: Tables with no Primary Key)
it seems like there is no reason to use tables with no PK. so why does PostgreSQL allows it? can you give an example when it's good idea to not indicate PK?
I think the answer to your question lies in trying to understand what are the drawbacks of having a Primary-Key (PK) in the first place.
One obvious 'drawback' (depending on how you see it) in maintaining a PK is that it has its own overhead during an INSERT. So, in order to increase INSERT performance (assuming for e.g. the sample case is a logging table, where Querying is done offline) I would remove all Constraints / PK if possible and definitely would increase table performance. You may argue that pure logging should be done outside the DB (in a noSQL DB such as Cassandra etc.) but then again at least its possible in PostgreSQL.
A primary key is a special form of a unique constraint. A unique constraint is always backed up by an index. And the disadvantage of an index is that it takes time to update. Tables with an index have lower update, delete and insert performance.
So if you have a table that has a lot of modifications, and few queries, you can improve performance by omitting the primary key.
AFAIK, the primary key is primarily needed for the relationships between tables as a foreign key. If you have a table that is not linked to anything you don't need a primary key. In Excel spreadsheets there're no primary keys but a spreadsheet is not a relational database.

Many-to-many link table design : two foreign keys only or an additional primary key?

this is undoubtedly a newbie question, but I haven't been able
to find a satisfactory answer.
When creating a link table for many-to-many relationships, is it better to
create a unique id or only use two foreign keys of the respective tables (compound key?).
Looking at different diagrams of the Northwind database for example, I've come across
both 'versions'.
That is: a OrderDetails table with fkProductID and fkOrderID and also versions
with an added OrderDetailsID.
What's the difference? (does it also depend on the DB engine?).
What are the SQL (or Linq) advantages/disadvantages?
Thanks in advance for an explanation.
Tom
ORMs have been mandating the use of non-composite primary keys to simplify queries...
But it Makes Queries Easier...
At first glance, it makes deleting or updating a specific order/etc easier - until you realize that you need to know the applicable id value first. If you have to search for that id value based on an orders specifics then you'd have been better off using the criteria directly in the first place.
But Composite keys are Complex...
In this example, a primary key constraint will ensure that the two columns--fkProductID and fkOrderID--will be unique and indexed (most DBs these days automatically index primary keys if the clustered index doesn't already exist) using the best index possible for the table.
The lone primary key approach means the OrderDetailsID is indexed with the best index for the table (SQL Server & MySQL call them clustered indexes, to Oracle they're all just indexes), and requires an additional composite unique constraint/index. Some databases might require additional indexing beyond the unique constraint... So this makes the data model more involved/complex, and for no benefit:
Some databases, like MySQL, put a limit on the amount of space you can use for indexes.
the primary key is getting the most ideal index yet the value has no relevance to the data in the table, so making use of the index related to the primary key will be seldom if ever.
Conclusion
I don't see the benefit in a single column primary key over a composite primary key. More work for additional overhead with no net benefit...
I'm used to use PrimaryKey column. It's because the primary key uniquely identify the record.
If you have a cascade-update settings on table relations, the values of foreign keys can be changed between "SELECT" and "UPDATE/DELETE" commands sent from application.

SQL primary key - complex primary or string with concatenation?

I have a table with 16 columns. It will be most frequently used table in web aplication and it will contain about few hundred tousand rows. Database is created on sql server 2008.
My question is choice for primary key. What is quicker? I can use complex primary key with two bigint-s or i can use one varchar value but i will need to concatenate it after?
There are many more factors you must consider:
data access prevalent pattern, how are you going to access the table?
how many non-clustered indexes?
frequency of updates
pattern of updates (sequential inserts, random)
pattern of deletes
All these factors, and specially the first two, should drive your choice of the clustered key. Note that the primary key and clustered key are different concepts, often confused. Read up my answer on Should I design a table with a primary key of varchar or int? for a lengthier discussion on the criteria that drive a clustered key choice.
Without any information on your access patterns I can answer very briefly and concise, and actually correct: the narrower key is always quicker (for reasons of IO). However, this response bares absolutely no value. The only thing that will make your application faster is to choose a key that is going to be used by the query execution plans.
A primary key which does not rely on any underlying values (called a surrogate key) is a good choice. That way if the row changes, the ID doesn't have to, and any tables referring to it (Foriegn Keys) will not need to change. I would choose an autonumber (i.e. IDENTITY) column for the primary key column.
In terms of performance, a shorter, integer based primary key is best.
You can still create your clustered index on multiple columns.
Why not just a single INT auto-generated primary key? INT is 32-bit, so it can handle over 4 billion records.
CREATE TABLE Records (
recordId INT NOT NULL PRIMARY KEY,
...
);
A surrogate key might be a fine idea if there are foreign key relationships on this table. Using a surrogate will save tables that refer to it from having to duplicate all those columns in their tables.
Another important consideration is indexes on columns that you'll be using in WHERE clauses. Your performance will suffer if you don't. Make sure that you add appropriate indexes, over and above the primary key, to avoid table scans.
What do you mean quicker? if you need to search quicker, you can create index for any column or create full text search. the primary key just make sure you do not have duplicated records.
The decision relies upon its use. If you are using the table to save data mostly and not retrieve it, then a simple key. If you are mostly querying the data and it is mostly static data where the key values will not change, your index strategy needs to optimize the data to the most frequent query that will be used. Personally, I like the idea of using GUIDs for the primary key and an int for the clustered index. That allows for easy data imports. But, it really depends upon your needs.
Lot’s of variables you haven’t mentioned; whether the data in the two columns is “natural” and there is a benefit in identifying records by a logical ID, if disclosure of the key via a UI poses a risk, how important performance is (a few hundred thousand rows is pretty minimal).
If you’re not too fussy, go the auto number path for speed and simplicity. Also take a look at all the posts on the site about SQL primary key types. Heaps of info here.
Is it a ER Model or Dimensional Model. In ER Model, they should be separate and should not be surrogated. The entire record could have a single surrogate for easy references in URLs etc. This could be a hash of all parts of the composite key or an Identity.
In Dimensional Model, also they must be separate and they all should be surrogated.

Is it OK not to use a Primary Key When I don't Need one

If I don't need a primary key should I not add one to the database?
You do need a primary key. You just don't know that yet.
A primary key uniquely identifies a row in your table.
The fact it's indexed and/or clustered is a physical implementation issue and unrelated to the logical design.
You need one for the table to make sense.
If you don't need a primary key then don't use one. I usually have the need for primary keys, so I usually use them. If you have related tables you probably want primary and foreign keys.
Yes, but only in the same sense that it's okay not to use a seatbelt if you're not planning to be in an accident. That is, it's a small price to pay for a big benefit when you need it, and even if you think you don't need it odds are you will in the future. The difference is you're a lot more likely to need a primary key than to get in a car accident.
You should also know that some database systems create a primary key for you if you don't, so you're not saving that much in terms of what's going on in the engine.
No, unless you can find an example of, "This database would work so much better if table_x didn't have a primary key."
You can make an arguement to never use a primary key, if performance, data integrity, and normalization are not required. Security and backup/restore capabilities may not be needed, but eventually, you put on your big-boy pants and join the real world of database implementation.
Yes, a table should ALWAYS have a primary key... unless you don't need to uniquely identify the records in it. (I like to make absolute statements and immediately contradict them)
When would you not need to uniquely identify the records in a table? Almost never. I have done this before though for things like audit log tables. Data that won't be updated or deleted, and wont be constrained in any way. Essentially structured logging.
A primary key will always help with query performance. So if you ever need to query using the "key" to a "foreign key", or used as lookup then yes, craete a foreign key.
I don't know. I have used a couple tables where there is just a single row and a single column. Will always only be a single row and a single column. There is no foreign key relationships.
Why would I put a primary key on that?
A primary key is mainly formally defined to aid referencial Integrity, however if the table is very small, or is unlikely to contain unique data then it's an un-necessary overhead.
Defining indexes on the table can normally be used to imply a primary key without formally declaring one.
However you should consider that defining the Primary key can be useful for Developers and Schema generation or SQL Dev tools, as having the meta data helps understanding, and some tools rely on this to correctly define the Primary/foreign key relationships in the model.
Well...
Each table in a relational DB needs a primary key. As already noted, a primary key is data that identies a record uniquely...
You might get away with not having an "ID" field, if you have a N-M table that joins 2 different tables, but you can uniquely identifiy the record by the values from both columns you join. (Composite primary key)
Having a table without an primary key is against the first normal form, and has nothing to do in a relational DB
You should always have a primary key, even if it's just on ID. Maybe NoSQL is what you're after instead (just asking)?
That depends very much on how sure you can be that you don't need one. If you have just the slightest bit of doubt, add one - you'll thank yourself later. An indicator being if the data you store could be related to other data in your DB at one point.
One use case I can think of is a logging kind-of table, in which you simply dump one entry after the other (to properly process them later). You probably won't need a primary key there, if you're storing enough data to filter out the relevant messages (like a date). Of course, it's questionable to use a RDBMS for this.

What are the down sides of using a composite/compound primary key?

What are the down sides of using a composite/compound primary key?
Could cause more problems for normalisation (2NF, "Note that when a 1NF table has no composite candidate keys (candidate keys consisting of more than one attribute), the table is automatically in 2NF")
More unnecessary data duplication. If your composite key consists of 3 columns, you will need to create the same 3 columns in every table, where it is used as a foreign key.
Generally avoidable with the help of surrogate keys (read about their advantages and disadvantages)
I can imagine a good scenario for composite key -- in a table representing a N:N relation, like Students - Classes, and the key in the intermediate table will be (StudentID, ClassID). But if you need to store more information about each pair (like a history of all marks of a student in a class) then you'll probably introduce a surrogate key.
There's nothing wrong with having a compound key per se, but a primary key should ideally be as small as possible (in terms of number of bytes required). If the primary key is long then this will cause non-clustered indexes to be bloated.
Bear in mind that the order of the columns in the primary key is important. The first column should be as selective as possible i.e. as 'unique' as possible. Searches on the first column will be able to seek, but searches just on the second column will have to scan, unless there is also a non-clustered index on the second column.
I think this is a specialisation of the synthetic key debate (whether to use meaningful keys or an arbitrary synthetic primary key). I come down almost completely on the synthetic key side of this debate for a number of reasons. These are a few of the more pertinent ones:
You have to keep dependent child
tables on the end of a foriegn key
up to date. If you change the the
value of one of the primary key
fields (which can happen - see
below) you have to somehow change
all of the dependent tables where
their PK value includes these
fields. This is a bit tricky
because changing key values will
invalidate FK relationships with
child tables so you may (depending
on the constraint validation options
available on your platform) have to
resort to tricks like copying the
record to a new one and deleting the
old records.
On a deep schema the keys can get
quite wide - I've seen 8 columns
once.
Changes in primary key values can be
troublesome to identify in ETL
processes loading off the system.
The example I once had occasion to
see was an MIS application
extracting from an insurance
underwriting system. On some
occasions a policy entry would be
re-used by the customer, changing
the policy identifier. This was a
part of the primary key of the
table. When this happens the
warehouse load is not aware of what
the old value was so it cannot match
the new data to it. The developer
had to go searching through audit
logs to identify the changed value.
Most of the issues with non-synthetic primary keys revolve around issues when PK values of records change. The most useful applications of non-synthetic values are where a database schema is intended to be used, such as an M.I.S. application where report writers are using the tables directly. In this case short values with fixed domains such as currency codes or dates might reasonably be placed directly on the table for convenience.
I would recommend a generated primary key in those cases with a unique not null constraint on the natural composite key.
If you use the natural key as primary then you will most likely have to reference both values in foreign key references to make sure you are identifying the correct record.
Take the example of a table with two candidate keys: one simple (single-column) and one compound (multi-column). Your question in that context seems to be, "What disadvantage may I suffer if I choose to promote one key to be 'primary' and I choose the compound key?"
First, consider whether you actually need to promote a key at all: "the very existence of the PRIMARY KEY in SQL seems to be an historical accident of some kind. According to author Chris Date the earliest incarnations of SQL didn't have any key constraints and PRIMARY KEY was only later addded to the SQL standards. The designers of the standard obviously took the term from E.F.Codd who invented it, even though Codd's original notion had been abandoned by that time! (Codd originally proposed that foreign keys must only reference one key - the primary key - but that idea was forgotten and ignored because it was widely recognised as a pointless limitation)." [source: David Portas' Blog: Down with Primary Keys?
Second, what criteria would you apply to choose which key in a table should be 'primary'?
In SQL, the choice of key PRIMARY KEY is arbitrary and product specific. In ACE/Jet (a.k.a. MS Access) the two main and often competing factors is whether you want to use PRIMARY KEY to favour clustering on disk or whether you want the columns comprising the key to appears as bold in the 'Relationships' picture in the MS Access user interface; I'm in the minority by thinking that index strategy trumps pretty picture :) In SQL Server, you can specify the clustered index independently of the PRIMARY KEY and there seems to be no product-specific advantage afforded. The only remaining advantage seems to be the fact you can omit the columns of the PRIMARY KEY when creating a foreign key in SQL DDL, being a SQL-92 Standard behaviour and anyhow doesn't seem such a big deal to me (perhaps another one of the things they added to the Standard because it was a feature already widespread in SQL products?) So, it's not a case of looking for drawbacks, rather, you should be looking to see what advantage, if any, your SQL product gives the PRIMARY KEY. Put another way, the only drawback to choosing the wrong key is that you may be missing out on a given advantage.
Third, are you rather alluding to using an artificial/synthetic/surrogate key to implement in your physical model a candidate key from your logical model because you are concerned there will be performance penalties if you use the natural key in foreign keys and table joins? That's an entirely different question and largely depends on your 'religious' stance on the issue of natural keys in SQL.
Need more specificity.
Taken too far, it can overcomplicate Inserts (Every key MUST exist) and documentation and your joined reads could be suspect if incomplete.
Sometimes it can indicate a flawed data model (is a composite key REALLY what's described by the data?)
I don't believe there is a performance cost...it just can go really wrong really easily.
when you se it on a diagram are less readable
when you use it on a query join are less
readable
when you use it on a foregein key
you have to add a check constraint
about all the attribute have to be
null or not null (if only one is
null the key is not checked)
usualy need more storage when use it
as foreign key
some tool doesn't manage composite
key
The main downside of using a compound primary key, is that you will confuse the hell out of typical ORM code generators.