Database design advice

Database design advice - sql

I have a working SQLite database that holds information about video files. The current design is as pictured below. However, the boss has decided to make some changes.
The FileProperties table currently uses the file name as the primary key. However, the PK now must be a compound key of both fileName and (file) location, which makes more sense anyway.
If this is done, what would be the best way to reference this compound key as a foreign key in the other tables? I was thinking of either creating a separate table that holds an auto-incrementing primary key, fileName and location. Then the PK can be used as a foreign key reference with all the other tables.
Or, make fileName and location a composite key in the current FileProperties table and add a new field that can be used as a reference and this field must be auto-incrementing and unique in the table.
I haven't had much practical experience with designing databases so any advice with my problem or my current design would be very welcome.

Absolutely use an auto-incrementing primary key. To ensure data integrity, create a unique index across the (filename,location) columns.
The following wiki article talks briefly about the pros and cons of a natural key. A natural key is a key taken directly from the data. In your case, that would be the composite key of (filename,location). In short, a natural key reduces physical space required by the data, at the cost of propagating changes to the key across all relations.
I (nearly) always have an auto-incrementing id on a table, even if there is a natural key available to be used.

Add auto-incremented FileId primary key.
Add unique constraint for Location + FileName.
Avoid using compound primary keys.

Related

Foreign Key referencing PK vs Foreign Key referencing a unique key

What is considered as a better/standard approach:
A foreign key referencing the primary key of another table (the PK is auto-increment numeric values).
A foreign key referencing the unique key of another table (the unique key column holds meaningful data rather than auto-generated values).
Is there any performance benefits of one approach over the other?
Ideally speaking, the unique key column should have been the PK too, but that is something that I cannot change.

I like the idea of using a unique key as a primary key, especially if you have to store the unique key data for other purposes anyway. It's sad that's something you cannot change, so I'm not sure where you're going with this question. But the only performance issues that come to mind would be the size of the keys, as some datatypes obviously use more storage than other datatypes which would eventually affect query performance. Either way should enforce referential integrity and prevent orphaned records.

Should every table have a primary key?

I read somewhere saying that every table should have a primary key to fulfill 1NF.
I have a tbl_friendship table.
There are 2 fields in the table : Owner and Friend.
Fields of Owner and Friends are foreign keys of auto increment id field in tbl_user.
Should this tbl_friendship has a primary key?
Should I create an auto increment id field in tbl_friendship and make it as primary key?

Primary keys can apply to multiple columns! In your example, the primary key should be on both columns, For example (Owner, Friend). Especially when Owner and Friend are foreign keys to a users table rather than actual names say (personally, my identity columns use the "Id" naming convention and so I would have (OwnerId, FriendId)
Personally I believe every table should have a primary key, but you'll find others who disagree.
Here's an article I wrote on the topic of normal forms.
http://michaeljswart.com/2011/01/ridiculously-unnormalized-database-schemas-part-zero/

Yes every table should have a primary key.
Yes you should create surrogate key.. aka an auto increment pk field.
You should also make "Friend" an FK to that auto increment field.
If you think that you are going to "rekey" in the future you might want to look into using natural keys, which are fields that naturally identify your data. The key to this is while coding always use the natural identifiers, and then you create unique indexes on those natural keys. In the future if you have to re-key you can, because your ux guarantees your data is consistent.
I would only do this if you absolutely have to, because it increases complexity, in your code and data model.

It is not clear from your description, but are owner and friend foreign keys and there can be only one relationship between any given pair? This makes two foreign key column a perfect candidate for a natural primary key.
Another option is to use surrogate key (extra auto-incremented column as you suggested). Take a look here for an in-depth discussion.

A primary key can be something abstract as well. In this case, each tuple (owner, friend), e.g. ("Dave","Matt") can form a unique entry and therefore be your primary key. In that case, it would be useful not to use names, but keys referencing another table. If you guarantee, that these tuples can't have duplicates, you have a valid primary key.
For processing reasons it might be useful to introduce a special primary key, like an autoincrement field (e.g. in MySQL) or using a sequence with Oracle.

To comply with 1NF (which is not completely aggreed upon what defines 1NF), yes you should have a primary key identified on each table. This is necessary to provide for uniqueness of each record.
http://en.wikipedia.org/wiki/First_normal_form
In general, you can create a primary key in many ways, one of which is to have an auto-increment column, another is to have a column with GUIDs, another is to have two or more columns that will identify a row uniquely when taken together.

Your table will be much easier to manage in the long term if it has a primary key. At the very least, you need to uniquely identify each record in the table. The field that is used to uniquely identify each record might as well be the primary key.

Yes every table should have (at least one) key. Duplicating rows in any table is undesirable for lots of reasons so put the constraint on those two columns.

Why we can't have more than one primary key?

I Know there can't be more than 1 primary key in a table but what is the technical reason ?

Pulled directly from SO:
You can only have one primary key, but you can have multiple columns in your primary key.
You can also have Unique Indexes on your table, which will work a bit like a primary key in that they will enforce unique values, and will speed up querying of those values.
Primary in the context of Primary Key means that it's ranked first in importance. Therefore, there can only be one key. It's by definition.
It's also usually the key for which the index has the actual data attached to it, that is, the data is stored with the primary key index. Other indices contain only the data that's being indexed, and perhaps some Included Columns.

In fact E.F.Codd (the inventor of the Relational Database Model) [1] originated the term "primary key" to mean any number of keys of a relation - not just one. He made it clear that it was quite possible to have more than one such key. His suggestion was that the database designer could choose one key as a preferred identifier ("the primary key") - but in principle this was optional and such a choice was "arbitrary" (that was his word). Because all keys enjoy the same properties as each other there is no fundamental need to choose any one over another.
Later on [2] what Codd originally called primary keys became known as candidate keys and the one key singled out as the preferred one became known as the "primary" key. This was not really a fundamental shift however because a primary key means exactly the same as a candidate key. Since they are equivalent concepts it doesn't really mean anything important when we say there "must" only be one primary key. If you have more than one candidate key you could quite reasonably call more than one of them "primary" if you prefer because it doesn't make any logical or practical difference to the meaning and function of the database.
It has been argued (by me among others) that the idea of designating one key per table as "primary" is utterly superfluous and sometimes a positive hinderance to a good understanding of database design and data intgrity issues. However, the concept is so entrenched we are probably stuck with it.
So the proper answer to your question is "convention" and "convenience". There is no good technical reason at all.
[1] A Relational Model of Data for Large Shared Data Banks (1970)
[2] E.g. in "Further Normalization of the Relational Data Base Model" (1971)

Well, it's called "primary" for a reason. As in, its the one key used to uniquely identify the record... and there "can be only one".
You could certainly mimick a second "primary" key by having an index placed on one or more other fields that are unique but for the purposes of your database server it's generally only necessary if your key isn't unique enough to cross database servers in a merge replication situation. (ie: multi master).

PRIMARY KEY is usually equivalent to UNIQUE INDEX NOT NULL. So you can effectively have multiple "primary keys" on a single table.

The primary key is the key which uniquely identifies that record.
I'm not sure if you're asking if a) there can be a single primary key spanning multiple columns, or b) if you can have multiple keys which uniquely identify the record.
The first is possible, known as a composite primary key.
The second is possible also, but only one is called the primary key.

Because the "primary" in "primary key" denotes its, mmm, singularity(?).
But if you need more, you can define UNIQUE keys which have quite the same behaviour.

The technical reason is that there can be only one primary. Otherwise it wouldn't be called so.
However a primary key can include several columns - see 7.5.2. Multiple-Column Indexes

The primary key is the one (of possibly many) unique identifiers of a particular row in a table. The other unique identifiers, which were not designated as the primary one, are hence often refereed to as secondary unique indexes.

Primary key allows us to uniquely identify each record in the table. You can have 2 primary keys in a table but they are called Composite Primary Keys. "When you define more than one column as your primary key on a table, it is called a composite primary key."

A primary key defines record uniqueness. To have two different measures of uniqueness can be problematic. For example, if you have primary keys A and B and you insert records where A is the same and B is different, then are those records the same or different? If you consider them different, then make your primary a composite of A and B. If you consider them the same record, then just use A or B as the primary key.

For non-clustered index we can create two index and are typically made on non-primary key columns used in JOIN, WHERE , ORDER BY clauses.
While in clustered index we have only one index and that on primary key. So if we have two primary keys there is ambiguity.
Also in referential intergrity there is ambiguity selecting one of the two primary keys.

Only one primary key possible on the table because primary key creates a clustered index on the table which stored data physically on the leaf node in ordered way based on that primary key column.
If we try to create one another primary key on that table then there will be one major problem related to the data.Because be can not store same data of the table in two different-2 order.

Database "key/ID" design ideas, Surrogate Key, Primary Key, etc

So I've seen several mentions of a surrogate key lately, and I'm not really sure what it is and how it differs from a primary key.
I always assumed that ID was my primary key in a table like this:
Users
ID, Guid
FirstName, Text
LastName, Text
SSN, Int
however, wikipedia defines a surrogate key as "A surrogate key in a database is a unique identifier for either an entity in the modeled world or an object in the database. The surrogate key is not derived from application data."
According to Wikipedia, it looks like ID is my surrogate key, and my primary key might be SSN+ID? Is this right? Is that a bad table design?
Assuming that table design is sound, would something like this be bad, for a table where the data didn't have anything unique about it?
LogEntry
ID, Guid
LogEntryID, Int [sql identity field +1 every time]
LogType, Int
Message, Text

No, your ID can be both a surrogate key (which just means it's not "derived from application data", e.g. an artificial key), and it should be your primary key, too.
The primary key is used to uniquely and safely identify any row in your table. It has to be stable, unique, and NOT NULL - an "artificial" ID usually has those properties.
I would normally recommend against using "natural" or real data for primary keys - are not REALLY 150% sure it's NEVER going to change?? The Swiss equivalent of the SSN for instance changes each time a woman marries (or gets divorced) - hardly an ideal candidate. And it's not guaranteed to be unique, either......
To spare yourself all that grief, just use a surrogate (artificial) ID that is system-defined, unique, and never changes and never has any application meaning (other than being your unique ID).
Scott Ambler has a pretty good article here which has a "glossary" of all the various keys and what they mean - you'll find natural, surrogate, primary key and a few more.

First, a Surrogate key is a key that is artificially generated within the database, as a unique value for each row in a table, and which has no dependency whatsoever on any other attribute in the table.
Now, the phrase Primary Key is a red herring. Whether a key is primary or an alternate doesn't mean anything. What matters is what the key is used for. Keys can serve two functions which are fundementally inconsistent with one another.
They are first and foremost there to ensure the integrity and consistency of your data! Each row in a table represents an instance of whatever entity that table is defined to hold data for. No Surrogate Key, by definition, can ever perform this function. Only a properly designed natural Key can do this. (If all you have is a surrogate key, you can always add another row with every other attributes exactly identical to an existing row, as long as you give it a different surrogate key value)
Secondly they are there to serve as references (pointers) for the foreign Keys in other tables which are children entities of an entity in the table with the Primary Key. A Natural Key, (especially if it is a composite of multiple attributes) is not a good choice for this function because it would mean tha that A) the foreign keys in all the child tables would also have to be composite keys, making them very wide, and thereby decreasing performance of all constraint operations and of SQL Joins. and B) If the value of the key changed in the main table, you would be required to do cascading updates on every table where the value was represented as a FK.
So the answer is simple... Always (wherever you care about data integrity/consistency) use a natural key and, where necessary, use both! When the natural key is a composite, or long, or not stable enough, add an alternate Surrogate key (as auto-incrementing integer for example) for use as targets of FKs in child tables. But at the risk of losing data consistency of your table, DO NOT remove the natural key from the main table.
To make this crystal clear let's make an example.
Say you have a table with Bank accounts in it... A natural Key might be the Bank Routing Number and the Account Number at the bank. To avoid using this twin composite key in every transaction record in the transactions table you might decide to put an artificially generated surrogate key on the BankAccount table which is just an integer. But you better keep the natural Key! If you didn't, if you did not also have the composite natural key, you could quite easily end up with two rows in the table as follows
id BankRoutingNumber BankAccountNumber BankBalance
1 12345678932154 9876543210123 $123.12
2 12345678932154 9876543210123 ($3,291.62)
Now, which one is right?
To marc from comments below, What good does it do you to be able to "identify the row"?? No good at all, it seems to me, because what we need to be able to identify is which bank account the row represents! Identifying the row is only important for internal database technical functions, like joins in queries, or for FK constraint operations, which, if/when they are necessary, should be using a surrogate key anyway, not the natural key.
You are right in that a poor choice of a natural key, or sometimes even the best available choice of a natural key, may not be truly unique, or guaranteed to prevent duplicates. But any choice is better than no choice, as it will at least prevent duplicate rows for the same values in the attributes chosen as the natural key. These issues can be kept to a minimum by the appropriate choice of key attributes, but sometimees they are unavoidable and must be dealt with. But it is still better to do so than to allow incorrect inaccurate or redundant data into the database.
As to "ease of use" If all you are using the natural key for is to constrain the insertion of duplicate rows, and you are using another, surrogate, key as the target for FK constraints, I do not see any ease of use issues of concern.

Wow, you opened a can of worms with this question. Database purists will tell you never to use surrogate keys (like you have above). On the other hand, surrogate keys can have some tremendous benefits. I use them all the time.
In SQL Server, a surrogate key is typically an auto-increment Identity value that SQL Server generates for you. It has NO relationship to the actual data stored in the table. The opposite of this is a Natural key. An example might be Social Security number. This does have a relationship to the data stored in the table. There are benefits to natural keys, but, IMO, the benefits to using surrogate keys outweigh natural keys.
I noticed in your example, you have a GUID for a primary key. You generally want to stay away from GUIDS as primary keys. The are big, bulky and can often be inserted into your database in a random way, causing major fragmentation.
Randy

The reason that database purists get all up in arms about surrogate keys is because, if used improperly, they can allow data duplication, which is one of the evils that good database design is meant to banish.
For instance, suppose that I had a table of email addresses for a mailing list. I would want them to be unique, right? There's no point in having 2, 3, or n entries of the same email address. If I use email_address as my primary key ( which is a natural key -- it exists as data independently of the database structure you've created ), this will guarantee that I will never have a duplicate email address in my mailing list.
However, if I have a field called id as a surrogate key, then I can have any number of duplicate email addresses. This becomes bad if there are then 10 rows of the same email address, all with conflicting subscription information in other columns. Which one is correct, if any? There's no way to tell! After that point, your data integrity is borked. There's no way to fix the data but to go through the records one by one, asking people what subscription information is really correct, etc.
The reason why non-purists want it is because it makes it easy to use standardized code, because you can rely on refering to a single database row with an integer value. If you had a natural key of, say, the set ( client_id, email, category_id ), the programmer is going to hate coding around this instance! It kind of breaks the encapsulation of class-based coding, because it requires the programmer to have deep knowledge of table structure, and a delete method may have different code for each table. Yuck!
So obviously this example is over-simplified, but it illustrates the point.

Users Table
Using a Guid as a primary key for your Users table is perfect.
LogEntry table
Unless you plan to expose your LogEntry data to an external system or merge it with another database, I would simply use an incrementing int rather than a Guid as the primary key. It's easier to work with and will use slightly less space, which could be significant in a huge log stretching several years.

The primary key is whatever you make it. Whatever you define as the primary key is the primary key. Usually its an integer ID field.
The surrogate key is also this ID field. Its a surrogate for the natural key, which defines uniqueness in terms of your application data.
The idea behind having an integer ID as the primary key (even it doesnt really mean anything) is for indexing purposes. You would then probably define a natural key as a unique constraint on your table. This way you get the best of both worlds. Fast indexing with your ID field and each row still maintains its natural uniqueness.
That said, some people swear by just using a natural key.

There are actually three kinds of keys to talk about. The primary key is what is used to uniquely identify every row in a table. The surrogate key is an artificial key that is created with that property. A natural key is a primary key which is derived from the actual real life data.
In some cases the natural key may be unwieldy so a surrogate key may be created to be used as a foreign key, etc. For example, in a log or diary the PK might be the date, time, and the full text of the entry (if it is possible to add two entries at the exact same time). Obviously it would be a bad idea to use all of that every time that you wanted to identify a row, so you might make a "log id". It might be a sequential number (the most common) or it might be the date plus a sequential number (like 20091222001) or it might be something else. Some natural keys may work well as a primary key though, such as vehicle VIN numbers, student ID numbers (if they are not reused), or in the case of joining tables the PKs of the two tables being joined.
This is just an overview of table key selection. There's a lot to consider, although in most shops you'll find that they go with, "add an identity column to every table and that's our primary key". You then get all of the problems that go with that.
In your case I think that a LogEntryID for your log items seems reasonable. Is the ID an FK to the Users table? If not then I might question having both the ID and the LogEntryID in the same table as they are redundant. If it is, then I would change the name to user_id or something similar.

What are the down sides of using a composite/compound primary key?

What are the down sides of using a composite/compound primary key?

Could cause more problems for normalisation (2NF, "Note that when a 1NF table has no composite candidate keys (candidate keys consisting of more than one attribute), the table is automatically in 2NF")
More unnecessary data duplication. If your composite key consists of 3 columns, you will need to create the same 3 columns in every table, where it is used as a foreign key.
Generally avoidable with the help of surrogate keys (read about their advantages and disadvantages)
I can imagine a good scenario for composite key -- in a table representing a N:N relation, like Students - Classes, and the key in the intermediate table will be (StudentID, ClassID). But if you need to store more information about each pair (like a history of all marks of a student in a class) then you'll probably introduce a surrogate key.

There's nothing wrong with having a compound key per se, but a primary key should ideally be as small as possible (in terms of number of bytes required). If the primary key is long then this will cause non-clustered indexes to be bloated.
Bear in mind that the order of the columns in the primary key is important. The first column should be as selective as possible i.e. as 'unique' as possible. Searches on the first column will be able to seek, but searches just on the second column will have to scan, unless there is also a non-clustered index on the second column.

I think this is a specialisation of the synthetic key debate (whether to use meaningful keys or an arbitrary synthetic primary key). I come down almost completely on the synthetic key side of this debate for a number of reasons. These are a few of the more pertinent ones:
You have to keep dependent child
tables on the end of a foriegn key
up to date. If you change the the
value of one of the primary key
fields (which can happen - see
below) you have to somehow change
all of the dependent tables where
their PK value includes these
fields. This is a bit tricky
because changing key values will
invalidate FK relationships with
child tables so you may (depending
on the constraint validation options
available on your platform) have to
resort to tricks like copying the
record to a new one and deleting the
old records.
On a deep schema the keys can get
quite wide - I've seen 8 columns
once.
Changes in primary key values can be
troublesome to identify in ETL
processes loading off the system.
The example I once had occasion to
see was an MIS application
extracting from an insurance
underwriting system. On some
occasions a policy entry would be
re-used by the customer, changing
the policy identifier. This was a
part of the primary key of the
table. When this happens the
warehouse load is not aware of what
the old value was so it cannot match
the new data to it. The developer
had to go searching through audit
logs to identify the changed value.
Most of the issues with non-synthetic primary keys revolve around issues when PK values of records change. The most useful applications of non-synthetic values are where a database schema is intended to be used, such as an M.I.S. application where report writers are using the tables directly. In this case short values with fixed domains such as currency codes or dates might reasonably be placed directly on the table for convenience.

I would recommend a generated primary key in those cases with a unique not null constraint on the natural composite key.
If you use the natural key as primary then you will most likely have to reference both values in foreign key references to make sure you are identifying the correct record.

Take the example of a table with two candidate keys: one simple (single-column) and one compound (multi-column). Your question in that context seems to be, "What disadvantage may I suffer if I choose to promote one key to be 'primary' and I choose the compound key?"
First, consider whether you actually need to promote a key at all: "the very existence of the PRIMARY KEY in SQL seems to be an historical accident of some kind. According to author Chris Date the earliest incarnations of SQL didn't have any key constraints and PRIMARY KEY was only later addded to the SQL standards. The designers of the standard obviously took the term from E.F.Codd who invented it, even though Codd's original notion had been abandoned by that time! (Codd originally proposed that foreign keys must only reference one key - the primary key - but that idea was forgotten and ignored because it was widely recognised as a pointless limitation)." [source: David Portas' Blog: Down with Primary Keys?
Second, what criteria would you apply to choose which key in a table should be 'primary'?
In SQL, the choice of key PRIMARY KEY is arbitrary and product specific. In ACE/Jet (a.k.a. MS Access) the two main and often competing factors is whether you want to use PRIMARY KEY to favour clustering on disk or whether you want the columns comprising the key to appears as bold in the 'Relationships' picture in the MS Access user interface; I'm in the minority by thinking that index strategy trumps pretty picture :) In SQL Server, you can specify the clustered index independently of the PRIMARY KEY and there seems to be no product-specific advantage afforded. The only remaining advantage seems to be the fact you can omit the columns of the PRIMARY KEY when creating a foreign key in SQL DDL, being a SQL-92 Standard behaviour and anyhow doesn't seem such a big deal to me (perhaps another one of the things they added to the Standard because it was a feature already widespread in SQL products?) So, it's not a case of looking for drawbacks, rather, you should be looking to see what advantage, if any, your SQL product gives the PRIMARY KEY. Put another way, the only drawback to choosing the wrong key is that you may be missing out on a given advantage.
Third, are you rather alluding to using an artificial/synthetic/surrogate key to implement in your physical model a candidate key from your logical model because you are concerned there will be performance penalties if you use the natural key in foreign keys and table joins? That's an entirely different question and largely depends on your 'religious' stance on the issue of natural keys in SQL.

Need more specificity.
Taken too far, it can overcomplicate Inserts (Every key MUST exist) and documentation and your joined reads could be suspect if incomplete.
Sometimes it can indicate a flawed data model (is a composite key REALLY what's described by the data?)
I don't believe there is a performance cost...it just can go really wrong really easily.

when you se it on a diagram are less readable
when you use it on a query join are less
readable
when you use it on a foregein key
you have to add a check constraint
about all the attribute have to be
null or not null (if only one is
null the key is not checked)
usualy need more storage when use it
as foreign key
some tool doesn't manage composite
key

The main downside of using a compound primary key, is that you will confuse the hell out of typical ORM code generators.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas