This question already has answers here:
Surrogate vs. natural/business keys [closed]
(19 answers)
Why would one consider using Surrogate keys vs Natural with ON UPDATE CASCADE?
(1 answer)
Closed 7 months ago.
Recently I Inherited a huge app from somebody who left the company.
This app used a SQL server DB .
Now the developer always defines an int base primary key on tables. for example even if Users table has a unique UserName field , he always added an integer identity primary key.
This is done for every table no matter if other fields could be unique and define primary key.
Do you see any benefits whatsoever on this? using UserName as primary key vs adding UserID(identify column) and set that as primary key?
I feel like I have to add add another element to my comments, which started to produce an essay of comments, so I think it is better that I post it all as an answer instead.
Sometimes there are domain specific reasons why a candidate key is not a good candidate for joins (maybe people change user names so often that the required cascades start causing performance problems). But another reason to add an ever-increasing surrogate is to make it the clustered index. A static and ever-increasing clustered index alleviates a high-cost IO operation known as a page split. So even with a good natural candidate key, it can be useful to add a surrogate and cluster on that. Read this for further details.
But if you add such a surrogate, recognise that the surrogate is purely internal, it is there for performance reasons only. It does not guarantee the integrity of your data. It has no meaning in the model, unless it becomes part of the model. For example, if you are generating invoice numbers as an identity column, and sending those values out into the real world (on invoice documents/emails/etc), then it's not a surrogate, it's part of the model. It can be meaningfully referenced by the customer who received the invoice, for example.
One final thing that is typically left out of this discussion is one particular aspect of join performance. It is often said that the primary key should also be narrow, because it can make joins more performant, as well as reducing the size of non-clustered indexes. And that's true.
But a natural primary key can eliminate the need for a join in the first place.
Let's put all this together with an example:
create table Countries
(
countryCode char(2) not null primary key clustered,
countryName varchar(64) not null
);
insert Countries values
('AU', 'Australia'),
('FR', 'France');
create table TourLocations
(
tourLocationName varchar(64) not null,
tourLocationId int identity(1,1) unique clustered,
countryCode char(2) not null foreign key references Countries(countryCode),
primary key (countryCode, tourLocationName)
);
insert TourLocations (TourLocationName, countryCode) values
('Bondi Beach', 'AU'),
('Eiffel Tower', 'FR')
I did not add a surrogate key to Countries, because there aren't many rows and we're not going to be constantly inserting new rows. I already know what all the countries are, and they don't change very often.
On the TourLocations table I have added an identity and clustered on it. There could be very many tour locations, changing all the time.
But I still must have a natural key on TourLocations. Otherwise I could insert the same tour location name with the same country twice. Sure, the Id's will be different. But the Id's don't mean anything. As far as any real human is concerned, two tour locations with the same name and country code are completely indistinguishable. Do you intend to have actual users using the system? Then you've got a problem.
By putting the same country and location name in twice I haven't created two facts in my database. I have created the same fact twice! No good. The natural key is necessary. In this sense The Impaler's answer is strictly, necessarily, wrong. You cannot not have a natural key. If the natural key can't be defined as anything other than "every meaningful column in the table" (that is to say, excluding the surrogate), so be it.
OK, now let's investigate the claim that an int identity key is advantageous because it helps with joins. Well, in this case my char(2) country code is narrower than an int would have been.
But even if it wasn't (maybe we think we can get away with a tinyint), those country codes are meaningful to real people, which means a lot of the time I don't have to do the join at all.
Suppose I gave the results of this query to my users:
select countryCode, tourLocationName
from TourLocations
order by 1, 2;
Very many people will not need me to provide the countries.countryName column for them to know which country is represented by the code in each of those rows. I don't have to do the join.
When you're dealing with a specific business domain that becomes even more likely. Meaningful codes are understood by the domain users. They often don't need to see the long description columns from the key table. So in many cases no join is required to give the users all of the information they need.
If I had foreign keyed to an identity surrogate I would have to do the join, because the identity surrogate doesn't mean anything to anyone.
You are talking about the difference between synthetic and natural keys.
In my [very] personal opinion, I would recommend to always use synthetic keys (and always call it id). The main problem is that natural keys are never unique; they are unique in theory, yes, but in the real world there are a myriad of unexpected and inexorable events that will make this false.
In database design:
Natural keys correspond to values present in the domain model. For example, UserName, SSN, VIN can be considered natural keys.
Synthetic keys are values not present in the domain model. They are just numeric/string/UUID values that have no relationship with the actual data. They only serve as a unique identifiers for the rows.
I would say, stick to synthetic keys and sleep well at night. You never know what the Marketing Department will come up with on Monday, and suddenly "the username is not unique anymore".
Yes having a dedicated int is a good thing for PK use.
you may have multiple alternate keys, that's ok too.
two great reasons for it:
it is performant
it protects against key mutation ( editing a name etc. )
A username or any such unique field that holds meaningful data is subject to changes. A name may have been misspelled or you might want to edit a name to choose a better one, etc. etc.
Primary keys are used to identify records and, in conjunction with foreign keys, to connect records in different tables. They should never change. Therefore, it is better to use a meaningless int field as primary key.
By meaningless I mean that apart from being the primary key it has no meaning to the users.
An int identity column has other advantages over a text field as primary key.
It is generated by the database engine and is guaranteed to be unique in multi-user scenarios.
it is faster than a text column.
Text can have leading spaces, hidden characters and other oddities.
There are multiple kinds of text data types, multiple character sets and culture dependent behaviors resulting in text comparisons not always working as expected.
int primary keys generated in ascending order have a superior performance in conjunction with clustered primary keys (which is a SQL-Server specialty).
Note that I am talking from a database point of view. In the user interface, users will prefer identifying entries by name or e-mail address, etc.
But commands like SELECT, INSERT, UPDATE or DELETE will always identify records by the primary key.
This subject - quite much like gulivar travels and wars being fought over which end of the egg you supposed to crack open to eat.
However, using the SAME "id" name for all tables, and autonumber? Yes, it is LONG establihsed choice.
There are of course MANY different views on this subject, and many advantages and disavantages.
Regardless of which choice one perfers (or even needs), this is a long established concept in our industry. In fact SharePoint tables use "ID" and autonumber by defualt. So does ms-access, and there probably more that do this.
The simple concpet?
You can build your tables with the PK and child tables with forighen keys.
At that point you setup your relationships between the tables.
Now, you might decide to add say some invoice number or whatever. Rules might mean that such invoice number is not duplicated.
But, WHY do we care of you have some "user" name, or some "invoice" number or whatever. Why should that fact effect your relational database model?
You mean I don't have a user name, or don't have a invoice number, and the whole database and relatonships don't work anymore? We don't care!!!!
The concept of data, even required fields, or even a column having to be unique ?
That has ZERO to do with a working relational data model.
And maybe you decide that invoice number is not generated until say sent to the customer. So, the fact of some user name, invoice number or whatever? Don't care - you can have all kinds of business rules for those numbers, but they have ZERO do to do with the fact that you designed a working relational data model based on so called "surrogate" or sometime called synthetic keys.
So, once you build that data model - even with JUST the PK "id" and FK (forighen keys), you are NOW free to start adding columns and define what type of data you going to put in each table. but, what you shove into each table has ZERO to do with that working related data model. They are to be thought as seperate concpets.
So, if you have a user name - add that column to the table. If you don't want users name, remove the column. As such data you store in the table has ZERO to do with the automatic PK ID you using - it not really any different then say what area of memory the computer going to allocate to load that data. Basic data operations of the system is has nothing to do with having build database with relationships that simple exist. And the data columns you add after having built those relationships is up to you - but will not, and should not effect the operation of the database and relationships you built and setup. Not only are these two concepts separate, but they free the developer from having to worry about the part that maintains the relationships as opposed to data column you add to such tables to store user data.
I mean, in json data, xml? We often have a master + child table relationship. We don't care how that relationship is maintained - but only that it exists.
Thus yes, all tables have that pk "ID". Even better? in code, you NEVER have to guess what the PK id is - it always the same!!!
So, data and columns you put and toss into a table? Those columns and data have zero to do with the PK id, and while it is the database generating that PK? It could be a web service call to some monkeys living in a far away jungle eating banana's and they give you a PK value based on how many bananas they eaten. We just really don't' care about that number - it is just internal house keeping numbers - one that we don't see or even care about in most code. And thus the number one rule to such auto matic PK values?
You NEVER give that auto PK number any meaning from a user and applcation point of view.
In summary:
Yes, using a PK called "id" for all tables? Common, and in fact in SharePoint and many systems, it not only the default, but is in fact required for such systems to operate.
Its better to use userid. User table is referenced by many other tables.
The referenced table would contain the primary key of the user table as foreign key.
Its better to use userid since its integer value,
it takes less space than string values of username and
the searches by the database engine would be faster
user(userid, username, name)
comments(commentid, comment, userid) would be better than
comments(commentid, comment, username)
I have poor skills for db design and I need some help about relations setup.
So use case is:
User which can be Coach or Client.
Client can have many coaches and coaches can have many clients.
Coach can create many workouts for client.
Client can also have many workouts assigned from coach.
Workout have many sets, sets have many exercises and reps etc...
So I created design on image.
But what make me feel that I am doing everything wrong is double keys.
In this table client_has_coach_has_workout I have two keys referencing n:n table and it confuses me as when I need to import data it will always have both keys from n:n table.
Any help please.
This design can work if you want it to, but I would consider reversing the link between coach and client to user.
remove client_id from user
remove coach_id from user
Add user_id as a foreign key to client
Add user_id as a foreign key to coach
Ultimately yes you are correct that now there are composite keys, but this is not a problem because during importing you would always need to lookup both the client_id and the coach_Id anyway.
To simplify the model logic, you can add an arbitrary Id key column to client_has_coach then client_has_coach_has_workout only needs a single foreign key client_has_coach_id that links back to the client_has_coach table, this forces us to lookup the specific linking record from client_has_coach and helps us enforce it's existence.
If you do use a composite key, then we can generally skip the lookup for client_has_coach and assume that it already exists, but this can lead to orphaned rows or scenarios where there is no record in client_has_coach corresponding to a combination of keys in client_has_coach_has_workout.
Composite keys work well in models where we do not need to maintain integrity between client_has_coach and client_has_coach_has_workout. But your naming conventions suggests that this is not a desirable aspect in your model.
Either way, on import of data you would need to first lookup the coach record, then lookup the client record to check for or create the record in client_has_coach, you would then import corresponding rows into client_has_coach_has_workout.
I am designing a database for a system and I came up with the following three tables
My problem is that an Address can belong to either a Person or a Company (or other things in the future) So how do I model this?
I discarded putting the address information in both tables (Person
and Company) because of it would be repeated
I thought of adding two columns (PersonId and CompanyId) to the
Address table and keep one of them null, but then I will need to add
one column for every future relation like this that appears (for
example an asset can have an address where its located at)
The last option that occur to me was to create two columns, one
called Type and other Id, so a pair of values would represent a
single record in the target table, for example: Type=Person,Id=5 and
Type=Company,Id=9 this way I can Join the right table using the type
and it will only be two columns no matter how many tables relate to
this table. But I cannot have constraints which reduce data integrity
I don't know if I am designing this properly. I think this should be a common issue (I've faced it at least three times during this small design in objects like Contact information, etc...) But I could not find many information or examples that would resemble mine.
Thanks for any guidance that you can give me
There are several basic approaches you could take, depending on how much you want to future proof your system.
In general, Has-One relationships are modeled by a foreign key on the owning entity, pointing to the primary key on the owned entity. So you would have an AddressId on both Company and Person,which would be a foreign key to Address.Id. The complexity in your case is how to handle the fact that a person can have multiple addresses. If you are 100% sure that there will only ever be a home and work address, you could put two foreign key columns on Person, but this becomes a big problem if there's a third, fourth, fifth etc. address. The other option is to create a join table, PersonAddress, with three columns a PersonId an AddressId and a AddressType, to indicate whether its a home work or whatever address.
Suppose we're working with a database that has the following relation:
CREATE TABLE Author (
first_name VARCHAR NOT NULL,
last_name VARCHAR NOT NULL,
birth_date DATE NOT NULL,
death_date DATE,
biography TEXT,
UNIQUE(first_name, last_name, birth_date)
);
In the real world, it's highly, highly improbable that two authors with the same first and last name will have been born on the exact same day. So we consider the combination of an author's first name, last name, and birth date to be a natural key.
However, for the purposes of joining tables and creating foreign keys, this is a bit painful because it means we need to store these three pieces of information over and over and over in our tables. Maybe we have a BookAuthor relation associating Authors with Books.
So we create a serial ID (or a UUID if we wanted to) and treat it as the primary key:
CREATE TABLE Author (
id SERIAL PRIMARY KEY,
first_name VARCHAR NOT NULL,
last_name VARCHAR NOT NULL,
birth_date DATE NOT NULL,
death_date DATE,
biography TEXT,
UNIQUE(first_name, last_name, birth_date)
);
However, based on the reading I've done here on StackOverflow and other sites, it seems you never want to expose your primary key to the end user because doing so may allow them to ascertain information that they should not be able to. If this were a situation involving user accounts, for example, and we decided to use serial IDs (we shouldn't, but humor me), then someone could potentially figure out how many total users we have in our database, which isn't information that they should see. In this particular case, someone knowing how many total authors we have isn't really a big deal. But I want to follow best practices.
So, if I were to design a REST API for this database, would it be acceptable to use the natural key to uniquely identify resources (Authors) instead of using serial IDs? For example:
https://www.foo.bar/authors/?first=:Terry&last=Pratchett&dob=19480424
My reasoning here is that there's no way for an end user to know what serial ID corresponds to this author in order to query for them.
As long as you guarantee this combination of data elements is an unique alternate key (would be good having a database unique key to enforce that) and use it consistently across all the API methods it would be perfectly acceptable to use it to identify resources in a REST API. There is no conceptual flaw in that.
The only minor issue is if you change the natural key, updating name or date of birth, cached data would be invalidated. Not a big deal. GET methods would still be idempotent.
If you choose to use a natural key as identifier and the data elements of this key are editable, keep in mind that you may want to do a redirect to a new URL when you PUT updates and want to keep displaying the same resource.
To address the different concerns you are looking at, I would suggest the following:
Generate a sequential id (a numeric sequence), if possible, which will be your primary key. This key is for internal app use (or db internal) and never leaves the backend.
A unique (non-numeric) id (UUID?... or a hash from id+first_name+las_name+b_day) which will be used as the key for access your data through the API calls.
Optionally, you could still have the unique key: first_name+last_name+b_day.
This way you will have a simple sequential Id for your DB, but also an Id that does not expose relevant information about your system.
This may not be perfect, or complete, solution, but could be a good start.
So, if I were to design a REST API for this database, would it be acceptable to use the natural key to uniquely identify resources (Authors) instead of using serial IDs?
The short answer to your question is No.
The longer answer is that there are actually two separate concerns here; you mention:
There's no way for an end user to know what serial ID corresponds to this author in order to query for them.
What this implies is that you want the end user to be able to search for an author.
This is different from the standard use-case for a GET route which explicitly requires that the requester knows the unique identifier (read this W3 description of the basic REST methods for more information).
I would recommend that you have two separate APIs, one for retrieving all details of a resource given its UUID, and a second for searching by some fields.
I'm trying to design my database with very basic tables and I am confused on the CORRECT way to do it.
I've attached a picture of the main info, and I'm not quite sure how to link them. Meaning what should be a foreign key, or should some of these tables include of LIST<> of the other tables.
UPDATE TO TABLES
As per your requirements, You are right about the associative table
Client can have multiple accounts And Accounts can have multiple clients
Then, Many (Client) to Many (Account)
So, Create an associate table to break the many to many relationship first. Then join it that way
Account can have only one Manager
Which means One(Manager) to Many(Accounts)
So, add an attribute called ManagerID in Accounts
Account can have many traedetail
Which means One(Accounts) to Many(TradeDetails)
So, add an attribute called AccountID in TradeDetails
Depends on whether you are looking to have a normalized database or some other type of design paradigm. I recommend doing some reading on the concepts of database normalization and referential integrity.
What I would do is make tables that have a 1 to 1 relationship such as account/manager into a single table (unless you can think of a really good reason not to). Add Clientid as a foreign key to Account. Add AccountID as a foreign key to TradeDetail. You are basically setting up everything as 1 to many relationships where the table that has 1 record for the id has the field as a primary key and the table that has many has it as a foreign key.