Alternate names for a surrogate key/sequence number/ID column - sql

I have a legacy table that has as a part of its natural key a column named <table_name>_IDENTIFIER and it seems like it would be confusing to create a surrogate key named <table_name>_ID or ID so I'm leaning towards naming it SURROGATE_KEY. All my other tables use the <table_name>_ID syntax. Any better suggestions?

Don't call it SURROGATE_KEY. That is meaningless in any other context. I'd stick with <table_name>_ID. Yes it's a little confusing. But, given your established convention, anything else would be confusing too.

I might suggest that you go with your standard: <table_name>_ID
Eventually, the legacy table will not be the driving force, and it will be the IDENTIFIER column that will look odd, which is what you want, as opposed to that - 'oh yeah, i need to use surrogate_key for that thing instead of id...' moment.

First, I would not include the table name in my columns. A column is an attribute which requires the context of the entity to which it belongs. Having a "name" for example without the context to which it belongs is of no use. You need to know it is a Person's name or a Company name etc. and you have that in the name of the entity itself. Thus, I would not prefix columns with the name of the table in which it is declared.
That leaves you with choices like "Id", "Key", "SurrogateKey", or perhaps "SystemId" which are all equally vague. At least "SurrogateKey" describes what it is which is a bonus. That name will make sense to a DBA but perhaps not a developer (although they should understand the concept). Of those choices, I'd be inclined to use "Id" and find a way to change <table_name>_Identifier to something more descriptive.

In Data Modelling world during drawing ER model, Surrogate key like SURROGATE_KEY (or SURROGATE_ID) will definitely cause pain side-effects when creating Foreign Key Constraint.
I.e. linking parent with child in majority of DM tools via dragg-n-dropping primary key will automatically create identical column in a child generating dups in column names.
To avoid that as a rule of thumb, naming Surrogate key like Table_name.Table_name_ID or Table_name._ID can be good option.

Agreed . . . SURROGATE_ID is not recommended. What all the suggestions seem to be lacking is at the very heart of data management & data modelling best practices: establishing (& consistently using!) naming conventions & value domains. Suggestions:
1. If the database or programming protocol (like .NET which abhors natural primary keys as I've been lead to understand) requires a single, meaningless, integer assigned as a primary key -- a surrogate -- key, then create a value domain of "Id" & define it as data type integer with description of surrogate primary key.
2. When naming attributes/columns, the ONLY columns using the domain "Id" would be surrogate (primary) key columns populated with assigned integer values. No other attributes/columns would be allowed to use the domain "Id", so it would be absolutely clear from the attribute/column name the nature of the values stored AND how those values are begin utilized.
which database structure to choose for tagging system

I have a database structure as follow:
-pk id (AutoIncrement)
tbl_tags (1) OR tbl_tags (2)
-pk name -pk id (AutoIncrement)
-name (unique)
-fk product_id
-fk tag_id
I have seen most choose data structures tbl_tags (2). I want to ask whether i could choose tbl_tags(1) since name is always unique, so i want to to make it primary. Does it have any downside ?
If you make the tag name unique, you have to think about what you'll do if a name needs to be changed. For example, if I want to change "tag" to "tags".
If this is a primary key, then all the child records that refer to "tag" will also have to be updated so the constraint is valid. If you have a lot of rows referring to a given name, running this change is likely to be slow and introduce some blocking/contention into your application. Whereas if you use a surrogate primary key, you only have to update the unique name field, not all the child rows as well.
If you're certain that you'll never update a tag name then you could use it as the primary key. Beware of changing requirements however!
Natural keys generally make sense when using codes that are issued and managed by an external source (e.g. airport, currency and country codes). In these cases you can be sure that the natural key won't change and is guaranteed to be unique within the domain.
My understanding is there would be a marginal performance penalty to tbl_tags (1) in the context of a very large dataset when compared to option 2. In smaller datasets, probably not so much. The machine can process integers much more efficiently than strings.
In the bigger picture though, with modern processor speeds, the difference between the two might be negligable in all but the largest datasets.
Of course, I am speaking about relational databases here. The various flavors of NoSQL are a different animal.
Also, there is the matter of consistency. The other tables in your database all seem to be using (what I assume to be) an auto-incrementing integer ID. For that reason, I would use it on the tags table as well.
The use of auto-incrementing integer PK fields vs "Natural Keys" in designing a database is a long-standing debate. My understanding is academics largely prefer the "Natural Keys" concept, while in practice some form of generated unique key tends to be the norm.
Personally, I prefer to create generated keys which have no meaning to the end user, integers where possible. Unless I have missed something, index performance is significantly enhanced.

Why is prefixing column names with the table name a convention?

I have seen this convention in many databases, but is seems redundant to me. I have found a few answers that say it is to reduce confusion during complex joins, but this doesn't seem like a sufficient reason. If you are making complex joins, make aliases. Do joins really represent such a common task that we should make standard tasks like selects, inserts, and updates redundant?
I don't think there is actually a convention of prefixing column names with the table name.
As Philippe Grondier details, the 'proper' approach to data modelling is to first create a dictionary of data element names. Following the international standard ISO 11179 guidelines:
[Object] [Qualifier] Property RepresentationTerm
you end up with data elements that are fully qualified. Here the qualifier elements Object, Qualifier and sometimes Property are in combination what you consider to be the 'prefix'.
On implementation of the data model in SQL, the table name can provide the context and leads the designer to drop the qualifying terms from the column name. I think this is convention you prefer.**
In other words, in the convention you are questioning it is not that the table name has been prefixed to the column name, rather it is that the qualifying terms have been retained.
** whether or not yours or any other is a good convention is subjective and Stackoverflow is not the place for such discussion. However, I will mention in passing that retaining qualification terms does have a practical benefits (as well as being theoretically sound) e.g. consider that SQL's NATURAL JOIN lends itself to columns that are named consistently throughout the schema.
It is true that such "developped column names" methods are widely used for column naming where, for example, Tbl_Person will have an id_Person primary key column, and a personName text column.
Though it might seem at first quite painfull to write 'developped' column names like "id_Person", "personName", "personAdress", etc, everything gets clearer when you have to write SELECT's on multiple tables, which is something that happens each time you open a form or a report.
There is also a theoretical/historical dimension to this "developped column names" method. First relational databases theories and methods (like MERISE) were proposing, as a first step, to build the so-called "data dictionary", ie the list of all data to be manipulated by the app\database.
This dictionary has to be established even before any "Entity-Relation" model is proposed. data names/descriptions have then to be fully developped, this to avoid confusion between 'similar' data entries, like, for example "companyName" and "personName".
Thus, the "developped column names" convention reflects the fact that, at the data level, similar columns (such as a and a columns) are not as equivalent as they seem to be. Though they both look like being here to hold a name, one of them is made to hold a company name, while the other is made to hold a person's name!
This convention can then be considered as a way to reflect the exact meaning of each of the database's column, or to reflect the exact meaning of each entry in the data dictionary.
I've never seen the full table name prefixed, but usually at least an abbreviation. And you're exactly right, it's for simplicity in joins and the like. It's easier to write ur_id all the time than it is to write id sometimes and other times, for example. It's not that uncommon to need to access more than one table at a time.
Join is part of a select, so that comparison doesn't hold.
That aside, I don't think you should prefix the field with the table name, except for primary keys. I like to give every table a surrogate key, which I rather name after the table. So the table 'Orders' will get an 'OrderId' PK. An order line will have a foreign key OrderId to point to the order. That way, the field names are the same across tables, and you can tell by the name, which data it presents. You could name the field just 'Id' in all tables, but you do have to read the alias to see which ID you mean. Some queries I wrote are over 400 lines. You don't want to rely on table aliases alone. A little context in the fieldname itself does help.
It's not a convention; some people do it, some people don't. More often I see an ID column prefixed with the table name, but no other columns. Some (all?) DBs also allow prefixing with the table name in queries, but it's neither required, nor part of the actual column name.
In addition to what others said, it is also makes things simpler in the presence of identifying relationships (a.k.a. identifying FOREIGN KEYs).
An identifying relationship "migrates" the parent's primary key into a part of child's primary key. Prefix ensures there will be no collision and you won't need to rename the migrated fields, even when there are multiple levels of identifying relationships. For example:
Keeping the same name throughout the whole data model avoids any confusion as to what the field means and where it came from.
On the other hand, prefixing can take a toll on readability, so I usually take a compromise: prefix primary key fields but leave other fields unprefixed.
I dislike such naming conventions. It encourages sloth, specifically the use of unqualified references in queries. Use an alias for each table in your query and qualify each column reference with the appropriate alias.
The only such naming convention I like has to do with primary/foreign keys:
I like to name primary keys something clever, like id.
I like to name prefix the names of foreign key columns with the name of the table containing the primary key.
It makes for much more legible SQL, IMHO. An example:
create table foo
id int not null primary key ,
create table bar
id int not null primary key ,
foo_id int not null foreign key references foo (id) ,
select *
from foo foo
join bar bar on bar.foo_id =
This scheme falls down, of course, when you get to compound keys. But I like it. YMMV.

General SQL question about Primary Keys

I know this is pretty elementary but here it goes.
I would like to know how you know what columns are a primary key in a table that does not have a primary key? Is there a technique or something that I should read?
You need to take a look at your data structures.
A primary key must:
never be NULL (no exceptions)
reliably and uniquely identify each single row
and it helps if it's
small and easy to use
stable (doesn't change at all, or at least not often)
a single column (or at most two)
Check your data - which columns or set of columns can fulfill these requirements??
Once you have those potential primary keys (the "candidate keys") - think about how you will access the data, and what other data might need to be associated with this one entity in question - what would make sense as a foreign key? Do you want to reference your department by its name? Probably not a good idea, since the name could be misspelled, it might change over time etc. By the department's office location? Bad choice, too. But something like a unique "department ID" might be a good idea.
If you don't find any appropriate column(s) in your actual data that could serve as primary key and would make sense, it's a common practice to introduce a "surrogate key" - an extra column, often an INT (and often something like an "auto-increment" INT) that will serve as an artificial identifier for each row. If you do this, one common best practice is to never show that artificial key on any data screen - it has no meaning whatsoever to the users of your system - so don't even show it to them.
Checking these requirements, and a lot of experience, will help you find the right primary key.
It really depends on the data itself. You need to determine what fields can be used to identify the record uniquely.
In SQL server it'll have a key next to it. It's typically ID or something with ID in it. It's also unique and typically increments. When you look at it in SQL. Server management studio under table design you'll see it towards the top of the list of columns with the Lil key icon.
It's a unique identifier that deciphers each record from one another. Kind of like how each person has a ssn.

Primary key/foreign Key naming convention [closed]

In our dev group we have a raging debate regarding the naming convention for Primary and Foreign Keys. There's basically two schools of thought in our group:
Primary Table (Employee)
Primary Key is called ID
Foreign table (Event)
Foreign key is called EmployeeID
Primary Table (Employee)
Primary Key is called EmployeeID
Foreign table (Event)
Foreign key is called EmployeeID
I prefer not to duplicate the name of the table in any of the columns (So I prefer option 1 above). Conceptually, it is consistent with a lot of the recommended practices in other languages, where you don't use the name of the object in its property names. I think that naming the foreign key EmployeeID (or Employee_ID might be better) tells the reader that it is the ID column of the Employee Table.
Some others prefer option 2 where you name the primary key prefixed with the table name so that the column name is the same throughout the database. I see that point, but you now can not visually distinguish a primary key from a foreign key.
Also, I think it's redundant to have the table name in the column name, because if you think of the table as an entity and a column as a property or attribute of that entity, you think of it as the ID attribute of the Employee, not the EmployeeID attribute of an employee. I don't go an ask my coworker what his PersonAge or PersonGender is. I ask him what his Age is.
So like I said, it's a raging debate and we go on and on and on about it. I'm interested to get some new perspectives.
If the two columns have the same name in both tables (convention #2), you can use the USING syntax in SQL to save some typing and some boilerplate noise:
SELECT name, address, amount
FROM employees JOIN payroll USING (employee_id)
Another argument in favor of convention #2 is that it's the way the relational model was designed.
The significance of each column is
partially conveyed by labeling it with
the name of the corresponding domain.
It doesn't really matter. I've never run into a system where there is a real difference between choice 1 and choice 2.
Jeff Atwood had a great article a while back on this topic. Basically people debate and argue the most furiously those topics which they cannot be proven wrong on. Or from a different angle, those topics which can only be won through filibuster style endurance based last-man-standing arguments.
Pick one and tell them to focus on issues that actually impact your code.
EDIT: If you want to have fun, have them specify at length why their method is superior for recursive table references.
I think it depends on your how you application is put together. If you use ORM or design your tables to represent objects then option 1 may be for you.
I like to code the database as its own layer. I control everything and the app just calls stored procedures. It is nice to have result sets with complete column names, especially when there are many tables joined and many columns returned. With this stype of application, I like option 2. I really like to see column names match on joins. I've worked on old systems where they didn't match and it was a nightmare,
Have you considered the following?
Primary Table (Employee)
Primary Key is PK_Employee
Foreign table (Event)
Foreign key is called FK_Employee
Neither convention works in all cases, so why have one at all? Use Common sense...
e.g., for self-referencing table, when there are more than one FK column that self-references the same table's PK, you HAVE to violate both "standards", since the two FK columns can't be named the same... e.g., EmployeeTable with EmployeeId PK, SupervisorId FK, MentorId Fk, PartnerId FK, ...
I agree that there is little to choose between them. To me a much more significant thing about either standard is the "standard" part.
If people start 'doing their own thing' they should be strung up by their nethers. IMHO :)
If you are looking at application code, not just database queries, some things seem clear to me:
Table definitions usually directly map to a class that describes one object, so they should be singular. To describe a collection of an object, I usually append "Array" or "List" or "Collection" to the singular name, as it more clearly than use of plurals indicates not only that it is a collection, but what kind of a collection it is. In that view, I see a table name as not the name of the collection, but the name of the type of object of which it is a collection. A DBA who doesn't write application code might miss this point.
The data I deal with often uses "ID" for non-key identification purposes. To eliminate confusion between key "ID"s and non-key "ID"s, for the primary key name, we use "Key" (that's what it is, isn't it?) prefixed with the table name or an abbreviation of the table name. This prefixing (and I reserve this only for the primary key) makes the key name unique, which is especially important because we use variable names that are the same as the database column names, and most classes have a parent, identified by the name of the parent key. This also is needed to make sure that it is not a reserved keyword, which "Key" alone is. To facilitate keeping key variable names consistent, and to provide for programs that do natural joins, foreign keys have the same name as is used in the table in which they are the primary key. I have more than once encountered programs which work much better this way using natural joins. On this last point, I admit a problem with self-referencing tables, which I have used. In this case, I would make an exception to the foreign key naming rule. For example, I would use ManagerKey as a foreign key in the Employee table to point to another record in that table.
The convention we use where I work is pretty close to A, with the exception that we name tables in the plural form (ie, "employees") and use underscores between the table and column name. The benefit of it is that to refer to a column, it's either "employees _ id" or "", depending on how you want to access it. If you need to specify what table the column is coming from, "employees.employees _ id" is definitely redundant.
I like convention #2 - in researching this topic, and finding this question before posting my own, I ran into the issue where:
I am selecting * from a table with a large number of columns and joining it to a second table that similarly has a large number of columns. Both tables have an "id" column as the primary key, and that means I have to specifically pick out every column (as far as I know) in order to make those two values unique in the result, i.e.:
SELECT AS parent_id, AS child_id
Though using convention #2 means I will still have some columns in the result with the same name, I can now specify which id I need (parent or child) and, as Steven Huwig suggested, the USING statement simplifies things further.
I've always used userId as a PK on one table and userId on another table as a FK. 'm seriously thinking about using userIdPK and userIdFK as names to identify one from the other. It will help me to identify PK and FK quickly when looking at the tables and it seems like it will clear up code when using PHP/SQL to access data making it easier to understand. Especially when someone else looks at my code.
I use convention #2. I'm working with a legacy data model now where I don't know what stands for in a given table. Where's the harm in being verbose?
How about naming the foreign key
where role is the role the referenced entity has relativ to the table at hand. This solves the issue of recursive reference and multiple fks to the same table.
In many cases will be identical to the referenced table name. In this cases it becomes identically to one of your proposals.
In any case havin long arguments is a bad idea
"Where in "employee INNER JOIN order ON order.employee_id =" is there a need for additional qualification?".
There is no need for additional qualification because the qualification I talked of is already there.
"the reason that a business user refers to Order ID or Employee ID is to provide context, but at a dabase level you already have context because you are refereing to the table".
Pray, tell me, if the column is named 'ID', then how is that "refereing [sic] to the table" done exactly, unless by qualifying this reference to the ID column exactly in the way I talked of ?

What should I consider when selecting a data type for my primary key?

When I am creating a new database table, what factors should I take into account for selecting the primary key's data type?
Sorry to do that, but I found that the answers I gave to related questions (you can check this and this) could apply to this one. I reshaped them a little bit...
You will find many posts dealing with this issue, and each choice you'll make has its pros and cons. Arguments for these usually refer to relational database theory and database performance.
On this subject, my point is very simple: surrogate primary keys ALWAYS work, while Natural keys MIGHT NOT ALWAYS work one of these days, and this for multiple reasons: field too short, rules change, etc.
To this point, you've guessed here that I am basically a member of the uniqueIdentifier/surrogate primary key team, and even if I appreciate and understand arguments such as the ones presented here, I am still looking for the case where "natural" key is better than surrogate ...
In addition to this, one of the most important but always forgotten arguments in favor of this basic rule is related to code normalization and productivity:
each time I create a table, shall I lose time
identifying its primary key and its physical characteristics (type, size)
remembering these characteristics each time I want to refer to it in my code?
explaining my PK choice to other developers in the team?
My answer is no to all of these questions:
I have no time to lose trying to identify "the best Natural Primary Key" when the surrogate option gives me a bullet-proof solution.
I do not want to remember that the Primary Key of my Table_whatever is a 10 characters long string when I write the code.
I don't want to lose my time negotiating the Natural Key length: "well if You need 10 why don't you take 12 to be on the safe side?". This "on the safe side" argument really annoys me: If you want to stay on the safe side, it means that you are really not far from the unsafe side! Choose surrogate: it's bullet-proof!
So I've been working for the last five years with a very basic rule: each table (let's call it 'myTable') has its first field called 'id_MyTable' which is of uniqueIdentifier type. Even if this table supports a "many-to-many" relation, where a field combination offers a very acceptable Primary Key, I prefer to create this 'id_myManyToManyTable' field being a uniqueIdentifier, just to stick to the rule, and because, finally, it does not hurt.
The major advantage is that you don't have to care anymore about the use of Primary Key and/or Foreign Key within your code. Once you have the table name, you know the PK name and type. Once you know which links are implemented in your data model, you'll know the name of available foreign keys in the table.
And if you still want to have your "Natural Key" somewhere in your table, I advise you to build it following a standard model such as
id_whatever, unique identifier, primary key
code_whatever, whateverTypeYouWant(whateverLengthYouEstimateTheRightOne), indexed
Where id_ is the prefix for primary key, and code_ is used for "natural" indexed field. Some would argue that the code_ field should be set as unique. This is true, and it can be easily managed either through DDL or external code. Note that many "natural" keys are calculated (invoice numbers), so they are already generated through code
I am not sure that my rule is the best one. But it is a very efficient one! If everyone was applying it, we would for example avoid time lost answering to this kind of question!
If using a numeric key, make sure the datatype is giong to be large enough to hold the number of rows you might expect the table to grow to.
If using a guid, does the extra space needed to store the guid need to be considered? Will coding against guid PKs be a pain for developers or users of the application.
If using composite keys, are you sure that the combined columns will always be unique?
I don't really like what they teach in school, that is using a 'natural key' (for example ISBN on a bookdatabase) or even having a primary key made up off 2 or more fields. I would never do that. So here's my little advice:
Always have one dedicated column in every table for your primary key.
They all should have the same colomn name across all tables, i.e. "ID" or "GUID"
Use GUIDs when you can (if you don't need performance), otherwise incrementing INTs
Okay, I think I need to explain my choices a little bit.
Having a dedicated column namend the same across all table for you primary key, just makes your SQL-Statements a lot of easier to construct and easier for someone else (who might not be familiar with your database layout) easier to understand. Especially when you're doing lots of JOINS and things like that. You won't need to look up what's the primary key for a specific table, you already know, because it's the same everywhere.
GUIDs vs. INTs doesn't really matters that much most of the time. Unless you hit the performance cap of GUIDs or doing database merges, you won't have major issues with one or another. BUT there's a reason I prefer GUIDs. The global uniqueness of GUIDs might always come in handy some day. Maybe you don't see a need for it now, but things like, synchronizing parts of the database to a laptop / cell phone or even finding datarecords without needing to know which table they're in, are great examples of the advantages GUIDs can provide. An Integer only identifies a record within the context of one table, whereas a GUID identifies a record everywhere.
In most cases I use an identity int primary key, unless the scenario requires a lot of replication, in which case I may opt for a GUID.
I (almost) never used meaningful keys.
Unless you have an ultra-convenient natural key available, always use a synthetic (a.k.a. surrogate) key of a numeric type. Even if you do have a natural key available, you might want to consider using a synthetic key anyway and placing an additional unique index on your natural key. Consider what happened to higher-ed databases that used social security numbers as PKs when federal law changed, the costs of changing over to synthetic keys were enormous.
Also, I have to disagree with the practice of naming every primary key the same, e.g. "id". This makes queries harder to understand, not easier. Primary keys should be named after the table. For example employee.employee_id, affiliate.affiliate_id, user.user_id, and so on.
Do not use a floating point numeric type, since floating point numbers cannot be properly compared for equality.
Where do you generate it? Incrementing number's don't fit well for keys generated by the client.
Do you want a data-dependent or independent key (sometimes you could use an ID from business data, can't say if this is always useful or not)?
How well can this type be indexed by your DB?
I have used uniqueidentifiers (GUIDs) or incrementing integers so far.
Numbers that have meaning in the real world are usually a bad idea, because every so often the real world changes the rules about how those numbers are used, in particular to allow duplicates, and then you've got a real mess on your hands.
I'm partial to using an generated integer key. If you expect the database to grow very large, you can go with bigint.
Some people like to use guids. The pro there is that you can merge multiple instances of the database without altering any keys but the con is that performance can be affected.
For a "natural" key, whatever datatype suits the column(s). Artifical (surrogate) keys are usually integers.
It all depends.
a) Are you fine having unique sequential numeric numbers as your primary key? If yes, then selecting UniqueIdentifier as your primary key will suffice.
b) If your business demand is such that you need to have alpha numeric primary key, then you got to go for varchar or nvarchar.
These are the two options I could think of.
A great factor is how much data you're going to store. I work for a web analytics company, and we have LOADS of data. So a GUID primary key on our pageviews table would kill us, due to the size.
A rule of thumb: For high performance, you should be able to store your entire index in memory. Guids could easily break this!
Use natural keys when they can be trusted. Some sources of natural keys can't be trusted. Years ago, the Social Security Administration used to occasionally mess up an assign the same SSN to two different people. Theyv'e probably fixed that by now.
You can probably trust VINs for vehicles, and ISBNs for books (but not for pamphlets, which may not have an ISBN).
If you use natural keys, the natural key will determine the datatype.
If you can't trust any natural keys, create a synthetic key. I prefer integers for this purpose. Leave enough room for reasonable expansion.
I usually go with a GUID column primary key for all tables (rowguid in mssql). What could be natural keys I make unique constraints. A typical example would be a produkt identification number that the user have to make up and ensure that is unique. If I need a sequence, like in a invoice i build a table to keep a lastnumber and a stored procedure to ensure serialized access. Or a Sequence in Oracle :-) I hate the "social security number" sample for natural keys as that number will never be alway awailable in a registration process. Resulting in a need for a scheme to generate dummy numbers.
I usually always use an integer, but here's an interesting perspective.
Whenever possible, try to use a primary key that is a natural key. For instance, if I had a table where I logged one record every day, the logdate would be a good primary key. Otherwise, if there is no natural key, just use int. If you think you will use more than 2 billion rows, use a bigint. Some people like to use GUIDs, which works well, as they are unique, and you will never run out of space. However, they are needlessly long, and hard to type in if you are just doing adhoc queries.