How to best explain on what fields should a user join on? - sql

I need to explain to somebody how they can determine what fields from multiple tables/views they should join on. Any suggestions? I know how to do it but am having difficulty trying to explain it.
One of the issues they have is they will take two fields from two tables that are the same (zip code) and join on those, when in reality they should be joining on ID columns. When they choose the wrong column to join on it increases records they receive in return.
Should I work in PK and FK somewhere?

While it is indeed typical to join a PK to an FK any conversation about JOIN clauses that only revolve around PK's and FK's is fairly limited
For example I had this FROM clause in a recent SQL answer I gave
FROM
YourTable firstNames
LEFT JOIN YourTable lastNames
ON firstnames.Name = lastNames.Name
AND lastNames.NameType =2
and firstnames.FrequencyPercent < lastNames.FrequencyPercent
The table referenced on each side of the table is the same table (a self join) and it includes three condidtions one of which is an inequality. Furthermore there would never be an FK here because its looking to join on a field, that is by design, not a Candidate Key.
Also you don't have even have to join one table to another. You can join inline queries to each other which of course can't possibly have a Key.
So in order to properly understand JOIN you just need to understand that it combines the records from two relations (tables, views, inline queries) where some conditions evaluate to true. This means you need to understand boolean logic and the database and the data in the database.
If your user is having a problem with a specific JOIN ask them to SELECT some rows from one table and also the other and then ask them under what conditions would you want to combine the rows.

You don't need to talk in terms of a primary key of a table but you should point to it and explain that it uniquely identifies a given row and that you must join to related tables using it or you could get duplicated results.
Give them examples of joining with it and joining without it.
An ER diagram showing all of the tables they use and their key relationships would help ensure that they always use the correct keys.

It sounds to me like neither you, nor the person you are trying to help understands how this particular database is constructed and perhaps don't really even understand basic database fundamentals, like PK's and FK's. Most often a PK from one table is joined to a FK to another table.
Assuming the database has the proper PK's and FK's in place, it would probably help a great deal to generate an ER diagram. That would make the joining concept much easier to grasp.
Another approach you could take is to find someone who does understand these things and create some views for this person to use. This way he doesn't need to understand how to join the tables together.

A user shouldn't typically be doing joins. A user should have an interface that lets them get the data that they need in the way that they need it. If you don't have the developer resources to do that then you're going to be stuck with this problem of having to teach a user technical details. You also need to be very careful about what kind of damage the user can do. Do they have update rights on the data? I hope they don't accidentally do a DELETE FROM Table with no WHERE clause. Even if you restrict their permissions, a poorly written query can crush the database server or block resources causing problems for other users (and more work for you).
If you have no choice, then I think that you need to certainly teach them about primary and foreign keys, even if you don't call them that. Point out that the id on your table (or whatever your PK is) identifies a row. Then explain how the id appears in other tables to show the relationship. For example, "See, in the address table we have a person_id which tells us who that address belongs to."
After that, expect to spend a large portion of your time with that user as they make mistakes or come up with other things that they want to get from the database, but which they can't figure out how to get.

From theory, and ideally, you should define primary keys on all tables, and join tables using a primary key to the matching field or fields (foreign key) in the other table.
Even if you don't define or if they're not defined as primary keys, you need to make sure the fields uniquely identify the records in the table, and that they should be properly indexed.
For example, let's say the 'person' table has a SSN and a driver's license field. The SSN could be considered and flagged as the 'primary key', but if you join that table to a 'drivers' table which might not have the SSN, but does have the driver's license #, you could join them by the driver's license field (even if it's not flagged as primary key), but you need to make sure that the field is properly indexed in both tables.

...explain to somebody how they can determine what fields from multiple tables/views they should join on.
Simply put, look for the columns with values that match between the tables/views. Preferably, match exactly but some massaging might be necessary.
The existence of foreign key constraints would help to know what matches to what, but the constraint might not be directly to the table/view that is to be joined.
The existence of a primary key doesn't mean it is the criteria that is necessary for the query, so I would overlook this detail (depending on the audience).
I would recommend attacking the desired result set by starting with the columns desired, and working back from there. If there's more than one table's columns in the result set, focus on the table whose columns should be returning distinct results first and then gradually add joins, checking the result set between each JOIN addition to confirm the results are still the same. Otherwise, need to review the JOIN or if a JOIN is actually necessary vs IN or EXISTS.

I did this when I first started out, it comes from thinking of joins as just linking tables together, so I linked at all possible points.
Once you think of joins as a way to combine AND filter the data it becomes easier to understand them.
Writing out your request as a sentence is helpful too, "I want to see all the times Table A interacted with Table B". Then build a query from that using only the ID, noting that if you wanted to know "All the times Table A was in the same zip code as Table B" then you would join by zip code.

Related

SQL Server database design with foreign keys

I have the following partial database design:
All the tables are dependent on each other so the table bvd_docflow_subdocuments is dependent on the table bdd_docflow_subsets
and the table bvd_docflow_subdocuments is dependent on bvd_docflow_subsets. So I thought I could me smart and use foreign keys on every table (and ON DELETE CASCADE). However the FK are being drilldown how further I go in to the tables.
The problem is the table bvd_docflow_documents has no point having a reference to the 1docflow_documentset_id` PK / FK. Is there a way (and maybe my design is crappy) that only the table standing above it has an FK relationship between the tables and not all the tables above it.
Edit:
More explanation:
In the bvd_docflow_subsets table information is stored about objects to create documents. There is an relation between that table and bvd_docflow_subdocuments table (This table stores master data about all the documents for an subset. (docflow_subset_id is in both tables). This is the link between those to tables.
Going further down we also got the table bvd_docflow_documents this table contains the actual document data. The link between bvd_docflow_documents and bvd_docflow_subdocuments is bvd_docflow_subdocument_id.
On every table I got an foreign key defined so when data is removed on a table all the data linked to that data is also removed.
However when we look to the bvd_docflow_documents table it has all the foreign keys from the other tables (docflow_subset_id and docflow_documentset_id) and there is the problem. The only foreign key needed for that bvd_docflow_documents table is docflow_subdocument_id and no other.
Edit 2
I have changed my design further and removed information that I don't need after initial import of the data.
See the following link for the (total) databse design:
https://sqldbm.com/Project/SQLServer/Share/_AUedvNutCEV2DGLJleUWA
The tables subsets, subdocuments and documents have a many to many relationship so I thought a table in between those 3 documents_subdocuments is the way to go were I define all the different keys for those tables.
I am not used to the database design first and then build it. But, for everything there is a first time, and I try to do make a database that is using standards and is using the power of SQL Server the correct way.
I'll address the bottom-most table and ignore the rest for the most part.
But first some comments. Your schema is simply a model of a system. To provide feedback, one must understand this "system" and how it actually works to evaluate your model. In addition, it is important to understand your entities and your reasons for choosing them and modelling them in the specified manner. Without that understanding all of this guessing based on experience.
And another comment. Slapping an identity column into every table is just lazy modelling IMO. Others will disagree, but you need to also enforce all natural keys. Do you have natural keys? It is rare not to have any. Enforce those that do exist.
And one last comment. Stop the ridiculous pattern of prepending the column names with the table names. And you should really think long and hard about using very long table names. Given what you have, I sense you need a schema for your docflow stuff.
For the documents table, your current PK makes no sense. Again, you've slapped an identity column into the table. By itself, this column is a key for the table. The inclusion of any other columns does not make the key any more "unique" - that inclusion is logical nonsense. Following your pattern, you would designate the identity column as the primary key. But ...
According to your image, the documents table is related to one and only one subdocument. You added a foreign key to that table - which matches the image. You also added additional columns and foreign keys to the "higher" tables. So now a document "points" to a specific subdocument. It also points to a specific subset - which may have no relationship to the subdocument. The same thought applies to the other FK. I have a doubt that this is logically correct. So why do these columns (and related FKs) exist? Perhaps this is the result of premature optimization - which everyone knows is the root of all evil coding. Again, it is impossible to know if this is "right" or even "useful" for your model.
To answer your question "... is there a way", the answer is obviously yes. You remove the columns of which you complain. You added them - Why? Is this perhaps a problem with the tool you are using?
And some last comments. There is nothing special about "varchar(50)". Perhaps this is a place holder that will be updated later. It may also be another sign of laziness. And generally speaking, columns with names like "type" and "code" tend to be foreign keys to "lookup" tables - because people like to add, modify, or remove these sorts categorization values over time. I'm also concerned about the column name overlap among the tables. "Location" exists in multiple tables, as do action_code and action_id. And a column named "id" (action_id) suggests a lookup to another table - is it? Should it be? Is there a relationship between action_id and action_code? From a distance it is impossible to answer any of these questions.
But designing a database is more art than science. Sometimes you just need to create something, populate it with some sample data, and then determine if it works for your needs. Everyone will get something wrong in the first try. That is expected; that is how you learn. The most difficult part is actually completing your first attempt.

DB: advantages of relations

I always think that the relations between tables are needed to perform cross-table operations, such as join. But I noticed that I can inner join two tables that are not linked at all (hasn't any foreign keys).
So, my questions:
Are some differences (such as speed) in joining linked and not-linked tables?
What are the advantages/disadvantages of using relations bwtween tables?
Thank you in advance.
The primary advantage is that foreign key constraints ensure the relational integrity of the data.. ie it stops you from deleting something that has a related entry in another table
You only get a performance advantage if you create an index on your FK
The FK/PK relationship is a logical feature of the data that would exist even if it were not declared in a given database. You include FKs in a table precisely to establish these logical relationships and to make them visible in a way that makes useful inner joins possible. Declaring an FK as referencing a given PK has the advantage, as said in other answers, of preventing orphaned references, rows that reference a non existent PK.
Indexes can speed up joins. In a complicated query, the optimizer may have a lot of strategies to evaluate, and most of these will not use every available index. Good database systems have good optimizers. In most database systems, declaring a PK will create an index behind the scenes. Sometimes, but not always, creating an index on the FK with the same structure as the index n the PK will enable the optimizer to use a strategy called a merge-join. In certain circumstances a merge-join can be much faster than the alternatives.
When you join tables that are apprently unrelated, there are several cases.
One case is where you end up matching every row from table A with every row from table B. This is called a cartesian join. It takes a long time, and nearly always produces unintended results. One time in ten years I did an intentional cartesian join.
Another case is where both tables contain the same FK, and you match along those two FK. An example might be matching by ZIPCODE. Zipcodes are really FKs to some master zipcode table somewhere out there in post office land, even though most people who use zipcodes never realize that fact.
A third case is where there is a third table, a junction table, containing FKs that reference each of the two tables in question. This implements a many-to-many relationship. In this case, what you probably want to be doing is a three way join with two inner joins each of which has an FK/PK matchup as the join condition.
Either I'm telling a lot that you already know, or you would benefit by going through a basic tutorial on relational databases.
In relational database terms a relation is (more or less) the data structure you have called a table - it is not something that exists "between" tables. A important advantage of the relational model is that there are no predefined links or other navigational structures that limit the way data can be joined or otherwise combined. You are free to join relations (tables) in a query however you like.
What you are asking about is actually called a foreign key constraint. A foreign key is a type of constraint that helps ensure data integrity by preventing inconsistent values being populated in the database.

SQL: Reference one to one-of-many

I'm having what some would call a rather strange problem/question.
Suppose I have a table, which may reference one (and only one) of many different other tables. How would I do that in the best way?? I'm looking for a solution which should work in a majority of databases (MS SQL, MySQL, PostgreSQL etc). The way I see it, there are a couple of different solutions (is any better than the other?):
Have one column for each possible reference. Only one of these columns may contain a value for any given row, all others are null. Allows for strict foreign keys, but it gets tedious when the number of "many" (possible referenced tables) gets large
Have a two column relationship, i.e. one column "describing" which table is referenced, and one referencing the instance (row in that table). Easily extended when the number of "many" (referenced tables) grows, though I can't perform single query lookup in a straightforward way (either left join all possible tables, or union multiple queries which joins towards one table each)
??
Make sense? What's best practise (if any) in this case?
I specifically want to be able to query data from the referenced entity, without really knowing which of the tables are being referenced.
How would you do?
Both of these methods are suitable in any relational database, so you don't have to worry about that consideration. Both result in rather cumbersome queries. For the first method:
select . . .
from t left outer join
ref1
on t.ref1id = ref1.ref1id left outer join
ref2
on t.ref2id = ref2.ref2id . . .
For the second method:
select . . .
from t left outer join
ref1
on t.anyid = ref1.ref1id and anytype = 'ref1' left outer join
ref2
on t.anyid = ref2.ref2id and anytype = 'ref2' . . .
So, from the perspective of query simplicity, I don't see a major advantage for one versus the other. The second version has a small disadvantage -- when writing queries, you have to remember what the name is for the join. This might get lost over time. (Of course, you can use constraints or triggers to ensure that only a fixed set of values make it into the column.)
From the perspective of query performance, the first version has a major advantage. You can identify the column as a foreign key and the database can keep statistics on it. This can help the database choose the right join algorithm, for instance. The second method does not readily offer this possibility.
From the perspective of data size, the first version requires storing the id for each of the possible values. The second is more compact. From the perspective of maintainability, the first is hard to add a new object type; the second is easy.
If you have a set of things that are similar to each other, then you can consider storing them in a single table. Attributes that are not appropriate can be NULLed out. You can even create views for the different flavors of the thing. One table may or may not be an option.
In other words, there is no right answer to this question. As with many aspects of database design, it depends on how the data is going to be used. Absent other information, I would probably first try to coerce the data into a single table. If that is just not reasonable, I would go with the first option if the number of tables can be counted on one hand, and the second if there are more tables.
1)
This is legitimate for small number of static tables. If you anticipate a number of new tables might need to be added in the future, take a look at 3) below...
2)
Please don't do that. You'd be forfeiting the declarative FOREIGN KEYs, which is one of the most important mechanisms for maintaining data integrity.
3)
Use inheritance. More info in this post:
What is the best design for a database table that can be owned by two different resources, and therefore needs two different foreign keys?
You might also be interested in looking at:
Implementing comments and Likes in database
Multiple one to many relationship design
How to avoid multiple tables tables to relations M: M?
database table design thoughts
Relating two database tables (associating an employee with an activity)
How to structure table Activities in a database?

Why is a primary-foreign key relation required when we can join without it?

If we can get data from two tables without having primary and foreign key relation, then why we need this rule? Can you please explain me clearly, with suitable example?
It's a test database, don't mind the bad structure.
Tables' structure:
**
table - 'test1'
columns - id,lname,fname,dob
no primary and foreign key and also not unique(without any constraints)
**
**table - 'test2'
columns- id,native_city
again, no relations and no constraints**
I can still join these tables with same columns 'id',
so if there's no primary-foreign key, then what is the use of that?
The main reason for primary and foreign keys is to enforce data consistency.
A primary key enforces the consistency of uniqueness of values over one or more columns. If an ID column has a primary key then it is impossible to have two rows with the same ID value. Without that primary key, many rows could have the same ID value and you wouldn't be able to distinguish between them based on the ID value alone.
A foreign key enforces the consistency of data that points elsewhere. It ensures that the data which is pointed to actually exists. In a typical parent-child relationship, a foreign key ensures that every child always points at a parent and that the parent actually exists. Without the foreign key you could have "orphaned" children that point at a parent that doesn't exist.
You need two columns of the same type, one on each table, to JOIN on. Whether they're primary and foreign keys or not doesn't matter.
You don't need a FK, you can join arbitrary columns.
But having a foreign key ensures that the join will actually succeed in finding something.
Foreign key give you certain guarantees that would be extremely difficult and error prone to implement otherwise.
For example, if you don't have a foreign key, you might insert a detail record in the system and just after you checked that the matching master record is present somebody else deletes it. So in order to prevent this you need to lock the master table, when ever you modify the detail table (and vice versa). If you don't need/want that guarantee, screw the FKs.
Depending on your RDBMS a foreign key also might improve performance of select (but also degrades performance of updates, inserts and deletes)
I know its late to post, but I use the site for my own reference and so I wanted to put an answer here for myself to reference in the future too. I hope you (and others) find it helpful.
Lets pretend a bunch of super Einstein experts designed our database. Our super perfect database has 3 tables, and the following relationships defined between them:
TblA 1:M TblB
TblB 1:M TblC
Notice there is no relationship between TblA and TblC
In most scenarios such a simple database is easy to navigate but in commercial databases it is usually impossible to be able to tell at the design stage all the possible uses and combination of uses for data, tables, and even whole databases, especially as systems get built upon and other systems get integrated or switched around or out. This simple fact has spawned a whole industry built on top of databases called Business Intelligence. But I digress...
In the above case, the structure is so simple to understand that its easy to see you can join from TblA, through to B, and through to C and vice versa to get at what you need. It also very vaguely highlights some of the problems with doing it. Now expand this simple chain to 10 or 20 or 50 relationships long. Now all of a sudden you start to envision a need for exactly your scenario. In simple terms, a join from A to C or vice versa or A to F or B to Z or whatever as our system grows.
There are many ways this can indeed be done. The one mentioned above being the most popular, that is driving through all the links. The major problem is that its very slow. And gets progressively slower the more tables you add to the chain, the more those tables grow, and the further you want to go through it.
Solution 1: Look for a common link. It must be there if you taught of a reason to join A to C. If it is not obvious, create a relationship and then join on it. i.e. To join A through B through C there must be some commonality or your join would either produce zero results or a massive number or results (Cartesian product). If you know this commonality, simply add the needed columns to A and C and link them directly.
The rule for relationships is that they simply must have a reason to exist. Nothing more. If you can find a good reason to link from A to C then do it. But you must ensure your reason is not redundant (i.e. its already handled in some other way).
Now a word of warning. There are some pitfalls. But I don't do a good job of explaining them so I will refer you to my source instead of talking about it here. But remember, this is getting into some heavy stuff, so this video about fan and chasm traps is really only a starting point. You can join without relationships. But I advise watching this video first as this goes beyond what most people learn in college and well into the territory of the BI and SAP guys. These guys, while they can program, their day job is to specialise in exactly this kind of thing. How to get massive amounts of data to talk to each other and make sense.
This video is one of the better videos I have come across on the subject. And it's worth looking over some of his other videos. I learned a lot from him.
A primary key is not required. A foreign key is not required either. You can construct a query joining two tables on any column you wish as long as the datatypes either match or are converted to match. No relationship needs to explicitly exist.
To do this you use an outer join:
select tablea.code, tablea.name, tableb.location from tablea left outer join
tableb on tablea.code = tableb.code
join with out relation
SQL join

Naming of ID columns in database tables

I was wondering peoples opinions on the naming of ID columns in database tables.
If I have a table called Invoices with a primary key of an identity column I would call that column InvoiceID so that I would not conflict with other tables and it's obvious what it is.
Where I am workind current they have called all ID columns ID.
So they would do the following:
Select
i.ID
, il.ID
From
Invoices i
Left Join InvoiceLines il
on i.ID = il.InvoiceID
Now, I see a few problems here:
1. You would need to alias the columns on the select
2. ID = InvoiceID does not fit in my brain
3. If you did not alias the tables and referred to InvoiceID is it obvious what table it is on?
What are other peoples thoughts on the topic?
I always prefered ID to TableName + ID for the id column and then TableName + ID for a foreign key. That way all tables have a the same name for the id field and there isn't a redundant description. This seems simpler to me because all the tables have the same primary key field name.
As far as joining tables and not knowing which Id field belongs to which table, in my opinion the query should be written to handle this situation. Where I work, we always prefece the fields we use in a statement with the table/table alias.
Theres been a nerd fight about this very thing in my company of late. The advent of LINQ has made the redundant tablename+ID pattern even more obviously silly in my eyes. I think most reasonable people will say that if you're hand writing your SQL in such a manner as that you have to specify table names to differentiate FKs then it's not only a savings on typing, but it adds clarity to your SQL to use just the ID in that you can clearly see which is the PK and which is the FK.
E.g.
FROM Employees e
LEFT JOIN Customers c ON e.ID = c.EmployeeID
tells me not only that the two are linked, but which is the PK and which is the FK. Whereas in the old style you're forced to either look or hope that they were named well.
ID is a SQL Antipattern.
See http://www.amazon.com/s/ref=nb_sb_ss_i_1_5?url=search-alias%3Dstripbooks&field-keywords=sql+antipatterns&sprefix=sql+a
If you have many tables with ID as the id you are making reporting that much more difficult. It obscures meaning and makes complex queries harder to read as well as requiring you to use aliases to differentiate on the report itself.
Further if someone is foolish enough to use a natural join in a database where they are available, you will join to the wrong records.
If you would like to use the USING syntax that some dbs allow, you cannot if you use ID.
If you use ID you can easily end up with a mistaken join if you happen to be copying the join syntax (don't tell me that no one ever does this!)and forget to change the alias in the join condition.
So you now have
select t1.field1, t2.field2, t3.field3
from table1 t1
join table2 t2 on t1.id = t2.table1id
join table3 t3 on t1.id = t3.table2id
when you meant
select t1.field1, t2.field2, t3.field3
from table1 t1
join table2 t2 on t1.id = t2.table1id
join table3 t3 on t2.id = t3.table2id
If you use tablenameID as the id field, this kind of accidental mistake is far less likely to happen and much easier to find.
We use InvoiceID, not ID. It makes queries more readable -- when you see ID alone it could mean anything, especially when you alias the table to i.
I agree with Keven and a few other people here that the PK for a table should simply be Id and foreign keys list the OtherTable + Id.
However I wish to add one reason which recently gave more weight to this arguement.
In my current position we are employing the entity framework using POCO generation. Using the standard naming convention of Id the the PK allows for inheritance of a base poco class with validation and such for tables which share a set of common column names. Using the Tablename + Id as the PK for each of these tables destroys the ability to use a base class for these.
Just some food for thought.
It's not really important, you are likely to run into simalar problems in all naming conventions.
But it is important to be consistent so you don't have to look at the table definitions every time you write a query.
My preference is also ID for primary key and TableNameID for foreign key. I also like to have a column "name" in most tables where I hold the user readable identifier (i.e. name :-)) of the entry. This structure offers great flexibility in the application itself, I can handle tables in mass, in the same way. This is a very powerful thing. Usually an OO software is built on top of the database, but the OO toolset cannot be applied because the db itself does not allow it. Having the columns id and name is still not very good, but it is a step.
Select
i.ID , il.ID From
Invoices i
Left Join InvoiceLines il
on i.ID = il.InvoiceID
Why cant I do this?
Select
Invoices.ID
, InvoiceLines.ID
From
Invoices
Left Join InvoiceLines
on Invoices.ID = InvoiceLines.InvoiceID
In my opinion this is very much readable and simple. Naming variables as i and il is a poor choice in general.
I just started working in a place that uses only "ID" (in the core tables, referenced by TableNameID in foreign keys), and have already found TWO production problems directly caused by it.
In one case the query used "... where ID in (SELECT ID FROM OtherTable ..." instead of "... where ID in (SELECT TransID FROM OtherTable ...".
Can anyone honestly say that wouldn't have been much easier to spot if full, consistent names were used where the wrong statement would have read "... where TransID in (SELECT OtherTableID from OtherTable ..."? I don't think so.
The other issue occurs when refactoring code. If you use a temp table whereas previously the query went off a core table then the old code reads "... dbo.MyFunction(t.ID) ..." and if that is not changed but "t" now refers to a temp table instead of the core table, you don't even get an error - just erroneous results.
If generating unnecessary errors is a goal (maybe some people don't have enough work?), then this kind of naming convention is great. Otherwise consistent naming is the way to go.
I personally prefer (as it has been stated above) the Table.ID for the PK and TableID for the FK. Even (please don't shoot me) Microsoft Access recommends this.
HOWEVER, I ALSO know for a fact that some generating tools favor the TableID for PK because they tend to link all column name that contain 'ID' in the word, INCLUDING ID!!!
Even the query designer does this on Microsoft SQL Server (and for each query you create, you end up ripping off all the unnecessary newly created relationships on all tables on column ID)
THUS as Much as my internal OCD hates it, I roll with the TableID convention. Let's remember that it's called a Data BASE, as it will be the base for hopefully many many many applications to come. And all technologies Should benefit of a well normalized with clear description Schema.
It goes without saying that I DO draw my line when people start using TableName, TableDescription and such. In My opinion, conventions should do the following:
Table name: Pluralized. Ex. Employees
Table alias: Full table Name, singularized. Ex.
SELECT Employee.*, eMail.Address
FROM Employees AS Employee LEFT JOIN eMails as eMail on Employee.eMailID = eMail.eMailID -- I would sure like it to just have the eMail.ID here.... but oh well
[Update]
Also, there are some valid posts in this thread about duplicated columns due of the "kind of relationship" or role. Example, if a Store has an EmployeeID, that tells me squat. So I sometimes do something like Store.EmployeeID_Manager. Sure it's a bit larger but at leas people won't go crazy trying to find table ManagerID, or what EmployeeID is doing there. When querying is WHERE I would simplify it as:
SELECT EmployeeID_Manager as ManagerID FROM Store
For the sake of simplicity most people name the column on the table ID. If it has a foreign key reference on another table, then they explicity call it InvoiceID (to use your example) in the case of joins, you are aliasing the table anyway so the explicit inv.ID is still simpler than inv.InvoiceID
Coming at this from the perspective of a formal data dictionary, I would name the data element invoice_ID. Generally, a data element name will be unique in the data dictionary and ideally will have the same name throughout, though sometimes additional qualifying terms may be required based on context e.g. the data element named employee_ID could be used twice in the org chart and therefore qualified as supervisor_employee_ID and subordinate_employee_ID respectively.
Obviously, naming conventions are subjective and a matter of style. I've find ISO/IEC 11179 guidelines to be a useful starting point.
For the DBMS, I see tables as collections of entites (except those that only ever contain one row e.g. cofig table, table of constants, etc) e.g. the table where my employee_ID is the key would be named Personnel. So straight away the TableNameID convention doesn't work for me.
I've seen the TableName.ID=PK TableNameID=FK style used on large data models and have to say I find it slightly confusing: I much prefer an identifier's name be the same throughout i.e. does not change name based on which table it happens to appear in. Something to note is the aforementioned style seems to be used in the shops which add an IDENTITY (auto-increment) column to every table while shunning natural and compound keys in foreign keys. Those shops tend not to have formal data dictionaries nor build from data models. Again, this is merely a question of style and one to which I don't personally subscribe. So ultimately, it's not for me.
All that said, I can see a case for sometimes dropping the qualifier from the column name when the table's name provides a context for doing so e.g. the element named employee_last_name may become simply last_name in the Personnel table. The rationale here is that the domain is 'people's last names' and is more likely to be UNIONed with last_name columns from other tables rather than be used as a foreign key in another table, but then again... I might just change my mind, sometimes you can never tell. That's the thing: data modelling is part art, part science.
My vote is for InvoiceID for the table ID. I also use the same naming convention when it's used as a foreign key and use intelligent alias names in the queries.
Select Invoice.InvoiceID, Lines.InvoiceLine, Customer.OrgName
From Invoices Invoice
Join InvoiceLines Lines on Lines.InvoiceID = Invoice.InvoiceID
Join Customers Customer on Customer.CustomerID = Invoice.CustomerID
Sure, it's longer than some other examples. But smile. This is for posterity and someday, some poor junior coder is going to have to alter your masterpiece. In this example there is no ambiguity and as additional tables get added to the query, you'll be grateful for the verbosity.
FWIW, our new standard (which changes, uh, I mean "evolves", with every new project) is:
Lower case database field names
Uppercase table names
Use underscores to separate words in the field name - convert these to Pascal case in code.
pk_ prefix means primary key
_id suffix means an integer, auto-increment ID
fk_ prefix means foreign key (no suffix necessary)
_VW suffix for views
is_ prefix for booleans
So, a table named NAMES might have the fields pk_name_id, first_name, last_name, is_alive, and fk_company and a view called LIVING_CUSTOMERS_VW, defined like:
SELECT first_name, last_name
FROM CONTACT.NAMES
WHERE (is_alive = 'True')
As others have said, though, just about any scheme will work as long as it is consistent and doesn't unnecessarily obfuscate your meanings.
There are lots of answers on this already, but I wanted to add two major things that I haven't seen above:
Customers coming to you for support.
Many times a customer or user or even dev of another department have hit a snag and have contacted us saying they're having a problem doing an operation. We ask them what record they're having a problem with. Now, the data they see on the screen, e.g. a grid with customer name, number of orders, destination etc is an aggregate of many tables. They say they've having trouble with id 83. There's no way to know what id that is, which table it is, if it's just called 'id'.
Namely, a row of data does not give any indication which table it is from. Unless you happen to know the schema of your database well, which is rarely the case on complex systems or non-greenfield systems you've been told to take over, you don't know what id=83 means even if you have more data like name, address, etc (which might not even be in the same table!).
This id could be coming from a grid, or it could be coming from an error in your API, or a faulty query dumping the error message to the screen, or to a log file.
Often a developer just dumps 'ID' into a column and forgets about it, and often DBs have many similar tables like Invoice, InvoiceGrouping, InvoicePlan and the ID could be for any of them. In frustration you look in the code to see which one it is, and see that they've called it Id on the model as well, so you then have to dig into how the model for the page was constructed. I cannot count how many times I've had to do this to figure out what an Id is. It's a lot. Sometimes you have to dig out a SPROC as well that just returns 'Id' as a header. Nightmare.
Log files are easier when it's clear what went wrong
Often SQL can give pretty crappy error messages. "Could not insert item with ID 83, column would be truncated" or something like that is very hard to debug. Often error messages are not very helpful, but usually the thing that broke will make a vague attempt to tell you what record was broken by just dumping out the primary key name and the value. If it's "ID" then it doesn't really help at all.
This is just two things that I didn't feel were mentioned in the other answers.
I also think that a lot of comments are 'if you program in X way then this isn't an issue', and I think the points above (and other points on this question) are valid specifically because of the way people program and because they don't have the time, energy, budget and foresight to program in perfect logging and error handling or change engrained habits of quick SQL and code writing.
I definitely agree with including the table name in the ID field name, for exactly the reasons you give. Generally, this is the only field where I would include the table name.
I do hate the plain id name. I strongly prefer to always use the invoice_id or a variant thereof. I always know which table is the authoritative table for the id when I need to, but this confuses me
SELECT * from Invoice inv, InvoiceLine inv_l where
inv_l.InvoiceID = inv.ID
SELECT * from Invoice inv, InvoiceLine inv_l where
inv_l.ID = inv.InvoiceLineID
SELECT * from Invoice inv, InvoiceLine inv_l where
inv_l.ID = inv.InvoiceID
SELECT * from Invoice inv, InvoiceLine inv_l where
inv_l.InvoiceLineID = inv.ID
What's worst of all is the mix you mention, totally confusing. I've had to work with a database where almost always it was foo_id except in one of the most used ids. That was total hell.
I think you can use anything for the "ID" as long as you're consistent. Including the table name is important to. I would suggest using a modeling tool like Erwin to enforce the naming conventions and standards so when writing queries it's easy to understand the relationships that may exist between tables.
What I mean by the first statement is, instead of ID you can use something else like 'recno'. So then this table would have a PK of invoice_recno and so on.
Cheers,
Ben
For the column name in the database, I'd use "InvoiceID".
If I copy the fields into a unnamed struct via LINQ, I may name it "ID" there, if it's the only ID in the structure.
If the column is NOT going to be used in a foreign key, so that it's only used to uniquely identify a row for edit editing or deletion, I'll name it "PK".
If you give each key a unique name, e.g. "invoices.invoice_id" instead of "invoices.id", then you can use the "natural join" and "using" operators with no worries. E.g.
SELECT * FROM invoices NATURAL JOIN invoice_lines
SELECT * FROM invoices JOIN invoice_lines USING (invoice_id)
instead of
SELECT * from invoices JOIN invoice_lines
ON invoices.id = invoice_lines.invoice_id
SQL is verbose enough without making it more verbose.
What I do to keep things consistent for myself (where a table has a single column primary key used as the ID) is to name the primary key of the table Table_pk. Anywhere I have a foreign key pointing to that tables primary key, I call the column PrimaryKeyTable_fk. That way I know that if I have a Customer_pk in my Customer table and a Customer_fk in my Order table, I know that the Order table is referring to an entry in the Customer table.
To me, this makes sense especially for joins where I think it reads easier.
SELECT *
FROM Customer AS c
INNER JOIN Order AS c ON c.Customer_pk = o.Customer_fk
I prefer DomainName || 'ID'. (i.e. DomainName + ID)
DomainName is often, but not always, the same as TableName.
The problem with ID all by itself is that it doesn't scale upwards. Once you have about 200 tables, each with a first column named ID, the data begins to look all alike. If you always qualify ID with the table name, that helps a little, but not that much.
DomainName & ID can be used to name foreign keys as well as primary keys. When foriegn keys are named after the column that they reference, that can be of mnemonic assistance. Formally, tying the name of a foreign key to the key it references is not necessary, since the referential integrity constrain will establish the reference. But it's awfully handy when it comes to reading queries and updates.
Occasionally, DomainName || 'ID' can't be used, because there would be two columns in the same table with the same name. Example: Employees.EmployeeID and Employees.SupervisorID. In those cases, I use RoleName || 'ID', as in the example.
Last but not least, I use natural keys rather than synthetic keys when possible. There are situations where natural keys are unavailable or untrustworthy, but there are plenty of situations where the natural key is the right choice. In those cases, I let the natural key take on the name it would naturally have. This name often doesn't even have the letters, 'ID' in it. Example: OrderNo where No is an abbreviation for "Number".
For each table I choose a tree letter shorthand(e.g. Employees => Emp)
That way a numeric autonumber primary key becomes nkEmp.
It is short, unique in the entire database and I know exactly its properties at a glance.
I keep the same names in SQL and all languages I use (mostly C#, Javascript, VB6).
See the Interakt site's naming conventions for a well thought out system of naming tables and columns. The method makes use of a suffix for each table (_prd for a product table, or _ctg for a category table) and appends that to each column in a given table. So the identity column for the products table would be id_prd and is therefore unique in the database.
They go one step further to help with understanding the foreign keys: The foreign key in the product table that refers to the category table would be idctg_prd so that it is obvious to which table it belong (_prd suffix) and to which table it refers (category).
Advantages are that there is no ambiguity with the identity columns in different tables, and that you can tell at a glance which columns a query is referring to by the column names.
You could use the following naming convention. It has its flaws but it solves your particular problems.
Use short (3-4 characters) nicknames for the table names, i.e. Invoice - inv, InvoiceLines - invl
Name the columns in the table using those nicknames, i.e. inv_id, invl_id
For the reference columns use invl_inv_id for the names.
this way you could say
SELECT * FROM Invoice LEFT JOIN InvoiceLines ON inv_id = invl_inv_id