How important are lookup tables? - sql

A lot of the applications I write make use of lookup tables, since that was just the way I was taught (normalization and such). The problem is that the queries I make are often more complicated because of this. They often look like this
get all posts that are still open
"SELECT * FROM posts WHERE status_id = (SELECT id FROM statuses WHERE name = 'open')"
Often times, the lookup tables themselves are very short. For instance, there may only be 3 or so different statuses. In this case, would it be okay to search for a certain type by using a constant or so in the application? Something like
get all posts that are still open
"SELECT * FROM posts WHERE status_id = ".Status::OPEN
Or, what if instead of using a foreign id, I set it as an enum and queried off of that?
Thanks.

The answer depends a little if you are limited to freeware such as PostGreSQL (not fully SQL compliant), or if you are thinking about SQL (ie. SQL compliant) and large databases.
In SQL compliant, Open Architecture databases, where there are many apps using one database, and many users using different report tools (not just the apps) to access the data, standards, normalisation, and open architecture requirements are important.
Despite the people who attempt to change the definition of "normalisation", etc. to suit their ever-changing purpose, Normalisation (the science) has not changed.
if you have data values such as {Open; Closed; etc} repeated in data tables, that is data duplication, a simple Normalisation error: if you those values change, you may have to update millions of rows, which is very limited design.
Such values should be Normalised into a Reference or Lookup table, with a short CHAR(2) PK:
O Open
C Closed
U [NotKnown]
The data values {Open;Closed;etc} are no longer duplicated in the millions of rows. It also saves space.
the second point is ease of change, if Closed were changed to Expired, again, one row needs to be changed, and that is reflected in the entire database; whereas in the un-normalised files, millions of rows need to be changed.
Adding new data values, eg. (H,HalfOpen) is then simply a matter of inserting one row.
in Open Architecture terms, the Lookup table is an ordinary table. It exists in the [SQL compliant] catalogue; as long as the FOREIGN KEY relation has been defined, the report tool can find that as well.
ENUM is a Non-SQL, do not use it. In SQL the "enum" is a Lookup table.
The next point relates to the meaningfulness of the key.
If the Key is meaningless to the user, fine, use an {INT;BIGINT;GUID;etc} or whatever is suitable; do not number them incrementally; allow "gaps".
But if the Key is meaningful to the user, do not use a meaningless number, use a meaningful Relational Key.
Now some people will get in to tangents regarding the permanence of PKs. That is a separate point. Yes, of course, always use a stable value for a PK (not "immutable", because no such thing exists, and a system-generated key does not provide row uniqueness).
{M,F} are unlikely to change
if you have used {0,1,2,4,6}, well don't change it, why would you want to. Those values were supposed to be meaningless, remember, only a meaningful Key need to be changed.
if you do use meaningful keys, use short alphabetic codes, that developers can readily understand (and infer the long description from). You will appreciate this only when you code SELECT and realise you do not have to JOIN every Lookup table. Power users too, appreciate it.
Since PKs are stable, particularly in Lookup tables, you can safely code:
WHERE status_code = 'O' -- Open
You do not have to JOIN the Lookup table and obtain the data value Open, as a developer, you are supposed to know what the Lookup PKs mean.
Last, if the database were large, and supported BI or DSS or OLAP functions in addition to OLTP (as properly Normalised databases can), then the Lookup table is actually a Dimension or Vector, in Dimension-Fact analyses. If it was not there, then it would have to be added in, to satisfy the requirements of that software, before such analyses can be mounted.
If you do that to your database from the outset, you will not have to upgrade it (and the code) later.
Your Example
SQL is a low-level language, thus it is cumbersome, especially when it comes to JOINs. That is what we have, so we need to just accept the encumbrance and deal with it. Your example code is fine. But simpler forms can do the same thing.
A report tool would generate:
SELECT p.*,
s.name
FROM posts p,
status s
WHERE p.status_id = s.status_id
AND p.status_id = 'O'
Another Exaple
For banking systems, where we use short codes which are meaningful (since they are meaningful, we do not change them with the seasons, we just add to them), given a Lookup table such as (carefully chosen, similar to ISO Country Codes):
Eq Equity
EqCS Equity/Common Share
OTC OverTheCounter
OF OTC/Future
Code such as this is common:
WHERE InstrumentTypeCode LIKE "Eq%"
And the users of the GUI would choose the value from a drop-down that displays
{Equity/Common Share;Over The Counter},
not {Eq;OTC;OF}, not {M;F;U}.
Without a lookup table, you can't do that, either in the apps, or in the report tool.

For look-up tables I use a sensible primary key -- usually just a CHAR(1) that makes sense in the domain with an additional Title (VARCHAR) field. This can maintain relationship enforcement while "keeping the SQL simple". The key to remember here is the look-up table does not "contain data". It contains identities. Some other identities might be time-zone names or assigned IOC country codes.
For instance gender:
ID Label
M Male
F Female
N Neutral
select * from people where gender = 'M'
Alternatively, an ORM could be used and manual SQL generation might never have to be done -- in this case the standard "int" surrogate key approach is fine because something else deals with it :-)
Happy coding.

Create a function for each lookup.
There is no easy way. You want performance and query simplicity. Ensure the following is maintained. You could create a SP_TestAppEnums to compare existing lookup values against the function and look for out of sync/zero returned.
CREATE FUNCTION [Enum_Post](#postname varchar(10))
RETURNS int
AS
BEGIN
DECLARE #postId int
SET #postId =
CASE #postname
WHEN 'Open' THEN 1
WHEN 'Closed' THEN 2
END
RETURN #postId
END
GO
/* Calling the function */
SELECT dbo.Enum_Post('Open')
SELECT dbo.Enum_Post('Closed')

Question is: do you need to include the lookup tables (domain tables 'round my neck of the woods) in your queries? Presumably, these sorts of tables are usually
pretty static in nature — the domain might get extended, but it probably won't get shortened.
their primary key values are pretty unlikely to change as well (e.g., the status_id for a status of 'open' is unlikely to suddenly get changed to something other than what it was created as).
If the above assumptions are correct, there's no real need to add all those extra tables to your joins just so your where clause can use a friend name instead of an id value. Just filter on status_id directly where you need to. I'd suspect the non-key attribute in the where clause ('name' in your example above) is more likely to get changes than the key attribute ('name' in your example above): you're more protected by referencing the desire key value(s) of the domain table in your join.
Domain tables serve
to limit the domain of the variable via a foreign key relationship,
to allow the domain to be expanded by adding data to the domain table,
to populate UI controls and the like with user-friendly information,
Naturally, you'd need to suck domain tables into your queries where you you actually required the non-key attributes from the domain table (e.g., descriptive name of the value).
YMMV: a lot depends on context and the nature of the problem space.

The answer is "whatever makes sense".
lookup tables involve joins or subqueries which are not always efficient. I make use of enums a lot to do this job. its efficient and fast

Where possible (and It is not always . . .), I use this rule of thumb: If I need to hard-code a value into my application (vs. let it remain a record in the database), and also store that vlue in my database, then something is amiss with my design. It's not ALWAYS true, but basically, whatever the value in question is, it either represents a piece of DATA, or a peice of PROGRAM LOGIC. It is a rare case that it is both.
NOT that you won't find yourself discovering which one it is halfway into the project. But as the others said above, there can be trade-offs either way. Just as we don't always acheive "perfect" normalization in a database design (for reason of performance, or simply because you CAN take thngs too far in pursuit of acedemic perfection . . .), we may make some concious choices about where we locate our "look-up" values.
Personally, though, I try to stand on my rule above. It is either DATA, or PROGRAM LOGIC, and rarely both. If it ends up as (or IN) a record in the databse, I try to keep it out of the Application code (except, of course, to retrieve it from the database . . .). If it is hardcoded in my application, I try to keep it out of my database.
In cases where I can't observe this rule, I DOCUMENT THE CODE with my reasoning, so three years later, some poor soul will be able to ficure out how it broke, if that happens.

The commenters have convinced me of the error of my ways. This answer and the discussion that went along with it, however, remain here for reference.
I think a constant is appropriate here, and a database table is not. As you design your application, you expect that table of statuses to never, ever change, since your application has hard-coded into it what those statuses mean, anyway. The point of a database is that the data within it will change. There are cases where the lines are fuzzy (e.g. "this data might change every few months or so…"), but this is not one of the fuzzy cases.
Statuses are a part of your application's logic; use constants to define them within the application. It's not only more strictly organized that way, but it will also allow your database interactions to be significantly speedier.

Related

Best practices for referencing natural and/or surrogate key values in code

I'm modifying some stored procedures that manage status changes when records are updated.
For example, if I have these two tables
Request(RequestID, StatusID)
Status(StatusID, StatusName)
I'm trying to determine the best to handle calling out the statuses in code.
Do I use StatusID or StatusName?
It's not guaranteed that StatusID will match between environments (DEV, PRE, PROD, etc).
Also, StatusName could be changed and I wouldn't want to have to alter code because I needed to change a StatusName.
I could create a 2nd unique column, which would sort of closely resemble StatusID.
I'd make sure this column was matched between regions, but that doesn't seem that clean either and sort of repetitive.
Can anyone suggest a cleaner, simpler way?
The difficulty of matching code to data can only partially be handled with a second column. When someone adds an item, what does this mean? If they re-use a known constant, what does it mean if you don't require this column to be unique?
Often times we will have user modifiable lookup tables, but they will have to be associated with a number of other flags indicating how to interpret the status - "IsTreatedAsExpired", "IsTreatedAsActive" or perhaps other tables which hold the statuses which are treated as certain things.
I think you really need to figure out the scope of what you want to allow with this table first. Because if you have a LOT of code references, you would be better off using a natural key which is in sync with your code on all installations. A possibility to handle this is to use negative numbers for unmovable codes (identity insert to add new unmovable codes) and then have your sequence only add positive ones. But again, this doesn't address the semantics of how your program would handle or use the user-entered extensions.
Again, it's hard to say without getting the full scope sorted out here.
From the information you've given, StatusID may have different values in different databases, presumably because your keys are generated automatically and are not specified by you. If so then obviously it's impossible to use StatusID consistently in your code anyway (without standardizing the values). Therefore the question becomes "is it acceptable/practical/desirable to hard-code StatusName values in my code?"
The obvious answer is yes, what's the alternative? If you have a certain status that represents 'ready' and you want to reference that in code then you must put something in your code that identifies the status unambiguously.
If you add a second key of some kind (as Carlos suggested) you still have the same basic problem that changing a natural key value is changing the identity of the status and therefore changes the meaning of your code. If you change the 'real' natural key (READY) without changing the second key (RDY) then your code will become more confusing and difficult to maintain.
If you do something more complex like extracting 'constants' or 'configuration parameters' into a configuration file or table or even writing a custom preprocessor to insert key values into your scripts at deployment time, you add lots of complexity for very little gain (unless you have other good reasons for doing it). I've seen this approach used, and it was a huge, unmaintainable mess.
In practice, StatusName is most likely to change because a) someone thinks another name would be 'more accurate' or 'look better', or b) you discover that it doesn't correctly represent your requirements. If you're forced to spend time on a) then just change the display name in your front end or reports and leave the database and code alone. If b) comes up then by definition your current data model and code are inaccurate and must be revised and possibly modified anyway. And when b) does happen, it often results in adding a new code, not changing the existing one (e.g. because someone defined a new process step that there is no existing code for).
And if you are open to changing your development and deployment practices there are other ways to look at this issue too, as others have suggested. Can you make your StatusID values the same everywhere? Technically it's possible, so what are the organizational reasons not to? Can you reduce the probability and impact of StatusName changes through change management and code reviews? Can you improve your requirements process to capture certain information more effectively?
Write a user defined function that accepts status name and gives out the status if wherever you are referring the status id
select * from resources where statusid = dbo.getStatusId("COMPLETED");
This would make sure that resolving the status id always happens within the function that you have defined
As a rule of thumb when you have id,value tables (Status, Result, Area, etc..) I usually add a third field that its the record's mnemonic value and always use that, neither the name or the id.
Now the mnemonic value is like a business key (well, it is a business key) in the sense that its a business value and does not depend on the database (for the id) or the way it displayed (the description) so for example for your status table you may have
StatusID,StatusName,StatusMnemo
1 ,COMPLETED ,COM
2 ,REJETED ,REJ
and so forth.
And in your queries you always join by statusId but you add a clause to join against the status table by StatusMnemo. This is a value that's independent across environments and remains constant.
Also in inserts, you always use statusid.
If you have statusID values that need special treatment then they should be the same across environments.
Why would you introduce a statusID that needs special treatment in Prod that has not gone thru Pre and Dev?
What I often do is start iden at 100 and use that for generic status that don't need special treatment.
Then DEV owns the space under 100 for special treatment using IDENTITY INSERT ON.
If deploy from DEV to PRE insert any records under 100.

Is it better to name the primary key column id or *_id?

I've been using Rails for a few years and I've grown used to the convention of naming the primary key column id. But I've run across lots of examples in SQL books that name the primary key column something like employee_id for an employees table or feed_id for a feeds table.
One advantage of the 2nd system seems to be that you can use USING() more to produce more concise SQL queries:
select feeds.title, items.title from items inner join feeds USING(feed_id);
As opposed to
select feeds.title, items.title from items inner join feeds on feeds.id = items.feed_id;
Which naming convention is better? Which is favored by experienced database administrators?
Also, is it better to pluralize the name of the table?
I always use the verbose form (i.e. 'employee_id' rather than 'id') as it is more descriptive. If you are joining more than one table and both have 'id' column you will have to alias 'id' if you need to SELECT both of the ids. Also, as you mentioned, you get the advantage of USING clause. In the grand scheme of things it isn't a huge factor one way or the other but the more verbose form gives you advantages.
Both options are valid but the purists will say use id as its name is specified by the table.
I use table_id because I find it to be more descriptive and makes debugging easier. It's more practical.
Re: Tablenames. Another hotly debated topic among database nerds but I say Singular.
Tablename_Id is my strong preference. When you do joins to Fks you know exactly what to join to what and don't make mistakes where you join to ID in table a when you meant tableb below is an example of how easy this is to do especially if you copy the on clause from somewhere else
FROM tablea a
JOIN tableb b
ON a.ID = b.tableaid
JOIN tablec c
ON a.ID = c.tablebid
In the case above, you really wanted to join to B.Id but forgot to change it from a when you copied. It will work and give you a resultset that isn't correct. If you use table_id instead, the query would fail the syntax check.
Another problem with using Id is when you are doing complex reports. Since the repport queries have to have fields with individual names, you can end up wasting time writing a bunch of aliases you wouldn't need if you had named the id with the tablename.
Now people who use ORMs don't write a lot of SQl but what they do write and what report writers write are generally complex, complicated statements. You need to design you database to make it easier to do those things than simple queries.
The use of ID as the name for the identifying field is considered a SQl antipattern. http://www.amazon.com/SQL-Antipatterns-Programming-Pragmatic-Programmers/dp/1934356557/ref=sr_1_1?s=books&ie=UTF8&qid=1308929815&sr=1-1
This is user preference, but I always name the primary keys of my tables Id. I always name references of that Id in other tables as [SingularEntityName][Id] e.g.
Credentials
Id Password
Users
Id Name CredentialId
Descriptions
Id UserId
Keeps my references clean. However, just be consistant in your naming and it really shouldn't matter how you set up your schemas.
To open the can of worms again,
I'm willing to bet those who select tablename_id are older, more experienced programmers.
Those who use just id are younger.
Why ? because you learn redundancy and constancy is not always a bad thing.
the one thing I would add to the #1 answer, use the "_" helps make it easier to pick out the variable in code, in the table, etc... I do the same for foreign keys. TableName_FK Some will argue over that but it works for me and it's obvious what it is.
I have had to work on other's code many times over the years. Consistency is critical, obfuscation is worthless and meaningful variable names very helpful.
There are those who argue that verbosity makes code harder to read. I don't think that argument flies in today's world of objects.that.derive.from.some.microsoft.class.twenty.layers.deep.that.you.have.to.fully.reference.
BTW - as so many have said, it's your choice. Those folks who spend time arguing over coding syntax don't have enough work to do. Learn to be flexible and to use the standards of the workplace where you are employed. If you are lucky enough to set your own standards, then have at it. The fact your are wondering is great. But choose one and then be consistent (until you change jobs or decide you have a paradigm shift that means you want to change your style.)
You can often pick out what era someone started learning to code by their personal preferences and styles. Guys that write very tight, minmal, hard to read code, started back when memory was very limited (DOS) and probably wrote a lot of assembler, those that use Hungarian started back with the Win SDK, etc...
This discussion has been evolving for decades. The older I get, the more I document my code, the more meaningful I make my variable names, etc... because in a week I will have forgotten what I wrote and I need the road maps to make sense of it. Not so much that I'm forgetful, although that's part of the equation, but more so because I'm writing code on so many different projects.
it's entirely your choice. But personally I prefer the second one as I wouldn't need to look for table names in my code when I come across an id. I think tablename_id is better.
Another advantage to giving your primary keys names that are unique to that table is that it makes it easier to have a naming convention, when referring to those keys in different tables, that indicates the corresponding key.
For example, suppose everything in your alpha table begins alpha_, so that you have alpha_id as your primary key. In your beta table - where everything would begin beta_ - you would use beta_alpha_id to have a reference in that table to the keys in the alpha table.

Database design: Store data from paper forms in database

Database design question for y'all. I have a form (like, the paper kind) that has several entry points for data. This form has changed, and is expected to change over years. It is being turned into a computer app, so that we can, among other things, quit wasting paper. (And minor things, like have all the data in one central store that can be queried, etc.) I'd like to store all of the forms data in a database, and have it be pretty agnostic as to the changes.
Originally, I was just considering each field to be a string -- and I had a table something like this:
FormId int (FK)
FieldName nvarchar(64)
FieldValue nvarchar(128)
...something like that. It was actually a bit more 3NFy in that FieldName was in another table, associated with an artificial key, so that the field names weren't duplicated all over the place.
However, I'd like to extend this to numeric and drop-down data. I could just store numeric data as strings, but that seems like a pretty crappy idea. Same with drop downs.
I could stop using a table, and actually use columns on the main form table (the one that FormId above references), but that means adding a column for each new item as they come along, and older forms would just be null. (And, unless I stored it, I wouldn't know when that column was created. With the string table above, it's implicit.)
I could extend the table above to something like:
FormId int (FK)
FieldName nvarchar(64)
FieldValueType int -- enum as to which of the columns below are valid (or just let nulls imply that)
FieldValue nvarchar(128)
FieldValueInt int
Combos would have to be in a OTLT (one true lookup table), which I have reservations about, but perhaps it's needed here?
Any advice on StackOverflow? I'm using MSSQL, but this is really a more general question.
Use Nulls. Proper database design is a complicated subject; you may do well to pick up a good reference and do some research on the whole thing (I gather this is a good book on the topic). In general, it sounds like you would be well served by starting with a single table that encapsulates all the fields in your form, and then putting it through the normalization process. And yes, use nulls and do NOT use an int to enumerate which columns are set to valid values; that is exactly what nulls are for.
You could have a separate table for each datatype.
I.e. to fetch an entire form you'd do an N-way join using the form id where N is the number of distinct datatypes you support (+ perhaps extras depending on the info you want - e.g. dropdown values would probably be stored in another table / your fieldname lookup / etc.)
But the design should probably also depend on how you intend to use the data, which you've said nothing about. And it would also depend on just how fast the rate of change is for these forms . . .
By creating a table with a description of your forms, you are actually defining a metadata structure. That's daunting. You would need a lot of the infrastructure needed for proper table description. I think the vendors of your database system spent a lot of effort in doing all that.
At first I thought - what a nice idea! Build your own compatibility-aware table description system!
But then I thought - I'm too stupid to do that on my own. There must be a database system capable of doing that.
So I conclude, not being a db expert, define proper defaults for 'new fields' in new form versions. Handle the compatibility issue in your business logic.
I would strongly advise against having a "generic table" like you describe.
You are essentially reinventing the relational database, which is not a good idea: Queries and updates will be very painful with your structure, and you will not be able to use the more advanced features like foreign keys and triggers, should you need them.
Just make a table(s) with columns for the data fields, and if a form does not have a field, let it be null.
Or, probably even better, have a "base table" (field that are in every form), and give names/version numbers to updated forms, and have a new table for the new columns that this version adds, then use a synthetic PK to join these new tables to your base table.
I.e.:
base table: id(numeric,PK), name, birthday, town
addresstable1: street, number, postal code, country, base_table_id (foreign key)
addresstable2: po box no, po box code, base_table_id (FK)
and so on.
That way you avoid loads of null fields; your tables are not so wide (always desirable), and your records are implicitly versioned, because the list of tables that have a record belonging to a record in your base table tells you which fields the original form had, hence what kind of form was used originally.

What is the preferred way to store custom fields in a SQL database?

My friend is building a product to be used by different independent medical units.
The database stores a vast collection of measurements taken at different times, like the temperature, blood pressure, etc...
Let us assume these are held in a table called exams with columns temperature, pressure, etc... (as well as id, patient_id and timestamp). Most of the measurements are stored as floats, but some are of other types (strings, integers...)
While many of these measurements are handled by their product, it needs to allow the different medical units to record and process other custom measurements. A very nifty UI allows the administrator to edit these customs fields, specify their name, type, possible range of values, etc...
He is unsure as to how to store these custom fields.
He is leaning towards a separate table (say a table custom_exam_data with fields like exam_id, custom_field_id, float_value, string_value, ...)
I worry that this will make searching both more difficult to achieve and less efficient.
I am leaning towards modifying the exam table directly (while avoiding conflicts on column names with some scheme like prefixing all custom fields with an underscore or naming them custom_1, ...)
He worries about modifying the database dynamically and having different schemas for each medical unit.
Hopefully some people which more experience can weigh in on this issue.
Notes:
he is using Ruby on Rails but I think this question is pretty much framework agnostic, except from the fact that he is only looking for solutions in SQL databases only.
I simplified the problem a bit since the custom fields need to be available for more than one table, but I believe this doesn`t really impact the direction to take.
(added) A very generic reporting module will need to search, sort, generate stats, etc.. of this data, so it is required that this data be stored in the columns of the appropriate type
(added) User inputs will be filtered, for the standard fields as well as for the custom fields. For example, numbers will be checked within a given range (can't have a temperature of -12 or +444), etc... Thus, conversion to the appropriate SQL type is not a problem.
I've had to deal with this situation many times over the years, and I agree with your initial idea of modifying the DB tables directly, and using dynamic SQL to generate statements.
Creating string UserAttribute or Key/Value columns sounds appealing at first, but it leads to the inner-platform effect where you end up having to re-implement foreign keys, data types, constraints, transactions, validation, sorting, grouping, calculations, et al. inside your RDBMS. You may as well just use flat files and not SQL at all.
SQL Server provides INFORMATION_SCHEMA tables that let you create, query, and modify table schemas at runtime. This has full type checking, constraints, transactions, calculations, and everything you need already built-in, don't reinvent it.
It's strange that so many people come up with ad-hoc solutions for this when there's a well-documented pattern for it:
Entity-Attribute-Value (EAV) Model
Two alternatives are XML and Nested Sets. XML is easier to manage but generally slow. Nested Sets usually require some type of proprietary database extension to do without making a mess, like CLR types in SQL Server 2005+. They violate first-normal form, but are nevertheless the fastest-performing solution.
Microsoft Dynamics CRM achieves this by altering the database design each time a change is made. Nasty, I think.
I would say a better option would be to consider an attribute table. Even though these are often frowned upon, it gives you the flexibility you need, and you can always create views using dynamic SQL to pivot the data out again. Just make sure you always use LEFT JOINs and FKs when creating these views, so that the Query Optimizer can do its job better.
I have seen a use of your friend's idea in a commercial accounting package. The table was split into two, first contained fields solely defined by the system, second contained fields like USER_STRING1, USER_STRING2, USER_FLOAT1 etc. The tables were linked by identity value (when a record is inserted into the main table, a record with same identity is inserted into the second one). Each table that needed user fields was split like that.
Well, whenever I need to store some unknown type in a database field, I usually store it as String, serializing it as needed, and also store the type of the data.
This way, you can have any kind of data, working with any type of database.
I would be inclined to store the measurement in the database as a string (varchar) with another column identifying the measurement type. My reasoning is that it will presumably, come from the UI as a string and casting to any other datatype may introduce a corruption before the user input get's stored.
The downside is that when you go to filter result-sets by some measurement metric you will still have to perform a casting but at least the storage and persistence mechanism is not introducing corruption.
I can't tell you the best way but I can tell you how Drupal achieves a sort of schemaless structure while still using the standard RDBMSs available today.
The general idea is that there's a schema table with a list of fields. Each row really only has two columns, the 'table':String column and the 'column':String column. For each of these columns it actually defines a whole table with just an id and the actual data for that column.
The trick really is that when you are working with the data it's never more than one join away from the bundle table that lists all the possible columns so you end up not losing as much speed as you might otherwise think. This will also allow you to expand much farther than just a few medical companies unlike the custom_ prefix you were proposing.
MySQL is very fast at returning row data for short rows with few columns. In this way this scheme ends up fairly quick while allowing you lots of flexibility.
As to search, my suggestion would be to index the page content instead of the database content. Use Solr to parse through rendered pages and hold links to the actual page instead of trying to search through the database using clever SQL.
Define two new tables: custom_exam_schema and custom_exam_data.
custom_exam_data has an exam_id column, plus an additional column for every custom attribute.
custom_exam_schema would have a row to describe how to interpret each of the columns of the custom_exam_data table. It would have columns like name, type, minValue, maxValue, etc.
So, for example, to create a custom field to track the number of fingers a person has, you would add ('fingerCount', 'number', 0, 10) to custom_exam_schema and then add a column named fingerCount to the exam table.
Someone might say it's bad to change the database schema at run time, but I'd argue that configuring these custom fields is part of set up and won't happen too often. Still, this method lets you handle changes at any time and doesn't risk messing around with your core table schemas.
lets say that your friend's database has to store data values from multiple sources such as demogrphic values, diagnosis, interventions, physionomic values, physiologic exam values, hospitalisation values etc.
He might have as well to define choices, lets say his database is missing the race and the unit staff need the race of the patient (different races are more unlikely to get some diseases), they might want to use a drop down with several choices.
I would propose to use an other table that would have these choices or would you just use a "Custom_field_choices" table, which at some point is exactly the same but with a different name.
Considering that the database :
- needs to be flexible
- that data from multiple tables can be added and be customized
- that you might want to keep the integrity of the main structure of your database for distribution and uniformity purpose
- that data MUST have a limit and alarms and warnings
- that data must have units ( 10 kg or 10 pounds) ?
- that data can have a selection of choices
- that data can be with different rights (from simple user to admin)
- that these data might be needed to generate reports without modifying the code (automation)
- that these data might be needed to make cross reference analysis within the system without modifying the code
the custom table would be my solution, modifying each table would end up being too risky.
I would store those custom fields in a table where each record ( dataType, dataValue, dataUnit ) would use in one row. So there would be a relation oneToMany from one sample to the data. You can also create a table to record all the kind of cutsom types you would use. For example:
create table DataType
(
id int primary key,
name varchar(100) not null unique
description text,
uri varchar(255) //<-- can be used for an ONTOLOGY
)
create table DataRecord
(
id int primary key,
sample_id int not null,//<-- reference to the sample
dataType_id int not null, //<-- references DataType
value varchar(100),//<-- the value as string
unit varchar(50)//<-- g, mg/ml, etc... but it could also be a link to a table describing the units just like DataType
)

What are some of your most useful database standards?

Locked. This question and its answers are locked because the question is off-topic but has historical significance. It is not currently accepting new answers or interactions.
I have some ideas, some that I have accumulated over time, but I really want to know what makes things go smoothly for you when modeling database:
Table name matches Primary Key name and description key
Schemas are by functional area
Avoid composite primary keys where possible (use unique constraints)
Camel Case table names and field names
Do not prefix tables with tbl_, or procs with SP_ (no hungarian notation)
OLTP databases should be atleast in BCNF / 4NF
Name similarly targetted stored procs with the same prefix, for instance if you've got 3 stored procedures for Person. That way everything for person is grouped in one place and you can find them easily without having to look through all your procs to find them.
PersonUpdate
PersonDelete
PersonCreate
Do similar things for tables when you have groups of tables with related data. For instance:
InvoiceHeaders
InvoiceLines
InvoiceLineDetails
If you have the option of schemas within your database, use them. It's much nicer to see:
Invoice.Header
Invoice.Line.Items
Invoice.Line.Item.Details
Person.Update
Person.Delete
Person.Create
Don't use triggers unless there's no other reasonable approach to achieve that goal.
Give field names a meaningful prefix so you can tell what table they come from without someone needing to explain. That way when you see a field name referenced, you can easily tell which table it's from.
Use consistent data types for fields containing similar data, i.e. don't store phone number as numeric in one table and varchar in another. In fact, don't store it as numeric, if I come across a negative phone number I'll be mad.
Don't use spaces or other obscure characters in table/field names. They should be entirely alphanumeric - or if I had my druthers, entirely alphabetic with the exception of the underscore. I'm currently working on an inherited system where table and field names contain spaces, question marks and exclamation marks. Makes me want to kill the designer on a daily basis!
Don't use syntax keywords as object names it'll cause headaches trying to retrieve data from them. I hate having to wrap object names as [index] that's two needless chars I didn't need to type damn you!
One thing I haven't seen mentioned yet:
Never use database keywords as object names. You do not want to have to qualify them every time you use them
If you misspell something when you create it, fix it as soon as you notice it. Don't spend years having to remember that in this table UserName is really Usernmae. It's a whole lot easier to fix when there isn't much code written against it.
Never use implied joins (the comma syntax), always specify the joins.
Putting everybody's input together into one list.
Naming Standards
Schemas are named by functional area (Products, Orders, Shipping)
No Hungarian Notation: No type names in object names (no strFirstName)
Do not use registered keywords for object names
No spaces or any special characters in object names (Alphanumber + Underscore are the only things allowed)
Name objects in a natural way (FirstName instead of NameFirst)
Table name should match Primary Key Name and Description field (SalesType – SalesTypeId, SalesTypeDescription)
Do not prefix with tbl_ or sp_
Name code by object name (CustomerSearch, CustomerGetBalance)
CamelCase database object names
Column names should be singular
Table names may be plural
Give business names to all constraints (MustEnterFirstName)
Data Types
Use same variable type across tables (Zip code – numeric in one table and varchar in another is not a good idea)
Use nNVarChar for customer information (name, address(es)) etc. you never know when you may go multinational
In code
Keywords always in UPPERCASE
Never use implied joins (Comma syntax) - always use explicit INNER JOIN / OUTER JOIN
One JOIN per line
One WHERE clause per line
No loops – replace with set based logic
Use short forms of table names for aliases rather than A, B, C
Avoid triggers unless there is no recourse
Avoid cursors like the plague (read http://www.sqlservercentral.com/articles/T-SQL/66097/)
Documentation
Create database diagrams
Create a data dictionary
Normalization and Referential Integrity
Use single column primary keys as much as possible. Use unique constraints where required.
Referential integrity will be always enforced
Avoid ON DELETE CASCADE
OLTP must be at least 4NF
Evaluate every one-to-many relationship as a potential many-to-many relationship
Non user generated Primary Keys
Build Insert based models instead of update based
PK to FK must be same name (Employee.EmployeeId is the same field as EmployeeSalary.EmployeeId)
Except when there is a double join (Person.PersonId joins to PersonRelation.PersonId_Parent and PersonRelation.PersonId_Child)
Maintenance : run periodic scripts to find
Schema without table
Orphaned records
Tables without primary keys
Tables without indexes
Non-deterministic UDF
Backup, Backup, Backup
Be good
Be Consistent
Fix errors now
Read Joe Celko's SQL Programming Style (ISBN 978-0120887972)
My standards for Oracle are:
Keywords are always in UPPERCASE;
Database object names are always in lowercase;
Underscores will replace spaces (ie there won't be any camel case conventions that are common on, say, SQL Server);
Primary keys will pretty much always be named 'id';
Referential integrity will be enforced;
Integer values (including table ids) will generally always be NUMBER(19,0). The reason for this is that this will fit in a 64-bit signed integer thus allowing the Java long type to be used instead of the more awkward BigInteger;
Despite the misnomer of appending "_number" to some column names, the type of such columns will be VARCHAR2 not a number type. Number types are reserved for primary keys and columns you do arithmetic on;
I always use a technical primary keys; and
Each table will have its own sequence for key generation. The name of that sequence will be _seq.
With SQL Server, the only modification is to use camel case for database object names (ie PartyName instead of party_name).
Queries will tend to be written multi-line with one clause or condition per line:
SELECT field1, field2, field2
FROM tablename t1
JOIN tablename2 t2 ON t1.id = t2.tablename_id
WHERE t1.field1 = 'blah'
AND t2.field2 = 'foo'
If the SELECT clause is sufficiently long I'll split it out one field per line.
Name all constraints
don't forget to back up your databases on a regular basis.
Don't use type names in the field names. The older guys will remember the old MS standard of lpszFieldName and the stupidity that ensued.
Use descriptive field names That follow normal language conventions. For example "FirstName" instead of "NameFirst"
Each word in the field name is capitalized
No underscores
Do not use normal keywords such as "Index"
Do not prefix ANYTHING with the object type. For example we do NOT use tblCustomers or spCustomersGet. These don't allow for good sorting and provide zero value.
Use schemas to define separate areas of the database. Such as sales.Customers and hr.Employees. This will get rid of most of the prefixes people use.
Loops of any kind should be viewed with suspicion. There's usually a better set based way.
Use views for complicated joins.
Avoid complicated joins when possible. It may be more astheticaly pleasing to have a CustomerPhoneNumbers table; but honestly, how many phone numbers do we really need to store? Just add the fields to the Customers table. Your DB queries will be faster and it's much easier to understand.
If one table calls a field "EmployeeId" then EVERY SINGLE TABLE that references it should use that name. It doesn't need to be called CustomerServiceRepId just because it's in an extension table.
Almost all tables have the "s" ending. For example: Customers, Orders, etc. After all the table holds many records...
Evaluate your queries, indexes and foreign key relationships with an analysis tool. Even those that may be generated for you. You might be surprised.
Linking tables which support many to many relationships have both linked tables in the name. For example, SchoolsGrades. It's very easy to tell by the table name what it does.
Be CONSISTENT. If you start down one path with your conventions, don't change horses halfway unless you are willing to refactor all of the previous work. This should put the brakes on any "wouldn't it be great if.." ideas that end up causing confusion and vast amounts of rework.
Think before you type. Do you really need that table, field, sproc, or view? Are you sure it isn't covered somewhere else? Get concensus before adding it. And if for some reason you have to take it out, talk to your team first. I've been at places where the DBA's make daily breaking changes without regard for the devs. This isn't fun.
If a database is for a particular application, have a version table so that the database releases can be checked against the code releases (amongst other reasons).
I always try not to use the type in the field name - "sFirstName", "sLastName", or "iEmployeeID". While they match at first, if something changes, they'll be out of sync, and it's a huge headache to change those names later, since you have to change the dependant objects as well.
Intellisense and the GUI tools make it trivial to find out what type a column is, so I don't feel this is necessary.
The WITH clause really helps break queries down into manageable parts.
It also really helps for efficiency on the execution plans of the queries.
Ensure that every varchar/nvarchar choice is appropriate.
Ensure that every NULLable column choice is appropriate - avoid NULLable columns where possible - allowing NULL should be the justifiable position.
Regardless of any other rules you might use from the suggestions here, I would create a stored procedure in the database that can be run on a regular basis to determine system health for any rules or standards you do have (some of this is a little SQL-Server specific):
Look for orphaned records in any cases where the DBMS system's referential integrity cannot be used for some reason (in my system I have a table of processes and a table of tests - so my system_health SP looks for processes without tests, since I only have a one-way FK relationship)
Look for empty schemas
Look for tables without primary keys
Look for tables without any indexes
Look for database objects without documentation (we use SQL Server Extended properties to put the documentation in the database - this documentation can be as granular as the column).
Look for system-specific issues - tables which need to be archived, exceptions which are not part of normal monthly or daily processing, certain common column names with or without defaults (CreateDate, say).
Look for non-deterministic UDFs
Look for TODO comments to ensure that code in the DB does not somehow have untested or pre-release code.
All this can be automated to give you an overall picture of system health.
Everyone writes SQL queries (views, stored procedures, etc) in the same basic format. It really helps development/maintenance efforts down the road.
Consistent naming standards. Having everyone on the same page, using the same format (whether it be Camel Case, specific prefixes, etc..) helps in being able to maintain a system accurately.
A few likes and dislikes.
My opinion is prefixes are horrible in every aspect. I currently work on a system where the tables are prefixed, and the columns within the tables are prefixed with 2 letter table name acronyms, I waste at least 30 mins each day working on this database because the acronym isn't logical. If you want to denote something with a prefix use a schema owner instead.
Using NVarchar from the start of a project if there is even a slight hint that down the line the text data will need to support multi lingual chars. Upgrading large databases because of lack of forward planning and thinking is a pain and wastes time.
Splitting each condition within a where clause onto a new line for readability (in and not in statements wrapped in brackets and tabbed in.) I think this is the important standard for me.
I worked at one company where a standard was that comma's must always be placed at the start of a line when performing parameter or variable declarations. This apparently made it more readable however I found it a complete nightmare.
In addition to normalization to 3NF or BCNF (more about that in this question), I have found the following to be useful:
Name tables as plural nouns
Name columns as sigular
So a "People" table has a "PersonID" column.
There is nothing wrong with composite keys, so long as the rules of 3NF or BCNF still hold. In many cases (such as the "many-to-many" case) this is entirely desirable.
Avoid repeating the table name in the column names. peoplePersonID is better written as table.column anyway, and much more readable and therefore self-documenting. People.PersonID is better, to me at least.
ON DELETE CASCADE should be used very carefully.
Remember that NULL means one of two things: Either it's unknown or it's not applicable.
Remember also that NULLs have interesting affects on joins, so practice your LEFT, RIGHT, and FULL outer joins.
Some others (albeit small) comments to throw against the wall...
SQL Server database schemas can be useful for both organizing tables and stored procedures as well as controlling security.
Every transactional table should always track both who and when created the record as well as updated the record in separate columns. I've seen implementation that simply used "update date" which can lead to auditing challenges in the future.
Use GUID's for row identifiers for all rows for projects with offline/synchronization requirements.
Good database design and normalization.
Tables are named in the singular, lowercase, no underscores, no prefix
Fields also lowercase, no underscores, no prefix
Stored procedures prefixed with "st_" (sorts nicely)
Views that are treated like tables have no prefix
Views created for special reports, etc. have a "v" prefix
Indexed views created for performance have an "ixv" prefix
All indexes have purposeful names (no auto-naming)
Strongly prefer uniqueidentifier (with sequential increment) over int IDENTITY for surrogate keys
Don't artificially limit VARCHAR/NVARCHAR fields to 100 or 255. Give them room to breath. This isn't the 1980s, fields are not stored padded to their max length.
3NF minimum standard
Prefer joining tables to column-level foreign keys: many 1:m assumptions are challenged as a system grows over time.
Always use surrogate keys, not natural keys, as the primary key. All assumptions about "natural" keys (SSNs, usernames, phone numbers, internal codes, etc.) will eventually be challenged.
Tabular formatted SQL.
select a.field1, b.field2
from any_table a
inner join blah b on b.a_id = a.a_id
inner join yet_another y on y.longer_key = b.b_id
where a.field_3 > 7
and b.long_field_name < 2;
Part of this is to use uniformly long alias names (in the example, here, a, b, and y are all length 1).
With this kind of formatting, I can more quickly answer common questions like, "what table is aliased by 'a'?" and "which fields join table T into the query?" The structure doesn't take long to apply or to update, and I find that it saves a lot of time. We spend more time reading code than writing it.
Document everything; wiki type documentation is easy to setup and the software is free.
Make sure you understand the interface first and design the database second. Most of the time its a lot better to know how the data you are going to use needs to work and then engineer the database. Most bad DB design happens as things evolve not upfront.
Then define the database standard and version you are going to work to. Define standards for the code elements (views, functions etc), database naming; naming conventions for columns, tables; type conventions for columns; coding templates.
Spend time considering how you define types having standard database types for fields or bespoke types are a good thing to sort out upfront.
As part of your documentation include a list of don'ts as well as dos for the application which include your prefered hated functionality cursors, triggers.
Review it regularly.
13- Evaluate your queries
Thats true. Sometimes you don't get what you wanted.
For me, it's always useful to name the tables and fields with their exact content and (for us) in clear spanish and using Upper Camel Case, with no whitespaces:
User Name: NombreUsuario
First Last Name: ApellidoPaterno
Second Last Name: ApellidoMaterno
etc etc
Taking "database" to mean "SQL product", my answer is, "Too many to mention. You could write a whole book on the subject." Happily, someone has.
We use Joe Celko's SQL Programming Style (ISBN 978-0120887972): "this book is a collection of heuristics and rules, tips, and tricks that will help you improve SQL programming style and proficiency, and for formatting and writing portable, readable, maintainable SQL code."
Advantages of this approach is include:
the guy knows more about this kind of thing than me (is there another book on SQL heuristics?!);
the work has already been done e.g. I can give the book to someone on the team to read and refer to;
if someone doesn't like my coding style I can blame someone else;
I recently got a load of rep on SO by recommending another Celko book :)
In practice we do deviate from the prescriptions of The Book but surprisingly rarely.
In MS-SQL, I've always had objects owned by dbo., and I prefix calls to those objects with dbo.
Too many times I've seen our devs wonder why they can't call their objects that they inadvertainly owned.
Avoid silly abbreviation conventions, such as comprehensive dictionaries of abbreviations that actively encourage monstrosities like EMP_ID_CONV_FCTR_WTF_LOL_WAK_A_WAK_HU_HU. This rule is inspired a real set of guidelines I've seen before.
MVP Aaron Bertrand's
"My stored procedure "best practices" checklist"
Table name matches Primary Key name and description key
I have just recently, after years of agreeing with this, jumped ship, and now have an "ID" column on every table.
Yes I know, when linking tables it's abiguous! But so is linking ProductID to ProductID, so uhh, why the extra typing?
This:
SELECT p.Name, o.Quantity FROM Products p, Orders o WHERE o.ProductID = p.ID
Is slightly better than this:
SELECT p.Name, o.Quantity FROM Products p, Orders o WHERE o.ProductID = p.ProductID
Note that both will require table or alias prefixes. But not only am I typing slightly less (multiply that across dozens of tables with long descriptive names and it adds up fast in a data intensive application) but it also makes it easier to know which table is the parent table in every join, which, when joining 8-10 tables in a query, can help quite a bit.
I follow a lot of the same conventions as others here, but I wanted to say a few things that haven't been said yet.
Regardless of whether you like plural names or singular names for your tables, be consistent. Choose one or the other, but don't use both.
The primary key in a table has the same name as the table, with the suffix _PK. Foreign keys have their same name as their corresponding primary key, but with a suffix of _FK. For example, the Product table's primary key is called Product_PK; in the Order table the corresponding foreign key is Product_FK. I picked this habit up from another DBA friend of mine and so far I'm liking it.
Whenever I do an INSERT INTO...SELECT, I alias all the columns in the SELECT portion to match the names of the columns from the INSERT INTO portion to make it easier to maintain and see how things match up.
The most important standard is: don't have a database by default. I find too many developers grabbing a database for projects where life would have been much easier without one (at least yet). It is just a tool in the toolbox, and not every problem is a nail.
Inappropriate use of a database leads to anemic domain models, badly testable code and unneeded performance problems.
I agree with just about everything you have put there except for #5. I often use prefixes for tables and stored procedures because the systems that we develop have lots of different functional areas, so I will tend to prefix the tables and sprocs with an identifier that will allow for them to group nicely in Management Studio based on what area they belong to.
Example: cjso_Users, cjso_Roles, and then you have routing_Users, routing_Roles. This may sound like replication of data, but in reality the two different user/roles tables are for completely separate functions of the system (cjso would be for the customer-based ecommerce app while the routing would stand for employees and distributors who use the routing system).
I like our table naming convention:
People Table
PEO_PersonID
PEO_FirstName
...
Which helps make larger querys a bit more readable. and joins make a bit more sense:
Select * -- naughty!
From People
Join Orders on PEO_PersonID = ORD_PersonID
--...
i guess rather than what the naming convention is, is the consistency of the naming.