Should we use prefixes in our database table naming conventions? - sql

We are deciding the naming convention for tables, columns, procedures, etc. at our development team at work. The singular-plural table naming has already been decided, we are using singular. We are discussing whether to use a prefix for each table name or not. I would like to read suggestions about using a prefix or not, and why.
Does it provide any security at all (at least one more obstacle for a possible intruder)? I think it's generally more comfortable to name them with a prefix, in case we are using a table's name in the code, so to not confuse them with variables, attributes, etc. But I would like to read opinions from more experienced developers.

I find hungarian DB object prefixes to indicate their types rather annoying.
I've worked in places where every table name had to start with "tbl". In every case, the naming convention ended up eventually causing much pain when someone needed to make an otherwise minor change.
For example, if your convention is that tables start with "tbl" and views start with "v", thn what's the right thing to do when you decide to replace a table with some other things on the backend and provide a view for compatibility or even as the preferred interface? We ended up having views that started with "tbl".

I prefer prefixing tables and other database objects with a short name of the application or solution.
This helps in two potential situations which spring to mind:
You are less likely to get naming conflicts if you opt to use any third-party framework components which require tables in your application database (e.g. asp net membership provider).
If you are developing solutions for customers, they may be limited to a single database (especially if they are paying for external hosting), requiring them to store the database objects for multiple applications in a single database.

I don't see how any naming convention can improve security...
If an intruder have access to the database (with harmful permissions), they will certainly have permissions to list table names and select to see what they're used for.
But I think that truly confusing table names might indirectly worsen security.
It would make further development hard, thus reducing the chance security issues will be fixed, or it could even hide potential issues:
If a table named (for instance) 'sro235onsg43oij5' is full of randomly named coloumns with random strings and numbers, a new developer might just think it's random test data (unless he touches the code that interact with it), but if it was named 'userpasswords' or similar any developer who looks at the table would perhaps be shocked that the passwords is stored in plaintext.

Why not name the tables according to the guidelines you have in place for coding? Consider the table name a "class" and the columns a "property" or "field". This assists when using an ORM that can automatically infer table/column naming from class/member naming.
For instance, Castle ActiveRecord, declared like below assumes the names are the same as the member they are on.
[ActiveRecord]
public class Person
{
[PrimaryKey]
public Int32 Id { get; set; }
[Property]
public String Name { get; set; }
}

If you use SqlServer the good start would be to look at the sample databases provided for some guidance.

In the past, I've been opposed to using prefixes in table names and column names. However, when faced with the task of redesigning a system, having prefixes is invaluable for doing search and replace. For example, grepping for "tbl_product" will probably give you much more relevant results than grepping for "product".

If you're worried about mixing up your table names, employ a hungarian notation style system in your code. Perhaps "s" for string + "tn" for table name:
stnUsers = 'users';
stnPosts = 'posts';
Of course, the prefix is up to you, depending on how verbose you like your code... strtblUsers, strtblnmeUsers, thisisthenameofatableyouguysUsers...
Appending a prefix to table names does have some benefits, especially if you don't hardcode that prefix into the system, and allow it to change per installation. For one, you run less risk of conflicts with other components, as Ian said, and secondly, should you wish, you could have two or instances of your program running off the same database.

Related

Why prefix sql function names?

What is a scenario that exemplifies a good reason to use prefixes, such as fn_GetName, on function names in SQL Server? It would seem that it would be unnecessary since usually the context of its usage would make it clear that it's a function. I have not used any other language that has ever needed prefixes on functions, and I can't think of a good scenario that would show why SQL is any different.
My only thinking is that perhaps in older IDE's it was useful for grouping functions together when the database objects were all listed together, but modern IDE's already make it clear what is a function.
You are correct about older IDEs
As a DBA, trying to fix permissions using SQL Server Enterprise Manager (SQL Server 2000 and 7.0), it was complete bugger trying to view permissions. If you had ufn or usp or vw it became easier to group things together because of how the GUI presented the data.
Saying that, if you have SELECT * FROM Thing, what is Thing? A view or a table? It may work for you as a developer but it will crash when deployed because you don't grant permissions on the table itself: only views or procs. Surely a vwThing will keep your blood pressure down...?
If you use schemas, it becomes irrelevant. You can "namespace" your objects. For example, tables in "Data" and other objects per client in other schemas eg WebGUI
Edit:
Function. You have table valued and scalar functions. If you stick with the code "VerbNoun" concept, how do you know which is which without some other clue. (of course, this can't happen because object names are unique)
SELECT dbo.GetName() FROM MyTable
SELECT * FROM dbo.GetName()
If you use a plural to signify a table valued function, this is arguably worse
SELECT dbo.GetName() FROM MyTable
SELECT * FROM dbo.GetNames()
Whereas this is less ambiguous, albeit offensive to some folk ;-)
SELECT dbo.sfnGetName() FROM MyTable
SELECT * FROM dbo.tfnGetName()
With schemas. And no name ambiguity.
SELECT ScalarFN.GetName() FROM MyTable
SELECT * FROM TableFN.GetName()
Your "any other language" comment doesn't apply. SQL isn't structured like c#, Java, f#, Ada (OK, PL/SQL might be), VBA, whatever: there is no object or namespace hierarchy. No Object.DoStuff method stuff.
A prefix may be just the thing to keep you sane...
There's no need to prefix function names with fn_ any more than there's a need to prefix table names with t_ (a convention I have seen). This sort of systematic prefix tends to be used by people who are not yet comfortable with the language and who need the convention as an extra help to understanding the code.
Like all naming conventions, it hardly matters what the convention actually is. What really matter is to be consistent. So even if the convention is wrong, it is still important to stick to it for consistency. Yes, it may be argued that if the naming convention is wrong then it should be changed, but the effort implies a lot: rename all objects, change source code, all maintenance procedures, get the dev team committed to follow the new convention, have all the support and ops personnel follow the new rules etc etc. On a large org, the effort to change a well established naming convention is just simply overwhelming.
I don't know what your situation is, but you should consider carefully before proposing a naming convention change just for sake of 'pretty'. No matter how bad the existing naming convention in your org is, is far better to stick to it and keep the naming consistency than to ignore it and start your own.
Of course, more often than not, the naming convention is not only bad, is also not followed and names are inconsistent. In that case, sucks to be you...
What is a scenario that exemplifies a
good reason to use prefixes
THere is none. People do all kind of stuff because they always did so, and quite a number of bad habits are explained with ancient knowledge that is wrong for many years.
I'm not a fan of prefixes, but perhaps one advantage could be that these fn_ prefixes might make it easier to identify that a particular function is user-defined, rather than in-built.
We had many painful meetings at our company about this. IMO, we don't need prefixes on any object names. However, you could make an argument for putting them on views, if the name of the view might conflict with the underlying table name.
In any case, there is no SQL requirement that prefixes be used, and frankly, IMO, they have no real value.
As others have noticed, any scenario you define can easily be challenged, so there's no rule defining it as necessary; however, there's equally no rule saying it's unnecessary, either. Like all naming conventions, it's more important to have consistent usage rather than justification.
One (and the only) advantage I can think of is that it a consequently applied prefixing scheme could make using intellisense / autocomplete easier since related functionality is automagically grouped together.
Since I recently stumbled over this question, I'd like to add the following point in favour of prefixes: Imagine, you have some object id from some system table and you want to determine whether it's a function, proc, view etc. You could of course run a test for each type, it's much easier though to extract the prefix from the object name and then act upon that. This gets even easier when you use an underscore to separate the prefix from the name, e.g. usp_Foo instead of uspFoo. IMHO it's not just about stupid IDEs.

There has to be a better way to do localized database fields

So far there've been several questions regarding this, and they've all come down to the same answer: one table for the language-neutral data, 1-* to a table with the translations and an indexed language ID field.
This has several problems:
Twice as much CRUD.
Need for Ajax CRUD if you want a decently friendly web UI.
More than twice the validation -- you need to ensure that the relationship is 1-* rather than 0-*.
Collation differences between languages isn't accommodated.
Queries require joins.
If you want slugs in multiple languages, oh boy.
A lot of database people have worked on all sorts of theoretical and practical problems, but surprisingly few people work on this one.
I think what we need ultimately is:
A field type that'll store multiple versions of strings
Multiple indices for each such field, one for each language or variation, with the option to specify the correct collation mode
A standard ORM object for this crazy thing
UI elements
Overkill? Sure, maybe, but the whole problem is a real nightmare as it is. And it's not exactly an uncommon scenario.
We gotta try to convince server vendors to work on this.
Edit: By the way, this is my first time using the community wiki; hopefully I'm doing it right.
Edit 2: Something about my wording seems to have made people think that I'm attacking the very concept of DBMS. I'm not; I'm simply saying that built-in support for localization is a much-needed feature.
I probably shouldn't have mentioned performance; it's of course completely negligible most of the time. The focus of my concern is on the fact that this really stifles productivity.
I'll provide an example. Suppose I have a very trivial table for a decidedly trivial store:
Products (id, price, description, name, slug)
In EF/MVC, I'd throw this in the ORM designer, maybe encapsulate it in a repository, build a Products controller, and have actions for Index, Details, Create, Update, Edit and Delete. To identify a product in any of the items, I'd simply do a WHERE(slug = #slug). I'd make a view model for the create/edit actions, design the form control, and wire it up straight to the repository. Done and done. To access the details for a product, the user would go to /products/details/product-slug.
But then since the rest of the website is bilingual, I decide to change the products table accordingly.
Products (id, price)
ProductsText (productId, language, description, name, slug)
Hey, that's not so bad. Yeah, not yet. Then you write your relationships and your constraints, and then you write you write out all your properties in the view-model, and then you make a complete CRUD controller for the ProductsText data or use jQuery/Ajax to add create/update/edit buttons on your Products controller, and then you add validation logic to make sure the user enters at least the primary language, and then when you want to read data for the end-user pages you write another query to take join ProductsText.slug and ProductsText.language with Products... I probably missed something, but you get the idea.
The complexity of the program just explodes with boilerplate code once you have localization involved.
Of course, I don't expect the problem to be solved completely, and it's obviously just as much a UI problem as it is a database problem. But there's just so much that could be done to make all this easier. A "multistring" field type might be a really good start.
Edit 3: Anyone ever hear of SQL Server Modeling Services? It has some localization tools in it that could be a step in the right direction. Still CTP though.
-- Simulate the French locale with the SET LANGUAGE statement.
SET LANGUAGE French
select Id, CountryName,
[System.Globalization].[SessionsString](CountryName, 1) as CountryNameString
from [Location].[CountriesTable]
What is a localized database field?
Typically in applications we've worked in, the UI is localized. This is accomplished using a database, and we put all the translations (and potentially the master phrases) in the table with a locale-code and phraseid being the primary key. This is fairly straightforward, requires a single reusable set of stored procs and has good performance and the usage is well-understood. We often allow translation on the fly so that the app interface includes a translation feature where corrections can be made and other users will see them live - either rich forms applications or web forms applications (depending on caching - which is another key feature of UI localization)
As far as querying requiring joins - that's just a fact of life in a normalized relational database, and performance there is usually managed with a good normalized design and proper indexing.
In other "data", it has made little sense to localize except under direction of the application requirements. For instance, even though you may offer a product in multiple countries, the SKU and distributor may be different. This level of localization is very application specific and we often dealt with it as a separate database and there really isn't anything tying those individually country database together - many products were not available although there may have been equivalent products in the other countries.
If you are selling the same products around the world, then you kind of fall into the original scenario in a kind of multi-lingual CMS. This requires significant work besides the low-level database. For instance, if someone corrects the default product description, what flags the translators that the translations need to also be corrected? These questions are non-trivial. Although I can see where database vendors could assist with features, these are intrinsic difficulties of application requirements and design and not necessarily something the database can add features which will universally solve.
The collation issue is indeed a little awkward. Typically data is stored in nvarchar and you would not know the collation you wanted for retrieval at the time you wrote the stored proc, since the locale would be a parameter. This only affects collections retrieved which need to be ordered by content, not usually natural key and certainly not retrieval by key - it's not a large problem, but is one which cannot easily be handled without dynamic SQL (casting using the preferred collation from a table depending upon the location passed in, if you mix data from different locales, you would have to decide if you want to sort by locale first and then it may be difficult to pick a collation which might work properly within all locales in the same result set). You are probably going to want to use a Windows collation with such a wide variety of data.
Similarly with ORMs, we typically treated the composite unique key of locale/phraseid as the key to retrieve objects (we typically also had a surrogate identity primary key) - I know that traditional ORMs don't necessarily like this departure from retrieval by a meaningless surrogate key.
I've encountered all of these issues for localized CRM-style web sites. Not fun to design and optimize, but it can be done. My 2ยข worth:
1. Twice as much CRUD.
This depends on how your CRUD is designed. Any of my stored procedures or functions that can retrieve a possibly-localized field take a locale/culture code parameter. All of these fields are also NVARCHAR to avoid encoding issues.
2. Need for Ajax CRUD if you want a decently friendly web UI.
I suppose so, but this is application-dependent. Should defer to the "internal" CRUD (DRY principle).
3. More than twice the validation -- you need to ensure that the relationship is 1-* rather than 0-*.
This also assumes that all content is required in all supported locales, instead of using a fallback mechanism. For example, Microsoft's MSDN content is available in multiple locales, but some is in only one (generally this is US English, the "neutral" locale for Microsoft).
For a CRM-style system, any locale can be used for the initial content as long as the fallback uses that if the neutral content is not available.
4. Collation differences between languages isn't accommodated.
I find that it is easier to put all collation support at the UI/reporting layer. Multilingual-aware tables with collation/locale specified on a row-by-row basis would be a very nice-to-have feature but I wouldn't like to wait for it to become available...
5. Queries require joins.
Yes, definitely makes the query a bit more complicated :-) but no real way around that. Can get even more complicated if locale fallback is included (a "locale specificity" ranking field helps here).
6. If you want slugs in multiple languages, oh boy.
This is the reason that the .NET replacement parameters in the format string were designed to be indexed, not positional (printf(), etc. are positional). An English format may need replacements in 1, 2, 3 order, while the German equivalent uses 3, 1, 2.
To make life easier for localizers, whenever I create a .NET resource bundle I document the parameters including index, data type (including minimum and/or maximum string lengths), and a contextual description - context is important for determining text gender in some locales.
Plurality may also require multiple related resources as some locales need more than just "single" and "plural" (e.g. "0 files", "1 file", "2 files").
The same rules must apply to any localizable column in the database.
Well the answers are not that helpfull so far. I had the same problem on various projects I was doing in the past. And there was never a shortcut nor a solution out of the box that helpped me to solve this problem in a easy way. But your approach is going into the right direction and with a little work on your Data Access Layer you can actualy abstract all the burden that is caused by this requirement.
So for Metadata like Types, Categories, Countries etc. performance is not an issue since the whole stuff can be cached. For freetext entries it is a different story. You most probably can't cache them and they tend to be quite long.
You might already know those pages:
http://www.codeproject.com/KB/aspnet/LocalizedSamplePart2.aspx
http://www.sisulizer.com/online-help/DatabaseLocalization.shtml
Best-practices for localizing a SQL Server (2005/2008) database
In my experience I haven't commonly run into the problem where the data stored in the database has many language-dependent versions of the same text. Typically a developed application will have many language files for all the text that's more or less statically built into the application. Then we see database data for text users enter. While an application may be used by users with many different languages, the situation where users type the same text in multiple languages is not so common. Typically uses of an application will show the UI in their language and then enter and view data in their language.
For example, users of our application in the US vs in Netherlands or Saudi Arabia would see the UI in the language of their choice, but for any given installation, the data they enter will consistently be in their native language.
Obviously this doesn't apply to all cases. CRMs are an example where you would have the same text with multiple translations, like Wikipedia, but I think what I described above is the more common scenario.
"A lot of database people have worked on all sorts of theoretical and practical problems, but surprisingly few people work on this one."
That's because there is nothing to work on, from a theoretical perspective, in your example. The so-called "problems" you mention are, all of them, nothing more than a direct consequence of the fact that you are managing more data.
"Twice as much CRUD."
And why is that a problem ? I know of at least a few systems I built that had a lot more of that than your example.
"Need for Ajax CRUD if you want a decently friendly web UI."
Is that really so ? I don't know, but at any rate how data is handled in the presentation layer, is no concern of the DBMS, and if the programmer thinks it is too difficult/cumbersome, then don't blame the DBMS for that.
More than twice the validation -- you need to ensure that the relationship is 1-* rather than 0-*.
And why is that a problem ? If more business rules are stated, more validation is required.
"Collation differences between languages isn't accommodated."
How so ? What is the sense of collating English text with French ? Of English text with Ukrainian or Russian or Chinese ? Or did you mean something else ?
"Queries require joins."
And why is that a problem ?
"If you want slugs in multiple languages, oh boy."
In what context ? For what purpose ?
SELECT language,nllabel FROM ...
NATURAL JOIN (SELECT 'EN' as language UNION SELECT 'FR' as language)
Oh but wait, I forgot ... JOINs are also a problem.
"and it's obviously just as much a UI problem as it is a database problem."
I disagree that it is. When looking at your problem from a database angle, there are two things that might possibly be a small beginning of a solution :
the possibility to do full view updating (both through JOIN and through GROUP, for your case).
the possibility to have attributes of type 'table' inside database tables. You could then have the entire set of applicable localized names-stuff as a sinle attribute in a single row for your product/...
As for full view updating : don't hold your breath. You'll suffocate long before it has arrived.
As for nested tables : they might already exist, if anyone has them Oracle will, I don't really know, but I'm not really confident that this will really make life easier on the UI side of things.
Oh, and BTW : SQL is nowhere near "theoretically pure".

Getting rid of hard coded values when dealing with lookup tables and related business logic

Example case:
We're building a renting service, using SQL Server. Information about items that can be rented is stored in a table. Each item has a state that can be either "Available", "Rented" or "Broken". The different states reside in a lookup table.
ItemState table:
id name
1 'Available'
2 'Rented'
3 'Broken'
Adding to this we have a business rule which states that whenever an item is returned, it's state is changed from "Rented" to "Available".
This could be done with a an update statement like "update Items set state=1 where id=#itemid". In application code we might have an enum that maps to the ItemState id:s. However, these contain hard coded values that could lead to maintenance issues later on. Say if a developer were to change the set of states but forgot to fix the related business logic layer...
What good methods or alternate designs are there for dealing with this type of design issues?
Links to related articles are also appreciated in addition to direct answers.
In my experience this is a case where you actually have to hardcode, preferably by using an Enum which integer values match the id's of your lookup tables. I can't see nothing wrong with saying that "1" is always "Available" and so forth.
Most systems that I've seen hard code the lookup table values and live with it. That's because, in practice, code tables rarely change as much as you think they might. And if they ever do change, you generally need to re-compile any programs that rely on that DDL anyway.
That said, if you want to make the code maintainable (a laudable goal), the best approach would be to externalize the values into a properties file. Then you can edit this file later without having to re-code your entire app.
The limiting factor here is that your app depends for its own internal state on the value you get from the lookup table, so that implies a certain amount of coupling.
For lookups where the app doesn't rely on that code, (for instance, if your code table stores a list of two-letter state codes for use in an address drop-down), then you can lazily load the codes into an object and access them only when needed. But that won't work for what you're doing.
When you have your lookup tables as well as enums defined in the code, then you always have an issue with keeping them in sync. There is not much that can be done here. Both live effectively in two different worlds and are generally unaware of each other.
You may wish to reject using lookup tables and only let your business logic operate these values. In that case you miss the options of relying on referential integrity to back you ap on the data integrity.
The other option is to build up your application in that way that you never need these values in your code. That means moving part of your business logic to the database layer, meaning, putting them in stored procedures and triggers. This will also have the benefit of being agnostic to the client. Anyone can invoke SPs and get assured the data will be kept in the consistence state, consistent with your business logic rules as well.
You'll need to have some predefined value that never changes, be it an integer, a string or something else.
In your case, the numerical value of the state is the state's surrogate PRIMARY KEY which should never change in a well-designed database.
If you're concerned about the consistency, use a CHAR code: A, R or B.
However, you should stick to it as well as to a numerical code so that A always means Available etc.
You database structure should be documented as well as the code is.
The answer depends entirely on the language you're using: solutions for this are not the same in Java, PHP, Smalltalk or even Assembler...
But let me tell you something: while it's true hard coded values are not a great thing, there are times in which you do need them. And this one is pretty much one of them: you need to declare in your code your current knowledge of the business logic, which includes these hard coded states.
So, in this particular case, I would hard code those values.
Don't overdesign it. Before trying to come up with a solution to this problem, you need to figure out if it's even a problem. Can you think of any legit hypothetical scenario where you would change the values in the itemState table? Not just "What if someone changes this table?" but "Someone wants to change this table in X way for Y reason, what effect would that have?". You need to stay realistic.
New state? you add a row, but it doesn't affect the existing ones.
Removing a state? You have to remove the references to it in code anyway.
Changing the id of a state? There is no legit reason to do that.
Changing the name of a state? There is no legit reason to do that.
So there really should be no reason to worry about this. But if you must have this cleanly maintainable in the case of irrational people who randomly decide to change Available to 2 because it just fits their Feng Shui better, make sure all tables are generated via a script which reads these values from a configuration file, and then make sure all code reads constants from that same configuration file. Then you have one definition location and any time you want to change the value you modify that configuration file instead of the DB/code.
I think this is a common problem and a valid concern, that's why I googled and found this article in the first place.
What about creating a public static class to hold all the lookup values, but instead of hard-coding, we initialize these values when the application is loaded and use names to refer them?
In my application, we tried this, it worked. Also you can do some checking, e.g. the number of different possible values of a lookup in code should be the same as in db, if it's not, log/email/etc. But I don't want to manually code this for the status of 40+ biz entities.
Moreover, this can be part of the bigger problem of OR mapping. We're exposed with too much details of the persistence layer, and thus we have to take care of it. With technologies like Entity Framework, we don't need to worry about the "sync" part because it's automated, am I right?
Thanks!
I've used a similar method to what you're describing - a table in the database with values and descriptions (useful for reporting, etc.) and an enum in code. I've handled the synchronization with a comment in code saying something like "these values are taken from table X in database ABC" so that the programmer knows the database needs to be updated. To prevent changes from the database side without the corresponding changes in code I set permissions on the table so that only certain people (who hopefully remember they need to change the code as well) have access.
The values have to be hard-coded, which effectively means that they can't be changed in the database, which means that storing them in the database is redundant.
Therefore, hard-code them and don't have a lookup table in the database. Instead store the items state directly in the items table.
You can structure your database so that your application doesn't actually have to care about the codes themselves, but rather the business rules behind them.
I have done both of the following:
Do one or more of your codes have a certain characteristic, such as IsAvailable, that the application cares about? If so, add it as a flag column to the code table, where those that match are set to true (or your DB's equivalent), and those that don't are set to false.
Do you need to use a specific, single code under a certain condition? You can create a singleton table, named something like EnvironmentSettings, with a column such as ItemStateIdOnReturn that's a foreign key to the ItemState table.
If I wanted to avoid declaring an enum in the application, I would use #2 to address the example in the question.
Whether you take this approach depends on your application's priorities. This type of structure comes at the cost of additional development and lookup overhead. Plus, if every individual code comes with its own business rules, then it's not practical to create one new column per required code.
But, it may be worthwhile if you don't want to worry about synchronizing your application with the contents of a code table.

Naming database table fields

Exists any naming guideline for naming columns at Sql Server? I searched at MSDN, but dont found anything, just for .Net
There are lots of different conventions out there (and I'm sure other answers may make some specific suggestions) but I think that the most important thing is that you be consistent. If you are going to use a prefix for something, use it everywhere. If you are going to have a foreign key to another table, use the same column name everywhere. If you are going to separate words with underscores, do that everywhere.
In other words, if someone looks at a few tables, they should be able to extrapolate out and guess the names of other tables and columns. It will require less mental processing to remember what things are called.
There are many resources out there, but nothing that I have been able to truly pin down as a SQL Server specific set or anything published by Microsoft.
However, I really like this list.
Also, very important to NOT start out stored procedures with sp_
To be 100% honest though, the first part of my posted link is the most important. It must make sense for your organization, application, and implementation.
As always, google is your friend...
I find the following short list helpful:
Name tables as pluralnouns (or singular, but as a previous response stated, be consistent) for example "Customers", "Orders", "LineItems"
Stored procedures should be named without any prefixes such as "sp_" since SQL Server uses the "sp_" prefix to denote special meaning for system procedures.
Name columns as though as you would name attributes on a class (without using underscores)
Try not to use space characters in naming columns or database entities since you would have to escape all names with "[...]"
Many-to-many tables: for example "CustomerOrders"

Many-to-many relationship: use associative table or delimited values in a column?

Update 2009.04.24
The main point of my question is not developer confusion and what to do about it.
The point is to understand when delimited values are the right solution.
I've seen delimited data used in commercial product databases (Ektron lol).
SQL Server even has an XML datatype, so that could be used for the same purpose as delimited fields.
/end Update
The application I'm designing has some many-to-many relationships. In the past, I've often used associative tables to represent these in the database. This has caused some confusion to the developers.
Here's an example DB structure:
Document
---------------
ID (PK)
Title
CategoryIDs (varchar(4000))
Category
------------
ID (PK)
Title
There is a many-to-many relationship between Document and Category.
In this implementation, Document.CategoryIDs is a big pipe-delimited list of CategoryIDs.
To me, this is bad because it requires use of substring matching in queries -- which cannot make use of indexes. I think this will be slow and will not scale.
With that model, to get all Documents for a Category, you would need something like the following:
select * from documents where categoryids like '%|' + #targetCategoryId + '|%'
My solution is to create an associative table as follows:
Document_Category
-------------------------------
DocumentID (PK)
CategoryID (PK)
This is confusing to the developers. Is there some elegant alternate solution that I'm missing?
I'm assuming there will be thousands of rows in Document. Category may be like 40 rows or so. The primary concern is query performance. Am I over-engineering this?
Is there a case where it's preferred to store lists of IDs in database columns rather than pushing the data out to an associative table?
Consider also that we may need to create many-to-many relationships among documents. This would suggest an associative table Document_Document. Is that the preferred design or is it better to store the associated Document IDs in a single column?
Thanks.
This is confusing to the developers.
Get better developers. That is the right approach.
Your suggestion IS the elegant, powerful, best practice solution.
Since I don't think the other answers said the following strongly enough, I'm going to do it.
If your developers 1) can't understand how to model a many-to-many relationship in a relational database, and 2) strongly insist on storing your CategoryIDs as delimited character data,
Then they ought to immediately lose all database design privileges. At the very least, they need an actual experienced professional to join their team who has the authority to stop them from doing something this unwise and can give them the database design training they are completely lacking.
Last, you should not refer to them as "database developers" again until they are properly up to speed, as this is a slight to those of us who actually are competent developers & designers.
I hope this answer is very helpful to you.
Update
The main point of my question is not developer confusion and what to do about it.
The point is to understand when delimited values are the right solution.
Delimited values are the wrong solution except in extremely rare cases. When individual values will ever be queried/inserted/deleted/updated this proves it was the wrong decision, because you have to parse and touch all the other values just to work with the desired one. By doing this you're violating first (!!!) normal form (this phrase should sound to you like an unbelievably vile expletive). Using XML to do the same thing is wrong, too. Storing delimited values or multi-value XML in a column could make sense when it is treated as an indivisible and opaque "property bag" that is NOT queried on by the database but is always sent whole to another consumer (perhaps a web server or an EDI recipient).
This takes me back to my initial comment. Developers who think violating first normal form is a good idea are very inexperienced developers in my book.
I will grant there are some pretty sophisticated non-relational data storage implementations out there using text property bags (such as Facebook(?) and other multi-million user sites running on thousands of servers). Well, when your database, user base, and transactions per second are big enough to need that, you'll have the money to develop it. In the meantime, stick with best practice.
It's almost always a big mistake to use comma separated IDs.
RDBMS are designed to store relationships.
My solution is to create an
associative table as follows: This is
confusing to the developers
Really? this is database 101, if this is confusing to them then maybe they need to step away from their wizard generated code and learn some basic DB normalization.
What you propose is the right solution!!
The Document_Category table in your design is certainly the correct way to approach the problem. If it's possible, I would suggest that you educate the developers instead of coming up with a suboptimal solution (and taking a performance hit, and not having referential integrity).
Your other options may depend on the database you're using. For example, in SQL Server you can have an XML column that would allow you to store your array in a pre-defined schema and then do joins based on the contents of that field. Other database systems may have something similar.
The many-to-many mapping you are doing is fine and normalized. It also allows for other data to be added later if needed. For example, say you wanted to add a time that the category was added to the document.
I would suggest having a surrogate primary key on the document_category table as well. And a Unique(documentid, categoryid) constraint if that makes sense to do so.
Why are the developers confused?
The 'this is confusing to the developers' design means you have under-educated developers. It is the better relational database design - you should use it if at all possible.
If you really want to use the list structure, then use a DBMS that understands them. Examples of such databases would be the U2 (Unidata, Universe) DBMS, which are (or were, once upon a long time ago) based on the Pick DBMS. There are likely to be other similar DBMS providers.
This is the classic object-relational mapping problem. The developers are probably not stupid, just inexperienced or unaccustomed to doing things the right way. Shouting "3NF!" over and over again won't convince them of the right way.
I suggest you ask your developers to explain to you how they would get a count of documents by category using the pipe-delimited approach. It would be a nightmare, whereas the link table makes it quite simple.
The number one reason that my developers try this "comma-delimited values in a database column" approach is that they have a perception that adding a new table to address the need for multiple values will take too long to add to the data model and the database.
Most of them know that their work around is bad for all kinds of reasons, but they choose this suboptimal method because they just can. They can do this and maybe never get caught, or they will get caught much later in the project when it is too expensive and risky to fix it. Why do they do this? Because their performance is measured solely on speed and not on quality or compliance.
It could also be, as on one of my projects, that the developers had a table to put the multi values in but were under the impression that duplicating that data in the parent table would speed up performance. They were wrong and they were called out on it.
So while you do need an answer to how to handle these costly, risky, and business-confidence damaging tricks, you should also try to find the reason why the developers believe that taking this course of action is better in the short and the long run for the project and company. Then fix both the perception and the data structures.
Yes, it could just be laziness, malicious intent, or cluelessness, but I'm betting most of the time developers do this stuff because they are constantly being told "just get it done". We on the data model and database design sides need to ensure that we aren't sending the wrong message about how responsive we can be to requests to fulfill a business requirement for a new entity/table/piece of information.
We should also see that data people need to be constantly monitoring the "as-built" part of our data architectures.
Personally, I never authorize the use of comma delimited values in a relational database because it is actually faster to build a new table than it is to build a parsing routine to create, update, and manage multiple values in a column and deal with all the anomalies introduced because sometimes that data has embedded commas, too.
Bottom line, don't do comma delimited values, but find out why the developers want to do it and fix that problem.