T-SQL database design and tables - sql

I'd like to hear some opinions or discussion on a matter of database design. Me and my colleagues are developing a complex application in finance industry that is being installed in several countries.
Our contractors wanted us to keep a single application for all the countries so we naturally face the difficulties with different workflows in every one of them and try to make the application adjustable to satisfy various needs.
The issue I've encountered today was a request from the head of the IT department from the contractors side that we keep the database model in terms of tables and columns they consist of.
For examlpe, we got a table with different risks and we needed to add a flag column IsSomething (BIT NOT NULL ...). It fully qualifies to exists within the risk table according to the third normal form, no transitive dependency to the key, a non key value ...
BUT, the guy said that he wants to keep the tables as they are so we had to make a new table "riskinfo" and link the data 1:1 to the new column.
What is your opinion ?

We add columns to our tables that are referenced by a variety of apps all the time.
So long as the applications specifically reference the columns they want to use and you make sure the new fields are either nullable or have a sensible default defined so it doesn't interfere with inserts I don't see any real problem.
That said, if an app does a select * then proceeds to reference the columns by index rather than name you could produce issues in existing code. Personally I have confidence that nothing referencing our database does this because of our coding conventions (That and I suspect the code review process would lynch someone who tried it :P), but if you're not certain then there is at least some small risk to such a change.
In your actual scenario I'd go back to the contractor and give your reasons you don't think the change will cause any problems and ask the rationale behind their choice. Maybe they have some application-specific wisdom behind their suggestion, maybe just paranoia from dealing with other companies that change the database structure in ways that aren't backwards-compatible, or maybe it's just a policy at their company that got rubber-stamped long ago and nobody's challenged. Till you ask you never know.

This question is indeed subjective like what Binary Worrier commented. I do not have an answer nor any suggestion. Just sharing my 2 cents.
Do you know the rationale for those decisions? Sometimes good designs are compromised for the sake of not breaking currently working applications or simply for the fact that too much has been done based on the previous one. It could also be many other non-technical reasons.

Very often, the programming community is unreasonably concerned about the ripple effect that results from redefining tables. Usually, this is a result of failure to understand data independence, and failure to guard the data independence of their operations on the data. Occasionally, the original database designer is at fault.
Most object oriented programmers understand encapsulation better than I do. But these same experts typically don't understand squat about data independence. And anyone who has learned how to operate on an SQL database, but never learned the concept of data independence is dangerously ignorant. The superficial aspects of data independence can be learned in about five minutes. But to really learn it takes time and effort.
Other responders have mentioned queries that use "select *". A select with a wildcard is more data dependent than the same select that lists the names of all the columns in the table. This is just one example among dozens.
The thing is, both data independence and encapsulation pursue the same goal: containing the unintended consequences of a change in the model.
Here's how to keep your IT chief happy. Define a new table with a new name that contains all the columns from the old table, and also all the additional columns that are now necessary. Create a view, with the same name as the old table, that contains precisely the same columns, and in the same order, that the old table had. Typically, this view will show all the rows in the old table, and the old PK will still guarantee uniqueness.
Once in a while, this will fail to meet all of the IT chief's needs. And if the IT chief is really saying "I don't understand databases; so don't change anything" then you are up the creek until the IT chief changes or gets changed.

Related

What are the pros and cons of Anchor Modeling? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I am currently trying to create a database where a very large percentage of the data is temporal. After reading through many techniques for doing this (most involving 6nf normalization) I ran into Anchor Modeling.
The schema that I was developing strongly resembled the Anchor Modeling model, especially since the use case (Temporal Data + Known Unknowns) is so similar, that I am tempted to embrace it fully.
The two biggest problem I am having is that I can find nothing detailing the negatives of this approach, and I cannot find any references to organizations that have used it in production for war-stories and gotchas that I need to be aware of.
I am wondering if anyone here is familiar enough with to briefly expound on some of the negatives (since the positives are very well advertized in research papers and their site), and any experiences with using it in a production environment.
In reference to the anchormodeling.com
Here are a few points I am aware of
The number of DB-objects is simply too large to maintain manually, so make sure that you use designer all the time to evolve the schema.
Currently, designer supports fully MS SQL Server, so if you have to port code all the time, you may want to wait until your target DB is fully supported. I know it has Oracle in dropdown box, but ...
Do not expect (nor demand) your developers to understand it, they have to access the model via 5NF views -- which is good. The thing is that tables are loaded via (instead-of-) triggers on views, which may (or may not) be a performance issue.
Expect that you may need to write some extra maintenance procedures (for each temporal attribute) which are not auto-generated (yet). For example, I often need a prune procedure for temporal attributes -- to delete same-value-records for the same ID on two consecutive time-events.
Generated views and queries-over-views resolve nicely, and so will probably anything that you write in the future. However, "other people" will be writing queries on views-over-views-over-views -- which does not always resolve nicely. So expect that you may need to police queries more than usual.
Having sad all that, I have recently used the approach to refactor a section of my warehouse, and it worked like a charm. Admittedly, warehouse does not have most of the problems outlined here.
I would suggest that it is imperative to create a demo-system and test, test, test ..., especially point No 3 -- loading via triggers.
With respect to point number 4 above. Restatement control is almost finished, such that you will be able to prevent two consecutive identical values over time.
And a general comment, joins are not necessarily a bad thing. Read: Why joins are a good thing.
One of the great benefits of 6NF in Anchor Modeling is non-destructive schema evolution. In other words, every previous version of the database model is available as a subset in the current model. Also, since changes are represented by extensions in the schema (new tables), upgrading a database is almost instantanous and can safely be done online (even in a production environment). This benefit would be lost in 5NF.
I haven't read any papers on it, but since it's based on 6NF, I'd expect it to suffer from whatever problems follow 6NF.
6NF requires each table consist of a candidate key and no more than one non-key column. So, in the worst case, you'll need nine joins to produce a 10-column result set. But you can also design a database that uses, say, 200 tables that are in 5NF, 30 that are in BCNF, and only 5 that are in 6NF. (I think that would no longer be Anchor Modeling per se, which seems to put all tables in 6NF, but I could be wrong about that.)
The Mythical Man-Month is still relevant here.
The management question, therefore, is not whether to build a pilot system and throw it away. You will do that. The only question is whether to plan in advance to build a throwaway, or to promise to deliver the throwaway to customers.
Fred Brooks, Jr., in The Mythical Man-Month, p 116.
How cheaply can you build a prototype to test your expected worst case?
In this post I will present a large part of the real business that belong to databases. Database's solutions in this big business area can not be solved by using „Anchor modeling" , at all.
In the real business world this case is happening on a daily basis. That is the case when data entry person, enters a wrong data.
In real-world business, errors happen frequently at data entry level. It often happens that data entry generates large amounts of erroneous data. So this is a real and big problem. "Anchor modeling" can not solve this problem.
Anyone who uses the "Anchor Modeling" database can enter incorrect data. This is possible because the authors of "Anchor modeling" have written that the erroneous data can be deleted.
Let me explain this problem by the following example:
A profesor of mathematics gave the best grade to the student who had the worst grade. In this high school, professors enter grades in the corressponding database. This student gave money to the professor for this criminal service. The student managed to enroll at the university using this false grade.
After a summer holiday, the professor of mathematics returned to school. After deleting the wrong grade from the database, the professor entered the correct one in the database. In this school they use "Anchor Modeling" db. So the math profesor deleted false data as it is strictly suggested by authors of "Anchor modeling".
Now, this professor of mathematics who did this criminal act is clean, thanks to the software "Anchor modeling".
This example says that using "Anchor Modeling," you can do crime with data just by applying „Anchor modeling technology“
In section 5.4 the authors of „Anchor modeling“ wrote the following: „Delete statements are allowed only when applied to remove erroneous data.“ .
You can see this text at the paper „ An agile modeling technique using sixth normal form for structurally evolving data“ written by authors of „Anchor modeling“.
Please note that „Anchor modeling“ was presented at the 28th International Conference on Conceptual Modeling and won the best paper award?!
Authors of "Anchor Modeling" claim that their data model can maintain a history! However this example shoes that „Anchor modeling“ can not maintain the history at all.
As „Anchor modeling“ allows deletion of data, then "Anchor modeling" has all the operations with the data, that is: adding new data, deleting data and update. Update can be obtained by using two operations: first delete the data, then add new data.
This further means that Anchor modeling has no history, because it has data deletion and data update.
I would like to point out that in "Anchor modeling" each erroneous data "MUST" be deleted. In the "Anchor modeling" it is not possible to keep erroneous data and corrected data.
"Anchor modeling" can not maintain history of erroneous data.
In the first part of this post, I showed that by using "Anchor Modeling" anyone can do crime with data. This means "Anchor Modeling" runs the business of a company, right into a disaster.
I will give one example so that professionals can see on real and important example, how bad "anchor modeling" is.
Example
People who are professionals in the business of databases, know that there are thousands and thousands of international standards, which have been used successfully in databases as keys.
International standards:
All professionals know what is "VIN" for cars, "ISBN" for books, and thousands of other international standards.
National standards:
All countries have their own standards for passports, personal documents, bank cards, bar codes, etc
Local standards:
Many companies have their own standards. For example, when you pay something, you have an invoice with a standard key written and that key is written in the database, also.
All the above mentioned type of keys from this example can be checked by using a variety of institutions, police, customs, banks credit card, post office, etc. You can check many of these "keys" on the internet or by using a phone.
I believe that percent of these databases, which have entities with standard keys, and which I have presented in this example, is more than 95%.
For all the above cases the "anchor surrogate key" is nonsense. "Anchor modeling" exclusively uses "anchor-surrogate key"
In my solution, I use all the keys that are standard on a global or local level and are simple.
Vladimir Odrljin

What are the [dis]advantages of using a key/value table over nullable columns or separate tables? [duplicate]

This question already has answers here:
How to design a product table for many kinds of product where each product has many parameters
(4 answers)
Closed 1 year ago.
I'm upgrading a payment management system I created a while ago. It currently has one table for each payment type it can accept. It is limited to only being able to pay for one thing, which this upgrade is to alleviate. I've been asking for suggestions as to how I should design it, and I have these basic ideas to work from:
Have one table for each payment type, with a few common columns on each. (current design)
Coordinate all payments with a central table that takes on the common columns (unifying payment IDs regardless of type), and identifies another table and row ID that has columns specialized to that payment type.
Have one table for all payment types, and null the columns which are not used for any given type.
Use the central table idea, but store specialized columns in a key/value table.
My goals for this are: not ridiculously slow, self-documenting as much as possible, and maximizing flexibility while maintaining the other goals.
I don't like 1 very much because of the duplicate columns in each table. It reflects the payment type classes inheriting a base class that provides functionality for all payment types... ORM in reverse?
I'm leaning toward 2 the most, because it's just as "type safe" and self-documenting as the current design. But, as with 1, to add a new payment type, I need to add a new table.
I don't like 3 because of its "wasted space", and it's not immediately clear which columns are used for which payment types. Documentation can alleviate the pain of this somewhat, but my company's internal tools do not have an effective method for storing/finding technical documentation.
The argument I was given for 4 was that it would alleviate needing to change the database when adding a new payment method, but it suffers even worse than 3 does from the lack of explicitness. Currently, changing the database isn't a problem, but it could become a logistical nightmare if we decide to start letting customers keep their own database down the road.
So, of course I have my biases. Does anyone have any better ideas? Which design do you think fits best? What criteria should I base my decision on?
Note
This subject is being discussed, and this thread is being referenced in other threads, therefore I have given it a reasonable treatment, please bear with me. My intention is to provide understanding, so that you can make informed decisions, rather than simplistic ones based merely on labels. If you find it intense, read it in chunks, at your leisure; come back when you are hungry, and not before.
What, exactly, about EAV, is "Bad" ?
1 Introduction
There is a difference between EAV (Entity-Attribute-Value Model) done properly, and done badly, just as there is a difference between 3NF done properly and done badly. In our technical work, we need to be precise about exactly what works, and what does not; about what performs well, and what doesn't. Blanket statements are dangerous, misinform people, and thus hinder progress and universal understanding of the issues concerned.
I am not for or against anything, except poor implementations by unskilled workers, and misrepresenting the level of compliance to standards. And where I see misunderstanding, as here, I will attempt to address it.
Normalisation is also often misunderstood, so a word on that. Wikipedia and other free sources actually post completely nonsensical "definitions", that have no academic basis, that have vendor biases so as to validate their non-standard-compliant products. There is a Codd published his Twelve Rules. I implement a minimum of 5NF, which is more than enough for most requirements, so I will use that as a baseline. Simply put, assuming Third Normal Form is understood by the reader (at least that definition is not confused) ...
2 Fifth Normal Form
2.1 Definition
Fifth Normal Form is defined as:
every column has a 1::1 relation with the Primary Key, only
and to no other column, in the table, or in any other table
the result is no duplicated columns, anywhere; No Update Anomalies (no need for triggers or complex code to ensure that, when a column is updated, its duplicates are updated correctly).
it improves performance because (a) it affects less rows and (b) improves concurrency due to reduced locking
I make the distinction that, it is not that a database is Normalised to a particular NF or not; the database is simply Normalised. It is that each table is Normalised to a particular NF: some tables may only require 1NF, others 3NF, and yet others require 5NF.
2.2 Performance
There was a time when people thought that Normalisation did not provide performance, and they had to "denormalise for performance". Thank God that myth has been debunked, and most IT professionals today realise that Normalised databases perform better. The database vendors optimise for Normalised databases, not for denormalised file systems. The truth "denormalised" is, the database was NOT normalised in the first place (and it performed badly), it was unnormalised, and they did some further scrambling to improve performance. In order to be Denormalised, it has to be faithfully Normalised first, and that never took place. I have rewritten scores of such "denormalised for performance" databases, providing faithful Normalisation and nothing else, and they ran at least ten, and as much as a hundred times faster. In addition, they required only a fraction of the disk space. It is so pedestrian that I guarantee the exercise, in writing.
2.3 Limitation
The limitations, or rather the full extent of 5NF is:
it does not handle optional values, and Nulls have to be used (many designers disallow Nulls and use substitutes, but this has limitations if it not implemented properly and consistently)
you still need to change DDL in order to add or change columns (and there are more and more requirements to add columns that were not initially identified, after implementation; change control is onerous)
although providing the highest level of performance due to Normalisation (read: elimination of duplicates and confused relations), complex queries such as pivoting (producing a report of rows, or summaries of rows, expressed as columns) and "columnar access" as required for data warehouse operations, are difficult, and those operations only, do not perform well. Not that this is due only to the SQL skill level available, and not to the engine.
3 Sixth Normal Form
3.1 Definition
Sixth Normal Form is defined as:
the Relation (row) is the Primary Key plus at most one attribute (column)
It is known as the Irreducible Normal Form, the ultimate NF, because there is no further Normalisation that can be performed. Although it was discussed in academic circles in the mid nineties, it was formally declared only in 2003. For those who like denigrating the formality of the Relational Model, by confusing relations, relvars, "relationships", and the like: all that nonsense can be put to bed because formally, the above definition identifies the Irreducible Relation, sometimes called the Atomic Relation.
3.2 Progression
The increment that 6NF provides (that 5NF does not) is:
formal support for optional values, and thus, elimination of The Null Problem
a side effect is, columns can be added without DDL changes (more later)
effortless pivoting
simple and direct columnar access
it allows for (not in its vanilla form) an even greater level of performance in this department
Let me say that I (and others) were supplying enhanced 5NF tables 20 years ago, explicitly for pivoting, with no problem at all, and thus allowing (a) simple SQL to be used and (b) providing very high performance; it was nice to know that the academic giants of the industry had formally defined what we were doing. Overnight, my 5NF tables were renamed 6NF, without me lifting a finger. Second, we only did this where we needed it; again, it was the table, not the database, that was Normalised to 6NF.
3.3 SQL Limitation
It is a cumbersome language, particularly re joins, and doing anything moderately complex makes it very cumbersome. (It is a separate issue that most coders do not understand or use subqueries.) It supports the structures required for 5NF, but only just. For robust and stable implementations, one must implement additional standards, which may consist in part, of additional catalogue tables. The "use by" date for SQL had well and truly elapsed by the early nineties; it is totally devoid of any support for 6NF tables, and desperately in need of replacement. But that is all we have, so we need to just Deal With It.
For those of us who had been implementing standards and additional catalogue tables, it was not a serious effort to extend our catalogues to provide the capability required to support 6NF structures to standard: which columns belong to which tables, and in what order; mandatory/optional; display format; etc. Essentially a full MetaData catalogue, married to the SQL catalogue.
Note that each NF contains each previous NF within it, so 6NF contains 5NF. We did not break 5NF in order provide 6NF, we provided a progression from 5NF; and where SQL fell short we provided the catalogue. What this means is, basic constraints such as for Foreign Keys; and Value Domains which were provided via SQL Declarative Referential integrity; Datatypes; CHECKS; and RULES, at the 5NF level, remained intact, and these constraints were not subverted. The high quality and high performance of standard-compliant 5NF databases was not reduced in anyway by introducing 6NF.
3.4 Catalogue
It is important to shield the users (any report tool) and the developers, from having to deal with the jump from 5NF to 6NF (it is their job to be app coding geeks, it is my job to be the database geek). Even at 5NF, that was always a design goal for me: a properly Normalised database, with a minimal Data Directory, is in fact quite easy to use, and there was no way I was going to give that up. Keep in mind that due to normal maintenance and expansion, the 6NF structures change over time, new versions of the database are published at regular intervals. Without doubt, the SQL (already cumbersome at 5NF) required to construct a 5NF row from the 6NF tables, is even more cumbersome. Gratefully, that is completely unnecessary.
Since we already had our catalogue, which identified the full 6NF-DDL-that-SQL-does-not-provide, if you will, I wrote a small utility to read the catalogue and:
generate the 6NF table DDL.
generate 5NF VIEWS of the 6NF tables. This allowed the users to remain blissfully unaware of them, and gave them the same capability and performance as they had at 5NF
generate the full SQL (not a template, we have those separately) required to operate against the 6NF structures, which coders then use. They are released from the tedium and repetition which is otherwise demanded, and free to concentrate on the app logic.
I did not write an utility for Pivoting because the complexity present at 5NF is eliminated, and they are now dead simple to write, as with the 5NF-enhanced-for-pivoting. Besides, most report tools provide pivoting, so I only need to provide functions which comprise heavy churning of stats, which needs to be performed on the server before shipment to the client.
3.5 Performance
Everyone has their cross to bear; I happen to be obsessed with Performance. My 5NF databases performed well, so let me assure you that I ran far more benchmarks than were necessary, before placing anything in production. The 6NF database performed exactly the same as the 5NF database, no better, no worse. This is no surprise, because the only thing the 'complex" 6NF SQL does, that the 5NF SQL doesn't, is perform much more joins and subqueries.
You have to examine the myths.
Anyone who has benchmarked the issue (i.e examined the execution plans of queries) will know that Joins Cost Nothing, it is a compile-time resolution, they have no effect at execution time.
Yes, of course, the number of tables joined; the size of the tables being joined; whether indices can be used; the distribution of the keys being joined; etc, all cost something.
But the join itself costs nothing.
A query on five (larger) tables in a Unnormalised database is much slower than the equivalent query on ten (smaller) tables in the same database if it were Normalised. the point is, neither the four nor the nine Joins cost anything; they do not figure in the performance problem; the selected set on each Join does figure in it.
3.6 Benefit
Unrestricted columnar access. This is where 6NF really stands out. The straight columnar access was so fast that there was no need to export the data to a data warehouse in order to obtain speed from specialised DW structures.
My research into a few DWs, by no means complete, shows that they consistently store data by columns, as opposed to rows, which is exactly what 6NF does. I am conservative, so I am not about to make any declarations that 6NF will displace DWs, but in my case it eliminated the need for one.
It would not be fair to compare functions available in 6NF that were unavailable in 5NF (eg. Pivoting), which obviously ran much faster.
That was our first true 6NF database (with a full catalogue, etc; as opposed to the always 5NF with enhancements only as necessary; which later turned out to be 6NF), and the customer is very happy. Of course I was monitoring performance for some time after delivery, and I identified an even faster columnar access method for my next 6NF project. That, when I do it, might present a bit of competition for the DW market. The customer is not ready, and we do not fix that which is not broken.
3.7 What, Exactly, about 6NF, is "Bad" ?
Note that not everyone would approach the job with as much formality, structure, and adherence to standards. So it would be silly to conclude from our project, that all 6NF databases perform well, and are easy to maintain. It would be just as silly to conclude (from looking at the implementations of others) that all 6NF databases perform badly, are hard to maintain; disasters. As always, with any technical endeavour, the resulting performance and ease of maintenance are strictly dependent on formality, structure, and adherence to standards, in addition to the relevant skill set.
4 Entity Attribute Value
Disclosure: Experience. I have inspected a few of these, mostly hospital and medical systems. I have performed corrective assignments on two of them. The initial delivery by the overseas provider was quite adequate, although not great, but the extensions implemented by the local provider were a mess. But not nearly the disaster that people have posted about re EAV on this site. A few months intense work fixed them up nicely.
4.1 What It Is
It was obvious to me that the EAV implementations I have worked on are merely subsets of Sixth Normal Form. Those who implement EAV do so because they want some of the features of 6NF (eg. ability to add columns without DDL changes), but they do not have the academic knowledge to implement true 6NF, or the standards and structures to implement and administer it securely. Even the original provider did not know about 6NF, or that EAV was a subset of 6NF, but they readily agreed when I pointed it out to them. Because the structures required to provide EAV, and indeed 6NF, efficiently and effectively (catalogue; Views; automated code generation) are not formally identified in the EAV community, and are missing from most implementations, I classify EAV as the bastard son Sixth Normal Form.
4.2 What, Exactly, about EAV, is "Bad" ?
Going by the comments in this and other threads, yes, EAV done badly is a disaster. More important (a) they are so bad that the performance provided at 5NF (forget 6NF) is lost and (b) the ordinary isolation from the complexity has not been implemented (coders and users are "forced" to use cumbersome navigation). And if they did not implement a catalogue, all sorts of preventable errors will not have been prevented.
All that may well be true for bad (EAV or other) implementations, but it has nothing to do with 6NF or EAV. The two projects I worked had quite adequate performance (sure, it could be improved; but there was no bad performance due to EAV), and good isolation of complexity. Of course, they were nowhere near the quality or performance of my 5NF databases or my true 6NF database, but they were fair enough, given the level of understanding of the posted issues within the EAV community. They were not the disasters and sub-standard nonsense alleged to be EAV in these pages.
5 Nulls
There is a well-known and documented issue called The Null Problem. It is worthy of an essay by itself. For this post, suffice to say:
the problem is really the optional or missing value; here the consideration is table design such that there are no Nulls vs Nullable columns
actually it does not matter because, regardless of whether you use Nulls/No Nulls/6NF to exclude missing values, you will have to code for that, the problem precisely then, is handling missing values, which cannot be circumvented
except of course for pure 6NF, which eliminates the Null Problem
the coding to handle missing values remains
except, with automated generation of SQL code, heh heh
Nulls are bad news for performance, and many of us have decided decades ago not to allow Nulls in the database (Nulls in passed parameters and result sets, to indicate missing values, is fine)
which means a set of Null Substitutes and boolean columns to indicate missing values
Nulls cause otherwise fixed len columns to be variable len; variable len columns should never be used in indices, because a little 'unpacking' has to be performed on every access of every index entry, during traversal or dive.
6 Position
I am not a proponent of EAV or 6NF, I am a proponent of quality and standards. My position is:
Always, in all ways, do whatever you are doing to the highest standard that you are aware of.
Normalising to Third Normal Form is minimal for a Relational Database (5NF for me). DataTypes, Declarative referential Integrity, Transactions, Normalisation are all essential requirements of a database; if they are missing, it is not a database.
if you have to "denormalise for performance", you have made serious Normalisation errors, your design in not normalised. Period. Do not "denormalise", on the contrary, learn Normalisation and Normalise.
There is no need to do extra work. If your requirement can be fulfilled with 5NF, do not implement more. If you need Optional Values or ability to add columns without DDL changes or the complete elimination of the Null Problem, implement 6NF, only in those tables that need them.
If you do that, due only to the fact that SQL does not provide proper support for 6NF, you will need to implement:
a simple and effective catalogue (column mix-ups and data integrity loss are simply not acceptable)
5NF access for the 6NF tables, via VIEWS, to isolate the users (and developers) from the encumbered (not "complex") SQL
write or buy utilities, so that you can generate the cumbersome SQL to construct the 5NF rows from the 6NF tables, and avoid writing same
measure, monitor, diagnose, and improve. If you have a performance problem, you have made either (a) a Normalisation error or (b) a coding error. Period. Back up a few steps and fix it.
If you decide to go with EAV, recognise it for what it is, 6NF, and implement it properly, as above. If you do, you will have a successful project, guaranteed. If you do not, you will have a dog's breakfast, guaranteed.
6.1 There Ain't No Such Thing As A Free Lunch
That adage has been referred to, but actually it has been misused. The way it actually, deeply applies is as above: if you want the benefits of 6NF/EAV, you had better be willing too do the work required to obtain it (catalogue, standards). Of course, the corollary is, if you don't do the work, you won't get the benefit. There is no "loss" of Datatypes; value Domains; Foreign keys; Checks; Rules. Regarding performance, there is no performance penalty for 6NF/EAV, but there is always a substantial performance penalty for sub-standard work.
7 Specific Question
Finally. With due consideration to the context above, and that it is a small project with a small team, there is no question:
Do not use EAV (or 6NF for that matter)
Do not use Nulls or Nullable columns (unless you wish to subvert performance)
Do use a single Payment table for the common payment columns
and a child table for each PaymentType, each with its specific columns
All fully typecast and constrained.
What's this "another row_id" business ? Why do some of you stick an ID on everything that moves, without checking if it is a deer or an eagle ? No. The child is a dependent child. The Relation is 1::1. The PK of the child is the PK of the parent, the common Payment table. This is an ordinary Supertype-Subtype cluster, the Differentiator is PaymentTypeCode. Subtypes and supertypes are an ordinary part of the Relational Model, and fully catered for in the database, as well as in any good modelling tool.
Sure, people who have no knowledge of Relational databases think they invented it 30 years later, and give it funny new names. Or worse, they knowingly re-label it and claim it as their own. Until some poor sod, with a bit of education and professional pride, exposes the ignorance or the fraud. I do not know which one it is, but it is one of them; I am just stating facts, which are easy to confirm.
A. Responses to Comments
A.1 Attribution
I do not have personal or private or special definitions. All statements regarding the definition (such as imperatives) of:
Normalisation,
Normal Forms, and
the Relational Model.
.
refer to the many original texts By EF Codd and CJ Date (not available free on the web)
.
The latest being Temporal Data and The Relational Model by CJ Date, Hugh Darwen, Nikos A Lorentzos
.
and nothing but those texts
.
"I stand on the shoulders of giants"
.
The essence, the body, all statements regarding the implementation (eg. subjective, and first person) of the above are based on experience; implementing the above principles and concepts, as a commercial organisation (salaried consultant or running a consultancy), in large financial institutions in America and Australia, over 32 years.
This includes scores of large assignments correcting or replacing sub-standard or non-relational implementations.
.
The Null Problem vis-a-vis Sixth Normal Form
A freely available White Paper relating to the title (it does not define The Null Problem alone) can be found at:
http://www.dcs.warwick.ac.uk/~hugh/TTM/Missing-info-without-nulls.pdf.
.
A 'nutshell' definition of 6NF (meaningful to those experienced with the other NFs), can be found on p6
A.2 Supporting Evidence
As stated at the outset, the purpose of this post is to counter the misinformation that is rife in this community, as a service to the community.
Evidence supporting statements made re the implementation of the above principles, can be provided, if and when specific statements are identified; and to the same degree that the incorrect statements posted by others, to which this post is a response, is likewise evidenced. If there is going to be a bun fight, let's make sure the playing field is level
Here are a few docs that I can lay my hands on immediately.
a. Large Bank
This is the best example, as it was undertaken for explicitly the reasons in this post, and goals were realised. They had a budget for Sybase IQ (DW product) but the reports were so fast when we finished the project, they did not need it. The trade analytical stats were my 5NF plus pivoting extensions which turned out to be 6NF, described above. I think all the questions asked in the comments have been answered in the doc, except:
- number of rows:
- old database is unknown, but it can be extrapolated from the other stats
- new database = 20 tables over 100M, 4 tables over 10B.
b. Small Financial Institute Part A
Part B - The meat
Part C - Referenced Diagrams
Part D - Appendix, Audit of Indices Before/After (1 line per Index)
Note four docs; the fourth only for those who wish to inspect detailed Index changes. They were running a 3rd party app that could not be changed because the local supplier was out of business, plus 120% extensions which they could, but did not want to, change. We were called in because they upgraded to a new version of Sybase, which was much faster, which shifted the various performance thresholds, which caused large no of deadlocks. Here we Normalised absolutely everything in the server except the db model, with the goal (guaranteed beforehand) of eliminating deadlocks (sorry, I am not going to explain that here: people who argue about the "denormalisation" issue, will be in a pink fit about this one). It included a reversal of "splitting tables into an archive db for performance", which is the subject of another post (yes, the new single table performed faster than the two spilt ones). This exercise applies to MS SQL Server [insert rewrite version] as well.
c. Yale New Haven Hospital
That's Yale School of Medicine, their teaching hospital. This is a third-party app on top of Sybase. The problem with stats is, 80% of the time they were collecting snapshots at nominated test times only, but no consistent history, so there is no "before image" to compare our new consistent stats with. I do not know of any other company who can get Unix and Sybase internal stats on the same graphs, in an automated manner. Now the network is the threshold (which is a Good Thing).
Perhaps you should look this question
The accepted answer from Bill Karwin goes into specific arguments against the key/value table usually know as Entity Attribute Value (EAV)
.. Although many people seem to favor
EAV, I don't. It seems like the most
flexible solution, and therefore the
best. However, keep in mind the adage
TANSTAAFL. Here are some of the
disadvantages of EAV:
No way to make a column mandatory (equivalent of NOT NULL).
No way to use SQL data types to validate entries.
No way to ensure that attribute names are spelled consistently.
No way to put a foreign key on the values of any given attribute, e.g.
for a lookup table.
Fetching results in a conventional tabular layout is complex and
expensive, because to get attributes
from multiple rows you need to do
JOIN for each attribute.
The degree of flexibility EAV gives
you requires sacrifices in other
areas, probably making your code as
complex (or worse) than it would have
been to solve the original problem in
a more conventional way.
And in most cases, it's an unnecessary
to have that degree of flexibility.
In the OP's question about product
types, it's much simpler to create a
table per product type for
product-specific attributes, so you
have some consistent structure
enforced at least for entries of the
same product type.
I'd use EAV only if every row must
be permitted to potentially have a
distinct set of attributes. When you
have a finite set of product types,
EAV is overkill. Class Table
Inheritance would be my first choice.
My #1 principle is not to redesign something for no reason. So I would go with option 1 because that's your current design and it has a proven track record of working.
Spend the redesign time on new features instead.
If I were designing from scratch I would go with number two. It gives you the flexibility you need. However with number 1 already in place and working and this being soemting rather central to your whole app, i would probably be wary of making a major design change without a good idea of exactly what queries, stored procs, views, UDFs, reports, imports etc you would have to change. If it was something I could do with a relatively low risk (and agood testing alrady in place.) I might go for the change to solution 2 otherwise you might beintroducing new worse bugs.
Under no circumstances would I use an EAV table for something like this. They are horrible for querying and performance and the flexibility is way overrated (ask users if they prefer to be able to add new types 3-4 times a year without a program change at the cost of everyday performance).
At first sight, I would go for option 2 (or 3): when possible, generalize.
Option 4 is not very Relational I think, and will make your queries complex.
When confronted to those question, I generally confront those options with "use cases":
-how is design 2/3 behaving when do this or this operation ?

Storing multiple choice values in database

Say I offer user to check off languages she speaks and store it in a db. Important side note, I will not search db for any of those values, as I will have some separate search engine for search.
Now, the obvious way of storing these values is to create a table like
UserLanguages
(
UserID nvarchar(50),
LookupLanguageID int
)
but the site will be high load and we are trying to eliminate any overhead where possible, so in order to avoid joins with main member table when showing results on UI, I was thinking of storing languages for a user in the main table, having them comma separated, like "12,34,65"
Again, I don't search for them so I don't worry about having to do fulltext index on that column.
I don't really see any problems with this solution, but am I overlooking anything?
Thanks,
Andrey
Don't.
You don't search for them now
Data is useless to anything but this one situation
No data integrity (eg no FK)
You still have to change to "English,German" etc for display
"Give me all users who speak x" = FAIL
The list is actually a presentation issue
It's your system, though, and I look forward to answering the inevitable "help" questions later...
You might not be missing anything now, but when you're requirements change you might regret that decision. You should store it normalized like your first instinct suggested. That's the correct approach.
What you're suggesting is a classic premature optimization. You don't know yet whether that join will be a bottleneck, and so you don't know whether you're actually buying any performance improvement. Wait until you can profile the thing, and then you'll know whether that piece needs to be optimized.
If it does, I would consider a materialized view, or some other approach that pre-computes the answer using the normalized data to a cache that is not considered the book of record.
More generally, there are a lot of possible optimizations that could be done, if necessary, without compromising your design in the way you suggest.
This type of storage has almost ALWAYS come back to haunt me. For one, you are not even in first normal form. For another, some manager or the other will definitely come back and say.. "hey, now that we store this, can you write me a report on... "
I would suggest going with a normalized design. Put it in a separate table.
Problems:
You lose join capability (obviously).
You have to reparse the list on each page load / post back. Which results in more code client side.
You lose all pretenses of trying to keep database integrity. Just imagine if you decide to REMOVE a language later on... What's the sql going to be to fix all of your user profiles?
Assuming your various profile options are stored in a lookup table in the DB, you still have to run "30 queries" per profile page. If they aren't then you have to code deploy for each little change. bad, very bad.
Basing a design decision on something that "won't happen" is an absolute recipe for failure. Sure, the business people said they won't ever do that... Until they think of a reason they absolutely must do it. Today. Which will be promptly after you finish coding this.
As I stated in a comment, 30 queries for a low use page is nothing. Don't sweat it, and definitely don't optimize unless you know for darn sure it's necessary. Guess how many queries SO does for it's profile page?
I generally stay away at the solution you described, you asking for troubles when you store relational data in such fashion.
As alternative solution:
You could store as one bitmasked integer, for example:
0 - No selection
1 - English
2 - Spanish
4 - German
8 - French
16 - Russian
--and so on powers of 2
So if someone selected English and Russian the value would be 17, and you could easily query the values with Bitwise operators.
Premature optimization is the root of all evil.
EDIT: Apparently the context of my observation has been misconstrued by some - and hence the downvotes. So I will clarify.
Denormalizing your model to make things easier and/or 'more performant' - such as creating concatenated columns to represent business information (as in the OP case) - is what I refer to as a "premature optimization".
While there may be some extreme edge cases where there is no other way to get the necessary performance necessary for a particular problem domain - one should rarely assume this is the case. In general, such premature optimizations cause long-term grief because they are hard to undo - changing your data model once it is in production takes a lot more effort than when it initially deployed.
When designing a database, developers (and DBAs) should apply standard practices like normalization to ensure that their data model expresses the business information being collected and managed. I don't believe that proper use of data normalization is an "optimization" - it is a necessary practice. In my opinion, data modelers should always be on the lookout for models that could be restructured to (at least) third normal form (3NF).
If you're not querying against them, you don't lose anything by storing them in a form like your initial plan.
If you are, then storing them in the comma-delimited format will come back to haunt you, and I doubt that any speed savings would be significant, especially when you factor in the work required to translate them back.
You seem to be extremely worried about adding in a few extra lookup table joins. In my experience, the time it takes to actually transmit the HTML response and have the browser render it far exceed a few extra table joins. Especially if you are using indexes for your primary and foreign keys (as you should be). It's like you are planning a multi-day cross-country trip and you are worried about 1 extra 10 minute bathroom stop.
The lack of long-term flexibility and data integrity are not worth it for such a small optimization (which may not be necessary or even noticeable).
Nooooooooooooooooo!!!!!!!!
As stated very well in the above few posts.
If you want a contrary view to this debate, look at wordpress. Tables are chocked full of delimited data, and it's a great, simple platform.

Many-to-many relationship: use associative table or delimited values in a column?

Update 2009.04.24
The main point of my question is not developer confusion and what to do about it.
The point is to understand when delimited values are the right solution.
I've seen delimited data used in commercial product databases (Ektron lol).
SQL Server even has an XML datatype, so that could be used for the same purpose as delimited fields.
/end Update
The application I'm designing has some many-to-many relationships. In the past, I've often used associative tables to represent these in the database. This has caused some confusion to the developers.
Here's an example DB structure:
Document
---------------
ID (PK)
Title
CategoryIDs (varchar(4000))
Category
------------
ID (PK)
Title
There is a many-to-many relationship between Document and Category.
In this implementation, Document.CategoryIDs is a big pipe-delimited list of CategoryIDs.
To me, this is bad because it requires use of substring matching in queries -- which cannot make use of indexes. I think this will be slow and will not scale.
With that model, to get all Documents for a Category, you would need something like the following:
select * from documents where categoryids like '%|' + #targetCategoryId + '|%'
My solution is to create an associative table as follows:
Document_Category
-------------------------------
DocumentID (PK)
CategoryID (PK)
This is confusing to the developers. Is there some elegant alternate solution that I'm missing?
I'm assuming there will be thousands of rows in Document. Category may be like 40 rows or so. The primary concern is query performance. Am I over-engineering this?
Is there a case where it's preferred to store lists of IDs in database columns rather than pushing the data out to an associative table?
Consider also that we may need to create many-to-many relationships among documents. This would suggest an associative table Document_Document. Is that the preferred design or is it better to store the associated Document IDs in a single column?
Thanks.
This is confusing to the developers.
Get better developers. That is the right approach.
Your suggestion IS the elegant, powerful, best practice solution.
Since I don't think the other answers said the following strongly enough, I'm going to do it.
If your developers 1) can't understand how to model a many-to-many relationship in a relational database, and 2) strongly insist on storing your CategoryIDs as delimited character data,
Then they ought to immediately lose all database design privileges. At the very least, they need an actual experienced professional to join their team who has the authority to stop them from doing something this unwise and can give them the database design training they are completely lacking.
Last, you should not refer to them as "database developers" again until they are properly up to speed, as this is a slight to those of us who actually are competent developers & designers.
I hope this answer is very helpful to you.
Update
The main point of my question is not developer confusion and what to do about it.
The point is to understand when delimited values are the right solution.
Delimited values are the wrong solution except in extremely rare cases. When individual values will ever be queried/inserted/deleted/updated this proves it was the wrong decision, because you have to parse and touch all the other values just to work with the desired one. By doing this you're violating first (!!!) normal form (this phrase should sound to you like an unbelievably vile expletive). Using XML to do the same thing is wrong, too. Storing delimited values or multi-value XML in a column could make sense when it is treated as an indivisible and opaque "property bag" that is NOT queried on by the database but is always sent whole to another consumer (perhaps a web server or an EDI recipient).
This takes me back to my initial comment. Developers who think violating first normal form is a good idea are very inexperienced developers in my book.
I will grant there are some pretty sophisticated non-relational data storage implementations out there using text property bags (such as Facebook(?) and other multi-million user sites running on thousands of servers). Well, when your database, user base, and transactions per second are big enough to need that, you'll have the money to develop it. In the meantime, stick with best practice.
It's almost always a big mistake to use comma separated IDs.
RDBMS are designed to store relationships.
My solution is to create an
associative table as follows: This is
confusing to the developers
Really? this is database 101, if this is confusing to them then maybe they need to step away from their wizard generated code and learn some basic DB normalization.
What you propose is the right solution!!
The Document_Category table in your design is certainly the correct way to approach the problem. If it's possible, I would suggest that you educate the developers instead of coming up with a suboptimal solution (and taking a performance hit, and not having referential integrity).
Your other options may depend on the database you're using. For example, in SQL Server you can have an XML column that would allow you to store your array in a pre-defined schema and then do joins based on the contents of that field. Other database systems may have something similar.
The many-to-many mapping you are doing is fine and normalized. It also allows for other data to be added later if needed. For example, say you wanted to add a time that the category was added to the document.
I would suggest having a surrogate primary key on the document_category table as well. And a Unique(documentid, categoryid) constraint if that makes sense to do so.
Why are the developers confused?
The 'this is confusing to the developers' design means you have under-educated developers. It is the better relational database design - you should use it if at all possible.
If you really want to use the list structure, then use a DBMS that understands them. Examples of such databases would be the U2 (Unidata, Universe) DBMS, which are (or were, once upon a long time ago) based on the Pick DBMS. There are likely to be other similar DBMS providers.
This is the classic object-relational mapping problem. The developers are probably not stupid, just inexperienced or unaccustomed to doing things the right way. Shouting "3NF!" over and over again won't convince them of the right way.
I suggest you ask your developers to explain to you how they would get a count of documents by category using the pipe-delimited approach. It would be a nightmare, whereas the link table makes it quite simple.
The number one reason that my developers try this "comma-delimited values in a database column" approach is that they have a perception that adding a new table to address the need for multiple values will take too long to add to the data model and the database.
Most of them know that their work around is bad for all kinds of reasons, but they choose this suboptimal method because they just can. They can do this and maybe never get caught, or they will get caught much later in the project when it is too expensive and risky to fix it. Why do they do this? Because their performance is measured solely on speed and not on quality or compliance.
It could also be, as on one of my projects, that the developers had a table to put the multi values in but were under the impression that duplicating that data in the parent table would speed up performance. They were wrong and they were called out on it.
So while you do need an answer to how to handle these costly, risky, and business-confidence damaging tricks, you should also try to find the reason why the developers believe that taking this course of action is better in the short and the long run for the project and company. Then fix both the perception and the data structures.
Yes, it could just be laziness, malicious intent, or cluelessness, but I'm betting most of the time developers do this stuff because they are constantly being told "just get it done". We on the data model and database design sides need to ensure that we aren't sending the wrong message about how responsive we can be to requests to fulfill a business requirement for a new entity/table/piece of information.
We should also see that data people need to be constantly monitoring the "as-built" part of our data architectures.
Personally, I never authorize the use of comma delimited values in a relational database because it is actually faster to build a new table than it is to build a parsing routine to create, update, and manage multiple values in a column and deal with all the anomalies introduced because sometimes that data has embedded commas, too.
Bottom line, don't do comma delimited values, but find out why the developers want to do it and fix that problem.

Deciding on a database structure for pricing wizard

Option A
We are working on a small project that requires a pricing wizard for custom tables. (yes, actual custom tables- the kind you eat at. From here out I'll call them kitchen tables so we don't get confused) I came up with a model where each kitchen table part was a database table. So the database looked like this:
TableLineItem
-------------
ID
TableSizeID
TableEdgeWoodID
TableBaseID
Quantity
TableEdgeWoodID
---------------
ID
Name
MaterialUnitCost
LaborSetupHours
LaborWorkHours
Each part has to be able to calculate its price. Most of the calculations are very similar. I liked this structure because I can drag it right into the linq-to-sql designer, and have all of my classes generated. (Less code writing means less to maintain...) I then implement a calculate cost interface which just takes in the size of the table. I have written some tests and this functions pretty well. I added also added a table to filter parts in the UI based on previous selections. (You can't have a particular wood with a particular finish.) There some other one off exceptions in the model, and I have them hard coded. This model is very rigid, and changing requirements would change the datamodel. (For example, if all the tables suddenly need umbrellas.)
Option B:
After various meetings with my colleagues (which probably took more time than it should considering the size of this project), my colleagues decided they would prefer a more generic approach. Something like this:
Spec
----
SpecID
SpecTypeID
TableType_LookupID
Name
MaterialUnitCost
LaborSetupHours
LaborWorkHours
SpecType
--------
SpecTypeID
ParentSpecType_SpecTypeID
IsCustomerOption
IsRequiredCustomerOption
etc...
This is a much more generic approach that could be used to construct any product. (like, if they started selling chairs...) I think this would take longer time to implement, but would be more flexible in the future. (although I doubt we will revisit this.) Also you lose some referential integrity- you would need triggers to enforce that a table base cannot be set for a table wood.
Questions:
Which database structure do you prefer? Feel free to suggest your own.
What would be considered a best practice? If you have several similar database tables, do you create 1 database table with a type column, or several distinct tables? I suspect the answer begins with "It depends..."
What would an estimated time difference be in the two approaches (1 week, 1 day, 150% longer, etc)
Thanks in advance. Let me know if you have any questions so I can update this.
Having been caught out much more often than I should have by designing db structures that met my clients original specs but which turned out to be too rigid, I would always go for the more flexible approach, even though it takes more time to set up.
I don't have time for a complete answer right now, but I'll throw this out:
It's usually a bad idea to design a database based on the development tool that you're using to code against it.
You want to be generic to a point. Tables in a database should represent something and it is possible to make it too generic. For example, a table called "Things" is probably too generic.
It may be possible to make constraints that go beyond what you expect. Your example of a "table base" with a "table wood" didn't make sense to me, but if you can expand on a specific example someone might be able to help with that.
Finally, if this is a small application for a single store then your design is going to have much less impact on the project outcome than it would if you were designing for an application that would be heavily used and constantly changed. This goes back to the "too generic" comment above. It is possible to overdesign a system when its use will be minimal and well-defined. I hope that makes sense.
Given your comment below about the table bases and woods, you could set up a table called TableAttributes (or something similar) and each possible option would be of a particular table attribute type. You could then enforce that any given option is only used for the attribute to which it applies all through foreign keys.
There is a tendency to over-abstract with database schema design, because the cost of change can be high. Myself, I like table names that are fairly descriptive. I often equate schema design with OO design. E.g., you wouldn't normally create a class named Thing, you would probably call it Product, Furniture, Item, something that relates to your business.
In the schema you have provided there is a mix of the abstract (spec) and the specific (TableType_LookupID). I would tend to equalize the level of abstraction, so use entities like:
ProductGroup (for the case where you have a product that is a collection of other products)
Product
ProductType
ProductDetail
ProductDetailType
etc.
Here's what my experience would tell me:
Which database structure do you prefer? Without a doubt, I'd go for approach one. Go for the simplest setup that might work. If you add complexity, always ask yourself, what value will it have to the customer?
What would be considered a best practice? That does indeed depend, among others on the size of the project and the expected rate of change. As a general rule, generic tables are worth it when you expect the customer to be adding new types. For example, if your customer wants to be able to add a new "color" entity to the table, you'd need generic tables. You can't predict beforehand what they will add.
What would an estimated time difference be in the two approaches? Not knowing your business, skill, and environment, it's impossible to give a valid estimate. The approach that you are confident in coding will take the least time. Here, my guess would be approach #1 could be 5x-50x as fast. Generic tables are hard, both on the database and the client side.
Option B..
Generic is generally better than specific. Software already is doomed to fail or reach it's capacity by it's design for a certain set of tasks only. If you build something generic it will break less if abstracted with a realistic analysis of where it might head. As long as you stay away from over-abstraction and under-abstraction, it's probably the sweet spot.
In this case the adage "less code is more" would probably be drawn in that you wouldn't have to come back and re-write it again.