What are the pros and cons of Anchor Modeling? [closed] - sql

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I am currently trying to create a database where a very large percentage of the data is temporal. After reading through many techniques for doing this (most involving 6nf normalization) I ran into Anchor Modeling.
The schema that I was developing strongly resembled the Anchor Modeling model, especially since the use case (Temporal Data + Known Unknowns) is so similar, that I am tempted to embrace it fully.
The two biggest problem I am having is that I can find nothing detailing the negatives of this approach, and I cannot find any references to organizations that have used it in production for war-stories and gotchas that I need to be aware of.
I am wondering if anyone here is familiar enough with to briefly expound on some of the negatives (since the positives are very well advertized in research papers and their site), and any experiences with using it in a production environment.

In reference to the anchormodeling.com
Here are a few points I am aware of
The number of DB-objects is simply too large to maintain manually, so make sure that you use designer all the time to evolve the schema.
Currently, designer supports fully MS SQL Server, so if you have to port code all the time, you may want to wait until your target DB is fully supported. I know it has Oracle in dropdown box, but ...
Do not expect (nor demand) your developers to understand it, they have to access the model via 5NF views -- which is good. The thing is that tables are loaded via (instead-of-) triggers on views, which may (or may not) be a performance issue.
Expect that you may need to write some extra maintenance procedures (for each temporal attribute) which are not auto-generated (yet). For example, I often need a prune procedure for temporal attributes -- to delete same-value-records for the same ID on two consecutive time-events.
Generated views and queries-over-views resolve nicely, and so will probably anything that you write in the future. However, "other people" will be writing queries on views-over-views-over-views -- which does not always resolve nicely. So expect that you may need to police queries more than usual.
Having sad all that, I have recently used the approach to refactor a section of my warehouse, and it worked like a charm. Admittedly, warehouse does not have most of the problems outlined here.
I would suggest that it is imperative to create a demo-system and test, test, test ..., especially point No 3 -- loading via triggers.

With respect to point number 4 above. Restatement control is almost finished, such that you will be able to prevent two consecutive identical values over time.
And a general comment, joins are not necessarily a bad thing. Read: Why joins are a good thing.
One of the great benefits of 6NF in Anchor Modeling is non-destructive schema evolution. In other words, every previous version of the database model is available as a subset in the current model. Also, since changes are represented by extensions in the schema (new tables), upgrading a database is almost instantanous and can safely be done online (even in a production environment). This benefit would be lost in 5NF.

I haven't read any papers on it, but since it's based on 6NF, I'd expect it to suffer from whatever problems follow 6NF.
6NF requires each table consist of a candidate key and no more than one non-key column. So, in the worst case, you'll need nine joins to produce a 10-column result set. But you can also design a database that uses, say, 200 tables that are in 5NF, 30 that are in BCNF, and only 5 that are in 6NF. (I think that would no longer be Anchor Modeling per se, which seems to put all tables in 6NF, but I could be wrong about that.)
The Mythical Man-Month is still relevant here.
The management question, therefore, is not whether to build a pilot system and throw it away. You will do that. The only question is whether to plan in advance to build a throwaway, or to promise to deliver the throwaway to customers.
Fred Brooks, Jr., in The Mythical Man-Month, p 116.
How cheaply can you build a prototype to test your expected worst case?

In this post I will present a large part of the real business that belong to databases. Database's solutions in this big business area can not be solved by using „Anchor modeling" , at all.
In the real business world this case is happening on a daily basis. That is the case when data entry person, enters a wrong data.
In real-world business, errors happen frequently at data entry level. It often happens that data entry generates large amounts of erroneous data. So this is a real and big problem. "Anchor modeling" can not solve this problem.
Anyone who uses the "Anchor Modeling" database can enter incorrect data. This is possible because the authors of "Anchor modeling" have written that the erroneous data can be deleted.
Let me explain this problem by the following example:
A profesor of mathematics gave the best grade to the student who had the worst grade. In this high school, professors enter grades in the corressponding database. This student gave money to the professor for this criminal service. The student managed to enroll at the university using this false grade.
After a summer holiday, the professor of mathematics returned to school. After deleting the wrong grade from the database, the professor entered the correct one in the database. In this school they use "Anchor Modeling" db. So the math profesor deleted false data as it is strictly suggested by authors of "Anchor modeling".
Now, this professor of mathematics who did this criminal act is clean, thanks to the software "Anchor modeling".
This example says that using "Anchor Modeling," you can do crime with data just by applying „Anchor modeling technology“
In section 5.4 the authors of „Anchor modeling“ wrote the following: „Delete statements are allowed only when applied to remove erroneous data.“ .
You can see this text at the paper „ An agile modeling technique using sixth normal form for structurally evolving data“ written by authors of „Anchor modeling“.
Please note that „Anchor modeling“ was presented at the 28th International Conference on Conceptual Modeling and won the best paper award?!
Authors of "Anchor Modeling" claim that their data model can maintain a history! However this example shoes that „Anchor modeling“ can not maintain the history at all.
As „Anchor modeling“ allows deletion of data, then "Anchor modeling" has all the operations with the data, that is: adding new data, deleting data and update. Update can be obtained by using two operations: first delete the data, then add new data.
This further means that Anchor modeling has no history, because it has data deletion and data update.
I would like to point out that in "Anchor modeling" each erroneous data "MUST" be deleted. In the "Anchor modeling" it is not possible to keep erroneous data and corrected data.
"Anchor modeling" can not maintain history of erroneous data.
In the first part of this post, I showed that by using "Anchor Modeling" anyone can do crime with data. This means "Anchor Modeling" runs the business of a company, right into a disaster.

I will give one example so that professionals can see on real and important example, how bad "anchor modeling" is.
Example
People who are professionals in the business of databases, know that there are thousands and thousands of international standards, which have been used successfully in databases as keys.
International standards:
All professionals know what is "VIN" for cars, "ISBN" for books, and thousands of other international standards.
National standards:
All countries have their own standards for passports, personal documents, bank cards, bar codes, etc
Local standards:
Many companies have their own standards. For example, when you pay something, you have an invoice with a standard key written and that key is written in the database, also.
All the above mentioned type of keys from this example can be checked by using a variety of institutions, police, customs, banks credit card, post office, etc. You can check many of these "keys" on the internet or by using a phone.
I believe that percent of these databases, which have entities with standard keys, and which I have presented in this example, is more than 95%.
For all the above cases the "anchor surrogate key" is nonsense. "Anchor modeling" exclusively uses "anchor-surrogate key"
In my solution, I use all the keys that are standard on a global or local level and are simple.
Vladimir Odrljin

Related

Database normalization - who's right?

My professor (who claimed to have a firm understanding about systems development for many years) and I are arguing about the design of our database.
As an example:
My professor insists this design is right:
(list of columns)
Subject_ID
Description
Units_Lec
Units_Lab
Total_Units
etc...
Notice the total units column. He said that this column must be included.
I tried to explain that it is unnecessary, because if you want it, then just make a query by simply adding the two.
I showed him an example I found in a book, but he insists that I don't have to rely on books too much in making our system.
The same thing applies to similar cases as in this one:
student_ID
prelim_grade
midterm_grade
prefinal_grade
average
He wanted me to include the average! Anywhere I go, I can find myself reading articles that convince me that this is a violation of normalization. If I needed the average, I can easily compute the three grades. He enumerated some scenarios including ('Hey! What if the query has been accidentally deleted? What will you do? That is why you need to include it in your table!')
Do I need to reconstruct my database(which consists of about more than 40 tables) to comply with what he want? Am I wrong and just have overlooked these things?
Another thing is that he wanted to include the total amount in the payments table, which I believe is unnecessary. (Just compute the unit price of the product and the quantity.) He pointed out that we need that column for computing debits and/or credits that are critical for the overall system management, that it is needed for balancing transaction. Please tell me what you think.
You are absolutely correct! One of the rules of normalization is to reduce those attributes which can be easily deduced by using other attributes' values. ie, by performing some mathematical calculation. In your case, the total units column can be obtained by simply adding.
Tell your professor that having that particular column will show clear signs of transitive dependency and according to the 3rd normalization rule, its recommended to reduce those.
You are right when you say your solution is more normalized.
However, there is a thing called denormalization (google for it) which is about deliberately violating normalization rules to increase queries performance.
For instance you want to retrieve first five subjects (whatever the thing would be) ordered by decreasing number or total units.
You solution would require a full scan on two tables (subject and unit), joining the resultsets and sorting the output.
Your professor's solution would require just taking first five records from an index on total_units.
This of course comes at the price of increased maintenance cost (both in terms of computational resources and development).
I can't tell you who is "right" here: we know nothing about the project itself, data volumes, queries to be made etc. This is a decision which needs to be made for every project (and for some projects it may be a core decision).
The thing is that the professor does have a rationale for this requirement which may or may not be just.
Why he hasn't explained everything above to you himself, is another question.
In addition to redskins80's great answer I want to point out why this is a bad idea: Every time you need to update one of the source columns you need to update the calculated column as well. This is more work that can contain bugs easily (maybe 1 year later when a different programmer is altering the system).
Maybe you can use a computed column instead? That would be a workable middle-ground.
Edit: Denormalization has its place, but it is the last measure to take. It is like chemotherapy: The doctor injects you poison only to cure an even greater threat to your health. It is the last possible step.
Think it is important to add this because when you see the question the answer is not complete in my opinion. The original question has been answered well but there is a glitch here. So I take in account only the added question quoted below:
Another thing is that he wanted to include the total amount in the
payments table, which I believe is unnecessary(Just compute the unit
price of the product and the quantity.). He pointed out that we need
that column for computing debits and/or credits that are critical for
the overall system management, that it is needed for balancing
transaction. Please tell me what you think.
This edit is interesting. Based on the facts that this is a transactional system handling about money it has to be accountable. I take some basic terms: Transaction, product, price, amount.
In that sense it is very common or even required to denormalize. Why? Because you need it to be accountable. So when the transaction is registered that's it, it may never ever be modified. If you need to correct it then you make another transaction.
Now yes you can calculate for example product price * amount * taxes etc. That makes sense in normalization sense. But then you will need a complete lockdown of all related records. So take for example the products table: If you change the price before the transaction it should be taken into account when the transaction happens. But if the price changes afterwards it does not affect the transaction.
So it is not acceptable to just join transaction.product_id=products.id since that product might change. Example:
2012-01-01 price = 10
2012-01-05 price = 20
Transaction happens here, we sell 10 items so 10 * 20 = 200
2012-01-06 price = 22
Now we lookup the transaction at 2012-01-10, so we do:
SELECT
transactions.amount * products.price AS totalAmount
FROM transactions
INNER JOIN products on products.id=transactions.product_id
That would give 10 * 22 = 220 so it is not correct.
So you have 2 options:
Do not allow updates on the products table. So you make that table versioned, so for every record you add a new INSERT instead of update. So the transaction keeps pointing at the right version of the product.
Or you just add the fields to the transactions table. So add totalAmount to the transactions table and calculate it (in a database transaction) when the transaction is inserted and save it.
Yes, it is denormalized but it has a good reason, it makes it accountable. You just know and it's verified with transactions, locks etc. that the moment that transaction happened it related to the described product with the price = 20 etc.
Next to that, and that is just a nice thing of denormalization when you have to do that anyway, it is very easy to run reports. Total transaction amount of the month, year etc. It is all very easy to calculate.
Normalization has good things, for example no double storage, single point of edit etc. But in this case you just don't want that concept since that is not allowed and not preferred for a transactions log database.
See a transaction as a registration of something happened in real world. It happened, you wrote it down. Now you cannot change history, it was written as it was. Future won't change it, it happened.
If you want to implement the good, old, classic relational model, I think what you're doing is right.
In general, it's actually a matter of philosophy. Some systems, Oracle being an example, even allow you to give up the traditional, relational model in favor of objects, which (by being complex structures kept in tables) violate the 1st NF but give you the power of object-oriented model (you can use inheritance, override methods, etc.), which is pretty damn awesome in some cases. The language used is still SQL, only extended.
I know my answer drifts away from the subject (as we take into consideration a whole new database type) but I thought it's an interesting thing to share on the occasion of a pretty general question.
Database design for actual applications is hardly the question of what tables to make. Currently, there are countless possibilities when it comes to keeping and processing your data. There are relational systems we all know and love, object databases (like db4o), object-relational databases (not to be confused with object relational mapping, what I mean is tools like Oracle 11g with its objects), xml databases (take eXist), stream databases (like Esper) and the currently thriving noSQL databases (some insist they shouldn't be called databases) like MongoDB, Cassandra, CouchDB or Oracle NoSQL
In case of some of these, normalization loses its sense. Each model serves a completely different purpose. I think the term "database" has a much wider meaning than it used to.
When it comes to relational databases, I agree with you and not the professor (although I'm not sure if it's a good idea to oppose him to strongly).
Now, to the point. I think you might win him over by showing that you are open-minded and that you understand that there are many options to take into consideration (including his views) but that the situation requires you to normalize the data.
I know my answer is quite a stream of conscience for a stackoverflow post but I hope it's not received as lunatic babbling.
Good luck in the relational tug of war
You are talking about historical and financial data here. It is common to store some computations that will never change becasue that is the cost that was charged at the time. If you do the calc from product * price and the price changed 6 months after the transaction, then you have the incorrect value. Your professor is smart, listen to him. Further, if you do a lot of reporting off the database, you don't want to often calculate values that are not allowed to be changed without another record of data entry. Why perform calculations many times over the history of the application when you only need to do it once? That is wasteful of precious server resources.
The purpose of normalization is to eliminate redundancies so as to eliminate update anomalies, predominantly in transactional systems. Relational is still the best solution by far for transaction processing, DW, master data and many BI solutions. Most NOSQLs have low-integrity requirements. So you lose my tweet - annoying but not catastrophic. But to lose my million dollar stock trade is a big problem. The choice is not NOSQL vs. relational. NOSQL does certain things very well. But Relational is not going anywhere. It is still the best choice for transactional, update oriented solutions. The requirements for normalization can be loosened when the data is read-only or read-mostly. That's why redundancy is not such a huge problem in DW; there are no updates.

What are the [dis]advantages of using a key/value table over nullable columns or separate tables? [duplicate]

This question already has answers here:
How to design a product table for many kinds of product where each product has many parameters
(4 answers)
Closed 1 year ago.
I'm upgrading a payment management system I created a while ago. It currently has one table for each payment type it can accept. It is limited to only being able to pay for one thing, which this upgrade is to alleviate. I've been asking for suggestions as to how I should design it, and I have these basic ideas to work from:
Have one table for each payment type, with a few common columns on each. (current design)
Coordinate all payments with a central table that takes on the common columns (unifying payment IDs regardless of type), and identifies another table and row ID that has columns specialized to that payment type.
Have one table for all payment types, and null the columns which are not used for any given type.
Use the central table idea, but store specialized columns in a key/value table.
My goals for this are: not ridiculously slow, self-documenting as much as possible, and maximizing flexibility while maintaining the other goals.
I don't like 1 very much because of the duplicate columns in each table. It reflects the payment type classes inheriting a base class that provides functionality for all payment types... ORM in reverse?
I'm leaning toward 2 the most, because it's just as "type safe" and self-documenting as the current design. But, as with 1, to add a new payment type, I need to add a new table.
I don't like 3 because of its "wasted space", and it's not immediately clear which columns are used for which payment types. Documentation can alleviate the pain of this somewhat, but my company's internal tools do not have an effective method for storing/finding technical documentation.
The argument I was given for 4 was that it would alleviate needing to change the database when adding a new payment method, but it suffers even worse than 3 does from the lack of explicitness. Currently, changing the database isn't a problem, but it could become a logistical nightmare if we decide to start letting customers keep their own database down the road.
So, of course I have my biases. Does anyone have any better ideas? Which design do you think fits best? What criteria should I base my decision on?
Note
This subject is being discussed, and this thread is being referenced in other threads, therefore I have given it a reasonable treatment, please bear with me. My intention is to provide understanding, so that you can make informed decisions, rather than simplistic ones based merely on labels. If you find it intense, read it in chunks, at your leisure; come back when you are hungry, and not before.
What, exactly, about EAV, is "Bad" ?
1 Introduction
There is a difference between EAV (Entity-Attribute-Value Model) done properly, and done badly, just as there is a difference between 3NF done properly and done badly. In our technical work, we need to be precise about exactly what works, and what does not; about what performs well, and what doesn't. Blanket statements are dangerous, misinform people, and thus hinder progress and universal understanding of the issues concerned.
I am not for or against anything, except poor implementations by unskilled workers, and misrepresenting the level of compliance to standards. And where I see misunderstanding, as here, I will attempt to address it.
Normalisation is also often misunderstood, so a word on that. Wikipedia and other free sources actually post completely nonsensical "definitions", that have no academic basis, that have vendor biases so as to validate their non-standard-compliant products. There is a Codd published his Twelve Rules. I implement a minimum of 5NF, which is more than enough for most requirements, so I will use that as a baseline. Simply put, assuming Third Normal Form is understood by the reader (at least that definition is not confused) ...
2 Fifth Normal Form
2.1 Definition
Fifth Normal Form is defined as:
every column has a 1::1 relation with the Primary Key, only
and to no other column, in the table, or in any other table
the result is no duplicated columns, anywhere; No Update Anomalies (no need for triggers or complex code to ensure that, when a column is updated, its duplicates are updated correctly).
it improves performance because (a) it affects less rows and (b) improves concurrency due to reduced locking
I make the distinction that, it is not that a database is Normalised to a particular NF or not; the database is simply Normalised. It is that each table is Normalised to a particular NF: some tables may only require 1NF, others 3NF, and yet others require 5NF.
2.2 Performance
There was a time when people thought that Normalisation did not provide performance, and they had to "denormalise for performance". Thank God that myth has been debunked, and most IT professionals today realise that Normalised databases perform better. The database vendors optimise for Normalised databases, not for denormalised file systems. The truth "denormalised" is, the database was NOT normalised in the first place (and it performed badly), it was unnormalised, and they did some further scrambling to improve performance. In order to be Denormalised, it has to be faithfully Normalised first, and that never took place. I have rewritten scores of such "denormalised for performance" databases, providing faithful Normalisation and nothing else, and they ran at least ten, and as much as a hundred times faster. In addition, they required only a fraction of the disk space. It is so pedestrian that I guarantee the exercise, in writing.
2.3 Limitation
The limitations, or rather the full extent of 5NF is:
it does not handle optional values, and Nulls have to be used (many designers disallow Nulls and use substitutes, but this has limitations if it not implemented properly and consistently)
you still need to change DDL in order to add or change columns (and there are more and more requirements to add columns that were not initially identified, after implementation; change control is onerous)
although providing the highest level of performance due to Normalisation (read: elimination of duplicates and confused relations), complex queries such as pivoting (producing a report of rows, or summaries of rows, expressed as columns) and "columnar access" as required for data warehouse operations, are difficult, and those operations only, do not perform well. Not that this is due only to the SQL skill level available, and not to the engine.
3 Sixth Normal Form
3.1 Definition
Sixth Normal Form is defined as:
the Relation (row) is the Primary Key plus at most one attribute (column)
It is known as the Irreducible Normal Form, the ultimate NF, because there is no further Normalisation that can be performed. Although it was discussed in academic circles in the mid nineties, it was formally declared only in 2003. For those who like denigrating the formality of the Relational Model, by confusing relations, relvars, "relationships", and the like: all that nonsense can be put to bed because formally, the above definition identifies the Irreducible Relation, sometimes called the Atomic Relation.
3.2 Progression
The increment that 6NF provides (that 5NF does not) is:
formal support for optional values, and thus, elimination of The Null Problem
a side effect is, columns can be added without DDL changes (more later)
effortless pivoting
simple and direct columnar access
it allows for (not in its vanilla form) an even greater level of performance in this department
Let me say that I (and others) were supplying enhanced 5NF tables 20 years ago, explicitly for pivoting, with no problem at all, and thus allowing (a) simple SQL to be used and (b) providing very high performance; it was nice to know that the academic giants of the industry had formally defined what we were doing. Overnight, my 5NF tables were renamed 6NF, without me lifting a finger. Second, we only did this where we needed it; again, it was the table, not the database, that was Normalised to 6NF.
3.3 SQL Limitation
It is a cumbersome language, particularly re joins, and doing anything moderately complex makes it very cumbersome. (It is a separate issue that most coders do not understand or use subqueries.) It supports the structures required for 5NF, but only just. For robust and stable implementations, one must implement additional standards, which may consist in part, of additional catalogue tables. The "use by" date for SQL had well and truly elapsed by the early nineties; it is totally devoid of any support for 6NF tables, and desperately in need of replacement. But that is all we have, so we need to just Deal With It.
For those of us who had been implementing standards and additional catalogue tables, it was not a serious effort to extend our catalogues to provide the capability required to support 6NF structures to standard: which columns belong to which tables, and in what order; mandatory/optional; display format; etc. Essentially a full MetaData catalogue, married to the SQL catalogue.
Note that each NF contains each previous NF within it, so 6NF contains 5NF. We did not break 5NF in order provide 6NF, we provided a progression from 5NF; and where SQL fell short we provided the catalogue. What this means is, basic constraints such as for Foreign Keys; and Value Domains which were provided via SQL Declarative Referential integrity; Datatypes; CHECKS; and RULES, at the 5NF level, remained intact, and these constraints were not subverted. The high quality and high performance of standard-compliant 5NF databases was not reduced in anyway by introducing 6NF.
3.4 Catalogue
It is important to shield the users (any report tool) and the developers, from having to deal with the jump from 5NF to 6NF (it is their job to be app coding geeks, it is my job to be the database geek). Even at 5NF, that was always a design goal for me: a properly Normalised database, with a minimal Data Directory, is in fact quite easy to use, and there was no way I was going to give that up. Keep in mind that due to normal maintenance and expansion, the 6NF structures change over time, new versions of the database are published at regular intervals. Without doubt, the SQL (already cumbersome at 5NF) required to construct a 5NF row from the 6NF tables, is even more cumbersome. Gratefully, that is completely unnecessary.
Since we already had our catalogue, which identified the full 6NF-DDL-that-SQL-does-not-provide, if you will, I wrote a small utility to read the catalogue and:
generate the 6NF table DDL.
generate 5NF VIEWS of the 6NF tables. This allowed the users to remain blissfully unaware of them, and gave them the same capability and performance as they had at 5NF
generate the full SQL (not a template, we have those separately) required to operate against the 6NF structures, which coders then use. They are released from the tedium and repetition which is otherwise demanded, and free to concentrate on the app logic.
I did not write an utility for Pivoting because the complexity present at 5NF is eliminated, and they are now dead simple to write, as with the 5NF-enhanced-for-pivoting. Besides, most report tools provide pivoting, so I only need to provide functions which comprise heavy churning of stats, which needs to be performed on the server before shipment to the client.
3.5 Performance
Everyone has their cross to bear; I happen to be obsessed with Performance. My 5NF databases performed well, so let me assure you that I ran far more benchmarks than were necessary, before placing anything in production. The 6NF database performed exactly the same as the 5NF database, no better, no worse. This is no surprise, because the only thing the 'complex" 6NF SQL does, that the 5NF SQL doesn't, is perform much more joins and subqueries.
You have to examine the myths.
Anyone who has benchmarked the issue (i.e examined the execution plans of queries) will know that Joins Cost Nothing, it is a compile-time resolution, they have no effect at execution time.
Yes, of course, the number of tables joined; the size of the tables being joined; whether indices can be used; the distribution of the keys being joined; etc, all cost something.
But the join itself costs nothing.
A query on five (larger) tables in a Unnormalised database is much slower than the equivalent query on ten (smaller) tables in the same database if it were Normalised. the point is, neither the four nor the nine Joins cost anything; they do not figure in the performance problem; the selected set on each Join does figure in it.
3.6 Benefit
Unrestricted columnar access. This is where 6NF really stands out. The straight columnar access was so fast that there was no need to export the data to a data warehouse in order to obtain speed from specialised DW structures.
My research into a few DWs, by no means complete, shows that they consistently store data by columns, as opposed to rows, which is exactly what 6NF does. I am conservative, so I am not about to make any declarations that 6NF will displace DWs, but in my case it eliminated the need for one.
It would not be fair to compare functions available in 6NF that were unavailable in 5NF (eg. Pivoting), which obviously ran much faster.
That was our first true 6NF database (with a full catalogue, etc; as opposed to the always 5NF with enhancements only as necessary; which later turned out to be 6NF), and the customer is very happy. Of course I was monitoring performance for some time after delivery, and I identified an even faster columnar access method for my next 6NF project. That, when I do it, might present a bit of competition for the DW market. The customer is not ready, and we do not fix that which is not broken.
3.7 What, Exactly, about 6NF, is "Bad" ?
Note that not everyone would approach the job with as much formality, structure, and adherence to standards. So it would be silly to conclude from our project, that all 6NF databases perform well, and are easy to maintain. It would be just as silly to conclude (from looking at the implementations of others) that all 6NF databases perform badly, are hard to maintain; disasters. As always, with any technical endeavour, the resulting performance and ease of maintenance are strictly dependent on formality, structure, and adherence to standards, in addition to the relevant skill set.
4 Entity Attribute Value
Disclosure: Experience. I have inspected a few of these, mostly hospital and medical systems. I have performed corrective assignments on two of them. The initial delivery by the overseas provider was quite adequate, although not great, but the extensions implemented by the local provider were a mess. But not nearly the disaster that people have posted about re EAV on this site. A few months intense work fixed them up nicely.
4.1 What It Is
It was obvious to me that the EAV implementations I have worked on are merely subsets of Sixth Normal Form. Those who implement EAV do so because they want some of the features of 6NF (eg. ability to add columns without DDL changes), but they do not have the academic knowledge to implement true 6NF, or the standards and structures to implement and administer it securely. Even the original provider did not know about 6NF, or that EAV was a subset of 6NF, but they readily agreed when I pointed it out to them. Because the structures required to provide EAV, and indeed 6NF, efficiently and effectively (catalogue; Views; automated code generation) are not formally identified in the EAV community, and are missing from most implementations, I classify EAV as the bastard son Sixth Normal Form.
4.2 What, Exactly, about EAV, is "Bad" ?
Going by the comments in this and other threads, yes, EAV done badly is a disaster. More important (a) they are so bad that the performance provided at 5NF (forget 6NF) is lost and (b) the ordinary isolation from the complexity has not been implemented (coders and users are "forced" to use cumbersome navigation). And if they did not implement a catalogue, all sorts of preventable errors will not have been prevented.
All that may well be true for bad (EAV or other) implementations, but it has nothing to do with 6NF or EAV. The two projects I worked had quite adequate performance (sure, it could be improved; but there was no bad performance due to EAV), and good isolation of complexity. Of course, they were nowhere near the quality or performance of my 5NF databases or my true 6NF database, but they were fair enough, given the level of understanding of the posted issues within the EAV community. They were not the disasters and sub-standard nonsense alleged to be EAV in these pages.
5 Nulls
There is a well-known and documented issue called The Null Problem. It is worthy of an essay by itself. For this post, suffice to say:
the problem is really the optional or missing value; here the consideration is table design such that there are no Nulls vs Nullable columns
actually it does not matter because, regardless of whether you use Nulls/No Nulls/6NF to exclude missing values, you will have to code for that, the problem precisely then, is handling missing values, which cannot be circumvented
except of course for pure 6NF, which eliminates the Null Problem
the coding to handle missing values remains
except, with automated generation of SQL code, heh heh
Nulls are bad news for performance, and many of us have decided decades ago not to allow Nulls in the database (Nulls in passed parameters and result sets, to indicate missing values, is fine)
which means a set of Null Substitutes and boolean columns to indicate missing values
Nulls cause otherwise fixed len columns to be variable len; variable len columns should never be used in indices, because a little 'unpacking' has to be performed on every access of every index entry, during traversal or dive.
6 Position
I am not a proponent of EAV or 6NF, I am a proponent of quality and standards. My position is:
Always, in all ways, do whatever you are doing to the highest standard that you are aware of.
Normalising to Third Normal Form is minimal for a Relational Database (5NF for me). DataTypes, Declarative referential Integrity, Transactions, Normalisation are all essential requirements of a database; if they are missing, it is not a database.
if you have to "denormalise for performance", you have made serious Normalisation errors, your design in not normalised. Period. Do not "denormalise", on the contrary, learn Normalisation and Normalise.
There is no need to do extra work. If your requirement can be fulfilled with 5NF, do not implement more. If you need Optional Values or ability to add columns without DDL changes or the complete elimination of the Null Problem, implement 6NF, only in those tables that need them.
If you do that, due only to the fact that SQL does not provide proper support for 6NF, you will need to implement:
a simple and effective catalogue (column mix-ups and data integrity loss are simply not acceptable)
5NF access for the 6NF tables, via VIEWS, to isolate the users (and developers) from the encumbered (not "complex") SQL
write or buy utilities, so that you can generate the cumbersome SQL to construct the 5NF rows from the 6NF tables, and avoid writing same
measure, monitor, diagnose, and improve. If you have a performance problem, you have made either (a) a Normalisation error or (b) a coding error. Period. Back up a few steps and fix it.
If you decide to go with EAV, recognise it for what it is, 6NF, and implement it properly, as above. If you do, you will have a successful project, guaranteed. If you do not, you will have a dog's breakfast, guaranteed.
6.1 There Ain't No Such Thing As A Free Lunch
That adage has been referred to, but actually it has been misused. The way it actually, deeply applies is as above: if you want the benefits of 6NF/EAV, you had better be willing too do the work required to obtain it (catalogue, standards). Of course, the corollary is, if you don't do the work, you won't get the benefit. There is no "loss" of Datatypes; value Domains; Foreign keys; Checks; Rules. Regarding performance, there is no performance penalty for 6NF/EAV, but there is always a substantial performance penalty for sub-standard work.
7 Specific Question
Finally. With due consideration to the context above, and that it is a small project with a small team, there is no question:
Do not use EAV (or 6NF for that matter)
Do not use Nulls or Nullable columns (unless you wish to subvert performance)
Do use a single Payment table for the common payment columns
and a child table for each PaymentType, each with its specific columns
All fully typecast and constrained.
What's this "another row_id" business ? Why do some of you stick an ID on everything that moves, without checking if it is a deer or an eagle ? No. The child is a dependent child. The Relation is 1::1. The PK of the child is the PK of the parent, the common Payment table. This is an ordinary Supertype-Subtype cluster, the Differentiator is PaymentTypeCode. Subtypes and supertypes are an ordinary part of the Relational Model, and fully catered for in the database, as well as in any good modelling tool.
Sure, people who have no knowledge of Relational databases think they invented it 30 years later, and give it funny new names. Or worse, they knowingly re-label it and claim it as their own. Until some poor sod, with a bit of education and professional pride, exposes the ignorance or the fraud. I do not know which one it is, but it is one of them; I am just stating facts, which are easy to confirm.
A. Responses to Comments
A.1 Attribution
I do not have personal or private or special definitions. All statements regarding the definition (such as imperatives) of:
Normalisation,
Normal Forms, and
the Relational Model.
.
refer to the many original texts By EF Codd and CJ Date (not available free on the web)
.
The latest being Temporal Data and The Relational Model by CJ Date, Hugh Darwen, Nikos A Lorentzos
.
and nothing but those texts
.
"I stand on the shoulders of giants"
.
The essence, the body, all statements regarding the implementation (eg. subjective, and first person) of the above are based on experience; implementing the above principles and concepts, as a commercial organisation (salaried consultant or running a consultancy), in large financial institutions in America and Australia, over 32 years.
This includes scores of large assignments correcting or replacing sub-standard or non-relational implementations.
.
The Null Problem vis-a-vis Sixth Normal Form
A freely available White Paper relating to the title (it does not define The Null Problem alone) can be found at:
http://www.dcs.warwick.ac.uk/~hugh/TTM/Missing-info-without-nulls.pdf.
.
A 'nutshell' definition of 6NF (meaningful to those experienced with the other NFs), can be found on p6
A.2 Supporting Evidence
As stated at the outset, the purpose of this post is to counter the misinformation that is rife in this community, as a service to the community.
Evidence supporting statements made re the implementation of the above principles, can be provided, if and when specific statements are identified; and to the same degree that the incorrect statements posted by others, to which this post is a response, is likewise evidenced. If there is going to be a bun fight, let's make sure the playing field is level
Here are a few docs that I can lay my hands on immediately.
a. Large Bank
This is the best example, as it was undertaken for explicitly the reasons in this post, and goals were realised. They had a budget for Sybase IQ (DW product) but the reports were so fast when we finished the project, they did not need it. The trade analytical stats were my 5NF plus pivoting extensions which turned out to be 6NF, described above. I think all the questions asked in the comments have been answered in the doc, except:
- number of rows:
- old database is unknown, but it can be extrapolated from the other stats
- new database = 20 tables over 100M, 4 tables over 10B.
b. Small Financial Institute Part A
Part B - The meat
Part C - Referenced Diagrams
Part D - Appendix, Audit of Indices Before/After (1 line per Index)
Note four docs; the fourth only for those who wish to inspect detailed Index changes. They were running a 3rd party app that could not be changed because the local supplier was out of business, plus 120% extensions which they could, but did not want to, change. We were called in because they upgraded to a new version of Sybase, which was much faster, which shifted the various performance thresholds, which caused large no of deadlocks. Here we Normalised absolutely everything in the server except the db model, with the goal (guaranteed beforehand) of eliminating deadlocks (sorry, I am not going to explain that here: people who argue about the "denormalisation" issue, will be in a pink fit about this one). It included a reversal of "splitting tables into an archive db for performance", which is the subject of another post (yes, the new single table performed faster than the two spilt ones). This exercise applies to MS SQL Server [insert rewrite version] as well.
c. Yale New Haven Hospital
That's Yale School of Medicine, their teaching hospital. This is a third-party app on top of Sybase. The problem with stats is, 80% of the time they were collecting snapshots at nominated test times only, but no consistent history, so there is no "before image" to compare our new consistent stats with. I do not know of any other company who can get Unix and Sybase internal stats on the same graphs, in an automated manner. Now the network is the threshold (which is a Good Thing).
Perhaps you should look this question
The accepted answer from Bill Karwin goes into specific arguments against the key/value table usually know as Entity Attribute Value (EAV)
.. Although many people seem to favor
EAV, I don't. It seems like the most
flexible solution, and therefore the
best. However, keep in mind the adage
TANSTAAFL. Here are some of the
disadvantages of EAV:
No way to make a column mandatory (equivalent of NOT NULL).
No way to use SQL data types to validate entries.
No way to ensure that attribute names are spelled consistently.
No way to put a foreign key on the values of any given attribute, e.g.
for a lookup table.
Fetching results in a conventional tabular layout is complex and
expensive, because to get attributes
from multiple rows you need to do
JOIN for each attribute.
The degree of flexibility EAV gives
you requires sacrifices in other
areas, probably making your code as
complex (or worse) than it would have
been to solve the original problem in
a more conventional way.
And in most cases, it's an unnecessary
to have that degree of flexibility.
In the OP's question about product
types, it's much simpler to create a
table per product type for
product-specific attributes, so you
have some consistent structure
enforced at least for entries of the
same product type.
I'd use EAV only if every row must
be permitted to potentially have a
distinct set of attributes. When you
have a finite set of product types,
EAV is overkill. Class Table
Inheritance would be my first choice.
My #1 principle is not to redesign something for no reason. So I would go with option 1 because that's your current design and it has a proven track record of working.
Spend the redesign time on new features instead.
If I were designing from scratch I would go with number two. It gives you the flexibility you need. However with number 1 already in place and working and this being soemting rather central to your whole app, i would probably be wary of making a major design change without a good idea of exactly what queries, stored procs, views, UDFs, reports, imports etc you would have to change. If it was something I could do with a relatively low risk (and agood testing alrady in place.) I might go for the change to solution 2 otherwise you might beintroducing new worse bugs.
Under no circumstances would I use an EAV table for something like this. They are horrible for querying and performance and the flexibility is way overrated (ask users if they prefer to be able to add new types 3-4 times a year without a program change at the cost of everyday performance).
At first sight, I would go for option 2 (or 3): when possible, generalize.
Option 4 is not very Relational I think, and will make your queries complex.
When confronted to those question, I generally confront those options with "use cases":
-how is design 2/3 behaving when do this or this operation ?

Database Modification or start over?

So I'm currently working on rebuilding an existing website that is used internally at my company for project management, at heart it is a bug tracking utility that has some customer support and accounting operations linked into it.
Currently the database model is very repetitive, a good example of this is, currently a UserId is linked into a record (FK relationship into a user table that contains all the information about the user) and then all the information about the user also exists in the table.
I've been tasked with improving the website and the functionality of the model; however, I want to reduce the repetition of data in the website (is this normalization or is that the breaking apart of unlinked items into separate tables?). I'm not sure what the best method of doing this would be. I'm thinking of generating the creation scripts for the database and creating a new database project in VS to then modify the database, then generating some scripts to populate the new database model from the old database.
I plan on using the Entity Framework and ASP. NET MVC 2 to build the website as I think it provides the most flexible model moving forward for the modification and maintenance of the website.
The reason I ask all of this is because I'm very familiar with using databases and modifying existing ones to be used in applications and websites but I'm trying to discover the best way to build one.
I'm curious if there is any material on the best way to do this or if I should be using a different tool to do this with?
Edit: Providing more information on the model
There are 4 major areas that we have that are used:
Cases (Bugs, Features, Working Tasks, Etc)
2 .Tickets (Tech Support Events)
Errors (Errors Generated from our logging Library, Basically a stack trace with customer information)
License (Keeps track of each customers License allows modification to those licenses)
These are the Objects that are intermixed and used throughout the above 4 major areas.
Users (People who use the system)
Customers (People who use our software)
Stores (Places where our customers use our software)
Products (Our Software)
Relationships
Cases:
A Cases has to have a User, can have a Customer, Store, Error, Ticket and/or Product
Tickets
A Ticket has to have a User and a Customer, can have a Store, Error and/or Product
Errors:
A Error has to have a Product, Can Have a Case, Ticket, Store, and/or Product
Licenses:
A Licenses has to have a Product and Customer, can have a Store
Like I said very basic website, with a not super complex database, if done correctly.
Currently the database has no FK constraints, replication of lots of information across each table and lots of extra tables that are duplicates with different names.
E.g.
Each Case type has a separate table so there is a FeatureRequest, Bug, Tasks, Completed, etc table that all contain the same information.
Normalization is about storing data without redundancy or anomalies.
One example of an anomaly could be when attributes about a user in your main table are not in sync with the users table. Someone changes information about that user in one table without reflecting the changes in the redundant copy. The problem is that it's hard to know which change is the correct one.
Some people think that normalization is just about breaking apart tables into littler tables, because that's what they see as the most common type of change. But that's not the goal of normalization. It's just by coincidence that most mistakes of non-normalization involve stuffing too much data into one table where multiple tables would be correct.
It's hard to answer your question about whether to modify your database in-place or whether to create a whole new database and migrate to it.
What I would do in your case is to design a properly normalized database, and then examine the differences between that and your existing database. Imagine what you would have to do for each difference, to change your old database to the new one, versus a data migration. It could be that only a few changes are needed, only dropping the redundant columns. Or it could be that some major rework is needed. It's impossible to tell until you do the work of creating a normalized data model so you can compare.
The bigger task might be to adapt your application code that uses the database. One way to ease this transition is to create database views on top of the normalized database, which mimic your old non-normalized database. That way hopefully you don't have to rewrite every bit of code in your app all at once, you can keep some of it the same at least until you can refactor the code.
Also having a good set of regression tests in place is ideal, so you can be sure your app still does all the tasks it is supposed to do, as you refactor the database and the code that uses the database.
Re your comment: You mention that you're adding new functionality to the user model at the same time. I would find it too confusing to try to do this simultaneously with refactoring. Refactoring typically does not change functionality, it only changes implementation. But refactoring adds value because it makes the code easier to maintain or debug, improves efficiency, or prepares you to make future functionality changes more easily.
I would recommend that you bit the bullet and add your new user model features to the old non-normalized database. It's good to get the benefit of new features in the short term, and also you need to develop those features first to understand them well enough to account for them in your big refactoring project.
Here are some suggestions for resources to help you truly understand what normalization means:
SQL and Relational Theory by C. J. Date
A Simple Guide to Five Normal Forms in Relational Database Theory by William Kent
Database Normalization at Wikipedia and its sub-pages for each respective normal form
SQL Antipatterns Volume 1: Avoiding the Pitfalls of Database Programming by me, Bill Karwin. I wrote a chapter about database normalization that I hope explains it in plain English and with good examples.
Here are a couple of resources for managing changes to a database:
Refactoring Databases by Scott W. Ambler and Pramodkumar J. Sadalage
Agile Database Techniques: Effective Strategies for the Agile Software Developer by Scott W. Ambler
How long do you have, and how big is the database?
It's very difficult to answer this question black and white without being immersed in your environment and business case. It really doesn't seem like your limitation is technology wise, just to choose between solutions.
Re-creating is what programmers instinctively go for. However, in the "real world", sometimes we spend a lot of effort into something that isn't that used or wont last that long.
So food for thought. How long will it take you to re-do the database, how much will it cost. Will working with what's existent be sufficient for the functionality asked?

T-SQL database design and tables

I'd like to hear some opinions or discussion on a matter of database design. Me and my colleagues are developing a complex application in finance industry that is being installed in several countries.
Our contractors wanted us to keep a single application for all the countries so we naturally face the difficulties with different workflows in every one of them and try to make the application adjustable to satisfy various needs.
The issue I've encountered today was a request from the head of the IT department from the contractors side that we keep the database model in terms of tables and columns they consist of.
For examlpe, we got a table with different risks and we needed to add a flag column IsSomething (BIT NOT NULL ...). It fully qualifies to exists within the risk table according to the third normal form, no transitive dependency to the key, a non key value ...
BUT, the guy said that he wants to keep the tables as they are so we had to make a new table "riskinfo" and link the data 1:1 to the new column.
What is your opinion ?
We add columns to our tables that are referenced by a variety of apps all the time.
So long as the applications specifically reference the columns they want to use and you make sure the new fields are either nullable or have a sensible default defined so it doesn't interfere with inserts I don't see any real problem.
That said, if an app does a select * then proceeds to reference the columns by index rather than name you could produce issues in existing code. Personally I have confidence that nothing referencing our database does this because of our coding conventions (That and I suspect the code review process would lynch someone who tried it :P), but if you're not certain then there is at least some small risk to such a change.
In your actual scenario I'd go back to the contractor and give your reasons you don't think the change will cause any problems and ask the rationale behind their choice. Maybe they have some application-specific wisdom behind their suggestion, maybe just paranoia from dealing with other companies that change the database structure in ways that aren't backwards-compatible, or maybe it's just a policy at their company that got rubber-stamped long ago and nobody's challenged. Till you ask you never know.
This question is indeed subjective like what Binary Worrier commented. I do not have an answer nor any suggestion. Just sharing my 2 cents.
Do you know the rationale for those decisions? Sometimes good designs are compromised for the sake of not breaking currently working applications or simply for the fact that too much has been done based on the previous one. It could also be many other non-technical reasons.
Very often, the programming community is unreasonably concerned about the ripple effect that results from redefining tables. Usually, this is a result of failure to understand data independence, and failure to guard the data independence of their operations on the data. Occasionally, the original database designer is at fault.
Most object oriented programmers understand encapsulation better than I do. But these same experts typically don't understand squat about data independence. And anyone who has learned how to operate on an SQL database, but never learned the concept of data independence is dangerously ignorant. The superficial aspects of data independence can be learned in about five minutes. But to really learn it takes time and effort.
Other responders have mentioned queries that use "select *". A select with a wildcard is more data dependent than the same select that lists the names of all the columns in the table. This is just one example among dozens.
The thing is, both data independence and encapsulation pursue the same goal: containing the unintended consequences of a change in the model.
Here's how to keep your IT chief happy. Define a new table with a new name that contains all the columns from the old table, and also all the additional columns that are now necessary. Create a view, with the same name as the old table, that contains precisely the same columns, and in the same order, that the old table had. Typically, this view will show all the rows in the old table, and the old PK will still guarantee uniqueness.
Once in a while, this will fail to meet all of the IT chief's needs. And if the IT chief is really saying "I don't understand databases; so don't change anything" then you are up the creek until the IT chief changes or gets changed.

How to document a database [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
This post was edited and submitted for review 12 months ago and failed to reopen the post:
Original close reason(s) were not resolved
Improve this question
(Note: I realize this is close to How do you document your database structure? , but I don't think it's identical.)
I've started work at a place with a database with literally hundreds of tables and views, all with cryptic names with very few vowels, and no documentation. They also don't allow gratuitous changes to the database schema, nor can I touch any database except the test one on my own machine (which gets blown away and recreated regularly), so I can't add comments that would help anybody.
I tried using "Toad" to create an ER diagram, but after leaving it running for 48 hours straight it still hadn't produced anything visible and I needed my computer back. I was talking to some other recent hires and we all suggested that whenever we've puzzled out what a particular table or what some of its columns means, we should update it in the developers wiki.
So what's a good way to do this? Just list tables/views and their columns and fill them in as we go? The basic tools I've got to hand are Toad, Oracle's "SQL Developer", MS Office, and Visio.
In my experience, ER (or UML) diagrams aren't the most useful artifact - with a large number of tables, diagrams (especially reverse engineered ones) are often a big convoluted mess that nobody learns anything from.
For my money, some good human-readable documentation (perhaps supplemented with diagrams of smaller portions of the system) will give you the most mileage. This will include, for each table:
Descriptions of what the table means and how it's functionally used (in the UI, etc.)
Descriptions of what each attribute means, if it isn't obvious
Explanations of the relationships (foreign keys) from this table to others, and vice-versa
Explanations of additional constraints and / or triggers
Additional explanation of major views & procs that touch the table, if they're not well documented already
With all of the above, don't document for the sake of documenting - documentation that restates the obvious just gets in people's way. Instead, focus on the stuff that confused you at first, and spend a few minutes writing really clear, concise explanations. That'll help you think it through, and it'll massively help other developers who run into these tables for the first time.
As others have mentioned, there are a wide variety of tools to help you manage this, like Enterprise Architect, Red Gate SQL Doc, and the built-in tools from various vendors. But while tool support is helpful (and even critical, in bigger databases), doing the hard work of understanding and explaining the conceptual model of the database is the real win. From that perspective, you can even do it in a text file (though doing it in Wiki form would allow several people to collaborate on adding to that documentation incrementally - so, every time someone figures out something, they can add it to the growing body of documentation instantly).
One thing to consider is the COMMENT facility built into the DBMS. If you put comments on all of the tables and all of the columns in the DBMS itself, then your documentation will be inside the database system.
Using the COMMENT facility does not make any changes to the schema itself, it only adds data to the USER_TAB_COMMENTS catalog table.
In our team we came to useful approach to documenting legacy large Oracle and SQL Server databases. We use Dataedo for documenting database schema elements (data dictionary) and creating ERD diagrams. Dataedo comes with documentation repository so all your team can work on documenting and reading recent documentation online. And you don’t need to interfere with database (Oracle comments or SQL Server MS_Description).
First you import schema (all tables, views, stored procedures and functions – with triggers, foreign keys etc.). Then you define logical domains/modules and group all objects (drag & drop) into them to be able to analyze and work on smaller chunks of database. For each module you create an ERD diagram and write top level description. Then, as you discover meaning of tables and views write a short description for each. Do the same for each column. Dataedo enables you to add meaningful title for each object and column – it’s useful if object names are vague or invalid. Pro version enables you to describe foreign keys, unique keys/constraints and triggers – which is useful but not essential to understand a database.
You can access documentation through UI or you can export it to PDF or interactive HTML (the latter is available only in Pro version).
Described here is a continuous process rather than one time job. If your database changes (eg. new columns, views) you should sync your documentation on regular basis (couple clicks with Dataedo).
See sample documentation:
http://dataedo.com/download/Dataedo%20repository.pdf
Some guidelines on documentation process:
Diagrams:
Keep your diagrams small and readable – just include important tables, relations and columns – only the one that have any meaning to understand big picture – primary/business keys, important attributes and relations,
Use different color for key tables in a diagram,
You can have more than one diagram per module,
You can add diagram to description of most important tables/with most relations.
Descriptions:
Don’t document the obvious – don’t write description “Document date” for document.date column. If there’s nothing meaningful to add just leave it blank,
If objects stored in tables have types or statuses it’s good to list them in general description of a table,
Define format that is expected, eg. “mm/dd/yy” for a date that is stored in text field,
List all known/important values an it’s meaning, e.g. for status column could be something like this: “Document status: A – Active, C – Cancelled, D – Deleted”,
If there’s any API to a table – a view that should be used to read data and function/procedures to insert/update data – list it in the description of table,
Describe where does rows/columns’ values come from (procedure, form, interface etc.) ,
Use “[deprecated]” mark (or similar) for columns that should not be used (title column is useful for this, explain which field should be used instead in description field).
We use Enterprise Architect for our DB definitions. We include stored procedures, triggers, and all table definitions defined in UML. The three brilliant features of the program are:
Import UML Diagrams from an ODBC Connection.
Generate SQL Scripts (DDL) for the entire DB at once
Generate Custom Templated Documentation of your DB.
You can edit your class / table definitions within the UML tool, and generate a fully descriptive with pictures included document. The autogenerated document can be in multiple formats including MSWord. We have just less than 100 tables in our schema, and it's quite managable.
I've never been more impressed with any other tool in my 10+ years as a developer. EA supports Oracle, MySQL, SQL Server (multiple versions), PostGreSQL, Interbase, DB2, and Access in one fell swoop. Any time I've had problems, their forums have answered my problems promptly. Highly recommended!!
When DB changes come in, we make then in EA, generate the SQL, and check it into our version control (svn). We use Hudson for building, and it auto-builds the database from scripts when it sees you've modified the checked-in sql.
(Mostly stolen from another answer of mine)
This answer extends Kieveli's above, which I upvoted. If your version of EA supports Object Role Modeling (conceptual design, vs. logical design = ERD), reverse engineer to that and then fill out the model with the expressive richness it gives you.
The cheap and lighter-weight option is to download Visiomodeler for free from MS, and do the same with that.
The ORM (call it ORMDB) is the only tool I've ever found that supports and encourages database design conversations with non-IS stakeholders about BL objects and relationships.
Reality check - on the way to generating your DDL, it passes through a full-stop ERD phase where you can satisfy your questions about whether it does anything screwy. It doesn't. It will probably show you weaknesses in the ERD you designed yourself.
ORMDB is a classic case of the principle that the more conceptual the tool, the smaller the market. Girls just want to have fun, and programmers just want to code.
A wiki solution supports hyperlinks and collaborative editing, but a wiki is only as good as the people who keep it organized and up to date. You need someone to take ownership of the document project, regardless of what tool you use. That person may involve other knowledgeable people to fill in the details, but one person should be responsible for organizing the information.
If you can't use a tool to generate an ERD by reverse engineering, you'll have to design one by hand using TOAD or VISIO.
Any ERD with hundreds of objects is probably useless as a guide for developers, because it'll be unreadable with so many boxes and lines. In a database with so many objects, it's likely that there are "sub-systems" of a few dozen tables and views each. So you should make custom diagrams of these sub-systems, instead of expecting a tool to do it for you.
You can also design a pseudo-ERD, where groups of tables are represented by a single object in one diagram, and that group is expanded in another diagram.
A single ERD or set of ERD's are not sufficient to document a system of this complexity, any more than a class diagram would be adequate to document an OO system. You'll have to write a document, using the ERD's as illustrations. You need text descriptions of the meaning and use of each table, each column, and the relationships between tables (especially where such relationships are implicit instead of represented by referential integrity constraints).
All of this is a lot of work, but it will be worth it. If there's a clear and up-to-date place where the schema is documented, the whole team will benefit from it.
Since you have the luxury of working with fellow developers that are in the same boat, I would suggest asking them what they feel would convey the needed information, most easily. My company has over 100 tables, and my boss gave me an ERD for a specific set tables that all connect. So also, you might want to try breaking 1 massive ERD into a bunch of smaller, manageable, ERDs.
Well, a picture tells a thousand words so I would recommend creating ER diagrams where you can view the relationship between tables at a glance, something that is hard to do with a text-only description.
You don't have to do the whole database in one diagram, break it up into sections. We use Visual Paradigm at work but EA is a good alternative as is ERWIN, and no doubt there are lots of others that are just as good.
If you have the patience, then using html to document the tables and columns makes your documentation easier to access.
If describing your databases to your end users is your primary goal Ooluk Data Dictionary Manager can prove useful. It is a web-based multi-user software that allows you to attach descriptions to tables and columns and allows full text searches on those descriptions. It also allows you to logically group tables using labels and browse tables using those labels. Tables as well as columns can be tagged to find similar data items across your database/databases.
The software allows you to import metadata information such as table name, column name, column data type, foreign keys into its internal repository using an API. Support for JDBC data sources comes built-in and can be extended further as the API source is distributed under ASL 2.0. It is coded to read the COMMENTS/REMARKS from many RDBMSs.You can always manually override the imported information. The information you can store about tables and columns can be extended using custom fields.
The Data Dictionary Manager uses the "data object" and "attribute" terminology instead of table and column because it isn't designed specifically for relational databases.
Notes
If describing technical aspects of your database such as triggers,
indexes, statistics is important this software isn't the best option.
It is however possible to combine a technical solution with this
software using hyperlink custom fields.
The software doesn't produce an ERD
Disclosure: I work at the company that develops this product.