What are the [dis]advantages of using a key/value table over nullable columns or separate tables? [duplicate] - sql

This question already has answers here:
How to design a product table for many kinds of product where each product has many parameters
(4 answers)
Closed 1 year ago.
I'm upgrading a payment management system I created a while ago. It currently has one table for each payment type it can accept. It is limited to only being able to pay for one thing, which this upgrade is to alleviate. I've been asking for suggestions as to how I should design it, and I have these basic ideas to work from:
Have one table for each payment type, with a few common columns on each. (current design)
Coordinate all payments with a central table that takes on the common columns (unifying payment IDs regardless of type), and identifies another table and row ID that has columns specialized to that payment type.
Have one table for all payment types, and null the columns which are not used for any given type.
Use the central table idea, but store specialized columns in a key/value table.
My goals for this are: not ridiculously slow, self-documenting as much as possible, and maximizing flexibility while maintaining the other goals.
I don't like 1 very much because of the duplicate columns in each table. It reflects the payment type classes inheriting a base class that provides functionality for all payment types... ORM in reverse?
I'm leaning toward 2 the most, because it's just as "type safe" and self-documenting as the current design. But, as with 1, to add a new payment type, I need to add a new table.
I don't like 3 because of its "wasted space", and it's not immediately clear which columns are used for which payment types. Documentation can alleviate the pain of this somewhat, but my company's internal tools do not have an effective method for storing/finding technical documentation.
The argument I was given for 4 was that it would alleviate needing to change the database when adding a new payment method, but it suffers even worse than 3 does from the lack of explicitness. Currently, changing the database isn't a problem, but it could become a logistical nightmare if we decide to start letting customers keep their own database down the road.
So, of course I have my biases. Does anyone have any better ideas? Which design do you think fits best? What criteria should I base my decision on?

Note
This subject is being discussed, and this thread is being referenced in other threads, therefore I have given it a reasonable treatment, please bear with me. My intention is to provide understanding, so that you can make informed decisions, rather than simplistic ones based merely on labels. If you find it intense, read it in chunks, at your leisure; come back when you are hungry, and not before.
What, exactly, about EAV, is "Bad" ?
1 Introduction
There is a difference between EAV (Entity-Attribute-Value Model) done properly, and done badly, just as there is a difference between 3NF done properly and done badly. In our technical work, we need to be precise about exactly what works, and what does not; about what performs well, and what doesn't. Blanket statements are dangerous, misinform people, and thus hinder progress and universal understanding of the issues concerned.
I am not for or against anything, except poor implementations by unskilled workers, and misrepresenting the level of compliance to standards. And where I see misunderstanding, as here, I will attempt to address it.
Normalisation is also often misunderstood, so a word on that. Wikipedia and other free sources actually post completely nonsensical "definitions", that have no academic basis, that have vendor biases so as to validate their non-standard-compliant products. There is a Codd published his Twelve Rules. I implement a minimum of 5NF, which is more than enough for most requirements, so I will use that as a baseline. Simply put, assuming Third Normal Form is understood by the reader (at least that definition is not confused) ...
2 Fifth Normal Form
2.1 Definition
Fifth Normal Form is defined as:
every column has a 1::1 relation with the Primary Key, only
and to no other column, in the table, or in any other table
the result is no duplicated columns, anywhere; No Update Anomalies (no need for triggers or complex code to ensure that, when a column is updated, its duplicates are updated correctly).
it improves performance because (a) it affects less rows and (b) improves concurrency due to reduced locking
I make the distinction that, it is not that a database is Normalised to a particular NF or not; the database is simply Normalised. It is that each table is Normalised to a particular NF: some tables may only require 1NF, others 3NF, and yet others require 5NF.
2.2 Performance
There was a time when people thought that Normalisation did not provide performance, and they had to "denormalise for performance". Thank God that myth has been debunked, and most IT professionals today realise that Normalised databases perform better. The database vendors optimise for Normalised databases, not for denormalised file systems. The truth "denormalised" is, the database was NOT normalised in the first place (and it performed badly), it was unnormalised, and they did some further scrambling to improve performance. In order to be Denormalised, it has to be faithfully Normalised first, and that never took place. I have rewritten scores of such "denormalised for performance" databases, providing faithful Normalisation and nothing else, and they ran at least ten, and as much as a hundred times faster. In addition, they required only a fraction of the disk space. It is so pedestrian that I guarantee the exercise, in writing.
2.3 Limitation
The limitations, or rather the full extent of 5NF is:
it does not handle optional values, and Nulls have to be used (many designers disallow Nulls and use substitutes, but this has limitations if it not implemented properly and consistently)
you still need to change DDL in order to add or change columns (and there are more and more requirements to add columns that were not initially identified, after implementation; change control is onerous)
although providing the highest level of performance due to Normalisation (read: elimination of duplicates and confused relations), complex queries such as pivoting (producing a report of rows, or summaries of rows, expressed as columns) and "columnar access" as required for data warehouse operations, are difficult, and those operations only, do not perform well. Not that this is due only to the SQL skill level available, and not to the engine.
3 Sixth Normal Form
3.1 Definition
Sixth Normal Form is defined as:
the Relation (row) is the Primary Key plus at most one attribute (column)
It is known as the Irreducible Normal Form, the ultimate NF, because there is no further Normalisation that can be performed. Although it was discussed in academic circles in the mid nineties, it was formally declared only in 2003. For those who like denigrating the formality of the Relational Model, by confusing relations, relvars, "relationships", and the like: all that nonsense can be put to bed because formally, the above definition identifies the Irreducible Relation, sometimes called the Atomic Relation.
3.2 Progression
The increment that 6NF provides (that 5NF does not) is:
formal support for optional values, and thus, elimination of The Null Problem
a side effect is, columns can be added without DDL changes (more later)
effortless pivoting
simple and direct columnar access
it allows for (not in its vanilla form) an even greater level of performance in this department
Let me say that I (and others) were supplying enhanced 5NF tables 20 years ago, explicitly for pivoting, with no problem at all, and thus allowing (a) simple SQL to be used and (b) providing very high performance; it was nice to know that the academic giants of the industry had formally defined what we were doing. Overnight, my 5NF tables were renamed 6NF, without me lifting a finger. Second, we only did this where we needed it; again, it was the table, not the database, that was Normalised to 6NF.
3.3 SQL Limitation
It is a cumbersome language, particularly re joins, and doing anything moderately complex makes it very cumbersome. (It is a separate issue that most coders do not understand or use subqueries.) It supports the structures required for 5NF, but only just. For robust and stable implementations, one must implement additional standards, which may consist in part, of additional catalogue tables. The "use by" date for SQL had well and truly elapsed by the early nineties; it is totally devoid of any support for 6NF tables, and desperately in need of replacement. But that is all we have, so we need to just Deal With It.
For those of us who had been implementing standards and additional catalogue tables, it was not a serious effort to extend our catalogues to provide the capability required to support 6NF structures to standard: which columns belong to which tables, and in what order; mandatory/optional; display format; etc. Essentially a full MetaData catalogue, married to the SQL catalogue.
Note that each NF contains each previous NF within it, so 6NF contains 5NF. We did not break 5NF in order provide 6NF, we provided a progression from 5NF; and where SQL fell short we provided the catalogue. What this means is, basic constraints such as for Foreign Keys; and Value Domains which were provided via SQL Declarative Referential integrity; Datatypes; CHECKS; and RULES, at the 5NF level, remained intact, and these constraints were not subverted. The high quality and high performance of standard-compliant 5NF databases was not reduced in anyway by introducing 6NF.
3.4 Catalogue
It is important to shield the users (any report tool) and the developers, from having to deal with the jump from 5NF to 6NF (it is their job to be app coding geeks, it is my job to be the database geek). Even at 5NF, that was always a design goal for me: a properly Normalised database, with a minimal Data Directory, is in fact quite easy to use, and there was no way I was going to give that up. Keep in mind that due to normal maintenance and expansion, the 6NF structures change over time, new versions of the database are published at regular intervals. Without doubt, the SQL (already cumbersome at 5NF) required to construct a 5NF row from the 6NF tables, is even more cumbersome. Gratefully, that is completely unnecessary.
Since we already had our catalogue, which identified the full 6NF-DDL-that-SQL-does-not-provide, if you will, I wrote a small utility to read the catalogue and:
generate the 6NF table DDL.
generate 5NF VIEWS of the 6NF tables. This allowed the users to remain blissfully unaware of them, and gave them the same capability and performance as they had at 5NF
generate the full SQL (not a template, we have those separately) required to operate against the 6NF structures, which coders then use. They are released from the tedium and repetition which is otherwise demanded, and free to concentrate on the app logic.
I did not write an utility for Pivoting because the complexity present at 5NF is eliminated, and they are now dead simple to write, as with the 5NF-enhanced-for-pivoting. Besides, most report tools provide pivoting, so I only need to provide functions which comprise heavy churning of stats, which needs to be performed on the server before shipment to the client.
3.5 Performance
Everyone has their cross to bear; I happen to be obsessed with Performance. My 5NF databases performed well, so let me assure you that I ran far more benchmarks than were necessary, before placing anything in production. The 6NF database performed exactly the same as the 5NF database, no better, no worse. This is no surprise, because the only thing the 'complex" 6NF SQL does, that the 5NF SQL doesn't, is perform much more joins and subqueries.
You have to examine the myths.
Anyone who has benchmarked the issue (i.e examined the execution plans of queries) will know that Joins Cost Nothing, it is a compile-time resolution, they have no effect at execution time.
Yes, of course, the number of tables joined; the size of the tables being joined; whether indices can be used; the distribution of the keys being joined; etc, all cost something.
But the join itself costs nothing.
A query on five (larger) tables in a Unnormalised database is much slower than the equivalent query on ten (smaller) tables in the same database if it were Normalised. the point is, neither the four nor the nine Joins cost anything; they do not figure in the performance problem; the selected set on each Join does figure in it.
3.6 Benefit
Unrestricted columnar access. This is where 6NF really stands out. The straight columnar access was so fast that there was no need to export the data to a data warehouse in order to obtain speed from specialised DW structures.
My research into a few DWs, by no means complete, shows that they consistently store data by columns, as opposed to rows, which is exactly what 6NF does. I am conservative, so I am not about to make any declarations that 6NF will displace DWs, but in my case it eliminated the need for one.
It would not be fair to compare functions available in 6NF that were unavailable in 5NF (eg. Pivoting), which obviously ran much faster.
That was our first true 6NF database (with a full catalogue, etc; as opposed to the always 5NF with enhancements only as necessary; which later turned out to be 6NF), and the customer is very happy. Of course I was monitoring performance for some time after delivery, and I identified an even faster columnar access method for my next 6NF project. That, when I do it, might present a bit of competition for the DW market. The customer is not ready, and we do not fix that which is not broken.
3.7 What, Exactly, about 6NF, is "Bad" ?
Note that not everyone would approach the job with as much formality, structure, and adherence to standards. So it would be silly to conclude from our project, that all 6NF databases perform well, and are easy to maintain. It would be just as silly to conclude (from looking at the implementations of others) that all 6NF databases perform badly, are hard to maintain; disasters. As always, with any technical endeavour, the resulting performance and ease of maintenance are strictly dependent on formality, structure, and adherence to standards, in addition to the relevant skill set.
4 Entity Attribute Value
Disclosure: Experience. I have inspected a few of these, mostly hospital and medical systems. I have performed corrective assignments on two of them. The initial delivery by the overseas provider was quite adequate, although not great, but the extensions implemented by the local provider were a mess. But not nearly the disaster that people have posted about re EAV on this site. A few months intense work fixed them up nicely.
4.1 What It Is
It was obvious to me that the EAV implementations I have worked on are merely subsets of Sixth Normal Form. Those who implement EAV do so because they want some of the features of 6NF (eg. ability to add columns without DDL changes), but they do not have the academic knowledge to implement true 6NF, or the standards and structures to implement and administer it securely. Even the original provider did not know about 6NF, or that EAV was a subset of 6NF, but they readily agreed when I pointed it out to them. Because the structures required to provide EAV, and indeed 6NF, efficiently and effectively (catalogue; Views; automated code generation) are not formally identified in the EAV community, and are missing from most implementations, I classify EAV as the bastard son Sixth Normal Form.
4.2 What, Exactly, about EAV, is "Bad" ?
Going by the comments in this and other threads, yes, EAV done badly is a disaster. More important (a) they are so bad that the performance provided at 5NF (forget 6NF) is lost and (b) the ordinary isolation from the complexity has not been implemented (coders and users are "forced" to use cumbersome navigation). And if they did not implement a catalogue, all sorts of preventable errors will not have been prevented.
All that may well be true for bad (EAV or other) implementations, but it has nothing to do with 6NF or EAV. The two projects I worked had quite adequate performance (sure, it could be improved; but there was no bad performance due to EAV), and good isolation of complexity. Of course, they were nowhere near the quality or performance of my 5NF databases or my true 6NF database, but they were fair enough, given the level of understanding of the posted issues within the EAV community. They were not the disasters and sub-standard nonsense alleged to be EAV in these pages.
5 Nulls
There is a well-known and documented issue called The Null Problem. It is worthy of an essay by itself. For this post, suffice to say:
the problem is really the optional or missing value; here the consideration is table design such that there are no Nulls vs Nullable columns
actually it does not matter because, regardless of whether you use Nulls/No Nulls/6NF to exclude missing values, you will have to code for that, the problem precisely then, is handling missing values, which cannot be circumvented
except of course for pure 6NF, which eliminates the Null Problem
the coding to handle missing values remains
except, with automated generation of SQL code, heh heh
Nulls are bad news for performance, and many of us have decided decades ago not to allow Nulls in the database (Nulls in passed parameters and result sets, to indicate missing values, is fine)
which means a set of Null Substitutes and boolean columns to indicate missing values
Nulls cause otherwise fixed len columns to be variable len; variable len columns should never be used in indices, because a little 'unpacking' has to be performed on every access of every index entry, during traversal or dive.
6 Position
I am not a proponent of EAV or 6NF, I am a proponent of quality and standards. My position is:
Always, in all ways, do whatever you are doing to the highest standard that you are aware of.
Normalising to Third Normal Form is minimal for a Relational Database (5NF for me). DataTypes, Declarative referential Integrity, Transactions, Normalisation are all essential requirements of a database; if they are missing, it is not a database.
if you have to "denormalise for performance", you have made serious Normalisation errors, your design in not normalised. Period. Do not "denormalise", on the contrary, learn Normalisation and Normalise.
There is no need to do extra work. If your requirement can be fulfilled with 5NF, do not implement more. If you need Optional Values or ability to add columns without DDL changes or the complete elimination of the Null Problem, implement 6NF, only in those tables that need them.
If you do that, due only to the fact that SQL does not provide proper support for 6NF, you will need to implement:
a simple and effective catalogue (column mix-ups and data integrity loss are simply not acceptable)
5NF access for the 6NF tables, via VIEWS, to isolate the users (and developers) from the encumbered (not "complex") SQL
write or buy utilities, so that you can generate the cumbersome SQL to construct the 5NF rows from the 6NF tables, and avoid writing same
measure, monitor, diagnose, and improve. If you have a performance problem, you have made either (a) a Normalisation error or (b) a coding error. Period. Back up a few steps and fix it.
If you decide to go with EAV, recognise it for what it is, 6NF, and implement it properly, as above. If you do, you will have a successful project, guaranteed. If you do not, you will have a dog's breakfast, guaranteed.
6.1 There Ain't No Such Thing As A Free Lunch
That adage has been referred to, but actually it has been misused. The way it actually, deeply applies is as above: if you want the benefits of 6NF/EAV, you had better be willing too do the work required to obtain it (catalogue, standards). Of course, the corollary is, if you don't do the work, you won't get the benefit. There is no "loss" of Datatypes; value Domains; Foreign keys; Checks; Rules. Regarding performance, there is no performance penalty for 6NF/EAV, but there is always a substantial performance penalty for sub-standard work.
7 Specific Question
Finally. With due consideration to the context above, and that it is a small project with a small team, there is no question:
Do not use EAV (or 6NF for that matter)
Do not use Nulls or Nullable columns (unless you wish to subvert performance)
Do use a single Payment table for the common payment columns
and a child table for each PaymentType, each with its specific columns
All fully typecast and constrained.
What's this "another row_id" business ? Why do some of you stick an ID on everything that moves, without checking if it is a deer or an eagle ? No. The child is a dependent child. The Relation is 1::1. The PK of the child is the PK of the parent, the common Payment table. This is an ordinary Supertype-Subtype cluster, the Differentiator is PaymentTypeCode. Subtypes and supertypes are an ordinary part of the Relational Model, and fully catered for in the database, as well as in any good modelling tool.
Sure, people who have no knowledge of Relational databases think they invented it 30 years later, and give it funny new names. Or worse, they knowingly re-label it and claim it as their own. Until some poor sod, with a bit of education and professional pride, exposes the ignorance or the fraud. I do not know which one it is, but it is one of them; I am just stating facts, which are easy to confirm.
A. Responses to Comments
A.1 Attribution
I do not have personal or private or special definitions. All statements regarding the definition (such as imperatives) of:
Normalisation,
Normal Forms, and
the Relational Model.
.
refer to the many original texts By EF Codd and CJ Date (not available free on the web)
.
The latest being Temporal Data and The Relational Model by CJ Date, Hugh Darwen, Nikos A Lorentzos
.
and nothing but those texts
.
"I stand on the shoulders of giants"
.
The essence, the body, all statements regarding the implementation (eg. subjective, and first person) of the above are based on experience; implementing the above principles and concepts, as a commercial organisation (salaried consultant or running a consultancy), in large financial institutions in America and Australia, over 32 years.
This includes scores of large assignments correcting or replacing sub-standard or non-relational implementations.
.
The Null Problem vis-a-vis Sixth Normal Form
A freely available White Paper relating to the title (it does not define The Null Problem alone) can be found at:
http://www.dcs.warwick.ac.uk/~hugh/TTM/Missing-info-without-nulls.pdf.
.
A 'nutshell' definition of 6NF (meaningful to those experienced with the other NFs), can be found on p6
A.2 Supporting Evidence
As stated at the outset, the purpose of this post is to counter the misinformation that is rife in this community, as a service to the community.
Evidence supporting statements made re the implementation of the above principles, can be provided, if and when specific statements are identified; and to the same degree that the incorrect statements posted by others, to which this post is a response, is likewise evidenced. If there is going to be a bun fight, let's make sure the playing field is level
Here are a few docs that I can lay my hands on immediately.
a. Large Bank
This is the best example, as it was undertaken for explicitly the reasons in this post, and goals were realised. They had a budget for Sybase IQ (DW product) but the reports were so fast when we finished the project, they did not need it. The trade analytical stats were my 5NF plus pivoting extensions which turned out to be 6NF, described above. I think all the questions asked in the comments have been answered in the doc, except:
- number of rows:
- old database is unknown, but it can be extrapolated from the other stats
- new database = 20 tables over 100M, 4 tables over 10B.
b. Small Financial Institute Part A
Part B - The meat
Part C - Referenced Diagrams
Part D - Appendix, Audit of Indices Before/After (1 line per Index)
Note four docs; the fourth only for those who wish to inspect detailed Index changes. They were running a 3rd party app that could not be changed because the local supplier was out of business, plus 120% extensions which they could, but did not want to, change. We were called in because they upgraded to a new version of Sybase, which was much faster, which shifted the various performance thresholds, which caused large no of deadlocks. Here we Normalised absolutely everything in the server except the db model, with the goal (guaranteed beforehand) of eliminating deadlocks (sorry, I am not going to explain that here: people who argue about the "denormalisation" issue, will be in a pink fit about this one). It included a reversal of "splitting tables into an archive db for performance", which is the subject of another post (yes, the new single table performed faster than the two spilt ones). This exercise applies to MS SQL Server [insert rewrite version] as well.
c. Yale New Haven Hospital
That's Yale School of Medicine, their teaching hospital. This is a third-party app on top of Sybase. The problem with stats is, 80% of the time they were collecting snapshots at nominated test times only, but no consistent history, so there is no "before image" to compare our new consistent stats with. I do not know of any other company who can get Unix and Sybase internal stats on the same graphs, in an automated manner. Now the network is the threshold (which is a Good Thing).

Perhaps you should look this question
The accepted answer from Bill Karwin goes into specific arguments against the key/value table usually know as Entity Attribute Value (EAV)
.. Although many people seem to favor
EAV, I don't. It seems like the most
flexible solution, and therefore the
best. However, keep in mind the adage
TANSTAAFL. Here are some of the
disadvantages of EAV:
No way to make a column mandatory (equivalent of NOT NULL).
No way to use SQL data types to validate entries.
No way to ensure that attribute names are spelled consistently.
No way to put a foreign key on the values of any given attribute, e.g.
for a lookup table.
Fetching results in a conventional tabular layout is complex and
expensive, because to get attributes
from multiple rows you need to do
JOIN for each attribute.
The degree of flexibility EAV gives
you requires sacrifices in other
areas, probably making your code as
complex (or worse) than it would have
been to solve the original problem in
a more conventional way.
And in most cases, it's an unnecessary
to have that degree of flexibility.
In the OP's question about product
types, it's much simpler to create a
table per product type for
product-specific attributes, so you
have some consistent structure
enforced at least for entries of the
same product type.
I'd use EAV only if every row must
be permitted to potentially have a
distinct set of attributes. When you
have a finite set of product types,
EAV is overkill. Class Table
Inheritance would be my first choice.

My #1 principle is not to redesign something for no reason. So I would go with option 1 because that's your current design and it has a proven track record of working.
Spend the redesign time on new features instead.

If I were designing from scratch I would go with number two. It gives you the flexibility you need. However with number 1 already in place and working and this being soemting rather central to your whole app, i would probably be wary of making a major design change without a good idea of exactly what queries, stored procs, views, UDFs, reports, imports etc you would have to change. If it was something I could do with a relatively low risk (and agood testing alrady in place.) I might go for the change to solution 2 otherwise you might beintroducing new worse bugs.
Under no circumstances would I use an EAV table for something like this. They are horrible for querying and performance and the flexibility is way overrated (ask users if they prefer to be able to add new types 3-4 times a year without a program change at the cost of everyday performance).

At first sight, I would go for option 2 (or 3): when possible, generalize.
Option 4 is not very Relational I think, and will make your queries complex.
When confronted to those question, I generally confront those options with "use cases":
-how is design 2/3 behaving when do this or this operation ?

Related

Database normalization - who's right?

My professor (who claimed to have a firm understanding about systems development for many years) and I are arguing about the design of our database.
As an example:
My professor insists this design is right:
(list of columns)
Subject_ID
Description
Units_Lec
Units_Lab
Total_Units
etc...
Notice the total units column. He said that this column must be included.
I tried to explain that it is unnecessary, because if you want it, then just make a query by simply adding the two.
I showed him an example I found in a book, but he insists that I don't have to rely on books too much in making our system.
The same thing applies to similar cases as in this one:
student_ID
prelim_grade
midterm_grade
prefinal_grade
average
He wanted me to include the average! Anywhere I go, I can find myself reading articles that convince me that this is a violation of normalization. If I needed the average, I can easily compute the three grades. He enumerated some scenarios including ('Hey! What if the query has been accidentally deleted? What will you do? That is why you need to include it in your table!')
Do I need to reconstruct my database(which consists of about more than 40 tables) to comply with what he want? Am I wrong and just have overlooked these things?
Another thing is that he wanted to include the total amount in the payments table, which I believe is unnecessary. (Just compute the unit price of the product and the quantity.) He pointed out that we need that column for computing debits and/or credits that are critical for the overall system management, that it is needed for balancing transaction. Please tell me what you think.
You are absolutely correct! One of the rules of normalization is to reduce those attributes which can be easily deduced by using other attributes' values. ie, by performing some mathematical calculation. In your case, the total units column can be obtained by simply adding.
Tell your professor that having that particular column will show clear signs of transitive dependency and according to the 3rd normalization rule, its recommended to reduce those.
You are right when you say your solution is more normalized.
However, there is a thing called denormalization (google for it) which is about deliberately violating normalization rules to increase queries performance.
For instance you want to retrieve first five subjects (whatever the thing would be) ordered by decreasing number or total units.
You solution would require a full scan on two tables (subject and unit), joining the resultsets and sorting the output.
Your professor's solution would require just taking first five records from an index on total_units.
This of course comes at the price of increased maintenance cost (both in terms of computational resources and development).
I can't tell you who is "right" here: we know nothing about the project itself, data volumes, queries to be made etc. This is a decision which needs to be made for every project (and for some projects it may be a core decision).
The thing is that the professor does have a rationale for this requirement which may or may not be just.
Why he hasn't explained everything above to you himself, is another question.
In addition to redskins80's great answer I want to point out why this is a bad idea: Every time you need to update one of the source columns you need to update the calculated column as well. This is more work that can contain bugs easily (maybe 1 year later when a different programmer is altering the system).
Maybe you can use a computed column instead? That would be a workable middle-ground.
Edit: Denormalization has its place, but it is the last measure to take. It is like chemotherapy: The doctor injects you poison only to cure an even greater threat to your health. It is the last possible step.
Think it is important to add this because when you see the question the answer is not complete in my opinion. The original question has been answered well but there is a glitch here. So I take in account only the added question quoted below:
Another thing is that he wanted to include the total amount in the
payments table, which I believe is unnecessary(Just compute the unit
price of the product and the quantity.). He pointed out that we need
that column for computing debits and/or credits that are critical for
the overall system management, that it is needed for balancing
transaction. Please tell me what you think.
This edit is interesting. Based on the facts that this is a transactional system handling about money it has to be accountable. I take some basic terms: Transaction, product, price, amount.
In that sense it is very common or even required to denormalize. Why? Because you need it to be accountable. So when the transaction is registered that's it, it may never ever be modified. If you need to correct it then you make another transaction.
Now yes you can calculate for example product price * amount * taxes etc. That makes sense in normalization sense. But then you will need a complete lockdown of all related records. So take for example the products table: If you change the price before the transaction it should be taken into account when the transaction happens. But if the price changes afterwards it does not affect the transaction.
So it is not acceptable to just join transaction.product_id=products.id since that product might change. Example:
2012-01-01 price = 10
2012-01-05 price = 20
Transaction happens here, we sell 10 items so 10 * 20 = 200
2012-01-06 price = 22
Now we lookup the transaction at 2012-01-10, so we do:
SELECT
transactions.amount * products.price AS totalAmount
FROM transactions
INNER JOIN products on products.id=transactions.product_id
That would give 10 * 22 = 220 so it is not correct.
So you have 2 options:
Do not allow updates on the products table. So you make that table versioned, so for every record you add a new INSERT instead of update. So the transaction keeps pointing at the right version of the product.
Or you just add the fields to the transactions table. So add totalAmount to the transactions table and calculate it (in a database transaction) when the transaction is inserted and save it.
Yes, it is denormalized but it has a good reason, it makes it accountable. You just know and it's verified with transactions, locks etc. that the moment that transaction happened it related to the described product with the price = 20 etc.
Next to that, and that is just a nice thing of denormalization when you have to do that anyway, it is very easy to run reports. Total transaction amount of the month, year etc. It is all very easy to calculate.
Normalization has good things, for example no double storage, single point of edit etc. But in this case you just don't want that concept since that is not allowed and not preferred for a transactions log database.
See a transaction as a registration of something happened in real world. It happened, you wrote it down. Now you cannot change history, it was written as it was. Future won't change it, it happened.
If you want to implement the good, old, classic relational model, I think what you're doing is right.
In general, it's actually a matter of philosophy. Some systems, Oracle being an example, even allow you to give up the traditional, relational model in favor of objects, which (by being complex structures kept in tables) violate the 1st NF but give you the power of object-oriented model (you can use inheritance, override methods, etc.), which is pretty damn awesome in some cases. The language used is still SQL, only extended.
I know my answer drifts away from the subject (as we take into consideration a whole new database type) but I thought it's an interesting thing to share on the occasion of a pretty general question.
Database design for actual applications is hardly the question of what tables to make. Currently, there are countless possibilities when it comes to keeping and processing your data. There are relational systems we all know and love, object databases (like db4o), object-relational databases (not to be confused with object relational mapping, what I mean is tools like Oracle 11g with its objects), xml databases (take eXist), stream databases (like Esper) and the currently thriving noSQL databases (some insist they shouldn't be called databases) like MongoDB, Cassandra, CouchDB or Oracle NoSQL
In case of some of these, normalization loses its sense. Each model serves a completely different purpose. I think the term "database" has a much wider meaning than it used to.
When it comes to relational databases, I agree with you and not the professor (although I'm not sure if it's a good idea to oppose him to strongly).
Now, to the point. I think you might win him over by showing that you are open-minded and that you understand that there are many options to take into consideration (including his views) but that the situation requires you to normalize the data.
I know my answer is quite a stream of conscience for a stackoverflow post but I hope it's not received as lunatic babbling.
Good luck in the relational tug of war
You are talking about historical and financial data here. It is common to store some computations that will never change becasue that is the cost that was charged at the time. If you do the calc from product * price and the price changed 6 months after the transaction, then you have the incorrect value. Your professor is smart, listen to him. Further, if you do a lot of reporting off the database, you don't want to often calculate values that are not allowed to be changed without another record of data entry. Why perform calculations many times over the history of the application when you only need to do it once? That is wasteful of precious server resources.
The purpose of normalization is to eliminate redundancies so as to eliminate update anomalies, predominantly in transactional systems. Relational is still the best solution by far for transaction processing, DW, master data and many BI solutions. Most NOSQLs have low-integrity requirements. So you lose my tweet - annoying but not catastrophic. But to lose my million dollar stock trade is a big problem. The choice is not NOSQL vs. relational. NOSQL does certain things very well. But Relational is not going anywhere. It is still the best choice for transactional, update oriented solutions. The requirements for normalization can be loosened when the data is read-only or read-mostly. That's why redundancy is not such a huge problem in DW; there are no updates.

What are the pros and cons of Anchor Modeling? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I am currently trying to create a database where a very large percentage of the data is temporal. After reading through many techniques for doing this (most involving 6nf normalization) I ran into Anchor Modeling.
The schema that I was developing strongly resembled the Anchor Modeling model, especially since the use case (Temporal Data + Known Unknowns) is so similar, that I am tempted to embrace it fully.
The two biggest problem I am having is that I can find nothing detailing the negatives of this approach, and I cannot find any references to organizations that have used it in production for war-stories and gotchas that I need to be aware of.
I am wondering if anyone here is familiar enough with to briefly expound on some of the negatives (since the positives are very well advertized in research papers and their site), and any experiences with using it in a production environment.
In reference to the anchormodeling.com
Here are a few points I am aware of
The number of DB-objects is simply too large to maintain manually, so make sure that you use designer all the time to evolve the schema.
Currently, designer supports fully MS SQL Server, so if you have to port code all the time, you may want to wait until your target DB is fully supported. I know it has Oracle in dropdown box, but ...
Do not expect (nor demand) your developers to understand it, they have to access the model via 5NF views -- which is good. The thing is that tables are loaded via (instead-of-) triggers on views, which may (or may not) be a performance issue.
Expect that you may need to write some extra maintenance procedures (for each temporal attribute) which are not auto-generated (yet). For example, I often need a prune procedure for temporal attributes -- to delete same-value-records for the same ID on two consecutive time-events.
Generated views and queries-over-views resolve nicely, and so will probably anything that you write in the future. However, "other people" will be writing queries on views-over-views-over-views -- which does not always resolve nicely. So expect that you may need to police queries more than usual.
Having sad all that, I have recently used the approach to refactor a section of my warehouse, and it worked like a charm. Admittedly, warehouse does not have most of the problems outlined here.
I would suggest that it is imperative to create a demo-system and test, test, test ..., especially point No 3 -- loading via triggers.
With respect to point number 4 above. Restatement control is almost finished, such that you will be able to prevent two consecutive identical values over time.
And a general comment, joins are not necessarily a bad thing. Read: Why joins are a good thing.
One of the great benefits of 6NF in Anchor Modeling is non-destructive schema evolution. In other words, every previous version of the database model is available as a subset in the current model. Also, since changes are represented by extensions in the schema (new tables), upgrading a database is almost instantanous and can safely be done online (even in a production environment). This benefit would be lost in 5NF.
I haven't read any papers on it, but since it's based on 6NF, I'd expect it to suffer from whatever problems follow 6NF.
6NF requires each table consist of a candidate key and no more than one non-key column. So, in the worst case, you'll need nine joins to produce a 10-column result set. But you can also design a database that uses, say, 200 tables that are in 5NF, 30 that are in BCNF, and only 5 that are in 6NF. (I think that would no longer be Anchor Modeling per se, which seems to put all tables in 6NF, but I could be wrong about that.)
The Mythical Man-Month is still relevant here.
The management question, therefore, is not whether to build a pilot system and throw it away. You will do that. The only question is whether to plan in advance to build a throwaway, or to promise to deliver the throwaway to customers.
Fred Brooks, Jr., in The Mythical Man-Month, p 116.
How cheaply can you build a prototype to test your expected worst case?
In this post I will present a large part of the real business that belong to databases. Database's solutions in this big business area can not be solved by using „Anchor modeling" , at all.
In the real business world this case is happening on a daily basis. That is the case when data entry person, enters a wrong data.
In real-world business, errors happen frequently at data entry level. It often happens that data entry generates large amounts of erroneous data. So this is a real and big problem. "Anchor modeling" can not solve this problem.
Anyone who uses the "Anchor Modeling" database can enter incorrect data. This is possible because the authors of "Anchor modeling" have written that the erroneous data can be deleted.
Let me explain this problem by the following example:
A profesor of mathematics gave the best grade to the student who had the worst grade. In this high school, professors enter grades in the corressponding database. This student gave money to the professor for this criminal service. The student managed to enroll at the university using this false grade.
After a summer holiday, the professor of mathematics returned to school. After deleting the wrong grade from the database, the professor entered the correct one in the database. In this school they use "Anchor Modeling" db. So the math profesor deleted false data as it is strictly suggested by authors of "Anchor modeling".
Now, this professor of mathematics who did this criminal act is clean, thanks to the software "Anchor modeling".
This example says that using "Anchor Modeling," you can do crime with data just by applying „Anchor modeling technology“
In section 5.4 the authors of „Anchor modeling“ wrote the following: „Delete statements are allowed only when applied to remove erroneous data.“ .
You can see this text at the paper „ An agile modeling technique using sixth normal form for structurally evolving data“ written by authors of „Anchor modeling“.
Please note that „Anchor modeling“ was presented at the 28th International Conference on Conceptual Modeling and won the best paper award?!
Authors of "Anchor Modeling" claim that their data model can maintain a history! However this example shoes that „Anchor modeling“ can not maintain the history at all.
As „Anchor modeling“ allows deletion of data, then "Anchor modeling" has all the operations with the data, that is: adding new data, deleting data and update. Update can be obtained by using two operations: first delete the data, then add new data.
This further means that Anchor modeling has no history, because it has data deletion and data update.
I would like to point out that in "Anchor modeling" each erroneous data "MUST" be deleted. In the "Anchor modeling" it is not possible to keep erroneous data and corrected data.
"Anchor modeling" can not maintain history of erroneous data.
In the first part of this post, I showed that by using "Anchor Modeling" anyone can do crime with data. This means "Anchor Modeling" runs the business of a company, right into a disaster.
I will give one example so that professionals can see on real and important example, how bad "anchor modeling" is.
Example
People who are professionals in the business of databases, know that there are thousands and thousands of international standards, which have been used successfully in databases as keys.
International standards:
All professionals know what is "VIN" for cars, "ISBN" for books, and thousands of other international standards.
National standards:
All countries have their own standards for passports, personal documents, bank cards, bar codes, etc
Local standards:
Many companies have their own standards. For example, when you pay something, you have an invoice with a standard key written and that key is written in the database, also.
All the above mentioned type of keys from this example can be checked by using a variety of institutions, police, customs, banks credit card, post office, etc. You can check many of these "keys" on the internet or by using a phone.
I believe that percent of these databases, which have entities with standard keys, and which I have presented in this example, is more than 95%.
For all the above cases the "anchor surrogate key" is nonsense. "Anchor modeling" exclusively uses "anchor-surrogate key"
In my solution, I use all the keys that are standard on a global or local level and are simple.
Vladimir Odrljin

Hypertable v. HBase and BigTable v. SQL

As Hypertable and HBase seem to be the two major open source BigTable implementations, what are the major pros and cons between these two databases?
In addition, what are the major pros and cons between BigTable and SQL RDBMSes, and what significant differences can I expect between writing a project with a traditional RDBMS like Postgres and Hypertable?
At the risk of broadening your second question more than I should (I've never played with BigTable, but I've toyed with MongoDB and CouchDB)...
The most important difference, in so far as I've understood it anyway, is that RDBMS all use a row-based store, whereas NoSQL engines use a column-based store. The pros and cons mostly derives from this point.
http://en.wikipedia.org/wiki/Column-oriented_DBMS
The major consideration that I tend to keep in mind is ACID compliance: a NoSQL engine is eventually consistent, rather than always consistent. Think of it like a storage that behaves like a website's cache: the latter is normally valid and consistent, but occasionally slightly outdated/inconsistent.
There's no right or wrong here: for some use-cases (e.g. a search engine, a blog), slightly inconsistent is a very acceptable option; for others (e.g. a bank, a billing system) it is not. (I tend to work on stuff that needs atomicity.)
Then, there are plenty of performance considerations that break down to implementation details.
An immediate consequence of striving for eventual consistency is that integrity checks and so forth are typically done in the app rather than the data store (i.e. there are no triggers or stored procedures to speak of). Your data store ends up with less work to do, which results in obvious performance benefits.
A column-based store means that if you update a single column from your document, you only invalidate that column. A row-based store, by contrast, invalidates the entire row. Depending on how you typically update your data (i.e. just a few columns vs most of them), either approach can add up.
A flip side of a column-based store is that it makes joins trickier (from an implementation standpoint). In overly simplistic terms, think of it as having an EAV table per column; this works fine for a few tables. It's a different story if you need a big report that requires a dozen joins on sales or stocks (which a good RDBMS will handle just fine).
A more experienced user will hopefully chime in on NoSQL sharding and replication. On this I'd only feel comfortable enough to point out that Postgres has built-in replication features since 9.0 and is quite good at dealing with queries that span multiple partitions.
Anyway... To cut a very long story short: unless you already know that you'll need to instantly scale to petabytes and gazillions of requests in multitudes of data centers in your next project, I think the only consideration that you should have in mind when picking an SQL or NoSQL implementation is whether you absolutely need ACID compliance or not.
Lastly, if your main interest lies in trying a new toy, consider trying a graph-oriented database instead. These potentially combine the benefits of row- and column-based stores.

T-SQL database design and tables

I'd like to hear some opinions or discussion on a matter of database design. Me and my colleagues are developing a complex application in finance industry that is being installed in several countries.
Our contractors wanted us to keep a single application for all the countries so we naturally face the difficulties with different workflows in every one of them and try to make the application adjustable to satisfy various needs.
The issue I've encountered today was a request from the head of the IT department from the contractors side that we keep the database model in terms of tables and columns they consist of.
For examlpe, we got a table with different risks and we needed to add a flag column IsSomething (BIT NOT NULL ...). It fully qualifies to exists within the risk table according to the third normal form, no transitive dependency to the key, a non key value ...
BUT, the guy said that he wants to keep the tables as they are so we had to make a new table "riskinfo" and link the data 1:1 to the new column.
What is your opinion ?
We add columns to our tables that are referenced by a variety of apps all the time.
So long as the applications specifically reference the columns they want to use and you make sure the new fields are either nullable or have a sensible default defined so it doesn't interfere with inserts I don't see any real problem.
That said, if an app does a select * then proceeds to reference the columns by index rather than name you could produce issues in existing code. Personally I have confidence that nothing referencing our database does this because of our coding conventions (That and I suspect the code review process would lynch someone who tried it :P), but if you're not certain then there is at least some small risk to such a change.
In your actual scenario I'd go back to the contractor and give your reasons you don't think the change will cause any problems and ask the rationale behind their choice. Maybe they have some application-specific wisdom behind their suggestion, maybe just paranoia from dealing with other companies that change the database structure in ways that aren't backwards-compatible, or maybe it's just a policy at their company that got rubber-stamped long ago and nobody's challenged. Till you ask you never know.
This question is indeed subjective like what Binary Worrier commented. I do not have an answer nor any suggestion. Just sharing my 2 cents.
Do you know the rationale for those decisions? Sometimes good designs are compromised for the sake of not breaking currently working applications or simply for the fact that too much has been done based on the previous one. It could also be many other non-technical reasons.
Very often, the programming community is unreasonably concerned about the ripple effect that results from redefining tables. Usually, this is a result of failure to understand data independence, and failure to guard the data independence of their operations on the data. Occasionally, the original database designer is at fault.
Most object oriented programmers understand encapsulation better than I do. But these same experts typically don't understand squat about data independence. And anyone who has learned how to operate on an SQL database, but never learned the concept of data independence is dangerously ignorant. The superficial aspects of data independence can be learned in about five minutes. But to really learn it takes time and effort.
Other responders have mentioned queries that use "select *". A select with a wildcard is more data dependent than the same select that lists the names of all the columns in the table. This is just one example among dozens.
The thing is, both data independence and encapsulation pursue the same goal: containing the unintended consequences of a change in the model.
Here's how to keep your IT chief happy. Define a new table with a new name that contains all the columns from the old table, and also all the additional columns that are now necessary. Create a view, with the same name as the old table, that contains precisely the same columns, and in the same order, that the old table had. Typically, this view will show all the rows in the old table, and the old PK will still guarantee uniqueness.
Once in a while, this will fail to meet all of the IT chief's needs. And if the IT chief is really saying "I don't understand databases; so don't change anything" then you are up the creek until the IT chief changes or gets changed.

MySQL design question - which is better, long tables or multiple databases?

So I have an interesting problem that's been the fruit of lots of good discussion in my group at work.
We have some scientific software producing SQLlite files, and this software is basically a black box. We don't control its table designs, formats, etc. It's entirely conceivable that this black box's output could change, and our design needs to be able to handle that.
The SQLlite files are entire databases which our user would like to query across. There are two ways (we see) of implementing this, one, to create a single database and a backend in Python that appends tables from each database to the master database, and two, querying across separate databases' tables and unifying the results in Python.
Both methods run into trouble when the black box produces alters its table structures, say for example renaming a column, splitting up a table, etc. We have to take this into account, and we've discussed translation tables that translate queries of columns from one table format to another.
We're interested in ease of implementation, how well the design handles a change in database/table layout, and speed. Also, a last dimension is how well it would work with existing Python web frameworks (Django doesn't support cross-database queries, and neither does SQLAlchemy, so we know we are in for a lot of programming.)
If you find yourself querying across databases, you should look into consolidating. Cross-database queries are evil.
If your queries are essentially relegated to individual databases, then you may want to stick with multiple databases, as clearly their separation is necessary.
You cannot accommodate arbitrary changes in a database's schema without categorizing and anticipating that change in some way. In the very best case with nontrivial changes, you can sometimes simply ignore new data or tables, in the worst case, your interpretation of the data will entirely break down.
I've encountered similar issues where users need data pivoted out of a normalized schema. The schema does NOT change. However, their required output format requires a fixed number of hierarchical levels. Thus, although the database design accommodates all the changes they want to make, their chosen view of that data cannot be maintained in the face of their changes. Thus it is impossible to maintain the output schema in the face of data change (not even schema change). This is not to say that it's not a valid output or input schema, but that there are limits beyond which their chosen schema cannot be used. At this point, they have to revise the output contract, the pivoting program (which CAN anticipate this and generate new columns) can then have a place to put the data in the output schema.
My point being: the semantics and interpretation of new columns and new tables (or removal of columns and tables which existing logic may depend on) is nontrivial unless new columns or tables can be anticipated in some way. However, in these cases, there are usually good database designs which eliminate those strategies in the first place:
For instance, a particular database schema can contain any number of tables, all with the same structure (although there is no theoretical reason they could not be consolidated into a single table). A particular kind of table could have a set of columns all similarly named (although this "array" violates normalization principles and could be normalized into a commonkey/code/value schema).
Even in a data warehouse ETL situation, a new column is going to have to be determined whether it is a fact or a dimensional attribute, and then if it is a dimensional attribute, which dimension table it is best assigned to. This could somewhat be automated for facts (obvious candidates would be scalars like decimal/numeric) by inspecting the metadata for unmapped columns, altering the DW table (yikes) and then loading appropriately. But for dimensions, I would be very leery of automating somethings like this.
So, in summary, I would say that schema changes in a good normalized database design are the least likely to be able to be accommodated because: 1) the database design already anticipates and accommodates a good deal of change and flexibility and 2) schema changes to such a database design are unlikely to be able to be anticipated very easily. Conversely, schema changes in a poorly normalized database design are actually more easy to anticipate as shortcomings in the database design are more visible.
So, my question to you is: How well-designed is the database you are working from?
You say that you know that you are in for a lot of programming...
I'm not sure about that. I would go for a quick and dirty solution not a 'generic' solution because generic solutions like the entity attribute value model often have a bad performance. Don't do client side joining (unifying the results) inside your Python code because that is very slow. Use SQL for joining, it is designed for that purpose. Users can also make their own reports with all kind of reporting tools that generate sql statements. You don't have to do everything in your app, just start with solving 80% of the problems, not 100%.
If something breaks because something inside the black box changes you can define views for backward compatibility that keeps your app functioning.
Maybe the scientific software will add a lot of new features and maybe it will change its datamodel because of those new features..? That is possible but then you will have to change your application anyways to take profit from those new features.
It sounds to me as if your problem isn't really about MySQL or SQLlite. It's about the sharing of data, and the contract that needs to exist between the supplier of data and the user of the same data.
To the extent that databases exist so that data can be shared, that contract is fundamental to everything about databases. When databases were first being built, and database theory was first being solidified, in the 1960s and 1970s, the sharing of data was the central purpose in building databases. Today, databases are frequently used where files would have served equally well. Your situation may be a case in point.
In your situation, you have a beggar's contract with your data suppliers. They can change the format of the data, and maybe even the semantics, and all you can do is suck it up and deal wth it. This situation is by no means uncommon.
I don't know the specifics of your situation, so what follows could be way off target.
If it was up to me, I would want to build a database that was as generic, as flexible, and as stable as possible, without losing the essential features of structured and managed data. Maybe, some design like star schema would make sense, but I might adopt a very different design if I were actually in your shoes.
This leaves the problem of extracting the data from the databases you are given, transforming the data into the stable format the central database supports, and loading it into the central database. You are right in guessing that this involves a lot of programming. This process, known as "ETL" in data warehousing texts, is not the simplest of programming challenges.
At least ETL collects all the hard problems in one place. Once you have the data loaded into a database that's built for your needs, and not for the needs of your suppliers, turning the data into valuable information should be relatively easy, at least at the programming or SQL level. There are even OLAP tools that make using the data as simple as a video game. There are challenges at that level, but they aren't the same kind of challenges I'm talking about here.
Read up on data warehousing, and especially data marts. The description may seem daunting to you at first, but it can be scaled down to meet your needs.