Database design: how to avoid serialization when data structure is not static - sql

I've recently been confronted with the need to design a database. Since this is my first time, I thought I'd better ask for some advice to make sure I'm building on solid foundations.
Goal
I'd like to store objects (POD structures best thought of as multi-maps) in an
SQL database for storage and querying. The objects' contents as well as its 'structure' are continuously modified. The database will be accessed intensively through both queries and updates.
Use Case
First, each object should have a unique identifier.
Second, different type of objects exist. For example, ObjectA is an instance of ClassA. ClassA can have attributes A1, A2, A3, etc. As a result, ObjectA can (but isn't required, NULL is allowed) have values for these attributes. However, each of these attributes may have more than one value, ie: ObjectA.A1="foo" and ObjectA.A1="bar" are both possible. The number of attributes of ClassA can change. For simplicity's sake, attributes can only be added, not removed.
Third, attributes are not specific to one class, ie: objects of ClassB can also have attributes A1, A2, etc. Thus ObjectB.A1="foo" is also possible. I'm not sure whether this changes anything, but I have a feeling it might in a design where each attribute corresponds with a table.
Finally, the following pseudo-queries and actions must be supported:
Get all the objects of type ClassA with attribute A1 equal to "bar".
Get all the attributes of ObjectB.
Add an attribute A4 to objects of type ClassA.
Add an object of type ClassC which has attributes A1="foobar", A2="bar".
Limitations
First, I want to avoid serializing the data, so multiple values in a single column are out of the question. The database should be normalized and the data structures should be atomic. The database will be queried very frequently, so I cannot afford wasting time trying to implement a complex query mechanism. I will end up re-inventing the wheel (probably a square one as well).
Second, I cannot use any prior knowledge of an object's internal structure as this will only become available at run-time. For example, in the use case above, the attributes are not known before-hand. So while I have thought of having a design where each attribute is a table, I cannot figure out how to get all the attributes of an object in such a set-up.
Environment
I'm using SQLite 3.7, C++.
Question
What would be an appropriate, flexible database design that meets the requirements of the described problem?
Any help, pointers or tips leading to useful insights or a solid design are very welcome.
Thanks!
ps: I have only basic theoretical familiarity and limited practical experience with relational databases, certainly no prior professional experience. I have been reading up on the subject the past week and have grasped some of the concepts which I think will be relevant to my case (normalization, foreign keys, etc), but I'm still going through my book at this moment.

If this is your first time out, and your project is as significant as it seems, you might want to invest the time and effort to learn the fundamentals from the ground up. CJ Date and many other authors have books and on line tutorials that can take you through the fundamentals. They are excellent works.
There are some fields within IT that are dominated by almost complete adhocracy. Not so database design. To begin with, EF Codd laid the groundwork on a very solid mathematical basis some 42 years ago, and the basic model has held up very well over time. There has been progress, but almost no backtracking. And very little change for the sake of change.
SQL has likewise enjoyed a lot of stability over its long lifespan.
Next, trial and error in database design can be enormously costly. There are dozens of cases where unfortunate choices made by newbies have ended up costing millions in data investments that didn't pan out.
Trial and error has its place. Tips and tricks have their place. Answers on SO have their place. But so does formal learning.

Related

Organizing interconnected objects

This is a generic question, I don't know if it belongs to Programming or StackOverflow.
I'm writing a litte simulation. Without going very deep into its details, consider that many kind of identities are involved. They correspond to Object since I'm using a OOP language.
There are Guys that inhabit the world simulated
There are Maps
A map has many Lots, that are pieces of land with some characteristics
There are Tribes (guys belong to tribes)
There is a generic class called Position to locate the elements
There are Bots in control of tribes that move guys around
There is a World that represents the world simulated
and so on.
If the simulated world was laid down as a database, the objects would be tables with lots of references, but in memory I have to use a different strategy. So, for example, a Tribe has an array of Guys as a property, The world has a, array of Bots, of Tribes, of Maps. A Map has a Dictionary whose key is a Position and whose value is a Lot. A Guy has a Position that is where he stands.
The way I lay down such connections is pretty much arbitrary. For example, I could have an array of Guys in the World, or an Array of guys per Lot (the guys standing on a piece of land), or an array of Guys per Bot (with the Guys controlled by the bot).
Doing so, I also have to pass around a lot of objects. For example, a Bot must have informations about the Map and opponent Guys to decide how to move its Guys.
As said, in a database I'd have a Guys table connected to the Lots table (indicating its position), to the Tribe table (indicating which Tribe it belongs to) and so it would also be easy to query "All the guys in Position [1, 5]". "All the Guys of Tribe 123". "All the Guys controlled by Bot B standing on the Lot b34 not belonging to the Tribe 456" and so on.
I've worked with APIs where to get the simplest information you had to make an instance of the CustomerContextCollection and pass it to CustomerQueryFactory to get back a CustomerInPlaceQuery to... When people criticize OOP and cite verbose abstractions that soon smell ridiculous, that's what I mean. I want to avoid such things and having to relay on deep abstractions and (anti pattern) abstract contexts.
The question is: what is the preferred, clean way to manage entities and collections of entities that are deeply linked in multiple ways?
It depends on your definition of "clean". In my case, I define clean as: I can implement desired behavior in an obvious, efficient manner.
Building OOP software is not a data modeling exercise. I'd suggest stepping back a little. What does each one of those objects actually do? What methods are you going to implement?
Just because "guys are in a lot" doesn't mean that the lot object needs a collection of guys; it only needs one if there are operations on a lot that affect all the guys in it. And even then, it doesn't necessarily need a collection of guys - it needs a way to get the guys in the lot. This may be an internally stored collection, but it could also be a simple method that calls back into the world to find guys matching a criteria. The implementation of that lookup should be transparent to anyone.
From the tenor of your questions, it seems like you're thinking of this from a "how do I generate reports" perspective. Step back and think of the behaviors you're trying to implement first.
Another thing I find extremely valuable is to differentiate between Entities and Values. Entities are objects where identity matters - you may have two guys, both named "Chris", but they are two different objects and remain distinct despite having the same "key". Values, on the other hand, act like ints. From your above list, Position sounds a lot like a value - Position(0,0) is Position(0,0) regardless of which chunk of memory (identity) those bits are stored in. The distinction has a bit effect on how you compare and store values vs. entities. For example, your Guy objects (entities) would store their Position as a simple member variable.
I've found a great reference for how to think about such things is Eric Evan's "Domain Driven Design" book. He's focused on business systems, but the discussions are very valuable for how you think about building OO systems in general I've found.
I would say that no 'true' answer exists to your core question -- a best way to manage collections of entities that are linked in multiple ways. It really depends on the kind of application (simulation) - here are some thoughts:
Is execution time important?
If this is the case, there is really no way around analyzing in which way your simulator will iterate over (query) the objects from the pool: sketch out the basic simulation loop and check what kind of events will require to iterate over what kind of model entities (I assume you are developing a discrete-event simulation?). Then you should organize the data structures in a way that optimizes the most frequent/time-consuming events (as opposed to "laying down the connections arbitrarily"). Additionally, you may want to use special data structures (such as k-d trees) to organize entities with properties that you need to query often (e.g., position data). For some typical problems, e.g. collision detection, there is also a whole lot of approaches to solve them efficiently (so look for suitable libraries/frameworks, e.g. for multi-agent simulation).
How flexible do you want to make it?
If you really want to make it super-flexible and really don't want to decide on the hierarchy of the model entities, why not just use an in-memory database? As you already said, databases are easily applicable to your problem (and you can easily save the model state, which may also be useful).
How clean is clean enough?
If you want to be absolutely sure that the rest of your simulator is not affected by the design choices you make in regards of your model representation, hide it behind an interface (say, ModelWorld), which defines methods for all the types of queries your simulator may invoke (this is orthogonal to the second point and may help with the first point, i.e. figuring out what kind of access pattern your simulator exhibits). This allows you to change implementations easily, without affecting any other parts of the simulator code.

What are the [dis]advantages of using a key/value table over nullable columns or separate tables? [duplicate]

This question already has answers here:
How to design a product table for many kinds of product where each product has many parameters
(4 answers)
Closed 1 year ago.
I'm upgrading a payment management system I created a while ago. It currently has one table for each payment type it can accept. It is limited to only being able to pay for one thing, which this upgrade is to alleviate. I've been asking for suggestions as to how I should design it, and I have these basic ideas to work from:
Have one table for each payment type, with a few common columns on each. (current design)
Coordinate all payments with a central table that takes on the common columns (unifying payment IDs regardless of type), and identifies another table and row ID that has columns specialized to that payment type.
Have one table for all payment types, and null the columns which are not used for any given type.
Use the central table idea, but store specialized columns in a key/value table.
My goals for this are: not ridiculously slow, self-documenting as much as possible, and maximizing flexibility while maintaining the other goals.
I don't like 1 very much because of the duplicate columns in each table. It reflects the payment type classes inheriting a base class that provides functionality for all payment types... ORM in reverse?
I'm leaning toward 2 the most, because it's just as "type safe" and self-documenting as the current design. But, as with 1, to add a new payment type, I need to add a new table.
I don't like 3 because of its "wasted space", and it's not immediately clear which columns are used for which payment types. Documentation can alleviate the pain of this somewhat, but my company's internal tools do not have an effective method for storing/finding technical documentation.
The argument I was given for 4 was that it would alleviate needing to change the database when adding a new payment method, but it suffers even worse than 3 does from the lack of explicitness. Currently, changing the database isn't a problem, but it could become a logistical nightmare if we decide to start letting customers keep their own database down the road.
So, of course I have my biases. Does anyone have any better ideas? Which design do you think fits best? What criteria should I base my decision on?
Note
This subject is being discussed, and this thread is being referenced in other threads, therefore I have given it a reasonable treatment, please bear with me. My intention is to provide understanding, so that you can make informed decisions, rather than simplistic ones based merely on labels. If you find it intense, read it in chunks, at your leisure; come back when you are hungry, and not before.
What, exactly, about EAV, is "Bad" ?
1 Introduction
There is a difference between EAV (Entity-Attribute-Value Model) done properly, and done badly, just as there is a difference between 3NF done properly and done badly. In our technical work, we need to be precise about exactly what works, and what does not; about what performs well, and what doesn't. Blanket statements are dangerous, misinform people, and thus hinder progress and universal understanding of the issues concerned.
I am not for or against anything, except poor implementations by unskilled workers, and misrepresenting the level of compliance to standards. And where I see misunderstanding, as here, I will attempt to address it.
Normalisation is also often misunderstood, so a word on that. Wikipedia and other free sources actually post completely nonsensical "definitions", that have no academic basis, that have vendor biases so as to validate their non-standard-compliant products. There is a Codd published his Twelve Rules. I implement a minimum of 5NF, which is more than enough for most requirements, so I will use that as a baseline. Simply put, assuming Third Normal Form is understood by the reader (at least that definition is not confused) ...
2 Fifth Normal Form
2.1 Definition
Fifth Normal Form is defined as:
every column has a 1::1 relation with the Primary Key, only
and to no other column, in the table, or in any other table
the result is no duplicated columns, anywhere; No Update Anomalies (no need for triggers or complex code to ensure that, when a column is updated, its duplicates are updated correctly).
it improves performance because (a) it affects less rows and (b) improves concurrency due to reduced locking
I make the distinction that, it is not that a database is Normalised to a particular NF or not; the database is simply Normalised. It is that each table is Normalised to a particular NF: some tables may only require 1NF, others 3NF, and yet others require 5NF.
2.2 Performance
There was a time when people thought that Normalisation did not provide performance, and they had to "denormalise for performance". Thank God that myth has been debunked, and most IT professionals today realise that Normalised databases perform better. The database vendors optimise for Normalised databases, not for denormalised file systems. The truth "denormalised" is, the database was NOT normalised in the first place (and it performed badly), it was unnormalised, and they did some further scrambling to improve performance. In order to be Denormalised, it has to be faithfully Normalised first, and that never took place. I have rewritten scores of such "denormalised for performance" databases, providing faithful Normalisation and nothing else, and they ran at least ten, and as much as a hundred times faster. In addition, they required only a fraction of the disk space. It is so pedestrian that I guarantee the exercise, in writing.
2.3 Limitation
The limitations, or rather the full extent of 5NF is:
it does not handle optional values, and Nulls have to be used (many designers disallow Nulls and use substitutes, but this has limitations if it not implemented properly and consistently)
you still need to change DDL in order to add or change columns (and there are more and more requirements to add columns that were not initially identified, after implementation; change control is onerous)
although providing the highest level of performance due to Normalisation (read: elimination of duplicates and confused relations), complex queries such as pivoting (producing a report of rows, or summaries of rows, expressed as columns) and "columnar access" as required for data warehouse operations, are difficult, and those operations only, do not perform well. Not that this is due only to the SQL skill level available, and not to the engine.
3 Sixth Normal Form
3.1 Definition
Sixth Normal Form is defined as:
the Relation (row) is the Primary Key plus at most one attribute (column)
It is known as the Irreducible Normal Form, the ultimate NF, because there is no further Normalisation that can be performed. Although it was discussed in academic circles in the mid nineties, it was formally declared only in 2003. For those who like denigrating the formality of the Relational Model, by confusing relations, relvars, "relationships", and the like: all that nonsense can be put to bed because formally, the above definition identifies the Irreducible Relation, sometimes called the Atomic Relation.
3.2 Progression
The increment that 6NF provides (that 5NF does not) is:
formal support for optional values, and thus, elimination of The Null Problem
a side effect is, columns can be added without DDL changes (more later)
effortless pivoting
simple and direct columnar access
it allows for (not in its vanilla form) an even greater level of performance in this department
Let me say that I (and others) were supplying enhanced 5NF tables 20 years ago, explicitly for pivoting, with no problem at all, and thus allowing (a) simple SQL to be used and (b) providing very high performance; it was nice to know that the academic giants of the industry had formally defined what we were doing. Overnight, my 5NF tables were renamed 6NF, without me lifting a finger. Second, we only did this where we needed it; again, it was the table, not the database, that was Normalised to 6NF.
3.3 SQL Limitation
It is a cumbersome language, particularly re joins, and doing anything moderately complex makes it very cumbersome. (It is a separate issue that most coders do not understand or use subqueries.) It supports the structures required for 5NF, but only just. For robust and stable implementations, one must implement additional standards, which may consist in part, of additional catalogue tables. The "use by" date for SQL had well and truly elapsed by the early nineties; it is totally devoid of any support for 6NF tables, and desperately in need of replacement. But that is all we have, so we need to just Deal With It.
For those of us who had been implementing standards and additional catalogue tables, it was not a serious effort to extend our catalogues to provide the capability required to support 6NF structures to standard: which columns belong to which tables, and in what order; mandatory/optional; display format; etc. Essentially a full MetaData catalogue, married to the SQL catalogue.
Note that each NF contains each previous NF within it, so 6NF contains 5NF. We did not break 5NF in order provide 6NF, we provided a progression from 5NF; and where SQL fell short we provided the catalogue. What this means is, basic constraints such as for Foreign Keys; and Value Domains which were provided via SQL Declarative Referential integrity; Datatypes; CHECKS; and RULES, at the 5NF level, remained intact, and these constraints were not subverted. The high quality and high performance of standard-compliant 5NF databases was not reduced in anyway by introducing 6NF.
3.4 Catalogue
It is important to shield the users (any report tool) and the developers, from having to deal with the jump from 5NF to 6NF (it is their job to be app coding geeks, it is my job to be the database geek). Even at 5NF, that was always a design goal for me: a properly Normalised database, with a minimal Data Directory, is in fact quite easy to use, and there was no way I was going to give that up. Keep in mind that due to normal maintenance and expansion, the 6NF structures change over time, new versions of the database are published at regular intervals. Without doubt, the SQL (already cumbersome at 5NF) required to construct a 5NF row from the 6NF tables, is even more cumbersome. Gratefully, that is completely unnecessary.
Since we already had our catalogue, which identified the full 6NF-DDL-that-SQL-does-not-provide, if you will, I wrote a small utility to read the catalogue and:
generate the 6NF table DDL.
generate 5NF VIEWS of the 6NF tables. This allowed the users to remain blissfully unaware of them, and gave them the same capability and performance as they had at 5NF
generate the full SQL (not a template, we have those separately) required to operate against the 6NF structures, which coders then use. They are released from the tedium and repetition which is otherwise demanded, and free to concentrate on the app logic.
I did not write an utility for Pivoting because the complexity present at 5NF is eliminated, and they are now dead simple to write, as with the 5NF-enhanced-for-pivoting. Besides, most report tools provide pivoting, so I only need to provide functions which comprise heavy churning of stats, which needs to be performed on the server before shipment to the client.
3.5 Performance
Everyone has their cross to bear; I happen to be obsessed with Performance. My 5NF databases performed well, so let me assure you that I ran far more benchmarks than were necessary, before placing anything in production. The 6NF database performed exactly the same as the 5NF database, no better, no worse. This is no surprise, because the only thing the 'complex" 6NF SQL does, that the 5NF SQL doesn't, is perform much more joins and subqueries.
You have to examine the myths.
Anyone who has benchmarked the issue (i.e examined the execution plans of queries) will know that Joins Cost Nothing, it is a compile-time resolution, they have no effect at execution time.
Yes, of course, the number of tables joined; the size of the tables being joined; whether indices can be used; the distribution of the keys being joined; etc, all cost something.
But the join itself costs nothing.
A query on five (larger) tables in a Unnormalised database is much slower than the equivalent query on ten (smaller) tables in the same database if it were Normalised. the point is, neither the four nor the nine Joins cost anything; they do not figure in the performance problem; the selected set on each Join does figure in it.
3.6 Benefit
Unrestricted columnar access. This is where 6NF really stands out. The straight columnar access was so fast that there was no need to export the data to a data warehouse in order to obtain speed from specialised DW structures.
My research into a few DWs, by no means complete, shows that they consistently store data by columns, as opposed to rows, which is exactly what 6NF does. I am conservative, so I am not about to make any declarations that 6NF will displace DWs, but in my case it eliminated the need for one.
It would not be fair to compare functions available in 6NF that were unavailable in 5NF (eg. Pivoting), which obviously ran much faster.
That was our first true 6NF database (with a full catalogue, etc; as opposed to the always 5NF with enhancements only as necessary; which later turned out to be 6NF), and the customer is very happy. Of course I was monitoring performance for some time after delivery, and I identified an even faster columnar access method for my next 6NF project. That, when I do it, might present a bit of competition for the DW market. The customer is not ready, and we do not fix that which is not broken.
3.7 What, Exactly, about 6NF, is "Bad" ?
Note that not everyone would approach the job with as much formality, structure, and adherence to standards. So it would be silly to conclude from our project, that all 6NF databases perform well, and are easy to maintain. It would be just as silly to conclude (from looking at the implementations of others) that all 6NF databases perform badly, are hard to maintain; disasters. As always, with any technical endeavour, the resulting performance and ease of maintenance are strictly dependent on formality, structure, and adherence to standards, in addition to the relevant skill set.
4 Entity Attribute Value
Disclosure: Experience. I have inspected a few of these, mostly hospital and medical systems. I have performed corrective assignments on two of them. The initial delivery by the overseas provider was quite adequate, although not great, but the extensions implemented by the local provider were a mess. But not nearly the disaster that people have posted about re EAV on this site. A few months intense work fixed them up nicely.
4.1 What It Is
It was obvious to me that the EAV implementations I have worked on are merely subsets of Sixth Normal Form. Those who implement EAV do so because they want some of the features of 6NF (eg. ability to add columns without DDL changes), but they do not have the academic knowledge to implement true 6NF, or the standards and structures to implement and administer it securely. Even the original provider did not know about 6NF, or that EAV was a subset of 6NF, but they readily agreed when I pointed it out to them. Because the structures required to provide EAV, and indeed 6NF, efficiently and effectively (catalogue; Views; automated code generation) are not formally identified in the EAV community, and are missing from most implementations, I classify EAV as the bastard son Sixth Normal Form.
4.2 What, Exactly, about EAV, is "Bad" ?
Going by the comments in this and other threads, yes, EAV done badly is a disaster. More important (a) they are so bad that the performance provided at 5NF (forget 6NF) is lost and (b) the ordinary isolation from the complexity has not been implemented (coders and users are "forced" to use cumbersome navigation). And if they did not implement a catalogue, all sorts of preventable errors will not have been prevented.
All that may well be true for bad (EAV or other) implementations, but it has nothing to do with 6NF or EAV. The two projects I worked had quite adequate performance (sure, it could be improved; but there was no bad performance due to EAV), and good isolation of complexity. Of course, they were nowhere near the quality or performance of my 5NF databases or my true 6NF database, but they were fair enough, given the level of understanding of the posted issues within the EAV community. They were not the disasters and sub-standard nonsense alleged to be EAV in these pages.
5 Nulls
There is a well-known and documented issue called The Null Problem. It is worthy of an essay by itself. For this post, suffice to say:
the problem is really the optional or missing value; here the consideration is table design such that there are no Nulls vs Nullable columns
actually it does not matter because, regardless of whether you use Nulls/No Nulls/6NF to exclude missing values, you will have to code for that, the problem precisely then, is handling missing values, which cannot be circumvented
except of course for pure 6NF, which eliminates the Null Problem
the coding to handle missing values remains
except, with automated generation of SQL code, heh heh
Nulls are bad news for performance, and many of us have decided decades ago not to allow Nulls in the database (Nulls in passed parameters and result sets, to indicate missing values, is fine)
which means a set of Null Substitutes and boolean columns to indicate missing values
Nulls cause otherwise fixed len columns to be variable len; variable len columns should never be used in indices, because a little 'unpacking' has to be performed on every access of every index entry, during traversal or dive.
6 Position
I am not a proponent of EAV or 6NF, I am a proponent of quality and standards. My position is:
Always, in all ways, do whatever you are doing to the highest standard that you are aware of.
Normalising to Third Normal Form is minimal for a Relational Database (5NF for me). DataTypes, Declarative referential Integrity, Transactions, Normalisation are all essential requirements of a database; if they are missing, it is not a database.
if you have to "denormalise for performance", you have made serious Normalisation errors, your design in not normalised. Period. Do not "denormalise", on the contrary, learn Normalisation and Normalise.
There is no need to do extra work. If your requirement can be fulfilled with 5NF, do not implement more. If you need Optional Values or ability to add columns without DDL changes or the complete elimination of the Null Problem, implement 6NF, only in those tables that need them.
If you do that, due only to the fact that SQL does not provide proper support for 6NF, you will need to implement:
a simple and effective catalogue (column mix-ups and data integrity loss are simply not acceptable)
5NF access for the 6NF tables, via VIEWS, to isolate the users (and developers) from the encumbered (not "complex") SQL
write or buy utilities, so that you can generate the cumbersome SQL to construct the 5NF rows from the 6NF tables, and avoid writing same
measure, monitor, diagnose, and improve. If you have a performance problem, you have made either (a) a Normalisation error or (b) a coding error. Period. Back up a few steps and fix it.
If you decide to go with EAV, recognise it for what it is, 6NF, and implement it properly, as above. If you do, you will have a successful project, guaranteed. If you do not, you will have a dog's breakfast, guaranteed.
6.1 There Ain't No Such Thing As A Free Lunch
That adage has been referred to, but actually it has been misused. The way it actually, deeply applies is as above: if you want the benefits of 6NF/EAV, you had better be willing too do the work required to obtain it (catalogue, standards). Of course, the corollary is, if you don't do the work, you won't get the benefit. There is no "loss" of Datatypes; value Domains; Foreign keys; Checks; Rules. Regarding performance, there is no performance penalty for 6NF/EAV, but there is always a substantial performance penalty for sub-standard work.
7 Specific Question
Finally. With due consideration to the context above, and that it is a small project with a small team, there is no question:
Do not use EAV (or 6NF for that matter)
Do not use Nulls or Nullable columns (unless you wish to subvert performance)
Do use a single Payment table for the common payment columns
and a child table for each PaymentType, each with its specific columns
All fully typecast and constrained.
What's this "another row_id" business ? Why do some of you stick an ID on everything that moves, without checking if it is a deer or an eagle ? No. The child is a dependent child. The Relation is 1::1. The PK of the child is the PK of the parent, the common Payment table. This is an ordinary Supertype-Subtype cluster, the Differentiator is PaymentTypeCode. Subtypes and supertypes are an ordinary part of the Relational Model, and fully catered for in the database, as well as in any good modelling tool.
Sure, people who have no knowledge of Relational databases think they invented it 30 years later, and give it funny new names. Or worse, they knowingly re-label it and claim it as their own. Until some poor sod, with a bit of education and professional pride, exposes the ignorance or the fraud. I do not know which one it is, but it is one of them; I am just stating facts, which are easy to confirm.
A. Responses to Comments
A.1 Attribution
I do not have personal or private or special definitions. All statements regarding the definition (such as imperatives) of:
Normalisation,
Normal Forms, and
the Relational Model.
.
refer to the many original texts By EF Codd and CJ Date (not available free on the web)
.
The latest being Temporal Data and The Relational Model by CJ Date, Hugh Darwen, Nikos A Lorentzos
.
and nothing but those texts
.
"I stand on the shoulders of giants"
.
The essence, the body, all statements regarding the implementation (eg. subjective, and first person) of the above are based on experience; implementing the above principles and concepts, as a commercial organisation (salaried consultant or running a consultancy), in large financial institutions in America and Australia, over 32 years.
This includes scores of large assignments correcting or replacing sub-standard or non-relational implementations.
.
The Null Problem vis-a-vis Sixth Normal Form
A freely available White Paper relating to the title (it does not define The Null Problem alone) can be found at:
http://www.dcs.warwick.ac.uk/~hugh/TTM/Missing-info-without-nulls.pdf.
.
A 'nutshell' definition of 6NF (meaningful to those experienced with the other NFs), can be found on p6
A.2 Supporting Evidence
As stated at the outset, the purpose of this post is to counter the misinformation that is rife in this community, as a service to the community.
Evidence supporting statements made re the implementation of the above principles, can be provided, if and when specific statements are identified; and to the same degree that the incorrect statements posted by others, to which this post is a response, is likewise evidenced. If there is going to be a bun fight, let's make sure the playing field is level
Here are a few docs that I can lay my hands on immediately.
a. Large Bank
This is the best example, as it was undertaken for explicitly the reasons in this post, and goals were realised. They had a budget for Sybase IQ (DW product) but the reports were so fast when we finished the project, they did not need it. The trade analytical stats were my 5NF plus pivoting extensions which turned out to be 6NF, described above. I think all the questions asked in the comments have been answered in the doc, except:
- number of rows:
- old database is unknown, but it can be extrapolated from the other stats
- new database = 20 tables over 100M, 4 tables over 10B.
b. Small Financial Institute Part A
Part B - The meat
Part C - Referenced Diagrams
Part D - Appendix, Audit of Indices Before/After (1 line per Index)
Note four docs; the fourth only for those who wish to inspect detailed Index changes. They were running a 3rd party app that could not be changed because the local supplier was out of business, plus 120% extensions which they could, but did not want to, change. We were called in because they upgraded to a new version of Sybase, which was much faster, which shifted the various performance thresholds, which caused large no of deadlocks. Here we Normalised absolutely everything in the server except the db model, with the goal (guaranteed beforehand) of eliminating deadlocks (sorry, I am not going to explain that here: people who argue about the "denormalisation" issue, will be in a pink fit about this one). It included a reversal of "splitting tables into an archive db for performance", which is the subject of another post (yes, the new single table performed faster than the two spilt ones). This exercise applies to MS SQL Server [insert rewrite version] as well.
c. Yale New Haven Hospital
That's Yale School of Medicine, their teaching hospital. This is a third-party app on top of Sybase. The problem with stats is, 80% of the time they were collecting snapshots at nominated test times only, but no consistent history, so there is no "before image" to compare our new consistent stats with. I do not know of any other company who can get Unix and Sybase internal stats on the same graphs, in an automated manner. Now the network is the threshold (which is a Good Thing).
Perhaps you should look this question
The accepted answer from Bill Karwin goes into specific arguments against the key/value table usually know as Entity Attribute Value (EAV)
.. Although many people seem to favor
EAV, I don't. It seems like the most
flexible solution, and therefore the
best. However, keep in mind the adage
TANSTAAFL. Here are some of the
disadvantages of EAV:
No way to make a column mandatory (equivalent of NOT NULL).
No way to use SQL data types to validate entries.
No way to ensure that attribute names are spelled consistently.
No way to put a foreign key on the values of any given attribute, e.g.
for a lookup table.
Fetching results in a conventional tabular layout is complex and
expensive, because to get attributes
from multiple rows you need to do
JOIN for each attribute.
The degree of flexibility EAV gives
you requires sacrifices in other
areas, probably making your code as
complex (or worse) than it would have
been to solve the original problem in
a more conventional way.
And in most cases, it's an unnecessary
to have that degree of flexibility.
In the OP's question about product
types, it's much simpler to create a
table per product type for
product-specific attributes, so you
have some consistent structure
enforced at least for entries of the
same product type.
I'd use EAV only if every row must
be permitted to potentially have a
distinct set of attributes. When you
have a finite set of product types,
EAV is overkill. Class Table
Inheritance would be my first choice.
My #1 principle is not to redesign something for no reason. So I would go with option 1 because that's your current design and it has a proven track record of working.
Spend the redesign time on new features instead.
If I were designing from scratch I would go with number two. It gives you the flexibility you need. However with number 1 already in place and working and this being soemting rather central to your whole app, i would probably be wary of making a major design change without a good idea of exactly what queries, stored procs, views, UDFs, reports, imports etc you would have to change. If it was something I could do with a relatively low risk (and agood testing alrady in place.) I might go for the change to solution 2 otherwise you might beintroducing new worse bugs.
Under no circumstances would I use an EAV table for something like this. They are horrible for querying and performance and the flexibility is way overrated (ask users if they prefer to be able to add new types 3-4 times a year without a program change at the cost of everyday performance).
At first sight, I would go for option 2 (or 3): when possible, generalize.
Option 4 is not very Relational I think, and will make your queries complex.
When confronted to those question, I generally confront those options with "use cases":
-how is design 2/3 behaving when do this or this operation ?

T-SQL database design and tables

I'd like to hear some opinions or discussion on a matter of database design. Me and my colleagues are developing a complex application in finance industry that is being installed in several countries.
Our contractors wanted us to keep a single application for all the countries so we naturally face the difficulties with different workflows in every one of them and try to make the application adjustable to satisfy various needs.
The issue I've encountered today was a request from the head of the IT department from the contractors side that we keep the database model in terms of tables and columns they consist of.
For examlpe, we got a table with different risks and we needed to add a flag column IsSomething (BIT NOT NULL ...). It fully qualifies to exists within the risk table according to the third normal form, no transitive dependency to the key, a non key value ...
BUT, the guy said that he wants to keep the tables as they are so we had to make a new table "riskinfo" and link the data 1:1 to the new column.
What is your opinion ?
We add columns to our tables that are referenced by a variety of apps all the time.
So long as the applications specifically reference the columns they want to use and you make sure the new fields are either nullable or have a sensible default defined so it doesn't interfere with inserts I don't see any real problem.
That said, if an app does a select * then proceeds to reference the columns by index rather than name you could produce issues in existing code. Personally I have confidence that nothing referencing our database does this because of our coding conventions (That and I suspect the code review process would lynch someone who tried it :P), but if you're not certain then there is at least some small risk to such a change.
In your actual scenario I'd go back to the contractor and give your reasons you don't think the change will cause any problems and ask the rationale behind their choice. Maybe they have some application-specific wisdom behind their suggestion, maybe just paranoia from dealing with other companies that change the database structure in ways that aren't backwards-compatible, or maybe it's just a policy at their company that got rubber-stamped long ago and nobody's challenged. Till you ask you never know.
This question is indeed subjective like what Binary Worrier commented. I do not have an answer nor any suggestion. Just sharing my 2 cents.
Do you know the rationale for those decisions? Sometimes good designs are compromised for the sake of not breaking currently working applications or simply for the fact that too much has been done based on the previous one. It could also be many other non-technical reasons.
Very often, the programming community is unreasonably concerned about the ripple effect that results from redefining tables. Usually, this is a result of failure to understand data independence, and failure to guard the data independence of their operations on the data. Occasionally, the original database designer is at fault.
Most object oriented programmers understand encapsulation better than I do. But these same experts typically don't understand squat about data independence. And anyone who has learned how to operate on an SQL database, but never learned the concept of data independence is dangerously ignorant. The superficial aspects of data independence can be learned in about five minutes. But to really learn it takes time and effort.
Other responders have mentioned queries that use "select *". A select with a wildcard is more data dependent than the same select that lists the names of all the columns in the table. This is just one example among dozens.
The thing is, both data independence and encapsulation pursue the same goal: containing the unintended consequences of a change in the model.
Here's how to keep your IT chief happy. Define a new table with a new name that contains all the columns from the old table, and also all the additional columns that are now necessary. Create a view, with the same name as the old table, that contains precisely the same columns, and in the same order, that the old table had. Typically, this view will show all the rows in the old table, and the old PK will still guarantee uniqueness.
Once in a while, this will fail to meet all of the IT chief's needs. And if the IT chief is really saying "I don't understand databases; so don't change anything" then you are up the creek until the IT chief changes or gets changed.

How can an object-oriented programmer get his/her head around database-driven programming?

I have been programming in C# and Java for a little over a year and have a decent grasp of object oriented programming, but my new side project requires a database-driven model. I'm using C# and Linq which seems to be a very powerful tool but I'm having trouble with designing a database around my object oriented approach.
My two main question are:
How do I deal with inheritance in my database?
Let's say I'm building a staff rostering application and I have an abstract class, Event. From Event I derive abstract classes ShiftEvent and StaffEvent. I then have concrete classes Shift (derived from ShiftEvent) and StaffTimeOff (derived from StaffEvent). There are other derived classes, but for the sake of argument these are enough.
Should I have a separate table for ShiftEvents and StaffEvents? Maybe I should have separate tables for each concrete class? Both of these approaches seem like they would give me problems when interacting with the database. Another approach could be to have one Event table, and this table would have nullable columns for every type of data in any of my concrete classes. All of these approaches feel like they could impede extensibility down the road. More than likely there is a third approach that I have not considered.
My second question:
How do I deal with collections and one-to-many relationships in an object oriented way?
Let's say I have a Products class and a Categories class. Each instance of Categories would contain one or more products, but the products themselves should have no knowledge of categories. If I want to implement this in a database, then each product would need a category ID which maps to the categories table. But this introduces more coupling than I would prefer from an OO point of view. The products shouldn't even know that the categories exist, much less have a data field containing a category ID! Is there a better way?
Linq to SQL using a table per class solution:
http://blogs.microsoft.co.il/blogs/bursteg/archive/2007/10/01/linq-to-sql-inheritance.aspx
Other solutions (such as my favorite, LLBLGen) allow other models. Personally, I like the single table solution with a discriminator column, but that is probably because we often query across the inheritance hierarchy and thus see it as the normal query, whereas querying a specific type only requires a "where" change.
All said and done, I personally feel that mapping OO into tables is putting the cart before the horse. There have been continual claims that the impedance mismatch between OO and relations has been solved... and there have been plenty of OO specific databases. None of them have unseated the powerful simplicity of the relation.
Instead, I tend to design the database with the application in mind, map those tables to entities and build from there. Some find this as a loss of OO in the design process, but in my mind the data layer shouldn't be talking high enough into your application to be affecting the design of the higher order systems, just because you used a relational model for storage.
I had the opposite problem: how to get my head around OO after years of database design. Come to that, a decade earlier I had the problem of getting my head around SQL after years of "structured" flat-file programming. There are jsut enough similarities betwwen class and data entity decomposition to mislead you into thinking that they're equivalent. They aren't.
I tend to agree with the view that once you're committed to a relational database for storage then you should design a normalised model and compromise your object model where unavoidable. This is because you're more constrained by the DBMS than you are with your own code - building a compromised data model is more likley to cause you pain.
That said, in the examples given, you have choices: if ShiftEvent and StaffEvent are mostly similar in terms of attributes and are often processed together as Events, then I'd be inclined to implement a single Events table with a type column. Single-table views can be an effective way to separate out the sub-classes and on most db platforms are updatable. If the classes are more different in terms of attributes, then a table for each might be more appropriate. I don't think I like the three-table idea:"has one or none" relationships are seldom necessary in relational design. Anyway, you can always create an Event view as the union of the two tables.
As to Product and Category, if one Category can have many Products, but not vice versa, then the normal relational way to represent this is for the product to contain a category id. Yes, it's coupling, but it's only data coupling, and it's not a mortal sin. The column should probably be indexed, so that it's efficient to retrieve all products for a category. If you're really horrified by the notion then pretend it's a many-to-many relationship and use a separate ProductCategorisation table. It's not that big a deal, although it implies a potential relationship that doesn't really exist and might mislead somone coming to the app in future.
In my opinion, these paradigms (the Relational Model and OOP) apply to different domains, making it difficult (and pointless) to try to create a mapping between them.
The Relational Model is about representing facts (such as "A is a person"), i.e. intangible things that have the property of being "unique". It doesn't make sense to talk about several "instances" of the same fact - there is just the fact.
Object Oriented Programming is a programming paradigm detailing a way to construct computer programs to fulfill certain criteria (re-use, polymorphism, information hiding...). An object is typically a metaphor for some tangible thing - a car, an engine, a manager or a person etc. Tangible things are not facts - there may be two distinct objects with identical state without them being the same object (hence the difference between equals and == in Java, for example).
Spring and similar tools provide access to relational data programmatically, so that the facts can be represented by objects in the program. This does not mean that OOP and the Relational Model are the same, or should be confused with eachother. Use the Realational Model to design databases (collections of facts) and OOP to design computer programs.
TL;DR version (Object-Relational impedance mismatch distilled):
Facts = the recipe on your fridge.
Objects = the content of your fridge.
Frameworks such as
Hibernate http://www.hibernate.org/
JPA http://java.sun.com/developer/technicalArticles/J2EE/jpa/
can help you to smoothly solve this problem of inheritance. e.g. http://www.java-tips.org/java-ee-tips/enterprise-java-beans/inheritance-and-the-java-persistenc.html
I also got to understand database design, SQL, and particularly the data centered world view before tackling the object oriented approach. The object-relational-impedance-mismatch still baffles me.
The closest thing I've found to getting a handle on it is this: looking at objects not from an object oriented progamming perspective, or even from an object oriented design perspective but from an object oriented analysis perspective. The best book on OOA that I got was written in the early 90s by Peter Coad.
On the database side, the best model to compare with OOA is not the relational model of data, but the Entity-Relationship (ER) model. An ER model is not really relational, and it doesn't specify the logical design. Many relational apologists think that is ER's weakness, but it is actually its strength. ER is best used not for database design but for requirements analysis of a database, otherwise known as data analysis.
ER data analysis and OOA are surprisingly compatible with each other. ER, in turn is fairly compatible with relational data modeling and hence to SQL database design. OOA is, of course, compatible with OOD and hence to OOP.
This may seem like the long way around. But if you keep things abstract enough, you won't waste too much time on the analysis models, and you'll find it surprisingly easy to overcome the impedance mismatch.
The biggest thing to get over in terms of learning database design is this: data linkages like the foreign key to primary key linkage you objected to in your question are not horrible at all. They are the essence of tying related data together.
There is a phenomenon in pre database and pre object oriented systems called the ripple effect. The ripple effect is where a seemingly trivial change to a large system ends up causing consequent required changes all over the entire system.
OOP contains the ripple effect primarily through encapsulation and information hiding.
Relational data modeling overcomes the ripple effect primarily through physical data independence and logical data independence.
On the surface, these two seem like fundamentally contradictory modes of thinking. Eventually, you'll learn how to use both of them to good advantage.
My guess off the top of my head:
On the topic of inheritance I would suggest having 3 tables: Event, ShiftEvent and StaffEvent. Event has the common data elements kind of like how it was originally defined.
The last one can go the other way, I think. You could have a table with category ID and product ID with no other columns where for a given category ID this returns the products but the product may not need to get the category as part of how it describes itself.
The big question: how can you get your head around it? It just takes practice. You try implementing a database design, run into problems with your design, you refactor and remember for next time what worked and what didn't.
To answer your specific questions... this is a little bit of opinion thrown in, as in "how I would do it", not taking into account performance needs and such. I always start fully normalized and go from there based on real-world testing:
Table Event
EventID
Title
StartDateTime
EndDateTime
Table ShiftEvent
ShiftEventID
EventID
ShiftSpecificProperty1
...
Table Product
ProductID
Name
Table Category
CategoryID
Name
Table CategoryProduct
CategoryID
ProductID
Also reiterating what Pierre said - an ORM tool like Hibernate makes dealing with the friction between relational structures and OO structures much nicer.
There are several possibilities in order to map an inheritance tree to a relational model.
NHibernate for instance supports the 'table per class hierarchy', table per subclass and table per concrete class strategies:
http://www.hibernate.org/hib_docs/nhibernate/html/inheritance.html
For your second question:
You can create a 1:n relation in your DB, where the Products table has offcourse a foreign key to the Categories table.
However, this does not mean that your Product Class needs to have a reference to the Category instance to which it belongs to.
You can create a Category class, which contains a set or list of products, and you can create a product class, which has no notion of the Category to which it belongs.
Again, you can easy do this using (N)Hibernate;
http://www.hibernate.org/hib_docs/reference/en/html/collections.html
Sounds like you are discovering the Object-Relational Impedance Mismatch.
The products shouldn't even know that
the categories exist, much less have a
data field containing a category ID!
I disagree here, I would think that instead of supplying a category id you let your orm do it for you. Then in code you would have something like (borrowing from NHib's and Castle's ActiveRecord):
class Category
[HasMany]
IList<Product> Products {get;set;}
...
class Product
[BelongsTo]
Category ParentCategory {get;set;}
Then if you wanted to see what category the product you are in you'd just do something simple like:
Product.ParentCategory
I think you can setup the orm's differently, but either way for the inheritence question, I ask...why do you care? Either go about it with objects and forget about the database or do it a different way. Might seem silly, but unless you really really can't have a bunch of tables, or don't want a single table for some reason, why would you care about the database? For instance, I have the same setup with a few inheriting objects, and I just go about my business. I haven't looked at the actual database yet as it doesn't concern me. The underlying SQL is what is concerning me, and the correct data coming back.
If you have to care about the database then you're going to need to either modify your objects or come up with a custom way of doing things.
I guess a bit of pragmatism would be good here. Mappings between objects and tables always have a bit of strangeness here and there. Here's what I do:
I use Ibatis to talk to my database (Java to Oracle). Whenever I have an inheretance structure where I want a subclass to be stored in the database, I use a "discriminator". This is a trick where you have one table for all the Classes (Types), and have all fields which you could possibly want to store. There is one extra column in the table, containing a string which is used by Ibatis to see which type of object it needs to return.
It looks funny in the database, and sometimes can get you into trouble with relations to fields which are not in all Classes, but 80% of the time this is a good solution.
Regarding your relation between category and product, I would add a categoryId column to the product, because that would make life really easy, both SQL wise and Mapping wise. If you're really stuck on doing the "theoretically correct thing", you can consider an extra table which has only 2 colums, connecting the Categories and their products. It will work, but generally this construction is only used when you need many-to-many relations.
Try to keep it as simple as possible. Having a "academic solution" is nice, but generally means a bit of overkill and is harder to refactor because it is too abstract (like hiding the relations between Category and Product).
I hope this helps.

Deciding on a database structure for pricing wizard

Option A
We are working on a small project that requires a pricing wizard for custom tables. (yes, actual custom tables- the kind you eat at. From here out I'll call them kitchen tables so we don't get confused) I came up with a model where each kitchen table part was a database table. So the database looked like this:
TableLineItem
-------------
ID
TableSizeID
TableEdgeWoodID
TableBaseID
Quantity
TableEdgeWoodID
---------------
ID
Name
MaterialUnitCost
LaborSetupHours
LaborWorkHours
Each part has to be able to calculate its price. Most of the calculations are very similar. I liked this structure because I can drag it right into the linq-to-sql designer, and have all of my classes generated. (Less code writing means less to maintain...) I then implement a calculate cost interface which just takes in the size of the table. I have written some tests and this functions pretty well. I added also added a table to filter parts in the UI based on previous selections. (You can't have a particular wood with a particular finish.) There some other one off exceptions in the model, and I have them hard coded. This model is very rigid, and changing requirements would change the datamodel. (For example, if all the tables suddenly need umbrellas.)
Option B:
After various meetings with my colleagues (which probably took more time than it should considering the size of this project), my colleagues decided they would prefer a more generic approach. Something like this:
Spec
----
SpecID
SpecTypeID
TableType_LookupID
Name
MaterialUnitCost
LaborSetupHours
LaborWorkHours
SpecType
--------
SpecTypeID
ParentSpecType_SpecTypeID
IsCustomerOption
IsRequiredCustomerOption
etc...
This is a much more generic approach that could be used to construct any product. (like, if they started selling chairs...) I think this would take longer time to implement, but would be more flexible in the future. (although I doubt we will revisit this.) Also you lose some referential integrity- you would need triggers to enforce that a table base cannot be set for a table wood.
Questions:
Which database structure do you prefer? Feel free to suggest your own.
What would be considered a best practice? If you have several similar database tables, do you create 1 database table with a type column, or several distinct tables? I suspect the answer begins with "It depends..."
What would an estimated time difference be in the two approaches (1 week, 1 day, 150% longer, etc)
Thanks in advance. Let me know if you have any questions so I can update this.
Having been caught out much more often than I should have by designing db structures that met my clients original specs but which turned out to be too rigid, I would always go for the more flexible approach, even though it takes more time to set up.
I don't have time for a complete answer right now, but I'll throw this out:
It's usually a bad idea to design a database based on the development tool that you're using to code against it.
You want to be generic to a point. Tables in a database should represent something and it is possible to make it too generic. For example, a table called "Things" is probably too generic.
It may be possible to make constraints that go beyond what you expect. Your example of a "table base" with a "table wood" didn't make sense to me, but if you can expand on a specific example someone might be able to help with that.
Finally, if this is a small application for a single store then your design is going to have much less impact on the project outcome than it would if you were designing for an application that would be heavily used and constantly changed. This goes back to the "too generic" comment above. It is possible to overdesign a system when its use will be minimal and well-defined. I hope that makes sense.
Given your comment below about the table bases and woods, you could set up a table called TableAttributes (or something similar) and each possible option would be of a particular table attribute type. You could then enforce that any given option is only used for the attribute to which it applies all through foreign keys.
There is a tendency to over-abstract with database schema design, because the cost of change can be high. Myself, I like table names that are fairly descriptive. I often equate schema design with OO design. E.g., you wouldn't normally create a class named Thing, you would probably call it Product, Furniture, Item, something that relates to your business.
In the schema you have provided there is a mix of the abstract (spec) and the specific (TableType_LookupID). I would tend to equalize the level of abstraction, so use entities like:
ProductGroup (for the case where you have a product that is a collection of other products)
Product
ProductType
ProductDetail
ProductDetailType
etc.
Here's what my experience would tell me:
Which database structure do you prefer? Without a doubt, I'd go for approach one. Go for the simplest setup that might work. If you add complexity, always ask yourself, what value will it have to the customer?
What would be considered a best practice? That does indeed depend, among others on the size of the project and the expected rate of change. As a general rule, generic tables are worth it when you expect the customer to be adding new types. For example, if your customer wants to be able to add a new "color" entity to the table, you'd need generic tables. You can't predict beforehand what they will add.
What would an estimated time difference be in the two approaches? Not knowing your business, skill, and environment, it's impossible to give a valid estimate. The approach that you are confident in coding will take the least time. Here, my guess would be approach #1 could be 5x-50x as fast. Generic tables are hard, both on the database and the client side.
Option B..
Generic is generally better than specific. Software already is doomed to fail or reach it's capacity by it's design for a certain set of tasks only. If you build something generic it will break less if abstracted with a realistic analysis of where it might head. As long as you stay away from over-abstraction and under-abstraction, it's probably the sweet spot.
In this case the adage "less code is more" would probably be drawn in that you wouldn't have to come back and re-write it again.