Implementing a Flexible Relationship in a RDBMS -- What really are the tradeoffs? - sql

I have a bunch of products with a bunch of different possible attributes for each product. E.g. Product A has a name, size, color, shape. Product B has a name, calories, sugar, etc. One way to solve this is like:
1) Create tables
Products (id, name)
Attributes (id, name)
Product_Attributes (product_id, attribute_id, value as string)
This allows for maximum flexibility, but I have heard a lot of people recommend against this although I am not sure why. I mean, if those tables were called Teams, Players, Team_Players we would all agree that this is proper relational design.
Everyone who explains to me why this is bad does so in the context of a completely flexible relational design where you don't ever create real tables past a basic few basic initial tables (e.g. object, attribute, object_attribute)-- which I think we all can agree is bad. But this is a much more limited and contained version of that (only Products, not every object in the system), so I don't think it is fair to group these two architectures together.
What issues have you encountered (experience or theoretical) that makes this design so bad?
2) Another way to solve this is to create a Product table with a bunch of columns like Size, Color, Shape, Weight, Sugar, etc and then include some extra columns at the end to give us some flexibility. This will create generally sparse rows filled mostly with NULLs. People tend to like this approach, but my question is how many columns can you have before this approach loses its performance benefits? If you have 200 columns, I imagine this is no longer a smart move, but what about 100 columns? 50 columns? 25 columns?
3) The final approach I know about is to store all of the attributes as a blob (JSON perhaps) in a single column of the Products table. I like this approach but it doesn't feel right. Queries are hard. And if you want to be able to easily change the name of an attribute later, you either have to parse every record individually or have them keyed in your blob by some id. If you go the id path then you will need another table Attributes and things start to look like approach #1 from above except you won't be able to join the attribute_id with your blob, so I hope you didn't want to query anything by attribute name.
What I like about this approach though is you can query one product and in your code you can easily access all the properties it has -- fast. And if you delete a product, you won't have to cleanup other tables -- easy to stay consistent.
4) I have read some things about being able to index strongly typed xml formats in some RDBMSs, but I honestly don't know much about this approach.
I am stuck. I feel like approach #1 is the best bet, but everything I read says that way stinks. What is the right way to think about this problem to be able to decide what is the best method for a given situation? More ideas than what I have listed are obviously welcomed!

You can probably find a great deal about this topic by doing a Google search on "entity attribute value antipattern".
One of the issues with this approach is that you end up mixing meta-data with actual data. Your "attribute" has to now tell the database what exactly is held in the "value" column. This can make it very difficult to handle this data in front-ends, reporting software, etc.
Second, you're going to have a very hard time actually enforcing any data integrity in the database. When your product has an attribute of "weight" what's to stop someone from putting "22 inches" in the value? Or a non-numeric value completely. You might say, "Well, my application will handle that." Then you need to change your application every time that you want to add a new attribute because the application needs to know how to handle it. If you're going to go through all of that work, just add a new column.
Third, how do you enforce that a given product has all of the attributes that it needs? In a row you can make column NOT NULL and they are then required to get that row into the database. You can't enforce that in the EAV model.
Fourth, this kind of a model usually leads to a lot of confusion. People aren't sure what "attributes" are supported, or they duplicate an attribute, or they forget to handle an attribute when creating a report. As an example, if I have an attribute for "Weight(kg)" and another attribute for "Weight(lbs)" and someone asks me, "What's the heaviest product in your database?" I'd better remember that I need to check both attributes.
Fifth, this model usually also leads to laziness. Hey, there's no reason to actually do any analysis of the products that our system can handle, because whatever comes along we'll just add some attributes. In my experience, companies are much better off doing the analysis required to create a good database design rather than fall back on an antipattern like this. You'll learn things about the database, the application, and likely the business as well.
Sixth, it might take a LOT of joins to get a single row of data for a given product. You can return the attributes as separate rows, but now you have to come up with customized list boxes to list those products, etc. Similarly, writing search queries against this model can be very difficult and in both of these situations you're likely to have performance issues.
These are just a few of the problems which I've encountered over the years. I'm sure that there are others.
What the correct solution is for your system depends a lot on the specifics of your business and application. Rather than a sparse row, you might consider using subtype tables if your products fall into a few categories that share common attributes.

There are many problems with flexible data models but the first one that is likely to bite you is the fact that queries get unwieldy very quickly. For example, if you wanted to get the Size attribute for every product, the query is relatively easy.
SELECT p.name product_name,
pa.value product_size
FROM product p
left outer join product_attribute pa on (p.product_id = pa.product_id)
left outer join attribute a on (pa.attribute_id = a.attribute_id and
a.name = 'size')
If you want to get the size and some other attribute like color, things get trickier
SELECT p.name product_name,
pa_size.value product_size
pa_color.value product_color
FROM product p
left outer join product_attribute pa_size on (p.product_id = pa_size.product_id)
left outer join product_attribute pa_color on (p.product_id = pa_size.product_id)
left outer join attribute a_size on (pa_size.attribute_id = a.attribute_id and
a_size.name = 'size')
left outer join attribute a_color on (pa_color.attribute_id = a.attribute_id and
a_color.name = 'color')
Very quickly, when you start wanting to grab 10 attributes or write complex searches (show me products where the color is blue and the size is medium), the queries start to get very complicated both for developers to write and maintain and for the database optimizer to generate the query plan for. If you're joining 30 tables together, the optimizer would have to prune the tree of plans it considers very, very quickly to be able to generate a query plan in a reasonable time frame. That tends to lead the optimizer to discard promising paths too early and to generate less than optimal paths for many of your queries.
This, in turn, means that you very quickly get to a point where new development is bottlenecked because developers can't get their queries right or developers can't get their queries to return quickly enough. Whatever time you saved up front by not gathering the requirements to determine what the valid attributes are quickly gets used up with the 47th iteration of "Why can't I get the data I want out of this putrid data model?"
Beyond this cost to developers, you end up creating a lot of costs for the organization as a whole.
No query tool is going to handle this sort of data model well. So all the users that can currently fire up their favorite query tool and run some reports out of your database are now stuck waiting for developers to write their reports and do their extracts for them.
Data quality becomes very hard to enforce. It becomes very hard to check conditions that involve multiple attributes (i.e. if a product's size Medium then the weight must be between 1 and 10 pounds, if a product's height is specified then a width is required as well) so people don't make those checks. They don't write the reports to identify where these sorts of rules are violated. So the data ends up being a bit bucket of data that downstream processes decide they can't use because it isn't sufficiently complete.
You're moving too much of the initial requirements discussion off into the future when understanding the core entities will likely lead to a much better design overall. If you can't agree on a set of attributes that the first version of the product needs to support, you don't really understand what that version is supposed to do. Even if you successfully code a very generic application, that means that it is going to require a lot of time to configure once you've built it (because someone will have to figure out what attributes it supports at that point). And then you'll discover when the application is being configured that you missed a ton of requirements that only became clear when the attributes were defined-- you can't know that width is required if height is specified if you don't know whether they're going to store height or width in the first place.
In the worst case, the response to this problem during configuration is to immediately determine that you need to provide a flexible way to specify business rules and to specify workflows so that the people configuring the application can quickly code their business rules when they add new attributes and so that they can control the flow of the application by grouping attributes together or skipping certain pages (i.e. have a page where make & model are required if the product type is car, skip that page if now). But in order to do that, you're going to end up building an entire development environment. And you're going to push the job of actually coding the application to the folks that are configuring the product. Unless you happen to be really good at building development environments, and unless the people configuring the product are really developers, this doesn't end well.

I mean, if those tables were called
Teams, Players, Team_Players we would
all agree that this is proper
relational design.
No, we wouldn't. Here's why.
You started with this.
Products (id, name)
Attributes (id, name)
Product_Attributes (product_id, attribute_id, value as string)
Let's drop the id numbers, so we can see what's really going on. (Longer column names for clarity.)
Products (product_name)
Attributes (attribute_name)
Product_Attributes (product_name, attribute_name, value as string)
And translating that to teams and players . . .
Teams (team_name)
Players (player_name)
Team_Players (team_name, player_name, value as string)
So for sample data we might have
Team Player Value
--
St. Louis Cardinals Boggs, Mitchell ?
St. Louis Cardinals Carpenter, Chris ?
St. Louis Cardinals Franklin, Ryan ?
St. Louis Cardinals Garcia, Jaime ?
What on earth belongs in place of the question marks? Let's say we want to record number of games played. Now the sample data looks like this.
Team Player Value
--
St. Louis Cardinals Boggs, Mitchell 23
St. Louis Cardinals Carpenter, Chris 15
St. Louis Cardinals Franklin, Ryan 19
St. Louis Cardinals Garcia, Jaime 14
Want to store batting average, too? You can't. Not only can you not store batting average along with games played, you can't tell by looking at the database whether Mitch Boggs played in 23 games, had 23 hits, scored 23 runs, had 23 "at bats", had 23 singles, or struck out 23 times.

The reason why this approach is so bad is that you don't know how may times you have to join to the table to get all the attributes. Plus joining to the same table 20 times tends to create a performance block of massive proportions. I am assuming that Products wil be at the heart of your system and thus be a critical place for performance.
Now you say that the product attributes will be drastically different. I disagree. There will be many attributes that are common to a large number of your products things like price, units, size, color, dimemnsions, weight. Those should be in the product table as common properties. These are also the ones that the user is most likely to be searching for when picking a product.
Other properties are useful as a description but not for most anything else (they won't be searched on or put into the order details). Put those in a description or notes field.
Finally you are left with the few attributes which might be different. But how different are they? Are they common to a partiuclar type of product (books have these attributes, cameras have these), then a related table for that type of product might work well.
Once you have done your job and figured all this out, then add the flexibility of an EAV table if you still need one. The steps above should cover 98+% of the real requirements.
(Also it's kind of hard to design the order details table if you don't know the attribute fields you need to record for the order - you can't rely on the products table for that)
(oh and I agree wholeheartedly with what #Tom H is saying as well.)

Related

SQL table design: one or multiple line per entity? [duplicate]

I was wondering if you have a website with a dozen different types of listings (Shops, Restaurants, Clubs, Hotels, Events) that require different fields, is there a benefit of creating a table with columns defined like so
Example Shop:
shop_id | name | X | Y | city | district | area | metro | station | address | phone | email | website | opening_hours
Or a more abstract approach similar to this:
object_id | name
---------------
1 | Messy Joe's
2 | Bate's Motel
type_id | name
---------------
1 | hotel
2 | restaurant
object_id | type_id
---------------
1 | 2
2 | 1
field_id | name | field_type
---------------
1 | address | text
2 | opening_hours | date
3 | speciality | text
type_id | field_id
---------------
1 | 1
1 | 2
2 | 1
2 | 3
object_id | field_id | value
1 | 1 | 1st street....
1 | 3 | English Cuisine
Of course it can be more abstract if value's are predefined (Example: specialties could have their own list)
If I take the abstract approach it can be very flexible, but queries will be more complex with a lot of joins.
But I don't know if this affects the performance, executing these 'more complex' queries.
I would be interested to know what are the up and downsides of both methods. I can just imagine for myself, but I don't have the experience to confirm this.
Certain issues need to be clarified and resolved before we can enter into a reasonable discussion.
Pre-requisite Resolution
Labels
In a profession that demands precision, it is important that we use precise labels, to avoid confusion, and so that we can communicate without having to use long-winded descriptions and qualifiers.
What you have posted as FixedTables, is Unnormalised. Fair enough, it may be an attempt at Third Normal form, but in fact it is a flat file, Unnormalised (not "denormalised). What you have posted as AbstractTables is, to be precise, Entity-Attribute-Value, which is almost, but not quite, Sixth Normal form, and is therefore more Normalised than 3NF. Assuming it is done correctly, of course.
The Unnormalised flat file is not "denormalised". It is chock full of duplication (nothing has been done to remove repeating groups and duplicate columns or to resolve dependencies) and Nulls, it is a performance hog in many ways, and prevents concurrency.
In order to be Denormalised, it has to first be Normalised, and then the Normalisation backed off a little for some good reason. Since it is not Normalised in the first place, it cannot be Denormalised. It is simply Unnormalised.
It cannot be said to be denormalised "for performance", because being a performance hog, it is the very antithesis of performance. Well, they need a justification for the lack of formalised design], and "for performance" is it. Even the smallest formal scrutiny exposed the misrepresentation (but very few people can provide, so it remains hidden, until they get an outsider to address, you guessed it, the massive performance problem).
Normalised structures perform far better than Unnormalised structures. More normalised structures (EAV/6NF) perform better than less normalised structures (3NF/5NF).
I am agreeing with the thrust of OMG Ponies, but not their labels and definitions
rather than saying 'don't "denormalise" unless you have to', I am saying, 'Normalise faithfully, period' and 'if there is a performance problem, you have not Normalised correctly'.
Wikipedia
The entries for Normal Forms and Normalisation offer definitions that are incorrect; they confuse the Normal Forms; they are lacking regarding the process of Normalisation; and they give equal weight to absurd or questionable NFs which have been debunked long ago. The result is, Wikipedia adds to an already confused and rarely understood subject. So don't waste your time.
However, in order to progress, without that reference posing a hindrance, let me say this.
The definition of 3NF is stable, and has not changed.
There is a lot of confusion of the NFs between 3NF and 5NF. The truth is that this is an area that progressed over the last 15 years; and many orgs, academics as well as vendors with their products with limitations, jumped to create a new "Normal Form" to validate their offerings. All serving commercial interests and academically unsound. 3NF in its original untampered state intended and guaranteed certain attributes.
The sum total is, 5NF is today, what 3NF was intended to be 15 years ago, and you can skip the commercial banter and the twelve or so "special" (commercial and pseudo-academic) NFs in-between, some of which are identified in Wikipedia, and even that in confusing terms.
Fifth Normal Form
Since you have been able to understand and implement the EAV in your post, you will have no problem understanding the following. Of course a true Relational Model is pre-requisite, strong keys, etc. Fifth Normal Form is, since we are skipping the Fourth:
Third Normal Form
which in simple definitive terms is, every non-key column in every table has a 1::1 relationship to the Primary Key of the table,
and to no other non-key columns
Zero data duplication (the result, if Normalisation is progressed diligently; not achieved by intelligence or experience alone, or by working toward it as a goal without the formal process)
no Update Anomalies (when you update a column somewhere, you do not have to update the same column located somewhere else; the column exists in one and only one place).
If you understand the above, 4NF, BCNF, and all the silly "NFs" can be dismissed, they are required for physicalised Record Filing Systems, as promoted by academics, quite foreign to the Relational Model (Codd).
Sixth Normal Form
The purpose is elimination of missing data (attribute columns), aka elimination of Nulls
This is the one true solution to the Null Problem (also called Handling Missing Values), and the result is a database without Nulls. (It can be done at 5NF with standards and Null substitutes but that is not optimal.) How you interpret and display the missing values is another story.
Technically, is not a true Normal Form, because it does not have 5NF as a pre-requisite, but it has a value
EAV vs Sixth Normal Form
All the databases I have written, except one, are pure 5NF. I have worked with (administered, fixed up, enhanced) a couple of EAV databases, and I have implemented many true 6NF databases. EAV is a loose implementation of 6NF, often done by people who do not have a good grasp on Normalisation and the NFs, but who can see the value in, and need the flexibility of, EAV. You are a perfect example.
The difference is this: because it is loose, and because implementers do not have a reference (6NF) to be faithful to, they only implement what they need, and they write it all in code; that ends up being an inconsistent model.
Whereas, a pure 6NF implementation does have a pure academic reference point, and thus it is usually tighter, and consistent. Typically this shows up in two visible elements:
6NF has a catalogue to contain metadata, and everything is defined in metadata, not code. EAV does not have one, everything is in code (implementers keep track of the objects and attributes). Obviously a catalogue eases the addition of columns, navigation, and allows utilities to be formed.
6NF when understood, provides the true solution to The Null Problem. EAV implementers, since they are absent the 6NF context, handle missing data in code, inconsistently, or worse, allow Nulls in the database. 6NF implementers disallow Nulls, and handle missing Data consistently and elegantly, without requiring code constructs (for Null handling; you still have to code for missing data of course).
Eg. For 6NF databases with a catalogue, I have a set of procs that will [re]generate the SQL required to perform all SELECTs, and I provide Views in 5NF for all users, so they do not need to know or understand the underlying 6NF structure. They are driven off the catalogue. Thus changes are easy and automated. EAV types do that manually, due to the absence of the catalogue.
Discussion
Now, we can start the discussion.
"Of course it can be more abstract if
value's are predefined (Example:
specialities could have their own
list)"
Sure. But do not get too "abstract". Maintain consistency and implement such lists in the same EAV (or 6NF) manner as other lists.
"If I take the abstract approach it
can be very flexible, but queries will
be more complex with a lot of joins.
But I don't know if this affects the
performance, executing these 'more
complex' queries."
Joins are pedestrian in relational databases. The problem is not the database, the problem is that SQL is cumbersome when handling joins, especially compound keys.
EAV and 6NF databases have more Joins, which just as pedestrian, no more, no less. If you have to code each SELECT manually, sure, the cumbersome gets really cumbersome.
The entire problem can be eliminated by (a) going with 6NF over EAV and (b) implementing a catalogue, from which you can (c) generate all the basic SQL. Eliminates an entire class of errors as well.
It is a common myth that Joins somehow have a cost. Totally false.
The join is implemented at compile time, there is nothing of substance to 'cost' CPU cycles.
The issue is the size of tables being joined, not the cost of the Join between those same tables.
Joining two tables with millions of rows each, on a correct PK⇢FK relation, each of which have the appropriate indices
(Unique on the parent [PK] side; Unique on the Child side [PK=parent FK + something]
is instantaneous
Where the Child index is not unique, but at least the leading columns are valid, it is slower; where there is no useful index, of course it is very slow.
None of it has to do with Join cost.
Where many rows are returned, the bottleneck will be the network and the disk layout; not the join processing.
Therefore you can get as "complex" as you like, there is no cost, SQL can handle it.
I would be interested to know what are
the up and downsides of both methods.
I can just imagine for myself, but I
don't have the experience to confirm
this.
5NF (or 3NF for those who have not made the progression) is the easiest and best, in terms of implementation; ease of use (developers as well as users); and maintenance.
The drawback is, every time you add a column, you have to change the database structure (table DDL). That is fine is some cases, but not in most cases, due to change control in place, quite onerous.
Second, you have to change existing code (code handling the new column does not count, because that is an imperative): where good standards are implemented, that is minimised; where they are absent, the scope is unpredictable.
EAV (which is what you have posted), allows columns to be added without DDL changes. That is the single reason people choose it. (code handling the new column does not count, because that is an imperative). If implemented well, it will not affect existing code; if not, it will.
But you need EAV-capable developers.
When EAV is implemented badly, it is abominable, a worse mess than 5NF done badly, but not any worse than Unnormalised which is what most databases out there are (misrepresented as "denormalised for performance").
Of course, it is even more important (than in 5NF/3NF) to hold a strong Transaction context, because the columns are far more distributed.
Likewise, it is essential to retain Declarative Referential Integrity: the messes I have seen were due in large part to the developers removing DRI because it became "too hard to maintain", the result was, as you can imagine, one mother of a data heap with duplicate 3NF/5NF rows and columns all over the place. And inconsistent Null handling.
There is no difference in performance, assuming that the server has been reasonably configured for the intended purpose. (Ok, there are specific optimisations that are possible only in 6NF, which are not possible in other NFs, but I think that is outside the scope of this thread.) And again, EAV done badly can cause unnecessary bottlenecks, no more so than Unnormalised.
Of course, if you go with EAV, I am recommending more formality; buy the full quid; go with 6NF; implement a catalogue; utilities to produce SQL; Views; handle Missing Data consistently; eliminate Nulls altogether. This reduces your vulnerability to the quality of your developers; they can forget about the EAV/6NF esoteric issues, use Views, and concentrate on the app logic.
In your question, you have presented at least two major issues at the same time. Those two issues are E-A-V and gen-spec.
First, let's talk about E-A-V. Your last table (object_id, field_id, value) is essentially an E-A-V. There is an upside to E-A-V and a downside to E-A-V. The upside is that the structure is so generic that it can accomodate almost any body of data describing almost any subject matter. That means that you can proceed to design and implementation with no data analysis and no understanding of the subject matter, and not worry about wrong assumptions. The down side is that at retrieval time, you have to do the data analysis that you skipped over before building the data base, in order to come up with queries that mean anything. This is much more serious than just retrieval efficiency. But you are also going to have terrible problems with retrieval efficiency. There are only two ways to learn about this pitfall: live through it or read about it from those who have. I recommend the reading.
Second, you have a gen-spec case. Your table (object_id, type_id) captures a gen-spec (generalization-specialization) pattern, along with the related tables. If I had to generalize between hotels and restaurants, I might call it something like "public accomodations" or "venues". But I'm not sure I understand your case, and you may be driving for something even more general than those two names suggest. After all, you've included "events" in your list, and an event is not a type of venue in my mind.
I've referred other people to readings on gen-spec and the relational model in previous responses.
When two tables are very similar, when should they be combined?
But I hesitate to send you off in the same direction, because it's not clear to me that you want to come up with a relational model of the data before building your database. A relational model of a body of data and an E-A-V model of the same data are almost totally at odds with each other. It seems to me you have to make that choice before you even explore how to express gen-spec in the relational model of data.
When you start to require a large number of different entities (or even before...), a nosql solution would be vastly simpler than either choice.
Just store each entity/record with the exact fields you require.
{
"id": 1,
"type":"Restaurant",
"name":"Messy Joe",
"address":"1 Main St.",
"tags":["asian","fusion","casual"]
}
The "abstract" approach is better known as "Normalization", looks like 3rd Normal Form (3NF).
The other one is called "Denormalized", and can be a valid performance option... when you've encountered speed issues using the Normalized approach, not before.
How do you have the listings represented in code? I'd guess Listing as a supertype, with Shop, Restuarant, etc. as subtypes?
Assuming so, this is a case of how to map subtypes to a relational database. There are generally three choices:
Option 1: single table per subtype,
with common attributes repeated in
each table (name, id, etc).
Option 2: single table for all objects (your single table approach)
Option 3: table for the supertype and one for each subtype
There's no universally correct solution. My preference is generally to start with option 3; it provides an intituitive structure to work with, is pretty well normalised and can easily be extended. It means a single join for retrieving each instance - but RDBMS are well optimised for doing joins so it doesn't really cause performance problems in practice.
Option 2 can be more performant for queries (no joins) but causes problems if other tables need to refer to all supertype instances (proliferation of foreign keys).
Option 1 appears at first sight to be the most performant, although 2 caveats: (1) It's not resilient to change. If you add a new subtype (and so different attributes) you'll need to change the table structure and migrate it. (2) It can be less efficient than it seems. Because the table population is sparse, some DBs don't store it particularly efficiently. As a consequence it can be less efficicent than option 1 - since the query engine can do joins faster than it can search bloated sparse table spaces.
Which to choose really comes down to knowing details of your problem. I'd suggest reading up a bit on the options: this article is a good place to start.
hth

Optimize Schema for JOIN across large but finite group of tables

I have some flexibility here so I'm looking for some advice before I lock things down. I also have a couple ways of solving this problem but I'm looking for advise on the most efficient way of doing this. Since the specifics of my data types are a bit obscure I'll use a more understandable object metaphor.
Right now I have two main tables, and a large but finite number of additional tables. The following business logic applies.
Each specific animal table has a unique filed assigned to it, something like "snout diameter" for a pig, or "whiskers" for a cat. There is also another field
The animal table has a field marking the animals "Role".
There can be multiple animals in a cage.
Animals are linked to Cages by FK constraints. The specific Animal tables are linked to the animal table by FK constraints.
Cages
Animal
Cat
dog
Pig, etc
The main question being asked is, what's in a cage? I also need be able to search through all the cages as quickly as possible and get all the info for animals that fall into the role "tasty". Sometimes a pig will be "tasty", other times it could be a cat. Depending on the type of animal that's "tasty" I need to display it's specific info.
What's the most efficient schema design, or SQL statement to find this info?
My first attempt at this had Only Cages and then a bunch of "SpecificAnimal" tables. This seemed like a bad idea because I would have to do a join across 10+ tables to figure out what was in a cage.
I then moved common attributes to the Animal Table, this allowed me to easily see what animals were in a cage, although this still required searching across the specific tables to get all the data.
I contemplated storing the specific attributes into some form of CSV string (but I'm not that desperate yet)
Of course I could go EAV, but that also seems inefficient since there really are a finite number of animals.
Am I being to worried? Should I just bite the bullet and accept Joins across 10 tables? Just worried about performance.... Any ideas, or design patterns that can be recommended. Suffering from information overload and a head cold. Help please.
It's really hard to answer 'what is the best schema' questions, because they always involve tradeoffs. Part of that means that to accurately trade off one design against another, you have to have measurements (of speed, for example) to base your decision on. (This is probably not the answer you were looking for).
For what it's worth, 10 joins is not a massive number, and depending on the number of animals and cages in your system you might never notice a speed issue. Further, if there really is one 'main query', then you can use materialised views to make at least that query fast to answer.
Finally, some overarching advice: go for a clean data model until you have hard numbers to dictate that you 'muddy' the design.

Is precalculation denormalization? If not, what is (in simple terms)?

I'm attempting to understand denormalization in databases, but almost all the articles google has spat out are aimed at advanced DB administrators. I fair amount of knowledge about MySQL and MSSQL, but I can't really grasp this.
The only example I can think of when speed was an issue was when doing calculations on about 2,500,000 rows in two tables at a place I used to intern at. As you can guess, calculating that much on demand took forever and froze the dev server I was on for a few minutes. So right before I left my supervisor wanted me to write a calculation table that would hold all the precalculated values, and would be updated about every hour or so (this was an internal site that wasn't used often). However I never got to finish it because I left
Would this be an example of denormalization? If so, is this a good example of it or does it go much farther? If not, then what is it in simple terms?
Say you had an Excel file with 2 worksheets you want to use to store family contact details. On the first worksheet, you have names of your contacts with their cell phone numbers. On the second worksheet, you have mailing addresses for each family with their landline phone numbers.
Now you want to print Christmas card labels to all of your family contacts listing all of the names but only one label per mailing address.
You need a way to link the two normalized sets. All the data in the 2 sets you have is normalized. It's 'atomic,' representing one 'atom,' or piece of information that can't be broken down. None of it is repeated.
In a denormalized view of the 2 sets, you'd have one list of all contacts with the mailing addresses repeated multiple times (cousin Alan lives with Uncle Bob at the same address, so it's listed on both Alan and Bob's rows.)
At this point, you want to introduce a Household ID in both sets to link them. Each mailing address has one householdID, each contact has a householdID value that can be repeated (cousin Alan and Uncle Bob, living in the same household, have the same householdID.)
Now say we're at work and we need to track zillions of contacts and households. Keeping the data normalized is great for maintenance purposes, because we only want to store contact and household details in one place. When we update an address, we're updating it for all the related contacts. Unfortunately, for performance reasons, when we ask the server to join the two related sets, it takes forever.
Therefore, some developer comes along and creates one denormalized table with all the zillions of rows, one for each contact with the household details repleated. Performance improves, and space considerations are tossed right out the window, as we now need space for 3 zillion rows instead of just 2.
Make sense?
I would call that aggregation not denormalization(if it is quantity of orders for example, SUM(Orders) per day...). This is what OLAP is used for. Denormalization would be for example instead of having a PhoneType table and the PhoneTypeID in the Contact table, you would just have the PhoneType in the Contact table thus eliminating 1 join
You could also of course use index/materialized views to have to aggregation values...but now you will slow down your update, delete and inserts
triggers are also another way to accomplish this
In an overly simplified form I would describe de-normalisation as reducing the number of tables used to represent the same data.
Customers and addresses are often kept in different tables to allow the concept of one customer having multiple addresses. (Work, Home, Current Address, Previous Address, etc)
The same could be said to apply to surnames, and other properties, but only the current surname ever be of concern. As such, one might normalise all the way to having a Customer table and a Surname table, with foreign key relationships, etc. But then denormalise this by merging the two tables together.
The benefit of "normalise until it hurts" is that it forces one to consider a pure and (hopefully) complete representation of the data and possible behaviours and relationships.
The benefit of "de-normalise until it works" is to reduce certain maintenance and/or processing overheads, but sticking to the same basic model as derived by working out a normalised model.
In the "Surname" example, by denormalising one is able to add an index to the customers based on their Surname and Date of Birth. Without de-normalising the Surname and DoB are in different tables and the composite index is not possible.
Denormalizing can be beneficial, the example you provided is an instance of this. It is not ideal to dynamically calculate these as the cost is expensive and thus you create a table and have a functional id referencing the other table along with calculation value.
The data is redundant as it can be derived from another table but due to production requirements this is a better design in the functional sense.
Curious to see what others have to say on this topic because I know my sql professor would cringe at the term denormalize but it has practicality.
Normal form would reject this table, as it is fully derivable from existing data. However, for performance reasons data of this type is commonly found. For example inventory counts are typically carried, but are derivable from the transactions that created them.
For smaller faster sets a view can be used to derive the aggregate. This provides the user the data they need (the aggregated value) rather than forcing them to aggregate it themselves. Oracle (and others?) have introduced materialized views to do what your manager was suggesting. This can be updated on various schedules.
If update volumes permit, triggers could be used to emulate a materialized view using a table. This may reduce the cost of maintaining the aggregated value. If not it would spread the overhead over a greater period of time. It does however, add the risk of creating a deadlock condition.
OLAP takes this simple case to more of an extreme interest in aggregates. Analysts are interested in aggregated values not the details. However, if the aggregated value is interesting, they may look at the details. Starting from normal form, is still a good practice.

Best pattern for storing (product) attributes in SQL Server

We are starting a new project where we need to store product and many product attributes in a database. The technology stack is MS SQL 2008 and Entity Framework 4.0 / LINQ for data access.
The products (and Products Table) are pretty straightforward (a SKU, manufacturer, price, etc..). However there are also many attributes to store with each product (think industrial widgets). These may range from color to certification(s) to pipe size. Every product may have different attributes, and some may have multiples of the same attribute (Ex: Certifications).
The current proposal is that we will basically have a name/value pair table with a FK back to the product ID in each row.
An example of the attributes Table may look like this:
ProdID AttributeName AttributeValue
123 Color Blue
123 FittingSize 1.25
123 Certification AS1111
123 Certification EE2212
123 Certification FM.3
456 Pipe 11
678 Color Red
999 Certification AE1111
...
Note: Attribute name would likely come from a lookup table or enum.
So the main question here is: Is this the best pattern for doing something like this? How will the performance be? Queries will be based on a JOIN of the product and attributes table, and generally need many WHEREs to filter on specific attributes - the most common search will be to find a product based on a set of known/desired attributes.
If anyone has any suggestions or a better pattern for this type of data, please let me know.
Thanks!
-Ed
You are about to re-invent the dreaded EAV model, Entity-Attribute-Value. This is notorious for having problems in real-life, for various reasons, many covered by Dave's answer.
Luckly the SQL Customer Advisory Team (SQLCAT) has a whitepaper on the topic,
Best Practices for Semantic Data Modeling for Performance and Scalability. I highly recommend this paper. Unfortunately, it does not offer a panacea, a cookie cutter solution, since the problem has no solution. Instead, you'll learn how to find the balance between a fixed queryable schema and a flexible EAV structure, a balance that works for your specific case:
Semantic data models can be very
complex and until semantic databases
are commonly available, the challenge
remains to find the optimal balance
between the pure object model and the
pure relational model for each
application. The key to success is to
understand the issues, make the
necessary mitigations for those
issues, and then test, test, and test.
Scalability testing is a critical
success factor if you are going to
find that optimal design.
This is going to be problematic for a couple of reasons:
Your entity queries will be much harder to write. Transforming the results of those queries into something resembling a ViewModel when it comes time for presentation is going to be painful because it will involve a pivot for each product.
Understanding what your datatypes will be is going to be tough when it comes time to read certain types of data. Are you planning on storing this as strings? For example, DateTimes hold more data than the default .ToString() implementation writes to the string. You're also going to have issues if you try to store floating-point values.
Your objects' data integrity is at risk. There will be a temptation to put properties which should be just attributes of your main product tables in this "bucket o' data". Maybe the design will be semi-sane to begin with, but I guarantee you that after a certain amount of time, folks will start to just throw properties in the bag. It'll then be very tough to keep your objects' integrity with such a loosely defined structure.
Your indexes will most likely be suboptimal. Again think of a property which should be on your product table. Instead of being able to index on just one column, you will now be forced to make a potentially very large composite index on your "type" table.
Since you're apparently planning to throw out proper datatypes and use strings, the performance of range queries for numeric data will likely be poor.
Your table will get big, slowing backups and queries. Instead of an integer being 4 bytes, you're going to have to store far more for an integer of any size.
Better to normalize the table in a more "traditional" way using "IS-A" relationships. For example, you might have Pipes, which are a type of Product, but have a couple more attributes. You might have Stoves, which are a type of product, but have a couple more attributes still.
If you really have a generic database and all sorts of other properties which aren't going to be subject to data integrity rules, you very well may want to consider storing data in an XML column. It's hard to tell you what the correct design choice is unless I know a lot more about your business.
IMO this is a design antipattern. The siren song of this idea has lured many a developer onto the rocks of of an unmaintainable application.
I know it is an old one - however there might be other readers...
I have seen the balance EAV to attribute modeled approach. Well - it is still EAV. "EAV's are like drugs" is pretty much true. So what about thinking it through once more - and let's be aggressive really:
I still liked the supertype apporach, where a lot of tables use the same primary key from a key generator. Let's reuse this one. So what about creating a new table for each set of attributes - all having the primary from the same key generator? Eg. you would have a table with the fields "color,pipe", another table "fittingsize,pipe", and so on. The requirement "volatility of attributes" screams for a carefully(automatically) maintained data dictionary anyway.
This approach is fully normalized and can be fully automated. You can support checks if specific attribute sets materialized already as table by hashing attribute name clusters, eg. crc32(lower('color~fittingsize~pipe')) where the atribute names need to be sorted alphabetically. Of course this requires to have the hash in the data dictionary. Based on the data dictionary each object can be searched (using 'UNION'), especially if the data dictionary itself is a table. Having the data dictionary as table also allows you to use its primary (surrogate) key as basis for unique tablenames, to end up with tables like 'attributes1','attributes2',... Most databases nowadays support some billion tables - so we are sort of save on that end as well. You could even have a product catalouge with very common attributes, that reference the extended attribute tables.
An open issue are 1:n data sets. I am afraid you need to sort them out in separate tables. However this very much depends on your data presentation and querying strategy. Should they always be presented as comma seperated string attached to the product or do you want to eg. be able to query for all products of a certain Certification?
Before you flame this approach please consider this: It is meant for use cases where you have a very high volatility of attributes - in quantity and quality - only. Also it was preset, that you cannot know most of the attributes at the point in time when the solution is created. So do not discuss this in a context where you can model your attributes upfront which would enable you to balance trade offs much better.
In short, you cannot go all one route. If you use an EAV like your example you will have a myriad of problems like those outlined by the other posters not the least of which will be performance and data integrity. Let me reiterate, that using an EAV as the core of your solution will fail when you get to reporting and analysis. However, as you have also stated, you might have hundreds of attributes that change regularly.
The solution, IMO, is a hybrid. For common attributes, use columns/standard schema. For additional, arbitrary attributes, use an EAV. However, the rule with the EAV data is that you can never, ever, under any circumstances, write a query that includes a sort or filter on an attribute. I.e., you can never write Where AttributeName = 'Foo'. The EAV portion of the schema represents a bag of data that is merely there for tracking purposes. In fact, I have seen many people implement this solution by using Xml for the EAV portion. The moment someone does want to search, filter, sort or place an EAV value in a specific spot on a report, that attribute must be elevated to a top level column in the products table.
The key to this hybrid approach is discipline. It will seem simple enough to add a filter, sort or put an attribute in a specific spot somewhere on a report especially when you get pressure from management. You must resist this temptation. Once you go down the dark path... If you do not think that you can maintain that level of discipline in your development team, then I would not use an EAV. As I've mentioned before, EAV's are like drugs: in small quantities and used under the right circumstances they can be beneficial. Too much will kill you.
Rather than have a name-value table, create the usual Product table structure containing all the common attributes, and add an XML column for the attributes that vary by product.
I have used this structure before and it worked quite well.
As #Dave Markle mentions, the name-value approach can lead to a world of pain.

Dealing with "hypernormalized" data

My employer, a small office supply company, is switching suppliers and I am looking through their electronic content to come up with a robust database schema; our previous schema was pretty much just thrown together without any thought at all, and it's pretty much led to an unbearable data model with corrupt, inconsistent information.
The new supplier's data is much better than the old one's, but their data is what I would call hypernormalized. For example, their product category structure has 5 levels: Master Department, Department, Class, Subclass, Product Block. In addition the product block content has the long description, search terms and image names for products (the idea is that a product block contains a product and all variations - e.g. a particular pen might come in black, blue or red ink; all of these items are essentially the same thing, so they apply to a single product block). In the data I've been given, this is expressed as the products table (I say "table" but it's a flat file with the data) having a reference to the product block's unique ID.
I am trying to come up with a robust schema to accommodate the data I'm provided with, since I'll need to load it relatively soon, and the data they've given me doesn't seem to match the type of data they provide for demonstration on their sample website (http://www.iteminfo.com). In any event, I'm not looking to reuse their presentation structure so it's a moot point, but I was browsing the site to get some ideas of how to structure things.
What I'm unsure of is whether or not I should keep the data in this format, or for example consolidate Master/Department/Class/Subclass into a single "Categories" table, using a self-referencing relationship, and link that to a product block (product block should be kept separate as it's not a "category" as such, but a group of related products for a given category). Currently, the product blocks table references the subclass table, so this would change to "category_id" if I consolidate them together.
I am probably going to be creating an e-commerce storefront making use of this data with Ruby on Rails (or that's my plan, at any rate) so I'm trying to avoid getting snagged later on or having a bloated application - maybe I'm giving it too much thought but I'd rather be safe than sorry; our previous data was a real mess and cost the company tens of thousands of dollars in lost sales due to inconsistent and inaccurate data. Also I am going to break from the Rails conventions a little by making sure that my database is robust and enforces constraints (I plan on doing it at the application level, too), so that's something I need to consider as well.
How would you tackle a situation like this? Keep in mind that I have the data to be loaded already in flat files that mimic a table structure (I have documentation saying which columns are which and what references are set up); I'm trying to decide if I should keep them as normalized as they currently are, or if I should look to consolidate; I need to be aware of how each method will affect the way I program the site using Rails since if I do consolidate, there will be essentially 4 "levels" of categories in a single table, but that definitely seems more manageable than separate tables for each level, since apart from Subclass (which directly links to product blocks) they don't do anything except show the next level of category under them. I'm always a loss for the "best" way to handle data like this - I know of the saying "Normalize until it hurts, then denormalize until it works" but I've never really had to implement it until now.
I would prefer the "hypernormalized" approach over a denormal data model. The self referencing table you mentioned might reduce the number of tables down and simplify life in some ways, but in general this type of relationship can be tricky to deal with. Hierarchical queries become a pain, as does mapping an object model to this (if you decide to go that route).
A couple of extra joins is not going to hurt and will keep the application more maintainable. Unless performance degrades due to the excessive number of joins, I would opt to leave things like they are. As an added bonus if any of these levels of tables needed additional functionality added, you will not run into issues because you merged them all into the self referencing table.
I totally disagree with the criticisms about self-referencing table structures for parent-child hierarchies. The linked list structure makes UI and business layer programming easier and more maintainable in most cases, since linked lists and trees are the natural way to represent this data in languages that the UI and business layers would typically be implemented in.
The criticism about the difficulty of maintaining data integrity constraints on these structures is perfectly valid, though the simple solution is to use a closure table that hosts the harder check constraints. The closure table is easily maintained with triggers.
The tradeoff is a little extra complexity in the DB (closure table and triggers) for a lot less complexity in UI and business layer code.
If I understand correctly, you want to take their separate tables and turn them into a hierarchy that's kept in a single table with a self-referencing FK.
This is generally a more flexible approach (for example, if you want to add a fifth level), BUT SQL and relational data models don't tend to work well with linked lists like this, even with new syntax like MS SQL Servers CTEs. Admittedly, CTEs make it much better though.
It can be difficult and costly to enforce things, like that a product must always be on the fourth level of the hierarchy, etc.
If you do decide to do it this way, then definitely check out Joe Celko's SQL for Smarties, which I believe has a section or two on modeling and working with hierarchies in SQL or better yet get his book that is devoted to the subject (Joe Celko's Trees and Hierarchies in SQL for Smarties).
Normalization implies data integrity, that is: each normal form reduces the number of situations where you data is inconsistent.
As a rule, denormalization has a goal of faster querying, but leads to increased space, increased DML time, and, last but not least, increased efforts to make data consistent.
One usually writes code faster (writes faster, not the code faster) and the code is less prone to errors if the data is normalized.
Self referencing tables almost always turn out to be much worse to query and perform worse than normalized tables. Don't do it. It may look to you to be more elegant, but it is not and is a very poor database design technique. Personally the structure you described sounds just fine to me not hypernormalized. A properly normalized database (with foreign key constraints as well as default values, triggers (if needed for complex rules) and data validation constraints) is also far likelier to have consistent and accurate data. I agree about having the database enforce the rules, likely this is part of why the last application had bad data because the rules were not enforced in the proper place and people were able to easily get around them. Not that the application shouldn't check as well (no point even sending an invalid date for instance for the datbase to fail on insert). Since youa redesigning, I would put more time and effort into designing the necessary constraints and choosing the correct data types (do not store dates as string data for instance), than in trying to make the perfectly ordinary normalized structure look more elegant.
I would bring it in as close to their model as possible (and if at all possible, I would get files which match their schema - not a flattened version). If you bring the data directly into your model, what happens if data they send starts to break assumptions in the transformation to your internal application's model?
Better to bring their data in, run sanity checks and check that assumptions are not violated. Then if you do have an application-specific model, transform it into that for optimal use by your application.
Don't denormalize. Trying to acheive a good schema design by denormalizing is like trying to get to San Francisco by driving away from New York. It doesn't tell you which way to go.
In your situation, you want to figure out what a normalized schema would like. You can base that largely on the source schema, but you need to learn what the functional dependencies (FD) in the data are. Neither the source schema nor the flattened files are guaranteed to reveal all the FDs to you.
Once you know what a normalized schema would look like, you now need to figure out how to design a schema that meets your needs. It that schema is somewhat less than fully normalized, so be it. But be prepared for difficulties in programming the transformation between the data in the flattened files and the data in your desgined schema.
You said that previous schemas at your company cost millions due to inconsistency and inaccuracy. The more normalized your schema is, the more protected you are from internal inconsistency. This leaves you free to be more vigilant about inaccuracy. Consistent data that's consistently wrong can be as misleading as inconsistent data.
is your storefront (or whatever it is you're building, not quite clear on that) always going to be using data from this supplier? might you ever change suppliers or add additional different suppliers?
if so, design a general schema that meets your needs, and map the vendor data to it. Personally I'd rather suffer the (incredibly minor) 'pain' of a self-referencing Category (hierarchical) table than maintain four (apparently semi-useless) levels of Category variants and then next year find out they've added a 5th, or introduced a product line with only three...
For me, the real question is: what fits the model better?
It's like comparing a Tuple and a List.
Tuples are a fixed size and are heterogeneous -- they are "hypernormalized".
Lists are an arbitrarty size and are homogeneous.
I use a Tuple when I need a Tuple and a List when I need a list; they fundamentally server different purposes.
In this case, since the product structure is already well defined (and I assume not likely to change) then I would stick with the "Tuple approach". The real power/use of a List (or recursive table pattern) is when you need it to expand to an arbitrary depth, such as for a BOM or a genealogy tree.
I use both approaches in some of my database depending upon the need. However, there is also the "hidden cost" of a recursive pattern which is that not all ORMs (not sure about AR) support it well. Many modern DBs have support for "join-throughs" (Oracle), hierarchy IDs (SQL Server) or other recursive patterns. Another approach is to use a set-based hierarchy (which generally relies on triggers/maintenance). In any case, if the ORM used does not support recursive queries well, then there may be the extra "cost" of using the to the DB features directly -- either in terms of manual query/view generation or management such as triggers. If you don't use a funky ORM, or simply use a logic separator such as iBatis, then this issue may not even apply.
As far as performance, on new Oracle or SQL Server (and likely others) RDBMS, it ought to be very comparable so that would be the least of my worries: but check out the solutions available for your RDBMS and portability concerns.
Everybody who recommends you not to have a hierarchy introduced in the database, considering just the option of having a self-referenced table. This is not the only way to model the hierarchy in the database.
You may use a different approach, that provides you with easier and faster querying without using recursive queries.
Let's say you have a big set of nodes (categories) in your hierarchy:
Set1 = (Node1 Node2 Node3...)
Any node in this set can also be another set by itself, that contains other nodes or nested sets:
Node1=(Node2 Node3=(Node4 Node5=(Node6) Node7))
Now, how we can model that? Let's have each node to have two attributes, that set the boundaries of the nodes it contains:
Node = { Id: int, Min: int, Max: int }
To model our hierarchy, we just assign those min/max values accordingly:
Node1 = { Id = 1, Min = 1, Max = 10 }
Node2 = { Id = 2, Min = 2, Max = 2 }
Node3 = { Id = 3, Min = 3, Max = 9 }
Node4 = { Id = 4, Min = 4, Max = 4 }
Node5 = { Id = 5, Min = 5, Max = 7 }
Node6 = { Id = 6, Min = 6, Max = 6 }
Node7 = { Id = 7, Min = 8, Max = 8 }
Now, to query all nodes under the Set/Node5:
select n.* from Nodes as n, Nodes as s
where s.Id = 5 and s.Min < n.Min and n.Max < s.Max
The only resource-consuming operation would be if you want to insert a new node, or move some node within the hierarchy, as many records will be affected, but this is fine, as the hierarchy itself does not change very often.