"Canonical" approach for mapping custom queries to hierarchical entities with user-defined key/value pairs - sql

In about every SQL-based database application I have worked on so far, sooner or later the following three-faceted requirement has popped up:
There is some entity, linked in a hierarchical fashion (i.e. the tuples form a tree structure).
Users must be able to define any number of custom attributes with values for the tuples, and these values are inherited/overridden towards the leaves of the tree structure. ("Dumb" attributes usually suffice. That is, no uniqueness constraints, no foreign keys, only one value per attribute, ...)
Users must be able to run arbitrary queries on this data (i.e. custom boolean expressions, based upon filters for the values of the user-defined attributes that are linked with AND/OR).
Storing the data, roughly matching the first two bullets above, is quite straightforward:
The hierarchy is built up by giving the respective table a parent column. This column will be null for root nodes, and a pointer to the ID of the parent node for all other nodes.
The user-defined attributes are stored according to the entity-attribute-value pattern.
While there are numerous resources that suggest to use a different approach especially in the latter point (e.g. answers here, here, or here), I have not usually been in a position to move away from a traditional static relational database schema. Hence, let's simply assume the above as a given. Also, hardly ever could I rely on the specifics of a particular DBMS; the more usual case was systems that were supposed to work with MS SQL Server, Oracle, and possibly others as backends without requiring two significantly different product versions.
Solving the third item, however, is always problematic (even without considering the hierarchical inheritance of attribute values). The number of joins depends on the different number of attributes considered in the boolean expression. Alternatively, the number of joins can somewhat be reduced by determining the maximum number of distinct attributes considered in any case of the custom boolean expression, which may save joins, but makes the resulting queries and the code used to generate them even less intelligible and maintainable. For instance,
a = 5 or (b = 8 and c = 9)
could do with 2 joins to the attribute-value table.
I have always been able to do this "somehow", but as this appears to be a fairly ubiquitous situation, I am looking for the "canonical" way to generate SQL queries in this situation. Is there a "standard pattern" to follow here?

Careful not to fall prey to the inner platform effect. It is a complicated problem, and SQL itself is designed to handle the complexities. Generate DDL to add and remove columns as needed, and generate simple select statements for queries. Store each Tuple Type (distinct set of attributes) as a table.
With regards to inheritance, I recommend handling it in the application or DAL, and only storing the non-inherited values. On retrieval, read all parent rows to calculate the functional values. If you do need to access "functional" values from SQL, use an indexed view or triggers to maintain them separate from storage.
Hierarchies can be represented as you describe, but a simple "Parent" column can make it difficult to query beyond a single level. Look at hierarchyid on SQL Server or CONNECT BY on oracle.
Avoiding EAV stores allows you to:
Use indexes and statistics where needed
Keep efficient storage (ints stored as ints, money stored as money)
Keep understandable queries (SELECT * FROM vwProducts WHERE Color = 'RED' ORDER BY Price ASC)
If you want an EAV system because you have too many attributes (>1024 per type) or they are not somewhat statically defined (many changes per hour), I would avoid using a relational database in the first place. Use an EAV (NoSQL) database server instead.
tl;dr: If you have a schema, use DDL to tell the server about it. If you don't, use a more appropriate server.

Related

Are one-to-one related tables good for distributed sql databases?

Suppose i have a User table, and other tables (e.g. UserSettings, UserStatistics) which have one-to-one relationship with a user.
Since sql databases don't save complex structs in table fields (some allow JSON fields with undefined format), is it ok to just add said tables, allowing to store individual (complex) data for each user? Will it complicate performance by 'joining' more queries?
And in distirbuted databases cases, will it save those (connected) tables randomly in different nodes, making more redundant requests with each other and decreasing efficiency?
1:1 joins can definitely add overhead, especially in a distributed database. Using a JSON or other schema-less column is one way to avoid that, but there are others.
The simplest approach is a "wide table": instead of creating a new table UserSettings with columns a,b,c, add columns setting_a, setting_b, setting_c to your User table. You can still treat them as separate objects when using an ORM, it'll just need a little extra code.
Some databases (like CockroachDB which you've tagged in your question) let you subdivide a wide table into "column families". This tends to let you get the best of both worlds: the database knows to store rows for the same user on the same node, but also to let them be updated independently.
The main downside of using JSON columns is they're harder to query efficiently--if you want all users with a certain setting, or want to know just one setting for a user, you're going to get at least a minor performance hit if the database has to parse a JSON column to figure that out, or you have to fetch the entire blob and do it in your app. If they're more convenient for other reasons though, you can work around this by adding inverted indexes on your JSON columns, or expression indexes on the specific values you're interested in. Indexes can have a similar cost to 1:1 joins, but you can mitigate that in CockroachDB using by using the STORING keyword to tell the DB to write a copy of all the user columns to the index.

Structuring Relational databases - combining similar and related tables

I am used to seeing relational databases where distinct entities are stored in different tables. (simple example: Country, State, City). Recently I been seeing more cases where distinct but similar entities are bundled into same table combined with different Views. I supposed this can economize on tables and data access programs (maybe at the expense of clarity and flexibility). Re-reading definition of normalized databases, I don't think this breaks any rules, but it seems less intuitive and through back to old mainframe "Miscellaneous" tables where you put anything that was forgotten in design stage. See 2 examples below: Multi-table solution vs Single table solution. Is this phenomenon part of a data or programming design pattern and have a name?
If you have small dedicated tables, then the database can easily cache the ones it needs in memory.
If you take what would otherwise be small tables and cram them together into one, the database doesn't know which entries are important to cache and which aren't.
More importantly, there is more opportunity for errors because you can inadvertently type in the wrong type code and end up joining to something irrelevant, with no RI or typechecking to warn you. If you use small dedicated tables then you can specify RI constraints.
Thinking back to a place where I saw the single monster-lookup-table pattern done, I think the attraction was that developers can add more kinds of entries without needing DBA intervention to create more tables. There were a lot of developers and only a few DBAs and this was how the DBAs avoided getting sucked into having to create dedicated lookup tables every time a new type of lookup entry was introduced. (Apparently granting create table rights in dev was not acceptable for the DBAs there.)
This seems like a workaround for environments where database schema changes are hard to come by. But another consideration is it may be easier to internationalize if all your entries are in one table.
And the pattern has an established name, it's called the One True Lookup Table. The linked article calls it out as an antipattern, and lists more downfalls of this technique. Here is the bulleted list from the article:
It makes the SQL look ugly.
Many statements will require multiple joins to the lookup table. The extra join columns make the statements look bigger and scarier. There will be the same number of joins when using separate lookup tables, but those joins will be simpler.
Multiple references to the same table can make it hard to determine what is happening in the execution plan, as you will see those repeated references there, and have to refer to the predicates to understand the context of table reference. If you were using separate lookup tables, it would be clear which table you were referring to at any point of the execution plan.
You can't foreign key to this type of table. Technically you can if you are willing to put both columns (lookup_type_code and lookup_key) in the table, but you won't because it is ugly. This means there is a good chance your data integrity will be compromised over time. It's really easy to foreign key to individual lookup tables, and therefore protect your data.
It's hard to control the contents of the table. It's a shared resource, so check constraints and triggers are problematic. If you need users to have different privileges, depending on which lookup they are dealing with, things are going to get messy. That would be really easy with separate lookup tables.
If you need to make a change for one reference type, like extending the size of the key or value, it affects all reference data. Using separate lookup tables isolates the change.
Over time, many reference tables take on additional data. To model that you would need to either split out that reference data from this shared lookup table, or start adding optional columns to cope with the "one-off" issues. A change like this is really simple for separate lookup tables.
Data types matter. You should always use the correct data type, as it will reduce the number of data type conversions needed. Implicit data type conversions are bugs waiting to happen!
Performance can be a problem with the OTLT approach as it's hard for the optimizer to make sound judgements about the data. The optimizer cares about cardinality, but it may be hard to make that decision if you are dealing with a large number of rows, most of which are irrelevant in any one specific context. The optimizer also cares about high/low values, but these are not be relevant to any one lookup, but shared. We've also mentioned you probably won't foreign key to this data, which will reduce the amount of information the optimizer has when making its decision. You may have artificially made columns optional, that are actually mandatory, a key must have a value, but which column? I think you get the message.
I think, if you need name dictionary only ( for spellchecking or something like ) second approach is good enough. Otherwise, if objects have some additional specific fields second approach is very bed.

Storing multiple entity orderings in a database

I have a database, with one table containing a set of entities - let's say films. I want users on the front end to be able to create and save multiple orderings of these films.
How could I sensibly store these orderings? What should I be thinking about (for making the database schema nice, or for performance, etc.)?
In this particular case, I'm only expecting about 100 films in the database, and probably fewer than 10 saved orderings, so performance is unlikely to be a major issue, although in theory it could get much larger.
My ideas so far, without restricting for niceness, include;
Having a table for orderings, storing ordering id, ordering name,
and a JSON.stringified version of the film ids.
Having a table storing ordering id and ordering name, and a separate
table storing, somehow (a linked list?) several sets of actual
orders - by storing as you might a single order but adding an order
id. (This second table might then contain a record for every pair of
film and order...)
Your ideas basically cover it, omitting some implementation-specific options such as Postgres' array field support which is for practical purposes a different way to accomplish #1.
#1 is compact and easy to set up, but lacks referential integrity: if you delete a film, you have to ensure it is removed from all lists yourself. #2 involves more work and a slightly more complex structure, but the payoff is that the relationships are tracked by the database; taking the example of deleting a film, you have the option to prevent deletion of any film on a list, or of cascading the deletion to remove the entry from all lists.
Each has its advantages and disadvantages. Which is appropriate really comes down to your specific use case.

How is a SQL table different from an array of structs? (In terms of usage, not implementation)

Here's a simple example of an SQL table:
CREATE TABLE persons
(
id INTEGER,
name VARCHAR(255),
height DOUBLE
);
Since I haven't used SQL very much, I haven't yet learned to think in its terms. Effectively, my brain translates the above into this:
struct Person
{
int id;
string name;
double height;
Person(int id_, const char* name_, double height_)
:id(id_),name(name_),height(height_)
{}
};
Person persons[64];
Then, inserting some elements, in SQL:
INSERT INTO persons (id, name, height) VALUES (1234, 'Frank', 5.125);
INSERT INTO persons (id, name, height) VALUES (5678, 'Jesse', 6.333);
...and how I'm thinking of it:
persons[0] = Person(1234, "Frank", 5.125);
persons[1] = Person(5678, "Jesse", 6.333);
I've read that SQL can be thought of as two major parts: data manipulation and data definition. I'm more concerned about organizing my data in the first place, as opposed to querying and modifying it. There, the distinctions of SQL are immediately obvious. To me, it seems like the subtleties of how data can and should be structured in SQL is a more obscure topic. Where does the array-of-structs analogy I'm automatically drawing for myself break down?
To give a concrete example, let's say that I want each entry in my persons table (or each of my Person objects) to contain a field denoting the names of that person's children (actual fruit-of-your-loins children, not hierarchical data structure children). In reality, these would probably be cross-table references (or pointers to objects), but let's keep things simple and make this field contain zero or more names. In my C++ example, I'd modify the declaration like so:
vector<string> namesOfChildren;
...and do something like this:
persons[0].namesOfChildren.push_back("John");
persons[0].namesOfChildren.push_back("Jane");
But, from what I can tell, the typical usage of SQL doesn't mirror this approach. If I'm wrong and there's a simple, straightforward solution, great. If not, I'm sure a SQL novice like myself could benefit greatly from a little cogitation on the subject of how databases of SQL tables are meant to be used in contrast to bare, generic data structures.
To me, it seems like the subtleties of how data can and should be structured in SQL is a more obscure topic.
It's called "data(base) modeling" and is somewhere between engineering discipline and art (like much of the computer programming). If you are really interested in the topic, take a look at ERwin Methods Guide.
Where does the array-of-structs analogy I'm automatically drawing for myself break down?
At persistency, concurrency, consistency and scalability.
Persistency: The table is automatically saved to the permanent storage. It'll stay there and survive reboots (not that a real database server will reboot much) until you explicitly delete it or there is a catastrophic hardware failure. DBMSes have well-oiled backup procedures that should help in the latter case.
Concurrency: Tables are meant to be accessed and (need be) modified by many clients concurrently. Mechanisms such as locking and multi-version concurrency control are employed to ensure clients will not "step on each other's toes".
Consistency: You can define certain constraints (such as uniqueness, foreign keys or checks) and the DBMS will make sure they are never broken. Furthermore, this can often be done in a declarative manner, minimizing chance for errors. On top of that, everything you do in a database is transactional, so you reap the benefits of atomicity, consistency, isolation and durability (aka. "ACID"). In a nutshell, the database will defend itself from bad data.
Scalability: A well designed database schema can grow well beyond the confines of the available RAM, and still keep good performance, using techniques such as indexing, partitioning, clustering etc... Furthermore, SQL is declarative and set-based, which means that the DBMS has the latitude to pick the optimal "query execution plan" for the data at hand, auto-parallelize the query, cache the results in hope they will be reused etc... without changing the meaning of the query.
Your analogy to the array of structs is not bad ... for the beginning.
After this beginning the differences start in relation to organizing data.
Database people love their "Normal Forms" laws. We do not have these laws in C++ or similar programming languages. Organizing data in the tables according to these laws help database engines to do their magic (queries, joins) better, i.e keep databases compact and crunch millions of rows in fractions of a second, and allow multiple requests concurrently. They are not absolute laws: the 1NF (1st Normal Form) is followed in 99.9999% cases, but the bigger the number (2NF, 3NF, ...) the more often DB planners allow themselves to deviate from them.
Description of normal forms can be found for example here.
I will try to illustrate differences on your example.
In your example the fields of your struct correspond to the columns of the database table. Adding vector of the names as a new field of struct would correspond to adding comma separated list of the names into a new column of your table. This is a violation of the 1NF which demands that one cell is for one value - not for the list of values. To normalize your data you will need to have two arrays: one of Person structs, and another new of structs for Child. While in C++ we can use just pointers to link each child to its parent, in SQL we must use the mechanism of the key. You already added id field into Person struct, now we need to add ParentId field to Child struct so that database engine could find the Parent. ParentId column is called foreign key. Another approach to satisfy 1NF instead of creating the new table/struct for children is that we can switch to children-centric thinking and have just one table with a record per child which will include all the information about the parent of the child. Info about the parent obviously will be repeated in as many records as many children this parent has.
Note (this is also considered part of 1NF) that while in the array of structs we always know the order of the elements, in databases it is up to the engine in what order to store the records. It is just mathematical un-ordered set of records, the engine can resort it in internal storage for various optimizations as it likes. When you retrieve the records from the database with the SELECT statement, if you care about the order, you need to provide ORDER BY clause.
2NF is about removing repetitions from your records. Imagine you would have place of work related fields also as part of your Person struct. Imagine it would include Name of the company and company address. If many Persons in your dataset work in the same Company your would repeat address of the company in their records. Probably we wouldn't do these repetitions in C++ either, but nevertheless extracting these repetitions into a separate table would satisfy 2NF. Strictly speaking even if there is no repetitions and all your Persons work in different places, 2NF still requires to extract data about the workplaces into separate table because it requires that one table would represent one entity.
3NF is about removing transitive dependency and is considered kind of optional, so I will not describe it here. See link above.
Another feature of databases quite different from conventional programming of data structures in C++ is the database indexes. Simplifying, index is just a copy of a column (or columns) (i.e vertical slice) into a separate table where they are stored in an inherent for them order and each record in the index retains the reference to the whole record. So, in your example to create index by height you would create another array of 64 elems of the new
struct HeightIndexElem
{
double height;
Person* pFullRecord;
}
and sort them by height in this array. This will allow the DB engine to automatically optimize certain queries. The database engine itself decides when to use certain index. In C++ we usually create maps (Dictionaries in C#) to speed up finding element by certain characteristic but we must use these maps ourselves - no automatic aspect there.
There are major differences:-
SQL tables are persistent -- (English Tran: written to disk)
They are transactional -- (really written to disk)
They can be an arbitary size -- (Tables of a several hundred million rows are quite common)
They support relational algebra -- (Joins with other tables, filtering etc.)
Relational Algebra is provable -- For a given SELECT statement there is only one possible correct answer.
The biggest differences are that when you "UPDATE" and "COMMIT" you know your data is saved in the database and will be there until you decide to "DELETE" it. When you update a structure within an array its gone when the machine is switched off.
The other big difference is scale. The size of a modern DBMS is only limited by your hard disk budget.
[I really like farfareast's answer from an academic stand point, but I feel the need to add a more practically oriented answer too.]
SQL tables themselves are "bare, generic data structures" as you call C++'s structures. They are only different data structures: a table is always an array of (fixed size) structs and the only pointers you can use are foreign keys.
For example, when you are adding a vector<string> to your struct, you are already using pointers internally as strings are only a "fancy" way of writing char*. This would already require a second table in SQL (using a secondayr index column to keep the elements in order). Of course there are things like postgresql's arrays that can help in this specific case, but those are "only" shortcuts for similar hand-writeable constructs.
The real difference in data structure and algorithms comes from the fact that you can easily add declarations of index structures. Say you know you need to always access Persons in the order of their height. In C++ you'd use some kind of tree or sorted list to keep them in order. There is an STL container for that. The downside is, that when you need to access them in a different order (say by name), you'll have to add a second tree and duplicate the data or start using pointers to Persons. If you add a Person, you need to update all containers and so on. This becomes cumbersome and soon you'll be on the front page of The Daily WTF. SQL tables on the other side can have attached indices which automatically keep up with new and changed data. Of course, their maintenance also must be paid in performance, but the management of them is basically deciding which are required by your access patterns -- something needed in every case -- and defining them. In contrast to having to rewrite large parts of an application, this is a much more favorable situation.

Dealing with "hypernormalized" data

My employer, a small office supply company, is switching suppliers and I am looking through their electronic content to come up with a robust database schema; our previous schema was pretty much just thrown together without any thought at all, and it's pretty much led to an unbearable data model with corrupt, inconsistent information.
The new supplier's data is much better than the old one's, but their data is what I would call hypernormalized. For example, their product category structure has 5 levels: Master Department, Department, Class, Subclass, Product Block. In addition the product block content has the long description, search terms and image names for products (the idea is that a product block contains a product and all variations - e.g. a particular pen might come in black, blue or red ink; all of these items are essentially the same thing, so they apply to a single product block). In the data I've been given, this is expressed as the products table (I say "table" but it's a flat file with the data) having a reference to the product block's unique ID.
I am trying to come up with a robust schema to accommodate the data I'm provided with, since I'll need to load it relatively soon, and the data they've given me doesn't seem to match the type of data they provide for demonstration on their sample website (http://www.iteminfo.com). In any event, I'm not looking to reuse their presentation structure so it's a moot point, but I was browsing the site to get some ideas of how to structure things.
What I'm unsure of is whether or not I should keep the data in this format, or for example consolidate Master/Department/Class/Subclass into a single "Categories" table, using a self-referencing relationship, and link that to a product block (product block should be kept separate as it's not a "category" as such, but a group of related products for a given category). Currently, the product blocks table references the subclass table, so this would change to "category_id" if I consolidate them together.
I am probably going to be creating an e-commerce storefront making use of this data with Ruby on Rails (or that's my plan, at any rate) so I'm trying to avoid getting snagged later on or having a bloated application - maybe I'm giving it too much thought but I'd rather be safe than sorry; our previous data was a real mess and cost the company tens of thousands of dollars in lost sales due to inconsistent and inaccurate data. Also I am going to break from the Rails conventions a little by making sure that my database is robust and enforces constraints (I plan on doing it at the application level, too), so that's something I need to consider as well.
How would you tackle a situation like this? Keep in mind that I have the data to be loaded already in flat files that mimic a table structure (I have documentation saying which columns are which and what references are set up); I'm trying to decide if I should keep them as normalized as they currently are, or if I should look to consolidate; I need to be aware of how each method will affect the way I program the site using Rails since if I do consolidate, there will be essentially 4 "levels" of categories in a single table, but that definitely seems more manageable than separate tables for each level, since apart from Subclass (which directly links to product blocks) they don't do anything except show the next level of category under them. I'm always a loss for the "best" way to handle data like this - I know of the saying "Normalize until it hurts, then denormalize until it works" but I've never really had to implement it until now.
I would prefer the "hypernormalized" approach over a denormal data model. The self referencing table you mentioned might reduce the number of tables down and simplify life in some ways, but in general this type of relationship can be tricky to deal with. Hierarchical queries become a pain, as does mapping an object model to this (if you decide to go that route).
A couple of extra joins is not going to hurt and will keep the application more maintainable. Unless performance degrades due to the excessive number of joins, I would opt to leave things like they are. As an added bonus if any of these levels of tables needed additional functionality added, you will not run into issues because you merged them all into the self referencing table.
I totally disagree with the criticisms about self-referencing table structures for parent-child hierarchies. The linked list structure makes UI and business layer programming easier and more maintainable in most cases, since linked lists and trees are the natural way to represent this data in languages that the UI and business layers would typically be implemented in.
The criticism about the difficulty of maintaining data integrity constraints on these structures is perfectly valid, though the simple solution is to use a closure table that hosts the harder check constraints. The closure table is easily maintained with triggers.
The tradeoff is a little extra complexity in the DB (closure table and triggers) for a lot less complexity in UI and business layer code.
If I understand correctly, you want to take their separate tables and turn them into a hierarchy that's kept in a single table with a self-referencing FK.
This is generally a more flexible approach (for example, if you want to add a fifth level), BUT SQL and relational data models don't tend to work well with linked lists like this, even with new syntax like MS SQL Servers CTEs. Admittedly, CTEs make it much better though.
It can be difficult and costly to enforce things, like that a product must always be on the fourth level of the hierarchy, etc.
If you do decide to do it this way, then definitely check out Joe Celko's SQL for Smarties, which I believe has a section or two on modeling and working with hierarchies in SQL or better yet get his book that is devoted to the subject (Joe Celko's Trees and Hierarchies in SQL for Smarties).
Normalization implies data integrity, that is: each normal form reduces the number of situations where you data is inconsistent.
As a rule, denormalization has a goal of faster querying, but leads to increased space, increased DML time, and, last but not least, increased efforts to make data consistent.
One usually writes code faster (writes faster, not the code faster) and the code is less prone to errors if the data is normalized.
Self referencing tables almost always turn out to be much worse to query and perform worse than normalized tables. Don't do it. It may look to you to be more elegant, but it is not and is a very poor database design technique. Personally the structure you described sounds just fine to me not hypernormalized. A properly normalized database (with foreign key constraints as well as default values, triggers (if needed for complex rules) and data validation constraints) is also far likelier to have consistent and accurate data. I agree about having the database enforce the rules, likely this is part of why the last application had bad data because the rules were not enforced in the proper place and people were able to easily get around them. Not that the application shouldn't check as well (no point even sending an invalid date for instance for the datbase to fail on insert). Since youa redesigning, I would put more time and effort into designing the necessary constraints and choosing the correct data types (do not store dates as string data for instance), than in trying to make the perfectly ordinary normalized structure look more elegant.
I would bring it in as close to their model as possible (and if at all possible, I would get files which match their schema - not a flattened version). If you bring the data directly into your model, what happens if data they send starts to break assumptions in the transformation to your internal application's model?
Better to bring their data in, run sanity checks and check that assumptions are not violated. Then if you do have an application-specific model, transform it into that for optimal use by your application.
Don't denormalize. Trying to acheive a good schema design by denormalizing is like trying to get to San Francisco by driving away from New York. It doesn't tell you which way to go.
In your situation, you want to figure out what a normalized schema would like. You can base that largely on the source schema, but you need to learn what the functional dependencies (FD) in the data are. Neither the source schema nor the flattened files are guaranteed to reveal all the FDs to you.
Once you know what a normalized schema would look like, you now need to figure out how to design a schema that meets your needs. It that schema is somewhat less than fully normalized, so be it. But be prepared for difficulties in programming the transformation between the data in the flattened files and the data in your desgined schema.
You said that previous schemas at your company cost millions due to inconsistency and inaccuracy. The more normalized your schema is, the more protected you are from internal inconsistency. This leaves you free to be more vigilant about inaccuracy. Consistent data that's consistently wrong can be as misleading as inconsistent data.
is your storefront (or whatever it is you're building, not quite clear on that) always going to be using data from this supplier? might you ever change suppliers or add additional different suppliers?
if so, design a general schema that meets your needs, and map the vendor data to it. Personally I'd rather suffer the (incredibly minor) 'pain' of a self-referencing Category (hierarchical) table than maintain four (apparently semi-useless) levels of Category variants and then next year find out they've added a 5th, or introduced a product line with only three...
For me, the real question is: what fits the model better?
It's like comparing a Tuple and a List.
Tuples are a fixed size and are heterogeneous -- they are "hypernormalized".
Lists are an arbitrarty size and are homogeneous.
I use a Tuple when I need a Tuple and a List when I need a list; they fundamentally server different purposes.
In this case, since the product structure is already well defined (and I assume not likely to change) then I would stick with the "Tuple approach". The real power/use of a List (or recursive table pattern) is when you need it to expand to an arbitrary depth, such as for a BOM or a genealogy tree.
I use both approaches in some of my database depending upon the need. However, there is also the "hidden cost" of a recursive pattern which is that not all ORMs (not sure about AR) support it well. Many modern DBs have support for "join-throughs" (Oracle), hierarchy IDs (SQL Server) or other recursive patterns. Another approach is to use a set-based hierarchy (which generally relies on triggers/maintenance). In any case, if the ORM used does not support recursive queries well, then there may be the extra "cost" of using the to the DB features directly -- either in terms of manual query/view generation or management such as triggers. If you don't use a funky ORM, or simply use a logic separator such as iBatis, then this issue may not even apply.
As far as performance, on new Oracle or SQL Server (and likely others) RDBMS, it ought to be very comparable so that would be the least of my worries: but check out the solutions available for your RDBMS and portability concerns.
Everybody who recommends you not to have a hierarchy introduced in the database, considering just the option of having a self-referenced table. This is not the only way to model the hierarchy in the database.
You may use a different approach, that provides you with easier and faster querying without using recursive queries.
Let's say you have a big set of nodes (categories) in your hierarchy:
Set1 = (Node1 Node2 Node3...)
Any node in this set can also be another set by itself, that contains other nodes or nested sets:
Node1=(Node2 Node3=(Node4 Node5=(Node6) Node7))
Now, how we can model that? Let's have each node to have two attributes, that set the boundaries of the nodes it contains:
Node = { Id: int, Min: int, Max: int }
To model our hierarchy, we just assign those min/max values accordingly:
Node1 = { Id = 1, Min = 1, Max = 10 }
Node2 = { Id = 2, Min = 2, Max = 2 }
Node3 = { Id = 3, Min = 3, Max = 9 }
Node4 = { Id = 4, Min = 4, Max = 4 }
Node5 = { Id = 5, Min = 5, Max = 7 }
Node6 = { Id = 6, Min = 6, Max = 6 }
Node7 = { Id = 7, Min = 8, Max = 8 }
Now, to query all nodes under the Set/Node5:
select n.* from Nodes as n, Nodes as s
where s.Id = 5 and s.Min < n.Min and n.Max < s.Max
The only resource-consuming operation would be if you want to insert a new node, or move some node within the hierarchy, as many records will be affected, but this is fine, as the hierarchy itself does not change very often.