Database design: should nested objects have their own table? - sql

I'm somewhat new to database design, so I'd like some pointers on how best to lay my current tables out.
I have a table Jobs that holds various jobs. Users can create Subjobs. A Subjob has a Job as a parent. A Subjob has all the same properties as a Job, but some of them are read-only, whereas they are all read/write for a Job. A Job can have many Subjobs. At the moment, there may only be one layer of subjobs, but I'd like the flexibility to allow for infinite nesting of Subjobs in the future. The objects will be interacted with through a MVC web app.
I've considered two options for layout:
Jobs and Subjobs each have their own table.
This seems like "good design" because I don't introduce columns in Job with the sole purpose of nesting with itself.
It's a bit of a pain for coding the web app, since a Job and Subjob would have to have two separate Controllers/sets of Views, despite them being identical in properties.
It makes less sense from a design perspective if infinite nesting is introduced.
Jobs and Subjobs are on the same table. Jobs are just given a nullable parent_job_id property that is non-null if it is a Subjob.
Makes sense for infinite nesting.
Less of a pain for coding the web app.
A weird nesting property is introduced to the Job table that has nothing to do with the actual properties of a Job.
Any advice on how to handle this? Are there additional design patterns I haven't considered? I'm using Entity Framework 6 Code First, if that matters.

The first option is fine if you were not describing a hierarchy with multiple levels, but you are. The pattern you are describing is commonly known as an Adjacency List which is stored as you describe your second option.
Some other options for storing a hierarchy are:
Nested sets (more complicated implementation but potentially faster queries without recursion).
Materialized Path (Stores a character representation of the hierarchy path, e.g. like a file storage system)
Modifications / Helper tables for Adjacency List (Flat table, bridge table)
Custom implementations like HierarchyId
Hierarchy Reference:
StackOverflow - What are the options for storing hierarchical data in a relational-database
Louis Davidson - Presentations - How to Optimize a Hierarchy In SQL Server - Presentations & Demo Code

Related

Is it better to use entity-arrtibute-value model over storing various different product in single description text column? [duplicate]

It is safe to say that the EAV/CR database model is bad. That said,
Question: What database model, technique, or pattern should be used to deal with "classes" of attributes describing e-commerce products which can be changed at run time?
In a good E-commerce database, you will store classes of options (like TV resolution then have a resolution for each TV, but the next product may not be a TV and not have "TV resolution"). How do you store them, search efficiently, and allow your users to setup product types with variable fields describing their products? If the search engine finds that customers typically search for TVs based on console depth, you could add console depth to your fields, then add a single depth for each tv product type at run time.
There is a nice common feature among good e-commerce apps where they show a set of products, then have "drill down" side menus where you can see "TV Resolution" as a header, and the top five most common TV Resolutions for the found set. You click one and it only shows TVs of that resolution, allowing you to further drill down by selecting other categories on the side menu. These options would be the dynamic product attributes added at run time.
Further discussion:
So long story short, are there any links out on the Internet or model descriptions that could "academically" fix the following setup? I thank Noel Kennedy for suggesting a category table, but the need may be greater than that. I describe it a different way below, trying to highlight the significance. I may need a viewpoint correction to solve the problem, or I may need to go deeper in to the EAV/CR.
Love the positive response to the EAV/CR model. My fellow developers all say what Jeffrey Kemp touched on below: "new entities must be modeled and designed by a professional" (taken out of context, read his response below). The problem is:
entities add and remove attributes weekly (search keywords dictate future attributes)
new entities arrive weekly (products are assembled from parts)
old entities go away weekly (archived, less popular, seasonal)
The customer wants to add attributes to the products for two reasons:
department / keyword search / comparison chart between like products
consumer product configuration before checkout
The attributes must have significance, not just a keyword search. If they want to compare all cakes that have a "whipped cream frosting", they can click cakes, click birthday theme, click whipped cream frosting, then check all cakes that are interesting knowing they all have whipped cream frosting. This is not specific to cakes, just an example.
There's a few general pros and cons I can think of, there are situations where one is better than the other:
Option 1, EAV Model:
Pro: less time to design and develop a simple application
Pro: new entities easy to add (might even
be added by users?)
Pro: "generic" interface components
Con: complex code required to validate simple data types
Con: much more complex SQL for simple
reports
Con: complex reports can become almost
impossible
Con: poor performance for large data sets
Option 2, Modelling each entity separately:
Con: more time required to gather
requirements and design
Con: new entities must be modelled and
designed by a professional
Con: custom interface components for each
entity
Pro: data type constraints and validation simple to implement
Pro: SQL is easy to write, easy to
understand and debug
Pro: even the most complex reports are relatively simple
Pro: best performance for large data sets
Option 3, Combination (model entities "properly", but add "extensions" for custom attributes for some/all entities)
Pro/Con: more time required to gather requirements and design than option 1 but perhaps not as much as option 2 *
Con: new entities must be modelled and designed by a professional
Pro: new attributes might be easily added later on
Con: complex code required to validate simple data types (for the custom attributes)
Con: custom interface components still required, but generic interface components may be possible for the custom attributes
Con: SQL becomes complex as soon as any custom attribute is included in a report
Con: good performance generally, unless you start need to search by or report by the custom attributes
* I'm not sure if Option 3 would necessarily save any time in the design phase.
Personally I would lean toward option 2, and avoid EAV wherever possible. However, for some scenarios the users need the flexibility that comes with EAV; but this comes with a great cost.
It is safe to say that the EAV/CR database model is bad.
No, it's not. It's just that they're an inefficient usage of relational databases. A purely key/value store works great with this model.
Now, to your real question: How to store various attributes and keep them searchable?
Just use EAV. In your case it would be a single extra table. index it on both attribute name and value, most RDBMs would use prefix-compression to on the attribute name repetitions, making it really fast and compact.
EAV/CR gets ugly when you use it to replace 'real' fields. As with every tool, overusing it is 'bad', and gives it a bad image.
// At this point, I'd like to take a moment to speak to you about the Magento/Adobe PSD format.
// Magento/PSD is not a good ecommerce platform/format. Magento/PSD is not even a bad ecommerce platform/format. Calling it such would be an
// insult to other bad ecommerce platform/formats, such as Zencart or OsCommerce. No, Magento/PSD is an abysmal ecommerce platform/format. Having
// worked on this code for several weeks now, my hate for Magento/PSD has grown to a raging fire
// that burns with the fierce passion of a million suns.
http://code.google.com/p/xee/source/browse/trunk/XeePhotoshopLoader.m?spec=svn28&r=11#107
The internal models are wacky at best, like someone put the schema into a boggle game, sealed that and put it in a paint shacker...
Real world: I'm working on a midware fulfilment app and here are one the queries to get address information.
CREATE OR REPLACE VIEW sales_flat_addresses AS
SELECT sales_order_entity.parent_id AS order_id,
sales_order_entity.entity_id,
CONCAT(CONCAT(UCASE(MID(sales_order_entity_varchar.value,1,1)),MID(sales_order_entity_varchar.value,2)), "Address") as type,
GROUP_CONCAT(
CONCAT( eav_attribute.attribute_code," ::::: ", sales_order_entity_varchar.value )
ORDER BY sales_order_entity_varchar.value DESC
SEPARATOR '!!!!!'
) as data
FROM sales_order_entity
INNER JOIN sales_order_entity_varchar ON sales_order_entity_varchar.entity_id = sales_order_entity.entity_id
INNER JOIN eav_attribute ON eav_attribute.attribute_id = sales_order_entity_varchar.attribute_id
AND sales_order_entity.entity_type_id =12
GROUP BY sales_order_entity.entity_id
ORDER BY eav_attribute.attribute_code = 'address_type'
Exacts address information for an order, lazily
--
Summary: Only use Magento if:
You are being given large sacks of money
You must
Enjoy pain
I'm surprised nobody mentioned NoSQL databases.
I've never practiced NoSQL in a production context (just tested MongoDB and was impressed) but the whole point of NoSQL is being able to save items with varying attributes in the same "document".
Where performance is not a major requirement, as in an ETL type of application, EAV has another distinct advantage: differential saves.
I've implemented a number of applications where an over-arching requirement was the ability to see the history of a domain object from its first "version" to it's current state. If that domain object has a large number of attributes, that means each change requires a new row be inserted into it's corresponding table (not an update because the history would be lost, but an insert). Let's say this domain object is a Person, and I have 500k Persons to track with an average of 100+ changes over the Persons life-cycle to various attributes. Couple that with the fact that rare is the application that has only 1 major domain object and you'll quickly surmize that the size of the database would quickly grow out of control.
An easy solution is to save only the differential changes to the major domain objects rather than repeatedly saving redundant information.
All models change over time to reflect new business needs. Period. Using EAV is but one of the tools in our box to use; but it should never be automatically classified as "bad".
I'm struggling with the same issue. It may be interesting for you to check out the following discussion on two existing ecommerce solutions: Magento (EAV) and Joomla (regular relational structure):
https://forum.virtuemart.net/index.php?topic=58686.0
It seems, that Magento's EAV performance is a real showstopper.
That's why I'm leaning towards a normalized structure. To overcome the lack of flexibility I'm thinking about adding some separate data dictionary in the future (XML or separate DB tables) that could be edited, and based on that, application code for displaying and comparing product categories with new attributes set would be generated, together with SQL scripts.
Such architecture seems to be the sweetspot in this case - flexible and performant at the same time.
The problem could be frequent use of ALTER TABLE in live environment. I'm using Postgres, so its MVCC and transactional DDL will hopefully ease the pain.
I still vote for modeling at the lowest-meaningful atomic-level for EAV. Let standards, technologies and applications that gear toward certain user community to decide content models, repetition needs of attributes, grains, etc.
If it's just about the product catalog attributes and hence validation requirements for those attributes are rather limited, the only real downside to EAV is query performance and even that is only a problem when your query deals with multiple "things" (products) with attributes, the performance for the query "give me all attributes for the product with id 234" while not optimal is still plenty fast.
One solution is to use the SQL database / EAV model only for the admin / edit side of the product catalog and have some process that denormalizes the products into something that makes it searchable. Since you already have attributes and hence it's rather likely that you want faceting, this something could be Solr or ElasticSearch. This approach avoids basically all downsides to the EAV model and the added complexity is limited to serializing a complete product to JSON on update.
EAV has many drawbacks:
Performance degradation over time
Once the amount of data in the application grows beyond a certain size, the retrieval and manipulation of that data is likely to become less and less efficient.
The SQL queries are very complex and difficult to write.
Data Integrity problems.
You can't define foreign keys for all the fields needed.
You have to define and maintain your own metadata.
I have a slightly different problem: instead of many attributes with sparse values (which is possibly a good reason to use EAV), I want to store something more like a spreadsheet. The columns in the sheet can change, but within a sheet all cells will contain data (not sparse).
I made a small set of tests to benchmark two designs: one using EAV, and the other using a Postgres ARRAY to store cell data.
EAV
Array
Both schemas have indexes on appropriate columns, and the indexes are used by the planner.
It turned out the array-based schema was an order of magnitude faster for both inserts and queries. From quick tests, it seemed that both scaled linearly. The tests aren't very thorough, though. Suggestions and forks welcome - they're under an MIT licence.

Is there any situation in which you would use an Object table and an ObjectInstance table in a relational database

Is there any situation in which you would create a table called Node and another table called NodeInstance if you were designing the DB schema to serve as the persistence and CRUD layer for an entity called Node using a relational database?
This question has been posted to try to dissuade my colleagues from making costly mistakes during the design of the database schema whose purpose is to serve as the storage, persistence and CRUD data-layer for the backend of an iPad app I am working on and also to avoid creating bugs and issues which will be a maintenance nightmare in the future. Because I am under NDA, I cannot post any details regarding the exact nature of the project, except to say that the main entity that we are creating the CRUD layer on the server for, is called a Node. Therefore, I have recommended to my colleagues who are working on the backend to create a class called Node to represent the Node object and to insert rows in the Node table in the relational DB for the Create operation, with a new row representing every instance of the Node which is created using the client app, in true ORM fashion.
However, for some reason, my colleagues seem to think that having 1 table for persisting the Node objects is the wrong approach and the right approach is to create a Node table and a NodeInstance table and that maintaining 2 tables in parallel to manage the persistence of the Node entity is more efficient / performant. Since I am a bit of an ORM nerd and also a DB-schema geek, I have been trying to figure out if there is any planet in the known universe where using a 2 table approach to persist and perform CRUD on 1 entity, would be a good idea but in all scenarios, it seems that this adds code complexity, not to mention that it necessitates unnecessary SQL joins, multiple queries to maintain data-integrity, issues with atomic transactions, and concurrency issues. However, if there is anyone on stackoverflow that can help me understand why my colleagues approach is sane, then I would like to have an open mind and try to entertain the notion that my colleagues are right. However, at the time of writing this, I am convinced that my colleagues do not have a complete understanding of ORM and are therefore, thinking of using 2 tables for persisting 1 entity. Please do shine some wisdom on this matter, my respected stackoverflow peers and experts.

Using QAbstractItemModel to represent data in a SQL database

I am trying to create a QTreeView to display data from a SQL database. This is a large database, so simply loading the data into a QStandardItemModel seems prohibitive.
None of Qt's pre-built SQL model classes are sufficient for the task. Therefore it seems necessary to subclass QAbstractItemModel.
In the first place, I can find no examples where this is done, so I am wondering whether it is the correct approach.
Implementing QAbstractItemModel::data is pretty straightforward. I am uncertain how to implement QAbstractItemModel::parent.
Qt's "Simple Tree Model Example" example would be informative, but in that example the tree structure is represented in memory with the TreeItem class. I could copy that, but if I am going to duplicate the database structure, it would be just as easy to use QStandardItemModel. If I need to maintain a separate data structure (in addition to the database and the QAbstractItemModel subclass) to represent the tree structure, is there any advantage to subclassing QAbstractItemModel over just using a QStandardItemModel?
The challenge in the tree structure is to always be able to identify a model index's parent (i.e., overloading the parent() method). In the Simple Tree example, this is done by storing the three structure in a separate data structure. For large SQL queries this is impractical. For the right database structure, you might be able to calculate the proper parent node given the child, but that is not a guarantee. The only alternative I can imagine is passing a quint32 to QAbstractItemModel::createIndex which encodes the item's parent.
One performance consideration that might be useful. After giving up on sublcassing QAbstractItemModel, I tried populating a QStandardItemModel from the database. I loaded about 1200 items into the model, and four child items to each item with two separate database calls. This took about 3 seconds on a 2009 laptop. That is faster than I had been expecting. (And there would be performance gains if I used a single query instead of repeated queries.)
In the end I went another route: having several QTableViews in a the GUI, with signals and slots to show different aspects of the data. My code is much simpler, and the proper functionality is in place, so this feels like the "right" solution.

SQLite structure advice

I have a book structure with Chapter, Subchapter, Section, Subsection, Article and unknown number of subarticles, sub-subarticles, sub-sub-subarticles etc.
What's the best way to structure this?
One table with child-parent relationships, multiple tables?
Thank you.
To determine whether there are seperate tables or one-big-table involved, you should take a close look at each item - chapter, subchapter, etc. - and decide if they carry different attributes from the others. Does a chapter carry something different from a sub-chapter?
If so, then you're looking at seperate tables for Chapter, SubChapter, Section, SubSection, Article. Article still feels hierarchical to me with your sub- sub-sub- sub-sub-sub- etc.
If not, then maybe it is one big table with parent/child, but it looks like you may be talking about 'names' for the depth of the hierarchy which leans me toward seperate tables again.
Also consider how you'll query and what you'll be searching for.
There are a couple of methods to save a tree structure in a relational database. The most commonly used are using parent pointers and nested sets.
The first has a very easy data structure, namely a pointer to the respective parent element on each object), and is thus easy to implement. On the downside it is not easy to make some queries on it as the tree can not be fully traversed. You would need a self-join per layer.
The nested set is easier to query (when you have understood how it works) but is harder to update. Many writes require additional updates to other objects ion the tree which might make it harder to be transitionally save.
A third variant is that of the materialized path which I personally consider a good compromise between the former two.
That said, if you want to store arbitrary size trees (e.g,. for sections, sub-sections, sub-sub-sections, ...) you should use one of the mentioned tree implementations. If you have a very limited maximum depth (e.g max 3 layers) you could get away with creating an explicit data structure. But as things always get more complex than initially though, I'd advise you to use a real tree implementation.

Complex taxonomy ORM mapping - looking for suggestions

In my project (ASP.NET MVC + NHibernate) I have all my entities, lets say Documents, described by set of custom metadata. Metadata is contained in a structure that can have multiple tags, categories etc. These terms have the most importance for users seeking the document they want, so it has an impact on views as well as underlying data structures, database querying etc.
From view side of application, what interests me the most are the string values for the terms. Ideally I would like to operate directly on the collections of strings like that:
class MetadataAsSeenInViews
{
public IList<string> Categories;
public IList<string> Tags;
// etc.
}
From model perspective, I could use the same structure, do the simplest-possible ORM mapping and use it in queries like "fetch all documents with metadata exactly like this".
But that kind of structure could turn out useless if the application needs to perform complex database queries like "fetch all documents, for which at least one of categories is IN (cat1, cat2, ..., catN) OR at least one of tags is IN (tag1, ..., tagN)". In that case, for performance reasons, we would probably use numeric keys for categories and tags.
So one can imagine a structure opposite to MetadataAsSeenInViews that operates on numeric keys and provide complex mappings of integers to strings and other way round. But that solution doesn't really satisfy me for several reasons:
it smells like single responsibility violation, as we're dealing with database-specific issues when just wanting to describe Document business object
database keys are leaking through all layers
it adds unnecessary complexity in views
and I believe it doesn't take advantage of what can good ORM do
Ideally I would like to have:
single, as simple as possible metadata structure (ideally like the one at the top) in my whole application
complex querying issues addressed only in the database layer (meaning DB + ORM + at less as possible additional code for data layer)
Do you have any ideas how to structure the code and do the ORM mappings to be as elegant, as effective and as performant as it is possible?
I have found that it is problematic to use domain entities directly in the views. To help decouple things I apply two different techniques.
Most importantly I'm using separate ViewModel classes to pass data to views. When the data corresponds nicely with a domain model entity, AutoMapper can ease the pain of copying data between them, but otherwise a bit of manual wiring is needed. Seems like a lot of work in the beginning but really helps out once the project starts growing, and is especially important if you haven't just designed the database from scratch. I'm also using an intermediate service layer to obtain ViewModels in order to keep the controllers lean and to be able to reuse the logic.
The second option is mostly for performance reasons, but I usually end up creating custom repositories for fetching data that spans entities. That is, I create a custom class to hold the data I'm interested in, and then write custom LINQ (or whatever) to project the result into that. This can often dramatically increase performance over just fetching entities and applying the projection after the data has been retrieved.
Let me know if I haven't been elaborate enough.
The solution I've finally implemented don't fully satisfy me, but it'll do by now.
I've divided my Tags/Categories into "real entities", mapped in NHibernate as separate entities and "references", mapped as components depending from entities they describe.
So in my C# code I have two separate classes - TagEntity and TagReference which both carry the same information, looking from domain perspective. TagEntity knows database id and is managed by NHibernate sessions, whereas TagReference carries only the tag name as string so it is quite handy to use in the whole application and if needed it is still easily convertible to TagEntity using static lookup dictionary.
That entity/reference separation allows me to query the database in more efficient way, joining two tables only, like select from articles join articles_tags ... where articles_tags.tag_id = X without joining the tags table, which will be joined too when doing simple fully-object-oriented NHibernate queries.