I am trying to figure out the best way to model a set of "classes" in my system. Note that I'm not talking about OO classes, but classes of responses (in a survey). So the model goes like this:
A Class can be defined with three different types of data:
A Class of Coded Responses (where a coded responses consists of a string label and an integer value)
A Class of Numeric Responses (defined as a set of intervals where each interval ranges from a min to a max value)
A Class of String Responses (defined as a set of regular expression patterns)
Right now we have: Class table (to define unique classes) and a ClassCoded, ClassNumeric and ClassString table (all with a ClassID as a foreign key to Class table).
My problem is that right now a Class could technically be both Coded and Numeric by this system. Is there any way to define the set of tables to be able to handle this situation??
There are two main ways to handle subtypes, either with sparse columns by adding columns for every possible property (preferrably with check constraints to make sure only one type has values) or to create a table for the supertype and then three tables for the sub-types, each with foreign keys back to the supertype table. Then add a check constraint to ensure that only one of the three possible type columns is not null.
Personally I decide which of the two implementations to use based on how similar the subtypes are. If 90% of the columns are shared I use the sparse columns approach, if very little information is shared I use the multiple tables approach.
Relational databases don't handle this elegantly. Simplest way is to define columns for all different types of data, and only fill the appropriate ones.
I don't understand what the issue is. This is just mixin inheritance. Why can't a Class just have an entry each both ClassCoded and ClassNumeric?
The enforcement of business rules isn't going to be done in the DB anyways, so you can easily enforce these constraints in the business layer code with special rules for Classes that have entries in both these tables.
Related
We have a database schema that involves a heavy amount of polymorphism/inheritance/subtyping. Certain foreign keys such as a comment subject, or a permission target, point to a large number of tables, and certain properties like globally unique codes are shared by multiple tables.
The way we have this structured is via table-per-class inheritance, where the parent and child tables share a UUID primary key. UUID's were chosen so that tables could be merged and separated to adjust to business needs without worrying about collisions.
Currently there is no explicit discriminator column on the parent types, but I would like to add one for a couple reasons. It would rule out the possibility of erroneously pointing two different concrete subtypes to the same parent type. It would also frequently save joins, as often the child-specific data is not needed, only the knowledge of which subtype matches a given id.
I can think of a few possible approaches:
Just use a plain text field that stores the name of the concrete subtype table.
Do the same as above but with a custom Enum type that lists the possible tables.
Use an "id" field that points to a lookup table.
(2) seems like it would be a nicer option than (1), however it has the rather large downside of not allowing me to drop a value from the Enum without a ton of migration pain. This is particularly painful if this discriminator column shows up on a lot of tables, which it likely will.
(3) is often used for "enums that need to change", however it'd require hardcoding UUID/int id values into the DDL in order to properly handle the foreign keys on the subtypes, which seems like a bit of a deal breaker.
This has me leaning towards (1), but I was wondering if there was a better option. Perhaps even a text type optimized for frequently repeated identifiers with very limited character sets.
In the light of your explanations, I would use the first option.
If you use short strings, that shouldn't waste noticeable performance and storage space (the space taken up by a short string is one byte more than the string itself). You'd have to use check constraints to constrain the strings that can be used, but that's not a big hassle.
As you can see below, the User is able to make a Call, the Operator will log it, writing the time (error on my part, Column2), his own ID and the ID of the caller. The Operator is also able to create a Solution, by generating a Solution ID and describing the solution.
Note that nothing differentiate the User from the Operator in terms of attributes. Indeed they both inherit their ID from the Person entity.
So I have two questions.
First, as you can see, the Call entity has two attributes which are the same column (ID for User and Operator), but will always represent two different people (i.e. a User will never be an Operator). Is this the correct notation for such a thing?
Secondly, I am not sure about having User and Operator as separate entities because no attribute distinguish them from one another, only their ability to do something or not (User can't create a solution). This would mean that they don't have attributes apart from the ones they inherit. Is this correct or should the two entities be merged under the Personentity?
Thanks in advance.
It's valid to create subtypes with distinct relationships and/or constraints, even if they have no distinct attributes. You'll be able to use referential integrity to ensure that Operator IDs and User IDs don't get mixed up in the Call table, and it's possible to enforce mutual exclusion between the IDs in the User and Operator tables.
As far as notation is concerned, I would show the ID in the User and Operator tables, and use Crow's foot lines to represent the FK constraints between the tables. If I wanted to make the subtyping explicit, I would rather show that on an EER diagram using Chen's notation than on a table diagram.
In my example, I have a watch, which is an indication a user wants notifications about events on a different item, say a group and an organization.
I see two ways to do this:
Have a groupwatch resource, with a groupwatch table, with id,user,group (group FK to group resource and table); and a orgwatch resource, with a orgwatch table, with id,user,organization (org FK to organization resource and table)
Have a generic watch resource, with a watch table, with id,user,type,typeid. type is one of group or organization, and typeid is the ID of the group or organization being watched.
Since both of them are watches, it seems a waste to have two different tables and resources to watch 2 different objects. It gets worse if I start watching 4, 5, 6, 20, 50 different types of resources.
On the other hand, a foreign key relationship appears impossible if I just have a generic typeid, which means that my database (if relational) and my framework (activerecord or anything else) cannot enforce it correctly.
How do I best implement this type of "association to different types of record/table for each record in my table"?
UPDATE:
Are my only choices for doing this:
separate tables/resources for each watch type, which enables the database to enforce relational integrity and do joins
single table for all watches, but I will have to enforce relational integrity and do joins at the app level?
If you add a new type of resource once every six months, you may want to define your tables in such a way that adding new resources involves changing data definitions. If you add a new resource type every week, you may want to make your data definitions stay the same when you add new types. There's a downside to either choice.
If you do choose to define table in such a way that the types are visible in the table structure, there are two patterns often used with type/subtype (aka class/subclass) situations.
One pattern has been called "single table inheritance". Put data about all the types in a single table, and leave some columns NULL wherever they do not apply.
Another pattern has been called "class table inheritance". Define one table for the superclass, with all the data that is common to all the types. Then define tables for each subtype (subclass) to contain class specific data. Make the primary key of the subtype tables a duplicate of the primary key in the supertype table, and also declare it as a foreign key that references the primary key of the supertype table. It's going to be up to the app, at insert time, to replicate the value of the primary key in the supertype table over in the subtype table.
I like Fowlers' treatment of these two patterns.
http://martinfowler.com/eaaCatalog/classTableInheritance.html
http://www.martinfowler.com/eaaCatalog/singleTableInheritance.html
This matter of sharing primary keys has a few beneficial effects.
First, it enforces the one-to-one nature of the ISa relationships.
Second, it makes it easy to find out whether a given entry belongs to a desired subtype, by just joining with the subtype table. You don't really need an extra type field.
Third, it speeds up the joins, because of the index that gets built when you declare a primary key.
If you want a structure that can adapt to new attributes without changing data definitions, you can look into E-A-V design. Be careful, though. Sometimes this results in data that is nearly impossible to use, because the logical structure is so obscure. I usually think of E-A-V as an anti-pattern for this reason, although there are some who really like the results they get from it.
In a nutshell here's the situation, we have a database that is used to build a hierarchy of "locations". (Example: Street Address > Building 1 > First Floor > Room).
Each of the locations are stored in a table. Each location can be of a different "type". The types are defined in another table. (We use the types of locations to restrict what locations can be added to a location).
Here's the quandary we are facing: We need to be able to store different types of information for different types of locations. For example, a location type of "building" may need to have the address stored where as a location type of "room" may need to have dimensions or paint color stored.
Obviously we could create a table for each location type we define to hold the properties required for the particular location type and then use application logic to query the appropriate table to pull in a particular location's additional information. Is there a more elegant or practical way to accomplish this relationally in the database without having to rely on application logic?
Thanks!
The relationally pure way to do this would be to implement your initial suggestion i.e. have a separate table such as BUILDING_LOCATIONS that holds the attributes that are only applicable to this type of location. A different [TYPE]_LOCATIONS table would be created for each type of location that has its own attributes. In this way you could use standard database constraint functionality to ensure the integrity of the data in these tables.
Another method would be to add a series of nullable columns to the LOCATIONS table such as BUILDING_ADDRESS and ROOM_DIMENSIONS. This is not relationally pure as it means null values can exist in this table. However, you can still use standard database functionality to ensure the integrity of the data. It can be a bit more convoluted if certain values are mandatory in certain situations. Also, if there are many types of location with many differing attributes the number of columns in the table can become unwieldy.
Another method is the Entity-Attribute-Value model. Generally, this is to be avoided if at all possible. It is not relationally pure, as your column values are now no longer defined over domains, and it is extremely difficult, if not impossible, to ensure the integrity of the data. Any real attempt to do so will require a lot of bespoke coding (which needs to be carefully implemented to cater for things like concurrency control that database constraints give you for free) as you cannot use standard database constraints. However, if you are just interested in storing values for information and not doing anything with them you could use this method.
The EAV method does have a danger that because it appears so easy to add attributes to an entity, it becomes the default way of doing so. It is then used to add attributes for which vital processing is dependent and, because you cannot ensure the integrity of the data using this method, you find the values being used are meaningless and the whole logical basis for the processing is destroyed.
I am designing a new laboratory database with MANY types of my main entities.
The table for each entity will hold fields common to ALL types of that entity (entity_id, created_on, created_by, etc). I will then use concrete inheritance (separate table for each unique set of attributes) to store all remaining fields.
I believe that this is the best design for the standard types of data which come through the laboratory daily. However, we often have a special samples which often are accompanied by specific values the originator wants stored.
Question: How should I model special (non-standard) types of entities?
Option 1: Use entity-value for special fields
One table (entity_id, attribute_name, numerical_value) would hold all data for any special entity.
+ Fewer tables.
- Cannot enforce requiring a particular attribute.
- Must convert (pivot) rows to columns which is inefficient.
Option 2: Strict concrete inheritance.
Create separate table for each separate special case.
+ Follows in accordance with all other rules
- Overhead of many tables with only a few rows.
Option 3: Concrete inheritance with special tables under a different user.
Put all special tables under a different user.
+ Keeps all special and standard tables separate.
+ Easier to search for common standard table in a list without searching through all special tables.
- Overhead of many tables with only a few rows.
Actually the design you described (common table plus subtype-specific tables) is called Class Table Inheritance.
Concrete Table Inheritance would have all the common attributes duplicated in the subtype tables, and you'd have no supertype table as you do now.
I'm strongly against EAV. I consider it an SQL antipattern. It may seem like an elegant solution because it requires fewer tables, but you're setting yourself up for a lot of headache later. You identified a couple of the disadvantages, but there are many others. IMHO, EAV is used appropriately only if you absolutely must not create a new table when you introduce a new subtype, or if you have an unbounded number of subtypes (e.g. users can define new attributes ad hoc).
You have many subtypes, but still a finite number of them, so if I were doing this project I'd stick with Class Table Inheritance. You may have few rows of each subtype, but at least you have some assurance that all rows in each subtype have the same columns, you can use NOT NULL if you need to, you can use SQL data types, you can use referential integrity constraints, etc. From a relational perspective, it's a better design than EAV.
One more option that you didn't mention is called Serialized LOB. That is, add a BLOB column for a semi-structured collection of custom attributes. Store XML, YAML, JSON, or your own DSL in that column. You won't be able to parse individual attributes out of that BLOB easily with SQL, you'll have to fetch the whole BLOB back into your application and extract individual attributes in code. So in some ways it's less convenient. But if that satisfies your usage of the data, then there's nothing wrong with that.
I think it depends mostly on how you want to use this data.
First of all, I don't really see the benefit of option 3 over option 2. I think separating the special tables in another schema will make your application harder to maintain, especially if later on commonalities are found between 'special values'.
As another option I would say:
- Store the special values in an XML fragment (or blob). Most databases have ability to query on XML structures these days, so without the need for many extra tables, you would keep your flexibility for a small performance hit.
If you put all the special values in one table, you get a very sparse table. Most normal DBMSes cannot handle this very well, but there are some implementations that specialize in this. You could benefit from that.
Do you often need to query the key-value pairs? if you basically access that table through it's entry_id, I think having a key-value table is not a bad design. An extra index on the kay column might even help you when you do need to query for special values. If you build an application layer on top of your database, the key-value table will map on a Map or Hash structure, which can also easily be used.
It also depends on the different types of values you want to store. If there are many different types, that need to be easily accessed (instead of being serialized/deserialized to XML/Character-String) you might want to store the type in a separate column, but that will usually lead to a very complicated design.
Hope this helps (a little bit).
-Maarten
http://en.wikipedia.org/wiki/Entity-Attribute-Value_model
Suggest you read about the problems with entity value tables before deciding to use them.
Oracle can deal with sparsely filled tables quite well. I think you can use a similar approach as company salesforce uses. They use tables with a lot of columns, they create columns when needed. You can index those columns much better than an eav model.
So it is flexible but it performs better than an eav model.
Read: Ask Tom 1, Ask Tom 2, High Scalabilty and SalesForce.
The "Option 1" patterns is also called the "Universal Relation" At first look it seems like a short cut to not doing potentially difficult data modeling. It trades effortless data modeling for not being able to do simple select, update, delete without dramatically more effort than it would take on more usual looking data model with multiple tables.