I am using ZF2 and Doctrine to develop a project that manages calendar events. All of the events share a common set of data elements, but unique types of events share their own unique sets of type-specific data elements. For example, all of the events include common elements such as eventID, eventName and eventDate. In addition to those common elements, events that are “meetings” will have additional elements like agenda, minutes or attendees that are specific to meetings, events that are “training” will have additional elements that are specific to training, events that are “conferences” will have additional unique elements, and so on.
The project’s index view will want to be able to list all events, but will not require any data beyond the “common” data set. If an event belongs to a certain type, it WILL NOT also belong to another type; so each event will only be associated with one supplemental data set. Some event types might have more than one router: the router sales/meetings[/action] will want to access the same or identical entities, fieldsets and forms as the router marketing/meetings[/action].
In my database I will have a table named events to index all of the events and record common data, and I’ll have a collection of associated tables like events_meetings, events_training, events_conferences, and so on to record type-specific data.
I’m contemplating a number of different solutions for the project:
Solution one: a single module, a single events entity and fieldset, and entities and fieldsets for each of the type-specific data element groups. This solution would require that the events entity have multiple OneToMany elements: one for each of the associated data sets. I don’t know whether this is possible; and even if it is possible, I don’t know whether it is a good idea.
Solution two: a single module and duplicate copies of the common “events” entity and fieldset: events_1 is linked to the events table and is associated with the events_meetings entity; events_2 is identical to events_1 except for the OneToMany element and it is associated with the events_training entity; and events_3, events_4, events_5, and so on are each associated with their own supplemental data set entities. I can see this working, but it requires a lot of nearly identical copies of the common data entity.
Solution three: multiple modules, each with a single events entity and fieldset, and a single associated events_foo entity and fieldset. This is perhaps the cleanest solution, although it seems to create a lot of identical code.
Solution four: reconfigure the data schema so that all of the supplemental data could be stored in a single table. For example, rather than having an events_meetings table that has a single row for each meeting event and a column for agenda, a column for minutes and a column for attendees, it’s possible to create an events_alt_data table that has a different row for each element and columns such as eventID, elementType, elementTitle and eventValue. Wordpress does something like this for unique data, but in my project the supplemental data sets are where the majority of the data will be stored and I’m concerned that it may affect performance as the data grows. This solution will also require some creative coding to deal with the conditional nature of data elements and how to validate and set options for data that could be any type or length.
Any advice?
Single Table Inheritance is the way to go.
Related
In a nutshell here's the situation, we have a database that is used to build a hierarchy of "locations". (Example: Street Address > Building 1 > First Floor > Room).
Each of the locations are stored in a table. Each location can be of a different "type". The types are defined in another table. (We use the types of locations to restrict what locations can be added to a location).
Here's the quandary we are facing: We need to be able to store different types of information for different types of locations. For example, a location type of "building" may need to have the address stored where as a location type of "room" may need to have dimensions or paint color stored.
Obviously we could create a table for each location type we define to hold the properties required for the particular location type and then use application logic to query the appropriate table to pull in a particular location's additional information. Is there a more elegant or practical way to accomplish this relationally in the database without having to rely on application logic?
Thanks!
The relationally pure way to do this would be to implement your initial suggestion i.e. have a separate table such as BUILDING_LOCATIONS that holds the attributes that are only applicable to this type of location. A different [TYPE]_LOCATIONS table would be created for each type of location that has its own attributes. In this way you could use standard database constraint functionality to ensure the integrity of the data in these tables.
Another method would be to add a series of nullable columns to the LOCATIONS table such as BUILDING_ADDRESS and ROOM_DIMENSIONS. This is not relationally pure as it means null values can exist in this table. However, you can still use standard database functionality to ensure the integrity of the data. It can be a bit more convoluted if certain values are mandatory in certain situations. Also, if there are many types of location with many differing attributes the number of columns in the table can become unwieldy.
Another method is the Entity-Attribute-Value model. Generally, this is to be avoided if at all possible. It is not relationally pure, as your column values are now no longer defined over domains, and it is extremely difficult, if not impossible, to ensure the integrity of the data. Any real attempt to do so will require a lot of bespoke coding (which needs to be carefully implemented to cater for things like concurrency control that database constraints give you for free) as you cannot use standard database constraints. However, if you are just interested in storing values for information and not doing anything with them you could use this method.
The EAV method does have a danger that because it appears so easy to add attributes to an entity, it becomes the default way of doing so. It is then used to add attributes for which vital processing is dependent and, because you cannot ensure the integrity of the data using this method, you find the values being used are meaningless and the whole logical basis for the processing is destroyed.
I am about to embark on a project for work that is very outside my normal scope of duties. As a SQL DBA, my initial inclination was to approach the project using a SQL database but the more I learn about NoSQL, the more I believe that it might be the better option. I was hoping that I could use this question to describe the project at a high level to get some feedback on the pros and cons of using each option.
The project is relatively straightforward. I have a set of objects that have various attributes. Some of these attributes are common to all objects whereas some are common only to a subset of the objects. What I am tasked with building is a service where the user chooses a series of filters that are based on the attributes of an object and then is returned a list of objects that matches all^ of the filters. When the user selects a filter, he or she may be filtering on a common or subset attribute but that is abstracted on the front end.
^ There is a chance, depending on user feedback, that the list of objects may match only some of the filters and the quality of the match will be displayed to the user through a score that indicates how many of the criteria were matched.
After watching this talk by Martin Folwler (http://www.youtube.com/watch?v=qI_g07C_Q5I), it would seem that a document-style NoSQL database should suit my needs but given that I have no experience with this approach, it is also possible that I am missing something obvious.
Some additional information - The database will initially have about 5,000 objects with each object containing 10 to 50 attributes but the number of objects will definitely grow over time and the number of attributes could grow depending on user feedback. In addition, I am hoping to have the ability to make rapid changes to the product as I get user feedback so flexibility is very important.
Any feedback would be very much appreciated and I would be happy to provide more information if I have left anything critical out of my discussion. Thanks.
This problem can be solved in by using two separate pieces of technology. The first is to use a relatively well designed database schema with a modern RDBMS. By modeling the application using the usual principles of normalization, you'll get really good response out of storage for individual CRUD statements.
Searching this schema, as you've surmised, is going to be a nightmare at scale. Don't do it. Instead look into using Solr/Lucene as your full text search engine. Solr's support for dynamic fields means you can add new properties to your documents/objects on the fly and immediately have the ability to search inside your data if you have designed your Solr schema correctly.
I'm not an expert in NoSQL, so I will not be advocating it. However, I have few points that can help you address your questions regarding the relational database structure.
First thing that I see right away is, you are talking about inheritance (at least conceptually). Your objects inherit from each-other, thus you have additional attributes for derived objects. Say you are adding a new type of object, first thing you need to do (conceptually) is to find a base/super (parent) object type for it, that has subset of the attributes and you are adding on top of them (extending base object type).
Once you get used to thinking like said above, next thing is about inheritance mapping patterns for relational databases. I'll steal terms from Martin Fowler to describe it here.
You can hold inheritance chain in the database by following one of the 3 ways:
1 - Single table inheritance: Whole inheritance chain is in one table. So, all new types of objects go into the same table.
Advantages: your search query has only one table to search, and it must be faster than a join for example.
Disadvantages: table grows faster than with option 2 for example; you have to add a type column that says what type of object is the row; some rows have empty columns because they belong to other types of objects.
2 - Concrete table inheritance: Separate table for each new type of object.
Advantages: if search affects only one type, you search only one table at a time; each table grows slower than in option 1 for example.
Disadvantages: you need to use union of queries if searching several types at the same time.
3 - Class table inheritance: One table for the base type object with its attributes only, additional tables with additional attributes for each child object type. So, child tables refer to the base table with PK/FK relations.
Advantages: all types are present in one table so easy to search all together using common attributes.
Disadvantages: base table grows fast because it contains part of child tables too; you need to use join to search all types of objects with all attributes.
Which one to choose?
It's a trade-off obviously. If you expect to have many types of objects added, I would go with Concrete table inheritance that gives reasonable query and scaling options. Class table inheritance seems to be not very friendly with fast queries and scalability. Single table inheritance seems to work with small number of types better.
Your call, my friend!
May as well make this an answer. I should comment that I'm not strong in NoSQL, so I tend to lean towards SQL.
I'd do this as a three table set. You will see it referred to as entity value pair logic on the web...it's a way of handling multiple dynamic attributes for items. Lets say you have a bunch of products and each one has a few attributes.
Prd 1 - a,b,c
Prd 2 - a,d,e,f
Prd 3 - a,b,d,g
Prd 4 - a,c,d,e,f
So here are 4 products and 6 attributes...same theory will work for hundreds of products and thousands of attributes. Standard way of holding this in one table requires the product info along with 6 columns to store the data (in this setup at least one third of them are null). New attribute added means altering the table to add another column to it and coming up with a script to populate existing or just leaving it null for all existing. Not the most fun, can be a head ache.
The alternative to this is a name value pair setup. You want a 'header' table to hold the common values amoungst your products (like name, or price...things that all rpoducts always have). In our example above, you will notice that attribute 'a' is being used on each record...this does mean attribute a can be a part of the header table as well. We'll call the key column here 'header_id'.
Second table is a reference table that is simply going to store the attributes that can be assigned to each product and assign an ID to it. We'll call the table attribute with atrr_id for a key. Rather straight forwards, each attribute above will be one row.
Quick example:
attr_id, attribute_name, notes
1,b, the length of time the product takes to install
2,c, spare part required
etc...
It's just a list of all of your attributes and what that attribute means. In the future, you will be adding a row to this table to open up a new attribute for each header.
Final table is a mapping table that actually holds the info. You will have your product id, the attribute id, and then the value. Normally called the detail table:
prd1, b, 5 mins
prd1, c, needs spare jack
prd2, d, 'misc text'
prd3, b, 15 mins
See how the data is stored as product key, value label, value? Any future product added can have any combination of any attributes stored in this table. Adding new attributes is adding a new line to the attribute table and then populating the details table as needed.
I beleive there is a wiki for it too... http://en.wikipedia.org/wiki/Entity-attribute-value_model
After this, it's simply figuring out the best methodology to pivot out your data (I'd recommend Postgres as an opensource db option here)
My application has one table called 'events' and each event has approx 30 standard fields, but also user defined fields that could be any name or type, in an 'eventdata' table. Users can define these event data tables, by specifying x number of fields (either text/double/datetime/boolean) and the names of these fields. This 'eventdata' (table) can be different for each 'event'.
My current approach is to create a lookup table for the definitions. So if i need to query all 'event' and 'eventdata' per record, i do so in a M-D relaitionship using two queries (i.e. select * from events, then for each record in 'events', select * from 'some table').
Is there a better approach to doing this? I have implemented this so far, but most of my queries require two distinct calls to the DB - i cannot simply join my master 'events' table with different 'eventdata' tables for each record in in 'events'.
I guess my main question is: can i join my master table with different detail tables for each record?
E.g.
SELECT E.*, E.Tablename
FROM events E
LEFT JOIN 'E.tablename' T ON E._ID = T.ID
If not, is there a better way to design my database considering i have no idea on how many user defined fields there may be and what type they will be.
There are four ways of handling this.
Add several additional fields named "Custom1", "Custom2", "Custom3", etc. These should have a datatype of varchar(?) or similiar
Add a field to hold the unstructured data (like an XML column).
Create a table of name /value pairs which are associated with some type of template. Let them manage the template. You'll have to use pivot tables or similiar to get the data out.
Use a database like MongoDB or another NoSql style product to store this.
The above said, The first one has the advantage of being fast but limits the number of custom fields to the number you defined. Older main frame type applications work this way. SalesForce CRM used to.
The second option means that each record can have it's own custom fields. However, depending on your database there are definite challenges here. Tried this, don't recommend it.
The third one is generally harder to code for but allows for extreme flexibility. SalesForce and other applications have gone this route; including a couple I'm responsible for. The downside is that Microsoft apparently acquired a patent on doing things this way and is in the process of suing a few companies over it. Personally, I think that's bullcrap; but whatever. Point is, use at your own risk.
The fourth option is interesting. We've played with it a bit and the performance is great while coding is pretty darn simple. This might be your best bet for the unstructured data.
Those type of joins won't work because you will need to pivot the eventdata table to make it columns instead of rows. Therefore it depends on which database technology you are using.
Here is an example with MySQL: How to pivot a MySQL entity-attribute-value schema
My approach would be to avoid using a different table for each event, if that's possible.
I would use something like:
Event (EventId, ..., ...)
EventColumnType (EventColumnTypeId, EventTypeId, ColumnName)
EventColumnData (EventColumnTypeId, Data)
You are them limited to the type of data you can store (everything would have to be strings, for example), but you the number of events and columns are unrestricted.
What I'm getting from your description is you have an event table, and then a separate EventData table for each and every event.
Rather than that, why not have a single EventCustomFields table that contains a foreign key to the event table, a field Name (event+field being the PK) and a field value.
Sure it's not the best. You'd be stuck serializing the value or storing everything as a string. And you'd still be stuck doing two queries, one for the event table and one to get it's custom fields, but at least you wouldn't have a new table for every event in the system (yuck x10)
Another, (arguably worse) option is to serialize the custom fields into a single column of the and then deserialize when you need. So your query would be something like
Select E.*, C.*
From events E, customFields C
Where E.ID = C.ID
Is it possible to just impose a limit on your users? I know the tables underneath Sharepoint 2007 had a bunch of columns for custom data that were just named like CustomString1, CustomDate2, etc. That may end up easier than some of the approaches above, where everything is in one column (though that's an approach I've taken as well), and I would think it would scale up better.
The answer to your main question is: no. You can't have different rows in the result set with different columns. The result set is kind of like a table, so each row has to have the same columns. You can fake it with padding and dummy columns, but that's probably not much better.
You could try defining a fixed event data table, with (say) ten of each type of column. Then you'd store the usage metadata in a separate table and just read that in at system startup. The metadata would tell you that event type "foo" has a field "name" mapped to column string0 in the event data table, a field named "reporter" mapped to column string1, and a field named "reportDate" mapped to column date0. It's ugly and wastes space, but it's reasonably flexible. If you're in charge of the database, you can even define a view on the table so to the client it looks like a "normal" table. If the clients create their own tables and just stick the table name in the event record, then obviously this won't fly.
If you're really hardcore you can write a database procedure to query the table structures and serialize everything to a lilst of key/type/value tuples and return that in one long string as the last column, but that's probably not much handier than what you're doing now.
I have to add functionality to an existing application and I've run into a data situation that I'm not sure how to model. I am being restricted to the creation of new tables and code. If I need to alter the existing structure I think my client may reject the proposal.. although if its the only way to get it right this is what I will have to do.
I have an Item table that can me link to any number of tables, and these tables may increase over time. The Item can only me linked to one other table, but the record in the other table may have many items linked to it.
Examples of the tables/entities being linked to are Person, Vehicle, Building, Office. These are all separate tables.
Example of Items are Pen, Stapler, Cushion, Tyre, A4 Paper, Plastic Bag, Poster, Decoration"
For instance a Poster may be allocated to a Person or Office or Building. In the future if they add a Conference Room table it may also be added to that.
My intital thoughts are:
Item
{
ID,
Name
}
LinkedItem
{
ItemID,
LinkedToTableName,
LinkedToID
}
The LinkedToTableName field will then allow me to identify the correct table to link to in my code.
I'm not overly happy with this solution, but I can't quite think of anything else. Please help! :)
Thanks!
It is not a good practice to store table names as column values. This is a bad hack.
There are two standard ways of doing what you are trying to do. The first is called single-table inheritance. This is easily understood by ORM tools but trades off some normalization. The idea is, that all of these entities - Person, Vehicle, whatever - are stored in the same table, often with several unused columns per entry, along with a discriminator field that identifies what type the entity is.
The discriminator field is usually an integer type, that is mapped to some enumeration in your code. It may also be a foreign key to some lookup table in your database, identifying which numbers correspond to which types (not table names, just descriptions).
The other way to do this is multiple-table inheritance, which is better for your database but not as easy to map in code. You do this by having a base table which defines some common properties of all the objects - perhaps just an ID and a name - and all of your "specific" tables (Person etc.) use the base ID as a unique foreign key (usually also the primary key).
In the first case, the exclusivity is implicit, since all entities are in one table. In the second case, the relationship is between the Item and the base entity ID, which also guarantees uniqueness.
Note that with multiple-table inheritance, you have a different problem - you can't guarantee that a base ID is used by exactly one inheritance table. It could be used by several, or not used at all. That is why multiple-table inheritance schemes usually also have a discriminator column, to identify which table is "expected." Again, this discriminator doesn't hold a table name, it holds a lookup value which the consumer may (or may not) use to determine which other table to join to.
Multiple-table inheritance is a closer match to your current schema, so I would recommend going with that unless you need to use this with Linq to SQL or a similar ORM.
See here for a good detailed tutorial: Implementing Table Inheritance in SQL Server.
Find something common to Person, Vehicle, Building, Office. For the lack of a better term I have used Entity. Then implement super-type/sub-type relationship between the Entity and its sub-types. Note that the EntityID is a PK and a FK in all sub-type tables. Now, you can link the Item table to the Entity (owner).
In this model, one item can belong to only one Entity; one Entity can have (own) many items.
your link table is ok.
the trouble you will have is that you will need to generate dynamic sql at runtime. parameterized sql does not typically allow the objects inthe FROM list to be parameters.
i fyou want to avoid this, you may be able to denormalize a little - say by creating a table to hold the id (assuming the ids are unique across the other tables) and the type_id representing which table is the source, and a generated description - e.g. the name value from the inital record.
you would trigger the creation of this denormalized list when the base info is modified, and you could use that for generalized queries - and then resort to your dynamic queries when needed at runtime.
I want to create a product catalog that allows for intricate details on each of the product types in the catalog. The product types have vastly different data associated with them; some with only generic data, some with a few extra fields of data, some with many fields that are specific to that product type. I need to easily add new product types to the system and respect their configuration, and I'd love tips on how to design the data model for these products as well as how to handle persistence and retrieval.
Some products will be very generic and I plan to use a common UI for editing those products. The products that have extensible configuration associated with them will get new views (and controllers) created for their editing. I expect all custom products to have their own model defined but to share a common base class. The base class would represent the generic product that has no custom fields.
Example products that need to be handled:
Generic product
Description
Light Bulb
Description
Type (with an enum of florescent, incandescent, halogen, led)
Wattage
Style (enum of flood, spot, etc.)
Refrigerator
Description
Make
Model
Style (with an enum in the domain model)
Water Filter information
Part number
Description
I expect to use MEF for discovering what product types are available in the system. I plan to create assemblies that contain product type models, views, and controllers, drop those assemblies into the bin, and have the application discover the new product types, and show them in the navigation.
Using SQL Server 2008, what would be the best way to store products of these various types, allowing for new types to be added without having to grow the database schema?
When retrieving data from the database, what's the best way to translate these polymorphic entities into their correct domain models?
Updates and Clarifications
To avoid the Inner Platform Effect, if there is a database table for every product type (to store the products of that type), then I still need a way to retrieve all products that spans product types. How would that be achieved?
I talked with Nikhilk in more detail about his SharePoint reference. Specifically, he was talking about this: http://msdn.microsoft.com/en-us/library/ms998711.aspx. It actually seems pretty attractive. No need to parse XML; and there is some indexing that could be done allowing for simple and fast queries over the data. For instance, I could say "find all 75-watt light bulbs" by knowing that the first int column in the row is the wattage when the row represents a light bulb. Something (NHibernate?) in the app tier would define the mapping from the product type to the userdata schema.
Voted down the schema that has the Property Table because this could lead to lots of rows per product. This could lead to index difficulties, plus all queries would have to essentially pivot the data.
Use a Sharepoint-style UserData table, that has a set of string columns, a set of int columns, etc. and a Type column.
Then you have a list of types table that specifies the schema for each type - its properties, and the specific columns they map to in the UserData table.
With things like Azure and other utility computing storage you don't even need to define a table. Every store object is basically a dictionary.
I think you need to go with a data model like --
Product Table
ProductId (PK)
ProductName
Details
Property Table
PropertyId (PK)
ProductId (FK)
ParentPropertyId (FK - Self referenced to categorize properties)
PropertyName
PropertyValue
PropertyValueTypeId
Property Value Lookup Table
PropertyValueLookupId (PK)
PropertyId (FK)
LookupValue
And then have a dynamic view based on this. You could use the PropertyValueTypeId coloumn to identify the type, using a convention, like (0- string, 1-integer, 2-float, 3-image etc) - But ultimately you can store everything untyped only. You could also use this column to select the control template to render the corresponding property to the user.
You can use the Value lookup table to keep lookups for a specific property (so that user can choose it from a list)
Summarizing lets look at the options under consideration for storing product information:
1) some xml format in the database
2) similar to the post above about having x number of type defined columns (sharepoint approach)
3) via generic table with name and type definitions stored in lookup table and values in secondary table with columns id, propertyid, value (similar to #2 however this approach would provide unlimited property information
4) some hybrid of the above option where product table would have x common columns (for storage of properties common with all products) with y user defined columns (this could be m of integer type and n of varchar types). This may be taking the best of #2 and a normalzied structure as if you knew all the properties of all products. You would be getting the best sql performance for the properties that you use the most (probably those that are common across all products) while still allowing custom columns for specific properties with each product.
Are there other options? In my opinion I would consider 4 above as the best hybrid of the combinations.
dave
Put as much of the shared anticipated structure in traditional normalized 3NF model, then augment with XML columns as appropriate.
I don't see MEF (or any other ORM) being able to do all this transparently.
I think you should avoid the Inner Platform Effect and actually build tables for your specialized entities. You'll be writing specific code to manage them so why not have proper backing tables too?
It will make your deployment slightly harder - drop in an assembly and run a script - but it will probably save you a lot of pain in the long run.
Jeff,
we currently use a XML field in the Products table to handle all product-specific data. So our Products table has a few common fields that all products share, an XML which contains whatever a particular product needs additionally, and a few computed fields that grab into the XML and surface some of the frequently queried fields as "virtual" fields on the Products table (e.g. "Style" would be set to whatever the current product defines, or NULL, if the product doesn't have a Style property).
So far, we've been quite flexible with that approach - if you create some decent XSD schemas for your XML, you can even create C# proxy classes for these fields.
Works nicely for us - joining the best of both the relational and XML worlds.
Marc