It is safe to say that the EAV/CR database model is bad. That said,
Question: What database model, technique, or pattern should be used to deal with "classes" of attributes describing e-commerce products which can be changed at run time?
In a good E-commerce database, you will store classes of options (like TV resolution then have a resolution for each TV, but the next product may not be a TV and not have "TV resolution"). How do you store them, search efficiently, and allow your users to setup product types with variable fields describing their products? If the search engine finds that customers typically search for TVs based on console depth, you could add console depth to your fields, then add a single depth for each tv product type at run time.
There is a nice common feature among good e-commerce apps where they show a set of products, then have "drill down" side menus where you can see "TV Resolution" as a header, and the top five most common TV Resolutions for the found set. You click one and it only shows TVs of that resolution, allowing you to further drill down by selecting other categories on the side menu. These options would be the dynamic product attributes added at run time.
Further discussion:
So long story short, are there any links out on the Internet or model descriptions that could "academically" fix the following setup? I thank Noel Kennedy for suggesting a category table, but the need may be greater than that. I describe it a different way below, trying to highlight the significance. I may need a viewpoint correction to solve the problem, or I may need to go deeper in to the EAV/CR.
Love the positive response to the EAV/CR model. My fellow developers all say what Jeffrey Kemp touched on below: "new entities must be modeled and designed by a professional" (taken out of context, read his response below). The problem is:
entities add and remove attributes weekly (search keywords dictate future attributes)
new entities arrive weekly (products are assembled from parts)
old entities go away weekly (archived, less popular, seasonal)
The customer wants to add attributes to the products for two reasons:
department / keyword search / comparison chart between like products
consumer product configuration before checkout
The attributes must have significance, not just a keyword search. If they want to compare all cakes that have a "whipped cream frosting", they can click cakes, click birthday theme, click whipped cream frosting, then check all cakes that are interesting knowing they all have whipped cream frosting. This is not specific to cakes, just an example.
There's a few general pros and cons I can think of, there are situations where one is better than the other:
Option 1, EAV Model:
Pro: less time to design and develop a simple application
Pro: new entities easy to add (might even
be added by users?)
Pro: "generic" interface components
Con: complex code required to validate simple data types
Con: much more complex SQL for simple
reports
Con: complex reports can become almost
impossible
Con: poor performance for large data sets
Option 2, Modelling each entity separately:
Con: more time required to gather
requirements and design
Con: new entities must be modelled and
designed by a professional
Con: custom interface components for each
entity
Pro: data type constraints and validation simple to implement
Pro: SQL is easy to write, easy to
understand and debug
Pro: even the most complex reports are relatively simple
Pro: best performance for large data sets
Option 3, Combination (model entities "properly", but add "extensions" for custom attributes for some/all entities)
Pro/Con: more time required to gather requirements and design than option 1 but perhaps not as much as option 2 *
Con: new entities must be modelled and designed by a professional
Pro: new attributes might be easily added later on
Con: complex code required to validate simple data types (for the custom attributes)
Con: custom interface components still required, but generic interface components may be possible for the custom attributes
Con: SQL becomes complex as soon as any custom attribute is included in a report
Con: good performance generally, unless you start need to search by or report by the custom attributes
* I'm not sure if Option 3 would necessarily save any time in the design phase.
Personally I would lean toward option 2, and avoid EAV wherever possible. However, for some scenarios the users need the flexibility that comes with EAV; but this comes with a great cost.
It is safe to say that the EAV/CR database model is bad.
No, it's not. It's just that they're an inefficient usage of relational databases. A purely key/value store works great with this model.
Now, to your real question: How to store various attributes and keep them searchable?
Just use EAV. In your case it would be a single extra table. index it on both attribute name and value, most RDBMs would use prefix-compression to on the attribute name repetitions, making it really fast and compact.
EAV/CR gets ugly when you use it to replace 'real' fields. As with every tool, overusing it is 'bad', and gives it a bad image.
// At this point, I'd like to take a moment to speak to you about the Magento/Adobe PSD format.
// Magento/PSD is not a good ecommerce platform/format. Magento/PSD is not even a bad ecommerce platform/format. Calling it such would be an
// insult to other bad ecommerce platform/formats, such as Zencart or OsCommerce. No, Magento/PSD is an abysmal ecommerce platform/format. Having
// worked on this code for several weeks now, my hate for Magento/PSD has grown to a raging fire
// that burns with the fierce passion of a million suns.
http://code.google.com/p/xee/source/browse/trunk/XeePhotoshopLoader.m?spec=svn28&r=11#107
The internal models are wacky at best, like someone put the schema into a boggle game, sealed that and put it in a paint shacker...
Real world: I'm working on a midware fulfilment app and here are one the queries to get address information.
CREATE OR REPLACE VIEW sales_flat_addresses AS
SELECT sales_order_entity.parent_id AS order_id,
sales_order_entity.entity_id,
CONCAT(CONCAT(UCASE(MID(sales_order_entity_varchar.value,1,1)),MID(sales_order_entity_varchar.value,2)), "Address") as type,
GROUP_CONCAT(
CONCAT( eav_attribute.attribute_code," ::::: ", sales_order_entity_varchar.value )
ORDER BY sales_order_entity_varchar.value DESC
SEPARATOR '!!!!!'
) as data
FROM sales_order_entity
INNER JOIN sales_order_entity_varchar ON sales_order_entity_varchar.entity_id = sales_order_entity.entity_id
INNER JOIN eav_attribute ON eav_attribute.attribute_id = sales_order_entity_varchar.attribute_id
AND sales_order_entity.entity_type_id =12
GROUP BY sales_order_entity.entity_id
ORDER BY eav_attribute.attribute_code = 'address_type'
Exacts address information for an order, lazily
--
Summary: Only use Magento if:
You are being given large sacks of money
You must
Enjoy pain
I'm surprised nobody mentioned NoSQL databases.
I've never practiced NoSQL in a production context (just tested MongoDB and was impressed) but the whole point of NoSQL is being able to save items with varying attributes in the same "document".
Where performance is not a major requirement, as in an ETL type of application, EAV has another distinct advantage: differential saves.
I've implemented a number of applications where an over-arching requirement was the ability to see the history of a domain object from its first "version" to it's current state. If that domain object has a large number of attributes, that means each change requires a new row be inserted into it's corresponding table (not an update because the history would be lost, but an insert). Let's say this domain object is a Person, and I have 500k Persons to track with an average of 100+ changes over the Persons life-cycle to various attributes. Couple that with the fact that rare is the application that has only 1 major domain object and you'll quickly surmize that the size of the database would quickly grow out of control.
An easy solution is to save only the differential changes to the major domain objects rather than repeatedly saving redundant information.
All models change over time to reflect new business needs. Period. Using EAV is but one of the tools in our box to use; but it should never be automatically classified as "bad".
I'm struggling with the same issue. It may be interesting for you to check out the following discussion on two existing ecommerce solutions: Magento (EAV) and Joomla (regular relational structure):
https://forum.virtuemart.net/index.php?topic=58686.0
It seems, that Magento's EAV performance is a real showstopper.
That's why I'm leaning towards a normalized structure. To overcome the lack of flexibility I'm thinking about adding some separate data dictionary in the future (XML or separate DB tables) that could be edited, and based on that, application code for displaying and comparing product categories with new attributes set would be generated, together with SQL scripts.
Such architecture seems to be the sweetspot in this case - flexible and performant at the same time.
The problem could be frequent use of ALTER TABLE in live environment. I'm using Postgres, so its MVCC and transactional DDL will hopefully ease the pain.
I still vote for modeling at the lowest-meaningful atomic-level for EAV. Let standards, technologies and applications that gear toward certain user community to decide content models, repetition needs of attributes, grains, etc.
If it's just about the product catalog attributes and hence validation requirements for those attributes are rather limited, the only real downside to EAV is query performance and even that is only a problem when your query deals with multiple "things" (products) with attributes, the performance for the query "give me all attributes for the product with id 234" while not optimal is still plenty fast.
One solution is to use the SQL database / EAV model only for the admin / edit side of the product catalog and have some process that denormalizes the products into something that makes it searchable. Since you already have attributes and hence it's rather likely that you want faceting, this something could be Solr or ElasticSearch. This approach avoids basically all downsides to the EAV model and the added complexity is limited to serializing a complete product to JSON on update.
EAV has many drawbacks:
Performance degradation over time
Once the amount of data in the application grows beyond a certain size, the retrieval and manipulation of that data is likely to become less and less efficient.
The SQL queries are very complex and difficult to write.
Data Integrity problems.
You can't define foreign keys for all the fields needed.
You have to define and maintain your own metadata.
I have a slightly different problem: instead of many attributes with sparse values (which is possibly a good reason to use EAV), I want to store something more like a spreadsheet. The columns in the sheet can change, but within a sheet all cells will contain data (not sparse).
I made a small set of tests to benchmark two designs: one using EAV, and the other using a Postgres ARRAY to store cell data.
EAV
Array
Both schemas have indexes on appropriate columns, and the indexes are used by the planner.
It turned out the array-based schema was an order of magnitude faster for both inserts and queries. From quick tests, it seemed that both scaled linearly. The tests aren't very thorough, though. Suggestions and forks welcome - they're under an MIT licence.
In various locations of the Social Tables API documentation (and within the application UI), I have seen reference to the following terms: Venues, Bookable Spaces, and Room Diagrams. Can you please provide an explanation for what each of these are?
Sure!
Bookable Rooms are the actual floorplans of the rooms you are planning in. For instance, if you are planning an event in a large Hotel in the Grand Ballroom, the Ballroom floorplan is represented in our system by a bookable room.
A venue are a generic term that is used, but if you're referring to a venue id: Sometimes it refers to a bookable room, in these cases is is prepended with an S. But Social Tables events can also be planned with a PDF or Image background if we don't have a floorplan CAD of the space you're planning in.
Diagrams are the actual end product of using our main diagramming product. Think of it as a document with a collection of tables, chairs and other items.
Hope that helps.
I'm using data stored within a SQL database (mariadDB) to store information about "widgets" and "products".
Widgets have metadata associated with them and some relational data associated. Like build team, builders, and an image. Builders can be on any number of teams and any team can build any widget. Currently this data is normalized into separate tables and mapping mapping/associative tables.
Products also have metadata; description, what it was used for, when it was delivered, etc.
The products and widgets tables seem to be a good fit for a NoSQL solution. Perhaps having to de-normalize builders and build teams.
Where the data does not seem to be a good fit for NoSQL is the relationship between products and widgets. Widgets can be associated/mapped to 0 or more products. Each widget mapped to a product provides a capability to that product (Widget A may provide a web service, Widget B may provide locomotion, etc.).
Every once in a while the suite of products changes and the widgets are then re-mapped to the new set of products. The data is being used with Business Intelligence (BI) software (Jaspersoft Studio) to generate reports.
The data is not large. It is for our internal team use and to generate reports for other teams as they request them. So I'm not worried about ACID compliance or write locks, vertical or horizontal scaling, 24x7 availability, and those sorts of thing. My primary concern is flexibility as the data changes (meta data captured about the widgets and the set of products the widgets are applied to).
Based on my research NoSQL should be avoided if your data is at all relational. These Articles were rather old and I'm wondering if this is still the case?
When the suite of products changes it is painful to re-map the widgets to the new suite. Notionally, it seems that NoSQL solution could help ease that pain. But I'm not sure how.
SQL is better with set operations and relations, it will also be faster in filtering the sets you are working with.
NoSQL has greater flexibility of functions and can do some single row operations faster.
I would recommend having two environments if you need to use NoSQL, the SQL for housing all the data and NoSQL to do calculations for you. To feed the data into NoSQL best practice would be to create table valued functions or views.
My mind is all over the place and I am not sure how to go about starting this UML diagram. This is the problem:
The library consists of a lot of publications in several types of media – books, periodicals (also called magazines), newspapers, audio, and video. Each publication falls into a particular genre – fiction, nonfiction,
self-help, or performance – and target age – children, teen, adult, or restricted (which means adult only). Each publication also includes a unique ISBN identifier, which is just text. Design an object-oriented application that manages these publications as objects. We'll want to know the title, author, copyright year, genre, media, target age, and ISBN for each of our
publications. We'll also want to know if the publication is checked in or checked out to a customer, and if checked out to a customer, who that customer is (their name and telephone number). It's OK to enter
the customer information each time a publication is checked out.
Each publication object should be able to print its contents and its check out status something like this:
“The Firm” by John Grisham, 1991 (adult fiction book) ISBN: 0440245923
Checked out to Mike Williams (817-272-3785)
We'll need a simple console application with 5 operations: (1) Create a new publication, (2) List all
publications created in the system, (3) Check out a publication to a patron, recording their name and
phone number, (4) Check in a publication that was previously checked out, and (5) some short, basic
documentation on how to use the system. (Persistence is NOT required. Each time your program is run,
it may start without publications.)
You should design this system in UML first, creating (at least) a basic Use Case diagram, an Activity
diagram for at least action (3) above, and a class diagram.
For Use case diagram determine all the use cases and actors, then draw an use case diagram. Just google how such diagram looks like and create yours.
For Activity diagram determine what steps these use cases would consist of and show on a diagram.
For Class diagram identify all the candidate classes and think which ones you would keep or discard. You can also deterine attributes of each class. Then draw a diagram.
I'm using Bonita Studio and I would like to create a task which would be shared between two pools and an other shared between two pool lanes.
I'm trying to draw all processes in a web dev company to then optimize them, and a visual support is probably the best way to show the modifications. With this goal in mind I'd need to represent meetings with the different people participating to it.
Am I using the right notation model? If so how do I show these meetings? If not what is a better notation model to show a process with different people working together on different tasks with meetings?
In BPMN, pools identify processes rather than objects or subjects. You cannot split a single task between two pools.
You also cannot split it between two pool lanes. Actually, lanes have no predefined semantics; they can represent anything you want: companies, people, divisions, etc.
So, I'd propose the following. If you have "Software developer", "Designer", "Product Manager" lanes and want all of them participate in a meeting, add one more lane and name it "All project participants". And put the meeting task there.
I think that this won't violate both syntax and semantics of BPMN.
Although you're using Bonita, I'd advise you to visit http://elearning.bizagi.com for a nice online video course dedicated to BPMN diagrams design. It is based on BizAgi modeler, but 1) this modeler is a freeware and 2) There is One BPMN Standard to rule them all. :)
Task done by Many persons has no responsibility.
Just decide who is responsible for establishing, facilitating such a meeting and make task "Orgnanize Meeting" in his pool.
You can put note on whom he should invite, checklist, expected outcomes.