Related
It is safe to say that the EAV/CR database model is bad. That said,
Question: What database model, technique, or pattern should be used to deal with "classes" of attributes describing e-commerce products which can be changed at run time?
In a good E-commerce database, you will store classes of options (like TV resolution then have a resolution for each TV, but the next product may not be a TV and not have "TV resolution"). How do you store them, search efficiently, and allow your users to setup product types with variable fields describing their products? If the search engine finds that customers typically search for TVs based on console depth, you could add console depth to your fields, then add a single depth for each tv product type at run time.
There is a nice common feature among good e-commerce apps where they show a set of products, then have "drill down" side menus where you can see "TV Resolution" as a header, and the top five most common TV Resolutions for the found set. You click one and it only shows TVs of that resolution, allowing you to further drill down by selecting other categories on the side menu. These options would be the dynamic product attributes added at run time.
Further discussion:
So long story short, are there any links out on the Internet or model descriptions that could "academically" fix the following setup? I thank Noel Kennedy for suggesting a category table, but the need may be greater than that. I describe it a different way below, trying to highlight the significance. I may need a viewpoint correction to solve the problem, or I may need to go deeper in to the EAV/CR.
Love the positive response to the EAV/CR model. My fellow developers all say what Jeffrey Kemp touched on below: "new entities must be modeled and designed by a professional" (taken out of context, read his response below). The problem is:
entities add and remove attributes weekly (search keywords dictate future attributes)
new entities arrive weekly (products are assembled from parts)
old entities go away weekly (archived, less popular, seasonal)
The customer wants to add attributes to the products for two reasons:
department / keyword search / comparison chart between like products
consumer product configuration before checkout
The attributes must have significance, not just a keyword search. If they want to compare all cakes that have a "whipped cream frosting", they can click cakes, click birthday theme, click whipped cream frosting, then check all cakes that are interesting knowing they all have whipped cream frosting. This is not specific to cakes, just an example.
There's a few general pros and cons I can think of, there are situations where one is better than the other:
Option 1, EAV Model:
Pro: less time to design and develop a simple application
Pro: new entities easy to add (might even
be added by users?)
Pro: "generic" interface components
Con: complex code required to validate simple data types
Con: much more complex SQL for simple
reports
Con: complex reports can become almost
impossible
Con: poor performance for large data sets
Option 2, Modelling each entity separately:
Con: more time required to gather
requirements and design
Con: new entities must be modelled and
designed by a professional
Con: custom interface components for each
entity
Pro: data type constraints and validation simple to implement
Pro: SQL is easy to write, easy to
understand and debug
Pro: even the most complex reports are relatively simple
Pro: best performance for large data sets
Option 3, Combination (model entities "properly", but add "extensions" for custom attributes for some/all entities)
Pro/Con: more time required to gather requirements and design than option 1 but perhaps not as much as option 2 *
Con: new entities must be modelled and designed by a professional
Pro: new attributes might be easily added later on
Con: complex code required to validate simple data types (for the custom attributes)
Con: custom interface components still required, but generic interface components may be possible for the custom attributes
Con: SQL becomes complex as soon as any custom attribute is included in a report
Con: good performance generally, unless you start need to search by or report by the custom attributes
* I'm not sure if Option 3 would necessarily save any time in the design phase.
Personally I would lean toward option 2, and avoid EAV wherever possible. However, for some scenarios the users need the flexibility that comes with EAV; but this comes with a great cost.
It is safe to say that the EAV/CR database model is bad.
No, it's not. It's just that they're an inefficient usage of relational databases. A purely key/value store works great with this model.
Now, to your real question: How to store various attributes and keep them searchable?
Just use EAV. In your case it would be a single extra table. index it on both attribute name and value, most RDBMs would use prefix-compression to on the attribute name repetitions, making it really fast and compact.
EAV/CR gets ugly when you use it to replace 'real' fields. As with every tool, overusing it is 'bad', and gives it a bad image.
// At this point, I'd like to take a moment to speak to you about the Magento/Adobe PSD format.
// Magento/PSD is not a good ecommerce platform/format. Magento/PSD is not even a bad ecommerce platform/format. Calling it such would be an
// insult to other bad ecommerce platform/formats, such as Zencart or OsCommerce. No, Magento/PSD is an abysmal ecommerce platform/format. Having
// worked on this code for several weeks now, my hate for Magento/PSD has grown to a raging fire
// that burns with the fierce passion of a million suns.
http://code.google.com/p/xee/source/browse/trunk/XeePhotoshopLoader.m?spec=svn28&r=11#107
The internal models are wacky at best, like someone put the schema into a boggle game, sealed that and put it in a paint shacker...
Real world: I'm working on a midware fulfilment app and here are one the queries to get address information.
CREATE OR REPLACE VIEW sales_flat_addresses AS
SELECT sales_order_entity.parent_id AS order_id,
sales_order_entity.entity_id,
CONCAT(CONCAT(UCASE(MID(sales_order_entity_varchar.value,1,1)),MID(sales_order_entity_varchar.value,2)), "Address") as type,
GROUP_CONCAT(
CONCAT( eav_attribute.attribute_code," ::::: ", sales_order_entity_varchar.value )
ORDER BY sales_order_entity_varchar.value DESC
SEPARATOR '!!!!!'
) as data
FROM sales_order_entity
INNER JOIN sales_order_entity_varchar ON sales_order_entity_varchar.entity_id = sales_order_entity.entity_id
INNER JOIN eav_attribute ON eav_attribute.attribute_id = sales_order_entity_varchar.attribute_id
AND sales_order_entity.entity_type_id =12
GROUP BY sales_order_entity.entity_id
ORDER BY eav_attribute.attribute_code = 'address_type'
Exacts address information for an order, lazily
--
Summary: Only use Magento if:
You are being given large sacks of money
You must
Enjoy pain
I'm surprised nobody mentioned NoSQL databases.
I've never practiced NoSQL in a production context (just tested MongoDB and was impressed) but the whole point of NoSQL is being able to save items with varying attributes in the same "document".
Where performance is not a major requirement, as in an ETL type of application, EAV has another distinct advantage: differential saves.
I've implemented a number of applications where an over-arching requirement was the ability to see the history of a domain object from its first "version" to it's current state. If that domain object has a large number of attributes, that means each change requires a new row be inserted into it's corresponding table (not an update because the history would be lost, but an insert). Let's say this domain object is a Person, and I have 500k Persons to track with an average of 100+ changes over the Persons life-cycle to various attributes. Couple that with the fact that rare is the application that has only 1 major domain object and you'll quickly surmize that the size of the database would quickly grow out of control.
An easy solution is to save only the differential changes to the major domain objects rather than repeatedly saving redundant information.
All models change over time to reflect new business needs. Period. Using EAV is but one of the tools in our box to use; but it should never be automatically classified as "bad".
I'm struggling with the same issue. It may be interesting for you to check out the following discussion on two existing ecommerce solutions: Magento (EAV) and Joomla (regular relational structure):
https://forum.virtuemart.net/index.php?topic=58686.0
It seems, that Magento's EAV performance is a real showstopper.
That's why I'm leaning towards a normalized structure. To overcome the lack of flexibility I'm thinking about adding some separate data dictionary in the future (XML or separate DB tables) that could be edited, and based on that, application code for displaying and comparing product categories with new attributes set would be generated, together with SQL scripts.
Such architecture seems to be the sweetspot in this case - flexible and performant at the same time.
The problem could be frequent use of ALTER TABLE in live environment. I'm using Postgres, so its MVCC and transactional DDL will hopefully ease the pain.
I still vote for modeling at the lowest-meaningful atomic-level for EAV. Let standards, technologies and applications that gear toward certain user community to decide content models, repetition needs of attributes, grains, etc.
If it's just about the product catalog attributes and hence validation requirements for those attributes are rather limited, the only real downside to EAV is query performance and even that is only a problem when your query deals with multiple "things" (products) with attributes, the performance for the query "give me all attributes for the product with id 234" while not optimal is still plenty fast.
One solution is to use the SQL database / EAV model only for the admin / edit side of the product catalog and have some process that denormalizes the products into something that makes it searchable. Since you already have attributes and hence it's rather likely that you want faceting, this something could be Solr or ElasticSearch. This approach avoids basically all downsides to the EAV model and the added complexity is limited to serializing a complete product to JSON on update.
EAV has many drawbacks:
Performance degradation over time
Once the amount of data in the application grows beyond a certain size, the retrieval and manipulation of that data is likely to become less and less efficient.
The SQL queries are very complex and difficult to write.
Data Integrity problems.
You can't define foreign keys for all the fields needed.
You have to define and maintain your own metadata.
I have a slightly different problem: instead of many attributes with sparse values (which is possibly a good reason to use EAV), I want to store something more like a spreadsheet. The columns in the sheet can change, but within a sheet all cells will contain data (not sparse).
I made a small set of tests to benchmark two designs: one using EAV, and the other using a Postgres ARRAY to store cell data.
EAV
Array
Both schemas have indexes on appropriate columns, and the indexes are used by the planner.
It turned out the array-based schema was an order of magnitude faster for both inserts and queries. From quick tests, it seemed that both scaled linearly. The tests aren't very thorough, though. Suggestions and forks welcome - they're under an MIT licence.
The company I work for has started a new initiative in HL7 where we are trading both v2X and v3 (CDA specifically) messages. I am at the point where I am able to accept, validate and acknowledge the messages we are receiving from our trading partners and have started to create a data model for the backend storage of said messages. After a lot of consideration and research I am at a loss for the best way to approach this in MS SQL Server 2008 R2.
Currently my idea is to essentially load the data into a data warehouse directly from my integration engine (BizTalk) and foregoing a backing, normalized operational database. I have set up the database for v2X messages according to the v2.7 specs as all versions of HL7 v2 are backward compatible (I can store any previous versions in the same database). My initial design has a table for each segment which will tie back to a header table with a guid I am generating and storing at run time. The biggest issue with this approach is the amount of columns in each table and it's something I have no experience with. For instance the PV1 segment has 569 columns in order to accommodate all possible data. In addition to this I need to make all columns varchar and make them big enough to house any possible customization scenario from our vendors. I am planning on using varchar(1024) to achieve this. A lot of these columns (the majority probably) would be NULL so I would use SPARSE columns. This screams bad design to me but fully normalizing these tables would require a ton of work in both BizTalk and SQL server and I'm not sure what I would gain from doing so. I'm trying to be pragmatic since I have a deadline.
If fully normalized, I would essentially have to create stored procs that would have a ton of parameters OR split these messages to the nth degree to do individual loads into the smaller subtables and make sure they all correlate back to the original guid. I would also want to maintain ACID processing which could get tricky and cause a lot of overhead in BizTalk. I suppose a 3rd option would be to use nHapi to create objects out of the messages I could tie into with Entity Framework but nHapi seems like a dead project and I have no experience with Entity Framework as of right now.
I'm basically at a loss and need help from some industry professionals who have experience with HL7 data modeling. Is it worth the extra effort to fully normalize the tables? Will performance on the SQL side be abysmal if I use these denormalized segment tables with hundreds of columns (most of which will be NULL for each row)? I'm not a DBA so I'm trying to understand the pitfalls of each approach. I've also looked at RIMBAA but the HL7 RIM seems like a foreign language to me as an HL7 newbie and translating v2 messages to the RIM would probably take far longer than I have to complete this project. I'm hoping I'm overthinking this and there is a simpler solution staring me in the face. Hopefully this question isn't too open ended.
HL7 is not a "tight" standard inputs and expected outputs vary depending on the system you are talking to. In this case the adding in a broker such as Mirth, Rhaposdy or BizTalk is a very good idea.
What ever solution you employ make sure you can cope with "non standard" input and output as you will soon find things vary. On the HL7 versions 2X and 3 be aware that very few hospitals have the version 3 most still run 2X.
I have been down the road of working with a database that tried to follow the HL7 structure, it can work however it will take time and effort. Given that you have a tight dead line maybe break out the bits of the data you will need to search on and have fields (e.g. PID segment 3 is the patient id would be useful to have) the rest can go in your varchar. Also if you are not indexing on the column you could use varchar(max).
As for your Guids in the database, this can work fine, but be careful not to cluster any indexes using the Guid as this will fragment your data. Do your research here and if in doubt go for identity columns instead.
I'll recommend the entity framework too, excellent ORM, well worth learning.
So my overall advice. Go for a hybrid for now, breaking out what you need. Expect it to evolve over time breaking out the pieces of HL7 into their own areas as needed. Do write a generic HL7 parser (not too difficult I've done it a couple of times) and keep it flexible. But most of all expect the HL7 to vary in structure don't treat the specification as 100% truth you will get variations.
I can only comment on the CDA (and some very limited HL7v2) side of things based on personal experience.
We receive and send CDA documents wrapped in HL7v3 wrappers from external vendors (as well as internal systems -- see below). The wrappers contain the metadata for things like sending/receiving systems/dates and other high-level data. The very limited message metadata is stripped and stored in the message data repository. Inside the wrapper, is the actual CDA, which is then taken and stored as XML datatype in the SQL database.
Using this model we can then search at the metadata level, but also narrow it down based on the CDA using Xpath queries. It makes the database much simpler...I can't even imagine creating columns based on the CDA schema.
As for making clients follow the CDA schema, as a part of the project we've created an implementation guide which clients must follow if they want to have their messages accepted.
Using the implementation guide + schematron + BizTalk and XSD validation, we only accept messages which follow the CDA schema. We then check some data fields using schematron validation and reject if any of those fail. This is relayed to the sender using an HL7v3 message back to them with the specific error message and/or fields that are invalid. This is a point at which a message will be stored in the database.
This is all done in BizTalk/SQL Server. And since the CDA schema is very much pre-defined by the HL7 group, you can make the consumers of this system follow the schema. This is unlike what I've seen with HL7v2 where it seems people just bend the schema as needed.
For the HL7v2 side of things, I'm 99% certain that "we" (as in, "my company") are storing the messages much in the same way. Except since since the HL7v2 schema is so open, we're not validating and just accepting/storing all messages. An HL7v2 parser has been written to parse the HL7v2 using the variations of schemas we know about.
In my project's case, we are sending HL7v2 from our HCIS --> Mirth --> BizTalk which then follows the Implementation guide + CDA Schema along with an XSLT transform to map the HL7v2 to CDA THEN submits it to the OTHER BizTalk CDA Submission service as though it was an external vendor.
That's a ton of reading right now, so please ask questions, as I'd like to talk about it.
In most cases it's a waste of time to try to create a normalized relational data model to persist HL7 V2 or V3 data. I would recommend just storing entire messages or documents as single XML column values. Then query using SQLXML and/or XQuery. All modern relational databases support this now.
Modeling on HL7 can be a pain.
I would do the following;
use the standards described in HL7 for staging tables, that way even if you have varchar(1024) and they are null it does not hurt you
create your actual table to be populated from the staging table as per the standards that you have enforced or will enforce.
This means that you have 500+ columns from the message but only 10 or 50 make sense, you will need to model only your 50. Yes, this has a lopside, tomorrow you want to make more meaning then it will increase from 50 to 75, the historical messages will not have information; which is fine but you will need to factor into the design.
I would under no circumstances attempt to model anything using the HL7 v3 RIM. The reason is that this schema is very generic, deferring much of the metadata to the message itself. Are you familiar with an EAV table? The RIM is like that.
On the other hand, HL7 v2 should be a fairly simple basis for a DB schema. You can create tables around segment types, and columns around field names.
I think the problem of pulling in everything kills the project and you should not do it. Typically, HL7 v2 messages carry a small subset of the whole, so it would be an utter waste to build out the whole thing, and it would be very confusing.
Further, the version of v2 you model would impact your schemas dramatically, with later versions, more and more fields become repeating fields, and your join relationships would change.
I recommend that you put a stake in the sand and start with v2.4 which is pretty easy yet still more complicated than most interfaces actually in use. Focus on a few segments and a few fields. MSH and PID first.
Add an EAV table to capture what may come in that you don't yet have in your tables. You can then look at what comes into this table over time and use it to decide what to build next. Your EAV could look like this MSG_ID, SEGMENT, SET_ID, FIELD_NAME, FIELD VALUE. Just store the unparsed HL7 contents of the field value.
I'm trying to build a WCF service that gets info from another service via XML. The XML usually has 4 elements ranging from int, string to DateTime. I want to build the service dynamically so that when it gets an XML, it stores it in the Database. I don't want to hardcode the types and element names in the code. If there is a change, I want it to dynamically add it to the database What is the best way of doing this? Is this a good practice? Or should i stick with just using Entity Framework and having a set model for the database and hardcode the element names and types?
Thanks
:)
As usual, "it depends...".
If the nature of the data you're receiving is that it has no fixed schema, or a schema that changes frequently as a part of its business logic/domain logic, it's a good idea to design a solution to manage that. Storing data whose schema you don't know at design time is a long-running debate on StackOverflow - look for "property bag", "entity attribute value" or EAV to see various questions and answers.
The thing you give up with this approach is - at the web service layer - the ability to check that the data you're receiving meets the agreed interface contract; this in turn helps to avoid all kinds of exciting bugs - what should your system do if it receives invalid XML? Without an agreed schema/dtd, you have to build all kinds of other checks into your code.
At the database level, you usually end up giving up aspects of the relational model (and therefore the power of SQL) to store your data without "traditional" row-and-column relationships. That often makes queries harder, and may sacrifice standards compliance (e.g. by using vendor specific extensions).
If the data changes because of technical reasons - i.e. it's not the nature of the data to change, it's just the technology chain that makes you worry about this - I'd suggest you instead build in the concept of versioning, and have "strongly typed" services/data with different versions. Whilst this seems more work, the benefits of relying on schema validation, the relational database model, and simplicity usually make it a good trade-off...
Put the xml into the database as xml. Many modern databases support XML directly. If not, just store it as text.
Soliciting feedback/thoughts on a pattern or best practice to address a situation that I have seen a few times over the years, yet I haven't found any one solution that addresses it the way I'd like.
Here is the background.
Company has 3 applications supporting 3 separate "lines of business" that are very much related to each other. Two of the applications are literally copy/paste from the original. The applications need to be able to grow at different rates and have slightly different functionality. The main differences in functionality come from the data entry fields. The differences essentially fall into one of the following categories:
One instance has a few fields
that the other does not.
String field has a max length of 200 in one
instance, but 50 in another.
Lookup/Reference fields have
different underlying values (i.e.
same table structures, but coming
from different databases).
A field is defined as a user supplied,
free text, value in one instance,
but a lookup/reference in another.
The problem is that there are other applications within the company that need to consume data from these three separate applications, but ideally, talk to them in a core/centralized manner (i.e. through a central service rather than 3 separate services). My question is how to handle, in particular, item D above. I am thinking a "lowest common denominator" approach might be the only way. For example:
<SomeFieldName>
<Code></Code> <!-- would store a FK ref value if instance used lookup, otherwise would be empty or nonexistent-->
<Text></Text> <!-- would store the text from the lookup if instance used lookup, would store user supplied text if not-->
</SomeFieldName>
Other thoughts/ideas on this?
TIA!
So are the differences strictly from a Datamodel view or are there functional business / behavioral differences at the application level.
If the later is the case then I would definetly go down the path you appear to be heading down with SOA. Now how you impliment your SOA just depends upon your architecture needs. What I would look at for design would be into some various patterns. Its hard to say for sure which one(s) would meet the needs with out more information / example on how the behavioral / functional differences are being used. From off of the top of my head tho with what you have described I would probably start off looking at a Strategy pattern in my initial design.
Definetly prototype this using TDD so that you can determine if your heading down the right path.
How about: extend your LCD approach, put a facade in front of these systems. devise a normalised form of the data which (if populated with enough data) can be transformed to any of the specific instances. [Heading towards an ESB here.]
Then you have the problem, how does a client know what "enough" is? Some kind of meta-data may be needed so that you can present a suiatble UI. So extend the services to provide an operation to deliver the meta data.
It is safe to say that the EAV/CR database model is bad. That said,
Question: What database model, technique, or pattern should be used to deal with "classes" of attributes describing e-commerce products which can be changed at run time?
In a good E-commerce database, you will store classes of options (like TV resolution then have a resolution for each TV, but the next product may not be a TV and not have "TV resolution"). How do you store them, search efficiently, and allow your users to setup product types with variable fields describing their products? If the search engine finds that customers typically search for TVs based on console depth, you could add console depth to your fields, then add a single depth for each tv product type at run time.
There is a nice common feature among good e-commerce apps where they show a set of products, then have "drill down" side menus where you can see "TV Resolution" as a header, and the top five most common TV Resolutions for the found set. You click one and it only shows TVs of that resolution, allowing you to further drill down by selecting other categories on the side menu. These options would be the dynamic product attributes added at run time.
Further discussion:
So long story short, are there any links out on the Internet or model descriptions that could "academically" fix the following setup? I thank Noel Kennedy for suggesting a category table, but the need may be greater than that. I describe it a different way below, trying to highlight the significance. I may need a viewpoint correction to solve the problem, or I may need to go deeper in to the EAV/CR.
Love the positive response to the EAV/CR model. My fellow developers all say what Jeffrey Kemp touched on below: "new entities must be modeled and designed by a professional" (taken out of context, read his response below). The problem is:
entities add and remove attributes weekly (search keywords dictate future attributes)
new entities arrive weekly (products are assembled from parts)
old entities go away weekly (archived, less popular, seasonal)
The customer wants to add attributes to the products for two reasons:
department / keyword search / comparison chart between like products
consumer product configuration before checkout
The attributes must have significance, not just a keyword search. If they want to compare all cakes that have a "whipped cream frosting", they can click cakes, click birthday theme, click whipped cream frosting, then check all cakes that are interesting knowing they all have whipped cream frosting. This is not specific to cakes, just an example.
There's a few general pros and cons I can think of, there are situations where one is better than the other:
Option 1, EAV Model:
Pro: less time to design and develop a simple application
Pro: new entities easy to add (might even
be added by users?)
Pro: "generic" interface components
Con: complex code required to validate simple data types
Con: much more complex SQL for simple
reports
Con: complex reports can become almost
impossible
Con: poor performance for large data sets
Option 2, Modelling each entity separately:
Con: more time required to gather
requirements and design
Con: new entities must be modelled and
designed by a professional
Con: custom interface components for each
entity
Pro: data type constraints and validation simple to implement
Pro: SQL is easy to write, easy to
understand and debug
Pro: even the most complex reports are relatively simple
Pro: best performance for large data sets
Option 3, Combination (model entities "properly", but add "extensions" for custom attributes for some/all entities)
Pro/Con: more time required to gather requirements and design than option 1 but perhaps not as much as option 2 *
Con: new entities must be modelled and designed by a professional
Pro: new attributes might be easily added later on
Con: complex code required to validate simple data types (for the custom attributes)
Con: custom interface components still required, but generic interface components may be possible for the custom attributes
Con: SQL becomes complex as soon as any custom attribute is included in a report
Con: good performance generally, unless you start need to search by or report by the custom attributes
* I'm not sure if Option 3 would necessarily save any time in the design phase.
Personally I would lean toward option 2, and avoid EAV wherever possible. However, for some scenarios the users need the flexibility that comes with EAV; but this comes with a great cost.
It is safe to say that the EAV/CR database model is bad.
No, it's not. It's just that they're an inefficient usage of relational databases. A purely key/value store works great with this model.
Now, to your real question: How to store various attributes and keep them searchable?
Just use EAV. In your case it would be a single extra table. index it on both attribute name and value, most RDBMs would use prefix-compression to on the attribute name repetitions, making it really fast and compact.
EAV/CR gets ugly when you use it to replace 'real' fields. As with every tool, overusing it is 'bad', and gives it a bad image.
// At this point, I'd like to take a moment to speak to you about the Magento/Adobe PSD format.
// Magento/PSD is not a good ecommerce platform/format. Magento/PSD is not even a bad ecommerce platform/format. Calling it such would be an
// insult to other bad ecommerce platform/formats, such as Zencart or OsCommerce. No, Magento/PSD is an abysmal ecommerce platform/format. Having
// worked on this code for several weeks now, my hate for Magento/PSD has grown to a raging fire
// that burns with the fierce passion of a million suns.
http://code.google.com/p/xee/source/browse/trunk/XeePhotoshopLoader.m?spec=svn28&r=11#107
The internal models are wacky at best, like someone put the schema into a boggle game, sealed that and put it in a paint shacker...
Real world: I'm working on a midware fulfilment app and here are one the queries to get address information.
CREATE OR REPLACE VIEW sales_flat_addresses AS
SELECT sales_order_entity.parent_id AS order_id,
sales_order_entity.entity_id,
CONCAT(CONCAT(UCASE(MID(sales_order_entity_varchar.value,1,1)),MID(sales_order_entity_varchar.value,2)), "Address") as type,
GROUP_CONCAT(
CONCAT( eav_attribute.attribute_code," ::::: ", sales_order_entity_varchar.value )
ORDER BY sales_order_entity_varchar.value DESC
SEPARATOR '!!!!!'
) as data
FROM sales_order_entity
INNER JOIN sales_order_entity_varchar ON sales_order_entity_varchar.entity_id = sales_order_entity.entity_id
INNER JOIN eav_attribute ON eav_attribute.attribute_id = sales_order_entity_varchar.attribute_id
AND sales_order_entity.entity_type_id =12
GROUP BY sales_order_entity.entity_id
ORDER BY eav_attribute.attribute_code = 'address_type'
Exacts address information for an order, lazily
--
Summary: Only use Magento if:
You are being given large sacks of money
You must
Enjoy pain
I'm surprised nobody mentioned NoSQL databases.
I've never practiced NoSQL in a production context (just tested MongoDB and was impressed) but the whole point of NoSQL is being able to save items with varying attributes in the same "document".
Where performance is not a major requirement, as in an ETL type of application, EAV has another distinct advantage: differential saves.
I've implemented a number of applications where an over-arching requirement was the ability to see the history of a domain object from its first "version" to it's current state. If that domain object has a large number of attributes, that means each change requires a new row be inserted into it's corresponding table (not an update because the history would be lost, but an insert). Let's say this domain object is a Person, and I have 500k Persons to track with an average of 100+ changes over the Persons life-cycle to various attributes. Couple that with the fact that rare is the application that has only 1 major domain object and you'll quickly surmize that the size of the database would quickly grow out of control.
An easy solution is to save only the differential changes to the major domain objects rather than repeatedly saving redundant information.
All models change over time to reflect new business needs. Period. Using EAV is but one of the tools in our box to use; but it should never be automatically classified as "bad".
I'm struggling with the same issue. It may be interesting for you to check out the following discussion on two existing ecommerce solutions: Magento (EAV) and Joomla (regular relational structure):
https://forum.virtuemart.net/index.php?topic=58686.0
It seems, that Magento's EAV performance is a real showstopper.
That's why I'm leaning towards a normalized structure. To overcome the lack of flexibility I'm thinking about adding some separate data dictionary in the future (XML or separate DB tables) that could be edited, and based on that, application code for displaying and comparing product categories with new attributes set would be generated, together with SQL scripts.
Such architecture seems to be the sweetspot in this case - flexible and performant at the same time.
The problem could be frequent use of ALTER TABLE in live environment. I'm using Postgres, so its MVCC and transactional DDL will hopefully ease the pain.
I still vote for modeling at the lowest-meaningful atomic-level for EAV. Let standards, technologies and applications that gear toward certain user community to decide content models, repetition needs of attributes, grains, etc.
If it's just about the product catalog attributes and hence validation requirements for those attributes are rather limited, the only real downside to EAV is query performance and even that is only a problem when your query deals with multiple "things" (products) with attributes, the performance for the query "give me all attributes for the product with id 234" while not optimal is still plenty fast.
One solution is to use the SQL database / EAV model only for the admin / edit side of the product catalog and have some process that denormalizes the products into something that makes it searchable. Since you already have attributes and hence it's rather likely that you want faceting, this something could be Solr or ElasticSearch. This approach avoids basically all downsides to the EAV model and the added complexity is limited to serializing a complete product to JSON on update.
EAV has many drawbacks:
Performance degradation over time
Once the amount of data in the application grows beyond a certain size, the retrieval and manipulation of that data is likely to become less and less efficient.
The SQL queries are very complex and difficult to write.
Data Integrity problems.
You can't define foreign keys for all the fields needed.
You have to define and maintain your own metadata.
I have a slightly different problem: instead of many attributes with sparse values (which is possibly a good reason to use EAV), I want to store something more like a spreadsheet. The columns in the sheet can change, but within a sheet all cells will contain data (not sparse).
I made a small set of tests to benchmark two designs: one using EAV, and the other using a Postgres ARRAY to store cell data.
EAV
Array
Both schemas have indexes on appropriate columns, and the indexes are used by the planner.
It turned out the array-based schema was an order of magnitude faster for both inserts and queries. From quick tests, it seemed that both scaled linearly. The tests aren't very thorough, though. Suggestions and forks welcome - they're under an MIT licence.