Why magento store EAV data across attribute type? - sql

In magento system, there are 5 tables to store EAV data across attribute type. Is it an effective performance choice to do it? When I make a sql query, I still need to use UNION clause to get the whole data set. If I use one mixed table to store EAV data according by the only one data type(varchar, or sql_variant in sql server2008), what will I encounter performance issue in future?

Is it an effective performance choice to do it?
The Magento developers chose to use an EAV structure because it performs well under high volumes of data. A flat table structure would be suitable for a small setup, but as it scales it becomes less and less efficient.
When I make a sql query, I still need to use UNION clause to get the whole data set.
You should try in every possible case to avoid direct SQL queries on a Magento database. They have Setup models that you can use for installing new data, of which you either utilise the Magento models that already exist and the methods that the Setup models have to create/modify the Magento core config data or variables, or use the underlying Zend framework's ORM if you need to to create new tables, etc.
In terms of the EAV part of the database specifically, the way it is setup is complicated if you attack it from the SQL point-of-view, which is why Magento models exist so that it can be all wrapped up in PHP ORM. Again, avoid SQL queries if you can.
If you have to make direct queries, you wouldn't be creating UNION queries but joins onto those tables, and you'd use the eav_attribute table as a pivot table to provide you with both the attribute_id (primary key) and the source table which the value will exist in.
By using direct SQL queries you also lose the fallback system that Magento implements where store or website level values can exist, and the Magento models will select them if you ask for them at a store level. If you want to do this manually with SQL then the queries become more complicated as you need to look for those values and if they aren't found, revert to the default (global scope) value.
If I use one mixed table to store EAV data according by the only one data type(varchar, or sql_variant in sql server2008), what will I encounter performance issue in future?
As mentioned before, it depends on the expected scale of your database. You will notice that there are plenty of flat tables in a standard Magento database, and that EAV structures only apply to the parts that Magento developers have decided may increase drastically in volume (customers, catalog etc).
If you want to implement a custom module and you think that it also has the potential to grow quickly over time then you can implement your own EAV tables for it. The Magento model scaffolds support this, and there is plenty of resource online about how to set them up.
If your tables are likely to remain (relatively) small, then by all means go for a flat table approach. If it's a custom module and you notice rapid growth, you can always convert it later before it becomes a bottleneck.

Related

SQL database design with some user defined fields

I'm developing a database schema to handle collection of data and later, reporting on this data.
After a requirements discussion, it seems that either an entity-attribute-value (EAV) solution, or a flat table solution would be alright - since the data is somewhat sparse but not highly sparse.
However, user defined fields will become a must in the future, but I understand that querying and optimizing an RDBMS with EAV tables can become complex.
I've taken a look at the discussion here, and I was thinking an option similar to option 1 would be possible. For example, have a number of set fields, then a number of spare fields that users can define the labels of.
In terms of reporting, is there any downside in using this approach rather than using EAV?
You will regret EAV, especially when it comes to reporting
Make sure you're aware of existing data model patterns before you try anything: Ready to use database model patterns
Familiarize yourself with Table Inheritance: How can you represent inheritance in a database?
Consider allowing users to modify their own schemas: https://martinfowler.com/bliki/UserDefinedField.html
EAV is almost always a really bad idea. If you still need custom fields after trying the above, use a blob type (like JSON or XML) with indexing: http://backchannel.org/blog/friendfeed-schemaless-mysql . Postgres's binary jsonb is fast and allows indexing/querying

How Does NoSQL Scale Out Exactly?

I use SQL Server 2012. I have a database sharded across physical tiers by User ID. In my app User is an aggregate root (i.e., nothing about Users comes from or goes into my repository without the entire User coming or going). Everything about one particular User exists on one particular machine.
Is my system any less scalable than one that employs NoSQL? Or, can someone explain how NoSQL systems scale out across servers exactly? Wouldn't they have to shard in a similar manner to what I'm doing? We've all read that NoSQL enables scalability but at the moment I don't see how, say, MongoDB would benefit my architecture.
MongoDB allows you to scale in two ways: sharding and replication. I think you can do both in MS SQL Server.
What usually is different is the data model:
In a relational database, you typically have multiple tables that reference each other. Theoretically, you can do something similar with MongoDB by using multiple collections, however this is not the way it's typically done. Instead, in MongoDB, you tend to store all the data that belongs together in the same collection. So typically you have less collections than tables in a database. This will in many times result in more redundancy (data is copied). You can try to do that in a relational database, but it's not quite so easy (there will be less tables, each having more columns).
MongoDB collections are more flexible than tables in that you don't need to define the data model up front (the exact list of columns / properties, the data types). This allows you to change the data model without having to alter the tables - the disadvantage is that you need to take this into account in the application (you can't rely on all rows / documents having the same structure). I'm not sure if you can do that in MS SQL Server.
In MongoDB, each document is a Json object, so it's a tree and not a flat table. This allows more flexibility in the data model. For example, in an application I'm developing (Apache Jackrabbit Oak / MongoMK), for each property (column) we can store multiple values; one value for each revision. Doing that in a relational database is possible, but quite tricky.

Is it possible to use Nhibernate with partition of an object over several tables?

We are having a system that gather large quantities of data each month and performes rather advanced calculations that increase the database even more. Since we have the requirements from the customer that we need to store data for fast access three years back and that we must be able to access older data (up to ten years), this however can be low performance and requires some work. We want to avoid performance issues where the database and its tables grows out of proportion.
After discussing using SQL Enterprise (VERY costly and full of traps since we haven't gotten the know-how) and since our system have so many tables that referenses each other we are leaning towards creating some kind of history tables to which we move data in a monthly fashion and rewrite the select queries that we have based on parameters to search either in the regular table or in the history or both depending on the situation.
Since we also are using NHibernate for mapping I was wondering if it is possbible to create a mapping file that handles this by itself (almost) using some sort of polymorfism or inheritance in which each object is stored in different tables based on parameters?
I know this sounds complicated and strange and that there is other methods to perform this but I this question I would rather have people answering the question asked and not give other sugestions to use instead.
As far as I know NHibernate can't do that (each class can be mapped to one table/view )but you can use SQL Queries or StoredProcedures (depends on the version of NHibernate that you are using) to populate mapped objects.
In your case you can have a combined view created by making unions of different tables Then you can use a SQL query to populated your entity.
There's also another solution that you create a summary object for your queries that uses that view ,therefore you can use both HQL and criteria to query this object.
Short answer "no". I would not create views as you mention a lot of joining.
Personally I would create summary tables and map to these directly using a stateless session or a very least mutable=false on the class definition. Think of these summary tables as denormalised data for report only. The only drawback is if historic data changes on a regular basis then the summary tables also needs changing. If historical data never changes then this should be simple to achieve.
I would also most probably store these summary tables on another catalog rather than adding to the size of the current system.
Its not a quick win this one I am afraid.

Database model refactoring for combining multiple tables of heterogeneous data in SQL Server?

I took over the task of re-developing a database of scientific data which is used by a web interface, where the original author had taken a 'table-per-dataset' approach which didn't scale well and is now fairly difficult to manage with more than 200 tables that have been created. I've spent quite a bit of time trying to figure out how to wrangle the thing, but the datasets contain heterogeneous values, so it is not reasonably possible to combine them into one table with a set schema for column definitions.
I've explored the possibility of EAV, XML columns, and ended up attempting to go with a table with many sparse columns since the database is running on SQL Server 2008. The DBAs are having some issues with my recently created sparse columns causing some havoc with their backup scripts, so I'm left wondering again if there isn't a better way to do this. I know that EAV does not lead to decent performance, and my experiments with XML data types also demonstrated poor performance, probably thanks to the large number of records in some of the tables.
Here's the summary:
Around 200 tables, most of which have a few columns containing floats and small strings
Some tables have as many as 15,000 records
Table schemas are not consistent, as the columns depended on the number of samples in the original experimental data.
SQL Server 2008
I'll be treating most of this data as legacy in the new version I'm developing, but I still need to be able to display it and query it- and I'd rather not have to do so by dynamically specifying the table name in my stored procedures as it would be with the current multi-table approach. Any suggestions?
I would suggest that the first step is looking to rationalise the data through views; attempt to consolidate similar data sets into logical pools through views.
You could then look at refactoring the code to look at the views, and see if the web platform operates effectively. From there you could decided whether or not the view structure is beneficial and if so, look to physically rationalising the data into a new table.
The benefit of using views in this manner is you should be able to squeak a little performance out of indexes on the views, and it should also give you a better handle on the data (that said, since you are dev'ing the new version, it would suggest you are perfectly capable of understanding the problem domain).
With 200 tables as simple raw data sets, and considering you believe your version will be taking over, I would probably go through the prototype exercise of seeing if you can't write the views to be identically named to what your final table names will be in V2, that way you can also backtest if your new database structure is in fact going to work.
Finally, a word to the wise, when someone has built a database in the way you've described, without looking at the data, and really knowing the problem set; they did it for a reason. Either it was bad design, or there was a cause for what now appears on the surface to be bad design; you raise consistency as an issue - look to wrap the data and see how consistent you can make it.
Good luck!

MySQL design question - which is better, long tables or multiple databases?

So I have an interesting problem that's been the fruit of lots of good discussion in my group at work.
We have some scientific software producing SQLlite files, and this software is basically a black box. We don't control its table designs, formats, etc. It's entirely conceivable that this black box's output could change, and our design needs to be able to handle that.
The SQLlite files are entire databases which our user would like to query across. There are two ways (we see) of implementing this, one, to create a single database and a backend in Python that appends tables from each database to the master database, and two, querying across separate databases' tables and unifying the results in Python.
Both methods run into trouble when the black box produces alters its table structures, say for example renaming a column, splitting up a table, etc. We have to take this into account, and we've discussed translation tables that translate queries of columns from one table format to another.
We're interested in ease of implementation, how well the design handles a change in database/table layout, and speed. Also, a last dimension is how well it would work with existing Python web frameworks (Django doesn't support cross-database queries, and neither does SQLAlchemy, so we know we are in for a lot of programming.)
If you find yourself querying across databases, you should look into consolidating. Cross-database queries are evil.
If your queries are essentially relegated to individual databases, then you may want to stick with multiple databases, as clearly their separation is necessary.
You cannot accommodate arbitrary changes in a database's schema without categorizing and anticipating that change in some way. In the very best case with nontrivial changes, you can sometimes simply ignore new data or tables, in the worst case, your interpretation of the data will entirely break down.
I've encountered similar issues where users need data pivoted out of a normalized schema. The schema does NOT change. However, their required output format requires a fixed number of hierarchical levels. Thus, although the database design accommodates all the changes they want to make, their chosen view of that data cannot be maintained in the face of their changes. Thus it is impossible to maintain the output schema in the face of data change (not even schema change). This is not to say that it's not a valid output or input schema, but that there are limits beyond which their chosen schema cannot be used. At this point, they have to revise the output contract, the pivoting program (which CAN anticipate this and generate new columns) can then have a place to put the data in the output schema.
My point being: the semantics and interpretation of new columns and new tables (or removal of columns and tables which existing logic may depend on) is nontrivial unless new columns or tables can be anticipated in some way. However, in these cases, there are usually good database designs which eliminate those strategies in the first place:
For instance, a particular database schema can contain any number of tables, all with the same structure (although there is no theoretical reason they could not be consolidated into a single table). A particular kind of table could have a set of columns all similarly named (although this "array" violates normalization principles and could be normalized into a commonkey/code/value schema).
Even in a data warehouse ETL situation, a new column is going to have to be determined whether it is a fact or a dimensional attribute, and then if it is a dimensional attribute, which dimension table it is best assigned to. This could somewhat be automated for facts (obvious candidates would be scalars like decimal/numeric) by inspecting the metadata for unmapped columns, altering the DW table (yikes) and then loading appropriately. But for dimensions, I would be very leery of automating somethings like this.
So, in summary, I would say that schema changes in a good normalized database design are the least likely to be able to be accommodated because: 1) the database design already anticipates and accommodates a good deal of change and flexibility and 2) schema changes to such a database design are unlikely to be able to be anticipated very easily. Conversely, schema changes in a poorly normalized database design are actually more easy to anticipate as shortcomings in the database design are more visible.
So, my question to you is: How well-designed is the database you are working from?
You say that you know that you are in for a lot of programming...
I'm not sure about that. I would go for a quick and dirty solution not a 'generic' solution because generic solutions like the entity attribute value model often have a bad performance. Don't do client side joining (unifying the results) inside your Python code because that is very slow. Use SQL for joining, it is designed for that purpose. Users can also make their own reports with all kind of reporting tools that generate sql statements. You don't have to do everything in your app, just start with solving 80% of the problems, not 100%.
If something breaks because something inside the black box changes you can define views for backward compatibility that keeps your app functioning.
Maybe the scientific software will add a lot of new features and maybe it will change its datamodel because of those new features..? That is possible but then you will have to change your application anyways to take profit from those new features.
It sounds to me as if your problem isn't really about MySQL or SQLlite. It's about the sharing of data, and the contract that needs to exist between the supplier of data and the user of the same data.
To the extent that databases exist so that data can be shared, that contract is fundamental to everything about databases. When databases were first being built, and database theory was first being solidified, in the 1960s and 1970s, the sharing of data was the central purpose in building databases. Today, databases are frequently used where files would have served equally well. Your situation may be a case in point.
In your situation, you have a beggar's contract with your data suppliers. They can change the format of the data, and maybe even the semantics, and all you can do is suck it up and deal wth it. This situation is by no means uncommon.
I don't know the specifics of your situation, so what follows could be way off target.
If it was up to me, I would want to build a database that was as generic, as flexible, and as stable as possible, without losing the essential features of structured and managed data. Maybe, some design like star schema would make sense, but I might adopt a very different design if I were actually in your shoes.
This leaves the problem of extracting the data from the databases you are given, transforming the data into the stable format the central database supports, and loading it into the central database. You are right in guessing that this involves a lot of programming. This process, known as "ETL" in data warehousing texts, is not the simplest of programming challenges.
At least ETL collects all the hard problems in one place. Once you have the data loaded into a database that's built for your needs, and not for the needs of your suppliers, turning the data into valuable information should be relatively easy, at least at the programming or SQL level. There are even OLAP tools that make using the data as simple as a video game. There are challenges at that level, but they aren't the same kind of challenges I'm talking about here.
Read up on data warehousing, and especially data marts. The description may seem daunting to you at first, but it can be scaled down to meet your needs.