When to use NoSQL vs SQL for BI data - sql

I'm using data stored within a SQL database (mariadDB) to store information about "widgets" and "products".
Widgets have metadata associated with them and some relational data associated. Like build team, builders, and an image. Builders can be on any number of teams and any team can build any widget. Currently this data is normalized into separate tables and mapping mapping/associative tables.
Products also have metadata; description, what it was used for, when it was delivered, etc.
The products and widgets tables seem to be a good fit for a NoSQL solution. Perhaps having to de-normalize builders and build teams.
Where the data does not seem to be a good fit for NoSQL is the relationship between products and widgets. Widgets can be associated/mapped to 0 or more products. Each widget mapped to a product provides a capability to that product (Widget A may provide a web service, Widget B may provide locomotion, etc.).
Every once in a while the suite of products changes and the widgets are then re-mapped to the new set of products. The data is being used with Business Intelligence (BI) software (Jaspersoft Studio) to generate reports.
The data is not large. It is for our internal team use and to generate reports for other teams as they request them. So I'm not worried about ACID compliance or write locks, vertical or horizontal scaling, 24x7 availability, and those sorts of thing. My primary concern is flexibility as the data changes (meta data captured about the widgets and the set of products the widgets are applied to).
Based on my research NoSQL should be avoided if your data is at all relational. These Articles were rather old and I'm wondering if this is still the case?
When the suite of products changes it is painful to re-map the widgets to the new suite. Notionally, it seems that NoSQL solution could help ease that pain. But I'm not sure how.

SQL is better with set operations and relations, it will also be faster in filtering the sets you are working with.
NoSQL has greater flexibility of functions and can do some single row operations faster.
I would recommend having two environments if you need to use NoSQL, the SQL for housing all the data and NoSQL to do calculations for you. To feed the data into NoSQL best practice would be to create table valued functions or views.

Related

Is it better to use entity-arrtibute-value model over storing various different product in single description text column? [duplicate]

It is safe to say that the EAV/CR database model is bad. That said,
Question: What database model, technique, or pattern should be used to deal with "classes" of attributes describing e-commerce products which can be changed at run time?
In a good E-commerce database, you will store classes of options (like TV resolution then have a resolution for each TV, but the next product may not be a TV and not have "TV resolution"). How do you store them, search efficiently, and allow your users to setup product types with variable fields describing their products? If the search engine finds that customers typically search for TVs based on console depth, you could add console depth to your fields, then add a single depth for each tv product type at run time.
There is a nice common feature among good e-commerce apps where they show a set of products, then have "drill down" side menus where you can see "TV Resolution" as a header, and the top five most common TV Resolutions for the found set. You click one and it only shows TVs of that resolution, allowing you to further drill down by selecting other categories on the side menu. These options would be the dynamic product attributes added at run time.
Further discussion:
So long story short, are there any links out on the Internet or model descriptions that could "academically" fix the following setup? I thank Noel Kennedy for suggesting a category table, but the need may be greater than that. I describe it a different way below, trying to highlight the significance. I may need a viewpoint correction to solve the problem, or I may need to go deeper in to the EAV/CR.
Love the positive response to the EAV/CR model. My fellow developers all say what Jeffrey Kemp touched on below: "new entities must be modeled and designed by a professional" (taken out of context, read his response below). The problem is:
entities add and remove attributes weekly (search keywords dictate future attributes)
new entities arrive weekly (products are assembled from parts)
old entities go away weekly (archived, less popular, seasonal)
The customer wants to add attributes to the products for two reasons:
department / keyword search / comparison chart between like products
consumer product configuration before checkout
The attributes must have significance, not just a keyword search. If they want to compare all cakes that have a "whipped cream frosting", they can click cakes, click birthday theme, click whipped cream frosting, then check all cakes that are interesting knowing they all have whipped cream frosting. This is not specific to cakes, just an example.
There's a few general pros and cons I can think of, there are situations where one is better than the other:
Option 1, EAV Model:
Pro: less time to design and develop a simple application
Pro: new entities easy to add (might even
be added by users?)
Pro: "generic" interface components
Con: complex code required to validate simple data types
Con: much more complex SQL for simple
reports
Con: complex reports can become almost
impossible
Con: poor performance for large data sets
Option 2, Modelling each entity separately:
Con: more time required to gather
requirements and design
Con: new entities must be modelled and
designed by a professional
Con: custom interface components for each
entity
Pro: data type constraints and validation simple to implement
Pro: SQL is easy to write, easy to
understand and debug
Pro: even the most complex reports are relatively simple
Pro: best performance for large data sets
Option 3, Combination (model entities "properly", but add "extensions" for custom attributes for some/all entities)
Pro/Con: more time required to gather requirements and design than option 1 but perhaps not as much as option 2 *
Con: new entities must be modelled and designed by a professional
Pro: new attributes might be easily added later on
Con: complex code required to validate simple data types (for the custom attributes)
Con: custom interface components still required, but generic interface components may be possible for the custom attributes
Con: SQL becomes complex as soon as any custom attribute is included in a report
Con: good performance generally, unless you start need to search by or report by the custom attributes
* I'm not sure if Option 3 would necessarily save any time in the design phase.
Personally I would lean toward option 2, and avoid EAV wherever possible. However, for some scenarios the users need the flexibility that comes with EAV; but this comes with a great cost.
It is safe to say that the EAV/CR database model is bad.
No, it's not. It's just that they're an inefficient usage of relational databases. A purely key/value store works great with this model.
Now, to your real question: How to store various attributes and keep them searchable?
Just use EAV. In your case it would be a single extra table. index it on both attribute name and value, most RDBMs would use prefix-compression to on the attribute name repetitions, making it really fast and compact.
EAV/CR gets ugly when you use it to replace 'real' fields. As with every tool, overusing it is 'bad', and gives it a bad image.
// At this point, I'd like to take a moment to speak to you about the Magento/Adobe PSD format.
// Magento/PSD is not a good ecommerce platform/format. Magento/PSD is not even a bad ecommerce platform/format. Calling it such would be an
// insult to other bad ecommerce platform/formats, such as Zencart or OsCommerce. No, Magento/PSD is an abysmal ecommerce platform/format. Having
// worked on this code for several weeks now, my hate for Magento/PSD has grown to a raging fire
// that burns with the fierce passion of a million suns.
http://code.google.com/p/xee/source/browse/trunk/XeePhotoshopLoader.m?spec=svn28&r=11#107
The internal models are wacky at best, like someone put the schema into a boggle game, sealed that and put it in a paint shacker...
Real world: I'm working on a midware fulfilment app and here are one the queries to get address information.
CREATE OR REPLACE VIEW sales_flat_addresses AS
SELECT sales_order_entity.parent_id AS order_id,
sales_order_entity.entity_id,
CONCAT(CONCAT(UCASE(MID(sales_order_entity_varchar.value,1,1)),MID(sales_order_entity_varchar.value,2)), "Address") as type,
GROUP_CONCAT(
CONCAT( eav_attribute.attribute_code," ::::: ", sales_order_entity_varchar.value )
ORDER BY sales_order_entity_varchar.value DESC
SEPARATOR '!!!!!'
) as data
FROM sales_order_entity
INNER JOIN sales_order_entity_varchar ON sales_order_entity_varchar.entity_id = sales_order_entity.entity_id
INNER JOIN eav_attribute ON eav_attribute.attribute_id = sales_order_entity_varchar.attribute_id
AND sales_order_entity.entity_type_id =12
GROUP BY sales_order_entity.entity_id
ORDER BY eav_attribute.attribute_code = 'address_type'
Exacts address information for an order, lazily
--
Summary: Only use Magento if:
You are being given large sacks of money
You must
Enjoy pain
I'm surprised nobody mentioned NoSQL databases.
I've never practiced NoSQL in a production context (just tested MongoDB and was impressed) but the whole point of NoSQL is being able to save items with varying attributes in the same "document".
Where performance is not a major requirement, as in an ETL type of application, EAV has another distinct advantage: differential saves.
I've implemented a number of applications where an over-arching requirement was the ability to see the history of a domain object from its first "version" to it's current state. If that domain object has a large number of attributes, that means each change requires a new row be inserted into it's corresponding table (not an update because the history would be lost, but an insert). Let's say this domain object is a Person, and I have 500k Persons to track with an average of 100+ changes over the Persons life-cycle to various attributes. Couple that with the fact that rare is the application that has only 1 major domain object and you'll quickly surmize that the size of the database would quickly grow out of control.
An easy solution is to save only the differential changes to the major domain objects rather than repeatedly saving redundant information.
All models change over time to reflect new business needs. Period. Using EAV is but one of the tools in our box to use; but it should never be automatically classified as "bad".
I'm struggling with the same issue. It may be interesting for you to check out the following discussion on two existing ecommerce solutions: Magento (EAV) and Joomla (regular relational structure):
https://forum.virtuemart.net/index.php?topic=58686.0
It seems, that Magento's EAV performance is a real showstopper.
That's why I'm leaning towards a normalized structure. To overcome the lack of flexibility I'm thinking about adding some separate data dictionary in the future (XML or separate DB tables) that could be edited, and based on that, application code for displaying and comparing product categories with new attributes set would be generated, together with SQL scripts.
Such architecture seems to be the sweetspot in this case - flexible and performant at the same time.
The problem could be frequent use of ALTER TABLE in live environment. I'm using Postgres, so its MVCC and transactional DDL will hopefully ease the pain.
I still vote for modeling at the lowest-meaningful atomic-level for EAV. Let standards, technologies and applications that gear toward certain user community to decide content models, repetition needs of attributes, grains, etc.
If it's just about the product catalog attributes and hence validation requirements for those attributes are rather limited, the only real downside to EAV is query performance and even that is only a problem when your query deals with multiple "things" (products) with attributes, the performance for the query "give me all attributes for the product with id 234" while not optimal is still plenty fast.
One solution is to use the SQL database / EAV model only for the admin / edit side of the product catalog and have some process that denormalizes the products into something that makes it searchable. Since you already have attributes and hence it's rather likely that you want faceting, this something could be Solr or ElasticSearch. This approach avoids basically all downsides to the EAV model and the added complexity is limited to serializing a complete product to JSON on update.
EAV has many drawbacks:
Performance degradation over time
Once the amount of data in the application grows beyond a certain size, the retrieval and manipulation of that data is likely to become less and less efficient.
The SQL queries are very complex and difficult to write.
Data Integrity problems.
You can't define foreign keys for all the fields needed.
You have to define and maintain your own metadata.
I have a slightly different problem: instead of many attributes with sparse values (which is possibly a good reason to use EAV), I want to store something more like a spreadsheet. The columns in the sheet can change, but within a sheet all cells will contain data (not sparse).
I made a small set of tests to benchmark two designs: one using EAV, and the other using a Postgres ARRAY to store cell data.
EAV
Array
Both schemas have indexes on appropriate columns, and the indexes are used by the planner.
It turned out the array-based schema was an order of magnitude faster for both inserts and queries. From quick tests, it seemed that both scaled linearly. The tests aren't very thorough, though. Suggestions and forks welcome - they're under an MIT licence.

Multiple Datamarts Architecture / Modeling on Snowflake cloud datawarehouse

Context :
Let's suppose we have multiple datamarts (Ex : HR, Accounting, Marketing ...) and all of them use the Star Schema as dimensional modeling (Kimball approach ) .
Question :
Since Snowflake cloud data warehouse architecture eliminate the need to spin off separate physical data marts / databases in order to maintain performance. So, what's the best approach to build the multiple datamarts on Snowflake ?
Create database for each datamart ? create one database (EDW )with multiple schema and each schema refer to a datamart ?
Thank you !
Ron is correct - the answer depends on a few things:
If there are conformed dimensions, then one database and schema might be the way to go
If they are completely non-integrated data marts I would go with separate schemas or even separate databases. They are all logical containers in Snowflake (rather than physical) with full role based access control available to segregate users.
So really - how do you do it today? Does that work for you or are there things you need or want to do that you cannot do today with your current physical setup. How is security set up with your BI tools? Do they reference a database name or just a schema name? If you can, minimize changes to your data pipeline and reporting so you have fewer things that might need refactoring (at least for your first POC or migration).
One thing to note is that with Snowflake you have the ability to easily do cross database joins (i.e., database.schema.table) - all you need is SELECT access, so even if you separate the marts by database oyu can still do cross mart reporting if needed.
Hope that helps.
There is no specific need to separate star schemas at all.
If you're using shared / conformed dimensions across your marts, separation would actually be an anti-pattern.
If your issue is simplifying the segregation of users, schema per mart works well.
All of the approaches you've suggested (DB/mart, DW/schema,...) will work, I'm just not clear on the need.
The goal of having separate data marts is more related to governance, to keep data organized and where it is expected to be found (i.e. sales transactions in the "sales data mart"), and less related to performance issues.
The advantage of having a single database acting as a data warehouse is that all your data for analytics will be stored in one place, making it more accessible and easier to find. In this case, you can use schemas to implement (logically) separate data marts. You can also use schemas within a database to keep development data separate from production data, for each data mart.
Snowflake is different from traditional relational databases; given its technical architecture, it has no issues with joining large tables between different databases/schemas so you can certainly build different data marts in separate databases and join their facts or dimensions with some other Snowflake database/data mart.
In you specific case, if you have a large number of data marts (e.g. 10 or more) and you're not using Snowflake for much more than data warehouseing, I think the best path would be to implement each data mart in its own database and use schemas to manage prod/dev data within each schema. This will help keep data organized, as opposed to quickly reaching a point where you'll have hundreds of tables (every data mart, and its dev/prod versions) in one database, which won't be a great development or maintenance experience.
But, from a performance perspective, there's really no noticeable difference.

Data Aggregation - Daily SQL Script vs Data Warehouse

Pardon me if this has already been asked (I know very little about Data Warehouse/BI and have yet to master the keywords).
I have a table that grow by more then 100 000 rows per day, each row having a timestamp and multiple information about an item (dimensions, weight,color,etc). Individual data can be useful for roughly a month after this period we are only interested in aggregations. I have a dedicated software that allow a more detailed visualisation of individual rows and mainly use PowerPivot for my reporting needs.
I could come up with an SQL query that would fill a new table daily:
In which I would have a row for each hour/item/batch and I would summarize the information (sum/average/stddev/etc.)
Within a day my script would be up and running and I could use powerpivot against this new table. All this while staying where I'm comfortable: plain old SQL.
From the few information I gathered reading about DataWarehouse and BI, what I'm about to do sounds a lot like creating dimensions and facts. My question therefore: is it worthwhile to investigate further in that direction (BI) or since my problem is relatively simple I would do better staying in a relational database.
N.B. Reports that are being produced are usually linked against another database to produce more meaningful informations. Task that is very well accomplished by Powerpivot.
Datawarehouses are normally implemented in relational databases, so your existing skills will still be usable.
Given that you have expressed an interest in the dimension/fact table approach to datawarehousing, the canonical books on this approach are usually considered to be:
The Date Warehouse Toolkit (Kimball, Ross)
The Date Warehouse Lifecycle Toolkit (Kimball, Ross, Thornthwaite, Mundy, Becker)
(The former has more of a technical focus, while the latter approaches the subject from a wider lifecycle management viewpoint.)
Implementing DWHs can be time-consuming, so it may be worth continuing with your existing approach even if you decide to build a DWH.
Good news: it sounds like you already have a data warehouse. "Data warehouse" is a very generic term, with no real formal definition - it pretty much means whatever you want it to.
Commonly accepted characteristics are:
Data warehouses do not run on the operational databases
Data warehouses schemas are optimized for querying, not for "normal form" compliance
Data warehouses are populated by "Extract, Transform, Load" proceses (ETL).
It sounds like you're already doing all of that. If there are no business requirements to change, I'd leave it as it is. If your business users are asking to create their own queries, using different levels of aggregation, filtering, or granularit, a star schema may be the way to go.
The most effective solutions are those which are simple, adequate to meet existing needsand stay within available skillsets.
I agree that this approach works well for your situation an if it provides the reports and information you need then its worth starting this way. If you need more complex functionality later then you can go for more complex BI

Entity Attribute Value Database vs. strict Relational Model Ecommerce

It is safe to say that the EAV/CR database model is bad. That said,
Question: What database model, technique, or pattern should be used to deal with "classes" of attributes describing e-commerce products which can be changed at run time?
In a good E-commerce database, you will store classes of options (like TV resolution then have a resolution for each TV, but the next product may not be a TV and not have "TV resolution"). How do you store them, search efficiently, and allow your users to setup product types with variable fields describing their products? If the search engine finds that customers typically search for TVs based on console depth, you could add console depth to your fields, then add a single depth for each tv product type at run time.
There is a nice common feature among good e-commerce apps where they show a set of products, then have "drill down" side menus where you can see "TV Resolution" as a header, and the top five most common TV Resolutions for the found set. You click one and it only shows TVs of that resolution, allowing you to further drill down by selecting other categories on the side menu. These options would be the dynamic product attributes added at run time.
Further discussion:
So long story short, are there any links out on the Internet or model descriptions that could "academically" fix the following setup? I thank Noel Kennedy for suggesting a category table, but the need may be greater than that. I describe it a different way below, trying to highlight the significance. I may need a viewpoint correction to solve the problem, or I may need to go deeper in to the EAV/CR.
Love the positive response to the EAV/CR model. My fellow developers all say what Jeffrey Kemp touched on below: "new entities must be modeled and designed by a professional" (taken out of context, read his response below). The problem is:
entities add and remove attributes weekly (search keywords dictate future attributes)
new entities arrive weekly (products are assembled from parts)
old entities go away weekly (archived, less popular, seasonal)
The customer wants to add attributes to the products for two reasons:
department / keyword search / comparison chart between like products
consumer product configuration before checkout
The attributes must have significance, not just a keyword search. If they want to compare all cakes that have a "whipped cream frosting", they can click cakes, click birthday theme, click whipped cream frosting, then check all cakes that are interesting knowing they all have whipped cream frosting. This is not specific to cakes, just an example.
There's a few general pros and cons I can think of, there are situations where one is better than the other:
Option 1, EAV Model:
Pro: less time to design and develop a simple application
Pro: new entities easy to add (might even
be added by users?)
Pro: "generic" interface components
Con: complex code required to validate simple data types
Con: much more complex SQL for simple
reports
Con: complex reports can become almost
impossible
Con: poor performance for large data sets
Option 2, Modelling each entity separately:
Con: more time required to gather
requirements and design
Con: new entities must be modelled and
designed by a professional
Con: custom interface components for each
entity
Pro: data type constraints and validation simple to implement
Pro: SQL is easy to write, easy to
understand and debug
Pro: even the most complex reports are relatively simple
Pro: best performance for large data sets
Option 3, Combination (model entities "properly", but add "extensions" for custom attributes for some/all entities)
Pro/Con: more time required to gather requirements and design than option 1 but perhaps not as much as option 2 *
Con: new entities must be modelled and designed by a professional
Pro: new attributes might be easily added later on
Con: complex code required to validate simple data types (for the custom attributes)
Con: custom interface components still required, but generic interface components may be possible for the custom attributes
Con: SQL becomes complex as soon as any custom attribute is included in a report
Con: good performance generally, unless you start need to search by or report by the custom attributes
* I'm not sure if Option 3 would necessarily save any time in the design phase.
Personally I would lean toward option 2, and avoid EAV wherever possible. However, for some scenarios the users need the flexibility that comes with EAV; but this comes with a great cost.
It is safe to say that the EAV/CR database model is bad.
No, it's not. It's just that they're an inefficient usage of relational databases. A purely key/value store works great with this model.
Now, to your real question: How to store various attributes and keep them searchable?
Just use EAV. In your case it would be a single extra table. index it on both attribute name and value, most RDBMs would use prefix-compression to on the attribute name repetitions, making it really fast and compact.
EAV/CR gets ugly when you use it to replace 'real' fields. As with every tool, overusing it is 'bad', and gives it a bad image.
// At this point, I'd like to take a moment to speak to you about the Magento/Adobe PSD format.
// Magento/PSD is not a good ecommerce platform/format. Magento/PSD is not even a bad ecommerce platform/format. Calling it such would be an
// insult to other bad ecommerce platform/formats, such as Zencart or OsCommerce. No, Magento/PSD is an abysmal ecommerce platform/format. Having
// worked on this code for several weeks now, my hate for Magento/PSD has grown to a raging fire
// that burns with the fierce passion of a million suns.
http://code.google.com/p/xee/source/browse/trunk/XeePhotoshopLoader.m?spec=svn28&r=11#107
The internal models are wacky at best, like someone put the schema into a boggle game, sealed that and put it in a paint shacker...
Real world: I'm working on a midware fulfilment app and here are one the queries to get address information.
CREATE OR REPLACE VIEW sales_flat_addresses AS
SELECT sales_order_entity.parent_id AS order_id,
sales_order_entity.entity_id,
CONCAT(CONCAT(UCASE(MID(sales_order_entity_varchar.value,1,1)),MID(sales_order_entity_varchar.value,2)), "Address") as type,
GROUP_CONCAT(
CONCAT( eav_attribute.attribute_code," ::::: ", sales_order_entity_varchar.value )
ORDER BY sales_order_entity_varchar.value DESC
SEPARATOR '!!!!!'
) as data
FROM sales_order_entity
INNER JOIN sales_order_entity_varchar ON sales_order_entity_varchar.entity_id = sales_order_entity.entity_id
INNER JOIN eav_attribute ON eav_attribute.attribute_id = sales_order_entity_varchar.attribute_id
AND sales_order_entity.entity_type_id =12
GROUP BY sales_order_entity.entity_id
ORDER BY eav_attribute.attribute_code = 'address_type'
Exacts address information for an order, lazily
--
Summary: Only use Magento if:
You are being given large sacks of money
You must
Enjoy pain
I'm surprised nobody mentioned NoSQL databases.
I've never practiced NoSQL in a production context (just tested MongoDB and was impressed) but the whole point of NoSQL is being able to save items with varying attributes in the same "document".
Where performance is not a major requirement, as in an ETL type of application, EAV has another distinct advantage: differential saves.
I've implemented a number of applications where an over-arching requirement was the ability to see the history of a domain object from its first "version" to it's current state. If that domain object has a large number of attributes, that means each change requires a new row be inserted into it's corresponding table (not an update because the history would be lost, but an insert). Let's say this domain object is a Person, and I have 500k Persons to track with an average of 100+ changes over the Persons life-cycle to various attributes. Couple that with the fact that rare is the application that has only 1 major domain object and you'll quickly surmize that the size of the database would quickly grow out of control.
An easy solution is to save only the differential changes to the major domain objects rather than repeatedly saving redundant information.
All models change over time to reflect new business needs. Period. Using EAV is but one of the tools in our box to use; but it should never be automatically classified as "bad".
I'm struggling with the same issue. It may be interesting for you to check out the following discussion on two existing ecommerce solutions: Magento (EAV) and Joomla (regular relational structure):
https://forum.virtuemart.net/index.php?topic=58686.0
It seems, that Magento's EAV performance is a real showstopper.
That's why I'm leaning towards a normalized structure. To overcome the lack of flexibility I'm thinking about adding some separate data dictionary in the future (XML or separate DB tables) that could be edited, and based on that, application code for displaying and comparing product categories with new attributes set would be generated, together with SQL scripts.
Such architecture seems to be the sweetspot in this case - flexible and performant at the same time.
The problem could be frequent use of ALTER TABLE in live environment. I'm using Postgres, so its MVCC and transactional DDL will hopefully ease the pain.
I still vote for modeling at the lowest-meaningful atomic-level for EAV. Let standards, technologies and applications that gear toward certain user community to decide content models, repetition needs of attributes, grains, etc.
If it's just about the product catalog attributes and hence validation requirements for those attributes are rather limited, the only real downside to EAV is query performance and even that is only a problem when your query deals with multiple "things" (products) with attributes, the performance for the query "give me all attributes for the product with id 234" while not optimal is still plenty fast.
One solution is to use the SQL database / EAV model only for the admin / edit side of the product catalog and have some process that denormalizes the products into something that makes it searchable. Since you already have attributes and hence it's rather likely that you want faceting, this something could be Solr or ElasticSearch. This approach avoids basically all downsides to the EAV model and the added complexity is limited to serializing a complete product to JSON on update.
EAV has many drawbacks:
Performance degradation over time
Once the amount of data in the application grows beyond a certain size, the retrieval and manipulation of that data is likely to become less and less efficient.
The SQL queries are very complex and difficult to write.
Data Integrity problems.
You can't define foreign keys for all the fields needed.
You have to define and maintain your own metadata.
I have a slightly different problem: instead of many attributes with sparse values (which is possibly a good reason to use EAV), I want to store something more like a spreadsheet. The columns in the sheet can change, but within a sheet all cells will contain data (not sparse).
I made a small set of tests to benchmark two designs: one using EAV, and the other using a Postgres ARRAY to store cell data.
EAV
Array
Both schemas have indexes on appropriate columns, and the indexes are used by the planner.
It turned out the array-based schema was an order of magnitude faster for both inserts and queries. From quick tests, it seemed that both scaled linearly. The tests aren't very thorough, though. Suggestions and forks welcome - they're under an MIT licence.

MySQL design question - which is better, long tables or multiple databases?

So I have an interesting problem that's been the fruit of lots of good discussion in my group at work.
We have some scientific software producing SQLlite files, and this software is basically a black box. We don't control its table designs, formats, etc. It's entirely conceivable that this black box's output could change, and our design needs to be able to handle that.
The SQLlite files are entire databases which our user would like to query across. There are two ways (we see) of implementing this, one, to create a single database and a backend in Python that appends tables from each database to the master database, and two, querying across separate databases' tables and unifying the results in Python.
Both methods run into trouble when the black box produces alters its table structures, say for example renaming a column, splitting up a table, etc. We have to take this into account, and we've discussed translation tables that translate queries of columns from one table format to another.
We're interested in ease of implementation, how well the design handles a change in database/table layout, and speed. Also, a last dimension is how well it would work with existing Python web frameworks (Django doesn't support cross-database queries, and neither does SQLAlchemy, so we know we are in for a lot of programming.)
If you find yourself querying across databases, you should look into consolidating. Cross-database queries are evil.
If your queries are essentially relegated to individual databases, then you may want to stick with multiple databases, as clearly their separation is necessary.
You cannot accommodate arbitrary changes in a database's schema without categorizing and anticipating that change in some way. In the very best case with nontrivial changes, you can sometimes simply ignore new data or tables, in the worst case, your interpretation of the data will entirely break down.
I've encountered similar issues where users need data pivoted out of a normalized schema. The schema does NOT change. However, their required output format requires a fixed number of hierarchical levels. Thus, although the database design accommodates all the changes they want to make, their chosen view of that data cannot be maintained in the face of their changes. Thus it is impossible to maintain the output schema in the face of data change (not even schema change). This is not to say that it's not a valid output or input schema, but that there are limits beyond which their chosen schema cannot be used. At this point, they have to revise the output contract, the pivoting program (which CAN anticipate this and generate new columns) can then have a place to put the data in the output schema.
My point being: the semantics and interpretation of new columns and new tables (or removal of columns and tables which existing logic may depend on) is nontrivial unless new columns or tables can be anticipated in some way. However, in these cases, there are usually good database designs which eliminate those strategies in the first place:
For instance, a particular database schema can contain any number of tables, all with the same structure (although there is no theoretical reason they could not be consolidated into a single table). A particular kind of table could have a set of columns all similarly named (although this "array" violates normalization principles and could be normalized into a commonkey/code/value schema).
Even in a data warehouse ETL situation, a new column is going to have to be determined whether it is a fact or a dimensional attribute, and then if it is a dimensional attribute, which dimension table it is best assigned to. This could somewhat be automated for facts (obvious candidates would be scalars like decimal/numeric) by inspecting the metadata for unmapped columns, altering the DW table (yikes) and then loading appropriately. But for dimensions, I would be very leery of automating somethings like this.
So, in summary, I would say that schema changes in a good normalized database design are the least likely to be able to be accommodated because: 1) the database design already anticipates and accommodates a good deal of change and flexibility and 2) schema changes to such a database design are unlikely to be able to be anticipated very easily. Conversely, schema changes in a poorly normalized database design are actually more easy to anticipate as shortcomings in the database design are more visible.
So, my question to you is: How well-designed is the database you are working from?
You say that you know that you are in for a lot of programming...
I'm not sure about that. I would go for a quick and dirty solution not a 'generic' solution because generic solutions like the entity attribute value model often have a bad performance. Don't do client side joining (unifying the results) inside your Python code because that is very slow. Use SQL for joining, it is designed for that purpose. Users can also make their own reports with all kind of reporting tools that generate sql statements. You don't have to do everything in your app, just start with solving 80% of the problems, not 100%.
If something breaks because something inside the black box changes you can define views for backward compatibility that keeps your app functioning.
Maybe the scientific software will add a lot of new features and maybe it will change its datamodel because of those new features..? That is possible but then you will have to change your application anyways to take profit from those new features.
It sounds to me as if your problem isn't really about MySQL or SQLlite. It's about the sharing of data, and the contract that needs to exist between the supplier of data and the user of the same data.
To the extent that databases exist so that data can be shared, that contract is fundamental to everything about databases. When databases were first being built, and database theory was first being solidified, in the 1960s and 1970s, the sharing of data was the central purpose in building databases. Today, databases are frequently used where files would have served equally well. Your situation may be a case in point.
In your situation, you have a beggar's contract with your data suppliers. They can change the format of the data, and maybe even the semantics, and all you can do is suck it up and deal wth it. This situation is by no means uncommon.
I don't know the specifics of your situation, so what follows could be way off target.
If it was up to me, I would want to build a database that was as generic, as flexible, and as stable as possible, without losing the essential features of structured and managed data. Maybe, some design like star schema would make sense, but I might adopt a very different design if I were actually in your shoes.
This leaves the problem of extracting the data from the databases you are given, transforming the data into the stable format the central database supports, and loading it into the central database. You are right in guessing that this involves a lot of programming. This process, known as "ETL" in data warehousing texts, is not the simplest of programming challenges.
At least ETL collects all the hard problems in one place. Once you have the data loaded into a database that's built for your needs, and not for the needs of your suppliers, turning the data into valuable information should be relatively easy, at least at the programming or SQL level. There are even OLAP tools that make using the data as simple as a video game. There are challenges at that level, but they aren't the same kind of challenges I'm talking about here.
Read up on data warehousing, and especially data marts. The description may seem daunting to you at first, but it can be scaled down to meet your needs.