HL7 v2X and v3 data modeling - sql

The company I work for has started a new initiative in HL7 where we are trading both v2X and v3 (CDA specifically) messages. I am at the point where I am able to accept, validate and acknowledge the messages we are receiving from our trading partners and have started to create a data model for the backend storage of said messages. After a lot of consideration and research I am at a loss for the best way to approach this in MS SQL Server 2008 R2.
Currently my idea is to essentially load the data into a data warehouse directly from my integration engine (BizTalk) and foregoing a backing, normalized operational database. I have set up the database for v2X messages according to the v2.7 specs as all versions of HL7 v2 are backward compatible (I can store any previous versions in the same database). My initial design has a table for each segment which will tie back to a header table with a guid I am generating and storing at run time. The biggest issue with this approach is the amount of columns in each table and it's something I have no experience with. For instance the PV1 segment has 569 columns in order to accommodate all possible data. In addition to this I need to make all columns varchar and make them big enough to house any possible customization scenario from our vendors. I am planning on using varchar(1024) to achieve this. A lot of these columns (the majority probably) would be NULL so I would use SPARSE columns. This screams bad design to me but fully normalizing these tables would require a ton of work in both BizTalk and SQL server and I'm not sure what I would gain from doing so. I'm trying to be pragmatic since I have a deadline.
If fully normalized, I would essentially have to create stored procs that would have a ton of parameters OR split these messages to the nth degree to do individual loads into the smaller subtables and make sure they all correlate back to the original guid. I would also want to maintain ACID processing which could get tricky and cause a lot of overhead in BizTalk. I suppose a 3rd option would be to use nHapi to create objects out of the messages I could tie into with Entity Framework but nHapi seems like a dead project and I have no experience with Entity Framework as of right now.
I'm basically at a loss and need help from some industry professionals who have experience with HL7 data modeling. Is it worth the extra effort to fully normalize the tables? Will performance on the SQL side be abysmal if I use these denormalized segment tables with hundreds of columns (most of which will be NULL for each row)? I'm not a DBA so I'm trying to understand the pitfalls of each approach. I've also looked at RIMBAA but the HL7 RIM seems like a foreign language to me as an HL7 newbie and translating v2 messages to the RIM would probably take far longer than I have to complete this project. I'm hoping I'm overthinking this and there is a simpler solution staring me in the face. Hopefully this question isn't too open ended.

HL7 is not a "tight" standard inputs and expected outputs vary depending on the system you are talking to. In this case the adding in a broker such as Mirth, Rhaposdy or BizTalk is a very good idea.
What ever solution you employ make sure you can cope with "non standard" input and output as you will soon find things vary. On the HL7 versions 2X and 3 be aware that very few hospitals have the version 3 most still run 2X.
I have been down the road of working with a database that tried to follow the HL7 structure, it can work however it will take time and effort. Given that you have a tight dead line maybe break out the bits of the data you will need to search on and have fields (e.g. PID segment 3 is the patient id would be useful to have) the rest can go in your varchar. Also if you are not indexing on the column you could use varchar(max).
As for your Guids in the database, this can work fine, but be careful not to cluster any indexes using the Guid as this will fragment your data. Do your research here and if in doubt go for identity columns instead.
I'll recommend the entity framework too, excellent ORM, well worth learning.
So my overall advice. Go for a hybrid for now, breaking out what you need. Expect it to evolve over time breaking out the pieces of HL7 into their own areas as needed. Do write a generic HL7 parser (not too difficult I've done it a couple of times) and keep it flexible. But most of all expect the HL7 to vary in structure don't treat the specification as 100% truth you will get variations.

I can only comment on the CDA (and some very limited HL7v2) side of things based on personal experience.
We receive and send CDA documents wrapped in HL7v3 wrappers from external vendors (as well as internal systems -- see below). The wrappers contain the metadata for things like sending/receiving systems/dates and other high-level data. The very limited message metadata is stripped and stored in the message data repository. Inside the wrapper, is the actual CDA, which is then taken and stored as XML datatype in the SQL database.
Using this model we can then search at the metadata level, but also narrow it down based on the CDA using Xpath queries. It makes the database much simpler...I can't even imagine creating columns based on the CDA schema.
As for making clients follow the CDA schema, as a part of the project we've created an implementation guide which clients must follow if they want to have their messages accepted.
Using the implementation guide + schematron + BizTalk and XSD validation, we only accept messages which follow the CDA schema. We then check some data fields using schematron validation and reject if any of those fail. This is relayed to the sender using an HL7v3 message back to them with the specific error message and/or fields that are invalid. This is a point at which a message will be stored in the database.
This is all done in BizTalk/SQL Server. And since the CDA schema is very much pre-defined by the HL7 group, you can make the consumers of this system follow the schema. This is unlike what I've seen with HL7v2 where it seems people just bend the schema as needed.
For the HL7v2 side of things, I'm 99% certain that "we" (as in, "my company") are storing the messages much in the same way. Except since since the HL7v2 schema is so open, we're not validating and just accepting/storing all messages. An HL7v2 parser has been written to parse the HL7v2 using the variations of schemas we know about.
In my project's case, we are sending HL7v2 from our HCIS --> Mirth --> BizTalk which then follows the Implementation guide + CDA Schema along with an XSLT transform to map the HL7v2 to CDA THEN submits it to the OTHER BizTalk CDA Submission service as though it was an external vendor.
That's a ton of reading right now, so please ask questions, as I'd like to talk about it.

In most cases it's a waste of time to try to create a normalized relational data model to persist HL7 V2 or V3 data. I would recommend just storing entire messages or documents as single XML column values. Then query using SQLXML and/or XQuery. All modern relational databases support this now.

Modeling on HL7 can be a pain.
I would do the following;
use the standards described in HL7 for staging tables, that way even if you have varchar(1024) and they are null it does not hurt you
create your actual table to be populated from the staging table as per the standards that you have enforced or will enforce.
This means that you have 500+ columns from the message but only 10 or 50 make sense, you will need to model only your 50. Yes, this has a lopside, tomorrow you want to make more meaning then it will increase from 50 to 75, the historical messages will not have information; which is fine but you will need to factor into the design.

I would under no circumstances attempt to model anything using the HL7 v3 RIM. The reason is that this schema is very generic, deferring much of the metadata to the message itself. Are you familiar with an EAV table? The RIM is like that.
On the other hand, HL7 v2 should be a fairly simple basis for a DB schema. You can create tables around segment types, and columns around field names.
I think the problem of pulling in everything kills the project and you should not do it. Typically, HL7 v2 messages carry a small subset of the whole, so it would be an utter waste to build out the whole thing, and it would be very confusing.
Further, the version of v2 you model would impact your schemas dramatically, with later versions, more and more fields become repeating fields, and your join relationships would change.
I recommend that you put a stake in the sand and start with v2.4 which is pretty easy yet still more complicated than most interfaces actually in use. Focus on a few segments and a few fields. MSH and PID first.
Add an EAV table to capture what may come in that you don't yet have in your tables. You can then look at what comes into this table over time and use it to decide what to build next. Your EAV could look like this MSG_ID, SEGMENT, SET_ID, FIELD_NAME, FIELD VALUE. Just store the unparsed HL7 contents of the field value.

Related

Migrating RMS to RDB

We're approaching the migration of legacy OpenVMS RMS files into relational database (both MS SQL 2012 and Oracle 10g are available).
I wonder if there are:
Tools to retrieve schema of indexed files
Tools to parse indexed files
Tools to deal with custom RMS data formats (zoned decimals etc)
as a bundle/API/Library
Perhaps I should change the approach?
There are several tools available, notably through ODBC vendors (I work for one: Attunity).
1 >> Tools to retrieve schema of indexed files
Please clarify. Looking for just record/column layout and indexes within the files or also relationships between files.
1a) How are the files currently being used? Cobol, Basic, Fortran programs? Datatrieve?
They will be using some data definition method, so you want a tool which can exploit that.
Connx, and Attunity Connect can 'import' CDD definitions, BASIC - MAP files, Cobol Copybooks. Variants are typically covered as well. I have written many a (perl/awk) script to convert special definition to XML.
1b ) Analyze/RMS, or a program with calling RMS XAB's can get available index information. Atunity connect will know how to map those onto the fields from 1a)
1c ) There is no formal, stored, relationship between (indexed) files on OpenVMS. That's all in the program logic. However, some modestly smart Perl/Awk/DCL script can often generate a tablem of likely foreign/primary keys by looking at filed names and datatypes matches.
How many files / layouts / gigabytes are we talking about?
2 >> Tools to parse indexed files
Please clarify? Once the structure is known (question 1), the parsing is done by reading using that structure right? You never ever want to understand the indexed file internals. Just tell RMS to fetch records.
3 >> Tools to deal with custom RMS data formats (zoned decimals etc) as a bundle/API/Library
Again, please clarify. Once the structure is known just use the 'right' tool to read using that structure and surely it will honor the detailed data definitions.
(I know it is quite simple to write one yourself, just thought there would be something in the industry)
Famous last words... 'quite simple'. Entire companies have been build and thrive doing just that for general cases. I admit that for specific cases it can be relatively straightforward, but 'the devil is in the details'.
In the Attunity Connect case we have a UDT (User Defined data Type) to handle the 'odd' cases, often involving DATES. Dates in integers, in strings, as units since xxx are all available out of the box, but for example some have -1 meaning 'some high date' which needs some help to be stored in a DB.
All the databases have some bulk load tool (BCP, SQL$LOADER).
As long as you can deliver data conforming to what those expect (tabular, comma-seperated, quoted-or-not, escapes-or-not) you should be in good shape.
The EGH tool Vselect may be a handy, and high performance, way to bulk read indexed files, filter and format some and spit out sequential files for the DB loaders. It can read RMS indexed file faster than RMS can! (It has its own metadata language though!)
Attunity offers full access and replication services.
They include a CDC (change data capture) to not a only load the data, but to also keep it up to date in near-real-time. That's useful for 'evolution' versus 'revolution'.
Check out Attunity 'Replicate'. Once you have a data dictionary, just point to the tables desired (include, exlude filters), point to a target DB and click to replicate. Of course there are options for (global or per-table) transformations (like an AREA-CODE+EXHANGE+NUMBER to single phone number, or adding a modified date columns ).
Will this be a single big switch conversion, or is there desire to migrate the data and keep the old systems alive for days, months, years perhaps, all along keeping the data in close sync?
Hope this helps some,
Hein van den Heuvel.
OP: Perhaps I should change the approach? Probably.
You might consider finding data migration vendors, some which likely have off-the-shelf solutions, if not as a COTS tool, more likely packaged as a service (I don't think this is a big market).
What this won't help you with is what I think of as much bigger problem with the application code: who is going to change all the code that is making RMS calls, in the corresponding code that makes relational DB calls? How will the entity ("Joe Programmer", or some tool), know where the data migrated to, so that he can write the correct call? What are you doing to do about the fact that the data representation is like to change?
Ideally you'd like an automated migration tool, that will move the data itself (therefore knows that datalayouts and representation changes), and will make the code changes that correspond. You can look for these kind of vendors, too.

Normalisation and multi-valued fields

I'm having a problem with my students using multi-valued fields in access and getting confused about normalisation as a result.
Here is what I can make out. Given a 1-to-many relationship, e.g.
Articles Comments
-------- --------
artID{PK} commID{PK}
text text
artID{FK}
Access makes it possible to store this information into what appears to be one table, something like
Articles
--------
artID{PK}
text
comment
+ value
"value" referring to multiple comment values for the comment "column", which access actually stores as a separate table. The specifics of how the values are stored - table, its PK and FK - is completely hidden, but it is possible to query the multi-valued field, e.g. in the example above with the query
INSERT INTO article( [comment].Value )
VALUES ('thank you')
WHERE artID = 1;
But the query doesn't quite reveal the underlying structure of the hidden table implementing the multi-valued field.
Given this (disaster, in my view) - my problem is how to help newcomers to database design and normalisation understand what Access is offering them, why it may not be helpful, and that it is not a reason to ignore the basics of the relational model. More specifically:
Are there better ways, besides queries as above, to reveal the structure behind multi-valued fields?
Are there good examples of where the multi-valued field is not good enough, and shows the advantage of normalising explicitly?
Are there straightforward ways to obtain the multi-select visual output of Access multi-values, but based on separate, explicit tables?
Thanks!
I cannot give you advice in using this feature, because I never used it; however, I can give you reasons not to use it.
I want to have full control on what I'm doing. This is not the case for multi-valued fields, therefore I don't use them.
This feature is not expandable. What if you want to add a date field to your comments, for instance?
It is sometimes necessary to upsize an Access (backend) database to a "big" database (SQL Server, Oracle). These Databases don't offer such a feature. It is often the customer who decides which database has to be used. Recently I had to migrate an Access application (frontend) using an Oracle backend to a SQL-Server backend because my client decided to drop his Oracle server. Therefore it is a good idea to restrict yourself to use only common features.
For common tasks like editing lookup tables I created generic forms. My existing solutions will not work with multi-valued fields.
I have a (self-made) tool that synchronizes changes in the structure of the database on my developer’s site with the database on the client’s site. This tool cannot deal with multi-valued fields.
I have tools for the security management that can grant SELECT, INSERT, UPDATE and DELETE rights on tables or revoke them. Again, the management tool does not work with multi-valued fields.
Having a separate table for the comments allows you to quickly inspect all the comments (by opening the table). You cannot do this with multi-valued fields.
You will not see the 1 to n relation between the articles and the comments in a database diagram.
With a separate table you can choose whether you want to cascade deletes to the details table or not. If you don't, you will not be able to delete an article as long as there are comments attached to it. This can be desirable, if you want to protect the comments from being deleted inadvertently.
It is important to realize the difference between physical and logical relationships. Today the whole internet and web services (SOAP) quite much realizes on a data format that is multi-value in nature.
When you represent multi-value data with a relational database (such as Access), then behind the scenes you are using a traditional (and legitimate) relation. I cannot stress that as such, then the use of multi-value columns in Access is in fact a LEGITIMATE relational model.
The fact that table is not exposed does not negate this issue. In fact, if you represent an invoice (master record, and repeating details) as a XML data cube, then we see two things:
1) you can build and represent that invoice with a relational database like Access
2) such a relational data model that is normalized can ALSO be represented as a SINGLE xml string.
3) deleting the XML record (or string) means that cascade delete of the child rows (invoice details) MUST occur.
So while it is true that Multi-Value fields been added to Access to deal with SharePoint, it is MOST important to realize that such data can be mapped to a relational database (if you could not do this, then Access could not consume that XML data using relational database tables as ACCESS CURRENTLY DOES RIGHT NOW).
And with the web such as XML, and SharePoint then the need to consume and manage and utilize such data is not only widespread, but is in fact a basic staple of the internet.
As more and more data becomes of a complex nature, we find the requirement for multi-value data exploding in use. Anyone who used that so called "fad" the internet is thus relying and using data that is in fact VERY OFTEN XML and is multi-value (complex) in nature.
As long as the logical (not physical) relational data model is kept, then use of multi-value columns to represent such data is possible and this is exactly what Access is doing (it is mapping the relational data model to a complex model). Note that the complex (xml) data model does NOT necessary have to be relational in nature. However, if you ARE going to map such data to Access then the complex multi-value model MUST CONFORM TO A RELATIONAL data model.
This is EXACTLY what is occurring in Access.
The fact that such a correct and legitimate math relational model is not exposed is of little issue here. Are we to suggest that because Excel does not expose the binary codes used then users will never learn about computers? Or perhaps we all must program in assembler so we all correctly learn how computers works.
At the end of the day, who cares and why does this matter? The fact that people drive automatic cars today does not toss out the concept that they are using different gears to operate that car. The idea that we shut down all of society because someone is going to drive an automatic car or in this case use complex data would be galactic stupid on our part.
So keep in mind that extensions to SQL do exist in Access to query the multi-value data, but as well pointed out here those underlying tables are not exposed. However, as noted, exposing such tables would STILL REQUIRE one to not change or mess with cascade delete since that feature is required TO MAINTAIN A INTERSECTION OF FEATURES and a CORRECT MATH relational model between the complex data model (xml) and that of using two related tables to represent such data.
In other words, you can use related tables to represent the complex data model IF YOU REMOVE the ability of users to play with the referential integrity options. The RI options MUST remain as set in those hidden tables else such data will not be able to make the trip BACK to the XML or complex data model of which it was consumed from.
As noted, in regards to users being taught how gasoline reacts with oxygen for that of learning to drive a car, or using a word processor and being forced to learn a relational model and expose the underlying tables makes little sense here.
However, the points made here in regards to such tables being exposed are legitimate concerns.
The REAL problem is SQL server and Oracle etc. cannot consume or represent that complex data WHILE ACCESS CAN CONSUME such data.
As noted, the complex data ship has LONG ago sailed! XML, soap, and the basic technologies of the internet are based on this complex data model.
In effect, SQL server, Oracle and most databases cannot that consume this multi-value data represent it without users having to create and model such data in a relational fashion is a BIG shortcoming of SQL server etc.
Access stands alone in this ability to consume this data.
So, for anyone who used a smartphone, iPad or the web, you are using basic technologies that are built around using complex data, something that Access now allows.
It is likely that the rest of the industry will have to follow suit given that more and more data is complex in nature. If the database industry does not change, then the mainstream traditional relational database system will NOT be the resting place of such data.
A trend away from storing data in related tables is occurring at a rapid pace right now and products like SharePoint, or even Google docs is proof of this concept. So Access is only reacting to market pressures and it is likely that other database vendors will have to follow suit or simply give up on being part of the "fad" called the internet.
XML and complex data structures are STAPLE and fact of our industry right now – this is not an issue we all should run away from, but in fact embrace.
Albert D. Kallal (Access MVP)
Edmonton, Alberta Canada
kallal#msn.com
The technical discussion is interesting. I think the real problem lies in student understanding. Because it is available in Access students will use it, and initially it will probably provide a simple solution to some design problems. The negatives will occur later when they try and use the data. Maybe a simple example demonstrating the problems would persuade some students to avoid using multi-valued fields ? Maybe an example of storing the data in another, more usable format would help ?
Good luck !
Peter Bullard
MS Access does a great job of simplifying database management and abstracting out a lot of complexity. This however makes the learning of dbms concepts a bit difficult. Have you tried using other 'standard' dbms tools like MySQL (or even sqlite). From a learning perspective they may be better.
I know this post is old. But, it's not quite the same as every other post I've seen on this topic. This one has someone making a good case for using Multi Valued Fields...
As someone who is trying who is still trying very hard to get their head around Access, I find the discussion for and against using the Multi Valued Fields incredibly frustrating.
I'm trying to sort through it all, but if everyone is so against them, what is an alternative method? It seems that in every search result I find everyone is either telling you how to use Multi Valued Fields and Controls or telling you how horrible and what a mistake they are. Many people refer to an alternative to them, but nobody says "Here's an example". I'm here to learn about these things. And while I know that this is a simpler concept for a lot of people in these forums, I could really use some examples to take a look at.
I'm at a point where I have to decide which way to go. It would be wonderful to compare examples of using Multi Valued Fields and alternatives and using a control to select multiple values.
Or am I wrong and the functionality of a combobox where you can select multiple items is only available through Access?
I want to address the last of your questions first. There is a way of providing a visual presentation of a parent child relationship. It's called subforms. If you get help about subforms in Access, it will explain the concept.
I have used subforms in a project where I wanted to display the transaction header in a form and the transaction details in a subform. There is nothing to hinder this construct even when the data is stored in two normalized tables.
Of course, this affects the screen, not the database. That's the whole point. Normalization is relevant to storage and retrieval, not to other uses of data.

WCF Service to store data dynamically from XML to Database

I'm trying to build a WCF service that gets info from another service via XML. The XML usually has 4 elements ranging from int, string to DateTime. I want to build the service dynamically so that when it gets an XML, it stores it in the Database. I don't want to hardcode the types and element names in the code. If there is a change, I want it to dynamically add it to the database What is the best way of doing this? Is this a good practice? Or should i stick with just using Entity Framework and having a set model for the database and hardcode the element names and types?
Thanks
:)
As usual, "it depends...".
If the nature of the data you're receiving is that it has no fixed schema, or a schema that changes frequently as a part of its business logic/domain logic, it's a good idea to design a solution to manage that. Storing data whose schema you don't know at design time is a long-running debate on StackOverflow - look for "property bag", "entity attribute value" or EAV to see various questions and answers.
The thing you give up with this approach is - at the web service layer - the ability to check that the data you're receiving meets the agreed interface contract; this in turn helps to avoid all kinds of exciting bugs - what should your system do if it receives invalid XML? Without an agreed schema/dtd, you have to build all kinds of other checks into your code.
At the database level, you usually end up giving up aspects of the relational model (and therefore the power of SQL) to store your data without "traditional" row-and-column relationships. That often makes queries harder, and may sacrifice standards compliance (e.g. by using vendor specific extensions).
If the data changes because of technical reasons - i.e. it's not the nature of the data to change, it's just the technology chain that makes you worry about this - I'd suggest you instead build in the concept of versioning, and have "strongly typed" services/data with different versions. Whilst this seems more work, the benefits of relying on schema validation, the relational database model, and simplicity usually make it a good trade-off...
Put the xml into the database as xml. Many modern databases support XML directly. If not, just store it as text.

I need advise choosing a NoSQL database for a project with a lot of minute related information

I am currently working on a private project that is going to use Google's GTFS spec to get information about 100s of Public Transit agencies, their routers, stations, times, and other related information. I will be getting my information from here and the google code wiki page with similar info. There is a lot of data and its partitioned into multiple CSV formatted text files. These can be huge, some ranging in 80-100mb of data.
With the data I have, I want to translate it all into a nice solid database that I can build layers on top of to use for my project. I will be using GPS positioning to pinpoint a location and all surrounding stations/stops.
My goal is to access all the information for all these stops and stations with as few calls as possible, while keeping datasets small for queried results.
I am currently leaning towards MongoDB and CouchDB for their GeoSpatial support that can really optimize getting small datasets. But I also need to be sure to link all the stops on a route because I will be propagating information along a transit route for that line. In this case I have found that I can benefit from a Graph DB like Neo4j and OrientDB, but from what I know, neither has GeoSpatial support nor am I 100% sure that a Graph DB would be what I need.
The perfect solution might not exist, but I come here asking for help on finding the best possible for my situation. I know I will possible have to work around limitations of whatever I choose, but I want to at least have done my research and know that its the best I can get at the moment.
I have also been suggested to splinter the data into multiple DBs, but that could get very messy because all the information is very tightly interconnected through IDs.
Any help would be appreciated.
Obviously a graph database fits 100% your problem. My advice here is to go for some geo spatial module over neo4j or orientdb, althought you have some others free and open source implementation.
I think the best one right now, with all the geo spatial thing implemented is neo4j-spatial package. But as far as I know, you can also reproduce most of the geo spatial thing on your own if necessary.
BTW talking about splitting, if the amount of data/queries will be high, I strongly recommend you to share the load and think the model in this terms. Sure you can do something.
I've used Mongo's GeoSpatial features and can offer some guidance if you need help with a C# or javascript implementation - I would recommend it to start because it's super easy to use. I'm learning all about Neo4j right now and I am working on a hybrid approach that takes advantage of both Mongo and Neo4j. You might want to cross reference the documents in Mongo to the nodes in Neo4j using the Mongo object id.
For my hybrid implementation, I'm storing profiles and any other large static data in Mongo. In Neo4j, I'm storing relationships like friend and friend-of-friend. If I wanted to analyze movies two friends are most likely to want to watch together (or really any other relationship I hadn't thought of initially), by keeping that object id reference I can simply add some code instructing each node go out and grab a list of movies from the related profile.
Added 2011-02-12:
Just wanted to follow up on this "hybrid" idea as I created prototypes for and implemented a few more solutions recently where I ended up using more than one database. Martin Fowler refers to this as "Polyglot Persistence."
I'm finding that I am often using a combination of a relational database, document database and a graph database (in my case this is generally SQL Server, MongoDB and Neo4j). Since the question is related to data modeling as much as it is to geospatial, I thought I would touch on that here:
I've used Neo4j for site organization (similar to the idea of hypermedia in the REST model), modeling social data and building recommendations (often based on social data). As a result, I will generally model this part of the application before I begin programming.
I often end up using MongoDB for prototyping the rest of the application because it provides such a simple persistence mechanism. I like to start developing an application with the user interface, so this ends up working well.
When I start moving entities from Mongo to SQL Server, the context is usually important - for instance, if I have an application that allows users to build daily reports based on periodically collected data, it may make sense to run a procedure that builds those reports each night and stores daily report objects in Mongo that may be combined into larger aggregate reports as needed (obviously this doesn't consider a few special cases, but that is not relevant to the point)...on the other hand, if users need to pull on-demand reports limited to very specific time periods, it may make sense to keep everything in SQL server and build those reports as needed.
That said, and this deserves more intense thought, here are some considerations that may be helpful:
I generally try to store entities in a relational database if I find that pulling an entity from the database [in other words(in the context of a relational database) - querying data from the database that provides the data required to generate an entity or list of entities that fulfills the requested parameters] does not require significant processing (multiple joins, for instance)
Do you require ACID compliance(aside:if you have a graph problem, you can leverage Neo4j for this)? There are document databases with ACID compliance, but there's a reason Mongo is not: What does MongoDB not being ACID compliant really mean?
One use of Mongo I saw in the wild that I thought was worthy of mention - Hadoop was being used to compute massive hash tables that were then stored in Mongo. I believe a similar approach is used by TripAdvisor for user based customization in terms of targeting offers, advertising, etc..
NoSQL only exists because MySQL users assume that all databases have their performance problems when their database grows large and/or becomes complex.
I suggest that you use PostGIS. You can use the same database for the rest of your data needs as well.
http://postgis.refractions.net/

MySQL design question - which is better, long tables or multiple databases?

So I have an interesting problem that's been the fruit of lots of good discussion in my group at work.
We have some scientific software producing SQLlite files, and this software is basically a black box. We don't control its table designs, formats, etc. It's entirely conceivable that this black box's output could change, and our design needs to be able to handle that.
The SQLlite files are entire databases which our user would like to query across. There are two ways (we see) of implementing this, one, to create a single database and a backend in Python that appends tables from each database to the master database, and two, querying across separate databases' tables and unifying the results in Python.
Both methods run into trouble when the black box produces alters its table structures, say for example renaming a column, splitting up a table, etc. We have to take this into account, and we've discussed translation tables that translate queries of columns from one table format to another.
We're interested in ease of implementation, how well the design handles a change in database/table layout, and speed. Also, a last dimension is how well it would work with existing Python web frameworks (Django doesn't support cross-database queries, and neither does SQLAlchemy, so we know we are in for a lot of programming.)
If you find yourself querying across databases, you should look into consolidating. Cross-database queries are evil.
If your queries are essentially relegated to individual databases, then you may want to stick with multiple databases, as clearly their separation is necessary.
You cannot accommodate arbitrary changes in a database's schema without categorizing and anticipating that change in some way. In the very best case with nontrivial changes, you can sometimes simply ignore new data or tables, in the worst case, your interpretation of the data will entirely break down.
I've encountered similar issues where users need data pivoted out of a normalized schema. The schema does NOT change. However, their required output format requires a fixed number of hierarchical levels. Thus, although the database design accommodates all the changes they want to make, their chosen view of that data cannot be maintained in the face of their changes. Thus it is impossible to maintain the output schema in the face of data change (not even schema change). This is not to say that it's not a valid output or input schema, but that there are limits beyond which their chosen schema cannot be used. At this point, they have to revise the output contract, the pivoting program (which CAN anticipate this and generate new columns) can then have a place to put the data in the output schema.
My point being: the semantics and interpretation of new columns and new tables (or removal of columns and tables which existing logic may depend on) is nontrivial unless new columns or tables can be anticipated in some way. However, in these cases, there are usually good database designs which eliminate those strategies in the first place:
For instance, a particular database schema can contain any number of tables, all with the same structure (although there is no theoretical reason they could not be consolidated into a single table). A particular kind of table could have a set of columns all similarly named (although this "array" violates normalization principles and could be normalized into a commonkey/code/value schema).
Even in a data warehouse ETL situation, a new column is going to have to be determined whether it is a fact or a dimensional attribute, and then if it is a dimensional attribute, which dimension table it is best assigned to. This could somewhat be automated for facts (obvious candidates would be scalars like decimal/numeric) by inspecting the metadata for unmapped columns, altering the DW table (yikes) and then loading appropriately. But for dimensions, I would be very leery of automating somethings like this.
So, in summary, I would say that schema changes in a good normalized database design are the least likely to be able to be accommodated because: 1) the database design already anticipates and accommodates a good deal of change and flexibility and 2) schema changes to such a database design are unlikely to be able to be anticipated very easily. Conversely, schema changes in a poorly normalized database design are actually more easy to anticipate as shortcomings in the database design are more visible.
So, my question to you is: How well-designed is the database you are working from?
You say that you know that you are in for a lot of programming...
I'm not sure about that. I would go for a quick and dirty solution not a 'generic' solution because generic solutions like the entity attribute value model often have a bad performance. Don't do client side joining (unifying the results) inside your Python code because that is very slow. Use SQL for joining, it is designed for that purpose. Users can also make their own reports with all kind of reporting tools that generate sql statements. You don't have to do everything in your app, just start with solving 80% of the problems, not 100%.
If something breaks because something inside the black box changes you can define views for backward compatibility that keeps your app functioning.
Maybe the scientific software will add a lot of new features and maybe it will change its datamodel because of those new features..? That is possible but then you will have to change your application anyways to take profit from those new features.
It sounds to me as if your problem isn't really about MySQL or SQLlite. It's about the sharing of data, and the contract that needs to exist between the supplier of data and the user of the same data.
To the extent that databases exist so that data can be shared, that contract is fundamental to everything about databases. When databases were first being built, and database theory was first being solidified, in the 1960s and 1970s, the sharing of data was the central purpose in building databases. Today, databases are frequently used where files would have served equally well. Your situation may be a case in point.
In your situation, you have a beggar's contract with your data suppliers. They can change the format of the data, and maybe even the semantics, and all you can do is suck it up and deal wth it. This situation is by no means uncommon.
I don't know the specifics of your situation, so what follows could be way off target.
If it was up to me, I would want to build a database that was as generic, as flexible, and as stable as possible, without losing the essential features of structured and managed data. Maybe, some design like star schema would make sense, but I might adopt a very different design if I were actually in your shoes.
This leaves the problem of extracting the data from the databases you are given, transforming the data into the stable format the central database supports, and loading it into the central database. You are right in guessing that this involves a lot of programming. This process, known as "ETL" in data warehousing texts, is not the simplest of programming challenges.
At least ETL collects all the hard problems in one place. Once you have the data loaded into a database that's built for your needs, and not for the needs of your suppliers, turning the data into valuable information should be relatively easy, at least at the programming or SQL level. There are even OLAP tools that make using the data as simple as a video game. There are challenges at that level, but they aren't the same kind of challenges I'm talking about here.
Read up on data warehousing, and especially data marts. The description may seem daunting to you at first, but it can be scaled down to meet your needs.