EAV vs Serialized Object vs SQL with Xpath? - sql

I'm trying to implement a badge system, the badge are based on user's metadata which are subject to change.
Those metadata are variable, and are set on the fly.
Example of metadata :
commentCount
hasCompletedProfile
isActiveMember
etc. Later, I would want to add hasGravatar metadata, for this reason, I can't easily design and normalize a table.
Those data, while they are an important part of the application are not 'sensible', almost all those metadata could be re-computed that means that the integrity of the data is not part of constraints.
Currently, I know three options, even if I didn't know any of them.
EAV
Serialized Objects
XML Field (I read somewhere that it is possible to store XML in a column, and use XPATH or something to query data)
All of these options look to have pro & cons but since I've never experimented with them, I don't really know which.
Do you have any feedbacks or advices?
I'm currently working with Zend Framework & Doctrine 2 with a MySQL server

XML and Serialized Objects are both very similar as you would likely be using 1 column to store this arbitrary data. This quickly becomes very messy and difficult to easily distinguish in SQL WHERE clauses (though some DBMS have XPath support)
EAV on the otherhand will provide a separate row for every Key => Value pair you have, which you can easily extract out with a JOIN or subquery. The major downfall is that it can be a performance hit if you have a lot of data in here. Another drawback is that to keep things simple you would store all keys/values as text in the db. You could create an EAV table for every type, but it's not practically needed in most languages as what you fetch comes out as a string or can be converted there anyway. Simply storing user configuration/properties should be perfectly fine for EAV.
So you might have a table user_metadata with 3 fields:
metadata_id INTEGER
user_id INTEGER
key CHAR
value CHAR
You could then fetch this data all at once for a user:
SELECT * FROM user_metadata WHERE metadata_user_id = $user_id
Or you could fetch individual metadata along with your user data
SELECT user.*, meta_gravatar.value AS hasGravatar
FROM user
LEFT JOIN user_metadata AS meta_gravatar
ON meta_gravatar.user_id = user.user_id AND meta_gravatar.key = 'hasGravatar'
WHERE user.user_id = $user_id

EAV: It is complicated and slow. It is an example how not to use an SQL database. You cannot have an index on properties in EAV and you need some nontrivial logic to get the data from database into business logic objects. Also your SQL queries become difficult to optimize.
Serialized objects: Serialization often depends on language or platform. There is no way of having an index on some property or search anything, but it is a simple way to store data of undefined structure.
XML field: Use of a standardized representation is better than serialization. Also, there may be support for such data structures in you SQL server.
JSON field: The same as XML field, however, JSON supports primitive data types (int, bool, null) and it is faster and easier to parse and serialize than XML. Some SQL servers provide some support for it as well.
All the three ways of serialization share the same disadvantage: No indices on the properties. In most applications, it is acceptable because the data are not processed by the database anyway, so they are simply a blob for the application. The good thing is, that this blob does not complicate the database schema and operations.
There is one more way to implement such an EAV alternative: plain old SQL table. If a new property requires some change in the application code, then you can add the SQL column as well. If you have user interface and application logic to define properties at run-time, you can teach your application to use ALTER TABLE queries. Then you simply add or remove columns as you need. In the end, it will be much easier and more effective than implementing EAV, as long as you have a good query builder.

Related

Dynamic Database/Key - Value/Entity - Key Value Dillemma

I have been programming relational database for many years, but now have come across an unusual and tricky problem:
I am building an application that needs to have very quick and easily defined entities (by the user). Instances of these entities could then be created, updated, deleted etc.
There are two options I can think of.
Option 1 - Dynamically created tables
The first option is to write an engine to dynamically generate the tables, and insert the data into these. However, this would become very tricky, as every query would also need to be dynamic, or at least dynamically created stored procedures etc.
Option 2 - Entity - Key - Value Pattern
This is the only realistic option I can think of, where I have 5 table structure:
EntityTypes
EntityTypeID int
EntityTypeName nvarchar(50)
Entities
EntityID int
EntityTypeID int
FieldTypes
FieldTypeID int
FieldTypeName nvarchar(50)
SQLtype int
FieldValues
EntityID int
FIeldID int
Value nvarchar(MAX)
Fields
FieldID int
FieldName nvarchar(50)
FieldTypeID int
The "FieldValues" table would work a little like a datawarehouse fact table, and all my inserts/updates would work by filling a "Key/Value" table valued parameter and passing this to a SPROC (to avoid multiple inserts/updates).
All the tables would be heavily indexed, and I would end up doing many self joins to obtain the data.
I have read a lot about how bad Key/Value databases are, but for this problem it still seems to be the best.
Now my questions!
Can anyone suggest another approach or pattern other than these two options?
Would option two be feasible for medium sized datasets (1 million rows max)?
Are there further optimizations for option 2 I could use?
Any direction and advice much appreciated!
Personally I would just use a "noSQL" (key/value) database like MongoDB.
But if you need to use a relational database option 2 is the way to go. A good example of that kind of model is the Alfresco Data Dictionary (Alfresco is an enterprise content management system). It's design is similar to what you describe, although they have multiple columns for field values (for every simple type available in the database). If you add a good cache system to that (for example Ehcache) it should work fine.
As others have suggested NoSQL, I'm going to say that, in my opinion, schemaless databases really is best suited for use-cases with no schema.
From the description, and the schema you came up with, it looks like your case is not in fact "no schema", but rather it seems to be "user-defined schema".
In fact, the schema you came up with looks very similar to the internal meta-schema of a relational database. (You're sort of building a relational database on top of a relational database, which in my experience is not a good idea, as this "meta-database" will have at least twice the overhead and complexity for any basic operation - tables will get very large, which doesn't scale well, and the data will be difficult to query and update, problems will be difficult to debug, and so on.)
For use-cases like that, you probably want DDL: Data Definition Language.
You didn't say which SQL database you're using, but most SQL databases (such as MySQL, PostgreSQL and MS-SQL) support some dialect of DDL extensions to SQL syntax, which let you manipulate the actual schema.
I've done this successfully for use-cases like yours in the past. It works well for cases where the schema rarely changes, and the data volumes are relatively low for each user. (For high volumes or frequent schema updates, you might want schemaless or some other type of NoSQL database.)
You might need some tables on the side for additional field information that doesn't fit in SQL schema - you may want to duplicate some schema information there as well, as this can be difficult or inefficient to read back from actual schema.
Ensuring atomic updates to your field information tables and the schema probably requires transactions, which may not be supported by your database engine - PostgreSQL at least does support transactional schema updates.
You have to be vigilant when it comes to security - you don't want to open yourself up to users creating, storing or deleting things they're not supposed to.
If it suits your use-case, consider using not only separate tables, but separate databases, which can also by created and destroyed on demand using DDL. This could be applicable if each customer has ownership of data collections that can't, shouldn't, or don't need to be queried across customers. (Arguably, these are rare - typically, you want at least analytics or something across customers, but there are cases where each customer "owns" an isolated, hosted wiki, shop or CMS/DMS of some sort.)
(I saw in your comment that you already decided on NoSQL, so just posting this option here for completeness.)
It sounds like this might be a solution in search of a problem. Is there any chance your domain can be refactored? If not - theres still hope.
Your scalability for option 2 will depend a lot on the width of the custom objects. How many fields can be created dynamically? 1 million entities when each entity has 100 fields could be a drag... Efficient indexing could make performance bearable.
For another option - you could have one data table that has a few string fields, a few double fields, and a few integer fields. For example, a table with String1, String2, String3, Int1, Int2, Int3. A second table with have rows that define a user object and map your "CustomObjectName" => String1, and such. A stored procedure reading INFORMATION_SCHEMA and some dynamic sql would be able to read the schema table and return a strongly typed recordset...
Yet another option (for recent versions of SQL Server) would be to store a row with an id, a type name, and an XML field that contains a XML document that contains the object data. In MS Sql Server this can be queried against directly, and maybe even validated against a schema.
PErsonally I would take the time to define as many attritbutes as you can ratheer than use EAV for everything. Surely you know some of the attributes. Then you only need EAv for the things that are truly client specific.
But if all must be EAV, then a nosql databse is the way to go. Or you can use a relationsla datbase for some stuff and a nosql database for the rest.

Most efficient method for persisting complex types with variable schemas in SQL

What I'm doing
I am creating an SQL table that will provide the back-end storage mechanism for complex-typed objects. I am trying to determine how to accomplish this with the best performance. I need to be able to query on each individual simple type value of the complex type (e.g. the String value of a City in an Address complex type).
I was originally thinking that I could store the complex type values in one record as an XML, but now I am concerned about the search performance of this design. I need to be able to create variable schemas on the fly without changing anything about the database access layer.
Where I'm at now
Right now I am thinking to create the following tables.
TABLE: Schemas
COLUMN NAME DATA TYPE
SchemaId uniqueidentifier
Xsd xml //contains the schema for the document of the given complex type
DeserializeType varchar(200) //The Full Type name of the C# class to which the document deserializes.
TABLE: Documents
COLUMN NAME DATA TYPE
DocumentId uniqueidentifier
SchemaId uniqueidentifier
TABLE: Values //The DocumentId+ValueXPath function as a PK
COLUMN NAME DATA TYPE
DocumentId uniqueidentifier
ValueXPath varchar(250)
Value text
from these tables, when performing queries I would do a series of self-joins on the value table. When I want to get the entire object by the DocumentId, I would have a generic script for creating a view mimics a denormalized datatable of the complex-type.
What I want to know
I believe there are better ways to accomplish what I am trying to, but I am a little too ignorant about the relative performance benefits of different SQL techniques. Specifically I don't know the performance cost of:
1 - comparing the value of a text field versus of a varchar field.
2 - different kind of joins versus nested queries
3 - getting a view versus an xml document from the sql db
4 - doing some other things that I don't even know I don't know would be affecting my query but, I am experienced enough to know exist
I would appreciate any information or resources about these performance issues in sql as well as a recommendation for how to approach this general issue in a more efficient way.
For Example,
Here's an example of what I am currently planning on doing.
I have a C# class Address which looks like
public class Address{
string Line1 {get;set;}
string Line2 {get;set;}
string City {get;set;}
string State {get;set;}
string Zip {get;set;
}
An instance is constructed from new Address{Line1="17 Mulberry Street", Line2="Apt C", City="New York", State="NY", Zip="10001"}
its XML value would be look like.
<Address>
<Line1>17 Mulberry Street</Line1>
<Line2>Apt C</Line2>
<City>New York</City>
<State>NY</State>
<Zip>10001</Zip>
</Address>
Using the db-schema from above I would have a single record in the Schemas table with an XSD definition of the address xml schema. This instance would have a uniqueidentifier (PK of the Documents table) which is assigned to the SchemaId of the Address record in the Schemas table. There would then be five records in the Values table to represent this Address.
They would look like:
DocumentId ValueXPath Value
82415E8A-8D95-4bb3-9E5C-AA4365850C70 /Address/Line1 17 Mulberry Street
82415E8A-8D95-4bb3-9E5C-AA4365850C70 /Address/Line2 Apt C
82415E8A-8D95-4bb3-9E5C-AA4365850C70 /Address/City New York
82415E8A-8D95-4bb3-9E5C-AA4365850C70 /Address/State NY
82415E8A-8D95-4bb3-9E5C-AA4365850C70 /Address/Zip 10001
Just Added a Bounty...
My objective is to obtain the resources I need in order to give my application a data access layer that is fully searchable and has a data-schema generated from the application layer that does not require direct database configuration (i.e. creating a new SQL table) in order to add a new aggregate root to the domain model.
I am open to the possibility of using .NET compatible technologies other than SQL, but I will require that any such suggestions be adequately substantiated in order to be considered.
How about looking for a solution at the architectural level? I was also breaking my head on complex graphs and performance until I discovered CQRS.
[start evangelist mode]
You can go document-based or relational as storage. Even both! (Event Sourcing)
Nice separation of concerns: Read Model vs Write Model
Have your cake and eat it too!
Ok, there is an initial learning / technical curve to get over ;)
[end evangelist mode]
As you stated: "I need to be able to create variable schemas on the fly without changing anything about the database access layer." The key benefit is that your read model can be very fast since it's made for reading. If you add Event Sourcing to the mix, you can drop and rebuild your Read Model to whatever schema you want... even "online".
There are some nice opensource frameworks out there like nServiceBus which saves lots of time and technical challenges. All depends on how far you want to take these concepts what you're willing/can spend time on. You can even start with just basics if you follow Greg Young's approach. See the info in the links below.
See
CQRS Examples and Screencasts
CQRS Questions
Intro (Also see the video)
Somehow what you want sounds like a painful thing to do in SQL. Basically, you should treat the inside of a text field as opaque as when querying an SQL database. Text fields were not made for efficient queries.
If you just want to store serialized objects in a text field, that is fine. But do not try to build queries that look inside the text field to find objects.
Your idea sounds like you want to perform some joins, XML parsing, and XPath application to get to a value. This doesn't strike me as the most efficient thing to do.
So, my advise:
Either just store serialized objects in the db, and do nothing more than load them and perform all other operations in memory
Or, if you need to query complex data structures, you may really want to look into document stores/databases like CouchDB or MongoDB; you can also check Wikipedia on the subject. There are even databases specifically designed for storing XML, even though I personally don't like them very much.
Addendum, per your explanations above
Simply put, don't go over the top with this thing:
If you just want to persist C#/.NET objects, just use the XML Serialization already built into the framework, a single table and be done with it.
If you, for some reason, need to store complex XML, use a dedicated XML store
If you have a fixed database schema, but it is too complex for efficient queries, use a Document Store in memory where you keep a denormalized version of your data for faster queries (or just simplify your database schema)
If you don't really need a fixed schema, use just a Document Store, and forget about having any "schema definition" at all
As for your solution, yes, it could work somehow. As could a plain SQL schema if you set it up right. But for applying an XPath, you'll probably parse the whole XML document each time you access a record, which wouldn't be very efficient to begin with.
If you want to check out Document databases, there are .NET drivers for CouchDB and MongoDB. The eXist XML database offers a number of Web protocols, and you can probably create a client class easily with VisualStudio's point-and-shoot interface. Or just google for someone who already did.
I need to be able to create variable
schemas on the fly without changing
anything about the database access
layer.
You are re-implementing the RDBMS within an RDBMS. The DB can do this already - that is what the DDL statements like create table and create schema are for....
I suggest you look into "schemas" and SQL security. There is no reason with the correct security setup you cannot allow your users to create their own tables to store document attributes in, or even generate them automatically.
Edit:
Slightly longer answer, if you don't have full requirements immediately, I would store the data as XML data type, and query them using XPath queries. This will be OK for occasional queries over smallish numbers of rows (fewer than a few thousand, certainly).
Also, your RDBMS may support indexes over XML, which may be another way of solving your problem. CREATE XML INDEX in SqlServer 2008 for example.
However for frequent queries, you can use triggers or materialized views to create copies of relevant data in table format, so more intensive reports can be speeded up by querying the breakout tables.
I don't know your requirements, but if you are responsible for creating the reports/queries yourself, this may be an approach to use. If you need to enable users to create their own reports that's a bigger mountain to climb.
I guess what i am saying is "are you sure you need to do this and XML can't just do the job".
In part, it will depend of your DB Engine. You're using SQL Server, don't you?
Answering your topics:
1 - Comparing the value of a text field versus of a varchar field: if you're comparing two db fields, varchar fields are smarter. Nvarchar(max) stores data in unicode with 2*l+2 bytes, where "l" is the lengh. For performance issues, you will need consider how much larger tables will be, for selecting the best way to index (or not) your table fields. See the topic.
2 - Sometimes nested queries are easily created and executed, also serving as a way to reduce query time. But, depending of the complexity, would be better to use different kind of joins. The best way is try to do in both ways. Execute two or more times each query, for the DB engine "compiles" a query on first executing, then the subsequent are quite faster. Measure the times for different parameters and choose the best option.
"Sometimes you can rewrite a subquery to use JOIN and achieve better performance. The advantage of creating a JOIN is that you can evaluate tables in a different order from that defined by the query. The advantage of using a subquery is that it is frequently not necessary to scan all rows from the subquery to evaluate the subquery expression. For example, an EXISTS subquery can return TRUE upon seeing the first qualifying row." - link
3- There's no much information in this question, but if you will get the xml document directly from the table, would be a good idea insted a view. Again, it will depends of the view and the document.
4- Other issues is about the total records expected for your table; the indexing of the columns, in wich you need to consider sorting, joining, filtering, PK's and FK's. Each situation could demmand different aproaches. My sugestion is to invest some time reading about your database engine and queries functioning and relating to your system.
I hope I've helped.
Interesting question.
I think you may be asking the wrong question here. Broadly speaking, as long as you have a FULLTEXT index on your text field, queries will be fast. Much faster than varchar if you have to use wild cards, for instance.
However, if I were you, I'd concentrate on the actual queries you're going to be running. Do you need boolean operators? Wildcards? Numerical comparisons? That's where I think you will encounter the real performance worries.
I would imagine you would need queries like:
"find all addresses in the states of New York, New Jersey and Pennsylvania"
"find all addresses between house numbers 1 and 100 on Mulberry Street"
"find all addresses where the zipcode is missing, and the city is New York"
At a high level, the solution you propose is to store your XML somewhere, and then de-normalize that XML into name/value pairs for querying.
Name/value pairs have a long and proud history, but become unwieldy in complex query situations, because you're not using the built-in optimizations and concepts of the relational database model.
Some refinements I'd recommend is to look at the domain model, and at least see if you can factor out separate data types into the "value" column; you might end up with "textValue", "moneyValue", "integerValue" and "dateValue". In the example you give, you might factor "address 1" into "housenumber" (as an integer) and "streetname".
Having said all this - I don't think there's a better solution other than completely changing tack to a document-focused database.

Schema less SQL database table - practical compromise

This question is an attempt to find a practical solution for this question.
I need a semi-schema less design for my SQL database. However, I can limit the flexibility to shoehorn it into the entire SQL paradigm. Moving to schema less databases might be an option in the future but right now, I' stuck with SQL.
I have a table in a SQL database (let's call it Foo). When an row is added to this, it needs to be able to store an arbitrary number of "meta" fields with this. An example would be the ability to attach arbitrary metadata like tags, collaborators etc. All the fields are optional but the problem is that they're of different types. Some might be numeric, some might be textual etc.
A simple design linking Foo to a table of OptionalValues with fields like name, value_type, value_string, value_int, value_date etc. seems direct although it descends into the whole EAV model which Alex mentions on that last answer and it looks quite wasteful. Also, I imagine queries out of this when it grows will be quite slow. I don't expect to search or sort by anything in this table though. All I need is that when I get a row out of Foo, these extra attributes should be obtainable as well.
Are there any best practices for implementing this kind of a setup in a SQL database or am I simply looking at the whole thing wrongly?
Add a string column "Metafields" to your table "Foo" and store your metadata there as an XML or JSON string.

Would keeping an XML data inside sql table be an architectural misconception?

I've got an SQL table that I use to keep product data. Some products have other attached data to them (be it: books have number of pages, cover type; movies have their time-length; etc).
I could use a separate table in SQL to keep those, keeping (name, value) pairs.
I can also just keep an XML-packed data in a single field in a table. It's not a normalized approach, but seems more-natural for me.
I did a similar thing in a shopping basket application. We needed to attach meta data to the products without creating too much of a schema, which would have restricted the format of the meta-data in the future. We kept the meta-data as XML.
The only reason I would not do it is if you're going to end up performing queries on the data. Just make sure you won't have some daft person wanting reports by Publisher meta-data or something (which has happened to me) and you should be fine.
If you were intending to use XML as a way of not properly defining database tables that would indeed an architectural cop-out. I'm not sure about your scenario, it seems dangerously close to that. But key-value pairs are probably worse.
The best thing is to use a specialist XML datatype, if your database has one. In addition to RageZ's list, Oracle as had an XMLType for ten years now (since 9i). The advantage of using XMLType is two-fold. It announces to the casual observer that the documents in this column are XML. It also gives you access to built-in functionality, such as validation with XML Schemas, should you want it. Other features could prove handy if you subsequently have to start referring to the contents of the XML. For instance, Oracle's XDB supports an XML index type which can dramatically improve the performance of XPath queries.
It depends!
If you expect the 'shape' of your products to vary greatly then XML is a good way to go. [If you are using SQL Server you can index an XML field.]
I don't htink it's an architectural misconception. Just make sure you don't want to use those data in a query because it's gonna be complex.
Plus recent RDBM have function to handle XML (MSSQL, Postgres, Mysql) so you would still be able to use those data.

Dynamic Database Schema [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
What is a recommended architecture for providing storage for a dynamic logical database schema?
To clarify: Where a system is required to provide storage for a model whose schema may be extended or altered by its users once in production, what are some good technologies, database models or storage engines that will allow this?
A few possibilities to illustrate:
Creating/altering database objects via dynamically generated DML
Creating tables with large numbers of sparse physical columns and using only those required for the 'overlaid' logical schema
Creating a 'long, narrow' table that stores dynamic column values as rows that then need to be pivoted to create a 'short, wide' rowset containing all the values for a specific entity
Using a BigTable/SimpleDB PropertyBag type system
Any answers based on real world experience would be greatly appreciated
What you are proposing is not new. Plenty of people have tried it... most have found that they chase "infinite" flexibility and instead end up with much, much less than that. It's the "roach motel" of database designs -- data goes in, but it's almost impossible to get it out. Try and conceptualize writing the code for ANY sort of constraint and you'll see what I mean.
The end result typically is a system that is MUCH more difficult to debug, maintain, and full of data consistency problems. This is not always the case, but more often than not, that is how it ends up. Mostly because the programmer(s) don't see this train wreck coming and fail to defensively code against it. Also, often ends up the case that the "infinite" flexibility really isn't that necessary; it's a very bad "smell" when the dev team gets a spec that says "Gosh I have no clue what sort of data they are going to put here, so let 'em put WHATEVER"... and the end users are just fine having pre-defined attribute types that they can use (code up a generic phone #, and let them create any # of them -- this is trivial in a nicely normalized system and maintains flexibility and integrity!)
If you have a very good development team and are intimately aware of the problems you'll have to overcome with this design, you can successfully code up a well designed, not terribly buggy system. Most of the time.
Why start out with the odds stacked so much against you, though?
Don't believe me? Google "One True Lookup Table" or "single table design". Some good results:
http://asktom.oracle.com/pls/asktom/f?p=100:11:0::::P11_QUESTION_ID:10678084117056
http://thedailywtf.com/Comments/Tom_Kyte_on_The_Ultimate_Extensibility.aspx?pg=3
http://www.dbazine.com/ofinterest/oi-articles/celko22
http://thedailywtf.com/Comments/The_Inner-Platform_Effect.aspx?pg=2
A strongly typed xml field in MSSQL has worked for us.
Like some others have said, don't do this unless you have no other choice. One case where this is required is if you are selling an off-the-shelf product that must allow users to record custom data. My company's product falls into this category.
If you do need to allow your customers to do this, here are a few tips:
- Create a robust administrative tool to perform the schema changes, and do not allow these changes to be made any other way.
- Make it an administrative feature; don't allow normal users to access it.
- Log every detail about every schema change. This will help you debug problems, and it will also give you CYA data if a customer does something stupid.
If you can do those things successfully (especially the first one), then any of the architectures you mentioned will work. My preference is to dynamically change the database objects, because that allows you to take advantage of your DBMS's query features when you access the data stored in the custom fields. The other three options require you load large chunks of data and then do most of your data processing in code.
I have a similar requirement and decided to use the schema-less MongoDB.
MongoDB (from "humongous") is an open source, scalable, high-performance, schema-free, document-oriented database written in the C++ programming language. (Wikipedia)
Highlights:
has rich query functionality (maybe the closest to SQL DBs)
production ready (foursquare, sourceforge use it)
Lowdarks (stuff you need to understand, so you can use mongo correctly):
no transactions (actually it has transactions but only on atomic operations)
this stuff here: http://ethangunderson.com/blog/two-reasons-to-not-use-mongodb/
durability .. mostly ACID related stuff
I did it ones in a real project:
The database consisted of one table with one field which was an array of 50. It had a 'word' index set on it. All the data was typeless so the 'word index' worked as expected. Numeric fields were represented as characters and the actual sorting had been done at client side. (It still possible to have several array fields for each data type if needed).
The logical data schema for logical tables was held within the same database with different table row 'type' (the first array element). It also supported simple versioning in copy-on-write style using same 'type' field.
Advantages:
You can rearrange and add/delete your columns dynamically, no need for dump/reload of database. Any new column data may be set to initial value (virtually) in zero time.
Fragmentation is minimal, since all records and tables are same size, sometimes it gives better performance.
All table schema is virtual. Any logical schema stucture is possible (even recursive, or object-oriented).
It is good for "write-once, read-mostly, no-delete/mark-as-deleted" data (most Web apps actually are like that).
Disadvantages:
Indexing only by full words, no abbreviation,
Complex queries are possible, but with slight performance degradation.
Depends on whether your preferred database system supports arrays and word indexes (it was inplemented in PROGRESS RDBMS).
Relational model is only in programmer's mind (i.e. only at run-time).
And now I'm thinking the next step could be - to implement such a database on the file system level. That might be relatively easy.
The whole point of having a relational DB is keeping your data safe and consistent. The moment you allow users to alter the schema, there goes your data integrity...
If your need is to store heterogeneous data, for example like a CMS scenario, I would suggest storing XML validated by an XSD in a row. Of course you lose performance and easy search capabilities, but it's a good trade off IMHO.
Since it's 2016, forget XML! Use JSON to store the non-relational data bag, with an appropriately typed column as backend. You shouldn't normally need to query by value inside the bag, which will be slow even though many contemporary SQL databases understand JSON natively.
Sounds to me like what you really want is some sort of "meta-schema", a database schema which is capable of describing a flexible schema for storing the actual data. Dynamic schema changes are touchy and not something you want to mess with, especially not if users are allowed to make the change.
You're not going to find a database which is more suited to this task than any other, so your best bet is just to select one based on other criteria. For example, what platform are you using to host the DB? What language is the app written in? etc
To clarify what I mean by "meta-schema":
CREATE TABLE data (
id INTEGER NOT NULL AUTO_INCREMENT,
key VARCHAR(255),
data TEXT,
PRIMARY KEY (id)
);
This is a very simple example, you would likely have something more specific to your needs (and hopefully a little easier to work with), but it does serve to illustrate my point. You should consider the database schema itself to be immutable at the application level; any structural changes should be reflected in the data (that-is, the instantiation of that schema).
I know that models indicated in the question are used in production systems all over. A rather large one is in use at a large university/teaching institution that I work for. They specifically use the long narrow table approach to map data gathered by many varied data acquisition systems.
Also, Google recently released their internal data sharing protocol, protocol buffer, as open source via their code site. A database system modeled on this approach would be quite interesting.
Check the following:
Entity-attribute-value model
Google Protocol Buffer
Create 2 databases
DB1 contains static tables, and represents the "real" state of the data.
DB2 is free for users to do with as they wish - they (or you) will have to write code to populate their odd-shaped tables from DB1.
EAV approach i believe is the best approach, but comes with a heavy cost
I know it's an old topic, but I guess that it never loses actuality.
I'm developing something like that right now.
Here is my approach.
I use a server setting with a MySQL, Apache, PHP, and Zend Framework 2 as application framework, but it should work as well with any other settings.
Here is a simple implementation guide, you can evolve it yourself further from this.
You would need to implement your own query language interpreter, because the effective SQL would be too complicated.
Example:
select id, password from user where email_address = "xyz#xyz.com"
The physical database layout:
Table 'specs': (should be cached in your data access layer)
id: int
parent_id: int
name: varchar(255)
Table 'items':
id: int
parent_id: int
spec_id: int
data: varchar(20000)
Contents of table 'specs':
1, 0, 'user'
2, 1, 'email_address'
3, 1, 'password'
Contents of table 'items':
1, 0, 1, ''
2, 1, 2, 'xyz#xyz.com'
3, 1, 3, 'my password'
The translation of the example in our own query language:
select id, password from user where email_address = "xyz#xyz.com"
to standard SQL would look like this:
select
parent_id, -- user id
data -- password
from
items
where
spec_id = 3 -- make sure this is a 'password' item
and
parent_id in
( -- get the 'user' item to which this 'password' item belongs
select
id
from
items
where
spec_id = 1 -- make sure this is a 'user' item
and
id in
( -- fetch all item id's with the desired 'email_address' child item
select
parent_id -- id of the parent item of the 'email_address' item
from
items
where
spec_id = 2 -- make sure this is a 'email_address' item
and
data = "xyz#xyz.com" -- with the desired data value
)
)
You will need to have the specs table cached in an associative array or hashtable or something similar to get the spec_id's from the spec names. Otherwise you would need to insert some more SQL overhead to get the spec_id's from the names, like in this snippet:
Bad example, don't use this, avoid this, cache the specs table instead!
select
parent_id,
data
from
items
where
spec_id = (select id from specs where name = "password")
and
parent_id in (
select
id
from
items
where
spec_id = (select id from specs where name = "user")
and
id in (
select
parent_id
from
items
where
spec_id = (select id from specs where name = "email_address")
and
data = "xyz#xyz.com"
)
)
I hope you get the idea and can determine for yourself whether that approach is feasible for you.
Enjoy! :-)
Over at the c2.com wiki, the idea of "Dynamic Relational" was explored. You DON'T need a DBA: columns and tables are Create-On-Write, unless you start adding constraints to make it act more like a traditional RDBMS: as a project matures, you can incrementally "lock it down".
Conceptually you can think of each row as an XML statement. For example, an employee record could be represented as:
<employee lastname="Li" firstname="Joe" salary="120000" id="318"/>
This does not imply it has to be implemented as XML, it's just a handy conceptualization. If you ask for a non-existing column, such as "SELECT madeUpColumn ...", it's treated as blank or null (unless added constraints forbid such). And it's possible to use SQL, although one has to be careful about comparisons because of the implied type model. But other than type handling, users of a Dynamic Relational system would feel right at home because they can leverage most of their existing RDBMS knowledge. Now, if somebody would just build it...
In the past I've chosen option C -- Creating a 'long, narrow' table that stores dynamic column values as rows that then need to be pivoted to create a 'short, wide' rowset containing all the values for a specific entity.. However, I was using an ORM, and that REALLY made things painful. I can't think of how you'd do it in, say, LinqToSql. I guess I'd have to create a Hashtable to reference the fields.
#Skliwz: I'm guessing he's more interested in allowing users to create user-defined fields.
ElasticSearch. You should consider it especially if you're dealing with datasets that you can partition by date, you can use JSON for your data, and are not fixed on using SQL for retrieving the data.
ES infers your schema for any new JSON fields you send, either automatically, with hints, or manually which you can define/change by one HTTP command ("mappings").
Although it does not support SQL, it has some great lookup capabilities and even aggregations.
I know this is a super old post, and much has changed in the last 11 years, but thought I would added this as it might be helpful to future readers. One of the reason's why my co-founders and I created HarperDB is to natively accomplish Dynamic schema in a single, unduplicated data set while providing full index capability. You can read more about it here:
https://harperdb.io/blog/dynamic-schema-the-harperdb-way/
sql already provides a way to change your schema: the ALTER command.
simply have a table that lists the fields that users are not allowed to change, and write a nice interface for ALTER.