Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
What is a recommended architecture for providing storage for a dynamic logical database schema?
To clarify: Where a system is required to provide storage for a model whose schema may be extended or altered by its users once in production, what are some good technologies, database models or storage engines that will allow this?
A few possibilities to illustrate:
Creating/altering database objects via dynamically generated DML
Creating tables with large numbers of sparse physical columns and using only those required for the 'overlaid' logical schema
Creating a 'long, narrow' table that stores dynamic column values as rows that then need to be pivoted to create a 'short, wide' rowset containing all the values for a specific entity
Using a BigTable/SimpleDB PropertyBag type system
Any answers based on real world experience would be greatly appreciated
What you are proposing is not new. Plenty of people have tried it... most have found that they chase "infinite" flexibility and instead end up with much, much less than that. It's the "roach motel" of database designs -- data goes in, but it's almost impossible to get it out. Try and conceptualize writing the code for ANY sort of constraint and you'll see what I mean.
The end result typically is a system that is MUCH more difficult to debug, maintain, and full of data consistency problems. This is not always the case, but more often than not, that is how it ends up. Mostly because the programmer(s) don't see this train wreck coming and fail to defensively code against it. Also, often ends up the case that the "infinite" flexibility really isn't that necessary; it's a very bad "smell" when the dev team gets a spec that says "Gosh I have no clue what sort of data they are going to put here, so let 'em put WHATEVER"... and the end users are just fine having pre-defined attribute types that they can use (code up a generic phone #, and let them create any # of them -- this is trivial in a nicely normalized system and maintains flexibility and integrity!)
If you have a very good development team and are intimately aware of the problems you'll have to overcome with this design, you can successfully code up a well designed, not terribly buggy system. Most of the time.
Why start out with the odds stacked so much against you, though?
Don't believe me? Google "One True Lookup Table" or "single table design". Some good results:
http://asktom.oracle.com/pls/asktom/f?p=100:11:0::::P11_QUESTION_ID:10678084117056
http://thedailywtf.com/Comments/Tom_Kyte_on_The_Ultimate_Extensibility.aspx?pg=3
http://www.dbazine.com/ofinterest/oi-articles/celko22
http://thedailywtf.com/Comments/The_Inner-Platform_Effect.aspx?pg=2
A strongly typed xml field in MSSQL has worked for us.
Like some others have said, don't do this unless you have no other choice. One case where this is required is if you are selling an off-the-shelf product that must allow users to record custom data. My company's product falls into this category.
If you do need to allow your customers to do this, here are a few tips:
- Create a robust administrative tool to perform the schema changes, and do not allow these changes to be made any other way.
- Make it an administrative feature; don't allow normal users to access it.
- Log every detail about every schema change. This will help you debug problems, and it will also give you CYA data if a customer does something stupid.
If you can do those things successfully (especially the first one), then any of the architectures you mentioned will work. My preference is to dynamically change the database objects, because that allows you to take advantage of your DBMS's query features when you access the data stored in the custom fields. The other three options require you load large chunks of data and then do most of your data processing in code.
I have a similar requirement and decided to use the schema-less MongoDB.
MongoDB (from "humongous") is an open source, scalable, high-performance, schema-free, document-oriented database written in the C++ programming language. (Wikipedia)
Highlights:
has rich query functionality (maybe the closest to SQL DBs)
production ready (foursquare, sourceforge use it)
Lowdarks (stuff you need to understand, so you can use mongo correctly):
no transactions (actually it has transactions but only on atomic operations)
this stuff here: http://ethangunderson.com/blog/two-reasons-to-not-use-mongodb/
durability .. mostly ACID related stuff
I did it ones in a real project:
The database consisted of one table with one field which was an array of 50. It had a 'word' index set on it. All the data was typeless so the 'word index' worked as expected. Numeric fields were represented as characters and the actual sorting had been done at client side. (It still possible to have several array fields for each data type if needed).
The logical data schema for logical tables was held within the same database with different table row 'type' (the first array element). It also supported simple versioning in copy-on-write style using same 'type' field.
Advantages:
You can rearrange and add/delete your columns dynamically, no need for dump/reload of database. Any new column data may be set to initial value (virtually) in zero time.
Fragmentation is minimal, since all records and tables are same size, sometimes it gives better performance.
All table schema is virtual. Any logical schema stucture is possible (even recursive, or object-oriented).
It is good for "write-once, read-mostly, no-delete/mark-as-deleted" data (most Web apps actually are like that).
Disadvantages:
Indexing only by full words, no abbreviation,
Complex queries are possible, but with slight performance degradation.
Depends on whether your preferred database system supports arrays and word indexes (it was inplemented in PROGRESS RDBMS).
Relational model is only in programmer's mind (i.e. only at run-time).
And now I'm thinking the next step could be - to implement such a database on the file system level. That might be relatively easy.
The whole point of having a relational DB is keeping your data safe and consistent. The moment you allow users to alter the schema, there goes your data integrity...
If your need is to store heterogeneous data, for example like a CMS scenario, I would suggest storing XML validated by an XSD in a row. Of course you lose performance and easy search capabilities, but it's a good trade off IMHO.
Since it's 2016, forget XML! Use JSON to store the non-relational data bag, with an appropriately typed column as backend. You shouldn't normally need to query by value inside the bag, which will be slow even though many contemporary SQL databases understand JSON natively.
Sounds to me like what you really want is some sort of "meta-schema", a database schema which is capable of describing a flexible schema for storing the actual data. Dynamic schema changes are touchy and not something you want to mess with, especially not if users are allowed to make the change.
You're not going to find a database which is more suited to this task than any other, so your best bet is just to select one based on other criteria. For example, what platform are you using to host the DB? What language is the app written in? etc
To clarify what I mean by "meta-schema":
CREATE TABLE data (
id INTEGER NOT NULL AUTO_INCREMENT,
key VARCHAR(255),
data TEXT,
PRIMARY KEY (id)
);
This is a very simple example, you would likely have something more specific to your needs (and hopefully a little easier to work with), but it does serve to illustrate my point. You should consider the database schema itself to be immutable at the application level; any structural changes should be reflected in the data (that-is, the instantiation of that schema).
I know that models indicated in the question are used in production systems all over. A rather large one is in use at a large university/teaching institution that I work for. They specifically use the long narrow table approach to map data gathered by many varied data acquisition systems.
Also, Google recently released their internal data sharing protocol, protocol buffer, as open source via their code site. A database system modeled on this approach would be quite interesting.
Check the following:
Entity-attribute-value model
Google Protocol Buffer
Create 2 databases
DB1 contains static tables, and represents the "real" state of the data.
DB2 is free for users to do with as they wish - they (or you) will have to write code to populate their odd-shaped tables from DB1.
EAV approach i believe is the best approach, but comes with a heavy cost
I know it's an old topic, but I guess that it never loses actuality.
I'm developing something like that right now.
Here is my approach.
I use a server setting with a MySQL, Apache, PHP, and Zend Framework 2 as application framework, but it should work as well with any other settings.
Here is a simple implementation guide, you can evolve it yourself further from this.
You would need to implement your own query language interpreter, because the effective SQL would be too complicated.
Example:
select id, password from user where email_address = "xyz#xyz.com"
The physical database layout:
Table 'specs': (should be cached in your data access layer)
id: int
parent_id: int
name: varchar(255)
Table 'items':
id: int
parent_id: int
spec_id: int
data: varchar(20000)
Contents of table 'specs':
1, 0, 'user'
2, 1, 'email_address'
3, 1, 'password'
Contents of table 'items':
1, 0, 1, ''
2, 1, 2, 'xyz#xyz.com'
3, 1, 3, 'my password'
The translation of the example in our own query language:
select id, password from user where email_address = "xyz#xyz.com"
to standard SQL would look like this:
select
parent_id, -- user id
data -- password
from
items
where
spec_id = 3 -- make sure this is a 'password' item
and
parent_id in
( -- get the 'user' item to which this 'password' item belongs
select
id
from
items
where
spec_id = 1 -- make sure this is a 'user' item
and
id in
( -- fetch all item id's with the desired 'email_address' child item
select
parent_id -- id of the parent item of the 'email_address' item
from
items
where
spec_id = 2 -- make sure this is a 'email_address' item
and
data = "xyz#xyz.com" -- with the desired data value
)
)
You will need to have the specs table cached in an associative array or hashtable or something similar to get the spec_id's from the spec names. Otherwise you would need to insert some more SQL overhead to get the spec_id's from the names, like in this snippet:
Bad example, don't use this, avoid this, cache the specs table instead!
select
parent_id,
data
from
items
where
spec_id = (select id from specs where name = "password")
and
parent_id in (
select
id
from
items
where
spec_id = (select id from specs where name = "user")
and
id in (
select
parent_id
from
items
where
spec_id = (select id from specs where name = "email_address")
and
data = "xyz#xyz.com"
)
)
I hope you get the idea and can determine for yourself whether that approach is feasible for you.
Enjoy! :-)
Over at the c2.com wiki, the idea of "Dynamic Relational" was explored. You DON'T need a DBA: columns and tables are Create-On-Write, unless you start adding constraints to make it act more like a traditional RDBMS: as a project matures, you can incrementally "lock it down".
Conceptually you can think of each row as an XML statement. For example, an employee record could be represented as:
<employee lastname="Li" firstname="Joe" salary="120000" id="318"/>
This does not imply it has to be implemented as XML, it's just a handy conceptualization. If you ask for a non-existing column, such as "SELECT madeUpColumn ...", it's treated as blank or null (unless added constraints forbid such). And it's possible to use SQL, although one has to be careful about comparisons because of the implied type model. But other than type handling, users of a Dynamic Relational system would feel right at home because they can leverage most of their existing RDBMS knowledge. Now, if somebody would just build it...
In the past I've chosen option C -- Creating a 'long, narrow' table that stores dynamic column values as rows that then need to be pivoted to create a 'short, wide' rowset containing all the values for a specific entity.. However, I was using an ORM, and that REALLY made things painful. I can't think of how you'd do it in, say, LinqToSql. I guess I'd have to create a Hashtable to reference the fields.
#Skliwz: I'm guessing he's more interested in allowing users to create user-defined fields.
ElasticSearch. You should consider it especially if you're dealing with datasets that you can partition by date, you can use JSON for your data, and are not fixed on using SQL for retrieving the data.
ES infers your schema for any new JSON fields you send, either automatically, with hints, or manually which you can define/change by one HTTP command ("mappings").
Although it does not support SQL, it has some great lookup capabilities and even aggregations.
I know this is a super old post, and much has changed in the last 11 years, but thought I would added this as it might be helpful to future readers. One of the reason's why my co-founders and I created HarperDB is to natively accomplish Dynamic schema in a single, unduplicated data set while providing full index capability. You can read more about it here:
https://harperdb.io/blog/dynamic-schema-the-harperdb-way/
sql already provides a way to change your schema: the ALTER command.
simply have a table that lists the fields that users are not allowed to change, and write a nice interface for ALTER.
Related
I'm fairly new to using postgres and would like to hear some opinions before I take the time to set this up. Not that it matters what framework I'm using, but to better explain the scenario, I'm using django with many custom models (essentially post types) that require many fields which equates to many many columns in each db table.
I know postgres is well optimized for this type of scenario but in my case, I'm planning for this to be a large scale application and when it gets larger, I'll be making many calls to the database which could negatively affect the performance.
I honestly have no idea if this would be a viable option but could setting up my database to store the data in a json format in a single column be a good solution?
for example, let's say I have a model called 'laptop' and rather than have individual columns for specs like:
brand:
model:
screen_size:
ram_capacity:
processor_type:
processor_speed:
hard_drive_capacity:
graphics_card:
drive: ....
and end up having 30 to 40 columns would it be viable to have it set up like:
id(primary_key):
brand:
model:
specs(json object): {
brand:
model:
screen_size:
ram_capacity:
processor_type:
processor_speed:
hard_drive_capacity:
graphics_card:
drive:
}
I'm just using the laptop model as an example but obviously there are millions of potential use cases. Just wanted to know the pros and cons of this potential set up and any insight is appreciated.
Is it viable? Yes. Is it a good idea? Generally no.
Why not? JSON is powerful, but it has some shortcomings:
JSON repeats the headers, so the size of the data is much larger using JSON than without.
JSON's types don't have the same flexibility as Postgres types. For instance, it is missing dates.
It is harder to enforce constraints on JSON objects.
JSON is overly flexible, so there is little validation of the data going into a JSON object (other than the JSON format).
Note that JSONB overcomes some of these limitations.
So, if you have columns that you know are going to be part of each entity, storing them as separate columns is generally the best approach.
When do you use JSON? You use it when you have optional columns that would not be shared among most of the rows in the table. For instance, different manufacturers might have specs specific to that manufacturer and that would be suitable for a JSON object.
#Gordon Linoff explains many of the reasons why that is a bad idea. One question
I would also ask is how do you plan to look up items in the database if all the
information is packed into JSON? You will find it difficult to search for the items with a particular processor or with a particular processor speed for example.
You have not really thought about using the database structure very well. If there is a lot of overlap between your products you not only waste space but will suffer form insert and deletion anomolies. For example you include processor type and processor speed in the record whereas a properly normalised database would look more like :
CREATE TABLE manufacturers (
name ...
);
CREATE TABLE processors (
uid INTEGER GENERATED ALWAYS AS IDENTITY,
model TEXT NOT NULL,
manufacturer INDEX REFERENCES manufacturers (uid),
speed INTEGER NOT NULL,
cache INTEGER ....
)
CREATE TABLE models (
name ...
processor INTEGER REFERENCES processors (uid)
...
)
i.e. would be normalised. This reduces data loads considerably but also helps prevent update anomolies. For example with your scheme there has to be model a with
an AMD xyz processor in order to record details of that processor.
If you really do not want to do database normalisation then you might be interested in the HSTORE type which is a set of key/value pairs you can keep in a
column. This is not so good as a properly normalised database but makes it
possible to do queries on the keys and values.
I have been programming relational database for many years, but now have come across an unusual and tricky problem:
I am building an application that needs to have very quick and easily defined entities (by the user). Instances of these entities could then be created, updated, deleted etc.
There are two options I can think of.
Option 1 - Dynamically created tables
The first option is to write an engine to dynamically generate the tables, and insert the data into these. However, this would become very tricky, as every query would also need to be dynamic, or at least dynamically created stored procedures etc.
Option 2 - Entity - Key - Value Pattern
This is the only realistic option I can think of, where I have 5 table structure:
EntityTypes
EntityTypeID int
EntityTypeName nvarchar(50)
Entities
EntityID int
EntityTypeID int
FieldTypes
FieldTypeID int
FieldTypeName nvarchar(50)
SQLtype int
FieldValues
EntityID int
FIeldID int
Value nvarchar(MAX)
Fields
FieldID int
FieldName nvarchar(50)
FieldTypeID int
The "FieldValues" table would work a little like a datawarehouse fact table, and all my inserts/updates would work by filling a "Key/Value" table valued parameter and passing this to a SPROC (to avoid multiple inserts/updates).
All the tables would be heavily indexed, and I would end up doing many self joins to obtain the data.
I have read a lot about how bad Key/Value databases are, but for this problem it still seems to be the best.
Now my questions!
Can anyone suggest another approach or pattern other than these two options?
Would option two be feasible for medium sized datasets (1 million rows max)?
Are there further optimizations for option 2 I could use?
Any direction and advice much appreciated!
Personally I would just use a "noSQL" (key/value) database like MongoDB.
But if you need to use a relational database option 2 is the way to go. A good example of that kind of model is the Alfresco Data Dictionary (Alfresco is an enterprise content management system). It's design is similar to what you describe, although they have multiple columns for field values (for every simple type available in the database). If you add a good cache system to that (for example Ehcache) it should work fine.
As others have suggested NoSQL, I'm going to say that, in my opinion, schemaless databases really is best suited for use-cases with no schema.
From the description, and the schema you came up with, it looks like your case is not in fact "no schema", but rather it seems to be "user-defined schema".
In fact, the schema you came up with looks very similar to the internal meta-schema of a relational database. (You're sort of building a relational database on top of a relational database, which in my experience is not a good idea, as this "meta-database" will have at least twice the overhead and complexity for any basic operation - tables will get very large, which doesn't scale well, and the data will be difficult to query and update, problems will be difficult to debug, and so on.)
For use-cases like that, you probably want DDL: Data Definition Language.
You didn't say which SQL database you're using, but most SQL databases (such as MySQL, PostgreSQL and MS-SQL) support some dialect of DDL extensions to SQL syntax, which let you manipulate the actual schema.
I've done this successfully for use-cases like yours in the past. It works well for cases where the schema rarely changes, and the data volumes are relatively low for each user. (For high volumes or frequent schema updates, you might want schemaless or some other type of NoSQL database.)
You might need some tables on the side for additional field information that doesn't fit in SQL schema - you may want to duplicate some schema information there as well, as this can be difficult or inefficient to read back from actual schema.
Ensuring atomic updates to your field information tables and the schema probably requires transactions, which may not be supported by your database engine - PostgreSQL at least does support transactional schema updates.
You have to be vigilant when it comes to security - you don't want to open yourself up to users creating, storing or deleting things they're not supposed to.
If it suits your use-case, consider using not only separate tables, but separate databases, which can also by created and destroyed on demand using DDL. This could be applicable if each customer has ownership of data collections that can't, shouldn't, or don't need to be queried across customers. (Arguably, these are rare - typically, you want at least analytics or something across customers, but there are cases where each customer "owns" an isolated, hosted wiki, shop or CMS/DMS of some sort.)
(I saw in your comment that you already decided on NoSQL, so just posting this option here for completeness.)
It sounds like this might be a solution in search of a problem. Is there any chance your domain can be refactored? If not - theres still hope.
Your scalability for option 2 will depend a lot on the width of the custom objects. How many fields can be created dynamically? 1 million entities when each entity has 100 fields could be a drag... Efficient indexing could make performance bearable.
For another option - you could have one data table that has a few string fields, a few double fields, and a few integer fields. For example, a table with String1, String2, String3, Int1, Int2, Int3. A second table with have rows that define a user object and map your "CustomObjectName" => String1, and such. A stored procedure reading INFORMATION_SCHEMA and some dynamic sql would be able to read the schema table and return a strongly typed recordset...
Yet another option (for recent versions of SQL Server) would be to store a row with an id, a type name, and an XML field that contains a XML document that contains the object data. In MS Sql Server this can be queried against directly, and maybe even validated against a schema.
PErsonally I would take the time to define as many attritbutes as you can ratheer than use EAV for everything. Surely you know some of the attributes. Then you only need EAv for the things that are truly client specific.
But if all must be EAV, then a nosql databse is the way to go. Or you can use a relationsla datbase for some stuff and a nosql database for the rest.
What I'm doing
I am creating an SQL table that will provide the back-end storage mechanism for complex-typed objects. I am trying to determine how to accomplish this with the best performance. I need to be able to query on each individual simple type value of the complex type (e.g. the String value of a City in an Address complex type).
I was originally thinking that I could store the complex type values in one record as an XML, but now I am concerned about the search performance of this design. I need to be able to create variable schemas on the fly without changing anything about the database access layer.
Where I'm at now
Right now I am thinking to create the following tables.
TABLE: Schemas
COLUMN NAME DATA TYPE
SchemaId uniqueidentifier
Xsd xml //contains the schema for the document of the given complex type
DeserializeType varchar(200) //The Full Type name of the C# class to which the document deserializes.
TABLE: Documents
COLUMN NAME DATA TYPE
DocumentId uniqueidentifier
SchemaId uniqueidentifier
TABLE: Values //The DocumentId+ValueXPath function as a PK
COLUMN NAME DATA TYPE
DocumentId uniqueidentifier
ValueXPath varchar(250)
Value text
from these tables, when performing queries I would do a series of self-joins on the value table. When I want to get the entire object by the DocumentId, I would have a generic script for creating a view mimics a denormalized datatable of the complex-type.
What I want to know
I believe there are better ways to accomplish what I am trying to, but I am a little too ignorant about the relative performance benefits of different SQL techniques. Specifically I don't know the performance cost of:
1 - comparing the value of a text field versus of a varchar field.
2 - different kind of joins versus nested queries
3 - getting a view versus an xml document from the sql db
4 - doing some other things that I don't even know I don't know would be affecting my query but, I am experienced enough to know exist
I would appreciate any information or resources about these performance issues in sql as well as a recommendation for how to approach this general issue in a more efficient way.
For Example,
Here's an example of what I am currently planning on doing.
I have a C# class Address which looks like
public class Address{
string Line1 {get;set;}
string Line2 {get;set;}
string City {get;set;}
string State {get;set;}
string Zip {get;set;
}
An instance is constructed from new Address{Line1="17 Mulberry Street", Line2="Apt C", City="New York", State="NY", Zip="10001"}
its XML value would be look like.
<Address>
<Line1>17 Mulberry Street</Line1>
<Line2>Apt C</Line2>
<City>New York</City>
<State>NY</State>
<Zip>10001</Zip>
</Address>
Using the db-schema from above I would have a single record in the Schemas table with an XSD definition of the address xml schema. This instance would have a uniqueidentifier (PK of the Documents table) which is assigned to the SchemaId of the Address record in the Schemas table. There would then be five records in the Values table to represent this Address.
They would look like:
DocumentId ValueXPath Value
82415E8A-8D95-4bb3-9E5C-AA4365850C70 /Address/Line1 17 Mulberry Street
82415E8A-8D95-4bb3-9E5C-AA4365850C70 /Address/Line2 Apt C
82415E8A-8D95-4bb3-9E5C-AA4365850C70 /Address/City New York
82415E8A-8D95-4bb3-9E5C-AA4365850C70 /Address/State NY
82415E8A-8D95-4bb3-9E5C-AA4365850C70 /Address/Zip 10001
Just Added a Bounty...
My objective is to obtain the resources I need in order to give my application a data access layer that is fully searchable and has a data-schema generated from the application layer that does not require direct database configuration (i.e. creating a new SQL table) in order to add a new aggregate root to the domain model.
I am open to the possibility of using .NET compatible technologies other than SQL, but I will require that any such suggestions be adequately substantiated in order to be considered.
How about looking for a solution at the architectural level? I was also breaking my head on complex graphs and performance until I discovered CQRS.
[start evangelist mode]
You can go document-based or relational as storage. Even both! (Event Sourcing)
Nice separation of concerns: Read Model vs Write Model
Have your cake and eat it too!
Ok, there is an initial learning / technical curve to get over ;)
[end evangelist mode]
As you stated: "I need to be able to create variable schemas on the fly without changing anything about the database access layer." The key benefit is that your read model can be very fast since it's made for reading. If you add Event Sourcing to the mix, you can drop and rebuild your Read Model to whatever schema you want... even "online".
There are some nice opensource frameworks out there like nServiceBus which saves lots of time and technical challenges. All depends on how far you want to take these concepts what you're willing/can spend time on. You can even start with just basics if you follow Greg Young's approach. See the info in the links below.
See
CQRS Examples and Screencasts
CQRS Questions
Intro (Also see the video)
Somehow what you want sounds like a painful thing to do in SQL. Basically, you should treat the inside of a text field as opaque as when querying an SQL database. Text fields were not made for efficient queries.
If you just want to store serialized objects in a text field, that is fine. But do not try to build queries that look inside the text field to find objects.
Your idea sounds like you want to perform some joins, XML parsing, and XPath application to get to a value. This doesn't strike me as the most efficient thing to do.
So, my advise:
Either just store serialized objects in the db, and do nothing more than load them and perform all other operations in memory
Or, if you need to query complex data structures, you may really want to look into document stores/databases like CouchDB or MongoDB; you can also check Wikipedia on the subject. There are even databases specifically designed for storing XML, even though I personally don't like them very much.
Addendum, per your explanations above
Simply put, don't go over the top with this thing:
If you just want to persist C#/.NET objects, just use the XML Serialization already built into the framework, a single table and be done with it.
If you, for some reason, need to store complex XML, use a dedicated XML store
If you have a fixed database schema, but it is too complex for efficient queries, use a Document Store in memory where you keep a denormalized version of your data for faster queries (or just simplify your database schema)
If you don't really need a fixed schema, use just a Document Store, and forget about having any "schema definition" at all
As for your solution, yes, it could work somehow. As could a plain SQL schema if you set it up right. But for applying an XPath, you'll probably parse the whole XML document each time you access a record, which wouldn't be very efficient to begin with.
If you want to check out Document databases, there are .NET drivers for CouchDB and MongoDB. The eXist XML database offers a number of Web protocols, and you can probably create a client class easily with VisualStudio's point-and-shoot interface. Or just google for someone who already did.
I need to be able to create variable
schemas on the fly without changing
anything about the database access
layer.
You are re-implementing the RDBMS within an RDBMS. The DB can do this already - that is what the DDL statements like create table and create schema are for....
I suggest you look into "schemas" and SQL security. There is no reason with the correct security setup you cannot allow your users to create their own tables to store document attributes in, or even generate them automatically.
Edit:
Slightly longer answer, if you don't have full requirements immediately, I would store the data as XML data type, and query them using XPath queries. This will be OK for occasional queries over smallish numbers of rows (fewer than a few thousand, certainly).
Also, your RDBMS may support indexes over XML, which may be another way of solving your problem. CREATE XML INDEX in SqlServer 2008 for example.
However for frequent queries, you can use triggers or materialized views to create copies of relevant data in table format, so more intensive reports can be speeded up by querying the breakout tables.
I don't know your requirements, but if you are responsible for creating the reports/queries yourself, this may be an approach to use. If you need to enable users to create their own reports that's a bigger mountain to climb.
I guess what i am saying is "are you sure you need to do this and XML can't just do the job".
In part, it will depend of your DB Engine. You're using SQL Server, don't you?
Answering your topics:
1 - Comparing the value of a text field versus of a varchar field: if you're comparing two db fields, varchar fields are smarter. Nvarchar(max) stores data in unicode with 2*l+2 bytes, where "l" is the lengh. For performance issues, you will need consider how much larger tables will be, for selecting the best way to index (or not) your table fields. See the topic.
2 - Sometimes nested queries are easily created and executed, also serving as a way to reduce query time. But, depending of the complexity, would be better to use different kind of joins. The best way is try to do in both ways. Execute two or more times each query, for the DB engine "compiles" a query on first executing, then the subsequent are quite faster. Measure the times for different parameters and choose the best option.
"Sometimes you can rewrite a subquery to use JOIN and achieve better performance. The advantage of creating a JOIN is that you can evaluate tables in a different order from that defined by the query. The advantage of using a subquery is that it is frequently not necessary to scan all rows from the subquery to evaluate the subquery expression. For example, an EXISTS subquery can return TRUE upon seeing the first qualifying row." - link
3- There's no much information in this question, but if you will get the xml document directly from the table, would be a good idea insted a view. Again, it will depends of the view and the document.
4- Other issues is about the total records expected for your table; the indexing of the columns, in wich you need to consider sorting, joining, filtering, PK's and FK's. Each situation could demmand different aproaches. My sugestion is to invest some time reading about your database engine and queries functioning and relating to your system.
I hope I've helped.
Interesting question.
I think you may be asking the wrong question here. Broadly speaking, as long as you have a FULLTEXT index on your text field, queries will be fast. Much faster than varchar if you have to use wild cards, for instance.
However, if I were you, I'd concentrate on the actual queries you're going to be running. Do you need boolean operators? Wildcards? Numerical comparisons? That's where I think you will encounter the real performance worries.
I would imagine you would need queries like:
"find all addresses in the states of New York, New Jersey and Pennsylvania"
"find all addresses between house numbers 1 and 100 on Mulberry Street"
"find all addresses where the zipcode is missing, and the city is New York"
At a high level, the solution you propose is to store your XML somewhere, and then de-normalize that XML into name/value pairs for querying.
Name/value pairs have a long and proud history, but become unwieldy in complex query situations, because you're not using the built-in optimizations and concepts of the relational database model.
Some refinements I'd recommend is to look at the domain model, and at least see if you can factor out separate data types into the "value" column; you might end up with "textValue", "moneyValue", "integerValue" and "dateValue". In the example you give, you might factor "address 1" into "housenumber" (as an integer) and "streetname".
Having said all this - I don't think there's a better solution other than completely changing tack to a document-focused database.
This question is an attempt to find a practical solution for this question.
I need a semi-schema less design for my SQL database. However, I can limit the flexibility to shoehorn it into the entire SQL paradigm. Moving to schema less databases might be an option in the future but right now, I' stuck with SQL.
I have a table in a SQL database (let's call it Foo). When an row is added to this, it needs to be able to store an arbitrary number of "meta" fields with this. An example would be the ability to attach arbitrary metadata like tags, collaborators etc. All the fields are optional but the problem is that they're of different types. Some might be numeric, some might be textual etc.
A simple design linking Foo to a table of OptionalValues with fields like name, value_type, value_string, value_int, value_date etc. seems direct although it descends into the whole EAV model which Alex mentions on that last answer and it looks quite wasteful. Also, I imagine queries out of this when it grows will be quite slow. I don't expect to search or sort by anything in this table though. All I need is that when I get a row out of Foo, these extra attributes should be obtainable as well.
Are there any best practices for implementing this kind of a setup in a SQL database or am I simply looking at the whole thing wrongly?
Add a string column "Metafields" to your table "Foo" and store your metadata there as an XML or JSON string.
As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 9 years ago.
What is the point (if any) in having a table in a database with only one row?
Note: I'm not talking about the possibility of having only one row in a table, but when a developer deliberately makes a table that is intended to always have exactly one row.
Edit:
The sales tax example is a good one.
I've just observed in some code I'm reviewing three different tables that contain three different kinds of certificates (a la SSL), each having exactly one row. I don't understand why this isn't made into one large table; I assume I'm missing something.
I've seen something like this when a developer was asked to create a configuration table to store name-value pairs of data that needs to persist without being changed often. He ended up creating a one-row table with a column for each configuration variable. I wouldn't say it's a good idea, but I can certainly see why the developer did it given his instructions. Needless to say it didn't pass review.
I've just observed in some code I'm reviewing three different tables that contain three different kinds of certificates (a la SSL), each having exactly one row. I don't understand why this isn't made into one row; I assume I'm missing something.
This doesn't sound like good design, unless there are some important details you don't know about. If there are three pieces of information that have the same constraints, the same use and the same structure, they should be stored in the same table, 99% of the time. That's a big part of what tables are for fundamentally.
For some things you only need one row - typically system configuration data. For example, "current sales tax rate". This might change in the future and so shouldn't be hardcoded, but you'll typically only ever need one at any given time. This kind of data needs to be in the database so that queries can use it in computations.
It's not necessarily a bad idea.
What if you had some global state (say, a boolean) that you wanted to store somewhere? And you wanted your stored procedures to easily access this state?
You could create a table with a primary key whose value range was limited to exactly one value.
Single row is like a singleton class. purpose: to control or manage some other process.
Single row table could act as a critical section or as deterministic automaton (kind of dispatcher based on row values)
Single row is use full in a table COMPANY_DESCRIPTION, to obtain consistent data about that company. Use full on company letters and addressing.
Single row is use full to contain an actual value like VAT or Date or Time, and so on.
It can be useful sometime to emulate some features the Database system doesn't provide. I'm thinking of sequences in MySQL for instance.
If your database is your application, then it probably makes sense for storing configuration data that might be required by stored procedures implementing business logic.
If you have an application that could use the file system to store information, then I don't think there is an advantage to using the database over an XML or flat file, except maybe that most developers are now far more well versed in using SQL to store and retrieve data than accessing the file system.
What is the point (if any) in having a table in a database with only one row?
A relational database stores things as relations: a tuples of data satisfying some relation.
Like, this one: "a VAT of this many percent is in effect in my country now".
If only one tuple satisifies this relation, then yes, it will be the only one in the table.
SQL cannot store variables: it can store a set consisting of 1 element, this is a one-row table.
Also, SQL is a set based language, and for some operations you need a fake set of only one row, like, to select a constant expression.
You cannot just SELECT out of nothing in Oracle, you need a FROM clause.
Oracle has a pseudotable, dual, which contains only one row and only one column.
Once, long time ago, it used to have two rows (hence the name dual), but lost its second row somewhere on its way to version 7.
MySQL has this pseudotable too, but MySQL is able to do selects without FROM clause. Still, it's useful when you need an empty rowset: SELECT 1 FROM dual WHERE NULL
I've just observed in some code I'm reviewing three different tables that contain three different kinds of certificates (a la SSL), each having exactly one row. I don't understand why this isn't made into one large table; I assume I'm missing something.
It may be a kind of "have it all or lose" scenario, when all three certificates are needed at once:
SELECT *
FROM ssl1
CROSS JOIN
ssl2
CROSS JOIN
ssl3
If any if the certificates is missing, the whole query returns nothing.
A table with a single row can be used to store application level settings that are shared across all database users. 'Maximum Allowed Users' for example.
Funny... I asked myself the same question. If you just want to store some simple value and your ONLY method of storage is an SQL server, that's pretty much what you have to do. If I have to do this, I usually end up creating a table with several columns and one row. I've seen a couple commercial products do this as well.
We have used a single-row table in the past (not often). In our case, this table was used to store system-wide configuration values that were updatable via a web interface. We could have gone the route of a simple name/value table, but the end client preferred a single row. I personally would have preferred the latter, but it really is up to preference, especially if this table will never have any sort of relationship with another table.
I really cannot figure out why this would be the best solution. It seams more efficient to just have some kind of config file that will contain the data that would be in the tables one row. The cost of connecting to the database and querying the one row would be more costly. However if this is going to be some kind of config for the database logic. Then this would make a little bit more sense depending on the type of database you are using.
I use the totally awesome rails-settings plugin for this http://github.com/Squeegy/rails-settings/tree/master
It's really easy to set up and provides a nice syntax:
Settings.admin_password = 'supersecret'
Settings.date_format = '%m %d, %Y'
Settings.cocktails = ['Martini', 'Screwdriver', 'White Russian']
Settings.foo = 123
Want a list of all the settings?
Settings.all # returns {'admin_password' => 'super_secret', 'date_format' => '%m %d, %Y'}
Set defaults for certain settings of your app. This will cause the defined settings to return with the Specified value even if they are not in the database. Make a new file in config/initializers/settings.rb with the following:
Settings.defaults[:some_setting] = 'footastic'
A use for this might be to store the current version of the database.
If one were storing database versions for schema changes it would need to reside within the database itself.
I currently analyse the schema and update accordingly but am thinking of moving to versioning. Unless someone has a better idea.
I use vb.net and sql express
Unless there are insert constraints on the table a timestamp for versioning then this sounds like a bad idea.
There was a table set up like this in a project I inherited. It was for configuration data, and the reason that was given was that it made for very simple queries:
SELECT WidgetSize FROM ConfigTable
SELECT FooLength FROM ConfigTable
Okay fine. We converted to a generalized configuration table:
ID Name IntValue StringValue TextValue
This has served our purposes well.
CREATE TABLE VERSION (VERSION_STRING VARCHAR2(20 BYTE))
?
I used a single datum in a SQLite database as a counter in a dynamic web page. That's the simplest way I can think of to make it thread-safe (or process-safe to be precise). But I am not sure whether it's a good idea.
I think the best way to deal with these scenarios is to, rather than using a database at all, use the configuration file (which is usually XML) or make your own configuration file that is read during start up of the application. It only takes a few minutes to write the code to read the file in.
The advantage here is that the there is no chance accidentally adding additional values for the same XML variable, and its great for testing because you don't need to write a lot of code to test the different inputs, just a simple change to the text value and re-run the application.