I've got an SQL table that I use to keep product data. Some products have other attached data to them (be it: books have number of pages, cover type; movies have their time-length; etc).
I could use a separate table in SQL to keep those, keeping (name, value) pairs.
I can also just keep an XML-packed data in a single field in a table. It's not a normalized approach, but seems more-natural for me.
I did a similar thing in a shopping basket application. We needed to attach meta data to the products without creating too much of a schema, which would have restricted the format of the meta-data in the future. We kept the meta-data as XML.
The only reason I would not do it is if you're going to end up performing queries on the data. Just make sure you won't have some daft person wanting reports by Publisher meta-data or something (which has happened to me) and you should be fine.
If you were intending to use XML as a way of not properly defining database tables that would indeed an architectural cop-out. I'm not sure about your scenario, it seems dangerously close to that. But key-value pairs are probably worse.
The best thing is to use a specialist XML datatype, if your database has one. In addition to RageZ's list, Oracle as had an XMLType for ten years now (since 9i). The advantage of using XMLType is two-fold. It announces to the casual observer that the documents in this column are XML. It also gives you access to built-in functionality, such as validation with XML Schemas, should you want it. Other features could prove handy if you subsequently have to start referring to the contents of the XML. For instance, Oracle's XDB supports an XML index type which can dramatically improve the performance of XPath queries.
It depends!
If you expect the 'shape' of your products to vary greatly then XML is a good way to go. [If you are using SQL Server you can index an XML field.]
I don't htink it's an architectural misconception. Just make sure you don't want to use those data in a query because it's gonna be complex.
Plus recent RDBM have function to handle XML (MSSQL, Postgres, Mysql) so you would still be able to use those data.
Related
I have multiple products which each of them may have different arrtibutes then the other products for example laptop vs t-shirt.
One of the solutions that may come to mind is to have text "specs" column in "products" table and store the products specs in it as text key/value pairs like
for example "label:laptop, RAM:8gb".
What is wrong with this approach? Why I can not find any web article that recommend it ? I mean it is not that hard to come to one's mind.
What I see on the internet are two ways to solve this problem :
1- use EAV model
2- use json
Why not just text key/value pairs as I mentioned
In SQL, a string in a primitive type and it should be used to store only a single value. That is how SQL works best -- single values in columns, rows devoted to a single entity or relationship between two tables.
Here are some reasons why you do not want to do this:
Databases have poor string processing functionality.
Queries using spec cannot (in general) be optimized using indexes or partitioning.
The strings have a lot of redundancy, because names are repeated over and over (admittedly, JSON and XML also have this "feature").
You cannot validate the data for each spec using built-in SQL functionality.
You cannot enforce the presence of particular values.
The one time when this is totally acceptable is when you don't care what is in the string -- it is there only to be returned for the application or user.
Why are you reluctant to use the solutions you mention in your question?
Text pairs (and even JSON blobs) are fine for storage and display, so long as you don't need to search on the product specifications. Searching against unstructured data in most SQL databases is slow and unreliable. That's why for variant data the EAV model typically gets used.
You can learn more about the structure by studying Normal Forms.
SQL assumes attributes are stored individually in columns, and the value in that column is to be treated as a whole value. Support for searching for rows based on some substring of a value in a column is awkward in SQL.
So you can use a string to combine all your product specs, but don't expect SQL expressions to be efficient or elegant if you want to search for one specific product spec by name or value inside that string.
If you store specs in that manner, then just use the column as a "black box." Fetch the whole product spec string from the database, after selecting the product using normal columns. Then you can explode the product spec string in your application code. You may use JSON, XML, or your own custom format. Whatever you can manipulate in application code is valid, and it's up to you.
You may also like a couple of related answers of mine:
How to design a product table for many kinds of product where each product has many parameters
Is storing a delimited list in a database column really that bad? (many of the disadvantages of using a comma-separated list are just as applicable to JSON or XML, or any other format of semi-structured data in a single column.)
After seeing some of the crazy ways developers use JSON columns in questions on Stack Overflow, I'm beginning to change my opinion that JSON or any other document-in-a-column formats are not a good idea for any relational database. They may be tolerable if you follow the "black box" principle I mention above, but too many developers then extend that and expect to query individual sub-fields within the JSON as if they are normal SQL columns. Don't do it!
I am looking at a cloud based solution which will give people the ability to enter information which is stored in a SQL database.
The benefits of my application will be that people can also change what type of information is stored (i.e an administrator would be able to add/remove certain attributes to change what data people can store).
Doing this in a relational database does work but it means the administrator would be changing the actual structure of the database which has so many risks and issues and I really don't want to go down this route.
I have thought about using XML, so one table contains two tables for example:
Template Data
columns (ID, XML) - This will contain the "Default Templates/Structure" of what people will enter which will is used when the users enter data and submit
Data Table
columns (ID, XML) - This will contain the actual data using the XML template of my first column but store the actual data in it
Does this sound like it would work and could I hit potential performance issues? A lot of the data will be searchable and could potentially have a LOT of records in the database. - I guess I could look at storing the searchable data in separate fields that the administrator can't modify.
Thanks
It is possible and if you do it a little smart it is feasible.
Contrary to Justings wroong answer you are not stuck with string manipulation and search.... if you actuall care to read the documentation.
SQL Server added a XML field type a long time ago.
This takes XML (only) and decomposes it internally and has an indexing mechanism (cech http://technet.microsoft.com/en-us/library/ms191497.aspx for details).
Queries then look like:
SELECT
EventID, EventTime,
AnnouncementValue = t1.EventXML.value('(/Event/Announcement/Value)[1]', 'decimal(10,2)'),
AnnouncementDate = t1.EventXML.value('(/Event/Announcement/Date)[1]', 'date')
FROM
dbo.T1
WHERE
t1.EventXML.exist('/Event/Indicator/Name[text() = "GDP"]') = 1
(copied from How to query xml column in tsql)
How far it gets you depends - this is heavier on the database and may have limitations, but it is a far cry from the alternative of storing strings and saying good bye to any indexing.
You can actually even add xml schemata so the data has to conform to some specific schema.
This is possible but data retrieval will suffer if you will query data based on the values in the XML string. If you will use this you're stuck with using a LIKE filter which is not recommended for searching a table with too many rows. If you will always read data using the ID column only I think this would be great.
On the other hand, if you will separate the data in the XML to several columns, you can refine the way you query data based on multiple columns. This will speed up your searches most especially if the columns are indexed.
I'm trying to implement a badge system, the badge are based on user's metadata which are subject to change.
Those metadata are variable, and are set on the fly.
Example of metadata :
commentCount
hasCompletedProfile
isActiveMember
etc. Later, I would want to add hasGravatar metadata, for this reason, I can't easily design and normalize a table.
Those data, while they are an important part of the application are not 'sensible', almost all those metadata could be re-computed that means that the integrity of the data is not part of constraints.
Currently, I know three options, even if I didn't know any of them.
EAV
Serialized Objects
XML Field (I read somewhere that it is possible to store XML in a column, and use XPATH or something to query data)
All of these options look to have pro & cons but since I've never experimented with them, I don't really know which.
Do you have any feedbacks or advices?
I'm currently working with Zend Framework & Doctrine 2 with a MySQL server
XML and Serialized Objects are both very similar as you would likely be using 1 column to store this arbitrary data. This quickly becomes very messy and difficult to easily distinguish in SQL WHERE clauses (though some DBMS have XPath support)
EAV on the otherhand will provide a separate row for every Key => Value pair you have, which you can easily extract out with a JOIN or subquery. The major downfall is that it can be a performance hit if you have a lot of data in here. Another drawback is that to keep things simple you would store all keys/values as text in the db. You could create an EAV table for every type, but it's not practically needed in most languages as what you fetch comes out as a string or can be converted there anyway. Simply storing user configuration/properties should be perfectly fine for EAV.
So you might have a table user_metadata with 3 fields:
metadata_id INTEGER
user_id INTEGER
key CHAR
value CHAR
You could then fetch this data all at once for a user:
SELECT * FROM user_metadata WHERE metadata_user_id = $user_id
Or you could fetch individual metadata along with your user data
SELECT user.*, meta_gravatar.value AS hasGravatar
FROM user
LEFT JOIN user_metadata AS meta_gravatar
ON meta_gravatar.user_id = user.user_id AND meta_gravatar.key = 'hasGravatar'
WHERE user.user_id = $user_id
EAV: It is complicated and slow. It is an example how not to use an SQL database. You cannot have an index on properties in EAV and you need some nontrivial logic to get the data from database into business logic objects. Also your SQL queries become difficult to optimize.
Serialized objects: Serialization often depends on language or platform. There is no way of having an index on some property or search anything, but it is a simple way to store data of undefined structure.
XML field: Use of a standardized representation is better than serialization. Also, there may be support for such data structures in you SQL server.
JSON field: The same as XML field, however, JSON supports primitive data types (int, bool, null) and it is faster and easier to parse and serialize than XML. Some SQL servers provide some support for it as well.
All the three ways of serialization share the same disadvantage: No indices on the properties. In most applications, it is acceptable because the data are not processed by the database anyway, so they are simply a blob for the application. The good thing is, that this blob does not complicate the database schema and operations.
There is one more way to implement such an EAV alternative: plain old SQL table. If a new property requires some change in the application code, then you can add the SQL column as well. If you have user interface and application logic to define properties at run-time, you can teach your application to use ALTER TABLE queries. Then you simply add or remove columns as you need. In the end, it will be much easier and more effective than implementing EAV, as long as you have a good query builder.
What I'm doing
I am creating an SQL table that will provide the back-end storage mechanism for complex-typed objects. I am trying to determine how to accomplish this with the best performance. I need to be able to query on each individual simple type value of the complex type (e.g. the String value of a City in an Address complex type).
I was originally thinking that I could store the complex type values in one record as an XML, but now I am concerned about the search performance of this design. I need to be able to create variable schemas on the fly without changing anything about the database access layer.
Where I'm at now
Right now I am thinking to create the following tables.
TABLE: Schemas
COLUMN NAME DATA TYPE
SchemaId uniqueidentifier
Xsd xml //contains the schema for the document of the given complex type
DeserializeType varchar(200) //The Full Type name of the C# class to which the document deserializes.
TABLE: Documents
COLUMN NAME DATA TYPE
DocumentId uniqueidentifier
SchemaId uniqueidentifier
TABLE: Values //The DocumentId+ValueXPath function as a PK
COLUMN NAME DATA TYPE
DocumentId uniqueidentifier
ValueXPath varchar(250)
Value text
from these tables, when performing queries I would do a series of self-joins on the value table. When I want to get the entire object by the DocumentId, I would have a generic script for creating a view mimics a denormalized datatable of the complex-type.
What I want to know
I believe there are better ways to accomplish what I am trying to, but I am a little too ignorant about the relative performance benefits of different SQL techniques. Specifically I don't know the performance cost of:
1 - comparing the value of a text field versus of a varchar field.
2 - different kind of joins versus nested queries
3 - getting a view versus an xml document from the sql db
4 - doing some other things that I don't even know I don't know would be affecting my query but, I am experienced enough to know exist
I would appreciate any information or resources about these performance issues in sql as well as a recommendation for how to approach this general issue in a more efficient way.
For Example,
Here's an example of what I am currently planning on doing.
I have a C# class Address which looks like
public class Address{
string Line1 {get;set;}
string Line2 {get;set;}
string City {get;set;}
string State {get;set;}
string Zip {get;set;
}
An instance is constructed from new Address{Line1="17 Mulberry Street", Line2="Apt C", City="New York", State="NY", Zip="10001"}
its XML value would be look like.
<Address>
<Line1>17 Mulberry Street</Line1>
<Line2>Apt C</Line2>
<City>New York</City>
<State>NY</State>
<Zip>10001</Zip>
</Address>
Using the db-schema from above I would have a single record in the Schemas table with an XSD definition of the address xml schema. This instance would have a uniqueidentifier (PK of the Documents table) which is assigned to the SchemaId of the Address record in the Schemas table. There would then be five records in the Values table to represent this Address.
They would look like:
DocumentId ValueXPath Value
82415E8A-8D95-4bb3-9E5C-AA4365850C70 /Address/Line1 17 Mulberry Street
82415E8A-8D95-4bb3-9E5C-AA4365850C70 /Address/Line2 Apt C
82415E8A-8D95-4bb3-9E5C-AA4365850C70 /Address/City New York
82415E8A-8D95-4bb3-9E5C-AA4365850C70 /Address/State NY
82415E8A-8D95-4bb3-9E5C-AA4365850C70 /Address/Zip 10001
Just Added a Bounty...
My objective is to obtain the resources I need in order to give my application a data access layer that is fully searchable and has a data-schema generated from the application layer that does not require direct database configuration (i.e. creating a new SQL table) in order to add a new aggregate root to the domain model.
I am open to the possibility of using .NET compatible technologies other than SQL, but I will require that any such suggestions be adequately substantiated in order to be considered.
How about looking for a solution at the architectural level? I was also breaking my head on complex graphs and performance until I discovered CQRS.
[start evangelist mode]
You can go document-based or relational as storage. Even both! (Event Sourcing)
Nice separation of concerns: Read Model vs Write Model
Have your cake and eat it too!
Ok, there is an initial learning / technical curve to get over ;)
[end evangelist mode]
As you stated: "I need to be able to create variable schemas on the fly without changing anything about the database access layer." The key benefit is that your read model can be very fast since it's made for reading. If you add Event Sourcing to the mix, you can drop and rebuild your Read Model to whatever schema you want... even "online".
There are some nice opensource frameworks out there like nServiceBus which saves lots of time and technical challenges. All depends on how far you want to take these concepts what you're willing/can spend time on. You can even start with just basics if you follow Greg Young's approach. See the info in the links below.
See
CQRS Examples and Screencasts
CQRS Questions
Intro (Also see the video)
Somehow what you want sounds like a painful thing to do in SQL. Basically, you should treat the inside of a text field as opaque as when querying an SQL database. Text fields were not made for efficient queries.
If you just want to store serialized objects in a text field, that is fine. But do not try to build queries that look inside the text field to find objects.
Your idea sounds like you want to perform some joins, XML parsing, and XPath application to get to a value. This doesn't strike me as the most efficient thing to do.
So, my advise:
Either just store serialized objects in the db, and do nothing more than load them and perform all other operations in memory
Or, if you need to query complex data structures, you may really want to look into document stores/databases like CouchDB or MongoDB; you can also check Wikipedia on the subject. There are even databases specifically designed for storing XML, even though I personally don't like them very much.
Addendum, per your explanations above
Simply put, don't go over the top with this thing:
If you just want to persist C#/.NET objects, just use the XML Serialization already built into the framework, a single table and be done with it.
If you, for some reason, need to store complex XML, use a dedicated XML store
If you have a fixed database schema, but it is too complex for efficient queries, use a Document Store in memory where you keep a denormalized version of your data for faster queries (or just simplify your database schema)
If you don't really need a fixed schema, use just a Document Store, and forget about having any "schema definition" at all
As for your solution, yes, it could work somehow. As could a plain SQL schema if you set it up right. But for applying an XPath, you'll probably parse the whole XML document each time you access a record, which wouldn't be very efficient to begin with.
If you want to check out Document databases, there are .NET drivers for CouchDB and MongoDB. The eXist XML database offers a number of Web protocols, and you can probably create a client class easily with VisualStudio's point-and-shoot interface. Or just google for someone who already did.
I need to be able to create variable
schemas on the fly without changing
anything about the database access
layer.
You are re-implementing the RDBMS within an RDBMS. The DB can do this already - that is what the DDL statements like create table and create schema are for....
I suggest you look into "schemas" and SQL security. There is no reason with the correct security setup you cannot allow your users to create their own tables to store document attributes in, or even generate them automatically.
Edit:
Slightly longer answer, if you don't have full requirements immediately, I would store the data as XML data type, and query them using XPath queries. This will be OK for occasional queries over smallish numbers of rows (fewer than a few thousand, certainly).
Also, your RDBMS may support indexes over XML, which may be another way of solving your problem. CREATE XML INDEX in SqlServer 2008 for example.
However for frequent queries, you can use triggers or materialized views to create copies of relevant data in table format, so more intensive reports can be speeded up by querying the breakout tables.
I don't know your requirements, but if you are responsible for creating the reports/queries yourself, this may be an approach to use. If you need to enable users to create their own reports that's a bigger mountain to climb.
I guess what i am saying is "are you sure you need to do this and XML can't just do the job".
In part, it will depend of your DB Engine. You're using SQL Server, don't you?
Answering your topics:
1 - Comparing the value of a text field versus of a varchar field: if you're comparing two db fields, varchar fields are smarter. Nvarchar(max) stores data in unicode with 2*l+2 bytes, where "l" is the lengh. For performance issues, you will need consider how much larger tables will be, for selecting the best way to index (or not) your table fields. See the topic.
2 - Sometimes nested queries are easily created and executed, also serving as a way to reduce query time. But, depending of the complexity, would be better to use different kind of joins. The best way is try to do in both ways. Execute two or more times each query, for the DB engine "compiles" a query on first executing, then the subsequent are quite faster. Measure the times for different parameters and choose the best option.
"Sometimes you can rewrite a subquery to use JOIN and achieve better performance. The advantage of creating a JOIN is that you can evaluate tables in a different order from that defined by the query. The advantage of using a subquery is that it is frequently not necessary to scan all rows from the subquery to evaluate the subquery expression. For example, an EXISTS subquery can return TRUE upon seeing the first qualifying row." - link
3- There's no much information in this question, but if you will get the xml document directly from the table, would be a good idea insted a view. Again, it will depends of the view and the document.
4- Other issues is about the total records expected for your table; the indexing of the columns, in wich you need to consider sorting, joining, filtering, PK's and FK's. Each situation could demmand different aproaches. My sugestion is to invest some time reading about your database engine and queries functioning and relating to your system.
I hope I've helped.
Interesting question.
I think you may be asking the wrong question here. Broadly speaking, as long as you have a FULLTEXT index on your text field, queries will be fast. Much faster than varchar if you have to use wild cards, for instance.
However, if I were you, I'd concentrate on the actual queries you're going to be running. Do you need boolean operators? Wildcards? Numerical comparisons? That's where I think you will encounter the real performance worries.
I would imagine you would need queries like:
"find all addresses in the states of New York, New Jersey and Pennsylvania"
"find all addresses between house numbers 1 and 100 on Mulberry Street"
"find all addresses where the zipcode is missing, and the city is New York"
At a high level, the solution you propose is to store your XML somewhere, and then de-normalize that XML into name/value pairs for querying.
Name/value pairs have a long and proud history, but become unwieldy in complex query situations, because you're not using the built-in optimizations and concepts of the relational database model.
Some refinements I'd recommend is to look at the domain model, and at least see if you can factor out separate data types into the "value" column; you might end up with "textValue", "moneyValue", "integerValue" and "dateValue". In the example you give, you might factor "address 1" into "housenumber" (as an integer) and "streetname".
Having said all this - I don't think there's a better solution other than completely changing tack to a document-focused database.
This question is an attempt to find a practical solution for this question.
I need a semi-schema less design for my SQL database. However, I can limit the flexibility to shoehorn it into the entire SQL paradigm. Moving to schema less databases might be an option in the future but right now, I' stuck with SQL.
I have a table in a SQL database (let's call it Foo). When an row is added to this, it needs to be able to store an arbitrary number of "meta" fields with this. An example would be the ability to attach arbitrary metadata like tags, collaborators etc. All the fields are optional but the problem is that they're of different types. Some might be numeric, some might be textual etc.
A simple design linking Foo to a table of OptionalValues with fields like name, value_type, value_string, value_int, value_date etc. seems direct although it descends into the whole EAV model which Alex mentions on that last answer and it looks quite wasteful. Also, I imagine queries out of this when it grows will be quite slow. I don't expect to search or sort by anything in this table though. All I need is that when I get a row out of Foo, these extra attributes should be obtainable as well.
Are there any best practices for implementing this kind of a setup in a SQL database or am I simply looking at the whole thing wrongly?
Add a string column "Metafields" to your table "Foo" and store your metadata there as an XML or JSON string.