Should I Put An XML Blob Into A Separate Table?

Should I Put An XML Blob Into A Separate Table? - sql

I'm designing a transactional table which will have a lot of records. It will have a lot of reads and writes.
There will be one point at which the user uploads an XML file, and I store this in a database column of type XML.
For a given transactional record, this XML will not be needed as often as everything else. It will probably only get read a couple of times, and will usually just get inserted and not updated.
I'm wondering whether there is any advantage in storing this XML field in a separate table. Then, I can just join to it only when I need it. The only advantage that I perceive is that the individual records on the "main" table will take up less space. But, if my table is properly indexed, does that really even matter?
I suspect that I'm overthinking this and being premature with my optimization. Should I just leave the XML field on the main table?
One sample XML file I have is 12KB. I don't expect it to get much larger than that. I'm not sure if SQL Server's XML data type would store the information more efficiently than that.
To clarify, it is a one-to-one relationship. There will be one XML blob for every transaction. There won't be one XML blob for multiple transactions. And every transaction should eventually get an XML blob, even if it's not immediate.
Thanks,
Tedderz

The answer is that you there is no need for you to modify or otherwise compromise your Logical data design to accommodate this Physical storage consideration.
This is because in SQL Server, XML is a "Large Value Type", and you can control whether these are Physically stored in-row or out-of-row through the use of the 'large value types out of row' option in the sp_tableoption system procedure, like so:
EXEC sys.sp_tableoption N'MyTable', 'large value types out of row', 'ON'
If you leave it OFF, then XML values less than 8000 bytes will be stored in-row. If you set it to ON, then all XML values (and [N]Varchar(MAX) columns) will be stored out of the table in a separate area. (This is all explained in detail here: http://technet.microsoft.com/en-us/library/ms189087(SQL.105).aspx)
The question of which to set it to is hard to say, but generally: if you expect to retrieve/modify this column a lot, the I'd recommend putting in in-row. Otherwise store it out of row.

If your XML is rather large, and there are quite a few use cases where you don't need that information in your queries - then it could make sense to put it into a separate table - even if there's a 1:1 relationship in place.
The motivation here is this: if your "base" table is smaller, e.g. doesn't contain the XML blob, and you often query your table without needing to retrieve the XML, then this smaller row size can lead to much better performance on the base table (since more rows fit on a page, and thus SQL Server would need to load fewer pages to satisfy some of your queries).
Also: if that XML only exists in a small number of cases (e.g. only 10-20% of your rows actually have an XML blob), that might also be a factor that would work in favor of "outsourcing" the XML blob to a separate table.

No, you should not. If there is a one-to-one relationship, it belongs in the same table. Joins are expensive.

Related

XML vs Relational Database

I am looking at a cloud based solution which will give people the ability to enter information which is stored in a SQL database.
The benefits of my application will be that people can also change what type of information is stored (i.e an administrator would be able to add/remove certain attributes to change what data people can store).
Doing this in a relational database does work but it means the administrator would be changing the actual structure of the database which has so many risks and issues and I really don't want to go down this route.
I have thought about using XML, so one table contains two tables for example:
Template Data
columns (ID, XML) - This will contain the "Default Templates/Structure" of what people will enter which will is used when the users enter data and submit
Data Table
columns (ID, XML) - This will contain the actual data using the XML template of my first column but store the actual data in it
Does this sound like it would work and could I hit potential performance issues? A lot of the data will be searchable and could potentially have a LOT of records in the database. - I guess I could look at storing the searchable data in separate fields that the administrator can't modify.
Thanks

It is possible and if you do it a little smart it is feasible.
Contrary to Justings wroong answer you are not stuck with string manipulation and search.... if you actuall care to read the documentation.
SQL Server added a XML field type a long time ago.
This takes XML (only) and decomposes it internally and has an indexing mechanism (cech http://technet.microsoft.com/en-us/library/ms191497.aspx for details).
Queries then look like:
SELECT
EventID, EventTime,
AnnouncementValue = t1.EventXML.value('(/Event/Announcement/Value)[1]', 'decimal(10,2)'),
AnnouncementDate = t1.EventXML.value('(/Event/Announcement/Date)[1]', 'date')
FROM
dbo.T1
WHERE
t1.EventXML.exist('/Event/Indicator/Name[text() = "GDP"]') = 1
(copied from How to query xml column in tsql)
How far it gets you depends - this is heavier on the database and may have limitations, but it is a far cry from the alternative of storing strings and saying good bye to any indexing.
You can actually even add xml schemata so the data has to conform to some specific schema.

This is possible but data retrieval will suffer if you will query data based on the values in the XML string. If you will use this you're stuck with using a LIKE filter which is not recommended for searching a table with too many rows. If you will always read data using the ID column only I think this would be great.
On the other hand, if you will separate the data in the XML to several columns, you can refine the way you query data based on multiple columns. This will speed up your searches most especially if the columns are indexed.

Adding information, xml column or new table?

We want to extend our database to create Multilanguage support but we are unsure how to do this.
Our database looks like this:
ID – Name – Description – (a lot of irrelevant columns)
Option 1 is to add an xml column to the table, in this column we can store the information we need like this:
<translation>
<language value=’en’>
<Name value=’’>
<Description value=’’>
</language>
<language value=’fr’>
<Name value=’’>
<Description value=’’>
</language>
</translation>
Does the trick and the advantage is that when I delete the row, I also delete the translations.
Option 2 is to add an extra table, it’s easy to create a table to store the information in, but it requires inner joins when getting the information and more effort to delete rows when the original row is deleted.
What is the preferred option in this case? Or are there other good solutions for this?

I'd recommend the "relational" approach, i.e. separate translation table(s). Consider doing it like this:
This model has some nice properties:
For each multi-lingual table, create a separate translation table. This way, you can use the fields appropriate for that particular table, and the translation cannot be "misconnected" to the wrong table.
The existence of the LANGUAGE table and the associated FOREIGN KEYs, ensures that a translation cannot exist for non-existent language, unlike the XML.
ON DELETE CASCADE referential action will ensure no "orphaned" translation can be left behind when a language is removed, unlike the XML.
While XML may be faster in simpler cases, I suspect JOIN is more scalable when the number of languages grows.1 In any case, measure the difference and decide for yourself if it's significant enough.
Separate fields such as NAME and DESCRIPTION may be easier to index. With XML, you'd probably need a DBMS with special support for XML, or possibly some sort of full-text index.
Fields such as NAME and DESCRIPTION will likely be just regular VARCHARs. OTOH, putting them together may produce XML too large for a regular VARCHAR, forcing you to use a CLOB/BLOB, which may have its own performance complications.
If your DBMS supports clustering (see below), the whole translation table can be stored in a single B-Tree. XML has a lot of redundant data (opening and closing tags), likely making it larger and less cache-friendly than the B-Tree (even when we count-in all the associated overheads).
You'll notice that the model above uses identifying relationships and the resulting PK: {LANGUAGE_ID, TABLEx_ID} can be used for clustering (so the translations that belong to the same language are stored physically close together in the database). As long you have few predominant (or "hot") languages, this should be OK - the caching is done at the database page level, so avoiding mixing "hot" and "cold" data in the same page avoids caching "cold" data (and making the cache "smaller").
OTOH, if you routinely need to query for many languages, consider flipping the clustering key order to: {TABLEx_ID, LANGUAGE_ID}, so all the translations of the same row are stored physically close together in the database. Once you retrieve one translation, other translations of the same row are probably already cached. Or, if you want to extract multiple translations in the single query, you could do it with less I/O.
1 We can JOIN just to the translation in the desired language. With XML, you must load (and parse) the whole XML, before deciding to use only a small portion of it that pertains to the desired language. Whenever you add a new languages (and the associated translations to the XML), it slows down the processing of existing rows even if you rarely use the new language.

How to handle SQL Server XML stored procedure parameters when table valued parameters (TVP's) are unavailable?

When you don't have SQL Server 2008 to play with (TVP's), the advantage of passing in a XML parameter into a SPROC is that if your parameter requirements change, you don't have to recompile/etc. your app to comply.
I'm of the notion that keeping the data as XML in a table field isn't the best idea, that the sproc should then parse the incoming XML and populate the relevant fields in the table. However, we can also easily do SELECT / filter queries on XML contained within a field in a table.
What kind of latency is introduced in parsing out the XML and populating the appropriate fields, and is anything gained by doing this?
In a high-traffic environment, which is the best policy?

It depends. This is a very broad question. Usually a lot of processing is required for parsing the XML (without XML indexes in this case). Is your high-traffic inserting heavily or selecting heavily?
There is potentially a lot of gains in parsing in to a table, if they are applicable. If you do not need to query it and you are storing XML to later return it in the same form then there is less to gain.
It definitely depends on the rest of the data in the table and how much time you have for inserting/updating vs selecting and other maintenance. A good medium may be to insert the XML with an index.
There is no best policy that I am aware of but I would aim to parse it at insert time and store it in separate fields with the benefits of referential integrity, indexes, faster queries and reduced storage costs, to name a few.
You can use query plans and wait stats when comparing different approaches.

MySQL Table with TEXT column

I've been working on a database and I have to deal with a TEXT field.
Now, I believe I've seen some place mentioning it would be best to isolate the TEXT column from the rest of the table(putting it in a table of its own).
However, now I can't find this reference anywhere and since it was quite a while ago, I'm starting to think that maybe I misinterpreted this information.
Some research revealed this, suggesting that
Separate text/blobs from metadata, don't put text/blobs in results if you don't need them.
However, I am not familiar with the definition of "metadata" being used here.
So I wonder if there are any relevant advantages in putting a TEXT column in a table of its own. What are the potential problems of having it with the rest of the fields? And potential problems of keeping it in a separated table?
This table(without the TEXT field) is supposed to be searched(SELECTed) rather frequently. Is "premature optimization considered evil" important here? (If there really is a penalty in TEXT columns, how relevant is it, considering it is fairly easy to change this later if needed).
Besides, are there any good links on this topic? (Perhaps stackoverflow questions&answers? I've tried to search this topic but I only found TEXT vs VARCHAR discussions)

Yep, it seems you've misinterpreted the meaning of the sentence. What it says is that you should only do a SELECT including a TEXT field if you really need the contents of that field. This is because TEXT/BLOB columns can contain huge amounts of data which would need to be delivered to your application - this takes time and of course resources.
Best wishes,
Fabian

This is probably premature optimisation. Performance tuning MySQL is really tricky and can only be done with real performance data for your application. I've seen plenty of attempts to second guess what makes MySQL slow without real data and the result each time has been a messy schema and complex code which will actually make performance tuning harder later on.
Start with a normalised simple schema, then when something proves too slow add a complexity only where/if needed.
As others have pointed out the quote you mentioned is more applicable to query results than the schema definition, in any case your choice of storage engine would affect the validity of the advice anyway.
If you do find yourself needing to add the complexity of moving TEXT/BLOB columns to a separate table, then it's probably worth considering the option of moving them out of the database altogether. Often file storage has advantages over database storage especially if you don't do any relational queries on the contents of the TEXT/BLOB column.
Basically, get some data before taking any MySQL tuning advice you get on the Internet, including this!

The data for a TEXT column is already stored separately. Whenever you SELECT * from a table with text column(s), each row in the result-set requires a lookup into the text storage area. This coupled with the very real possibility of huge amounts of data would be a big overhead to your system.
Moving the column to another table simply requires an additional lookup, one into the secondary table, and the normal one into the text storage area.
The only time that moving TEXT columns into another table will offer any benefit is if there it a tendency to usually select all columns from tables. This is merely introducing a second bad practice to compensate for the first. It should go without saying the two wrongs is not the same as three lefts.

The concern is that a large text field—like way over 8,192 bytes—will cause excessive paging and/or file i/o during complex queries on unindexed fields. In such cases, it's better to migrate the large field to another table and replace it with the new table's row id or index (which would then be metadata since it doesn't actually contain data).
The disadvantages are:
a) More complicated schema
b) If the large field is using inspected or retrieved, there is no advantage
c) Ensuring data consistency is more complicated and a potential source of database malaise.

There might be some good reasons to separate a text field out of your table definition. For instance, if you are using an ORM that loads the complete record no matter what, you might want to create a properties table to hold the text field so it doesn't load all the time. However if you are controlling the code 100%, for simplicity, leave the field on the table, then only select it when you need it to cut down on data trasfer and reading time.

Now, I believe I've seen some place mentioning it would be best to isolate the TEXT column from the rest of the table(putting it in a table of its own).
However, now I can't find this reference anywhere and since it was quite a while ago, I'm starting to think that maybe I misinterpreted this information.
You probably saw this, from the MySQL manual
http://dev.mysql.com/doc/refman/5.5/en/optimize-character.html
If a table contains string columns such as name and address, but many queries do not retrieve those columns, consider splitting the string columns into a separate table and using join queries with a foreign key when necessary. When MySQL retrieves any value from a row, it reads a data block containing all the columns of that row (and possibly other adjacent rows). Keeping each row small, with only the most frequently used columns, allows more rows to fit in each data block. Such compact tables reduce disk I/O and memory usage for common queries.
Which indeed is telling you that in MySQL you are discouraged from keeping TEXT data (and BLOB, as written elsewhere) in tables frequently searched

do i need a separate table for nvarchar(max) descriptions

In one of my very previous company we used to have a separate table that we stored long descriptions on a text type column. I think this was done because of the limitations that come with text type.
Im now designing the tables for the existing application that I am working on and this question comes to my mind. I am resonating towards storing the long description of my items on the same item table on a varchar(max) column. I understand that I cannot index this column but that is OK as I will not be doing searches on these columns.
So far I cannot see any reason to separate this column to another table.
Can you please give me input if I am missing on something or if storing my descriptions on the same table on varchar(max) is good approach? Thanks!

Keep the fields in the table where they belong. Since SQL Server 2005 the engine got a lot smarter in regard to large data types and even variable length short data types. The old TEXT, NTEXT and IMAGE types are deprecated. The new types with MAX length are their replacement. With SQL 2005 each partition has 3 types of underlying allocation units: one for rows, one for LOBs and one for row-overflow. The MAX types are stored in the LOB allocation unit, so in effect the engine is managing for you a separate table to store large objects. The row overflow unit is for in-row variable length data that after an update would no longer fit in the page, so it is 'overflown' into a separate unit.
See Table and Index Organization.

It depends on how often you use them, but yes, you may want the on a separate table. Before you make the decision, you'll want to read up on SQL file paging, page splits, and the details of "how" sql stores the data.
The short answer is that varcharmax() can definitely cause a decrease in performance where those field lengths change a lot due to an increase in page splits which are expensive operations.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas