do i need a separate table for nvarchar(max) descriptions - sql

In one of my very previous company we used to have a separate table that we stored long descriptions on a text type column. I think this was done because of the limitations that come with text type.
Im now designing the tables for the existing application that I am working on and this question comes to my mind. I am resonating towards storing the long description of my items on the same item table on a varchar(max) column. I understand that I cannot index this column but that is OK as I will not be doing searches on these columns.
So far I cannot see any reason to separate this column to another table.
Can you please give me input if I am missing on something or if storing my descriptions on the same table on varchar(max) is good approach? Thanks!

Keep the fields in the table where they belong. Since SQL Server 2005 the engine got a lot smarter in regard to large data types and even variable length short data types. The old TEXT, NTEXT and IMAGE types are deprecated. The new types with MAX length are their replacement. With SQL 2005 each partition has 3 types of underlying allocation units: one for rows, one for LOBs and one for row-overflow. The MAX types are stored in the LOB allocation unit, so in effect the engine is managing for you a separate table to store large objects. The row overflow unit is for in-row variable length data that after an update would no longer fit in the page, so it is 'overflown' into a separate unit.
See Table and Index Organization.

It depends on how often you use them, but yes, you may want the on a separate table. Before you make the decision, you'll want to read up on SQL file paging, page splits, and the details of "how" sql stores the data.
The short answer is that varcharmax() can definitely cause a decrease in performance where those field lengths change a lot due to an increase in page splits which are expensive operations.

Related

Sparse column size limitation workaround

I'm using SQL server 2014. I'm creating multiple tables, always with more than 500 columns, which will be varying accordingly.
So, I created a sparse column so that I could be sure if the number of my columns exceed 1024 there won't be a problem. Now there is a new problem:
Cannot create a row that has sparse data of size 8710 which is greater
than the allowable maximum sparse data size of 8023.
I know SQL server allows only 8 Kb of data in a row, I need to know what's the work around for this. If I need to plan to move to No SQL (Mongodb) how much impact will it create on converting my stored procedure.
Maximum number of columns in an ordinary table is 1024. Maximum number of columns in a wide (sparse) table is 30,000. Sparse columns are usually used when you have a lot of columns, but most of them are NULL.
In any case, there is limit of 8060 bytes per row, so sparse columns won't help.
Often, having thousand columns in a table indicate that there are problems with the database design and normalisation.
If you are sure that you need these thousand values as columns, not as rows in a related table, then the only workaround that comes to mind is to partition the table vertically.
For example, you have a Table1 with column ID (which is the primary key) and 1000 other columns. Split it into Table1 and Table2. Each will have the same ID as a primary key and 500 columns each. The tables would be linked 1:1 using foreign key constraint.
The datatypes that are used and the density of how much data in a row is null determines the effectiveness of sparse columns. If all of the fields in a table are populated there is actually more overhead on storing those rows and will cause you to hit that maximum page size faster. If that is the case then don't use sparse columns.
See how many you can convert from static to variable length datatypes (varchar, nvarchar, varbinary). This might buy you some additional space in the page as variable length fields can be put into overflow pages, but do carry an overhead of 24 bytes for the pointer into the overflow page. I suspect you were thinking that sparse columns was going to allow you to store 30K columns...this would only be the circumstance where you had a wide table where most of the columns are NULL.
MongoDB will not be your answer...at least not without a lot of refactoring. You will not be able to leverage your existing stored procedures. It might be the best fit for you but there are many things to consider when moving to MongoDB. Your data access layer will need to be rebuilt unless you just happen to be persisting your data in the relational structure as JSON documents :). I assume that is not the case.
I am assuming that you have wide tables and they are densely populated...based on that assumption here is my recommendation.
Partition the table as Vladimir suggested, but create a view that will join all these tables together to make it look like one table. Now you have the same structure as you did before. Then add an Instead of Trigger to the view to update the tables. This is a way that you can get what you want without having to do major refactoring of your code. There is code you need to add for the trigger, but my experience as been that it's easy to write and most times I didn't write the code but created a script to generate the code for all the views I had to do this for since it was repetitive.

SQL Datatype Varchar(max) Vs Varchar(8000) [duplicate]

Is there any problem with making all your Sql Server 2008 string columns varchar(max)? My allowable string sizes are managed by the application. The database should just persist what I give it. Will I take a performance hit by declaring all string columns to be of type varchar(max) in Sql Server 2008, no matter what the size of the data that actually goes into them?
By using VARCHAR(MAX) you are basically telling SQL Server "store the values in this field how you see best", SQL Server will then choose whether to store values as a regular VARCHAR or as a LOB (Large object). In general if the values stored are less than 8,000 bytes SQL Server will treat values as a regular VARCHAR type.
If the values stored are too large then the column is allowed to spill off the page in to LOB pages, exactly as they do for other LOB types (text, ntext and image) - if this happens then additional page reads are required to read the data stored in the additional pages (i.e. there is a performance penatly), however this only happens if the values stored are too large.
In fact under SQL Server 2008 or later data can overflow onto additional pages even with the fixed length data types (e.g. VARCHAR(3,000)), however these pages are called row overflow data pages and are treated slightly differently.
Short version: from a storage perspective there is no disadvantage of using VARCHAR(MAX) over VARCHAR(N) for some N.
(Note that this also applies to the other variable-length field types NVARCHAR and VARBINARY)
FYI - You can't create indexes on VARCHAR(MAX) columns
Indexes can not be over 900 bytes wide for one. So you can probably never create an index. If your data is less then 900 bytes, use varchar(900).
This is one downside: because it gives
really bad searching performance
no unique constraints
Simon Sabin wrote a post on this some time back. I don't have the time to grab it now, but you should search for it, because he comes up with the conclusion that you shouldn't use varchar(max) by default.
Edited: Simon's got a few posts about varchar(max). The links in the comments below show this quite nicely. I think the most significant one is http://sqlblogcasts.com/blogs/simons/archive/2009/07/11/String-concatenation-with-max-types-stops-plan-caching.aspx, which talks about the effect of varchar(max) on plan caching. The general principle is to be careful. If you don't need it to be max, then don't use max - if you need more than 8000 characters, then sure... go for it.
For this question specifically a few points I don't see mentioned.
On 2005/2008/2008 R2 if a LOB column is included in an index this will block online index rebuilds.
On 2012 the online index rebuild restriction is lifted but LOB columns cannot participate in the new functionality Adding NOT NULL Columns as an Online Operation.
Locks can be taken out longer on rows containing columns of this data type. (more)
A couple of other reasons are covered in my answer as to why not varchar(8000) everywhere.
Your queries may end up requesting huge memory grants not justified by the size of data.
On table with triggers it can prevent an optimisation where versioning tags are not added.
I asked the similar question earlier. got some interesting replies. check it out here
There was one site that had a guy talking about the detriment of using wide columns, however if your data is limited in the application, my testing disproved it.
The fact you can't create indexes on the columns means I wouldn't use them all the time (personally i wouldn't use them that much at all, but i'm a bit of a purist in that regard).
However if you know there isn't much stored in them, i don't think they are that bad.
If you do any sorting on columns a recordset with a varchar(max) in it (or any wide column being char or varchar), then you could suffer performance penalties. these could be resolved (if required) by indexes, but you can't put indexes on varchar(max).
If you want to future proof your columns, why not just put them to something reasonable. eg a name column be 255 characters instead of max... that kinda thing.
There is another reason to avoid using varchar(max) on all columns. For the same reason we use check constraints (to avoid filling tables with junk caused by errant software or user entries), we would want to guard against any faulty process that adds much more data than intended. For example, if someone or something tried to add 3,000 bytes into a City field, we would know for certain that something is amiss and would want to stop the process dead in its tracks to debug it at the earliest possible point. We would also know that a 3000-byte city name could not possibly be valid and would mess up reports and such if we tried to use it.
Ideally, you should only allow what you need. Meaning if you're certain a particular column (say a username column) is never going to be more than 20 characters long, using a VARCHAR(20) vs. a VARCHAR(MAX) lets the database optimize queries and data structures.
From MSDN:
http://msdn.microsoft.com/en-us/library/ms176089.aspx
Variable-length, non-Unicode character data. n can be a value from 1 through 8,000. max indicates that the maximum storage size is 2^31-1 bytes.
Are you really going ever going to come close to 2^31-1 bytes for these columns?

Normalizing an extremely big table

I face the following issue. I have an extremely big table. This table is a heritage from the people who previously worked on the project. The table is in MS SQL Server.
The table has the following properties:
it has about 300 columns. All of them have "text" type but some of them eventually should represent other types (for example, integer or datetime). So one has to convert this text values in appropriate types before using them
the table has more than 100 milliom rows. The space for the table would soon reach 1 terabyte
the table does not have any indices
the table does not have any implemented mechanisms of partitioning.
As you may guess, it is impossible to run any reasonable query to this table. Now people only insert new records into the table but nobody uses it. So I need to restructure it. I plan to create a new structure and refill the new structure with the data from the old table. Obviously, I will implement partioning, but it is not the only thing to be done.
One of the most important features of the table is that those fields that are purely textual (i.e. they don't have to be converted into another type) usually have frequently repeated values. So the actual variety of values in a given column is in the range of 5-30 different values. This induces the idea to make normalization: for every such a textual column I will create an additional table with the list of all the different values that may appear in this column, then I will create a (tinyint) primary key in this additional table and then will use an appropriate foreign key in the original table instead of keeping those text values in the original table. Then I will put an index on this foreign key column. The number of the columns to be processed this way is about 100.
It raises the following questions:
would this normalization really increase the speed of the queires imposing conditions on some of those 100 fields? If we forget about the size needed to keep those columns, whether would there be any increase in the performance due to the substition of the initial text-columns with tinyint-columns? If I do not do any normalization and simply put an index on those initial text columns, whether the performace will be the same as for the index on the planned tinyint-column?
If I do the described normalization, then building a view showing the text values will require joining my main table with some 100 additional tables. A positive moment is that I'll do those joins for pairs "primary key"="foreign key". But still quite a big amount of tables should be joined. Here is the question: whether the performance of the queryes made to this view compare to the performance of the queries to the initial non-normalized table will be not worse? Whether the SQL Server Optimizer will really be able to optimize the query the way that allows taking the benefits of the normalization?
Sorry for such a long text.
Thanks for every comment!
PS
I created a related question regarding joining 100 tables;
Joining 100 tables
You'll find other benefits to normalizing the data besides the speed of queries running against it... such as size and maintainability, which alone should justify normalizing it...
However, it will also likely improve the speed of queries; currently having a single row containing 300 text columns is massive, and is almost certainly past the 8,060 byte limit for storing the row data page... and is instead being stored in the ROW_OVERFLOW_DATA or LOB_DATA Allocation Units.
By reducing the size of each row through normalization, such as replacing redundant text data with a TINYINT foreign key, and by also removing columns that aren't dependent on this large table's primary key into another table, the data should no longer overflow, and you'll also be able to store more rows per page.
As far as the overhead added by performing JOIN to get the normalized data... if you properly index your tables, this shouldn't add a substantial amount of overhead. However, if it does add an unacceptable overhead, you can then selectively de-normalize the data as necessary.
Whether this is worth the effort depends on how long the values are. If the values are, say, state abbreviations (2 characters) or country codes (3 characters), the resulting table would be even larger than the existing one. Remember, you need to include the primary key of the reference table. That would typically be an integer and occupy four bytes.
There are other good reasons to do this. Having reference tables with valid lists of values maintains database consistency. The reference tables can be used both to validate inputs and for reporting purposes. Additional information can be included, such as a "long name" or something like that.
Also, SQL Server will spill varchar columns over onto additional pages. It does not spill other types. You only have 300 columns but eventually your record data might get close to the 8k limit for data on a single page.
And, if you decide to go ahead, I would suggest that you look for "themes" in the columns. There may be groups of columns that can be grouped together . . . detailed stop code and stop category, short business name and full business name. You are going down the path of modelling the data (a good thing). But be cautious about doing things at a very low level (managing 100 reference tables) versus identifying a reasonable set of entities and relationships.
1) The system is currently having to do a full table scan on very significant amounts of data, leading to the performance issues. There are many aspects of optimisation which could improve this performance. The conversion of columns to the correct data types would not only significantly improve performance by reducing the size of each record, but would allow data to be made correct. If querying on a column, you're currently looking at the text being compared to the text in the field. With just indexing, this could be improved, but changing to a lookup would allow the ID value to be looked up from a table small enough to keep in memory and then use this to scan just integer values, which is a much quicker process.
2) If data is normalised to 3rd normal form or the like, then you can see instances where performance suffers in the name of data integrity. This is most a problem if the engine cannot work out how to restrict the rows without projecting the data first. If this does occur, though, the execution plan can identify this and the query can be amended to reduce the likelihood of this.
Another point to note is that it sounds like if the database was properly structured it may be able to be cached in memory because the amount of data would be greatly reduced. If this is the case, then the performance would be greatly improved.
The quick way to improve performance would probably be to add indexes. However, this would further increase the overall database size, and doesn't address the issue of storing duplicate data and possible data integrity issues.
There are some other changes which can be made - if a lot of the data is not always needed, then this can be separated off into a related table and only looked up as needed. Fields that are not used for lookups to other tables are particular candidates for this, as the joins can then be on a much smaller table, while preserving a fairly simple structure that just looks up the additional data when you've identified the data you actually need. This is obviously not a properly normalised structure, but may be a quick and dirty way to improve performance (after adding indexing).
Construct in your head and onto paper a normalized database structure
Construct the database (with indexes)
De-construct that monolith. Things will not look so bad. I would guess that A LOT (I MEAN A LOT) of data is repeated
Create SQL insert statements to insert the data into the database
Go to the persons that constructed that nightmare in the first place with a shotgun. Have fun.

Should I Put An XML Blob Into A Separate Table?

I'm designing a transactional table which will have a lot of records. It will have a lot of reads and writes.
There will be one point at which the user uploads an XML file, and I store this in a database column of type XML.
For a given transactional record, this XML will not be needed as often as everything else. It will probably only get read a couple of times, and will usually just get inserted and not updated.
I'm wondering whether there is any advantage in storing this XML field in a separate table. Then, I can just join to it only when I need it. The only advantage that I perceive is that the individual records on the "main" table will take up less space. But, if my table is properly indexed, does that really even matter?
I suspect that I'm overthinking this and being premature with my optimization. Should I just leave the XML field on the main table?
One sample XML file I have is 12KB. I don't expect it to get much larger than that. I'm not sure if SQL Server's XML data type would store the information more efficiently than that.
To clarify, it is a one-to-one relationship. There will be one XML blob for every transaction. There won't be one XML blob for multiple transactions. And every transaction should eventually get an XML blob, even if it's not immediate.
Thanks,
Tedderz
The answer is that you there is no need for you to modify or otherwise compromise your Logical data design to accommodate this Physical storage consideration.
This is because in SQL Server, XML is a "Large Value Type", and you can control whether these are Physically stored in-row or out-of-row through the use of the 'large value types out of row' option in the sp_tableoption system procedure, like so:
EXEC sys.sp_tableoption N'MyTable', 'large value types out of row', 'ON'
If you leave it OFF, then XML values less than 8000 bytes will be stored in-row. If you set it to ON, then all XML values (and [N]Varchar(MAX) columns) will be stored out of the table in a separate area. (This is all explained in detail here: http://technet.microsoft.com/en-us/library/ms189087(SQL.105).aspx)
The question of which to set it to is hard to say, but generally: if you expect to retrieve/modify this column a lot, the I'd recommend putting in in-row. Otherwise store it out of row.
If your XML is rather large, and there are quite a few use cases where you don't need that information in your queries - then it could make sense to put it into a separate table - even if there's a 1:1 relationship in place.
The motivation here is this: if your "base" table is smaller, e.g. doesn't contain the XML blob, and you often query your table without needing to retrieve the XML, then this smaller row size can lead to much better performance on the base table (since more rows fit on a page, and thus SQL Server would need to load fewer pages to satisfy some of your queries).
Also: if that XML only exists in a small number of cases (e.g. only 10-20% of your rows actually have an XML blob), that might also be a factor that would work in favor of "outsourcing" the XML blob to a separate table.
No, you should not. If there is a one-to-one relationship, it belongs in the same table. Joins are expensive.

MySQL Table with TEXT column

I've been working on a database and I have to deal with a TEXT field.
Now, I believe I've seen some place mentioning it would be best to isolate the TEXT column from the rest of the table(putting it in a table of its own).
However, now I can't find this reference anywhere and since it was quite a while ago, I'm starting to think that maybe I misinterpreted this information.
Some research revealed this, suggesting that
Separate text/blobs from metadata, don't put text/blobs in results if you don't need them.
However, I am not familiar with the definition of "metadata" being used here.
So I wonder if there are any relevant advantages in putting a TEXT column in a table of its own. What are the potential problems of having it with the rest of the fields? And potential problems of keeping it in a separated table?
This table(without the TEXT field) is supposed to be searched(SELECTed) rather frequently. Is "premature optimization considered evil" important here? (If there really is a penalty in TEXT columns, how relevant is it, considering it is fairly easy to change this later if needed).
Besides, are there any good links on this topic? (Perhaps stackoverflow questions&answers? I've tried to search this topic but I only found TEXT vs VARCHAR discussions)
Yep, it seems you've misinterpreted the meaning of the sentence. What it says is that you should only do a SELECT including a TEXT field if you really need the contents of that field. This is because TEXT/BLOB columns can contain huge amounts of data which would need to be delivered to your application - this takes time and of course resources.
Best wishes,
Fabian
This is probably premature optimisation. Performance tuning MySQL is really tricky and can only be done with real performance data for your application. I've seen plenty of attempts to second guess what makes MySQL slow without real data and the result each time has been a messy schema and complex code which will actually make performance tuning harder later on.
Start with a normalised simple schema, then when something proves too slow add a complexity only where/if needed.
As others have pointed out the quote you mentioned is more applicable to query results than the schema definition, in any case your choice of storage engine would affect the validity of the advice anyway.
If you do find yourself needing to add the complexity of moving TEXT/BLOB columns to a separate table, then it's probably worth considering the option of moving them out of the database altogether. Often file storage has advantages over database storage especially if you don't do any relational queries on the contents of the TEXT/BLOB column.
Basically, get some data before taking any MySQL tuning advice you get on the Internet, including this!
The data for a TEXT column is already stored separately. Whenever you SELECT * from a table with text column(s), each row in the result-set requires a lookup into the text storage area. This coupled with the very real possibility of huge amounts of data would be a big overhead to your system.
Moving the column to another table simply requires an additional lookup, one into the secondary table, and the normal one into the text storage area.
The only time that moving TEXT columns into another table will offer any benefit is if there it a tendency to usually select all columns from tables. This is merely introducing a second bad practice to compensate for the first. It should go without saying the two wrongs is not the same as three lefts.
The concern is that a large text field—like way over 8,192 bytes—will cause excessive paging and/or file i/o during complex queries on unindexed fields. In such cases, it's better to migrate the large field to another table and replace it with the new table's row id or index (which would then be metadata since it doesn't actually contain data).
The disadvantages are:
a) More complicated schema
b) If the large field is using inspected or retrieved, there is no advantage
c) Ensuring data consistency is more complicated and a potential source of database malaise.
There might be some good reasons to separate a text field out of your table definition. For instance, if you are using an ORM that loads the complete record no matter what, you might want to create a properties table to hold the text field so it doesn't load all the time. However if you are controlling the code 100%, for simplicity, leave the field on the table, then only select it when you need it to cut down on data trasfer and reading time.
Now, I believe I've seen some place mentioning it would be best to isolate the TEXT column from the rest of the table(putting it in a table of its own).
However, now I can't find this reference anywhere and since it was quite a while ago, I'm starting to think that maybe I misinterpreted this information.
You probably saw this, from the MySQL manual
http://dev.mysql.com/doc/refman/5.5/en/optimize-character.html
If a table contains string columns such as name and address, but many queries do not retrieve those columns, consider splitting the string columns into a separate table and using join queries with a foreign key when necessary. When MySQL retrieves any value from a row, it reads a data block containing all the columns of that row (and possibly other adjacent rows). Keeping each row small, with only the most frequently used columns, allows more rows to fit in each data block. Such compact tables reduce disk I/O and memory usage for common queries.
Which indeed is telling you that in MySQL you are discouraged from keeping TEXT data (and BLOB, as written elsewhere) in tables frequently searched