Why use shorter VARCHAR(n) fields? - sql

It is frequently advised to choose database field sizes to be as narrow as possible. I am wondering to what degree this applies to SQL Server 2005 VARCHAR columns: Storing 10-letter English words in a VARCHAR(255) field will not take up more storage than in a VARCHAR(10) field.
Are there other reasons to restrict the size of VARCHAR fields to stick as closely as possible to the size of the data? I'm thinking of
Performance: Is there an advantage to using a smaller n when selecting, filtering and sorting on the data?
Memory, including on the application side (C++)?
Style/validation: How important do you consider restricting colunm size to force non-sensical data imports to fail (such as 200-character surnames)?
Anything else?
Background: I help data integrators with the design of data flows into a database-backed system. They have to use an API that restricts their choice of data types. For character data, only VARCHAR(n) with n <= 255 is available; CHAR, NCHAR, NVARCHAR and TEXT are not. We're trying to lay down some "good practices" rules, and the question has come up if there is a real detriment to using VARCHAR(255) even for data where real maximum sizes will never exceed 30 bytes or so.
Typical data volumes for one table are 1-10 Mio records with up to 150 attributes. Query performance (SELECT, with frequently extensive WHERE clauses) and application-side retrieval performance are paramount.

Data Integrity - By far the most important reason. If you create a column called Surname that is 255 characters, you will likely get more than surnames. You'll get first name, last name, middle name. You'll get their favorite pet. You'll get "Alice in the Accounting Department with the Triangle hair". In short, you will make it easy for users to use the column as a notes/surname column. You want the cap to imped the users that try to put something other than a surname into that column. If you have a column that calls for a specific length (e.g. a US tax identifier is nine characters) but the column is varchar(255), other developers will wonder what is going on and you likely get crap data as well.
Indexing and row limits. In SQL Server you have a limit of 8060 bytes IIRC. Lots of fat non-varchar(max) columns with lots of data can quickly exceed that limit. In addition, indexes have a 900 bytes cap in width IIRC. So, if you wanted to index on your surname column and some others that contain lots of data, you could exceed this limit.
Reporting and external systems. As a report designer you must assume that if a column is declared with a max length of 255, it could have 255 characters. If the user can do it, they will do it. Thus, to say, "It probably won't have more than 30 characters." is not even remotely the same as "It cannot have more than 30 characters." Never rely on the former. As a report designer, you have to work around the possibilities that users will enter a bunch of data into a column. That either means truncating the values (and if that is the case why have the additional space available?) or using CanGrow to make a lovely mess of a report. Either way, you make it harder on other developers to understand the intent of the column if the column size is so far out of whack with the actual data being stored.

I think that the biggest issue is data validation. If you allow 255 characters for a surname, you WILL get a surname that's 200+ characters in your database.
Another reason is that if you allow the database to hold 255 characters you now have to account for that possibility in every system that touches your database. For example, if you exported to a fixed-width column file all of your columns would have to be 255 characters wide, which could be pretty annoying or even problematic. That's just one example where it could cause a problem.

One good reason is validation.
(for example) In Holland a social security number is always 9 chars long, when you won't allow more it will never occur.
If you would allow more and for some unknown reason there are 10 chars, you will need to put in checks (which you otherwise wouldn't) to check if it is 9 long.

1) Readability & Support
A database developer could look at a field called StateCode with a length of varchar(2) and get a good idea of what kind of data that field holds, without even looking at the contents.
2) Reporting
When you data is without a length constraint, you are expecting the developer to enforce that the column data is all similar in length. When reporting on that data, if the developer has failed to make the column data consistent, that will make the reporting that data inconsistent & look funny.
3) SQL Server Data Storage
SQL Server stores data on 8k "pages" and from a performance standpoint it is ideal to be as efficient as possible and store as much data as possible on a page.
If your database is designed to store every string column as varchar(255), "bad" data could slip into one of those fields (for example a state name might slip into a StateCode field that is meant to be 2 characters long), and cause unecessary & inefficient page and index splits.

The other thing is that a single row of data is limited to 8060 bytes, and SQL Server uses the max length of varchar fields to determine this.
Reference: http://msdn.microsoft.com/en-us/library/ms143432.aspx

Related

SQL Query Performance with an nvarchar(500) where the MAX(LEN(column)) < 30 [duplicate]

I've read up on this on MSDN forums and here and I'm still not clear. I think this is correct: Varchar(max) will be stored as a text datatype, so that has drawbacks. So lets say your field will reliably be under 8000 characters. Like a BusinessName field in my database table. In reality, a business name will probably always be under (pulling a number outta my hat) 500 characters. It seems like plenty of varchar fields that I run across fall well under the 8k character count.
So should I make that field a varchar(500) instead of varchar(8000)? From what I understand of SQL there's no difference between those two. So, to make life easy, I'd want to define all my varchar fields as varchar(8000). Does that have any drawbacks?
Related: Size of varchar columns (I didn't feel like this one answered my question).
One example where this can make a difference is that it can prevent a performance optimization that avoids adding row versioning information to tables with after triggers.
This is covered by Paul White here
The actual size of the data stored is immaterial – it is the potential
size that matters.
Similarly if using memory optimised tables since 2016 it has been possible to use LOB columns or combinations of column widths that could potentially exceed the inrow limit but with a penalty.
(Max) columns are always stored off-row. For other columns, if the data row size in the table definition can exceed 8,060 bytes, SQL Server pushes largest variable-length column(s) off-row. Again, it does not depend on amount of the data you store there.
This can have a large negative effect on memory consumption and performance
Another case where over declaring column widths can make a big difference is if the table will ever be processed using SSIS. The memory allocated for variable length (non BLOB) columns is fixed for each row in an execution tree and is per the columns' declared maximum length which can lead to inefficient usage of memory buffers (example). Whilst the SSIS package developer can declare a smaller column size than the source this analysis is best done up front and enforced there.
Back in the SQL Server engine itself a similar case is that when calculating the memory grant to allocate for SORT operations SQL Server assumes that varchar(x) columns will on average consume x/2 bytes.
If most of your varchar columns are fuller than that this can lead to the sort operations spilling to tempdb.
In your case if your varchar columns are declared as 8000 bytes but actually have contents much less than that your query will be allocated memory that it doesn't require which is obviously inefficient and can lead to waits for memory grants.
This is covered in Part 2 of SQL Workshops Webcast 1 downloadable from here or see below.
use tempdb;
CREATE TABLE T(
id INT IDENTITY(1,1) PRIMARY KEY,
number int,
name8000 VARCHAR(8000),
name500 VARCHAR(500))
INSERT INTO T
(number,name8000,name500)
SELECT number, name, name /*<--Same contents in both cols*/
FROM master..spt_values
SELECT id,name500
FROM T
ORDER BY number
SELECT id,name8000
FROM T
ORDER BY number
From a processing standpoint, it will not make a difference to use varchar(8000) vs varchar(500). It's more of a "good practice" kind of thing to define a maximum length that a field should hold and make your varchar that length. It's something that can be used to assist with data validation. For instance, making a state abbreviation be 2 characters or a postal/zip code as 5 or 9 characters. This used to be a more important distinction for when your data interacted with other systems or user interfaces where field length was critical (e.g. a mainframe flat file dataset), but nowadays I think it's more habit than anything else.
There are some disadvantages to large columns that are a bit less obvious and might catch you a little later:
All columns you use in an INDEX - must not exceed 900 bytes
All the columns in an ORDER BY clause may not exceed 8060 bytes. This is a bit difficult to grasp since this only applies to some columns. See SQL 2008 R2 Row size limit exceeded for details)
If the total row size exceeds 8060 bytes, you get a "page spill" for that row. This might affect performance (A page is an allocation unit in SQLServer and is fixed at 8000 bytes+some overhead. Exceeding this will not be severe, but it's noticable and you should try to avoid it if you easily can)
Many other internal datastructures, buffers and last-not-least your own varaibles and table-variables all need to mirror these sizes. With excessive sizes, excessive memory allocation can affect performance
As a general rule, try to be conservative with the column width. If it becomes a problem, you can easily expand it to fit the needs. If you notice memory issues later, shrinking a wide column later may become impossible without losing data and you won't know where to begin.
In your example of the business names, think about where you get to display them. Is there really space for 500 characters?? If not, there is little point in storing them as such. http://en.wikipedia.org/wiki/List_of_companies_of_the_United_States lists some company names and the max is about 50 characters. So I'd use 100 for the column max. Maybe more like 80.
Apart from best practices (BBlake's answer)
You get warnings about maximum row size (8060) bytes and index width (900 bytes) with DDL
DML will die if you exceed these limits
ANSI PADDING ON is the default so you could end up storing a wholeload of whitespace
Ideally you'd want to go smaller than that, down to a reasonably sized length (500 isn't reasonably sized) and make sure the client validation catches when the data is going to be too large and send a useful error.
While the varchar isn't actually going to reserve space in the database for the unused space, I recall versions of SQL Server having a snit about database rows being wider than some number of bytes (do not recall the exact count) and actually throwing out whatever data didn't fit. A certain number of those bytes were reserved for things internal to SQL Server.

is there an advantage to varchar(500) over varchar(8000)?

I've read up on this on MSDN forums and here and I'm still not clear. I think this is correct: Varchar(max) will be stored as a text datatype, so that has drawbacks. So lets say your field will reliably be under 8000 characters. Like a BusinessName field in my database table. In reality, a business name will probably always be under (pulling a number outta my hat) 500 characters. It seems like plenty of varchar fields that I run across fall well under the 8k character count.
So should I make that field a varchar(500) instead of varchar(8000)? From what I understand of SQL there's no difference between those two. So, to make life easy, I'd want to define all my varchar fields as varchar(8000). Does that have any drawbacks?
Related: Size of varchar columns (I didn't feel like this one answered my question).
One example where this can make a difference is that it can prevent a performance optimization that avoids adding row versioning information to tables with after triggers.
This is covered by Paul White here
The actual size of the data stored is immaterial – it is the potential
size that matters.
Similarly if using memory optimised tables since 2016 it has been possible to use LOB columns or combinations of column widths that could potentially exceed the inrow limit but with a penalty.
(Max) columns are always stored off-row. For other columns, if the data row size in the table definition can exceed 8,060 bytes, SQL Server pushes largest variable-length column(s) off-row. Again, it does not depend on amount of the data you store there.
This can have a large negative effect on memory consumption and performance
Another case where over declaring column widths can make a big difference is if the table will ever be processed using SSIS. The memory allocated for variable length (non BLOB) columns is fixed for each row in an execution tree and is per the columns' declared maximum length which can lead to inefficient usage of memory buffers (example). Whilst the SSIS package developer can declare a smaller column size than the source this analysis is best done up front and enforced there.
Back in the SQL Server engine itself a similar case is that when calculating the memory grant to allocate for SORT operations SQL Server assumes that varchar(x) columns will on average consume x/2 bytes.
If most of your varchar columns are fuller than that this can lead to the sort operations spilling to tempdb.
In your case if your varchar columns are declared as 8000 bytes but actually have contents much less than that your query will be allocated memory that it doesn't require which is obviously inefficient and can lead to waits for memory grants.
This is covered in Part 2 of SQL Workshops Webcast 1 downloadable from here or see below.
use tempdb;
CREATE TABLE T(
id INT IDENTITY(1,1) PRIMARY KEY,
number int,
name8000 VARCHAR(8000),
name500 VARCHAR(500))
INSERT INTO T
(number,name8000,name500)
SELECT number, name, name /*<--Same contents in both cols*/
FROM master..spt_values
SELECT id,name500
FROM T
ORDER BY number
SELECT id,name8000
FROM T
ORDER BY number
From a processing standpoint, it will not make a difference to use varchar(8000) vs varchar(500). It's more of a "good practice" kind of thing to define a maximum length that a field should hold and make your varchar that length. It's something that can be used to assist with data validation. For instance, making a state abbreviation be 2 characters or a postal/zip code as 5 or 9 characters. This used to be a more important distinction for when your data interacted with other systems or user interfaces where field length was critical (e.g. a mainframe flat file dataset), but nowadays I think it's more habit than anything else.
There are some disadvantages to large columns that are a bit less obvious and might catch you a little later:
All columns you use in an INDEX - must not exceed 900 bytes
All the columns in an ORDER BY clause may not exceed 8060 bytes. This is a bit difficult to grasp since this only applies to some columns. See SQL 2008 R2 Row size limit exceeded for details)
If the total row size exceeds 8060 bytes, you get a "page spill" for that row. This might affect performance (A page is an allocation unit in SQLServer and is fixed at 8000 bytes+some overhead. Exceeding this will not be severe, but it's noticable and you should try to avoid it if you easily can)
Many other internal datastructures, buffers and last-not-least your own varaibles and table-variables all need to mirror these sizes. With excessive sizes, excessive memory allocation can affect performance
As a general rule, try to be conservative with the column width. If it becomes a problem, you can easily expand it to fit the needs. If you notice memory issues later, shrinking a wide column later may become impossible without losing data and you won't know where to begin.
In your example of the business names, think about where you get to display them. Is there really space for 500 characters?? If not, there is little point in storing them as such. http://en.wikipedia.org/wiki/List_of_companies_of_the_United_States lists some company names and the max is about 50 characters. So I'd use 100 for the column max. Maybe more like 80.
Apart from best practices (BBlake's answer)
You get warnings about maximum row size (8060) bytes and index width (900 bytes) with DDL
DML will die if you exceed these limits
ANSI PADDING ON is the default so you could end up storing a wholeload of whitespace
Ideally you'd want to go smaller than that, down to a reasonably sized length (500 isn't reasonably sized) and make sure the client validation catches when the data is going to be too large and send a useful error.
While the varchar isn't actually going to reserve space in the database for the unused space, I recall versions of SQL Server having a snit about database rows being wider than some number of bytes (do not recall the exact count) and actually throwing out whatever data didn't fit. A certain number of those bytes were reserved for things internal to SQL Server.

Does varchar result in performance hit due to data fragmentation?

How are varchar columns handled internally by a database engine?
For a column defined as char(100), the DBMS allocates 100 contiguous bytes on the disk. However, for a column defined as varchar(100), that presumably isn't the case, since the whole point of varchar is to not allocate any more space than required to store the actual data value stored in the column. So, when a user updates a database row containing an empty varchar(100) column to a value consisting of 80 characters for instance, where does the space for that 80 characters get allocated from?
It seems that varchar columns must result in a fair amount of fragmentation of the actual database rows, at least in scenarios where column values are initially inserted as blank or NULL, and then updated later with actual values. Does this fragmentation result in degraded performance on database queries, as opposed to using char type values, where the space for the columns stored in the rows is allocated contiguously? Obviously using varchar results in less disk space than using char, but is there a performance hit when optimizing for query performance, especially for columns whose values are frequently updated after the initial insert?
You make a lot of assumptions in your question that aren't necessarily true.
The type of the a column in any DBMS tells you nothing at all about the nature of the storage of that data unless the documentation clearly tells you how the data is stored. IF that's not stated, you don't know how it is stored and the DBMS is free to change the storage mechanism from release to release.
In fact some databases store CHAR fields internally as VARCHAR, while others make a decision about how to the store the column based on the declared size of the column. Some database store VARCHAR with the other columns, some with BLOB data, and some implement other storage, Some databases always rewrite the entire row when a column is updated, others don't. Some pad VARCHARs to allow for limited future updating without relocating the storage.
The DBMS is responsible for figuring out how to store the data and return it to you in a speedy and consistent fashion. It always amazes me how many people to try out think the database, generally in advance of detecting any performance problem.
The data structures used inside a database engine is far more complex than you are giving it credit for! Yes, there are issues of fragmentation and issues where updating a varchar with a large value can cause a performance hit, however its difficult to explain /understand what the implications of those issues are without a fuller understanding of the datastructures involved.
For MS Sql server you might want to start with understanding pages - the fundamental unit of storage (see http://msdn.microsoft.com/en-us/library/ms190969.aspx)
In terms of the performance implications of fixes vs variable storage types on performance there are a number of points to consider:
Using variable length columns can improve performance as it allows more rows to fit on a single page, meaning fewer reads
Using variable length columns requires special offset values, and the maintenance of these values requires a slight overhead, however this extra overhead is generally neglible.
Another potential cost is the cost of increasing the size of a column when the page containing that row is nearly full
As you can see, the situation is rather complex - generally speaking however you can trust the database engine to be pretty good at dealing with variable data types and they should be the data type of choice when there may be a significant variance of the length of data held in a column.
At this point I'm also going to recommend the excellent book "Microsoft Sql Server 2008 Internals" for some more insight into how complex things like this really get!
The answer will depend on the specific DBMS. For Oracle, it is certainly possible to end up with fragmentation in the form of "chained rows", and that incurs a performance penalty. However, you can mitigate against that by pre-allocating some empty space in the table blocks to allow for some expansion due to updates. However, CHAR columns will typically make the table much bigger, which has its own impact on performance. CHAR also has other issues such as blank-padded comparisons which mean that, in Oracle, use of the CHAR datatype is almost never a good idea.
Your question is too general because different database engines will have different behavior. If you really need to know this, I suggest that you set up a benchmark to write a large number of records and time it. You would want enough records to take at least an hour to write.
As you suggested, it would be interesting to see what happens if you write insert all the records with an empty string ("") and then update them to have 100 characters that are reasonably random, not just 100 Xs.
If you try this with SQLITE and see no significant difference, then I think it unlikely that the larger database servers, with all the analysis and tuning that goes on, would be worse than SQLITE.
This is going to be completely database specific.
I do know that in Oracle, the database will reserve a certain percentage of each block for future updates (The PCTFREE parameter). For example, if PCTFREE is set to 25%, then a block will only be used for new data until it is 75% full. By doing that, room is left for rows to grow. If the row grows such that the 25% reserved space is completely used up, then you do end up with chained rows and a performance penalty. If you find that a table has a large number of chained rows, you can tune the PCTFREE for that table. If you have a table which will never have any updates at all, a PCTFREE of zero would make sense
In SQL Server varchar (except varchar(MAX)) is generally stored together with the rest of the row's data (on the same page if the row's data is < 8KB and on the same extent if it is < 64KB. Only the large data types such as TEXT, NTEXT, IMAGE, VARHCAR(MAX), NVARHCAR(MAX), XML and VARBINARY(MAX) are stored seperately.

How much more inefficient are text (blobs) than varchar/nvarchar's?

We're doing a lot of large, but straightforward forms for a fairly big project (about 600 users using it throughout the day - that's big for me at least ;-) ).
The forms have a lot of question/answer type sections, so it's natural for some people to type a sentence, while others type a novel. How beneficial would it be to put a character limit on some of these fields really?
(Please include references or citations, if necessary/possible - Thanks!)
If you have no limitations on the data size, then why worry. This doesn't sound like a mission critical project, even with 600 users and several thousand records. Use CLOB/BLOB and be done with it. I have doubts as to whether you would see any major gains in limiting sizes and risking data loss. That said, you should layout such boundaries before implementation.
Usually varchar is best for storing values that you wish to use logically and perform "whole value" comparisons against. Text is for unstructured data. If your project is a survey result with unstructured text, use CLOB/BLOB
Semi-Reference: I work with hundreds of thousands of call center records sometimes where we use a CLOB to store the dialog between employees and customers.
I say, focus on the needs of the users and only worry about database performance issues when/if those issues arise. Ask yourself "will my users benefit if I limit the amount of data they can enter".
I keep a great gapingvoid cartoon on my wall that says "it's not what the software does. it's what the user does".
You don't mention which sql server you are using
If you are using MySql there are definite advantages in speed to using fixed length fields to keep the table in static mode, however if you have any variable width fields the table will switch to dynamic and you lose the benefit of specifying the length of the field.
http://dev.mysql.com/doc/refman/5.0/en/static-format.html
http://dev.mysql.com/doc/refman/5.0/en/dynamic-format.html
Microsoft SQL Server has similar performance gains when you use fixed length columns. With fixed length columns the server knows exactly what the offset and length of the data in the row is. With variable length columns the server knows the offset but has to store the actual length of the data as a preceding 2byte counter. This has a couple of implications that are discussed in this interesting article that discusses performance as a function of disk space and the advantages of variable length columns.
If you are using SQL Server 2005 or newer you can take advantage of varchar(max). This column type has the same 2GB storage capacity of BLOBs but the data is stored in 8K chunks with the table data pages instead of in a separate store. So you get the large size advantage, only use 8K in your pages at a time, quick access for the DB engine, and the same query semantics that work with other column types work with varchar(max).
In the end specifying a max length on a variable column mainly lets you constrain the growth size of your database. Once you use variable length columns you lose the advantage of fixed size rows and varchar(max) will perform the same as varchar(10) when holding the same amount of data.
blob and text / ntext are stored outside of the row context, and only a reference stored to the object, resulting in a smaller row size, which will improve performance on clustered indexes.
However because text / ntext are not stored with the row data retrival takes longer, and these fields cannot be used in any comparison statements.
from: http://www.making-the-web.com/2008/03/24/saving-bytes-efficient-data-storage-mysql-part-1/
There are a few variations of the TEXT and BLOB types which affect size; they are:
Type - Maximum Length -Storage
TINYBLOB, TINYTEXT 255 Length+1 bytes
BLOB, TEXT 65535 Length+2 bytes
MEDIUMBLOB, MEDIUMTEXT 16777215 Length+3 bytes
LONGBLOB, LONGTEXT 4294967295 Length+4 bytes

Disadvantage of choosing large MAX value for varchar or varbinary

What's the disadvantage of choosing a large value for max when creating a varchar or varbinary column?
I'm using MS SQL but I assume this would be relevant to other dbs as well.
Thanks
That depends on whether it is ever reasonable to store a large amount of data in the particular column.
If you declare a column that would never properly store much data (i.e. an employee first name as a VARCHAR(1000)), you end up with a variety of problems
Many if not most client APIs (i.e. ODBC drivers, JDBC drivers, etc) allocate memory buffers on the client that are large enough to store the maximum size of a particular column. So even though the database only has to store the actual data, you may substantially increase the amount of memory the client application uses.
You lose the ability to drive data validation rules (or impart information about the data) from the table definition. If the database allows 1000 character first names, every application that interacts with the database will probably end up having its own rules for how large an employee name can be. If this is not mitigated by putting a stored procedure layer between all applications and the tables, this generally leads to various applications having various rules.
Murphy's Law states that if you allow 1000 characters, someone will eventually store 1000 characters in the column, or at least a value large enough to cause errors in one or more application (i.e. no one checked to see whether every application's employee name field could display 1000 characters).
Depends on the RDBMS. IIRC, MySql allocates a 2 byte overhead for varchars > 255 characters (to track the varchar length). MSSQL <= 2000 would allow you to allocate a row size > 8060 bytes, but would fail if you tried to INSERT or UPDATE a row that actually exceeded 8060 bytes. SQL 2005[1] allows the insert, but will allocate a new page for the overflow and leave a pointer behind. This, obviously, impacts performance.
[1] varchar(max) is somewhat of a special case, but will also allocate an overflow page if the length of the field is > 8000 or the row > 8060. This is with MSSQL defaults, and behavior can change with the large types in data row option.
You could be adding a risk of breaking your application if a large data got in somehow (like from an external interface) and your app isn't designed to handle it.
As a good design, you should always limit the size of the fields to a realistic value.