I've always been bothered by the need for max lengths on SQL string columns. There is some data for which there is no true max length. For example, let's say you have a field to store someone's first name, and you make it NVARCHAR(50). It's always possible (although highly unlikely) that someone has a name longer than 50 chars.
Would it be feasible to change the field's max length on the fly? What I mean by this is, when you do an INSERT/UPDATE, you check if the person's name is longer than 50 chars, and ALTER the table if need be before you do the INSERT/UPDATE. (Or perhaps, catch an exception and perform an ALTER if need be).
Would the ALTER be a slow operation if the table had a lot of data in it?
Let's say you alter the column to be NVARCHAR(100). Would a SELECT from this table be slower than if you'd made it NVARCHAR(100) from the beginning?
From the use of nvarchar(), I am guessing that you are using SQL Server.
In SQL Server, I usually just make such names varchar(255) (or nvarchar(255)). This is actually a weird anachronism of affinity to powers of 2 (255 = largest 8-bit unsigned value). But it works well in practice.
You have at least two considerations in SQL Server. First, the maximum "not-maximum" length of a string is 8000 data bytes, which is varchar(8000) or nvarchar(4000). These are reasonable maximum lengths as well.
The second consideration is the length of a key in an index. The maximum length is 900 bytes. If you have an index on the field, then you don't want the field longer than this. If you have composite indexes, then 255 seems like a reasonable length.
The key point, however, is that changing the length of a field could have effects you are not thinking of, such as on the index. You don't want to change the structure of a table lightly. Instead, just make the field oversized to begin with, and forget about the problem entirely.
Related
I've read up on this on MSDN forums and here and I'm still not clear. I think this is correct: Varchar(max) will be stored as a text datatype, so that has drawbacks. So lets say your field will reliably be under 8000 characters. Like a BusinessName field in my database table. In reality, a business name will probably always be under (pulling a number outta my hat) 500 characters. It seems like plenty of varchar fields that I run across fall well under the 8k character count.
So should I make that field a varchar(500) instead of varchar(8000)? From what I understand of SQL there's no difference between those two. So, to make life easy, I'd want to define all my varchar fields as varchar(8000). Does that have any drawbacks?
Related: Size of varchar columns (I didn't feel like this one answered my question).
One example where this can make a difference is that it can prevent a performance optimization that avoids adding row versioning information to tables with after triggers.
This is covered by Paul White here
The actual size of the data stored is immaterial – it is the potential
size that matters.
Similarly if using memory optimised tables since 2016 it has been possible to use LOB columns or combinations of column widths that could potentially exceed the inrow limit but with a penalty.
(Max) columns are always stored off-row. For other columns, if the data row size in the table definition can exceed 8,060 bytes, SQL Server pushes largest variable-length column(s) off-row. Again, it does not depend on amount of the data you store there.
This can have a large negative effect on memory consumption and performance
Another case where over declaring column widths can make a big difference is if the table will ever be processed using SSIS. The memory allocated for variable length (non BLOB) columns is fixed for each row in an execution tree and is per the columns' declared maximum length which can lead to inefficient usage of memory buffers (example). Whilst the SSIS package developer can declare a smaller column size than the source this analysis is best done up front and enforced there.
Back in the SQL Server engine itself a similar case is that when calculating the memory grant to allocate for SORT operations SQL Server assumes that varchar(x) columns will on average consume x/2 bytes.
If most of your varchar columns are fuller than that this can lead to the sort operations spilling to tempdb.
In your case if your varchar columns are declared as 8000 bytes but actually have contents much less than that your query will be allocated memory that it doesn't require which is obviously inefficient and can lead to waits for memory grants.
This is covered in Part 2 of SQL Workshops Webcast 1 downloadable from here or see below.
use tempdb;
CREATE TABLE T(
id INT IDENTITY(1,1) PRIMARY KEY,
number int,
name8000 VARCHAR(8000),
name500 VARCHAR(500))
INSERT INTO T
(number,name8000,name500)
SELECT number, name, name /*<--Same contents in both cols*/
FROM master..spt_values
SELECT id,name500
FROM T
ORDER BY number
SELECT id,name8000
FROM T
ORDER BY number
From a processing standpoint, it will not make a difference to use varchar(8000) vs varchar(500). It's more of a "good practice" kind of thing to define a maximum length that a field should hold and make your varchar that length. It's something that can be used to assist with data validation. For instance, making a state abbreviation be 2 characters or a postal/zip code as 5 or 9 characters. This used to be a more important distinction for when your data interacted with other systems or user interfaces where field length was critical (e.g. a mainframe flat file dataset), but nowadays I think it's more habit than anything else.
There are some disadvantages to large columns that are a bit less obvious and might catch you a little later:
All columns you use in an INDEX - must not exceed 900 bytes
All the columns in an ORDER BY clause may not exceed 8060 bytes. This is a bit difficult to grasp since this only applies to some columns. See SQL 2008 R2 Row size limit exceeded for details)
If the total row size exceeds 8060 bytes, you get a "page spill" for that row. This might affect performance (A page is an allocation unit in SQLServer and is fixed at 8000 bytes+some overhead. Exceeding this will not be severe, but it's noticable and you should try to avoid it if you easily can)
Many other internal datastructures, buffers and last-not-least your own varaibles and table-variables all need to mirror these sizes. With excessive sizes, excessive memory allocation can affect performance
As a general rule, try to be conservative with the column width. If it becomes a problem, you can easily expand it to fit the needs. If you notice memory issues later, shrinking a wide column later may become impossible without losing data and you won't know where to begin.
In your example of the business names, think about where you get to display them. Is there really space for 500 characters?? If not, there is little point in storing them as such. http://en.wikipedia.org/wiki/List_of_companies_of_the_United_States lists some company names and the max is about 50 characters. So I'd use 100 for the column max. Maybe more like 80.
Apart from best practices (BBlake's answer)
You get warnings about maximum row size (8060) bytes and index width (900 bytes) with DDL
DML will die if you exceed these limits
ANSI PADDING ON is the default so you could end up storing a wholeload of whitespace
Ideally you'd want to go smaller than that, down to a reasonably sized length (500 isn't reasonably sized) and make sure the client validation catches when the data is going to be too large and send a useful error.
While the varchar isn't actually going to reserve space in the database for the unused space, I recall versions of SQL Server having a snit about database rows being wider than some number of bytes (do not recall the exact count) and actually throwing out whatever data didn't fit. A certain number of those bytes were reserved for things internal to SQL Server.
Consider I've column VARCHAR(MAX). What if I change it to VARCHAR(500) will Microsoft SQL Server will decrease the size claimed by table?
If you got any link just comment it. I'll check it out.
Update:
I've tested following two case with table.
ALTER column size
Create new table and import data from old table.
Initial Table size
ALTER TABLE table_transaction ALTER COLUMN column_name VARCHAR(500)
After ALTER column, table size is incresed
Create new table with new column size and import data from old table
I've taken care of Index in new table.
Why table size is increased in case of ALTER COLUMN. Ideally table size should decrease.
After performing de-fragmentation on PK in original table few MB decreased. However its not promising like creating new table.
When you change varchar(n) column to varchar(MAX) or visa-versa, SQL Server will update every row in the table. This will temporarily increase table size until you rebuild the clustered index or execute DBCC CLEANTABLE.
For ongoing space requirements of a varchar(MAX) column, the space will be the same as varchar(n) as long as the value remains in-row. However, if the value exceeds 8000 bytes, it will be stored on separate LOB page(s) dedicated to the value. This will increase space requirements and require extra I/O when a query needs the value.
I good rule of thumb is to use MAX types only if the value may exceed 8000 bytes, and specify a proper max length for the domain of data stored for data 8000 bytes or less.
According to the documentation, there is no difference in the storage of strings:
varchar [ ( n | max ) ]
Variable-length, non-Unicode string data. n defines the string length
and can be a value from 1 through 8,000. max indicates that the
maximum storage size is 2^31-1 bytes (2 GB). The storage size is the
actual length of the data entered + 2 bytes.
As I read this, the storage size is the actual length plus two bytes regardless of whether you use n or max.
I am suspicious about this. I would expect the length of varchar(max) to occupy four bytes. And there might be additional overhead for storing off-page references (if they exist). However, the documentation is pretty clear on this point.
Whether changing the data type changes the size of the field depends on the data already stored. You can have several situations.
If all the values are NULL, then there will be no changes at all. The values are not being stored.
If all the values are less than 20 bytes, then -- according to the documentation -- there would be no change. I have a nagging suspicion that you might save 2 bytes per value, but I can't find a reference to it and don't have SQL Server on hand today to check.
If values exceed 20 bytes but remain on the page, then you will save space because the values will change.
If the values go off-page, then you will save the header information as well as truncating the data (thank you Dan for pointing this out).
Our production sever is sql server 2005, and we have a very large table of 103 Million records. We want increase length of one particular field from varchar(20) to varchar(30). Though i said its just a metadata change as its a increase in the column length my manager says he doesnt want to alter such a huge table. pl advice the best option. I am thinking to create a new column and update the new column with old column values.
I looked at many blogs and they say that the alter will impact and some say it will not impact.
As you said, it is a metadata-only operation and this is the way to go. Prove to your manager (and to yourself!) through testing that you are right.
You should test any advice first, unless from a SQL Server MVP who might actually know the details of what happens.
However, changing the varchar length from 20 to 30 does not affect the layout of any existing data in the table. That is, the layout of the two variables is exactly the same. That means that the data does not have to change when you alter the table.
This offers optimism that the change would be "easy".
The data page does contain some information about types -- at least the length of the type in the record. I don't know if this includes the maximum length of a character type. It is possible that the data pages would need to be changed.
This is a bit of pessimism.
Almost any other change will require changes to every record and/or data page. For instance, changing from int to bigint is moving from a 4-byte field to an 8-byte field. All the records are affected by this change in data layout. Big change.
Changing from varchar() to either nvarchar() or char() would have the same impact.
On the other hand, changing a field from being NULLABLE to NOT NULLABLE (or vice versa) would not affect the record storage on each page. But, that information is stored on the page in the NULLABLE flags array, so all the pages would need to be updated.
So, there is some possibility that the change would not cause any data to be rewritten. But test on a smaller table to see what happens.
It is frequently advised to choose database field sizes to be as narrow as possible. I am wondering to what degree this applies to SQL Server 2005 VARCHAR columns: Storing 10-letter English words in a VARCHAR(255) field will not take up more storage than in a VARCHAR(10) field.
Are there other reasons to restrict the size of VARCHAR fields to stick as closely as possible to the size of the data? I'm thinking of
Performance: Is there an advantage to using a smaller n when selecting, filtering and sorting on the data?
Memory, including on the application side (C++)?
Style/validation: How important do you consider restricting colunm size to force non-sensical data imports to fail (such as 200-character surnames)?
Anything else?
Background: I help data integrators with the design of data flows into a database-backed system. They have to use an API that restricts their choice of data types. For character data, only VARCHAR(n) with n <= 255 is available; CHAR, NCHAR, NVARCHAR and TEXT are not. We're trying to lay down some "good practices" rules, and the question has come up if there is a real detriment to using VARCHAR(255) even for data where real maximum sizes will never exceed 30 bytes or so.
Typical data volumes for one table are 1-10 Mio records with up to 150 attributes. Query performance (SELECT, with frequently extensive WHERE clauses) and application-side retrieval performance are paramount.
Data Integrity - By far the most important reason. If you create a column called Surname that is 255 characters, you will likely get more than surnames. You'll get first name, last name, middle name. You'll get their favorite pet. You'll get "Alice in the Accounting Department with the Triangle hair". In short, you will make it easy for users to use the column as a notes/surname column. You want the cap to imped the users that try to put something other than a surname into that column. If you have a column that calls for a specific length (e.g. a US tax identifier is nine characters) but the column is varchar(255), other developers will wonder what is going on and you likely get crap data as well.
Indexing and row limits. In SQL Server you have a limit of 8060 bytes IIRC. Lots of fat non-varchar(max) columns with lots of data can quickly exceed that limit. In addition, indexes have a 900 bytes cap in width IIRC. So, if you wanted to index on your surname column and some others that contain lots of data, you could exceed this limit.
Reporting and external systems. As a report designer you must assume that if a column is declared with a max length of 255, it could have 255 characters. If the user can do it, they will do it. Thus, to say, "It probably won't have more than 30 characters." is not even remotely the same as "It cannot have more than 30 characters." Never rely on the former. As a report designer, you have to work around the possibilities that users will enter a bunch of data into a column. That either means truncating the values (and if that is the case why have the additional space available?) or using CanGrow to make a lovely mess of a report. Either way, you make it harder on other developers to understand the intent of the column if the column size is so far out of whack with the actual data being stored.
I think that the biggest issue is data validation. If you allow 255 characters for a surname, you WILL get a surname that's 200+ characters in your database.
Another reason is that if you allow the database to hold 255 characters you now have to account for that possibility in every system that touches your database. For example, if you exported to a fixed-width column file all of your columns would have to be 255 characters wide, which could be pretty annoying or even problematic. That's just one example where it could cause a problem.
One good reason is validation.
(for example) In Holland a social security number is always 9 chars long, when you won't allow more it will never occur.
If you would allow more and for some unknown reason there are 10 chars, you will need to put in checks (which you otherwise wouldn't) to check if it is 9 long.
1) Readability & Support
A database developer could look at a field called StateCode with a length of varchar(2) and get a good idea of what kind of data that field holds, without even looking at the contents.
2) Reporting
When you data is without a length constraint, you are expecting the developer to enforce that the column data is all similar in length. When reporting on that data, if the developer has failed to make the column data consistent, that will make the reporting that data inconsistent & look funny.
3) SQL Server Data Storage
SQL Server stores data on 8k "pages" and from a performance standpoint it is ideal to be as efficient as possible and store as much data as possible on a page.
If your database is designed to store every string column as varchar(255), "bad" data could slip into one of those fields (for example a state name might slip into a StateCode field that is meant to be 2 characters long), and cause unecessary & inefficient page and index splits.
The other thing is that a single row of data is limited to 8060 bytes, and SQL Server uses the max length of varchar fields to determine this.
Reference: http://msdn.microsoft.com/en-us/library/ms143432.aspx
I've read up on this on MSDN forums and here and I'm still not clear. I think this is correct: Varchar(max) will be stored as a text datatype, so that has drawbacks. So lets say your field will reliably be under 8000 characters. Like a BusinessName field in my database table. In reality, a business name will probably always be under (pulling a number outta my hat) 500 characters. It seems like plenty of varchar fields that I run across fall well under the 8k character count.
So should I make that field a varchar(500) instead of varchar(8000)? From what I understand of SQL there's no difference between those two. So, to make life easy, I'd want to define all my varchar fields as varchar(8000). Does that have any drawbacks?
Related: Size of varchar columns (I didn't feel like this one answered my question).
One example where this can make a difference is that it can prevent a performance optimization that avoids adding row versioning information to tables with after triggers.
This is covered by Paul White here
The actual size of the data stored is immaterial – it is the potential
size that matters.
Similarly if using memory optimised tables since 2016 it has been possible to use LOB columns or combinations of column widths that could potentially exceed the inrow limit but with a penalty.
(Max) columns are always stored off-row. For other columns, if the data row size in the table definition can exceed 8,060 bytes, SQL Server pushes largest variable-length column(s) off-row. Again, it does not depend on amount of the data you store there.
This can have a large negative effect on memory consumption and performance
Another case where over declaring column widths can make a big difference is if the table will ever be processed using SSIS. The memory allocated for variable length (non BLOB) columns is fixed for each row in an execution tree and is per the columns' declared maximum length which can lead to inefficient usage of memory buffers (example). Whilst the SSIS package developer can declare a smaller column size than the source this analysis is best done up front and enforced there.
Back in the SQL Server engine itself a similar case is that when calculating the memory grant to allocate for SORT operations SQL Server assumes that varchar(x) columns will on average consume x/2 bytes.
If most of your varchar columns are fuller than that this can lead to the sort operations spilling to tempdb.
In your case if your varchar columns are declared as 8000 bytes but actually have contents much less than that your query will be allocated memory that it doesn't require which is obviously inefficient and can lead to waits for memory grants.
This is covered in Part 2 of SQL Workshops Webcast 1 downloadable from here or see below.
use tempdb;
CREATE TABLE T(
id INT IDENTITY(1,1) PRIMARY KEY,
number int,
name8000 VARCHAR(8000),
name500 VARCHAR(500))
INSERT INTO T
(number,name8000,name500)
SELECT number, name, name /*<--Same contents in both cols*/
FROM master..spt_values
SELECT id,name500
FROM T
ORDER BY number
SELECT id,name8000
FROM T
ORDER BY number
From a processing standpoint, it will not make a difference to use varchar(8000) vs varchar(500). It's more of a "good practice" kind of thing to define a maximum length that a field should hold and make your varchar that length. It's something that can be used to assist with data validation. For instance, making a state abbreviation be 2 characters or a postal/zip code as 5 or 9 characters. This used to be a more important distinction for when your data interacted with other systems or user interfaces where field length was critical (e.g. a mainframe flat file dataset), but nowadays I think it's more habit than anything else.
There are some disadvantages to large columns that are a bit less obvious and might catch you a little later:
All columns you use in an INDEX - must not exceed 900 bytes
All the columns in an ORDER BY clause may not exceed 8060 bytes. This is a bit difficult to grasp since this only applies to some columns. See SQL 2008 R2 Row size limit exceeded for details)
If the total row size exceeds 8060 bytes, you get a "page spill" for that row. This might affect performance (A page is an allocation unit in SQLServer and is fixed at 8000 bytes+some overhead. Exceeding this will not be severe, but it's noticable and you should try to avoid it if you easily can)
Many other internal datastructures, buffers and last-not-least your own varaibles and table-variables all need to mirror these sizes. With excessive sizes, excessive memory allocation can affect performance
As a general rule, try to be conservative with the column width. If it becomes a problem, you can easily expand it to fit the needs. If you notice memory issues later, shrinking a wide column later may become impossible without losing data and you won't know where to begin.
In your example of the business names, think about where you get to display them. Is there really space for 500 characters?? If not, there is little point in storing them as such. http://en.wikipedia.org/wiki/List_of_companies_of_the_United_States lists some company names and the max is about 50 characters. So I'd use 100 for the column max. Maybe more like 80.
Apart from best practices (BBlake's answer)
You get warnings about maximum row size (8060) bytes and index width (900 bytes) with DDL
DML will die if you exceed these limits
ANSI PADDING ON is the default so you could end up storing a wholeload of whitespace
Ideally you'd want to go smaller than that, down to a reasonably sized length (500 isn't reasonably sized) and make sure the client validation catches when the data is going to be too large and send a useful error.
While the varchar isn't actually going to reserve space in the database for the unused space, I recall versions of SQL Server having a snit about database rows being wider than some number of bytes (do not recall the exact count) and actually throwing out whatever data didn't fit. A certain number of those bytes were reserved for things internal to SQL Server.