At what point does it become more efficient to use a text field than an nvarchar field in SQL Server? - sql

How long does an nvarchar field need to be before it is better to use a text field in SQL Server? What are the general indications for using one or the other for textual content that may or may not be queried?

From what I understand, the TEXT datatype should never be used in SQL 2005+. You should start using VARCHAR(MAX) instead.
See this question about VARCHAR(MAX) vs. TEXT.
UPDATE (per comment):
This blog does a good job at explaining the advantages. Taken from it:
But the pain from using the type text comes in when trying to query against it. For example grouping by a text type is not possible.
Another downside to using text types is increased disk IO due to the fact each record now points to a blob (or file).
So basically, VARCHAR(MAX) keeps the data with the record, and gives you the ability to treat it like other VARCHAR types, like using GROUP BY and string functions (LEN, CHARINDEX, etc.).
For TEXT, you almost always have to convert it to VARCHAR to use functions against it.
But back to the root of your question regarding efficiency, I don't think it's ever more efficient to use TEXT vs. VARCHAR(MAX). Looking at this MSDN article (search for "data types"), TEXT is deprecated, and should be replaced with VARCHAR(MAX).

First of all don't use text at all. MSDN says:
ntext, text, and image data types will
be removed in a future version of
Microsoft SQL Server. Avoid using
these data types in new development
work, and plan to modify applications
that currently use them. Use
nvarchar(max), varchar(max), and
varbinary(max) instead.
varchar(max) is what you might need.
If you compare varchar(n) vs varchar(max), these are technically two different datatypes (stored differently):
varchar(n) value is always stored inside of the row. Which means it cannot be greater than max row size, and row cannot be greater than page size, which is 8K.
varchar(max) is stored outsize the row. Row has a pointer to a separate BLOB page. However, under certain condition varchar(max) can store data as a regular row, obviously it should at least fit to the row size.
So if your row is potentially greater than 8K, you have to use varchar(max). If not, using varchar(n) will likely be preferable as it is faster to retrieve in-row data vs from outside page.
MSDN says:
Use varchar(max) when the sizes of the
column data entries vary considerably,
and the size might exceed 8,000 bytes.

The main advantage of VARCHAR over TEXT is that you can run string manipulations and string functions on it. With VARCHAR(max), now you basically have an awesome large (unrestricted) variable that you can manipulate how you want..

Related

sql server data length [duplicate]

What is the best way to store a large amount of text in a table in SQL server?
Is varchar(max) reliable?
In SQL 2005 and higher, VARCHAR(MAX) is indeed the preferred method. The TEXT type is still available, but primarily for backward compatibility with SQL 2000 and lower.
I like using VARCHAR(MAX) (or actually NVARCHAR) because it works like a standard VARCHAR field. Since it's introduction, I use it rather than TEXT fields whenever possible.
Varchar(max) is available only in SQL 2005 or later. This will store up to 2GB and can be treated as a regular varchar. Before SQL 2005, use the "text" type.
According to the text found here, varbinary(max) is the way to go. You'll be able to store approximately 2GB of data.
Split the text into chunks that your database can actually handle. And, put the split up text in another table. Use the id from the text_chunk table as text_chunk_id in your original table. You might want another column in your table to keep text that fits within your largest text data type.
CREATE TABLE text_chunk (
id NUMBER,
chunk_sequence NUMBER,
text BIGTEXT)
In a BLOB
BLOBs are very large variable binary or character data, typically documents (.txt, .doc) and pictures (.jpeg, .gif, .bmp), which can be stored in a database. In SQL Server, BLOBs can be text, ntext, or image data type, you can use the text type
text
Variable-length non-Unicode data, stored in the code page of the server, with a maximum length of 231 - 1 (2,147,483,647) characters.
Depending on your situation, a design alternative to consider is saving them as .txt file to server and save the file path to your database.
Use nvarchar(max) to store the whole chat conversation thread in a single record. Each individual text message (or block) is identified in the content text by inserting markers.
Example:
{{UserId: Date and time}}<Chat Text>.
On display time UI should be intelligent enough to understand this markers and display it correctly. This way one record should suffice for a single conversation as long as size limit is not reached.

How are you supposed to choose K when creating a VARCHAR(K) column? Or are you?

This is something I've never understood. Let's say I want a column that is storing an email address. I think,
"Ok, email addresses are usually no more than 15 characters, but I'll
say 50 max characters just to play it safe."
and making that column VARCHAR(50). Of course, then this means that I have to create extra code, possibly both client- and server-side validation of entries to that column.
That brings up the question of Why not just use NVARCHAR all the time except in those rare circumstances where the logic of my application dicates a fixed or maximum length. From what I understand, if I create a VARCHAR(50) and none of the entries are more than 25 characters, that does not mean that 50% of the space is wasted, as the database knows how to optimize everything.
Again, that brings up the question of why not just use NVARCHAR.
nvarchar itself has nothing to with "unlimited length of string" since it is just unicode version of varchar. At present time there are no reasons to use varchar (except some backward compatibility issues) and nvarchar should be preferred.
So I'm supposing you're asking why don't use nvarchar(max) everywhere which is almost unlimited (2 GByte of storage) instead of specifying nvarchar(n) for concrete columns.
There are many reasons of using nvarchar(n) instead of nvarchar(max).
For example, if your column should be included in index - it can't be nvarchar(max).
Also nvarchar(max) data internally stored differently than nvarchar(n) and somtimes it can affect performance.

Best data type for storing strings in SQL Server?

What's the best data type to be used when storing strings, like a first name? I've seen varchar and nvarchar both used. Which one is better? Does it matter?
I've also heard that the best length to use is 255, but I don't know why. Is there a specific length that is preferred for strings?
nvarchar stores unicode character data which is required if you plan to store non-English names. If it's a web application, I highly recommend using nvarchar even if you don't plan on being international. The downside is that it consumes twice as much space, 16-bits per character for nvarchar and 8-bits per character for varchar.
What's the best data type to be used
when storing strings, like a first
name? I've seen varchar and nvarchar
both used. Which one is better? Does
it matter?
See What is the difference between nchar(10) and varchar(10) in MSSQL?
If you need non-ASCII characters, you have to use nchar/nvarchar. If you don't, then you may want to use char/varchar to save space.
Note that this issue is specific to MS SQL Server, which doesn't have good support for UTF-8. In other SQL implementations that do, you can use Unicode strings with no extra space requirements (for English).
EDIT: Since this answer was originally written, SQL Server 2019 (15.x) finally introduced UTF-8 support. You may want to consider using it as your default database text encoding.
I've also heard that the best length
to use is 255, but I don't know why.
See Is there a good reason I see VARCHAR(255) used so often (as opposed to another length)?
Is there a specific length that is
preferred for strings?
If you data has a well-defined maximum limit (e.g., 17 characters for a VIN), then use that.
OTOH, if the limit is arbitrary, then choose a generous maximum size to avoid rejecting valid data. In SQL Server, you may want to consider the 900-byte maximum size of index keys.
nvarchar means you can save unicode character inside it. there is 2GB limit for nvarchar type. if the field length is more than 4000 characters, an overflow page is used. smaller fields means one page can hold more rows which increase the query performance.
Generally, for small strings use nvarchar(n), which supports Unicode characters. The string is compressed when used with row or page compression (at least one of which is generally desirable).
Large strings need nvarchar(max), which Unicode compression does not support.
For special-case scenarios when your data set never uses Unicode characters, varchar(n) and varchar(max) restrict the string type of one byte per character.
If you know the max length (n) is less than 256, SQL Server only needs to use 1 byte to store the string length. This reduces storage space by about half a percent compared a string type whose max length is just over 255.

What problems can an NVARCHAR(3000) cause

I have an already large table that my clients are asking for me to extend the length of the notes field. The notes field is already an NVARCHAR(1000) and I am being asked to expand it to 3000. The long term solution is to move notes out of the table and create a notes table that uses an NVARCHAR(max) field that is only joined in when necessary. My question is about the short term. Knowing that this field will be moved out in the future what problems could I have if I just increase the field to an NVARCHAR(3000) for now?
text and ntext are deprecated in favor of varchar(max) and nvarchar(max). So nvarchar(3000) should be fine.
You probably already know this, but just make sure that the increased length doesn't drive your total record length over 8000. I pretty sure that still applies for 2005/2008.
You should be fine with nvarchar(3000) for the interim solution. You can go up to a maximum of nvarchar(4000). And as posted earlier by km.srd.myopenid.com, make sure that the entire length of your row doesn't exceed 8000 (remember that nvarchar is 2x the size of a regular varchar - which is why you can only have nvarchar(4000), but you can have varchar(8000)).
I would suggest changing the column to NTEXT. You will have virtually no limit on the amount of data and the data is not stored with the rest of the row data. This helps keep you from hitting the maximum row size limit.
The only drawback is that you can only perform "LIKE" searches on that column and you cannot index it. However, if it's a notes field, my guess is that you are not doing any searching on it at all.
You may also experience more slowness as your data pages may get split up to accomodate the larger field. You can create a structure that allows a record of more than 8060 bytes by doing this but be aware if you try to add a data record that actually contains more than that you will have a problem.

What are the use cases for selecting CHAR over VARCHAR in SQL?

I realize that CHAR is recommended if all my values are fixed-width. But, so what? Why not just pick VARCHAR for all text fields just to be safe.
The general rule is to pick CHAR if all rows will have close to the same length. Pick VARCHAR (or NVARCHAR) when the length varies significantly. CHAR may also be a bit faster because all the rows are of the same length.
It varies by DB implementation, but generally, VARCHAR (or NVARCHAR) uses one or two more bytes of storage (for length or termination) in addition to the actual data. So (assuming you are using a one-byte character set) storing the word "FooBar"
CHAR(6) = 6 bytes (no overhead)
VARCHAR(100) = 8 bytes (2 bytes of overhead)
CHAR(10) = 10 bytes (4 bytes of waste)
The bottom line is CHAR can be faster and more space-efficient for data of relatively the same length (within two characters length difference).
Note: Microsoft SQL has 2 bytes of overhead for a VARCHAR. This may vary from DB to DB, but generally, there is at least 1 byte of overhead needed to indicate length or EOL on a VARCHAR.
As was pointed out by Gaven in the comments: Things change when it comes to multi-byte characters sets, and is a is case where VARCHAR becomes a much better choice.
A note about the declared length of the VARCHAR: Because it stores the length of the actual content, then you don't waste unused length. So storing 6 characters in VARCHAR(6), VARCHAR(100), or VARCHAR(MAX) uses the same amount of storage. Read more about the differences when using VARCHAR(MAX). You declare a maximum size in VARCHAR to limit how much is stored.
In the comments AlwaysLearning pointed out that the Microsoft Transact-SQL docs seem to say the opposite. I would suggest that is an error or at least the docs are unclear.
If you're working with me and you're working with Oracle, I would probably make you use varchar in almost every circumstance. The assumption that char uses less processing power than varchar may be true...for now...but database engines get better over time and this sort of general rule has the making of a future "myth".
Another thing: I have never seen a performance problem because someone decided to go with varchar. You will make much better use of your time writing good code (fewer calls to the database) and efficient SQL (how do indexes work, how does the optimizer make decisions, why is exists faster than in usually...).
Final thought: I have seen all sorts of problems with use of CHAR, people looking for '' when they should be looking for ' ', or people looking for 'FOO' when they should be looking for 'FOO (bunch of spaces here)', or people not trimming the trailing blanks, or bugs with Powerbuilder adding up to 2000 blanks to the value it returns from an Oracle procedure.
In addition to performance benefits, CHAR can be used to indicate that all values should be the same length, e.g., a column for U.S. state abbreviations.
Char is a little bit faster, so if you have a column that you KNOW will be a certain length, use char. For example, storing (M)ale/(F)emale/(U)nknown for gender, or 2 characters for a US state.
Does NChar or Char perform better that their var alternatives?
Great question. The simple answer is yes in certain situations. Let's see if this can be explained.
Obviously we all know that if I create a table with a column of varchar(255) (let's call this column myColumn) and insert a million rows but put only a few characters into myColumn for each row, the table will be much smaller (overall number of data pages needed by the storage engine) than if I had created myColumn as char(255). Anytime I do an operation (DML) on that table and request alot of rows, it will be faster when myColumn is varchar because I don't have to move around all those "extra" spaces at the end. Move, as in when SQL Server does internal sorts such as during a distinct or union operation, or if it chooses a merge during it's query plan, etc. Move could also mean the time it takes to get the data from the server to my local pc or to another computer or wherever it is going to be consumed.
But there is some overhead in using varchar. SQL Server has to use a two byte indicator (overhead) to, on each row, to know how many bytes that particular row's myColumn has in it. It's not the extra 2 bytes that presents the problem, it's the having to "decode" the length of the data in myColumn on every row.
In my experiences it makes the most sense to use char instead of varchar on columns that will be joined to in queries. For example the primary key of a table, or some other column that will be indexed. CustomerNumber on a demographic table, or CodeID on a decode table, or perhaps OrderNumber on an order table. By using char, the query engine can more quickly perform the join because it can do straight pointer arithmetic (deterministically) rather than having to move it's pointers a variable amount of bytes as it reads the pages. I know I might have lost you on that last sentence. Joins in SQL Server are based around the idea of "predicates." A predicate is a condition. For example myColumn = 1, or OrderNumber < 500.
So if SQL Server is performing a DML statement, and the predicates, or "keys" being joined on are a fixed length (char), the query engine doesn't have to do as much work to match rows from one table to rows from another table. It won't have to find out how long the data is in the row and then walk down the string to find the end. All that takes time.
Now bear in mind this can easily be poorly implemented. I have seen char used for primary key fields in online systems. The width must be kept small i.e. char(15) or something reasonable. And it works best in online systems because you are usually only retrieving or upserting a small number of rows, so having to "rtrim" those trailing spaces you'll get in the result set is a trivial task as opposed to having to join millions of rows from one table to millions of rows on another table.
Another reason CHAR makes sense over varchar on online systems is that it reduces page splits. By using char, you are essentially "reserving" (and wasting) that space so if a user comes along later and puts more data into that column SQL has already allocated space for it and in it goes.
Another reason to use CHAR is similar to the second reason. If a programmer or user does a "batch" update to millions of rows, adding some sentence to a note field for example, you won't get a call from your DBA in the middle of the night wondering why their drives are full. In other words, it leads to more predictable growth of the size of a database.
So those are 3 ways an online (OLTP) system can benefit from char over varchar. I hardly ever use char in a warehouse/analysis/OLAP scenario because usually you have SO much data that all those char columns can add up to lots of wasted space.
Keep in mind that char can make your database much larger but most backup tools have data compression so your backups tend to be about the same size as if you had used varchar. For example LiteSpeed or RedGate SQL Backup.
Another use is in views created for exporting data to a fixed width file. Let's say I have to export some data to a flat file to be read by a mainframe. It is fixed width (not delimited). I like to store the data in my "staging" table as varchar (thus consuming less space on my database) and then use a view to CAST everything to it's char equivalent, with the length corresponding to the width of the fixed width for that column. For example:
create table tblStagingTable (
pkID BIGINT (IDENTITY,1,1),
CustomerFirstName varchar(30),
CustomerLastName varchar(30),
CustomerCityStateZip varchar(100),
CustomerCurrentBalance money )
insert into tblStagingTable
(CustomerFirstName,CustomerLastName, CustomerCityStateZip) ('Joe','Blow','123 Main St Washington, MD 12345', 123.45)
create view vwStagingTable AS
SELECT CustomerFirstName = CAST(CustomerFirstName as CHAR(30)),
CustomerLastName = CAST(CustomerLastName as CHAR(30)),
CustomerCityStateZip = CAST(CustomerCityStateZip as CHAR(100)),
CustomerCurrentBalance = CAST(CAST(CustomerCurrentBalance as NUMERIC(9,2)) AS CHAR(10))
SELECT * from vwStagingTable
This is cool because internally my data takes up less space because it's using varchar. But when I use DTS or SSIS or even just a cut and paste from SSMS to Notepad, I can use the view and get the right number of trailing spaces. In DTS we used to have a feature called, damn I forget I think it was called "suggest columns" or something. In SSIS you can't do that anymore, you have to tediously define the flat file connection manager. But since you have your view setup, SSIS can know the width of each column and it can save alot of time when building your data flow tasks.
So bottom line... use varchar. There are a very small number of reasons to use char and it's only for performance reasons. If you have a system with hundrends of millions of rows you will see a noticeable difference if the predicates are deterministic (char) but for most systems using char is simply wasting space.
Hope that helps.
Jeff
There are performance benefits, but here is one that has not been mentioned: row migration. With char, you reserve the entire space in advance.So let's says you have a char(1000), and you store 10 characters, you will use up all 1000 charaters of space. In a varchar2(1000), you will only use 10 characters. The problem comes when you modify the data. Let's say you update the column to now contain 900 characters. It is possible that the space to expand the varchar is not available in the current block. In that case, the DB engine must migrate the row to another block, and make a pointer in the original block to the new row in the new block. To read this data, the DB engine will now have to read 2 blocks.
No one can equivocally say that varchar or char are better. There is a space for time tradeoff, and consideration of whether the data will be updated, especially if there is a good chance that it will grow.
There is a difference between early performance optimization and using a best practice type of rule. If you are creating new tables where you will always have a fixed length field, it makes sense to use CHAR, you should be using it in that case. This isn't early optimization, but rather implementing a rule of thumb (or best practice).
i.e. - If you have a 2 letter state field, use CHAR(2). If you have a field with the actual state names, use VARCHAR.
I would choose varchar unless the column stores fixed value like US state code -- which is always 2 chars long and the list of valid US states code doesn't change often :).
In every other case, even like storing hashed password (which is fixed length), I would choose varchar.
Why -- char type column is always fulfilled with spaces, which makes for column my_column defined as char(5) with value 'ABC' inside comparation:
my_column = 'ABC' -- my_column stores 'ABC ' value which is different then 'ABC'
false.
This feature could lead to many irritating bugs during development and makes testing harder.
CHAR takes up less storage space than VARCHAR if all your data values in that field are the same length. Now perhaps in 2009 a 800GB database is the same for all intents and purposes as a 810GB if you converted the VARCHARs to CHARs, but for short strings (1 or 2 characters), CHAR is still a industry "best practice" I would say.
Now if you look at the wide variety of data types most databases provide even for integers alone (bit, tiny, int, bigint), there ARE reasons to choose one over the other. Simply choosing bigint every time is actually being a bit ignorant of the purposes and uses of the field. If a field simply represents a persons age in years, a bigint is overkill. Now it's not necessarily "wrong", but it's not efficient.
But its an interesting argument, and as databases improve over time, it could be argued CHAR vs VARCHAR does get less relevant.
I would NEVER use chars. I’ve had this debate with many people and they always bring up the tired cliché that char is faster. Well I say, how much faster? What are we talking about here, milliseconds, seconds and if so how many? You’re telling me because someone claims its a few milliseconds faster, we should introduce tons of hard to fix bugs into the system?
So here are some issues you will run into:
Every field will be padded, so you end up with code forever that has RTRIMS everywhere. This is also a huge disk space waste for the longer fields.
Now let’s say you have the quintessential example of a char field of just one character but the field is optional. If somebody passes an empty string to that field it becomes one space. So when another application/process queries it, they get one single space, if they don’t use rtrim. We’ve had xml documents, files and other programs, display just one space, in optional fields and break things.
So now you have to ensure that you’re passing nulls and not empty string, to the char field. But that’s NOT the correct use of null. Here is the use of null. Lets say you get a file from a vendor
Name|Gender|City
Bob||Los Angeles
If gender is not specified than you enter Bob, empty string and Los Angeles into the table. Now lets say you get the file and its format changes and gender is no longer included but was in the past.
Name|City
Bob|Seattle
Well now since gender is not included, I would use null. Varchars support this without issues.
Char on the other hand is different. You always have to send null. If you ever send empty string, you will end up with a field that has spaces in it.
I could go on and on with all the bugs I’ve had to fix from chars and in about 20 years of development.
I stand by Jim McKeeth's comment.
Also, indexing and full table scans are faster if your table has only CHAR columns. Basically the optimizer will be able to predict how big each record is if it only has CHAR columns, while it needs to check the size value of every VARCHAR column.
Besides if you update a VARCHAR column to a size larger than its previous content you may force the database to rebuild its indexes (because you forced the database to physically move the record on disk). While with CHAR columns that'll never happen.
But you probably won't care about the performance hit unless your table is huge.
Remember Djikstra's wise words. Early performance optimization is the root of all evil.
Many people have pointed out that if you know the exact length of the value using CHAR has some benefits. But while storing US states as CHAR(2) is great today, when you get the message from sales that 'We have just made our first sale to Australia', you are in a world of pain. I always send to overestimate how long I think fields will need to be rather than making an 'exact' guess to cover for future events. VARCHAR will give me more flexibility in this area.
I think in your case there is probably no reason to not pick Varchar. It gives you flexibility and as has been mentioned by a number of respondants, performance is such now that except in very specific circumstances us meer mortals (as opposed to Google DBA's) will not notice the difference.
An interesting thing worth noting when it comes to DB Types is the sqlite (a popular mini database with pretty impressive performance) puts everything into the database as a string and types on the fly.
I always use VarChar and usually make it much bigger than I might strickly need. Eg. 50 for Firstname, as you say why not just to be safe.
It's the classic space versus performance tradeoff.
In MS SQL 2005, Varchar (or NVarchar for lanuagues requiring two bytes per character ie Chinese) are variable length. If you add to the row after it has been written to the hard disk it will locate the data in a non-contigious location to the original row and lead to fragmentation of your data files. This will affect performance.
So, if space is not an issue then Char are better for performance but if you want to keep the database size down then varchars are better.
Fragmentation. Char reserves space and VarChar does not. Page split can be required to accommodate update to varchar.
There is some small processing overhead in calculating the actual needed size for a column value and allocating the space for a Varchar, so if you are definitely sure how long the value will always be, it is better to use Char and avoid the hit.
when using varchar values SQL Server needs an additional 2 bytes per row to store some info about that column whereas if you use char it doesn't need that
so unless you
Using CHAR (NCHAR) and VARCHAR (NVARCHAR) brings differences in the ways the database server stores the data. The first one introduces trailing blanks; I have encountered problem when using it with LIKE operator in SQL SERVER functions. So I have to make it safe by using VARCHAR (NVARCHAR) all the times.
For example, if we have a table TEST(ID INT, Status CHAR(1)), and you write a function to list all the records with some specific value like the following:
CREATE FUNCTION List(#Status AS CHAR(1) = '')
RETURNS TABLE
AS
RETURN
SELECT * FROM TEST
WHERE Status LIKE '%' + #Status '%'
In this function we expect that when we put the default parameter the function will return all the rows, but in fact it does not. Change the #Status data type to VARCHAR will fix the issue.
In some SQL databases, VARCHAR will be padded out to its maximum size in order to optimize the offsets, This is to speed up full table scans and indexes.
Because of this, you do not have any space savings by using a VARCHAR(200) compared to a CHAR(200)