Does nvarchar always take twice as much space as varchar? - sql

Nvarchar is used to store unicode data which is used to store multilingual data. If you don't end up storing unicode does it still take up the same space?

YES.
See MSDN Books Online on NCHAR and NVARCHAR.
NCHAR:
The storage size is two times n bytes.
NVARCHAR
The storage size, in bytes, is two
times the number of characters entered
+ 2 bytes

Sort of. Not all unicode characters use two bytes. Utf-8, for example, is still just one byte per character a lot of the time, but rarely you may need 4 bytes per character. What nvarchar will do is allocate two bytes per character.

Related

Why should I use char instead of varchar? [duplicate]

This question already has answers here:
What are the use cases for selecting CHAR over VARCHAR in SQL?
(19 answers)
Closed 8 years ago.
Char and varchar are datatypes in SQL, as they are in many other languages(So this question could be multi-language).
From what I understand, the difference is that if I declared a Char as Char(20) it would allocate 20 (bytes/bits) [Could someone clarify this bit too? For now, I'll use bytes.]. Then if I only used 16 bytes, I would still have four allocated to that field. (A waste of 4 bytes of memory.)
However, if I declared a varchar as varchar(20) and only used 16 bytes, it would only allocate 16 bytes.
Surely this is better? Why would anyone choose char? Is it foe legacy reasons, or is there something I'm missing?
Prefer VARCHAR.
In olden days of tight storage, it mattered for space. Nowadays, disk storage is cheap, but RAM and IO are still precious. VARCHAR is IO and cache friendly; it allows you to more densely pack the db buffer cache with data rather than wasted literal "space" space, and for the same reason, space padding imposes an IO overhead.
The upside to CHAR() used to be reduced row chaining on frequently updated records. When you update a field and the value is larger than previously allocated, the record may chain. This is manageable, however; databases often support a "percent free" setting on your table storage attributes that tells the DB how much extra space to preallocate per row for growth.
VARCHAR is almost always preferable because space padding requires you to be aware of it and code differently. Different databases handle it differently. With VARCHAR you know your field holds only exactly what you store in it.
I haven't designed a schema in over a decade with CHAR.
FROM Specification
char[(n)]
Fixed-length non-Unicode character data with length of n bytes. n must
be a value from 1 through 8,000. Storage size is n bytes. The SQL-92
synonym for char is character.
So Char(20) will allocate fixed 20 Bytes space to hold the data.
Usage:
For example if you have a column named Gender and you want to assign values like only M for Male (OR) F for female and you are sure that the field/column are non-null column . In such case, it's much better to define it as CHAR(1) instead like
Gender CHAR(1) not null
Also, varchar types carries extra overhead of 2 bytes as stated in document . The storage size is the actual length of the data entered + 2 bytes.
In case of char that's not the case.

Number of real character allowed in SQL Server Varchar(max) or Text data type

I need to know the maximum number of characters I can put into a varchar(max) or text field using Sql Server. In this page I have found that the maximum number of bytes for storage is 2GB (2^31 - 1). Since I suppose, according to this page and other I've searched, the Unicode character is 2 byte sized, I conclude that I have to divide the total byte size for the Unicode character size, which does not give an integer result. Any sugestions where I am failing? Why does the page say the maximum string length is 2^31 - 1 instead of 2^31?
From SQL Server 2012 Help:
Variable-length, non-Unicode string data. ndefines the string length and can be a value from 1 through
8,000. maxindicates that the maximum storage size is 2^31-1 bytes (2 GB). The storage size is the actual
length of the data entered + 2 bytes. The ISO synonyms for varcharare char varyingor character
varying.

Why historically do people use 255 not 256 for database field magnitudes?

You often see database fields set to have a magnitude of 255 characters, what is the traditional / historic reason why? I assume it's something to do with paging / memory limits, and performance but the distinction between 255 and 256 has always confused me.
varchar(255)
Considering this is a capacity or magnitude, not an indexer, why is 255 preferred over 256? Is a byte reserved for some purpose (terminator or null or something)?
Presumably varchar(0) is a nonsense (has zero capacity)? In which case 2^8 of space should be 256 surely?
Are there other magnitudes that provide performance benefits? For example is varchar(512) less performant than varchar(511) or varchar(510)?
Is this value the same for all relations databases, old and new?
disclaimer - I'm a developer not a DBA, I use field sizes and types that suit my business logic where that is known, but I'd like to know the historic reason for this preference, even if it's no longer relevant (but even more if it still is relevant).
Edit:
Thanks for the answers, there seems to be some concensus that a byte is used to store size, but this doesn't settle the matter definitively in my mind.
If the meta data (string length) is stored in the same contiguous memory/disk, it makes some sense. 1 byte of metadata and 255 bytes of string data, would suit each other very nicely, and fit into 256 contiguous bytes of storage, which presumably is neat and tidy.
But...If the metadata (string length) is stored separately from the actual string data (in a master table perhaps), then to constrain the length of string's data by one byte, just because it's easier to store only a 1 byte integer of metadata seems a bit odd.
In both cases, it would seem to be a subtlety that probably depends on the DB implementation. The practice of using 255 seems pretty widespread, so someone somewhere must have argued a good case for it in the beginning, can anyone remember what that case was/is? Programmers won't adopt any new practice without a reason, and this must have been new once.
With a maximum length of 255 characters, the DBMS can choose to use a single byte to indicate the length of the data in the field. If the limit were 256 or greater, two bytes would be needed.
A value of length zero is certainly valid for varchar data (unless constrained otherwise). Most systems treat such an empty string as distinct from NULL, but some systems (notably Oracle) treat an empty string identically to NULL. For systems where an empty string is not NULL, an additional bit somewhere in the row would be needed to indicate whether the value should be considered NULL or not.
As you note, this is a historical optimisation and is probably not relevant to most systems today.
255 was the varchar limit in mySQL4 and earlier.
Also 255 chars + Null terminator = 256
Or 1 byte length descriptor gives a possible range 0-255 chars
255 is the largest numerical value that can be stored in a single-byte unsigned integer (assuming 8-bit bytes) - hence, applications which store the length of a string for some purpose would prefer 255 over 256 because it means they only have to allocate 1 byte for the "size" variable.
From MySQL Manual:
Data Type:
VARCHAR(M), VARBINARY(M)
Storage Required:
L + 1 bytes if column values require 0 – 255 bytes, L + 2 bytes if values may require more than 255 bytes
Understand and make choice.
255 is the maximum value of a 8 bit integer : 11111111 = 255.
Are there other magnitudes that provide performance benefits? For example is varchar(512) less performant than varchar(511) or varchar(510)?
Recollected the fundamentals of the bits/bytes storage, it requires one byte to store integers below 256 and two bytes for any integer between 256 and 65536.
Hence, it requires same space (two bytes) to store 511 or 512 or for that matter 65535....
Thus it is clear that the this argument mentioned in the discussion above is N/A for varchar(512) or varchar(511).
A maximum length of 255 allows the database engine to use only 1 byte to store the length of each field. You are correct that 1 byte of space allows you to store 2^8=256 distinct values for the length of the string.
But if you allow the field to store zero-length text strings, you need to be able to store zero in the length. So you can allow 256 distinct length values, starting at zero: 0-255.
It used to be that all strings required a NUL terminator, or "backslash-zero". Updated databases don't have that. It was "255 characters of text" with a "\0" added automatically at the end so the system knew where the string ended. If you said VARCHAR(256), it would end up being 257 and then you'd be in the next register for one character. Wasteful. That's why everything was VARCHAR(255) and VARCHAR(31). Out of habit the 255 seems to have stuck around but the 31's became 32's and the 511's became 512's. That part is weird. It's hard to make myself write VARCHAR(256).
Often varchars are implemented as pascal strings: holding the actual length in the byte #0. The length was therefore bound to 255. (Value of a byte varies from 0 to 255.)
8 bits unsigned = 256 bytes
255 characters + byte 0 for length
I think this might answer your question. Looks like it was the max limit of varchar in earlier systems. I took it off another stackoverflow question.
It's hard to know what the longest postal address is, of course, which is why many people choose a long VARCHAR that is certainly longer than any address. And 255 is customary because it may have been the maximum length of a VARCHAR in some databases in the dawn of time (as well as PostgreSQL until more recently).
Are there disadvantages to using a generic varchar(255) for all text-based fields?
Data is saved in memory in binary system and 0 and 1 are binary digits. Largest binary number that can fit in 1 byte (8-bits) is 11111111 which converts to decimal 255.

UTF-8 vs ASCII Text

Why does sql database use UTF-8 Encoding? do they both use 8-bit to store a character?
UTF-8 is used to support a large range of characters. In UTF-8, up to 4 bytes can be used to represent a single character.
Joel has written an article on this subject that you may want to refer to
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
For "normal" characters, only 8 bits are used. For characters that do not fit in 8 bits more bits can be used. This makes UTF-8 is a variable length encoding.
Wikipedia has a good article on UTF-8.
ASCII only defines 128 character. So only 7 bits. But is normally stored with 8 bits/character. RS232 (old serial communication) can be used with bytes of 7 bits.
ASCII can only represent a limited number of characters at one time. It isn't very useful to represent any language that isn't based on a Latin character set. However, UTF-8 which is an encoding standard for UCS-4 (Unicode) can represent almost any language. It does this by chaining multiple bytes together to represent one character (or glyph to be more correct).
A more sophisticated encoding increases the index access time drastically. It's something to think about, when encountering performance problems in writing or reading from an database.

[My]SQL VARCHAR Size and Null-Termination

Disclaimer: I'm very new to SQL and databases in general.
I need to create a field that will store a maximum of 32 characters of text data. Does "VARCHAR(32)" mean that I have exactly 32 characters for my data? Do I need to reserve an extra character for null-termination?
I conducted a simple test and it seems that this is a WYSIWYG buffer. However, I wanted to get a concrete answer from people who actually know what they're doing.
I have a C[++] background, so this question is raising alarm bells in my head.
Yes, you have 32 characters at your disposal. SQL does not concern itself with nul terminated strings like some programming languages do.
Your VARCHAR specification size is the max size of your data, so in this case, 32 characters. However, VARCHARS are a dynamic field, so the actual physical storage used is only the size of your data, plus one or two bytes.
If you put a 10-character string into a VARCHAR(32), the physical storage will be 11 or 12 bytes (the manual will tell you the exact formula).
However, when MySQL is dealing with result sets (ie. after a SELECT), 32 bytes will be allocated in memory for that field for every record.